git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it
@ 2020-03-24  6:04 Matheus Tavares
  2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
                   ` (3 more replies)
  0 siblings, 4 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-03-24  6:04 UTC (permalink / raw)
  To: git; +Cc: dstolee, newren

This series is based on the discussions we had some months ago[1], about
git-grep not currently honoring the sparsity patterns. To summarize, the
idea is that, since a sparse checkout is used to limit the set of files
in which users are interested, git-grep should, by default, only search
within this boundary.  But it would be good to also have an
'--ignore-sparsity' option, to restore the old behavior when needed, as
there are also valid use cases for it. The following patches seek to
address these suggestions. The first patch is not really related, it is
a cleanup, used by the third one.

[1]: https://lore.kernel.org/git/CAHd-oW7e5qCuxZLBeVDq+Th3E+E4+P8=WzJfK8WcG2yz=n_nag@mail.gmail.com/t/#u

Matheus Tavares (3):
  doc: grep: unify info on configuration variables
  grep: honor sparse checkout patterns
  grep: add option to ignore sparsity patterns

 Documentation/config/grep.txt    |  10 ++-
 Documentation/git-grep.txt       |  40 +++-------
 builtin/grep.c                   |  36 ++++++++-
 t/t7011-skip-worktree-reading.sh |   9 ---
 t/t7817-grep-sparse-checkout.sh  | 130 +++++++++++++++++++++++++++++++
 5 files changed, 180 insertions(+), 45 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

-- 
2.25.1


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [RFC PATCH 1/3] doc: grep: unify info on configuration variables
  2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
@ 2020-03-24  6:11 ` Matheus Tavares
  2020-03-24  7:57   ` Elijah Newren
  2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-03-24  6:11 UTC (permalink / raw)
  To: git; +Cc: dstolee, newren, sandals

Explanations about the configuration variables for git-grep are
duplicated in "Documentation/git-grep.txt" and
"Documentation/config/grep.txt". Let's unify the information in the
second file and include it in the first.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 Documentation/config/grep.txt |  7 +++++--
 Documentation/git-grep.txt    | 35 +++++------------------------------
 2 files changed, 10 insertions(+), 32 deletions(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 44abe45a7c..76689771aa 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -16,8 +16,11 @@ grep.extendedRegexp::
 	other than 'default'.
 
 grep.threads::
-	Number of grep worker threads to use.
-	See `grep.threads` in linkgit:git-grep[1] for more information.
+	Number of grep worker threads to use. See `--threads` in
+	linkgit:git-grep[1] for more information.
+
+grep.fullName::
+	If set to true, enable `--full-name` option by default.
 
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index ddb6acc025..97e25d7b1b 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,34 +41,7 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------
 
-grep.lineNumber::
-	If set to true, enable `-n` option by default.
-
-grep.column::
-	If set to true, enable the `--column` option by default.
-
-grep.patternType::
-	Set the default matching behavior. Using a value of 'basic', 'extended',
-	'fixed', or 'perl' will enable the `--basic-regexp`, `--extended-regexp`,
-	`--fixed-strings`, or `--perl-regexp` option accordingly, while the
-	value 'default' will return to the default matching behavior.
-
-grep.extendedRegexp::
-	If set to true, enable `--extended-regexp` option by default. This
-	option is ignored when the `grep.patternType` option is set to a value
-	other than 'default'.
-
-grep.threads::
-	Number of grep worker threads to use. If unset (or set to 0), Git will
-	use as many threads as the number of logical cores available.
-
-grep.fullName::
-	If set to true, enable `--full-name` option by default.
-
-grep.fallbackToNoIndex::
-	If set to true, fall back to git grep --no-index if git grep
-	is executed outside of a git repository.  Defaults to false.
-
+include::config/grep.txt[]
 
 OPTIONS
 -------
@@ -267,8 +240,10 @@ providing this option will cause it to die.
 	found.
 
 --threads <num>::
-	Number of grep worker threads to use.
-	See `grep.threads` in 'CONFIGURATION' for more information.
+	Number of grep worker threads to use. If not provided (or set to
+	0), Git will use as many worker threads as the number of logical
+	cores available. The default value can also be set with the
+	`grep.threads` configuration (see linkgit:git-config[1]).
 
 -f <file>::
 	Read patterns from <file>, one per line.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
@ 2020-03-24  6:12 ` Matheus Tavares
  2020-03-24  7:15   ` Elijah Newren
  2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  3 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-03-24  6:12 UTC (permalink / raw)
  To: git; +Cc: dstolee, newren, sandals, stefanbeller

One of the main uses for a sparse checkout is to allow users to focus on
the subset of files in a repository in which they are interested. But
git-grep currently ignores the sparsity patterns and report all matches
found outside this subset, which kind of goes in the oposity direction.
Let's fix that, making it honor the sparsity boundaries for every
grepping case:

- git grep in worktree
- git grep --cached
- git grep $REVISION
- git grep --untracked and git grep --no-index (which already respect
  sparse checkout boundaries)

This is also what some users reported[1] they would want as the default
behavior.

Note: for `git grep $REVISION`, we will choose to honor the sparsity
patterns only when $REVISION is a commit-ish object. The reason is that,
for a tree, we don't know whether it represents the root of a
repository or a subtree. So we wouldn't be able to correctly match it
against the sparsity patterns. E.g. suppose we have a repository with
these two sparsity rules: "/*" and "!/a"; and the following structure:

/
| - a (file)
| - d (dir)
    | - a (file)

If `git grep $REVISION` were to honor the sparsity patterns for every
object type, when grepping the /d tree, we would wrongly ignore the /d/a
file. This happens because we wouldn't know it resides in /d and
therefore it would wrongly match the pattern "!/a". Furthermore, for a
search in a blob object, we wouldn't even have a path to check the
patterns against. So, let's ignore the sparsity patterns when grepping
non-commit-ish objects (tags to commits should be fine).

Finally, the old behavior is still desirable for some use cases. So the
next patch will add an option to allow restoring it when needed.

[1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Something I'm not entirely sure in this patch is how we implement the
mechanism to honor sparsity for the `git grep <commit-ish>` case (which
is treated in the grep_tree() function). Currently, the patch looks for
an index entry that matches the path, and then checks its skip_worktree
bit. But this operation is perfomed in O(log(N)); N being the number of
index entries. If there are many entries (and no so many sparsity
patterns), maybe a better approach would be to try matching the path
directly against the sparsity patterns. This would be O(M) in the number
of patterns, and it could be done, in builtin/grep.c, with a function
like the following:

static struct pattern_list sparsity_patterns;
static int sparsity_patterns_initialized = 0;
static enum pattern_match_result path_matches_sparsity_patterns(
					const char *path, int pathlen,
					const char *basename,
					struct repository *repo)
{
	int dtype = DT_UNKNOWN;

	if (!sparsity_patterns_initialized) {
		char *sparse_file = git_pathdup("info/sparse-checkout");
		int ret;

		memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
		sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
		ret = add_patterns_from_file_to_list(sparse_file, "", 0,
						     &sparsity_patterns, NULL);
		free(sparse_file);

		if (ret < 0)
			die(_("failed to load sparse-checkout patterns"));
		sparsity_patterns_initialized = 1;
	}

	return path_matches_pattern_list(path, pathlen, basename, &dtype,
					 &sparsity_patterns, repo->index);
}

Also, if I understand correctly, the index doesn't hold paths to dirs,
right? So even if a complete dir is excluded from sparse checkout, we
still have to check all its subentries, only to discover that they
should all be skipped from the search. However, if we were to check
against the sparsity patterns directly (e.g. with the function above),
we could skip such directories together with all their entries.

Oh, and there is also the case of a commit whose tree paths are not in
the index (maybe manually created objects?). For such commits, with the
index lookup approach, we would have to fall back on ignoring the
sparsity rules. I'm not sure if that would be OK, though.

Any thoughts on these two approaches (looking up the skip_worktree bit
in the index or directly matching against sparsity patterns), will be
highly appreciated. (Note that it only concerns the `git grep
<commit-ish>` case. The other cases already iterate thought the index, so
there is no O(log(N)) extra complexity).

 builtin/grep.c                   | 29 ++++++++---
 t/t7011-skip-worktree-reading.sh |  9 ----
 t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
 3 files changed, 111 insertions(+), 15 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

diff --git a/builtin/grep.c b/builtin/grep.c
index 99e2685090..52ec72a036 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
 		      const struct pathspec *pathspec, int cached);
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr);
+		     int from_commit);
 
 static int grep_submodule(struct grep_opt *opt,
 			  const struct pathspec *pathspec,
@@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
 
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
+
+		if (ce_skip_worktree(ce))
+			continue;
+
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
 
@@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID) ||
-			    ce_skip_worktree(ce)) {
+			if (cached || (ce->ce_flags & CE_VALID)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
 
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr)
+		     int from_commit)
 {
 	struct repository *repo = opt->repo;
 	int hit = 0;
@@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		name_base_len = name.len;
 	}
 
+	if (from_commit && repo_read_index(repo) < 0)
+		die(_("index file corrupt"));
+
 	while (tree_entry(tree, &entry)) {
 		int te_len = tree_entry_len(&entry);
 
@@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
+		if (from_commit) {
+			int pos = index_name_pos(repo->index,
+						 base->buf + tn_len,
+						 base->len - tn_len);
+			if (pos >= 0 &&
+			    ce_skip_worktree(repo->index->cache[pos])) {
+				strbuf_setlen(base, old_baselen);
+				continue;
+			}
+		}
+
 		if (S_ISREG(entry.mode)) {
 			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
-					 check_attr ? base->buf + tn_len : NULL);
+					from_commit ? base->buf + tn_len : NULL);
 		} else if (S_ISDIR(entry.mode)) {
 			enum object_type type;
 			struct tree_desc sub;
@@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 			strbuf_addch(base, '/');
 			init_tree_desc(&sub, data, size);
 			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
-					 check_attr);
+					 from_commit);
 			free(data);
 		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
 			hit |= grep_submodule(opt, pathspec, &entry.oid,
diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
index 37525cae3a..26852586ac 100755
--- a/t/t7011-skip-worktree-reading.sh
+++ b/t/t7011-skip-worktree-reading.sh
@@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
 	test -z "$(git ls-files -m)"
 '
 
-test_expect_success 'grep with skip-worktree file' '
-	git update-index --no-skip-worktree 1 &&
-	echo test > 1 &&
-	git update-index 1 &&
-	git update-index --skip-worktree 1 &&
-	rm 1 &&
-	test "$(git grep --no-ext-grep test)" = "1:test"
-'
-
 echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
 test_expect_success 'diff-index does not examine skip-worktree absent entries' '
 	setup_absent &&
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
new file mode 100755
index 0000000000..fccf44e829
--- /dev/null
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -0,0 +1,88 @@
+#!/bin/sh
+
+test_description='grep in sparse checkout
+
+This test creates the following dir structure:
+.
+| - a
+| - b
+| - dir
+    | - c
+
+Only "a" should be present due to the sparse checkout patterns:
+"/*", "!/b" and "!/dir".
+'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	echo "text" >a &&
+	echo "text" >b &&
+	mkdir dir &&
+	echo "text" >dir/c &&
+	git add a b dir &&
+	git commit -m "initial commit" &&
+	git tag -am t-commit t-commit HEAD &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	git tag -am t-tree t-tree $tree &&
+	cat >.git/info/sparse-checkout <<-EOF &&
+	/*
+	!/b
+	!/dir
+	EOF
+	git sparse-checkout init &&
+	test_path_is_missing b &&
+	test_path_is_missing dir &&
+	test_path_is_file a
+'
+
+test_expect_success 'grep in working tree should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --cached should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep <commit-ish> should honor sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	EOF
+	cat >expect_t-commit <<-EOF &&
+	t-commit:a:text
+	EOF
+	git grep "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep "text" t-commit >actual_t-commit &&
+	test_cmp expect_t-commit actual_t-commit
+'
+
+test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	cat >expect_tree <<-EOF &&
+	$tree:a:text
+	$tree:b:text
+	$tree:dir/c:text
+	EOF
+	cat >expect_t-tree <<-EOF &&
+	t-tree:a:text
+	t-tree:b:text
+	t-tree:dir/c:text
+	EOF
+	git grep "text" $tree >actual_tree &&
+	test_cmp expect_tree actual_tree &&
+	git grep "text" t-tree >actual_t-tree &&
+	test_cmp expect_t-tree actual_t-tree
+'
+
+test_done
-- 
2.25.1


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
  2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-03-24  6:13 ` Matheus Tavares
  2020-03-24  7:54   ` Elijah Newren
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  3 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-03-24  6:13 UTC (permalink / raw)
  To: git; +Cc: dstolee, newren, pclouds

In the last commit, git-grep learned to honor sparsity patterns. For
some use cases, however, it may be desirable to search outside the
sparse checkout. So add the '--ignore-sparsity' option, which restores
the old behavior. Also add the grep.ignoreSparsity configuration, to
allow setting this behavior by default.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Note: I still have to make --ignore-sparsity be able to work together
with --untracked. Unfortunatelly, this won't be as simple because the
codeflow taken by --untracked goes to grep_directory() which just
iterates the working tree, without looking the index entries. So I will
have to either: make --untracked use grep_cache(), and grep the
untracked files later; or try matching the working tree paths against
the sparsity patterns, without looking for the skip_worktree bit in
the index (as I mentioned in the previous patch's comments). Any
preferences regarding these two approaches? (or other suggestions?)

 Documentation/config/grep.txt   |  3 +++
 Documentation/git-grep.txt      |  5 ++++
 builtin/grep.c                  | 19 +++++++++++----
 t/t7817-grep-sparse-checkout.sh | 42 +++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 76689771aa..c1d49484c8 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -25,3 +25,6 @@ grep.fullName::
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
 	is executed outside of a git repository.  Defaults to false.
+
+grep.ignoreSparsity::
+	If set to true, enable `--ignore-sparsity` by default.
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 97e25d7b1b..5c5c66c056 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -65,6 +65,11 @@ OPTIONS
 	mechanism.  Only useful when searching files in the current directory
 	with `--no-index`.
 
+--ignore-sparsity::
+	In a sparse checked out repository (see linkgit:git-sparse-checkout[1]),
+	also search in files that are outside the sparse checkout. This option
+	cannot be used with --no-index or --untracked.
+
 --recurse-submodules::
 	Recursively search in each submodule that has been initialized and
 	checked out in the repository.  When used in combination with the
diff --git a/builtin/grep.c b/builtin/grep.c
index 52ec72a036..17eae3edd6 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -33,6 +33,8 @@ static char const * const grep_usage[] = {
 
 static int recurse_submodules;
 
+static int ignore_sparsity = 0;
+
 static int num_threads;
 
 static pthread_t *threads;
@@ -292,6 +294,9 @@ static int grep_cmd_config(const char *var, const char *value, void *cb)
 	if (!strcmp(var, "submodule.recurse"))
 		recurse_submodules = git_config_bool(var, value);
 
+	if (!strcmp(var, "grep.ignoresparsity"))
+		ignore_sparsity = git_config_bool(var, value);
+
 	return st;
 }
 
@@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (ce_skip_worktree(ce))
+		if (!ignore_sparsity && ce_skip_worktree(ce))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -502,7 +507,8 @@ static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID)) {
+			if (cached || (ce->ce_flags & CE_VALID) ||
+			    ce_skip_worktree(ce)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -549,7 +555,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		name_base_len = name.len;
 	}
 
-	if (from_commit && repo_read_index(repo) < 0)
+	if (!ignore_sparsity && from_commit && repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
 	while (tree_entry(tree, &entry)) {
@@ -570,7 +576,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
-		if (from_commit) {
+		if (!ignore_sparsity && from_commit) {
 			int pos = index_name_pos(repo->index,
 						 base->buf + tn_len,
 						 base->len - tn_len);
@@ -932,6 +938,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 		OPT_BOOL_F(0, "ext-grep", &external_grep_allowed__ignored,
 			   N_("allow calling of grep(1) (ignored by this build)"),
 			   PARSE_OPT_NOCOMPLETE),
+		OPT_BOOL(0, "ignore-sparsity", &ignore_sparsity,
+			 N_("also search in files outside the sparse checkout")),
 		OPT_END()
 	};
 
@@ -1073,6 +1081,9 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 	if (recurse_submodules && untracked)
 		die(_("--untracked not supported with --recurse-submodules"));
 
+	if (ignore_sparsity && (!use_index || untracked))
+		die(_("--no-index or --untracked cannot be used with --ignore-sparsity"));
+
 	if (show_in_pager) {
 		if (num_threads > 1)
 			warning(_("invalid option combination, ignoring --threads"));
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index fccf44e829..1891ddea57 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -85,4 +85,46 @@ test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
 	test_cmp expect_t-tree actual_t-tree
 '
 
+for cmd in 'git grep --ignore-sparsity' 'git -c grep.ignoreSparsity grep' \
+	   'git -c grep.ignoreSparsity=false grep --ignore-sparsity'
+do
+	test_expect_success "$cmd should search outside sparse checkout" '
+		cat >expect <<-EOF &&
+		a:text
+		b:text
+		dir/c:text
+		EOF
+		$cmd "text" >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "$cmd --cached should search outside sparse checkout" '
+		cat >expect <<-EOF &&
+		a:text
+		b:text
+		dir/c:text
+		EOF
+		$cmd --cached "text" >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "$cmd <commit-ish> should search outside sparse checkout" '
+		commit=$(git rev-parse HEAD) &&
+		cat >expect_commit <<-EOF &&
+		$commit:a:text
+		$commit:b:text
+		$commit:dir/c:text
+		EOF
+		cat >expect_t-commit <<-EOF &&
+		t-commit:a:text
+		t-commit:b:text
+		t-commit:dir/c:text
+		EOF
+		$cmd "text" $commit >actual_commit &&
+		test_cmp expect_commit actual_commit &&
+		$cmd "text" t-commit >actual_t-commit &&
+		test_cmp expect_t-commit actual_t-commit
+	'
+done
+
 test_done
-- 
2.25.1


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-03-24  7:15   ` Elijah Newren
  2020-03-24 15:12     ` Derrick Stolee
  2020-03-24 22:55     ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 120+ messages in thread
From: Elijah Newren @ 2020-03-24  7:15 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson, Stefan Beller

Hi Matheus,

On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> One of the main uses for a sparse checkout is to allow users to focus on
> the subset of files in a repository in which they are interested. But
> git-grep currently ignores the sparsity patterns and report all matches
> found outside this subset, which kind of goes in the oposity direction.
> Let's fix that, making it honor the sparsity boundaries for every
> grepping case:
>
> - git grep in worktree
> - git grep --cached
> - git grep $REVISION

Wahoo!  This is great.

> - git grep --untracked and git grep --no-index (which already respect
>   sparse checkout boundaries)
>
> This is also what some users reported[1] they would want as the default
> behavior.
>
> Note: for `git grep $REVISION`, we will choose to honor the sparsity
> patterns only when $REVISION is a commit-ish object. The reason is that,

Makes sense.

> for a tree, we don't know whether it represents the root of a
> repository or a subtree. So we wouldn't be able to correctly match it
> against the sparsity patterns. E.g. suppose we have a repository with
> these two sparsity rules: "/*" and "!/a"; and the following structure:
>
> /
> | - a (file)
> | - d (dir)
>     | - a (file)
>
> If `git grep $REVISION` were to honor the sparsity patterns for every
> object type, when grepping the /d tree, we would wrongly ignore the /d/a
> file. This happens because we wouldn't know it resides in /d and
> therefore it would wrongly match the pattern "!/a". Furthermore, for a
> search in a blob object, we wouldn't even have a path to check the
> patterns against. So, let's ignore the sparsity patterns when grepping
> non-commit-ish objects (tags to commits should be fine).
>
> Finally, the old behavior is still desirable for some use cases. So the
> next patch will add an option to allow restoring it when needed.
>
> [1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/
>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>
> Something I'm not entirely sure in this patch is how we implement the
> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> is treated in the grep_tree() function). Currently, the patch looks for
> an index entry that matches the path, and then checks its skip_worktree

As you discuss below, checking the index is both wrong _and_ costly.
You should use the sparsity patterns; Stolee did a lot of work to make
those correspond to simple hashes you could check to determine whether
to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
mode but the non-cone sparsity patterns were a performance nightmare
waiting to rear its ugly head.  We should just try to encourage
everyone to move to cone mode, or accept the slowness they get without
it.

> bit. But this operation is perfomed in O(log(N)); N being the number of
> index entries. If there are many entries (and no so many sparsity
> patterns), maybe a better approach would be to try matching the path
> directly against the sparsity patterns. This would be O(M) in the number
> of patterns, and it could be done, in builtin/grep.c, with a function
> like the following:
>
> static struct pattern_list sparsity_patterns;
> static int sparsity_patterns_initialized = 0;
> static enum pattern_match_result path_matches_sparsity_patterns(
>                                         const char *path, int pathlen,
>                                         const char *basename,
>                                         struct repository *repo)
> {
>         int dtype = DT_UNKNOWN;
>
>         if (!sparsity_patterns_initialized) {
>                 char *sparse_file = git_pathdup("info/sparse-checkout");
>                 int ret;
>
>                 memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
>                 sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
>                 ret = add_patterns_from_file_to_list(sparse_file, "", 0,
>                                                      &sparsity_patterns, NULL);
>                 free(sparse_file);
>
>                 if (ret < 0)
>                         die(_("failed to load sparse-checkout patterns"));
>                 sparsity_patterns_initialized = 1;
>         }
>
>         return path_matches_pattern_list(path, pathlen, basename, &dtype,
>                                          &sparsity_patterns, repo->index);
> }
>
> Also, if I understand correctly, the index doesn't hold paths to dirs,
> right? So even if a complete dir is excluded from sparse checkout, we
> still have to check all its subentries, only to discover that they
> should all be skipped from the search. However, if we were to check
> against the sparsity patterns directly (e.g. with the function above),
> we could skip such directories together with all their entries.
>
> Oh, and there is also the case of a commit whose tree paths are not in
> the index (maybe manually created objects?). For such commits, with the
> index lookup approach, we would have to fall back on ignoring the
> sparsity rules. I'm not sure if that would be OK, though.
>
> Any thoughts on these two approaches (looking up the skip_worktree bit
> in the index or directly matching against sparsity patterns), will be
> highly appreciated. (Note that it only concerns the `git grep
> <commit-ish>` case. The other cases already iterate thought the index, so
> there is no O(log(N)) extra complexity).
>
>  builtin/grep.c                   | 29 ++++++++---
>  t/t7011-skip-worktree-reading.sh |  9 ----
>  t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
>  3 files changed, 111 insertions(+), 15 deletions(-)
>  create mode 100755 t/t7817-grep-sparse-checkout.sh
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 99e2685090..52ec72a036 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
>                       const struct pathspec *pathspec, int cached);
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr);
> +                    int from_commit);

I'm not familiar with grep.c and have to admit I don't know what
"check_attr" means.  Slightly surprised to see you replace it, but
maybe reading the rest will explain...

>
>  static int grep_submodule(struct grep_opt *opt,
>                           const struct pathspec *pathspec,
> @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
>
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
> +
> +               if (ce_skip_worktree(ce))
> +                       continue;
> +

Looks good for the case where we are grepping through what's cached.

>                 strbuf_setlen(&name, name_base_len);
>                 strbuf_addstr(&name, ce->name);
>
> @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
>                          * cache entry are identical, even if worktree file has
>                          * been modified, so use cache version instead
>                          */
> -                       if (cached || (ce->ce_flags & CE_VALID) ||
> -                           ce_skip_worktree(ce)) {
> +                       if (cached || (ce->ce_flags & CE_VALID)) {

I had the same change when I was trying to hack something like this
patch into place but only handled the worktree case before realized it
was a bit bigger job.

>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>                                         continue;
>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
>
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr)
> +                    int from_commit)
>  {
>         struct repository *repo = opt->repo;
>         int hit = 0;
> @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                 name_base_len = name.len;
>         }
>
> +       if (from_commit && repo_read_index(repo) < 0)
> +               die(_("index file corrupt"));
> +

As above, I don't think we should need to read the index.  We should
compare to sparsity patterns, which in the important case (cone mode)
simplifies to a hash lookup as we walk directories.

>         while (tree_entry(tree, &entry)) {
>                 int te_len = tree_entry_len(&entry);
>
> @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                 strbuf_add(base, entry.path, te_len);
>
> +               if (from_commit) {
> +                       int pos = index_name_pos(repo->index,
> +                                                base->buf + tn_len,
> +                                                base->len - tn_len);
> +                       if (pos >= 0 &&
> +                           ce_skip_worktree(repo->index->cache[pos])) {
> +                               strbuf_setlen(base, old_baselen);
> +                               continue;
> +                       }
> +               }
> +
>                 if (S_ISREG(entry.mode)) {
>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
> -                                        check_attr ? base->buf + tn_len : NULL);
> +                                       from_commit ? base->buf + tn_len : NULL);

Sadly, this doesn't help me understand check_attr or from_commit.
Could you clue me in a bit?

>                 } else if (S_ISDIR(entry.mode)) {
>                         enum object_type type;
>                         struct tree_desc sub;
> @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                         strbuf_addch(base, '/');
>                         init_tree_desc(&sub, data, size);
>                         hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> -                                        check_attr);
> +                                        from_commit);

Same.

>                         free(data);
>                 } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>                         hit |= grep_submodule(opt, pathspec, &entry.oid,
> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> index 37525cae3a..26852586ac 100755
> --- a/t/t7011-skip-worktree-reading.sh
> +++ b/t/t7011-skip-worktree-reading.sh
> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>         test -z "$(git ls-files -m)"
>  '
>
> -test_expect_success 'grep with skip-worktree file' '
> -       git update-index --no-skip-worktree 1 &&
> -       echo test > 1 &&
> -       git update-index 1 &&
> -       git update-index --skip-worktree 1 &&
> -       rm 1 &&
> -       test "$(git grep --no-ext-grep test)" = "1:test"
> -'
> -
>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A   1" > expected
>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>         setup_absent &&
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> new file mode 100755
> index 0000000000..fccf44e829
> --- /dev/null
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -0,0 +1,88 @@
> +#!/bin/sh
> +
> +test_description='grep in sparse checkout
> +
> +This test creates the following dir structure:
> +.
> +| - a
> +| - b
> +| - dir
> +    | - c
> +
> +Only "a" should be present due to the sparse checkout patterns:
> +"/*", "!/b" and "!/dir".
> +'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +       echo "text" >a &&
> +       echo "text" >b &&
> +       mkdir dir &&
> +       echo "text" >dir/c &&
> +       git add a b dir &&
> +       git commit -m "initial commit" &&
> +       git tag -am t-commit t-commit HEAD &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       git tag -am t-tree t-tree $tree &&
> +       cat >.git/info/sparse-checkout <<-EOF &&
> +       /*
> +       !/b
> +       !/dir
> +       EOF
> +       git sparse-checkout init &&

Using `git sparse-checkout init` but then manually writing to
.git/info/sparse-checkout?  Seems like it'd make more sense to use
`git sparse-checkout set` than writing the patterns directly yourself.
Also, would prefer to have the examples use cone mode (even if you
have to add subdirectories), as it makes the testcase a bit easier to
read and more performant, though neither is a big deal.

> +       test_path_is_missing b &&
> +       test_path_is_missing dir &&
> +       test_path_is_file a
> +'
> +
> +test_expect_success 'grep in working tree should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --cached should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep --cached "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> +       commit=$(git rev-parse HEAD) &&
> +       cat >expect_commit <<-EOF &&
> +       $commit:a:text
> +       EOF
> +       cat >expect_t-commit <<-EOF &&
> +       t-commit:a:text
> +       EOF
> +       git grep "text" $commit >actual_commit &&
> +       test_cmp expect_commit actual_commit &&
> +       git grep "text" t-commit >actual_t-commit &&
> +       test_cmp expect_t-commit actual_t-commit
> +'
> +
> +test_expect_success 'grep <tree-ish> should search outside sparse checkout' '

I think the test is fine but the title seems misleading.  "outside"
and "inside" aren't defined because <tree-ish> isn't known to be
rooted, meaning we have no way to apply the sparsity patterns.  So
perhaps just 'grep <tree-ish> should ignore sparsity patterns'?

> +       commit=$(git rev-parse HEAD) &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       cat >expect_tree <<-EOF &&
> +       $tree:a:text
> +       $tree:b:text
> +       $tree:dir/c:text
> +       EOF
> +       cat >expect_t-tree <<-EOF &&
> +       t-tree:a:text
> +       t-tree:b:text
> +       t-tree:dir/c:text
> +       EOF
> +       git grep "text" $tree >actual_tree &&
> +       test_cmp expect_tree actual_tree &&
> +       git grep "text" t-tree >actual_t-tree &&
> +       test_cmp expect_t-tree actual_t-tree
> +'
> +
> +test_done
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
@ 2020-03-24  7:54   ` Elijah Newren
  2020-03-24 18:30     ` Junio C Hamano
  2020-03-25 23:15     ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 120+ messages in thread
From: Elijah Newren @ 2020-03-24  7:54 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Derrick Stolee, Nguyễn Thái Ngọc

On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> In the last commit, git-grep learned to honor sparsity patterns. For
> some use cases, however, it may be desirable to search outside the
> sparse checkout. So add the '--ignore-sparsity' option, which restores
> the old behavior. Also add the grep.ignoreSparsity configuration, to
> allow setting this behavior by default.

Should `--ignore-sparsity` be a global git option rather than a
grep-specific one?  Also, should grep.ignoreSparsity rather be
core.ignoreSparsity or core.searchOutsideSparsePaths or something?  In
particular, I want a world where:

* Someone can do a "sparse" clone that is NOT just about
sparse-checkout but also about partial clone.  In particular, it makes
use of partial clones to download only the history for the sparsity
paths, and does a sparse-checkout --cone to get those checked out.
(Or, perhaps, defaults to just downloading history for the toplevel
dir, much like `sparse-checkout init --cone`, and then when the user
runs `sparse-checkout set $dir1 $dir2 ...` then it downloads the extra
bits).
* grep, diff, log, shortlog, blame, bisect (and maybe others) all by
default make use of the sparsity patterns to limit their output (but
can all use whatever flag(s) are added here to search outside the
sparsity pattern cones).  This helps users feel they are in a smaller
repo and searching just their area of interest, and it avoids partial
clones downloading blobs unnecessarily.  Nice for the user, and nice
for the system.
* worktrees behave nicer; when creating a new one it inherits the
sparsity patterns of the parent (again to avoid partail clones having
to download everything, and let users continue working on their area
of interest, though they can disable sparse checkouts at any time, of
course).  Still would like Junio's feedback on this one.
* rebase, merge, cherry-pick, etc. (all via the merge machiner) have
smarter tree-merging logic such that when trees are unchanged on one
or both sides of history, we take advantage of the subset of those
cases where we can avoid traversing into subtrees but can resolve the
merge at the tree level.  This is a performance optimization even when
you have all trees and blob available, but an even more important one
if you don't want partial clones to suddenly have to download
unnecessary objects.  I have ideas and am working on this as part of
merge-ort.

> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>
> Note: I still have to make --ignore-sparsity be able to work together
> with --untracked. Unfortunatelly, this won't be as simple because the
> codeflow taken by --untracked goes to grep_directory() which just
> iterates the working tree, without looking the index entries. So I will
> have to either: make --untracked use grep_cache(), and grep the
> untracked files later; or try matching the working tree paths against
> the sparsity patterns, without looking for the skip_worktree bit in
> the index (as I mentioned in the previous patch's comments). Any
> preferences regarding these two approaches? (or other suggestions?)

Hmm.  So, 'tracked' in git is the idea that we are keeping information
about specific files.  'sparse-checkout' is the idea that we have a
subset of those that we can work with without materializing all the
other tracked files; it's clearly a subset of the realm of 'tracked'.
'untracked' is about getting everything outside the set of 'tracked'
files, which to me means it is clearly outside the set of sparsity
paths too (and thus you could take --untracked as implying
--ignore-sparsity, though whether you do might not matter in practice
because of the items I'll discuss next).  Of course, I am also
assuming `--untracked` is incompatible with --cached or specifying
revisions or trees (based on it's definiton of "In addition to
searching in the tracked files in the *working tree*, search also in
untracked files." -- emphasis added.)  If the incompatibility of
--untracked and --cached/REVSIONS/TREES is not enforced, we may want
to look into erroring out if they are given together.  Once we do, we
don't have to worry about grep_cache() at all in the case of
--untracked and shouldn't.  Files with the skip_worktree bit won't
exist in the working directory, and thus won't be searched (this is
what makes --untracked imply --ignore-sparsity not really matter).

In short: With --untracked you are grepping ALL (non-ignored) files in
the working directory -- either because they are both tracked and in
the sparsity paths (anything tracked that isn't in the sparsity paths
has the skip_worktree bit and thus isn't present), or because it is an
untracked file.  [And this may be what grep_directory() already does.]

Does that make sense?

>  Documentation/config/grep.txt   |  3 +++
>  Documentation/git-grep.txt      |  5 ++++
>  builtin/grep.c                  | 19 +++++++++++----
>  t/t7817-grep-sparse-checkout.sh | 42 +++++++++++++++++++++++++++++++++
>  4 files changed, 65 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> index 76689771aa..c1d49484c8 100644
> --- a/Documentation/config/grep.txt
> +++ b/Documentation/config/grep.txt
> @@ -25,3 +25,6 @@ grep.fullName::
>  grep.fallbackToNoIndex::
>         If set to true, fall back to git grep --no-index if git grep
>         is executed outside of a git repository.  Defaults to false.
> +
> +grep.ignoreSparsity::
> +       If set to true, enable `--ignore-sparsity` by default.
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index 97e25d7b1b..5c5c66c056 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -65,6 +65,11 @@ OPTIONS
>         mechanism.  Only useful when searching files in the current directory
>         with `--no-index`.
>
> +--ignore-sparsity::
> +       In a sparse checked out repository (see linkgit:git-sparse-checkout[1]),
> +       also search in files that are outside the sparse checkout. This option
> +       cannot be used with --no-index or --untracked.

If they are outside the sparse checkout, then they are not present on
disk -- so what is this outside stuff that is being searched?  Perhaps
clarify that this is only useful in combination with
--cached/REVISION/TREE, where there do exist paths outside the
sparsity patterns that become relevant?

>  --recurse-submodules::
>         Recursively search in each submodule that has been initialized and
>         checked out in the repository.  When used in combination with the
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 52ec72a036..17eae3edd6 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -33,6 +33,8 @@ static char const * const grep_usage[] = {
>
>  static int recurse_submodules;
>
> +static int ignore_sparsity = 0;
> +
>  static int num_threads;
>
>  static pthread_t *threads;
> @@ -292,6 +294,9 @@ static int grep_cmd_config(const char *var, const char *value, void *cb)
>         if (!strcmp(var, "submodule.recurse"))
>                 recurse_submodules = git_config_bool(var, value);
>
> +       if (!strcmp(var, "grep.ignoresparsity"))
> +               ignore_sparsity = git_config_bool(var, value);
> +
>         return st;
>  }
>
> @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
>
> -               if (ce_skip_worktree(ce))
> +               if (!ignore_sparsity && ce_skip_worktree(ce))

Oh boy on the double negatives...maybe we want to rename this flag somehow?

>                         continue;
>
>                 strbuf_setlen(&name, name_base_len);
> @@ -502,7 +507,8 @@ static int grep_cache(struct grep_opt *opt,
>                          * cache entry are identical, even if worktree file has
>                          * been modified, so use cache version instead
>                          */
> -                       if (cached || (ce->ce_flags & CE_VALID)) {
> +                       if (cached || (ce->ce_flags & CE_VALID) ||
> +                           ce_skip_worktree(ce)) {
>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>                                         continue;
>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -549,7 +555,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                 name_base_len = name.len;
>         }
>
> -       if (from_commit && repo_read_index(repo) < 0)
> +       if (!ignore_sparsity && from_commit && repo_read_index(repo) < 0)
>                 die(_("index file corrupt"));
>
>         while (tree_entry(tree, &entry)) {
> @@ -570,7 +576,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                 strbuf_add(base, entry.path, te_len);
>
> -               if (from_commit) {
> +               if (!ignore_sparsity && from_commit) {
>                         int pos = index_name_pos(repo->index,
>                                                  base->buf + tn_len,
>                                                  base->len - tn_len);
> @@ -932,6 +938,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>                 OPT_BOOL_F(0, "ext-grep", &external_grep_allowed__ignored,
>                            N_("allow calling of grep(1) (ignored by this build)"),
>                            PARSE_OPT_NOCOMPLETE),
> +               OPT_BOOL(0, "ignore-sparsity", &ignore_sparsity,
> +                        N_("also search in files outside the sparse checkout")),
>                 OPT_END()
>         };
>
> @@ -1073,6 +1081,9 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>         if (recurse_submodules && untracked)
>                 die(_("--untracked not supported with --recurse-submodules"));
>
> +       if (ignore_sparsity && (!use_index || untracked))
> +               die(_("--no-index or --untracked cannot be used with --ignore-sparsity"));
> +
>         if (show_in_pager) {
>                 if (num_threads > 1)
>                         warning(_("invalid option combination, ignoring --threads"));
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index fccf44e829..1891ddea57 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -85,4 +85,46 @@ test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
>         test_cmp expect_t-tree actual_t-tree
>  '
>
> +for cmd in 'git grep --ignore-sparsity' 'git -c grep.ignoreSparsity grep' \
> +          'git -c grep.ignoreSparsity=false grep --ignore-sparsity'
> +do
> +       test_expect_success "$cmd should search outside sparse checkout" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd --cached should search outside sparse checkout" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd --cached "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd <commit-ish> should search outside sparse checkout" '
> +               commit=$(git rev-parse HEAD) &&
> +               cat >expect_commit <<-EOF &&
> +               $commit:a:text
> +               $commit:b:text
> +               $commit:dir/c:text
> +               EOF
> +               cat >expect_t-commit <<-EOF &&
> +               t-commit:a:text
> +               t-commit:b:text
> +               t-commit:dir/c:text
> +               EOF
> +               $cmd "text" $commit >actual_commit &&
> +               test_cmp expect_commit actual_commit &&
> +               $cmd "text" t-commit >actual_t-commit &&
> +               test_cmp expect_t-commit actual_t-commit
> +       '
> +done
> +
>  test_done
> --
> 2.25.1

I think there are several things that we need to straighten out first
and will affect a lot of this patch quite a bit:
* The feedback from the previous patch that the revision handling
should use sparsity patterns rather than ce_skip_worktree() is going
to affect this patch a fair amount.
* I think the fact that --ignore-sparsity is meaningless without
--cached or a REVISION or TREE may also affect things.
* The decision about how to globally name and set the
"ignore-sparsity" bit without requiring users to set it for each and
every subcommand will change this patch a bit too.


I'm super excited to see work in this area.  I hope I'm not
discouraging you by attempting to provide what I think is the bigger
picture I'd like us to work towards.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 1/3] doc: grep: unify info on configuration variables
  2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
@ 2020-03-24  7:57   ` Elijah Newren
  2020-03-24 21:26     ` Junio C Hamano
  0 siblings, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-03-24  7:57 UTC (permalink / raw)
  To: Matheus Tavares; +Cc: Git Mailing List, Derrick Stolee, brian m. carlson

On Mon, Mar 23, 2020 at 11:11 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> Explanations about the configuration variables for git-grep are
> duplicated in "Documentation/git-grep.txt" and
> "Documentation/config/grep.txt". Let's unify the information in the
> second file and include it in the first.
>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>  Documentation/config/grep.txt |  7 +++++--
>  Documentation/git-grep.txt    | 35 +++++------------------------------
>  2 files changed, 10 insertions(+), 32 deletions(-)
>
> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> index 44abe45a7c..76689771aa 100644
> --- a/Documentation/config/grep.txt
> +++ b/Documentation/config/grep.txt
> @@ -16,8 +16,11 @@ grep.extendedRegexp::
>         other than 'default'.
>
>  grep.threads::
> -       Number of grep worker threads to use.
> -       See `grep.threads` in linkgit:git-grep[1] for more information.
> +       Number of grep worker threads to use. See `--threads` in
> +       linkgit:git-grep[1] for more information.
> +
> +grep.fullName::
> +       If set to true, enable `--full-name` option by default.
>
>  grep.fallbackToNoIndex::
>         If set to true, fall back to git grep --no-index if git grep
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index ddb6acc025..97e25d7b1b 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -41,34 +41,7 @@ characters.  An empty string as search expression matches all lines.
>  CONFIGURATION
>  -------------
>
> -grep.lineNumber::
> -       If set to true, enable `-n` option by default.
> -
> -grep.column::
> -       If set to true, enable the `--column` option by default.
> -
> -grep.patternType::
> -       Set the default matching behavior. Using a value of 'basic', 'extended',
> -       'fixed', or 'perl' will enable the `--basic-regexp`, `--extended-regexp`,
> -       `--fixed-strings`, or `--perl-regexp` option accordingly, while the
> -       value 'default' will return to the default matching behavior.
> -
> -grep.extendedRegexp::
> -       If set to true, enable `--extended-regexp` option by default. This
> -       option is ignored when the `grep.patternType` option is set to a value
> -       other than 'default'.
> -
> -grep.threads::
> -       Number of grep worker threads to use. If unset (or set to 0), Git will
> -       use as many threads as the number of logical cores available.
> -
> -grep.fullName::
> -       If set to true, enable `--full-name` option by default.
> -
> -grep.fallbackToNoIndex::
> -       If set to true, fall back to git grep --no-index if git grep
> -       is executed outside of a git repository.  Defaults to false.
> -
> +include::config/grep.txt[]
>
>  OPTIONS
>  -------
> @@ -267,8 +240,10 @@ providing this option will cause it to die.
>         found.
>
>  --threads <num>::
> -       Number of grep worker threads to use.
> -       See `grep.threads` in 'CONFIGURATION' for more information.
> +       Number of grep worker threads to use. If not provided (or set to
> +       0), Git will use as many worker threads as the number of logical
> +       cores available. The default value can also be set with the
> +       `grep.threads` configuration (see linkgit:git-config[1]).

I'm possibly showing my ignorance here, but doesn't the
"include::config/grep.txt[]" you added above mean that the user
doesn't have to see an external manpage but can see the definition
earlier within this same manpage?

>
>  -f <file>::
>         Read patterns from <file>, one per line.
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24  7:15   ` Elijah Newren
@ 2020-03-24 15:12     ` Derrick Stolee
  2020-03-24 16:16       ` Elijah Newren
  2020-03-24 23:01       ` Matheus Tavares Bernardino
  2020-03-24 22:55     ` Matheus Tavares Bernardino
  1 sibling, 2 replies; 120+ messages in thread
From: Derrick Stolee @ 2020-03-24 15:12 UTC (permalink / raw)
  To: Elijah Newren, Matheus Tavares
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson, Stefan Beller

On 3/24/2020 3:15 AM, Elijah Newren wrote:
> Hi Matheus,
> 
> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>>
>> One of the main uses for a sparse checkout is to allow users to focus on
>> the subset of files in a repository in which they are interested. But
>> git-grep currently ignores the sparsity patterns and report all matches
>> found outside this subset, which kind of goes in the oposity direction.
>> Let's fix that, making it honor the sparsity boundaries for every
>> grepping case:
>>
>> - git grep in worktree
>> - git grep --cached
>> - git grep $REVISION
> 
> Wahoo!  This is great.

I am also excited. Also thrilled to see the option to get the old
behavior in the next patch.

>> Something I'm not entirely sure in this patch is how we implement the
>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>> is treated in the grep_tree() function). Currently, the patch looks for
>> an index entry that matches the path, and then checks its skip_worktree
> 
> As you discuss below, checking the index is both wrong _and_ costly.

I'm not sure why checking the index is _wrong_, but I agree about the
performance cost.

> You should use the sparsity patterns; Stolee did a lot of work to make
> those correspond to simple hashes you could check to determine whether
> to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
> mode but the non-cone sparsity patterns were a performance nightmare
> waiting to rear its ugly head.  We should just try to encourage
> everyone to move to cone mode, or accept the slowness they get without
> it.
> 
>> bit. But this operation is perfomed in O(log(N)); N being the number of
>> index entries. If there are many entries (and no so many sparsity
>> patterns), maybe a better approach would be to try matching the path
>> directly against the sparsity patterns. This would be O(M) in the number
>> of patterns, and it could be done, in builtin/grep.c, with a function
>> like the following:
>>
>> static struct pattern_list sparsity_patterns;
>> static int sparsity_patterns_initialized = 0;
>> static enum pattern_match_result path_matches_sparsity_patterns(
>>                                         const char *path, int pathlen,
>>                                         const char *basename,
>>                                         struct repository *repo)
>> {
>>         int dtype = DT_UNKNOWN;
>>
>>         if (!sparsity_patterns_initialized) {
>>                 char *sparse_file = git_pathdup("info/sparse-checkout");
>>                 int ret;
>>
>>                 memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
>>                 sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
>>                 ret = add_patterns_from_file_to_list(sparse_file, "", 0,
>>                                                      &sparsity_patterns, NULL);
>>                 free(sparse_file);
>>
>>                 if (ret < 0)
>>                         die(_("failed to load sparse-checkout patterns"));
>>                 sparsity_patterns_initialized = 1;
>>         }
>>
>>         return path_matches_pattern_list(path, pathlen, basename, &dtype,
>>                                          &sparsity_patterns, repo->index);
>> }
>>
>> Also, if I understand correctly, the index doesn't hold paths to dirs,
>> right? So even if a complete dir is excluded from sparse checkout, we
>> still have to check all its subentries, only to discover that they
>> should all be skipped from the search. However, if we were to check
>> against the sparsity patterns directly (e.g. with the function above),
>> we could skip such directories together with all their entries.

When in cone mode, we can check if a directory is one of these three
modes:

1. Completely contained in the cone (recursive match)
2. Completely outside the cone
3. Neither. Keep matching subdirectories. (parent match)

The clear_ce_flags() code in dir.c includes the matching algorithms
for this. Hopefully you can re-use a lot of it. You may need to extract
some methods to use them from the grep code.

>> Oh, and there is also the case of a commit whose tree paths are not in
>> the index (maybe manually created objects?). For such commits, with the
>> index lookup approach, we would have to fall back on ignoring the
>> sparsity rules. I'm not sure if that would be OK, though.
>>
>> Any thoughts on these two approaches (looking up the skip_worktree bit
>> in the index or directly matching against sparsity patterns), will be
>> highly appreciated. (Note that it only concerns the `git grep
>> <commit-ish>` case. The other cases already iterate thought the index, so
>> there is no O(log(N)) extra complexity).
>>
>>  builtin/grep.c                   | 29 ++++++++---
>>  t/t7011-skip-worktree-reading.sh |  9 ----
>>  t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
>>  3 files changed, 111 insertions(+), 15 deletions(-)
>>  create mode 100755 t/t7817-grep-sparse-checkout.sh
>>
>> diff --git a/builtin/grep.c b/builtin/grep.c
>> index 99e2685090..52ec72a036 100644
>> --- a/builtin/grep.c
>> +++ b/builtin/grep.c
>> @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
>>                       const struct pathspec *pathspec, int cached);
>>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
>> -                    int check_attr);
>> +                    int from_commit);
> 
> I'm not familiar with grep.c and have to admit I don't know what
> "check_attr" means.  Slightly surprised to see you replace it, but
> maybe reading the rest will explain...
> 
>>
>>  static int grep_submodule(struct grep_opt *opt,
>>                           const struct pathspec *pathspec,
>> @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
>>
>>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>>                 const struct cache_entry *ce = repo->index->cache[nr];
>> +
>> +               if (ce_skip_worktree(ce))
>> +                       continue;
>> +
> 
> Looks good for the case where we are grepping through what's cached.
> 
>>                 strbuf_setlen(&name, name_base_len);
>>                 strbuf_addstr(&name, ce->name);
>>
>> @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
>>                          * cache entry are identical, even if worktree file has
>>                          * been modified, so use cache version instead
>>                          */
>> -                       if (cached || (ce->ce_flags & CE_VALID) ||
>> -                           ce_skip_worktree(ce)) {
>> +                       if (cached || (ce->ce_flags & CE_VALID)) {
> 
> I had the same change when I was trying to hack something like this
> patch into place but only handled the worktree case before realized it
> was a bit bigger job.
> 
>>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>>                                         continue;
>>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
>> @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
>>
>>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
>> -                    int check_attr)
>> +                    int from_commit)
>>  {
>>         struct repository *repo = opt->repo;
>>         int hit = 0;
>> @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                 name_base_len = name.len;
>>         }
>>
>> +       if (from_commit && repo_read_index(repo) < 0)
>> +               die(_("index file corrupt"));
>> +
> 
> As above, I don't think we should need to read the index.  We should
> compare to sparsity patterns, which in the important case (cone mode)
> simplifies to a hash lookup as we walk directories.
> 
>>         while (tree_entry(tree, &entry)) {
>>                 int te_len = tree_entry_len(&entry);
>>
>> @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>
>>                 strbuf_add(base, entry.path, te_len);
>>
>> +               if (from_commit) {
>> +                       int pos = index_name_pos(repo->index,
>> +                                                base->buf + tn_len,
>> +                                                base->len - tn_len);
>> +                       if (pos >= 0 &&
>> +                           ce_skip_worktree(repo->index->cache[pos])) {
>> +                               strbuf_setlen(base, old_baselen);
>> +                               continue;
>> +                       }
>> +               }
>> +
>>                 if (S_ISREG(entry.mode)) {
>>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>> -                                        check_attr ? base->buf + tn_len : NULL);
>> +                                       from_commit ? base->buf + tn_len : NULL);
> 
> Sadly, this doesn't help me understand check_attr or from_commit.
> Could you clue me in a bit?

Yeah, Elijah and I know the sparse-checkout code quite well, but are
unfamiliar with grep. Let's all expand our knowledge!

>>                 } else if (S_ISDIR(entry.mode)) {
>>                         enum object_type type;
>>                         struct tree_desc sub;
>> @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                         strbuf_addch(base, '/');
>>                         init_tree_desc(&sub, data, size);
>>                         hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
>> -                                        check_attr);
>> +                                        from_commit);
> 
> Same.
> 
>>                         free(data);
>>                 } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>>                         hit |= grep_submodule(opt, pathspec, &entry.oid,
>> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
>> index 37525cae3a..26852586ac 100755
>> --- a/t/t7011-skip-worktree-reading.sh
>> +++ b/t/t7011-skip-worktree-reading.sh
>> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>>         test -z "$(git ls-files -m)"
>>  '
>>
>> -test_expect_success 'grep with skip-worktree file' '
>> -       git update-index --no-skip-worktree 1 &&
>> -       echo test > 1 &&
>> -       git update-index 1 &&
>> -       git update-index --skip-worktree 1 &&
>> -       rm 1 &&
>> -       test "$(git grep --no-ext-grep test)" = "1:test"
>> -'
>> -
>>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A   1" > expected
>>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>>         setup_absent &&
>> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
>> new file mode 100755
>> index 0000000000..fccf44e829
>> --- /dev/null
>> +++ b/t/t7817-grep-sparse-checkout.sh
>> @@ -0,0 +1,88 @@
>> +#!/bin/sh
>> +
>> +test_description='grep in sparse checkout
>> +
>> +This test creates the following dir structure:
>> +.
>> +| - a
>> +| - b
>> +| - dir
>> +    | - c
>> +
>> +Only "a" should be present due to the sparse checkout patterns:
>> +"/*", "!/b" and "!/dir".
>> +'
>> +
>> +. ./test-lib.sh
>> +
>> +test_expect_success 'setup' '
>> +       echo "text" >a &&
>> +       echo "text" >b &&
>> +       mkdir dir &&
>> +       echo "text" >dir/c &&
>> +       git add a b dir &&
>> +       git commit -m "initial commit" &&
>> +       git tag -am t-commit t-commit HEAD &&
>> +       tree=$(git rev-parse HEAD^{tree}) &&
>> +       git tag -am t-tree t-tree $tree &&
>> +       cat >.git/info/sparse-checkout <<-EOF &&
>> +       /*
>> +       !/b
>> +       !/dir
>> +       EOF
>> +       git sparse-checkout init &&
> 
> Using `git sparse-checkout init` but then manually writing to
> .git/info/sparse-checkout?  Seems like it'd make more sense to use
> `git sparse-checkout set` than writing the patterns directly yourself.
> Also, would prefer to have the examples use cone mode (even if you
> have to add subdirectories), as it makes the testcase a bit easier to
> read and more performant, though neither is a big deal.

I agree that we should use the builtin so your test script is less
brittle to potential back-end changes to sparse-checkout (none planned).

I do recommend having at least one test with non-cone mode patterns,
especially if you are checking the pattern-matching yourself instead of
relying on the index.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24 15:12     ` Derrick Stolee
@ 2020-03-24 16:16       ` Elijah Newren
  2020-03-24 17:02         ` Derrick Stolee
  2020-03-24 23:01       ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-03-24 16:16 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	brian m. carlson, Stefan Beller

On Tue, Mar 24, 2020 at 8:12 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/24/2020 3:15 AM, Elijah Newren wrote:
> > Hi Matheus,
> >
> > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
...
> >> Something I'm not entirely sure in this patch is how we implement the
> >> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> >> is treated in the grep_tree() function). Currently, the patch looks for
> >> an index entry that matches the path, and then checks its skip_worktree
> >
> > As you discuss below, checking the index is both wrong _and_ costly.
>
> I'm not sure why checking the index is _wrong_, but I agree about the
> performance cost.

Let's say there are two directories, dir1 and dir2.  Over time, there
have existed a total of six files:
   dir1/{a,b,c}
   dir2/{d,e,f}
At the current time, there are only four files in the index:
   dir1/{a,b}
   dir2/{d,e}
And the user has done a `git sparse-checkout set dir2` and then at
some point later run `git grep OTHERCOMMIT foobar`.  What happens?

Well, since we're in a sparse checkout, we should only search the
relevant paths within OTHERCOMMIT for "foobar".  Let's say we attempt
to figure out the "relevant paths" using the index.  We can tell that
dir1/a and dir2/a are marked as SKIP_WORKTREE so we don't search them.
dir1/c is untracked -- what do we do with it?  Include it?  Exclude
it?  Carrying on with the other files, dir2/d and dir2/e are tracked
and !SKIP_WORKTREE so we search them.  dir2/f is untracked -- what do
we do with it?  Include it?  Exclude it?

We're left without the necessary information to tell whether we should
search OTHERCOMMIT's dir1/c and dir2/f if we consult the index.  Any
decision we make is going to be wrong for one of the two paths.

If we instead do not attempt to consult the index (which corresponds
to a version close to HEAD) in order to ask questions about the
completely different OTHERCOMMIT, but instead use the sparsity
patterns to query whether those files/directories are interesting,
then we get the right answer.  The index can only be consulted for the
right answer in the case of --cached; in all other cases (including
OTHERCOMMIT == HEAD), we should use the sparsity patterns.  In fact,
we could also use the sparsity patterns in the case of --cached, it's
just that for that one particular case consulting the index will also
give the right answer.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24 16:16       ` Elijah Newren
@ 2020-03-24 17:02         ` Derrick Stolee
  0 siblings, 0 replies; 120+ messages in thread
From: Derrick Stolee @ 2020-03-24 17:02 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	brian m. carlson, Stefan Beller

On 3/24/2020 12:16 PM, Elijah Newren wrote:
> On Tue, Mar 24, 2020 at 8:12 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 3/24/2020 3:15 AM, Elijah Newren wrote:
>>> Hi Matheus,
>>>
>>> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> ...
>>>> Something I'm not entirely sure in this patch is how we implement the
>>>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>>>> is treated in the grep_tree() function). Currently, the patch looks for
>>>> an index entry that matches the path, and then checks its skip_worktree
>>>
>>> As you discuss below, checking the index is both wrong _and_ costly.
>>
>> I'm not sure why checking the index is _wrong_, but I agree about the
>> performance cost.
> 
> Let's say there are two directories, dir1 and dir2.  Over time, there
> have existed a total of six files:
>    dir1/{a,b,c}
>    dir2/{d,e,f}
> At the current time, there are only four files in the index:
>    dir1/{a,b}
>    dir2/{d,e}
> And the user has done a `git sparse-checkout set dir2` and then at
> some point later run `git grep OTHERCOMMIT foobar`.  What happens?
> 
> Well, since we're in a sparse checkout, we should only search the
> relevant paths within OTHERCOMMIT for "foobar".  Let's say we attempt
> to figure out the "relevant paths" using the index.  We can tell that
> dir1/a and dir2/a are marked as SKIP_WORKTREE so we don't search them.
> dir1/c is untracked -- what do we do with it?  Include it?  Exclude
> it?  Carrying on with the other files, dir2/d and dir2/e are tracked
> and !SKIP_WORKTREE so we search them.  dir2/f is untracked -- what do
> we do with it?  Include it?  Exclude it?
> 
> We're left without the necessary information to tell whether we should
> search OTHERCOMMIT's dir1/c and dir2/f if we consult the index.  Any
> decision we make is going to be wrong for one of the two paths.
> 
> If we instead do not attempt to consult the index (which corresponds
> to a version close to HEAD) in order to ask questions about the
> completely different OTHERCOMMIT, but instead use the sparsity
> patterns to query whether those files/directories are interesting,
> then we get the right answer.  The index can only be consulted for the
> right answer in the case of --cached; in all other cases (including
> OTHERCOMMIT == HEAD), we should use the sparsity patterns.  In fact,
> we could also use the sparsity patterns in the case of --cached, it's
> just that for that one particular case consulting the index will also
> give the right answer.

Thanks! This helps a lot.

-Stolee


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24  7:54   ` Elijah Newren
@ 2020-03-24 18:30     ` Junio C Hamano
  2020-03-24 19:07       ` Elijah Newren
  2020-03-30  3:23       ` Matheus Tavares Bernardino
  2020-03-25 23:15     ` Matheus Tavares Bernardino
  1 sibling, 2 replies; 120+ messages in thread
From: Junio C Hamano @ 2020-03-24 18:30 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

Elijah Newren <newren@gmail.com> writes:

> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>>
>> In the last commit, git-grep learned to honor sparsity patterns. For
>> some use cases, however, it may be desirable to search outside the
>> sparse checkout. So add the '--ignore-sparsity' option, which restores
>> the old behavior. Also add the grep.ignoreSparsity configuration, to
>> allow setting this behavior by default.
>
> Should `--ignore-sparsity` be a global git option rather than a
> grep-specific one?  Also, should grep.ignoreSparsity rather be
> core.ignoreSparsity or core.searchOutsideSparsePaths or something?

Great question.  I think "git diff" with various options would also
want to optionally be able to be confined within the sparse cone, or
checking the entire world by lazily fetching outside the sparsity.

> * grep, diff, log, shortlog, blame, bisect (and maybe others) all by
> default make use of the sparsity patterns to limit their output (but
> can all use whatever flag(s) are added here to search outside the
> sparsity pattern cones).  This helps users feel they are in a smaller
> repo and searching just their area of interest, and it avoids partial
> clones downloading blobs unnecessarily.  Nice for the user, and nice
> for the system.

I am not sure which one should be the default.  From historical
point of view that sparse stuff was done as an optimization to omit
initial work and lazily give the whole world, I may have slight
preference to the "we pretend that you have everything, just some
parts may be slower to come to you" world view to be the default,
with an option to limit the view to whatever sparsity you initially
set up.  Regardless of the choice of the default, it would be a good
idea to make the subcommands consistently offer the same default and
allow the non-default views with the same UI.



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24 18:30     ` Junio C Hamano
@ 2020-03-24 19:07       ` Elijah Newren
  2020-03-25 20:18         ` Junio C Hamano
  2020-03-30  3:23       ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-03-24 19:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

On Tue, Mar 24, 2020 at 11:30 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> In the last commit, git-grep learned to honor sparsity patterns. For
> >> some use cases, however, it may be desirable to search outside the
> >> sparse checkout. So add the '--ignore-sparsity' option, which restores
> >> the old behavior. Also add the grep.ignoreSparsity configuration, to
> >> allow setting this behavior by default.
> >
> > Should `--ignore-sparsity` be a global git option rather than a
> > grep-specific one?  Also, should grep.ignoreSparsity rather be
> > core.ignoreSparsity or core.searchOutsideSparsePaths or something?
>
> Great question.  I think "git diff" with various options would also
> want to optionally be able to be confined within the sparse cone, or
> checking the entire world by lazily fetching outside the sparsity.
>
> > * grep, diff, log, shortlog, blame, bisect (and maybe others) all by
> > default make use of the sparsity patterns to limit their output (but
> > can all use whatever flag(s) are added here to search outside the
> > sparsity pattern cones).  This helps users feel they are in a smaller
> > repo and searching just their area of interest, and it avoids partial
> > clones downloading blobs unnecessarily.  Nice for the user, and nice
> > for the system.
>
> I am not sure which one should be the default.  From historical
> point of view that sparse stuff was done as an optimization to omit
> initial work and lazily give the whole world, I may have slight
> preference to the "we pretend that you have everything, just some
> parts may be slower to come to you" world view to be the default,
> with an option to limit the view to whatever sparsity you initially
> set up.

It sounds like you are describing partial clone rather than sparse
checkout?  Or perhaps you're trying to blur the distinction,
suggesting the two should be used together, with the partial clone
machinery learning to download history within the specified sparse
cones?

>  Regardless of the choice of the default, it would be a good
> idea to make the subcommands consistently offer the same default and
> allow the non-default views with the same UI.

Agreed.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 1/3] doc: grep: unify info on configuration variables
  2020-03-24  7:57   ` Elijah Newren
@ 2020-03-24 21:26     ` Junio C Hamano
  2020-03-24 23:38       ` Matheus Tavares
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2020-03-24 21:26 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee, brian m. carlson

Elijah Newren <newren@gmail.com> writes:

>> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
>> index 44abe45a7c..76689771aa 100644
>> --- a/Documentation/config/grep.txt
>> +++ b/Documentation/config/grep.txt
>> @@ -16,8 +16,11 @@ grep.extendedRegexp::
>> ...
>> +       Number of grep worker threads to use. See `--threads` in
>> +       linkgit:git-grep[1] for more information.
>> ...
>> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
>> index ddb6acc025..97e25d7b1b 100644
>> --- a/Documentation/git-grep.txt
>> +++ b/Documentation/git-grep.txt
>> @@ -41,34 +41,7 @@ characters.  An empty string as search expression matches all lines.
>> ...
>> +include::config/grep.txt[]
>> ...
>>  --threads <num>::
>> -       Number of grep worker threads to use.
>> -       See `grep.threads` in 'CONFIGURATION' for more information.
>> +       Number of grep worker threads to use. If not provided (or set to
>> +       0), Git will use as many worker threads as the number of logical
>> +       cores available. The default value can also be set with the
>> +       `grep.threads` configuration (see linkgit:git-config[1]).
>
> I'm possibly showing my ignorance here, but doesn't the
> "include::config/grep.txt[]" you added above mean that the user
> doesn't have to see an external manpage but can see the definition
> earlier within this same manpage?

I think so.  Also, the new reference "See `--threads` in git-grep"
added to grep.threads to config/grep.txt would become somewhat
redundant in the context of "git grep --help" (only "See --threads"
is relevant when it appears in this same manual page).

Readers who finds the reference in "git config --help" still needs
to see that --threads is an option to git-grep, though.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24  7:15   ` Elijah Newren
  2020-03-24 15:12     ` Derrick Stolee
@ 2020-03-24 22:55     ` Matheus Tavares Bernardino
  2020-04-21  2:10       ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-24 22:55 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson, Stefan Beller

On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> >
> > Something I'm not entirely sure in this patch is how we implement the
> > mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> > is treated in the grep_tree() function). Currently, the patch looks for
> > an index entry that matches the path, and then checks its skip_worktree
>
> As you discuss below, checking the index is both wrong _and_ costly.
> You should use the sparsity patterns; Stolee did a lot of work to make
> those correspond to simple hashes you could check to determine whether
> to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
> mode but the non-cone sparsity patterns were a performance nightmare
> waiting to rear its ugly head.  We should just try to encourage
> everyone to move to cone mode, or accept the slowness they get without
> it.

OK, makes sense. And your reply to Stolee, later in this thread, made
it clearer for me why checking the index is not only costly but also
wrong. Thanks for the great explanation! I will use the sparsity
patterns directly, in the next iteration.

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index 99e2685090..52ec72a036 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
> > @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
> >                       const struct pathspec *pathspec, int cached);
> >  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                    int check_attr);
> > +                    int from_commit);
>
> I'm not familiar with grep.c and have to admit I don't know what
> "check_attr" means. Slightly surprised to see you replace it, but
> maybe reading the rest will explain...
...
>>                 if (S_ISREG(entry.mode)) {
>>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>> -                                        check_attr ? base->buf + tn_len : NULL);
>> +                                       from_commit ? base->buf + tn_len : NULL);
>
> Sadly, this doesn't help me understand check_attr or from_commit.
> Could you clue me in a bit?

Sure! The grep machinery can optionally look the .gitattributes file,
to see if a given path has a "diff" attribute assigned to it. This
attribute points to a diff driver in .gitconfig, which can specify
many things, such as whether the path should be treated as a binary or
not. The "check_attr" flag passed to grep_tree() tells the grep
machinery if it should perform this attribute lookup for the paths in
the given tree.

I decided to replace it with "from_commit" because the only times we
want an attribute lookup when grepping a tree, is when it comes from a
commit. I.e., when the tree is the root. (The reasoning goes in the
same lines as for why we only check sparsity patterns in git-grep for
commit-ish objects: we cannot check pattern matching for trees which
we are not sure to be rooted). Since "knowing if the tree is a root or
not" is useful in grep_tree() for both sparsity checks and attribute
checks, I thought we could use a single "from_commit" variable instead
of "check_attr" and "check_sparsity", which would always have matching
values. But on second thought, I could maybe rename the variable to
something as "is_root_tree" or add a comment explaining the usage of
"from_commit".

(I'm not a big fan of "is_root_tree", thought, because we could give a
root tree to grep_tree() but not really know it.)

> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > new file mode 100755
> > index 0000000000..fccf44e829
> > --- /dev/null
> > +++ b/t/t7817-grep-sparse-checkout.sh
...
> > +test_expect_success 'setup' '
> > +       echo "text" >a &&
> > +       echo "text" >b &&
> > +       mkdir dir &&
> > +       echo "text" >dir/c &&
> > +       git add a b dir &&
> > +       git commit -m "initial commit" &&
> > +       git tag -am t-commit t-commit HEAD &&
> > +       tree=$(git rev-parse HEAD^{tree}) &&
> > +       git tag -am t-tree t-tree $tree &&
> > +       cat >.git/info/sparse-checkout <<-EOF &&
> > +       /*
> > +       !/b
> > +       !/dir
> > +       EOF
> > +       git sparse-checkout init &&
>
> Using `git sparse-checkout init` but then manually writing to
> .git/info/sparse-checkout?  Seems like it'd make more sense to use
> `git sparse-checkout set` than writing the patterns directly yourself.
> Also, would prefer to have the examples use cone mode (even if you
> have to add subdirectories), as it makes the testcase a bit easier to
> read and more performant, though neither is a big deal.

OK, I will make use of the builtin here. I will also use the cone mode
(and leave one test without it, as Stolee suggested later in this
thread).

> > +test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
>
> I think the test is fine but the title seems misleading.  "outside"
> and "inside" aren't defined because <tree-ish> isn't known to be
> rooted, meaning we have no way to apply the sparsity patterns.  So
> perhaps just 'grep <tree-ish> should ignore sparsity patterns'?

Right! "should ignore sparsity patterns" is a much better name, thanks.

Thanks a lot for the thoughtful review and comments!

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24 15:12     ` Derrick Stolee
  2020-03-24 16:16       ` Elijah Newren
@ 2020-03-24 23:01       ` Matheus Tavares Bernardino
  1 sibling, 0 replies; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-24 23:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren, Git Mailing List, Derrick Stolee,
	brian m. carlson, Stefan Beller

On Tue, Mar 24, 2020 at 12:12 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/24/2020 3:15 AM, Elijah Newren wrote:
> >
> > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> Also, if I understand correctly, the index doesn't hold paths to dirs,
> >> right? So even if a complete dir is excluded from sparse checkout, we
> >> still have to check all its subentries, only to discover that they
> >> should all be skipped from the search. However, if we were to check
> >> against the sparsity patterns directly (e.g. with the function above),
> >> we could skip such directories together with all their entries.
>
> When in cone mode, we can check if a directory is one of these three
> modes:
>
> 1. Completely contained in the cone (recursive match)
> 2. Completely outside the cone
> 3. Neither. Keep matching subdirectories. (parent match)
>
> The clear_ce_flags() code in dir.c includes the matching algorithms
> for this. Hopefully you can re-use a lot of it. You may need to extract
> some methods to use them from the grep code.

Thanks for the pointer! I will take a look at the code in dir.c.

> >> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> >> new file mode 100755
> >> index 0000000000..fccf44e829
...
> >> +       cat >.git/info/sparse-checkout <<-EOF &&
> >> +       /*
> >> +       !/b
> >> +       !/dir
> >> +       EOF
> >> +       git sparse-checkout init &&
> >
> > Using `git sparse-checkout init` but then manually writing to
> > .git/info/sparse-checkout?  Seems like it'd make more sense to use
> > `git sparse-checkout set` than writing the patterns directly yourself.
> > Also, would prefer to have the examples use cone mode (even if you
> > have to add subdirectories), as it makes the testcase a bit easier to
> > read and more performant, though neither is a big deal.
>
> I agree that we should use the builtin so your test script is less
> brittle to potential back-end changes to sparse-checkout (none planned).

Makes sense!

> I do recommend having at least one test with non-cone mode patterns,
> especially if you are checking the pattern-matching yourself instead of
> relying on the index.

OK, I will leave at least one test with non-cone patterns then. Thanks
for the comments!

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 1/3] doc: grep: unify info on configuration variables
  2020-03-24 21:26     ` Junio C Hamano
@ 2020-03-24 23:38       ` Matheus Tavares
  0 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-03-24 23:38 UTC (permalink / raw)
  To: gitster; +Cc: dstolee, git, matheus.bernardino, newren, sandals

On Tue, Mar 24, 2020 at 6:26 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> >> index 44abe45a7c..76689771aa 100644
> >> --- a/Documentation/config/grep.txt
> >> +++ b/Documentation/config/grep.txt
> >> @@ -16,8 +16,11 @@ grep.extendedRegexp::
> >> ...
> >> +       Number of grep worker threads to use. See `--threads` in
> >> +       linkgit:git-grep[1] for more information.
> >> ...
> >> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> >> index ddb6acc025..97e25d7b1b 100644
> >> --- a/Documentation/git-grep.txt
> >> +++ b/Documentation/git-grep.txt
> >> @@ -41,34 +41,7 @@ characters.  An empty string as search expression matches all lines.
> >> ...
> >> +include::config/grep.txt[]
> >> ...
> >>  --threads <num>::
> >> -       Number of grep worker threads to use.
> >> -       See `grep.threads` in 'CONFIGURATION' for more information.
> >> +       Number of grep worker threads to use. If not provided (or set to
> >> +       0), Git will use as many worker threads as the number of logical
> >> +       cores available. The default value can also be set with the
> >> +       `grep.threads` configuration (see linkgit:git-config[1]).
> >
> > I'm possibly showing my ignorance here, but doesn't the
> > "include::config/grep.txt[]" you added above mean that the user
> > doesn't have to see an external manpage but can see the definition
> > earlier within this same manpage?

You are right. I added the "(see linkgit:git-config[1])" here more as a
reference to the config system itself (for a user that is possibly not familiar
with git-config). But if this is not necessary, we can remove the reference.

> I think so.  Also, the new reference "See `--threads` in git-grep"
> added to grep.threads to config/grep.txt would become somewhat
> redundant in the context of "git grep --help" (only "See --threads"
> is relevant when it appears in this same manual page).

Thanks for pointing that out. I think we can solve this issue with the
following:

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index c1d49484c8..ac06db4206 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -16,8 +16,11 @@ grep.extendedRegexp::
 	other than 'default'.

 grep.threads::
-	Number of grep worker threads to use. See `--threads` in
-	linkgit:git-grep[1] for more information.
+	Number of grep worker threads to use. See `--threads`
+ifndef::git-grep[]
+	in linkgit:git-grep[1]
+endif::git-grep[]
+	for more information.

 grep.fullName::
 	If set to true, enable `--full-name` option by default.
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 5c5c66c056..192aab4cba 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,6 +41,7 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------

+:git-grep: 1
 include::config/grep.txt[]

 OPTIONS

I will add these changes in v2.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24 19:07       ` Elijah Newren
@ 2020-03-25 20:18         ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2020-03-25 20:18 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

Elijah Newren <newren@gmail.com> writes:

> It sounds like you are describing partial clone rather than sparse
> checkout?  Or perhaps you're trying to blur the distinction,
> suggesting the two should be used together, with the partial clone
> machinery learning to download history within the specified sparse
> cones?

Yeah, I guess it is a little bit of both ;-)

>>  Regardless of the choice of the default, it would be a good
>> idea to make the subcommands consistently offer the same default and
>> allow the non-default views with the same UI.
>
> Agreed.

Yup, thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24  7:54   ` Elijah Newren
  2020-03-24 18:30     ` Junio C Hamano
@ 2020-03-25 23:15     ` Matheus Tavares Bernardino
  2020-03-26  6:02       ` Elijah Newren
  1 sibling, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-25 23:15 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Junio C Hamano

On Tue, Mar 24, 2020 at 4:55 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>
> > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> > ---
> >
> > Note: I still have to make --ignore-sparsity be able to work together
> > with --untracked. Unfortunatelly, this won't be as simple because the
> > codeflow taken by --untracked goes to grep_directory() which just
> > iterates the working tree, without looking the index entries. So I will
> > have to either: make --untracked use grep_cache(), and grep the
> > untracked files later; or try matching the working tree paths against
> > the sparsity patterns, without looking for the skip_worktree bit in
> > the index (as I mentioned in the previous patch's comments). Any
> > preferences regarding these two approaches? (or other suggestions?)
>
> Hmm.  So, 'tracked' in git is the idea that we are keeping information
> about specific files.  'sparse-checkout' is the idea that we have a
> subset of those that we can work with without materializing all the
> other tracked files; it's clearly a subset of the realm of 'tracked'.
> 'untracked' is about getting everything outside the set of 'tracked'
> files, which to me means it is clearly outside the set of sparsity
> paths too (and thus you could take --untracked as implying
> --ignore-sparsity, though whether you do might not matter in practice
> because of the items I'll discuss next). Of course, I am also
> assuming `--untracked` is incompatible with --cached or specifying
> revisions or trees (based on it's definiton of "In addition to
> searching in the tracked files in the *working tree*, search also in
> untracked files." -- emphasis added.)

Hm, I see the point now, but I'm still a little confused: The "in the
working tree" section of the definition would exclude non checked out
files, right? However, git-grep's description says "Look for specified
patterns in the tracked files *in the work tree*", and it still
searches non checked out files (loading them from the cache, even when
--cache is not given). I know that's exactly what we are trying to
change with this patchset, but we will still give the
--ignore-sparsity option to allow the old behavior when needed (unless
we prohibit using --ignore-sparsity without --cached or $REV). I guess
my doubt is whether the problem is in the implementation of the
working tree grep, which considers non checked out files, or in the
docs, which say "tracked files *in the work tree*".

I tend to go with the latter, since using `git grep --ignore-sparsity`
in a sparse checked out working tree, to grep not present files as
well, kind of makes sense to me. And if the problem is indeed in the
docs, then I think we should also allow --ignore-sparsity when
grepping with --untracked, since it's an analogous case.

> If the incompatibility of
> --untracked and --cached/REVSIONS/TREES is not enforced, we may want
> to look into erroring out if they are given together.  Once we do, we
> don't have to worry about grep_cache() at all in the case of
> --untracked and shouldn't.  Files with the skip_worktree bit won't
> exist in the working directory, and thus won't be searched (this is
> what makes --untracked imply --ignore-sparsity not really matter).
>
> In short: With --untracked you are grepping ALL (non-ignored) files in
> the working directory -- either because they are both tracked and in
> the sparsity paths (anything tracked that isn't in the sparsity paths
> has the skip_worktree bit and thus isn't present), or because it is an
> untracked file.  [And this may be what grep_directory() already does.]
>
> Does that make sense?

It does, and thanks for a very detailed explanation. But as I
mentioned before, I'm a little uncertain about --untracked implying
--ignore-spasity. The commit that added --untracked (0a93fb8) says:

"grep --untracked" would find the specified patterns from files in
untracked files in addition to its usual behaviour of finding them in
the tracked files

So, in my mind, it feels like --untracked wasn't meant to limit the
search to "all non-ignored files in the working directory", but to add
untracked files to the search (which could also contain tracked but
non checked out files). Wouldn't the "all non-ignored files in the
working directory" case be the use of --no-index?

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index 52ec72a036..17eae3edd6 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
...
> >
> > @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
> >         for (nr = 0; nr < repo->index->cache_nr; nr++) {
> >                 const struct cache_entry *ce = repo->index->cache[nr];
> >
> > -               if (ce_skip_worktree(ce))
> > +               if (!ignore_sparsity && ce_skip_worktree(ce))
>
> Oh boy on the double negatives...maybe we want to rename this flag somehow?

Yeah, I also thought about that, but couldn't come up with a better
name myself... My alternatives were all too verbose.

...
> I'm super excited to see work in this area.  I hope I'm not
> discouraging you by attempting to provide what I think is the bigger
> picture I'd like us to work towards.

Not at all! :) Thanks a lot for the bigger picture and other
explanations. They help me understand the long-term goals and make
better decisions now.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-25 23:15     ` Matheus Tavares Bernardino
@ 2020-03-26  6:02       ` Elijah Newren
  2020-03-27 15:51         ` Junio C Hamano
  2020-03-30  1:12         ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 120+ messages in thread
From: Elijah Newren @ 2020-03-26  6:02 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Junio C Hamano

Hi Matheus!

On Wed, Mar 25, 2020 at 4:15 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Mar 24, 2020 at 4:55 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >
> > > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> > > ---
> > >
> > > Note: I still have to make --ignore-sparsity be able to work together
> > > with --untracked. Unfortunatelly, this won't be as simple because the
> > > codeflow taken by --untracked goes to grep_directory() which just
> > > iterates the working tree, without looking the index entries. So I will
> > > have to either: make --untracked use grep_cache(), and grep the
> > > untracked files later; or try matching the working tree paths against
> > > the sparsity patterns, without looking for the skip_worktree bit in
> > > the index (as I mentioned in the previous patch's comments). Any
> > > preferences regarding these two approaches? (or other suggestions?)
> >
> > Hmm.  So, 'tracked' in git is the idea that we are keeping information
> > about specific files.  'sparse-checkout' is the idea that we have a
> > subset of those that we can work with without materializing all the
> > other tracked files; it's clearly a subset of the realm of 'tracked'.
> > 'untracked' is about getting everything outside the set of 'tracked'
> > files, which to me means it is clearly outside the set of sparsity
> > paths too (and thus you could take --untracked as implying
> > --ignore-sparsity, though whether you do might not matter in practice
> > because of the items I'll discuss next). Of course, I am also
> > assuming `--untracked` is incompatible with --cached or specifying
> > revisions or trees (based on it's definiton of "In addition to
> > searching in the tracked files in the *working tree*, search also in
> > untracked files." -- emphasis added.)
>
> Hm, I see the point now, but I'm still a little confused: The "in the
> working tree" section of the definition would exclude non checked out
> files, right? However, git-grep's description says "Look for specified
> patterns in the tracked files *in the work tree*", and it still
> searches non checked out files (loading them from the cache, even when
> --cache is not given). I know that's exactly what we are trying to

I really respect Duy and he does some amazing work and I wish he were
still active in git, but the SKIP_WORKTREE stuff wasn't his best work
and even he downplayed it: "In my defense it was one of my first
contribution when I was naiver...I'd love to hear how sparse checkout
could be improved, or even replaced."[0]

I've seen enough egregiously confusing cases and enough
difficult-to-recover-from cases with the implementation of the
SKIP_WORKTREE handling that I think it is dangerous to assume behavior
you see with it is intended design.  A year and a half ago, I read all
available docs to figure out how to sparsify and de-sparsify, and read
them several times but was still confused.  If I could only figure it
out with great difficulty, a lot of google searching, and even trying
to look at the code, what chance did "normal" users stand?  To add
more flavor to that argument, let me cite [1] (the three paragraphs
starting with "Playing with sparse-checkout, it feels to me like a
half-baked feature"), [2], as well as good chunks of [3], [4], and
[5].

[0] https://lore.kernel.org/git/CACsJy8ArUXD0cF2vQAVnzM_AGto2k2yQTFuTO7PhP4ffHM8dVQ@mail.gmail.com/
[1] https://lore.kernel.org/git/CABPp-BFKf2N6TYzCCneRwWUektMzRMnHLZ8JT64q=MGj5WQZkA@mail.gmail.com/
[2] https://lore.kernel.org/git/CABPp-BGE-m_UFfUt_moXG-YR=ZW8hMzMwraD7fkFV-+sEHw36w@mail.gmail.com/
[3] https://lore.kernel.org/git/pull.316.git.gitgitgadget@gmail.com/
[4] https://lore.kernel.org/git/pull.513.git.1579029962.gitgitgadget@gmail.com/
[5] https://lore.kernel.org/git/a46439c8536f912ad4a1e1751852cf477d3d7dc7.1584813609.git.gitgitgadget@gmail.com/

But let me try to explain it all below from first principles in a way
that will hopefully make sense why falling back to loading from the
cache when --cached is not given is just flat wrong.  The explanation
from first principles should also help explain --untracked a bit
better, and when there are decisions about whether to use sparsity
patterns.

> change with this patchset, but we will still give the
> --ignore-sparsity option to allow the old behavior when needed (unless
> we prohibit using --ignore-sparsity without --cached or $REV). I guess
> my doubt is whether the problem is in the implementation of the
> working tree grep, which considers non checked out files, or in the
> docs, which say "tracked files *in the work tree*".
>
> I tend to go with the latter, since using `git grep --ignore-sparsity`
> in a sparse checked out working tree, to grep not present files as
> well, kind of makes sense to me. And if the problem is indeed in the
> docs, then I think we should also allow --ignore-sparsity when
> grepping with --untracked, since it's an analogous case.

It's probably not a surprise to you given what I've already said above
to hear me say that the docs are correct in this case.  But not only
are the docs correct, I'll go even further and claim that falling back
to the cache when --cached is not passed is indefensible and leads to
surprises and contradictions.  But instead of just claiming that, let
me try to spell out a bit better why I believe that from first
principles, though:

There were previously three types of files for git:
  * tracked
  * ignored
  * untracked
where:
  * tracked was defined as "recorded in index"
  * ignored was defined as "a file which is not tracked and which
matches an ignore rule (.gitignore, .git/info/exclude, etc.)"
  * untracked was defined as "all other files present in the working directory".
With the SKIP_WORKTREE bit and sparse-checkouts, we actually have four
types because we split the "tracked" category into two:
  * tracked and matches the sparsity patterns (implies it will be
missing from the working directory as the SKIP_WORKTREE bit is set)
  * tracked and does not match the sparsity patterns (implies it will
be present in the working directory, as the SKIP_WORKTREE bit is not
set)
But let's ignore the splitting of the tracked type for a minute as
well as everything else related to sparseness.  Let's just look at how
grep was designed.

git grep has traditionally been about searching "tracked files in the
work tree" as you highlighted (and note that sparsity bits came four
years later in 2009, so cannot undercut that claim).  If the user has
made edits to files and hasn't staged them, grep would search those
working tree files with their edits, not old cached versions of those
files.  People were told that git grep was a great way to just search
relevant stuff (instead of normal grep which would look through build
results and random big files in your working directory that you
weren't even tracking).  Then in 2011 grep gained options like
--untracked to extend the search in the working tree to also include
untracked files, and added --no-exclude-standard (which is "only
useful with --untracked") so that people had a way to search *all*
files in the working tree (tracked, untracked, and ignored files).
(Note: no mechanism was provided for searching tracked and ignored
files without untracked as far as I can tell, though I don't see why
that would make sense.)  git-grep also gained options like --no-index
so that it could be used in a directory that wasn't tracked by git at
all -- it turns out people liked git-grep better than normal grep (I
think it got colorization first?), even for things that weren't being
tracked by git.  But again, all these cases were about searching files
that existed in the working tree.

Of course, people sometimes wanted to search a version other than what
existed in the working tree.  And thus options like --cached or
specifying a REVISION were added early on.

Sometimes, code that wasn't meant to be used together accidentally is
used together or the docs suggest they can be used together.  In 2010,
someone had to clarify that --cached was incompatible with <tree>; not
sure why someone would attempt to use them together, but that's the
type of accident that is easy to have in the implementation or docs
because it doesn't even occur to people who understand the design and
the data structures why anyone would attempt that.  Inevitably,
someone comes along who doesn't understand the underlying data
structures or design or terminology and tries incompatible options
together...and then gets surprised.  (Side note: I think this kind of
issues occurs fairly frequently, so I'm unlikely to assume options
were meant to be supported together based solely on a lack of logic
that would throw an error when both are specified.  We could probably
add a bunch of useful microprojects around checking for flags that
should be incompatible and making sure git throws errors when both are
specified.  We had lots of cases in rebase, for example, where if
users happened to specify two flags then one would just be silently
ignored.)

REVISION and --cached are not just incompatible with each other; each
is incompatible with all three of --untracked, --no-index, and
--no-exclude-standard.  This is because REVISION and --cached are
about picking some version other than what exists in the working tree
to search through, while those other options are all intended for when
we are searching through files in the working tree (and in particular,
exist to extend how many files in the working tree we look through).

One more useful case to consider before we start adding SKIP_WORKTREE
into the mix.  Let's say that you have three files:
   fileA
   fileB
   fileC
and all of them are tracked.  You have made edits to fileA and fileB,
and ran 'rm fileC' (NOT 'git rm fileC', i.e. the deletion is not
staged).  Now, you run 'git grep mystring'.  Quick question: Which
files are searched for 'mystring'?  Well...
  * REVISION and --cached were left out of the git grep command, so
working tree files should be searched, not staged versions or versions
from other commits
  * No flags like --untracked or --no-exclude-standard were included,
so only tracked files in the working tree should be searched
  * There are two files in the working tree, both tracked: fileA and fileB.
So, this searches fileA and fileB.  In particular: NO VERSION of fileC
is searched.  fileC may be tracked/cached, but we don't search any
version of that file, because this particular command line is about
searching the working directory and fileC is not in the working
directory.  To the best of my knowledge, git grep has always behaved
that way.


Users understand the idea of searching the working copy vs. the index
vs. "old" (or different) versions of the repository.  They also
understand that when searching the working copy, by default a subset
of the files are searched.  Tell me: given all this information here,
what possible explanation is there for SKIP_WORKTREE entries to be
translated into searches of the cache when --cached is not specified?
Please square that away with the fact that 'rm fileC' results in fileC
NOT being searched.

It's just completely, utterly wrong.

Also, hopefully this helps answer your question about --untracked and
skip_worktree.  --untracked is only useful when searching through the
working tree, and is entirely about adding the "untracked" category to
the things we search.  The skip_worktree bit is about adding more
granularity to the "tracked" category.  The two are thus entirely
orthogonal and --untracked shouldn't change behavior at all in the
face of sparse checkouts.

And I also think it explains more when the sparsity patterns and
--ignore-sparsity-patterns flags even matter.  The division of working
tree files which were tracked into two subsets (those that match
sparsity patterns and those that don't) didn't matter because only one
of those two sets existed and could be searched.  So the question is,
when can the sparsity pattern divide a set of files into two subsets
where both are non-empty?  And the answer is when --cached or REVISION
is specified.  This is the case Junio recently brought up and said
that there are good reasons users might want to limit to just the
paths that match the sparsity patterns, and other reasons when users
might want to search everything[6].  So, both cases need to be
supported fairly easily, and this will be true for several commands
besides just grep.

[6] https://lore.kernel.org/git/xmqq7dz938sc.fsf@gitster.c.googlers.com/

> > If the incompatibility of
> > --untracked and --cached/REVSIONS/TREES is not enforced, we may want
> > to look into erroring out if they are given together.  Once we do, we
> > don't have to worry about grep_cache() at all in the case of
> > --untracked and shouldn't.  Files with the skip_worktree bit won't
> > exist in the working directory, and thus won't be searched (this is
> > what makes --untracked imply --ignore-sparsity not really matter).
> >
> > In short: With --untracked you are grepping ALL (non-ignored) files in
> > the working directory -- either because they are both tracked and in
> > the sparsity paths (anything tracked that isn't in the sparsity paths
> > has the skip_worktree bit and thus isn't present), or because it is an
> > untracked file.  [And this may be what grep_directory() already does.]
> >
> > Does that make sense?
>
> It does, and thanks for a very detailed explanation. But as I
> mentioned before, I'm a little uncertain about --untracked implying
> --ignore-sparsity. The commit that added --untracked (0a93fb8) says:
>
> "grep --untracked" would find the specified patterns from files in
> untracked files in addition to its usual behaviour of finding them in
> the tracked files
>
> So, in my mind, it feels like --untracked wasn't meant to limit the
> search to "all non-ignored files in the working directory", but to add
> untracked files to the search (which could also contain tracked but
> non checked out files). Wouldn't the "all non-ignored files in the
> working directory" case be the use of --no-index?

--no-index is specifically designed for when the directory isn't
tracked by git at all.  It would be equivalent, though, to saying we
wanted to search all files in the working copy regardless of whether
they are tracked, untracked, or ignored, i.e. equivalent to specifying
both --untracked and --no-exclude-standard.

And you were right to be uncertain about --untracked implying
--ignore-sparsity; --untracked is completely orthogonal to sparsity.
(However, it wouldn't much matter if it did imply that option or if it
implied its opposite: --untracked implies we are only looking at the
working directory files, and thus we aren't even going to check the
sparsity patterns, we'll just check which files exist in the working
directory.  `git sparse-checkout reapply` will care about the sparsity
patterns and possibly add files to the working copy or remove some,
but grep certainly shouldn't be having a side effect like that; it
should just search the directory as it exists.)

> > > diff --git a/builtin/grep.c b/builtin/grep.c
> > > index 52ec72a036..17eae3edd6 100644
> > > --- a/builtin/grep.c
> > > +++ b/builtin/grep.c
> ...
> > >
> > > @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
> > >         for (nr = 0; nr < repo->index->cache_nr; nr++) {
> > >                 const struct cache_entry *ce = repo->index->cache[nr];
> > >
> > > -               if (ce_skip_worktree(ce))
> > > +               if (!ignore_sparsity && ce_skip_worktree(ce))
> >
> > Oh boy on the double negatives...maybe we want to rename this flag somehow?
>
> Yeah, I also thought about that, but couldn't come up with a better
> name myself... My alternatives were all too verbose.
>
> ...
> > I'm super excited to see work in this area.  I hope I'm not
> > discouraging you by attempting to provide what I think is the bigger
> > picture I'd like us to work towards.
>
> Not at all! :) Thanks a lot for the bigger picture and other
> explanations. They help me understand the long-term goals and make
> better decisions now.

Hope this email helps too.  I've composed it over about 4 different
sessions with various interruptions, so there's a good chance all my
edits and loss of train of thought might have made something murky.
Let me know which part(s) are confusing and I'll try to clarify.

Elijah

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-26  6:02       ` Elijah Newren
@ 2020-03-27 15:51         ` Junio C Hamano
  2020-03-27 19:01           ` Elijah Newren
  2020-03-30  1:12         ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2020-03-27 15:51 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares Bernardino, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

Elijah Newren <newren@gmail.com> writes:

> Sometimes, code that wasn't meant to be used together accidentally is
> used together or the docs suggest they can be used together.  ...
> ... but that's the
> type of accident that is easy to have in the implementation or docs
> because it doesn't even occur to people who understand the design and
> the data structures why anyone would attempt that.

The above is not limited to "git grep", but you said so clearly what
I have felt, without being able to express myself in a satisfactory
manner, for the last 10 years.

> ... (Side note: I think this kind of
> issues occurs fairly frequently, so I'm unlikely to assume options
> were meant to be supported together based solely on a lack of logic
> that would throw an error when both are specified.

Amen to that.

By the way, and I am so sorry to making the main issue of the
discussion into a mere "by the way" point, but if I understand your
message correctly, the primary conclusion in there is that a file
that is not in the working tree, if the sparsity pattern tells us
that it should not be checked out to the working tree, should not be
sought in the index instead.  I think I agree with that conclusion.

I however have some disagreement on a minor point, though.

"git grep -e '<pattern>' master" looks for the pattern in the commit
at the tip of the master branch.  "git grep -e '<pattern>' master
pu" does so in these two commits.  I do not think it is conceptually
wrong to allow "git grep -e '<pattern>' --cached master pu" to look
for three "commits", i.e. those two commits that already exist, plus
the one you would be creating if you were to "git commit" right now.
Similarly, I do not see a reason why we should forbid looking for
the same pattern in the tracked files in the working tree at the
same time we check tree object(s) and/or the index.

At least in principle.

There are two practical issues that makes these combinations
problematic, but I do not think they are insurmountable.

 - Once you give an object on the command line, there is no syntax
   to let you say "oh, by the way, I want the working tree as well".
   If you are looking in the index, the working tree, and optionally
   in some objects, "--index" instead of "--cached" would be the
   standard way to tell the command "I want to affect both the index
   and the working tree", but there is no way to say "I want only
   tracked files in the working tree and these objects searched".
   We'd need a new syntax to express it if we wanted to allow the
   combination.

 - The lines found in the working tree and in the index are prefixed
   by the filename, while they are prefixed by the tree's name and a
   colon.  When output for the working tree and the index are
   combined, we cannot tell where each hit came from.  We need to
   change the output to allow us to tell them apart, by
   e.g. prefixing "<worktree>:" and "<index>:" in a way similar to
   we use "<revision>:".

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-27 15:51         ` Junio C Hamano
@ 2020-03-27 19:01           ` Elijah Newren
  0 siblings, 0 replies; 120+ messages in thread
From: Elijah Newren @ 2020-03-27 19:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matheus Tavares Bernardino, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

On Fri, Mar 27, 2020 at 8:51 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > Sometimes, code that wasn't meant to be used together accidentally is
> > used together or the docs suggest they can be used together.  ...
> > ... but that's the
> > type of accident that is easy to have in the implementation or docs
> > because it doesn't even occur to people who understand the design and
> > the data structures why anyone would attempt that.
>
> The above is not limited to "git grep", but you said so clearly what
> I have felt, without being able to express myself in a satisfactory
> manner, for the last 10 years.
>
> > ... (Side note: I think this kind of
> > issues occurs fairly frequently, so I'm unlikely to assume options
> > were meant to be supported together based solely on a lack of logic
> > that would throw an error when both are specified.
>
> Amen to that.
>
> By the way, and I am so sorry to making the main issue of the
> discussion into a mere "by the way" point, but if I understand your
> message correctly, the primary conclusion in there is that a file
> that is not in the working tree, if the sparsity pattern tells us
> that it should not be checked out to the working tree, should not be
> sought in the index instead.  I think I agree with that conclusion.

Cool.

> I however have some disagreement on a minor point, though.
>
> "git grep -e '<pattern>' master" looks for the pattern in the commit
> at the tip of the master branch.  "git grep -e '<pattern>' master
> pu" does so in these two commits.  I do not think it is conceptually
> wrong to allow "git grep -e '<pattern>' --cached master pu" to look
> for three "commits", i.e. those two commits that already exist, plus
> the one you would be creating if you were to "git commit" right now.
> Similarly, I do not see a reason why we should forbid looking for
> the same pattern in the tracked files in the working tree at the
> same time we check tree object(s) and/or the index.
>
> At least in principle.
>
> There are two practical issues that makes these combinations
> problematic, but I do not think they are insurmountable.
>
>  - Once you give an object on the command line, there is no syntax
>    to let you say "oh, by the way, I want the working tree as well".
>    If you are looking in the index, the working tree, and optionally
>    in some objects, "--index" instead of "--cached" would be the
>    standard way to tell the command "I want to affect both the index
>    and the working tree", but there is no way to say "I want only
>    tracked files in the working tree and these objects searched".
>    We'd need a new syntax to express it if we wanted to allow the
>    combination.
>
>  - The lines found in the working tree and in the index are prefixed
>    by the filename, while they are prefixed by the tree's name and a
>    colon.  When output for the working tree and the index are
>    combined, we cannot tell where each hit came from.  We need to
>    change the output to allow us to tell them apart, by
>    e.g. prefixing "<worktree>:" and "<index>:" in a way similar to
>    we use "<revision>:".
>
> Thanks.

Ah, so you're saying that even though --cached and REVISION are
incompatible today, that's not fundamental and we could conceivably
let them or even more options be used together in the future and you
even highlight how it could be made to sensibly work.  I agree with
what you say here: _if_ there is a way for users to explicitly specify
that they want to search multiple versions (whether that is revisions
or the index or the working tree), _and_ we have a way to distinguish
which version we found the results from, then (and only then) it'd
make sense to search the complete set of files from each of those
versions and show the results for the matches we found.

That differs in multiple important ways from the SKIP_WORKTREE
behavior I was railing against, and I think what you propose as a
possibility in contrast would make sense.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-26  6:02       ` Elijah Newren
  2020-03-27 15:51         ` Junio C Hamano
@ 2020-03-30  1:12         ` Matheus Tavares Bernardino
  2020-03-31 16:48           ` Elijah Newren
  1 sibling, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-30  1:12 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Junio C Hamano

On Thu, Mar 26, 2020 at 3:02 AM Elijah Newren <newren@gmail.com> wrote:
>
> Hi Matheus!

Hi, Elijah.

First of all, thanks for taking the time to go over these topics in
great detail. I must say it's much clearer for me now.

> On Wed, Mar 25, 2020 at 4:15 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
[...]
> One more useful case to consider before we start adding SKIP_WORKTREE
> into the mix.  Let's say that you have three files:
>    fileA
>   fileB
>    fileC
> and all of them are tracked.  You have made edits to fileA and fileB,
> and ran 'rm fileC' (NOT 'git rm fileC', i.e. the deletion is not
> staged).  Now, you run 'git grep mystring'.  Quick question: Which
> files are searched for 'mystring'?  Well...
>   * REVISION and --cached were left out of the git grep command, so
> working tree files should be searched, not staged versions or versions
> from other commits
>  * No flags like --untracked or --no-exclude-standard were included,
> so only tracked files in the working tree should be searched
>   * There are two files in the working tree, both tracked: fileA and fileB.
> So, this searches fileA and fileB.  In particular: NO VERSION of fileC
> is searched.  fileC may be tracked/cached, but we don't search any
> version of that file, because this particular command line is about
> searching the working directory and fileC is not in the working
> directory.  To the best of my knowledge, git grep has always behaved
> that way.
>
> Users understand the idea of searching the working copy vs. the index
> vs. "old" (or different) versions of the repository.  They also
> understand that when searching the working copy, by default a subset
> of the files are searched.  Tell me: given all this information here,
> what possible explanation is there for SKIP_WORKTREE entries to be
> translated into searches of the cache when --cached is not specified?
> Please square that away with the fact that 'rm fileC' results in fileC
> NOT being searched.
>
> It's just completely, utterly wrong.

Makes sense, thanks. I agree that we shouldn't fall back to the cache
when searching the working tree.

> Also, hopefully this helps answer your question about --untracked and
> skip_worktree.  --untracked is only useful when searching through the
> working tree, and is entirely about adding the "untracked" category to
> the things we search.  The skip_worktree bit is about adding more
> granularity to the "tracked" category.  The two are thus entirely
> orthogonal and --untracked shouldn't change behavior at all in the
> face of sparse checkouts.

Thanks, your explanation clarified the issue I had. I see now why
--untracked and --ignore-sparsity don't make sense together.

It also made me think about the combination of --cached and
--untracked which, IIUC, should be prohibited. I will add a patch in
v2, making git-grep error out in this case.

> And I also think it explains more when the sparsity patterns and
> --ignore-sparsity-patterns flags even matter.  The division of working
> tree files which were tracked into two subsets (those that match
> sparsity patterns and those that don't) didn't matter because only one
> of those two sets existed and could be searched.  So the question is,
> when can the sparsity pattern divide a set of files into two subsets
> where both are non-empty?  And the answer is when --cached or REVISION
> is specified.

Makes sense. I will add in --ignore-sparsity's description that it is
only relevant with --cached or REVISION, as you previously suggested.
When it is used outside of these cases, though, I think we could just
warn that --ignore-sparsity will be discarded (to avoid erroring out
when users have grep.ignoreSparsity enabled).

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24 18:30     ` Junio C Hamano
  2020-03-24 19:07       ` Elijah Newren
@ 2020-03-30  3:23       ` Matheus Tavares Bernardino
  2020-03-31 19:12         ` Elijah Newren
  1 sibling, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-30  3:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> In the last commit, git-grep learned to honor sparsity patterns. For
> >> some use cases, however, it may be desirable to search outside the
> >> sparse checkout. So add the '--ignore-sparsity' option, which restores
> >> the old behavior. Also add the grep.ignoreSparsity configuration, to
> >> allow setting this behavior by default.
> >
> > Should `--ignore-sparsity` be a global git option rather than a
> > grep-specific one?  Also, should grep.ignoreSparsity rather be
> > core.ignoreSparsity or core.searchOutsideSparsePaths or something?
>
> Great question.  I think "git diff" with various options would also
> want to optionally be able to be confined within the sparse cone, or
> checking the entire world by lazily fetching outside the sparsity.
[...]
> Regardless of the choice of the default, it would be a good
> idea to make the subcommands consistently offer the same default and
> allow the non-default views with the same UI.

Yeah, it seems like a sensible path. Regarding implementation, there
is the question that Elijah raised, of whether to use a global git
option or separate but consistent options for each subcommand. I don't
have much experience with sparse checkout to argument for one or
another, so I would like to hear what others have to say about it.

A question that comes to my mind regarding the global git option is:
will --ignore-sparsity (or whichever name we choose for it [1]) be
sufficient for all subcommands? Or may some of them require additional
options for command-specific behaviors concerning sparsity patterns?
Also, would it be OK if we just ignored the option in commands that do
not operate differently in sparse checkouts (maybe, fetch, branch and
send-email, for example)? And would it make sense to allow
constructions such as `git --ignore-sparsity checkout` or even `git
--ignore-sparsity sparse-checkout ...`?

[1]: Does anyone have suggestions for the option/config name? The best
I could come up with so far (without being too verbose) is
--no-sparsity-constraints. But I fear this might sound generic. As
Elijah already mentioned, --ignore-sparsity is not good either, as it
introduces double negatives in code...

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-30  1:12         ` Matheus Tavares Bernardino
@ 2020-03-31 16:48           ` Elijah Newren
  0 siblings, 0 replies; 120+ messages in thread
From: Elijah Newren @ 2020-03-31 16:48 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Junio C Hamano

On Sun, Mar 29, 2020 at 6:13 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Thu, Mar 26, 2020 at 3:02 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > Hi Matheus!
>
> Hi, Elijah.
>
> First of all, thanks for taking the time to go over these topics in
> great detail. I must say it's much clearer for me now.
>
> > On Wed, Mar 25, 2020 at 4:15 PM Matheus Tavares Bernardino
> > <matheus.bernardino@usp.br> wrote:
> > >
> [...]
> > One more useful case to consider before we start adding SKIP_WORKTREE
> > into the mix.  Let's say that you have three files:
> >    fileA
> >   fileB
> >    fileC
> > and all of them are tracked.  You have made edits to fileA and fileB,
> > and ran 'rm fileC' (NOT 'git rm fileC', i.e. the deletion is not
> > staged).  Now, you run 'git grep mystring'.  Quick question: Which
> > files are searched for 'mystring'?  Well...
> >   * REVISION and --cached were left out of the git grep command, so
> > working tree files should be searched, not staged versions or versions
> > from other commits
> >  * No flags like --untracked or --no-exclude-standard were included,
> > so only tracked files in the working tree should be searched
> >   * There are two files in the working tree, both tracked: fileA and fileB.
> > So, this searches fileA and fileB.  In particular: NO VERSION of fileC
> > is searched.  fileC may be tracked/cached, but we don't search any
> > version of that file, because this particular command line is about
> > searching the working directory and fileC is not in the working
> > directory.  To the best of my knowledge, git grep has always behaved
> > that way.
> >
> > Users understand the idea of searching the working copy vs. the index
> > vs. "old" (or different) versions of the repository.  They also
> > understand that when searching the working copy, by default a subset
> > of the files are searched.  Tell me: given all this information here,
> > what possible explanation is there for SKIP_WORKTREE entries to be
> > translated into searches of the cache when --cached is not specified?
> > Please square that away with the fact that 'rm fileC' results in fileC
> > NOT being searched.
> >
> > It's just completely, utterly wrong.
>
> Makes sense, thanks. I agree that we shouldn't fall back to the cache
> when searching the working tree.
>
> > Also, hopefully this helps answer your question about --untracked and
> > skip_worktree.  --untracked is only useful when searching through the
> > working tree, and is entirely about adding the "untracked" category to
> > the things we search.  The skip_worktree bit is about adding more
> > granularity to the "tracked" category.  The two are thus entirely
> > orthogonal and --untracked shouldn't change behavior at all in the
> > face of sparse checkouts.
>
> Thanks, your explanation clarified the issue I had. I see now why
> --untracked and --ignore-sparsity don't make sense together.
>
> It also made me think about the combination of --cached and
> --untracked which, IIUC, should be prohibited. I will add a patch in
> v2, making git-grep error out in this case.
>
> > And I also think it explains more when the sparsity patterns and
> > --ignore-sparsity-patterns flags even matter.  The division of working
> > tree files which were tracked into two subsets (those that match
> > sparsity patterns and those that don't) didn't matter because only one
> > of those two sets existed and could be searched.  So the question is,
> > when can the sparsity pattern divide a set of files into two subsets
> > where both are non-empty?  And the answer is when --cached or REVISION
> > is specified.
>
> Makes sense. I will add in --ignore-sparsity's description that it is
> only relevant with --cached or REVISION, as you previously suggested.
> When it is used outside of these cases, though, I think we could just
> warn that --ignore-sparsity will be discarded (to avoid erroring out
> when users have grep.ignoreSparsity enabled).

Not grep.ignoreSparsity but core.ignoreSparsity or core.$WHATEVER  ;-)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-30  3:23       ` Matheus Tavares Bernardino
@ 2020-03-31 19:12         ` Elijah Newren
  2020-03-31 20:02           ` Derrick Stolee
  0 siblings, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-03-31 19:12 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Junio C Hamano, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Jonathan Tan

// adding Jonathan Tan to cc based on the fact that we keep bringing
up partial clones and how it relates...

On Sun, Mar 29, 2020 at 8:23 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Elijah Newren <newren@gmail.com> writes:
> >
> > > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > > <matheus.bernardino@usp.br> wrote:
> > >>
> > >> In the last commit, git-grep learned to honor sparsity patterns. For
> > >> some use cases, however, it may be desirable to search outside the
> > >> sparse checkout. So add the '--ignore-sparsity' option, which restores
> > >> the old behavior. Also add the grep.ignoreSparsity configuration, to
> > >> allow setting this behavior by default.
> > >
> > > Should `--ignore-sparsity` be a global git option rather than a
> > > grep-specific one?  Also, should grep.ignoreSparsity rather be
> > > core.ignoreSparsity or core.searchOutsideSparsePaths or something?
> >
> > Great question.  I think "git diff" with various options would also
> > want to optionally be able to be confined within the sparse cone, or
> > checking the entire world by lazily fetching outside the sparsity.
> [...]
> > Regardless of the choice of the default, it would be a good
> > idea to make the subcommands consistently offer the same default and
> > allow the non-default views with the same UI.
>
> Yeah, it seems like a sensible path. Regarding implementation, there
> is the question that Elijah raised, of whether to use a global git
> option or separate but consistent options for each subcommand. I don't
> have much experience with sparse checkout to argument for one or
> another, so I would like to hear what others have to say about it.
>
> A question that comes to my mind regarding the global git option is:
> will --ignore-sparsity (or whichever name we choose for it [1]) be
> sufficient for all subcommands? Or may some of them require additional
> options for command-specific behaviors concerning sparsity patterns?
> Also, would it be OK if we just ignored the option in commands that do
> not operate differently in sparse checkouts (maybe, fetch, branch and
> send-email, for example)? And would it make sense to allow
> constructions such as `git --ignore-sparsity checkout` or even `git
> --ignore-sparsity sparse-checkout ...`?

I think the same option would probably be sufficient for all
subcommands, though I have a minor question about the merge machinery
(below).  And generally, I think it would be unusual for people to
pass the command line flag; I suspect most would set a config option
for most cases and then only occasionally override it on the command
line.  Since that config option would always be set, I'd expect
commands that are unaffected to just ignore it (much like both "git -c
merge.detectRenames=true fetch" and "git --work-tree=othertree fetch"
will both ignore the irrelevant options rather than trying to detect
that they were specified and error out).

> [1]: Does anyone have suggestions for the option/config name? The best
> I could come up with so far (without being too verbose) is
> --no-sparsity-constraints. But I fear this might sound generic. As
> Elijah already mentioned, --ignore-sparsity is not good either, as it
> introduces double negatives in code...

Does verbosity matter that much?  I think people would set it in
config, and tab completion would make it pretty easy to complete in
any event.

Anyway, maybe it will help if I provide a very rough first draft of
what changes we could introduce to Documentation/config/core.txt, and
then ask a bunch of my own questions about it below:

"""
core.restrictToSparsePaths::
        Only meaningful in conjuntion with core.sparseCheckoutCone.
        This option extends sparse checkouts (which limit which paths
        are written to the worktree), so that output and operations
        are also limited to the sparsity paths where possible and
        implemented.  The purpose of this option is to (1) focus
        output for the user on the portion of the repository that is
        of interest to them, and (2) enable potentially dramatic
        performance improvements, especially in conjunction with
        partial clones.
+
When this option is true, git commands such as log, diff, and grep may
limit their output to the directories specified by the sparse cone, or
to the intersection of those paths and any (like `*.c) that the user
might also specify on the command line.  (Note that this limit for
diff and grep only becomes relevant with --cached or when specifying a
REVISION, since a search of the working tree will automatically be
limited to the sparse paths that are present.)  Also, commands like
bisect may only select commits which modify paths within the sparsity
cone.  The merge machinery may use the sparse paths as a heuristic to
avoid trying to detect renames from within the sparsity cone to
outside the sparsity cone when at least one side of history only
touches paths within the sparsity cone (this can make the merge
machinery faster, but may risk modify/delete conflicts since upstream
can rename a file within the sparsity paths to a location outside
them).  Commands which export, integrity check, or create history will
always operate on full trees (e.g. fast-export, format-patch, fsck,
commit, etc.), unaffected by any sparsity patterns.
"""

Several questions here, of course:

  * do people like or hate the name?  indifferent?  have alternate ideas?
  * should we restrict this to core.sparseCheckoutCone as I suggested
above or also allow people to do it with core.sparseCheckout without
the cone mode?  I think attempting to weld partial clones together
with core.sparseCheckout is crazy, so I'm tempted to just make it be
specific to cone mode and to push people to use it.  But I'm
interested in thoughts on the matter.
  * should worktrees be affected?  (I've been an advocate of new
worktrees inheriting the sparse patterns of the worktree in use at the
time the new worktree was created.  Junio once suggested he didn't
like that and that worktrees should start out dense.  That seems
problematic to me in big repos with partial clones and sparse chckouts
in use.  Perhaps dense new worktrees is the behavior you get when
core.restrictToSparsePaths is false?)
  * does my idea for the merge machinery make folks uncomfortable?
Should that be a different option?  Being able to do trivial *tree*
merges for the huge portion of the tree outside the sparsity paths
would be a huge win, especially with partial clones, but it certainly
is different.  Then again, microsoft has disabled rename detection
entirely based on it being too expensive, so perhaps the idea of
rename-detection-within-your-cone-if-you-really-didn't-modify-anything-outside-the-cone-on-your-side-of-history
is a reasonable middle ground between off and on for rename detection.
  * what should the default be?  Junio suggested elsewhere[1] that
sparse-checkouts and partial clones should probably be welded together
(with partial clones downloading just history in the sparsity paths by
default), in which case having this option be true would be useful.
But it may also be slightly weird because it'll probably take us a
while to implement this; while the big warning in
git-sparse-checkout.txt certainly allows this:
        THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
        COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
        THE FUTURE.
It may still be slightly weird that the default behavior of commands
in the presence of sparse-checkouts changes release to release until
we get it all implemented.

[1] https://lore.kernel.org/git/xmqqh7ycw5lc.fsf@gitster.c.googlers.com/

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-31 19:12         ` Elijah Newren
@ 2020-03-31 20:02           ` Derrick Stolee
  2020-04-27 17:15             ` Matheus Tavares Bernardino
  2020-04-29 17:21             ` Elijah Newren
  0 siblings, 2 replies; 120+ messages in thread
From: Derrick Stolee @ 2020-03-31 20:02 UTC (permalink / raw)
  To: Elijah Newren, Matheus Tavares Bernardino
  Cc: Junio C Hamano, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Jonathan Tan

On 3/31/2020 3:12 PM, Elijah Newren wrote:
> // adding Jonathan Tan to cc based on the fact that we keep bringing
> up partial clones and how it relates...
> 
> On Sun, Mar 29, 2020 at 8:23 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
>>
>> On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
>>>
>>> Elijah Newren <newren@gmail.com> writes:
>>>
>>>> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
>>>> <matheus.bernardino@usp.br> wrote:
>>>>>
>>>>> In the last commit, git-grep learned to honor sparsity patterns. For
>>>>> some use cases, however, it may be desirable to search outside the
>>>>> sparse checkout. So add the '--ignore-sparsity' option, which restores
>>>>> the old behavior. Also add the grep.ignoreSparsity configuration, to
>>>>> allow setting this behavior by default.
>>>>
>>>> Should `--ignore-sparsity` be a global git option rather than a
>>>> grep-specific one?  Also, should grep.ignoreSparsity rather be
>>>> core.ignoreSparsity or core.searchOutsideSparsePaths or something?
>>>
>>> Great question.  I think "git diff" with various options would also
>>> want to optionally be able to be confined within the sparse cone, or
>>> checking the entire world by lazily fetching outside the sparsity.
>> [...]
>>> Regardless of the choice of the default, it would be a good
>>> idea to make the subcommands consistently offer the same default and
>>> allow the non-default views with the same UI.
>>
>> Yeah, it seems like a sensible path. Regarding implementation, there
>> is the question that Elijah raised, of whether to use a global git
>> option or separate but consistent options for each subcommand. I don't
>> have much experience with sparse checkout to argument for one or
>> another, so I would like to hear what others have to say about it.
>>
>> A question that comes to my mind regarding the global git option is:
>> will --ignore-sparsity (or whichever name we choose for it [1]) be
>> sufficient for all subcommands? Or may some of them require additional
>> options for command-specific behaviors concerning sparsity patterns?
>> Also, would it be OK if we just ignored the option in commands that do
>> not operate differently in sparse checkouts (maybe, fetch, branch and
>> send-email, for example)? And would it make sense to allow
>> constructions such as `git --ignore-sparsity checkout` or even `git
>> --ignore-sparsity sparse-checkout ...`?
> 
> I think the same option would probably be sufficient for all
> subcommands, though I have a minor question about the merge machinery
> (below).  And generally, I think it would be unusual for people to
> pass the command line flag; I suspect most would set a config option
> for most cases and then only occasionally override it on the command
> line.  Since that config option would always be set, I'd expect
> commands that are unaffected to just ignore it (much like both "git -c
> merge.detectRenames=true fetch" and "git --work-tree=othertree fetch"
> will both ignore the irrelevant options rather than trying to detect
> that they were specified and error out).
> 
>> [1]: Does anyone have suggestions for the option/config name? The best
>> I could come up with so far (without being too verbose) is
>> --no-sparsity-constraints. But I fear this might sound generic. As
>> Elijah already mentioned, --ignore-sparsity is not good either, as it
>> introduces double negatives in code...
> 
> Does verbosity matter that much?  I think people would set it in
> config, and tab completion would make it pretty easy to complete in
> any event.
> 
> Anyway, maybe it will help if I provide a very rough first draft of
> what changes we could introduce to Documentation/config/core.txt, and
> then ask a bunch of my own questions about it below:
> 
> """
> core.restrictToSparsePaths::
>         Only meaningful in conjuntion with core.sparseCheckoutCone.
>         This option extends sparse checkouts (which limit which paths
>         are written to the worktree), so that output and operations
>         are also limited to the sparsity paths where possible and
>         implemented.  The purpose of this option is to (1) focus
>         output for the user on the portion of the repository that is
>         of interest to them, and (2) enable potentially dramatic
>         performance improvements, especially in conjunction with
>         partial clones.
> +
> When this option is true, git commands such as log, diff, and grep may
> limit their output to the directories specified by the sparse cone, or
> to the intersection of those paths and any (like `*.c) that the user
> might also specify on the command line.  (Note that this limit for
> diff and grep only becomes relevant with --cached or when specifying a
> REVISION, since a search of the working tree will automatically be
> limited to the sparse paths that are present.)  Also, commands like
> bisect may only select commits which modify paths within the sparsity
> cone.  The merge machinery may use the sparse paths as a heuristic to
> avoid trying to detect renames from within the sparsity cone to
> outside the sparsity cone when at least one side of history only
> touches paths within the sparsity cone (this can make the merge
> machinery faster, but may risk modify/delete conflicts since upstream
> can rename a file within the sparsity paths to a location outside
> them).  Commands which export, integrity check, or create history will
> always operate on full trees (e.g. fast-export, format-patch, fsck,
> commit, etc.), unaffected by any sparsity patterns.
> """
> 
> Several questions here, of course:
> 
>   * do people like or hate the name?  indifferent?  have alternate ideas?

It's probably time to create a 'sparse-checkout' config space. That
would allow

	sparse-checkout.restrictGrep = true

as an option. Or a more general

	sparse-checkout.restrictCommands = true

to make it clear that it affects multiple commands.

>   * should we restrict this to core.sparseCheckoutCone as I suggested
> above or also allow people to do it with core.sparseCheckout without
> the cone mode?  I think attempting to weld partial clones together
> with core.sparseCheckout is crazy, so I'm tempted to just make it be
> specific to cone mode and to push people to use it.  But I'm
> interested in thoughts on the matter.

Personally, I prefer cone mode and think it covers 99% of cases.
However, there are some who are using a big directory full of large
binaries and relying on file-prefix matches to get only the big
binaries they need. Until they restructure their repositories to
take advantage of cone mode, we should be considerate of the full
sparse-checkout specification when possible.

>   * should worktrees be affected?  (I've been an advocate of new
> worktrees inheriting the sparse patterns of the worktree in use at the
> time the new worktree was created.  Junio once suggested he didn't
> like that and that worktrees should start out dense.  That seems
> problematic to me in big repos with partial clones and sparse chckouts
> in use.  Perhaps dense new worktrees is the behavior you get when
> core.restrictToSparsePaths is false?)

We should probably consider a `--sparse` option for `git worktree add`
so we can allow interested users to add worktrees that initialize to
a sparse-checkout. Optionally create a config option that would copy
the sparse-checkout file from the current repo to the worktree.

>   * does my idea for the merge machinery make folks uncomfortable?
> Should that be a different option?  Being able to do trivial *tree*
> merges for the huge portion of the tree outside the sparsity paths
> would be a huge win, especially with partial clones, but it certainly
> is different.  Then again, microsoft has disabled rename detection
> entirely based on it being too expensive, so perhaps the idea of
> rename-detection-within-your-cone-if-you-really-didn't-modify-anything-outside-the-cone-on-your-side-of-history
> is a reasonable middle ground between off and on for rename detection.

The part where you say " when at least one side of history only
touches paths within the sparsity cone" makes me want to entertain
the idea if it can be done cleanly.

I'm more concerned about the "git bisect" logic being restricted to
the cone, since that is such an open-ended command for what is
considered "good" or "bad".

>   * what should the default be?  Junio suggested elsewhere[1] that
> sparse-checkouts and partial clones should probably be welded together
> (with partial clones downloading just history in the sparsity paths by
> default), in which case having this option be true would be useful.

My opinion on this is as follows: filtering blobs based on sparse-
checkout patterns does not filter enough, and filtering trees based
on sparse-checkout patterns filters too much. The costs are just
flipped: having extra trees is not a huge problem but recovering from
a "tree miss" is problematic. Having extra blobs is painful, but
recovering from a "blob miss" is not a big deal.

> But it may also be slightly weird because it'll probably take us a
> while to implement this; while the big warning in
> git-sparse-checkout.txt certainly allows this:
>         THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
>         COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
>         THE FUTURE.
> It may still be slightly weird that the default behavior of commands
> in the presence of sparse-checkouts changes release to release until
> we get it all implemented.

I appreciate that we put that warning at the top. We will be
able to do more experimental things with the feature because
of it. The idea I'm toying with is to have "git clone --sparse"
set core.sparseCheckoutCone = true.

Also, if we are creating the "sparse-checkout.*" config space,
we should "rename" core.sparseCheckoutCone to sparse-checkout.coneMode
or something. We would need to support both for a while, for sure.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24 22:55     ` Matheus Tavares Bernardino
@ 2020-04-21  2:10       ` Matheus Tavares Bernardino
  2020-04-21  3:08         ` Elijah Newren
  0 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-04-21  2:10 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson, Stefan Beller

Hi, Elijah, Stolee and others

On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> > >
> > > Something I'm not entirely sure in this patch is how we implement the
> > > mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> > > is treated in the grep_tree() function). Currently, the patch looks for
> > > an index entry that matches the path, and then checks its skip_worktree
> >
> > As you discuss below, checking the index is both wrong _and_ costly.
> > You should use the sparsity patterns; Stolee did a lot of work to make
> > those correspond to simple hashes you could check to determine whether
> > to even walk into a subdirectory.
[...]
> OK, makes sense.

I've been working on the file skipping mechanism using the sparsity
patterns directly. But I'm uncertain about some implementation
details. So I wanted to share my current plan with you, to get some
feedback before going deeper.

The first idea was to load the sparsity patterns a priori and pass
them to grep_tree(), which recursively greps the entries of a given
tree object. If --recurse-submodules is given, however, we would also
need to load each surepo's sparse-checkout file on the fly (as the
subrepos are lazily initialized in grep_tree()'s call chain). That's
not a problem on its own. But in the most naive implementation, this
means unnecessarily re-loading the sparse-checkout files of the
submodules for each tree given to git-grep (as grep_tree() is called
separately for each one of them).

So my next idea was to implement a cache, mapping 'struct repository's
to 'struct pattern_list'. Well, not 'struct repository' itself, but
repo->gitdir. This way we could load each file once, store the pattern
list, and quickly retrieve the one that affect the repository
currently being grepped, whether it is a submodule or not. But, is
gitidir unique per repository? If not, could we use
repo_git_path(repo, "info/sparse-checkout") as the key?

I already have a prototype implementation of the last idea (using
repo_git_path()). But I wanted to make sure, does this seem like a
good path? Or should we avoid the work of having this hashmap here and
do something else, as adding a 'struct pattern_list' to 'struct
repository', directly?

Thanks,
Matheus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-04-21  2:10       ` Matheus Tavares Bernardino
@ 2020-04-21  3:08         ` Elijah Newren
  2020-04-22 12:08           ` Derrick Stolee
  2020-04-23  6:09           ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 120+ messages in thread
From: Elijah Newren @ 2020-04-21  3:08 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson,
	Stefan Beller, Jonathan Nieder

On Mon, Apr 20, 2020 at 7:11 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> Hi, Elijah, Stolee and others
>
> On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> > > <matheus.bernardino@usp.br> wrote:
> > > >
> > > > Something I'm not entirely sure in this patch is how we implement the
> > > > mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> > > > is treated in the grep_tree() function). Currently, the patch looks for
> > > > an index entry that matches the path, and then checks its skip_worktree
> > >
> > > As you discuss below, checking the index is both wrong _and_ costly.
> > > You should use the sparsity patterns; Stolee did a lot of work to make
> > > those correspond to simple hashes you could check to determine whether
> > > to even walk into a subdirectory.
> [...]
> > OK, makes sense.
>
> I've been working on the file skipping mechanism using the sparsity
> patterns directly. But I'm uncertain about some implementation
> details. So I wanted to share my current plan with you, to get some
> feedback before going deeper.
>
> The first idea was to load the sparsity patterns a priori and pass
> them to grep_tree(), which recursively greps the entries of a given
> tree object. If --recurse-submodules is given, however, we would also
> need to load each surepo's sparse-checkout file on the fly (as the
> subrepos are lazily initialized in grep_tree()'s call chain). That's
> not a problem on its own. But in the most naive implementation, this
> means unnecessarily re-loading the sparse-checkout files of the
> submodules for each tree given to git-grep (as grep_tree() is called
> separately for each one of them).

Wouldn't loading the sparse-checkout files be fast compared to
grepping a submodule for matching strings?  And not just fast, but
essentially in the noise and hard to even measure?  I have a hard time
fathoming parsing the sparse-checkout file for a submodule somehow
appreciably affecting the cost of grepping through that submodule.  If
the submodule has a huge number of sparse-checkout patterns, that'll
be because it has a ginormous number of files and grepping through
them all would be way, way longer.  If the submodule only has a few
files, then the sparse-checkout file is only going to be a few lines
at most.

Also, from another angle: I think the original intent of submodules
was an alternate form of sparse-checkout/partial-clone, letting people
deal with just their piece of the repo.  As such, do we really even
expect people to use sparse-checkouts and submodules together, let
alone use them very heavily together?  Sure, someone will use them,
but I have a hard time imagining the scale of use of both features
heavily enough for this to matter, especially since it also requires
specifying multiple trees to grep (which is slightly unusual) in
addition to the combination of these other features before your
optimization here could kick in and be worthwhile.

I'd be very tempted to just implement the most naive implementation
and maybe leave a TODO note in the code for some future person to come
along and optimize if it really matters, but I'd like to see numbers
before we spend the development and maintenance effort on it because
I'm having a hard time imagining any scale where it could matter.

> So my next idea was to implement a cache, mapping 'struct repository's
> to 'struct pattern_list'. Well, not 'struct repository' itself, but
> repo->gitdir. This way we could load each file once, store the pattern
> list, and quickly retrieve the one that affect the repository
> currently being grepped, whether it is a submodule or not. But, is
> gitidir unique per repository? If not, could we use
> repo_git_path(repo, "info/sparse-checkout") as the key?
>
> I already have a prototype implementation of the last idea (using
> repo_git_path()). But I wanted to make sure, does this seem like a
> good path? Or should we avoid the work of having this hashmap here and
> do something else, as adding a 'struct pattern_list' to 'struct
> repository', directly?

Honestly, it sounds a bit like premature optimization to me.  Sorry if
that's disappointing since you've apparently already put some effort
into this, and it sounds like you're on a good track for optimizing
this if it were necessary, but I'm just having a hard time figuring
out whether it'd really help and be worth the code complexity.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-04-21  3:08         ` Elijah Newren
@ 2020-04-22 12:08           ` Derrick Stolee
  2020-04-23  6:09           ` Matheus Tavares Bernardino
  1 sibling, 0 replies; 120+ messages in thread
From: Derrick Stolee @ 2020-04-22 12:08 UTC (permalink / raw)
  To: Elijah Newren, Matheus Tavares Bernardino
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson,
	Stefan Beller, Jonathan Nieder

On 4/20/2020 11:08 PM, Elijah Newren wrote:
> On Mon, Apr 20, 2020 at 7:11 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
>>
>> Hi, Elijah, Stolee and others
>>
>> On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino
>> <matheus.bernardino@usp.br> wrote:
>>>
>>> On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
>>>>
>>>> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
>>>> <matheus.bernardino@usp.br> wrote:
>>>>>
>>>>> Something I'm not entirely sure in this patch is how we implement the
>>>>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>>>>> is treated in the grep_tree() function). Currently, the patch looks for
>>>>> an index entry that matches the path, and then checks its skip_worktree
>>>>
>>>> As you discuss below, checking the index is both wrong _and_ costly.
>>>> You should use the sparsity patterns; Stolee did a lot of work to make
>>>> those correspond to simple hashes you could check to determine whether
>>>> to even walk into a subdirectory.
>> [...]
>>> OK, makes sense.
>>
>> I've been working on the file skipping mechanism using the sparsity
>> patterns directly. But I'm uncertain about some implementation
>> details. So I wanted to share my current plan with you, to get some
>> feedback before going deeper.
>>
>> The first idea was to load the sparsity patterns a priori and pass
>> them to grep_tree(), which recursively greps the entries of a given
>> tree object. If --recurse-submodules is given, however, we would also
>> need to load each surepo's sparse-checkout file on the fly (as the
>> subrepos are lazily initialized in grep_tree()'s call chain). That's
>> not a problem on its own. But in the most naive implementation, this
>> means unnecessarily re-loading the sparse-checkout files of the
>> submodules for each tree given to git-grep (as grep_tree() is called
>> separately for each one of them).
> 
> Wouldn't loading the sparse-checkout files be fast compared to
> grepping a submodule for matching strings?  And not just fast, but
> essentially in the noise and hard to even measure?  I have a hard time
> fathoming parsing the sparse-checkout file for a submodule somehow
> appreciably affecting the cost of grepping through that submodule.  If
> the submodule has a huge number of sparse-checkout patterns, that'll
> be because it has a ginormous number of files and grepping through
> them all would be way, way longer.  If the submodule only has a few
> files, then the sparse-checkout file is only going to be a few lines
> at most.
> 
> Also, from another angle: I think the original intent of submodules
> was an alternate form of sparse-checkout/partial-clone, letting people
> deal with just their piece of the repo.  As such, do we really even
> expect people to use sparse-checkouts and submodules together, let
> alone use them very heavily together?  Sure, someone will use them,
> but I have a hard time imagining the scale of use of both features
> heavily enough for this to matter, especially since it also requires
> specifying multiple trees to grep (which is slightly unusual) in
> addition to the combination of these other features before your
> optimization here could kick in and be worthwhile.
> 
> I'd be very tempted to just implement the most naive implementation
> and maybe leave a TODO note in the code for some future person to come
> along and optimize if it really matters, but I'd like to see numbers
> before we spend the development and maintenance effort on it because
> I'm having a hard time imagining any scale where it could matter.
> 
>> So my next idea was to implement a cache, mapping 'struct repository's
>> to 'struct pattern_list'. Well, not 'struct repository' itself, but
>> repo->gitdir. This way we could load each file once, store the pattern
>> list, and quickly retrieve the one that affect the repository
>> currently being grepped, whether it is a submodule or not. But, is
>> gitidir unique per repository? If not, could we use
>> repo_git_path(repo, "info/sparse-checkout") as the key?
>>
>> I already have a prototype implementation of the last idea (using
>> repo_git_path()). But I wanted to make sure, does this seem like a
>> good path? Or should we avoid the work of having this hashmap here and
>> do something else, as adding a 'struct pattern_list' to 'struct
>> repository', directly?
> 
> Honestly, it sounds a bit like premature optimization to me.  Sorry if
> that's disappointing since you've apparently already put some effort
> into this, and it sounds like you're on a good track for optimizing
> this if it were necessary, but I'm just having a hard time figuring
> out whether it'd really help and be worth the code complexity.

My initial thought was to use a stack or queue. It depend on how
git-grep treats submodules. Imagine directories A, B, C where B is a
submodule.

If results from 'B' are output between results from 'A' and 'C', then
use a stack to "push" the latest sparse-checkout patterns as you
deepen into a submodule, then "pop" the patterns as you leave a
submodule.

If results from 'B' are output after results from 'C', then you could
possibly use a queue instead. I find this unlikely, and it would
behave strangely for nested submodules.

Since "struct pattern_list" has most of the information you require,
then it should not be challenging to create a list of them.

Hopefully that provides some ideas.

Thanks,
-Stolee



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-04-21  3:08         ` Elijah Newren
  2020-04-22 12:08           ` Derrick Stolee
@ 2020-04-23  6:09           ` Matheus Tavares Bernardino
  1 sibling, 0 replies; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-04-23  6:09 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson,
	Stefan Beller, Jonathan Nieder

On Tue, Apr 21, 2020 at 12:08 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Mon, Apr 20, 2020 at 7:11 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > I've been working on the file skipping mechanism using the sparsity
> > patterns directly. But I'm uncertain about some implementation
> > details. So I wanted to share my current plan with you, to get some
> > feedback before going deeper.
> >
> > The first idea was to load the sparsity patterns a priori and pass
> > them to grep_tree(), which recursively greps the entries of a given
> > tree object. If --recurse-submodules is given, however, we would also
> > need to load each surepo's sparse-checkout file on the fly (as the
> > subrepos are lazily initialized in grep_tree()'s call chain). That's
> > not a problem on its own. But in the most naive implementation, this
> > means unnecessarily re-loading the sparse-checkout files of the
> > submodules for each tree given to git-grep (as grep_tree() is called
> > separately for each one of them).
>
> Wouldn't loading the sparse-checkout files be fast compared to
> grepping a submodule for matching strings?  And not just fast, but
> essentially in the noise and hard to even measure?  I have a hard time
> fathoming parsing the sparse-checkout file for a submodule somehow
> appreciably affecting the cost of grepping through that submodule.  If
> the submodule has a huge number of sparse-checkout patterns, that'll
> be because it has a ginormous number of files and grepping through
> them all would be way, way longer.  If the submodule only has a few
> files, then the sparse-checkout file is only going to be a few lines
> at most.

Yeah, makes sense.

> Also, from another angle: I think the original intent of submodules
> was an alternate form of sparse-checkout/partial-clone, letting people
> deal with just their piece of the repo.  As such, do we really even
> expect people to use sparse-checkouts and submodules together, let
> alone use them very heavily together?  Sure, someone will use them,
> but I have a hard time imagining the scale of use of both features
> heavily enough for this to matter, especially since it also requires
> specifying multiple trees to grep (which is slightly unusual) in
> addition to the combination of these other features before your
> optimization here could kick in and be worthwhile.
>
> I'd be very tempted to just implement the most naive implementation
> and maybe leave a TODO note in the code for some future person to come
> along and optimize if it really matters, but I'd like to see numbers
> before we spend the development and maintenance effort on it because
> I'm having a hard time imagining any scale where it could matter.

You're right. I guess I got a little too excited about the
optimizations possibilities and neglected the fact that they might not
even be needed here.

Just to take a look at some numbers, I prototyped the naive
implementation and downloaded a testing repository[1] containing 8
submodules (or 14 counting the nested ones). For each of the
non-nested submodules, I added its .gitignore rules to the
sparse-checkout file (of course this doesn't make any sense for a
real-world usage, but I just wanted to populate the file with a large
quantity of valid rules, to test the parsing time). I also added the
rule '/*'. Then I ran:

git-grep --threads=1 --recurse-submodules -E "config_[a-z]+\(" $(cat /tmp/trees)

Where /tmp/trees contained about 120 trees in the said repository
(again, a probably unreal case, for testing purposes only). Then,
measuring the time spent only inside the function I created to load a
sparse-checkout file for a given 'struct repository', I got to the
following numbers:

Number of calls: 1531 (makes sense: ~120 trees and 14 submodules)
Percentage over the total time: 0.015%
Number of matches: 300897

And using 8 threads, I got the same numbers except for the percentage,
which was a little higher: 0.05%.

So, indeed, the overhead of re-loading the files is too insignificant.
And my cache idea was a premature and unnecessary optimization.

> > So my next idea was to implement a cache, mapping 'struct repository's
> > to 'struct pattern_list'. Well, not 'struct repository' itself, but
> > repo->gitdir. This way we could load each file once, store the pattern
> > list, and quickly retrieve the one that affect the repository
> > currently being grepped, whether it is a submodule or not. But, is
> > gitidir unique per repository? If not, could we use
> > repo_git_path(repo, "info/sparse-checkout") as the key?
> >
> > I already have a prototype implementation of the last idea (using
> > repo_git_path()). But I wanted to make sure, does this seem like a
> > good path? Or should we avoid the work of having this hashmap here and
> > do something else, as adding a 'struct pattern_list' to 'struct
> > repository', directly?
>
> Honestly, it sounds a bit like premature optimization to me.  Sorry if
> that's disappointing since you've apparently already put some effort
> into this, and it sounds like you're on a good track for optimizing
> this if it were necessary, but I'm just having a hard time figuring
> out whether it'd really help and be worth the code complexity.

No problem! I'm glad to have this feedback now, while I'm still
working on v2  :) Now I can focus on what's really relevant. So thanks
again!

[1]: https://github.com/surevine/Metre

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-31 20:02           ` Derrick Stolee
@ 2020-04-27 17:15             ` Matheus Tavares Bernardino
  2020-04-29 16:46               ` Elijah Newren
  2020-04-29 17:21             ` Elijah Newren
  1 sibling, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-04-27 17:15 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren, Junio C Hamano, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Jonathan Tan

Hi, Stolee and Elijah

I think I just finished addressing the comments on patch 2/3 [1]. And
I'm now looking at the ones in 3/3 (this one). Below are some
questions, just to make sure I'm going in the right direction with
this one.

On Tue, Mar 31, 2020 at 5:02 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/31/2020 3:12 PM, Elijah Newren wrote:
> >
> > Anyway, maybe it will help if I provide a very rough first draft of
> > what changes we could introduce to Documentation/config/core.txt, and
> > then ask a bunch of my own questions about it below:
> >
> > """
> > core.restrictToSparsePaths::
> >         Only meaningful in conjuntion with core.sparseCheckoutCone.
> >         This option extends sparse checkouts (which limit which paths
> >         are written to the worktree), so that output and operations
> >         are also limited to the sparsity paths where possible and
> >         implemented.  The purpose of this option is to (1) focus
> >         output for the user on the portion of the repository that is
> >         of interest to them, and (2) enable potentially dramatic
> >         performance improvements, especially in conjunction with
> >         partial clones.
...
> > """
> >
> > Several questions here, of course:
> >
> >   * do people like or hate the name?  indifferent?  have alternate ideas?
>
> It's probably time to create a 'sparse-checkout' config space. That
> would allow
>
>         sparse-checkout.restrictGrep = true
>
> as an option. Or a more general
>
>         sparse-checkout.restrictCommands = true
>
> to make it clear that it affects multiple commands.

If we are creating the new namespace, 'core.sparseCheckout' should
also be renamed to something like 'sparse-checkout.enabled', right?
And maybe we could use 'sparsecheckout.*', instead? That seems to be
the convention for settings on hyphenated commands (as in sendemail.*,
uploadpack.* and gitgui.*).

As for compatibility, when running `git sparse-checkout init`, if the
config file already has the core.sparseCheckout setting, should we
remove it? Or just add the new sparsecheckout.enabled config, which
will always be read first?

Also, should we emit a warning about the former being deprecated? The
good thing about deprecation warnings, IMO, is that users will know
the name change faster. But, at least for `git grep <tree>`, where we
read  core.sparseCheckout and core.sparseCheckoutCone for each
submodule and each tree, there would be too much pollution in the
output...

Finally, about restrictCommands, the idea is to have both
sparsecheckout.restrictCommands and `git --restrict-to-sparse-paths`,
right? For now, the option/setting would only affect grep, but support
would be added gradually to other commands in the future. I noticed
git-read-tree already has a --no-sparse-checkout option. Should we
remove this option in favor of the global
--[no]-restrict-to-sparse-paths?

Sorry for too many questions. I just wanted to make sure that I
understood the plan before diving into the implementation, to avoid
going in the wrong direction.

[1]: Here is a sneak peek for v2 of patch 2/3, in case you might want
to take a look:
https://github.com/matheustavares/git/commit/970ef529f1e8f719c4427bd9fea8205ada69d913

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-04-27 17:15             ` Matheus Tavares Bernardino
@ 2020-04-29 16:46               ` Elijah Newren
  0 siblings, 0 replies; 120+ messages in thread
From: Elijah Newren @ 2020-04-29 16:46 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Derrick Stolee, Junio C Hamano, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Jonathan Tan

On Mon, Apr 27, 2020 at 10:15 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> Hi, Stolee and Elijah
>
> I think I just finished addressing the comments on patch 2/3 [1]. And
> I'm now looking at the ones in 3/3 (this one). Below are some
> questions, just to make sure I'm going in the right direction with
> this one.
>
> On Tue, Mar 31, 2020 at 5:02 PM Derrick Stolee <stolee@gmail.com> wrote:
> >
> > On 3/31/2020 3:12 PM, Elijah Newren wrote:
> > >
> > > Anyway, maybe it will help if I provide a very rough first draft of
> > > what changes we could introduce to Documentation/config/core.txt, and
> > > then ask a bunch of my own questions about it below:
> > >
> > > """
> > > core.restrictToSparsePaths::
> > >         Only meaningful in conjuntion with core.sparseCheckoutCone.
> > >         This option extends sparse checkouts (which limit which paths
> > >         are written to the worktree), so that output and operations
> > >         are also limited to the sparsity paths where possible and
> > >         implemented.  The purpose of this option is to (1) focus
> > >         output for the user on the portion of the repository that is
> > >         of interest to them, and (2) enable potentially dramatic
> > >         performance improvements, especially in conjunction with
> > >         partial clones.
> ...
> > > """
> > >
> > > Several questions here, of course:
> > >
> > >   * do people like or hate the name?  indifferent?  have alternate ideas?
> >
> > It's probably time to create a 'sparse-checkout' config space. That
> > would allow
> >
> >         sparse-checkout.restrictGrep = true
> >
> > as an option. Or a more general
> >
> >         sparse-checkout.restrictCommands = true
> >
> > to make it clear that it affects multiple commands.
>
> If we are creating the new namespace, 'core.sparseCheckout' should
> also be renamed to something like 'sparse-checkout.enabled', right?
> And maybe we could use 'sparsecheckout.*', instead? That seems to be
> the convention for settings on hyphenated commands (as in sendemail.*,
> uploadpack.* and gitgui.*).

Or maybe just call the namespace 'sparse.*' if we're going that route?

> As for compatibility, when running `git sparse-checkout init`, if the
> config file already has the core.sparseCheckout setting, should we
> remove it? Or just add the new sparsecheckout.enabled config, which
> will always be read first?

We seem to have two competing issues:

  * If you remove the core.sparseCheckout setting in favor of
sparse.enabled, then people can't use the repo with an older version
of git.  (This may be acceptable, but we've generally been somewhat
careful with index extensions and such to avoid such a state, with
slow transitions with index and pack versions and such.)
  * If you leave the core.sparseCheckout setting around as well as
having sparse.enabled, then we have two different settings that we can
keep in sync with newer git but which older git will only update one
of.  What do we do if we detect they are out of sync?  Throw an error?
 Pretend that one overrules?  If the older one overrules, what do we
accomplish with the new name?  If the newer name overrules, doesn't
that also potentially break using an older git version?

I'm not sure what to do here.  Maybe people who have worked on index
version and pack version transitions have some good suggestions for
us?

> Also, should we emit a warning about the former being deprecated? The
> good thing about deprecation warnings, IMO, is that users will know
> the name change faster. But, at least for `git grep <tree>`, where we
> read  core.sparseCheckout and core.sparseCheckoutCone for each
> submodule and each tree, there would be too much pollution in the
> output...

We've already started to steer away from users setting these values
and just have them get set/updated/unset by sparse-checkout init and
sparse-checkout disable.  Since users won't be setting these directly,
I don't think deprecation warnings make sense.

> Finally, about restrictCommands, the idea is to have both
> sparsecheckout.restrictCommands and `git --restrict-to-sparse-paths`,
> right? For now, the option/setting would only affect grep, but support
> would be added gradually to other commands in the future. I noticed

There should be both a config option and a global command line flag,
yes.  We might need the flag to default to
not-restricting-to-sparse-paths for now because that's consistent with
the only thing the current implementation of these commands can do.
But I'm really worried that this will remain the default and we'll
force users in the future to jump through a bunch of hoops to do a
simple thing:

$ git clone --sparse-paths $WANTED_DIRECTORIES user@server.name:path/to/repo.git
$ cd repo
<Enjoy their small view of the repo without every command suddenly
requiring a network connection and downloading huge reams of data they
don't even care about.>

> git-read-tree already has a --no-sparse-checkout option. Should we
> remove this option in favor of the global
> --[no]-restrict-to-sparse-paths?

read-tree is plumbing; we can't break backward compatibility.  We'll
have to leave that option there and just document that the two options
do the same thing.

> Sorry for too many questions. I just wanted to make sure that I
> understood the plan before diving into the implementation, to avoid
> going in the wrong direction.

Nah, these are all good questions.  Sorry for the delay in getting back to you.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-31 20:02           ` Derrick Stolee
  2020-04-27 17:15             ` Matheus Tavares Bernardino
@ 2020-04-29 17:21             ` Elijah Newren
  1 sibling, 0 replies; 120+ messages in thread
From: Elijah Newren @ 2020-04-29 17:21 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Matheus Tavares Bernardino, Junio C Hamano, Git Mailing List,
	Derrick Stolee, Nguyễn Thái Ngọc, Jonathan Tan

Sorry for the super late reply...

On Tue, Mar 31, 2020 at 1:02 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/31/2020 3:12 PM, Elijah Newren wrote:
> > // adding Jonathan Tan to cc based on the fact that we keep bringing
> > up partial clones and how it relates...
> >
> > On Sun, Mar 29, 2020 at 8:23 PM Matheus Tavares Bernardino
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
> >>>
> >>> Elijah Newren <newren@gmail.com> writes:
> >>>
> >>>> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> >>>> <matheus.bernardino@usp.br> wrote:
> >>>>>
> >>>>> In the last commit, git-grep learned to honor sparsity patterns. For
> >>>>> some use cases, however, it may be desirable to search outside the
> >>>>> sparse checkout. So add the '--ignore-sparsity' option, which restores
> >>>>> the old behavior. Also add the grep.ignoreSparsity configuration, to
> >>>>> allow setting this behavior by default.
> >>>>
> >>>> Should `--ignore-sparsity` be a global git option rather than a
> >>>> grep-specific one?  Also, should grep.ignoreSparsity rather be
> >>>> core.ignoreSparsity or core.searchOutsideSparsePaths or something?
> >>>
> >>> Great question.  I think "git diff" with various options would also
> >>> want to optionally be able to be confined within the sparse cone, or
> >>> checking the entire world by lazily fetching outside the sparsity.
> >> [...]
> >>> Regardless of the choice of the default, it would be a good
> >>> idea to make the subcommands consistently offer the same default and
> >>> allow the non-default views with the same UI.
> >>
> >> Yeah, it seems like a sensible path. Regarding implementation, there
> >> is the question that Elijah raised, of whether to use a global git
> >> option or separate but consistent options for each subcommand. I don't
> >> have much experience with sparse checkout to argument for one or
> >> another, so I would like to hear what others have to say about it.
> >>
> >> A question that comes to my mind regarding the global git option is:
> >> will --ignore-sparsity (or whichever name we choose for it [1]) be
> >> sufficient for all subcommands? Or may some of them require additional
> >> options for command-specific behaviors concerning sparsity patterns?
> >> Also, would it be OK if we just ignored the option in commands that do
> >> not operate differently in sparse checkouts (maybe, fetch, branch and
> >> send-email, for example)? And would it make sense to allow
> >> constructions such as `git --ignore-sparsity checkout` or even `git
> >> --ignore-sparsity sparse-checkout ...`?
> >
> > I think the same option would probably be sufficient for all
> > subcommands, though I have a minor question about the merge machinery
> > (below).  And generally, I think it would be unusual for people to
> > pass the command line flag; I suspect most would set a config option
> > for most cases and then only occasionally override it on the command
> > line.  Since that config option would always be set, I'd expect
> > commands that are unaffected to just ignore it (much like both "git -c
> > merge.detectRenames=true fetch" and "git --work-tree=othertree fetch"
> > will both ignore the irrelevant options rather than trying to detect
> > that they were specified and error out).
> >
> >> [1]: Does anyone have suggestions for the option/config name? The best
> >> I could come up with so far (without being too verbose) is
> >> --no-sparsity-constraints. But I fear this might sound generic. As
> >> Elijah already mentioned, --ignore-sparsity is not good either, as it
> >> introduces double negatives in code...
> >
> > Does verbosity matter that much?  I think people would set it in
> > config, and tab completion would make it pretty easy to complete in
> > any event.
> >
> > Anyway, maybe it will help if I provide a very rough first draft of
> > what changes we could introduce to Documentation/config/core.txt, and
> > then ask a bunch of my own questions about it below:
> >
> > """
> > core.restrictToSparsePaths::
> >         Only meaningful in conjuntion with core.sparseCheckoutCone.
> >         This option extends sparse checkouts (which limit which paths
> >         are written to the worktree), so that output and operations
> >         are also limited to the sparsity paths where possible and
> >         implemented.  The purpose of this option is to (1) focus
> >         output for the user on the portion of the repository that is
> >         of interest to them, and (2) enable potentially dramatic
> >         performance improvements, especially in conjunction with
> >         partial clones.
> > +
> > When this option is true, git commands such as log, diff, and grep may
> > limit their output to the directories specified by the sparse cone, or
> > to the intersection of those paths and any (like `*.c) that the user
> > might also specify on the command line.  (Note that this limit for
> > diff and grep only becomes relevant with --cached or when specifying a
> > REVISION, since a search of the working tree will automatically be
> > limited to the sparse paths that are present.)  Also, commands like
> > bisect may only select commits which modify paths within the sparsity
> > cone.  The merge machinery may use the sparse paths as a heuristic to
> > avoid trying to detect renames from within the sparsity cone to
> > outside the sparsity cone when at least one side of history only
> > touches paths within the sparsity cone (this can make the merge
> > machinery faster, but may risk modify/delete conflicts since upstream
> > can rename a file within the sparsity paths to a location outside
> > them).  Commands which export, integrity check, or create history will
> > always operate on full trees (e.g. fast-export, format-patch, fsck,
> > commit, etc.), unaffected by any sparsity patterns.
> > """
> >
> > Several questions here, of course:
> >
> >   * do people like or hate the name?  indifferent?  have alternate ideas?
>
> It's probably time to create a 'sparse-checkout' config space. That
> would allow
>
>         sparse-checkout.restrictGrep = true
>
> as an option. Or a more general
>
>         sparse-checkout.restrictCommands = true
>
> to make it clear that it affects multiple commands.

As I mentioned to Matheus, would a "sparse" config space be nicer?

> >   * should we restrict this to core.sparseCheckoutCone as I suggested
> > above or also allow people to do it with core.sparseCheckout without
> > the cone mode?  I think attempting to weld partial clones together
> > with core.sparseCheckout is crazy, so I'm tempted to just make it be
> > specific to cone mode and to push people to use it.  But I'm
> > interested in thoughts on the matter.
>
> Personally, I prefer cone mode and think it covers 99% of cases.
> However, there are some who are using a big directory full of large
> binaries and relying on file-prefix matches to get only the big
> binaries they need. Until they restructure their repositories to
> take advantage of cone mode, we should be considerate of the full
> sparse-checkout specification when possible.

I agree with everything you say here except the last word; if you
replaced "possible" with "practical" then I'd agree.  In particular, I
like the idea of a partial clone that defaults to grabbing all the
blobs in the sparse path specification; I think it'd be reasonable to
transfer the sparseCone specification to the server and have it use
that to walk history and make a packfile.  Transfering a sparse
specification that does not match the cone mode requirements to a
server and making it use that as it walks over all of history sounds
like a good way to overload the server.

> >   * should worktrees be affected?  (I've been an advocate of new
> > worktrees inheriting the sparse patterns of the worktree in use at the
> > time the new worktree was created.  Junio once suggested he didn't
> > like that and that worktrees should start out dense.  That seems
> > problematic to me in big repos with partial clones and sparse chckouts
> > in use.  Perhaps dense new worktrees is the behavior you get when
> > core.restrictToSparsePaths is false?)
>
> We should probably consider a `--sparse` option for `git worktree add`
> so we can allow interested users to add worktrees that initialize to
> a sparse-checkout. Optionally create a config option that would copy
> the sparse-checkout file from the current repo to the worktree.

Okay, but if someone runs a future

$ git clone --sparse $RELEVANT_DIRECTORIES user@server.name:path/to/repo.git
$ cd repo
<Blissfully work in their smaller repo without commands suddenly
downloading reams of unwanted data>

should the clone command automatically set this option for the user?
I don't like the idea of users having to remember to set this option
(and the restrictToSparsePaths option, and whatever other options are
needed to work in their smaller environment).  I'd really like there
to be a single flag, in the form of some clone option, that sets all
of this up.

> >   * does my idea for the merge machinery make folks uncomfortable?
> > Should that be a different option?  Being able to do trivial *tree*
> > merges for the huge portion of the tree outside the sparsity paths
> > would be a huge win, especially with partial clones, but it certainly
> > is different.  Then again, microsoft has disabled rename detection
> > entirely based on it being too expensive, so perhaps the idea of
> > rename-detection-within-your-cone-if-you-really-didn't-modify-anything-outside-the-cone-on-your-side-of-history
> > is a reasonable middle ground between off and on for rename detection.
>
> The part where you say " when at least one side of history only
> touches paths within the sparsity cone" makes me want to entertain
> the idea if it can be done cleanly.

Yeah, I still have to dig in and verify that this really works.

> I'm more concerned about the "git bisect" logic being restricted to
> the cone, since that is such an open-ended command for what is
> considered "good" or "bad".

If the sparse checkout has sufficient information for them to build
and test whatever predicate they are interested in, then surely
bisecting in a way that restricts to the cone would be a nice
optimization, right?  And if the cone doesn't have enough information
for them to build and test commits, then they would need to leave the
sparse checkout in order to bisect anyway.

> >   * what should the default be?  Junio suggested elsewhere[1] that
> > sparse-checkouts and partial clones should probably be welded together
> > (with partial clones downloading just history in the sparsity paths by
> > default), in which case having this option be true would be useful.
>
> My opinion on this is as follows: filtering blobs based on sparse-
> checkout patterns does not filter enough, and filtering trees based
> on sparse-checkout patterns filters too much. The costs are just
> flipped: having extra trees is not a huge problem but recovering from
> a "tree miss" is problematic. Having extra blobs is painful, but
> recovering from a "blob miss" is not a big deal.

Sounds like --filter=blob:none already solves your issues.  It doesn't
make me happy; I really want the history within the sparse cone to be
downloaded as part of the initial clone.  (I can see various ways that
downloading all trees would be easier, so if we end up downloading all
commits and all trees and just the blobs within the sparse cone, that
sounds fine to me.)

> > But it may also be slightly weird because it'll probably take us a
> > while to implement this; while the big warning in
> > git-sparse-checkout.txt certainly allows this:
> >         THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
> >         COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
> >         THE FUTURE.
> > It may still be slightly weird that the default behavior of commands
> > in the presence of sparse-checkouts changes release to release until
> > we get it all implemented.
>
> I appreciate that we put that warning at the top. We will be
> able to do more experimental things with the feature because
> of it. The idea I'm toying with is to have "git clone --sparse"
> set core.sparseCheckoutCone = true.

Sounds good to me.  We might also want to set worktrees.copySparsity
and sparse.restrictToCone (or whatever these end up being named) as
well.

> Also, if we are creating the "sparse-checkout.*" config space,
> we should "rename" core.sparseCheckoutCone to sparse-checkout.coneMode
> or something. We would need to support both for a while, for sure.

And, if we automatically migrate the setting and delete the old one,
do we prevent someone from successfully using an older git version
with the repo?  Or, if we don't automatically unset the old one, do we
risk the two values getting out of sync if they do switch to an older
git version?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it
  2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                   ` (2 preceding siblings ...)
  2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
@ 2020-05-10  0:41 ` Matheus Tavares
  2020-05-10  0:41   ` [RFC PATCH v2 1/4] doc: grep: unify info on configuration variables Matheus Tavares
                     ` (4 more replies)
  3 siblings, 5 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-05-10  0:41 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

This series is based on the discussions in [1]. The idea is to make
git-grep (and other commands, in the future) be able to restrict their
output to the sparsity patterns, when requested by the user.

Main changes since v1:

In patch 1:
- Remove two unnecessary references in git-grep.txt, as they are in the
  same document.

Added patch 2.

In patch 3:
- Match paths directly against the sparsity patterns, when grepping a
  given tree, instead of checking the index.
- Better handle searches with --recurse-submodules when the superproject
  and/or the submodule have sparse checkout enabled. And add tests for
  cases like these.
- In tests, use the builtin git-sparse-checkout instead of manually
  writting to the sparse-checkout file.
- Add tests for grepping in cone mode.
- Rename the previous 'from_commit' parameter and a test name, to be
  more meaningful.

  Note: it was suggested to change some of the tests in this patch to
  use cone mode. I ended up using both cone mode and full patterns, so
  that we could check that grep behaves correctly when submodules have
  different pattern rules than the superproject. I tried to leave the
  testing repo's structure simple, though, so that the tests remain well
  readable.

In patch 4:
- Move the configuration that restrict cmds' behavior based on the
  sparse checkout from the 'grep' namespace to 'sparse', as the idea is
  to have the same setting affecting multiple cmds.
- Add the --[no]-restrict-to-sparse-paths global option
- Add more tests for the setting and CLI option in grep.
- Add tests to ensure the submodules' values for the setting are
  respected when running grep with --recurse-submodules.

  Note: in this patch, I used the 'sparse' namespace, instead of 'core',
  following the idea we discussed in [2], to have the sparse checkout
  settings in their own namespace. We also talked about moving
  core.sparseCheckout and core.sparseCheckoutCone to the new
  namespace.  I tried implementing this change in this same patchset
  (although, on second thought, it is probably better to do it in
  another one), but I still haven't managed to come up with a rename
  implementation that keeps good compatibility. The problems are the
  ones Elijah listed in [3]. So, for now, sparse.restrictCmds is the
  only setting in the 'sparse' namespace. But it won't be the only one
  for too long, as Stolee is already implementing other ones [4].

[1]: https://lore.kernel.org/git/CAHd-oW7e5qCuxZLBeVDq+Th3E+E4+P8=WzJfK8WcG2yz=n_nag@mail.gmail.com/t/#u
[2]: https://lore.kernel.org/git/49c1e9a5-b234-1696-03cc-95bf95f4663c@gmail.com/
[3]: https://lore.kernel.org/git/CABPp-BGytfCugK0S99nLPH4_VXmcYPHWdVyLO59BZc4__4CT9w@mail.gmail.com/
[4]: https://lore.kernel.org/git/2188577cd848d7cee77f06f1ad2b181864e5e36d.1588857462.git.gitgitgadget@gmail.com/

Matheus Tavares (4):
  doc: grep: unify info on configuration variables
  config: load the correct config.worktree file
  grep: honor sparse checkout patterns
  config: add setting to ignore sparsity patterns in some cmds

 Documentation/config.txt               |   2 +
 Documentation/config/grep.txt          |  10 +-
 Documentation/config/sparse.txt        |  22 +++
 Documentation/git-grep.txt             |  37 +----
 Documentation/git.txt                  |   4 +
 Makefile                               |   1 +
 builtin/grep.c                         | 137 +++++++++++++++-
 config.c                               |   5 +-
 contrib/completion/git-completion.bash |   2 +
 git.c                                  |   6 +
 sparse-checkout.c                      |  16 ++
 sparse-checkout.h                      |  11 ++
 t/t7011-skip-worktree-reading.sh       |   9 --
 t/t7817-grep-sparse-checkout.sh        | 216 +++++++++++++++++++++++++
 t/t9902-completion.sh                  |   4 +-
 15 files changed, 431 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/config/sparse.txt
 create mode 100644 sparse-checkout.c
 create mode 100644 sparse-checkout.h
 create mode 100755 t/t7817-grep-sparse-checkout.sh

Range-diff against v1:
1:  7ba5caf10d ! 1:  c344d22313 doc: grep: unify info on configuration variables
    @@ Commit message
     
         Explanations about the configuration variables for git-grep are
         duplicated in "Documentation/git-grep.txt" and
    -    "Documentation/config/grep.txt". Let's unify the information in the
    -    second file and include it in the first.
    +    "Documentation/config/grep.txt", which can make maintenance difficult.
    +    The first also contains a definition not present in the latter
    +    (grep.fullName). To avoid problems like this, let's unify the
    +    information in the second file and include it in the first.
     
         Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
     
    @@ Documentation/config/grep.txt: grep.extendedRegexp::
      grep.threads::
     -	Number of grep worker threads to use.
     -	See `grep.threads` in linkgit:git-grep[1] for more information.
    -+	Number of grep worker threads to use. See `--threads` in
    -+	linkgit:git-grep[1] for more information.
    ++	Number of grep worker threads to use. See `--threads`
    ++ifndef::git-grep[]
    ++	in linkgit:git-grep[1]
    ++endif::git-grep[]
    ++	for more information.
     +
     +grep.fullName::
     +	If set to true, enable `--full-name` option by default.
    @@ Documentation/git-grep.txt: characters.  An empty string as search expression ma
     -	If set to true, fall back to git grep --no-index if git grep
     -	is executed outside of a git repository.  Defaults to false.
     -
    ++:git-grep: 1
     +include::config/grep.txt[]
      
      OPTIONS
    @@ Documentation/git-grep.txt: providing this option will cause it to die.
     +	Number of grep worker threads to use. If not provided (or set to
     +	0), Git will use as many worker threads as the number of logical
     +	cores available. The default value can also be set with the
    -+	`grep.threads` configuration (see linkgit:git-config[1]).
    ++	`grep.threads` configuration.
      
      -f <file>::
      	Read patterns from <file>, one per line.
-:  ---------- > 2:  882310b69f config: load the correct config.worktree file
2:  0b9b4c4b41 ! 3:  e00674c727 grep: honor sparse checkout patterns
    @@ Commit message
         One of the main uses for a sparse checkout is to allow users to focus on
         the subset of files in a repository in which they are interested. But
         git-grep currently ignores the sparsity patterns and report all matches
    -    found outside this subset, which kind of goes in the oposity direction.
    +    found outside this subset, which kind of goes in the opposite direction.
         Let's fix that, making it honor the sparsity boundaries for every
         grepping case:
     
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
      		     struct tree_desc *tree, struct strbuf *base, int tn_len,
     -		     int check_attr);
    -+		     int from_commit);
    ++		     int is_root_tree);
      
      static int grep_submodule(struct grep_opt *opt,
      			  const struct pathspec *pathspec,
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      	for (nr = 0; nr < repo->index->cache_nr; nr++) {
      		const struct cache_entry *ce = repo->index->cache[nr];
     +
    -+		if (ce_skip_worktree(ce))
    ++		if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
     +			continue;
     +
      		strbuf_setlen(&name, name_base_len);
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      					continue;
      				hit |= grep_oid(opt, &ce->oid, name.buf,
     @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
    + 	return hit;
    + }
      
    - static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    - 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
    +-static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    +-		     struct tree_desc *tree, struct strbuf *base, int tn_len,
     -		     int check_attr)
    -+		     int from_commit)
    ++static struct pattern_list *get_sparsity_patterns(struct repository *repo)
    ++{
    ++	struct pattern_list *patterns;
    ++	char *sparse_file;
    ++	int sparse_config, cone_config;
    ++
    ++	if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
    ++	    !sparse_config) {
    ++		return NULL;
    ++	}
    ++
    ++	sparse_file = repo_git_path(repo, "info/sparse-checkout");
    ++	patterns = xcalloc(1, sizeof(*patterns));
    ++
    ++	if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
    ++		cone_config = 0;
    ++	patterns->use_cone_patterns = cone_config;
    ++
    ++	if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
    ++		if (file_exists(sparse_file)) {
    ++			warning(_("failed to load sparse-checkout file: '%s'"),
    ++				sparse_file);
    ++		}
    ++		free(sparse_file);
    ++		free(patterns);
    ++		return NULL;
    ++	}
    ++
    ++	free(sparse_file);
    ++	return patterns;
    ++}
    ++
    ++static int in_sparse_checkout(struct strbuf *path, int prefix_len,
    ++			      unsigned int entry_mode,
    ++			      struct index_state *istate,
    ++			      struct pattern_list *sparsity,
    ++			      enum pattern_match_result parent_match,
    ++			      enum pattern_match_result *match)
    ++{
    ++	int dtype = DT_UNKNOWN;
    ++
    ++	if (S_ISGITLINK(entry_mode))
    ++		return 1;
    ++
    ++	if (parent_match == MATCHED_RECURSIVE) {
    ++		*match = parent_match;
    ++		return 1;
    ++	}
    ++
    ++	if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1]))
    ++		strbuf_addch(path, '/');
    ++
    ++	*match = path_matches_pattern_list(path->buf, path->len,
    ++					   path->buf + prefix_len, &dtype,
    ++					   sparsity, istate);
    ++	if (*match == UNDECIDED)
    ++		*match = parent_match;
    ++
    ++	if (S_ISDIR(entry_mode))
    ++		strbuf_trim_trailing_dir_sep(path);
    ++
    ++	if (*match == NOT_MATCHED && (S_ISREG(entry_mode) ||
    ++	    (S_ISDIR(entry_mode) && sparsity->use_cone_patterns)))
    ++		return 0;
    ++
    ++	return 1;
    ++}
    ++
    ++static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    ++			struct tree_desc *tree, struct strbuf *base, int tn_len,
    ++			int check_attr, struct pattern_list *sparsity,
    ++			enum pattern_match_result default_sparsity_match)
      {
      	struct repository *repo = opt->repo;
      	int hit = 0;
     @@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    - 		name_base_len = name.len;
    - 	}
      
    -+	if (from_commit && repo_read_index(repo) < 0)
    -+		die(_("index file corrupt"));
    -+
      	while (tree_entry(tree, &entry)) {
      		int te_len = tree_entry_len(&entry);
    ++		enum pattern_match_result sparsity_match = 0;
      
    + 		if (match != all_entries_interesting) {
    + 			strbuf_addstr(&name, base->buf + tn_len);
     @@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
      
      		strbuf_add(base, entry.path, te_len);
      
    -+		if (from_commit) {
    -+			int pos = index_name_pos(repo->index,
    -+						 base->buf + tn_len,
    -+						 base->len - tn_len);
    -+			if (pos >= 0 &&
    -+			    ce_skip_worktree(repo->index->cache[pos])) {
    ++		if (sparsity) {
    ++			struct strbuf path = STRBUF_INIT;
    ++			strbuf_addstr(&path, base->buf + tn_len);
    ++
    ++			if (!in_sparse_checkout(&path, old_baselen - tn_len,
    ++						entry.mode, repo->index,
    ++						sparsity, default_sparsity_match,
    ++						&sparsity_match)) {
     +				strbuf_setlen(base, old_baselen);
     +				continue;
     +			}
    @@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec
     +
      		if (S_ISREG(entry.mode)) {
      			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
    --					 check_attr ? base->buf + tn_len : NULL);
    -+					from_commit ? base->buf + tn_len : NULL);
    - 		} else if (S_ISDIR(entry.mode)) {
    - 			enum object_type type;
    - 			struct tree_desc sub;
    + 					 check_attr ? base->buf + tn_len : NULL);
     @@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    + 
      			strbuf_addch(base, '/');
      			init_tree_desc(&sub, data, size);
    - 			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
    +-			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
     -					 check_attr);
    -+					 from_commit);
    ++			hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
    ++					    check_attr, sparsity, sparsity_match);
      			free(data);
      		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
      			hit |= grep_submodule(opt, pathspec, &entry.oid,
    +@@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    + 	return hit;
    + }
    + 
    ++/*
    ++ * Note: sparsity patterns and paths' attributes will only be considered if
    ++ * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
    ++ * matching on paths.)
    ++ */
    ++static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    ++		     struct tree_desc *tree, struct strbuf *base, int tn_len,
    ++		     int is_root_tree)
    ++{
    ++	struct pattern_list *patterns = NULL;
    ++	int ret;
    ++
    ++	if (is_root_tree)
    ++		patterns = get_sparsity_patterns(opt->repo);
    ++
    ++	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
    ++			   patterns, 0);
    ++
    ++	if (patterns) {
    ++		clear_pattern_list(patterns);
    ++		free(patterns);
    ++	}
    ++	return ret;
    ++}
    ++
    + static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
    + 		       struct object *obj, const char *name, const char *path)
    + {
     
      ## t/t7011-skip-worktree-reading.sh ##
     @@ t/t7011-skip-worktree-reading.sh: test_expect_success 'ls-files --modified' '
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +
     +test_description='grep in sparse checkout
     +
    -+This test creates the following dir structure:
    ++This test creates a repo with the following structure:
    ++
     +.
    -+| - a
    -+| - b
    -+| - dir
    -+    | - c
    ++|-- a
    ++|-- b
    ++|-- dir
    ++|   `-- c
    ++`-- sub
    ++    |-- A
    ++    |   `-- a
    ++    `-- B
    ++	`-- b
     +
    -+Only "a" should be present due to the sparse checkout patterns:
    -+"/*", "!/b" and "!/dir".
    ++Where . has non-cone mode sparsity patterns and sub is a submodule with cone
    ++mode sparsity patterns. The resulting sparse-checkout should leave the following
    ++structure:
    ++
    ++.
    ++|-- a
    ++`-- sub
    ++    `-- B
    ++	`-- b
     +'
     +
     +. ./test-lib.sh
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +	echo "text" >b &&
     +	mkdir dir &&
     +	echo "text" >dir/c &&
    ++
    ++	git init sub &&
    ++	(
    ++		cd sub &&
    ++		mkdir A B &&
    ++		echo "text" >A/a &&
    ++		echo "text" >B/b &&
    ++		git add A B &&
    ++		git commit -m sub &&
    ++		git sparse-checkout init --cone &&
    ++		git sparse-checkout set B
    ++	) &&
    ++
    ++	git submodule add ./sub &&
     +	git add a b dir &&
    -+	git commit -m "initial commit" &&
    ++	git commit -m super &&
    ++	git sparse-checkout init --no-cone &&
    ++	git sparse-checkout set "/*" "!b" "!/*/" &&
    ++
     +	git tag -am t-commit t-commit HEAD &&
     +	tree=$(git rev-parse HEAD^{tree}) &&
     +	git tag -am t-tree t-tree $tree &&
    -+	cat >.git/info/sparse-checkout <<-EOF &&
    -+	/*
    -+	!/b
    -+	!/dir
    -+	EOF
    -+	git sparse-checkout init &&
    ++
     +	test_path_is_missing b &&
     +	test_path_is_missing dir &&
    -+	test_path_is_file a
    ++	test_path_is_missing sub/A &&
    ++	test_path_is_file a &&
    ++	test_path_is_file sub/B/b
     +'
     +
     +test_expect_success 'grep in working tree should honor sparse checkout' '
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +	test_cmp expect_t-commit actual_t-commit
     +'
     +
    -+test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
    ++test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
     +	commit=$(git rev-parse HEAD) &&
     +	tree=$(git rev-parse HEAD^{tree}) &&
     +	cat >expect_tree <<-EOF &&
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +	test_cmp expect_t-tree actual_t-tree
     +'
     +
    ++test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
    ++	cat >expect <<-EOF &&
    ++	a:text
    ++	sub/B/b:text
    ++	EOF
    ++	git grep --recurse-submodules --cached "text" >actual &&
    ++	test_cmp expect actual
    ++'
    ++
    ++test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
    ++	commit=$(git rev-parse HEAD) &&
    ++	cat >expect_commit <<-EOF &&
    ++	$commit:a:text
    ++	$commit:sub/B/b:text
    ++	EOF
    ++	cat >expect_t-commit <<-EOF &&
    ++	t-commit:a:text
    ++	t-commit:sub/B/b:text
    ++	EOF
    ++	git grep --recurse-submodules "text" $commit >actual_commit &&
    ++	test_cmp expect_commit actual_commit &&
    ++	git grep --recurse-submodules "text" t-commit >actual_t-commit &&
    ++	test_cmp expect_t-commit actual_t-commit
    ++'
    ++
     +test_done
3:  a76242ecfa < -:  ---------- grep: add option to ignore sparsity patterns
-:  ---------- > 4:  3e9e906249 config: add setting to ignore sparsity patterns in some cmds
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [RFC PATCH v2 1/4] doc: grep: unify info on configuration variables
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
@ 2020-05-10  0:41   ` Matheus Tavares
  2020-05-10  0:41   ` [RFC PATCH v2 2/4] config: load the correct config.worktree file Matheus Tavares
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-05-10  0:41 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

Explanations about the configuration variables for git-grep are
duplicated in "Documentation/git-grep.txt" and
"Documentation/config/grep.txt", which can make maintenance difficult.
The first also contains a definition not present in the latter
(grep.fullName). To avoid problems like this, let's unify the
information in the second file and include it in the first.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 Documentation/config/grep.txt | 10 ++++++++--
 Documentation/git-grep.txt    | 36 ++++++-----------------------------
 2 files changed, 14 insertions(+), 32 deletions(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 44abe45a7c..dd51db38e1 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -16,8 +16,14 @@ grep.extendedRegexp::
 	other than 'default'.
 
 grep.threads::
-	Number of grep worker threads to use.
-	See `grep.threads` in linkgit:git-grep[1] for more information.
+	Number of grep worker threads to use. See `--threads`
+ifndef::git-grep[]
+	in linkgit:git-grep[1]
+endif::git-grep[]
+	for more information.
+
+grep.fullName::
+	If set to true, enable `--full-name` option by default.
 
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index a7f9bc99ea..9bdf807584 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,34 +41,8 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------
 
-grep.lineNumber::
-	If set to true, enable `-n` option by default.
-
-grep.column::
-	If set to true, enable the `--column` option by default.
-
-grep.patternType::
-	Set the default matching behavior. Using a value of 'basic', 'extended',
-	'fixed', or 'perl' will enable the `--basic-regexp`, `--extended-regexp`,
-	`--fixed-strings`, or `--perl-regexp` option accordingly, while the
-	value 'default' will return to the default matching behavior.
-
-grep.extendedRegexp::
-	If set to true, enable `--extended-regexp` option by default. This
-	option is ignored when the `grep.patternType` option is set to a value
-	other than 'default'.
-
-grep.threads::
-	Number of grep worker threads to use. If unset (or set to 0), Git will
-	use as many threads as the number of logical cores available.
-
-grep.fullName::
-	If set to true, enable `--full-name` option by default.
-
-grep.fallbackToNoIndex::
-	If set to true, fall back to git grep --no-index if git grep
-	is executed outside of a git repository.  Defaults to false.
-
+:git-grep: 1
+include::config/grep.txt[]
 
 OPTIONS
 -------
@@ -269,8 +243,10 @@ providing this option will cause it to die.
 	found.
 
 --threads <num>::
-	Number of grep worker threads to use.
-	See `grep.threads` in 'CONFIGURATION' for more information.
+	Number of grep worker threads to use. If not provided (or set to
+	0), Git will use as many worker threads as the number of logical
+	cores available. The default value can also be set with the
+	`grep.threads` configuration.
 
 -f <file>::
 	Read patterns from <file>, one per line.
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [RFC PATCH v2 2/4] config: load the correct config.worktree file
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-05-10  0:41   ` [RFC PATCH v2 1/4] doc: grep: unify info on configuration variables Matheus Tavares
@ 2020-05-10  0:41   ` Matheus Tavares
  2020-05-11 19:10     ` Junio C Hamano
  2020-05-10  0:41   ` [RFC PATCH v2 3/4] grep: honor sparse checkout patterns Matheus Tavares
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-05-10  0:41 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

One of the steps in do_git_config_sequence() is to load the
worktree-specific config file. Although the function receives a git_dir
string, it relies on git_pathdup(), which uses the_repository->git_dir,
to make the path to the file. Thus, when a submodule has a worktree
setting, a command executed in the superproject that recurses into the
submodule won't find the said setting. Such a scenario might not be
needed now, but it will be in the following patch. git-grep will learn
to honor sparse checkouts and, when running with --recurse-submodules,
the submodule's sparse checkout settings must be loaded. As these
settings are stored in the config.worktree file, they would be ignored
without this patch.

The fix is simple, we replace git_pathdup() with mkpathdup(), to format
the path with the given git_dir. This is the same idea used to make the
config.worktree path in setup.c:check_repository_format_gently().

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 config.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/config.c b/config.c
index 8db9c77098..a3d0a0d266 100644
--- a/config.c
+++ b/config.c
@@ -1747,8 +1747,9 @@ static int do_git_config_sequence(const struct config_options *opts,
 		ret += git_config_from_file(fn, repo_config, data);
 
 	current_parsing_scope = CONFIG_SCOPE_WORKTREE;
-	if (!opts->ignore_worktree && repository_format_worktree_config) {
-		char *path = git_pathdup("config.worktree");
+	if (!opts->ignore_worktree && repository_format_worktree_config &&
+	    opts->git_dir) {
+		char *path = mkpathdup("%s/config.worktree", opts->git_dir);
 		if (!access_or_die(path, R_OK, 0))
 			ret += git_config_from_file(fn, path, data);
 		free(path);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-05-10  0:41   ` [RFC PATCH v2 1/4] doc: grep: unify info on configuration variables Matheus Tavares
  2020-05-10  0:41   ` [RFC PATCH v2 2/4] config: load the correct config.worktree file Matheus Tavares
@ 2020-05-10  0:41   ` Matheus Tavares
  2020-05-11 19:35     ` Junio C Hamano
  2020-05-10  0:41   ` [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
  2020-05-28  1:12   ` [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  4 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-05-10  0:41 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

One of the main uses for a sparse checkout is to allow users to focus on
the subset of files in a repository in which they are interested. But
git-grep currently ignores the sparsity patterns and report all matches
found outside this subset, which kind of goes in the opposite direction.
Let's fix that, making it honor the sparsity boundaries for every
grepping case:

- git grep in worktree
- git grep --cached
- git grep $REVISION
- git grep --untracked and git grep --no-index (which already respect
  sparse checkout boundaries)

This is also what some users reported[1] they would want as the default
behavior.

Note: for `git grep $REVISION`, we will choose to honor the sparsity
patterns only when $REVISION is a commit-ish object. The reason is that,
for a tree, we don't know whether it represents the root of a
repository or a subtree. So we wouldn't be able to correctly match it
against the sparsity patterns. E.g. suppose we have a repository with
these two sparsity rules: "/*" and "!/a"; and the following structure:

/
| - a (file)
| - d (dir)
    | - a (file)

If `git grep $REVISION` were to honor the sparsity patterns for every
object type, when grepping the /d tree, we would wrongly ignore the /d/a
file. This happens because we wouldn't know it resides in /d and
therefore it would wrongly match the pattern "!/a". Furthermore, for a
search in a blob object, we wouldn't even have a path to check the
patterns against. So, let's ignore the sparsity patterns when grepping
non-commit-ish objects (tags to commits should be fine).

Finally, the old behavior may still be desirable for some use cases. So
the next patch will add an option to allow restoring it when needed.

[1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Note: as I mentioned in the cover letter, the tests in this patch now
contain both cone mode and full pattern sparse checkouts. This was done
for two reasons: To test grep's behavior when searching with
--recurse-submodules and having submodules with different pattern sets
than the superproject (which was incorrect in my first implementation).
And to test the direct pattern matching in grep_tree(), using both
modes.

 builtin/grep.c                   | 127 ++++++++++++++++++++++++++--
 t/t7011-skip-worktree-reading.sh |   9 --
 t/t7817-grep-sparse-checkout.sh  | 140 +++++++++++++++++++++++++++++++
 3 files changed, 259 insertions(+), 17 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

diff --git a/builtin/grep.c b/builtin/grep.c
index a5056f395a..91ee0b2734 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
 		      const struct pathspec *pathspec, int cached);
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr);
+		     int is_root_tree);
 
 static int grep_submodule(struct grep_opt *opt,
 			  const struct pathspec *pathspec,
@@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
 
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
+
+		if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
+			continue;
+
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
 
@@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID) ||
-			    ce_skip_worktree(ce)) {
+			if (cached || (ce->ce_flags & CE_VALID)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -552,9 +555,78 @@ static int grep_cache(struct grep_opt *opt,
 	return hit;
 }
 
-static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
-		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr)
+static struct pattern_list *get_sparsity_patterns(struct repository *repo)
+{
+	struct pattern_list *patterns;
+	char *sparse_file;
+	int sparse_config, cone_config;
+
+	if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
+	    !sparse_config) {
+		return NULL;
+	}
+
+	sparse_file = repo_git_path(repo, "info/sparse-checkout");
+	patterns = xcalloc(1, sizeof(*patterns));
+
+	if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
+		cone_config = 0;
+	patterns->use_cone_patterns = cone_config;
+
+	if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
+		if (file_exists(sparse_file)) {
+			warning(_("failed to load sparse-checkout file: '%s'"),
+				sparse_file);
+		}
+		free(sparse_file);
+		free(patterns);
+		return NULL;
+	}
+
+	free(sparse_file);
+	return patterns;
+}
+
+static int in_sparse_checkout(struct strbuf *path, int prefix_len,
+			      unsigned int entry_mode,
+			      struct index_state *istate,
+			      struct pattern_list *sparsity,
+			      enum pattern_match_result parent_match,
+			      enum pattern_match_result *match)
+{
+	int dtype = DT_UNKNOWN;
+
+	if (S_ISGITLINK(entry_mode))
+		return 1;
+
+	if (parent_match == MATCHED_RECURSIVE) {
+		*match = parent_match;
+		return 1;
+	}
+
+	if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1]))
+		strbuf_addch(path, '/');
+
+	*match = path_matches_pattern_list(path->buf, path->len,
+					   path->buf + prefix_len, &dtype,
+					   sparsity, istate);
+	if (*match == UNDECIDED)
+		*match = parent_match;
+
+	if (S_ISDIR(entry_mode))
+		strbuf_trim_trailing_dir_sep(path);
+
+	if (*match == NOT_MATCHED && (S_ISREG(entry_mode) ||
+	    (S_ISDIR(entry_mode) && sparsity->use_cone_patterns)))
+		return 0;
+
+	return 1;
+}
+
+static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
+			struct tree_desc *tree, struct strbuf *base, int tn_len,
+			int check_attr, struct pattern_list *sparsity,
+			enum pattern_match_result default_sparsity_match)
 {
 	struct repository *repo = opt->repo;
 	int hit = 0;
@@ -570,6 +642,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 	while (tree_entry(tree, &entry)) {
 		int te_len = tree_entry_len(&entry);
+		enum pattern_match_result sparsity_match = 0;
 
 		if (match != all_entries_interesting) {
 			strbuf_addstr(&name, base->buf + tn_len);
@@ -586,6 +659,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
+		if (sparsity) {
+			struct strbuf path = STRBUF_INIT;
+			strbuf_addstr(&path, base->buf + tn_len);
+
+			if (!in_sparse_checkout(&path, old_baselen - tn_len,
+						entry.mode, repo->index,
+						sparsity, default_sparsity_match,
+						&sparsity_match)) {
+				strbuf_setlen(base, old_baselen);
+				continue;
+			}
+		}
+
 		if (S_ISREG(entry.mode)) {
 			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
 					 check_attr ? base->buf + tn_len : NULL);
@@ -602,8 +688,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 			strbuf_addch(base, '/');
 			init_tree_desc(&sub, data, size);
-			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
-					 check_attr);
+			hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
+					    check_attr, sparsity, sparsity_match);
 			free(data);
 		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
 			hit |= grep_submodule(opt, pathspec, &entry.oid,
@@ -621,6 +707,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 	return hit;
 }
 
+/*
+ * Note: sparsity patterns and paths' attributes will only be considered if
+ * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
+ * matching on paths.)
+ */
+static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
+		     struct tree_desc *tree, struct strbuf *base, int tn_len,
+		     int is_root_tree)
+{
+	struct pattern_list *patterns = NULL;
+	int ret;
+
+	if (is_root_tree)
+		patterns = get_sparsity_patterns(opt->repo);
+
+	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
+			   patterns, 0);
+
+	if (patterns) {
+		clear_pattern_list(patterns);
+		free(patterns);
+	}
+	return ret;
+}
+
 static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
 		       struct object *obj, const char *name, const char *path)
 {
diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
index 37525cae3a..26852586ac 100755
--- a/t/t7011-skip-worktree-reading.sh
+++ b/t/t7011-skip-worktree-reading.sh
@@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
 	test -z "$(git ls-files -m)"
 '
 
-test_expect_success 'grep with skip-worktree file' '
-	git update-index --no-skip-worktree 1 &&
-	echo test > 1 &&
-	git update-index 1 &&
-	git update-index --skip-worktree 1 &&
-	rm 1 &&
-	test "$(git grep --no-ext-grep test)" = "1:test"
-'
-
 echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
 test_expect_success 'diff-index does not examine skip-worktree absent entries' '
 	setup_absent &&
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
new file mode 100755
index 0000000000..3bd67082eb
--- /dev/null
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -0,0 +1,140 @@
+#!/bin/sh
+
+test_description='grep in sparse checkout
+
+This test creates a repo with the following structure:
+
+.
+|-- a
+|-- b
+|-- dir
+|   `-- c
+`-- sub
+    |-- A
+    |   `-- a
+    `-- B
+	`-- b
+
+Where . has non-cone mode sparsity patterns and sub is a submodule with cone
+mode sparsity patterns. The resulting sparse-checkout should leave the following
+structure:
+
+.
+|-- a
+`-- sub
+    `-- B
+	`-- b
+'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	echo "text" >a &&
+	echo "text" >b &&
+	mkdir dir &&
+	echo "text" >dir/c &&
+
+	git init sub &&
+	(
+		cd sub &&
+		mkdir A B &&
+		echo "text" >A/a &&
+		echo "text" >B/b &&
+		git add A B &&
+		git commit -m sub &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set B
+	) &&
+
+	git submodule add ./sub &&
+	git add a b dir &&
+	git commit -m super &&
+	git sparse-checkout init --no-cone &&
+	git sparse-checkout set "/*" "!b" "!/*/" &&
+
+	git tag -am t-commit t-commit HEAD &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	git tag -am t-tree t-tree $tree &&
+
+	test_path_is_missing b &&
+	test_path_is_missing dir &&
+	test_path_is_missing sub/A &&
+	test_path_is_file a &&
+	test_path_is_file sub/B/b
+'
+
+test_expect_success 'grep in working tree should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --cached should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep <commit-ish> should honor sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	EOF
+	cat >expect_t-commit <<-EOF &&
+	t-commit:a:text
+	EOF
+	git grep "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep "text" t-commit >actual_t-commit &&
+	test_cmp expect_t-commit actual_t-commit
+'
+
+test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
+	commit=$(git rev-parse HEAD) &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	cat >expect_tree <<-EOF &&
+	$tree:a:text
+	$tree:b:text
+	$tree:dir/c:text
+	EOF
+	cat >expect_t-tree <<-EOF &&
+	t-tree:a:text
+	t-tree:b:text
+	t-tree:dir/c:text
+	EOF
+	git grep "text" $tree >actual_tree &&
+	test_cmp expect_tree actual_tree &&
+	git grep "text" t-tree >actual_t-tree &&
+	test_cmp expect_t-tree actual_t-tree
+'
+
+test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	EOF
+	git grep --recurse-submodules --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	$commit:sub/B/b:text
+	EOF
+	cat >expect_t-commit <<-EOF &&
+	t-commit:a:text
+	t-commit:sub/B/b:text
+	EOF
+	git grep --recurse-submodules "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep --recurse-submodules "text" t-commit >actual_t-commit &&
+	test_cmp expect_t-commit actual_t-commit
+'
+
+test_done
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                     ` (2 preceding siblings ...)
  2020-05-10  0:41   ` [RFC PATCH v2 3/4] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-05-10  0:41   ` Matheus Tavares
  2020-05-10  4:23     ` Matheus Tavares Bernardino
  2020-05-21  7:09     ` Elijah Newren
  2020-05-28  1:12   ` [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  4 siblings, 2 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-05-10  0:41 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

When sparse checkout is enabled, some users expect the output of certain
commands (such as grep, diff, and log) to be also restricted within the
sparsity patterns. This would allow them to effectively work only on the
subset of files in which they are interested; and allow some commands to
possibly perform better, by not considering uninteresting paths. For
this reason, we taught grep to honor the sparsity patterns, in the
previous commit. But, on the other hand, allowing grep and the other
commands mentioned to optionally ignore the patterns also make for some
interesting use cases. E.g. using grep to search for a function
definition that resides outside the sparse checkout.

In any case, there is no current way for users to configure the behavior
they want for these commands. Aiming to provide this flexibility, let's
introduce the sparse.restrictCmds setting (and the analogous
--[no]-restrict-to-sparse-paths global option). The default value is
true. For now, grep is the only one affected by this setting, but the
goal is to have support for more commands, in the future.

Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Some notes/questions about this one:

- I guess having the additional sparse-checkout.o only for the
  restrict_to_sparse_paths() function is not very justifiable.
  Especially since builtin/grep.c is currently its only caller. But
  since Stolee is already moving some code out of the sparse-checkout
  builtin and into sparse-checkout.o [1], I thought it would be better
  to place this function here from the start, as it will likely be
  needed by other cmds when they start honoring sparse.restrictCmds.
  (Side note: I think I will also be able to use the
  populate_sparse_checkout_patterns() function added by Stolee in the
  same patchset [2], to avoid code duplication in the
  get_sparsity_patterns() function added in this patch).

[1]: https://lore.kernel.org/git/0181a134bfb6986dc0e54ae624c478446a1324a9.1588857462.git.gitgitgadget@gmail.com/
[2]: https://lore.kernel.org/git/444a6b5f894f28e96f713e5caccba18e1ea3b3eb.1588857462.git.gitgitgadget@gmail.com/

- With that said, the only reason we need restrict_to_sparse_paths() to
  begin with, is so that commands which recurse into submodules may
  respect the value set in each submodule for the sparse.restrictCmds
  config. This is already being done for grep, in this patch. But,
  should we do like this or should we use the value set at the
  superproject, for all submodules as well, when recursing (ignoring the
  value set on them)?

- It's possible to also make read-tree respect the new setting/option,
  using --no-restrict-to-sparse-paths as a synonym for its
  --no-sparse-checkout option (with lower precedence). However, as this
  command can change the sparse checked out paths, I thought it kind
  of falls under a different category. Also, `git read-tree -mu
  --sparse-checkout` doesn't have the effect of *restricting* the
  command's behavior to the sparsity patterns, but of applying them to
  the working tree, right? So maybe it could be confusing to make this
  command honor the new setting. Does that make sense, or should we do
  it?

- Finally, if we decide to make read-tree be affected by
  sparse.restrictCmds, there is also the case of whether the config
  should be honored for submodules or just propagate the superproject's
  value. I think the latter would be as simple as adding this line,
  before calling parse_options() in builtin/read-tree.c:

  opts.skip_sparse_checkout = !restrict_to_sparse_paths(the_repository);

  As for the former, I'm not very familiar with the code in
  unpack_trees(), so I'm not sure how complicated that would be.


 Documentation/config.txt               |  2 +
 Documentation/config/sparse.txt        | 22 ++++++++
 Documentation/git-grep.txt             |  3 +
 Documentation/git.txt                  |  4 ++
 Makefile                               |  1 +
 builtin/grep.c                         | 14 ++++-
 contrib/completion/git-completion.bash |  2 +
 git.c                                  |  6 ++
 sparse-checkout.c                      | 16 ++++++
 sparse-checkout.h                      | 11 ++++
 t/t7817-grep-sparse-checkout.sh        | 78 +++++++++++++++++++++++++-
 t/t9902-completion.sh                  |  4 +-
 12 files changed, 159 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/config/sparse.txt
 create mode 100644 sparse-checkout.c
 create mode 100644 sparse-checkout.h

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ef0768b91a..fd74b80302 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -436,6 +436,8 @@ include::config/sequencer.txt[]
 
 include::config/showbranch.txt[]
 
+include::config/sparse.txt[]
+
 include::config/splitindex.txt[]
 
 include::config/ssh.txt[]
diff --git a/Documentation/config/sparse.txt b/Documentation/config/sparse.txt
new file mode 100644
index 0000000000..83a4e0018f
--- /dev/null
+++ b/Documentation/config/sparse.txt
@@ -0,0 +1,22 @@
+sparse.restrictCmds::
+	Only meaningful in conjunction with core.sparseCheckout. This option
+	extends sparse checkouts (which limit which paths are written to the
+	working tree), so that output and operations are also limited to the
+	sparsity paths where possible and implemented. The purpose of this
+	option is to (1) focus output for the user on the portion of the
+	repository that is of interest to them, and (2) enable potentially
+	dramatic performance improvements, especially in conjunction with
+	partial clones.
++
+When this option is true (default), some git commands may limit their behavior
+to the paths specified by the sparsity patterns, or to the intersection of
+those paths and any (like `*.c) that the user might also specify on the command
+line. When false, the affected commands will work on full trees, ignoring the
+sparsity patterns. For now, only git-grep honors this setting. In this command,
+the restriction becomes relevant in one of these three cases: with --cached;
+when a commit-ish is given; when searching a working tree that contains paths
+previously excluded by the sparsity patterns.
++
+Note: commands which export, integrity check, or create history will always
+operate on full trees (e.g. fast-export, format-patch, fsck, commit, etc.),
+unaffected by any sparsity patterns.
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 9bdf807584..abbf100109 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,6 +41,9 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------
 
+git-grep honors the sparse.restrictCmds setting. See its definition in
+linkgit:git-config[1].
+
 :git-grep: 1
 include::config/grep.txt[]
 
diff --git a/Documentation/git.txt b/Documentation/git.txt
index 9d6769e95a..5e107c6246 100644
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -180,6 +180,10 @@ If you just want to run git as if it was started in `<path>` then use
 	Do not perform optional operations that require locks. This is
 	equivalent to setting the `GIT_OPTIONAL_LOCKS` to `0`.
 
+--[no-]restrict-to-sparse-paths::
+	Overrides the sparse.restrictCmds configuration (see
+	linkgit:git-config[1]) for this execution.
+
 --list-cmds=group[,group...]::
 	List commands by group. This is an internal/experimental
 	option and may change or be removed in the future. Supported
diff --git a/Makefile b/Makefile
index 3d3a39fc19..67580c691b 100644
--- a/Makefile
+++ b/Makefile
@@ -986,6 +986,7 @@ LIB_OBJS += sha1-name.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-checkout.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/builtin/grep.c b/builtin/grep.c
index 91ee0b2734..3f92e7fd6c 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -25,6 +25,7 @@
 #include "submodule-config.h"
 #include "object-store.h"
 #include "packfile.h"
+#include "sparse-checkout.h"
 
 static char const * const grep_usage[] = {
 	N_("git grep [<options>] [-e] <pattern> [<rev>...] [[--] <path>...]"),
@@ -498,6 +499,7 @@ static int grep_cache(struct grep_opt *opt,
 	int nr;
 	struct strbuf name = STRBUF_INIT;
 	int name_base_len = 0;
+	int sparse_paths_only =	restrict_to_sparse_paths(repo);
 	if (repo->submodule_prefix) {
 		name_base_len = strlen(repo->submodule_prefix);
 		strbuf_addstr(&name, repo->submodule_prefix);
@@ -509,7 +511,8 @@ static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
+		if (sparse_paths_only && ce_skip_worktree(ce) &&
+		    !S_ISGITLINK(ce->ce_mode))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -717,9 +720,10 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     int is_root_tree)
 {
 	struct pattern_list *patterns = NULL;
+	int sparse_paths_only = restrict_to_sparse_paths(opt->repo);
 	int ret;
 
-	if (is_root_tree)
+	if (is_root_tree && sparse_paths_only)
 		patterns = get_sparsity_patterns(opt->repo);
 
 	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
@@ -1259,6 +1263,12 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 
 	if (!use_index || untracked) {
 		int use_exclude = (opt_exclude < 0) ? use_index : !!opt_exclude;
+
+		if (opt_restrict_to_sparse_paths >= 0) {
+			warning(_("--[no-]restrict-to-sparse-paths is ignored"
+				  " with --no-index or --untracked"));
+		}
+
 		hit = grep_directory(&opt, &pathspec, use_exclude, use_index);
 	} else if (0 <= opt_exclude) {
 		die(_("--[no-]exclude-standard cannot be used for tracked contents"));
diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
index b1d6e5ebed..cba0f9166c 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -3207,6 +3207,8 @@ __git_main ()
 			--namespace=
 			--no-replace-objects
 			--help
+			--restrict-to-sparse-paths
+			--no-restrict-to-sparse-paths
 			"
 			;;
 		*)
diff --git a/git.c b/git.c
index 2e4efb4ff0..f967c75d9c 100644
--- a/git.c
+++ b/git.c
@@ -37,6 +37,7 @@ const char git_more_info_string[] =
 	   "See 'git help git' for an overview of the system.");
 
 static int use_pager = -1;
+int opt_restrict_to_sparse_paths = -1;
 
 static void list_builtins(struct string_list *list, unsigned int exclude_option);
 
@@ -310,6 +311,10 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
 			} else {
 				exit(list_cmds(cmd));
 			}
+		} else if (!strcmp(cmd, "--restrict-to-sparse-paths")) {
+			opt_restrict_to_sparse_paths = 1;
+		} else if (!strcmp(cmd, "--no-restrict-to-sparse-paths")) {
+			opt_restrict_to_sparse_paths = 0;
 		} else {
 			fprintf(stderr, _("unknown option: %s\n"), cmd);
 			usage(git_usage_string);
@@ -318,6 +323,7 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
 		(*argv)++;
 		(*argc)--;
 	}
+
 	return (*argv) - orig_argv;
 }
 
diff --git a/sparse-checkout.c b/sparse-checkout.c
new file mode 100644
index 0000000000..9a9e50fd29
--- /dev/null
+++ b/sparse-checkout.c
@@ -0,0 +1,16 @@
+#include "cache.h"
+#include "config.h"
+#include "sparse-checkout.h"
+
+int restrict_to_sparse_paths(struct repository *repo)
+{
+	int ret;
+
+	if (opt_restrict_to_sparse_paths >= 0)
+		return opt_restrict_to_sparse_paths;
+
+	if (repo_config_get_bool(repo, "sparse.restrictcmds", &ret))
+		ret = 1;
+
+	return ret;
+}
diff --git a/sparse-checkout.h b/sparse-checkout.h
new file mode 100644
index 0000000000..1de3b588d8
--- /dev/null
+++ b/sparse-checkout.h
@@ -0,0 +1,11 @@
+#ifndef SPARSE_CHECKOUT_H
+#define SPARSE_CHECKOUT_H
+
+struct repository;
+
+extern int opt_restrict_to_sparse_paths; /* from git.c */
+
+/* Whether or not cmds should restrict behavior on sparse paths, in this repo */
+int restrict_to_sparse_paths(struct repository *repo);
+
+#endif /* SPARSE_CHECKOUT_H */
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index 3bd67082eb..8509694bf1 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -63,12 +63,28 @@ test_expect_success 'setup' '
 	test_path_is_file sub/B/b
 '
 
+# The two tests bellow check a special case: the sparsity patterns exclude '/b'
+# and sparse checkout is enable, but the path exists on the working tree (e.g.
+# manually created after `git sparse-checkout init`). In this case, grep should
+# honor --restrict-to-sparse-paths.
 test_expect_success 'grep in working tree should honor sparse checkout' '
 	cat >expect <<-EOF &&
 	a:text
 	EOF
+	echo newtext >b &&
 	git grep "text" >actual &&
-	test_cmp expect actual
+	test_cmp expect actual &&
+	rm b
+'
+test_expect_success 'grep w/ --no-restrict-to-sparse-paths for sparsely excluded but present paths' '
+	cat >expect <<-EOF &&
+	a:text
+	b:newtext
+	EOF
+	echo newtext >b &&
+	git --no-restrict-to-sparse-paths grep "text" >actual &&
+	test_cmp expect actual &&
+	rm b
 '
 
 test_expect_success 'grep --cached should honor sparse checkout' '
@@ -137,4 +153,64 @@ test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse
 	test_cmp expect_t-commit actual_t-commit
 '
 
+for cmd in 'git --no-restrict-to-sparse-paths grep' \
+	   'git -c sparse.restrictCmds=false grep' \
+	   'git -c sparse.restrictCmds=true --no-restrict-to-sparse-paths grep'
+do
+
+	test_expect_success "$cmd --cached should ignore sparsity patterns" '
+		cat >expect <<-EOF &&
+		a:text
+		b:text
+		dir/c:text
+		EOF
+		$cmd --cached "text" >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "$cmd <commit-ish> should ignore sparsity patterns" '
+		commit=$(git rev-parse HEAD) &&
+		cat >expect_commit <<-EOF &&
+		$commit:a:text
+		$commit:b:text
+		$commit:dir/c:text
+		EOF
+		cat >expect_t-commit <<-EOF &&
+		t-commit:a:text
+		t-commit:b:text
+		t-commit:dir/c:text
+		EOF
+		$cmd "text" $commit >actual_commit &&
+		test_cmp expect_commit actual_commit &&
+		$cmd "text" t-commit >actual_t-commit &&
+		test_cmp expect_t-commit actual_t-commit
+	'
+done
+
+test_expect_success 'should respect the sparse.restrictCmds values from submodules' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/A/a:text
+	sub/B/b:text
+	EOF
+	git -C sub config sparse.restrictCmds false &&
+	git grep --cached --recurse-submodules "text" >actual &&
+	test_cmp expect actual &&
+	git -C sub config --unset sparse.restrictCmds
+'
+
+test_expect_success 'should propagate --[no]-restrict-to-sparse-paths to submodules' '
+	cat >expect <<-EOF &&
+	a:text
+	b:text
+	dir/c:text
+	sub/A/a:text
+	sub/B/b:text
+	EOF
+	git -C sub config sparse.restrictCmds true &&
+	git --no-restrict-to-sparse-paths grep --cached --recurse-submodules "text" >actual &&
+	test_cmp expect actual &&
+	git -C sub config --unset sparse.restrictCmds
+'
+
 test_done
diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
index 3c44af6940..a4a7767e06 100755
--- a/t/t9902-completion.sh
+++ b/t/t9902-completion.sh
@@ -1473,6 +1473,8 @@ test_expect_success 'double dash "git" itself' '
 	--namespace=
 	--no-replace-objects Z
 	--help Z
+	--restrict-to-sparse-paths Z
+	--no-restrict-to-sparse-paths Z
 	EOF
 '
 
@@ -1515,7 +1517,7 @@ test_expect_success 'general options' '
 	test_completion "git --nam" "--namespace=" &&
 	test_completion "git --bar" "--bare " &&
 	test_completion "git --inf" "--info-path " &&
-	test_completion "git --no-r" "--no-replace-objects "
+	test_completion "git --no-rep" "--no-replace-objects "
 '
 
 test_expect_success 'general options plus command' '
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds
  2020-05-10  0:41   ` [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
@ 2020-05-10  4:23     ` Matheus Tavares Bernardino
  2020-05-21 17:18       ` Elijah Newren
  2020-05-21  7:09     ` Elijah Newren
  1 sibling, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-05-10  4:23 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee, Elijah Newren, Jonathan Tan

On Sat, May 9, 2020 at 9:42 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index 3bd67082eb..8509694bf1 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -63,12 +63,28 @@ test_expect_success 'setup' '
>         test_path_is_file sub/B/b
>  '
>
> +# The two tests bellow check a special case: the sparsity patterns exclude '/b'
> +# and sparse checkout is enable, but the path exists on the working tree (e.g.
> +# manually created after `git sparse-checkout init`). In this case, grep should
> +# honor --restrict-to-sparse-paths.

I just want to highlight a small thing that I forgot to comment on:
Elijah and I had already discussed about --restrict-to-sparse-paths
being relevant in grep only with --cached or when a commit-ish is
given. But it had not occurred to me, before, the possibility of the
special case mentioned above. I.e. when searching in the working tree
and a path that should be excluded by the sparsity patterns is
present. In this patch, I let --restrict-to-sparse-paths control the
desired behavior for grep in this case too. But please, let me know if
that doesn't seem like a good idea.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 2/4] config: load the correct config.worktree file
  2020-05-10  0:41   ` [RFC PATCH v2 2/4] config: load the correct config.worktree file Matheus Tavares
@ 2020-05-11 19:10     ` Junio C Hamano
  2020-05-12 22:55       ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2020-05-11 19:10 UTC (permalink / raw)
  To: Matheus Tavares; +Cc: git, stolee, newren, jonathantanmy

Matheus Tavares <matheus.bernardino@usp.br> writes:

> One of the steps in do_git_config_sequence() is to load the
> worktree-specific config file. Although the function receives a git_dir
> string, it relies on git_pathdup(), which uses the_repository->git_dir,
> to make the path to the file. Thus, when a submodule has a worktree
> setting, a command executed in the superproject that recurses into the
> submodule won't find the said setting.

This has far wider ramifications than just "git grep" and it may be
an important fix.  Anything that wants to read from a per-worktree
configuration is not working as expected when run from a secondary
worktree, right?

Can we add a test or two to protect this fix from future breakages?


>  	current_parsing_scope = CONFIG_SCOPE_WORKTREE;
> -	if (!opts->ignore_worktree && repository_format_worktree_config) {
> -		char *path = git_pathdup("config.worktree");
> +	if (!opts->ignore_worktree && repository_format_worktree_config &&
> +	    opts->git_dir) {
> +		char *path = mkpathdup("%s/config.worktree", opts->git_dir);
>  		if (!access_or_die(path, R_OK, 0))
>  			ret += git_config_from_file(fn, path, data);
>  		free(path);

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-10  0:41   ` [RFC PATCH v2 3/4] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-05-11 19:35     ` Junio C Hamano
  2020-05-13  0:05       ` Matheus Tavares Bernardino
  2020-05-21  7:36       ` Elijah Newren
  0 siblings, 2 replies; 120+ messages in thread
From: Junio C Hamano @ 2020-05-11 19:35 UTC (permalink / raw)
  To: Matheus Tavares; +Cc: git, stolee, newren, jonathantanmy

Matheus Tavares <matheus.bernardino@usp.br> writes:

> One of the main uses for a sparse checkout is to allow users to focus on
> the subset of files in a repository in which they are interested. But
> git-grep currently ignores the sparsity patterns and report all matches
> found outside this subset, which kind of goes in the opposite direction.
> Let's fix that, making it honor the sparsity boundaries for every
> grepping case:
>
> - git grep in worktree
> - git grep --cached
> - git grep $REVISION

It makes sense for these to be limited within the "sparse" area.

> - git grep --untracked and git grep --no-index (which already respect
>   sparse checkout boundaries)

I can understand the former; those untracked files are what _could_
be brought into attention by "git add", so limiting to the same
"sparse" area may make sense.

I am not sure about the latter, though, as "--no-index" is an
explicit request to pretend that we are dealing with a random
collection of files, not managed in a git repository.  But perhaps
there is a similar justification like how "--untracked" is
unjustifiable.  I dunno.

> diff --git a/builtin/grep.c b/builtin/grep.c
> index a5056f395a..91ee0b2734 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
>  		      const struct pathspec *pathspec, int cached);
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  		     struct tree_desc *tree, struct strbuf *base, int tn_len,
> -		     int check_attr);
> +		     int is_root_tree);
>  
>  static int grep_submodule(struct grep_opt *opt,
>  			  const struct pathspec *pathspec,
> @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
>  
>  	for (nr = 0; nr < repo->index->cache_nr; nr++) {
>  		const struct cache_entry *ce = repo->index->cache[nr];
> +
> +		if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
> +			continue;

Hmph.  Why exclude gitlink from this rule?  If a submodule sits at a
path that is excluded by the sparse pattern, should we still recurse
into it?

>  		strbuf_setlen(&name, name_base_len);
>  		strbuf_addstr(&name, ce->name);
>  
> @@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
>  			 * cache entry are identical, even if worktree file has
>  			 * been modified, so use cache version instead
>  			 */
> -			if (cached || (ce->ce_flags & CE_VALID) ||
> -			    ce_skip_worktree(ce)) {
> +			if (cached || (ce->ce_flags & CE_VALID)) {
>  				if (ce_stage(ce) || ce_intent_to_add(ce))
>  					continue;
>  				hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -552,9 +555,78 @@ static int grep_cache(struct grep_opt *opt,
>  	return hit;
>  }
>  
> -static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> -		     struct tree_desc *tree, struct strbuf *base, int tn_len,
> -		     int check_attr)
> +static struct pattern_list *get_sparsity_patterns(struct repository *repo)
> +{
> +	struct pattern_list *patterns;
> +	char *sparse_file;
> +	int sparse_config, cone_config;
> +
> +	if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
> +	    !sparse_config) {
> +		return NULL;
> +	}
> +
> +	sparse_file = repo_git_path(repo, "info/sparse-checkout");
> +	patterns = xcalloc(1, sizeof(*patterns));
> +
> +	if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
> +		cone_config = 0;
> +	patterns->use_cone_patterns = cone_config;
> +
> +	if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
> +		if (file_exists(sparse_file)) {
> +			warning(_("failed to load sparse-checkout file: '%s'"),
> +				sparse_file);
> +		}
> +		free(sparse_file);
> +		free(patterns);
> +		return NULL;
> +	}
> +
> +	free(sparse_file);
> +	return patterns;
> +}
> +
> +static int in_sparse_checkout(struct strbuf *path, int prefix_len,
> +			      unsigned int entry_mode,
> +			      struct index_state *istate,
> +			      struct pattern_list *sparsity,
> +			      enum pattern_match_result parent_match,
> +			      enum pattern_match_result *match)
> +{
> +	int dtype = DT_UNKNOWN;
> +
> +	if (S_ISGITLINK(entry_mode))
> +		return 1;

This is consistent with the "we do not care where a gitlink
appears---submodules are always descended into, regardless of the
sparse definition" decision we saw earlier, I think.  I am not sure
if that is a good design in the first place, though.

> +	if (parent_match == MATCHED_RECURSIVE) {
> +		*match = parent_match;
> +		return 1;
> +	}
> +
> +	if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1]))
> +		strbuf_addch(path, '/');
> +
> +	*match = path_matches_pattern_list(path->buf, path->len,
> +					   path->buf + prefix_len, &dtype,
> +					   sparsity, istate);
> +	if (*match == UNDECIDED)
> +		*match = parent_match;
> +
> +	if (S_ISDIR(entry_mode))
> +		strbuf_trim_trailing_dir_sep(path);
> +
> +	if (*match == NOT_MATCHED && (S_ISREG(entry_mode) ||
> +	    (S_ISDIR(entry_mode) && sparsity->use_cone_patterns)))
> +		return 0;
> +
> +	return 1;
> +}



> +static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> +			struct tree_desc *tree, struct strbuf *base, int tn_len,
> +			int check_attr, struct pattern_list *sparsity,
> +			enum pattern_match_result default_sparsity_match)
>  {
>  	struct repository *repo = opt->repo;
>  	int hit = 0;
> @@ -570,6 +642,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  
>  	while (tree_entry(tree, &entry)) {
>  		int te_len = tree_entry_len(&entry);
> +		enum pattern_match_result sparsity_match = 0;
>  
>  		if (match != all_entries_interesting) {
>  			strbuf_addstr(&name, base->buf + tn_len);
> @@ -586,6 +659,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  
>  		strbuf_add(base, entry.path, te_len);
>  
> +		if (sparsity) {
> +			struct strbuf path = STRBUF_INIT;
> +			strbuf_addstr(&path, base->buf + tn_len);
> +
> +			if (!in_sparse_checkout(&path, old_baselen - tn_len,
> +						entry.mode, repo->index,
> +						sparsity, default_sparsity_match,
> +						&sparsity_match)) {
> +				strbuf_setlen(base, old_baselen);
> +				continue;
> +			}
> +		}

OK.

>  		if (S_ISREG(entry.mode)) {
>  			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>  					 check_attr ? base->buf + tn_len : NULL);
> @@ -602,8 +688,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  
>  			strbuf_addch(base, '/');
>  			init_tree_desc(&sub, data, size);
> -			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> -					 check_attr);
> +			hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
> +					    check_attr, sparsity, sparsity_match);
>  			free(data);
>  		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>  			hit |= grep_submodule(opt, pathspec, &entry.oid,
> @@ -621,6 +707,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  	return hit;
>  }
>  
> +/*
> + * Note: sparsity patterns and paths' attributes will only be considered if
> + * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
> + * matching on paths.)
> + */
> +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> +		     struct tree_desc *tree, struct strbuf *base, int tn_len,
> +		     int is_root_tree)
> +{
> +	struct pattern_list *patterns = NULL;
> +	int ret;
> +
> +	if (is_root_tree)
> +		patterns = get_sparsity_patterns(opt->repo);
> +
> +	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> +			   patterns, 0);
> +
> +	if (patterns) {
> +		clear_pattern_list(patterns);
> +		free(patterns);
> +	}

OK, it is not like this codepath is driven by "git log" to grep from
top-level tree objects of many commits, so it is OK to grab the
sparsity patterns once before do_grep_tree() and discard it when we
are done.

> +	return ret;
> +}
> +

>  static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
>  		       struct object *obj, const char *name, const char *path)
>  {
> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> index 37525cae3a..26852586ac 100755
> --- a/t/t7011-skip-worktree-reading.sh
> +++ b/t/t7011-skip-worktree-reading.sh
> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>  	test -z "$(git ls-files -m)"
>  '
>  
> -test_expect_success 'grep with skip-worktree file' '
> -	git update-index --no-skip-worktree 1 &&
> -	echo test > 1 &&
> -	git update-index 1 &&
> -	git update-index --skip-worktree 1 &&
> -	rm 1 &&
> -	test "$(git grep --no-ext-grep test)" = "1:test"
> -'
> -
>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>  	setup_absent &&
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> new file mode 100755
> index 0000000000..3bd67082eb
> --- /dev/null
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -0,0 +1,140 @@
> +#!/bin/sh
> +
> +test_description='grep in sparse checkout
> +
> +This test creates a repo with the following structure:
> +
> +.
> +|-- a
> +|-- b
> +|-- dir
> +|   `-- c
> +`-- sub
> +    |-- A
> +    |   `-- a
> +    `-- B
> +	`-- b
> +
> +Where . has non-cone mode sparsity patterns and sub is a submodule with cone
> +mode sparsity patterns. The resulting sparse-checkout should leave the following
> +structure:
> +
> +.
> +|-- a
> +`-- sub
> +    `-- B
> +	`-- b
> +'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +	echo "text" >a &&
> +	echo "text" >b &&
> +	mkdir dir &&
> +	echo "text" >dir/c &&
> +
> +	git init sub &&
> +	(
> +		cd sub &&
> +		mkdir A B &&
> +		echo "text" >A/a &&
> +		echo "text" >B/b &&
> +		git add A B &&
> +		git commit -m sub &&
> +		git sparse-checkout init --cone &&
> +		git sparse-checkout set B
> +	) &&
> +
> +	git submodule add ./sub &&
> +	git add a b dir &&
> +	git commit -m super &&
> +	git sparse-checkout init --no-cone &&
> +	git sparse-checkout set "/*" "!b" "!/*/" &&
> +
> +	git tag -am t-commit t-commit HEAD &&
> +	tree=$(git rev-parse HEAD^{tree}) &&
> +	git tag -am t-tree t-tree $tree &&
> +
> +	test_path_is_missing b &&
> +	test_path_is_missing dir &&
> +	test_path_is_missing sub/A &&
> +	test_path_is_file a &&
> +	test_path_is_file sub/B/b
> +'
> +
> +test_expect_success 'grep in working tree should honor sparse checkout' '
> +	cat >expect <<-EOF &&
> +	a:text
> +	EOF
> +	git grep "text" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --cached should honor sparse checkout' '
> +	cat >expect <<-EOF &&
> +	a:text
> +	EOF
> +	git grep --cached "text" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> +	commit=$(git rev-parse HEAD) &&
> +	cat >expect_commit <<-EOF &&
> +	$commit:a:text
> +	EOF
> +	cat >expect_t-commit <<-EOF &&
> +	t-commit:a:text
> +	EOF
> +	git grep "text" $commit >actual_commit &&
> +	test_cmp expect_commit actual_commit &&
> +	git grep "text" t-commit >actual_t-commit &&
> +	test_cmp expect_t-commit actual_t-commit
> +'
> +
> +test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
> +	commit=$(git rev-parse HEAD) &&
> +	tree=$(git rev-parse HEAD^{tree}) &&
> +	cat >expect_tree <<-EOF &&
> +	$tree:a:text
> +	$tree:b:text
> +	$tree:dir/c:text
> +	EOF
> +	cat >expect_t-tree <<-EOF &&
> +	t-tree:a:text
> +	t-tree:b:text
> +	t-tree:dir/c:text
> +	EOF
> +	git grep "text" $tree >actual_tree &&
> +	test_cmp expect_tree actual_tree &&
> +	git grep "text" t-tree >actual_t-tree &&
> +	test_cmp expect_t-tree actual_t-tree
> +'
> +
> +test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
> +	cat >expect <<-EOF &&
> +	a:text
> +	sub/B/b:text
> +	EOF
> +	git grep --recurse-submodules --cached "text" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
> +	commit=$(git rev-parse HEAD) &&
> +	cat >expect_commit <<-EOF &&
> +	$commit:a:text
> +	$commit:sub/B/b:text
> +	EOF
> +	cat >expect_t-commit <<-EOF &&
> +	t-commit:a:text
> +	t-commit:sub/B/b:text
> +	EOF
> +	git grep --recurse-submodules "text" $commit >actual_commit &&
> +	test_cmp expect_commit actual_commit &&
> +	git grep --recurse-submodules "text" t-commit >actual_t-commit &&
> +	test_cmp expect_t-commit actual_t-commit
> +'
> +
> +test_done

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 2/4] config: load the correct config.worktree file
  2020-05-11 19:10     ` Junio C Hamano
@ 2020-05-12 22:55       ` Matheus Tavares Bernardino
  2020-05-12 23:22         ` Junio C Hamano
  0 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-05-12 22:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Derrick Stolee, Elijah Newren, Jonathan Tan

On Mon, May 11, 2020 at 4:10 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares <matheus.bernardino@usp.br> writes:
>
> > One of the steps in do_git_config_sequence() is to load the
> > worktree-specific config file. Although the function receives a git_dir
> > string, it relies on git_pathdup(), which uses the_repository->git_dir,
> > to make the path to the file. Thus, when a submodule has a worktree
> > setting, a command executed in the superproject that recurses into the
> > submodule won't find the said setting.
>
> This has far wider ramifications than just "git grep" and it may be
> an important fix.  Anything that wants to read from a per-worktree
> configuration is not working as expected when run from a secondary
> worktree, right?

Hmm, I think the code should be able to retrieve the per-worktree
configuration, in this case, as the_repository->gitdir will be
pointing to the secondary worktree's gitdir. But when we want to read
a per-worktree configuration from a repo other than the_repository,
then the code doesn't find the setting (even if it is in the main
worktree of the subrepo).

> Can we add a test or two to protect this fix from future breakages?

Sure! There are already a couple tests, in the following patch, that
check this behavior *indirectly*. As we recurse into submodules, in
grep, we try to retrieve the core.sparseCheckout setting for each
submodule (which is stored in the subrepo's config.worktree file). The
said tests make sure we can get this setting, and they indeed fail
without this patch. But would it be better to also add a more direct
test, in this patch? I think we could do so by adding a new test
helper that prints submodules' configs, from the superproject, and
then testing the presence of per-worktree configs in the output.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 2/4] config: load the correct config.worktree file
  2020-05-12 22:55       ` Matheus Tavares Bernardino
@ 2020-05-12 23:22         ` Junio C Hamano
  0 siblings, 0 replies; 120+ messages in thread
From: Junio C Hamano @ 2020-05-12 23:22 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: git, Derrick Stolee, Elijah Newren, Jonathan Tan

Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:

>> Can we add a test or two to protect this fix from future breakages?
>
> Sure! There are already a couple tests, in the following patch, that
> check this behavior *indirectly*. As we recurse into submodules, in
> grep, we try to retrieve the core.sparseCheckout setting for each
> submodule (which is stored in the subrepo's config.worktree file). The
> said tests make sure we can get this setting, and they indeed fail
> without this patch. But would it be better to also add a more direct
> test, in this patch? I think we could do so by adding a new test
> helper that prints submodules' configs, from the superproject, and
> then testing the presence of per-worktree configs in the output.

Sounds like a plan.  Yes, checking by observing how grep that
recurses into submodules behave is doable but is indirect, and if
any other subcommand that may want to do the recursion will have the
same issue that gets fixed by this patch, it's better to ensure that
the fix applies to any subcommand in a more direct way.

Thanks.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-11 19:35     ` Junio C Hamano
@ 2020-05-13  0:05       ` Matheus Tavares Bernardino
  2020-05-13  0:17         ` Junio C Hamano
  2020-05-21  7:36       ` Elijah Newren
  1 sibling, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-05-13  0:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Derrick Stolee, Elijah Newren, Jonathan Tan

On Mon, May 11, 2020 at 4:35 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares <matheus.bernardino@usp.br> writes:
>
> > One of the main uses for a sparse checkout is to allow users to focus on
> > the subset of files in a repository in which they are interested. But
> > git-grep currently ignores the sparsity patterns and report all matches
> > found outside this subset, which kind of goes in the opposite direction.
> > Let's fix that, making it honor the sparsity boundaries for every
> > grepping case:
> >
> > - git grep in worktree
> > - git grep --cached
> > - git grep $REVISION
>
> It makes sense for these to be limited within the "sparse" area.
>
> > - git grep --untracked and git grep --no-index (which already respect
> >   sparse checkout boundaries)
>
> I can understand the former; those untracked files are what _could_
> be brought into attention by "git add", so limiting to the same
> "sparse" area may make sense.
>
> I am not sure about the latter, though, as "--no-index" is an
> explicit request to pretend that we are dealing with a random
> collection of files, not managed in a git repository.  But perhaps
> there is a similar justification like how "--untracked" is
> unjustifiable.  I dunno.

Yeah, I think there was no need to mention those two cases here. My
intention was to say that, in these cases, we should stick to the
files that are present in the working tree (which should match the
sparsity patterns + untracked {and ignored, in --no-index}), as
opposed to how the worktree grep used to behave until now, falling
back to the cache on files excluded by the sparse checkout.

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index a5056f395a..91ee0b2734 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
> > @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
> >                     const struct pathspec *pathspec, int cached);
> >  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >                    struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                  int check_attr);
> > +                  int is_root_tree);
> >
> >  static int grep_submodule(struct grep_opt *opt,
> >                         const struct pathspec *pathspec,
> > @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
> >
> >       for (nr = 0; nr < repo->index->cache_nr; nr++) {
> >               const struct cache_entry *ce = repo->index->cache[nr];
> > +
> > +             if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
> > +                     continue;
>
> Hmph.  Why exclude gitlink from this rule?  If a submodule sits at a
> path that is excluded by the sparse pattern, should we still recurse
> into it?

The idea behind not skipping gitlinks here was to be compliant with
what we have in the working tree. In 4fd683b ("sparse-checkout:
document interactions with submodules"), we decided that, if the
sparse-checkout patterns exclude a submodule, the submodule would
still appear in the working tree. The purpose was to keep these
features (submodules and sparse-checkout) independent. Along the same
lines, I think we should always recurse into initialized submodules in
grep, and then load their own sparsity patterns, to decide what should
be grepped within.

[...]
> > +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> > +                  struct tree_desc *tree, struct strbuf *base, int tn_len,
> > +                  int is_root_tree)
> > +{
> > +     struct pattern_list *patterns = NULL;
> > +     int ret;
> > +
> > +     if (is_root_tree)
> > +             patterns = get_sparsity_patterns(opt->repo);
> > +
> > +     ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> > +                        patterns, 0);
> > +
> > +     if (patterns) {
> > +             clear_pattern_list(patterns);
> > +             free(patterns);
> > +     }
>
> OK, it is not like this codepath is driven by "git log" to grep from
> top-level tree objects of many commits, so it is OK to grab the
> sparsity patterns once before do_grep_tree() and discard it when we
> are done.

Yeah. A possible performance problem here would be when users pass
many trees to git-grep (since we are reloading the pattern lists, from
both the_repository and submodules, for each tree). But, as Elijah
pointed out [1], the cases where this overhead might be somewhat
noticeable should be very rare.

[1]: https://lore.kernel.org/git/CABPp-BGUf-4exGW23xka1twf2D=nFOz1CkD_f-rDX_AGdVEeDA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-13  0:05       ` Matheus Tavares Bernardino
@ 2020-05-13  0:17         ` Junio C Hamano
  2020-05-21  7:26           ` Elijah Newren
  0 siblings, 1 reply; 120+ messages in thread
From: Junio C Hamano @ 2020-05-13  0:17 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: git, Derrick Stolee, Elijah Newren, Jonathan Tan

Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:

> The idea behind not skipping gitlinks here was to be compliant with
> what we have in the working tree. In 4fd683b ("sparse-checkout:
> document interactions with submodules"), we decided that, if the
> sparse-checkout patterns exclude a submodule, the submodule would
> still appear in the working tree. The purpose was to keep these
> features (submodules and sparse-checkout) independent. Along the same
> lines, I think we should always recurse into initialized submodules in
> grep, and then load their own sparsity patterns, to decide what should
> be grepped within.

OK.  

I do not necessarily agree with the justification described in
4fd683b (e.g. "would easily cause problems." that is not
substantiated is merely an opinion), but I do agree with you that
the new code in "git grep" we are discussing here does behave in
line with that design.

Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds
  2020-05-10  0:41   ` [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
  2020-05-10  4:23     ` Matheus Tavares Bernardino
@ 2020-05-21  7:09     ` Elijah Newren
  1 sibling, 0 replies; 120+ messages in thread
From: Elijah Newren @ 2020-05-21  7:09 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

Sorry for the late reply...and for responding in backwards order.

Great to see these newer patches!

On Sat, May 9, 2020 at 5:42 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> When sparse checkout is enabled, some users expect the output of certain
> commands (such as grep, diff, and log) to be also restricted within the
> sparsity patterns. This would allow them to effectively work only on the
> subset of files in which they are interested; and allow some commands to
> possibly perform better, by not considering uninteresting paths. For
> this reason, we taught grep to honor the sparsity patterns, in the
> previous commit. But, on the other hand, allowing grep and the other
> commands mentioned to optionally ignore the patterns also make for some
> interesting use cases. E.g. using grep to search for a function
> definition that resides outside the sparse checkout.
>
> In any case, there is no current way for users to configure the behavior
> they want for these commands. Aiming to provide this flexibility, let's
> introduce the sparse.restrictCmds setting (and the analogous
> --[no]-restrict-to-sparse-paths global option). The default value is
> true. For now, grep is the only one affected by this setting, but the
> goal is to have support for more commands, in the future.
>
> Helped-by: Elijah Newren <newren@gmail.com>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>
> Some notes/questions about this one:
>
> - I guess having the additional sparse-checkout.o only for the
>   restrict_to_sparse_paths() function is not very justifiable.
>   Especially since builtin/grep.c is currently its only caller. But
>   since Stolee is already moving some code out of the sparse-checkout
>   builtin and into sparse-checkout.o [1], I thought it would be better
>   to place this function here from the start, as it will likely be
>   needed by other cmds when they start honoring sparse.restrictCmds.
>   (Side note: I think I will also be able to use the
>   populate_sparse_checkout_patterns() function added by Stolee in the
>   same patchset [2], to avoid code duplication in the
>   get_sparsity_patterns() function added in this patch).
>
> [1]: https://lore.kernel.org/git/0181a134bfb6986dc0e54ae624c478446a1324a9.1588857462.git.gitgitgadget@gmail.com/
> [2]: https://lore.kernel.org/git/444a6b5f894f28e96f713e5caccba18e1ea3b3eb.1588857462.git.gitgitgadget@gmail.com/

Seems reasonable to me.

> - With that said, the only reason we need restrict_to_sparse_paths() to
>   begin with, is so that commands which recurse into submodules may
>   respect the value set in each submodule for the sparse.restrictCmds
>   config. This is already being done for grep, in this patch. But,
>   should we do like this or should we use the value set at the
>   superproject, for all submodules as well, when recursing (ignoring the
>   value set on them)?

We have a few different types of files in git: tracked, untracked, and
ignored (though it's sometimes not clear if people are using untracked
to mean everything that isn't tracked, or if they are using it to mean
everything that is both not tracked and not ignored; it seems to
depend on the context).

The point of the sparsity patterns is to break the "tracked" category
into two subsets: those tracked files matching the sparsity patterns
and the tracked files that don't.  The reason for this subsetting is
it allows us to work with a smaller subset of a much larger
repository.

The thing about submodules is that the parent repository doesn't know
what the submodule tracks, it only has a commit id.  The submodule
itself knows which individual files it tracks in its own index.  If
the parent module doesn't even know which files the submodule tracks,
how is it supposed to be responsible for defining a subset of the
submodules' tracked files?  It seems like a layering violation to me.

So, I think you are right with grep to not override the submodules'
sparse.restrictCmds config.  For other commands that recurse into
submodules, if there are any relevant ones, I think they'd want to do
the same as you did for grep here.  But what other commands recurse
into submodules?  I can't think of any right now.  log doesn't, diff
doesn't, status doesn't.  The only ones I can think of right now are
clone and pull.  In the case of clone, the submodule doesn't exist yet
so can't have any setting yet.  In the case of pull, what would it do
with the setting anyway?  Do a partial fetch that ignores blobs
outside the sparse cone?  I think that'd be great...but wouldn't that
behavior of fetch be controlled by whether the user was in a partial
clone rather than any sparse-checkout setting?  (I have to admit I'm
not familiar with how partial clones work yet.)  [Later edit:] Also,
pull seems like more of a write operation, so see below.

> - It's possible to also make read-tree respect the new setting/option,
>   using --no-restrict-to-sparse-paths as a synonym for its
>   --no-sparse-checkout option (with lower precedence). However, as this
>   command can change the sparse checked out paths, I thought it kind
>   of falls under a different category. Also, `git read-tree -mu
>   --sparse-checkout` doesn't have the effect of *restricting* the
>   command's behavior to the sparsity patterns, but of applying them to
>   the working tree, right? So maybe it could be confusing to make this
>   command honor the new setting. Does that make sense, or should we do
>   it?

That's a good question; I hadn't considered read-tree before.  My gut
reaction is that these flags only affect read operations, not write
ones.  (And doesn't affect all read operations; e.g. fsck is about
integrity checking, so fsck by default would check everything that was
downloaded and would only be limited in e.g. a partial clone -- but
that's a different kind of limit.)

For example, if we said these flags affected write operations, then as
soon as someone sets sparse.restrictCmds=false and then runs 'git
checkout $branch', then we would be forced to interpret
sparse.restrictCmds=false to mean we shouldn't pay attention to
sparsity patterns and thus should check out ALL files.  The user would
end up with a non-sparse tree really fast and would have to constantly
re-sparsify.  I think that's pretty clearly not the intention.  As
such, I think these flags are for controlling read operations like
grep/diff/log, and that neither read-tree nor checkout should be
affected by these flags.

> - Finally, if we decide to make read-tree be affected by
>   sparse.restrictCmds, there is also the case of whether the config
>   should be honored for submodules or just propagate the superproject's
>   value. I think the latter would be as simple as adding this line,
>   before calling parse_options() in builtin/read-tree.c:
>
>   opts.skip_sparse_checkout = !restrict_to_sparse_paths(the_repository);
>
>   As for the former, I'm not very familiar with the code in
>   unpack_trees(), so I'm not sure how complicated that would be.

As before, I don't think propagating the superproject's value makes
any sense.  However, I don't think making read-tree be affected by
sparse.restrictCmds makes sense either so it shouldn't matter.

>  Documentation/config.txt               |  2 +
>  Documentation/config/sparse.txt        | 22 ++++++++
>  Documentation/git-grep.txt             |  3 +
>  Documentation/git.txt                  |  4 ++
>  Makefile                               |  1 +
>  builtin/grep.c                         | 14 ++++-
>  contrib/completion/git-completion.bash |  2 +
>  git.c                                  |  6 ++
>  sparse-checkout.c                      | 16 ++++++
>  sparse-checkout.h                      | 11 ++++
>  t/t7817-grep-sparse-checkout.sh        | 78 +++++++++++++++++++++++++-
>  t/t9902-completion.sh                  |  4 +-
>  12 files changed, 159 insertions(+), 4 deletions(-)
>  create mode 100644 Documentation/config/sparse.txt
>  create mode 100644 sparse-checkout.c
>  create mode 100644 sparse-checkout.h
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index ef0768b91a..fd74b80302 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -436,6 +436,8 @@ include::config/sequencer.txt[]
>
>  include::config/showbranch.txt[]
>
> +include::config/sparse.txt[]
> +
>  include::config/splitindex.txt[]
>
>  include::config/ssh.txt[]
> diff --git a/Documentation/config/sparse.txt b/Documentation/config/sparse.txt
> new file mode 100644
> index 0000000000..83a4e0018f
> --- /dev/null
> +++ b/Documentation/config/sparse.txt
> @@ -0,0 +1,22 @@
> +sparse.restrictCmds::
> +       Only meaningful in conjunction with core.sparseCheckout. This option
> +       extends sparse checkouts (which limit which paths are written to the
> +       working tree), so that output and operations are also limited to the
> +       sparsity paths where possible and implemented. The purpose of this
> +       option is to (1) focus output for the user on the portion of the
> +       repository that is of interest to them, and (2) enable potentially
> +       dramatic performance improvements, especially in conjunction with
> +       partial clones.
> ++
> +When this option is true (default), some git commands may limit their behavior
> +to the paths specified by the sparsity patterns, or to the intersection of
> +those paths and any (like `*.c) that the user might also specify on the command
> +line. When false, the affected commands will work on full trees, ignoring the
> +sparsity patterns. For now, only git-grep honors this setting. In this command,
> +the restriction becomes relevant in one of these three cases: with --cached;
> +when a commit-ish is given; when searching a working tree that contains paths
> +previously excluded by the sparsity patterns.
> ++
> +Note: commands which export, integrity check, or create history will always
> +operate on full trees (e.g. fast-export, format-patch, fsck, commit, etc.),
> +unaffected by any sparsity patterns.
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index 9bdf807584..abbf100109 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -41,6 +41,9 @@ characters.  An empty string as search expression matches all lines.
>  CONFIGURATION
>  -------------
>
> +git-grep honors the sparse.restrictCmds setting. See its definition in
> +linkgit:git-config[1].
> +
>  :git-grep: 1
>  include::config/grep.txt[]
>
> diff --git a/Documentation/git.txt b/Documentation/git.txt
> index 9d6769e95a..5e107c6246 100644
> --- a/Documentation/git.txt
> +++ b/Documentation/git.txt
> @@ -180,6 +180,10 @@ If you just want to run git as if it was started in `<path>` then use
>         Do not perform optional operations that require locks. This is
>         equivalent to setting the `GIT_OPTIONAL_LOCKS` to `0`.
>
> +--[no-]restrict-to-sparse-paths::
> +       Overrides the sparse.restrictCmds configuration (see
> +       linkgit:git-config[1]) for this execution.
> +
>  --list-cmds=group[,group...]::
>         List commands by group. This is an internal/experimental
>         option and may change or be removed in the future. Supported
> diff --git a/Makefile b/Makefile
> index 3d3a39fc19..67580c691b 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -986,6 +986,7 @@ LIB_OBJS += sha1-name.o
>  LIB_OBJS += shallow.o
>  LIB_OBJS += sideband.o
>  LIB_OBJS += sigchain.o
> +LIB_OBJS += sparse-checkout.o
>  LIB_OBJS += split-index.o
>  LIB_OBJS += stable-qsort.o
>  LIB_OBJS += strbuf.o
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 91ee0b2734..3f92e7fd6c 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -25,6 +25,7 @@
>  #include "submodule-config.h"
>  #include "object-store.h"
>  #include "packfile.h"
> +#include "sparse-checkout.h"
>
>  static char const * const grep_usage[] = {
>         N_("git grep [<options>] [-e] <pattern> [<rev>...] [[--] <path>...]"),
> @@ -498,6 +499,7 @@ static int grep_cache(struct grep_opt *opt,
>         int nr;
>         struct strbuf name = STRBUF_INIT;
>         int name_base_len = 0;
> +       int sparse_paths_only = restrict_to_sparse_paths(repo);
>         if (repo->submodule_prefix) {
>                 name_base_len = strlen(repo->submodule_prefix);
>                 strbuf_addstr(&name, repo->submodule_prefix);
> @@ -509,7 +511,8 @@ static int grep_cache(struct grep_opt *opt,
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
>
> -               if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
> +               if (sparse_paths_only && ce_skip_worktree(ce) &&
> +                   !S_ISGITLINK(ce->ce_mode))
>                         continue;
>
>                 strbuf_setlen(&name, name_base_len);
> @@ -717,9 +720,10 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      int is_root_tree)
>  {
>         struct pattern_list *patterns = NULL;
> +       int sparse_paths_only = restrict_to_sparse_paths(opt->repo);
>         int ret;
>
> -       if (is_root_tree)
> +       if (is_root_tree && sparse_paths_only)
>                 patterns = get_sparsity_patterns(opt->repo);
>
>         ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> @@ -1259,6 +1263,12 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>
>         if (!use_index || untracked) {
>                 int use_exclude = (opt_exclude < 0) ? use_index : !!opt_exclude;
> +
> +               if (opt_restrict_to_sparse_paths >= 0) {
> +                       warning(_("--[no-]restrict-to-sparse-paths is ignored"
> +                                 " with --no-index or --untracked"));

I think this should instead be
    die(_("--[no-]restrict-to-sparse-paths is incompatible with
--no-index and --untracked"))

Restricting to sparse paths (or not) is about working with subsets of
tracked files (or all tracked files).  --no-index and --untracked are
about working with files that aren't tracked.  They just don't make
sense to combine.

> +               }
> +
>                 hit = grep_directory(&opt, &pathspec, use_exclude, use_index);
>         } else if (0 <= opt_exclude) {
>                 die(_("--[no-]exclude-standard cannot be used for tracked contents"));
> diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
> index b1d6e5ebed..cba0f9166c 100644
> --- a/contrib/completion/git-completion.bash
> +++ b/contrib/completion/git-completion.bash
> @@ -3207,6 +3207,8 @@ __git_main ()
>                         --namespace=
>                         --no-replace-objects
>                         --help
> +                       --restrict-to-sparse-paths
> +                       --no-restrict-to-sparse-paths
>                         "
>                         ;;
>                 *)
> diff --git a/git.c b/git.c
> index 2e4efb4ff0..f967c75d9c 100644
> --- a/git.c
> +++ b/git.c
> @@ -37,6 +37,7 @@ const char git_more_info_string[] =
>            "See 'git help git' for an overview of the system.");
>
>  static int use_pager = -1;
> +int opt_restrict_to_sparse_paths = -1;
>
>  static void list_builtins(struct string_list *list, unsigned int exclude_option);
>
> @@ -310,6 +311,10 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
>                         } else {
>                                 exit(list_cmds(cmd));
>                         }
> +               } else if (!strcmp(cmd, "--restrict-to-sparse-paths")) {
> +                       opt_restrict_to_sparse_paths = 1;
> +               } else if (!strcmp(cmd, "--no-restrict-to-sparse-paths")) {
> +                       opt_restrict_to_sparse_paths = 0;
>                 } else {
>                         fprintf(stderr, _("unknown option: %s\n"), cmd);
>                         usage(git_usage_string);
> @@ -318,6 +323,7 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
>                 (*argv)++;
>                 (*argc)--;
>         }
> +
>         return (*argv) - orig_argv;
>  }
>
> diff --git a/sparse-checkout.c b/sparse-checkout.c
> new file mode 100644
> index 0000000000..9a9e50fd29
> --- /dev/null
> +++ b/sparse-checkout.c
> @@ -0,0 +1,16 @@
> +#include "cache.h"
> +#include "config.h"
> +#include "sparse-checkout.h"
> +
> +int restrict_to_sparse_paths(struct repository *repo)
> +{
> +       int ret;
> +
> +       if (opt_restrict_to_sparse_paths >= 0)
> +               return opt_restrict_to_sparse_paths;
> +
> +       if (repo_config_get_bool(repo, "sparse.restrictcmds", &ret))
> +               ret = 1;
> +
> +       return ret;
> +}
> diff --git a/sparse-checkout.h b/sparse-checkout.h
> new file mode 100644
> index 0000000000..1de3b588d8
> --- /dev/null
> +++ b/sparse-checkout.h
> @@ -0,0 +1,11 @@
> +#ifndef SPARSE_CHECKOUT_H
> +#define SPARSE_CHECKOUT_H
> +
> +struct repository;
> +
> +extern int opt_restrict_to_sparse_paths; /* from git.c */
> +
> +/* Whether or not cmds should restrict behavior on sparse paths, in this repo */
> +int restrict_to_sparse_paths(struct repository *repo);
> +
> +#endif /* SPARSE_CHECKOUT_H */
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index 3bd67082eb..8509694bf1 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -63,12 +63,28 @@ test_expect_success 'setup' '
>         test_path_is_file sub/B/b
>  '
>
> +# The two tests bellow check a special case: the sparsity patterns exclude '/b'
> +# and sparse checkout is enable, but the path exists on the working tree (e.g.
> +# manually created after `git sparse-checkout init`). In this case, grep should
> +# honor --restrict-to-sparse-paths.
>  test_expect_success 'grep in working tree should honor sparse checkout' '
>         cat >expect <<-EOF &&
>         a:text
>         EOF
> +       echo newtext >b &&
>         git grep "text" >actual &&
> -       test_cmp expect actual
> +       test_cmp expect actual &&
> +       rm b
> +'
> +test_expect_success 'grep w/ --no-restrict-to-sparse-paths for sparsely excluded but present paths' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       b:newtext
> +       EOF
> +       echo newtext >b &&
> +       git --no-restrict-to-sparse-paths grep "text" >actual &&
> +       test_cmp expect actual &&
> +       rm b
>  '
>
>  test_expect_success 'grep --cached should honor sparse checkout' '
> @@ -137,4 +153,64 @@ test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse
>         test_cmp expect_t-commit actual_t-commit
>  '
>
> +for cmd in 'git --no-restrict-to-sparse-paths grep' \
> +          'git -c sparse.restrictCmds=false grep' \
> +          'git -c sparse.restrictCmds=true --no-restrict-to-sparse-paths grep'
> +do
> +
> +       test_expect_success "$cmd --cached should ignore sparsity patterns" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd --cached "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd <commit-ish> should ignore sparsity patterns" '
> +               commit=$(git rev-parse HEAD) &&
> +               cat >expect_commit <<-EOF &&
> +               $commit:a:text
> +               $commit:b:text
> +               $commit:dir/c:text
> +               EOF
> +               cat >expect_t-commit <<-EOF &&
> +               t-commit:a:text
> +               t-commit:b:text
> +               t-commit:dir/c:text
> +               EOF
> +               $cmd "text" $commit >actual_commit &&
> +               test_cmp expect_commit actual_commit &&
> +               $cmd "text" t-commit >actual_t-commit &&
> +               test_cmp expect_t-commit actual_t-commit
> +       '
> +done
> +
> +test_expect_success 'should respect the sparse.restrictCmds values from submodules' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       sub/A/a:text
> +       sub/B/b:text
> +       EOF
> +       git -C sub config sparse.restrictCmds false &&
> +       git grep --cached --recurse-submodules "text" >actual &&
> +       test_cmp expect actual &&
> +       git -C sub config --unset sparse.restrictCmds
> +'
> +
> +test_expect_success 'should propagate --[no]-restrict-to-sparse-paths to submodules' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       b:text
> +       dir/c:text
> +       sub/A/a:text
> +       sub/B/b:text
> +       EOF
> +       git -C sub config sparse.restrictCmds true &&
> +       git --no-restrict-to-sparse-paths grep --cached --recurse-submodules "text" >actual &&
> +       test_cmp expect actual &&
> +       git -C sub config --unset sparse.restrictCmds
> +'
> +
>  test_done
> diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
> index 3c44af6940..a4a7767e06 100755
> --- a/t/t9902-completion.sh
> +++ b/t/t9902-completion.sh
> @@ -1473,6 +1473,8 @@ test_expect_success 'double dash "git" itself' '
>         --namespace=
>         --no-replace-objects Z
>         --help Z
> +       --restrict-to-sparse-paths Z
> +       --no-restrict-to-sparse-paths Z
>         EOF
>  '
>
> @@ -1515,7 +1517,7 @@ test_expect_success 'general options' '
>         test_completion "git --nam" "--namespace=" &&
>         test_completion "git --bar" "--bare " &&
>         test_completion "git --inf" "--info-path " &&
> -       test_completion "git --no-r" "--no-replace-objects "
> +       test_completion "git --no-rep" "--no-replace-objects "
>  '
>
>  test_expect_success 'general options plus command' '
> --
> 2.26.2

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-13  0:17         ` Junio C Hamano
@ 2020-05-21  7:26           ` Elijah Newren
  2020-05-21 17:35             ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-05-21  7:26 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matheus Tavares Bernardino, git, Derrick Stolee, Jonathan Tan

On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
>
> > The idea behind not skipping gitlinks here was to be compliant with
> > what we have in the working tree. In 4fd683b ("sparse-checkout:
> > document interactions with submodules"), we decided that, if the
> > sparse-checkout patterns exclude a submodule, the submodule would
> > still appear in the working tree. The purpose was to keep these
> > features (submodules and sparse-checkout) independent. Along the same
> > lines, I think we should always recurse into initialized submodules in

Sorry if I missed it in the code, but do you check whether the
submodule is initialized before descending into it, or do you descend
into it based on it just being a submodule?

> > grep, and then load their own sparsity patterns, to decide what should
> > be grepped within.
>
> OK.
>
> I do not necessarily agree with the justification described in
> 4fd683b (e.g. "would easily cause problems." that is not
> substantiated is merely an opinion), but I do agree with you that
> the new code in "git grep" we are discussing here does behave in
> line with that design.
>
> Thanks.

I'm also a little worried by 4fd683b; are we headed towards a circular
reasoning of some sort?  In particular, sparse-checkout was written
assuming submodules might already be checked out.  I can see how
un-checking-out an existing submodule could raise fears of losing
untracked or ignored files within it, or stuff stored on other
branches, etc.  But that's not the only relevant case.  What if
someone runs:
   git clone --recurse-submodules --sparse=moduleA git.hosting.site:my/repo.git
In such a case, we don't have already checked out submodules.
Obviously, we should clone submodules that are within our sparsity
paths.  But should we automatically clone the submodules outside our
sparsity paths?  The the logic presented in 4fd683b makes this
completely ambiguous.  ("It will appear if it's initialized."  Okay,
but do we initialize it?)

You may say that clone doesn't have a --sparse= flag right now.  So
let me change the example slightly.  What if someone runs
   git checkout --recurse-submodules $otherBranch
and $otherBranch adds a new submodule somewhere deep under a directory
excluded by the sparsity patterns (i.e. deep within a directory we
aren't interested in and don't have checked out).  Should the
submodule be checked out, i.e. should it be initialized?  Commit
4fd683b only says it will appear if it's initialized, but my whole
question is should we initialize it?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-11 19:35     ` Junio C Hamano
  2020-05-13  0:05       ` Matheus Tavares Bernardino
@ 2020-05-21  7:36       ` Elijah Newren
  1 sibling, 0 replies; 120+ messages in thread
From: Elijah Newren @ 2020-05-21  7:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee, Jonathan Tan

On Mon, May 11, 2020 at 12:35 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares <matheus.bernardino@usp.br> writes:
>
> > One of the main uses for a sparse checkout is to allow users to focus on
> > the subset of files in a repository in which they are interested. But
> > git-grep currently ignores the sparsity patterns and report all matches
> > found outside this subset, which kind of goes in the opposite direction.
> > Let's fix that, making it honor the sparsity boundaries for every
> > grepping case:
> >
> > - git grep in worktree
> > - git grep --cached
> > - git grep $REVISION
>
> It makes sense for these to be limited within the "sparse" area.
>
> > - git grep --untracked and git grep --no-index (which already respect
> >   sparse checkout boundaries)
>
> I can understand the former; those untracked files are what _could_
> be brought into attention by "git add", so limiting to the same
> "sparse" area may make sense.
>
> I am not sure about the latter, though, as "--no-index" is an
> explicit request to pretend that we are dealing with a random
> collection of files, not managed in a git repository.  But perhaps
> there is a similar justification like how "--untracked" is
> unjustifiable.  I dunno.

I don't think it makes sense for sparsity patterns to affect either.
sparsity patterns are a way of splitting "tracked" files into two
subsets (those matching the sparsity paths and those that don't).
Therefore, flags that are about searching things that aren't tracked,
clearly don't have anything to do with sparsity patterns.

However, I think this was just a wording issue; in the subsequent
commit Matheus made it clear that he's not modifying the behavior of
grep --untracked or grep --no-index based on the presence or absence
of sparsity patterns.

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index a5056f395a..91ee0b2734 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
> > @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
> >                     const struct pathspec *pathspec, int cached);
> >  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >                    struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                  int check_attr);
> > +                  int is_root_tree);
> >
> >  static int grep_submodule(struct grep_opt *opt,
> >                         const struct pathspec *pathspec,
> > @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
> >
> >       for (nr = 0; nr < repo->index->cache_nr; nr++) {
> >               const struct cache_entry *ce = repo->index->cache[nr];
> > +
> > +             if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
> > +                     continue;
>
> Hmph.  Why exclude gitlink from this rule?  If a submodule sits at a
> path that is excluded by the sparse pattern, should we still recurse
> into it?

That bothers me too.

> >               strbuf_setlen(&name, name_base_len);
> >               strbuf_addstr(&name, ce->name);
> >
> > @@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
> >                        * cache entry are identical, even if worktree file has
> >                        * been modified, so use cache version instead
> >                        */
> > -                     if (cached || (ce->ce_flags & CE_VALID) ||
> > -                         ce_skip_worktree(ce)) {
> > +                     if (cached || (ce->ce_flags & CE_VALID)) {
> >                               if (ce_stage(ce) || ce_intent_to_add(ce))
> >                                       continue;
> >                               hit |= grep_oid(opt, &ce->oid, name.buf,
> > @@ -552,9 +555,78 @@ static int grep_cache(struct grep_opt *opt,
> >       return hit;
> >  }
> >
> > -static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> > -                  struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                  int check_attr)
> > +static struct pattern_list *get_sparsity_patterns(struct repository *repo)
> > +{
> > +     struct pattern_list *patterns;
> > +     char *sparse_file;
> > +     int sparse_config, cone_config;
> > +
> > +     if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
> > +         !sparse_config) {
> > +             return NULL;
> > +     }
> > +
> > +     sparse_file = repo_git_path(repo, "info/sparse-checkout");
> > +     patterns = xcalloc(1, sizeof(*patterns));
> > +
> > +     if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
> > +             cone_config = 0;
> > +     patterns->use_cone_patterns = cone_config;
> > +
> > +     if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
> > +             if (file_exists(sparse_file)) {
> > +                     warning(_("failed to load sparse-checkout file: '%s'"),
> > +                             sparse_file);
> > +             }
> > +             free(sparse_file);
> > +             free(patterns);
> > +             return NULL;
> > +     }
> > +
> > +     free(sparse_file);
> > +     return patterns;
> > +}
> > +
> > +static int in_sparse_checkout(struct strbuf *path, int prefix_len,
> > +                           unsigned int entry_mode,
> > +                           struct index_state *istate,
> > +                           struct pattern_list *sparsity,
> > +                           enum pattern_match_result parent_match,
> > +                           enum pattern_match_result *match)
> > +{
> > +     int dtype = DT_UNKNOWN;
> > +
> > +     if (S_ISGITLINK(entry_mode))
> > +             return 1;
>
> This is consistent with the "we do not care where a gitlink
> appears---submodules are always descended into, regardless of the
> sparse definition" decision we saw earlier, I think.  I am not sure
> if that is a good design in the first place, though.
>
> > +     if (parent_match == MATCHED_RECURSIVE) {
> > +             *match = parent_match;
> > +             return 1;
> > +     }
> > +
> > +     if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1]))
> > +             strbuf_addch(path, '/');
> > +
> > +     *match = path_matches_pattern_list(path->buf, path->len,
> > +                                        path->buf + prefix_len, &dtype,
> > +                                        sparsity, istate);
> > +     if (*match == UNDECIDED)
> > +             *match = parent_match;
> > +
> > +     if (S_ISDIR(entry_mode))
> > +             strbuf_trim_trailing_dir_sep(path);
> > +
> > +     if (*match == NOT_MATCHED && (S_ISREG(entry_mode) ||
> > +         (S_ISDIR(entry_mode) && sparsity->use_cone_patterns)))
> > +             return 0;
> > +
> > +     return 1;
> > +}
>
>
>
> > +static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> > +                     struct tree_desc *tree, struct strbuf *base, int tn_len,
> > +                     int check_attr, struct pattern_list *sparsity,
> > +                     enum pattern_match_result default_sparsity_match)
> >  {
> >       struct repository *repo = opt->repo;
> >       int hit = 0;
> > @@ -570,6 +642,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >
> >       while (tree_entry(tree, &entry)) {
> >               int te_len = tree_entry_len(&entry);
> > +             enum pattern_match_result sparsity_match = 0;
> >
> >               if (match != all_entries_interesting) {
> >                       strbuf_addstr(&name, base->buf + tn_len);
> > @@ -586,6 +659,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >
> >               strbuf_add(base, entry.path, te_len);
> >
> > +             if (sparsity) {
> > +                     struct strbuf path = STRBUF_INIT;
> > +                     strbuf_addstr(&path, base->buf + tn_len);
> > +
> > +                     if (!in_sparse_checkout(&path, old_baselen - tn_len,
> > +                                             entry.mode, repo->index,
> > +                                             sparsity, default_sparsity_match,
> > +                                             &sparsity_match)) {
> > +                             strbuf_setlen(base, old_baselen);
> > +                             continue;
> > +                     }
> > +             }
>
> OK.
>
> >               if (S_ISREG(entry.mode)) {
> >                       hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
> >                                        check_attr ? base->buf + tn_len : NULL);
> > @@ -602,8 +688,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >
> >                       strbuf_addch(base, '/');
> >                       init_tree_desc(&sub, data, size);
> > -                     hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> > -                                      check_attr);
> > +                     hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
> > +                                         check_attr, sparsity, sparsity_match);
> >                       free(data);
> >               } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
> >                       hit |= grep_submodule(opt, pathspec, &entry.oid,
> > @@ -621,6 +707,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >       return hit;
> >  }
> >
> > +/*
> > + * Note: sparsity patterns and paths' attributes will only be considered if
> > + * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
> > + * matching on paths.)
> > + */
> > +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> > +                  struct tree_desc *tree, struct strbuf *base, int tn_len,
> > +                  int is_root_tree)
> > +{
> > +     struct pattern_list *patterns = NULL;
> > +     int ret;
> > +
> > +     if (is_root_tree)
> > +             patterns = get_sparsity_patterns(opt->repo);
> > +
> > +     ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> > +                        patterns, 0);
> > +
> > +     if (patterns) {
> > +             clear_pattern_list(patterns);
> > +             free(patterns);
> > +     }
>
> OK, it is not like this codepath is driven by "git log" to grep from
> top-level tree objects of many commits, so it is OK to grab the
> sparsity patterns once before do_grep_tree() and discard it when we
> are done.
>
> > +     return ret;
> > +}
> > +
>
> >  static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
> >                      struct object *obj, const char *name, const char *path)
> >  {
> > diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> > index 37525cae3a..26852586ac 100755
> > --- a/t/t7011-skip-worktree-reading.sh
> > +++ b/t/t7011-skip-worktree-reading.sh
> > @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
> >       test -z "$(git ls-files -m)"
> >  '
> >
> > -test_expect_success 'grep with skip-worktree file' '
> > -     git update-index --no-skip-worktree 1 &&
> > -     echo test > 1 &&
> > -     git update-index 1 &&
> > -     git update-index --skip-worktree 1 &&
> > -     rm 1 &&
> > -     test "$(git grep --no-ext-grep test)" = "1:test"
> > -'
> > -
> >  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A 1" > expected
> >  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
> >       setup_absent &&
> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > new file mode 100755
> > index 0000000000..3bd67082eb
> > --- /dev/null
> > +++ b/t/t7817-grep-sparse-checkout.sh
> > @@ -0,0 +1,140 @@
> > +#!/bin/sh
> > +
> > +test_description='grep in sparse checkout
> > +
> > +This test creates a repo with the following structure:
> > +
> > +.
> > +|-- a
> > +|-- b
> > +|-- dir
> > +|   `-- c
> > +`-- sub
> > +    |-- A
> > +    |   `-- a
> > +    `-- B
> > +     `-- b
> > +
> > +Where . has non-cone mode sparsity patterns and sub is a submodule with cone
> > +mode sparsity patterns. The resulting sparse-checkout should leave the following
> > +structure:
> > +
> > +.
> > +|-- a
> > +`-- sub
> > +    `-- B
> > +     `-- b
> > +'
> > +
> > +. ./test-lib.sh
> > +
> > +test_expect_success 'setup' '
> > +     echo "text" >a &&
> > +     echo "text" >b &&
> > +     mkdir dir &&
> > +     echo "text" >dir/c &&
> > +
> > +     git init sub &&
> > +     (
> > +             cd sub &&
> > +             mkdir A B &&
> > +             echo "text" >A/a &&
> > +             echo "text" >B/b &&
> > +             git add A B &&
> > +             git commit -m sub &&
> > +             git sparse-checkout init --cone &&
> > +             git sparse-checkout set B
> > +     ) &&
> > +
> > +     git submodule add ./sub &&
> > +     git add a b dir &&
> > +     git commit -m super &&
> > +     git sparse-checkout init --no-cone &&
> > +     git sparse-checkout set "/*" "!b" "!/*/" &&
> > +
> > +     git tag -am t-commit t-commit HEAD &&
> > +     tree=$(git rev-parse HEAD^{tree}) &&
> > +     git tag -am t-tree t-tree $tree &&
> > +
> > +     test_path_is_missing b &&
> > +     test_path_is_missing dir &&
> > +     test_path_is_missing sub/A &&
> > +     test_path_is_file a &&
> > +     test_path_is_file sub/B/b
> > +'
> > +
> > +test_expect_success 'grep in working tree should honor sparse checkout' '
> > +     cat >expect <<-EOF &&
> > +     a:text
> > +     EOF
> > +     git grep "text" >actual &&
> > +     test_cmp expect actual
> > +'
> > +
> > +test_expect_success 'grep --cached should honor sparse checkout' '
> > +     cat >expect <<-EOF &&
> > +     a:text
> > +     EOF
> > +     git grep --cached "text" >actual &&
> > +     test_cmp expect actual
> > +'
> > +
> > +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> > +     commit=$(git rev-parse HEAD) &&
> > +     cat >expect_commit <<-EOF &&
> > +     $commit:a:text
> > +     EOF
> > +     cat >expect_t-commit <<-EOF &&
> > +     t-commit:a:text
> > +     EOF
> > +     git grep "text" $commit >actual_commit &&
> > +     test_cmp expect_commit actual_commit &&
> > +     git grep "text" t-commit >actual_t-commit &&
> > +     test_cmp expect_t-commit actual_t-commit
> > +'
> > +
> > +test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
> > +     commit=$(git rev-parse HEAD) &&
> > +     tree=$(git rev-parse HEAD^{tree}) &&
> > +     cat >expect_tree <<-EOF &&
> > +     $tree:a:text
> > +     $tree:b:text
> > +     $tree:dir/c:text
> > +     EOF
> > +     cat >expect_t-tree <<-EOF &&
> > +     t-tree:a:text
> > +     t-tree:b:text
> > +     t-tree:dir/c:text
> > +     EOF
> > +     git grep "text" $tree >actual_tree &&
> > +     test_cmp expect_tree actual_tree &&
> > +     git grep "text" t-tree >actual_t-tree &&
> > +     test_cmp expect_t-tree actual_t-tree
> > +'
> > +
> > +test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
> > +     cat >expect <<-EOF &&
> > +     a:text
> > +     sub/B/b:text
> > +     EOF
> > +     git grep --recurse-submodules --cached "text" >actual &&
> > +     test_cmp expect actual
> > +'
> > +
> > +test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
> > +     commit=$(git rev-parse HEAD) &&
> > +     cat >expect_commit <<-EOF &&
> > +     $commit:a:text
> > +     $commit:sub/B/b:text
> > +     EOF
> > +     cat >expect_t-commit <<-EOF &&
> > +     t-commit:a:text
> > +     t-commit:sub/B/b:text
> > +     EOF
> > +     git grep --recurse-submodules "text" $commit >actual_commit &&
> > +     test_cmp expect_commit actual_commit &&
> > +     git grep --recurse-submodules "text" t-commit >actual_t-commit &&
> > +     test_cmp expect_t-commit actual_t-commit
> > +'
> > +
> > +test_done

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds
  2020-05-10  4:23     ` Matheus Tavares Bernardino
@ 2020-05-21 17:18       ` Elijah Newren
  0 siblings, 0 replies; 120+ messages in thread
From: Elijah Newren @ 2020-05-21 17:18 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: git, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Sat, May 9, 2020 at 9:23 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Sat, May 9, 2020 at 9:42 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> >
> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > index 3bd67082eb..8509694bf1 100755
> > --- a/t/t7817-grep-sparse-checkout.sh
> > +++ b/t/t7817-grep-sparse-checkout.sh
> > @@ -63,12 +63,28 @@ test_expect_success 'setup' '
> >         test_path_is_file sub/B/b
> >  '
> >
> > +# The two tests bellow check a special case: the sparsity patterns exclude '/b'
> > +# and sparse checkout is enable, but the path exists on the working tree (e.g.
> > +# manually created after `git sparse-checkout init`). In this case, grep should
> > +# honor --restrict-to-sparse-paths.
>
> I just want to highlight a small thing that I forgot to comment on:
> Elijah and I had already discussed about --restrict-to-sparse-paths
> being relevant in grep only with --cached or when a commit-ish is
> given. But it had not occurred to me, before, the possibility of the
> special case mentioned above. I.e. when searching in the working tree
> and a path that should be excluded by the sparsity patterns is
> present. In this patch, I let --restrict-to-sparse-paths control the
> desired behavior for grep in this case too. But please, let me know if
> that doesn't seem like a good idea.

Wow, that is an interesting edge case.  But it can come up during a
merge or rebase or checkout -m, could be manually changed by various
plumbing commands, and might just not be enforced well in various
areas of the system (see e.g. [1]).  Perhaps the most interesting
case, given recent discussion, is submodules -- those might be left in
the working tree despite not matching sparsity paths.  So, should `git
-c sparse.restrictCmds=true grep PATTERN` look at these paths or not?
Currently, you've chosen contradictory answers -- yes to submodules,
and no to other entries.  I'm not certain here, but I've given it a
little thought and think there's a few things to take into
consideration:

Users are used to the fact that
    grep -r PATTERN *
searches existing files for PATTERN.  If you delete a file, then a
subsequent grep isn't going to search through it.  Similarly, git grep
is billed as a grep which limits searches to tracked files, thus they
expect
    git grep PATTERN
to search for files in their working copy but limiting it to files
which are tracked.  From this angle, I think users would be surprised
if `git grep` searched through deleted files, and they would also be
surprised if it ignored tracked and present files.

That is a basic answer, but let's go a bit further.  Since git grep
also has history at its disposal, it has more options.  For example:
    git grep REVISION PATTERN
means to search through all tracked files (those are the only kinds
that are recorded in revisions anyway) as of REVISION for the given
PATTERN, without checking it out.  Users probably expect this to
behave the same as:
    git checkout REVISION
    git grep PATTERN
and since checkout pays attention to sparsity rules, this is why we'd
want to have both "git grep PATTERN" and "git grep REVISION PATTERN"
pay attention to sparsity rules.

When we think in terms of "git grep REVISION PATTERN" as an optimized
version of "git checkout REVISION && git grep PATTERN" it puts us in
the frame of mind of asking the following question:
   For each path, would it be marked as SKIP_WORKTREE if we were to
check it out right now?  If so, we should skip it for the grepping.
Usually, the SKIP_WORKTREE bit is set for files if and only if they
don't match the sparsity patterns.  Also, we can't use the
SKIP_WORKTREE bit of the current index to decide whether to grep
through an old REVISION, because there are paths that exists in the
old revision that don't exist in the current index.  The sparsity
rules are the only things that can tell us whether such a path would
be marked as SKIP_WORKTREE if we were to check it out.  So it makes
sense to use the sparsity patterns when looking at REVISIONS.  When
dealing with the current worktree, we can check SKIP_WORKTREE
directly.  Usually that'll give the same answer as asking the sparsity
rules but as per [1] the two aren't always identical.  Rather than
asking "Would we mark this as SKIP_WORKTREE if we were to checkout
this version right now?", perhaps we should ask "Since we have this
version checked out right now, let's just check the path directly.  Is
it marked as SKIP_WORKTREE?".

Does that sound reasonable?

[1] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-21  7:26           ` Elijah Newren
@ 2020-05-21 17:35             ` Matheus Tavares Bernardino
  2020-05-21 17:52               ` Elijah Newren
  0 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-05-21 17:35 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Junio C Hamano, git, Derrick Stolee, Jonathan Tan

On Thu, May 21, 2020 at 4:26 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
> >
> > > The idea behind not skipping gitlinks here was to be compliant with
> > > what we have in the working tree. In 4fd683b ("sparse-checkout:
> > > document interactions with submodules"), we decided that, if the
> > > sparse-checkout patterns exclude a submodule, the submodule would
> > > still appear in the working tree. The purpose was to keep these
> > > features (submodules and sparse-checkout) independent. Along the same
> > > lines, I think we should always recurse into initialized submodules in
>
> Sorry if I missed it in the code, but do you check whether the
> submodule is initialized before descending into it, or do you descend
> into it based on it just being a submodule?

We only descend if the submodule is initialized. The new code in this
patch doesn't do this check, but it is already implemented in
grep_submodule() (which is called by grep_tree() and grep_cache() when
a submodule is found).

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-21 17:35             ` Matheus Tavares Bernardino
@ 2020-05-21 17:52               ` Elijah Newren
  2020-05-22  5:49                 ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-05-21 17:52 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Junio C Hamano, git, Derrick Stolee, Jonathan Tan

On Thu, May 21, 2020 at 10:36 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Thu, May 21, 2020 at 4:26 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote:
> > >
> > > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
> > >
> > > > The idea behind not skipping gitlinks here was to be compliant with
> > > > what we have in the working tree. In 4fd683b ("sparse-checkout:
> > > > document interactions with submodules"), we decided that, if the
> > > > sparse-checkout patterns exclude a submodule, the submodule would
> > > > still appear in the working tree. The purpose was to keep these
> > > > features (submodules and sparse-checkout) independent. Along the same
> > > > lines, I think we should always recurse into initialized submodules in
> >
> > Sorry if I missed it in the code, but do you check whether the
> > submodule is initialized before descending into it, or do you descend
> > into it based on it just being a submodule?
>
> We only descend if the submodule is initialized. The new code in this
> patch doesn't do this check, but it is already implemented in
> grep_submodule() (which is called by grep_tree() and grep_cache() when
> a submodule is found).

Good to know.  To up the ante a bit: What if another branch has
directory that doesn't exist in HEAD or the current checkout, and
within that directory is a submodule.  Would it be recursed into?
What if it matched the sparsity paths?  (Is it even possible to
recurse into it?)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-21 17:52               ` Elijah Newren
@ 2020-05-22  5:49                 ` Matheus Tavares Bernardino
  2020-05-22 14:26                   ` Elijah Newren
  0 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-05-22  5:49 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Junio C Hamano, git, Derrick Stolee, Jonathan Tan

On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Thu, May 21, 2020 at 10:36 AM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > On Thu, May 21, 2020 at 4:26 AM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote:
> > > >
> > > > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
> > > >
> > > > > The idea behind not skipping gitlinks here was to be compliant with
> > > > > what we have in the working tree. In 4fd683b ("sparse-checkout:
> > > > > document interactions with submodules"), we decided that, if the
> > > > > sparse-checkout patterns exclude a submodule, the submodule would
> > > > > still appear in the working tree. The purpose was to keep these
> > > > > features (submodules and sparse-checkout) independent. Along the same
> > > > > lines, I think we should always recurse into initialized submodules in
> > >
> > > Sorry if I missed it in the code, but do you check whether the
> > > submodule is initialized before descending into it, or do you descend
> > > into it based on it just being a submodule?
> >
> > We only descend if the submodule is initialized. The new code in this
> > patch doesn't do this check, but it is already implemented in
> > grep_submodule() (which is called by grep_tree() and grep_cache() when
> > a submodule is found).
>
> Good to know.  To up the ante a bit: What if another branch has
> directory that doesn't exist in HEAD or the current checkout, and
> within that directory is a submodule.  Would it be recursed into?

In this case, `git grep --recurse-submodules <pattern> $branch` will
recurse into the submodule, but only if it has already been
initialized. I.e. if we have checked out to $branch, ran `git
submodule init` and then checked out back.

> What if it matched the sparsity paths?  (Is it even possible to
> recurse into it?)

That's a great question. The idea that I tried to implement is to
always recurse into _initialized_ submodules (even the ones excluded
by the superproject's sparsity patterns) and, then, follow their own
sparsity patterns inside. I'm not necessarily in favor (or against)
this behavior, but this seemed to be the most compatible way with the
design we describe in our docs:

"If your sparse-checkout patterns exclude an initialized submodule,
then that submodule will still appear in your working directory." (in
git-sparse-checkout.txt)

So, back to the original question, if you run `git grep
--recurse-submodules <pattern> $branch` and $branch contains a
submodule which was previously initialized, git-grep _would_ recurse
into it, even if it (or its parent dir) was excluded. However, your
question helped me notice an inconsistency in my patch: the behavior I
just described is working for the full pattern set, but not in cone
mode. That's because, in cone mode, we can mark the whole submodule's
parent dir as excluded. Then, path_matches_pattern_list() will return
NOT_MATCHED for the parent dir and we won't recurse into it, so we
won't even get to the submodule's path to discover that it refers to a
gitlink.

Therefore, if we decide to keep the behavior of always recursing into
submodules, we will need some extra work for the cone mode. I.e.
grep_tree() will have to check if NOT_MATCHED directories contain
submodules before discarding them, and recurse only into the
submodules if so. As for the implementation, the first idea that came
to my mind was to list the submodules' pathnames and do prefix
matching for each submodule and NOT_MATCHED dir. But the places I've
seen such submodule listings in the code base so far [1] seem to work
only in the current branch. My second idea was to continue the tree
walk when we hit NOT_MATCHED dir entries, but not doing any work, just
looking for possible gitlinks to recurse into. I'm not sure if that
could negatively affect the execution time, though.

Does this seem like a good approach? Or is there another solution that
I have not considered? Or even further, should we choose to skip the
submodules in excluded paths? My only concern in this case is that it
would be contrary to the design in git-sparse-checkout.txt. And the
working tree grep and cached grep would differ even on a clean working
tree.

[1]: builtin/submodule--helper.c:module_list_compute() and
submodule-config.c:config_from_gitmodules()

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-22  5:49                 ` Matheus Tavares Bernardino
@ 2020-05-22 14:26                   ` Elijah Newren
  2020-05-22 15:36                     ` Elijah Newren
  2020-06-10 11:40                     ` Derrick Stolee
  0 siblings, 2 replies; 120+ messages in thread
From: Elijah Newren @ 2020-05-22 14:26 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: git, Junio C Hamano, Derrick Stolee, Jonathan Tan, Elijah Newren

Hi Matheus,

On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote:
>
> On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
> >
<snip>
> > Good to know.  To up the ante a bit: What if another branch has
> > directory that doesn't exist in HEAD or the current checkout, and
> > within that directory is a submodule.  Would it be recursed into?
>
> In this case, `git grep --recurse-submodules <pattern> $branch` will
> recurse into the submodule, but only if it has already been
> initialized. I.e. if we have checked out to $branch, ran `git
> submodule init` and then checked out back.
>
> > What if it matched the sparsity paths?  (Is it even possible to
> > recurse into it?)
>
> That's a great question. The idea that I tried to implement is to
> always recurse into _initialized_ submodules (even the ones excluded
> by the superproject's sparsity patterns) and, then, follow their own
> sparsity patterns inside. I'm not necessarily in favor (or against)
> this behavior, but this seemed to be the most compatible way with the
> design we describe in our docs:
>
> "If your sparse-checkout patterns exclude an initialized submodule,
> then that submodule will still appear in your working directory." (in
> git-sparse-checkout.txt)
>
> So, back to the original question, if you run `git grep
> --recurse-submodules <pattern> $branch` and $branch contains a
> submodule which was previously initialized, git-grep _would_ recurse
> into it, even if it (or its parent dir) was excluded. However, your
> question helped me notice an inconsistency in my patch: the behavior I
> just described is working for the full pattern set, but not in cone
> mode. That's because, in cone mode, we can mark the whole submodule's
> parent dir as excluded. Then, path_matches_pattern_list() will return
> NOT_MATCHED for the parent dir and we won't recurse into it, so we
> won't even get to the submodule's path to discover that it refers to a
> gitlink.
>
> Therefore, if we decide to keep the behavior of always recursing into
> submodules, we will need some extra work for the cone mode. I.e.
> grep_tree() will have to check if NOT_MATCHED directories contain
> submodules before discarding them, and recurse only into the
> submodules if so. As for the implementation, the first idea that came
> to my mind was to list the submodules' pathnames and do prefix
> matching for each submodule and NOT_MATCHED dir. But the places I've
> seen such submodule listings in the code base so far [1] seem to work
> only in the current branch. My second idea was to continue the tree
> walk when we hit NOT_MATCHED dir entries, but not doing any work, just
> looking for possible gitlinks to recurse into. I'm not sure if that
> could negatively affect the execution time, though.
>
> Does this seem like a good approach? Or is there another solution that
> I have not considered? Or even further, should we choose to skip the
> submodules in excluded paths? My only concern in this case is that it
> would be contrary to the design in git-sparse-checkout.txt. And the
> working tree grep and cached grep would differ even on a clean working
> tree.

To be honest, I think it sounds insane.  What you propose does make
sense if you take what was written in git-sparse-checkout.txt very
literally and as though it was a core design principle meant to cover
all cases but I do not think it merits such a standing at all.  I
think it should be treated as a first draft attempt to explain
interactions that was written solely with the 'checkout' case in mind,
especially since it was written at the same approximate time that this
was written earlier in the same file:

    THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF
    OTHER COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY
    CHANGE IN THE FUTURE.

Anyway, the wording in that file seems to be really important, so
let's fix it.

-- >8 --
Subject: [PATCH] git-sparse-checkout: clarify interactions with submodules

Ignoring the sparse-checkout feature momentarily, if one has a submodule and
creates local branches within it with unpushed changes and maybe adds some
untracked files to it, then we would want to avoid accidentally removing such
a submodule.  So, for example with git.git, if you run
   git checkout v2.13.0
then the sha1collisiondetection/ submodule is NOT removed even though it
did not exist as a submodule until v2.14.0.  Similarly, if you only had
v2.13.0 checked out previously and ran
   git checkout v2.14.0
the sha1collisiondetection/ submodule would NOT be automatically
initialized despite being part of v2.14.0.  In both cases, git requires
submodules to be initialized or deinitialized separately.  Further, we
also have special handling for submodules in other commands such as
clean, which requires two --force flags to delete untracked submodules,
and some commands have a --recurse-submodules flag.

sparse-checkout is very similar to checkout, as evidenced by the similar
name -- it adds and removes files from the working copy.  However, for
the same avoid-data-loss reasons we do not want to remove a submodule
from the working copy with checkout, we do not want to do it with
sparse-checkout either.  So submodules need to be separately initialized
or deinitialized; changing sparse-checkout rules should not
automatically trigger the removal or vivification of submodules.

I believe the previous wording in git-sparse-checkout.txt about
submodules was only about this particular issue.  Unfortunately, the
previous wording could be interpreted to imply that submodules should be
considered active regardless of sparsity patterns.  Update the wording
to avoid making such an implication.  It may be helpful to consider two
example situations where the differences in wording become important:

In the future, we want users to be able to run commands like
   git clone --sparse=moduleA --recurse-submodules $REPO_URL
and have sparsity paths automatically set up and have submodules *within
the sparsity paths* be automatically initialized.  We do not want all
submodules in any path to be automatically initialized with that
command.

Similarly, we want to be able to do things like
   git -c sparse.restrictCmds grep --recurse-submodules $REV $PATTERN
and search through $REV for $PATTERN within the recorded sparsity
patterns.  We want it to recurse into submodules within those sparsity
patterns, but do not want to recurse into directories that do not match
the sparsity patterns in search of a possible submodule.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-sparse-checkout.txt | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index c0342e5393..7dde2d330c 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -190,10 +190,23 @@ directory.
 SUBMODULES
 ----------
 
-If your repository contains one or more submodules, then those submodules will
-appear based on which you initialized with the `git submodule` command. If
-your sparse-checkout patterns exclude an initialized submodule, then that
-submodule will still appear in your working directory.
+If your repository contains one or more submodules, then those submodules
+will appear based on which you initialized with the `git submodule`
+command.  Submodules may have additional untracked files or code stored on
+other branches, so to avoid data loss, changing sparse inclusion/exclusion
+rules will not cause an already checked out submodule to be removed from
+the working copy.  Said another way, just as `checkout` will not cause
+submodules to be automatically removed or initialized even when switching
+between branches that remove or add submodules, using `sparse-checkout` to
+reduce or expand the scope of "interesting" files will not cause submodules
+to be automatically deinitialized or initialized either.  Adding or
+removing them must be done as a separate step with `git submodule init` or
+`git submodule deinit`.
+
+This may mean that even if your sparsity patterns include or exclude
+submodules, until you manually initialize or deinitialize them, commands
+like grep that work on tracked files in the working copy will ignore "not
+yet initialized" submodules and pay attention to "left behind" ones.
 
 
 SEE ALSO
-- 
2.26.1.250.g8bb771e84c


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-22 14:26                   ` Elijah Newren
@ 2020-05-22 15:36                     ` Elijah Newren
  2020-05-22 20:54                       ` Matheus Tavares Bernardino
  2020-06-10 11:40                     ` Derrick Stolee
  1 sibling, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-05-22 15:36 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Fri, May 22, 2020 at 7:26 AM Elijah Newren <newren@gmail.com> wrote:
>
> Hi Matheus,
>
> On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote:
> >
> > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
> > >
<snip>
> > Does this seem like a good approach? Or is there another solution that
> > I have not considered? Or even further, should we choose to skip the
> > submodules in excluded paths? My only concern in this case is that it
> > would be contrary to the design in git-sparse-checkout.txt. And the
> > working tree grep and cached grep would differ even on a clean working
> > tree.
>
<snip>
> Anyway, the wording in that file seems to be really important, so
> let's fix it.
>

Let me also try to give a concrete proposal for grep behavior for the
edge cases we've discussed:

git -c sparse.restrictCmds=true grep --recurse-submodules $PATTERN

This goes through all the files in the index (i.e. all tracked files)
which do not have the SKIP_WORKTREE bit set.  For each of these: If
the file is a symlink, ignore it (like grep currently does).  If the
file is a regular file and is present in the working copy, search it.
If the file is a submodule and it is initialized, recurse into it.

git -c sparse.restrictCmds=true grep --recurse-submodules --cached $PATTERN

This goes through all the files in the index (i.e. all tracked files)
which do not have the SKIP_WORKTREE bit set.  For each of these: Skip
symlinks.  Search regular files.  Recurse into submodules if they are
initialized.

git -c sparse.restrictCmds=true grep --recurse-submodules $REVISION $PATTERN

This goes through all the files in the given revision (i.e. all
tracked files) which match the sparsity patterns (i.e. that would not
have the SKIP_WORKTREE bit set if were we to checkout that commit).
For each of these: Skip symlinks.  Search regular files.  Recurse into
submodules if they are initialized.


Further, for any of these, when recursing into submodules, make sure
to load that submodules' core.sparseCheckout setting (and related
settings) and the submodules' sparsity patterns, if any.

Sound good?

I think this addresses the edge cases we've discussed so far:
interaction between submodules and sparsity patterns, and handling of
files that are still present despite not matching the sparsity
patterns. (Also note that files which are present-despite-the-rules
are prone to be removed by the next `git sparse-checkout reapply` or
anything that triggers a call to unpack_trees(); there's already
multiple things that do and Stolee's proposed patches would add more).
If I've missed edge cases, let me know.


Elijah

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-22 15:36                     ` Elijah Newren
@ 2020-05-22 20:54                       ` Matheus Tavares Bernardino
  2020-05-22 21:06                         ` Elijah Newren
  0 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-05-22 20:54 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

Hi, Elijah

On Fri, May 22, 2020 at 12:36 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Fri, May 22, 2020 at 7:26 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > Hi Matheus,
> >
> > On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote:
> > >
> > > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
> > > >
> <snip>
> > > Does this seem like a good approach? Or is there another solution that
> > > I have not considered? Or even further, should we choose to skip the
> > > submodules in excluded paths? My only concern in this case is that it
> > > would be contrary to the design in git-sparse-checkout.txt. And the
> > > working tree grep and cached grep would differ even on a clean working
> > > tree.
> >
> <snip>
> > Anyway, the wording in that file seems to be really important, so
> > let's fix it.
> >
>
> Let me also try to give a concrete proposal for grep behavior for the
> edge cases we've discussed:

Thank you for this proposal and for the previous comments as well.

> git -c sparse.restrictCmds=true grep --recurse-submodules $PATTERN
>
> This goes through all the files in the index (i.e. all tracked files)
> which do not have the SKIP_WORKTREE bit set.  For each of these: If
> the file is a symlink, ignore it (like grep currently does).  If the
> file is a regular file and is present in the working copy, search it.
> If the file is a submodule and it is initialized, recurse into it.

Sounds good. And when sparse.restrictCmds=false, we also search the
present files and present initialized submodules that have the
SKIP_WORKTREE set, right?

> git -c sparse.restrictCmds=true grep --recurse-submodules --cached $PATTERN
>
> This goes through all the files in the index (i.e. all tracked files)
> which do not have the SKIP_WORKTREE bit set.  For each of these: Skip
> symlinks.  Search regular files.  Recurse into submodules if they are
> initialized.

OK.

> git -c sparse.restrictCmds=true grep --recurse-submodules $REVISION $PATTERN
>
> This goes through all the files in the given revision (i.e. all
> tracked files) which match the sparsity patterns (i.e. that would not
> have the SKIP_WORKTREE bit set if were we to checkout that commit).
> For each of these: Skip symlinks.  Search regular files.  Recurse into
> submodules if they are initialized.

OK.

> Further, for any of these, when recursing into submodules, make sure
> to load that submodules' core.sparseCheckout setting (and related
> settings) and the submodules' sparsity patterns, if any.
>
> Sound good?
>
> I think this addresses the edge cases we've discussed so far:
> interaction between submodules and sparsity patterns, and handling of
> files that are still present despite not matching the sparsity
> patterns. (Also note that files which are present-despite-the-rules
> are prone to be removed by the next `git sparse-checkout reapply` or
> anything that triggers a call to unpack_trees(); there's already
> multiple things that do and Stolee's proposed patches would add more).
> If I've missed edge cases, let me know.

Sounds great. This addresses all the edge cases we've mentioned
before. Thanks again for the detailed proposal, and for considering
case by case.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-22 20:54                       ` Matheus Tavares Bernardino
@ 2020-05-22 21:06                         ` Elijah Newren
  0 siblings, 0 replies; 120+ messages in thread
From: Elijah Newren @ 2020-05-22 21:06 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Fri, May 22, 2020 at 1:54 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> Hi, Elijah
>
> On Fri, May 22, 2020 at 12:36 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Fri, May 22, 2020 at 7:26 AM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > Hi Matheus,
> > >
> > > On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote:
> > > >
> > > > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
> > > > >
> > <snip>
> > > > Does this seem like a good approach? Or is there another solution that
> > > > I have not considered? Or even further, should we choose to skip the
> > > > submodules in excluded paths? My only concern in this case is that it
> > > > would be contrary to the design in git-sparse-checkout.txt. And the
> > > > working tree grep and cached grep would differ even on a clean working
> > > > tree.
> > >
> > <snip>
> > > Anyway, the wording in that file seems to be really important, so
> > > let's fix it.
> > >
> >
> > Let me also try to give a concrete proposal for grep behavior for the
> > edge cases we've discussed:
>
> Thank you for this proposal and for the previous comments as well.
>
> > git -c sparse.restrictCmds=true grep --recurse-submodules $PATTERN
> >
> > This goes through all the files in the index (i.e. all tracked files)
> > which do not have the SKIP_WORKTREE bit set.  For each of these: If
> > the file is a symlink, ignore it (like grep currently does).  If the
> > file is a regular file and is present in the working copy, search it.
> > If the file is a submodule and it is initialized, recurse into it.
>
> Sounds good. And when sparse.restrictCmds=false, we also search the
> present files and present initialized submodules that have the
> SKIP_WORKTREE set, right?

You're really pushing those corner cases, I love it.  :-)
SKIP_WORKTREE is supposed to mean we have removed it from the working
tree, i.e. it shouldn't be present (if we decide we're not going to
remove it from the working tree, e.g. because the file is unmerged or
something, then we don't mark it as SKIP_WORKTREE even if it doesn't
match sparsity patterns).  Therefore, the set of files that satisfy
this condition you have given should generally be empty.

But presuming we hit this corner case, I'd say you are right.
sparse.restrictCmds=false means we ignore the SKIP_WORKTREE bit
entirely (and in the case of grepping a $REVISION, we ignore the
sparsity patterns entirely).

> > git -c sparse.restrictCmds=true grep --recurse-submodules --cached $PATTERN
> >
> > This goes through all the files in the index (i.e. all tracked files)
> > which do not have the SKIP_WORKTREE bit set.  For each of these: Skip
> > symlinks.  Search regular files.  Recurse into submodules if they are
> > initialized.
>
> OK.
>
> > git -c sparse.restrictCmds=true grep --recurse-submodules $REVISION $PATTERN
> >
> > This goes through all the files in the given revision (i.e. all
> > tracked files) which match the sparsity patterns (i.e. that would not
> > have the SKIP_WORKTREE bit set if were we to checkout that commit).
> > For each of these: Skip symlinks.  Search regular files.  Recurse into
> > submodules if they are initialized.
>
> OK.
>
> > Further, for any of these, when recursing into submodules, make sure
> > to load that submodules' core.sparseCheckout setting (and related
> > settings) and the submodules' sparsity patterns, if any.
> >
> > Sound good?
> >
> > I think this addresses the edge cases we've discussed so far:
> > interaction between submodules and sparsity patterns, and handling of
> > files that are still present despite not matching the sparsity
> > patterns. (Also note that files which are present-despite-the-rules
> > are prone to be removed by the next `git sparse-checkout reapply` or
> > anything that triggers a call to unpack_trees(); there's already
> > multiple things that do and Stolee's proposed patches would add more).
> > If I've missed edge cases, let me know.
>
> Sounds great. This addresses all the edge cases we've mentioned
> before. Thanks again for the detailed proposal, and for considering
> case by case.

And thank you for working on this.  :-)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                     ` (3 preceding siblings ...)
  2020-05-10  0:41   ` [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
@ 2020-05-28  1:12   ` Matheus Tavares
  2020-05-28  1:12     ` [PATCH v3 1/5] doc: grep: unify info on configuration variables Matheus Tavares
                       ` (5 more replies)
  4 siblings, 6 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-05-28  1:12 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

This series is based on the discussions in [1]. The idea is to make
git-grep (and other commands, in the future) be able to restrict their
output to the sparsity patterns, when requested by the user.

[1]: https://lore.kernel.org/git/CAHd-oW7e5qCuxZLBeVDq+Th3E+E4+P8=WzJfK8WcG2yz=n_nag@mail.gmail.com/t/#u


Note on tests:

In the previous iteration, the setup test in t7817 (patch 4), used the
following command to set sparse-checkout up:

    git sparse-checkout set "/*" "!b" "!/*/" "/sub"

In this iteration, though, I had to change "/sub" to "sub" (which is not
the same, but should produce the same results in t1787). Using the
previous format, the test failed on Windows because git grep
--recurse-submodules did not recurse into "/sub" (a submodule). I used
[2] to investigate and noticed that sub indeed had the SKIP_WORKTREE bit
set on the repo created by the test on Windows. And the sparse-checkout
file contained the following:

/*
!b
!/*/
C:/Users/<path_to_the_Git_for_Windows_SDK_installation>/sub

I wasn't expecting the conversion from "/sub" to the path above. But I'm
not very familiar with the Git for Windows SDK, so is this conversion
expected? 

Furthermore, `pwd` would output:

    /usr/src/git/t/trash directory.t7817-grep-sparse-checkout

So I think that would explain why the converted path for the "/sub" rule
didn't match sub. Could this be a bug in `git sparse-checkout set`? Or
am I missing something?

[2]: https://github.com/sbp/gin


Main changes since v2:

Added patch 2.

Patch 3:
- Fix reading of extensions.worktreeConfig value in do_git_config_sequence(), to
  get the one from the given git_dir, not the_repository.
- Add --submodule option to test-config helper and regression test for the fixes
  in do_git_config_sequence().

Patch 4:
- Reword commit message to remove snippet about --untracked and --no-index
  respecting the sparsity patterns.
- Don't grep submodules that are excluded by the sparsity patterns.
- Add tests to ensure that submodules (and other paths) that are excluded by the
  sparsity patterns, but present in the working tree, are not grepped.
- Some minor variable renames in tests, for better readability.

Patch 5:
- Mention in sparse config docs that --[no-]restrict-to-sparse-paths won't
  affect writting commands.
- die() in grep when --[no-]restrict-to-sparse-paths is used with --no-index
  or --untracked, and add test for this behavior.
- Use test_when_finished and test_config, when possible, to avoid breaking next
  test cases on a test error.
- Adjust the behavior of --[no-]restrict-to-sparse-paths to follow the ideas
  proposed by Elijah in [3] and [4]. Also add more tests for the different
  cases where this option is relevant and improve docs at
  Documentation/config/sparse.txt.

[3]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/
[4]: https://lore.kernel.org/git/CABPp-BGEPU49yRN2FRtwhYn6Uh+scGKEFYP4G2GH6=uBTN1SCw@mail.gmail.com/

CI: https://github.com/matheustavares/git/actions/runs/117388742

Matheus Tavares (5):
  doc: grep: unify info on configuration variables
  t/helper/test-config: return exit codes consistently
  config: correctly read worktree configs in submodules
  grep: honor sparse checkout patterns
  config: add setting to ignore sparsity patterns in some cmds

 Documentation/config.txt               |   2 +
 Documentation/config/grep.txt          |  10 +-
 Documentation/config/sparse.txt        |  24 ++
 Documentation/git-grep.txt             |  37 +--
 Documentation/git.txt                  |   4 +
 Makefile                               |   1 +
 builtin/grep.c                         | 134 ++++++++++-
 config.c                               |  21 +-
 contrib/completion/git-completion.bash |   2 +
 git.c                                  |   6 +
 sparse-checkout.c                      |  16 ++
 sparse-checkout.h                      |  11 +
 t/helper/test-config.c                 | 183 +++++++++------
 t/t2404-worktree-config.sh             |  16 ++
 t/t7011-skip-worktree-reading.sh       |   9 -
 t/t7817-grep-sparse-checkout.sh        | 300 +++++++++++++++++++++++++
 t/t9902-completion.sh                  |   4 +-
 17 files changed, 663 insertions(+), 117 deletions(-)
 create mode 100644 Documentation/config/sparse.txt
 create mode 100644 sparse-checkout.c
 create mode 100644 sparse-checkout.h
 create mode 100755 t/t7817-grep-sparse-checkout.sh

Range-diff against v2:
1:  c344d22313 = 1:  63c195d737 doc: grep: unify info on configuration variables
2:  882310b69f < -:  ---------- config: load the correct config.worktree file
-:  ---------- > 2:  43402007ad t/helper/test-config: return exit codes consistently
-:  ---------- > 3:  448e0efffd config: correctly read worktree configs in submodules
3:  e00674c727 ! 4:  5ddac81818 grep: honor sparse checkout patterns
    @@ Commit message
         git-grep currently ignores the sparsity patterns and report all matches
         found outside this subset, which kind of goes in the opposite direction.
         Let's fix that, making it honor the sparsity boundaries for every
    -    grepping case:
    +    grepping case where this is relevant:
     
         - git grep in worktree
         - git grep --cached
         - git grep $REVISION
    -    - git grep --untracked and git grep --no-index (which already respect
    -      sparse checkout boundaries)
     
    -    This is also what some users reported[1] they would want as the default
    -    behavior.
    +    For the worktree case, we will not grep paths that have the
    +    SKIP_WORKTREE bit set, even if they are present for some reason (e.g.
    +    manually created after `git sparse-checkout init`). But the next patch
    +    will add an option to do so. (See 'Note' below.)
     
    -    Note: for `git grep $REVISION`, we will choose to honor the sparsity
    -    patterns only when $REVISION is a commit-ish object. The reason is that,
    -    for a tree, we don't know whether it represents the root of a
    -    repository or a subtree. So we wouldn't be able to correctly match it
    -    against the sparsity patterns. E.g. suppose we have a repository with
    -    these two sparsity rules: "/*" and "!/a"; and the following structure:
    +    For `git grep $REVISION`, we will choose to honor the sparsity patterns
    +    only when $REVISION is a commit-ish object. The reason is that, for a
    +    tree, we don't know whether it represents the root of a repository or a
    +    subtree. So we wouldn't be able to correctly match it against the
    +    sparsity patterns. E.g. suppose we have a repository with these two
    +    sparsity rules: "/*" and "!/a"; and the following structure:
     
         /
         | - a (file)
    @@ Commit message
         therefore it would wrongly match the pattern "!/a". Furthermore, for a
         search in a blob object, we wouldn't even have a path to check the
         patterns against. So, let's ignore the sparsity patterns when grepping
    -    non-commit-ish objects (tags to commits should be fine).
    +    non-commit-ish objects.
     
    -    Finally, the old behavior may still be desirable for some use cases. So
    -    the next patch will add an option to allow restoring it when needed.
    +    Note: The behavior introduced in this patch is what some users have
    +    reported[1] that they would like by default. But the old behavior is
    +    still desirable for some use cases. Therefore, the next patch will add
    +    an option to allow restoring it when needed.
     
         [1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/
     
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      	for (nr = 0; nr < repo->index->cache_nr; nr++) {
      		const struct cache_entry *ce = repo->index->cache[nr];
     +
    -+		if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
    ++		if (ce_skip_worktree(ce))
     +			continue;
     +
      		strbuf_setlen(&name, name_base_len);
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
     +			      enum pattern_match_result *match)
     +{
     +	int dtype = DT_UNKNOWN;
    -+
    -+	if (S_ISGITLINK(entry_mode))
    -+		return 1;
    ++	int is_dir = S_ISDIR(entry_mode);
     +
     +	if (parent_match == MATCHED_RECURSIVE) {
     +		*match = parent_match;
     +		return 1;
     +	}
     +
    -+	if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1]))
    ++	if (is_dir && !is_dir_sep(path->buf[path->len - 1]))
     +		strbuf_addch(path, '/');
     +
     +	*match = path_matches_pattern_list(path->buf, path->len,
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
     +	if (*match == UNDECIDED)
     +		*match = parent_match;
     +
    -+	if (S_ISDIR(entry_mode))
    ++	if (is_dir)
     +		strbuf_trim_trailing_dir_sep(path);
     +
    -+	if (*match == NOT_MATCHED && (S_ISREG(entry_mode) ||
    -+	    (S_ISDIR(entry_mode) && sparsity->use_cone_patterns)))
    -+		return 0;
    ++	if (*match == NOT_MATCHED &&
    ++		(!is_dir || (is_dir && sparsity->use_cone_patterns)))
    ++	     return 0;
     +
     +	return 1;
     +}
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +|-- b
     +|-- dir
     +|   `-- c
    -+`-- sub
    -+    |-- A
    -+    |   `-- a
    -+    `-- B
    -+	`-- b
    -+
    -+Where . has non-cone mode sparsity patterns and sub is a submodule with cone
    -+mode sparsity patterns. The resulting sparse-checkout should leave the following
    -+structure:
    ++|-- sub
    ++|   |-- A
    ++|   |   `-- a
    ++|   `-- B
    ++|       `-- b
    ++`-- sub2
    ++    `-- a
    ++
    ++Where . has non-cone mode sparsity patterns, sub is a submodule with cone mode
    ++sparsity patterns and sub2 is a submodule that is excluded by the superproject
    ++sparsity patterns. The resulting sparse checkout should leave the following
    ++structure on the working tree:
     +
     +.
     +|-- a
    -+`-- sub
    -+    `-- B
    -+	`-- b
    ++|-- sub
    ++|   `-- B
    ++|       `-- b
    ++`-- sub2
    ++    `-- a
    ++
    ++But note that sub2 should have the SKIP_WORKTREE bit set.
     +'
     +
     +. ./test-lib.sh
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +		git sparse-checkout set B
     +	) &&
     +
    ++	git init sub2 &&
    ++	(
    ++		cd sub2 &&
    ++		echo "text" >a &&
    ++		git add a &&
    ++		git commit -m sub2
    ++	) &&
    ++
     +	git submodule add ./sub &&
    ++	git submodule add ./sub2 &&
     +	git add a b dir &&
     +	git commit -m super &&
     +	git sparse-checkout init --no-cone &&
    -+	git sparse-checkout set "/*" "!b" "!/*/" &&
    ++	git sparse-checkout set "/*" "!b" "!/*/" "sub" &&
     +
    -+	git tag -am t-commit t-commit HEAD &&
    ++	git tag -am tag-to-commit tag-to-commit HEAD &&
     +	tree=$(git rev-parse HEAD^{tree}) &&
    -+	git tag -am t-tree t-tree $tree &&
    ++	git tag -am tag-to-tree tag-to-tree $tree &&
     +
     +	test_path_is_missing b &&
     +	test_path_is_missing dir &&
     +	test_path_is_missing sub/A &&
     +	test_path_is_file a &&
    -+	test_path_is_file sub/B/b
    ++	test_path_is_file sub/B/b &&
    ++	test_path_is_file sub2/a
     +'
     +
    ++# The test bellow checks a special case: the sparsity patterns exclude '/b'
    ++# and sparse checkout is enable, but the path exists on the working tree (e.g.
    ++# manually created after `git sparse-checkout init`). In this case, grep should
    ++# skip it.
     +test_expect_success 'grep in working tree should honor sparse checkout' '
     +	cat >expect <<-EOF &&
     +	a:text
     +	EOF
    ++	echo "new-text" >b &&
    ++	test_when_finished "rm b" &&
     +	git grep "text" >actual &&
     +	test_cmp expect actual
     +'
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +	cat >expect_commit <<-EOF &&
     +	$commit:a:text
     +	EOF
    -+	cat >expect_t-commit <<-EOF &&
    -+	t-commit:a:text
    ++	cat >expect_tag-to-commit <<-EOF &&
    ++	tag-to-commit:a:text
     +	EOF
     +	git grep "text" $commit >actual_commit &&
     +	test_cmp expect_commit actual_commit &&
    -+	git grep "text" t-commit >actual_t-commit &&
    -+	test_cmp expect_t-commit actual_t-commit
    ++	git grep "text" tag-to-commit >actual_tag-to-commit &&
    ++	test_cmp expect_tag-to-commit actual_tag-to-commit
     +'
     +
     +test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +	$tree:b:text
     +	$tree:dir/c:text
     +	EOF
    -+	cat >expect_t-tree <<-EOF &&
    -+	t-tree:a:text
    -+	t-tree:b:text
    -+	t-tree:dir/c:text
    ++	cat >expect_tag-to-tree <<-EOF &&
    ++	tag-to-tree:a:text
    ++	tag-to-tree:b:text
    ++	tag-to-tree:dir/c:text
     +	EOF
     +	git grep "text" $tree >actual_tree &&
     +	test_cmp expect_tree actual_tree &&
    -+	git grep "text" t-tree >actual_t-tree &&
    -+	test_cmp expect_t-tree actual_t-tree
    ++	git grep "text" tag-to-tree >actual_tag-to-tree &&
    ++	test_cmp expect_tag-to-tree actual_tag-to-tree
    ++'
    ++
    ++# Note that sub2/ is present in the worktree but it is excluded by the sparsity
    ++# patterns, so grep should not recurse into it.
    ++test_expect_success 'grep --recurse-submodules should honor sparse checkout in submodule' '
    ++	cat >expect <<-EOF &&
    ++	a:text
    ++	sub/B/b:text
    ++	EOF
    ++	git grep --recurse-submodules "text" >actual &&
    ++	test_cmp expect actual
     +'
     +
     +test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +	$commit:a:text
     +	$commit:sub/B/b:text
     +	EOF
    -+	cat >expect_t-commit <<-EOF &&
    -+	t-commit:a:text
    -+	t-commit:sub/B/b:text
    ++	cat >expect_tag-to-commit <<-EOF &&
    ++	tag-to-commit:a:text
    ++	tag-to-commit:sub/B/b:text
     +	EOF
     +	git grep --recurse-submodules "text" $commit >actual_commit &&
     +	test_cmp expect_commit actual_commit &&
    -+	git grep --recurse-submodules "text" t-commit >actual_t-commit &&
    -+	test_cmp expect_t-commit actual_t-commit
    ++	git grep --recurse-submodules "text" tag-to-commit >actual_tag-to-commit &&
    ++	test_cmp expect_tag-to-commit actual_tag-to-commit
     +'
     +
     +test_done
4:  3e9e906249 ! 5:  748b1e955c config: add setting to ignore sparsity patterns in some cmds
    @@ Commit message
         subset of files in which they are interested; and allow some commands to
         possibly perform better, by not considering uninteresting paths. For
         this reason, we taught grep to honor the sparsity patterns, in the
    -    previous commit. But, on the other hand, allowing grep and the other
    +    previous patch. But, on the other hand, allowing grep and the other
         commands mentioned to optionally ignore the patterns also make for some
         interesting use cases. E.g. using grep to search for a function
    -    definition that resides outside the sparse checkout.
    +    documentation that resides outside the sparse checkout.
     
         In any case, there is no current way for users to configure the behavior
         they want for these commands. Aiming to provide this flexibility, let's
    @@ Documentation/config/sparse.txt (new)
     ++
     +When this option is true (default), some git commands may limit their behavior
     +to the paths specified by the sparsity patterns, or to the intersection of
    -+those paths and any (like `*.c) that the user might also specify on the command
    -+line. When false, the affected commands will work on full trees, ignoring the
    -+sparsity patterns. For now, only git-grep honors this setting. In this command,
    -+the restriction becomes relevant in one of these three cases: with --cached;
    -+when a commit-ish is given; when searching a working tree that contains paths
    -+previously excluded by the sparsity patterns.
    ++those paths and any (like `*.c`) that the user might also specify on the
    ++command line. When false, the affected commands will work on full trees,
    ++ignoring the sparsity patterns. For now, only git-grep honors this setting. In
    ++this command, the restriction takes effect in three cases: with --cached; when
    ++a commit-ish is given; when searching a working tree where some paths excluded
    ++by the sparsity patterns are present (e.g. manually created paths or not
    ++removed submodules).
     ++
     +Note: commands which export, integrity check, or create history will always
     +operate on full trees (e.g. fast-export, format-patch, fsck, commit, etc.),
    -+unaffected by any sparsity patterns.
    ++unaffected by any sparsity patterns. Also, writting commands such as
    ++sparse-checkout and read-tree will not be affected by this configuration.
     
      ## Documentation/git-grep.txt ##
     @@ Documentation/git-grep.txt: characters.  An empty string as search expression matches all lines.
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      	for (nr = 0; nr < repo->index->cache_nr; nr++) {
      		const struct cache_entry *ce = repo->index->cache[nr];
      
    --		if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
    -+		if (sparse_paths_only && ce_skip_worktree(ce) &&
    -+		    !S_ISGITLINK(ce->ce_mode))
    +-		if (ce_skip_worktree(ce))
    ++		if (sparse_paths_only && ce_skip_worktree(ce))
      			continue;
      
      		strbuf_setlen(&name, name_base_len);
    @@ builtin/grep.c: int cmd_grep(int argc, const char **argv, const char *prefix)
      		int use_exclude = (opt_exclude < 0) ? use_index : !!opt_exclude;
     +
     +		if (opt_restrict_to_sparse_paths >= 0) {
    -+			warning(_("--[no-]restrict-to-sparse-paths is ignored"
    -+				  " with --no-index or --untracked"));
    ++			die(_("--[no-]restrict-to-sparse-paths is incompatible"
    ++				  " with --no-index and --untracked"));
     +		}
     +
      		hit = grep_directory(&opt, &pathspec, use_exclude, use_index);
    @@ sparse-checkout.h (new)
     
      ## t/t7817-grep-sparse-checkout.sh ##
     @@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'setup' '
    - 	test_path_is_file sub/B/b
    + 	test_path_is_file sub2/a
      '
      
    +-# The test bellow checks a special case: the sparsity patterns exclude '/b'
     +# The two tests bellow check a special case: the sparsity patterns exclude '/b'
    -+# and sparse checkout is enable, but the path exists on the working tree (e.g.
    -+# manually created after `git sparse-checkout init`). In this case, grep should
    -+# honor --restrict-to-sparse-paths.
    + # and sparse checkout is enable, but the path exists on the working tree (e.g.
    + # manually created after `git sparse-checkout init`). In this case, grep should
    +-# skip it.
    ++# skip the file by default, but not with --no-restrict-to-sparse-paths.
      test_expect_success 'grep in working tree should honor sparse checkout' '
      	cat >expect <<-EOF &&
      	a:text
    - 	EOF
    -+	echo newtext >b &&
    +@@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'grep in working tree should honor sparse checkout' '
      	git grep "text" >actual &&
    --	test_cmp expect actual
    -+	test_cmp expect actual &&
    -+	rm b
    -+'
    + 	test_cmp expect actual
    + '
     +test_expect_success 'grep w/ --no-restrict-to-sparse-paths for sparsely excluded but present paths' '
     +	cat >expect <<-EOF &&
     +	a:text
    -+	b:newtext
    ++	b:new-text
     +	EOF
    -+	echo newtext >b &&
    ++	echo "new-text" >b &&
    ++	test_when_finished "rm b" &&
     +	git --no-restrict-to-sparse-paths grep "text" >actual &&
    -+	test_cmp expect actual &&
    -+	rm b
    - '
    ++	test_cmp expect actual
    ++'
      
      test_expect_success 'grep --cached should honor sparse checkout' '
    + 	cat >expect <<-EOF &&
    +@@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
    + '
    + 
    + # Note that sub2/ is present in the worktree but it is excluded by the sparsity
    +-# patterns, so grep should not recurse into it.
    ++# patterns, so grep should only recurse into it with --no-restrict-to-sparse-paths.
    + test_expect_success 'grep --recurse-submodules should honor sparse checkout in submodule' '
    + 	cat >expect <<-EOF &&
    + 	a:text
    +@@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'grep --recurse-submodules should honor sparse checkout in s
    + 	git grep --recurse-submodules "text" >actual &&
    + 	test_cmp expect actual
    + '
    ++test_expect_success 'grep --recurse-submodules should search in excluded submodules w/ --no-restrict-to-sparse-paths' '
    ++	cat >expect <<-EOF &&
    ++	a:text
    ++	sub/B/b:text
    ++	sub2/a:text
    ++	EOF
    ++	git --no-restrict-to-sparse-paths grep --recurse-submodules "text" >actual &&
    ++	test_cmp expect actual
    ++'
    + 
    + test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
    + 	cat >expect <<-EOF &&
     @@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse
    - 	test_cmp expect_t-commit actual_t-commit
    + 	test_cmp expect_tag-to-commit actual_tag-to-commit
      '
      
     +for cmd in 'git --no-restrict-to-sparse-paths grep' \
    @@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'grep --recurse-submodules
     +		$commit:b:text
     +		$commit:dir/c:text
     +		EOF
    -+		cat >expect_t-commit <<-EOF &&
    -+		t-commit:a:text
    -+		t-commit:b:text
    -+		t-commit:dir/c:text
    ++		cat >expect_tag-to-commit <<-EOF &&
    ++		tag-to-commit:a:text
    ++		tag-to-commit:b:text
    ++		tag-to-commit:dir/c:text
     +		EOF
     +		$cmd "text" $commit >actual_commit &&
     +		test_cmp expect_commit actual_commit &&
    -+		$cmd "text" t-commit >actual_t-commit &&
    -+		test_cmp expect_t-commit actual_t-commit
    ++		$cmd "text" tag-to-commit >actual_tag-to-commit &&
    ++		test_cmp expect_tag-to-commit actual_tag-to-commit
     +	'
     +done
     +
    ++test_expect_success 'grep --recurse-submodules --cached \w --no-restrict-to-sparse-paths' '
    ++	cat >expect <<-EOF &&
    ++	a:text
    ++	b:text
    ++	dir/c:text
    ++	sub/A/a:text
    ++	sub/B/b:text
    ++	sub2/a:text
    ++	EOF
    ++	git --no-restrict-to-sparse-paths grep --recurse-submodules --cached \
    ++		"text" >actual &&
    ++	test_cmp expect actual
    ++'
    ++
    ++test_expect_success 'grep --recurse-submodules <commit-ish> \w --no-restrict-to-sparse-paths' '
    ++	commit=$(git rev-parse HEAD) &&
    ++	cat >expect_commit <<-EOF &&
    ++	$commit:a:text
    ++	$commit:b:text
    ++	$commit:dir/c:text
    ++	$commit:sub/A/a:text
    ++	$commit:sub/B/b:text
    ++	$commit:sub2/a:text
    ++	EOF
    ++	cat >expect_tag-to-commit <<-EOF &&
    ++	tag-to-commit:a:text
    ++	tag-to-commit:b:text
    ++	tag-to-commit:dir/c:text
    ++	tag-to-commit:sub/A/a:text
    ++	tag-to-commit:sub/B/b:text
    ++	tag-to-commit:sub2/a:text
    ++	EOF
    ++	git --no-restrict-to-sparse-paths grep --recurse-submodules "text" \
    ++		$commit >actual_commit &&
    ++	test_cmp expect_commit actual_commit &&
    ++	git --no-restrict-to-sparse-paths grep --recurse-submodules "text" \
    ++		tag-to-commit >actual_tag-to-commit &&
    ++	test_cmp expect_tag-to-commit actual_tag-to-commit
    ++'
    ++
     +test_expect_success 'should respect the sparse.restrictCmds values from submodules' '
     +	cat >expect <<-EOF &&
     +	a:text
     +	sub/A/a:text
     +	sub/B/b:text
     +	EOF
    -+	git -C sub config sparse.restrictCmds false &&
    ++	test_config -C sub sparse.restrictCmds false &&
     +	git grep --cached --recurse-submodules "text" >actual &&
    -+	test_cmp expect actual &&
    -+	git -C sub config --unset sparse.restrictCmds
    ++	test_cmp expect actual
     +'
     +
     +test_expect_success 'should propagate --[no]-restrict-to-sparse-paths to submodules' '
    @@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'grep --recurse-submodules
     +	dir/c:text
     +	sub/A/a:text
     +	sub/B/b:text
    ++	sub2/a:text
     +	EOF
    -+	git -C sub config sparse.restrictCmds true &&
    ++	test_config -C sub sparse.restrictCmds true &&
     +	git --no-restrict-to-sparse-paths grep --cached --recurse-submodules "text" >actual &&
    -+	test_cmp expect actual &&
    -+	git -C sub config --unset sparse.restrictCmds
    ++	test_cmp expect actual
     +'
    ++
    ++for opt in '--untracked' '--no-index'
    ++do
    ++	test_expect_success "--[no]-restrict-to-sparse-paths and $opt are incompatible" "
    ++		test_must_fail git --restrict-to-sparse-paths grep $opt . 2>actual &&
    ++		test_i18ngrep 'restrict-to-sparse-paths is incompatible with' actual
    ++	"
    ++done
     +
      test_done
     
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 1/5] doc: grep: unify info on configuration variables
  2020-05-28  1:12   ` [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it Matheus Tavares
@ 2020-05-28  1:12     ` Matheus Tavares
  2020-05-28  1:13     ` [PATCH v3 2/5] t/helper/test-config: return exit codes consistently Matheus Tavares
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-05-28  1:12 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

Explanations about the configuration variables for git-grep are
duplicated in "Documentation/git-grep.txt" and
"Documentation/config/grep.txt", which can make maintenance difficult.
The first also contains a definition not present in the latter
(grep.fullName). To avoid problems like this, let's unify the
information in the second file and include it in the first.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 Documentation/config/grep.txt | 10 ++++++++--
 Documentation/git-grep.txt    | 36 ++++++-----------------------------
 2 files changed, 14 insertions(+), 32 deletions(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 44abe45a7c..dd51db38e1 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -16,8 +16,14 @@ grep.extendedRegexp::
 	other than 'default'.
 
 grep.threads::
-	Number of grep worker threads to use.
-	See `grep.threads` in linkgit:git-grep[1] for more information.
+	Number of grep worker threads to use. See `--threads`
+ifndef::git-grep[]
+	in linkgit:git-grep[1]
+endif::git-grep[]
+	for more information.
+
+grep.fullName::
+	If set to true, enable `--full-name` option by default.
 
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index a7f9bc99ea..9bdf807584 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,34 +41,8 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------
 
-grep.lineNumber::
-	If set to true, enable `-n` option by default.
-
-grep.column::
-	If set to true, enable the `--column` option by default.
-
-grep.patternType::
-	Set the default matching behavior. Using a value of 'basic', 'extended',
-	'fixed', or 'perl' will enable the `--basic-regexp`, `--extended-regexp`,
-	`--fixed-strings`, or `--perl-regexp` option accordingly, while the
-	value 'default' will return to the default matching behavior.
-
-grep.extendedRegexp::
-	If set to true, enable `--extended-regexp` option by default. This
-	option is ignored when the `grep.patternType` option is set to a value
-	other than 'default'.
-
-grep.threads::
-	Number of grep worker threads to use. If unset (or set to 0), Git will
-	use as many threads as the number of logical cores available.
-
-grep.fullName::
-	If set to true, enable `--full-name` option by default.
-
-grep.fallbackToNoIndex::
-	If set to true, fall back to git grep --no-index if git grep
-	is executed outside of a git repository.  Defaults to false.
-
+:git-grep: 1
+include::config/grep.txt[]
 
 OPTIONS
 -------
@@ -269,8 +243,10 @@ providing this option will cause it to die.
 	found.
 
 --threads <num>::
-	Number of grep worker threads to use.
-	See `grep.threads` in 'CONFIGURATION' for more information.
+	Number of grep worker threads to use. If not provided (or set to
+	0), Git will use as many worker threads as the number of logical
+	cores available. The default value can also be set with the
+	`grep.threads` configuration.
 
 -f <file>::
 	Read patterns from <file>, one per line.
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 2/5] t/helper/test-config: return exit codes consistently
  2020-05-28  1:12   ` [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-05-28  1:12     ` [PATCH v3 1/5] doc: grep: unify info on configuration variables Matheus Tavares
@ 2020-05-28  1:13     ` Matheus Tavares
  2020-05-30 14:29       ` Elijah Newren
  2020-05-28  1:13     ` [PATCH v3 3/5] config: correctly read worktree configs in submodules Matheus Tavares
                       ` (3 subsequent siblings)
  5 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-05-28  1:13 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

The test-config helper may exit with a variety of at least four
different codes, to reflect the status of the requested operations.
These codes are sometimes checked in the tests, but not all of the codes
are returned consistently by the helper: 1 will usually refer to a
"value not found", but usage errors can also return 1 or 128. The latter
is also expected on errors within the configset functions. These
inconsistent uses of the exit codes can lead to false positives in the
tests. Although all tests that currently check the helper's exit code,
on errors, do also check the output, it's still better to standardize
the exit codes and avoid future problems in new tests. While we are
here, let's also check that we have the expected argc for
configset_get_value and configset_get_value_multi, before trying to use
argv.

Note: this change is implemented with the unification of the exit
labels. This might seem unnecessary, for now, but it will benefit the
next patch, which will increase the cleanup section.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 t/helper/test-config.c | 76 ++++++++++++++++++++++--------------------
 1 file changed, 40 insertions(+), 36 deletions(-)

diff --git a/t/helper/test-config.c b/t/helper/test-config.c
index 234c722b48..1c8e965840 100644
--- a/t/helper/test-config.c
+++ b/t/helper/test-config.c
@@ -30,6 +30,14 @@
  * iterate -> iterate over all values using git_config(), and print some
  *            data for each
  *
+ * Exit codes:
+ *     0:   success
+ *     1:   value not found for the given config key
+ *     2:   config file path given as argument is inaccessible or doesn't exist
+ *     129: test-config usage error
+ *
+ * Note: tests may also expect 128 for die() calls in the config machinery.
+ *
  * Examples:
  *
  * To print the value with highest priority for key "foo.bAr Baz.rock":
@@ -64,35 +72,42 @@ static int early_config_cb(const char *var, const char *value, void *vdata)
 	return 0;
 }
 
+enum test_config_exit_code {
+	TC_SUCCESS = 0,
+	TC_VALUE_NOT_FOUND = 1,
+	TC_CONFIG_FILE_ERROR = 2,
+	TC_USAGE_ERROR = 129,
+};
+
 int cmd__config(int argc, const char **argv)
 {
 	int i, val;
 	const char *v;
 	const struct string_list *strptr;
 	struct config_set cs;
+	enum test_config_exit_code ret = TC_SUCCESS;
 
 	if (argc == 3 && !strcmp(argv[1], "read_early_config")) {
 		read_early_config(early_config_cb, (void *)argv[2]);
-		return 0;
+		return TC_SUCCESS;
 	}
 
 	setup_git_directory();
 
 	git_configset_init(&cs);
 
-	if (argc < 2) {
-		fprintf(stderr, "Please, provide a command name on the command-line\n");
-		goto exit1;
-	} else if (argc == 3 && !strcmp(argv[1], "get_value")) {
+	if (argc < 2)
+		goto print_usage_error;
+
+	if (argc == 3 && !strcmp(argv[1], "get_value")) {
 		if (!git_config_get_value(argv[2], &v)) {
 			if (!v)
 				printf("(NULL)\n");
 			else
 				printf("%s\n", v);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_value_multi")) {
 		strptr = git_config_get_value_multi(argv[2]);
@@ -104,41 +119,38 @@ int cmd__config(int argc, const char **argv)
 				else
 					printf("%s\n", v);
 			}
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_int")) {
 		if (!git_config_get_int(argv[2], &val)) {
 			printf("%d\n", val);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_bool")) {
 		if (!git_config_get_bool(argv[2], &val)) {
 			printf("%d\n", val);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_string")) {
 		if (!git_config_get_string_const(argv[2], &v)) {
 			printf("%s\n", v);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (!strcmp(argv[1], "configset_get_value")) {
+	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
 		for (i = 3; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
 				fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
-				goto exit2;
+				ret = TC_CONFIG_FILE_ERROR;
+				goto out;
 			}
 		}
 		if (!git_configset_get_value(&cs, argv[2], &v)) {
@@ -146,17 +158,17 @@ int cmd__config(int argc, const char **argv)
 				printf("(NULL)\n");
 			else
 				printf("%s\n", v);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (!strcmp(argv[1], "configset_get_value_multi")) {
+	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
 		for (i = 3; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
 				fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
-				goto exit2;
+				ret = TC_CONFIG_FILE_ERROR;
+				goto out;
 			}
 		}
 		strptr = git_configset_get_value_multi(&cs, argv[2]);
@@ -168,27 +180,19 @@ int cmd__config(int argc, const char **argv)
 				else
 					printf("%s\n", v);
 			}
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (!strcmp(argv[1], "iterate")) {
 		git_config(iterate_cb, NULL);
-		goto exit0;
+	} else {
+print_usage_error:
+		fprintf(stderr, "Invalid syntax. Usage: test-tool config <cmd> [args]\n");
+		ret = TC_USAGE_ERROR;
 	}
 
-	die("%s: Please check the syntax and the function name", argv[0]);
-
-exit0:
-	git_configset_clear(&cs);
-	return 0;
-
-exit1:
-	git_configset_clear(&cs);
-	return 1;
-
-exit2:
+out:
 	git_configset_clear(&cs);
-	return 2;
+	return ret;
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 3/5] config: correctly read worktree configs in submodules
  2020-05-28  1:12   ` [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-05-28  1:12     ` [PATCH v3 1/5] doc: grep: unify info on configuration variables Matheus Tavares
  2020-05-28  1:13     ` [PATCH v3 2/5] t/helper/test-config: return exit codes consistently Matheus Tavares
@ 2020-05-28  1:13     ` Matheus Tavares
  2020-05-30 14:49       ` Elijah Newren
  2020-05-28  1:13     ` [PATCH v3 4/5] grep: honor sparse checkout patterns Matheus Tavares
                       ` (2 subsequent siblings)
  5 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-05-28  1:13 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

One of the steps in do_git_config_sequence() is to load the
worktree-specific config file. Although the function receives a git_dir
string, it relies on git_pathdup(), which uses the_repository->git_dir,
to make the path to the file. Furthermore, it also checks that
extensions.worktreeConfig is set through the
repository_format_worktree_config variable, which refers to
the_repository only. Thus, when a submodule has worktree settings, a
command executed in the superproject that recurses into the submodule
won't find the said settings.

Such a scenario might not be needed now, but it will be in the following
patch. git-grep will learn to honor sparse checkouts and, when running
with --recurse-submodules, the submodule's sparse checkout settings must
be loaded. As these settings are stored in the config.worktree file,
they would be ignored without this patch. So let's fix this by reading
the right config.worktree file and extensions.worktreeConfig setting,
based on the git_dir and commondir paths given to
do_git_config_sequence(). Also add a test to avoid any regressions.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 config.c                   |  21 +++++--
 t/helper/test-config.c     | 119 +++++++++++++++++++++++++++----------
 t/t2404-worktree-config.sh |  16 +++++
 3 files changed, 118 insertions(+), 38 deletions(-)

diff --git a/config.c b/config.c
index 8db9c77098..c2d56309dc 100644
--- a/config.c
+++ b/config.c
@@ -1747,11 +1747,22 @@ static int do_git_config_sequence(const struct config_options *opts,
 		ret += git_config_from_file(fn, repo_config, data);
 
 	current_parsing_scope = CONFIG_SCOPE_WORKTREE;
-	if (!opts->ignore_worktree && repository_format_worktree_config) {
-		char *path = git_pathdup("config.worktree");
-		if (!access_or_die(path, R_OK, 0))
-			ret += git_config_from_file(fn, path, data);
-		free(path);
+	if (!opts->ignore_worktree && repo_config && opts->git_dir) {
+		struct repository_format repo_fmt = REPOSITORY_FORMAT_INIT;
+		struct strbuf buf = STRBUF_INIT;
+
+		read_repository_format(&repo_fmt, repo_config);
+
+		if (!verify_repository_format(&repo_fmt, &buf) &&
+		    repo_fmt.worktree_config) {
+			char *path = mkpathdup("%s/config.worktree", opts->git_dir);
+			if (!access_or_die(path, R_OK, 0))
+				ret += git_config_from_file(fn, path, data);
+			free(path);
+		}
+
+		strbuf_release(&buf);
+		clear_repository_format(&repo_fmt);
 	}
 
 	current_parsing_scope = CONFIG_SCOPE_COMMAND;
diff --git a/t/helper/test-config.c b/t/helper/test-config.c
index 1c8e965840..284f83a921 100644
--- a/t/helper/test-config.c
+++ b/t/helper/test-config.c
@@ -2,12 +2,19 @@
 #include "cache.h"
 #include "config.h"
 #include "string-list.h"
+#include "submodule-config.h"
 
 /*
  * This program exposes the C API of the configuration mechanism
  * as a set of simple commands in order to facilitate testing.
  *
- * Reads stdin and prints result of command to stdout:
+ * Usage: test-tool config [--submodule=<path>] <cmd> [<args>]
+ *
+ * If --submodule=<path> is given, <cmd> will operate on the submodule at the
+ * given <path>. This option is not valid for the commands: read_early_config,
+ * configset_get_value and configset_get_value_multi.
+ *
+ * Possible cmds are:
  *
  * get_value -> prints the value with highest priority for the entered key
  *
@@ -84,33 +91,63 @@ int cmd__config(int argc, const char **argv)
 	int i, val;
 	const char *v;
 	const struct string_list *strptr;
-	struct config_set cs;
+	struct config_set cs = { .hash_initialized = 0 };
 	enum test_config_exit_code ret = TC_SUCCESS;
+	struct repository *repo = the_repository;
+	const char *subrepo_path = NULL;
+
+	argc--; /* skip over "config" */
+	argv++;
+
+	if (argc == 0)
+		goto print_usage_error;
+
+	if (skip_prefix(*argv, "--submodule=", &subrepo_path)) {
+		argc--;
+		argv++;
+		if (argc == 0)
+			goto print_usage_error;
+	}
 
-	if (argc == 3 && !strcmp(argv[1], "read_early_config")) {
-		read_early_config(early_config_cb, (void *)argv[2]);
+	if (argc == 2 && !strcmp(argv[0], "read_early_config")) {
+		if (subrepo_path) {
+			fprintf(stderr, "Cannot use --submodule with read_early_config\n");
+			return TC_USAGE_ERROR;
+		}
+		read_early_config(early_config_cb, (void *)argv[1]);
 		return TC_SUCCESS;
 	}
 
 	setup_git_directory();
-
 	git_configset_init(&cs);
 
-	if (argc < 2)
-		goto print_usage_error;
+	if (subrepo_path) {
+		const struct submodule *sub;
+		struct repository *subrepo = xcalloc(1, sizeof(*repo));
+
+		sub = submodule_from_path(the_repository, &null_oid, subrepo_path);
+		if (!sub || repo_submodule_init(subrepo, the_repository, sub)) {
+			fprintf(stderr, "Invalid argument to --submodule: '%s'\n",
+				subrepo_path);
+			free(subrepo);
+			ret = TC_USAGE_ERROR;
+			goto out;
+		}
+		repo = subrepo;
+	}
 
-	if (argc == 3 && !strcmp(argv[1], "get_value")) {
-		if (!git_config_get_value(argv[2], &v)) {
+	if (argc == 2 && !strcmp(argv[0], "get_value")) {
+		if (!repo_config_get_value(repo, argv[1], &v)) {
 			if (!v)
 				printf("(NULL)\n");
 			else
 				printf("%s\n", v);
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (argc == 3 && !strcmp(argv[1], "get_value_multi")) {
-		strptr = git_config_get_value_multi(argv[2]);
+	} else if (argc == 2 && !strcmp(argv[0], "get_value_multi")) {
+		strptr = repo_config_get_value_multi(repo, argv[1]);
 		if (strptr) {
 			for (i = 0; i < strptr->nr; i++) {
 				v = strptr->items[i].string;
@@ -120,32 +157,38 @@ int cmd__config(int argc, const char **argv)
 					printf("%s\n", v);
 			}
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (argc == 3 && !strcmp(argv[1], "get_int")) {
-		if (!git_config_get_int(argv[2], &val)) {
+	} else if (argc == 2 && !strcmp(argv[0], "get_int")) {
+		if (!repo_config_get_int(repo, argv[1], &val)) {
 			printf("%d\n", val);
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (argc == 3 && !strcmp(argv[1], "get_bool")) {
-		if (!git_config_get_bool(argv[2], &val)) {
+	} else if (argc == 2 && !strcmp(argv[0], "get_bool")) {
+		if (!repo_config_get_bool(repo, argv[1], &val)) {
 			printf("%d\n", val);
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (argc == 3 && !strcmp(argv[1], "get_string")) {
-		if (!git_config_get_string_const(argv[2], &v)) {
+	} else if (argc == 2 && !strcmp(argv[0], "get_string")) {
+		if (!repo_config_get_string_const(repo, argv[1], &v)) {
 			printf("%s\n", v);
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
-		for (i = 3; i < argc; i++) {
+	} else if (argc >= 2 && !strcmp(argv[0], "configset_get_value")) {
+		if (subrepo_path) {
+			fprintf(stderr, "Cannot use --submodule with configset_get_value\n");
+			ret = TC_USAGE_ERROR;
+			goto out;
+		}
+		for (i = 2; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
 				fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
@@ -153,17 +196,22 @@ int cmd__config(int argc, const char **argv)
 				goto out;
 			}
 		}
-		if (!git_configset_get_value(&cs, argv[2], &v)) {
+		if (!git_configset_get_value(&cs, argv[1], &v)) {
 			if (!v)
 				printf("(NULL)\n");
 			else
 				printf("%s\n", v);
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
-		for (i = 3; i < argc; i++) {
+	} else if (argc >= 2 && !strcmp(argv[0], "configset_get_value_multi")) {
+		if (subrepo_path) {
+			fprintf(stderr, "Cannot use --submodule with configset_get_value_multi\n");
+			ret = TC_USAGE_ERROR;
+			goto out;
+		}
+		for (i = 2; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
 				fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
@@ -171,7 +219,7 @@ int cmd__config(int argc, const char **argv)
 				goto out;
 			}
 		}
-		strptr = git_configset_get_value_multi(&cs, argv[2]);
+		strptr = git_configset_get_value_multi(&cs, argv[1]);
 		if (strptr) {
 			for (i = 0; i < strptr->nr; i++) {
 				v = strptr->items[i].string;
@@ -181,18 +229,23 @@ int cmd__config(int argc, const char **argv)
 					printf("%s\n", v);
 			}
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (!strcmp(argv[1], "iterate")) {
-		git_config(iterate_cb, NULL);
+	} else if (!strcmp(argv[0], "iterate")) {
+		repo_config(repo, iterate_cb, NULL);
 	} else {
 print_usage_error:
-		fprintf(stderr, "Invalid syntax. Usage: test-tool config <cmd> [args]\n");
+		fprintf(stderr, "Invalid syntax. Usage: test-tool config"
+				" [--submodule=<path>] <cmd> [args]\n");
 		ret = TC_USAGE_ERROR;
 	}
 
 out:
 	git_configset_clear(&cs);
+	if (repo != the_repository) {
+		repo_clear(repo);
+		free(repo);
+	}
 	return ret;
 }
diff --git a/t/t2404-worktree-config.sh b/t/t2404-worktree-config.sh
index 286121d8de..b6ab793203 100755
--- a/t/t2404-worktree-config.sh
+++ b/t/t2404-worktree-config.sh
@@ -76,4 +76,20 @@ test_expect_success 'config.worktree no longer read without extension' '
 	test_cmp_config -C wt2 shared this.is
 '
 
+test_expect_success 'correctly read config.worktree from submodules' '
+	test_unconfig extensions.worktreeConfig &&
+	git init sub &&
+	(
+		cd sub &&
+		test_commit A &&
+		git config extensions.worktreeConfig true &&
+		git config --worktree wtconfig.sub test-value
+	) &&
+	git submodule add ./sub &&
+	git commit -m "add sub" &&
+	echo test-value >expect &&
+	test-tool config --submodule=sub get_value wtconfig.sub >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 4/5] grep: honor sparse checkout patterns
  2020-05-28  1:12   ` [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                       ` (2 preceding siblings ...)
  2020-05-28  1:13     ` [PATCH v3 3/5] config: correctly read worktree configs in submodules Matheus Tavares
@ 2020-05-28  1:13     ` Matheus Tavares
  2020-05-30 15:48       ` Elijah Newren
  2020-05-28  1:13     ` [PATCH v3 5/5] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
  2020-06-12 15:44     ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  5 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-05-28  1:13 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

One of the main uses for a sparse checkout is to allow users to focus on
the subset of files in a repository in which they are interested. But
git-grep currently ignores the sparsity patterns and report all matches
found outside this subset, which kind of goes in the opposite direction.
Let's fix that, making it honor the sparsity boundaries for every
grepping case where this is relevant:

- git grep in worktree
- git grep --cached
- git grep $REVISION

For the worktree case, we will not grep paths that have the
SKIP_WORKTREE bit set, even if they are present for some reason (e.g.
manually created after `git sparse-checkout init`). But the next patch
will add an option to do so. (See 'Note' below.)

For `git grep $REVISION`, we will choose to honor the sparsity patterns
only when $REVISION is a commit-ish object. The reason is that, for a
tree, we don't know whether it represents the root of a repository or a
subtree. So we wouldn't be able to correctly match it against the
sparsity patterns. E.g. suppose we have a repository with these two
sparsity rules: "/*" and "!/a"; and the following structure:

/
| - a (file)
| - d (dir)
    | - a (file)

If `git grep $REVISION` were to honor the sparsity patterns for every
object type, when grepping the /d tree, we would wrongly ignore the /d/a
file. This happens because we wouldn't know it resides in /d and
therefore it would wrongly match the pattern "!/a". Furthermore, for a
search in a blob object, we wouldn't even have a path to check the
patterns against. So, let's ignore the sparsity patterns when grepping
non-commit-ish objects.

Note: The behavior introduced in this patch is what some users have
reported[1] that they would like by default. But the old behavior is
still desirable for some use cases. Therefore, the next patch will add
an option to allow restoring it when needed.

[1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 builtin/grep.c                   | 125 ++++++++++++++++++++--
 t/t7011-skip-worktree-reading.sh |   9 --
 t/t7817-grep-sparse-checkout.sh  | 174 +++++++++++++++++++++++++++++++
 3 files changed, 291 insertions(+), 17 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

diff --git a/builtin/grep.c b/builtin/grep.c
index a5056f395a..11e33b8aee 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
 		      const struct pathspec *pathspec, int cached);
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr);
+		     int is_root_tree);
 
 static int grep_submodule(struct grep_opt *opt,
 			  const struct pathspec *pathspec,
@@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
 
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
+
+		if (ce_skip_worktree(ce))
+			continue;
+
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
 
@@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID) ||
-			    ce_skip_worktree(ce)) {
+			if (cached || (ce->ce_flags & CE_VALID)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -552,9 +555,76 @@ static int grep_cache(struct grep_opt *opt,
 	return hit;
 }
 
-static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
-		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr)
+static struct pattern_list *get_sparsity_patterns(struct repository *repo)
+{
+	struct pattern_list *patterns;
+	char *sparse_file;
+	int sparse_config, cone_config;
+
+	if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
+	    !sparse_config) {
+		return NULL;
+	}
+
+	sparse_file = repo_git_path(repo, "info/sparse-checkout");
+	patterns = xcalloc(1, sizeof(*patterns));
+
+	if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
+		cone_config = 0;
+	patterns->use_cone_patterns = cone_config;
+
+	if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
+		if (file_exists(sparse_file)) {
+			warning(_("failed to load sparse-checkout file: '%s'"),
+				sparse_file);
+		}
+		free(sparse_file);
+		free(patterns);
+		return NULL;
+	}
+
+	free(sparse_file);
+	return patterns;
+}
+
+static int in_sparse_checkout(struct strbuf *path, int prefix_len,
+			      unsigned int entry_mode,
+			      struct index_state *istate,
+			      struct pattern_list *sparsity,
+			      enum pattern_match_result parent_match,
+			      enum pattern_match_result *match)
+{
+	int dtype = DT_UNKNOWN;
+	int is_dir = S_ISDIR(entry_mode);
+
+	if (parent_match == MATCHED_RECURSIVE) {
+		*match = parent_match;
+		return 1;
+	}
+
+	if (is_dir && !is_dir_sep(path->buf[path->len - 1]))
+		strbuf_addch(path, '/');
+
+	*match = path_matches_pattern_list(path->buf, path->len,
+					   path->buf + prefix_len, &dtype,
+					   sparsity, istate);
+	if (*match == UNDECIDED)
+		*match = parent_match;
+
+	if (is_dir)
+		strbuf_trim_trailing_dir_sep(path);
+
+	if (*match == NOT_MATCHED &&
+		(!is_dir || (is_dir && sparsity->use_cone_patterns)))
+	     return 0;
+
+	return 1;
+}
+
+static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
+			struct tree_desc *tree, struct strbuf *base, int tn_len,
+			int check_attr, struct pattern_list *sparsity,
+			enum pattern_match_result default_sparsity_match)
 {
 	struct repository *repo = opt->repo;
 	int hit = 0;
@@ -570,6 +640,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 	while (tree_entry(tree, &entry)) {
 		int te_len = tree_entry_len(&entry);
+		enum pattern_match_result sparsity_match = 0;
 
 		if (match != all_entries_interesting) {
 			strbuf_addstr(&name, base->buf + tn_len);
@@ -586,6 +657,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
+		if (sparsity) {
+			struct strbuf path = STRBUF_INIT;
+			strbuf_addstr(&path, base->buf + tn_len);
+
+			if (!in_sparse_checkout(&path, old_baselen - tn_len,
+						entry.mode, repo->index,
+						sparsity, default_sparsity_match,
+						&sparsity_match)) {
+				strbuf_setlen(base, old_baselen);
+				continue;
+			}
+		}
+
 		if (S_ISREG(entry.mode)) {
 			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
 					 check_attr ? base->buf + tn_len : NULL);
@@ -602,8 +686,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 			strbuf_addch(base, '/');
 			init_tree_desc(&sub, data, size);
-			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
-					 check_attr);
+			hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
+					    check_attr, sparsity, sparsity_match);
 			free(data);
 		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
 			hit |= grep_submodule(opt, pathspec, &entry.oid,
@@ -621,6 +705,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 	return hit;
 }
 
+/*
+ * Note: sparsity patterns and paths' attributes will only be considered if
+ * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
+ * matching on paths.)
+ */
+static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
+		     struct tree_desc *tree, struct strbuf *base, int tn_len,
+		     int is_root_tree)
+{
+	struct pattern_list *patterns = NULL;
+	int ret;
+
+	if (is_root_tree)
+		patterns = get_sparsity_patterns(opt->repo);
+
+	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
+			   patterns, 0);
+
+	if (patterns) {
+		clear_pattern_list(patterns);
+		free(patterns);
+	}
+	return ret;
+}
+
 static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
 		       struct object *obj, const char *name, const char *path)
 {
diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
index 37525cae3a..26852586ac 100755
--- a/t/t7011-skip-worktree-reading.sh
+++ b/t/t7011-skip-worktree-reading.sh
@@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
 	test -z "$(git ls-files -m)"
 '
 
-test_expect_success 'grep with skip-worktree file' '
-	git update-index --no-skip-worktree 1 &&
-	echo test > 1 &&
-	git update-index 1 &&
-	git update-index --skip-worktree 1 &&
-	rm 1 &&
-	test "$(git grep --no-ext-grep test)" = "1:test"
-'
-
 echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
 test_expect_success 'diff-index does not examine skip-worktree absent entries' '
 	setup_absent &&
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
new file mode 100755
index 0000000000..ce080cf572
--- /dev/null
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -0,0 +1,174 @@
+#!/bin/sh
+
+test_description='grep in sparse checkout
+
+This test creates a repo with the following structure:
+
+.
+|-- a
+|-- b
+|-- dir
+|   `-- c
+|-- sub
+|   |-- A
+|   |   `-- a
+|   `-- B
+|       `-- b
+`-- sub2
+    `-- a
+
+Where . has non-cone mode sparsity patterns, sub is a submodule with cone mode
+sparsity patterns and sub2 is a submodule that is excluded by the superproject
+sparsity patterns. The resulting sparse checkout should leave the following
+structure on the working tree:
+
+.
+|-- a
+|-- sub
+|   `-- B
+|       `-- b
+`-- sub2
+    `-- a
+
+But note that sub2 should have the SKIP_WORKTREE bit set.
+'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	echo "text" >a &&
+	echo "text" >b &&
+	mkdir dir &&
+	echo "text" >dir/c &&
+
+	git init sub &&
+	(
+		cd sub &&
+		mkdir A B &&
+		echo "text" >A/a &&
+		echo "text" >B/b &&
+		git add A B &&
+		git commit -m sub &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set B
+	) &&
+
+	git init sub2 &&
+	(
+		cd sub2 &&
+		echo "text" >a &&
+		git add a &&
+		git commit -m sub2
+	) &&
+
+	git submodule add ./sub &&
+	git submodule add ./sub2 &&
+	git add a b dir &&
+	git commit -m super &&
+	git sparse-checkout init --no-cone &&
+	git sparse-checkout set "/*" "!b" "!/*/" "sub" &&
+
+	git tag -am tag-to-commit tag-to-commit HEAD &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	git tag -am tag-to-tree tag-to-tree $tree &&
+
+	test_path_is_missing b &&
+	test_path_is_missing dir &&
+	test_path_is_missing sub/A &&
+	test_path_is_file a &&
+	test_path_is_file sub/B/b &&
+	test_path_is_file sub2/a
+'
+
+# The test bellow checks a special case: the sparsity patterns exclude '/b'
+# and sparse checkout is enable, but the path exists on the working tree (e.g.
+# manually created after `git sparse-checkout init`). In this case, grep should
+# skip it.
+test_expect_success 'grep in working tree should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	echo "new-text" >b &&
+	test_when_finished "rm b" &&
+	git grep "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --cached should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep <commit-ish> should honor sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	EOF
+	cat >expect_tag-to-commit <<-EOF &&
+	tag-to-commit:a:text
+	EOF
+	git grep "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep "text" tag-to-commit >actual_tag-to-commit &&
+	test_cmp expect_tag-to-commit actual_tag-to-commit
+'
+
+test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
+	commit=$(git rev-parse HEAD) &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	cat >expect_tree <<-EOF &&
+	$tree:a:text
+	$tree:b:text
+	$tree:dir/c:text
+	EOF
+	cat >expect_tag-to-tree <<-EOF &&
+	tag-to-tree:a:text
+	tag-to-tree:b:text
+	tag-to-tree:dir/c:text
+	EOF
+	git grep "text" $tree >actual_tree &&
+	test_cmp expect_tree actual_tree &&
+	git grep "text" tag-to-tree >actual_tag-to-tree &&
+	test_cmp expect_tag-to-tree actual_tag-to-tree
+'
+
+# Note that sub2/ is present in the worktree but it is excluded by the sparsity
+# patterns, so grep should not recurse into it.
+test_expect_success 'grep --recurse-submodules should honor sparse checkout in submodule' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	EOF
+	git grep --recurse-submodules "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	EOF
+	git grep --recurse-submodules --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	$commit:sub/B/b:text
+	EOF
+	cat >expect_tag-to-commit <<-EOF &&
+	tag-to-commit:a:text
+	tag-to-commit:sub/B/b:text
+	EOF
+	git grep --recurse-submodules "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep --recurse-submodules "text" tag-to-commit >actual_tag-to-commit &&
+	test_cmp expect_tag-to-commit actual_tag-to-commit
+'
+
+test_done
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v3 5/5] config: add setting to ignore sparsity patterns in some cmds
  2020-05-28  1:12   ` [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                       ` (3 preceding siblings ...)
  2020-05-28  1:13     ` [PATCH v3 4/5] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-05-28  1:13     ` Matheus Tavares
  2020-05-30 16:18       ` Elijah Newren
  2020-06-12 15:44     ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  5 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-05-28  1:13 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

When sparse checkout is enabled, some users expect the output of certain
commands (such as grep, diff, and log) to be also restricted within the
sparsity patterns. This would allow them to effectively work only on the
subset of files in which they are interested; and allow some commands to
possibly perform better, by not considering uninteresting paths. For
this reason, we taught grep to honor the sparsity patterns, in the
previous patch. But, on the other hand, allowing grep and the other
commands mentioned to optionally ignore the patterns also make for some
interesting use cases. E.g. using grep to search for a function
documentation that resides outside the sparse checkout.

In any case, there is no current way for users to configure the behavior
they want for these commands. Aiming to provide this flexibility, let's
introduce the sparse.restrictCmds setting (and the analogous
--[no]-restrict-to-sparse-paths global option). The default value is
true. For now, grep is the only one affected by this setting, but the
goal is to have support for more commands, in the future.

Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 Documentation/config.txt               |   2 +
 Documentation/config/sparse.txt        |  24 +++++
 Documentation/git-grep.txt             |   3 +
 Documentation/git.txt                  |   4 +
 Makefile                               |   1 +
 builtin/grep.c                         |  13 ++-
 contrib/completion/git-completion.bash |   2 +
 git.c                                  |   6 ++
 sparse-checkout.c                      |  16 +++
 sparse-checkout.h                      |  11 +++
 t/t7817-grep-sparse-checkout.sh        | 132 ++++++++++++++++++++++++-
 t/t9902-completion.sh                  |   4 +-
 12 files changed, 212 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/config/sparse.txt
 create mode 100644 sparse-checkout.c
 create mode 100644 sparse-checkout.h

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ef0768b91a..fd74b80302 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -436,6 +436,8 @@ include::config/sequencer.txt[]
 
 include::config/showbranch.txt[]
 
+include::config/sparse.txt[]
+
 include::config/splitindex.txt[]
 
 include::config/ssh.txt[]
diff --git a/Documentation/config/sparse.txt b/Documentation/config/sparse.txt
new file mode 100644
index 0000000000..2a25b4b8ef
--- /dev/null
+++ b/Documentation/config/sparse.txt
@@ -0,0 +1,24 @@
+sparse.restrictCmds::
+	Only meaningful in conjunction with core.sparseCheckout. This option
+	extends sparse checkouts (which limit which paths are written to the
+	working tree), so that output and operations are also limited to the
+	sparsity paths where possible and implemented. The purpose of this
+	option is to (1) focus output for the user on the portion of the
+	repository that is of interest to them, and (2) enable potentially
+	dramatic performance improvements, especially in conjunction with
+	partial clones.
++
+When this option is true (default), some git commands may limit their behavior
+to the paths specified by the sparsity patterns, or to the intersection of
+those paths and any (like `*.c`) that the user might also specify on the
+command line. When false, the affected commands will work on full trees,
+ignoring the sparsity patterns. For now, only git-grep honors this setting. In
+this command, the restriction takes effect in three cases: with --cached; when
+a commit-ish is given; when searching a working tree where some paths excluded
+by the sparsity patterns are present (e.g. manually created paths or not
+removed submodules).
++
+Note: commands which export, integrity check, or create history will always
+operate on full trees (e.g. fast-export, format-patch, fsck, commit, etc.),
+unaffected by any sparsity patterns. Also, writting commands such as
+sparse-checkout and read-tree will not be affected by this configuration.
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 9bdf807584..abbf100109 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,6 +41,9 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------
 
+git-grep honors the sparse.restrictCmds setting. See its definition in
+linkgit:git-config[1].
+
 :git-grep: 1
 include::config/grep.txt[]
 
diff --git a/Documentation/git.txt b/Documentation/git.txt
index 9d6769e95a..5e107c6246 100644
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -180,6 +180,10 @@ If you just want to run git as if it was started in `<path>` then use
 	Do not perform optional operations that require locks. This is
 	equivalent to setting the `GIT_OPTIONAL_LOCKS` to `0`.
 
+--[no-]restrict-to-sparse-paths::
+	Overrides the sparse.restrictCmds configuration (see
+	linkgit:git-config[1]) for this execution.
+
 --list-cmds=group[,group...]::
 	List commands by group. This is an internal/experimental
 	option and may change or be removed in the future. Supported
diff --git a/Makefile b/Makefile
index 90aa329eb7..0c0013b32c 100644
--- a/Makefile
+++ b/Makefile
@@ -983,6 +983,7 @@ LIB_OBJS += sha1-name.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-checkout.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/builtin/grep.c b/builtin/grep.c
index 11e33b8aee..cc696dab4a 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -25,6 +25,7 @@
 #include "submodule-config.h"
 #include "object-store.h"
 #include "packfile.h"
+#include "sparse-checkout.h"
 
 static char const * const grep_usage[] = {
 	N_("git grep [<options>] [-e] <pattern> [<rev>...] [[--] <path>...]"),
@@ -498,6 +499,7 @@ static int grep_cache(struct grep_opt *opt,
 	int nr;
 	struct strbuf name = STRBUF_INIT;
 	int name_base_len = 0;
+	int sparse_paths_only =	restrict_to_sparse_paths(repo);
 	if (repo->submodule_prefix) {
 		name_base_len = strlen(repo->submodule_prefix);
 		strbuf_addstr(&name, repo->submodule_prefix);
@@ -509,7 +511,7 @@ static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (ce_skip_worktree(ce))
+		if (sparse_paths_only && ce_skip_worktree(ce))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -715,9 +717,10 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     int is_root_tree)
 {
 	struct pattern_list *patterns = NULL;
+	int sparse_paths_only = restrict_to_sparse_paths(opt->repo);
 	int ret;
 
-	if (is_root_tree)
+	if (is_root_tree && sparse_paths_only)
 		patterns = get_sparsity_patterns(opt->repo);
 
 	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
@@ -1257,6 +1260,12 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 
 	if (!use_index || untracked) {
 		int use_exclude = (opt_exclude < 0) ? use_index : !!opt_exclude;
+
+		if (opt_restrict_to_sparse_paths >= 0) {
+			die(_("--[no-]restrict-to-sparse-paths is incompatible"
+				  " with --no-index and --untracked"));
+		}
+
 		hit = grep_directory(&opt, &pathspec, use_exclude, use_index);
 	} else if (0 <= opt_exclude) {
 		die(_("--[no-]exclude-standard cannot be used for tracked contents"));
diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
index 70ad04e1b2..71956f7313 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -3208,6 +3208,8 @@ __git_main ()
 			--namespace=
 			--no-replace-objects
 			--help
+			--restrict-to-sparse-paths
+			--no-restrict-to-sparse-paths
 			"
 			;;
 		*)
diff --git a/git.c b/git.c
index a2d337eed7..6db1382ae4 100644
--- a/git.c
+++ b/git.c
@@ -38,6 +38,7 @@ const char git_more_info_string[] =
 	   "See 'git help git' for an overview of the system.");
 
 static int use_pager = -1;
+int opt_restrict_to_sparse_paths = -1;
 
 static void list_builtins(struct string_list *list, unsigned int exclude_option);
 
@@ -311,6 +312,10 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
 			} else {
 				exit(list_cmds(cmd));
 			}
+		} else if (!strcmp(cmd, "--restrict-to-sparse-paths")) {
+			opt_restrict_to_sparse_paths = 1;
+		} else if (!strcmp(cmd, "--no-restrict-to-sparse-paths")) {
+			opt_restrict_to_sparse_paths = 0;
 		} else {
 			fprintf(stderr, _("unknown option: %s\n"), cmd);
 			usage(git_usage_string);
@@ -319,6 +324,7 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
 		(*argv)++;
 		(*argc)--;
 	}
+
 	return (*argv) - orig_argv;
 }
 
diff --git a/sparse-checkout.c b/sparse-checkout.c
new file mode 100644
index 0000000000..9a9e50fd29
--- /dev/null
+++ b/sparse-checkout.c
@@ -0,0 +1,16 @@
+#include "cache.h"
+#include "config.h"
+#include "sparse-checkout.h"
+
+int restrict_to_sparse_paths(struct repository *repo)
+{
+	int ret;
+
+	if (opt_restrict_to_sparse_paths >= 0)
+		return opt_restrict_to_sparse_paths;
+
+	if (repo_config_get_bool(repo, "sparse.restrictcmds", &ret))
+		ret = 1;
+
+	return ret;
+}
diff --git a/sparse-checkout.h b/sparse-checkout.h
new file mode 100644
index 0000000000..1de3b588d8
--- /dev/null
+++ b/sparse-checkout.h
@@ -0,0 +1,11 @@
+#ifndef SPARSE_CHECKOUT_H
+#define SPARSE_CHECKOUT_H
+
+struct repository;
+
+extern int opt_restrict_to_sparse_paths; /* from git.c */
+
+/* Whether or not cmds should restrict behavior on sparse paths, in this repo */
+int restrict_to_sparse_paths(struct repository *repo);
+
+#endif /* SPARSE_CHECKOUT_H */
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index ce080cf572..1aef084186 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -80,10 +80,10 @@ test_expect_success 'setup' '
 	test_path_is_file sub2/a
 '
 
-# The test bellow checks a special case: the sparsity patterns exclude '/b'
+# The two tests bellow check a special case: the sparsity patterns exclude '/b'
 # and sparse checkout is enable, but the path exists on the working tree (e.g.
 # manually created after `git sparse-checkout init`). In this case, grep should
-# skip it.
+# skip the file by default, but not with --no-restrict-to-sparse-paths.
 test_expect_success 'grep in working tree should honor sparse checkout' '
 	cat >expect <<-EOF &&
 	a:text
@@ -93,6 +93,16 @@ test_expect_success 'grep in working tree should honor sparse checkout' '
 	git grep "text" >actual &&
 	test_cmp expect actual
 '
+test_expect_success 'grep w/ --no-restrict-to-sparse-paths for sparsely excluded but present paths' '
+	cat >expect <<-EOF &&
+	a:text
+	b:new-text
+	EOF
+	echo "new-text" >b &&
+	test_when_finished "rm b" &&
+	git --no-restrict-to-sparse-paths grep "text" >actual &&
+	test_cmp expect actual
+'
 
 test_expect_success 'grep --cached should honor sparse checkout' '
 	cat >expect <<-EOF &&
@@ -136,7 +146,7 @@ test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
 '
 
 # Note that sub2/ is present in the worktree but it is excluded by the sparsity
-# patterns, so grep should not recurse into it.
+# patterns, so grep should only recurse into it with --no-restrict-to-sparse-paths.
 test_expect_success 'grep --recurse-submodules should honor sparse checkout in submodule' '
 	cat >expect <<-EOF &&
 	a:text
@@ -145,6 +155,15 @@ test_expect_success 'grep --recurse-submodules should honor sparse checkout in s
 	git grep --recurse-submodules "text" >actual &&
 	test_cmp expect actual
 '
+test_expect_success 'grep --recurse-submodules should search in excluded submodules w/ --no-restrict-to-sparse-paths' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	sub2/a:text
+	EOF
+	git --no-restrict-to-sparse-paths grep --recurse-submodules "text" >actual &&
+	test_cmp expect actual
+'
 
 test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
 	cat >expect <<-EOF &&
@@ -171,4 +190,111 @@ test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse
 	test_cmp expect_tag-to-commit actual_tag-to-commit
 '
 
+for cmd in 'git --no-restrict-to-sparse-paths grep' \
+	   'git -c sparse.restrictCmds=false grep' \
+	   'git -c sparse.restrictCmds=true --no-restrict-to-sparse-paths grep'
+do
+
+	test_expect_success "$cmd --cached should ignore sparsity patterns" '
+		cat >expect <<-EOF &&
+		a:text
+		b:text
+		dir/c:text
+		EOF
+		$cmd --cached "text" >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "$cmd <commit-ish> should ignore sparsity patterns" '
+		commit=$(git rev-parse HEAD) &&
+		cat >expect_commit <<-EOF &&
+		$commit:a:text
+		$commit:b:text
+		$commit:dir/c:text
+		EOF
+		cat >expect_tag-to-commit <<-EOF &&
+		tag-to-commit:a:text
+		tag-to-commit:b:text
+		tag-to-commit:dir/c:text
+		EOF
+		$cmd "text" $commit >actual_commit &&
+		test_cmp expect_commit actual_commit &&
+		$cmd "text" tag-to-commit >actual_tag-to-commit &&
+		test_cmp expect_tag-to-commit actual_tag-to-commit
+	'
+done
+
+test_expect_success 'grep --recurse-submodules --cached \w --no-restrict-to-sparse-paths' '
+	cat >expect <<-EOF &&
+	a:text
+	b:text
+	dir/c:text
+	sub/A/a:text
+	sub/B/b:text
+	sub2/a:text
+	EOF
+	git --no-restrict-to-sparse-paths grep --recurse-submodules --cached \
+		"text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --recurse-submodules <commit-ish> \w --no-restrict-to-sparse-paths' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	$commit:b:text
+	$commit:dir/c:text
+	$commit:sub/A/a:text
+	$commit:sub/B/b:text
+	$commit:sub2/a:text
+	EOF
+	cat >expect_tag-to-commit <<-EOF &&
+	tag-to-commit:a:text
+	tag-to-commit:b:text
+	tag-to-commit:dir/c:text
+	tag-to-commit:sub/A/a:text
+	tag-to-commit:sub/B/b:text
+	tag-to-commit:sub2/a:text
+	EOF
+	git --no-restrict-to-sparse-paths grep --recurse-submodules "text" \
+		$commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git --no-restrict-to-sparse-paths grep --recurse-submodules "text" \
+		tag-to-commit >actual_tag-to-commit &&
+	test_cmp expect_tag-to-commit actual_tag-to-commit
+'
+
+test_expect_success 'should respect the sparse.restrictCmds values from submodules' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/A/a:text
+	sub/B/b:text
+	EOF
+	test_config -C sub sparse.restrictCmds false &&
+	git grep --cached --recurse-submodules "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'should propagate --[no]-restrict-to-sparse-paths to submodules' '
+	cat >expect <<-EOF &&
+	a:text
+	b:text
+	dir/c:text
+	sub/A/a:text
+	sub/B/b:text
+	sub2/a:text
+	EOF
+	test_config -C sub sparse.restrictCmds true &&
+	git --no-restrict-to-sparse-paths grep --cached --recurse-submodules "text" >actual &&
+	test_cmp expect actual
+'
+
+for opt in '--untracked' '--no-index'
+do
+	test_expect_success "--[no]-restrict-to-sparse-paths and $opt are incompatible" "
+		test_must_fail git --restrict-to-sparse-paths grep $opt . 2>actual &&
+		test_i18ngrep 'restrict-to-sparse-paths is incompatible with' actual
+	"
+done
+
 test_done
diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
index 3c44af6940..a4a7767e06 100755
--- a/t/t9902-completion.sh
+++ b/t/t9902-completion.sh
@@ -1473,6 +1473,8 @@ test_expect_success 'double dash "git" itself' '
 	--namespace=
 	--no-replace-objects Z
 	--help Z
+	--restrict-to-sparse-paths Z
+	--no-restrict-to-sparse-paths Z
 	EOF
 '
 
@@ -1515,7 +1517,7 @@ test_expect_success 'general options' '
 	test_completion "git --nam" "--namespace=" &&
 	test_completion "git --bar" "--bare " &&
 	test_completion "git --inf" "--info-path " &&
-	test_completion "git --no-r" "--no-replace-objects "
+	test_completion "git --no-rep" "--no-replace-objects "
 '
 
 test_expect_success 'general options plus command' '
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 2/5] t/helper/test-config: return exit codes consistently
  2020-05-28  1:13     ` [PATCH v3 2/5] t/helper/test-config: return exit codes consistently Matheus Tavares
@ 2020-05-30 14:29       ` Elijah Newren
  2020-06-01  4:36         ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-05-30 14:29 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Wed, May 27, 2020 at 6:13 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> The test-config helper may exit with a variety of at least four
> different codes, to reflect the status of the requested operations.
> These codes are sometimes checked in the tests, but not all of the codes
> are returned consistently by the helper: 1 will usually refer to a
> "value not found", but usage errors can also return 1 or 128. The latter

I'm not sure what "The latter" refers to here.

> is also expected on errors within the configset functions. These
> inconsistent uses of the exit codes can lead to false positives in the
> tests. Although all tests that currently check the helper's exit code,
> on errors, do also check the output, it's still better to standardize
> the exit codes and avoid future problems in new tests. While we are

That last sentence was slightly hard for me to parse.  Maybe something like:

...Although all tests which expect errors and check the helper's exit
code currently also check the output, it's still better...


> here, let's also check that we have the expected argc for
> configset_get_value and configset_get_value_multi, before trying to use
> argv.
>
> Note: this change is implemented with the unification of the exit
> labels. This might seem unnecessary, for now, but it will benefit the
> next patch, which will increase the cleanup section.
>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>  t/helper/test-config.c | 76 ++++++++++++++++++++++--------------------
>  1 file changed, 40 insertions(+), 36 deletions(-)
>
> diff --git a/t/helper/test-config.c b/t/helper/test-config.c
> index 234c722b48..1c8e965840 100644
> --- a/t/helper/test-config.c
> +++ b/t/helper/test-config.c
> @@ -30,6 +30,14 @@
>   * iterate -> iterate over all values using git_config(), and print some
>   *            data for each
>   *
> + * Exit codes:
> + *     0:   success
> + *     1:   value not found for the given config key
> + *     2:   config file path given as argument is inaccessible or doesn't exist
> + *     129: test-config usage error
> + *
> + * Note: tests may also expect 128 for die() calls in the config machinery.
> + *
>   * Examples:
>   *
>   * To print the value with highest priority for key "foo.bAr Baz.rock":
> @@ -64,35 +72,42 @@ static int early_config_cb(const char *var, const char *value, void *vdata)
>         return 0;
>  }
>
> +enum test_config_exit_code {
> +       TC_SUCCESS = 0,
> +       TC_VALUE_NOT_FOUND = 1,
> +       TC_CONFIG_FILE_ERROR = 2,
> +       TC_USAGE_ERROR = 129,
> +};
> +
>  int cmd__config(int argc, const char **argv)
>  {
>         int i, val;
>         const char *v;
>         const struct string_list *strptr;
>         struct config_set cs;
> +       enum test_config_exit_code ret = TC_SUCCESS;
>
>         if (argc == 3 && !strcmp(argv[1], "read_early_config")) {
>                 read_early_config(early_config_cb, (void *)argv[2]);
> -               return 0;
> +               return TC_SUCCESS;
>         }
>
>         setup_git_directory();
>
>         git_configset_init(&cs);
>
> -       if (argc < 2) {
> -               fprintf(stderr, "Please, provide a command name on the command-line\n");
> -               goto exit1;
> -       } else if (argc == 3 && !strcmp(argv[1], "get_value")) {
> +       if (argc < 2)
> +               goto print_usage_error;
> +
> +       if (argc == 3 && !strcmp(argv[1], "get_value")) {
>                 if (!git_config_get_value(argv[2], &v)) {
>                         if (!v)
>                                 printf("(NULL)\n");
>                         else
>                                 printf("%s\n", v);
> -                       goto exit0;
>                 } else {
>                         printf("Value not found for \"%s\"\n", argv[2]);
> -                       goto exit1;
> +                       ret = TC_VALUE_NOT_FOUND;
>                 }
>         } else if (argc == 3 && !strcmp(argv[1], "get_value_multi")) {
>                 strptr = git_config_get_value_multi(argv[2]);
> @@ -104,41 +119,38 @@ int cmd__config(int argc, const char **argv)
>                                 else
>                                         printf("%s\n", v);
>                         }
> -                       goto exit0;
>                 } else {
>                         printf("Value not found for \"%s\"\n", argv[2]);
> -                       goto exit1;
> +                       ret = TC_VALUE_NOT_FOUND;
>                 }
>         } else if (argc == 3 && !strcmp(argv[1], "get_int")) {
>                 if (!git_config_get_int(argv[2], &val)) {
>                         printf("%d\n", val);
> -                       goto exit0;
>                 } else {
>                         printf("Value not found for \"%s\"\n", argv[2]);
> -                       goto exit1;
> +                       ret = TC_VALUE_NOT_FOUND;
>                 }
>         } else if (argc == 3 && !strcmp(argv[1], "get_bool")) {
>                 if (!git_config_get_bool(argv[2], &val)) {
>                         printf("%d\n", val);
> -                       goto exit0;
>                 } else {
>                         printf("Value not found for \"%s\"\n", argv[2]);
> -                       goto exit1;
> +                       ret = TC_VALUE_NOT_FOUND;
>                 }
>         } else if (argc == 3 && !strcmp(argv[1], "get_string")) {
>                 if (!git_config_get_string_const(argv[2], &v)) {
>                         printf("%s\n", v);
> -                       goto exit0;
>                 } else {
>                         printf("Value not found for \"%s\"\n", argv[2]);
> -                       goto exit1;
> +                       ret = TC_VALUE_NOT_FOUND;
>                 }
> -       } else if (!strcmp(argv[1], "configset_get_value")) {
> +       } else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
>                 for (i = 3; i < argc; i++) {
>                         int err;
>                         if ((err = git_configset_add_file(&cs, argv[i]))) {
>                                 fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
> -                               goto exit2;
> +                               ret = TC_CONFIG_FILE_ERROR;
> +                               goto out;
>                         }
>                 }
>                 if (!git_configset_get_value(&cs, argv[2], &v)) {
> @@ -146,17 +158,17 @@ int cmd__config(int argc, const char **argv)
>                                 printf("(NULL)\n");
>                         else
>                                 printf("%s\n", v);
> -                       goto exit0;
>                 } else {
>                         printf("Value not found for \"%s\"\n", argv[2]);
> -                       goto exit1;
> +                       ret = TC_VALUE_NOT_FOUND;
>                 }
> -       } else if (!strcmp(argv[1], "configset_get_value_multi")) {
> +       } else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
>                 for (i = 3; i < argc; i++) {
>                         int err;
>                         if ((err = git_configset_add_file(&cs, argv[i]))) {
>                                 fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
> -                               goto exit2;
> +                               ret = TC_CONFIG_FILE_ERROR;
> +                               goto out;
>                         }
>                 }
>                 strptr = git_configset_get_value_multi(&cs, argv[2]);
> @@ -168,27 +180,19 @@ int cmd__config(int argc, const char **argv)
>                                 else
>                                         printf("%s\n", v);
>                         }
> -                       goto exit0;
>                 } else {
>                         printf("Value not found for \"%s\"\n", argv[2]);
> -                       goto exit1;
> +                       ret = TC_VALUE_NOT_FOUND;
>                 }
>         } else if (!strcmp(argv[1], "iterate")) {
>                 git_config(iterate_cb, NULL);
> -               goto exit0;
> +       } else {
> +print_usage_error:
> +               fprintf(stderr, "Invalid syntax. Usage: test-tool config <cmd> [args]\n");
> +               ret = TC_USAGE_ERROR;
>         }
>
> -       die("%s: Please check the syntax and the function name", argv[0]);
> -
> -exit0:
> -       git_configset_clear(&cs);
> -       return 0;
> -
> -exit1:
> -       git_configset_clear(&cs);
> -       return 1;
> -
> -exit2:
> +out:
>         git_configset_clear(&cs);
> -       return 2;
> +       return ret;
>  }
> --
> 2.26.2

So, the primary purpose of the commit is getting making the return
status clearer, but most the code changes actually center around
reducing the gotos and unification of the exit labels.  Might have
been slightly easier to read if those two issues had been split, but
the patch is small enough that it's not a big deal.  Makes sense.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 3/5] config: correctly read worktree configs in submodules
  2020-05-28  1:13     ` [PATCH v3 3/5] config: correctly read worktree configs in submodules Matheus Tavares
@ 2020-05-30 14:49       ` Elijah Newren
  2020-06-01  4:38         ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-05-30 14:49 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Wed, May 27, 2020 at 6:13 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> One of the steps in do_git_config_sequence() is to load the
> worktree-specific config file. Although the function receives a git_dir
> string, it relies on git_pathdup(), which uses the_repository->git_dir,
> to make the path to the file. Furthermore, it also checks that
> extensions.worktreeConfig is set through the
> repository_format_worktree_config variable, which refers to
> the_repository only. Thus, when a submodule has worktree settings, a
> command executed in the superproject that recurses into the submodule
> won't find the said settings.
>
> Such a scenario might not be needed now, but it will be in the following

It's not needed?  Are there not other config values that affect grep's
behavior, such as smudge filters of the submodule that might be
important if doing a 'git grep --recurse-submodules $REVISION'?

Also, is there a similar issue here for .gitattributes?  (e.g. if the
submodule declares certain files to be binary?)

(I don't actually know if these are issues but I'm just surprised to
hear that this would be the first case that would need to look at
submodule-specific configuration.  If the current code handles these
other scenarios I bring up, then you just need to correct your commit
message.  If these aren't issues, then I'd appreciate a quick
explanation of why I'm off base.  If these are current issues and the
current code isn't handling them, I'm not saying you need to address
them in this patch series, but you might need to reword the commit
message to mention that was already an issue that has previously been
overlooked and we're starting by fixing one case.)

> patch. git-grep will learn to honor sparse checkouts and, when running
> with --recurse-submodules, the submodule's sparse checkout settings must
> be loaded. As these settings are stored in the config.worktree file,
> they would be ignored without this patch. So let's fix this by reading
> the right config.worktree file and extensions.worktreeConfig setting,
> based on the git_dir and commondir paths given to
> do_git_config_sequence(). Also add a test to avoid any regressions.
>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>  config.c                   |  21 +++++--
>  t/helper/test-config.c     | 119 +++++++++++++++++++++++++++----------
>  t/t2404-worktree-config.sh |  16 +++++
>  3 files changed, 118 insertions(+), 38 deletions(-)
>
> diff --git a/config.c b/config.c
> index 8db9c77098..c2d56309dc 100644
> --- a/config.c
> +++ b/config.c
> @@ -1747,11 +1747,22 @@ static int do_git_config_sequence(const struct config_options *opts,
>                 ret += git_config_from_file(fn, repo_config, data);
>
>         current_parsing_scope = CONFIG_SCOPE_WORKTREE;
> -       if (!opts->ignore_worktree && repository_format_worktree_config) {
> -               char *path = git_pathdup("config.worktree");
> -               if (!access_or_die(path, R_OK, 0))
> -                       ret += git_config_from_file(fn, path, data);
> -               free(path);
> +       if (!opts->ignore_worktree && repo_config && opts->git_dir) {
> +               struct repository_format repo_fmt = REPOSITORY_FORMAT_INIT;
> +               struct strbuf buf = STRBUF_INIT;
> +
> +               read_repository_format(&repo_fmt, repo_config);
> +
> +               if (!verify_repository_format(&repo_fmt, &buf) &&
> +                   repo_fmt.worktree_config) {
> +                       char *path = mkpathdup("%s/config.worktree", opts->git_dir);
> +                       if (!access_or_die(path, R_OK, 0))
> +                               ret += git_config_from_file(fn, path, data);
> +                       free(path);
> +               }
> +
> +               strbuf_release(&buf);
> +               clear_repository_format(&repo_fmt);
>         }
>
>         current_parsing_scope = CONFIG_SCOPE_COMMAND;
> diff --git a/t/helper/test-config.c b/t/helper/test-config.c
> index 1c8e965840..284f83a921 100644
> --- a/t/helper/test-config.c
> +++ b/t/helper/test-config.c
> @@ -2,12 +2,19 @@
>  #include "cache.h"
>  #include "config.h"
>  #include "string-list.h"
> +#include "submodule-config.h"
>
>  /*
>   * This program exposes the C API of the configuration mechanism
>   * as a set of simple commands in order to facilitate testing.
>   *
> - * Reads stdin and prints result of command to stdout:
> + * Usage: test-tool config [--submodule=<path>] <cmd> [<args>]
> + *
> + * If --submodule=<path> is given, <cmd> will operate on the submodule at the
> + * given <path>. This option is not valid for the commands: read_early_config,
> + * configset_get_value and configset_get_value_multi.
> + *
> + * Possible cmds are:
>   *
>   * get_value -> prints the value with highest priority for the entered key
>   *
> @@ -84,33 +91,63 @@ int cmd__config(int argc, const char **argv)
>         int i, val;
>         const char *v;
>         const struct string_list *strptr;
> -       struct config_set cs;
> +       struct config_set cs = { .hash_initialized = 0 };
>         enum test_config_exit_code ret = TC_SUCCESS;
> +       struct repository *repo = the_repository;
> +       const char *subrepo_path = NULL;
> +
> +       argc--; /* skip over "config" */

This line alone is responsible for a fairly big set of changes
throughout this file, just decrementing indices everywhere.  It might
be nice for review purposes if this and the other changes it caused
were pulled out into a separate step, so we can more easily
concentrate on the primary additions and changes you are making to
this file.  In particular, being so unfamiliar with submodules I'd
really like to try to find someone who knows them a bit better to
review all the subrepo_path related portions of this change to this
file plus the config.c changes, but I think that'd be easier if the
change were more focused.

> +       argv++;
> +
> +       if (argc == 0)
> +               goto print_usage_error;
> +
> +       if (skip_prefix(*argv, "--submodule=", &subrepo_path)) {
> +               argc--;
> +               argv++;
> +               if (argc == 0)
> +                       goto print_usage_error;
> +       }
>
> -       if (argc == 3 && !strcmp(argv[1], "read_early_config")) {
> -               read_early_config(early_config_cb, (void *)argv[2]);
> +       if (argc == 2 && !strcmp(argv[0], "read_early_config")) {
> +               if (subrepo_path) {
> +                       fprintf(stderr, "Cannot use --submodule with read_early_config\n");
> +                       return TC_USAGE_ERROR;
> +               }
> +               read_early_config(early_config_cb, (void *)argv[1]);
>                 return TC_SUCCESS;
>         }
>
>         setup_git_directory();
> -
>         git_configset_init(&cs);
>
> -       if (argc < 2)
> -               goto print_usage_error;
> +       if (subrepo_path) {
> +               const struct submodule *sub;
> +               struct repository *subrepo = xcalloc(1, sizeof(*repo));
> +
> +               sub = submodule_from_path(the_repository, &null_oid, subrepo_path);
> +               if (!sub || repo_submodule_init(subrepo, the_repository, sub)) {
> +                       fprintf(stderr, "Invalid argument to --submodule: '%s'\n",
> +                               subrepo_path);
> +                       free(subrepo);
> +                       ret = TC_USAGE_ERROR;
> +                       goto out;
> +               }
> +               repo = subrepo;
> +       }
>
> -       if (argc == 3 && !strcmp(argv[1], "get_value")) {
> -               if (!git_config_get_value(argv[2], &v)) {
> +       if (argc == 2 && !strcmp(argv[0], "get_value")) {
> +               if (!repo_config_get_value(repo, argv[1], &v)) {
>                         if (!v)
>                                 printf("(NULL)\n");
>                         else
>                                 printf("%s\n", v);
>                 } else {
> -                       printf("Value not found for \"%s\"\n", argv[2]);
> +                       printf("Value not found for \"%s\"\n", argv[1]);
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
> -       } else if (argc == 3 && !strcmp(argv[1], "get_value_multi")) {
> -               strptr = git_config_get_value_multi(argv[2]);
> +       } else if (argc == 2 && !strcmp(argv[0], "get_value_multi")) {
> +               strptr = repo_config_get_value_multi(repo, argv[1]);
>                 if (strptr) {
>                         for (i = 0; i < strptr->nr; i++) {
>                                 v = strptr->items[i].string;
> @@ -120,32 +157,38 @@ int cmd__config(int argc, const char **argv)
>                                         printf("%s\n", v);
>                         }
>                 } else {
> -                       printf("Value not found for \"%s\"\n", argv[2]);
> +                       printf("Value not found for \"%s\"\n", argv[1]);
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
> -       } else if (argc == 3 && !strcmp(argv[1], "get_int")) {
> -               if (!git_config_get_int(argv[2], &val)) {
> +       } else if (argc == 2 && !strcmp(argv[0], "get_int")) {
> +               if (!repo_config_get_int(repo, argv[1], &val)) {
>                         printf("%d\n", val);
>                 } else {
> -                       printf("Value not found for \"%s\"\n", argv[2]);
> +                       printf("Value not found for \"%s\"\n", argv[1]);
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
> -       } else if (argc == 3 && !strcmp(argv[1], "get_bool")) {
> -               if (!git_config_get_bool(argv[2], &val)) {
> +       } else if (argc == 2 && !strcmp(argv[0], "get_bool")) {
> +               if (!repo_config_get_bool(repo, argv[1], &val)) {
>                         printf("%d\n", val);
>                 } else {
> -                       printf("Value not found for \"%s\"\n", argv[2]);
> +
> +                       printf("Value not found for \"%s\"\n", argv[1]);
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
> -       } else if (argc == 3 && !strcmp(argv[1], "get_string")) {
> -               if (!git_config_get_string_const(argv[2], &v)) {
> +       } else if (argc == 2 && !strcmp(argv[0], "get_string")) {
> +               if (!repo_config_get_string_const(repo, argv[1], &v)) {
>                         printf("%s\n", v);
>                 } else {
> -                       printf("Value not found for \"%s\"\n", argv[2]);
> +                       printf("Value not found for \"%s\"\n", argv[1]);
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
> -       } else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
> -               for (i = 3; i < argc; i++) {
> +       } else if (argc >= 2 && !strcmp(argv[0], "configset_get_value")) {
> +               if (subrepo_path) {
> +                       fprintf(stderr, "Cannot use --submodule with configset_get_value\n");
> +                       ret = TC_USAGE_ERROR;
> +                       goto out;
> +               }
> +               for (i = 2; i < argc; i++) {
>                         int err;
>                         if ((err = git_configset_add_file(&cs, argv[i]))) {
>                                 fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
> @@ -153,17 +196,22 @@ int cmd__config(int argc, const char **argv)
>                                 goto out;
>                         }
>                 }
> -               if (!git_configset_get_value(&cs, argv[2], &v)) {
> +               if (!git_configset_get_value(&cs, argv[1], &v)) {
>                         if (!v)
>                                 printf("(NULL)\n");
>                         else
>                                 printf("%s\n", v);
>                 } else {
> -                       printf("Value not found for \"%s\"\n", argv[2]);
> +                       printf("Value not found for \"%s\"\n", argv[1]);
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
> -       } else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
> -               for (i = 3; i < argc; i++) {
> +       } else if (argc >= 2 && !strcmp(argv[0], "configset_get_value_multi")) {
> +               if (subrepo_path) {
> +                       fprintf(stderr, "Cannot use --submodule with configset_get_value_multi\n");
> +                       ret = TC_USAGE_ERROR;
> +                       goto out;
> +               }
> +               for (i = 2; i < argc; i++) {
>                         int err;
>                         if ((err = git_configset_add_file(&cs, argv[i]))) {
>                                 fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
> @@ -171,7 +219,7 @@ int cmd__config(int argc, const char **argv)
>                                 goto out;
>                         }
>                 }
> -               strptr = git_configset_get_value_multi(&cs, argv[2]);
> +               strptr = git_configset_get_value_multi(&cs, argv[1]);
>                 if (strptr) {
>                         for (i = 0; i < strptr->nr; i++) {
>                                 v = strptr->items[i].string;
> @@ -181,18 +229,23 @@ int cmd__config(int argc, const char **argv)
>                                         printf("%s\n", v);
>                         }
>                 } else {
> -                       printf("Value not found for \"%s\"\n", argv[2]);
> +                       printf("Value not found for \"%s\"\n", argv[1]);
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
> -       } else if (!strcmp(argv[1], "iterate")) {
> -               git_config(iterate_cb, NULL);
> +       } else if (!strcmp(argv[0], "iterate")) {
> +               repo_config(repo, iterate_cb, NULL);
>         } else {
>  print_usage_error:
> -               fprintf(stderr, "Invalid syntax. Usage: test-tool config <cmd> [args]\n");
> +               fprintf(stderr, "Invalid syntax. Usage: test-tool config"
> +                               " [--submodule=<path>] <cmd> [args]\n");
>                 ret = TC_USAGE_ERROR;
>         }
>
>  out:
>         git_configset_clear(&cs);
> +       if (repo != the_repository) {
> +               repo_clear(repo);
> +               free(repo);
> +       }
>         return ret;
>  }
> diff --git a/t/t2404-worktree-config.sh b/t/t2404-worktree-config.sh
> index 286121d8de..b6ab793203 100755
> --- a/t/t2404-worktree-config.sh
> +++ b/t/t2404-worktree-config.sh
> @@ -76,4 +76,20 @@ test_expect_success 'config.worktree no longer read without extension' '
>         test_cmp_config -C wt2 shared this.is
>  '
>
> +test_expect_success 'correctly read config.worktree from submodules' '
> +       test_unconfig extensions.worktreeConfig &&
> +       git init sub &&
> +       (
> +               cd sub &&
> +               test_commit A &&
> +               git config extensions.worktreeConfig true &&
> +               git config --worktree wtconfig.sub test-value
> +       ) &&
> +       git submodule add ./sub &&
> +       git commit -m "add sub" &&
> +       echo test-value >expect &&
> +       test-tool config --submodule=sub get_value wtconfig.sub >actual &&
> +       test_cmp expect actual
> +'
> +
>  test_done
> --
> 2.26.2

The index updates seem fine, and I like the test, and I tried to look
at the submodule and config bits but I'm quite unfamiliar with that
area of the code and I'd like to see if we can find someone who knows
submodules and/or config a bit better to review those pieces.  A split
of this patch into two in your next roll of this series would be nice
so we can ask someone to look at just the relevant bits.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 4/5] grep: honor sparse checkout patterns
  2020-05-28  1:13     ` [PATCH v3 4/5] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-05-30 15:48       ` Elijah Newren
  2020-06-01  4:44         ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-05-30 15:48 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Wed, May 27, 2020 at 6:13 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> One of the main uses for a sparse checkout is to allow users to focus on
> the subset of files in a repository in which they are interested. But
> git-grep currently ignores the sparsity patterns and report all matches
> found outside this subset, which kind of goes in the opposite direction.
> Let's fix that, making it honor the sparsity boundaries for every
> grepping case where this is relevant:
>
> - git grep in worktree
> - git grep --cached
> - git grep $REVISION
>
> For the worktree case, we will not grep paths that have the
> SKIP_WORKTREE bit set, even if they are present for some reason (e.g.
> manually created after `git sparse-checkout init`).

This seems worded to rise alarm bells and make users suspect
implementation difficulties or regrets rather than desired behavior.
It would be much better to word this simply as something like:

    For the worktree and cached cases, we iterate over paths without
the SKIP_WORKTREE bit set, and limit our searches to these paths.

> But the next patch
> will add an option to do so. (See 'Note' below.)

Because this was in the same paragraph as the previous sentence, it
made it sound like you were going to provide a special worktree-only
option to search outside the SKIP_WORKTREE bits.  Very confusing.  I
think I'd combine this sentence into the very first paragraph of the
commit message and massage the wording a little.  Perhaps something
like:  ...goes in the opposite direction.  There are some usecases for
ignoring the sparsity patterns and the next commit will add an option
to obtain this behavior, but here we start by making grep honor the
sparsity boundaries for every...

> For `git grep $REVISION`, we will choose to honor the sparsity patterns
> only when $REVISION is a commit-ish object. The reason is that, for a
> tree, we don't know whether it represents the root of a repository or a
> subtree. So we wouldn't be able to correctly match it against the
> sparsity patterns. E.g. suppose we have a repository with these two
> sparsity rules: "/*" and "!/a"; and the following structure:
>
> /
> | - a (file)
> | - d (dir)
>     | - a (file)
>
> If `git grep $REVISION` were to honor the sparsity patterns for every
> object type, when grepping the /d tree, we would wrongly ignore the /d/a
> file. This happens because we wouldn't know it resides in /d and
> therefore it would wrongly match the pattern "!/a". Furthermore, for a
> search in a blob object, we wouldn't even have a path to check the
> patterns against. So, let's ignore the sparsity patterns when grepping
> non-commit-ish objects.

This doesn't actually make it clear how you handle $REVISION which is
a commit object; you focus so much on when $REVISION is just a tree
and contrasting that case that you omit the behavior for the case of
interest.  Also, $REVISION to my mind implies "commit"; if you want to
imply that a commit or tree could be used, you'd use $TREE or
$TREE_ISH or something else.  I think it'd make sense to cover all
three relevant cases into a single paragraph (thus combining with the
previous paragraph), and then add a second paragraph about the $TREE
case that streamlines the last two pargraphs above.  So, perhaps we
can your paragraphs from "For the worktree case, we will not grep
paths..." all the way to "So, let's ignore the sparsity patterns when
grepping non-commit-ish objects" (after first moving the comment about
adding an option in the next commit to some other area of the commit
message, as dicussed above) with something like the following:


    For the worktree and cached cases, we iterate over paths without
the SKIP_WORKTREE bit set, and limit our searches to these paths.  For
the $REVISION case, we limit the paths we search to those that match
the sparsity patterns.  (We do not check the SKIP_WORKTREE bit for the
$REVISION case, because $REVISION may contain paths that do not exist
in HEAD and thus for which we have no SKIP_WORKTREE bit to consult.
The sparsity patterns tell us how the SKIP_WORKTREE bit would be set
if we were to check out $REVISION, so we consult those.  Also, we
don't use the sparsity paths with the worktree or cached cases, both
because we have a bit we can check directly and more efficiently, and
because unmerged entries from a merge or a rebase could cause more
files to temporarily be present than the sparsity patterns would
normally select.)

    Note that there is a special case here: `git grep $TREE`.  In this
case we cannot know whether $TREE corresponds to the root of the
repository or some sub-tree, and thus there is no way for us to know
which sparsity patterns, if any, apply.  So the $TREE case will not
use sparsity patterns or any SKIP_WORKTREE bits and will instead
always search all files within the $TREE.

>
> Note: The behavior introduced in this patch is what some users have
> reported[1] that they would like by default. But the old behavior is
> still desirable for some use cases. Therefore, the next patch will add
> an option to allow restoring it when needed.

This paragraph duplicates information you already stated previously.
It's much clearer than what you stated before, but if you just reword
the previous comments and combine them into the first paragraph, then
we can drop this final note.


> [1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/
>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>  builtin/grep.c                   | 125 ++++++++++++++++++++--
>  t/t7011-skip-worktree-reading.sh |   9 --
>  t/t7817-grep-sparse-checkout.sh  | 174 +++++++++++++++++++++++++++++++
>  3 files changed, 291 insertions(+), 17 deletions(-)
>  create mode 100755 t/t7817-grep-sparse-checkout.sh
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index a5056f395a..11e33b8aee 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
>                       const struct pathspec *pathspec, int cached);
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr);
> +                    int is_root_tree);

So you modified the forward declaration of grep_tree()...

>
>  static int grep_submodule(struct grep_opt *opt,
>                           const struct pathspec *pathspec,
> @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
>
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
> +
> +               if (ce_skip_worktree(ce))
> +                       continue;
> +
>                 strbuf_setlen(&name, name_base_len);
>                 strbuf_addstr(&name, ce->name);
>
> @@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
>                          * cache entry are identical, even if worktree file has
>                          * been modified, so use cache version instead
>                          */
> -                       if (cached || (ce->ce_flags & CE_VALID) ||
> -                           ce_skip_worktree(ce)) {
> +                       if (cached || (ce->ce_flags & CE_VALID)) {
>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>                                         continue;
>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -552,9 +555,76 @@ static int grep_cache(struct grep_opt *opt,
>         return hit;
>  }
>
> -static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> -                    struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr)

Here the patch splits your handling of grep_tree()...

> +static struct pattern_list *get_sparsity_patterns(struct repository *repo)
> +{
> +       struct pattern_list *patterns;
> +       char *sparse_file;
> +       int sparse_config, cone_config;
> +
> +       if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
> +           !sparse_config) {
> +               return NULL;
> +       }

Is core_apply_sparse_checkout not initialized for some reason?

> +
> +       sparse_file = repo_git_path(repo, "info/sparse-checkout");
> +       patterns = xcalloc(1, sizeof(*patterns));
> +
> +       if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
> +               cone_config = 0;
> +       patterns->use_cone_patterns = cone_config;

Similarly, is core_sparse_checkout_cone not intialized?

> +
> +       if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
> +               if (file_exists(sparse_file)) {
> +                       warning(_("failed to load sparse-checkout file: '%s'"),
> +                               sparse_file);
> +               }
> +               free(sparse_file);
> +               free(patterns);
> +               return NULL;
> +       }
> +
> +       free(sparse_file);
> +       return patterns;
> +}
> +
> +static int in_sparse_checkout(struct strbuf *path, int prefix_len,

This function name in_sparse_checkout() makes me think "Does the
working tree represent a sparse checkout?"  Perhaps we could rename it
to path_matches_sparsity_patterns() ?

Also, is there a reason we can't use dir.c's
path_matches_pattern_list() here?  How does this new function differ
in behavior from that function?

> +                             unsigned int entry_mode,
> +                             struct index_state *istate,
> +                             struct pattern_list *sparsity,
> +                             enum pattern_match_result parent_match,
> +                             enum pattern_match_result *match)
> +{
> +       int dtype = DT_UNKNOWN;
> +       int is_dir = S_ISDIR(entry_mode);
> +
> +       if (parent_match == MATCHED_RECURSIVE) {
> +               *match = parent_match;
> +               return 1;
> +       }
> +
> +       if (is_dir && !is_dir_sep(path->buf[path->len - 1]))
> +               strbuf_addch(path, '/');
> +
> +       *match = path_matches_pattern_list(path->buf, path->len,
> +                                          path->buf + prefix_len, &dtype,
> +                                          sparsity, istate);
> +       if (*match == UNDECIDED)
> +               *match = parent_match;
> +
> +       if (is_dir)
> +               strbuf_trim_trailing_dir_sep(path);
> +
> +       if (*match == NOT_MATCHED &&
> +               (!is_dir || (is_dir && sparsity->use_cone_patterns)))
> +            return 0;
> +
> +       return 1;
> +}
> +
> +static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,

I thought this meant you were renaming grep_tree() to do_grep_tree()
but it's a new function that happens to have most of the logic from
the old grep_tree() and which the new grep_tree() will call to do most
its work.

> +                       struct tree_desc *tree, struct strbuf *base, int tn_len,
> +                       int check_attr, struct pattern_list *sparsity,
> +                       enum pattern_match_result default_sparsity_match)
>  {
>         struct repository *repo = opt->repo;
>         int hit = 0;
> @@ -570,6 +640,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>         while (tree_entry(tree, &entry)) {
>                 int te_len = tree_entry_len(&entry);
> +               enum pattern_match_result sparsity_match = 0;
>
>                 if (match != all_entries_interesting) {
>                         strbuf_addstr(&name, base->buf + tn_len);
> @@ -586,6 +657,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                 strbuf_add(base, entry.path, te_len);
>
> +               if (sparsity) {
> +                       struct strbuf path = STRBUF_INIT;
> +                       strbuf_addstr(&path, base->buf + tn_len);
> +
> +                       if (!in_sparse_checkout(&path, old_baselen - tn_len,
> +                                               entry.mode, repo->index,
> +                                               sparsity, default_sparsity_match,
> +                                               &sparsity_match)) {
> +                               strbuf_setlen(base, old_baselen);
> +                               continue;
> +                       }
> +               }
> +
>                 if (S_ISREG(entry.mode)) {
>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>                                          check_attr ? base->buf + tn_len : NULL);
> @@ -602,8 +686,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                         strbuf_addch(base, '/');
>                         init_tree_desc(&sub, data, size);
> -                       hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> -                                        check_attr);
> +                       hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
> +                                           check_attr, sparsity, sparsity_match);
>                         free(data);
>                 } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>                         hit |= grep_submodule(opt, pathspec, &entry.oid,
> @@ -621,6 +705,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>         return hit;
>  }
>
> +/*
> + * Note: sparsity patterns and paths' attributes will only be considered if
> + * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
> + * matching on paths.)
> + */
> +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> +                    struct tree_desc *tree, struct strbuf *base, int tn_len,
> +                    int is_root_tree)
> +{
> +       struct pattern_list *patterns = NULL;
> +       int ret;
> +
> +       if (is_root_tree)
> +               patterns = get_sparsity_patterns(opt->repo);
> +
> +       ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> +                          patterns, 0);
> +
> +       if (patterns) {
> +               clear_pattern_list(patterns);
> +               free(patterns);
> +       }
> +       return ret;
> +}

Once I figured out grep_tree() was just becoming a wrapper around
do_grep_tree(), the patch made more sense; I should have scrolled to
the end quicker.  ;-)

> +
>  static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
>                        struct object *obj, const char *name, const char *path)
>  {
> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> index 37525cae3a..26852586ac 100755
> --- a/t/t7011-skip-worktree-reading.sh
> +++ b/t/t7011-skip-worktree-reading.sh
> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>         test -z "$(git ls-files -m)"
>  '
>
> -test_expect_success 'grep with skip-worktree file' '
> -       git update-index --no-skip-worktree 1 &&
> -       echo test > 1 &&
> -       git update-index 1 &&
> -       git update-index --skip-worktree 1 &&
> -       rm 1 &&
> -       test "$(git grep --no-ext-grep test)" = "1:test"
> -'

Yaay!

> -
>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A   1" > expected
>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>         setup_absent &&
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> new file mode 100755
> index 0000000000..ce080cf572
> --- /dev/null
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -0,0 +1,174 @@
> +#!/bin/sh
> +
> +test_description='grep in sparse checkout
> +
> +This test creates a repo with the following structure:
> +
> +.
> +|-- a
> +|-- b
> +|-- dir
> +|   `-- c
> +|-- sub
> +|   |-- A
> +|   |   `-- a
> +|   `-- B
> +|       `-- b
> +`-- sub2
> +    `-- a
> +
> +Where . has non-cone mode sparsity patterns, sub is a submodule with cone mode

Maybe "Where the outer repository has non-code mode..."?  The use of
'.' threw me for a bit.

> +sparsity patterns and sub2 is a submodule that is excluded by the superproject
> +sparsity patterns. The resulting sparse checkout should leave the following
> +structure on the working tree:

s/on the/in the/?

> +
> +.
> +|-- a
> +|-- sub
> +|   `-- B
> +|       `-- b
> +`-- sub2
> +    `-- a
> +
> +But note that sub2 should have the SKIP_WORKTREE bit set.
> +'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +       echo "text" >a &&
> +       echo "text" >b &&
> +       mkdir dir &&
> +       echo "text" >dir/c &&
> +
> +       git init sub &&
> +       (
> +               cd sub &&
> +               mkdir A B &&
> +               echo "text" >A/a &&
> +               echo "text" >B/b &&
> +               git add A B &&
> +               git commit -m sub &&
> +               git sparse-checkout init --cone &&
> +               git sparse-checkout set B
> +       ) &&
> +
> +       git init sub2 &&
> +       (
> +               cd sub2 &&
> +               echo "text" >a &&
> +               git add a &&
> +               git commit -m sub2
> +       ) &&
> +
> +       git submodule add ./sub &&
> +       git submodule add ./sub2 &&
> +       git add a b dir &&
> +       git commit -m super &&
> +       git sparse-checkout init --no-cone &&
> +       git sparse-checkout set "/*" "!b" "!/*/" "sub" &&
> +
> +       git tag -am tag-to-commit tag-to-commit HEAD &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       git tag -am tag-to-tree tag-to-tree $tree &&
> +
> +       test_path_is_missing b &&
> +       test_path_is_missing dir &&
> +       test_path_is_missing sub/A &&
> +       test_path_is_file a &&
> +       test_path_is_file sub/B/b &&
> +       test_path_is_file sub2/a
> +'
> +
> +# The test bellow checks a special case: the sparsity patterns exclude '/b'

s/bellow/below/

> +# and sparse checkout is enable, but the path exists on the working tree (e.g.

s/enable/enabled/, s/on/in/

> +# manually created after `git sparse-checkout init`). In this case, grep should
> +# skip it.
> +test_expect_success 'grep in working tree should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       echo "new-text" >b &&
> +       test_when_finished "rm b" &&
> +       git grep "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --cached should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep --cached "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> +       commit=$(git rev-parse HEAD) &&
> +       cat >expect_commit <<-EOF &&
> +       $commit:a:text
> +       EOF
> +       cat >expect_tag-to-commit <<-EOF &&
> +       tag-to-commit:a:text
> +       EOF
> +       git grep "text" $commit >actual_commit &&
> +       test_cmp expect_commit actual_commit &&
> +       git grep "text" tag-to-commit >actual_tag-to-commit &&
> +       test_cmp expect_tag-to-commit actual_tag-to-commit
> +'
> +
> +test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
> +       commit=$(git rev-parse HEAD) &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       cat >expect_tree <<-EOF &&
> +       $tree:a:text
> +       $tree:b:text
> +       $tree:dir/c:text
> +       EOF
> +       cat >expect_tag-to-tree <<-EOF &&
> +       tag-to-tree:a:text
> +       tag-to-tree:b:text
> +       tag-to-tree:dir/c:text
> +       EOF
> +       git grep "text" $tree >actual_tree &&
> +       test_cmp expect_tree actual_tree &&
> +       git grep "text" tag-to-tree >actual_tag-to-tree &&
> +       test_cmp expect_tag-to-tree actual_tag-to-tree
> +'
> +
> +# Note that sub2/ is present in the worktree but it is excluded by the sparsity
> +# patterns, so grep should not recurse into it.
> +test_expect_success 'grep --recurse-submodules should honor sparse checkout in submodule' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       sub/B/b:text
> +       EOF
> +       git grep --recurse-submodules "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       sub/B/b:text
> +       EOF
> +       git grep --recurse-submodules --cached "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
> +       commit=$(git rev-parse HEAD) &&
> +       cat >expect_commit <<-EOF &&
> +       $commit:a:text
> +       $commit:sub/B/b:text
> +       EOF
> +       cat >expect_tag-to-commit <<-EOF &&
> +       tag-to-commit:a:text
> +       tag-to-commit:sub/B/b:text
> +       EOF
> +       git grep --recurse-submodules "text" $commit >actual_commit &&
> +       test_cmp expect_commit actual_commit &&
> +       git grep --recurse-submodules "text" tag-to-commit >actual_tag-to-commit &&
> +       test_cmp expect_tag-to-commit actual_tag-to-commit
> +'
> +
> +test_done
> --
> 2.26.2

Looks good.  Do we want to add a testcase where a file is unmerged and
present in the working copy despite not matching the sparsity patterns
(i.e. to emulate being in the middle of a merge/rebase/cherry-pick)?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 5/5] config: add setting to ignore sparsity patterns in some cmds
  2020-05-28  1:13     ` [PATCH v3 5/5] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
@ 2020-05-30 16:18       ` Elijah Newren
  2020-06-01  4:45         ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-05-30 16:18 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Wed, May 27, 2020 at 6:14 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> When sparse checkout is enabled, some users expect the output of certain
> commands (such as grep, diff, and log) to be also restricted within the
> sparsity patterns. This would allow them to effectively work only on the
> subset of files in which they are interested; and allow some commands to
> possibly perform better, by not considering uninteresting paths. For
> this reason, we taught grep to honor the sparsity patterns, in the
> previous patch. But, on the other hand, allowing grep and the other
> commands mentioned to optionally ignore the patterns also make for some
> interesting use cases. E.g. using grep to search for a function
> documentation that resides outside the sparse checkout.
>
> In any case, there is no current way for users to configure the behavior
> they want for these commands. Aiming to provide this flexibility, let's
> introduce the sparse.restrictCmds setting (and the analogous
> --[no]-restrict-to-sparse-paths global option). The default value is
> true. For now, grep is the only one affected by this setting, but the
> goal is to have support for more commands, in the future.
>
> Helped-by: Elijah Newren <newren@gmail.com>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>  Documentation/config.txt               |   2 +
>  Documentation/config/sparse.txt        |  24 +++++
>  Documentation/git-grep.txt             |   3 +
>  Documentation/git.txt                  |   4 +
>  Makefile                               |   1 +
>  builtin/grep.c                         |  13 ++-
>  contrib/completion/git-completion.bash |   2 +
>  git.c                                  |   6 ++
>  sparse-checkout.c                      |  16 +++
>  sparse-checkout.h                      |  11 +++
>  t/t7817-grep-sparse-checkout.sh        | 132 ++++++++++++++++++++++++-
>  t/t9902-completion.sh                  |   4 +-
>  12 files changed, 212 insertions(+), 6 deletions(-)
>  create mode 100644 Documentation/config/sparse.txt
>  create mode 100644 sparse-checkout.c
>  create mode 100644 sparse-checkout.h
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index ef0768b91a..fd74b80302 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -436,6 +436,8 @@ include::config/sequencer.txt[]
>
>  include::config/showbranch.txt[]
>
> +include::config/sparse.txt[]
> +
>  include::config/splitindex.txt[]
>
>  include::config/ssh.txt[]
> diff --git a/Documentation/config/sparse.txt b/Documentation/config/sparse.txt
> new file mode 100644
> index 0000000000..2a25b4b8ef
> --- /dev/null
> +++ b/Documentation/config/sparse.txt
> @@ -0,0 +1,24 @@
> +sparse.restrictCmds::
> +       Only meaningful in conjunction with core.sparseCheckout. This option
> +       extends sparse checkouts (which limit which paths are written to the
> +       working tree), so that output and operations are also limited to the
> +       sparsity paths where possible and implemented. The purpose of this
> +       option is to (1) focus output for the user on the portion of the
> +       repository that is of interest to them, and (2) enable potentially
> +       dramatic performance improvements, especially in conjunction with
> +       partial clones.
> ++
> +When this option is true (default), some git commands may limit their behavior
> +to the paths specified by the sparsity patterns, or to the intersection of
> +those paths and any (like `*.c`) that the user might also specify on the
> +command line. When false, the affected commands will work on full trees,
> +ignoring the sparsity patterns. For now, only git-grep honors this setting. In
> +this command, the restriction takes effect in three cases: with --cached; when
> +a commit-ish is given; when searching a working tree where some paths excluded
> +by the sparsity patterns are present (e.g. manually created paths or not
> +removed submodules).

I think "In this command, the restriction takes effect..." to the end
of the paragraph should be removed.  I don't want every subcommand's
behavior to be specified here; it'll grow unreadably long and be more
likely to eventually go stale.

> ++
> +Note: commands which export, integrity check, or create history will always
> +operate on full trees (e.g. fast-export, format-patch, fsck, commit, etc.),
> +unaffected by any sparsity patterns. Also, writting commands such as
> +sparse-checkout and read-tree will not be affected by this configuration.

s/writting/writing/

> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index 9bdf807584..abbf100109 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -41,6 +41,9 @@ characters.  An empty string as search expression matches all lines.
>  CONFIGURATION
>  -------------
>
> +git-grep honors the sparse.restrictCmds setting. See its definition in
> +linkgit:git-config[1].
> +
>  :git-grep: 1
>  include::config/grep.txt[]
>
> diff --git a/Documentation/git.txt b/Documentation/git.txt
> index 9d6769e95a..5e107c6246 100644
> --- a/Documentation/git.txt
> +++ b/Documentation/git.txt
> @@ -180,6 +180,10 @@ If you just want to run git as if it was started in `<path>` then use
>         Do not perform optional operations that require locks. This is
>         equivalent to setting the `GIT_OPTIONAL_LOCKS` to `0`.
>
> +--[no-]restrict-to-sparse-paths::
> +       Overrides the sparse.restrictCmds configuration (see
> +       linkgit:git-config[1]) for this execution.
> +
>  --list-cmds=group[,group...]::
>         List commands by group. This is an internal/experimental
>         option and may change or be removed in the future. Supported
> diff --git a/Makefile b/Makefile
> index 90aa329eb7..0c0013b32c 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -983,6 +983,7 @@ LIB_OBJS += sha1-name.o
>  LIB_OBJS += shallow.o
>  LIB_OBJS += sideband.o
>  LIB_OBJS += sigchain.o
> +LIB_OBJS += sparse-checkout.o
>  LIB_OBJS += split-index.o
>  LIB_OBJS += stable-qsort.o
>  LIB_OBJS += strbuf.o
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 11e33b8aee..cc696dab4a 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -25,6 +25,7 @@
>  #include "submodule-config.h"
>  #include "object-store.h"
>  #include "packfile.h"
> +#include "sparse-checkout.h"
>
>  static char const * const grep_usage[] = {
>         N_("git grep [<options>] [-e] <pattern> [<rev>...] [[--] <path>...]"),
> @@ -498,6 +499,7 @@ static int grep_cache(struct grep_opt *opt,
>         int nr;
>         struct strbuf name = STRBUF_INIT;
>         int name_base_len = 0;
> +       int sparse_paths_only = restrict_to_sparse_paths(repo);
>         if (repo->submodule_prefix) {
>                 name_base_len = strlen(repo->submodule_prefix);
>                 strbuf_addstr(&name, repo->submodule_prefix);
> @@ -509,7 +511,7 @@ static int grep_cache(struct grep_opt *opt,
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
>
> -               if (ce_skip_worktree(ce))
> +               if (sparse_paths_only && ce_skip_worktree(ce))
>                         continue;
>
>                 strbuf_setlen(&name, name_base_len);
> @@ -715,9 +717,10 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      int is_root_tree)
>  {
>         struct pattern_list *patterns = NULL;
> +       int sparse_paths_only = restrict_to_sparse_paths(opt->repo);
>         int ret;
>
> -       if (is_root_tree)
> +       if (is_root_tree && sparse_paths_only)
>                 patterns = get_sparsity_patterns(opt->repo);
>
>         ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,

It's kinda nice how clean and easy it is to insert this new option
after the previous patch.

> @@ -1257,6 +1260,12 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>
>         if (!use_index || untracked) {
>                 int use_exclude = (opt_exclude < 0) ? use_index : !!opt_exclude;
> +
> +               if (opt_restrict_to_sparse_paths >= 0) {
> +                       die(_("--[no-]restrict-to-sparse-paths is incompatible"
> +                                 " with --no-index and --untracked"));
> +               }
> +
>                 hit = grep_directory(&opt, &pathspec, use_exclude, use_index);
>         } else if (0 <= opt_exclude) {
>                 die(_("--[no-]exclude-standard cannot be used for tracked contents"));
> diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
> index 70ad04e1b2..71956f7313 100644
> --- a/contrib/completion/git-completion.bash
> +++ b/contrib/completion/git-completion.bash
> @@ -3208,6 +3208,8 @@ __git_main ()
>                         --namespace=
>                         --no-replace-objects
>                         --help
> +                       --restrict-to-sparse-paths
> +                       --no-restrict-to-sparse-paths
>                         "
>                         ;;
>                 *)
> diff --git a/git.c b/git.c
> index a2d337eed7..6db1382ae4 100644
> --- a/git.c
> +++ b/git.c
> @@ -38,6 +38,7 @@ const char git_more_info_string[] =
>            "See 'git help git' for an overview of the system.");
>
>  static int use_pager = -1;
> +int opt_restrict_to_sparse_paths = -1;
>
>  static void list_builtins(struct string_list *list, unsigned int exclude_option);
>
> @@ -311,6 +312,10 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
>                         } else {
>                                 exit(list_cmds(cmd));
>                         }
> +               } else if (!strcmp(cmd, "--restrict-to-sparse-paths")) {
> +                       opt_restrict_to_sparse_paths = 1;
> +               } else if (!strcmp(cmd, "--no-restrict-to-sparse-paths")) {
> +                       opt_restrict_to_sparse_paths = 0;
>                 } else {
>                         fprintf(stderr, _("unknown option: %s\n"), cmd);
>                         usage(git_usage_string);
> @@ -319,6 +324,7 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
>                 (*argv)++;
>                 (*argc)--;
>         }
> +
>         return (*argv) - orig_argv;
>  }
>

Why the stray whitespace change?

> diff --git a/sparse-checkout.c b/sparse-checkout.c
> new file mode 100644
> index 0000000000..9a9e50fd29
> --- /dev/null
> +++ b/sparse-checkout.c
> @@ -0,0 +1,16 @@
> +#include "cache.h"
> +#include "config.h"
> +#include "sparse-checkout.h"
> +
> +int restrict_to_sparse_paths(struct repository *repo)
> +{
> +       int ret;
> +
> +       if (opt_restrict_to_sparse_paths >= 0)
> +               return opt_restrict_to_sparse_paths;
> +
> +       if (repo_config_get_bool(repo, "sparse.restrictcmds", &ret))
> +               ret = 1;
> +
> +       return ret;
> +}

Do we want to considering renaming this file to sparse.c, since it's
for sparse grep and sparse diff and etc., not just for the checkout
piece?  It would also go along well with our toplevel related config
being in the "sparse" namespace.

> diff --git a/sparse-checkout.h b/sparse-checkout.h
> new file mode 100644
> index 0000000000..1de3b588d8
> --- /dev/null
> +++ b/sparse-checkout.h
> @@ -0,0 +1,11 @@
> +#ifndef SPARSE_CHECKOUT_H
> +#define SPARSE_CHECKOUT_H
> +
> +struct repository;
> +
> +extern int opt_restrict_to_sparse_paths; /* from git.c */
> +
> +/* Whether or not cmds should restrict behavior on sparse paths, in this repo */
> +int restrict_to_sparse_paths(struct repository *repo);
> +
> +#endif /* SPARSE_CHECKOUT_H */
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index ce080cf572..1aef084186 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -80,10 +80,10 @@ test_expect_success 'setup' '
>         test_path_is_file sub2/a
>  '
>
> -# The test bellow checks a special case: the sparsity patterns exclude '/b'
> +# The two tests bellow check a special case: the sparsity patterns exclude '/b'
>  # and sparse checkout is enable, but the path exists on the working tree (e.g.
>  # manually created after `git sparse-checkout init`). In this case, grep should
> -# skip it.
> +# skip the file by default, but not with --no-restrict-to-sparse-paths.
>  test_expect_success 'grep in working tree should honor sparse checkout' '
>         cat >expect <<-EOF &&
>         a:text
> @@ -93,6 +93,16 @@ test_expect_success 'grep in working tree should honor sparse checkout' '
>         git grep "text" >actual &&
>         test_cmp expect actual
>  '
> +test_expect_success 'grep w/ --no-restrict-to-sparse-paths for sparsely excluded but present paths' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       b:new-text
> +       EOF
> +       echo "new-text" >b &&
> +       test_when_finished "rm b" &&
> +       git --no-restrict-to-sparse-paths grep "text" >actual &&
> +       test_cmp expect actual
> +'
>
>  test_expect_success 'grep --cached should honor sparse checkout' '
>         cat >expect <<-EOF &&
> @@ -136,7 +146,7 @@ test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
>  '
>
>  # Note that sub2/ is present in the worktree but it is excluded by the sparsity
> -# patterns, so grep should not recurse into it.
> +# patterns, so grep should only recurse into it with --no-restrict-to-sparse-paths.
>  test_expect_success 'grep --recurse-submodules should honor sparse checkout in submodule' '
>         cat >expect <<-EOF &&
>         a:text
> @@ -145,6 +155,15 @@ test_expect_success 'grep --recurse-submodules should honor sparse checkout in s
>         git grep --recurse-submodules "text" >actual &&
>         test_cmp expect actual
>  '
> +test_expect_success 'grep --recurse-submodules should search in excluded submodules w/ --no-restrict-to-sparse-paths' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       sub/B/b:text
> +       sub2/a:text
> +       EOF
> +       git --no-restrict-to-sparse-paths grep --recurse-submodules "text" >actual &&
> +       test_cmp expect actual
> +'
>
>  test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
>         cat >expect <<-EOF &&
> @@ -171,4 +190,111 @@ test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse
>         test_cmp expect_tag-to-commit actual_tag-to-commit
>  '
>
> +for cmd in 'git --no-restrict-to-sparse-paths grep' \
> +          'git -c sparse.restrictCmds=false grep' \
> +          'git -c sparse.restrictCmds=true --no-restrict-to-sparse-paths grep'
> +do
> +
> +       test_expect_success "$cmd --cached should ignore sparsity patterns" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd --cached "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd <commit-ish> should ignore sparsity patterns" '
> +               commit=$(git rev-parse HEAD) &&
> +               cat >expect_commit <<-EOF &&
> +               $commit:a:text
> +               $commit:b:text
> +               $commit:dir/c:text
> +               EOF
> +               cat >expect_tag-to-commit <<-EOF &&
> +               tag-to-commit:a:text
> +               tag-to-commit:b:text
> +               tag-to-commit:dir/c:text
> +               EOF
> +               $cmd "text" $commit >actual_commit &&
> +               test_cmp expect_commit actual_commit &&
> +               $cmd "text" tag-to-commit >actual_tag-to-commit &&
> +               test_cmp expect_tag-to-commit actual_tag-to-commit
> +       '
> +done
> +
> +test_expect_success 'grep --recurse-submodules --cached \w --no-restrict-to-sparse-paths' '

s%\w%w/%, or s%\w%with%?  Same issue below too.

> +       cat >expect <<-EOF &&
> +       a:text
> +       b:text
> +       dir/c:text
> +       sub/A/a:text
> +       sub/B/b:text
> +       sub2/a:text
> +       EOF
> +       git --no-restrict-to-sparse-paths grep --recurse-submodules --cached \
> +               "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --recurse-submodules <commit-ish> \w --no-restrict-to-sparse-paths' '
> +       commit=$(git rev-parse HEAD) &&
> +       cat >expect_commit <<-EOF &&
> +       $commit:a:text
> +       $commit:b:text
> +       $commit:dir/c:text
> +       $commit:sub/A/a:text
> +       $commit:sub/B/b:text
> +       $commit:sub2/a:text
> +       EOF
> +       cat >expect_tag-to-commit <<-EOF &&
> +       tag-to-commit:a:text
> +       tag-to-commit:b:text
> +       tag-to-commit:dir/c:text
> +       tag-to-commit:sub/A/a:text
> +       tag-to-commit:sub/B/b:text
> +       tag-to-commit:sub2/a:text
> +       EOF
> +       git --no-restrict-to-sparse-paths grep --recurse-submodules "text" \
> +               $commit >actual_commit &&
> +       test_cmp expect_commit actual_commit &&
> +       git --no-restrict-to-sparse-paths grep --recurse-submodules "text" \
> +               tag-to-commit >actual_tag-to-commit &&
> +       test_cmp expect_tag-to-commit actual_tag-to-commit
> +'
> +
> +test_expect_success 'should respect the sparse.restrictCmds values from submodules' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       sub/A/a:text
> +       sub/B/b:text
> +       EOF
> +       test_config -C sub sparse.restrictCmds false &&
> +       git grep --cached --recurse-submodules "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'should propagate --[no]-restrict-to-sparse-paths to submodules' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       b:text
> +       dir/c:text
> +       sub/A/a:text
> +       sub/B/b:text
> +       sub2/a:text
> +       EOF
> +       test_config -C sub sparse.restrictCmds true &&
> +       git --no-restrict-to-sparse-paths grep --cached --recurse-submodules "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +for opt in '--untracked' '--no-index'
> +do
> +       test_expect_success "--[no]-restrict-to-sparse-paths and $opt are incompatible" "
> +               test_must_fail git --restrict-to-sparse-paths grep $opt . 2>actual &&
> +               test_i18ngrep 'restrict-to-sparse-paths is incompatible with' actual
> +       "
> +done
> +
>  test_done
> diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
> index 3c44af6940..a4a7767e06 100755
> --- a/t/t9902-completion.sh
> +++ b/t/t9902-completion.sh
> @@ -1473,6 +1473,8 @@ test_expect_success 'double dash "git" itself' '
>         --namespace=
>         --no-replace-objects Z
>         --help Z
> +       --restrict-to-sparse-paths Z
> +       --no-restrict-to-sparse-paths Z
>         EOF
>  '
>
> @@ -1515,7 +1517,7 @@ test_expect_success 'general options' '
>         test_completion "git --nam" "--namespace=" &&
>         test_completion "git --bar" "--bare " &&
>         test_completion "git --inf" "--info-path " &&
> -       test_completion "git --no-r" "--no-replace-objects "
> +       test_completion "git --no-rep" "--no-replace-objects "
>  '

All these testcases look great (modulo the small typo I pointed out
earlier); I kept thinking "but what about case <x>?" and then I kept
reading and saw you covered it.  You even added some I wasn't thinking
about and might have overlooked but seem important.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 2/5] t/helper/test-config: return exit codes consistently
  2020-05-30 14:29       ` Elijah Newren
@ 2020-06-01  4:36         ` Matheus Tavares Bernardino
  0 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-06-01  4:36 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Sat, May 30, 2020 at 11:29 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Wed, May 27, 2020 at 6:13 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> >
> > The test-config helper may exit with a variety of at least four
> > different codes, to reflect the status of the requested operations.
> > These codes are sometimes checked in the tests, but not all of the codes
> > are returned consistently by the helper: 1 will usually refer to a
> > "value not found", but usage errors can also return 1 or 128. The latter
>
> I'm not sure what "The latter" refers to here.

It would be the 128 exit code. I'll try to reword that for clarity.

> > is also expected on errors within the configset functions. These
> > inconsistent uses of the exit codes can lead to false positives in the
> > tests. Although all tests that currently check the helper's exit code,
> > on errors, do also check the output, it's still better to standardize
> > the exit codes and avoid future problems in new tests. While we are
>
> That last sentence was slightly hard for me to parse.  Maybe something like:
>
> ...Although all tests which expect errors and check the helper's exit
> code currently also check the output, it's still better...

Sounds better, I will use that for the next version. Thanks.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 3/5] config: correctly read worktree configs in submodules
  2020-05-30 14:49       ` Elijah Newren
@ 2020-06-01  4:38         ` Matheus Tavares Bernardino
  0 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-06-01  4:38 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Sat, May 30, 2020 at 11:49 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Wed, May 27, 2020 at 6:13 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> >
> > One of the steps in do_git_config_sequence() is to load the
> > worktree-specific config file. Although the function receives a git_dir
> > string, it relies on git_pathdup(), which uses the_repository->git_dir,
> > to make the path to the file. Furthermore, it also checks that
> > extensions.worktreeConfig is set through the
> > repository_format_worktree_config variable, which refers to
> > the_repository only. Thus, when a submodule has worktree settings, a
> > command executed in the superproject that recurses into the submodule
> > won't find the said settings.
> >
> > Such a scenario might not be needed now, but it will be in the following
>
> It's not needed?  Are there not other config values that affect grep's
> behavior, such as smudge filters of the submodule that might be
> important if doing a 'git grep --recurse-submodules $REVISION'?

Hmm, I haven't used smudge filters before, but it seems to me that
`git grep $REVISION` does not honor them.

> Also, is there a similar issue here for .gitattributes?  (e.g. if the
> submodule declares certain files to be binary?)

Declaring files as binary in the submodule works fine. But I noticed
that textconv filter specifications in the submodule's config are
currently ignored. To be honest, I wasn't aware of this issue before.

> I don't actually know if these are issues but I'm just surprised to
> hear that this would be the first case that would need to look at
> submodule-specific configuration.

Hmm, not to submodule-specific configuration but to worktree-specific
configuration of a submodule, right? I.e. a config.worktree file from
within a submodule. Reconsidering this now, we could indeed have a
diff.<driver>.textconv or core.quotePath settings specified in the
worktree scope of a submodule. And we should honor them when recursing
in grep. I guess I thought the "most natural" place for these
settings, in a submodule, would be in the standard .git/config file
(as opposed to the sparse-checkout ones, which are normally at
config.worktree). That's probably why I wrote "Such scenario might not
be needed now". But we should indeed support reading
diff.<driver>.textconv from config.worktree as well (although grep
currently ignores this setting in submodules, both in the local and
worktree scopes). So the said sentence doesn't make much sense,
indeed. I will remove it. Thanks!

> > diff --git a/t/helper/test-config.c b/t/helper/test-config.c
> > index 1c8e965840..284f83a921 100644
> > --- a/t/helper/test-config.c
> > +++ b/t/helper/test-config.c
> > @@ -84,33 +91,63 @@ int cmd__config(int argc, const char **argv)
> >         int i, val;
> >         const char *v;
> >         const struct string_list *strptr;
> > -       struct config_set cs;
> > +       struct config_set cs = { .hash_initialized = 0 };
> >         enum test_config_exit_code ret = TC_SUCCESS;
> > +       struct repository *repo = the_repository;
> > +       const char *subrepo_path = NULL;
> > +
> > +       argc--; /* skip over "config" */
>
> This line alone is responsible for a fairly big set of changes
> throughout this file, just decrementing indices everywhere.  It might
> be nice for review purposes if this and the other changes it caused
> were pulled out into a separate step, so we can more easily
> concentrate on the primary additions and changes you are making to
> this file.

OK, will do.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 4/5] grep: honor sparse checkout patterns
  2020-05-30 15:48       ` Elijah Newren
@ 2020-06-01  4:44         ` Matheus Tavares Bernardino
  2020-06-03  2:38           ` Elijah Newren
  0 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-06-01  4:44 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Sat, May 30, 2020 at 12:48 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Wed, May 27, 2020 at 6:13 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> >
> > One of the main uses for a sparse checkout is to allow users to focus on
> > the subset of files in a repository in which they are interested. But
> > git-grep currently ignores the sparsity patterns and report all matches
> > found outside this subset, which kind of goes in the opposite direction.
> > Let's fix that, making it honor the sparsity boundaries for every
> > grepping case where this is relevant:
> >
> > - git grep in worktree
> > - git grep --cached
> > - git grep $REVISION
> >
> > For the worktree case, we will not grep paths that have the
> > SKIP_WORKTREE bit set, even if they are present for some reason (e.g.
> > manually created after `git sparse-checkout init`).
>
> This seems worded to rise alarm bells and make users suspect
> implementation difficulties or regrets rather than desired behavior.
> It would be much better to word this simply as something like:
>
>     For the worktree and cached cases, we iterate over paths without
> the SKIP_WORKTREE bit set, and limit our searches to these paths.
>
> > But the next patch
> > will add an option to do so. (See 'Note' below.)
>
> Because this was in the same paragraph as the previous sentence, it
> made it sound like you were going to provide a special worktree-only
> option to search outside the SKIP_WORKTREE bits.  Very confusing.  I
> think I'd combine this sentence into the very first paragraph of the
> commit message and massage the wording a little.  Perhaps something
> like:  ...goes in the opposite direction.  There are some usecases for
> ignoring the sparsity patterns and the next commit will add an option
> to obtain this behavior, but here we start by making grep honor the
> sparsity boundaries for every...
>
> > For `git grep $REVISION`, we will choose to honor the sparsity patterns
> > only when $REVISION is a commit-ish object. The reason is that, for a
> > tree, we don't know whether it represents the root of a repository or a
> > subtree. So we wouldn't be able to correctly match it against the
> > sparsity patterns. E.g. suppose we have a repository with these two
> > sparsity rules: "/*" and "!/a"; and the following structure:
> >
> > /
> > | - a (file)
> > | - d (dir)
> >     | - a (file)
> >
> > If `git grep $REVISION` were to honor the sparsity patterns for every
> > object type, when grepping the /d tree, we would wrongly ignore the /d/a
> > file. This happens because we wouldn't know it resides in /d and
> > therefore it would wrongly match the pattern "!/a". Furthermore, for a
> > search in a blob object, we wouldn't even have a path to check the
> > patterns against. So, let's ignore the sparsity patterns when grepping
> > non-commit-ish objects.
>
> This doesn't actually make it clear how you handle $REVISION which is
> a commit object; you focus so much on when $REVISION is just a tree
> and contrasting that case that you omit the behavior for the case of
> interest.  Also, $REVISION to my mind implies "commit"; if you want to
> imply that a commit or tree could be used, you'd use $TREE or
> $TREE_ISH or something else.  I think it'd make sense to cover all
> three relevant cases into a single paragraph (thus combining with the
> previous paragraph), and then add a second paragraph about the $TREE
> case that streamlines the last two pargraphs above.  So, perhaps we
> can your paragraphs from "For the worktree case, we will not grep
> paths..." all the way to "So, let's ignore the sparsity patterns when
> grepping non-commit-ish objects" (after first moving the comment about
> adding an option in the next commit to some other area of the commit
> message, as dicussed above) with something like the following:
>
>     For the worktree and cached cases, we iterate over paths without
> the SKIP_WORKTREE bit set, and limit our searches to these paths.  For
> the $REVISION case, we limit the paths we search to those that match
> the sparsity patterns.  (We do not check the SKIP_WORKTREE bit for the
> $REVISION case, because $REVISION may contain paths that do not exist
> in HEAD and thus for which we have no SKIP_WORKTREE bit to consult.
> The sparsity patterns tell us how the SKIP_WORKTREE bit would be set
> if we were to check out $REVISION, so we consult those.  Also, we
> don't use the sparsity paths with the worktree or cached cases, both
> because we have a bit we can check directly and more efficiently, and
> because unmerged entries from a merge or a rebase could cause more
> files to temporarily be present than the sparsity patterns would
> normally select.)
>
>     Note that there is a special case here: `git grep $TREE`.  In this
> case we cannot know whether $TREE corresponds to the root of the
> repository or some sub-tree, and thus there is no way for us to know
> which sparsity patterns, if any, apply.  So the $TREE case will not
> use sparsity patterns or any SKIP_WORKTREE bits and will instead
> always search all files within the $TREE.
>
> >
> > Note: The behavior introduced in this patch is what some users have
> > reported[1] that they would like by default. But the old behavior is
> > still desirable for some use cases. Therefore, the next patch will add
> > an option to allow restoring it when needed.
>
> This paragraph duplicates information you already stated previously.
> It's much clearer than what you stated before, but if you just reword
> the previous comments and combine them into the first paragraph, then
> we can drop this final note.

All great suggestions! I will amend the commit message using your
proposed paragraphs. Thanks!

> >
> > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> > ---
> >  builtin/grep.c                   | 125 ++++++++++++++++++++--
> >  t/t7011-skip-worktree-reading.sh |   9 --
> >  t/t7817-grep-sparse-checkout.sh  | 174 +++++++++++++++++++++++++++++++
[...]
> > +static struct pattern_list *get_sparsity_patterns(struct repository *repo)
> > +{
> > +       struct pattern_list *patterns;
> > +       char *sparse_file;
> > +       int sparse_config, cone_config;
> > +
> > +       if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
> > +           !sparse_config) {
> > +               return NULL;
> > +       }
>
> Is core_apply_sparse_checkout not initialized for some reason?

It should be already initialized, yes. But we cannot rely on that as
`repo` might be a submodule, and core_apply_sparse_checkout holds the
configuration's value for `the_repository`.

> > +static int in_sparse_checkout(struct strbuf *path, int prefix_len,
>
> This function name in_sparse_checkout() makes me think "Does the
> working tree represent a sparse checkout?"  Perhaps we could rename it
> to path_matches_sparsity_patterns() ?
>
> Also, is there a reason we can't use dir.c's
> path_matches_pattern_list() here?

Oh, we do use path_matches_pattern_list() inside:

> > +       *match = path_matches_pattern_list(path->buf, path->len,
> > +                                          path->buf + prefix_len, &dtype,
> > +                                          sparsity, istate);
> > +       if (*match == UNDECIDED)
> > +               *match = parent_match;

> How does this new function differ
> in behavior from that function?

The idea of in_sparse_checkout() is to implement a logic closer to
what we have in clear_ce_flags_1(). Here, it is effectively a wrapper
to path_matches_pattern_list() but with some extra logic to decide
whether grep should search in a given entry, based on its mode, the
match result against the sparsity patterns, and the result from the
parent dir.

> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > new file mode 100755
> > index 0000000000..ce080cf572
> > --- /dev/null
> > +++ b/t/t7817-grep-sparse-checkout.sh
> > @@ -0,0 +1,174 @@
> > +#!/bin/sh
> > +
> > +test_description='grep in sparse checkout
> > +
> > +This test creates a repo with the following structure:
> > +
> > +.
> > +|-- a
> > +|-- b
> > +|-- dir
> > +|   `-- c
> > +|-- sub
> > +|   |-- A
> > +|   |   `-- a
> > +|   `-- B
> > +|       `-- b
> > +`-- sub2
> > +    `-- a
> > +
> > +Where . has non-cone mode sparsity patterns, sub is a submodule with cone mode
>
> Maybe "Where the outer repository has non-code mode..."?  The use of
> '.' threw me for a bit.

Sure!

> > +test_done
> > --
> > 2.26.2
>
> Looks good.  Do we want to add a testcase where a file is unmerged and
> present in the working copy despite not matching the sparsity patterns
> (i.e. to emulate being in the middle of a merge/rebase/cherry-pick)?

Sure, I can add that. But after a quick test here, it seems that the
unmerged path doesn't have the SKIP_WORKTREE bit set. Is this how it
should be?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 5/5] config: add setting to ignore sparsity patterns in some cmds
  2020-05-30 16:18       ` Elijah Newren
@ 2020-06-01  4:45         ` Matheus Tavares Bernardino
  2020-06-03  2:39           ` Elijah Newren
  0 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-06-01  4:45 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Sat, May 30, 2020 at 1:18 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Wed, May 27, 2020 at 6:14 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> > diff --git a/Documentation/config/sparse.txt b/Documentation/config/sparse.txt
> > new file mode 100644
> > index 0000000000..2a25b4b8ef
> > --- /dev/null
> > +++ b/Documentation/config/sparse.txt
> > @@ -0,0 +1,24 @@
> > +sparse.restrictCmds::
> > +       Only meaningful in conjunction with core.sparseCheckout. This option
> > +       extends sparse checkouts (which limit which paths are written to the
> > +       working tree), so that output and operations are also limited to the
> > +       sparsity paths where possible and implemented. The purpose of this
> > +       option is to (1) focus output for the user on the portion of the
> > +       repository that is of interest to them, and (2) enable potentially
> > +       dramatic performance improvements, especially in conjunction with
> > +       partial clones.
> > ++
> > +When this option is true (default), some git commands may limit their behavior
> > +to the paths specified by the sparsity patterns, or to the intersection of
> > +those paths and any (like `*.c`) that the user might also specify on the
> > +command line. When false, the affected commands will work on full trees,
> > +ignoring the sparsity patterns. For now, only git-grep honors this setting. In
> > +this command, the restriction takes effect in three cases: with --cached; when
> > +a commit-ish is given; when searching a working tree where some paths excluded
> > +by the sparsity patterns are present (e.g. manually created paths or not
> > +removed submodules).
>
> I think "In this command, the restriction takes effect..." to the end
> of the paragraph should be removed.  I don't want every subcommand's
> behavior to be specified here; it'll grow unreadably long and be more
> likely to eventually go stale.

Yeah, I was also concerned about that. But wouldn't it be important to
inform the users how the setting takes place in grep (specially with
the corner cases)? And maybe others, in the future?

What if we move the information that is only relevant to a single
command into its own man page? I.e. git-grep.txt would have something
like:

sparse.restrictCmds::
See complete definition in linkgit:git-config[1]. In grep, the
restriction takes effect in three cases: with --cached; when a
commit-ish is given; when searching a working tree where some paths
excluded by the sparsity patterns are present (e.g. manually created
paths or not removed submodules).

The only problem then is that the information would be a little
scattered... But I think it shouldn't be a big deal, as a person
interested in knowing how foo behaves with sparse.restrictCmds would
only need to look into foo's man page, anyway.

> > diff --git a/git.c b/git.c
> > index a2d337eed7..6db1382ae4 100644
> > --- a/git.c
> > +++ b/git.c
> > @@ -319,6 +324,7 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
> >                 (*argv)++;
> >                 (*argc)--;
> >         }
> > +
> >         return (*argv) - orig_argv;
> >  }
> >
>
> Why the stray whitespace change?

Oops, that shouldn't be there. Thanks!

>
> > diff --git a/sparse-checkout.c b/sparse-checkout.c
> > new file mode 100644
> > index 0000000000..9a9e50fd29
> > --- /dev/null
> > +++ b/sparse-checkout.c
> > @@ -0,0 +1,16 @@
> > +#include "cache.h"
> > +#include "config.h"
> > +#include "sparse-checkout.h"
> > +
> > +int restrict_to_sparse_paths(struct repository *repo)
> > +{
> > +       int ret;
> > +
> > +       if (opt_restrict_to_sparse_paths >= 0)
> > +               return opt_restrict_to_sparse_paths;
> > +
> > +       if (repo_config_get_bool(repo, "sparse.restrictcmds", &ret))
> > +               ret = 1;
> > +
> > +       return ret;
> > +}
>
> Do we want to considering renaming this file to sparse.c, since it's
> for sparse grep and sparse diff and etc., not just for the checkout
> piece?  It would also go along well with our toplevel related config
> being in the "sparse" namespace.

Makes sense. But since Stolee is already working on
"sparse-checkout.c" [1], if we use "sparse.c" in this series we will
end up with two extra files. And as "sparse.c" is quite small, I think
we could unify into the "sparse-checkout.c".

[1]: https://lore.kernel.org/git/0181a134bfb6986dc0e54ae624c478446a1324a9.1588857462.git.gitgitgadget@gmail.com/

> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > index ce080cf572..1aef084186 100755
> > --- a/t/t7817-grep-sparse-checkout.sh
> > +++ b/t/t7817-grep-sparse-checkout.sh
>
> All these testcases look great (modulo the small typo I pointed out
> earlier); I kept thinking "but what about case <x>?" and then I kept
> reading and saw you covered it.  You even added some I wasn't thinking
> about and might have overlooked but seem important.

Thanks :)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 4/5] grep: honor sparse checkout patterns
  2020-06-01  4:44         ` Matheus Tavares Bernardino
@ 2020-06-03  2:38           ` Elijah Newren
  2020-06-10 17:08             ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-06-03  2:38 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Sun, May 31, 2020 at 9:44 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Sat, May 30, 2020 at 12:48 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Wed, May 27, 2020 at 6:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> > >
[...]
> > > +static struct pattern_list *get_sparsity_patterns(struct repository *repo)
> > > +{
> > > +       struct pattern_list *patterns;
> > > +       char *sparse_file;
> > > +       int sparse_config, cone_config;
> > > +
> > > +       if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
> > > +           !sparse_config) {
> > > +               return NULL;
> > > +       }
> >
> > Is core_apply_sparse_checkout not initialized for some reason?
>
> It should be already initialized, yes. But we cannot rely on that as
> `repo` might be a submodule, and core_apply_sparse_checkout holds the
> configuration's value for `the_repository`.

Ah, gotcha.  Thanks for straightening me out.

> > > +static int in_sparse_checkout(struct strbuf *path, int prefix_len,
> >
> > This function name in_sparse_checkout() makes me think "Does the
> > working tree represent a sparse checkout?"  Perhaps we could rename it
> > to path_matches_sparsity_patterns() ?
> >
> > Also, is there a reason we can't use dir.c's
> > path_matches_pattern_list() here?
>
> Oh, we do use path_matches_pattern_list() inside:
>
> > > +       *match = path_matches_pattern_list(path->buf, path->len,
> > > +                                          path->buf + prefix_len, &dtype,
> > > +                                          sparsity, istate);
> > > +       if (*match == UNDECIDED)
> > > +               *match = parent_match;
>
> > How does this new function differ
> > in behavior from that function?
>
> The idea of in_sparse_checkout() is to implement a logic closer to
> what we have in clear_ce_flags_1(). Here, it is effectively a wrapper
> to path_matches_pattern_list() but with some extra logic to decide
> whether grep should search in a given entry, based on its mode, the
> match result against the sparsity patterns, and the result from the
> parent dir.

I've had this response and one to 5/5 sitting in my draft folder for
over a day because I was hoping to go read clear_ce_flags_1() and find
out what it is.  I have no idea, so your answer doesn't answer my
question... ;-)  I'll try to find some time and maybe respond further
after I do.

>
> > > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > > new file mode 100755
> > > index 0000000000..ce080cf572
> > > --- /dev/null
> > > +++ b/t/t7817-grep-sparse-checkout.sh
> > > @@ -0,0 +1,174 @@
> > > +#!/bin/sh
> > > +
> > > +test_description='grep in sparse checkout
> > > +
> > > +This test creates a repo with the following structure:
> > > +
> > > +.
> > > +|-- a
> > > +|-- b
> > > +|-- dir
> > > +|   `-- c
> > > +|-- sub
> > > +|   |-- A
> > > +|   |   `-- a
> > > +|   `-- B
> > > +|       `-- b
> > > +`-- sub2
> > > +    `-- a
> > > +
> > > +Where . has non-cone mode sparsity patterns, sub is a submodule with cone mode
> >
> > Maybe "Where the outer repository has non-code mode..."?  The use of
> > '.' threw me for a bit.
>
> Sure!
>
> > > +test_done
> > > --
> > > 2.26.2
> >
> > Looks good.  Do we want to add a testcase where a file is unmerged and
> > present in the working copy despite not matching the sparsity patterns
> > (i.e. to emulate being in the middle of a merge/rebase/cherry-pick)?
>
> Sure, I can add that. But after a quick test here, it seems that the
> unmerged path doesn't have the SKIP_WORKTREE bit set. Is this how it
> should be?

Right, the merge machinery will clear the SKIP_WORKTREE bit when it
writes out conflicted files.  Also, any future 'git sparse-checkout'
commands will see the unmerged entry and avoid marking it as
SKIP_WORKTREE even though it doesn't match the sparsity patterns.
Thus, grep doesn't have to do any special checking for whether the
files are merged or not, and from your current implementation probably
doesn't look like a special case at all -- you just check the
SKIP_WORKTREE bit.

However, I think the test still has value because the test enforces
that other areas of the code (merge, sparse-checkout) don't break the
invariants that grep is relying on.  (I could see someone making a
merge change that keeps the SKIP_WORKTREE bit accidentally set even
though it writes the file out to the working tree, for example.)
Sure, merge has some tests around that, so it might be viewed as
slightly duplicative, but I see it as an interesting edge case that
exercises whether the SKIP_WORKTREE bit should really be set and since
grep expects a certain invariant about how that is handled, the
testcase will help make sure our expectations aren't violated.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 5/5] config: add setting to ignore sparsity patterns in some cmds
  2020-06-01  4:45         ` Matheus Tavares Bernardino
@ 2020-06-03  2:39           ` Elijah Newren
  2020-06-10 21:15             ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-06-03  2:39 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Sun, May 31, 2020 at 9:46 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Sat, May 30, 2020 at 1:18 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Wed, May 27, 2020 at 6:14 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> > > diff --git a/Documentation/config/sparse.txt b/Documentation/config/sparse.txt
> > > new file mode 100644
> > > index 0000000000..2a25b4b8ef
> > > --- /dev/null
> > > +++ b/Documentation/config/sparse.txt
> > > @@ -0,0 +1,24 @@
> > > +sparse.restrictCmds::
> > > +       Only meaningful in conjunction with core.sparseCheckout. This option
> > > +       extends sparse checkouts (which limit which paths are written to the
> > > +       working tree), so that output and operations are also limited to the
> > > +       sparsity paths where possible and implemented. The purpose of this
> > > +       option is to (1) focus output for the user on the portion of the
> > > +       repository that is of interest to them, and (2) enable potentially
> > > +       dramatic performance improvements, especially in conjunction with
> > > +       partial clones.
> > > ++
> > > +When this option is true (default), some git commands may limit their behavior
> > > +to the paths specified by the sparsity patterns, or to the intersection of
> > > +those paths and any (like `*.c`) that the user might also specify on the
> > > +command line. When false, the affected commands will work on full trees,
> > > +ignoring the sparsity patterns. For now, only git-grep honors this setting. In
> > > +this command, the restriction takes effect in three cases: with --cached; when
> > > +a commit-ish is given; when searching a working tree where some paths excluded
> > > +by the sparsity patterns are present (e.g. manually created paths or not
> > > +removed submodules).
> >
> > I think "In this command, the restriction takes effect..." to the end
> > of the paragraph should be removed.  I don't want every subcommand's
> > behavior to be specified here; it'll grow unreadably long and be more
> > likely to eventually go stale.
>
> Yeah, I was also concerned about that. But wouldn't it be important to
> inform the users how the setting takes place in grep (specially with
> the corner cases)? And maybe others, in the future?
>
> What if we move the information that is only relevant to a single
> command into its own man page? I.e. git-grep.txt would have something
> like:

Moving it to grep's manpage seems ideal to me.  grep's behavior should
be defined in grep's manual.

> sparse.restrictCmds::
> See complete definition in linkgit:git-config[1]. In grep, the
> restriction takes effect in three cases: with --cached; when a
> commit-ish is given; when searching a working tree where some paths
> excluded by the sparsity patterns are present (e.g. manually created
> paths or not removed submodules).

That looks more than a little confusing.  Could this definition be
something more like "See base definition in linkgit:git-config[1].
grep honors sparse.restrictCmds by limiting searches to the sparsity
paths in three cases: when searching the working tree, when searching
the index with --cached, or when searching a specified commit"

> The only problem then is that the information would be a little
> scattered... But I think it shouldn't be a big deal, as a person
> interested in knowing how foo behaves with sparse.restrictCmds would
> only need to look into foo's man page, anyway.
>
> > > diff --git a/git.c b/git.c
> > > index a2d337eed7..6db1382ae4 100644
> > > --- a/git.c
> > > +++ b/git.c
> > > @@ -319,6 +324,7 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
> > >                 (*argv)++;
> > >                 (*argc)--;
> > >         }
> > > +
> > >         return (*argv) - orig_argv;
> > >  }
> > >
> >
> > Why the stray whitespace change?
>
> Oops, that shouldn't be there. Thanks!
>
> >
> > > diff --git a/sparse-checkout.c b/sparse-checkout.c
> > > new file mode 100644
> > > index 0000000000..9a9e50fd29
> > > --- /dev/null
> > > +++ b/sparse-checkout.c
> > > @@ -0,0 +1,16 @@
> > > +#include "cache.h"
> > > +#include "config.h"
> > > +#include "sparse-checkout.h"
> > > +
> > > +int restrict_to_sparse_paths(struct repository *repo)
> > > +{
> > > +       int ret;
> > > +
> > > +       if (opt_restrict_to_sparse_paths >= 0)
> > > +               return opt_restrict_to_sparse_paths;
> > > +
> > > +       if (repo_config_get_bool(repo, "sparse.restrictcmds", &ret))
> > > +               ret = 1;
> > > +
> > > +       return ret;
> > > +}
> >
> > Do we want to considering renaming this file to sparse.c, since it's
> > for sparse grep and sparse diff and etc., not just for the checkout
> > piece?  It would also go along well with our toplevel related config
> > being in the "sparse" namespace.
>
> Makes sense. But since Stolee is already working on
> "sparse-checkout.c" [1], if we use "sparse.c" in this series we will
> end up with two extra files. And as "sparse.c" is quite small, I think
> we could unify into the "sparse-checkout.c".
>
> [1]: https://lore.kernel.org/git/0181a134bfb6986dc0e54ae624c478446a1324a9.1588857462.git.gitgitgadget@gmail.com/

Or we could just suggest he use sparse.c too.  :-)

Stolee?


> > > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > > index ce080cf572..1aef084186 100755
> > > --- a/t/t7817-grep-sparse-checkout.sh
> > > +++ b/t/t7817-grep-sparse-checkout.sh
> >
> > All these testcases look great (modulo the small typo I pointed out
> > earlier); I kept thinking "but what about case <x>?" and then I kept
> > reading and saw you covered it.  You even added some I wasn't thinking
> > about and might have overlooked but seem important.
>
> Thanks :)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-22 14:26                   ` Elijah Newren
  2020-05-22 15:36                     ` Elijah Newren
@ 2020-06-10 11:40                     ` Derrick Stolee
  2020-06-10 16:22                       ` Matheus Tavares Bernardino
  2020-06-10 19:58                       ` Elijah Newren
  1 sibling, 2 replies; 120+ messages in thread
From: Derrick Stolee @ 2020-06-10 11:40 UTC (permalink / raw)
  To: Elijah Newren, Matheus Tavares Bernardino
  Cc: git, Junio C Hamano, Jonathan Tan

On 5/22/2020 10:26 AM, Elijah Newren wrote:

Sorry I missed this patch. I was searching all over for patches with
"sparse" or "submodule" in the _subject_. Thanks for calling out the
need for review, Junio!

> Subject: [PATCH] git-sparse-checkout: clarify interactions with submodules
> 
> Ignoring the sparse-checkout feature momentarily, if one has a submodule and
> creates local branches within it with unpushed changes and maybe adds some
> untracked files to it, then we would want to avoid accidentally removing such
> a submodule.  So, for example with git.git, if you run
>    git checkout v2.13.0
> then the sha1collisiondetection/ submodule is NOT removed even though it
> did not exist as a submodule until v2.14.0.  Similarly, if you only had
> v2.13.0 checked out previously and ran
>    git checkout v2.14.0
> the sha1collisiondetection/ submodule would NOT be automatically
> initialized despite being part of v2.14.0.  In both cases, git requires
> submodules to be initialized or deinitialized separately.  Further, we
> also have special handling for submodules in other commands such as
> clean, which requires two --force flags to delete untracked submodules,
> and some commands have a --recurse-submodules flag.
> 
> sparse-checkout is very similar to checkout, as evidenced by the similar
> name -- it adds and removes files from the working copy.  However, for
> the same avoid-data-loss reasons we do not want to remove a submodule
> from the working copy with checkout, we do not want to do it with
> sparse-checkout either.  So submodules need to be separately initialized
> or deinitialized; changing sparse-checkout rules should not
> automatically trigger the removal or vivification of submodules.

This is a good summary of how submodules decide to be present or not.

> I believe the previous wording in git-sparse-checkout.txt about
> submodules was only about this particular issue.  Unfortunately, the
> previous wording could be interpreted to imply that submodules should be
> considered active regardless of sparsity patterns.  Update the wording
> to avoid making such an implication.  It may be helpful to consider two
> example situations where the differences in wording become important:

You are correct, the wording was unclear. Worth fixing.

> In the future, we want users to be able to run commands like
>    git clone --sparse=moduleA --recurse-submodules $REPO_URL
> and have sparsity paths automatically set up and have submodules *within
> the sparsity paths* be automatically initialized.  We do not want all
> submodules in any path to be automatically initialized with that
> command.

INTERESTING. You are correct that it would be nice to have one
feature that describes "what should be present or not". The in-tree
sparse-checkout feature (still in infancy) would benefit from a
redesign with that in mind.

I am interested as well in the idea that combining "--sparse[=X]"
with "--recurse-submodules" might want to imply that the submodules
themselves are initialized with sparse-checkout patterns.

These ramblings are of course off-topic for the current patch.

> Similarly, we want to be able to do things like
>    git -c sparse.restrictCmds grep --recurse-submodules $REV $PATTERN
> and search through $REV for $PATTERN within the recorded sparsity
> patterns.  We want it to recurse into submodules within those sparsity
> patterns, but do not want to recurse into directories that do not match
> the sparsity patterns in search of a possible submodule.

(snipping way the old paragraph and focusing on the new text)

> +If your repository contains one or more submodules, then those submodules
> +will appear based on which you initialized with the `git submodule`
> +command.

This sentence is awkward. Here is a potential replacement:

  If your repository contains one or more submodules, then submodules are
  populated based on interactions with the `git submodule` command.
  Specifically, `git submodule init -- <path>` will ensure the submodule at
  `<path>` is present while `git submodule deinit -- <path>` will remove the
  files for the submodule at `<path>`. Similar to sparse-checkout, the
  deinitialized submodules still exist in the index, but are not present in
  the working directory.

That got a lot longer as I was working on it. Perhaps add a paragraph break
before the next bit.

>  Submodules may have additional untracked files or code stored on

To emphasize the importance of the following "to avoid data loss" statement,
you could mention that when a submodule is removed from the working directory,
then so is all of its Git data such as objects and branches. If that data was
not pushed to another repository, then deinitializing a submodule can result
in loss of important data. (Also: maybe I'm wrong about that?)

> +other branches, so to avoid data loss, changing sparse inclusion/exclusion

Edit: other branches. To avoid data loss, ...

> +rules will not cause an already checked out submodule to be removed from
> +the working copy.  Said another way, just as `checkout` will not cause
> +submodules to be automatically removed or initialized even when switching
> +between branches that remove or add submodules, using `sparse-checkout` to
> +reduce or expand the scope of "interesting" files will not cause submodules
> +to be automatically deinitialized or initialized either.  Adding or
> +removing them must be done as a separate step with `git submodule init` or
> +`git submodule deinit`.

This final sentence may be redundant if you include reference to init/deinit
earlier in the section.

> +This may mean that even if your sparsity patterns include or exclude
> +submodules, until you manually initialize or deinitialize them, commands
> +like grep that work on tracked files in the working copy will ignore "not
> +yet initialized" submodules and pay attention to "left behind" ones.

I don't think that "left behind" is a good phrase here. It feels like
they've been _dropped_ instead of _persisted despite sparse-checkout
changes_.

Perhaps:

  commands like `git grep` that work on tracked files in the working copy
  will pay attention only to initialized submodules, regardless of the
  sparse-checkout definition.

Thanks for pointing out how complicated this scenario is! It certainly
demands a careful update like this one.

-Stolee


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-06-10 11:40                     ` Derrick Stolee
@ 2020-06-10 16:22                       ` Matheus Tavares Bernardino
  2020-06-10 17:42                         ` Derrick Stolee
  2020-06-10 20:12                         ` Elijah Newren
  2020-06-10 19:58                       ` Elijah Newren
  1 sibling, 2 replies; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-06-10 16:22 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Elijah Newren, git, Junio C Hamano, Jonathan Tan

On Wed, Jun 10, 2020 at 8:41 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 5/22/2020 10:26 AM, Elijah Newren wrote:
> > +This may mean that even if your sparsity patterns include or exclude
> > +submodules, until you manually initialize or deinitialize them, commands
> > +like grep that work on tracked files in the working copy will ignore "not
> > +yet initialized" submodules and pay attention to "left behind" ones.
>
> I don't think that "left behind" is a good phrase here. It feels like
> they've been _dropped_ instead of _persisted despite sparse-checkout
> changes_.
>
> Perhaps:
>
>   commands like `git grep` that work on tracked files in the working copy
>   will pay attention only to initialized submodules, regardless of the
>   sparse-checkout definition.

Hmm, I'm a little confused by the "regardless of the sparse-checkout
definition". The plan we discussed for grep was to not recurse into
submodules if they have the SKIP_WORKTREE bit set [1], wasn't it?

[1]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 4/5] grep: honor sparse checkout patterns
  2020-06-03  2:38           ` Elijah Newren
@ 2020-06-10 17:08             ` Matheus Tavares Bernardino
  0 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-06-10 17:08 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Tue, Jun 2, 2020 at 11:38 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Sun, May 31, 2020 at 9:44 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > On Sat, May 30, 2020 at 12:48 PM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > On Wed, May 27, 2020 at 6:13 PM Matheus Tavares
> > > <matheus.bernardino@usp.br> wrote:
> > > >
> > > > +static int in_sparse_checkout(struct strbuf *path, int prefix_len,
> > >
> > > This function name in_sparse_checkout() makes me think "Does the
> > > working tree represent a sparse checkout?"  Perhaps we could rename it
> > > to path_matches_sparsity_patterns() ?
> > >
> > > Also, is there a reason we can't use dir.c's
> > > path_matches_pattern_list() here?
> >
> > Oh, we do use path_matches_pattern_list() inside:
> >
> > > > +       *match = path_matches_pattern_list(path->buf, path->len,
> > > > +                                          path->buf + prefix_len, &dtype,
> > > > +                                          sparsity, istate);
> > > > +       if (*match == UNDECIDED)
> > > > +               *match = parent_match;
> >
> > > How does this new function differ
> > > in behavior from that function?
> >
> > The idea of in_sparse_checkout() is to implement a logic closer to
> > what we have in clear_ce_flags_1(). Here, it is effectively a wrapper
> > to path_matches_pattern_list() but with some extra logic to decide
> > whether grep should search in a given entry, based on its mode, the
> > match result against the sparsity patterns, and the result from the
> > parent dir.
>
> I've had this response and one to 5/5 sitting in my draft folder for
> over a day because I was hoping to go read clear_ce_flags_1() and find
> out what it is.  I have no idea, so your answer doesn't answer my
> question... ;-)  I'll try to find some time and maybe respond further
> after I do.

Oops, sorry for the incomplete answer. clear_ce_flags() recursively
traverses the index entries, unsetting the bits specified in a given
mask when the entry matches a given pattern list. (It is used in
unpack-trees.c:mark_new_skip_worktree() to clear the
CE_NEW_SKIP_WORKTREE bit for the matched entries.) clear_ce_flags()
does use path_matches_pattern_list() but it also has to check some
additional rules for cone mode (as there might be recursive
matches/non-matches). These rules are implemented in
clear_ce_flags_dir().

in_sparse_checkout() is a small wrapper around
path_matches_pattern_list() with (1) the additional checks for cone
mode, similar to what clear_ce_flags_dir() implements, and (2) the
usage of the parent dir's match_result when undecided about the
current path. We could just implement this directly in grep_tree(),
but I thought that isolating this logic into its own static function
would make grep_tree() more readable.

> > > > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > > > new file mode 100755
> > > > index 0000000000..ce080cf572
> > > > --- /dev/null
> > > > +++ b/t/t7817-grep-sparse-checkout.sh
> > >
> > > Looks good.  Do we want to add a testcase where a file is unmerged and
> > > present in the working copy despite not matching the sparsity patterns
> > > (i.e. to emulate being in the middle of a merge/rebase/cherry-pick)?
> >
> > Sure, I can add that. But after a quick test here, it seems that the
> > unmerged path doesn't have the SKIP_WORKTREE bit set. Is this how it
> > should be?
>
> Right, the merge machinery will clear the SKIP_WORKTREE bit when it
> writes out conflicted files.  Also, any future 'git sparse-checkout'
> commands will see the unmerged entry and avoid marking it as
> SKIP_WORKTREE even though it doesn't match the sparsity patterns.
> Thus, grep doesn't have to do any special checking for whether the
> files are merged or not, and from your current implementation probably
> doesn't look like a special case at all -- you just check the
> SKIP_WORKTREE bit.
>
> However, I think the test still has value because the test enforces
> that other areas of the code (merge, sparse-checkout) don't break the
> invariants that grep is relying on.  (I could see someone making a
> merge change that keeps the SKIP_WORKTREE bit accidentally set even
> though it writes the file out to the working tree, for example.)
> Sure, merge has some tests around that, so it might be viewed as
> slightly duplicative, but I see it as an interesting edge case that
> exercises whether the SKIP_WORKTREE bit should really be set and since
> grep expects a certain invariant about how that is handled, the
> testcase will help make sure our expectations aren't violated.

OK. I will add this test for the next version.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-06-10 16:22                       ` Matheus Tavares Bernardino
@ 2020-06-10 17:42                         ` Derrick Stolee
  2020-06-10 18:14                           ` Matheus Tavares Bernardino
  2020-06-10 20:12                         ` Elijah Newren
  1 sibling, 1 reply; 120+ messages in thread
From: Derrick Stolee @ 2020-06-10 17:42 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Elijah Newren, git, Junio C Hamano, Jonathan Tan

On 6/10/2020 12:22 PM, Matheus Tavares Bernardino wrote:
> On Wed, Jun 10, 2020 at 8:41 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 5/22/2020 10:26 AM, Elijah Newren wrote:
>>> +This may mean that even if your sparsity patterns include or exclude
>>> +submodules, until you manually initialize or deinitialize them, commands
>>> +like grep that work on tracked files in the working copy will ignore "not
>>> +yet initialized" submodules and pay attention to "left behind" ones.
>>
>> I don't think that "left behind" is a good phrase here. It feels like
>> they've been _dropped_ instead of _persisted despite sparse-checkout
>> changes_.
>>
>> Perhaps:
>>
>>   commands like `git grep` that work on tracked files in the working copy
>>   will pay attention only to initialized submodules, regardless of the
>>   sparse-checkout definition.
> 
> Hmm, I'm a little confused by the "regardless of the sparse-checkout
> definition". The plan we discussed for grep was to not recurse into
> submodules if they have the SKIP_WORKTREE bit set [1], wasn't it?
> 
> [1]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/

Thanks for correcting my misunderstanding. By introducing
`git grep` into this documentation, I have also made it
co-dependent on your series. Instead, Elijah was probably
purposeful in his use of "grep" over "git grep".

If we revert that part of my change to use `grep` instead
of `git grep`, then is my statement correct?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-06-10 17:42                         ` Derrick Stolee
@ 2020-06-10 18:14                           ` Matheus Tavares Bernardino
  0 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-06-10 18:14 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Elijah Newren, git, Junio C Hamano, Jonathan Tan

On Wed, Jun 10, 2020 at 2:42 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/10/2020 12:22 PM, Matheus Tavares Bernardino wrote:
> > On Wed, Jun 10, 2020 at 8:41 AM Derrick Stolee <stolee@gmail.com> wrote:
> >>
> >> On 5/22/2020 10:26 AM, Elijah Newren wrote:
> >>> +This may mean that even if your sparsity patterns include or exclude
> >>> +submodules, until you manually initialize or deinitialize them, commands
> >>> +like grep that work on tracked files in the working copy will ignore "not
> >>> +yet initialized" submodules and pay attention to "left behind" ones.
> >>
> >> I don't think that "left behind" is a good phrase here. It feels like
> >> they've been _dropped_ instead of _persisted despite sparse-checkout
> >> changes_.
> >>
> >> Perhaps:
> >>
> >>   commands like `git grep` that work on tracked files in the working copy
> >>   will pay attention only to initialized submodules, regardless of the
> >>   sparse-checkout definition.
> >
> > Hmm, I'm a little confused by the "regardless of the sparse-checkout
> > definition". The plan we discussed for grep was to not recurse into
> > submodules if they have the SKIP_WORKTREE bit set [1], wasn't it?
> >
> > [1]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/
>
> Thanks for correcting my misunderstanding. By introducing
> `git grep` into this documentation, I have also made it
> co-dependent on your series. Instead, Elijah was probably
> purposeful in his use of "grep" over "git grep".

I think he used grep referring to git-grep as he mentioned "tracked
files in the working copy". Maybe he wanted to describe the current
state of git-grep, which does recurse into initialized submodules even
when they don't match the sparsity patterns. Was that it, Elijah?

If so, since this behavior is changed in mt/grep-sparse-checkout, I
think I should also change this doc section within my series. Or we
change the doc in this patch and make it dependent on the series.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-06-10 11:40                     ` Derrick Stolee
  2020-06-10 16:22                       ` Matheus Tavares Bernardino
@ 2020-06-10 19:58                       ` Elijah Newren
  1 sibling, 0 replies; 120+ messages in thread
From: Elijah Newren @ 2020-06-10 19:58 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Matheus Tavares Bernardino, Git Mailing List, Junio C Hamano,
	Jonathan Tan

On Wed, Jun 10, 2020 at 4:41 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 5/22/2020 10:26 AM, Elijah Newren wrote:
>
> Sorry I missed this patch. I was searching all over for patches with
> "sparse" or "submodule" in the _subject_. Thanks for calling out the
> need for review, Junio!
>
> > Subject: [PATCH] git-sparse-checkout: clarify interactions with submodules
> >
> > Ignoring the sparse-checkout feature momentarily, if one has a submodule and
> > creates local branches within it with unpushed changes and maybe adds some
> > untracked files to it, then we would want to avoid accidentally removing such
> > a submodule.  So, for example with git.git, if you run
> >    git checkout v2.13.0
> > then the sha1collisiondetection/ submodule is NOT removed even though it
> > did not exist as a submodule until v2.14.0.  Similarly, if you only had
> > v2.13.0 checked out previously and ran
> >    git checkout v2.14.0
> > the sha1collisiondetection/ submodule would NOT be automatically
> > initialized despite being part of v2.14.0.  In both cases, git requires
> > submodules to be initialized or deinitialized separately.  Further, we
> > also have special handling for submodules in other commands such as
> > clean, which requires two --force flags to delete untracked submodules,
> > and some commands have a --recurse-submodules flag.
> >
> > sparse-checkout is very similar to checkout, as evidenced by the similar
> > name -- it adds and removes files from the working copy.  However, for
> > the same avoid-data-loss reasons we do not want to remove a submodule
> > from the working copy with checkout, we do not want to do it with
> > sparse-checkout either.  So submodules need to be separately initialized
> > or deinitialized; changing sparse-checkout rules should not
> > automatically trigger the removal or vivification of submodules.
>
> This is a good summary of how submodules decide to be present or not.
>
> > I believe the previous wording in git-sparse-checkout.txt about
> > submodules was only about this particular issue.  Unfortunately, the
> > previous wording could be interpreted to imply that submodules should be
> > considered active regardless of sparsity patterns.  Update the wording
> > to avoid making such an implication.  It may be helpful to consider two
> > example situations where the differences in wording become important:
>
> You are correct, the wording was unclear. Worth fixing.
>
> > In the future, we want users to be able to run commands like
> >    git clone --sparse=moduleA --recurse-submodules $REPO_URL
> > and have sparsity paths automatically set up and have submodules *within
> > the sparsity paths* be automatically initialized.  We do not want all
> > submodules in any path to be automatically initialized with that
> > command.
>
> INTERESTING. You are correct that it would be nice to have one
> feature that describes "what should be present or not". The in-tree
> sparse-checkout feature (still in infancy) would benefit from a
> redesign with that in mind.
>
> I am interested as well in the idea that combining "--sparse[=X]"
> with "--recurse-submodules" might want to imply that the submodules
> themselves are initialized with sparse-checkout patterns.
>
> These ramblings are of course off-topic for the current patch.

Yeah, it might get complicated too; we'd almost certainly want to
limit to cone mode (globs could get super hairy).  It's also the case
we might want some submodules to have sparse-checkouts and others have
full checkouts, depending on whether the --sparse=X specification
listed some path that traversed from the toplevel outer repo down into
a submodule.  (But if --sparse is given with no specification, do all
submodules become sparse or do all remain full?)  Anyway, lots of
complications there and we should start a different thread to discuss
that when we feel it's time to tackle it.

> > Similarly, we want to be able to do things like
> >    git -c sparse.restrictCmds grep --recurse-submodules $REV $PATTERN
> > and search through $REV for $PATTERN within the recorded sparsity
> > patterns.  We want it to recurse into submodules within those sparsity
> > patterns, but do not want to recurse into directories that do not match
> > the sparsity patterns in search of a possible submodule.
>
> (snipping way the old paragraph and focusing on the new text)
>
> > +If your repository contains one or more submodules, then those submodules
> > +will appear based on which you initialized with the `git submodule`
> > +command.
>
> This sentence is awkward. Here is a potential replacement:
>
>   If your repository contains one or more submodules, then submodules are
>   populated based on interactions with the `git submodule` command.
>   Specifically, `git submodule init -- <path>` will ensure the submodule at
>   `<path>` is present while `git submodule deinit -- <path>` will remove the
>   files for the submodule at `<path>`. Similar to sparse-checkout, the
>   deinitialized submodules still exist in the index, but are not present in
>   the working directory.
>
> That got a lot longer as I was working on it. Perhaps add a paragraph break
> before the next bit.

Sounds good, thanks.

> >  Submodules may have additional untracked files or code stored on
>
> To emphasize the importance of the following "to avoid data loss" statement,
> you could mention that when a submodule is removed from the working directory,
> then so is all of its Git data such as objects and branches. If that data was
> not pushed to another repository, then deinitializing a submodule can result
> in loss of important data. (Also: maybe I'm wrong about that?)
>
> > +other branches, so to avoid data loss, changing sparse inclusion/exclusion

I thought that was what I covered with the "code stored on other
branches" but I guess that wasn't clear enough.  So yeah, I can try
extending it a bit.

> Edit: other branches. To avoid data loss, ...

Sounds good.

> > +rules will not cause an already checked out submodule to be removed from
> > +the working copy.  Said another way, just as `checkout` will not cause
> > +submodules to be automatically removed or initialized even when switching
> > +between branches that remove or add submodules, using `sparse-checkout` to
> > +reduce or expand the scope of "interesting" files will not cause submodules
> > +to be automatically deinitialized or initialized either.  Adding or
> > +removing them must be done as a separate step with `git submodule init` or
> > +`git submodule deinit`.
>
> This final sentence may be redundant if you include reference to init/deinit
> earlier in the section.

Yep, I'll strike it.

> > +This may mean that even if your sparsity patterns include or exclude
> > +submodules, until you manually initialize or deinitialize them, commands
> > +like grep that work on tracked files in the working copy will ignore "not
> > +yet initialized" submodules and pay attention to "left behind" ones.
>
> I don't think that "left behind" is a good phrase here. It feels like
> they've been _dropped_ instead of _persisted despite sparse-checkout
> changes_.

I think in addition to the "left behind" wording being bad, my
paragraph left another funny gray area and might be inconsistent with
what Matheus and I wrote elsewhere:

If sparsity patterns would exclude a submodule that is initialized,
sparse-checkout clearly can't remove the submodule.  However, should
it set the SKIP_WORKTREE bit for that submodule if it's not going to
remove it?

I'm not sure of the answer, yet.  I think Matheus had the right idea
for how to make grep handle an initialized submodule in the different
sparse.restrictCmds settings, and if we do go ahead and clear the
SKIP_WORKTREE bit, then I think the wording of this paragraph needs to
change.  So, let's discuss your alternative:

> Perhaps:
>
>   commands like `git grep` that work on tracked files in the working copy
>   will pay attention only to initialized submodules, regardless of the
>   sparse-checkout definition.

I think this is easy to misconstrue in an entirely new way: if there
are initialized submodules (and maybe a sparse checkout), then your
wording implies normal files would be ignored by grep (even files that
aren't removed by the sparse checkout)!  While that sounds like crazy
behavior, this whole thread started because of suggested behaviors
being proposed to carefully follow what was already written in this
document even though the end user result seemed somewhat crazy to me.
So, we might want to avoid a repeat.  :-)

Also, your suggested wording is different than the behavior we came up
with before, and is also inconsistent with how we'd work with normal
files.  For example, what if a user:

* uses sparse-checkout to remove a bunch of files/directories they
don't care about
* creates a new file that happens to have the same name as an
(unfortunately) generically worded filename that exists in the index
(but is marked SKIP_WORKTREE and had previously been removed)

Is this new file related to the tracked file?  Is the new file
considered tracked?  Should the new file be considered part of the
sparse cone (i.e. should it be considered part of the set of tracked
working tree files relevant to the user for commands that operate on
that subset)?  It's a bit of a thorny case.


Here's the behavior Matheus and I came to previously:

git -c sparse.restrictCmds=true grep --recurse-submodules <pattern>:
This goes through all the files in the index (i.e. all tracked files)
which do NOT have the SKIP_WORKTREE bit set.  For each of these: If
the file is a symlink, ignore it (like git-grep currently does).  If
the file is a regular file and is present in the working copy, search
it.  If the file is a submodule and it is initialized, recurse into
it.

git -c sparse.restrictCmds=false grep --recurse-submodules <pattern>:
This goes through all the files in the index (i.e. all tracked files)
regardless of SKIP_WORKTREE bit setting.  For each of these: If the
file is a symlink, ignore it (like git-grep currently does).  If the
file is a regular file and is present in the working copy, search it.
If the file is a submodule and it is initialized, recurse into it.

The only difference between these two sparse.restrictCmds settings is
the handling of the SKIP_WORKTREE bit.  I think that makes them nice
and orthogonal.  They also generalize nicely to the cases of searching
--cached or $REVISION with a few obvious changes (check if data is
available in git object store rather than if file is present in
working tree, and for $REVISION check sparsity patterns rather than
SKIP_WORKTREE bit).

If we start ignoring the SKIP_WORKTREE bit for some types of files
even when sparse.restrictCmds=true, I think we start getting a number
of inconsistencies and user surprises.  So, these formal definitions
seem like a good high-level design.  I think our attempts to summarize
behavior in short sentences for users sometimes ignores some cases due
to the desire to summarize.  However, taking the summary literally can
suggest behaviors that'd be inconsistent if not downright crazy for
some of the ignored cases.  I'll see if I can clean it up somehow.

> Thanks for pointing out how complicated this scenario is! It certainly
> demands a careful update like this one.

Thanks for the thoughtful review!

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-06-10 16:22                       ` Matheus Tavares Bernardino
  2020-06-10 17:42                         ` Derrick Stolee
@ 2020-06-10 20:12                         ` Elijah Newren
  1 sibling, 0 replies; 120+ messages in thread
From: Elijah Newren @ 2020-06-10 20:12 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Derrick Stolee, git, Junio C Hamano, Jonathan Tan

On Wed, Jun 10, 2020 at 9:23 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Wed, Jun 10, 2020 at 8:41 AM Derrick Stolee <stolee@gmail.com> wrote:
> >
> > On 5/22/2020 10:26 AM, Elijah Newren wrote:
> > > +This may mean that even if your sparsity patterns include or exclude
> > > +submodules, until you manually initialize or deinitialize them, commands
> > > +like grep that work on tracked files in the working copy will ignore "not
> > > +yet initialized" submodules and pay attention to "left behind" ones.
> >
> > I don't think that "left behind" is a good phrase here. It feels like
> > they've been _dropped_ instead of _persisted despite sparse-checkout
> > changes_.
> >
> > Perhaps:
> >
> >   commands like `git grep` that work on tracked files in the working copy
> >   will pay attention only to initialized submodules, regardless of the
> >   sparse-checkout definition.
>
> Hmm, I'm a little confused by the "regardless of the sparse-checkout
> definition". The plan we discussed for grep was to not recurse into
> submodules if they have the SKIP_WORKTREE bit set [1], wasn't it?
>
> [1]: https://lore.kernel.org/git/CABPp-BE6M9ATDYuQh8f_r3S00dM2Cv9vM3T5j5W_odbVzhC-5A@mail.gmail.com/

I flagged some issues with that sentence...and an additional issue in
my original sentence besides the one Stolee flagged.  It seems to be
easy to mess up a simple summary here.  :-)

But I do want a simple summary of some sort; I want
Documentation/git-sparse-checkout.txt to be an end-user guide and not
an implementation spec.  Perhaps I can bring up a simpler example that
will make it easier to see my distinction between the two -- let's
consider the case of unmerged files.  I think all of the following
statements are true, but some are meant strictly as implementation
details of relevant subcommands, while others are deduced overall
behavior observed by end-users:

* If you just ran merge or rebase and have some files with conflicts,
'git grep searchstring' will search the conflicted files for the
searchstring
* When searching the working tree, git grep should not do any special
checking for whether files are in a conflicted state
* sparse-checkout will never set the SKIP_WORKTREE bit on an unmerged
file (despite sparsity patterns)
* sparse-checkout will delete all (regular and symlink) files from the
working tree when it sets the SKIP_WORKTREE bit for them
* sparse-checkout will not delete files from the working copy if it
doesn't set the SKIP_WORKTREE bit on it
* When merging, if the merge machinery notices a conflict, it must
clear the SKIP_WORKTREE bit and write the (conflicted version of the)
file out to the working tree.  (It is also allowed to clear the
SKIP_WORKTREE bit for files that are not conflicted, though we'd
rather it didn't do that so much.)

These statements above are not incompatible, because some deal with
the implementation of git grep (the second item), others deal with
implementation details of other commands or machinery (all items after
the second), and the first item deals with the combination of
behaviors between sparse-checkout + merge machinery + grep.

So, even though the first bullet point says "git grep...will search
the conflicted files" that does NOT mean git grep should check for
whether files are conflicted.  My proposed update in v2 that I'll send
out (once I come up with one) might use similar broad brushes.


Hope that helps,
Elijah

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 5/5] config: add setting to ignore sparsity patterns in some cmds
  2020-06-03  2:39           ` Elijah Newren
@ 2020-06-10 21:15             ` Matheus Tavares Bernardino
  2020-06-11  0:35               ` Elijah Newren
  0 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-06-10 21:15 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Tue, Jun 2, 2020 at 11:40 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Sun, May 31, 2020 at 9:46 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
>
> Moving it to grep's manpage seems ideal to me.  grep's behavior should
> be defined in grep's manual.
>
> > sparse.restrictCmds::
> > See complete definition in linkgit:git-config[1]. In grep, the
> > restriction takes effect in three cases: with --cached; when a
> > commit-ish is given; when searching a working tree where some paths
> > excluded by the sparsity patterns are present (e.g. manually created
> > paths or not removed submodules).
>
> That looks more than a little confusing.  Could this definition be
> something more like "See base definition in linkgit:git-config[1].
> grep honors sparse.restrictCmds by limiting searches to the sparsity
> paths in three cases: when searching the working tree, when searching
> the index with --cached, or when searching a specified commit"

Yes, this looks better, thanks. I would only add a brief explanation
on what we mean by limiting the search in the working tree case. Since
the working tree should already contain only the sparse paths (in most
cases), I think this sentence may sound a little confusing without
some explanation. Even further, some users might expect that `git -c
sparse.restrictCmds=false grep $pattern` would restore the previous
behavior of falling back to the cache for non-present entries, which
is not true.

In particular, I would like to emphasize that the use for
`sparse.restrictCmds=false` in the working tree case, is for
situations like the one you described in [1]:

* uses sparse-checkout to remove a bunch of files/directories they
don't care about
* creates a new file that happens to have the same name as an
(unfortunately) generically worded filename that exists in the index
(but is marked SKIP_WORKTREE and had previously been removed)

In this situation, grep would ignore the said file by default, but
search it with `sparse.restrictCmds=false`.

So what do you think of the following:

sparse.restrictCmds::
See base definition in linkgit:git-config[1]. grep honors
sparse.restrictCmds by limiting searches to the sparsity paths in
three cases: when searching the working tree, when searching the index
with --cached, and when searching a specified commit. Note: when this
option is set to true (default), the working tree search will ignore
paths that are present despite not matching the sparsity patterns.
This can happen, for example, if you create a new file in a path that
was previously removed by git-sparse-checkout. Or if you don't
deinitialize a submodule that is excluded by the sparsity patterns
(thus remaining in the working copy, anyway).

[1]: https://lore.kernel.org/git/CABPp-BE+BL3Nq=Co=-kNB_wr=6gqX8zcGwa0ega_pGBpk6xYsg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v3 5/5] config: add setting to ignore sparsity patterns in some cmds
  2020-06-10 21:15             ` Matheus Tavares Bernardino
@ 2020-06-11  0:35               ` Elijah Newren
  0 siblings, 0 replies; 120+ messages in thread
From: Elijah Newren @ 2020-06-11  0:35 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

Hi Matheus,

On Wed, Jun 10, 2020 at 2:15 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Jun 2, 2020 at 11:40 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Sun, May 31, 2020 at 9:46 PM Matheus Tavares Bernardino
> > <matheus.bernardino@usp.br> wrote:
> > >
> >
> > Moving it to grep's manpage seems ideal to me.  grep's behavior should
> > be defined in grep's manual.
> >
> > > sparse.restrictCmds::
> > > See complete definition in linkgit:git-config[1]. In grep, the
> > > restriction takes effect in three cases: with --cached; when a
> > > commit-ish is given; when searching a working tree where some paths
> > > excluded by the sparsity patterns are present (e.g. manually created
> > > paths or not removed submodules).
> >
> > That looks more than a little confusing.  Could this definition be
> > something more like "See base definition in linkgit:git-config[1].
> > grep honors sparse.restrictCmds by limiting searches to the sparsity
> > paths in three cases: when searching the working tree, when searching
> > the index with --cached, or when searching a specified commit"
>
> Yes, this looks better, thanks. I would only add a brief explanation
> on what we mean by limiting the search in the working tree case.

Possibly, but I think it would be easy to go overboard here.

> Since
> the working tree should already contain only the sparse paths (in most
> cases), I think this sentence may sound a little confusing without
> some explanation.

That's an interesting flag.  I'm curious, though, would they be
confused by it, or would it just seem immediately obvious and almost
not worth mentioning?  In other words, would they think "Well, if you
use sparse-checkout to get just a subset of files checked out, it
totally makes sense that grep would be limited in that case.  Why do
they even need to mention it -- just for completeness, I guess?"

And even if not all users think that way, would a large percentage of
users around them think that way and point out the obviousness of the
docs?

If not, maybe we just add a "(obviously)" comment right after "working tree"?

> Even further, some users might expect that `git -c
> sparse.restrictCmds=false grep $pattern` would restore the previous
> behavior of falling back to the cache for non-present entries, which
> is not true.

10 years from now, I don't want our docs to consist of a long
explanation of all the bugs that existed in various ancient versions
of git and how modern behavior differs from each previous iteration.
There are times when it's worth calling out bugs in prior versions to
bring it to the attention of our users, but I don't see how this is
one of them.  The previous behavior was just outright buggy and
inconsistent, and from my viewpoint, was also a regression.  I think
it should have been reverted regardless of your series, though
skip_worktree stuff was dormant and went unused for a really long
time.

Also, this is a special area of git where focusing too much on
backward compatibility might actually be detrimental.  Backward
compatibility is a really good goal to keep in mind in general, but
the SKIP_WORKTREE usability was traditionally really, really bad -- so
much so that outright replacing was contemplated by its author[A], and
we placed a HUGE ALL CAPS DISCLAIMER in the documentation of
sparse-checkout about how users should expect the behavior of commands
to change[B].  So, unlike other areas of git, we should focus on
getting sparse-checkout behavior right more than on bug compatibility
with previous code and long migration stories.  Given the context of
such disclaimers and changes, the idea of trying to document those
changes makes me think that in the not too distant future we would
have the equivalent of the following humorous driving directions from
the era before smartphones: "To get to Joe's place, you turn right on
the first road after where Billy's Barn burned down 5 years ago..."
(when the burned Barn was cleared out 4 years ago and there's no
indication of where it once was)

[A] https://lore.kernel.org/git/CABPp-BGE-m_UFfUt_moXG-YR=ZW8hMzMwraD7fkFV-+sEHw36w@mail.gmail.com/
[B] https://git-scm.com/docs/git-sparse-checkout#_description

> In particular, I would like to emphasize that the use for
> `sparse.restrictCmds=false` in the working tree case, is for
> situations like the one you described in [1]:
>
> * uses sparse-checkout to remove a bunch of files/directories they
> don't care about
> * creates a new file that happens to have the same name as an
> (unfortunately) generically worded filename that exists in the index
> (but is marked SKIP_WORKTREE and had previously been removed)
>
> In this situation, grep would ignore the said file by default, but
> search it with `sparse.restrictCmds=false`.

I think this is such a weird and unusual case that I'm not sure it
merits mentioning in the docs.

But if others disagree and think this case is worth mentioning in the
docs, then it shouldn't just be mentioned in "git grep".  All affected
manpages should be updated to discuss how they handle this obscure
corner case.  For example, `git diff` and `git status` just ignore
these files and do not print out any information about them.  So it's
kind of like these files are ignored...but even `git status --ignored`
won't show anything about such files.

Anyway, I think this is a pretty obscure case whose discussion would
dilute the value of the manual in teaching people the basics of
commands.

> So what do you think of the following:
>
> sparse.restrictCmds::
> See base definition in linkgit:git-config[1]. grep honors
> sparse.restrictCmds by limiting searches to the sparsity paths in
> three cases: when searching the working tree, when searching the index
> with --cached, and when searching a specified commit.

Good up to here.  I think I'd like to use just this text as-is (or
maybe with the "(obviously)" addition) and then see if we get feedback
that we need clarifications, because I'm worried our attempts at
clarifying might backfire.  For example...

> Note: when this
> option is set to true (default), the working tree search will ignore
> paths that are present despite not matching the sparsity patterns.

You've run into the same problem Stolee and I did by trying to provide
details about one case, but overlooking others.  ;-)  This "Note:"
statement is not correct; there's a couple cases it gets wrong:

merge/rebase/cherry-pick can unset the SKIP_WORKTREE bit even for
paths that do not match the sparsity patterns in order to be able to
materialize a file and show conflicts.  In fact, they are allowed to
unset the bit for other files and materialize them too (see
https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/).
Such paths, despite not matching the sparsity patterns, will not have
the SKIP_WORKTREE bit set.  And it is the SKIP_WORKTREE bit, rather
than the sparsity patterns, that git-grep uses for deciding which
files in the working tree to search.

Also, if someone runs sparse-checkout init/set, and sparse-checkout
would normally remove some file but notices that the file has local
modifications, then sparse-checkout will avoid removing the file AND
will avoid setting the SKIP_WORKTREE bit on that file.  See commit
681c637b4a ("unpack-trees: failure to set SKIP_WORKTREE bits always
just a warning", 2020-03-27)

> This can happen, for example, if you create a new file in a path that
> was previously removed by git-sparse-checkout.

This is that obscure corner case discussed above.

> Or if you don't
> deinitialize a submodule that is excluded by the sparsity patterns
> (thus remaining in the working copy, anyway).

This case requires more thought.  If a submodule doesn't match the
sparsity patterns, we already said elsewhere that sparse-checkout
should not remove the submodule (since doing so would risk data loss).
But do we set the SKIP_WORKTREE bit for it?  Generally,
sparse-checkout avoids removing files with modifications, and if it
doesn't remove them it also doesn't set the SKIP_WORKTREE bit.  For
consistency, should sparse-checkout not set SKIP_WORKTREE for
initialized submodules?

If we don't set the SKIP_WORKTREE bit for initialized submodules, then
we don't actually have a second different case to mention here.

Granted, that's more an issue for `sparse-checkout` than `grep`.


Hope that all helps.  Let me know if it doesn't, if you disagree with
any parts, or some parts aren't clear.
Elijah

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it
  2020-05-28  1:12   ` [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                       ` (4 preceding siblings ...)
  2020-05-28  1:13     ` [PATCH v3 5/5] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
@ 2020-06-12 15:44     ` Matheus Tavares
  2020-06-12 15:44       ` [PATCH v4 1/6] doc: grep: unify info on configuration variables Matheus Tavares
                         ` (7 more replies)
  5 siblings, 8 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-06-12 15:44 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

This series makes git-grep restrict its output to the present sparsity
patterns. A new global option is added to toggle this behavior in grep
and hopefully more commands in the future.

Main changes since v3:

Patch 2:
- Reworded commit message for clarity.

Patch 3 and 4:
- Split into two patches. The first one contains the changes to easily
  accommodate new options in t/helper/test-config; the second adds
  --submodule=path.

Patch 4:
- Removed the section "Such scenario might not be needed now..." from
  the commit message. This was not true as we already have other
  submodule configs, even in git-grep itself, which should be considered
  when recursing into submodules.  And this should happen for both the
  local scope and worktree scope of the submodules configs.

Patch 5:
- Reworded commit message as suggested by Elijah [1].
- Fixed spelling errors in t7817 (as also pointed in [1]).
- Added test to ensure grep searches unmerged files despite not matching
  the sparsity patterns.
- Renamed builtin/grep.c:in_sparse_checkout() to
  path_in_sparse_checkout() for clarity.

Patch 6:
- Fixed typos and spelling errors.
- Removed unnecessary new line in git.c.
- Included "sparse-checkout.h" in git.c to avoid Sparse error as Ramsay
  Jones pointed out. And moved opt_restrict_to_sparse_paths to
  sparse-checkout.c.
- Moved information about how grep honors sparse.restrictCmds to grep's
  man page.

[1]: https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/

CI: https://github.com/matheustavares/git/actions/runs/133459296

Matheus Tavares (6):
  doc: grep: unify info on configuration variables
  t/helper/test-config: return exit codes consistently
  t/helper/test-config: facilitate addition of new cli options
  config: correctly read worktree configs in submodules
  grep: honor sparse checkout patterns
  config: add setting to ignore sparsity patterns in some cmds

 Documentation/config.txt               |   2 +
 Documentation/config/grep.txt          |  18 +-
 Documentation/config/sparse.txt        |  20 ++
 Documentation/git-grep.txt             |  36 +--
 Documentation/git.txt                  |   4 +
 Makefile                               |   1 +
 builtin/grep.c                         | 134 ++++++++++-
 config.c                               |  21 +-
 contrib/completion/git-completion.bash |   2 +
 git.c                                  |   5 +
 sparse-checkout.c                      |  18 ++
 sparse-checkout.h                      |  11 +
 t/helper/test-config.c                 | 183 +++++++++-----
 t/t2404-worktree-config.sh             |  16 ++
 t/t7011-skip-worktree-reading.sh       |   9 -
 t/t7817-grep-sparse-checkout.sh        | 321 +++++++++++++++++++++++++
 t/t9902-completion.sh                  |   4 +-
 17 files changed, 687 insertions(+), 118 deletions(-)
 create mode 100644 Documentation/config/sparse.txt
 create mode 100644 sparse-checkout.c
 create mode 100644 sparse-checkout.h
 create mode 100755 t/t7817-grep-sparse-checkout.sh

Range-diff against v3:
1:  86602034c1 = 1:  99cf2124f3 doc: grep: unify info on configuration variables
2:  e5b689aaad ! 2:  85c429ac69 t/helper/test-config: return exit codes consistently
    @@ Commit message
         different codes, to reflect the status of the requested operations.
         These codes are sometimes checked in the tests, but not all of the codes
         are returned consistently by the helper: 1 will usually refer to a
    -    "value not found", but usage errors can also return 1 or 128. The latter
    -    is also expected on errors within the configset functions. These
    +    "value not found", but usage errors can also return 1 or 128. Moreover,
    +    128 is also expected on errors within the configset functions. These
         inconsistent uses of the exit codes can lead to false positives in the
    -    tests. Although all tests that currently check the helper's exit code,
    -    on errors, do also check the output, it's still better to standardize
    -    the exit codes and avoid future problems in new tests. While we are
    -    here, let's also check that we have the expected argc for
    +    tests. Although all tests which expect errors and check the helper's
    +    exit code currently also check the output, it's still better to
    +    standardize the exit codes and avoid future problems in new tests.
    +    While we are here, let's also check that we have the expected argc for
         configset_get_value and configset_get_value_multi, before trying to use
         argv.
     
-:  ---------- > 3:  e9eaaecccc t/helper/test-config: facilitate addition of new cli options
3:  0d2fd01305 ! 4:  6402c96807 config: correctly read worktree configs in submodules
    @@ Commit message
         to make the path to the file. Furthermore, it also checks that
         extensions.worktreeConfig is set through the
         repository_format_worktree_config variable, which refers to
    -    the_repository only. Thus, when a submodule has worktree settings, a
    -    command executed in the superproject that recurses into the submodule
    -    won't find the said settings.
    +    the_repository only. Thus, when a submodule has worktree-specific
    +    settings, a command executed in the superproject that recurses into the
    +    submodule won't find the said settings.
     
    -    Such a scenario might not be needed now, but it will be in the following
    -    patch. git-grep will learn to honor sparse checkouts and, when running
    -    with --recurse-submodules, the submodule's sparse checkout settings must
    -    be loaded. As these settings are stored in the config.worktree file,
    -    they would be ignored without this patch. So let's fix this by reading
    -    the right config.worktree file and extensions.worktreeConfig setting,
    -    based on the git_dir and commondir paths given to
    -    do_git_config_sequence(). Also add a test to avoid any regressions.
    +    This will be especially important in the next patch: git-grep will learn
    +    to honor sparse checkouts and, when running with --recurse-submodules,
    +    the submodule's sparse checkout settings must be loaded. As these
    +    settings are stored in the config.worktree file, they would be ignored
    +    without this patch. So let's fix this by reading the right
    +    config.worktree file and extensions.worktreeConfig setting, based on the
    +    git_dir and commondir paths given to do_git_config_sequence(). Also
    +    add a test to avoid any regressions.
     
         Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
     
    @@ t/helper/test-config.c
       * get_value -> prints the value with highest priority for the entered key
       *
     @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
    - 	int i, val;
    - 	const char *v;
      	const struct string_list *strptr;
    --	struct config_set cs;
    -+	struct config_set cs = { .hash_initialized = 0 };
    + 	struct config_set cs = { .hash_initialized = 0 };
      	enum test_config_exit_code ret = TC_SUCCESS;
     +	struct repository *repo = the_repository;
     +	const char *subrepo_path = NULL;
    -+
    -+	argc--; /* skip over "config" */
    -+	argv++;
    -+
    -+	if (argc == 0)
    -+		goto print_usage_error;
    -+
    + 
    + 	argc--; /* skip over "config" */
    + 	argv++;
    +@@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
    + 	if (argc == 0)
    + 		goto print_usage_error;
    + 
     +	if (skip_prefix(*argv, "--submodule=", &subrepo_path)) {
     +		argc--;
     +		argv++;
     +		if (argc == 0)
     +			goto print_usage_error;
     +	}
    - 
    --	if (argc == 3 && !strcmp(argv[1], "read_early_config")) {
    --		read_early_config(early_config_cb, (void *)argv[2]);
    -+	if (argc == 2 && !strcmp(argv[0], "read_early_config")) {
    ++
    + 	if (argc == 2 && !strcmp(argv[0], "read_early_config")) {
     +		if (subrepo_path) {
     +			fprintf(stderr, "Cannot use --submodule with read_early_config\n");
     +			return TC_USAGE_ERROR;
     +		}
    -+		read_early_config(early_config_cb, (void *)argv[1]);
    + 		read_early_config(early_config_cb, (void *)argv[1]);
      		return TC_SUCCESS;
      	}
    - 
    +@@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
      	setup_git_directory();
    --
      	git_configset_init(&cs);
      
    --	if (argc < 2)
    --		goto print_usage_error;
     +	if (subrepo_path) {
     +		const struct submodule *sub;
     +		struct repository *subrepo = xcalloc(1, sizeof(*repo));
    @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
     +		}
     +		repo = subrepo;
     +	}
    - 
    --	if (argc == 3 && !strcmp(argv[1], "get_value")) {
    --		if (!git_config_get_value(argv[2], &v)) {
    -+	if (argc == 2 && !strcmp(argv[0], "get_value")) {
    ++
    + 	if (argc == 2 && !strcmp(argv[0], "get_value")) {
    +-		if (!git_config_get_value(argv[1], &v)) {
     +		if (!repo_config_get_value(repo, argv[1], &v)) {
      			if (!v)
      				printf("(NULL)\n");
      			else
    - 				printf("%s\n", v);
    - 		} else {
    --			printf("Value not found for \"%s\"\n", argv[2]);
    -+			printf("Value not found for \"%s\"\n", argv[1]);
    +@@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
      			ret = TC_VALUE_NOT_FOUND;
      		}
    --	} else if (argc == 3 && !strcmp(argv[1], "get_value_multi")) {
    --		strptr = git_config_get_value_multi(argv[2]);
    -+	} else if (argc == 2 && !strcmp(argv[0], "get_value_multi")) {
    + 	} else if (argc == 2 && !strcmp(argv[0], "get_value_multi")) {
    +-		strptr = git_config_get_value_multi(argv[1]);
     +		strptr = repo_config_get_value_multi(repo, argv[1]);
      		if (strptr) {
      			for (i = 0; i < strptr->nr; i++) {
      				v = strptr->items[i].string;
     @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
    - 					printf("%s\n", v);
    - 			}
    - 		} else {
    --			printf("Value not found for \"%s\"\n", argv[2]);
    -+			printf("Value not found for \"%s\"\n", argv[1]);
      			ret = TC_VALUE_NOT_FOUND;
      		}
    --	} else if (argc == 3 && !strcmp(argv[1], "get_int")) {
    --		if (!git_config_get_int(argv[2], &val)) {
    -+	} else if (argc == 2 && !strcmp(argv[0], "get_int")) {
    + 	} else if (argc == 2 && !strcmp(argv[0], "get_int")) {
    +-		if (!git_config_get_int(argv[1], &val)) {
     +		if (!repo_config_get_int(repo, argv[1], &val)) {
      			printf("%d\n", val);
      		} else {
    --			printf("Value not found for \"%s\"\n", argv[2]);
    -+			printf("Value not found for \"%s\"\n", argv[1]);
    + 			printf("Value not found for \"%s\"\n", argv[1]);
      			ret = TC_VALUE_NOT_FOUND;
      		}
    --	} else if (argc == 3 && !strcmp(argv[1], "get_bool")) {
    --		if (!git_config_get_bool(argv[2], &val)) {
    -+	} else if (argc == 2 && !strcmp(argv[0], "get_bool")) {
    + 	} else if (argc == 2 && !strcmp(argv[0], "get_bool")) {
    +-		if (!git_config_get_bool(argv[1], &val)) {
     +		if (!repo_config_get_bool(repo, argv[1], &val)) {
      			printf("%d\n", val);
      		} else {
    --			printf("Value not found for \"%s\"\n", argv[2]);
     +
    -+			printf("Value not found for \"%s\"\n", argv[1]);
    + 			printf("Value not found for \"%s\"\n", argv[1]);
      			ret = TC_VALUE_NOT_FOUND;
      		}
    --	} else if (argc == 3 && !strcmp(argv[1], "get_string")) {
    --		if (!git_config_get_string_const(argv[2], &v)) {
    -+	} else if (argc == 2 && !strcmp(argv[0], "get_string")) {
    + 	} else if (argc == 2 && !strcmp(argv[0], "get_string")) {
    +-		if (!git_config_get_string_const(argv[1], &v)) {
     +		if (!repo_config_get_string_const(repo, argv[1], &v)) {
      			printf("%s\n", v);
      		} else {
    --			printf("Value not found for \"%s\"\n", argv[2]);
    -+			printf("Value not found for \"%s\"\n", argv[1]);
    + 			printf("Value not found for \"%s\"\n", argv[1]);
      			ret = TC_VALUE_NOT_FOUND;
      		}
    --	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
    --		for (i = 3; i < argc; i++) {
    -+	} else if (argc >= 2 && !strcmp(argv[0], "configset_get_value")) {
    + 	} else if (argc >= 2 && !strcmp(argv[0], "configset_get_value")) {
     +		if (subrepo_path) {
     +			fprintf(stderr, "Cannot use --submodule with configset_get_value\n");
     +			ret = TC_USAGE_ERROR;
     +			goto out;
     +		}
    -+		for (i = 2; i < argc; i++) {
    + 		for (i = 2; i < argc; i++) {
      			int err;
      			if ((err = git_configset_add_file(&cs, argv[i]))) {
    - 				fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
     @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
    - 				goto out;
    - 			}
    - 		}
    --		if (!git_configset_get_value(&cs, argv[2], &v)) {
    -+		if (!git_configset_get_value(&cs, argv[1], &v)) {
    - 			if (!v)
    - 				printf("(NULL)\n");
    - 			else
    - 				printf("%s\n", v);
    - 		} else {
    --			printf("Value not found for \"%s\"\n", argv[2]);
    -+			printf("Value not found for \"%s\"\n", argv[1]);
      			ret = TC_VALUE_NOT_FOUND;
      		}
    --	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
    --		for (i = 3; i < argc; i++) {
    -+	} else if (argc >= 2 && !strcmp(argv[0], "configset_get_value_multi")) {
    + 	} else if (argc >= 2 && !strcmp(argv[0], "configset_get_value_multi")) {
     +		if (subrepo_path) {
     +			fprintf(stderr, "Cannot use --submodule with configset_get_value_multi\n");
     +			ret = TC_USAGE_ERROR;
     +			goto out;
     +		}
    -+		for (i = 2; i < argc; i++) {
    + 		for (i = 2; i < argc; i++) {
      			int err;
      			if ((err = git_configset_add_file(&cs, argv[i]))) {
    - 				fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
     @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
    - 				goto out;
    - 			}
    - 		}
    --		strptr = git_configset_get_value_multi(&cs, argv[2]);
    -+		strptr = git_configset_get_value_multi(&cs, argv[1]);
    - 		if (strptr) {
    - 			for (i = 0; i < strptr->nr; i++) {
    - 				v = strptr->items[i].string;
    -@@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
    - 					printf("%s\n", v);
    - 			}
    - 		} else {
    --			printf("Value not found for \"%s\"\n", argv[2]);
    -+			printf("Value not found for \"%s\"\n", argv[1]);
      			ret = TC_VALUE_NOT_FOUND;
      		}
    --	} else if (!strcmp(argv[1], "iterate")) {
    + 	} else if (!strcmp(argv[0], "iterate")) {
     -		git_config(iterate_cb, NULL);
    -+	} else if (!strcmp(argv[0], "iterate")) {
     +		repo_config(repo, iterate_cb, NULL);
      	} else {
      print_usage_error:
4:  3b819a8d52 ! 5:  4d2916eb99 grep: honor sparse checkout patterns
    @@ Commit message
     
         One of the main uses for a sparse checkout is to allow users to focus on
         the subset of files in a repository in which they are interested. But
    -    git-grep currently ignores the sparsity patterns and report all matches
    +    git-grep currently ignores the sparsity patterns and reports all matches
         found outside this subset, which kind of goes in the opposite direction.
    -    Let's fix that, making it honor the sparsity boundaries for every
    -    grepping case where this is relevant:
    +    There are some use cases for ignoring the sparsity patterns and the next
    +    commit will add an option to obtain this behavior, but here we start by
    +    making grep honor the sparsity boundaries in every case where this is
    +    relevant:
     
         - git grep in worktree
         - git grep --cached
         - git grep $REVISION
     
    -    For the worktree case, we will not grep paths that have the
    -    SKIP_WORKTREE bit set, even if they are present for some reason (e.g.
    -    manually created after `git sparse-checkout init`). But the next patch
    -    will add an option to do so. (See 'Note' below.)
    +    For the worktree and cached cases, we iterate over paths without the
    +    SKIP_WORKTREE bit set, and limit our searches to these paths. For the
    +    $REVISION case, we limit the paths we search to those that match the
    +    sparsity patterns. (We do not check the SKIP_WORKTREE bit for the
    +    $REVISION case, because $REVISION may contain paths that do not exist in
    +    HEAD and thus for which we have no SKIP_WORKTREE bit to consult. The
    +    sparsity patterns tell us how the SKIP_WORKTREE bit would be set if we
    +    were to check out $REVISION, so we consult those. Also, we don't use the
    +    sparsity patterns with the worktree or cached cases, both because we
    +    have a bit we can check directly and more efficiently, and because
    +    unmerged entries from a merge or a rebase could cause more files to
    +    temporarily be present than the sparsity patterns would normally
    +    select.)
     
    -    For `git grep $REVISION`, we will choose to honor the sparsity patterns
    -    only when $REVISION is a commit-ish object. The reason is that, for a
    -    tree, we don't know whether it represents the root of a repository or a
    -    subtree. So we wouldn't be able to correctly match it against the
    -    sparsity patterns. E.g. suppose we have a repository with these two
    -    sparsity rules: "/*" and "!/a"; and the following structure:
    -
    -    /
    -    | - a (file)
    -    | - d (dir)
    -        | - a (file)
    -
    -    If `git grep $REVISION` were to honor the sparsity patterns for every
    -    object type, when grepping the /d tree, we would wrongly ignore the /d/a
    -    file. This happens because we wouldn't know it resides in /d and
    -    therefore it would wrongly match the pattern "!/a". Furthermore, for a
    -    search in a blob object, we wouldn't even have a path to check the
    -    patterns against. So, let's ignore the sparsity patterns when grepping
    -    non-commit-ish objects.
    -
    -    Note: The behavior introduced in this patch is what some users have
    -    reported[1] that they would like by default. But the old behavior is
    -    still desirable for some use cases. Therefore, the next patch will add
    -    an option to allow restoring it when needed.
    -
    -    [1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/
    +    Note that there is a special case here: `git grep $TREE`. In this case,
    +    we cannot know whether $TREE corresponds to the root of the repository
    +    or some sub-tree, and thus there is no way for us to know which sparsity
    +    patterns, if any, apply. So the $TREE case will not use sparsity
    +    patterns or any SKIP_WORKTREE bits and will instead always search all
    +    files within the $TREE.
     
         Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
     
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
     +	return patterns;
     +}
     +
    -+static int in_sparse_checkout(struct strbuf *path, int prefix_len,
    -+			      unsigned int entry_mode,
    -+			      struct index_state *istate,
    -+			      struct pattern_list *sparsity,
    -+			      enum pattern_match_result parent_match,
    -+			      enum pattern_match_result *match)
    ++static int path_in_sparse_checkout(struct strbuf *path, int prefix_len,
    ++				   unsigned int entry_mode,
    ++				   struct index_state *istate,
    ++				   struct pattern_list *sparsity,
    ++				   enum pattern_match_result parent_match,
    ++				   enum pattern_match_result *match)
     +{
     +	int dtype = DT_UNKNOWN;
     +	int is_dir = S_ISDIR(entry_mode);
    @@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec
     +			struct strbuf path = STRBUF_INIT;
     +			strbuf_addstr(&path, base->buf + tn_len);
     +
    -+			if (!in_sparse_checkout(&path, old_baselen - tn_len,
    -+						entry.mode, repo->index,
    -+						sparsity, default_sparsity_match,
    -+						&sparsity_match)) {
    ++			if (!path_in_sparse_checkout(&path, old_baselen - tn_len,
    ++						     entry.mode, repo->index,
    ++						     sparsity, default_sparsity_match,
    ++						     &sparsity_match)) {
     +				strbuf_setlen(base, old_baselen);
     +				continue;
     +			}
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +`-- sub2
     +    `-- a
     +
    -+Where . has non-cone mode sparsity patterns, sub is a submodule with cone mode
    -+sparsity patterns and sub2 is a submodule that is excluded by the superproject
    -+sparsity patterns. The resulting sparse checkout should leave the following
    -+structure on the working tree:
    ++Where the outer repository has non-cone mode sparsity patterns, sub is a
    ++submodule with cone mode sparsity patterns and sub2 is a submodule that is
    ++excluded by the superproject sparsity patterns. The resulting sparse checkout
    ++should leave the following structure in the working tree:
     +
     +.
     +|-- a
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +	test_path_is_file sub2/a
     +'
     +
    -+# The test bellow checks a special case: the sparsity patterns exclude '/b'
    -+# and sparse checkout is enable, but the path exists on the working tree (e.g.
    ++# The test below checks a special case: the sparsity patterns exclude '/b'
    ++# and sparse checkout is enabled, but the path exists in the working tree (e.g.
     +# manually created after `git sparse-checkout init`). In this case, grep should
     +# skip it.
     +test_expect_success 'grep in working tree should honor sparse checkout' '
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +	test_cmp expect actual
     +'
     +
    ++test_expect_success 'grep unmerged file despite not matching sparsity patterns' '
    ++	cat >expect <<-EOF &&
    ++	b:modified-b-in-branchX
    ++	b:modified-b-in-branchY
    ++	EOF
    ++	test_when_finished "test_might_fail git merge --abort && \
    ++			    git checkout master" &&
    ++
    ++	git sparse-checkout disable &&
    ++	git checkout -b branchY master &&
    ++	test_commit modified-b-in-branchY b &&
    ++	git checkout -b branchX master &&
    ++	test_commit modified-b-in-branchX b &&
    ++
    ++	git sparse-checkout init &&
    ++	test_path_is_missing b &&
    ++	test_must_fail git merge branchY &&
    ++	git grep "modified-b" >actual &&
    ++	test_cmp expect actual
    ++'
    ++
     +test_expect_success 'grep --cached should honor sparse checkout' '
     +	cat >expect <<-EOF &&
     +	a:text
5:  02990a6fa1 ! 6:  4547718b60 config: add setting to ignore sparsity patterns in some cmds
    @@ Documentation/config.txt: include::config/sequencer.txt[]
      
      include::config/ssh.txt[]
     
    + ## Documentation/config/grep.txt ##
    +@@ Documentation/config/grep.txt: grep.fullName::
    + grep.fallbackToNoIndex::
    + 	If set to true, fall back to git grep --no-index if git grep
    + 	is executed outside of a git repository.  Defaults to false.
    ++
    ++ifdef::git-grep[]
    ++sparse.restrictCmds::
    ++	See base definition in linkgit:git-config[1]. grep honors
    ++	sparse.restrictCmds by limiting searches to the sparsity paths in three
    ++	cases: when searching the working tree, when searching the index with
    ++	--cached, and when searching a specified commit.
    ++endif::git-grep[]
    +
      ## Documentation/config/sparse.txt (new) ##
     @@
     +sparse.restrictCmds::
    @@ Documentation/config/sparse.txt (new)
     +to the paths specified by the sparsity patterns, or to the intersection of
     +those paths and any (like `*.c`) that the user might also specify on the
     +command line. When false, the affected commands will work on full trees,
    -+ignoring the sparsity patterns. For now, only git-grep honors this setting. In
    -+this command, the restriction takes effect in three cases: with --cached; when
    -+a commit-ish is given; when searching a working tree where some paths excluded
    -+by the sparsity patterns are present (e.g. manually created paths or not
    -+removed submodules).
    ++ignoring the sparsity patterns. For now, only git-grep honors this setting.
     ++
     +Note: commands which export, integrity check, or create history will always
     +operate on full trees (e.g. fast-export, format-patch, fsck, commit, etc.),
    -+unaffected by any sparsity patterns. Also, writting commands such as
    ++unaffected by any sparsity patterns. Also, writing commands such as
     +sparse-checkout and read-tree will not be affected by this configuration.
     
    - ## Documentation/git-grep.txt ##
    -@@ Documentation/git-grep.txt: characters.  An empty string as search expression matches all lines.
    - CONFIGURATION
    - -------------
    - 
    -+git-grep honors the sparse.restrictCmds setting. See its definition in
    -+linkgit:git-config[1].
    -+
    - :git-grep: 1
    - include::config/grep.txt[]
    - 
    -
      ## Documentation/git.txt ##
     @@ Documentation/git.txt: If you just want to run git as if it was started in `<path>` then use
      	Do not perform optional operations that require locks. This is
    @@ contrib/completion/git-completion.bash: __git_main ()
      		*)
     
      ## git.c ##
    -@@ git.c: const char git_more_info_string[] =
    - 	   "See 'git help git' for an overview of the system.");
    - 
    - static int use_pager = -1;
    -+int opt_restrict_to_sparse_paths = -1;
    - 
    - static void list_builtins(struct string_list *list, unsigned int exclude_option);
    +@@
    + #include "run-command.h"
    + #include "alias.h"
    + #include "shallow.h"
    ++#include "sparse-checkout.h"
      
    + #define RUN_SETUP		(1<<0)
    + #define RUN_SETUP_GENTLY	(1<<1)
     @@ git.c: static int handle_options(const char ***argv, int *argc, int *envchanged)
      			} else {
      				exit(list_cmds(cmd));
    @@ git.c: static int handle_options(const char ***argv, int *argc, int *envchanged)
      		} else {
      			fprintf(stderr, _("unknown option: %s\n"), cmd);
      			usage(git_usage_string);
    -@@ git.c: static int handle_options(const char ***argv, int *argc, int *envchanged)
    - 		(*argv)++;
    - 		(*argc)--;
    - 	}
    -+
    - 	return (*argv) - orig_argv;
    - }
    - 
     
      ## sparse-checkout.c (new) ##
     @@
    @@ sparse-checkout.c (new)
     +#include "config.h"
     +#include "sparse-checkout.h"
     +
    ++int opt_restrict_to_sparse_paths = -1;
    ++
     +int restrict_to_sparse_paths(struct repository *repo)
     +{
     +	int ret;
    @@ sparse-checkout.h (new)
     +
     +struct repository;
     +
    -+extern int opt_restrict_to_sparse_paths; /* from git.c */
    ++extern int opt_restrict_to_sparse_paths;
     +
     +/* Whether or not cmds should restrict behavior on sparse paths, in this repo */
     +int restrict_to_sparse_paths(struct repository *repo);
    @@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'setup' '
      	test_path_is_file sub2/a
      '
      
    --# The test bellow checks a special case: the sparsity patterns exclude '/b'
    -+# The two tests bellow check a special case: the sparsity patterns exclude '/b'
    - # and sparse checkout is enable, but the path exists on the working tree (e.g.
    +-# The test below checks a special case: the sparsity patterns exclude '/b'
    ++# The two tests below check a special case: the sparsity patterns exclude '/b'
    + # and sparse checkout is enabled, but the path exists in the working tree (e.g.
      # manually created after `git sparse-checkout init`). In this case, grep should
     -# skip it.
     +# skip the file by default, but not with --no-restrict-to-sparse-paths.
    @@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'grep in working tree shoul
     +	test_cmp expect actual
     +'
      
    - test_expect_success 'grep --cached should honor sparse checkout' '
    + test_expect_success 'grep unmerged file despite not matching sparsity patterns' '
      	cat >expect <<-EOF &&
     @@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
      '
    @@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'grep --recurse-submodules
     +	'
     +done
     +
    -+test_expect_success 'grep --recurse-submodules --cached \w --no-restrict-to-sparse-paths' '
    ++test_expect_success 'grep --recurse-submodules --cached w/ --no-restrict-to-sparse-paths' '
     +	cat >expect <<-EOF &&
     +	a:text
     +	b:text
    @@ t/t7817-grep-sparse-checkout.sh: test_expect_success 'grep --recurse-submodules
     +	test_cmp expect actual
     +'
     +
    -+test_expect_success 'grep --recurse-submodules <commit-ish> \w --no-restrict-to-sparse-paths' '
    ++test_expect_success 'grep --recurse-submodules <commit-ish> w/ --no-restrict-to-sparse-paths' '
     +	commit=$(git rev-parse HEAD) &&
     +	cat >expect_commit <<-EOF &&
     +	$commit:a:text
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 1/6] doc: grep: unify info on configuration variables
  2020-06-12 15:44     ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Matheus Tavares
@ 2020-06-12 15:44       ` Matheus Tavares
  2020-06-12 15:45       ` [PATCH v4 2/6] t/helper/test-config: return exit codes consistently Matheus Tavares
                         ` (6 subsequent siblings)
  7 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-06-12 15:44 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

Explanations about the configuration variables for git-grep are
duplicated in "Documentation/git-grep.txt" and
"Documentation/config/grep.txt", which can make maintenance difficult.
The first also contains a definition not present in the latter
(grep.fullName). To avoid problems like this, let's unify the
information in the second file and include it in the first.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 Documentation/config/grep.txt | 10 ++++++++--
 Documentation/git-grep.txt    | 36 ++++++-----------------------------
 2 files changed, 14 insertions(+), 32 deletions(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 44abe45a7c..dd51db38e1 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -16,8 +16,14 @@ grep.extendedRegexp::
 	other than 'default'.
 
 grep.threads::
-	Number of grep worker threads to use.
-	See `grep.threads` in linkgit:git-grep[1] for more information.
+	Number of grep worker threads to use. See `--threads`
+ifndef::git-grep[]
+	in linkgit:git-grep[1]
+endif::git-grep[]
+	for more information.
+
+grep.fullName::
+	If set to true, enable `--full-name` option by default.
 
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index a7f9bc99ea..9bdf807584 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,34 +41,8 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------
 
-grep.lineNumber::
-	If set to true, enable `-n` option by default.
-
-grep.column::
-	If set to true, enable the `--column` option by default.
-
-grep.patternType::
-	Set the default matching behavior. Using a value of 'basic', 'extended',
-	'fixed', or 'perl' will enable the `--basic-regexp`, `--extended-regexp`,
-	`--fixed-strings`, or `--perl-regexp` option accordingly, while the
-	value 'default' will return to the default matching behavior.
-
-grep.extendedRegexp::
-	If set to true, enable `--extended-regexp` option by default. This
-	option is ignored when the `grep.patternType` option is set to a value
-	other than 'default'.
-
-grep.threads::
-	Number of grep worker threads to use. If unset (or set to 0), Git will
-	use as many threads as the number of logical cores available.
-
-grep.fullName::
-	If set to true, enable `--full-name` option by default.
-
-grep.fallbackToNoIndex::
-	If set to true, fall back to git grep --no-index if git grep
-	is executed outside of a git repository.  Defaults to false.
-
+:git-grep: 1
+include::config/grep.txt[]
 
 OPTIONS
 -------
@@ -269,8 +243,10 @@ providing this option will cause it to die.
 	found.
 
 --threads <num>::
-	Number of grep worker threads to use.
-	See `grep.threads` in 'CONFIGURATION' for more information.
+	Number of grep worker threads to use. If not provided (or set to
+	0), Git will use as many worker threads as the number of logical
+	cores available. The default value can also be set with the
+	`grep.threads` configuration.
 
 -f <file>::
 	Read patterns from <file>, one per line.
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 2/6] t/helper/test-config: return exit codes consistently
  2020-06-12 15:44     ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-06-12 15:44       ` [PATCH v4 1/6] doc: grep: unify info on configuration variables Matheus Tavares
@ 2020-06-12 15:45       ` Matheus Tavares
  2020-06-12 15:45       ` [PATCH v4 3/6] t/helper/test-config: facilitate addition of new cli options Matheus Tavares
                         ` (5 subsequent siblings)
  7 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-06-12 15:45 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

The test-config helper may exit with a variety of at least four
different codes, to reflect the status of the requested operations.
These codes are sometimes checked in the tests, but not all of the codes
are returned consistently by the helper: 1 will usually refer to a
"value not found", but usage errors can also return 1 or 128. Moreover,
128 is also expected on errors within the configset functions. These
inconsistent uses of the exit codes can lead to false positives in the
tests. Although all tests which expect errors and check the helper's
exit code currently also check the output, it's still better to
standardize the exit codes and avoid future problems in new tests.
While we are here, let's also check that we have the expected argc for
configset_get_value and configset_get_value_multi, before trying to use
argv.

Note: this change is implemented with the unification of the exit
labels. This might seem unnecessary, for now, but it will benefit the
next patch, which will increase the cleanup section.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 t/helper/test-config.c | 76 ++++++++++++++++++++++--------------------
 1 file changed, 40 insertions(+), 36 deletions(-)

diff --git a/t/helper/test-config.c b/t/helper/test-config.c
index 234c722b48..1c8e965840 100644
--- a/t/helper/test-config.c
+++ b/t/helper/test-config.c
@@ -30,6 +30,14 @@
  * iterate -> iterate over all values using git_config(), and print some
  *            data for each
  *
+ * Exit codes:
+ *     0:   success
+ *     1:   value not found for the given config key
+ *     2:   config file path given as argument is inaccessible or doesn't exist
+ *     129: test-config usage error
+ *
+ * Note: tests may also expect 128 for die() calls in the config machinery.
+ *
  * Examples:
  *
  * To print the value with highest priority for key "foo.bAr Baz.rock":
@@ -64,35 +72,42 @@ static int early_config_cb(const char *var, const char *value, void *vdata)
 	return 0;
 }
 
+enum test_config_exit_code {
+	TC_SUCCESS = 0,
+	TC_VALUE_NOT_FOUND = 1,
+	TC_CONFIG_FILE_ERROR = 2,
+	TC_USAGE_ERROR = 129,
+};
+
 int cmd__config(int argc, const char **argv)
 {
 	int i, val;
 	const char *v;
 	const struct string_list *strptr;
 	struct config_set cs;
+	enum test_config_exit_code ret = TC_SUCCESS;
 
 	if (argc == 3 && !strcmp(argv[1], "read_early_config")) {
 		read_early_config(early_config_cb, (void *)argv[2]);
-		return 0;
+		return TC_SUCCESS;
 	}
 
 	setup_git_directory();
 
 	git_configset_init(&cs);
 
-	if (argc < 2) {
-		fprintf(stderr, "Please, provide a command name on the command-line\n");
-		goto exit1;
-	} else if (argc == 3 && !strcmp(argv[1], "get_value")) {
+	if (argc < 2)
+		goto print_usage_error;
+
+	if (argc == 3 && !strcmp(argv[1], "get_value")) {
 		if (!git_config_get_value(argv[2], &v)) {
 			if (!v)
 				printf("(NULL)\n");
 			else
 				printf("%s\n", v);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_value_multi")) {
 		strptr = git_config_get_value_multi(argv[2]);
@@ -104,41 +119,38 @@ int cmd__config(int argc, const char **argv)
 				else
 					printf("%s\n", v);
 			}
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_int")) {
 		if (!git_config_get_int(argv[2], &val)) {
 			printf("%d\n", val);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_bool")) {
 		if (!git_config_get_bool(argv[2], &val)) {
 			printf("%d\n", val);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_string")) {
 		if (!git_config_get_string_const(argv[2], &v)) {
 			printf("%s\n", v);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (!strcmp(argv[1], "configset_get_value")) {
+	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
 		for (i = 3; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
 				fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
-				goto exit2;
+				ret = TC_CONFIG_FILE_ERROR;
+				goto out;
 			}
 		}
 		if (!git_configset_get_value(&cs, argv[2], &v)) {
@@ -146,17 +158,17 @@ int cmd__config(int argc, const char **argv)
 				printf("(NULL)\n");
 			else
 				printf("%s\n", v);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (!strcmp(argv[1], "configset_get_value_multi")) {
+	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
 		for (i = 3; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
 				fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
-				goto exit2;
+				ret = TC_CONFIG_FILE_ERROR;
+				goto out;
 			}
 		}
 		strptr = git_configset_get_value_multi(&cs, argv[2]);
@@ -168,27 +180,19 @@ int cmd__config(int argc, const char **argv)
 				else
 					printf("%s\n", v);
 			}
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (!strcmp(argv[1], "iterate")) {
 		git_config(iterate_cb, NULL);
-		goto exit0;
+	} else {
+print_usage_error:
+		fprintf(stderr, "Invalid syntax. Usage: test-tool config <cmd> [args]\n");
+		ret = TC_USAGE_ERROR;
 	}
 
-	die("%s: Please check the syntax and the function name", argv[0]);
-
-exit0:
-	git_configset_clear(&cs);
-	return 0;
-
-exit1:
-	git_configset_clear(&cs);
-	return 1;
-
-exit2:
+out:
 	git_configset_clear(&cs);
-	return 2;
+	return ret;
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 3/6] t/helper/test-config: facilitate addition of new cli options
  2020-06-12 15:44     ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-06-12 15:44       ` [PATCH v4 1/6] doc: grep: unify info on configuration variables Matheus Tavares
  2020-06-12 15:45       ` [PATCH v4 2/6] t/helper/test-config: return exit codes consistently Matheus Tavares
@ 2020-06-12 15:45       ` Matheus Tavares
  2020-06-12 15:45       ` [PATCH v4 4/6] config: correctly read worktree configs in submodules Matheus Tavares
                         ` (4 subsequent siblings)
  7 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-06-12 15:45 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

test-config parses its arguments in an if-else chain, with one arm for
each available subcommand. Every arm expects (and checks) that argv
corresponds to something like "config <subcommand> [<subcommand args>]".
This means that whenever we want to change the syntax to accommodate a
new argument before <subcommand> (as we will do in the next patch), we
also need to increment the indexes accessing argv everywhere in the
if-else chain. This makes patches adding new options much noisier than
they need to be, besides being error-prone. So let's skip the "config"
argument in argv and argc to take the extra complexity out of such
patches (as the following one).

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 t/helper/test-config.c | 64 ++++++++++++++++++++++--------------------
 1 file changed, 33 insertions(+), 31 deletions(-)

diff --git a/t/helper/test-config.c b/t/helper/test-config.c
index 1c8e965840..61da2574c5 100644
--- a/t/helper/test-config.c
+++ b/t/helper/test-config.c
@@ -84,33 +84,35 @@ int cmd__config(int argc, const char **argv)
 	int i, val;
 	const char *v;
 	const struct string_list *strptr;
-	struct config_set cs;
+	struct config_set cs = { .hash_initialized = 0 };
 	enum test_config_exit_code ret = TC_SUCCESS;
 
-	if (argc == 3 && !strcmp(argv[1], "read_early_config")) {
-		read_early_config(early_config_cb, (void *)argv[2]);
+	argc--; /* skip over "config" */
+	argv++;
+
+	if (argc == 0)
+		goto print_usage_error;
+
+	if (argc == 2 && !strcmp(argv[0], "read_early_config")) {
+		read_early_config(early_config_cb, (void *)argv[1]);
 		return TC_SUCCESS;
 	}
 
 	setup_git_directory();
-
 	git_configset_init(&cs);
 
-	if (argc < 2)
-		goto print_usage_error;
-
-	if (argc == 3 && !strcmp(argv[1], "get_value")) {
-		if (!git_config_get_value(argv[2], &v)) {
+	if (argc == 2 && !strcmp(argv[0], "get_value")) {
+		if (!git_config_get_value(argv[1], &v)) {
 			if (!v)
 				printf("(NULL)\n");
 			else
 				printf("%s\n", v);
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (argc == 3 && !strcmp(argv[1], "get_value_multi")) {
-		strptr = git_config_get_value_multi(argv[2]);
+	} else if (argc == 2 && !strcmp(argv[0], "get_value_multi")) {
+		strptr = git_config_get_value_multi(argv[1]);
 		if (strptr) {
 			for (i = 0; i < strptr->nr; i++) {
 				v = strptr->items[i].string;
@@ -120,32 +122,32 @@ int cmd__config(int argc, const char **argv)
 					printf("%s\n", v);
 			}
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (argc == 3 && !strcmp(argv[1], "get_int")) {
-		if (!git_config_get_int(argv[2], &val)) {
+	} else if (argc == 2 && !strcmp(argv[0], "get_int")) {
+		if (!git_config_get_int(argv[1], &val)) {
 			printf("%d\n", val);
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (argc == 3 && !strcmp(argv[1], "get_bool")) {
-		if (!git_config_get_bool(argv[2], &val)) {
+	} else if (argc == 2 && !strcmp(argv[0], "get_bool")) {
+		if (!git_config_get_bool(argv[1], &val)) {
 			printf("%d\n", val);
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (argc == 3 && !strcmp(argv[1], "get_string")) {
-		if (!git_config_get_string_const(argv[2], &v)) {
+	} else if (argc == 2 && !strcmp(argv[0], "get_string")) {
+		if (!git_config_get_string_const(argv[1], &v)) {
 			printf("%s\n", v);
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
-		for (i = 3; i < argc; i++) {
+	} else if (argc >= 2 && !strcmp(argv[0], "configset_get_value")) {
+		for (i = 2; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
 				fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
@@ -153,17 +155,17 @@ int cmd__config(int argc, const char **argv)
 				goto out;
 			}
 		}
-		if (!git_configset_get_value(&cs, argv[2], &v)) {
+		if (!git_configset_get_value(&cs, argv[1], &v)) {
 			if (!v)
 				printf("(NULL)\n");
 			else
 				printf("%s\n", v);
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
-		for (i = 3; i < argc; i++) {
+	} else if (argc >= 2 && !strcmp(argv[0], "configset_get_value_multi")) {
+		for (i = 2; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
 				fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
@@ -171,7 +173,7 @@ int cmd__config(int argc, const char **argv)
 				goto out;
 			}
 		}
-		strptr = git_configset_get_value_multi(&cs, argv[2]);
+		strptr = git_configset_get_value_multi(&cs, argv[1]);
 		if (strptr) {
 			for (i = 0; i < strptr->nr; i++) {
 				v = strptr->items[i].string;
@@ -181,10 +183,10 @@ int cmd__config(int argc, const char **argv)
 					printf("%s\n", v);
 			}
 		} else {
-			printf("Value not found for \"%s\"\n", argv[2]);
+			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
-	} else if (!strcmp(argv[1], "iterate")) {
+	} else if (!strcmp(argv[0], "iterate")) {
 		git_config(iterate_cb, NULL);
 	} else {
 print_usage_error:
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 4/6] config: correctly read worktree configs in submodules
  2020-06-12 15:44     ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                         ` (2 preceding siblings ...)
  2020-06-12 15:45       ` [PATCH v4 3/6] t/helper/test-config: facilitate addition of new cli options Matheus Tavares
@ 2020-06-12 15:45       ` Matheus Tavares
  2020-06-16 19:13         ` Elijah Newren
  2020-09-01  2:41         ` Jonathan Nieder
  2020-06-12 15:45       ` [PATCH v4 5/6] grep: honor sparse checkout patterns Matheus Tavares
                         ` (3 subsequent siblings)
  7 siblings, 2 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-06-12 15:45 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

One of the steps in do_git_config_sequence() is to load the
worktree-specific config file. Although the function receives a git_dir
string, it relies on git_pathdup(), which uses the_repository->git_dir,
to make the path to the file. Furthermore, it also checks that
extensions.worktreeConfig is set through the
repository_format_worktree_config variable, which refers to
the_repository only. Thus, when a submodule has worktree-specific
settings, a command executed in the superproject that recurses into the
submodule won't find the said settings.

This will be especially important in the next patch: git-grep will learn
to honor sparse checkouts and, when running with --recurse-submodules,
the submodule's sparse checkout settings must be loaded. As these
settings are stored in the config.worktree file, they would be ignored
without this patch. So let's fix this by reading the right
config.worktree file and extensions.worktreeConfig setting, based on the
git_dir and commondir paths given to do_git_config_sequence(). Also
add a test to avoid any regressions.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 config.c                   | 21 +++++++++---
 t/helper/test-config.c     | 67 +++++++++++++++++++++++++++++++++-----
 t/t2404-worktree-config.sh | 16 +++++++++
 3 files changed, 91 insertions(+), 13 deletions(-)

diff --git a/config.c b/config.c
index 8db9c77098..c2d56309dc 100644
--- a/config.c
+++ b/config.c
@@ -1747,11 +1747,22 @@ static int do_git_config_sequence(const struct config_options *opts,
 		ret += git_config_from_file(fn, repo_config, data);
 
 	current_parsing_scope = CONFIG_SCOPE_WORKTREE;
-	if (!opts->ignore_worktree && repository_format_worktree_config) {
-		char *path = git_pathdup("config.worktree");
-		if (!access_or_die(path, R_OK, 0))
-			ret += git_config_from_file(fn, path, data);
-		free(path);
+	if (!opts->ignore_worktree && repo_config && opts->git_dir) {
+		struct repository_format repo_fmt = REPOSITORY_FORMAT_INIT;
+		struct strbuf buf = STRBUF_INIT;
+
+		read_repository_format(&repo_fmt, repo_config);
+
+		if (!verify_repository_format(&repo_fmt, &buf) &&
+		    repo_fmt.worktree_config) {
+			char *path = mkpathdup("%s/config.worktree", opts->git_dir);
+			if (!access_or_die(path, R_OK, 0))
+				ret += git_config_from_file(fn, path, data);
+			free(path);
+		}
+
+		strbuf_release(&buf);
+		clear_repository_format(&repo_fmt);
 	}
 
 	current_parsing_scope = CONFIG_SCOPE_COMMAND;
diff --git a/t/helper/test-config.c b/t/helper/test-config.c
index 61da2574c5..284f83a921 100644
--- a/t/helper/test-config.c
+++ b/t/helper/test-config.c
@@ -2,12 +2,19 @@
 #include "cache.h"
 #include "config.h"
 #include "string-list.h"
+#include "submodule-config.h"
 
 /*
  * This program exposes the C API of the configuration mechanism
  * as a set of simple commands in order to facilitate testing.
  *
- * Reads stdin and prints result of command to stdout:
+ * Usage: test-tool config [--submodule=<path>] <cmd> [<args>]
+ *
+ * If --submodule=<path> is given, <cmd> will operate on the submodule at the
+ * given <path>. This option is not valid for the commands: read_early_config,
+ * configset_get_value and configset_get_value_multi.
+ *
+ * Possible cmds are:
  *
  * get_value -> prints the value with highest priority for the entered key
  *
@@ -86,6 +93,8 @@ int cmd__config(int argc, const char **argv)
 	const struct string_list *strptr;
 	struct config_set cs = { .hash_initialized = 0 };
 	enum test_config_exit_code ret = TC_SUCCESS;
+	struct repository *repo = the_repository;
+	const char *subrepo_path = NULL;
 
 	argc--; /* skip over "config" */
 	argv++;
@@ -93,7 +102,18 @@ int cmd__config(int argc, const char **argv)
 	if (argc == 0)
 		goto print_usage_error;
 
+	if (skip_prefix(*argv, "--submodule=", &subrepo_path)) {
+		argc--;
+		argv++;
+		if (argc == 0)
+			goto print_usage_error;
+	}
+
 	if (argc == 2 && !strcmp(argv[0], "read_early_config")) {
+		if (subrepo_path) {
+			fprintf(stderr, "Cannot use --submodule with read_early_config\n");
+			return TC_USAGE_ERROR;
+		}
 		read_early_config(early_config_cb, (void *)argv[1]);
 		return TC_SUCCESS;
 	}
@@ -101,8 +121,23 @@ int cmd__config(int argc, const char **argv)
 	setup_git_directory();
 	git_configset_init(&cs);
 
+	if (subrepo_path) {
+		const struct submodule *sub;
+		struct repository *subrepo = xcalloc(1, sizeof(*repo));
+
+		sub = submodule_from_path(the_repository, &null_oid, subrepo_path);
+		if (!sub || repo_submodule_init(subrepo, the_repository, sub)) {
+			fprintf(stderr, "Invalid argument to --submodule: '%s'\n",
+				subrepo_path);
+			free(subrepo);
+			ret = TC_USAGE_ERROR;
+			goto out;
+		}
+		repo = subrepo;
+	}
+
 	if (argc == 2 && !strcmp(argv[0], "get_value")) {
-		if (!git_config_get_value(argv[1], &v)) {
+		if (!repo_config_get_value(repo, argv[1], &v)) {
 			if (!v)
 				printf("(NULL)\n");
 			else
@@ -112,7 +147,7 @@ int cmd__config(int argc, const char **argv)
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 2 && !strcmp(argv[0], "get_value_multi")) {
-		strptr = git_config_get_value_multi(argv[1]);
+		strptr = repo_config_get_value_multi(repo, argv[1]);
 		if (strptr) {
 			for (i = 0; i < strptr->nr; i++) {
 				v = strptr->items[i].string;
@@ -126,27 +161,33 @@ int cmd__config(int argc, const char **argv)
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 2 && !strcmp(argv[0], "get_int")) {
-		if (!git_config_get_int(argv[1], &val)) {
+		if (!repo_config_get_int(repo, argv[1], &val)) {
 			printf("%d\n", val);
 		} else {
 			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 2 && !strcmp(argv[0], "get_bool")) {
-		if (!git_config_get_bool(argv[1], &val)) {
+		if (!repo_config_get_bool(repo, argv[1], &val)) {
 			printf("%d\n", val);
 		} else {
+
 			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 2 && !strcmp(argv[0], "get_string")) {
-		if (!git_config_get_string_const(argv[1], &v)) {
+		if (!repo_config_get_string_const(repo, argv[1], &v)) {
 			printf("%s\n", v);
 		} else {
 			printf("Value not found for \"%s\"\n", argv[1]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc >= 2 && !strcmp(argv[0], "configset_get_value")) {
+		if (subrepo_path) {
+			fprintf(stderr, "Cannot use --submodule with configset_get_value\n");
+			ret = TC_USAGE_ERROR;
+			goto out;
+		}
 		for (i = 2; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
@@ -165,6 +206,11 @@ int cmd__config(int argc, const char **argv)
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc >= 2 && !strcmp(argv[0], "configset_get_value_multi")) {
+		if (subrepo_path) {
+			fprintf(stderr, "Cannot use --submodule with configset_get_value_multi\n");
+			ret = TC_USAGE_ERROR;
+			goto out;
+		}
 		for (i = 2; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
@@ -187,14 +233,19 @@ int cmd__config(int argc, const char **argv)
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (!strcmp(argv[0], "iterate")) {
-		git_config(iterate_cb, NULL);
+		repo_config(repo, iterate_cb, NULL);
 	} else {
 print_usage_error:
-		fprintf(stderr, "Invalid syntax. Usage: test-tool config <cmd> [args]\n");
+		fprintf(stderr, "Invalid syntax. Usage: test-tool config"
+				" [--submodule=<path>] <cmd> [args]\n");
 		ret = TC_USAGE_ERROR;
 	}
 
 out:
 	git_configset_clear(&cs);
+	if (repo != the_repository) {
+		repo_clear(repo);
+		free(repo);
+	}
 	return ret;
 }
diff --git a/t/t2404-worktree-config.sh b/t/t2404-worktree-config.sh
index 286121d8de..b6ab793203 100755
--- a/t/t2404-worktree-config.sh
+++ b/t/t2404-worktree-config.sh
@@ -76,4 +76,20 @@ test_expect_success 'config.worktree no longer read without extension' '
 	test_cmp_config -C wt2 shared this.is
 '
 
+test_expect_success 'correctly read config.worktree from submodules' '
+	test_unconfig extensions.worktreeConfig &&
+	git init sub &&
+	(
+		cd sub &&
+		test_commit A &&
+		git config extensions.worktreeConfig true &&
+		git config --worktree wtconfig.sub test-value
+	) &&
+	git submodule add ./sub &&
+	git commit -m "add sub" &&
+	echo test-value >expect &&
+	test-tool config --submodule=sub get_value wtconfig.sub >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 5/6] grep: honor sparse checkout patterns
  2020-06-12 15:44     ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                         ` (3 preceding siblings ...)
  2020-06-12 15:45       ` [PATCH v4 4/6] config: correctly read worktree configs in submodules Matheus Tavares
@ 2020-06-12 15:45       ` Matheus Tavares
  2020-06-12 15:45       ` [PATCH v4 6/6] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
                         ` (2 subsequent siblings)
  7 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-06-12 15:45 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

One of the main uses for a sparse checkout is to allow users to focus on
the subset of files in a repository in which they are interested. But
git-grep currently ignores the sparsity patterns and reports all matches
found outside this subset, which kind of goes in the opposite direction.
There are some use cases for ignoring the sparsity patterns and the next
commit will add an option to obtain this behavior, but here we start by
making grep honor the sparsity boundaries in every case where this is
relevant:

- git grep in worktree
- git grep --cached
- git grep $REVISION

For the worktree and cached cases, we iterate over paths without the
SKIP_WORKTREE bit set, and limit our searches to these paths. For the
$REVISION case, we limit the paths we search to those that match the
sparsity patterns. (We do not check the SKIP_WORKTREE bit for the
$REVISION case, because $REVISION may contain paths that do not exist in
HEAD and thus for which we have no SKIP_WORKTREE bit to consult. The
sparsity patterns tell us how the SKIP_WORKTREE bit would be set if we
were to check out $REVISION, so we consult those. Also, we don't use the
sparsity patterns with the worktree or cached cases, both because we
have a bit we can check directly and more efficiently, and because
unmerged entries from a merge or a rebase could cause more files to
temporarily be present than the sparsity patterns would normally
select.)

Note that there is a special case here: `git grep $TREE`. In this case,
we cannot know whether $TREE corresponds to the root of the repository
or some sub-tree, and thus there is no way for us to know which sparsity
patterns, if any, apply. So the $TREE case will not use sparsity
patterns or any SKIP_WORKTREE bits and will instead always search all
files within the $TREE.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 builtin/grep.c                   | 125 ++++++++++++++++++--
 t/t7011-skip-worktree-reading.sh |   9 --
 t/t7817-grep-sparse-checkout.sh  | 195 +++++++++++++++++++++++++++++++
 3 files changed, 312 insertions(+), 17 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

diff --git a/builtin/grep.c b/builtin/grep.c
index a5056f395a..bee0681393 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
 		      const struct pathspec *pathspec, int cached);
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr);
+		     int is_root_tree);
 
 static int grep_submodule(struct grep_opt *opt,
 			  const struct pathspec *pathspec,
@@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
 
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
+
+		if (ce_skip_worktree(ce))
+			continue;
+
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
 
@@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID) ||
-			    ce_skip_worktree(ce)) {
+			if (cached || (ce->ce_flags & CE_VALID)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -552,9 +555,76 @@ static int grep_cache(struct grep_opt *opt,
 	return hit;
 }
 
-static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
-		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr)
+static struct pattern_list *get_sparsity_patterns(struct repository *repo)
+{
+	struct pattern_list *patterns;
+	char *sparse_file;
+	int sparse_config, cone_config;
+
+	if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
+	    !sparse_config) {
+		return NULL;
+	}
+
+	sparse_file = repo_git_path(repo, "info/sparse-checkout");
+	patterns = xcalloc(1, sizeof(*patterns));
+
+	if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
+		cone_config = 0;
+	patterns->use_cone_patterns = cone_config;
+
+	if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
+		if (file_exists(sparse_file)) {
+			warning(_("failed to load sparse-checkout file: '%s'"),
+				sparse_file);
+		}
+		free(sparse_file);
+		free(patterns);
+		return NULL;
+	}
+
+	free(sparse_file);
+	return patterns;
+}
+
+static int path_in_sparse_checkout(struct strbuf *path, int prefix_len,
+				   unsigned int entry_mode,
+				   struct index_state *istate,
+				   struct pattern_list *sparsity,
+				   enum pattern_match_result parent_match,
+				   enum pattern_match_result *match)
+{
+	int dtype = DT_UNKNOWN;
+	int is_dir = S_ISDIR(entry_mode);
+
+	if (parent_match == MATCHED_RECURSIVE) {
+		*match = parent_match;
+		return 1;
+	}
+
+	if (is_dir && !is_dir_sep(path->buf[path->len - 1]))
+		strbuf_addch(path, '/');
+
+	*match = path_matches_pattern_list(path->buf, path->len,
+					   path->buf + prefix_len, &dtype,
+					   sparsity, istate);
+	if (*match == UNDECIDED)
+		*match = parent_match;
+
+	if (is_dir)
+		strbuf_trim_trailing_dir_sep(path);
+
+	if (*match == NOT_MATCHED &&
+		(!is_dir || (is_dir && sparsity->use_cone_patterns)))
+	     return 0;
+
+	return 1;
+}
+
+static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
+			struct tree_desc *tree, struct strbuf *base, int tn_len,
+			int check_attr, struct pattern_list *sparsity,
+			enum pattern_match_result default_sparsity_match)
 {
 	struct repository *repo = opt->repo;
 	int hit = 0;
@@ -570,6 +640,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 	while (tree_entry(tree, &entry)) {
 		int te_len = tree_entry_len(&entry);
+		enum pattern_match_result sparsity_match = 0;
 
 		if (match != all_entries_interesting) {
 			strbuf_addstr(&name, base->buf + tn_len);
@@ -586,6 +657,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
+		if (sparsity) {
+			struct strbuf path = STRBUF_INIT;
+			strbuf_addstr(&path, base->buf + tn_len);
+
+			if (!path_in_sparse_checkout(&path, old_baselen - tn_len,
+						     entry.mode, repo->index,
+						     sparsity, default_sparsity_match,
+						     &sparsity_match)) {
+				strbuf_setlen(base, old_baselen);
+				continue;
+			}
+		}
+
 		if (S_ISREG(entry.mode)) {
 			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
 					 check_attr ? base->buf + tn_len : NULL);
@@ -602,8 +686,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 			strbuf_addch(base, '/');
 			init_tree_desc(&sub, data, size);
-			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
-					 check_attr);
+			hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
+					    check_attr, sparsity, sparsity_match);
 			free(data);
 		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
 			hit |= grep_submodule(opt, pathspec, &entry.oid,
@@ -621,6 +705,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 	return hit;
 }
 
+/*
+ * Note: sparsity patterns and paths' attributes will only be considered if
+ * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
+ * matching on paths.)
+ */
+static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
+		     struct tree_desc *tree, struct strbuf *base, int tn_len,
+		     int is_root_tree)
+{
+	struct pattern_list *patterns = NULL;
+	int ret;
+
+	if (is_root_tree)
+		patterns = get_sparsity_patterns(opt->repo);
+
+	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
+			   patterns, 0);
+
+	if (patterns) {
+		clear_pattern_list(patterns);
+		free(patterns);
+	}
+	return ret;
+}
+
 static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
 		       struct object *obj, const char *name, const char *path)
 {
diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
index 37525cae3a..26852586ac 100755
--- a/t/t7011-skip-worktree-reading.sh
+++ b/t/t7011-skip-worktree-reading.sh
@@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
 	test -z "$(git ls-files -m)"
 '
 
-test_expect_success 'grep with skip-worktree file' '
-	git update-index --no-skip-worktree 1 &&
-	echo test > 1 &&
-	git update-index 1 &&
-	git update-index --skip-worktree 1 &&
-	rm 1 &&
-	test "$(git grep --no-ext-grep test)" = "1:test"
-'
-
 echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
 test_expect_success 'diff-index does not examine skip-worktree absent entries' '
 	setup_absent &&
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
new file mode 100755
index 0000000000..b3109e3479
--- /dev/null
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -0,0 +1,195 @@
+#!/bin/sh
+
+test_description='grep in sparse checkout
+
+This test creates a repo with the following structure:
+
+.
+|-- a
+|-- b
+|-- dir
+|   `-- c
+|-- sub
+|   |-- A
+|   |   `-- a
+|   `-- B
+|       `-- b
+`-- sub2
+    `-- a
+
+Where the outer repository has non-cone mode sparsity patterns, sub is a
+submodule with cone mode sparsity patterns and sub2 is a submodule that is
+excluded by the superproject sparsity patterns. The resulting sparse checkout
+should leave the following structure in the working tree:
+
+.
+|-- a
+|-- sub
+|   `-- B
+|       `-- b
+`-- sub2
+    `-- a
+
+But note that sub2 should have the SKIP_WORKTREE bit set.
+'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	echo "text" >a &&
+	echo "text" >b &&
+	mkdir dir &&
+	echo "text" >dir/c &&
+
+	git init sub &&
+	(
+		cd sub &&
+		mkdir A B &&
+		echo "text" >A/a &&
+		echo "text" >B/b &&
+		git add A B &&
+		git commit -m sub &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set B
+	) &&
+
+	git init sub2 &&
+	(
+		cd sub2 &&
+		echo "text" >a &&
+		git add a &&
+		git commit -m sub2
+	) &&
+
+	git submodule add ./sub &&
+	git submodule add ./sub2 &&
+	git add a b dir &&
+	git commit -m super &&
+	git sparse-checkout init --no-cone &&
+	git sparse-checkout set "/*" "!b" "!/*/" "sub" &&
+
+	git tag -am tag-to-commit tag-to-commit HEAD &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	git tag -am tag-to-tree tag-to-tree $tree &&
+
+	test_path_is_missing b &&
+	test_path_is_missing dir &&
+	test_path_is_missing sub/A &&
+	test_path_is_file a &&
+	test_path_is_file sub/B/b &&
+	test_path_is_file sub2/a
+'
+
+# The test below checks a special case: the sparsity patterns exclude '/b'
+# and sparse checkout is enabled, but the path exists in the working tree (e.g.
+# manually created after `git sparse-checkout init`). In this case, grep should
+# skip it.
+test_expect_success 'grep in working tree should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	echo "new-text" >b &&
+	test_when_finished "rm b" &&
+	git grep "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep unmerged file despite not matching sparsity patterns' '
+	cat >expect <<-EOF &&
+	b:modified-b-in-branchX
+	b:modified-b-in-branchY
+	EOF
+	test_when_finished "test_might_fail git merge --abort && \
+			    git checkout master" &&
+
+	git sparse-checkout disable &&
+	git checkout -b branchY master &&
+	test_commit modified-b-in-branchY b &&
+	git checkout -b branchX master &&
+	test_commit modified-b-in-branchX b &&
+
+	git sparse-checkout init &&
+	test_path_is_missing b &&
+	test_must_fail git merge branchY &&
+	git grep "modified-b" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --cached should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep <commit-ish> should honor sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	EOF
+	cat >expect_tag-to-commit <<-EOF &&
+	tag-to-commit:a:text
+	EOF
+	git grep "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep "text" tag-to-commit >actual_tag-to-commit &&
+	test_cmp expect_tag-to-commit actual_tag-to-commit
+'
+
+test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
+	commit=$(git rev-parse HEAD) &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	cat >expect_tree <<-EOF &&
+	$tree:a:text
+	$tree:b:text
+	$tree:dir/c:text
+	EOF
+	cat >expect_tag-to-tree <<-EOF &&
+	tag-to-tree:a:text
+	tag-to-tree:b:text
+	tag-to-tree:dir/c:text
+	EOF
+	git grep "text" $tree >actual_tree &&
+	test_cmp expect_tree actual_tree &&
+	git grep "text" tag-to-tree >actual_tag-to-tree &&
+	test_cmp expect_tag-to-tree actual_tag-to-tree
+'
+
+# Note that sub2/ is present in the worktree but it is excluded by the sparsity
+# patterns, so grep should not recurse into it.
+test_expect_success 'grep --recurse-submodules should honor sparse checkout in submodule' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	EOF
+	git grep --recurse-submodules "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	EOF
+	git grep --recurse-submodules --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	$commit:sub/B/b:text
+	EOF
+	cat >expect_tag-to-commit <<-EOF &&
+	tag-to-commit:a:text
+	tag-to-commit:sub/B/b:text
+	EOF
+	git grep --recurse-submodules "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep --recurse-submodules "text" tag-to-commit >actual_tag-to-commit &&
+	test_cmp expect_tag-to-commit actual_tag-to-commit
+'
+
+test_done
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v4 6/6] config: add setting to ignore sparsity patterns in some cmds
  2020-06-12 15:44     ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                         ` (4 preceding siblings ...)
  2020-06-12 15:45       ` [PATCH v4 5/6] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-06-12 15:45       ` Matheus Tavares
  2020-06-16 22:31       ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Elijah Newren
  2020-09-02  6:17       ` [PATCH v5 0/8] " Matheus Tavares
  7 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-06-12 15:45 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

When sparse checkout is enabled, some users expect the output of certain
commands (such as grep, diff, and log) to be also restricted within the
sparsity patterns. This would allow them to effectively work only on the
subset of files in which they are interested; and allow some commands to
possibly perform better, by not considering uninteresting paths. For
this reason, we taught grep to honor the sparsity patterns, in the
previous patch. But, on the other hand, allowing grep and the other
commands mentioned to optionally ignore the patterns also make for some
interesting use cases. E.g. using grep to search for a function
documentation that resides outside the sparse checkout.

In any case, there is no current way for users to configure the behavior
they want for these commands. Aiming to provide this flexibility, let's
introduce the sparse.restrictCmds setting (and the analogous
--[no]-restrict-to-sparse-paths global option). The default value is
true. For now, grep is the only one affected by this setting, but the
goal is to have support for more commands, in the future.

Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 Documentation/config.txt               |   2 +
 Documentation/config/grep.txt          |   8 ++
 Documentation/config/sparse.txt        |  20 ++++
 Documentation/git.txt                  |   4 +
 Makefile                               |   1 +
 builtin/grep.c                         |  13 ++-
 contrib/completion/git-completion.bash |   2 +
 git.c                                  |   5 +
 sparse-checkout.c                      |  18 ++++
 sparse-checkout.h                      |  11 +++
 t/t7817-grep-sparse-checkout.sh        | 132 ++++++++++++++++++++++++-
 t/t9902-completion.sh                  |   4 +-
 12 files changed, 214 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/config/sparse.txt
 create mode 100644 sparse-checkout.c
 create mode 100644 sparse-checkout.h

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ef0768b91a..fd74b80302 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -436,6 +436,8 @@ include::config/sequencer.txt[]
 
 include::config/showbranch.txt[]
 
+include::config/sparse.txt[]
+
 include::config/splitindex.txt[]
 
 include::config/ssh.txt[]
diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index dd51db38e1..a3275ab4b7 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -28,3 +28,11 @@ grep.fullName::
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
 	is executed outside of a git repository.  Defaults to false.
+
+ifdef::git-grep[]
+sparse.restrictCmds::
+	See base definition in linkgit:git-config[1]. grep honors
+	sparse.restrictCmds by limiting searches to the sparsity paths in three
+	cases: when searching the working tree, when searching the index with
+	--cached, and when searching a specified commit.
+endif::git-grep[]
diff --git a/Documentation/config/sparse.txt b/Documentation/config/sparse.txt
new file mode 100644
index 0000000000..494761526e
--- /dev/null
+++ b/Documentation/config/sparse.txt
@@ -0,0 +1,20 @@
+sparse.restrictCmds::
+	Only meaningful in conjunction with core.sparseCheckout. This option
+	extends sparse checkouts (which limit which paths are written to the
+	working tree), so that output and operations are also limited to the
+	sparsity paths where possible and implemented. The purpose of this
+	option is to (1) focus output for the user on the portion of the
+	repository that is of interest to them, and (2) enable potentially
+	dramatic performance improvements, especially in conjunction with
+	partial clones.
++
+When this option is true (default), some git commands may limit their behavior
+to the paths specified by the sparsity patterns, or to the intersection of
+those paths and any (like `*.c`) that the user might also specify on the
+command line. When false, the affected commands will work on full trees,
+ignoring the sparsity patterns. For now, only git-grep honors this setting.
++
+Note: commands which export, integrity check, or create history will always
+operate on full trees (e.g. fast-export, format-patch, fsck, commit, etc.),
+unaffected by any sparsity patterns. Also, writing commands such as
+sparse-checkout and read-tree will not be affected by this configuration.
diff --git a/Documentation/git.txt b/Documentation/git.txt
index 40bd32f590..89604b6648 100644
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -180,6 +180,10 @@ If you just want to run git as if it was started in `<path>` then use
 	Do not perform optional operations that require locks. This is
 	equivalent to setting the `GIT_OPTIONAL_LOCKS` to `0`.
 
+--[no-]restrict-to-sparse-paths::
+	Overrides the sparse.restrictCmds configuration (see
+	linkgit:git-config[1]) for this execution.
+
 --list-cmds=group[,group...]::
 	List commands by group. This is an internal/experimental
 	option and may change or be removed in the future. Supported
diff --git a/Makefile b/Makefile
index 372139f1f2..9c8a6f19cd 100644
--- a/Makefile
+++ b/Makefile
@@ -983,6 +983,7 @@ LIB_OBJS += sha1-name.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-checkout.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/builtin/grep.c b/builtin/grep.c
index bee0681393..7f485ea732 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -25,6 +25,7 @@
 #include "submodule-config.h"
 #include "object-store.h"
 #include "packfile.h"
+#include "sparse-checkout.h"
 
 static char const * const grep_usage[] = {
 	N_("git grep [<options>] [-e] <pattern> [<rev>...] [[--] <path>...]"),
@@ -498,6 +499,7 @@ static int grep_cache(struct grep_opt *opt,
 	int nr;
 	struct strbuf name = STRBUF_INIT;
 	int name_base_len = 0;
+	int sparse_paths_only =	restrict_to_sparse_paths(repo);
 	if (repo->submodule_prefix) {
 		name_base_len = strlen(repo->submodule_prefix);
 		strbuf_addstr(&name, repo->submodule_prefix);
@@ -509,7 +511,7 @@ static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (ce_skip_worktree(ce))
+		if (sparse_paths_only && ce_skip_worktree(ce))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -715,9 +717,10 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     int is_root_tree)
 {
 	struct pattern_list *patterns = NULL;
+	int sparse_paths_only = restrict_to_sparse_paths(opt->repo);
 	int ret;
 
-	if (is_root_tree)
+	if (is_root_tree && sparse_paths_only)
 		patterns = get_sparsity_patterns(opt->repo);
 
 	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
@@ -1257,6 +1260,12 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 
 	if (!use_index || untracked) {
 		int use_exclude = (opt_exclude < 0) ? use_index : !!opt_exclude;
+
+		if (opt_restrict_to_sparse_paths >= 0) {
+			die(_("--[no-]restrict-to-sparse-paths is incompatible"
+				  " with --no-index and --untracked"));
+		}
+
 		hit = grep_directory(&opt, &pathspec, use_exclude, use_index);
 	} else if (0 <= opt_exclude) {
 		die(_("--[no-]exclude-standard cannot be used for tracked contents"));
diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
index 4b59004847..3f15fd5275 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -3208,6 +3208,8 @@ __git_main ()
 			--namespace=
 			--no-replace-objects
 			--help
+			--restrict-to-sparse-paths
+			--no-restrict-to-sparse-paths
 			"
 			;;
 		*)
diff --git a/git.c b/git.c
index a2d337eed7..99bac1d26d 100644
--- a/git.c
+++ b/git.c
@@ -5,6 +5,7 @@
 #include "run-command.h"
 #include "alias.h"
 #include "shallow.h"
+#include "sparse-checkout.h"
 
 #define RUN_SETUP		(1<<0)
 #define RUN_SETUP_GENTLY	(1<<1)
@@ -311,6 +312,10 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
 			} else {
 				exit(list_cmds(cmd));
 			}
+		} else if (!strcmp(cmd, "--restrict-to-sparse-paths")) {
+			opt_restrict_to_sparse_paths = 1;
+		} else if (!strcmp(cmd, "--no-restrict-to-sparse-paths")) {
+			opt_restrict_to_sparse_paths = 0;
 		} else {
 			fprintf(stderr, _("unknown option: %s\n"), cmd);
 			usage(git_usage_string);
diff --git a/sparse-checkout.c b/sparse-checkout.c
new file mode 100644
index 0000000000..96c5ed5446
--- /dev/null
+++ b/sparse-checkout.c
@@ -0,0 +1,18 @@
+#include "cache.h"
+#include "config.h"
+#include "sparse-checkout.h"
+
+int opt_restrict_to_sparse_paths = -1;
+
+int restrict_to_sparse_paths(struct repository *repo)
+{
+	int ret;
+
+	if (opt_restrict_to_sparse_paths >= 0)
+		return opt_restrict_to_sparse_paths;
+
+	if (repo_config_get_bool(repo, "sparse.restrictcmds", &ret))
+		ret = 1;
+
+	return ret;
+}
diff --git a/sparse-checkout.h b/sparse-checkout.h
new file mode 100644
index 0000000000..a4805e443a
--- /dev/null
+++ b/sparse-checkout.h
@@ -0,0 +1,11 @@
+#ifndef SPARSE_CHECKOUT_H
+#define SPARSE_CHECKOUT_H
+
+struct repository;
+
+extern int opt_restrict_to_sparse_paths;
+
+/* Whether or not cmds should restrict behavior on sparse paths, in this repo */
+int restrict_to_sparse_paths(struct repository *repo);
+
+#endif /* SPARSE_CHECKOUT_H */
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index b3109e3479..f93a4f71d1 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -80,10 +80,10 @@ test_expect_success 'setup' '
 	test_path_is_file sub2/a
 '
 
-# The test below checks a special case: the sparsity patterns exclude '/b'
+# The two tests below check a special case: the sparsity patterns exclude '/b'
 # and sparse checkout is enabled, but the path exists in the working tree (e.g.
 # manually created after `git sparse-checkout init`). In this case, grep should
-# skip it.
+# skip the file by default, but not with --no-restrict-to-sparse-paths.
 test_expect_success 'grep in working tree should honor sparse checkout' '
 	cat >expect <<-EOF &&
 	a:text
@@ -93,6 +93,16 @@ test_expect_success 'grep in working tree should honor sparse checkout' '
 	git grep "text" >actual &&
 	test_cmp expect actual
 '
+test_expect_success 'grep w/ --no-restrict-to-sparse-paths for sparsely excluded but present paths' '
+	cat >expect <<-EOF &&
+	a:text
+	b:new-text
+	EOF
+	echo "new-text" >b &&
+	test_when_finished "rm b" &&
+	git --no-restrict-to-sparse-paths grep "text" >actual &&
+	test_cmp expect actual
+'
 
 test_expect_success 'grep unmerged file despite not matching sparsity patterns' '
 	cat >expect <<-EOF &&
@@ -157,7 +167,7 @@ test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
 '
 
 # Note that sub2/ is present in the worktree but it is excluded by the sparsity
-# patterns, so grep should not recurse into it.
+# patterns, so grep should only recurse into it with --no-restrict-to-sparse-paths.
 test_expect_success 'grep --recurse-submodules should honor sparse checkout in submodule' '
 	cat >expect <<-EOF &&
 	a:text
@@ -166,6 +176,15 @@ test_expect_success 'grep --recurse-submodules should honor sparse checkout in s
 	git grep --recurse-submodules "text" >actual &&
 	test_cmp expect actual
 '
+test_expect_success 'grep --recurse-submodules should search in excluded submodules w/ --no-restrict-to-sparse-paths' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	sub2/a:text
+	EOF
+	git --no-restrict-to-sparse-paths grep --recurse-submodules "text" >actual &&
+	test_cmp expect actual
+'
 
 test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
 	cat >expect <<-EOF &&
@@ -192,4 +211,111 @@ test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse
 	test_cmp expect_tag-to-commit actual_tag-to-commit
 '
 
+for cmd in 'git --no-restrict-to-sparse-paths grep' \
+	   'git -c sparse.restrictCmds=false grep' \
+	   'git -c sparse.restrictCmds=true --no-restrict-to-sparse-paths grep'
+do
+
+	test_expect_success "$cmd --cached should ignore sparsity patterns" '
+		cat >expect <<-EOF &&
+		a:text
+		b:text
+		dir/c:text
+		EOF
+		$cmd --cached "text" >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "$cmd <commit-ish> should ignore sparsity patterns" '
+		commit=$(git rev-parse HEAD) &&
+		cat >expect_commit <<-EOF &&
+		$commit:a:text
+		$commit:b:text
+		$commit:dir/c:text
+		EOF
+		cat >expect_tag-to-commit <<-EOF &&
+		tag-to-commit:a:text
+		tag-to-commit:b:text
+		tag-to-commit:dir/c:text
+		EOF
+		$cmd "text" $commit >actual_commit &&
+		test_cmp expect_commit actual_commit &&
+		$cmd "text" tag-to-commit >actual_tag-to-commit &&
+		test_cmp expect_tag-to-commit actual_tag-to-commit
+	'
+done
+
+test_expect_success 'grep --recurse-submodules --cached w/ --no-restrict-to-sparse-paths' '
+	cat >expect <<-EOF &&
+	a:text
+	b:text
+	dir/c:text
+	sub/A/a:text
+	sub/B/b:text
+	sub2/a:text
+	EOF
+	git --no-restrict-to-sparse-paths grep --recurse-submodules --cached \
+		"text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --recurse-submodules <commit-ish> w/ --no-restrict-to-sparse-paths' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	$commit:b:text
+	$commit:dir/c:text
+	$commit:sub/A/a:text
+	$commit:sub/B/b:text
+	$commit:sub2/a:text
+	EOF
+	cat >expect_tag-to-commit <<-EOF &&
+	tag-to-commit:a:text
+	tag-to-commit:b:text
+	tag-to-commit:dir/c:text
+	tag-to-commit:sub/A/a:text
+	tag-to-commit:sub/B/b:text
+	tag-to-commit:sub2/a:text
+	EOF
+	git --no-restrict-to-sparse-paths grep --recurse-submodules "text" \
+		$commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git --no-restrict-to-sparse-paths grep --recurse-submodules "text" \
+		tag-to-commit >actual_tag-to-commit &&
+	test_cmp expect_tag-to-commit actual_tag-to-commit
+'
+
+test_expect_success 'should respect the sparse.restrictCmds values from submodules' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/A/a:text
+	sub/B/b:text
+	EOF
+	test_config -C sub sparse.restrictCmds false &&
+	git grep --cached --recurse-submodules "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'should propagate --[no]-restrict-to-sparse-paths to submodules' '
+	cat >expect <<-EOF &&
+	a:text
+	b:text
+	dir/c:text
+	sub/A/a:text
+	sub/B/b:text
+	sub2/a:text
+	EOF
+	test_config -C sub sparse.restrictCmds true &&
+	git --no-restrict-to-sparse-paths grep --cached --recurse-submodules "text" >actual &&
+	test_cmp expect actual
+'
+
+for opt in '--untracked' '--no-index'
+do
+	test_expect_success "--[no]-restrict-to-sparse-paths and $opt are incompatible" "
+		test_must_fail git --restrict-to-sparse-paths grep $opt . 2>actual &&
+		test_i18ngrep 'restrict-to-sparse-paths is incompatible with' actual
+	"
+done
+
 test_done
diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
index 3c44af6940..a4a7767e06 100755
--- a/t/t9902-completion.sh
+++ b/t/t9902-completion.sh
@@ -1473,6 +1473,8 @@ test_expect_success 'double dash "git" itself' '
 	--namespace=
 	--no-replace-objects Z
 	--help Z
+	--restrict-to-sparse-paths Z
+	--no-restrict-to-sparse-paths Z
 	EOF
 '
 
@@ -1515,7 +1517,7 @@ test_expect_success 'general options' '
 	test_completion "git --nam" "--namespace=" &&
 	test_completion "git --bar" "--bare " &&
 	test_completion "git --inf" "--info-path " &&
-	test_completion "git --no-r" "--no-replace-objects "
+	test_completion "git --no-rep" "--no-replace-objects "
 '
 
 test_expect_success 'general options plus command' '
-- 
2.26.2


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 4/6] config: correctly read worktree configs in submodules
  2020-06-12 15:45       ` [PATCH v4 4/6] config: correctly read worktree configs in submodules Matheus Tavares
@ 2020-06-16 19:13         ` Elijah Newren
  2020-06-21 16:05           ` Matheus Tavares Bernardino
  2020-09-01  2:41         ` Jonathan Nieder
  1 sibling, 1 reply; 120+ messages in thread
From: Elijah Newren @ 2020-06-16 19:13 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan,
	Jeff King, Jonathan Nieder

Hi,

On Fri, Jun 12, 2020 at 8:45 AM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> One of the steps in do_git_config_sequence() is to load the
> worktree-specific config file. Although the function receives a git_dir
> string, it relies on git_pathdup(), which uses the_repository->git_dir,
> to make the path to the file. Furthermore, it also checks that
> extensions.worktreeConfig is set through the
> repository_format_worktree_config variable, which refers to
> the_repository only. Thus, when a submodule has worktree-specific
> settings, a command executed in the superproject that recurses into the
> submodule won't find the said settings.
>
> This will be especially important in the next patch: git-grep will learn
> to honor sparse checkouts and, when running with --recurse-submodules,
> the submodule's sparse checkout settings must be loaded. As these
> settings are stored in the config.worktree file, they would be ignored
> without this patch. So let's fix this by reading the right
> config.worktree file and extensions.worktreeConfig setting, based on the
> git_dir and commondir paths given to do_git_config_sequence(). Also
> add a test to avoid any regressions.

Thanks for splitting this part of the change out from the previous patch.

>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>  config.c                   | 21 +++++++++---
>  t/helper/test-config.c     | 67 +++++++++++++++++++++++++++++++++-----
>  t/t2404-worktree-config.sh | 16 +++++++++
>  3 files changed, 91 insertions(+), 13 deletions(-)
>
> diff --git a/config.c b/config.c
> index 8db9c77098..c2d56309dc 100644
> --- a/config.c
> +++ b/config.c
> @@ -1747,11 +1747,22 @@ static int do_git_config_sequence(const struct config_options *opts,
>                 ret += git_config_from_file(fn, repo_config, data);
>
>         current_parsing_scope = CONFIG_SCOPE_WORKTREE;
> -       if (!opts->ignore_worktree && repository_format_worktree_config) {
> -               char *path = git_pathdup("config.worktree");
> -               if (!access_or_die(path, R_OK, 0))
> -                       ret += git_config_from_file(fn, path, data);
> -               free(path);
> +       if (!opts->ignore_worktree && repo_config && opts->git_dir) {

What happens when opts->git_dir is NULL?  (Does that ever even
happen?)  Should it fall back to the old code path in that case?

> +               struct repository_format repo_fmt = REPOSITORY_FORMAT_INIT;
> +               struct strbuf buf = STRBUF_INIT;
> +
> +               read_repository_format(&repo_fmt, repo_config);
> +
> +               if (!verify_repository_format(&repo_fmt, &buf) &&
> +                   repo_fmt.worktree_config) {
> +                       char *path = mkpathdup("%s/config.worktree", opts->git_dir);
> +                       if (!access_or_die(path, R_OK, 0))
> +                               ret += git_config_from_file(fn, path, data);
> +                       free(path);
> +               }
> +
> +               strbuf_release(&buf);
> +               clear_repository_format(&repo_fmt);

I've tried to poke around a little at this block, but as with the
previous series, I still feel like it should be reviewed by someone
who knows submodules and/or config handling better.  That's easier now
that the patch has been split up.  Unfortunately, trying to figure out
who the submodule expert(s) are (looking at authors of
submodule*.[ch]) seems to lead me to a who's who of people who are no
longer active in the project.  :-(  Maybe Peff or jrnieder would have
suggestions; cc'ing them.

>         }
>
>         current_parsing_scope = CONFIG_SCOPE_COMMAND;
> diff --git a/t/helper/test-config.c b/t/helper/test-config.c
> index 61da2574c5..284f83a921 100644
> --- a/t/helper/test-config.c
> +++ b/t/helper/test-config.c
> @@ -2,12 +2,19 @@
>  #include "cache.h"
>  #include "config.h"
>  #include "string-list.h"
> +#include "submodule-config.h"
>
>  /*
>   * This program exposes the C API of the configuration mechanism
>   * as a set of simple commands in order to facilitate testing.
>   *
> - * Reads stdin and prints result of command to stdout:
> + * Usage: test-tool config [--submodule=<path>] <cmd> [<args>]
> + *
> + * If --submodule=<path> is given, <cmd> will operate on the submodule at the
> + * given <path>. This option is not valid for the commands: read_early_config,
> + * configset_get_value and configset_get_value_multi.
> + *
> + * Possible cmds are:
>   *
>   * get_value -> prints the value with highest priority for the entered key
>   *
> @@ -86,6 +93,8 @@ int cmd__config(int argc, const char **argv)
>         const struct string_list *strptr;
>         struct config_set cs = { .hash_initialized = 0 };
>         enum test_config_exit_code ret = TC_SUCCESS;
> +       struct repository *repo = the_repository;
> +       const char *subrepo_path = NULL;
>
>         argc--; /* skip over "config" */
>         argv++;
> @@ -93,7 +102,18 @@ int cmd__config(int argc, const char **argv)
>         if (argc == 0)
>                 goto print_usage_error;
>
> +       if (skip_prefix(*argv, "--submodule=", &subrepo_path)) {
> +               argc--;
> +               argv++;
> +               if (argc == 0)
> +                       goto print_usage_error;
> +       }
> +
>         if (argc == 2 && !strcmp(argv[0], "read_early_config")) {
> +               if (subrepo_path) {
> +                       fprintf(stderr, "Cannot use --submodule with read_early_config\n");
> +                       return TC_USAGE_ERROR;
> +               }
>                 read_early_config(early_config_cb, (void *)argv[1]);
>                 return TC_SUCCESS;
>         }
> @@ -101,8 +121,23 @@ int cmd__config(int argc, const char **argv)
>         setup_git_directory();
>         git_configset_init(&cs);
>
> +       if (subrepo_path) {
> +               const struct submodule *sub;
> +               struct repository *subrepo = xcalloc(1, sizeof(*repo));
> +
> +               sub = submodule_from_path(the_repository, &null_oid, subrepo_path);
> +               if (!sub || repo_submodule_init(subrepo, the_repository, sub)) {
> +                       fprintf(stderr, "Invalid argument to --submodule: '%s'\n",
> +                               subrepo_path);
> +                       free(subrepo);
> +                       ret = TC_USAGE_ERROR;
> +                       goto out;
> +               }
> +               repo = subrepo;
> +       }
> +
>         if (argc == 2 && !strcmp(argv[0], "get_value")) {
> -               if (!git_config_get_value(argv[1], &v)) {
> +               if (!repo_config_get_value(repo, argv[1], &v)) {
>                         if (!v)
>                                 printf("(NULL)\n");
>                         else
> @@ -112,7 +147,7 @@ int cmd__config(int argc, const char **argv)
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
>         } else if (argc == 2 && !strcmp(argv[0], "get_value_multi")) {
> -               strptr = git_config_get_value_multi(argv[1]);
> +               strptr = repo_config_get_value_multi(repo, argv[1]);
>                 if (strptr) {
>                         for (i = 0; i < strptr->nr; i++) {
>                                 v = strptr->items[i].string;
> @@ -126,27 +161,33 @@ int cmd__config(int argc, const char **argv)
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
>         } else if (argc == 2 && !strcmp(argv[0], "get_int")) {
> -               if (!git_config_get_int(argv[1], &val)) {
> +               if (!repo_config_get_int(repo, argv[1], &val)) {
>                         printf("%d\n", val);
>                 } else {
>                         printf("Value not found for \"%s\"\n", argv[1]);
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
>         } else if (argc == 2 && !strcmp(argv[0], "get_bool")) {
> -               if (!git_config_get_bool(argv[1], &val)) {
> +               if (!repo_config_get_bool(repo, argv[1], &val)) {
>                         printf("%d\n", val);
>                 } else {
> +
>                         printf("Value not found for \"%s\"\n", argv[1]);
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
>         } else if (argc == 2 && !strcmp(argv[0], "get_string")) {
> -               if (!git_config_get_string_const(argv[1], &v)) {
> +               if (!repo_config_get_string_const(repo, argv[1], &v)) {
>                         printf("%s\n", v);
>                 } else {
>                         printf("Value not found for \"%s\"\n", argv[1]);
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
>         } else if (argc >= 2 && !strcmp(argv[0], "configset_get_value")) {
> +               if (subrepo_path) {
> +                       fprintf(stderr, "Cannot use --submodule with configset_get_value\n");
> +                       ret = TC_USAGE_ERROR;
> +                       goto out;
> +               }
>                 for (i = 2; i < argc; i++) {
>                         int err;
>                         if ((err = git_configset_add_file(&cs, argv[i]))) {
> @@ -165,6 +206,11 @@ int cmd__config(int argc, const char **argv)
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
>         } else if (argc >= 2 && !strcmp(argv[0], "configset_get_value_multi")) {
> +               if (subrepo_path) {
> +                       fprintf(stderr, "Cannot use --submodule with configset_get_value_multi\n");
> +                       ret = TC_USAGE_ERROR;
> +                       goto out;
> +               }
>                 for (i = 2; i < argc; i++) {
>                         int err;
>                         if ((err = git_configset_add_file(&cs, argv[i]))) {
> @@ -187,14 +233,19 @@ int cmd__config(int argc, const char **argv)
>                         ret = TC_VALUE_NOT_FOUND;
>                 }
>         } else if (!strcmp(argv[0], "iterate")) {
> -               git_config(iterate_cb, NULL);
> +               repo_config(repo, iterate_cb, NULL);
>         } else {
>  print_usage_error:
> -               fprintf(stderr, "Invalid syntax. Usage: test-tool config <cmd> [args]\n");
> +               fprintf(stderr, "Invalid syntax. Usage: test-tool config"
> +                               " [--submodule=<path>] <cmd> [args]\n");
>                 ret = TC_USAGE_ERROR;
>         }
>
>  out:
>         git_configset_clear(&cs);
> +       if (repo != the_repository) {
> +               repo_clear(repo);
> +               free(repo);
> +       }
>         return ret;
>  }
> diff --git a/t/t2404-worktree-config.sh b/t/t2404-worktree-config.sh
> index 286121d8de..b6ab793203 100755
> --- a/t/t2404-worktree-config.sh
> +++ b/t/t2404-worktree-config.sh
> @@ -76,4 +76,20 @@ test_expect_success 'config.worktree no longer read without extension' '
>         test_cmp_config -C wt2 shared this.is
>  '
>
> +test_expect_success 'correctly read config.worktree from submodules' '
> +       test_unconfig extensions.worktreeConfig &&
> +       git init sub &&
> +       (
> +               cd sub &&
> +               test_commit A &&
> +               git config extensions.worktreeConfig true &&
> +               git config --worktree wtconfig.sub test-value
> +       ) &&
> +       git submodule add ./sub &&
> +       git commit -m "add sub" &&
> +       echo test-value >expect &&
> +       test-tool config --submodule=sub get_value wtconfig.sub >actual &&
> +       test_cmp expect actual
> +'
> +
>  test_done
> --
> 2.26.2

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it
  2020-06-12 15:44     ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                         ` (5 preceding siblings ...)
  2020-06-12 15:45       ` [PATCH v4 6/6] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
@ 2020-06-16 22:31       ` Elijah Newren
  2020-09-02  6:17       ` [PATCH v5 0/8] " Matheus Tavares
  7 siblings, 0 replies; 120+ messages in thread
From: Elijah Newren @ 2020-06-16 22:31 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Fri, Jun 12, 2020 at 8:45 AM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> This series makes git-grep restrict its output to the present sparsity
> patterns. A new global option is added to toggle this behavior in grep
> and hopefully more commands in the future.

You've cleaned up all the issues (or corrected my understanding) from
my comments in the previous iterations of this series; I didn't spot
any additional issues in reading over this latest version of the
series.

However, I would like someone more familiar with submodules and/or
config to take a look at the changes to do_git_config_sequence() in
patch 4, as I commented on there, if we can find someone to do so.

Thanks for working on this; nice work!

Elijah

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 4/6] config: correctly read worktree configs in submodules
  2020-06-16 19:13         ` Elijah Newren
@ 2020-06-21 16:05           ` Matheus Tavares Bernardino
  0 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-06-21 16:05 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan,
	Jeff King, Jonathan Nieder

On Tue, Jun 16, 2020 at 4:13 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Fri, Jun 12, 2020 at 8:45 AM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> >
> >  config.c                   | 21 +++++++++---
> >  t/helper/test-config.c     | 67 +++++++++++++++++++++++++++++++++-----
> >  t/t2404-worktree-config.sh | 16 +++++++++
> >  3 files changed, 91 insertions(+), 13 deletions(-)
> >
> > diff --git a/config.c b/config.c
> > index 8db9c77098..c2d56309dc 100644
> > --- a/config.c
> > +++ b/config.c
> > @@ -1747,11 +1747,22 @@ static int do_git_config_sequence(const struct config_options *opts,
> >                 ret += git_config_from_file(fn, repo_config, data);
> >
> >         current_parsing_scope = CONFIG_SCOPE_WORKTREE;
> > -       if (!opts->ignore_worktree && repository_format_worktree_config) {
> > -               char *path = git_pathdup("config.worktree");
> > -               if (!access_or_die(path, R_OK, 0))
> > -                       ret += git_config_from_file(fn, path, data);
> > -               free(path);
> > +       if (!opts->ignore_worktree && repo_config && opts->git_dir) {
>
> What happens when opts->git_dir is NULL?  (Does that ever even
> happen?)  Should it fall back to the old code path in that case?

Sorry for not replying earlier.

Yes, opts->git_dir might be NULL in some cases. I did a quick grep
search, though, and it seems that this only happens in two
circumstances: (1) in builtin/config.c when
startup_info->have_repository is false; and (2) in
read_early_config(), if have_git_dir() returns false and
discover_git_directory() fails.

For (2), I think it is right to ignore the worktree config file when
opts->git_dir is NULL because we indeed don't have a repo to read the
file from. I'm tempted to say the same for (1), but I'm not very
familiar with setup.c. By the definition of have_git_dir() it seems
possible to have the_repository->git_dir set up even when
startup_info->have_repository == false:

int have_git_dir(void)
{
        return startup_info->have_repository
                || the_repository->gitdir;
}

Nevertheless, the current calls to config_with_options() either set
both opts->git_dir and opts->commondir or none. So if we were to fall
back to the_repository->git_dir, for the worktree config, when
startup_info->have_repository == false, the local config file would
still be ignored during the config sequence in such case. I think it
wouldn't make much sense to ignore the local config file but try to
load the worktree-specific one, which is also dependent on having a
repo, and even more specific. So I think we shouldn't fall back to the
old code path. But I would appreciate hearing from others more
familiar with this code.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 4/6] config: correctly read worktree configs in submodules
  2020-06-12 15:45       ` [PATCH v4 4/6] config: correctly read worktree configs in submodules Matheus Tavares
  2020-06-16 19:13         ` Elijah Newren
@ 2020-09-01  2:41         ` Jonathan Nieder
  2020-09-01 21:44           ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 120+ messages in thread
From: Jonathan Nieder @ 2020-09-01  2:41 UTC (permalink / raw)
  To: Matheus Tavares; +Cc: git, gitster, stolee, newren, jonathantanmy

Hi,

Matheus Tavares wrote:

> One of the steps in do_git_config_sequence() is to load the
> worktree-specific config file. Although the function receives a git_dir
> string, it relies on git_pathdup(), which uses the_repository->git_dir,
> to make the path to the file. Furthermore, it also checks that
> extensions.worktreeConfig is set through the
> repository_format_worktree_config variable, which refers to
> the_repository only. Thus, when a submodule has worktree-specific
> settings, a command executed in the superproject that recurses into the
> submodule won't find the said settings.

I think the above goes out of order: it states the "how" before the
"what".  Instead, a commit message should lead with the problem the
change aims to solve.

Is the idea here that until this patch, we're only able to read
worktree config from a repository when extensions.worktreeConfig is
set in the_repository, meaning that

- when examining submodule config in a process where the_repository
  represents the superproject, we do not read the submodule's worktree
  config even if extensions.worktreeConfig is set in the submodule,
  unless the superproject has extensions.worktreeConfig set, and

- when examining submodule config in a process where the_repository
  represents the superproject, we *do* read the submodule's worktree
  config even if extensions.worktreeConfig is not set in the submodule,
  if the superproject has extensions.worktreeConfig set, and

?

That sounds like a serious problem indeed.  Thanks for fixing it.

> This will be especially important in the next patch: git-grep will learn
> to honor sparse checkouts and, when running with --recurse-submodules,
> the submodule's sparse checkout settings must be loaded. As these
> settings are stored in the config.worktree file, they would be ignored
> without this patch. So let's fix this by reading the right
> config.worktree file and extensions.worktreeConfig setting, based on the
> git_dir and commondir paths given to do_git_config_sequence(). Also
> add a test to avoid any regressions.

I see.  I'm not sure that's more important than other cases, but I
can understand if the problem was noticed in this circumstance. :)

> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>  config.c                   | 21 +++++++++---
>  t/helper/test-config.c     | 67 +++++++++++++++++++++++++++++++++-----
>  t/t2404-worktree-config.sh | 16 +++++++++
>  3 files changed, 91 insertions(+), 13 deletions(-)
> 
> diff --git a/config.c b/config.c
> index 8db9c77098..c2d56309dc 100644
> --- a/config.c
> +++ b/config.c
> @@ -1747,11 +1747,22 @@ static int do_git_config_sequence(const struct config_options *opts,
>  		ret += git_config_from_file(fn, repo_config, data);
>  
>  	current_parsing_scope = CONFIG_SCOPE_WORKTREE;
> -	if (!opts->ignore_worktree && repository_format_worktree_config) {
> +	if (!opts->ignore_worktree && repo_config && opts->git_dir) {

Can we eliminate the repository_format_worktree_config global to save
the next caller from the same problem?

> +		struct repository_format repo_fmt = REPOSITORY_FORMAT_INIT;
> +		struct strbuf buf = STRBUF_INIT;
> +
> +		read_repository_format(&repo_fmt, repo_config);
> +
> +		if (!verify_repository_format(&repo_fmt, &buf) &&
> +		    repo_fmt.worktree_config) {

This undoes the caching the repository_format_worktree_config means to
do.  Can we cache the value in "struct repository" instead?  That way,
in the common case where we're reading the_repository, we wouldn't
experience a slowdown.

> -		char *path = git_pathdup("config.worktree");
> +			char *path = mkpathdup("%s/config.worktree", opts->git_dir);

Can this use a helper like repo_git_path or strbuf_repo_git_path
(preferably one using strbuf like the latter)?

[...]
> +		strbuf_release(&buf);
> +		clear_repository_format(&repo_fmt);
>  	}
>  
>  	current_parsing_scope = CONFIG_SCOPE_COMMAND;
> diff --git a/t/helper/test-config.c b/t/helper/test-config.c
> index 61da2574c5..284f83a921 100644
> --- a/t/helper/test-config.c
> +++ b/t/helper/test-config.c
> @@ -2,12 +2,19 @@
>  #include "cache.h"
>  #include "config.h"
>  #include "string-list.h"
> +#include "submodule-config.h"
>  
>  /*
>   * This program exposes the C API of the configuration mechanism
>   * as a set of simple commands in order to facilitate testing.
>   *
> - * Reads stdin and prints result of command to stdout:
> + * Usage: test-tool config [--submodule=<path>] <cmd> [<args>]
> + *
> + * If --submodule=<path> is given, <cmd> will operate on the submodule at the
> + * given <path>. This option is not valid for the commands: read_early_config,
> + * configset_get_value and configset_get_value_multi.

Nice!

[...]
> @@ -93,7 +102,18 @@ int cmd__config(int argc, const char **argv)
>  	if (argc == 0)
>  		goto print_usage_error;
>  
> +	if (skip_prefix(*argv, "--submodule=", &subrepo_path)) {
> +		argc--;
> +		argv++;
> +		if (argc == 0)
> +			goto print_usage_error;
> +	}

Can this use the parse_options API?

> +
>  	if (argc == 2 && !strcmp(argv[0], "read_early_config")) {
> +		if (subrepo_path) {
> +			fprintf(stderr, "Cannot use --submodule with read_early_config\n");
> +			return TC_USAGE_ERROR;

Should this use die() or BUG()?

> +		}
>  		read_early_config(early_config_cb, (void *)argv[1]);
>  		return TC_SUCCESS;
>  	}
> @@ -101,8 +121,23 @@ int cmd__config(int argc, const char **argv)
>  	setup_git_directory();
>  	git_configset_init(&cs);
>  
> +	if (subrepo_path) {
> +		const struct submodule *sub;
> +		struct repository *subrepo = xcalloc(1, sizeof(*repo));

nit: this could be scoped to cmd__config:

	struct repository subrepo = {0};

> +
> +		sub = submodule_from_path(the_repository, &null_oid, subrepo_path);
> +		if (!sub || repo_submodule_init(subrepo, the_repository, sub)) {
> +			fprintf(stderr, "Invalid argument to --submodule: '%s'\n",
> +				subrepo_path);
> +			free(subrepo);
> +			ret = TC_USAGE_ERROR;

Likewise: I think may want to use die() or BUG() (and likewise for other
USAGE_ERROR cases).

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v4 4/6] config: correctly read worktree configs in submodules
  2020-09-01  2:41         ` Jonathan Nieder
@ 2020-09-01 21:44           ` Matheus Tavares Bernardino
  0 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares Bernardino @ 2020-09-01 21:44 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: git, Junio C Hamano, Derrick Stolee, Elijah Newren, Jonathan Tan

Hi, Jonathan

On Mon, Aug 31, 2020 at 11:41 PM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> Hi,
>
> Matheus Tavares wrote:
>
> > One of the steps in do_git_config_sequence() is to load the
> > worktree-specific config file. Although the function receives a git_dir
> > string, it relies on git_pathdup(), which uses the_repository->git_dir,
> > to make the path to the file. Furthermore, it also checks that
> > extensions.worktreeConfig is set through the
> > repository_format_worktree_config variable, which refers to
> > the_repository only. Thus, when a submodule has worktree-specific
> > settings, a command executed in the superproject that recurses into the
> > submodule won't find the said settings.
>
> I think the above goes out of order: it states the "how" before the
> "what".  Instead, a commit message should lead with the problem the
> change aims to solve.

Thanks. I will reorder these two sections in the commit message.

> Is the idea here that until this patch, we're only able to read
> worktree config from a repository when extensions.worktreeConfig is
> set in the_repository, meaning that
>
> - when examining submodule config in a process where the_repository
>   represents the superproject, we do not read the submodule's worktree
>   config even if extensions.worktreeConfig is set in the submodule,
>   unless the superproject has extensions.worktreeConfig set, and

Right.

> - when examining submodule config in a process where the_repository
>   represents the superproject, we *do* read the submodule's worktree
>   config even if extensions.worktreeConfig is not set in the submodule,
>   if the superproject has extensions.worktreeConfig set, and
>
> ?

Right, but with one change: if extensions.worktreeConfig is not set in
the submodule and it is set in the superproject, the *superproject's*
worktree config is read (independently of which git_dir was given as
argument).

> That sounds like a serious problem indeed.  Thanks for fixing it.
>
> > This will be especially important in the next patch: git-grep will learn
> > to honor sparse checkouts and, when running with --recurse-submodules,
> > the submodule's sparse checkout settings must be loaded. As these
> > settings are stored in the config.worktree file, they would be ignored
> > without this patch. So let's fix this by reading the right
> > config.worktree file and extensions.worktreeConfig setting, based on the
> > git_dir and commondir paths given to do_git_config_sequence(). Also
> > add a test to avoid any regressions.
>
> I see.  I'm not sure that's more important than other cases, but I
> can understand if the problem was noticed in this circumstance. :)
>
> > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> > ---
> >  config.c                   | 21 +++++++++---
> >  t/helper/test-config.c     | 67 +++++++++++++++++++++++++++++++++-----
> >  t/t2404-worktree-config.sh | 16 +++++++++
> >  3 files changed, 91 insertions(+), 13 deletions(-)
> >
> > diff --git a/config.c b/config.c
> > index 8db9c77098..c2d56309dc 100644
> > --- a/config.c
> > +++ b/config.c
> > @@ -1747,11 +1747,22 @@ static int do_git_config_sequence(const struct config_options *opts,
> >               ret += git_config_from_file(fn, repo_config, data);
> >
> >       current_parsing_scope = CONFIG_SCOPE_WORKTREE;
> > -     if (!opts->ignore_worktree && repository_format_worktree_config) {
> > +     if (!opts->ignore_worktree && repo_config && opts->git_dir) {
>
> Can we eliminate the repository_format_worktree_config global to save
> the next caller from the same problem?

Hmm, I think it's possible, I will investigate it further.

> > +             struct repository_format repo_fmt = REPOSITORY_FORMAT_INIT;
> > +             struct strbuf buf = STRBUF_INIT;
> > +
> > +             read_repository_format(&repo_fmt, repo_config);
> > +
> > +             if (!verify_repository_format(&repo_fmt, &buf) &&
> > +                 repo_fmt.worktree_config) {
>
> This undoes the caching the repository_format_worktree_config means to
> do.  Can we cache the value in "struct repository" instead?  That way,
> in the common case where we're reading the_repository, we wouldn't
> experience a slowdown.

Yeah, that would be the best solution. But, unfortunately,
do_git_config_sequence() doesn't receive a complete repository struct,
just the 'commondir' and 'git_dir' strings.

> > -             char *path = git_pathdup("config.worktree");
> > +                     char *path = mkpathdup("%s/config.worktree", opts->git_dir);
>
> Can this use a helper like repo_git_path or strbuf_repo_git_path
> (preferably one using strbuf like the latter)?

Hmm, here we would have the same problem of not having a 'struct
repository' to pass to those functions :(

> [...]
> > +             strbuf_release(&buf);
> > +             clear_repository_format(&repo_fmt);
> >       }
> >
> >       current_parsing_scope = CONFIG_SCOPE_COMMAND;
> > diff --git a/t/helper/test-config.c b/t/helper/test-config.c
> > index 61da2574c5..284f83a921 100644
> > --- a/t/helper/test-config.c
> > +++ b/t/helper/test-config.c
> > @@ -2,12 +2,19 @@
> >  #include "cache.h"
> >  #include "config.h"
> >  #include "string-list.h"
> > +#include "submodule-config.h"
> >
> >  /*
> >   * This program exposes the C API of the configuration mechanism
> >   * as a set of simple commands in order to facilitate testing.
> >   *
> > - * Reads stdin and prints result of command to stdout:
> > + * Usage: test-tool config [--submodule=<path>] <cmd> [<args>]
> > + *
> > + * If --submodule=<path> is given, <cmd> will operate on the submodule at the
> > + * given <path>. This option is not valid for the commands: read_early_config,
> > + * configset_get_value and configset_get_value_multi.
>
> Nice!
>
> [...]
> > @@ -93,7 +102,18 @@ int cmd__config(int argc, const char **argv)
> >       if (argc == 0)
> >               goto print_usage_error;
> >
> > +     if (skip_prefix(*argv, "--submodule=", &subrepo_path)) {
> > +             argc--;
> > +             argv++;
> > +             if (argc == 0)
> > +                     goto print_usage_error;
> > +     }
>
> Can this use the parse_options API?

Right, it would make it easier to add more options in the future.
There is only one consideration, though, about parse_options()'s exit
codes on error, but more on that below...

> > +
> >       if (argc == 2 && !strcmp(argv[0], "read_early_config")) {
> > +             if (subrepo_path) {
> > +                     fprintf(stderr, "Cannot use --submodule with read_early_config\n");
> > +                     return TC_USAGE_ERROR;
>
> Should this use die() or BUG()?

The idea of using TC_USAGE_ERROR (129) here and not die() (128), was
that some users of the test-config helper want to detect die() errors
from the config machinery itself. So by using a different exit code,
we can avoid false positives in these tests. Of course they should
also be checking stderr/stdout, but there is at least one test which
only checks the exit code. Rethinking about that now, instead of using
different exit codes in test-config.c, should we adjust the tests to
use `test_must_fail` and only check stderr/stdout? Then we could use
die() (or BUG()) here, as you suggested, as well as the parse_options
API in the snippet above. Does that sound reasonable?

> > +             }
> >               read_early_config(early_config_cb, (void *)argv[1]);
> >               return TC_SUCCESS;
> >       }
> > @@ -101,8 +121,23 @@ int cmd__config(int argc, const char **argv)
> >       setup_git_directory();
> >       git_configset_init(&cs);
> >
> > +     if (subrepo_path) {
> > +             const struct submodule *sub;
> > +             struct repository *subrepo = xcalloc(1, sizeof(*repo));
>
> nit: this could be scoped to cmd__config:
>
>         struct repository subrepo = {0};

OK, will do. Thanks

> > +
> > +             sub = submodule_from_path(the_repository, &null_oid, subrepo_path);
> > +             if (!sub || repo_submodule_init(subrepo, the_repository, sub)) {
> > +                     fprintf(stderr, "Invalid argument to --submodule: '%s'\n",
> > +                             subrepo_path);
> > +                     free(subrepo);
> > +                     ret = TC_USAGE_ERROR;
>
> Likewise: I think may want to use die() or BUG() (and likewise for other
> USAGE_ERROR cases).
>
> Thanks and hope that helps,
> Jonathan

It did :) Thanks a lot for the comments!

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v5 0/8] grep: honor sparse checkout and add option to ignore it
  2020-06-12 15:44     ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                         ` (6 preceding siblings ...)
  2020-06-16 22:31       ` [PATCH v4 0/6] grep: honor sparse checkout and add option to ignore it Elijah Newren
@ 2020-09-02  6:17       ` Matheus Tavares
  2020-09-02  6:17         ` [PATCH v5 1/8] doc: grep: unify info on configuration variables Matheus Tavares
                           ` (8 more replies)
  7 siblings, 9 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-09-02  6:17 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy, jrnieder

This series makes git-grep restrict its output to the sparsity patterns when
requested by the user. A new global option is added to control this behavior
in grep and hopefully more commands in the future. There are also a
couple fixes in t/helper/test-config and in a test that uses it.

Changes since v4:

- Rebased on top of master to use repo_config_get_string_tmp(), added in
  jk/leakfix, in t/helper/test-config (patch 6).

- Added patch 2, to make sure a test that relies on test-config checks its
  output in addition to the exit code, to avoid false positives.

- Split patch "t/helper/test-config: return exit codes consistently" into
  three separated ones, as these are in fact three non-related changes:
	"t/helper/test-config: unify exit labels"
	"t/helper/test-config: check argc before accessing argv"
	"t/helper/test-config: be consistent with exit codes"

- Removed TC_USAGE_ERROR in favor of calling die(). Also removed the
  test_config_exit_code enum.

- On "config: correctly read worktree configs in submodules":
  * Improved commit message to focus on the problem instead of the
    implementation and remove section about the grep example.
  * Made use of the parse_options API
  * Allocated subrepo struct in the stack instead of malloc()'ing.

Matheus Tavares (8):
  doc: grep: unify info on configuration variables
  t1308-config-set: avoid false positives when using test-config
  t/helper/test-config: be consistent with exit codes
  t/helper/test-config: check argc before accessing argv
  t/helper/test-config: unify exit labels
  config: correctly read worktree configs in submodules
  grep: honor sparse checkout patterns
  config: add setting to ignore sparsity patterns in some cmds

 Documentation/config.txt               |   2 +
 Documentation/config/grep.txt          |  18 +-
 Documentation/config/sparse.txt        |  20 ++
 Documentation/git-grep.txt             |  36 +--
 Documentation/git.txt                  |   4 +
 Makefile                               |   1 +
 builtin/grep.c                         | 134 ++++++++++-
 config.c                               |  21 +-
 contrib/completion/git-completion.bash |   2 +
 git.c                                  |   5 +
 sparse-checkout.c                      |  18 ++
 sparse-checkout.h                      |  11 +
 t/helper/test-config.c                 | 126 ++++++----
 t/t1308-config-set.sh                  |   8 +-
 t/t2404-worktree-config.sh             |  16 ++
 t/t7011-skip-worktree-reading.sh       |   9 -
 t/t7817-grep-sparse-checkout.sh        | 321 +++++++++++++++++++++++++
 t/t9902-completion.sh                  |   4 +-
 18 files changed, 652 insertions(+), 104 deletions(-)
 create mode 100644 Documentation/config/sparse.txt
 create mode 100644 sparse-checkout.c
 create mode 100644 sparse-checkout.h
 create mode 100755 t/t7817-grep-sparse-checkout.sh

Range-diff against v4:
1:  fc47a96bfa = 1:  70c9a4e741 doc: grep: unify info on configuration variables
-:  ---------- > 2:  f53782f14c t1308-config-set: avoid false positives when using test-config
-:  ---------- > 3:  85e1588d6c t/helper/test-config: be consistent with exit codes
-:  ---------- > 4:  0750191342 t/helper/test-config: check argc before accessing argv
2:  874aab36dd ! 5:  56535b0e36 t/helper/test-config: return exit codes consistently
    @@ Metadata
     Author: Matheus Tavares <matheus.bernardino@usp.br>
     
      ## Commit message ##
    -    t/helper/test-config: return exit codes consistently
    +    t/helper/test-config: unify exit labels
     
    -    The test-config helper may exit with a variety of at least four
    -    different codes, to reflect the status of the requested operations.
    -    These codes are sometimes checked in the tests, but not all of the codes
    -    are returned consistently by the helper: 1 will usually refer to a
    -    "value not found", but usage errors can also return 1 or 128. Moreover,
    -    128 is also expected on errors within the configset functions. These
    -    inconsistent uses of the exit codes can lead to false positives in the
    -    tests. Although all tests which expect errors and check the helper's
    -    exit code currently also check the output, it's still better to
    -    standardize the exit codes and avoid future problems in new tests.
    -    While we are here, let's also check that we have the expected argc for
    -    configset_get_value and configset_get_value_multi, before trying to use
    -    argv.
    -
    -    Note: this change is implemented with the unification of the exit
    -    labels. This might seem unnecessary, for now, but it will benefit the
    -    next patch, which will increase the cleanup section.
    +    test-config's main function has three different exit labels, all of
    +    which have to perform the same cleanup code before returning. Unify the
    +    labels in preparation for a future patch which will increase the cleanup
    +    section.
     
         Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
     
      ## t/helper/test-config.c ##
    -@@
    -  * iterate -> iterate over all values using git_config(), and print some
    -  *            data for each
    -  *
    -+ * Exit codes:
    -+ *     0:   success
    -+ *     1:   value not found for the given config key
    -+ *     2:   config file path given as argument is inaccessible or doesn't exist
    -+ *     129: test-config usage error
    -+ *
    -+ * Note: tests may also expect 128 for die() calls in the config machinery.
    -+ *
    -  * Examples:
    -  *
    -  * To print the value with highest priority for key "foo.bAr Baz.rock":
     @@ t/helper/test-config.c: static int early_config_cb(const char *var, const char *value, void *vdata)
      	return 0;
      }
      
    -+enum test_config_exit_code {
    -+	TC_SUCCESS = 0,
    -+	TC_VALUE_NOT_FOUND = 1,
    -+	TC_CONFIG_FILE_ERROR = 2,
    -+	TC_USAGE_ERROR = 129,
    -+};
    ++#define TC_VALUE_NOT_FOUND 1
    ++#define TC_CONFIG_FILE_ERROR 2
     +
      int cmd__config(int argc, const char **argv)
      {
    - 	int i, val;
    +-	int i, val;
    ++	int i, val, ret = 0;
      	const char *v;
      	const struct string_list *strptr;
      	struct config_set cs;
    -+	enum test_config_exit_code ret = TC_SUCCESS;
      
      	if (argc == 3 && !strcmp(argv[1], "read_early_config")) {
      		read_early_config(early_config_cb, (void *)argv[2]);
     -		return 0;
    -+		return TC_SUCCESS;
    ++		return ret;
      	}
      
      	setup_git_directory();
    - 
    - 	git_configset_init(&cs);
    - 
    --	if (argc < 2) {
    --		fprintf(stderr, "Please, provide a command name on the command-line\n");
    --		goto exit1;
    --	} else if (argc == 3 && !strcmp(argv[1], "get_value")) {
    -+	if (argc < 2)
    -+		goto print_usage_error;
    -+
    -+	if (argc == 3 && !strcmp(argv[1], "get_value")) {
    - 		if (!git_config_get_value(argv[2], &v)) {
    - 			if (!v)
    +@@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
      				printf("(NULL)\n");
      			else
      				printf("%s\n", v);
    @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
     -			goto exit1;
     +			ret = TC_VALUE_NOT_FOUND;
      		}
    --	} else if (!strcmp(argv[1], "configset_get_value")) {
    -+	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
    + 	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
      		for (i = 3; i < argc; i++) {
      			int err;
      			if ((err = git_configset_add_file(&cs, argv[i]))) {
    @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
     -			goto exit1;
     +			ret = TC_VALUE_NOT_FOUND;
      		}
    --	} else if (!strcmp(argv[1], "configset_get_value_multi")) {
    -+	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
    + 	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
      		for (i = 3; i < argc; i++) {
      			int err;
      			if ((err = git_configset_add_file(&cs, argv[i]))) {
    @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
      		git_config(iterate_cb, NULL);
     -		goto exit0;
     +	} else {
    -+print_usage_error:
    -+		fprintf(stderr, "Invalid syntax. Usage: test-tool config <cmd> [args]\n");
    -+		ret = TC_USAGE_ERROR;
    ++		die("%s: Please check the syntax and the function name", argv[0]);
      	}
      
     -	die("%s: Please check the syntax and the function name", argv[0]);
3:  c5093099f3 < -:  ---------- t/helper/test-config: facilitate addition of new cli options
4:  b510de0de0 ! 6:  3e02e1bd24 config: correctly read worktree configs in submodules
    @@ Metadata
      ## Commit message ##
         config: correctly read worktree configs in submodules
     
    -    One of the steps in do_git_config_sequence() is to load the
    -    worktree-specific config file. Although the function receives a git_dir
    -    string, it relies on git_pathdup(), which uses the_repository->git_dir,
    -    to make the path to the file. Furthermore, it also checks that
    -    extensions.worktreeConfig is set through the
    -    repository_format_worktree_config variable, which refers to
    -    the_repository only. Thus, when a submodule has worktree-specific
    -    settings, a command executed in the superproject that recurses into the
    -    submodule won't find the said settings.
    +    The config machinery is not able to read worktree configs from a
    +    submodule in a process where the_repository represents the superproject.
    +    Furthermore, when extensions.worktreeConfig is set on the superproject,
    +    querying for a worktree config in a submodule will, instead, return
    +    the value set at the superproject.
     
    -    This will be especially important in the next patch: git-grep will learn
    -    to honor sparse checkouts and, when running with --recurse-submodules,
    -    the submodule's sparse checkout settings must be loaded. As these
    -    settings are stored in the config.worktree file, they would be ignored
    -    without this patch. So let's fix this by reading the right
    -    config.worktree file and extensions.worktreeConfig setting, based on the
    -    git_dir and commondir paths given to do_git_config_sequence(). Also
    -    add a test to avoid any regressions.
    +    The problem resides in do_git_config_sequence(). Although the function
    +    receives a git_dir string, it uses the_repository->git_dir when making
    +    the path to the worktree config file. And when checking if
    +    extensions.worktreeConfig is set, it uses the global
    +    repository_format_worktree_config variable, which refers to
    +    the_repository only. So let's fix this by using the git_dir given to the
    +    function and reading the extension value from the right place. Also add
    +    a test to avoid any regressions.
     
         Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
     
    @@ t/helper/test-config.c
      #include "config.h"
      #include "string-list.h"
     +#include "submodule-config.h"
    ++#include "parse-options.h"
      
      /*
       * This program exposes the C API of the configuration mechanism
    @@ t/helper/test-config.c
       *
       * get_value -> prints the value with highest priority for the entered key
       *
    -@@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
    +@@ t/helper/test-config.c: static int early_config_cb(const char *var, const char *value, void *vdata)
    + #define TC_VALUE_NOT_FOUND 1
    + #define TC_CONFIG_FILE_ERROR 2
    + 
    ++static const char *test_config_usage[] = {
    ++	"test-tool config [--submodule=<path>] <cmd> [<args>]",
    ++	NULL
    ++};
    ++
    + int cmd__config(int argc, const char **argv)
    + {
    + 	int i, val, ret = 0;
    + 	const char *v;
      	const struct string_list *strptr;
    - 	struct config_set cs = { .hash_initialized = 0 };
    - 	enum test_config_exit_code ret = TC_SUCCESS;
    -+	struct repository *repo = the_repository;
    + 	struct config_set cs;
    ++	struct repository subrepo, *repo = the_repository;
     +	const char *subrepo_path = NULL;
    - 
    - 	argc--; /* skip over "config" */
    - 	argv++;
    -@@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
    - 	if (argc == 0)
    - 		goto print_usage_error;
    - 
    -+	if (skip_prefix(*argv, "--submodule=", &subrepo_path)) {
    -+		argc--;
    -+		argv++;
    -+		if (argc == 0)
    -+			goto print_usage_error;
    -+	}
     +
    - 	if (argc == 2 && !strcmp(argv[0], "read_early_config")) {
    -+		if (subrepo_path) {
    -+			fprintf(stderr, "Cannot use --submodule with read_early_config\n");
    -+			return TC_USAGE_ERROR;
    -+		}
    - 		read_early_config(early_config_cb, (void *)argv[1]);
    - 		return TC_SUCCESS;
    ++	struct option options[] = {
    ++		OPT_STRING(0, "submodule", &subrepo_path, "path",
    ++			   "run <cmd> on the submodule at <path>"),
    ++		OPT_END()
    ++	};
    ++
    ++	argc = parse_options(argc, argv, NULL, options, test_config_usage,
    ++			     PARSE_OPT_KEEP_ARGV0 | PARSE_OPT_STOP_AT_NON_OPTION);
    ++	if (argc < 2)
    ++		die("Please, provide a command name on the command-line");
    + 
    + 	if (argc == 3 && !strcmp(argv[1], "read_early_config")) {
    ++		if (subrepo_path)
    ++			die("cannot use --submodule with read_early_config");
    + 		read_early_config(early_config_cb, (void *)argv[2]);
    + 		return ret;
      	}
     @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
    - 	setup_git_directory();
    + 
      	git_configset_init(&cs);
      
    +-	if (argc < 2)
    +-		die("Please, provide a command name on the command-line");
     +	if (subrepo_path) {
     +		const struct submodule *sub;
    -+		struct repository *subrepo = xcalloc(1, sizeof(*repo));
     +
     +		sub = submodule_from_path(the_repository, &null_oid, subrepo_path);
    -+		if (!sub || repo_submodule_init(subrepo, the_repository, sub)) {
    -+			fprintf(stderr, "Invalid argument to --submodule: '%s'\n",
    -+				subrepo_path);
    -+			free(subrepo);
    -+			ret = TC_USAGE_ERROR;
    -+			goto out;
    -+		}
    -+		repo = subrepo;
    -+	}
    ++		if (!sub || repo_submodule_init(&subrepo, the_repository, sub))
    ++			die("invalid argument to --submodule: '%s'", subrepo_path);
     +
    - 	if (argc == 2 && !strcmp(argv[0], "get_value")) {
    --		if (!git_config_get_value(argv[1], &v)) {
    -+		if (!repo_config_get_value(repo, argv[1], &v)) {
    ++		repo = &subrepo;
    ++	}
    + 
    + 	if (argc == 3 && !strcmp(argv[1], "get_value")) {
    +-		if (!git_config_get_value(argv[2], &v)) {
    ++		if (!repo_config_get_value(repo, argv[2], &v)) {
      			if (!v)
      				printf("(NULL)\n");
      			else
     @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
      			ret = TC_VALUE_NOT_FOUND;
      		}
    - 	} else if (argc == 2 && !strcmp(argv[0], "get_value_multi")) {
    --		strptr = git_config_get_value_multi(argv[1]);
    -+		strptr = repo_config_get_value_multi(repo, argv[1]);
    + 	} else if (argc == 3 && !strcmp(argv[1], "get_value_multi")) {
    +-		strptr = git_config_get_value_multi(argv[2]);
    ++		strptr = repo_config_get_value_multi(repo, argv[2]);
      		if (strptr) {
      			for (i = 0; i < strptr->nr; i++) {
      				v = strptr->items[i].string;
     @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
      			ret = TC_VALUE_NOT_FOUND;
      		}
    - 	} else if (argc == 2 && !strcmp(argv[0], "get_int")) {
    --		if (!git_config_get_int(argv[1], &val)) {
    -+		if (!repo_config_get_int(repo, argv[1], &val)) {
    + 	} else if (argc == 3 && !strcmp(argv[1], "get_int")) {
    +-		if (!git_config_get_int(argv[2], &val)) {
    ++		if (!repo_config_get_int(repo, argv[2], &val)) {
      			printf("%d\n", val);
      		} else {
    - 			printf("Value not found for \"%s\"\n", argv[1]);
    + 			printf("Value not found for \"%s\"\n", argv[2]);
      			ret = TC_VALUE_NOT_FOUND;
      		}
    - 	} else if (argc == 2 && !strcmp(argv[0], "get_bool")) {
    --		if (!git_config_get_bool(argv[1], &val)) {
    -+		if (!repo_config_get_bool(repo, argv[1], &val)) {
    + 	} else if (argc == 3 && !strcmp(argv[1], "get_bool")) {
    +-		if (!git_config_get_bool(argv[2], &val)) {
    ++		if (!repo_config_get_bool(repo, argv[2], &val)) {
      			printf("%d\n", val);
      		} else {
     +
    - 			printf("Value not found for \"%s\"\n", argv[1]);
    + 			printf("Value not found for \"%s\"\n", argv[2]);
      			ret = TC_VALUE_NOT_FOUND;
      		}
    - 	} else if (argc == 2 && !strcmp(argv[0], "get_string")) {
    --		if (!git_config_get_string_tmp(argv[1], &v)) {
    -+		if (!repo_config_get_string_tmp(repo, argv[1], &v)) {
    + 	} else if (argc == 3 && !strcmp(argv[1], "get_string")) {
    +-		if (!git_config_get_string_tmp(argv[2], &v)) {
    ++		if (!repo_config_get_string_tmp(repo, argv[2], &v)) {
      			printf("%s\n", v);
      		} else {
    - 			printf("Value not found for \"%s\"\n", argv[1]);
    + 			printf("Value not found for \"%s\"\n", argv[2]);
      			ret = TC_VALUE_NOT_FOUND;
      		}
    - 	} else if (argc >= 2 && !strcmp(argv[0], "configset_get_value")) {
    -+		if (subrepo_path) {
    -+			fprintf(stderr, "Cannot use --submodule with configset_get_value\n");
    -+			ret = TC_USAGE_ERROR;
    -+			goto out;
    -+		}
    - 		for (i = 2; i < argc; i++) {
    + 	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
    ++		if (subrepo_path)
    ++			die("cannot use --submodule with configset_get_value");
    ++
    + 		for (i = 3; i < argc; i++) {
      			int err;
      			if ((err = git_configset_add_file(&cs, argv[i]))) {
     @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
      			ret = TC_VALUE_NOT_FOUND;
      		}
    - 	} else if (argc >= 2 && !strcmp(argv[0], "configset_get_value_multi")) {
    -+		if (subrepo_path) {
    -+			fprintf(stderr, "Cannot use --submodule with configset_get_value_multi\n");
    -+			ret = TC_USAGE_ERROR;
    -+			goto out;
    -+		}
    - 		for (i = 2; i < argc; i++) {
    + 	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
    ++		if (subrepo_path)
    ++			die("cannot use --submodule with configset_get_value_multi");
    ++
    + 		for (i = 3; i < argc; i++) {
      			int err;
      			if ((err = git_configset_add_file(&cs, argv[i]))) {
     @@ t/helper/test-config.c: int cmd__config(int argc, const char **argv)
      			ret = TC_VALUE_NOT_FOUND;
      		}
    - 	} else if (!strcmp(argv[0], "iterate")) {
    + 	} else if (!strcmp(argv[1], "iterate")) {
     -		git_config(iterate_cb, NULL);
     +		repo_config(repo, iterate_cb, NULL);
      	} else {
    - print_usage_error:
    --		fprintf(stderr, "Invalid syntax. Usage: test-tool config <cmd> [args]\n");
    -+		fprintf(stderr, "Invalid syntax. Usage: test-tool config"
    -+				" [--submodule=<path>] <cmd> [args]\n");
    - 		ret = TC_USAGE_ERROR;
    + 		die("%s: Please check the syntax and the function name", argv[0]);
      	}
      
      out:
      	git_configset_clear(&cs);
    -+	if (repo != the_repository) {
    ++	if (repo != the_repository)
     +		repo_clear(repo);
    -+		free(repo);
    -+	}
      	return ret;
      }
     
5:  6d9720abf5 = 7:  902556a7b6 grep: honor sparse checkout patterns
6:  affb931d35 = 8:  70e7d7b90c config: add setting to ignore sparsity patterns in some cmds
-- 
2.28.0


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v5 1/8] doc: grep: unify info on configuration variables
  2020-09-02  6:17       ` [PATCH v5 0/8] " Matheus Tavares
@ 2020-09-02  6:17         ` Matheus Tavares
  2020-09-02  6:17         ` [PATCH v5 2/8] t1308-config-set: avoid false positives when using test-config Matheus Tavares
                           ` (7 subsequent siblings)
  8 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-09-02  6:17 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy, jrnieder

Explanations about the configuration variables for git-grep are
duplicated in "Documentation/git-grep.txt" and
"Documentation/config/grep.txt", which can make maintenance difficult.
The first also contains a definition not present in the latter
(grep.fullName). To avoid problems like this, let's unify the
information in the second file and include it in the first.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 Documentation/config/grep.txt | 10 ++++++++--
 Documentation/git-grep.txt    | 36 ++++++-----------------------------
 2 files changed, 14 insertions(+), 32 deletions(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 44abe45a7c..dd51db38e1 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -16,8 +16,14 @@ grep.extendedRegexp::
 	other than 'default'.
 
 grep.threads::
-	Number of grep worker threads to use.
-	See `grep.threads` in linkgit:git-grep[1] for more information.
+	Number of grep worker threads to use. See `--threads`
+ifndef::git-grep[]
+	in linkgit:git-grep[1]
+endif::git-grep[]
+	for more information.
+
+grep.fullName::
+	If set to true, enable `--full-name` option by default.
 
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index a7f9bc99ea..9bdf807584 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,34 +41,8 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------
 
-grep.lineNumber::
-	If set to true, enable `-n` option by default.
-
-grep.column::
-	If set to true, enable the `--column` option by default.
-
-grep.patternType::
-	Set the default matching behavior. Using a value of 'basic', 'extended',
-	'fixed', or 'perl' will enable the `--basic-regexp`, `--extended-regexp`,
-	`--fixed-strings`, or `--perl-regexp` option accordingly, while the
-	value 'default' will return to the default matching behavior.
-
-grep.extendedRegexp::
-	If set to true, enable `--extended-regexp` option by default. This
-	option is ignored when the `grep.patternType` option is set to a value
-	other than 'default'.
-
-grep.threads::
-	Number of grep worker threads to use. If unset (or set to 0), Git will
-	use as many threads as the number of logical cores available.
-
-grep.fullName::
-	If set to true, enable `--full-name` option by default.
-
-grep.fallbackToNoIndex::
-	If set to true, fall back to git grep --no-index if git grep
-	is executed outside of a git repository.  Defaults to false.
-
+:git-grep: 1
+include::config/grep.txt[]
 
 OPTIONS
 -------
@@ -269,8 +243,10 @@ providing this option will cause it to die.
 	found.
 
 --threads <num>::
-	Number of grep worker threads to use.
-	See `grep.threads` in 'CONFIGURATION' for more information.
+	Number of grep worker threads to use. If not provided (or set to
+	0), Git will use as many worker threads as the number of logical
+	cores available. The default value can also be set with the
+	`grep.threads` configuration.
 
 -f <file>::
 	Read patterns from <file>, one per line.
-- 
2.28.0


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v5 2/8] t1308-config-set: avoid false positives when using test-config
  2020-09-02  6:17       ` [PATCH v5 0/8] " Matheus Tavares
  2020-09-02  6:17         ` [PATCH v5 1/8] doc: grep: unify info on configuration variables Matheus Tavares
@ 2020-09-02  6:17         ` Matheus Tavares
  2020-09-02  6:57           ` Eric Sunshine
  2020-09-02  6:17         ` [PATCH v5 3/8] t/helper/test-config: be consistent with exit codes Matheus Tavares
                           ` (6 subsequent siblings)
  8 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-09-02  6:17 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy, jrnieder

One test in t1308 expects test-config to fail with exit code 128 due to
a parsing error in the config machinery. But test-config might also exit
with 128 for any other reason that leads it to call die(). Therefore the
test can potentially succeed for the wrong reason. To avoid false
positives, let's check test-config's output, in addition to the exit
code, and make sure that the cause of the error is the one we expect in
this test.

Moreover, the test was using the auxiliary function check_config which
optionally takes a string to compare the test-config stdout against.
Because this string is optional, there is a risk that future callers may
also check only the exit code and not the output. To avoid that, make
the string parameter of this function mandatory.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 t/t1308-config-set.sh | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/t/t1308-config-set.sh b/t/t1308-config-set.sh
index 3a527e3a84..cff17120dc 100755
--- a/t/t1308-config-set.sh
+++ b/t/t1308-config-set.sh
@@ -14,10 +14,7 @@ check_config () {
 		expect_code=0
 	fi &&
 	op=$1 key=$2 && shift && shift &&
-	if test $# != 0
-	then
-		printf "%s\n" "$@"
-	fi >expect &&
+	printf "%s\n" "$@" >expect &&
 	test_expect_code $expect_code test-tool config "$op" "$key" >actual &&
 	test_cmp expect actual
 }
@@ -130,7 +127,8 @@ test_expect_success 'check line error when NULL string is queried' '
 '
 
 test_expect_success 'find integer if value is non parse-able' '
-	check_config expect_code 128 get_int lamb.head
+	test_expect_code 128 test-tool config get_int lamb.head 2>result &&
+	test_i18ngrep "fatal: bad numeric config value '\'none\'' for '\'lamb.head\''" result
 '
 
 test_expect_success 'find bool value for the entered key' '
-- 
2.28.0


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v5 3/8] t/helper/test-config: be consistent with exit codes
  2020-09-02  6:17       ` [PATCH v5 0/8] " Matheus Tavares
  2020-09-02  6:17         ` [PATCH v5 1/8] doc: grep: unify info on configuration variables Matheus Tavares
  2020-09-02  6:17         ` [PATCH v5 2/8] t1308-config-set: avoid false positives when using test-config Matheus Tavares
@ 2020-09-02  6:17         ` Matheus Tavares
  2020-09-02  6:17         ` [PATCH v5 4/8] t/helper/test-config: check argc before accessing argv Matheus Tavares
                           ` (5 subsequent siblings)
  8 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-09-02  6:17 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy, jrnieder

The test-config helper can return at least three different exit codes to
reflect the status of the requested operation. And these codes are
checked in some of the tests. But there is an inconsistent place in the
helper where an usage error returns the same code as a "value not found"
error. Let's fix that and, while we are here, document the meaning of
each exit code in the file's header.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 t/helper/test-config.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/t/helper/test-config.c b/t/helper/test-config.c
index a6e936721f..9e9d50099a 100644
--- a/t/helper/test-config.c
+++ b/t/helper/test-config.c
@@ -30,6 +30,11 @@
  * iterate -> iterate over all values using git_config(), and print some
  *            data for each
  *
+ * Exit codes:
+ *     0:   success
+ *     1:   value not found for the given config key
+ *     2:   config file path given as argument is inaccessible or doesn't exist
+ *
  * Examples:
  *
  * To print the value with highest priority for key "foo.bAr Baz.rock":
@@ -80,10 +85,10 @@ int cmd__config(int argc, const char **argv)
 
 	git_configset_init(&cs);
 
-	if (argc < 2) {
-		fprintf(stderr, "Please, provide a command name on the command-line\n");
-		goto exit1;
-	} else if (argc == 3 && !strcmp(argv[1], "get_value")) {
+	if (argc < 2)
+		die("Please, provide a command name on the command-line");
+
+	if (argc == 3 && !strcmp(argv[1], "get_value")) {
 		if (!git_config_get_value(argv[2], &v)) {
 			if (!v)
 				printf("(NULL)\n");
-- 
2.28.0


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v5 4/8] t/helper/test-config: check argc before accessing argv
  2020-09-02  6:17       ` [PATCH v5 0/8] " Matheus Tavares
                           ` (2 preceding siblings ...)
  2020-09-02  6:17         ` [PATCH v5 3/8] t/helper/test-config: be consistent with exit codes Matheus Tavares
@ 2020-09-02  6:17         ` Matheus Tavares
  2020-09-02  7:18           ` Eric Sunshine
  2020-09-02  6:17         ` [PATCH v5 5/8] t/helper/test-config: unify exit labels Matheus Tavares
                           ` (4 subsequent siblings)
  8 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-09-02  6:17 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy, jrnieder

Check that we have the expected argc in 'configset_get_value' and
'configset_get_value_multi' before trying to access argv elements.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 t/helper/test-config.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/helper/test-config.c b/t/helper/test-config.c
index 9e9d50099a..26d9c2ac4c 100644
--- a/t/helper/test-config.c
+++ b/t/helper/test-config.c
@@ -138,7 +138,7 @@ int cmd__config(int argc, const char **argv)
 			printf("Value not found for \"%s\"\n", argv[2]);
 			goto exit1;
 		}
-	} else if (!strcmp(argv[1], "configset_get_value")) {
+	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
 		for (i = 3; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
@@ -156,7 +156,7 @@ int cmd__config(int argc, const char **argv)
 			printf("Value not found for \"%s\"\n", argv[2]);
 			goto exit1;
 		}
-	} else if (!strcmp(argv[1], "configset_get_value_multi")) {
+	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
 		for (i = 3; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
-- 
2.28.0


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v5 5/8] t/helper/test-config: unify exit labels
  2020-09-02  6:17       ` [PATCH v5 0/8] " Matheus Tavares
                           ` (3 preceding siblings ...)
  2020-09-02  6:17         ` [PATCH v5 4/8] t/helper/test-config: check argc before accessing argv Matheus Tavares
@ 2020-09-02  6:17         ` Matheus Tavares
  2020-09-02  7:30           ` Eric Sunshine
  2020-09-02  6:17         ` [PATCH v5 6/8] config: correctly read worktree configs in submodules Matheus Tavares
                           ` (3 subsequent siblings)
  8 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-09-02  6:17 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy, jrnieder

test-config's main function has three different exit labels, all of
which have to perform the same cleanup code before returning. Unify the
labels in preparation for the next patch which will increase the cleanup
section.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 t/helper/test-config.c | 51 +++++++++++++++++-------------------------
 1 file changed, 20 insertions(+), 31 deletions(-)

diff --git a/t/helper/test-config.c b/t/helper/test-config.c
index 26d9c2ac4c..8fe43e9775 100644
--- a/t/helper/test-config.c
+++ b/t/helper/test-config.c
@@ -69,16 +69,19 @@ static int early_config_cb(const char *var, const char *value, void *vdata)
 	return 0;
 }
 
+#define TC_VALUE_NOT_FOUND 1
+#define TC_CONFIG_FILE_ERROR 2
+
 int cmd__config(int argc, const char **argv)
 {
-	int i, val;
+	int i, val, ret = 0;
 	const char *v;
 	const struct string_list *strptr;
 	struct config_set cs;
 
 	if (argc == 3 && !strcmp(argv[1], "read_early_config")) {
 		read_early_config(early_config_cb, (void *)argv[2]);
-		return 0;
+		return ret;
 	}
 
 	setup_git_directory();
@@ -94,10 +97,9 @@ int cmd__config(int argc, const char **argv)
 				printf("(NULL)\n");
 			else
 				printf("%s\n", v);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_value_multi")) {
 		strptr = git_config_get_value_multi(argv[2]);
@@ -109,41 +111,38 @@ int cmd__config(int argc, const char **argv)
 				else
 					printf("%s\n", v);
 			}
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_int")) {
 		if (!git_config_get_int(argv[2], &val)) {
 			printf("%d\n", val);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_bool")) {
 		if (!git_config_get_bool(argv[2], &val)) {
 			printf("%d\n", val);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_string")) {
 		if (!git_config_get_string_tmp(argv[2], &v)) {
 			printf("%s\n", v);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
 		for (i = 3; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
 				fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
-				goto exit2;
+				ret = TC_CONFIG_FILE_ERROR;
+				goto out;
 			}
 		}
 		if (!git_configset_get_value(&cs, argv[2], &v)) {
@@ -151,17 +150,17 @@ int cmd__config(int argc, const char **argv)
 				printf("(NULL)\n");
 			else
 				printf("%s\n", v);
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
 		for (i = 3; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
 				fprintf(stderr, "Error (%d) reading configuration file %s.\n", err, argv[i]);
-				goto exit2;
+				ret = TC_CONFIG_FILE_ERROR;
+				goto out;
 			}
 		}
 		strptr = git_configset_get_value_multi(&cs, argv[2]);
@@ -173,27 +172,17 @@ int cmd__config(int argc, const char **argv)
 				else
 					printf("%s\n", v);
 			}
-			goto exit0;
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
-			goto exit1;
+			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (!strcmp(argv[1], "iterate")) {
 		git_config(iterate_cb, NULL);
-		goto exit0;
+	} else {
+		die("%s: Please check the syntax and the function name", argv[0]);
 	}
 
-	die("%s: Please check the syntax and the function name", argv[0]);
-
-exit0:
-	git_configset_clear(&cs);
-	return 0;
-
-exit1:
-	git_configset_clear(&cs);
-	return 1;
-
-exit2:
+out:
 	git_configset_clear(&cs);
-	return 2;
+	return ret;
 }
-- 
2.28.0


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v5 6/8] config: correctly read worktree configs in submodules
  2020-09-02  6:17       ` [PATCH v5 0/8] " Matheus Tavares
                           ` (4 preceding siblings ...)
  2020-09-02  6:17         ` [PATCH v5 5/8] t/helper/test-config: unify exit labels Matheus Tavares
@ 2020-09-02  6:17         ` Matheus Tavares
  2020-09-02 20:15           ` Jonathan Nieder
  2020-09-02  6:17         ` [PATCH v5 7/8] grep: honor sparse checkout patterns Matheus Tavares
                           ` (2 subsequent siblings)
  8 siblings, 1 reply; 120+ messages in thread
From: Matheus Tavares @ 2020-09-02  6:17 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy, jrnieder

The config machinery is not able to read worktree configs from a
submodule in a process where the_repository represents the superproject.
Furthermore, when extensions.worktreeConfig is set on the superproject,
querying for a worktree config in a submodule will, instead, return
the value set at the superproject.

The problem resides in do_git_config_sequence(). Although the function
receives a git_dir string, it uses the_repository->git_dir when making
the path to the worktree config file. And when checking if
extensions.worktreeConfig is set, it uses the global
repository_format_worktree_config variable, which refers to
the_repository only. So let's fix this by using the git_dir given to the
function and reading the extension value from the right place. Also add
a test to avoid any regressions.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 config.c                   | 21 ++++++++++---
 t/helper/test-config.c     | 62 ++++++++++++++++++++++++++++++++------
 t/t2404-worktree-config.sh | 16 ++++++++++
 3 files changed, 85 insertions(+), 14 deletions(-)

diff --git a/config.c b/config.c
index 2bdff4457b..e1e7fab6dc 100644
--- a/config.c
+++ b/config.c
@@ -1747,11 +1747,22 @@ static int do_git_config_sequence(const struct config_options *opts,
 		ret += git_config_from_file(fn, repo_config, data);
 
 	current_parsing_scope = CONFIG_SCOPE_WORKTREE;
-	if (!opts->ignore_worktree && repository_format_worktree_config) {
-		char *path = git_pathdup("config.worktree");
-		if (!access_or_die(path, R_OK, 0))
-			ret += git_config_from_file(fn, path, data);
-		free(path);
+	if (!opts->ignore_worktree && repo_config && opts->git_dir) {
+		struct repository_format repo_fmt = REPOSITORY_FORMAT_INIT;
+		struct strbuf buf = STRBUF_INIT;
+
+		read_repository_format(&repo_fmt, repo_config);
+
+		if (!verify_repository_format(&repo_fmt, &buf) &&
+		    repo_fmt.worktree_config) {
+			char *path = mkpathdup("%s/config.worktree", opts->git_dir);
+			if (!access_or_die(path, R_OK, 0))
+				ret += git_config_from_file(fn, path, data);
+			free(path);
+		}
+
+		strbuf_release(&buf);
+		clear_repository_format(&repo_fmt);
 	}
 
 	current_parsing_scope = CONFIG_SCOPE_COMMAND;
diff --git a/t/helper/test-config.c b/t/helper/test-config.c
index 8fe43e9775..2924c09c21 100644
--- a/t/helper/test-config.c
+++ b/t/helper/test-config.c
@@ -2,12 +2,20 @@
 #include "cache.h"
 #include "config.h"
 #include "string-list.h"
+#include "submodule-config.h"
+#include "parse-options.h"
 
 /*
  * This program exposes the C API of the configuration mechanism
  * as a set of simple commands in order to facilitate testing.
  *
- * Reads stdin and prints result of command to stdout:
+ * Usage: test-tool config [--submodule=<path>] <cmd> [<args>]
+ *
+ * If --submodule=<path> is given, <cmd> will operate on the submodule at the
+ * given <path>. This option is not valid for the commands: read_early_config,
+ * configset_get_value and configset_get_value_multi.
+ *
+ * Possible cmds are:
  *
  * get_value -> prints the value with highest priority for the entered key
  *
@@ -72,14 +80,34 @@ static int early_config_cb(const char *var, const char *value, void *vdata)
 #define TC_VALUE_NOT_FOUND 1
 #define TC_CONFIG_FILE_ERROR 2
 
+static const char *test_config_usage[] = {
+	"test-tool config [--submodule=<path>] <cmd> [<args>]",
+	NULL
+};
+
 int cmd__config(int argc, const char **argv)
 {
 	int i, val, ret = 0;
 	const char *v;
 	const struct string_list *strptr;
 	struct config_set cs;
+	struct repository subrepo, *repo = the_repository;
+	const char *subrepo_path = NULL;
+
+	struct option options[] = {
+		OPT_STRING(0, "submodule", &subrepo_path, "path",
+			   "run <cmd> on the submodule at <path>"),
+		OPT_END()
+	};
+
+	argc = parse_options(argc, argv, NULL, options, test_config_usage,
+			     PARSE_OPT_KEEP_ARGV0 | PARSE_OPT_STOP_AT_NON_OPTION);
+	if (argc < 2)
+		die("Please, provide a command name on the command-line");
 
 	if (argc == 3 && !strcmp(argv[1], "read_early_config")) {
+		if (subrepo_path)
+			die("cannot use --submodule with read_early_config");
 		read_early_config(early_config_cb, (void *)argv[2]);
 		return ret;
 	}
@@ -88,11 +116,18 @@ int cmd__config(int argc, const char **argv)
 
 	git_configset_init(&cs);
 
-	if (argc < 2)
-		die("Please, provide a command name on the command-line");
+	if (subrepo_path) {
+		const struct submodule *sub;
+
+		sub = submodule_from_path(the_repository, &null_oid, subrepo_path);
+		if (!sub || repo_submodule_init(&subrepo, the_repository, sub))
+			die("invalid argument to --submodule: '%s'", subrepo_path);
+
+		repo = &subrepo;
+	}
 
 	if (argc == 3 && !strcmp(argv[1], "get_value")) {
-		if (!git_config_get_value(argv[2], &v)) {
+		if (!repo_config_get_value(repo, argv[2], &v)) {
 			if (!v)
 				printf("(NULL)\n");
 			else
@@ -102,7 +137,7 @@ int cmd__config(int argc, const char **argv)
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_value_multi")) {
-		strptr = git_config_get_value_multi(argv[2]);
+		strptr = repo_config_get_value_multi(repo, argv[2]);
 		if (strptr) {
 			for (i = 0; i < strptr->nr; i++) {
 				v = strptr->items[i].string;
@@ -116,27 +151,31 @@ int cmd__config(int argc, const char **argv)
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_int")) {
-		if (!git_config_get_int(argv[2], &val)) {
+		if (!repo_config_get_int(repo, argv[2], &val)) {
 			printf("%d\n", val);
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_bool")) {
-		if (!git_config_get_bool(argv[2], &val)) {
+		if (!repo_config_get_bool(repo, argv[2], &val)) {
 			printf("%d\n", val);
 		} else {
+
 			printf("Value not found for \"%s\"\n", argv[2]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc == 3 && !strcmp(argv[1], "get_string")) {
-		if (!git_config_get_string_tmp(argv[2], &v)) {
+		if (!repo_config_get_string_tmp(repo, argv[2], &v)) {
 			printf("%s\n", v);
 		} else {
 			printf("Value not found for \"%s\"\n", argv[2]);
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value")) {
+		if (subrepo_path)
+			die("cannot use --submodule with configset_get_value");
+
 		for (i = 3; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
@@ -155,6 +194,9 @@ int cmd__config(int argc, const char **argv)
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (argc >= 3 && !strcmp(argv[1], "configset_get_value_multi")) {
+		if (subrepo_path)
+			die("cannot use --submodule with configset_get_value_multi");
+
 		for (i = 3; i < argc; i++) {
 			int err;
 			if ((err = git_configset_add_file(&cs, argv[i]))) {
@@ -177,12 +219,14 @@ int cmd__config(int argc, const char **argv)
 			ret = TC_VALUE_NOT_FOUND;
 		}
 	} else if (!strcmp(argv[1], "iterate")) {
-		git_config(iterate_cb, NULL);
+		repo_config(repo, iterate_cb, NULL);
 	} else {
 		die("%s: Please check the syntax and the function name", argv[0]);
 	}
 
 out:
 	git_configset_clear(&cs);
+	if (repo != the_repository)
+		repo_clear(repo);
 	return ret;
 }
diff --git a/t/t2404-worktree-config.sh b/t/t2404-worktree-config.sh
index 9536d10919..1e32c93735 100755
--- a/t/t2404-worktree-config.sh
+++ b/t/t2404-worktree-config.sh
@@ -78,4 +78,20 @@ test_expect_success 'config.worktree no longer read without extension' '
 	test_cmp_config -C wt2 shared this.is
 '
 
+test_expect_success 'correctly read config.worktree from submodules' '
+	test_unconfig extensions.worktreeConfig &&
+	git init sub &&
+	(
+		cd sub &&
+		test_commit A &&
+		git config extensions.worktreeConfig true &&
+		git config --worktree wtconfig.sub test-value
+	) &&
+	git submodule add ./sub &&
+	git commit -m "add sub" &&
+	echo test-value >expect &&
+	test-tool config --submodule=sub get_value wtconfig.sub >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.28.0


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v5 7/8] grep: honor sparse checkout patterns
  2020-09-02  6:17       ` [PATCH v5 0/8] " Matheus Tavares
                           ` (5 preceding siblings ...)
  2020-09-02  6:17         ` [PATCH v5 6/8] config: correctly read worktree configs in submodules Matheus Tavares
@ 2020-09-02  6:17         ` Matheus Tavares
  2020-09-02  6:17         ` [PATCH v5 8/8] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
  2020-09-10 17:21         ` [PATCH v6 0/9] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  8 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-09-02  6:17 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy, jrnieder

One of the main uses for a sparse checkout is to allow users to focus on
the subset of files in a repository in which they are interested. But
git-grep currently ignores the sparsity patterns and reports all matches
found outside this subset, which kind of goes in the opposite direction.
There are some use cases for ignoring the sparsity patterns and the next
commit will add an option to obtain this behavior, but here we start by
making grep honor the sparsity boundaries in every case where this is
relevant:

- git grep in worktree
- git grep --cached
- git grep $REVISION

For the worktree and cached cases, we iterate over paths without the
SKIP_WORKTREE bit set, and limit our searches to these paths. For the
$REVISION case, we limit the paths we search to those that match the
sparsity patterns. (We do not check the SKIP_WORKTREE bit for the
$REVISION case, because $REVISION may contain paths that do not exist in
HEAD and thus for which we have no SKIP_WORKTREE bit to consult. The
sparsity patterns tell us how the SKIP_WORKTREE bit would be set if we
were to check out $REVISION, so we consult those. Also, we don't use the
sparsity patterns with the worktree or cached cases, both because we
have a bit we can check directly and more efficiently, and because
unmerged entries from a merge or a rebase could cause more files to
temporarily be present than the sparsity patterns would normally
select.)

Note that there is a special case here: `git grep $TREE`. In this case,
we cannot know whether $TREE corresponds to the root of the repository
or some sub-tree, and thus there is no way for us to know which sparsity
patterns, if any, apply. So the $TREE case will not use sparsity
patterns or any SKIP_WORKTREE bits and will instead always search all
files within the $TREE.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 builtin/grep.c                   | 125 ++++++++++++++++++--
 t/t7011-skip-worktree-reading.sh |   9 --
 t/t7817-grep-sparse-checkout.sh  | 195 +++++++++++++++++++++++++++++++
 3 files changed, 312 insertions(+), 17 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

diff --git a/builtin/grep.c b/builtin/grep.c
index f58979bc3f..a32815de0a 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
 		      const struct pathspec *pathspec, int cached);
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr);
+		     int is_root_tree);
 
 static int grep_submodule(struct grep_opt *opt,
 			  const struct pathspec *pathspec,
@@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
 
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
+
+		if (ce_skip_worktree(ce))
+			continue;
+
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
 
@@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID) ||
-			    ce_skip_worktree(ce)) {
+			if (cached || (ce->ce_flags & CE_VALID)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -552,9 +555,76 @@ static int grep_cache(struct grep_opt *opt,
 	return hit;
 }
 
-static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
-		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr)
+static struct pattern_list *get_sparsity_patterns(struct repository *repo)
+{
+	struct pattern_list *patterns;
+	char *sparse_file;
+	int sparse_config, cone_config;
+
+	if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
+	    !sparse_config) {
+		return NULL;
+	}
+
+	sparse_file = repo_git_path(repo, "info/sparse-checkout");
+	patterns = xcalloc(1, sizeof(*patterns));
+
+	if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
+		cone_config = 0;
+	patterns->use_cone_patterns = cone_config;
+
+	if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
+		if (file_exists(sparse_file)) {
+			warning(_("failed to load sparse-checkout file: '%s'"),
+				sparse_file);
+		}
+		free(sparse_file);
+		free(patterns);
+		return NULL;
+	}
+
+	free(sparse_file);
+	return patterns;
+}
+
+static int path_in_sparse_checkout(struct strbuf *path, int prefix_len,
+				   unsigned int entry_mode,
+				   struct index_state *istate,
+				   struct pattern_list *sparsity,
+				   enum pattern_match_result parent_match,
+				   enum pattern_match_result *match)
+{
+	int dtype = DT_UNKNOWN;
+	int is_dir = S_ISDIR(entry_mode);
+
+	if (parent_match == MATCHED_RECURSIVE) {
+		*match = parent_match;
+		return 1;
+	}
+
+	if (is_dir && !is_dir_sep(path->buf[path->len - 1]))
+		strbuf_addch(path, '/');
+
+	*match = path_matches_pattern_list(path->buf, path->len,
+					   path->buf + prefix_len, &dtype,
+					   sparsity, istate);
+	if (*match == UNDECIDED)
+		*match = parent_match;
+
+	if (is_dir)
+		strbuf_trim_trailing_dir_sep(path);
+
+	if (*match == NOT_MATCHED &&
+		(!is_dir || (is_dir && sparsity->use_cone_patterns)))
+	     return 0;
+
+	return 1;
+}
+
+static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
+			struct tree_desc *tree, struct strbuf *base, int tn_len,
+			int check_attr, struct pattern_list *sparsity,
+			enum pattern_match_result default_sparsity_match)
 {
 	struct repository *repo = opt->repo;
 	int hit = 0;
@@ -570,6 +640,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 	while (tree_entry(tree, &entry)) {
 		int te_len = tree_entry_len(&entry);
+		enum pattern_match_result sparsity_match = 0;
 
 		if (match != all_entries_interesting) {
 			strbuf_addstr(&name, base->buf + tn_len);
@@ -586,6 +657,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
+		if (sparsity) {
+			struct strbuf path = STRBUF_INIT;
+			strbuf_addstr(&path, base->buf + tn_len);
+
+			if (!path_in_sparse_checkout(&path, old_baselen - tn_len,
+						     entry.mode, repo->index,
+						     sparsity, default_sparsity_match,
+						     &sparsity_match)) {
+				strbuf_setlen(base, old_baselen);
+				continue;
+			}
+		}
+
 		if (S_ISREG(entry.mode)) {
 			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
 					 check_attr ? base->buf + tn_len : NULL);
@@ -602,8 +686,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 			strbuf_addch(base, '/');
 			init_tree_desc(&sub, data, size);
-			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
-					 check_attr);
+			hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
+					    check_attr, sparsity, sparsity_match);
 			free(data);
 		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
 			hit |= grep_submodule(opt, pathspec, &entry.oid,
@@ -621,6 +705,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 	return hit;
 }
 
+/*
+ * Note: sparsity patterns and paths' attributes will only be considered if
+ * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
+ * matching on paths.)
+ */
+static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
+		     struct tree_desc *tree, struct strbuf *base, int tn_len,
+		     int is_root_tree)
+{
+	struct pattern_list *patterns = NULL;
+	int ret;
+
+	if (is_root_tree)
+		patterns = get_sparsity_patterns(opt->repo);
+
+	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
+			   patterns, 0);
+
+	if (patterns) {
+		clear_pattern_list(patterns);
+		free(patterns);
+	}
+	return ret;
+}
+
 static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
 		       struct object *obj, const char *name, const char *path)
 {
diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
index 37525cae3a..26852586ac 100755
--- a/t/t7011-skip-worktree-reading.sh
+++ b/t/t7011-skip-worktree-reading.sh
@@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
 	test -z "$(git ls-files -m)"
 '
 
-test_expect_success 'grep with skip-worktree file' '
-	git update-index --no-skip-worktree 1 &&
-	echo test > 1 &&
-	git update-index 1 &&
-	git update-index --skip-worktree 1 &&
-	rm 1 &&
-	test "$(git grep --no-ext-grep test)" = "1:test"
-'
-
 echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
 test_expect_success 'diff-index does not examine skip-worktree absent entries' '
 	setup_absent &&
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
new file mode 100755
index 0000000000..b3109e3479
--- /dev/null
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -0,0 +1,195 @@
+#!/bin/sh
+
+test_description='grep in sparse checkout
+
+This test creates a repo with the following structure:
+
+.
+|-- a
+|-- b
+|-- dir
+|   `-- c
+|-- sub
+|   |-- A
+|   |   `-- a
+|   `-- B
+|       `-- b
+`-- sub2
+    `-- a
+
+Where the outer repository has non-cone mode sparsity patterns, sub is a
+submodule with cone mode sparsity patterns and sub2 is a submodule that is
+excluded by the superproject sparsity patterns. The resulting sparse checkout
+should leave the following structure in the working tree:
+
+.
+|-- a
+|-- sub
+|   `-- B
+|       `-- b
+`-- sub2
+    `-- a
+
+But note that sub2 should have the SKIP_WORKTREE bit set.
+'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	echo "text" >a &&
+	echo "text" >b &&
+	mkdir dir &&
+	echo "text" >dir/c &&
+
+	git init sub &&
+	(
+		cd sub &&
+		mkdir A B &&
+		echo "text" >A/a &&
+		echo "text" >B/b &&
+		git add A B &&
+		git commit -m sub &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set B
+	) &&
+
+	git init sub2 &&
+	(
+		cd sub2 &&
+		echo "text" >a &&
+		git add a &&
+		git commit -m sub2
+	) &&
+
+	git submodule add ./sub &&
+	git submodule add ./sub2 &&
+	git add a b dir &&
+	git commit -m super &&
+	git sparse-checkout init --no-cone &&
+	git sparse-checkout set "/*" "!b" "!/*/" "sub" &&
+
+	git tag -am tag-to-commit tag-to-commit HEAD &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	git tag -am tag-to-tree tag-to-tree $tree &&
+
+	test_path_is_missing b &&
+	test_path_is_missing dir &&
+	test_path_is_missing sub/A &&
+	test_path_is_file a &&
+	test_path_is_file sub/B/b &&
+	test_path_is_file sub2/a
+'
+
+# The test below checks a special case: the sparsity patterns exclude '/b'
+# and sparse checkout is enabled, but the path exists in the working tree (e.g.
+# manually created after `git sparse-checkout init`). In this case, grep should
+# skip it.
+test_expect_success 'grep in working tree should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	echo "new-text" >b &&
+	test_when_finished "rm b" &&
+	git grep "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep unmerged file despite not matching sparsity patterns' '
+	cat >expect <<-EOF &&
+	b:modified-b-in-branchX
+	b:modified-b-in-branchY
+	EOF
+	test_when_finished "test_might_fail git merge --abort && \
+			    git checkout master" &&
+
+	git sparse-checkout disable &&
+	git checkout -b branchY master &&
+	test_commit modified-b-in-branchY b &&
+	git checkout -b branchX master &&
+	test_commit modified-b-in-branchX b &&
+
+	git sparse-checkout init &&
+	test_path_is_missing b &&
+	test_must_fail git merge branchY &&
+	git grep "modified-b" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --cached should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep <commit-ish> should honor sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	EOF
+	cat >expect_tag-to-commit <<-EOF &&
+	tag-to-commit:a:text
+	EOF
+	git grep "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep "text" tag-to-commit >actual_tag-to-commit &&
+	test_cmp expect_tag-to-commit actual_tag-to-commit
+'
+
+test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
+	commit=$(git rev-parse HEAD) &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	cat >expect_tree <<-EOF &&
+	$tree:a:text
+	$tree:b:text
+	$tree:dir/c:text
+	EOF
+	cat >expect_tag-to-tree <<-EOF &&
+	tag-to-tree:a:text
+	tag-to-tree:b:text
+	tag-to-tree:dir/c:text
+	EOF
+	git grep "text" $tree >actual_tree &&
+	test_cmp expect_tree actual_tree &&
+	git grep "text" tag-to-tree >actual_tag-to-tree &&
+	test_cmp expect_tag-to-tree actual_tag-to-tree
+'
+
+# Note that sub2/ is present in the worktree but it is excluded by the sparsity
+# patterns, so grep should not recurse into it.
+test_expect_success 'grep --recurse-submodules should honor sparse checkout in submodule' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	EOF
+	git grep --recurse-submodules "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	EOF
+	git grep --recurse-submodules --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	$commit:sub/B/b:text
+	EOF
+	cat >expect_tag-to-commit <<-EOF &&
+	tag-to-commit:a:text
+	tag-to-commit:sub/B/b:text
+	EOF
+	git grep --recurse-submodules "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep --recurse-submodules "text" tag-to-commit >actual_tag-to-commit &&
+	test_cmp expect_tag-to-commit actual_tag-to-commit
+'
+
+test_done
-- 
2.28.0


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v5 8/8] config: add setting to ignore sparsity patterns in some cmds
  2020-09-02  6:17       ` [PATCH v5 0/8] " Matheus Tavares
                           ` (6 preceding siblings ...)
  2020-09-02  6:17         ` [PATCH v5 7/8] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-09-02  6:17         ` Matheus Tavares
  2020-09-10 17:21         ` [PATCH v6 0/9] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  8 siblings, 0 replies; 120+ messages in thread
From: Matheus Tavares @ 2020-09-02  6:17 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy, jrnieder

When sparse checkout is enabled, some users expect the output of certain
commands (such as grep, diff, and log) to be also restricted within the
sparsity patterns. This would allow them to effectively work only on the
subset of files in which they are interested; and allow some commands to
possibly perform better, by not considering uninteresting paths. For
this reason, we taught grep to honor the sparsity patterns, in the
previous patch. But, on the other hand, allowing grep and the other
commands mentioned to optionally ignore the patterns also make for some
interesting use cases. E.g. using grep to search for a function
documentation that resides outside the sparse checkout.

In any case, there is no current way for users to configure the behavior
they want for these commands. Aiming to provide this flexibility, let's
introduce the sparse.restrictCmds setting (and the analogous
--[no]-restrict-to-sparse-paths global option). The default value is
true. For now, grep is the only one affected by this setting, but the
goal is to have support for more commands, in the future.

Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---