git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
* [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it
@ 2020-03-24  6:04 Matheus Tavares
  2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
                   ` (3 more replies)
  0 siblings, 4 replies; 57+ messages in thread
From: Matheus Tavares @ 2020-03-24  6:04 UTC (permalink / raw)
  To: git; +Cc: dstolee, newren

This series is based on the discussions we had some months ago[1], about
git-grep not currently honoring the sparsity patterns. To summarize, the
idea is that, since a sparse checkout is used to limit the set of files
in which users are interested, git-grep should, by default, only search
within this boundary.  But it would be good to also have an
'--ignore-sparsity' option, to restore the old behavior when needed, as
there are also valid use cases for it. The following patches seek to
address these suggestions. The first patch is not really related, it is
a cleanup, used by the third one.

[1]: https://lore.kernel.org/git/CAHd-oW7e5qCuxZLBeVDq+Th3E+E4+P8=WzJfK8WcG2yz=n_nag@mail.gmail.com/t/#u

Matheus Tavares (3):
  doc: grep: unify info on configuration variables
  grep: honor sparse checkout patterns
  grep: add option to ignore sparsity patterns

 Documentation/config/grep.txt    |  10 ++-
 Documentation/git-grep.txt       |  40 +++-------
 builtin/grep.c                   |  36 ++++++++-
 t/t7011-skip-worktree-reading.sh |   9 ---
 t/t7817-grep-sparse-checkout.sh  | 130 +++++++++++++++++++++++++++++++
 5 files changed, 180 insertions(+), 45 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

-- 
2.25.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH 1/3] doc: grep: unify info on configuration variables
  2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
@ 2020-03-24  6:11 ` Matheus Tavares
  2020-03-24  7:57   ` Elijah Newren
  2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 57+ messages in thread
From: Matheus Tavares @ 2020-03-24  6:11 UTC (permalink / raw)
  To: git; +Cc: dstolee, newren, sandals

Explanations about the configuration variables for git-grep are
duplicated in "Documentation/git-grep.txt" and
"Documentation/config/grep.txt". Let's unify the information in the
second file and include it in the first.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 Documentation/config/grep.txt |  7 +++++--
 Documentation/git-grep.txt    | 35 +++++------------------------------
 2 files changed, 10 insertions(+), 32 deletions(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 44abe45a7c..76689771aa 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -16,8 +16,11 @@ grep.extendedRegexp::
 	other than 'default'.
 
 grep.threads::
-	Number of grep worker threads to use.
-	See `grep.threads` in linkgit:git-grep[1] for more information.
+	Number of grep worker threads to use. See `--threads` in
+	linkgit:git-grep[1] for more information.
+
+grep.fullName::
+	If set to true, enable `--full-name` option by default.
 
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index ddb6acc025..97e25d7b1b 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,34 +41,7 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------
 
-grep.lineNumber::
-	If set to true, enable `-n` option by default.
-
-grep.column::
-	If set to true, enable the `--column` option by default.
-
-grep.patternType::
-	Set the default matching behavior. Using a value of 'basic', 'extended',
-	'fixed', or 'perl' will enable the `--basic-regexp`, `--extended-regexp`,
-	`--fixed-strings`, or `--perl-regexp` option accordingly, while the
-	value 'default' will return to the default matching behavior.
-
-grep.extendedRegexp::
-	If set to true, enable `--extended-regexp` option by default. This
-	option is ignored when the `grep.patternType` option is set to a value
-	other than 'default'.
-
-grep.threads::
-	Number of grep worker threads to use. If unset (or set to 0), Git will
-	use as many threads as the number of logical cores available.
-
-grep.fullName::
-	If set to true, enable `--full-name` option by default.
-
-grep.fallbackToNoIndex::
-	If set to true, fall back to git grep --no-index if git grep
-	is executed outside of a git repository.  Defaults to false.
-
+include::config/grep.txt[]
 
 OPTIONS
 -------
@@ -267,8 +240,10 @@ providing this option will cause it to die.
 	found.
 
 --threads <num>::
-	Number of grep worker threads to use.
-	See `grep.threads` in 'CONFIGURATION' for more information.
+	Number of grep worker threads to use. If not provided (or set to
+	0), Git will use as many worker threads as the number of logical
+	cores available. The default value can also be set with the
+	`grep.threads` configuration (see linkgit:git-config[1]).
 
 -f <file>::
 	Read patterns from <file>, one per line.
-- 
2.25.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
@ 2020-03-24  6:12 ` Matheus Tavares
  2020-03-24  7:15   ` Elijah Newren
  2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  3 siblings, 1 reply; 57+ messages in thread
From: Matheus Tavares @ 2020-03-24  6:12 UTC (permalink / raw)
  To: git; +Cc: dstolee, newren, sandals, stefanbeller

One of the main uses for a sparse checkout is to allow users to focus on
the subset of files in a repository in which they are interested. But
git-grep currently ignores the sparsity patterns and report all matches
found outside this subset, which kind of goes in the oposity direction.
Let's fix that, making it honor the sparsity boundaries for every
grepping case:

- git grep in worktree
- git grep --cached
- git grep $REVISION
- git grep --untracked and git grep --no-index (which already respect
  sparse checkout boundaries)

This is also what some users reported[1] they would want as the default
behavior.

Note: for `git grep $REVISION`, we will choose to honor the sparsity
patterns only when $REVISION is a commit-ish object. The reason is that,
for a tree, we don't know whether it represents the root of a
repository or a subtree. So we wouldn't be able to correctly match it
against the sparsity patterns. E.g. suppose we have a repository with
these two sparsity rules: "/*" and "!/a"; and the following structure:

/
| - a (file)
| - d (dir)
    | - a (file)

If `git grep $REVISION` were to honor the sparsity patterns for every
object type, when grepping the /d tree, we would wrongly ignore the /d/a
file. This happens because we wouldn't know it resides in /d and
therefore it would wrongly match the pattern "!/a". Furthermore, for a
search in a blob object, we wouldn't even have a path to check the
patterns against. So, let's ignore the sparsity patterns when grepping
non-commit-ish objects (tags to commits should be fine).

Finally, the old behavior is still desirable for some use cases. So the
next patch will add an option to allow restoring it when needed.

[1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Something I'm not entirely sure in this patch is how we implement the
mechanism to honor sparsity for the `git grep <commit-ish>` case (which
is treated in the grep_tree() function). Currently, the patch looks for
an index entry that matches the path, and then checks its skip_worktree
bit. But this operation is perfomed in O(log(N)); N being the number of
index entries. If there are many entries (and no so many sparsity
patterns), maybe a better approach would be to try matching the path
directly against the sparsity patterns. This would be O(M) in the number
of patterns, and it could be done, in builtin/grep.c, with a function
like the following:

static struct pattern_list sparsity_patterns;
static int sparsity_patterns_initialized = 0;
static enum pattern_match_result path_matches_sparsity_patterns(
					const char *path, int pathlen,
					const char *basename,
					struct repository *repo)
{
	int dtype = DT_UNKNOWN;

	if (!sparsity_patterns_initialized) {
		char *sparse_file = git_pathdup("info/sparse-checkout");
		int ret;

		memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
		sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
		ret = add_patterns_from_file_to_list(sparse_file, "", 0,
						     &sparsity_patterns, NULL);
		free(sparse_file);

		if (ret < 0)
			die(_("failed to load sparse-checkout patterns"));
		sparsity_patterns_initialized = 1;
	}

	return path_matches_pattern_list(path, pathlen, basename, &dtype,
					 &sparsity_patterns, repo->index);
}

Also, if I understand correctly, the index doesn't hold paths to dirs,
right? So even if a complete dir is excluded from sparse checkout, we
still have to check all its subentries, only to discover that they
should all be skipped from the search. However, if we were to check
against the sparsity patterns directly (e.g. with the function above),
we could skip such directories together with all their entries.

Oh, and there is also the case of a commit whose tree paths are not in
the index (maybe manually created objects?). For such commits, with the
index lookup approach, we would have to fall back on ignoring the
sparsity rules. I'm not sure if that would be OK, though.

Any thoughts on these two approaches (looking up the skip_worktree bit
in the index or directly matching against sparsity patterns), will be
highly appreciated. (Note that it only concerns the `git grep
<commit-ish>` case. The other cases already iterate thought the index, so
there is no O(log(N)) extra complexity).

 builtin/grep.c                   | 29 ++++++++---
 t/t7011-skip-worktree-reading.sh |  9 ----
 t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
 3 files changed, 111 insertions(+), 15 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

diff --git a/builtin/grep.c b/builtin/grep.c
index 99e2685090..52ec72a036 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
 		      const struct pathspec *pathspec, int cached);
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr);
+		     int from_commit);
 
 static int grep_submodule(struct grep_opt *opt,
 			  const struct pathspec *pathspec,
@@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
 
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
+
+		if (ce_skip_worktree(ce))
+			continue;
+
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
 
@@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID) ||
-			    ce_skip_worktree(ce)) {
+			if (cached || (ce->ce_flags & CE_VALID)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
 
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr)
+		     int from_commit)
 {
 	struct repository *repo = opt->repo;
 	int hit = 0;
@@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		name_base_len = name.len;
 	}
 
+	if (from_commit && repo_read_index(repo) < 0)
+		die(_("index file corrupt"));
+
 	while (tree_entry(tree, &entry)) {
 		int te_len = tree_entry_len(&entry);
 
@@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
+		if (from_commit) {
+			int pos = index_name_pos(repo->index,
+						 base->buf + tn_len,
+						 base->len - tn_len);
+			if (pos >= 0 &&
+			    ce_skip_worktree(repo->index->cache[pos])) {
+				strbuf_setlen(base, old_baselen);
+				continue;
+			}
+		}
+
 		if (S_ISREG(entry.mode)) {
 			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
-					 check_attr ? base->buf + tn_len : NULL);
+					from_commit ? base->buf + tn_len : NULL);
 		} else if (S_ISDIR(entry.mode)) {
 			enum object_type type;
 			struct tree_desc sub;
@@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 			strbuf_addch(base, '/');
 			init_tree_desc(&sub, data, size);
 			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
-					 check_attr);
+					 from_commit);
 			free(data);
 		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
 			hit |= grep_submodule(opt, pathspec, &entry.oid,
diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
index 37525cae3a..26852586ac 100755
--- a/t/t7011-skip-worktree-reading.sh
+++ b/t/t7011-skip-worktree-reading.sh
@@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
 	test -z "$(git ls-files -m)"
 '
 
-test_expect_success 'grep with skip-worktree file' '
-	git update-index --no-skip-worktree 1 &&
-	echo test > 1 &&
-	git update-index 1 &&
-	git update-index --skip-worktree 1 &&
-	rm 1 &&
-	test "$(git grep --no-ext-grep test)" = "1:test"
-'
-
 echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
 test_expect_success 'diff-index does not examine skip-worktree absent entries' '
 	setup_absent &&
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
new file mode 100755
index 0000000000..fccf44e829
--- /dev/null
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -0,0 +1,88 @@
+#!/bin/sh
+
+test_description='grep in sparse checkout
+
+This test creates the following dir structure:
+.
+| - a
+| - b
+| - dir
+    | - c
+
+Only "a" should be present due to the sparse checkout patterns:
+"/*", "!/b" and "!/dir".
+'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	echo "text" >a &&
+	echo "text" >b &&
+	mkdir dir &&
+	echo "text" >dir/c &&
+	git add a b dir &&
+	git commit -m "initial commit" &&
+	git tag -am t-commit t-commit HEAD &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	git tag -am t-tree t-tree $tree &&
+	cat >.git/info/sparse-checkout <<-EOF &&
+	/*
+	!/b
+	!/dir
+	EOF
+	git sparse-checkout init &&
+	test_path_is_missing b &&
+	test_path_is_missing dir &&
+	test_path_is_file a
+'
+
+test_expect_success 'grep in working tree should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --cached should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep <commit-ish> should honor sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	EOF
+	cat >expect_t-commit <<-EOF &&
+	t-commit:a:text
+	EOF
+	git grep "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep "text" t-commit >actual_t-commit &&
+	test_cmp expect_t-commit actual_t-commit
+'
+
+test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	cat >expect_tree <<-EOF &&
+	$tree:a:text
+	$tree:b:text
+	$tree:dir/c:text
+	EOF
+	cat >expect_t-tree <<-EOF &&
+	t-tree:a:text
+	t-tree:b:text
+	t-tree:dir/c:text
+	EOF
+	git grep "text" $tree >actual_tree &&
+	test_cmp expect_tree actual_tree &&
+	git grep "text" t-tree >actual_t-tree &&
+	test_cmp expect_t-tree actual_t-tree
+'
+
+test_done
-- 
2.25.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
  2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-03-24  6:13 ` Matheus Tavares
  2020-03-24  7:54   ` Elijah Newren
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  3 siblings, 1 reply; 57+ messages in thread
From: Matheus Tavares @ 2020-03-24  6:13 UTC (permalink / raw)
  To: git; +Cc: dstolee, newren, pclouds

In the last commit, git-grep learned to honor sparsity patterns. For
some use cases, however, it may be desirable to search outside the
sparse checkout. So add the '--ignore-sparsity' option, which restores
the old behavior. Also add the grep.ignoreSparsity configuration, to
allow setting this behavior by default.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Note: I still have to make --ignore-sparsity be able to work together
with --untracked. Unfortunatelly, this won't be as simple because the
codeflow taken by --untracked goes to grep_directory() which just
iterates the working tree, without looking the index entries. So I will
have to either: make --untracked use grep_cache(), and grep the
untracked files later; or try matching the working tree paths against
the sparsity patterns, without looking for the skip_worktree bit in
the index (as I mentioned in the previous patch's comments). Any
preferences regarding these two approaches? (or other suggestions?)

 Documentation/config/grep.txt   |  3 +++
 Documentation/git-grep.txt      |  5 ++++
 builtin/grep.c                  | 19 +++++++++++----
 t/t7817-grep-sparse-checkout.sh | 42 +++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 76689771aa..c1d49484c8 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -25,3 +25,6 @@ grep.fullName::
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
 	is executed outside of a git repository.  Defaults to false.
+
+grep.ignoreSparsity::
+	If set to true, enable `--ignore-sparsity` by default.
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 97e25d7b1b..5c5c66c056 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -65,6 +65,11 @@ OPTIONS
 	mechanism.  Only useful when searching files in the current directory
 	with `--no-index`.
 
+--ignore-sparsity::
+	In a sparse checked out repository (see linkgit:git-sparse-checkout[1]),
+	also search in files that are outside the sparse checkout. This option
+	cannot be used with --no-index or --untracked.
+
 --recurse-submodules::
 	Recursively search in each submodule that has been initialized and
 	checked out in the repository.  When used in combination with the
diff --git a/builtin/grep.c b/builtin/grep.c
index 52ec72a036..17eae3edd6 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -33,6 +33,8 @@ static char const * const grep_usage[] = {
 
 static int recurse_submodules;
 
+static int ignore_sparsity = 0;
+
 static int num_threads;
 
 static pthread_t *threads;
@@ -292,6 +294,9 @@ static int grep_cmd_config(const char *var, const char *value, void *cb)
 	if (!strcmp(var, "submodule.recurse"))
 		recurse_submodules = git_config_bool(var, value);
 
+	if (!strcmp(var, "grep.ignoresparsity"))
+		ignore_sparsity = git_config_bool(var, value);
+
 	return st;
 }
 
@@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (ce_skip_worktree(ce))
+		if (!ignore_sparsity && ce_skip_worktree(ce))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -502,7 +507,8 @@ static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID)) {
+			if (cached || (ce->ce_flags & CE_VALID) ||
+			    ce_skip_worktree(ce)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -549,7 +555,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		name_base_len = name.len;
 	}
 
-	if (from_commit && repo_read_index(repo) < 0)
+	if (!ignore_sparsity && from_commit && repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
 	while (tree_entry(tree, &entry)) {
@@ -570,7 +576,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
-		if (from_commit) {
+		if (!ignore_sparsity && from_commit) {
 			int pos = index_name_pos(repo->index,
 						 base->buf + tn_len,
 						 base->len - tn_len);
@@ -932,6 +938,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 		OPT_BOOL_F(0, "ext-grep", &external_grep_allowed__ignored,
 			   N_("allow calling of grep(1) (ignored by this build)"),
 			   PARSE_OPT_NOCOMPLETE),
+		OPT_BOOL(0, "ignore-sparsity", &ignore_sparsity,
+			 N_("also search in files outside the sparse checkout")),
 		OPT_END()
 	};
 
@@ -1073,6 +1081,9 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 	if (recurse_submodules && untracked)
 		die(_("--untracked not supported with --recurse-submodules"));
 
+	if (ignore_sparsity && (!use_index || untracked))
+		die(_("--no-index or --untracked cannot be used with --ignore-sparsity"));
+
 	if (show_in_pager) {
 		if (num_threads > 1)
 			warning(_("invalid option combination, ignoring --threads"));
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index fccf44e829..1891ddea57 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -85,4 +85,46 @@ test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
 	test_cmp expect_t-tree actual_t-tree
 '
 
+for cmd in 'git grep --ignore-sparsity' 'git -c grep.ignoreSparsity grep' \
+	   'git -c grep.ignoreSparsity=false grep --ignore-sparsity'
+do
+	test_expect_success "$cmd should search outside sparse checkout" '
+		cat >expect <<-EOF &&
+		a:text
+		b:text
+		dir/c:text
+		EOF
+		$cmd "text" >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "$cmd --cached should search outside sparse checkout" '
+		cat >expect <<-EOF &&
+		a:text
+		b:text
+		dir/c:text
+		EOF
+		$cmd --cached "text" >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "$cmd <commit-ish> should search outside sparse checkout" '
+		commit=$(git rev-parse HEAD) &&
+		cat >expect_commit <<-EOF &&
+		$commit:a:text
+		$commit:b:text
+		$commit:dir/c:text
+		EOF
+		cat >expect_t-commit <<-EOF &&
+		t-commit:a:text
+		t-commit:b:text
+		t-commit:dir/c:text
+		EOF
+		$cmd "text" $commit >actual_commit &&
+		test_cmp expect_commit actual_commit &&
+		$cmd "text" t-commit >actual_t-commit &&
+		test_cmp expect_t-commit actual_t-commit
+	'
+done
+
 test_done
-- 
2.25.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-03-24  7:15   ` Elijah Newren
  2020-03-24 15:12     ` Derrick Stolee
  2020-03-24 22:55     ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 57+ messages in thread
From: Elijah Newren @ 2020-03-24  7:15 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson, Stefan Beller

Hi Matheus,

On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> One of the main uses for a sparse checkout is to allow users to focus on
> the subset of files in a repository in which they are interested. But
> git-grep currently ignores the sparsity patterns and report all matches
> found outside this subset, which kind of goes in the oposity direction.
> Let's fix that, making it honor the sparsity boundaries for every
> grepping case:
>
> - git grep in worktree
> - git grep --cached
> - git grep $REVISION

Wahoo!  This is great.

> - git grep --untracked and git grep --no-index (which already respect
>   sparse checkout boundaries)
>
> This is also what some users reported[1] they would want as the default
> behavior.
>
> Note: for `git grep $REVISION`, we will choose to honor the sparsity
> patterns only when $REVISION is a commit-ish object. The reason is that,

Makes sense.

> for a tree, we don't know whether it represents the root of a
> repository or a subtree. So we wouldn't be able to correctly match it
> against the sparsity patterns. E.g. suppose we have a repository with
> these two sparsity rules: "/*" and "!/a"; and the following structure:
>
> /
> | - a (file)
> | - d (dir)
>     | - a (file)
>
> If `git grep $REVISION` were to honor the sparsity patterns for every
> object type, when grepping the /d tree, we would wrongly ignore the /d/a
> file. This happens because we wouldn't know it resides in /d and
> therefore it would wrongly match the pattern "!/a". Furthermore, for a
> search in a blob object, we wouldn't even have a path to check the
> patterns against. So, let's ignore the sparsity patterns when grepping
> non-commit-ish objects (tags to commits should be fine).
>
> Finally, the old behavior is still desirable for some use cases. So the
> next patch will add an option to allow restoring it when needed.
>
> [1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/
>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>
> Something I'm not entirely sure in this patch is how we implement the
> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> is treated in the grep_tree() function). Currently, the patch looks for
> an index entry that matches the path, and then checks its skip_worktree

As you discuss below, checking the index is both wrong _and_ costly.
You should use the sparsity patterns; Stolee did a lot of work to make
those correspond to simple hashes you could check to determine whether
to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
mode but the non-cone sparsity patterns were a performance nightmare
waiting to rear its ugly head.  We should just try to encourage
everyone to move to cone mode, or accept the slowness they get without
it.

> bit. But this operation is perfomed in O(log(N)); N being the number of
> index entries. If there are many entries (and no so many sparsity
> patterns), maybe a better approach would be to try matching the path
> directly against the sparsity patterns. This would be O(M) in the number
> of patterns, and it could be done, in builtin/grep.c, with a function
> like the following:
>
> static struct pattern_list sparsity_patterns;
> static int sparsity_patterns_initialized = 0;
> static enum pattern_match_result path_matches_sparsity_patterns(
>                                         const char *path, int pathlen,
>                                         const char *basename,
>                                         struct repository *repo)
> {
>         int dtype = DT_UNKNOWN;
>
>         if (!sparsity_patterns_initialized) {
>                 char *sparse_file = git_pathdup("info/sparse-checkout");
>                 int ret;
>
>                 memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
>                 sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
>                 ret = add_patterns_from_file_to_list(sparse_file, "", 0,
>                                                      &sparsity_patterns, NULL);
>                 free(sparse_file);
>
>                 if (ret < 0)
>                         die(_("failed to load sparse-checkout patterns"));
>                 sparsity_patterns_initialized = 1;
>         }
>
>         return path_matches_pattern_list(path, pathlen, basename, &dtype,
>                                          &sparsity_patterns, repo->index);
> }
>
> Also, if I understand correctly, the index doesn't hold paths to dirs,
> right? So even if a complete dir is excluded from sparse checkout, we
> still have to check all its subentries, only to discover that they
> should all be skipped from the search. However, if we were to check
> against the sparsity patterns directly (e.g. with the function above),
> we could skip such directories together with all their entries.
>
> Oh, and there is also the case of a commit whose tree paths are not in
> the index (maybe manually created objects?). For such commits, with the
> index lookup approach, we would have to fall back on ignoring the
> sparsity rules. I'm not sure if that would be OK, though.
>
> Any thoughts on these two approaches (looking up the skip_worktree bit
> in the index or directly matching against sparsity patterns), will be
> highly appreciated. (Note that it only concerns the `git grep
> <commit-ish>` case. The other cases already iterate thought the index, so
> there is no O(log(N)) extra complexity).
>
>  builtin/grep.c                   | 29 ++++++++---
>  t/t7011-skip-worktree-reading.sh |  9 ----
>  t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
>  3 files changed, 111 insertions(+), 15 deletions(-)
>  create mode 100755 t/t7817-grep-sparse-checkout.sh
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 99e2685090..52ec72a036 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
>                       const struct pathspec *pathspec, int cached);
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr);
> +                    int from_commit);

I'm not familiar with grep.c and have to admit I don't know what
"check_attr" means.  Slightly surprised to see you replace it, but
maybe reading the rest will explain...

>
>  static int grep_submodule(struct grep_opt *opt,
>                           const struct pathspec *pathspec,
> @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
>
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
> +
> +               if (ce_skip_worktree(ce))
> +                       continue;
> +

Looks good for the case where we are grepping through what's cached.

>                 strbuf_setlen(&name, name_base_len);
>                 strbuf_addstr(&name, ce->name);
>
> @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
>                          * cache entry are identical, even if worktree file has
>                          * been modified, so use cache version instead
>                          */
> -                       if (cached || (ce->ce_flags & CE_VALID) ||
> -                           ce_skip_worktree(ce)) {
> +                       if (cached || (ce->ce_flags & CE_VALID)) {

I had the same change when I was trying to hack something like this
patch into place but only handled the worktree case before realized it
was a bit bigger job.

>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>                                         continue;
>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
>
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> -                    int check_attr)
> +                    int from_commit)
>  {
>         struct repository *repo = opt->repo;
>         int hit = 0;
> @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                 name_base_len = name.len;
>         }
>
> +       if (from_commit && repo_read_index(repo) < 0)
> +               die(_("index file corrupt"));
> +

As above, I don't think we should need to read the index.  We should
compare to sparsity patterns, which in the important case (cone mode)
simplifies to a hash lookup as we walk directories.

>         while (tree_entry(tree, &entry)) {
>                 int te_len = tree_entry_len(&entry);
>
> @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                 strbuf_add(base, entry.path, te_len);
>
> +               if (from_commit) {
> +                       int pos = index_name_pos(repo->index,
> +                                                base->buf + tn_len,
> +                                                base->len - tn_len);
> +                       if (pos >= 0 &&
> +                           ce_skip_worktree(repo->index->cache[pos])) {
> +                               strbuf_setlen(base, old_baselen);
> +                               continue;
> +                       }
> +               }
> +
>                 if (S_ISREG(entry.mode)) {
>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
> -                                        check_attr ? base->buf + tn_len : NULL);
> +                                       from_commit ? base->buf + tn_len : NULL);

Sadly, this doesn't help me understand check_attr or from_commit.
Could you clue me in a bit?

>                 } else if (S_ISDIR(entry.mode)) {
>                         enum object_type type;
>                         struct tree_desc sub;
> @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                         strbuf_addch(base, '/');
>                         init_tree_desc(&sub, data, size);
>                         hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> -                                        check_attr);
> +                                        from_commit);

Same.

>                         free(data);
>                 } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>                         hit |= grep_submodule(opt, pathspec, &entry.oid,
> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> index 37525cae3a..26852586ac 100755
> --- a/t/t7011-skip-worktree-reading.sh
> +++ b/t/t7011-skip-worktree-reading.sh
> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>         test -z "$(git ls-files -m)"
>  '
>
> -test_expect_success 'grep with skip-worktree file' '
> -       git update-index --no-skip-worktree 1 &&
> -       echo test > 1 &&
> -       git update-index 1 &&
> -       git update-index --skip-worktree 1 &&
> -       rm 1 &&
> -       test "$(git grep --no-ext-grep test)" = "1:test"
> -'
> -
>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A   1" > expected
>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>         setup_absent &&
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> new file mode 100755
> index 0000000000..fccf44e829
> --- /dev/null
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -0,0 +1,88 @@
> +#!/bin/sh
> +
> +test_description='grep in sparse checkout
> +
> +This test creates the following dir structure:
> +.
> +| - a
> +| - b
> +| - dir
> +    | - c
> +
> +Only "a" should be present due to the sparse checkout patterns:
> +"/*", "!/b" and "!/dir".
> +'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +       echo "text" >a &&
> +       echo "text" >b &&
> +       mkdir dir &&
> +       echo "text" >dir/c &&
> +       git add a b dir &&
> +       git commit -m "initial commit" &&
> +       git tag -am t-commit t-commit HEAD &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       git tag -am t-tree t-tree $tree &&
> +       cat >.git/info/sparse-checkout <<-EOF &&
> +       /*
> +       !/b
> +       !/dir
> +       EOF
> +       git sparse-checkout init &&

Using `git sparse-checkout init` but then manually writing to
.git/info/sparse-checkout?  Seems like it'd make more sense to use
`git sparse-checkout set` than writing the patterns directly yourself.
Also, would prefer to have the examples use cone mode (even if you
have to add subdirectories), as it makes the testcase a bit easier to
read and more performant, though neither is a big deal.

> +       test_path_is_missing b &&
> +       test_path_is_missing dir &&
> +       test_path_is_file a
> +'
> +
> +test_expect_success 'grep in working tree should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --cached should honor sparse checkout' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       EOF
> +       git grep --cached "text" >actual &&
> +       test_cmp expect actual
> +'
> +
> +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> +       commit=$(git rev-parse HEAD) &&
> +       cat >expect_commit <<-EOF &&
> +       $commit:a:text
> +       EOF
> +       cat >expect_t-commit <<-EOF &&
> +       t-commit:a:text
> +       EOF
> +       git grep "text" $commit >actual_commit &&
> +       test_cmp expect_commit actual_commit &&
> +       git grep "text" t-commit >actual_t-commit &&
> +       test_cmp expect_t-commit actual_t-commit
> +'
> +
> +test_expect_success 'grep <tree-ish> should search outside sparse checkout' '

I think the test is fine but the title seems misleading.  "outside"
and "inside" aren't defined because <tree-ish> isn't known to be
rooted, meaning we have no way to apply the sparsity patterns.  So
perhaps just 'grep <tree-ish> should ignore sparsity patterns'?

> +       commit=$(git rev-parse HEAD) &&
> +       tree=$(git rev-parse HEAD^{tree}) &&
> +       cat >expect_tree <<-EOF &&
> +       $tree:a:text
> +       $tree:b:text
> +       $tree:dir/c:text
> +       EOF
> +       cat >expect_t-tree <<-EOF &&
> +       t-tree:a:text
> +       t-tree:b:text
> +       t-tree:dir/c:text
> +       EOF
> +       git grep "text" $tree >actual_tree &&
> +       test_cmp expect_tree actual_tree &&
> +       git grep "text" t-tree >actual_t-tree &&
> +       test_cmp expect_t-tree actual_t-tree
> +'
> +
> +test_done
> --
> 2.25.1

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
@ 2020-03-24  7:54   ` Elijah Newren
  2020-03-24 18:30     ` Junio C Hamano
  2020-03-25 23:15     ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 57+ messages in thread
From: Elijah Newren @ 2020-03-24  7:54 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Derrick Stolee, Nguyễn Thái Ngọc

On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> In the last commit, git-grep learned to honor sparsity patterns. For
> some use cases, however, it may be desirable to search outside the
> sparse checkout. So add the '--ignore-sparsity' option, which restores
> the old behavior. Also add the grep.ignoreSparsity configuration, to
> allow setting this behavior by default.

Should `--ignore-sparsity` be a global git option rather than a
grep-specific one?  Also, should grep.ignoreSparsity rather be
core.ignoreSparsity or core.searchOutsideSparsePaths or something?  In
particular, I want a world where:

* Someone can do a "sparse" clone that is NOT just about
sparse-checkout but also about partial clone.  In particular, it makes
use of partial clones to download only the history for the sparsity
paths, and does a sparse-checkout --cone to get those checked out.
(Or, perhaps, defaults to just downloading history for the toplevel
dir, much like `sparse-checkout init --cone`, and then when the user
runs `sparse-checkout set $dir1 $dir2 ...` then it downloads the extra
bits).
* grep, diff, log, shortlog, blame, bisect (and maybe others) all by
default make use of the sparsity patterns to limit their output (but
can all use whatever flag(s) are added here to search outside the
sparsity pattern cones).  This helps users feel they are in a smaller
repo and searching just their area of interest, and it avoids partial
clones downloading blobs unnecessarily.  Nice for the user, and nice
for the system.
* worktrees behave nicer; when creating a new one it inherits the
sparsity patterns of the parent (again to avoid partail clones having
to download everything, and let users continue working on their area
of interest, though they can disable sparse checkouts at any time, of
course).  Still would like Junio's feedback on this one.
* rebase, merge, cherry-pick, etc. (all via the merge machiner) have
smarter tree-merging logic such that when trees are unchanged on one
or both sides of history, we take advantage of the subset of those
cases where we can avoid traversing into subtrees but can resolve the
merge at the tree level.  This is a performance optimization even when
you have all trees and blob available, but an even more important one
if you don't want partial clones to suddenly have to download
unnecessary objects.  I have ideas and am working on this as part of
merge-ort.

> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>
> Note: I still have to make --ignore-sparsity be able to work together
> with --untracked. Unfortunatelly, this won't be as simple because the
> codeflow taken by --untracked goes to grep_directory() which just
> iterates the working tree, without looking the index entries. So I will
> have to either: make --untracked use grep_cache(), and grep the
> untracked files later; or try matching the working tree paths against
> the sparsity patterns, without looking for the skip_worktree bit in
> the index (as I mentioned in the previous patch's comments). Any
> preferences regarding these two approaches? (or other suggestions?)

Hmm.  So, 'tracked' in git is the idea that we are keeping information
about specific files.  'sparse-checkout' is the idea that we have a
subset of those that we can work with without materializing all the
other tracked files; it's clearly a subset of the realm of 'tracked'.
'untracked' is about getting everything outside the set of 'tracked'
files, which to me means it is clearly outside the set of sparsity
paths too (and thus you could take --untracked as implying
--ignore-sparsity, though whether you do might not matter in practice
because of the items I'll discuss next).  Of course, I am also
assuming `--untracked` is incompatible with --cached or specifying
revisions or trees (based on it's definiton of "In addition to
searching in the tracked files in the *working tree*, search also in
untracked files." -- emphasis added.)  If the incompatibility of
--untracked and --cached/REVSIONS/TREES is not enforced, we may want
to look into erroring out if they are given together.  Once we do, we
don't have to worry about grep_cache() at all in the case of
--untracked and shouldn't.  Files with the skip_worktree bit won't
exist in the working directory, and thus won't be searched (this is
what makes --untracked imply --ignore-sparsity not really matter).

In short: With --untracked you are grepping ALL (non-ignored) files in
the working directory -- either because they are both tracked and in
the sparsity paths (anything tracked that isn't in the sparsity paths
has the skip_worktree bit and thus isn't present), or because it is an
untracked file.  [And this may be what grep_directory() already does.]

Does that make sense?

>  Documentation/config/grep.txt   |  3 +++
>  Documentation/git-grep.txt      |  5 ++++
>  builtin/grep.c                  | 19 +++++++++++----
>  t/t7817-grep-sparse-checkout.sh | 42 +++++++++++++++++++++++++++++++++
>  4 files changed, 65 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> index 76689771aa..c1d49484c8 100644
> --- a/Documentation/config/grep.txt
> +++ b/Documentation/config/grep.txt
> @@ -25,3 +25,6 @@ grep.fullName::
>  grep.fallbackToNoIndex::
>         If set to true, fall back to git grep --no-index if git grep
>         is executed outside of a git repository.  Defaults to false.
> +
> +grep.ignoreSparsity::
> +       If set to true, enable `--ignore-sparsity` by default.
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index 97e25d7b1b..5c5c66c056 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -65,6 +65,11 @@ OPTIONS
>         mechanism.  Only useful when searching files in the current directory
>         with `--no-index`.
>
> +--ignore-sparsity::
> +       In a sparse checked out repository (see linkgit:git-sparse-checkout[1]),
> +       also search in files that are outside the sparse checkout. This option
> +       cannot be used with --no-index or --untracked.

If they are outside the sparse checkout, then they are not present on
disk -- so what is this outside stuff that is being searched?  Perhaps
clarify that this is only useful in combination with
--cached/REVISION/TREE, where there do exist paths outside the
sparsity patterns that become relevant?

>  --recurse-submodules::
>         Recursively search in each submodule that has been initialized and
>         checked out in the repository.  When used in combination with the
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 52ec72a036..17eae3edd6 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -33,6 +33,8 @@ static char const * const grep_usage[] = {
>
>  static int recurse_submodules;
>
> +static int ignore_sparsity = 0;
> +
>  static int num_threads;
>
>  static pthread_t *threads;
> @@ -292,6 +294,9 @@ static int grep_cmd_config(const char *var, const char *value, void *cb)
>         if (!strcmp(var, "submodule.recurse"))
>                 recurse_submodules = git_config_bool(var, value);
>
> +       if (!strcmp(var, "grep.ignoresparsity"))
> +               ignore_sparsity = git_config_bool(var, value);
> +
>         return st;
>  }
>
> @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
>
> -               if (ce_skip_worktree(ce))
> +               if (!ignore_sparsity && ce_skip_worktree(ce))

Oh boy on the double negatives...maybe we want to rename this flag somehow?

>                         continue;
>
>                 strbuf_setlen(&name, name_base_len);
> @@ -502,7 +507,8 @@ static int grep_cache(struct grep_opt *opt,
>                          * cache entry are identical, even if worktree file has
>                          * been modified, so use cache version instead
>                          */
> -                       if (cached || (ce->ce_flags & CE_VALID)) {
> +                       if (cached || (ce->ce_flags & CE_VALID) ||
> +                           ce_skip_worktree(ce)) {
>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>                                         continue;
>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -549,7 +555,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                 name_base_len = name.len;
>         }
>
> -       if (from_commit && repo_read_index(repo) < 0)
> +       if (!ignore_sparsity && from_commit && repo_read_index(repo) < 0)
>                 die(_("index file corrupt"));
>
>         while (tree_entry(tree, &entry)) {
> @@ -570,7 +576,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>
>                 strbuf_add(base, entry.path, te_len);
>
> -               if (from_commit) {
> +               if (!ignore_sparsity && from_commit) {
>                         int pos = index_name_pos(repo->index,
>                                                  base->buf + tn_len,
>                                                  base->len - tn_len);
> @@ -932,6 +938,8 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>                 OPT_BOOL_F(0, "ext-grep", &external_grep_allowed__ignored,
>                            N_("allow calling of grep(1) (ignored by this build)"),
>                            PARSE_OPT_NOCOMPLETE),
> +               OPT_BOOL(0, "ignore-sparsity", &ignore_sparsity,
> +                        N_("also search in files outside the sparse checkout")),
>                 OPT_END()
>         };
>
> @@ -1073,6 +1081,9 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>         if (recurse_submodules && untracked)
>                 die(_("--untracked not supported with --recurse-submodules"));
>
> +       if (ignore_sparsity && (!use_index || untracked))
> +               die(_("--no-index or --untracked cannot be used with --ignore-sparsity"));
> +
>         if (show_in_pager) {
>                 if (num_threads > 1)
>                         warning(_("invalid option combination, ignoring --threads"));
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index fccf44e829..1891ddea57 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -85,4 +85,46 @@ test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
>         test_cmp expect_t-tree actual_t-tree
>  '
>
> +for cmd in 'git grep --ignore-sparsity' 'git -c grep.ignoreSparsity grep' \
> +          'git -c grep.ignoreSparsity=false grep --ignore-sparsity'
> +do
> +       test_expect_success "$cmd should search outside sparse checkout" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd --cached should search outside sparse checkout" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd --cached "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd <commit-ish> should search outside sparse checkout" '
> +               commit=$(git rev-parse HEAD) &&
> +               cat >expect_commit <<-EOF &&
> +               $commit:a:text
> +               $commit:b:text
> +               $commit:dir/c:text
> +               EOF
> +               cat >expect_t-commit <<-EOF &&
> +               t-commit:a:text
> +               t-commit:b:text
> +               t-commit:dir/c:text
> +               EOF
> +               $cmd "text" $commit >actual_commit &&
> +               test_cmp expect_commit actual_commit &&
> +               $cmd "text" t-commit >actual_t-commit &&
> +               test_cmp expect_t-commit actual_t-commit
> +       '
> +done
> +
>  test_done
> --
> 2.25.1

I think there are several things that we need to straighten out first
and will affect a lot of this patch quite a bit:
* The feedback from the previous patch that the revision handling
should use sparsity patterns rather than ce_skip_worktree() is going
to affect this patch a fair amount.
* I think the fact that --ignore-sparsity is meaningless without
--cached or a REVISION or TREE may also affect things.
* The decision about how to globally name and set the
"ignore-sparsity" bit without requiring users to set it for each and
every subcommand will change this patch a bit too.


I'm super excited to see work in this area.  I hope I'm not
discouraging you by attempting to provide what I think is the bigger
picture I'd like us to work towards.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 1/3] doc: grep: unify info on configuration variables
  2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
@ 2020-03-24  7:57   ` Elijah Newren
  2020-03-24 21:26     ` Junio C Hamano
  0 siblings, 1 reply; 57+ messages in thread
From: Elijah Newren @ 2020-03-24  7:57 UTC (permalink / raw)
  To: Matheus Tavares; +Cc: Git Mailing List, Derrick Stolee, brian m. carlson

On Mon, Mar 23, 2020 at 11:11 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> Explanations about the configuration variables for git-grep are
> duplicated in "Documentation/git-grep.txt" and
> "Documentation/config/grep.txt". Let's unify the information in the
> second file and include it in the first.
>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>  Documentation/config/grep.txt |  7 +++++--
>  Documentation/git-grep.txt    | 35 +++++------------------------------
>  2 files changed, 10 insertions(+), 32 deletions(-)
>
> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> index 44abe45a7c..76689771aa 100644
> --- a/Documentation/config/grep.txt
> +++ b/Documentation/config/grep.txt
> @@ -16,8 +16,11 @@ grep.extendedRegexp::
>         other than 'default'.
>
>  grep.threads::
> -       Number of grep worker threads to use.
> -       See `grep.threads` in linkgit:git-grep[1] for more information.
> +       Number of grep worker threads to use. See `--threads` in
> +       linkgit:git-grep[1] for more information.
> +
> +grep.fullName::
> +       If set to true, enable `--full-name` option by default.
>
>  grep.fallbackToNoIndex::
>         If set to true, fall back to git grep --no-index if git grep
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index ddb6acc025..97e25d7b1b 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -41,34 +41,7 @@ characters.  An empty string as search expression matches all lines.
>  CONFIGURATION
>  -------------
>
> -grep.lineNumber::
> -       If set to true, enable `-n` option by default.
> -
> -grep.column::
> -       If set to true, enable the `--column` option by default.
> -
> -grep.patternType::
> -       Set the default matching behavior. Using a value of 'basic', 'extended',
> -       'fixed', or 'perl' will enable the `--basic-regexp`, `--extended-regexp`,
> -       `--fixed-strings`, or `--perl-regexp` option accordingly, while the
> -       value 'default' will return to the default matching behavior.
> -
> -grep.extendedRegexp::
> -       If set to true, enable `--extended-regexp` option by default. This
> -       option is ignored when the `grep.patternType` option is set to a value
> -       other than 'default'.
> -
> -grep.threads::
> -       Number of grep worker threads to use. If unset (or set to 0), Git will
> -       use as many threads as the number of logical cores available.
> -
> -grep.fullName::
> -       If set to true, enable `--full-name` option by default.
> -
> -grep.fallbackToNoIndex::
> -       If set to true, fall back to git grep --no-index if git grep
> -       is executed outside of a git repository.  Defaults to false.
> -
> +include::config/grep.txt[]
>
>  OPTIONS
>  -------
> @@ -267,8 +240,10 @@ providing this option will cause it to die.
>         found.
>
>  --threads <num>::
> -       Number of grep worker threads to use.
> -       See `grep.threads` in 'CONFIGURATION' for more information.
> +       Number of grep worker threads to use. If not provided (or set to
> +       0), Git will use as many worker threads as the number of logical
> +       cores available. The default value can also be set with the
> +       `grep.threads` configuration (see linkgit:git-config[1]).

I'm possibly showing my ignorance here, but doesn't the
"include::config/grep.txt[]" you added above mean that the user
doesn't have to see an external manpage but can see the definition
earlier within this same manpage?

>
>  -f <file>::
>         Read patterns from <file>, one per line.
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24  7:15   ` Elijah Newren
@ 2020-03-24 15:12     ` Derrick Stolee
  2020-03-24 16:16       ` Elijah Newren
  2020-03-24 23:01       ` Matheus Tavares Bernardino
  2020-03-24 22:55     ` Matheus Tavares Bernardino
  1 sibling, 2 replies; 57+ messages in thread
From: Derrick Stolee @ 2020-03-24 15:12 UTC (permalink / raw)
  To: Elijah Newren, Matheus Tavares
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson, Stefan Beller

On 3/24/2020 3:15 AM, Elijah Newren wrote:
> Hi Matheus,
> 
> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>>
>> One of the main uses for a sparse checkout is to allow users to focus on
>> the subset of files in a repository in which they are interested. But
>> git-grep currently ignores the sparsity patterns and report all matches
>> found outside this subset, which kind of goes in the oposity direction.
>> Let's fix that, making it honor the sparsity boundaries for every
>> grepping case:
>>
>> - git grep in worktree
>> - git grep --cached
>> - git grep $REVISION
> 
> Wahoo!  This is great.

I am also excited. Also thrilled to see the option to get the old
behavior in the next patch.

>> Something I'm not entirely sure in this patch is how we implement the
>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>> is treated in the grep_tree() function). Currently, the patch looks for
>> an index entry that matches the path, and then checks its skip_worktree
> 
> As you discuss below, checking the index is both wrong _and_ costly.

I'm not sure why checking the index is _wrong_, but I agree about the
performance cost.

> You should use the sparsity patterns; Stolee did a lot of work to make
> those correspond to simple hashes you could check to determine whether
> to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
> mode but the non-cone sparsity patterns were a performance nightmare
> waiting to rear its ugly head.  We should just try to encourage
> everyone to move to cone mode, or accept the slowness they get without
> it.
> 
>> bit. But this operation is perfomed in O(log(N)); N being the number of
>> index entries. If there are many entries (and no so many sparsity
>> patterns), maybe a better approach would be to try matching the path
>> directly against the sparsity patterns. This would be O(M) in the number
>> of patterns, and it could be done, in builtin/grep.c, with a function
>> like the following:
>>
>> static struct pattern_list sparsity_patterns;
>> static int sparsity_patterns_initialized = 0;
>> static enum pattern_match_result path_matches_sparsity_patterns(
>>                                         const char *path, int pathlen,
>>                                         const char *basename,
>>                                         struct repository *repo)
>> {
>>         int dtype = DT_UNKNOWN;
>>
>>         if (!sparsity_patterns_initialized) {
>>                 char *sparse_file = git_pathdup("info/sparse-checkout");
>>                 int ret;
>>
>>                 memset(&sparsity_patterns, 0, sizeof(sparsity_patterns));
>>                 sparsity_patterns.use_cone_patterns = core_sparse_checkout_cone;
>>                 ret = add_patterns_from_file_to_list(sparse_file, "", 0,
>>                                                      &sparsity_patterns, NULL);
>>                 free(sparse_file);
>>
>>                 if (ret < 0)
>>                         die(_("failed to load sparse-checkout patterns"));
>>                 sparsity_patterns_initialized = 1;
>>         }
>>
>>         return path_matches_pattern_list(path, pathlen, basename, &dtype,
>>                                          &sparsity_patterns, repo->index);
>> }
>>
>> Also, if I understand correctly, the index doesn't hold paths to dirs,
>> right? So even if a complete dir is excluded from sparse checkout, we
>> still have to check all its subentries, only to discover that they
>> should all be skipped from the search. However, if we were to check
>> against the sparsity patterns directly (e.g. with the function above),
>> we could skip such directories together with all their entries.

When in cone mode, we can check if a directory is one of these three
modes:

1. Completely contained in the cone (recursive match)
2. Completely outside the cone
3. Neither. Keep matching subdirectories. (parent match)

The clear_ce_flags() code in dir.c includes the matching algorithms
for this. Hopefully you can re-use a lot of it. You may need to extract
some methods to use them from the grep code.

>> Oh, and there is also the case of a commit whose tree paths are not in
>> the index (maybe manually created objects?). For such commits, with the
>> index lookup approach, we would have to fall back on ignoring the
>> sparsity rules. I'm not sure if that would be OK, though.
>>
>> Any thoughts on these two approaches (looking up the skip_worktree bit
>> in the index or directly matching against sparsity patterns), will be
>> highly appreciated. (Note that it only concerns the `git grep
>> <commit-ish>` case. The other cases already iterate thought the index, so
>> there is no O(log(N)) extra complexity).
>>
>>  builtin/grep.c                   | 29 ++++++++---
>>  t/t7011-skip-worktree-reading.sh |  9 ----
>>  t/t7817-grep-sparse-checkout.sh  | 88 ++++++++++++++++++++++++++++++++
>>  3 files changed, 111 insertions(+), 15 deletions(-)
>>  create mode 100755 t/t7817-grep-sparse-checkout.sh
>>
>> diff --git a/builtin/grep.c b/builtin/grep.c
>> index 99e2685090..52ec72a036 100644
>> --- a/builtin/grep.c
>> +++ b/builtin/grep.c
>> @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
>>                       const struct pathspec *pathspec, int cached);
>>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
>> -                    int check_attr);
>> +                    int from_commit);
> 
> I'm not familiar with grep.c and have to admit I don't know what
> "check_attr" means.  Slightly surprised to see you replace it, but
> maybe reading the rest will explain...
> 
>>
>>  static int grep_submodule(struct grep_opt *opt,
>>                           const struct pathspec *pathspec,
>> @@ -486,6 +486,10 @@ static int grep_cache(struct grep_opt *opt,
>>
>>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>>                 const struct cache_entry *ce = repo->index->cache[nr];
>> +
>> +               if (ce_skip_worktree(ce))
>> +                       continue;
>> +
> 
> Looks good for the case where we are grepping through what's cached.
> 
>>                 strbuf_setlen(&name, name_base_len);
>>                 strbuf_addstr(&name, ce->name);
>>
>> @@ -498,8 +502,7 @@ static int grep_cache(struct grep_opt *opt,
>>                          * cache entry are identical, even if worktree file has
>>                          * been modified, so use cache version instead
>>                          */
>> -                       if (cached || (ce->ce_flags & CE_VALID) ||
>> -                           ce_skip_worktree(ce)) {
>> +                       if (cached || (ce->ce_flags & CE_VALID)) {
> 
> I had the same change when I was trying to hack something like this
> patch into place but only handled the worktree case before realized it
> was a bit bigger job.
> 
>>                                 if (ce_stage(ce) || ce_intent_to_add(ce))
>>                                         continue;
>>                                 hit |= grep_oid(opt, &ce->oid, name.buf,
>> @@ -532,7 +535,7 @@ static int grep_cache(struct grep_opt *opt,
>>
>>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                      struct tree_desc *tree, struct strbuf *base, int tn_len,
>> -                    int check_attr)
>> +                    int from_commit)
>>  {
>>         struct repository *repo = opt->repo;
>>         int hit = 0;
>> @@ -546,6 +549,9 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                 name_base_len = name.len;
>>         }
>>
>> +       if (from_commit && repo_read_index(repo) < 0)
>> +               die(_("index file corrupt"));
>> +
> 
> As above, I don't think we should need to read the index.  We should
> compare to sparsity patterns, which in the important case (cone mode)
> simplifies to a hash lookup as we walk directories.
> 
>>         while (tree_entry(tree, &entry)) {
>>                 int te_len = tree_entry_len(&entry);
>>
>> @@ -564,9 +570,20 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>
>>                 strbuf_add(base, entry.path, te_len);
>>
>> +               if (from_commit) {
>> +                       int pos = index_name_pos(repo->index,
>> +                                                base->buf + tn_len,
>> +                                                base->len - tn_len);
>> +                       if (pos >= 0 &&
>> +                           ce_skip_worktree(repo->index->cache[pos])) {
>> +                               strbuf_setlen(base, old_baselen);
>> +                               continue;
>> +                       }
>> +               }
>> +
>>                 if (S_ISREG(entry.mode)) {
>>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>> -                                        check_attr ? base->buf + tn_len : NULL);
>> +                                       from_commit ? base->buf + tn_len : NULL);
> 
> Sadly, this doesn't help me understand check_attr or from_commit.
> Could you clue me in a bit?

Yeah, Elijah and I know the sparse-checkout code quite well, but are
unfamiliar with grep. Let's all expand our knowledge!

>>                 } else if (S_ISDIR(entry.mode)) {
>>                         enum object_type type;
>>                         struct tree_desc sub;
>> @@ -581,7 +598,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>>                         strbuf_addch(base, '/');
>>                         init_tree_desc(&sub, data, size);
>>                         hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
>> -                                        check_attr);
>> +                                        from_commit);
> 
> Same.
> 
>>                         free(data);
>>                 } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>>                         hit |= grep_submodule(opt, pathspec, &entry.oid,
>> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
>> index 37525cae3a..26852586ac 100755
>> --- a/t/t7011-skip-worktree-reading.sh
>> +++ b/t/t7011-skip-worktree-reading.sh
>> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>>         test -z "$(git ls-files -m)"
>>  '
>>
>> -test_expect_success 'grep with skip-worktree file' '
>> -       git update-index --no-skip-worktree 1 &&
>> -       echo test > 1 &&
>> -       git update-index 1 &&
>> -       git update-index --skip-worktree 1 &&
>> -       rm 1 &&
>> -       test "$(git grep --no-ext-grep test)" = "1:test"
>> -'
>> -
>>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A   1" > expected
>>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>>         setup_absent &&
>> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
>> new file mode 100755
>> index 0000000000..fccf44e829
>> --- /dev/null
>> +++ b/t/t7817-grep-sparse-checkout.sh
>> @@ -0,0 +1,88 @@
>> +#!/bin/sh
>> +
>> +test_description='grep in sparse checkout
>> +
>> +This test creates the following dir structure:
>> +.
>> +| - a
>> +| - b
>> +| - dir
>> +    | - c
>> +
>> +Only "a" should be present due to the sparse checkout patterns:
>> +"/*", "!/b" and "!/dir".
>> +'
>> +
>> +. ./test-lib.sh
>> +
>> +test_expect_success 'setup' '
>> +       echo "text" >a &&
>> +       echo "text" >b &&
>> +       mkdir dir &&
>> +       echo "text" >dir/c &&
>> +       git add a b dir &&
>> +       git commit -m "initial commit" &&
>> +       git tag -am t-commit t-commit HEAD &&
>> +       tree=$(git rev-parse HEAD^{tree}) &&
>> +       git tag -am t-tree t-tree $tree &&
>> +       cat >.git/info/sparse-checkout <<-EOF &&
>> +       /*
>> +       !/b
>> +       !/dir
>> +       EOF
>> +       git sparse-checkout init &&
> 
> Using `git sparse-checkout init` but then manually writing to
> .git/info/sparse-checkout?  Seems like it'd make more sense to use
> `git sparse-checkout set` than writing the patterns directly yourself.
> Also, would prefer to have the examples use cone mode (even if you
> have to add subdirectories), as it makes the testcase a bit easier to
> read and more performant, though neither is a big deal.

I agree that we should use the builtin so your test script is less
brittle to potential back-end changes to sparse-checkout (none planned).

I do recommend having at least one test with non-cone mode patterns,
especially if you are checking the pattern-matching yourself instead of
relying on the index.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24 15:12     ` Derrick Stolee
@ 2020-03-24 16:16       ` Elijah Newren
  2020-03-24 17:02         ` Derrick Stolee
  2020-03-24 23:01       ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 57+ messages in thread
From: Elijah Newren @ 2020-03-24 16:16 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	brian m. carlson, Stefan Beller

On Tue, Mar 24, 2020 at 8:12 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/24/2020 3:15 AM, Elijah Newren wrote:
> > Hi Matheus,
> >
> > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
...
> >> Something I'm not entirely sure in this patch is how we implement the
> >> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> >> is treated in the grep_tree() function). Currently, the patch looks for
> >> an index entry that matches the path, and then checks its skip_worktree
> >
> > As you discuss below, checking the index is both wrong _and_ costly.
>
> I'm not sure why checking the index is _wrong_, but I agree about the
> performance cost.

Let's say there are two directories, dir1 and dir2.  Over time, there
have existed a total of six files:
   dir1/{a,b,c}
   dir2/{d,e,f}
At the current time, there are only four files in the index:
   dir1/{a,b}
   dir2/{d,e}
And the user has done a `git sparse-checkout set dir2` and then at
some point later run `git grep OTHERCOMMIT foobar`.  What happens?

Well, since we're in a sparse checkout, we should only search the
relevant paths within OTHERCOMMIT for "foobar".  Let's say we attempt
to figure out the "relevant paths" using the index.  We can tell that
dir1/a and dir2/a are marked as SKIP_WORKTREE so we don't search them.
dir1/c is untracked -- what do we do with it?  Include it?  Exclude
it?  Carrying on with the other files, dir2/d and dir2/e are tracked
and !SKIP_WORKTREE so we search them.  dir2/f is untracked -- what do
we do with it?  Include it?  Exclude it?

We're left without the necessary information to tell whether we should
search OTHERCOMMIT's dir1/c and dir2/f if we consult the index.  Any
decision we make is going to be wrong for one of the two paths.

If we instead do not attempt to consult the index (which corresponds
to a version close to HEAD) in order to ask questions about the
completely different OTHERCOMMIT, but instead use the sparsity
patterns to query whether those files/directories are interesting,
then we get the right answer.  The index can only be consulted for the
right answer in the case of --cached; in all other cases (including
OTHERCOMMIT == HEAD), we should use the sparsity patterns.  In fact,
we could also use the sparsity patterns in the case of --cached, it's
just that for that one particular case consulting the index will also
give the right answer.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24 16:16       ` Elijah Newren
@ 2020-03-24 17:02         ` Derrick Stolee
  0 siblings, 0 replies; 57+ messages in thread
From: Derrick Stolee @ 2020-03-24 17:02 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	brian m. carlson, Stefan Beller

On 3/24/2020 12:16 PM, Elijah Newren wrote:
> On Tue, Mar 24, 2020 at 8:12 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 3/24/2020 3:15 AM, Elijah Newren wrote:
>>> Hi Matheus,
>>>
>>> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> ...
>>>> Something I'm not entirely sure in this patch is how we implement the
>>>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>>>> is treated in the grep_tree() function). Currently, the patch looks for
>>>> an index entry that matches the path, and then checks its skip_worktree
>>>
>>> As you discuss below, checking the index is both wrong _and_ costly.
>>
>> I'm not sure why checking the index is _wrong_, but I agree about the
>> performance cost.
> 
> Let's say there are two directories, dir1 and dir2.  Over time, there
> have existed a total of six files:
>    dir1/{a,b,c}
>    dir2/{d,e,f}
> At the current time, there are only four files in the index:
>    dir1/{a,b}
>    dir2/{d,e}
> And the user has done a `git sparse-checkout set dir2` and then at
> some point later run `git grep OTHERCOMMIT foobar`.  What happens?
> 
> Well, since we're in a sparse checkout, we should only search the
> relevant paths within OTHERCOMMIT for "foobar".  Let's say we attempt
> to figure out the "relevant paths" using the index.  We can tell that
> dir1/a and dir2/a are marked as SKIP_WORKTREE so we don't search them.
> dir1/c is untracked -- what do we do with it?  Include it?  Exclude
> it?  Carrying on with the other files, dir2/d and dir2/e are tracked
> and !SKIP_WORKTREE so we search them.  dir2/f is untracked -- what do
> we do with it?  Include it?  Exclude it?
> 
> We're left without the necessary information to tell whether we should
> search OTHERCOMMIT's dir1/c and dir2/f if we consult the index.  Any
> decision we make is going to be wrong for one of the two paths.
> 
> If we instead do not attempt to consult the index (which corresponds
> to a version close to HEAD) in order to ask questions about the
> completely different OTHERCOMMIT, but instead use the sparsity
> patterns to query whether those files/directories are interesting,
> then we get the right answer.  The index can only be consulted for the
> right answer in the case of --cached; in all other cases (including
> OTHERCOMMIT == HEAD), we should use the sparsity patterns.  In fact,
> we could also use the sparsity patterns in the case of --cached, it's
> just that for that one particular case consulting the index will also
> give the right answer.

Thanks! This helps a lot.

-Stolee


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24  7:54   ` Elijah Newren
@ 2020-03-24 18:30     ` Junio C Hamano
  2020-03-24 19:07       ` Elijah Newren
  2020-03-30  3:23       ` Matheus Tavares Bernardino
  2020-03-25 23:15     ` Matheus Tavares Bernardino
  1 sibling, 2 replies; 57+ messages in thread
From: Junio C Hamano @ 2020-03-24 18:30 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

Elijah Newren <newren@gmail.com> writes:

> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>>
>> In the last commit, git-grep learned to honor sparsity patterns. For
>> some use cases, however, it may be desirable to search outside the
>> sparse checkout. So add the '--ignore-sparsity' option, which restores
>> the old behavior. Also add the grep.ignoreSparsity configuration, to
>> allow setting this behavior by default.
>
> Should `--ignore-sparsity` be a global git option rather than a
> grep-specific one?  Also, should grep.ignoreSparsity rather be
> core.ignoreSparsity or core.searchOutsideSparsePaths or something?

Great question.  I think "git diff" with various options would also
want to optionally be able to be confined within the sparse cone, or
checking the entire world by lazily fetching outside the sparsity.

> * grep, diff, log, shortlog, blame, bisect (and maybe others) all by
> default make use of the sparsity patterns to limit their output (but
> can all use whatever flag(s) are added here to search outside the
> sparsity pattern cones).  This helps users feel they are in a smaller
> repo and searching just their area of interest, and it avoids partial
> clones downloading blobs unnecessarily.  Nice for the user, and nice
> for the system.

I am not sure which one should be the default.  From historical
point of view that sparse stuff was done as an optimization to omit
initial work and lazily give the whole world, I may have slight
preference to the "we pretend that you have everything, just some
parts may be slower to come to you" world view to be the default,
with an option to limit the view to whatever sparsity you initially
set up.  Regardless of the choice of the default, it would be a good
idea to make the subcommands consistently offer the same default and
allow the non-default views with the same UI.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24 18:30     ` Junio C Hamano
@ 2020-03-24 19:07       ` Elijah Newren
  2020-03-25 20:18         ` Junio C Hamano
  2020-03-30  3:23       ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 57+ messages in thread
From: Elijah Newren @ 2020-03-24 19:07 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

On Tue, Mar 24, 2020 at 11:30 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> In the last commit, git-grep learned to honor sparsity patterns. For
> >> some use cases, however, it may be desirable to search outside the
> >> sparse checkout. So add the '--ignore-sparsity' option, which restores
> >> the old behavior. Also add the grep.ignoreSparsity configuration, to
> >> allow setting this behavior by default.
> >
> > Should `--ignore-sparsity` be a global git option rather than a
> > grep-specific one?  Also, should grep.ignoreSparsity rather be
> > core.ignoreSparsity or core.searchOutsideSparsePaths or something?
>
> Great question.  I think "git diff" with various options would also
> want to optionally be able to be confined within the sparse cone, or
> checking the entire world by lazily fetching outside the sparsity.
>
> > * grep, diff, log, shortlog, blame, bisect (and maybe others) all by
> > default make use of the sparsity patterns to limit their output (but
> > can all use whatever flag(s) are added here to search outside the
> > sparsity pattern cones).  This helps users feel they are in a smaller
> > repo and searching just their area of interest, and it avoids partial
> > clones downloading blobs unnecessarily.  Nice for the user, and nice
> > for the system.
>
> I am not sure which one should be the default.  From historical
> point of view that sparse stuff was done as an optimization to omit
> initial work and lazily give the whole world, I may have slight
> preference to the "we pretend that you have everything, just some
> parts may be slower to come to you" world view to be the default,
> with an option to limit the view to whatever sparsity you initially
> set up.

It sounds like you are describing partial clone rather than sparse
checkout?  Or perhaps you're trying to blur the distinction,
suggesting the two should be used together, with the partial clone
machinery learning to download history within the specified sparse
cones?

>  Regardless of the choice of the default, it would be a good
> idea to make the subcommands consistently offer the same default and
> allow the non-default views with the same UI.

Agreed.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 1/3] doc: grep: unify info on configuration variables
  2020-03-24  7:57   ` Elijah Newren
@ 2020-03-24 21:26     ` Junio C Hamano
  2020-03-24 23:38       ` Matheus Tavares
  0 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2020-03-24 21:26 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee, brian m. carlson

Elijah Newren <newren@gmail.com> writes:

>> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
>> index 44abe45a7c..76689771aa 100644
>> --- a/Documentation/config/grep.txt
>> +++ b/Documentation/config/grep.txt
>> @@ -16,8 +16,11 @@ grep.extendedRegexp::
>> ...
>> +       Number of grep worker threads to use. See `--threads` in
>> +       linkgit:git-grep[1] for more information.
>> ...
>> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
>> index ddb6acc025..97e25d7b1b 100644
>> --- a/Documentation/git-grep.txt
>> +++ b/Documentation/git-grep.txt
>> @@ -41,34 +41,7 @@ characters.  An empty string as search expression matches all lines.
>> ...
>> +include::config/grep.txt[]
>> ...
>>  --threads <num>::
>> -       Number of grep worker threads to use.
>> -       See `grep.threads` in 'CONFIGURATION' for more information.
>> +       Number of grep worker threads to use. If not provided (or set to
>> +       0), Git will use as many worker threads as the number of logical
>> +       cores available. The default value can also be set with the
>> +       `grep.threads` configuration (see linkgit:git-config[1]).
>
> I'm possibly showing my ignorance here, but doesn't the
> "include::config/grep.txt[]" you added above mean that the user
> doesn't have to see an external manpage but can see the definition
> earlier within this same manpage?

I think so.  Also, the new reference "See `--threads` in git-grep"
added to grep.threads to config/grep.txt would become somewhat
redundant in the context of "git grep --help" (only "See --threads"
is relevant when it appears in this same manual page).

Readers who finds the reference in "git config --help" still needs
to see that --threads is an option to git-grep, though.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24  7:15   ` Elijah Newren
  2020-03-24 15:12     ` Derrick Stolee
@ 2020-03-24 22:55     ` Matheus Tavares Bernardino
  2020-04-21  2:10       ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-24 22:55 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson, Stefan Beller

On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> >
> > Something I'm not entirely sure in this patch is how we implement the
> > mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> > is treated in the grep_tree() function). Currently, the patch looks for
> > an index entry that matches the path, and then checks its skip_worktree
>
> As you discuss below, checking the index is both wrong _and_ costly.
> You should use the sparsity patterns; Stolee did a lot of work to make
> those correspond to simple hashes you could check to determine whether
> to even walk into a subdirectory.  So, O(1).  Yeah, that's "only" cone
> mode but the non-cone sparsity patterns were a performance nightmare
> waiting to rear its ugly head.  We should just try to encourage
> everyone to move to cone mode, or accept the slowness they get without
> it.

OK, makes sense. And your reply to Stolee, later in this thread, made
it clearer for me why checking the index is not only costly but also
wrong. Thanks for the great explanation! I will use the sparsity
patterns directly, in the next iteration.

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index 99e2685090..52ec72a036 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
> > @@ -388,7 +388,7 @@ static int grep_cache(struct grep_opt *opt,
> >                       const struct pathspec *pathspec, int cached);
> >  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >                      struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                    int check_attr);
> > +                    int from_commit);
>
> I'm not familiar with grep.c and have to admit I don't know what
> "check_attr" means. Slightly surprised to see you replace it, but
> maybe reading the rest will explain...
...
>>                 if (S_ISREG(entry.mode)) {
>>                         hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>> -                                        check_attr ? base->buf + tn_len : NULL);
>> +                                       from_commit ? base->buf + tn_len : NULL);
>
> Sadly, this doesn't help me understand check_attr or from_commit.
> Could you clue me in a bit?

Sure! The grep machinery can optionally look the .gitattributes file,
to see if a given path has a "diff" attribute assigned to it. This
attribute points to a diff driver in .gitconfig, which can specify
many things, such as whether the path should be treated as a binary or
not. The "check_attr" flag passed to grep_tree() tells the grep
machinery if it should perform this attribute lookup for the paths in
the given tree.

I decided to replace it with "from_commit" because the only times we
want an attribute lookup when grepping a tree, is when it comes from a
commit. I.e., when the tree is the root. (The reasoning goes in the
same lines as for why we only check sparsity patterns in git-grep for
commit-ish objects: we cannot check pattern matching for trees which
we are not sure to be rooted). Since "knowing if the tree is a root or
not" is useful in grep_tree() for both sparsity checks and attribute
checks, I thought we could use a single "from_commit" variable instead
of "check_attr" and "check_sparsity", which would always have matching
values. But on second thought, I could maybe rename the variable to
something as "is_root_tree" or add a comment explaining the usage of
"from_commit".

(I'm not a big fan of "is_root_tree", thought, because we could give a
root tree to grep_tree() but not really know it.)

> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > new file mode 100755
> > index 0000000000..fccf44e829
> > --- /dev/null
> > +++ b/t/t7817-grep-sparse-checkout.sh
...
> > +test_expect_success 'setup' '
> > +       echo "text" >a &&
> > +       echo "text" >b &&
> > +       mkdir dir &&
> > +       echo "text" >dir/c &&
> > +       git add a b dir &&
> > +       git commit -m "initial commit" &&
> > +       git tag -am t-commit t-commit HEAD &&
> > +       tree=$(git rev-parse HEAD^{tree}) &&
> > +       git tag -am t-tree t-tree $tree &&
> > +       cat >.git/info/sparse-checkout <<-EOF &&
> > +       /*
> > +       !/b
> > +       !/dir
> > +       EOF
> > +       git sparse-checkout init &&
>
> Using `git sparse-checkout init` but then manually writing to
> .git/info/sparse-checkout?  Seems like it'd make more sense to use
> `git sparse-checkout set` than writing the patterns directly yourself.
> Also, would prefer to have the examples use cone mode (even if you
> have to add subdirectories), as it makes the testcase a bit easier to
> read and more performant, though neither is a big deal.

OK, I will make use of the builtin here. I will also use the cone mode
(and leave one test without it, as Stolee suggested later in this
thread).

> > +test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
>
> I think the test is fine but the title seems misleading.  "outside"
> and "inside" aren't defined because <tree-ish> isn't known to be
> rooted, meaning we have no way to apply the sparsity patterns.  So
> perhaps just 'grep <tree-ish> should ignore sparsity patterns'?

Right! "should ignore sparsity patterns" is a much better name, thanks.

Thanks a lot for the thoughtful review and comments!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24 15:12     ` Derrick Stolee
  2020-03-24 16:16       ` Elijah Newren
@ 2020-03-24 23:01       ` Matheus Tavares Bernardino
  1 sibling, 0 replies; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-24 23:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren, Git Mailing List, Derrick Stolee,
	brian m. carlson, Stefan Beller

On Tue, Mar 24, 2020 at 12:12 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/24/2020 3:15 AM, Elijah Newren wrote:
> >
> > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> Also, if I understand correctly, the index doesn't hold paths to dirs,
> >> right? So even if a complete dir is excluded from sparse checkout, we
> >> still have to check all its subentries, only to discover that they
> >> should all be skipped from the search. However, if we were to check
> >> against the sparsity patterns directly (e.g. with the function above),
> >> we could skip such directories together with all their entries.
>
> When in cone mode, we can check if a directory is one of these three
> modes:
>
> 1. Completely contained in the cone (recursive match)
> 2. Completely outside the cone
> 3. Neither. Keep matching subdirectories. (parent match)
>
> The clear_ce_flags() code in dir.c includes the matching algorithms
> for this. Hopefully you can re-use a lot of it. You may need to extract
> some methods to use them from the grep code.

Thanks for the pointer! I will take a look at the code in dir.c.

> >> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> >> new file mode 100755
> >> index 0000000000..fccf44e829
...
> >> +       cat >.git/info/sparse-checkout <<-EOF &&
> >> +       /*
> >> +       !/b
> >> +       !/dir
> >> +       EOF
> >> +       git sparse-checkout init &&
> >
> > Using `git sparse-checkout init` but then manually writing to
> > .git/info/sparse-checkout?  Seems like it'd make more sense to use
> > `git sparse-checkout set` than writing the patterns directly yourself.
> > Also, would prefer to have the examples use cone mode (even if you
> > have to add subdirectories), as it makes the testcase a bit easier to
> > read and more performant, though neither is a big deal.
>
> I agree that we should use the builtin so your test script is less
> brittle to potential back-end changes to sparse-checkout (none planned).

Makes sense!

> I do recommend having at least one test with non-cone mode patterns,
> especially if you are checking the pattern-matching yourself instead of
> relying on the index.

OK, I will leave at least one test with non-cone patterns then. Thanks
for the comments!

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 1/3] doc: grep: unify info on configuration variables
  2020-03-24 21:26     ` Junio C Hamano
@ 2020-03-24 23:38       ` Matheus Tavares
  0 siblings, 0 replies; 57+ messages in thread
From: Matheus Tavares @ 2020-03-24 23:38 UTC (permalink / raw)
  To: gitster; +Cc: dstolee, git, matheus.bernardino, newren, sandals

On Tue, Mar 24, 2020 at 6:26 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> >> index 44abe45a7c..76689771aa 100644
> >> --- a/Documentation/config/grep.txt
> >> +++ b/Documentation/config/grep.txt
> >> @@ -16,8 +16,11 @@ grep.extendedRegexp::
> >> ...
> >> +       Number of grep worker threads to use. See `--threads` in
> >> +       linkgit:git-grep[1] for more information.
> >> ...
> >> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> >> index ddb6acc025..97e25d7b1b 100644
> >> --- a/Documentation/git-grep.txt
> >> +++ b/Documentation/git-grep.txt
> >> @@ -41,34 +41,7 @@ characters.  An empty string as search expression matches all lines.
> >> ...
> >> +include::config/grep.txt[]
> >> ...
> >>  --threads <num>::
> >> -       Number of grep worker threads to use.
> >> -       See `grep.threads` in 'CONFIGURATION' for more information.
> >> +       Number of grep worker threads to use. If not provided (or set to
> >> +       0), Git will use as many worker threads as the number of logical
> >> +       cores available. The default value can also be set with the
> >> +       `grep.threads` configuration (see linkgit:git-config[1]).
> >
> > I'm possibly showing my ignorance here, but doesn't the
> > "include::config/grep.txt[]" you added above mean that the user
> > doesn't have to see an external manpage but can see the definition
> > earlier within this same manpage?

You are right. I added the "(see linkgit:git-config[1])" here more as a
reference to the config system itself (for a user that is possibly not familiar
with git-config). But if this is not necessary, we can remove the reference.

> I think so.  Also, the new reference "See `--threads` in git-grep"
> added to grep.threads to config/grep.txt would become somewhat
> redundant in the context of "git grep --help" (only "See --threads"
> is relevant when it appears in this same manual page).

Thanks for pointing that out. I think we can solve this issue with the
following:

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index c1d49484c8..ac06db4206 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -16,8 +16,11 @@ grep.extendedRegexp::
 	other than 'default'.

 grep.threads::
-	Number of grep worker threads to use. See `--threads` in
-	linkgit:git-grep[1] for more information.
+	Number of grep worker threads to use. See `--threads`
+ifndef::git-grep[]
+	in linkgit:git-grep[1]
+endif::git-grep[]
+	for more information.

 grep.fullName::
 	If set to true, enable `--full-name` option by default.
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 5c5c66c056..192aab4cba 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,6 +41,7 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------

+:git-grep: 1
 include::config/grep.txt[]

 OPTIONS

I will add these changes in v2.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24 19:07       ` Elijah Newren
@ 2020-03-25 20:18         ` Junio C Hamano
  0 siblings, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2020-03-25 20:18 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

Elijah Newren <newren@gmail.com> writes:

> It sounds like you are describing partial clone rather than sparse
> checkout?  Or perhaps you're trying to blur the distinction,
> suggesting the two should be used together, with the partial clone
> machinery learning to download history within the specified sparse
> cones?

Yeah, I guess it is a little bit of both ;-)

>>  Regardless of the choice of the default, it would be a good
>> idea to make the subcommands consistently offer the same default and
>> allow the non-default views with the same UI.
>
> Agreed.

Yup, thanks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24  7:54   ` Elijah Newren
  2020-03-24 18:30     ` Junio C Hamano
@ 2020-03-25 23:15     ` Matheus Tavares Bernardino
  2020-03-26  6:02       ` Elijah Newren
  1 sibling, 1 reply; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-25 23:15 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Junio C Hamano

On Tue, Mar 24, 2020 at 4:55 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
>
> > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> > ---
> >
> > Note: I still have to make --ignore-sparsity be able to work together
> > with --untracked. Unfortunatelly, this won't be as simple because the
> > codeflow taken by --untracked goes to grep_directory() which just
> > iterates the working tree, without looking the index entries. So I will
> > have to either: make --untracked use grep_cache(), and grep the
> > untracked files later; or try matching the working tree paths against
> > the sparsity patterns, without looking for the skip_worktree bit in
> > the index (as I mentioned in the previous patch's comments). Any
> > preferences regarding these two approaches? (or other suggestions?)
>
> Hmm.  So, 'tracked' in git is the idea that we are keeping information
> about specific files.  'sparse-checkout' is the idea that we have a
> subset of those that we can work with without materializing all the
> other tracked files; it's clearly a subset of the realm of 'tracked'.
> 'untracked' is about getting everything outside the set of 'tracked'
> files, which to me means it is clearly outside the set of sparsity
> paths too (and thus you could take --untracked as implying
> --ignore-sparsity, though whether you do might not matter in practice
> because of the items I'll discuss next). Of course, I am also
> assuming `--untracked` is incompatible with --cached or specifying
> revisions or trees (based on it's definiton of "In addition to
> searching in the tracked files in the *working tree*, search also in
> untracked files." -- emphasis added.)

Hm, I see the point now, but I'm still a little confused: The "in the
working tree" section of the definition would exclude non checked out
files, right? However, git-grep's description says "Look for specified
patterns in the tracked files *in the work tree*", and it still
searches non checked out files (loading them from the cache, even when
--cache is not given). I know that's exactly what we are trying to
change with this patchset, but we will still give the
--ignore-sparsity option to allow the old behavior when needed (unless
we prohibit using --ignore-sparsity without --cached or $REV). I guess
my doubt is whether the problem is in the implementation of the
working tree grep, which considers non checked out files, or in the
docs, which say "tracked files *in the work tree*".

I tend to go with the latter, since using `git grep --ignore-sparsity`
in a sparse checked out working tree, to grep not present files as
well, kind of makes sense to me. And if the problem is indeed in the
docs, then I think we should also allow --ignore-sparsity when
grepping with --untracked, since it's an analogous case.

> If the incompatibility of
> --untracked and --cached/REVSIONS/TREES is not enforced, we may want
> to look into erroring out if they are given together.  Once we do, we
> don't have to worry about grep_cache() at all in the case of
> --untracked and shouldn't.  Files with the skip_worktree bit won't
> exist in the working directory, and thus won't be searched (this is
> what makes --untracked imply --ignore-sparsity not really matter).
>
> In short: With --untracked you are grepping ALL (non-ignored) files in
> the working directory -- either because they are both tracked and in
> the sparsity paths (anything tracked that isn't in the sparsity paths
> has the skip_worktree bit and thus isn't present), or because it is an
> untracked file.  [And this may be what grep_directory() already does.]
>
> Does that make sense?

It does, and thanks for a very detailed explanation. But as I
mentioned before, I'm a little uncertain about --untracked implying
--ignore-spasity. The commit that added --untracked (0a93fb8) says:

"grep --untracked" would find the specified patterns from files in
untracked files in addition to its usual behaviour of finding them in
the tracked files

So, in my mind, it feels like --untracked wasn't meant to limit the
search to "all non-ignored files in the working directory", but to add
untracked files to the search (which could also contain tracked but
non checked out files). Wouldn't the "all non-ignored files in the
working directory" case be the use of --no-index?

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index 52ec72a036..17eae3edd6 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
...
> >
> > @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
> >         for (nr = 0; nr < repo->index->cache_nr; nr++) {
> >                 const struct cache_entry *ce = repo->index->cache[nr];
> >
> > -               if (ce_skip_worktree(ce))
> > +               if (!ignore_sparsity && ce_skip_worktree(ce))
>
> Oh boy on the double negatives...maybe we want to rename this flag somehow?

Yeah, I also thought about that, but couldn't come up with a better
name myself... My alternatives were all too verbose.

...
> I'm super excited to see work in this area.  I hope I'm not
> discouraging you by attempting to provide what I think is the bigger
> picture I'd like us to work towards.

Not at all! :) Thanks a lot for the bigger picture and other
explanations. They help me understand the long-term goals and make
better decisions now.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-25 23:15     ` Matheus Tavares Bernardino
@ 2020-03-26  6:02       ` Elijah Newren
  2020-03-27 15:51         ` Junio C Hamano
  2020-03-30  1:12         ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 57+ messages in thread
From: Elijah Newren @ 2020-03-26  6:02 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Junio C Hamano

Hi Matheus!

On Wed, Mar 25, 2020 at 4:15 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Mar 24, 2020 at 4:55 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >
> > > Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> > > ---
> > >
> > > Note: I still have to make --ignore-sparsity be able to work together
> > > with --untracked. Unfortunatelly, this won't be as simple because the
> > > codeflow taken by --untracked goes to grep_directory() which just
> > > iterates the working tree, without looking the index entries. So I will
> > > have to either: make --untracked use grep_cache(), and grep the
> > > untracked files later; or try matching the working tree paths against
> > > the sparsity patterns, without looking for the skip_worktree bit in
> > > the index (as I mentioned in the previous patch's comments). Any
> > > preferences regarding these two approaches? (or other suggestions?)
> >
> > Hmm.  So, 'tracked' in git is the idea that we are keeping information
> > about specific files.  'sparse-checkout' is the idea that we have a
> > subset of those that we can work with without materializing all the
> > other tracked files; it's clearly a subset of the realm of 'tracked'.
> > 'untracked' is about getting everything outside the set of 'tracked'
> > files, which to me means it is clearly outside the set of sparsity
> > paths too (and thus you could take --untracked as implying
> > --ignore-sparsity, though whether you do might not matter in practice
> > because of the items I'll discuss next). Of course, I am also
> > assuming `--untracked` is incompatible with --cached or specifying
> > revisions or trees (based on it's definiton of "In addition to
> > searching in the tracked files in the *working tree*, search also in
> > untracked files." -- emphasis added.)
>
> Hm, I see the point now, but I'm still a little confused: The "in the
> working tree" section of the definition would exclude non checked out
> files, right? However, git-grep's description says "Look for specified
> patterns in the tracked files *in the work tree*", and it still
> searches non checked out files (loading them from the cache, even when
> --cache is not given). I know that's exactly what we are trying to

I really respect Duy and he does some amazing work and I wish he were
still active in git, but the SKIP_WORKTREE stuff wasn't his best work
and even he downplayed it: "In my defense it was one of my first
contribution when I was naiver...I'd love to hear how sparse checkout
could be improved, or even replaced."[0]

I've seen enough egregiously confusing cases and enough
difficult-to-recover-from cases with the implementation of the
SKIP_WORKTREE handling that I think it is dangerous to assume behavior
you see with it is intended design.  A year and a half ago, I read all
available docs to figure out how to sparsify and de-sparsify, and read
them several times but was still confused.  If I could only figure it
out with great difficulty, a lot of google searching, and even trying
to look at the code, what chance did "normal" users stand?  To add
more flavor to that argument, let me cite [1] (the three paragraphs
starting with "Playing with sparse-checkout, it feels to me like a
half-baked feature"), [2], as well as good chunks of [3], [4], and
[5].

[0] https://lore.kernel.org/git/CACsJy8ArUXD0cF2vQAVnzM_AGto2k2yQTFuTO7PhP4ffHM8dVQ@mail.gmail.com/
[1] https://lore.kernel.org/git/CABPp-BFKf2N6TYzCCneRwWUektMzRMnHLZ8JT64q=MGj5WQZkA@mail.gmail.com/
[2] https://lore.kernel.org/git/CABPp-BGE-m_UFfUt_moXG-YR=ZW8hMzMwraD7fkFV-+sEHw36w@mail.gmail.com/
[3] https://lore.kernel.org/git/pull.316.git.gitgitgadget@gmail.com/
[4] https://lore.kernel.org/git/pull.513.git.1579029962.gitgitgadget@gmail.com/
[5] https://lore.kernel.org/git/a46439c8536f912ad4a1e1751852cf477d3d7dc7.1584813609.git.gitgitgadget@gmail.com/

But let me try to explain it all below from first principles in a way
that will hopefully make sense why falling back to loading from the
cache when --cached is not given is just flat wrong.  The explanation
from first principles should also help explain --untracked a bit
better, and when there are decisions about whether to use sparsity
patterns.

> change with this patchset, but we will still give the
> --ignore-sparsity option to allow the old behavior when needed (unless
> we prohibit using --ignore-sparsity without --cached or $REV). I guess
> my doubt is whether the problem is in the implementation of the
> working tree grep, which considers non checked out files, or in the
> docs, which say "tracked files *in the work tree*".
>
> I tend to go with the latter, since using `git grep --ignore-sparsity`
> in a sparse checked out working tree, to grep not present files as
> well, kind of makes sense to me. And if the problem is indeed in the
> docs, then I think we should also allow --ignore-sparsity when
> grepping with --untracked, since it's an analogous case.

It's probably not a surprise to you given what I've already said above
to hear me say that the docs are correct in this case.  But not only
are the docs correct, I'll go even further and claim that falling back
to the cache when --cached is not passed is indefensible and leads to
surprises and contradictions.  But instead of just claiming that, let
me try to spell out a bit better why I believe that from first
principles, though:

There were previously three types of files for git:
  * tracked
  * ignored
  * untracked
where:
  * tracked was defined as "recorded in index"
  * ignored was defined as "a file which is not tracked and which
matches an ignore rule (.gitignore, .git/info/exclude, etc.)"
  * untracked was defined as "all other files present in the working directory".
With the SKIP_WORKTREE bit and sparse-checkouts, we actually have four
types because we split the "tracked" category into two:
  * tracked and matches the sparsity patterns (implies it will be
missing from the working directory as the SKIP_WORKTREE bit is set)
  * tracked and does not match the sparsity patterns (implies it will
be present in the working directory, as the SKIP_WORKTREE bit is not
set)
But let's ignore the splitting of the tracked type for a minute as
well as everything else related to sparseness.  Let's just look at how
grep was designed.

git grep has traditionally been about searching "tracked files in the
work tree" as you highlighted (and note that sparsity bits came four
years later in 2009, so cannot undercut that claim).  If the user has
made edits to files and hasn't staged them, grep would search those
working tree files with their edits, not old cached versions of those
files.  People were told that git grep was a great way to just search
relevant stuff (instead of normal grep which would look through build
results and random big files in your working directory that you
weren't even tracking).  Then in 2011 grep gained options like
--untracked to extend the search in the working tree to also include
untracked files, and added --no-exclude-standard (which is "only
useful with --untracked") so that people had a way to search *all*
files in the working tree (tracked, untracked, and ignored files).
(Note: no mechanism was provided for searching tracked and ignored
files without untracked as far as I can tell, though I don't see why
that would make sense.)  git-grep also gained options like --no-index
so that it could be used in a directory that wasn't tracked by git at
all -- it turns out people liked git-grep better than normal grep (I
think it got colorization first?), even for things that weren't being
tracked by git.  But again, all these cases were about searching files
that existed in the working tree.

Of course, people sometimes wanted to search a version other than what
existed in the working tree.  And thus options like --cached or
specifying a REVISION were added early on.

Sometimes, code that wasn't meant to be used together accidentally is
used together or the docs suggest they can be used together.  In 2010,
someone had to clarify that --cached was incompatible with <tree>; not
sure why someone would attempt to use them together, but that's the
type of accident that is easy to have in the implementation or docs
because it doesn't even occur to people who understand the design and
the data structures why anyone would attempt that.  Inevitably,
someone comes along who doesn't understand the underlying data
structures or design or terminology and tries incompatible options
together...and then gets surprised.  (Side note: I think this kind of
issues occurs fairly frequently, so I'm unlikely to assume options
were meant to be supported together based solely on a lack of logic
that would throw an error when both are specified.  We could probably
add a bunch of useful microprojects around checking for flags that
should be incompatible and making sure git throws errors when both are
specified.  We had lots of cases in rebase, for example, where if
users happened to specify two flags then one would just be silently
ignored.)

REVISION and --cached are not just incompatible with each other; each
is incompatible with all three of --untracked, --no-index, and
--no-exclude-standard.  This is because REVISION and --cached are
about picking some version other than what exists in the working tree
to search through, while those other options are all intended for when
we are searching through files in the working tree (and in particular,
exist to extend how many files in the working tree we look through).

One more useful case to consider before we start adding SKIP_WORKTREE
into the mix.  Let's say that you have three files:
   fileA
   fileB
   fileC
and all of them are tracked.  You have made edits to fileA and fileB,
and ran 'rm fileC' (NOT 'git rm fileC', i.e. the deletion is not
staged).  Now, you run 'git grep mystring'.  Quick question: Which
files are searched for 'mystring'?  Well...
  * REVISION and --cached were left out of the git grep command, so
working tree files should be searched, not staged versions or versions
from other commits
  * No flags like --untracked or --no-exclude-standard were included,
so only tracked files in the working tree should be searched
  * There are two files in the working tree, both tracked: fileA and fileB.
So, this searches fileA and fileB.  In particular: NO VERSION of fileC
is searched.  fileC may be tracked/cached, but we don't search any
version of that file, because this particular command line is about
searching the working directory and fileC is not in the working
directory.  To the best of my knowledge, git grep has always behaved
that way.


Users understand the idea of searching the working copy vs. the index
vs. "old" (or different) versions of the repository.  They also
understand that when searching the working copy, by default a subset
of the files are searched.  Tell me: given all this information here,
what possible explanation is there for SKIP_WORKTREE entries to be
translated into searches of the cache when --cached is not specified?
Please square that away with the fact that 'rm fileC' results in fileC
NOT being searched.

It's just completely, utterly wrong.

Also, hopefully this helps answer your question about --untracked and
skip_worktree.  --untracked is only useful when searching through the
working tree, and is entirely about adding the "untracked" category to
the things we search.  The skip_worktree bit is about adding more
granularity to the "tracked" category.  The two are thus entirely
orthogonal and --untracked shouldn't change behavior at all in the
face of sparse checkouts.

And I also think it explains more when the sparsity patterns and
--ignore-sparsity-patterns flags even matter.  The division of working
tree files which were tracked into two subsets (those that match
sparsity patterns and those that don't) didn't matter because only one
of those two sets existed and could be searched.  So the question is,
when can the sparsity pattern divide a set of files into two subsets
where both are non-empty?  And the answer is when --cached or REVISION
is specified.  This is the case Junio recently brought up and said
that there are good reasons users might want to limit to just the
paths that match the sparsity patterns, and other reasons when users
might want to search everything[6].  So, both cases need to be
supported fairly easily, and this will be true for several commands
besides just grep.

[6] https://lore.kernel.org/git/xmqq7dz938sc.fsf@gitster.c.googlers.com/

> > If the incompatibility of
> > --untracked and --cached/REVSIONS/TREES is not enforced, we may want
> > to look into erroring out if they are given together.  Once we do, we
> > don't have to worry about grep_cache() at all in the case of
> > --untracked and shouldn't.  Files with the skip_worktree bit won't
> > exist in the working directory, and thus won't be searched (this is
> > what makes --untracked imply --ignore-sparsity not really matter).
> >
> > In short: With --untracked you are grepping ALL (non-ignored) files in
> > the working directory -- either because they are both tracked and in
> > the sparsity paths (anything tracked that isn't in the sparsity paths
> > has the skip_worktree bit and thus isn't present), or because it is an
> > untracked file.  [And this may be what grep_directory() already does.]
> >
> > Does that make sense?
>
> It does, and thanks for a very detailed explanation. But as I
> mentioned before, I'm a little uncertain about --untracked implying
> --ignore-sparsity. The commit that added --untracked (0a93fb8) says:
>
> "grep --untracked" would find the specified patterns from files in
> untracked files in addition to its usual behaviour of finding them in
> the tracked files
>
> So, in my mind, it feels like --untracked wasn't meant to limit the
> search to "all non-ignored files in the working directory", but to add
> untracked files to the search (which could also contain tracked but
> non checked out files). Wouldn't the "all non-ignored files in the
> working directory" case be the use of --no-index?

--no-index is specifically designed for when the directory isn't
tracked by git at all.  It would be equivalent, though, to saying we
wanted to search all files in the working copy regardless of whether
they are tracked, untracked, or ignored, i.e. equivalent to specifying
both --untracked and --no-exclude-standard.

And you were right to be uncertain about --untracked implying
--ignore-sparsity; --untracked is completely orthogonal to sparsity.
(However, it wouldn't much matter if it did imply that option or if it
implied its opposite: --untracked implies we are only looking at the
working directory files, and thus we aren't even going to check the
sparsity patterns, we'll just check which files exist in the working
directory.  `git sparse-checkout reapply` will care about the sparsity
patterns and possibly add files to the working copy or remove some,
but grep certainly shouldn't be having a side effect like that; it
should just search the directory as it exists.)

> > > diff --git a/builtin/grep.c b/builtin/grep.c
> > > index 52ec72a036..17eae3edd6 100644
> > > --- a/builtin/grep.c
> > > +++ b/builtin/grep.c
> ...
> > >
> > > @@ -487,7 +492,7 @@ static int grep_cache(struct grep_opt *opt,
> > >         for (nr = 0; nr < repo->index->cache_nr; nr++) {
> > >                 const struct cache_entry *ce = repo->index->cache[nr];
> > >
> > > -               if (ce_skip_worktree(ce))
> > > +               if (!ignore_sparsity && ce_skip_worktree(ce))
> >
> > Oh boy on the double negatives...maybe we want to rename this flag somehow?
>
> Yeah, I also thought about that, but couldn't come up with a better
> name myself... My alternatives were all too verbose.
>
> ...
> > I'm super excited to see work in this area.  I hope I'm not
> > discouraging you by attempting to provide what I think is the bigger
> > picture I'd like us to work towards.
>
> Not at all! :) Thanks a lot for the bigger picture and other
> explanations. They help me understand the long-term goals and make
> better decisions now.

Hope this email helps too.  I've composed it over about 4 different
sessions with various interruptions, so there's a good chance all my
edits and loss of train of thought might have made something murky.
Let me know which part(s) are confusing and I'll try to clarify.

Elijah

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-26  6:02       ` Elijah Newren
@ 2020-03-27 15:51         ` Junio C Hamano
  2020-03-27 19:01           ` Elijah Newren
  2020-03-30  1:12         ` Matheus Tavares Bernardino
  1 sibling, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2020-03-27 15:51 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Matheus Tavares Bernardino, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

Elijah Newren <newren@gmail.com> writes:

> Sometimes, code that wasn't meant to be used together accidentally is
> used together or the docs suggest they can be used together.  ...
> ... but that's the
> type of accident that is easy to have in the implementation or docs
> because it doesn't even occur to people who understand the design and
> the data structures why anyone would attempt that.

The above is not limited to "git grep", but you said so clearly what
I have felt, without being able to express myself in a satisfactory
manner, for the last 10 years.

> ... (Side note: I think this kind of
> issues occurs fairly frequently, so I'm unlikely to assume options
> were meant to be supported together based solely on a lack of logic
> that would throw an error when both are specified.

Amen to that.

By the way, and I am so sorry to making the main issue of the
discussion into a mere "by the way" point, but if I understand your
message correctly, the primary conclusion in there is that a file
that is not in the working tree, if the sparsity pattern tells us
that it should not be checked out to the working tree, should not be
sought in the index instead.  I think I agree with that conclusion.

I however have some disagreement on a minor point, though.

"git grep -e '<pattern>' master" looks for the pattern in the commit
at the tip of the master branch.  "git grep -e '<pattern>' master
pu" does so in these two commits.  I do not think it is conceptually
wrong to allow "git grep -e '<pattern>' --cached master pu" to look
for three "commits", i.e. those two commits that already exist, plus
the one you would be creating if you were to "git commit" right now.
Similarly, I do not see a reason why we should forbid looking for
the same pattern in the tracked files in the working tree at the
same time we check tree object(s) and/or the index.

At least in principle.

There are two practical issues that makes these combinations
problematic, but I do not think they are insurmountable.

 - Once you give an object on the command line, there is no syntax
   to let you say "oh, by the way, I want the working tree as well".
   If you are looking in the index, the working tree, and optionally
   in some objects, "--index" instead of "--cached" would be the
   standard way to tell the command "I want to affect both the index
   and the working tree", but there is no way to say "I want only
   tracked files in the working tree and these objects searched".
   We'd need a new syntax to express it if we wanted to allow the
   combination.

 - The lines found in the working tree and in the index are prefixed
   by the filename, while they are prefixed by the tree's name and a
   colon.  When output for the working tree and the index are
   combined, we cannot tell where each hit came from.  We need to
   change the output to allow us to tell them apart, by
   e.g. prefixing "<worktree>:" and "<index>:" in a way similar to
   we use "<revision>:".

Thanks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-27 15:51         ` Junio C Hamano
@ 2020-03-27 19:01           ` Elijah Newren
  0 siblings, 0 replies; 57+ messages in thread
From: Elijah Newren @ 2020-03-27 19:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matheus Tavares Bernardino, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

On Fri, Mar 27, 2020 at 8:51 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > Sometimes, code that wasn't meant to be used together accidentally is
> > used together or the docs suggest they can be used together.  ...
> > ... but that's the
> > type of accident that is easy to have in the implementation or docs
> > because it doesn't even occur to people who understand the design and
> > the data structures why anyone would attempt that.
>
> The above is not limited to "git grep", but you said so clearly what
> I have felt, without being able to express myself in a satisfactory
> manner, for the last 10 years.
>
> > ... (Side note: I think this kind of
> > issues occurs fairly frequently, so I'm unlikely to assume options
> > were meant to be supported together based solely on a lack of logic
> > that would throw an error when both are specified.
>
> Amen to that.
>
> By the way, and I am so sorry to making the main issue of the
> discussion into a mere "by the way" point, but if I understand your
> message correctly, the primary conclusion in there is that a file
> that is not in the working tree, if the sparsity pattern tells us
> that it should not be checked out to the working tree, should not be
> sought in the index instead.  I think I agree with that conclusion.

Cool.

> I however have some disagreement on a minor point, though.
>
> "git grep -e '<pattern>' master" looks for the pattern in the commit
> at the tip of the master branch.  "git grep -e '<pattern>' master
> pu" does so in these two commits.  I do not think it is conceptually
> wrong to allow "git grep -e '<pattern>' --cached master pu" to look
> for three "commits", i.e. those two commits that already exist, plus
> the one you would be creating if you were to "git commit" right now.
> Similarly, I do not see a reason why we should forbid looking for
> the same pattern in the tracked files in the working tree at the
> same time we check tree object(s) and/or the index.
>
> At least in principle.
>
> There are two practical issues that makes these combinations
> problematic, but I do not think they are insurmountable.
>
>  - Once you give an object on the command line, there is no syntax
>    to let you say "oh, by the way, I want the working tree as well".
>    If you are looking in the index, the working tree, and optionally
>    in some objects, "--index" instead of "--cached" would be the
>    standard way to tell the command "I want to affect both the index
>    and the working tree", but there is no way to say "I want only
>    tracked files in the working tree and these objects searched".
>    We'd need a new syntax to express it if we wanted to allow the
>    combination.
>
>  - The lines found in the working tree and in the index are prefixed
>    by the filename, while they are prefixed by the tree's name and a
>    colon.  When output for the working tree and the index are
>    combined, we cannot tell where each hit came from.  We need to
>    change the output to allow us to tell them apart, by
>    e.g. prefixing "<worktree>:" and "<index>:" in a way similar to
>    we use "<revision>:".
>
> Thanks.

Ah, so you're saying that even though --cached and REVISION are
incompatible today, that's not fundamental and we could conceivably
let them or even more options be used together in the future and you
even highlight how it could be made to sensibly work.  I agree with
what you say here: _if_ there is a way for users to explicitly specify
that they want to search multiple versions (whether that is revisions
or the index or the working tree), _and_ we have a way to distinguish
which version we found the results from, then (and only then) it'd
make sense to search the complete set of files from each of those
versions and show the results for the matches we found.

That differs in multiple important ways from the SKIP_WORKTREE
behavior I was railing against, and I think what you propose as a
possibility in contrast would make sense.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-26  6:02       ` Elijah Newren
  2020-03-27 15:51         ` Junio C Hamano
@ 2020-03-30  1:12         ` Matheus Tavares Bernardino
  2020-03-31 16:48           ` Elijah Newren
  1 sibling, 1 reply; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-30  1:12 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Junio C Hamano

On Thu, Mar 26, 2020 at 3:02 AM Elijah Newren <newren@gmail.com> wrote:
>
> Hi Matheus!

Hi, Elijah.

First of all, thanks for taking the time to go over these topics in
great detail. I must say it's much clearer for me now.

> On Wed, Mar 25, 2020 at 4:15 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
[...]
> One more useful case to consider before we start adding SKIP_WORKTREE
> into the mix.  Let's say that you have three files:
>    fileA
>   fileB
>    fileC
> and all of them are tracked.  You have made edits to fileA and fileB,
> and ran 'rm fileC' (NOT 'git rm fileC', i.e. the deletion is not
> staged).  Now, you run 'git grep mystring'.  Quick question: Which
> files are searched for 'mystring'?  Well...
>   * REVISION and --cached were left out of the git grep command, so
> working tree files should be searched, not staged versions or versions
> from other commits
>  * No flags like --untracked or --no-exclude-standard were included,
> so only tracked files in the working tree should be searched
>   * There are two files in the working tree, both tracked: fileA and fileB.
> So, this searches fileA and fileB.  In particular: NO VERSION of fileC
> is searched.  fileC may be tracked/cached, but we don't search any
> version of that file, because this particular command line is about
> searching the working directory and fileC is not in the working
> directory.  To the best of my knowledge, git grep has always behaved
> that way.
>
> Users understand the idea of searching the working copy vs. the index
> vs. "old" (or different) versions of the repository.  They also
> understand that when searching the working copy, by default a subset
> of the files are searched.  Tell me: given all this information here,
> what possible explanation is there for SKIP_WORKTREE entries to be
> translated into searches of the cache when --cached is not specified?
> Please square that away with the fact that 'rm fileC' results in fileC
> NOT being searched.
>
> It's just completely, utterly wrong.

Makes sense, thanks. I agree that we shouldn't fall back to the cache
when searching the working tree.

> Also, hopefully this helps answer your question about --untracked and
> skip_worktree.  --untracked is only useful when searching through the
> working tree, and is entirely about adding the "untracked" category to
> the things we search.  The skip_worktree bit is about adding more
> granularity to the "tracked" category.  The two are thus entirely
> orthogonal and --untracked shouldn't change behavior at all in the
> face of sparse checkouts.

Thanks, your explanation clarified the issue I had. I see now why
--untracked and --ignore-sparsity don't make sense together.

It also made me think about the combination of --cached and
--untracked which, IIUC, should be prohibited. I will add a patch in
v2, making git-grep error out in this case.

> And I also think it explains more when the sparsity patterns and
> --ignore-sparsity-patterns flags even matter.  The division of working
> tree files which were tracked into two subsets (those that match
> sparsity patterns and those that don't) didn't matter because only one
> of those two sets existed and could be searched.  So the question is,
> when can the sparsity pattern divide a set of files into two subsets
> where both are non-empty?  And the answer is when --cached or REVISION
> is specified.

Makes sense. I will add in --ignore-sparsity's description that it is
only relevant with --cached or REVISION, as you previously suggested.
When it is used outside of these cases, though, I think we could just
warn that --ignore-sparsity will be discarded (to avoid erroring out
when users have grep.ignoreSparsity enabled).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-24 18:30     ` Junio C Hamano
  2020-03-24 19:07       ` Elijah Newren
@ 2020-03-30  3:23       ` Matheus Tavares Bernardino
  2020-03-31 19:12         ` Elijah Newren
  1 sibling, 1 reply; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-03-30  3:23 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc

On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> In the last commit, git-grep learned to honor sparsity patterns. For
> >> some use cases, however, it may be desirable to search outside the
> >> sparse checkout. So add the '--ignore-sparsity' option, which restores
> >> the old behavior. Also add the grep.ignoreSparsity configuration, to
> >> allow setting this behavior by default.
> >
> > Should `--ignore-sparsity` be a global git option rather than a
> > grep-specific one?  Also, should grep.ignoreSparsity rather be
> > core.ignoreSparsity or core.searchOutsideSparsePaths or something?
>
> Great question.  I think "git diff" with various options would also
> want to optionally be able to be confined within the sparse cone, or
> checking the entire world by lazily fetching outside the sparsity.
[...]
> Regardless of the choice of the default, it would be a good
> idea to make the subcommands consistently offer the same default and
> allow the non-default views with the same UI.

Yeah, it seems like a sensible path. Regarding implementation, there
is the question that Elijah raised, of whether to use a global git
option or separate but consistent options for each subcommand. I don't
have much experience with sparse checkout to argument for one or
another, so I would like to hear what others have to say about it.

A question that comes to my mind regarding the global git option is:
will --ignore-sparsity (or whichever name we choose for it [1]) be
sufficient for all subcommands? Or may some of them require additional
options for command-specific behaviors concerning sparsity patterns?
Also, would it be OK if we just ignored the option in commands that do
not operate differently in sparse checkouts (maybe, fetch, branch and
send-email, for example)? And would it make sense to allow
constructions such as `git --ignore-sparsity checkout` or even `git
--ignore-sparsity sparse-checkout ...`?

[1]: Does anyone have suggestions for the option/config name? The best
I could come up with so far (without being too verbose) is
--no-sparsity-constraints. But I fear this might sound generic. As
Elijah already mentioned, --ignore-sparsity is not good either, as it
introduces double negatives in code...

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-30  1:12         ` Matheus Tavares Bernardino
@ 2020-03-31 16:48           ` Elijah Newren
  0 siblings, 0 replies; 57+ messages in thread
From: Elijah Newren @ 2020-03-31 16:48 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Junio C Hamano

On Sun, Mar 29, 2020 at 6:13 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Thu, Mar 26, 2020 at 3:02 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > Hi Matheus!
>
> Hi, Elijah.
>
> First of all, thanks for taking the time to go over these topics in
> great detail. I must say it's much clearer for me now.
>
> > On Wed, Mar 25, 2020 at 4:15 PM Matheus Tavares Bernardino
> > <matheus.bernardino@usp.br> wrote:
> > >
> [...]
> > One more useful case to consider before we start adding SKIP_WORKTREE
> > into the mix.  Let's say that you have three files:
> >    fileA
> >   fileB
> >    fileC
> > and all of them are tracked.  You have made edits to fileA and fileB,
> > and ran 'rm fileC' (NOT 'git rm fileC', i.e. the deletion is not
> > staged).  Now, you run 'git grep mystring'.  Quick question: Which
> > files are searched for 'mystring'?  Well...
> >   * REVISION and --cached were left out of the git grep command, so
> > working tree files should be searched, not staged versions or versions
> > from other commits
> >  * No flags like --untracked or --no-exclude-standard were included,
> > so only tracked files in the working tree should be searched
> >   * There are two files in the working tree, both tracked: fileA and fileB.
> > So, this searches fileA and fileB.  In particular: NO VERSION of fileC
> > is searched.  fileC may be tracked/cached, but we don't search any
> > version of that file, because this particular command line is about
> > searching the working directory and fileC is not in the working
> > directory.  To the best of my knowledge, git grep has always behaved
> > that way.
> >
> > Users understand the idea of searching the working copy vs. the index
> > vs. "old" (or different) versions of the repository.  They also
> > understand that when searching the working copy, by default a subset
> > of the files are searched.  Tell me: given all this information here,
> > what possible explanation is there for SKIP_WORKTREE entries to be
> > translated into searches of the cache when --cached is not specified?
> > Please square that away with the fact that 'rm fileC' results in fileC
> > NOT being searched.
> >
> > It's just completely, utterly wrong.
>
> Makes sense, thanks. I agree that we shouldn't fall back to the cache
> when searching the working tree.
>
> > Also, hopefully this helps answer your question about --untracked and
> > skip_worktree.  --untracked is only useful when searching through the
> > working tree, and is entirely about adding the "untracked" category to
> > the things we search.  The skip_worktree bit is about adding more
> > granularity to the "tracked" category.  The two are thus entirely
> > orthogonal and --untracked shouldn't change behavior at all in the
> > face of sparse checkouts.
>
> Thanks, your explanation clarified the issue I had. I see now why
> --untracked and --ignore-sparsity don't make sense together.
>
> It also made me think about the combination of --cached and
> --untracked which, IIUC, should be prohibited. I will add a patch in
> v2, making git-grep error out in this case.
>
> > And I also think it explains more when the sparsity patterns and
> > --ignore-sparsity-patterns flags even matter.  The division of working
> > tree files which were tracked into two subsets (those that match
> > sparsity patterns and those that don't) didn't matter because only one
> > of those two sets existed and could be searched.  So the question is,
> > when can the sparsity pattern divide a set of files into two subsets
> > where both are non-empty?  And the answer is when --cached or REVISION
> > is specified.
>
> Makes sense. I will add in --ignore-sparsity's description that it is
> only relevant with --cached or REVISION, as you previously suggested.
> When it is used outside of these cases, though, I think we could just
> warn that --ignore-sparsity will be discarded (to avoid erroring out
> when users have grep.ignoreSparsity enabled).

Not grep.ignoreSparsity but core.ignoreSparsity or core.$WHATEVER  ;-)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-30  3:23       ` Matheus Tavares Bernardino
@ 2020-03-31 19:12         ` Elijah Newren
  2020-03-31 20:02           ` Derrick Stolee
  0 siblings, 1 reply; 57+ messages in thread
From: Elijah Newren @ 2020-03-31 19:12 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Junio C Hamano, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Jonathan Tan

// adding Jonathan Tan to cc based on the fact that we keep bringing
up partial clones and how it relates...

On Sun, Mar 29, 2020 at 8:23 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Elijah Newren <newren@gmail.com> writes:
> >
> > > On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> > > <matheus.bernardino@usp.br> wrote:
> > >>
> > >> In the last commit, git-grep learned to honor sparsity patterns. For
> > >> some use cases, however, it may be desirable to search outside the
> > >> sparse checkout. So add the '--ignore-sparsity' option, which restores
> > >> the old behavior. Also add the grep.ignoreSparsity configuration, to
> > >> allow setting this behavior by default.
> > >
> > > Should `--ignore-sparsity` be a global git option rather than a
> > > grep-specific one?  Also, should grep.ignoreSparsity rather be
> > > core.ignoreSparsity or core.searchOutsideSparsePaths or something?
> >
> > Great question.  I think "git diff" with various options would also
> > want to optionally be able to be confined within the sparse cone, or
> > checking the entire world by lazily fetching outside the sparsity.
> [...]
> > Regardless of the choice of the default, it would be a good
> > idea to make the subcommands consistently offer the same default and
> > allow the non-default views with the same UI.
>
> Yeah, it seems like a sensible path. Regarding implementation, there
> is the question that Elijah raised, of whether to use a global git
> option or separate but consistent options for each subcommand. I don't
> have much experience with sparse checkout to argument for one or
> another, so I would like to hear what others have to say about it.
>
> A question that comes to my mind regarding the global git option is:
> will --ignore-sparsity (or whichever name we choose for it [1]) be
> sufficient for all subcommands? Or may some of them require additional
> options for command-specific behaviors concerning sparsity patterns?
> Also, would it be OK if we just ignored the option in commands that do
> not operate differently in sparse checkouts (maybe, fetch, branch and
> send-email, for example)? And would it make sense to allow
> constructions such as `git --ignore-sparsity checkout` or even `git
> --ignore-sparsity sparse-checkout ...`?

I think the same option would probably be sufficient for all
subcommands, though I have a minor question about the merge machinery
(below).  And generally, I think it would be unusual for people to
pass the command line flag; I suspect most would set a config option
for most cases and then only occasionally override it on the command
line.  Since that config option would always be set, I'd expect
commands that are unaffected to just ignore it (much like both "git -c
merge.detectRenames=true fetch" and "git --work-tree=othertree fetch"
will both ignore the irrelevant options rather than trying to detect
that they were specified and error out).

> [1]: Does anyone have suggestions for the option/config name? The best
> I could come up with so far (without being too verbose) is
> --no-sparsity-constraints. But I fear this might sound generic. As
> Elijah already mentioned, --ignore-sparsity is not good either, as it
> introduces double negatives in code...

Does verbosity matter that much?  I think people would set it in
config, and tab completion would make it pretty easy to complete in
any event.

Anyway, maybe it will help if I provide a very rough first draft of
what changes we could introduce to Documentation/config/core.txt, and
then ask a bunch of my own questions about it below:

"""
core.restrictToSparsePaths::
        Only meaningful in conjuntion with core.sparseCheckoutCone.
        This option extends sparse checkouts (which limit which paths
        are written to the worktree), so that output and operations
        are also limited to the sparsity paths where possible and
        implemented.  The purpose of this option is to (1) focus
        output for the user on the portion of the repository that is
        of interest to them, and (2) enable potentially dramatic
        performance improvements, especially in conjunction with
        partial clones.
+
When this option is true, git commands such as log, diff, and grep may
limit their output to the directories specified by the sparse cone, or
to the intersection of those paths and any (like `*.c) that the user
might also specify on the command line.  (Note that this limit for
diff and grep only becomes relevant with --cached or when specifying a
REVISION, since a search of the working tree will automatically be
limited to the sparse paths that are present.)  Also, commands like
bisect may only select commits which modify paths within the sparsity
cone.  The merge machinery may use the sparse paths as a heuristic to
avoid trying to detect renames from within the sparsity cone to
outside the sparsity cone when at least one side of history only
touches paths within the sparsity cone (this can make the merge
machinery faster, but may risk modify/delete conflicts since upstream
can rename a file within the sparsity paths to a location outside
them).  Commands which export, integrity check, or create history will
always operate on full trees (e.g. fast-export, format-patch, fsck,
commit, etc.), unaffected by any sparsity patterns.
"""

Several questions here, of course:

  * do people like or hate the name?  indifferent?  have alternate ideas?
  * should we restrict this to core.sparseCheckoutCone as I suggested
above or also allow people to do it with core.sparseCheckout without
the cone mode?  I think attempting to weld partial clones together
with core.sparseCheckout is crazy, so I'm tempted to just make it be
specific to cone mode and to push people to use it.  But I'm
interested in thoughts on the matter.
  * should worktrees be affected?  (I've been an advocate of new
worktrees inheriting the sparse patterns of the worktree in use at the
time the new worktree was created.  Junio once suggested he didn't
like that and that worktrees should start out dense.  That seems
problematic to me in big repos with partial clones and sparse chckouts
in use.  Perhaps dense new worktrees is the behavior you get when
core.restrictToSparsePaths is false?)
  * does my idea for the merge machinery make folks uncomfortable?
Should that be a different option?  Being able to do trivial *tree*
merges for the huge portion of the tree outside the sparsity paths
would be a huge win, especially with partial clones, but it certainly
is different.  Then again, microsoft has disabled rename detection
entirely based on it being too expensive, so perhaps the idea of
rename-detection-within-your-cone-if-you-really-didn't-modify-anything-outside-the-cone-on-your-side-of-history
is a reasonable middle ground between off and on for rename detection.
  * what should the default be?  Junio suggested elsewhere[1] that
sparse-checkouts and partial clones should probably be welded together
(with partial clones downloading just history in the sparsity paths by
default), in which case having this option be true would be useful.
But it may also be slightly weird because it'll probably take us a
while to implement this; while the big warning in
git-sparse-checkout.txt certainly allows this:
        THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
        COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
        THE FUTURE.
It may still be slightly weird that the default behavior of commands
in the presence of sparse-checkouts changes release to release until
we get it all implemented.

[1] https://lore.kernel.org/git/xmqqh7ycw5lc.fsf@gitster.c.googlers.com/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-31 19:12         ` Elijah Newren
@ 2020-03-31 20:02           ` Derrick Stolee
  2020-04-27 17:15             ` Matheus Tavares Bernardino
  2020-04-29 17:21             ` Elijah Newren
  0 siblings, 2 replies; 57+ messages in thread
From: Derrick Stolee @ 2020-03-31 20:02 UTC (permalink / raw)
  To: Elijah Newren, Matheus Tavares Bernardino
  Cc: Junio C Hamano, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Jonathan Tan

On 3/31/2020 3:12 PM, Elijah Newren wrote:
> // adding Jonathan Tan to cc based on the fact that we keep bringing
> up partial clones and how it relates...
> 
> On Sun, Mar 29, 2020 at 8:23 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
>>
>> On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
>>>
>>> Elijah Newren <newren@gmail.com> writes:
>>>
>>>> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
>>>> <matheus.bernardino@usp.br> wrote:
>>>>>
>>>>> In the last commit, git-grep learned to honor sparsity patterns. For
>>>>> some use cases, however, it may be desirable to search outside the
>>>>> sparse checkout. So add the '--ignore-sparsity' option, which restores
>>>>> the old behavior. Also add the grep.ignoreSparsity configuration, to
>>>>> allow setting this behavior by default.
>>>>
>>>> Should `--ignore-sparsity` be a global git option rather than a
>>>> grep-specific one?  Also, should grep.ignoreSparsity rather be
>>>> core.ignoreSparsity or core.searchOutsideSparsePaths or something?
>>>
>>> Great question.  I think "git diff" with various options would also
>>> want to optionally be able to be confined within the sparse cone, or
>>> checking the entire world by lazily fetching outside the sparsity.
>> [...]
>>> Regardless of the choice of the default, it would be a good
>>> idea to make the subcommands consistently offer the same default and
>>> allow the non-default views with the same UI.
>>
>> Yeah, it seems like a sensible path. Regarding implementation, there
>> is the question that Elijah raised, of whether to use a global git
>> option or separate but consistent options for each subcommand. I don't
>> have much experience with sparse checkout to argument for one or
>> another, so I would like to hear what others have to say about it.
>>
>> A question that comes to my mind regarding the global git option is:
>> will --ignore-sparsity (or whichever name we choose for it [1]) be
>> sufficient for all subcommands? Or may some of them require additional
>> options for command-specific behaviors concerning sparsity patterns?
>> Also, would it be OK if we just ignored the option in commands that do
>> not operate differently in sparse checkouts (maybe, fetch, branch and
>> send-email, for example)? And would it make sense to allow
>> constructions such as `git --ignore-sparsity checkout` or even `git
>> --ignore-sparsity sparse-checkout ...`?
> 
> I think the same option would probably be sufficient for all
> subcommands, though I have a minor question about the merge machinery
> (below).  And generally, I think it would be unusual for people to
> pass the command line flag; I suspect most would set a config option
> for most cases and then only occasionally override it on the command
> line.  Since that config option would always be set, I'd expect
> commands that are unaffected to just ignore it (much like both "git -c
> merge.detectRenames=true fetch" and "git --work-tree=othertree fetch"
> will both ignore the irrelevant options rather than trying to detect
> that they were specified and error out).
> 
>> [1]: Does anyone have suggestions for the option/config name? The best
>> I could come up with so far (without being too verbose) is
>> --no-sparsity-constraints. But I fear this might sound generic. As
>> Elijah already mentioned, --ignore-sparsity is not good either, as it
>> introduces double negatives in code...
> 
> Does verbosity matter that much?  I think people would set it in
> config, and tab completion would make it pretty easy to complete in
> any event.
> 
> Anyway, maybe it will help if I provide a very rough first draft of
> what changes we could introduce to Documentation/config/core.txt, and
> then ask a bunch of my own questions about it below:
> 
> """
> core.restrictToSparsePaths::
>         Only meaningful in conjuntion with core.sparseCheckoutCone.
>         This option extends sparse checkouts (which limit which paths
>         are written to the worktree), so that output and operations
>         are also limited to the sparsity paths where possible and
>         implemented.  The purpose of this option is to (1) focus
>         output for the user on the portion of the repository that is
>         of interest to them, and (2) enable potentially dramatic
>         performance improvements, especially in conjunction with
>         partial clones.
> +
> When this option is true, git commands such as log, diff, and grep may
> limit their output to the directories specified by the sparse cone, or
> to the intersection of those paths and any (like `*.c) that the user
> might also specify on the command line.  (Note that this limit for
> diff and grep only becomes relevant with --cached or when specifying a
> REVISION, since a search of the working tree will automatically be
> limited to the sparse paths that are present.)  Also, commands like
> bisect may only select commits which modify paths within the sparsity
> cone.  The merge machinery may use the sparse paths as a heuristic to
> avoid trying to detect renames from within the sparsity cone to
> outside the sparsity cone when at least one side of history only
> touches paths within the sparsity cone (this can make the merge
> machinery faster, but may risk modify/delete conflicts since upstream
> can rename a file within the sparsity paths to a location outside
> them).  Commands which export, integrity check, or create history will
> always operate on full trees (e.g. fast-export, format-patch, fsck,
> commit, etc.), unaffected by any sparsity patterns.
> """
> 
> Several questions here, of course:
> 
>   * do people like or hate the name?  indifferent?  have alternate ideas?

It's probably time to create a 'sparse-checkout' config space. That
would allow

	sparse-checkout.restrictGrep = true

as an option. Or a more general

	sparse-checkout.restrictCommands = true

to make it clear that it affects multiple commands.

>   * should we restrict this to core.sparseCheckoutCone as I suggested
> above or also allow people to do it with core.sparseCheckout without
> the cone mode?  I think attempting to weld partial clones together
> with core.sparseCheckout is crazy, so I'm tempted to just make it be
> specific to cone mode and to push people to use it.  But I'm
> interested in thoughts on the matter.

Personally, I prefer cone mode and think it covers 99% of cases.
However, there are some who are using a big directory full of large
binaries and relying on file-prefix matches to get only the big
binaries they need. Until they restructure their repositories to
take advantage of cone mode, we should be considerate of the full
sparse-checkout specification when possible.

>   * should worktrees be affected?  (I've been an advocate of new
> worktrees inheriting the sparse patterns of the worktree in use at the
> time the new worktree was created.  Junio once suggested he didn't
> like that and that worktrees should start out dense.  That seems
> problematic to me in big repos with partial clones and sparse chckouts
> in use.  Perhaps dense new worktrees is the behavior you get when
> core.restrictToSparsePaths is false?)

We should probably consider a `--sparse` option for `git worktree add`
so we can allow interested users to add worktrees that initialize to
a sparse-checkout. Optionally create a config option that would copy
the sparse-checkout file from the current repo to the worktree.

>   * does my idea for the merge machinery make folks uncomfortable?
> Should that be a different option?  Being able to do trivial *tree*
> merges for the huge portion of the tree outside the sparsity paths
> would be a huge win, especially with partial clones, but it certainly
> is different.  Then again, microsoft has disabled rename detection
> entirely based on it being too expensive, so perhaps the idea of
> rename-detection-within-your-cone-if-you-really-didn't-modify-anything-outside-the-cone-on-your-side-of-history
> is a reasonable middle ground between off and on for rename detection.

The part where you say " when at least one side of history only
touches paths within the sparsity cone" makes me want to entertain
the idea if it can be done cleanly.

I'm more concerned about the "git bisect" logic being restricted to
the cone, since that is such an open-ended command for what is
considered "good" or "bad".

>   * what should the default be?  Junio suggested elsewhere[1] that
> sparse-checkouts and partial clones should probably be welded together
> (with partial clones downloading just history in the sparsity paths by
> default), in which case having this option be true would be useful.

My opinion on this is as follows: filtering blobs based on sparse-
checkout patterns does not filter enough, and filtering trees based
on sparse-checkout patterns filters too much. The costs are just
flipped: having extra trees is not a huge problem but recovering from
a "tree miss" is problematic. Having extra blobs is painful, but
recovering from a "blob miss" is not a big deal.

> But it may also be slightly weird because it'll probably take us a
> while to implement this; while the big warning in
> git-sparse-checkout.txt certainly allows this:
>         THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
>         COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
>         THE FUTURE.
> It may still be slightly weird that the default behavior of commands
> in the presence of sparse-checkouts changes release to release until
> we get it all implemented.

I appreciate that we put that warning at the top. We will be
able to do more experimental things with the feature because
of it. The idea I'm toying with is to have "git clone --sparse"
set core.sparseCheckoutCone = true.

Also, if we are creating the "sparse-checkout.*" config space,
we should "rename" core.sparseCheckoutCone to sparse-checkout.coneMode
or something. We would need to support both for a while, for sure.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-03-24 22:55     ` Matheus Tavares Bernardino
@ 2020-04-21  2:10       ` Matheus Tavares Bernardino
  2020-04-21  3:08         ` Elijah Newren
  0 siblings, 1 reply; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-04-21  2:10 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson, Stefan Beller

Hi, Elijah, Stolee and others

On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> > <matheus.bernardino@usp.br> wrote:
> > >
> > > Something I'm not entirely sure in this patch is how we implement the
> > > mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> > > is treated in the grep_tree() function). Currently, the patch looks for
> > > an index entry that matches the path, and then checks its skip_worktree
> >
> > As you discuss below, checking the index is both wrong _and_ costly.
> > You should use the sparsity patterns; Stolee did a lot of work to make
> > those correspond to simple hashes you could check to determine whether
> > to even walk into a subdirectory.
[...]
> OK, makes sense.

I've been working on the file skipping mechanism using the sparsity
patterns directly. But I'm uncertain about some implementation
details. So I wanted to share my current plan with you, to get some
feedback before going deeper.

The first idea was to load the sparsity patterns a priori and pass
them to grep_tree(), which recursively greps the entries of a given
tree object. If --recurse-submodules is given, however, we would also
need to load each surepo's sparse-checkout file on the fly (as the
subrepos are lazily initialized in grep_tree()'s call chain). That's
not a problem on its own. But in the most naive implementation, this
means unnecessarily re-loading the sparse-checkout files of the
submodules for each tree given to git-grep (as grep_tree() is called
separately for each one of them).

So my next idea was to implement a cache, mapping 'struct repository's
to 'struct pattern_list'. Well, not 'struct repository' itself, but
repo->gitdir. This way we could load each file once, store the pattern
list, and quickly retrieve the one that affect the repository
currently being grepped, whether it is a submodule or not. But, is
gitidir unique per repository? If not, could we use
repo_git_path(repo, "info/sparse-checkout") as the key?

I already have a prototype implementation of the last idea (using
repo_git_path()). But I wanted to make sure, does this seem like a
good path? Or should we avoid the work of having this hashmap here and
do something else, as adding a 'struct pattern_list' to 'struct
repository', directly?

Thanks,
Matheus

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-04-21  2:10       ` Matheus Tavares Bernardino
@ 2020-04-21  3:08         ` Elijah Newren
  2020-04-22 12:08           ` Derrick Stolee
  2020-04-23  6:09           ` Matheus Tavares Bernardino
  0 siblings, 2 replies; 57+ messages in thread
From: Elijah Newren @ 2020-04-21  3:08 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson,
	Stefan Beller, Jonathan Nieder

On Mon, Apr 20, 2020 at 7:11 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> Hi, Elijah, Stolee and others
>
> On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
> > > <matheus.bernardino@usp.br> wrote:
> > > >
> > > > Something I'm not entirely sure in this patch is how we implement the
> > > > mechanism to honor sparsity for the `git grep <commit-ish>` case (which
> > > > is treated in the grep_tree() function). Currently, the patch looks for
> > > > an index entry that matches the path, and then checks its skip_worktree
> > >
> > > As you discuss below, checking the index is both wrong _and_ costly.
> > > You should use the sparsity patterns; Stolee did a lot of work to make
> > > those correspond to simple hashes you could check to determine whether
> > > to even walk into a subdirectory.
> [...]
> > OK, makes sense.
>
> I've been working on the file skipping mechanism using the sparsity
> patterns directly. But I'm uncertain about some implementation
> details. So I wanted to share my current plan with you, to get some
> feedback before going deeper.
>
> The first idea was to load the sparsity patterns a priori and pass
> them to grep_tree(), which recursively greps the entries of a given
> tree object. If --recurse-submodules is given, however, we would also
> need to load each surepo's sparse-checkout file on the fly (as the
> subrepos are lazily initialized in grep_tree()'s call chain). That's
> not a problem on its own. But in the most naive implementation, this
> means unnecessarily re-loading the sparse-checkout files of the
> submodules for each tree given to git-grep (as grep_tree() is called
> separately for each one of them).

Wouldn't loading the sparse-checkout files be fast compared to
grepping a submodule for matching strings?  And not just fast, but
essentially in the noise and hard to even measure?  I have a hard time
fathoming parsing the sparse-checkout file for a submodule somehow
appreciably affecting the cost of grepping through that submodule.  If
the submodule has a huge number of sparse-checkout patterns, that'll
be because it has a ginormous number of files and grepping through
them all would be way, way longer.  If the submodule only has a few
files, then the sparse-checkout file is only going to be a few lines
at most.

Also, from another angle: I think the original intent of submodules
was an alternate form of sparse-checkout/partial-clone, letting people
deal with just their piece of the repo.  As such, do we really even
expect people to use sparse-checkouts and submodules together, let
alone use them very heavily together?  Sure, someone will use them,
but I have a hard time imagining the scale of use of both features
heavily enough for this to matter, especially since it also requires
specifying multiple trees to grep (which is slightly unusual) in
addition to the combination of these other features before your
optimization here could kick in and be worthwhile.

I'd be very tempted to just implement the most naive implementation
and maybe leave a TODO note in the code for some future person to come
along and optimize if it really matters, but I'd like to see numbers
before we spend the development and maintenance effort on it because
I'm having a hard time imagining any scale where it could matter.

> So my next idea was to implement a cache, mapping 'struct repository's
> to 'struct pattern_list'. Well, not 'struct repository' itself, but
> repo->gitdir. This way we could load each file once, store the pattern
> list, and quickly retrieve the one that affect the repository
> currently being grepped, whether it is a submodule or not. But, is
> gitidir unique per repository? If not, could we use
> repo_git_path(repo, "info/sparse-checkout") as the key?
>
> I already have a prototype implementation of the last idea (using
> repo_git_path()). But I wanted to make sure, does this seem like a
> good path? Or should we avoid the work of having this hashmap here and
> do something else, as adding a 'struct pattern_list' to 'struct
> repository', directly?

Honestly, it sounds a bit like premature optimization to me.  Sorry if
that's disappointing since you've apparently already put some effort
into this, and it sounds like you're on a good track for optimizing
this if it were necessary, but I'm just having a hard time figuring
out whether it'd really help and be worth the code complexity.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-04-21  3:08         ` Elijah Newren
@ 2020-04-22 12:08           ` Derrick Stolee
  2020-04-23  6:09           ` Matheus Tavares Bernardino
  1 sibling, 0 replies; 57+ messages in thread
From: Derrick Stolee @ 2020-04-22 12:08 UTC (permalink / raw)
  To: Elijah Newren, Matheus Tavares Bernardino
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson,
	Stefan Beller, Jonathan Nieder

On 4/20/2020 11:08 PM, Elijah Newren wrote:
> On Mon, Apr 20, 2020 at 7:11 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
>>
>> Hi, Elijah, Stolee and others
>>
>> On Tue, Mar 24, 2020 at 7:55 PM Matheus Tavares Bernardino
>> <matheus.bernardino@usp.br> wrote:
>>>
>>> On Tue, Mar 24, 2020 at 4:15 AM Elijah Newren <newren@gmail.com> wrote:
>>>>
>>>> On Mon, Mar 23, 2020 at 11:12 PM Matheus Tavares
>>>> <matheus.bernardino@usp.br> wrote:
>>>>>
>>>>> Something I'm not entirely sure in this patch is how we implement the
>>>>> mechanism to honor sparsity for the `git grep <commit-ish>` case (which
>>>>> is treated in the grep_tree() function). Currently, the patch looks for
>>>>> an index entry that matches the path, and then checks its skip_worktree
>>>>
>>>> As you discuss below, checking the index is both wrong _and_ costly.
>>>> You should use the sparsity patterns; Stolee did a lot of work to make
>>>> those correspond to simple hashes you could check to determine whether
>>>> to even walk into a subdirectory.
>> [...]
>>> OK, makes sense.
>>
>> I've been working on the file skipping mechanism using the sparsity
>> patterns directly. But I'm uncertain about some implementation
>> details. So I wanted to share my current plan with you, to get some
>> feedback before going deeper.
>>
>> The first idea was to load the sparsity patterns a priori and pass
>> them to grep_tree(), which recursively greps the entries of a given
>> tree object. If --recurse-submodules is given, however, we would also
>> need to load each surepo's sparse-checkout file on the fly (as the
>> subrepos are lazily initialized in grep_tree()'s call chain). That's
>> not a problem on its own. But in the most naive implementation, this
>> means unnecessarily re-loading the sparse-checkout files of the
>> submodules for each tree given to git-grep (as grep_tree() is called
>> separately for each one of them).
> 
> Wouldn't loading the sparse-checkout files be fast compared to
> grepping a submodule for matching strings?  And not just fast, but
> essentially in the noise and hard to even measure?  I have a hard time
> fathoming parsing the sparse-checkout file for a submodule somehow
> appreciably affecting the cost of grepping through that submodule.  If
> the submodule has a huge number of sparse-checkout patterns, that'll
> be because it has a ginormous number of files and grepping through
> them all would be way, way longer.  If the submodule only has a few
> files, then the sparse-checkout file is only going to be a few lines
> at most.
> 
> Also, from another angle: I think the original intent of submodules
> was an alternate form of sparse-checkout/partial-clone, letting people
> deal with just their piece of the repo.  As such, do we really even
> expect people to use sparse-checkouts and submodules together, let
> alone use them very heavily together?  Sure, someone will use them,
> but I have a hard time imagining the scale of use of both features
> heavily enough for this to matter, especially since it also requires
> specifying multiple trees to grep (which is slightly unusual) in
> addition to the combination of these other features before your
> optimization here could kick in and be worthwhile.
> 
> I'd be very tempted to just implement the most naive implementation
> and maybe leave a TODO note in the code for some future person to come
> along and optimize if it really matters, but I'd like to see numbers
> before we spend the development and maintenance effort on it because
> I'm having a hard time imagining any scale where it could matter.
> 
>> So my next idea was to implement a cache, mapping 'struct repository's
>> to 'struct pattern_list'. Well, not 'struct repository' itself, but
>> repo->gitdir. This way we could load each file once, store the pattern
>> list, and quickly retrieve the one that affect the repository
>> currently being grepped, whether it is a submodule or not. But, is
>> gitidir unique per repository? If not, could we use
>> repo_git_path(repo, "info/sparse-checkout") as the key?
>>
>> I already have a prototype implementation of the last idea (using
>> repo_git_path()). But I wanted to make sure, does this seem like a
>> good path? Or should we avoid the work of having this hashmap here and
>> do something else, as adding a 'struct pattern_list' to 'struct
>> repository', directly?
> 
> Honestly, it sounds a bit like premature optimization to me.  Sorry if
> that's disappointing since you've apparently already put some effort
> into this, and it sounds like you're on a good track for optimizing
> this if it were necessary, but I'm just having a hard time figuring
> out whether it'd really help and be worth the code complexity.

My initial thought was to use a stack or queue. It depend on how
git-grep treats submodules. Imagine directories A, B, C where B is a
submodule.

If results from 'B' are output between results from 'A' and 'C', then
use a stack to "push" the latest sparse-checkout patterns as you
deepen into a submodule, then "pop" the patterns as you leave a
submodule.

If results from 'B' are output after results from 'C', then you could
possibly use a queue instead. I find this unlikely, and it would
behave strangely for nested submodules.

Since "struct pattern_list" has most of the information you require,
then it should not be challenging to create a list of them.

Hopefully that provides some ideas.

Thanks,
-Stolee



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 2/3] grep: honor sparse checkout patterns
  2020-04-21  3:08         ` Elijah Newren
  2020-04-22 12:08           ` Derrick Stolee
@ 2020-04-23  6:09           ` Matheus Tavares Bernardino
  1 sibling, 0 replies; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-04-23  6:09 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee, brian m. carlson,
	Stefan Beller, Jonathan Nieder

On Tue, Apr 21, 2020 at 12:08 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Mon, Apr 20, 2020 at 7:11 PM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > I've been working on the file skipping mechanism using the sparsity
> > patterns directly. But I'm uncertain about some implementation
> > details. So I wanted to share my current plan with you, to get some
> > feedback before going deeper.
> >
> > The first idea was to load the sparsity patterns a priori and pass
> > them to grep_tree(), which recursively greps the entries of a given
> > tree object. If --recurse-submodules is given, however, we would also
> > need to load each surepo's sparse-checkout file on the fly (as the
> > subrepos are lazily initialized in grep_tree()'s call chain). That's
> > not a problem on its own. But in the most naive implementation, this
> > means unnecessarily re-loading the sparse-checkout files of the
> > submodules for each tree given to git-grep (as grep_tree() is called
> > separately for each one of them).
>
> Wouldn't loading the sparse-checkout files be fast compared to
> grepping a submodule for matching strings?  And not just fast, but
> essentially in the noise and hard to even measure?  I have a hard time
> fathoming parsing the sparse-checkout file for a submodule somehow
> appreciably affecting the cost of grepping through that submodule.  If
> the submodule has a huge number of sparse-checkout patterns, that'll
> be because it has a ginormous number of files and grepping through
> them all would be way, way longer.  If the submodule only has a few
> files, then the sparse-checkout file is only going to be a few lines
> at most.

Yeah, makes sense.

> Also, from another angle: I think the original intent of submodules
> was an alternate form of sparse-checkout/partial-clone, letting people
> deal with just their piece of the repo.  As such, do we really even
> expect people to use sparse-checkouts and submodules together, let
> alone use them very heavily together?  Sure, someone will use them,
> but I have a hard time imagining the scale of use of both features
> heavily enough for this to matter, especially since it also requires
> specifying multiple trees to grep (which is slightly unusual) in
> addition to the combination of these other features before your
> optimization here could kick in and be worthwhile.
>
> I'd be very tempted to just implement the most naive implementation
> and maybe leave a TODO note in the code for some future person to come
> along and optimize if it really matters, but I'd like to see numbers
> before we spend the development and maintenance effort on it because
> I'm having a hard time imagining any scale where it could matter.

You're right. I guess I got a little too excited about the
optimizations possibilities and neglected the fact that they might not
even be needed here.

Just to take a look at some numbers, I prototyped the naive
implementation and downloaded a testing repository[1] containing 8
submodules (or 14 counting the nested ones). For each of the
non-nested submodules, I added its .gitignore rules to the
sparse-checkout file (of course this doesn't make any sense for a
real-world usage, but I just wanted to populate the file with a large
quantity of valid rules, to test the parsing time). I also added the
rule '/*'. Then I ran:

git-grep --threads=1 --recurse-submodules -E "config_[a-z]+\(" $(cat /tmp/trees)

Where /tmp/trees contained about 120 trees in the said repository
(again, a probably unreal case, for testing purposes only). Then,
measuring the time spent only inside the function I created to load a
sparse-checkout file for a given 'struct repository', I got to the
following numbers:

Number of calls: 1531 (makes sense: ~120 trees and 14 submodules)
Percentage over the total time: 0.015%
Number of matches: 300897

And using 8 threads, I got the same numbers except for the percentage,
which was a little higher: 0.05%.

So, indeed, the overhead of re-loading the files is too insignificant.
And my cache idea was a premature and unnecessary optimization.

> > So my next idea was to implement a cache, mapping 'struct repository's
> > to 'struct pattern_list'. Well, not 'struct repository' itself, but
> > repo->gitdir. This way we could load each file once, store the pattern
> > list, and quickly retrieve the one that affect the repository
> > currently being grepped, whether it is a submodule or not. But, is
> > gitidir unique per repository? If not, could we use
> > repo_git_path(repo, "info/sparse-checkout") as the key?
> >
> > I already have a prototype implementation of the last idea (using
> > repo_git_path()). But I wanted to make sure, does this seem like a
> > good path? Or should we avoid the work of having this hashmap here and
> > do something else, as adding a 'struct pattern_list' to 'struct
> > repository', directly?
>
> Honestly, it sounds a bit like premature optimization to me.  Sorry if
> that's disappointing since you've apparently already put some effort
> into this, and it sounds like you're on a good track for optimizing
> this if it were necessary, but I'm just having a hard time figuring
> out whether it'd really help and be worth the code complexity.

No problem! I'm glad to have this feedback now, while I'm still
working on v2  :) Now I can focus on what's really relevant. So thanks
again!

[1]: https://github.com/surevine/Metre

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-31 20:02           ` Derrick Stolee
@ 2020-04-27 17:15             ` Matheus Tavares Bernardino
  2020-04-29 16:46               ` Elijah Newren
  2020-04-29 17:21             ` Elijah Newren
  1 sibling, 1 reply; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-04-27 17:15 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren, Junio C Hamano, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Jonathan Tan

Hi, Stolee and Elijah

I think I just finished addressing the comments on patch 2/3 [1]. And
I'm now looking at the ones in 3/3 (this one). Below are some
questions, just to make sure I'm going in the right direction with
this one.

On Tue, Mar 31, 2020 at 5:02 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/31/2020 3:12 PM, Elijah Newren wrote:
> >
> > Anyway, maybe it will help if I provide a very rough first draft of
> > what changes we could introduce to Documentation/config/core.txt, and
> > then ask a bunch of my own questions about it below:
> >
> > """
> > core.restrictToSparsePaths::
> >         Only meaningful in conjuntion with core.sparseCheckoutCone.
> >         This option extends sparse checkouts (which limit which paths
> >         are written to the worktree), so that output and operations
> >         are also limited to the sparsity paths where possible and
> >         implemented.  The purpose of this option is to (1) focus
> >         output for the user on the portion of the repository that is
> >         of interest to them, and (2) enable potentially dramatic
> >         performance improvements, especially in conjunction with
> >         partial clones.
...
> > """
> >
> > Several questions here, of course:
> >
> >   * do people like or hate the name?  indifferent?  have alternate ideas?
>
> It's probably time to create a 'sparse-checkout' config space. That
> would allow
>
>         sparse-checkout.restrictGrep = true
>
> as an option. Or a more general
>
>         sparse-checkout.restrictCommands = true
>
> to make it clear that it affects multiple commands.

If we are creating the new namespace, 'core.sparseCheckout' should
also be renamed to something like 'sparse-checkout.enabled', right?
And maybe we could use 'sparsecheckout.*', instead? That seems to be
the convention for settings on hyphenated commands (as in sendemail.*,
uploadpack.* and gitgui.*).

As for compatibility, when running `git sparse-checkout init`, if the
config file already has the core.sparseCheckout setting, should we
remove it? Or just add the new sparsecheckout.enabled config, which
will always be read first?

Also, should we emit a warning about the former being deprecated? The
good thing about deprecation warnings, IMO, is that users will know
the name change faster. But, at least for `git grep <tree>`, where we
read  core.sparseCheckout and core.sparseCheckoutCone for each
submodule and each tree, there would be too much pollution in the
output...

Finally, about restrictCommands, the idea is to have both
sparsecheckout.restrictCommands and `git --restrict-to-sparse-paths`,
right? For now, the option/setting would only affect grep, but support
would be added gradually to other commands in the future. I noticed
git-read-tree already has a --no-sparse-checkout option. Should we
remove this option in favor of the global
--[no]-restrict-to-sparse-paths?

Sorry for too many questions. I just wanted to make sure that I
understood the plan before diving into the implementation, to avoid
going in the wrong direction.

[1]: Here is a sneak peek for v2 of patch 2/3, in case you might want
to take a look:
https://github.com/matheustavares/git/commit/970ef529f1e8f719c4427bd9fea8205ada69d913

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-04-27 17:15             ` Matheus Tavares Bernardino
@ 2020-04-29 16:46               ` Elijah Newren
  0 siblings, 0 replies; 57+ messages in thread
From: Elijah Newren @ 2020-04-29 16:46 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Derrick Stolee, Junio C Hamano, Git Mailing List, Derrick Stolee,
	Nguyễn Thái Ngọc, Jonathan Tan

On Mon, Apr 27, 2020 at 10:15 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> Hi, Stolee and Elijah
>
> I think I just finished addressing the comments on patch 2/3 [1]. And
> I'm now looking at the ones in 3/3 (this one). Below are some
> questions, just to make sure I'm going in the right direction with
> this one.
>
> On Tue, Mar 31, 2020 at 5:02 PM Derrick Stolee <stolee@gmail.com> wrote:
> >
> > On 3/31/2020 3:12 PM, Elijah Newren wrote:
> > >
> > > Anyway, maybe it will help if I provide a very rough first draft of
> > > what changes we could introduce to Documentation/config/core.txt, and
> > > then ask a bunch of my own questions about it below:
> > >
> > > """
> > > core.restrictToSparsePaths::
> > >         Only meaningful in conjuntion with core.sparseCheckoutCone.
> > >         This option extends sparse checkouts (which limit which paths
> > >         are written to the worktree), so that output and operations
> > >         are also limited to the sparsity paths where possible and
> > >         implemented.  The purpose of this option is to (1) focus
> > >         output for the user on the portion of the repository that is
> > >         of interest to them, and (2) enable potentially dramatic
> > >         performance improvements, especially in conjunction with
> > >         partial clones.
> ...
> > > """
> > >
> > > Several questions here, of course:
> > >
> > >   * do people like or hate the name?  indifferent?  have alternate ideas?
> >
> > It's probably time to create a 'sparse-checkout' config space. That
> > would allow
> >
> >         sparse-checkout.restrictGrep = true
> >
> > as an option. Or a more general
> >
> >         sparse-checkout.restrictCommands = true
> >
> > to make it clear that it affects multiple commands.
>
> If we are creating the new namespace, 'core.sparseCheckout' should
> also be renamed to something like 'sparse-checkout.enabled', right?
> And maybe we could use 'sparsecheckout.*', instead? That seems to be
> the convention for settings on hyphenated commands (as in sendemail.*,
> uploadpack.* and gitgui.*).

Or maybe just call the namespace 'sparse.*' if we're going that route?

> As for compatibility, when running `git sparse-checkout init`, if the
> config file already has the core.sparseCheckout setting, should we
> remove it? Or just add the new sparsecheckout.enabled config, which
> will always be read first?

We seem to have two competing issues:

  * If you remove the core.sparseCheckout setting in favor of
sparse.enabled, then people can't use the repo with an older version
of git.  (This may be acceptable, but we've generally been somewhat
careful with index extensions and such to avoid such a state, with
slow transitions with index and pack versions and such.)
  * If you leave the core.sparseCheckout setting around as well as
having sparse.enabled, then we have two different settings that we can
keep in sync with newer git but which older git will only update one
of.  What do we do if we detect they are out of sync?  Throw an error?
 Pretend that one overrules?  If the older one overrules, what do we
accomplish with the new name?  If the newer name overrules, doesn't
that also potentially break using an older git version?

I'm not sure what to do here.  Maybe people who have worked on index
version and pack version transitions have some good suggestions for
us?

> Also, should we emit a warning about the former being deprecated? The
> good thing about deprecation warnings, IMO, is that users will know
> the name change faster. But, at least for `git grep <tree>`, where we
> read  core.sparseCheckout and core.sparseCheckoutCone for each
> submodule and each tree, there would be too much pollution in the
> output...

We've already started to steer away from users setting these values
and just have them get set/updated/unset by sparse-checkout init and
sparse-checkout disable.  Since users won't be setting these directly,
I don't think deprecation warnings make sense.

> Finally, about restrictCommands, the idea is to have both
> sparsecheckout.restrictCommands and `git --restrict-to-sparse-paths`,
> right? For now, the option/setting would only affect grep, but support
> would be added gradually to other commands in the future. I noticed

There should be both a config option and a global command line flag,
yes.  We might need the flag to default to
not-restricting-to-sparse-paths for now because that's consistent with
the only thing the current implementation of these commands can do.
But I'm really worried that this will remain the default and we'll
force users in the future to jump through a bunch of hoops to do a
simple thing:

$ git clone --sparse-paths $WANTED_DIRECTORIES user@server.name:path/to/repo.git
$ cd repo
<Enjoy their small view of the repo without every command suddenly
requiring a network connection and downloading huge reams of data they
don't even care about.>

> git-read-tree already has a --no-sparse-checkout option. Should we
> remove this option in favor of the global
> --[no]-restrict-to-sparse-paths?

read-tree is plumbing; we can't break backward compatibility.  We'll
have to leave that option there and just document that the two options
do the same thing.

> Sorry for too many questions. I just wanted to make sure that I
> understood the plan before diving into the implementation, to avoid
> going in the wrong direction.

Nah, these are all good questions.  Sorry for the delay in getting back to you.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH 3/3] grep: add option to ignore sparsity patterns
  2020-03-31 20:02           ` Derrick Stolee
  2020-04-27 17:15             ` Matheus Tavares Bernardino
@ 2020-04-29 17:21             ` Elijah Newren
  1 sibling, 0 replies; 57+ messages in thread
From: Elijah Newren @ 2020-04-29 17:21 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Matheus Tavares Bernardino, Junio C Hamano, Git Mailing List,
	Derrick Stolee, Nguyễn Thái Ngọc, Jonathan Tan

Sorry for the super late reply...

On Tue, Mar 31, 2020 at 1:02 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/31/2020 3:12 PM, Elijah Newren wrote:
> > // adding Jonathan Tan to cc based on the fact that we keep bringing
> > up partial clones and how it relates...
> >
> > On Sun, Mar 29, 2020 at 8:23 PM Matheus Tavares Bernardino
> > <matheus.bernardino@usp.br> wrote:
> >>
> >> On Tue, Mar 24, 2020 at 3:30 PM Junio C Hamano <gitster@pobox.com> wrote:
> >>>
> >>> Elijah Newren <newren@gmail.com> writes:
> >>>
> >>>> On Mon, Mar 23, 2020 at 11:13 PM Matheus Tavares
> >>>> <matheus.bernardino@usp.br> wrote:
> >>>>>
> >>>>> In the last commit, git-grep learned to honor sparsity patterns. For
> >>>>> some use cases, however, it may be desirable to search outside the
> >>>>> sparse checkout. So add the '--ignore-sparsity' option, which restores
> >>>>> the old behavior. Also add the grep.ignoreSparsity configuration, to
> >>>>> allow setting this behavior by default.
> >>>>
> >>>> Should `--ignore-sparsity` be a global git option rather than a
> >>>> grep-specific one?  Also, should grep.ignoreSparsity rather be
> >>>> core.ignoreSparsity or core.searchOutsideSparsePaths or something?
> >>>
> >>> Great question.  I think "git diff" with various options would also
> >>> want to optionally be able to be confined within the sparse cone, or
> >>> checking the entire world by lazily fetching outside the sparsity.
> >> [...]
> >>> Regardless of the choice of the default, it would be a good
> >>> idea to make the subcommands consistently offer the same default and
> >>> allow the non-default views with the same UI.
> >>
> >> Yeah, it seems like a sensible path. Regarding implementation, there
> >> is the question that Elijah raised, of whether to use a global git
> >> option or separate but consistent options for each subcommand. I don't
> >> have much experience with sparse checkout to argument for one or
> >> another, so I would like to hear what others have to say about it.
> >>
> >> A question that comes to my mind regarding the global git option is:
> >> will --ignore-sparsity (or whichever name we choose for it [1]) be
> >> sufficient for all subcommands? Or may some of them require additional
> >> options for command-specific behaviors concerning sparsity patterns?
> >> Also, would it be OK if we just ignored the option in commands that do
> >> not operate differently in sparse checkouts (maybe, fetch, branch and
> >> send-email, for example)? And would it make sense to allow
> >> constructions such as `git --ignore-sparsity checkout` or even `git
> >> --ignore-sparsity sparse-checkout ...`?
> >
> > I think the same option would probably be sufficient for all
> > subcommands, though I have a minor question about the merge machinery
> > (below).  And generally, I think it would be unusual for people to
> > pass the command line flag; I suspect most would set a config option
> > for most cases and then only occasionally override it on the command
> > line.  Since that config option would always be set, I'd expect
> > commands that are unaffected to just ignore it (much like both "git -c
> > merge.detectRenames=true fetch" and "git --work-tree=othertree fetch"
> > will both ignore the irrelevant options rather than trying to detect
> > that they were specified and error out).
> >
> >> [1]: Does anyone have suggestions for the option/config name? The best
> >> I could come up with so far (without being too verbose) is
> >> --no-sparsity-constraints. But I fear this might sound generic. As
> >> Elijah already mentioned, --ignore-sparsity is not good either, as it
> >> introduces double negatives in code...
> >
> > Does verbosity matter that much?  I think people would set it in
> > config, and tab completion would make it pretty easy to complete in
> > any event.
> >
> > Anyway, maybe it will help if I provide a very rough first draft of
> > what changes we could introduce to Documentation/config/core.txt, and
> > then ask a bunch of my own questions about it below:
> >
> > """
> > core.restrictToSparsePaths::
> >         Only meaningful in conjuntion with core.sparseCheckoutCone.
> >         This option extends sparse checkouts (which limit which paths
> >         are written to the worktree), so that output and operations
> >         are also limited to the sparsity paths where possible and
> >         implemented.  The purpose of this option is to (1) focus
> >         output for the user on the portion of the repository that is
> >         of interest to them, and (2) enable potentially dramatic
> >         performance improvements, especially in conjunction with
> >         partial clones.
> > +
> > When this option is true, git commands such as log, diff, and grep may
> > limit their output to the directories specified by the sparse cone, or
> > to the intersection of those paths and any (like `*.c) that the user
> > might also specify on the command line.  (Note that this limit for
> > diff and grep only becomes relevant with --cached or when specifying a
> > REVISION, since a search of the working tree will automatically be
> > limited to the sparse paths that are present.)  Also, commands like
> > bisect may only select commits which modify paths within the sparsity
> > cone.  The merge machinery may use the sparse paths as a heuristic to
> > avoid trying to detect renames from within the sparsity cone to
> > outside the sparsity cone when at least one side of history only
> > touches paths within the sparsity cone (this can make the merge
> > machinery faster, but may risk modify/delete conflicts since upstream
> > can rename a file within the sparsity paths to a location outside
> > them).  Commands which export, integrity check, or create history will
> > always operate on full trees (e.g. fast-export, format-patch, fsck,
> > commit, etc.), unaffected by any sparsity patterns.
> > """
> >
> > Several questions here, of course:
> >
> >   * do people like or hate the name?  indifferent?  have alternate ideas?
>
> It's probably time to create a 'sparse-checkout' config space. That
> would allow
>
>         sparse-checkout.restrictGrep = true
>
> as an option. Or a more general
>
>         sparse-checkout.restrictCommands = true
>
> to make it clear that it affects multiple commands.

As I mentioned to Matheus, would a "sparse" config space be nicer?

> >   * should we restrict this to core.sparseCheckoutCone as I suggested
> > above or also allow people to do it with core.sparseCheckout without
> > the cone mode?  I think attempting to weld partial clones together
> > with core.sparseCheckout is crazy, so I'm tempted to just make it be
> > specific to cone mode and to push people to use it.  But I'm
> > interested in thoughts on the matter.
>
> Personally, I prefer cone mode and think it covers 99% of cases.
> However, there are some who are using a big directory full of large
> binaries and relying on file-prefix matches to get only the big
> binaries they need. Until they restructure their repositories to
> take advantage of cone mode, we should be considerate of the full
> sparse-checkout specification when possible.

I agree with everything you say here except the last word; if you
replaced "possible" with "practical" then I'd agree.  In particular, I
like the idea of a partial clone that defaults to grabbing all the
blobs in the sparse path specification; I think it'd be reasonable to
transfer the sparseCone specification to the server and have it use
that to walk history and make a packfile.  Transfering a sparse
specification that does not match the cone mode requirements to a
server and making it use that as it walks over all of history sounds
like a good way to overload the server.

> >   * should worktrees be affected?  (I've been an advocate of new
> > worktrees inheriting the sparse patterns of the worktree in use at the
> > time the new worktree was created.  Junio once suggested he didn't
> > like that and that worktrees should start out dense.  That seems
> > problematic to me in big repos with partial clones and sparse chckouts
> > in use.  Perhaps dense new worktrees is the behavior you get when
> > core.restrictToSparsePaths is false?)
>
> We should probably consider a `--sparse` option for `git worktree add`
> so we can allow interested users to add worktrees that initialize to
> a sparse-checkout. Optionally create a config option that would copy
> the sparse-checkout file from the current repo to the worktree.

Okay, but if someone runs a future

$ git clone --sparse $RELEVANT_DIRECTORIES user@server.name:path/to/repo.git
$ cd repo
<Blissfully work in their smaller repo without commands suddenly
downloading reams of unwanted data>

should the clone command automatically set this option for the user?
I don't like the idea of users having to remember to set this option
(and the restrictToSparsePaths option, and whatever other options are
needed to work in their smaller environment).  I'd really like there
to be a single flag, in the form of some clone option, that sets all
of this up.

> >   * does my idea for the merge machinery make folks uncomfortable?
> > Should that be a different option?  Being able to do trivial *tree*
> > merges for the huge portion of the tree outside the sparsity paths
> > would be a huge win, especially with partial clones, but it certainly
> > is different.  Then again, microsoft has disabled rename detection
> > entirely based on it being too expensive, so perhaps the idea of
> > rename-detection-within-your-cone-if-you-really-didn't-modify-anything-outside-the-cone-on-your-side-of-history
> > is a reasonable middle ground between off and on for rename detection.
>
> The part where you say " when at least one side of history only
> touches paths within the sparsity cone" makes me want to entertain
> the idea if it can be done cleanly.

Yeah, I still have to dig in and verify that this really works.

> I'm more concerned about the "git bisect" logic being restricted to
> the cone, since that is such an open-ended command for what is
> considered "good" or "bad".

If the sparse checkout has sufficient information for them to build
and test whatever predicate they are interested in, then surely
bisecting in a way that restricts to the cone would be a nice
optimization, right?  And if the cone doesn't have enough information
for them to build and test commits, then they would need to leave the
sparse checkout in order to bisect anyway.

> >   * what should the default be?  Junio suggested elsewhere[1] that
> > sparse-checkouts and partial clones should probably be welded together
> > (with partial clones downloading just history in the sparsity paths by
> > default), in which case having this option be true would be useful.
>
> My opinion on this is as follows: filtering blobs based on sparse-
> checkout patterns does not filter enough, and filtering trees based
> on sparse-checkout patterns filters too much. The costs are just
> flipped: having extra trees is not a huge problem but recovering from
> a "tree miss" is problematic. Having extra blobs is painful, but
> recovering from a "blob miss" is not a big deal.

Sounds like --filter=blob:none already solves your issues.  It doesn't
make me happy; I really want the history within the sparse cone to be
downloaded as part of the initial clone.  (I can see various ways that
downloading all trees would be easier, so if we end up downloading all
commits and all trees and just the blobs within the sparse cone, that
sounds fine to me.)

> > But it may also be slightly weird because it'll probably take us a
> > while to implement this; while the big warning in
> > git-sparse-checkout.txt certainly allows this:
> >         THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
> >         COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
> >         THE FUTURE.
> > It may still be slightly weird that the default behavior of commands
> > in the presence of sparse-checkouts changes release to release until
> > we get it all implemented.
>
> I appreciate that we put that warning at the top. We will be
> able to do more experimental things with the feature because
> of it. The idea I'm toying with is to have "git clone --sparse"
> set core.sparseCheckoutCone = true.

Sounds good to me.  We might also want to set worktrees.copySparsity
and sparse.restrictToCone (or whatever these end up being named) as
well.

> Also, if we are creating the "sparse-checkout.*" config space,
> we should "rename" core.sparseCheckoutCone to sparse-checkout.coneMode
> or something. We would need to support both for a while, for sure.

And, if we automatically migrate the setting and delete the old one,
do we prevent someone from successfully using an older git version
with the repo?  Or, if we don't automatically unset the old one, do we
risk the two values getting out of sync if they do switch to an older
git version?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it
  2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                   ` (2 preceding siblings ...)
  2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
@ 2020-05-10  0:41 ` Matheus Tavares
  2020-05-10  0:41   ` [RFC PATCH v2 1/4] doc: grep: unify info on configuration variables Matheus Tavares
                     ` (3 more replies)
  3 siblings, 4 replies; 57+ messages in thread
From: Matheus Tavares @ 2020-05-10  0:41 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

This series is based on the discussions in [1]. The idea is to make
git-grep (and other commands, in the future) be able to restrict their
output to the sparsity patterns, when requested by the user.

Main changes since v1:

In patch 1:
- Remove two unnecessary references in git-grep.txt, as they are in the
  same document.

Added patch 2.

In patch 3:
- Match paths directly against the sparsity patterns, when grepping a
  given tree, instead of checking the index.
- Better handle searches with --recurse-submodules when the superproject
  and/or the submodule have sparse checkout enabled. And add tests for
  cases like these.
- In tests, use the builtin git-sparse-checkout instead of manually
  writting to the sparse-checkout file.
- Add tests for grepping in cone mode.
- Rename the previous 'from_commit' parameter and a test name, to be
  more meaningful.

  Note: it was suggested to change some of the tests in this patch to
  use cone mode. I ended up using both cone mode and full patterns, so
  that we could check that grep behaves correctly when submodules have
  different pattern rules than the superproject. I tried to leave the
  testing repo's structure simple, though, so that the tests remain well
  readable.

In patch 4:
- Move the configuration that restrict cmds' behavior based on the
  sparse checkout from the 'grep' namespace to 'sparse', as the idea is
  to have the same setting affecting multiple cmds.
- Add the --[no]-restrict-to-sparse-paths global option
- Add more tests for the setting and CLI option in grep.
- Add tests to ensure the submodules' values for the setting are
  respected when running grep with --recurse-submodules.

  Note: in this patch, I used the 'sparse' namespace, instead of 'core',
  following the idea we discussed in [2], to have the sparse checkout
  settings in their own namespace. We also talked about moving
  core.sparseCheckout and core.sparseCheckoutCone to the new
  namespace.  I tried implementing this change in this same patchset
  (although, on second thought, it is probably better to do it in
  another one), but I still haven't managed to come up with a rename
  implementation that keeps good compatibility. The problems are the
  ones Elijah listed in [3]. So, for now, sparse.restrictCmds is the
  only setting in the 'sparse' namespace. But it won't be the only one
  for too long, as Stolee is already implementing other ones [4].

[1]: https://lore.kernel.org/git/CAHd-oW7e5qCuxZLBeVDq+Th3E+E4+P8=WzJfK8WcG2yz=n_nag@mail.gmail.com/t/#u
[2]: https://lore.kernel.org/git/49c1e9a5-b234-1696-03cc-95bf95f4663c@gmail.com/
[3]: https://lore.kernel.org/git/CABPp-BGytfCugK0S99nLPH4_VXmcYPHWdVyLO59BZc4__4CT9w@mail.gmail.com/
[4]: https://lore.kernel.org/git/2188577cd848d7cee77f06f1ad2b181864e5e36d.1588857462.git.gitgitgadget@gmail.com/

Matheus Tavares (4):
  doc: grep: unify info on configuration variables
  config: load the correct config.worktree file
  grep: honor sparse checkout patterns
  config: add setting to ignore sparsity patterns in some cmds

 Documentation/config.txt               |   2 +
 Documentation/config/grep.txt          |  10 +-
 Documentation/config/sparse.txt        |  22 +++
 Documentation/git-grep.txt             |  37 +----
 Documentation/git.txt                  |   4 +
 Makefile                               |   1 +
 builtin/grep.c                         | 137 +++++++++++++++-
 config.c                               |   5 +-
 contrib/completion/git-completion.bash |   2 +
 git.c                                  |   6 +
 sparse-checkout.c                      |  16 ++
 sparse-checkout.h                      |  11 ++
 t/t7011-skip-worktree-reading.sh       |   9 --
 t/t7817-grep-sparse-checkout.sh        | 216 +++++++++++++++++++++++++
 t/t9902-completion.sh                  |   4 +-
 15 files changed, 431 insertions(+), 51 deletions(-)
 create mode 100644 Documentation/config/sparse.txt
 create mode 100644 sparse-checkout.c
 create mode 100644 sparse-checkout.h
 create mode 100755 t/t7817-grep-sparse-checkout.sh

Range-diff against v1:
1:  7ba5caf10d ! 1:  c344d22313 doc: grep: unify info on configuration variables
    @@ Commit message
     
         Explanations about the configuration variables for git-grep are
         duplicated in "Documentation/git-grep.txt" and
    -    "Documentation/config/grep.txt". Let's unify the information in the
    -    second file and include it in the first.
    +    "Documentation/config/grep.txt", which can make maintenance difficult.
    +    The first also contains a definition not present in the latter
    +    (grep.fullName). To avoid problems like this, let's unify the
    +    information in the second file and include it in the first.
     
         Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
     
    @@ Documentation/config/grep.txt: grep.extendedRegexp::
      grep.threads::
     -	Number of grep worker threads to use.
     -	See `grep.threads` in linkgit:git-grep[1] for more information.
    -+	Number of grep worker threads to use. See `--threads` in
    -+	linkgit:git-grep[1] for more information.
    ++	Number of grep worker threads to use. See `--threads`
    ++ifndef::git-grep[]
    ++	in linkgit:git-grep[1]
    ++endif::git-grep[]
    ++	for more information.
     +
     +grep.fullName::
     +	If set to true, enable `--full-name` option by default.
    @@ Documentation/git-grep.txt: characters.  An empty string as search expression ma
     -	If set to true, fall back to git grep --no-index if git grep
     -	is executed outside of a git repository.  Defaults to false.
     -
    ++:git-grep: 1
     +include::config/grep.txt[]
      
      OPTIONS
    @@ Documentation/git-grep.txt: providing this option will cause it to die.
     +	Number of grep worker threads to use. If not provided (or set to
     +	0), Git will use as many worker threads as the number of logical
     +	cores available. The default value can also be set with the
    -+	`grep.threads` configuration (see linkgit:git-config[1]).
    ++	`grep.threads` configuration.
      
      -f <file>::
      	Read patterns from <file>, one per line.
-:  ---------- > 2:  882310b69f config: load the correct config.worktree file
2:  0b9b4c4b41 ! 3:  e00674c727 grep: honor sparse checkout patterns
    @@ Commit message
         One of the main uses for a sparse checkout is to allow users to focus on
         the subset of files in a repository in which they are interested. But
         git-grep currently ignores the sparsity patterns and report all matches
    -    found outside this subset, which kind of goes in the oposity direction.
    +    found outside this subset, which kind of goes in the opposite direction.
         Let's fix that, making it honor the sparsity boundaries for every
         grepping case:
     
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
      		     struct tree_desc *tree, struct strbuf *base, int tn_len,
     -		     int check_attr);
    -+		     int from_commit);
    ++		     int is_root_tree);
      
      static int grep_submodule(struct grep_opt *opt,
      			  const struct pathspec *pathspec,
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      	for (nr = 0; nr < repo->index->cache_nr; nr++) {
      		const struct cache_entry *ce = repo->index->cache[nr];
     +
    -+		if (ce_skip_worktree(ce))
    ++		if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
     +			continue;
     +
      		strbuf_setlen(&name, name_base_len);
    @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
      					continue;
      				hit |= grep_oid(opt, &ce->oid, name.buf,
     @@ builtin/grep.c: static int grep_cache(struct grep_opt *opt,
    + 	return hit;
    + }
      
    - static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    - 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
    +-static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    +-		     struct tree_desc *tree, struct strbuf *base, int tn_len,
     -		     int check_attr)
    -+		     int from_commit)
    ++static struct pattern_list *get_sparsity_patterns(struct repository *repo)
    ++{
    ++	struct pattern_list *patterns;
    ++	char *sparse_file;
    ++	int sparse_config, cone_config;
    ++
    ++	if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
    ++	    !sparse_config) {
    ++		return NULL;
    ++	}
    ++
    ++	sparse_file = repo_git_path(repo, "info/sparse-checkout");
    ++	patterns = xcalloc(1, sizeof(*patterns));
    ++
    ++	if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
    ++		cone_config = 0;
    ++	patterns->use_cone_patterns = cone_config;
    ++
    ++	if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
    ++		if (file_exists(sparse_file)) {
    ++			warning(_("failed to load sparse-checkout file: '%s'"),
    ++				sparse_file);
    ++		}
    ++		free(sparse_file);
    ++		free(patterns);
    ++		return NULL;
    ++	}
    ++
    ++	free(sparse_file);
    ++	return patterns;
    ++}
    ++
    ++static int in_sparse_checkout(struct strbuf *path, int prefix_len,
    ++			      unsigned int entry_mode,
    ++			      struct index_state *istate,
    ++			      struct pattern_list *sparsity,
    ++			      enum pattern_match_result parent_match,
    ++			      enum pattern_match_result *match)
    ++{
    ++	int dtype = DT_UNKNOWN;
    ++
    ++	if (S_ISGITLINK(entry_mode))
    ++		return 1;
    ++
    ++	if (parent_match == MATCHED_RECURSIVE) {
    ++		*match = parent_match;
    ++		return 1;
    ++	}
    ++
    ++	if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1]))
    ++		strbuf_addch(path, '/');
    ++
    ++	*match = path_matches_pattern_list(path->buf, path->len,
    ++					   path->buf + prefix_len, &dtype,
    ++					   sparsity, istate);
    ++	if (*match == UNDECIDED)
    ++		*match = parent_match;
    ++
    ++	if (S_ISDIR(entry_mode))
    ++		strbuf_trim_trailing_dir_sep(path);
    ++
    ++	if (*match == NOT_MATCHED && (S_ISREG(entry_mode) ||
    ++	    (S_ISDIR(entry_mode) && sparsity->use_cone_patterns)))
    ++		return 0;
    ++
    ++	return 1;
    ++}
    ++
    ++static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    ++			struct tree_desc *tree, struct strbuf *base, int tn_len,
    ++			int check_attr, struct pattern_list *sparsity,
    ++			enum pattern_match_result default_sparsity_match)
      {
      	struct repository *repo = opt->repo;
      	int hit = 0;
     @@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    - 		name_base_len = name.len;
    - 	}
      
    -+	if (from_commit && repo_read_index(repo) < 0)
    -+		die(_("index file corrupt"));
    -+
      	while (tree_entry(tree, &entry)) {
      		int te_len = tree_entry_len(&entry);
    ++		enum pattern_match_result sparsity_match = 0;
      
    + 		if (match != all_entries_interesting) {
    + 			strbuf_addstr(&name, base->buf + tn_len);
     @@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
      
      		strbuf_add(base, entry.path, te_len);
      
    -+		if (from_commit) {
    -+			int pos = index_name_pos(repo->index,
    -+						 base->buf + tn_len,
    -+						 base->len - tn_len);
    -+			if (pos >= 0 &&
    -+			    ce_skip_worktree(repo->index->cache[pos])) {
    ++		if (sparsity) {
    ++			struct strbuf path = STRBUF_INIT;
    ++			strbuf_addstr(&path, base->buf + tn_len);
    ++
    ++			if (!in_sparse_checkout(&path, old_baselen - tn_len,
    ++						entry.mode, repo->index,
    ++						sparsity, default_sparsity_match,
    ++						&sparsity_match)) {
     +				strbuf_setlen(base, old_baselen);
     +				continue;
     +			}
    @@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec
     +
      		if (S_ISREG(entry.mode)) {
      			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
    --					 check_attr ? base->buf + tn_len : NULL);
    -+					from_commit ? base->buf + tn_len : NULL);
    - 		} else if (S_ISDIR(entry.mode)) {
    - 			enum object_type type;
    - 			struct tree_desc sub;
    + 					 check_attr ? base->buf + tn_len : NULL);
     @@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    + 
      			strbuf_addch(base, '/');
      			init_tree_desc(&sub, data, size);
    - 			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
    +-			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
     -					 check_attr);
    -+					 from_commit);
    ++			hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
    ++					    check_attr, sparsity, sparsity_match);
      			free(data);
      		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
      			hit |= grep_submodule(opt, pathspec, &entry.oid,
    +@@ builtin/grep.c: static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    + 	return hit;
    + }
    + 
    ++/*
    ++ * Note: sparsity patterns and paths' attributes will only be considered if
    ++ * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
    ++ * matching on paths.)
    ++ */
    ++static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
    ++		     struct tree_desc *tree, struct strbuf *base, int tn_len,
    ++		     int is_root_tree)
    ++{
    ++	struct pattern_list *patterns = NULL;
    ++	int ret;
    ++
    ++	if (is_root_tree)
    ++		patterns = get_sparsity_patterns(opt->repo);
    ++
    ++	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
    ++			   patterns, 0);
    ++
    ++	if (patterns) {
    ++		clear_pattern_list(patterns);
    ++		free(patterns);
    ++	}
    ++	return ret;
    ++}
    ++
    + static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
    + 		       struct object *obj, const char *name, const char *path)
    + {
     
      ## t/t7011-skip-worktree-reading.sh ##
     @@ t/t7011-skip-worktree-reading.sh: test_expect_success 'ls-files --modified' '
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +
     +test_description='grep in sparse checkout
     +
    -+This test creates the following dir structure:
    ++This test creates a repo with the following structure:
    ++
     +.
    -+| - a
    -+| - b
    -+| - dir
    -+    | - c
    ++|-- a
    ++|-- b
    ++|-- dir
    ++|   `-- c
    ++`-- sub
    ++    |-- A
    ++    |   `-- a
    ++    `-- B
    ++	`-- b
     +
    -+Only "a" should be present due to the sparse checkout patterns:
    -+"/*", "!/b" and "!/dir".
    ++Where . has non-cone mode sparsity patterns and sub is a submodule with cone
    ++mode sparsity patterns. The resulting sparse-checkout should leave the following
    ++structure:
    ++
    ++.
    ++|-- a
    ++`-- sub
    ++    `-- B
    ++	`-- b
     +'
     +
     +. ./test-lib.sh
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +	echo "text" >b &&
     +	mkdir dir &&
     +	echo "text" >dir/c &&
    ++
    ++	git init sub &&
    ++	(
    ++		cd sub &&
    ++		mkdir A B &&
    ++		echo "text" >A/a &&
    ++		echo "text" >B/b &&
    ++		git add A B &&
    ++		git commit -m sub &&
    ++		git sparse-checkout init --cone &&
    ++		git sparse-checkout set B
    ++	) &&
    ++
    ++	git submodule add ./sub &&
     +	git add a b dir &&
    -+	git commit -m "initial commit" &&
    ++	git commit -m super &&
    ++	git sparse-checkout init --no-cone &&
    ++	git sparse-checkout set "/*" "!b" "!/*/" &&
    ++
     +	git tag -am t-commit t-commit HEAD &&
     +	tree=$(git rev-parse HEAD^{tree}) &&
     +	git tag -am t-tree t-tree $tree &&
    -+	cat >.git/info/sparse-checkout <<-EOF &&
    -+	/*
    -+	!/b
    -+	!/dir
    -+	EOF
    -+	git sparse-checkout init &&
    ++
     +	test_path_is_missing b &&
     +	test_path_is_missing dir &&
    -+	test_path_is_file a
    ++	test_path_is_missing sub/A &&
    ++	test_path_is_file a &&
    ++	test_path_is_file sub/B/b
     +'
     +
     +test_expect_success 'grep in working tree should honor sparse checkout' '
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +	test_cmp expect_t-commit actual_t-commit
     +'
     +
    -+test_expect_success 'grep <tree-ish> should search outside sparse checkout' '
    ++test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
     +	commit=$(git rev-parse HEAD) &&
     +	tree=$(git rev-parse HEAD^{tree}) &&
     +	cat >expect_tree <<-EOF &&
    @@ t/t7817-grep-sparse-checkout.sh (new)
     +	test_cmp expect_t-tree actual_t-tree
     +'
     +
    ++test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
    ++	cat >expect <<-EOF &&
    ++	a:text
    ++	sub/B/b:text
    ++	EOF
    ++	git grep --recurse-submodules --cached "text" >actual &&
    ++	test_cmp expect actual
    ++'
    ++
    ++test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
    ++	commit=$(git rev-parse HEAD) &&
    ++	cat >expect_commit <<-EOF &&
    ++	$commit:a:text
    ++	$commit:sub/B/b:text
    ++	EOF
    ++	cat >expect_t-commit <<-EOF &&
    ++	t-commit:a:text
    ++	t-commit:sub/B/b:text
    ++	EOF
    ++	git grep --recurse-submodules "text" $commit >actual_commit &&
    ++	test_cmp expect_commit actual_commit &&
    ++	git grep --recurse-submodules "text" t-commit >actual_t-commit &&
    ++	test_cmp expect_t-commit actual_t-commit
    ++'
    ++
     +test_done
3:  a76242ecfa < -:  ---------- grep: add option to ignore sparsity patterns
-:  ---------- > 4:  3e9e906249 config: add setting to ignore sparsity patterns in some cmds
-- 
2.26.2


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH v2 1/4] doc: grep: unify info on configuration variables
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
@ 2020-05-10  0:41   ` Matheus Tavares
  2020-05-10  0:41   ` [RFC PATCH v2 2/4] config: load the correct config.worktree file Matheus Tavares
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 57+ messages in thread
From: Matheus Tavares @ 2020-05-10  0:41 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

Explanations about the configuration variables for git-grep are
duplicated in "Documentation/git-grep.txt" and
"Documentation/config/grep.txt", which can make maintenance difficult.
The first also contains a definition not present in the latter
(grep.fullName). To avoid problems like this, let's unify the
information in the second file and include it in the first.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 Documentation/config/grep.txt | 10 ++++++++--
 Documentation/git-grep.txt    | 36 ++++++-----------------------------
 2 files changed, 14 insertions(+), 32 deletions(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 44abe45a7c..dd51db38e1 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -16,8 +16,14 @@ grep.extendedRegexp::
 	other than 'default'.
 
 grep.threads::
-	Number of grep worker threads to use.
-	See `grep.threads` in linkgit:git-grep[1] for more information.
+	Number of grep worker threads to use. See `--threads`
+ifndef::git-grep[]
+	in linkgit:git-grep[1]
+endif::git-grep[]
+	for more information.
+
+grep.fullName::
+	If set to true, enable `--full-name` option by default.
 
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index a7f9bc99ea..9bdf807584 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,34 +41,8 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------
 
-grep.lineNumber::
-	If set to true, enable `-n` option by default.
-
-grep.column::
-	If set to true, enable the `--column` option by default.
-
-grep.patternType::
-	Set the default matching behavior. Using a value of 'basic', 'extended',
-	'fixed', or 'perl' will enable the `--basic-regexp`, `--extended-regexp`,
-	`--fixed-strings`, or `--perl-regexp` option accordingly, while the
-	value 'default' will return to the default matching behavior.
-
-grep.extendedRegexp::
-	If set to true, enable `--extended-regexp` option by default. This
-	option is ignored when the `grep.patternType` option is set to a value
-	other than 'default'.
-
-grep.threads::
-	Number of grep worker threads to use. If unset (or set to 0), Git will
-	use as many threads as the number of logical cores available.
-
-grep.fullName::
-	If set to true, enable `--full-name` option by default.
-
-grep.fallbackToNoIndex::
-	If set to true, fall back to git grep --no-index if git grep
-	is executed outside of a git repository.  Defaults to false.
-
+:git-grep: 1
+include::config/grep.txt[]
 
 OPTIONS
 -------
@@ -269,8 +243,10 @@ providing this option will cause it to die.
 	found.
 
 --threads <num>::
-	Number of grep worker threads to use.
-	See `grep.threads` in 'CONFIGURATION' for more information.
+	Number of grep worker threads to use. If not provided (or set to
+	0), Git will use as many worker threads as the number of logical
+	cores available. The default value can also be set with the
+	`grep.threads` configuration.
 
 -f <file>::
 	Read patterns from <file>, one per line.
-- 
2.26.2


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH v2 2/4] config: load the correct config.worktree file
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-05-10  0:41   ` [RFC PATCH v2 1/4] doc: grep: unify info on configuration variables Matheus Tavares
@ 2020-05-10  0:41   ` Matheus Tavares
  2020-05-11 19:10     ` Junio C Hamano
  2020-05-10  0:41   ` [RFC PATCH v2 3/4] grep: honor sparse checkout patterns Matheus Tavares
  2020-05-10  0:41   ` [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
  3 siblings, 1 reply; 57+ messages in thread
From: Matheus Tavares @ 2020-05-10  0:41 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

One of the steps in do_git_config_sequence() is to load the
worktree-specific config file. Although the function receives a git_dir
string, it relies on git_pathdup(), which uses the_repository->git_dir,
to make the path to the file. Thus, when a submodule has a worktree
setting, a command executed in the superproject that recurses into the
submodule won't find the said setting. Such a scenario might not be
needed now, but it will be in the following patch. git-grep will learn
to honor sparse checkouts and, when running with --recurse-submodules,
the submodule's sparse checkout settings must be loaded. As these
settings are stored in the config.worktree file, they would be ignored
without this patch.

The fix is simple, we replace git_pathdup() with mkpathdup(), to format
the path with the given git_dir. This is the same idea used to make the
config.worktree path in setup.c:check_repository_format_gently().

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 config.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/config.c b/config.c
index 8db9c77098..a3d0a0d266 100644
--- a/config.c
+++ b/config.c
@@ -1747,8 +1747,9 @@ static int do_git_config_sequence(const struct config_options *opts,
 		ret += git_config_from_file(fn, repo_config, data);
 
 	current_parsing_scope = CONFIG_SCOPE_WORKTREE;
-	if (!opts->ignore_worktree && repository_format_worktree_config) {
-		char *path = git_pathdup("config.worktree");
+	if (!opts->ignore_worktree && repository_format_worktree_config &&
+	    opts->git_dir) {
+		char *path = mkpathdup("%s/config.worktree", opts->git_dir);
 		if (!access_or_die(path, R_OK, 0))
 			ret += git_config_from_file(fn, path, data);
 		free(path);
-- 
2.26.2


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
  2020-05-10  0:41   ` [RFC PATCH v2 1/4] doc: grep: unify info on configuration variables Matheus Tavares
  2020-05-10  0:41   ` [RFC PATCH v2 2/4] config: load the correct config.worktree file Matheus Tavares
@ 2020-05-10  0:41   ` Matheus Tavares
  2020-05-11 19:35     ` Junio C Hamano
  2020-05-10  0:41   ` [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
  3 siblings, 1 reply; 57+ messages in thread
From: Matheus Tavares @ 2020-05-10  0:41 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

One of the main uses for a sparse checkout is to allow users to focus on
the subset of files in a repository in which they are interested. But
git-grep currently ignores the sparsity patterns and report all matches
found outside this subset, which kind of goes in the opposite direction.
Let's fix that, making it honor the sparsity boundaries for every
grepping case:

- git grep in worktree
- git grep --cached
- git grep $REVISION
- git grep --untracked and git grep --no-index (which already respect
  sparse checkout boundaries)

This is also what some users reported[1] they would want as the default
behavior.

Note: for `git grep $REVISION`, we will choose to honor the sparsity
patterns only when $REVISION is a commit-ish object. The reason is that,
for a tree, we don't know whether it represents the root of a
repository or a subtree. So we wouldn't be able to correctly match it
against the sparsity patterns. E.g. suppose we have a repository with
these two sparsity rules: "/*" and "!/a"; and the following structure:

/
| - a (file)
| - d (dir)
    | - a (file)

If `git grep $REVISION` were to honor the sparsity patterns for every
object type, when grepping the /d tree, we would wrongly ignore the /d/a
file. This happens because we wouldn't know it resides in /d and
therefore it would wrongly match the pattern "!/a". Furthermore, for a
search in a blob object, we wouldn't even have a path to check the
patterns against. So, let's ignore the sparsity patterns when grepping
non-commit-ish objects (tags to commits should be fine).

Finally, the old behavior may still be desirable for some use cases. So
the next patch will add an option to allow restoring it when needed.

[1]: https://lore.kernel.org/git/CABPp-BGuFhDwWZBRaD3nA8ui46wor-4=Ha1G1oApsfF8KNpfGQ@mail.gmail.com/

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Note: as I mentioned in the cover letter, the tests in this patch now
contain both cone mode and full pattern sparse checkouts. This was done
for two reasons: To test grep's behavior when searching with
--recurse-submodules and having submodules with different pattern sets
than the superproject (which was incorrect in my first implementation).
And to test the direct pattern matching in grep_tree(), using both
modes.

 builtin/grep.c                   | 127 ++++++++++++++++++++++++++--
 t/t7011-skip-worktree-reading.sh |   9 --
 t/t7817-grep-sparse-checkout.sh  | 140 +++++++++++++++++++++++++++++++
 3 files changed, 259 insertions(+), 17 deletions(-)
 create mode 100755 t/t7817-grep-sparse-checkout.sh

diff --git a/builtin/grep.c b/builtin/grep.c
index a5056f395a..91ee0b2734 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
 		      const struct pathspec *pathspec, int cached);
 static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr);
+		     int is_root_tree);
 
 static int grep_submodule(struct grep_opt *opt,
 			  const struct pathspec *pathspec,
@@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
 
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
+
+		if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
+			continue;
+
 		strbuf_setlen(&name, name_base_len);
 		strbuf_addstr(&name, ce->name);
 
@@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
 			 * cache entry are identical, even if worktree file has
 			 * been modified, so use cache version instead
 			 */
-			if (cached || (ce->ce_flags & CE_VALID) ||
-			    ce_skip_worktree(ce)) {
+			if (cached || (ce->ce_flags & CE_VALID)) {
 				if (ce_stage(ce) || ce_intent_to_add(ce))
 					continue;
 				hit |= grep_oid(opt, &ce->oid, name.buf,
@@ -552,9 +555,78 @@ static int grep_cache(struct grep_opt *opt,
 	return hit;
 }
 
-static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
-		     struct tree_desc *tree, struct strbuf *base, int tn_len,
-		     int check_attr)
+static struct pattern_list *get_sparsity_patterns(struct repository *repo)
+{
+	struct pattern_list *patterns;
+	char *sparse_file;
+	int sparse_config, cone_config;
+
+	if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
+	    !sparse_config) {
+		return NULL;
+	}
+
+	sparse_file = repo_git_path(repo, "info/sparse-checkout");
+	patterns = xcalloc(1, sizeof(*patterns));
+
+	if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
+		cone_config = 0;
+	patterns->use_cone_patterns = cone_config;
+
+	if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
+		if (file_exists(sparse_file)) {
+			warning(_("failed to load sparse-checkout file: '%s'"),
+				sparse_file);
+		}
+		free(sparse_file);
+		free(patterns);
+		return NULL;
+	}
+
+	free(sparse_file);
+	return patterns;
+}
+
+static int in_sparse_checkout(struct strbuf *path, int prefix_len,
+			      unsigned int entry_mode,
+			      struct index_state *istate,
+			      struct pattern_list *sparsity,
+			      enum pattern_match_result parent_match,
+			      enum pattern_match_result *match)
+{
+	int dtype = DT_UNKNOWN;
+
+	if (S_ISGITLINK(entry_mode))
+		return 1;
+
+	if (parent_match == MATCHED_RECURSIVE) {
+		*match = parent_match;
+		return 1;
+	}
+
+	if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1]))
+		strbuf_addch(path, '/');
+
+	*match = path_matches_pattern_list(path->buf, path->len,
+					   path->buf + prefix_len, &dtype,
+					   sparsity, istate);
+	if (*match == UNDECIDED)
+		*match = parent_match;
+
+	if (S_ISDIR(entry_mode))
+		strbuf_trim_trailing_dir_sep(path);
+
+	if (*match == NOT_MATCHED && (S_ISREG(entry_mode) ||
+	    (S_ISDIR(entry_mode) && sparsity->use_cone_patterns)))
+		return 0;
+
+	return 1;
+}
+
+static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
+			struct tree_desc *tree, struct strbuf *base, int tn_len,
+			int check_attr, struct pattern_list *sparsity,
+			enum pattern_match_result default_sparsity_match)
 {
 	struct repository *repo = opt->repo;
 	int hit = 0;
@@ -570,6 +642,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 	while (tree_entry(tree, &entry)) {
 		int te_len = tree_entry_len(&entry);
+		enum pattern_match_result sparsity_match = 0;
 
 		if (match != all_entries_interesting) {
 			strbuf_addstr(&name, base->buf + tn_len);
@@ -586,6 +659,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 		strbuf_add(base, entry.path, te_len);
 
+		if (sparsity) {
+			struct strbuf path = STRBUF_INIT;
+			strbuf_addstr(&path, base->buf + tn_len);
+
+			if (!in_sparse_checkout(&path, old_baselen - tn_len,
+						entry.mode, repo->index,
+						sparsity, default_sparsity_match,
+						&sparsity_match)) {
+				strbuf_setlen(base, old_baselen);
+				continue;
+			}
+		}
+
 		if (S_ISREG(entry.mode)) {
 			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
 					 check_attr ? base->buf + tn_len : NULL);
@@ -602,8 +688,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 
 			strbuf_addch(base, '/');
 			init_tree_desc(&sub, data, size);
-			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
-					 check_attr);
+			hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
+					    check_attr, sparsity, sparsity_match);
 			free(data);
 		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
 			hit |= grep_submodule(opt, pathspec, &entry.oid,
@@ -621,6 +707,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 	return hit;
 }
 
+/*
+ * Note: sparsity patterns and paths' attributes will only be considered if
+ * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
+ * matching on paths.)
+ */
+static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
+		     struct tree_desc *tree, struct strbuf *base, int tn_len,
+		     int is_root_tree)
+{
+	struct pattern_list *patterns = NULL;
+	int ret;
+
+	if (is_root_tree)
+		patterns = get_sparsity_patterns(opt->repo);
+
+	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
+			   patterns, 0);
+
+	if (patterns) {
+		clear_pattern_list(patterns);
+		free(patterns);
+	}
+	return ret;
+}
+
 static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
 		       struct object *obj, const char *name, const char *path)
 {
diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
index 37525cae3a..26852586ac 100755
--- a/t/t7011-skip-worktree-reading.sh
+++ b/t/t7011-skip-worktree-reading.sh
@@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
 	test -z "$(git ls-files -m)"
 '
 
-test_expect_success 'grep with skip-worktree file' '
-	git update-index --no-skip-worktree 1 &&
-	echo test > 1 &&
-	git update-index 1 &&
-	git update-index --skip-worktree 1 &&
-	rm 1 &&
-	test "$(git grep --no-ext-grep test)" = "1:test"
-'
-
 echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
 test_expect_success 'diff-index does not examine skip-worktree absent entries' '
 	setup_absent &&
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
new file mode 100755
index 0000000000..3bd67082eb
--- /dev/null
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -0,0 +1,140 @@
+#!/bin/sh
+
+test_description='grep in sparse checkout
+
+This test creates a repo with the following structure:
+
+.
+|-- a
+|-- b
+|-- dir
+|   `-- c
+`-- sub
+    |-- A
+    |   `-- a
+    `-- B
+	`-- b
+
+Where . has non-cone mode sparsity patterns and sub is a submodule with cone
+mode sparsity patterns. The resulting sparse-checkout should leave the following
+structure:
+
+.
+|-- a
+`-- sub
+    `-- B
+	`-- b
+'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	echo "text" >a &&
+	echo "text" >b &&
+	mkdir dir &&
+	echo "text" >dir/c &&
+
+	git init sub &&
+	(
+		cd sub &&
+		mkdir A B &&
+		echo "text" >A/a &&
+		echo "text" >B/b &&
+		git add A B &&
+		git commit -m sub &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set B
+	) &&
+
+	git submodule add ./sub &&
+	git add a b dir &&
+	git commit -m super &&
+	git sparse-checkout init --no-cone &&
+	git sparse-checkout set "/*" "!b" "!/*/" &&
+
+	git tag -am t-commit t-commit HEAD &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	git tag -am t-tree t-tree $tree &&
+
+	test_path_is_missing b &&
+	test_path_is_missing dir &&
+	test_path_is_missing sub/A &&
+	test_path_is_file a &&
+	test_path_is_file sub/B/b
+'
+
+test_expect_success 'grep in working tree should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --cached should honor sparse checkout' '
+	cat >expect <<-EOF &&
+	a:text
+	EOF
+	git grep --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep <commit-ish> should honor sparse checkout' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	EOF
+	cat >expect_t-commit <<-EOF &&
+	t-commit:a:text
+	EOF
+	git grep "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep "text" t-commit >actual_t-commit &&
+	test_cmp expect_t-commit actual_t-commit
+'
+
+test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
+	commit=$(git rev-parse HEAD) &&
+	tree=$(git rev-parse HEAD^{tree}) &&
+	cat >expect_tree <<-EOF &&
+	$tree:a:text
+	$tree:b:text
+	$tree:dir/c:text
+	EOF
+	cat >expect_t-tree <<-EOF &&
+	t-tree:a:text
+	t-tree:b:text
+	t-tree:dir/c:text
+	EOF
+	git grep "text" $tree >actual_tree &&
+	test_cmp expect_tree actual_tree &&
+	git grep "text" t-tree >actual_t-tree &&
+	test_cmp expect_t-tree actual_t-tree
+'
+
+test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/B/b:text
+	EOF
+	git grep --recurse-submodules --cached "text" >actual &&
+	test_cmp expect actual
+'
+
+test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
+	commit=$(git rev-parse HEAD) &&
+	cat >expect_commit <<-EOF &&
+	$commit:a:text
+	$commit:sub/B/b:text
+	EOF
+	cat >expect_t-commit <<-EOF &&
+	t-commit:a:text
+	t-commit:sub/B/b:text
+	EOF
+	git grep --recurse-submodules "text" $commit >actual_commit &&
+	test_cmp expect_commit actual_commit &&
+	git grep --recurse-submodules "text" t-commit >actual_t-commit &&
+	test_cmp expect_t-commit actual_t-commit
+'
+
+test_done
-- 
2.26.2


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds
  2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
                     ` (2 preceding siblings ...)
  2020-05-10  0:41   ` [RFC PATCH v2 3/4] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-05-10  0:41   ` Matheus Tavares
  2020-05-10  4:23     ` Matheus Tavares Bernardino
  2020-05-21  7:09     ` Elijah Newren
  3 siblings, 2 replies; 57+ messages in thread
From: Matheus Tavares @ 2020-05-10  0:41 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

When sparse checkout is enabled, some users expect the output of certain
commands (such as grep, diff, and log) to be also restricted within the
sparsity patterns. This would allow them to effectively work only on the
subset of files in which they are interested; and allow some commands to
possibly perform better, by not considering uninteresting paths. For
this reason, we taught grep to honor the sparsity patterns, in the
previous commit. But, on the other hand, allowing grep and the other
commands mentioned to optionally ignore the patterns also make for some
interesting use cases. E.g. using grep to search for a function
definition that resides outside the sparse checkout.

In any case, there is no current way for users to configure the behavior
they want for these commands. Aiming to provide this flexibility, let's
introduce the sparse.restrictCmds setting (and the analogous
--[no]-restrict-to-sparse-paths global option). The default value is
true. For now, grep is the only one affected by this setting, but the
goal is to have support for more commands, in the future.

Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---

Some notes/questions about this one:

- I guess having the additional sparse-checkout.o only for the
  restrict_to_sparse_paths() function is not very justifiable.
  Especially since builtin/grep.c is currently its only caller. But
  since Stolee is already moving some code out of the sparse-checkout
  builtin and into sparse-checkout.o [1], I thought it would be better
  to place this function here from the start, as it will likely be
  needed by other cmds when they start honoring sparse.restrictCmds.
  (Side note: I think I will also be able to use the
  populate_sparse_checkout_patterns() function added by Stolee in the
  same patchset [2], to avoid code duplication in the
  get_sparsity_patterns() function added in this patch).

[1]: https://lore.kernel.org/git/0181a134bfb6986dc0e54ae624c478446a1324a9.1588857462.git.gitgitgadget@gmail.com/
[2]: https://lore.kernel.org/git/444a6b5f894f28e96f713e5caccba18e1ea3b3eb.1588857462.git.gitgitgadget@gmail.com/

- With that said, the only reason we need restrict_to_sparse_paths() to
  begin with, is so that commands which recurse into submodules may
  respect the value set in each submodule for the sparse.restrictCmds
  config. This is already being done for grep, in this patch. But,
  should we do like this or should we use the value set at the
  superproject, for all submodules as well, when recursing (ignoring the
  value set on them)?

- It's possible to also make read-tree respect the new setting/option,
  using --no-restrict-to-sparse-paths as a synonym for its
  --no-sparse-checkout option (with lower precedence). However, as this
  command can change the sparse checked out paths, I thought it kind
  of falls under a different category. Also, `git read-tree -mu
  --sparse-checkout` doesn't have the effect of *restricting* the
  command's behavior to the sparsity patterns, but of applying them to
  the working tree, right? So maybe it could be confusing to make this
  command honor the new setting. Does that make sense, or should we do
  it?

- Finally, if we decide to make read-tree be affected by
  sparse.restrictCmds, there is also the case of whether the config
  should be honored for submodules or just propagate the superproject's
  value. I think the latter would be as simple as adding this line,
  before calling parse_options() in builtin/read-tree.c:

  opts.skip_sparse_checkout = !restrict_to_sparse_paths(the_repository);

  As for the former, I'm not very familiar with the code in
  unpack_trees(), so I'm not sure how complicated that would be.


 Documentation/config.txt               |  2 +
 Documentation/config/sparse.txt        | 22 ++++++++
 Documentation/git-grep.txt             |  3 +
 Documentation/git.txt                  |  4 ++
 Makefile                               |  1 +
 builtin/grep.c                         | 14 ++++-
 contrib/completion/git-completion.bash |  2 +
 git.c                                  |  6 ++
 sparse-checkout.c                      | 16 ++++++
 sparse-checkout.h                      | 11 ++++
 t/t7817-grep-sparse-checkout.sh        | 78 +++++++++++++++++++++++++-
 t/t9902-completion.sh                  |  4 +-
 12 files changed, 159 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/config/sparse.txt
 create mode 100644 sparse-checkout.c
 create mode 100644 sparse-checkout.h

diff --git a/Documentation/config.txt b/Documentation/config.txt
index ef0768b91a..fd74b80302 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -436,6 +436,8 @@ include::config/sequencer.txt[]
 
 include::config/showbranch.txt[]
 
+include::config/sparse.txt[]
+
 include::config/splitindex.txt[]
 
 include::config/ssh.txt[]
diff --git a/Documentation/config/sparse.txt b/Documentation/config/sparse.txt
new file mode 100644
index 0000000000..83a4e0018f
--- /dev/null
+++ b/Documentation/config/sparse.txt
@@ -0,0 +1,22 @@
+sparse.restrictCmds::
+	Only meaningful in conjunction with core.sparseCheckout. This option
+	extends sparse checkouts (which limit which paths are written to the
+	working tree), so that output and operations are also limited to the
+	sparsity paths where possible and implemented. The purpose of this
+	option is to (1) focus output for the user on the portion of the
+	repository that is of interest to them, and (2) enable potentially
+	dramatic performance improvements, especially in conjunction with
+	partial clones.
++
+When this option is true (default), some git commands may limit their behavior
+to the paths specified by the sparsity patterns, or to the intersection of
+those paths and any (like `*.c) that the user might also specify on the command
+line. When false, the affected commands will work on full trees, ignoring the
+sparsity patterns. For now, only git-grep honors this setting. In this command,
+the restriction becomes relevant in one of these three cases: with --cached;
+when a commit-ish is given; when searching a working tree that contains paths
+previously excluded by the sparsity patterns.
++
+Note: commands which export, integrity check, or create history will always
+operate on full trees (e.g. fast-export, format-patch, fsck, commit, etc.),
+unaffected by any sparsity patterns.
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 9bdf807584..abbf100109 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,6 +41,9 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------
 
+git-grep honors the sparse.restrictCmds setting. See its definition in
+linkgit:git-config[1].
+
 :git-grep: 1
 include::config/grep.txt[]
 
diff --git a/Documentation/git.txt b/Documentation/git.txt
index 9d6769e95a..5e107c6246 100644
--- a/Documentation/git.txt
+++ b/Documentation/git.txt
@@ -180,6 +180,10 @@ If you just want to run git as if it was started in `<path>` then use
 	Do not perform optional operations that require locks. This is
 	equivalent to setting the `GIT_OPTIONAL_LOCKS` to `0`.
 
+--[no-]restrict-to-sparse-paths::
+	Overrides the sparse.restrictCmds configuration (see
+	linkgit:git-config[1]) for this execution.
+
 --list-cmds=group[,group...]::
 	List commands by group. This is an internal/experimental
 	option and may change or be removed in the future. Supported
diff --git a/Makefile b/Makefile
index 3d3a39fc19..67580c691b 100644
--- a/Makefile
+++ b/Makefile
@@ -986,6 +986,7 @@ LIB_OBJS += sha1-name.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-checkout.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/builtin/grep.c b/builtin/grep.c
index 91ee0b2734..3f92e7fd6c 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -25,6 +25,7 @@
 #include "submodule-config.h"
 #include "object-store.h"
 #include "packfile.h"
+#include "sparse-checkout.h"
 
 static char const * const grep_usage[] = {
 	N_("git grep [<options>] [-e] <pattern> [<rev>...] [[--] <path>...]"),
@@ -498,6 +499,7 @@ static int grep_cache(struct grep_opt *opt,
 	int nr;
 	struct strbuf name = STRBUF_INIT;
 	int name_base_len = 0;
+	int sparse_paths_only =	restrict_to_sparse_paths(repo);
 	if (repo->submodule_prefix) {
 		name_base_len = strlen(repo->submodule_prefix);
 		strbuf_addstr(&name, repo->submodule_prefix);
@@ -509,7 +511,8 @@ static int grep_cache(struct grep_opt *opt,
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 
-		if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
+		if (sparse_paths_only && ce_skip_worktree(ce) &&
+		    !S_ISGITLINK(ce->ce_mode))
 			continue;
 
 		strbuf_setlen(&name, name_base_len);
@@ -717,9 +720,10 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 		     int is_root_tree)
 {
 	struct pattern_list *patterns = NULL;
+	int sparse_paths_only = restrict_to_sparse_paths(opt->repo);
 	int ret;
 
-	if (is_root_tree)
+	if (is_root_tree && sparse_paths_only)
 		patterns = get_sparsity_patterns(opt->repo);
 
 	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
@@ -1259,6 +1263,12 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 
 	if (!use_index || untracked) {
 		int use_exclude = (opt_exclude < 0) ? use_index : !!opt_exclude;
+
+		if (opt_restrict_to_sparse_paths >= 0) {
+			warning(_("--[no-]restrict-to-sparse-paths is ignored"
+				  " with --no-index or --untracked"));
+		}
+
 		hit = grep_directory(&opt, &pathspec, use_exclude, use_index);
 	} else if (0 <= opt_exclude) {
 		die(_("--[no-]exclude-standard cannot be used for tracked contents"));
diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
index b1d6e5ebed..cba0f9166c 100644
--- a/contrib/completion/git-completion.bash
+++ b/contrib/completion/git-completion.bash
@@ -3207,6 +3207,8 @@ __git_main ()
 			--namespace=
 			--no-replace-objects
 			--help
+			--restrict-to-sparse-paths
+			--no-restrict-to-sparse-paths
 			"
 			;;
 		*)
diff --git a/git.c b/git.c
index 2e4efb4ff0..f967c75d9c 100644
--- a/git.c
+++ b/git.c
@@ -37,6 +37,7 @@ const char git_more_info_string[] =
 	   "See 'git help git' for an overview of the system.");
 
 static int use_pager = -1;
+int opt_restrict_to_sparse_paths = -1;
 
 static void list_builtins(struct string_list *list, unsigned int exclude_option);
 
@@ -310,6 +311,10 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
 			} else {
 				exit(list_cmds(cmd));
 			}
+		} else if (!strcmp(cmd, "--restrict-to-sparse-paths")) {
+			opt_restrict_to_sparse_paths = 1;
+		} else if (!strcmp(cmd, "--no-restrict-to-sparse-paths")) {
+			opt_restrict_to_sparse_paths = 0;
 		} else {
 			fprintf(stderr, _("unknown option: %s\n"), cmd);
 			usage(git_usage_string);
@@ -318,6 +323,7 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
 		(*argv)++;
 		(*argc)--;
 	}
+
 	return (*argv) - orig_argv;
 }
 
diff --git a/sparse-checkout.c b/sparse-checkout.c
new file mode 100644
index 0000000000..9a9e50fd29
--- /dev/null
+++ b/sparse-checkout.c
@@ -0,0 +1,16 @@
+#include "cache.h"
+#include "config.h"
+#include "sparse-checkout.h"
+
+int restrict_to_sparse_paths(struct repository *repo)
+{
+	int ret;
+
+	if (opt_restrict_to_sparse_paths >= 0)
+		return opt_restrict_to_sparse_paths;
+
+	if (repo_config_get_bool(repo, "sparse.restrictcmds", &ret))
+		ret = 1;
+
+	return ret;
+}
diff --git a/sparse-checkout.h b/sparse-checkout.h
new file mode 100644
index 0000000000..1de3b588d8
--- /dev/null
+++ b/sparse-checkout.h
@@ -0,0 +1,11 @@
+#ifndef SPARSE_CHECKOUT_H
+#define SPARSE_CHECKOUT_H
+
+struct repository;
+
+extern int opt_restrict_to_sparse_paths; /* from git.c */
+
+/* Whether or not cmds should restrict behavior on sparse paths, in this repo */
+int restrict_to_sparse_paths(struct repository *repo);
+
+#endif /* SPARSE_CHECKOUT_H */
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index 3bd67082eb..8509694bf1 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -63,12 +63,28 @@ test_expect_success 'setup' '
 	test_path_is_file sub/B/b
 '
 
+# The two tests bellow check a special case: the sparsity patterns exclude '/b'
+# and sparse checkout is enable, but the path exists on the working tree (e.g.
+# manually created after `git sparse-checkout init`). In this case, grep should
+# honor --restrict-to-sparse-paths.
 test_expect_success 'grep in working tree should honor sparse checkout' '
 	cat >expect <<-EOF &&
 	a:text
 	EOF
+	echo newtext >b &&
 	git grep "text" >actual &&
-	test_cmp expect actual
+	test_cmp expect actual &&
+	rm b
+'
+test_expect_success 'grep w/ --no-restrict-to-sparse-paths for sparsely excluded but present paths' '
+	cat >expect <<-EOF &&
+	a:text
+	b:newtext
+	EOF
+	echo newtext >b &&
+	git --no-restrict-to-sparse-paths grep "text" >actual &&
+	test_cmp expect actual &&
+	rm b
 '
 
 test_expect_success 'grep --cached should honor sparse checkout' '
@@ -137,4 +153,64 @@ test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse
 	test_cmp expect_t-commit actual_t-commit
 '
 
+for cmd in 'git --no-restrict-to-sparse-paths grep' \
+	   'git -c sparse.restrictCmds=false grep' \
+	   'git -c sparse.restrictCmds=true --no-restrict-to-sparse-paths grep'
+do
+
+	test_expect_success "$cmd --cached should ignore sparsity patterns" '
+		cat >expect <<-EOF &&
+		a:text
+		b:text
+		dir/c:text
+		EOF
+		$cmd --cached "text" >actual &&
+		test_cmp expect actual
+	'
+
+	test_expect_success "$cmd <commit-ish> should ignore sparsity patterns" '
+		commit=$(git rev-parse HEAD) &&
+		cat >expect_commit <<-EOF &&
+		$commit:a:text
+		$commit:b:text
+		$commit:dir/c:text
+		EOF
+		cat >expect_t-commit <<-EOF &&
+		t-commit:a:text
+		t-commit:b:text
+		t-commit:dir/c:text
+		EOF
+		$cmd "text" $commit >actual_commit &&
+		test_cmp expect_commit actual_commit &&
+		$cmd "text" t-commit >actual_t-commit &&
+		test_cmp expect_t-commit actual_t-commit
+	'
+done
+
+test_expect_success 'should respect the sparse.restrictCmds values from submodules' '
+	cat >expect <<-EOF &&
+	a:text
+	sub/A/a:text
+	sub/B/b:text
+	EOF
+	git -C sub config sparse.restrictCmds false &&
+	git grep --cached --recurse-submodules "text" >actual &&
+	test_cmp expect actual &&
+	git -C sub config --unset sparse.restrictCmds
+'
+
+test_expect_success 'should propagate --[no]-restrict-to-sparse-paths to submodules' '
+	cat >expect <<-EOF &&
+	a:text
+	b:text
+	dir/c:text
+	sub/A/a:text
+	sub/B/b:text
+	EOF
+	git -C sub config sparse.restrictCmds true &&
+	git --no-restrict-to-sparse-paths grep --cached --recurse-submodules "text" >actual &&
+	test_cmp expect actual &&
+	git -C sub config --unset sparse.restrictCmds
+'
+
 test_done
diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
index 3c44af6940..a4a7767e06 100755
--- a/t/t9902-completion.sh
+++ b/t/t9902-completion.sh
@@ -1473,6 +1473,8 @@ test_expect_success 'double dash "git" itself' '
 	--namespace=
 	--no-replace-objects Z
 	--help Z
+	--restrict-to-sparse-paths Z
+	--no-restrict-to-sparse-paths Z
 	EOF
 '
 
@@ -1515,7 +1517,7 @@ test_expect_success 'general options' '
 	test_completion "git --nam" "--namespace=" &&
 	test_completion "git --bar" "--bare " &&
 	test_completion "git --inf" "--info-path " &&
-	test_completion "git --no-r" "--no-replace-objects "
+	test_completion "git --no-rep" "--no-replace-objects "
 '
 
 test_expect_success 'general options plus command' '
-- 
2.26.2


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds
  2020-05-10  0:41   ` [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
@ 2020-05-10  4:23     ` Matheus Tavares Bernardino
  2020-05-21 17:18       ` Elijah Newren
  2020-05-21  7:09     ` Elijah Newren
  1 sibling, 1 reply; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-05-10  4:23 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Derrick Stolee, Elijah Newren, Jonathan Tan

On Sat, May 9, 2020 at 9:42 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index 3bd67082eb..8509694bf1 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -63,12 +63,28 @@ test_expect_success 'setup' '
>         test_path_is_file sub/B/b
>  '
>
> +# The two tests bellow check a special case: the sparsity patterns exclude '/b'
> +# and sparse checkout is enable, but the path exists on the working tree (e.g.
> +# manually created after `git sparse-checkout init`). In this case, grep should
> +# honor --restrict-to-sparse-paths.

I just want to highlight a small thing that I forgot to comment on:
Elijah and I had already discussed about --restrict-to-sparse-paths
being relevant in grep only with --cached or when a commit-ish is
given. But it had not occurred to me, before, the possibility of the
special case mentioned above. I.e. when searching in the working tree
and a path that should be excluded by the sparsity patterns is
present. In this patch, I let --restrict-to-sparse-paths control the
desired behavior for grep in this case too. But please, let me know if
that doesn't seem like a good idea.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 2/4] config: load the correct config.worktree file
  2020-05-10  0:41   ` [RFC PATCH v2 2/4] config: load the correct config.worktree file Matheus Tavares
@ 2020-05-11 19:10     ` Junio C Hamano
  2020-05-12 22:55       ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2020-05-11 19:10 UTC (permalink / raw)
  To: Matheus Tavares; +Cc: git, stolee, newren, jonathantanmy

Matheus Tavares <matheus.bernardino@usp.br> writes:

> One of the steps in do_git_config_sequence() is to load the
> worktree-specific config file. Although the function receives a git_dir
> string, it relies on git_pathdup(), which uses the_repository->git_dir,
> to make the path to the file. Thus, when a submodule has a worktree
> setting, a command executed in the superproject that recurses into the
> submodule won't find the said setting.

This has far wider ramifications than just "git grep" and it may be
an important fix.  Anything that wants to read from a per-worktree
configuration is not working as expected when run from a secondary
worktree, right?

Can we add a test or two to protect this fix from future breakages?


>  	current_parsing_scope = CONFIG_SCOPE_WORKTREE;
> -	if (!opts->ignore_worktree && repository_format_worktree_config) {
> -		char *path = git_pathdup("config.worktree");
> +	if (!opts->ignore_worktree && repository_format_worktree_config &&
> +	    opts->git_dir) {
> +		char *path = mkpathdup("%s/config.worktree", opts->git_dir);
>  		if (!access_or_die(path, R_OK, 0))
>  			ret += git_config_from_file(fn, path, data);
>  		free(path);

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-10  0:41   ` [RFC PATCH v2 3/4] grep: honor sparse checkout patterns Matheus Tavares
@ 2020-05-11 19:35     ` Junio C Hamano
  2020-05-13  0:05       ` Matheus Tavares Bernardino
  2020-05-21  7:36       ` Elijah Newren
  0 siblings, 2 replies; 57+ messages in thread
From: Junio C Hamano @ 2020-05-11 19:35 UTC (permalink / raw)
  To: Matheus Tavares; +Cc: git, stolee, newren, jonathantanmy

Matheus Tavares <matheus.bernardino@usp.br> writes:

> One of the main uses for a sparse checkout is to allow users to focus on
> the subset of files in a repository in which they are interested. But
> git-grep currently ignores the sparsity patterns and report all matches
> found outside this subset, which kind of goes in the opposite direction.
> Let's fix that, making it honor the sparsity boundaries for every
> grepping case:
>
> - git grep in worktree
> - git grep --cached
> - git grep $REVISION

It makes sense for these to be limited within the "sparse" area.

> - git grep --untracked and git grep --no-index (which already respect
>   sparse checkout boundaries)

I can understand the former; those untracked files are what _could_
be brought into attention by "git add", so limiting to the same
"sparse" area may make sense.

I am not sure about the latter, though, as "--no-index" is an
explicit request to pretend that we are dealing with a random
collection of files, not managed in a git repository.  But perhaps
there is a similar justification like how "--untracked" is
unjustifiable.  I dunno.

> diff --git a/builtin/grep.c b/builtin/grep.c
> index a5056f395a..91ee0b2734 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
>  		      const struct pathspec *pathspec, int cached);
>  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  		     struct tree_desc *tree, struct strbuf *base, int tn_len,
> -		     int check_attr);
> +		     int is_root_tree);
>  
>  static int grep_submodule(struct grep_opt *opt,
>  			  const struct pathspec *pathspec,
> @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
>  
>  	for (nr = 0; nr < repo->index->cache_nr; nr++) {
>  		const struct cache_entry *ce = repo->index->cache[nr];
> +
> +		if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
> +			continue;

Hmph.  Why exclude gitlink from this rule?  If a submodule sits at a
path that is excluded by the sparse pattern, should we still recurse
into it?

>  		strbuf_setlen(&name, name_base_len);
>  		strbuf_addstr(&name, ce->name);
>  
> @@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
>  			 * cache entry are identical, even if worktree file has
>  			 * been modified, so use cache version instead
>  			 */
> -			if (cached || (ce->ce_flags & CE_VALID) ||
> -			    ce_skip_worktree(ce)) {
> +			if (cached || (ce->ce_flags & CE_VALID)) {
>  				if (ce_stage(ce) || ce_intent_to_add(ce))
>  					continue;
>  				hit |= grep_oid(opt, &ce->oid, name.buf,
> @@ -552,9 +555,78 @@ static int grep_cache(struct grep_opt *opt,
>  	return hit;
>  }
>  
> -static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> -		     struct tree_desc *tree, struct strbuf *base, int tn_len,
> -		     int check_attr)
> +static struct pattern_list *get_sparsity_patterns(struct repository *repo)
> +{
> +	struct pattern_list *patterns;
> +	char *sparse_file;
> +	int sparse_config, cone_config;
> +
> +	if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
> +	    !sparse_config) {
> +		return NULL;
> +	}
> +
> +	sparse_file = repo_git_path(repo, "info/sparse-checkout");
> +	patterns = xcalloc(1, sizeof(*patterns));
> +
> +	if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
> +		cone_config = 0;
> +	patterns->use_cone_patterns = cone_config;
> +
> +	if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
> +		if (file_exists(sparse_file)) {
> +			warning(_("failed to load sparse-checkout file: '%s'"),
> +				sparse_file);
> +		}
> +		free(sparse_file);
> +		free(patterns);
> +		return NULL;
> +	}
> +
> +	free(sparse_file);
> +	return patterns;
> +}
> +
> +static int in_sparse_checkout(struct strbuf *path, int prefix_len,
> +			      unsigned int entry_mode,
> +			      struct index_state *istate,
> +			      struct pattern_list *sparsity,
> +			      enum pattern_match_result parent_match,
> +			      enum pattern_match_result *match)
> +{
> +	int dtype = DT_UNKNOWN;
> +
> +	if (S_ISGITLINK(entry_mode))
> +		return 1;

This is consistent with the "we do not care where a gitlink
appears---submodules are always descended into, regardless of the
sparse definition" decision we saw earlier, I think.  I am not sure
if that is a good design in the first place, though.

> +	if (parent_match == MATCHED_RECURSIVE) {
> +		*match = parent_match;
> +		return 1;
> +	}
> +
> +	if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1]))
> +		strbuf_addch(path, '/');
> +
> +	*match = path_matches_pattern_list(path->buf, path->len,
> +					   path->buf + prefix_len, &dtype,
> +					   sparsity, istate);
> +	if (*match == UNDECIDED)
> +		*match = parent_match;
> +
> +	if (S_ISDIR(entry_mode))
> +		strbuf_trim_trailing_dir_sep(path);
> +
> +	if (*match == NOT_MATCHED && (S_ISREG(entry_mode) ||
> +	    (S_ISDIR(entry_mode) && sparsity->use_cone_patterns)))
> +		return 0;
> +
> +	return 1;
> +}



> +static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> +			struct tree_desc *tree, struct strbuf *base, int tn_len,
> +			int check_attr, struct pattern_list *sparsity,
> +			enum pattern_match_result default_sparsity_match)
>  {
>  	struct repository *repo = opt->repo;
>  	int hit = 0;
> @@ -570,6 +642,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  
>  	while (tree_entry(tree, &entry)) {
>  		int te_len = tree_entry_len(&entry);
> +		enum pattern_match_result sparsity_match = 0;
>  
>  		if (match != all_entries_interesting) {
>  			strbuf_addstr(&name, base->buf + tn_len);
> @@ -586,6 +659,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  
>  		strbuf_add(base, entry.path, te_len);
>  
> +		if (sparsity) {
> +			struct strbuf path = STRBUF_INIT;
> +			strbuf_addstr(&path, base->buf + tn_len);
> +
> +			if (!in_sparse_checkout(&path, old_baselen - tn_len,
> +						entry.mode, repo->index,
> +						sparsity, default_sparsity_match,
> +						&sparsity_match)) {
> +				strbuf_setlen(base, old_baselen);
> +				continue;
> +			}
> +		}

OK.

>  		if (S_ISREG(entry.mode)) {
>  			hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
>  					 check_attr ? base->buf + tn_len : NULL);
> @@ -602,8 +688,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  
>  			strbuf_addch(base, '/');
>  			init_tree_desc(&sub, data, size);
> -			hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> -					 check_attr);
> +			hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
> +					    check_attr, sparsity, sparsity_match);
>  			free(data);
>  		} else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
>  			hit |= grep_submodule(opt, pathspec, &entry.oid,
> @@ -621,6 +707,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>  	return hit;
>  }
>  
> +/*
> + * Note: sparsity patterns and paths' attributes will only be considered if
> + * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
> + * matching on paths.)
> + */
> +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> +		     struct tree_desc *tree, struct strbuf *base, int tn_len,
> +		     int is_root_tree)
> +{
> +	struct pattern_list *patterns = NULL;
> +	int ret;
> +
> +	if (is_root_tree)
> +		patterns = get_sparsity_patterns(opt->repo);
> +
> +	ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> +			   patterns, 0);
> +
> +	if (patterns) {
> +		clear_pattern_list(patterns);
> +		free(patterns);
> +	}

OK, it is not like this codepath is driven by "git log" to grep from
top-level tree objects of many commits, so it is OK to grab the
sparsity patterns once before do_grep_tree() and discard it when we
are done.

> +	return ret;
> +}
> +

>  static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
>  		       struct object *obj, const char *name, const char *path)
>  {
> diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> index 37525cae3a..26852586ac 100755
> --- a/t/t7011-skip-worktree-reading.sh
> +++ b/t/t7011-skip-worktree-reading.sh
> @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
>  	test -z "$(git ls-files -m)"
>  '
>  
> -test_expect_success 'grep with skip-worktree file' '
> -	git update-index --no-skip-worktree 1 &&
> -	echo test > 1 &&
> -	git update-index 1 &&
> -	git update-index --skip-worktree 1 &&
> -	rm 1 &&
> -	test "$(git grep --no-ext-grep test)" = "1:test"
> -'
> -
>  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A	1" > expected
>  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
>  	setup_absent &&
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> new file mode 100755
> index 0000000000..3bd67082eb
> --- /dev/null
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -0,0 +1,140 @@
> +#!/bin/sh
> +
> +test_description='grep in sparse checkout
> +
> +This test creates a repo with the following structure:
> +
> +.
> +|-- a
> +|-- b
> +|-- dir
> +|   `-- c
> +`-- sub
> +    |-- A
> +    |   `-- a
> +    `-- B
> +	`-- b
> +
> +Where . has non-cone mode sparsity patterns and sub is a submodule with cone
> +mode sparsity patterns. The resulting sparse-checkout should leave the following
> +structure:
> +
> +.
> +|-- a
> +`-- sub
> +    `-- B
> +	`-- b
> +'
> +
> +. ./test-lib.sh
> +
> +test_expect_success 'setup' '
> +	echo "text" >a &&
> +	echo "text" >b &&
> +	mkdir dir &&
> +	echo "text" >dir/c &&
> +
> +	git init sub &&
> +	(
> +		cd sub &&
> +		mkdir A B &&
> +		echo "text" >A/a &&
> +		echo "text" >B/b &&
> +		git add A B &&
> +		git commit -m sub &&
> +		git sparse-checkout init --cone &&
> +		git sparse-checkout set B
> +	) &&
> +
> +	git submodule add ./sub &&
> +	git add a b dir &&
> +	git commit -m super &&
> +	git sparse-checkout init --no-cone &&
> +	git sparse-checkout set "/*" "!b" "!/*/" &&
> +
> +	git tag -am t-commit t-commit HEAD &&
> +	tree=$(git rev-parse HEAD^{tree}) &&
> +	git tag -am t-tree t-tree $tree &&
> +
> +	test_path_is_missing b &&
> +	test_path_is_missing dir &&
> +	test_path_is_missing sub/A &&
> +	test_path_is_file a &&
> +	test_path_is_file sub/B/b
> +'
> +
> +test_expect_success 'grep in working tree should honor sparse checkout' '
> +	cat >expect <<-EOF &&
> +	a:text
> +	EOF
> +	git grep "text" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --cached should honor sparse checkout' '
> +	cat >expect <<-EOF &&
> +	a:text
> +	EOF
> +	git grep --cached "text" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> +	commit=$(git rev-parse HEAD) &&
> +	cat >expect_commit <<-EOF &&
> +	$commit:a:text
> +	EOF
> +	cat >expect_t-commit <<-EOF &&
> +	t-commit:a:text
> +	EOF
> +	git grep "text" $commit >actual_commit &&
> +	test_cmp expect_commit actual_commit &&
> +	git grep "text" t-commit >actual_t-commit &&
> +	test_cmp expect_t-commit actual_t-commit
> +'
> +
> +test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
> +	commit=$(git rev-parse HEAD) &&
> +	tree=$(git rev-parse HEAD^{tree}) &&
> +	cat >expect_tree <<-EOF &&
> +	$tree:a:text
> +	$tree:b:text
> +	$tree:dir/c:text
> +	EOF
> +	cat >expect_t-tree <<-EOF &&
> +	t-tree:a:text
> +	t-tree:b:text
> +	t-tree:dir/c:text
> +	EOF
> +	git grep "text" $tree >actual_tree &&
> +	test_cmp expect_tree actual_tree &&
> +	git grep "text" t-tree >actual_t-tree &&
> +	test_cmp expect_t-tree actual_t-tree
> +'
> +
> +test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
> +	cat >expect <<-EOF &&
> +	a:text
> +	sub/B/b:text
> +	EOF
> +	git grep --recurse-submodules --cached "text" >actual &&
> +	test_cmp expect actual
> +'
> +
> +test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
> +	commit=$(git rev-parse HEAD) &&
> +	cat >expect_commit <<-EOF &&
> +	$commit:a:text
> +	$commit:sub/B/b:text
> +	EOF
> +	cat >expect_t-commit <<-EOF &&
> +	t-commit:a:text
> +	t-commit:sub/B/b:text
> +	EOF
> +	git grep --recurse-submodules "text" $commit >actual_commit &&
> +	test_cmp expect_commit actual_commit &&
> +	git grep --recurse-submodules "text" t-commit >actual_t-commit &&
> +	test_cmp expect_t-commit actual_t-commit
> +'
> +
> +test_done

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 2/4] config: load the correct config.worktree file
  2020-05-11 19:10     ` Junio C Hamano
@ 2020-05-12 22:55       ` Matheus Tavares Bernardino
  2020-05-12 23:22         ` Junio C Hamano
  0 siblings, 1 reply; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-05-12 22:55 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Derrick Stolee, Elijah Newren, Jonathan Tan

On Mon, May 11, 2020 at 4:10 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares <matheus.bernardino@usp.br> writes:
>
> > One of the steps in do_git_config_sequence() is to load the
> > worktree-specific config file. Although the function receives a git_dir
> > string, it relies on git_pathdup(), which uses the_repository->git_dir,
> > to make the path to the file. Thus, when a submodule has a worktree
> > setting, a command executed in the superproject that recurses into the
> > submodule won't find the said setting.
>
> This has far wider ramifications than just "git grep" and it may be
> an important fix.  Anything that wants to read from a per-worktree
> configuration is not working as expected when run from a secondary
> worktree, right?

Hmm, I think the code should be able to retrieve the per-worktree
configuration, in this case, as the_repository->gitdir will be
pointing to the secondary worktree's gitdir. But when we want to read
a per-worktree configuration from a repo other than the_repository,
then the code doesn't find the setting (even if it is in the main
worktree of the subrepo).

> Can we add a test or two to protect this fix from future breakages?

Sure! There are already a couple tests, in the following patch, that
check this behavior *indirectly*. As we recurse into submodules, in
grep, we try to retrieve the core.sparseCheckout setting for each
submodule (which is stored in the subrepo's config.worktree file). The
said tests make sure we can get this setting, and they indeed fail
without this patch. But would it be better to also add a more direct
test, in this patch? I think we could do so by adding a new test
helper that prints submodules' configs, from the superproject, and
then testing the presence of per-worktree configs in the output.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 2/4] config: load the correct config.worktree file
  2020-05-12 22:55       ` Matheus Tavares Bernardino
@ 2020-05-12 23:22         ` Junio C Hamano
  0 siblings, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2020-05-12 23:22 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: git, Derrick Stolee, Elijah Newren, Jonathan Tan

Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:

>> Can we add a test or two to protect this fix from future breakages?
>
> Sure! There are already a couple tests, in the following patch, that
> check this behavior *indirectly*. As we recurse into submodules, in
> grep, we try to retrieve the core.sparseCheckout setting for each
> submodule (which is stored in the subrepo's config.worktree file). The
> said tests make sure we can get this setting, and they indeed fail
> without this patch. But would it be better to also add a more direct
> test, in this patch? I think we could do so by adding a new test
> helper that prints submodules' configs, from the superproject, and
> then testing the presence of per-worktree configs in the output.

Sounds like a plan.  Yes, checking by observing how grep that
recurses into submodules behave is doable but is indirect, and if
any other subcommand that may want to do the recursion will have the
same issue that gets fixed by this patch, it's better to ensure that
the fix applies to any subcommand in a more direct way.

Thanks.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-11 19:35     ` Junio C Hamano
@ 2020-05-13  0:05       ` Matheus Tavares Bernardino
  2020-05-13  0:17         ` Junio C Hamano
  2020-05-21  7:36       ` Elijah Newren
  1 sibling, 1 reply; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-05-13  0:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Derrick Stolee, Elijah Newren, Jonathan Tan

On Mon, May 11, 2020 at 4:35 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares <matheus.bernardino@usp.br> writes:
>
> > One of the main uses for a sparse checkout is to allow users to focus on
> > the subset of files in a repository in which they are interested. But
> > git-grep currently ignores the sparsity patterns and report all matches
> > found outside this subset, which kind of goes in the opposite direction.
> > Let's fix that, making it honor the sparsity boundaries for every
> > grepping case:
> >
> > - git grep in worktree
> > - git grep --cached
> > - git grep $REVISION
>
> It makes sense for these to be limited within the "sparse" area.
>
> > - git grep --untracked and git grep --no-index (which already respect
> >   sparse checkout boundaries)
>
> I can understand the former; those untracked files are what _could_
> be brought into attention by "git add", so limiting to the same
> "sparse" area may make sense.
>
> I am not sure about the latter, though, as "--no-index" is an
> explicit request to pretend that we are dealing with a random
> collection of files, not managed in a git repository.  But perhaps
> there is a similar justification like how "--untracked" is
> unjustifiable.  I dunno.

Yeah, I think there was no need to mention those two cases here. My
intention was to say that, in these cases, we should stick to the
files that are present in the working tree (which should match the
sparsity patterns + untracked {and ignored, in --no-index}), as
opposed to how the worktree grep used to behave until now, falling
back to the cache on files excluded by the sparse checkout.

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index a5056f395a..91ee0b2734 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
> > @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
> >                     const struct pathspec *pathspec, int cached);
> >  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >                    struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                  int check_attr);
> > +                  int is_root_tree);
> >
> >  static int grep_submodule(struct grep_opt *opt,
> >                         const struct pathspec *pathspec,
> > @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
> >
> >       for (nr = 0; nr < repo->index->cache_nr; nr++) {
> >               const struct cache_entry *ce = repo->index->cache[nr];
> > +
> > +             if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
> > +                     continue;
>
> Hmph.  Why exclude gitlink from this rule?  If a submodule sits at a
> path that is excluded by the sparse pattern, should we still recurse
> into it?

The idea behind not skipping gitlinks here was to be compliant with
what we have in the working tree. In 4fd683b ("sparse-checkout:
document interactions with submodules"), we decided that, if the
sparse-checkout patterns exclude a submodule, the submodule would
still appear in the working tree. The purpose was to keep these
features (submodules and sparse-checkout) independent. Along the same
lines, I think we should always recurse into initialized submodules in
grep, and then load their own sparsity patterns, to decide what should
be grepped within.

[...]
> > +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> > +                  struct tree_desc *tree, struct strbuf *base, int tn_len,
> > +                  int is_root_tree)
> > +{
> > +     struct pattern_list *patterns = NULL;
> > +     int ret;
> > +
> > +     if (is_root_tree)
> > +             patterns = get_sparsity_patterns(opt->repo);
> > +
> > +     ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> > +                        patterns, 0);
> > +
> > +     if (patterns) {
> > +             clear_pattern_list(patterns);
> > +             free(patterns);
> > +     }
>
> OK, it is not like this codepath is driven by "git log" to grep from
> top-level tree objects of many commits, so it is OK to grab the
> sparsity patterns once before do_grep_tree() and discard it when we
> are done.

Yeah. A possible performance problem here would be when users pass
many trees to git-grep (since we are reloading the pattern lists, from
both the_repository and submodules, for each tree). But, as Elijah
pointed out [1], the cases where this overhead might be somewhat
noticeable should be very rare.

[1]: https://lore.kernel.org/git/CABPp-BGUf-4exGW23xka1twf2D=nFOz1CkD_f-rDX_AGdVEeDA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-13  0:05       ` Matheus Tavares Bernardino
@ 2020-05-13  0:17         ` Junio C Hamano
  2020-05-21  7:26           ` Elijah Newren
  0 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2020-05-13  0:17 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: git, Derrick Stolee, Elijah Newren, Jonathan Tan

Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:

> The idea behind not skipping gitlinks here was to be compliant with
> what we have in the working tree. In 4fd683b ("sparse-checkout:
> document interactions with submodules"), we decided that, if the
> sparse-checkout patterns exclude a submodule, the submodule would
> still appear in the working tree. The purpose was to keep these
> features (submodules and sparse-checkout) independent. Along the same
> lines, I think we should always recurse into initialized submodules in
> grep, and then load their own sparsity patterns, to decide what should
> be grepped within.

OK.  

I do not necessarily agree with the justification described in
4fd683b (e.g. "would easily cause problems." that is not
substantiated is merely an opinion), but I do agree with you that
the new code in "git grep" we are discussing here does behave in
line with that design.

Thanks.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds
  2020-05-10  0:41   ` [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
  2020-05-10  4:23     ` Matheus Tavares Bernardino
@ 2020-05-21  7:09     ` Elijah Newren
  1 sibling, 0 replies; 57+ messages in thread
From: Elijah Newren @ 2020-05-21  7:09 UTC (permalink / raw)
  To: Matheus Tavares
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

Sorry for the late reply...and for responding in backwards order.

Great to see these newer patches!

On Sat, May 9, 2020 at 5:42 PM Matheus Tavares
<matheus.bernardino@usp.br> wrote:
>
> When sparse checkout is enabled, some users expect the output of certain
> commands (such as grep, diff, and log) to be also restricted within the
> sparsity patterns. This would allow them to effectively work only on the
> subset of files in which they are interested; and allow some commands to
> possibly perform better, by not considering uninteresting paths. For
> this reason, we taught grep to honor the sparsity patterns, in the
> previous commit. But, on the other hand, allowing grep and the other
> commands mentioned to optionally ignore the patterns also make for some
> interesting use cases. E.g. using grep to search for a function
> definition that resides outside the sparse checkout.
>
> In any case, there is no current way for users to configure the behavior
> they want for these commands. Aiming to provide this flexibility, let's
> introduce the sparse.restrictCmds setting (and the analogous
> --[no]-restrict-to-sparse-paths global option). The default value is
> true. For now, grep is the only one affected by this setting, but the
> goal is to have support for more commands, in the future.
>
> Helped-by: Elijah Newren <newren@gmail.com>
> Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
> ---
>
> Some notes/questions about this one:
>
> - I guess having the additional sparse-checkout.o only for the
>   restrict_to_sparse_paths() function is not very justifiable.
>   Especially since builtin/grep.c is currently its only caller. But
>   since Stolee is already moving some code out of the sparse-checkout
>   builtin and into sparse-checkout.o [1], I thought it would be better
>   to place this function here from the start, as it will likely be
>   needed by other cmds when they start honoring sparse.restrictCmds.
>   (Side note: I think I will also be able to use the
>   populate_sparse_checkout_patterns() function added by Stolee in the
>   same patchset [2], to avoid code duplication in the
>   get_sparsity_patterns() function added in this patch).
>
> [1]: https://lore.kernel.org/git/0181a134bfb6986dc0e54ae624c478446a1324a9.1588857462.git.gitgitgadget@gmail.com/
> [2]: https://lore.kernel.org/git/444a6b5f894f28e96f713e5caccba18e1ea3b3eb.1588857462.git.gitgitgadget@gmail.com/

Seems reasonable to me.

> - With that said, the only reason we need restrict_to_sparse_paths() to
>   begin with, is so that commands which recurse into submodules may
>   respect the value set in each submodule for the sparse.restrictCmds
>   config. This is already being done for grep, in this patch. But,
>   should we do like this or should we use the value set at the
>   superproject, for all submodules as well, when recursing (ignoring the
>   value set on them)?

We have a few different types of files in git: tracked, untracked, and
ignored (though it's sometimes not clear if people are using untracked
to mean everything that isn't tracked, or if they are using it to mean
everything that is both not tracked and not ignored; it seems to
depend on the context).

The point of the sparsity patterns is to break the "tracked" category
into two subsets: those tracked files matching the sparsity patterns
and the tracked files that don't.  The reason for this subsetting is
it allows us to work with a smaller subset of a much larger
repository.

The thing about submodules is that the parent repository doesn't know
what the submodule tracks, it only has a commit id.  The submodule
itself knows which individual files it tracks in its own index.  If
the parent module doesn't even know which files the submodule tracks,
how is it supposed to be responsible for defining a subset of the
submodules' tracked files?  It seems like a layering violation to me.

So, I think you are right with grep to not override the submodules'
sparse.restrictCmds config.  For other commands that recurse into
submodules, if there are any relevant ones, I think they'd want to do
the same as you did for grep here.  But what other commands recurse
into submodules?  I can't think of any right now.  log doesn't, diff
doesn't, status doesn't.  The only ones I can think of right now are
clone and pull.  In the case of clone, the submodule doesn't exist yet
so can't have any setting yet.  In the case of pull, what would it do
with the setting anyway?  Do a partial fetch that ignores blobs
outside the sparse cone?  I think that'd be great...but wouldn't that
behavior of fetch be controlled by whether the user was in a partial
clone rather than any sparse-checkout setting?  (I have to admit I'm
not familiar with how partial clones work yet.)  [Later edit:] Also,
pull seems like more of a write operation, so see below.

> - It's possible to also make read-tree respect the new setting/option,
>   using --no-restrict-to-sparse-paths as a synonym for its
>   --no-sparse-checkout option (with lower precedence). However, as this
>   command can change the sparse checked out paths, I thought it kind
>   of falls under a different category. Also, `git read-tree -mu
>   --sparse-checkout` doesn't have the effect of *restricting* the
>   command's behavior to the sparsity patterns, but of applying them to
>   the working tree, right? So maybe it could be confusing to make this
>   command honor the new setting. Does that make sense, or should we do
>   it?

That's a good question; I hadn't considered read-tree before.  My gut
reaction is that these flags only affect read operations, not write
ones.  (And doesn't affect all read operations; e.g. fsck is about
integrity checking, so fsck by default would check everything that was
downloaded and would only be limited in e.g. a partial clone -- but
that's a different kind of limit.)

For example, if we said these flags affected write operations, then as
soon as someone sets sparse.restrictCmds=false and then runs 'git
checkout $branch', then we would be forced to interpret
sparse.restrictCmds=false to mean we shouldn't pay attention to
sparsity patterns and thus should check out ALL files.  The user would
end up with a non-sparse tree really fast and would have to constantly
re-sparsify.  I think that's pretty clearly not the intention.  As
such, I think these flags are for controlling read operations like
grep/diff/log, and that neither read-tree nor checkout should be
affected by these flags.

> - Finally, if we decide to make read-tree be affected by
>   sparse.restrictCmds, there is also the case of whether the config
>   should be honored for submodules or just propagate the superproject's
>   value. I think the latter would be as simple as adding this line,
>   before calling parse_options() in builtin/read-tree.c:
>
>   opts.skip_sparse_checkout = !restrict_to_sparse_paths(the_repository);
>
>   As for the former, I'm not very familiar with the code in
>   unpack_trees(), so I'm not sure how complicated that would be.

As before, I don't think propagating the superproject's value makes
any sense.  However, I don't think making read-tree be affected by
sparse.restrictCmds makes sense either so it shouldn't matter.

>  Documentation/config.txt               |  2 +
>  Documentation/config/sparse.txt        | 22 ++++++++
>  Documentation/git-grep.txt             |  3 +
>  Documentation/git.txt                  |  4 ++
>  Makefile                               |  1 +
>  builtin/grep.c                         | 14 ++++-
>  contrib/completion/git-completion.bash |  2 +
>  git.c                                  |  6 ++
>  sparse-checkout.c                      | 16 ++++++
>  sparse-checkout.h                      | 11 ++++
>  t/t7817-grep-sparse-checkout.sh        | 78 +++++++++++++++++++++++++-
>  t/t9902-completion.sh                  |  4 +-
>  12 files changed, 159 insertions(+), 4 deletions(-)
>  create mode 100644 Documentation/config/sparse.txt
>  create mode 100644 sparse-checkout.c
>  create mode 100644 sparse-checkout.h
>
> diff --git a/Documentation/config.txt b/Documentation/config.txt
> index ef0768b91a..fd74b80302 100644
> --- a/Documentation/config.txt
> +++ b/Documentation/config.txt
> @@ -436,6 +436,8 @@ include::config/sequencer.txt[]
>
>  include::config/showbranch.txt[]
>
> +include::config/sparse.txt[]
> +
>  include::config/splitindex.txt[]
>
>  include::config/ssh.txt[]
> diff --git a/Documentation/config/sparse.txt b/Documentation/config/sparse.txt
> new file mode 100644
> index 0000000000..83a4e0018f
> --- /dev/null
> +++ b/Documentation/config/sparse.txt
> @@ -0,0 +1,22 @@
> +sparse.restrictCmds::
> +       Only meaningful in conjunction with core.sparseCheckout. This option
> +       extends sparse checkouts (which limit which paths are written to the
> +       working tree), so that output and operations are also limited to the
> +       sparsity paths where possible and implemented. The purpose of this
> +       option is to (1) focus output for the user on the portion of the
> +       repository that is of interest to them, and (2) enable potentially
> +       dramatic performance improvements, especially in conjunction with
> +       partial clones.
> ++
> +When this option is true (default), some git commands may limit their behavior
> +to the paths specified by the sparsity patterns, or to the intersection of
> +those paths and any (like `*.c) that the user might also specify on the command
> +line. When false, the affected commands will work on full trees, ignoring the
> +sparsity patterns. For now, only git-grep honors this setting. In this command,
> +the restriction becomes relevant in one of these three cases: with --cached;
> +when a commit-ish is given; when searching a working tree that contains paths
> +previously excluded by the sparsity patterns.
> ++
> +Note: commands which export, integrity check, or create history will always
> +operate on full trees (e.g. fast-export, format-patch, fsck, commit, etc.),
> +unaffected by any sparsity patterns.
> diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
> index 9bdf807584..abbf100109 100644
> --- a/Documentation/git-grep.txt
> +++ b/Documentation/git-grep.txt
> @@ -41,6 +41,9 @@ characters.  An empty string as search expression matches all lines.
>  CONFIGURATION
>  -------------
>
> +git-grep honors the sparse.restrictCmds setting. See its definition in
> +linkgit:git-config[1].
> +
>  :git-grep: 1
>  include::config/grep.txt[]
>
> diff --git a/Documentation/git.txt b/Documentation/git.txt
> index 9d6769e95a..5e107c6246 100644
> --- a/Documentation/git.txt
> +++ b/Documentation/git.txt
> @@ -180,6 +180,10 @@ If you just want to run git as if it was started in `<path>` then use
>         Do not perform optional operations that require locks. This is
>         equivalent to setting the `GIT_OPTIONAL_LOCKS` to `0`.
>
> +--[no-]restrict-to-sparse-paths::
> +       Overrides the sparse.restrictCmds configuration (see
> +       linkgit:git-config[1]) for this execution.
> +
>  --list-cmds=group[,group...]::
>         List commands by group. This is an internal/experimental
>         option and may change or be removed in the future. Supported
> diff --git a/Makefile b/Makefile
> index 3d3a39fc19..67580c691b 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -986,6 +986,7 @@ LIB_OBJS += sha1-name.o
>  LIB_OBJS += shallow.o
>  LIB_OBJS += sideband.o
>  LIB_OBJS += sigchain.o
> +LIB_OBJS += sparse-checkout.o
>  LIB_OBJS += split-index.o
>  LIB_OBJS += stable-qsort.o
>  LIB_OBJS += strbuf.o
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 91ee0b2734..3f92e7fd6c 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -25,6 +25,7 @@
>  #include "submodule-config.h"
>  #include "object-store.h"
>  #include "packfile.h"
> +#include "sparse-checkout.h"
>
>  static char const * const grep_usage[] = {
>         N_("git grep [<options>] [-e] <pattern> [<rev>...] [[--] <path>...]"),
> @@ -498,6 +499,7 @@ static int grep_cache(struct grep_opt *opt,
>         int nr;
>         struct strbuf name = STRBUF_INIT;
>         int name_base_len = 0;
> +       int sparse_paths_only = restrict_to_sparse_paths(repo);
>         if (repo->submodule_prefix) {
>                 name_base_len = strlen(repo->submodule_prefix);
>                 strbuf_addstr(&name, repo->submodule_prefix);
> @@ -509,7 +511,8 @@ static int grep_cache(struct grep_opt *opt,
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
>
> -               if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
> +               if (sparse_paths_only && ce_skip_worktree(ce) &&
> +                   !S_ISGITLINK(ce->ce_mode))
>                         continue;
>
>                 strbuf_setlen(&name, name_base_len);
> @@ -717,9 +720,10 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>                      int is_root_tree)
>  {
>         struct pattern_list *patterns = NULL;
> +       int sparse_paths_only = restrict_to_sparse_paths(opt->repo);
>         int ret;
>
> -       if (is_root_tree)
> +       if (is_root_tree && sparse_paths_only)
>                 patterns = get_sparsity_patterns(opt->repo);
>
>         ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> @@ -1259,6 +1263,12 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
>
>         if (!use_index || untracked) {
>                 int use_exclude = (opt_exclude < 0) ? use_index : !!opt_exclude;
> +
> +               if (opt_restrict_to_sparse_paths >= 0) {
> +                       warning(_("--[no-]restrict-to-sparse-paths is ignored"
> +                                 " with --no-index or --untracked"));

I think this should instead be
    die(_("--[no-]restrict-to-sparse-paths is incompatible with
--no-index and --untracked"))

Restricting to sparse paths (or not) is about working with subsets of
tracked files (or all tracked files).  --no-index and --untracked are
about working with files that aren't tracked.  They just don't make
sense to combine.

> +               }
> +
>                 hit = grep_directory(&opt, &pathspec, use_exclude, use_index);
>         } else if (0 <= opt_exclude) {
>                 die(_("--[no-]exclude-standard cannot be used for tracked contents"));
> diff --git a/contrib/completion/git-completion.bash b/contrib/completion/git-completion.bash
> index b1d6e5ebed..cba0f9166c 100644
> --- a/contrib/completion/git-completion.bash
> +++ b/contrib/completion/git-completion.bash
> @@ -3207,6 +3207,8 @@ __git_main ()
>                         --namespace=
>                         --no-replace-objects
>                         --help
> +                       --restrict-to-sparse-paths
> +                       --no-restrict-to-sparse-paths
>                         "
>                         ;;
>                 *)
> diff --git a/git.c b/git.c
> index 2e4efb4ff0..f967c75d9c 100644
> --- a/git.c
> +++ b/git.c
> @@ -37,6 +37,7 @@ const char git_more_info_string[] =
>            "See 'git help git' for an overview of the system.");
>
>  static int use_pager = -1;
> +int opt_restrict_to_sparse_paths = -1;
>
>  static void list_builtins(struct string_list *list, unsigned int exclude_option);
>
> @@ -310,6 +311,10 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
>                         } else {
>                                 exit(list_cmds(cmd));
>                         }
> +               } else if (!strcmp(cmd, "--restrict-to-sparse-paths")) {
> +                       opt_restrict_to_sparse_paths = 1;
> +               } else if (!strcmp(cmd, "--no-restrict-to-sparse-paths")) {
> +                       opt_restrict_to_sparse_paths = 0;
>                 } else {
>                         fprintf(stderr, _("unknown option: %s\n"), cmd);
>                         usage(git_usage_string);
> @@ -318,6 +323,7 @@ static int handle_options(const char ***argv, int *argc, int *envchanged)
>                 (*argv)++;
>                 (*argc)--;
>         }
> +
>         return (*argv) - orig_argv;
>  }
>
> diff --git a/sparse-checkout.c b/sparse-checkout.c
> new file mode 100644
> index 0000000000..9a9e50fd29
> --- /dev/null
> +++ b/sparse-checkout.c
> @@ -0,0 +1,16 @@
> +#include "cache.h"
> +#include "config.h"
> +#include "sparse-checkout.h"
> +
> +int restrict_to_sparse_paths(struct repository *repo)
> +{
> +       int ret;
> +
> +       if (opt_restrict_to_sparse_paths >= 0)
> +               return opt_restrict_to_sparse_paths;
> +
> +       if (repo_config_get_bool(repo, "sparse.restrictcmds", &ret))
> +               ret = 1;
> +
> +       return ret;
> +}
> diff --git a/sparse-checkout.h b/sparse-checkout.h
> new file mode 100644
> index 0000000000..1de3b588d8
> --- /dev/null
> +++ b/sparse-checkout.h
> @@ -0,0 +1,11 @@
> +#ifndef SPARSE_CHECKOUT_H
> +#define SPARSE_CHECKOUT_H
> +
> +struct repository;
> +
> +extern int opt_restrict_to_sparse_paths; /* from git.c */
> +
> +/* Whether or not cmds should restrict behavior on sparse paths, in this repo */
> +int restrict_to_sparse_paths(struct repository *repo);
> +
> +#endif /* SPARSE_CHECKOUT_H */
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index 3bd67082eb..8509694bf1 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -63,12 +63,28 @@ test_expect_success 'setup' '
>         test_path_is_file sub/B/b
>  '
>
> +# The two tests bellow check a special case: the sparsity patterns exclude '/b'
> +# and sparse checkout is enable, but the path exists on the working tree (e.g.
> +# manually created after `git sparse-checkout init`). In this case, grep should
> +# honor --restrict-to-sparse-paths.
>  test_expect_success 'grep in working tree should honor sparse checkout' '
>         cat >expect <<-EOF &&
>         a:text
>         EOF
> +       echo newtext >b &&
>         git grep "text" >actual &&
> -       test_cmp expect actual
> +       test_cmp expect actual &&
> +       rm b
> +'
> +test_expect_success 'grep w/ --no-restrict-to-sparse-paths for sparsely excluded but present paths' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       b:newtext
> +       EOF
> +       echo newtext >b &&
> +       git --no-restrict-to-sparse-paths grep "text" >actual &&
> +       test_cmp expect actual &&
> +       rm b
>  '
>
>  test_expect_success 'grep --cached should honor sparse checkout' '
> @@ -137,4 +153,64 @@ test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse
>         test_cmp expect_t-commit actual_t-commit
>  '
>
> +for cmd in 'git --no-restrict-to-sparse-paths grep' \
> +          'git -c sparse.restrictCmds=false grep' \
> +          'git -c sparse.restrictCmds=true --no-restrict-to-sparse-paths grep'
> +do
> +
> +       test_expect_success "$cmd --cached should ignore sparsity patterns" '
> +               cat >expect <<-EOF &&
> +               a:text
> +               b:text
> +               dir/c:text
> +               EOF
> +               $cmd --cached "text" >actual &&
> +               test_cmp expect actual
> +       '
> +
> +       test_expect_success "$cmd <commit-ish> should ignore sparsity patterns" '
> +               commit=$(git rev-parse HEAD) &&
> +               cat >expect_commit <<-EOF &&
> +               $commit:a:text
> +               $commit:b:text
> +               $commit:dir/c:text
> +               EOF
> +               cat >expect_t-commit <<-EOF &&
> +               t-commit:a:text
> +               t-commit:b:text
> +               t-commit:dir/c:text
> +               EOF
> +               $cmd "text" $commit >actual_commit &&
> +               test_cmp expect_commit actual_commit &&
> +               $cmd "text" t-commit >actual_t-commit &&
> +               test_cmp expect_t-commit actual_t-commit
> +       '
> +done
> +
> +test_expect_success 'should respect the sparse.restrictCmds values from submodules' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       sub/A/a:text
> +       sub/B/b:text
> +       EOF
> +       git -C sub config sparse.restrictCmds false &&
> +       git grep --cached --recurse-submodules "text" >actual &&
> +       test_cmp expect actual &&
> +       git -C sub config --unset sparse.restrictCmds
> +'
> +
> +test_expect_success 'should propagate --[no]-restrict-to-sparse-paths to submodules' '
> +       cat >expect <<-EOF &&
> +       a:text
> +       b:text
> +       dir/c:text
> +       sub/A/a:text
> +       sub/B/b:text
> +       EOF
> +       git -C sub config sparse.restrictCmds true &&
> +       git --no-restrict-to-sparse-paths grep --cached --recurse-submodules "text" >actual &&
> +       test_cmp expect actual &&
> +       git -C sub config --unset sparse.restrictCmds
> +'
> +
>  test_done
> diff --git a/t/t9902-completion.sh b/t/t9902-completion.sh
> index 3c44af6940..a4a7767e06 100755
> --- a/t/t9902-completion.sh
> +++ b/t/t9902-completion.sh
> @@ -1473,6 +1473,8 @@ test_expect_success 'double dash "git" itself' '
>         --namespace=
>         --no-replace-objects Z
>         --help Z
> +       --restrict-to-sparse-paths Z
> +       --no-restrict-to-sparse-paths Z
>         EOF
>  '
>
> @@ -1515,7 +1517,7 @@ test_expect_success 'general options' '
>         test_completion "git --nam" "--namespace=" &&
>         test_completion "git --bar" "--bare " &&
>         test_completion "git --inf" "--info-path " &&
> -       test_completion "git --no-r" "--no-replace-objects "
> +       test_completion "git --no-rep" "--no-replace-objects "
>  '
>
>  test_expect_success 'general options plus command' '
> --
> 2.26.2

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-13  0:17         ` Junio C Hamano
@ 2020-05-21  7:26           ` Elijah Newren
  2020-05-21 17:35             ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 57+ messages in thread
From: Elijah Newren @ 2020-05-21  7:26 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matheus Tavares Bernardino, git, Derrick Stolee, Jonathan Tan

On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
>
> > The idea behind not skipping gitlinks here was to be compliant with
> > what we have in the working tree. In 4fd683b ("sparse-checkout:
> > document interactions with submodules"), we decided that, if the
> > sparse-checkout patterns exclude a submodule, the submodule would
> > still appear in the working tree. The purpose was to keep these
> > features (submodules and sparse-checkout) independent. Along the same
> > lines, I think we should always recurse into initialized submodules in

Sorry if I missed it in the code, but do you check whether the
submodule is initialized before descending into it, or do you descend
into it based on it just being a submodule?

> > grep, and then load their own sparsity patterns, to decide what should
> > be grepped within.
>
> OK.
>
> I do not necessarily agree with the justification described in
> 4fd683b (e.g. "would easily cause problems." that is not
> substantiated is merely an opinion), but I do agree with you that
> the new code in "git grep" we are discussing here does behave in
> line with that design.
>
> Thanks.

I'm also a little worried by 4fd683b; are we headed towards a circular
reasoning of some sort?  In particular, sparse-checkout was written
assuming submodules might already be checked out.  I can see how
un-checking-out an existing submodule could raise fears of losing
untracked or ignored files within it, or stuff stored on other
branches, etc.  But that's not the only relevant case.  What if
someone runs:
   git clone --recurse-submodules --sparse=moduleA git.hosting.site:my/repo.git
In such a case, we don't have already checked out submodules.
Obviously, we should clone submodules that are within our sparsity
paths.  But should we automatically clone the submodules outside our
sparsity paths?  The the logic presented in 4fd683b makes this
completely ambiguous.  ("It will appear if it's initialized."  Okay,
but do we initialize it?)

You may say that clone doesn't have a --sparse= flag right now.  So
let me change the example slightly.  What if someone runs
   git checkout --recurse-submodules $otherBranch
and $otherBranch adds a new submodule somewhere deep under a directory
excluded by the sparsity patterns (i.e. deep within a directory we
aren't interested in and don't have checked out).  Should the
submodule be checked out, i.e. should it be initialized?  Commit
4fd683b only says it will appear if it's initialized, but my whole
question is should we initialize it?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-11 19:35     ` Junio C Hamano
  2020-05-13  0:05       ` Matheus Tavares Bernardino
@ 2020-05-21  7:36       ` Elijah Newren
  1 sibling, 0 replies; 57+ messages in thread
From: Elijah Newren @ 2020-05-21  7:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Matheus Tavares, Git Mailing List, Derrick Stolee, Jonathan Tan

On Mon, May 11, 2020 at 12:35 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matheus Tavares <matheus.bernardino@usp.br> writes:
>
> > One of the main uses for a sparse checkout is to allow users to focus on
> > the subset of files in a repository in which they are interested. But
> > git-grep currently ignores the sparsity patterns and report all matches
> > found outside this subset, which kind of goes in the opposite direction.
> > Let's fix that, making it honor the sparsity boundaries for every
> > grepping case:
> >
> > - git grep in worktree
> > - git grep --cached
> > - git grep $REVISION
>
> It makes sense for these to be limited within the "sparse" area.
>
> > - git grep --untracked and git grep --no-index (which already respect
> >   sparse checkout boundaries)
>
> I can understand the former; those untracked files are what _could_
> be brought into attention by "git add", so limiting to the same
> "sparse" area may make sense.
>
> I am not sure about the latter, though, as "--no-index" is an
> explicit request to pretend that we are dealing with a random
> collection of files, not managed in a git repository.  But perhaps
> there is a similar justification like how "--untracked" is
> unjustifiable.  I dunno.

I don't think it makes sense for sparsity patterns to affect either.
sparsity patterns are a way of splitting "tracked" files into two
subsets (those matching the sparsity paths and those that don't).
Therefore, flags that are about searching things that aren't tracked,
clearly don't have anything to do with sparsity patterns.

However, I think this was just a wording issue; in the subsequent
commit Matheus made it clear that he's not modifying the behavior of
grep --untracked or grep --no-index based on the presence or absence
of sparsity patterns.

> > diff --git a/builtin/grep.c b/builtin/grep.c
> > index a5056f395a..91ee0b2734 100644
> > --- a/builtin/grep.c
> > +++ b/builtin/grep.c
> > @@ -410,7 +410,7 @@ static int grep_cache(struct grep_opt *opt,
> >                     const struct pathspec *pathspec, int cached);
> >  static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >                    struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                  int check_attr);
> > +                  int is_root_tree);
> >
> >  static int grep_submodule(struct grep_opt *opt,
> >                         const struct pathspec *pathspec,
> > @@ -508,6 +508,10 @@ static int grep_cache(struct grep_opt *opt,
> >
> >       for (nr = 0; nr < repo->index->cache_nr; nr++) {
> >               const struct cache_entry *ce = repo->index->cache[nr];
> > +
> > +             if (ce_skip_worktree(ce) && !S_ISGITLINK(ce->ce_mode))
> > +                     continue;
>
> Hmph.  Why exclude gitlink from this rule?  If a submodule sits at a
> path that is excluded by the sparse pattern, should we still recurse
> into it?

That bothers me too.

> >               strbuf_setlen(&name, name_base_len);
> >               strbuf_addstr(&name, ce->name);
> >
> > @@ -520,8 +524,7 @@ static int grep_cache(struct grep_opt *opt,
> >                        * cache entry are identical, even if worktree file has
> >                        * been modified, so use cache version instead
> >                        */
> > -                     if (cached || (ce->ce_flags & CE_VALID) ||
> > -                         ce_skip_worktree(ce)) {
> > +                     if (cached || (ce->ce_flags & CE_VALID)) {
> >                               if (ce_stage(ce) || ce_intent_to_add(ce))
> >                                       continue;
> >                               hit |= grep_oid(opt, &ce->oid, name.buf,
> > @@ -552,9 +555,78 @@ static int grep_cache(struct grep_opt *opt,
> >       return hit;
> >  }
> >
> > -static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> > -                  struct tree_desc *tree, struct strbuf *base, int tn_len,
> > -                  int check_attr)
> > +static struct pattern_list *get_sparsity_patterns(struct repository *repo)
> > +{
> > +     struct pattern_list *patterns;
> > +     char *sparse_file;
> > +     int sparse_config, cone_config;
> > +
> > +     if (repo_config_get_bool(repo, "core.sparsecheckout", &sparse_config) ||
> > +         !sparse_config) {
> > +             return NULL;
> > +     }
> > +
> > +     sparse_file = repo_git_path(repo, "info/sparse-checkout");
> > +     patterns = xcalloc(1, sizeof(*patterns));
> > +
> > +     if (repo_config_get_bool(repo, "core.sparsecheckoutcone", &cone_config))
> > +             cone_config = 0;
> > +     patterns->use_cone_patterns = cone_config;
> > +
> > +     if (add_patterns_from_file_to_list(sparse_file, "", 0, patterns, NULL)) {
> > +             if (file_exists(sparse_file)) {
> > +                     warning(_("failed to load sparse-checkout file: '%s'"),
> > +                             sparse_file);
> > +             }
> > +             free(sparse_file);
> > +             free(patterns);
> > +             return NULL;
> > +     }
> > +
> > +     free(sparse_file);
> > +     return patterns;
> > +}
> > +
> > +static int in_sparse_checkout(struct strbuf *path, int prefix_len,
> > +                           unsigned int entry_mode,
> > +                           struct index_state *istate,
> > +                           struct pattern_list *sparsity,
> > +                           enum pattern_match_result parent_match,
> > +                           enum pattern_match_result *match)
> > +{
> > +     int dtype = DT_UNKNOWN;
> > +
> > +     if (S_ISGITLINK(entry_mode))
> > +             return 1;
>
> This is consistent with the "we do not care where a gitlink
> appears---submodules are always descended into, regardless of the
> sparse definition" decision we saw earlier, I think.  I am not sure
> if that is a good design in the first place, though.
>
> > +     if (parent_match == MATCHED_RECURSIVE) {
> > +             *match = parent_match;
> > +             return 1;
> > +     }
> > +
> > +     if (S_ISDIR(entry_mode) && !is_dir_sep(path->buf[path->len - 1]))
> > +             strbuf_addch(path, '/');
> > +
> > +     *match = path_matches_pattern_list(path->buf, path->len,
> > +                                        path->buf + prefix_len, &dtype,
> > +                                        sparsity, istate);
> > +     if (*match == UNDECIDED)
> > +             *match = parent_match;
> > +
> > +     if (S_ISDIR(entry_mode))
> > +             strbuf_trim_trailing_dir_sep(path);
> > +
> > +     if (*match == NOT_MATCHED && (S_ISREG(entry_mode) ||
> > +         (S_ISDIR(entry_mode) && sparsity->use_cone_patterns)))
> > +             return 0;
> > +
> > +     return 1;
> > +}
>
>
>
> > +static int do_grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> > +                     struct tree_desc *tree, struct strbuf *base, int tn_len,
> > +                     int check_attr, struct pattern_list *sparsity,
> > +                     enum pattern_match_result default_sparsity_match)
> >  {
> >       struct repository *repo = opt->repo;
> >       int hit = 0;
> > @@ -570,6 +642,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >
> >       while (tree_entry(tree, &entry)) {
> >               int te_len = tree_entry_len(&entry);
> > +             enum pattern_match_result sparsity_match = 0;
> >
> >               if (match != all_entries_interesting) {
> >                       strbuf_addstr(&name, base->buf + tn_len);
> > @@ -586,6 +659,19 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >
> >               strbuf_add(base, entry.path, te_len);
> >
> > +             if (sparsity) {
> > +                     struct strbuf path = STRBUF_INIT;
> > +                     strbuf_addstr(&path, base->buf + tn_len);
> > +
> > +                     if (!in_sparse_checkout(&path, old_baselen - tn_len,
> > +                                             entry.mode, repo->index,
> > +                                             sparsity, default_sparsity_match,
> > +                                             &sparsity_match)) {
> > +                             strbuf_setlen(base, old_baselen);
> > +                             continue;
> > +                     }
> > +             }
>
> OK.
>
> >               if (S_ISREG(entry.mode)) {
> >                       hit |= grep_oid(opt, &entry.oid, base->buf, tn_len,
> >                                        check_attr ? base->buf + tn_len : NULL);
> > @@ -602,8 +688,8 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >
> >                       strbuf_addch(base, '/');
> >                       init_tree_desc(&sub, data, size);
> > -                     hit |= grep_tree(opt, pathspec, &sub, base, tn_len,
> > -                                      check_attr);
> > +                     hit |= do_grep_tree(opt, pathspec, &sub, base, tn_len,
> > +                                         check_attr, sparsity, sparsity_match);
> >                       free(data);
> >               } else if (recurse_submodules && S_ISGITLINK(entry.mode)) {
> >                       hit |= grep_submodule(opt, pathspec, &entry.oid,
> > @@ -621,6 +707,31 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> >       return hit;
> >  }
> >
> > +/*
> > + * Note: sparsity patterns and paths' attributes will only be considered if
> > + * is_root_tree has true value. (Otherwise, we cannot properly perform pattern
> > + * matching on paths.)
> > + */
> > +static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> > +                  struct tree_desc *tree, struct strbuf *base, int tn_len,
> > +                  int is_root_tree)
> > +{
> > +     struct pattern_list *patterns = NULL;
> > +     int ret;
> > +
> > +     if (is_root_tree)
> > +             patterns = get_sparsity_patterns(opt->repo);
> > +
> > +     ret = do_grep_tree(opt, pathspec, tree, base, tn_len, is_root_tree,
> > +                        patterns, 0);
> > +
> > +     if (patterns) {
> > +             clear_pattern_list(patterns);
> > +             free(patterns);
> > +     }
>
> OK, it is not like this codepath is driven by "git log" to grep from
> top-level tree objects of many commits, so it is OK to grab the
> sparsity patterns once before do_grep_tree() and discard it when we
> are done.
>
> > +     return ret;
> > +}
> > +
>
> >  static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
> >                      struct object *obj, const char *name, const char *path)
> >  {
> > diff --git a/t/t7011-skip-worktree-reading.sh b/t/t7011-skip-worktree-reading.sh
> > index 37525cae3a..26852586ac 100755
> > --- a/t/t7011-skip-worktree-reading.sh
> > +++ b/t/t7011-skip-worktree-reading.sh
> > @@ -109,15 +109,6 @@ test_expect_success 'ls-files --modified' '
> >       test -z "$(git ls-files -m)"
> >  '
> >
> > -test_expect_success 'grep with skip-worktree file' '
> > -     git update-index --no-skip-worktree 1 &&
> > -     echo test > 1 &&
> > -     git update-index 1 &&
> > -     git update-index --skip-worktree 1 &&
> > -     rm 1 &&
> > -     test "$(git grep --no-ext-grep test)" = "1:test"
> > -'
> > -
> >  echo ":000000 100644 $ZERO_OID $EMPTY_BLOB A 1" > expected
> >  test_expect_success 'diff-index does not examine skip-worktree absent entries' '
> >       setup_absent &&
> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > new file mode 100755
> > index 0000000000..3bd67082eb
> > --- /dev/null
> > +++ b/t/t7817-grep-sparse-checkout.sh
> > @@ -0,0 +1,140 @@
> > +#!/bin/sh
> > +
> > +test_description='grep in sparse checkout
> > +
> > +This test creates a repo with the following structure:
> > +
> > +.
> > +|-- a
> > +|-- b
> > +|-- dir
> > +|   `-- c
> > +`-- sub
> > +    |-- A
> > +    |   `-- a
> > +    `-- B
> > +     `-- b
> > +
> > +Where . has non-cone mode sparsity patterns and sub is a submodule with cone
> > +mode sparsity patterns. The resulting sparse-checkout should leave the following
> > +structure:
> > +
> > +.
> > +|-- a
> > +`-- sub
> > +    `-- B
> > +     `-- b
> > +'
> > +
> > +. ./test-lib.sh
> > +
> > +test_expect_success 'setup' '
> > +     echo "text" >a &&
> > +     echo "text" >b &&
> > +     mkdir dir &&
> > +     echo "text" >dir/c &&
> > +
> > +     git init sub &&
> > +     (
> > +             cd sub &&
> > +             mkdir A B &&
> > +             echo "text" >A/a &&
> > +             echo "text" >B/b &&
> > +             git add A B &&
> > +             git commit -m sub &&
> > +             git sparse-checkout init --cone &&
> > +             git sparse-checkout set B
> > +     ) &&
> > +
> > +     git submodule add ./sub &&
> > +     git add a b dir &&
> > +     git commit -m super &&
> > +     git sparse-checkout init --no-cone &&
> > +     git sparse-checkout set "/*" "!b" "!/*/" &&
> > +
> > +     git tag -am t-commit t-commit HEAD &&
> > +     tree=$(git rev-parse HEAD^{tree}) &&
> > +     git tag -am t-tree t-tree $tree &&
> > +
> > +     test_path_is_missing b &&
> > +     test_path_is_missing dir &&
> > +     test_path_is_missing sub/A &&
> > +     test_path_is_file a &&
> > +     test_path_is_file sub/B/b
> > +'
> > +
> > +test_expect_success 'grep in working tree should honor sparse checkout' '
> > +     cat >expect <<-EOF &&
> > +     a:text
> > +     EOF
> > +     git grep "text" >actual &&
> > +     test_cmp expect actual
> > +'
> > +
> > +test_expect_success 'grep --cached should honor sparse checkout' '
> > +     cat >expect <<-EOF &&
> > +     a:text
> > +     EOF
> > +     git grep --cached "text" >actual &&
> > +     test_cmp expect actual
> > +'
> > +
> > +test_expect_success 'grep <commit-ish> should honor sparse checkout' '
> > +     commit=$(git rev-parse HEAD) &&
> > +     cat >expect_commit <<-EOF &&
> > +     $commit:a:text
> > +     EOF
> > +     cat >expect_t-commit <<-EOF &&
> > +     t-commit:a:text
> > +     EOF
> > +     git grep "text" $commit >actual_commit &&
> > +     test_cmp expect_commit actual_commit &&
> > +     git grep "text" t-commit >actual_t-commit &&
> > +     test_cmp expect_t-commit actual_t-commit
> > +'
> > +
> > +test_expect_success 'grep <tree-ish> should ignore sparsity patterns' '
> > +     commit=$(git rev-parse HEAD) &&
> > +     tree=$(git rev-parse HEAD^{tree}) &&
> > +     cat >expect_tree <<-EOF &&
> > +     $tree:a:text
> > +     $tree:b:text
> > +     $tree:dir/c:text
> > +     EOF
> > +     cat >expect_t-tree <<-EOF &&
> > +     t-tree:a:text
> > +     t-tree:b:text
> > +     t-tree:dir/c:text
> > +     EOF
> > +     git grep "text" $tree >actual_tree &&
> > +     test_cmp expect_tree actual_tree &&
> > +     git grep "text" t-tree >actual_t-tree &&
> > +     test_cmp expect_t-tree actual_t-tree
> > +'
> > +
> > +test_expect_success 'grep --recurse-submodules --cached should honor sparse checkout in submodule' '
> > +     cat >expect <<-EOF &&
> > +     a:text
> > +     sub/B/b:text
> > +     EOF
> > +     git grep --recurse-submodules --cached "text" >actual &&
> > +     test_cmp expect actual
> > +'
> > +
> > +test_expect_success 'grep --recurse-submodules <commit-ish> should honor sparse checkout in submodule' '
> > +     commit=$(git rev-parse HEAD) &&
> > +     cat >expect_commit <<-EOF &&
> > +     $commit:a:text
> > +     $commit:sub/B/b:text
> > +     EOF
> > +     cat >expect_t-commit <<-EOF &&
> > +     t-commit:a:text
> > +     t-commit:sub/B/b:text
> > +     EOF
> > +     git grep --recurse-submodules "text" $commit >actual_commit &&
> > +     test_cmp expect_commit actual_commit &&
> > +     git grep --recurse-submodules "text" t-commit >actual_t-commit &&
> > +     test_cmp expect_t-commit actual_t-commit
> > +'
> > +
> > +test_done

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds
  2020-05-10  4:23     ` Matheus Tavares Bernardino
@ 2020-05-21 17:18       ` Elijah Newren
  0 siblings, 0 replies; 57+ messages in thread
From: Elijah Newren @ 2020-05-21 17:18 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: git, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Sat, May 9, 2020 at 9:23 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Sat, May 9, 2020 at 9:42 PM Matheus Tavares
> <matheus.bernardino@usp.br> wrote:
> >
> > diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> > index 3bd67082eb..8509694bf1 100755
> > --- a/t/t7817-grep-sparse-checkout.sh
> > +++ b/t/t7817-grep-sparse-checkout.sh
> > @@ -63,12 +63,28 @@ test_expect_success 'setup' '
> >         test_path_is_file sub/B/b
> >  '
> >
> > +# The two tests bellow check a special case: the sparsity patterns exclude '/b'
> > +# and sparse checkout is enable, but the path exists on the working tree (e.g.
> > +# manually created after `git sparse-checkout init`). In this case, grep should
> > +# honor --restrict-to-sparse-paths.
>
> I just want to highlight a small thing that I forgot to comment on:
> Elijah and I had already discussed about --restrict-to-sparse-paths
> being relevant in grep only with --cached or when a commit-ish is
> given. But it had not occurred to me, before, the possibility of the
> special case mentioned above. I.e. when searching in the working tree
> and a path that should be excluded by the sparsity patterns is
> present. In this patch, I let --restrict-to-sparse-paths control the
> desired behavior for grep in this case too. But please, let me know if
> that doesn't seem like a good idea.

Wow, that is an interesting edge case.  But it can come up during a
merge or rebase or checkout -m, could be manually changed by various
plumbing commands, and might just not be enforced well in various
areas of the system (see e.g. [1]).  Perhaps the most interesting
case, given recent discussion, is submodules -- those might be left in
the working tree despite not matching sparsity paths.  So, should `git
-c sparse.restrictCmds=true grep PATTERN` look at these paths or not?
Currently, you've chosen contradictory answers -- yes to submodules,
and no to other entries.  I'm not certain here, but I've given it a
little thought and think there's a few things to take into
consideration:

Users are used to the fact that
    grep -r PATTERN *
searches existing files for PATTERN.  If you delete a file, then a
subsequent grep isn't going to search through it.  Similarly, git grep
is billed as a grep which limits searches to tracked files, thus they
expect
    git grep PATTERN
to search for files in their working copy but limiting it to files
which are tracked.  From this angle, I think users would be surprised
if `git grep` searched through deleted files, and they would also be
surprised if it ignored tracked and present files.

That is a basic answer, but let's go a bit further.  Since git grep
also has history at its disposal, it has more options.  For example:
    git grep REVISION PATTERN
means to search through all tracked files (those are the only kinds
that are recorded in revisions anyway) as of REVISION for the given
PATTERN, without checking it out.  Users probably expect this to
behave the same as:
    git checkout REVISION
    git grep PATTERN
and since checkout pays attention to sparsity rules, this is why we'd
want to have both "git grep PATTERN" and "git grep REVISION PATTERN"
pay attention to sparsity rules.

When we think in terms of "git grep REVISION PATTERN" as an optimized
version of "git checkout REVISION && git grep PATTERN" it puts us in
the frame of mind of asking the following question:
   For each path, would it be marked as SKIP_WORKTREE if we were to
check it out right now?  If so, we should skip it for the grepping.
Usually, the SKIP_WORKTREE bit is set for files if and only if they
don't match the sparsity patterns.  Also, we can't use the
SKIP_WORKTREE bit of the current index to decide whether to grep
through an old REVISION, because there are paths that exists in the
old revision that don't exist in the current index.  The sparsity
rules are the only things that can tell us whether such a path would
be marked as SKIP_WORKTREE if we were to check it out.  So it makes
sense to use the sparsity patterns when looking at REVISIONS.  When
dealing with the current worktree, we can check SKIP_WORKTREE
directly.  Usually that'll give the same answer as asking the sparsity
rules but as per [1] the two aren't always identical.  Rather than
asking "Would we mark this as SKIP_WORKTREE if we were to checkout
this version right now?", perhaps we should ask "Since we have this
version checked out right now, let's just check the path directly.  Is
it marked as SKIP_WORKTREE?".

Does that sound reasonable?

[1] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-21  7:26           ` Elijah Newren
@ 2020-05-21 17:35             ` Matheus Tavares Bernardino
  2020-05-21 17:52               ` Elijah Newren
  0 siblings, 1 reply; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-05-21 17:35 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Junio C Hamano, git, Derrick Stolee, Jonathan Tan

On Thu, May 21, 2020 at 4:26 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote:
> >
> > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
> >
> > > The idea behind not skipping gitlinks here was to be compliant with
> > > what we have in the working tree. In 4fd683b ("sparse-checkout:
> > > document interactions with submodules"), we decided that, if the
> > > sparse-checkout patterns exclude a submodule, the submodule would
> > > still appear in the working tree. The purpose was to keep these
> > > features (submodules and sparse-checkout) independent. Along the same
> > > lines, I think we should always recurse into initialized submodules in
>
> Sorry if I missed it in the code, but do you check whether the
> submodule is initialized before descending into it, or do you descend
> into it based on it just being a submodule?

We only descend if the submodule is initialized. The new code in this
patch doesn't do this check, but it is already implemented in
grep_submodule() (which is called by grep_tree() and grep_cache() when
a submodule is found).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-21 17:35             ` Matheus Tavares Bernardino
@ 2020-05-21 17:52               ` Elijah Newren
  2020-05-22  5:49                 ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 57+ messages in thread
From: Elijah Newren @ 2020-05-21 17:52 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Junio C Hamano, git, Derrick Stolee, Jonathan Tan

On Thu, May 21, 2020 at 10:36 AM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> On Thu, May 21, 2020 at 4:26 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote:
> > >
> > > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
> > >
> > > > The idea behind not skipping gitlinks here was to be compliant with
> > > > what we have in the working tree. In 4fd683b ("sparse-checkout:
> > > > document interactions with submodules"), we decided that, if the
> > > > sparse-checkout patterns exclude a submodule, the submodule would
> > > > still appear in the working tree. The purpose was to keep these
> > > > features (submodules and sparse-checkout) independent. Along the same
> > > > lines, I think we should always recurse into initialized submodules in
> >
> > Sorry if I missed it in the code, but do you check whether the
> > submodule is initialized before descending into it, or do you descend
> > into it based on it just being a submodule?
>
> We only descend if the submodule is initialized. The new code in this
> patch doesn't do this check, but it is already implemented in
> grep_submodule() (which is called by grep_tree() and grep_cache() when
> a submodule is found).

Good to know.  To up the ante a bit: What if another branch has
directory that doesn't exist in HEAD or the current checkout, and
within that directory is a submodule.  Would it be recursed into?
What if it matched the sparsity paths?  (Is it even possible to
recurse into it?)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-21 17:52               ` Elijah Newren
@ 2020-05-22  5:49                 ` Matheus Tavares Bernardino
  2020-05-22 14:26                   ` Elijah Newren
  0 siblings, 1 reply; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-05-22  5:49 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Junio C Hamano, git, Derrick Stolee, Jonathan Tan

On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Thu, May 21, 2020 at 10:36 AM Matheus Tavares Bernardino
> <matheus.bernardino@usp.br> wrote:
> >
> > On Thu, May 21, 2020 at 4:26 AM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > On Tue, May 12, 2020 at 5:17 PM Junio C Hamano <gitster@pobox.com> wrote:
> > > >
> > > > Matheus Tavares Bernardino <matheus.bernardino@usp.br> writes:
> > > >
> > > > > The idea behind not skipping gitlinks here was to be compliant with
> > > > > what we have in the working tree. In 4fd683b ("sparse-checkout:
> > > > > document interactions with submodules"), we decided that, if the
> > > > > sparse-checkout patterns exclude a submodule, the submodule would
> > > > > still appear in the working tree. The purpose was to keep these
> > > > > features (submodules and sparse-checkout) independent. Along the same
> > > > > lines, I think we should always recurse into initialized submodules in
> > >
> > > Sorry if I missed it in the code, but do you check whether the
> > > submodule is initialized before descending into it, or do you descend
> > > into it based on it just being a submodule?
> >
> > We only descend if the submodule is initialized. The new code in this
> > patch doesn't do this check, but it is already implemented in
> > grep_submodule() (which is called by grep_tree() and grep_cache() when
> > a submodule is found).
>
> Good to know.  To up the ante a bit: What if another branch has
> directory that doesn't exist in HEAD or the current checkout, and
> within that directory is a submodule.  Would it be recursed into?

In this case, `git grep --recurse-submodules <pattern> $branch` will
recurse into the submodule, but only if it has already been
initialized. I.e. if we have checked out to $branch, ran `git
submodule init` and then checked out back.

> What if it matched the sparsity paths?  (Is it even possible to
> recurse into it?)

That's a great question. The idea that I tried to implement is to
always recurse into _initialized_ submodules (even the ones excluded
by the superproject's sparsity patterns) and, then, follow their own
sparsity patterns inside. I'm not necessarily in favor (or against)
this behavior, but this seemed to be the most compatible way with the
design we describe in our docs:

"If your sparse-checkout patterns exclude an initialized submodule,
then that submodule will still appear in your working directory." (in
git-sparse-checkout.txt)

So, back to the original question, if you run `git grep
--recurse-submodules <pattern> $branch` and $branch contains a
submodule which was previously initialized, git-grep _would_ recurse
into it, even if it (or its parent dir) was excluded. However, your
question helped me notice an inconsistency in my patch: the behavior I
just described is working for the full pattern set, but not in cone
mode. That's because, in cone mode, we can mark the whole submodule's
parent dir as excluded. Then, path_matches_pattern_list() will return
NOT_MATCHED for the parent dir and we won't recurse into it, so we
won't even get to the submodule's path to discover that it refers to a
gitlink.

Therefore, if we decide to keep the behavior of always recursing into
submodules, we will need some extra work for the cone mode. I.e.
grep_tree() will have to check if NOT_MATCHED directories contain
submodules before discarding them, and recurse only into the
submodules if so. As for the implementation, the first idea that came
to my mind was to list the submodules' pathnames and do prefix
matching for each submodule and NOT_MATCHED dir. But the places I've
seen such submodule listings in the code base so far [1] seem to work
only in the current branch. My second idea was to continue the tree
walk when we hit NOT_MATCHED dir entries, but not doing any work, just
looking for possible gitlinks to recurse into. I'm not sure if that
could negatively affect the execution time, though.

Does this seem like a good approach? Or is there another solution that
I have not considered? Or even further, should we choose to skip the
submodules in excluded paths? My only concern in this case is that it
would be contrary to the design in git-sparse-checkout.txt. And the
working tree grep and cached grep would differ even on a clean working
tree.

[1]: builtin/submodule--helper.c:module_list_compute() and
submodule-config.c:config_from_gitmodules()

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-22  5:49                 ` Matheus Tavares Bernardino
@ 2020-05-22 14:26                   ` Elijah Newren
  2020-05-22 15:36                     ` Elijah Newren
  0 siblings, 1 reply; 57+ messages in thread
From: Elijah Newren @ 2020-05-22 14:26 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: git, Junio C Hamano, Derrick Stolee, Jonathan Tan, Elijah Newren

Hi Matheus,

On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote:
>
> On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
> >
<snip>
> > Good to know.  To up the ante a bit: What if another branch has
> > directory that doesn't exist in HEAD or the current checkout, and
> > within that directory is a submodule.  Would it be recursed into?
>
> In this case, `git grep --recurse-submodules <pattern> $branch` will
> recurse into the submodule, but only if it has already been
> initialized. I.e. if we have checked out to $branch, ran `git
> submodule init` and then checked out back.
>
> > What if it matched the sparsity paths?  (Is it even possible to
> > recurse into it?)
>
> That's a great question. The idea that I tried to implement is to
> always recurse into _initialized_ submodules (even the ones excluded
> by the superproject's sparsity patterns) and, then, follow their own
> sparsity patterns inside. I'm not necessarily in favor (or against)
> this behavior, but this seemed to be the most compatible way with the
> design we describe in our docs:
>
> "If your sparse-checkout patterns exclude an initialized submodule,
> then that submodule will still appear in your working directory." (in
> git-sparse-checkout.txt)
>
> So, back to the original question, if you run `git grep
> --recurse-submodules <pattern> $branch` and $branch contains a
> submodule which was previously initialized, git-grep _would_ recurse
> into it, even if it (or its parent dir) was excluded. However, your
> question helped me notice an inconsistency in my patch: the behavior I
> just described is working for the full pattern set, but not in cone
> mode. That's because, in cone mode, we can mark the whole submodule's
> parent dir as excluded. Then, path_matches_pattern_list() will return
> NOT_MATCHED for the parent dir and we won't recurse into it, so we
> won't even get to the submodule's path to discover that it refers to a
> gitlink.
>
> Therefore, if we decide to keep the behavior of always recursing into
> submodules, we will need some extra work for the cone mode. I.e.
> grep_tree() will have to check if NOT_MATCHED directories contain
> submodules before discarding them, and recurse only into the
> submodules if so. As for the implementation, the first idea that came
> to my mind was to list the submodules' pathnames and do prefix
> matching for each submodule and NOT_MATCHED dir. But the places I've
> seen such submodule listings in the code base so far [1] seem to work
> only in the current branch. My second idea was to continue the tree
> walk when we hit NOT_MATCHED dir entries, but not doing any work, just
> looking for possible gitlinks to recurse into. I'm not sure if that
> could negatively affect the execution time, though.
>
> Does this seem like a good approach? Or is there another solution that
> I have not considered? Or even further, should we choose to skip the
> submodules in excluded paths? My only concern in this case is that it
> would be contrary to the design in git-sparse-checkout.txt. And the
> working tree grep and cached grep would differ even on a clean working
> tree.

To be honest, I think it sounds insane.  What you propose does make
sense if you take what was written in git-sparse-checkout.txt very
literally and as though it was a core design principle meant to cover
all cases but I do not think it merits such a standing at all.  I
think it should be treated as a first draft attempt to explain
interactions that was written solely with the 'checkout' case in mind,
especially since it was written at the same approximate time that this
was written earlier in the same file:

    THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF
    OTHER COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY
    CHANGE IN THE FUTURE.

Anyway, the wording in that file seems to be really important, so
let's fix it.

-- >8 --
Subject: [PATCH] git-sparse-checkout: clarify interactions with submodules

Ignoring the sparse-checkout feature momentarily, if one has a submodule and
creates local branches within it with unpushed changes and maybe adds some
untracked files to it, then we would want to avoid accidentally removing such
a submodule.  So, for example with git.git, if you run
   git checkout v2.13.0
then the sha1collisiondetection/ submodule is NOT removed even though it
did not exist as a submodule until v2.14.0.  Similarly, if you only had
v2.13.0 checked out previously and ran
   git checkout v2.14.0
the sha1collisiondetection/ submodule would NOT be automatically
initialized despite being part of v2.14.0.  In both cases, git requires
submodules to be initialized or deinitialized separately.  Further, we
also have special handling for submodules in other commands such as
clean, which requires two --force flags to delete untracked submodules,
and some commands have a --recurse-submodules flag.

sparse-checkout is very similar to checkout, as evidenced by the similar
name -- it adds and removes files from the working copy.  However, for
the same avoid-data-loss reasons we do not want to remove a submodule
from the working copy with checkout, we do not want to do it with
sparse-checkout either.  So submodules need to be separately initialized
or deinitialized; changing sparse-checkout rules should not
automatically trigger the removal or vivification of submodules.

I believe the previous wording in git-sparse-checkout.txt about
submodules was only about this particular issue.  Unfortunately, the
previous wording could be interpreted to imply that submodules should be
considered active regardless of sparsity patterns.  Update the wording
to avoid making such an implication.  It may be helpful to consider two
example situations where the differences in wording become important:

In the future, we want users to be able to run commands like
   git clone --sparse=moduleA --recurse-submodules $REPO_URL
and have sparsity paths automatically set up and have submodules *within
the sparsity paths* be automatically initialized.  We do not want all
submodules in any path to be automatically initialized with that
command.

Similarly, we want to be able to do things like
   git -c sparse.restrictCmds grep --recurse-submodules $REV $PATTERN
and search through $REV for $PATTERN within the recorded sparsity
patterns.  We want it to recurse into submodules within those sparsity
patterns, but do not want to recurse into directories that do not match
the sparsity patterns in search of a possible submodule.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-sparse-checkout.txt | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index c0342e5393..7dde2d330c 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -190,10 +190,23 @@ directory.
 SUBMODULES
 ----------
 
-If your repository contains one or more submodules, then those submodules will
-appear based on which you initialized with the `git submodule` command. If
-your sparse-checkout patterns exclude an initialized submodule, then that
-submodule will still appear in your working directory.
+If your repository contains one or more submodules, then those submodules
+will appear based on which you initialized with the `git submodule`
+command.  Submodules may have additional untracked files or code stored on
+other branches, so to avoid data loss, changing sparse inclusion/exclusion
+rules will not cause an already checked out submodule to be removed from
+the working copy.  Said another way, just as `checkout` will not cause
+submodules to be automatically removed or initialized even when switching
+between branches that remove or add submodules, using `sparse-checkout` to
+reduce or expand the scope of "interesting" files will not cause submodules
+to be automatically deinitialized or initialized either.  Adding or
+removing them must be done as a separate step with `git submodule init` or
+`git submodule deinit`.
+
+This may mean that even if your sparsity patterns include or exclude
+submodules, until you manually initialize or deinitialize them, commands
+like grep that work on tracked files in the working copy will ignore "not
+yet initialized" submodules and pay attention to "left behind" ones.
 
 
 SEE ALSO
-- 
2.26.1.250.g8bb771e84c


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-22 14:26                   ` Elijah Newren
@ 2020-05-22 15:36                     ` Elijah Newren
  2020-05-22 20:54                       ` Matheus Tavares Bernardino
  0 siblings, 1 reply; 57+ messages in thread
From: Elijah Newren @ 2020-05-22 15:36 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Fri, May 22, 2020 at 7:26 AM Elijah Newren <newren@gmail.com> wrote:
>
> Hi Matheus,
>
> On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote:
> >
> > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
> > >
<snip>
> > Does this seem like a good approach? Or is there another solution that
> > I have not considered? Or even further, should we choose to skip the
> > submodules in excluded paths? My only concern in this case is that it
> > would be contrary to the design in git-sparse-checkout.txt. And the
> > working tree grep and cached grep would differ even on a clean working
> > tree.
>
<snip>
> Anyway, the wording in that file seems to be really important, so
> let's fix it.
>

Let me also try to give a concrete proposal for grep behavior for the
edge cases we've discussed:

git -c sparse.restrictCmds=true grep --recurse-submodules $PATTERN

This goes through all the files in the index (i.e. all tracked files)
which do not have the SKIP_WORKTREE bit set.  For each of these: If
the file is a symlink, ignore it (like grep currently does).  If the
file is a regular file and is present in the working copy, search it.
If the file is a submodule and it is initialized, recurse into it.

git -c sparse.restrictCmds=true grep --recurse-submodules --cached $PATTERN

This goes through all the files in the index (i.e. all tracked files)
which do not have the SKIP_WORKTREE bit set.  For each of these: Skip
symlinks.  Search regular files.  Recurse into submodules if they are
initialized.

git -c sparse.restrictCmds=true grep --recurse-submodules $REVISION $PATTERN

This goes through all the files in the given revision (i.e. all
tracked files) which match the sparsity patterns (i.e. that would not
have the SKIP_WORKTREE bit set if were we to checkout that commit).
For each of these: Skip symlinks.  Search regular files.  Recurse into
submodules if they are initialized.


Further, for any of these, when recursing into submodules, make sure
to load that submodules' core.sparseCheckout setting (and related
settings) and the submodules' sparsity patterns, if any.

Sound good?

I think this addresses the edge cases we've discussed so far:
interaction between submodules and sparsity patterns, and handling of
files that are still present despite not matching the sparsity
patterns. (Also note that files which are present-despite-the-rules
are prone to be removed by the next `git sparse-checkout reapply` or
anything that triggers a call to unpack_trees(); there's already
multiple things that do and Stolee's proposed patches would add more).
If I've missed edge cases, let me know.


Elijah

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-22 15:36                     ` Elijah Newren
@ 2020-05-22 20:54                       ` Matheus Tavares Bernardino
  2020-05-22 21:06                         ` Elijah Newren
  0 siblings, 1 reply; 57+ messages in thread
From: Matheus Tavares Bernardino @ 2020-05-22 20:54 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

Hi, Elijah

On Fri, May 22, 2020 at 12:36 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Fri, May 22, 2020 at 7:26 AM Elijah Newren <newren@gmail.com> wrote:
> >
> > Hi Matheus,
> >
> > On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote:
> > >
> > > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
> > > >
> <snip>
> > > Does this seem like a good approach? Or is there another solution that
> > > I have not considered? Or even further, should we choose to skip the
> > > submodules in excluded paths? My only concern in this case is that it
> > > would be contrary to the design in git-sparse-checkout.txt. And the
> > > working tree grep and cached grep would differ even on a clean working
> > > tree.
> >
> <snip>
> > Anyway, the wording in that file seems to be really important, so
> > let's fix it.
> >
>
> Let me also try to give a concrete proposal for grep behavior for the
> edge cases we've discussed:

Thank you for this proposal and for the previous comments as well.

> git -c sparse.restrictCmds=true grep --recurse-submodules $PATTERN
>
> This goes through all the files in the index (i.e. all tracked files)
> which do not have the SKIP_WORKTREE bit set.  For each of these: If
> the file is a symlink, ignore it (like grep currently does).  If the
> file is a regular file and is present in the working copy, search it.
> If the file is a submodule and it is initialized, recurse into it.

Sounds good. And when sparse.restrictCmds=false, we also search the
present files and present initialized submodules that have the
SKIP_WORKTREE set, right?

> git -c sparse.restrictCmds=true grep --recurse-submodules --cached $PATTERN
>
> This goes through all the files in the index (i.e. all tracked files)
> which do not have the SKIP_WORKTREE bit set.  For each of these: Skip
> symlinks.  Search regular files.  Recurse into submodules if they are
> initialized.

OK.

> git -c sparse.restrictCmds=true grep --recurse-submodules $REVISION $PATTERN
>
> This goes through all the files in the given revision (i.e. all
> tracked files) which match the sparsity patterns (i.e. that would not
> have the SKIP_WORKTREE bit set if were we to checkout that commit).
> For each of these: Skip symlinks.  Search regular files.  Recurse into
> submodules if they are initialized.

OK.

> Further, for any of these, when recursing into submodules, make sure
> to load that submodules' core.sparseCheckout setting (and related
> settings) and the submodules' sparsity patterns, if any.
>
> Sound good?
>
> I think this addresses the edge cases we've discussed so far:
> interaction between submodules and sparsity patterns, and handling of
> files that are still present despite not matching the sparsity
> patterns. (Also note that files which are present-despite-the-rules
> are prone to be removed by the next `git sparse-checkout reapply` or
> anything that triggers a call to unpack_trees(); there's already
> multiple things that do and Stolee's proposed patches would add more).
> If I've missed edge cases, let me know.

Sounds great. This addresses all the edge cases we've mentioned
before. Thanks again for the detailed proposal, and for considering
case by case.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v2 3/4] grep: honor sparse checkout patterns
  2020-05-22 20:54                       ` Matheus Tavares Bernardino
@ 2020-05-22 21:06                         ` Elijah Newren
  0 siblings, 0 replies; 57+ messages in thread
From: Elijah Newren @ 2020-05-22 21:06 UTC (permalink / raw)
  To: Matheus Tavares Bernardino
  Cc: Git Mailing List, Junio C Hamano, Derrick Stolee, Jonathan Tan

On Fri, May 22, 2020 at 1:54 PM Matheus Tavares Bernardino
<matheus.bernardino@usp.br> wrote:
>
> Hi, Elijah
>
> On Fri, May 22, 2020 at 12:36 PM Elijah Newren <newren@gmail.com> wrote:
> >
> > On Fri, May 22, 2020 at 7:26 AM Elijah Newren <newren@gmail.com> wrote:
> > >
> > > Hi Matheus,
> > >
> > > On Thu, May 21, 2020 at 10:49 PM Matheus Tavares Bernardino <matheus.bernardino@usp.br> wrote:
> > > >
> > > > On Thu, May 21, 2020 at 2:52 PM Elijah Newren <newren@gmail.com> wrote:
> > > > >
> > <snip>
> > > > Does this seem like a good approach? Or is there another solution that
> > > > I have not considered? Or even further, should we choose to skip the
> > > > submodules in excluded paths? My only concern in this case is that it
> > > > would be contrary to the design in git-sparse-checkout.txt. And the
> > > > working tree grep and cached grep would differ even on a clean working
> > > > tree.
> > >
> > <snip>
> > > Anyway, the wording in that file seems to be really important, so
> > > let's fix it.
> > >
> >
> > Let me also try to give a concrete proposal for grep behavior for the
> > edge cases we've discussed:
>
> Thank you for this proposal and for the previous comments as well.
>
> > git -c sparse.restrictCmds=true grep --recurse-submodules $PATTERN
> >
> > This goes through all the files in the index (i.e. all tracked files)
> > which do not have the SKIP_WORKTREE bit set.  For each of these: If
> > the file is a symlink, ignore it (like grep currently does).  If the
> > file is a regular file and is present in the working copy, search it.
> > If the file is a submodule and it is initialized, recurse into it.
>
> Sounds good. And when sparse.restrictCmds=false, we also search the
> present files and present initialized submodules that have the
> SKIP_WORKTREE set, right?

You're really pushing those corner cases, I love it.  :-)
SKIP_WORKTREE is supposed to mean we have removed it from the working
tree, i.e. it shouldn't be present (if we decide we're not going to
remove it from the working tree, e.g. because the file is unmerged or
something, then we don't mark it as SKIP_WORKTREE even if it doesn't
match sparsity patterns).  Therefore, the set of files that satisfy
this condition you have given should generally be empty.

But presuming we hit this corner case, I'd say you are right.
sparse.restrictCmds=false means we ignore the SKIP_WORKTREE bit
entirely (and in the case of grepping a $REVISION, we ignore the
sparsity patterns entirely).

> > git -c sparse.restrictCmds=true grep --recurse-submodules --cached $PATTERN
> >
> > This goes through all the files in the index (i.e. all tracked files)
> > which do not have the SKIP_WORKTREE bit set.  For each of these: Skip
> > symlinks.  Search regular files.  Recurse into submodules if they are
> > initialized.
>
> OK.
>
> > git -c sparse.restrictCmds=true grep --recurse-submodules $REVISION $PATTERN
> >
> > This goes through all the files in the given revision (i.e. all
> > tracked files) which match the sparsity patterns (i.e. that would not
> > have the SKIP_WORKTREE bit set if were we to checkout that commit).
> > For each of these: Skip symlinks.  Search regular files.  Recurse into
> > submodules if they are initialized.
>
> OK.
>
> > Further, for any of these, when recursing into submodules, make sure
> > to load that submodules' core.sparseCheckout setting (and related
> > settings) and the submodules' sparsity patterns, if any.
> >
> > Sound good?
> >
> > I think this addresses the edge cases we've discussed so far:
> > interaction between submodules and sparsity patterns, and handling of
> > files that are still present despite not matching the sparsity
> > patterns. (Also note that files which are present-despite-the-rules
> > are prone to be removed by the next `git sparse-checkout reapply` or
> > anything that triggers a call to unpack_trees(); there's already
> > multiple things that do and Stolee's proposed patches would add more).
> > If I've missed edge cases, let me know.
>
> Sounds great. This addresses all the edge cases we've mentioned
> before. Thanks again for the detailed proposal, and for considering
> case by case.

And thank you for working on this.  :-)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH v3 1/5] doc: grep: unify info on configuration variables
  2020-05-28  1:12 [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it Matheus Tavares
@ 2020-05-28  1:12 ` Matheus Tavares
  0 siblings, 0 replies; 57+ messages in thread
From: Matheus Tavares @ 2020-05-28  1:12 UTC (permalink / raw)
  To: git; +Cc: gitster, stolee, newren, jonathantanmy

Explanations about the configuration variables for git-grep are
duplicated in "Documentation/git-grep.txt" and
"Documentation/config/grep.txt", which can make maintenance difficult.
The first also contains a definition not present in the latter
(grep.fullName). To avoid problems like this, let's unify the
information in the second file and include it in the first.

Signed-off-by: Matheus Tavares <matheus.bernardino@usp.br>
---
 Documentation/config/grep.txt | 10 ++++++++--
 Documentation/git-grep.txt    | 36 ++++++-----------------------------
 2 files changed, 14 insertions(+), 32 deletions(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index 44abe45a7c..dd51db38e1 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -16,8 +16,14 @@ grep.extendedRegexp::
 	other than 'default'.
 
 grep.threads::
-	Number of grep worker threads to use.
-	See `grep.threads` in linkgit:git-grep[1] for more information.
+	Number of grep worker threads to use. See `--threads`
+ifndef::git-grep[]
+	in linkgit:git-grep[1]
+endif::git-grep[]
+	for more information.
+
+grep.fullName::
+	If set to true, enable `--full-name` option by default.
 
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index a7f9bc99ea..9bdf807584 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -41,34 +41,8 @@ characters.  An empty string as search expression matches all lines.
 CONFIGURATION
 -------------
 
-grep.lineNumber::
-	If set to true, enable `-n` option by default.
-
-grep.column::
-	If set to true, enable the `--column` option by default.
-
-grep.patternType::
-	Set the default matching behavior. Using a value of 'basic', 'extended',
-	'fixed', or 'perl' will enable the `--basic-regexp`, `--extended-regexp`,
-	`--fixed-strings`, or `--perl-regexp` option accordingly, while the
-	value 'default' will return to the default matching behavior.
-
-grep.extendedRegexp::
-	If set to true, enable `--extended-regexp` option by default. This
-	option is ignored when the `grep.patternType` option is set to a value
-	other than 'default'.
-
-grep.threads::
-	Number of grep worker threads to use. If unset (or set to 0), Git will
-	use as many threads as the number of logical cores available.
-
-grep.fullName::
-	If set to true, enable `--full-name` option by default.
-
-grep.fallbackToNoIndex::
-	If set to true, fall back to git grep --no-index if git grep
-	is executed outside of a git repository.  Defaults to false.
-
+:git-grep: 1
+include::config/grep.txt[]
 
 OPTIONS
 -------
@@ -269,8 +243,10 @@ providing this option will cause it to die.
 	found.
 
 --threads <num>::
-	Number of grep worker threads to use.
-	See `grep.threads` in 'CONFIGURATION' for more information.
+	Number of grep worker threads to use. If not provided (or set to
+	0), Git will use as many worker threads as the number of logical
+	cores available. The default value can also be set with the
+	`grep.threads` configuration.
 
 -f <file>::
 	Read patterns from <file>, one per line.
-- 
2.26.2


^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, back to index

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-24  6:04 [RFC PATCH 0/3] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-03-24  6:11 ` [RFC PATCH 1/3] doc: grep: unify info on configuration variables Matheus Tavares
2020-03-24  7:57   ` Elijah Newren
2020-03-24 21:26     ` Junio C Hamano
2020-03-24 23:38       ` Matheus Tavares
2020-03-24  6:12 ` [RFC PATCH 2/3] grep: honor sparse checkout patterns Matheus Tavares
2020-03-24  7:15   ` Elijah Newren
2020-03-24 15:12     ` Derrick Stolee
2020-03-24 16:16       ` Elijah Newren
2020-03-24 17:02         ` Derrick Stolee
2020-03-24 23:01       ` Matheus Tavares Bernardino
2020-03-24 22:55     ` Matheus Tavares Bernardino
2020-04-21  2:10       ` Matheus Tavares Bernardino
2020-04-21  3:08         ` Elijah Newren
2020-04-22 12:08           ` Derrick Stolee
2020-04-23  6:09           ` Matheus Tavares Bernardino
2020-03-24  6:13 ` [RFC PATCH 3/3] grep: add option to ignore sparsity patterns Matheus Tavares
2020-03-24  7:54   ` Elijah Newren
2020-03-24 18:30     ` Junio C Hamano
2020-03-24 19:07       ` Elijah Newren
2020-03-25 20:18         ` Junio C Hamano
2020-03-30  3:23       ` Matheus Tavares Bernardino
2020-03-31 19:12         ` Elijah Newren
2020-03-31 20:02           ` Derrick Stolee
2020-04-27 17:15             ` Matheus Tavares Bernardino
2020-04-29 16:46               ` Elijah Newren
2020-04-29 17:21             ` Elijah Newren
2020-03-25 23:15     ` Matheus Tavares Bernardino
2020-03-26  6:02       ` Elijah Newren
2020-03-27 15:51         ` Junio C Hamano
2020-03-27 19:01           ` Elijah Newren
2020-03-30  1:12         ` Matheus Tavares Bernardino
2020-03-31 16:48           ` Elijah Newren
2020-05-10  0:41 ` [RFC PATCH v2 0/4] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-05-10  0:41   ` [RFC PATCH v2 1/4] doc: grep: unify info on configuration variables Matheus Tavares
2020-05-10  0:41   ` [RFC PATCH v2 2/4] config: load the correct config.worktree file Matheus Tavares
2020-05-11 19:10     ` Junio C Hamano
2020-05-12 22:55       ` Matheus Tavares Bernardino
2020-05-12 23:22         ` Junio C Hamano
2020-05-10  0:41   ` [RFC PATCH v2 3/4] grep: honor sparse checkout patterns Matheus Tavares
2020-05-11 19:35     ` Junio C Hamano
2020-05-13  0:05       ` Matheus Tavares Bernardino
2020-05-13  0:17         ` Junio C Hamano
2020-05-21  7:26           ` Elijah Newren
2020-05-21 17:35             ` Matheus Tavares Bernardino
2020-05-21 17:52               ` Elijah Newren
2020-05-22  5:49                 ` Matheus Tavares Bernardino
2020-05-22 14:26                   ` Elijah Newren
2020-05-22 15:36                     ` Elijah Newren
2020-05-22 20:54                       ` Matheus Tavares Bernardino
2020-05-22 21:06                         ` Elijah Newren
2020-05-21  7:36       ` Elijah Newren
2020-05-10  0:41   ` [RFC PATCH v2 4/4] config: add setting to ignore sparsity patterns in some cmds Matheus Tavares
2020-05-10  4:23     ` Matheus Tavares Bernardino
2020-05-21 17:18       ` Elijah Newren
2020-05-21  7:09     ` Elijah Newren
2020-05-28  1:12 [PATCH v3 0/5] grep: honor sparse checkout and add option to ignore it Matheus Tavares
2020-05-28  1:12 ` [PATCH v3 1/5] doc: grep: unify info on configuration variables Matheus Tavares

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git