git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/27] [RFC] Sparse Index
@ 2021-01-25 17:41 Derrick Stolee via GitGitGadget
  2021-01-25 17:41 ` [PATCH 01/27] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
                   ` (28 more replies)
  0 siblings, 29 replies; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee

This is based on ds/more-index-cleanups also available as GitGitGadget PR
#839.

The sparse checkout feature allows users to specify a "populated set" that
is smaller than the full list of files at HEAD. Files outside the sparse
checkout definition are not present in the working directory, but are still
present in the index (and marked with the CE_SKIP_WORKTREE bit).

This means that the working directory has size O(populated), and commands
like "git status" or "git checkout" operate using an O(populated) number of
filesystem operations. However, parsing the index still operates on the
scale of O(HEAD).

This can be particularly jarring if you are merging a small repository with
a large monorepo for the purpose of simplifying dependency management. Even
if users have nothing more in their working directory than they had before,
they suddenly see a significant increase in their "git status" or "git add"
times. In these cases, simply parsing the index can be a huge portion of the
command time.

This RFC proposes an update to the index formats to allow "sparse directory
entries". These entries correspond to directories that are completely
excluded from the sparse checkout definition. We can detect that a directory
is excluded when using "cone mode" patterns.

Since having directory entries is a radical departure from the existing
index format, a new extension "extensions.sparseIndex" is added. Using a
sparse index should cause incompatible tools to fail because they do not
understand this extension.

The index is a critical data structure, so making such a drastic change must
be handled carefully. This RFC does only enough adjustments to demonstrate
performance improvements for "git status" and "git add." Other commands
should operate identically to before, since the other commands will expand a
sparse index into a full index by parsing trees.

WARNING: I'm getting a failure on the FreeBSD build with my sparse-checkout
tests. I'm not sure what is causing these failures, but I will explore while
we discuss the possibility of the feature as a whole.

Here is an outline for this RFC:

 * Patches 1-14: create and test the sparse index format. This is just
   enough to start writing the format, but all Git commands become slower
   for using it. This is because everything is guarded to expand to a full
   index before actually operating on the cache entries.

 * Patch 15: This massive patch is actually a bunch of patches squashed
   together. I have a branch that adds "ensure_full_index()" guards in each
   necessary file along with some commentary about how the index is being
   used. This patch is presented here as one big dump because that
   commentary isn't particularly interesting if the RFC leads to a very
   different approach.

 * Patches 16-27: These changes make enough code "sparse aware" such that
   "git status" and "git add" start operating in time O(populated) instead
   of O(HEAD).

Performance numbers are given in patch 27, but repeated somewhat here. The
test environment I use has ~2.1 million paths at HEAD, but only 68,000
populated paths given the sparse-checkout I'm using. The sparse index has
about 2,000 sparse directory entries.

 1. Use the full index. The index size is ~186 MB.
 2. Use the sparse index. The index size is ~5.5 MB.
 3. Use a commit where HEAD matches the populated set. The full index size
    is ~5.3MB.

The third benchmark is included as a theoretical optimum for a repository of
the same object database.

First, a clean 'git status' improves from 3.1s to 240ms.

Benchmark #1: full index (git status) Time (mean ± σ): 3.167 s ± 0.036 s
[User: 2.006 s, System: 1.078 s] Range (min … max): 3.100 s … 3.208 s 10
runs

Benchmark #2: sparse index (git status) Time (mean ± σ): 239.5 ms ± 8.1 ms
[User: 189.4 ms, System: 226.8 ms] Range (min … max): 226.0 ms … 251.9 ms 13
runs

Benchmark #3: small tree (git status) Time (mean ± σ): 195.3 ms ± 4.5 ms
[User: 116.5 ms, System: 84.4 ms] Range (min … max): 188.8 ms … 202.8 ms 15
runs

The performance numbers for 'git add .' are much closer to optimal:

Benchmark #1: full index (git add .) Time (mean ± σ): 3.076 s ± 0.022 s
[User: 2.065 s, System: 0.943 s] Range (min … max): 3.044 s … 3.116 s 10
runs

Benchmark #2: sparse index (git add .) Time (mean ± σ): 218.0 ms ± 6.6 ms
[User: 195.7 ms, System: 206.6 ms] Range (min … max): 209.8 ms … 228.2 ms 13
runs

Benchmark #3: small tree (git add .) Time (mean ± σ): 217.6 ms ± 5.4 ms
[User: 131.9 ms, System: 86.7 ms] Range (min … max): 212.1 ms … 228.4 ms 14
runs

I expect that making a sparse index work optimally through the most common
Git commands will take a year of effort. During this process, I expect to
add a lot of testing infrastructure around the sparse-checkout feature,
especially in corner cases. (This RFC focuses on the happy paths of
operating only within the sparse cone, but that will change in the future.)

If this general approach is acceptable, then I would follow it with a
sequence of patch submissions that follow this approach:

 1. Basics of the format. (Patches 1-14)
 2. Add additional guards around index interactions (Patch 15, but split
    appropriately.)
 3. Speed up "git status" and "git add" (Patches 16-27)

After those three items that are represented in this RFC, the work starts to
parallelize a bit. My basic ideas for moving forward from this point are to
do these basic steps:

 * Add new index API abstractions where appropriate, make them sparse-aware.
 * Add new tests around sparse-checkout corner cases. Ensure the sparse
   index works correctly.
 * For a given builtin, add extra testing for sparse-checkouts then it them
   sparse-aware.

Here are some specific questions I'm hoping to answer in this RFC period:

 1. Are these sparse directory entries an appropriate way to extend the
    index format?
 2. Is extensions.sparseIndex a good way to signal that these entries can
    exist?
 3. Is git sparse-checkout init --cone --sparse-index an appropriate way to
    toggle the format?
 4. Are there specific areas that I should target to harden the index API
    before I submit this work?
 5. Does anyone have a good idea how to test a large portion of the test
    suite with sparse-index enabled? The problem I see is that most tests
    don't use sparse-checkout, so the sparse index is identical to the full
    index. Would it be interesting to enable the test setup to add such
    "extra" directories during the test setup?

Thanks, -Stolee

Derrick Stolee (27):
  sparse-index: add guard to ensure full index
  sparse-index: implement ensure_full_index()
  t1092: compare sparse-checkout to sparse-index
  test-read-cache: print cache entries with --table
  test-tool: read-cache --table --no-stat
  test-tool: don't force full index
  unpack-trees: ensure full index
  sparse-checkout: hold pattern list in index
  sparse-index: convert from full to sparse
  submodule: sparse-index should not collapse links
  unpack-trees: allow sparse directories
  sparse-index: check index conversion happens
  sparse-index: create extension for compatibility
  sparse-checkout: toggle sparse index from builtin
  [RFC-VERSION] *: ensure full index
  unpack-trees: make sparse aware
  dir.c: accept a directory as part of cone-mode patterns
  status: use sparse-index throughout
  status: skip sparse-checkout percentage with sparse-index
  sparse-index: expand_to_path() trivial implementation
  sparse-index: expand_to_path no-op if path exists
  add: allow operating on a sparse-only index
  submodule: die_path_inside_submodule is sparse aware
  dir: use expand_to_path in add_patterns()
  fsmonitor: disable if index is sparse
  pathspec: stop calling ensure_full_index
  cache-tree: integrate with sparse directory entries

 Documentation/config/extensions.txt      |   7 +
 Documentation/git-sparse-checkout.txt    |  14 +
 Makefile                                 |   1 +
 apply.c                                  |  10 +-
 blame.c                                  |   7 +-
 builtin/add.c                            |   3 +
 builtin/checkout-index.c                 |   5 +-
 builtin/commit.c                         |   3 +
 builtin/grep.c                           |   2 +
 builtin/ls-files.c                       |   9 +-
 builtin/merge-index.c                    |   2 +
 builtin/mv.c                             |   2 +
 builtin/rm.c                             |   2 +
 builtin/sparse-checkout.c                |  35 ++-
 builtin/update-index.c                   |   2 +
 cache-tree.c                             |  21 ++
 cache.h                                  |  15 +-
 diff.c                                   |   2 +
 dir.c                                    |  19 +-
 dir.h                                    |   2 +-
 entry.c                                  |   2 +
 fsmonitor.c                              |  18 +-
 merge-recursive.c                        |  22 +-
 name-hash.c                              |  10 +
 pathspec.c                               |   4 +-
 pathspec.h                               |   4 +-
 preload-index.c                          |   2 +
 read-cache.c                             |  51 +++-
 repo-settings.c                          |  15 +
 repository.c                             |  11 +-
 repository.h                             |   3 +
 rerere.c                                 |   2 +
 resolve-undo.c                           |   6 +
 setup.c                                  |   3 +
 sha1-name.c                              |   3 +
 sparse-index.c                           | 360 +++++++++++++++++++++++
 sparse-index.h                           |  23 ++
 split-index.c                            |   2 +
 submodule.c                              |  22 +-
 submodule.h                              |   6 +-
 t/helper/test-read-cache.c               |  77 ++++-
 t/t1092-sparse-checkout-compatibility.sh | 130 +++++++-
 tree.c                                   |   2 +
 unpack-trees.c                           |  40 ++-
 wt-status.c                              |  21 +-
 wt-status.h                              |   1 +
 46 files changed, 942 insertions(+), 61 deletions(-)
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h


base-commit: 2271fe7848aa11b30e5313d95d9caebc2937fce5
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-847%2Fderrickstolee%2Fsparse-index%2Frfc-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-847/derrickstolee/sparse-index/rfc-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/847
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 01/27] sparse-index: add guard to ensure full index
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-25 17:41 ` [PATCH 02/27] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Upcoming changes will introduce modifications to the index format that
allow sparse directories. It will be useful to have a mechanism for
converting those sparse index files into full indexes by walking the
tree at those sparse directories. Name this method ensure_full_index()
as it will guarantee that the index is fully expanded.

This method is not implemented yet, and instead we focus on the
scaffolding to declare it and call it at the appropriate time.

Add a 'command_requires_full_index' member to struct repo_settings. This
will be an indicator that we need the index in full mode to do certain
index operations. This starts as being true for every command, then we
will set it to false as some commands integrate with sparse indexes.

If 'command_requires_full_index' is true, then we will immediately
expand a sparse index to a full one upon reading from disk. This
suffices for now, but we will want to add more callers to
ensure_full_index() later.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile        |  1 +
 repo-settings.c |  8 ++++++++
 repository.c    | 11 ++++++++++-
 repository.h    |  2 ++
 sparse-index.c  |  8 ++++++++
 sparse-index.h  |  7 +++++++
 6 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h

diff --git a/Makefile b/Makefile
index 7b64106930a..77564ae3b78 100644
--- a/Makefile
+++ b/Makefile
@@ -999,6 +999,7 @@ LIB_OBJS += sha1-name.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-index.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/repo-settings.c b/repo-settings.c
index f7fff0f5ab8..d63569e4041 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -77,4 +77,12 @@ void prepare_repo_settings(struct repository *r)
 		UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_KEEP);
 
 	UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_DEFAULT);
+
+	/*
+	 * This setting guards all index reads to require a full index
+	 * over a sparse index. After suitable guards are placed in the
+	 * codebase around uses of the index, this setting will be
+	 * removed.
+	 */
+	r->settings.command_requires_full_index = 1;
 }
diff --git a/repository.c b/repository.c
index c98298acd01..a8acae002f7 100644
--- a/repository.c
+++ b/repository.c
@@ -10,6 +10,7 @@
 #include "object.h"
 #include "lockfile.h"
 #include "submodule-config.h"
+#include "sparse-index.h"
 
 /* The main repository */
 static struct repository the_repo;
@@ -261,6 +262,8 @@ void repo_clear(struct repository *repo)
 
 int repo_read_index(struct repository *repo)
 {
+	int res;
+
 	if (!repo->index)
 		repo->index = xcalloc(1, sizeof(*repo->index));
 
@@ -270,7 +273,13 @@ int repo_read_index(struct repository *repo)
 	else if (repo->index->repo != repo)
 		BUG("repo's index should point back at itself");
 
-	return read_index_from(repo->index, repo->index_file, repo->gitdir);
+	res = read_index_from(repo->index, repo->index_file, repo->gitdir);
+
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index)
+		ensure_full_index(repo->index);
+
+	return res;
 }
 
 int repo_hold_locked_index(struct repository *repo,
diff --git a/repository.h b/repository.h
index b385ca3c94b..e06a2301569 100644
--- a/repository.h
+++ b/repository.h
@@ -41,6 +41,8 @@ struct repo_settings {
 	enum fetch_negotiation_setting fetch_negotiation_algorithm;
 
 	int core_multi_pack_index;
+
+	unsigned command_requires_full_index:1;
 };
 
 struct repository {
diff --git a/sparse-index.c b/sparse-index.c
new file mode 100644
index 00000000000..82183ead563
--- /dev/null
+++ b/sparse-index.c
@@ -0,0 +1,8 @@
+#include "cache.h"
+#include "repository.h"
+#include "sparse-index.h"
+
+void ensure_full_index(struct index_state *istate)
+{
+	/* intentionally left blank */
+}
diff --git a/sparse-index.h b/sparse-index.h
new file mode 100644
index 00000000000..8dda92032e2
--- /dev/null
+++ b/sparse-index.h
@@ -0,0 +1,7 @@
+#ifndef SPARSE_INDEX_H__
+#define SPARSE_INDEX_H__
+
+struct index_state;
+void ensure_full_index(struct index_state *istate);
+
+#endif
\ No newline at end of file
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 02/27] sparse-index: implement ensure_full_index()
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
  2021-01-25 17:41 ` [PATCH 01/27] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-27  3:05   ` Elijah Newren
  2021-01-25 17:41 ` [PATCH 03/27] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
                   ` (26 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will mark an in-memory index_state as having sparse directory entries
with the sparse_index bit. These currently cannot exist, but we will add
a mechanism for collapsing a full index to a sparse one in a later
change. That will happen at write time, so we must first allow parsing
the format before writing it.

Commands or methods that require a full index in order to operate can
call ensure_full_index() to expand that index in-memory. This requires
parsing trees using that index's repository.

Sparse directory entries have a specific 'ce_mode' value. The macro
S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This
ce_mode is not possible with the existing index formats, so we don't
also verify all properties of a sparse-directory entry, which are:

 1. ce->ce_mode == 01000755
 2. ce->flags & CE_SKIP_WORKTREE is true
 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
 4. ce->oid references a tree object.

These are all semi-enforced in ensure_full_index() to some extent. Any
deviation will cause a warning at minimum or a failure in the worst
case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache.h        | 11 +++++-
 read-cache.c   |  9 +++++
 sparse-index.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++-
 sparse-index.h |  1 +
 4 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index f9c7a603841..884046ca5b8 100644
--- a/cache.h
+++ b/cache.h
@@ -204,6 +204,10 @@ struct cache_entry {
 #error "CE_EXTENDED_FLAGS out of range"
 #endif
 
+#define CE_MODE_SPARSE_DIRECTORY 01000755
+#define SPARSE_DIR_MODE 0100
+#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY)
+
 /* Forward structure decls */
 struct pathspec;
 struct child_process;
@@ -249,6 +253,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
 {
 	if (S_ISLNK(mode))
 		return S_IFLNK;
+	if (mode == SPARSE_DIR_MODE)
+		return CE_MODE_SPARSE_DIRECTORY;
 	if (S_ISDIR(mode) || S_ISGITLINK(mode))
 		return S_IFGITLINK;
 	return S_IFREG | ce_permissions(mode);
@@ -319,7 +325,8 @@ struct index_state {
 		 drop_cache_tree : 1,
 		 updated_workdir : 1,
 		 updated_skipworktree : 1,
-		 fsmonitor_has_run_once : 1;
+		 fsmonitor_has_run_once : 1,
+		 sparse_index : 1;
 	struct hashmap name_hash;
 	struct hashmap dir_hash;
 	struct object_id oid;
@@ -721,6 +728,8 @@ int read_index_from(struct index_state *, const char *path,
 		    const char *gitdir);
 int is_index_unborn(struct index_state *);
 
+void ensure_full_index(struct index_state *istate);
+
 /* For use with `write_locked_index()`. */
 #define COMMIT_LOCK		(1 << 0)
 #define SKIP_IF_UNCHANGED	(1 << 1)
diff --git a/read-cache.c b/read-cache.c
index ecf6f689940..1097ecbf132 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -101,6 +101,9 @@ static const char *alternate_index_output;
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
+	if (S_ISSPARSEDIR(ce))
+		istate->sparse_index = 1;
+
 	istate->cache[nr] = ce;
 	add_name_hash(istate, ce);
 }
@@ -2255,6 +2258,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
 	trace2_data_intmax("index", the_repository, "read/cache_nr",
 			   istate->cache_nr);
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+	prepare_repo_settings(istate->repo);
+	if (istate->repo->settings.command_requires_full_index)
+		ensure_full_index(istate);
+
 	return istate->cache_nr;
 
 unmap:
diff --git a/sparse-index.c b/sparse-index.c
index 82183ead563..1e70244dc13 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -1,8 +1,100 @@
 #include "cache.h"
 #include "repository.h"
 #include "sparse-index.h"
+#include "tree.h"
+#include "pathspec.h"
+#include "trace2.h"
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
+
+	istate->cache[nr] = ce;
+	add_name_hash(istate, ce);
+}
+
+static int add_path_to_index(const struct object_id *oid,
+				struct strbuf *base, const char *path,
+				unsigned int mode, int stage, void *context)
+{
+	struct index_state *istate = (struct index_state *)context;
+	struct cache_entry *ce;
+	size_t len = base->len;
+
+	if (S_ISDIR(mode))
+		return READ_TREE_RECURSIVE;
+
+	strbuf_addstr(base, path);
+
+	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
+	ce->ce_flags |= CE_SKIP_WORKTREE;
+	set_index_entry(istate, istate->cache_nr++, ce);
+
+	strbuf_setlen(base, len);
+	return 0;
+}
 
 void ensure_full_index(struct index_state *istate)
 {
-	/* intentionally left blank */
+	int i;
+	struct index_state *full;
+
+	if (!istate || !istate->sparse_index)
+		return;
+
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	trace2_region_enter("index", "ensure_full_index", istate->repo);
+
+	/* initialize basics of new index */
+	full = xcalloc(1, sizeof(struct index_state));
+	memcpy(full, istate, sizeof(struct index_state));
+
+	/* then change the necessary things */
+	full->sparse_index = 0;
+	full->cache_alloc = (3 * istate->cache_alloc) / 2;
+	full->cache_nr = 0;
+	ALLOC_ARRAY(full->cache, full->cache_alloc);
+
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+		struct tree *tree;
+		struct pathspec ps;
+
+		if (!S_ISSPARSEDIR(ce)) {
+			set_index_entry(full, full->cache_nr++, ce);
+			continue;
+		}
+		if (!(ce->ce_flags & CE_SKIP_WORKTREE))
+			warning(_("index entry is a directory, but not sparse (%08x)"),
+				ce->ce_flags);
+
+		/* recursively walk into cd->name */
+		tree = lookup_tree(istate->repo, &ce->oid);
+
+		memset(&ps, 0, sizeof(ps));
+		ps.recursive = 1;
+		ps.has_wildcard = 1;
+		ps.max_depth = -1;
+
+		read_tree_recursive(istate->repo, tree,
+				    ce->name, strlen(ce->name),
+				    0, &ps,
+				    add_path_to_index, full);
+
+		/* free directory entries. full entries are re-used */
+		discard_cache_entry(ce);
+	}
+
+	/* Copy back into original index. */
+	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
+	istate->sparse_index = 0;
+	istate->cache = full->cache;
+	istate->cache_nr = full->cache_nr;
+	istate->cache_alloc = full->cache_alloc;
+
+	free(full);
+
+	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
diff --git a/sparse-index.h b/sparse-index.h
index 8dda92032e2..a2777dcac59 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -3,5 +3,6 @@
 
 struct index_state;
 void ensure_full_index(struct index_state *istate);
+int convert_to_sparse(struct index_state *istate);
 
 #endif
\ No newline at end of file
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 03/27] t1092: compare sparse-checkout to sparse-index
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
  2021-01-25 17:41 ` [PATCH 01/27] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
  2021-01-25 17:41 ` [PATCH 02/27] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-27  3:08   ` Elijah Newren
  2021-01-25 17:41 ` [PATCH 04/27] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                   ` (25 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new 'sparse-index' repo alongside the 'full-checkout' and
'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
add run_on_sparse and test_sparse_match helpers. These helpers will be
used when the sparse index is implemented.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 29 ++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 8cd3e5a8d22..8876eae0fe3 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
 test_expect_success 'setup' '
 	git init initial-repo &&
 	(
+		(GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX) &&
 		cd initial-repo &&
 		echo a >a &&
 		echo "after deep" >e &&
@@ -87,23 +88,32 @@ init_repos () {
 
 	cp -r initial-repo sparse-checkout &&
 	git -C sparse-checkout reset --hard &&
-	git -C sparse-checkout sparse-checkout init --cone &&
+
+	cp -r initial-repo sparse-index &&
+	git -C sparse-index reset --hard &&
 
 	# initialize sparse-checkout definitions
-	git -C sparse-checkout sparse-checkout set deep
+	git -C sparse-checkout sparse-checkout init --cone &&
+	git -C sparse-checkout sparse-checkout set deep &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		$* >../sparse-checkout-out 2>../sparse-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 $* >../sparse-checkout-out 2>../sparse-checkout-err
+	) &&
+	(
+		cd sparse-index &&
+		$* >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		$* >../full-checkout-out 2>../full-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 $* >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse $*
 }
@@ -114,6 +124,17 @@ test_all_match () {
 	test_cmp full-checkout-err sparse-checkout-err
 }
 
+test_sparse_match () {
+	run_on_sparse $* &&
+	test_cmp sparse-checkout-out sparse-index-out &&
+	test_cmp sparse-checkout-err sparse-index-err
+}
+
+test_expect_success 'expanded in-memory index matches full index' '
+	init_repos &&
+	test_sparse_match test-tool read-cache --expand --table-no-stat
+'
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 04/27] test-read-cache: print cache entries with --table
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-01-25 17:41 ` [PATCH 03/27] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-27  3:25   ` Elijah Newren
  2021-01-25 17:41 ` [PATCH 05/27] test-tool: read-cache --table --no-stat Derrick Stolee via GitGitGadget
                   ` (24 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This table is helpful for discovering data in the index to ensure it is
being written correctly, especially as we build and test the
sparse-index.

To make the option parsing slightly more robust, wrap the string
comparisons in a loop adapted from test-dir-iterator.c.

Care must be taken with the final check for the 'cnt' variable. We
continue the expectation that the numerical value is the final argument.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c | 49 ++++++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 5 deletions(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 244977a29bd..cd7d106a675 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -2,18 +2,55 @@
 #include "cache.h"
 #include "config.h"
 
+static void print_cache_entry(struct cache_entry *ce)
+{
+	/* stat info */
+	printf("%08x %08x %08x %08x %08x %08x ",
+	       ce->ce_stat_data.sd_ctime.sec,
+	       ce->ce_stat_data.sd_ctime.nsec,
+	       ce->ce_stat_data.sd_mtime.sec,
+	       ce->ce_stat_data.sd_mtime.nsec,
+	       ce->ce_stat_data.sd_dev,
+	       ce->ce_stat_data.sd_ino);
+
+	/* mode in binary */
+	printf("0b%d%d%d%d ",
+		(ce->ce_mode >> 15) & 1,
+		(ce->ce_mode >> 14) & 1,
+		(ce->ce_mode >> 13) & 1,
+		(ce->ce_mode >> 12) & 1);
+
+	/* output permissions? */
+	printf("%04o ", ce->ce_mode & 01777);
+
+	printf("%s ", oid_to_hex(&ce->oid));
+
+	printf("%s\n", ce->name);
+}
+
+static void print_cache(struct index_state *cache)
+{
+	int i;
+	for (i = 0; i < the_index.cache_nr; i++)
+		print_cache_entry(the_index.cache[i]);
+}
+
 int cmd__read_cache(int argc, const char **argv)
 {
 	int i, cnt = 1;
 	const char *name = NULL;
+	int table = 0;
 
-	if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
-		argc--;
-		argv++;
+	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
+		if (skip_prefix(*argv, "--print-and-refresh=", &name))
+			continue;
+		if (!strcmp(*argv, "--table")) {
+			table = 1;
+		}
 	}
 
-	if (argc == 2)
-		cnt = strtol(argv[1], NULL, 0);
+	if (argc == 1)
+		cnt = strtol(argv[0], NULL, 0);
 	setup_git_directory();
 	git_config(git_default_config, NULL);
 	for (i = 0; i < cnt; i++) {
@@ -30,6 +67,8 @@ int cmd__read_cache(int argc, const char **argv)
 			       ce_uptodate(the_index.cache[pos]) ? "" : " not");
 			write_file(name, "%d\n", i);
 		}
+		if (table)
+			print_cache(&the_index);
 		discard_cache();
 	}
 	return 0;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 05/27] test-tool: read-cache --table --no-stat
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-01-25 17:41 ` [PATCH 04/27] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-25 17:41 ` [PATCH 06/27] test-tool: don't force full index Derrick Stolee via GitGitGadget
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The 'test-tool read-cache --table' output is helpful to understand the
full contents of the index entries on-disk. This is particularly helpful
when trying to diagnose issues with a real repository example.

However, for test cases we might want to compare the index contents of
two repositories that were updated in similar ways, but will not
actually share the same stat data. Add the '--no-stat' option to remove
the timestamps and other stat data from the output. This allows us to
compare index contents directly.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c               | 44 ++++++++++++++----------
 t/t1092-sparse-checkout-compatibility.sh |  2 +-
 2 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index cd7d106a675..f858d0d0a0c 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -2,16 +2,18 @@
 #include "cache.h"
 #include "config.h"
 
-static void print_cache_entry(struct cache_entry *ce)
+static void print_cache_entry(struct cache_entry *ce, unsigned stat)
 {
-	/* stat info */
-	printf("%08x %08x %08x %08x %08x %08x ",
-	       ce->ce_stat_data.sd_ctime.sec,
-	       ce->ce_stat_data.sd_ctime.nsec,
-	       ce->ce_stat_data.sd_mtime.sec,
-	       ce->ce_stat_data.sd_mtime.nsec,
-	       ce->ce_stat_data.sd_dev,
-	       ce->ce_stat_data.sd_ino);
+	if (stat) {
+		/* stat info */
+		printf("%08x %08x %08x %08x %08x %08x ",
+		ce->ce_stat_data.sd_ctime.sec,
+		ce->ce_stat_data.sd_ctime.nsec,
+		ce->ce_stat_data.sd_mtime.sec,
+		ce->ce_stat_data.sd_mtime.nsec,
+		ce->ce_stat_data.sd_dev,
+		ce->ce_stat_data.sd_ino);
+	}
 
 	/* mode in binary */
 	printf("0b%d%d%d%d ",
@@ -28,48 +30,52 @@ static void print_cache_entry(struct cache_entry *ce)
 	printf("%s\n", ce->name);
 }
 
-static void print_cache(struct index_state *cache)
+static void print_cache(struct index_state *cache, unsigned stat)
 {
 	int i;
 	for (i = 0; i < the_index.cache_nr; i++)
-		print_cache_entry(the_index.cache[i]);
+		print_cache_entry(the_index.cache[i], stat);
 }
 
 int cmd__read_cache(int argc, const char **argv)
 {
+	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
 	int table = 0;
+	int stat = 1;
 
 	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
 		if (skip_prefix(*argv, "--print-and-refresh=", &name))
 			continue;
-		if (!strcmp(*argv, "--table")) {
+		if (!strcmp(*argv, "--table"))
 			table = 1;
-		}
+		else if (!strcmp(*argv, "--no-stat"))
+			stat = 0;
 	}
 
 	if (argc == 1)
 		cnt = strtol(argv[0], NULL, 0);
 	setup_git_directory();
 	git_config(git_default_config, NULL);
+
 	for (i = 0; i < cnt; i++) {
-		read_cache();
+		repo_read_index(r);
 		if (name) {
 			int pos;
 
-			refresh_index(&the_index, REFRESH_QUIET,
+			refresh_index(r->index, REFRESH_QUIET,
 				      NULL, NULL, NULL);
-			pos = index_name_pos(&the_index, name, strlen(name));
+			pos = index_name_pos(r->index, name, strlen(name));
 			if (pos < 0)
 				die("%s not in index", name);
 			printf("%s is%s up to date\n", name,
-			       ce_uptodate(the_index.cache[pos]) ? "" : " not");
+			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
 			write_file(name, "%d\n", i);
 		}
 		if (table)
-			print_cache(&the_index);
-		discard_cache();
+			print_cache(r->index, stat);
+		discard_index(r->index);
 	}
 	return 0;
 }
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 8876eae0fe3..3aa9b0d21b4 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -132,7 +132,7 @@ test_sparse_match () {
 
 test_expect_success 'expanded in-memory index matches full index' '
 	init_repos &&
-	test_sparse_match test-tool read-cache --expand --table-no-stat
+	test_sparse_match test-tool read-cache --expand --table --no-stat
 '
 
 test_expect_success 'status with options' '
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 06/27] test-tool: don't force full index
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2021-01-25 17:41 ` [PATCH 05/27] test-tool: read-cache --table --no-stat Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-25 17:41 ` [PATCH 07/27] unpack-trees: ensure " Derrick Stolee via GitGitGadget
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will use 'test-tool read-cache --table' to check that a sparse
index is written as part of init_repos. Since we will no longer always
expand a sparse index into a full index, add an '--expand' parameter
that adds a call to ensure_full_index() so we can compare a sparse index
directly against a full index, or at least what the in-memory index
looks like when expanded in this way.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index f858d0d0a0c..3c45dfeb3cb 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -1,6 +1,7 @@
 #include "test-tool.h"
 #include "cache.h"
 #include "config.h"
+#include "sparse-index.h"
 
 static void print_cache_entry(struct cache_entry *ce, unsigned stat)
 {
@@ -44,6 +45,11 @@ int cmd__read_cache(int argc, const char **argv)
 	const char *name = NULL;
 	int table = 0;
 	int stat = 1;
+	int expand = 0;
+
+	initialize_the_repository();
+	prepare_repo_settings(r);
+	r->settings.command_requires_full_index = 0;
 
 	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
 		if (skip_prefix(*argv, "--print-and-refresh=", &name))
@@ -52,6 +58,8 @@ int cmd__read_cache(int argc, const char **argv)
 			table = 1;
 		else if (!strcmp(*argv, "--no-stat"))
 			stat = 0;
+		else if (!strcmp(*argv, "--expand"))
+			expand = 1;
 	}
 
 	if (argc == 1)
@@ -61,6 +69,10 @@ int cmd__read_cache(int argc, const char **argv)
 
 	for (i = 0; i < cnt; i++) {
 		repo_read_index(r);
+
+		if (expand)
+			ensure_full_index(r->index);
+
 		if (name) {
 			int pos;
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 07/27] unpack-trees: ensure full index
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2021-01-25 17:41 ` [PATCH 06/27] test-tool: don't force full index Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-27  4:43   ` Elijah Newren
  2021-01-25 17:41 ` [PATCH 08/27] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
                   ` (21 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The next change will translate full indexes into sparse indexes at write
time. The existing logic provides a way for every sparse index to be
expanded to a full index at read time. However, there are cases where an
index is written and then continues to be used in-memory to perform
further updates.

unpack_trees() is frequently called after such a write. In particular,
commands like 'git reset' do this double-update of the index.

Ensure that we have a full index when entering unpack_trees(), but only
when command_requires_full_index is true. This is always true at the
moment, but we will later relax that after unpack_trees() is updated to
handle sparse directory entries.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/unpack-trees.c b/unpack-trees.c
index f5f668f532d..4dd99219073 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *,
  */
 int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options *o)
 {
+	struct repository *repo = the_repository;
 	int i, ret;
 	static struct cache_entry *dfc;
 	struct pattern_list pl;
@@ -1578,6 +1579,12 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	trace_performance_enter();
 	trace2_region_enter("unpack_trees", "unpack_trees", the_repository);
 
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index) {
+		ensure_full_index(o->src_index);
+		ensure_full_index(o->dst_index);
+	}
+
 	if (!core_apply_sparse_checkout || !o->update)
 		o->skip_sparse_checkout = 1;
 	if (!o->skip_sparse_checkout && !o->pl) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 08/27] sparse-checkout: hold pattern list in index
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2021-01-25 17:41 ` [PATCH 07/27] unpack-trees: ensure " Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-27 17:00   ` Elijah Newren
  2021-01-25 17:41 ` [PATCH 09/27] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
                   ` (20 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we modify the sparse-checkout definition, we perform index operations
on a pattern_list that only exists in-memory. This allows easy backing
out in case the index update fails.

However, if the index write itself cares about the sparse-checkout
pattern set, we need access to that in-memory copy. Place a pointer to
a 'struct pattern_list' in the index so we can access this on-demand.
This will be used in the next change which uses the sparse-checkout
definition to filter out directories that are outsie the sparse cone.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c | 17 ++++++++++-------
 cache.h                   |  2 ++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index 2306a9ad98e..e00b82af727 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
 	if (is_index_unborn(r->index))
 		return UPDATE_SPARSITY_SUCCESS;
 
+	r->index->sparse_checkout_patterns = pl;
+
 	memset(&o, 0, sizeof(o));
 	o.verbose_update = isatty(2);
 	o.update = 1;
@@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
 	else
 		rollback_lock_file(&lock_file);
 
+	r->index->sparse_checkout_patterns = NULL;
 	return result;
 }
 
@@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 {
 	int result;
 	int changed_config = 0;
-	struct pattern_list pl;
-	memset(&pl, 0, sizeof(pl));
+	struct pattern_list *pl = xcalloc(1, sizeof(*pl));
 
 	switch (m) {
 	case ADD:
 		if (core_sparse_checkout_cone)
-			add_patterns_cone_mode(argc, argv, &pl);
+			add_patterns_cone_mode(argc, argv, pl);
 		else
-			add_patterns_literal(argc, argv, &pl);
+			add_patterns_literal(argc, argv, pl);
 		break;
 
 	case REPLACE:
-		add_patterns_from_input(&pl, argc, argv);
+		add_patterns_from_input(pl, argc, argv);
 		break;
 	}
 
@@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 		changed_config = 1;
 	}
 
-	result = write_patterns_and_update(&pl);
+	result = write_patterns_and_update(pl);
 
 	if (result && changed_config)
 		set_config(MODE_NO_PATTERNS);
 
-	clear_pattern_list(&pl);
+	clear_pattern_list(pl);
+	free(pl);
 	return result;
 }
 
diff --git a/cache.h b/cache.h
index 884046ca5b8..b05341cc687 100644
--- a/cache.h
+++ b/cache.h
@@ -311,6 +311,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 struct split_index;
 struct untracked_cache;
 struct progress;
+struct pattern_list;
 
 struct index_state {
 	struct cache_entry **cache;
@@ -336,6 +337,7 @@ struct index_state {
 	struct mem_pool *ce_mem_pool;
 	struct progress *progress;
 	struct repository *repo;
+	struct pattern_list *sparse_checkout_patterns;
 };
 
 /* Name hashing */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 09/27] sparse-index: convert from full to sparse
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2021-01-25 17:41 ` [PATCH 08/27] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-27 17:30   ` Elijah Newren
  2021-01-25 17:41 ` [PATCH 10/27] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
                   ` (19 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we have a full index, then we can convert it to a sparse index by
replacing directories outside of the sparse cone with sparse directory
entries. The convert_to_sparse() method does this, when the situation is
appropriate.

For now, we avoid converting the index to a sparse index if:

 1. the index is split.
 2. the index is already sparse.
 3. sparse-checkout is disabled.
 4. sparse-checkout does not use cone mode.

Finally, we currently limit the conversion to when the
GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
config will be added in a later change.

The trickiest thing about this conversion is that we might not be able
to mark a directory as a sparse directory just because it is outside the
sparse cone. There might be unmerged files within that directory, so we
need to look for those. Also, if there is some strange reason why a file
is not marked with CE_SKIP_WORKTREE, then we should give up on
converting that directory. There is still hope that some of its
subdirectories might be able to convert to sparse, so we keep looking
deeper.

The conversion process is assisted by the cache-tree extension. This is
calculated from the full index if it does not already exist. We then
abandon the cache-tree as it no longer applies to the newly-sparse
index. Thus, this cache-tree will be recalculated in every
sparse-full-sparse round-trip until we integrate the cache-tree
extension with the sparse index.

We can compare the behavior of the sparse-index in
t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
when operating on the 'sparse-index' repo. We can also compare the two
sparse repos directly, such as comparing their indexes (when expanded to
full in the case of the 'sparse-index' repo). We also verify that the
index is actually populated with sparse directory entries.

The 'checkout and reset (mixed)' test is marked for failure when
comparing a sparse repo to a full repo, but we can compare the two
sparse-checkout cases directly to ensure that we are not changing the
behavior when using a sparse index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             |   3 +
 read-cache.c                             |  18 ++-
 sparse-index.c                           | 139 +++++++++++++++++++++++
 t/t1092-sparse-checkout-compatibility.sh |  63 +++++++++-
 4 files changed, 218 insertions(+), 5 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2fb483d3c08..5f07a39e501 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -6,6 +6,7 @@
 #include "object-store.h"
 #include "replace-object.h"
 #include "promisor-remote.h"
+#include "sparse-index.h"
 
 #ifndef DEBUG_CACHE_TREE
 #define DEBUG_CACHE_TREE 0
@@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
 	if (i)
 		return i;
 
+	ensure_full_index(istate);
+
 	if (!istate->cache_tree)
 		istate->cache_tree = cache_tree();
 
diff --git a/read-cache.c b/read-cache.c
index 1097ecbf132..0522260416e 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -25,6 +25,7 @@
 #include "fsmonitor.h"
 #include "thread-utils.h"
 #include "progress.h"
+#include "sparse-index.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -1002,8 +1003,15 @@ int verify_path(const char *path, unsigned mode)
 
 			c = *path++;
 			if ((c == '.' && !verify_dotfile(path, mode)) ||
-			    is_dir_sep(c) || c == '\0')
+			    is_dir_sep(c))
 				return 0;
+			/*
+			 * allow terminating directory separators for
+			 * sparse directory enries.
+			 */
+			if (c == '\0')
+				return mode == CE_MODE_SPARSE_DIRECTORY ||
+				       mode == SPARSE_DIR_MODE;
 		} else if (c == '\\' && protect_ntfs) {
 			if (is_ntfs_dotgit(path))
 				return 0;
@@ -3062,6 +3070,13 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 {
 	int ret;
 
+	ret = convert_to_sparse(istate);
+
+	if (ret) {
+		warning(_("failed to convert to a sparse-index"));
+		return ret;
+	}
+
 	/*
 	 * TODO trace2: replace "the_repository" with the actual repo instance
 	 * that is associated with the given "istate".
@@ -3165,6 +3180,7 @@ static int write_shared_index(struct index_state *istate,
 	int ret;
 
 	move_cache_to_base_index(istate);
+	convert_to_sparse(istate);
 
 	trace2_region_enter_printf("index", "shared/do_write_index",
 				   the_repository, "%s", (*temp)->filename.buf);
diff --git a/sparse-index.c b/sparse-index.c
index 1e70244dc13..d8f1a5a13d7 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -4,6 +4,145 @@
 #include "tree.h"
 #include "pathspec.h"
 #include "trace2.h"
+#include "cache-tree.h"
+#include "config.h"
+#include "dir.h"
+#include "fsmonitor.h"
+
+static struct cache_entry *construct_sparse_dir_entry(
+				struct index_state *istate,
+				const char *sparse_dir,
+				struct cache_tree *tree)
+{
+	struct cache_entry *de;
+
+	de = make_cache_entry(istate, SPARSE_DIR_MODE, &tree->oid, sparse_dir, 0, 0);
+
+	de->ce_flags |= CE_SKIP_WORKTREE;
+	return de;
+}
+
+/*
+ * Returns the number of entries "inserted" into the index.
+ */
+static int convert_to_sparse_rec(struct index_state *istate,
+				 int num_converted,
+				 int start, int end,
+				 const char *ct_path, size_t ct_pathlen,
+				 struct cache_tree *ct)
+{
+	int i, can_convert = 1;
+	int start_converted = num_converted;
+	enum pattern_match_result match;
+	int dtype;
+	struct strbuf child_path = STRBUF_INIT;
+	struct pattern_list *pl = istate->sparse_checkout_patterns;
+
+	/*
+	 * Is the current path outside of the sparse cone?
+	 * Then check if the region can be replaced by a sparse
+	 * directory entry (everything is sparse and merged).
+	 */
+	match = path_matches_pattern_list(ct_path, ct_pathlen,
+					  NULL, &dtype, pl, istate);
+	if (match != NOT_MATCHED)
+		can_convert = 0;
+
+	for (i = start; can_convert && i < end; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (ce_stage(ce) ||
+		    !(ce->ce_flags & CE_SKIP_WORKTREE))
+			can_convert = 0;
+	}
+
+	if (can_convert) {
+		struct cache_entry *se;
+		se = construct_sparse_dir_entry(istate, ct_path, ct);
+
+		istate->cache[num_converted++] = se;
+		return 1;
+	}
+
+	for (i = start; i < end; ) {
+		int count, span, pos = -1;
+		const char *base, *slash;
+		struct cache_entry *ce = istate->cache[i];
+
+		/*
+		 * Detect if this is a normal entry oustide of any subtree
+		 * entry.
+		 */
+		base = ce->name + ct_pathlen;
+		slash = strchr(base, '/');
+
+		if (slash)
+			pos = cache_tree_subtree_pos(ct, base, slash - base);
+
+		if (pos < 0) {
+			istate->cache[num_converted++] = ce;
+			i++;
+			continue;
+		}
+
+		strbuf_setlen(&child_path, 0);
+		strbuf_add(&child_path, ce->name, slash - ce->name + 1);
+
+		span = ct->down[pos]->cache_tree->entry_count;
+		count = convert_to_sparse_rec(istate,
+					      num_converted, i, i + span,
+					      child_path.buf, child_path.len,
+					      ct->down[pos]->cache_tree);
+		num_converted += count;
+		i += span;
+	}
+
+	strbuf_release(&child_path);
+	return num_converted - start_converted;
+}
+
+int convert_to_sparse(struct index_state *istate)
+{
+	if (istate->split_index || istate->sparse_index ||
+	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
+		return 0;
+
+	/*
+	 * For now, only create a sparse index with the
+	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
+	 * this once we have a proper way to opt-in (and later still,
+	 * opt-out).
+	 */
+	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+		return 0;
+
+	if (!istate->sparse_checkout_patterns) {
+		istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
+		if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
+			return 0;
+	}
+
+	if (!istate->sparse_checkout_patterns->use_cone_patterns) {
+		warning(_("attempting to use sparse-index without cone mode"));
+		return -1;
+	}
+
+	if (cache_tree_update(istate, 0)) {
+		warning(_("unable to update cache-tree, staying full"));
+		return -1;
+	}
+
+	remove_fsmonitor(istate);
+
+	trace2_region_enter("index", "convert_to_sparse", istate->repo);
+	istate->cache_nr = convert_to_sparse_rec(istate,
+						 0, 0, istate->cache_nr,
+						 "", 0, istate->cache_tree);
+	istate->drop_cache_tree = 1;
+	istate->sparse_index = 1;
+	trace2_region_leave("index", "convert_to_sparse", istate->repo);
+	return 0;
+}
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 3aa9b0d21b4..22becbaca2e 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,6 +2,9 @@
 
 test_description='compare full workdir to sparse workdir'
 
+GIT_TEST_CHECK_CACHE_TREE=0
+GIT_TEST_SPLIT_INDEX=0
+
 . ./test-lib.sh
 
 test_expect_success 'setup' '
@@ -106,7 +109,7 @@ run_on_sparse () {
 	) &&
 	(
 		cd sparse-index &&
-		$* >../sparse-index-out 2>../sparse-index-err
+		GIT_TEST_SPARSE_INDEX=1 $* >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
@@ -121,7 +124,9 @@ run_on_all () {
 test_all_match () {
 	run_on_all $* &&
 	test_cmp full-checkout-out sparse-checkout-out &&
-	test_cmp full-checkout-err sparse-checkout-err
+	test_cmp full-checkout-out sparse-index-out &&
+	test_cmp full-checkout-err sparse-checkout-err &&
+	test_cmp full-checkout-err sparse-index-err
 }
 
 test_sparse_match () {
@@ -130,6 +135,38 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'sparse-index contents' '
+	init_repos &&
+
+	test-tool -C sparse-index read-cache --table --no-stat >cache &&
+	for dir in folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "0b0000 0755 $TREE $dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+
+	test-tool -C sparse-index read-cache --table --no-stat >cache &&
+	for dir in deep folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "0b0000 0755 $TREE $dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+
+	test-tool -C sparse-index read-cache --table --no-stat >cache &&
+	for dir in deep/deeper2 folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "0b0000 0755 $TREE $dir/" cache \
+			|| return 1
+	done
+'
+
 test_expect_success 'expanded in-memory index matches full index' '
 	init_repos &&
 	test_sparse_match test-tool read-cache --expand --table --no-stat
@@ -137,6 +174,7 @@ test_expect_success 'expanded in-memory index matches full index' '
 
 test_expect_success 'status with options' '
 	init_repos &&
+	test_sparse_match ls &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -169,7 +207,7 @@ test_expect_success 'add, commit, checkout' '
 
 	test_all_match git add -A &&
 	test_all_match git status --porcelain=v2 &&
-	test_all_match git commit -m "Extend README.md" &&
+	test_all_match git commit -m "Extend-README.md" &&
 
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
@@ -273,6 +311,17 @@ test_expect_failure 'checkout and reset (mixed)' '
 	test_all_match git reset update-folder2
 '
 
+# Ensure that sparse-index behaves identically to
+# sparse-checkout with a full index.
+test_expect_success 'checkout and reset (mixed) [sparse]' '
+	init_repos &&
+
+	test_sparse_match git checkout -b reset-test update-deep &&
+	test_sparse_match git reset deepest &&
+	test_sparse_match git reset update-folder1 &&
+	test_sparse_match git reset update-folder2
+'
+
 test_expect_success 'merge' '
 	init_repos &&
 
@@ -309,14 +358,20 @@ test_expect_success 'clean' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git clean -f &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xdf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
-	test_path_is_dir sparse-checkout/folder1
+	test_sparse_match test_path_is_dir folder1
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 10/27] submodule: sparse-index should not collapse links
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2021-01-25 17:41 ` [PATCH 09/27] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-25 17:41 ` [PATCH 11/27] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

A submodule is stored as a "Git link" that actually points to a commit
within a submodule. Submodules are populated or not depending on
submodule configuration, not sparse-checkout. To ensure that the
sparse-index feature integrates correctly with submodules, we should not
collapse a directory if there is a Git link within its range.

This allows us to remove ensure_full_index() from
die_path_inside_submodule() because a sparse-index will not remove the
entries for Git links. The loop already 'continue's if the cache entry
is not a Git link.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sparse-index.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sparse-index.c b/sparse-index.c
index d8f1a5a13d7..5dd0b835b9d 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -52,6 +52,7 @@ static int convert_to_sparse_rec(struct index_state *istate,
 		struct cache_entry *ce = istate->cache[i];
 
 		if (ce_stage(ce) ||
+		    S_ISGITLINK(ce->ce_mode) ||
 		    !(ce->ce_flags & CE_SKIP_WORKTREE))
 			can_convert = 0;
 	}
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 11/27] unpack-trees: allow sparse directories
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (9 preceding siblings ...)
  2021-01-25 17:41 ` [PATCH 10/27] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-27 17:36   ` Elijah Newren
  2021-01-25 17:41 ` [PATCH 12/27] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
                   ` (17 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The index_pos_by_traverse_info() currently throws a BUG() when a
directory entry exists exactly in the index. We need to consider that it
is possible to have a directory in a sparse index as long as that entry
is itself marked with the skip-worktree bit.

The negation of the 'pos' variable must be conditioned to only when it
starts as negative. This is identical behavior as before when the index
is full.

The starts_with() condition matches because our name.buf terminates with
a directory separator, just like our sparse directory entries.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/unpack-trees.c b/unpack-trees.c
index 4dd99219073..b324eec2a5d 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
 	strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
 	strbuf_addch(&name, '/');
 	pos = index_name_pos(o->src_index, name.buf, name.len);
-	if (pos >= 0)
-		BUG("This is a directory and should not exist in index");
-	pos = -pos - 1;
+	if (pos >= 0) {
+		if (!o->src_index->sparse_index ||
+		    !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
+			BUG("This is a directory and should not exist in index");
+	} else
+		pos = -pos - 1;
 	if (pos >= o->src_index->cache_nr ||
 	    !starts_with(o->src_index->cache[pos]->name, name.buf) ||
 	    (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 12/27] sparse-index: check index conversion happens
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (10 preceding siblings ...)
  2021-01-25 17:41 ` [PATCH 11/27] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-27 17:46   ` Elijah Newren
  2021-01-25 17:41 ` [PATCH 13/27] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
                   ` (16 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a test case that uses test_region to ensure that we are truly
expanding a sparse index to a full one, then converting back to sparse
when writing the index. As we integrate more Git commands with the
sparse index, we will convert these commands to check that we do _not_
convert the sparse index to a full index and instead stay sparse the
entire time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 22becbaca2e..a22def89e37 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -374,4 +374,21 @@ test_expect_success 'clean' '
 	test_sparse_match test_path_is_dir folder1
 '
 
+test_expect_success 'sparse-index is expanded and converted back' '
+	init_repos &&
+
+	(
+		(GIT_TEST_SPARSE_INDEX=1 && export GIT_TEST_SPARSE_INDEX) &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" reset --hard &&
+		test_region index convert_to_sparse trace2.txt &&
+		test_region index ensure_full_index trace2.txt &&
+
+		rm trace2.txt &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" status -uno &&
+		test_region index ensure_full_index trace2.txt
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 13/27] sparse-index: create extension for compatibility
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (11 preceding siblings ...)
  2021-01-25 17:41 ` [PATCH 12/27] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
@ 2021-01-25 17:41 ` Derrick Stolee via GitGitGadget
  2021-01-27 18:03   ` Elijah Newren
  2021-01-25 17:42 ` [PATCH 14/27] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
                   ` (15 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:41 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Previously, we enabled the sparse index format only using
GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
actually select this mode. Further, sparse directory entries are not
understood by the index formats as advertised.

We _could_ add a new index version that explicitly adds these
capabilities, but there are nuances to index formats 2, 3, and 4 that
are still valuable to select as options. For now, create a repo
extension, "extensions.sparseIndex", that specifies that the tool
reading this repository must understand sparse directory entries.

This change only encodes the extension and enables it when
GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
mechanism.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/extensions.txt |  7 ++++++
 cache.h                             |  1 +
 repo-settings.c                     |  7 ++++++
 repository.h                        |  3 ++-
 setup.c                             |  3 +++
 sparse-index.c                      | 38 +++++++++++++++++++++++++----
 6 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
index 4e23d73cdca..5c86b364873 100644
--- a/Documentation/config/extensions.txt
+++ b/Documentation/config/extensions.txt
@@ -6,3 +6,10 @@ extensions.objectFormat::
 Note that this setting should only be set by linkgit:git-init[1] or
 linkgit:git-clone[1].  Trying to change it after initialization will not
 work and will produce hard-to-diagnose issues.
+
+extensions.sparseIndex::
+	When combined with `core.sparseCheckout=true` and
+	`core.sparseCheckoutCone=true`, the index may contain entries
+	corresponding to directories outside of the sparse-checkout
+	definition. Versions of Git that do not understand this extension
+	do not expect directory entries in the index.
diff --git a/cache.h b/cache.h
index b05341cc687..dcf089b7006 100644
--- a/cache.h
+++ b/cache.h
@@ -1054,6 +1054,7 @@ struct repository_format {
 	int worktree_config;
 	int is_bare;
 	int hash_algo;
+	int sparse_index;
 	char *work_tree;
 	struct string_list unknown_extensions;
 	struct string_list v1_only_extensions;
diff --git a/repo-settings.c b/repo-settings.c
index d63569e4041..9677d50f923 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
 	 * removed.
 	 */
 	r->settings.command_requires_full_index = 1;
+
+	/*
+	 * Initialize this as off.
+	 */
+	r->settings.sparse_index = 0;
+	if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
+		r->settings.sparse_index = 1;
 }
diff --git a/repository.h b/repository.h
index e06a2301569..a45f7520fd9 100644
--- a/repository.h
+++ b/repository.h
@@ -42,7 +42,8 @@ struct repo_settings {
 
 	int core_multi_pack_index;
 
-	unsigned command_requires_full_index:1;
+	unsigned command_requires_full_index:1,
+		 sparse_index:1;
 };
 
 struct repository {
diff --git a/setup.c b/setup.c
index c04cd25a30d..cd839456461 100644
--- a/setup.c
+++ b/setup.c
@@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var,
 			return error("invalid value for 'extensions.objectformat'");
 		data->hash_algo = format;
 		return EXTENSION_OK;
+	} else if (!strcmp(ext, "sparseindex")) {
+		data->sparse_index = 1;
+		return EXTENSION_OK;
 	}
 	return EXTENSION_UNKNOWN;
 }
diff --git a/sparse-index.c b/sparse-index.c
index 5dd0b835b9d..71544095267 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate,
 	return num_converted - start_converted;
 }
 
+static int enable_sparse_index(struct repository *repo)
+{
+	const char *config_path = repo_git_path(repo, "config.worktree");
+
+	if (upgrade_repository_format(1) < 0) {
+		warning(_("unable to upgrade repository format to enable sparse-index"));
+		return -1;
+	}
+	git_config_set_in_file_gently(config_path,
+				      "extensions.sparseIndex",
+				      "true");
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 1;
+	return 0;
+}
+
 int convert_to_sparse(struct index_state *istate)
 {
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	/*
+	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
+	 * extensions.sparseIndex config variable to be on.
+	 */
+	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
+		int err = enable_sparse_index(istate->repo);
+		if (err < 0)
+			return err;
+	}
+
 	/*
-	 * For now, only create a sparse index with the
-	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
-	 * this once we have a proper way to opt-in (and later still,
-	 * opt-out).
+	 * Only convert to sparse if extensions.sparseIndex is set.
 	 */
-	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+	prepare_repo_settings(istate->repo);
+	if (!istate->repo->settings.sparse_index)
 		return 0;
 
 	if (!istate->sparse_checkout_patterns) {
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 14/27] sparse-checkout: toggle sparse index from builtin
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (12 preceding siblings ...)
  2021-01-25 17:41 ` [PATCH 13/27] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-01-27 18:18   ` Elijah Newren
  2021-01-25 17:42 ` [PATCH 15/27] [RFC-VERSION] *: ensure full index Derrick Stolee via GitGitGadget
                   ` (14 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The sparse index extension is used to signal that index writes should be
in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1.

Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that
specifies if the sparse index should be used. It also updates the index
to use the correct format, either way. Add a warning in the
documentation that the use of a repository extension might reduce
compatibility with third-party tools. 'git sparse-checkout init' already
sets extension.worktreeConfig, which places most sparse-checkout users
outside of the scope of most third-party tools.

Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of
GIT_TEST_SPARSE_INDEX=1.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-sparse-checkout.txt    | 14 +++++++++
 builtin/sparse-checkout.c                | 17 ++++++++++-
 sparse-index.c                           | 38 ++++++++++++++++--------
 sparse-index.h                           |  3 ++
 t/t1092-sparse-checkout-compatibility.sh | 33 ++++++++++----------
 5 files changed, 75 insertions(+), 30 deletions(-)

diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index a0eeaeb02ee..b51b8450cfd 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
 When `--cone` is provided, the `core.sparseCheckoutCone` setting is
 also set, allowing for better performance with a limited set of
 patterns (see 'CONE PATTERN SET' below).
++
+Use the `--[no-]sparse-index` option to toggle the use of the sparse
+index format. This reduces the size of the index to be more closely
+aligned with your sparse-checkout definition. This can have significant
+performance advantages for commands such as `git status` or `git add`.
+This feature is still experimental. Some commands might be slower with
+a sparse index until they are properly integrated with the feature.
++
+**WARNING:** Using a sparse index requires modifying the index in a way
+that is not completely understood by other tools. Enabling sparse index
+enables the `extensions.spareseIndex` config value, which might cause
+other tools to stop working with your repository. If you have trouble with
+this compatibility, then run `git sparse-checkout sparse-index disable` to
+remove this config and rewrite your index to not be sparse.
 
 'set'::
 	Write a set of patterns to the sparse-checkout file, as given as
diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index e00b82af727..ca63e2c64e9 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -14,6 +14,7 @@
 #include "unpack-trees.h"
 #include "wt-status.h"
 #include "quote.h"
+#include "sparse-index.h"
 
 static const char *empty_base = "";
 
@@ -283,12 +284,13 @@ static int set_config(enum sparse_checkout_mode mode)
 }
 
 static char const * const builtin_sparse_checkout_init_usage[] = {
-	N_("git sparse-checkout init [--cone]"),
+	N_("git sparse-checkout init [--cone] [--[no-]sparse-index]"),
 	NULL
 };
 
 static struct sparse_checkout_init_opts {
 	int cone_mode;
+	int sparse_index;
 } init_opts;
 
 static int sparse_checkout_init(int argc, const char **argv)
@@ -303,11 +305,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	static struct option builtin_sparse_checkout_init_options[] = {
 		OPT_BOOL(0, "cone", &init_opts.cone_mode,
 			 N_("initialize the sparse-checkout in cone mode")),
+		OPT_BOOL(0, "sparse-index", &init_opts.sparse_index,
+			 N_("toggle the use of a sparse index")),
 		OPT_END(),
 	};
 
 	repo_read_index(the_repository);
 
+	init_opts.sparse_index = -1;
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_sparse_checkout_init_options,
 			     builtin_sparse_checkout_init_usage, 0);
@@ -326,6 +332,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	sparse_filename = get_sparse_checkout_filename();
 	res = add_patterns_from_file_to_list(sparse_filename, "", 0, &pl, NULL);
 
+	if (init_opts.sparse_index >= 0) {
+		if (set_sparse_index_config(the_repository, init_opts.sparse_index) < 0)
+			die(_("failed to modify sparse-index config"));
+
+		/* force an index rewrite */
+		repo_read_index(the_repository);
+		the_repository->index->updated_workdir = 1;
+	}
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
diff --git a/sparse-index.c b/sparse-index.c
index 71544095267..3552f88fb03 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -104,23 +104,38 @@ static int convert_to_sparse_rec(struct index_state *istate,
 
 static int enable_sparse_index(struct repository *repo)
 {
-	const char *config_path = repo_git_path(repo, "config.worktree");
+	int res;
 
 	if (upgrade_repository_format(1) < 0) {
 		warning(_("unable to upgrade repository format to enable sparse-index"));
 		return -1;
 	}
-	git_config_set_in_file_gently(config_path,
-				      "extensions.sparseIndex",
-				      "true");
+	res = git_config_set_gently("extensions.sparseindex", "true");
 
 	prepare_repo_settings(repo);
 	repo->settings.sparse_index = 1;
-	return 0;
+	return res;
+}
+
+int set_sparse_index_config(struct repository *repo, int enable)
+{
+	int res;
+
+	if (enable)
+		return enable_sparse_index(repo);
+
+	/* Don't downgrade repository format, just remove the extension. */
+	res = git_config_set_multivar_gently("extensions.sparseindex", NULL, "",
+					     CONFIG_FLAGS_MULTI_REPLACE);
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 0;
+	return res;
 }
 
 int convert_to_sparse(struct index_state *istate)
 {
+	int test_env;
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
@@ -129,14 +144,13 @@ int convert_to_sparse(struct index_state *istate)
 		istate->repo = the_repository;
 
 	/*
-	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
-	 * extensions.sparseIndex config variable to be on.
+	 * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex
+	 * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly),
+	 * then purposefully disable the setting.
 	 */
-	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
-		int err = enable_sparse_index(istate->repo);
-		if (err < 0)
-			return err;
-	}
+	test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
+	if (test_env >= 0)
+		set_sparse_index_config(istate->repo, test_env);
 
 	/*
 	 * Only convert to sparse if extensions.sparseIndex is set.
diff --git a/sparse-index.h b/sparse-index.h
index a2777dcac59..ca936e95d11 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -5,4 +5,7 @@ struct index_state;
 void ensure_full_index(struct index_state *istate);
 int convert_to_sparse(struct index_state *istate);
 
+struct repository;
+int set_sparse_index_config(struct repository *repo, int enable);
+
 #endif
\ No newline at end of file
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index a22def89e37..c6b7e8b8891 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -4,6 +4,7 @@ test_description='compare full workdir to sparse workdir'
 
 GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
+GIT_TEST_SPARSE_INDEX=
 
 . ./test-lib.sh
 
@@ -98,8 +99,9 @@ init_repos () {
 	# initialize sparse-checkout definitions
 	git -C sparse-checkout sparse-checkout init --cone &&
 	git -C sparse-checkout sparse-checkout set deep &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
+	git -C sparse-index sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C sparse-index true extensions.sparseindex &&
+	git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
@@ -109,7 +111,7 @@ run_on_sparse () {
 	) &&
 	(
 		cd sparse-index &&
-		GIT_TEST_SPARSE_INDEX=1 $* >../sparse-index-out 2>../sparse-index-err
+		$* >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
@@ -146,7 +148,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+	git -C sparse-index sparse-checkout set folder1 &&
 
 	test-tool -C sparse-index read-cache --table --no-stat >cache &&
 	for dir in deep folder2 x
@@ -156,7 +158,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+	git -C sparse-index sparse-checkout set deep/deeper1 &&
 
 	test-tool -C sparse-index read-cache --table --no-stat >cache &&
 	for dir in deep/deeper2 folder1 folder2 x
@@ -377,18 +379,15 @@ test_expect_success 'clean' '
 test_expect_success 'sparse-index is expanded and converted back' '
 	init_repos &&
 
-	(
-		(GIT_TEST_SPARSE_INDEX=1 && export GIT_TEST_SPARSE_INDEX) &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" reset --hard &&
-		test_region index convert_to_sparse trace2.txt &&
-		test_region index ensure_full_index trace2.txt &&
-
-		rm trace2.txt &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" status -uno &&
-		test_region index ensure_full_index trace2.txt
-	)
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" reset --hard &&
+	test_region index convert_to_sparse trace2.txt &&
+	test_region index ensure_full_index trace2.txt &&
+
+	rm trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" status -uno &&
+	test_region index ensure_full_index trace2.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 15/27] [RFC-VERSION] *: ensure full index
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (13 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 14/27] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-02-01 20:22   ` Elijah Newren
  2021-01-25 17:42 ` [PATCH 16/27] unpack-trees: make sparse aware Derrick Stolee via GitGitGadget
                   ` (13 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This giant patch is not intended for actual review. I have a branch that
has these changes split out in a sane way with some commentary in each
file that is modified.

The idea here is to guard certain portions of the codebase that do not
know how to handle sparse indexes by ensuring that the index is expanded
to a full index before proceeding with the logic.

This also provides a good mechanism for testing which code needs
updating to enable the sparse index in a Git builtin. The builtin can
set the_repository->settings.command_requires_full_index to zero and
then we can debug the command with a breakpoint on ensure_full_index().
That identifies the portion of code that needs adjusting before enabling
sparse indexes for that command.

Some index operations must be changed to operate on a non-const pointer,
since ensuring a full index will modify the index itself.

There are likely some gaps to these protections, which is why it will be
important to carefully test each scenario as we relax the requirements.
I expect that to be a long effort.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 apply.c                   | 10 +++++++++-
 blame.c                   |  7 ++++++-
 builtin/checkout-index.c  |  5 ++++-
 builtin/grep.c            |  2 ++
 builtin/ls-files.c        |  9 ++++++++-
 builtin/merge-index.c     |  2 ++
 builtin/mv.c              |  2 ++
 builtin/rm.c              |  2 ++
 builtin/sparse-checkout.c |  1 +
 builtin/update-index.c    |  2 ++
 cache.h                   |  1 +
 diff-lib.c                |  2 ++
 diff.c                    |  2 ++
 dir.c                     | 14 +++++++++++++-
 entry.c                   |  2 ++
 fsmonitor.c               | 11 ++++++++++-
 merge-recursive.c         | 22 +++++++++++++++++++---
 name-hash.c               |  6 ++++++
 pathspec.c                |  5 +++--
 pathspec.h                |  4 ++--
 read-cache.c              | 19 +++++++++++++++++--
 rerere.c                  |  2 ++
 resolve-undo.c            |  6 ++++++
 sha1-name.c               |  3 +++
 split-index.c             |  2 ++
 submodule.c               | 24 +++++++++++++++++++-----
 submodule.h               |  6 +++---
 tree.c                    |  2 ++
 wt-status.c               |  7 +++++++
 29 files changed, 159 insertions(+), 23 deletions(-)

diff --git a/apply.c b/apply.c
index 668b16e9893..5bfbd928b38 100644
--- a/apply.c
+++ b/apply.c
@@ -3523,6 +3523,8 @@ static int load_current(struct apply_state *state,
 	if (!patch->is_new)
 		BUG("patch to %s is not a creation", patch->old_name);
 
+	ensure_full_index(state->repo->index);
+
 	pos = index_name_pos(state->repo->index, name, strlen(name));
 	if (pos < 0)
 		return error(_("%s: does not exist in index"), name);
@@ -3692,7 +3694,11 @@ static int check_preimage(struct apply_state *state,
 	}
 
 	if (state->check_index && !previous) {
-		int pos = index_name_pos(state->repo->index, old_name,
+		int pos;
+
+		ensure_full_index(state->repo->index);
+
+		pos = index_name_pos(state->repo->index, old_name,
 					 strlen(old_name));
 		if (pos < 0) {
 			if (patch->is_new < 0)
@@ -3751,6 +3757,8 @@ static int check_to_create(struct apply_state *state,
 	if (state->check_index && (!ok_if_exists || !state->cached)) {
 		int pos;
 
+		ensure_full_index(state->repo->index);
+
 		pos = index_name_pos(state->repo->index, new_name, strlen(new_name));
 		if (pos >= 0) {
 			struct cache_entry *ce = state->repo->index->cache[pos];
diff --git a/blame.c b/blame.c
index a5044fcfaa6..0aa368a35cf 100644
--- a/blame.c
+++ b/blame.c
@@ -108,6 +108,7 @@ static void verify_working_tree_path(struct repository *r,
 			return;
 	}
 
+	ensure_full_index(r->index);
 	pos = index_name_pos(r->index, path, strlen(path));
 	if (pos >= 0)
 		; /* path is in the index */
@@ -277,7 +278,11 @@ static struct commit *fake_working_tree_commit(struct repository *r,
 
 	len = strlen(path);
 	if (!mode) {
-		int pos = index_name_pos(r->index, path, len);
+		int pos;
+
+		ensure_full_index(r->index);
+
+		pos = index_name_pos(r->index, path, len);
 		if (0 <= pos)
 			mode = r->index->cache[pos]->ce_mode;
 		else
diff --git a/builtin/checkout-index.c b/builtin/checkout-index.c
index 4bbfc92dce5..24c85b1c125 100644
--- a/builtin/checkout-index.c
+++ b/builtin/checkout-index.c
@@ -48,11 +48,14 @@ static void write_tempfile_record(const char *name, const char *prefix)
 static int checkout_file(const char *name, const char *prefix)
 {
 	int namelen = strlen(name);
-	int pos = cache_name_pos(name, namelen);
+	int pos;
 	int has_same_name = 0;
 	int did_checkout = 0;
 	int errs = 0;
 
+	ensure_full_index(the_repository->index);
+	pos = index_name_pos(the_repository->index, name, namelen);
+
 	if (pos < 0)
 		pos = -pos - 1;
 
diff --git a/builtin/grep.c b/builtin/grep.c
index ca259af4416..e53cf817204 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -506,6 +506,8 @@ static int grep_cache(struct grep_opt *opt,
 	if (repo_read_index(repo) < 0)
 		die(_("index file corrupt"));
 
+	ensure_full_index(repo->index);
+
 	for (nr = 0; nr < repo->index->cache_nr; nr++) {
 		const struct cache_entry *ce = repo->index->cache[nr];
 		strbuf_setlen(&name, name_base_len);
diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index c8eae899b82..933e259cdbe 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -150,7 +150,7 @@ static void show_other_files(const struct index_state *istate,
 	}
 }
 
-static void show_killed_files(const struct index_state *istate,
+static void show_killed_files(struct index_state *istate,
 			      const struct dir_struct *dir)
 {
 	int i;
@@ -159,6 +159,8 @@ static void show_killed_files(const struct index_state *istate,
 		char *cp, *sp;
 		int pos, len, killed = 0;
 
+		ensure_full_index(istate);
+
 		for (cp = ent->name; cp - ent->name < ent->len; cp = sp + 1) {
 			sp = strchr(cp, '/');
 			if (!sp) {
@@ -313,6 +315,7 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
 			show_killed_files(repo->index, dir);
 	}
 	if (show_cached || show_stage) {
+		ensure_full_index(repo->index);
 		for (i = 0; i < repo->index->cache_nr; i++) {
 			const struct cache_entry *ce = repo->index->cache[i];
 
@@ -332,6 +335,7 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
 		}
 	}
 	if (show_deleted || show_modified) {
+		ensure_full_index(repo->index);
 		for (i = 0; i < repo->index->cache_nr; i++) {
 			const struct cache_entry *ce = repo->index->cache[i];
 			struct stat st;
@@ -368,6 +372,7 @@ static void prune_index(struct index_state *istate,
 
 	if (!prefix || !istate->cache_nr)
 		return;
+	ensure_full_index(istate);
 	pos = index_name_pos(istate, prefix, prefixlen);
 	if (pos < 0)
 		pos = -pos-1;
@@ -428,6 +433,8 @@ void overlay_tree_on_index(struct index_state *istate,
 	if (!tree)
 		die("bad tree-ish %s", tree_name);
 
+	ensure_full_index(istate);
+
 	/* Hoist the unmerged entries up to stage #3 to make room */
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct cache_entry *ce = istate->cache[i];
diff --git a/builtin/merge-index.c b/builtin/merge-index.c
index 38ea6ad6ca2..3e1ddabd650 100644
--- a/builtin/merge-index.c
+++ b/builtin/merge-index.c
@@ -80,6 +80,8 @@ int cmd_merge_index(int argc, const char **argv, const char *prefix)
 
 	read_cache();
 
+	ensure_full_index(&the_index);
+
 	i = 1;
 	if (!strcmp(argv[i], "-o")) {
 		one_shot = 1;
diff --git a/builtin/mv.c b/builtin/mv.c
index 7dac714af90..2ab6416fce9 100644
--- a/builtin/mv.c
+++ b/builtin/mv.c
@@ -145,6 +145,8 @@ int cmd_mv(int argc, const char **argv, const char *prefix)
 	if (read_cache() < 0)
 		die(_("index file corrupt"));
 
+	ensure_full_index(&the_index);
+
 	source = internal_prefix_pathspec(prefix, argv, argc, 0);
 	modes = xcalloc(argc, sizeof(enum update_mode));
 	/*
diff --git a/builtin/rm.c b/builtin/rm.c
index 4858631e0f0..2db4fcd22d9 100644
--- a/builtin/rm.c
+++ b/builtin/rm.c
@@ -291,6 +291,8 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
 
 	refresh_index(&the_index, REFRESH_QUIET|REFRESH_UNMERGED, &pathspec, NULL, NULL);
 
+	ensure_full_index(&the_index);
+
 	seen = xcalloc(pathspec.nr, 1);
 
 	for (i = 0; i < active_nr; i++) {
diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index ca63e2c64e9..14022b5e182 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -123,6 +123,7 @@ static int update_working_directory(struct pattern_list *pl)
 	o.pl = pl;
 
 	setup_work_tree();
+	ensure_full_index(r->index);
 
 	repo_hold_locked_index(r, &lock_file, LOCK_DIE_ON_ERROR);
 
diff --git a/builtin/update-index.c b/builtin/update-index.c
index 79087bccea4..521a6c23c75 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -1088,6 +1088,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
 
 	the_index.updated_skipworktree = 1;
 
+	ensure_full_index(&the_index);
+
 	/*
 	 * Custom copy of parse_options() because we want to handle
 	 * filename arguments as they come.
diff --git a/cache.h b/cache.h
index dcf089b7006..306eab444b9 100644
--- a/cache.h
+++ b/cache.h
@@ -346,6 +346,7 @@ void add_name_hash(struct index_state *istate, struct cache_entry *ce);
 void remove_name_hash(struct index_state *istate, struct cache_entry *ce);
 void free_name_hash(struct index_state *istate);
 
+void ensure_full_index(struct index_state *istate);
 
 /* Cache entry creation and cleanup */
 
diff --git a/diff-lib.c b/diff-lib.c
index b73cc1859a4..3743e4463b4 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -96,6 +96,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 	uint64_t start = getnanotime();
 	struct index_state *istate = revs->diffopt.repo->index;
 
+	ensure_full_index(istate);
+
 	diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
 
 	refresh_fsmonitor(istate);
diff --git a/diff.c b/diff.c
index 2253ec88029..02fafee8587 100644
--- a/diff.c
+++ b/diff.c
@@ -3901,6 +3901,8 @@ static int reuse_worktree_file(struct index_state *istate,
 	if (!want_file && would_convert_to_git(istate, name))
 		return 0;
 
+	ensure_full_index(istate);
+
 	len = strlen(name);
 	pos = index_name_pos(istate, name, len);
 	if (pos < 0)
diff --git a/dir.c b/dir.c
index d153a63bbd1..ad6eb033cb1 100644
--- a/dir.c
+++ b/dir.c
@@ -892,13 +892,15 @@ void add_pattern(const char *string, const char *base,
 	add_pattern_to_hashsets(pl, pattern);
 }
 
-static int read_skip_worktree_file_from_index(const struct index_state *istate,
+static int read_skip_worktree_file_from_index(struct index_state *istate,
 					      const char *path,
 					      size_t *size_out, char **data_out,
 					      struct oid_stat *oid_stat)
 {
 	int pos, len;
 
+	ensure_full_index(istate);
+
 	len = strlen(path);
 	pos = index_name_pos(istate, path, len);
 	if (pos < 0)
@@ -1088,6 +1090,10 @@ static int add_patterns(const char *fname, const char *base, int baselen,
 		close(fd);
 		if (oid_stat) {
 			int pos;
+
+			if (istate)
+				ensure_full_index(istate);
+
 			if (oid_stat->valid &&
 			    !match_stat_data_racy(istate, &oid_stat->stat, &st))
 				; /* no content change, oid_stat->oid still good */
@@ -1696,6 +1702,8 @@ static enum exist_status directory_exists_in_index(struct index_state *istate,
 	if (ignore_case)
 		return directory_exists_in_index_icase(istate, dirname, len);
 
+	ensure_full_index(istate);
+
 	pos = index_name_pos(istate, dirname, len);
 	if (pos < 0)
 		pos = -pos-1;
@@ -2050,6 +2058,8 @@ static int get_index_dtype(struct index_state *istate,
 	int pos;
 	const struct cache_entry *ce;
 
+	ensure_full_index(istate);
+
 	ce = index_file_exists(istate, path, len, 0);
 	if (ce) {
 		if (!ce_uptodate(ce))
@@ -3536,6 +3546,8 @@ static void connect_wt_gitdir_in_nested(const char *sub_worktree,
 	if (repo_read_index(&subrepo) < 0)
 		die(_("index file corrupt in repo %s"), subrepo.gitdir);
 
+	ensure_full_index(subrepo.index);
+
 	for (i = 0; i < subrepo.index->cache_nr; i++) {
 		const struct cache_entry *ce = subrepo.index->cache[i];
 
diff --git a/entry.c b/entry.c
index a0532f1f000..d505e6f2c6e 100644
--- a/entry.c
+++ b/entry.c
@@ -412,6 +412,8 @@ static void mark_colliding_entries(const struct checkout *state,
 
 	ce->ce_flags |= CE_MATCHED;
 
+	ensure_full_index(state->istate);
+
 	for (i = 0; i < state->istate->cache_nr; i++) {
 		struct cache_entry *dup = state->istate->cache[i];
 
diff --git a/fsmonitor.c b/fsmonitor.c
index fe9e9d7baf4..7b8cd3975b9 100644
--- a/fsmonitor.c
+++ b/fsmonitor.c
@@ -97,6 +97,9 @@ int read_fsmonitor_extension(struct index_state *istate, const void *data,
 void fill_fsmonitor_bitmap(struct index_state *istate)
 {
 	unsigned int i, skipped = 0;
+
+	ensure_full_index(istate);
+
 	istate->fsmonitor_dirty = ewah_new();
 	for (i = 0; i < istate->cache_nr; i++) {
 		if (istate->cache[i]->ce_flags & CE_REMOVE)
@@ -158,7 +161,11 @@ static int query_fsmonitor(int version, const char *last_update, struct strbuf *
 
 static void fsmonitor_refresh_callback(struct index_state *istate, const char *name)
 {
-	int pos = index_name_pos(istate, name, strlen(name));
+	int pos;
+
+	ensure_full_index(istate);
+
+	pos = index_name_pos(istate, name, strlen(name));
 
 	if (pos >= 0) {
 		struct cache_entry *ce = istate->cache[pos];
@@ -330,6 +337,8 @@ void tweak_fsmonitor(struct index_state *istate)
 
 	if (istate->fsmonitor_dirty) {
 		if (fsmonitor_enabled) {
+			ensure_full_index(istate);
+
 			/* Mark all entries valid */
 			for (i = 0; i < istate->cache_nr; i++) {
 				istate->cache[i]->ce_flags |= CE_FSMONITOR_VALID;
diff --git a/merge-recursive.c b/merge-recursive.c
index f736a0f6323..12109f37723 100644
--- a/merge-recursive.c
+++ b/merge-recursive.c
@@ -522,6 +522,8 @@ static struct string_list *get_unmerged(struct index_state *istate)
 
 	unmerged->strdup_strings = 1;
 
+	ensure_full_index(istate);
+
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct string_list_item *item;
 		struct stage_data *e;
@@ -762,6 +764,8 @@ static int dir_in_way(struct index_state *istate, const char *path,
 	strbuf_addstr(&dirpath, path);
 	strbuf_addch(&dirpath, '/');
 
+	ensure_full_index(istate);
+
 	pos = index_name_pos(istate, dirpath.buf, dirpath.len);
 
 	if (pos < 0)
@@ -785,9 +789,13 @@ static int dir_in_way(struct index_state *istate, const char *path,
 static int was_tracked_and_matches(struct merge_options *opt, const char *path,
 				   const struct diff_filespec *blob)
 {
-	int pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
+	int pos;
 	struct cache_entry *ce;
 
+	ensure_full_index(&opt->priv->orig_index);
+
+	pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
+
 	if (0 > pos)
 		/* we were not tracking this path before the merge */
 		return 0;
@@ -802,7 +810,11 @@ static int was_tracked_and_matches(struct merge_options *opt, const char *path,
  */
 static int was_tracked(struct merge_options *opt, const char *path)
 {
-	int pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
+	int pos;
+
+	ensure_full_index(&opt->priv->orig_index);
+
+	pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
 
 	if (0 <= pos)
 		/* we were tracking this path before the merge */
@@ -814,6 +826,9 @@ static int was_tracked(struct merge_options *opt, const char *path)
 static int would_lose_untracked(struct merge_options *opt, const char *path)
 {
 	struct index_state *istate = opt->repo->index;
+	int pos;
+
+	ensure_full_index(istate);
 
 	/*
 	 * This may look like it can be simplified to:
@@ -832,7 +847,7 @@ static int would_lose_untracked(struct merge_options *opt, const char *path)
 	 * update_file()/would_lose_untracked(); see every comment in this
 	 * file which mentions "update_stages".
 	 */
-	int pos = index_name_pos(istate, path, strlen(path));
+	pos = index_name_pos(istate, path, strlen(path));
 
 	if (pos < 0)
 		pos = -1 - pos;
@@ -3086,6 +3101,7 @@ static int handle_content_merge(struct merge_file_info *mfi,
 		 * flag to avoid making the file appear as if it were
 		 * deleted by the user.
 		 */
+		ensure_full_index(&opt->priv->orig_index);
 		pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
 		ce = opt->priv->orig_index.cache[pos];
 		if (ce_skip_worktree(ce)) {
diff --git a/name-hash.c b/name-hash.c
index 4e03fac9bb1..0f6d4fcca5a 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -679,6 +679,8 @@ int index_dir_exists(struct index_state *istate, const char *name, int namelen)
 {
 	struct dir_entry *dir;
 
+	ensure_full_index(istate);
+
 	lazy_init_name_hash(istate);
 	dir = find_dir_entry(istate, name, namelen);
 	return dir && dir->nr;
@@ -689,6 +691,8 @@ void adjust_dirname_case(struct index_state *istate, char *name)
 	const char *startPtr = name;
 	const char *ptr = startPtr;
 
+	ensure_full_index( istate);
+
 	lazy_init_name_hash(istate);
 	while (*ptr) {
 		while (*ptr && *ptr != '/')
@@ -712,6 +716,8 @@ struct cache_entry *index_file_exists(struct index_state *istate, const char *na
 	struct cache_entry *ce;
 	unsigned int hash = memihash(name, namelen);
 
+	ensure_full_index(istate);
+
 	lazy_init_name_hash(istate);
 
 	ce = hashmap_get_entry_from_hash(&istate->name_hash, hash, NULL,
diff --git a/pathspec.c b/pathspec.c
index 7a229d8d22f..9b105855483 100644
--- a/pathspec.c
+++ b/pathspec.c
@@ -20,7 +20,7 @@
  * to use find_pathspecs_matching_against_index() instead.
  */
 void add_pathspec_matches_against_index(const struct pathspec *pathspec,
-					const struct index_state *istate,
+					struct index_state *istate,
 					char *seen)
 {
 	int num_unmatched = 0, i;
@@ -36,6 +36,7 @@ void add_pathspec_matches_against_index(const struct pathspec *pathspec,
 			num_unmatched++;
 	if (!num_unmatched)
 		return;
+	ensure_full_index(istate);
 	for (i = 0; i < istate->cache_nr; i++) {
 		const struct cache_entry *ce = istate->cache[i];
 		ce_path_match(istate, ce, pathspec, seen);
@@ -51,7 +52,7 @@ void add_pathspec_matches_against_index(const struct pathspec *pathspec,
  * given pathspecs achieves against all items in the index.
  */
 char *find_pathspecs_matching_against_index(const struct pathspec *pathspec,
-					    const struct index_state *istate)
+					    struct index_state *istate)
 {
 	char *seen = xcalloc(pathspec->nr, 1);
 	add_pathspec_matches_against_index(pathspec, istate, seen);
diff --git a/pathspec.h b/pathspec.h
index 454ce364fac..f19c5dcf022 100644
--- a/pathspec.h
+++ b/pathspec.h
@@ -150,10 +150,10 @@ static inline int ps_strcmp(const struct pathspec_item *item,
 }
 
 void add_pathspec_matches_against_index(const struct pathspec *pathspec,
-					const struct index_state *istate,
+					struct index_state *istate,
 					char *seen);
 char *find_pathspecs_matching_against_index(const struct pathspec *pathspec,
-					    const struct index_state *istate);
+					    struct index_state *istate);
 int match_pathspec_attrs(const struct index_state *istate,
 			 const char *name, int namelen,
 			 const struct pathspec_item *item);
diff --git a/read-cache.c b/read-cache.c
index 0522260416e..65679d70d7c 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -622,7 +622,11 @@ void remove_marked_cache_entries(struct index_state *istate, int invalidate)
 
 int remove_file_from_index(struct index_state *istate, const char *path)
 {
-	int pos = index_name_pos(istate, path, strlen(path));
+	int pos;
+
+	ensure_full_index(istate);
+
+	pos = index_name_pos(istate, path, strlen(path));
 	if (pos < 0)
 		pos = -pos-1;
 	cache_tree_invalidate_path(istate, path);
@@ -640,9 +644,12 @@ static int compare_name(struct cache_entry *ce, const char *path, int namelen)
 static int index_name_pos_also_unmerged(struct index_state *istate,
 	const char *path, int namelen)
 {
-	int pos = index_name_pos(istate, path, namelen);
+	int pos;
 	struct cache_entry *ce;
 
+	ensure_full_index(istate);
+
+	pos = index_name_pos(istate, path, namelen);
 	if (pos >= 0)
 		return pos;
 
@@ -717,6 +724,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 	int hash_flags = HASH_WRITE_OBJECT;
 	struct object_id oid;
 
+	ensure_full_index(istate);
+
 	if (flags & ADD_CACHE_RENORMALIZE)
 		hash_flags |= HASH_RENORMALIZE;
 
@@ -1095,6 +1104,8 @@ static int has_dir_name(struct index_state *istate,
 	size_t len_eq_last;
 	int cmp_last = 0;
 
+	ensure_full_index(istate);
+
 	/*
 	 * We are frequently called during an iteration on a sorted
 	 * list of pathnames and while building a new index.  Therefore,
@@ -1338,6 +1349,8 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
 {
 	int pos;
 
+	ensure_full_index(istate);
+
 	if (option & ADD_CACHE_JUST_APPEND)
 		pos = istate->cache_nr;
 	else {
@@ -1547,6 +1560,8 @@ int refresh_index(struct index_state *istate, unsigned int flags,
 	 * we only have to do the special cases that are left.
 	 */
 	preload_index(istate, pathspec, 0);
+
+	ensure_full_index(istate);
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct cache_entry *ce, *new_entry;
 		int cache_errno = 0;
diff --git a/rerere.c b/rerere.c
index 9281131a9f1..1836a6cfbcf 100644
--- a/rerere.c
+++ b/rerere.c
@@ -962,6 +962,8 @@ static int handle_cache(struct index_state *istate,
 	struct rerere_io_mem io;
 	int marker_size = ll_merge_marker_size(istate, path);
 
+	ensure_full_index(istate);
+
 	/*
 	 * Reproduce the conflicted merge in-core
 	 */
diff --git a/resolve-undo.c b/resolve-undo.c
index 236320f179c..a4265834977 100644
--- a/resolve-undo.c
+++ b/resolve-undo.c
@@ -125,6 +125,8 @@ int unmerge_index_entry_at(struct index_state *istate, int pos)
 	if (!istate->resolve_undo)
 		return pos;
 
+	ensure_full_index(istate);
+
 	ce = istate->cache[pos];
 	if (ce_stage(ce)) {
 		/* already unmerged */
@@ -172,6 +174,8 @@ void unmerge_marked_index(struct index_state *istate)
 	if (!istate->resolve_undo)
 		return;
 
+	ensure_full_index(istate);
+
 	for (i = 0; i < istate->cache_nr; i++) {
 		const struct cache_entry *ce = istate->cache[i];
 		if (ce->ce_flags & CE_MATCHED)
@@ -186,6 +190,8 @@ void unmerge_index(struct index_state *istate, const struct pathspec *pathspec)
 	if (!istate->resolve_undo)
 		return;
 
+	ensure_full_index(istate);
+
 	for (i = 0; i < istate->cache_nr; i++) {
 		const struct cache_entry *ce = istate->cache[i];
 		if (!ce_path_match(istate, ce, pathspec, NULL))
diff --git a/sha1-name.c b/sha1-name.c
index 0b23b86ceb4..c2f17e526ab 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -1734,6 +1734,8 @@ static void diagnose_invalid_index_path(struct repository *r,
 	if (!prefix)
 		prefix = "";
 
+	ensure_full_index(r->index);
+
 	/* Wrong stage number? */
 	pos = index_name_pos(istate, filename, namelen);
 	if (pos < 0)
@@ -1854,6 +1856,7 @@ static enum get_oid_result get_oid_with_context_1(struct repository *repo,
 
 		if (!repo->index || !repo->index->cache)
 			repo_read_index(repo);
+		ensure_full_index(repo->index);
 		pos = index_name_pos(repo->index, cp, namelen);
 		if (pos < 0)
 			pos = -pos - 1;
diff --git a/split-index.c b/split-index.c
index c0e8ad670d0..3150fa6476a 100644
--- a/split-index.c
+++ b/split-index.c
@@ -4,6 +4,8 @@
 
 struct split_index *init_split_index(struct index_state *istate)
 {
+	ensure_full_index(istate);
+
 	if (!istate->split_index) {
 		istate->split_index = xcalloc(1, sizeof(*istate->split_index));
 		istate->split_index->refcount = 1;
diff --git a/submodule.c b/submodule.c
index b3bb59f0664..f80cfddbd52 100644
--- a/submodule.c
+++ b/submodule.c
@@ -33,9 +33,13 @@ static struct oid_array ref_tips_after_fetch;
  * will be disabled because we can't guess what might be configured in
  * .gitmodules unless the user resolves the conflict.
  */
-int is_gitmodules_unmerged(const struct index_state *istate)
+int is_gitmodules_unmerged(struct index_state *istate)
 {
-	int pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE));
+	int pos;
+
+	ensure_full_index(istate);
+
+	pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE));
 	if (pos < 0) { /* .gitmodules not found or isn't merged */
 		pos = -1 - pos;
 		if (istate->cache_nr > pos) {  /* there is a .gitmodules */
@@ -77,7 +81,11 @@ int is_writing_gitmodules_ok(void)
  */
 int is_staging_gitmodules_ok(struct index_state *istate)
 {
-	int pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE));
+	int pos;
+
+	ensure_full_index(istate);
+
+	pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE));
 
 	if ((pos >= 0) && (pos < istate->cache_nr)) {
 		struct stat st;
@@ -301,7 +309,7 @@ int is_submodule_populated_gently(const char *path, int *return_error_code)
 /*
  * Dies if the provided 'prefix' corresponds to an unpopulated submodule
  */
-void die_in_unpopulated_submodule(const struct index_state *istate,
+void die_in_unpopulated_submodule(struct index_state *istate,
 				  const char *prefix)
 {
 	int i, prefixlen;
@@ -311,6 +319,8 @@ void die_in_unpopulated_submodule(const struct index_state *istate,
 
 	prefixlen = strlen(prefix);
 
+	ensure_full_index(istate);
+
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct cache_entry *ce = istate->cache[i];
 		int ce_len = ce_namelen(ce);
@@ -331,11 +341,13 @@ void die_in_unpopulated_submodule(const struct index_state *istate,
 /*
  * Dies if any paths in the provided pathspec descends into a submodule
  */
-void die_path_inside_submodule(const struct index_state *istate,
+void die_path_inside_submodule(struct index_state *istate,
 			       const struct pathspec *ps)
 {
 	int i, j;
 
+	ensure_full_index(istate);
+
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct cache_entry *ce = istate->cache[i];
 		int ce_len = ce_namelen(ce);
@@ -1420,6 +1432,8 @@ static int get_next_submodule(struct child_process *cp,
 {
 	struct submodule_parallel_fetch *spf = data;
 
+	ensure_full_index(spf->r->index);
+
 	for (; spf->count < spf->r->index->cache_nr; spf->count++) {
 		const struct cache_entry *ce = spf->r->index->cache[spf->count];
 		const char *default_argv;
diff --git a/submodule.h b/submodule.h
index 4ac6e31cf1f..84640c49c11 100644
--- a/submodule.h
+++ b/submodule.h
@@ -39,7 +39,7 @@ struct submodule_update_strategy {
 };
 #define SUBMODULE_UPDATE_STRATEGY_INIT {SM_UPDATE_UNSPECIFIED, NULL}
 
-int is_gitmodules_unmerged(const struct index_state *istate);
+int is_gitmodules_unmerged(struct index_state *istate);
 int is_writing_gitmodules_ok(void);
 int is_staging_gitmodules_ok(struct index_state *istate);
 int update_path_in_gitmodules(const char *oldpath, const char *newpath);
@@ -60,9 +60,9 @@ int is_submodule_active(struct repository *repo, const char *path);
  * Otherwise the return error code is the same as of resolve_gitdir_gently.
  */
 int is_submodule_populated_gently(const char *path, int *return_error_code);
-void die_in_unpopulated_submodule(const struct index_state *istate,
+void die_in_unpopulated_submodule(struct index_state *istate,
 				  const char *prefix);
-void die_path_inside_submodule(const struct index_state *istate,
+void die_path_inside_submodule(struct index_state *istate,
 			       const struct pathspec *ps);
 enum submodule_update_type parse_submodule_update_type(const char *value);
 int parse_submodule_update_strategy(const char *value,
diff --git a/tree.c b/tree.c
index e76517f6b18..60f575440c8 100644
--- a/tree.c
+++ b/tree.c
@@ -170,6 +170,8 @@ int read_tree(struct repository *r, struct tree *tree, int stage,
 	 * to matter.
 	 */
 
+	ensure_full_index(istate);
+
 	/*
 	 * See if we have cache entry at the stage.  If so,
 	 * do it the original slow way, otherwise, append and then
diff --git a/wt-status.c b/wt-status.c
index 7074bbdd53c..5366d336938 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -509,6 +509,8 @@ static int unmerged_mask(struct index_state *istate, const char *path)
 	int pos, mask;
 	const struct cache_entry *ce;
 
+	ensure_full_index(istate);
+
 	pos = index_name_pos(istate, path, strlen(path));
 	if (0 <= pos)
 		return 0;
@@ -657,6 +659,8 @@ static void wt_status_collect_changes_initial(struct wt_status *s)
 	struct index_state *istate = s->repo->index;
 	int i;
 
+	ensure_full_index(istate);
+
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct string_list_item *it;
 		struct wt_status_change_data *d;
@@ -2295,6 +2299,9 @@ static void wt_porcelain_v2_print_unmerged_entry(
 	 */
 	memset(stages, 0, sizeof(stages));
 	sum = 0;
+
+	ensure_full_index(istate);
+
 	pos = index_name_pos(istate, it->string, strlen(it->string));
 	assert(pos < 0);
 	pos = -pos-1;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 16/27] unpack-trees: make sparse aware
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (14 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 15/27] [RFC-VERSION] *: ensure full index Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-02-01 20:50   ` Elijah Newren
  2021-01-25 17:42 ` [PATCH 17/27] dir.c: accept a directory as part of cone-mode patterns Derrick Stolee via GitGitGadget
                   ` (12 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As a first step to integrate 'git status' and 'git add' with the sparse
index, we must start integrating unpack_trees() with sparse directory
entries. These changes are currently impossible to trigger because
unpack_trees() calls ensure_full_index() if command_requires_full_index
is true. This is the case for all commands at the moment. As we expand
more commands to be sparse-aware, we might find that more changes are
required to unpack_trees(). The current changes will suffice for
'status' and 'add'.

unpack_trees() calls the traverse_trees() API using unpack_callback()
to decide if we should recurse into a subtree. We must add new abilities
to skip a subtree if it corresponds to a sparse directory entry.

It is important to be careful about the trailing directory separator
that exists in the sparse directory entries but not in the subtree
paths.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 dir.h           |  2 +-
 preload-index.c |  2 ++
 read-cache.c    |  3 +++
 unpack-trees.c  | 24 ++++++++++++++++++++++--
 4 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/dir.h b/dir.h
index facfae47402..300305ec335 100644
--- a/dir.h
+++ b/dir.h
@@ -503,7 +503,7 @@ static inline int ce_path_match(const struct index_state *istate,
 				char *seen)
 {
 	return match_pathspec(istate, pathspec, ce->name, ce_namelen(ce), 0, seen,
-			      S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode));
+			      S_ISSPARSEDIR(ce) || S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode));
 }
 
 static inline int dir_path_match(const struct index_state *istate,
diff --git a/preload-index.c b/preload-index.c
index ed6eaa47388..323fc8c5100 100644
--- a/preload-index.c
+++ b/preload-index.c
@@ -54,6 +54,8 @@ static void *preload_thread(void *_data)
 			continue;
 		if (S_ISGITLINK(ce->ce_mode))
 			continue;
+		if (S_ISSPARSEDIR(ce))
+			continue;
 		if (ce_uptodate(ce))
 			continue;
 		if (ce_skip_worktree(ce))
diff --git a/read-cache.c b/read-cache.c
index 65679d70d7c..ab0c2b86ec0 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1572,6 +1572,9 @@ int refresh_index(struct index_state *istate, unsigned int flags,
 		if (ignore_submodules && S_ISGITLINK(ce->ce_mode))
 			continue;
 
+		if (istate->sparse_index && S_ISSPARSEDIR(ce))
+			continue;
+
 		if (pathspec && !ce_path_match(istate, ce, pathspec, seen))
 			filtered = 1;
 
diff --git a/unpack-trees.c b/unpack-trees.c
index b324eec2a5d..90644856a80 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -583,6 +583,13 @@ static void mark_ce_used(struct cache_entry *ce, struct unpack_trees_options *o)
 {
 	ce->ce_flags |= CE_UNPACKED;
 
+	/*
+	 * If this is a sparse directory, don't advance cache_bottom.
+	 * That will be advanced later using the cache-tree data.
+	 */
+	if (S_ISSPARSEDIR(ce))
+		return;
+
 	if (o->cache_bottom < o->src_index->cache_nr &&
 	    o->src_index->cache[o->cache_bottom] == ce) {
 		int bottom = o->cache_bottom;
@@ -980,6 +987,9 @@ static int do_compare_entry(const struct cache_entry *ce,
 	ce_len -= pathlen;
 	ce_name = ce->name + pathlen;
 
+	/* remove directory separator if a sparse directory entry */
+	if (S_ISSPARSEDIR(ce))
+		ce_len--;
 	return df_name_compare(ce_name, ce_len, S_IFREG, name, namelen, mode);
 }
 
@@ -989,6 +999,10 @@ static int compare_entry(const struct cache_entry *ce, const struct traverse_inf
 	if (cmp)
 		return cmp;
 
+	/* If ce is a sparse directory, then allow equality here. */
+	if (S_ISSPARSEDIR(ce))
+		return 0;
+
 	/*
 	 * Even if the beginning compared identically, the ce should
 	 * compare as bigger than a directory leading up to it!
@@ -1239,6 +1253,7 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
 	struct cache_entry *src[MAX_UNPACK_TREES + 1] = { NULL, };
 	struct unpack_trees_options *o = info->data;
 	const struct name_entry *p = names;
+	unsigned recurse = 1;
 
 	/* Find first entry with a real name (we could use "mask" too) */
 	while (!p->mode)
@@ -1280,12 +1295,16 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
 					}
 				}
 				src[0] = ce;
+
+				if (S_ISSPARSEDIR(ce))
+					recurse = 0;
 			}
 			break;
 		}
 	}
 
-	if (unpack_nondirectories(n, mask, dirmask, src, names, info) < 0)
+	if (recurse &&
+	    unpack_nondirectories(n, mask, dirmask, src, names, info) < 0)
 		return -1;
 
 	if (o->merge && src[0]) {
@@ -1315,7 +1334,8 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
 			}
 		}
 
-		if (traverse_trees_recursive(n, dirmask, mask & ~dirmask,
+		if (recurse &&
+		    traverse_trees_recursive(n, dirmask, mask & ~dirmask,
 					     names, info) < 0)
 			return -1;
 		return mask;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 17/27] dir.c: accept a directory as part of cone-mode patterns
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (15 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 16/27] unpack-trees: make sparse aware Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-02-01 22:12   ` Elijah Newren
  2021-01-25 17:42 ` [PATCH 18/27] status: use sparse-index throughout Derrick Stolee via GitGitGadget
                   ` (11 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

When we have sparse directory entries in the index, we want to compare
that directory against sparse-checkout patterns. Those pattern matching
algorithms are built expecting a file path, not a directory path. This
is especially important in the "cone mode" patterns which will match
files that exist within the "parent directories" as well as the
recursive directory matches.

If path_matches_pattern_list() is given a directory, we can add a bogus
filename ("-") to the directory and get the same results as before,
assuming we are in cone mode. Since sparse index requires cone mode
patterns, this is an acceptable assumption.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 dir.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/dir.c b/dir.c
index ad6eb033cb1..c786fa98d0e 100644
--- a/dir.c
+++ b/dir.c
@@ -1384,6 +1384,11 @@ enum pattern_match_result path_matches_pattern_list(
 	strbuf_addch(&parent_pathname, '/');
 	strbuf_add(&parent_pathname, pathname, pathlen);
 
+	/* Directory requests should be added as if they are a file */
+	if (parent_pathname.len > 1 &&
+	    parent_pathname.buf[parent_pathname.len - 1] == '/')
+		strbuf_add(&parent_pathname, "-", 1);
+
 	if (hashmap_contains_path(&pl->recursive_hashmap,
 				  &parent_pathname)) {
 		result = MATCHED_RECURSIVE;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 18/27] status: use sparse-index throughout
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (16 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 17/27] dir.c: accept a directory as part of cone-mode patterns Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-01-25 17:42 ` [PATCH 19/27] status: skip sparse-checkout percentage with sparse-index Derrick Stolee via GitGitGadget
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

By testing 'git -c core.fsmonitor= status -uno', we can check for the
simplest index operations that can be made sparse-aware. The necessary
implementation details are already integrated with sparse-checkout, so
modify command_requires_full_index to be zero for cmd_status().

By running the debugger for 'git status -uno' after that change, we find
two instances of ensure_full_index() that were added for extra safety,
but can be removed without issue.

In refresh_index(), we loop through the index entries. The
refresh_cache_ent() method copies the sparse directories into the
refreshed index without issue.

The loop within run_diff_files() skips things that are in stage 0 and
have skip-worktree enabled, so seems safe to disable ensure_full_index()
here.

While this change avoids calling ensure_full_index(), it actually slows
'git status' because we do not have the cache-tree extension to help us.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/commit.c                         |  3 +++
 diff-lib.c                               |  2 --
 read-cache.c                             |  1 -
 t/t1092-sparse-checkout-compatibility.sh | 10 +++++++---
 4 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/builtin/commit.c b/builtin/commit.c
index 505fe60956d..543aa0caeae 100644
--- a/builtin/commit.c
+++ b/builtin/commit.c
@@ -1400,6 +1400,9 @@ int cmd_status(int argc, const char **argv, const char *prefix)
 	if (argc == 2 && !strcmp(argv[1], "-h"))
 		usage_with_options(builtin_status_usage, builtin_status_options);
 
+	prepare_repo_settings(the_repository);
+	the_repository->settings.command_requires_full_index = 0;
+
 	status_init_config(&s, git_status_config);
 	argc = parse_options(argc, argv, prefix,
 			     builtin_status_options,
diff --git a/diff-lib.c b/diff-lib.c
index 3743e4463b4..b73cc1859a4 100644
--- a/diff-lib.c
+++ b/diff-lib.c
@@ -96,8 +96,6 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
 	uint64_t start = getnanotime();
 	struct index_state *istate = revs->diffopt.repo->index;
 
-	ensure_full_index(istate);
-
 	diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
 
 	refresh_fsmonitor(istate);
diff --git a/read-cache.c b/read-cache.c
index ab0c2b86ec0..78910d8f1b7 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1561,7 +1561,6 @@ int refresh_index(struct index_state *istate, unsigned int flags,
 	 */
 	preload_index(istate, pathspec, 0);
 
-	ensure_full_index(istate);
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct cache_entry *ce, *new_entry;
 		int cache_errno = 0;
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index c6b7e8b8891..a3521cdc310 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -382,12 +382,16 @@ test_expect_success 'sparse-index is expanded and converted back' '
 	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
 		git -C sparse-index -c core.fsmonitor="" reset --hard &&
 	test_region index convert_to_sparse trace2.txt &&
-	test_region index ensure_full_index trace2.txt &&
+	test_region index ensure_full_index trace2.txt
+'
 
-	rm trace2.txt &&
+test_expect_success 'sparse-index is not expanded' '
+	init_repos &&
+
+	rm -f trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
 		git -C sparse-index -c core.fsmonitor="" status -uno &&
-	test_region index ensure_full_index trace2.txt
+	test_region ! index ensure_full_index trace2.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 19/27] status: skip sparse-checkout percentage with sparse-index
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (17 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 18/27] status: use sparse-index throughout Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-01-25 17:42 ` [PATCH 20/27] sparse-index: expand_to_path() trivial implementation Derrick Stolee via GitGitGadget
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

'git status' began reporting a percentage of populated paths when
sparse-checkout is enabled in 051df3cf (wt-status: show sparse
checkout status as well, 2020-07-18). This percentage is incorrect when
the index has sparse directories. It would also be expensive to
calculate as we would need to parse trees to count the total number of
possible paths.

Avoid the expensive computation by simplifying the output to only report
that a sparse checkout exists, without the percentage.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh |  8 ++++++++
 wt-status.c                              | 14 +++++++++++---
 wt-status.h                              |  1 +
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index a3521cdc310..09650f0755c 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -190,6 +190,14 @@ test_expect_success 'status with options' '
 	test_all_match git status --porcelain=v2 -uno
 '
 
+test_expect_success 'status reports sparse-checkout' '
+	init_repos &&
+	git -C sparse-checkout status >full &&
+	git -C sparse-index status >sparse &&
+	test_i18ngrep "You are in a sparse checkout with " full &&
+	test_i18ngrep "You are in a sparse checkout." sparse
+'
+
 test_expect_success 'add, commit, checkout' '
 	init_repos &&
 
diff --git a/wt-status.c b/wt-status.c
index 5366d336938..46c9d71068e 100644
--- a/wt-status.c
+++ b/wt-status.c
@@ -1492,9 +1492,12 @@ static void show_sparse_checkout_in_use(struct wt_status *s,
 	if (s->state.sparse_checkout_percentage == SPARSE_CHECKOUT_DISABLED)
 		return;
 
-	status_printf_ln(s, color,
-			 _("You are in a sparse checkout with %d%% of tracked files present."),
-			 s->state.sparse_checkout_percentage);
+	if (s->state.sparse_checkout_percentage == SPARSE_CHECKOUT_SPARSE_INDEX)
+		status_printf_ln(s, color, _("You are in a sparse checkout."));
+	else
+		status_printf_ln(s, color,
+				_("You are in a sparse checkout with %d%% of tracked files present."),
+				s->state.sparse_checkout_percentage);
 	wt_longstatus_print_trailer(s);
 }
 
@@ -1652,6 +1655,11 @@ static void wt_status_check_sparse_checkout(struct repository *r,
 		return;
 	}
 
+	if (r->index->sparse_index) {
+		state->sparse_checkout_percentage = SPARSE_CHECKOUT_SPARSE_INDEX;
+		return;
+	}
+
 	for (i = 0; i < r->index->cache_nr; i++) {
 		struct cache_entry *ce = r->index->cache[i];
 		if (ce_skip_worktree(ce))
diff --git a/wt-status.h b/wt-status.h
index 35b44c388ed..3cb0c200244 100644
--- a/wt-status.h
+++ b/wt-status.h
@@ -80,6 +80,7 @@ enum wt_status_format {
 #define HEAD_DETACHED_AT _("HEAD detached at ")
 #define HEAD_DETACHED_FROM _("HEAD detached from ")
 #define SPARSE_CHECKOUT_DISABLED -1
+#define SPARSE_CHECKOUT_SPARSE_INDEX -2
 
 struct wt_status_state {
 	int merge_in_progress;
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 20/27] sparse-index: expand_to_path() trivial implementation
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (18 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 19/27] status: skip sparse-checkout percentage with sparse-index Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-01-25 17:42 ` [PATCH 21/27] sparse-index: expand_to_path no-op if path exists Derrick Stolee via GitGitGadget
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Before we check if a specific file or directory exists in the index, it
would be good to see if a leading directory is a sparse-directory. If
so, we will want to expand the index _just enough_ to be sure that the
paths we are interested in are in the index.

The actually-interesting implementation will follow in a later change.
For now, simply call ensure_full_index() to expand every directory
simultaneously.

Calls like index_dir_exists(), adjust_dirname_case(), and
index_file_exists() in name-hash.c can trust the name hash if the index
was properly expanded for the requested names. These methods can
transition from ensure_full_index() to expand_to_path().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 name-hash.c    | 10 ++++------
 sparse-index.c |  7 +++++++
 sparse-index.h | 12 ++++++++++++
 3 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/name-hash.c b/name-hash.c
index 0f6d4fcca5a..641f6900a7c 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -8,6 +8,7 @@
 #include "cache.h"
 #include "thread-utils.h"
 #include "trace2.h"
+#include "sparse-index.h"
 
 struct dir_entry {
 	struct hashmap_entry ent;
@@ -679,9 +680,8 @@ int index_dir_exists(struct index_state *istate, const char *name, int namelen)
 {
 	struct dir_entry *dir;
 
-	ensure_full_index(istate);
-
 	lazy_init_name_hash(istate);
+	expand_to_path(istate, name, namelen, 0);
 	dir = find_dir_entry(istate, name, namelen);
 	return dir && dir->nr;
 }
@@ -691,9 +691,8 @@ void adjust_dirname_case(struct index_state *istate, char *name)
 	const char *startPtr = name;
 	const char *ptr = startPtr;
 
-	ensure_full_index( istate);
-
 	lazy_init_name_hash(istate);
+	expand_to_path(istate, name, strlen(name), 0);
 	while (*ptr) {
 		while (*ptr && *ptr != '/')
 			ptr++;
@@ -716,9 +715,8 @@ struct cache_entry *index_file_exists(struct index_state *istate, const char *na
 	struct cache_entry *ce;
 	unsigned int hash = memihash(name, namelen);
 
-	ensure_full_index(istate);
-
 	lazy_init_name_hash(istate);
+	expand_to_path(istate, name, namelen, icase);
 
 	ce = hashmap_get_entry_from_hash(&istate->name_hash, hash, NULL,
 					 struct cache_entry, ent);
diff --git a/sparse-index.c b/sparse-index.c
index 3552f88fb03..dd1a06dfdd3 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -280,3 +280,10 @@ void ensure_full_index(struct index_state *istate)
 
 	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
+
+void expand_to_path(struct index_state *istate,
+		    const char *path, size_t pathlen, int icase)
+{
+	/* for now, do the obviously-correct, slow thing */
+	ensure_full_index(istate);
+}
diff --git a/sparse-index.h b/sparse-index.h
index ca936e95d11..549e4171f1a 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -4,6 +4,18 @@
 struct index_state;
 void ensure_full_index(struct index_state *istate);
 int convert_to_sparse(struct index_state *istate);
+/*
+ * Some places in the codebase expect to search for a specific path.
+ * This path might be outside of the sparse-checkout definition, in
+ * which case a sparse-index may not contain a path for that index.
+ *
+ * Given an index and a path, check to see if a leading directory for
+ * 'path' exists in the index as a sparse directory. In that case,
+ * expand that sparse directory to a full range of cache entries and
+ * populate the index accordingly.
+ */
+void expand_to_path(struct index_state *istate,
+		    const char *path, size_t pathlen, int icase);
 
 struct repository;
 int set_sparse_index_config(struct repository *repo, int enable);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 21/27] sparse-index: expand_to_path no-op if path exists
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (19 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 20/27] sparse-index: expand_to_path() trivial implementation Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-02-01 22:34   ` Elijah Newren
  2021-01-25 17:42 ` [PATCH 22/27] add: allow operating on a sparse-only index Derrick Stolee via GitGitGadget
                   ` (7 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We need to check the file hashmap first, then look to see if the
directory signals a non-sparse directory entry. In such a case, we can
rely on the contents of the sparse-index.

We still use ensure_full_index() in the case that we hit a path that is
within a sparse-directory entry.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 name-hash.c    |  6 ++++++
 sparse-index.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/name-hash.c b/name-hash.c
index 641f6900a7c..cb0f316f652 100644
--- a/name-hash.c
+++ b/name-hash.c
@@ -110,6 +110,12 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
 	if (ce->ce_flags & CE_HASHED)
 		return;
 	ce->ce_flags |= CE_HASHED;
+
+	if (ce->ce_mode == CE_MODE_SPARSE_DIRECTORY) {
+		add_dir_entry(istate, ce);
+		return;
+	}
+
 	hashmap_entry_init(&ce->ent, memihash(ce->name, ce_namelen(ce)));
 	hashmap_add(&istate->name_hash, &ce->ent);
 
diff --git a/sparse-index.c b/sparse-index.c
index dd1a06dfdd3..bf8dce9a09b 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -281,9 +281,62 @@ void ensure_full_index(struct index_state *istate)
 	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
 
+static int in_expand_to_path = 0;
+
 void expand_to_path(struct index_state *istate,
 		    const char *path, size_t pathlen, int icase)
 {
+	struct strbuf path_as_dir = STRBUF_INIT;
+	int pos;
+
+	/* prevent extra recursion */
+	if (in_expand_to_path)
+		return;
+
+	if (!istate || !istate->sparse_index)
+		return;
+
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	in_expand_to_path = 1;
+
+	/*
+	 * We only need to actually expand a region if the
+	 * following are both true:
+	 *
+	 * 1. 'path' is not already in the index.
+	 * 2. Some parent directory of 'path' is a sparse directory.
+	 */
+
+	strbuf_add(&path_as_dir, path, pathlen);
+	strbuf_addch(&path_as_dir, '/');
+
+	/* in_expand_to_path prevents infinite recursion here */
+	if (index_file_exists(istate, path, pathlen, icase))
+		goto cleanup;
+
+	pos = index_name_pos(istate, path_as_dir.buf, path_as_dir.len);
+
+	if (pos < 0)
+		pos = -pos - 1;
+
+	/*
+	 * Even if the path doesn't exist, if the value isn't exactly a
+	 * sparse-directory entry, then there is no need to expand the
+	 * index.
+	 */
+	if (istate->cache[pos]->ce_mode != CE_MODE_SPARSE_DIRECTORY)
+		goto cleanup;
+
+	trace2_region_enter("index", "expand_to_path", istate->repo);
+
 	/* for now, do the obviously-correct, slow thing */
 	ensure_full_index(istate);
+
+	trace2_region_leave("index", "expand_to_path", istate->repo);
+
+cleanup:
+	strbuf_release(&path_as_dir);
+	in_expand_to_path = 0;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 22/27] add: allow operating on a sparse-only index
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (20 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 21/27] sparse-index: expand_to_path no-op if path exists Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-02-01 23:08   ` Elijah Newren
  2021-01-25 17:42 ` [PATCH 23/27] submodule: die_path_inside_submodule is sparse aware Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Replace enough callers to ensure_full_index() to instead call
expand_to_path() to reduce how often 'git add' expands a sparse index in
memory (before writing a sparse index again).

One non-obvious case is index_name_pos_also_unmerged() which is only hit
on the Windows platform (in my tests). Use expand_to_path() instead of
ensure_full_index().

Add a test to check that 'git add -A' and 'git add <file>' does not
expand the index at all, as long as <file> is not within a sparse
directory.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/add.c                            |  3 +++
 dir.c                                    |  8 ++++----
 read-cache.c                             | 10 +++++-----
 sparse-index.c                           | 18 ++++++++++++++----
 t/t1092-sparse-checkout-compatibility.sh | 14 ++++++++++++++
 5 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/builtin/add.c b/builtin/add.c
index a825887c503..b73f8d51de6 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -491,6 +491,9 @@ int cmd_add(int argc, const char **argv, const char *prefix)
 	add_new_files = !take_worktree_changes && !refresh_only && !add_renormalize;
 	require_pathspec = !(take_worktree_changes || (0 < addremove_explicit));
 
+	prepare_repo_settings(the_repository);
+	the_repository->settings.command_requires_full_index = 0;
+
 	hold_locked_index(&lock_file, LOCK_DIE_ON_ERROR);
 
 	/*
diff --git a/dir.c b/dir.c
index c786fa98d0e..21998c7c4b7 100644
--- a/dir.c
+++ b/dir.c
@@ -18,6 +18,7 @@
 #include "ewah/ewok.h"
 #include "fsmonitor.h"
 #include "submodule-config.h"
+#include "sparse-index.h"
 
 /*
  * Tells read_directory_recursive how a file or directory should be treated.
@@ -899,9 +900,9 @@ static int read_skip_worktree_file_from_index(struct index_state *istate,
 {
 	int pos, len;
 
-	ensure_full_index(istate);
-
 	len = strlen(path);
+
+	expand_to_path(istate, path, len, 0);
 	pos = index_name_pos(istate, path, len);
 	if (pos < 0)
 		return -1;
@@ -1707,8 +1708,7 @@ static enum exist_status directory_exists_in_index(struct index_state *istate,
 	if (ignore_case)
 		return directory_exists_in_index_icase(istate, dirname, len);
 
-	ensure_full_index(istate);
-
+	expand_to_path(istate, dirname, len, 0);
 	pos = index_name_pos(istate, dirname, len);
 	if (pos < 0)
 		pos = -pos-1;
diff --git a/read-cache.c b/read-cache.c
index 78910d8f1b7..8c974829497 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -647,7 +647,7 @@ static int index_name_pos_also_unmerged(struct index_state *istate,
 	int pos;
 	struct cache_entry *ce;
 
-	ensure_full_index(istate);
+	expand_to_path(istate, path, namelen, 0);
 
 	pos = index_name_pos(istate, path, namelen);
 	if (pos >= 0)
@@ -724,8 +724,6 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 	int hash_flags = HASH_WRITE_OBJECT;
 	struct object_id oid;
 
-	ensure_full_index(istate);
-
 	if (flags & ADD_CACHE_RENORMALIZE)
 		hash_flags |= HASH_RENORMALIZE;
 
@@ -733,6 +731,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
 		return error(_("%s: can only add regular files, symbolic links or git-directories"), path);
 
 	namelen = strlen(path);
+	expand_to_path(istate, path, namelen, 0);
+
 	if (S_ISDIR(st_mode)) {
 		if (resolve_gitlink_ref(path, "HEAD", &oid) < 0)
 			return error(_("'%s' does not have a commit checked out"), path);
@@ -1104,7 +1104,7 @@ static int has_dir_name(struct index_state *istate,
 	size_t len_eq_last;
 	int cmp_last = 0;
 
-	ensure_full_index(istate);
+	expand_to_path(istate, ce->name, ce->ce_namelen, 0);
 
 	/*
 	 * We are frequently called during an iteration on a sorted
@@ -1349,7 +1349,7 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
 {
 	int pos;
 
-	ensure_full_index(istate);
+	expand_to_path(istate, ce->name, ce->ce_namelen, 0);
 
 	if (option & ADD_CACHE_JUST_APPEND)
 		pos = istate->cache_nr;
diff --git a/sparse-index.c b/sparse-index.c
index bf8dce9a09b..a201f3b905c 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -286,6 +286,7 @@ static int in_expand_to_path = 0;
 void expand_to_path(struct index_state *istate,
 		    const char *path, size_t pathlen, int icase)
 {
+	struct cache_entry *ce = NULL;
 	struct strbuf path_as_dir = STRBUF_INIT;
 	int pos;
 
@@ -320,13 +321,22 @@ void expand_to_path(struct index_state *istate,
 
 	if (pos < 0)
 		pos = -pos - 1;
+	if (pos < istate->cache_nr)
+		ce = istate->cache[pos];
 
 	/*
-	 * Even if the path doesn't exist, if the value isn't exactly a
-	 * sparse-directory entry, then there is no need to expand the
-	 * index.
+	 * If we didn't land on a sparse directory, then there is
+	 * nothing to expand.
 	 */
-	if (istate->cache[pos]->ce_mode != CE_MODE_SPARSE_DIRECTORY)
+	if (ce && !S_ISSPARSEDIR(ce))
+		goto cleanup;
+	/*
+	 * If that sparse directory is not a prefix of the path we
+	 * are looking for, then we don't need to expand.
+	 */
+	if (ce &&
+	    (ce->ce_namelen >= path_as_dir.len ||
+	     strncmp(ce->name, path_as_dir.buf, ce->ce_namelen)))
 		goto cleanup;
 
 	trace2_region_enter("index", "expand_to_path", istate->repo);
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 09650f0755c..ae594ab880c 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -390,6 +390,20 @@ test_expect_success 'sparse-index is expanded and converted back' '
 	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
 		git -C sparse-index -c core.fsmonitor="" reset --hard &&
 	test_region index convert_to_sparse trace2.txt &&
+	test_region index ensure_full_index trace2.txt &&
+
+	rm trace2.txt &&
+	echo >>sparse-index/README.md &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" add -A &&
+	test_region index convert_to_sparse trace2.txt &&
+	test_region index ensure_full_index trace2.txt &&
+
+	rm trace2.txt &&
+	echo >>sparse-index/extra.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" add extra.txt &&
+	test_region index convert_to_sparse trace2.txt &&
 	test_region index ensure_full_index trace2.txt
 '
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 23/27] submodule: die_path_inside_submodule is sparse aware
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (21 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 22/27] add: allow operating on a sparse-only index Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-01-25 17:42 ` [PATCH 24/27] dir: use expand_to_path in add_patterns() Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Since we already do not collapse a sparse directory if it contains a
submodule, we don't need to expand to a full index in
die_path_inside_submodule(). A simple scan of the index entries is
sufficient.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 submodule.c                              |  2 --
 t/t1092-sparse-checkout-compatibility.sh | 24 +++++++++++-------------
 2 files changed, 11 insertions(+), 15 deletions(-)

diff --git a/submodule.c b/submodule.c
index f80cfddbd52..487d083e4ef 100644
--- a/submodule.c
+++ b/submodule.c
@@ -346,8 +346,6 @@ void die_path_inside_submodule(struct index_state *istate,
 {
 	int i, j;
 
-	ensure_full_index(istate);
-
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct cache_entry *ce = istate->cache[i];
 		int ce_len = ce_namelen(ce);
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index ae594ab880c..2e8efe6ab37 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -390,29 +390,27 @@ test_expect_success 'sparse-index is expanded and converted back' '
 	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
 		git -C sparse-index -c core.fsmonitor="" reset --hard &&
 	test_region index convert_to_sparse trace2.txt &&
-	test_region index ensure_full_index trace2.txt &&
+	test_region index ensure_full_index trace2.txt
+'
+
+test_expect_success 'sparse-index is not expanded' '
+	init_repos &&
+
+	rm -f trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" status -uno &&
+	test_region ! index ensure_full_index trace2.txt &&
 
 	rm trace2.txt &&
 	echo >>sparse-index/README.md &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
 		git -C sparse-index -c core.fsmonitor="" add -A &&
-	test_region index convert_to_sparse trace2.txt &&
-	test_region index ensure_full_index trace2.txt &&
+	test_region ! index ensure_full_index trace2.txt &&
 
 	rm trace2.txt &&
 	echo >>sparse-index/extra.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
 		git -C sparse-index -c core.fsmonitor="" add extra.txt &&
-	test_region index convert_to_sparse trace2.txt &&
-	test_region index ensure_full_index trace2.txt
-'
-
-test_expect_success 'sparse-index is not expanded' '
-	init_repos &&
-
-	rm -f trace2.txt &&
-	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-		git -C sparse-index -c core.fsmonitor="" status -uno &&
 	test_region ! index ensure_full_index trace2.txt
 '
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 24/27] dir: use expand_to_path in add_patterns()
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (22 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 23/27] submodule: die_path_inside_submodule is sparse aware Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-02-01 23:21   ` Elijah Newren
  2021-01-25 17:42 ` [PATCH 25/27] fsmonitor: disable if index is sparse Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The add_patterns() method has a way to extract a pattern file from the
index. If this pattern file is sparse and within a sparse directory
entry, then we need to expand the index before looking for that entry as
a file path.

For now, convert ensure_full_index() into expand_to_path() to only
expand this way when absolutely necessary.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 dir.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dir.c b/dir.c
index 21998c7c4b7..7df8d3b1da0 100644
--- a/dir.c
+++ b/dir.c
@@ -1093,7 +1093,7 @@ static int add_patterns(const char *fname, const char *base, int baselen,
 			int pos;
 
 			if (istate)
-				ensure_full_index(istate);
+				expand_to_path(istate, fname, strlen(fname), 0);
 
 			if (oid_stat->valid &&
 			    !match_stat_data_racy(istate, &oid_stat->stat, &st))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 25/27] fsmonitor: disable if index is sparse
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (23 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 24/27] dir: use expand_to_path in add_patterns() Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-01-25 17:42 ` [PATCH 26/27] pathspec: stop calling ensure_full_index Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The FS Monitor feature uses a bitmap over the index entries. This
currently interacts poorly with a sparse index. We will revisit this
interaction in the future, but for now protect the index by refusing to
use the FS Monitor extension at all if the index is sparse.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 fsmonitor.c                              | 21 ++++++++++++++-------
 read-cache.c                             |  3 ++-
 t/t1092-sparse-checkout-compatibility.sh |  8 ++++----
 3 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/fsmonitor.c b/fsmonitor.c
index 7b8cd3975b9..99b26576baa 100644
--- a/fsmonitor.c
+++ b/fsmonitor.c
@@ -58,6 +58,9 @@ int read_fsmonitor_extension(struct index_state *istate, const void *data,
 	uint64_t timestamp;
 	struct strbuf last_update = STRBUF_INIT;
 
+	if (istate->sparse_index)
+		return 0;
+
 	if (sz < sizeof(uint32_t) + 1 + sizeof(uint32_t))
 		return error("corrupt fsmonitor extension (too short)");
 
@@ -98,7 +101,8 @@ void fill_fsmonitor_bitmap(struct index_state *istate)
 {
 	unsigned int i, skipped = 0;
 
-	ensure_full_index(istate);
+	if (istate->sparse_index)
+		return;
 
 	istate->fsmonitor_dirty = ewah_new();
 	for (i = 0; i < istate->cache_nr; i++) {
@@ -161,11 +165,7 @@ static int query_fsmonitor(int version, const char *last_update, struct strbuf *
 
 static void fsmonitor_refresh_callback(struct index_state *istate, const char *name)
 {
-	int pos;
-
-	ensure_full_index(istate);
-
-	pos = index_name_pos(istate, name, strlen(name));
+	int pos = index_name_pos(istate, name, strlen(name));
 
 	if (pos >= 0) {
 		struct cache_entry *ce = istate->cache[pos];
@@ -190,7 +190,8 @@ void refresh_fsmonitor(struct index_state *istate)
 	char *buf;
 	unsigned int i;
 
-	if (!core_fsmonitor || istate->fsmonitor_has_run_once)
+	if (!core_fsmonitor || istate->fsmonitor_has_run_once ||
+	    istate->sparse_index)
 		return;
 
 	hook_version = fsmonitor_hook_version();
@@ -300,6 +301,9 @@ void add_fsmonitor(struct index_state *istate)
 	unsigned int i;
 	struct strbuf last_update = STRBUF_INIT;
 
+	if (istate->sparse_index)
+		return;
+
 	if (!istate->fsmonitor_last_update) {
 		trace_printf_key(&trace_fsmonitor, "add fsmonitor");
 		istate->cache_changed |= FSMONITOR_CHANGED;
@@ -335,6 +339,9 @@ void tweak_fsmonitor(struct index_state *istate)
 	unsigned int i;
 	int fsmonitor_enabled = git_config_get_fsmonitor();
 
+	if (istate->sparse_index)
+		fsmonitor_enabled = 0;
+
 	if (istate->fsmonitor_dirty) {
 		if (fsmonitor_enabled) {
 			ensure_full_index(istate);
diff --git a/read-cache.c b/read-cache.c
index 8c974829497..96d9b95128a 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -3017,7 +3017,8 @@ static int do_write_index(struct index_state *istate, struct tempfile *tempfile,
 		if (err)
 			return -1;
 	}
-	if (!strip_extensions && istate->fsmonitor_last_update) {
+	if (!strip_extensions && istate->fsmonitor_last_update &&
+	    !istate->sparse_index) {
 		struct strbuf sb = STRBUF_INIT;
 
 		write_fsmonitor_extension(&sb, istate);
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 2e8efe6ab37..1cdf33a4025 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -388,7 +388,7 @@ test_expect_success 'sparse-index is expanded and converted back' '
 	init_repos &&
 
 	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-		git -C sparse-index -c core.fsmonitor="" reset --hard &&
+		git -C sparse-index reset --hard &&
 	test_region index convert_to_sparse trace2.txt &&
 	test_region index ensure_full_index trace2.txt
 '
@@ -398,19 +398,19 @@ test_expect_success 'sparse-index is not expanded' '
 
 	rm -f trace2.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-		git -C sparse-index -c core.fsmonitor="" status -uno &&
+		git -C sparse-index status -uno &&
 	test_region ! index ensure_full_index trace2.txt &&
 
 	rm trace2.txt &&
 	echo >>sparse-index/README.md &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-		git -C sparse-index -c core.fsmonitor="" add -A &&
+		git -C sparse-index add -A &&
 	test_region ! index ensure_full_index trace2.txt &&
 
 	rm trace2.txt &&
 	echo >>sparse-index/extra.txt &&
 	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-		git -C sparse-index -c core.fsmonitor="" add extra.txt &&
+		git -C sparse-index add extra.txt &&
 	test_region ! index ensure_full_index trace2.txt
 '
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 26/27] pathspec: stop calling ensure_full_index
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (24 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 25/27] fsmonitor: disable if index is sparse Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-02-01 23:24   ` Elijah Newren
  2021-01-25 17:42 ` [PATCH 27/27] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The add_pathspec_matches_against_index() focuses on matching a pathspec
to file entries in the index. It is possible that this already works
correctly for its only use: checking if untracked files exist in the
index.

It is likely that this causes a behavior issue when adding a directory
that exists at HEAD but is outside the sparse cone. I'm marking this as
a place to pursue with future tests.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 pathspec.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/pathspec.c b/pathspec.c
index 9b105855483..61dc771aa02 100644
--- a/pathspec.c
+++ b/pathspec.c
@@ -36,7 +36,6 @@ void add_pathspec_matches_against_index(const struct pathspec *pathspec,
 			num_unmatched++;
 	if (!num_unmatched)
 		return;
-	ensure_full_index(istate);
 	for (i = 0; i < istate->cache_nr; i++) {
 		const struct cache_entry *ce = istate->cache[i];
 		ce_path_match(istate, ce, pathspec, seen);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 27/27] cache-tree: integrate with sparse directory entries
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (25 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 26/27] pathspec: stop calling ensure_full_index Derrick Stolee via GitGitGadget
@ 2021-01-25 17:42 ` Derrick Stolee via GitGitGadget
  2021-02-01 23:54   ` Elijah Newren
  2021-01-25 20:10 ` [PATCH 00/27] [RFC] Sparse Index Junio C Hamano
  2021-02-02  3:11 ` Elijah Newren
  28 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-25 17:42 UTC (permalink / raw)
  To: git
  Cc: gitster, newren, peff, jrnieder, sunshine, pclouds,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache-tree extension was previously disabled with sparse indexes.
However, the cache-tree is an important performance feature for commands
like 'git status' and 'git add'. Integrate it with sparse directory
entries.

When writing a sparse index, completely clear and recalculate the cache
tree. By starting from scratch, the only integration necessary is to
check if we hit a sparse directory entry and create a leaf of the
cache-tree that has an entry_count of one and no subtrees.

Once the cache-tree exists within a sparse index, we finally get
improved performance. I test the sparse index performance using a
private monorepo with over 2.1 million files at HEAD, but with a
sparse-checkout definition that has only 68,000 paths in the populated
cone. The sparse index has about 2,000 sparse directory entries. I
compare three scenarios:

 1. Use the full index. The index size is ~186 MB.
 2. Use the sparse index. The index size is ~5.5 MB.
 3. Use a commit where HEAD matches the populated set. The full index
    size is ~5.3MB.

The third benchmark is included as a theoretical optimium for a
repository of the same object database.

First, a clean 'git status' improves from 3.1s to 240ms.

Benchmark #1: full index (git status)
  Time (mean ± σ):      3.167 s ±  0.036 s    [User: 2.006 s, System: 1.078 s]
  Range (min … max):    3.100 s …  3.208 s    10 runs

Benchmark #2: sparse index (git status)
  Time (mean ± σ):     239.5 ms ±   8.1 ms    [User: 189.4 ms, System: 226.8 ms]
  Range (min … max):   226.0 ms … 251.9 ms    13 runs

Benchmark #3: small tree (git status)
  Time (mean ± σ):     195.3 ms ±   4.5 ms    [User: 116.5 ms, System: 84.4 ms]
  Range (min … max):   188.8 ms … 202.8 ms    15 runs

The optimimum is still 45ms faster. This is due in part to the 2,000+
sparse directory entries, but there might be other optimizations to make
in the sparse-index case. In particular, I find that this performance
difference disappears when I disable FS Monitor, which is somewhat
disabled in the sparse-index case, but might still be adding overhead.

The performance numbers for 'git add .' are much closer to optimal:

Benchmark #1: full index (git add .)
  Time (mean ± σ):      3.076 s ±  0.022 s    [User: 2.065 s, System: 0.943 s]
  Range (min … max):    3.044 s …  3.116 s    10 runs

Benchmark #2: sparse index (git add .)
  Time (mean ± σ):     218.0 ms ±   6.6 ms    [User: 195.7 ms, System: 206.6 ms]
  Range (min … max):   209.8 ms … 228.2 ms    13 runs

Benchmark #3: small tree (git add .)
  Time (mean ± σ):     217.6 ms ±   5.4 ms    [User: 131.9 ms, System: 86.7 ms]
  Range (min … max):   212.1 ms … 228.4 ms    14 runs

In this test, I also used "echo >>README.md" to append a line to the
README.md file, so the 'git add .' command is doing _something_ other
than a no-op. Without this edit (and FS Monitor enabled) the small
tree case again gains about 30ms on the sparse index case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c   | 18 ++++++++++++++++++
 sparse-index.c | 10 +++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 5f07a39e501..9da6a4394e0 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
 
 	*skip_count = 0;
 
+	/*
+	 * If the first entry of this region is a sparse directory
+	 * entry corresponding exactly to 'base', then this cache_tree
+	 * struct is a "leaf" in the data structure, pointing to the
+	 * tree OID specified in the entry.
+	 */
+	if (entries > 0) {
+		const struct cache_entry *ce = cache[0];
+
+		if (S_ISSPARSEDIR(ce) &&
+		    ce->ce_namelen == baselen &&
+		    !strncmp(ce->name, base, baselen)) {
+			it->entry_count = 1;
+			oidcpy(&it->oid, &ce->oid);
+			return 1;
+		}
+	}
+
 	if (0 <= it->entry_count && has_object_file(&it->oid))
 		return it->entry_count;
 
diff --git a/sparse-index.c b/sparse-index.c
index a201f3b905c..9ea3b321400 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -181,7 +181,11 @@ int convert_to_sparse(struct index_state *istate)
 	istate->cache_nr = convert_to_sparse_rec(istate,
 						 0, 0, istate->cache_nr,
 						 "", 0, istate->cache_tree);
-	istate->drop_cache_tree = 1;
+
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	istate->sparse_index = 1;
 	trace2_region_leave("index", "convert_to_sparse", istate->repo);
 	return 0;
@@ -278,6 +282,10 @@ void ensure_full_index(struct index_state *istate)
 
 	free(full);
 
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
 
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/27] [RFC] Sparse Index
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (26 preceding siblings ...)
  2021-01-25 17:42 ` [PATCH 27/27] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
@ 2021-01-25 20:10 ` Junio C Hamano
  2021-01-25 21:18   ` Derrick Stolee
  2021-02-02  3:11 ` Elijah Newren
  28 siblings, 1 reply; 61+ messages in thread
From: Junio C Hamano @ 2021-01-25 20:10 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, peff, jrnieder, sunshine, pclouds, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This RFC proposes an update to the index formats to allow "sparse directory
> entries". These entries correspond to directories that are completely
> excluded from the sparse checkout definition. We can detect that a directory
> is excluded when using "cone mode" patterns.

Yay.

> Since having directory entries is a radical departure from the existing
> index format, a new extension "extensions.sparseIndex" is added. Using a
> sparse index should cause incompatible tools to fail because they do not
> understand this extension.

Safety is good, but because the index is purely a local matter, we
do not have to be so careful as updating the network protocols or
pack/object formats.

I think the use of "extensions.*" mechanism to render the repository
that uses the new feature unusable by older Git is safe enough, but
it may be too draconian.  For example, when things go wrong, don't
you want to "fetch"/"clone" from it into another repository to first
save the objects and refs?  You do not need a version of the index
file you understand in order to do that.

The index format has a mechanism to make older versions of Git bail
when it encounters a file that uses newer feature that they do not
understand.  Perhaps using it is sufficient instead?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/27] [RFC] Sparse Index
  2021-01-25 20:10 ` [PATCH 00/27] [RFC] Sparse Index Junio C Hamano
@ 2021-01-25 21:18   ` Derrick Stolee
  0 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee @ 2021-01-25 21:18 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, newren, peff, jrnieder, sunshine, pclouds, Derrick Stolee



On 1/25/21 3:10 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> This RFC proposes an update to the index formats to allow "sparse directory
>> entries". These entries correspond to directories that are completely
>> excluded from the sparse checkout definition. We can detect that a directory
>> is excluded when using "cone mode" patterns.
> 
> Yay.
> 
>> Since having directory entries is a radical departure from the existing
>> index format, a new extension "extensions.sparseIndex" is added. Using a
>> sparse index should cause incompatible tools to fail because they do not
>> understand this extension.
> 
> Safety is good, but because the index is purely a local matter, we
> do not have to be so careful as updating the network protocols or
> pack/object formats.
> 
> I think the use of "extensions.*" mechanism to render the repository
> that uses the new feature unusable by older Git is safe enough, but
> it may be too draconian.  For example, when things go wrong, don't
> you want to "fetch"/"clone" from it into another repository to first
> save the objects and refs?  You do not need a version of the index
> file you understand in order to do that.
> 
> The index format has a mechanism to make older versions of Git bail
> when it encounters a file that uses newer feature that they do not
> understand.  Perhaps using it is sufficient instead?

There are interesting subtleties with the differences between index
formats 2, 3, and 4 that are worth keeping around. Perhaps the
extension could be a mechanism for allowing sparse directories in
those versions, but then a future "index version 5" includes sparse
directories without the extension.

I could spend time working on such an index v5 in parallel with
the updates to Git commands to make them sparse aware. The logic
from patches 1-14 in this series will be required before that could
begin.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/27] sparse-index: implement ensure_full_index()
  2021-01-25 17:41 ` [PATCH 02/27] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-01-27  3:05   ` Elijah Newren
  2021-01-27 13:43     ` Derrick Stolee
  2021-01-28  5:25     ` Junio C Hamano
  0 siblings, 2 replies; 61+ messages in thread
From: Elijah Newren @ 2021-01-27  3:05 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> We will mark an in-memory index_state as having sparse directory entries
> with the sparse_index bit. These currently cannot exist, but we will add
> a mechanism for collapsing a full index to a sparse one in a later
> change. That will happen at write time, so we must first allow parsing
> the format before writing it.
>
> Commands or methods that require a full index in order to operate can
> call ensure_full_index() to expand that index in-memory. This requires
> parsing trees using that index's repository.
>
> Sparse directory entries have a specific 'ce_mode' value. The macro
> S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This
> ce_mode is not possible with the existing index formats, so we don't
> also verify all properties of a sparse-directory entry, which are:
>
>  1. ce->ce_mode == 01000755

This is a weird number.  What's the reason for choosing it?  It looks
deceptively close to 0100755, normal executable files, but has the
extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no
file type.

Since it's a directory, why not use S_IFDIR (040000)?

(GITLINK does use the weird 0160000 value, but it happens to be
S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a
directory and a symlink")

>  2. ce->flags & CE_SKIP_WORKTREE is true

Makes sense.

>  3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)

Is there a particular reason for this?  I'm used to seeing names
without the trailing slash, both in the index and in tree objects.  I
don't know enough to be for or against this idea; just curious at this
point.

>  4. ce->oid references a tree object.

Makes sense...but doesn't that suggest we'd want to use ce->ce_mode = 040000?


Also, as a bit of a side comment: There have been other requests in
the past to support directory objects in the index.  The only use I
remember for them requested from others was to allow tracking empty
directories.  However, I've long wanted to introduce a new "blobtree"
object to git, so that a user can "git add" some big binary file, but
internally git splits the binary file up and stores it as multiple
blobs within git plus a new "blobtree" object that references all the
individual blobs with enough information about how to stitch all the
blobs together to get the original file.  I had a few different forms
of "blobtree" things that I was interested in.  I think brian once
suggested some similar idea.

> These are all semi-enforced in ensure_full_index() to some extent. Any
> deviation will cause a warning at minimum or a failure in the worst
> case.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  cache.h        | 11 +++++-
>  read-cache.c   |  9 +++++
>  sparse-index.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  sparse-index.h |  1 +
>  4 files changed, 113 insertions(+), 2 deletions(-)
>
> diff --git a/cache.h b/cache.h
> index f9c7a603841..884046ca5b8 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -204,6 +204,10 @@ struct cache_entry {
>  #error "CE_EXTENDED_FLAGS out of range"
>  #endif
>
> +#define CE_MODE_SPARSE_DIRECTORY 01000755
> +#define SPARSE_DIR_MODE 0100

Another magic value.  Feels like the commit message should reference
this one and why it was picked.  Seems odd to me, and possibly
problematic to re-use file permission bits that might collide with
files recorded by really old versions of git.  Maybe that's not a
concern, though.

> +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY)

Should the special sauce apply to ce_flags rather than ce_mode?  Thus,
instead of an S_ISSPARSEDIR, perhaps have a ce_sparse_dir macro
(similar to ce_skip_worktree) based on a CE_SPARSE_DIR value (similar
to CE_SKIP_WORKTREE)?

Or, alternatively, do we need a single special state here?  Could we
check for a combination of ce_mode == 040000 && ce_skip_worktree(ce)?

> +
>  /* Forward structure decls */
>  struct pathspec;
>  struct child_process;
> @@ -249,6 +253,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
>  {
>         if (S_ISLNK(mode))
>                 return S_IFLNK;
> +       if (mode == SPARSE_DIR_MODE)
> +               return CE_MODE_SPARSE_DIRECTORY;
>         if (S_ISDIR(mode) || S_ISGITLINK(mode))
>                 return S_IFGITLINK;
>         return S_IFREG | ce_permissions(mode);
> @@ -319,7 +325,8 @@ struct index_state {
>                  drop_cache_tree : 1,
>                  updated_workdir : 1,
>                  updated_skipworktree : 1,
> -                fsmonitor_has_run_once : 1;
> +                fsmonitor_has_run_once : 1,
> +                sparse_index : 1;
>         struct hashmap name_hash;
>         struct hashmap dir_hash;
>         struct object_id oid;
> @@ -721,6 +728,8 @@ int read_index_from(struct index_state *, const char *path,
>                     const char *gitdir);
>  int is_index_unborn(struct index_state *);
>
> +void ensure_full_index(struct index_state *istate);
> +
>  /* For use with `write_locked_index()`. */
>  #define COMMIT_LOCK            (1 << 0)
>  #define SKIP_IF_UNCHANGED      (1 << 1)
> diff --git a/read-cache.c b/read-cache.c
> index ecf6f689940..1097ecbf132 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -101,6 +101,9 @@ static const char *alternate_index_output;
>
>  static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
>  {
> +       if (S_ISSPARSEDIR(ce))
> +               istate->sparse_index = 1;
> +
>         istate->cache[nr] = ce;
>         add_name_hash(istate, ce);
>  }
> @@ -2255,6 +2258,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
>         trace2_data_intmax("index", the_repository, "read/cache_nr",
>                            istate->cache_nr);
>
> +       if (!istate->repo)
> +               istate->repo = the_repository;
> +       prepare_repo_settings(istate->repo);
> +       if (istate->repo->settings.command_requires_full_index)
> +               ensure_full_index(istate);
> +
>         return istate->cache_nr;
>
>  unmap:
> diff --git a/sparse-index.c b/sparse-index.c
> index 82183ead563..1e70244dc13 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -1,8 +1,100 @@
>  #include "cache.h"
>  #include "repository.h"
>  #include "sparse-index.h"
> +#include "tree.h"
> +#include "pathspec.h"
> +#include "trace2.h"
> +
> +static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
> +{
> +       ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
> +
> +       istate->cache[nr] = ce;
> +       add_name_hash(istate, ce);
> +}
> +
> +static int add_path_to_index(const struct object_id *oid,
> +                               struct strbuf *base, const char *path,
> +                               unsigned int mode, int stage, void *context)
> +{
> +       struct index_state *istate = (struct index_state *)context;
> +       struct cache_entry *ce;
> +       size_t len = base->len;
> +
> +       if (S_ISDIR(mode))
> +               return READ_TREE_RECURSIVE;
> +
> +       strbuf_addstr(base, path);
> +
> +       ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
> +       ce->ce_flags |= CE_SKIP_WORKTREE;
> +       set_index_entry(istate, istate->cache_nr++, ce);
> +
> +       strbuf_setlen(base, len);
> +       return 0;
> +}
>
>  void ensure_full_index(struct index_state *istate)
>  {
> -       /* intentionally left blank */
> +       int i;
> +       struct index_state *full;
> +
> +       if (!istate || !istate->sparse_index)
> +               return;
> +
> +       if (!istate->repo)
> +               istate->repo = the_repository;
> +
> +       trace2_region_enter("index", "ensure_full_index", istate->repo);
> +
> +       /* initialize basics of new index */
> +       full = xcalloc(1, sizeof(struct index_state));
> +       memcpy(full, istate, sizeof(struct index_state));
> +
> +       /* then change the necessary things */
> +       full->sparse_index = 0;
> +       full->cache_alloc = (3 * istate->cache_alloc) / 2;
> +       full->cache_nr = 0;
> +       ALLOC_ARRAY(full->cache, full->cache_alloc);
> +
> +       for (i = 0; i < istate->cache_nr; i++) {
> +               struct cache_entry *ce = istate->cache[i];
> +               struct tree *tree;
> +               struct pathspec ps;
> +
> +               if (!S_ISSPARSEDIR(ce)) {
> +                       set_index_entry(full, full->cache_nr++, ce);
> +                       continue;
> +               }
> +               if (!(ce->ce_flags & CE_SKIP_WORKTREE))
> +                       warning(_("index entry is a directory, but not sparse (%08x)"),
> +                               ce->ce_flags);
> +
> +               /* recursively walk into cd->name */
> +               tree = lookup_tree(istate->repo, &ce->oid);
> +
> +               memset(&ps, 0, sizeof(ps));
> +               ps.recursive = 1;
> +               ps.has_wildcard = 1;
> +               ps.max_depth = -1;
> +
> +               read_tree_recursive(istate->repo, tree,
> +                                   ce->name, strlen(ce->name),
> +                                   0, &ps,
> +                                   add_path_to_index, full);
> +
> +               /* free directory entries. full entries are re-used */
> +               discard_cache_entry(ce);
> +       }
> +
> +       /* Copy back into original index. */
> +       memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
> +       istate->sparse_index = 0;
> +       istate->cache = full->cache;

Haven't you leaked the original istate->cache here?

> +       istate->cache_nr = full->cache_nr;
> +       istate->cache_alloc = full->cache_alloc;
> +
> +       free(full);
> +
> +       trace2_region_leave("index", "ensure_full_index", istate->repo);
>  }
> diff --git a/sparse-index.h b/sparse-index.h
> index 8dda92032e2..a2777dcac59 100644
> --- a/sparse-index.h
> +++ b/sparse-index.h
> @@ -3,5 +3,6 @@
>
>  struct index_state;
>  void ensure_full_index(struct index_state *istate);
> +int convert_to_sparse(struct index_state *istate);

Seems logically that you'd add one, but was this meant to be included
in a later patch?

>
>  #endif
> \ No newline at end of file
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/27] t1092: compare sparse-checkout to sparse-index
  2021-01-25 17:41 ` [PATCH 03/27] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-01-27  3:08   ` Elijah Newren
  2021-01-27 13:30     ` Derrick Stolee
  0 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren @ 2021-01-27  3:08 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Add a new 'sparse-index' repo alongside the 'full-checkout' and
> 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
> add run_on_sparse and test_sparse_match helpers. These helpers will be
> used when the sparse index is implemented.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/t1092-sparse-checkout-compatibility.sh | 29 ++++++++++++++++++++----
>  1 file changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 8cd3e5a8d22..8876eae0fe3 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
>  test_expect_success 'setup' '
>         git init initial-repo &&
>         (
> +               (GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX) &&

I thought parentheses started a subshell; once the subshell ends,
wouldn't the setting of GIT_TEST_SPARSE_INDEX be thrown away?

>                 cd initial-repo &&
>                 echo a >a &&
>                 echo "after deep" >e &&
> @@ -87,23 +88,32 @@ init_repos () {
>
>         cp -r initial-repo sparse-checkout &&
>         git -C sparse-checkout reset --hard &&
> -       git -C sparse-checkout sparse-checkout init --cone &&
> +
> +       cp -r initial-repo sparse-index &&
> +       git -C sparse-index reset --hard &&
>
>         # initialize sparse-checkout definitions
> -       git -C sparse-checkout sparse-checkout set deep
> +       git -C sparse-checkout sparse-checkout init --cone &&
> +       git -C sparse-checkout sparse-checkout set deep &&
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
>  }
>
>  run_on_sparse () {
>         (
>                 cd sparse-checkout &&
> -               $* >../sparse-checkout-out 2>../sparse-checkout-err
> +               GIT_TEST_SPARSE_INDEX=0 $* >../sparse-checkout-out 2>../sparse-checkout-err
> +       ) &&
> +       (
> +               cd sparse-index &&
> +               $* >../sparse-index-out 2>../sparse-index-err
>         )
>  }
>
>  run_on_all () {
>         (
>                 cd full-checkout &&
> -               $* >../full-checkout-out 2>../full-checkout-err
> +               GIT_TEST_SPARSE_INDEX=0 $* >../full-checkout-out 2>../full-checkout-err
>         ) &&
>         run_on_sparse $*
>  }
> @@ -114,6 +124,17 @@ test_all_match () {
>         test_cmp full-checkout-err sparse-checkout-err
>  }
>
> +test_sparse_match () {
> +       run_on_sparse $* &&
> +       test_cmp sparse-checkout-out sparse-index-out &&
> +       test_cmp sparse-checkout-err sparse-index-err
> +}
> +
> +test_expect_success 'expanded in-memory index matches full index' '
> +       init_repos &&
> +       test_sparse_match test-tool read-cache --expand --table-no-stat
> +'
> +
>  test_expect_success 'status with options' '
>         init_repos &&
>         test_all_match git status --porcelain=v2 &&
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/27] test-read-cache: print cache entries with --table
  2021-01-25 17:41 ` [PATCH 04/27] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-01-27  3:25   ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-01-27  3:25 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> This table is helpful for discovering data in the index to ensure it is
> being written correctly, especially as we build and test the
> sparse-index.
>
> To make the option parsing slightly more robust, wrap the string
> comparisons in a loop adapted from test-dir-iterator.c.
>
> Care must be taken with the final check for the 'cnt' variable. We
> continue the expectation that the numerical value is the final argument.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/helper/test-read-cache.c | 49 ++++++++++++++++++++++++++++++++++----
>  1 file changed, 44 insertions(+), 5 deletions(-)
>
> diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
> index 244977a29bd..cd7d106a675 100644
> --- a/t/helper/test-read-cache.c
> +++ b/t/helper/test-read-cache.c
> @@ -2,18 +2,55 @@
>  #include "cache.h"
>  #include "config.h"
>
> +static void print_cache_entry(struct cache_entry *ce)
> +{
> +       /* stat info */
> +       printf("%08x %08x %08x %08x %08x %08x ",
> +              ce->ce_stat_data.sd_ctime.sec,
> +              ce->ce_stat_data.sd_ctime.nsec,
> +              ce->ce_stat_data.sd_mtime.sec,
> +              ce->ce_stat_data.sd_mtime.nsec,
> +              ce->ce_stat_data.sd_dev,
> +              ce->ce_stat_data.sd_ino);

Printing sec & nsec in hexidecimal?  Why?

Also, if they'll be displayed in hex, do you want to format them as
0x%08x, similar to what you do with binary below?

> +
> +       /* mode in binary */

This comment feels misleading; I think this is the "S_IFMT portion of
mode in binary" not "mode in binary".

> +       printf("0b%d%d%d%d ",
> +               (ce->ce_mode >> 15) & 1,
> +               (ce->ce_mode >> 14) & 1,
> +               (ce->ce_mode >> 13) & 1,
> +               (ce->ce_mode >> 12) & 1);

Why binary?  Also, since you defined a special magic constant of
01000755 which utilizes bit 18; how come you aren't including any bits
higher than 15?

> +       /* output permissions? */
> +       printf("%04o ", ce->ce_mode & 01777);

01777 instead of 07777 just because we don't have anything using the
setuid or setgid bits?  But if it's based on non-use, then we don't
use the sticky bit (01000) either, so this could be just 0777.

Also, if you're using 0b for binary to distinguish and you're clearly
using multiple bases in this code, perhaps use a print format of
0o%04o (or 0o%03o if you only use a mask of 0777).

> +       printf("%s ", oid_to_hex(&ce->oid));
> +
> +       printf("%s\n", ce->name);
> +}
> +
> +static void print_cache(struct index_state *cache)
> +{
> +       int i;
> +       for (i = 0; i < the_index.cache_nr; i++)
> +               print_cache_entry(the_index.cache[i]);
> +}
> +
>  int cmd__read_cache(int argc, const char **argv)
>  {
>         int i, cnt = 1;
>         const char *name = NULL;
> +       int table = 0;
>
> -       if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
> -               argc--;
> -               argv++;
> +       for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
> +               if (skip_prefix(*argv, "--print-and-refresh=", &name))
> +                       continue;
> +               if (!strcmp(*argv, "--table")) {
> +                       table = 1;
> +               }
>         }
>
> -       if (argc == 2)
> -               cnt = strtol(argv[1], NULL, 0);
> +       if (argc == 1)
> +               cnt = strtol(argv[0], NULL, 0);
>         setup_git_directory();
>         git_config(git_default_config, NULL);
>         for (i = 0; i < cnt; i++) {
> @@ -30,6 +67,8 @@ int cmd__read_cache(int argc, const char **argv)
>                                ce_uptodate(the_index.cache[pos]) ? "" : " not");
>                         write_file(name, "%d\n", i);
>                 }
> +               if (table)
> +                       print_cache(&the_index);
>                 discard_cache();
>         }
>         return 0;
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 07/27] unpack-trees: ensure full index
  2021-01-25 17:41 ` [PATCH 07/27] unpack-trees: ensure " Derrick Stolee via GitGitGadget
@ 2021-01-27  4:43   ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-01-27  4:43 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The next change will translate full indexes into sparse indexes at write
> time. The existing logic provides a way for every sparse index to be
> expanded to a full index at read time. However, there are cases where an
> index is written and then continues to be used in-memory to perform
> further updates.
>
> unpack_trees() is frequently called after such a write. In particular,
> commands like 'git reset' do this double-update of the index.
>
> Ensure that we have a full index when entering unpack_trees(), but only
> when command_requires_full_index is true. This is always true at the
> moment, but we will later relax that after unpack_trees() is updated to
> handle sparse directory entries.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  unpack-trees.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index f5f668f532d..4dd99219073 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *,
>   */
>  int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options *o)
>  {
> +       struct repository *repo = the_repository;
>         int i, ret;
>         static struct cache_entry *dfc;
>         struct pattern_list pl;
> @@ -1578,6 +1579,12 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
>         trace_performance_enter();
>         trace2_region_enter("unpack_trees", "unpack_trees", the_repository);
>
> +       prepare_repo_settings(repo);
> +       if (repo->settings.command_requires_full_index) {
> +               ensure_full_index(o->src_index);
> +               ensure_full_index(o->dst_index);

I was worried about o->result as well, since there is a
    memset(&o->result, 0, sizeof(o->result));
followed by manually initializing the relevant fields of the
index_state.  However, the relevant field here is your new
sparse_index bit, and you want that to be 0, i.e. full.

I also checked ensure_full_index() since it is often the case that
o->src_index == o->dst_index, but it'll be safe to be called twice on
the same index state -- at least as currently written.

So, this patch seems good.

> +       }
> +
>         if (!core_apply_sparse_checkout || !o->update)
>                 o->skip_sparse_checkout = 1;
>         if (!o->skip_sparse_checkout && !o->pl) {
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/27] t1092: compare sparse-checkout to sparse-index
  2021-01-27  3:08   ` Elijah Newren
@ 2021-01-27 13:30     ` Derrick Stolee
  2021-01-27 16:54       ` Elijah Newren
  0 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee @ 2021-01-27 13:30 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On 1/26/2021 10:08 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Add a new 'sparse-index' repo alongside the 'full-checkout' and
>> 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
>> add run_on_sparse and test_sparse_match helpers. These helpers will be
>> used when the sparse index is implemented.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  t/t1092-sparse-checkout-compatibility.sh | 29 ++++++++++++++++++++----
>>  1 file changed, 25 insertions(+), 4 deletions(-)
>>
>> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
>> index 8cd3e5a8d22..8876eae0fe3 100755
>> --- a/t/t1092-sparse-checkout-compatibility.sh
>> +++ b/t/t1092-sparse-checkout-compatibility.sh
>> @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
>>  test_expect_success 'setup' '
>>         git init initial-repo &&
>>         (
>> +               (GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX) &&
> 
> I thought parentheses started a subshell; once the subshell ends,
> wouldn't the setting of GIT_TEST_SPARSE_INDEX be thrown away?

I think the "export" specifically pushes the setting out of the
first level of subshell. This is the recommendation that comes up
if one runs 

	export GIT_TEST_SPARSE_INDEX=1 &&

inside a test on macOS, since this isn't completely portable.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/27] sparse-index: implement ensure_full_index()
  2021-01-27  3:05   ` Elijah Newren
@ 2021-01-27 13:43     ` Derrick Stolee
  2021-01-27 16:38       ` Elijah Newren
  2021-01-28  5:25     ` Junio C Hamano
  1 sibling, 1 reply; 61+ messages in thread
From: Derrick Stolee @ 2021-01-27 13:43 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On 1/26/2021 10:05 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
...
>> Sparse directory entries have a specific 'ce_mode' value. The macro
>> S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This
>> ce_mode is not possible with the existing index formats, so we don't
>> also verify all properties of a sparse-directory entry, which are:
>>
>>  1. ce->ce_mode == 01000755
> 
> This is a weird number.  What's the reason for choosing it?  It looks
> deceptively close to 0100755, normal executable files, but has the
> extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no
> file type.
> 
> Since it's a directory, why not use S_IFDIR (040000)?
> 
> (GITLINK does use the weird 0160000 value, but it happens to be
> S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a
> directory and a symlink")

I forget how exactly I came up with these magic constants, but then
completely forgot to think of them critically because I haven't had
to look at them in a while. They _are_ important, especially because
these values affect the file format itself.

I'll think harder on this before submitting a series intended for
merging.

>>  2. ce->flags & CE_SKIP_WORKTREE is true
> 
> Makes sense.
> 
>>  3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
> 
> Is there a particular reason for this?  I'm used to seeing names
> without the trailing slash, both in the index and in tree objects.  I
> don't know enough to be for or against this idea; just curious at this
> point.

It's yet another way to distinguish directories from files, but
there are cases where we do string searches up to a prefix, and
having these directory separators did help, IIRC.

>>  4. ce->oid references a tree object.
> 
> Makes sense...but doesn't that suggest we'd want to use ce->ce_mode = 040000?

...

>> +#define CE_MODE_SPARSE_DIRECTORY 01000755
>> +#define SPARSE_DIR_MODE 0100
> 
> Another magic value.  Feels like the commit message should reference
> this one and why it was picked.  Seems odd to me, and possibly
> problematic to re-use file permission bits that might collide with
> files recorded by really old versions of git.  Maybe that's not a
> concern, though.
> 
>> +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY)
> 
> Should the special sauce apply to ce_flags rather than ce_mode?  Thus,
> instead of an S_ISSPARSEDIR, perhaps have a ce_sparse_dir macro
> (similar to ce_skip_worktree) based on a CE_SPARSE_DIR value (similar
> to CE_SKIP_WORKTREE)?
>
> Or, alternatively, do we need a single special state here?  Could we
> check for a combination of ce_mode == 040000 && ce_skip_worktree(ce)?

The intention was that ce_mode be a unique value that could only
be assigned to a directory entry, which would then by necessity be
sparse. Checking both ce_mode and ce_flags seemed wasteful with the
given assumptions

...

>> +       /* Copy back into original index. */
>> +       memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
>> +       istate->sparse_index = 0;
>> +       istate->cache = full->cache;
> 
> Haven't you leaked the original istate->cache here?

Yes, seems so. Will fix.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/27] sparse-index: implement ensure_full_index()
  2021-01-27 13:43     ` Derrick Stolee
@ 2021-01-27 16:38       ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-01-27 16:38 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List, Junio C Hamano,
	Jeff King, Jonathan Nieder, Eric Sunshine,
	Nguyễn Thái Ngọc, Derrick Stolee, Derrick Stolee

On Wed, Jan 27, 2021 at 5:43 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 1/26/2021 10:05 PM, Elijah Newren wrote:
> > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> ...
> >> Sparse directory entries have a specific 'ce_mode' value. The macro
> >> S_ISSPARSEDIR(ce) can check if a cache_entry 'ce' has this type. This
> >> ce_mode is not possible with the existing index formats, so we don't
> >> also verify all properties of a sparse-directory entry, which are:
> >>
> >>  1. ce->ce_mode == 01000755
> >
> > This is a weird number.  What's the reason for choosing it?  It looks
> > deceptively close to 0100755, normal executable files, but has the
> > extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no
> > file type.
> >
> > Since it's a directory, why not use S_IFDIR (040000)?
> >
> > (GITLINK does use the weird 0160000 value, but it happens to be
> > S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a
> > directory and a symlink")
>
> I forget how exactly I came up with these magic constants, but then
> completely forgot to think of them critically because I haven't had
> to look at them in a while. They _are_ important, especially because
> these values affect the file format itself.
>
> I'll think harder on this before submitting a series intended for
> merging.
>
> >>  2. ce->flags & CE_SKIP_WORKTREE is true
> >
> > Makes sense.
> >
> >>  3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
> >
> > Is there a particular reason for this?  I'm used to seeing names
> > without the trailing slash, both in the index and in tree objects.  I
> > don't know enough to be for or against this idea; just curious at this
> > point.
>
> It's yet another way to distinguish directories from files, but
> there are cases where we do string searches up to a prefix, and
> having these directory separators did help, IIRC.
>
> >>  4. ce->oid references a tree object.
> >
> > Makes sense...but doesn't that suggest we'd want to use ce->ce_mode = 040000?
>
> ...
>
> >> +#define CE_MODE_SPARSE_DIRECTORY 01000755
> >> +#define SPARSE_DIR_MODE 0100
> >
> > Another magic value.  Feels like the commit message should reference
> > this one and why it was picked.  Seems odd to me, and possibly
> > problematic to re-use file permission bits that might collide with
> > files recorded by really old versions of git.  Maybe that's not a
> > concern, though.
> >
> >> +#define S_ISSPARSEDIR(m) ((m)->ce_mode == CE_MODE_SPARSE_DIRECTORY)
> >
> > Should the special sauce apply to ce_flags rather than ce_mode?  Thus,
> > instead of an S_ISSPARSEDIR, perhaps have a ce_sparse_dir macro
> > (similar to ce_skip_worktree) based on a CE_SPARSE_DIR value (similar
> > to CE_SKIP_WORKTREE)?
> >
> > Or, alternatively, do we need a single special state here?  Could we
> > check for a combination of ce_mode == 040000 && ce_skip_worktree(ce)?
>
> The intention was that ce_mode be a unique value that could only
> be assigned to a directory entry, which would then by necessity be
> sparse. Checking both ce_mode and ce_flags seemed wasteful with the
> given assumptions

040000 is a unique value that could only be assigned to a directory
entry.  Since we have no other uses of directories within the index,
you are right, we wouldn't need to check ce_skip_worktree(ce) as well;
just a check for the 040000 mode would be enough.

> ...
>
> >> +       /* Copy back into original index. */
> >> +       memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
> >> +       istate->sparse_index = 0;
> >> +       istate->cache = full->cache;
> >
> > Haven't you leaked the original istate->cache here?
>
> Yes, seems so. Will fix.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 03/27] t1092: compare sparse-checkout to sparse-index
  2021-01-27 13:30     ` Derrick Stolee
@ 2021-01-27 16:54       ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-01-27 16:54 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List, Junio C Hamano,
	Jeff King, Jonathan Nieder, Eric Sunshine,
	Nguyễn Thái Ngọc, Derrick Stolee, Derrick Stolee

On Wed, Jan 27, 2021 at 5:30 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 1/26/2021 10:08 PM, Elijah Newren wrote:
> > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> From: Derrick Stolee <dstolee@microsoft.com>
> >>
> >> Add a new 'sparse-index' repo alongside the 'full-checkout' and
> >> 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
> >> add run_on_sparse and test_sparse_match helpers. These helpers will be
> >> used when the sparse index is implemented.
> >>
> >> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> >> ---
> >>  t/t1092-sparse-checkout-compatibility.sh | 29 ++++++++++++++++++++----
> >>  1 file changed, 25 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> >> index 8cd3e5a8d22..8876eae0fe3 100755
> >> --- a/t/t1092-sparse-checkout-compatibility.sh
> >> +++ b/t/t1092-sparse-checkout-compatibility.sh
> >> @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
> >>  test_expect_success 'setup' '
> >>         git init initial-repo &&
> >>         (
> >> +               (GIT_TEST_SPARSE_INDEX=0 && export GIT_TEST_SPARSE_INDEX) &&
> >
> > I thought parentheses started a subshell; once the subshell ends,
> > wouldn't the setting of GIT_TEST_SPARSE_INDEX be thrown away?
>
> I think the "export" specifically pushes the setting out of the
> first level of subshell. This is the recommendation that comes up

You're having a child process change the environment variables of a
parent process? ...without some kind of gdb or other debugger
wizardry?

> if one runs
>
>         export GIT_TEST_SPARSE_INDEX=1 &&
>
> inside a test on macOS, since this isn't completely portable.

Um, I think you meant to run
      GIT_TEST_SPARSE_INDEX=0 &&
      export GIT_TEST_SPARSE_INDEX &&
in order to avoid the unportable
      export GIT_TEST_SPARSE_INDEX=0 &&
because
      (GIT_TEST_SPARSE_INDEX=0 &&
       export GIT_TEST_SPARSE_INDEX) &&
looks like a useless no-op.  At least it would be in normal bash; is
the test harness doing some special magic with it?  In normal bash,
the value definitely does NOT survive the subshell; (export just means
that subprocesses of the subshell where the environment variable is
set will see the value):

$ echo Before: $GIT_TEST_SPARSE_INDEX && (GIT_TEST_SPARSE_INDEX=0 &&
export GIT_TEST_SPARSE_INDEX) && echo After: $GIT_TEST_SPARSE_INDEX
Before:
After:

But in contrast, without the parentheses:
$ echo Before: $GIT_TEST_SPARSE_INDEX && GIT_TEST_SPARSE_INDEX=0 &&
export GIT_TEST_SPARSE_INDEX && echo After: $GIT_TEST_SPARSE_INDEX
Before:
After: 0

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/27] sparse-checkout: hold pattern list in index
  2021-01-25 17:41 ` [PATCH 08/27] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
@ 2021-01-27 17:00   ` Elijah Newren
  2021-01-28 13:12     ` Derrick Stolee
  0 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren @ 2021-01-27 17:00 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> As we modify the sparse-checkout definition, we perform index operations
> on a pattern_list that only exists in-memory. This allows easy backing
> out in case the index update fails.
>
> However, if the index write itself cares about the sparse-checkout
> pattern set, we need access to that in-memory copy. Place a pointer to
> a 'struct pattern_list' in the index so we can access this on-demand.
> This will be used in the next change which uses the sparse-checkout
> definition to filter out directories that are outsie the sparse cone.

s/outsie/outside/

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/sparse-checkout.c | 17 ++++++++++-------
>  cache.h                   |  2 ++
>  2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index 2306a9ad98e..e00b82af727 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
>         if (is_index_unborn(r->index))
>                 return UPDATE_SPARSITY_SUCCESS;
>
> +       r->index->sparse_checkout_patterns = pl;
> +
>         memset(&o, 0, sizeof(o));
>         o.verbose_update = isatty(2);
>         o.update = 1;
> @@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
>         else
>                 rollback_lock_file(&lock_file);
>
> +       r->index->sparse_checkout_patterns = NULL;
>         return result;
>  }
>
> @@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
>  {
>         int result;
>         int changed_config = 0;
> -       struct pattern_list pl;
> -       memset(&pl, 0, sizeof(pl));
> +       struct pattern_list *pl = xcalloc(1, sizeof(*pl));
>
>         switch (m) {
>         case ADD:
>                 if (core_sparse_checkout_cone)
> -                       add_patterns_cone_mode(argc, argv, &pl);
> +                       add_patterns_cone_mode(argc, argv, pl);
>                 else
> -                       add_patterns_literal(argc, argv, &pl);
> +                       add_patterns_literal(argc, argv, pl);
>                 break;
>
>         case REPLACE:
> -               add_patterns_from_input(&pl, argc, argv);
> +               add_patterns_from_input(pl, argc, argv);
>                 break;
>         }
>
> @@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
>                 changed_config = 1;
>         }
>
> -       result = write_patterns_and_update(&pl);
> +       result = write_patterns_and_update(pl);
>
>         if (result && changed_config)
>                 set_config(MODE_NO_PATTERNS);
>
> -       clear_pattern_list(&pl);
> +       clear_pattern_list(pl);
> +       free(pl);
>         return result;
>  }
>
> diff --git a/cache.h b/cache.h
> index 884046ca5b8..b05341cc687 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -311,6 +311,7 @@ static inline unsigned int canon_mode(unsigned int mode)
>  struct split_index;
>  struct untracked_cache;
>  struct progress;
> +struct pattern_list;
>
>  struct index_state {
>         struct cache_entry **cache;
> @@ -336,6 +337,7 @@ struct index_state {
>         struct mem_pool *ce_mem_pool;
>         struct progress *progress;
>         struct repository *repo;
> +       struct pattern_list *sparse_checkout_patterns;
>  };
>
>  /* Name hashing */
> --
> gitgitgadget

Isn't this the same patch you put in your index cleanup series, or am
I getting confused?  It looks very familiar.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 09/27] sparse-index: convert from full to sparse
  2021-01-25 17:41 ` [PATCH 09/27] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-01-27 17:30   ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-01-27 17:30 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> If we have a full index, then we can convert it to a sparse index by
> replacing directories outside of the sparse cone with sparse directory
> entries. The convert_to_sparse() method does this, when the situation is
> appropriate.
>
> For now, we avoid converting the index to a sparse index if:
>
>  1. the index is split.
>  2. the index is already sparse.
>  3. sparse-checkout is disabled.
>  4. sparse-checkout does not use cone mode.
>
> Finally, we currently limit the conversion to when the
> GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
> config will be added in a later change.
>
> The trickiest thing about this conversion is that we might not be able
> to mark a directory as a sparse directory just because it is outside the
> sparse cone. There might be unmerged files within that directory, so we
> need to look for those. Also, if there is some strange reason why a file
> is not marked with CE_SKIP_WORKTREE, then we should give up on
> converting that directory. There is still hope that some of its
> subdirectories might be able to convert to sparse, so we keep looking
> deeper.

Oh good, you check for *both* unmerged entries and !CE_SKIP_WORKTREE
ones.  Very nice.

>
> The conversion process is assisted by the cache-tree extension. This is
> calculated from the full index if it does not already exist. We then
> abandon the cache-tree as it no longer applies to the newly-sparse
> index. Thus, this cache-tree will be recalculated in every
> sparse-full-sparse round-trip until we integrate the cache-tree
> extension with the sparse index.

When going from full to sparse, won't the parts of the cache-tree for
paths outside of sparsified directories still be valid?  Can't we use
those?

Also, when going from sparse to full, can't we just populate the
cache-tree as well since we have to read trees to get the individual
file entries and we will get all the directory (tree) values at the
same time?

> We can compare the behavior of the sparse-index in
> t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
> when operating on the 'sparse-index' repo. We can also compare the two
> sparse repos directly, such as comparing their indexes (when expanded to
> full in the case of the 'sparse-index' repo). We also verify that the
> index is actually populated with sparse directory entries.
>
> The 'checkout and reset (mixed)' test is marked for failure when
> comparing a sparse repo to a full repo, but we can compare the two
> sparse-checkout cases directly to ensure that we are not changing the
> behavior when using a sparse index.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  cache-tree.c                             |   3 +
>  read-cache.c                             |  18 ++-
>  sparse-index.c                           | 139 +++++++++++++++++++++++
>  t/t1092-sparse-checkout-compatibility.sh |  63 +++++++++-
>  4 files changed, 218 insertions(+), 5 deletions(-)
>
> diff --git a/cache-tree.c b/cache-tree.c
> index 2fb483d3c08..5f07a39e501 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -6,6 +6,7 @@
>  #include "object-store.h"
>  #include "replace-object.h"
>  #include "promisor-remote.h"
> +#include "sparse-index.h"
>
>  #ifndef DEBUG_CACHE_TREE
>  #define DEBUG_CACHE_TREE 0
> @@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
>         if (i)
>                 return i;
>
> +       ensure_full_index(istate);
> +
>         if (!istate->cache_tree)
>                 istate->cache_tree = cache_tree();
>
> diff --git a/read-cache.c b/read-cache.c
> index 1097ecbf132..0522260416e 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -25,6 +25,7 @@
>  #include "fsmonitor.h"
>  #include "thread-utils.h"
>  #include "progress.h"
> +#include "sparse-index.h"
>
>  /* Mask for the name length in ce_flags in the on-disk index */
>
> @@ -1002,8 +1003,15 @@ int verify_path(const char *path, unsigned mode)
>
>                         c = *path++;
>                         if ((c == '.' && !verify_dotfile(path, mode)) ||
> -                           is_dir_sep(c) || c == '\0')
> +                           is_dir_sep(c))
>                                 return 0;
> +                       /*
> +                        * allow terminating directory separators for
> +                        * sparse directory enries.

s/enries/entries/

> +                        */
> +                       if (c == '\0')
> +                               return mode == CE_MODE_SPARSE_DIRECTORY ||
> +                                      mode == SPARSE_DIR_MODE;

Why two values here?  I get confused why you have both (which isn't
new to this patch; I'm just still confused from when I saw
SPARSE_DIR_MODE).

>                 } else if (c == '\\' && protect_ntfs) {
>                         if (is_ntfs_dotgit(path))
>                                 return 0;
> @@ -3062,6 +3070,13 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
>  {
>         int ret;
>
> +       ret = convert_to_sparse(istate);
> +
> +       if (ret) {
> +               warning(_("failed to convert to a sparse-index"));
> +               return ret;
> +       }
> +
>         /*
>          * TODO trace2: replace "the_repository" with the actual repo instance
>          * that is associated with the given "istate".
> @@ -3165,6 +3180,7 @@ static int write_shared_index(struct index_state *istate,
>         int ret;
>
>         move_cache_to_base_index(istate);
> +       convert_to_sparse(istate);
>
>         trace2_region_enter_printf("index", "shared/do_write_index",
>                                    the_repository, "%s", (*temp)->filename.buf);
> diff --git a/sparse-index.c b/sparse-index.c
> index 1e70244dc13..d8f1a5a13d7 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -4,6 +4,145 @@
>  #include "tree.h"
>  #include "pathspec.h"
>  #include "trace2.h"
> +#include "cache-tree.h"
> +#include "config.h"
> +#include "dir.h"
> +#include "fsmonitor.h"
> +
> +static struct cache_entry *construct_sparse_dir_entry(
> +                               struct index_state *istate,
> +                               const char *sparse_dir,
> +                               struct cache_tree *tree)
> +{
> +       struct cache_entry *de;
> +
> +       de = make_cache_entry(istate, SPARSE_DIR_MODE, &tree->oid, sparse_dir, 0, 0);
> +
> +       de->ce_flags |= CE_SKIP_WORKTREE;
> +       return de;
> +}
> +
> +/*
> + * Returns the number of entries "inserted" into the index.
> + */
> +static int convert_to_sparse_rec(struct index_state *istate,
> +                                int num_converted,
> +                                int start, int end,
> +                                const char *ct_path, size_t ct_pathlen,
> +                                struct cache_tree *ct)
> +{
> +       int i, can_convert = 1;
> +       int start_converted = num_converted;
> +       enum pattern_match_result match;
> +       int dtype;
> +       struct strbuf child_path = STRBUF_INIT;
> +       struct pattern_list *pl = istate->sparse_checkout_patterns;
> +
> +       /*
> +        * Is the current path outside of the sparse cone?
> +        * Then check if the region can be replaced by a sparse
> +        * directory entry (everything is sparse and merged).
> +        */
> +       match = path_matches_pattern_list(ct_path, ct_pathlen,
> +                                         NULL, &dtype, pl, istate);
> +       if (match != NOT_MATCHED)
> +               can_convert = 0;

I know some people hate gotos, but this seems like one of the cases
where a goto jumping after the following for & if would be clearer
then setting can_convert to 0.

> +
> +       for (i = start; can_convert && i < end; i++) {

Instead of checking can_convert here...

> +               struct cache_entry *ce = istate->cache[i];
> +
> +               if (ce_stage(ce) ||
> +                   !(ce->ce_flags & CE_SKIP_WORKTREE))
> +                       can_convert = 0;

...could you just insert a break here?

> +       }
> +
> +       if (can_convert) {
> +               struct cache_entry *se;
> +               se = construct_sparse_dir_entry(istate, ct_path, ct);
> +
> +               istate->cache[num_converted++] = se;
> +               return 1;
> +       }
> +
> +       for (i = start; i < end; ) {
> +               int count, span, pos = -1;
> +               const char *base, *slash;
> +               struct cache_entry *ce = istate->cache[i];
> +
> +               /*
> +                * Detect if this is a normal entry oustide of any subtree

s/oustide/outside/

> +                * entry.
> +                */
> +               base = ce->name + ct_pathlen;
> +               slash = strchr(base, '/');
> +
> +               if (slash)
> +                       pos = cache_tree_subtree_pos(ct, base, slash - base);
> +
> +               if (pos < 0) {
> +                       istate->cache[num_converted++] = ce;
> +                       i++;
> +                       continue;
> +               }
> +
> +               strbuf_setlen(&child_path, 0);
> +               strbuf_add(&child_path, ce->name, slash - ce->name + 1);
> +
> +               span = ct->down[pos]->cache_tree->entry_count;
> +               count = convert_to_sparse_rec(istate,
> +                                             num_converted, i, i + span,
> +                                             child_path.buf, child_path.len,
> +                                             ct->down[pos]->cache_tree);
> +               num_converted += count;
> +               i += span;

And there's no i++ in the loop itself, so this is good.

> +       }
> +
> +       strbuf_release(&child_path);
> +       return num_converted - start_converted;
> +}
> +
> +int convert_to_sparse(struct index_state *istate)
> +{
> +       if (istate->split_index || istate->sparse_index ||
> +           !core_apply_sparse_checkout || !core_sparse_checkout_cone)
> +               return 0;
> +
> +       /*
> +        * For now, only create a sparse index with the
> +        * GIT_TEST_SPARSE_INDEX environment variable. We will relax
> +        * this once we have a proper way to opt-in (and later still,
> +        * opt-out).
> +        */
> +       if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
> +               return 0;
> +
> +       if (!istate->sparse_checkout_patterns) {
> +               istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
> +               if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
> +                       return 0;
> +       }
> +
> +       if (!istate->sparse_checkout_patterns->use_cone_patterns) {
> +               warning(_("attempting to use sparse-index without cone mode"));
> +               return -1;
> +       }
> +
> +       if (cache_tree_update(istate, 0)) {
> +               warning(_("unable to update cache-tree, staying full"));
> +               return -1;
> +       }
> +
> +       remove_fsmonitor(istate);
> +
> +       trace2_region_enter("index", "convert_to_sparse", istate->repo);
> +       istate->cache_nr = convert_to_sparse_rec(istate,
> +                                                0, 0, istate->cache_nr,
> +                                                "", 0, istate->cache_tree);
> +       istate->drop_cache_tree = 1;
> +       istate->sparse_index = 1;
> +       trace2_region_leave("index", "convert_to_sparse", istate->repo);
> +       return 0;
> +}
>
>  static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
>  {
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 3aa9b0d21b4..22becbaca2e 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -2,6 +2,9 @@
>
>  test_description='compare full workdir to sparse workdir'
>
> +GIT_TEST_CHECK_CACHE_TREE=0

Why do you need to set this?  I vaguely remember needing to mess with
this when working with sparse checkouts because it did weird stuff but
I don't remember details.  But since you patch touches cache_trees, it
seems weird to show up without explanation.

> +GIT_TEST_SPLIT_INDEX=0
> +
>  . ./test-lib.sh
>
>  test_expect_success 'setup' '
> @@ -106,7 +109,7 @@ run_on_sparse () {
>         ) &&
>         (
>                 cd sparse-index &&
> -               $* >../sparse-index-out 2>../sparse-index-err
> +               GIT_TEST_SPARSE_INDEX=1 $* >../sparse-index-out 2>../sparse-index-err
>         )
>  }
>
> @@ -121,7 +124,9 @@ run_on_all () {
>  test_all_match () {
>         run_on_all $* &&
>         test_cmp full-checkout-out sparse-checkout-out &&
> -       test_cmp full-checkout-err sparse-checkout-err
> +       test_cmp full-checkout-out sparse-index-out &&
> +       test_cmp full-checkout-err sparse-checkout-err &&
> +       test_cmp full-checkout-err sparse-index-err
>  }
>
>  test_sparse_match () {
> @@ -130,6 +135,38 @@ test_sparse_match () {
>         test_cmp sparse-checkout-err sparse-index-err
>  }
>
> +test_expect_success 'sparse-index contents' '
> +       init_repos &&
> +
> +       test-tool -C sparse-index read-cache --table --no-stat >cache &&
> +       for dir in folder1 folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "0b0000 0755 $TREE $dir/" cache \
> +                       || return 1
> +       done &&
> +
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
> +
> +       test-tool -C sparse-index read-cache --table --no-stat >cache &&
> +       for dir in deep folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "0b0000 0755 $TREE $dir/" cache \

It would seem clearer to me if this output were to better match `git
ls-tree -rt ...` output (or at least the ' tree ' lines from such
output).

> +                       || return 1
> +       done &&
> +
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
> +
> +       test-tool -C sparse-index read-cache --table --no-stat >cache &&
> +       for dir in deep/deeper2 folder1 folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "0b0000 0755 $TREE $dir/" cache \
> +                       || return 1
> +       done
> +'
> +
>  test_expect_success 'expanded in-memory index matches full index' '
>         init_repos &&
>         test_sparse_match test-tool read-cache --expand --table --no-stat
> @@ -137,6 +174,7 @@ test_expect_success 'expanded in-memory index matches full index' '
>
>  test_expect_success 'status with options' '
>         init_repos &&
> +       test_sparse_match ls &&
>         test_all_match git status --porcelain=v2 &&
>         test_all_match git status --porcelain=v2 -z -u &&
>         test_all_match git status --porcelain=v2 -uno &&
> @@ -169,7 +207,7 @@ test_expect_success 'add, commit, checkout' '
>
>         test_all_match git add -A &&
>         test_all_match git status --porcelain=v2 &&
> -       test_all_match git commit -m "Extend README.md" &&
> +       test_all_match git commit -m "Extend-README.md" &&

Why this change?

>
>         test_all_match git checkout HEAD~1 &&
>         test_all_match git checkout - &&
> @@ -273,6 +311,17 @@ test_expect_failure 'checkout and reset (mixed)' '
>         test_all_match git reset update-folder2
>  '
>
> +# Ensure that sparse-index behaves identically to
> +# sparse-checkout with a full index.
> +test_expect_success 'checkout and reset (mixed) [sparse]' '
> +       init_repos &&
> +
> +       test_sparse_match git checkout -b reset-test update-deep &&
> +       test_sparse_match git reset deepest &&
> +       test_sparse_match git reset update-folder1 &&
> +       test_sparse_match git reset update-folder2
> +'
> +
>  test_expect_success 'merge' '
>         init_repos &&
>
> @@ -309,14 +358,20 @@ test_expect_success 'clean' '
>         test_all_match git status --porcelain=v2 &&
>         test_all_match git clean -f &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
>         test_all_match git clean -xf &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
>         test_all_match git clean -xdf &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
> -       test_path_is_dir sparse-checkout/folder1
> +       test_sparse_match test_path_is_dir folder1
>  '
>
>  test_done
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 11/27] unpack-trees: allow sparse directories
  2021-01-25 17:41 ` [PATCH 11/27] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-01-27 17:36   ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-01-27 17:36 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The index_pos_by_traverse_info() currently throws a BUG() when a
> directory entry exists exactly in the index. We need to consider that it
> is possible to have a directory in a sparse index as long as that entry
> is itself marked with the skip-worktree bit.
>
> The negation of the 'pos' variable must be conditioned to only when it
> starts as negative. This is identical behavior as before when the index
> is full.

The first sentence of this paragraph was really hard for me to parse.
Reading the code and then reading the sentence I could make sense of
it, but I struggled to do so the other way around.  Is there some way
to reword this?  Or is this just a reading comprehension issue on my
part?  (It might be...)

> The starts_with() condition matches because our name.buf terminates with
> a directory separator, just like our sparse directory entries.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  unpack-trees.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 4dd99219073..b324eec2a5d 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
>         strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
>         strbuf_addch(&name, '/');
>         pos = index_name_pos(o->src_index, name.buf, name.len);
> -       if (pos >= 0)
> -               BUG("This is a directory and should not exist in index");
> -       pos = -pos - 1;
> +       if (pos >= 0) {
> +               if (!o->src_index->sparse_index ||
> +                   !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
> +                       BUG("This is a directory and should not exist in index");
> +       } else
> +               pos = -pos - 1;
>         if (pos >= o->src_index->cache_nr ||
>             !starts_with(o->src_index->cache[pos]->name, name.buf) ||
>             (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
> --
> gitgitgadget

The patch looks pretty straightforward to me.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 12/27] sparse-index: check index conversion happens
  2021-01-25 17:41 ` [PATCH 12/27] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
@ 2021-01-27 17:46   ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-01-27 17:46 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Add a test case that uses test_region to ensure that we are truly
> expanding a sparse index to a full one, then converting back to sparse
> when writing the index. As we integrate more Git commands with the
> sparse index, we will convert these commands to check that we do _not_
> convert the sparse index to a full index and instead stay sparse the
> entire time.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 22becbaca2e..a22def89e37 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -374,4 +374,21 @@ test_expect_success 'clean' '
>         test_sparse_match test_path_is_dir folder1
>  '
>
> +test_expect_success 'sparse-index is expanded and converted back' '
> +       init_repos &&
> +
> +       (
> +               (GIT_TEST_SPARSE_INDEX=1 && export GIT_TEST_SPARSE_INDEX) &&

Drop the parentheses.

What system are you running on that this test passed for you with
those parentheses there?  I checked out this particular commit and ran
the test -- and it fails for me.  Removing the parentheses makes the
test pass.

Is there some shell where parentheses only function as grouping,
similar to bash's {...}, rather than as a subshell, the way bash
handles (...) ?

> +               GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> +                       git -C sparse-index -c core.fsmonitor="" reset --hard &&
> +               test_region index convert_to_sparse trace2.txt &&
> +               test_region index ensure_full_index trace2.txt &&
> +
> +               rm trace2.txt &&
> +               GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> +                       git -C sparse-index -c core.fsmonitor="" status -uno &&
> +               test_region index ensure_full_index trace2.txt
> +       )
> +'
> +
>  test_done
> --
> gitgitgadget

Otherwise, I like the test and this commit.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 13/27] sparse-index: create extension for compatibility
  2021-01-25 17:41 ` [PATCH 13/27] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
@ 2021-01-27 18:03   ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-01-27 18:03 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Previously, we enabled the sparse index format only using
> GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
> actually select this mode. Further, sparse directory entries are not
> understood by the index formats as advertised.
>
> We _could_ add a new index version that explicitly adds these
> capabilities, but there are nuances to index formats 2, 3, and 4 that
> are still valuable to select as options. For now, create a repo
> extension, "extensions.sparseIndex", that specifies that the tool
> reading this repository must understand sparse directory entries.
>
> This change only encodes the extension and enables it when
> GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
> mechanism.

One other interesting thing to note is that last I checked, jgit
doesn't support index format v4, which makes us unable to use it.
Making a v5 would force jgit to support all previous index formats in
order to support your new feature.

However, the jgit thing is going to make it hard for me to find other
users willing to test out this feature at $DAYJOB.  But I don't think
there's anyway around that; you need to change the index format.  And
you at least have jrnieder cc'ed.  :-)

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/config/extensions.txt |  7 ++++++
>  cache.h                             |  1 +
>  repo-settings.c                     |  7 ++++++
>  repository.h                        |  3 ++-
>  setup.c                             |  3 +++
>  sparse-index.c                      | 38 +++++++++++++++++++++++++----
>  6 files changed, 53 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
> index 4e23d73cdca..5c86b364873 100644
> --- a/Documentation/config/extensions.txt
> +++ b/Documentation/config/extensions.txt
> @@ -6,3 +6,10 @@ extensions.objectFormat::
>  Note that this setting should only be set by linkgit:git-init[1] or
>  linkgit:git-clone[1].  Trying to change it after initialization will not
>  work and will produce hard-to-diagnose issues.
> +
> +extensions.sparseIndex::
> +       When combined with `core.sparseCheckout=true` and
> +       `core.sparseCheckoutCone=true`, the index may contain entries
> +       corresponding to directories outside of the sparse-checkout
> +       definition. Versions of Git that do not understand this extension
> +       do not expect directory entries in the index.

Perhaps to make this slightly more explicit ("corresponding to" can be
fuzzy and be read to assume you are talking about file entries
underneath a directory rather than directory entries, so add an extra
phrase to rule that out):

...the index may contain entries corresponding to directories outside
of the sparse-checkout definition in lieu of containing each path
under such directories...

> diff --git a/cache.h b/cache.h
> index b05341cc687..dcf089b7006 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1054,6 +1054,7 @@ struct repository_format {
>         int worktree_config;
>         int is_bare;
>         int hash_algo;
> +       int sparse_index;
>         char *work_tree;
>         struct string_list unknown_extensions;
>         struct string_list v1_only_extensions;
> diff --git a/repo-settings.c b/repo-settings.c
> index d63569e4041..9677d50f923 100644
> --- a/repo-settings.c
> +++ b/repo-settings.c
> @@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
>          * removed.
>          */
>         r->settings.command_requires_full_index = 1;
> +
> +       /*
> +        * Initialize this as off.
> +        */
> +       r->settings.sparse_index = 0;
> +       if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
> +               r->settings.sparse_index = 1;
>  }
> diff --git a/repository.h b/repository.h
> index e06a2301569..a45f7520fd9 100644
> --- a/repository.h
> +++ b/repository.h
> @@ -42,7 +42,8 @@ struct repo_settings {
>
>         int core_multi_pack_index;
>
> -       unsigned command_requires_full_index:1;
> +       unsigned command_requires_full_index:1,
> +                sparse_index:1;
>  };
>
>  struct repository {
> diff --git a/setup.c b/setup.c
> index c04cd25a30d..cd839456461 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var,
>                         return error("invalid value for 'extensions.objectformat'");
>                 data->hash_algo = format;
>                 return EXTENSION_OK;
> +       } else if (!strcmp(ext, "sparseindex")) {
> +               data->sparse_index = 1;
> +               return EXTENSION_OK;
>         }
>         return EXTENSION_UNKNOWN;
>  }
> diff --git a/sparse-index.c b/sparse-index.c
> index 5dd0b835b9d..71544095267 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate,
>         return num_converted - start_converted;
>  }
>
> +static int enable_sparse_index(struct repository *repo)
> +{
> +       const char *config_path = repo_git_path(repo, "config.worktree");
> +
> +       if (upgrade_repository_format(1) < 0) {
> +               warning(_("unable to upgrade repository format to enable sparse-index"));
> +               return -1;
> +       }
> +       git_config_set_in_file_gently(config_path,
> +                                     "extensions.sparseIndex",
> +                                     "true");
> +
> +       prepare_repo_settings(repo);
> +       repo->settings.sparse_index = 1;
> +       return 0;
> +}
> +
>  int convert_to_sparse(struct index_state *istate)
>  {
>         if (istate->split_index || istate->sparse_index ||
>             !core_apply_sparse_checkout || !core_sparse_checkout_cone)
>                 return 0;
>
> +       if (!istate->repo)
> +               istate->repo = the_repository;
> +
> +       /*
> +        * The GIT_TEST_SPARSE_INDEX environment variable triggers the
> +        * extensions.sparseIndex config variable to be on.
> +        */
> +       if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
> +               int err = enable_sparse_index(istate->repo);
> +               if (err < 0)
> +                       return err;
> +       }
> +
>         /*
> -        * For now, only create a sparse index with the
> -        * GIT_TEST_SPARSE_INDEX environment variable. We will relax
> -        * this once we have a proper way to opt-in (and later still,
> -        * opt-out).
> +        * Only convert to sparse if extensions.sparseIndex is set.
>          */
> -       if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
> +       prepare_repo_settings(istate->repo);
> +       if (!istate->repo->settings.sparse_index)
>                 return 0;
>
>         if (!istate->sparse_checkout_patterns) {
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 14/27] sparse-checkout: toggle sparse index from builtin
  2021-01-25 17:42 ` [PATCH 14/27] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
@ 2021-01-27 18:18   ` Elijah Newren
  2021-01-28 15:26     ` Derrick Stolee
  0 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren @ 2021-01-27 18:18 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The sparse index extension is used to signal that index writes should be
> in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1.
>
> Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that
> specifies if the sparse index should be used. It also updates the index
> to use the correct format, either way. Add a warning in the
> documentation that the use of a repository extension might reduce
> compatibility with third-party tools. 'git sparse-checkout init' already
> sets extension.worktreeConfig, which places most sparse-checkout users
> outside of the scope of most third-party tools.

Heh, looks like you're addressing my comments on the last patch about
jgit.  If I would have just read on...

One side question, though -- I thought I remembered seeing that we
record index versions or extension information directly in the index,
so that third party tools have a way of noting that the index has
something they won't understand, rather than just reading values that
appear to be corrupt to them.  Perhaps I missed it, but have you done
anything like that with this series?

>
> Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of
> GIT_TEST_SPARSE_INDEX=1.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-sparse-checkout.txt    | 14 +++++++++
>  builtin/sparse-checkout.c                | 17 ++++++++++-
>  sparse-index.c                           | 38 ++++++++++++++++--------
>  sparse-index.h                           |  3 ++
>  t/t1092-sparse-checkout-compatibility.sh | 33 ++++++++++----------
>  5 files changed, 75 insertions(+), 30 deletions(-)
>
> diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
> index a0eeaeb02ee..b51b8450cfd 100644
> --- a/Documentation/git-sparse-checkout.txt
> +++ b/Documentation/git-sparse-checkout.txt
> @@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
>  When `--cone` is provided, the `core.sparseCheckoutCone` setting is
>  also set, allowing for better performance with a limited set of
>  patterns (see 'CONE PATTERN SET' below).
> ++
> +Use the `--[no-]sparse-index` option to toggle the use of the sparse
> +index format. This reduces the size of the index to be more closely
> +aligned with your sparse-checkout definition. This can have significant
> +performance advantages for commands such as `git status` or `git add`.
> +This feature is still experimental. Some commands might be slower with
> +a sparse index until they are properly integrated with the feature.
> ++
> +**WARNING:** Using a sparse index requires modifying the index in a way
> +that is not completely understood by other tools. Enabling sparse index
> +enables the `extensions.spareseIndex` config value, which might cause

extensions.sparseIndex; you have an extra 'e' in there.

> +other tools to stop working with your repository. If you have trouble with
> +this compatibility, then run `git sparse-checkout sparse-index disable` to
> +remove this config and rewrite your index to not be sparse.
>
>  'set'::
>         Write a set of patterns to the sparse-checkout file, as given as
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index e00b82af727..ca63e2c64e9 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -14,6 +14,7 @@
>  #include "unpack-trees.h"
>  #include "wt-status.h"
>  #include "quote.h"
> +#include "sparse-index.h"
>
>  static const char *empty_base = "";
>
> @@ -283,12 +284,13 @@ static int set_config(enum sparse_checkout_mode mode)
>  }
>
>  static char const * const builtin_sparse_checkout_init_usage[] = {
> -       N_("git sparse-checkout init [--cone]"),
> +       N_("git sparse-checkout init [--cone] [--[no-]sparse-index]"),

This all makes sense, but between partial clones, sparse-checkouts and
sparse-indexes, I wonder if we're overloading users with terms and
conditions.  Perhaps that's inevitable in the short-term due to the
various caveats that exist, but I'd just like to put out a fuzzy
high-level goal of allowing users in the future to just specify "I
want a sparse clone of this stuff" with as few special knobs and flags
as possible.  I don't want them to have to specify all of the
individual things that means, such as they want (a) the history to be
sparse (i.e. partial clone), (b) the checkout to be sparse, (c) the
index to be sparse, (d) several commands to operate in a sparse
manner, limiting their output based on the sparsity paths (hopefully
they aren't required to list each one), and (e) several other commands
shouldn't be limited by the sparsity paths.  I guess it might be nice
to _allow_ them to specify all the things it means for users who want
control, but it'd be nice to avoid requiring it of all users.

>         NULL
>  };
>
>  static struct sparse_checkout_init_opts {
>         int cone_mode;
> +       int sparse_index;
>  } init_opts;
>
>  static int sparse_checkout_init(int argc, const char **argv)
> @@ -303,11 +305,15 @@ static int sparse_checkout_init(int argc, const char **argv)
>         static struct option builtin_sparse_checkout_init_options[] = {
>                 OPT_BOOL(0, "cone", &init_opts.cone_mode,
>                          N_("initialize the sparse-checkout in cone mode")),
> +               OPT_BOOL(0, "sparse-index", &init_opts.sparse_index,
> +                        N_("toggle the use of a sparse index")),
>                 OPT_END(),
>         };
>
>         repo_read_index(the_repository);
>
> +       init_opts.sparse_index = -1;
> +
>         argc = parse_options(argc, argv, NULL,
>                              builtin_sparse_checkout_init_options,
>                              builtin_sparse_checkout_init_usage, 0);
> @@ -326,6 +332,15 @@ static int sparse_checkout_init(int argc, const char **argv)
>         sparse_filename = get_sparse_checkout_filename();
>         res = add_patterns_from_file_to_list(sparse_filename, "", 0, &pl, NULL);
>
> +       if (init_opts.sparse_index >= 0) {
> +               if (set_sparse_index_config(the_repository, init_opts.sparse_index) < 0)
> +                       die(_("failed to modify sparse-index config"));
> +
> +               /* force an index rewrite */
> +               repo_read_index(the_repository);
> +               the_repository->index->updated_workdir = 1;
> +       }
> +
>         /* If we already have a sparse-checkout file, use it. */
>         if (res >= 0) {
>                 free(sparse_filename);
> diff --git a/sparse-index.c b/sparse-index.c
> index 71544095267..3552f88fb03 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -104,23 +104,38 @@ static int convert_to_sparse_rec(struct index_state *istate,
>
>  static int enable_sparse_index(struct repository *repo)
>  {
> -       const char *config_path = repo_git_path(repo, "config.worktree");
> +       int res;
>
>         if (upgrade_repository_format(1) < 0) {
>                 warning(_("unable to upgrade repository format to enable sparse-index"));
>                 return -1;
>         }
> -       git_config_set_in_file_gently(config_path,
> -                                     "extensions.sparseIndex",
> -                                     "true");
> +       res = git_config_set_gently("extensions.sparseindex", "true");
>
>         prepare_repo_settings(repo);
>         repo->settings.sparse_index = 1;
> -       return 0;
> +       return res;
> +}
> +
> +int set_sparse_index_config(struct repository *repo, int enable)
> +{
> +       int res;
> +
> +       if (enable)
> +               return enable_sparse_index(repo);
> +
> +       /* Don't downgrade repository format, just remove the extension. */
> +       res = git_config_set_multivar_gently("extensions.sparseindex", NULL, "",
> +                                            CONFIG_FLAGS_MULTI_REPLACE);
> +
> +       prepare_repo_settings(repo);
> +       repo->settings.sparse_index = 0;
> +       return res;
>  }
>
>  int convert_to_sparse(struct index_state *istate)
>  {
> +       int test_env;
>         if (istate->split_index || istate->sparse_index ||
>             !core_apply_sparse_checkout || !core_sparse_checkout_cone)
>                 return 0;
> @@ -129,14 +144,13 @@ int convert_to_sparse(struct index_state *istate)
>                 istate->repo = the_repository;
>
>         /*
> -        * The GIT_TEST_SPARSE_INDEX environment variable triggers the
> -        * extensions.sparseIndex config variable to be on.
> +        * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex
> +        * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly),
> +        * then purposefully disable the setting.
>          */
> -       if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
> -               int err = enable_sparse_index(istate->repo);
> -               if (err < 0)
> -                       return err;
> -       }
> +       test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
> +       if (test_env >= 0)
> +               set_sparse_index_config(istate->repo, test_env);
>
>         /*
>          * Only convert to sparse if extensions.sparseIndex is set.
> diff --git a/sparse-index.h b/sparse-index.h
> index a2777dcac59..ca936e95d11 100644
> --- a/sparse-index.h
> +++ b/sparse-index.h
> @@ -5,4 +5,7 @@ struct index_state;
>  void ensure_full_index(struct index_state *istate);
>  int convert_to_sparse(struct index_state *istate);
>
> +struct repository;
> +int set_sparse_index_config(struct repository *repo, int enable);
> +
>  #endif
> \ No newline at end of file
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index a22def89e37..c6b7e8b8891 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -4,6 +4,7 @@ test_description='compare full workdir to sparse workdir'
>
>  GIT_TEST_CHECK_CACHE_TREE=0
>  GIT_TEST_SPLIT_INDEX=0
> +GIT_TEST_SPARSE_INDEX=
>
>  . ./test-lib.sh
>
> @@ -98,8 +99,9 @@ init_repos () {
>         # initialize sparse-checkout definitions
>         git -C sparse-checkout sparse-checkout init --cone &&
>         git -C sparse-checkout sparse-checkout set deep &&
> -       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
> -       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
> +       git -C sparse-index sparse-checkout init --cone --sparse-index &&
> +       test_cmp_config -C sparse-index true extensions.sparseindex &&
> +       git -C sparse-index sparse-checkout set deep
>  }
>
>  run_on_sparse () {
> @@ -109,7 +111,7 @@ run_on_sparse () {
>         ) &&
>         (
>                 cd sparse-index &&
> -               GIT_TEST_SPARSE_INDEX=1 $* >../sparse-index-out 2>../sparse-index-err
> +               $* >../sparse-index-out 2>../sparse-index-err
>         )
>  }
>
> @@ -146,7 +148,7 @@ test_expect_success 'sparse-index contents' '
>                         || return 1
>         done &&
>
> -       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
> +       git -C sparse-index sparse-checkout set folder1 &&
>
>         test-tool -C sparse-index read-cache --table --no-stat >cache &&
>         for dir in deep folder2 x
> @@ -156,7 +158,7 @@ test_expect_success 'sparse-index contents' '
>                         || return 1
>         done &&
>
> -       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
> +       git -C sparse-index sparse-checkout set deep/deeper1 &&
>
>         test-tool -C sparse-index read-cache --table --no-stat >cache &&
>         for dir in deep/deeper2 folder1 folder2 x
> @@ -377,18 +379,15 @@ test_expect_success 'clean' '
>  test_expect_success 'sparse-index is expanded and converted back' '
>         init_repos &&
>
> -       (
> -               (GIT_TEST_SPARSE_INDEX=1 && export GIT_TEST_SPARSE_INDEX) &&
> -               GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> -                       git -C sparse-index -c core.fsmonitor="" reset --hard &&
> -               test_region index convert_to_sparse trace2.txt &&
> -               test_region index ensure_full_index trace2.txt &&
> -
> -               rm trace2.txt &&
> -               GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> -                       git -C sparse-index -c core.fsmonitor="" status -uno &&
> -               test_region index ensure_full_index trace2.txt
> -       )
> +       GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> +               git -C sparse-index -c core.fsmonitor="" reset --hard &&
> +       test_region index convert_to_sparse trace2.txt &&
> +       test_region index ensure_full_index trace2.txt &&
> +
> +       rm trace2.txt &&
> +       GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> +               git -C sparse-index -c core.fsmonitor="" status -uno &&
> +       test_region index ensure_full_index trace2.txt
>  '
>
>  test_done
> --
> gitgitgadget

I need to take a break from reviewing again at this point and work on
some other tasks.  I'll resume reviewing the series later, perhaps
tomorrow afternoon.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/27] sparse-index: implement ensure_full_index()
  2021-01-27  3:05   ` Elijah Newren
  2021-01-27 13:43     ` Derrick Stolee
@ 2021-01-28  5:25     ` Junio C Hamano
  1 sibling, 0 replies; 61+ messages in thread
From: Junio C Hamano @ 2021-01-28  5:25 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List, Jeff King,
	Jonathan Nieder, Eric Sunshine, Nguyễn Thái Ngọc,
	Derrick Stolee, Derrick Stolee

Elijah Newren <newren@gmail.com> writes:

>>  1. ce->ce_mode == 01000755
>
> This is a weird number.  What's the reason for choosing it?  It looks
> deceptively close to 0100755, normal executable files, but has the
> extra 0, meaning that ce->ce_mode & S_IFMT is 0, suggesting it has no
> file type.
>
> Since it's a directory, why not use S_IFDIR (040000)?
>
> (GITLINK does use the weird 0160000 value, but it happens to be
> S_IFLNK | S_IFDIR == 0120000 | 040000, which conveys "it's both a
> directory and a symlink")

Yes, that combination of IFLNK/IFDIR was the reason why we use the
value.  I tend to think IFDIR is the best thing to use here.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/27] sparse-checkout: hold pattern list in index
  2021-01-27 17:00   ` Elijah Newren
@ 2021-01-28 13:12     ` Derrick Stolee
  0 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee @ 2021-01-28 13:12 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On 1/27/2021 12:00 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> As we modify the sparse-checkout definition, we perform index operations
>> on a pattern_list that only exists in-memory. This allows easy backing
>> out in case the index update fails.
>>
>> However, if the index write itself cares about the sparse-checkout
>> pattern set, we need access to that in-memory copy. Place a pointer to
>> a 'struct pattern_list' in the index so we can access this on-demand.
>> This will be used in the next change which uses the sparse-checkout
>> definition to filter out directories that are outsie the sparse cone.
> 
> s/outsie/outside/

Thanks! 

> Isn't this the same patch you put in your index cleanup series, or am
> I getting confused?  It looks very familiar.
 
I removed it from v2 of that series because it didn't do anything of
value until we start using the sparse_checkout_patterns member in the
next patch of _this_ series.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 14/27] sparse-checkout: toggle sparse index from builtin
  2021-01-27 18:18   ` Elijah Newren
@ 2021-01-28 15:26     ` Derrick Stolee
  0 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee @ 2021-01-28 15:26 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On 1/27/2021 1:18 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> 
> I need to take a break from reviewing again at this point and work on
> some other tasks.  I'll resume reviewing the series later, perhaps
> tomorrow afternoon.

I appreciate your efforts here! I'm delaying detailed responses until
enough time has passed for interested parties to comment. If it helps,
then focusing on the "big things" is most important for now. I'll be
more careful about typos and things when I submit patches for full
review.

For my part, I'm taking time to look around at other things that need
my attention after being heads-down on this prototype. The sparse
index will need proper care to implement and I'm not expecting it to
move very quickly.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 15/27] [RFC-VERSION] *: ensure full index
  2021-01-25 17:42 ` [PATCH 15/27] [RFC-VERSION] *: ensure full index Derrick Stolee via GitGitGadget
@ 2021-02-01 20:22   ` Elijah Newren
  2021-02-01 21:10     ` Derrick Stolee
  0 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren @ 2021-02-01 20:22 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> This giant patch is not intended for actual review. I have a branch that
> has these changes split out in a sane way with some commentary in each
> file that is modified.
>
> The idea here is to guard certain portions of the codebase that do not
> know how to handle sparse indexes by ensuring that the index is expanded
> to a full index before proceeding with the logic.
>
> This also provides a good mechanism for testing which code needs
> updating to enable the sparse index in a Git builtin. The builtin can
> set the_repository->settings.command_requires_full_index to zero and
> then we can debug the command with a breakpoint on ensure_full_index().
> That identifies the portion of code that needs adjusting before enabling
> sparse indexes for that command.
>
> Some index operations must be changed to operate on a non-const pointer,
> since ensuring a full index will modify the index itself.
>
> There are likely some gaps to these protections, which is why it will be
> important to carefully test each scenario as we relax the requirements.
> I expect that to be a long effort.

I think the idea makes sense; it provides a way for us to
incrementally build support for this new feature.

I skimmed over the code and noticed various interesting places that
had the ensure_full_index() call (e.g.
read_skip_worktree_file_from_index() -- whose existence comes from
sparsity; what irony...).  Better breakouts would be great, so I'll
defer commenting much until then.  But, just to verify I'm
understanding: the primary defence is the command_requires_full_index
setting, and you have added several ensure_full_index() calls
throughout the code in places you believe would need to be fixed up in
case someone switches the command_requires_full_index setting.  Is
that correct?  And your comment on the gaps is just that there may be
other places that are missing the secondary protection (as opposed to
my first reading of that paragraph as suggesting we aren't sure if we
have enough protections yet and need to add more before this moves out
of RFC); is that right?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  apply.c                   | 10 +++++++++-
>  blame.c                   |  7 ++++++-
>  builtin/checkout-index.c  |  5 ++++-
>  builtin/grep.c            |  2 ++
>  builtin/ls-files.c        |  9 ++++++++-
>  builtin/merge-index.c     |  2 ++
>  builtin/mv.c              |  2 ++
>  builtin/rm.c              |  2 ++
>  builtin/sparse-checkout.c |  1 +
>  builtin/update-index.c    |  2 ++
>  cache.h                   |  1 +
>  diff-lib.c                |  2 ++
>  diff.c                    |  2 ++
>  dir.c                     | 14 +++++++++++++-
>  entry.c                   |  2 ++
>  fsmonitor.c               | 11 ++++++++++-
>  merge-recursive.c         | 22 +++++++++++++++++++---
>  name-hash.c               |  6 ++++++
>  pathspec.c                |  5 +++--
>  pathspec.h                |  4 ++--
>  read-cache.c              | 19 +++++++++++++++++--
>  rerere.c                  |  2 ++
>  resolve-undo.c            |  6 ++++++
>  sha1-name.c               |  3 +++
>  split-index.c             |  2 ++
>  submodule.c               | 24 +++++++++++++++++++-----
>  submodule.h               |  6 +++---
>  tree.c                    |  2 ++
>  wt-status.c               |  7 +++++++
>  29 files changed, 159 insertions(+), 23 deletions(-)
>
> diff --git a/apply.c b/apply.c
> index 668b16e9893..5bfbd928b38 100644
> --- a/apply.c
> +++ b/apply.c
> @@ -3523,6 +3523,8 @@ static int load_current(struct apply_state *state,
>         if (!patch->is_new)
>                 BUG("patch to %s is not a creation", patch->old_name);
>
> +       ensure_full_index(state->repo->index);
> +
>         pos = index_name_pos(state->repo->index, name, strlen(name));
>         if (pos < 0)
>                 return error(_("%s: does not exist in index"), name);
> @@ -3692,7 +3694,11 @@ static int check_preimage(struct apply_state *state,
>         }
>
>         if (state->check_index && !previous) {
> -               int pos = index_name_pos(state->repo->index, old_name,
> +               int pos;
> +
> +               ensure_full_index(state->repo->index);
> +
> +               pos = index_name_pos(state->repo->index, old_name,
>                                          strlen(old_name));
>                 if (pos < 0) {
>                         if (patch->is_new < 0)
> @@ -3751,6 +3757,8 @@ static int check_to_create(struct apply_state *state,
>         if (state->check_index && (!ok_if_exists || !state->cached)) {
>                 int pos;
>
> +               ensure_full_index(state->repo->index);
> +
>                 pos = index_name_pos(state->repo->index, new_name, strlen(new_name));
>                 if (pos >= 0) {
>                         struct cache_entry *ce = state->repo->index->cache[pos];
> diff --git a/blame.c b/blame.c
> index a5044fcfaa6..0aa368a35cf 100644
> --- a/blame.c
> +++ b/blame.c
> @@ -108,6 +108,7 @@ static void verify_working_tree_path(struct repository *r,
>                         return;
>         }
>
> +       ensure_full_index(r->index);
>         pos = index_name_pos(r->index, path, strlen(path));
>         if (pos >= 0)
>                 ; /* path is in the index */
> @@ -277,7 +278,11 @@ static struct commit *fake_working_tree_commit(struct repository *r,
>
>         len = strlen(path);
>         if (!mode) {
> -               int pos = index_name_pos(r->index, path, len);
> +               int pos;
> +
> +               ensure_full_index(r->index);
> +
> +               pos = index_name_pos(r->index, path, len);
>                 if (0 <= pos)
>                         mode = r->index->cache[pos]->ce_mode;
>                 else
> diff --git a/builtin/checkout-index.c b/builtin/checkout-index.c
> index 4bbfc92dce5..24c85b1c125 100644
> --- a/builtin/checkout-index.c
> +++ b/builtin/checkout-index.c
> @@ -48,11 +48,14 @@ static void write_tempfile_record(const char *name, const char *prefix)
>  static int checkout_file(const char *name, const char *prefix)
>  {
>         int namelen = strlen(name);
> -       int pos = cache_name_pos(name, namelen);
> +       int pos;
>         int has_same_name = 0;
>         int did_checkout = 0;
>         int errs = 0;
>
> +       ensure_full_index(the_repository->index);
> +       pos = index_name_pos(the_repository->index, name, namelen);
> +
>         if (pos < 0)
>                 pos = -pos - 1;
>
> diff --git a/builtin/grep.c b/builtin/grep.c
> index ca259af4416..e53cf817204 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -506,6 +506,8 @@ static int grep_cache(struct grep_opt *opt,
>         if (repo_read_index(repo) < 0)
>                 die(_("index file corrupt"));
>
> +       ensure_full_index(repo->index);
> +
>         for (nr = 0; nr < repo->index->cache_nr; nr++) {
>                 const struct cache_entry *ce = repo->index->cache[nr];
>                 strbuf_setlen(&name, name_base_len);
> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> index c8eae899b82..933e259cdbe 100644
> --- a/builtin/ls-files.c
> +++ b/builtin/ls-files.c
> @@ -150,7 +150,7 @@ static void show_other_files(const struct index_state *istate,
>         }
>  }
>
> -static void show_killed_files(const struct index_state *istate,
> +static void show_killed_files(struct index_state *istate,
>                               const struct dir_struct *dir)
>  {
>         int i;
> @@ -159,6 +159,8 @@ static void show_killed_files(const struct index_state *istate,
>                 char *cp, *sp;
>                 int pos, len, killed = 0;
>
> +               ensure_full_index(istate);
> +
>                 for (cp = ent->name; cp - ent->name < ent->len; cp = sp + 1) {
>                         sp = strchr(cp, '/');
>                         if (!sp) {
> @@ -313,6 +315,7 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
>                         show_killed_files(repo->index, dir);
>         }
>         if (show_cached || show_stage) {
> +               ensure_full_index(repo->index);
>                 for (i = 0; i < repo->index->cache_nr; i++) {
>                         const struct cache_entry *ce = repo->index->cache[i];
>
> @@ -332,6 +335,7 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
>                 }
>         }
>         if (show_deleted || show_modified) {
> +               ensure_full_index(repo->index);
>                 for (i = 0; i < repo->index->cache_nr; i++) {
>                         const struct cache_entry *ce = repo->index->cache[i];
>                         struct stat st;
> @@ -368,6 +372,7 @@ static void prune_index(struct index_state *istate,
>
>         if (!prefix || !istate->cache_nr)
>                 return;
> +       ensure_full_index(istate);
>         pos = index_name_pos(istate, prefix, prefixlen);
>         if (pos < 0)
>                 pos = -pos-1;
> @@ -428,6 +433,8 @@ void overlay_tree_on_index(struct index_state *istate,
>         if (!tree)
>                 die("bad tree-ish %s", tree_name);
>
> +       ensure_full_index(istate);
> +
>         /* Hoist the unmerged entries up to stage #3 to make room */
>         for (i = 0; i < istate->cache_nr; i++) {
>                 struct cache_entry *ce = istate->cache[i];
> diff --git a/builtin/merge-index.c b/builtin/merge-index.c
> index 38ea6ad6ca2..3e1ddabd650 100644
> --- a/builtin/merge-index.c
> +++ b/builtin/merge-index.c
> @@ -80,6 +80,8 @@ int cmd_merge_index(int argc, const char **argv, const char *prefix)
>
>         read_cache();
>
> +       ensure_full_index(&the_index);
> +
>         i = 1;
>         if (!strcmp(argv[i], "-o")) {
>                 one_shot = 1;
> diff --git a/builtin/mv.c b/builtin/mv.c
> index 7dac714af90..2ab6416fce9 100644
> --- a/builtin/mv.c
> +++ b/builtin/mv.c
> @@ -145,6 +145,8 @@ int cmd_mv(int argc, const char **argv, const char *prefix)
>         if (read_cache() < 0)
>                 die(_("index file corrupt"));
>
> +       ensure_full_index(&the_index);
> +
>         source = internal_prefix_pathspec(prefix, argv, argc, 0);
>         modes = xcalloc(argc, sizeof(enum update_mode));
>         /*
> diff --git a/builtin/rm.c b/builtin/rm.c
> index 4858631e0f0..2db4fcd22d9 100644
> --- a/builtin/rm.c
> +++ b/builtin/rm.c
> @@ -291,6 +291,8 @@ int cmd_rm(int argc, const char **argv, const char *prefix)
>
>         refresh_index(&the_index, REFRESH_QUIET|REFRESH_UNMERGED, &pathspec, NULL, NULL);
>
> +       ensure_full_index(&the_index);
> +
>         seen = xcalloc(pathspec.nr, 1);
>
>         for (i = 0; i < active_nr; i++) {
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index ca63e2c64e9..14022b5e182 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -123,6 +123,7 @@ static int update_working_directory(struct pattern_list *pl)
>         o.pl = pl;
>
>         setup_work_tree();
> +       ensure_full_index(r->index);
>
>         repo_hold_locked_index(r, &lock_file, LOCK_DIE_ON_ERROR);
>
> diff --git a/builtin/update-index.c b/builtin/update-index.c
> index 79087bccea4..521a6c23c75 100644
> --- a/builtin/update-index.c
> +++ b/builtin/update-index.c
> @@ -1088,6 +1088,8 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
>
>         the_index.updated_skipworktree = 1;
>
> +       ensure_full_index(&the_index);
> +
>         /*
>          * Custom copy of parse_options() because we want to handle
>          * filename arguments as they come.
> diff --git a/cache.h b/cache.h
> index dcf089b7006..306eab444b9 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -346,6 +346,7 @@ void add_name_hash(struct index_state *istate, struct cache_entry *ce);
>  void remove_name_hash(struct index_state *istate, struct cache_entry *ce);
>  void free_name_hash(struct index_state *istate);
>
> +void ensure_full_index(struct index_state *istate);
>
>  /* Cache entry creation and cleanup */
>
> diff --git a/diff-lib.c b/diff-lib.c
> index b73cc1859a4..3743e4463b4 100644
> --- a/diff-lib.c
> +++ b/diff-lib.c
> @@ -96,6 +96,8 @@ int run_diff_files(struct rev_info *revs, unsigned int option)
>         uint64_t start = getnanotime();
>         struct index_state *istate = revs->diffopt.repo->index;
>
> +       ensure_full_index(istate);
> +
>         diff_set_mnemonic_prefix(&revs->diffopt, "i/", "w/");
>
>         refresh_fsmonitor(istate);
> diff --git a/diff.c b/diff.c
> index 2253ec88029..02fafee8587 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -3901,6 +3901,8 @@ static int reuse_worktree_file(struct index_state *istate,
>         if (!want_file && would_convert_to_git(istate, name))
>                 return 0;
>
> +       ensure_full_index(istate);
> +
>         len = strlen(name);
>         pos = index_name_pos(istate, name, len);
>         if (pos < 0)
> diff --git a/dir.c b/dir.c
> index d153a63bbd1..ad6eb033cb1 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -892,13 +892,15 @@ void add_pattern(const char *string, const char *base,
>         add_pattern_to_hashsets(pl, pattern);
>  }
>
> -static int read_skip_worktree_file_from_index(const struct index_state *istate,
> +static int read_skip_worktree_file_from_index(struct index_state *istate,
>                                               const char *path,
>                                               size_t *size_out, char **data_out,
>                                               struct oid_stat *oid_stat)
>  {
>         int pos, len;
>
> +       ensure_full_index(istate);
> +
>         len = strlen(path);
>         pos = index_name_pos(istate, path, len);
>         if (pos < 0)
> @@ -1088,6 +1090,10 @@ static int add_patterns(const char *fname, const char *base, int baselen,
>                 close(fd);
>                 if (oid_stat) {
>                         int pos;
> +
> +                       if (istate)
> +                               ensure_full_index(istate);
> +
>                         if (oid_stat->valid &&
>                             !match_stat_data_racy(istate, &oid_stat->stat, &st))
>                                 ; /* no content change, oid_stat->oid still good */
> @@ -1696,6 +1702,8 @@ static enum exist_status directory_exists_in_index(struct index_state *istate,
>         if (ignore_case)
>                 return directory_exists_in_index_icase(istate, dirname, len);
>
> +       ensure_full_index(istate);
> +
>         pos = index_name_pos(istate, dirname, len);
>         if (pos < 0)
>                 pos = -pos-1;
> @@ -2050,6 +2058,8 @@ static int get_index_dtype(struct index_state *istate,
>         int pos;
>         const struct cache_entry *ce;
>
> +       ensure_full_index(istate);
> +
>         ce = index_file_exists(istate, path, len, 0);
>         if (ce) {
>                 if (!ce_uptodate(ce))
> @@ -3536,6 +3546,8 @@ static void connect_wt_gitdir_in_nested(const char *sub_worktree,
>         if (repo_read_index(&subrepo) < 0)
>                 die(_("index file corrupt in repo %s"), subrepo.gitdir);
>
> +       ensure_full_index(subrepo.index);
> +
>         for (i = 0; i < subrepo.index->cache_nr; i++) {
>                 const struct cache_entry *ce = subrepo.index->cache[i];
>
> diff --git a/entry.c b/entry.c
> index a0532f1f000..d505e6f2c6e 100644
> --- a/entry.c
> +++ b/entry.c
> @@ -412,6 +412,8 @@ static void mark_colliding_entries(const struct checkout *state,
>
>         ce->ce_flags |= CE_MATCHED;
>
> +       ensure_full_index(state->istate);
> +
>         for (i = 0; i < state->istate->cache_nr; i++) {
>                 struct cache_entry *dup = state->istate->cache[i];
>
> diff --git a/fsmonitor.c b/fsmonitor.c
> index fe9e9d7baf4..7b8cd3975b9 100644
> --- a/fsmonitor.c
> +++ b/fsmonitor.c
> @@ -97,6 +97,9 @@ int read_fsmonitor_extension(struct index_state *istate, const void *data,
>  void fill_fsmonitor_bitmap(struct index_state *istate)
>  {
>         unsigned int i, skipped = 0;
> +
> +       ensure_full_index(istate);
> +
>         istate->fsmonitor_dirty = ewah_new();
>         for (i = 0; i < istate->cache_nr; i++) {
>                 if (istate->cache[i]->ce_flags & CE_REMOVE)
> @@ -158,7 +161,11 @@ static int query_fsmonitor(int version, const char *last_update, struct strbuf *
>
>  static void fsmonitor_refresh_callback(struct index_state *istate, const char *name)
>  {
> -       int pos = index_name_pos(istate, name, strlen(name));
> +       int pos;
> +
> +       ensure_full_index(istate);
> +
> +       pos = index_name_pos(istate, name, strlen(name));
>
>         if (pos >= 0) {
>                 struct cache_entry *ce = istate->cache[pos];
> @@ -330,6 +337,8 @@ void tweak_fsmonitor(struct index_state *istate)
>
>         if (istate->fsmonitor_dirty) {
>                 if (fsmonitor_enabled) {
> +                       ensure_full_index(istate);
> +
>                         /* Mark all entries valid */
>                         for (i = 0; i < istate->cache_nr; i++) {
>                                 istate->cache[i]->ce_flags |= CE_FSMONITOR_VALID;
> diff --git a/merge-recursive.c b/merge-recursive.c
> index f736a0f6323..12109f37723 100644
> --- a/merge-recursive.c
> +++ b/merge-recursive.c
> @@ -522,6 +522,8 @@ static struct string_list *get_unmerged(struct index_state *istate)
>
>         unmerged->strdup_strings = 1;
>
> +       ensure_full_index(istate);
> +
>         for (i = 0; i < istate->cache_nr; i++) {
>                 struct string_list_item *item;
>                 struct stage_data *e;
> @@ -762,6 +764,8 @@ static int dir_in_way(struct index_state *istate, const char *path,
>         strbuf_addstr(&dirpath, path);
>         strbuf_addch(&dirpath, '/');
>
> +       ensure_full_index(istate);
> +
>         pos = index_name_pos(istate, dirpath.buf, dirpath.len);
>
>         if (pos < 0)
> @@ -785,9 +789,13 @@ static int dir_in_way(struct index_state *istate, const char *path,
>  static int was_tracked_and_matches(struct merge_options *opt, const char *path,
>                                    const struct diff_filespec *blob)
>  {
> -       int pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
> +       int pos;
>         struct cache_entry *ce;
>
> +       ensure_full_index(&opt->priv->orig_index);
> +
> +       pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
> +
>         if (0 > pos)
>                 /* we were not tracking this path before the merge */
>                 return 0;
> @@ -802,7 +810,11 @@ static int was_tracked_and_matches(struct merge_options *opt, const char *path,
>   */
>  static int was_tracked(struct merge_options *opt, const char *path)
>  {
> -       int pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
> +       int pos;
> +
> +       ensure_full_index(&opt->priv->orig_index);
> +
> +       pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
>
>         if (0 <= pos)
>                 /* we were tracking this path before the merge */
> @@ -814,6 +826,9 @@ static int was_tracked(struct merge_options *opt, const char *path)
>  static int would_lose_untracked(struct merge_options *opt, const char *path)
>  {
>         struct index_state *istate = opt->repo->index;
> +       int pos;
> +
> +       ensure_full_index(istate);
>
>         /*
>          * This may look like it can be simplified to:
> @@ -832,7 +847,7 @@ static int would_lose_untracked(struct merge_options *opt, const char *path)
>          * update_file()/would_lose_untracked(); see every comment in this
>          * file which mentions "update_stages".
>          */
> -       int pos = index_name_pos(istate, path, strlen(path));
> +       pos = index_name_pos(istate, path, strlen(path));
>
>         if (pos < 0)
>                 pos = -1 - pos;
> @@ -3086,6 +3101,7 @@ static int handle_content_merge(struct merge_file_info *mfi,
>                  * flag to avoid making the file appear as if it were
>                  * deleted by the user.
>                  */
> +               ensure_full_index(&opt->priv->orig_index);
>                 pos = index_name_pos(&opt->priv->orig_index, path, strlen(path));
>                 ce = opt->priv->orig_index.cache[pos];
>                 if (ce_skip_worktree(ce)) {
> diff --git a/name-hash.c b/name-hash.c
> index 4e03fac9bb1..0f6d4fcca5a 100644
> --- a/name-hash.c
> +++ b/name-hash.c
> @@ -679,6 +679,8 @@ int index_dir_exists(struct index_state *istate, const char *name, int namelen)
>  {
>         struct dir_entry *dir;
>
> +       ensure_full_index(istate);
> +
>         lazy_init_name_hash(istate);
>         dir = find_dir_entry(istate, name, namelen);
>         return dir && dir->nr;
> @@ -689,6 +691,8 @@ void adjust_dirname_case(struct index_state *istate, char *name)
>         const char *startPtr = name;
>         const char *ptr = startPtr;
>
> +       ensure_full_index( istate);
> +
>         lazy_init_name_hash(istate);
>         while (*ptr) {
>                 while (*ptr && *ptr != '/')
> @@ -712,6 +716,8 @@ struct cache_entry *index_file_exists(struct index_state *istate, const char *na
>         struct cache_entry *ce;
>         unsigned int hash = memihash(name, namelen);
>
> +       ensure_full_index(istate);
> +
>         lazy_init_name_hash(istate);
>
>         ce = hashmap_get_entry_from_hash(&istate->name_hash, hash, NULL,
> diff --git a/pathspec.c b/pathspec.c
> index 7a229d8d22f..9b105855483 100644
> --- a/pathspec.c
> +++ b/pathspec.c
> @@ -20,7 +20,7 @@
>   * to use find_pathspecs_matching_against_index() instead.
>   */
>  void add_pathspec_matches_against_index(const struct pathspec *pathspec,
> -                                       const struct index_state *istate,
> +                                       struct index_state *istate,
>                                         char *seen)
>  {
>         int num_unmatched = 0, i;
> @@ -36,6 +36,7 @@ void add_pathspec_matches_against_index(const struct pathspec *pathspec,
>                         num_unmatched++;
>         if (!num_unmatched)
>                 return;
> +       ensure_full_index(istate);
>         for (i = 0; i < istate->cache_nr; i++) {
>                 const struct cache_entry *ce = istate->cache[i];
>                 ce_path_match(istate, ce, pathspec, seen);
> @@ -51,7 +52,7 @@ void add_pathspec_matches_against_index(const struct pathspec *pathspec,
>   * given pathspecs achieves against all items in the index.
>   */
>  char *find_pathspecs_matching_against_index(const struct pathspec *pathspec,
> -                                           const struct index_state *istate)
> +                                           struct index_state *istate)
>  {
>         char *seen = xcalloc(pathspec->nr, 1);
>         add_pathspec_matches_against_index(pathspec, istate, seen);
> diff --git a/pathspec.h b/pathspec.h
> index 454ce364fac..f19c5dcf022 100644
> --- a/pathspec.h
> +++ b/pathspec.h
> @@ -150,10 +150,10 @@ static inline int ps_strcmp(const struct pathspec_item *item,
>  }
>
>  void add_pathspec_matches_against_index(const struct pathspec *pathspec,
> -                                       const struct index_state *istate,
> +                                       struct index_state *istate,
>                                         char *seen);
>  char *find_pathspecs_matching_against_index(const struct pathspec *pathspec,
> -                                           const struct index_state *istate);
> +                                           struct index_state *istate);
>  int match_pathspec_attrs(const struct index_state *istate,
>                          const char *name, int namelen,
>                          const struct pathspec_item *item);
> diff --git a/read-cache.c b/read-cache.c
> index 0522260416e..65679d70d7c 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -622,7 +622,11 @@ void remove_marked_cache_entries(struct index_state *istate, int invalidate)
>
>  int remove_file_from_index(struct index_state *istate, const char *path)
>  {
> -       int pos = index_name_pos(istate, path, strlen(path));
> +       int pos;
> +
> +       ensure_full_index(istate);
> +
> +       pos = index_name_pos(istate, path, strlen(path));
>         if (pos < 0)
>                 pos = -pos-1;
>         cache_tree_invalidate_path(istate, path);
> @@ -640,9 +644,12 @@ static int compare_name(struct cache_entry *ce, const char *path, int namelen)
>  static int index_name_pos_also_unmerged(struct index_state *istate,
>         const char *path, int namelen)
>  {
> -       int pos = index_name_pos(istate, path, namelen);
> +       int pos;
>         struct cache_entry *ce;
>
> +       ensure_full_index(istate);
> +
> +       pos = index_name_pos(istate, path, namelen);
>         if (pos >= 0)
>                 return pos;
>
> @@ -717,6 +724,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
>         int hash_flags = HASH_WRITE_OBJECT;
>         struct object_id oid;
>
> +       ensure_full_index(istate);
> +
>         if (flags & ADD_CACHE_RENORMALIZE)
>                 hash_flags |= HASH_RENORMALIZE;
>
> @@ -1095,6 +1104,8 @@ static int has_dir_name(struct index_state *istate,
>         size_t len_eq_last;
>         int cmp_last = 0;
>
> +       ensure_full_index(istate);
> +
>         /*
>          * We are frequently called during an iteration on a sorted
>          * list of pathnames and while building a new index.  Therefore,
> @@ -1338,6 +1349,8 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
>  {
>         int pos;
>
> +       ensure_full_index(istate);
> +
>         if (option & ADD_CACHE_JUST_APPEND)
>                 pos = istate->cache_nr;
>         else {
> @@ -1547,6 +1560,8 @@ int refresh_index(struct index_state *istate, unsigned int flags,
>          * we only have to do the special cases that are left.
>          */
>         preload_index(istate, pathspec, 0);
> +
> +       ensure_full_index(istate);
>         for (i = 0; i < istate->cache_nr; i++) {
>                 struct cache_entry *ce, *new_entry;
>                 int cache_errno = 0;
> diff --git a/rerere.c b/rerere.c
> index 9281131a9f1..1836a6cfbcf 100644
> --- a/rerere.c
> +++ b/rerere.c
> @@ -962,6 +962,8 @@ static int handle_cache(struct index_state *istate,
>         struct rerere_io_mem io;
>         int marker_size = ll_merge_marker_size(istate, path);
>
> +       ensure_full_index(istate);
> +
>         /*
>          * Reproduce the conflicted merge in-core
>          */
> diff --git a/resolve-undo.c b/resolve-undo.c
> index 236320f179c..a4265834977 100644
> --- a/resolve-undo.c
> +++ b/resolve-undo.c
> @@ -125,6 +125,8 @@ int unmerge_index_entry_at(struct index_state *istate, int pos)
>         if (!istate->resolve_undo)
>                 return pos;
>
> +       ensure_full_index(istate);
> +
>         ce = istate->cache[pos];
>         if (ce_stage(ce)) {
>                 /* already unmerged */
> @@ -172,6 +174,8 @@ void unmerge_marked_index(struct index_state *istate)
>         if (!istate->resolve_undo)
>                 return;
>
> +       ensure_full_index(istate);
> +
>         for (i = 0; i < istate->cache_nr; i++) {
>                 const struct cache_entry *ce = istate->cache[i];
>                 if (ce->ce_flags & CE_MATCHED)
> @@ -186,6 +190,8 @@ void unmerge_index(struct index_state *istate, const struct pathspec *pathspec)
>         if (!istate->resolve_undo)
>                 return;
>
> +       ensure_full_index(istate);
> +
>         for (i = 0; i < istate->cache_nr; i++) {
>                 const struct cache_entry *ce = istate->cache[i];
>                 if (!ce_path_match(istate, ce, pathspec, NULL))
> diff --git a/sha1-name.c b/sha1-name.c
> index 0b23b86ceb4..c2f17e526ab 100644
> --- a/sha1-name.c
> +++ b/sha1-name.c
> @@ -1734,6 +1734,8 @@ static void diagnose_invalid_index_path(struct repository *r,
>         if (!prefix)
>                 prefix = "";
>
> +       ensure_full_index(r->index);
> +
>         /* Wrong stage number? */
>         pos = index_name_pos(istate, filename, namelen);
>         if (pos < 0)
> @@ -1854,6 +1856,7 @@ static enum get_oid_result get_oid_with_context_1(struct repository *repo,
>
>                 if (!repo->index || !repo->index->cache)
>                         repo_read_index(repo);
> +               ensure_full_index(repo->index);
>                 pos = index_name_pos(repo->index, cp, namelen);
>                 if (pos < 0)
>                         pos = -pos - 1;
> diff --git a/split-index.c b/split-index.c
> index c0e8ad670d0..3150fa6476a 100644
> --- a/split-index.c
> +++ b/split-index.c
> @@ -4,6 +4,8 @@
>
>  struct split_index *init_split_index(struct index_state *istate)
>  {
> +       ensure_full_index(istate);
> +
>         if (!istate->split_index) {
>                 istate->split_index = xcalloc(1, sizeof(*istate->split_index));
>                 istate->split_index->refcount = 1;
> diff --git a/submodule.c b/submodule.c
> index b3bb59f0664..f80cfddbd52 100644
> --- a/submodule.c
> +++ b/submodule.c
> @@ -33,9 +33,13 @@ static struct oid_array ref_tips_after_fetch;
>   * will be disabled because we can't guess what might be configured in
>   * .gitmodules unless the user resolves the conflict.
>   */
> -int is_gitmodules_unmerged(const struct index_state *istate)
> +int is_gitmodules_unmerged(struct index_state *istate)
>  {
> -       int pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE));
> +       int pos;
> +
> +       ensure_full_index(istate);
> +
> +       pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE));
>         if (pos < 0) { /* .gitmodules not found or isn't merged */
>                 pos = -1 - pos;
>                 if (istate->cache_nr > pos) {  /* there is a .gitmodules */
> @@ -77,7 +81,11 @@ int is_writing_gitmodules_ok(void)
>   */
>  int is_staging_gitmodules_ok(struct index_state *istate)
>  {
> -       int pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE));
> +       int pos;
> +
> +       ensure_full_index(istate);
> +
> +       pos = index_name_pos(istate, GITMODULES_FILE, strlen(GITMODULES_FILE));
>
>         if ((pos >= 0) && (pos < istate->cache_nr)) {
>                 struct stat st;
> @@ -301,7 +309,7 @@ int is_submodule_populated_gently(const char *path, int *return_error_code)
>  /*
>   * Dies if the provided 'prefix' corresponds to an unpopulated submodule
>   */
> -void die_in_unpopulated_submodule(const struct index_state *istate,
> +void die_in_unpopulated_submodule(struct index_state *istate,
>                                   const char *prefix)
>  {
>         int i, prefixlen;
> @@ -311,6 +319,8 @@ void die_in_unpopulated_submodule(const struct index_state *istate,
>
>         prefixlen = strlen(prefix);
>
> +       ensure_full_index(istate);
> +
>         for (i = 0; i < istate->cache_nr; i++) {
>                 struct cache_entry *ce = istate->cache[i];
>                 int ce_len = ce_namelen(ce);
> @@ -331,11 +341,13 @@ void die_in_unpopulated_submodule(const struct index_state *istate,
>  /*
>   * Dies if any paths in the provided pathspec descends into a submodule
>   */
> -void die_path_inside_submodule(const struct index_state *istate,
> +void die_path_inside_submodule(struct index_state *istate,
>                                const struct pathspec *ps)
>  {
>         int i, j;
>
> +       ensure_full_index(istate);
> +
>         for (i = 0; i < istate->cache_nr; i++) {
>                 struct cache_entry *ce = istate->cache[i];
>                 int ce_len = ce_namelen(ce);
> @@ -1420,6 +1432,8 @@ static int get_next_submodule(struct child_process *cp,
>  {
>         struct submodule_parallel_fetch *spf = data;
>
> +       ensure_full_index(spf->r->index);
> +
>         for (; spf->count < spf->r->index->cache_nr; spf->count++) {
>                 const struct cache_entry *ce = spf->r->index->cache[spf->count];
>                 const char *default_argv;
> diff --git a/submodule.h b/submodule.h
> index 4ac6e31cf1f..84640c49c11 100644
> --- a/submodule.h
> +++ b/submodule.h
> @@ -39,7 +39,7 @@ struct submodule_update_strategy {
>  };
>  #define SUBMODULE_UPDATE_STRATEGY_INIT {SM_UPDATE_UNSPECIFIED, NULL}
>
> -int is_gitmodules_unmerged(const struct index_state *istate);
> +int is_gitmodules_unmerged(struct index_state *istate);
>  int is_writing_gitmodules_ok(void);
>  int is_staging_gitmodules_ok(struct index_state *istate);
>  int update_path_in_gitmodules(const char *oldpath, const char *newpath);
> @@ -60,9 +60,9 @@ int is_submodule_active(struct repository *repo, const char *path);
>   * Otherwise the return error code is the same as of resolve_gitdir_gently.
>   */
>  int is_submodule_populated_gently(const char *path, int *return_error_code);
> -void die_in_unpopulated_submodule(const struct index_state *istate,
> +void die_in_unpopulated_submodule(struct index_state *istate,
>                                   const char *prefix);
> -void die_path_inside_submodule(const struct index_state *istate,
> +void die_path_inside_submodule(struct index_state *istate,
>                                const struct pathspec *ps);
>  enum submodule_update_type parse_submodule_update_type(const char *value);
>  int parse_submodule_update_strategy(const char *value,
> diff --git a/tree.c b/tree.c
> index e76517f6b18..60f575440c8 100644
> --- a/tree.c
> +++ b/tree.c
> @@ -170,6 +170,8 @@ int read_tree(struct repository *r, struct tree *tree, int stage,
>          * to matter.
>          */
>
> +       ensure_full_index(istate);
> +
>         /*
>          * See if we have cache entry at the stage.  If so,
>          * do it the original slow way, otherwise, append and then
> diff --git a/wt-status.c b/wt-status.c
> index 7074bbdd53c..5366d336938 100644
> --- a/wt-status.c
> +++ b/wt-status.c
> @@ -509,6 +509,8 @@ static int unmerged_mask(struct index_state *istate, const char *path)
>         int pos, mask;
>         const struct cache_entry *ce;
>
> +       ensure_full_index(istate);
> +
>         pos = index_name_pos(istate, path, strlen(path));
>         if (0 <= pos)
>                 return 0;
> @@ -657,6 +659,8 @@ static void wt_status_collect_changes_initial(struct wt_status *s)
>         struct index_state *istate = s->repo->index;
>         int i;
>
> +       ensure_full_index(istate);
> +
>         for (i = 0; i < istate->cache_nr; i++) {
>                 struct string_list_item *it;
>                 struct wt_status_change_data *d;
> @@ -2295,6 +2299,9 @@ static void wt_porcelain_v2_print_unmerged_entry(
>          */
>         memset(stages, 0, sizeof(stages));
>         sum = 0;
> +
> +       ensure_full_index(istate);
> +
>         pos = index_name_pos(istate, it->string, strlen(it->string));
>         assert(pos < 0);
>         pos = -pos-1;
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 16/27] unpack-trees: make sparse aware
  2021-01-25 17:42 ` [PATCH 16/27] unpack-trees: make sparse aware Derrick Stolee via GitGitGadget
@ 2021-02-01 20:50   ` Elijah Newren
  2021-02-09 17:23     ` Derrick Stolee
  0 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren @ 2021-02-01 20:50 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> As a first step to integrate 'git status' and 'git add' with the sparse
> index, we must start integrating unpack_trees() with sparse directory
> entries. These changes are currently impossible to trigger because
> unpack_trees() calls ensure_full_index() if command_requires_full_index
> is true. This is the case for all commands at the moment. As we expand
> more commands to be sparse-aware, we might find that more changes are
> required to unpack_trees(). The current changes will suffice for
> 'status' and 'add'.
>
> unpack_trees() calls the traverse_trees() API using unpack_callback()
> to decide if we should recurse into a subtree. We must add new abilities
> to skip a subtree if it corresponds to a sparse directory entry.

Makes sense.

> It is important to be careful about the trailing directory separator
> that exists in the sparse directory entries but not in the subtree
> paths.

The comment makes me wonder if leaving the trailing directory
separator out would be better, as it'd allow direct comparisons.  Of
course, you have a better idea of what is easier or harder based on
this decision.  Is there any chance you have a quick list of the
places that the code was simplified by this decision and a list of
places like this one that were made slightly harder?

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  dir.h           |  2 +-
>  preload-index.c |  2 ++
>  read-cache.c    |  3 +++
>  unpack-trees.c  | 24 ++++++++++++++++++++++--
>  4 files changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/dir.h b/dir.h
> index facfae47402..300305ec335 100644
> --- a/dir.h
> +++ b/dir.h
> @@ -503,7 +503,7 @@ static inline int ce_path_match(const struct index_state *istate,
>                                 char *seen)
>  {
>         return match_pathspec(istate, pathspec, ce->name, ce_namelen(ce), 0, seen,
> -                             S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode));
> +                             S_ISSPARSEDIR(ce) || S_ISDIR(ce->ce_mode) || S_ISGITLINK(ce->ce_mode));

I think this hunk becomes unnecessary if you use ce_mode = 040000 for
sparse directory entries.

>  }
>
>  static inline int dir_path_match(const struct index_state *istate,
> diff --git a/preload-index.c b/preload-index.c
> index ed6eaa47388..323fc8c5100 100644
> --- a/preload-index.c
> +++ b/preload-index.c
> @@ -54,6 +54,8 @@ static void *preload_thread(void *_data)
>                         continue;
>                 if (S_ISGITLINK(ce->ce_mode))
>                         continue;
> +               if (S_ISSPARSEDIR(ce))
> +                       continue;
>                 if (ce_uptodate(ce))
>                         continue;
>                 if (ce_skip_worktree(ce))
> diff --git a/read-cache.c b/read-cache.c
> index 65679d70d7c..ab0c2b86ec0 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1572,6 +1572,9 @@ int refresh_index(struct index_state *istate, unsigned int flags,
>                 if (ignore_submodules && S_ISGITLINK(ce->ce_mode))
>                         continue;
>
> +               if (istate->sparse_index && S_ISSPARSEDIR(ce))
> +                       continue;
> +
>                 if (pathspec && !ce_path_match(istate, ce, pathspec, seen))
>                         filtered = 1;
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index b324eec2a5d..90644856a80 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -583,6 +583,13 @@ static void mark_ce_used(struct cache_entry *ce, struct unpack_trees_options *o)
>  {
>         ce->ce_flags |= CE_UNPACKED;
>
> +       /*
> +        * If this is a sparse directory, don't advance cache_bottom.
> +        * That will be advanced later using the cache-tree data.
> +        */
> +       if (S_ISSPARSEDIR(ce))
> +               return;

I don't grok the cache_bottom stuff -- in general, nothing specific
about your patch.  But since I don't grok that stuff, it means I don't
understand how your comment here relates; you may want to ping another
reviewer about this portion of the patch.

> +
>         if (o->cache_bottom < o->src_index->cache_nr &&
>             o->src_index->cache[o->cache_bottom] == ce) {
>                 int bottom = o->cache_bottom;
> @@ -980,6 +987,9 @@ static int do_compare_entry(const struct cache_entry *ce,
>         ce_len -= pathlen;
>         ce_name = ce->name + pathlen;
>
> +       /* remove directory separator if a sparse directory entry */
> +       if (S_ISSPARSEDIR(ce))
> +               ce_len--;

Here's where your comment about trailing separator comes in; makes sense.

>         return df_name_compare(ce_name, ce_len, S_IFREG, name, namelen, mode);
>  }
>
> @@ -989,6 +999,10 @@ static int compare_entry(const struct cache_entry *ce, const struct traverse_inf
>         if (cmp)
>                 return cmp;
>
> +       /* If ce is a sparse directory, then allow equality here. */
> +       if (S_ISSPARSEDIR(ce))
> +               return 0;
> +

This seems surprising to me.  Is there a chance you are comparing
sparse directory A with sparse directory B and you return with
equality?  Or sparse_directory A with regular file B?  Do the callers
still do the right thing?  If your code change here is right, it seems
like it deserves an extra comment either in the code or the commit
message.

>         /*
>          * Even if the beginning compared identically, the ce should
>          * compare as bigger than a directory leading up to it!
> @@ -1239,6 +1253,7 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
>         struct cache_entry *src[MAX_UNPACK_TREES + 1] = { NULL, };
>         struct unpack_trees_options *o = info->data;
>         const struct name_entry *p = names;
> +       unsigned recurse = 1;
>
>         /* Find first entry with a real name (we could use "mask" too) */
>         while (!p->mode)
> @@ -1280,12 +1295,16 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
>                                         }
>                                 }
>                                 src[0] = ce;
> +
> +                               if (S_ISSPARSEDIR(ce))
> +                                       recurse = 0;
>                         }
>                         break;
>                 }
>         }
>
> -       if (unpack_nondirectories(n, mask, dirmask, src, names, info) < 0)
> +       if (recurse &&
> +           unpack_nondirectories(n, mask, dirmask, src, names, info) < 0)
>                 return -1;
>
>         if (o->merge && src[0]) {
> @@ -1315,7 +1334,8 @@ static int unpack_callback(int n, unsigned long mask, unsigned long dirmask, str
>                         }
>                 }
>
> -               if (traverse_trees_recursive(n, dirmask, mask & ~dirmask,
> +               if (recurse &&
> +                   traverse_trees_recursive(n, dirmask, mask & ~dirmask,
>                                              names, info) < 0)
>                         return -1;
>                 return mask;

The unpack_callback() code has some comparison to a cache-tree, but
I'd assume that you'd need to update cache-tree.c somewhat to take
advantage of these sparse directory entries.  Am I wrong, and you just
get cache-tree.c working with sparse directory entries for free?  Or
is this something coming in a later patch?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 15/27] [RFC-VERSION] *: ensure full index
  2021-02-01 20:22   ` Elijah Newren
@ 2021-02-01 21:10     ` Derrick Stolee
  0 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee @ 2021-02-01 21:10 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On 2/1/2021 3:22 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> This giant patch is not intended for actual review. I have a branch that
>> has these changes split out in a sane way with some commentary in each
>> file that is modified.
>>
>> The idea here is to guard certain portions of the codebase that do not
>> know how to handle sparse indexes by ensuring that the index is expanded
>> to a full index before proceeding with the logic.
>>
>> This also provides a good mechanism for testing which code needs
>> updating to enable the sparse index in a Git builtin. The builtin can
>> set the_repository->settings.command_requires_full_index to zero and
>> then we can debug the command with a breakpoint on ensure_full_index().
>> That identifies the portion of code that needs adjusting before enabling
>> sparse indexes for that command.
>>
>> Some index operations must be changed to operate on a non-const pointer,
>> since ensuring a full index will modify the index itself.
>>
>> There are likely some gaps to these protections, which is why it will be
>> important to carefully test each scenario as we relax the requirements.
>> I expect that to be a long effort.
> 
> I think the idea makes sense; it provides a way for us to
> incrementally build support for this new feature.
> 
> I skimmed over the code and noticed various interesting places that
> had the ensure_full_index() call (e.g.
> read_skip_worktree_file_from_index() -- whose existence comes from
> sparsity; what irony...).  Better breakouts would be great, so I'll
> defer commenting much until then.  But, just to verify I'm
> understanding: the primary defence is the command_requires_full_index
> setting, and you have added several ensure_full_index() calls
> throughout the code in places you believe would need to be fixed up in
> case someone switches the command_requires_full_index setting.  Is
> that correct?  And your comment on the gaps is just that there may be
> other places that are missing the secondary protection (as opposed to
> my first reading of that paragraph as suggesting we aren't sure if we
> have enough protections yet and need to add more before this moves out
> of RFC); is that right?

Yes, the idea is that we can incrementally enable
command_requires_full_index for some builtins and be confident that
corner cases will be protected by ensure_full_index(). Further, we
can test whether ensure_full_index() was called using test_region
in test scripts to demonstrate that a command is truly "sparse aware"
or if it is converting to full and back to sparse.

There is also the case that when we write the index into a sparse
format, the in-memory structure is modified. If the index is re-used
afterwards, then we must expand to full again for these code paths.

unpack_trees() already has one of these calls because it was necessary
for the sparse-index write to work.

The ensure_full_index() pattern also works when updating a builtin to
work with the sparse-index because of the breakpoint trick.

When I submit this as a full series, this patch will be one full
patch series submission with careful comments about why each of these
is added on a file-by-file basis.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 17/27] dir.c: accept a directory as part of cone-mode patterns
  2021-01-25 17:42 ` [PATCH 17/27] dir.c: accept a directory as part of cone-mode patterns Derrick Stolee via GitGitGadget
@ 2021-02-01 22:12   ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-02-01 22:12 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> When we have sparse directory entries in the index, we want to compare
> that directory against sparse-checkout patterns. Those pattern matching
> algorithms are built expecting a file path, not a directory path. This
> is especially important in the "cone mode" patterns which will match
> files that exist within the "parent directories" as well as the
> recursive directory matches.
>
> If path_matches_pattern_list() is given a directory, we can add a bogus
> filename ("-") to the directory and get the same results as before,
> assuming we are in cone mode. Since sparse index requires cone mode
> patterns, this is an acceptable assumption.

Why is "-" a bogus filename?  Is that only on certain operating
systems, or are you just not expecting a user to name their file with
such a bad name?  What if there is a file with that name in that
directory in the repository; do you need the pathname to be bogus?

What do you mean by "get the same results as before"?  The first
paragraph suggests the code wouldn't handle a directory path, and that
not handling it was problematic, so it seems unlikely you want the
same results as that.  But it's not clear what the "before" refers to
here.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  dir.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/dir.c b/dir.c
> index ad6eb033cb1..c786fa98d0e 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1384,6 +1384,11 @@ enum pattern_match_result path_matches_pattern_list(
>         strbuf_addch(&parent_pathname, '/');
>         strbuf_add(&parent_pathname, pathname, pathlen);
>
> +       /* Directory requests should be added as if they are a file */
> +       if (parent_pathname.len > 1 &&
> +           parent_pathname.buf[parent_pathname.len - 1] == '/')

Ah, this looks like a case where the trailing slash is helpful;
without it, you might have to feed extra data in through the call
hierarchy to signify that this is a directory entry.

> +               strbuf_add(&parent_pathname, "-", 1);
> +
>         if (hashmap_contains_path(&pl->recursive_hashmap,
>                                   &parent_pathname)) {
>                 result = MATCHED_RECURSIVE;

hashmap_contains_path?  Don't we already know (modulo special cases of
our bogus value not quite being bogus enough) that this is false since
we were adding a bogus path?  How could the hashmap have a bogus value
in it?  Won't this particular call fail with or without our adding "-"
to the end of the path?

After this hashmap_contains_path() call, the subsequent code looks for
the parent of the path by stripping off everything after the last
'/'...which seems like the relevant code anyway.  Is the problem that
the hashmap_contains_path() call was returning true when we didn't add
"-" to the end?  If so, can we use and if or a goto instead to make
the code skip this first check and move on to where we want it to go?

Or am I misunderstanding something about this code?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 21/27] sparse-index: expand_to_path no-op if path exists
  2021-01-25 17:42 ` [PATCH 21/27] sparse-index: expand_to_path no-op if path exists Derrick Stolee via GitGitGadget
@ 2021-02-01 22:34   ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-02-01 22:34 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> We need to check the file hashmap first, then look to see if the
> directory signals a non-sparse directory entry. In such a case, we can
> rely on the contents of the sparse-index.
>
> We still use ensure_full_index() in the case that we hit a path that is
> within a sparse-directory entry.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  name-hash.c    |  6 ++++++
>  sparse-index.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 59 insertions(+)
>
> diff --git a/name-hash.c b/name-hash.c
> index 641f6900a7c..cb0f316f652 100644
> --- a/name-hash.c
> +++ b/name-hash.c
> @@ -110,6 +110,12 @@ static void hash_index_entry(struct index_state *istate, struct cache_entry *ce)
>         if (ce->ce_flags & CE_HASHED)
>                 return;
>         ce->ce_flags |= CE_HASHED;
> +
> +       if (ce->ce_mode == CE_MODE_SPARSE_DIRECTORY) {
> +               add_dir_entry(istate, ce);
> +               return;
> +       }
> +
>         hashmap_entry_init(&ce->ent, memihash(ce->name, ce_namelen(ce)));
>         hashmap_add(&istate->name_hash, &ce->ent);
>
> diff --git a/sparse-index.c b/sparse-index.c
> index dd1a06dfdd3..bf8dce9a09b 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -281,9 +281,62 @@ void ensure_full_index(struct index_state *istate)
>         trace2_region_leave("index", "ensure_full_index", istate->repo);
>  }
>
> +static int in_expand_to_path = 0;
> +
>  void expand_to_path(struct index_state *istate,
>                     const char *path, size_t pathlen, int icase)
>  {
> +       struct strbuf path_as_dir = STRBUF_INIT;
> +       int pos;
> +
> +       /* prevent extra recursion */
> +       if (in_expand_to_path)
> +               return;

Maybe "prevent extra expand_to_path() <-> index_file_exists()
recursion", just to be extra explicit?

> +
> +       if (!istate || !istate->sparse_index)
> +               return;
> +
> +       if (!istate->repo)
> +               istate->repo = the_repository;

So, we assume the_repository if istate->repo isn't set.  I guess given
the number of the_repository assumptions we have in the code, this
isn't a big deal.  And instead of a
USE_THE_REPOSITORY_COMPATIBILITY_MACROS we have a
NO_THE_REPOSITORY_COMPATIBILITY_MACROS, so there's nothing to mark
this either.

> +
> +       in_expand_to_path = 1;
> +
> +       /*
> +        * We only need to actually expand a region if the
> +        * following are both true:
> +        *
> +        * 1. 'path' is not already in the index.
> +        * 2. Some parent directory of 'path' is a sparse directory.
> +        */
> +
> +       strbuf_add(&path_as_dir, path, pathlen);
> +       strbuf_addch(&path_as_dir, '/');
> +
> +       /* in_expand_to_path prevents infinite recursion here */
> +       if (index_file_exists(istate, path, pathlen, icase))
> +               goto cleanup;

Shouldn't the editing of path_as_dir be done after the
index_file_exists() call?  In the case that the entry already exists,
writing to path_as_dir is wasted work.

> +       pos = index_name_pos(istate, path_as_dir.buf, path_as_dir.len);
> +
> +       if (pos < 0)
> +               pos = -pos - 1;
> +
> +       /*
> +        * Even if the path doesn't exist, if the value isn't exactly a
> +        * sparse-directory entry, then there is no need to expand the
> +        * index.
> +        */
> +       if (istate->cache[pos]->ce_mode != CE_MODE_SPARSE_DIRECTORY)
> +               goto cleanup;

This looked wrong to me until I tried to come up with a
counter-example.  Here you are relying on the fact that before the
comment, pos is going to be the index of a sparse directory entry --
either for path_as_dir or some ancestor directory.  It would be nice
if the comment mentioned that.

> +
> +       trace2_region_enter("index", "expand_to_path", istate->repo);
> +
>         /* for now, do the obviously-correct, slow thing */
>         ensure_full_index(istate);
> +
> +       trace2_region_leave("index", "expand_to_path", istate->repo);
> +
> +cleanup:
> +       strbuf_release(&path_as_dir);
> +       in_expand_to_path = 0;
>  }
> --
> gitgitgadget

Looks good otherwise.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 22/27] add: allow operating on a sparse-only index
  2021-01-25 17:42 ` [PATCH 22/27] add: allow operating on a sparse-only index Derrick Stolee via GitGitGadget
@ 2021-02-01 23:08   ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-02-01 23:08 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Replace enough callers to ensure_full_index() to instead call
> expand_to_path() to reduce how often 'git add' expands a sparse index in
> memory (before writing a sparse index again).
>
> One non-obvious case is index_name_pos_also_unmerged() which is only hit
> on the Windows platform (in my tests). Use expand_to_path() instead of
> ensure_full_index().

I read this paragraph as saying that the conversion of
index_name_pos_also_unmerged() was tricky, whereas after reading the
patch it looks like the conversion was trivial and you were perhaps
meaning to say that it is easy to miss that this function also needs a
conversion.  Also, since the second sentence is true of all the
conversions, not sure how much it helps to highlight it when just
talking about this one function.
Both of these are minor quibbles, but if there's a clever way to
reword here that reduces potential confusion, that'd be great.

> Add a test to check that 'git add -A' and 'git add <file>' does not
> expand the index at all, as long as <file> is not within a sparse
> directory.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/add.c                            |  3 +++
>  dir.c                                    |  8 ++++----
>  read-cache.c                             | 10 +++++-----
>  sparse-index.c                           | 18 ++++++++++++++----
>  t/t1092-sparse-checkout-compatibility.sh | 14 ++++++++++++++
>  5 files changed, 40 insertions(+), 13 deletions(-)
>
> diff --git a/builtin/add.c b/builtin/add.c
> index a825887c503..b73f8d51de6 100644
> --- a/builtin/add.c
> +++ b/builtin/add.c
> @@ -491,6 +491,9 @@ int cmd_add(int argc, const char **argv, const char *prefix)
>         add_new_files = !take_worktree_changes && !refresh_only && !add_renormalize;
>         require_pathspec = !(take_worktree_changes || (0 < addremove_explicit));
>
> +       prepare_repo_settings(the_repository);
> +       the_repository->settings.command_requires_full_index = 0;
> +
>         hold_locked_index(&lock_file, LOCK_DIE_ON_ERROR);
>
>         /*
> diff --git a/dir.c b/dir.c
> index c786fa98d0e..21998c7c4b7 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -18,6 +18,7 @@
>  #include "ewah/ewok.h"
>  #include "fsmonitor.h"
>  #include "submodule-config.h"
> +#include "sparse-index.h"
>
>  /*
>   * Tells read_directory_recursive how a file or directory should be treated.
> @@ -899,9 +900,9 @@ static int read_skip_worktree_file_from_index(struct index_state *istate,
>  {
>         int pos, len;
>
> -       ensure_full_index(istate);
> -
>         len = strlen(path);
> +
> +       expand_to_path(istate, path, len, 0);
>         pos = index_name_pos(istate, path, len);
>         if (pos < 0)
>                 return -1;
> @@ -1707,8 +1708,7 @@ static enum exist_status directory_exists_in_index(struct index_state *istate,
>         if (ignore_case)
>                 return directory_exists_in_index_icase(istate, dirname, len);
>
> -       ensure_full_index(istate);
> -
> +       expand_to_path(istate, dirname, len, 0);
>         pos = index_name_pos(istate, dirname, len);
>         if (pos < 0)
>                 pos = -pos-1;
> diff --git a/read-cache.c b/read-cache.c
> index 78910d8f1b7..8c974829497 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -647,7 +647,7 @@ static int index_name_pos_also_unmerged(struct index_state *istate,
>         int pos;
>         struct cache_entry *ce;
>
> -       ensure_full_index(istate);
> +       expand_to_path(istate, path, namelen, 0);
>
>         pos = index_name_pos(istate, path, namelen);
>         if (pos >= 0)
> @@ -724,8 +724,6 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
>         int hash_flags = HASH_WRITE_OBJECT;
>         struct object_id oid;
>
> -       ensure_full_index(istate);
> -
>         if (flags & ADD_CACHE_RENORMALIZE)
>                 hash_flags |= HASH_RENORMALIZE;
>
> @@ -733,6 +731,8 @@ int add_to_index(struct index_state *istate, const char *path, struct stat *st,
>                 return error(_("%s: can only add regular files, symbolic links or git-directories"), path);
>
>         namelen = strlen(path);
> +       expand_to_path(istate, path, namelen, 0);
> +
>         if (S_ISDIR(st_mode)) {
>                 if (resolve_gitlink_ref(path, "HEAD", &oid) < 0)
>                         return error(_("'%s' does not have a commit checked out"), path);
> @@ -1104,7 +1104,7 @@ static int has_dir_name(struct index_state *istate,
>         size_t len_eq_last;
>         int cmp_last = 0;
>
> -       ensure_full_index(istate);
> +       expand_to_path(istate, ce->name, ce->ce_namelen, 0);
>
>         /*
>          * We are frequently called during an iteration on a sorted
> @@ -1349,7 +1349,7 @@ int add_index_entry(struct index_state *istate, struct cache_entry *ce, int opti
>  {
>         int pos;
>
> -       ensure_full_index(istate);
> +       expand_to_path(istate, ce->name, ce->ce_namelen, 0);
>
>         if (option & ADD_CACHE_JUST_APPEND)
>                 pos = istate->cache_nr;
> diff --git a/sparse-index.c b/sparse-index.c
> index bf8dce9a09b..a201f3b905c 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -286,6 +286,7 @@ static int in_expand_to_path = 0;
>  void expand_to_path(struct index_state *istate,
>                     const char *path, size_t pathlen, int icase)
>  {
> +       struct cache_entry *ce = NULL;
>         struct strbuf path_as_dir = STRBUF_INIT;
>         int pos;
>
> @@ -320,13 +321,22 @@ void expand_to_path(struct index_state *istate,
>
>         if (pos < 0)
>                 pos = -pos - 1;
> +       if (pos < istate->cache_nr)
> +               ce = istate->cache[pos];
>
>         /*
> -        * Even if the path doesn't exist, if the value isn't exactly a
> -        * sparse-directory entry, then there is no need to expand the
> -        * index.
> +        * If we didn't land on a sparse directory, then there is
> +        * nothing to expand.
>          */
> -       if (istate->cache[pos]->ce_mode != CE_MODE_SPARSE_DIRECTORY)
> +       if (ce && !S_ISSPARSEDIR(ce))
> +               goto cleanup;

Seems like these changes to expand_to_path() could and maybe should be
squashed into the commit that introduces expand_to_path()?

> +       /*
> +        * If that sparse directory is not a prefix of the path we
> +        * are looking for, then we don't need to expand.
> +        */
> +       if (ce &&
> +           (ce->ce_namelen >= path_as_dir.len ||
> +            strncmp(ce->name, path_as_dir.buf, ce->ce_namelen)))

Should this also be squashed into the commit that introduces
expand_to_path()?  Why is this check added here?

>                 goto cleanup;
>
>         trace2_region_enter("index", "expand_to_path", istate->repo);
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 09650f0755c..ae594ab880c 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -390,6 +390,20 @@ test_expect_success 'sparse-index is expanded and converted back' '
>         GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
>                 git -C sparse-index -c core.fsmonitor="" reset --hard &&
>         test_region index convert_to_sparse trace2.txt &&
> +       test_region index ensure_full_index trace2.txt &&
> +
> +       rm trace2.txt &&
> +       echo >>sparse-index/README.md &&
> +       GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> +               git -C sparse-index -c core.fsmonitor="" add -A &&
> +       test_region index convert_to_sparse trace2.txt &&
> +       test_region index ensure_full_index trace2.txt &&
> +
> +       rm trace2.txt &&
> +       echo >>sparse-index/extra.txt &&
> +       GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
> +               git -C sparse-index -c core.fsmonitor="" add extra.txt &&
> +       test_region index convert_to_sparse trace2.txt &&
>         test_region index ensure_full_index trace2.txt
>  '
>
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 24/27] dir: use expand_to_path in add_patterns()
  2021-01-25 17:42 ` [PATCH 24/27] dir: use expand_to_path in add_patterns() Derrick Stolee via GitGitGadget
@ 2021-02-01 23:21   ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-02-01 23:21 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The add_patterns() method has a way to extract a pattern file from the
> index. If this pattern file is sparse and within a sparse directory
> entry, then we need to expand the index before looking for that entry as
> a file path.

Why?

Correct me if I'm wrong, but I thought the point of add_patterns() was
to read .gitignore entries, so that we can know whether to e.g. have
status report untracked files within some directory or have clean
delete files within a directory.  But if we have a sparse directory
entry in the index, we probably have no such directory in the working
directory.  And if we have no such working directory, getting
.gitignore entries for those directories is a big waste of time.

> For now, convert ensure_full_index() into expand_to_path() to only
> expand this way when absolutely necessary.

Not only should we probably not need to read these files at all,
expand_to_path() still expands a lot more than necessary, right?  If
two directories are sparse -- moduleA and moduleB, and we need
something from under moduleA/, then expand_to_path() will call
ensure_full_index() and fill out every entry under both modules, even
if moduleB is way bigger than moduleA.  Unless I've misunderstood
something, there's multiple ways we're falling short of "only...when
absolutely necessary".


Perhaps both of these things are future work you already had planned;
if so, some tweaks to the commit message may help keep this reader
oriented.  :-)

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  dir.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/dir.c b/dir.c
> index 21998c7c4b7..7df8d3b1da0 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1093,7 +1093,7 @@ static int add_patterns(const char *fname, const char *base, int baselen,
>                         int pos;
>
>                         if (istate)
> -                               ensure_full_index(istate);
> +                               expand_to_path(istate, fname, strlen(fname), 0);
>
>                         if (oid_stat->valid &&
>                             !match_stat_data_racy(istate, &oid_stat->stat, &st))
> --
> gitgitgadget

There's also a read_skip_worktree_file_from_index() call earlier in
the same function, which in your big RFC patch you protected with the
ensure_full_index() call already.  Perhaps it should have an
expand_to_path() conversion as well?  But, in the big picture, it
seems like checking if we can avoid reading in that pattern file
(whenever the directory doesn't exist within the working copy) would
be a better first step.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 26/27] pathspec: stop calling ensure_full_index
  2021-01-25 17:42 ` [PATCH 26/27] pathspec: stop calling ensure_full_index Derrick Stolee via GitGitGadget
@ 2021-02-01 23:24   ` Elijah Newren
  2021-02-02  2:39     ` Derrick Stolee
  0 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren @ 2021-02-01 23:24 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The add_pathspec_matches_against_index() focuses on matching a pathspec
> to file entries in the index. It is possible that this already works
> correctly for its only use: checking if untracked files exist in the
> index.
>
> It is likely that this causes a behavior issue when adding a directory
> that exists at HEAD but is outside the sparse cone. I'm marking this as
> a place to pursue with future tests.

Sounds like you're unsure if this patch is good.  Should it be marked
RFC or something?

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  pathspec.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/pathspec.c b/pathspec.c
> index 9b105855483..61dc771aa02 100644
> --- a/pathspec.c
> +++ b/pathspec.c
> @@ -36,7 +36,6 @@ void add_pathspec_matches_against_index(const struct pathspec *pathspec,
>                         num_unmatched++;
>         if (!num_unmatched)
>                 return;
> -       ensure_full_index(istate);
>         for (i = 0; i < istate->cache_nr; i++) {
>                 const struct cache_entry *ce = istate->cache[i];
>                 ce_path_match(istate, ce, pathspec, seen);
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 27/27] cache-tree: integrate with sparse directory entries
  2021-01-25 17:42 ` [PATCH 27/27] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
@ 2021-02-01 23:54   ` Elijah Newren
  2021-02-02  2:41     ` Derrick Stolee
  0 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren @ 2021-02-01 23:54 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The cache-tree extension was previously disabled with sparse indexes.
> However, the cache-tree is an important performance feature for commands
> like 'git status' and 'git add'. Integrate it with sparse directory
> entries.
>
> When writing a sparse index, completely clear and recalculate the cache
> tree. By starting from scratch, the only integration necessary is to
> check if we hit a sparse directory entry and create a leaf of the
> cache-tree that has an entry_count of one and no subtrees.
>
> Once the cache-tree exists within a sparse index, we finally get
> improved performance. I test the sparse index performance using a
> private monorepo with over 2.1 million files at HEAD, but with a
> sparse-checkout definition that has only 68,000 paths in the populated
> cone. The sparse index has about 2,000 sparse directory entries. I
> compare three scenarios:

How many .gitignore entries does this monorepo have?  What percentage
of those are populated for #2 and #3?

Can a testcase be devised that others can repeat?  For example, I
created https://github.com/newren/gvfs-like-git-bomb once upon a time
to create a very small repository with a very large index and a lot of
skip-worktree entries, mostly to test some stuff that someone (Ben
Peart?) mentioned as being slow and verify for myself that it wasn't
Windows specific.

>  1. Use the full index. The index size is ~186 MB.
>  2. Use the sparse index. The index size is ~5.5 MB.
>  3. Use a commit where HEAD matches the populated set. The full index
>     size is ~5.3MB.

I'm not sure I'm understanding the difference between #2 and #3, other
than #3 is smaller.  How did you form #2?  Also, what do you mean by
"full index size" for #3, when it's the smallest?  Isn't that index
the most sparse (or least full)?  Or is it an index for a different
commit entirely that has far fewer files in it?

> The third benchmark is included as a theoretical optimium for a

s/optimium/optimum/

> repository of the same object database.

This I'm also not understanding, but maybe this goes back to not
understanding the difference in how #2 and #3 are constructed.

> First, a clean 'git status' improves from 3.1s to 240ms.
>
> Benchmark #1: full index (git status)
>   Time (mean ± σ):      3.167 s ±  0.036 s    [User: 2.006 s, System: 1.078 s]
>   Range (min … max):    3.100 s …  3.208 s    10 runs
>
> Benchmark #2: sparse index (git status)
>   Time (mean ± σ):     239.5 ms ±   8.1 ms    [User: 189.4 ms, System: 226.8 ms]
>   Range (min … max):   226.0 ms … 251.9 ms    13 runs
>
> Benchmark #3: small tree (git status)
>   Time (mean ± σ):     195.3 ms ±   4.5 ms    [User: 116.5 ms, System: 84.4 ms]
>   Range (min … max):   188.8 ms … 202.8 ms    15 runs

Always nice to see a speedup factor greater than 10.  :-)

>
> The optimimum is still 45ms faster. This is due in part to the 2,000+

s/optimimum/optimum/

> sparse directory entries, but there might be other optimizations to make
> in the sparse-index case. In particular, I find that this performance
> difference disappears when I disable FS Monitor, which is somewhat
> disabled in the sparse-index case, but might still be adding overhead.

The FS monitor wording is unclear to me; it feels like multiple negatives.

> The performance numbers for 'git add .' are much closer to optimal:
>
> Benchmark #1: full index (git add .)
>   Time (mean ± σ):      3.076 s ±  0.022 s    [User: 2.065 s, System: 0.943 s]
>   Range (min … max):    3.044 s …  3.116 s    10 runs
>
> Benchmark #2: sparse index (git add .)
>   Time (mean ± σ):     218.0 ms ±   6.6 ms    [User: 195.7 ms, System: 206.6 ms]
>   Range (min … max):   209.8 ms … 228.2 ms    13 runs
>
> Benchmark #3: small tree (git add .)
>   Time (mean ± σ):     217.6 ms ±   5.4 ms    [User: 131.9 ms, System: 86.7 ms]
>   Range (min … max):   212.1 ms … 228.4 ms    14 runs
>
> In this test, I also used "echo >>README.md" to append a line to the
> README.md file, so the 'git add .' command is doing _something_ other
> than a no-op. Without this edit (and FS Monitor enabled) the small
> tree case again gains about 30ms on the sparse index case.

Meaning the small tree is 30 ms faster than reported here, or 30 ms
slower, or that both sparse index and small tree are faster but the
small tree decreases its time more than the sparse index one does?

Sorry, I don't mean to be dense, I'm just struggling with
understanding words today it seems.  (Also, it seems like there's a
joke in there about me being "dense" in a review of a "sparse"
feature...but I'm not quite coming up with it.)

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  cache-tree.c   | 18 ++++++++++++++++++
>  sparse-index.c | 10 +++++++++-
>  2 files changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/cache-tree.c b/cache-tree.c
> index 5f07a39e501..9da6a4394e0 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
>
>         *skip_count = 0;
>
> +       /*
> +        * If the first entry of this region is a sparse directory
> +        * entry corresponding exactly to 'base', then this cache_tree
> +        * struct is a "leaf" in the data structure, pointing to the
> +        * tree OID specified in the entry.
> +        */
> +       if (entries > 0) {
> +               const struct cache_entry *ce = cache[0];
> +
> +               if (S_ISSPARSEDIR(ce) &&
> +                   ce->ce_namelen == baselen &&
> +                   !strncmp(ce->name, base, baselen)) {
> +                       it->entry_count = 1;
> +                       oidcpy(&it->oid, &ce->oid);
> +                       return 1;
> +               }
> +       }
> +
>         if (0 <= it->entry_count && has_object_file(&it->oid))
>                 return it->entry_count;
>
> diff --git a/sparse-index.c b/sparse-index.c
> index a201f3b905c..9ea3b321400 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -181,7 +181,11 @@ int convert_to_sparse(struct index_state *istate)
>         istate->cache_nr = convert_to_sparse_rec(istate,
>                                                  0, 0, istate->cache_nr,
>                                                  "", 0, istate->cache_tree);
> -       istate->drop_cache_tree = 1;
> +
> +       /* Clear and recompute the cache-tree */
> +       cache_tree_free(&istate->cache_tree);
> +       cache_tree_update(istate, 0);
> +
>         istate->sparse_index = 1;
>         trace2_region_leave("index", "convert_to_sparse", istate->repo);
>         return 0;
> @@ -278,6 +282,10 @@ void ensure_full_index(struct index_state *istate)
>
>         free(full);
>
> +       /* Clear and recompute the cache-tree */
> +       cache_tree_free(&istate->cache_tree);
> +       cache_tree_update(istate, 0);
> +
>         trace2_region_leave("index", "ensure_full_index", istate->repo);
>  }
>
> --
> gitgitgadget

This is very exciting work!!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 26/27] pathspec: stop calling ensure_full_index
  2021-02-01 23:24   ` Elijah Newren
@ 2021-02-02  2:39     ` Derrick Stolee
  0 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee @ 2021-02-02  2:39 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On 2/1/2021 6:24 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The add_pathspec_matches_against_index() focuses on matching a pathspec
>> to file entries in the index. It is possible that this already works
>> correctly for its only use: checking if untracked files exist in the
>> index.
>>
>> It is likely that this causes a behavior issue when adding a directory
>> that exists at HEAD but is outside the sparse cone. I'm marking this as
>> a place to pursue with future tests.
> 
> Sounds like you're unsure if this patch is good.  Should it be marked
> RFC or something?

...isn't the whole series marked as RFC? I only specifically marked the
ensure_full_index() one because I purposefully squashed it.

But in general, everything I'm touching in these areas seems like a
potentially problematic change. So many things are used and re-used
that I'm not sure what is safe or not. More testing is required for
commands to ensure their behavior.

I can enable things like GIT_TEST_SPARSE_INDEX to get other tests
using sparse-checkout to work (but only if they use cone mode). Hence,
I'm relying on what tests I can write to cover the behavior instead of
a robust history of valuable tests.

I hope to gather more confidence as this goes forward. I definitely
work to be confident that I am not making any errors that cause
problems for users who do not enable the sparse-index, but I expect
there to be a long tail of adjustments required as more corner
cases are discovered during the development and testing of the
feature.

The good news is that I have some engaged users who are willing to
test the feature and provide feedback if they hit any snags once
there is a minimal set of functionality.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 27/27] cache-tree: integrate with sparse directory entries
  2021-02-01 23:54   ` Elijah Newren
@ 2021-02-02  2:41     ` Derrick Stolee
  2021-02-02  3:05       ` Elijah Newren
  0 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee @ 2021-02-02  2:41 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On 2/1/2021 6:54 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> In this test, I also used "echo >>README.md" to append a line to the
>> README.md file, so the 'git add .' command is doing _something_ other
>> than a no-op. Without this edit (and FS Monitor enabled) the small
>> tree case again gains about 30ms on the sparse index case.
> 
> Meaning the small tree is 30 ms faster than reported here, or 30 ms
> slower, or that both sparse index and small tree are faster but the
> small tree decreases its time more than the sparse index one does?
> 
> Sorry, I don't mean to be dense, I'm just struggling with
> understanding words today it seems.  (Also, it seems like there's a
> joke in there about me being "dense" in a review of a "sparse"
> feature...but I'm not quite coming up with it.)

I don't blame you! This is a lot to digest, and I appreciate you
pushing through to the end of it.

Clearly, I was getting a bit inexact near the end. My excitement
to share this RFC clearly overshadowed my attention to grammatical
detail. I'll go through your feedback more carefully soon and
hopefully clarify these and many other questions.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 27/27] cache-tree: integrate with sparse directory entries
  2021-02-02  2:41     ` Derrick Stolee
@ 2021-02-02  3:05       ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-02-02  3:05 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List, Junio C Hamano,
	Jeff King, Jonathan Nieder, Eric Sunshine,
	Nguyễn Thái Ngọc, Derrick Stolee, Derrick Stolee

On Mon, Feb 1, 2021 at 6:41 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/1/2021 6:54 PM, Elijah Newren wrote:
> > On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >> In this test, I also used "echo >>README.md" to append a line to the
> >> README.md file, so the 'git add .' command is doing _something_ other
> >> than a no-op. Without this edit (and FS Monitor enabled) the small
> >> tree case again gains about 30ms on the sparse index case.
> >
> > Meaning the small tree is 30 ms faster than reported here, or 30 ms
> > slower, or that both sparse index and small tree are faster but the
> > small tree decreases its time more than the sparse index one does?
> >
> > Sorry, I don't mean to be dense, I'm just struggling with
> > understanding words today it seems.  (Also, it seems like there's a
> > joke in there about me being "dense" in a review of a "sparse"
> > feature...but I'm not quite coming up with it.)
>
> I don't blame you! This is a lot to digest, and I appreciate you
> pushing through to the end of it.
>
> Clearly, I was getting a bit inexact near the end. My excitement
> to share this RFC clearly overshadowed my attention to grammatical
> detail. I'll go through your feedback more carefully soon and
> hopefully clarify these and many other questions.

I can't blame you for being excited; this series is awesome.  I've
thought we should do something along this direction for years[1].
Sure I found lots of little nitpicks here and there in your series,
but that's just attempting to help find any issues so it can be made
even better; overall I'm super excited about it.

[1] See "crazy idea" paragraph at
https://lore.kernel.org/git/CABPp-BGir_5xyqEfwytDog0rZDydPHXjuqXCpNKk67dVPXjUjA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/27] [RFC] Sparse Index
  2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
                   ` (27 preceding siblings ...)
  2021-01-25 20:10 ` [PATCH 00/27] [RFC] Sparse Index Junio C Hamano
@ 2021-02-02  3:11 ` Elijah Newren
  28 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-02-02  3:11 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee

On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This is based on ds/more-index-cleanups also available as GitGitGadget PR
> #839.
>
> The sparse checkout feature allows users to specify a "populated set" that
> is smaller than the full list of files at HEAD. Files outside the sparse
> checkout definition are not present in the working directory, but are still
> present in the index (and marked with the CE_SKIP_WORKTREE bit).
>
> This means that the working directory has size O(populated), and commands
> like "git status" or "git checkout" operate using an O(populated) number of
> filesystem operations. However, parsing the index still operates on the
> scale of O(HEAD).
>
> This can be particularly jarring if you are merging a small repository with
> a large monorepo for the purpose of simplifying dependency management. Even
> if users have nothing more in their working directory than they had before,
> they suddenly see a significant increase in their "git status" or "git add"
> times. In these cases, simply parsing the index can be a huge portion of the
> command time.
>
> This RFC proposes an update to the index formats to allow "sparse directory
> entries". These entries correspond to directories that are completely
> excluded from the sparse checkout definition. We can detect that a directory
> is excluded when using "cone mode" patterns.
>
> Since having directory entries is a radical departure from the existing
> index format, a new extension "extensions.sparseIndex" is added. Using a
> sparse index should cause incompatible tools to fail because they do not
> understand this extension.
>
> The index is a critical data structure, so making such a drastic change must
> be handled carefully. This RFC does only enough adjustments to demonstrate
> performance improvements for "git status" and "git add." Other commands
> should operate identically to before, since the other commands will expand a
> sparse index into a full index by parsing trees.
>
> WARNING: I'm getting a failure on the FreeBSD build with my sparse-checkout
> tests. I'm not sure what is causing these failures, but I will explore while
> we discuss the possibility of the feature as a whole.
>
> Here is an outline for this RFC:
>
>  * Patches 1-14: create and test the sparse index format. This is just
>    enough to start writing the format, but all Git commands become slower
>    for using it. This is because everything is guarded to expand to a full
>    index before actually operating on the cache entries.
>
>  * Patch 15: This massive patch is actually a bunch of patches squashed
>    together. I have a branch that adds "ensure_full_index()" guards in each
>    necessary file along with some commentary about how the index is being
>    used. This patch is presented here as one big dump because that
>    commentary isn't particularly interesting if the RFC leads to a very
>    different approach.
>
>  * Patches 16-27: These changes make enough code "sparse aware" such that
>    "git status" and "git add" start operating in time O(populated) instead
>    of O(HEAD).
>
> Performance numbers are given in patch 27, but repeated somewhat here. The
> test environment I use has ~2.1 million paths at HEAD, but only 68,000
> populated paths given the sparse-checkout I'm using. The sparse index has
> about 2,000 sparse directory entries.
>
>  1. Use the full index. The index size is ~186 MB.
>  2. Use the sparse index. The index size is ~5.5 MB.
>  3. Use a commit where HEAD matches the populated set. The full index size
>     is ~5.3MB.
>
> The third benchmark is included as a theoretical optimum for a repository of
> the same object database.
>
> First, a clean 'git status' improves from 3.1s to 240ms.
>
> Benchmark #1: full index (git status) Time (mean ± σ): 3.167 s ± 0.036 s
> [User: 2.006 s, System: 1.078 s] Range (min … max): 3.100 s … 3.208 s 10
> runs
>
> Benchmark #2: sparse index (git status) Time (mean ± σ): 239.5 ms ± 8.1 ms
> [User: 189.4 ms, System: 226.8 ms] Range (min … max): 226.0 ms … 251.9 ms 13
> runs
>
> Benchmark #3: small tree (git status) Time (mean ± σ): 195.3 ms ± 4.5 ms
> [User: 116.5 ms, System: 84.4 ms] Range (min … max): 188.8 ms … 202.8 ms 15
> runs
>
> The performance numbers for 'git add .' are much closer to optimal:
>
> Benchmark #1: full index (git add .) Time (mean ± σ): 3.076 s ± 0.022 s
> [User: 2.065 s, System: 0.943 s] Range (min … max): 3.044 s … 3.116 s 10
> runs
>
> Benchmark #2: sparse index (git add .) Time (mean ± σ): 218.0 ms ± 6.6 ms
> [User: 195.7 ms, System: 206.6 ms] Range (min … max): 209.8 ms … 228.2 ms 13
> runs
>
> Benchmark #3: small tree (git add .) Time (mean ± σ): 217.6 ms ± 5.4 ms
> [User: 131.9 ms, System: 86.7 ms] Range (min … max): 212.1 ms … 228.4 ms 14
> runs
>
> I expect that making a sparse index work optimally through the most common
> Git commands will take a year of effort. During this process, I expect to
> add a lot of testing infrastructure around the sparse-checkout feature,
> especially in corner cases. (This RFC focuses on the happy paths of
> operating only within the sparse cone, but that will change in the future.)
>
> If this general approach is acceptable, then I would follow it with a
> sequence of patch submissions that follow this approach:
>
>  1. Basics of the format. (Patches 1-14)
>  2. Add additional guards around index interactions (Patch 15, but split
>     appropriately.)
>  3. Speed up "git status" and "git add" (Patches 16-27)
>
> After those three items that are represented in this RFC, the work starts to
> parallelize a bit. My basic ideas for moving forward from this point are to
> do these basic steps:
>
>  * Add new index API abstractions where appropriate, make them sparse-aware.
>  * Add new tests around sparse-checkout corner cases. Ensure the sparse
>    index works correctly.
>  * For a given builtin, add extra testing for sparse-checkouts then it them
>    sparse-aware.
>
> Here are some specific questions I'm hoping to answer in this RFC period:
>
>  1. Are these sparse directory entries an appropriate way to extend the
>     index format?
>  2. Is extensions.sparseIndex a good way to signal that these entries can
>     exist?
>  3. Is git sparse-checkout init --cone --sparse-index an appropriate way to
>     toggle the format?
>  4. Are there specific areas that I should target to harden the index API
>     before I submit this work?
>  5. Does anyone have a good idea how to test a large portion of the test
>     suite with sparse-index enabled? The problem I see is that most tests
>     don't use sparse-checkout, so the sparse index is identical to the full
>     index. Would it be interesting to enable the test setup to add such
>     "extra" directories during the test setup?
>
> Thanks, -Stolee

Thanks for working on this.  It's very exciting seeing this idea come
alive.  I had lots of little nitpicks and questions and whatnot here
and there, but that almost feels like a diversion.  Overall, I think
you divided up the series in a very logical and easy to follow
fashion, and actually achieved quite a bit already.

I suspect I have partially answered some of your questions above among
all my comments, and left others unanswered or worse, just re-asked
the same question(s) myself.  Feel free to ping again with the next
round and I'll see if I can dodge your questions again...er, I mean,
try to think of something helpful to say.  :-)

>
> Derrick Stolee (27):
>   sparse-index: add guard to ensure full index
>   sparse-index: implement ensure_full_index()
>   t1092: compare sparse-checkout to sparse-index
>   test-read-cache: print cache entries with --table
>   test-tool: read-cache --table --no-stat
>   test-tool: don't force full index
>   unpack-trees: ensure full index
>   sparse-checkout: hold pattern list in index
>   sparse-index: convert from full to sparse
>   submodule: sparse-index should not collapse links
>   unpack-trees: allow sparse directories
>   sparse-index: check index conversion happens
>   sparse-index: create extension for compatibility
>   sparse-checkout: toggle sparse index from builtin
>   [RFC-VERSION] *: ensure full index
>   unpack-trees: make sparse aware
>   dir.c: accept a directory as part of cone-mode patterns
>   status: use sparse-index throughout
>   status: skip sparse-checkout percentage with sparse-index
>   sparse-index: expand_to_path() trivial implementation
>   sparse-index: expand_to_path no-op if path exists
>   add: allow operating on a sparse-only index
>   submodule: die_path_inside_submodule is sparse aware
>   dir: use expand_to_path in add_patterns()
>   fsmonitor: disable if index is sparse
>   pathspec: stop calling ensure_full_index
>   cache-tree: integrate with sparse directory entries
>
>  Documentation/config/extensions.txt      |   7 +
>  Documentation/git-sparse-checkout.txt    |  14 +
>  Makefile                                 |   1 +
>  apply.c                                  |  10 +-
>  blame.c                                  |   7 +-
>  builtin/add.c                            |   3 +
>  builtin/checkout-index.c                 |   5 +-
>  builtin/commit.c                         |   3 +
>  builtin/grep.c                           |   2 +
>  builtin/ls-files.c                       |   9 +-
>  builtin/merge-index.c                    |   2 +
>  builtin/mv.c                             |   2 +
>  builtin/rm.c                             |   2 +
>  builtin/sparse-checkout.c                |  35 ++-
>  builtin/update-index.c                   |   2 +
>  cache-tree.c                             |  21 ++
>  cache.h                                  |  15 +-
>  diff.c                                   |   2 +
>  dir.c                                    |  19 +-
>  dir.h                                    |   2 +-
>  entry.c                                  |   2 +
>  fsmonitor.c                              |  18 +-
>  merge-recursive.c                        |  22 +-
>  name-hash.c                              |  10 +
>  pathspec.c                               |   4 +-
>  pathspec.h                               |   4 +-
>  preload-index.c                          |   2 +
>  read-cache.c                             |  51 +++-
>  repo-settings.c                          |  15 +
>  repository.c                             |  11 +-
>  repository.h                             |   3 +
>  rerere.c                                 |   2 +
>  resolve-undo.c                           |   6 +
>  setup.c                                  |   3 +
>  sha1-name.c                              |   3 +
>  sparse-index.c                           | 360 +++++++++++++++++++++++
>  sparse-index.h                           |  23 ++
>  split-index.c                            |   2 +
>  submodule.c                              |  22 +-
>  submodule.h                              |   6 +-
>  t/helper/test-read-cache.c               |  77 ++++-
>  t/t1092-sparse-checkout-compatibility.sh | 130 +++++++-
>  tree.c                                   |   2 +
>  unpack-trees.c                           |  40 ++-
>  wt-status.c                              |  21 +-
>  wt-status.h                              |   1 +
>  46 files changed, 942 insertions(+), 61 deletions(-)
>  create mode 100644 sparse-index.c
>  create mode 100644 sparse-index.h
>
>
> base-commit: 2271fe7848aa11b30e5313d95d9caebc2937fce5
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-847%2Fderrickstolee%2Fsparse-index%2Frfc-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-847/derrickstolee/sparse-index/rfc-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/847
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 16/27] unpack-trees: make sparse aware
  2021-02-01 20:50   ` Elijah Newren
@ 2021-02-09 17:23     ` Derrick Stolee
  0 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee @ 2021-02-09 17:23 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Jonathan Nieder,
	Eric Sunshine, Nguyễn Thái Ngọc, Derrick Stolee,
	Derrick Stolee

On 2/1/2021 3:50 PM, Elijah Newren wrote:
> On Mon, Jan 25, 2021 at 9:42 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> It is important to be careful about the trailing directory separator
>> that exists in the sparse directory entries but not in the subtree
>> paths.
> 
> The comment makes me wonder if leaving the trailing directory
> separator out would be better, as it'd allow direct comparisons.  Of
> course, you have a better idea of what is easier or harder based on
> this decision.  Is there any chance you have a quick list of the
> places that the code was simplified by this decision and a list of
> places like this one that were made slightly harder?

I'm going through all of your comments and making notes about areas
to fix and clean up before starting a new series for full review.

This question of the trailing slash is important, and I will take
particular care about answering it as I rework the series. However,
the questions in this patch poke at the right places...

>> +       /* remove directory separator if a sparse directory entry */
>> +       if (S_ISSPARSEDIR(ce))
>> +               ce_len--;
> 
> Here's where your comment about trailing separator comes in; makes sense.
> 
>>         return df_name_compare(ce_name, ce_len, S_IFREG, name, namelen, mode);
>>  }
>>
>> @@ -989,6 +999,10 @@ static int compare_entry(const struct cache_entry *ce, const struct traverse_inf
>>         if (cmp)
>>                 return cmp;
>>
>> +       /* If ce is a sparse directory, then allow equality here. */
>> +       if (S_ISSPARSEDIR(ce))
>> +               return 0;
>> +
> 
> This seems surprising to me.  Is there a chance you are comparing
> sparse directory A with sparse directory B and you return with
> equality?  Or sparse_directory A with regular file B?  Do the callers
> still do the right thing?  If your code change here is right, it seems
> like it deserves an extra comment either in the code or the commit
> message.

Sometimes a caller is asking for the first index entry corresponding
to a directory. In these cases, the input could be "A/B/C/". We want
to ensure that a sparse directory entry corresponding exactly to that
directory is correctly matched. If we place "A/B/C" in the index instead,
this search becomes more complicated (I think; I will justify this more
after thinking about it). 

At this point in time, we are just saying "We found the entry with
equal path value!" and not failing with the check in the rest of the
method:

	/*
	 * Even if the beginning compared identically, the ce should
	 * compare as bigger than a directory leading up to it!
	 */
	return ce_namelen(ce) > traverse_path_len(info, tree_entry_len(n));

>> -               if (traverse_trees_recursive(n, dirmask, mask & ~dirmask,
>> +               if (recurse &&
>> +                   traverse_trees_recursive(n, dirmask, mask & ~dirmask,
>>                                              names, info) < 0)
>>                         return -1;
>>                 return mask;
> 
> The unpack_callback() code has some comparison to a cache-tree, but
> I'd assume that you'd need to update cache-tree.c somewhat to take
> advantage of these sparse directory entries.  Am I wrong, and you just
> get cache-tree.c working with sparse directory entries for free?  Or
> is this something coming in a later patch?

In the RFC, I integrate the cache-tree with the sparse-index at the
very end. I will move that integration to be much earlier in the next
submission, so it becomes part of the format discussion.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2021-02-09 17:27 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-25 17:41 [PATCH 00/27] [RFC] Sparse Index Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 01/27] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 02/27] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
2021-01-27  3:05   ` Elijah Newren
2021-01-27 13:43     ` Derrick Stolee
2021-01-27 16:38       ` Elijah Newren
2021-01-28  5:25     ` Junio C Hamano
2021-01-25 17:41 ` [PATCH 03/27] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
2021-01-27  3:08   ` Elijah Newren
2021-01-27 13:30     ` Derrick Stolee
2021-01-27 16:54       ` Elijah Newren
2021-01-25 17:41 ` [PATCH 04/27] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
2021-01-27  3:25   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 05/27] test-tool: read-cache --table --no-stat Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 06/27] test-tool: don't force full index Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 07/27] unpack-trees: ensure " Derrick Stolee via GitGitGadget
2021-01-27  4:43   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 08/27] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
2021-01-27 17:00   ` Elijah Newren
2021-01-28 13:12     ` Derrick Stolee
2021-01-25 17:41 ` [PATCH 09/27] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
2021-01-27 17:30   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 10/27] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
2021-01-25 17:41 ` [PATCH 11/27] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
2021-01-27 17:36   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 12/27] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
2021-01-27 17:46   ` Elijah Newren
2021-01-25 17:41 ` [PATCH 13/27] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
2021-01-27 18:03   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 14/27] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
2021-01-27 18:18   ` Elijah Newren
2021-01-28 15:26     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 15/27] [RFC-VERSION] *: ensure full index Derrick Stolee via GitGitGadget
2021-02-01 20:22   ` Elijah Newren
2021-02-01 21:10     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 16/27] unpack-trees: make sparse aware Derrick Stolee via GitGitGadget
2021-02-01 20:50   ` Elijah Newren
2021-02-09 17:23     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 17/27] dir.c: accept a directory as part of cone-mode patterns Derrick Stolee via GitGitGadget
2021-02-01 22:12   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 18/27] status: use sparse-index throughout Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 19/27] status: skip sparse-checkout percentage with sparse-index Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 20/27] sparse-index: expand_to_path() trivial implementation Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 21/27] sparse-index: expand_to_path no-op if path exists Derrick Stolee via GitGitGadget
2021-02-01 22:34   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 22/27] add: allow operating on a sparse-only index Derrick Stolee via GitGitGadget
2021-02-01 23:08   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 23/27] submodule: die_path_inside_submodule is sparse aware Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 24/27] dir: use expand_to_path in add_patterns() Derrick Stolee via GitGitGadget
2021-02-01 23:21   ` Elijah Newren
2021-01-25 17:42 ` [PATCH 25/27] fsmonitor: disable if index is sparse Derrick Stolee via GitGitGadget
2021-01-25 17:42 ` [PATCH 26/27] pathspec: stop calling ensure_full_index Derrick Stolee via GitGitGadget
2021-02-01 23:24   ` Elijah Newren
2021-02-02  2:39     ` Derrick Stolee
2021-01-25 17:42 ` [PATCH 27/27] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
2021-02-01 23:54   ` Elijah Newren
2021-02-02  2:41     ` Derrick Stolee
2021-02-02  3:05       ` Elijah Newren
2021-01-25 20:10 ` [PATCH 00/27] [RFC] Sparse Index Junio C Hamano
2021-01-25 21:18   ` Derrick Stolee
2021-02-02  3:11 ` Elijah Newren

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).