git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/20] Sparse Index: Design, Format, Tests
@ 2021-02-23 20:14 Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
                   ` (21 more replies)
  0 siblings, 22 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee

Here is the first full patch series submission coming out of the
sparse-index RFC [1].

[1]
https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/

I won't waste too much space here, because PATCH 1 includes a sizeable
design document that describes the feature, the reasoning behind it, and my
plan for getting this implemented widely throughout the codebase.

There are some new things here that were not in the RFC:

 * Design doc and format updates. (Patch 1)
 * Performance test script. (Patches 2 and 20)

Notably missing in this series from the RFC:

 * The mega-patch inserting ensure_full_index() throughout the codebase.
   That will be a follow-up series to this one.
 * The integrations with git status and git add to demonstrate the improved
   performance. Those will also appear in their own series later.

I plan to keep my latest work in this area in my 'sparse-index/wip' branch
[2]. It includes all of the work from the RFC right now, updated with the
work from this series.

[2] https://github.com/derrickstolee/git/tree/sparse-index/wip

Thanks, -Stolee

Derrick Stolee (20):
  sparse-index: design doc and format update
  t/perf: add performance test for sparse operations
  t1092: clean up script quoting
  sparse-index: add guard to ensure full index
  sparse-index: implement ensure_full_index()
  t1092: compare sparse-checkout to sparse-index
  test-read-cache: print cache entries with --table
  test-tool: don't force full index
  unpack-trees: ensure full index
  sparse-checkout: hold pattern list in index
  sparse-index: convert from full to sparse
  submodule: sparse-index should not collapse links
  unpack-trees: allow sparse directories
  sparse-index: check index conversion happens
  sparse-index: create extension for compatibility
  sparse-checkout: toggle sparse index from builtin
  sparse-checkout: disable sparse-index
  cache-tree: integrate with sparse directory entries
  sparse-index: loose integration with cache_tree_verify()
  p2000: add sparse-index repos

 Documentation/config/extensions.txt      |   7 +
 Documentation/git-sparse-checkout.txt    |  14 ++
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 167 +++++++++++++
 Makefile                                 |   1 +
 builtin/sparse-checkout.c                |  44 +++-
 cache-tree.c                             |  40 ++++
 cache.h                                  |  12 +-
 read-cache.c                             |  35 ++-
 repo-settings.c                          |  15 ++
 repository.c                             |  11 +-
 repository.h                             |   3 +
 setup.c                                  |   3 +
 sparse-index.c                           | 290 +++++++++++++++++++++++
 sparse-index.h                           |  11 +
 t/README                                 |   3 +
 t/helper/test-read-cache.c               |  61 ++++-
 t/perf/p2000-sparse-operations.sh        | 104 ++++++++
 t/t1091-sparse-checkout-builtin.sh       |  13 +
 t/t1092-sparse-checkout-compatibility.sh | 136 +++++++++--
 unpack-trees.c                           |  16 +-
 21 files changed, 953 insertions(+), 40 deletions(-)
 create mode 100644 Documentation/technical/sparse-index.txt
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h
 create mode 100755 t/perf/p2000-sparse-operations.sh


base-commit: 966e671106b2fd38301e7c344c754fd118d0bb07
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/883
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 01/20] sparse-index: design doc and format update
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-24  1:13   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
                   ` (20 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This begins a long effort to update the index format to allow sparse
directory entries. This should result in a significant improvement to
Git commands when HEAD contains millions of files, but the user has
selected many fewer files to keep in their sparse-checkout definition.

Currently, the index format is only updated in the presence of
extensions.sparseIndex instead of increasing a file format version
number. This is temporary, and index v5 is part of the plan for future
work in this area.

The design document details many of the reasons for embarking on this
work, and also the plan for completing it safely.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 167 +++++++++++++++++++++++
 2 files changed, 174 insertions(+)
 create mode 100644 Documentation/technical/sparse-index.txt

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index b633482b1bdf..387126582556 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -44,6 +44,13 @@ Git index format
   localization, no special casing of directory separator '/'). Entries
   with the same name are sorted by their stage field.
 
+  An index entry typically represents a file. However, if sparse-checkout
+  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
+  `extensions.sparseIndex` extension is enabled, then the index may
+  contain entries for directories outside of the sparse-checkout definition.
+  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
+  the path ends in a directory separator.
+
   32-bit ctime seconds, the last time a file's metadata changed
     this is stat(2) data
 
diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
new file mode 100644
index 000000000000..9070836f0655
--- /dev/null
+++ b/Documentation/technical/sparse-index.txt
@@ -0,0 +1,167 @@
+Git Sparse-Index Design Document
+================================
+
+The sparse-checkout feature allows users to focus a working directory on
+a subset of the files at HEAD. The cone mode patterns, enabled by
+`core.sparseCheckoutCone`, allow for very fast pattern matching to
+discover which files at HEAD belong in the sparse-checkout cone.
+
+Three important scale dimensions for a Git worktree are:
+
+* `HEAD`: How many files are present at `HEAD`?
+
+* Populated: How many files are within the sparse-checkout cone.
+
+* Modified: How many files has the user modified in the working directory?
+
+We will use big-O notation -- O(X) -- to denote how expensive certain
+operations are in terms of these dimensions.
+
+These dimensions are ordered by their magnitude: users (typically) modify
+fewer files than are populated, and we can only populate files at `HEAD`.
+These dimensions are also ordered by how expensive they are per item: it
+is expensive to detect a modified file than it is to write one that we
+know must be populated; changing `HEAD` only really requires updating the
+index.
+
+Problems occur if there is an extreme imbalance in these dimensions. For
+example, if `HEAD` contains millions of paths but the populated set has
+only tens of thousands, then commands like `git status` and `git add` can
+be dominated by operations that require O(`HEAD`) operations instead of
+O(Populated). Primarily, the cost is in parsing and rewriting the index,
+which is filled primarily with files at `HEAD` that are marked with the
+`SKIP_WORKTREE` bit.
+
+The sparse-index intends to take these commands that read and modify the
+index from O(`HEAD`) to O(Populated). To do this, we need to modify the
+index format in a significant way: add "sparse directory" entries.
+
+With cone mode patterns, it is possible to detect when an entire
+directory will have its contents outside of the sparse-checkout definition.
+Instead of listing all of the files it contains as individual entries, a
+sparse-index contains an entry with the directory name, referencing the
+object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
+If we need to discover the details for paths within that directory, we
+can parse trees to find that list.
+
+This addition of sparse-directory entries violates expectations about the
+index format and its in-memory data structure. There are many consumers in
+the codebase that expect to iterate through all of the index entries and
+see only files. In addition, they expect to see all files at `HEAD`. One
+way to handle this is to parse trees to replace a sparse-directory entry
+with all of the files within that tree as the index is loaded. However,
+parsing trees is slower than parsing the index format, so that is a slower
+operation than if we left the index alone.
+
+The implementation plan below follows four phases to slowly integrate with
+the sparse-index. The intention is to incrementally update Git commands to
+interact safely with the sparse-index without significant slowdowns. This
+may not always be possible, but the hope is that the primary commands that
+users need in their daily work are dramatically improved.
+
+Phase I: Format and initial speedups
+------------------------------------
+
+During this phase, Git learns to enable the sparse-index and safely parse
+one. Protections are put in place so that every consumer of the in-memory
+data structure can operate with its current assumption of every file at
+`HEAD`.
+
+At first, every index parse will expand the sparse-directory entries into
+the full list of paths at `HEAD`. This will be slower in all cases. The
+only noticable change in behavior will be that the serialized index file
+contains sparse-directory entries.
+
+To start, we use a new repository extension, `extensions.sparseIndex`, to
+allow inserting sparse-directory entries into indexes with file format
+versions 2, 3, and 4. This prevents Git versions that do not understand
+the sparse-index from operating on one, but it also prevents other
+operations that do not use the index at all. A new format, index v5, will
+be introduced that includes sparse-directory entries by default. It might
+also introduce other features that have been considered for improving the
+index, as well.
+
+Next, consumers of the index will be guarded against operating on a
+sparse-index by inserting calls to `ensure_full_index()` or
+`expand_index_to_path()`. After these guards are in place, we can begin
+leaving sparse-directory entries in the in-memory index structure.
+
+Even after inserting these guards, we will keep expanding sparse-indexes
+for most Git commands using the `command_requires_full_index` repository
+setting. This setting will be on by default and disabled one builtin at a
+time until we have sufficient confidence that all of the index operations
+are properly guarded.
+
+To complete this phase, the commands `git status` and `git add` will be
+integrated with the sparse-index so that they operate with O(Populated)
+performance. They will be carefully tested for operations within and
+outside the sparse-checkout definition.
+
+Phase II: Careful integrations
+------------------------------
+
+This phase focuses on ensuring that all index extensions and APIs work
+well with a sparse-index. This requires significant increases to our test
+coverage, especially for operations that interact with the working
+directory outside of the sparse-checkout definition. Some of these
+behaviors may not be the desirable ones, such as some tests already
+marked for failure in `t1092-sparse-checkout-compatibility.sh`.
+
+The index extensions that may require special integrations are:
+
+* FS Monitor
+* Untracked cache
+
+While integrating with these features, we should look for patterns that
+might lead to better APIs for interacting with the index. Coalescing
+common usage patterns into an API call can reduce the number of places
+where sparse-directories need to be handled carefully.
+
+Phase III: Important command speedups
+-------------------------------------
+
+At this point, the patterns for testing and implementing sparse-directory
+logic should be relatively stable. This phase focuses on updating some of
+the most common builtins that use the index to operate as O(Populated).
+Here is a potential list of commands that could be valuable to integrate
+at this point:
+
+* `git commit`
+* `git checkout`
+* `git merge`
+* `git rebase`
+
+Along with `git status` and `git add`, these commands cover the majority
+of users' interactions with the working directory. In addition, we can
+integrate with these commands:
+
+* `git grep`
+* `git rm`
+
+These have been proposed as some whose behavior could change when in a
+repo with a sparse-checkout definition. It would be good to include this
+behavior automatically when using a sparse-index. Some clarity is needed
+to make the behavior switch clear to the user.
+
+This phase is the first where parallel work might be possible without too
+much conflicts between topics.
+
+Phase IV: The long tail
+-----------------------
+
+This last phase is less a "phase" and more "the new normal" after all of
+the previous work.
+
+To start, the `command_requires_full_index` option could be removed in
+favor of expanding only when hitting an API guard.
+
+There are many Git commands that could use special attention to operate as
+O(Populated), while some might be so rare that it is acceptable to leave
+them with additional overhead when a sparse-index is present.
+
+Here are some commands that might be useful to update:
+
+* `git sparse-checkout set`
+* `git am`
+* `git clean`
+* `git stash`
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 02/20] t/perf: add performance test for sparse operations
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-24  2:30   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Create a test script that takes the default performance test (the Git
codebase) and multiplies it by 256 using four layers of duplicated
trees of width four. This results in nearly one million blob entries in
the index. Then, we can clone this repository with sparse-checkout
patterns that demonstrate four copies of the initial repository. Each
clone will use a different index format or mode so peformance can be
tested across the different options.

Note that the initial repo is stripped of submodules before doing the
copies. This preserves the expected data shape of the sparse index,
because directories containing submodules are not collapsed to a sparse
directory entry.

Run a few Git commands on these clones, especially those that use the
index (status, add, commit).

Here are the results on my Linux machine:

Test
--------------------------------------------------------------
2000.2: git status (full-index-v3)             0.37(0.30+0.09)
2000.3: git status (full-index-v4)             0.39(0.32+0.10)
2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)

It is perhaps noteworthy that there is an improvement when using index
version 4. This is because the v3 index uses 108 MiB while the v4
index uses 80 MiB. Since the repeated portions of the directories are
very short (f3/f1/f2, for example) this ratio is less pronounced than in
similarly-sized real repositories.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 87 +++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)
 create mode 100755 t/perf/p2000-sparse-operations.sh

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
new file mode 100755
index 000000000000..52597683376e
--- /dev/null
+++ b/t/perf/p2000-sparse-operations.sh
@@ -0,0 +1,87 @@
+#!/bin/sh
+
+test_description="test performance of Git operations using the index"
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+SPARSE_CONE=f2/f4/f1
+
+test_expect_success 'setup repo and indexes' '
+	git reset --hard HEAD &&
+	# Remove submodules from the example repo, because our
+	# duplication of the entire repo creates an unlikly data shape.
+	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
+	rm -f .gitmodules &&
+	git add .gitmodules &&
+	for module in $(awk "{print \$2}" modules)
+	do
+		git rm $module || return 1
+	done &&
+	git add . &&
+	git commit -m "remove submodules" &&
+
+	echo bogus >a &&
+	cp a b &&
+	git add a b &&
+	git commit -m "level 0" &&
+	BLOB=$(git rev-parse HEAD:a) &&
+	OLD_COMMIT=$(git rev-parse HEAD) &&
+	OLD_TREE=$(git rev-parse HEAD^{tree}) &&
+
+	for i in $(test_seq 1 4)
+	do
+		cat >in <<-EOF &&
+			100755 blob $BLOB	a
+			040000 tree $OLD_TREE	f1
+			040000 tree $OLD_TREE	f2
+			040000 tree $OLD_TREE	f3
+			040000 tree $OLD_TREE	f4
+		EOF
+		NEW_TREE=$(git mktree <in) &&
+		NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
+		OLD_TREE=$NEW_TREE &&
+		OLD_COMMIT=$NEW_COMMIT || return 1
+	done &&
+
+	git sparse-checkout init --cone &&
+	git branch -f wide $OLD_COMMIT &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
+	(
+		cd full-index-v3 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
+	(
+		cd full-index-v4 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
+	)
+'
+
+test_perf_on_all () {
+	command="$@"
+	for repo in full-index-v3 full-index-v4
+	do
+		test_perf "$command ($repo)" "
+			(
+				cd $repo &&
+				echo >>$SPARSE_CONE/a &&
+				$command
+			)
+		"
+	done
+}
+
+test_perf_on_all git status
+test_perf_on_all git add -A
+test_perf_on_all git add .
+test_perf_on_all git commit -a -m A
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 03/20] t1092: clean up script quoting
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This test was introduced in 19a0acc83e4 (t1092: test interesting
sparse-checkout scenarios, 2021-01-23), but these issues with quoting
were not noticed until starting this follow-up series. The old mechanism
would drop quoting such as in

   test_all_match git commit -m "touch README.md"

The above happened to work because README.md is a file in the
repository, so 'git commit -m touch REAMDE.md' would succeed by
accident.

Other cases included quoting for no good reason, so clean that up now.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 8cd3e5a8d227..3725d3997e70 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -96,20 +96,20 @@ init_repos () {
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		$* >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		$* >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
-	run_on_sparse $*
+	run_on_sparse "$@"
 }
 
 test_all_match () {
-	run_on_all $* &&
+	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
 	test_cmp full-checkout-err sparse-checkout-err
 }
@@ -119,7 +119,7 @@ test_expect_success 'status with options' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
-	run_on_all "touch README.md" &&
+	run_on_all touch README.md &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -135,7 +135,7 @@ test_expect_success 'add, commit, checkout' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>$1
 	EOF
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add README.md &&
 	test_all_match git status --porcelain=v2 &&
@@ -144,7 +144,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add -A &&
 	test_all_match git status --porcelain=v2 &&
@@ -153,7 +153,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents deep/newfile" &&
+	run_on_all ../edit-contents deep/newfile &&
 
 	test_all_match git status --porcelain=v2 -uno &&
 	test_all_match git status --porcelain=v2 &&
@@ -186,7 +186,7 @@ test_expect_success 'diff --staged' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>README.md
 	EOF
-	run_on_all "../edit-contents" &&
+	run_on_all ../edit-contents &&
 
 	test_all_match git diff &&
 	test_all_match git diff --staged &&
@@ -280,7 +280,7 @@ test_expect_success 'clean' '
 	echo bogus >>.gitignore &&
 	run_on_all cp ../.gitignore . &&
 	test_all_match git add .gitignore &&
-	test_all_match git commit -m ignore-bogus-files &&
+	test_all_match git commit -m "ignore bogus files" &&
 
 	run_on_sparse mkdir folder1 &&
 	run_on_all touch folder1/bogus &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 04/20] sparse-index: add guard to ensure full index
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-24  2:44   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Upcoming changes will introduce modifications to the index format that
allow sparse directories. It will be useful to have a mechanism for
converting those sparse index files into full indexes by walking the
tree at those sparse directories. Name this method ensure_full_index()
as it will guarantee that the index is fully expanded.

This method is not implemented yet, and instead we focus on the
scaffolding to declare it and call it at the appropriate time.

Add a 'command_requires_full_index' member to struct repo_settings. This
will be an indicator that we need the index in full mode to do certain
index operations. This starts as being true for every command, then we
will set it to false as some commands integrate with sparse indexes.

If 'command_requires_full_index' is true, then we will immediately
expand a sparse index to a full one upon reading from disk. This
suffices for now, but we will want to add more callers to
ensure_full_index() later.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile        |  1 +
 repo-settings.c |  8 ++++++++
 repository.c    | 11 ++++++++++-
 repository.h    |  2 ++
 sparse-index.c  |  8 ++++++++
 sparse-index.h  |  7 +++++++
 6 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h

diff --git a/Makefile b/Makefile
index 5a239cac20e3..3bf61699238d 100644
--- a/Makefile
+++ b/Makefile
@@ -980,6 +980,7 @@ LIB_OBJS += setup.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-index.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/repo-settings.c b/repo-settings.c
index f7fff0f5ab83..d63569e4041e 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -77,4 +77,12 @@ void prepare_repo_settings(struct repository *r)
 		UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_KEEP);
 
 	UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_DEFAULT);
+
+	/*
+	 * This setting guards all index reads to require a full index
+	 * over a sparse index. After suitable guards are placed in the
+	 * codebase around uses of the index, this setting will be
+	 * removed.
+	 */
+	r->settings.command_requires_full_index = 1;
 }
diff --git a/repository.c b/repository.c
index c98298acd017..a8acae002f71 100644
--- a/repository.c
+++ b/repository.c
@@ -10,6 +10,7 @@
 #include "object.h"
 #include "lockfile.h"
 #include "submodule-config.h"
+#include "sparse-index.h"
 
 /* The main repository */
 static struct repository the_repo;
@@ -261,6 +262,8 @@ void repo_clear(struct repository *repo)
 
 int repo_read_index(struct repository *repo)
 {
+	int res;
+
 	if (!repo->index)
 		repo->index = xcalloc(1, sizeof(*repo->index));
 
@@ -270,7 +273,13 @@ int repo_read_index(struct repository *repo)
 	else if (repo->index->repo != repo)
 		BUG("repo's index should point back at itself");
 
-	return read_index_from(repo->index, repo->index_file, repo->gitdir);
+	res = read_index_from(repo->index, repo->index_file, repo->gitdir);
+
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index)
+		ensure_full_index(repo->index);
+
+	return res;
 }
 
 int repo_hold_locked_index(struct repository *repo,
diff --git a/repository.h b/repository.h
index b385ca3c94b6..e06a23015697 100644
--- a/repository.h
+++ b/repository.h
@@ -41,6 +41,8 @@ struct repo_settings {
 	enum fetch_negotiation_setting fetch_negotiation_algorithm;
 
 	int core_multi_pack_index;
+
+	unsigned command_requires_full_index:1;
 };
 
 struct repository {
diff --git a/sparse-index.c b/sparse-index.c
new file mode 100644
index 000000000000..82183ead563b
--- /dev/null
+++ b/sparse-index.c
@@ -0,0 +1,8 @@
+#include "cache.h"
+#include "repository.h"
+#include "sparse-index.h"
+
+void ensure_full_index(struct index_state *istate)
+{
+	/* intentionally left blank */
+}
diff --git a/sparse-index.h b/sparse-index.h
new file mode 100644
index 000000000000..09a20d036c46
--- /dev/null
+++ b/sparse-index.h
@@ -0,0 +1,7 @@
+#ifndef SPARSE_INDEX_H__
+#define SPARSE_INDEX_H__
+
+struct index_state;
+void ensure_full_index(struct index_state *istate);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 05/20] sparse-index: implement ensure_full_index()
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-24  3:20   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will mark an in-memory index_state as having sparse directory entries
with the sparse_index bit. These currently cannot exist, but we will add
a mechanism for collapsing a full index to a sparse one in a later
change. That will happen at write time, so we must first allow parsing
the format before writing it.

Commands or methods that require a full index in order to operate can
call ensure_full_index() to expand that index in-memory. This requires
parsing trees using that index's repository.

Sparse directory entries have a specific 'ce_mode' value. The macro
S_ISSPARSEDIR(ce->ce_mode) can check if a cache_entry 'ce' has this type.
This ce_mode is not possible with the existing index formats, so we don't
also verify all properties of a sparse-directory entry, which are:

 1. ce->ce_mode == 0040000
 2. ce->flags & CE_SKIP_WORKTREE is true
 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
 4. ce->oid references a tree object.

These are all semi-enforced in ensure_full_index() to some extent. Any
deviation will cause a warning at minimum or a failure in the worst
case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache.h        |  7 +++-
 read-cache.c   |  9 +++++
 sparse-index.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 109 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index d92814961405..1336c8d7435e 100644
--- a/cache.h
+++ b/cache.h
@@ -204,6 +204,8 @@ struct cache_entry {
 #error "CE_EXTENDED_FLAGS out of range"
 #endif
 
+#define S_ISSPARSEDIR(m) ((m) == S_IFDIR)
+
 /* Forward structure decls */
 struct pathspec;
 struct child_process;
@@ -319,7 +321,8 @@ struct index_state {
 		 drop_cache_tree : 1,
 		 updated_workdir : 1,
 		 updated_skipworktree : 1,
-		 fsmonitor_has_run_once : 1;
+		 fsmonitor_has_run_once : 1,
+		 sparse_index : 1;
 	struct hashmap name_hash;
 	struct hashmap dir_hash;
 	struct object_id oid;
@@ -722,6 +725,8 @@ int read_index_from(struct index_state *, const char *path,
 		    const char *gitdir);
 int is_index_unborn(struct index_state *);
 
+void ensure_full_index(struct index_state *istate);
+
 /* For use with `write_locked_index()`. */
 #define COMMIT_LOCK		(1 << 0)
 #define SKIP_IF_UNCHANGED	(1 << 1)
diff --git a/read-cache.c b/read-cache.c
index 29144cf879e7..97dbf2434f30 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -101,6 +101,9 @@ static const char *alternate_index_output;
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		istate->sparse_index = 1;
+
 	istate->cache[nr] = ce;
 	add_name_hash(istate, ce);
 }
@@ -2255,6 +2258,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
 	trace2_data_intmax("index", the_repository, "read/cache_nr",
 			   istate->cache_nr);
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+	prepare_repo_settings(istate->repo);
+	if (istate->repo->settings.command_requires_full_index)
+		ensure_full_index(istate);
+
 	return istate->cache_nr;
 
 unmap:
diff --git a/sparse-index.c b/sparse-index.c
index 82183ead563b..316cb949b74b 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -1,8 +1,101 @@
 #include "cache.h"
 #include "repository.h"
 #include "sparse-index.h"
+#include "tree.h"
+#include "pathspec.h"
+#include "trace2.h"
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
+
+	istate->cache[nr] = ce;
+	add_name_hash(istate, ce);
+}
+
+static int add_path_to_index(const struct object_id *oid,
+				struct strbuf *base, const char *path,
+				unsigned int mode, int stage, void *context)
+{
+	struct index_state *istate = (struct index_state *)context;
+	struct cache_entry *ce;
+	size_t len = base->len;
+
+	if (S_ISDIR(mode))
+		return READ_TREE_RECURSIVE;
+
+	strbuf_addstr(base, path);
+
+	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
+	ce->ce_flags |= CE_SKIP_WORKTREE;
+	set_index_entry(istate, istate->cache_nr++, ce);
+
+	strbuf_setlen(base, len);
+	return 0;
+}
 
 void ensure_full_index(struct index_state *istate)
 {
-	/* intentionally left blank */
+	int i;
+	struct index_state *full;
+
+	if (!istate || !istate->sparse_index)
+		return;
+
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	trace2_region_enter("index", "ensure_full_index", istate->repo);
+
+	/* initialize basics of new index */
+	full = xcalloc(1, sizeof(struct index_state));
+	memcpy(full, istate, sizeof(struct index_state));
+
+	/* then change the necessary things */
+	full->sparse_index = 0;
+	full->cache_alloc = (3 * istate->cache_alloc) / 2;
+	full->cache_nr = 0;
+	ALLOC_ARRAY(full->cache, full->cache_alloc);
+
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+		struct tree *tree;
+		struct pathspec ps;
+
+		if (!S_ISSPARSEDIR(ce->ce_mode)) {
+			set_index_entry(full, full->cache_nr++, ce);
+			continue;
+		}
+		if (!(ce->ce_flags & CE_SKIP_WORKTREE))
+			warning(_("index entry is a directory, but not sparse (%08x)"),
+				ce->ce_flags);
+
+		/* recursively walk into cd->name */
+		tree = lookup_tree(istate->repo, &ce->oid);
+
+		memset(&ps, 0, sizeof(ps));
+		ps.recursive = 1;
+		ps.has_wildcard = 1;
+		ps.max_depth = -1;
+
+		read_tree_recursive(istate->repo, tree,
+				    ce->name, strlen(ce->name),
+				    0, &ps,
+				    add_path_to_index, full);
+
+		/* free directory entries. full entries are re-used */
+		discard_cache_entry(ce);
+	}
+
+	/* Copy back into original index. */
+	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
+	istate->sparse_index = 0;
+	free(istate->cache);
+	istate->cache = full->cache;
+	istate->cache_nr = full->cache_nr;
+	istate->cache_alloc = full->cache_alloc;
+
+	free(full);
+
+	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 06/20] t1092: compare sparse-checkout to sparse-index
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-25  6:37   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new 'sparse-index' repo alongside the 'full-checkout' and
'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
add run_on_sparse and test_sparse_match helpers. These helpers will be
used when the sparse index is implemented.

Add GIT_TEST_SPARSE_INDEX environment variable to enable the
sparse-index by default. This will be intended to use across the entire
test suite, except that it will only affect cases where the
sparse-checkout feature is enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/README                                 |  3 +++
 t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/t/README b/t/README
index 593d4a4e270c..b98bc563aab5 100644
--- a/t/README
+++ b/t/README
@@ -439,6 +439,9 @@ and "sha256".
 GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
 'pack.writeReverseIndex' setting.
 
+GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
+sparse-index format by default.
+
 Naming Tests
 ------------
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 3725d3997e70..71d6f9e4c014 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
 test_expect_success 'setup' '
 	git init initial-repo &&
 	(
+		GIT_TEST_SPARSE_INDEX=0 &&
 		cd initial-repo &&
 		echo a >a &&
 		echo "after deep" >e &&
@@ -87,23 +88,32 @@ init_repos () {
 
 	cp -r initial-repo sparse-checkout &&
 	git -C sparse-checkout reset --hard &&
-	git -C sparse-checkout sparse-checkout init --cone &&
+
+	cp -r initial-repo sparse-index &&
+	git -C sparse-index reset --hard &&
 
 	# initialize sparse-checkout definitions
-	git -C sparse-checkout sparse-checkout set deep
+	git -C sparse-checkout sparse-checkout init --cone &&
+	git -C sparse-checkout sparse-checkout set deep &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+	) &&
+	(
+		cd sparse-index &&
+		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		"$@" >../full-checkout-out 2>../full-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -114,6 +124,12 @@ test_all_match () {
 	test_cmp full-checkout-err sparse-checkout-err
 }
 
+test_sparse_match () {
+	run_on_sparse $* &&
+	test_cmp sparse-checkout-out sparse-index-out &&
+	test_cmp sparse-checkout-err sparse-index-err
+}
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 07/20] test-read-cache: print cache entries with --table
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-25  7:02   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This table is helpful for discovering data in the index to ensure it is
being written correctly, especially as we build and test the
sparse-index. This table includes an output format similar to 'git
ls-tree', but should not be compared to that directly. The biggest
reasons are that 'git ls-tree' includes a tree entry for every
subdirectory, even those that would not appear as a sparse directory in
a sparse-index. Further, 'git ls-tree' does not use a trailing directory
separator for its tree rows.

This does not print the stat() information for the blobs. That could be
added in a future change with another option. The tests that are added
in the next few changes care only about the object types and IDs.

To make the option parsing slightly more robust, wrap the string
comparisons in a loop adapted from test-dir-iterator.c.

Care must be taken with the final check for the 'cnt' variable. We
continue the expectation that the numerical value is the final argument.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c | 50 ++++++++++++++++++++++++++++++--------
 1 file changed, 40 insertions(+), 10 deletions(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 244977a29bdf..e4c3492f7d3e 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -2,35 +2,65 @@
 #include "cache.h"
 #include "config.h"
 
+static void print_cache_entry(struct cache_entry *ce)
+{
+	printf("%06o ", ce->ce_mode & 0777777);
+
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		printf("tree ");
+	else if (S_ISGITLINK(ce->ce_mode))
+		printf("commit ");
+	else
+		printf("blob ");
+
+	printf("%s\t%s\n",
+	       oid_to_hex(&ce->oid),
+	       ce->name);
+}
+
+static void print_cache(struct index_state *cache)
+{
+	int i;
+	for (i = 0; i < the_index.cache_nr; i++)
+		print_cache_entry(the_index.cache[i]);
+}
+
 int cmd__read_cache(int argc, const char **argv)
 {
+	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
+	int table = 0;
 
-	if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
-		argc--;
-		argv++;
+	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
+		if (skip_prefix(*argv, "--print-and-refresh=", &name))
+			continue;
+		if (!strcmp(*argv, "--table"))
+			table = 1;
 	}
 
-	if (argc == 2)
-		cnt = strtol(argv[1], NULL, 0);
+	if (argc == 1)
+		cnt = strtol(argv[0], NULL, 0);
 	setup_git_directory();
 	git_config(git_default_config, NULL);
+
 	for (i = 0; i < cnt; i++) {
-		read_cache();
+		repo_read_index(r);
 		if (name) {
 			int pos;
 
-			refresh_index(&the_index, REFRESH_QUIET,
+			refresh_index(r->index, REFRESH_QUIET,
 				      NULL, NULL, NULL);
-			pos = index_name_pos(&the_index, name, strlen(name));
+			pos = index_name_pos(r->index, name, strlen(name));
 			if (pos < 0)
 				die("%s not in index", name);
 			printf("%s is%s up to date\n", name,
-			       ce_uptodate(the_index.cache[pos]) ? "" : " not");
+			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
 			write_file(name, "%d\n", i);
 		}
-		discard_cache();
+		if (table)
+			print_cache(r->index);
+		discard_index(r->index);
 	}
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 08/20] test-tool: don't force full index
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (6 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will use 'test-tool read-cache --table' to check that a sparse
index is written as part of init_repos. Since we will no longer always
expand a sparse index into a full index, add an '--expand' parameter
that adds a call to ensure_full_index() so we can compare a sparse index
directly against a full index, or at least what the in-memory index
looks like when expanded in this way.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c               | 13 ++++++++++++-
 t/t1092-sparse-checkout-compatibility.sh |  5 +++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index e4c3492f7d3e..4780429dca6b 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -1,6 +1,7 @@
 #include "test-tool.h"
 #include "cache.h"
 #include "config.h"
+#include "sparse-index.h"
 
 static void print_cache_entry(struct cache_entry *ce)
 {
@@ -30,13 +31,19 @@ int cmd__read_cache(int argc, const char **argv)
 	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
-	int table = 0;
+	int table = 0, expand = 0;
+
+	initialize_the_repository();
+	prepare_repo_settings(r);
+	r->settings.command_requires_full_index = 0;
 
 	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
 		if (skip_prefix(*argv, "--print-and-refresh=", &name))
 			continue;
 		if (!strcmp(*argv, "--table"))
 			table = 1;
+		else if (!strcmp(*argv, "--expand"))
+			expand = 1;
 	}
 
 	if (argc == 1)
@@ -46,6 +53,10 @@ int cmd__read_cache(int argc, const char **argv)
 
 	for (i = 0; i < cnt; i++) {
 		repo_read_index(r);
+
+		if (expand)
+			ensure_full_index(r->index);
+
 		if (name) {
 			int pos;
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 71d6f9e4c014..4d789fe86b9d 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -130,6 +130,11 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'expanded in-memory index matches full index' '
+	init_repos &&
+	test_sparse_match test-tool read-cache --expand --table
+'
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 09/20] unpack-trees: ensure full index
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (7 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The next change will translate full indexes into sparse indexes at write
time. The existing logic provides a way for every sparse index to be
expanded to a full index at read time. However, there are cases where an
index is written and then continues to be used in-memory to perform
further updates.

unpack_trees() is frequently called after such a write. In particular,
commands like 'git reset' do this double-update of the index.

Ensure that we have a full index when entering unpack_trees(), but only
when command_requires_full_index is true. This is always true at the
moment, but we will later relax that after unpack_trees() is updated to
handle sparse directory entries.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/unpack-trees.c b/unpack-trees.c
index f5f668f532d8..4dd99219073a 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *,
  */
 int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options *o)
 {
+	struct repository *repo = the_repository;
 	int i, ret;
 	static struct cache_entry *dfc;
 	struct pattern_list pl;
@@ -1578,6 +1579,12 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	trace_performance_enter();
 	trace2_region_enter("unpack_trees", "unpack_trees", the_repository);
 
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index) {
+		ensure_full_index(o->src_index);
+		ensure_full_index(o->dst_index);
+	}
+
 	if (!core_apply_sparse_checkout || !o->update)
 		o->skip_sparse_checkout = 1;
 	if (!o->skip_sparse_checkout && !o->pl) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 10/20] sparse-checkout: hold pattern list in index
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (8 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-25  7:14   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we modify the sparse-checkout definition, we perform index operations
on a pattern_list that only exists in-memory. This allows easy backing
out in case the index update fails.

However, if the index write itself cares about the sparse-checkout
pattern set, we need access to that in-memory copy. Place a pointer to
a 'struct pattern_list' in the index so we can access this on-demand.
This will be used in the next change which uses the sparse-checkout
definition to filter out directories that are outsie the sparse cone.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c | 17 ++++++++++-------
 cache.h                   |  2 ++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index 2306a9ad98e0..e00b82af727b 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
 	if (is_index_unborn(r->index))
 		return UPDATE_SPARSITY_SUCCESS;
 
+	r->index->sparse_checkout_patterns = pl;
+
 	memset(&o, 0, sizeof(o));
 	o.verbose_update = isatty(2);
 	o.update = 1;
@@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
 	else
 		rollback_lock_file(&lock_file);
 
+	r->index->sparse_checkout_patterns = NULL;
 	return result;
 }
 
@@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 {
 	int result;
 	int changed_config = 0;
-	struct pattern_list pl;
-	memset(&pl, 0, sizeof(pl));
+	struct pattern_list *pl = xcalloc(1, sizeof(*pl));
 
 	switch (m) {
 	case ADD:
 		if (core_sparse_checkout_cone)
-			add_patterns_cone_mode(argc, argv, &pl);
+			add_patterns_cone_mode(argc, argv, pl);
 		else
-			add_patterns_literal(argc, argv, &pl);
+			add_patterns_literal(argc, argv, pl);
 		break;
 
 	case REPLACE:
-		add_patterns_from_input(&pl, argc, argv);
+		add_patterns_from_input(pl, argc, argv);
 		break;
 	}
 
@@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 		changed_config = 1;
 	}
 
-	result = write_patterns_and_update(&pl);
+	result = write_patterns_and_update(pl);
 
 	if (result && changed_config)
 		set_config(MODE_NO_PATTERNS);
 
-	clear_pattern_list(&pl);
+	clear_pattern_list(pl);
+	free(pl);
 	return result;
 }
 
diff --git a/cache.h b/cache.h
index 1336c8d7435e..d75b352f38d3 100644
--- a/cache.h
+++ b/cache.h
@@ -307,6 +307,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 struct split_index;
 struct untracked_cache;
 struct progress;
+struct pattern_list;
 
 struct index_state {
 	struct cache_entry **cache;
@@ -332,6 +333,7 @@ struct index_state {
 	struct mem_pool *ce_mem_pool;
 	struct progress *progress;
 	struct repository *repo;
+	struct pattern_list *sparse_checkout_patterns;
 };
 
 /* Name hashing */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 11/20] sparse-index: convert from full to sparse
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (9 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-25  7:33   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we have a full index, then we can convert it to a sparse index by
replacing directories outside of the sparse cone with sparse directory
entries. The convert_to_sparse() method does this, when the situation is
appropriate.

For now, we avoid converting the index to a sparse index if:

 1. the index is split.
 2. the index is already sparse.
 3. sparse-checkout is disabled.
 4. sparse-checkout does not use cone mode.

Finally, we currently limit the conversion to when the
GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
config will be added in a later change.

The trickiest thing about this conversion is that we might not be able
to mark a directory as a sparse directory just because it is outside the
sparse cone. There might be unmerged files within that directory, so we
need to look for those. Also, if there is some strange reason why a file
is not marked with CE_SKIP_WORKTREE, then we should give up on
converting that directory. There is still hope that some of its
subdirectories might be able to convert to sparse, so we keep looking
deeper.

The conversion process is assisted by the cache-tree extension. This is
calculated from the full index if it does not already exist. We then
abandon the cache-tree as it no longer applies to the newly-sparse
index. Thus, this cache-tree will be recalculated in every
sparse-full-sparse round-trip until we integrate the cache-tree
extension with the sparse index.

Some Git commands use the index after writing it. For example, 'git add'
will update the index, then write it to disk, then read its entries to
report information. To keep the in-memory index in a full state after
writing, we re-expand it to a full one after the write. This is wasteful
for commands that only write the index and do not read from it again,
but that is only the case until we make those commands "sparse aware."

We can compare the behavior of the sparse-index in
t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
when operating on the 'sparse-index' repo. We can also compare the two
sparse repos directly, such as comparing their indexes (when expanded to
full in the case of the 'sparse-index' repo). We also verify that the
index is actually populated with sparse directory entries.

The 'checkout and reset (mixed)' test is marked for failure when
comparing a sparse repo to a full repo, but we can compare the two
sparse-checkout cases directly to ensure that we are not changing the
behavior when using a sparse index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             |   3 +
 cache.h                                  |   2 +
 read-cache.c                             |  26 ++++-
 sparse-index.c                           | 139 +++++++++++++++++++++++
 sparse-index.h                           |   1 +
 t/t1092-sparse-checkout-compatibility.sh |  61 +++++++++-
 6 files changed, 227 insertions(+), 5 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2fb483d3c083..5f07a39e501e 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -6,6 +6,7 @@
 #include "object-store.h"
 #include "replace-object.h"
 #include "promisor-remote.h"
+#include "sparse-index.h"
 
 #ifndef DEBUG_CACHE_TREE
 #define DEBUG_CACHE_TREE 0
@@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
 	if (i)
 		return i;
 
+	ensure_full_index(istate);
+
 	if (!istate->cache_tree)
 		istate->cache_tree = cache_tree();
 
diff --git a/cache.h b/cache.h
index d75b352f38d3..e8b7d3b4fb33 100644
--- a/cache.h
+++ b/cache.h
@@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
 {
 	if (S_ISLNK(mode))
 		return S_IFLNK;
+	if (mode == S_IFDIR)
+		return S_IFDIR;
 	if (S_ISDIR(mode) || S_ISGITLINK(mode))
 		return S_IFGITLINK;
 	return S_IFREG | ce_permissions(mode);
diff --git a/read-cache.c b/read-cache.c
index 97dbf2434f30..67acbf202f4e 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -25,6 +25,7 @@
 #include "fsmonitor.h"
 #include "thread-utils.h"
 #include "progress.h"
+#include "sparse-index.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -1002,8 +1003,14 @@ int verify_path(const char *path, unsigned mode)
 
 			c = *path++;
 			if ((c == '.' && !verify_dotfile(path, mode)) ||
-			    is_dir_sep(c) || c == '\0')
+			    is_dir_sep(c))
 				return 0;
+			/*
+			 * allow terminating directory separators for
+			 * sparse directory enries.
+			 */
+			if (c == '\0')
+				return S_ISDIR(mode);
 		} else if (c == '\\' && protect_ntfs) {
 			if (is_ntfs_dotgit(path))
 				return 0;
@@ -3061,6 +3068,14 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 				 unsigned flags)
 {
 	int ret;
+	int was_full = !istate->sparse_index;
+
+	ret = convert_to_sparse(istate);
+
+	if (ret) {
+		warning(_("failed to convert to a sparse-index"));
+		return ret;
+	}
 
 	/*
 	 * TODO trace2: replace "the_repository" with the actual repo instance
@@ -3072,6 +3087,9 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 	trace2_region_leave_printf("index", "do_write_index", the_repository,
 				   "%s", get_lock_file_path(lock));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	if (flags & COMMIT_LOCK)
@@ -3162,9 +3180,10 @@ static int write_shared_index(struct index_state *istate,
 			      struct tempfile **temp)
 {
 	struct split_index *si = istate->split_index;
-	int ret;
+	int ret, was_full = !istate->sparse_index;
 
 	move_cache_to_base_index(istate);
+	convert_to_sparse(istate);
 
 	trace2_region_enter_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
@@ -3172,6 +3191,9 @@ static int write_shared_index(struct index_state *istate,
 	trace2_region_leave_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	ret = adjust_shared_perm(get_tempfile_path(*temp));
diff --git a/sparse-index.c b/sparse-index.c
index 316cb949b74b..cb1f85635fbc 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -4,6 +4,145 @@
 #include "tree.h"
 #include "pathspec.h"
 #include "trace2.h"
+#include "cache-tree.h"
+#include "config.h"
+#include "dir.h"
+#include "fsmonitor.h"
+
+static struct cache_entry *construct_sparse_dir_entry(
+				struct index_state *istate,
+				const char *sparse_dir,
+				struct cache_tree *tree)
+{
+	struct cache_entry *de;
+
+	de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);
+
+	de->ce_flags |= CE_SKIP_WORKTREE;
+	return de;
+}
+
+/*
+ * Returns the number of entries "inserted" into the index.
+ */
+static int convert_to_sparse_rec(struct index_state *istate,
+				 int num_converted,
+				 int start, int end,
+				 const char *ct_path, size_t ct_pathlen,
+				 struct cache_tree *ct)
+{
+	int i, can_convert = 1;
+	int start_converted = num_converted;
+	enum pattern_match_result match;
+	int dtype;
+	struct strbuf child_path = STRBUF_INIT;
+	struct pattern_list *pl = istate->sparse_checkout_patterns;
+
+	/*
+	 * Is the current path outside of the sparse cone?
+	 * Then check if the region can be replaced by a sparse
+	 * directory entry (everything is sparse and merged).
+	 */
+	match = path_matches_pattern_list(ct_path, ct_pathlen,
+					  NULL, &dtype, pl, istate);
+	if (match != NOT_MATCHED)
+		can_convert = 0;
+
+	for (i = start; can_convert && i < end; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (ce_stage(ce) ||
+		    !(ce->ce_flags & CE_SKIP_WORKTREE))
+			can_convert = 0;
+	}
+
+	if (can_convert) {
+		struct cache_entry *se;
+		se = construct_sparse_dir_entry(istate, ct_path, ct);
+
+		istate->cache[num_converted++] = se;
+		return 1;
+	}
+
+	for (i = start; i < end; ) {
+		int count, span, pos = -1;
+		const char *base, *slash;
+		struct cache_entry *ce = istate->cache[i];
+
+		/*
+		 * Detect if this is a normal entry oustide of any subtree
+		 * entry.
+		 */
+		base = ce->name + ct_pathlen;
+		slash = strchr(base, '/');
+
+		if (slash)
+			pos = cache_tree_subtree_pos(ct, base, slash - base);
+
+		if (pos < 0) {
+			istate->cache[num_converted++] = ce;
+			i++;
+			continue;
+		}
+
+		strbuf_setlen(&child_path, 0);
+		strbuf_add(&child_path, ce->name, slash - ce->name + 1);
+
+		span = ct->down[pos]->cache_tree->entry_count;
+		count = convert_to_sparse_rec(istate,
+					      num_converted, i, i + span,
+					      child_path.buf, child_path.len,
+					      ct->down[pos]->cache_tree);
+		num_converted += count;
+		i += span;
+	}
+
+	strbuf_release(&child_path);
+	return num_converted - start_converted;
+}
+
+int convert_to_sparse(struct index_state *istate)
+{
+	if (istate->split_index || istate->sparse_index ||
+	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
+		return 0;
+
+	/*
+	 * For now, only create a sparse index with the
+	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
+	 * this once we have a proper way to opt-in (and later still,
+	 * opt-out).
+	 */
+	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+		return 0;
+
+	if (!istate->sparse_checkout_patterns) {
+		istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
+		if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
+			return 0;
+	}
+
+	if (!istate->sparse_checkout_patterns->use_cone_patterns) {
+		warning(_("attempting to use sparse-index without cone mode"));
+		return -1;
+	}
+
+	if (cache_tree_update(istate, 0)) {
+		warning(_("unable to update cache-tree, staying full"));
+		return -1;
+	}
+
+	remove_fsmonitor(istate);
+
+	trace2_region_enter("index", "convert_to_sparse", istate->repo);
+	istate->cache_nr = convert_to_sparse_rec(istate,
+						 0, 0, istate->cache_nr,
+						 "", 0, istate->cache_tree);
+	istate->drop_cache_tree = 1;
+	istate->sparse_index = 1;
+	trace2_region_leave("index", "convert_to_sparse", istate->repo);
+	return 0;
+}
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
diff --git a/sparse-index.h b/sparse-index.h
index 09a20d036c46..64380e121d80 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -3,5 +3,6 @@
 
 struct index_state;
 void ensure_full_index(struct index_state *istate);
+int convert_to_sparse(struct index_state *istate);
 
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 4d789fe86b9d..ca87033d30b0 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,6 +2,9 @@
 
 test_description='compare full workdir to sparse workdir'
 
+GIT_TEST_CHECK_CACHE_TREE=0
+GIT_TEST_SPLIT_INDEX=0
+
 . ./test-lib.sh
 
 test_expect_success 'setup' '
@@ -121,15 +124,49 @@ run_on_all () {
 test_all_match () {
 	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
-	test_cmp full-checkout-err sparse-checkout-err
+	test_cmp full-checkout-out sparse-index-out &&
+	test_cmp full-checkout-err sparse-checkout-err &&
+	test_cmp full-checkout-err sparse-index-err
 }
 
 test_sparse_match () {
-	run_on_sparse $* &&
+	run_on_sparse "$@" &&
 	test_cmp sparse-checkout-out sparse-index-out &&
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'sparse-index contents' '
+	init_repos &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep/deeper2 folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done
+'
+
 test_expect_success 'expanded in-memory index matches full index' '
 	init_repos &&
 	test_sparse_match test-tool read-cache --expand --table
@@ -137,6 +174,7 @@ test_expect_success 'expanded in-memory index matches full index' '
 
 test_expect_success 'status with options' '
 	init_repos &&
+	test_sparse_match ls &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -273,6 +311,17 @@ test_expect_failure 'checkout and reset (mixed)' '
 	test_all_match git reset update-folder2
 '
 
+# Ensure that sparse-index behaves identically to
+# sparse-checkout with a full index.
+test_expect_success 'checkout and reset (mixed) [sparse]' '
+	init_repos &&
+
+	test_sparse_match git checkout -b reset-test update-deep &&
+	test_sparse_match git reset deepest &&
+	test_sparse_match git reset update-folder1 &&
+	test_sparse_match git reset update-folder2
+'
+
 test_expect_success 'merge' '
 	init_repos &&
 
@@ -309,14 +358,20 @@ test_expect_success 'clean' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git clean -f &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xdf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
-	test_path_is_dir sparse-checkout/folder1
+	test_sparse_match test_path_is_dir folder1
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 12/20] submodule: sparse-index should not collapse links
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (10 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

A submodule is stored as a "Git link" that actually points to a commit
within a submodule. Submodules are populated or not depending on
submodule configuration, not sparse-checkout. To ensure that the
sparse-index feature integrates correctly with submodules, we should not
collapse a directory if there is a Git link within its range.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sparse-index.c                           |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/sparse-index.c b/sparse-index.c
index cb1f85635fbc..14029fafc750 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -52,6 +52,7 @@ static int convert_to_sparse_rec(struct index_state *istate,
 		struct cache_entry *ce = istate->cache[i];
 
 		if (ce_stage(ce) ||
+		    S_ISGITLINK(ce->ce_mode) ||
 		    !(ce->ce_flags & CE_SKIP_WORKTREE))
 			can_convert = 0;
 	}
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index ca87033d30b0..b38fab6455d9 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -374,4 +374,21 @@ test_expect_success 'clean' '
 	test_sparse_match test_path_is_dir folder1
 '
 
+test_expect_success 'submodule handling' '
+	init_repos &&
+
+	test_all_match mkdir modules &&
+	test_all_match touch modules/a &&
+	test_all_match git add modules &&
+	test_all_match git commit -m "add modules directory" &&
+
+	run_on_all git submodule add "$(pwd)/initial-repo" modules/sub &&
+	test_all_match git commit -m "add submodule" &&
+
+	# having a submodule prevents "modules" from collapse
+	test-tool -C sparse-index read-cache --table >cache &&
+	grep "100644 blob .*	modules/a" cache &&
+	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 13/20] unpack-trees: allow sparse directories
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (11 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-25  7:40   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The index_pos_by_traverse_info() currently throws a BUG() when a
directory entry exists exactly in the index. We need to consider that it
is possible to have a directory in a sparse index as long as that entry
is itself marked with the skip-worktree bit.

The negation of the 'pos' variable must be conditioned to only when it
starts as negative. This is identical behavior as before when the index
is full.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/unpack-trees.c b/unpack-trees.c
index 4dd99219073a..b324eec2a5d1 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
 	strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
 	strbuf_addch(&name, '/');
 	pos = index_name_pos(o->src_index, name.buf, name.len);
-	if (pos >= 0)
-		BUG("This is a directory and should not exist in index");
-	pos = -pos - 1;
+	if (pos >= 0) {
+		if (!o->src_index->sparse_index ||
+		    !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
+			BUG("This is a directory and should not exist in index");
+	} else
+		pos = -pos - 1;
 	if (pos >= o->src_index->cache_nr ||
 	    !starts_with(o->src_index->cache[pos]->name, name.buf) ||
 	    (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 14/20] sparse-index: check index conversion happens
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (12 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a test case that uses test_region to ensure that we are truly
expanding a sparse index to a full one, then converting back to sparse
when writing the index. As we integrate more Git commands with the
sparse index, we will convert these commands to check that we do _not_
convert the sparse index to a full index and instead stay sparse the
entire time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index b38fab6455d9..bfc9e28ef0e1 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -391,4 +391,22 @@ test_expect_success 'submodule handling' '
 	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
 '
 
+test_expect_success 'sparse-index is expanded and converted back' '
+	init_repos &&
+
+	(
+		GIT_TEST_SPARSE_INDEX=1 &&
+		export GIT_TEST_SPARSE_INDEX &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" reset --hard &&
+		test_region index convert_to_sparse trace2.txt &&
+		test_region index ensure_full_index trace2.txt &&
+
+		rm trace2.txt &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" status -uno &&
+		test_region index ensure_full_index trace2.txt
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 15/20] sparse-index: create extension for compatibility
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (13 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-25  7:45   ` Elijah Newren
  2021-02-23 20:14 ` [PATCH 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Previously, we enabled the sparse index format only using
GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
actually select this mode. Further, sparse directory entries are not
understood by the index formats as advertised.

We _could_ add a new index version that explicitly adds these
capabilities, but there are nuances to index formats 2, 3, and 4 that
are still valuable to select as options. For now, create a repo
extension, "extensions.sparseIndex", that specifies that the tool
reading this repository must understand sparse directory entries.

This change only encodes the extension and enables it when
GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
mechanism.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/extensions.txt |  7 ++++++
 cache.h                             |  1 +
 repo-settings.c                     |  7 ++++++
 repository.h                        |  3 ++-
 setup.c                             |  3 +++
 sparse-index.c                      | 38 +++++++++++++++++++++++++----
 6 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
index 4e23d73cdcad..5c86b3648732 100644
--- a/Documentation/config/extensions.txt
+++ b/Documentation/config/extensions.txt
@@ -6,3 +6,10 @@ extensions.objectFormat::
 Note that this setting should only be set by linkgit:git-init[1] or
 linkgit:git-clone[1].  Trying to change it after initialization will not
 work and will produce hard-to-diagnose issues.
+
+extensions.sparseIndex::
+	When combined with `core.sparseCheckout=true` and
+	`core.sparseCheckoutCone=true`, the index may contain entries
+	corresponding to directories outside of the sparse-checkout
+	definition. Versions of Git that do not understand this extension
+	do not expect directory entries in the index.
diff --git a/cache.h b/cache.h
index e8b7d3b4fb33..eea61fba7568 100644
--- a/cache.h
+++ b/cache.h
@@ -1053,6 +1053,7 @@ struct repository_format {
 	int worktree_config;
 	int is_bare;
 	int hash_algo;
+	int sparse_index;
 	char *work_tree;
 	struct string_list unknown_extensions;
 	struct string_list v1_only_extensions;
diff --git a/repo-settings.c b/repo-settings.c
index d63569e4041e..9677d50f9238 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
 	 * removed.
 	 */
 	r->settings.command_requires_full_index = 1;
+
+	/*
+	 * Initialize this as off.
+	 */
+	r->settings.sparse_index = 0;
+	if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
+		r->settings.sparse_index = 1;
 }
diff --git a/repository.h b/repository.h
index e06a23015697..a45f7520fd9e 100644
--- a/repository.h
+++ b/repository.h
@@ -42,7 +42,8 @@ struct repo_settings {
 
 	int core_multi_pack_index;
 
-	unsigned command_requires_full_index:1;
+	unsigned command_requires_full_index:1,
+		 sparse_index:1;
 };
 
 struct repository {
diff --git a/setup.c b/setup.c
index c04cd25a30df..cd8394564613 100644
--- a/setup.c
+++ b/setup.c
@@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var,
 			return error("invalid value for 'extensions.objectformat'");
 		data->hash_algo = format;
 		return EXTENSION_OK;
+	} else if (!strcmp(ext, "sparseindex")) {
+		data->sparse_index = 1;
+		return EXTENSION_OK;
 	}
 	return EXTENSION_UNKNOWN;
 }
diff --git a/sparse-index.c b/sparse-index.c
index 14029fafc750..97b0d0c57857 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate,
 	return num_converted - start_converted;
 }
 
+static int enable_sparse_index(struct repository *repo)
+{
+	const char *config_path = repo_git_path(repo, "config.worktree");
+
+	if (upgrade_repository_format(1) < 0) {
+		warning(_("unable to upgrade repository format to enable sparse-index"));
+		return -1;
+	}
+	git_config_set_in_file_gently(config_path,
+				      "extensions.sparseIndex",
+				      "true");
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 1;
+	return 0;
+}
+
 int convert_to_sparse(struct index_state *istate)
 {
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	/*
+	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
+	 * extensions.sparseIndex config variable to be on.
+	 */
+	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
+		int err = enable_sparse_index(istate->repo);
+		if (err < 0)
+			return err;
+	}
+
 	/*
-	 * For now, only create a sparse index with the
-	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
-	 * this once we have a proper way to opt-in (and later still,
-	 * opt-out).
+	 * Only convert to sparse if extensions.sparseIndex is set.
 	 */
-	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+	prepare_repo_settings(istate->repo);
+	if (!istate->repo->settings.sparse_index)
 		return 0;
 
 	if (!istate->sparse_checkout_patterns) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (14 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-24 19:11   ` Martin Ågren
  2021-02-23 20:14 ` [PATCH 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The sparse index extension is used to signal that index writes should be
in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1.

Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that
specifies if the sparse index should be used. It also updates the index
to use the correct format, either way. Add a warning in the
documentation that the use of a repository extension might reduce
compatibility with third-party tools. 'git sparse-checkout init' already
sets extension.worktreeConfig, which places most sparse-checkout users
outside of the scope of most third-party tools.

Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of
GIT_TEST_SPARSE_INDEX=1.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-sparse-checkout.txt    | 14 +++++++++
 builtin/sparse-checkout.c                | 17 ++++++++++-
 sparse-index.c                           | 37 +++++++++++++++--------
 sparse-index.h                           |  3 ++
 t/t1092-sparse-checkout-compatibility.sh | 38 +++++++++++-------------
 5 files changed, 76 insertions(+), 33 deletions(-)

diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index a0eeaeb02ee3..b51b8450cfd9 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
 When `--cone` is provided, the `core.sparseCheckoutCone` setting is
 also set, allowing for better performance with a limited set of
 patterns (see 'CONE PATTERN SET' below).
++
+Use the `--[no-]sparse-index` option to toggle the use of the sparse
+index format. This reduces the size of the index to be more closely
+aligned with your sparse-checkout definition. This can have significant
+performance advantages for commands such as `git status` or `git add`.
+This feature is still experimental. Some commands might be slower with
+a sparse index until they are properly integrated with the feature.
++
+**WARNING:** Using a sparse index requires modifying the index in a way
+that is not completely understood by other tools. Enabling sparse index
+enables the `extensions.spareseIndex` config value, which might cause
+other tools to stop working with your repository. If you have trouble with
+this compatibility, then run `git sparse-checkout sparse-index disable` to
+remove this config and rewrite your index to not be sparse.
 
 'set'::
 	Write a set of patterns to the sparse-checkout file, as given as
diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index e00b82af727b..ca63e2c64e95 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -14,6 +14,7 @@
 #include "unpack-trees.h"
 #include "wt-status.h"
 #include "quote.h"
+#include "sparse-index.h"
 
 static const char *empty_base = "";
 
@@ -283,12 +284,13 @@ static int set_config(enum sparse_checkout_mode mode)
 }
 
 static char const * const builtin_sparse_checkout_init_usage[] = {
-	N_("git sparse-checkout init [--cone]"),
+	N_("git sparse-checkout init [--cone] [--[no-]sparse-index]"),
 	NULL
 };
 
 static struct sparse_checkout_init_opts {
 	int cone_mode;
+	int sparse_index;
 } init_opts;
 
 static int sparse_checkout_init(int argc, const char **argv)
@@ -303,11 +305,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	static struct option builtin_sparse_checkout_init_options[] = {
 		OPT_BOOL(0, "cone", &init_opts.cone_mode,
 			 N_("initialize the sparse-checkout in cone mode")),
+		OPT_BOOL(0, "sparse-index", &init_opts.sparse_index,
+			 N_("toggle the use of a sparse index")),
 		OPT_END(),
 	};
 
 	repo_read_index(the_repository);
 
+	init_opts.sparse_index = -1;
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_sparse_checkout_init_options,
 			     builtin_sparse_checkout_init_usage, 0);
@@ -326,6 +332,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	sparse_filename = get_sparse_checkout_filename();
 	res = add_patterns_from_file_to_list(sparse_filename, "", 0, &pl, NULL);
 
+	if (init_opts.sparse_index >= 0) {
+		if (set_sparse_index_config(the_repository, init_opts.sparse_index) < 0)
+			die(_("failed to modify sparse-index config"));
+
+		/* force an index rewrite */
+		repo_read_index(the_repository);
+		the_repository->index->updated_workdir = 1;
+	}
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
diff --git a/sparse-index.c b/sparse-index.c
index 97b0d0c57857..a991c5331e9e 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -104,23 +104,37 @@ static int convert_to_sparse_rec(struct index_state *istate,
 
 static int enable_sparse_index(struct repository *repo)
 {
-	const char *config_path = repo_git_path(repo, "config.worktree");
+	int res;
 
 	if (upgrade_repository_format(1) < 0) {
 		warning(_("unable to upgrade repository format to enable sparse-index"));
 		return -1;
 	}
-	git_config_set_in_file_gently(config_path,
-				      "extensions.sparseIndex",
-				      "true");
+	res = git_config_set_gently("extensions.sparseindex", "true");
 
 	prepare_repo_settings(repo);
 	repo->settings.sparse_index = 1;
-	return 0;
+	return res;
+}
+
+int set_sparse_index_config(struct repository *repo, int enable)
+{
+	int res;
+
+	if (enable)
+		return enable_sparse_index(repo);
+
+	/* Don't downgrade repository format, just remove the extension. */
+	res = git_config_set_gently("extensions.sparseindex", NULL);
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 0;
+	return res;
 }
 
 int convert_to_sparse(struct index_state *istate)
 {
+	int test_env;
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
@@ -129,14 +143,13 @@ int convert_to_sparse(struct index_state *istate)
 		istate->repo = the_repository;
 
 	/*
-	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
-	 * extensions.sparseIndex config variable to be on.
+	 * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex
+	 * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly),
+	 * then purposefully disable the setting.
 	 */
-	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
-		int err = enable_sparse_index(istate->repo);
-		if (err < 0)
-			return err;
-	}
+	test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
+	if (test_env >= 0)
+		set_sparse_index_config(istate->repo, test_env);
 
 	/*
 	 * Only convert to sparse if extensions.sparseIndex is set.
diff --git a/sparse-index.h b/sparse-index.h
index 64380e121d80..39dcc859735e 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -5,4 +5,7 @@ struct index_state;
 void ensure_full_index(struct index_state *istate);
 int convert_to_sparse(struct index_state *istate);
 
+struct repository;
+int set_sparse_index_config(struct repository *repo, int enable);
+
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index bfc9e28ef0e1..9c2bc4d25f66 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -4,6 +4,7 @@ test_description='compare full workdir to sparse workdir'
 
 GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
+GIT_TEST_SPARSE_INDEX=
 
 . ./test-lib.sh
 
@@ -98,25 +99,26 @@ init_repos () {
 	# initialize sparse-checkout definitions
 	git -C sparse-checkout sparse-checkout init --cone &&
 	git -C sparse-checkout sparse-checkout set deep &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
+	git -C sparse-index sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C sparse-index true extensions.sparseindex &&
+	git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	) &&
 	(
 		cd sparse-index &&
-		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
+		"$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -146,7 +148,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+	git -C sparse-index sparse-checkout set folder1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep folder2 x
@@ -156,7 +158,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+	git -C sparse-index sparse-checkout set deep/deeper1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep/deeper2 folder1 folder2 x
@@ -394,19 +396,15 @@ test_expect_success 'submodule handling' '
 test_expect_success 'sparse-index is expanded and converted back' '
 	init_repos &&
 
-	(
-		GIT_TEST_SPARSE_INDEX=1 &&
-		export GIT_TEST_SPARSE_INDEX &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" reset --hard &&
-		test_region index convert_to_sparse trace2.txt &&
-		test_region index ensure_full_index trace2.txt &&
-
-		rm trace2.txt &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" status -uno &&
-		test_region index ensure_full_index trace2.txt
-	)
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" reset --hard &&
+	test_region index convert_to_sparse trace2.txt &&
+	test_region index ensure_full_index trace2.txt &&
+
+	rm trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" status -uno &&
+	test_region index ensure_full_index trace2.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 17/20] sparse-checkout: disable sparse-index
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (15 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-27 12:32   ` SZEDER Gábor
  2021-02-23 20:14 ` [PATCH 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We use 'git sparse-checkout init --cone --sparse-index' to toggle the
sparse-index feature. It makes sense to also disable it when running
'git sparse-checkout disable'. This is particularly important because it
removes the extensions.sparseIndex config option, allowing other tools
to use this Git repository again.

This does mean that 'git sparse-checkout init' will not re-enable the
sparse-index feature, even if it was previously enabled.

While testing this feature, I noticed that the sparse-index was not
being written on the first run, but by a second. This was caught by the
call to 'test-tool read-cache --table'. This requires adjusting some
assignments to core_apply_sparse_checkout and pl.use_cone_patterns in
the sparse_checkout_init() logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c          | 10 +++++++++-
 t/t1091-sparse-checkout-builtin.sh | 13 +++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index ca63e2c64e95..585343fa1972 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -280,6 +280,9 @@ static int set_config(enum sparse_checkout_mode mode)
 				      "core.sparseCheckoutCone",
 				      mode == MODE_CONE_PATTERNS ? "true" : NULL);
 
+	if (mode == MODE_NO_PATTERNS)
+		set_sparse_index_config(the_repository, 0);
+
 	return 0;
 }
 
@@ -341,10 +344,11 @@ static int sparse_checkout_init(int argc, const char **argv)
 		the_repository->index->updated_workdir = 1;
 	}
 
+	core_apply_sparse_checkout = 1;
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
-		core_apply_sparse_checkout = 1;
 		return update_working_directory(NULL);
 	}
 
@@ -366,6 +370,7 @@ static int sparse_checkout_init(int argc, const char **argv)
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
 	strbuf_addstr(&pattern, "!/*/");
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
+	pl.use_cone_patterns = init_opts.cone_mode;
 
 	return write_patterns_and_update(&pl);
 }
@@ -632,6 +637,9 @@ static int sparse_checkout_disable(int argc, const char **argv)
 	strbuf_addstr(&match_all, "/*");
 	add_pattern(strbuf_detach(&match_all, NULL), empty_base, 0, &pl, 0);
 
+	prepare_repo_settings(the_repository);
+	the_repository->settings.sparse_index = 0;
+
 	if (update_working_directory(&pl))
 		die(_("error while refreshing working directory"));
 
diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index fc64e9ed99f4..ff1ad570a255 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -205,6 +205,19 @@ test_expect_success 'sparse-checkout disable' '
 	check_files repo a deep folder1 folder2
 '
 
+test_expect_success 'sparse-index enabled and disabled' '
+	git -C repo sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C repo true extensions.sparseIndex &&
+	test-tool -C repo read-cache --table >cache &&
+	grep " tree " cache &&
+
+	git -C repo sparse-checkout disable &&
+	test-tool -C repo read-cache --table >cache &&
+	! grep " tree " cache &&
+	git -C repo config --list >config &&
+	! grep extensions.sparseindex config
+'
+
 test_expect_success 'cone mode: init and set' '
 	git -C repo sparse-checkout init --cone &&
 	git -C repo config --list >config &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 18/20] cache-tree: integrate with sparse directory entries
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (16 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache-tree extension was previously disabled with sparse indexes.
However, the cache-tree is an important performance feature for commands
like 'git status' and 'git add'. Integrate it with sparse directory
entries.

When writing a sparse index, completely clear and recalculate the cache
tree. By starting from scratch, the only integration necessary is to
check if we hit a sparse directory entry and create a leaf of the
cache-tree that has an entry_count of one and no subtrees.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c   | 18 ++++++++++++++++++
 sparse-index.c | 10 +++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 5f07a39e501e..950a9615db8f 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
 
 	*skip_count = 0;
 
+	/*
+	 * If the first entry of this region is a sparse directory
+	 * entry corresponding exactly to 'base', then this cache_tree
+	 * struct is a "leaf" in the data structure, pointing to the
+	 * tree OID specified in the entry.
+	 */
+	if (entries > 0) {
+		const struct cache_entry *ce = cache[0];
+
+		if (S_ISSPARSEDIR(ce->ce_mode) &&
+		    ce->ce_namelen == baselen &&
+		    !strncmp(ce->name, base, baselen)) {
+			it->entry_count = 1;
+			oidcpy(&it->oid, &ce->oid);
+			return 1;
+		}
+	}
+
 	if (0 <= it->entry_count && has_object_file(&it->oid))
 		return it->entry_count;
 
diff --git a/sparse-index.c b/sparse-index.c
index a991c5331e9e..e541f251b37a 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -180,7 +180,11 @@ int convert_to_sparse(struct index_state *istate)
 	istate->cache_nr = convert_to_sparse_rec(istate,
 						 0, 0, istate->cache_nr,
 						 "", 0, istate->cache_tree);
-	istate->drop_cache_tree = 1;
+
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	istate->sparse_index = 1;
 	trace2_region_leave("index", "convert_to_sparse", istate->repo);
 	return 0;
@@ -278,5 +282,9 @@ void ensure_full_index(struct index_state *istate)
 
 	free(full);
 
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 19/20] sparse-index: loose integration with cache_tree_verify()
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (17 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 20:14 ` [PATCH 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache_tree_verify() method is run when GIT_TEST_CHECK_CACHE_TREE
is enabled, which it is by default in the test suite. The logic must
be adjusted for the presence of these directory entries.

For now, leave the test as a simple check for whether the directory
entry is sparse. Do not go any further until needed.

This allows us to re-enable GIT_TEST_CHECK_CACHE_TREE in
t1092-sparse-checkout-compatibility.sh. Further,
p2000-sparse-operations.sh uses the test suite and hence this is enabled
for all tests. We need to integrate with it before we run our
performance tests with a sparse-index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             | 19 +++++++++++++++++++
 t/t1092-sparse-checkout-compatibility.sh |  1 -
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 950a9615db8f..11bf1fcae6e1 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -808,6 +808,19 @@ int cache_tree_matches_traversal(struct cache_tree *root,
 	return 0;
 }
 
+static void verify_one_sparse(struct repository *r,
+			      struct index_state *istate,
+			      struct cache_tree *it,
+			      struct strbuf *path,
+			      int pos)
+{
+	struct cache_entry *ce = istate->cache[pos];
+
+	if (!S_ISSPARSEDIR(ce->ce_mode))
+		BUG("directory '%s' is present in index, but not sparse",
+		    path->buf);
+}
+
 static void verify_one(struct repository *r,
 		       struct index_state *istate,
 		       struct cache_tree *it,
@@ -830,6 +843,12 @@ static void verify_one(struct repository *r,
 
 	if (path->len) {
 		pos = index_name_pos(istate, path->buf, path->len);
+
+		if (pos >= 0) {
+			verify_one_sparse(r, istate, it, path, pos);
+			return;
+		}
+
 		pos = -pos - 1;
 	} else {
 		pos = 0;
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 9c2bc4d25f66..c2624176c2e0 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,7 +2,6 @@
 
 test_description='compare full workdir to sparse workdir'
 
-GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
 GIT_TEST_SPARSE_INDEX=
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH 20/20] p2000: add sparse-index repos
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (18 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
@ 2021-02-23 20:14 ` Derrick Stolee via GitGitGadget
  2021-02-23 23:49 ` [PATCH 00/20] Sparse Index: Design, Format, Tests Elijah Newren
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-23 20:14 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

p2000-sparse-operations.sh compares different Git commands in
repositories with many files at HEAD but using sparse-checkout to focus
on a small portion of those files.

Add extra copies of the repository that use the sparse-index format so
we can track how that affects the performance of different commands.

At this point in time, the sparse-index is 100% overhead from the CPU
front, and this is measurable in these tests:

Test
---------------------------------------------------------------
2000.2: git status (full-index-v3)              0.59(0.51+0.12)
2000.3: git status (full-index-v4)              0.59(0.52+0.11)
2000.4: git status (sparse-index-v3)            1.40(1.32+0.12)
2000.5: git status (sparse-index-v4)            1.41(1.36+0.08)
2000.6: git add -A (full-index-v3)              2.32(1.97+0.19)
2000.7: git add -A (full-index-v4)              2.17(1.92+0.14)
2000.8: git add -A (sparse-index-v3)            2.31(2.21+0.15)
2000.9: git add -A (sparse-index-v4)            2.30(2.20+0.13)
2000.10: git add . (full-index-v3)              2.39(2.02+0.20)
2000.11: git add . (full-index-v4)              2.20(1.94+0.16)
2000.12: git add . (sparse-index-v3)            2.36(2.27+0.12)
2000.13: git add . (sparse-index-v4)            2.33(2.21+0.16)
2000.14: git commit -a -m A (full-index-v3)     2.47(2.12+0.20)
2000.15: git commit -a -m A (full-index-v4)     2.26(2.00+0.17)
2000.16: git commit -a -m A (sparse-index-v3)   3.01(2.92+0.16)
2000.17: git commit -a -m A (sparse-index-v4)   3.01(2.94+0.15)

Note that there is very little difference between the v3 and v4 index
formats when the sparse-index is enabled. This is primarily due to the
fact that the relative file sizes are the same, and the command time is
mostly taken up by parsing tree objects to expand the sparse index into
a full one.

With the current file layout, the index file sizes are given by this
table:

       |  full index | sparse index |
       +-------------+--------------+
    v3 |     108 MiB |      1.6 MiB |
    v4 |      80 MiB |      1.2 MiB |

Future updates will improve the performance of Git commands when the
index is sparse.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index 52597683376e..f9c7f3c6e27e 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -62,12 +62,29 @@ test_expect_success 'setup repo and indexes' '
 		git sparse-checkout set $SPARSE_CONE &&
 		git config index.version 4 &&
 		git update-index --index-version=4
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v3 &&
+	(
+		cd sparse-index-v3 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v4 &&
+	(
+		cd sparse-index-v4 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
 	)
 '
 
 test_perf_on_all () {
 	command="$@"
-	for repo in full-index-v3 full-index-v4
+	for repo in full-index-v3 full-index-v4 \
+		    sparse-index-v3 sparse-index-v4
 	do
 		test_perf "$command ($repo)" "
 			(
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 00/20] Sparse Index: Design, Format, Tests
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (19 preceding siblings ...)
  2021-02-23 20:14 ` [PATCH 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
@ 2021-02-23 23:49 ` Elijah Newren
  2021-02-26 21:28   ` Elijah Newren
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  21 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-23 23:49 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Here is the first full patch series submission coming out of the
> sparse-index RFC [1].

Wahoo!  I'll be reading these over the next few days.

> [1]
> https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/
>
> I won't waste too much space here, because PATCH 1 includes a sizeable
> design document that describes the feature, the reasoning behind it, and my
> plan for getting this implemented widely throughout the codebase.
>
> There are some new things here that were not in the RFC:
>
>  * Design doc and format updates. (Patch 1)
>  * Performance test script. (Patches 2 and 20)
>
> Notably missing in this series from the RFC:
>
>  * The mega-patch inserting ensure_full_index() throughout the codebase.
>    That will be a follow-up series to this one.
>  * The integrations with git status and git add to demonstrate the improved
>    performance. Those will also appear in their own series later.
>
> I plan to keep my latest work in this area in my 'sparse-index/wip' branch
> [2]. It includes all of the work from the RFC right now, updated with the
> work from this series.
>
> [2] https://github.com/derrickstolee/git/tree/sparse-index/wip
>
> Thanks, -Stolee
>
> Derrick Stolee (20):
>   sparse-index: design doc and format update
>   t/perf: add performance test for sparse operations
>   t1092: clean up script quoting
>   sparse-index: add guard to ensure full index
>   sparse-index: implement ensure_full_index()
>   t1092: compare sparse-checkout to sparse-index
>   test-read-cache: print cache entries with --table
>   test-tool: don't force full index
>   unpack-trees: ensure full index
>   sparse-checkout: hold pattern list in index
>   sparse-index: convert from full to sparse
>   submodule: sparse-index should not collapse links
>   unpack-trees: allow sparse directories
>   sparse-index: check index conversion happens
>   sparse-index: create extension for compatibility
>   sparse-checkout: toggle sparse index from builtin
>   sparse-checkout: disable sparse-index
>   cache-tree: integrate with sparse directory entries
>   sparse-index: loose integration with cache_tree_verify()
>   p2000: add sparse-index repos
>
>  Documentation/config/extensions.txt      |   7 +
>  Documentation/git-sparse-checkout.txt    |  14 ++
>  Documentation/technical/index-format.txt |   7 +
>  Documentation/technical/sparse-index.txt | 167 +++++++++++++
>  Makefile                                 |   1 +
>  builtin/sparse-checkout.c                |  44 +++-
>  cache-tree.c                             |  40 ++++
>  cache.h                                  |  12 +-
>  read-cache.c                             |  35 ++-
>  repo-settings.c                          |  15 ++
>  repository.c                             |  11 +-
>  repository.h                             |   3 +
>  setup.c                                  |   3 +
>  sparse-index.c                           | 290 +++++++++++++++++++++++
>  sparse-index.h                           |  11 +
>  t/README                                 |   3 +
>  t/helper/test-read-cache.c               |  61 ++++-
>  t/perf/p2000-sparse-operations.sh        | 104 ++++++++
>  t/t1091-sparse-checkout-builtin.sh       |  13 +
>  t/t1092-sparse-checkout-compatibility.sh | 136 +++++++++--
>  unpack-trees.c                           |  16 +-
>  21 files changed, 953 insertions(+), 40 deletions(-)
>  create mode 100644 Documentation/technical/sparse-index.txt
>  create mode 100644 sparse-index.c
>  create mode 100644 sparse-index.h
>  create mode 100755 t/perf/p2000-sparse-operations.sh
>
>
> base-commit: 966e671106b2fd38301e7c344c754fd118d0bb07
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/883
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 01/20] sparse-index: design doc and format update
  2021-02-23 20:14 ` [PATCH 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-02-24  1:13   ` Elijah Newren
  2021-02-25 15:29     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-24  1:13 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee, Matheus Tavares Bernardino

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> This begins a long effort to update the index format to allow sparse
> directory entries. This should result in a significant improvement to
> Git commands when HEAD contains millions of files, but the user has
> selected many fewer files to keep in their sparse-checkout definition.
>
> Currently, the index format is only updated in the presence of
> extensions.sparseIndex instead of increasing a file format version
> number. This is temporary, and index v5 is part of the plan for future
> work in this area.
>
> The design document details many of the reasons for embarking on this
> work, and also the plan for completing it safely.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/index-format.txt |   7 +
>  Documentation/technical/sparse-index.txt | 167 +++++++++++++++++++++++
>  2 files changed, 174 insertions(+)
>  create mode 100644 Documentation/technical/sparse-index.txt
>
> diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
> index b633482b1bdf..387126582556 100644
> --- a/Documentation/technical/index-format.txt
> +++ b/Documentation/technical/index-format.txt
> @@ -44,6 +44,13 @@ Git index format
>    localization, no special casing of directory separator '/'). Entries
>    with the same name are sorted by their stage field.
>
> +  An index entry typically represents a file. However, if sparse-checkout
> +  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
> +  `extensions.sparseIndex` extension is enabled, then the index may
> +  contain entries for directories outside of the sparse-checkout definition.
> +  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
> +  the path ends in a directory separator.
> +
>    32-bit ctime seconds, the last time a file's metadata changed
>      this is stat(2) data
>
> diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
> new file mode 100644
> index 000000000000..9070836f0655
> --- /dev/null
> +++ b/Documentation/technical/sparse-index.txt
> @@ -0,0 +1,167 @@
> +Git Sparse-Index Design Document
> +================================
> +
> +The sparse-checkout feature allows users to focus a working directory on
> +a subset of the files at HEAD. The cone mode patterns, enabled by
> +`core.sparseCheckoutCone`, allow for very fast pattern matching to
> +discover which files at HEAD belong in the sparse-checkout cone.
> +
> +Three important scale dimensions for a Git worktree are:
> +
> +* `HEAD`: How many files are present at `HEAD`?
> +
> +* Populated: How many files are within the sparse-checkout cone.
> +
> +* Modified: How many files has the user modified in the working directory?
> +
> +We will use big-O notation -- O(X) -- to denote how expensive certain
> +operations are in terms of these dimensions.
> +
> +These dimensions are ordered by their magnitude: users (typically) modify
> +fewer files than are populated, and we can only populate files at `HEAD`.
> +These dimensions are also ordered by how expensive they are per item: it
> +is expensive to detect a modified file than it is to write one that we
> +know must be populated; changing `HEAD` only really requires updating the
> +index.
> +
> +Problems occur if there is an extreme imbalance in these dimensions. For
> +example, if `HEAD` contains millions of paths but the populated set has
> +only tens of thousands, then commands like `git status` and `git add` can
> +be dominated by operations that require O(`HEAD`) operations instead of
> +O(Populated). Primarily, the cost is in parsing and rewriting the index,
> +which is filled primarily with files at `HEAD` that are marked with the
> +`SKIP_WORKTREE` bit.
> +
> +The sparse-index intends to take these commands that read and modify the
> +index from O(`HEAD`) to O(Populated). To do this, we need to modify the
> +index format in a significant way: add "sparse directory" entries.
> +
> +With cone mode patterns, it is possible to detect when an entire
> +directory will have its contents outside of the sparse-checkout definition.
> +Instead of listing all of the files it contains as individual entries, a
> +sparse-index contains an entry with the directory name, referencing the
> +object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
> +If we need to discover the details for paths within that directory, we
> +can parse trees to find that list.
> +
> +This addition of sparse-directory entries violates expectations about the

Violates current expectations, yes.  Documentation tends to live a
long time, and I suspect that 2-3 years from now reading this sentence
might be jarring since we'll have modified the code to have an updated
set of expectations.  Maybe a simple "As of time of writing, ..." at
the beginning of the sentence here?  Or maybe I'm just being overly
worried...

> +index format and its in-memory data structure. There are many consumers in
> +the codebase that expect to iterate through all of the index entries and
> +see only files. In addition, they expect to see all files at `HEAD`. One
> +way to handle this is to parse trees to replace a sparse-directory entry
> +with all of the files within that tree as the index is loaded. However,
> +parsing trees is slower than parsing the index format, so that is a slower
> +operation than if we left the index alone.
> +
> +The implementation plan below follows four phases to slowly integrate with
> +the sparse-index. The intention is to incrementally update Git commands to
> +interact safely with the sparse-index without significant slowdowns. This
> +may not always be possible, but the hope is that the primary commands that
> +users need in their daily work are dramatically improved.
> +
> +Phase I: Format and initial speedups
> +------------------------------------
> +
> +During this phase, Git learns to enable the sparse-index and safely parse
> +one. Protections are put in place so that every consumer of the in-memory
> +data structure can operate with its current assumption of every file at
> +`HEAD`.
> +
> +At first, every index parse will expand the sparse-directory entries into
> +the full list of paths at `HEAD`. This will be slower in all cases. The
> +only noticable change in behavior will be that the serialized index file

noticeable

> +contains sparse-directory entries.
> +
> +To start, we use a new repository extension, `extensions.sparseIndex`, to
> +allow inserting sparse-directory entries into indexes with file format
> +versions 2, 3, and 4. This prevents Git versions that do not understand
> +the sparse-index from operating on one, but it also prevents other
> +operations that do not use the index at all. A new format, index v5, will
> +be introduced that includes sparse-directory entries by default. It might
> +also introduce other features that have been considered for improving the
> +index, as well.
> +
> +Next, consumers of the index will be guarded against operating on a
> +sparse-index by inserting calls to `ensure_full_index()` or
> +`expand_index_to_path()`. After these guards are in place, we can begin
> +leaving sparse-directory entries in the in-memory index structure.
> +
> +Even after inserting these guards, we will keep expanding sparse-indexes
> +for most Git commands using the `command_requires_full_index` repository
> +setting. This setting will be on by default and disabled one builtin at a
> +time until we have sufficient confidence that all of the index operations
> +are properly guarded.
> +
> +To complete this phase, the commands `git status` and `git add` will be
> +integrated with the sparse-index so that they operate with O(Populated)
> +performance. They will be carefully tested for operations within and
> +outside the sparse-checkout definition.

Good plan so far, but there's something else bugging me a little here.
One thing we noticed with our usage of `sparse-checkout` is that
although unimportant _tracked_ files go away, leftover build files and
other untracked files stick around.  So, although 'git status'
shouldn't have to check the tracked files anymore, it is still going
to have to look at each of the *untracked* files and compare to
.gitignore files in order to correctly classify each file as ignored
or just plain untracked.  Our `sparsify` tool has for a long time
tried to warn about such files when changing the sparsity
patterns/modules and had an --remove-old-ignores option for clearing
out ignored files that are found within directories that are sparse
(Meaning the directories where all files under them are marked
SKIP_WORKTREE.). I was never sure whether a warning was enough, or if
pushing that option more made sense, but about a month ago my
colleagues made the tool just auto-invoke that option from other
sparsify invocations.  To my knowledge, there have been no complaints
about that being automatically turned on; but there were
complaints/confusion before about the directories being left around.
(Of course, non-ignored files are still left around by that option.)

I'm worried that since sparse-checkout doesn't do anything to help
with all these untracked/ignored files, we might not get all the
performance improvements we want from a `git status` with sparse
directories.  We'll be dropping from walking O(2*HEAD) files (1 source
+ 1 object file) down to O(HEAD) files (just the object files) rather
than actually getting down to O(Populated).

> +
> +Phase II: Careful integrations
> +------------------------------
> +
> +This phase focuses on ensuring that all index extensions and APIs work
> +well with a sparse-index. This requires significant increases to our test
> +coverage, especially for operations that interact with the working
> +directory outside of the sparse-checkout definition. Some of these
> +behaviors may not be the desirable ones, such as some tests already
> +marked for failure in `t1092-sparse-checkout-compatibility.sh`.
> +
> +The index extensions that may require special integrations are:
> +
> +* FS Monitor
> +* Untracked cache
> +
> +While integrating with these features, we should look for patterns that
> +might lead to better APIs for interacting with the index. Coalescing
> +common usage patterns into an API call can reduce the number of places
> +where sparse-directories need to be handled carefully.

Makes sense.

> +Phase III: Important command speedups
> +-------------------------------------
> +
> +At this point, the patterns for testing and implementing sparse-directory
> +logic should be relatively stable. This phase focuses on updating some of
> +the most common builtins that use the index to operate as O(Populated).
> +Here is a potential list of commands that could be valuable to integrate
> +at this point:
> +
> +* `git commit`
> +* `git checkout`
> +* `git merge`
> +* `git rebase`
> +
> +Along with `git status` and `git add`, these commands cover the majority
> +of users' interactions with the working directory.

Sounds like a good plan as well.

I hope we get to make this specific to the merge-ort backend.  It
localizes the index-related code to (a) a call to unpack_trees()
called from checkout-like code (which would probably automatically be
handled by your updates to git checkout), and (b) a single function
named record_conflicted_index_entries().  I feel it should be pretty
easy to update.

In contrast, the idea of attempting to update merge-recursive with
this kind of change sounds overwhelming.

>  In addition, we can
> +integrate with these commands:
> +
> +* `git grep`
> +* `git rm`
> +
> +These have been proposed as some whose behavior could change when in a
> +repo with a sparse-checkout definition. It would be good to include this
> +behavior automatically when using a sparse-index. Some clarity is needed
> +to make the behavior switch clear to the user.

Is this leftover from before recent events?  I think this portion of
the document should just be stricken.

I argued about how these were buggy as-is due SKIP_WORKTREE always
having been an incomplete implementation of an idea at [1], but didn't
hear a further response from you.  I'm curious if you disagreed with
my reasoning, or it just slipped through the cracks in a busy schedule
and this portion of the document was leftover from before.  In my
opinion, both commands are just buggy and should be fixed for general
sparse-checkout usage cases, not just for sparse-index.

As for git grep, it has options for searching the working tree
(default) OR searching the index (--cached) OR searching an old commit
(passing a REVISION).  But never some combination or more than one of
these.  The fact that it combined some in the cases of SKIP_WORKTREE
entries looks entirely like a bug to me.  For the same reasons I
argued that --untracked and --cached are incompatible[2], we shouldn't
be combining results from searching the working tree and searching the
index.  Luckily, this fix has already been submitted[3] and picked up
in mt/grep-sparse-checkout and is marked in the cooking emails as
"Will merge to next".

As for git rm, I'll quote from my email to Matheus:

"""As far as the longer term discussion about making git rm configurable...
_If_ it comes up again in the future, I will argue that if git rm
should have configuration to delete paths outside the sparsity
specification, then git add should have configuration to add paths
outside the sparsity specification that happen to be present despite
being SKIP_WORKTREE, that git diff with no revision arguments (nor
--cached) should have configuration to diff against paths that are
SKIP_WORKTREE but happen to be present, that git status should have
configuration to report on changes to paths that are SKIP_WORKTREE but
happen to be present, that git checkout should have configuration to
write files to the working tree despite matching sparsity paths, etc.
And I'll argue that you do ALL of those or you're being inconsistent.
I hope that people see these are actually all the same request and
that it is horribly inconsistent to do some of these and not others,
and that at least by the time I get to mentioning checkout that they
realize it's a crazy request.  We should just tell users to extend
their sparsity if they want the working copy (and commands that
interact with the working copy) to handle the additional paths.  Maybe
I'm just really biased, but I don't see how this makes sense.  I would
argue more about it, but no one has responded.  My plan was to just
fix the default behavior, and then see if anyone ever actually cared
enough to come back and ask for more configurability."""

Also, for rm, Matheus has already submitted the fix[4], though at
Junio's request he separated out some fixes for git-add as a separate
preliminary series[5] and then will resubmit the other `add` and `rm`
fixes.

[1] https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/
[2] https://lore.kernel.org/git/xmqqtuql0yfp.fsf@gitster.c.googlers.com/
[3] https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/
[4] https://lore.kernel.org/git/61a77cd5f45ba02c7dff4b7932abdebb17c1667f.1613593946.git.matheus.bernardino@usp.br/
[5] https://lore.kernel.org/git/cover.1614037664.git.matheus.bernardino@usp.br/

Anyway, that's a long way of saying I think this section of your
document is already obsolete.  (Which is a good thing -- less work to
do to get sparse-index working.  Thanks, Matheus!).

> +This phase is the first where parallel work might be possible without too
> +much conflicts between topics.
> +
> +Phase IV: The long tail
> +-----------------------
> +
> +This last phase is less a "phase" and more "the new normal" after all of
> +the previous work.
> +
> +To start, the `command_requires_full_index` option could be removed in
> +favor of expanding only when hitting an API guard.
> +
> +There are many Git commands that could use special attention to operate as
> +O(Populated), while some might be so rare that it is acceptable to leave
> +them with additional overhead when a sparse-index is present.
> +
> +Here are some commands that might be useful to update:
> +
> +* `git sparse-checkout set`
> +* `git am`
> +* `git clean`
> +* `git stash`

Oh, man, git stash is definitely in need of work.  It's still a
minimalistic transliteration of shell to C, complete with lots of
process forking and piping output between various low-level commands.
It might be interesting to rewrite this in terms of the merge
machinery, though its separate stashing of staged stuff, unstaged
stuff, and possibly untracked stuff means that there is a sequence of
two or three merges needed and interesting failure handling to do if
those merges fail, especially if the user uses --index.  But I
digress...


Anyway, overall, very nicely written and planned out.  Thanks for
taking the time to write this all up.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 02/20] t/perf: add performance test for sparse operations
  2021-02-23 20:14 ` [PATCH 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
@ 2021-02-24  2:30   ` Elijah Newren
  2021-03-09 20:03     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-24  2:30 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Create a test script that takes the default performance test (the Git
> codebase) and multiplies it by 256 using four layers of duplicated
> trees of width four. This results in nearly one million blob entries in
> the index. Then, we can clone this repository with sparse-checkout
> patterns that demonstrate four copies of the initial repository. Each
> clone will use a different index format or mode so peformance can be
> tested across the different options.
>
> Note that the initial repo is stripped of submodules before doing the
> copies. This preserves the expected data shape of the sparse index,
> because directories containing submodules are not collapsed to a sparse
> directory entry.
>
> Run a few Git commands on these clones, especially those that use the
> index (status, add, commit).
>
> Here are the results on my Linux machine:
>
> Test
> --------------------------------------------------------------
> 2000.2: git status (full-index-v3)             0.37(0.30+0.09)
> 2000.3: git status (full-index-v4)             0.39(0.32+0.10)
> 2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
> 2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
> 2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
> 2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
> 2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
> 2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)
>
> It is perhaps noteworthy that there is an improvement when using index
> version 4. This is because the v3 index uses 108 MiB while the v4
> index uses 80 MiB. Since the repeated portions of the directories are
> very short (f3/f1/f2, for example) this ratio is less pronounced than in
> similarly-sized real repositories.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/perf/p2000-sparse-operations.sh | 87 +++++++++++++++++++++++++++++++
>  1 file changed, 87 insertions(+)
>  create mode 100755 t/perf/p2000-sparse-operations.sh
>
> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
> new file mode 100755
> index 000000000000..52597683376e
> --- /dev/null
> +++ b/t/perf/p2000-sparse-operations.sh
> @@ -0,0 +1,87 @@
> +#!/bin/sh
> +
> +test_description="test performance of Git operations using the index"
> +
> +. ./perf-lib.sh
> +
> +test_perf_default_repo
> +
> +SPARSE_CONE=f2/f4/f1
> +
> +test_expect_success 'setup repo and indexes' '
> +       git reset --hard HEAD &&
> +       # Remove submodules from the example repo, because our
> +       # duplication of the entire repo creates an unlikly data shape.
> +       git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
> +       rm -f .gitmodules &&
> +       git add .gitmodules &&

Why not `git rm [-f] .gitmodules` instead of these two commands?  Is
there something special about .gitmodules that requires this special
handling?

> +       for module in $(awk "{print \$2}" modules)
> +       do
> +               git rm $module || return 1
> +       done &&
> +       git add . &&

What does the `git add .` do?  I don't see any changes there weren't
already git-add'ed or git-rm'ed.

> +       git commit -m "remove submodules" &&
> +
> +       echo bogus >a &&
> +       cp a b &&
> +       git add a b &&
> +       git commit -m "level 0" &&
> +       BLOB=$(git rev-parse HEAD:a) &&
> +       OLD_COMMIT=$(git rev-parse HEAD) &&
> +       OLD_TREE=$(git rev-parse HEAD^{tree}) &&
> +
> +       for i in $(test_seq 1 4)
> +       do
> +               cat >in <<-EOF &&
> +                       100755 blob $BLOB       a
> +                       040000 tree $OLD_TREE   f1
> +                       040000 tree $OLD_TREE   f2
> +                       040000 tree $OLD_TREE   f3
> +                       040000 tree $OLD_TREE   f4
> +               EOF
> +               NEW_TREE=$(git mktree <in) &&
> +               NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
> +               OLD_TREE=$NEW_TREE &&
> +               OLD_COMMIT=$NEW_COMMIT || return 1
> +       done &&
> +
> +       git sparse-checkout init --cone &&
> +       git branch -f wide $OLD_COMMIT &&
> +       git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
> +       (
> +               cd full-index-v3 &&
> +               git sparse-checkout init --cone &&
> +               git sparse-checkout set $SPARSE_CONE &&
> +               git config index.version 3 &&
> +               git update-index --index-version=3
> +       ) &&
> +       git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
> +       (
> +               cd full-index-v4 &&
> +               git sparse-checkout init --cone &&
> +               git sparse-checkout set $SPARSE_CONE &&
> +               git config index.version 4 &&
> +               git update-index --index-version=4
> +       )
> +'
> +
> +test_perf_on_all () {
> +       command="$@"
> +       for repo in full-index-v3 full-index-v4
> +       do
> +               test_perf "$command ($repo)" "
> +                       (
> +                               cd $repo &&
> +                               echo >>$SPARSE_CONE/a &&
> +                               $command
> +                       )
> +               "
> +       done
> +}
> +
> +test_perf_on_all git status
> +test_perf_on_all git add -A
> +test_perf_on_all git add .
> +test_perf_on_all git commit -a -m A
> +
> +test_done
> --
> gitgitgadget

Other than the two minor questions, the rest looks good to me.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 04/20] sparse-index: add guard to ensure full index
  2021-02-23 20:14 ` [PATCH 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
@ 2021-02-24  2:44   ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-02-24  2:44 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Upcoming changes will introduce modifications to the index format that
> allow sparse directories. It will be useful to have a mechanism for
> converting those sparse index files into full indexes by walking the
> tree at those sparse directories. Name this method ensure_full_index()
> as it will guarantee that the index is fully expanded.
>
> This method is not implemented yet, and instead we focus on the
> scaffolding to declare it and call it at the appropriate time.
>
> Add a 'command_requires_full_index' member to struct repo_settings. This
> will be an indicator that we need the index in full mode to do certain
> index operations. This starts as being true for every command, then we
> will set it to false as some commands integrate with sparse indexes.
>
> If 'command_requires_full_index' is true, then we will immediately
> expand a sparse index to a full one upon reading from disk. This
> suffices for now, but we will want to add more callers to
> ensure_full_index() later.

Same as 01/27 of your RFC series; looks good.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Makefile        |  1 +
>  repo-settings.c |  8 ++++++++
>  repository.c    | 11 ++++++++++-
>  repository.h    |  2 ++
>  sparse-index.c  |  8 ++++++++
>  sparse-index.h  |  7 +++++++
>  6 files changed, 36 insertions(+), 1 deletion(-)
>  create mode 100644 sparse-index.c
>  create mode 100644 sparse-index.h
>
> diff --git a/Makefile b/Makefile
> index 5a239cac20e3..3bf61699238d 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -980,6 +980,7 @@ LIB_OBJS += setup.o
>  LIB_OBJS += shallow.o
>  LIB_OBJS += sideband.o
>  LIB_OBJS += sigchain.o
> +LIB_OBJS += sparse-index.o
>  LIB_OBJS += split-index.o
>  LIB_OBJS += stable-qsort.o
>  LIB_OBJS += strbuf.o
> diff --git a/repo-settings.c b/repo-settings.c
> index f7fff0f5ab83..d63569e4041e 100644
> --- a/repo-settings.c
> +++ b/repo-settings.c
> @@ -77,4 +77,12 @@ void prepare_repo_settings(struct repository *r)
>                 UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_KEEP);
>
>         UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_DEFAULT);
> +
> +       /*
> +        * This setting guards all index reads to require a full index
> +        * over a sparse index. After suitable guards are placed in the
> +        * codebase around uses of the index, this setting will be
> +        * removed.
> +        */
> +       r->settings.command_requires_full_index = 1;
>  }
> diff --git a/repository.c b/repository.c
> index c98298acd017..a8acae002f71 100644
> --- a/repository.c
> +++ b/repository.c
> @@ -10,6 +10,7 @@
>  #include "object.h"
>  #include "lockfile.h"
>  #include "submodule-config.h"
> +#include "sparse-index.h"
>
>  /* The main repository */
>  static struct repository the_repo;
> @@ -261,6 +262,8 @@ void repo_clear(struct repository *repo)
>
>  int repo_read_index(struct repository *repo)
>  {
> +       int res;
> +
>         if (!repo->index)
>                 repo->index = xcalloc(1, sizeof(*repo->index));
>
> @@ -270,7 +273,13 @@ int repo_read_index(struct repository *repo)
>         else if (repo->index->repo != repo)
>                 BUG("repo's index should point back at itself");
>
> -       return read_index_from(repo->index, repo->index_file, repo->gitdir);
> +       res = read_index_from(repo->index, repo->index_file, repo->gitdir);
> +
> +       prepare_repo_settings(repo);
> +       if (repo->settings.command_requires_full_index)
> +               ensure_full_index(repo->index);
> +
> +       return res;
>  }
>
>  int repo_hold_locked_index(struct repository *repo,
> diff --git a/repository.h b/repository.h
> index b385ca3c94b6..e06a23015697 100644
> --- a/repository.h
> +++ b/repository.h
> @@ -41,6 +41,8 @@ struct repo_settings {
>         enum fetch_negotiation_setting fetch_negotiation_algorithm;
>
>         int core_multi_pack_index;
> +
> +       unsigned command_requires_full_index:1;
>  };
>
>  struct repository {
> diff --git a/sparse-index.c b/sparse-index.c
> new file mode 100644
> index 000000000000..82183ead563b
> --- /dev/null
> +++ b/sparse-index.c
> @@ -0,0 +1,8 @@
> +#include "cache.h"
> +#include "repository.h"
> +#include "sparse-index.h"
> +
> +void ensure_full_index(struct index_state *istate)
> +{
> +       /* intentionally left blank */
> +}
> diff --git a/sparse-index.h b/sparse-index.h
> new file mode 100644
> index 000000000000..09a20d036c46
> --- /dev/null
> +++ b/sparse-index.h
> @@ -0,0 +1,7 @@
> +#ifndef SPARSE_INDEX_H__
> +#define SPARSE_INDEX_H__
> +
> +struct index_state;
> +void ensure_full_index(struct index_state *istate);
> +
> +#endif
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 05/20] sparse-index: implement ensure_full_index()
  2021-02-23 20:14 ` [PATCH 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-02-24  3:20   ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-02-24  3:20 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> We will mark an in-memory index_state as having sparse directory entries
> with the sparse_index bit. These currently cannot exist, but we will add
> a mechanism for collapsing a full index to a sparse one in a later
> change. That will happen at write time, so we must first allow parsing
> the format before writing it.
>
> Commands or methods that require a full index in order to operate can
> call ensure_full_index() to expand that index in-memory. This requires
> parsing trees using that index's repository.
>
> Sparse directory entries have a specific 'ce_mode' value. The macro
> S_ISSPARSEDIR(ce->ce_mode) can check if a cache_entry 'ce' has this type.
> This ce_mode is not possible with the existing index formats, so we don't
> also verify all properties of a sparse-directory entry, which are:
>
>  1. ce->ce_mode == 0040000
>  2. ce->flags & CE_SKIP_WORKTREE is true
>  3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
>  4. ce->oid references a tree object.
>
> These are all semi-enforced in ensure_full_index() to some extent. Any
> deviation will cause a warning at minimum or a failure in the worst
> case.

Thanks for spelling these all out; looks good.

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  cache.h        |  7 +++-
>  read-cache.c   |  9 +++++
>  sparse-index.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 109 insertions(+), 2 deletions(-)
>
> diff --git a/cache.h b/cache.h
> index d92814961405..1336c8d7435e 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -204,6 +204,8 @@ struct cache_entry {
>  #error "CE_EXTENDED_FLAGS out of range"
>  #endif
>
> +#define S_ISSPARSEDIR(m) ((m) == S_IFDIR)

Much nicer, thanks.  :-)

> +
>  /* Forward structure decls */
>  struct pathspec;
>  struct child_process;
> @@ -319,7 +321,8 @@ struct index_state {
>                  drop_cache_tree : 1,
>                  updated_workdir : 1,
>                  updated_skipworktree : 1,
> -                fsmonitor_has_run_once : 1;
> +                fsmonitor_has_run_once : 1,
> +                sparse_index : 1;
>         struct hashmap name_hash;
>         struct hashmap dir_hash;
>         struct object_id oid;
> @@ -722,6 +725,8 @@ int read_index_from(struct index_state *, const char *path,
>                     const char *gitdir);
>  int is_index_unborn(struct index_state *);
>
> +void ensure_full_index(struct index_state *istate);
> +
>  /* For use with `write_locked_index()`. */
>  #define COMMIT_LOCK            (1 << 0)
>  #define SKIP_IF_UNCHANGED      (1 << 1)
> diff --git a/read-cache.c b/read-cache.c
> index 29144cf879e7..97dbf2434f30 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -101,6 +101,9 @@ static const char *alternate_index_output;
>
>  static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
>  {
> +       if (S_ISSPARSEDIR(ce->ce_mode))
> +               istate->sparse_index = 1;

A very minor question -- someone who sees "sparse_index" could
probably easily think either "index with missing entries, due to
having a SKIP_WORKTREE directory instead" or perhaps "index when using
the sparse-checkout feature, i.e. it has some SKIP_WORKTREE entries in
it".  From the code here, clearly the former is your intent.  I wonder
if it'd help to have a small comment near the declaration of
sparse_index to mention its intent.

> +
>         istate->cache[nr] = ce;
>         add_name_hash(istate, ce);
>  }
> @@ -2255,6 +2258,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
>         trace2_data_intmax("index", the_repository, "read/cache_nr",
>                            istate->cache_nr);
>
> +       if (!istate->repo)
> +               istate->repo = the_repository;
> +       prepare_repo_settings(istate->repo);
> +       if (istate->repo->settings.command_requires_full_index)
> +               ensure_full_index(istate);
> +
>         return istate->cache_nr;
>
>  unmap:
> diff --git a/sparse-index.c b/sparse-index.c
> index 82183ead563b..316cb949b74b 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -1,8 +1,101 @@
>  #include "cache.h"
>  #include "repository.h"
>  #include "sparse-index.h"
> +#include "tree.h"
> +#include "pathspec.h"
> +#include "trace2.h"
> +
> +static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
> +{
> +       ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
> +
> +       istate->cache[nr] = ce;
> +       add_name_hash(istate, ce);
> +}
> +
> +static int add_path_to_index(const struct object_id *oid,
> +                               struct strbuf *base, const char *path,
> +                               unsigned int mode, int stage, void *context)
> +{
> +       struct index_state *istate = (struct index_state *)context;
> +       struct cache_entry *ce;
> +       size_t len = base->len;
> +
> +       if (S_ISDIR(mode))
> +               return READ_TREE_RECURSIVE;
> +
> +       strbuf_addstr(base, path);
> +
> +       ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
> +       ce->ce_flags |= CE_SKIP_WORKTREE;
> +       set_index_entry(istate, istate->cache_nr++, ce);
> +
> +       strbuf_setlen(base, len);
> +       return 0;
> +}
>
>  void ensure_full_index(struct index_state *istate)
>  {
> -       /* intentionally left blank */
> +       int i;
> +       struct index_state *full;
> +
> +       if (!istate || !istate->sparse_index)
> +               return;
> +
> +       if (!istate->repo)
> +               istate->repo = the_repository;
> +
> +       trace2_region_enter("index", "ensure_full_index", istate->repo);
> +
> +       /* initialize basics of new index */
> +       full = xcalloc(1, sizeof(struct index_state));
> +       memcpy(full, istate, sizeof(struct index_state));
> +
> +       /* then change the necessary things */
> +       full->sparse_index = 0;
> +       full->cache_alloc = (3 * istate->cache_alloc) / 2;
> +       full->cache_nr = 0;
> +       ALLOC_ARRAY(full->cache, full->cache_alloc);
> +
> +       for (i = 0; i < istate->cache_nr; i++) {
> +               struct cache_entry *ce = istate->cache[i];
> +               struct tree *tree;
> +               struct pathspec ps;
> +
> +               if (!S_ISSPARSEDIR(ce->ce_mode)) {
> +                       set_index_entry(full, full->cache_nr++, ce);
> +                       continue;
> +               }
> +               if (!(ce->ce_flags & CE_SKIP_WORKTREE))
> +                       warning(_("index entry is a directory, but not sparse (%08x)"),
> +                               ce->ce_flags);
> +
> +               /* recursively walk into cd->name */
> +               tree = lookup_tree(istate->repo, &ce->oid);
> +
> +               memset(&ps, 0, sizeof(ps));
> +               ps.recursive = 1;
> +               ps.has_wildcard = 1;
> +               ps.max_depth = -1;
> +
> +               read_tree_recursive(istate->repo, tree,
> +                                   ce->name, strlen(ce->name),
> +                                   0, &ps,
> +                                   add_path_to_index, full);
> +
> +               /* free directory entries. full entries are re-used */
> +               discard_cache_entry(ce);
> +       }
> +
> +       /* Copy back into original index. */
> +       memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
> +       istate->sparse_index = 0;
> +       free(istate->cache);

Thanks for fixing that leak that from the RFC series.

> +       istate->cache = full->cache;
> +       istate->cache_nr = full->cache_nr;
> +       istate->cache_alloc = full->cache_alloc;
> +
> +       free(full);
> +
> +       trace2_region_leave("index", "ensure_full_index", istate->repo);
>  }
> --
> gitgitgadget

Looks good to me.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-02-23 20:14 ` [PATCH 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
@ 2021-02-24 19:11   ` Martin Ågren
  2021-03-09 20:52     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Martin Ågren @ 2021-02-24 19:11 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Elijah Newren, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Wed, 24 Feb 2021 at 00:57, Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> +that is not completely understood by other tools. Enabling sparse index
> +enables the `extensions.spareseIndex` config value, which might cause

s/sparese/sparse

> +other tools to stop working with your repository. If you have trouble with
> +this compatibility, then run `git sparse-checkout sparse-index disable` to
> +remove this config and rewrite your index to not be sparse.

While I'm commenting on this..:

There are several "layers" here, for lack of a better term. "Enabling foo
enables bar which may cause baz. If you fail due to baz, try dropping
bar by dropping foo." If I remove any mention of the config variable from
your text, I get the following.

 Enabling sparse index might cause other tools to stop working with your
 repository. If you have trouble with this compatibility, then run `git
 sparse-checkout sparse-index disable` to rewrite your index to not be
 sparse.

I'm tempted to suggest such a rewrite to relieve readers of knowing of
the middle step, which you could say is more of an implementation
detail. But if we think that the symptoms / error messages might involve
"extensions.sparseIndex" or, as would be the case with an older Git
installation,

  fatal: unknown repository extensions found:
          sparseindex

maybe there is some value in mentioning the config item by name. Just
thinking out loud, really, and I don't have any strong opinion. I only
came here to point out the typo in the docs.

Martin

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 06/20] t1092: compare sparse-checkout to sparse-index
  2021-02-23 20:14 ` [PATCH 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-02-25  6:37   ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-02-25  6:37 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Add a new 'sparse-index' repo alongside the 'full-checkout' and
> 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
> add run_on_sparse and test_sparse_match helpers. These helpers will be
> used when the sparse index is implemented.
>
> Add GIT_TEST_SPARSE_INDEX environment variable to enable the
> sparse-index by default. This will be intended to use across the entire
> test suite, except that it will only affect cases where the
> sparse-checkout feature is enabled.

This last sentence was a bit awkward to read.  "will be intended to
use" -> "is intended to be used"?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/README                                 |  3 +++
>  t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++++++++++----
>  2 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/t/README b/t/README
> index 593d4a4e270c..b98bc563aab5 100644
> --- a/t/README
> +++ b/t/README
> @@ -439,6 +439,9 @@ and "sha256".
>  GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
>  'pack.writeReverseIndex' setting.
>
> +GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
> +sparse-index format by default.
> +
>  Naming Tests
>  ------------
>
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 3725d3997e70..71d6f9e4c014 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
>  test_expect_success 'setup' '
>         git init initial-repo &&
>         (
> +               GIT_TEST_SPARSE_INDEX=0 &&
>                 cd initial-repo &&
>                 echo a >a &&
>                 echo "after deep" >e &&
> @@ -87,23 +88,32 @@ init_repos () {
>
>         cp -r initial-repo sparse-checkout &&
>         git -C sparse-checkout reset --hard &&
> -       git -C sparse-checkout sparse-checkout init --cone &&
> +
> +       cp -r initial-repo sparse-index &&
> +       git -C sparse-index reset --hard &&
>
>         # initialize sparse-checkout definitions
> -       git -C sparse-checkout sparse-checkout set deep
> +       git -C sparse-checkout sparse-checkout init --cone &&
> +       git -C sparse-checkout sparse-checkout set deep &&
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
>  }
>
>  run_on_sparse () {
>         (
>                 cd sparse-checkout &&
> -               "$@" >../sparse-checkout-out 2>../sparse-checkout-err
> +               GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
> +       ) &&
> +       (
> +               cd sparse-index &&
> +               GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
>         )
>  }
>
>  run_on_all () {
>         (
>                 cd full-checkout &&
> -               "$@" >../full-checkout-out 2>../full-checkout-err
> +               GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
>         ) &&
>         run_on_sparse "$@"
>  }
> @@ -114,6 +124,12 @@ test_all_match () {
>         test_cmp full-checkout-err sparse-checkout-err
>  }
>
> +test_sparse_match () {
> +       run_on_sparse $* &&

Should this be
   run_on_sparse "$@"
in order to allow arguments with spaces?

> +       test_cmp sparse-checkout-out sparse-index-out &&
> +       test_cmp sparse-checkout-err sparse-index-err
> +}
> +
>  test_expect_success 'status with options' '
>         init_repos &&
>         test_all_match git status --porcelain=v2 &&
> --
> gitgitgadget

Other than those minor comments, looks good to me.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 07/20] test-read-cache: print cache entries with --table
  2021-02-23 20:14 ` [PATCH 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-02-25  7:02   ` Elijah Newren
  2021-03-09 21:00     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-25  7:02 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> This table is helpful for discovering data in the index to ensure it is
> being written correctly, especially as we build and test the
> sparse-index. This table includes an output format similar to 'git
> ls-tree', but should not be compared to that directly. The biggest
> reasons are that 'git ls-tree' includes a tree entry for every
> subdirectory, even those that would not appear as a sparse directory in
> a sparse-index. Further, 'git ls-tree' does not use a trailing directory
> separator for its tree rows.
>
> This does not print the stat() information for the blobs. That could be
> added in a future change with another option. The tests that are added
> in the next few changes care only about the object types and IDs.
>
> To make the option parsing slightly more robust, wrap the string
> comparisons in a loop adapted from test-dir-iterator.c.
>
> Care must be taken with the final check for the 'cnt' variable. We
> continue the expectation that the numerical value is the final argument.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/helper/test-read-cache.c | 50 ++++++++++++++++++++++++++++++--------
>  1 file changed, 40 insertions(+), 10 deletions(-)
>
> diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
> index 244977a29bdf..e4c3492f7d3e 100644
> --- a/t/helper/test-read-cache.c
> +++ b/t/helper/test-read-cache.c
> @@ -2,35 +2,65 @@
>  #include "cache.h"
>  #include "config.h"
>
> +static void print_cache_entry(struct cache_entry *ce)
> +{
> +       printf("%06o ", ce->ce_mode & 0777777);

This constant is curious.  I think it's because you want to strip off
the special in-memory bits of the ce_mode where git stores extra data,
which would be everything beyond the first 16 bits (as noted in a
comment near the beginning of cache.h).  But here you keep the first
18 bits.  Is CE_UPDATE and CE_REMOVE just 0 in the cases you've viewed
so this works (but you really should use 0177777 or 0xFFFF), or am I
off in my guess of what you're trying to do and you do want to see
those two flags?

It also seems surprising to me that this constant isn't already
defined somewhere in cache.h or as some variant of S_IFMT, though I'm
not finding it.

> +
> +       if (S_ISSPARSEDIR(ce->ce_mode))
> +               printf("tree ");
> +       else if (S_ISGITLINK(ce->ce_mode))
> +               printf("commit ");
> +       else
> +               printf("blob ");

Perhaps make use of the tree_type, commit_type, and blob_type global constants?

> +
> +       printf("%s\t%s\n",
> +              oid_to_hex(&ce->oid),
> +              ce->name);
> +}
> +
> +static void print_cache(struct index_state *cache)
> +{
> +       int i;
> +       for (i = 0; i < the_index.cache_nr; i++)
> +               print_cache_entry(the_index.cache[i]);

Why are you passing cache as a parameter, then ignoring it and using the_index?

> +}
> +
>  int cmd__read_cache(int argc, const char **argv)
>  {
> +       struct repository *r = the_repository;
>         int i, cnt = 1;
>         const char *name = NULL;
> +       int table = 0;
>
> -       if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
> -               argc--;
> -               argv++;
> +       for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
> +               if (skip_prefix(*argv, "--print-and-refresh=", &name))
> +                       continue;
> +               if (!strcmp(*argv, "--table"))
> +                       table = 1;
>         }
>
> -       if (argc == 2)
> -               cnt = strtol(argv[1], NULL, 0);
> +       if (argc == 1)
> +               cnt = strtol(argv[0], NULL, 0);
>         setup_git_directory();
>         git_config(git_default_config, NULL);
> +
>         for (i = 0; i < cnt; i++) {
> -               read_cache();
> +               repo_read_index(r);
>                 if (name) {
>                         int pos;
>
> -                       refresh_index(&the_index, REFRESH_QUIET,
> +                       refresh_index(r->index, REFRESH_QUIET,
>                                       NULL, NULL, NULL);
> -                       pos = index_name_pos(&the_index, name, strlen(name));
> +                       pos = index_name_pos(r->index, name, strlen(name));
>                         if (pos < 0)
>                                 die("%s not in index", name);
>                         printf("%s is%s up to date\n", name,
> -                              ce_uptodate(the_index.cache[pos]) ? "" : " not");
> +                              ce_uptodate(r->index->cache[pos]) ? "" : " not");
>                         write_file(name, "%d\n", i);
>                 }
> -               discard_cache();
> +               if (table)
> +                       print_cache(r->index);
> +               discard_index(r->index);
>         }
>         return 0;
>  }
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 10/20] sparse-checkout: hold pattern list in index
  2021-02-23 20:14 ` [PATCH 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
@ 2021-02-25  7:14   ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-02-25  7:14 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> As we modify the sparse-checkout definition, we perform index operations
> on a pattern_list that only exists in-memory. This allows easy backing
> out in case the index update fails.
>
> However, if the index write itself cares about the sparse-checkout
> pattern set, we need access to that in-memory copy. Place a pointer to
> a 'struct pattern_list' in the index so we can access this on-demand.
> This will be used in the next change which uses the sparse-checkout
> definition to filter out directories that are outsie the sparse cone.

Looks like you still have the "outsie" typo.  ;-)

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/sparse-checkout.c | 17 ++++++++++-------
>  cache.h                   |  2 ++
>  2 files changed, 12 insertions(+), 7 deletions(-)
>
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index 2306a9ad98e0..e00b82af727b 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
>         if (is_index_unborn(r->index))
>                 return UPDATE_SPARSITY_SUCCESS;
>
> +       r->index->sparse_checkout_patterns = pl;
> +
>         memset(&o, 0, sizeof(o));
>         o.verbose_update = isatty(2);
>         o.update = 1;
> @@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
>         else
>                 rollback_lock_file(&lock_file);
>
> +       r->index->sparse_checkout_patterns = NULL;
>         return result;
>  }
>
> @@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
>  {
>         int result;
>         int changed_config = 0;
> -       struct pattern_list pl;
> -       memset(&pl, 0, sizeof(pl));
> +       struct pattern_list *pl = xcalloc(1, sizeof(*pl));
>
>         switch (m) {
>         case ADD:
>                 if (core_sparse_checkout_cone)
> -                       add_patterns_cone_mode(argc, argv, &pl);
> +                       add_patterns_cone_mode(argc, argv, pl);
>                 else
> -                       add_patterns_literal(argc, argv, &pl);
> +                       add_patterns_literal(argc, argv, pl);
>                 break;
>
>         case REPLACE:
> -               add_patterns_from_input(&pl, argc, argv);
> +               add_patterns_from_input(pl, argc, argv);
>                 break;
>         }
>
> @@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
>                 changed_config = 1;
>         }
>
> -       result = write_patterns_and_update(&pl);
> +       result = write_patterns_and_update(pl);
>
>         if (result && changed_config)
>                 set_config(MODE_NO_PATTERNS);
>
> -       clear_pattern_list(&pl);
> +       clear_pattern_list(pl);
> +       free(pl);
>         return result;
>  }
>
> diff --git a/cache.h b/cache.h
> index 1336c8d7435e..d75b352f38d3 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -307,6 +307,7 @@ static inline unsigned int canon_mode(unsigned int mode)
>  struct split_index;
>  struct untracked_cache;
>  struct progress;
> +struct pattern_list;
>
>  struct index_state {
>         struct cache_entry **cache;
> @@ -332,6 +333,7 @@ struct index_state {
>         struct mem_pool *ce_mem_pool;
>         struct progress *progress;
>         struct repository *repo;
> +       struct pattern_list *sparse_checkout_patterns;
>  };
>
>  /* Name hashing */
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 11/20] sparse-index: convert from full to sparse
  2021-02-23 20:14 ` [PATCH 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-02-25  7:33   ` Elijah Newren
  2021-03-09 21:13     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-25  7:33 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> If we have a full index, then we can convert it to a sparse index by
> replacing directories outside of the sparse cone with sparse directory
> entries. The convert_to_sparse() method does this, when the situation is
> appropriate.
>
> For now, we avoid converting the index to a sparse index if:
>
>  1. the index is split.
>  2. the index is already sparse.
>  3. sparse-checkout is disabled.
>  4. sparse-checkout does not use cone mode.
>
> Finally, we currently limit the conversion to when the
> GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
> config will be added in a later change.
>
> The trickiest thing about this conversion is that we might not be able
> to mark a directory as a sparse directory just because it is outside the
> sparse cone. There might be unmerged files within that directory, so we
> need to look for those. Also, if there is some strange reason why a file
> is not marked with CE_SKIP_WORKTREE, then we should give up on
> converting that directory. There is still hope that some of its
> subdirectories might be able to convert to sparse, so we keep looking
> deeper.
>
> The conversion process is assisted by the cache-tree extension. This is
> calculated from the full index if it does not already exist. We then
> abandon the cache-tree as it no longer applies to the newly-sparse
> index. Thus, this cache-tree will be recalculated in every
> sparse-full-sparse round-trip until we integrate the cache-tree
> extension with the sparse index.
>
> Some Git commands use the index after writing it. For example, 'git add'
> will update the index, then write it to disk, then read its entries to
> report information. To keep the in-memory index in a full state after
> writing, we re-expand it to a full one after the write. This is wasteful
> for commands that only write the index and do not read from it again,
> but that is only the case until we make those commands "sparse aware."
>
> We can compare the behavior of the sparse-index in
> t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
> when operating on the 'sparse-index' repo. We can also compare the two
> sparse repos directly, such as comparing their indexes (when expanded to
> full in the case of the 'sparse-index' repo). We also verify that the
> index is actually populated with sparse directory entries.
>
> The 'checkout and reset (mixed)' test is marked for failure when
> comparing a sparse repo to a full repo, but we can compare the two
> sparse-checkout cases directly to ensure that we are not changing the
> behavior when using a sparse index.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  cache-tree.c                             |   3 +
>  cache.h                                  |   2 +
>  read-cache.c                             |  26 ++++-
>  sparse-index.c                           | 139 +++++++++++++++++++++++
>  sparse-index.h                           |   1 +
>  t/t1092-sparse-checkout-compatibility.sh |  61 +++++++++-
>  6 files changed, 227 insertions(+), 5 deletions(-)
>
> diff --git a/cache-tree.c b/cache-tree.c
> index 2fb483d3c083..5f07a39e501e 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -6,6 +6,7 @@
>  #include "object-store.h"
>  #include "replace-object.h"
>  #include "promisor-remote.h"
> +#include "sparse-index.h"
>
>  #ifndef DEBUG_CACHE_TREE
>  #define DEBUG_CACHE_TREE 0
> @@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
>         if (i)
>                 return i;
>
> +       ensure_full_index(istate);
> +
>         if (!istate->cache_tree)
>                 istate->cache_tree = cache_tree();
>
> diff --git a/cache.h b/cache.h
> index d75b352f38d3..e8b7d3b4fb33 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
>  {
>         if (S_ISLNK(mode))
>                 return S_IFLNK;
> +       if (mode == S_IFDIR)
> +               return S_IFDIR;
>         if (S_ISDIR(mode) || S_ISGITLINK(mode))
>                 return S_IFGITLINK;
>         return S_IFREG | ce_permissions(mode);
> diff --git a/read-cache.c b/read-cache.c
> index 97dbf2434f30..67acbf202f4e 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -25,6 +25,7 @@
>  #include "fsmonitor.h"
>  #include "thread-utils.h"
>  #include "progress.h"
> +#include "sparse-index.h"
>
>  /* Mask for the name length in ce_flags in the on-disk index */
>
> @@ -1002,8 +1003,14 @@ int verify_path(const char *path, unsigned mode)
>
>                         c = *path++;
>                         if ((c == '.' && !verify_dotfile(path, mode)) ||
> -                           is_dir_sep(c) || c == '\0')
> +                           is_dir_sep(c))
>                                 return 0;
> +                       /*
> +                        * allow terminating directory separators for
> +                        * sparse directory enries.

enries -> entries

> +                        */
> +                       if (c == '\0')
> +                               return S_ISDIR(mode);

Yaay, much simpler (than the RFC version).

>                 } else if (c == '\\' && protect_ntfs) {
>                         if (is_ntfs_dotgit(path))
>                                 return 0;
> @@ -3061,6 +3068,14 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
>                                  unsigned flags)
>  {
>         int ret;
> +       int was_full = !istate->sparse_index;
> +
> +       ret = convert_to_sparse(istate);
> +
> +       if (ret) {
> +               warning(_("failed to convert to a sparse-index"));
> +               return ret;
> +       }
>
>         /*
>          * TODO trace2: replace "the_repository" with the actual repo instance
> @@ -3072,6 +3087,9 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
>         trace2_region_leave_printf("index", "do_write_index", the_repository,
>                                    "%s", get_lock_file_path(lock));
>
> +       if (was_full)
> +               ensure_full_index(istate);
> +
>         if (ret)
>                 return ret;
>         if (flags & COMMIT_LOCK)
> @@ -3162,9 +3180,10 @@ static int write_shared_index(struct index_state *istate,
>                               struct tempfile **temp)
>  {
>         struct split_index *si = istate->split_index;
> -       int ret;
> +       int ret, was_full = !istate->sparse_index;
>
>         move_cache_to_base_index(istate);
> +       convert_to_sparse(istate);
>
>         trace2_region_enter_printf("index", "shared/do_write_index",
>                                    the_repository, "%s", get_tempfile_path(*temp));
> @@ -3172,6 +3191,9 @@ static int write_shared_index(struct index_state *istate,
>         trace2_region_leave_printf("index", "shared/do_write_index",
>                                    the_repository, "%s", get_tempfile_path(*temp));
>
> +       if (was_full)
> +               ensure_full_index(istate);
> +
>         if (ret)
>                 return ret;
>         ret = adjust_shared_perm(get_tempfile_path(*temp));
> diff --git a/sparse-index.c b/sparse-index.c
> index 316cb949b74b..cb1f85635fbc 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -4,6 +4,145 @@
>  #include "tree.h"
>  #include "pathspec.h"
>  #include "trace2.h"
> +#include "cache-tree.h"
> +#include "config.h"
> +#include "dir.h"
> +#include "fsmonitor.h"
> +
> +static struct cache_entry *construct_sparse_dir_entry(
> +                               struct index_state *istate,
> +                               const char *sparse_dir,
> +                               struct cache_tree *tree)
> +{
> +       struct cache_entry *de;
> +
> +       de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);
> +
> +       de->ce_flags |= CE_SKIP_WORKTREE;
> +       return de;
> +}
> +
> +/*
> + * Returns the number of entries "inserted" into the index.
> + */
> +static int convert_to_sparse_rec(struct index_state *istate,
> +                                int num_converted,
> +                                int start, int end,
> +                                const char *ct_path, size_t ct_pathlen,
> +                                struct cache_tree *ct)
> +{
> +       int i, can_convert = 1;
> +       int start_converted = num_converted;
> +       enum pattern_match_result match;
> +       int dtype;
> +       struct strbuf child_path = STRBUF_INIT;
> +       struct pattern_list *pl = istate->sparse_checkout_patterns;
> +
> +       /*
> +        * Is the current path outside of the sparse cone?
> +        * Then check if the region can be replaced by a sparse
> +        * directory entry (everything is sparse and merged).
> +        */
> +       match = path_matches_pattern_list(ct_path, ct_pathlen,
> +                                         NULL, &dtype, pl, istate);
> +       if (match != NOT_MATCHED)
> +               can_convert = 0;

Not sure if you saw my comments on the flow control at
https://lore.kernel.org/git/CABPp-BE9wPwmC0=pA4p1_QSRDHrO8RzqfJQdE2NxYZsYL_Rcig@mail.gmail.com/
(the typos elsewhere seem to still be present).  If you saw it and
decided against it, that's fine, just wanted the idea to at least be
floated.

> +
> +       for (i = start; can_convert && i < end; i++) {
> +               struct cache_entry *ce = istate->cache[i];
> +
> +               if (ce_stage(ce) ||
> +                   !(ce->ce_flags & CE_SKIP_WORKTREE))
> +                       can_convert = 0;
> +       }
> +
> +       if (can_convert) {
> +               struct cache_entry *se;
> +               se = construct_sparse_dir_entry(istate, ct_path, ct);
> +
> +               istate->cache[num_converted++] = se;
> +               return 1;
> +       }
> +
> +       for (i = start; i < end; ) {
> +               int count, span, pos = -1;
> +               const char *base, *slash;
> +               struct cache_entry *ce = istate->cache[i];
> +
> +               /*
> +                * Detect if this is a normal entry oustide of any subtree

s/oustide/outside/

> +                * entry.
> +                */
> +               base = ce->name + ct_pathlen;
> +               slash = strchr(base, '/');
> +
> +               if (slash)
> +                       pos = cache_tree_subtree_pos(ct, base, slash - base);
> +
> +               if (pos < 0) {
> +                       istate->cache[num_converted++] = ce;
> +                       i++;
> +                       continue;
> +               }
> +
> +               strbuf_setlen(&child_path, 0);
> +               strbuf_add(&child_path, ce->name, slash - ce->name + 1);
> +
> +               span = ct->down[pos]->cache_tree->entry_count;
> +               count = convert_to_sparse_rec(istate,
> +                                             num_converted, i, i + span,
> +                                             child_path.buf, child_path.len,
> +                                             ct->down[pos]->cache_tree);
> +               num_converted += count;
> +               i += span;
> +       }
> +
> +       strbuf_release(&child_path);
> +       return num_converted - start_converted;
> +}
> +
> +int convert_to_sparse(struct index_state *istate)
> +{
> +       if (istate->split_index || istate->sparse_index ||
> +           !core_apply_sparse_checkout || !core_sparse_checkout_cone)
> +               return 0;
> +
> +       /*
> +        * For now, only create a sparse index with the
> +        * GIT_TEST_SPARSE_INDEX environment variable. We will relax
> +        * this once we have a proper way to opt-in (and later still,
> +        * opt-out).
> +        */
> +       if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
> +               return 0;
> +
> +       if (!istate->sparse_checkout_patterns) {
> +               istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
> +               if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
> +                       return 0;
> +       }
> +
> +       if (!istate->sparse_checkout_patterns->use_cone_patterns) {
> +               warning(_("attempting to use sparse-index without cone mode"));
> +               return -1;
> +       }
> +
> +       if (cache_tree_update(istate, 0)) {
> +               warning(_("unable to update cache-tree, staying full"));
> +               return -1;
> +       }
> +
> +       remove_fsmonitor(istate);
> +
> +       trace2_region_enter("index", "convert_to_sparse", istate->repo);
> +       istate->cache_nr = convert_to_sparse_rec(istate,
> +                                                0, 0, istate->cache_nr,
> +                                                "", 0, istate->cache_tree);
> +       istate->drop_cache_tree = 1;
> +       istate->sparse_index = 1;
> +       trace2_region_leave("index", "convert_to_sparse", istate->repo);
> +       return 0;
> +}
>
>  static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
>  {
> diff --git a/sparse-index.h b/sparse-index.h
> index 09a20d036c46..64380e121d80 100644
> --- a/sparse-index.h
> +++ b/sparse-index.h
> @@ -3,5 +3,6 @@
>
>  struct index_state;
>  void ensure_full_index(struct index_state *istate);
> +int convert_to_sparse(struct index_state *istate);
>
>  #endif
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 4d789fe86b9d..ca87033d30b0 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -2,6 +2,9 @@
>
>  test_description='compare full workdir to sparse workdir'
>
> +GIT_TEST_CHECK_CACHE_TREE=0

Same question as I posted for the RFC series:

Why do you need to set this?  I vaguely remember needing to mess with
this when working with sparse checkouts because it did weird stuff but
I don't remember details.  But since your patch touches cache_trees, it
seems weird to show up without explanation.

> +GIT_TEST_SPLIT_INDEX=0
> +
>  . ./test-lib.sh
>
>  test_expect_success 'setup' '
> @@ -121,15 +124,49 @@ run_on_all () {
>  test_all_match () {
>         run_on_all "$@" &&
>         test_cmp full-checkout-out sparse-checkout-out &&
> -       test_cmp full-checkout-err sparse-checkout-err
> +       test_cmp full-checkout-out sparse-index-out &&
> +       test_cmp full-checkout-err sparse-checkout-err &&
> +       test_cmp full-checkout-err sparse-index-err
>  }
>
>  test_sparse_match () {
> -       run_on_sparse $* &&
> +       run_on_sparse "$@" &&
>         test_cmp sparse-checkout-out sparse-index-out &&
>         test_cmp sparse-checkout-err sparse-index-err
>  }
>
> +test_expect_success 'sparse-index contents' '
> +       init_repos &&
> +
> +       test-tool -C sparse-index read-cache --table >cache &&
> +       for dir in folder1 folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "040000 tree $TREE $dir/" cache \
> +                       || return 1
> +       done &&

Thanks for making the output look more like ls-tree output; it's
easier to parse that way, at least for me.

> +
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
> +
> +       test-tool -C sparse-index read-cache --table >cache &&
> +       for dir in deep folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "040000 tree $TREE $dir/" cache \
> +                       || return 1
> +       done &&
> +
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
> +
> +       test-tool -C sparse-index read-cache --table >cache &&
> +       for dir in deep/deeper2 folder1 folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "040000 tree $TREE $dir/" cache \
> +                       || return 1
> +       done
> +'
> +
>  test_expect_success 'expanded in-memory index matches full index' '
>         init_repos &&
>         test_sparse_match test-tool read-cache --expand --table
> @@ -137,6 +174,7 @@ test_expect_success 'expanded in-memory index matches full index' '
>
>  test_expect_success 'status with options' '
>         init_repos &&
> +       test_sparse_match ls &&
>         test_all_match git status --porcelain=v2 &&
>         test_all_match git status --porcelain=v2 -z -u &&
>         test_all_match git status --porcelain=v2 -uno &&
> @@ -273,6 +311,17 @@ test_expect_failure 'checkout and reset (mixed)' '
>         test_all_match git reset update-folder2
>  '
>
> +# Ensure that sparse-index behaves identically to
> +# sparse-checkout with a full index.
> +test_expect_success 'checkout and reset (mixed) [sparse]' '
> +       init_repos &&
> +
> +       test_sparse_match git checkout -b reset-test update-deep &&
> +       test_sparse_match git reset deepest &&
> +       test_sparse_match git reset update-folder1 &&
> +       test_sparse_match git reset update-folder2
> +'
> +
>  test_expect_success 'merge' '
>         init_repos &&
>
> @@ -309,14 +358,20 @@ test_expect_success 'clean' '
>         test_all_match git status --porcelain=v2 &&
>         test_all_match git clean -f &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
>         test_all_match git clean -xf &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
>         test_all_match git clean -xdf &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
> -       test_path_is_dir sparse-checkout/folder1
> +       test_sparse_match test_path_is_dir folder1
>  '
>
>  test_done
> --
> gitgitgadget

I mostly read over the range-diff since it was much shorter.  You've
addressed a number of questions/comments I had on the RFC version, but
there's still some I didn't see a response to so I reposted them here.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 13/20] unpack-trees: allow sparse directories
  2021-02-23 20:14 ` [PATCH 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-02-25  7:40   ` Elijah Newren
  2021-03-09 21:35     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-25  7:40 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> The index_pos_by_traverse_info() currently throws a BUG() when a
> directory entry exists exactly in the index. We need to consider that it
> is possible to have a directory in a sparse index as long as that entry
> is itself marked with the skip-worktree bit.
>
> The negation of the 'pos' variable must be conditioned to only when it
> starts as negative. This is identical behavior as before when the index
> is full.

Same comment on the second paragraph as I made in the RFC series --
https://lore.kernel.org/git/CABPp-BGPJgA4guWHVm3AVS=hM0fTixUpRvJe5i9NnHT-3QJMfw@mail.gmail.com/.
I apologize if I'm repeating stuff you chose to not change, but I
didn't see a response and given the three typos left in previous
patches, I'm unsure whether it was unaddressed on purpose or on
accident.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  unpack-trees.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 4dd99219073a..b324eec2a5d1 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
>         strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
>         strbuf_addch(&name, '/');
>         pos = index_name_pos(o->src_index, name.buf, name.len);
> -       if (pos >= 0)
> -               BUG("This is a directory and should not exist in index");
> -       pos = -pos - 1;
> +       if (pos >= 0) {
> +               if (!o->src_index->sparse_index ||
> +                   !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
> +                       BUG("This is a directory and should not exist in index");
> +       } else
> +               pos = -pos - 1;
>         if (pos >= o->src_index->cache_nr ||
>             !starts_with(o->src_index->cache[pos]->name, name.buf) ||
>             (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 15/20] sparse-index: create extension for compatibility
  2021-02-23 20:14 ` [PATCH 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
@ 2021-02-25  7:45   ` Elijah Newren
  2021-03-09 21:45     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-02-25  7:45 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Previously, we enabled the sparse index format only using
> GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
> actually select this mode. Further, sparse directory entries are not
> understood by the index formats as advertised.
>
> We _could_ add a new index version that explicitly adds these
> capabilities, but there are nuances to index formats 2, 3, and 4 that
> are still valuable to select as options. For now, create a repo
> extension, "extensions.sparseIndex", that specifies that the tool
> reading this repository must understand sparse directory entries.

This commit is unchanged from the RFC series, but given your comments
in the design document about how you do intend to create an index
format v5 now, do you want to reference that here?

>
> This change only encodes the extension and enables it when
> GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
> mechanism.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/config/extensions.txt |  7 ++++++
>  cache.h                             |  1 +
>  repo-settings.c                     |  7 ++++++
>  repository.h                        |  3 ++-
>  setup.c                             |  3 +++
>  sparse-index.c                      | 38 +++++++++++++++++++++++++----
>  6 files changed, 53 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
> index 4e23d73cdcad..5c86b3648732 100644
> --- a/Documentation/config/extensions.txt
> +++ b/Documentation/config/extensions.txt
> @@ -6,3 +6,10 @@ extensions.objectFormat::
>  Note that this setting should only be set by linkgit:git-init[1] or
>  linkgit:git-clone[1].  Trying to change it after initialization will not
>  work and will produce hard-to-diagnose issues.
> +
> +extensions.sparseIndex::
> +       When combined with `core.sparseCheckout=true` and
> +       `core.sparseCheckoutCone=true`, the index may contain entries
> +       corresponding to directories outside of the sparse-checkout
> +       definition. Versions of Git that do not understand this extension
> +       do not expect directory entries in the index.

I had a wording suggestion for this paragraph in the RFC series --
https://lore.kernel.org/git/CABPp-BFEJE82k4VgkR=Jf7V7sZxZzo2pHMfAGshhi9_vV6iK0w@mail.gmail.com/.
Let me know if you just decided to leave it out so I don't bug you
about stuff you already considered.

> diff --git a/cache.h b/cache.h
> index e8b7d3b4fb33..eea61fba7568 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -1053,6 +1053,7 @@ struct repository_format {
>         int worktree_config;
>         int is_bare;
>         int hash_algo;
> +       int sparse_index;
>         char *work_tree;
>         struct string_list unknown_extensions;
>         struct string_list v1_only_extensions;
> diff --git a/repo-settings.c b/repo-settings.c
> index d63569e4041e..9677d50f9238 100644
> --- a/repo-settings.c
> +++ b/repo-settings.c
> @@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
>          * removed.
>          */
>         r->settings.command_requires_full_index = 1;
> +
> +       /*
> +        * Initialize this as off.
> +        */
> +       r->settings.sparse_index = 0;
> +       if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
> +               r->settings.sparse_index = 1;
>  }
> diff --git a/repository.h b/repository.h
> index e06a23015697..a45f7520fd9e 100644
> --- a/repository.h
> +++ b/repository.h
> @@ -42,7 +42,8 @@ struct repo_settings {
>
>         int core_multi_pack_index;
>
> -       unsigned command_requires_full_index:1;
> +       unsigned command_requires_full_index:1,
> +                sparse_index:1;
>  };
>
>  struct repository {
> diff --git a/setup.c b/setup.c
> index c04cd25a30df..cd8394564613 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var,
>                         return error("invalid value for 'extensions.objectformat'");
>                 data->hash_algo = format;
>                 return EXTENSION_OK;
> +       } else if (!strcmp(ext, "sparseindex")) {
> +               data->sparse_index = 1;
> +               return EXTENSION_OK;
>         }
>         return EXTENSION_UNKNOWN;
>  }
> diff --git a/sparse-index.c b/sparse-index.c
> index 14029fafc750..97b0d0c57857 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate,
>         return num_converted - start_converted;
>  }
>
> +static int enable_sparse_index(struct repository *repo)
> +{
> +       const char *config_path = repo_git_path(repo, "config.worktree");
> +
> +       if (upgrade_repository_format(1) < 0) {
> +               warning(_("unable to upgrade repository format to enable sparse-index"));
> +               return -1;
> +       }
> +       git_config_set_in_file_gently(config_path,
> +                                     "extensions.sparseIndex",
> +                                     "true");
> +
> +       prepare_repo_settings(repo);
> +       repo->settings.sparse_index = 1;
> +       return 0;
> +}
> +
>  int convert_to_sparse(struct index_state *istate)
>  {
>         if (istate->split_index || istate->sparse_index ||
>             !core_apply_sparse_checkout || !core_sparse_checkout_cone)
>                 return 0;
>
> +       if (!istate->repo)
> +               istate->repo = the_repository;
> +
> +       /*
> +        * The GIT_TEST_SPARSE_INDEX environment variable triggers the
> +        * extensions.sparseIndex config variable to be on.
> +        */
> +       if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
> +               int err = enable_sparse_index(istate->repo);
> +               if (err < 0)
> +                       return err;
> +       }
> +
>         /*
> -        * For now, only create a sparse index with the
> -        * GIT_TEST_SPARSE_INDEX environment variable. We will relax
> -        * this once we have a proper way to opt-in (and later still,
> -        * opt-out).
> +        * Only convert to sparse if extensions.sparseIndex is set.
>          */
> -       if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
> +       prepare_repo_settings(istate->repo);
> +       if (!istate->repo->settings.sparse_index)
>                 return 0;
>
>         if (!istate->sparse_checkout_patterns) {
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 01/20] sparse-index: design doc and format update
  2021-02-24  1:13   ` Elijah Newren
@ 2021-02-25 15:29     ` Derrick Stolee
  2021-02-25 20:14       ` Elijah Newren
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-02-25 15:29 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee, Matheus Tavares Bernardino

On 2/23/2021 8:13 PM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:>> +This addition of sparse-directory entries violates expectations about the
> 
> Violates current expectations, yes.  Documentation tends to live a
> long time, and I suspect that 2-3 years from now reading this sentence
> might be jarring since we'll have modified the code to have an updated
> set of expectations.  Maybe a simple "As of time of writing, ..." at
> the beginning of the sentence here?  Or maybe I'm just being overly
> worried...

I was hoping that the phrase "this addition of" places this statement in
a moment of time where sparse-directory entries didn't exist, but now they
will. I'm open to alternatives and will give this some thought.

>> +To complete this phase, the commands `git status` and `git add` will be
>> +integrated with the sparse-index so that they operate with O(Populated)
>> +performance. They will be carefully tested for operations within and
>> +outside the sparse-checkout definition.
> 
> Good plan so far, but there's something else bugging me a little here.
> One thing we noticed with our usage of `sparse-checkout` is that
> although unimportant _tracked_ files go away, leftover build files and
> other untracked files stick around.  So, although 'git status'
> shouldn't have to check the tracked files anymore, it is still going
> to have to look at each of the *untracked* files and compare to
> .gitignore files in order to correctly classify each file as ignored
> or just plain untracked.  Our `sparsify` tool has for a long time
> tried to warn about such files when changing the sparsity
> patterns/modules and had an --remove-old-ignores option for clearing
> out ignored files that are found within directories that are sparse
> (Meaning the directories where all files under them are marked
> SKIP_WORKTREE.). I was never sure whether a warning was enough, or if
> pushing that option more made sense, but about a month ago my
> colleagues made the tool just auto-invoke that option from other
> sparsify invocations.  To my knowledge, there have been no complaints
> about that being automatically turned on; but there were
> complaints/confusion before about the directories being left around.
> (Of course, non-ignored files are still left around by that option.)
> 
> I'm worried that since sparse-checkout doesn't do anything to help
> with all these untracked/ignored files, we might not get all the
> performance improvements we want from a `git status` with sparse
> directories.  We'll be dropping from walking O(2*HEAD) files (1 source
> + 1 object file) down to O(HEAD) files (just the object files) rather
> than actually getting down to O(Populated).

This definitely seems like a valuable _enhancement_ to sparse-checkout
that could occur in parallel.

For a workaround in the moment: is "git clean -xdf" sufficient to help
these users?

>> +Phase III: Important command speedups
>> +-------------------------------------
>> +
>> +At this point, the patterns for testing and implementing sparse-directory
>> +logic should be relatively stable. This phase focuses on updating some of
>> +the most common builtins that use the index to operate as O(Populated).
>> +Here is a potential list of commands that could be valuable to integrate
>> +at this point:
>> +
>> +* `git commit`
>> +* `git checkout`
>> +* `git merge`
>> +* `git rebase`
>> +
>> +Along with `git status` and `git add`, these commands cover the majority
>> +of users' interactions with the working directory.
> 
> Sounds like a good plan as well.
> 
> I hope we get to make this specific to the merge-ort backend.  It
> localizes the index-related code to (a) a call to unpack_trees()
> called from checkout-like code (which would probably automatically be
> handled by your updates to git checkout), and (b) a single function
> named record_conflicted_index_entries().  I feel it should be pretty
> easy to update.
> 
> In contrast, the idea of attempting to update merge-recursive with
> this kind of change sounds overwhelming.

Yes, I'm hoping to eventually say "if you are in a sparse-checkout, then
you should use ORT by default." Then, if someone opts to do merge-recursive
instead, then they pay the index expansion cost.

While this is very clear in my head, it might be worth mentioning that
explicitly here.

>>  In addition, we can
>> +integrate with these commands:
>> +
>> +* `git grep`
>> +* `git rm`
>> +
>> +These have been proposed as some whose behavior could change when in a
>> +repo with a sparse-checkout definition. It would be good to include this
>> +behavior automatically when using a sparse-index. Some clarity is needed
>> +to make the behavior switch clear to the user.
> 
> Is this leftover from before recent events?  I think this portion of
> the document should just be stricken.
> 
> I argued about how these were buggy as-is due SKIP_WORKTREE always
> having been an incomplete implementation of an idea at [1], but didn't
> hear a further response from you.  I'm curious if you disagreed with
> my reasoning, or it just slipped through the cracks in a busy schedule
> and this portion of the document was leftover from before.  In my
> opinion, both commands are just buggy and should be fixed for general
> sparse-checkout usage cases, not just for sparse-index.

This is definitely a case of "I've been too busy to read those topics
in detail." I figured that there was something going on that was relevant
to the sparse-checkout and would affect the sparse-index, but I shelved
it in my mind until I had space to think about it directly.

> Anyway, that's a long way of saying I think this section of your
> document is already obsolete.  (Which is a good thing -- less work to
> do to get sparse-index working.  Thanks, Matheus!).

Thank you for your summary, which helps a lot. Thanks, Matheus, too!
If those fixes already include coverage for the behavior, then I'll see
if they also translate to sparse-index tests easily.

I feel like a lot of these later contributions will be more about adding
tests than actually writing a lot of code.

>> +This phase is the first where parallel work might be possible without too
>> +much conflicts between topics.
>> +
>> +Phase IV: The long tail
>> +-----------------------
>> +
>> +This last phase is less a "phase" and more "the new normal" after all of
>> +the previous work.
>> +
>> +To start, the `command_requires_full_index` option could be removed in
>> +favor of expanding only when hitting an API guard.
>> +
>> +There are many Git commands that could use special attention to operate as
>> +O(Populated), while some might be so rare that it is acceptable to leave
>> +them with additional overhead when a sparse-index is present.
>> +
>> +Here are some commands that might be useful to update:
>> +
>> +* `git sparse-checkout set`
>> +* `git am`
>> +* `git clean`
>> +* `git stash`
> 
> Oh, man, git stash is definitely in need of work.  It's still a
> minimalistic transliteration of shell to C, complete with lots of
> process forking and piping output between various low-level commands.
> It might be interesting to rewrite this in terms of the merge
> machinery, though its separate stashing of staged stuff, unstaged
> stuff, and possibly untracked stuff means that there is a sequence of
> two or three merges needed and interesting failure handling to do if
> those merges fail, especially if the user uses --index.  But I
> digress...

I would prefer to leave 'git stash' alone, but it's used by enough
people that I need to care about it eventually.

> Anyway, overall, very nicely written and planned out.  Thanks for
> taking the time to write this all up.

Thanks for your detailed comments!
-Stolee
 


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 01/20] sparse-index: design doc and format update
  2021-02-25 15:29     ` Derrick Stolee
@ 2021-02-25 20:14       ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-02-25 20:14 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List,
	Junio C Hamano, Nguyễn Thái Ngọc,
	Jonathan Nieder, Derrick Stolee, Derrick Stolee,
	Matheus Tavares Bernardino

On Thu, Feb 25, 2021 at 7:29 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/23/2021 8:13 PM, Elijah Newren wrote:
> > On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:>> +This addition of sparse-directory entries violates expectations about the
> >
> > Violates current expectations, yes.  Documentation tends to live a
> > long time, and I suspect that 2-3 years from now reading this sentence
> > might be jarring since we'll have modified the code to have an updated
> > set of expectations.  Maybe a simple "As of time of writing, ..." at
> > the beginning of the sentence here?  Or maybe I'm just being overly
> > worried...
>
> I was hoping that the phrase "this addition of" places this statement in
> a moment of time where sparse-directory entries didn't exist, but now they
> will. I'm open to alternatives and will give this some thought.

I already listed my only suggestion -- adding a "As of time of
writing," at the beginning.  I'm totally open to other
proposals/suggestions, and it's admittedly a minor point so you can
feel free to just ignore it if we can't come up with wording everyone
likes.

>
> >> +To complete this phase, the commands `git status` and `git add` will be
> >> +integrated with the sparse-index so that they operate with O(Populated)
> >> +performance. They will be carefully tested for operations within and
> >> +outside the sparse-checkout definition.
> >
> > Good plan so far, but there's something else bugging me a little here.
> > One thing we noticed with our usage of `sparse-checkout` is that
> > although unimportant _tracked_ files go away, leftover build files and
> > other untracked files stick around.  So, although 'git status'
> > shouldn't have to check the tracked files anymore, it is still going
> > to have to look at each of the *untracked* files and compare to
> > .gitignore files in order to correctly classify each file as ignored
> > or just plain untracked.  Our `sparsify` tool has for a long time
> > tried to warn about such files when changing the sparsity
> > patterns/modules and had an --remove-old-ignores option for clearing
> > out ignored files that are found within directories that are sparse
> > (Meaning the directories where all files under them are marked
> > SKIP_WORKTREE.). I was never sure whether a warning was enough, or if
> > pushing that option more made sense, but about a month ago my
> > colleagues made the tool just auto-invoke that option from other
> > sparsify invocations.  To my knowledge, there have been no complaints
> > about that being automatically turned on; but there were
> > complaints/confusion before about the directories being left around.
> > (Of course, non-ignored files are still left around by that option.)
> >
> > I'm worried that since sparse-checkout doesn't do anything to help
> > with all these untracked/ignored files, we might not get all the
> > performance improvements we want from a `git status` with sparse
> > directories.  We'll be dropping from walking O(2*HEAD) files (1 source
> > + 1 object file) down to O(HEAD) files (just the object files) rather
> > than actually getting down to O(Populated).
>
> This definitely seems like a valuable _enhancement_ to sparse-checkout
> that could occur in parallel.

Yes, indeed.  Your discussion of performance just reminded me of it,
and since this idea might be important in order to drive the costs
down to O(populated) in practice, I thought I'd mention it.

> For a workaround in the moment: is "git clean -xdf" sufficient to help
> these users?

Not really; that wouldn't remove the ignored stuff (build files) under
sparsified directories which is the point.  (Builds build everything
over here; once you sparsify you have leftover build files from
projects you now don't care about.)  If you convert it to "git clean
-Xdf" then you're closer, but that wouldn't just remove builds info
from sparse projects, it'd force users to rebuild all the stuff
they're interested in.

It's close though; what's wanted is basically a special flag that runs
"git clean -Xf <long list of sparsified directories>", without the
user having to specify 300 directories.

However, for now, since I've got a 'sparsify' script anyway (needed
for determining inter-module dependencies and certain directories that
always need to be present, etc.), it just has a flag for running "git
clean -Xf <long list of sparsified directories>" since it has logic to
compute what all those directories are anyway.

> >> +Phase III: Important command speedups
> >> +-------------------------------------
> >> +
> >> +At this point, the patterns for testing and implementing sparse-directory
> >> +logic should be relatively stable. This phase focuses on updating some of
> >> +the most common builtins that use the index to operate as O(Populated).
> >> +Here is a potential list of commands that could be valuable to integrate
> >> +at this point:
> >> +
> >> +* `git commit`
> >> +* `git checkout`
> >> +* `git merge`
> >> +* `git rebase`
> >> +
> >> +Along with `git status` and `git add`, these commands cover the majority
> >> +of users' interactions with the working directory.
> >
> > Sounds like a good plan as well.
> >
> > I hope we get to make this specific to the merge-ort backend.  It
> > localizes the index-related code to (a) a call to unpack_trees()
> > called from checkout-like code (which would probably automatically be
> > handled by your updates to git checkout), and (b) a single function
> > named record_conflicted_index_entries().  I feel it should be pretty
> > easy to update.
> >
> > In contrast, the idea of attempting to update merge-recursive with
> > this kind of change sounds overwhelming.
>
> Yes, I'm hoping to eventually say "if you are in a sparse-checkout, then
> you should use ORT by default." Then, if someone opts to do merge-recursive
> instead, then they pay the index expansion cost.
>
> While this is very clear in my head, it might be worth mentioning that
> explicitly here.

:+1:

> >>  In addition, we can
> >> +integrate with these commands:
> >> +
> >> +* `git grep`
> >> +* `git rm`
> >> +
> >> +These have been proposed as some whose behavior could change when in a
> >> +repo with a sparse-checkout definition. It would be good to include this
> >> +behavior automatically when using a sparse-index. Some clarity is needed
> >> +to make the behavior switch clear to the user.
> >
> > Is this leftover from before recent events?  I think this portion of
> > the document should just be stricken.
> >
> > I argued about how these were buggy as-is due SKIP_WORKTREE always
> > having been an incomplete implementation of an idea at [1], but didn't
> > hear a further response from you.  I'm curious if you disagreed with
> > my reasoning, or it just slipped through the cracks in a busy schedule
> > and this portion of the document was leftover from before.  In my
> > opinion, both commands are just buggy and should be fixed for general
> > sparse-checkout usage cases, not just for sparse-index.
>
> This is definitely a case of "I've been too busy to read those topics
> in detail." I figured that there was something going on that was relevant
> to the sparse-checkout and would affect the sparse-index, but I shelved
> it in my mind until I had space to think about it directly.
>
> > Anyway, that's a long way of saying I think this section of your
> > document is already obsolete.  (Which is a good thing -- less work to
> > do to get sparse-index working.  Thanks, Matheus!).
>
> Thank you for your summary, which helps a lot. Thanks, Matheus, too!
> If those fixes already include coverage for the behavior, then I'll see
> if they also translate to sparse-index tests easily.
>
> I feel like a lot of these later contributions will be more about adding
> tests than actually writing a lot of code.
>
> >> +This phase is the first where parallel work might be possible without too
> >> +much conflicts between topics.
> >> +
> >> +Phase IV: The long tail
> >> +-----------------------
> >> +
> >> +This last phase is less a "phase" and more "the new normal" after all of
> >> +the previous work.
> >> +
> >> +To start, the `command_requires_full_index` option could be removed in
> >> +favor of expanding only when hitting an API guard.
> >> +
> >> +There are many Git commands that could use special attention to operate as
> >> +O(Populated), while some might be so rare that it is acceptable to leave
> >> +them with additional overhead when a sparse-index is present.
> >> +
> >> +Here are some commands that might be useful to update:
> >> +
> >> +* `git sparse-checkout set`
> >> +* `git am`
> >> +* `git clean`
> >> +* `git stash`
> >
> > Oh, man, git stash is definitely in need of work.  It's still a
> > minimalistic transliteration of shell to C, complete with lots of
> > process forking and piping output between various low-level commands.
> > It might be interesting to rewrite this in terms of the merge
> > machinery, though its separate stashing of staged stuff, unstaged
> > stuff, and possibly untracked stuff means that there is a sequence of
> > two or three merges needed and interesting failure handling to do if
> > those merges fail, especially if the user uses --index.  But I
> > digress...
>
> I would prefer to leave 'git stash' alone, but it's used by enough
> people that I need to care about it eventually.

Oh, it can definitely come later.  And I agree about the desirability
of touching that code; I was avoiding it for a long time, but there
was one important sparse-checkout-related bug recently[1] so I've
already been forced to touch it once.  That might mean I'm
(eventually) on the hook to make it sparse-index friendly, especially
since it might involve using merge-ort to do so...

[1] https://lore.kernel.org/git/pull.919.git.git.1605891222.gitgitgadget@gmail.com/

> > Anyway, overall, very nicely written and planned out.  Thanks for
> > taking the time to write this all up.
>
> Thanks for your detailed comments!
> -Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 00/20] Sparse Index: Design, Format, Tests
  2021-02-23 23:49 ` [PATCH 00/20] Sparse Index: Design, Format, Tests Elijah Newren
@ 2021-02-26 21:28   ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-02-26 21:28 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee

On Tue, Feb 23, 2021 at 3:49 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > Here is the first full patch series submission coming out of the
> > sparse-index RFC [1].
>
> Wahoo!  I'll be reading these over the next few days.

I finally finished the last five patches today, and didn't spot
anything on those to comment on.

Overall, I find the series well constructed, motivated, and explained.
I've left various comments on individual patches, but they're mostly
all minor things that should be easy to cleanup and/or just comment
on.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 17/20] sparse-checkout: disable sparse-index
  2021-02-23 20:14 ` [PATCH 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
@ 2021-02-27 12:32   ` SZEDER Gábor
  2021-03-09 20:20     ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: SZEDER Gábor @ 2021-02-27 12:32 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

On Tue, Feb 23, 2021 at 08:14:26PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> We use 'git sparse-checkout init --cone --sparse-index' to toggle the
> sparse-index feature. It makes sense to also disable it when running
> 'git sparse-checkout disable'. This is particularly important because it
> removes the extensions.sparseIndex config option, allowing other tools
> to use this Git repository again.
> 
> This does mean that 'git sparse-checkout init' will not re-enable the
> sparse-index feature, even if it was previously enabled.
> 
> While testing this feature, I noticed that the sparse-index was not
> being written on the first run, but by a second. This was caught by the
> call to 'test-tool read-cache --table'. This requires adjusting some
> assignments to core_apply_sparse_checkout and pl.use_cone_patterns in
> the sparse_checkout_init() logic.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  builtin/sparse-checkout.c          | 10 +++++++++-
>  t/t1091-sparse-checkout-builtin.sh | 13 +++++++++++++
>  2 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
> index ca63e2c64e95..585343fa1972 100644
> --- a/builtin/sparse-checkout.c
> +++ b/builtin/sparse-checkout.c
> @@ -280,6 +280,9 @@ static int set_config(enum sparse_checkout_mode mode)
>  				      "core.sparseCheckoutCone",
>  				      mode == MODE_CONE_PATTERNS ? "true" : NULL);
>  
> +	if (mode == MODE_NO_PATTERNS)
> +		set_sparse_index_config(the_repository, 0);
> +
>  	return 0;
>  }
>  
> @@ -341,10 +344,11 @@ static int sparse_checkout_init(int argc, const char **argv)
>  		the_repository->index->updated_workdir = 1;
>  	}
>  
> +	core_apply_sparse_checkout = 1;
> +
>  	/* If we already have a sparse-checkout file, use it. */
>  	if (res >= 0) {
>  		free(sparse_filename);
> -		core_apply_sparse_checkout = 1;
>  		return update_working_directory(NULL);
>  	}
>  
> @@ -366,6 +370,7 @@ static int sparse_checkout_init(int argc, const char **argv)
>  	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
>  	strbuf_addstr(&pattern, "!/*/");
>  	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
> +	pl.use_cone_patterns = init_opts.cone_mode;
>  
>  	return write_patterns_and_update(&pl);
>  }
> @@ -632,6 +637,9 @@ static int sparse_checkout_disable(int argc, const char **argv)
>  	strbuf_addstr(&match_all, "/*");
>  	add_pattern(strbuf_detach(&match_all, NULL), empty_base, 0, &pl, 0);
>  
> +	prepare_repo_settings(the_repository);
> +	the_repository->settings.sparse_index = 0;
> +
>  	if (update_working_directory(&pl))
>  		die(_("error while refreshing working directory"));
>  
> diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
> index fc64e9ed99f4..ff1ad570a255 100755
> --- a/t/t1091-sparse-checkout-builtin.sh
> +++ b/t/t1091-sparse-checkout-builtin.sh
> @@ -205,6 +205,19 @@ test_expect_success 'sparse-checkout disable' '
>  	check_files repo a deep folder1 folder2
>  '
>  
> +test_expect_success 'sparse-index enabled and disabled' '
> +	git -C repo sparse-checkout init --cone --sparse-index &&
> +	test_cmp_config -C repo true extensions.sparseIndex &&
> +	test-tool -C repo read-cache --table >cache &&
> +	grep " tree " cache &&
> +
> +	git -C repo sparse-checkout disable &&
> +	test-tool -C repo read-cache --table >cache &&
> +	! grep " tree " cache &&
> +	git -C repo config --list >config &&
> +	! grep extensions.sparseindex config
> +'

This test passes with GIT_TEST_SPLIT_INDEX=1 at the moment, because,
unfortunately, GIT_TEST_SPLIT_INDEX has been broken for the past two
years.  However, if I run it with my WIP fixes for that issue [1],
then it will fail:

  +git -C repo sparse-checkout init --cone --sparse-index
  +test_cmp_config -C repo true extensions.sparseIndex
  +test-tool -C repo read-cache --table
  +grep  tree  cache
  error: last command exited with $?=1
  not ok 16 - sparse-index enabled and disabled

https://travis-ci.com/github/szeder/git-cooking-topics-for-travis-ci/jobs/486702444#L2594

[1] Try to run it with:

      https://github.com/szeder/git split-index-fixes

    The code is, I believe, close to final, the commit messages,
    however, are far from being finished.


> +
>  test_expect_success 'cone mode: init and set' '
>  	git -C repo sparse-checkout init --cone &&
>  	git -C repo config --list >config &&
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 02/20] t/perf: add performance test for sparse operations
  2021-02-24  2:30   ` Elijah Newren
@ 2021-03-09 20:03     ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 20:03 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 2/23/2021 9:30 PM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> +test_expect_success 'setup repo and indexes' '
> +       git reset --hard HEAD &&
> +       # Remove submodules from the example repo, because our
> +       # duplication of the entire repo creates an unlikly data shape.
> +       git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
> +       rm -f .gitmodules &&
> +       git add .gitmodules &&
> Why not `git rm [-f] .gitmodules` instead of these two commands?  Is
> there something special about .gitmodules that requires this special
> handling?

No, I'm just being sloppy. Will clean up.

>> +       for module in $(awk "{print \$2}" modules)
>> +       do
>> +               git rm $module || return 1
>> +       done &&
>> +       git add . &&
> What does the `git add .` do?  I don't see any changes there weren't
> already git-add'ed or git-rm'ed.

Same here. Thanks.

-Stolee


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 17/20] sparse-checkout: disable sparse-index
  2021-02-27 12:32   ` SZEDER Gábor
@ 2021-03-09 20:20     ` Derrick Stolee
  2021-03-10 18:20       ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 20:20 UTC (permalink / raw)
  To: SZEDER Gábor, Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

On 2/27/2021 7:32 AM, SZEDER Gábor wrote:
> On Tue, Feb 23, 2021 at 08:14:26PM +0000, Derrick Stolee via GitGitGadget wrote:
>> +test_expect_success 'sparse-index enabled and disabled' '
>> +	git -C repo sparse-checkout init --cone --sparse-index &&
>> +	test_cmp_config -C repo true extensions.sparseIndex &&
>> +	test-tool -C repo read-cache --table >cache &&
>> +	grep " tree " cache &&
>> +
>> +	git -C repo sparse-checkout disable &&
>> +	test-tool -C repo read-cache --table >cache &&
>> +	! grep " tree " cache &&
>> +	git -C repo config --list >config &&
>> +	! grep extensions.sparseindex config
>> +'
> 
> This test passes with GIT_TEST_SPLIT_INDEX=1 at the moment, because,
> unfortunately, GIT_TEST_SPLIT_INDEX has been broken for the past two
> years.  However, if I run it with my WIP fixes for that issue [1],
> then it will fail:
> 
>   +git -C repo sparse-checkout init --cone --sparse-index
>   +test_cmp_config -C repo true extensions.sparseIndex
>   +test-tool -C repo read-cache --table
>   +grep  tree  cache
>   error: last command exited with $?=1
>   not ok 16 - sparse-index enabled and disabled
> 
> https://travis-ci.com/github/szeder/git-cooking-topics-for-travis-ci/jobs/486702444#L2594
> 
> [1] Try to run it with:
> 
>       https://github.com/szeder/git split-index-fixes
> 
>     The code is, I believe, close to final, the commit messages,
>     however, are far from being finished.

I'll keep that in mind. I should have added a variable
that disables GIT_TEST_SPLIT_INDEX for this test script,
since the sparse-index is (currently) incompatible with
the split-index. I bet that the test is failing because
it isn't actually writing the sparse-directory entry due
to that short-circuit check.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-02-24 19:11   ` Martin Ågren
@ 2021-03-09 20:52     ` Derrick Stolee
  2021-03-09 21:03       ` Elijah Newren
  2021-03-14 20:08       ` Martin Ågren
  0 siblings, 2 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 20:52 UTC (permalink / raw)
  To: Martin Ågren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Elijah Newren, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 2/24/2021 2:11 PM, Martin Ågren wrote:
> On Wed, 24 Feb 2021 at 00:57, Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> +that is not completely understood by other tools. Enabling sparse index
>> +enables the `extensions.spareseIndex` config value, which might cause
> 
> s/sparese/sparse

Thanks!

 
>> +other tools to stop working with your repository. If you have trouble with
>> +this compatibility, then run `git sparse-checkout sparse-index disable` to
>> +remove this config and rewrite your index to not be sparse.
> 
> While I'm commenting on this..:
> 
> There are several "layers" here, for lack of a better term. "Enabling foo
> enables bar which may cause baz. If you fail due to baz, try dropping
> bar by dropping foo." If I remove any mention of the config variable from
> your text, I get the following.
> 
>  Enabling sparse index might cause other tools to stop working with your
>  repository. If you have trouble with this compatibility, then run `git
>  sparse-checkout sparse-index disable` to rewrite your index to not be
>  sparse.
> 
> I'm tempted to suggest such a rewrite to relieve readers of knowing of
> the middle step, which you could say is more of an implementation
> detail. But if we think that the symptoms / error messages might involve
> "extensions.sparseIndex" or, as would be the case with an older Git
> installation,
> 
>   fatal: unknown repository extensions found:
>           sparseindex
> 
> maybe there is some value in mentioning the config item by name. Just
> thinking out loud, really, and I don't have any strong opinion. I only
> came here to point out the typo in the docs.
 
I agree that the layers are confusing. We could rearrange and have
a similar flow to what you recommend by mentioning the extension at
the end:

**WARNING:** Using a sparse index requires modifying the index in a way
that is not completely understood by other tools. If you have trouble with
this compatibility, then run `git sparse-checkout sparse-index disable` to
rewrite your index to not be sparse. Older versions of Git will not
understand the `sparseIndex` repository extension and may fail to interact
with your repository until it is disabled.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 07/20] test-read-cache: print cache entries with --table
  2021-02-25  7:02   ` Elijah Newren
@ 2021-03-09 21:00     ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 21:00 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 2/25/2021 2:02 AM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> This table is helpful for discovering data in the index to ensure it is
>> being written correctly, especially as we build and test the
>> sparse-index. This table includes an output format similar to 'git
>> ls-tree', but should not be compared to that directly. The biggest
>> reasons are that 'git ls-tree' includes a tree entry for every
>> subdirectory, even those that would not appear as a sparse directory in
>> a sparse-index. Further, 'git ls-tree' does not use a trailing directory
>> separator for its tree rows.
>>
>> This does not print the stat() information for the blobs. That could be
>> added in a future change with another option. The tests that are added
>> in the next few changes care only about the object types and IDs.
>>
>> To make the option parsing slightly more robust, wrap the string
>> comparisons in a loop adapted from test-dir-iterator.c.
>>
>> Care must be taken with the final check for the 'cnt' variable. We
>> continue the expectation that the numerical value is the final argument.
>>
>> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>> ---
>>  t/helper/test-read-cache.c | 50 ++++++++++++++++++++++++++++++--------
>>  1 file changed, 40 insertions(+), 10 deletions(-)
>>
>> diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
>> index 244977a29bdf..e4c3492f7d3e 100644
>> --- a/t/helper/test-read-cache.c
>> +++ b/t/helper/test-read-cache.c
>> @@ -2,35 +2,65 @@
>>  #include "cache.h"
>>  #include "config.h"
>>
>> +static void print_cache_entry(struct cache_entry *ce)
>> +{
>> +       printf("%06o ", ce->ce_mode & 0777777);
> 
> This constant is curious.  I think it's because you want to strip off
> the special in-memory bits of the ce_mode where git stores extra data,
> which would be everything beyond the first 16 bits (as noted in a
> comment near the beginning of cache.h).  But here you keep the first
> 18 bits.  Is CE_UPDATE and CE_REMOVE just 0 in the cases you've viewed
> so this works (but you really should use 0177777 or 0xFFFF), or am I
> off in my guess of what you're trying to do and you do want to see
> those two flags?

You're right, 0177777 is what I want. I'm focusing only on the
file permissions bits that are reported by ls-tree.

> It also seems surprising to me that this constant isn't already
> defined somewhere in cache.h or as some variant of S_IFMT, though I'm
> not finding it.

I'm not, either.

>> +
>> +       if (S_ISSPARSEDIR(ce->ce_mode))
>> +               printf("tree ");
>> +       else if (S_ISGITLINK(ce->ce_mode))
>> +               printf("commit ");
>> +       else
>> +               printf("blob ");
> 
> Perhaps make use of the tree_type, commit_type, and blob_type global constants?

Today I Learned...

>> +
>> +       printf("%s\t%s\n",
>> +              oid_to_hex(&ce->oid),
>> +              ce->name);
>> +}
>> +
>> +static void print_cache(struct index_state *cache)
>> +{
>> +       int i;
>> +       for (i = 0; i < the_index.cache_nr; i++)
>> +               print_cache_entry(the_index.cache[i]);
> 
> Why are you passing cache as a parameter, then ignoring it and using the_index?

Good catch!

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-09 20:52     ` Derrick Stolee
@ 2021-03-09 21:03       ` Elijah Newren
  2021-03-09 21:10         ` Derrick Stolee
  2021-03-14 20:08       ` Martin Ågren
  1 sibling, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-09 21:03 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Martin Ågren, Derrick Stolee via GitGitGadget,
	Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Mar 9, 2021 at 12:52 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/24/2021 2:11 PM, Martin Ågren wrote:
> > On Wed, 24 Feb 2021 at 00:57, Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >> +that is not completely understood by other tools. Enabling sparse index
> >> +enables the `extensions.spareseIndex` config value, which might cause
> >
> > s/sparese/sparse
>
> Thanks!
>
>
> >> +other tools to stop working with your repository. If you have trouble with
> >> +this compatibility, then run `git sparse-checkout sparse-index disable` to
> >> +remove this config and rewrite your index to not be sparse.
> >
> > While I'm commenting on this..:
> >
> > There are several "layers" here, for lack of a better term. "Enabling foo
> > enables bar which may cause baz. If you fail due to baz, try dropping
> > bar by dropping foo." If I remove any mention of the config variable from
> > your text, I get the following.
> >
> >  Enabling sparse index might cause other tools to stop working with your
> >  repository. If you have trouble with this compatibility, then run `git
> >  sparse-checkout sparse-index disable` to rewrite your index to not be
> >  sparse.
> >
> > I'm tempted to suggest such a rewrite to relieve readers of knowing of
> > the middle step, which you could say is more of an implementation
> > detail. But if we think that the symptoms / error messages might involve
> > "extensions.sparseIndex" or, as would be the case with an older Git
> > installation,
> >
> >   fatal: unknown repository extensions found:
> >           sparseindex
> >
> > maybe there is some value in mentioning the config item by name. Just
> > thinking out loud, really, and I don't have any strong opinion. I only
> > came here to point out the typo in the docs.
>
> I agree that the layers are confusing. We could rearrange and have
> a similar flow to what you recommend by mentioning the extension at
> the end:
>
> **WARNING:** Using a sparse index requires modifying the index in a way
> that is not completely understood by other tools. If you have trouble with
> this compatibility, then run `git sparse-checkout sparse-index disable` to
> rewrite your index to not be sparse. Older versions of Git will not
> understand the `sparseIndex` repository extension and may fail to interact
> with your repository until it is disabled.
>
> Thanks,
> -Stolee

This looks pretty good to me, but could we change the first sentence
to read "...modifying the index in a way that may not yet be
understood by external tools." ?  I'm worried "other tools" might make
people worry about different builtin commands (e.g. fast-export, log).
I also prefer "may" and "yet" because I suspect most external tools
(e.g. git filter-repo just to name a personal example) won't need to
read an index format and will thus be unaffected, and any tools that
do read the index format will probably eventually learn how to work
with the new format.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-09 21:03       ` Elijah Newren
@ 2021-03-09 21:10         ` Derrick Stolee
  2021-03-09 21:38           ` Elijah Newren
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 21:10 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Martin Ågren, Derrick Stolee via GitGitGadget,
	Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 3/9/2021 4:03 PM, Elijah Newren wrote:
> On Tue, Mar 9, 2021 at 12:52 PM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 2/24/2021 2:11 PM, Martin Ågren wrote:
>>> There are several "layers" here, for lack of a better term. "Enabling foo
>>> enables bar which may cause baz. If you fail due to baz, try dropping
>>> bar by dropping foo." If I remove any mention of the config variable from
>>> your text, I get the following.
>>>
>>>  Enabling sparse index might cause other tools to stop working with your
>>>  repository. If you have trouble with this compatibility, then run `git
>>>  sparse-checkout sparse-index disable` to rewrite your index to not be
>>>  sparse.
>>>
>>> I'm tempted to suggest such a rewrite to relieve readers of knowing of
>>> the middle step, which you could say is more of an implementation
>>> detail. But if we think that the symptoms / error messages might involve
>>> "extensions.sparseIndex" or, as would be the case with an older Git
>>> installation,
>>>
>>>   fatal: unknown repository extensions found:
>>>           sparseindex
>>>
>>> maybe there is some value in mentioning the config item by name. Just
>>> thinking out loud, really, and I don't have any strong opinion. I only
>>> came here to point out the typo in the docs.
>>
>> I agree that the layers are confusing. We could rearrange and have
>> a similar flow to what you recommend by mentioning the extension at
>> the end:
>>
>> **WARNING:** Using a sparse index requires modifying the index in a way
>> that is not completely understood by other tools. If you have trouble with
>> this compatibility, then run `git sparse-checkout sparse-index disable` to
>> rewrite your index to not be sparse. Older versions of Git will not
>> understand the `sparseIndex` repository extension and may fail to interact
>> with your repository until it is disabled.
>>
>> Thanks,
>> -Stolee
> 
> This looks pretty good to me, but could we change the first sentence
> to read "...modifying the index in a way that may not yet be
> understood by external tools." ?  I'm worried "other tools" might make
> people worry about different builtin commands (e.g. fast-export, log).
> I also prefer "may" and "yet" because I suspect most external tools
> (e.g. git filter-repo just to name a personal example) won't need to
> read an index format and will thus be unaffected, and any tools that
> do read the index format will probably eventually learn how to work
> with the new format.

I can make the change, but I do want to point out that the current
use of a repository extension _does_ mean that tools that (correctly)
interact with a Git repository should fail even if they don't try to
access the index file. This is only something to make this work until
we introduce a new index file format version and then can drop the
extension.

"git filter-repo" _should_ be safe because it's really just shelling
to Git, right? I'm more concerned about tools like libgit2.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 11/20] sparse-index: convert from full to sparse
  2021-02-25  7:33   ` Elijah Newren
@ 2021-03-09 21:13     ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 21:13 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 2/25/2021 2:33 AM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:

>> +                       /*
>> +                        * allow terminating directory separators for
>> +                        * sparse directory enries.
> 
> enries -> entries

Thanks.

>> +                        */
>> +                       if (c == '\0')
>> +                               return S_ISDIR(mode);
> 
> Yaay, much simpler (than the RFC version).

>> +       /*
>> +        * Is the current path outside of the sparse cone?
>> +        * Then check if the region can be replaced by a sparse
>> +        * directory entry (everything is sparse and merged).
>> +        */
>> +       match = path_matches_pattern_list(ct_path, ct_pathlen,
>> +                                         NULL, &dtype, pl, istate);
>> +       if (match != NOT_MATCHED)
>> +               can_convert = 0;
> 
> Not sure if you saw my comments on the flow control at
> https://lore.kernel.org/git/CABPp-BE9wPwmC0=pA4p1_QSRDHrO8RzqfJQdE2NxYZsYL_Rcig@mail.gmail.com/
> (the typos elsewhere seem to still be present).  If you saw it and
> decided against it, that's fine, just wanted the idea to at least be
> floated.

Sorry for dropping this one. I _did_ decide against it, and
primarily because the "if (can_convert)" condition contains
a return statement. I like to use 'gotos' for blocks that
will eventually be entered by all paths through the code,
such as "goto cleanup;" but here I find the "can_convert"
check to be clearer.

>> +               /*
>> +                * Detect if this is a normal entry oustide of any subtree
> 
> s/oustide/outside/

Got it.

>> +test_expect_success 'sparse-index contents' '
>> +       init_repos &&
>> +
>> +       test-tool -C sparse-index read-cache --table >cache &&
>> +       for dir in folder1 folder2 x
>> +       do
>> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
>> +               grep "040000 tree $TREE $dir/" cache \
>> +                       || return 1
>> +       done &&
> 
> Thanks for making the output look more like ls-tree output; it's
> easier to parse that way, at least for me.

Excellent.
 
> I mostly read over the range-diff since it was much shorter.  You've
> addressed a number of questions/comments I had on the RFC version, but
> there's still some I didn't see a response to so I reposted them here.
 
Thanks for being diligent!
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 13/20] unpack-trees: allow sparse directories
  2021-02-25  7:40   ` Elijah Newren
@ 2021-03-09 21:35     ` Derrick Stolee
  2021-03-09 21:39       ` Elijah Newren
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 21:35 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 2/25/2021 2:40 AM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> The index_pos_by_traverse_info() currently throws a BUG() when a
>> directory entry exists exactly in the index. We need to consider that it
>> is possible to have a directory in a sparse index as long as that entry
>> is itself marked with the skip-worktree bit.
>>
>> The negation of the 'pos' variable must be conditioned to only when it
>> starts as negative. This is identical behavior as before when the index
>> is full.
> 
> Same comment on the second paragraph as I made in the RFC series --
> https://lore.kernel.org/git/CABPp-BGPJgA4guWHVm3AVS=hM0fTixUpRvJe5i9NnHT-3QJMfw@mail.gmail.com/.
> I apologize if I'm repeating stuff you chose to not change, but I
> didn't see a response and given the three typos left in previous
> patches, I'm unsure whether it was unaddressed on purpose or on
> accident.

Yes, I dropped this one. How about this?

    The 'pos' variable is assigned a negative value if an exact match is not
    found. Since a directory name can be an exact match, it is no longer an
    error to have a nonnegative 'pos' value.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-09 21:10         ` Derrick Stolee
@ 2021-03-09 21:38           ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-09 21:38 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Martin Ågren, Derrick Stolee via GitGitGadget,
	Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc Duy, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On Tue, Mar 9, 2021 at 1:10 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/9/2021 4:03 PM, Elijah Newren wrote:
> > On Tue, Mar 9, 2021 at 12:52 PM Derrick Stolee <stolee@gmail.com> wrote:
> >>
> >> On 2/24/2021 2:11 PM, Martin Ågren wrote:
> >>> There are several "layers" here, for lack of a better term. "Enabling foo
> >>> enables bar which may cause baz. If you fail due to baz, try dropping
> >>> bar by dropping foo." If I remove any mention of the config variable from
> >>> your text, I get the following.
> >>>
> >>>  Enabling sparse index might cause other tools to stop working with your
> >>>  repository. If you have trouble with this compatibility, then run `git
> >>>  sparse-checkout sparse-index disable` to rewrite your index to not be
> >>>  sparse.
> >>>
> >>> I'm tempted to suggest such a rewrite to relieve readers of knowing of
> >>> the middle step, which you could say is more of an implementation
> >>> detail. But if we think that the symptoms / error messages might involve
> >>> "extensions.sparseIndex" or, as would be the case with an older Git
> >>> installation,
> >>>
> >>>   fatal: unknown repository extensions found:
> >>>           sparseindex
> >>>
> >>> maybe there is some value in mentioning the config item by name. Just
> >>> thinking out loud, really, and I don't have any strong opinion. I only
> >>> came here to point out the typo in the docs.
> >>
> >> I agree that the layers are confusing. We could rearrange and have
> >> a similar flow to what you recommend by mentioning the extension at
> >> the end:
> >>
> >> **WARNING:** Using a sparse index requires modifying the index in a way
> >> that is not completely understood by other tools. If you have trouble with
> >> this compatibility, then run `git sparse-checkout sparse-index disable` to
> >> rewrite your index to not be sparse. Older versions of Git will not
> >> understand the `sparseIndex` repository extension and may fail to interact
> >> with your repository until it is disabled.
> >>
> >> Thanks,
> >> -Stolee
> >
> > This looks pretty good to me, but could we change the first sentence
> > to read "...modifying the index in a way that may not yet be
> > understood by external tools." ?  I'm worried "other tools" might make
> > people worry about different builtin commands (e.g. fast-export, log).
> > I also prefer "may" and "yet" because I suspect most external tools
> > (e.g. git filter-repo just to name a personal example) won't need to
> > read an index format and will thus be unaffected, and any tools that
> > do read the index format will probably eventually learn how to work
> > with the new format.
>
> I can make the change, but I do want to point out that the current
> use of a repository extension _does_ mean that tools that (correctly)
> interact with a Git repository should fail even if they don't try to
> access the index file. This is only something to make this work until
> we introduce a new index file format version and then can drop the
> extension.

Good point, though...

> "git filter-repo" _should_ be safe because it's really just shelling
> to Git, right? I'm more concerned about tools like libgit2.

Yes, libgit2 and jgit and similar tools are clearly going to be
affected and deeply.  Those are of concern, but I suspect most users
when they see "external tools" will be thinking of the large multitude
of scripts out there that just shell out to git under the hood to
provide some higher level wrapper of some sort.  And anything that
operates that way won't be affected directly by the repository
extension.

I'm not sure I'd even mark things that shell out to git as _should_ be
safe.  In general, scripts can make all kinds of assumptions on
interpreting output, and I suspect some of those may become
invalidated by this new feature.  We have a recent guidepost that's
very close to home on that too -- git stash had *3* different bugs in
it once sparse-checkouts were introduced, based on the fact that it
was designed as a just-shell-out-to-low-level-git-commands script and
it made assumptions on how those commands worked together.  See
https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/.
Sure git-stash is a builtin (supposedly, anyway), but external tools
can make similar logical jumps.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 13/20] unpack-trees: allow sparse directories
  2021-03-09 21:35     ` Derrick Stolee
@ 2021-03-09 21:39       ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-09 21:39 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List,
	Junio C Hamano, Nguyễn Thái Ngọc,
	Jonathan Nieder, Derrick Stolee, Derrick Stolee

On Tue, Mar 9, 2021 at 1:35 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/25/2021 2:40 AM, Elijah Newren wrote:
> > On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> From: Derrick Stolee <dstolee@microsoft.com>
> >>
> >> The index_pos_by_traverse_info() currently throws a BUG() when a
> >> directory entry exists exactly in the index. We need to consider that it
> >> is possible to have a directory in a sparse index as long as that entry
> >> is itself marked with the skip-worktree bit.
> >>
> >> The negation of the 'pos' variable must be conditioned to only when it
> >> starts as negative. This is identical behavior as before when the index
> >> is full.
> >
> > Same comment on the second paragraph as I made in the RFC series --
> > https://lore.kernel.org/git/CABPp-BGPJgA4guWHVm3AVS=hM0fTixUpRvJe5i9NnHT-3QJMfw@mail.gmail.com/.
> > I apologize if I'm repeating stuff you chose to not change, but I
> > didn't see a response and given the three typos left in previous
> > patches, I'm unsure whether it was unaddressed on purpose or on
> > accident.
>
> Yes, I dropped this one. How about this?
>
>     The 'pos' variable is assigned a negative value if an exact match is not
>     found. Since a directory name can be an exact match, it is no longer an
>     error to have a nonnegative 'pos' value.

I like it!

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 15/20] sparse-index: create extension for compatibility
  2021-02-25  7:45   ` Elijah Newren
@ 2021-03-09 21:45     ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-09 21:45 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Derrick Stolee, Derrick Stolee

On 2/25/2021 2:45 AM, Elijah Newren wrote:
> On Tue, Feb 23, 2021 at 12:14 PM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>> Previously, we enabled the sparse index format only using
>> GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
>> actually select this mode. Further, sparse directory entries are not
>> understood by the index formats as advertised.
>>
>> We _could_ add a new index version that explicitly adds these
>> capabilities, but there are nuances to index formats 2, 3, and 4 that
>> are still valuable to select as options. For now, create a repo
>> extension, "extensions.sparseIndex", that specifies that the tool
>> reading this repository must understand sparse directory entries.
> 
> This commit is unchanged from the RFC series, but given your comments
> in the design document about how you do intend to create an index
> format v5 now, do you want to reference that here?

I'll insert detail about v5.
 
>> +extensions.sparseIndex::
>> +       When combined with `core.sparseCheckout=true` and
>> +       `core.sparseCheckoutCone=true`, the index may contain entries
>> +       corresponding to directories outside of the sparse-checkout
>> +       definition. Versions of Git that do not understand this extension
>> +       do not expect directory entries in the index.
> 
> I had a wording suggestion for this paragraph in the RFC series --
> https://lore.kernel.org/git/CABPp-BFEJE82k4VgkR=Jf7V7sZxZzo2pHMfAGshhi9_vV6iK0w@mail.gmail.com/.
> Let me know if you just decided to leave it out so I don't bug you
> about stuff you already considered.

I'll take your suggestion, thanks.

-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 17/20] sparse-checkout: disable sparse-index
  2021-03-09 20:20     ` Derrick Stolee
@ 2021-03-10 18:20       ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-10 18:20 UTC (permalink / raw)
  To: SZEDER Gábor, Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Derrick Stolee, Derrick Stolee

On 3/9/2021 3:20 PM, Derrick Stolee wrote:
> On 2/27/2021 7:32 AM, SZEDER Gábor wrote:
>> On Tue, Feb 23, 2021 at 08:14:26PM +0000, Derrick Stolee via GitGitGadget wrote:
>>> +test_expect_success 'sparse-index enabled and disabled' '
>>> +	git -C repo sparse-checkout init --cone --sparse-index &&
>>> +	test_cmp_config -C repo true extensions.sparseIndex &&
>>> +	test-tool -C repo read-cache --table >cache &&
>>> +	grep " tree " cache &&
>>> +
>>> +	git -C repo sparse-checkout disable &&
>>> +	test-tool -C repo read-cache --table >cache &&
>>> +	! grep " tree " cache &&
>>> +	git -C repo config --list >config &&
>>> +	! grep extensions.sparseindex config
>>> +'
>>
>> This test passes with GIT_TEST_SPLIT_INDEX=1 at the moment, because,
>> unfortunately, GIT_TEST_SPLIT_INDEX has been broken for the past two
>> years.  However, if I run it with my WIP fixes for that issue [1],
>> then it will fail:
>>
>>   +git -C repo sparse-checkout init --cone --sparse-index
>>   +test_cmp_config -C repo true extensions.sparseIndex
>>   +test-tool -C repo read-cache --table
>>   +grep  tree  cache
>>   error: last command exited with $?=1
>>   not ok 16 - sparse-index enabled and disabled
>>
>> https://travis-ci.com/github/szeder/git-cooking-topics-for-travis-ci/jobs/486702444#L2594
>>
>> [1] Try to run it with:
>>
>>       https://github.com/szeder/git split-index-fixes
>>
>>     The code is, I believe, close to final, the commit messages,
>>     however, are far from being finished.
> 
> I'll keep that in mind. I should have added a variable
> that disables GIT_TEST_SPLIT_INDEX for this test script,
> since the sparse-index is (currently) incompatible with
> the split-index. I bet that the test is failing because
> it isn't actually writing the sparse-directory entry due
> to that short-circuit check.

The next version will include GIT_TEST_SPLIT_INDEX=0 at
the start and that will make it work with your branch.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 00/20] Sparse Index: Design, Format, Tests
  2021-02-23 20:14 [PATCH 00/20] Sparse Index: Design, Format, Tests Derrick Stolee via GitGitGadget
                   ` (20 preceding siblings ...)
  2021-02-23 23:49 ` [PATCH 00/20] Sparse Index: Design, Format, Tests Elijah Newren
@ 2021-03-10 19:30 ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
                     ` (21 more replies)
  21 siblings, 22 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee

Here is the first full patch series submission coming out of the
sparse-index RFC [1].

[1]
https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/

I won't waste too much space here, because PATCH 1 includes a sizeable
design document that describes the feature, the reasoning behind it, and my
plan for getting this implemented widely throughout the codebase.

There are some new things here that were not in the RFC:

 * Design doc and format updates. (Patch 1)
 * Performance test script. (Patches 2 and 20)

Notably missing in this series from the RFC:

 * The mega-patch inserting ensure_full_index() throughout the codebase.
   That will be a follow-up series to this one.
 * The integrations with git status and git add to demonstrate the improved
   performance. Those will also appear in their own series later.

I plan to keep my latest work in this area in my 'sparse-index/wip' branch
[2]. It includes all of the work from the RFC right now, updated with the
work from this series.

[2] https://github.com/derrickstolee/git/tree/sparse-index/wip


Updates in V2
=============

 * Various typos and awkward grammar is fixed.
 * Cleaned up unnecessary commands in p2000-sparse-operations.sh
 * Added a comment to the sparse_index member of struct index_state.
 * Used tree_type, commit_type, and blob_type in test-read-cache.c.

Thanks, -Stolee

Derrick Stolee (20):
  sparse-index: design doc and format update
  t/perf: add performance test for sparse operations
  t1092: clean up script quoting
  sparse-index: add guard to ensure full index
  sparse-index: implement ensure_full_index()
  t1092: compare sparse-checkout to sparse-index
  test-read-cache: print cache entries with --table
  test-tool: don't force full index
  unpack-trees: ensure full index
  sparse-checkout: hold pattern list in index
  sparse-index: convert from full to sparse
  submodule: sparse-index should not collapse links
  unpack-trees: allow sparse directories
  sparse-index: check index conversion happens
  sparse-index: create extension for compatibility
  sparse-checkout: toggle sparse index from builtin
  sparse-checkout: disable sparse-index
  cache-tree: integrate with sparse directory entries
  sparse-index: loose integration with cache_tree_verify()
  p2000: add sparse-index repos

 Documentation/config/extensions.txt      |   8 +
 Documentation/git-sparse-checkout.txt    |  14 ++
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 173 ++++++++++++++
 Makefile                                 |   1 +
 builtin/sparse-checkout.c                |  44 +++-
 cache-tree.c                             |  40 ++++
 cache.h                                  |  18 +-
 read-cache.c                             |  35 ++-
 repo-settings.c                          |  15 ++
 repository.c                             |  11 +-
 repository.h                             |   3 +
 setup.c                                  |   3 +
 sparse-index.c                           | 290 +++++++++++++++++++++++
 sparse-index.h                           |  11 +
 t/README                                 |   3 +
 t/helper/test-read-cache.c               |  66 +++++-
 t/perf/p2000-sparse-operations.sh        | 102 ++++++++
 t/t1091-sparse-checkout-builtin.sh       |  13 +
 t/t1092-sparse-checkout-compatibility.sh | 136 +++++++++--
 unpack-trees.c                           |  16 +-
 21 files changed, 969 insertions(+), 40 deletions(-)
 create mode 100644 Documentation/technical/sparse-index.txt
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h
 create mode 100755 t/perf/p2000-sparse-operations.sh


base-commit: 966e671106b2fd38301e7c344c754fd118d0bb07
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/883

Range-diff vs v1:

  1:  daa9a6bcefbc !  1:  2fe413fdac80 sparse-index: design doc and format update
     @@ Documentation/technical/sparse-index.txt (new)
      +If we need to discover the details for paths within that directory, we
      +can parse trees to find that list.
      +
     -+This addition of sparse-directory entries violates expectations about the
     ++At time of writing, sparse-directory entries violate expectations about the
      +index format and its in-memory data structure. There are many consumers in
      +the codebase that expect to iterate through all of the index entries and
      +see only files. In addition, they expect to see all files at `HEAD`. One
     @@ Documentation/technical/sparse-index.txt (new)
      +* `git merge`
      +* `git rebase`
      +
     ++Hopefully, commands such as `git merge` and `git rebase` can benefit
     ++instead from merge algorithms that do not use the index as a data
     ++structure, such as the merge-ORT strategy. As these topics mature, we
     ++may enalbe the ORT strategy by default for repositories using the
     ++sparse-index feature.
     ++
      +Along with `git status` and `git add`, these commands cover the majority
      +of users' interactions with the working directory. In addition, we can
      +integrate with these commands:
  2:  a8c6322a3dbe !  2:  540ab5495065 t/perf: add performance test for sparse operations
     @@ t/perf/p2000-sparse-operations.sh (new)
      +	# Remove submodules from the example repo, because our
      +	# duplication of the entire repo creates an unlikly data shape.
      +	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
     -+	rm -f .gitmodules &&
     -+	git add .gitmodules &&
     ++	git rm -f .gitmodules &&
      +	for module in $(awk "{print \$2}" modules)
      +	do
      +		git rm $module || return 1
      +	done &&
     -+	git add . &&
      +	git commit -m "remove submodules" &&
      +
      +	echo bogus >a &&
  3:  6e783c88821e =  3:  5cbedb377b37 t1092: clean up script quoting
  4:  01da4c48a1fa =  4:  6e21f776e883 sparse-index: add guard to ensure full index
  5:  2b83989fbcd3 !  5:  399ddb0bad56 sparse-index: implement ensure_full_index()
     @@ cache.h: struct index_state {
       		 updated_skipworktree : 1,
      -		 fsmonitor_has_run_once : 1;
      +		 fsmonitor_has_run_once : 1,
     ++
     ++		 /*
     ++		  * sparse_index == 1 when sparse-directory
     ++		  * entries exist. Requires sparse-checkout
     ++		  * in cone mode.
     ++		  */
      +		 sparse_index : 1;
       	struct hashmap name_hash;
       	struct hashmap dir_hash;
  6:  c9910a37579c =  6:  eac2db5efc22 t1092: compare sparse-checkout to sparse-index
  7:  3d92df7a0cf9 !  7:  e9c82d2eda82 test-read-cache: print cache entries with --table
     @@ Commit message
      
       ## t/helper/test-read-cache.c ##
      @@
     + #include "test-tool.h"
       #include "cache.h"
       #include "config.h"
     - 
     ++#include "blob.h"
     ++#include "commit.h"
     ++#include "tree.h"
     ++
      +static void print_cache_entry(struct cache_entry *ce)
      +{
     -+	printf("%06o ", ce->ce_mode & 0777777);
     ++	const char *type;
     ++	printf("%06o ", ce->ce_mode & 0177777);
      +
      +	if (S_ISSPARSEDIR(ce->ce_mode))
     -+		printf("tree ");
     ++		type = tree_type;
      +	else if (S_ISGITLINK(ce->ce_mode))
     -+		printf("commit ");
     ++		type = commit_type;
      +	else
     -+		printf("blob ");
     ++		type = blob_type;
      +
     -+	printf("%s\t%s\n",
     ++	printf("%s %s\t%s\n",
     ++	       type,
      +	       oid_to_hex(&ce->oid),
      +	       ce->name);
      +}
      +
     -+static void print_cache(struct index_state *cache)
     ++static void print_cache(struct index_state *istate)
      +{
      +	int i;
     -+	for (i = 0; i < the_index.cache_nr; i++)
     -+		print_cache_entry(the_index.cache[i]);
     ++	for (i = 0; i < istate->cache_nr; i++)
     ++		print_cache_entry(istate->cache[i]);
      +}
     -+
     + 
       int cmd__read_cache(int argc, const char **argv)
       {
      +	struct repository *r = the_repository;
  8:  94373e2bfbbc !  8:  243541fc5820 test-tool: don't force full index
     @@ Commit message
      
       ## t/helper/test-read-cache.c ##
      @@
     - #include "test-tool.h"
     - #include "cache.h"
     - #include "config.h"
     + #include "blob.h"
     + #include "commit.h"
     + #include "tree.h"
      +#include "sparse-index.h"
       
       static void print_cache_entry(struct cache_entry *ce)
  9:  e71f033c2871 =  9:  48f65093b3da unpack-trees: ensure full index
 10:  f86d3dc154d1 ! 10:  83aac8b7a1ec sparse-checkout: hold pattern list in index
     @@ Commit message
          pattern set, we need access to that in-memory copy. Place a pointer to
          a 'struct pattern_list' in the index so we can access this on-demand.
          This will be used in the next change which uses the sparse-checkout
     -    definition to filter out directories that are outsie the sparse cone.
     +    definition to filter out directories that are outside the sparse cone.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
 11:  a2d77c23a0cb ! 11:  f6db0c27a285 sparse-index: convert from full to sparse
     @@ read-cache.c: int verify_path(const char *path, unsigned mode)
       				return 0;
      +			/*
      +			 * allow terminating directory separators for
     -+			 * sparse directory enries.
     ++			 * sparse directory entries.
      +			 */
      +			if (c == '\0')
      +				return S_ISDIR(mode);
     @@ sparse-index.c
      +		struct cache_entry *ce = istate->cache[i];
      +
      +		/*
     -+		 * Detect if this is a normal entry oustide of any subtree
     ++		 * Detect if this is a normal entry outside of any subtree
      +		 * entry.
      +		 */
      +		base = ce->name + ct_pathlen;
 12:  4405a9115c3b = 12:  f2a3e7298798 submodule: sparse-index should not collapse links
 13:  fda23f07e6a2 ! 13:  6f1ebe6ccc08 unpack-trees: allow sparse directories
     @@ Commit message
          is possible to have a directory in a sparse index as long as that entry
          is itself marked with the skip-worktree bit.
      
     -    The negation of the 'pos' variable must be conditioned to only when it
     -    starts as negative. This is identical behavior as before when the index
     -    is full.
     +    The 'pos' variable is assigned a negative value if an exact match is not
     +    found. Since a directory name can be an exact match, it is no longer an
     +    error to have a nonnegative 'pos' value.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
 14:  7d4627574bb8 = 14:  3fa684b315fb sparse-index: check index conversion happens
 15:  564503f78784 ! 15:  d74576d677f6 sparse-index: create extension for compatibility
     @@ Commit message
      
          We _could_ add a new index version that explicitly adds these
          capabilities, but there are nuances to index formats 2, 3, and 4 that
     -    are still valuable to select as options. For now, create a repo
     -    extension, "extensions.sparseIndex", that specifies that the tool
     -    reading this repository must understand sparse directory entries.
     +    are still valuable to select as options. Until we add index format
     +    version 5, create a repo extension, "extensions.sparseIndex", that
     +    specifies that the tool reading this repository must understand sparse
     +    directory entries.
      
          This change only encodes the extension and enables it when
          GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
     @@ Documentation/config/extensions.txt: extensions.objectFormat::
      +	When combined with `core.sparseCheckout=true` and
      +	`core.sparseCheckoutCone=true`, the index may contain entries
      +	corresponding to directories outside of the sparse-checkout
     -+	definition. Versions of Git that do not understand this extension
     -+	do not expect directory entries in the index.
     ++	definition in lieu of containing each path under such directories.
     ++	Versions of Git that do not understand this extension do not
     ++	expect directory entries in the index.
      
       ## cache.h ##
      @@ cache.h: struct repository_format {
 16:  6d6b230e3318 ! 16:  e530ca5f668d sparse-checkout: toggle sparse index from builtin
     @@ Documentation/git-sparse-checkout.txt: To avoid interfering with other worktrees
      +a sparse index until they are properly integrated with the feature.
      ++
      +**WARNING:** Using a sparse index requires modifying the index in a way
     -+that is not completely understood by other tools. Enabling sparse index
     -+enables the `extensions.spareseIndex` config value, which might cause
     -+other tools to stop working with your repository. If you have trouble with
     -+this compatibility, then run `git sparse-checkout sparse-index disable` to
     -+remove this config and rewrite your index to not be sparse.
     ++that is not completely understood by external tools. If you have trouble
     ++with this compatibility, then run `git sparse-checkout sparse-index disable`
     ++to rewrite your index to not be sparse. Older versions of Git will not
     ++understand the `sparseIndex` repository extension and may fail to interact
     ++with your repository until it is disabled.
       
       'set'::
       	Write a set of patterns to the sparse-checkout file, as given as
 17:  bcf960ef2362 = 17:  42d0da9c5def sparse-checkout: disable sparse-index
 18:  e6afec58674e = 18:  6bb0976a6295 cache-tree: integrate with sparse directory entries
 19:  2be4981fe698 = 19:  07f34e80609a sparse-index: loose integration with cache_tree_verify()
 20:  a738b0ba8ab4 = 20:  41e3b56b9c17 p2000: add sparse-index repos

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 01/20] sparse-index: design doc and format update
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 22:19     ` Elijah Newren
  2021-03-10 19:30   ` [PATCH v2 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
                     ` (20 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This begins a long effort to update the index format to allow sparse
directory entries. This should result in a significant improvement to
Git commands when HEAD contains millions of files, but the user has
selected many fewer files to keep in their sparse-checkout definition.

Currently, the index format is only updated in the presence of
extensions.sparseIndex instead of increasing a file format version
number. This is temporary, and index v5 is part of the plan for future
work in this area.

The design document details many of the reasons for embarking on this
work, and also the plan for completing it safely.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 173 +++++++++++++++++++++++
 2 files changed, 180 insertions(+)
 create mode 100644 Documentation/technical/sparse-index.txt

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index b633482b1bdf..387126582556 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -44,6 +44,13 @@ Git index format
   localization, no special casing of directory separator '/'). Entries
   with the same name are sorted by their stage field.
 
+  An index entry typically represents a file. However, if sparse-checkout
+  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
+  `extensions.sparseIndex` extension is enabled, then the index may
+  contain entries for directories outside of the sparse-checkout definition.
+  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
+  the path ends in a directory separator.
+
   32-bit ctime seconds, the last time a file's metadata changed
     this is stat(2) data
 
diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
new file mode 100644
index 000000000000..787a2a0b3b81
--- /dev/null
+++ b/Documentation/technical/sparse-index.txt
@@ -0,0 +1,173 @@
+Git Sparse-Index Design Document
+================================
+
+The sparse-checkout feature allows users to focus a working directory on
+a subset of the files at HEAD. The cone mode patterns, enabled by
+`core.sparseCheckoutCone`, allow for very fast pattern matching to
+discover which files at HEAD belong in the sparse-checkout cone.
+
+Three important scale dimensions for a Git worktree are:
+
+* `HEAD`: How many files are present at `HEAD`?
+
+* Populated: How many files are within the sparse-checkout cone.
+
+* Modified: How many files has the user modified in the working directory?
+
+We will use big-O notation -- O(X) -- to denote how expensive certain
+operations are in terms of these dimensions.
+
+These dimensions are ordered by their magnitude: users (typically) modify
+fewer files than are populated, and we can only populate files at `HEAD`.
+These dimensions are also ordered by how expensive they are per item: it
+is expensive to detect a modified file than it is to write one that we
+know must be populated; changing `HEAD` only really requires updating the
+index.
+
+Problems occur if there is an extreme imbalance in these dimensions. For
+example, if `HEAD` contains millions of paths but the populated set has
+only tens of thousands, then commands like `git status` and `git add` can
+be dominated by operations that require O(`HEAD`) operations instead of
+O(Populated). Primarily, the cost is in parsing and rewriting the index,
+which is filled primarily with files at `HEAD` that are marked with the
+`SKIP_WORKTREE` bit.
+
+The sparse-index intends to take these commands that read and modify the
+index from O(`HEAD`) to O(Populated). To do this, we need to modify the
+index format in a significant way: add "sparse directory" entries.
+
+With cone mode patterns, it is possible to detect when an entire
+directory will have its contents outside of the sparse-checkout definition.
+Instead of listing all of the files it contains as individual entries, a
+sparse-index contains an entry with the directory name, referencing the
+object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
+If we need to discover the details for paths within that directory, we
+can parse trees to find that list.
+
+At time of writing, sparse-directory entries violate expectations about the
+index format and its in-memory data structure. There are many consumers in
+the codebase that expect to iterate through all of the index entries and
+see only files. In addition, they expect to see all files at `HEAD`. One
+way to handle this is to parse trees to replace a sparse-directory entry
+with all of the files within that tree as the index is loaded. However,
+parsing trees is slower than parsing the index format, so that is a slower
+operation than if we left the index alone.
+
+The implementation plan below follows four phases to slowly integrate with
+the sparse-index. The intention is to incrementally update Git commands to
+interact safely with the sparse-index without significant slowdowns. This
+may not always be possible, but the hope is that the primary commands that
+users need in their daily work are dramatically improved.
+
+Phase I: Format and initial speedups
+------------------------------------
+
+During this phase, Git learns to enable the sparse-index and safely parse
+one. Protections are put in place so that every consumer of the in-memory
+data structure can operate with its current assumption of every file at
+`HEAD`.
+
+At first, every index parse will expand the sparse-directory entries into
+the full list of paths at `HEAD`. This will be slower in all cases. The
+only noticable change in behavior will be that the serialized index file
+contains sparse-directory entries.
+
+To start, we use a new repository extension, `extensions.sparseIndex`, to
+allow inserting sparse-directory entries into indexes with file format
+versions 2, 3, and 4. This prevents Git versions that do not understand
+the sparse-index from operating on one, but it also prevents other
+operations that do not use the index at all. A new format, index v5, will
+be introduced that includes sparse-directory entries by default. It might
+also introduce other features that have been considered for improving the
+index, as well.
+
+Next, consumers of the index will be guarded against operating on a
+sparse-index by inserting calls to `ensure_full_index()` or
+`expand_index_to_path()`. After these guards are in place, we can begin
+leaving sparse-directory entries in the in-memory index structure.
+
+Even after inserting these guards, we will keep expanding sparse-indexes
+for most Git commands using the `command_requires_full_index` repository
+setting. This setting will be on by default and disabled one builtin at a
+time until we have sufficient confidence that all of the index operations
+are properly guarded.
+
+To complete this phase, the commands `git status` and `git add` will be
+integrated with the sparse-index so that they operate with O(Populated)
+performance. They will be carefully tested for operations within and
+outside the sparse-checkout definition.
+
+Phase II: Careful integrations
+------------------------------
+
+This phase focuses on ensuring that all index extensions and APIs work
+well with a sparse-index. This requires significant increases to our test
+coverage, especially for operations that interact with the working
+directory outside of the sparse-checkout definition. Some of these
+behaviors may not be the desirable ones, such as some tests already
+marked for failure in `t1092-sparse-checkout-compatibility.sh`.
+
+The index extensions that may require special integrations are:
+
+* FS Monitor
+* Untracked cache
+
+While integrating with these features, we should look for patterns that
+might lead to better APIs for interacting with the index. Coalescing
+common usage patterns into an API call can reduce the number of places
+where sparse-directories need to be handled carefully.
+
+Phase III: Important command speedups
+-------------------------------------
+
+At this point, the patterns for testing and implementing sparse-directory
+logic should be relatively stable. This phase focuses on updating some of
+the most common builtins that use the index to operate as O(Populated).
+Here is a potential list of commands that could be valuable to integrate
+at this point:
+
+* `git commit`
+* `git checkout`
+* `git merge`
+* `git rebase`
+
+Hopefully, commands such as `git merge` and `git rebase` can benefit
+instead from merge algorithms that do not use the index as a data
+structure, such as the merge-ORT strategy. As these topics mature, we
+may enalbe the ORT strategy by default for repositories using the
+sparse-index feature.
+
+Along with `git status` and `git add`, these commands cover the majority
+of users' interactions with the working directory. In addition, we can
+integrate with these commands:
+
+* `git grep`
+* `git rm`
+
+These have been proposed as some whose behavior could change when in a
+repo with a sparse-checkout definition. It would be good to include this
+behavior automatically when using a sparse-index. Some clarity is needed
+to make the behavior switch clear to the user.
+
+This phase is the first where parallel work might be possible without too
+much conflicts between topics.
+
+Phase IV: The long tail
+-----------------------
+
+This last phase is less a "phase" and more "the new normal" after all of
+the previous work.
+
+To start, the `command_requires_full_index` option could be removed in
+favor of expanding only when hitting an API guard.
+
+There are many Git commands that could use special attention to operate as
+O(Populated), while some might be so rare that it is acceptable to leave
+them with additional overhead when a sparse-index is present.
+
+Here are some commands that might be useful to update:
+
+* `git sparse-checkout set`
+* `git am`
+* `git clean`
+* `git stash`
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 02/20] t/perf: add performance test for sparse operations
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
                     ` (19 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Create a test script that takes the default performance test (the Git
codebase) and multiplies it by 256 using four layers of duplicated
trees of width four. This results in nearly one million blob entries in
the index. Then, we can clone this repository with sparse-checkout
patterns that demonstrate four copies of the initial repository. Each
clone will use a different index format or mode so peformance can be
tested across the different options.

Note that the initial repo is stripped of submodules before doing the
copies. This preserves the expected data shape of the sparse index,
because directories containing submodules are not collapsed to a sparse
directory entry.

Run a few Git commands on these clones, especially those that use the
index (status, add, commit).

Here are the results on my Linux machine:

Test
--------------------------------------------------------------
2000.2: git status (full-index-v3)             0.37(0.30+0.09)
2000.3: git status (full-index-v4)             0.39(0.32+0.10)
2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)

It is perhaps noteworthy that there is an improvement when using index
version 4. This is because the v3 index uses 108 MiB while the v4
index uses 80 MiB. Since the repeated portions of the directories are
very short (f3/f1/f2, for example) this ratio is less pronounced than in
similarly-sized real repositories.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 85 +++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)
 create mode 100755 t/perf/p2000-sparse-operations.sh

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
new file mode 100755
index 000000000000..2fbc81b22119
--- /dev/null
+++ b/t/perf/p2000-sparse-operations.sh
@@ -0,0 +1,85 @@
+#!/bin/sh
+
+test_description="test performance of Git operations using the index"
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+SPARSE_CONE=f2/f4/f1
+
+test_expect_success 'setup repo and indexes' '
+	git reset --hard HEAD &&
+	# Remove submodules from the example repo, because our
+	# duplication of the entire repo creates an unlikly data shape.
+	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
+	git rm -f .gitmodules &&
+	for module in $(awk "{print \$2}" modules)
+	do
+		git rm $module || return 1
+	done &&
+	git commit -m "remove submodules" &&
+
+	echo bogus >a &&
+	cp a b &&
+	git add a b &&
+	git commit -m "level 0" &&
+	BLOB=$(git rev-parse HEAD:a) &&
+	OLD_COMMIT=$(git rev-parse HEAD) &&
+	OLD_TREE=$(git rev-parse HEAD^{tree}) &&
+
+	for i in $(test_seq 1 4)
+	do
+		cat >in <<-EOF &&
+			100755 blob $BLOB	a
+			040000 tree $OLD_TREE	f1
+			040000 tree $OLD_TREE	f2
+			040000 tree $OLD_TREE	f3
+			040000 tree $OLD_TREE	f4
+		EOF
+		NEW_TREE=$(git mktree <in) &&
+		NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
+		OLD_TREE=$NEW_TREE &&
+		OLD_COMMIT=$NEW_COMMIT || return 1
+	done &&
+
+	git sparse-checkout init --cone &&
+	git branch -f wide $OLD_COMMIT &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
+	(
+		cd full-index-v3 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
+	(
+		cd full-index-v4 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
+	)
+'
+
+test_perf_on_all () {
+	command="$@"
+	for repo in full-index-v3 full-index-v4
+	do
+		test_perf "$command ($repo)" "
+			(
+				cd $repo &&
+				echo >>$SPARSE_CONE/a &&
+				$command
+			)
+		"
+	done
+}
+
+test_perf_on_all git status
+test_perf_on_all git add -A
+test_perf_on_all git add .
+test_perf_on_all git commit -a -m A
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 03/20] t1092: clean up script quoting
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
                     ` (18 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This test was introduced in 19a0acc83e4 (t1092: test interesting
sparse-checkout scenarios, 2021-01-23), but these issues with quoting
were not noticed until starting this follow-up series. The old mechanism
would drop quoting such as in

   test_all_match git commit -m "touch README.md"

The above happened to work because README.md is a file in the
repository, so 'git commit -m touch REAMDE.md' would succeed by
accident.

Other cases included quoting for no good reason, so clean that up now.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 8cd3e5a8d227..3725d3997e70 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -96,20 +96,20 @@ init_repos () {
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		$* >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		$* >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
-	run_on_sparse $*
+	run_on_sparse "$@"
 }
 
 test_all_match () {
-	run_on_all $* &&
+	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
 	test_cmp full-checkout-err sparse-checkout-err
 }
@@ -119,7 +119,7 @@ test_expect_success 'status with options' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
-	run_on_all "touch README.md" &&
+	run_on_all touch README.md &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -135,7 +135,7 @@ test_expect_success 'add, commit, checkout' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>$1
 	EOF
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add README.md &&
 	test_all_match git status --porcelain=v2 &&
@@ -144,7 +144,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add -A &&
 	test_all_match git status --porcelain=v2 &&
@@ -153,7 +153,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents deep/newfile" &&
+	run_on_all ../edit-contents deep/newfile &&
 
 	test_all_match git status --porcelain=v2 -uno &&
 	test_all_match git status --porcelain=v2 &&
@@ -186,7 +186,7 @@ test_expect_success 'diff --staged' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>README.md
 	EOF
-	run_on_all "../edit-contents" &&
+	run_on_all ../edit-contents &&
 
 	test_all_match git diff &&
 	test_all_match git diff --staged &&
@@ -280,7 +280,7 @@ test_expect_success 'clean' '
 	echo bogus >>.gitignore &&
 	run_on_all cp ../.gitignore . &&
 	test_all_match git add .gitignore &&
-	test_all_match git commit -m ignore-bogus-files &&
+	test_all_match git commit -m "ignore bogus files" &&
 
 	run_on_sparse mkdir folder1 &&
 	run_on_all touch folder1/bogus &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 04/20] sparse-index: add guard to ensure full index
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
                     ` (17 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Upcoming changes will introduce modifications to the index format that
allow sparse directories. It will be useful to have a mechanism for
converting those sparse index files into full indexes by walking the
tree at those sparse directories. Name this method ensure_full_index()
as it will guarantee that the index is fully expanded.

This method is not implemented yet, and instead we focus on the
scaffolding to declare it and call it at the appropriate time.

Add a 'command_requires_full_index' member to struct repo_settings. This
will be an indicator that we need the index in full mode to do certain
index operations. This starts as being true for every command, then we
will set it to false as some commands integrate with sparse indexes.

If 'command_requires_full_index' is true, then we will immediately
expand a sparse index to a full one upon reading from disk. This
suffices for now, but we will want to add more callers to
ensure_full_index() later.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile        |  1 +
 repo-settings.c |  8 ++++++++
 repository.c    | 11 ++++++++++-
 repository.h    |  2 ++
 sparse-index.c  |  8 ++++++++
 sparse-index.h  |  7 +++++++
 6 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h

diff --git a/Makefile b/Makefile
index 5a239cac20e3..3bf61699238d 100644
--- a/Makefile
+++ b/Makefile
@@ -980,6 +980,7 @@ LIB_OBJS += setup.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-index.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/repo-settings.c b/repo-settings.c
index f7fff0f5ab83..d63569e4041e 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -77,4 +77,12 @@ void prepare_repo_settings(struct repository *r)
 		UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_KEEP);
 
 	UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_DEFAULT);
+
+	/*
+	 * This setting guards all index reads to require a full index
+	 * over a sparse index. After suitable guards are placed in the
+	 * codebase around uses of the index, this setting will be
+	 * removed.
+	 */
+	r->settings.command_requires_full_index = 1;
 }
diff --git a/repository.c b/repository.c
index c98298acd017..a8acae002f71 100644
--- a/repository.c
+++ b/repository.c
@@ -10,6 +10,7 @@
 #include "object.h"
 #include "lockfile.h"
 #include "submodule-config.h"
+#include "sparse-index.h"
 
 /* The main repository */
 static struct repository the_repo;
@@ -261,6 +262,8 @@ void repo_clear(struct repository *repo)
 
 int repo_read_index(struct repository *repo)
 {
+	int res;
+
 	if (!repo->index)
 		repo->index = xcalloc(1, sizeof(*repo->index));
 
@@ -270,7 +273,13 @@ int repo_read_index(struct repository *repo)
 	else if (repo->index->repo != repo)
 		BUG("repo's index should point back at itself");
 
-	return read_index_from(repo->index, repo->index_file, repo->gitdir);
+	res = read_index_from(repo->index, repo->index_file, repo->gitdir);
+
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index)
+		ensure_full_index(repo->index);
+
+	return res;
 }
 
 int repo_hold_locked_index(struct repository *repo,
diff --git a/repository.h b/repository.h
index b385ca3c94b6..e06a23015697 100644
--- a/repository.h
+++ b/repository.h
@@ -41,6 +41,8 @@ struct repo_settings {
 	enum fetch_negotiation_setting fetch_negotiation_algorithm;
 
 	int core_multi_pack_index;
+
+	unsigned command_requires_full_index:1;
 };
 
 struct repository {
diff --git a/sparse-index.c b/sparse-index.c
new file mode 100644
index 000000000000..82183ead563b
--- /dev/null
+++ b/sparse-index.c
@@ -0,0 +1,8 @@
+#include "cache.h"
+#include "repository.h"
+#include "sparse-index.h"
+
+void ensure_full_index(struct index_state *istate)
+{
+	/* intentionally left blank */
+}
diff --git a/sparse-index.h b/sparse-index.h
new file mode 100644
index 000000000000..09a20d036c46
--- /dev/null
+++ b/sparse-index.h
@@ -0,0 +1,7 @@
+#ifndef SPARSE_INDEX_H__
+#define SPARSE_INDEX_H__
+
+struct index_state;
+void ensure_full_index(struct index_state *istate);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 05/20] sparse-index: implement ensure_full_index()
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-12  6:50     ` Junio C Hamano
  2021-03-10 19:30   ` [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
                     ` (16 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will mark an in-memory index_state as having sparse directory entries
with the sparse_index bit. These currently cannot exist, but we will add
a mechanism for collapsing a full index to a sparse one in a later
change. That will happen at write time, so we must first allow parsing
the format before writing it.

Commands or methods that require a full index in order to operate can
call ensure_full_index() to expand that index in-memory. This requires
parsing trees using that index's repository.

Sparse directory entries have a specific 'ce_mode' value. The macro
S_ISSPARSEDIR(ce->ce_mode) can check if a cache_entry 'ce' has this type.
This ce_mode is not possible with the existing index formats, so we don't
also verify all properties of a sparse-directory entry, which are:

 1. ce->ce_mode == 0040000
 2. ce->flags & CE_SKIP_WORKTREE is true
 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
 4. ce->oid references a tree object.

These are all semi-enforced in ensure_full_index() to some extent. Any
deviation will cause a warning at minimum or a failure in the worst
case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache.h        | 13 ++++++-
 read-cache.c   |  9 +++++
 sparse-index.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 115 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index d92814961405..1f0b42264606 100644
--- a/cache.h
+++ b/cache.h
@@ -204,6 +204,8 @@ struct cache_entry {
 #error "CE_EXTENDED_FLAGS out of range"
 #endif
 
+#define S_ISSPARSEDIR(m) ((m) == S_IFDIR)
+
 /* Forward structure decls */
 struct pathspec;
 struct child_process;
@@ -319,7 +321,14 @@ struct index_state {
 		 drop_cache_tree : 1,
 		 updated_workdir : 1,
 		 updated_skipworktree : 1,
-		 fsmonitor_has_run_once : 1;
+		 fsmonitor_has_run_once : 1,
+
+		 /*
+		  * sparse_index == 1 when sparse-directory
+		  * entries exist. Requires sparse-checkout
+		  * in cone mode.
+		  */
+		 sparse_index : 1;
 	struct hashmap name_hash;
 	struct hashmap dir_hash;
 	struct object_id oid;
@@ -722,6 +731,8 @@ int read_index_from(struct index_state *, const char *path,
 		    const char *gitdir);
 int is_index_unborn(struct index_state *);
 
+void ensure_full_index(struct index_state *istate);
+
 /* For use with `write_locked_index()`. */
 #define COMMIT_LOCK		(1 << 0)
 #define SKIP_IF_UNCHANGED	(1 << 1)
diff --git a/read-cache.c b/read-cache.c
index 29144cf879e7..97dbf2434f30 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -101,6 +101,9 @@ static const char *alternate_index_output;
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		istate->sparse_index = 1;
+
 	istate->cache[nr] = ce;
 	add_name_hash(istate, ce);
 }
@@ -2255,6 +2258,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
 	trace2_data_intmax("index", the_repository, "read/cache_nr",
 			   istate->cache_nr);
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+	prepare_repo_settings(istate->repo);
+	if (istate->repo->settings.command_requires_full_index)
+		ensure_full_index(istate);
+
 	return istate->cache_nr;
 
 unmap:
diff --git a/sparse-index.c b/sparse-index.c
index 82183ead563b..316cb949b74b 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -1,8 +1,101 @@
 #include "cache.h"
 #include "repository.h"
 #include "sparse-index.h"
+#include "tree.h"
+#include "pathspec.h"
+#include "trace2.h"
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
+
+	istate->cache[nr] = ce;
+	add_name_hash(istate, ce);
+}
+
+static int add_path_to_index(const struct object_id *oid,
+				struct strbuf *base, const char *path,
+				unsigned int mode, int stage, void *context)
+{
+	struct index_state *istate = (struct index_state *)context;
+	struct cache_entry *ce;
+	size_t len = base->len;
+
+	if (S_ISDIR(mode))
+		return READ_TREE_RECURSIVE;
+
+	strbuf_addstr(base, path);
+
+	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
+	ce->ce_flags |= CE_SKIP_WORKTREE;
+	set_index_entry(istate, istate->cache_nr++, ce);
+
+	strbuf_setlen(base, len);
+	return 0;
+}
 
 void ensure_full_index(struct index_state *istate)
 {
-	/* intentionally left blank */
+	int i;
+	struct index_state *full;
+
+	if (!istate || !istate->sparse_index)
+		return;
+
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	trace2_region_enter("index", "ensure_full_index", istate->repo);
+
+	/* initialize basics of new index */
+	full = xcalloc(1, sizeof(struct index_state));
+	memcpy(full, istate, sizeof(struct index_state));
+
+	/* then change the necessary things */
+	full->sparse_index = 0;
+	full->cache_alloc = (3 * istate->cache_alloc) / 2;
+	full->cache_nr = 0;
+	ALLOC_ARRAY(full->cache, full->cache_alloc);
+
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+		struct tree *tree;
+		struct pathspec ps;
+
+		if (!S_ISSPARSEDIR(ce->ce_mode)) {
+			set_index_entry(full, full->cache_nr++, ce);
+			continue;
+		}
+		if (!(ce->ce_flags & CE_SKIP_WORKTREE))
+			warning(_("index entry is a directory, but not sparse (%08x)"),
+				ce->ce_flags);
+
+		/* recursively walk into cd->name */
+		tree = lookup_tree(istate->repo, &ce->oid);
+
+		memset(&ps, 0, sizeof(ps));
+		ps.recursive = 1;
+		ps.has_wildcard = 1;
+		ps.max_depth = -1;
+
+		read_tree_recursive(istate->repo, tree,
+				    ce->name, strlen(ce->name),
+				    0, &ps,
+				    add_path_to_index, full);
+
+		/* free directory entries. full entries are re-used */
+		discard_cache_entry(ce);
+	}
+
+	/* Copy back into original index. */
+	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
+	istate->sparse_index = 0;
+	free(istate->cache);
+	istate->cache = full->cache;
+	istate->cache_nr = full->cache_nr;
+	istate->cache_alloc = full->cache_alloc;
+
+	free(full);
+
+	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 23:04     ` Elijah Newren
  2021-03-10 19:30   ` [PATCH v2 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                     ` (15 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new 'sparse-index' repo alongside the 'full-checkout' and
'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
add run_on_sparse and test_sparse_match helpers. These helpers will be
used when the sparse index is implemented.

Add GIT_TEST_SPARSE_INDEX environment variable to enable the
sparse-index by default. This will be intended to use across the entire
test suite, except that it will only affect cases where the
sparse-checkout feature is enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/README                                 |  3 +++
 t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/t/README b/t/README
index 593d4a4e270c..b98bc563aab5 100644
--- a/t/README
+++ b/t/README
@@ -439,6 +439,9 @@ and "sha256".
 GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
 'pack.writeReverseIndex' setting.
 
+GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
+sparse-index format by default.
+
 Naming Tests
 ------------
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 3725d3997e70..71d6f9e4c014 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
 test_expect_success 'setup' '
 	git init initial-repo &&
 	(
+		GIT_TEST_SPARSE_INDEX=0 &&
 		cd initial-repo &&
 		echo a >a &&
 		echo "after deep" >e &&
@@ -87,23 +88,32 @@ init_repos () {
 
 	cp -r initial-repo sparse-checkout &&
 	git -C sparse-checkout reset --hard &&
-	git -C sparse-checkout sparse-checkout init --cone &&
+
+	cp -r initial-repo sparse-index &&
+	git -C sparse-index reset --hard &&
 
 	# initialize sparse-checkout definitions
-	git -C sparse-checkout sparse-checkout set deep
+	git -C sparse-checkout sparse-checkout init --cone &&
+	git -C sparse-checkout sparse-checkout set deep &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+	) &&
+	(
+		cd sparse-index &&
+		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		"$@" >../full-checkout-out 2>../full-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -114,6 +124,12 @@ test_all_match () {
 	test_cmp full-checkout-err sparse-checkout-err
 }
 
+test_sparse_match () {
+	run_on_sparse $* &&
+	test_cmp sparse-checkout-out sparse-index-out &&
+	test_cmp sparse-checkout-err sparse-index-err
+}
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 07/20] test-read-cache: print cache entries with --table
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
                     ` (14 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This table is helpful for discovering data in the index to ensure it is
being written correctly, especially as we build and test the
sparse-index. This table includes an output format similar to 'git
ls-tree', but should not be compared to that directly. The biggest
reasons are that 'git ls-tree' includes a tree entry for every
subdirectory, even those that would not appear as a sparse directory in
a sparse-index. Further, 'git ls-tree' does not use a trailing directory
separator for its tree rows.

This does not print the stat() information for the blobs. That could be
added in a future change with another option. The tests that are added
in the next few changes care only about the object types and IDs.

To make the option parsing slightly more robust, wrap the string
comparisons in a loop adapted from test-dir-iterator.c.

Care must be taken with the final check for the 'cnt' variable. We
continue the expectation that the numerical value is the final argument.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c | 55 +++++++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 10 deletions(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 244977a29bdf..6cfd8f2de71c 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -1,36 +1,71 @@
 #include "test-tool.h"
 #include "cache.h"
 #include "config.h"
+#include "blob.h"
+#include "commit.h"
+#include "tree.h"
+
+static void print_cache_entry(struct cache_entry *ce)
+{
+	const char *type;
+	printf("%06o ", ce->ce_mode & 0177777);
+
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		type = tree_type;
+	else if (S_ISGITLINK(ce->ce_mode))
+		type = commit_type;
+	else
+		type = blob_type;
+
+	printf("%s %s\t%s\n",
+	       type,
+	       oid_to_hex(&ce->oid),
+	       ce->name);
+}
+
+static void print_cache(struct index_state *istate)
+{
+	int i;
+	for (i = 0; i < istate->cache_nr; i++)
+		print_cache_entry(istate->cache[i]);
+}
 
 int cmd__read_cache(int argc, const char **argv)
 {
+	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
+	int table = 0;
 
-	if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
-		argc--;
-		argv++;
+	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
+		if (skip_prefix(*argv, "--print-and-refresh=", &name))
+			continue;
+		if (!strcmp(*argv, "--table"))
+			table = 1;
 	}
 
-	if (argc == 2)
-		cnt = strtol(argv[1], NULL, 0);
+	if (argc == 1)
+		cnt = strtol(argv[0], NULL, 0);
 	setup_git_directory();
 	git_config(git_default_config, NULL);
+
 	for (i = 0; i < cnt; i++) {
-		read_cache();
+		repo_read_index(r);
 		if (name) {
 			int pos;
 
-			refresh_index(&the_index, REFRESH_QUIET,
+			refresh_index(r->index, REFRESH_QUIET,
 				      NULL, NULL, NULL);
-			pos = index_name_pos(&the_index, name, strlen(name));
+			pos = index_name_pos(r->index, name, strlen(name));
 			if (pos < 0)
 				die("%s not in index", name);
 			printf("%s is%s up to date\n", name,
-			       ce_uptodate(the_index.cache[pos]) ? "" : " not");
+			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
 			write_file(name, "%d\n", i);
 		}
-		discard_cache();
+		if (table)
+			print_cache(r->index);
+		discard_index(r->index);
 	}
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 08/20] test-tool: don't force full index
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (6 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
                     ` (13 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will use 'test-tool read-cache --table' to check that a sparse
index is written as part of init_repos. Since we will no longer always
expand a sparse index into a full index, add an '--expand' parameter
that adds a call to ensure_full_index() so we can compare a sparse index
directly against a full index, or at least what the in-memory index
looks like when expanded in this way.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c               | 13 ++++++++++++-
 t/t1092-sparse-checkout-compatibility.sh |  5 +++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 6cfd8f2de71c..b52c174acc7a 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -4,6 +4,7 @@
 #include "blob.h"
 #include "commit.h"
 #include "tree.h"
+#include "sparse-index.h"
 
 static void print_cache_entry(struct cache_entry *ce)
 {
@@ -35,13 +36,19 @@ int cmd__read_cache(int argc, const char **argv)
 	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
-	int table = 0;
+	int table = 0, expand = 0;
+
+	initialize_the_repository();
+	prepare_repo_settings(r);
+	r->settings.command_requires_full_index = 0;
 
 	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
 		if (skip_prefix(*argv, "--print-and-refresh=", &name))
 			continue;
 		if (!strcmp(*argv, "--table"))
 			table = 1;
+		else if (!strcmp(*argv, "--expand"))
+			expand = 1;
 	}
 
 	if (argc == 1)
@@ -51,6 +58,10 @@ int cmd__read_cache(int argc, const char **argv)
 
 	for (i = 0; i < cnt; i++) {
 		repo_read_index(r);
+
+		if (expand)
+			ensure_full_index(r->index);
+
 		if (name) {
 			int pos;
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 71d6f9e4c014..4d789fe86b9d 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -130,6 +130,11 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'expanded in-memory index matches full index' '
+	init_repos &&
+	test_sparse_match test-tool read-cache --expand --table
+'
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 09/20] unpack-trees: ensure full index
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (7 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
                     ` (12 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The next change will translate full indexes into sparse indexes at write
time. The existing logic provides a way for every sparse index to be
expanded to a full index at read time. However, there are cases where an
index is written and then continues to be used in-memory to perform
further updates.

unpack_trees() is frequently called after such a write. In particular,
commands like 'git reset' do this double-update of the index.

Ensure that we have a full index when entering unpack_trees(), but only
when command_requires_full_index is true. This is always true at the
moment, but we will later relax that after unpack_trees() is updated to
handle sparse directory entries.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/unpack-trees.c b/unpack-trees.c
index f5f668f532d8..4dd99219073a 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1567,6 +1567,7 @@ static int verify_absent(const struct cache_entry *,
  */
 int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options *o)
 {
+	struct repository *repo = the_repository;
 	int i, ret;
 	static struct cache_entry *dfc;
 	struct pattern_list pl;
@@ -1578,6 +1579,12 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	trace_performance_enter();
 	trace2_region_enter("unpack_trees", "unpack_trees", the_repository);
 
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index) {
+		ensure_full_index(o->src_index);
+		ensure_full_index(o->dst_index);
+	}
+
 	if (!core_apply_sparse_checkout || !o->update)
 		o->skip_sparse_checkout = 1;
 	if (!o->skip_sparse_checkout && !o->pl) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 10/20] sparse-checkout: hold pattern list in index
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (8 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
                     ` (11 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we modify the sparse-checkout definition, we perform index operations
on a pattern_list that only exists in-memory. This allows easy backing
out in case the index update fails.

However, if the index write itself cares about the sparse-checkout
pattern set, we need access to that in-memory copy. Place a pointer to
a 'struct pattern_list' in the index so we can access this on-demand.
This will be used in the next change which uses the sparse-checkout
definition to filter out directories that are outside the sparse cone.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c | 17 ++++++++++-------
 cache.h                   |  2 ++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index 2306a9ad98e0..e00b82af727b 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
 	if (is_index_unborn(r->index))
 		return UPDATE_SPARSITY_SUCCESS;
 
+	r->index->sparse_checkout_patterns = pl;
+
 	memset(&o, 0, sizeof(o));
 	o.verbose_update = isatty(2);
 	o.update = 1;
@@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
 	else
 		rollback_lock_file(&lock_file);
 
+	r->index->sparse_checkout_patterns = NULL;
 	return result;
 }
 
@@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 {
 	int result;
 	int changed_config = 0;
-	struct pattern_list pl;
-	memset(&pl, 0, sizeof(pl));
+	struct pattern_list *pl = xcalloc(1, sizeof(*pl));
 
 	switch (m) {
 	case ADD:
 		if (core_sparse_checkout_cone)
-			add_patterns_cone_mode(argc, argv, &pl);
+			add_patterns_cone_mode(argc, argv, pl);
 		else
-			add_patterns_literal(argc, argv, &pl);
+			add_patterns_literal(argc, argv, pl);
 		break;
 
 	case REPLACE:
-		add_patterns_from_input(&pl, argc, argv);
+		add_patterns_from_input(pl, argc, argv);
 		break;
 	}
 
@@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 		changed_config = 1;
 	}
 
-	result = write_patterns_and_update(&pl);
+	result = write_patterns_and_update(pl);
 
 	if (result && changed_config)
 		set_config(MODE_NO_PATTERNS);
 
-	clear_pattern_list(&pl);
+	clear_pattern_list(pl);
+	free(pl);
 	return result;
 }
 
diff --git a/cache.h b/cache.h
index 1f0b42264606..303411726e10 100644
--- a/cache.h
+++ b/cache.h
@@ -307,6 +307,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 struct split_index;
 struct untracked_cache;
 struct progress;
+struct pattern_list;
 
 struct index_state {
 	struct cache_entry **cache;
@@ -338,6 +339,7 @@ struct index_state {
 	struct mem_pool *ce_mem_pool;
 	struct progress *progress;
 	struct repository *repo;
+	struct pattern_list *sparse_checkout_patterns;
 };
 
 /* Name hashing */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 11/20] sparse-index: convert from full to sparse
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (9 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 23:44     ` Elijah Newren
  2021-03-10 19:30   ` [PATCH v2 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
                     ` (10 subsequent siblings)
  21 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we have a full index, then we can convert it to a sparse index by
replacing directories outside of the sparse cone with sparse directory
entries. The convert_to_sparse() method does this, when the situation is
appropriate.

For now, we avoid converting the index to a sparse index if:

 1. the index is split.
 2. the index is already sparse.
 3. sparse-checkout is disabled.
 4. sparse-checkout does not use cone mode.

Finally, we currently limit the conversion to when the
GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
config will be added in a later change.

The trickiest thing about this conversion is that we might not be able
to mark a directory as a sparse directory just because it is outside the
sparse cone. There might be unmerged files within that directory, so we
need to look for those. Also, if there is some strange reason why a file
is not marked with CE_SKIP_WORKTREE, then we should give up on
converting that directory. There is still hope that some of its
subdirectories might be able to convert to sparse, so we keep looking
deeper.

The conversion process is assisted by the cache-tree extension. This is
calculated from the full index if it does not already exist. We then
abandon the cache-tree as it no longer applies to the newly-sparse
index. Thus, this cache-tree will be recalculated in every
sparse-full-sparse round-trip until we integrate the cache-tree
extension with the sparse index.

Some Git commands use the index after writing it. For example, 'git add'
will update the index, then write it to disk, then read its entries to
report information. To keep the in-memory index in a full state after
writing, we re-expand it to a full one after the write. This is wasteful
for commands that only write the index and do not read from it again,
but that is only the case until we make those commands "sparse aware."

We can compare the behavior of the sparse-index in
t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
when operating on the 'sparse-index' repo. We can also compare the two
sparse repos directly, such as comparing their indexes (when expanded to
full in the case of the 'sparse-index' repo). We also verify that the
index is actually populated with sparse directory entries.

The 'checkout and reset (mixed)' test is marked for failure when
comparing a sparse repo to a full repo, but we can compare the two
sparse-checkout cases directly to ensure that we are not changing the
behavior when using a sparse index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             |   3 +
 cache.h                                  |   2 +
 read-cache.c                             |  26 ++++-
 sparse-index.c                           | 139 +++++++++++++++++++++++
 sparse-index.h                           |   1 +
 t/t1092-sparse-checkout-compatibility.sh |  61 +++++++++-
 6 files changed, 227 insertions(+), 5 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2fb483d3c083..5f07a39e501e 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -6,6 +6,7 @@
 #include "object-store.h"
 #include "replace-object.h"
 #include "promisor-remote.h"
+#include "sparse-index.h"
 
 #ifndef DEBUG_CACHE_TREE
 #define DEBUG_CACHE_TREE 0
@@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
 	if (i)
 		return i;
 
+	ensure_full_index(istate);
+
 	if (!istate->cache_tree)
 		istate->cache_tree = cache_tree();
 
diff --git a/cache.h b/cache.h
index 303411726e10..9217d405b9b8 100644
--- a/cache.h
+++ b/cache.h
@@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
 {
 	if (S_ISLNK(mode))
 		return S_IFLNK;
+	if (mode == S_IFDIR)
+		return S_IFDIR;
 	if (S_ISDIR(mode) || S_ISGITLINK(mode))
 		return S_IFGITLINK;
 	return S_IFREG | ce_permissions(mode);
diff --git a/read-cache.c b/read-cache.c
index 97dbf2434f30..92126b9d23c9 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -25,6 +25,7 @@
 #include "fsmonitor.h"
 #include "thread-utils.h"
 #include "progress.h"
+#include "sparse-index.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -1002,8 +1003,14 @@ int verify_path(const char *path, unsigned mode)
 
 			c = *path++;
 			if ((c == '.' && !verify_dotfile(path, mode)) ||
-			    is_dir_sep(c) || c == '\0')
+			    is_dir_sep(c))
 				return 0;
+			/*
+			 * allow terminating directory separators for
+			 * sparse directory entries.
+			 */
+			if (c == '\0')
+				return S_ISDIR(mode);
 		} else if (c == '\\' && protect_ntfs) {
 			if (is_ntfs_dotgit(path))
 				return 0;
@@ -3061,6 +3068,14 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 				 unsigned flags)
 {
 	int ret;
+	int was_full = !istate->sparse_index;
+
+	ret = convert_to_sparse(istate);
+
+	if (ret) {
+		warning(_("failed to convert to a sparse-index"));
+		return ret;
+	}
 
 	/*
 	 * TODO trace2: replace "the_repository" with the actual repo instance
@@ -3072,6 +3087,9 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 	trace2_region_leave_printf("index", "do_write_index", the_repository,
 				   "%s", get_lock_file_path(lock));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	if (flags & COMMIT_LOCK)
@@ -3162,9 +3180,10 @@ static int write_shared_index(struct index_state *istate,
 			      struct tempfile **temp)
 {
 	struct split_index *si = istate->split_index;
-	int ret;
+	int ret, was_full = !istate->sparse_index;
 
 	move_cache_to_base_index(istate);
+	convert_to_sparse(istate);
 
 	trace2_region_enter_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
@@ -3172,6 +3191,9 @@ static int write_shared_index(struct index_state *istate,
 	trace2_region_leave_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	ret = adjust_shared_perm(get_tempfile_path(*temp));
diff --git a/sparse-index.c b/sparse-index.c
index 316cb949b74b..5eb561259bb1 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -4,6 +4,145 @@
 #include "tree.h"
 #include "pathspec.h"
 #include "trace2.h"
+#include "cache-tree.h"
+#include "config.h"
+#include "dir.h"
+#include "fsmonitor.h"
+
+static struct cache_entry *construct_sparse_dir_entry(
+				struct index_state *istate,
+				const char *sparse_dir,
+				struct cache_tree *tree)
+{
+	struct cache_entry *de;
+
+	de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);
+
+	de->ce_flags |= CE_SKIP_WORKTREE;
+	return de;
+}
+
+/*
+ * Returns the number of entries "inserted" into the index.
+ */
+static int convert_to_sparse_rec(struct index_state *istate,
+				 int num_converted,
+				 int start, int end,
+				 const char *ct_path, size_t ct_pathlen,
+				 struct cache_tree *ct)
+{
+	int i, can_convert = 1;
+	int start_converted = num_converted;
+	enum pattern_match_result match;
+	int dtype;
+	struct strbuf child_path = STRBUF_INIT;
+	struct pattern_list *pl = istate->sparse_checkout_patterns;
+
+	/*
+	 * Is the current path outside of the sparse cone?
+	 * Then check if the region can be replaced by a sparse
+	 * directory entry (everything is sparse and merged).
+	 */
+	match = path_matches_pattern_list(ct_path, ct_pathlen,
+					  NULL, &dtype, pl, istate);
+	if (match != NOT_MATCHED)
+		can_convert = 0;
+
+	for (i = start; can_convert && i < end; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (ce_stage(ce) ||
+		    !(ce->ce_flags & CE_SKIP_WORKTREE))
+			can_convert = 0;
+	}
+
+	if (can_convert) {
+		struct cache_entry *se;
+		se = construct_sparse_dir_entry(istate, ct_path, ct);
+
+		istate->cache[num_converted++] = se;
+		return 1;
+	}
+
+	for (i = start; i < end; ) {
+		int count, span, pos = -1;
+		const char *base, *slash;
+		struct cache_entry *ce = istate->cache[i];
+
+		/*
+		 * Detect if this is a normal entry outside of any subtree
+		 * entry.
+		 */
+		base = ce->name + ct_pathlen;
+		slash = strchr(base, '/');
+
+		if (slash)
+			pos = cache_tree_subtree_pos(ct, base, slash - base);
+
+		if (pos < 0) {
+			istate->cache[num_converted++] = ce;
+			i++;
+			continue;
+		}
+
+		strbuf_setlen(&child_path, 0);
+		strbuf_add(&child_path, ce->name, slash - ce->name + 1);
+
+		span = ct->down[pos]->cache_tree->entry_count;
+		count = convert_to_sparse_rec(istate,
+					      num_converted, i, i + span,
+					      child_path.buf, child_path.len,
+					      ct->down[pos]->cache_tree);
+		num_converted += count;
+		i += span;
+	}
+
+	strbuf_release(&child_path);
+	return num_converted - start_converted;
+}
+
+int convert_to_sparse(struct index_state *istate)
+{
+	if (istate->split_index || istate->sparse_index ||
+	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
+		return 0;
+
+	/*
+	 * For now, only create a sparse index with the
+	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
+	 * this once we have a proper way to opt-in (and later still,
+	 * opt-out).
+	 */
+	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+		return 0;
+
+	if (!istate->sparse_checkout_patterns) {
+		istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
+		if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
+			return 0;
+	}
+
+	if (!istate->sparse_checkout_patterns->use_cone_patterns) {
+		warning(_("attempting to use sparse-index without cone mode"));
+		return -1;
+	}
+
+	if (cache_tree_update(istate, 0)) {
+		warning(_("unable to update cache-tree, staying full"));
+		return -1;
+	}
+
+	remove_fsmonitor(istate);
+
+	trace2_region_enter("index", "convert_to_sparse", istate->repo);
+	istate->cache_nr = convert_to_sparse_rec(istate,
+						 0, 0, istate->cache_nr,
+						 "", 0, istate->cache_tree);
+	istate->drop_cache_tree = 1;
+	istate->sparse_index = 1;
+	trace2_region_leave("index", "convert_to_sparse", istate->repo);
+	return 0;
+}
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
diff --git a/sparse-index.h b/sparse-index.h
index 09a20d036c46..64380e121d80 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -3,5 +3,6 @@
 
 struct index_state;
 void ensure_full_index(struct index_state *istate);
+int convert_to_sparse(struct index_state *istate);
 
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 4d789fe86b9d..ca87033d30b0 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,6 +2,9 @@
 
 test_description='compare full workdir to sparse workdir'
 
+GIT_TEST_CHECK_CACHE_TREE=0
+GIT_TEST_SPLIT_INDEX=0
+
 . ./test-lib.sh
 
 test_expect_success 'setup' '
@@ -121,15 +124,49 @@ run_on_all () {
 test_all_match () {
 	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
-	test_cmp full-checkout-err sparse-checkout-err
+	test_cmp full-checkout-out sparse-index-out &&
+	test_cmp full-checkout-err sparse-checkout-err &&
+	test_cmp full-checkout-err sparse-index-err
 }
 
 test_sparse_match () {
-	run_on_sparse $* &&
+	run_on_sparse "$@" &&
 	test_cmp sparse-checkout-out sparse-index-out &&
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'sparse-index contents' '
+	init_repos &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep/deeper2 folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done
+'
+
 test_expect_success 'expanded in-memory index matches full index' '
 	init_repos &&
 	test_sparse_match test-tool read-cache --expand --table
@@ -137,6 +174,7 @@ test_expect_success 'expanded in-memory index matches full index' '
 
 test_expect_success 'status with options' '
 	init_repos &&
+	test_sparse_match ls &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -273,6 +311,17 @@ test_expect_failure 'checkout and reset (mixed)' '
 	test_all_match git reset update-folder2
 '
 
+# Ensure that sparse-index behaves identically to
+# sparse-checkout with a full index.
+test_expect_success 'checkout and reset (mixed) [sparse]' '
+	init_repos &&
+
+	test_sparse_match git checkout -b reset-test update-deep &&
+	test_sparse_match git reset deepest &&
+	test_sparse_match git reset update-folder1 &&
+	test_sparse_match git reset update-folder2
+'
+
 test_expect_success 'merge' '
 	init_repos &&
 
@@ -309,14 +358,20 @@ test_expect_success 'clean' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git clean -f &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xdf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
-	test_path_is_dir sparse-checkout/folder1
+	test_sparse_match test_path_is_dir folder1
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 12/20] submodule: sparse-index should not collapse links
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (10 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
                     ` (9 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

A submodule is stored as a "Git link" that actually points to a commit
within a submodule. Submodules are populated or not depending on
submodule configuration, not sparse-checkout. To ensure that the
sparse-index feature integrates correctly with submodules, we should not
collapse a directory if there is a Git link within its range.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sparse-index.c                           |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/sparse-index.c b/sparse-index.c
index 5eb561259bb1..36b4dde7eeda 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -52,6 +52,7 @@ static int convert_to_sparse_rec(struct index_state *istate,
 		struct cache_entry *ce = istate->cache[i];
 
 		if (ce_stage(ce) ||
+		    S_ISGITLINK(ce->ce_mode) ||
 		    !(ce->ce_flags & CE_SKIP_WORKTREE))
 			can_convert = 0;
 	}
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index ca87033d30b0..b38fab6455d9 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -374,4 +374,21 @@ test_expect_success 'clean' '
 	test_sparse_match test_path_is_dir folder1
 '
 
+test_expect_success 'submodule handling' '
+	init_repos &&
+
+	test_all_match mkdir modules &&
+	test_all_match touch modules/a &&
+	test_all_match git add modules &&
+	test_all_match git commit -m "add modules directory" &&
+
+	run_on_all git submodule add "$(pwd)/initial-repo" modules/sub &&
+	test_all_match git commit -m "add submodule" &&
+
+	# having a submodule prevents "modules" from collapse
+	test-tool -C sparse-index read-cache --table >cache &&
+	grep "100644 blob .*	modules/a" cache &&
+	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 13/20] unpack-trees: allow sparse directories
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (11 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
                     ` (8 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The index_pos_by_traverse_info() currently throws a BUG() when a
directory entry exists exactly in the index. We need to consider that it
is possible to have a directory in a sparse index as long as that entry
is itself marked with the skip-worktree bit.

The 'pos' variable is assigned a negative value if an exact match is not
found. Since a directory name can be an exact match, it is no longer an
error to have a nonnegative 'pos' value.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/unpack-trees.c b/unpack-trees.c
index 4dd99219073a..b324eec2a5d1 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -746,9 +746,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
 	strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
 	strbuf_addch(&name, '/');
 	pos = index_name_pos(o->src_index, name.buf, name.len);
-	if (pos >= 0)
-		BUG("This is a directory and should not exist in index");
-	pos = -pos - 1;
+	if (pos >= 0) {
+		if (!o->src_index->sparse_index ||
+		    !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
+			BUG("This is a directory and should not exist in index");
+	} else
+		pos = -pos - 1;
 	if (pos >= o->src_index->cache_nr ||
 	    !starts_with(o->src_index->cache[pos]->name, name.buf) ||
 	    (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 14/20] sparse-index: check index conversion happens
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (12 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
                     ` (7 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a test case that uses test_region to ensure that we are truly
expanding a sparse index to a full one, then converting back to sparse
when writing the index. As we integrate more Git commands with the
sparse index, we will convert these commands to check that we do _not_
convert the sparse index to a full index and instead stay sparse the
entire time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index b38fab6455d9..bfc9e28ef0e1 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -391,4 +391,22 @@ test_expect_success 'submodule handling' '
 	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
 '
 
+test_expect_success 'sparse-index is expanded and converted back' '
+	init_repos &&
+
+	(
+		GIT_TEST_SPARSE_INDEX=1 &&
+		export GIT_TEST_SPARSE_INDEX &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" reset --hard &&
+		test_region index convert_to_sparse trace2.txt &&
+		test_region index ensure_full_index trace2.txt &&
+
+		rm trace2.txt &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" status -uno &&
+		test_region index ensure_full_index trace2.txt
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 15/20] sparse-index: create extension for compatibility
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (13 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:30   ` [PATCH v2 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
                     ` (6 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Previously, we enabled the sparse index format only using
GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
actually select this mode. Further, sparse directory entries are not
understood by the index formats as advertised.

We _could_ add a new index version that explicitly adds these
capabilities, but there are nuances to index formats 2, 3, and 4 that
are still valuable to select as options. Until we add index format
version 5, create a repo extension, "extensions.sparseIndex", that
specifies that the tool reading this repository must understand sparse
directory entries.

This change only encodes the extension and enables it when
GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
mechanism.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/extensions.txt |  8 ++++++
 cache.h                             |  1 +
 repo-settings.c                     |  7 ++++++
 repository.h                        |  3 ++-
 setup.c                             |  3 +++
 sparse-index.c                      | 38 +++++++++++++++++++++++++----
 6 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
index 4e23d73cdcad..c02e09af0046 100644
--- a/Documentation/config/extensions.txt
+++ b/Documentation/config/extensions.txt
@@ -6,3 +6,11 @@ extensions.objectFormat::
 Note that this setting should only be set by linkgit:git-init[1] or
 linkgit:git-clone[1].  Trying to change it after initialization will not
 work and will produce hard-to-diagnose issues.
+
+extensions.sparseIndex::
+	When combined with `core.sparseCheckout=true` and
+	`core.sparseCheckoutCone=true`, the index may contain entries
+	corresponding to directories outside of the sparse-checkout
+	definition in lieu of containing each path under such directories.
+	Versions of Git that do not understand this extension do not
+	expect directory entries in the index.
diff --git a/cache.h b/cache.h
index 9217d405b9b8..03f931c5f34d 100644
--- a/cache.h
+++ b/cache.h
@@ -1059,6 +1059,7 @@ struct repository_format {
 	int worktree_config;
 	int is_bare;
 	int hash_algo;
+	int sparse_index;
 	char *work_tree;
 	struct string_list unknown_extensions;
 	struct string_list v1_only_extensions;
diff --git a/repo-settings.c b/repo-settings.c
index d63569e4041e..9677d50f9238 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
 	 * removed.
 	 */
 	r->settings.command_requires_full_index = 1;
+
+	/*
+	 * Initialize this as off.
+	 */
+	r->settings.sparse_index = 0;
+	if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
+		r->settings.sparse_index = 1;
 }
diff --git a/repository.h b/repository.h
index e06a23015697..a45f7520fd9e 100644
--- a/repository.h
+++ b/repository.h
@@ -42,7 +42,8 @@ struct repo_settings {
 
 	int core_multi_pack_index;
 
-	unsigned command_requires_full_index:1;
+	unsigned command_requires_full_index:1,
+		 sparse_index:1;
 };
 
 struct repository {
diff --git a/setup.c b/setup.c
index c04cd25a30df..cd8394564613 100644
--- a/setup.c
+++ b/setup.c
@@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var,
 			return error("invalid value for 'extensions.objectformat'");
 		data->hash_algo = format;
 		return EXTENSION_OK;
+	} else if (!strcmp(ext, "sparseindex")) {
+		data->sparse_index = 1;
+		return EXTENSION_OK;
 	}
 	return EXTENSION_UNKNOWN;
 }
diff --git a/sparse-index.c b/sparse-index.c
index 36b4dde7eeda..b9c14ef7ab50 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate,
 	return num_converted - start_converted;
 }
 
+static int enable_sparse_index(struct repository *repo)
+{
+	const char *config_path = repo_git_path(repo, "config.worktree");
+
+	if (upgrade_repository_format(1) < 0) {
+		warning(_("unable to upgrade repository format to enable sparse-index"));
+		return -1;
+	}
+	git_config_set_in_file_gently(config_path,
+				      "extensions.sparseIndex",
+				      "true");
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 1;
+	return 0;
+}
+
 int convert_to_sparse(struct index_state *istate)
 {
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	/*
+	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
+	 * extensions.sparseIndex config variable to be on.
+	 */
+	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
+		int err = enable_sparse_index(istate->repo);
+		if (err < 0)
+			return err;
+	}
+
 	/*
-	 * For now, only create a sparse index with the
-	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
-	 * this once we have a proper way to opt-in (and later still,
-	 * opt-out).
+	 * Only convert to sparse if extensions.sparseIndex is set.
 	 */
-	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+	prepare_repo_settings(istate->repo);
+	if (!istate->repo->settings.sparse_index)
 		return 0;
 
 	if (!istate->sparse_checkout_patterns) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (14 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
@ 2021-03-10 19:30   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:31   ` [PATCH v2 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:30 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The sparse index extension is used to signal that index writes should be
in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1.

Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that
specifies if the sparse index should be used. It also updates the index
to use the correct format, either way. Add a warning in the
documentation that the use of a repository extension might reduce
compatibility with third-party tools. 'git sparse-checkout init' already
sets extension.worktreeConfig, which places most sparse-checkout users
outside of the scope of most third-party tools.

Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of
GIT_TEST_SPARSE_INDEX=1.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-sparse-checkout.txt    | 14 +++++++++
 builtin/sparse-checkout.c                | 17 ++++++++++-
 sparse-index.c                           | 37 +++++++++++++++--------
 sparse-index.h                           |  3 ++
 t/t1092-sparse-checkout-compatibility.sh | 38 +++++++++++-------------
 5 files changed, 76 insertions(+), 33 deletions(-)

diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index a0eeaeb02ee3..4a8343cf7fa4 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
 When `--cone` is provided, the `core.sparseCheckoutCone` setting is
 also set, allowing for better performance with a limited set of
 patterns (see 'CONE PATTERN SET' below).
++
+Use the `--[no-]sparse-index` option to toggle the use of the sparse
+index format. This reduces the size of the index to be more closely
+aligned with your sparse-checkout definition. This can have significant
+performance advantages for commands such as `git status` or `git add`.
+This feature is still experimental. Some commands might be slower with
+a sparse index until they are properly integrated with the feature.
++
+**WARNING:** Using a sparse index requires modifying the index in a way
+that is not completely understood by external tools. If you have trouble
+with this compatibility, then run `git sparse-checkout sparse-index disable`
+to rewrite your index to not be sparse. Older versions of Git will not
+understand the `sparseIndex` repository extension and may fail to interact
+with your repository until it is disabled.
 
 'set'::
 	Write a set of patterns to the sparse-checkout file, as given as
diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index e00b82af727b..ca63e2c64e95 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -14,6 +14,7 @@
 #include "unpack-trees.h"
 #include "wt-status.h"
 #include "quote.h"
+#include "sparse-index.h"
 
 static const char *empty_base = "";
 
@@ -283,12 +284,13 @@ static int set_config(enum sparse_checkout_mode mode)
 }
 
 static char const * const builtin_sparse_checkout_init_usage[] = {
-	N_("git sparse-checkout init [--cone]"),
+	N_("git sparse-checkout init [--cone] [--[no-]sparse-index]"),
 	NULL
 };
 
 static struct sparse_checkout_init_opts {
 	int cone_mode;
+	int sparse_index;
 } init_opts;
 
 static int sparse_checkout_init(int argc, const char **argv)
@@ -303,11 +305,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	static struct option builtin_sparse_checkout_init_options[] = {
 		OPT_BOOL(0, "cone", &init_opts.cone_mode,
 			 N_("initialize the sparse-checkout in cone mode")),
+		OPT_BOOL(0, "sparse-index", &init_opts.sparse_index,
+			 N_("toggle the use of a sparse index")),
 		OPT_END(),
 	};
 
 	repo_read_index(the_repository);
 
+	init_opts.sparse_index = -1;
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_sparse_checkout_init_options,
 			     builtin_sparse_checkout_init_usage, 0);
@@ -326,6 +332,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	sparse_filename = get_sparse_checkout_filename();
 	res = add_patterns_from_file_to_list(sparse_filename, "", 0, &pl, NULL);
 
+	if (init_opts.sparse_index >= 0) {
+		if (set_sparse_index_config(the_repository, init_opts.sparse_index) < 0)
+			die(_("failed to modify sparse-index config"));
+
+		/* force an index rewrite */
+		repo_read_index(the_repository);
+		the_repository->index->updated_workdir = 1;
+	}
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
diff --git a/sparse-index.c b/sparse-index.c
index b9c14ef7ab50..1c84cac255bf 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -104,23 +104,37 @@ static int convert_to_sparse_rec(struct index_state *istate,
 
 static int enable_sparse_index(struct repository *repo)
 {
-	const char *config_path = repo_git_path(repo, "config.worktree");
+	int res;
 
 	if (upgrade_repository_format(1) < 0) {
 		warning(_("unable to upgrade repository format to enable sparse-index"));
 		return -1;
 	}
-	git_config_set_in_file_gently(config_path,
-				      "extensions.sparseIndex",
-				      "true");
+	res = git_config_set_gently("extensions.sparseindex", "true");
 
 	prepare_repo_settings(repo);
 	repo->settings.sparse_index = 1;
-	return 0;
+	return res;
+}
+
+int set_sparse_index_config(struct repository *repo, int enable)
+{
+	int res;
+
+	if (enable)
+		return enable_sparse_index(repo);
+
+	/* Don't downgrade repository format, just remove the extension. */
+	res = git_config_set_gently("extensions.sparseindex", NULL);
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 0;
+	return res;
 }
 
 int convert_to_sparse(struct index_state *istate)
 {
+	int test_env;
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
@@ -129,14 +143,13 @@ int convert_to_sparse(struct index_state *istate)
 		istate->repo = the_repository;
 
 	/*
-	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
-	 * extensions.sparseIndex config variable to be on.
+	 * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex
+	 * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly),
+	 * then purposefully disable the setting.
 	 */
-	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
-		int err = enable_sparse_index(istate->repo);
-		if (err < 0)
-			return err;
-	}
+	test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
+	if (test_env >= 0)
+		set_sparse_index_config(istate->repo, test_env);
 
 	/*
 	 * Only convert to sparse if extensions.sparseIndex is set.
diff --git a/sparse-index.h b/sparse-index.h
index 64380e121d80..39dcc859735e 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -5,4 +5,7 @@ struct index_state;
 void ensure_full_index(struct index_state *istate);
 int convert_to_sparse(struct index_state *istate);
 
+struct repository;
+int set_sparse_index_config(struct repository *repo, int enable);
+
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index bfc9e28ef0e1..9c2bc4d25f66 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -4,6 +4,7 @@ test_description='compare full workdir to sparse workdir'
 
 GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
+GIT_TEST_SPARSE_INDEX=
 
 . ./test-lib.sh
 
@@ -98,25 +99,26 @@ init_repos () {
 	# initialize sparse-checkout definitions
 	git -C sparse-checkout sparse-checkout init --cone &&
 	git -C sparse-checkout sparse-checkout set deep &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
+	git -C sparse-index sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C sparse-index true extensions.sparseindex &&
+	git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	) &&
 	(
 		cd sparse-index &&
-		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
+		"$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -146,7 +148,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+	git -C sparse-index sparse-checkout set folder1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep folder2 x
@@ -156,7 +158,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+	git -C sparse-index sparse-checkout set deep/deeper1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep/deeper2 folder1 folder2 x
@@ -394,19 +396,15 @@ test_expect_success 'submodule handling' '
 test_expect_success 'sparse-index is expanded and converted back' '
 	init_repos &&
 
-	(
-		GIT_TEST_SPARSE_INDEX=1 &&
-		export GIT_TEST_SPARSE_INDEX &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" reset --hard &&
-		test_region index convert_to_sparse trace2.txt &&
-		test_region index ensure_full_index trace2.txt &&
-
-		rm trace2.txt &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" status -uno &&
-		test_region index ensure_full_index trace2.txt
-	)
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" reset --hard &&
+	test_region index convert_to_sparse trace2.txt &&
+	test_region index ensure_full_index trace2.txt &&
+
+	rm trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" status -uno &&
+	test_region index ensure_full_index trace2.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 17/20] sparse-checkout: disable sparse-index
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (15 preceding siblings ...)
  2021-03-10 19:30   ` [PATCH v2 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
@ 2021-03-10 19:31   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:31   ` [PATCH v2 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:31 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We use 'git sparse-checkout init --cone --sparse-index' to toggle the
sparse-index feature. It makes sense to also disable it when running
'git sparse-checkout disable'. This is particularly important because it
removes the extensions.sparseIndex config option, allowing other tools
to use this Git repository again.

This does mean that 'git sparse-checkout init' will not re-enable the
sparse-index feature, even if it was previously enabled.

While testing this feature, I noticed that the sparse-index was not
being written on the first run, but by a second. This was caught by the
call to 'test-tool read-cache --table'. This requires adjusting some
assignments to core_apply_sparse_checkout and pl.use_cone_patterns in
the sparse_checkout_init() logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c          | 10 +++++++++-
 t/t1091-sparse-checkout-builtin.sh | 13 +++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index ca63e2c64e95..585343fa1972 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -280,6 +280,9 @@ static int set_config(enum sparse_checkout_mode mode)
 				      "core.sparseCheckoutCone",
 				      mode == MODE_CONE_PATTERNS ? "true" : NULL);
 
+	if (mode == MODE_NO_PATTERNS)
+		set_sparse_index_config(the_repository, 0);
+
 	return 0;
 }
 
@@ -341,10 +344,11 @@ static int sparse_checkout_init(int argc, const char **argv)
 		the_repository->index->updated_workdir = 1;
 	}
 
+	core_apply_sparse_checkout = 1;
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
-		core_apply_sparse_checkout = 1;
 		return update_working_directory(NULL);
 	}
 
@@ -366,6 +370,7 @@ static int sparse_checkout_init(int argc, const char **argv)
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
 	strbuf_addstr(&pattern, "!/*/");
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
+	pl.use_cone_patterns = init_opts.cone_mode;
 
 	return write_patterns_and_update(&pl);
 }
@@ -632,6 +637,9 @@ static int sparse_checkout_disable(int argc, const char **argv)
 	strbuf_addstr(&match_all, "/*");
 	add_pattern(strbuf_detach(&match_all, NULL), empty_base, 0, &pl, 0);
 
+	prepare_repo_settings(the_repository);
+	the_repository->settings.sparse_index = 0;
+
 	if (update_working_directory(&pl))
 		die(_("error while refreshing working directory"));
 
diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index fc64e9ed99f4..ff1ad570a255 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -205,6 +205,19 @@ test_expect_success 'sparse-checkout disable' '
 	check_files repo a deep folder1 folder2
 '
 
+test_expect_success 'sparse-index enabled and disabled' '
+	git -C repo sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C repo true extensions.sparseIndex &&
+	test-tool -C repo read-cache --table >cache &&
+	grep " tree " cache &&
+
+	git -C repo sparse-checkout disable &&
+	test-tool -C repo read-cache --table >cache &&
+	! grep " tree " cache &&
+	git -C repo config --list >config &&
+	! grep extensions.sparseindex config
+'
+
 test_expect_success 'cone mode: init and set' '
 	git -C repo sparse-checkout init --cone &&
 	git -C repo config --list >config &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 18/20] cache-tree: integrate with sparse directory entries
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (16 preceding siblings ...)
  2021-03-10 19:31   ` [PATCH v2 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-10 19:31   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:31   ` [PATCH v2 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:31 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache-tree extension was previously disabled with sparse indexes.
However, the cache-tree is an important performance feature for commands
like 'git status' and 'git add'. Integrate it with sparse directory
entries.

When writing a sparse index, completely clear and recalculate the cache
tree. By starting from scratch, the only integration necessary is to
check if we hit a sparse directory entry and create a leaf of the
cache-tree that has an entry_count of one and no subtrees.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c   | 18 ++++++++++++++++++
 sparse-index.c | 10 +++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 5f07a39e501e..950a9615db8f 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
 
 	*skip_count = 0;
 
+	/*
+	 * If the first entry of this region is a sparse directory
+	 * entry corresponding exactly to 'base', then this cache_tree
+	 * struct is a "leaf" in the data structure, pointing to the
+	 * tree OID specified in the entry.
+	 */
+	if (entries > 0) {
+		const struct cache_entry *ce = cache[0];
+
+		if (S_ISSPARSEDIR(ce->ce_mode) &&
+		    ce->ce_namelen == baselen &&
+		    !strncmp(ce->name, base, baselen)) {
+			it->entry_count = 1;
+			oidcpy(&it->oid, &ce->oid);
+			return 1;
+		}
+	}
+
 	if (0 <= it->entry_count && has_object_file(&it->oid))
 		return it->entry_count;
 
diff --git a/sparse-index.c b/sparse-index.c
index 1c84cac255bf..ea603201a323 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -180,7 +180,11 @@ int convert_to_sparse(struct index_state *istate)
 	istate->cache_nr = convert_to_sparse_rec(istate,
 						 0, 0, istate->cache_nr,
 						 "", 0, istate->cache_tree);
-	istate->drop_cache_tree = 1;
+
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	istate->sparse_index = 1;
 	trace2_region_leave("index", "convert_to_sparse", istate->repo);
 	return 0;
@@ -278,5 +282,9 @@ void ensure_full_index(struct index_state *istate)
 
 	free(full);
 
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 19/20] sparse-index: loose integration with cache_tree_verify()
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (17 preceding siblings ...)
  2021-03-10 19:31   ` [PATCH v2 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
@ 2021-03-10 19:31   ` Derrick Stolee via GitGitGadget
  2021-03-10 19:31   ` [PATCH v2 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:31 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache_tree_verify() method is run when GIT_TEST_CHECK_CACHE_TREE
is enabled, which it is by default in the test suite. The logic must
be adjusted for the presence of these directory entries.

For now, leave the test as a simple check for whether the directory
entry is sparse. Do not go any further until needed.

This allows us to re-enable GIT_TEST_CHECK_CACHE_TREE in
t1092-sparse-checkout-compatibility.sh. Further,
p2000-sparse-operations.sh uses the test suite and hence this is enabled
for all tests. We need to integrate with it before we run our
performance tests with a sparse-index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             | 19 +++++++++++++++++++
 t/t1092-sparse-checkout-compatibility.sh |  1 -
 2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 950a9615db8f..11bf1fcae6e1 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -808,6 +808,19 @@ int cache_tree_matches_traversal(struct cache_tree *root,
 	return 0;
 }
 
+static void verify_one_sparse(struct repository *r,
+			      struct index_state *istate,
+			      struct cache_tree *it,
+			      struct strbuf *path,
+			      int pos)
+{
+	struct cache_entry *ce = istate->cache[pos];
+
+	if (!S_ISSPARSEDIR(ce->ce_mode))
+		BUG("directory '%s' is present in index, but not sparse",
+		    path->buf);
+}
+
 static void verify_one(struct repository *r,
 		       struct index_state *istate,
 		       struct cache_tree *it,
@@ -830,6 +843,12 @@ static void verify_one(struct repository *r,
 
 	if (path->len) {
 		pos = index_name_pos(istate, path->buf, path->len);
+
+		if (pos >= 0) {
+			verify_one_sparse(r, istate, it, path, pos);
+			return;
+		}
+
 		pos = -pos - 1;
 	} else {
 		pos = 0;
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 9c2bc4d25f66..c2624176c2e0 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,7 +2,6 @@
 
 test_description='compare full workdir to sparse workdir'
 
-GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
 GIT_TEST_SPARSE_INDEX=
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v2 20/20] p2000: add sparse-index repos
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (18 preceding siblings ...)
  2021-03-10 19:31   ` [PATCH v2 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
@ 2021-03-10 19:31   ` Derrick Stolee via GitGitGadget
  2021-03-11  0:07   ` [PATCH v2 00/20] Sparse Index: Design, Format, Tests Elijah Newren
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  21 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-10 19:31 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

p2000-sparse-operations.sh compares different Git commands in
repositories with many files at HEAD but using sparse-checkout to focus
on a small portion of those files.

Add extra copies of the repository that use the sparse-index format so
we can track how that affects the performance of different commands.

At this point in time, the sparse-index is 100% overhead from the CPU
front, and this is measurable in these tests:

Test
---------------------------------------------------------------
2000.2: git status (full-index-v3)              0.59(0.51+0.12)
2000.3: git status (full-index-v4)              0.59(0.52+0.11)
2000.4: git status (sparse-index-v3)            1.40(1.32+0.12)
2000.5: git status (sparse-index-v4)            1.41(1.36+0.08)
2000.6: git add -A (full-index-v3)              2.32(1.97+0.19)
2000.7: git add -A (full-index-v4)              2.17(1.92+0.14)
2000.8: git add -A (sparse-index-v3)            2.31(2.21+0.15)
2000.9: git add -A (sparse-index-v4)            2.30(2.20+0.13)
2000.10: git add . (full-index-v3)              2.39(2.02+0.20)
2000.11: git add . (full-index-v4)              2.20(1.94+0.16)
2000.12: git add . (sparse-index-v3)            2.36(2.27+0.12)
2000.13: git add . (sparse-index-v4)            2.33(2.21+0.16)
2000.14: git commit -a -m A (full-index-v3)     2.47(2.12+0.20)
2000.15: git commit -a -m A (full-index-v4)     2.26(2.00+0.17)
2000.16: git commit -a -m A (sparse-index-v3)   3.01(2.92+0.16)
2000.17: git commit -a -m A (sparse-index-v4)   3.01(2.94+0.15)

Note that there is very little difference between the v3 and v4 index
formats when the sparse-index is enabled. This is primarily due to the
fact that the relative file sizes are the same, and the command time is
mostly taken up by parsing tree objects to expand the sparse index into
a full one.

With the current file layout, the index file sizes are given by this
table:

       |  full index | sparse index |
       +-------------+--------------+
    v3 |     108 MiB |      1.6 MiB |
    v4 |      80 MiB |      1.2 MiB |

Future updates will improve the performance of Git commands when the
index is sparse.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index 2fbc81b22119..e527316e66d6 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -60,12 +60,29 @@ test_expect_success 'setup repo and indexes' '
 		git sparse-checkout set $SPARSE_CONE &&
 		git config index.version 4 &&
 		git update-index --index-version=4
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v3 &&
+	(
+		cd sparse-index-v3 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v4 &&
+	(
+		cd sparse-index-v4 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
 	)
 '
 
 test_perf_on_all () {
 	command="$@"
-	for repo in full-index-v3 full-index-v4
+	for repo in full-index-v3 full-index-v4 \
+		    sparse-index-v3 sparse-index-v4
 	do
 		test_perf "$command ($repo)" "
 			(
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 01/20] sparse-index: design doc and format update
  2021-03-10 19:30   ` [PATCH v2 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-03-10 22:19     ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-10 22:19 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 10, 2021 at 11:31 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> This begins a long effort to update the index format to allow sparse
> directory entries. This should result in a significant improvement to
> Git commands when HEAD contains millions of files, but the user has
> selected many fewer files to keep in their sparse-checkout definition.
>
> Currently, the index format is only updated in the presence of
> extensions.sparseIndex instead of increasing a file format version
> number. This is temporary, and index v5 is part of the plan for future
> work in this area.
>
> The design document details many of the reasons for embarking on this
> work, and also the plan for completing it safely.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/technical/index-format.txt |   7 +
>  Documentation/technical/sparse-index.txt | 173 +++++++++++++++++++++++
>  2 files changed, 180 insertions(+)
>  create mode 100644 Documentation/technical/sparse-index.txt
>
> diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
> index b633482b1bdf..387126582556 100644
> --- a/Documentation/technical/index-format.txt
> +++ b/Documentation/technical/index-format.txt
> @@ -44,6 +44,13 @@ Git index format
>    localization, no special casing of directory separator '/'). Entries
>    with the same name are sorted by their stage field.
>
> +  An index entry typically represents a file. However, if sparse-checkout
> +  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
> +  `extensions.sparseIndex` extension is enabled, then the index may
> +  contain entries for directories outside of the sparse-checkout definition.
> +  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
> +  the path ends in a directory separator.
> +
>    32-bit ctime seconds, the last time a file's metadata changed
>      this is stat(2) data
>
> diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
> new file mode 100644
> index 000000000000..787a2a0b3b81
> --- /dev/null
> +++ b/Documentation/technical/sparse-index.txt
> @@ -0,0 +1,173 @@
> +Git Sparse-Index Design Document
> +================================
> +
> +The sparse-checkout feature allows users to focus a working directory on
> +a subset of the files at HEAD. The cone mode patterns, enabled by
> +`core.sparseCheckoutCone`, allow for very fast pattern matching to
> +discover which files at HEAD belong in the sparse-checkout cone.
> +
> +Three important scale dimensions for a Git worktree are:
> +
> +* `HEAD`: How many files are present at `HEAD`?
> +
> +* Populated: How many files are within the sparse-checkout cone.
> +
> +* Modified: How many files has the user modified in the working directory?
> +
> +We will use big-O notation -- O(X) -- to denote how expensive certain
> +operations are in terms of these dimensions.
> +
> +These dimensions are ordered by their magnitude: users (typically) modify
> +fewer files than are populated, and we can only populate files at `HEAD`.
> +These dimensions are also ordered by how expensive they are per item: it
> +is expensive to detect a modified file than it is to write one that we
> +know must be populated; changing `HEAD` only really requires updating the
> +index.
> +
> +Problems occur if there is an extreme imbalance in these dimensions. For
> +example, if `HEAD` contains millions of paths but the populated set has
> +only tens of thousands, then commands like `git status` and `git add` can
> +be dominated by operations that require O(`HEAD`) operations instead of
> +O(Populated). Primarily, the cost is in parsing and rewriting the index,
> +which is filled primarily with files at `HEAD` that are marked with the
> +`SKIP_WORKTREE` bit.
> +
> +The sparse-index intends to take these commands that read and modify the
> +index from O(`HEAD`) to O(Populated). To do this, we need to modify the
> +index format in a significant way: add "sparse directory" entries.
> +
> +With cone mode patterns, it is possible to detect when an entire
> +directory will have its contents outside of the sparse-checkout definition.
> +Instead of listing all of the files it contains as individual entries, a
> +sparse-index contains an entry with the directory name, referencing the
> +object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
> +If we need to discover the details for paths within that directory, we
> +can parse trees to find that list.
> +
> +At time of writing, sparse-directory entries violate expectations about the
> +index format and its in-memory data structure. There are many consumers in
> +the codebase that expect to iterate through all of the index entries and
> +see only files. In addition, they expect to see all files at `HEAD`. One
> +way to handle this is to parse trees to replace a sparse-directory entry
> +with all of the files within that tree as the index is loaded. However,
> +parsing trees is slower than parsing the index format, so that is a slower
> +operation than if we left the index alone.
> +
> +The implementation plan below follows four phases to slowly integrate with
> +the sparse-index. The intention is to incrementally update Git commands to
> +interact safely with the sparse-index without significant slowdowns. This
> +may not always be possible, but the hope is that the primary commands that
> +users need in their daily work are dramatically improved.
> +
> +Phase I: Format and initial speedups
> +------------------------------------
> +
> +During this phase, Git learns to enable the sparse-index and safely parse
> +one. Protections are put in place so that every consumer of the in-memory
> +data structure can operate with its current assumption of every file at
> +`HEAD`.
> +
> +At first, every index parse will expand the sparse-directory entries into
> +the full list of paths at `HEAD`. This will be slower in all cases. The
> +only noticable change in behavior will be that the serialized index file
> +contains sparse-directory entries.
> +
> +To start, we use a new repository extension, `extensions.sparseIndex`, to
> +allow inserting sparse-directory entries into indexes with file format
> +versions 2, 3, and 4. This prevents Git versions that do not understand
> +the sparse-index from operating on one, but it also prevents other
> +operations that do not use the index at all. A new format, index v5, will
> +be introduced that includes sparse-directory entries by default. It might
> +also introduce other features that have been considered for improving the
> +index, as well.
> +
> +Next, consumers of the index will be guarded against operating on a
> +sparse-index by inserting calls to `ensure_full_index()` or
> +`expand_index_to_path()`. After these guards are in place, we can begin
> +leaving sparse-directory entries in the in-memory index structure.
> +
> +Even after inserting these guards, we will keep expanding sparse-indexes
> +for most Git commands using the `command_requires_full_index` repository
> +setting. This setting will be on by default and disabled one builtin at a
> +time until we have sufficient confidence that all of the index operations
> +are properly guarded.
> +
> +To complete this phase, the commands `git status` and `git add` will be
> +integrated with the sparse-index so that they operate with O(Populated)
> +performance. They will be carefully tested for operations within and
> +outside the sparse-checkout definition.
> +
> +Phase II: Careful integrations
> +------------------------------
> +
> +This phase focuses on ensuring that all index extensions and APIs work
> +well with a sparse-index. This requires significant increases to our test
> +coverage, especially for operations that interact with the working
> +directory outside of the sparse-checkout definition. Some of these
> +behaviors may not be the desirable ones, such as some tests already
> +marked for failure in `t1092-sparse-checkout-compatibility.sh`.
> +
> +The index extensions that may require special integrations are:
> +
> +* FS Monitor
> +* Untracked cache
> +
> +While integrating with these features, we should look for patterns that
> +might lead to better APIs for interacting with the index. Coalescing
> +common usage patterns into an API call can reduce the number of places
> +where sparse-directories need to be handled carefully.
> +
> +Phase III: Important command speedups
> +-------------------------------------
> +
> +At this point, the patterns for testing and implementing sparse-directory
> +logic should be relatively stable. This phase focuses on updating some of
> +the most common builtins that use the index to operate as O(Populated).
> +Here is a potential list of commands that could be valuable to integrate
> +at this point:
> +
> +* `git commit`
> +* `git checkout`
> +* `git merge`
> +* `git rebase`
> +
> +Hopefully, commands such as `git merge` and `git rebase` can benefit
> +instead from merge algorithms that do not use the index as a data
> +structure, such as the merge-ORT strategy. As these topics mature, we
> +may enalbe the ORT strategy by default for repositories using the

s/enalbe/enable/

> +sparse-index feature.
> +
> +Along with `git status` and `git add`, these commands cover the majority
> +of users' interactions with the working directory. In addition, we can
> +integrate with these commands:
> +
> +* `git grep`
> +* `git rm`
> +
> +These have been proposed as some whose behavior could change when in a
> +repo with a sparse-checkout definition. It would be good to include this
> +behavior automatically when using a sparse-index. Some clarity is needed
> +to make the behavior switch clear to the user.
> +
> +This phase is the first where parallel work might be possible without too
> +much conflicts between topics.
> +
> +Phase IV: The long tail
> +-----------------------
> +
> +This last phase is less a "phase" and more "the new normal" after all of
> +the previous work.
> +
> +To start, the `command_requires_full_index` option could be removed in
> +favor of expanding only when hitting an API guard.
> +
> +There are many Git commands that could use special attention to operate as
> +O(Populated), while some might be so rare that it is acceptable to leave
> +them with additional overhead when a sparse-index is present.
> +
> +Here are some commands that might be useful to update:
> +
> +* `git sparse-checkout set`
> +* `git am`
> +* `git clean`
> +* `git stash`
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index
  2021-03-10 19:30   ` [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-10 23:04     ` Elijah Newren
  2021-03-11 14:17       ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-10 23:04 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 10, 2021 at 11:31 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> Add a new 'sparse-index' repo alongside the 'full-checkout' and
> 'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
> add run_on_sparse and test_sparse_match helpers. These helpers will be
> used when the sparse index is implemented.
>
> Add GIT_TEST_SPARSE_INDEX environment variable to enable the
> sparse-index by default. This will be intended to use across the entire
> test suite, except that it will only affect cases where the
> sparse-checkout feature is enabled.

This last sentence was a bit awkward to read.  "will be intended to
use" -> "is intended to be used"?

>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/README                                 |  3 +++
>  t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++++++++++----
>  2 files changed, 23 insertions(+), 4 deletions(-)
>
> diff --git a/t/README b/t/README
> index 593d4a4e270c..b98bc563aab5 100644
> --- a/t/README
> +++ b/t/README
> @@ -439,6 +439,9 @@ and "sha256".
>  GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
>  'pack.writeReverseIndex' setting.
>
> +GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
> +sparse-index format by default.
> +
>  Naming Tests
>  ------------
>
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 3725d3997e70..71d6f9e4c014 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
>  test_expect_success 'setup' '
>         git init initial-repo &&
>         (
> +               GIT_TEST_SPARSE_INDEX=0 &&
>                 cd initial-repo &&
>                 echo a >a &&
>                 echo "after deep" >e &&
> @@ -87,23 +88,32 @@ init_repos () {
>
>         cp -r initial-repo sparse-checkout &&
>         git -C sparse-checkout reset --hard &&
> -       git -C sparse-checkout sparse-checkout init --cone &&
> +
> +       cp -r initial-repo sparse-index &&
> +       git -C sparse-index reset --hard &&
>
>         # initialize sparse-checkout definitions
> -       git -C sparse-checkout sparse-checkout set deep
> +       git -C sparse-checkout sparse-checkout init --cone &&
> +       git -C sparse-checkout sparse-checkout set deep &&
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
>  }
>
>  run_on_sparse () {
>         (
>                 cd sparse-checkout &&
> -               "$@" >../sparse-checkout-out 2>../sparse-checkout-err
> +               GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
> +       ) &&
> +       (
> +               cd sparse-index &&
> +               GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
>         )
>  }
>
>  run_on_all () {
>         (
>                 cd full-checkout &&
> -               "$@" >../full-checkout-out 2>../full-checkout-err
> +               GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
>         ) &&
>         run_on_sparse "$@"
>  }
> @@ -114,6 +124,12 @@ test_all_match () {
>         test_cmp full-checkout-err sparse-checkout-err
>  }
>
> +test_sparse_match () {
> +       run_on_sparse $* &&

Should this be
   run_on_sparse "$@"
in order to allow arguments with spaces?

> +       test_cmp sparse-checkout-out sparse-index-out &&
> +       test_cmp sparse-checkout-err sparse-index-err
> +}
> +
>  test_expect_success 'status with options' '
>         init_repos &&
>         test_all_match git status --porcelain=v2 &&
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 11/20] sparse-index: convert from full to sparse
  2021-03-10 19:30   ` [PATCH v2 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-03-10 23:44     ` Elijah Newren
  2021-03-11 14:13       ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-10 23:44 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 10, 2021 at 11:31 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <dstolee@microsoft.com>
>
> If we have a full index, then we can convert it to a sparse index by
> replacing directories outside of the sparse cone with sparse directory
> entries. The convert_to_sparse() method does this, when the situation is
> appropriate.
>
> For now, we avoid converting the index to a sparse index if:
>
>  1. the index is split.
>  2. the index is already sparse.
>  3. sparse-checkout is disabled.
>  4. sparse-checkout does not use cone mode.
>
> Finally, we currently limit the conversion to when the
> GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
> config will be added in a later change.
>
> The trickiest thing about this conversion is that we might not be able
> to mark a directory as a sparse directory just because it is outside the
> sparse cone. There might be unmerged files within that directory, so we
> need to look for those. Also, if there is some strange reason why a file
> is not marked with CE_SKIP_WORKTREE, then we should give up on
> converting that directory. There is still hope that some of its
> subdirectories might be able to convert to sparse, so we keep looking
> deeper.
>
> The conversion process is assisted by the cache-tree extension. This is
> calculated from the full index if it does not already exist. We then
> abandon the cache-tree as it no longer applies to the newly-sparse
> index. Thus, this cache-tree will be recalculated in every
> sparse-full-sparse round-trip until we integrate the cache-tree
> extension with the sparse index.
>
> Some Git commands use the index after writing it. For example, 'git add'
> will update the index, then write it to disk, then read its entries to
> report information. To keep the in-memory index in a full state after
> writing, we re-expand it to a full one after the write. This is wasteful
> for commands that only write the index and do not read from it again,
> but that is only the case until we make those commands "sparse aware."
>
> We can compare the behavior of the sparse-index in
> t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
> when operating on the 'sparse-index' repo. We can also compare the two
> sparse repos directly, such as comparing their indexes (when expanded to
> full in the case of the 'sparse-index' repo). We also verify that the
> index is actually populated with sparse directory entries.
>
> The 'checkout and reset (mixed)' test is marked for failure when
> comparing a sparse repo to a full repo, but we can compare the two
> sparse-checkout cases directly to ensure that we are not changing the
> behavior when using a sparse index.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  cache-tree.c                             |   3 +
>  cache.h                                  |   2 +
>  read-cache.c                             |  26 ++++-
>  sparse-index.c                           | 139 +++++++++++++++++++++++
>  sparse-index.h                           |   1 +
>  t/t1092-sparse-checkout-compatibility.sh |  61 +++++++++-
>  6 files changed, 227 insertions(+), 5 deletions(-)
>
> diff --git a/cache-tree.c b/cache-tree.c
> index 2fb483d3c083..5f07a39e501e 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -6,6 +6,7 @@
>  #include "object-store.h"
>  #include "replace-object.h"
>  #include "promisor-remote.h"
> +#include "sparse-index.h"
>
>  #ifndef DEBUG_CACHE_TREE
>  #define DEBUG_CACHE_TREE 0
> @@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
>         if (i)
>                 return i;
>
> +       ensure_full_index(istate);
> +
>         if (!istate->cache_tree)
>                 istate->cache_tree = cache_tree();
>
> diff --git a/cache.h b/cache.h
> index 303411726e10..9217d405b9b8 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
>  {
>         if (S_ISLNK(mode))
>                 return S_IFLNK;
> +       if (mode == S_IFDIR)
> +               return S_IFDIR;
>         if (S_ISDIR(mode) || S_ISGITLINK(mode))
>                 return S_IFGITLINK;
>         return S_IFREG | ce_permissions(mode);
> diff --git a/read-cache.c b/read-cache.c
> index 97dbf2434f30..92126b9d23c9 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -25,6 +25,7 @@
>  #include "fsmonitor.h"
>  #include "thread-utils.h"
>  #include "progress.h"
> +#include "sparse-index.h"
>
>  /* Mask for the name length in ce_flags in the on-disk index */
>
> @@ -1002,8 +1003,14 @@ int verify_path(const char *path, unsigned mode)
>
>                         c = *path++;
>                         if ((c == '.' && !verify_dotfile(path, mode)) ||
> -                           is_dir_sep(c) || c == '\0')
> +                           is_dir_sep(c))
>                                 return 0;
> +                       /*
> +                        * allow terminating directory separators for
> +                        * sparse directory entries.
> +                        */
> +                       if (c == '\0')
> +                               return S_ISDIR(mode);
>                 } else if (c == '\\' && protect_ntfs) {
>                         if (is_ntfs_dotgit(path))
>                                 return 0;
> @@ -3061,6 +3068,14 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
>                                  unsigned flags)
>  {
>         int ret;
> +       int was_full = !istate->sparse_index;
> +
> +       ret = convert_to_sparse(istate);
> +
> +       if (ret) {
> +               warning(_("failed to convert to a sparse-index"));
> +               return ret;
> +       }
>
>         /*
>          * TODO trace2: replace "the_repository" with the actual repo instance
> @@ -3072,6 +3087,9 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
>         trace2_region_leave_printf("index", "do_write_index", the_repository,
>                                    "%s", get_lock_file_path(lock));
>
> +       if (was_full)
> +               ensure_full_index(istate);
> +
>         if (ret)
>                 return ret;
>         if (flags & COMMIT_LOCK)
> @@ -3162,9 +3180,10 @@ static int write_shared_index(struct index_state *istate,
>                               struct tempfile **temp)
>  {
>         struct split_index *si = istate->split_index;
> -       int ret;
> +       int ret, was_full = !istate->sparse_index;
>
>         move_cache_to_base_index(istate);
> +       convert_to_sparse(istate);
>
>         trace2_region_enter_printf("index", "shared/do_write_index",
>                                    the_repository, "%s", get_tempfile_path(*temp));
> @@ -3172,6 +3191,9 @@ static int write_shared_index(struct index_state *istate,
>         trace2_region_leave_printf("index", "shared/do_write_index",
>                                    the_repository, "%s", get_tempfile_path(*temp));
>
> +       if (was_full)
> +               ensure_full_index(istate);
> +
>         if (ret)
>                 return ret;
>         ret = adjust_shared_perm(get_tempfile_path(*temp));
> diff --git a/sparse-index.c b/sparse-index.c
> index 316cb949b74b..5eb561259bb1 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -4,6 +4,145 @@
>  #include "tree.h"
>  #include "pathspec.h"
>  #include "trace2.h"
> +#include "cache-tree.h"
> +#include "config.h"
> +#include "dir.h"
> +#include "fsmonitor.h"
> +
> +static struct cache_entry *construct_sparse_dir_entry(
> +                               struct index_state *istate,
> +                               const char *sparse_dir,
> +                               struct cache_tree *tree)
> +{
> +       struct cache_entry *de;
> +
> +       de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);
> +
> +       de->ce_flags |= CE_SKIP_WORKTREE;
> +       return de;
> +}
> +
> +/*
> + * Returns the number of entries "inserted" into the index.
> + */
> +static int convert_to_sparse_rec(struct index_state *istate,
> +                                int num_converted,
> +                                int start, int end,
> +                                const char *ct_path, size_t ct_pathlen,
> +                                struct cache_tree *ct)
> +{
> +       int i, can_convert = 1;
> +       int start_converted = num_converted;
> +       enum pattern_match_result match;
> +       int dtype;
> +       struct strbuf child_path = STRBUF_INIT;
> +       struct pattern_list *pl = istate->sparse_checkout_patterns;
> +
> +       /*
> +        * Is the current path outside of the sparse cone?
> +        * Then check if the region can be replaced by a sparse
> +        * directory entry (everything is sparse and merged).
> +        */
> +       match = path_matches_pattern_list(ct_path, ct_pathlen,
> +                                         NULL, &dtype, pl, istate);
> +       if (match != NOT_MATCHED)
> +               can_convert = 0;
> +
> +       for (i = start; can_convert && i < end; i++) {
> +               struct cache_entry *ce = istate->cache[i];
> +
> +               if (ce_stage(ce) ||
> +                   !(ce->ce_flags & CE_SKIP_WORKTREE))
> +                       can_convert = 0;
> +       }
> +
> +       if (can_convert) {
> +               struct cache_entry *se;
> +               se = construct_sparse_dir_entry(istate, ct_path, ct);
> +
> +               istate->cache[num_converted++] = se;
> +               return 1;
> +       }
> +
> +       for (i = start; i < end; ) {
> +               int count, span, pos = -1;
> +               const char *base, *slash;
> +               struct cache_entry *ce = istate->cache[i];
> +
> +               /*
> +                * Detect if this is a normal entry outside of any subtree
> +                * entry.
> +                */
> +               base = ce->name + ct_pathlen;
> +               slash = strchr(base, '/');
> +
> +               if (slash)
> +                       pos = cache_tree_subtree_pos(ct, base, slash - base);
> +
> +               if (pos < 0) {
> +                       istate->cache[num_converted++] = ce;
> +                       i++;
> +                       continue;
> +               }
> +
> +               strbuf_setlen(&child_path, 0);
> +               strbuf_add(&child_path, ce->name, slash - ce->name + 1);
> +
> +               span = ct->down[pos]->cache_tree->entry_count;
> +               count = convert_to_sparse_rec(istate,
> +                                             num_converted, i, i + span,
> +                                             child_path.buf, child_path.len,
> +                                             ct->down[pos]->cache_tree);
> +               num_converted += count;
> +               i += span;
> +       }
> +
> +       strbuf_release(&child_path);
> +       return num_converted - start_converted;
> +}
> +
> +int convert_to_sparse(struct index_state *istate)
> +{
> +       if (istate->split_index || istate->sparse_index ||
> +           !core_apply_sparse_checkout || !core_sparse_checkout_cone)
> +               return 0;
> +
> +       /*
> +        * For now, only create a sparse index with the
> +        * GIT_TEST_SPARSE_INDEX environment variable. We will relax
> +        * this once we have a proper way to opt-in (and later still,
> +        * opt-out).
> +        */
> +       if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
> +               return 0;
> +
> +       if (!istate->sparse_checkout_patterns) {
> +               istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
> +               if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
> +                       return 0;
> +       }
> +
> +       if (!istate->sparse_checkout_patterns->use_cone_patterns) {
> +               warning(_("attempting to use sparse-index without cone mode"));
> +               return -1;
> +       }
> +
> +       if (cache_tree_update(istate, 0)) {
> +               warning(_("unable to update cache-tree, staying full"));
> +               return -1;
> +       }
> +
> +       remove_fsmonitor(istate);
> +
> +       trace2_region_enter("index", "convert_to_sparse", istate->repo);
> +       istate->cache_nr = convert_to_sparse_rec(istate,
> +                                                0, 0, istate->cache_nr,
> +                                                "", 0, istate->cache_tree);
> +       istate->drop_cache_tree = 1;
> +       istate->sparse_index = 1;
> +       trace2_region_leave("index", "convert_to_sparse", istate->repo);
> +       return 0;
> +}
>
>  static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
>  {
> diff --git a/sparse-index.h b/sparse-index.h
> index 09a20d036c46..64380e121d80 100644
> --- a/sparse-index.h
> +++ b/sparse-index.h
> @@ -3,5 +3,6 @@
>
>  struct index_state;
>  void ensure_full_index(struct index_state *istate);
> +int convert_to_sparse(struct index_state *istate);
>
>  #endif
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 4d789fe86b9d..ca87033d30b0 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -2,6 +2,9 @@
>
>  test_description='compare full workdir to sparse workdir'
>
> +GIT_TEST_CHECK_CACHE_TREE=0

I still think it'd be nice to get a comment, either in the code or the
commit message, explaining why your series needs to set
GIT_TEST_CHECK_CACHE_TREE to 0.  I feel like I should almost know the
answer (was this just a preliminary step and it'll later be turned on?
did the cache-tree checking do stuff that assumes no sparse directory
entries? is it really slow?), but I don't.

> +GIT_TEST_SPLIT_INDEX=0
> +
>  . ./test-lib.sh
>
>  test_expect_success 'setup' '
> @@ -121,15 +124,49 @@ run_on_all () {
>  test_all_match () {
>         run_on_all "$@" &&
>         test_cmp full-checkout-out sparse-checkout-out &&
> -       test_cmp full-checkout-err sparse-checkout-err
> +       test_cmp full-checkout-out sparse-index-out &&
> +       test_cmp full-checkout-err sparse-checkout-err &&
> +       test_cmp full-checkout-err sparse-index-err
>  }
>
>  test_sparse_match () {
> -       run_on_sparse $* &&
> +       run_on_sparse "$@" &&
>         test_cmp sparse-checkout-out sparse-index-out &&
>         test_cmp sparse-checkout-err sparse-index-err
>  }
>
> +test_expect_success 'sparse-index contents' '
> +       init_repos &&
> +
> +       test-tool -C sparse-index read-cache --table >cache &&
> +       for dir in folder1 folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "040000 tree $TREE $dir/" cache \
> +                       || return 1
> +       done &&
> +
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
> +
> +       test-tool -C sparse-index read-cache --table >cache &&
> +       for dir in deep folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "040000 tree $TREE $dir/" cache \
> +                       || return 1
> +       done &&
> +
> +       GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
> +
> +       test-tool -C sparse-index read-cache --table >cache &&
> +       for dir in deep/deeper2 folder1 folder2 x
> +       do
> +               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> +               grep "040000 tree $TREE $dir/" cache \
> +                       || return 1
> +       done
> +'
> +
>  test_expect_success 'expanded in-memory index matches full index' '
>         init_repos &&
>         test_sparse_match test-tool read-cache --expand --table
> @@ -137,6 +174,7 @@ test_expect_success 'expanded in-memory index matches full index' '
>
>  test_expect_success 'status with options' '
>         init_repos &&
> +       test_sparse_match ls &&
>         test_all_match git status --porcelain=v2 &&
>         test_all_match git status --porcelain=v2 -z -u &&
>         test_all_match git status --porcelain=v2 -uno &&
> @@ -273,6 +311,17 @@ test_expect_failure 'checkout and reset (mixed)' '
>         test_all_match git reset update-folder2
>  '
>
> +# Ensure that sparse-index behaves identically to
> +# sparse-checkout with a full index.
> +test_expect_success 'checkout and reset (mixed) [sparse]' '
> +       init_repos &&
> +
> +       test_sparse_match git checkout -b reset-test update-deep &&
> +       test_sparse_match git reset deepest &&
> +       test_sparse_match git reset update-folder1 &&
> +       test_sparse_match git reset update-folder2
> +'
> +
>  test_expect_success 'merge' '
>         init_repos &&
>
> @@ -309,14 +358,20 @@ test_expect_success 'clean' '
>         test_all_match git status --porcelain=v2 &&
>         test_all_match git clean -f &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
>         test_all_match git clean -xf &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
>         test_all_match git clean -xdf &&
>         test_all_match git status --porcelain=v2 &&
> +       test_sparse_match ls &&
> +       test_sparse_match ls folder1 &&
>
> -       test_path_is_dir sparse-checkout/folder1
> +       test_sparse_match test_path_is_dir folder1
>  '
>
>  test_done
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 00/20] Sparse Index: Design, Format, Tests
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (19 preceding siblings ...)
  2021-03-10 19:31   ` [PATCH v2 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
@ 2021-03-11  0:07   ` Elijah Newren
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  21 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-11  0:07 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee

On Wed, Mar 10, 2021 at 11:31 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Here is the first full patch series submission coming out of the
> sparse-index RFC [1].
>
> [1]
> https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/
>
> I won't waste too much space here, because PATCH 1 includes a sizeable
> design document that describes the feature, the reasoning behind it, and my
> plan for getting this implemented widely throughout the codebase.
>
> There are some new things here that were not in the RFC:
>
>  * Design doc and format updates. (Patch 1)
>  * Performance test script. (Patches 2 and 20)
>
> Notably missing in this series from the RFC:
>
>  * The mega-patch inserting ensure_full_index() throughout the codebase.
>    That will be a follow-up series to this one.
>  * The integrations with git status and git add to demonstrate the improved
>    performance. Those will also appear in their own series later.
>
> I plan to keep my latest work in this area in my 'sparse-index/wip' branch
> [2]. It includes all of the work from the RFC right now, updated with the
> work from this series.
>
> [2] https://github.com/derrickstolee/git/tree/sparse-index/wip
>
>
> Updates in V2
> =============
>
>  * Various typos and awkward grammar is fixed.
>  * Cleaned up unnecessary commands in p2000-sparse-operations.sh
>  * Added a comment to the sparse_index member of struct index_state.
>  * Used tree_type, commit_type, and blob_type in test-read-cache.c.

I read through the range-diff and comments from the previous series.
There's only a few things left (as I noted in comments), but they're
all pretty trivial so this one is:

Reviewed-by: Elijah Newren <newren@gmail.com>

>
> Thanks, -Stolee
>
> Derrick Stolee (20):
>   sparse-index: design doc and format update
>   t/perf: add performance test for sparse operations
>   t1092: clean up script quoting
>   sparse-index: add guard to ensure full index
>   sparse-index: implement ensure_full_index()
>   t1092: compare sparse-checkout to sparse-index
>   test-read-cache: print cache entries with --table
>   test-tool: don't force full index
>   unpack-trees: ensure full index
>   sparse-checkout: hold pattern list in index
>   sparse-index: convert from full to sparse
>   submodule: sparse-index should not collapse links
>   unpack-trees: allow sparse directories
>   sparse-index: check index conversion happens
>   sparse-index: create extension for compatibility
>   sparse-checkout: toggle sparse index from builtin
>   sparse-checkout: disable sparse-index
>   cache-tree: integrate with sparse directory entries
>   sparse-index: loose integration with cache_tree_verify()
>   p2000: add sparse-index repos
>
>  Documentation/config/extensions.txt      |   8 +
>  Documentation/git-sparse-checkout.txt    |  14 ++
>  Documentation/technical/index-format.txt |   7 +
>  Documentation/technical/sparse-index.txt | 173 ++++++++++++++
>  Makefile                                 |   1 +
>  builtin/sparse-checkout.c                |  44 +++-
>  cache-tree.c                             |  40 ++++
>  cache.h                                  |  18 +-
>  read-cache.c                             |  35 ++-
>  repo-settings.c                          |  15 ++
>  repository.c                             |  11 +-
>  repository.h                             |   3 +
>  setup.c                                  |   3 +
>  sparse-index.c                           | 290 +++++++++++++++++++++++
>  sparse-index.h                           |  11 +
>  t/README                                 |   3 +
>  t/helper/test-read-cache.c               |  66 +++++-
>  t/perf/p2000-sparse-operations.sh        | 102 ++++++++
>  t/t1091-sparse-checkout-builtin.sh       |  13 +
>  t/t1092-sparse-checkout-compatibility.sh | 136 +++++++++--
>  unpack-trees.c                           |  16 +-
>  21 files changed, 969 insertions(+), 40 deletions(-)
>  create mode 100644 Documentation/technical/sparse-index.txt
>  create mode 100644 sparse-index.c
>  create mode 100644 sparse-index.h
>  create mode 100755 t/perf/p2000-sparse-operations.sh
>
>
> base-commit: 966e671106b2fd38301e7c344c754fd118d0bb07
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/883
>
> Range-diff vs v1:
>
>   1:  daa9a6bcefbc !  1:  2fe413fdac80 sparse-index: design doc and format update
>      @@ Documentation/technical/sparse-index.txt (new)
>       +If we need to discover the details for paths within that directory, we
>       +can parse trees to find that list.
>       +
>      -+This addition of sparse-directory entries violates expectations about the
>      ++At time of writing, sparse-directory entries violate expectations about the
>       +index format and its in-memory data structure. There are many consumers in
>       +the codebase that expect to iterate through all of the index entries and
>       +see only files. In addition, they expect to see all files at `HEAD`. One
>      @@ Documentation/technical/sparse-index.txt (new)
>       +* `git merge`
>       +* `git rebase`
>       +
>      ++Hopefully, commands such as `git merge` and `git rebase` can benefit
>      ++instead from merge algorithms that do not use the index as a data
>      ++structure, such as the merge-ORT strategy. As these topics mature, we
>      ++may enalbe the ORT strategy by default for repositories using the
>      ++sparse-index feature.
>      ++
>       +Along with `git status` and `git add`, these commands cover the majority
>       +of users' interactions with the working directory. In addition, we can
>       +integrate with these commands:
>   2:  a8c6322a3dbe !  2:  540ab5495065 t/perf: add performance test for sparse operations
>      @@ t/perf/p2000-sparse-operations.sh (new)
>       + # Remove submodules from the example repo, because our
>       + # duplication of the entire repo creates an unlikly data shape.
>       + git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
>      -+ rm -f .gitmodules &&
>      -+ git add .gitmodules &&
>      ++ git rm -f .gitmodules &&
>       + for module in $(awk "{print \$2}" modules)
>       + do
>       +         git rm $module || return 1
>       + done &&
>      -+ git add . &&
>       + git commit -m "remove submodules" &&
>       +
>       + echo bogus >a &&
>   3:  6e783c88821e =  3:  5cbedb377b37 t1092: clean up script quoting
>   4:  01da4c48a1fa =  4:  6e21f776e883 sparse-index: add guard to ensure full index
>   5:  2b83989fbcd3 !  5:  399ddb0bad56 sparse-index: implement ensure_full_index()
>      @@ cache.h: struct index_state {
>                  updated_skipworktree : 1,
>       -          fsmonitor_has_run_once : 1;
>       +          fsmonitor_has_run_once : 1,
>      ++
>      ++          /*
>      ++           * sparse_index == 1 when sparse-directory
>      ++           * entries exist. Requires sparse-checkout
>      ++           * in cone mode.
>      ++           */
>       +          sparse_index : 1;
>         struct hashmap name_hash;
>         struct hashmap dir_hash;
>   6:  c9910a37579c =  6:  eac2db5efc22 t1092: compare sparse-checkout to sparse-index
>   7:  3d92df7a0cf9 !  7:  e9c82d2eda82 test-read-cache: print cache entries with --table
>      @@ Commit message
>
>        ## t/helper/test-read-cache.c ##
>       @@
>      + #include "test-tool.h"
>        #include "cache.h"
>        #include "config.h"
>      -
>      ++#include "blob.h"
>      ++#include "commit.h"
>      ++#include "tree.h"
>      ++
>       +static void print_cache_entry(struct cache_entry *ce)
>       +{
>      -+ printf("%06o ", ce->ce_mode & 0777777);
>      ++ const char *type;
>      ++ printf("%06o ", ce->ce_mode & 0177777);
>       +
>       + if (S_ISSPARSEDIR(ce->ce_mode))
>      -+         printf("tree ");
>      ++         type = tree_type;
>       + else if (S_ISGITLINK(ce->ce_mode))
>      -+         printf("commit ");
>      ++         type = commit_type;
>       + else
>      -+         printf("blob ");
>      ++         type = blob_type;
>       +
>      -+ printf("%s\t%s\n",
>      ++ printf("%s %s\t%s\n",
>      ++        type,
>       +        oid_to_hex(&ce->oid),
>       +        ce->name);
>       +}
>       +
>      -+static void print_cache(struct index_state *cache)
>      ++static void print_cache(struct index_state *istate)
>       +{
>       + int i;
>      -+ for (i = 0; i < the_index.cache_nr; i++)
>      -+         print_cache_entry(the_index.cache[i]);
>      ++ for (i = 0; i < istate->cache_nr; i++)
>      ++         print_cache_entry(istate->cache[i]);
>       +}
>      -+
>      +
>        int cmd__read_cache(int argc, const char **argv)
>        {
>       + struct repository *r = the_repository;
>   8:  94373e2bfbbc !  8:  243541fc5820 test-tool: don't force full index
>      @@ Commit message
>
>        ## t/helper/test-read-cache.c ##
>       @@
>      - #include "test-tool.h"
>      - #include "cache.h"
>      - #include "config.h"
>      + #include "blob.h"
>      + #include "commit.h"
>      + #include "tree.h"
>       +#include "sparse-index.h"
>
>        static void print_cache_entry(struct cache_entry *ce)
>   9:  e71f033c2871 =  9:  48f65093b3da unpack-trees: ensure full index
>  10:  f86d3dc154d1 ! 10:  83aac8b7a1ec sparse-checkout: hold pattern list in index
>      @@ Commit message
>           pattern set, we need access to that in-memory copy. Place a pointer to
>           a 'struct pattern_list' in the index so we can access this on-demand.
>           This will be used in the next change which uses the sparse-checkout
>      -    definition to filter out directories that are outsie the sparse cone.
>      +    definition to filter out directories that are outside the sparse cone.
>
>           Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>
>  11:  a2d77c23a0cb ! 11:  f6db0c27a285 sparse-index: convert from full to sparse
>      @@ read-cache.c: int verify_path(const char *path, unsigned mode)
>                                 return 0;
>       +                 /*
>       +                  * allow terminating directory separators for
>      -+                  * sparse directory enries.
>      ++                  * sparse directory entries.
>       +                  */
>       +                 if (c == '\0')
>       +                         return S_ISDIR(mode);
>      @@ sparse-index.c
>       +         struct cache_entry *ce = istate->cache[i];
>       +
>       +         /*
>      -+          * Detect if this is a normal entry oustide of any subtree
>      ++          * Detect if this is a normal entry outside of any subtree
>       +          * entry.
>       +          */
>       +         base = ce->name + ct_pathlen;
>  12:  4405a9115c3b = 12:  f2a3e7298798 submodule: sparse-index should not collapse links
>  13:  fda23f07e6a2 ! 13:  6f1ebe6ccc08 unpack-trees: allow sparse directories
>      @@ Commit message
>           is possible to have a directory in a sparse index as long as that entry
>           is itself marked with the skip-worktree bit.
>
>      -    The negation of the 'pos' variable must be conditioned to only when it
>      -    starts as negative. This is identical behavior as before when the index
>      -    is full.
>      +    The 'pos' variable is assigned a negative value if an exact match is not
>      +    found. Since a directory name can be an exact match, it is no longer an
>      +    error to have a nonnegative 'pos' value.
>
>           Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
>
>  14:  7d4627574bb8 = 14:  3fa684b315fb sparse-index: check index conversion happens
>  15:  564503f78784 ! 15:  d74576d677f6 sparse-index: create extension for compatibility
>      @@ Commit message
>
>           We _could_ add a new index version that explicitly adds these
>           capabilities, but there are nuances to index formats 2, 3, and 4 that
>      -    are still valuable to select as options. For now, create a repo
>      -    extension, "extensions.sparseIndex", that specifies that the tool
>      -    reading this repository must understand sparse directory entries.
>      +    are still valuable to select as options. Until we add index format
>      +    version 5, create a repo extension, "extensions.sparseIndex", that
>      +    specifies that the tool reading this repository must understand sparse
>      +    directory entries.
>
>           This change only encodes the extension and enables it when
>           GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
>      @@ Documentation/config/extensions.txt: extensions.objectFormat::
>       + When combined with `core.sparseCheckout=true` and
>       + `core.sparseCheckoutCone=true`, the index may contain entries
>       + corresponding to directories outside of the sparse-checkout
>      -+ definition. Versions of Git that do not understand this extension
>      -+ do not expect directory entries in the index.
>      ++ definition in lieu of containing each path under such directories.
>      ++ Versions of Git that do not understand this extension do not
>      ++ expect directory entries in the index.
>
>        ## cache.h ##
>       @@ cache.h: struct repository_format {
>  16:  6d6b230e3318 ! 16:  e530ca5f668d sparse-checkout: toggle sparse index from builtin
>      @@ Documentation/git-sparse-checkout.txt: To avoid interfering with other worktrees
>       +a sparse index until they are properly integrated with the feature.
>       ++
>       +**WARNING:** Using a sparse index requires modifying the index in a way
>      -+that is not completely understood by other tools. Enabling sparse index
>      -+enables the `extensions.spareseIndex` config value, which might cause
>      -+other tools to stop working with your repository. If you have trouble with
>      -+this compatibility, then run `git sparse-checkout sparse-index disable` to
>      -+remove this config and rewrite your index to not be sparse.
>      ++that is not completely understood by external tools. If you have trouble
>      ++with this compatibility, then run `git sparse-checkout sparse-index disable`
>      ++to rewrite your index to not be sparse. Older versions of Git will not
>      ++understand the `sparseIndex` repository extension and may fail to interact
>      ++with your repository until it is disabled.
>
>        'set'::
>         Write a set of patterns to the sparse-checkout file, as given as
>  17:  bcf960ef2362 = 17:  42d0da9c5def sparse-checkout: disable sparse-index
>  18:  e6afec58674e = 18:  6bb0976a6295 cache-tree: integrate with sparse directory entries
>  19:  2be4981fe698 = 19:  07f34e80609a sparse-index: loose integration with cache_tree_verify()
>  20:  a738b0ba8ab4 = 20:  41e3b56b9c17 p2000: add sparse-index repos
>
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 11/20] sparse-index: convert from full to sparse
  2021-03-10 23:44     ` Elijah Newren
@ 2021-03-11 14:13       ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-11 14:13 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

On 3/10/2021 6:44 PM, Elijah Newren wrote:
> On Wed, Mar 10, 2021 at 11:31 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> +GIT_TEST_CHECK_CACHE_TREE=0
> 
> I still think it'd be nice to get a comment, either in the code or the
> commit message, explaining why your series needs to set
> GIT_TEST_CHECK_CACHE_TREE to 0.  I feel like I should almost know the
> answer (was this just a preliminary step and it'll later be turned on?
> did the cache-tree checking do stuff that assumes no sparse directory
> entries? is it really slow?), but I don't.

Sorry I missed commenting on this earlier.

The GIT_TEST_CHECK_CACHE_TREE environment is enabled by the test suite
by default and it does extra validation to see that the cache-tree
extension exists and matches the index contents. Since at this point
we don't have the cache-tree extension enabled with sparse-index, we
would start getting failures by those tests.

This is re-enabled in "sparse-index: loose integration with
cache_tree_verify()" so everything is being verified at the end of the
series.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 06/20] t1092: compare sparse-checkout to sparse-index
  2021-03-10 23:04     ` Elijah Newren
@ 2021-03-11 14:17       ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-11 14:17 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

On 3/10/2021 6:04 PM, Elijah Newren wrote:
> On Wed, Mar 10, 2021 at 11:31 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> Add GIT_TEST_SPARSE_INDEX environment variable to enable the
>> sparse-index by default. This will be intended to use across the entire
>> test suite, except that it will only affect cases where the
>> sparse-checkout feature is enabled.
> 
> This last sentence was a bit awkward to read.  "will be intended to
> use" -> "is intended to be used"?

Fixed locally to:

    Add the GIT_TEST_SPARSE_INDEX environment variable to enable the
    sparse-index by default. This can be enabled across all tests, but that
    will only affect cases where the sparse-checkout feature is enabled.
 
>> +test_sparse_match () {
>> +       run_on_sparse $* &&
> 
> Should this be
>    run_on_sparse "$@"
> in order to allow arguments with spaces?

Sorry I missed this one. It was fixed to the right use in
"sparse-index: convert from full to sparse" so I thought I
had already covered this one when looking at the tip of my
branch.
 
Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 05/20] sparse-index: implement ensure_full_index()
  2021-03-10 19:30   ` [PATCH v2 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-03-12  6:50     ` Junio C Hamano
  2021-03-12 13:56       ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Junio C Hamano @ 2021-03-12  6:50 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee, Ævar Arnfjörð Bjarmason

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  void ensure_full_index(struct index_state *istate)
>  {
> ...
> +	int i;
> +		tree = lookup_tree(istate->repo, &ce->oid);
> +
> +		memset(&ps, 0, sizeof(ps));
> +		ps.recursive = 1;
> +		ps.has_wildcard = 1;
> +		ps.max_depth = -1;
> +
> +		read_tree_recursive(istate->repo, tree,
> +				    ce->name, strlen(ce->name),
> +				    0, &ps,
> +				    add_path_to_index, full);

Ævar, the assumption that led to your e68237bb (tree.h API: remove
support for starting at prefix != "", 2021-03-08) closes the door
for this code rather badly.  Please work with Derrick to figure out
what the best course of action would be.

Thanks.

> +		/* free directory entries. full entries are re-used */
> +		discard_cache_entry(ce);
> +	}
> +
> +	/* Copy back into original index. */
> +	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
> +	istate->sparse_index = 0;
> +	free(istate->cache);
> +	istate->cache = full->cache;
> +	istate->cache_nr = full->cache_nr;
> +	istate->cache_alloc = full->cache_alloc;
> +
> +	free(full);
> +
> +	trace2_region_leave("index", "ensure_full_index", istate->repo);
>  }

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 05/20] sparse-index: implement ensure_full_index()
  2021-03-12  6:50     ` Junio C Hamano
@ 2021-03-12 13:56       ` Derrick Stolee
  2021-03-12 20:08         ` Junio C Hamano
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-12 13:56 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, newren, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Derrick Stolee, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

On 3/12/2021 1:50 AM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>>  void ensure_full_index(struct index_state *istate)
>>  {
>> ...
>> +	int i;
>> +		tree = lookup_tree(istate->repo, &ce->oid);
>> +
>> +		memset(&ps, 0, sizeof(ps));
>> +		ps.recursive = 1;
>> +		ps.has_wildcard = 1;
>> +		ps.max_depth = -1;
>> +
>> +		read_tree_recursive(istate->repo, tree,
>> +				    ce->name, strlen(ce->name),
>> +				    0, &ps,
>> +				    add_path_to_index, full);
> 
> Ævar, the assumption that led to your e68237bb (tree.h API: remove
> support for starting at prefix != "", 2021-03-08) closes the door
> for this code rather badly.  Please work with Derrick to figure out
> what the best course of action would be.

Thanks for pointing this out, Junio.

My preference would be to drop "tree.h API: remove support for
starting at prefix != """, but it should be OK to keep "tree.h API:
remove "stage" parameter from read_tree_recursive()" (currently
b3a078863f6), even though it introduces a semantic conflict here.

Since I haven't seen my sparse-index topic get picked up by a
tracking branch, I'd be happy to rebase on top of Ævar's topic if
I can still set a non-root prefix.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 05/20] sparse-index: implement ensure_full_index()
  2021-03-12 13:56       ` Derrick Stolee
@ 2021-03-12 20:08         ` Junio C Hamano
  2021-03-12 20:11           ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Junio C Hamano @ 2021-03-12 20:08 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, newren, pclouds, jrnieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee, Ævar Arnfjörð Bjarmason

Derrick Stolee <stolee@gmail.com> writes:

>> Ævar, the assumption that led to your e68237bb (tree.h API: remove
>> support for starting at prefix != "", 2021-03-08) closes the door
>> for this code rather badly.  Please work with Derrick to figure out
>> what the best course of action would be.
>
> Thanks for pointing this out, Junio.
>
> My preference would be to drop "tree.h API: remove support for
> starting at prefix != """, but it should be OK to keep "tree.h API:
> remove "stage" parameter from read_tree_recursive()" (currently
> b3a078863f6), even though it introduces a semantic conflict here.
>
> Since I haven't seen my sparse-index topic get picked up by a
> tracking branch, I'd be happy to rebase on top of Ævar's topic if
> I can still set a non-root prefix.

I did try to have both in 'seen' (after all, that is the primary way
I find out these conflicts early---no one can keep all the details
of all the topics in flight in one's head), and saw that we now have
a need for non-empty prefix that we thought we no longer have in the
other topic --- I think we should probably keep support of non-empty
prefix (as the primary reason why that patch exists is because we
saw no in-tree users---now if your 05/20 proves to be a good use of
the feature, there is one fewer reasons to remove the support) in
some form, so discarding e68237bb certainly is an option.


If we were to base the sparse-index topic on top of ab/read-tree, we
may be able to gain further simplification and clean-up of the API.

I think all the clean-up value e68237bb has are on the calling side
(they no longer have to pass constant ("", 0) to the function), and
we could rewrite e68237bb by

 - renaming "read_tree_recursive()" to "read_tree_at()", with the
   non-empty prefix support.

 - creating a new function "read_tree()", which lacks the support
   for prefix, as a thin-wrapper around "read_tree_at()".

 - modifying the callers of "read_tree_recursive()" changed by
   e68237bb to instead call "read_tree()" (without prefix).

to simplify majority of calling sites without losing functionality.

Then your [05/20] can use the read_tree_at() to read with a prefix.


But that kind of details, I'd want to see you two figure out
yourselves.

Thanks.


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 05/20] sparse-index: implement ensure_full_index()
  2021-03-12 20:08         ` Junio C Hamano
@ 2021-03-12 20:11           ` Derrick Stolee
  2021-03-15 23:52             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-12 20:11 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, newren, pclouds, jrnieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee, Ævar Arnfjörð Bjarmason

On 3/12/2021 3:08 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>>> Ævar, the assumption that led to your e68237bb (tree.h API: remove
>>> support for starting at prefix != "", 2021-03-08) closes the door
>>> for this code rather badly.  Please work with Derrick to figure out
>>> what the best course of action would be.
>>
>> Thanks for pointing this out, Junio.
>>
>> My preference would be to drop "tree.h API: remove support for
>> starting at prefix != """, but it should be OK to keep "tree.h API:
>> remove "stage" parameter from read_tree_recursive()" (currently
>> b3a078863f6), even though it introduces a semantic conflict here.
>>
>> Since I haven't seen my sparse-index topic get picked up by a
>> tracking branch, I'd be happy to rebase on top of Ævar's topic if
>> I can still set a non-root prefix.
> I think all the clean-up value e68237bb has are on the calling side
> (they no longer have to pass constant ("", 0) to the function), and
> we could rewrite e68237bb by
> 
>  - renaming "read_tree_recursive()" to "read_tree_at()", with the
>    non-empty prefix support.
> 
>  - creating a new function "read_tree()", which lacks the support
>    for prefix, as a thin-wrapper around "read_tree_at()".
> 
>  - modifying the callers of "read_tree_recursive()" changed by
>    e68237bb to instead call "read_tree()" (without prefix).
> 
> to simplify majority of calling sites without losing functionality.
> 
> Then your [05/20] can use the read_tree_at() to read with a prefix.
> 
> 
> But that kind of details, I'd want to see you two figure out
> yourselves.

You've given us a great proposal. I'll wait for Ævar to chime in
(and probably update his topic) before I submit a new version.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-09 20:52     ` Derrick Stolee
  2021-03-09 21:03       ` Elijah Newren
@ 2021-03-14 20:08       ` Martin Ågren
  2021-03-15 13:36         ` Derrick Stolee
  1 sibling, 1 reply; 203+ messages in thread
From: Martin Ågren @ 2021-03-14 20:08 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List, Elijah Newren,
	Junio C Hamano, Nguyễn Thái Ngọc Duy,
	Jonathan Nieder, Derrick Stolee, Derrick Stolee

On Tue, 9 Mar 2021 at 21:52, Derrick Stolee <stolee@gmail.com> wrote:
>
> I agree that the layers are confusing. We could rearrange and have
> a similar flow to what you recommend by mentioning the extension at
> the end:
>
> **WARNING:** Using a sparse index requires modifying the index in a way
> that is not completely understood by other tools. If you have trouble with
> this compatibility, then run `git sparse-checkout sparse-index disable` to
> rewrite your index to not be sparse. Older versions of Git will not
> understand the `sparseIndex` repository extension and may fail to interact
> with your repository until it is disabled.

I like it. I find this easier to read than the previous version. That
said, is `git sparse-index sparse-checkout disable` really the way to do
this? I don't see a "sparse-index" subcommand of git-sparse-checkout.
... Hmm, no, after building and installing your patches, I get

  $ git sparse-checkout sparse-index disable
  usage: git sparse-checkout (init|list|set|add|reapply|disable) <options>

Should that be `git sparse-checkout init --no-sparse-index`? I just
tried that on a fresh, empty repo. It seems to work in the sense that it
drops the config item. I'm guessing re-initing a sparse checkout is a
safe and sane thing to do?

I don't find any tests for this. If re-initing should be ok and in
particular if it should allow toggling the use of sparse index, it might
be good having a test. At a minimum to see that the command passes and
that the config item goes away? And check that the actual index is
rewritten back to the "old" format? (Sorry if you have that already and
I'm just bad at finding it.)

Martin

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-14 20:08       ` Martin Ågren
@ 2021-03-15 13:36         ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-15 13:36 UTC (permalink / raw)
  To: Martin Ågren
  Cc: Derrick Stolee via GitGitGadget, Git Mailing List, Elijah Newren,
	Junio C Hamano, Nguyễn Thái Ngọc Duy,
	Jonathan Nieder, Derrick Stolee, Derrick Stolee

On 3/14/2021 4:08 PM, Martin Ågren wrote:
> On Tue, 9 Mar 2021 at 21:52, Derrick Stolee <stolee@gmail.com> wrote:
>>
>> I agree that the layers are confusing. We could rearrange and have
>> a similar flow to what you recommend by mentioning the extension at
>> the end:
>>
>> **WARNING:** Using a sparse index requires modifying the index in a way
>> that is not completely understood by other tools. If you have trouble with
>> this compatibility, then run `git sparse-checkout sparse-index disable` to
>> rewrite your index to not be sparse. Older versions of Git will not
>> understand the `sparseIndex` repository extension and may fail to interact
>> with your repository until it is disabled.
> 
> I like it. I find this easier to read than the previous version. That
> said, is `git sparse-index sparse-checkout disable` really the way to do
> this? I don't see a "sparse-index" subcommand of git-sparse-checkout.
> ... Hmm, no, after building and installing your patches, I get
> 
>   $ git sparse-checkout sparse-index disable
>   usage: git sparse-checkout (init|list|set|add|reapply|disable) <options>
> 
> Should that be `git sparse-checkout init --no-sparse-index`? I just
> tried that on a fresh, empty repo. It seems to work in the sense that it
> drops the config item. I'm guessing re-initing a sparse checkout is a
> safe and sane thing to do?

Yes! Sorry I missed updating this instance when changing the
design. Your suggestion is indeed the proper way to disable the
sparse-index.
 
> I don't find any tests for this. If re-initing should be ok and in
> particular if it should allow toggling the use of sparse index, it might
> be good having a test. At a minimum to see that the command passes and
> that the config item goes away? And check that the actual index is
> rewritten back to the "old" format? (Sorry if you have that already and
> I'm just bad at finding it.)

We have tests already that 'git sparse-checkout init' will preserve
existing sparse-checkout patterns.

I should definitely have a test to ensure that '--no-sparse-index'
rewrites the index to be a full one. Thanks!

-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v2 05/20] sparse-index: implement ensure_full_index()
  2021-03-12 20:11           ` Derrick Stolee
@ 2021-03-15 23:52             ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-15 23:52 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Junio C Hamano, Derrick Stolee via GitGitGadget, git, newren,
	pclouds, jrnieder, Martin Ågren, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee


On Fri, Mar 12 2021, Derrick Stolee wrote:

> On 3/12/2021 3:08 PM, Junio C Hamano wrote:
>> Derrick Stolee <stolee@gmail.com> writes:
>> 
>>>> Ævar, the assumption that led to your e68237bb (tree.h API: remove
>>>> support for starting at prefix != "", 2021-03-08) closes the door
>>>> for this code rather badly.  Please work with Derrick to figure out
>>>> what the best course of action would be.
>>>
>>> Thanks for pointing this out, Junio.
>>>
>>> My preference would be to drop "tree.h API: remove support for
>>> starting at prefix != """, but it should be OK to keep "tree.h API:
>>> remove "stage" parameter from read_tree_recursive()" (currently
>>> b3a078863f6), even though it introduces a semantic conflict here.
>>>
>>> Since I haven't seen my sparse-index topic get picked up by a
>>> tracking branch, I'd be happy to rebase on top of Ævar's topic if
>>> I can still set a non-root prefix.
>> I think all the clean-up value e68237bb has are on the calling side
>> (they no longer have to pass constant ("", 0) to the function), and
>> we could rewrite e68237bb by
>> 
>>  - renaming "read_tree_recursive()" to "read_tree_at()", with the
>>    non-empty prefix support.
>> 
>>  - creating a new function "read_tree()", which lacks the support
>>    for prefix, as a thin-wrapper around "read_tree_at()".
>> 
>>  - modifying the callers of "read_tree_recursive()" changed by
>>    e68237bb to instead call "read_tree()" (without prefix).
>> 
>> to simplify majority of calling sites without losing functionality.
>> 
>> Then your [05/20] can use the read_tree_at() to read with a prefix.
>> 
>> 
>> But that kind of details, I'd want to see you two figure out
>> yourselves.
>
> You've given us a great proposal. I'll wait for Ævar to chime in
> (and probably update his topic) before I submit a new version.

I've re-rolled my series just now at
https://lore.kernel.org/git/20210315234344.28427-1-avarab@gmail.com/
sorry for the delay.

You should be able to rebase easily on top of it, although note that the
new read_tree_at() uses a strbuf, but is otherwise the same as the old
read_tree_recursive().

Note that the pathspec can also be used to get to where
read_tree_recursive() would have brought you. I haven't looked at
whether there's reasons to convert in-tree (or this) code to pathspec
use, or vice-versa convert some things that use pathspecs now
(e.g. ls-tree with a path) to providing a prefix via the strbuf.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 00/20] Sparse Index: Design, Format, Tests
  2021-03-10 19:30 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (20 preceding siblings ...)
  2021-03-11  0:07   ` [PATCH v2 00/20] Sparse Index: Design, Format, Tests Elijah Newren
@ 2021-03-16 16:42   ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
                       ` (23 more replies)
  21 siblings, 24 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

Here is the first full patch series submission coming out of the
sparse-index RFC [1].

[1]
https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/

I won't waste too much space here, because PATCH 1 includes a sizeable
design document that describes the feature, the reasoning behind it, and my
plan for getting this implemented widely throughout the codebase.

There are some new things here that were not in the RFC:

 * Design doc and format updates. (Patch 1)
 * Performance test script. (Patches 2 and 20)

Notably missing in this series from the RFC:

 * The mega-patch inserting ensure_full_index() throughout the codebase.
   That will be a follow-up series to this one.
 * The integrations with git status and git add to demonstrate the improved
   performance. Those will also appear in their own series later.

I plan to keep my latest work in this area in my 'sparse-index/wip' branch
[2]. It includes all of the work from the RFC right now, updated with the
work from this series.

[2] https://github.com/derrickstolee/git/tree/sparse-index/wip


Updates in V3
=============

For this version, I took Ævar's latest patches and applied them to v2.31.0
and rebased this series on top. It uses his new "read_tree_at()" helper and
the associated changes to the function pointer type.

 * Fixed more typos. Thanks Martin and Elijah!
 * Updated the test_sparse_match() macro to use "$@" instead of $*
 * Added a test that git sparse-checkout init --no-sparse-index rewrites the
   index to be full.


Updates in V2
=============

 * Various typos and awkward grammar is fixed.
 * Cleaned up unnecessary commands in p2000-sparse-operations.sh
 * Added a comment to the sparse_index member of struct index_state.
 * Used tree_type, commit_type, and blob_type in test-read-cache.c.

Thanks, -Stolee

Derrick Stolee (20):
  sparse-index: design doc and format update
  t/perf: add performance test for sparse operations
  t1092: clean up script quoting
  sparse-index: add guard to ensure full index
  sparse-index: implement ensure_full_index()
  t1092: compare sparse-checkout to sparse-index
  test-read-cache: print cache entries with --table
  test-tool: don't force full index
  unpack-trees: ensure full index
  sparse-checkout: hold pattern list in index
  sparse-index: convert from full to sparse
  submodule: sparse-index should not collapse links
  unpack-trees: allow sparse directories
  sparse-index: check index conversion happens
  sparse-index: create extension for compatibility
  sparse-checkout: toggle sparse index from builtin
  sparse-checkout: disable sparse-index
  cache-tree: integrate with sparse directory entries
  sparse-index: loose integration with cache_tree_verify()
  p2000: add sparse-index repos

 Documentation/config/extensions.txt      |   8 +
 Documentation/git-sparse-checkout.txt    |  14 ++
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 173 +++++++++++++
 Makefile                                 |   1 +
 builtin/sparse-checkout.c                |  44 +++-
 cache-tree.c                             |  40 ++++
 cache.h                                  |  18 +-
 read-cache.c                             |  35 ++-
 repo-settings.c                          |  15 ++
 repository.c                             |  11 +-
 repository.h                             |   3 +
 setup.c                                  |   3 +
 sparse-index.c                           | 293 +++++++++++++++++++++++
 sparse-index.h                           |  11 +
 t/README                                 |   3 +
 t/helper/test-read-cache.c               |  66 ++++-
 t/perf/p2000-sparse-operations.sh        | 102 ++++++++
 t/t1091-sparse-checkout-builtin.sh       |  13 +
 t/t1092-sparse-checkout-compatibility.sh | 143 +++++++++--
 unpack-trees.c                           |  16 +-
 21 files changed, 979 insertions(+), 40 deletions(-)
 create mode 100644 Documentation/technical/sparse-index.txt
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h
 create mode 100755 t/perf/p2000-sparse-operations.sh


base-commit: 9c34e7ffd7b544199d889e2f3f7d9ba663c4357d
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-883%2Fderrickstolee%2Fsparse-index%2Fformat-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-883/derrickstolee/sparse-index/format-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/883

Range-diff vs v2:

  1:  2fe413fdac80 !  1:  62ac13945bec sparse-index: design doc and format update
     @@ Documentation/technical/sparse-index.txt (new)
      +Hopefully, commands such as `git merge` and `git rebase` can benefit
      +instead from merge algorithms that do not use the index as a data
      +structure, such as the merge-ORT strategy. As these topics mature, we
     -+may enalbe the ORT strategy by default for repositories using the
     ++may enable the ORT strategy by default for repositories using the
      +sparse-index feature.
      +
      +Along with `git status` and `git add`, these commands cover the majority
  2:  540ab5495065 =  2:  d2197e895e4d t/perf: add performance test for sparse operations
  3:  5cbedb377b37 =  3:  d3cfd34b8418 t1092: clean up script quoting
  4:  6e21f776e883 =  4:  4472118cf903 sparse-index: add guard to ensure full index
  5:  399ddb0bad56 !  5:  99292cdbaae4 sparse-index: implement ensure_full_index()
     @@ sparse-index.c
      +}
      +
      +static int add_path_to_index(const struct object_id *oid,
     -+				struct strbuf *base, const char *path,
     -+				unsigned int mode, int stage, void *context)
     ++			     struct strbuf *base, const char *path,
     ++			     unsigned int mode, void *context)
      +{
      +	struct index_state *istate = (struct index_state *)context;
      +	struct cache_entry *ce;
     @@ sparse-index.c
      -	/* intentionally left blank */
      +	int i;
      +	struct index_state *full;
     ++	struct strbuf base = STRBUF_INIT;
      +
      +	if (!istate || !istate->sparse_index)
      +		return;
     @@ sparse-index.c
      +		ps.has_wildcard = 1;
      +		ps.max_depth = -1;
      +
     -+		read_tree_recursive(istate->repo, tree,
     -+				    ce->name, strlen(ce->name),
     -+				    0, &ps,
     -+				    add_path_to_index, full);
     ++		strbuf_setlen(&base, 0);
     ++		strbuf_add(&base, ce->name, strlen(ce->name));
     ++
     ++		read_tree_at(istate->repo, tree, &base, &ps,
     ++			     add_path_to_index, full);
      +
      +		/* free directory entries. full entries are re-used */
      +		discard_cache_entry(ce);
     @@ sparse-index.c
      +	istate->cache_nr = full->cache_nr;
      +	istate->cache_alloc = full->cache_alloc;
      +
     ++	strbuf_release(&base);
      +	free(full);
      +
      +	trace2_region_leave("index", "ensure_full_index", istate->repo);
  6:  eac2db5efc22 !  6:  fae5663a17bb t1092: compare sparse-checkout to sparse-index
     @@ Commit message
          add run_on_sparse and test_sparse_match helpers. These helpers will be
          used when the sparse index is implemented.
      
     -    Add GIT_TEST_SPARSE_INDEX environment variable to enable the
     -    sparse-index by default. This will be intended to use across the entire
     -    test suite, except that it will only affect cases where the
     -    sparse-checkout feature is enabled.
     +    Add the GIT_TEST_SPARSE_INDEX environment variable to enable the
     +    sparse-index by default. This can be enabled across all tests, but that
     +    will only affect cases where the sparse-checkout feature is enabled.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
     @@ t/t1092-sparse-checkout-compatibility.sh: test_all_match () {
       }
       
      +test_sparse_match () {
     -+	run_on_sparse $* &&
     ++	run_on_sparse "$@" &&
      +	test_cmp sparse-checkout-out sparse-index-out &&
      +	test_cmp sparse-checkout-err sparse-index-err
      +}
  7:  e9c82d2eda82 =  7:  dffe8821fde2 test-read-cache: print cache entries with --table
  8:  243541fc5820 =  8:  f4ad081f25bb test-tool: don't force full index
  9:  48f65093b3da =  9:  4780076a50df unpack-trees: ensure full index
 10:  83aac8b7a1ec = 10:  33fdba2b8cfd sparse-checkout: hold pattern list in index
 11:  f6db0c27a285 ! 11:  e41b14e03ebb sparse-index: convert from full to sparse
     @@ t/t1092-sparse-checkout-compatibility.sh
       
       test_description='compare full workdir to sparse workdir'
       
     ++# The verify_cache_tree() check is not sparse-aware (yet).
     ++# So, disable the check until that integration is complete.
      +GIT_TEST_CHECK_CACHE_TREE=0
      +GIT_TEST_SPLIT_INDEX=0
      +
     @@ t/t1092-sparse-checkout-compatibility.sh: run_on_all () {
       }
       
       test_sparse_match () {
     --	run_on_sparse $* &&
     -+	run_on_sparse "$@" &&
     - 	test_cmp sparse-checkout-out sparse-index-out &&
     +@@ t/t1092-sparse-checkout-compatibility.sh: test_sparse_match () {
       	test_cmp sparse-checkout-err sparse-index-err
       }
       
 12:  f2a3e7298798 = 12:  b77cd6b02265 submodule: sparse-index should not collapse links
 13:  6f1ebe6ccc08 = 13:  4000c5cdd4cf unpack-trees: allow sparse directories
 14:  3fa684b315fb = 14:  1a2be38b2ca7 sparse-index: check index conversion happens
 15:  d74576d677f6 = 15:  f89891b0ae4e sparse-index: create extension for compatibility
 16:  e530ca5f668d ! 16:  bd703c76c859 sparse-checkout: toggle sparse index from builtin
     @@ Documentation/git-sparse-checkout.txt: To avoid interfering with other worktrees
      ++
      +**WARNING:** Using a sparse index requires modifying the index in a way
      +that is not completely understood by external tools. If you have trouble
     -+with this compatibility, then run `git sparse-checkout sparse-index disable`
     ++with this compatibility, then run `git sparse-checkout init --no-sparse-index`
      +to rewrite your index to not be sparse. Older versions of Git will not
      +understand the `sparseIndex` repository extension and may fail to interact
      +with your repository until it is disabled.
     @@ sparse-index.h: struct index_state;
      
       ## t/t1092-sparse-checkout-compatibility.sh ##
      @@ t/t1092-sparse-checkout-compatibility.sh: test_description='compare full workdir to sparse workdir'
     - 
     + # So, disable the check until that integration is complete.
       GIT_TEST_CHECK_CACHE_TREE=0
       GIT_TEST_SPLIT_INDEX=0
      +GIT_TEST_SPARSE_INDEX=
     @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'sparse-index cont
       
       	test-tool -C sparse-index read-cache --table >cache &&
       	for dir in deep/deeper2 folder1 folder2 x
     +@@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'sparse-index contents' '
     + 		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
     + 		grep "040000 tree $TREE	$dir/" cache \
     + 			|| return 1
     +-	done
     ++	done &&
     ++
     ++	# Disabling the sparse-index removes tree entries with full ones
     ++	git -C sparse-index sparse-checkout init --no-sparse-index &&
     ++
     ++	test-tool -C sparse-index read-cache --table >cache &&
     ++	! grep "040000 tree" cache &&
     ++	test_sparse_match test-tool read-cache --table
     + '
     + 
     + test_expect_success 'expanded in-memory index matches full index' '
      @@ t/t1092-sparse-checkout-compatibility.sh: test_expect_success 'submodule handling' '
       test_expect_success 'sparse-index is expanded and converted back' '
       	init_repos &&
 17:  42d0da9c5def = 17:  598557f90a2a sparse-checkout: disable sparse-index
 18:  6bb0976a6295 ! 18:  c2d0c17db31a cache-tree: integrate with sparse directory entries
     @@ sparse-index.c: int convert_to_sparse(struct index_state *istate)
       	trace2_region_leave("index", "convert_to_sparse", istate->repo);
       	return 0;
      @@ sparse-index.c: void ensure_full_index(struct index_state *istate)
     - 
     + 	strbuf_release(&base);
       	free(full);
       
      +	/* Clear and recompute the cache-tree */
 19:  07f34e80609a ! 19:  6fdd9323c14e sparse-index: loose integration with cache_tree_verify()
     @@ t/t1092-sparse-checkout-compatibility.sh
       
       test_description='compare full workdir to sparse workdir'
       
     +-# The verify_cache_tree() check is not sparse-aware (yet).
     +-# So, disable the check until that integration is complete.
      -GIT_TEST_CHECK_CACHE_TREE=0
       GIT_TEST_SPLIT_INDEX=0
       GIT_TEST_SPARSE_INDEX=
 20:  41e3b56b9c17 = 20:  3db06ac46dd5 p2000: add sparse-index repos

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 01/20] sparse-index: design doc and format update
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-19 23:43       ` Junio C Hamano
  2021-03-16 16:42     ` [PATCH v3 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
                       ` (22 subsequent siblings)
  23 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This begins a long effort to update the index format to allow sparse
directory entries. This should result in a significant improvement to
Git commands when HEAD contains millions of files, but the user has
selected many fewer files to keep in their sparse-checkout definition.

Currently, the index format is only updated in the presence of
extensions.sparseIndex instead of increasing a file format version
number. This is temporary, and index v5 is part of the plan for future
work in this area.

The design document details many of the reasons for embarking on this
work, and also the plan for completing it safely.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/technical/index-format.txt |   7 +
 Documentation/technical/sparse-index.txt | 173 +++++++++++++++++++++++
 2 files changed, 180 insertions(+)
 create mode 100644 Documentation/technical/sparse-index.txt

diff --git a/Documentation/technical/index-format.txt b/Documentation/technical/index-format.txt
index d363a71c37ec..cc548eaa0e97 100644
--- a/Documentation/technical/index-format.txt
+++ b/Documentation/technical/index-format.txt
@@ -44,6 +44,13 @@ Git index format
   localization, no special casing of directory separator '/'). Entries
   with the same name are sorted by their stage field.
 
+  An index entry typically represents a file. However, if sparse-checkout
+  is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
+  `extensions.sparseIndex` extension is enabled, then the index may
+  contain entries for directories outside of the sparse-checkout definition.
+  These entries have mode `0040000`, include the `SKIP_WORKTREE` bit, and
+  the path ends in a directory separator.
+
   32-bit ctime seconds, the last time a file's metadata changed
     this is stat(2) data
 
diff --git a/Documentation/technical/sparse-index.txt b/Documentation/technical/sparse-index.txt
new file mode 100644
index 000000000000..aa116406a016
--- /dev/null
+++ b/Documentation/technical/sparse-index.txt
@@ -0,0 +1,173 @@
+Git Sparse-Index Design Document
+================================
+
+The sparse-checkout feature allows users to focus a working directory on
+a subset of the files at HEAD. The cone mode patterns, enabled by
+`core.sparseCheckoutCone`, allow for very fast pattern matching to
+discover which files at HEAD belong in the sparse-checkout cone.
+
+Three important scale dimensions for a Git worktree are:
+
+* `HEAD`: How many files are present at `HEAD`?
+
+* Populated: How many files are within the sparse-checkout cone.
+
+* Modified: How many files has the user modified in the working directory?
+
+We will use big-O notation -- O(X) -- to denote how expensive certain
+operations are in terms of these dimensions.
+
+These dimensions are ordered by their magnitude: users (typically) modify
+fewer files than are populated, and we can only populate files at `HEAD`.
+These dimensions are also ordered by how expensive they are per item: it
+is expensive to detect a modified file than it is to write one that we
+know must be populated; changing `HEAD` only really requires updating the
+index.
+
+Problems occur if there is an extreme imbalance in these dimensions. For
+example, if `HEAD` contains millions of paths but the populated set has
+only tens of thousands, then commands like `git status` and `git add` can
+be dominated by operations that require O(`HEAD`) operations instead of
+O(Populated). Primarily, the cost is in parsing and rewriting the index,
+which is filled primarily with files at `HEAD` that are marked with the
+`SKIP_WORKTREE` bit.
+
+The sparse-index intends to take these commands that read and modify the
+index from O(`HEAD`) to O(Populated). To do this, we need to modify the
+index format in a significant way: add "sparse directory" entries.
+
+With cone mode patterns, it is possible to detect when an entire
+directory will have its contents outside of the sparse-checkout definition.
+Instead of listing all of the files it contains as individual entries, a
+sparse-index contains an entry with the directory name, referencing the
+object ID of the tree at `HEAD` and marked with the `SKIP_WORKTREE` bit.
+If we need to discover the details for paths within that directory, we
+can parse trees to find that list.
+
+At time of writing, sparse-directory entries violate expectations about the
+index format and its in-memory data structure. There are many consumers in
+the codebase that expect to iterate through all of the index entries and
+see only files. In addition, they expect to see all files at `HEAD`. One
+way to handle this is to parse trees to replace a sparse-directory entry
+with all of the files within that tree as the index is loaded. However,
+parsing trees is slower than parsing the index format, so that is a slower
+operation than if we left the index alone.
+
+The implementation plan below follows four phases to slowly integrate with
+the sparse-index. The intention is to incrementally update Git commands to
+interact safely with the sparse-index without significant slowdowns. This
+may not always be possible, but the hope is that the primary commands that
+users need in their daily work are dramatically improved.
+
+Phase I: Format and initial speedups
+------------------------------------
+
+During this phase, Git learns to enable the sparse-index and safely parse
+one. Protections are put in place so that every consumer of the in-memory
+data structure can operate with its current assumption of every file at
+`HEAD`.
+
+At first, every index parse will expand the sparse-directory entries into
+the full list of paths at `HEAD`. This will be slower in all cases. The
+only noticable change in behavior will be that the serialized index file
+contains sparse-directory entries.
+
+To start, we use a new repository extension, `extensions.sparseIndex`, to
+allow inserting sparse-directory entries into indexes with file format
+versions 2, 3, and 4. This prevents Git versions that do not understand
+the sparse-index from operating on one, but it also prevents other
+operations that do not use the index at all. A new format, index v5, will
+be introduced that includes sparse-directory entries by default. It might
+also introduce other features that have been considered for improving the
+index, as well.
+
+Next, consumers of the index will be guarded against operating on a
+sparse-index by inserting calls to `ensure_full_index()` or
+`expand_index_to_path()`. After these guards are in place, we can begin
+leaving sparse-directory entries in the in-memory index structure.
+
+Even after inserting these guards, we will keep expanding sparse-indexes
+for most Git commands using the `command_requires_full_index` repository
+setting. This setting will be on by default and disabled one builtin at a
+time until we have sufficient confidence that all of the index operations
+are properly guarded.
+
+To complete this phase, the commands `git status` and `git add` will be
+integrated with the sparse-index so that they operate with O(Populated)
+performance. They will be carefully tested for operations within and
+outside the sparse-checkout definition.
+
+Phase II: Careful integrations
+------------------------------
+
+This phase focuses on ensuring that all index extensions and APIs work
+well with a sparse-index. This requires significant increases to our test
+coverage, especially for operations that interact with the working
+directory outside of the sparse-checkout definition. Some of these
+behaviors may not be the desirable ones, such as some tests already
+marked for failure in `t1092-sparse-checkout-compatibility.sh`.
+
+The index extensions that may require special integrations are:
+
+* FS Monitor
+* Untracked cache
+
+While integrating with these features, we should look for patterns that
+might lead to better APIs for interacting with the index. Coalescing
+common usage patterns into an API call can reduce the number of places
+where sparse-directories need to be handled carefully.
+
+Phase III: Important command speedups
+-------------------------------------
+
+At this point, the patterns for testing and implementing sparse-directory
+logic should be relatively stable. This phase focuses on updating some of
+the most common builtins that use the index to operate as O(Populated).
+Here is a potential list of commands that could be valuable to integrate
+at this point:
+
+* `git commit`
+* `git checkout`
+* `git merge`
+* `git rebase`
+
+Hopefully, commands such as `git merge` and `git rebase` can benefit
+instead from merge algorithms that do not use the index as a data
+structure, such as the merge-ORT strategy. As these topics mature, we
+may enable the ORT strategy by default for repositories using the
+sparse-index feature.
+
+Along with `git status` and `git add`, these commands cover the majority
+of users' interactions with the working directory. In addition, we can
+integrate with these commands:
+
+* `git grep`
+* `git rm`
+
+These have been proposed as some whose behavior could change when in a
+repo with a sparse-checkout definition. It would be good to include this
+behavior automatically when using a sparse-index. Some clarity is needed
+to make the behavior switch clear to the user.
+
+This phase is the first where parallel work might be possible without too
+much conflicts between topics.
+
+Phase IV: The long tail
+-----------------------
+
+This last phase is less a "phase" and more "the new normal" after all of
+the previous work.
+
+To start, the `command_requires_full_index` option could be removed in
+favor of expanding only when hitting an API guard.
+
+There are many Git commands that could use special attention to operate as
+O(Populated), while some might be so rare that it is acceptable to leave
+them with additional overhead when a sparse-index is present.
+
+Here are some commands that might be useful to update:
+
+* `git sparse-checkout set`
+* `git am`
+* `git clean`
+* `git stash`
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 02/20] t/perf: add performance test for sparse operations
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-17  8:41       ` Ævar Arnfjörð Bjarmason
  2021-03-16 16:42     ` [PATCH v3 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
                       ` (21 subsequent siblings)
  23 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Create a test script that takes the default performance test (the Git
codebase) and multiplies it by 256 using four layers of duplicated
trees of width four. This results in nearly one million blob entries in
the index. Then, we can clone this repository with sparse-checkout
patterns that demonstrate four copies of the initial repository. Each
clone will use a different index format or mode so peformance can be
tested across the different options.

Note that the initial repo is stripped of submodules before doing the
copies. This preserves the expected data shape of the sparse index,
because directories containing submodules are not collapsed to a sparse
directory entry.

Run a few Git commands on these clones, especially those that use the
index (status, add, commit).

Here are the results on my Linux machine:

Test
--------------------------------------------------------------
2000.2: git status (full-index-v3)             0.37(0.30+0.09)
2000.3: git status (full-index-v4)             0.39(0.32+0.10)
2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)

It is perhaps noteworthy that there is an improvement when using index
version 4. This is because the v3 index uses 108 MiB while the v4
index uses 80 MiB. Since the repeated portions of the directories are
very short (f3/f1/f2, for example) this ratio is less pronounced than in
similarly-sized real repositories.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 85 +++++++++++++++++++++++++++++++
 1 file changed, 85 insertions(+)
 create mode 100755 t/perf/p2000-sparse-operations.sh

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
new file mode 100755
index 000000000000..2fbc81b22119
--- /dev/null
+++ b/t/perf/p2000-sparse-operations.sh
@@ -0,0 +1,85 @@
+#!/bin/sh
+
+test_description="test performance of Git operations using the index"
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+SPARSE_CONE=f2/f4/f1
+
+test_expect_success 'setup repo and indexes' '
+	git reset --hard HEAD &&
+	# Remove submodules from the example repo, because our
+	# duplication of the entire repo creates an unlikly data shape.
+	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
+	git rm -f .gitmodules &&
+	for module in $(awk "{print \$2}" modules)
+	do
+		git rm $module || return 1
+	done &&
+	git commit -m "remove submodules" &&
+
+	echo bogus >a &&
+	cp a b &&
+	git add a b &&
+	git commit -m "level 0" &&
+	BLOB=$(git rev-parse HEAD:a) &&
+	OLD_COMMIT=$(git rev-parse HEAD) &&
+	OLD_TREE=$(git rev-parse HEAD^{tree}) &&
+
+	for i in $(test_seq 1 4)
+	do
+		cat >in <<-EOF &&
+			100755 blob $BLOB	a
+			040000 tree $OLD_TREE	f1
+			040000 tree $OLD_TREE	f2
+			040000 tree $OLD_TREE	f3
+			040000 tree $OLD_TREE	f4
+		EOF
+		NEW_TREE=$(git mktree <in) &&
+		NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
+		OLD_TREE=$NEW_TREE &&
+		OLD_COMMIT=$NEW_COMMIT || return 1
+	done &&
+
+	git sparse-checkout init --cone &&
+	git branch -f wide $OLD_COMMIT &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
+	(
+		cd full-index-v3 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
+	(
+		cd full-index-v4 &&
+		git sparse-checkout init --cone &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
+	)
+'
+
+test_perf_on_all () {
+	command="$@"
+	for repo in full-index-v3 full-index-v4
+	do
+		test_perf "$command ($repo)" "
+			(
+				cd $repo &&
+				echo >>$SPARSE_CONE/a &&
+				$command
+			)
+		"
+	done
+}
+
+test_perf_on_all git status
+test_perf_on_all git add -A
+test_perf_on_all git add .
+test_perf_on_all git commit -a -m A
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 03/20] t1092: clean up script quoting
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 01/20] sparse-index: design doc and format update Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-17  8:47       ` Ævar Arnfjörð Bjarmason
  2021-03-16 16:42     ` [PATCH v3 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
                       ` (20 subsequent siblings)
  23 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This test was introduced in 19a0acc83e4 (t1092: test interesting
sparse-checkout scenarios, 2021-01-23), but these issues with quoting
were not noticed until starting this follow-up series. The old mechanism
would drop quoting such as in

   test_all_match git commit -m "touch README.md"

The above happened to work because README.md is a file in the
repository, so 'git commit -m touch REAMDE.md' would succeed by
accident.

Other cases included quoting for no good reason, so clean that up now.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 8cd3e5a8d227..3725d3997e70 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -96,20 +96,20 @@ init_repos () {
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		$* >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		$* >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
-	run_on_sparse $*
+	run_on_sparse "$@"
 }
 
 test_all_match () {
-	run_on_all $* &&
+	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
 	test_cmp full-checkout-err sparse-checkout-err
 }
@@ -119,7 +119,7 @@ test_expect_success 'status with options' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
-	run_on_all "touch README.md" &&
+	run_on_all touch README.md &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -135,7 +135,7 @@ test_expect_success 'add, commit, checkout' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>$1
 	EOF
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add README.md &&
 	test_all_match git status --porcelain=v2 &&
@@ -144,7 +144,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents README.md" &&
+	run_on_all ../edit-contents README.md &&
 
 	test_all_match git add -A &&
 	test_all_match git status --porcelain=v2 &&
@@ -153,7 +153,7 @@ test_expect_success 'add, commit, checkout' '
 	test_all_match git checkout HEAD~1 &&
 	test_all_match git checkout - &&
 
-	run_on_all "../edit-contents deep/newfile" &&
+	run_on_all ../edit-contents deep/newfile &&
 
 	test_all_match git status --porcelain=v2 -uno &&
 	test_all_match git status --porcelain=v2 &&
@@ -186,7 +186,7 @@ test_expect_success 'diff --staged' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>README.md
 	EOF
-	run_on_all "../edit-contents" &&
+	run_on_all ../edit-contents &&
 
 	test_all_match git diff &&
 	test_all_match git diff --staged &&
@@ -280,7 +280,7 @@ test_expect_success 'clean' '
 	echo bogus >>.gitignore &&
 	run_on_all cp ../.gitignore . &&
 	test_all_match git add .gitignore &&
-	test_all_match git commit -m ignore-bogus-files &&
+	test_all_match git commit -m "ignore bogus files" &&
 
 	run_on_sparse mkdir folder1 &&
 	run_on_all touch folder1/bogus &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 04/20] sparse-index: add guard to ensure full index
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
                       ` (19 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Upcoming changes will introduce modifications to the index format that
allow sparse directories. It will be useful to have a mechanism for
converting those sparse index files into full indexes by walking the
tree at those sparse directories. Name this method ensure_full_index()
as it will guarantee that the index is fully expanded.

This method is not implemented yet, and instead we focus on the
scaffolding to declare it and call it at the appropriate time.

Add a 'command_requires_full_index' member to struct repo_settings. This
will be an indicator that we need the index in full mode to do certain
index operations. This starts as being true for every command, then we
will set it to false as some commands integrate with sparse indexes.

If 'command_requires_full_index' is true, then we will immediately
expand a sparse index to a full one upon reading from disk. This
suffices for now, but we will want to add more callers to
ensure_full_index() later.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Makefile        |  1 +
 repo-settings.c |  8 ++++++++
 repository.c    | 11 ++++++++++-
 repository.h    |  2 ++
 sparse-index.c  |  8 ++++++++
 sparse-index.h  |  7 +++++++
 6 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 sparse-index.c
 create mode 100644 sparse-index.h

diff --git a/Makefile b/Makefile
index dfb0f1000fa3..89b1d5374107 100644
--- a/Makefile
+++ b/Makefile
@@ -985,6 +985,7 @@ LIB_OBJS += setup.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += sparse-index.o
 LIB_OBJS += split-index.o
 LIB_OBJS += stable-qsort.o
 LIB_OBJS += strbuf.o
diff --git a/repo-settings.c b/repo-settings.c
index f7fff0f5ab83..d63569e4041e 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -77,4 +77,12 @@ void prepare_repo_settings(struct repository *r)
 		UPDATE_DEFAULT_BOOL(r->settings.core_untracked_cache, UNTRACKED_CACHE_KEEP);
 
 	UPDATE_DEFAULT_BOOL(r->settings.fetch_negotiation_algorithm, FETCH_NEGOTIATION_DEFAULT);
+
+	/*
+	 * This setting guards all index reads to require a full index
+	 * over a sparse index. After suitable guards are placed in the
+	 * codebase around uses of the index, this setting will be
+	 * removed.
+	 */
+	r->settings.command_requires_full_index = 1;
 }
diff --git a/repository.c b/repository.c
index c98298acd017..a8acae002f71 100644
--- a/repository.c
+++ b/repository.c
@@ -10,6 +10,7 @@
 #include "object.h"
 #include "lockfile.h"
 #include "submodule-config.h"
+#include "sparse-index.h"
 
 /* The main repository */
 static struct repository the_repo;
@@ -261,6 +262,8 @@ void repo_clear(struct repository *repo)
 
 int repo_read_index(struct repository *repo)
 {
+	int res;
+
 	if (!repo->index)
 		repo->index = xcalloc(1, sizeof(*repo->index));
 
@@ -270,7 +273,13 @@ int repo_read_index(struct repository *repo)
 	else if (repo->index->repo != repo)
 		BUG("repo's index should point back at itself");
 
-	return read_index_from(repo->index, repo->index_file, repo->gitdir);
+	res = read_index_from(repo->index, repo->index_file, repo->gitdir);
+
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index)
+		ensure_full_index(repo->index);
+
+	return res;
 }
 
 int repo_hold_locked_index(struct repository *repo,
diff --git a/repository.h b/repository.h
index b385ca3c94b6..e06a23015697 100644
--- a/repository.h
+++ b/repository.h
@@ -41,6 +41,8 @@ struct repo_settings {
 	enum fetch_negotiation_setting fetch_negotiation_algorithm;
 
 	int core_multi_pack_index;
+
+	unsigned command_requires_full_index:1;
 };
 
 struct repository {
diff --git a/sparse-index.c b/sparse-index.c
new file mode 100644
index 000000000000..82183ead563b
--- /dev/null
+++ b/sparse-index.c
@@ -0,0 +1,8 @@
+#include "cache.h"
+#include "repository.h"
+#include "sparse-index.h"
+
+void ensure_full_index(struct index_state *istate)
+{
+	/* intentionally left blank */
+}
diff --git a/sparse-index.h b/sparse-index.h
new file mode 100644
index 000000000000..09a20d036c46
--- /dev/null
+++ b/sparse-index.h
@@ -0,0 +1,7 @@
+#ifndef SPARSE_INDEX_H__
+#define SPARSE_INDEX_H__
+
+struct index_state;
+void ensure_full_index(struct index_state *istate);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 05/20] sparse-index: implement ensure_full_index()
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 04/20] sparse-index: add guard to ensure full index Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-17 13:03       ` Ævar Arnfjörð Bjarmason
  2021-03-16 16:42     ` [PATCH v3 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
                       ` (18 subsequent siblings)
  23 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will mark an in-memory index_state as having sparse directory entries
with the sparse_index bit. These currently cannot exist, but we will add
a mechanism for collapsing a full index to a sparse one in a later
change. That will happen at write time, so we must first allow parsing
the format before writing it.

Commands or methods that require a full index in order to operate can
call ensure_full_index() to expand that index in-memory. This requires
parsing trees using that index's repository.

Sparse directory entries have a specific 'ce_mode' value. The macro
S_ISSPARSEDIR(ce->ce_mode) can check if a cache_entry 'ce' has this type.
This ce_mode is not possible with the existing index formats, so we don't
also verify all properties of a sparse-directory entry, which are:

 1. ce->ce_mode == 0040000
 2. ce->flags & CE_SKIP_WORKTREE is true
 3. ce->name[ce->namelen - 1] == '/' (ends in dir separator)
 4. ce->oid references a tree object.

These are all semi-enforced in ensure_full_index() to some extent. Any
deviation will cause a warning at minimum or a failure in the worst
case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache.h        | 13 ++++++-
 read-cache.c   |  9 +++++
 sparse-index.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 118 insertions(+), 2 deletions(-)

diff --git a/cache.h b/cache.h
index c2f8a8eadf67..abb00a068e5d 100644
--- a/cache.h
+++ b/cache.h
@@ -204,6 +204,8 @@ struct cache_entry {
 #error "CE_EXTENDED_FLAGS out of range"
 #endif
 
+#define S_ISSPARSEDIR(m) ((m) == S_IFDIR)
+
 /* Forward structure decls */
 struct pathspec;
 struct child_process;
@@ -319,7 +321,14 @@ struct index_state {
 		 drop_cache_tree : 1,
 		 updated_workdir : 1,
 		 updated_skipworktree : 1,
-		 fsmonitor_has_run_once : 1;
+		 fsmonitor_has_run_once : 1,
+
+		 /*
+		  * sparse_index == 1 when sparse-directory
+		  * entries exist. Requires sparse-checkout
+		  * in cone mode.
+		  */
+		 sparse_index : 1;
 	struct hashmap name_hash;
 	struct hashmap dir_hash;
 	struct object_id oid;
@@ -722,6 +731,8 @@ int read_index_from(struct index_state *, const char *path,
 		    const char *gitdir);
 int is_index_unborn(struct index_state *);
 
+void ensure_full_index(struct index_state *istate);
+
 /* For use with `write_locked_index()`. */
 #define COMMIT_LOCK		(1 << 0)
 #define SKIP_IF_UNCHANGED	(1 << 1)
diff --git a/read-cache.c b/read-cache.c
index 1e9a50c6c734..dd3980c12b53 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -101,6 +101,9 @@ static const char *alternate_index_output;
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		istate->sparse_index = 1;
+
 	istate->cache[nr] = ce;
 	add_name_hash(istate, ce);
 }
@@ -2273,6 +2276,12 @@ int do_read_index(struct index_state *istate, const char *path, int must_exist)
 	trace2_data_intmax("index", the_repository, "read/cache_nr",
 			   istate->cache_nr);
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+	prepare_repo_settings(istate->repo);
+	if (istate->repo->settings.command_requires_full_index)
+		ensure_full_index(istate);
+
 	return istate->cache_nr;
 
 unmap:
diff --git a/sparse-index.c b/sparse-index.c
index 82183ead563b..7095378a1b28 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -1,8 +1,104 @@
 #include "cache.h"
 #include "repository.h"
 #include "sparse-index.h"
+#include "tree.h"
+#include "pathspec.h"
+#include "trace2.h"
+
+static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
+{
+	ALLOC_GROW(istate->cache, nr + 1, istate->cache_alloc);
+
+	istate->cache[nr] = ce;
+	add_name_hash(istate, ce);
+}
+
+static int add_path_to_index(const struct object_id *oid,
+			     struct strbuf *base, const char *path,
+			     unsigned int mode, void *context)
+{
+	struct index_state *istate = (struct index_state *)context;
+	struct cache_entry *ce;
+	size_t len = base->len;
+
+	if (S_ISDIR(mode))
+		return READ_TREE_RECURSIVE;
+
+	strbuf_addstr(base, path);
+
+	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
+	ce->ce_flags |= CE_SKIP_WORKTREE;
+	set_index_entry(istate, istate->cache_nr++, ce);
+
+	strbuf_setlen(base, len);
+	return 0;
+}
 
 void ensure_full_index(struct index_state *istate)
 {
-	/* intentionally left blank */
+	int i;
+	struct index_state *full;
+	struct strbuf base = STRBUF_INIT;
+
+	if (!istate || !istate->sparse_index)
+		return;
+
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	trace2_region_enter("index", "ensure_full_index", istate->repo);
+
+	/* initialize basics of new index */
+	full = xcalloc(1, sizeof(struct index_state));
+	memcpy(full, istate, sizeof(struct index_state));
+
+	/* then change the necessary things */
+	full->sparse_index = 0;
+	full->cache_alloc = (3 * istate->cache_alloc) / 2;
+	full->cache_nr = 0;
+	ALLOC_ARRAY(full->cache, full->cache_alloc);
+
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+		struct tree *tree;
+		struct pathspec ps;
+
+		if (!S_ISSPARSEDIR(ce->ce_mode)) {
+			set_index_entry(full, full->cache_nr++, ce);
+			continue;
+		}
+		if (!(ce->ce_flags & CE_SKIP_WORKTREE))
+			warning(_("index entry is a directory, but not sparse (%08x)"),
+				ce->ce_flags);
+
+		/* recursively walk into cd->name */
+		tree = lookup_tree(istate->repo, &ce->oid);
+
+		memset(&ps, 0, sizeof(ps));
+		ps.recursive = 1;
+		ps.has_wildcard = 1;
+		ps.max_depth = -1;
+
+		strbuf_setlen(&base, 0);
+		strbuf_add(&base, ce->name, strlen(ce->name));
+
+		read_tree_at(istate->repo, tree, &base, &ps,
+			     add_path_to_index, full);
+
+		/* free directory entries. full entries are re-used */
+		discard_cache_entry(ce);
+	}
+
+	/* Copy back into original index. */
+	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
+	istate->sparse_index = 0;
+	free(istate->cache);
+	istate->cache = full->cache;
+	istate->cache_nr = full->cache_nr;
+	istate->cache_alloc = full->cache_alloc;
+
+	strbuf_release(&base);
+	free(full);
+
+	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 06/20] t1092: compare sparse-checkout to sparse-index
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (4 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                       ` (17 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a new 'sparse-index' repo alongside the 'full-checkout' and
'sparse-checkout' repos in t1092-sparse-checkout-compatibility.sh. Also
add run_on_sparse and test_sparse_match helpers. These helpers will be
used when the sparse index is implemented.

Add the GIT_TEST_SPARSE_INDEX environment variable to enable the
sparse-index by default. This can be enabled across all tests, but that
will only affect cases where the sparse-checkout feature is enabled.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/README                                 |  3 +++
 t/t1092-sparse-checkout-compatibility.sh | 24 ++++++++++++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/t/README b/t/README
index 593d4a4e270c..b98bc563aab5 100644
--- a/t/README
+++ b/t/README
@@ -439,6 +439,9 @@ and "sha256".
 GIT_TEST_WRITE_REV_INDEX=<boolean>, when true enables the
 'pack.writeReverseIndex' setting.
 
+GIT_TEST_SPARSE_INDEX=<boolean>, when true enables index writes to use the
+sparse-index format by default.
+
 Naming Tests
 ------------
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 3725d3997e70..de5d8461c993 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -7,6 +7,7 @@ test_description='compare full workdir to sparse workdir'
 test_expect_success 'setup' '
 	git init initial-repo &&
 	(
+		GIT_TEST_SPARSE_INDEX=0 &&
 		cd initial-repo &&
 		echo a >a &&
 		echo "after deep" >e &&
@@ -87,23 +88,32 @@ init_repos () {
 
 	cp -r initial-repo sparse-checkout &&
 	git -C sparse-checkout reset --hard &&
-	git -C sparse-checkout sparse-checkout init --cone &&
+
+	cp -r initial-repo sparse-index &&
+	git -C sparse-index reset --hard &&
 
 	# initialize sparse-checkout definitions
-	git -C sparse-checkout sparse-checkout set deep
+	git -C sparse-checkout sparse-checkout init --cone &&
+	git -C sparse-checkout sparse-checkout set deep &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+	) &&
+	(
+		cd sparse-index &&
+		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		"$@" >../full-checkout-out 2>../full-checkout-err
+		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -114,6 +124,12 @@ test_all_match () {
 	test_cmp full-checkout-err sparse-checkout-err
 }
 
+test_sparse_match () {
+	run_on_sparse "$@" &&
+	test_cmp sparse-checkout-out sparse-index-out &&
+	test_cmp sparse-checkout-err sparse-index-err
+}
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 07/20] test-read-cache: print cache entries with --table
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (5 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 06/20] t1092: compare sparse-checkout to sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-17 13:28       ` [RFC/PATCH 0/5] " Ævar Arnfjörð Bjarmason
                         ` (5 more replies)
  2021-03-16 16:42     ` [PATCH v3 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
                       ` (16 subsequent siblings)
  23 siblings, 6 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

This table is helpful for discovering data in the index to ensure it is
being written correctly, especially as we build and test the
sparse-index. This table includes an output format similar to 'git
ls-tree', but should not be compared to that directly. The biggest
reasons are that 'git ls-tree' includes a tree entry for every
subdirectory, even those that would not appear as a sparse directory in
a sparse-index. Further, 'git ls-tree' does not use a trailing directory
separator for its tree rows.

This does not print the stat() information for the blobs. That could be
added in a future change with another option. The tests that are added
in the next few changes care only about the object types and IDs.

To make the option parsing slightly more robust, wrap the string
comparisons in a loop adapted from test-dir-iterator.c.

Care must be taken with the final check for the 'cnt' variable. We
continue the expectation that the numerical value is the final argument.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c | 55 +++++++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 10 deletions(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 244977a29bdf..6cfd8f2de71c 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -1,36 +1,71 @@
 #include "test-tool.h"
 #include "cache.h"
 #include "config.h"
+#include "blob.h"
+#include "commit.h"
+#include "tree.h"
+
+static void print_cache_entry(struct cache_entry *ce)
+{
+	const char *type;
+	printf("%06o ", ce->ce_mode & 0177777);
+
+	if (S_ISSPARSEDIR(ce->ce_mode))
+		type = tree_type;
+	else if (S_ISGITLINK(ce->ce_mode))
+		type = commit_type;
+	else
+		type = blob_type;
+
+	printf("%s %s\t%s\n",
+	       type,
+	       oid_to_hex(&ce->oid),
+	       ce->name);
+}
+
+static void print_cache(struct index_state *istate)
+{
+	int i;
+	for (i = 0; i < istate->cache_nr; i++)
+		print_cache_entry(istate->cache[i]);
+}
 
 int cmd__read_cache(int argc, const char **argv)
 {
+	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
+	int table = 0;
 
-	if (argc > 1 && skip_prefix(argv[1], "--print-and-refresh=", &name)) {
-		argc--;
-		argv++;
+	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
+		if (skip_prefix(*argv, "--print-and-refresh=", &name))
+			continue;
+		if (!strcmp(*argv, "--table"))
+			table = 1;
 	}
 
-	if (argc == 2)
-		cnt = strtol(argv[1], NULL, 0);
+	if (argc == 1)
+		cnt = strtol(argv[0], NULL, 0);
 	setup_git_directory();
 	git_config(git_default_config, NULL);
+
 	for (i = 0; i < cnt; i++) {
-		read_cache();
+		repo_read_index(r);
 		if (name) {
 			int pos;
 
-			refresh_index(&the_index, REFRESH_QUIET,
+			refresh_index(r->index, REFRESH_QUIET,
 				      NULL, NULL, NULL);
-			pos = index_name_pos(&the_index, name, strlen(name));
+			pos = index_name_pos(r->index, name, strlen(name));
 			if (pos < 0)
 				die("%s not in index", name);
 			printf("%s is%s up to date\n", name,
-			       ce_uptodate(the_index.cache[pos]) ? "" : " not");
+			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
 			write_file(name, "%d\n", i);
 		}
-		discard_cache();
+		if (table)
+			print_cache(r->index);
+		discard_index(r->index);
 	}
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 08/20] test-tool: don't force full index
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (6 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
                       ` (15 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We will use 'test-tool read-cache --table' to check that a sparse
index is written as part of init_repos. Since we will no longer always
expand a sparse index into a full index, add an '--expand' parameter
that adds a call to ensure_full_index() so we can compare a sparse index
directly against a full index, or at least what the in-memory index
looks like when expanded in this way.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-read-cache.c               | 13 ++++++++++++-
 t/t1092-sparse-checkout-compatibility.sh |  5 +++++
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index 6cfd8f2de71c..b52c174acc7a 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -4,6 +4,7 @@
 #include "blob.h"
 #include "commit.h"
 #include "tree.h"
+#include "sparse-index.h"
 
 static void print_cache_entry(struct cache_entry *ce)
 {
@@ -35,13 +36,19 @@ int cmd__read_cache(int argc, const char **argv)
 	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
-	int table = 0;
+	int table = 0, expand = 0;
+
+	initialize_the_repository();
+	prepare_repo_settings(r);
+	r->settings.command_requires_full_index = 0;
 
 	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
 		if (skip_prefix(*argv, "--print-and-refresh=", &name))
 			continue;
 		if (!strcmp(*argv, "--table"))
 			table = 1;
+		else if (!strcmp(*argv, "--expand"))
+			expand = 1;
 	}
 
 	if (argc == 1)
@@ -51,6 +58,10 @@ int cmd__read_cache(int argc, const char **argv)
 
 	for (i = 0; i < cnt; i++) {
 		repo_read_index(r);
+
+		if (expand)
+			ensure_full_index(r->index);
+
 		if (name) {
 			int pos;
 
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index de5d8461c993..a1aea141c62c 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -130,6 +130,11 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'expanded in-memory index matches full index' '
+	init_repos &&
+	test_sparse_match test-tool read-cache --expand --table
+'
+
 test_expect_success 'status with options' '
 	init_repos &&
 	test_all_match git status --porcelain=v2 &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 09/20] unpack-trees: ensure full index
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (7 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 08/20] test-tool: don't force full index Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
                       ` (14 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The next change will translate full indexes into sparse indexes at write
time. The existing logic provides a way for every sparse index to be
expanded to a full index at read time. However, there are cases where an
index is written and then continues to be used in-memory to perform
further updates.

unpack_trees() is frequently called after such a write. In particular,
commands like 'git reset' do this double-update of the index.

Ensure that we have a full index when entering unpack_trees(), but only
when command_requires_full_index is true. This is always true at the
moment, but we will later relax that after unpack_trees() is updated to
handle sparse directory entries.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/unpack-trees.c b/unpack-trees.c
index eb8fcda31ba7..2da3e5ec77a1 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1570,6 +1570,7 @@ static int verify_absent(const struct cache_entry *,
  */
 int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options *o)
 {
+	struct repository *repo = the_repository;
 	int i, ret;
 	static struct cache_entry *dfc;
 	struct pattern_list pl;
@@ -1581,6 +1582,12 @@ int unpack_trees(unsigned len, struct tree_desc *t, struct unpack_trees_options
 	trace_performance_enter();
 	trace2_region_enter("unpack_trees", "unpack_trees", the_repository);
 
+	prepare_repo_settings(repo);
+	if (repo->settings.command_requires_full_index) {
+		ensure_full_index(o->src_index);
+		ensure_full_index(o->dst_index);
+	}
+
 	if (!core_apply_sparse_checkout || !o->update)
 		o->skip_sparse_checkout = 1;
 	if (!o->skip_sparse_checkout && !o->pl) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 10/20] sparse-checkout: hold pattern list in index
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (8 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 09/20] unpack-trees: ensure " Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
                       ` (13 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

As we modify the sparse-checkout definition, we perform index operations
on a pattern_list that only exists in-memory. This allows easy backing
out in case the index update fails.

However, if the index write itself cares about the sparse-checkout
pattern set, we need access to that in-memory copy. Place a pointer to
a 'struct pattern_list' in the index so we can access this on-demand.
This will be used in the next change which uses the sparse-checkout
definition to filter out directories that are outside the sparse cone.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c | 17 ++++++++++-------
 cache.h                   |  2 ++
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index 2306a9ad98e0..e00b82af727b 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -110,6 +110,8 @@ static int update_working_directory(struct pattern_list *pl)
 	if (is_index_unborn(r->index))
 		return UPDATE_SPARSITY_SUCCESS;
 
+	r->index->sparse_checkout_patterns = pl;
+
 	memset(&o, 0, sizeof(o));
 	o.verbose_update = isatty(2);
 	o.update = 1;
@@ -138,6 +140,7 @@ static int update_working_directory(struct pattern_list *pl)
 	else
 		rollback_lock_file(&lock_file);
 
+	r->index->sparse_checkout_patterns = NULL;
 	return result;
 }
 
@@ -517,19 +520,18 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 {
 	int result;
 	int changed_config = 0;
-	struct pattern_list pl;
-	memset(&pl, 0, sizeof(pl));
+	struct pattern_list *pl = xcalloc(1, sizeof(*pl));
 
 	switch (m) {
 	case ADD:
 		if (core_sparse_checkout_cone)
-			add_patterns_cone_mode(argc, argv, &pl);
+			add_patterns_cone_mode(argc, argv, pl);
 		else
-			add_patterns_literal(argc, argv, &pl);
+			add_patterns_literal(argc, argv, pl);
 		break;
 
 	case REPLACE:
-		add_patterns_from_input(&pl, argc, argv);
+		add_patterns_from_input(pl, argc, argv);
 		break;
 	}
 
@@ -539,12 +541,13 @@ static int modify_pattern_list(int argc, const char **argv, enum modify_type m)
 		changed_config = 1;
 	}
 
-	result = write_patterns_and_update(&pl);
+	result = write_patterns_and_update(pl);
 
 	if (result && changed_config)
 		set_config(MODE_NO_PATTERNS);
 
-	clear_pattern_list(&pl);
+	clear_pattern_list(pl);
+	free(pl);
 	return result;
 }
 
diff --git a/cache.h b/cache.h
index abb00a068e5d..759ca92e2ecc 100644
--- a/cache.h
+++ b/cache.h
@@ -307,6 +307,7 @@ static inline unsigned int canon_mode(unsigned int mode)
 struct split_index;
 struct untracked_cache;
 struct progress;
+struct pattern_list;
 
 struct index_state {
 	struct cache_entry **cache;
@@ -338,6 +339,7 @@ struct index_state {
 	struct mem_pool *ce_mem_pool;
 	struct progress *progress;
 	struct repository *repo;
+	struct pattern_list *sparse_checkout_patterns;
 };
 
 /* Name hashing */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 11/20] sparse-index: convert from full to sparse
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (9 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 10/20] sparse-checkout: hold pattern list in index Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-17 13:43       ` Ævar Arnfjörð Bjarmason
  2021-03-16 16:42     ` [PATCH v3 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
                       ` (12 subsequent siblings)
  23 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

If we have a full index, then we can convert it to a sparse index by
replacing directories outside of the sparse cone with sparse directory
entries. The convert_to_sparse() method does this, when the situation is
appropriate.

For now, we avoid converting the index to a sparse index if:

 1. the index is split.
 2. the index is already sparse.
 3. sparse-checkout is disabled.
 4. sparse-checkout does not use cone mode.

Finally, we currently limit the conversion to when the
GIT_TEST_SPARSE_INDEX environment variable is enabled. A mode using Git
config will be added in a later change.

The trickiest thing about this conversion is that we might not be able
to mark a directory as a sparse directory just because it is outside the
sparse cone. There might be unmerged files within that directory, so we
need to look for those. Also, if there is some strange reason why a file
is not marked with CE_SKIP_WORKTREE, then we should give up on
converting that directory. There is still hope that some of its
subdirectories might be able to convert to sparse, so we keep looking
deeper.

The conversion process is assisted by the cache-tree extension. This is
calculated from the full index if it does not already exist. We then
abandon the cache-tree as it no longer applies to the newly-sparse
index. Thus, this cache-tree will be recalculated in every
sparse-full-sparse round-trip until we integrate the cache-tree
extension with the sparse index.

Some Git commands use the index after writing it. For example, 'git add'
will update the index, then write it to disk, then read its entries to
report information. To keep the in-memory index in a full state after
writing, we re-expand it to a full one after the write. This is wasteful
for commands that only write the index and do not read from it again,
but that is only the case until we make those commands "sparse aware."

We can compare the behavior of the sparse-index in
t1092-sparse-checkout-compability.sh by using GIT_TEST_SPARSE_INDEX=1
when operating on the 'sparse-index' repo. We can also compare the two
sparse repos directly, such as comparing their indexes (when expanded to
full in the case of the 'sparse-index' repo). We also verify that the
index is actually populated with sparse directory entries.

The 'checkout and reset (mixed)' test is marked for failure when
comparing a sparse repo to a full repo, but we can compare the two
sparse-checkout cases directly to ensure that we are not changing the
behavior when using a sparse index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             |   3 +
 cache.h                                  |   2 +
 read-cache.c                             |  26 ++++-
 sparse-index.c                           | 139 +++++++++++++++++++++++
 sparse-index.h                           |   1 +
 t/t1092-sparse-checkout-compatibility.sh |  61 +++++++++-
 6 files changed, 228 insertions(+), 4 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 2fb483d3c083..5f07a39e501e 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -6,6 +6,7 @@
 #include "object-store.h"
 #include "replace-object.h"
 #include "promisor-remote.h"
+#include "sparse-index.h"
 
 #ifndef DEBUG_CACHE_TREE
 #define DEBUG_CACHE_TREE 0
@@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
 	if (i)
 		return i;
 
+	ensure_full_index(istate);
+
 	if (!istate->cache_tree)
 		istate->cache_tree = cache_tree();
 
diff --git a/cache.h b/cache.h
index 759ca92e2ecc..69a32146cd77 100644
--- a/cache.h
+++ b/cache.h
@@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
 {
 	if (S_ISLNK(mode))
 		return S_IFLNK;
+	if (mode == S_IFDIR)
+		return S_IFDIR;
 	if (S_ISDIR(mode) || S_ISGITLINK(mode))
 		return S_IFGITLINK;
 	return S_IFREG | ce_permissions(mode);
diff --git a/read-cache.c b/read-cache.c
index dd3980c12b53..b9c08773466c 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -25,6 +25,7 @@
 #include "fsmonitor.h"
 #include "thread-utils.h"
 #include "progress.h"
+#include "sparse-index.h"
 
 /* Mask for the name length in ce_flags in the on-disk index */
 
@@ -1002,8 +1003,14 @@ int verify_path(const char *path, unsigned mode)
 
 			c = *path++;
 			if ((c == '.' && !verify_dotfile(path, mode)) ||
-			    is_dir_sep(c) || c == '\0')
+			    is_dir_sep(c))
 				return 0;
+			/*
+			 * allow terminating directory separators for
+			 * sparse directory entries.
+			 */
+			if (c == '\0')
+				return S_ISDIR(mode);
 		} else if (c == '\\' && protect_ntfs) {
 			if (is_ntfs_dotgit(path))
 				return 0;
@@ -3079,6 +3086,14 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 				 unsigned flags)
 {
 	int ret;
+	int was_full = !istate->sparse_index;
+
+	ret = convert_to_sparse(istate);
+
+	if (ret) {
+		warning(_("failed to convert to a sparse-index"));
+		return ret;
+	}
 
 	/*
 	 * TODO trace2: replace "the_repository" with the actual repo instance
@@ -3090,6 +3105,9 @@ static int do_write_locked_index(struct index_state *istate, struct lock_file *l
 	trace2_region_leave_printf("index", "do_write_index", the_repository,
 				   "%s", get_lock_file_path(lock));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	if (flags & COMMIT_LOCK)
@@ -3180,9 +3198,10 @@ static int write_shared_index(struct index_state *istate,
 			      struct tempfile **temp)
 {
 	struct split_index *si = istate->split_index;
-	int ret;
+	int ret, was_full = !istate->sparse_index;
 
 	move_cache_to_base_index(istate);
+	convert_to_sparse(istate);
 
 	trace2_region_enter_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
@@ -3190,6 +3209,9 @@ static int write_shared_index(struct index_state *istate,
 	trace2_region_leave_printf("index", "shared/do_write_index",
 				   the_repository, "%s", get_tempfile_path(*temp));
 
+	if (was_full)
+		ensure_full_index(istate);
+
 	if (ret)
 		return ret;
 	ret = adjust_shared_perm(get_tempfile_path(*temp));
diff --git a/sparse-index.c b/sparse-index.c
index 7095378a1b28..619ff7c2e217 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -4,6 +4,145 @@
 #include "tree.h"
 #include "pathspec.h"
 #include "trace2.h"
+#include "cache-tree.h"
+#include "config.h"
+#include "dir.h"
+#include "fsmonitor.h"
+
+static struct cache_entry *construct_sparse_dir_entry(
+				struct index_state *istate,
+				const char *sparse_dir,
+				struct cache_tree *tree)
+{
+	struct cache_entry *de;
+
+	de = make_cache_entry(istate, S_IFDIR, &tree->oid, sparse_dir, 0, 0);
+
+	de->ce_flags |= CE_SKIP_WORKTREE;
+	return de;
+}
+
+/*
+ * Returns the number of entries "inserted" into the index.
+ */
+static int convert_to_sparse_rec(struct index_state *istate,
+				 int num_converted,
+				 int start, int end,
+				 const char *ct_path, size_t ct_pathlen,
+				 struct cache_tree *ct)
+{
+	int i, can_convert = 1;
+	int start_converted = num_converted;
+	enum pattern_match_result match;
+	int dtype;
+	struct strbuf child_path = STRBUF_INIT;
+	struct pattern_list *pl = istate->sparse_checkout_patterns;
+
+	/*
+	 * Is the current path outside of the sparse cone?
+	 * Then check if the region can be replaced by a sparse
+	 * directory entry (everything is sparse and merged).
+	 */
+	match = path_matches_pattern_list(ct_path, ct_pathlen,
+					  NULL, &dtype, pl, istate);
+	if (match != NOT_MATCHED)
+		can_convert = 0;
+
+	for (i = start; can_convert && i < end; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (ce_stage(ce) ||
+		    !(ce->ce_flags & CE_SKIP_WORKTREE))
+			can_convert = 0;
+	}
+
+	if (can_convert) {
+		struct cache_entry *se;
+		se = construct_sparse_dir_entry(istate, ct_path, ct);
+
+		istate->cache[num_converted++] = se;
+		return 1;
+	}
+
+	for (i = start; i < end; ) {
+		int count, span, pos = -1;
+		const char *base, *slash;
+		struct cache_entry *ce = istate->cache[i];
+
+		/*
+		 * Detect if this is a normal entry outside of any subtree
+		 * entry.
+		 */
+		base = ce->name + ct_pathlen;
+		slash = strchr(base, '/');
+
+		if (slash)
+			pos = cache_tree_subtree_pos(ct, base, slash - base);
+
+		if (pos < 0) {
+			istate->cache[num_converted++] = ce;
+			i++;
+			continue;
+		}
+
+		strbuf_setlen(&child_path, 0);
+		strbuf_add(&child_path, ce->name, slash - ce->name + 1);
+
+		span = ct->down[pos]->cache_tree->entry_count;
+		count = convert_to_sparse_rec(istate,
+					      num_converted, i, i + span,
+					      child_path.buf, child_path.len,
+					      ct->down[pos]->cache_tree);
+		num_converted += count;
+		i += span;
+	}
+
+	strbuf_release(&child_path);
+	return num_converted - start_converted;
+}
+
+int convert_to_sparse(struct index_state *istate)
+{
+	if (istate->split_index || istate->sparse_index ||
+	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
+		return 0;
+
+	/*
+	 * For now, only create a sparse index with the
+	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
+	 * this once we have a proper way to opt-in (and later still,
+	 * opt-out).
+	 */
+	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+		return 0;
+
+	if (!istate->sparse_checkout_patterns) {
+		istate->sparse_checkout_patterns = xcalloc(1, sizeof(struct pattern_list));
+		if (get_sparse_checkout_patterns(istate->sparse_checkout_patterns) < 0)
+			return 0;
+	}
+
+	if (!istate->sparse_checkout_patterns->use_cone_patterns) {
+		warning(_("attempting to use sparse-index without cone mode"));
+		return -1;
+	}
+
+	if (cache_tree_update(istate, 0)) {
+		warning(_("unable to update cache-tree, staying full"));
+		return -1;
+	}
+
+	remove_fsmonitor(istate);
+
+	trace2_region_enter("index", "convert_to_sparse", istate->repo);
+	istate->cache_nr = convert_to_sparse_rec(istate,
+						 0, 0, istate->cache_nr,
+						 "", 0, istate->cache_tree);
+	istate->drop_cache_tree = 1;
+	istate->sparse_index = 1;
+	trace2_region_leave("index", "convert_to_sparse", istate->repo);
+	return 0;
+}
 
 static void set_index_entry(struct index_state *istate, int nr, struct cache_entry *ce)
 {
diff --git a/sparse-index.h b/sparse-index.h
index 09a20d036c46..64380e121d80 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -3,5 +3,6 @@
 
 struct index_state;
 void ensure_full_index(struct index_state *istate);
+int convert_to_sparse(struct index_state *istate);
 
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index a1aea141c62c..1e888d195122 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,6 +2,11 @@
 
 test_description='compare full workdir to sparse workdir'
 
+# The verify_cache_tree() check is not sparse-aware (yet).
+# So, disable the check until that integration is complete.
+GIT_TEST_CHECK_CACHE_TREE=0
+GIT_TEST_SPLIT_INDEX=0
+
 . ./test-lib.sh
 
 test_expect_success 'setup' '
@@ -121,7 +126,9 @@ run_on_all () {
 test_all_match () {
 	run_on_all "$@" &&
 	test_cmp full-checkout-out sparse-checkout-out &&
-	test_cmp full-checkout-err sparse-checkout-err
+	test_cmp full-checkout-out sparse-index-out &&
+	test_cmp full-checkout-err sparse-checkout-err &&
+	test_cmp full-checkout-err sparse-index-err
 }
 
 test_sparse_match () {
@@ -130,6 +137,38 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_expect_success 'sparse-index contents' '
+	init_repos &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done &&
+
+	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	for dir in deep/deeper2 folder1 folder2 x
+	do
+		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
+		grep "040000 tree $TREE	$dir/" cache \
+			|| return 1
+	done
+'
+
 test_expect_success 'expanded in-memory index matches full index' '
 	init_repos &&
 	test_sparse_match test-tool read-cache --expand --table
@@ -137,6 +176,7 @@ test_expect_success 'expanded in-memory index matches full index' '
 
 test_expect_success 'status with options' '
 	init_repos &&
+	test_sparse_match ls &&
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git status --porcelain=v2 -z -u &&
 	test_all_match git status --porcelain=v2 -uno &&
@@ -273,6 +313,17 @@ test_expect_failure 'checkout and reset (mixed)' '
 	test_all_match git reset update-folder2
 '
 
+# Ensure that sparse-index behaves identically to
+# sparse-checkout with a full index.
+test_expect_success 'checkout and reset (mixed) [sparse]' '
+	init_repos &&
+
+	test_sparse_match git checkout -b reset-test update-deep &&
+	test_sparse_match git reset deepest &&
+	test_sparse_match git reset update-folder1 &&
+	test_sparse_match git reset update-folder2
+'
+
 test_expect_success 'merge' '
 	init_repos &&
 
@@ -309,14 +360,20 @@ test_expect_success 'clean' '
 	test_all_match git status --porcelain=v2 &&
 	test_all_match git clean -f &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
 	test_all_match git clean -xdf &&
 	test_all_match git status --porcelain=v2 &&
+	test_sparse_match ls &&
+	test_sparse_match ls folder1 &&
 
-	test_path_is_dir sparse-checkout/folder1
+	test_sparse_match test_path_is_dir folder1
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 12/20] submodule: sparse-index should not collapse links
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (10 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
                       ` (11 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

A submodule is stored as a "Git link" that actually points to a commit
within a submodule. Submodules are populated or not depending on
submodule configuration, not sparse-checkout. To ensure that the
sparse-index feature integrates correctly with submodules, we should not
collapse a directory if there is a Git link within its range.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 sparse-index.c                           |  1 +
 t/t1092-sparse-checkout-compatibility.sh | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

diff --git a/sparse-index.c b/sparse-index.c
index 619ff7c2e217..7631f7bd00b7 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -52,6 +52,7 @@ static int convert_to_sparse_rec(struct index_state *istate,
 		struct cache_entry *ce = istate->cache[i];
 
 		if (ce_stage(ce) ||
+		    S_ISGITLINK(ce->ce_mode) ||
 		    !(ce->ce_flags & CE_SKIP_WORKTREE))
 			can_convert = 0;
 	}
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 1e888d195122..cba5f89b1e96 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -376,4 +376,21 @@ test_expect_success 'clean' '
 	test_sparse_match test_path_is_dir folder1
 '
 
+test_expect_success 'submodule handling' '
+	init_repos &&
+
+	test_all_match mkdir modules &&
+	test_all_match touch modules/a &&
+	test_all_match git add modules &&
+	test_all_match git commit -m "add modules directory" &&
+
+	run_on_all git submodule add "$(pwd)/initial-repo" modules/sub &&
+	test_all_match git commit -m "add submodule" &&
+
+	# having a submodule prevents "modules" from collapse
+	test-tool -C sparse-index read-cache --table >cache &&
+	grep "100644 blob .*	modules/a" cache &&
+	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 13/20] unpack-trees: allow sparse directories
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (11 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 12/20] submodule: sparse-index should not collapse links Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-17 13:35       ` Ævar Arnfjörð Bjarmason
  2021-03-16 16:42     ` [PATCH v3 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
                       ` (10 subsequent siblings)
  23 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The index_pos_by_traverse_info() currently throws a BUG() when a
directory entry exists exactly in the index. We need to consider that it
is possible to have a directory in a sparse index as long as that entry
is itself marked with the skip-worktree bit.

The 'pos' variable is assigned a negative value if an exact match is not
found. Since a directory name can be an exact match, it is no longer an
error to have a nonnegative 'pos' value.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 unpack-trees.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/unpack-trees.c b/unpack-trees.c
index 2da3e5ec77a1..e81d82d72d89 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -749,9 +749,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
 	strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
 	strbuf_addch(&name, '/');
 	pos = index_name_pos(o->src_index, name.buf, name.len);
-	if (pos >= 0)
-		BUG("This is a directory and should not exist in index");
-	pos = -pos - 1;
+	if (pos >= 0) {
+		if (!o->src_index->sparse_index ||
+		    !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
+			BUG("This is a directory and should not exist in index");
+	} else
+		pos = -pos - 1;
 	if (pos >= o->src_index->cache_nr ||
 	    !starts_with(o->src_index->cache[pos]->name, name.buf) ||
 	    (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 14/20] sparse-index: check index conversion happens
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (12 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
                       ` (9 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Add a test case that uses test_region to ensure that we are truly
expanding a sparse index to a full one, then converting back to sparse
when writing the index. As we integrate more Git commands with the
sparse index, we will convert these commands to check that we do _not_
convert the sparse index to a full index and instead stay sparse the
entire time.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1092-sparse-checkout-compatibility.sh | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index cba5f89b1e96..47f983217852 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -393,4 +393,22 @@ test_expect_success 'submodule handling' '
 	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
 '
 
+test_expect_success 'sparse-index is expanded and converted back' '
+	init_repos &&
+
+	(
+		GIT_TEST_SPARSE_INDEX=1 &&
+		export GIT_TEST_SPARSE_INDEX &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" reset --hard &&
+		test_region index convert_to_sparse trace2.txt &&
+		test_region index ensure_full_index trace2.txt &&
+
+		rm trace2.txt &&
+		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+			git -C sparse-index -c core.fsmonitor="" status -uno &&
+		test_region index ensure_full_index trace2.txt
+	)
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 15/20] sparse-index: create extension for compatibility
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (13 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 14/20] sparse-index: check index conversion happens Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:42     ` [PATCH v3 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
                       ` (8 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Previously, we enabled the sparse index format only using
GIT_TEST_SPARSE_INDEX=1. This is not a feasible direction for users to
actually select this mode. Further, sparse directory entries are not
understood by the index formats as advertised.

We _could_ add a new index version that explicitly adds these
capabilities, but there are nuances to index formats 2, 3, and 4 that
are still valuable to select as options. Until we add index format
version 5, create a repo extension, "extensions.sparseIndex", that
specifies that the tool reading this repository must understand sparse
directory entries.

This change only encodes the extension and enables it when
GIT_TEST_SPARSE_INDEX=1. Later, we will add a more user-friendly CLI
mechanism.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/config/extensions.txt |  8 ++++++
 cache.h                             |  1 +
 repo-settings.c                     |  7 ++++++
 repository.h                        |  3 ++-
 setup.c                             |  3 +++
 sparse-index.c                      | 38 +++++++++++++++++++++++++----
 6 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/Documentation/config/extensions.txt b/Documentation/config/extensions.txt
index 4e23d73cdcad..c02e09af0046 100644
--- a/Documentation/config/extensions.txt
+++ b/Documentation/config/extensions.txt
@@ -6,3 +6,11 @@ extensions.objectFormat::
 Note that this setting should only be set by linkgit:git-init[1] or
 linkgit:git-clone[1].  Trying to change it after initialization will not
 work and will produce hard-to-diagnose issues.
+
+extensions.sparseIndex::
+	When combined with `core.sparseCheckout=true` and
+	`core.sparseCheckoutCone=true`, the index may contain entries
+	corresponding to directories outside of the sparse-checkout
+	definition in lieu of containing each path under such directories.
+	Versions of Git that do not understand this extension do not
+	expect directory entries in the index.
diff --git a/cache.h b/cache.h
index 69a32146cd77..4ca6cd7f782c 100644
--- a/cache.h
+++ b/cache.h
@@ -1059,6 +1059,7 @@ struct repository_format {
 	int worktree_config;
 	int is_bare;
 	int hash_algo;
+	int sparse_index;
 	char *work_tree;
 	struct string_list unknown_extensions;
 	struct string_list v1_only_extensions;
diff --git a/repo-settings.c b/repo-settings.c
index d63569e4041e..9677d50f9238 100644
--- a/repo-settings.c
+++ b/repo-settings.c
@@ -85,4 +85,11 @@ void prepare_repo_settings(struct repository *r)
 	 * removed.
 	 */
 	r->settings.command_requires_full_index = 1;
+
+	/*
+	 * Initialize this as off.
+	 */
+	r->settings.sparse_index = 0;
+	if (!repo_config_get_bool(r, "extensions.sparseindex", &value) && value)
+		r->settings.sparse_index = 1;
 }
diff --git a/repository.h b/repository.h
index e06a23015697..a45f7520fd9e 100644
--- a/repository.h
+++ b/repository.h
@@ -42,7 +42,8 @@ struct repo_settings {
 
 	int core_multi_pack_index;
 
-	unsigned command_requires_full_index:1;
+	unsigned command_requires_full_index:1,
+		 sparse_index:1;
 };
 
 struct repository {
diff --git a/setup.c b/setup.c
index c04cd25a30df..cd8394564613 100644
--- a/setup.c
+++ b/setup.c
@@ -500,6 +500,9 @@ static enum extension_result handle_extension(const char *var,
 			return error("invalid value for 'extensions.objectformat'");
 		data->hash_algo = format;
 		return EXTENSION_OK;
+	} else if (!strcmp(ext, "sparseindex")) {
+		data->sparse_index = 1;
+		return EXTENSION_OK;
 	}
 	return EXTENSION_UNKNOWN;
 }
diff --git a/sparse-index.c b/sparse-index.c
index 7631f7bd00b7..3a6df66faeab 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -102,19 +102,47 @@ static int convert_to_sparse_rec(struct index_state *istate,
 	return num_converted - start_converted;
 }
 
+static int enable_sparse_index(struct repository *repo)
+{
+	const char *config_path = repo_git_path(repo, "config.worktree");
+
+	if (upgrade_repository_format(1) < 0) {
+		warning(_("unable to upgrade repository format to enable sparse-index"));
+		return -1;
+	}
+	git_config_set_in_file_gently(config_path,
+				      "extensions.sparseIndex",
+				      "true");
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 1;
+	return 0;
+}
+
 int convert_to_sparse(struct index_state *istate)
 {
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
 
+	if (!istate->repo)
+		istate->repo = the_repository;
+
+	/*
+	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
+	 * extensions.sparseIndex config variable to be on.
+	 */
+	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
+		int err = enable_sparse_index(istate->repo);
+		if (err < 0)
+			return err;
+	}
+
 	/*
-	 * For now, only create a sparse index with the
-	 * GIT_TEST_SPARSE_INDEX environment variable. We will relax
-	 * this once we have a proper way to opt-in (and later still,
-	 * opt-out).
+	 * Only convert to sparse if extensions.sparseIndex is set.
 	 */
-	if (!git_env_bool("GIT_TEST_SPARSE_INDEX", 0))
+	prepare_repo_settings(istate->repo);
+	if (!istate->repo->settings.sparse_index)
 		return 0;
 
 	if (!istate->sparse_checkout_patterns) {
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 16/20] sparse-checkout: toggle sparse index from builtin
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (14 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 15/20] sparse-index: create extension for compatibility Derrick Stolee via GitGitGadget
@ 2021-03-16 16:42     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:43     ` [PATCH v3 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
                       ` (7 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:42 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The sparse index extension is used to signal that index writes should be
in sparse mode. This was only updated using GIT_TEST_SPARSE_INDEX=1.

Add a '--[no-]sparse-index' option to 'git sparse-checkout init' that
specifies if the sparse index should be used. It also updates the index
to use the correct format, either way. Add a warning in the
documentation that the use of a repository extension might reduce
compatibility with third-party tools. 'git sparse-checkout init' already
sets extension.worktreeConfig, which places most sparse-checkout users
outside of the scope of most third-party tools.

Update t1092-sparse-checkout-compatibility.sh to use this CLI instead of
GIT_TEST_SPARSE_INDEX=1.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-sparse-checkout.txt    | 14 +++++++
 builtin/sparse-checkout.c                | 17 ++++++++-
 sparse-index.c                           | 37 +++++++++++++------
 sparse-index.h                           |  3 ++
 t/t1092-sparse-checkout-compatibility.sh | 47 +++++++++++++-----------
 5 files changed, 84 insertions(+), 34 deletions(-)

diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index a0eeaeb02ee3..2ff66c5a4e41 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -45,6 +45,20 @@ To avoid interfering with other worktrees, it first enables the
 When `--cone` is provided, the `core.sparseCheckoutCone` setting is
 also set, allowing for better performance with a limited set of
 patterns (see 'CONE PATTERN SET' below).
++
+Use the `--[no-]sparse-index` option to toggle the use of the sparse
+index format. This reduces the size of the index to be more closely
+aligned with your sparse-checkout definition. This can have significant
+performance advantages for commands such as `git status` or `git add`.
+This feature is still experimental. Some commands might be slower with
+a sparse index until they are properly integrated with the feature.
++
+**WARNING:** Using a sparse index requires modifying the index in a way
+that is not completely understood by external tools. If you have trouble
+with this compatibility, then run `git sparse-checkout init --no-sparse-index`
+to rewrite your index to not be sparse. Older versions of Git will not
+understand the `sparseIndex` repository extension and may fail to interact
+with your repository until it is disabled.
 
 'set'::
 	Write a set of patterns to the sparse-checkout file, as given as
diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index e00b82af727b..ca63e2c64e95 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -14,6 +14,7 @@
 #include "unpack-trees.h"
 #include "wt-status.h"
 #include "quote.h"
+#include "sparse-index.h"
 
 static const char *empty_base = "";
 
@@ -283,12 +284,13 @@ static int set_config(enum sparse_checkout_mode mode)
 }
 
 static char const * const builtin_sparse_checkout_init_usage[] = {
-	N_("git sparse-checkout init [--cone]"),
+	N_("git sparse-checkout init [--cone] [--[no-]sparse-index]"),
 	NULL
 };
 
 static struct sparse_checkout_init_opts {
 	int cone_mode;
+	int sparse_index;
 } init_opts;
 
 static int sparse_checkout_init(int argc, const char **argv)
@@ -303,11 +305,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	static struct option builtin_sparse_checkout_init_options[] = {
 		OPT_BOOL(0, "cone", &init_opts.cone_mode,
 			 N_("initialize the sparse-checkout in cone mode")),
+		OPT_BOOL(0, "sparse-index", &init_opts.sparse_index,
+			 N_("toggle the use of a sparse index")),
 		OPT_END(),
 	};
 
 	repo_read_index(the_repository);
 
+	init_opts.sparse_index = -1;
+
 	argc = parse_options(argc, argv, NULL,
 			     builtin_sparse_checkout_init_options,
 			     builtin_sparse_checkout_init_usage, 0);
@@ -326,6 +332,15 @@ static int sparse_checkout_init(int argc, const char **argv)
 	sparse_filename = get_sparse_checkout_filename();
 	res = add_patterns_from_file_to_list(sparse_filename, "", 0, &pl, NULL);
 
+	if (init_opts.sparse_index >= 0) {
+		if (set_sparse_index_config(the_repository, init_opts.sparse_index) < 0)
+			die(_("failed to modify sparse-index config"));
+
+		/* force an index rewrite */
+		repo_read_index(the_repository);
+		the_repository->index->updated_workdir = 1;
+	}
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
diff --git a/sparse-index.c b/sparse-index.c
index 3a6df66faeab..30c1a11fd62d 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -104,23 +104,37 @@ static int convert_to_sparse_rec(struct index_state *istate,
 
 static int enable_sparse_index(struct repository *repo)
 {
-	const char *config_path = repo_git_path(repo, "config.worktree");
+	int res;
 
 	if (upgrade_repository_format(1) < 0) {
 		warning(_("unable to upgrade repository format to enable sparse-index"));
 		return -1;
 	}
-	git_config_set_in_file_gently(config_path,
-				      "extensions.sparseIndex",
-				      "true");
+	res = git_config_set_gently("extensions.sparseindex", "true");
 
 	prepare_repo_settings(repo);
 	repo->settings.sparse_index = 1;
-	return 0;
+	return res;
+}
+
+int set_sparse_index_config(struct repository *repo, int enable)
+{
+	int res;
+
+	if (enable)
+		return enable_sparse_index(repo);
+
+	/* Don't downgrade repository format, just remove the extension. */
+	res = git_config_set_gently("extensions.sparseindex", NULL);
+
+	prepare_repo_settings(repo);
+	repo->settings.sparse_index = 0;
+	return res;
 }
 
 int convert_to_sparse(struct index_state *istate)
 {
+	int test_env;
 	if (istate->split_index || istate->sparse_index ||
 	    !core_apply_sparse_checkout || !core_sparse_checkout_cone)
 		return 0;
@@ -129,14 +143,13 @@ int convert_to_sparse(struct index_state *istate)
 		istate->repo = the_repository;
 
 	/*
-	 * The GIT_TEST_SPARSE_INDEX environment variable triggers the
-	 * extensions.sparseIndex config variable to be on.
+	 * If GIT_TEST_SPARSE_INDEX=1, then trigger extensions.sparseIndex
+	 * to be fully enabled. If GIT_TEST_SPARSE_INDEX=0 (set explicitly),
+	 * then purposefully disable the setting.
 	 */
-	if (git_env_bool("GIT_TEST_SPARSE_INDEX", 0)) {
-		int err = enable_sparse_index(istate->repo);
-		if (err < 0)
-			return err;
-	}
+	test_env = git_env_bool("GIT_TEST_SPARSE_INDEX", -1);
+	if (test_env >= 0)
+		set_sparse_index_config(istate->repo, test_env);
 
 	/*
 	 * Only convert to sparse if extensions.sparseIndex is set.
diff --git a/sparse-index.h b/sparse-index.h
index 64380e121d80..39dcc859735e 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -5,4 +5,7 @@ struct index_state;
 void ensure_full_index(struct index_state *istate);
 int convert_to_sparse(struct index_state *istate);
 
+struct repository;
+int set_sparse_index_config(struct repository *repo, int enable);
+
 #endif
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 47f983217852..f14dc48924d2 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -6,6 +6,7 @@ test_description='compare full workdir to sparse workdir'
 # So, disable the check until that integration is complete.
 GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
+GIT_TEST_SPARSE_INDEX=
 
 . ./test-lib.sh
 
@@ -100,25 +101,26 @@ init_repos () {
 	# initialize sparse-checkout definitions
 	git -C sparse-checkout sparse-checkout init --cone &&
 	git -C sparse-checkout sparse-checkout set deep &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout init --cone &&
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep
+	git -C sparse-index sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C sparse-index true extensions.sparseindex &&
+	git -C sparse-index sparse-checkout set deep
 }
 
 run_on_sparse () {
 	(
 		cd sparse-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../sparse-checkout-out 2>../sparse-checkout-err
+		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
 	) &&
 	(
 		cd sparse-index &&
-		GIT_TEST_SPARSE_INDEX=1 "$@" >../sparse-index-out 2>../sparse-index-err
+		"$@" >../sparse-index-out 2>../sparse-index-err
 	)
 }
 
 run_on_all () {
 	(
 		cd full-checkout &&
-		GIT_TEST_SPARSE_INDEX=0 "$@" >../full-checkout-out 2>../full-checkout-err
+		"$@" >../full-checkout-out 2>../full-checkout-err
 	) &&
 	run_on_sparse "$@"
 }
@@ -148,7 +150,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set folder1 &&
+	git -C sparse-index sparse-checkout set folder1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep folder2 x
@@ -158,7 +160,7 @@ test_expect_success 'sparse-index contents' '
 			|| return 1
 	done &&
 
-	GIT_TEST_SPARSE_INDEX=1 git -C sparse-index sparse-checkout set deep/deeper1 &&
+	git -C sparse-index sparse-checkout set deep/deeper1 &&
 
 	test-tool -C sparse-index read-cache --table >cache &&
 	for dir in deep/deeper2 folder1 folder2 x
@@ -166,7 +168,14 @@ test_expect_success 'sparse-index contents' '
 		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
 		grep "040000 tree $TREE	$dir/" cache \
 			|| return 1
-	done
+	done &&
+
+	# Disabling the sparse-index removes tree entries with full ones
+	git -C sparse-index sparse-checkout init --no-sparse-index &&
+
+	test-tool -C sparse-index read-cache --table >cache &&
+	! grep "040000 tree" cache &&
+	test_sparse_match test-tool read-cache --table
 '
 
 test_expect_success 'expanded in-memory index matches full index' '
@@ -396,19 +405,15 @@ test_expect_success 'submodule handling' '
 test_expect_success 'sparse-index is expanded and converted back' '
 	init_repos &&
 
-	(
-		GIT_TEST_SPARSE_INDEX=1 &&
-		export GIT_TEST_SPARSE_INDEX &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" reset --hard &&
-		test_region index convert_to_sparse trace2.txt &&
-		test_region index ensure_full_index trace2.txt &&
-
-		rm trace2.txt &&
-		GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
-			git -C sparse-index -c core.fsmonitor="" status -uno &&
-		test_region index ensure_full_index trace2.txt
-	)
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" reset --hard &&
+	test_region index convert_to_sparse trace2.txt &&
+	test_region index ensure_full_index trace2.txt &&
+
+	rm trace2.txt &&
+	GIT_TRACE2_EVENT="$(pwd)/trace2.txt" GIT_TRACE2_EVENT_NESTING=10 \
+		git -C sparse-index -c core.fsmonitor="" status -uno &&
+	test_region index ensure_full_index trace2.txt
 '
 
 test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 17/20] sparse-checkout: disable sparse-index
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (15 preceding siblings ...)
  2021-03-16 16:42     ` [PATCH v3 16/20] sparse-checkout: toggle sparse index from builtin Derrick Stolee via GitGitGadget
@ 2021-03-16 16:43     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:43     ` [PATCH v3 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
                       ` (6 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:43 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

We use 'git sparse-checkout init --cone --sparse-index' to toggle the
sparse-index feature. It makes sense to also disable it when running
'git sparse-checkout disable'. This is particularly important because it
removes the extensions.sparseIndex config option, allowing other tools
to use this Git repository again.

This does mean that 'git sparse-checkout init' will not re-enable the
sparse-index feature, even if it was previously enabled.

While testing this feature, I noticed that the sparse-index was not
being written on the first run, but by a second. This was caught by the
call to 'test-tool read-cache --table'. This requires adjusting some
assignments to core_apply_sparse_checkout and pl.use_cone_patterns in
the sparse_checkout_init() logic.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 builtin/sparse-checkout.c          | 10 +++++++++-
 t/t1091-sparse-checkout-builtin.sh | 13 +++++++++++++
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/builtin/sparse-checkout.c b/builtin/sparse-checkout.c
index ca63e2c64e95..585343fa1972 100644
--- a/builtin/sparse-checkout.c
+++ b/builtin/sparse-checkout.c
@@ -280,6 +280,9 @@ static int set_config(enum sparse_checkout_mode mode)
 				      "core.sparseCheckoutCone",
 				      mode == MODE_CONE_PATTERNS ? "true" : NULL);
 
+	if (mode == MODE_NO_PATTERNS)
+		set_sparse_index_config(the_repository, 0);
+
 	return 0;
 }
 
@@ -341,10 +344,11 @@ static int sparse_checkout_init(int argc, const char **argv)
 		the_repository->index->updated_workdir = 1;
 	}
 
+	core_apply_sparse_checkout = 1;
+
 	/* If we already have a sparse-checkout file, use it. */
 	if (res >= 0) {
 		free(sparse_filename);
-		core_apply_sparse_checkout = 1;
 		return update_working_directory(NULL);
 	}
 
@@ -366,6 +370,7 @@ static int sparse_checkout_init(int argc, const char **argv)
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
 	strbuf_addstr(&pattern, "!/*/");
 	add_pattern(strbuf_detach(&pattern, NULL), empty_base, 0, &pl, 0);
+	pl.use_cone_patterns = init_opts.cone_mode;
 
 	return write_patterns_and_update(&pl);
 }
@@ -632,6 +637,9 @@ static int sparse_checkout_disable(int argc, const char **argv)
 	strbuf_addstr(&match_all, "/*");
 	add_pattern(strbuf_detach(&match_all, NULL), empty_base, 0, &pl, 0);
 
+	prepare_repo_settings(the_repository);
+	the_repository->settings.sparse_index = 0;
+
 	if (update_working_directory(&pl))
 		die(_("error while refreshing working directory"));
 
diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index fc64e9ed99f4..ff1ad570a255 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -205,6 +205,19 @@ test_expect_success 'sparse-checkout disable' '
 	check_files repo a deep folder1 folder2
 '
 
+test_expect_success 'sparse-index enabled and disabled' '
+	git -C repo sparse-checkout init --cone --sparse-index &&
+	test_cmp_config -C repo true extensions.sparseIndex &&
+	test-tool -C repo read-cache --table >cache &&
+	grep " tree " cache &&
+
+	git -C repo sparse-checkout disable &&
+	test-tool -C repo read-cache --table >cache &&
+	! grep " tree " cache &&
+	git -C repo config --list >config &&
+	! grep extensions.sparseindex config
+'
+
 test_expect_success 'cone mode: init and set' '
 	git -C repo sparse-checkout init --cone &&
 	git -C repo config --list >config &&
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 18/20] cache-tree: integrate with sparse directory entries
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (16 preceding siblings ...)
  2021-03-16 16:43     ` [PATCH v3 17/20] sparse-checkout: disable sparse-index Derrick Stolee via GitGitGadget
@ 2021-03-16 16:43     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:43     ` [PATCH v3 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
                       ` (5 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:43 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache-tree extension was previously disabled with sparse indexes.
However, the cache-tree is an important performance feature for commands
like 'git status' and 'git add'. Integrate it with sparse directory
entries.

When writing a sparse index, completely clear and recalculate the cache
tree. By starting from scratch, the only integration necessary is to
check if we hit a sparse directory entry and create a leaf of the
cache-tree that has an entry_count of one and no subtrees.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c   | 18 ++++++++++++++++++
 sparse-index.c | 10 +++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/cache-tree.c b/cache-tree.c
index 5f07a39e501e..950a9615db8f 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -256,6 +256,24 @@ static int update_one(struct cache_tree *it,
 
 	*skip_count = 0;
 
+	/*
+	 * If the first entry of this region is a sparse directory
+	 * entry corresponding exactly to 'base', then this cache_tree
+	 * struct is a "leaf" in the data structure, pointing to the
+	 * tree OID specified in the entry.
+	 */
+	if (entries > 0) {
+		const struct cache_entry *ce = cache[0];
+
+		if (S_ISSPARSEDIR(ce->ce_mode) &&
+		    ce->ce_namelen == baselen &&
+		    !strncmp(ce->name, base, baselen)) {
+			it->entry_count = 1;
+			oidcpy(&it->oid, &ce->oid);
+			return 1;
+		}
+	}
+
 	if (0 <= it->entry_count && has_object_file(&it->oid))
 		return it->entry_count;
 
diff --git a/sparse-index.c b/sparse-index.c
index 30c1a11fd62d..56313e805d9d 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -180,7 +180,11 @@ int convert_to_sparse(struct index_state *istate)
 	istate->cache_nr = convert_to_sparse_rec(istate,
 						 0, 0, istate->cache_nr,
 						 "", 0, istate->cache_tree);
-	istate->drop_cache_tree = 1;
+
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	istate->sparse_index = 1;
 	trace2_region_leave("index", "convert_to_sparse", istate->repo);
 	return 0;
@@ -281,5 +285,9 @@ void ensure_full_index(struct index_state *istate)
 	strbuf_release(&base);
 	free(full);
 
+	/* Clear and recompute the cache-tree */
+	cache_tree_free(&istate->cache_tree);
+	cache_tree_update(istate, 0);
+
 	trace2_region_leave("index", "ensure_full_index", istate->repo);
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 19/20] sparse-index: loose integration with cache_tree_verify()
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (17 preceding siblings ...)
  2021-03-16 16:43     ` [PATCH v3 18/20] cache-tree: integrate with sparse directory entries Derrick Stolee via GitGitGadget
@ 2021-03-16 16:43     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:43     ` [PATCH v3 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
                       ` (4 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:43 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The cache_tree_verify() method is run when GIT_TEST_CHECK_CACHE_TREE
is enabled, which it is by default in the test suite. The logic must
be adjusted for the presence of these directory entries.

For now, leave the test as a simple check for whether the directory
entry is sparse. Do not go any further until needed.

This allows us to re-enable GIT_TEST_CHECK_CACHE_TREE in
t1092-sparse-checkout-compatibility.sh. Further,
p2000-sparse-operations.sh uses the test suite and hence this is enabled
for all tests. We need to integrate with it before we run our
performance tests with a sparse-index.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 cache-tree.c                             | 19 +++++++++++++++++++
 t/t1092-sparse-checkout-compatibility.sh |  3 ---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/cache-tree.c b/cache-tree.c
index 950a9615db8f..11bf1fcae6e1 100644
--- a/cache-tree.c
+++ b/cache-tree.c
@@ -808,6 +808,19 @@ int cache_tree_matches_traversal(struct cache_tree *root,
 	return 0;
 }
 
+static void verify_one_sparse(struct repository *r,
+			      struct index_state *istate,
+			      struct cache_tree *it,
+			      struct strbuf *path,
+			      int pos)
+{
+	struct cache_entry *ce = istate->cache[pos];
+
+	if (!S_ISSPARSEDIR(ce->ce_mode))
+		BUG("directory '%s' is present in index, but not sparse",
+		    path->buf);
+}
+
 static void verify_one(struct repository *r,
 		       struct index_state *istate,
 		       struct cache_tree *it,
@@ -830,6 +843,12 @@ static void verify_one(struct repository *r,
 
 	if (path->len) {
 		pos = index_name_pos(istate, path->buf, path->len);
+
+		if (pos >= 0) {
+			verify_one_sparse(r, istate, it, path, pos);
+			return;
+		}
+
 		pos = -pos - 1;
 	} else {
 		pos = 0;
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index f14dc48924d2..d97bf9b64527 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -2,9 +2,6 @@
 
 test_description='compare full workdir to sparse workdir'
 
-# The verify_cache_tree() check is not sparse-aware (yet).
-# So, disable the check until that integration is complete.
-GIT_TEST_CHECK_CACHE_TREE=0
 GIT_TEST_SPLIT_INDEX=0
 GIT_TEST_SPARSE_INDEX=
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [PATCH v3 20/20] p2000: add sparse-index repos
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (18 preceding siblings ...)
  2021-03-16 16:43     ` [PATCH v3 19/20] sparse-index: loose integration with cache_tree_verify() Derrick Stolee via GitGitGadget
@ 2021-03-16 16:43     ` Derrick Stolee via GitGitGadget
  2021-03-16 16:59     ` [PATCH v3 00/20] Sparse Index: Design, Format, Tests Derrick Stolee
                       ` (3 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-03-16 16:43 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

p2000-sparse-operations.sh compares different Git commands in
repositories with many files at HEAD but using sparse-checkout to focus
on a small portion of those files.

Add extra copies of the repository that use the sparse-index format so
we can track how that affects the performance of different commands.

At this point in time, the sparse-index is 100% overhead from the CPU
front, and this is measurable in these tests:

Test
---------------------------------------------------------------
2000.2: git status (full-index-v3)              0.59(0.51+0.12)
2000.3: git status (full-index-v4)              0.59(0.52+0.11)
2000.4: git status (sparse-index-v3)            1.40(1.32+0.12)
2000.5: git status (sparse-index-v4)            1.41(1.36+0.08)
2000.6: git add -A (full-index-v3)              2.32(1.97+0.19)
2000.7: git add -A (full-index-v4)              2.17(1.92+0.14)
2000.8: git add -A (sparse-index-v3)            2.31(2.21+0.15)
2000.9: git add -A (sparse-index-v4)            2.30(2.20+0.13)
2000.10: git add . (full-index-v3)              2.39(2.02+0.20)
2000.11: git add . (full-index-v4)              2.20(1.94+0.16)
2000.12: git add . (sparse-index-v3)            2.36(2.27+0.12)
2000.13: git add . (sparse-index-v4)            2.33(2.21+0.16)
2000.14: git commit -a -m A (full-index-v3)     2.47(2.12+0.20)
2000.15: git commit -a -m A (full-index-v4)     2.26(2.00+0.17)
2000.16: git commit -a -m A (sparse-index-v3)   3.01(2.92+0.16)
2000.17: git commit -a -m A (sparse-index-v4)   3.01(2.94+0.15)

Note that there is very little difference between the v3 and v4 index
formats when the sparse-index is enabled. This is primarily due to the
fact that the relative file sizes are the same, and the command time is
mostly taken up by parsing tree objects to expand the sparse index into
a full one.

With the current file layout, the index file sizes are given by this
table:

       |  full index | sparse index |
       +-------------+--------------+
    v3 |     108 MiB |      1.6 MiB |
    v4 |      80 MiB |      1.2 MiB |

Future updates will improve the performance of Git commands when the
index is sparse.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/perf/p2000-sparse-operations.sh | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index 2fbc81b22119..e527316e66d6 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -60,12 +60,29 @@ test_expect_success 'setup repo and indexes' '
 		git sparse-checkout set $SPARSE_CONE &&
 		git config index.version 4 &&
 		git update-index --index-version=4
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v3 &&
+	(
+		cd sparse-index-v3 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 3 &&
+		git update-index --index-version=3
+	) &&
+	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . sparse-index-v4 &&
+	(
+		cd sparse-index-v4 &&
+		git sparse-checkout init --cone --sparse-index &&
+		git sparse-checkout set $SPARSE_CONE &&
+		git config index.version 4 &&
+		git update-index --index-version=4
 	)
 '
 
 test_perf_on_all () {
 	command="$@"
-	for repo in full-index-v3 full-index-v4
+	for repo in full-index-v3 full-index-v4 \
+		    sparse-index-v3 sparse-index-v4
 	do
 		test_perf "$command ($repo)" "
 			(
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 00/20] Sparse Index: Design, Format, Tests
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (19 preceding siblings ...)
  2021-03-16 16:43     ` [PATCH v3 20/20] p2000: add sparse-index repos Derrick Stolee via GitGitGadget
@ 2021-03-16 16:59     ` Derrick Stolee
  2021-03-16 21:18     ` Elijah Newren
                       ` (2 subsequent siblings)
  23 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-16 16:59 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Ævar Arnfjörð Bjarmason,
	Derrick Stolee

On 3/16/2021 12:42 PM, Derrick Stolee via GitGitGadget wrote:> Updates in V3
> =============
> 
> For this version, I took Ævar's latest patches and applied them to v2.31.0
> and rebased this series on top. It uses his new "read_tree_at()" helper and
> the associated changes to the function pointer type.

Junio, I wanted to call your attention to this change in base.

Here is the relevant part of the range-diff:

>   5:  399ddb0bad56 !  5:  99292cdbaae4 sparse-index: implement ensure_full_index()
>      @@ sparse-index.c
>       +}
>       +
>       +static int add_path_to_index(const struct object_id *oid,
>      -+				struct strbuf *base, const char *path,
>      -+				unsigned int mode, int stage, void *context)
>      ++			     struct strbuf *base, const char *path,
>      ++			     unsigned int mode, void *context)
>       +{
>       +	struct index_state *istate = (struct index_state *)context;
>       +	struct cache_entry *ce;
>      @@ sparse-index.c
>       -	/* intentionally left blank */
>       +	int i;
>       +	struct index_state *full;
>      ++	struct strbuf base = STRBUF_INIT;
>       +
>       +	if (!istate || !istate->sparse_index)
>       +		return;
>      @@ sparse-index.c
>       +		ps.has_wildcard = 1;
>       +		ps.max_depth = -1;
>       +
>      -+		read_tree_recursive(istate->repo, tree,
>      -+				    ce->name, strlen(ce->name),
>      -+				    0, &ps,
>      -+				    add_path_to_index, full);
>      ++		strbuf_setlen(&base, 0);
>      ++		strbuf_add(&base, ce->name, strlen(ce->name));
>      ++
>      ++		read_tree_at(istate->repo, tree, &base, &ps,
>      ++			     add_path_to_index, full);
>       +
>       +		/* free directory entries. full entries are re-used */
>       +		discard_cache_entry(ce);
>      @@ sparse-index.c
>       +	istate->cache_nr = full->cache_nr;
>       +	istate->cache_alloc = full->cache_alloc;
>       +
>      ++	strbuf_release(&base);
>       +	free(full);
>       +
>       +	trace2_region_leave("index", "ensure_full_index", istate->repo);

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 00/20] Sparse Index: Design, Format, Tests
  2021-03-16 16:42   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (20 preceding siblings ...)
  2021-03-16 16:59     ` [PATCH v3 00/20] Sparse Index: Design, Format, Tests Derrick Stolee
@ 2021-03-16 21:18     ` Elijah Newren
  2021-03-18 21:50     ` Junio C Hamano
  2021-03-23 13:44     ` [PATCH v4 " Derrick Stolee via GitGitGadget
  23 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-16 21:18 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee

On Tue, Mar 16, 2021 at 9:43 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Here is the first full patch series submission coming out of the
> sparse-index RFC [1].
>
> [1]
> https://lore.kernel.org/git/pull.847.git.1611596533.gitgitgadget@gmail.com/
>
> I won't waste too much space here, because PATCH 1 includes a sizeable
> design document that describes the feature, the reasoning behind it, and my
> plan for getting this implemented widely throughout the codebase.
>
> There are some new things here that were not in the RFC:
>
>  * Design doc and format updates. (Patch 1)
>  * Performance test script. (Patches 2 and 20)
>
> Notably missing in this series from the RFC:
>
>  * The mega-patch inserting ensure_full_index() throughout the codebase.
>    That will be a follow-up series to this one.
>  * The integrations with git status and git add to demonstrate the improved
>    performance. Those will also appear in their own series later.
>
> I plan to keep my latest work in this area in my 'sparse-index/wip' branch
> [2]. It includes all of the work from the RFC right now, updated with the
> work from this series.
>
> [2] https://github.com/derrickstolee/git/tree/sparse-index/wip
>
>
> Updates in V3
> =============
>
> For this version, I took Ævar's latest patches and applied them to v2.31.0
> and rebased this series on top. It uses his new "read_tree_at()" helper and
> the associated changes to the function pointer type.
>
>  * Fixed more typos. Thanks Martin and Elijah!
>  * Updated the test_sparse_match() macro to use "$@" instead of $*
>  * Added a test that git sparse-checkout init --no-sparse-index rewrites the
>    index to be full.

I've read through the range-diff.  Sorry for not spotting the conflict
with Ævar's series (that I also reviewed).  Anyway, my Reviewed-by
from the last series still holds.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 02/20] t/perf: add performance test for sparse operations
  2021-03-16 16:42     ` [PATCH v3 02/20] t/perf: add performance test for sparse operations Derrick Stolee via GitGitGadget
@ 2021-03-17  8:41       ` Ævar Arnfjörð Bjarmason
  2021-03-17 13:05         ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17  8:41 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> Create a test script that takes the default performance test (the Git
> codebase) and multiplies it by 256 using four layers of duplicated
> trees of width four. This results in nearly one million blob entries in
> the index. Then, we can clone this repository with sparse-checkout
> patterns that demonstrate four copies of the initial repository. Each
> clone will use a different index format or mode so peformance can be
> tested across the different options.
>
> Note that the initial repo is stripped of submodules before doing the
> copies. This preserves the expected data shape of the sparse index,
> because directories containing submodules are not collapsed to a sparse
> directory entry.
>
> Run a few Git commands on these clones, especially those that use the
> index (status, add, commit).
>
> Here are the results on my Linux machine:
>
> Test
> --------------------------------------------------------------
> 2000.2: git status (full-index-v3)             0.37(0.30+0.09)
> 2000.3: git status (full-index-v4)             0.39(0.32+0.10)
> 2000.4: git add -A (full-index-v3)             1.42(1.06+0.20)
> 2000.5: git add -A (full-index-v4)             1.26(0.98+0.16)
> 2000.6: git add . (full-index-v3)              1.40(1.04+0.18)
> 2000.7: git add . (full-index-v4)              1.26(0.98+0.17)
> 2000.8: git commit -a -m A (full-index-v3)     1.42(1.11+0.16)
> 2000.9: git commit -a -m A (full-index-v4)     1.33(1.08+0.16)
>
> It is perhaps noteworthy that there is an improvement when using index
> version 4. This is because the v3 index uses 108 MiB while the v4
> index uses 80 MiB. Since the repeated portions of the directories are
> very short (f3/f1/f2, for example) this ratio is less pronounced than in
> similarly-sized real repositories.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/perf/p2000-sparse-operations.sh | 85 +++++++++++++++++++++++++++++++
>  1 file changed, 85 insertions(+)
>  create mode 100755 t/perf/p2000-sparse-operations.sh
>
> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
> new file mode 100755
> index 000000000000..2fbc81b22119
> --- /dev/null
> +++ b/t/perf/p2000-sparse-operations.sh
> @@ -0,0 +1,85 @@
> +#!/bin/sh
> +
> +test_description="test performance of Git operations using the index"
> +
> +. ./perf-lib.sh
> +
> +test_perf_default_repo
> +
> +SPARSE_CONE=f2/f4/f1
> +
> +test_expect_success 'setup repo and indexes' '
> +	git reset --hard HEAD &&
> +	# Remove submodules from the example repo, because our
> +	# duplication of the entire repo creates an unlikly data shape.
> +	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
> +	git rm -f .gitmodules &&
> +	for module in $(awk "{print \$2}" modules)
> +	do
> +		git rm $module || return 1
> +	done &&
> +	git commit -m "remove submodules" &&

Paradoxically with this you can no longer use a repo that's not git.git
or another repo that has submodules, since we'll die in trying to remove
them.

Also you don't have to "git rm .gitmodules", the "git rm" command
removes submodule entries.

Perhaps just:

    for module in $(git ls-files --stage | grep ^160000 | awk -F '\t' '{ print $2 }')
    do
        git rm "$module"
    done

Or another way of guarding against rm getting the empty list && commit?

But it seems odd to be doing this at all, the point of the perf
framework is that you can point it at any repo, and some repos you want
to test will have submodules.

Seems like something like the WIP patch at the end on top would be
better.

> +	echo bogus >a &&
> +	cp a b &&
> +	git add a b &&
> +	git commit -m "level 0" &&
> +	BLOB=$(git rev-parse HEAD:a) &&

Isn't the way we're getting this $BLOB equivalent to just 'echo bogus |
git hash-object --stdin -w' why commit it?

> +	OLD_COMMIT=$(git rev-parse HEAD) &&
> +	OLD_TREE=$(git rev-parse HEAD^{tree}) &&
> +
> +	for i in $(test_seq 1 4)
> +	do
> +		cat >in <<-EOF &&
> +			100755 blob $BLOB	a
> +			040000 tree $OLD_TREE	f1
> +			040000 tree $OLD_TREE	f2
> +			040000 tree $OLD_TREE	f3
> +			040000 tree $OLD_TREE	f4
> +		EOF
> +		NEW_TREE=$(git mktree <in) &&
> +		NEW_COMMIT=$(git commit-tree $NEW_TREE -p $OLD_COMMIT -m "level $i") &&
> +		OLD_TREE=$NEW_TREE &&
> +		OLD_COMMIT=$NEW_COMMIT || return 1
> +	done &&
> +
> +	git sparse-checkout init --cone &&
> +	git branch -f wide $OLD_COMMIT &&
> +	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v3 &&
> +	(
> +		cd full-index-v3 &&
> +		git sparse-checkout init --cone &&
> +		git sparse-checkout set $SPARSE_CONE &&
> +		git config index.version 3 &&
> +		git update-index --index-version=3
> +	) &&
> +	git -c core.sparseCheckoutCone=true clone --branch=wide --sparse . full-index-v4 &&
> +	(
> +		cd full-index-v4 &&
> +		git sparse-checkout init --cone &&
> +		git sparse-checkout set $SPARSE_CONE &&
> +		git config index.version 4 &&
> +		git update-index --index-version=4
> +	)
> +'

This whole thing makes me think you just wanted a test_perf_fresh_repo
all along, but I think this would be much more useful if you took the
default repo and multiplied the size in its tree by some multiple.

E.g. take the files we have in git.git, write a copy at prefix-1/,
prefix-2/ etc.

The whole point of test_perf_{default,large}_repo is being able to point
them at a local repo you're testing for performance and get numbers
representative of that repo.

So maybe that's not what's wanted here at all, but that brings us back
to test_perf_fresh_repo...

> +test_perf_on_all () {
> +	command="$@"
> +	for repo in full-index-v3 full-index-v4
> +	do
> +		test_perf "$command ($repo)" "
> +			(
> +				cd $repo &&
> +				echo >>$SPARSE_CONE/a &&
> +				$command
> +			)
> +		"
> +	done
> +}
> +
> +test_perf_on_all git status
> +test_perf_on_all git add -A
> +test_perf_on_all git add .
> +test_perf_on_all git commit -a -m A
> +
> +test_done

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index e527316e66..2c07b04159 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -4,22 +4,11 @@ test_description="test performance of Git operations using the index"
 
 . ./perf-lib.sh
 
-test_perf_default_repo
+test_perf_nosubodules_repo
 
 SPARSE_CONE=f2/f4/f1
 
 test_expect_success 'setup repo and indexes' '
-	git reset --hard HEAD &&
-	# Remove submodules from the example repo, because our
-	# duplication of the entire repo creates an unlikly data shape.
-	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
-	git rm -f .gitmodules &&
-	for module in $(awk "{print \$2}" modules)
-	do
-		git rm $module || return 1
-	done &&
-	git commit -m "remove submodules" &&
-
 	echo bogus >a &&
 	cp a b &&
 	git add a b &&
diff --git a/t/perf/perf-lib.sh b/t/perf/perf-lib.sh
index e385c6896f..86b716ce8f 100644
--- a/t/perf/perf-lib.sh
+++ b/t/perf/perf-lib.sh
@@ -128,6 +128,15 @@ test_perf_large_repo () {
 	fi
 	test_perf_create_repo_from "${1:-$TRASH_DIRECTORY}" "$GIT_PERF_LARGE_REPO"
 }
+test_perf_nosubodules_repo () {
+	if test "$GIT_PERF_NOSUBMODULES_REPO" = "$GIT_BUILD_DIR"; then
+		echo "warning: \$GIT_PERF_NOSUBMODULES_REPO is \$GIT_BUILD_DIR." >&2
+		echo "warning: This will probably work, but it has a submodule!" >&2
+		echo "warning: point to another repo for representative measurements." >&2
+		# git rm dance here? optionally?
+	fi
+	test_perf_create_repo_from "${1:-$TRASH_DIRECTORY}" "$GIT_PERF_NOSUBMODULES_REPO"
+}
 test_checkout_worktree () {
 	git checkout-index -u -a ||
 	error "git checkout-index failed"
@@ -196,7 +205,7 @@ test_perf_ () {
 	else
 		echo "perf $test_count - $1:"
 	fi
-	for i in $(test_seq 1 $GIT_PERF_REPEAT_COUNT); do
+	for i in $(test_seq 1 $GIT_PERF_REP
 		say >&3 "running: $2"
 		if test_run_perf_ "$2"
 		then

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 03/20] t1092: clean up script quoting
  2021-03-16 16:42     ` [PATCH v3 03/20] t1092: clean up script quoting Derrick Stolee via GitGitGadget
@ 2021-03-17  8:47       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17  8:47 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> This test was introduced in 19a0acc83e4 (t1092: test interesting
> sparse-checkout scenarios, 2021-01-23), but these issues with quoting
> were not noticed until starting this follow-up series. The old mechanism
> would drop quoting such as in

the "but these issues" follows a partial sentence where we haven't
introduces "what issues?".

Perhaps leading with some summary about $@ v.s. $*:

    Fix a bug in the sparse checkout tests of "$@" being conflated with
    "$*". The bug was introduced in 19a0acc83e4 ([...]), but had no
    effect until now because XYZ ...


>    test_all_match git commit -m "touch README.md"
>
> The above happened to work because README.md is a file in the
> repository, so 'git commit -m touch REAMDE.md' would succeed by
> accident.
>
> Other cases included quoting for no good reason, so clean that up now.

Maybe just my taste, per your comment on another series of mine we might
not have the same sense of splitting up commits, but...

I think in this case it's clearer to have these be two commits. We have
3 hunks fixing the bug, and 6 on an unrelated cleanup. It's a lot easier
for eyeballing a fix to be able to glance just at the 3, especially with
something like $@ v.s. $*.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/t1092-sparse-checkout-compatibility.sh | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
>
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 8cd3e5a8d227..3725d3997e70 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -96,20 +96,20 @@ init_repos () {
>  run_on_sparse () {
>  	(
>  		cd sparse-checkout &&
> -		$* >../sparse-checkout-out 2>../sparse-checkout-err
> +		"$@" >../sparse-checkout-out 2>../sparse-checkout-err
>  	)
>  }
>  
>  run_on_all () {
>  	(
>  		cd full-checkout &&
> -		$* >../full-checkout-out 2>../full-checkout-err
> +		"$@" >../full-checkout-out 2>../full-checkout-err
>  	) &&
> -	run_on_sparse $*
> +	run_on_sparse "$@"
>  }
>  
>  test_all_match () {
> -	run_on_all $* &&
> +	run_on_all "$@" &&
>  	test_cmp full-checkout-out sparse-checkout-out &&
>  	test_cmp full-checkout-err sparse-checkout-err
>  }
> @@ -119,7 +119,7 @@ test_expect_success 'status with options' '
>  	test_all_match git status --porcelain=v2 &&
>  	test_all_match git status --porcelain=v2 -z -u &&
>  	test_all_match git status --porcelain=v2 -uno &&
> -	run_on_all "touch README.md" &&
> +	run_on_all touch README.md &&
>  	test_all_match git status --porcelain=v2 &&
>  	test_all_match git status --porcelain=v2 -z -u &&
>  	test_all_match git status --porcelain=v2 -uno &&
> @@ -135,7 +135,7 @@ test_expect_success 'add, commit, checkout' '
>  	write_script edit-contents <<-\EOF &&
>  	echo text >>$1
>  	EOF
> -	run_on_all "../edit-contents README.md" &&
> +	run_on_all ../edit-contents README.md &&
>  
>  	test_all_match git add README.md &&
>  	test_all_match git status --porcelain=v2 &&
> @@ -144,7 +144,7 @@ test_expect_success 'add, commit, checkout' '
>  	test_all_match git checkout HEAD~1 &&
>  	test_all_match git checkout - &&
>  
> -	run_on_all "../edit-contents README.md" &&
> +	run_on_all ../edit-contents README.md &&
>  
>  	test_all_match git add -A &&
>  	test_all_match git status --porcelain=v2 &&
> @@ -153,7 +153,7 @@ test_expect_success 'add, commit, checkout' '
>  	test_all_match git checkout HEAD~1 &&
>  	test_all_match git checkout - &&
>  
> -	run_on_all "../edit-contents deep/newfile" &&
> +	run_on_all ../edit-contents deep/newfile &&
>  
>  	test_all_match git status --porcelain=v2 -uno &&
>  	test_all_match git status --porcelain=v2 &&
> @@ -186,7 +186,7 @@ test_expect_success 'diff --staged' '
>  	write_script edit-contents <<-\EOF &&
>  	echo text >>README.md
>  	EOF
> -	run_on_all "../edit-contents" &&
> +	run_on_all ../edit-contents &&
>  
>  	test_all_match git diff &&
>  	test_all_match git diff --staged &&
> @@ -280,7 +280,7 @@ test_expect_success 'clean' '
>  	echo bogus >>.gitignore &&
>  	run_on_all cp ../.gitignore . &&
>  	test_all_match git add .gitignore &&
> -	test_all_match git commit -m ignore-bogus-files &&
> +	test_all_match git commit -m "ignore bogus files" &&
>  
>  	run_on_sparse mkdir folder1 &&
>  	run_on_all touch folder1/bogus &&


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 05/20] sparse-index: implement ensure_full_index()
  2021-03-16 16:42     ` [PATCH v3 05/20] sparse-index: implement ensure_full_index() Derrick Stolee via GitGitGadget
@ 2021-03-17 13:03       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:03 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
> [...]
> +static int add_path_to_index(const struct object_id *oid,
> +			     struct strbuf *base, const char *path,
> +			     unsigned int mode, void *context)
> +{
> +	struct index_state *istate = (struct index_state *)context;
> +	struct cache_entry *ce;
> +	size_t len = base->len;
> +
> +	if (S_ISDIR(mode))
> +		return READ_TREE_RECURSIVE;
> +
> +	strbuf_addstr(base, path);
> +
> +	ce = make_cache_entry(istate, mode, oid, base->buf, 0, 0);
> +	ce->ce_flags |= CE_SKIP_WORKTREE;
> +	set_index_entry(istate, istate->cache_nr++, ce);
> +
> +	strbuf_setlen(base, len);
> +	return 0;
> +}
>  
>  void ensure_full_index(struct index_state *istate)
>  {
> -	/* intentionally left blank */
> +	int i;
> +	struct index_state *full;
> +	struct strbuf base = STRBUF_INIT;
> +
> +	if (!istate || !istate->sparse_index)
> +		return;
> +
> +	if (!istate->repo)
> +		istate->repo = the_repository;
> +
> +	trace2_region_enter("index", "ensure_full_index", istate->repo);
> +
> +	/* initialize basics of new index */
> +	full = xcalloc(1, sizeof(struct index_state));
> +	memcpy(full, istate, sizeof(struct index_state));
> +
> +	/* then change the necessary things */
> +	full->sparse_index = 0;
> +	full->cache_alloc = (3 * istate->cache_alloc) / 2;
> +	full->cache_nr = 0;
> +	ALLOC_ARRAY(full->cache, full->cache_alloc);
> +
> +	for (i = 0; i < istate->cache_nr; i++) {
> +		struct cache_entry *ce = istate->cache[i];
> +		struct tree *tree;
> +		struct pathspec ps;
> +
> +		if (!S_ISSPARSEDIR(ce->ce_mode)) {
> +			set_index_entry(full, full->cache_nr++, ce);
> +			continue;
> +		}
> +		if (!(ce->ce_flags & CE_SKIP_WORKTREE))
> +			warning(_("index entry is a directory, but not sparse (%08x)"),
> +				ce->ce_flags);
> +
> +		/* recursively walk into cd->name */
> +		tree = lookup_tree(istate->repo, &ce->oid);
> +
> +		memset(&ps, 0, sizeof(ps));
> +		ps.recursive = 1;
> +		ps.has_wildcard = 1;
> +		ps.max_depth = -1;
> +
> +		strbuf_setlen(&base, 0);
> +		strbuf_add(&base, ce->name, strlen(ce->name));
> +
> +		read_tree_at(istate->repo, tree, &base, &ps,
> +			     add_path_to_index, full);
> +
> +		/* free directory entries. full entries are re-used */
> +		discard_cache_entry(ce);
> +	}
> +
> +	/* Copy back into original index. */
> +	memcpy(&istate->name_hash, &full->name_hash, sizeof(full->name_hash));
> +	istate->sparse_index = 0;
> +	free(istate->cache);
> +	istate->cache = full->cache;
> +	istate->cache_nr = full->cache_nr;
> +	istate->cache_alloc = full->cache_alloc;
> +
> +	strbuf_release(&base);
> +	free(full);
> +
> +	trace2_region_leave("index", "ensure_full_index", istate->repo);
>  }

Not that I mind having added the read_tree_at() again, but just thinking
aloud here.

So we need this loop here because there's nothing like a read_tree_at()
that knows how to start at the non-tree root of the index, and then for
each directory there we're going to perform the equivalent of a
read_tree() there, but we need to set the base for add_path_to_index()
since we started at subdirs, not the root.

That's fine, but grepping around a bit I wonder if we shouldn't
eventually have some slightly fancier API that just works like
read_tree() but takes an optional "start at the index's root" instead.

Well, things that want that usually care about the index-specific bits,
whereas this "I just care about the tree for these" is more of a special
case I guess.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 02/20] t/perf: add performance test for sparse operations
  2021-03-17  8:41       ` Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:05         ` Derrick Stolee
  2021-03-17 13:21           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-17 13:05 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Derrick Stolee, Derrick Stolee

On 3/17/2021 4:41 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:
>> +test_expect_success 'setup repo and indexes' '
>> +	git reset --hard HEAD &&
>> +	# Remove submodules from the example repo, because our
>> +	# duplication of the entire repo creates an unlikly data shape.
>> +	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
>> +	git rm -f .gitmodules &&
>> +	for module in $(awk "{print \$2}" modules)
>> +	do
>> +		git rm $module || return 1
>> +	done &&
>> +	git commit -m "remove submodules" &&
> 
> Paradoxically with this you can no longer use a repo that's not git.git
> or another repo that has submodules, since we'll die in trying to remove
> them.

Good point.

> Also you don't have to "git rm .gitmodules", the "git rm" command
> removes submodule entries.

Sure.

> Perhaps just:
> 
>     for module in $(git ls-files --stage | grep ^160000 | awk -F '\t' '{ print $2 }')
>     do
>         git rm "$module"
>     done
> 
> Or another way of guarding against rm getting the empty list && commit?
> 
> But it seems odd to be doing this at all, the point of the perf
> framework is that you can point it at any repo, and some repos you want
> to test will have submodules.

You're right that it should handle all repos. However, the point of
the test is to have many copies of the repo, but most of them are
excluded by sparse-directory entries. We don't collapse sparse-directory
entries if there is a submodule inside, so the data shape is wrong after
making all the copies.

So, I disagree with your approach in your suggested diff, and instead
offer this one. I've tested this with git.git and another local repo
without submodules and checked that everything works as expected.

diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
index e527316e66d..5c0d78eeeea 100755
--- a/t/perf/p2000-sparse-operations.sh
+++ b/t/perf/p2000-sparse-operations.sh
@@ -10,15 +10,17 @@ SPARSE_CONE=f2/f4/f1
 
 test_expect_success 'setup repo and indexes' '
 	git reset --hard HEAD &&
+
 	# Remove submodules from the example repo, because our
-	# duplication of the entire repo creates an unlikly data shape.
-	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
-	git rm -f .gitmodules &&
-	for module in $(awk "{print \$2}" modules)
-	do
-		git rm $module || return 1
-	done &&
-	git commit -m "remove submodules" &&
+	# duplication of the entire repo creates an unlikely data shape.
+	if (git config --file .gitmodules --get-regexp "submodule.*.path" >modules)
+	then
+		for module in $(awk "{print \$2}" modules)
+		do
+			git rm $module || return 1
+		done &&
+		git commit -m "remove submodules" || return 1
+	fi &&
 
 	echo bogus >a &&
 	cp a b &&

> Seems like something like the WIP patch at the end on top would be
> better.
> 
>> +	echo bogus >a &&
>> +	cp a b &&
>> +	git add a b &&
>> +	git commit -m "level 0" &&
>> +	BLOB=$(git rev-parse HEAD:a) &&
> 
> Isn't the way we're getting this $BLOB equivalent to just 'echo bogus |
> git hash-object --stdin -w' why commit it?

We are committing it so we can add commits that deepen the copies,
but within those copies we have these known file paths.

> This whole thing makes me think you just wanted a test_perf_fresh_repo
> all along, but I think this would be much more useful if you took the
> default repo and multiplied the size in its tree by some multiple.
> 
> E.g. take the files we have in git.git, write a copy at prefix-1/,
> prefix-2/ etc.

That is essentially what is happening here, but using multiple levels
of directories. Using these multiple levels presents extra tree
lookups and parsing in the event of expanding a sparse index to a
full one.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 02/20] t/perf: add performance test for sparse operations
  2021-03-17 13:05         ` Derrick Stolee
@ 2021-03-17 13:21           ` Ævar Arnfjörð Bjarmason
  2021-03-17 18:02             ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:21 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, newren, gitster, pclouds,
	jrnieder, Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Wed, Mar 17 2021, Derrick Stolee wrote:

> On 3/17/2021 4:41 AM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:
>>> +test_expect_success 'setup repo and indexes' '
>>> +	git reset --hard HEAD &&
>>> +	# Remove submodules from the example repo, because our
>>> +	# duplication of the entire repo creates an unlikly data shape.
>>> +	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
>>> +	git rm -f .gitmodules &&
>>> +	for module in $(awk "{print \$2}" modules)
>>> +	do
>>> +		git rm $module || return 1
>>> +	done &&
>>> +	git commit -m "remove submodules" &&
>> 
>> Paradoxically with this you can no longer use a repo that's not git.git
>> or another repo that has submodules, since we'll die in trying to remove
>> them.
>
> Good point.
>
>> Also you don't have to "git rm .gitmodules", the "git rm" command
>> removes submodule entries.
>
> Sure.
>
>> Perhaps just:
>> 
>>     for module in $(git ls-files --stage | grep ^160000 | awk -F '\t' '{ print $2 }')
>>     do
>>         git rm "$module"
>>     done
>> 
>> Or another way of guarding against rm getting the empty list && commit?
>> 
>> But it seems odd to be doing this at all, the point of the perf
>> framework is that you can point it at any repo, and some repos you want
>> to test will have submodules.
>
> You're right that it should handle all repos. However, the point of
> the test is to have many copies of the repo, but most of them are
> excluded by sparse-directory entries. We don't collapse sparse-directory
> entries if there is a submodule inside, so the data shape is wrong after
> making all the copies.
>
> So, I disagree with your approach in your suggested diff, and instead
> offer this one. I've tested this with git.git and another local repo
> without submodules and checked that everything works as expected.

What's got me confused here is that there's two uses for the perf
framework in this context.

It's to use an empty/git.git as a test repo to demonstrate something,
but then also that you can run it in your arbitrary repo, and e.g. see
how much a given feature might benefit you.

Hence suggesting that maybe test_perf_fresh_repois better here, because
by using test_perf_default_repo you're creating the expectation that you
can run the perf test, observe an %X difference, and that'll be
give-or-take what you'll get for that use case if you enable the feature.

Except it won't because the repo has submodules, which we deleted for
the perf test...

> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
> index e527316e66d..5c0d78eeeea 100755
> --- a/t/perf/p2000-sparse-operations.sh
> +++ b/t/perf/p2000-sparse-operations.sh
> @@ -10,15 +10,17 @@ SPARSE_CONE=f2/f4/f1
>  
>  test_expect_success 'setup repo and indexes' '
>  	git reset --hard HEAD &&
> +
>  	# Remove submodules from the example repo, because our
> -	# duplication of the entire repo creates an unlikly data shape.
> -	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
> -	git rm -f .gitmodules &&
> -	for module in $(awk "{print \$2}" modules)
> -	do
> -		git rm $module || return 1
> -	done &&
> -	git commit -m "remove submodules" &&
> +	# duplication of the entire repo creates an unlikely data shape.
> +	if (git config --file .gitmodules --get-regexp "submodule.*.path" >modules)

A subshell isn't needed here.

FWIW the reason I got this out of ls-files is because you can have
submodules without .gitmodules entries, rare and broken, but seemed more
direct to grep the mode bits.

> +	then
> +		for module in $(awk "{print \$2}" modules)
> +		do
> +			git rm $module || return 1
> +		done &&

Once we know we have submodules we can just do this without the loop.

    git rm $(awk "{print \$2}" modules)



> +		git commit -m "remove submodules" || return 1
> +	fi &&
>  
>  	echo bogus >a &&
>  	cp a b &&
>
>> Seems like something like the WIP patch at the end on top would be
>> better.
>> 
>>> +	echo bogus >a &&
>>> +	cp a b &&
>>> +	git add a b &&
>>> +	git commit -m "level 0" &&
>>> +	BLOB=$(git rev-parse HEAD:a) &&
>> 
>> Isn't the way we're getting this $BLOB equivalent to just 'echo bogus |
>> git hash-object --stdin -w' why commit it?
>
> We are committing it so we can add commits that deepen the copies,
> but within those copies we have these known file paths.
>
>> This whole thing makes me think you just wanted a test_perf_fresh_repo
>> all along, but I think this would be much more useful if you took the
>> default repo and multiplied the size in its tree by some multiple.
>> 
>> E.g. take the files we have in git.git, write a copy at prefix-1/,
>> prefix-2/ etc.
>
> That is essentially what is happening here, but using multiple levels
> of directories. Using these multiple levels presents extra tree
> lookups and parsing in the event of expanding a sparse index to a
> full one.

*nod*

Anyway, this thread's a bit of a bikeshed on my part, I was just
wondering if & what part of the test relied on the existing repo if it
was mostly setting up its own test data.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* [RFC/PATCH 0/5] Re: [PATCH v3 07/20] test-read-cache: print cache entries with --table
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
@ 2021-03-17 13:28       ` Ævar Arnfjörð Bjarmason
  2021-03-17 18:28         ` Elijah Newren
  2021-03-17 13:28       ` [RFC/PATCH 1/5] ls-files: defer read_index() after parse_options() etc Ævar Arnfjörð Bjarmason
                         ` (4 subsequent siblings)
  5 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:28 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee

> From: Derrick Stolee <dstolee@microsoft.com>
>
> This table is helpful for discovering data in the index to ensure it is
> being written correctly, especially as we build and test the
> sparse-index. This table includes an output format similar to 'git
> ls-tree', but should not be compared to that directly. The biggest
> reasons are that 'git ls-tree' includes a tree entry for every
> subdirectory, even those that would not appear as a sparse directory in
> a sparse-index. Further, 'git ls-tree' does not use a trailing directory
> separator for its tree rows.
>
> This does not print the stat() information for the blobs. That could be
> added in a future change with another option. The tests that are added
> in the next few changes care only about the object types and IDs.
>
> To make the option parsing slightly more robust, wrap the string
> comparisons in a loop adapted from test-dir-iterator.c.
>
> Care must be taken with the final check for the 'cnt' variable. We
> continue the expectation that the numerical value is the final argument.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  t/helper/test-read-cache.c | 55 +++++++++++++++++++++++++++++++-------
>  1 file changed, 45 insertions(+), 10 deletions(-)
>
> diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
> index 244977a29bdf..6cfd8f2de71c 100644
> --- a/t/helper/test-read-cache.c
> +++ b/t/helper/test-read-cache.c
> @@ -1,36 +1,71 @@
>  #include "test-tool.h"
>  #include "cache.h"
>  #include "config.h"
> +#include "blob.h"
> +#include "commit.h"
> +#include "tree.h"
> +
> +static void print_cache_entry(struct cache_entry *ce)
> +{
> +	const char *type;
> +	printf("%06o ", ce->ce_mode & 0177777);
> +
> +	if (S_ISSPARSEDIR(ce->ce_mode))
> +		type = tree_type;
> +	else if (S_ISGITLINK(ce->ce_mode))
> +		type = commit_type;
> +	else
> +		type = blob_type;
> +
> +	printf("%s %s\t%s\n",
> +	       type,
> +	       oid_to_hex(&ce->oid),
> +	       ce->name);
> +}
> +

So we have a test tool that's mostly ls-files but mocks the output
ls-tree would emit, won't these tests eventually care about what stage
things are in?

What follows is an RFC series on top that's the result of me wondering
why if we're adding new index constructs we aren't updating our
plumbing to emit that data, can we just add this to ls-files and drop
this test helper?

Turns out: Yes we can.

Ævar Arnfjörð Bjarmason (5):
  ls-files: defer read_index() after parse_options() etc.
  ls-files: make "mode" in show_ce() loop a variable
  ls-files: add and use a new --sparse option
  test-tool read-cache: --table is redundant to ls-files
  test-tool: split up test-tool read-cache

 Documentation/git-ls-files.txt           |  4 ++
 Makefile                                 |  3 +-
 builtin/ls-files.c                       | 29 +++++++--
 t/helper/test-read-cache-again.c         | 31 +++++++++
 t/helper/test-read-cache-perf.c          | 21 ++++++
 t/helper/test-read-cache.c               | 82 ------------------------
 t/helper/test-tool.c                     |  3 +-
 t/helper/test-tool.h                     |  3 +-
 t/perf/p0002-read-cache.sh               |  2 +-
 t/t1091-sparse-checkout-builtin.sh       |  9 +--
 t/t1092-sparse-checkout-compatibility.sh | 57 ++++++++++------
 t/t7519-status-fsmonitor.sh              |  2 +-
 12 files changed, 131 insertions(+), 115 deletions(-)
 create mode 100644 t/helper/test-read-cache-again.c
 create mode 100644 t/helper/test-read-cache-perf.c
 delete mode 100644 t/helper/test-read-cache.c

-- 
2.31.0.260.g719c683c1d


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [RFC/PATCH 1/5] ls-files: defer read_index() after parse_options() etc.
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
  2021-03-17 13:28       ` [RFC/PATCH 0/5] " Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:28       ` Ævar Arnfjörð Bjarmason
  2021-03-17 13:28       ` [RFC/PATCH 2/5] ls-files: make "mode" in show_ce() loop a variable Ævar Arnfjörð Bjarmason
                         ` (3 subsequent siblings)
  5 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:28 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee

Move the reading of the index below the parsing of options. We'll need
to setup some index options in the next commit after option parsing,
but in any case it makes sense to give parse_options() handling a
chance to die early before we perform the more expensive operation of
reading the index.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/ls-files.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index 13bcc2d847..eb72d16493 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -681,9 +681,6 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		prefix_len = strlen(prefix);
 	git_config(git_default_config, NULL);
 
-	if (repo_read_index(the_repository) < 0)
-		die("index file corrupt");
-
 	argc = parse_options(argc, argv, prefix, builtin_ls_files_options,
 			ls_files_usage, 0);
 	pl = add_pattern_list(&dir, EXC_CMDL, "--exclude option");
@@ -743,6 +740,12 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		max_prefix = common_prefix(&pathspec);
 	max_prefix_len = get_common_prefix_len(max_prefix);
 
+	/*
+	 * Read the index after parse options etc. have had a chance
+	 * to die early.
+	 */
+	if (repo_read_index(the_repository) < 0)
+		die("index file corrupt");
 	prune_index(the_repository->index, max_prefix, max_prefix_len);
 
 	/* Treat unmatching pathspec elements as errors */
-- 
2.31.0.260.g719c683c1d


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [RFC/PATCH 2/5] ls-files: make "mode" in show_ce() loop a variable
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
  2021-03-17 13:28       ` [RFC/PATCH 0/5] " Ævar Arnfjörð Bjarmason
  2021-03-17 13:28       ` [RFC/PATCH 1/5] ls-files: defer read_index() after parse_options() etc Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:28       ` Ævar Arnfjörð Bjarmason
  2021-03-17 18:11         ` Elijah Newren
  2021-03-17 13:28       ` [RFC/PATCH 3/5] ls-files: add and use a new --sparse option Ævar Arnfjörð Bjarmason
                         ` (2 subsequent siblings)
  5 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:28 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee

In a subsequent commit I'll optionally change the mode in a new sparse
mode, let's do this first to make that change smaller.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/ls-files.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index eb72d16493..4db75351f2 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -242,9 +242,17 @@ static void show_ce(struct repository *repo, struct dir_struct *dir,
 		if (!show_stage) {
 			fputs(tag, stdout);
 		} else {
+			unsigned int mode = ce->ce_mode;
+			if (show_sparse && S_ISSPARSEDIR(mode))
+				/*
+				 * We could just do & 0177777 all the
+				 * time, just make it clear this is
+				 * for --stage-sparse.
+				 */
+				mode &= 0177777;
 			printf("%s%06o %s %d\t",
 			       tag,
-			       ce->ce_mode,
+			       mode,
 			       find_unique_abbrev(&ce->oid, abbrev),
 			       ce_stage(ce));
 		}
-- 
2.31.0.260.g719c683c1d


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [RFC/PATCH 3/5] ls-files: add and use a new --sparse option
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                         ` (2 preceding siblings ...)
  2021-03-17 13:28       ` [RFC/PATCH 2/5] ls-files: make "mode" in show_ce() loop a variable Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:28       ` Ævar Arnfjörð Bjarmason
  2021-03-17 18:19         ` Elijah Newren
  2021-03-17 20:43         ` Derrick Stolee
  2021-03-17 13:28       ` [RFC/PATCH 4/5] test-tool read-cache: --table is redundant to ls-files Ævar Arnfjörð Bjarmason
  2021-03-17 13:28       ` [RFC/PATCH 5/5] test-tool: split up test-tool read-cache Ævar Arnfjörð Bjarmason
  5 siblings, 2 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:28 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/git-ls-files.txt           |  4 ++
 builtin/ls-files.c                       | 10 ++++-
 t/t1091-sparse-checkout-builtin.sh       |  9 ++--
 t/t1092-sparse-checkout-compatibility.sh | 57 ++++++++++++++++--------
 4 files changed, 56 insertions(+), 24 deletions(-)

diff --git a/Documentation/git-ls-files.txt b/Documentation/git-ls-files.txt
index 6d11ab506b..1145e960a4 100644
--- a/Documentation/git-ls-files.txt
+++ b/Documentation/git-ls-files.txt
@@ -71,6 +71,10 @@ OPTIONS
 --unmerged::
 	Show unmerged files in the output (forces --stage)
 
+--sparse::
+	Show sparse directories in the output instead of expanding
+	them (forces --stage)
+
 -k::
 --killed::
 	Show files on the filesystem that need to be removed due
diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index 4db75351f2..1ebbb63c10 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -26,6 +26,7 @@ static int show_deleted;
 static int show_cached;
 static int show_others;
 static int show_stage;
+static int show_sparse;
 static int show_unmerged;
 static int show_resolve_undo;
 static int show_modified;
@@ -639,6 +640,8 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 			DIR_SHOW_IGNORED),
 		OPT_BOOL('s', "stage", &show_stage,
 			N_("show staged contents' object name in the output")),
+		OPT_BOOL(0, "sparse", &show_sparse,
+			N_("show unexpanded sparse directories in the output")),
 		OPT_BOOL('k', "killed", &show_killed,
 			N_("show files on the filesystem that need to be removed")),
 		OPT_BIT(0, "directory", &dir.flags,
@@ -705,12 +708,17 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
 		tag_skip_worktree = "S ";
 		tag_resolve_undo = "U ";
 	}
+	if (show_sparse) {
+		prepare_repo_settings(the_repository);
+		the_repository->settings.command_requires_full_index = 0;
+	}
 	if (show_modified || show_others || show_deleted || (dir.flags & DIR_SHOW_IGNORED) || show_killed)
 		require_work_tree = 1;
-	if (show_unmerged)
+	if (show_unmerged || show_sparse)
 		/*
 		 * There's no point in showing unmerged unless
 		 * you also show the stage information.
+		 * The same goes for the --sparse option.
 		 */
 		show_stage = 1;
 	if (show_tag || show_stage)
diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index ff1ad570a2..c823df423c 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -208,12 +208,13 @@ test_expect_success 'sparse-checkout disable' '
 test_expect_success 'sparse-index enabled and disabled' '
 	git -C repo sparse-checkout init --cone --sparse-index &&
 	test_cmp_config -C repo true extensions.sparseIndex &&
-	test-tool -C repo read-cache --table >cache &&
-	grep " tree " cache &&
+	git -C repo ls-files --sparse >cache &&
+	grep "^040000 " cache >lines &&
+	test_line_count = 3 lines &&
 
 	git -C repo sparse-checkout disable &&
-	test-tool -C repo read-cache --table >cache &&
-	! grep " tree " cache &&
+	git -C repo ls-files --sparse >cache &&
+	! grep "^040000 " cache &&
 	git -C repo config --list >config &&
 	! grep extensions.sparseindex config
 '
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index d97bf9b645..48d3920490 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -136,48 +136,67 @@ test_sparse_match () {
 	test_cmp sparse-checkout-err sparse-index-err
 }
 
+test_index_entry_like () {
+	dir=$1
+	shift
+	fmt=$1
+	shift
+	rev=$1
+	shift
+	entry=$1
+	shift
+	file=$1
+	shift
+	hash=$(git -C "$dir" rev-parse "$rev") &&
+	printf "$fmt\n" "$hash" "$entry" >expected &&
+	if grep "$entry" "$file" >line
+	then
+		test_cmp expected line
+	else
+		cat cache &&
+		false
+	fi
+}
+
 test_expect_success 'sparse-index contents' '
 	init_repos &&
 
-	test-tool -C sparse-index read-cache --table >cache &&
+	git -C sparse-index ls-files --sparse >cache &&
 	for dir in folder1 folder2 x
 	do
-		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
-		grep "040000 tree $TREE	$dir/" cache \
-			|| return 1
+		test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
 	done &&
 
 	git -C sparse-index sparse-checkout set folder1 &&
 
-	test-tool -C sparse-index read-cache --table >cache &&
+	git -C sparse-index ls-files --sparse >cache &&
 	for dir in deep folder2 x
 	do
-		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
-		grep "040000 tree $TREE	$dir/" cache \
-			|| return 1
+		test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
 	done &&
 
 	git -C sparse-index sparse-checkout set deep/deeper1 &&
 
-	test-tool -C sparse-index read-cache --table >cache &&
+	git -C sparse-index ls-files --sparse >cache &&
 	for dir in deep/deeper2 folder1 folder2 x
 	do
-		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
-		grep "040000 tree $TREE	$dir/" cache \
-			|| return 1
+		test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
 	done &&
 
+	grep 040000 cache >lines &&
+	test_line_count = 4 lines &&
+
 	# Disabling the sparse-index removes tree entries with full ones
 	git -C sparse-index sparse-checkout init --no-sparse-index &&
 
-	test-tool -C sparse-index read-cache --table >cache &&
-	! grep "040000 tree" cache &&
-	test_sparse_match test-tool read-cache --table
+	git -C sparse-index ls-files --sparse >cache &&
+	! grep "^040000 " cache >lines &&
+	test_sparse_match git ls-tree -r HEAD
 '
 
 test_expect_success 'expanded in-memory index matches full index' '
 	init_repos &&
-	test_sparse_match test-tool read-cache --expand --table
+	test_sparse_match git ls-tree -r HEAD
 '
 
 test_expect_success 'status with options' '
@@ -394,9 +413,9 @@ test_expect_success 'submodule handling' '
 	test_all_match git commit -m "add submodule" &&
 
 	# having a submodule prevents "modules" from collapse
-	test-tool -C sparse-index read-cache --table >cache &&
-	grep "100644 blob .*	modules/a" cache &&
-	grep "160000 commit $(git -C initial-repo rev-parse HEAD)	modules/sub" cache
+	git -C sparse-index ls-files --sparse >cache &&
+	test_index_entry_like sparse-index "100644 %s 0\t%s" "HEAD:modules/a" "modules/a" cache &&
+	test_index_entry_like sparse-index "160000 %s 0\t%s" "HEAD:modules/sub" "modules/sub" cache
 '
 
 test_expect_success 'sparse-index is expanded and converted back' '
-- 
2.31.0.260.g719c683c1d


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [RFC/PATCH 4/5] test-tool read-cache: --table is redundant to ls-files
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                         ` (3 preceding siblings ...)
  2021-03-17 13:28       ` [RFC/PATCH 3/5] ls-files: add and use a new --sparse option Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:28       ` Ævar Arnfjörð Bjarmason
  2021-03-17 13:28       ` [RFC/PATCH 5/5] test-tool: split up test-tool read-cache Ævar Arnfjörð Bjarmason
  5 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:28 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 t/helper/test-read-cache.c | 43 --------------------------------------
 1 file changed, 43 deletions(-)

diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
index b52c174acc..2499999af3 100644
--- a/t/helper/test-read-cache.c
+++ b/t/helper/test-read-cache.c
@@ -1,54 +1,16 @@
 #include "test-tool.h"
 #include "cache.h"
 #include "config.h"
-#include "blob.h"
-#include "commit.h"
-#include "tree.h"
-#include "sparse-index.h"
-
-static void print_cache_entry(struct cache_entry *ce)
-{
-	const char *type;
-	printf("%06o ", ce->ce_mode & 0177777);
-
-	if (S_ISSPARSEDIR(ce->ce_mode))
-		type = tree_type;
-	else if (S_ISGITLINK(ce->ce_mode))
-		type = commit_type;
-	else
-		type = blob_type;
-
-	printf("%s %s\t%s\n",
-	       type,
-	       oid_to_hex(&ce->oid),
-	       ce->name);
-}
-
-static void print_cache(struct index_state *istate)
-{
-	int i;
-	for (i = 0; i < istate->cache_nr; i++)
-		print_cache_entry(istate->cache[i]);
-}
 
 int cmd__read_cache(int argc, const char **argv)
 {
 	struct repository *r = the_repository;
 	int i, cnt = 1;
 	const char *name = NULL;
-	int table = 0, expand = 0;
-
-	initialize_the_repository();
-	prepare_repo_settings(r);
-	r->settings.command_requires_full_index = 0;
 
 	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
 		if (skip_prefix(*argv, "--print-and-refresh=", &name))
 			continue;
-		if (!strcmp(*argv, "--table"))
-			table = 1;
-		else if (!strcmp(*argv, "--expand"))
-			expand = 1;
 	}
 
 	if (argc == 1)
@@ -59,9 +21,6 @@ int cmd__read_cache(int argc, const char **argv)
 	for (i = 0; i < cnt; i++) {
 		repo_read_index(r);
 
-		if (expand)
-			ensure_full_index(r->index);
-
 		if (name) {
 			int pos;
 
@@ -74,8 +33,6 @@ int cmd__read_cache(int argc, const char **argv)
 			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
 			write_file(name, "%d\n", i);
 		}
-		if (table)
-			print_cache(r->index);
 		discard_index(r->index);
 	}
 	return 0;
-- 
2.31.0.260.g719c683c1d


^ permalink raw reply	[flat|nested] 203+ messages in thread

* [RFC/PATCH 5/5] test-tool: split up test-tool read-cache
  2021-03-16 16:42     ` [PATCH v3 07/20] test-read-cache: print cache entries with --table Derrick Stolee via GitGitGadget
                         ` (4 preceding siblings ...)
  2021-03-17 13:28       ` [RFC/PATCH 4/5] test-tool read-cache: --table is redundant to ls-files Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:28       ` Ævar Arnfjörð Bjarmason
  2021-03-17 13:32         ` Ævar Arnfjörð Bjarmason
  5 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:28 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee

Since the "test-tool read-cache" was originally added back in
1ecb5ff141 (read-cache: add simple performance test, 2013-06-09) it's
been growing all sorts of bells and whistles that aren't very
conducive to performance testing the index, e.g. it learned how to
read config.

Let's split what remains of the "test-tool read-cache" into the two
narrow use-cases it's used for.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Makefile                         |  3 ++-
 t/helper/test-read-cache-again.c | 31 +++++++++++++++++++++++++
 t/helper/test-read-cache-perf.c  | 21 +++++++++++++++++
 t/helper/test-read-cache.c       | 39 --------------------------------
 t/helper/test-tool.c             |  3 ++-
 t/helper/test-tool.h             |  3 ++-
 t/perf/p0002-read-cache.sh       |  2 +-
 t/t7519-status-fsmonitor.sh      |  2 +-
 8 files changed, 60 insertions(+), 44 deletions(-)
 create mode 100644 t/helper/test-read-cache-again.c
 create mode 100644 t/helper/test-read-cache-perf.c
 delete mode 100644 t/helper/test-read-cache.c

diff --git a/Makefile b/Makefile
index 89b1d53741..a1bbb818d9 100644
--- a/Makefile
+++ b/Makefile
@@ -724,7 +724,8 @@ TEST_BUILTINS_OBJS += test-prio-queue.o
 TEST_BUILTINS_OBJS += test-proc-receive.o
 TEST_BUILTINS_OBJS += test-progress.o
 TEST_BUILTINS_OBJS += test-reach.o
-TEST_BUILTINS_OBJS += test-read-cache.o
+TEST_BUILTINS_OBJS += test-read-cache-again.o
+TEST_BUILTINS_OBJS += test-read-cache-perf.o
 TEST_BUILTINS_OBJS += test-read-graph.o
 TEST_BUILTINS_OBJS += test-read-midx.o
 TEST_BUILTINS_OBJS += test-ref-store.o
diff --git a/t/helper/test-read-cache-again.c b/t/helper/test-read-cache-again.c
new file mode 100644
index 0000000000..5e20ca1c8f
--- /dev/null
+++ b/t/helper/test-read-cache-again.c
@@ -0,0 +1,31 @@
+#include "test-tool.h"
+#include "cache.h"
+
+int cmd__read_cache_again(int argc, const char **argv)
+{
+	struct repository *r = the_repository;
+	int cnt;
+	const char *name;
+
+	if (argc != 2)
+		die("usage: test-tool read-cache-again <count> <file>");
+
+	cnt = strtol(argv[0], NULL, 0);
+	name = argv[2];
+
+	setup_git_directory();
+	while (cnt--) {
+		int pos;
+		repo_read_index(r);
+		refresh_index(r->index, REFRESH_QUIET,
+			      NULL, NULL, NULL);
+		pos = index_name_pos(r->index, name, strlen(name));
+		if (pos < 0)
+			die("%s not in index", name);
+		printf("%s is%s up to date\n", name,
+		       ce_uptodate(r->index->cache[pos]) ? "" : " not");
+		write_file(name, "%d\n", cnt);
+		discard_index(r->index);
+	}
+	return 0;
+}
diff --git a/t/helper/test-read-cache-perf.c b/t/helper/test-read-cache-perf.c
new file mode 100644
index 0000000000..ac9c297efa
--- /dev/null
+++ b/t/helper/test-read-cache-perf.c
@@ -0,0 +1,21 @@
+#include "test-tool.h"
+#include "cache.h"
+
+int cmd__read_cache_perf(int argc, const char **argv)
+{
+	struct repository *r = the_repository;
+	int cnt = 1000;
+
+	if (argc == 1)
+		cnt = strtol(argv[0], NULL, 0);
+	else if (argc)
+		die("usage: test-tool read-cache-perf [<count>]");
+
+	setup_git_directory();
+	while (cnt--) {
+		repo_read_index(r);
+		discard_index(r->index);
+	}
+
+	return 0;
+}
diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
deleted file mode 100644
index 2499999af3..0000000000
--- a/t/helper/test-read-cache.c
+++ /dev/null
@@ -1,39 +0,0 @@
-#include "test-tool.h"
-#include "cache.h"
-#include "config.h"
-
-int cmd__read_cache(int argc, const char **argv)
-{
-	struct repository *r = the_repository;
-	int i, cnt = 1;
-	const char *name = NULL;
-
-	for (++argv, --argc; *argv && starts_with(*argv, "--"); ++argv, --argc) {
-		if (skip_prefix(*argv, "--print-and-refresh=", &name))
-			continue;
-	}
-
-	if (argc == 1)
-		cnt = strtol(argv[0], NULL, 0);
-	setup_git_directory();
-	git_config(git_default_config, NULL);
-
-	for (i = 0; i < cnt; i++) {
-		repo_read_index(r);
-
-		if (name) {
-			int pos;
-
-			refresh_index(r->index, REFRESH_QUIET,
-				      NULL, NULL, NULL);
-			pos = index_name_pos(r->index, name, strlen(name));
-			if (pos < 0)
-				die("%s not in index", name);
-			printf("%s is%s up to date\n", name,
-			       ce_uptodate(r->index->cache[pos]) ? "" : " not");
-			write_file(name, "%d\n", i);
-		}
-		discard_index(r->index);
-	}
-	return 0;
-}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index f97cd9f48a..1334fa25ba 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -52,7 +52,8 @@ static struct test_cmd cmds[] = {
 	{ "proc-receive", cmd__proc_receive},
 	{ "progress", cmd__progress },
 	{ "reach", cmd__reach },
-	{ "read-cache", cmd__read_cache },
+	{ "read-cache-again", cmd__read_cache_again },
+	{ "read-cache-perf", cmd__read_cache_perf },
 	{ "read-graph", cmd__read_graph },
 	{ "read-midx", cmd__read_midx },
 	{ "ref-store", cmd__ref_store },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index 28072c0ad5..d70cde8574 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -41,7 +41,8 @@ int cmd__prio_queue(int argc, const char **argv);
 int cmd__proc_receive(int argc, const char **argv);
 int cmd__progress(int argc, const char **argv);
 int cmd__reach(int argc, const char **argv);
-int cmd__read_cache(int argc, const char **argv);
+int cmd__read_cache_again(int argc, const char **argv);
+int cmd__read_cache_perf(int argc, const char **argv);
 int cmd__read_graph(int argc, const char **argv);
 int cmd__read_midx(int argc, const char **argv);
 int cmd__ref_store(int argc, const char **argv);
diff --git a/t/perf/p0002-read-cache.sh b/t/perf/p0002-read-cache.sh
index cdd105a594..d0ba5173fb 100755
--- a/t/perf/p0002-read-cache.sh
+++ b/t/perf/p0002-read-cache.sh
@@ -8,7 +8,7 @@ test_perf_default_repo
 
 count=1000
 test_perf "read_cache/discard_cache $count times" "
-	test-tool read-cache $count
+	test-tool read-cache-perf $count
 "
 
 test_done
diff --git a/t/t7519-status-fsmonitor.sh b/t/t7519-status-fsmonitor.sh
index 45d025f960..3761a8781d 100755
--- a/t/t7519-status-fsmonitor.sh
+++ b/t/t7519-status-fsmonitor.sh
@@ -359,7 +359,7 @@ test_expect_success UNTRACKED_CACHE 'ignore .git changes when invalidating UNTR'
 test_expect_success 'discard_index() also discards fsmonitor info' '
 	test_config core.fsmonitor "$TEST_DIRECTORY/t7519/fsmonitor-all" &&
 	test_might_fail git update-index --refresh &&
-	test-tool read-cache --print-and-refresh=tracked 2 >actual &&
+	test-tool read-cache-again 2 tracked >actual &&
 	printf "tracked is%s up to date\n" "" " not" >expect &&
 	test_cmp expect actual
 '
-- 
2.31.0.260.g719c683c1d


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 5/5] test-tool: split up test-tool read-cache
  2021-03-17 13:28       ` [RFC/PATCH 5/5] test-tool: split up test-tool read-cache Ævar Arnfjörð Bjarmason
@ 2021-03-17 13:32         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:32 UTC (permalink / raw)
  To: git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor,
	Ævar Arnfjörð Bjarmason, Derrick Stolee, dstolee


On Wed, Mar 17 2021, Ævar Arnfjörð Bjarmason wrote:

> +	if (argc != 2)
> +		die("usage: test-tool read-cache-again <count> <file>");
> +
> +	cnt = strtol(argv[0], NULL, 0);
> +	name = argv[2];

This is needed on top, the perils of sending out ad-hoc RFC patches from
the working tree..:

diff --git a/t/helper/test-read-cache-again.c b/t/helper/test-read-cache-again.c
index 5e20ca1c8f..aa97b3aaf3 100644
--- a/t/helper/test-read-cache-again.c
+++ b/t/helper/test-read-cache-again.c
@@ -7,10 +7,9 @@ int cmd__read_cache_again(int argc, const char **argv)
 	int cnt;
 	const char *name;
 
-	if (argc != 2)
+	if (argc != 3)
 		die("usage: test-tool read-cache-again <count> <file>");
-
-	cnt = strtol(argv[0], NULL, 0);
+	cnt = strtol(argv[1], NULL, 0);
 	name = argv[2];
 
 	setup_git_directory();

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 13/20] unpack-trees: allow sparse directories
  2021-03-16 16:42     ` [PATCH v3 13/20] unpack-trees: allow sparse directories Derrick Stolee via GitGitGadget
@ 2021-03-17 13:35       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:35 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> The index_pos_by_traverse_info() currently throws a BUG() when a
> directory entry exists exactly in the index. We need to consider that it
> is possible to have a directory in a sparse index as long as that entry
> is itself marked with the skip-worktree bit.
>
> The 'pos' variable is assigned a negative value if an exact match is not
> found. Since a directory name can be an exact match, it is no longer an
> error to have a nonnegative 'pos' value.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  unpack-trees.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 2da3e5ec77a1..e81d82d72d89 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -749,9 +749,12 @@ static int index_pos_by_traverse_info(struct name_entry *names,
>  	strbuf_make_traverse_path(&name, info, names->path, names->pathlen);
>  	strbuf_addch(&name, '/');
>  	pos = index_name_pos(o->src_index, name.buf, name.len);
> -	if (pos >= 0)
> -		BUG("This is a directory and should not exist in index");
> -	pos = -pos - 1;
> +	if (pos >= 0) {
> +		if (!o->src_index->sparse_index ||
> +		    !(o->src_index->cache[pos]->ce_flags & CE_SKIP_WORKTREE))
> +			BUG("This is a directory and should not exist in index");
> +	} else
> +		pos = -pos - 1;

Style nit: add {}'s to the "else" once the "if" gets one.

>  	if (pos >= o->src_index->cache_nr ||
>  	    !starts_with(o->src_index->cache[pos]->name, name.buf) ||
>  	    (pos > 0 && starts_with(o->src_index->cache[pos-1]->name, name.buf)))


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 11/20] sparse-index: convert from full to sparse
  2021-03-16 16:42     ` [PATCH v3 11/20] sparse-index: convert from full to sparse Derrick Stolee via GitGitGadget
@ 2021-03-17 13:43       ` Ævar Arnfjörð Bjarmason
  2021-03-17 19:55         ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 13:43 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	Derrick Stolee, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee


On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:

> diff --git a/cache-tree.c b/cache-tree.c
> index 2fb483d3c083..5f07a39e501e 100644
> --- a/cache-tree.c
> +++ b/cache-tree.c
> @@ -6,6 +6,7 @@
>  #include "object-store.h"
>  #include "replace-object.h"
>  #include "promisor-remote.h"
> +#include "sparse-index.h"
>  
>  #ifndef DEBUG_CACHE_TREE
>  #define DEBUG_CACHE_TREE 0
> @@ -442,6 +443,8 @@ int cache_tree_update(struct index_state *istate, int flags)
>  	if (i)
>  		return i;
>  
> +	ensure_full_index(istate);
> +
>  	if (!istate->cache_tree)
>  		istate->cache_tree = cache_tree();
>  
> diff --git a/cache.h b/cache.h
> index 759ca92e2ecc..69a32146cd77 100644
> --- a/cache.h
> +++ b/cache.h
> @@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
>  {
>  	if (S_ISLNK(mode))
>  		return S_IFLNK;
> +	if (mode == S_IFDIR)
> +		return S_IFDIR;

Does this actually need to be mode == S_IFDIR v.s. S_ISDIR(mode)? Those
aren't the same thing...

>  	if (S_ISDIR(mode) || S_ISGITLINK(mode))
>  		return S_IFGITLINK;

...and if it can be S_ISDIR(mode) then this becomes just
S_ISGITLINK(mode), but losing the "if" there makes me suspect that some
dir == submodule heuristic is being broken somewhere..


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 02/20] t/perf: add performance test for sparse operations
  2021-03-17 13:21           ` Ævar Arnfjörð Bjarmason
@ 2021-03-17 18:02             ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-17 18:02 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Derrick Stolee via GitGitGadget, git, newren, gitster, pclouds,
	jrnieder, Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

On 3/17/2021 9:21 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Mar 17 2021, Derrick Stolee wrote:
> 
>> On 3/17/2021 4:41 AM, Ævar Arnfjörð Bjarmason wrote:
>>> But it seems odd to be doing this at all, the point of the perf
>>> framework is that you can point it at any repo, and some repos you want
>>> to test will have submodules.
>>
>> You're right that it should handle all repos. However, the point of
>> the test is to have many copies of the repo, but most of them are
>> excluded by sparse-directory entries. We don't collapse sparse-directory
>> entries if there is a submodule inside, so the data shape is wrong after
>> making all the copies.
>>
>> So, I disagree with your approach in your suggested diff, and instead
>> offer this one. I've tested this with git.git and another local repo
>> without submodules and checked that everything works as expected.
> 
> What's got me confused here is that there's two uses for the perf
> framework in this context.
> 
> It's to use an empty/git.git as a test repo to demonstrate something,
> but then also that you can run it in your arbitrary repo, and e.g. see
> how much a given feature might benefit you.
> 
> Hence suggesting that maybe test_perf_fresh_repois better here, because
> by using test_perf_default_repo you're creating the expectation that you
> can run the perf test, observe an %X difference, and that'll be
> give-or-take what you'll get for that use case if you enable the feature.
> 
> Except it won't because the repo has submodules, which we deleted for
> the perf test...

I'm also dramatically changing the repository shape to expose index
reads and writes as a bottleneck. The benefit of using other repos
(like git.git or optionally choosing the Linux kernel repo) is to
change how much of the time is spent crawling the populated set.

>> diff --git a/t/perf/p2000-sparse-operations.sh b/t/perf/p2000-sparse-operations.sh
>> index e527316e66d..5c0d78eeeea 100755
>> --- a/t/perf/p2000-sparse-operations.sh
>> +++ b/t/perf/p2000-sparse-operations.sh
>> @@ -10,15 +10,17 @@ SPARSE_CONE=f2/f4/f1
>>  
>>  test_expect_success 'setup repo and indexes' '
>>  	git reset --hard HEAD &&
>> +
>>  	# Remove submodules from the example repo, because our
>> -	# duplication of the entire repo creates an unlikly data shape.
>> -	git config --file .gitmodules --get-regexp "submodule.*.path" >modules &&
>> -	git rm -f .gitmodules &&
>> -	for module in $(awk "{print \$2}" modules)
>> -	do
>> -		git rm $module || return 1
>> -	done &&
>> -	git commit -m "remove submodules" &&
>> +	# duplication of the entire repo creates an unlikely data shape.
>> +	if (git config --file .gitmodules --get-regexp "submodule.*.path" >modules)
> 
> A subshell isn't needed here.
> 
> FWIW the reason I got this out of ls-files is because you can have
> submodules without .gitmodules entries, rare and broken, but seemed more
> direct to grep the mode bits.

I'd prefer to do something (textually) simpler, expecting the input
repos to have correct data.

>> +	then
>> +		for module in $(awk "{print \$2}" modules)
>> +		do
>> +			git rm $module || return 1
>> +		done &&
> 
> Once we know we have submodules we can just do this without the loop.
> 
>     git rm $(awk "{print \$2}" modules)

Ok. That works for me.
>>> Seems like something like the WIP patch at the end on top would be
>>> better.
>>>
>>>> +	echo bogus >a &&
>>>> +	cp a b &&
>>>> +	git add a b &&
>>>> +	git commit -m "level 0" &&
>>>> +	BLOB=$(git rev-parse HEAD:a) &&
>>>
>>> Isn't the way we're getting this $BLOB equivalent to just 'echo bogus |
>>> git hash-object --stdin -w' why commit it?
>>
>> We are committing it so we can add commits that deepen the copies,
>> but within those copies we have these known file paths.
>>
>>> This whole thing makes me think you just wanted a test_perf_fresh_repo
>>> all along, but I think this would be much more useful if you took the
>>> default repo and multiplied the size in its tree by some multiple.
>>>
>>> E.g. take the files we have in git.git, write a copy at prefix-1/,
>>> prefix-2/ etc.
>>
>> That is essentially what is happening here, but using multiple levels
>> of directories. Using these multiple levels presents extra tree
>> lookups and parsing in the event of expanding a sparse index to a
>> full one.
> 
> *nod*
> 
> Anyway, this thread's a bit of a bikeshed on my part, I was just
> wondering if & what part of the test relied on the existing repo if it
> was mostly setting up its own test data.

Again, the benefit is to depend on the repo shape in some aspects,
while exaggerating the data shape to make the non-populated set
extremely large.

This presents different aspects that are worth examining, such as
git.git is much smaller than linux.git, and that is noticable with
these different performance numbers (taken at the end of this
series):

git.git
Test                                            this tree      
---------------------------------------------------------------
2000.2: git status (full-index-v3)              0.39(0.35+0.08)
2000.3: git status (full-index-v4)              0.39(0.34+0.09)
2000.4: git status (sparse-index-v3)            2.46(2.33+0.16)
2000.5: git status (sparse-index-v4)            2.42(2.31+0.15)
2000.6: git add -A (full-index-v3)              1.35(0.98+0.20)
2000.7: git add -A (full-index-v4)              1.25(0.96+0.18)
2000.8: git add -A (sparse-index-v3)            2.39(2.26+0.17)
2000.9: git add -A (sparse-index-v4)            2.35(2.29+0.11)
2000.10: git add . (full-index-v3)              1.39(1.01+0.19)
2000.11: git add . (full-index-v4)              1.31(1.00+0.19)
2000.12: git add . (sparse-index-v3)            2.41(2.28+0.16)
2000.13: git add . (sparse-index-v4)            2.45(2.32+0.16)
2000.14: git commit -a -m A (full-index-v3)     1.44(1.08+0.21)
2000.15: git commit -a -m A (full-index-v4)     1.31(1.04+0.19)
2000.16: git commit -a -m A (sparse-index-v3)   2.44(2.35+0.16)
2000.17: git commit -a -m A (sparse-index-v4)   2.44(2.36+0.16)

linux.git
Test                                            this tree        
-----------------------------------------------------------------
2000.2: git status (full-index-v3)              7.14(6.06+1.79)  
2000.3: git status (full-index-v4)              7.01(6.16+1.60)  
2000.4: git status (sparse-index-v3)            58.50(56.86+2.34)
2000.5: git status (sparse-index-v4)            57.52(55.80+2.45)
2000.6: git add -A (full-index-v3)              25.52(18.70+3.18)
2000.7: git add -A (full-index-v4)              22.26(17.52+2.72)
2000.8: git add -A (sparse-index-v3)            56.65(55.00+2.35)
2000.9: git add -A (sparse-index-v4)            56.56(54.98+2.29)
2000.10: git add . (full-index-v3)              25.87(19.12+3.15)
2000.11: git add . (full-index-v4)              22.56(17.85+2.71)
2000.12: git add . (sparse-index-v3)            57.01(55.28+2.42)
2000.13: git add . (sparse-index-v4)            56.84(55.38+2.19)
2000.14: git commit -a -m A (full-index-v3)     26.83(20.69+3.24)
2000.15: git commit -a -m A (full-index-v4)     24.04(19.86+2.65)
2000.16: git commit -a -m A (sparse-index-v3)   60.23(58.99+2.44)
2000.17: git commit -a -m A (sparse-index-v4)   60.52(59.09+2.74)

The intention is to make these numbers improve in the future
so that the sparse-index is a better approach.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 2/5] ls-files: make "mode" in show_ce() loop a variable
  2021-03-17 13:28       ` [RFC/PATCH 2/5] ls-files: make "mode" in show_ce() loop a variable Ævar Arnfjörð Bjarmason
@ 2021-03-17 18:11         ` Elijah Newren
  2021-03-24  0:46           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-17 18:11 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> In a subsequent commit I'll optionally change the mode in a new sparse
> mode, let's do this first to make that change smaller.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  builtin/ls-files.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> index eb72d16493..4db75351f2 100644
> --- a/builtin/ls-files.c
> +++ b/builtin/ls-files.c
> @@ -242,9 +242,17 @@ static void show_ce(struct repository *repo, struct dir_struct *dir,
>                 if (!show_stage) {
>                         fputs(tag, stdout);
>                 } else {
> +                       unsigned int mode = ce->ce_mode;
> +                       if (show_sparse && S_ISSPARSEDIR(mode))
> +                               /*
> +                                * We could just do & 0177777 all the
> +                                * time, just make it clear this is
> +                                * for --stage-sparse.
> +                                */
> +                               mode &= 0177777;

I could kind of see referencing the magic constant 0177777 in a test-*
source file, but it really needs an explanation when showing up in
actual git source code.  At least reference something about how
cache.h mentions these are the mode bits, or better yet #define this
constant somewhere in cache.h with an explanation.

Also, what is --stage-sparse?

>                         printf("%s%06o %s %d\t",
>                                tag,
> -                              ce->ce_mode,
> +                              mode,
>                                find_unique_abbrev(&ce->oid, abbrev),
>                                ce_stage(ce));
>                 }
> --
> 2.31.0.260.g719c683c1d

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 3/5] ls-files: add and use a new --sparse option
  2021-03-17 13:28       ` [RFC/PATCH 3/5] ls-files: add and use a new --sparse option Ævar Arnfjörð Bjarmason
@ 2021-03-17 18:19         ` Elijah Newren
  2021-03-17 18:27           ` Ævar Arnfjörð Bjarmason
  2021-03-17 20:43         ` Derrick Stolee
  1 sibling, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-17 18:19 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  Documentation/git-ls-files.txt           |  4 ++
>  builtin/ls-files.c                       | 10 ++++-
>  t/t1091-sparse-checkout-builtin.sh       |  9 ++--
>  t/t1092-sparse-checkout-compatibility.sh | 57 ++++++++++++++++--------
>  4 files changed, 56 insertions(+), 24 deletions(-)
>
> diff --git a/Documentation/git-ls-files.txt b/Documentation/git-ls-files.txt
> index 6d11ab506b..1145e960a4 100644
> --- a/Documentation/git-ls-files.txt
> +++ b/Documentation/git-ls-files.txt
> @@ -71,6 +71,10 @@ OPTIONS
>  --unmerged::
>         Show unmerged files in the output (forces --stage)
>
> +--sparse::
> +       Show sparse directories in the output instead of expanding
> +       them (forces --stage)
> +
>  -k::
>  --killed::
>         Show files on the filesystem that need to be removed due
> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> index 4db75351f2..1ebbb63c10 100644
> --- a/builtin/ls-files.c
> +++ b/builtin/ls-files.c
> @@ -26,6 +26,7 @@ static int show_deleted;
>  static int show_cached;
>  static int show_others;
>  static int show_stage;
> +static int show_sparse;
>  static int show_unmerged;
>  static int show_resolve_undo;
>  static int show_modified;
> @@ -639,6 +640,8 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
>                         DIR_SHOW_IGNORED),
>                 OPT_BOOL('s', "stage", &show_stage,
>                         N_("show staged contents' object name in the output")),
> +               OPT_BOOL(0, "sparse", &show_sparse,
> +                       N_("show unexpanded sparse directories in the output")),
>                 OPT_BOOL('k', "killed", &show_killed,
>                         N_("show files on the filesystem that need to be removed")),
>                 OPT_BIT(0, "directory", &dir.flags,
> @@ -705,12 +708,17 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
>                 tag_skip_worktree = "S ";
>                 tag_resolve_undo = "U ";
>         }
> +       if (show_sparse) {
> +               prepare_repo_settings(the_repository);
> +               the_repository->settings.command_requires_full_index = 0;
> +       }
>         if (show_modified || show_others || show_deleted || (dir.flags & DIR_SHOW_IGNORED) || show_killed)
>                 require_work_tree = 1;
> -       if (show_unmerged)
> +       if (show_unmerged || show_sparse)
>                 /*
>                  * There's no point in showing unmerged unless
>                  * you also show the stage information.
> +                * The same goes for the --sparse option.

Yuck, haven't you just made --sparse an alias for --stage?  Why does
it need an alias?

Was the goal just to get a quick way to make the command run under
repo->settings.command_requires_full_index = 0 without auditing the
codepaths?  It seems to rely on them having been audited anyway, since
it just falls back to the code used for --stage, so I don't see how it
helps.  It also suggests the command might do unexpected or weird
things if run without the --sparse option?  If people manually
configure a sparse-checkout and cone mode AND a sparse-index (it's
annoying how they have to specify all three instead of having to just
pass one flag somewhere), then now we also need to force them to
remember to pass extra flags to random various commands for them to
operate in a sane manner in their environment??

I think this is a bad path to go down.

However, if you want to write the necessary tests to make it so that
ls-files can operate with command_requires_full_index = 0, then I
think that's useful.  If you want to add a special flag so that folks
in a sparse-checkout-with-cone-mode-with-sparse-index setup want to
operate densely (i.e. to show what files would be in the index if it
were fully populated), then I think that's useful.  But having
sparse-yes-with-cone-yes-very-sparse folks need to specify an extra
flag to commands to get sparse behavior just seems wrong to me.

>                  */
>                 show_stage = 1;
>         if (show_tag || show_stage)
> diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
> index ff1ad570a2..c823df423c 100755
> --- a/t/t1091-sparse-checkout-builtin.sh
> +++ b/t/t1091-sparse-checkout-builtin.sh
> @@ -208,12 +208,13 @@ test_expect_success 'sparse-checkout disable' '
>  test_expect_success 'sparse-index enabled and disabled' '
>         git -C repo sparse-checkout init --cone --sparse-index &&
>         test_cmp_config -C repo true extensions.sparseIndex &&
> -       test-tool -C repo read-cache --table >cache &&
> -       grep " tree " cache &&
> +       git -C repo ls-files --sparse >cache &&
> +       grep "^040000 " cache >lines &&
> +       test_line_count = 3 lines &&
>
>         git -C repo sparse-checkout disable &&
> -       test-tool -C repo read-cache --table >cache &&
> -       ! grep " tree " cache &&
> +       git -C repo ls-files --sparse >cache &&
> +       ! grep "^040000 " cache &&
>         git -C repo config --list >config &&
>         ! grep extensions.sparseindex config
>  '
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index d97bf9b645..48d3920490 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -136,48 +136,67 @@ test_sparse_match () {
>         test_cmp sparse-checkout-err sparse-index-err
>  }
>
> +test_index_entry_like () {
> +       dir=$1
> +       shift
> +       fmt=$1
> +       shift
> +       rev=$1
> +       shift
> +       entry=$1
> +       shift
> +       file=$1
> +       shift
> +       hash=$(git -C "$dir" rev-parse "$rev") &&
> +       printf "$fmt\n" "$hash" "$entry" >expected &&
> +       if grep "$entry" "$file" >line
> +       then
> +               test_cmp expected line
> +       else
> +               cat cache &&
> +               false
> +       fi
> +}
> +
>  test_expect_success 'sparse-index contents' '
>         init_repos &&
>
> -       test-tool -C sparse-index read-cache --table >cache &&
> +       git -C sparse-index ls-files --sparse >cache &&
>         for dir in folder1 folder2 x
>         do
> -               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> -               grep "040000 tree $TREE $dir/" cache \
> -                       || return 1
> +               test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>         done &&
>
>         git -C sparse-index sparse-checkout set folder1 &&
>
> -       test-tool -C sparse-index read-cache --table >cache &&
> +       git -C sparse-index ls-files --sparse >cache &&
>         for dir in deep folder2 x
>         do
> -               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> -               grep "040000 tree $TREE $dir/" cache \
> -                       || return 1
> +               test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>         done &&
>
>         git -C sparse-index sparse-checkout set deep/deeper1 &&
>
> -       test-tool -C sparse-index read-cache --table >cache &&
> +       git -C sparse-index ls-files --sparse >cache &&
>         for dir in deep/deeper2 folder1 folder2 x
>         do
> -               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> -               grep "040000 tree $TREE $dir/" cache \
> -                       || return 1
> +               test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>         done &&
>
> +       grep 040000 cache >lines &&
> +       test_line_count = 4 lines &&
> +
>         # Disabling the sparse-index removes tree entries with full ones
>         git -C sparse-index sparse-checkout init --no-sparse-index &&
>
> -       test-tool -C sparse-index read-cache --table >cache &&
> -       ! grep "040000 tree" cache &&
> -       test_sparse_match test-tool read-cache --table
> +       git -C sparse-index ls-files --sparse >cache &&
> +       ! grep "^040000 " cache >lines &&
> +       test_sparse_match git ls-tree -r HEAD
>  '
>
>  test_expect_success 'expanded in-memory index matches full index' '
>         init_repos &&
> -       test_sparse_match test-tool read-cache --expand --table
> +       test_sparse_match git ls-tree -r HEAD
>  '
>
>  test_expect_success 'status with options' '
> @@ -394,9 +413,9 @@ test_expect_success 'submodule handling' '
>         test_all_match git commit -m "add submodule" &&
>
>         # having a submodule prevents "modules" from collapse
> -       test-tool -C sparse-index read-cache --table >cache &&
> -       grep "100644 blob .*    modules/a" cache &&
> -       grep "160000 commit $(git -C initial-repo rev-parse HEAD)       modules/sub" cache
> +       git -C sparse-index ls-files --sparse >cache &&
> +       test_index_entry_like sparse-index "100644 %s 0\t%s" "HEAD:modules/a" "modules/a" cache &&
> +       test_index_entry_like sparse-index "160000 %s 0\t%s" "HEAD:modules/sub" "modules/sub" cache
>  '
>
>  test_expect_success 'sparse-index is expanded and converted back' '
> --
> 2.31.0.260.g719c683c1d

I do like the tests and your idea that we can use ls-files to list
whatever entries are in the index, I just think the tests should use
--stage to do that.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 3/5] ls-files: add and use a new --sparse option
  2021-03-17 18:19         ` Elijah Newren
@ 2021-03-17 18:27           ` Ævar Arnfjörð Bjarmason
  2021-03-17 18:44             ` Elijah Newren
  0 siblings, 1 reply; 203+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-03-17 18:27 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee


On Wed, Mar 17 2021, Elijah Newren wrote:

> On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> ---
>>  Documentation/git-ls-files.txt           |  4 ++
>>  builtin/ls-files.c                       | 10 ++++-
>>  t/t1091-sparse-checkout-builtin.sh       |  9 ++--
>>  t/t1092-sparse-checkout-compatibility.sh | 57 ++++++++++++++++--------
>>  4 files changed, 56 insertions(+), 24 deletions(-)
>>
>> diff --git a/Documentation/git-ls-files.txt b/Documentation/git-ls-files.txt
>> index 6d11ab506b..1145e960a4 100644
>> --- a/Documentation/git-ls-files.txt
>> +++ b/Documentation/git-ls-files.txt
>> @@ -71,6 +71,10 @@ OPTIONS
>>  --unmerged::
>>         Show unmerged files in the output (forces --stage)
>>
>> +--sparse::
>> +       Show sparse directories in the output instead of expanding
>> +       them (forces --stage)
>> +
>>  -k::
>>  --killed::
>>         Show files on the filesystem that need to be removed due
>> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
>> index 4db75351f2..1ebbb63c10 100644
>> --- a/builtin/ls-files.c
>> +++ b/builtin/ls-files.c
>> @@ -26,6 +26,7 @@ static int show_deleted;
>>  static int show_cached;
>>  static int show_others;
>>  static int show_stage;
>> +static int show_sparse;
>>  static int show_unmerged;
>>  static int show_resolve_undo;
>>  static int show_modified;
>> @@ -639,6 +640,8 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
>>                         DIR_SHOW_IGNORED),
>>                 OPT_BOOL('s', "stage", &show_stage,
>>                         N_("show staged contents' object name in the output")),
>> +               OPT_BOOL(0, "sparse", &show_sparse,
>> +                       N_("show unexpanded sparse directories in the output")),
>>                 OPT_BOOL('k', "killed", &show_killed,
>>                         N_("show files on the filesystem that need to be removed")),
>>                 OPT_BIT(0, "directory", &dir.flags,
>> @@ -705,12 +708,17 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
>>                 tag_skip_worktree = "S ";
>>                 tag_resolve_undo = "U ";
>>         }
>> +       if (show_sparse) {
>> +               prepare_repo_settings(the_repository);
>> +               the_repository->settings.command_requires_full_index = 0;
>> +       }
>>         if (show_modified || show_others || show_deleted || (dir.flags & DIR_SHOW_IGNORED) || show_killed)
>>                 require_work_tree = 1;
>> -       if (show_unmerged)
>> +       if (show_unmerged || show_sparse)
>>                 /*
>>                  * There's no point in showing unmerged unless
>>                  * you also show the stage information.
>> +                * The same goes for the --sparse option.
>
> Yuck, haven't you just made --sparse an alias for --stage?  Why does
> it need an alias?

It doesn't, but --unmerged, the one other option which purely modifies
--stage output implies --stage.

So it's in line with existing UI convention in the command, it's
probably better to keep following that than have new options behave
differently.

But yeah, we could spell out --stage --sparse in the tests.

> Was the goal just to get a quick way to make the command run under
> repo->settings.command_requires_full_index = 0 without auditing the
> codepaths?  It seems to rely on them having been audited anyway, since
> it just falls back to the code used for --stage, so I don't see how it
> helps.  It also suggests the command might do unexpected or weird
> things if run without the --sparse option?  If people manually
> configure a sparse-checkout and cone mode AND a sparse-index (it's
> annoying how they have to specify all three instead of having to just
> pass one flag somewhere), then now we also need to force them to
> remember to pass extra flags to random various commands for them to
> operate in a sane manner in their environment??
>
> I think this is a bad path to go down.

Those are probably good points, I don't have enough overview of the
whole sparse thing yet to say.

I just thought it didn't make sense to have a series changing the nature
of the index without corresponding tooling changes to interrogate the
state of the index.

> However, if you want to write the necessary tests to make it so that
> ls-files can operate with command_requires_full_index = 0, then I
> think that's useful.  If you want to add a special flag so that folks
> in a sparse-checkout-with-cone-mode-with-sparse-index setup want to
> operate densely (i.e. to show what files would be in the index if it
> were fully populated), then I think that's useful.  But having
> sparse-yes-with-cone-yes-very-sparse folks need to specify an extra
> flag to commands to get sparse behavior just seems wrong to me.

Maybe, but what else do you suggest for getting this information out of
the index?

>>                  */
>>                 show_stage = 1;
>>         if (show_tag || show_stage)
>> diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
>> index ff1ad570a2..c823df423c 100755
>> --- a/t/t1091-sparse-checkout-builtin.sh
>> +++ b/t/t1091-sparse-checkout-builtin.sh
>> @@ -208,12 +208,13 @@ test_expect_success 'sparse-checkout disable' '
>>  test_expect_success 'sparse-index enabled and disabled' '
>>         git -C repo sparse-checkout init --cone --sparse-index &&
>>         test_cmp_config -C repo true extensions.sparseIndex &&
>> -       test-tool -C repo read-cache --table >cache &&
>> -       grep " tree " cache &&
>> +       git -C repo ls-files --sparse >cache &&
>> +       grep "^040000 " cache >lines &&
>> +       test_line_count = 3 lines &&
>>
>>         git -C repo sparse-checkout disable &&
>> -       test-tool -C repo read-cache --table >cache &&
>> -       ! grep " tree " cache &&
>> +       git -C repo ls-files --sparse >cache &&
>> +       ! grep "^040000 " cache &&
>>         git -C repo config --list >config &&
>>         ! grep extensions.sparseindex config
>>  '
>> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
>> index d97bf9b645..48d3920490 100755
>> --- a/t/t1092-sparse-checkout-compatibility.sh
>> +++ b/t/t1092-sparse-checkout-compatibility.sh
>> @@ -136,48 +136,67 @@ test_sparse_match () {
>>         test_cmp sparse-checkout-err sparse-index-err
>>  }
>>
>> +test_index_entry_like () {
>> +       dir=$1
>> +       shift
>> +       fmt=$1
>> +       shift
>> +       rev=$1
>> +       shift
>> +       entry=$1
>> +       shift
>> +       file=$1
>> +       shift
>> +       hash=$(git -C "$dir" rev-parse "$rev") &&
>> +       printf "$fmt\n" "$hash" "$entry" >expected &&
>> +       if grep "$entry" "$file" >line
>> +       then
>> +               test_cmp expected line
>> +       else
>> +               cat cache &&
>> +               false
>> +       fi
>> +}
>> +
>>  test_expect_success 'sparse-index contents' '
>>         init_repos &&
>>
>> -       test-tool -C sparse-index read-cache --table >cache &&
>> +       git -C sparse-index ls-files --sparse >cache &&
>>         for dir in folder1 folder2 x
>>         do
>> -               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
>> -               grep "040000 tree $TREE $dir/" cache \
>> -                       || return 1
>> +               test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>>         done &&
>>
>>         git -C sparse-index sparse-checkout set folder1 &&
>>
>> -       test-tool -C sparse-index read-cache --table >cache &&
>> +       git -C sparse-index ls-files --sparse >cache &&
>>         for dir in deep folder2 x
>>         do
>> -               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
>> -               grep "040000 tree $TREE $dir/" cache \
>> -                       || return 1
>> +               test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>>         done &&
>>
>>         git -C sparse-index sparse-checkout set deep/deeper1 &&
>>
>> -       test-tool -C sparse-index read-cache --table >cache &&
>> +       git -C sparse-index ls-files --sparse >cache &&
>>         for dir in deep/deeper2 folder1 folder2 x
>>         do
>> -               TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
>> -               grep "040000 tree $TREE $dir/" cache \
>> -                       || return 1
>> +               test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>>         done &&
>>
>> +       grep 040000 cache >lines &&
>> +       test_line_count = 4 lines &&
>> +
>>         # Disabling the sparse-index removes tree entries with full ones
>>         git -C sparse-index sparse-checkout init --no-sparse-index &&
>>
>> -       test-tool -C sparse-index read-cache --table >cache &&
>> -       ! grep "040000 tree" cache &&
>> -       test_sparse_match test-tool read-cache --table
>> +       git -C sparse-index ls-files --sparse >cache &&
>> +       ! grep "^040000 " cache >lines &&
>> +       test_sparse_match git ls-tree -r HEAD
>>  '
>>
>>  test_expect_success 'expanded in-memory index matches full index' '
>>         init_repos &&
>> -       test_sparse_match test-tool read-cache --expand --table
>> +       test_sparse_match git ls-tree -r HEAD
>>  '
>>
>>  test_expect_success 'status with options' '
>> @@ -394,9 +413,9 @@ test_expect_success 'submodule handling' '
>>         test_all_match git commit -m "add submodule" &&
>>
>>         # having a submodule prevents "modules" from collapse
>> -       test-tool -C sparse-index read-cache --table >cache &&
>> -       grep "100644 blob .*    modules/a" cache &&
>> -       grep "160000 commit $(git -C initial-repo rev-parse HEAD)       modules/sub" cache
>> +       git -C sparse-index ls-files --sparse >cache &&
>> +       test_index_entry_like sparse-index "100644 %s 0\t%s" "HEAD:modules/a" "modules/a" cache &&
>> +       test_index_entry_like sparse-index "160000 %s 0\t%s" "HEAD:modules/sub" "modules/sub" cache
>>  '
>>
>>  test_expect_success 'sparse-index is expanded and converted back' '
>> --
>> 2.31.0.260.g719c683c1d
>
> I do like the tests and your idea that we can use ls-files to list
> whatever entries are in the index, I just think the tests should use
> --stage to do that.


^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 0/5] Re: [PATCH v3 07/20] test-read-cache: print cache entries with --table
  2021-03-17 13:28       ` [RFC/PATCH 0/5] " Ævar Arnfjörð Bjarmason
@ 2021-03-17 18:28         ` Elijah Newren
  2021-03-17 19:46           ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-17 18:28 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> > From: Derrick Stolee <dstolee@microsoft.com>
> >
> > This table is helpful for discovering data in the index to ensure it is
> > being written correctly, especially as we build and test the
> > sparse-index. This table includes an output format similar to 'git
> > ls-tree', but should not be compared to that directly. The biggest
> > reasons are that 'git ls-tree' includes a tree entry for every
> > subdirectory, even those that would not appear as a sparse directory in
> > a sparse-index. Further, 'git ls-tree' does not use a trailing directory
> > separator for its tree rows.
> >
> > This does not print the stat() information for the blobs. That could be
> > added in a future change with another option. The tests that are added
> > in the next few changes care only about the object types and IDs.
> >
> > To make the option parsing slightly more robust, wrap the string
> > comparisons in a loop adapted from test-dir-iterator.c.
> >
> > Care must be taken with the final check for the 'cnt' variable. We
> > continue the expectation that the numerical value is the final argument.
> >
> > Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> > ---
> >  t/helper/test-read-cache.c | 55 +++++++++++++++++++++++++++++++-------
> >  1 file changed, 45 insertions(+), 10 deletions(-)
> >
> > diff --git a/t/helper/test-read-cache.c b/t/helper/test-read-cache.c
> > index 244977a29bdf..6cfd8f2de71c 100644
> > --- a/t/helper/test-read-cache.c
> > +++ b/t/helper/test-read-cache.c
> > @@ -1,36 +1,71 @@
> >  #include "test-tool.h"
> >  #include "cache.h"
> >  #include "config.h"
> > +#include "blob.h"
> > +#include "commit.h"
> > +#include "tree.h"
> > +
> > +static void print_cache_entry(struct cache_entry *ce)
> > +{
> > +     const char *type;
> > +     printf("%06o ", ce->ce_mode & 0177777);
> > +
> > +     if (S_ISSPARSEDIR(ce->ce_mode))
> > +             type = tree_type;
> > +     else if (S_ISGITLINK(ce->ce_mode))
> > +             type = commit_type;
> > +     else
> > +             type = blob_type;
> > +
> > +     printf("%s %s\t%s\n",
> > +            type,
> > +            oid_to_hex(&ce->oid),
> > +            ce->name);
> > +}
> > +
>
> So we have a test tool that's mostly ls-files but mocks the output
> ls-tree would emit, won't these tests eventually care about what stage
> things are in?
>
> What follows is an RFC series on top that's the result of me wondering
> why if we're adding new index constructs we aren't updating our
> plumbing to emit that data, can we just add this to ls-files and drop
> this test helper?
>
> Turns out: Yes we can.

I like the idea of having ls-files be usable to show the entries that
are in the index; that seems great to me.  I very much dislike the
--sparse flag to ls-files, as noted on that commit.

Also, as a minor point, the first two patches seemed a bit confusing
to me.  The first commit said that it was there solely to make "the
next commit" easier, and the second was worded as just making the next
patch easier, which made me wonder if the wording in the first commit
message was referring to 3/5 when it said "the next commit".  Both of
the first two commits were so tiny that if they are both prep for 3/5,
maybe it makes sense to combine them (together or both to 3/5)?  If
not, maybe the commit messages could be cleaned up or clarified a bit?

> Ævar Arnfjörð Bjarmason (5):
>   ls-files: defer read_index() after parse_options() etc.
>   ls-files: make "mode" in show_ce() loop a variable
>   ls-files: add and use a new --sparse option
>   test-tool read-cache: --table is redundant to ls-files
>   test-tool: split up test-tool read-cache
>
>  Documentation/git-ls-files.txt           |  4 ++
>  Makefile                                 |  3 +-
>  builtin/ls-files.c                       | 29 +++++++--
>  t/helper/test-read-cache-again.c         | 31 +++++++++
>  t/helper/test-read-cache-perf.c          | 21 ++++++
>  t/helper/test-read-cache.c               | 82 ------------------------
>  t/helper/test-tool.c                     |  3 +-
>  t/helper/test-tool.h                     |  3 +-
>  t/perf/p0002-read-cache.sh               |  2 +-
>  t/t1091-sparse-checkout-builtin.sh       |  9 +--
>  t/t1092-sparse-checkout-compatibility.sh | 57 ++++++++++------
>  t/t7519-status-fsmonitor.sh              |  2 +-
>  12 files changed, 131 insertions(+), 115 deletions(-)
>  create mode 100644 t/helper/test-read-cache-again.c
>  create mode 100644 t/helper/test-read-cache-perf.c
>  delete mode 100644 t/helper/test-read-cache.c
>
> --
> 2.31.0.260.g719c683c1d

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 3/5] ls-files: add and use a new --sparse option
  2021-03-17 18:27           ` Ævar Arnfjörð Bjarmason
@ 2021-03-17 18:44             ` Elijah Newren
  0 siblings, 0 replies; 203+ messages in thread
From: Elijah Newren @ 2021-03-17 18:44 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, Derrick Stolee, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 17, 2021 at 11:27 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Wed, Mar 17 2021, Elijah Newren wrote:
>
> > On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> >> ---
> >>  Documentation/git-ls-files.txt           |  4 ++
> >>  builtin/ls-files.c                       | 10 ++++-
> >>  t/t1091-sparse-checkout-builtin.sh       |  9 ++--
> >>  t/t1092-sparse-checkout-compatibility.sh | 57 ++++++++++++++++--------
> >>  4 files changed, 56 insertions(+), 24 deletions(-)
> >>
> >> diff --git a/Documentation/git-ls-files.txt b/Documentation/git-ls-files.txt
> >> index 6d11ab506b..1145e960a4 100644
> >> --- a/Documentation/git-ls-files.txt
> >> +++ b/Documentation/git-ls-files.txt
> >> @@ -71,6 +71,10 @@ OPTIONS
> >>  --unmerged::
> >>         Show unmerged files in the output (forces --stage)
> >>
> >> +--sparse::
> >> +       Show sparse directories in the output instead of expanding
> >> +       them (forces --stage)
> >> +
> >>  -k::
> >>  --killed::
> >>         Show files on the filesystem that need to be removed due
> >> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> >> index 4db75351f2..1ebbb63c10 100644
> >> --- a/builtin/ls-files.c
> >> +++ b/builtin/ls-files.c
> >> @@ -26,6 +26,7 @@ static int show_deleted;
> >>  static int show_cached;
> >>  static int show_others;
> >>  static int show_stage;
> >> +static int show_sparse;
> >>  static int show_unmerged;
> >>  static int show_resolve_undo;
> >>  static int show_modified;
> >> @@ -639,6 +640,8 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
> >>                         DIR_SHOW_IGNORED),
> >>                 OPT_BOOL('s', "stage", &show_stage,
> >>                         N_("show staged contents' object name in the output")),
> >> +               OPT_BOOL(0, "sparse", &show_sparse,
> >> +                       N_("show unexpanded sparse directories in the output")),
> >>                 OPT_BOOL('k', "killed", &show_killed,
> >>                         N_("show files on the filesystem that need to be removed")),
> >>                 OPT_BIT(0, "directory", &dir.flags,
> >> @@ -705,12 +708,17 @@ int cmd_ls_files(int argc, const char **argv, const char *cmd_prefix)
> >>                 tag_skip_worktree = "S ";
> >>                 tag_resolve_undo = "U ";
> >>         }
> >> +       if (show_sparse) {
> >> +               prepare_repo_settings(the_repository);
> >> +               the_repository->settings.command_requires_full_index = 0;
> >> +       }
> >>         if (show_modified || show_others || show_deleted || (dir.flags & DIR_SHOW_IGNORED) || show_killed)
> >>                 require_work_tree = 1;
> >> -       if (show_unmerged)
> >> +       if (show_unmerged || show_sparse)
> >>                 /*
> >>                  * There's no point in showing unmerged unless
> >>                  * you also show the stage information.
> >> +                * The same goes for the --sparse option.
> >
> > Yuck, haven't you just made --sparse an alias for --stage?  Why does
> > it need an alias?
>
> It doesn't, but --unmerged, the one other option which purely modifies
> --stage output implies --stage.

--unmerged modifies --stage output.  --sparse won't.  (Maybe it does
_now_ because the command doesn't yet support sparse-indexes, but
that's a temporary artifact.  Long term, there should be no difference
in the output.)

> So it's in line with existing UI convention in the command, it's
> probably better to keep following that than have new options behave
> differently.
>
> But yeah, we could spell out --stage --sparse in the tests.

There should not be a --sparse option.  The index is _already_ sparse
and users had to take multiple steps to make it so; users shouldn't
have to repeat themselves with each and every command they ever type
when they've created a sparse index that they want sparse behavior.

You should just spell it "--stage".

> > Was the goal just to get a quick way to make the command run under
> > repo->settings.command_requires_full_index = 0 without auditing the
> > codepaths?  It seems to rely on them having been audited anyway, since
> > it just falls back to the code used for --stage, so I don't see how it
> > helps.  It also suggests the command might do unexpected or weird
> > things if run without the --sparse option?  If people manually
> > configure a sparse-checkout and cone mode AND a sparse-index (it's
> > annoying how they have to specify all three instead of having to just
> > pass one flag somewhere), then now we also need to force them to
> > remember to pass extra flags to random various commands for them to
> > operate in a sane manner in their environment??
> >
> > I think this is a bad path to go down.
>
> Those are probably good points, I don't have enough overview of the
> whole sparse thing yet to say.
>
> I just thought it didn't make sense to have a series changing the nature
> of the index without corresponding tooling changes to interrogate the
> state of the index.

That makes sense to me; I agree with you on that point.

> > However, if you want to write the necessary tests to make it so that
> > ls-files can operate with command_requires_full_index = 0, then I
> > think that's useful.  If you want to add a special flag so that folks
> > in a sparse-checkout-with-cone-mode-with-sparse-index setup want to
> > operate densely (i.e. to show what files would be in the index if it
> > were fully populated), then I think that's useful.  But having
> > sparse-yes-with-cone-yes-very-sparse folks need to specify an extra
> > flag to commands to get sparse behavior just seems wrong to me.
>
> Maybe, but what else do you suggest for getting this information out of
> the index?

Use git ls-files without new options...as I stated here:

...
> > I do like the tests and your idea that we can use ls-files to list
> > whatever entries are in the index, I just think the tests should use
> > --stage to do that.

In other words, I think making "git ls-files" the first, or at least
one of the first, commands to be modified to behave properly in a
sparse-index world is what you should be aiming for, not some
new-option-shortcut that'll make no sense long term and persist
indefinitely.

List the entries in the index: `git ls-files`
List the entries in the index with their hash, mode, and stage: `git
ls-files --stage`

List all the entries that would be in the index if it weren't sparse:
`git ls-files --$SOME_NEW_OPTION_NAME`

You don't need to implement the --$SOME_NEW_OPTION_NAME yet, of
course.  We can just note that it's the plan to add it later.

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 0/5] Re: [PATCH v3 07/20] test-read-cache: print cache entries with --table
  2021-03-17 18:28         ` Elijah Newren
@ 2021-03-17 19:46           ` Derrick Stolee
  2021-03-17 20:26             ` Elijah Newren
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-17 19:46 UTC (permalink / raw)
  To: Elijah Newren, Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano,
	Nguyễn Thái Ngọc, Jonathan Nieder,
	Martin Ågren, SZEDER Gábor, Derrick Stolee,
	Derrick Stolee

On 3/17/2021 2:28 PM, Elijah Newren wrote:
> On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>>> From: Derrick Stolee <dstolee@microsoft.com>

>>
>> So we have a test tool that's mostly ls-files but mocks the output
>> ls-tree would emit, won't these tests eventually care about what stage
>> things are in?
>>
>> What follows is an RFC series on top that's the result of me wondering
>> why if we're adding new index constructs we aren't updating our
>> plumbing to emit that data, can we just add this to ls-files and drop
>> this test helper?
>>
>> Turns out: Yes we can.
> 
> I like the idea of having ls-files be usable to show the entries that
> are in the index; that seems great to me.  I very much dislike the
> --sparse flag to ls-files, as noted on that commit.

I don't like this idea. I don't think exposing internal structures
like this is something we want to do so quickly. Further, I intend
to use this test tool in the future to _also_ show the stored stat()
data, which would be inappropriate here in ls-files.

I would prefer to continue using the test helper here and leave
functional changes to ls-files be considered independently.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [PATCH v3 11/20] sparse-index: convert from full to sparse
  2021-03-17 13:43       ` Ævar Arnfjörð Bjarmason
@ 2021-03-17 19:55         ` Derrick Stolee
  2021-03-18 13:38           ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-17 19:55 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Derrick Stolee via GitGitGadget
  Cc: git, newren, gitster, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Derrick Stolee, Derrick Stolee

On 3/17/2021 9:43 AM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Mar 16 2021, Derrick Stolee via GitGitGadget wrote:
>> @@ -251,6 +251,8 @@ static inline unsigned int create_ce_mode(unsigned int mode)
>>  {
>>  	if (S_ISLNK(mode))
>>  		return S_IFLNK;
>> +	if (mode == S_IFDIR)
>> +		return S_IFDIR;
> 
> Does this actually need to be mode == S_IFDIR v.s. S_ISDIR(mode)? Those
> aren't the same thing...
> 
>>  	if (S_ISDIR(mode) || S_ISGITLINK(mode))
>>  		return S_IFGITLINK;
> 
> ...and if it can be S_ISDIR(mode) then this becomes just
> S_ISGITLINK(mode), but losing the "if" there makes me suspect that some
> dir == submodule heuristic is being broken somewhere..
 
I have a vague recollection that I did that at one point, and
it didn't work. However, using the simpler

	if (S_ISDIR(mode))
		return S_IFDIR;
	if (S_ISGITLINK(mode))
		return S_IFGITLINK;

passes all of my tests.

Looking at the history of create_ce_mode(), this "||"
condition was created in this commit:

commit 9eec4795d44439cd170fb52c73827c728252648d
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Mon Apr 9 21:14:58 2007 -0700

    Add "S_IFDIRLNK" file mode infrastructure for git links
    
    This just adds the basic helper functions to recognize and work with git
    tree entries that are links to other git repositories ("subprojects").
    They still aren't actually connected up to any of the code-paths, but
    now all the infrastructure is in place.
    
    The next commit will start actually adding actual subproject support.
    
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: Junio C Hamano <junkio@cox.net>

There isn't any justification of why S_ISDIR() is there. Perhaps
it was defensive programming? If that is the case, then this simpler
logic will work.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 0/5] Re: [PATCH v3 07/20] test-read-cache: print cache entries with --table
  2021-03-17 19:46           ` Derrick Stolee
@ 2021-03-17 20:26             ` Elijah Newren
  2021-03-17 20:34               ` Derrick Stolee
  0 siblings, 1 reply; 203+ messages in thread
From: Elijah Newren @ 2021-03-17 20:26 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Ævar Arnfjörð Bjarmason, Git Mailing List,
	Junio C Hamano, Nguyễn Thái Ngọc,
	Jonathan Nieder, Martin Ågren, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On Wed, Mar 17, 2021 at 12:46 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/17/2021 2:28 PM, Elijah Newren wrote:
> > On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >>> From: Derrick Stolee <dstolee@microsoft.com>
>
> >>
> >> So we have a test tool that's mostly ls-files but mocks the output
> >> ls-tree would emit, won't these tests eventually care about what stage
> >> things are in?
> >>
> >> What follows is an RFC series on top that's the result of me wondering
> >> why if we're adding new index constructs we aren't updating our
> >> plumbing to emit that data, can we just add this to ls-files and drop
> >> this test helper?
> >>
> >> Turns out: Yes we can.
> >
> > I like the idea of having ls-files be usable to show the entries that
> > are in the index; that seems great to me.  I very much dislike the
> > --sparse flag to ls-files, as noted on that commit.
>
> I don't like this idea. I don't think exposing internal structures
> like this is something we want to do so quickly.

Not sure I follow; ls-files was already about exposing three bits of
internal structures for index entries: mode, hash, and stage number.
These are quantities that are well-defined for sparse directories too.
It would not be exposing any new or different internal structures, nor
changing the output format.  (Ævar changed the tests to not look for
"tree" but to look for the "040000" mode number.)

>  Further, I intend
> to use this test tool in the future to _also_ show the stored stat()
> data, which would be inappropriate here in ls-files.
>
> I would prefer to continue using the test helper here and leave
> functional changes to ls-files be considered independently.

Well, I was okay with it being in a test helper regardless of whether
it could be done with ls-files, and then just circling back and fixing
up ls-files later.  But perhaps it's worth calling out in the commit
message about your plans to add stat() data and how that future piece
can't be done in ls-files (without functional changes of some sort)
just to make it clearer why we're using a test helper instead of
front-loading the port of ls-files over to sparse-indexes?

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 0/5] Re: [PATCH v3 07/20] test-read-cache: print cache entries with --table
  2021-03-17 20:26             ` Elijah Newren
@ 2021-03-17 20:34               ` Derrick Stolee
  0 siblings, 0 replies; 203+ messages in thread
From: Derrick Stolee @ 2021-03-17 20:34 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Ævar Arnfjörð Bjarmason, Git Mailing List,
	Junio C Hamano, Nguyễn Thái Ngọc,
	Jonathan Nieder, Martin Ågren, SZEDER Gábor,
	Derrick Stolee, Derrick Stolee

On 3/17/2021 4:26 PM, Elijah Newren wrote:
> On Wed, Mar 17, 2021 at 12:46 PM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 3/17/2021 2:28 PM, Elijah Newren wrote:
>>> On Wed, Mar 17, 2021 at 6:28 AM Ævar Arnfjörð Bjarmason
>>> <avarab@gmail.com> wrote:
>>>>
>>>>> From: Derrick Stolee <dstolee@microsoft.com>
>>
>>>>
>>>> So we have a test tool that's mostly ls-files but mocks the output
>>>> ls-tree would emit, won't these tests eventually care about what stage
>>>> things are in?
>>>>
>>>> What follows is an RFC series on top that's the result of me wondering
>>>> why if we're adding new index constructs we aren't updating our
>>>> plumbing to emit that data, can we just add this to ls-files and drop
>>>> this test helper?
>>>>
>>>> Turns out: Yes we can.
>>>
>>> I like the idea of having ls-files be usable to show the entries that
>>> are in the index; that seems great to me.  I very much dislike the
>>> --sparse flag to ls-files, as noted on that commit.
>>
>> I don't like this idea. I don't think exposing internal structures
>> like this is something we want to do so quickly.
> 
> Not sure I follow; ls-files was already about exposing three bits of
> internal structures for index entries: mode, hash, and stage number.
> These are quantities that are well-defined for sparse directories too.
> It would not be exposing any new or different internal structures, nor
> changing the output format.  (Ævar changed the tests to not look for
> "tree" but to look for the "040000" mode number.)

True, that is some internal information already.

>>  Further, I intend
>> to use this test tool in the future to _also_ show the stored stat()
>> data, which would be inappropriate here in ls-files.
>>
>> I would prefer to continue using the test helper here and leave
>> functional changes to ls-files be considered independently.
> 
> Well, I was okay with it being in a test helper regardless of whether
> it could be done with ls-files, and then just circling back and fixing
> up ls-files later.  But perhaps it's worth calling out in the commit
> message about your plans to add stat() data and how that future piece
> can't be done in ls-files (without functional changes of some sort)
> just to make it clearer why we're using a test helper instead of
> front-loading the port of ls-files over to sparse-indexes?

Adding this justification to the commit message would definitely be
helpful, so I will do that.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 203+ messages in thread

* Re: [RFC/PATCH 3/5] ls-files: add and use a new --sparse option
  2021-03-17 13:28       ` [RFC/PATCH 3/5] ls-files: add and use a new --sparse option Ævar Arnfjörð Bjarmason
  2021-03-17 18:19         ` Elijah Newren
@ 2021-03-17 20:43         ` Derrick Stolee
  2021-03-24  0:52           ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 203+ messages in thread
From: Derrick Stolee @ 2021-03-17 20:43 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, git
  Cc: newren, gitster, pclouds, jrnieder, Martin Ågren,
	SZEDER Gábor, Derrick Stolee, dstolee

On 3/17/2021 9:28 AM, Ævar Arnfjörð Bjarmason wrote:
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh

I want to learn from your suggested changes to the test, here,
so forgive my questions here:
  
> +test_index_entry_like () {
> +	dir=$1
> +	shift
> +	fmt=$1
> +	shift
> +	rev=$1
> +	shift
> +	entry=$1
> +	shift
> +	file=$1
> +	shift

Why all the shifts? Why not just use $1, $2, $3,...? My
guess is that you want to be able to insert a new parameter
in the middle in the future without changing the later
numbers, but that seems unlikely, and we could just add
the parameter at the end.

> +	hash=$(git -C "$dir" rev-parse "$rev") &&
> +	printf "$fmt\n" "$hash" "$entry" >expected &&
> +	if grep "$entry" "$file" >line
> +	then
> +		test_cmp expected line
> +	else
> +		cat cache &&
> +		false
> +	fi
> +}
> +
>  test_expect_success 'sparse-index contents' '
>  	init_repos &&
>  
> -	test-tool -C sparse-index read-cache --table >cache &&
> +	git -C sparse-index ls-files --sparse >cache &&
>  	for dir in folder1 folder2 x
>  	do
> -		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> -		grep "040000 tree $TREE	$dir/" cache \
> -			|| return 1
> +		test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1

I see how this uses only one line, but it seems like the
test_index_entry_like is too generic to make it not a
complicated mess of format strings that need to copy
over and over again.

Perhaps instead it could be a "test_entry_is_tree"
and it only passes "$dir" and "cache"? Then we could drop the loop and
just have

	test_entry_is_tree cache folder1 &&
	test_entry_is_tree cache folder2 &&
	test_entry_is_tree cache x &&

or we could still use the loop, especially when we test for four trees.

> -	test-tool -C sparse-index read-cache --table >cache &&
> +	git -C sparse-index ls-files --sparse >cache &&
>  	for dir in deep/deeper2 folder1 folder2 x
>  	do
> -		TREE=$(git -C sparse-index rev-parse HEAD:$dir) &&
> -		grep "040000 tree $TREE	$dir/" cache \
> -			|| return 1
> +		test_index_entry_like sparse-index "040000 %s 0\t%s" "HEAD:$dir" "$dir/" cache || return 1
>  	done &&
>  
> +	grep 040000 cache >lines &&
> +	test_line_count = 4 lines &&
> +

The point here is to check that no other entries are trees? We know
that this number will be _at least_ 4 based on the loop above.

Thanks,
-Stolee

^ permalink