git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [RFC PATCH 0/5] Remove the present-despite-SKIP_WORKTREE class of bugs
@ 2022-01-09  4:57 Elijah Newren
  2022-01-09  4:57 ` [RFC PATCH 1/5] t1011: add testcase demonstrating accidental loss of user modifications Elijah Newren
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Elijah Newren @ 2022-01-09  4:57 UTC (permalink / raw)
  To: git; +Cc: Victoria Dye, Derrick Stolee, Lessley Dennington, Elijah Newren

(Maintainer note: This series builds on vd/sparse-clean-etc, because it
tweaks one of the testcases added there.)

Files in the present-despite-SKIP_WORKTREE state have caused no ends of
discussions and bugs[1,2,3,4,5,6,...and lots of others].  Trying to
address the big issue of discovering & recovering from this state has
befuddled me for over a year because I was worried we'd need additional
code at every skip_worktree-checking path in the code (and they are all
over the place), and that we'd make the code significantly slower unless
we plumbed a bunch of additional information all over the place to allow
some reasonable optimizations.

This series tries to solve the problem a bit differently by automatic
early discovery and recovery; as it result, it greatly simplifies the
landscape, reduces our testing matrix burden, and fixes a large swath of
bugs.  It does have a cost, though.  See the commit message of patch 3
(the crux of this series) for the details.

Quick overview:

  * Patches 1 & 2 add a test to demonstrate accidental deletion of
    possibly-modified files, and then fix the bug.
  * Patch 3 is the crux of this series; a small amount of code with a
    huge commit message
  * Patch 4 updates the documentation
  * Patch 5 adds some optimizations to reduce the performance impact of
    patch 3

[1] https://lore.kernel.org/git/CABPp-BFnFpzwGC11TLoLs8YK5yiisA5D5-fFjXnJsbESVDwZsA@mail.gmail.com/
[2] https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
[3] https://lore.kernel.org/git/pull.809.git.git.1592356884310.gitgitgadget@gmail.com/
[4] commit 66b209b86a ("merge-ort: implement CE_SKIP_WORKTREE handling with conflicted entries", 2021-03-20)
[5] commit ba359fd507 ("stash: fix stash application in sparse-checkouts", 2020-12-01)
[6] https://lore.kernel.org/git/CABPp-BH9tju7WVm=QZDOvaMDdZbpNXrVWQdN-jmfN8wC6YVhmw@mail.gmail.com/

[Final note: This series and test results can also be seen at
https://github.com/gitgitgadget/git/pull/1113; gitgitgadget wanted me to
"rewrap" the lines with >80 character URLS in commit messages, and I
refuse to remove the URLs or break them by inserting newlines, so I'm
sending it manually instead.]

Elijah Newren (5):
  t1011: add testcase demonstrating accidental loss of user
    modifications
  unpack-trees: fix accidental loss of user changes
  repo_read_index: ensure SKIP_WORKTREE means skip worktree
  Update documentation related to sparsity and the skip-worktree bit
  Accelerate ensure_skip_worktree_means_skip_worktree by caching

 Documentation/git-read-tree.txt          |  12 ++-
 Documentation/git-sparse-checkout.txt    |  76 ++++++++------
 Documentation/git-update-index.txt       |  55 +++++++---
 repository.c                             |   7 ++
 sparse-index.c                           | 123 +++++++++++++++++++++++
 sparse-index.h                           |   1 +
 t/t1011-read-tree-sparse-checkout.sh     |  23 ++++-
 t/t1092-sparse-checkout-compatibility.sh |  16 +--
 t/t3705-add-sparse-checkout.sh           |   2 +
 t/t6428-merge-conflicts-sparse.sh        |  23 +----
 t/t7012-skip-worktree-writing.sh         |  44 ++------
 t/t7817-grep-sparse-checkout.sh          |  11 +-
 unpack-trees.c                           |   4 +-
 13 files changed, 284 insertions(+), 113 deletions(-)

-- 
2.34.1.442.ge63c19bdd2.dirty


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/5] t1011: add testcase demonstrating accidental loss of user modifications
  2022-01-09  4:57 [RFC PATCH 0/5] Remove the present-despite-SKIP_WORKTREE class of bugs Elijah Newren
@ 2022-01-09  4:57 ` Elijah Newren
  2022-01-09  4:57 ` [RFC PATCH 2/5] unpack-trees: fix accidental loss of user changes Elijah Newren
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Elijah Newren @ 2022-01-09  4:57 UTC (permalink / raw)
  To: git; +Cc: Victoria Dye, Derrick Stolee, Lessley Dennington, Elijah Newren

If a user has a file with local modifications that is not marked as
SKIP_WORKTREE, but the sparsity patterns are such that it should be
marked that way, and the user then invokes a command like

   * git checkout -q HEAD^

or

   * git read-tree -mu HEAD^

Then the file will be deleted along with all the users' modifications.
Add a testcase demonstrating this problem.

Note: This bug only triggers if something other than 'HEAD' is given;
if the commands above had specified 'HEAD', then the users' file would
be left alone.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t1011-read-tree-sparse-checkout.sh | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/t/t1011-read-tree-sparse-checkout.sh b/t/t1011-read-tree-sparse-checkout.sh
index 24092c09a9..1b2395b8a8 100755
--- a/t/t1011-read-tree-sparse-checkout.sh
+++ b/t/t1011-read-tree-sparse-checkout.sh
@@ -187,6 +187,27 @@ test_expect_success 'read-tree updates worktree, absent case' '
 	test ! -f init.t
 '
 
+test_expect_success 'read-tree will not throw away dirty changes, non-sparse' '
+	echo "/*" >.git/info/sparse-checkout &&
+	read_tree_u_must_succeed -m -u HEAD &&
+
+	echo dirty >init.t &&
+	read_tree_u_must_fail -m -u HEAD^ &&
+	test_path_is_file init.t &&
+	grep -q dirty init.t
+'
+
+test_expect_failure 'read-tree will not throw away dirty changes, sparse' '
+	echo "/*" >.git/info/sparse-checkout &&
+	read_tree_u_must_succeed -m -u HEAD &&
+
+	echo dirty >init.t &&
+	echo sub/added >.git/info/sparse-checkout &&
+	read_tree_u_must_fail -m -u HEAD^ &&
+	test_path_is_file init.t &&
+	grep -q dirty init.t
+'
+
 test_expect_success 'read-tree updates worktree, dirty case' '
 	echo sub/added >.git/info/sparse-checkout &&
 	git checkout -f top &&
-- 
2.34.1.442.ge63c19bdd2.dirty


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/5] unpack-trees: fix accidental loss of user changes
  2022-01-09  4:57 [RFC PATCH 0/5] Remove the present-despite-SKIP_WORKTREE class of bugs Elijah Newren
  2022-01-09  4:57 ` [RFC PATCH 1/5] t1011: add testcase demonstrating accidental loss of user modifications Elijah Newren
@ 2022-01-09  4:57 ` Elijah Newren
  2022-01-09  4:57 ` [RFC PATCH 3/5] repo_read_index: ensure SKIP_WORKTREE means skip worktree Elijah Newren
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 11+ messages in thread
From: Elijah Newren @ 2022-01-09  4:57 UTC (permalink / raw)
  To: git; +Cc: Victoria Dye, Derrick Stolee, Lessley Dennington, Elijah Newren

For sparse-checkouts, we don't want unpack-trees to error out on files
that are missing from the worktree, so there has traditionally been
logic to make it skip the verify_uptodate() check for these.
Unfortunately, it was skipping the verify_uptodate() check for files
that were expected to *become* SKIP_WORKTREE.  For files that were not
already SKIP_WORKTREE, that can cause us to later delete the file in
apply_sparse_checkout().  Only skip the check for files that were
already SKIP_WORKTREE as well to avoid lightly discarding important
changes users may have made to files.

Note 1: unpack-trees.c is already a bit complex, and the logic around
CE_SKIP_WORKTREE and CE_NEW_SKIP_WORKTREE in that file are no exception.
I also tried just replacing CE_NEW_SKIP_WORKTREE with CE_SKIP_WORKTREE
in the verify_uptodate() check instead of checking for both flags, and
found that it also fixed this bug and passed all the tests.  I also
attempted to devise a few testcases that might trip either variant of my
fix and was unable to find any problems.  It may be that just checking
CE_SKIP_WORKTREE is a better fix, but I'm not sure.  I thought it
was a bit safer to strictly reduce the number of cases where we skip the
up-to-date check rather than just toggling which kind of cases skip it,
and thus went with the current variant of the fix.

Note 2: I also wondered if verify_absent() might have a similar bug, but
despite my attempts to try to devise a testcase that would trigger such
a thing, I couldn't find any problematic testcases.  Thus, this patch
makes no attempt to apply similar changes to verify_absent() and
verify_absent_if_directory().

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t1011-read-tree-sparse-checkout.sh | 2 +-
 unpack-trees.c                       | 4 +++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/t/t1011-read-tree-sparse-checkout.sh b/t/t1011-read-tree-sparse-checkout.sh
index 1b2395b8a8..4ed0885bf2 100755
--- a/t/t1011-read-tree-sparse-checkout.sh
+++ b/t/t1011-read-tree-sparse-checkout.sh
@@ -197,7 +197,7 @@ test_expect_success 'read-tree will not throw away dirty changes, non-sparse' '
 	grep -q dirty init.t
 '
 
-test_expect_failure 'read-tree will not throw away dirty changes, sparse' '
+test_expect_success 'read-tree will not throw away dirty changes, sparse' '
 	echo "/*" >.git/info/sparse-checkout &&
 	read_tree_u_must_succeed -m -u HEAD &&
 
diff --git a/unpack-trees.c b/unpack-trees.c
index 98e2f2e0e6..6d9d89c662 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -2059,7 +2059,9 @@ static int verify_uptodate_1(const struct cache_entry *ce,
 int verify_uptodate(const struct cache_entry *ce,
 		    struct unpack_trees_options *o)
 {
-	if (!o->skip_sparse_checkout && (ce->ce_flags & CE_NEW_SKIP_WORKTREE))
+	if (!o->skip_sparse_checkout &&
+	    (ce->ce_flags & CE_SKIP_WORKTREE) &&
+	    (ce->ce_flags & CE_NEW_SKIP_WORKTREE))
 		return 0;
 	return verify_uptodate_1(ce, o, ERROR_NOT_UPTODATE_FILE);
 }
-- 
2.34.1.442.ge63c19bdd2.dirty


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/5] repo_read_index: ensure SKIP_WORKTREE means skip worktree
  2022-01-09  4:57 [RFC PATCH 0/5] Remove the present-despite-SKIP_WORKTREE class of bugs Elijah Newren
  2022-01-09  4:57 ` [RFC PATCH 1/5] t1011: add testcase demonstrating accidental loss of user modifications Elijah Newren
  2022-01-09  4:57 ` [RFC PATCH 2/5] unpack-trees: fix accidental loss of user changes Elijah Newren
@ 2022-01-09  4:57 ` Elijah Newren
  2022-01-10 20:38   ` Victoria Dye
  2022-01-09  4:57 ` [RFC PATCH 4/5] Update documentation related to sparsity and the skip-worktree bit Elijah Newren
  2022-01-09  4:57 ` [RFC PATCH 5/5] Accelerate ensure_skip_worktree_means_skip_worktree by caching Elijah Newren
  4 siblings, 1 reply; 11+ messages in thread
From: Elijah Newren @ 2022-01-09  4:57 UTC (permalink / raw)
  To: git; +Cc: Victoria Dye, Derrick Stolee, Lessley Dennington, Elijah Newren

The fix is short (~30 lines), but the description is not.  Sorry.

There is a set of problems caused by files in what I'll refer to as the
"present-despite-SKIP_WORKTREE" state.  This commit aims to not just fix
these problems, but remove the entire class as a possibility -- for
those using sparse checkouts.  But first, we need to understand the
problems this class presents.  A quick outline:

   * Problems
     * User facing issues
     * Problem space complexity
     * Maintenance and code correctness challenges
   * SKIP_WORKTREE expectations in Git
   * Suggested solution
   * Pros/Cons of suggested solution
   * Notes on testcase modifications

=== User facing issues ===

There are various ways for users to get files to be present in the
working copy despite having the SKIP_WORKTREE bit set for that file in
the index.  This may come from:
  * various git commands not really supporting the SKIP_WORKTREE bit[1,2]
  * users grabbing files from elsewhere and writing them to the worktree
    (perhaps even cached in their editor)
  * users attempting to "abort" a sparse-checkout operation with a
    not-so-early Ctrl+C (updating $GIT_DIR/info/sparse-checkout and the
    working tree is not atomic)[3].

Once users have present-despite-SKIP_WORKTREE files, any modifications
users make to these files will be ignored, possibly to users' confusion.

Further:
  * these files will not be updated by by standard commands
    (switch/checkout/pull/merge/rebase will leave them alone unless
    conflicts happen -- and even then, the conflicted file may be
    written somewhere else to avoid overwriting the SKIP_WORKTREE file
    that is present and in the way)
  * there is nothing in Git that users can use to discover such
    files (status, diff, grep, etc. all ignore it)
  * there is no reasonable mechanism to "recover" from such a condition
    (neither `git sparse-checkout reapply` nor `git reset --hard` will
    correct it).

So, not only are users modifications ignored, but the files get
progressively more stale over time.  At some point in the future, they
may change their sparseness specification or disable sparse-checkouts.
At that time, all present-despite-SKIP_WORKTREE files will show up as
having lots of modifications because they represent a version from a
different branch or commit.  These might include user-made local changes
from days before, but the only way to tell is to have users look through
them all closely.

If these users come to others for help, there will be no logs that
explain the issue; it's just a mysterious list of changes.  Users might
adamantly claim (correctly, as it turns out) that they didn't modify
these files, while others presume they did.

[1] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/
[2] https://lore.kernel.org/git/CABPp-BH9tju7WVm=QZDOvaMDdZbpNXrVWQdN-jmfN8wC6YVhmw@mail.gmail.com/
[3] https://lore.kernel.org/git/CABPp-BFnFpzwGC11TLoLs8YK5yiisA5D5-fFjXnJsbESVDwZsA@mail.gmail.com/

=== Problem space complexity ===

SKIP_WORKTREE has been part of Git for over a decade.  Duy did lots of
work on it initially, and several others have since come along and put
lots of work into it.  Stolee spent most of 2021 on the sparse-index,
with lots of bugfixes along the way including to non-sparse-index cases
as we are still trying to get sparse checkouts to behave reasonably.
Basically every codepath throughout the treat needs to be aware of an
additional type of file: tracked-but-not-present.  The extra type
results in lots of extra testcases and lots of extra code everywhere.

But, the sad thing is that we actually have more than one extra type.
We have tracked, tracked-but-not-present (SKIP_WORKTREE), and
tracked-but-promised-to-not-be-present-but-is-present-anyway
(present-despite-SKIP_WORKTREE).  Two types is a monumental amount of
effort to support, and adding a third feels a bit like insanity[4].

[4] Some examples of which can be seen at
    https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/

=== Maintenance and code correctness challenges ===

Matheus' patches to grep stalled for nearly a year, in part because of
complications of how to handle sparse-checkouts appropriately in all
cases[5][6] (with trying to sanely figure out how to sanely handle
present-despite-SKIP_WORKTREE files being one of the complications).
His rm/add follow-ups also took months because of those kinds of
issues[7].  The corner cases with things like submodules and
SKIP_WORKTREE with the addition of present-despite-SKIP_WORKTREE start
becoming really complex[8].

We've had to add ugly logic to merge-ort to attempt to handle
present-despite-SKIP_WORKTREE files[9], and basically just been forced
to give up in merge-recursive knowing full well that we'll sometimes
silently discard user modifications.  Despite stash essentially being a
merge, it needed extra code (beyond what was in merge-ort and
merge-recursive) to manually tweak SKIP_WORKTREE bits in order to avoid
a few different bugs that'd result in an early abort with a partial
stash application[10].

[5] See https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/#t
    and the dates on the thread; also Matheus and I had several
    conversations off-list trying to resolve the issues over that time
[6] ...it finally kind of got unstuck after
    https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
[7] See for example
    https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/#t
    and quotes like "The core functionality of sparse-checkout has always
    been only partially implemented", a statement I still believe is true
    today.
[8] https://lore.kernel.org/git/pull.809.git.git.1592356884310.gitgitgadget@gmail.com/
[9] See commit 66b209b86a ("merge-ort: implement CE_SKIP_WORKTREE
    handling with conflicted entries", 2021-03-20)
[10] See commit ba359fd507 ("stash: fix stash application in
     sparse-checkouts", 2020-12-01)

=== SKIP_WORKTREE expectations in Git ===

A couple quotes:

From [11] (before the "sparse-checkout" command existed):
   If it needs too many special cases, hacks, and conditionals, then it
   is not worth the complexity---if it is easier to write a correct code
   by allowing Git to populate working tree files, it is perfectly fine
   to do so.

   In a sense, the sparse checkout "feature" itself is a hack by itself,
   and that is why I think this part should be "best effort" as well.

From the git-sparse-checkout manual (still present today):

   THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
   COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
   THE FUTURE.

[11] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/

=== Suggested solution ===

SKIP_WORKTREE was written to allow sparse-checkouts, in particular, as
the name of the option implies, to allow the file to NOT be in the
worktree but consider it to be unchanged rather than deleted.

The suggests a simple solution: present-despite-SKIP_WORKTREE files
should not exist, for those using sparse-checkouts.

Enforce this at index loading time by checking if core.sparseCheckout is
true; if so, check files in the index with the SKIP_WORKTREE bit set to
verify that they are absent from the working tree.  If they are present,
unset the bit (in memory, though any commands that write to the index
will record the update).

Users can, of course, can get the SKIP_WORKTREE bit back such as by
running `git sparse-checkout reapply` (if they have ensured the file is
unmodified and doesn't match the specified sparsity patterns).

=== Pros/Cons of suggested solution ===

Pros:

  * Solves the user visible problems reported above, which I've been
    complaining about for nearly a year but couldn't find a solution to.
  * Much easier behavior in sparse-checkouts for users to reason about
  * Very simple, ~30 lines of code.
  * Significantly simplifies some ugly testcases, and obviates the need
    to test an entire class of potential issues.
  * Reduces code complexity, reasoning, and maintenance.  Avoids
    disagreements about weird corner cases[12].
  * It has been reported that some users might be (ab)using
    SKIP_WORKTREE as a let-me-modify-but-keep-the-file-in-the-worktree
    mechanism[13, and a few other similar references].  These users know
    of multiple caveats and shortcomings in doing so; perhaps not
    surprising given the "SKIP_WORKTREE expecations" section above.
    However, these users use `git update-index --skip-worktree`, and not
    `git sparse-checkout` or core.sparseCheckout=true.  As such, these
    users would be unaffected by this change and can continue abusing
    the system as before.

[12] https://lore.kernel.org/git/CABPp-BH9tju7WVm=QZDOvaMDdZbpNXrVWQdN-jmfN8wC6YVhmw@mail.gmail.com/
[13] https://stackoverflow.com/questions/13630849/git-difference-between-assume-unchanged-and-skip-worktree

Cons:

  * When core.sparseCheckout is enabled, this adds a performance cost to
    reading the index.  I'll defer discussion of this cost to a subsequent
    patch, since I have some optimizations to add.

=== Notes on testcase modifications ===

The good:
  * t1011: Compare to two cases above it ('read-tree will not throw away
    dirty changes, non-sparse'); since the file is present, it should
    match the non-sparse case now
  * t1092: sparse-index & sparse-checkout now match full-worktree
    behavior in more cases!  Yaay for consistency!
  * t6428, t7012: look at how much simpler the tests become!  Merge and
    stash can just fail early telling the user there's a file in the
    way, instead of not noticing until it's about to write a file and
    then have to implement sudden crash avoidance.  Hurray for sanity!
  * t7817: sparse behavior better matches full tree behavior.  Hurray
    for sanity!

The confusing:
  * t3705: These changes were ONLY needed on Windows, but they don't
    hurt other platforms.  Let's discuss each individually:

    * core.sparseCheckout should be false by default.  Nothing in this
      testcase toggles that until many, many tests later.  However,
      early tests (#5 in particular) were testing `update-index
      --skip-worktree` behavior in a non-sparse-checkout, but the
      Windows tests in CI were behaving as if core.sparseCheckout=true
      had been specified somewhere.  I do not have access to a Windows
      machine.  But I just manually did what should have been a no-op
      and turned the config off.  And it fixed the test.
    * I have no idea why the leftover .gitattributes file from this
      test was causing failures for test #18 on Windows, but only with
      these changes of mine.  Test #18 was checking for empty stderr,
      and specifically wanted to know that some error completely
      unrelated to file endings did not appear.  The leftover
      .gitattributes file thus caused some spurious stderr unrelated to
      the thing being checked.  Since other tests did not intend to
      test normalization, just proactively remove the .gitattributes
      file.  I'm certain this is cleaner and better, I'm just unsure
      why/how this didn't trigger problems before.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 repository.c                             |  7 ++++
 sparse-index.c                           | 22 ++++++++++++
 sparse-index.h                           |  1 +
 t/t1011-read-tree-sparse-checkout.sh     |  2 +-
 t/t1092-sparse-checkout-compatibility.sh | 16 ++++-----
 t/t3705-add-sparse-checkout.sh           |  2 ++
 t/t6428-merge-conflicts-sparse.sh        | 23 +++----------
 t/t7012-skip-worktree-writing.sh         | 44 ++++--------------------
 t/t7817-grep-sparse-checkout.sh          | 11 ++++--
 9 files changed, 62 insertions(+), 66 deletions(-)

diff --git a/repository.c b/repository.c
index 34610c5a33..dfd1911902 100644
--- a/repository.c
+++ b/repository.c
@@ -301,6 +301,13 @@ int repo_read_index(struct repository *repo)
 	if (repo->settings.command_requires_full_index)
 		ensure_full_index(repo->index);
 
+	/*
+	 * If sparse checkouts are in use, check whether paths with the
+	 * SKIP_WORKTREE attribute are missing from the worktree; if not,
+	 * clear that attribute for that path.
+	 */
+	ensure_skip_worktree_means_skip_worktree(repo->index);
+
 	return res;
 }
 
diff --git a/sparse-index.c b/sparse-index.c
index a1d505d50e..79d50e444c 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -341,6 +341,28 @@ void ensure_correct_sparsity(struct index_state *istate)
 		ensure_full_index(istate);
 }
 
+void ensure_skip_worktree_means_skip_worktree(struct index_state *istate)
+{
+	int i;
+	if (!core_apply_sparse_checkout)
+		return;
+
+restart:
+	for (i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+		struct stat st;
+
+		if (ce_skip_worktree(ce) && !lstat(ce->name, &st)) {
+			if (S_ISSPARSEDIR(ce->ce_mode)) {
+				ensure_full_index(istate);
+				goto restart;
+			}
+			ce->ce_flags &= ~CE_SKIP_WORKTREE;
+		}
+	}
+}
+
+
 /*
  * This static global helps avoid infinite recursion between
  * expand_to_path() and index_file_exists().
diff --git a/sparse-index.h b/sparse-index.h
index 656bd835b2..1007859ed4 100644
--- a/sparse-index.h
+++ b/sparse-index.h
@@ -5,6 +5,7 @@ struct index_state;
 #define SPARSE_INDEX_MEMORY_ONLY (1 << 0)
 int convert_to_sparse(struct index_state *istate, int flags);
 void ensure_correct_sparsity(struct index_state *istate);
+void ensure_skip_worktree_means_skip_worktree(struct index_state *istate);
 
 /*
  * Some places in the codebase expect to search for a specific path.
diff --git a/t/t1011-read-tree-sparse-checkout.sh b/t/t1011-read-tree-sparse-checkout.sh
index 4ed0885bf2..dd957be1b7 100755
--- a/t/t1011-read-tree-sparse-checkout.sh
+++ b/t/t1011-read-tree-sparse-checkout.sh
@@ -212,7 +212,7 @@ test_expect_success 'read-tree updates worktree, dirty case' '
 	echo sub/added >.git/info/sparse-checkout &&
 	git checkout -f top &&
 	echo dirty >init.t &&
-	read_tree_u_must_succeed -m -u HEAD^ &&
+	read_tree_u_must_fail -m -u HEAD^ &&
 	grep -q dirty init.t &&
 	rm init.t
 '
diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
index 0863c9747c..6f8538bb4c 100755
--- a/t/t1092-sparse-checkout-compatibility.sh
+++ b/t/t1092-sparse-checkout-compatibility.sh
@@ -370,7 +370,7 @@ test_expect_success 'status/add: outside sparse cone' '
 	write_script edit-contents <<-\EOF &&
 	echo text >>$1
 	EOF
-	run_on_sparse ../edit-contents folder1/a &&
+	run_on_all ../edit-contents folder1/a &&
 	run_on_all ../edit-contents folder1/new &&
 
 	test_sparse_match git status --porcelain=v2 &&
@@ -379,8 +379,8 @@ test_expect_success 'status/add: outside sparse cone' '
 	test_sparse_match test_must_fail git add folder1/a &&
 	grep "Disable or modify the sparsity rules" sparse-checkout-err &&
 	test_sparse_unstaged folder1/a &&
-	test_sparse_match test_must_fail git add --refresh folder1/a &&
-	grep "Disable or modify the sparsity rules" sparse-checkout-err &&
+	test_all_match git add --refresh folder1/a &&
+	test_must_be_empty sparse-checkout-err &&
 	test_sparse_unstaged folder1/a &&
 	test_sparse_match test_must_fail git add folder1/new &&
 	grep "Disable or modify the sparsity rules" sparse-checkout-err &&
@@ -642,11 +642,11 @@ test_expect_success 'update-index modify outside sparse definition' '
 	run_on_sparse cp ../initial-repo/folder1/a folder1/a &&
 	run_on_all ../edit-contents folder1/a &&
 
-	# If file has skip-worktree enabled, update-index does not modify the
-	# index entry
-	test_sparse_match git update-index folder1/a &&
-	test_sparse_match git status --porcelain=v2 &&
-	test_must_be_empty sparse-checkout-out &&
+	# If file has skip-worktree enabled, but the file is present, it is
+	# treated the same as if skip-worktree is disabled
+	test_all_match git status --porcelain=v2 &&
+	test_all_match git update-index folder1/a &&
+	test_all_match git status --porcelain=v2 &&
 
 	# When skip-worktree is disabled (even on files outside sparse cone), file
 	# is updated in the index
diff --git a/t/t3705-add-sparse-checkout.sh b/t/t3705-add-sparse-checkout.sh
index f3143c9290..61506c1d7c 100755
--- a/t/t3705-add-sparse-checkout.sh
+++ b/t/t3705-add-sparse-checkout.sh
@@ -19,6 +19,7 @@ setup_sparse_entry () {
 	fi &&
 	git add sparse_entry &&
 	git update-index --skip-worktree sparse_entry &&
+	git config core.sparseCheckout false &&
 	git commit --allow-empty -m "ensure sparse_entry exists at HEAD" &&
 	SPARSE_ENTRY_BLOB=$(git rev-parse :sparse_entry)
 }
@@ -126,6 +127,7 @@ test_expect_success 'git add --chmod does not update sparse entries' '
 '
 
 test_expect_success 'git add --renormalize does not update sparse entries' '
+	test_when_finished rm .gitattributes &&
 	test_config core.autocrlf false &&
 	setup_sparse_entry "LINEONE\r\nLINETWO\r\n" &&
 	echo "sparse_entry text=auto" >.gitattributes &&
diff --git a/t/t6428-merge-conflicts-sparse.sh b/t/t6428-merge-conflicts-sparse.sh
index 7e8bf497f8..142c9aaabc 100755
--- a/t/t6428-merge-conflicts-sparse.sh
+++ b/t/t6428-merge-conflicts-sparse.sh
@@ -112,7 +112,7 @@ test_expect_success 'conflicting entries written to worktree even if sparse' '
 	)
 '
 
-test_expect_merge_algorithm failure success 'present-despite-SKIP_WORKTREE handled reasonably' '
+test_expect_success 'present-despite-SKIP_WORKTREE handled reasonably' '
 	test_setup_numerals in_the_way &&
 	(
 		cd numerals_in_the_way &&
@@ -132,26 +132,13 @@ test_expect_merge_algorithm failure success 'present-despite-SKIP_WORKTREE handl
 
 		test_must_fail git merge -s recursive B^0 &&
 
-		git ls-files -t >index_files &&
-		test_cmp expected-index index_files &&
+		test_path_is_missing .git/MERGE_HEAD &&
 
-		test_path_is_file README &&
 		test_path_is_file numerals &&
 
-		test_cmp expected-merge numerals &&
-
-		# There should still be a file with "foobar" in it
-		grep foobar * &&
-
-		# 5 other files:
-		#   * expected-merge
-		#   * expected-index
-		#   * index_files
-		#   * others
-		#   * whatever name was given to the numerals file that had
-		#     "foobar" in it
-		git ls-files -o >others &&
-		test_line_count = 5 others
+		# numerals should still have "foobar" in it
+		echo foobar >expect &&
+		test_cmp expect numerals
 	)
 '
 
diff --git a/t/t7012-skip-worktree-writing.sh b/t/t7012-skip-worktree-writing.sh
index a1080b94e3..cb9f1a6981 100755
--- a/t/t7012-skip-worktree-writing.sh
+++ b/t/t7012-skip-worktree-writing.sh
@@ -171,50 +171,20 @@ test_expect_success 'stash restore in sparse checkout' '
 
 		# Put a file in the working directory in the way
 		echo in the way >modified &&
-		git stash apply &&
+		test_must_fail git stash apply 2>error&&
 
-		# Ensure stash vivifies modifies paths...
-		cat >expect <<-EOF &&
-		H addme
-		H modified
-		H removeme
-		H subdir/A
-		S untouched
-		EOF
-		git ls-files -t >actual &&
-		test_cmp expect actual &&
+		grep "changes.*would be overwritten by merge" error &&
 
-		# ...and that the paths show up in status as changed...
-		cat >expect <<-EOF &&
-		A  addme
-		 M modified
-		 D removeme
-		 M subdir/A
-		?? actual
-		?? expect
-		?? modified.stash.XXXXXX
-		EOF
-		git status --porcelain | \
-			sed -e s/stash......./stash.XXXXXX/ >actual &&
-		test_cmp expect actual &&
+		echo in the way >expect &&
+		test_cmp expect modified &&
+		git diff --quiet HEAD ":!modified" &&
 
 		# ...and that working directory reflects the files correctly
-		test_path_is_file    addme &&
+		test_path_is_missing addme &&
 		test_path_is_file    modified &&
 		test_path_is_missing removeme &&
 		test_path_is_file    subdir/A &&
-		test_path_is_missing untouched &&
-
-		# ...including that we have the expected "modified" file...
-		cat >expect <<-EOF &&
-		modified
-		tweaked
-		EOF
-		test_cmp expect modified &&
-
-		# ...and that the other "modified" file is still present...
-		echo in the way >expect &&
-		test_cmp expect modified.stash.*
+		test_path_is_missing untouched
 	)
 '
 
diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
index 590b99bbb6..eb59564565 100755
--- a/t/t7817-grep-sparse-checkout.sh
+++ b/t/t7817-grep-sparse-checkout.sh
@@ -83,10 +83,13 @@ test_expect_success 'setup' '
 
 # The test below covers a special case: the sparsity patterns exclude '/b' and
 # sparse checkout is enabled, but the path exists in the working tree (e.g.
-# manually created after `git sparse-checkout init`). git grep should skip it.
+# manually created after `git sparse-checkout init`).  Although b is marked
+# as SKIP_WORKTREE, git grep should notice it IS present in the worktree and
+# report it.
 test_expect_success 'working tree grep honors sparse checkout' '
 	cat >expect <<-EOF &&
 	a:text
+	b:new-text
 	EOF
 	test_when_finished "rm -f b" &&
 	echo "new-text" >b &&
@@ -126,12 +129,16 @@ test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit'
 '
 
 # Note that sub2/ is present in the worktree but it is excluded by the sparsity
-# patterns, so grep should not recurse into it.
+# patterns.  We also explicitly mark it as SKIP_WORKTREE in case it got cleared
+# by previous git commands.  Thus sub2 starts as SKIP_WORKTREE but since it is
+# present in the working tree, grep should recurse into it.
 test_expect_success 'grep --recurse-submodules honors sparse checkout in submodule' '
 	cat >expect <<-EOF &&
 	a:text
 	sub/B/b:text
+	sub2/a:text
 	EOF
+	git update-index --skip-worktree sub2 &&
 	git grep --recurse-submodules "text" >actual &&
 	test_cmp expect actual
 '
-- 
2.34.1.442.ge63c19bdd2.dirty


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 4/5] Update documentation related to sparsity and the skip-worktree bit
  2022-01-09  4:57 [RFC PATCH 0/5] Remove the present-despite-SKIP_WORKTREE class of bugs Elijah Newren
                   ` (2 preceding siblings ...)
  2022-01-09  4:57 ` [RFC PATCH 3/5] repo_read_index: ensure SKIP_WORKTREE means skip worktree Elijah Newren
@ 2022-01-09  4:57 ` Elijah Newren
  2022-01-09  4:57 ` [RFC PATCH 5/5] Accelerate ensure_skip_worktree_means_skip_worktree by caching Elijah Newren
  4 siblings, 0 replies; 11+ messages in thread
From: Elijah Newren @ 2022-01-09  4:57 UTC (permalink / raw)
  To: git; +Cc: Victoria Dye, Derrick Stolee, Lessley Dennington, Elijah Newren

Make several small updates, to address a few documentation issues
I spotted:
  * sparse-checkout focused on "patterns" even though the inputs (and
    outputs in the case of `list`) are directories in cone-mode
  * The description section of the sparse-checkout documentation
    was a bit sparse (no pun intended), and focused more on internal
    mechanics rather than end user usage.  This made sense in the
    early days when the command was even more experimental, but let's
    adjust a bit to try to make it more approachable to end users who
    may want to consider using it.  Keep the scary backward
    compatibility warning, though; we're still hard at work trying to
    fix up commands to behave reasonably in sparse checkouts.
  * both read-tree and update-index tried to describe how to use the
    skip-worktree bit, but both predated the sparse-checkout command.
    The sparse-checkout command is a far easier mechanism to use and
    for users trying to reduce the size of their working tree, we
    should recommend users to look at it instead.
  * The update-index documentation pointed out that assume-unchanged
    and skip-worktree sounded similar but had different purposes.
    However, it made no attempt to explain the differences, only to
    point out that they were different.  Explain the differences.
  * The update-index documentation focused much more on (internal?)
    implementation details than on end-user usage.  Try to explain
    its purpose better for users of update-index, rather than
    fellow developers trying to work with the SKIP_WORKTREE bit.
  * Clarify that when core.sparseCheckout=true, we treat a file's
    presence in the working tree as being an override to the
    SKIP_WORKTREE bit (i.e. in sparse checkouts when the file is
    present we ignore the SKIP_WORKTREE bit).

Note that this commit, like many touching documentation, is best viewed
with the `--color-words` option to diff/log.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/git-read-tree.txt       | 12 +++--
 Documentation/git-sparse-checkout.txt | 76 ++++++++++++++++-----------
 Documentation/git-update-index.txt    | 55 ++++++++++++++-----
 3 files changed, 97 insertions(+), 46 deletions(-)

diff --git a/Documentation/git-read-tree.txt b/Documentation/git-read-tree.txt
index 8c3aceb832..99bb387134 100644
--- a/Documentation/git-read-tree.txt
+++ b/Documentation/git-read-tree.txt
@@ -375,9 +375,14 @@ have finished your work-in-progress), attempt the merge again.
 SPARSE CHECKOUT
 ---------------
 
+Note: The `update-index` and `read-tree` primitives for supporting the
+skip-worktree bit predated the introduction of
+linkgit:git-sparse-checkout[1].  Users are encouraged to use
+`sparse-checkout` in preference to these low-level primitives.
+
 "Sparse checkout" allows populating the working directory sparsely.
-It uses the skip-worktree bit (see linkgit:git-update-index[1]) to tell
-Git whether a file in the working directory is worth looking at.
+It uses the skip-worktree bit (see linkgit:git-update-index[1]) to
+tell Git whether a file in the working directory is worth looking at.
 
 'git read-tree' and other merge-based commands ('git merge', 'git
 checkout'...) can help maintaining the skip-worktree bitmap and working
@@ -385,7 +390,8 @@ directory update. `$GIT_DIR/info/sparse-checkout` is used to
 define the skip-worktree reference bitmap. When 'git read-tree' needs
 to update the working directory, it resets the skip-worktree bit in the index
 based on this file, which uses the same syntax as .gitignore files.
-If an entry matches a pattern in this file, skip-worktree will not be
+If an entry matches a pattern in this file, or the entry corresponds to
+a file present in the working tree, then skip-worktree will not be
 set on that entry. Otherwise, skip-worktree will be set.
 
 Then it compares the new skip-worktree value with the previous one. If
diff --git a/Documentation/git-sparse-checkout.txt b/Documentation/git-sparse-checkout.txt
index b81dbe0654..3da3d5a100 100644
--- a/Documentation/git-sparse-checkout.txt
+++ b/Documentation/git-sparse-checkout.txt
@@ -3,9 +3,7 @@ git-sparse-checkout(1)
 
 NAME
 ----
-git-sparse-checkout - Initialize and modify the sparse-checkout
-configuration, which reduces the checkout to a set of paths
-given by a list of patterns.
+git-sparse-checkout - Reduce your working tree to a subset of tracked files
 
 
 SYNOPSIS
@@ -17,8 +15,20 @@ SYNOPSIS
 DESCRIPTION
 -----------
 
-Initialize and modify the sparse-checkout configuration, which reduces
-the checkout to a set of paths given by a list of patterns.
+This command is used to create sparse checkouts, which means that it
+changes the working tree from having all tracked files present, to only
+have a subset of them.  It can also switch which subset of files are
+present, or undo and go back to having all tracked files present in the
+working copy.
+
+The subset of files is chosen by providing a list of directories in
+cone mode (which is recommended), or by providing a list of patterns
+in non-cone mode.
+
+When in a sparse-checkout, other Git commands behave a bit differently.
+For example, switching branches will not update paths outside the
+sparse-checkout directories/patterns, and `git commit -a` will not record
+paths outside the sparse-checkout directories/patterns as deleted.
 
 THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
 COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
@@ -28,7 +38,7 @@ THE FUTURE.
 COMMANDS
 --------
 'list'::
-	Describe the patterns in the sparse-checkout file.
+	Describe the directories or patterns in the sparse-checkout file.
 
 'set'::
 	Enable the necessary config settings
@@ -38,20 +48,26 @@ COMMANDS
 	list of arguments following the 'set' subcommand. Update the
 	working directory to match the new patterns.
 +
-When the `--stdin` option is provided, the patterns are read from
-standard in as a newline-delimited list instead of from the arguments.
+When the `--stdin` option is provided, the directories or patterns are
+read from standard in as a newline-delimited list instead of from the
+arguments.
 +
 When `--cone` is passed or `core.sparseCheckoutCone` is enabled, the
-input list is considered a list of directories instead of
-sparse-checkout patterns.  This allows for better performance with a
-limited set of patterns (see 'CONE PATTERN SET' below).  Note that the
-set command will write patterns to the sparse-checkout file to include
-all files contained in those directories (recursively) as well as
-files that are siblings of ancestor directories. The input format
-matches the output of `git ls-tree --name-only`.  This includes
-interpreting pathnames that begin with a double quote (") as C-style
-quoted strings.  This may become the default in the future; --no-cone
-can be passed to request non-cone mode.
+input list is considered a list of directories.  This allows for
+better performance with a limited set of patterns (see 'CONE PATTERN
+SET' below).  The input format matches the output of `git ls-tree
+--name-only`.  This includes interpreting pathnames that begin with a
+double quote (") as C-style quoted strings.  Note that the set command
+will write patterns to the sparse-checkout file to include all files
+contained in those directories (recursively) as well as files that are
+siblings of ancestor directories. This may become the default in the
+future; --no-cone can be passed to request non-cone mode.
++
+When `--no-cone` is passed or `core.sparseCheckoutCone` is not enabled,
+the input list is considered a list of patterns.  This mode is harder
+to use and less performant, and is thus not recommended.  See the
+"Sparse Checkout" section of linkgit:git-read-tree[1] and the "Pattern
+Set" sections below for more details.
 +
 Use the `--[no-]sparse-index` option to use a sparse index (the
 default is to not use it).  A sparse index reduces the size of the
@@ -69,11 +85,10 @@ understand the sparse directory entries index extension and may fail to
 interact with your repository until it is disabled.
 
 'add'::
-	Update the sparse-checkout file to include additional patterns.
-	By default, these patterns are read from the command-line arguments,
-	but they can be read from stdin using the `--stdin` option. When
-	`core.sparseCheckoutCone` is enabled, the given patterns are interpreted
-	as directory names as in the 'set' subcommand.
+	Update the sparse-checkout file to include additional directories
+	(in cone mode) or patterns (in non-cone mode).  By default, these
+	directories or patterns are read from the command-line arguments,
+	but they can be read from stdin using the `--stdin` option.
 
 'reapply'::
 	Reapply the sparsity pattern rules to paths in the working tree.
@@ -117,13 +132,14 @@ decreased in utility.
 SPARSE CHECKOUT
 ---------------
 
-"Sparse checkout" allows populating the working directory sparsely.
-It uses the skip-worktree bit (see linkgit:git-update-index[1]) to tell
-Git whether a file in the working directory is worth looking at. If
-the skip-worktree bit is set, then the file is ignored in the working
-directory. Git will avoid populating the contents of those files, which
-makes a sparse checkout helpful when working in a repository with many
-files, but only a few are important to the current user.
+"Sparse checkout" allows populating the working directory sparsely.  It
+uses the skip-worktree bit (see linkgit:git-update-index[1]) to tell Git
+whether a file in the working directory is worth looking at. If the
+skip-worktree bit is set, and the file is not present in the working tree,
+then its absence is ignored. Git will avoid populating the contents of
+those files, which makes a sparse checkout helpful when working in a
+repository with many files, but only a few are important to the current
+user.
 
 The `$GIT_DIR/info/sparse-checkout` file is used to define the
 skip-worktree reference bitmap. When Git updates the working
diff --git a/Documentation/git-update-index.txt b/Documentation/git-update-index.txt
index 2853f168d9..568dbfe76b 100644
--- a/Documentation/git-update-index.txt
+++ b/Documentation/git-update-index.txt
@@ -351,6 +351,10 @@ unchanged".  Note that "assume unchanged" bit is *not* set if
 the index (use `git update-index --really-refresh` if you want
 to mark them as "assume unchanged").
 
+Sometimes users confuse the assume-unchanged bit with the
+skip-worktree bit.  See the final paragraph in the "Skip-worktree bit"
+section below for an explanation of the differences.
+
 
 EXAMPLES
 --------
@@ -392,22 +396,47 @@ M foo.c
 SKIP-WORKTREE BIT
 -----------------
 
-Skip-worktree bit can be defined in one (long) sentence: When reading
-an entry, if it is marked as skip-worktree, then Git pretends its
-working directory version is up to date and read the index version
-instead.
+Skip-worktree bit can be defined in one (long) sentence: Tell git to
+avoid writing the file to the working directory when reasonably
+possible, and treat the file as unchanged when it is not
+present in the working directory.
 
-To elaborate, "reading" means checking for file existence, reading
-file attributes or file content. The working directory version may be
-present or absent. If present, its content may match against the index
-version or not. Writing is not affected by this bit, content safety
-is still first priority. Note that Git _can_ update working directory
-file, that is marked skip-worktree, if it is safe to do so (i.e.
-working directory version matches index version)
+Note that not all git commands will pay attention to this bit, and
+some only partially support it.
+
+The update-index flags and the read-tree capabilities relating to the
+skip-worktree bit predated the introduction of the
+linkgit:git-sparse-checkout[1] command, which provides a much easier
+way to configure and handle the skip-worktree bits.  If you want to
+reduce your working tree to only deal with a subset of the files in
+the repository, we strongly encourage the use of
+linkgit:git-sparse-checkout[1] in preference to the low-level
+update-index and read-tree primitives.
+
+The primary purpose of the skip-worktree bit is to enable sparse
+checkouts, i.e. to have working directories with only a subset of
+paths present.  When the skip-worktree bit is set, Git commands (such
+as `switch`, `pull`, `merge`) will avoid writing these files.
+However, these commands will sometimes write these files anyway in
+important cases such as conflicts during a merge or rebase.  Git
+commands will also avoid treating the lack of such files as an
+intentional deletion; for example `git add -u` will not not stage a
+deletion for these files and `git commit -a` will not make a commit
+deleting them either.
 
 Although this bit looks similar to assume-unchanged bit, its goal is
-different from assume-unchanged bit's. Skip-worktree also takes
-precedence over assume-unchanged bit when both are set.
+different.  The assume-unchanged bit is for leaving the file in the
+working tree but having Git omit checking it for changes and presuming
+that the file has not been changed (though if it can determine without
+stat'ing the file that it has changed, it is free to record the
+changes).  skip-worktree tells Git to ignore the absence of the file,
+avoid updating it when possible with commands that normally update
+much of the working directory (e.g. `checkout`, `switch`, `pull`,
+etc.), and not have its absence be recorded in commits.  Note that in
+sparse checkouts (setup by `git sparse-checkout` or by configuring
+core.sparseCheckout to true), if a file is marked as skip-worktree in
+the index but is found in the working tree, Git will clear the
+skip-worktree bit for that file.
 
 SPLIT INDEX
 -----------
-- 
2.34.1.442.ge63c19bdd2.dirty


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 5/5] Accelerate ensure_skip_worktree_means_skip_worktree by caching
  2022-01-09  4:57 [RFC PATCH 0/5] Remove the present-despite-SKIP_WORKTREE class of bugs Elijah Newren
                   ` (3 preceding siblings ...)
  2022-01-09  4:57 ` [RFC PATCH 4/5] Update documentation related to sparsity and the skip-worktree bit Elijah Newren
@ 2022-01-09  4:57 ` Elijah Newren
  2022-01-11 18:30   ` Victoria Dye
  4 siblings, 1 reply; 11+ messages in thread
From: Elijah Newren @ 2022-01-09  4:57 UTC (permalink / raw)
  To: git; +Cc: Victoria Dye, Derrick Stolee, Lessley Dennington, Elijah Newren

Rather than lstat()'ing every SKIP_WORKTREE path, take advantage of the
fact that entire directories will often be missing, especially for cone
mode and even more so ever since commit 55dfcf9591 ("sparse-checkout:
clear tracked sparse dirs", 2021-09-08).  If we have already determined
that the parent directory of a file (or any other previous ancestor)
does not exist, then we already know the file cannot exist and do not
need to lstat() it separately.

Granted, the cost of ensure_skip_worktree_means_skip_worktree() might
be considered a bit high for non-cone mode since it might now lstat()
every SKIP_WORKTREE path when the index is loaded (an O(N) cost, with
N the number of SKIP_WORKTREE paths), but non-cone mode users already
have to deal with the O(N*M) cost (with N=the number of tracked files
and M=the number of sparsity patterns), so this should be reasonable.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 sparse-index.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 103 insertions(+), 2 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index 79d50e444c..608782e255 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -341,18 +341,117 @@ void ensure_correct_sparsity(struct index_state *istate)
 		ensure_full_index(istate);
 }
 
+struct path_cache_entry {
+	struct hashmap_entry ent;
+	const char *path;
+	int path_length;
+	int is_present;
+};
+
+static int path_cache_cmp(const void *unused,
+			  const struct hashmap_entry *entry1,
+			  const struct hashmap_entry *entry2,
+			  const void *also_unused)
+{
+	const struct path_cache_entry *e1, *e2;
+
+	e1 = container_of(entry1, const struct path_cache_entry, ent);
+	e2 = container_of(entry2, const struct path_cache_entry, ent);
+	if (e1->path_length != e2->path_length)
+		return e1->path_length - e2->path_length;
+	return memcmp(e1->path, e2->path, e1->path_length);
+}
+
+static struct path_cache_entry *find_path_cache_entry(struct hashmap *map,
+						      const char *str,
+						      int str_length)
+{
+	struct path_cache_entry entry;
+	hashmap_entry_init(&entry.ent, memhash(str, str_length));
+	entry.path = str;
+	entry.path_length = str_length;
+	return hashmap_get_entry(map, &entry, ent, NULL);
+}
+
+static void record(struct hashmap *path_cache,
+		   struct mem_pool *pool,
+		   const char *path,
+		   int path_length,
+		   int found)
+{
+	struct path_cache_entry *entry;
+
+	entry = mem_pool_alloc(pool, sizeof(*entry));
+	hashmap_entry_init(&entry->ent, memhash(path, path_length));
+	entry->path = path;
+	entry->path_length = path_length;
+	entry->is_present = found;
+	hashmap_add(path_cache, &entry->ent);
+}
+
+static int path_found(struct hashmap *path_cache, struct mem_pool *pool,
+		      const char *path, int path_length)
+{
+	struct stat st;
+	int found;
+	const char *dirsep = path + path_length - 1;
+	const char *tmp;
+
+	/* Find directory separator; memrchr is sadly glibc-specific */
+	while (dirsep > path && *dirsep != '/')
+		dirsep--;
+
+	/* If parent of path doesn't exist, no point lstat'ing path... */
+	if (dirsep > path) {
+		struct path_cache_entry *entry;
+		int new_length, parent_found;
+
+		/* First, check if path's parent's existence was cached */
+		new_length = dirsep - path;
+		entry = find_path_cache_entry(path_cache, path, new_length);
+		if (entry)
+			parent_found = entry->is_present;
+		else
+			parent_found = path_found(path_cache, pool,
+						  path, new_length);
+
+		if (!parent_found) {
+			/* path can't exist if parent dir doesn't */
+			record(path_cache, pool, path, path_length, 0);
+			return 0;
+		} /* else parent was found so must check path itself too... */
+	}
+
+	/* Okay, parent dir exists, so we have to check original path */
+
+	/* Make sure we have a NUL-terminated string to pass to lstat */
+	tmp = path;
+	if (path[path_length])
+		tmp = mem_pool_strndup(pool, path, path_length);
+	/* Determine if path exists */
+	found = !lstat(tmp, &st);
+
+	record(path_cache, pool, path, path_length, found);
+	return found;
+}
+
 void ensure_skip_worktree_means_skip_worktree(struct index_state *istate)
 {
+	struct hashmap path_cache = HASHMAP_INIT(path_cache_cmp, NULL);
+	struct mem_pool pool;
+
 	int i;
+
 	if (!core_apply_sparse_checkout)
 		return;
 
+	mem_pool_init(&pool, 32*1024);
 restart:
 	for (i = 0; i < istate->cache_nr; i++) {
 		struct cache_entry *ce = istate->cache[i];
-		struct stat st;
 
-		if (ce_skip_worktree(ce) && !lstat(ce->name, &st)) {
+		if (ce_skip_worktree(ce) &&
+		    path_found(&path_cache, &pool, ce->name, strlen(ce->name))) {
 			if (S_ISSPARSEDIR(ce->ce_mode)) {
 				ensure_full_index(istate);
 				goto restart;
@@ -360,6 +459,8 @@ void ensure_skip_worktree_means_skip_worktree(struct index_state *istate)
 			ce->ce_flags &= ~CE_SKIP_WORKTREE;
 		}
 	}
+	hashmap_clear(&path_cache);
+	mem_pool_discard(&pool, 0);
 }
 
 
-- 
2.34.1.442.ge63c19bdd2.dirty


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/5] repo_read_index: ensure SKIP_WORKTREE means skip worktree
  2022-01-09  4:57 ` [RFC PATCH 3/5] repo_read_index: ensure SKIP_WORKTREE means skip worktree Elijah Newren
@ 2022-01-10 20:38   ` Victoria Dye
  2022-01-11 19:27     ` Elijah Newren
  0 siblings, 1 reply; 11+ messages in thread
From: Victoria Dye @ 2022-01-10 20:38 UTC (permalink / raw)
  To: Elijah Newren, git; +Cc: Derrick Stolee, Lessley Dennington

Elijah Newren wrote:
> The fix is short (~30 lines), but the description is not.  Sorry.
> 
> There is a set of problems caused by files in what I'll refer to as the
> "present-despite-SKIP_WORKTREE" state.  This commit aims to not just fix
> these problems, but remove the entire class as a possibility -- for
> those using sparse checkouts.  But first, we need to understand the
> problems this class presents.  A quick outline:
> 
>    * Problems
>      * User facing issues
>      * Problem space complexity
>      * Maintenance and code correctness challenges
>    * SKIP_WORKTREE expectations in Git
>    * Suggested solution
>    * Pros/Cons of suggested solution
>    * Notes on testcase modifications
> 
> === User facing issues ===
> 
> There are various ways for users to get files to be present in the
> working copy despite having the SKIP_WORKTREE bit set for that file in
> the index.  This may come from:
>   * various git commands not really supporting the SKIP_WORKTREE bit[1,2]
>   * users grabbing files from elsewhere and writing them to the worktree
>     (perhaps even cached in their editor)
>   * users attempting to "abort" a sparse-checkout operation with a
>     not-so-early Ctrl+C (updating $GIT_DIR/info/sparse-checkout and the
>     working tree is not atomic)[3].
> 
> Once users have present-despite-SKIP_WORKTREE files, any modifications
> users make to these files will be ignored, possibly to users' confusion.
> 
> Further:
>   * these files will not be updated by by standard commands
>     (switch/checkout/pull/merge/rebase will leave them alone unless
>     conflicts happen -- and even then, the conflicted file may be
>     written somewhere else to avoid overwriting the SKIP_WORKTREE file
>     that is present and in the way)
>   * there is nothing in Git that users can use to discover such
>     files (status, diff, grep, etc. all ignore it)
>   * there is no reasonable mechanism to "recover" from such a condition
>     (neither `git sparse-checkout reapply` nor `git reset --hard` will
>     correct it).
> 

Just to add to this, files like these always force sparse index expansion in
`git status` (and probably some other commands?), ruining a lot of the
performance gains of using sparse index in the first place.

> So, not only are users modifications ignored, but the files get
> progressively more stale over time.  At some point in the future, they
> may change their sparseness specification or disable sparse-checkouts.
> At that time, all present-despite-SKIP_WORKTREE files will show up as
> having lots of modifications because they represent a version from a
> different branch or commit.  These might include user-made local changes
> from days before, but the only way to tell is to have users look through
> them all closely.
> 
> If these users come to others for help, there will be no logs that
> explain the issue; it's just a mysterious list of changes.  Users might
> adamantly claim (correctly, as it turns out) that they didn't modify
> these files, while others presume they did.
> 
> [1] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/
> [2] https://lore.kernel.org/git/CABPp-BH9tju7WVm=QZDOvaMDdZbpNXrVWQdN-jmfN8wC6YVhmw@mail.gmail.com/
> [3] https://lore.kernel.org/git/CABPp-BFnFpzwGC11TLoLs8YK5yiisA5D5-fFjXnJsbESVDwZsA@mail.gmail.com/
> 
> === Problem space complexity ===
> 
> SKIP_WORKTREE has been part of Git for over a decade.  Duy did lots of
> work on it initially, and several others have since come along and put
> lots of work into it.  Stolee spent most of 2021 on the sparse-index,
> with lots of bugfixes along the way including to non-sparse-index cases
> as we are still trying to get sparse checkouts to behave reasonably.
> Basically every codepath throughout the treat needs to be aware of an
> additional type of file: tracked-but-not-present.  The extra type
> results in lots of extra testcases and lots of extra code everywhere.
> 
> But, the sad thing is that we actually have more than one extra type.
> We have tracked, tracked-but-not-present (SKIP_WORKTREE), and
> tracked-but-promised-to-not-be-present-but-is-present-anyway
> (present-despite-SKIP_WORKTREE).  Two types is a monumental amount of
> effort to support, and adding a third feels a bit like insanity[4].
> 
> [4] Some examples of which can be seen at
>     https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
> 
> === Maintenance and code correctness challenges ===
> 
> Matheus' patches to grep stalled for nearly a year, in part because of
> complications of how to handle sparse-checkouts appropriately in all
> cases[5][6] (with trying to sanely figure out how to sanely handle
> present-despite-SKIP_WORKTREE files being one of the complications).
> His rm/add follow-ups also took months because of those kinds of
> issues[7].  The corner cases with things like submodules and
> SKIP_WORKTREE with the addition of present-despite-SKIP_WORKTREE start
> becoming really complex[8].
> 
> We've had to add ugly logic to merge-ort to attempt to handle
> present-despite-SKIP_WORKTREE files[9], and basically just been forced
> to give up in merge-recursive knowing full well that we'll sometimes
> silently discard user modifications.  Despite stash essentially being a
> merge, it needed extra code (beyond what was in merge-ort and
> merge-recursive) to manually tweak SKIP_WORKTREE bits in order to avoid
> a few different bugs that'd result in an early abort with a partial
> stash application[10].
> 
> [5] See https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/#t
>     and the dates on the thread; also Matheus and I had several
>     conversations off-list trying to resolve the issues over that time
> [6] ...it finally kind of got unstuck after
>     https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
> [7] See for example
>     https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/#t
>     and quotes like "The core functionality of sparse-checkout has always
>     been only partially implemented", a statement I still believe is true
>     today.
> [8] https://lore.kernel.org/git/pull.809.git.git.1592356884310.gitgitgadget@gmail.com/
> [9] See commit 66b209b86a ("merge-ort: implement CE_SKIP_WORKTREE
>     handling with conflicted entries", 2021-03-20)
> [10] See commit ba359fd507 ("stash: fix stash application in
>      sparse-checkouts", 2020-12-01)
> 
> === SKIP_WORKTREE expectations in Git ===
> 
> A couple quotes:
> 
> From [11] (before the "sparse-checkout" command existed):
>    If it needs too many special cases, hacks, and conditionals, then it
>    is not worth the complexity---if it is easier to write a correct code
>    by allowing Git to populate working tree files, it is perfectly fine
>    to do so.
> 
>    In a sense, the sparse checkout "feature" itself is a hack by itself,
>    and that is why I think this part should be "best effort" as well.
> 
> From the git-sparse-checkout manual (still present today):
> 
>    THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE BEHAVIOR OF OTHER
>    COMMANDS IN THE PRESENCE OF SPARSE-CHECKOUTS, WILL LIKELY CHANGE IN
>    THE FUTURE.
> 
> [11] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/
> 
> === Suggested solution ===
> 
> SKIP_WORKTREE was written to allow sparse-checkouts, in particular, as
> the name of the option implies, to allow the file to NOT be in the
> worktree but consider it to be unchanged rather than deleted.
> 
> The suggests a simple solution: present-despite-SKIP_WORKTREE files
> should not exist, for those using sparse-checkouts.
> 
> Enforce this at index loading time by checking if core.sparseCheckout is
> true; if so, check files in the index with the SKIP_WORKTREE bit set to
> verify that they are absent from the working tree.  If they are present,
> unset the bit (in memory, though any commands that write to the index
> will record the update).
> 

Since this solution is specific to a sparse-checkout, should this automatic
unsetting only be done if the file is outside the sparse checkout
definition? Otherwise, the `sparse-checkout reapply` cleanup suggested below
doesn't return the original `skip-worktree` state. 

Admittedly, I imagine it's unlikely that someone is simultaneously using a
sparse checkout and manually SKIP_WORKTREE-ing files *inside* the sparse
checkout definition. But, given that you're not unsetting the flag for
non-sparse-checkout SKIP_WORKTREE files, it seems like an additional
constraint based on sparse checkout patterns would be consistent with
other parts of this patch.

> Users can, of course, can get the SKIP_WORKTREE bit back such as by
> running `git sparse-checkout reapply` (if they have ensured the file is
> unmodified and doesn't match the specified sparsity patterns).
> 

There are some performance implications of this solution in a sparse
index-enabled checkout. Any time an out-of-cone file is no longer
SKIP_WORKTREE, its parent directory lineage will be added to the sparse index,
and performance would progressively (silently) degrade as more out-of-cone
files were added.

That said, a lot of my concern would be alleviated with some kind of warning
indicating that a file just had SKIP_WORKTREE removed, including a mention
of fixing it with `git sparse-checkout reapply`. 

It would be *extra* nice if `git status` could tell a user that they have
non-SKIP_WORKTREE files outside the sparse definition, but I think that's
less critical and probably better suited as a separate series.

> === Pros/Cons of suggested solution ===
> 
> Pros:
> 
>   * Solves the user visible problems reported above, which I've been
>     complaining about for nearly a year but couldn't find a solution to.
>   * Much easier behavior in sparse-checkouts for users to reason about
>   * Very simple, ~30 lines of code.
>   * Significantly simplifies some ugly testcases, and obviates the need
>     to test an entire class of potential issues.
>   * Reduces code complexity, reasoning, and maintenance.  Avoids
>     disagreements about weird corner cases[12].
>   * It has been reported that some users might be (ab)using
>     SKIP_WORKTREE as a let-me-modify-but-keep-the-file-in-the-worktree
>     mechanism[13, and a few other similar references].  These users know
>     of multiple caveats and shortcomings in doing so; perhaps not
>     surprising given the "SKIP_WORKTREE expecations" section above.
>     However, these users use `git update-index --skip-worktree`, and not
>     `git sparse-checkout` or core.sparseCheckout=true.  As such, these
>     users would be unaffected by this change and can continue abusing
>     the system as before.
> 
> [12] https://lore.kernel.org/git/CABPp-BH9tju7WVm=QZDOvaMDdZbpNXrVWQdN-jmfN8wC6YVhmw@mail.gmail.com/
> [13] https://stackoverflow.com/questions/13630849/git-difference-between-assume-unchanged-and-skip-worktree
> 
> Cons:
> 
>   * When core.sparseCheckout is enabled, this adds a performance cost to
>     reading the index.  I'll defer discussion of this cost to a subsequent
>     patch, since I have some optimizations to add.
> 
> === Notes on testcase modifications ===
> 
> The good:
>   * t1011: Compare to two cases above it ('read-tree will not throw away
>     dirty changes, non-sparse'); since the file is present, it should
>     match the non-sparse case now
>   * t1092: sparse-index & sparse-checkout now match full-worktree
>     behavior in more cases!  Yaay for consistency!
>   * t6428, t7012: look at how much simpler the tests become!  Merge and
>     stash can just fail early telling the user there's a file in the
>     way, instead of not noticing until it's about to write a file and
>     then have to implement sudden crash avoidance.  Hurray for sanity!
>   * t7817: sparse behavior better matches full tree behavior.  Hurray
>     for sanity!
> 
> The confusing:
>   * t3705: These changes were ONLY needed on Windows, but they don't
>     hurt other platforms.  Let's discuss each individually:
> 
>     * core.sparseCheckout should be false by default.  Nothing in this
>       testcase toggles that until many, many tests later.  However,
>       early tests (#5 in particular) were testing `update-index
>       --skip-worktree` behavior in a non-sparse-checkout, but the
>       Windows tests in CI were behaving as if core.sparseCheckout=true
>       had been specified somewhere.  I do not have access to a Windows
>       machine.  But I just manually did what should have been a no-op
>       and turned the config off.  And it fixed the test.
>     * I have no idea why the leftover .gitattributes file from this
>       test was causing failures for test #18 on Windows, but only with
>       these changes of mine.  Test #18 was checking for empty stderr,
>       and specifically wanted to know that some error completely
>       unrelated to file endings did not appear.  The leftover
>       .gitattributes file thus caused some spurious stderr unrelated to
>       the thing being checked.  Since other tests did not intend to
>       test normalization, just proactively remove the .gitattributes
>       file.  I'm certain this is cleaner and better, I'm just unsure
>       why/how this didn't trigger problems before.
> 
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  repository.c                             |  7 ++++
>  sparse-index.c                           | 22 ++++++++++++
>  sparse-index.h                           |  1 +
>  t/t1011-read-tree-sparse-checkout.sh     |  2 +-
>  t/t1092-sparse-checkout-compatibility.sh | 16 ++++-----
>  t/t3705-add-sparse-checkout.sh           |  2 ++
>  t/t6428-merge-conflicts-sparse.sh        | 23 +++----------
>  t/t7012-skip-worktree-writing.sh         | 44 ++++--------------------
>  t/t7817-grep-sparse-checkout.sh          | 11 ++++--
>  9 files changed, 62 insertions(+), 66 deletions(-)
> 
> diff --git a/repository.c b/repository.c
> index 34610c5a33..dfd1911902 100644
> --- a/repository.c
> +++ b/repository.c
> @@ -301,6 +301,13 @@ int repo_read_index(struct repository *repo)
>  	if (repo->settings.command_requires_full_index)
>  		ensure_full_index(repo->index);
>  
> +	/*
> +	 * If sparse checkouts are in use, check whether paths with the
> +	 * SKIP_WORKTREE attribute are missing from the worktree; if not,
> +	 * clear that attribute for that path.
> +	 */
> +	ensure_skip_worktree_means_skip_worktree(repo->index);
> +
>  	return res;
>  }
>  
> diff --git a/sparse-index.c b/sparse-index.c
> index a1d505d50e..79d50e444c 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -341,6 +341,28 @@ void ensure_correct_sparsity(struct index_state *istate)
>  		ensure_full_index(istate);
>  }
>  
> +void ensure_skip_worktree_means_skip_worktree(struct index_state *istate)

I can feel the frustration behind this name. :) 

However, a more descriptive one would make the code easier to follow, e.g.
'clear_skip_worktree_from_present_files' (or something else indicating what
it does to the index).

> +{
> +	int i;
> +	if (!core_apply_sparse_checkout)
> +		return;
> +
> +restart:
> +	for (i = 0; i < istate->cache_nr; i++) {
> +		struct cache_entry *ce = istate->cache[i];
> +		struct stat st;
> +
> +		if (ce_skip_worktree(ce) && !lstat(ce->name, &st)) {
> +			if (S_ISSPARSEDIR(ce->ce_mode)) {
> +				ensure_full_index(istate);
> +				goto restart;
> +			}
> +			ce->ce_flags &= ~CE_SKIP_WORKTREE;
> +		}
> +	}
> +}
> +
> +
>  /*
>   * This static global helps avoid infinite recursion between
>   * expand_to_path() and index_file_exists().
> diff --git a/sparse-index.h b/sparse-index.h
> index 656bd835b2..1007859ed4 100644
> --- a/sparse-index.h
> +++ b/sparse-index.h
> @@ -5,6 +5,7 @@ struct index_state;
>  #define SPARSE_INDEX_MEMORY_ONLY (1 << 0)
>  int convert_to_sparse(struct index_state *istate, int flags);
>  void ensure_correct_sparsity(struct index_state *istate);
> +void ensure_skip_worktree_means_skip_worktree(struct index_state *istate);
>  
>  /*
>   * Some places in the codebase expect to search for a specific path.
> diff --git a/t/t1011-read-tree-sparse-checkout.sh b/t/t1011-read-tree-sparse-checkout.sh
> index 4ed0885bf2..dd957be1b7 100755
> --- a/t/t1011-read-tree-sparse-checkout.sh
> +++ b/t/t1011-read-tree-sparse-checkout.sh
> @@ -212,7 +212,7 @@ test_expect_success 'read-tree updates worktree, dirty case' '
>  	echo sub/added >.git/info/sparse-checkout &&
>  	git checkout -f top &&
>  	echo dirty >init.t &&
> -	read_tree_u_must_succeed -m -u HEAD^ &&
> +	read_tree_u_must_fail -m -u HEAD^ &&
>  	grep -q dirty init.t &&
>  	rm init.t
>  '
> diff --git a/t/t1092-sparse-checkout-compatibility.sh b/t/t1092-sparse-checkout-compatibility.sh
> index 0863c9747c..6f8538bb4c 100755
> --- a/t/t1092-sparse-checkout-compatibility.sh
> +++ b/t/t1092-sparse-checkout-compatibility.sh
> @@ -370,7 +370,7 @@ test_expect_success 'status/add: outside sparse cone' '
>  	write_script edit-contents <<-\EOF &&
>  	echo text >>$1
>  	EOF
> -	run_on_sparse ../edit-contents folder1/a &&
> +	run_on_all ../edit-contents folder1/a &&
>  	run_on_all ../edit-contents folder1/new &&
>  
>  	test_sparse_match git status --porcelain=v2 &&
> @@ -379,8 +379,8 @@ test_expect_success 'status/add: outside sparse cone' '
>  	test_sparse_match test_must_fail git add folder1/a &&
>  	grep "Disable or modify the sparsity rules" sparse-checkout-err &&
>  	test_sparse_unstaged folder1/a &&
> -	test_sparse_match test_must_fail git add --refresh folder1/a &&
> -	grep "Disable or modify the sparsity rules" sparse-checkout-err &&
> +	test_all_match git add --refresh folder1/a &&
> +	test_must_be_empty sparse-checkout-err &&
>  	test_sparse_unstaged folder1/a &&
>  	test_sparse_match test_must_fail git add folder1/new &&
>  	grep "Disable or modify the sparsity rules" sparse-checkout-err &&
> @@ -642,11 +642,11 @@ test_expect_success 'update-index modify outside sparse definition' '
>  	run_on_sparse cp ../initial-repo/folder1/a folder1/a &&
>  	run_on_all ../edit-contents folder1/a &&
>  
> -	# If file has skip-worktree enabled, update-index does not modify the
> -	# index entry
> -	test_sparse_match git update-index folder1/a &&
> -	test_sparse_match git status --porcelain=v2 &&
> -	test_must_be_empty sparse-checkout-out &&
> +	# If file has skip-worktree enabled, but the file is present, it is
> +	# treated the same as if skip-worktree is disabled
> +	test_all_match git status --porcelain=v2 &&
> +	test_all_match git update-index folder1/a &&
> +	test_all_match git status --porcelain=v2 &&
>  
>  	# When skip-worktree is disabled (even on files outside sparse cone), file
>  	# is updated in the index
> diff --git a/t/t3705-add-sparse-checkout.sh b/t/t3705-add-sparse-checkout.sh
> index f3143c9290..61506c1d7c 100755
> --- a/t/t3705-add-sparse-checkout.sh
> +++ b/t/t3705-add-sparse-checkout.sh
> @@ -19,6 +19,7 @@ setup_sparse_entry () {
>  	fi &&
>  	git add sparse_entry &&
>  	git update-index --skip-worktree sparse_entry &&
> +	git config core.sparseCheckout false &&
>  	git commit --allow-empty -m "ensure sparse_entry exists at HEAD" &&
>  	SPARSE_ENTRY_BLOB=$(git rev-parse :sparse_entry)
>  }
> @@ -126,6 +127,7 @@ test_expect_success 'git add --chmod does not update sparse entries' '
>  '
>  
>  test_expect_success 'git add --renormalize does not update sparse entries' '
> +	test_when_finished rm .gitattributes &&
>  	test_config core.autocrlf false &&
>  	setup_sparse_entry "LINEONE\r\nLINETWO\r\n" &&
>  	echo "sparse_entry text=auto" >.gitattributes &&
> diff --git a/t/t6428-merge-conflicts-sparse.sh b/t/t6428-merge-conflicts-sparse.sh
> index 7e8bf497f8..142c9aaabc 100755
> --- a/t/t6428-merge-conflicts-sparse.sh
> +++ b/t/t6428-merge-conflicts-sparse.sh
> @@ -112,7 +112,7 @@ test_expect_success 'conflicting entries written to worktree even if sparse' '
>  	)
>  '
>  
> -test_expect_merge_algorithm failure success 'present-despite-SKIP_WORKTREE handled reasonably' '
> +test_expect_success 'present-despite-SKIP_WORKTREE handled reasonably' '
>  	test_setup_numerals in_the_way &&
>  	(
>  		cd numerals_in_the_way &&
> @@ -132,26 +132,13 @@ test_expect_merge_algorithm failure success 'present-despite-SKIP_WORKTREE handl
>  
>  		test_must_fail git merge -s recursive B^0 &&
>  
> -		git ls-files -t >index_files &&
> -		test_cmp expected-index index_files &&
> +		test_path_is_missing .git/MERGE_HEAD &&
>  
> -		test_path_is_file README &&
>  		test_path_is_file numerals &&
>  
> -		test_cmp expected-merge numerals &&
> -
> -		# There should still be a file with "foobar" in it
> -		grep foobar * &&
> -
> -		# 5 other files:
> -		#   * expected-merge
> -		#   * expected-index
> -		#   * index_files
> -		#   * others
> -		#   * whatever name was given to the numerals file that had
> -		#     "foobar" in it
> -		git ls-files -o >others &&
> -		test_line_count = 5 others
> +		# numerals should still have "foobar" in it
> +		echo foobar >expect &&
> +		test_cmp expect numerals
>  	)
>  '
>  
> diff --git a/t/t7012-skip-worktree-writing.sh b/t/t7012-skip-worktree-writing.sh
> index a1080b94e3..cb9f1a6981 100755
> --- a/t/t7012-skip-worktree-writing.sh
> +++ b/t/t7012-skip-worktree-writing.sh
> @@ -171,50 +171,20 @@ test_expect_success 'stash restore in sparse checkout' '
>  
>  		# Put a file in the working directory in the way
>  		echo in the way >modified &&
> -		git stash apply &&
> +		test_must_fail git stash apply 2>error&&
>  
> -		# Ensure stash vivifies modifies paths...
> -		cat >expect <<-EOF &&
> -		H addme
> -		H modified
> -		H removeme
> -		H subdir/A
> -		S untouched
> -		EOF
> -		git ls-files -t >actual &&
> -		test_cmp expect actual &&
> +		grep "changes.*would be overwritten by merge" error &&
>  
> -		# ...and that the paths show up in status as changed...
> -		cat >expect <<-EOF &&
> -		A  addme
> -		 M modified
> -		 D removeme
> -		 M subdir/A
> -		?? actual
> -		?? expect
> -		?? modified.stash.XXXXXX
> -		EOF
> -		git status --porcelain | \
> -			sed -e s/stash......./stash.XXXXXX/ >actual &&
> -		test_cmp expect actual &&
> +		echo in the way >expect &&
> +		test_cmp expect modified &&
> +		git diff --quiet HEAD ":!modified" &&
>  
>  		# ...and that working directory reflects the files correctly
> -		test_path_is_file    addme &&
> +		test_path_is_missing addme &&
>  		test_path_is_file    modified &&
>  		test_path_is_missing removeme &&
>  		test_path_is_file    subdir/A &&
> -		test_path_is_missing untouched &&
> -
> -		# ...including that we have the expected "modified" file...
> -		cat >expect <<-EOF &&
> -		modified
> -		tweaked
> -		EOF
> -		test_cmp expect modified &&
> -
> -		# ...and that the other "modified" file is still present...
> -		echo in the way >expect &&
> -		test_cmp expect modified.stash.*
> +		test_path_is_missing untouched
>  	)
>  '
>  
> diff --git a/t/t7817-grep-sparse-checkout.sh b/t/t7817-grep-sparse-checkout.sh
> index 590b99bbb6..eb59564565 100755
> --- a/t/t7817-grep-sparse-checkout.sh
> +++ b/t/t7817-grep-sparse-checkout.sh
> @@ -83,10 +83,13 @@ test_expect_success 'setup' '
>  
>  # The test below covers a special case: the sparsity patterns exclude '/b' and
>  # sparse checkout is enabled, but the path exists in the working tree (e.g.
> -# manually created after `git sparse-checkout init`). git grep should skip it.
> +# manually created after `git sparse-checkout init`).  Although b is marked
> +# as SKIP_WORKTREE, git grep should notice it IS present in the worktree and
> +# report it.
>  test_expect_success 'working tree grep honors sparse checkout' '
>  	cat >expect <<-EOF &&
>  	a:text
> +	b:new-text
>  	EOF
>  	test_when_finished "rm -f b" &&
>  	echo "new-text" >b &&
> @@ -126,12 +129,16 @@ test_expect_success 'grep --cached searches entries with the SKIP_WORKTREE bit'
>  '
>  
>  # Note that sub2/ is present in the worktree but it is excluded by the sparsity
> -# patterns, so grep should not recurse into it.
> +# patterns.  We also explicitly mark it as SKIP_WORKTREE in case it got cleared
> +# by previous git commands.  Thus sub2 starts as SKIP_WORKTREE but since it is
> +# present in the working tree, grep should recurse into it.
>  test_expect_success 'grep --recurse-submodules honors sparse checkout in submodule' '
>  	cat >expect <<-EOF &&
>  	a:text
>  	sub/B/b:text
> +	sub2/a:text
>  	EOF
> +	git update-index --skip-worktree sub2 &&
>  	git grep --recurse-submodules "text" >actual &&
>  	test_cmp expect actual
>  '


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 5/5] Accelerate ensure_skip_worktree_means_skip_worktree by caching
  2022-01-09  4:57 ` [RFC PATCH 5/5] Accelerate ensure_skip_worktree_means_skip_worktree by caching Elijah Newren
@ 2022-01-11 18:30   ` Victoria Dye
  2022-01-11 22:04     ` Elijah Newren
  0 siblings, 1 reply; 11+ messages in thread
From: Victoria Dye @ 2022-01-11 18:30 UTC (permalink / raw)
  To: Elijah Newren, git; +Cc: Derrick Stolee, Lessley Dennington

Elijah Newren wrote:
> Rather than lstat()'ing every SKIP_WORKTREE path, take advantage of the
> fact that entire directories will often be missing, especially for cone
> mode and even more so ever since commit 55dfcf9591 ("sparse-checkout:
> clear tracked sparse dirs", 2021-09-08).  If we have already determined
> that the parent directory of a file (or any other previous ancestor)
> does not exist, then we already know the file cannot exist and do not
> need to lstat() it separately.
> 
> Granted, the cost of ensure_skip_worktree_means_skip_worktree() might
> be considered a bit high for non-cone mode since it might now lstat()
> every SKIP_WORKTREE path when the index is loaded (an O(N) cost, with
> N the number of SKIP_WORKTREE paths), but non-cone mode users already
> have to deal with the O(N*M) cost (with N=the number of tracked files
> and M=the number of sparsity patterns), so this should be reasonable.
> 

Did you write/run any performance tests to see how this optimization changed
the execution time? If not, running the `p2000` performance tests against
the patch series base, [3/5], and [5/5] would provide some really helpful
insight into the cost of `ensure_skip_worktree_means_skip_worktree`, then
how much this optimization improves it.

> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>  sparse-index.c | 105 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 103 insertions(+), 2 deletions(-)
> 
> diff --git a/sparse-index.c b/sparse-index.c
> index 79d50e444c..608782e255 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -341,18 +341,117 @@ void ensure_correct_sparsity(struct index_state *istate)
>  		ensure_full_index(istate);
>  }
>  
> +struct path_cache_entry {
> +	struct hashmap_entry ent;
> +	const char *path;
> +	int path_length;
> +	int is_present;
> +};
> +
> +static int path_cache_cmp(const void *unused,
> +			  const struct hashmap_entry *entry1,
> +			  const struct hashmap_entry *entry2,
> +			  const void *also_unused)
> +{
> +	const struct path_cache_entry *e1, *e2;
> +
> +	e1 = container_of(entry1, const struct path_cache_entry, ent);
> +	e2 = container_of(entry2, const struct path_cache_entry, ent);
> +	if (e1->path_length != e2->path_length)
> +		return e1->path_length - e2->path_length;
> +	return memcmp(e1->path, e2->path, e1->path_length);
> +}
> +
> +static struct path_cache_entry *find_path_cache_entry(struct hashmap *map,
> +						      const char *str,
> +						      int str_length)
> +{
> +	struct path_cache_entry entry;
> +	hashmap_entry_init(&entry.ent, memhash(str, str_length));
> +	entry.path = str;
> +	entry.path_length = str_length;
> +	return hashmap_get_entry(map, &entry, ent, NULL);
> +}
> +
> +static void record(struct hashmap *path_cache,
> +		   struct mem_pool *pool,
> +		   const char *path,
> +		   int path_length,
> +		   int found)
> +{
> +	struct path_cache_entry *entry;
> +
> +	entry = mem_pool_alloc(pool, sizeof(*entry));
> +	hashmap_entry_init(&entry->ent, memhash(path, path_length));
> +	entry->path = path;
> +	entry->path_length = path_length;
> +	entry->is_present = found;
> +	hashmap_add(path_cache, &entry->ent);
> +}
> +
> +static int path_found(struct hashmap *path_cache, struct mem_pool *pool,
> +		      const char *path, int path_length)
> +{
> +	struct stat st;
> +	int found;
> +	const char *dirsep = path + path_length - 1;
> +	const char *tmp;
> +
> +	/* Find directory separator; memrchr is sadly glibc-specific */
> +	while (dirsep > path && *dirsep != '/')
> +		dirsep--;
> +
> +	/* If parent of path doesn't exist, no point lstat'ing path... */
> +	if (dirsep > path) {
> +		struct path_cache_entry *entry;
> +		int new_length, parent_found;
> +
> +		/* First, check if path's parent's existence was cached */
> +		new_length = dirsep - path;
> +		entry = find_path_cache_entry(path_cache, path, new_length);
> +		if (entry)
> +			parent_found = entry->is_present;
> +		else
> +			parent_found = path_found(path_cache, pool,
> +						  path, new_length);
> +
> +		if (!parent_found) {
> +			/* path can't exist if parent dir doesn't */
> +			record(path_cache, pool, path, path_length, 0);
> +			return 0;
> +		} /* else parent was found so must check path itself too... */
> +	}
> +
> +	/* Okay, parent dir exists, so we have to check original path */
> +
> +	/* Make sure we have a NUL-terminated string to pass to lstat */
> +	tmp = path;
> +	if (path[path_length])
> +		tmp = mem_pool_strndup(pool, path, path_length);
> +	/* Determine if path exists */
> +	found = !lstat(tmp, &st);
> +
> +	record(path_cache, pool, path, path_length, found);
> +	return found;
> +}
> +
>  void ensure_skip_worktree_means_skip_worktree(struct index_state *istate)
>  {
> +	struct hashmap path_cache = HASHMAP_INIT(path_cache_cmp, NULL);
> +	struct mem_pool pool;
> +
>  	int i;
> +
>  	if (!core_apply_sparse_checkout)
>  		return;
>  
> +	mem_pool_init(&pool, 32*1024);
>  restart:
>  	for (i = 0; i < istate->cache_nr; i++) {
>  		struct cache_entry *ce = istate->cache[i];
> -		struct stat st;
>  
> -		if (ce_skip_worktree(ce) && !lstat(ce->name, &st)) {
> +		if (ce_skip_worktree(ce) &&
> +		    path_found(&path_cache, &pool, ce->name, strlen(ce->name))) {
>  			if (S_ISSPARSEDIR(ce->ce_mode)) {
>  				ensure_full_index(istate);
>  				goto restart;
> @@ -360,6 +459,8 @@ void ensure_skip_worktree_means_skip_worktree(struct index_state *istate)
>  			ce->ce_flags &= ~CE_SKIP_WORKTREE;
>  		}
>  	}
> +	hashmap_clear(&path_cache);
> +	mem_pool_discard(&pool, 0);
>  }
>  
>  


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/5] repo_read_index: ensure SKIP_WORKTREE means skip worktree
  2022-01-10 20:38   ` Victoria Dye
@ 2022-01-11 19:27     ` Elijah Newren
  2022-01-11 23:09       ` Victoria Dye
  0 siblings, 1 reply; 11+ messages in thread
From: Elijah Newren @ 2022-01-11 19:27 UTC (permalink / raw)
  To: Victoria Dye; +Cc: Git Mailing List, Derrick Stolee, Lessley Dennington

On Mon, Jan 10, 2022 at 12:38 PM Victoria Dye <vdye@github.com> wrote:
>
> Elijah Newren wrote:
...
> > === User facing issues ===
...
> > Further:
> >   * these files will not be updated by by standard commands
> >     (switch/checkout/pull/merge/rebase will leave them alone unless
> >     conflicts happen -- and even then, the conflicted file may be
> >     written somewhere else to avoid overwriting the SKIP_WORKTREE file
> >     that is present and in the way)
> >   * there is nothing in Git that users can use to discover such
> >     files (status, diff, grep, etc. all ignore it)
> >   * there is no reasonable mechanism to "recover" from such a condition
> >     (neither `git sparse-checkout reapply` nor `git reset --hard` will
> >     correct it).
> >
>
> Just to add to this, files like these always force sparse index expansion in
> `git status` (and probably some other commands?), ruining a lot of the
> performance gains of using sparse index in the first place.

Oh, good point.  Another reason this state is just bad.

...
> > === Suggested solution ===
> >
> > SKIP_WORKTREE was written to allow sparse-checkouts, in particular, as
> > the name of the option implies, to allow the file to NOT be in the
> > worktree but consider it to be unchanged rather than deleted.
> >
> > The suggests a simple solution: present-despite-SKIP_WORKTREE files
> > should not exist, for those using sparse-checkouts.
> >
> > Enforce this at index loading time by checking if core.sparseCheckout is
> > true; if so, check files in the index with the SKIP_WORKTREE bit set to
> > verify that they are absent from the working tree.  If they are present,
> > unset the bit (in memory, though any commands that write to the index
> > will record the update).
> >
>
> Since this solution is specific to a sparse-checkout, should this automatic
> unsetting only be done if the file is outside the sparse checkout
> definition? Otherwise, the `sparse-checkout reapply` cleanup suggested below
> doesn't return the original `skip-worktree` state.

That's an interesting distinction, but I think there are multiple
problems with it:
  1) I think you might be assuming this state would only be entered by
a deliberate user choice, but that's far from the case
  2) Allowing present-despite-SKIP_WORKTREE only for in-cone paths
still allows the state to exist, which has multiple ramifications:
    2A) while it solves user facing problems for out-of-cone
present-despite-SKIP_WORKTREE files, it leaves user facing problems
for in-cone ones
    2B) it doesn't solve any of the complexity or need for special
testing outlined in the commit message since the bad state is still
possible for some paths
    2C) Related to (2B), your series would need to do more work to
make checkout-index sane
  3) It's basically impossible to keep this kind of skip-worktree
state in a sparse-checkout anyway.

Let's look at (1), (2A), and (3) in more detail, since those may not be obvious:

=== (1) present-despite-SKIP_WORKTREE on in-cone can be triggered
accidentally ===

Users can directly edit $GIT_DIR/info/sparse-checkout.  We documented
how for years.  People wrote tools that did so.  `git sparse-checkout`
came much later.  So, they could get into this state by having a
sparse-checkout already, and then editing
$GIT_DIR/info/sparse-checkout such that the files which used to be
out-of-cone are now in-cone even though the working directory doesn't
match.  People could also get into this state without knowing about
the $GIT_DIR/info/sparse-checkout file, with an in-opportune Ctrl-C
during the middle of a `git sparse-checkout ...` command of some sort.
I actually think it's more likely that people accidentally get into
this state than deliberately.  However, for sake of argument, let's
presume people could only get into this state intentionally.  I think
we'd still want to make my changes for this class of files because of
the other reasons, so let's move on to those...

=== (2A) present-despite-SKIP_WORKTREE on in-cone has user-facing
problems too ===

So, we have a user who has some file(s) that are marked SKIP_WORKTREE
despite being in-cone.  Let's assume that the in-cone file has some
dirty change(s).  First, note that the in-cone file is already
SKIP_WORKTREE:

   $ git ls-files -t
   S in-cone/foo.c
   S out-of-cone/tracked

Now this user tries to switch to another branch which does differ in
in-cone/foo.c (or try to do a rebase that involves in-cone/foo.c,
or...):

   $ git checkout other-branch-modifying-in-cone
   error: Your local changes to the following files would be
overwritten by checkout:
   in-cone/foo.c
   Please commit your changes or stash them before you switch branches.
   Aborting

Oh, thinks the user, maybe I should check status before switching branches:

   $ git status --porcelain
   $

The user perhaps finds it odd that the file reported as having "local
changes" doesn't show up in status, but decides to do what the error
message reported anyway:

   $ git stash save foobar
   No local changes to save

If neither status nor stash think there are any changes, why does
checkout?  Why can't I switch branches??  If they attempt to switch
branches again anyway, they'll get the same local-changes error.
Perhaps at this point they notice there were two possible solutions in
the error message, so they decide to try committing the changes:

   $ git commit -am 'changes'
   On branch main
   You are in a sparse checkout with 0% of tracked files present.

   nothing to commit, working tree clean

Three commands tell me my working tree is clean, but checkout says
otherwise?!?!?  Why can't I switch branches???  Stupid Git!  (And if
someone points them at `git add $FILENAME`, then they get another
error message and are told to use `git add --sparse $FILENAME`...)

In contrast, with my patch, none of this craziness happens and a
locally modified file is detected and reported as locally modified by
every command within a sparse checkout.  It's so much saner for users.

But, for sake of argument, let's ignore all of the reasons so far and
instead pretend there are users who deliberately create skip-worktree
files in-cone, and thus expect all these weirdnesses.  Let's focus on
the experience they might expect, and turn to item (3):

=== (3) present-despite-SKIP_WORKTREE on in-cone files will be
aggressively cleared _anyway_, due to the fundamental design of
sparse-checkout ===

Let's say I have a simple sparse-checkout with just a couple files:

   $ git ls-files -t
   H in-cone/foo.c
   S out-of-cone/tracked

Now, I decide to make in-cone/foo.c skip-worktree, but keep it in the
working tree and dirty its contents:

   $ git update-index --skip-worktree in-cone/foo.c
   $ echo dirty >in-cone/foo.c
   $ git ls-files -t
   S in-cone/foo.c
   S out-of-cone/tracked

Now, status won't report our dirty modifications, commit won't include
them, etc.  Perhaps that's what the user wanted by marking it as
skip-worktree.  But as soon as they invoke any command that calls
unpack_trees() in a way that might update the working copy (but which
wouldn't need to touch the dirty files marked skip-worktree), the
skip-worktree status is going to be dropped:

   $ git checkout -q HEAD^0
   warning: The following paths were already present and thus not
updated despite sparse patterns:
   in-cone/foo.c

   After fixing the above paths, you may want to run `git
sparse-checkout reapply`.
   Switched to branch 'other-mod'
   $ git ls-files -t
   H in-cone/foo.c
   S out-of-cone/tracked

This is not special to detaching HEAD; I could have switched to
another branch and seen the same thing.  Or rebased or cherry-picked
or merge commits that didn't modify the skip-worktree file, and they'd
all unset the skip-worktree bit for this file.  This isn't a bug,
either; it's by design.  sparsity patterns have to be reapplied when
switching branches in general (or rebasing or...), because that
operation might bring in new files and so we need to know whether
those new files should be SKIP_WORKTREE.  You might say we could
attempt to limit the SKIP_WORKTREE bit flipping to "just" new files,
but due to the possibility of conflicts in earlier merges/rebases
causing files to lose the SKIP_WORKTREE bit, we wanted future
operations that were busy updating the bits anyway to update those
files and clear the bit (and this pre-dated the work Stolee and I did
on sparse-checkouts, btw).

So, this means that checkout, merge, rebase, cherry-pick, etc. are
going to be clearing SKIP_WORKTREE bits for in-cone files all the
time.

So, even if users did try to deliberately get into such a state,
there's no point attempting to preserve the bit since so many commands
are going to be aggressively clearing it anyway.

> Admittedly, I imagine it's unlikely that someone is simultaneously using a
> sparse checkout and manually SKIP_WORKTREE-ing files *inside* the sparse
> checkout definition. But, given that you're not unsetting the flag for
> non-sparse-checkout SKIP_WORKTREE files, it seems like an additional
> constraint based on sparse checkout patterns would be consistent with
> other parts of this patch.
>
> > Users can, of course, can get the SKIP_WORKTREE bit back such as by
> > running `git sparse-checkout reapply` (if they have ensured the file is
> > unmodified and doesn't match the specified sparsity patterns).
> >
>
> There are some performance implications of this solution in a sparse
> index-enabled checkout. Any time an out-of-cone file is no longer
> SKIP_WORKTREE, its parent directory lineage will be added to the sparse index,
> and performance would progressively (silently) degrade as more out-of-cone
> files were added.

Can I restate this a bit?

"""
The point of sparse-index is to allow operations to be faster due to
needing to track fewer items in the index.  Its performance is nearly
linearly correlated with the number of paths it has to track, so fewer
paths is better.  If a user needs to work with more paths than they
previously did, then their sparse-index will be correspondingly
slower.

And users who start modifying files, are obviously working with more
paths than they previously did.
"""

Based on that restating, I think the only thing that could be
considered a problem is your implication that the performance
degradation would silently persist.  If users don't undo or commit
those local modifications, then they really are working with more
files and I think it's fine for the sparse-index to remain expanded to
include those files.  In contrast, if users undo those local changes
(or commit them so they are no longer dirty), then future
checkout/rebase/merge/etc. operations will automatically (re-)set the
SKIP_WORKTREE bit on those files, and the sparse-index can shrink
again.  These dirty files would be very similar to files that have
conflicts during a merge: both have their SKIP_WORKTREE bit cleared
despite being outside the sparsity paths.  Continuing the comparison,
after a user resolves a conflict and commits the changes, the
previously conflicted file will automatically have its SKIP_WORKTREE
bit set again by a future checkout/rebase/merge/etc.  Or users can
manually get it to be set earlier with a `git sparse-checkout
reapply`.  So the performance degradation from the additional paths
the user is working with is only temporary, and only lasts roughly as
long as the user is still working with these additional paths.  To me,
this feels like expected and wanted behavior.

> That said, a lot of my concern would be alleviated with some kind of warning
> indicating that a file just had SKIP_WORKTREE removed, including a mention
> of fixing it with `git sparse-checkout reapply`.

That might be interesting, but there's two problems here:
  * It'd affect more types of files and situations than intended.  In
particular, given [11], it could be quite noisy due to triggering on
files other than those manually tweaked by users and even include
files never involved in conflicts either.
  * It'd repeatedly trigger on the exact same files, possibly muddying
other output unexpectedly.  Given that we clear the bit in memory but
not necessarily on disk, users would repeatedly see the same warning
for the same files during sequences of read-only operations (like `git
diff` or `git log`) and they warnings would only go away once the user
ran a command that re-wrote the index (such as `git status`).

I think the point of your suggestion was to help users recover from
potentially persisting performance degradation; is that fair?  If so,
then as I highlighted above, I don't think we have any such persisting
problem.

[11] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/

> It would be *extra* nice if `git status` could tell a user that they have
> non-SKIP_WORKTREE files outside the sparse definition, but I think that's
> less critical and probably better suited as a separate series.

Oh, interesting idea.  But I agree, it'd be better suited to a separate series.

> > diff --git a/sparse-index.c b/sparse-index.c
> > index a1d505d50e..79d50e444c 100644
> > --- a/sparse-index.c
> > +++ b/sparse-index.c
> > @@ -341,6 +341,28 @@ void ensure_correct_sparsity(struct index_state *istate)
> >               ensure_full_index(istate);
> >  }
> >
> > +void ensure_skip_worktree_means_skip_worktree(struct index_state *istate)
>
> I can feel the frustration behind this name. :)

:-)

> However, a more descriptive one would make the code easier to follow, e.g.
> 'clear_skip_worktree_from_present_files' (or something else indicating what
> it does to the index).

Yeah, I did spend some time trying to come up with a better name (I
realized it wasn't the best), but failed.  Your suggestion seems
obvious in hindsight.  I like it; I'll make that change.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 5/5] Accelerate ensure_skip_worktree_means_skip_worktree by caching
  2022-01-11 18:30   ` Victoria Dye
@ 2022-01-11 22:04     ` Elijah Newren
  0 siblings, 0 replies; 11+ messages in thread
From: Elijah Newren @ 2022-01-11 22:04 UTC (permalink / raw)
  To: Victoria Dye; +Cc: Git Mailing List, Derrick Stolee, Lessley Dennington

On Tue, Jan 11, 2022 at 10:30 AM Victoria Dye <vdye@github.com> wrote:
>
> Elijah Newren wrote:
> > Rather than lstat()'ing every SKIP_WORKTREE path, take advantage of the
> > fact that entire directories will often be missing, especially for cone
> > mode and even more so ever since commit 55dfcf9591 ("sparse-checkout:
> > clear tracked sparse dirs", 2021-09-08).  If we have already determined
> > that the parent directory of a file (or any other previous ancestor)
> > does not exist, then we already know the file cannot exist and do not
> > need to lstat() it separately.
> >
> > Granted, the cost of ensure_skip_worktree_means_skip_worktree() might
> > be considered a bit high for non-cone mode since it might now lstat()
> > every SKIP_WORKTREE path when the index is loaded (an O(N) cost, with
> > N the number of SKIP_WORKTREE paths), but non-cone mode users already
> > have to deal with the O(N*M) cost (with N=the number of tracked files
> > and M=the number of sparsity patterns), so this should be reasonable.
> >
>
> Did you write/run any performance tests to see how this optimization changed
> the execution time? If not, running the `p2000` performance tests against
> the patch series base, [3/5], and [5/5] would provide some really helpful
> insight into the cost of `ensure_skip_worktree_means_skip_worktree`, then
> how much this optimization improves it.

I haven't[1].  You bring up a very good point; I'll add it for the next round.

[1] Long, probably irrelevant story about why: My original patches
were actually going to go further and just remove the
present-despite-SKIP_WORKTREE files in _all_ cases, sparse-checkout or
not. It had not occurred to me while writing the patches to make it
specific to sparse-checkouts.  Because of that, I figured it was
better to get feedback on if the idea was acceptable and spent a lot
more time concentrating on making the case.  Then I realized near the
end that folks who don't use sparse-checkout or SKIP_WORKTREE might be
annoyed at the overhead also being added for them, for a feature
they'll never even use.  I decided to back off a bit, and make it
sparse-checkout specific.  Then I realized that backing off might just
keep all users happy anyway (the folks who intentionally use
present-despite-SKIP_WORKTREE paths, despite their many warts, can
keep doings so) and edited a lot of my commit messages, docs, and
cover letter.  And by then it was late Saturday night and I had
promised to send out my series on Friday.  Since I had already marked
my cover letter as RFC anyway, I just decided to temporarily punt on
getting performance numbers...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 3/5] repo_read_index: ensure SKIP_WORKTREE means skip worktree
  2022-01-11 19:27     ` Elijah Newren
@ 2022-01-11 23:09       ` Victoria Dye
  0 siblings, 0 replies; 11+ messages in thread
From: Victoria Dye @ 2022-01-11 23:09 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Derrick Stolee, Lessley Dennington

Elijah Newren wrote:
> On Mon, Jan 10, 2022 at 12:38 PM Victoria Dye <vdye@github.com> wrote:
>>
>> Elijah Newren wrote:
> ...
>>> === User facing issues ===
> ...
>>> Further:
>>>   * these files will not be updated by by standard commands
>>>     (switch/checkout/pull/merge/rebase will leave them alone unless
>>>     conflicts happen -- and even then, the conflicted file may be
>>>     written somewhere else to avoid overwriting the SKIP_WORKTREE file
>>>     that is present and in the way)
>>>   * there is nothing in Git that users can use to discover such
>>>     files (status, diff, grep, etc. all ignore it)
>>>   * there is no reasonable mechanism to "recover" from such a condition
>>>     (neither `git sparse-checkout reapply` nor `git reset --hard` will
>>>     correct it).
>>>
>>
>> Just to add to this, files like these always force sparse index expansion in
>> `git status` (and probably some other commands?), ruining a lot of the
>> performance gains of using sparse index in the first place.
> 
> Oh, good point.  Another reason this state is just bad.
> 
> ...
>>> === Suggested solution ===
>>>
>>> SKIP_WORKTREE was written to allow sparse-checkouts, in particular, as
>>> the name of the option implies, to allow the file to NOT be in the
>>> worktree but consider it to be unchanged rather than deleted.
>>>
>>> The suggests a simple solution: present-despite-SKIP_WORKTREE files
>>> should not exist, for those using sparse-checkouts.
>>>
>>> Enforce this at index loading time by checking if core.sparseCheckout is
>>> true; if so, check files in the index with the SKIP_WORKTREE bit set to
>>> verify that they are absent from the working tree.  If they are present,
>>> unset the bit (in memory, though any commands that write to the index
>>> will record the update).
>>>
>>
>> Since this solution is specific to a sparse-checkout, should this automatic
>> unsetting only be done if the file is outside the sparse checkout
>> definition? Otherwise, the `sparse-checkout reapply` cleanup suggested below
>> doesn't return the original `skip-worktree` state.
> 
> That's an interesting distinction, but I think there are multiple
> problems with it:
>   1) I think you might be assuming this state would only be entered by
> a deliberate user choice, but that's far from the case
>   2) Allowing present-despite-SKIP_WORKTREE only for in-cone paths
> still allows the state to exist, which has multiple ramifications:
>     2A) while it solves user facing problems for out-of-cone
> present-despite-SKIP_WORKTREE files, it leaves user facing problems
> for in-cone ones
>     2B) it doesn't solve any of the complexity or need for special
> testing outlined in the commit message since the bad state is still
> possible for some paths
>     2C) Related to (2B), your series would need to do more work to
> make checkout-index sane
>   3) It's basically impossible to keep this kind of skip-worktree
> state in a sparse-checkout anyway.
> 
> Let's look at (1), (2A), and (3) in more detail, since those may not be obvious:
> 
> === (1) present-despite-SKIP_WORKTREE on in-cone can be triggered
> accidentally ===
> 
> Users can directly edit $GIT_DIR/info/sparse-checkout.  We documented
> how for years.  People wrote tools that did so.  `git sparse-checkout`
> came much later.  So, they could get into this state by having a
> sparse-checkout already, and then editing
> $GIT_DIR/info/sparse-checkout such that the files which used to be
> out-of-cone are now in-cone even though the working directory doesn't
> match.  People could also get into this state without knowing about
> the $GIT_DIR/info/sparse-checkout file, with an in-opportune Ctrl-C
> during the middle of a `git sparse-checkout ...` command of some sort.
> I actually think it's more likely that people accidentally get into
> this state than deliberately.  However, for sake of argument, let's
> presume people could only get into this state intentionally.  I think
> we'd still want to make my changes for this class of files because of
> the other reasons, so let's move on to those...
> 
> === (2A) present-despite-SKIP_WORKTREE on in-cone has user-facing
> problems too ===
> 
> So, we have a user who has some file(s) that are marked SKIP_WORKTREE
> despite being in-cone.  Let's assume that the in-cone file has some
> dirty change(s).  First, note that the in-cone file is already
> SKIP_WORKTREE:
> 
>    $ git ls-files -t
>    S in-cone/foo.c
>    S out-of-cone/tracked
> 
> Now this user tries to switch to another branch which does differ in
> in-cone/foo.c (or try to do a rebase that involves in-cone/foo.c,
> or...):
> 
>    $ git checkout other-branch-modifying-in-cone
>    error: Your local changes to the following files would be
> overwritten by checkout:
>    in-cone/foo.c
>    Please commit your changes or stash them before you switch branches.
>    Aborting
> 
> Oh, thinks the user, maybe I should check status before switching branches:
> 
>    $ git status --porcelain
>    $
> 
> The user perhaps finds it odd that the file reported as having "local
> changes" doesn't show up in status, but decides to do what the error
> message reported anyway:
> 
>    $ git stash save foobar
>    No local changes to save
> 
> If neither status nor stash think there are any changes, why does
> checkout?  Why can't I switch branches??  If they attempt to switch
> branches again anyway, they'll get the same local-changes error.
> Perhaps at this point they notice there were two possible solutions in
> the error message, so they decide to try committing the changes:
> 
>    $ git commit -am 'changes'
>    On branch main
>    You are in a sparse checkout with 0% of tracked files present.
> 
>    nothing to commit, working tree clean
> 
> Three commands tell me my working tree is clean, but checkout says
> otherwise?!?!?  Why can't I switch branches???  Stupid Git!  (And if
> someone points them at `git add $FILENAME`, then they get another
> error message and are told to use `git add --sparse $FILENAME`...)
> 
> In contrast, with my patch, none of this craziness happens and a
> locally modified file is detected and reported as locally modified by
> every command within a sparse checkout.  It's so much saner for users.
> 

This particular issue also happens in non-sparse-checkout usage of
SKIP_WORKTREE, but point (1) (+ a general desire to make sparse-checkout
less "experimental") is a good reason to ensure the UX in a sparse-checkout
isn't this frustrating.

> But, for sake of argument, let's ignore all of the reasons so far and
> instead pretend there are users who deliberately create skip-worktree
> files in-cone, and thus expect all these weirdnesses.  Let's focus on
> the experience they might expect, and turn to item (3):
> 
> === (3) present-despite-SKIP_WORKTREE on in-cone files will be
> aggressively cleared _anyway_, due to the fundamental design of
> sparse-checkout ===
> 
> Let's say I have a simple sparse-checkout with just a couple files:
> 
>    $ git ls-files -t
>    H in-cone/foo.c
>    S out-of-cone/tracked
> 
> Now, I decide to make in-cone/foo.c skip-worktree, but keep it in the
> working tree and dirty its contents:
> 
>    $ git update-index --skip-worktree in-cone/foo.c
>    $ echo dirty >in-cone/foo.c
>    $ git ls-files -t
>    S in-cone/foo.c
>    S out-of-cone/tracked
> 
> Now, status won't report our dirty modifications, commit won't include
> them, etc.  Perhaps that's what the user wanted by marking it as
> skip-worktree.  But as soon as they invoke any command that calls
> unpack_trees() in a way that might update the working copy (but which
> wouldn't need to touch the dirty files marked skip-worktree), the
> skip-worktree status is going to be dropped:
> 
>    $ git checkout -q HEAD^0
>    warning: The following paths were already present and thus not
> updated despite sparse patterns:
>    in-cone/foo.c
> 
>    After fixing the above paths, you may want to run `git
> sparse-checkout reapply`.
>    Switched to branch 'other-mod'
>    $ git ls-files -t
>    H in-cone/foo.c
>    S out-of-cone/tracked
> 
> This is not special to detaching HEAD; I could have switched to
> another branch and seen the same thing.  Or rebased or cherry-picked
> or merge commits that didn't modify the skip-worktree file, and they'd
> all unset the skip-worktree bit for this file.  This isn't a bug,
> either; it's by design.  sparsity patterns have to be reapplied when
> switching branches in general (or rebasing or...), because that
> operation might bring in new files and so we need to know whether
> those new files should be SKIP_WORKTREE.  You might say we could
> attempt to limit the SKIP_WORKTREE bit flipping to "just" new files,
> but due to the possibility of conflicts in earlier merges/rebases
> causing files to lose the SKIP_WORKTREE bit, we wanted future
> operations that were busy updating the bits anyway to update those
> files and clear the bit (and this pre-dated the work Stolee and I did
> on sparse-checkouts, btw).
> 
> So, this means that checkout, merge, rebase, cherry-pick, etc. are
> going to be clearing SKIP_WORKTREE bits for in-cone files all the
> time.
> 
> So, even if users did try to deliberately get into such a state,
> there's no point attempting to preserve the bit since so many commands
> are going to be aggressively clearing it anyway.
> 

Ahh, I should have remembered that - it's also really helpful context for
understanding why you can/should aggressively unset SKIP_WORKTREE (because
git will aggressively re-set it if a file is up-to-date).

>> Admittedly, I imagine it's unlikely that someone is simultaneously using a
>> sparse checkout and manually SKIP_WORKTREE-ing files *inside* the sparse
>> checkout definition. But, given that you're not unsetting the flag for
>> non-sparse-checkout SKIP_WORKTREE files, it seems like an additional
>> constraint based on sparse checkout patterns would be consistent with
>> other parts of this patch.
>>
>>> Users can, of course, can get the SKIP_WORKTREE bit back such as by
>>> running `git sparse-checkout reapply` (if they have ensured the file is
>>> unmodified and doesn't match the specified sparsity patterns).
>>>
>>
>> There are some performance implications of this solution in a sparse
>> index-enabled checkout. Any time an out-of-cone file is no longer
>> SKIP_WORKTREE, its parent directory lineage will be added to the sparse index,
>> and performance would progressively (silently) degrade as more out-of-cone
>> files were added.
> 
> Can I restate this a bit?
> 
> """
> The point of sparse-index is to allow operations to be faster due to
> needing to track fewer items in the index.  Its performance is nearly
> linearly correlated with the number of paths it has to track, so fewer
> paths is better.  If a user needs to work with more paths than they
> previously did, then their sparse-index will be correspondingly
> slower.
> 
> And users who start modifying files, are obviously working with more
> paths than they previously did.
> """
> 
> Based on that restating, I think the only thing that could be
> considered a problem is your implication that the performance
> degradation would silently persist.  If users don't undo or commit
> those local modifications, then they really are working with more
> files and I think it's fine for the sparse-index to remain expanded to
> include those files.  In contrast, if users undo those local changes
> (or commit them so they are no longer dirty), then future
> checkout/rebase/merge/etc. operations will automatically (re-)set the
> SKIP_WORKTREE bit on those files, and the sparse-index can shrink
> again.  These dirty files would be very similar to files that have
> conflicts during a merge: both have their SKIP_WORKTREE bit cleared
> despite being outside the sparsity paths.  Continuing the comparison,
> after a user resolves a conflict and commits the changes, the
> previously conflicted file will automatically have its SKIP_WORKTREE
> bit set again by a future checkout/rebase/merge/etc.  Or users can
> manually get it to be set earlier with a `git sparse-checkout
> reapply`.  So the performance degradation from the additional paths
> the user is working with is only temporary, and only lasts roughly as
> long as the user is still working with these additional paths.  To me,
> this feels like expected and wanted behavior.
> 
>> That said, a lot of my concern would be alleviated with some kind of warning
>> indicating that a file just had SKIP_WORKTREE removed, including a mention
>> of fixing it with `git sparse-checkout reapply`.
> 
> That might be interesting, but there's two problems here:
>   * It'd affect more types of files and situations than intended.  In
> particular, given [11], it could be quite noisy due to triggering on
> files other than those manually tweaked by users and even include
> files never involved in conflicts either.
>   * It'd repeatedly trigger on the exact same files, possibly muddying
> other output unexpectedly.  Given that we clear the bit in memory but
> not necessarily on disk, users would repeatedly see the same warning
> for the same files during sequences of read-only operations (like `git
> diff` or `git log`) and they warnings would only go away once the user
> ran a command that re-wrote the index (such as `git status`).
> 
> I think the point of your suggestion was to help users recover from
> potentially persisting performance degradation; is that fair?  If so,
> then as I highlighted above, I don't think we have any such persisting
> problem.
> 

Yes, the performance degradation was my main concern, but I agree that it
wouldn't be a persistent problem based on your explanation. Thanks for the
clarification!

> [11] https://lore.kernel.org/git/xmqqbmb1a7ga.fsf@gitster-ct.c.googlers.com/
> 
>> It would be *extra* nice if `git status` could tell a user that they have
>> non-SKIP_WORKTREE files outside the sparse definition, but I think that's
>> less critical and probably better suited as a separate series.
> 
> Oh, interesting idea.  But I agree, it'd be better suited to a separate series.
> 
>>> diff --git a/sparse-index.c b/sparse-index.c
>>> index a1d505d50e..79d50e444c 100644
>>> --- a/sparse-index.c
>>> +++ b/sparse-index.c
>>> @@ -341,6 +341,28 @@ void ensure_correct_sparsity(struct index_state *istate)
>>>               ensure_full_index(istate);
>>>  }
>>>
>>> +void ensure_skip_worktree_means_skip_worktree(struct index_state *istate)
>>
>> I can feel the frustration behind this name. :)
> 
> :-)
> 
>> However, a more descriptive one would make the code easier to follow, e.g.
>> 'clear_skip_worktree_from_present_files' (or something else indicating what
>> it does to the index).
> 
> Yeah, I did spend some time trying to come up with a better name (I
> realized it wasn't the best), but failed.  Your suggestion seems
> obvious in hindsight.  I like it; I'll make that change.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-01-11 23:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-09  4:57 [RFC PATCH 0/5] Remove the present-despite-SKIP_WORKTREE class of bugs Elijah Newren
2022-01-09  4:57 ` [RFC PATCH 1/5] t1011: add testcase demonstrating accidental loss of user modifications Elijah Newren
2022-01-09  4:57 ` [RFC PATCH 2/5] unpack-trees: fix accidental loss of user changes Elijah Newren
2022-01-09  4:57 ` [RFC PATCH 3/5] repo_read_index: ensure SKIP_WORKTREE means skip worktree Elijah Newren
2022-01-10 20:38   ` Victoria Dye
2022-01-11 19:27     ` Elijah Newren
2022-01-11 23:09       ` Victoria Dye
2022-01-09  4:57 ` [RFC PATCH 4/5] Update documentation related to sparsity and the skip-worktree bit Elijah Newren
2022-01-09  4:57 ` [RFC PATCH 5/5] Accelerate ensure_skip_worktree_means_skip_worktree by caching Elijah Newren
2022-01-11 18:30   ` Victoria Dye
2022-01-11 22:04     ` Elijah Newren

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).