git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
@ 2022-09-25  0:09 Elijah Newren via GitGitGadget
  2022-09-26 17:20 ` Junio C Hamano
                   ` (5 more replies)
  0 siblings, 6 replies; 42+ messages in thread
From: Elijah Newren via GitGitGadget @ 2022-09-25  0:09 UTC (permalink / raw)
  To: git
  Cc: Victoria Dye, Derrick Stolee, Shaoxuan Yuan, Matheus Tavares,
	ZheNing Hu, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Once upon a time, Matheus wrote some patches to make
   git grep [--cached | <REVISION>] ...
restrict its output to the sparsity specification when working in a
sparse checkout[1].  That effort got derailed by two things:

  (1) The --sparse-index work just beginning which we wanted to avoid
      creating conflicts for
  (2) Never deciding on flag and config names and planned high level
      behavior for all commands.

More recently, Shaoxuan implemented a more limited form of Matheus'
patches that only affected --cached, using a different flag name,
but also changing the default behavior in line with what Matheus did.
This again highlighted the fact that we never decided on command line
flag names, config option names, and the big picture path forward.

The --sparse-index work has been mostly complete (or at least released
into production even if some small edges remain) for quite some time
now.  We have also had several discussions on flag and config names,
though we never came to solid conclusions.  Stolee once upon a time
suggested putting all these into some document in
Documentation/technical[3], which Victoria recently also requested[4].
I'm behind the times, but here's a patch attempting to finally do that.

Note that the "Implementation Questions" section is pretty large,
reflecting the fact that this is perhaps more RFC than proposal.

[1] https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/
    (See his second link in that email in particular)
[2] https://lore.kernel.org/git/20220908001854.206789-2-shaoxuan.yuan02@gmail.com/
[3] https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/
    (Scroll to the very end for the final few paragraphs)
[4] https://lore.kernel.org/git/cafcedba-96a2-cb85-d593-ef47c8c8397c@github.com/

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    [RFC] sparse-checkout.txt: new document with sparse-checkout directions
    
    As noted in the title and commit message, while I have some goals &
    plans proposed here, I have a lot more in the questions category.
    Thoughts and opinions very much welcome.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1367%2Fnewren%2Fsparse-checkout-directions-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1367/newren/sparse-checkout-directions-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1367

 Documentation/technical/sparse-checkout.txt | 670 ++++++++++++++++++++
 1 file changed, 670 insertions(+)
 create mode 100644 Documentation/technical/sparse-checkout.txt

diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
new file mode 100644
index 00000000000..b213b2b3f35
--- /dev/null
+++ b/Documentation/technical/sparse-checkout.txt
@@ -0,0 +1,670 @@
+Table of contents:
+
+  * Purpose of sparse-checkouts
+  * Desired behavior
+  * Subcommand-dependent defaults
+  * Implementation Questions
+  * Implementation Goals/Plans
+  * Known bugs
+  * Reference Emails
+
+
+=== Purpose of sparse-checkouts ===
+
+sparse-checkouts exist to allow users to work with a subset of their
+files.
+
+The idea is simple enough, but there are two different high-level
+usecases which affect how some Git subcommands should behave.  Further,
+even if we only considered one of those usecases, sparse-checkouts
+modify different subcommands in over a half dozen different ways.  Let's
+start by considering the high level usecases in this section:
+
+  A) Users are _only_ interested in the sparse portion of the repo
+
+  B) Users want a sparse working tree, but are working in a larger whole
+
+It may be worth explaining both of these in a bit more detail:
+
+  (Behavior A) Users are _only_ interested in the sparse portion of the repo
+
+These folks might know there are other things in the repository, but
+don't care.  They are uninterested in other parts of the repository, and
+only want to know about changes within their area of interest.  Showing
+them other results from history (e.g. from diff/log/grep/etc.) is a
+usability annoyance, potentially a huge one since other changes in
+history may dwarf the changes they are interested in.
+
+Some of these users also arrive at this usecase from wanting to use
+partial clones together with sparse checkouts and do disconnected
+development.  Not only do these users generally not care about other
+parts of the repository, but consider it a blocker for Git commands to
+try to operate on those.  If commands attempt to access paths in history
+outside the sparsity specification, then the partial clone will attempt
+to download additional blobs on demand, fail, and then fail the user's
+command.  (This may be unavoidable in some cases, e.g. when `git merge`
+has non-trivial changes to reconcile outside the sparsity path, but we
+should limit how often users are forced to connect to the network.)
+
+Also, even for users using partial clones that do not mind being
+always connected to the network, the need to download blobs as
+side-effects of various other commands (such as the printed diffstat
+after a merge or pull) can lead to worries about local repository size
+growing unnecessarily[10].
+
+  (Behavior B) Users want a sparse working tree, but are working in a larger whole
+
+Stolee described this usecase this way[11]:
+
+"I'm also focused on users that know that they are a part of a larger
+whole. They know they are operating on a large repository but focus on
+what they need to contribute their part. I expect multiple "roles" to
+use very different, almost disjoint parts of the codebase. Some other
+"architect" users operate across the entire tree or hop between different
+sections of the codebase as necessary. In this situation, I'm wary of
+scoping too many features to the sparse-checkout definition, especially
+"git log," as it can be too confusing to have their view of the codebase
+depend on your "point of view."
+
+People might also end up wanting behavior B due to complex inter-project
+dependencies.  The initial attempts to use sparse-checkouts usually
+involve the directories you are directly interested in plus what those
+directories depend upon within your repository.  But there's a monkey
+wrench here: if you have integration tests, they invert the hierarchy:
+to run integration tests, you need not only what you are interested in
+and its dependencies, you also need everything that depends upon what
+you are interested in or that depends upon one of your
+dependencies...AND you need all the dependencies of that expanded group.
+That can easily change your sparse-checkout into a nearly dense one.
+Naturally, that tends to kill the benefits of sparse-checkouts.  There
+are a couple solutions to this conundrum: either avoid grabbing
+dependencies (maybe have built versions of your dependencies pulled from
+a CI cache somewhere), or say that users shouldn't run integration tests
+directly and instead do it on the CI server when they submit a code
+review.  Or do both.  Regardless of whether you stub out your
+dependencies or stub out the things that depend upon you, there is
+certainly a reason to want to query and be aware of those other
+stubbed-out parts of the repository, particularly when the dependencies
+are complex or change relatively frequently.  Thus, for such uses,
+sparse-checkouts can be used to limit what you directly build and
+modify, but these users do not necessarily want their sparse checkout
+paths to limit their queries of history.
+
+Some people may also be interested in behavior B simply as a performance
+workaround: if they are using non-cone mode, then they have to deal with
+its inherent quadratic performance problems.  In that mode, every
+operation that checks whether paths match the sparsity specification can
+be expensive.  As such, these users may only be willing to pay for those
+expensive checks when interacting with the working copy, and may prefer
+getting "unrelated" results from their history queries over having slow
+commands.
+
+
+=== Desired behavior ===
+
+As noted in the previous section, despite the simple idea of just
+working with a subset of files, there are a range of different
+behavioral changes that need to be made to different subcommands to work
+well with such a feature.  See [1,2,3,4,5,6,7,8,9,10] for various
+examples.  In particular, at [2], we saw that mere composition of other
+commands that individually worked correctly in a sparse-checkout context
+did not imply that the higher level command would work correctly; it
+sometimes requires further tweaks.  So, understanding these differences
+can be beneficial.
+
+* Commands behaving the same regardless of high-level use-case
+
+  * commands that only look at files within the sparsity specification
+
+      * status
+      * diff (without --cached or REVISION arguments)
+      * grep (without --cached or REVISION arguments)
+
+  * commands that restore files to the working tree that match sparsity patterns, and
+    remove unmodified files that don't match those patterns:
+
+      * switch
+      * checkout (the switch-like half)
+      * read-tree
+      * reset --hard
+
+      * `restore` & the restore-like half of `checkout` SHOULD be in this above
+	category, but are buggy (see the "Known bugs" section below)
+
+  * commands that write conflicted files to the working tree, but otherwise will
+    omit writing files that do not match the sparsity patterns:
+
+      * merge
+      * rebase
+      * cherry-pick
+      * revert
+
+    Note that this somewhat depends upon the merge strategy being used:
+      * `ort` behaves as described above
+      * `recursive` tries to not vivify files unnecessarily, but does sometimes
+	vivify files without conflicts.
+      * `octopus` and `resolve` will always vivify any file changed in the merge
+	relative to the first parent, which is rather suboptimal.
+
+  * commands that always ignore sparsity since commits must be full-tree
+
+      * archive
+      * bundle
+      * commit
+      * format-patch
+      * fast-export
+      * fast-import
+      * commit-tree
+
+  * commands that write any modified file to the working tree (conflicted or not,
+    and whether those paths match sparsity patterns or not):
+
+      * stash
+
+      * am/apply probably should be in the above category, but need to be fixed to
+	auto-vivify instead of failing
+
+* Commands that differ for behavior A vs. behavior B:
+
+  * commands that make modifications:
+      * add
+      * rm
+      * mv
+
+  * commands that query history
+      * diff (with --cached or REVISION arguments)
+      * grep (with --cached or REVISION arguments)
+      * show (when given commit arguments)
+      * bisect
+      * blame
+	* and annotate
+      * log
+	* and variants: shortlog, gitk, show-branch, whatchanged
+
+* Comands I don't know how to classify
+
+  * ls-files
+
+    Shows all tracked files by default, and with an option can show
+    sparse directory entries instead of expanding them.  Should there be
+    a way to restrict to just the non SKIP_WORKTREE files?
+
+    Note that `git ls-files -t` is often used to see what is sparse and
+    what is not, which only works with a non-restricted assumption.
+
+  * checkout-index
+
+    should it be like `checkout` and pay attention to sparsity paths, or
+    be considered special and write to working tree anyway?  The
+    interaction with --prefix, and the use of specifically named files
+    (rather than globs) makes me wonder.
+
+  * update-index
+
+    The --[no-]ignore-skip-worktree-entries default is totally bogus,
+    but otherwise this command seems okay?  Not sure what category it
+    would go under, though.
+
+  * range-diff
+
+    Is this like `log` or `format-patch`?
+
+  * cherry
+
+    See range-diff
+
+  * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list
+
+    should these be tweaked or always operate full-tree?
+
+* Commands unaffected by sparse-checkouts
+
+  * branch
+  * clean (works on untracked files, whereas SKIP_WORKTREE files are still tracked)
+  * describe
+  * fetch
+  * gc
+  * init
+  * maintenance
+  * notes
+  * pull (merge & rebase have the necessary changes)
+  * push
+  * submodule
+  * tag
+
+  * config
+  * filter-branch (works in separate checkout without sparse-checkout setup)
+  * pack-refs
+  * prune
+  * remote
+  * repack
+  * replace
+
+  * bugreport
+  * count-objects
+  * fsck
+  * gitweb
+  * help
+  * instaweb
+  * merge-tree (doesn't touch worktree or index, and merges always compute full-tree)
+  * rerere
+  * verify-commit
+  * verify-tag
+
+  * commit-graph
+  * hash-object
+  * index-pack
+  * mktag
+  * mktree
+  * multi-pack-index
+  * pack-objects
+  * prune-packed
+  * symbolic-ref
+  * unpack-objects
+  * update-ref
+  * write-tree (operates on index, possibly optimized to use sparse dir entries)
+
+  * for-each-ref
+  * get-tar-commit-id
+  * ls-remote
+  * merge-base (merges are computed full tree, so merge base should be too)
+  * name-rev
+  * pack-redundant
+  * rev-parse
+  * show-index
+  * show-ref
+  * unpack-file
+  * var
+  * verify-pack
+
+  * <Everything under 'Interacting with Others' in 'git help --all'>
+  * <Everything under 'Low-level...Syncing' in 'git help --all'>
+  * <Everything under 'Low-level...Internal Helpers' in 'git help --all'>
+  * <Everything under 'External commands' in 'git help --all'>
+
+* Commands that might be affected, but who cares?
+
+  * merge-file
+  * merge-index
+
+
+=== Subcommand-dependent defaults ===
+
+Note that we have different defaults (for the desired behavior, not just
+the current implementation) depending on the command:
+
+  * Commands defaulting to --restrict:
+    * status
+    * diff (without --cached or REVISION arguments)
+    * grep (without --cached or REVISION arguments)
+    * switch
+    * checkout (the switch-like half)
+    * read-tree
+    * reset (--hard)
+    * restore/checkout
+    * checkout-index
+
+    This behavior makes sense; these interact with the working tree.
+
+  * Commands defaulting to --restrict-unless-conflicts
+    * merge
+    * rebase
+    * cherry-pick
+    * revert
+
+    These also interact with the working tree, but require slightly different
+    behavior so that conflicts can be resolved.
+
+  * Commands defaulting to --no-restrict
+    * archive
+    * bundle
+    * commit
+    * format-patch
+    * fast-export
+    * fast-import
+    * commit-tree
+
+    * ls-files
+    * stash
+    * am
+    * apply
+
+    These have completely different defaults and perhaps deserve the most detailed
+    explanation:
+
+    In the case of commands in the first group (format-patch,
+    fast-export, bundle, archive, etc.), these are commands for
+    communicating history, which will be broken if they restrict to a
+    subset of the repository.  As such, they operate on full paths and
+    have no `--restrict` option for overriding.  Some of these commands may
+    take paths for manually restricting what is exported, but it needs to
+    be very explicit.
+
+    In the case of stash, it needs to vivify files to avoid losing the
+    user's changes.
+
+    In the case of am and apply, those commands only operate on the
+    working tree, so they are kind of in the same boat as stash.
+    Perhaps `git am` could run `git sparse-checkout reapply`
+    automatically afterward and move into a category more similar to
+    merge/rebase/cherry-pick, but it'd still be weird because it'd
+    vivify files besides just conflicted ones when there are conflicts.
+
+    In the case of ls-files, `git ls-files -t` is often used to see what
+    is sparse and not, in which case restricting would not make sense.
+    Also, ls-files has traditionally been used to get a list of "all
+    tracked files", which would suggest not restricting.  But it's
+    slightly funny, because sparse-checkouts essentially split tracked
+    files into two categories -- those in the sparse specification and
+    those outside -- and how does the user specify which of those two
+    types of tracked files they want?
+
+  * Commands defaulting to --restrict-but-warn (although Behavior A vs. Behavior B
+    may affect how verbose the warnings are):
+    * add
+    * rm
+    * mv
+
+    The defaults here perhaps make sense since they are nearly --restrict, but
+    actually using --restrict could cause user confusion if users specify a
+    specific filename, so they warn by default.  That logic may sound like
+    --no-restrict should be the default, but that's prone to even bigger confusion:
+      * `git add <somefile>` if honored and outside the sparse cone, can result in
+	the file randomly disappearing later when some subsequent command is run
+	(since various commands automatically clean up unmodified files outside
+	the sparsity specification).
+      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
+	outside the range of the user's interest.  Much better to operate on the
+	sparsity specification and give the user warnings if other files could have
+	matched.
+      * `git mv` has similar surprises when moving into or out of the cone, so
+	best to restrict and throw warnings if restriction might affect the result.
+
+    There may be a difference in here between behavior A and behavior B.
+    For behavior A, we probably only want to warn if there were no
+    suitable matches for files in the sparsity specification, whereas
+    for behavior B, we may want to warn even if there are valid files to
+    operate on if the result would have been different under
+    `--no-restrict`.
+
+  * Commands whose default for --restrict vs. --no-restrict should vary depending
+    on Behavior A or Behavior B
+    * diff (with --cached or REVISION arguments)
+    * grep (with --cached or REVISION arguments)
+    * show (when given commit arguments)
+    * bisect
+    * blame
+      * and annotate
+    * log
+      * and variants: shortlog, gitk, show-branch, whatchanged
+
+    For now, we default to behavior B for these, which want a default of
+    --no-restrict.
+
+    Note that two of these commands -- diff and grep -- also appeared in
+    a different list with a default of --restrict, but only when limited
+    to searching the working tree.  The working tree vs. history
+    distinction is fundamental in how behavior B operates, so this is
+    expected.
+
+    --restrict may make more sense as the long term default for
+    these[12], but that's a fair amount of work to implement, and it'd
+    be very problematic for behavior B users.  Making it the default
+    now, and then slowly implementing that default in various
+    subcommands over multiple releases would mean that behavior B users
+    would need to learn to slowly add additional flags to their
+    commands, depending on git version, to get the behavior they want.
+    That gradual switchover would be painful, so we should avoid it at
+    least until it's fully implemented.
+
+
+=== Implementation Questions ===
+
+  * Does the name --[no-]restrict sound good to others?  Are there better options?
+    * Names in use, or appearing in patches, or previously suggested:
+      * --sparse/--dense
+      * --ignore-skip-worktree-bits
+      * --ignore-skip-worktree-entries
+      * --ignore-sparsity
+      * --[no-]restrict-to-sparse-paths
+      * --full-tree/--sparse-tree
+      * --[no-]restrict
+    * Rationale making me lean slightly towards --[no-]restrict:
+      * We want a name that works for many commands, so we need a name that
+	does not conflict
+      * --[no-]restrict isn't overly long and seems relatively explanatory
+      * `--sparse`, as used in add/rm/mv, is totally backwards for
+	grep/log/etc.  Changing the meaning of `--sparse` for these
+	commands would fix the backwardness, but possibly break existing
+	scripts.  Using a new name pairing would allow us to treat
+	`--sparse` in these commands as a deprecated alias.
+      * There is a different `--sparse`/`--dense` pair for commands using
+	revision machinery, so using that naming might cause confusion
+      * There is also a `--sparse` in both pack-objects and show-branch, which
+	don't conflict but do suggest that `--sparse` is overloaded
+      * The name --ignore-skip-worktree-bits is a double negative, is
+	quite a mouthful, refers to an implementation detail that many
+	users may not be familiar with, and we'd need a negation for it
+	which would probably be even more ridiculously long.  (But we
+	can make --ignore-skip-worktree-bits a deprecated alias for
+	--no-restrict.)
+
+  * Should --[no-]restrict be a git global option, or added as options to each
+    relevant command?  (Does that make sense given the multitude of different
+    default behaviors we have for different options?)
+
+  * If a config option is added (core.restrictToSparsity?) what should
+    the values and description be?  There's a risk of confusion, because
+    we only want this config option to affect the history-querying
+    commands (log/diff/grep) and maybe the path-modifying worktree
+    commands (add/rm/mv), but certainly not most the others.  Previous config
+    suggestion here: [13]
+
+  * Should --sparse in ls-files be made an alias for --restrict?
+    `--restrict` is certainly a near synonym in cone-mode, but even then
+    it's not quite the same.  In non-cone mode, ls-files' `--sparse`
+    option has no effect, and in cone-mode it still shows the sparse
+    directory entries which are technically outside the sparsity
+    specification.
+
+  * Should --ignore-skip-worktree-bits in checkout-index, checkout, and
+    restore be made deprecated aliases for --no-restrict?  (They have the
+    same meaning.)
+
+  * Should --ignore-skip-worktree-entries in update-index be made a
+    deprecated alias for --no-restrict?  (Or, better yet, should the
+    option just be nuked from orbit after flipping the default, since
+    the reverse option is never wanted and the sole purpose of this
+    option was to turn off a bug?)
+
+  * sparse-checkout: once behavior A is fully implemented, should we
+    take an interim measure to easy people into switching the default?
+    Namely, if folks are not already in a sparse checkout, then require
+    `sparse-checkout init/set` to take a `--[no-]restrict` flag (which
+    would set core.restrictToSparse according to the setting given), and
+    throw an error if the flag is not provided?  That error would be a
+    great place to warn folks that the default may change in the future,
+    and get them used to specifying what they want so that the eventual
+    default switch is seamless for them.
+
+  * clone: should we provide some mechanism for tying partial clones and
+    sparse checkouts together better.  Maybe an option
+	--sparse=dir1,dir2,...,dirN
+    which:
+       * Does initial fetch with `--filter=blob:none`
+       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
+       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
+	 fault in the missing blobs within the sparse
+	 specification...except that rev-list needs some kind of options
+	 to also get files from leading directories too.
+       * Sets --restrict mode to allow focusing on the cone of interest
+	 (and to permit disconnected development)
+
+
+=== Implementation Goals/Plans ===
+
+ * Figure out answers to the 'Implementation Questions' sections (above)
+
+ * Fix bugs in the 'Known bugs' section (below)
+
+ * update-index: flip the default to --no-ignore-skip-worktree-entries, possibly
+   nuke this stupid "Oh, there's a bug?  Let me add a flag to let users request
+   that they not trigger this bug." flag
+
+  * Flags & Config
+    * Make `--sparse` in add/rm/mv a deprecated alias for `--no-restrict`
+    * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
+      a deprecated aliases for `--no-restrict`
+    * Create config option (core.restrictToSparsity?), note how it only
+      affects two classes of commands
+
+ * Behavioral plans:
+     add, rm, mv:
+	Behavior B: throw error if would have affected paths outside of sparsity.
+	Behavior A: throw error if would have only affected paths outside of sparsity.
+     grep (on history), diff (on history), log, etc:
+	Behavior B: act on all paths (already implemented)
+	Behavior A: act on limited paths, maybe show stderr warning ("results limited")
+		    if selected via config rather than explicitly
+     other diff machinery:
+	make sure diff machinery changes don't mess with format-patch, fast-export, etc.
+
+  * Fix performance issues, such as
+    https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
+
+
+=== Known bugs ===
+
+This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
+been working on it.
+
+0. Behavior A is not well supported in Git.  (Behavior B didn't used to be either,
+   but was the easier of the two to implement.)
+
+1. am and apply:
+
+   am and apply rely on files being present in the working copy, and
+   also write to them unconditionally.  They should probably first check
+   for the files' presence, and if found to be SKIP_WORKTREE, then clear
+   the bit and vivify the paths, then do its work.
+
+2. reset --hard:
+
+   reset --hard provides confusing error message (works correctly, but
+   misleads the user into believing it didn't):
+
+    $ touch addme
+    $ git add addme
+    $ git ls-files -t
+    H addme
+    H tracked
+    S tracked-but-maybe-skipped
+    $ git reset --hard                           # usually works great
+    error: Path 'addme' not uptodate; will not remove from working tree.
+    HEAD is now at bdbbb6f third
+    $ git ls-files -t
+    H tracked
+    S tracked-but-maybe-skipped
+    $ ls -1
+    tracked
+
+    `git reset --hard` DID remove addme from the index and the working tree, contrary
+    to the error message, but in line with how reset --hard should behave.
+
+3. Checkout, restore:
+
+   These command do not handle path & revision arguments appropriately:
+
+    $ ls
+    tracked
+    $ git ls-files -t
+    H tracked
+    S tracked-but-maybe-skipped
+    $ git status --porcelain
+    $ git checkout -- '*skipped'
+    error: pathspec '*skipped' did not match any file(s) known to git
+    $ git ls-files -- '*skipped'
+    tracked-but-maybe-skipped
+    $ git checkout HEAD -- '*skipped'
+    error: pathspec '*skipped' did not match any file(s) known to git
+    $ git ls-tree HEAD | grep skipped
+    100644 blob 276f5a64354b791b13840f02047738c77ad0584f	tracked-but-maybe-skipped
+    $ git status --porcelain
+    $ git checkout HEAD~1 -- '*skipped'
+    $ git ls-files -t
+    H tracked
+    H tracked-but-maybe-skipped
+    $ git status --porcelain
+    M  tracked-but-maybe-skipped
+    $ git checkout HEAD -- '*skipped'
+    $ git status --porcelain
+    $
+
+    Note that checkout without a revision (or restore --staged) fails to
+    find a file to restore from the index, even though ls-files shows
+    such a file certainly exists.
+
+    Similar issues occur with HEAD (--source=HEAD in restore's case),
+    but suddenly works when HEAD~1 is specified.  And then after that it
+    will work with HEAD specified, even though it didn't before.
+
+    Directories are also an issue:
+
+    $ git sparse-checkout set nomatches
+    $ git status
+    On branch main
+    You are in a sparse checkout with 0% of tracked files present.
+
+    nothing to commit, working tree clean
+    $ git checkout .
+    error: pathspec '.' did not match any file(s) known to git
+    $ git checkout HEAD~1 .
+    Updated 1 path from 58916d9
+    $ git ls-files -t
+    S tracked
+    H tracked-but-maybe-skipped
+
+
+=== Reference Emails ===
+
+Emails that detail various bugs we've had in sparse-checkout:
+
+[1] (Original descriptions of behavior A & behavior B)
+    https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
+[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences)
+    https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/
+[3] (Present-despite-skipped entries)
+    https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
+[4] (Clone --no-checkout interaction)
+    https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout)
+[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`)
+    https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/
+[6] (SKIP_WORKTREE is advisory, not mandatory)
+    https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/
+[7] (`worktree add` should copy sparsity settings from current worktree)
+    https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/
+[8] (Avoid negative surprises in add, rm, and mv)
+    https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/
+    https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/
+[9] (Move from out-of-cone to in-cone)
+    https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
+    https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
+[10] (Unnecessarily downloading objects outside sparsity specification)
+     https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
+
+[11] (Stolee's comments on high-level usecases)
+     https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
+
+[12] Others commenting on eventually switching default to behavior A:
+  * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
+  * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
+  * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
+
+[13] Previous config name suggestion and description
+  * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
+
+[14] Tangential issue: switch to cone mode as default sparsity specification mechanism:
+  https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
+
+[15] Lengthy email on grep behavior, covering what should be searched:
+  * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/

base-commit: 1b3d6e17fe83eb6f79ffbac2f2c61bbf1eaef5f8
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-25  0:09 [PATCH] sparse-checkout.txt: new document with sparse-checkout directions Elijah Newren via GitGitGadget
@ 2022-09-26 17:20 ` Junio C Hamano
  2022-09-26 17:38 ` Junio C Hamano
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2022-09-26 17:20 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Victoria Dye, Derrick Stolee, Shaoxuan Yuan,
	Matheus Tavares, ZheNing Hu, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Elijah Newren <newren@gmail.com>
>
> Once upon a time, Matheus wrote some patches to make
>    git grep [--cached | <REVISION>] ...
> restrict its output to the sparsity specification when working in a
> sparse checkout[1].  That effort got derailed by two things:
>
>   (1) The --sparse-index work just beginning which we wanted to avoid
>       creating conflicts for
>   (2) Never deciding on flag and config names and planned high level
>       behavior for all commands.
>
> More recently, Shaoxuan implemented a more limited form of Matheus'
> patches that only affected --cached, using a different flag name,
> but also changing the default behavior in line with what Matheus did.
> This again highlighted the fact that we never decided on command line
> flag names, config option names, and the big picture path forward.
>
> The --sparse-index work has been mostly complete (or at least released
> into production even if some small edges remain) for quite some time
> now.  We have also had several discussions on flag and config names,
> though we never came to solid conclusions.  Stolee once upon a time
> suggested putting all these into some document in
> Documentation/technical[3], which Victoria recently also requested[4].
> I'm behind the times, but here's a patch attempting to finally do that.
>
> Note that the "Implementation Questions" section is pretty large,
> reflecting the fact that this is perhaps more RFC than proposal.

Thanks for starting this.  The document even in the current
iteration with a large set of "questions" helped me refresh my
memory on where we are in the bigger picture, and will offer us a
good frame of reference.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-25  0:09 [PATCH] sparse-checkout.txt: new document with sparse-checkout directions Elijah Newren via GitGitGadget
  2022-09-26 17:20 ` Junio C Hamano
@ 2022-09-26 17:38 ` Junio C Hamano
  2022-09-27  3:05   ` Elijah Newren
  2022-09-26 20:08 ` Victoria Dye
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2022-09-26 17:38 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Victoria Dye, Derrick Stolee, Shaoxuan Yuan,
	Matheus Tavares, ZheNing Hu, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +    In the case of am and apply, those commands only operate on the
> +    working tree, so they are kind of in the same boat as stash.

"apply" does not touch the HEAD but it can touch the index; when it
operates with the "--cached" or the "--index" option, it should not
be considered as a working-tree-only command.

"am" is about recording what is in the patch as a commit.

> +    Perhaps `git am` could run `git sparse-checkout reapply`
> +    automatically afterward and move into a category more similar to
> +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> +    vivify files besides just conflicted ones when there are conflicts.

I do not particularly think it is so bad.

How would we handle the case where the user modifies paths outside
the sparse specification and makes a commit out of the result,
without using "am"?  We should be consistent with that use case, i.e.

    $ edit path/outside/sparse/specification
    $ git add path/outside/sparse/specification
    $ git commit

Do we require some "Yes, I am aware that I need to widen my sparse
specification to do this, because I am now stepping out of it, and I
understand that my sparse specification becomes wider after doing
this operation" confirmation with "add" or "commit"?  If not, then I
think "am" should silently widen just like these commands.  If they
do, then "am" should also require such an option.  Perhaps call it
"--widen-sparse" or whatever.

By the way, I like the term "sparse specification" very much, as
we should worry about non-cone mode as well.  Please use it
consistently in this document after getting a concensus that it
is a good phrase to use from others---I saw some other words
used after "sparse" elsewhere in this patch.

> +    In the case of ls-files, `git ls-files -t` is often used to see what
> +    is sparse and not, in which case restricting would not make sense.

I suspect that leaving it tree-wide would allow scripters come up
with Porcelains that restricts to the sparse specification more
easily.

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-25  0:09 [PATCH] sparse-checkout.txt: new document with sparse-checkout directions Elijah Newren via GitGitGadget
  2022-09-26 17:20 ` Junio C Hamano
  2022-09-26 17:38 ` Junio C Hamano
@ 2022-09-26 20:08 ` Victoria Dye
  2022-09-26 22:36   ` Junio C Hamano
                     ` (2 more replies)
  2022-09-27 15:43 ` Junio C Hamano
                   ` (2 subsequent siblings)
  5 siblings, 3 replies; 42+ messages in thread
From: Victoria Dye @ 2022-09-26 20:08 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, ZheNing Hu,
	Elijah Newren

Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> Once upon a time, Matheus wrote some patches to make
>    git grep [--cached | <REVISION>] ...
> restrict its output to the sparsity specification when working in a
> sparse checkout[1].  That effort got derailed by two things:
> 
>   (1) The --sparse-index work just beginning which we wanted to avoid
>       creating conflicts for
>   (2) Never deciding on flag and config names and planned high level
>       behavior for all commands.
> 
> More recently, Shaoxuan implemented a more limited form of Matheus'
> patches that only affected --cached, using a different flag name,
> but also changing the default behavior in line with what Matheus did.
> This again highlighted the fact that we never decided on command line
> flag names, config option names, and the big picture path forward.
> 
> The --sparse-index work has been mostly complete (or at least released
> into production even if some small edges remain) for quite some time
> now.  We have also had several discussions on flag and config names,
> though we never came to solid conclusions.  Stolee once upon a time
> suggested putting all these into some document in
> Documentation/technical[3], which Victoria recently also requested[4].
> I'm behind the times, but here's a patch attempting to finally do that.

Thank you so much for writing this!

> diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
> new file mode 100644
> index 00000000000..b213b2b3f35
> --- /dev/null
> +++ b/Documentation/technical/sparse-checkout.txt
> @@ -0,0 +1,670 @@
> +Table of contents:
> +
> +  * Purpose of sparse-checkouts
> +  * Desired behavior
> +  * Subcommand-dependent defaults
> +  * Implementation Questions
> +  * Implementation Goals/Plans
> +  * Known bugs
> +  * Reference Emails
> +
> +
> +=== Purpose of sparse-checkouts ===
> +
> +sparse-checkouts exist to allow users to work with a subset of their
> +files.
> +
> +The idea is simple enough, but there are two different high-level
> +usecases which affect how some Git subcommands should behave.  Further,
> +even if we only considered one of those usecases, sparse-checkouts
> +modify different subcommands in over a half dozen different ways.  Let's
> +start by considering the high level usecases in this section:
> +
> +  A) Users are _only_ interested in the sparse portion of the repo
> +
> +  B) Users want a sparse working tree, but are working in a larger whole

Both of these use cases make sense to me! Two thoughts/comments:

1. This could be a "me" problem, but I regularly struggle with "sparse"
   having different meanings in similar contexts. For example, a "sparse
   directory" is one *with* 'SKIP_WORKTREE' applied vs. "the sparse portion
   of the repo"  here refers to the files *without* 'SKIP_WORKTREE' applied.
   A quick note/section outlining some standard terminology would be
   immensely helpful.
2. One detail I'd like this document to clarify is the similarity/difference
   between "in the sparse portion of the repo" and "does not have
   'SKIP_WORKTREE' applied." In a well-behaved sparse-checkout, these are
   one in the same. However, if a user removes 'SKIP_WORKTREE' from a file
   (either with 'update-index' or by checking it out on disk), commands
   *sometimes* treat it as inside the sparse checkout (e.g., 'git status'),
   and some treat it as outside (e.g., 'git add'). Technically, I think it
   comes down to whether a command uses sparse patterns + 'SKIP_WORKTREE' to
   determine sparsity vs. just 'SKIP_WORKTREE', but the varying behavior
   feels inconsistent as an end user. 

> +
> +=== Desired behavior ===
> +
> +As noted in the previous section, despite the simple idea of just
> +working with a subset of files, there are a range of different
> +behavioral changes that need to be made to different subcommands to work
> +well with such a feature.  See [1,2,3,4,5,6,7,8,9,10] for various
> +examples.  In particular, at [2], we saw that mere composition of other
> +commands that individually worked correctly in a sparse-checkout context
> +did not imply that the higher level command would work correctly; it
> +sometimes requires further tweaks.  So, understanding these differences
> +can be beneficial.
> +
> +* Commands behaving the same regardless of high-level use-case
> +
> +  * commands that only look at files within the sparsity specification
> +
> +      * status
> +      * diff (without --cached or REVISION arguments)
> +      * grep (without --cached or REVISION arguments)

'status' and 'diff' currently show information about untracked files outside
the working tree (since, not being in the index, they don't have a
'SKIP_WORKTREE' to use). Should that change with the proposed '--restrict'
option?

> +
> +  * commands that restore files to the working tree that match sparsity patterns, and
> +    remove unmodified files that don't match those patterns:
> +
> +      * switch
> +      * checkout (the switch-like half)
> +      * read-tree
> +      * reset --hard
> +
> +      * `restore` & the restore-like half of `checkout` SHOULD be in this above
> +	category, but are buggy (see the "Known bugs" section below)

These commands do behave differently if there are *modified* files outside
the sparsity patterns:

- 'switch', 'checkout' (switch-like), and 'read-tree -m' block the operation
  & advise on how to clean up the modified files to re-align with the
  sparsity patterns.
- 'reset --hard' silently drops the modified file and resets the
  'SKIP_WORKTREE' bit on the corresponding index entry.

With the exception of 'reset --hard' (aggressively and unconditionally
cleaning the worktree & index is an important aspect of the command, IMO),
I'd personally like to see commands in this category align with the behavior
of 'switch' where they don't already. Regardless of what we decide, though,
I think it's probably worth documenting the "modified outside of sparsity
patterns" case.

Also, 'read-tree' (no args) doesn't apply the 'SKIP_WORKTREE' bit to *any*
of the entries it reads into the index. Having all of your files suddenly
appear "deleted" probably isn't desired behavior, so it might be a good
candidate for the "Known bugs" section. 

> +
> +  * commands that write conflicted files to the working tree, but otherwise will
> +    omit writing files that do not match the sparsity patterns:
> +
> +      * merge
> +      * rebase
> +      * cherry-pick
> +      * revert
> +
> +    Note that this somewhat depends upon the merge strategy being used:
> +      * `ort` behaves as described above
> +      * `recursive` tries to not vivify files unnecessarily, but does sometimes
> +	vivify files without conflicts.
> +      * `octopus` and `resolve` will always vivify any file changed in the merge
> +	relative to the first parent, which is rather suboptimal.
> +
> +  * commands that always ignore sparsity since commits must be full-tree
> +
> +      * archive
> +      * bundle
> +      * commit
> +      * format-patch
> +      * fast-export
> +      * fast-import
> +      * commit-tree
> +
> +  * commands that write any modified file to the working tree (conflicted or not,
> +    and whether those paths match sparsity patterns or not):
> +
> +      * stash
> +
> +      * am/apply probably should be in the above category, but need to be fixed to
> +	auto-vivify instead of failing
> +
> +* Commands that differ for behavior A vs. behavior B:
> +
> +  * commands that make modifications:

nit: "make modifications" -> "make modifications to the index"? 

> +      * add
> +      * rm
> +      * mv
> +
> +  * commands that query history
> +      * diff (with --cached or REVISION arguments)
> +      * grep (with --cached or REVISION arguments)
> +      * show (when given commit arguments)
> +      * bisect
> +      * blame
> +	* and annotate
> +      * log
> +	* and variants: shortlog, gitk, show-branch, whatchanged
> +
> +* Comands I don't know how to classify
> +
> +  * ls-files
> +
> +    Shows all tracked files by default, and with an option can show
> +    sparse directory entries instead of expanding them.  Should there be
> +    a way to restrict to just the non SKIP_WORKTREE files?

Yes, I think "restricting to just non SKIP_WORKTREE files" would be what a
'--restrict' option would do. The existing '--sparse' flag really is
independent of the sparse patterns altogether - it just toggles whether
sparse directories are shown as-is or expanded. Given your analysis so far,
'--sparse' should probably be renamed to something that reflects its unique
behavior ('--no-expand-sparse-directories'? I'm sure someone more creative
than me could come up with a better name ;) ).

So, disregarding the special sparse index behavior, I think 'ls-files' fits
neatly in the "commands that query history" section.

> +
> +    Note that `git ls-files -t` is often used to see what is sparse and
> +    what is not, which only works with a non-restricted assumption.
> +
> +  * checkout-index
> +
> +    should it be like `checkout` and pay attention to sparsity paths, or
> +    be considered special and write to working tree anyway?  The
> +    interaction with --prefix, and the use of specifically named files
> +    (rather than globs) makes me wonder.

IMO, it should still pay attention to sparsity paths, even with '--prefix'.
My interpretation would be that '--restrict' tells it how to *read* the
index when determining what to write to disk - even with '--prefix', then,
it'd only write files matching the sparsity patterns. In that case, it seems
to fit alongside 'switch', 'restore', etc. in "commands that restore files
to the working tree that match sparsity patterns." 

> +
> +  * update-index
> +
> +    The --[no-]ignore-skip-worktree-entries default is totally bogus,
> +    but otherwise this command seems okay?  Not sure what category it
> +    would go under, though.

I'd probably call this a "makes modifications" command (like 'git add', 'git
rm', etc.), since it adds/removes/modifies items in the index (either their
content or their flags).

> +
> +  * range-diff
> +
> +    Is this like `log` or `format-patch`?
> +
> +  * cherry
> +
> +    See range-diff
> +
> +  * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list
> +
> +    should these be tweaked or always operate full-tree?

For these (and the other plumbing/plumbing-ish commands you have listed:
'checkout-index', 'update-index', 'read-tree'), I'd lean towards making them
respect the sparsity patterns consistently with the porcelain layer. Part of
that is because the line between "plumbing" and "porcelain" is sometimes
fuzzy (like with 'read-tree'?), so having _very_ different behavior around
that boundary would probably be confusing. The other part is that I think
plumbing-based scripts would still fit one of your "A" or "B" user
archetypes, so full-tree behavior might not be desired anyway.

> +=== Subcommand-dependent defaults ===
> +
> +Note that we have different defaults (for the desired behavior, not just
> +the current implementation) depending on the command:
> +
> +  * Commands defaulting to --restrict:
> +    * status
> +    * diff (without --cached or REVISION arguments)
> +    * grep (without --cached or REVISION arguments)
> +    * switch
> +    * checkout (the switch-like half)
> +    * read-tree
> +    * reset (--hard)
> +    * restore/checkout
> +    * checkout-index
> +
> +    This behavior makes sense; these interact with the working tree.
> +
> +  * Commands defaulting to --restrict-unless-conflicts
> +    * merge
> +    * rebase
> +    * cherry-pick
> +    * revert
> +
> +    These also interact with the working tree, but require slightly different
> +    behavior so that conflicts can be resolved.
> +
> +  * Commands defaulting to --no-restrict
> +    * archive
> +    * bundle
> +    * commit
> +    * format-patch
> +    * fast-export
> +    * fast-import
> +    * commit-tree
> +
> +    * ls-files

In line with what I wrote earlier, I think 'ls-files' would belong wherever
other "commands that query history" go (looks like "Commands whose default
for --restrict vs. --no-restrict should vary").

> +    * stash
> +    * am
> +    * apply
> +
> +    These have completely different defaults and perhaps deserve the most detailed
> +    explanation:
> +
> +    In the case of commands in the first group (format-patch,
> +    fast-export, bundle, archive, etc.), these are commands for
> +    communicating history, which will be broken if they restrict to a
> +    subset of the repository.  As such, they operate on full paths and
> +    have no `--restrict` option for overriding.  Some of these commands may
> +    take paths for manually restricting what is exported, but it needs to
> +    be very explicit.
> +
> +    In the case of stash, it needs to vivify files to avoid losing the
> +    user's changes.
> +
> +    In the case of am and apply, those commands only operate on the
> +    working tree, so they are kind of in the same boat as stash.
> +    Perhaps `git am` could run `git sparse-checkout reapply`
> +    automatically afterward and move into a category more similar to
> +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> +    vivify files besides just conflicted ones when there are conflicts.
> +
> +    In the case of ls-files, `git ls-files -t` is often used to see what
> +    is sparse and not, in which case restricting would not make sense.
> +    Also, ls-files has traditionally been used to get a list of "all
> +    tracked files", which would suggest not restricting.  But it's
> +    slightly funny, because sparse-checkouts essentially split tracked
> +    files into two categories -- those in the sparse specification and
> +    those outside -- and how does the user specify which of those two
> +    types of tracked files they want?
> +
> +  * Commands defaulting to --restrict-but-warn (although Behavior A vs. Behavior B
> +    may affect how verbose the warnings are):
> +    * add
> +    * rm
> +    * mv

I was going to say that, if you consider 'update-index' part of the same
category as 'git add', it would belong here. However, the "but warn" part
seems a little weird with a mostly-plumbing command like 'update-index'. 

> +
> +    The defaults here perhaps make sense since they are nearly --restrict, but
> +    actually using --restrict could cause user confusion if users specify a
> +    specific filename, so they warn by default.  That logic may sound like
> +    --no-restrict should be the default, but that's prone to even bigger confusion:
> +      * `git add <somefile>` if honored and outside the sparse cone, can result in
> +	the file randomly disappearing later when some subsequent command is run
> +	(since various commands automatically clean up unmodified files outside
> +	the sparsity specification).
> +      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
> +	outside the range of the user's interest.  Much better to operate on the
> +	sparsity specification and give the user warnings if other files could have
> +	matched.
> +      * `git mv` has similar surprises when moving into or out of the cone, so
> +	best to restrict and throw warnings if restriction might affect the result.
> +
> +    There may be a difference in here between behavior A and behavior B.
> +    For behavior A, we probably only want to warn if there were no
> +    suitable matches for files in the sparsity specification, whereas
> +    for behavior B, we may want to warn even if there are valid files to
> +    operate on if the result would have been different under
> +    `--no-restrict`.

I'm a bit confused why '--restrict-but-warn' needs to be separate from
'--restrict'. Couldn't the '--restrict' behavior for 'add'/'rm'/'mv' just be
what you described above, since behavior is set on a per-command (or
per-category) basis?

Also, I might be mistaken, but isn't the current behavior more like
'--restrict', in that it returns an error code & advisory message if it
tries to add files outside the sparse patterns? If this is already okay to
users, what's the benefit of relaxing the error to a warning?

Otherwise, I'm on board with the difference between behaviors A & B (i.e.,
"some files must be in the sparse-checkout to avoid a warning/error" vs.
"all files must be in the sparse-checkout to avoid a warning/error").

> +
> +  * Commands whose default for --restrict vs. --no-restrict should vary depending
> +    on Behavior A or Behavior B
> +    * diff (with --cached or REVISION arguments)
> +    * grep (with --cached or REVISION arguments)
> +    * show (when given commit arguments)
> +    * bisect
> +    * blame
> +      * and annotate
> +    * log
> +      * and variants: shortlog, gitk, show-branch, whatchanged
> +
> +    For now, we default to behavior B for these, which want a default of
> +    --no-restrict.
> +
> +    Note that two of these commands -- diff and grep -- also appeared in
> +    a different list with a default of --restrict, but only when limited
> +    to searching the working tree.  The working tree vs. history
> +    distinction is fundamental in how behavior B operates, so this is
> +    expected.
> +
> +    --restrict may make more sense as the long term default for
> +    these[12], but that's a fair amount of work to implement, and it'd
> +    be very problematic for behavior B users.  Making it the default
> +    now, and then slowly implementing that default in various
> +    subcommands over multiple releases would mean that behavior B users
> +    would need to learn to slowly add additional flags to their
> +    commands, depending on git version, to get the behavior they want.
> +    That gradual switchover would be painful, so we should avoid it at
> +    least until it's fully implemented.

I think transitioning to '--restrict' by default is a good plan - as far as
I can tell, user A types seem more common than user B types, and
'--restrict' creates a more consistent experience. 

Maybe '--restrict' could be made the default earlier in 'scalar' (which
already sets up a cone-mode sparse-checkout by default)? We'd still
gradually move towards making the option a global default, but 'scalar'
might get it some early exposure with users that'd benefit the most from it.

> +
> +
> +=== Implementation Questions ===
> +
> +  * Does the name --[no-]restrict sound good to others?  Are there better options?
> +    * Names in use, or appearing in patches, or previously suggested:
> +      * --sparse/--dense
> +      * --ignore-skip-worktree-bits
> +      * --ignore-skip-worktree-entries
> +      * --ignore-sparsity
> +      * --[no-]restrict-to-sparse-paths
> +      * --full-tree/--sparse-tree
> +      * --[no-]restrict
> +    * Rationale making me lean slightly towards --[no-]restrict:
> +      * We want a name that works for many commands, so we need a name that
> +	does not conflict
> +      * --[no-]restrict isn't overly long and seems relatively explanatory
> +      * `--sparse`, as used in add/rm/mv, is totally backwards for
> +	grep/log/etc.  Changing the meaning of `--sparse` for these
> +	commands would fix the backwardness, but possibly break existing
> +	scripts.  Using a new name pairing would allow us to treat
> +	`--sparse` in these commands as a deprecated alias.
> +      * There is a different `--sparse`/`--dense` pair for commands using
> +	revision machinery, so using that naming might cause confusion
> +      * There is also a `--sparse` in both pack-objects and show-branch, which
> +	don't conflict but do suggest that `--sparse` is overloaded
> +      * The name --ignore-skip-worktree-bits is a double negative, is
> +	quite a mouthful, refers to an implementation detail that many
> +	users may not be familiar with, and we'd need a negation for it
> +	which would probably be even more ridiculously long.  (But we
> +	can make --ignore-skip-worktree-bits a deprecated alias for
> +	--no-restrict.)

I think '--[no-]restrict' is a good choice - it doesn't have the ambiguity
of '--sparse' or the so-verbose-it's-confusing nature of
'--ignore-skip-worktree-(bits|entries)'. My only concern would be with the
fact that '--[no-]restrict' doesn't clearly indicate its relationship to
sparse-checkout, but a longer name (like
'--[no-]restrict-to-sparse-checkout') would be cumbersome, not worth it for
the little bit of extra info a user would get.

> +
> +  * Should --[no-]restrict be a git global option, or added as options to each
> +    relevant command?  (Does that make sense given the multitude of different
> +    default behaviors we have for different options?)

That's an interesting idea! I'd be fine either way, there are pros and cons
to each. E.g., it feels a little weird putting the option before the command
('git --no-restrict add' vs. 'git add --no-restrict'), but the option does
apply to nearly every command (and it's easier to describe/document from a
Git-wide perspective than a per-command perspective).

> +
> +  * If a config option is added (core.restrictToSparsity?) what should
> +    the values and description be?  There's a risk of confusion, because
> +    we only want this config option to affect the history-querying
> +    commands (log/diff/grep) and maybe the path-modifying worktree
> +    commands (add/rm/mv), but certainly not most the others.  Previous config
> +    suggestion here: [13]

For values, maybe 'strict' (for behavior A/'--restrict' across the board),
'loose' (for behavior B), 'off'/'none' (for '--no-restrict' across the
board)? For the description, it could outline each of the use cases and
highlight notable command behavior differences? Kind of like what you
already have in [13].

> +
> +  * Should --sparse in ls-files be made an alias for --restrict?
> +    `--restrict` is certainly a near synonym in cone-mode, but even then
> +    it's not quite the same.  In non-cone mode, ls-files' `--sparse`
> +    option has no effect, and in cone-mode it still shows the sparse
> +    directory entries which are technically outside the sparsity
> +    specification.

I don't think so (for the reasons I mentioned earlier - tl;dr --sparse and
--restrict are conceptually quite different, and functionally independent).
I do think '--sparse' should be renamed as part of the "Implementation
Goals/Plans", though.

> +
> +  * Should --ignore-skip-worktree-bits in checkout-index, checkout, and
> +    restore be made deprecated aliases for --no-restrict?  (They have the
> +    same meaning.)
> +
> +  * Should --ignore-skip-worktree-entries in update-index be made a
> +    deprecated alias for --no-restrict?  (Or, better yet, should the
> +    option just be nuked from orbit after flipping the default, since
> +    the reverse option is never wanted and the sole purpose of this
> +    option was to turn off a bug?)

That's an interesting bit of history! I tend to think of 'update-index' as
"plumbing add/rm", so I think there's still a benefit to having a
'--restrict' mode.

In any case, if I'm reading this correctly, these two options are subtly
different than what's proposed for '--restrict', since IIRC they don't take
into account the sparse patterns at all (only operating based on
'SKIP_WORKTREE'). If '--restrict' will involve also using the sparse
patterns, the behavior would change. I'm happy with doing that (I think the
change would be beneficial), but it should probably be explicitly noted
either here or whenever those commands are updated.

> +
> +  * sparse-checkout: once behavior A is fully implemented, should we
> +    take an interim measure to easy people into switching the default?

nit: s/easy/ease/

> +    Namely, if folks are not already in a sparse checkout, then require
> +    `sparse-checkout init/set` to take a `--[no-]restrict` flag (which
> +    would set core.restrictToSparse according to the setting given), and
> +    throw an error if the flag is not provided?  That error would be a
> +    great place to warn folks that the default may change in the future,
> +    and get them used to specifying what they want so that the eventual
> +    default switch is seamless for them.

Sounds like a good approach to me! It avoids needing to constantly
re-specify '--[no-]restrict' on every 'sparse-checkout set' (because it sets
the config), and also provides visibility to users. 

> +
> +  * clone: should we provide some mechanism for tying partial clones and
> +    sparse checkouts together better.  Maybe an option
> +	--sparse=dir1,dir2,...,dirN
> +    which:
> +       * Does initial fetch with `--filter=blob:none`
> +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
> +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
> +	 fault in the missing blobs within the sparse
> +	 specification...except that rev-list needs some kind of options
> +	 to also get files from leading directories too.
> +       * Sets --restrict mode to allow focusing on the cone of interest
> +	 (and to permit disconnected development)

Similar to the '--restrict' default, this could also be a good fit for
'scalar clone'.

> +
> +
> +=== Implementation Goals/Plans ===

The rest of this (+the "Known bugs" section) all look good to me.

Thanks again for writing this document, I really appreciate the time &
effort you put into it! It'll serve as a clear reference for work on
sparse-checkout going forward, and ultimately make sparse-checkout usage a
much better experience for users.

> base-commit: 1b3d6e17fe83eb6f79ffbac2f2c61bbf1eaef5f8


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-26 20:08 ` Victoria Dye
@ 2022-09-26 22:36   ` Junio C Hamano
  2022-09-27  7:30     ` Elijah Newren
  2022-09-27  6:09   ` Elijah Newren
  2022-09-27 16:42   ` Derrick Stolee
  2 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2022-09-26 22:36 UTC (permalink / raw)
  To: Victoria Dye
  Cc: Elijah Newren via GitGitGadget, git, Derrick Stolee,
	Shaoxuan Yuan, Matheus Tavares, ZheNing Hu, Elijah Newren

Victoria Dye <vdye@github.com> writes:

>> +* Commands behaving the same regardless of high-level use-case
>> +
>> +  * commands that only look at files within the sparsity specification
>> +
>> +      * status
>> +      * diff (without --cached or REVISION arguments)
>> +      * grep (without --cached or REVISION arguments)
>
> 'status' and 'diff' currently show information about untracked files outside
> the working tree (since, not being in the index, they don't have a
> 'SKIP_WORKTREE' to use). Should that change with the proposed '--restrict'
> option?

Most likely not.  When sparsity specification is in effect, as you
said elsewhere in your response, no files, whether tracked or
untrcked, should exist that are outside your area of interest.
Their presence should be reported as anomalies by "git status".

Unless the command is being run with the "-uno" option, that is.

> - 'switch', 'checkout' (switch-like), and 'read-tree -m' block the operation
>   & advise on how to clean up the modified files to re-align with the
>   sparsity patterns.
> - 'reset --hard' silently drops the modified file and resets the
>   'SKIP_WORKTREE' bit on the corresponding index entry.
>
> With the exception of 'reset --hard' (aggressively and unconditionally
> cleaning the worktree & index is an important aspect of the command, IMO),
> I'd personally like to see commands in this category align with the behavior
> of 'switch' where they don't already. Regardless of what we decide, though,
> I think it's probably worth documenting the "modified outside of sparsity
> patterns" case.

True.  I agree on both counts.

> Also, 'read-tree' (no args) doesn't apply the 'SKIP_WORKTREE' bit to *any*
> of the entries it reads into the index. Having all of your files suddenly
> appear "deleted" probably isn't desired behavior, so it might be a good
> candidate for the "Known bugs" section. 

I would imagine that it actually is OK to say that it is the
responsibility of whoever invokes read-tree the plumbing command
to reapply the skip-worktree bits and/or collapse the index entries
outside the area of interest into trees afterwards.

>> +* Commands that differ for behavior A vs. behavior B:
>> +
>> +  * commands that make modifications:
>
> nit: "make modifications" -> "make modifications to the index"? 

That clarification actually raises an interesting question.  Do we
want three level distinction, i.e. different behaviour between
commands that touch and do not touch the working tree, between those
that touch and do not touch the index, and between those that touch
and do not touch the commit?

As the index is merely a way to express what the user did to
eventually create the next tree to be recorded in the commit, my gut
feeling is that it may be easier to understand if we treated the
working tree and the index at the same level, actually.  I.e. if
grepping in the working tree of a sparse checkout does not find a
match outside the cones of interest, it may make sense to do the
same at least by default in "grep --cached" mode.

If I understand Stolee's write-up on the use case of those in the
camp B, they are more aware of the larger whole and expect to see
hits outside the area they have checkout when running "grep HEAD".
But in their use case, they do not touch (only look) the area
outside their cone of interest, so if we limit the operation to
their cone of interest by default for working tree, the same default
probably should apply equally for an operation that inspect the
index.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-26 17:38 ` Junio C Hamano
@ 2022-09-27  3:05   ` Elijah Newren
  2022-09-27  4:30     ` Junio C Hamano
  0 siblings, 1 reply; 42+ messages in thread
From: Elijah Newren @ 2022-09-27  3:05 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Victoria Dye,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

On Mon, Sep 26, 2022 at 10:38 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +    In the case of am and apply, those commands only operate on the
> > +    working tree, so they are kind of in the same boat as stash.
>
> "apply" does not touch the HEAD but it can touch the index; when it
> operates with the "--cached" or the "--index" option, it should not
> be considered as a working-tree-only command.

Ah, right, good flag.  This helps resolve part of my question, but
gives me a new question as well.

Without --cached or --index, I think we'd need to make `apply` behave
like `stash` and just auto-vivify any files being tweaked.  If we
don't, we'll lose changes from the patch.

"apply --cached" could possibly just update the index.  However, it
appears to have another bug I need to add to the known bugs section.
`apply --cached` updates the index, but the new index entry fails to
carry over the "SKIP_WORKTREE" bit, making it appear there is an
unstaged deletion of the file.  (Users can run `git sparse-checkout
reapply` afterwards as a workaround.).  This is slightly weird for
files with conflicts (created when running `git apply -3 --cached`)
since those files with content conflicts will not be present in the
working tree, but that's in line with the fact that `git apply -3
--cached` refuses to touch the working tree in general.

In line with `--cached`, we could have "apply --index" do updates to
both the index and the working copy, while ensuring any
"SKIP_WORKTREE" bits are preserved for non-conflicted files.  However,
would preserving "SKIP_WORKTREE" bits be weird for users?  On one
hand, `git apply` without `--index` auto-vivifies files and `--index`
says to "also apply changes to the index" -- but preserving
SKIP_WORKTREE bits would make the `--index` flag also affect how the
working tree is treated, which might seem odd.  On the other hand,
merge/cherry-pick/rebase will update files in the index while leaving
the file missing from the working tree when not conflicted, so there
is some precedent for such behavior.  The question might just be
whether `git apply --index` should be more like mergy behavior, or
more like `git apply`/`git stash` behavior.

> "am" is about recording what is in the patch as a commit.

Does that mean it should behave like "apply --index"?  Or more like
cherry-pick?  (This question might be moot depending on what we choose
for "apply --index", in particular, it won't matter if we preserve
SKIP_WORKTREE bits on non-conflicted files.)

> > +    Perhaps `git am` could run `git sparse-checkout reapply`
> > +    automatically afterward and move into a category more similar to
> > +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> > +    vivify files besides just conflicted ones when there are conflicts.
>
> I do not particularly think it is so bad.

For some reason I was thinking of running `git sparse-checkout
reapply` only if the `am` operation succeeded, which would give us a
special one-off command treatment.  If we instead view it as always
running `git sparse-checkout reapply` whether or not we hit conflicts,
or equivalently, if we view `git am` preserving SKIP_WORKTREE bits on
non-conflicted files, then I agree it's not weird anymore and can be
classified in the same group as merge/rebase/cherry-pick.

But something else you said confuses me...

> How would we handle the case where the user modifies paths outside
> the sparse specification and makes a commit out of the result,
> without using "am"?  We should be consistent with that use case, i.e.
>
>     $ edit path/outside/sparse/specification
>     $ git add path/outside/sparse/specification
>     $ git commit
>
> Do we require some "Yes, I am aware that I need to widen my sparse
> specification to do this, because I am now stepping out of it, and I
> understand that my sparse specification becomes wider after doing
> this operation" confirmation with "add" or "commit"?  If not, then I
> think "am" should silently widen just like these commands.  If they
> do, then "am" should also require such an option.  Perhaps call it
> "--widen-sparse" or whatever.

The command
    $ edit path/outside/sparse/specification
doesn't make sense to me; the file (and perhaps also its leading
directories) are missing.  Most editors will probably tell you that
you are editing a new file, but then it's more of a "rewrite from
scratch" than an "edit".

Typically, we'd expect users who want to edit such files to do so by
first running the `add` or `set` subcommands of sparse-checkout to
change their sparse specification so that the file becomes present.
But then it's no longer outside the sparse specification.  So, I'm not
sure how this angle could help guide our direction.

> By the way, I like the term "sparse specification" very much, as
> we should worry about non-cone mode as well.  Please use it
> consistently in this document after getting a concensus that it
> is a good phrase to use from others---I saw some other words
> used after "sparse" elsewhere in this patch.

:-)

> > +    In the case of ls-files, `git ls-files -t` is often used to see what
> > +    is sparse and not, in which case restricting would not make sense.
>
> I suspect that leaving it tree-wide would allow scripters come up
> with Porcelains that restricts to the sparse specification more
> easily.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-27  3:05   ` Elijah Newren
@ 2022-09-27  4:30     ` Junio C Hamano
  0 siblings, 0 replies; 42+ messages in thread
From: Junio C Hamano @ 2022-09-27  4:30 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Victoria Dye,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

>> "am" is about recording what is in the patch as a commit.
>
> Does that mean it should behave like "apply --index"?  Or more like
> cherry-pick?

It should behave like a manual edit (after widening the area of
interest by adjusting sparsity specification, if needed) followed by
"git add" followed by "git commit".

> The command
>     $ edit path/outside/sparse/specification
> doesn't make sense to me; the file (and perhaps also its leading
> directories) are missing.  Most editors will probably tell you that
> you are editing a new file, but then it's more of a "rewrite from
> scratch" than an "edit".

If it is a new file, read it with "mkdir -p $(dirname $that_file)"
prefixed.  If it is an existing file, then "checkout $that_file"
instead.  And then adjust your sparsity specification so that the
path is now within your area of interest.

> Typically, we'd expect users who want to edit such files to do so by
> first running the `add` or `set` subcommands of sparse-checkout to
> change their sparse specification so that the file becomes present.
> But then it's no longer outside the sparse specification.  So, I'm not
> sure how this angle could help guide our direction.

The fact that you accept and attempt to apply and make it into a
commit already indicates your intention that the paths touched by
the patch are now in your area of interest, just like whichever
paths you decide to manually edit and record the changes you made,
so it would be the most user friendly to automatically adjust the
sparsity specification to allow them do exactly that, I would think.

That is how I look at the "am" command, anyway.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-26 20:08 ` Victoria Dye
  2022-09-26 22:36   ` Junio C Hamano
@ 2022-09-27  6:09   ` Elijah Newren
  2022-09-27 16:42   ` Derrick Stolee
  2 siblings, 0 replies; 42+ messages in thread
From: Elijah Newren @ 2022-09-27  6:09 UTC (permalink / raw)
  To: Victoria Dye
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

On Mon, Sep 26, 2022 at 1:09 PM Victoria Dye <vdye@github.com> wrote:
>
> Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > Once upon a time, Matheus wrote some patches to make
> >    git grep [--cached | <REVISION>] ...
> > restrict its output to the sparsity specification when working in a
> > sparse checkout[1].  That effort got derailed by two things:
> >
> >   (1) The --sparse-index work just beginning which we wanted to avoid
> >       creating conflicts for
> >   (2) Never deciding on flag and config names and planned high level
> >       behavior for all commands.
> >
> > More recently, Shaoxuan implemented a more limited form of Matheus'
> > patches that only affected --cached, using a different flag name,
> > but also changing the default behavior in line with what Matheus did.
> > This again highlighted the fact that we never decided on command line
> > flag names, config option names, and the big picture path forward.
> >
> > The --sparse-index work has been mostly complete (or at least released
> > into production even if some small edges remain) for quite some time
> > now.  We have also had several discussions on flag and config names,
> > though we never came to solid conclusions.  Stolee once upon a time
> > suggested putting all these into some document in
> > Documentation/technical[3], which Victoria recently also requested[4].
> > I'm behind the times, but here's a patch attempting to finally do that.
>
> Thank you so much for writing this!
>
> > diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
> > new file mode 100644
> > index 00000000000..b213b2b3f35
> > --- /dev/null
> > +++ b/Documentation/technical/sparse-checkout.txt
> > @@ -0,0 +1,670 @@
> > +Table of contents:
> > +
> > +  * Purpose of sparse-checkouts
> > +  * Desired behavior
> > +  * Subcommand-dependent defaults
> > +  * Implementation Questions
> > +  * Implementation Goals/Plans
> > +  * Known bugs
> > +  * Reference Emails
> > +
> > +
> > +=== Purpose of sparse-checkouts ===
> > +
> > +sparse-checkouts exist to allow users to work with a subset of their
> > +files.
> > +
> > +The idea is simple enough, but there are two different high-level
> > +usecases which affect how some Git subcommands should behave.  Further,
> > +even if we only considered one of those usecases, sparse-checkouts
> > +modify different subcommands in over a half dozen different ways.  Let's
> > +start by considering the high level usecases in this section:
> > +
> > +  A) Users are _only_ interested in the sparse portion of the repo
> > +
> > +  B) Users want a sparse working tree, but are working in a larger whole
>
> Both of these use cases make sense to me! Two thoughts/comments:
>
> 1. This could be a "me" problem, but I regularly struggle with "sparse"
>    having different meanings in similar contexts. For example, a "sparse
>    directory" is one *with* 'SKIP_WORKTREE' applied vs. "the sparse portion
>    of the repo"  here refers to the files *without* 'SKIP_WORKTREE' applied.
>    A quick note/section outlining some standard terminology would be
>    immensely helpful.

Yeah, that's a good point.  I think we maybe misnamed the sparse
directory entries, and that led to other naming problems.

I like your idea of adding a terminology section; I'll add one.

> 2. One detail I'd like this document to clarify is the similarity/difference
>    between "in the sparse portion of the repo" and "does not have
>    'SKIP_WORKTREE' applied." In a well-behaved sparse-checkout, these are
>    one in the same. However, if a user removes 'SKIP_WORKTREE' from a file
>    (either with 'update-index' or by checking it out on disk), commands
>    *sometimes* treat it as inside the sparse checkout (e.g., 'git status'),
>    and some treat it as outside (e.g., 'git add'). Technically, I think it
>    comes down to whether a command uses sparse patterns + 'SKIP_WORKTREE' to
>    determine sparsity vs. just 'SKIP_WORKTREE', but the varying behavior
>    feels inconsistent as an end user.

Yeah, that's a good point, I should address this.  There are
additional ways to get more files too -- resolving conflicts, or
commands like `stash` that auto-vivify intentionally, or commands that
accidentally auto-vivify (various merge backends), etc.  Anyway,
here's my current mental model, in case it helps:

* In a well-behaved situation, the sparse specification is given
directly by the $GIT_DIR/info/sparse-checkout file.
* The working tree can transiently have an expanded sparse
specification, due to a variety of reasons like resolving conflicts or
running various commands that might add or restore files to the
working tree.
   * Such transient differences can and will be automatically removed
as a side-effect of commands which call unpack_trees() (checkout,
merge, reset, etc.).
   * Users can also request such transient differences be corrected
via running `git sparse-checkout reapply`
   * Additional commands are also welcome to implicitly fix these differences.
   * Because of the above three items, users should make no assumption
that files in a transiently expanded (or restricted) sparse
specification will persist unless they manually explicitly request an
expansion or restriction (via e.g. the `set` or `add` subcommands of
sparse-checkout.)
   * (Yes, we avoid removing files when there are unstaged changes or
conflicts, since we don't want to lose user data.  I don't think that
undermines the general point of the last few bullets).
* The behavior wanted when doing something like "git grep expression
REVISION" is roughly what the users would expect from "git checkout
REVISION && git grep expression" (I know, we add "REVISION:" prefixes,
so it's not exactly the same, but it captures the high level idea).
This has a couple ramifications:
   * REVISION may have paths not in the current index, so there is no
path we can consult for a SKIP_WORKTREE setting for those paths.
   * Since a checkout tries to remove transient differences in the
sparse specification, it makes sense to use the corrected sparse
specification (i.e. $GIT_DIR/info/sparse-checkout) rather than
attempting to consult SKIP_WORKTREE anyway.
   * Therefore, a transiently expanded (or restricted) sparse
specification *only* applies to the working tree and perhaps index.
It does not apply for history queries.

We kind of discussed this previously for why SKIP_WORKTREE not
matching the normal sparse specification should only apply to the
worktree and not to history, in the context of grep[*]:

"""
For the worktree and cached cases, we iterate over paths without
the SKIP_WORKTREE bit set, and limit our searches to these paths.  For
the $REVISION case, we limit the paths we search to those that match
the sparsity patterns.  (We do not check the SKIP_WORKTREE bit for the
$REVISION case, because $REVISION may contain paths that do not exist
in HEAD and thus for which we have no SKIP_WORKTREE bit to consult.
The sparsity patterns tell us how the SKIP_WORKTREE bit would be set
if we were to check out $REVISION, so we consult those.  Also, we
don't use the sparsity paths with the worktree or cached cases, both
because we have a bit we can check directly and more efficiently, and
because unmerged entries from a merge or a rebase could cause more
files to temporarily be present than the sparsity patterns would
normally select.)
"""

(That email also discussed the weird case of being given a TREE
instead of a REVISION, which mucks things up a bit.)

[*] https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/

> > +
> > +=== Desired behavior ===
> > +
> > +As noted in the previous section, despite the simple idea of just
> > +working with a subset of files, there are a range of different
> > +behavioral changes that need to be made to different subcommands to work
> > +well with such a feature.  See [1,2,3,4,5,6,7,8,9,10] for various
> > +examples.  In particular, at [2], we saw that mere composition of other
> > +commands that individually worked correctly in a sparse-checkout context
> > +did not imply that the higher level command would work correctly; it
> > +sometimes requires further tweaks.  So, understanding these differences
> > +can be beneficial.
> > +
> > +* Commands behaving the same regardless of high-level use-case
> > +
> > +  * commands that only look at files within the sparsity specification
> > +
> > +      * status
> > +      * diff (without --cached or REVISION arguments)
> > +      * grep (without --cached or REVISION arguments)
>
> 'status' and 'diff' currently show information about untracked files outside
> the working tree (since, not being in the index, they don't have a
> 'SKIP_WORKTREE' to use).

'status' does, yes, but...I thought 'diff' only applied to tracked
files.  How do you get 'diff' to show information about untracked
files?

(Are you by chance referring to either (1) --no-index which requires
paths to be explicitly specified and thus --[no-]restrict is
irrelevant, or (2) --ignore-submodules, in which case I think
--[no-]restrict is also irrelevant since --[no-]restrict would apply
to the supermodule and the untracked files would just be ones found
within the submodule?)

> Should that change with the proposed '--restrict' option?

Here's how I look at it:

One way to view the purpose of sparse-checkouts is that it subdivides
"tracked" files into two categories -- a sparse subset, and all the
rest.  We mark "all the rest" with SKIP_WORKTREE.  The SKIP_WORKTREE
files are still tracked, just not present in the working copy.
`--restrict` is a modifier that only works to differentiate between
those two groups of tracked files.  In particular, `--restrict` exists
to allow us to specify that operations that normally operate on
tracked files should instead operate on that subset (and likewise,
`--no-restrict` exists to allow us to specify that operations that
default to working on a subset of tracked files should instead operate
on all tracked files).

untracked files are not tracked.  As such `--[no-]restrict` should not
affect how untracked files are treated...except when dealing with the
tracked/untracked boundary and moving files across that boundary (e.g.
with add/rm/mv).  In fact, I think that's why those three commands
have their own special category.

> > +
> > +  * commands that restore files to the working tree that match sparsity patterns, and
> > +    remove unmodified files that don't match those patterns:
> > +
> > +      * switch
> > +      * checkout (the switch-like half)
> > +      * read-tree
> > +      * reset --hard
> > +
> > +      * `restore` & the restore-like half of `checkout` SHOULD be in this above
> > +     category, but are buggy (see the "Known bugs" section below)
>
> These commands do behave differently if there are *modified* files outside
> the sparsity patterns:

I don't understand this claim; using checkout/switch:

$ git sparse-checkout disable
$ git status --porcelain
 M tracked-but-maybe-skipped
$ git checkout main~1
error: Your local changes to the following files would be overwritten
by checkout:
tracked-but-maybe-skipped
Please commit your changes or stash them before you switch branches.
Aborting
$ git sparse-checkout set --no-cone /tracked 2>/dev/null
$ git ls-files -t  # Note: tracked-but-maybe-skipped is outside
sparsity patterns, but modified
H tracked
H tracked-but-maybe-skipped
$ git checkout main~1
error: Your local changes to the following files would be overwritten
by checkout:
tracked-but-maybe-skipped
Please commit your changes or stash them before you switch branches.
Aborting

Exact same error in both sparse and non-sparse checkouts, even when
the sparse-checkout has a modified file outside the sparsity patterns.

> - 'switch', 'checkout' (switch-like), and 'read-tree -m' block the operation
>   & advise on how to clean up the modified files to re-align with the
>   sparsity patterns.

Perhaps you have a different case in mind than I do?  I'm not aware of
anywhere that switch/checkout does this.  (If I modified the above
testcase to have the changes be staged, I still get the same error
both with or without a sparse-checkout, and that error doesn't mention
sparsity patterns in any way.)  I tried grepping around the source
code, but maybe I'm missing something?

> - 'reset --hard' silently drops the modified file and resets the
>   'SKIP_WORKTREE' bit on the corresponding index entry.
>
> With the exception of 'reset --hard' (aggressively and unconditionally
> cleaning the worktree & index is an important aspect of the command, IMO),
> I'd personally like to see commands in this category align with the behavior
> of 'switch' where they don't already.

Oh, are you thinking that `reset --hard` has a different kind of
modification made to it in sparse-checkouts than the other commands in
this category?

I still don't see it, even if that's what you're referring to.  Each
of these commands, in a sparse-checkout, performs its operation within
the sparsity specification, and then attempts to aggressively cull
differences between the sparsity specification and the sparsity
patterns (by marking unmodified files outside the sparsity patterns as
SKIP_WORKTREE and removing them, and marking files matching the
sparsity patterns which were previously SKIP_WORKTREE as
!SKIP_WORKTREE and restoring them to the working tree).  Perhaps some
examples would help:

Having switch/checkout restore paths matching sparsity patterns:
  $ rm tracked
  $ git status --porcelain
   D tracked
  $ git update-index --skip-worktree tracked
  $ git status --porcelain
  $ git ls-files -t
  S tracked
  $

  $ git checkout HEAD~1
  $ git status --porcelain
  $ git ls-files -t
  H tracked

Having switch/checkout remove paths that do not match sparsity patterns:
  $ git ls-files -t
  S tracked-but-maybe-skipped
  $ git show HEAD:tracked-but-maybe-skipped >tracked-but-maybe-skipped
  $ git ls-files -t
  H tracked-but-maybe-skipped

  $ git checkout HEAD~1
  $ git ls-files -t
  S tracked-but-maybe-skipped

So, switch & checkout are doing the same culling that `reset --hard`
is doing.  It's just that all the commands avoid culling when there
are modifications to the file after its normal operation, and by
design, you'll see `reset --hard` have more opportunities to cull
files since it squashes those modifications.

> Regardless of what we decide, though,
> I think it's probably worth documenting the "modified outside of sparsity
> patterns" case.

I'm happy to document if I understand it better; right now I'm just
not following.

> Also, 'read-tree' (no args) doesn't apply the 'SKIP_WORKTREE' bit to *any*
> of the entries it reads into the index. Having all of your files suddenly
> appear "deleted" probably isn't desired behavior, so it might be a good
> candidate for the "Known bugs" section.

Ooh, good catch.  Yeah, I'll add it.

> > +
> > +  * commands that write conflicted files to the working tree, but otherwise will
> > +    omit writing files that do not match the sparsity patterns:
> > +
> > +      * merge
> > +      * rebase
> > +      * cherry-pick
> > +      * revert
> > +
> > +    Note that this somewhat depends upon the merge strategy being used:
> > +      * `ort` behaves as described above
> > +      * `recursive` tries to not vivify files unnecessarily, but does sometimes
> > +     vivify files without conflicts.
> > +      * `octopus` and `resolve` will always vivify any file changed in the merge
> > +     relative to the first parent, which is rather suboptimal.
> > +
> > +  * commands that always ignore sparsity since commits must be full-tree
> > +
> > +      * archive
> > +      * bundle
> > +      * commit
> > +      * format-patch
> > +      * fast-export
> > +      * fast-import
> > +      * commit-tree
> > +
> > +  * commands that write any modified file to the working tree (conflicted or not,
> > +    and whether those paths match sparsity patterns or not):
> > +
> > +      * stash
> > +
> > +      * am/apply probably should be in the above category, but need to be fixed to
> > +     auto-vivify instead of failing
> > +
> > +* Commands that differ for behavior A vs. behavior B:
> > +
> > +  * commands that make modifications:
>
> nit: "make modifications" -> "make modifications to the index"?

More specifically, "make modifications to which files are tracked".
In a sense, these commands determine whether "--[no-]restrict" apply
to _untracked_ files (because those untracked files are about to
become tracked), which is something no other command has to worry
about, and they deserve special treatment because of that.

> > +      * add
> > +      * rm
> > +      * mv
> > +
> > +  * commands that query history
> > +      * diff (with --cached or REVISION arguments)
> > +      * grep (with --cached or REVISION arguments)
> > +      * show (when given commit arguments)
> > +      * bisect
> > +      * blame
> > +     * and annotate
> > +      * log
> > +     * and variants: shortlog, gitk, show-branch, whatchanged
> > +
> > +* Comands I don't know how to classify
> > +
> > +  * ls-files
> > +
> > +    Shows all tracked files by default, and with an option can show
> > +    sparse directory entries instead of expanding them.  Should there be
> > +    a way to restrict to just the non SKIP_WORKTREE files?
>
> Yes, I think "restricting to just non SKIP_WORKTREE files" would be what a
> '--restrict' option would do.

Hmm...yeah, that makes sense...especially if as you say:

> The existing '--sparse' flag really is
> independent of the sparse patterns altogether - it just toggles whether
> sparse directories are shown as-is or expanded. Given your analysis so far,
> '--sparse' should probably be renamed to something that reflects its unique
> behavior ('--no-expand-sparse-directories'? I'm sure someone more creative
> than me could come up with a better name ;) ).

Maybe just `--no-expand`?  I'm also open to further alternatives.

> So, disregarding the special sparse index behavior, I think 'ls-files' fits
> neatly in the "commands that query history" section.

If it fits neatly in the "commands that query history" section, that
implies that `--restrict` should be the default for the behavior A
camp of people.  That may be fine, but...

Junio suggested that leaving ls-files as full-tree by default "would
allow scripters [to] come up with Porcelains that restricts to the
sparse specification more easily."  I know we've certainly used
`ls-files -t` a lot internally.  I guess it's a question of whether we
train such folks to always use `--no-restrict` together with `git
ls-files -t`, whether we actually treat ls-files as a special category
that defaults to full-tree even for the behavior A camp, or whether we
find some kind of middle ground by defaulting to `--restrict` but
making the `-t` option imply `--no-restrict`.  Thoughts?

> > +
> > +    Note that `git ls-files -t` is often used to see what is sparse and
> > +    what is not, which only works with a non-restricted assumption.
> > +
> > +  * checkout-index
> > +
> > +    should it be like `checkout` and pay attention to sparsity paths, or
> > +    be considered special and write to working tree anyway?  The
> > +    interaction with --prefix, and the use of specifically named files
> > +    (rather than globs) makes me wonder.
>
> IMO, it should still pay attention to sparsity paths, even with '--prefix'.
> My interpretation would be that '--restrict' tells it how to *read* the
> index when determining what to write to disk - even with '--prefix', then,
> it'd only write files matching the sparsity patterns. In that case, it seems
> to fit alongside 'switch', 'restore', etc. in "commands that restore files
> to the working tree that match sparsity patterns."

Sounds fair; I like that.

> > +
> > +  * update-index
> > +
> > +    The --[no-]ignore-skip-worktree-entries default is totally bogus,
> > +    but otherwise this command seems okay?  Not sure what category it
> > +    would go under, though.
>
> I'd probably call this a "makes modifications" command (like 'git add', 'git
> rm', etc.), since it adds/removes/modifies items in the index (either their
> content or their flags).

That group has a restrict-or-error behavior.  Do we want update-index
to require a --no-restrict to operate on files outside the sparse
specification?  Maybe we do, for the same reasons we do with
add/rm/mv.  And that certainly would have helped us avoid the
--[no-]ignore-skip-worktree-entries bug.

If we go this route, should some flags imply --no-restrict (such as
--[no-]skip-worktree)?

> > +
> > +  * range-diff
> > +
> > +    Is this like `log` or `format-patch`?
> > +
> > +  * cherry
> > +
> > +    See range-diff

I'm presuming you didn't mean the answers below to apply to the above two.

> > +  * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list
> > +
> > +    should these be tweaked or always operate full-tree?
>
> For these (and the other plumbing/plumbing-ish commands you have listed:
> 'checkout-index', 'update-index', 'read-tree'), I'd lean towards making them
> respect the sparsity patterns consistently with the porcelain layer. Part of
> that is because the line between "plumbing" and "porcelain" is sometimes
> fuzzy (like with 'read-tree'?), so having _very_ different behavior around
> that boundary would probably be confusing. The other part is that I think
> plumbing-based scripts would still fit one of your "A" or "B" user
> archetypes, so full-tree behavior might not be desired anyway.

That sounds compelling to me, generally.

However, if we are given a tree rather than a revision, we have no way
of knowing where in the directory hierarchy that the tree is found, so
we may not be able to provide `--restrict` behavior (unless we want to
just blindly assume the tree given is a toplevel tree; not sure Junio
would like that based on looking at the commit message of d4789c60aa
("ls-tree: add --full-tree option", 2008-12-25) where such an
assumption was made before).  Thus, things like `git grep $TREE`, `git
diff-tree $TREE1 $TREE2`, or `git ls-tree $TREE` may have to default
to `--no-restrict` when those arguments truly are trees rather than
commits.


> > +=== Subcommand-dependent defaults ===
> > +
> > +Note that we have different defaults (for the desired behavior, not just
> > +the current implementation) depending on the command:
> > +
> > +  * Commands defaulting to --restrict:
> > +    * status
> > +    * diff (without --cached or REVISION arguments)
> > +    * grep (without --cached or REVISION arguments)
> > +    * switch
> > +    * checkout (the switch-like half)
> > +    * read-tree
> > +    * reset (--hard)
> > +    * restore/checkout
> > +    * checkout-index
> > +
> > +    This behavior makes sense; these interact with the working tree.
> > +
> > +  * Commands defaulting to --restrict-unless-conflicts
> > +    * merge
> > +    * rebase
> > +    * cherry-pick
> > +    * revert
> > +
> > +    These also interact with the working tree, but require slightly different
> > +    behavior so that conflicts can be resolved.
> > +
> > +  * Commands defaulting to --no-restrict
> > +    * archive
> > +    * bundle
> > +    * commit
> > +    * format-patch
> > +    * fast-export
> > +    * fast-import
> > +    * commit-tree
> > +
> > +    * ls-files
>
> In line with what I wrote earlier, I think 'ls-files' would belong wherever
> other "commands that query history" go (looks like "Commands whose default
> for --restrict vs. --no-restrict should vary").
>
> > +    * stash
> > +    * am
> > +    * apply
> > +
> > +    These have completely different defaults and perhaps deserve the most detailed
> > +    explanation:
> > +
> > +    In the case of commands in the first group (format-patch,
> > +    fast-export, bundle, archive, etc.), these are commands for
> > +    communicating history, which will be broken if they restrict to a
> > +    subset of the repository.  As such, they operate on full paths and
> > +    have no `--restrict` option for overriding.  Some of these commands may
> > +    take paths for manually restricting what is exported, but it needs to
> > +    be very explicit.
> > +
> > +    In the case of stash, it needs to vivify files to avoid losing the
> > +    user's changes.
> > +
> > +    In the case of am and apply, those commands only operate on the
> > +    working tree, so they are kind of in the same boat as stash.
> > +    Perhaps `git am` could run `git sparse-checkout reapply`
> > +    automatically afterward and move into a category more similar to
> > +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> > +    vivify files besides just conflicted ones when there are conflicts.
> > +
> > +    In the case of ls-files, `git ls-files -t` is often used to see what
> > +    is sparse and not, in which case restricting would not make sense.
> > +    Also, ls-files has traditionally been used to get a list of "all
> > +    tracked files", which would suggest not restricting.  But it's
> > +    slightly funny, because sparse-checkouts essentially split tracked
> > +    files into two categories -- those in the sparse specification and
> > +    those outside -- and how does the user specify which of those two
> > +    types of tracked files they want?
> > +
> > +  * Commands defaulting to --restrict-but-warn (although Behavior A vs. Behavior B
> > +    may affect how verbose the warnings are):
> > +    * add
> > +    * rm
> > +    * mv
>
> I was going to say that, if you consider 'update-index' part of the same
> category as 'git add', it would belong here. However, the "but warn" part
> seems a little weird with a mostly-plumbing command like 'update-index'.

Is it more or less weird with "but error" rather than "but warn"?

> > +
> > +    The defaults here perhaps make sense since they are nearly --restrict, but
> > +    actually using --restrict could cause user confusion if users specify a
> > +    specific filename, so they warn by default.  That logic may sound like
> > +    --no-restrict should be the default, but that's prone to even bigger confusion:
> > +      * `git add <somefile>` if honored and outside the sparse cone, can result in
> > +     the file randomly disappearing later when some subsequent command is run
> > +     (since various commands automatically clean up unmodified files outside
> > +     the sparsity specification).
> > +      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
> > +     outside the range of the user's interest.  Much better to operate on the
> > +     sparsity specification and give the user warnings if other files could have
> > +     matched.
> > +      * `git mv` has similar surprises when moving into or out of the cone, so
> > +     best to restrict and throw warnings if restriction might affect the result.
> > +
> > +    There may be a difference in here between behavior A and behavior B.
> > +    For behavior A, we probably only want to warn if there were no
> > +    suitable matches for files in the sparsity specification, whereas
> > +    for behavior B, we may want to warn even if there are valid files to
> > +    operate on if the result would have been different under
> > +    `--no-restrict`.
>
> I'm a bit confused why '--restrict-but-warn' needs to be separate from
> '--restrict'. Couldn't the '--restrict' behavior for 'add'/'rm'/'mv' just be
> what you described above, since behavior is set on a per-command (or
> per-category) basis?
>
> Also, I might be mistaken, but isn't the current behavior more like
> '--restrict', in that it returns an error code & advisory message if it
> tries to add files outside the sparse patterns? If this is already okay to
> users, what's the benefit of relaxing the error to a warning?
>
> Otherwise, I'm on board with the difference between behaviors A & B (i.e.,
> "some files must be in the sparse-checkout to avoid a warning/error" vs.
> "all files must be in the sparse-checkout to avoid a warning/error").

Sorry, I should have written "error" rather than "warning".  I wanted
these in a separate category, because initially these had
`--no-restrict` behavior and we had really big usability problems.  We
tried to fix this by implementing "--restrict" behavior and just
silently ignoring any paths users gave us outside the sparse
specification.  That reduced complaints and made problems much
smaller, but we still got complaints.  Providing an error message in
some cases due to the restriction (hence --restrict-but-error) is kind
of important to getting the user experience right on these commands.

> > +
> > +  * Commands whose default for --restrict vs. --no-restrict should vary depending
> > +    on Behavior A or Behavior B
> > +    * diff (with --cached or REVISION arguments)
> > +    * grep (with --cached or REVISION arguments)
> > +    * show (when given commit arguments)
> > +    * bisect
> > +    * blame
> > +      * and annotate
> > +    * log
> > +      * and variants: shortlog, gitk, show-branch, whatchanged
> > +
> > +    For now, we default to behavior B for these, which want a default of
> > +    --no-restrict.
> > +
> > +    Note that two of these commands -- diff and grep -- also appeared in
> > +    a different list with a default of --restrict, but only when limited
> > +    to searching the working tree.  The working tree vs. history
> > +    distinction is fundamental in how behavior B operates, so this is
> > +    expected.
> > +
> > +    --restrict may make more sense as the long term default for
> > +    these[12], but that's a fair amount of work to implement, and it'd
> > +    be very problematic for behavior B users.  Making it the default
> > +    now, and then slowly implementing that default in various
> > +    subcommands over multiple releases would mean that behavior B users
> > +    would need to learn to slowly add additional flags to their
> > +    commands, depending on git version, to get the behavior they want.
> > +    That gradual switchover would be painful, so we should avoid it at
> > +    least until it's fully implemented.
>
> I think transitioning to '--restrict' by default is a good plan - as far as
> I can tell, user A types seem more common than user B types, and
> '--restrict' creates a more consistent experience.
>
> Maybe '--restrict' could be made the default earlier in 'scalar' (which
> already sets up a cone-mode sparse-checkout by default)? We'd still
> gradually move towards making the option a global default, but 'scalar'
> might get it some early exposure with users that'd benefit the most from it.

I'm glad others support this idea.  A couple years ago, I thought it
was going to be hard to get buy-in to even support it as a config
option.

> > +=== Implementation Questions ===
> > +
> > +  * Does the name --[no-]restrict sound good to others?  Are there better options?
> > +    * Names in use, or appearing in patches, or previously suggested:
> > +      * --sparse/--dense
> > +      * --ignore-skip-worktree-bits
> > +      * --ignore-skip-worktree-entries
> > +      * --ignore-sparsity
> > +      * --[no-]restrict-to-sparse-paths
> > +      * --full-tree/--sparse-tree
> > +      * --[no-]restrict
> > +    * Rationale making me lean slightly towards --[no-]restrict:
> > +      * We want a name that works for many commands, so we need a name that
> > +     does not conflict
> > +      * --[no-]restrict isn't overly long and seems relatively explanatory
> > +      * `--sparse`, as used in add/rm/mv, is totally backwards for
> > +     grep/log/etc.  Changing the meaning of `--sparse` for these
> > +     commands would fix the backwardness, but possibly break existing
> > +     scripts.  Using a new name pairing would allow us to treat
> > +     `--sparse` in these commands as a deprecated alias.
> > +      * There is a different `--sparse`/`--dense` pair for commands using
> > +     revision machinery, so using that naming might cause confusion
> > +      * There is also a `--sparse` in both pack-objects and show-branch, which
> > +     don't conflict but do suggest that `--sparse` is overloaded
> > +      * The name --ignore-skip-worktree-bits is a double negative, is
> > +     quite a mouthful, refers to an implementation detail that many
> > +     users may not be familiar with, and we'd need a negation for it
> > +     which would probably be even more ridiculously long.  (But we
> > +     can make --ignore-skip-worktree-bits a deprecated alias for
> > +     --no-restrict.)
>
> I think '--[no-]restrict' is a good choice - it doesn't have the ambiguity
> of '--sparse' or the so-verbose-it's-confusing nature of
> '--ignore-skip-worktree-(bits|entries)'. My only concern would be with the
> fact that '--[no-]restrict' doesn't clearly indicate its relationship to
> sparse-checkout, but a longer name (like
> '--[no-]restrict-to-sparse-checkout') would be cumbersome, not worth it for
> the little bit of extra info a user would get.

Yeah, that lack of relationship is annoying, but perhaps we can create
one by adding a --[no-]restrict flag to `sparse checkout (init|set)`?

> > +
> > +  * Should --[no-]restrict be a git global option, or added as options to each
> > +    relevant command?  (Does that make sense given the multitude of different
> > +    default behaviors we have for different options?)
>
> That's an interesting idea! I'd be fine either way, there are pros and cons
> to each. E.g., it feels a little weird putting the option before the command
> ('git --no-restrict add' vs. 'git add --no-restrict'), but the option does
> apply to nearly every command (and it's easier to describe/document from a
> Git-wide perspective than a per-command perspective).

One difficulty with global is that both --restrict and --no-restrict
will be added.  So:
  * What if --restrict is passed with a command that only uses
no-restrict behavior?  For example: stash? apply? commit?  etc.
  * What if --restrict is passed with a command that defaults to
something not-quite-restrict?  Such as add?  Or merge?  Should it
attempt harder to ignore paths outside the sparse specfication?
  * What if --restrict is passed to a command that doesn't understand
or use paths at all?  Such as update-ref?  Or branch?  Or repack?

Do we just ignore in the first and third case, and map it to the
almost-restrict in the second case?

> > +
> > +  * If a config option is added (core.restrictToSparsity?) what should
> > +    the values and description be?  There's a risk of confusion, because
> > +    we only want this config option to affect the history-querying
> > +    commands (log/diff/grep) and maybe the path-modifying worktree
> > +    commands (add/rm/mv), but certainly not most the others.  Previous config
> > +    suggestion here: [13]
>
> For values, maybe 'strict' (for behavior A/'--restrict' across the board),
> 'loose' (for behavior B), 'off'/'none' (for '--no-restrict' across the
> board)? For the description, it could outline each of the use cases and
> highlight notable command behavior differences? Kind of like what you
> already have in [13].

I'm a little lost on your third case there.  How would a
"`--no-restrict` across the board" setting be useful?  Doesn't having
checkout/switch default to --no-restrict defeat the point of
sparse-checkouts?  I suspect you meant something else by "across the
board", but I don't know what other usecase exists that defines the
edge of the board for your scenario.

> > +
> > +  * Should --sparse in ls-files be made an alias for --restrict?
> > +    `--restrict` is certainly a near synonym in cone-mode, but even then
> > +    it's not quite the same.  In non-cone mode, ls-files' `--sparse`
> > +    option has no effect, and in cone-mode it still shows the sparse
> > +    directory entries which are technically outside the sparsity
> > +    specification.
>
> I don't think so (for the reasons I mentioned earlier - tl;dr --sparse and
> --restrict are conceptually quite different, and functionally independent).
> I do think '--sparse' should be renamed as part of the "Implementation
> Goals/Plans", though.

Yeah, sounds good.

> > +
> > +  * Should --ignore-skip-worktree-bits in checkout-index, checkout, and
> > +    restore be made deprecated aliases for --no-restrict?  (They have the
> > +    same meaning.)
> > +
> > +  * Should --ignore-skip-worktree-entries in update-index be made a
> > +    deprecated alias for --no-restrict?  (Or, better yet, should the
> > +    option just be nuked from orbit after flipping the default, since
> > +    the reverse option is never wanted and the sole purpose of this
> > +    option was to turn off a bug?)
>
> That's an interesting bit of history! I tend to think of 'update-index' as
> "plumbing add/rm", so I think there's still a benefit to having a
> '--restrict' mode.
>
> In any case, if I'm reading this correctly, these two options are subtly
> different than what's proposed for '--restrict', since IIRC they don't take
> into account the sparse patterns at all (only operating based on
> 'SKIP_WORKTREE'). If '--restrict' will involve also using the sparse
> patterns, the behavior would change. I'm happy with doing that (I think the
> change would be beneficial), but it should probably be explicitly noted
> either here or whenever those commands are updated.

I think of `--restrict` as "apply operation to the sparse
specification", and as noted above, I view the sparse specification as
able to transiently diverge from the canonical sparsity patterns in
$GIT_DIR/info/sparse-checkout.

However, that's not really relevant here, because the difference
between sparse specification and sparsity patterns only matters for
--restrict.  In contrast, --no-restrict means apply operation to all
paths in both cases, making that subtle difference a moot point.

Since in this case these flags map to --no-restrict, we don't need to
worry about that distinction.

> > +
> > +  * sparse-checkout: once behavior A is fully implemented, should we
> > +    take an interim measure to easy people into switching the default?
>
> nit: s/easy/ease/

Indeed, thanks for catching.

> > +    Namely, if folks are not already in a sparse checkout, then require
> > +    `sparse-checkout init/set` to take a `--[no-]restrict` flag (which
> > +    would set core.restrictToSparse according to the setting given), and
> > +    throw an error if the flag is not provided?  That error would be a
> > +    great place to warn folks that the default may change in the future,
> > +    and get them used to specifying what they want so that the eventual
> > +    default switch is seamless for them.
>
> Sounds like a good approach to me! It avoids needing to constantly
> re-specify '--[no-]restrict' on every 'sparse-checkout set' (because it sets
> the config), and also provides visibility to users.

:-)

> > +
> > +  * clone: should we provide some mechanism for tying partial clones and
> > +    sparse checkouts together better.  Maybe an option
> > +     --sparse=dir1,dir2,...,dirN
> > +    which:
> > +       * Does initial fetch with `--filter=blob:none`
> > +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
> > +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
> > +      fault in the missing blobs within the sparse
> > +      specification...except that rev-list needs some kind of options
> > +      to also get files from leading directories too.
> > +       * Sets --restrict mode to allow focusing on the cone of interest
> > +      (and to permit disconnected development)
>
> Similar to the '--restrict' default, this could also be a good fit for
> 'scalar clone'.

It's awesome that you're already thinking about how to get early testing.

> > +
> > +
> > +=== Implementation Goals/Plans ===
>
> The rest of this (+the "Known bugs" section) all look good to me.
>
> Thanks again for writing this document, I really appreciate the time &
> effort you put into it! It'll serve as a clear reference for work on
> sparse-checkout going forward, and ultimately make sparse-checkout usage a
> much better experience for users.

Thanks for taking the time to read through it and provide detailed feedback!

>
> > base-commit: 1b3d6e17fe83eb6f79ffbac2f2c61bbf1eaef5f8
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-26 22:36   ` Junio C Hamano
@ 2022-09-27  7:30     ` Elijah Newren
  2022-09-27 16:07       ` Junio C Hamano
  0 siblings, 1 reply; 42+ messages in thread
From: Elijah Newren @ 2022-09-27  7:30 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Victoria Dye, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

On Mon, Sep 26, 2022 at 3:36 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Victoria Dye <vdye@github.com> writes:
>
> >> +* Commands behaving the same regardless of high-level use-case
> >> +
> >> +  * commands that only look at files within the sparsity specification
> >> +
> >> +      * status
> >> +      * diff (without --cached or REVISION arguments)
> >> +      * grep (without --cached or REVISION arguments)
> >
> > 'status' and 'diff' currently show information about untracked files outside
> > the working tree (since, not being in the index, they don't have a
> > 'SKIP_WORKTREE' to use). Should that change with the proposed '--restrict'
> > option?
>
> Most likely not.  When sparsity specification is in effect, as you
> said elsewhere in your response, no files, whether tracked or
> untrcked, should exist that are outside your area of interest.
> Their presence should be reported as anomalies by "git status".
>
> Unless the command is being run with the "-uno" option, that is.

Oh, wow, that's something completely outside what I had considered.  I
had viewed sparse-checkouts as splitting "tracked files" into two
subsets.  As such, `--[no-]restrict` could only affect selecting
whether the smaller or larger set of tracked files was of interest.
From that viewpoint, untracked files seemed orthogonal, and thus there
couldn't be such a thing as an "anamalous untracked file".

But this idea is very interesting.  Hmm...

>
> > - 'switch', 'checkout' (switch-like), and 'read-tree -m' block the operation
> >   & advise on how to clean up the modified files to re-align with the
> >   sparsity patterns.
> > - 'reset --hard' silently drops the modified file and resets the
> >   'SKIP_WORKTREE' bit on the corresponding index entry.
> >
> > With the exception of 'reset --hard' (aggressively and unconditionally
> > cleaning the worktree & index is an important aspect of the command, IMO),
> > I'd personally like to see commands in this category align with the behavior
> > of 'switch' where they don't already. Regardless of what we decide, though,
> > I think it's probably worth documenting the "modified outside of sparsity
> > patterns" case.
>
> True.  I agree on both counts.
>
> > Also, 'read-tree' (no args) doesn't apply the 'SKIP_WORKTREE' bit to *any*
> > of the entries it reads into the index. Having all of your files suddenly
> > appear "deleted" probably isn't desired behavior, so it might be a good
> > candidate for the "Known bugs" section.
>
> I would imagine that it actually is OK to say that it is the
> responsibility of whoever invokes read-tree the plumbing command
> to reapply the skip-worktree bits and/or collapse the index entries
> outside the area of interest into trees afterwards.

I'll keep that in mind, but that sounds very error prone to me.

> >> +* Commands that differ for behavior A vs. behavior B:
> >> +
> >> +  * commands that make modifications:
> >
> > nit: "make modifications" -> "make modifications to the index"?
>
> That clarification actually raises an interesting question.  Do we
> want three level distinction, i.e. different behaviour between
> commands that touch and do not touch the working tree, between those
> that touch and do not touch the index, and between those that touch
> and do not touch the commit?
>
> As the index is merely a way to express what the user did to
> eventually create the next tree to be recorded in the commit, my gut
> feeling is that it may be easier to understand if we treated the
> working tree and the index at the same level, actually.  I.e. if
> grepping in the working tree of a sparse checkout does not find a
> match outside the cones of interest, it may make sense to do the
> same at least by default in "grep --cached" mode.
>
> If I understand Stolee's write-up on the use case of those in the
> camp B, they are more aware of the larger whole and expect to see
> hits outside the area they have checkout when running "grep HEAD".
> But in their use case, they do not touch (only look) the area
> outside their cone of interest, so if we limit the operation to
> their cone of interest by default for working tree, the same default
> probably should apply equally for an operation that inspect the
> index.

That is an interesting angle to view things; I wondered if an idea
along these lines was going to come up when I was first responding to
Shaoxuan.  I also wondered if people would come to different
conclusions on whether "git grep --cached" should search outside the
sparsity-paths depending upon whether the sparse index was in use.

One thing that makes me a little leery about this path is whether we
can consistently apply the scoped-to-sparse-specification rule for
index operations.  For example:

  * You previously agreed that `git format-patch` should ignore sparse
specification and operate full tree.
  * `git apply --cached $PATCH` only updates the index, and you
suggested in an alternate email that apply should operate full-tree
(at least with --index or without --cached, but I assume by extension
it probably also applies with --cached).
  * What if someone ran the last two commands, and then goes to commit
the result?  Do we want to scope `git commit` to only accept staged
changes within the sparse specification by default?  I thought we
wouldn't and marked commit as a full-tree operation, by default.
  * What if someone runs `git diff --cached` just before that commit?
Do we scope the diff to only those paths within the sparse
specification?
  * What if someone runs `git status` just before that commit?  Do we
only show staged changes within the sparse specification?

It feels like "git grep --cached" is perhaps the next thing along this
sequence, and I don't see a clear line where to draw that we should
limit things to the sparse specification for the index while treating
the other operations as full tree; it seems like something feels
broken or inconsistent in this sequence of commands if we attempt to
do so.


Also, I have some users in camp B.  They specifically have been using
"git grep --cached ..." for a few years now to find other code of
interest outside of their current sparse-checkout (often in stubbed
out dependencies or other projects that depend on the area you are
modifying).  This allows them to make internal API changes and find
the other sites that need to be modified, including outside the normal
sparse cone.  Perhaps I could re-teach them to use "git grep ... HEAD"
instead, but it may feel like a bit of a break to them.  I've found
"git grep --cached" being documented by others who wrote various "how
to work in sparse checkouts" documents, all commenting on this being
the trick to do a whole-tree search.  I did warn them that we might
change that command on them (and sparse-checkouts in general have a
warning about potentially changing behavior), but I'm a little
hesitant to do so.  So that's a second reason I lean towards treating
index searches the same as REVISION ones -- full-tree for camp B.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-25  0:09 [PATCH] sparse-checkout.txt: new document with sparse-checkout directions Elijah Newren via GitGitGadget
                   ` (2 preceding siblings ...)
  2022-09-26 20:08 ` Victoria Dye
@ 2022-09-27 15:43 ` Junio C Hamano
  2022-09-28  7:49   ` Elijah Newren
  2022-09-27 16:36 ` Derrick Stolee
  2022-09-28  8:32 ` [PATCH v2] " Elijah Newren via GitGitGadget
  5 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2022-09-27 15:43 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Victoria Dye, Derrick Stolee, Shaoxuan Yuan,
	Matheus Tavares, ZheNing Hu, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +  * Does the name --[no-]restrict sound good to others?  Are there better options?

Everybody in this thread are interested in sparse checkout, which
unfortunately blinds them from the fact that "restrict to", "limit
to", "focus on", etc. need not to be limited to the sparse checkout
feature.  We must have something that hints that the option is about
the sparse checkout feature.

As to the verbs, I do not mind "restrict to".  Other good ones I do
not mind choosing are "limit to" and "focus on".  They would equally
convey the same thing in this context.  And the object for these
verb phrases are the area of interest, those paths without the
skip-worktree bit, the paths outside the sparse cone(s).

Or we could go the other way.  We are excluding those paths with the
skip-worktree bit, so "exclude" and "ignore" are natural candidates.

These two classes are good if the "restrict" behaviour will never be
the default.  When it is the default, the option often used will
become "--no-restrict", which is awkward.

	Personally I am slightly in favor of "focus on" (i.e.
	"--focus" vs "--unfocus") as that meshes well with the
	concept of "the areas of the working tree paths that I am
	interested in right now", which may already hint that the
	option is about the sparse checkout feature (i.e. "I am
	focusing on these areas right now") and can stay short.  But
	this is just one person's opinion.

> +      * `--sparse`, as used in add/rm/mv, is totally backwards for
> +	grep/log/etc.  Changing the meaning of `--sparse` for these
> +	commands would fix the backwardness, but possibly break existing
> +	scripts.  Using a new name pairing would allow us to treat
> +	`--sparse` in these commands as a deprecated alias.

I actually am in favor of this, even though the appearance of
breaking backward compatibility may be big, but ...

> +      * There is a different `--sparse`/`--dense` pair for commands using
> +	revision machinery, so using that naming might cause confusion

... that is a good reason to avoid these two words.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-27  7:30     ` Elijah Newren
@ 2022-09-27 16:07       ` Junio C Hamano
  2022-09-28  6:13         ` Elijah Newren
  0 siblings, 1 reply; 42+ messages in thread
From: Junio C Hamano @ 2022-09-27 16:07 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Victoria Dye, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

Elijah Newren <newren@gmail.com> writes:

> Oh, wow, that's something completely outside what I had considered.  I
> had viewed sparse-checkouts as splitting "tracked files" into two
> subsets.  As such, `--[no-]restrict` could only affect selecting
> whether the smaller or larger set of tracked files was of interest.
> From that viewpoint, untracked files seemed orthogonal, and thus there
> couldn't be such a thing as an "anamalous untracked file".
>
> But this idea is very interesting.  Hmm...

We need to design the behaviour of "git add" sensibly.  Even we say
"untracked files are just one class and there are two classes of
tracked ones, those path of current interest and those that are
uninteresting", we would need to say "'git add F' behaves this way
if F would become 'tracked path of current interest' when added, but
the command behaves this other way if F becomes 'tracked path that
is not interesting right now'".  It may be cleaner to separate the
untracked ones along the same line as the tracked ones.

Which in turn would mean that the skip-worktree bit cannot be the
source of truth.  Sparsity specification (either pattern matching or
being in listed directories) authoritatively decides if a path is of
the current interest or not.  This is simply because untracked ones
cannot have that bit ;-)  We can treat the skip-worktree bit as mere
implementation detail, a measure for optimization.

> One thing that makes me a little leery about this path is whether we
> can consistently apply the scoped-to-sparse-specification rule for
> index operations.  For example:
>
>   * You previously agreed that `git format-patch` should ignore sparse
> specification and operate full tree.

It is not "are we focusing on subset when we talk about index" to
begin with---format-patch is about a commit (or a series of commit),
and you should view it as a member of the "log" family.  Or the
first half of "rebase/cherry-pick" (the other half being "am"),
which should be full-tree, I would think.

>   * `git apply --cached $PATCH` only updates the index, and you
> suggested in an alternate email that apply should operate full-tree
> (at least with --index or without --cached, but I assume by extension
> it probably also applies with --cached).

I have not thought about "apply --cached".  Just like merge-tree can
merge without a working tree, "apply --cached" should be able to
serve as a foundation to apply a series out of lore archive and
create a topic branch without a working tree.

>   * What if someone runs `git diff --cached` just before that commit?
> Do we scope the diff to only those paths within the sparse
> specification?
>   * What if someone runs `git status` just before that commit?  Do we
> only show staged changes within the sparse specification?
>
> It feels like "git grep --cached" is perhaps the next thing along this
> sequence, and I don't see a clear line where to draw that we should
> limit things to the sparse specification for the index while treating
> the other operations as full tree; it seems like something feels
> broken or inconsistent in this sequence of commands if we attempt to
> do so.

OK, it seems that "--cached" has many cases that it wants to operate
on full tree.  I am in general more in favor of making things work
on full tree, simply because I feel it would have less chance of
going wrong, so defaulting to --no-restrict would be fine ;-)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-25  0:09 [PATCH] sparse-checkout.txt: new document with sparse-checkout directions Elijah Newren via GitGitGadget
                   ` (3 preceding siblings ...)
  2022-09-27 15:43 ` Junio C Hamano
@ 2022-09-27 16:36 ` Derrick Stolee
  2022-09-28  5:38   ` Elijah Newren
  2022-09-30  9:09   ` ZheNing Hu
  2022-09-28  8:32 ` [PATCH v2] " Elijah Newren via GitGitGadget
  5 siblings, 2 replies; 42+ messages in thread
From: Derrick Stolee @ 2022-09-27 16:36 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Victoria Dye, Shaoxuan Yuan, Matheus Tavares, ZheNing Hu, Elijah Newren

On 9/24/2022 8:09 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>

> +  (Behavior A) Users are _only_ interested in the sparse portion of the repo
> +
> +These folks might know there are other things in the repository, but
> +don't care.  They are uninterested in other parts of the repository, and
> +only want to know about changes within their area of interest.  Showing
> +them other results from history (e.g. from diff/log/grep/etc.) is a
> +usability annoyance, potentially a huge one since other changes in
> +history may dwarf the changes they are interested in.

This idea of restricting the commit history to the sparse-checkout
definition (by default, with an escape hatch) seems like the most
radical of the things we've considered. I think it's interesting to
consider, but it might be better to think about things like diffstats,
grepping, and otherwise preventing out-of-cone adjustments by default.

That said, the idea of restricting history is also the simplest to
describe as a user-visible change.

> +Some of these users also arrive at this usecase from wanting to use
> +partial clones together with sparse checkouts and do disconnected
> +development.  Not only do these users generally not care about other
> +parts of the repository, but consider it a blocker for Git commands to
> +try to operate on those.  If commands attempt to access paths in history
> +outside the sparsity specification, then the partial clone will attempt
> +to download additional blobs on demand, fail, and then fail the user's
> +command.  (This may be unavoidable in some cases, e.g. when `git merge`
> +has non-trivial changes to reconcile outside the sparsity path, but we
> +should limit how often users are forced to connect to the network.)

This idea pairs well with a feature I've been meaning to build:
'git sparse-checkout backfill' would download all historical blobs
within the sparse-checkout definition. This is possible with rev-list,
but I want to investigate grouping blobs by path and making requests in
batches, hopefully allowing better deltification and ability to recover
from network disconnections. That makes this idea of "staying within
your sparse-checkout means no missing object downloads" even more likely.

> +  (Behavior B) Users want a sparse working tree, but are working in a larger whole
> +
> +Stolee described this usecase this way[11]:
> +
> +"I'm also focused on users that know that they are a part of a larger
> +whole. They know they are operating on a large repository but focus on
> +what they need to contribute their part. I expect multiple "roles" to
> +use very different, almost disjoint parts of the codebase. Some other
> +"architect" users operate across the entire tree or hop between different
> +sections of the codebase as necessary. In this situation, I'm wary of
> +scoping too many features to the sparse-checkout definition, especially
> +"git log," as it can be too confusing to have their view of the codebase
> +depend on your "point of view."

Thanks for including this.

> +People might also end up wanting behavior B due to complex inter-project
> +dependencies.  The initial attempts to use sparse-checkouts usually
> +involve the directories you are directly interested in plus what those
> +directories depend upon within your repository.  But there's a monkey
> +wrench here: if you have integration tests, they invert the hierarchy:
> +to run integration tests, you need not only what you are interested in
> +and its dependencies, you also need everything that depends upon what
> +you are interested in or that depends upon one of your
> +dependencies...AND you need all the dependencies of that expanded group.
> +That can easily change your sparse-checkout into a nearly dense one.

In my experience, the downstream dependencies are checked via builds in
the cloud, though that doesn't help if they are source dependencies and
you make a breaking change to an API interface. This kind of problem is
absolutely one of system architecture and I don't know what Git can do
other than to acknowledge it and recommend good patterns.

In a properly-organized project, 95% of engineers in the project can have
a small sparse-checkout, then 5% work on the common core that has these
downstream dependencies and require a large sparse-checkout definition.
There's nothing Git can do to help those engineers that do cross-tree
work.

(nit: this is a good place to break up this paragraph.)

> +Naturally, that tends to kill the benefits of sparse-checkouts.  There
> +are a couple solutions to this conundrum: either avoid grabbing
> +dependencies (maybe have built versions of your dependencies pulled from
> +a CI cache somewhere), or say that users shouldn't run integration tests
> +directly and instead do it on the CI server when they submit a code
> +review.  Or do both.  Regardless of whether you stub out your
> +dependencies or stub out the things that depend upon you, there is
> +certainly a reason to want to query and be aware of those other
> +stubbed-out parts of the repository, particularly when the dependencies
> +are complex or change relatively frequently.  Thus, for such uses,
> +sparse-checkouts can be used to limit what you directly build and
> +modify, but these users do not necessarily want their sparse checkout
> +paths to limit their queries of history.

...

> +* Commands behaving the same regardless of high-level use-case

Thank you for this audit of command usage.

> +* Commands that differ for behavior A vs. behavior B:
> +
> +  * commands that make modifications:
> +      * add
> +      * rm
> +      * mv

I think these, along with diff and grep, are great candidates to have
the default behavior fit category A with a flag to act with behavior B.

> +  * commands that query history
> +      * bisect

Interesting that 'bisect' could be considered differently, but I
suppose that if we are presenting the commit history graph in a
simplified form that we'd want to bisect on that simplified graph
instead of the full one.

> +      * blame
> +	* and annotate

blame and annotate operate on a single path, so they already
restrict within the sparse-checkout definition (unless the user
specifies a path outside of the sparse-checkout). The only difference
between A and B would be reporting an error if the path is outside the
definition, right? We don't need to do anything special to simplify
the history.

> +      * show (when given commit arguments)
> +      * log
> +	* and variants: shortlog, gitk, show-branch, whatchanged

And here is where we'd need to do that big changes for simplifying
the history graph. Does 'rev-list' not fit here? I tend to think of
'log' as a formatting layer on top of 'rev-list', but maybe that is
misguided.

> +* Comands I don't know how to classify

nit: s/Comands/Commands/

> +
> +  * ls-files> +  * checkout-index
> +  * update-index
> +  * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list

Plumbing commands might be a good candidate for "by default you
can do anything, but we can add ability to put guard rails on the
sparse-checkout set".

> +  * range-diff
> +
> +    Is this like `log` or `format-patch`?

I think this is more like format-patch. However, we need to be careful
if users use "git log" output to determine the range they provide to
the range-diff command, since that range could indicate a larger set of
commits.

> +=== Subcommand-dependent defaults ===
> +
> +Note that we have different defaults (for the desired behavior, not just
> +the current implementation) depending on the command:
> +
> +  * Commands defaulting to --restrict:

This appears to be the first mention of --restrict. Perhaps it would be
worth declaring what --restrict, --restrict-unless-conflicts, and
--no-restrict mean before creating this categorization?

> +    * status
> +    * diff (without --cached or REVISION arguments)
> +    * grep (without --cached or REVISION arguments)
> +    * switch
> +    * checkout (the switch-like half)
> +    * read-tree
> +    * reset (--hard)
> +    * restore/checkout
> +    * checkout-index
> +
> +    This behavior makes sense; these interact with the working tree.
> +
> +  * Commands defaulting to --restrict-unless-conflicts
> +    * merge
> +    * rebase
> +    * cherry-pick
> +    * revert

In my mind, --restrict-unless-conflicts doesn't provide any value unless
you want the --restrict mode to create an _error_ when trying to do
something outside of the sparse-checkout cone.

The only thing I can think about is that the diffstat might want to show
the stats for the conflicted files, in which case that's an important
perspective on the distinction from --restrict.

> +    In the case of am and apply, those commands only operate on the
> +    working tree, so they are kind of in the same boat as stash.
> +    Perhaps `git am` could run `git sparse-checkout reapply`
> +    automatically afterward and move into a category more similar to
> +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> +    vivify files besides just conflicted ones when there are conflicts.

'git am' should be able to construct the resulting commit from the patch
without adding files outside of the sparse-checkout definition. If there
is a conflict, it fails in the application, anyway. I suppose you are
writing this here because 'git am' does not play nice with sparse-checkout
right now.

> +    In the case of ls-files, `git ls-files -t` is often used to see what
> +    is sparse and not, in which case restricting would not make sense.
> +    Also, ls-files has traditionally been used to get a list of "all
> +    tracked files", which would suggest not restricting.  But it's
> +    slightly funny, because sparse-checkouts essentially split tracked
> +    files into two categories -- those in the sparse specification and
> +    those outside -- and how does the user specify which of those two
> +    types of tracked files they want?

> +  * Commands defaulting to --restrict-but-warn (although Behavior A vs. Behavior B> +    may affect how verbose the warnings are):

More modes! OK.

> +    * add
> +    * rm
> +    * mv
> +
> +    The defaults here perhaps make sense since they are nearly --restrict, but
> +    actually using --restrict could cause user confusion if users specify a
> +    specific filename, so they warn by default.  That logic may sound like
> +    --no-restrict should be the default, but that's prone to even bigger confusion:
> +      * `git add <somefile>` if honored and outside the sparse cone, can result in
> +	the file randomly disappearing later when some subsequent command is run
> +	(since various commands automatically clean up unmodified files outside
> +	the sparsity specification).
> +      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
> +	outside the range of the user's interest.  Much better to operate on the
> +	sparsity specification and give the user warnings if other files could have
> +	matched.

The cost of checking for other files that might match is sometimes too large
(needing to expand the sparse index or walk trees to find those path names) that
I would not recommend warning that we _didn't_ do something. Perhaps an advice
that says "we did not look outside the sparse-checkout definition for matching
paths" when the pathspec is not an exact path or a prefix match.

> +      * `git mv` has similar surprises when moving into or out of the cone, so
> +	best to restrict and throw warnings if restriction might affect the result.
> +
> +    There may be a difference in here between behavior A and behavior B.
> +    For behavior A, we probably only want to warn if there were no
> +    suitable matches for files in the sparsity specification, whereas
> +    for behavior B, we may want to warn even if there are valid files to
> +    operate on if the result would have been different under
> +    `--no-restrict`.

I think in behavior B, users who actually want to modify things tree-wide will
actually increase their sparse-checkout definition to include those files so
they can validate what they are doing.

> +  * Commands whose default for --restrict vs. --no-restrict should vary depending
> +    on Behavior A or Behavior B
> +    * diff (with --cached or REVISION arguments)
> +    * grep (with --cached or REVISION arguments)
> +    * show (when given commit arguments)
> +    * bisect
> +    * blame
> +      * and annotate
> +    * log
> +      * and variants: shortlog, gitk, show-branch, whatchanged
> +
> +    For now, we default to behavior B for these, which want a default of
> +    --no-restrict.

I do feel pretty strongly that we'll want a --no-restrict default here
because otherwise we will present confusion. I'm not even sure if we would
want to make this available via a config setting, but likely a config
setting makes sense in the long term.

> +=== Implementation Questions ===
> +
> +  * Does the name --[no-]restrict sound good to others?  Are there better options?
> +    * Names in use, or appearing in patches, or previously suggested:
> +      * --sparse/--dense
> +      * --ignore-skip-worktree-bits
> +      * --ignore-skip-worktree-entries
> +      * --ignore-sparsity
> +      * --[no-]restrict-to-sparse-paths
> +      * --full-tree/--sparse-tree
> +      * --[no-]restrict

I like the simplicity of --[no-]restrict, and my only worry is that it
doesn't immediately link to what it is restricting.

Perhaps something like "scope" would describe the set of things we care
about, but use a text mode:

	--scope=sparse	(--restrict)
	--scope=all	(--no-restrict)

But I'm notoriously bad at naming things.

> +  * Should --[no-]restrict be a git global option, or added as options to each
> +    relevant command?  (Does that make sense given the multitude of different
> +    default behaviors we have for different options?)

If we can make it a global option, that would be great, then update
the commands to behave under that mode as we go.

If that doesn't work, then adding the consistent option across commands
would be helpful. It might be good to make a OPT_RESTRICT macro (much
like OPT__VERBOSE, OPT__QUIET, and similar macros.

> +  * Should --sparse in ls-files be made an alias for --restrict?
> +    `--restrict` is certainly a near synonym in cone-mode, but even then
> +    it's not quite the same.  In non-cone mode, ls-files' `--sparse`
> +    option has no effect, and in cone-mode it still shows the sparse
> +    directory entries which are technically outside the sparsity
> +    specification.

We should definitely replace the --sparse option(s) with whatever we
choose here. For ls-files, we have the issue that we are reporting
what is in the index, and in non-cone-mode the index cannot be sparse.

Now, maybe we change what the ls-files mode does under --restrict and
only have it report the paths within the sparse-checkout and not even
show the results for sparse directory entries. The --no-restrict would
then expand a sparse-index to show only paths again.

> +  * Should --ignore-skip-worktree-bits in checkout-index, checkout, and
> +    restore be made deprecated aliases for --no-restrict?  (They have the
> +    same meaning.)

Yes.

> +  * Should --ignore-skip-worktree-entries in update-index be made a
> +    deprecated alias for --no-restrict?  (Or, better yet, should the
> +    option just be nuked from orbit after flipping the default, since
> +    the reverse option is never wanted and the sole purpose of this
> +    option was to turn off a bug?)

Yes and yes.

> +  * sparse-checkout: once behavior A is fully implemented, should we
> +    take an interim measure to easy people into switching the default?

nit: s/easy/ease/

> +    Namely, if folks are not already in a sparse checkout, then require
> +    `sparse-checkout init/set` to take a `--[no-]restrict` flag (which
> +    would set core.restrictToSparse according to the setting given), and
> +    throw an error if the flag is not provided?  That error would be a
> +    great place to warn folks that the default may change in the future,
> +    and get them used to specifying what they want so that the eventual
> +    default switch is seamless for them.

I don't like using the same option name (--[no-]restrict) for something
that sets a config option to keep that behavior permanently. Different
names that make it clearer could be:

	--enable-restrict-mode
	--set-scope=(sparse|all)

> +  * clone: should we provide some mechanism for tying partial clones and
> +    sparse checkouts together better.  Maybe an option
> +	--sparse=dir1,dir2,...,dirN
> +    which:
> +       * Does initial fetch with `--filter=blob:none`
> +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
> +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
> +	 fault in the missing blobs within the sparse
> +	 specification...except that rev-list needs some kind of options
> +	 to also get files from leading directories too.
> +       * Sets --restrict mode to allow focusing on the cone of interest
> +	 (and to permit disconnected development)

As mentioned, I think we should have the option to backfill the blobs in
the sparse-checkout definition, but 'git clone' should not do this by
default. It's something that can be launched in the background, maybe, but
not a blocking operation on being able to use the repository.

'scalar clone' is an excellent testing bed for these kinds of things,
like setting the --restrict mode by default.

Hopefully my responses aren't too far off-base. I'll go read the rest of
the discussion now that I've contributed my thoughts on the doc.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-26 20:08 ` Victoria Dye
  2022-09-26 22:36   ` Junio C Hamano
  2022-09-27  6:09   ` Elijah Newren
@ 2022-09-27 16:42   ` Derrick Stolee
  2022-09-28  5:42     ` Elijah Newren
  2 siblings, 1 reply; 42+ messages in thread
From: Derrick Stolee @ 2022-09-27 16:42 UTC (permalink / raw)
  To: Victoria Dye, Elijah Newren via GitGitGadget, git
  Cc: Shaoxuan Yuan, Matheus Tavares, ZheNing Hu, Elijah Newren

On 9/26/2022 4:08 PM, Victoria Dye wrote:
> Elijah Newren via GitGitGadget wrote:
>> +=== Purpose of sparse-checkouts ===
>> +
>> +sparse-checkouts exist to allow users to work with a subset of their
>> +files.
>> +
>> +The idea is simple enough, but there are two different high-level
>> +usecases which affect how some Git subcommands should behave.  Further,
>> +even if we only considered one of those usecases, sparse-checkouts
>> +modify different subcommands in over a half dozen different ways.  Let's
>> +start by considering the high level usecases in this section:
>> +
>> +  A) Users are _only_ interested in the sparse portion of the repo
>> +
>> +  B) Users want a sparse working tree, but are working in a larger whole
> 
> Both of these use cases make sense to me! Two thoughts/comments:
> 
> 1. This could be a "me" problem, but I regularly struggle with "sparse"
>    having different meanings in similar contexts. For example, a "sparse
>    directory" is one *with* 'SKIP_WORKTREE' applied vs. "the sparse portion
>    of the repo"  here refers to the files *without* 'SKIP_WORKTREE' applied.
>    A quick note/section outlining some standard terminology would be
>    immensely helpful.

This difference is absolutely my fault, and maybe we should consider
fixing this problem by renaming sparse directories something else.
Perhaps "skipped directory" would be a better name?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-27 16:36 ` Derrick Stolee
@ 2022-09-28  5:38   ` Elijah Newren
  2022-09-28 13:22     ` Derrick Stolee
  2022-09-30  9:54     ` ZheNing Hu
  2022-09-30  9:09   ` ZheNing Hu
  1 sibling, 2 replies; 42+ messages in thread
From: Elijah Newren @ 2022-09-28  5:38 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Victoria Dye,
	Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

On Tue, Sep 27, 2022 at 9:36 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 9/24/2022 8:09 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
>
> > +  (Behavior A) Users are _only_ interested in the sparse portion of the repo
> > +
> > +These folks might know there are other things in the repository, but
> > +don't care.  They are uninterested in other parts of the repository, and
> > +only want to know about changes within their area of interest.  Showing
> > +them other results from history (e.g. from diff/log/grep/etc.) is a
> > +usability annoyance, potentially a huge one since other changes in
> > +history may dwarf the changes they are interested in.
>
> This idea of restricting the commit history to the sparse-checkout
> definition (by default, with an escape hatch) seems like the most
> radical of the things we've considered. I think it's interesting to
> consider, but it might be better to think about things like diffstats,
> grepping, and otherwise preventing out-of-cone adjustments by default.
>
> That said, the idea of restricting history is also the simplest to
> describe as a user-visible change.

By "restricting commit history", are you thinking in terms of "git log
-- PATHS" or more like some kind of special --filter to git-clone?

I get the feeling you might be thinking about the latter, whereas I
was assuming users had all commits (and all trees), but log/diff would
restrict output based on relevant paths.

> > +Some of these users also arrive at this usecase from wanting to use
> > +partial clones together with sparse checkouts and do disconnected
> > +development.  Not only do these users generally not care about other
> > +parts of the repository, but consider it a blocker for Git commands to
> > +try to operate on those.  If commands attempt to access paths in history
> > +outside the sparsity specification, then the partial clone will attempt
> > +to download additional blobs on demand, fail, and then fail the user's
> > +command.  (This may be unavoidable in some cases, e.g. when `git merge`
> > +has non-trivial changes to reconcile outside the sparsity path, but we
> > +should limit how often users are forced to connect to the network.)
>
> This idea pairs well with a feature I've been meaning to build:
> 'git sparse-checkout backfill' would download all historical blobs
> within the sparse-checkout definition. This is possible with rev-list,
> but I want to investigate grouping blobs by path and making requests in
> batches, hopefully allowing better deltification and ability to recover
> from network disconnections. That makes this idea of "staying within
> your sparse-checkout means no missing object downloads" even more likely.

This sounds awesome.

> > +  (Behavior B) Users want a sparse working tree, but are working in a larger whole
> > +
> > +Stolee described this usecase this way[11]:
> > +
> > +"I'm also focused on users that know that they are a part of a larger
> > +whole. They know they are operating on a large repository but focus on
> > +what they need to contribute their part. I expect multiple "roles" to
> > +use very different, almost disjoint parts of the codebase. Some other
> > +"architect" users operate across the entire tree or hop between different
> > +sections of the codebase as necessary. In this situation, I'm wary of
> > +scoping too many features to the sparse-checkout definition, especially
> > +"git log," as it can be too confusing to have their view of the codebase
> > +depend on your "point of view."
>
> Thanks for including this.

I was actually worried this usecase was decreasing in priority for
you.  More on that later...

> > +People might also end up wanting behavior B due to complex inter-project
> > +dependencies.  The initial attempts to use sparse-checkouts usually
> > +involve the directories you are directly interested in plus what those
> > +directories depend upon within your repository.  But there's a monkey
> > +wrench here: if you have integration tests, they invert the hierarchy:
> > +to run integration tests, you need not only what you are interested in
> > +and its dependencies, you also need everything that depends upon what
> > +you are interested in or that depends upon one of your
> > +dependencies...AND you need all the dependencies of that expanded group.
> > +That can easily change your sparse-checkout into a nearly dense one.
>
> In my experience, the downstream dependencies are checked via builds in
> the cloud, though that doesn't help if they are source dependencies and
> you make a breaking change to an API interface. This kind of problem is
> absolutely one of system architecture and I don't know what Git can do
> other than to acknowledge it and recommend good patterns.

I was talking about (source) dependencies between
modules/projects/whatever-you-want-to-call-the-subcomponents of your
repository.  We have hundreds of modules, with various cross-module
dependencies that evolve over time.

I get the feeling from your description that your intra-repository
dependencies between modules/projects/whatever are much more static
for you than what we deal with.  (Which is a good thing; it'd be nice
if ours were more static.)

> In a properly-organized project, 95% of engineers in the project can have
> a small sparse-checkout, then 5% work on the common core that has these
> downstream dependencies and require a large sparse-checkout definition.

"In a properly-organized project"?  I'm unsure if this is an
indictment of some of the repositories I deal with in reality (and to
be fair, it might be a totally fair indictment), or if your statement
is starting to cross into "No true scotsman" territory.  ;-)

I would probably lean towards the former (we know it's more messy than
it should be), but I'm a bit puzzled that you'd just brush aside my
mention of integration tests.  We have people who want to run
integration tests locally, even when only modifying a small area of
the codebase.  These users are not doing cross-tree work, rather they
are doing cross-tree testing in conjunction with their work.  Running
such tests requires a build of the modules across the repository,
which naively would push folks into a dense checkout...and really long
local builds.  We want fast local builds, and sparse-checkouts help us
achieve that...but it does mean we have to be clever about how we
build in order to let these users run integration tests.  (And we have
to make it easy for users to discover the relevant integration tests,
and sometimes associated code components that depend on what they are
changing, which is where behavior B comes in).

> There's nothing Git can do to help those engineers that do cross-tree
> work.

I'm going to partially disagree with this, in part because of our
experience with many inter-module dependencies that evolve over time.
Folks can start on a certain module and begin refactoring.  Being
aware that their changes will affect other areas of the code, the can
do a search (e.g. "git grep --cached ..." to find cases outside their
current sparse checkout), and then selectively unsparsify to get the
relevant few dozen (or maybe even few hundred) modules added.  They
aren't switching to a dense checkout, just a less sparse one.  When
they are done, they may narrow their sparse specification again.  We
have a number of users doing cross-tree work who are using
sparse-checkouts, and who find it productive and say it still speeds
up their local build/test cycles.

So, I'd say that ensuring Git supports behavior B well in
sparse-checkouts, is something Git can do to help out both some of the
engineers doing cross-tree work, and some of the engineers that are
doing cross-tree testing.

(For full disclosure, we also have users doing cross-tree work using
regular dense checkouts and I agree there's not a lot we can do to
help them.)

> (nit: this is a good place to break up this paragraph.)

Yeah, it was kind of nice to have one paragraph per explanation of why
people might like behavior B.  But this is indeed a long paragraph.

[...]
> > +      * blame
> > +     * and annotate
>
> blame and annotate operate on a single path, so they already
> restrict within the sparse-checkout definition (unless the user
> specifies a path outside of the sparse-checkout). The only difference
> between A and B would be reporting an error if the path is outside the
> definition, right? We don't need to do anything special to simplify
> the history.

You're forgetting the possibility of one or more -C flags.  I'll note
it specifically on the line.

> > +      * show (when given commit arguments)
> > +      * log
> > +     * and variants: shortlog, gitk, show-branch, whatchanged
>
> And here is where we'd need to do that big changes for simplifying
> the history graph. Does 'rev-list' not fit here? I tend to think of
> 'log' as a formatting layer on top of 'rev-list', but maybe that is
> misguided.

Right, rev-list should probably be included here too.

> > +* Comands I don't know how to classify
>
> nit: s/Comands/Commands/

Thanks.

[...]
> > +=== Subcommand-dependent defaults ===
> > +
> > +Note that we have different defaults (for the desired behavior, not just
> > +the current implementation) depending on the command:
> > +
> > +  * Commands defaulting to --restrict:
>
> This appears to be the first mention of --restrict. Perhaps it would be
> worth declaring what --restrict, --restrict-unless-conflicts, and
> --no-restrict mean before creating this categorization?

Probably, yes.  Doing that might have even avoided some of the
confusion below...

[...]
> > +  * Commands defaulting to --restrict-unless-conflicts
> > +    * merge
> > +    * rebase
> > +    * cherry-pick
> > +    * revert
>
> In my mind, --restrict-unless-conflicts doesn't provide any value unless
> you want the --restrict mode to create an _error_ when trying to do
> something outside of the sparse-checkout cone.

Are you assuming here I was suggesting command line flags?  If so, I
apologize for my poor wording/descriptions.  At some point, I was just
noting that I was referring to behavior by the names of `--restrict`
and `--no-restrict`.  While pointing out that a strict interpretation
of the behaviors suggested by each name didn't match all commands, I
came up with names for alternate behaviors.  These names weren't meant
to become flags we'd use on the command line, despite the name that
perhaps suggests such.  Probably a really poor way to name these
behaviors; sorry about that.

Anyway, we do not want the behavior of `--restrict` for these
commands.  That would imply not providing conflicts to users for them
to resolve unless they are contained within the sparse specification,
which would clearly be broken.  We instead chose to write out files
with conflicts regardless of whether they are outside the sparse
specification.  This modified behavior I gave the name of
`--restrict-unless-conflict`, but we don't need or want an actual
command line flag for that.  I think the behavior should just remain
hardcoded into these commands.

(Note: these commands are among those that make me think
--[no-]restrict or --[un]focus or whatever might not make sense as a
git global option: `--restrict-unless-conflict` behavior is the
default for these and in fact that only sensible option, I think.  If
there's only one sensible option, no actual flag names are needed.)

> The only thing I can think about is that the diffstat might want to show
> the stats for the conflicted files, in which case that's an important
> perspective on the distinction from --restrict.

We only show the diffstat on a successful merge, so there's no
diffstat to show if there are any conflicted files.

> > +    In the case of am and apply, those commands only operate on the
> > +    working tree, so they are kind of in the same boat as stash.
> > +    Perhaps `git am` could run `git sparse-checkout reapply`
> > +    automatically afterward and move into a category more similar to
> > +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> > +    vivify files besides just conflicted ones when there are conflicts.
>
> 'git am' should be able to construct the resulting commit from the patch
> without adding files outside of the sparse-checkout definition. If there

That's yet another interesting take on `git am` -- different than what
I originally had in mind, and different from what Junio suggested.  I
think both of your takes are better than what I was initially
thinking, I just wish your two approaches weren't pulling in opposite
directions.  :-)

> is a conflict, it fails in the application, anyway. I suppose you are
> writing this here because 'git am' does not play nice with sparse-checkout
> right now.

Well, as a result of this thread, we now have at least 2-3 potential
solutions we could pursue...

[...]
> > +    * add
> > +    * rm
> > +    * mv
> > +
> > +    The defaults here perhaps make sense since they are nearly --restrict, but
> > +    actually using --restrict could cause user confusion if users specify a
> > +    specific filename, so they warn by default.  That logic may sound like
> > +    --no-restrict should be the default, but that's prone to even bigger confusion:
> > +      * `git add <somefile>` if honored and outside the sparse cone, can result in
> > +     the file randomly disappearing later when some subsequent command is run
> > +     (since various commands automatically clean up unmodified files outside
> > +     the sparsity specification).
> > +      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
> > +     outside the range of the user's interest.  Much better to operate on the
> > +     sparsity specification and give the user warnings if other files could have
> > +     matched.
>
> The cost of checking for other files that might match is sometimes too large
> (needing to expand the sparse index or walk trees to find those path names) that
> I would not recommend warning that we _didn't_ do something. Perhaps an advice
> that says "we did not look outside the sparse-checkout definition for matching
> paths" when the pathspec is not an exact path or a prefix match.

Ah, good point, and a good idea to keep in mind.

However, I think advise_on_updating_sparse_paths() currently does what
you're warning against.  Do you think there's a good chance this is
the cause of the performance bug reported over at
https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com
?

> > +  * Commands whose default for --restrict vs. --no-restrict should vary depending
> > +    on Behavior A or Behavior B
> > +    * diff (with --cached or REVISION arguments)
> > +    * grep (with --cached or REVISION arguments)
> > +    * show (when given commit arguments)
> > +    * bisect
> > +    * blame
> > +      * and annotate
> > +    * log
> > +      * and variants: shortlog, gitk, show-branch, whatchanged
> > +
> > +    For now, we default to behavior B for these, which want a default of
> > +    --no-restrict.
>
> I do feel pretty strongly that we'll want a --no-restrict default here
> because otherwise we will present confusion. I'm not even sure if we would
> want to make this available via a config setting, but likely a config
> setting makes sense in the long term.

You've got me slightly confused.  You did say the same thing a long time ago:

    "But I also want to avoid doing this as a default or even behind a
config setting."[A]

BUT, when Shaoxuan proposed making --restrict/--focus the default for
one of these commands, you seemed to be on board[B].

Personally, I thought that if anyone would object to some of these
commands changing, that grep would be considered as among the riskier.
For diff and log, printing a "Warning: restricting output to the
sparse-checkout specification" would be pretty innocuous, but for grep
that wouldn't be.

I was a little unsure about making `--restrict/--focus` the default
for these commands, both based on your previous concerns and because
of thinking about some of my behavior B users.  But then, it seemed
like everyone else was pushing for not only having this behavior but
making it the default[C,D,E,F].  I was beginning to wonder if even you
had decided behavior B didn't matter anymore between your support of
Shaoxuan's change at [B] and your diffstat comments at [G].  But now
it sounds like you're not only against behavior A by default but even
implementing it at all...even though I don't see how that squares with
your previous comments on grep and diffstat.

Is it just a matter of presentation?  Is it specific subcommands you
don't want changed?  Or am I either missing or misunderstanding
something?


Anyway...I will note that without a configurable option to give these
commands a behavior of `--restrict`, I think you make working in
disconnected partial clones practically impossible.  I want to be able
to do "git log -p", "git diff REV1 REV2", and "git grep TERM REV" in
disconnected partial clones, and I've wanted that kind of capability
for well over a decade[H].  So, don't be surprised if I keep bringing
up a config option of some sort for these commands.  :-)

[A] https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
[B] https://lore.kernel.org/git/e719d1e1-1849-07bc-ea08-2729985e5048@github.com/,
and the others in the thread
[C] https://lore.kernel.org/git/2fc889c9c264fc10d878f31bd89cc44e79982516.1599758167.git.matheus.bernardino@usp.br/
[D] paragraphs with "transitioning" in them from
https://lore.kernel.org/git/a89413b5-464b-2d54-5b8c-4502392afde8@github.com/
[E] https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
[F] https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
[G] https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
[H] https://lore.kernel.org/git/1283645647-1891-1-git-send-email-newren@gmail.com/


> > +=== Implementation Questions ===
> > +
> > +  * Does the name --[no-]restrict sound good to others?  Are there better options?
> > +    * Names in use, or appearing in patches, or previously suggested:
> > +      * --sparse/--dense
> > +      * --ignore-skip-worktree-bits
> > +      * --ignore-skip-worktree-entries
> > +      * --ignore-sparsity
> > +      * --[no-]restrict-to-sparse-paths
> > +      * --full-tree/--sparse-tree
> > +      * --[no-]restrict
>
> I like the simplicity of --[no-]restrict, and my only worry is that it
> doesn't immediately link to what it is restricting.

Yeah, Junio and Victoria brought up other flavors of this same
concern, and it's also the one thing I find suboptimal about this
name.

The problem is just that we need to add the flag in more places,
"sparse" is already taken in some of them with a different meaning,
and I'm not sure there is any other flag that does automatically link
to sparse-checkouts and/or self-describe without being excessively
wordy.

> Perhaps something like "scope" would describe the set of things we care
> about, but use a text mode:
>
>         --scope=sparse  (--restrict)
>         --scope=all     (--no-restrict)
>
> But I'm notoriously bad at naming things.

Yeah, me too.  Naming things is one of the two hard problems in
computer science, right?  (The others being cache invalidation, and
off-by-one errors.)

However, in this case, your suggestion sounds pretty decent to me.
I'll add it to the list for us to consider.

> > +  * Should --[no-]restrict be a git global option, or added as options to each
> > +    relevant command?  (Does that make sense given the multitude of different
> > +    default behaviors we have for different options?)
>
> If we can make it a global option, that would be great, then update
> the commands to behave under that mode as we go.
>
> If that doesn't work, then adding the consistent option across commands
> would be helpful. It might be good to make a OPT_RESTRICT macro (much
> like OPT__VERBOSE, OPT__QUIET, and similar macros.

Ooh, I didn't know about OPT__VERBOSE and OPT__QUIET.  Thanks for the flag.

[...]
> > +  * clone: should we provide some mechanism for tying partial clones and
> > +    sparse checkouts together better.  Maybe an option
> > +     --sparse=dir1,dir2,...,dirN
> > +    which:
> > +       * Does initial fetch with `--filter=blob:none`
> > +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
> > +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
> > +      fault in the missing blobs within the sparse
> > +      specification...except that rev-list needs some kind of options
> > +      to also get files from leading directories too.
> > +       * Sets --restrict mode to allow focusing on the cone of interest
> > +      (and to permit disconnected development)
>
> As mentioned, I think we should have the option to backfill the blobs in
> the sparse-checkout definition, but 'git clone' should not do this by
> default. It's something that can be launched in the background, maybe, but
> not a blocking operation on being able to use the repository.
>
> 'scalar clone' is an excellent testing bed for these kinds of things,
> like setting the --restrict mode by default.

Earlier in this same email you were against even making an option to
request --restrict mode, but now you're suggesting to not only
implement it but make it the default in scalar?

> Hopefully my responses aren't too far off-base. I'll go read the rest of
> the discussion now that I've contributed my thoughts on the doc.

Thanks for the detailed response!

I figured we'd have one or two places where all of us had some
disagreements on the big picture, but more and more I'm finding we
aren't even always thinking about the problems the same (e.g. the 3+
different solutions to the `am` issues).  All the more reason that a
document like this is important for us to discuss these details and
work out a plan.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-27 16:42   ` Derrick Stolee
@ 2022-09-28  5:42     ` Elijah Newren
  0 siblings, 0 replies; 42+ messages in thread
From: Elijah Newren @ 2022-09-28  5:42 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Victoria Dye, Elijah Newren via GitGitGadget, Git Mailing List,
	Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

On Tue, Sep 27, 2022 at 9:42 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 9/26/2022 4:08 PM, Victoria Dye wrote:
[...]
> > 1. This could be a "me" problem, but I regularly struggle with "sparse"
> >    having different meanings in similar contexts. For example, a "sparse
> >    directory" is one *with* 'SKIP_WORKTREE' applied vs. "the sparse portion
> >    of the repo"  here refers to the files *without* 'SKIP_WORKTREE' applied.
> >    A quick note/section outlining some standard terminology would be
> >    immensely helpful.
>
> This difference is absolutely my fault, and maybe we should consider
> fixing this problem by renaming sparse directories something else.

Hey now, don't reviewers also get some of the "credit"?  ;-)

> Perhaps "skipped directory" would be a better name?

Sounds reasonable to me.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-27 16:07       ` Junio C Hamano
@ 2022-09-28  6:13         ` Elijah Newren
  0 siblings, 0 replies; 42+ messages in thread
From: Elijah Newren @ 2022-09-28  6:13 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Victoria Dye, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

On Tue, Sep 27, 2022 at 9:07 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > Oh, wow, that's something completely outside what I had considered.  I
> > had viewed sparse-checkouts as splitting "tracked files" into two
> > subsets.  As such, `--[no-]restrict` could only affect selecting
> > whether the smaller or larger set of tracked files was of interest.
> > From that viewpoint, untracked files seemed orthogonal, and thus there
> > couldn't be such a thing as an "anamalous untracked file".
> >
> > But this idea is very interesting.  Hmm...
>
> We need to design the behaviour of "git add" sensibly.  Even we say
> "untracked files are just one class and there are two classes of
> tracked ones, those path of current interest and those that are
> uninteresting", we would need to say "'git add F' behaves this way
> if F would become 'tracked path of current interest' when added, but
> the command behaves this other way if F becomes 'tracked path that
> is not interesting right now'".  It may be cleaner to separate the
> untracked ones along the same line as the tracked ones.
>
> Which in turn would mean that the skip-worktree bit cannot be the
> source of truth.  Sparsity specification (either pattern matching or
> being in listed directories) authoritatively decides if a path is of
> the current interest or not.  This is simply because untracked ones
> cannot have that bit ;-)  We can treat the skip-worktree bit as mere
> implementation detail, a measure for optimization.

I like this idea.  Seems I should then move 'status' into the category
with add/rm/mv -- commands that need to be modified to treat untracked
files carefully.

Of course, this also may drag "git clean" into that category...though
I'm not sure how or if it'd differ.


[...]
> > It feels like "git grep --cached" is perhaps the next thing along this
> > sequence, and I don't see a clear line where to draw that we should
> > limit things to the sparse specification for the index while treating
> > the other operations as full tree; it seems like something feels
> > broken or inconsistent in this sequence of commands if we attempt to
> > do so.
>
> OK, it seems that "--cached" has many cases that it wants to operate
> on full tree.  I am in general more in favor of making things work
> on full tree, simply because I feel it would have less chance of
> going wrong, so defaulting to --no-restrict would be fine ;-)

Yeah, I think for the camp B folks, "--no-restrict" may make more
sense for operations searching or comparing to the index.

However, there's also another possibility I'm still mulling over.  To
understand it, first note that relative to the working tree, the
"sparse specification" can temporarily differ from the "paths matching
the sparsity patterns", because additional files might be transiently
present.  This most often happens due to conflicts, and we want
worktree related operations that behave under "restrict" mode (such as
"diff" or "grep" or "switch") to operate on all present tracked
files[1].  With that understanding, we could similarly consider that
relative to the index, the "sparse specification" could temporarily
differ from the "paths matching the sparsity patterns", because
additional paths outside the sparsity patterns could have been
modified in the index (e.g. during a merge or rebase or whatever).

Using a temporarily expanded sparsity specification may allow a
"restrict-like" behavior to make sense for index-related operations.
I currently think that'd be more useful for the camp A folks than the
camp B folks, though.

Either way, I don't think the index should use the sparsity defined by
or for the working tree.  The idea of using the working tree sparsity
for index-related operations may sound nice at first, but I think it
only behaves well when all paths modified in the index or working tree
are limited to those paths matching the sparsity patterns.  And
there's too many normal cases where that just doesn't hold.

[1] See also 82386b4496 ("Merge branch 'en/present-despite-skipped'",
2022-03-09)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-27 15:43 ` Junio C Hamano
@ 2022-09-28  7:49   ` Elijah Newren
  0 siblings, 0 replies; 42+ messages in thread
From: Elijah Newren @ 2022-09-28  7:49 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Victoria Dye,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

On Tue, Sep 27, 2022 at 8:44 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +  * Does the name --[no-]restrict sound good to others?  Are there better options?
>
> Everybody in this thread are interested in sparse checkout, which
> unfortunately blinds them from the fact that "restrict to", "limit
> to", "focus on", etc. need not to be limited to the sparse checkout
> feature.  We must have something that hints that the option is about
> the sparse checkout feature.
>
> As to the verbs, I do not mind "restrict to".  Other good ones I do
> not mind choosing are "limit to" and "focus on".  They would equally
> convey the same thing in this context.  And the object for these
> verb phrases are the area of interest, those paths without the
> skip-worktree bit, the paths outside the sparse cone(s).
>
> Or we could go the other way.  We are excluding those paths with the
> skip-worktree bit, so "exclude" and "ignore" are natural candidates.

If you're thinking about plain "exclude", that's already a flag in
'apply', 'am', 'clean', and 'ls-files'.

Also, if you want these words alone, then they also seem to lack hints
that the option is about the sparse checkout feature.  Expand them a
bit, perhaps?  "--ignore-sparsity"?
"--exclude-sparse-checkout-restrictions"?

Assuming we are worried about needing "--no-" variants, wouldn't the
risk of a "--no-ignore-sparsity" be worse than a "--no-restrict" in
terms of awkwardness, given the double negative?

> These two classes are good if the "restrict" behaviour will never be
> the default.  When it is the default, the option often used will
> become "--no-restrict", which is awkward.
>
>         Personally I am slightly in favor of "focus on" (i.e.
>         "--focus" vs "--unfocus") as that meshes well with the
>         concept of "the areas of the working tree paths that I am
>         interested in right now", which may already hint that the
>         option is about the sparse checkout feature (i.e. "I am
>         focusing on these areas right now") and can stay short.  But
>         this is just one person's opinion.

I'll add --focus/--unfocus to the list.  --unfocus seems a bit more
awkward to me than --no-restrict, but that might just be me.  If
others really liked it, I'd be fine with it.

Right now, I'm leaning a bit more towards Stolee's
--scope={sparse,all} (or maybe --scope={sparse,dense}?)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-25  0:09 [PATCH] sparse-checkout.txt: new document with sparse-checkout directions Elijah Newren via GitGitGadget
                   ` (4 preceding siblings ...)
  2022-09-27 16:36 ` Derrick Stolee
@ 2022-09-28  8:32 ` Elijah Newren via GitGitGadget
  2022-10-08 22:52   ` [PATCH v3] " Elijah Newren via GitGitGadget
  5 siblings, 1 reply; 42+ messages in thread
From: Elijah Newren via GitGitGadget @ 2022-09-28  8:32 UTC (permalink / raw)
  To: git
  Cc: Victoria Dye, Derrick Stolee, Shaoxuan Yuan, Matheus Tavares,
	ZheNing Hu, Elijah Newren, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Once upon a time, Matheus wrote some patches to make
   git grep [--cached | <REVISION>] ...
restrict its output to the sparsity specification when working in a
sparse checkout[1].  That effort got derailed by two things:

  (1) The --sparse-index work just beginning which we wanted to avoid
      creating conflicts for
  (2) Never deciding on flag and config names and planned high level
      behavior for all commands.

More recently, Shaoxuan implemented a more limited form of Matheus'
patches that only affected --cached, using a different flag name,
but also changing the default behavior in line with what Matheus did.
This again highlighted the fact that we never decided on command line
flag names, config option names, and the big picture path forward.

The --sparse-index work has been mostly complete (or at least released
into production even if some small edges remain) for quite some time
now.  We have also had several discussions on flag and config names,
though we never came to solid conclusions.  Stolee once upon a time
suggested putting all these into some document in
Documentation/technical[3], which Victoria recently also requested[4].
I'm behind the times, but here's a patch attempting to finally do that.

Note that the "Implementation Questions" section is pretty large,
reflecting the fact that this is perhaps more RFC than proposal.

[1] https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/
    (See his second link in that email in particular)
[2] https://lore.kernel.org/git/20220908001854.206789-2-shaoxuan.yuan02@gmail.com/
[3] https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/
    (Scroll to the very end for the final few paragraphs)
[4] https://lore.kernel.org/git/cafcedba-96a2-cb85-d593-ef47c8c8397c@github.com/

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    [RFC] sparse-checkout.txt: new document with sparse-checkout directions
    
    As discussion has shown so far, we seem to have a variety of different
    ideas in a number of areas, and sometimes are pulling a bit in different
    directions. But the discussion is very illuminating. Anyway, take any
    proposal or option names with a big grain of salt and don't consider
    anything final. Thoughts and opinions still very much welcome.
    
    (I've worked really hard to get this document out, because I feel bad
    that I've blocked multiple contributors' changes in this area over
    concerns of not having a clear direction and possibly painting ourselves
    into corners. But it's taken a lot of time, so I may have to back off
    for a bit, so I may wait a week or two to respond further to this topic.
    That might be better anyway, because it's long enough that folks may
    need time to digest it and all its updates.)
    
    Changes since v1:
    
     * Added new sections:
       * "Terminology"
       * "Behavior classes"
       * "Sparse specification vs. sparsity patterns"
     * Tried to shuffle commands from unknown into appropriate sections
       based on feedback, but I got some conflicting feedback, so...who
       knows if thing are in the right place
     * More consistency in using "sparse specification" over other terms
     * Extra comments about how add/rm/mv operate on moving files across the
       tracked/untracked boundary
     * --restrict-but-warn should have been "restrict or error", but
       reworded even more heavily as part of "Behavior classes" section
     * Added extra questions based on feedback (--no-expand, update-index
       stuff, apply --index)
     * More details on apply/am bugs
     * Documented read-tree issue
     * A few cases of fixing line wrapping at <=80 chars
     * Added more alternate name suggestions for options instead of
       --[no-]restrict

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1367%2Fnewren%2Fsparse-checkout-directions-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1367/newren/sparse-checkout-directions-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1367

Range-diff vs v1:

 1:  3b6b1910eb4 ! 1:  d20e63206dc sparse-checkout.txt: new document with sparse-checkout directions
     @@ Documentation/technical/sparse-checkout.txt (new)
      @@
      +Table of contents:
      +
     ++  * Terminology
      +  * Purpose of sparse-checkouts
      +  * Desired behavior
     ++  * Behavior classes
      +  * Subcommand-dependent defaults
     ++  * Sparse specification vs. sparsity patterns
      +  * Implementation Questions
      +  * Implementation Goals/Plans
      +  * Known bugs
      +  * Reference Emails
      +
      +
     ++=== Terminology ===
     ++
     ++cone mode: one of two modes for specifying the desired subset of files
     ++	in a sparse-checkout.  In cone-mode, the user specifies
     ++	directories (getting both everything under that directory as
     ++	well as everything in leading directories), while in non-cone
     ++	mode, the user specifies gitignore-style patterns.  Controlled
     ++	by the --[no-]cone option to sparse-checkout init|set.
     ++
     ++SKIP_WORKTREE: When tracked files do not match the sparse specification and
     ++	are removed from the working tree, the file in the index is marked
     ++	with a SKIP_WORKTREE bit.  Note that if a tracked file has the
     ++	SKIP_WORKTREE bit set but is later written by the user to the
     ++	working tree anyway, the SKIP_WORKTREE bit will be cleared at the
     ++	beginning of any Git operation.
     ++
     ++	Most sparse checkout users are unaware of this implementation
     ++	detail, and the term should generally be avoided in user-facing
     ++	descriptions and command flags.  Unfortunately, prior to the
     ++	`sparse-checkout` subcommand these low-level details were exposed,
     ++	and as of time of writing, still are in various places.
     ++
     ++sparse-checkout: a subcommand in git used to reduce the files present in
     ++	the working tree to a subset of all tracked files.  Also, the
     ++	name of the file in the $GIT_DIR/info directory used to track
     ++	the sparsity patterns corresponding to the user's desired
     ++	subset.
     ++
     ++sparse cone: see cone mode
     ++
     ++sparse directory: An entry in the index corresponding to a directory
     ++	rather, and used to replace all files under that directory that
     ++	would normally appear in the index.  See also sparse-index.
     ++	Something that can cause confusion is that the "sparse
     ++	directory" does NOT match the sparse specification, i.e. the
     ++	directory is NOT present in the working tree.
     ++
     ++sparse index: A special mode for sparse-checkout that also makes the
     ++	index sparse by recording a directory entry in lieu of all the
     ++	files underneath that directory.  Controlled by the
     ++	--[no-]sparse-index option to init|set|reapply.  See also
     ++	"sparse directory".
     ++
     ++sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
     ++	define the set of files of interest.  A warning: It is easy to
     ++	over-use this term (or the shortened "patterns" term), for two
     ++	reasons (1) users in cone mode specify directories rather
     ++	than patterns (their directories are transformed into patterns,
     ++	but users may think you are talking about non-cone mode if you
     ++	use the word "patterns"), and (b) the sparse specification might
     ++	transiently differ in the working tree from the sparsity
     ++	patterns (see "Sparse specification vs. sparsity patterns").
     ++
     ++sparse specification: The set of paths in the user's area of focus.  When
     ++	interacting with the working tree, this is the set of tracked files
     ++	present in the working copy or with a clear SKIP_WORKTREE bit.
     ++	When working with history, this is the set of files matching the
     ++	sparsity patterns.  Usually the tracked files present in the
     ++	working copy are precisely the set of tracked files matching
     ++	sparsity patterns, but they can temporarily differ.  (See also
     ++	"Sparse specification vs. sparsity patterns")
     ++
     ++vivifying: When a command restores a tracked file to the working tree
     ++	(and clearing the SKIP_WORKTREE bit in the index), this is
     ++	referred to as "vivifying" the file.
     ++
     ++
      +=== Purpose of sparse-checkouts ===
      +
      +sparse-checkouts exist to allow users to work with a subset of their
      +files.
      +
     -+The idea is simple enough, but there are two different high-level
     -+usecases which affect how some Git subcommands should behave.  Further,
     -+even if we only considered one of those usecases, sparse-checkouts
     -+modify different subcommands in over a half dozen different ways.  Let's
     -+start by considering the high level usecases in this section:
     ++You can think of sparse-checkouts as subdividing "tracked" files into
     ++two categories -- a sparse subset, and all the rest.
     ++Implementationally, we mark "all the rest" with SKIP_WORKTREE.  The
     ++SKIP_WORKTREE files are still tracked, just not present in the working
     ++tree.
     ++
     ++In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
     ++is missing from the working tree but pretend the file matches HEAD".  That
     ++was a low-level detail which provided decent behavior for a few commands,
     ++but which had a surprising number of ways in which it violated user
     ++expectations and was a bad mental model.  However, it persisted for many
     ++years and may still be found in some corners of the code base.
     ++
     ++Anyway, the idea of "working with a subset of files" is simple enough, but
     ++there are two different high-level usecases which affect how some Git
     ++subcommands should behave.  Further, even if we only considered one of
     ++those usecases, sparse-checkouts modify different subcommands in over a
     ++half dozen different ways.  Let's start by considering the high level
     ++usecases:
      +
      +  A) Users are _only_ interested in the sparse portion of the repo
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +outside the sparsity specification, then the partial clone will attempt
      +to download additional blobs on demand, fail, and then fail the user's
      +command.  (This may be unavoidable in some cases, e.g. when `git merge`
     -+has non-trivial changes to reconcile outside the sparsity path, but we
     -+should limit how often users are forced to connect to the network.)
     ++has non-trivial changes to reconcile outside the sparse specification,
     ++but we should limit how often users are forced to connect to the
     ++network.)
      +
      +Also, even for users using partial clones that do not mind being
      +always connected to the network, the need to download blobs as
     @@ Documentation/technical/sparse-checkout.txt (new)
      +you are interested in or that depends upon one of your
      +dependencies...AND you need all the dependencies of that expanded group.
      +That can easily change your sparse-checkout into a nearly dense one.
     ++
      +Naturally, that tends to kill the benefits of sparse-checkouts.  There
      +are a couple solutions to this conundrum: either avoid grabbing
      +dependencies (maybe have built versions of your dependencies pulled from
     @@ Documentation/technical/sparse-checkout.txt (new)
      +
      +  * commands that only look at files within the sparsity specification
      +
     -+      * status
      +      * diff (without --cached or REVISION arguments)
      +      * grep (without --cached or REVISION arguments)
     ++      * diff-files
      +
     -+  * commands that restore files to the working tree that match sparsity patterns, and
     -+    remove unmodified files that don't match those patterns:
     ++  * commands that restore files to the working tree that match sparsity
     ++    patterns, and remove unmodified files that don't match those
     ++    patterns:
      +
      +      * switch
      +      * checkout (the switch-like half)
     @@ Documentation/technical/sparse-checkout.txt (new)
      +      * cherry-pick
      +      * revert
      +
     ++      * `am` and `apply --index` should probably be in this section but are buggy
     ++	(see the "Known bugs" section below)
     ++
      +    Note that this somewhat depends upon the merge strategy being used:
      +      * `ort` behaves as described above
      +      * `recursive` tries to not vivify files unnecessarily, but does sometimes
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    and whether those paths match sparsity patterns or not):
      +
      +      * stash
     -+
     -+      * am/apply probably should be in the above category, but need to be fixed to
     -+	auto-vivify instead of failing
     ++      * apply (without `--index` or `--cached`)
      +
      +* Commands that differ for behavior A vs. behavior B:
      +
     -+  * commands that make modifications:
     ++  * commands that make modifications to which files are tracked:
      +      * add
      +      * rm
      +      * mv
     ++      * update-index
     ++
     ++    The fact that files can move between the 'tracked' and 'untracked'
     ++    categories means some commands will have to treat untracked files
     ++    differently.  But if we have to treat untracked files differently,
     ++    then additional commands may also need changes:
     ++
     ++      * status
     ++      * clean
     ++
     ++    In particular, `status` may need to report any untracked files outside
     ++    the sparsity specification as an erroneous condition (especially to
     ++    avoid the user trying to `git add` them, forcing `git add` to display
     ++    an error).
     ++
     ++    It's not clear to me exactly how (or if `clean` would change, but it's
     ++    the other command that also affects untracked files.
     ++
     ++    `update-index` may be slightly special.  Its --[no-]skip-worktree flag
     ++    may need to ignore the sparse specification by its nature.  Also, its
     ++    current --[no-]ignore-skip-worktree-entries default is totally bogus.
      +
      +  * commands that query history
      +      * diff (with --cached or REVISION arguments)
      +      * grep (with --cached or REVISION arguments)
      +      * show (when given commit arguments)
      +      * bisect
     -+      * blame
     ++      * blame (only matters when one or more -C flags passed)
      +	* and annotate
      +      * log
     -+	* and variants: shortlog, gitk, show-branch, whatchanged
     -+
     -+* Comands I don't know how to classify
     -+
     -+  * ls-files
     ++	* and variants: shortlog, gitk, show-branch, whatchanged, rev-list
     ++      * ls-files
     ++      * diff-index
     ++      * diff-tree
     ++      * ls-tree
      +
     -+    Shows all tracked files by default, and with an option can show
     -+    sparse directory entries instead of expanding them.  Should there be
     -+    a way to restrict to just the non SKIP_WORKTREE files?
     ++    ls-files may be slightly special in that e.g. `git ls-files -t` is
     ++    often used to see what is sparse and what is not.  Perhaps -t should
     ++    always work on the full tree?
      +
     -+    Note that `git ls-files -t` is often used to see what is sparse and
     -+    what is not, which only works with a non-restricted assumption.
     -+
     -+  * checkout-index
     -+
     -+    should it be like `checkout` and pay attention to sparsity paths, or
     -+    be considered special and write to working tree anyway?  The
     -+    interaction with --prefix, and the use of specifically named files
     -+    (rather than globs) makes me wonder.
     -+
     -+  * update-index
     -+
     -+    The --[no-]ignore-skip-worktree-entries default is totally bogus,
     -+    but otherwise this command seems okay?  Not sure what category it
     -+    would go under, though.
     ++* Commands I don't know how to classify
      +
      +  * range-diff
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +
      +    See range-diff
      +
     -+  * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list
     -+
     -+    should these be tweaked or always operate full-tree?
     -+
      +* Commands unaffected by sparse-checkouts
      +
      +  * branch
     -+  * clean (works on untracked files, whereas SKIP_WORKTREE files are still tracked)
      +  * describe
      +  * fetch
      +  * gc
     @@ Documentation/technical/sparse-checkout.txt (new)
      +  * merge-index
      +
      +
     ++=== Behavior classes ====
     ++
     ++From the above there are a few classes of behavior:
     ++
     ++  * "restrict"
     ++
     ++    Commands in this class only read or write files within the sparse
     ++    specification.  Some of these commands may also attempt, at the end of
     ++    their operation, to cull transient differences between the sparse
     ++    specification and the sparsity patterns (see "Sparse specification
     ++    vs. sparsity patterns" for details, but this basically means either
     ++    removing unmodified files not matching the sparsity patterns and
     ++    marking those files as SKIP_WORKTREE, or vivifying files that match the
     ++    sparsity patterns and marking those files as !SKIP_WORKTREE).
     ++
     ++  * "restrict modulo conflicts"
     ++
     ++    Commands in this class generally behave like the "restrict" class,
     ++    except that:
     ++      (1) they ignore the sparse specification in terms of updates to the
     ++	  index, though they'll preserve or update the SKIP_WORKTREE bit
     ++	  for files as needed to follow the sparsity patterns.
     ++      (2) they will ignore the sparse specification and write files with
     ++	  conflicts to the working tree (thus temporarily expanding the
     ++	  sparse specification to include such files.)
     ++
     ++  * "restrict also specially applied to untracked files"
     ++
     ++    Commands in this class generally behave like the "restrict" class,
     ++    except that they have to handle untracked files differently too, often
     ++    because these commands are dealing with files changing state between
     ++    'tracked' and 'untracked'.  Often, this may mean printing an error
     ++    message if the command had nothing to do, but the arguments may have
     ++    referred to files whose tracked-ness state could have changed were it
     ++    not for the sparsity patterns excluding them.
     ++
     ++  * "no restrict"
     ++
     ++    Commands in this class ignore the sparse specification entirely.
     ++
     ++  * "restrict or no restrict dependent upon behavior A vs. behavior B"
     ++
     ++    Commands in this class behave like "no restrict" for folks in the
     ++    behavior B camp, and like "restrict" for folks in the behavior A camp.
     ++    However, when behaving like "restrict" a warning of some sort might be
     ++    provided that history queries have been limited by the sparse-checkout
     ++    specification.
     ++
     ++
      +=== Subcommand-dependent defaults ===
      +
     -+Note that we have different defaults (for the desired behavior, not just
     -+the current implementation) depending on the command:
     ++Note that we have different defaults depending on the command for the
     ++desired behavior :
      +
     -+  * Commands defaulting to --restrict:
     ++  * Commands defaulting to "restrict":
      +    * status
      +    * diff (without --cached or REVISION arguments)
      +    * grep (without --cached or REVISION arguments)
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    * reset (--hard)
      +    * restore/checkout
      +    * checkout-index
     ++    * diff-files
      +
      +    This behavior makes sense; these interact with the working tree.
      +
     -+  * Commands defaulting to --restrict-unless-conflicts
     ++  * Commands defaulting to "restrict modulo conflicts":
      +    * merge
      +    * rebase
      +    * cherry-pick
      +    * revert
      +
     ++    * am
     ++    * apply --index
     ++
      +    These also interact with the working tree, but require slightly different
      +    behavior so that conflicts can be resolved.
      +
     -+  * Commands defaulting to --no-restrict
     ++    (See also the "Known bugs" section below regarding `am` and `apply`)
     ++
     ++  * Commands defaulting to "no restrict":
      +    * archive
      +    * bundle
      +    * commit
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    * fast-import
      +    * commit-tree
      +
     -+    * ls-files
      +    * stash
     -+    * am
     -+    * apply
     ++    * apply (without `--index`)
      +
     -+    These have completely different defaults and perhaps deserve the most detailed
     -+    explanation:
     ++    These have completely different defaults and perhaps deserve the most
     ++    detailed explanation:
      +
      +    In the case of commands in the first group (format-patch,
      +    fast-export, bundle, archive, etc.), these are commands for
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    In the case of stash, it needs to vivify files to avoid losing the
      +    user's changes.
      +
     -+    In the case of am and apply, those commands only operate on the
     -+    working tree, so they are kind of in the same boat as stash.
     -+    Perhaps `git am` could run `git sparse-checkout reapply`
     -+    automatically afterward and move into a category more similar to
     -+    merge/rebase/cherry-pick, but it'd still be weird because it'd
     -+    vivify files besides just conflicted ones when there are conflicts.
     -+
     -+    In the case of ls-files, `git ls-files -t` is often used to see what
     -+    is sparse and not, in which case restricting would not make sense.
     -+    Also, ls-files has traditionally been used to get a list of "all
     -+    tracked files", which would suggest not restricting.  But it's
     -+    slightly funny, because sparse-checkouts essentially split tracked
     -+    files into two categories -- those in the sparse specification and
     -+    those outside -- and how does the user specify which of those two
     -+    types of tracked files they want?
     -+
     -+  * Commands defaulting to --restrict-but-warn (although Behavior A vs. Behavior B
     -+    may affect how verbose the warnings are):
     ++    In the case of apply without `--index`, that command needs to update
     ++    the working tree without the index (or the index without the working
     ++    tree if `--cached` is passed), and if we restrict those updates to the
     ++    sparse specification then we'll lose changes from the user.
     ++
     ++  * Commands defaulting to "restrict also specially applied to untracked files":
      +    * add
      +    * rm
      +    * mv
      +
     -+    The defaults here perhaps make sense since they are nearly --restrict, but
     -+    actually using --restrict could cause user confusion if users specify a
     -+    specific filename, so they warn by default.  That logic may sound like
     -+    --no-restrict should be the default, but that's prone to even bigger confusion:
     -+      * `git add <somefile>` if honored and outside the sparse cone, can result in
     -+	the file randomly disappearing later when some subsequent command is run
     -+	(since various commands automatically clean up unmodified files outside
     -+	the sparsity specification).
     -+      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
     -+	outside the range of the user's interest.  Much better to operate on the
     -+	sparsity specification and give the user warnings if other files could have
     -+	matched.
     -+      * `git mv` has similar surprises when moving into or out of the cone, so
     -+	best to restrict and throw warnings if restriction might affect the result.
     -+
     -+    There may be a difference in here between behavior A and behavior B.
     -+    For behavior A, we probably only want to warn if there were no
     -+    suitable matches for files in the sparsity specification, whereas
     -+    for behavior B, we may want to warn even if there are valid files to
     -+    operate on if the result would have been different under
     -+    `--no-restrict`.
     -+
     -+  * Commands whose default for --restrict vs. --no-restrict should vary depending
     -+    on Behavior A or Behavior B
     ++    Our original implementation for these commands was "no restrict", but
     ++    it had some severe usability issues:
     ++      * `git add <somefile>` if honored and outside the sparse
     ++	specification, can result in the file randomly disappearing later
     ++	when some subsequent command is run (since various commands
     ++	automatically clean up unmodified files outside the sparse
     ++	specification).
     ++      * `git rm '*.jpg'` could very negatively surprise users if it deletes
     ++	files outside the range of the user's interest.
     ++      * `git mv` has similar surprises when moving into or out of the cone,
     ++	so best to restrict by default
     ++
     ++    So, we switched `add` and `rm` to default to "restrict", which made
     ++    usability problems much less severe and less frequent, but we still got
     ++    complaints because commands like:
     ++	git add <file-outside-sparse-specification>
     ++	git rm <file-outside-sparse-specification>
     ++    would silently do nothing.  We should instead print an error in those
     ++    cases to get usability right.
     ++
     ++    There may be a difference in here between behavior A and behavior B in
     ++    terms of verboseness of errors or additional warnings.
     ++
     ++  * Commands falling under "restrict or no restrict dependent upon behavior
     ++    A vs. behavior B"
     ++
      +    * diff (with --cached or REVISION arguments)
      +    * grep (with --cached or REVISION arguments)
      +    * show (when given commit arguments)
      +    * bisect
     -+    * blame
     ++    * blame (only matters when one or more -C flags passed)
      +      * and annotate
      +    * log
     -+      * and variants: shortlog, gitk, show-branch, whatchanged
     ++      * and variants: shortlog, gitk, show-branch, whatchanged, rev-list
     ++    * ls-files
     ++    * diff-index
     ++    * diff-tree
     ++    * ls-tree
      +
      +    For now, we default to behavior B for these, which want a default of
     -+    --no-restrict.
     -+
     -+    Note that two of these commands -- diff and grep -- also appeared in
     -+    a different list with a default of --restrict, but only when limited
     -+    to searching the working tree.  The working tree vs. history
     -+    distinction is fundamental in how behavior B operates, so this is
     -+    expected.
     -+
     -+    --restrict may make more sense as the long term default for
     -+    these[12], but that's a fair amount of work to implement, and it'd
     -+    be very problematic for behavior B users.  Making it the default
     -+    now, and then slowly implementing that default in various
     -+    subcommands over multiple releases would mean that behavior B users
     -+    would need to learn to slowly add additional flags to their
     -+    commands, depending on git version, to get the behavior they want.
     -+    That gradual switchover would be painful, so we should avoid it at
     -+    least until it's fully implemented.
     ++    "no restrict".
     ++
     ++    Note that two of these commands -- diff and grep -- also appeared in a
     ++    different list with a default of "restrict", but only when limited to
     ++    searching the working tree.  The working tree vs. history distinction
     ++    is fundamental in how behavior B operates, so this is expected.
     ++
     ++    "restrict" may make more sense as the long term default for these[12],
     ++    though Stolee seems to have some reservations[17].  Also, supporting
     ++    "restrict" for these commands might be a fair amount of work to
     ++    implement, meaning it might be implemented over multiple releases.  If
     ++    that behavior were the default in the commands that supported it, that
     ++    would force behavior B users to need to learn to slowly add additional
     ++    flags to their commands, depending on git version, to get the behavior
     ++    they want.  That gradual switchover would be painful, so we should
     ++    avoid it at least until it's fully implemented.
     ++
     ++
     ++=== Sparse specification vs. sparsity patterns ===
     ++
     ++In a well-behaved situation, the sparse specification is given directly
     ++by the $GIT_DIR/info/sparse-checkout file.  However, it can transiently
     ++diverge for a few reasons:
     ++
     ++    * needing to resolve conflicts (merging will vivify conflicted files)
     ++    * running Git commands that implicitly vivify files (e.g. "git stash apply")
     ++    * running Git commands that explicitly vivify files (e.g. "git checkout
     ++      --ignore-skip-worktree-bits FILENAME")
     ++    * other commands that write to these files (perhaps a user copies it
     ++      from elsewhere)
     ++
     ++For the last item, note that we do automatically clear the SKIP_WORKTREE
     ++bit for files that are present in the working tree.  This has been true
     ++since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
     ++2022-03-09)
     ++
     ++However, such a situation is transient because:
     ++
     ++   * Such transient differences can and will be automatically removed as
     ++     a side-effect of commands which call unpack_trees() (checkout,
     ++     merge, reset, etc.).
     ++   * Users can also request such transient differences be corrected via
     ++     running `git sparse-checkout reapply`.  Various places recommand
     ++     running that command.
     ++   * Additional commands are also welcome to implicitly fix these
     ++     differences; we may add more in the future.
     ++
     ++While we avoid dropping unstaged changes or files which have conflicts,
     ++we otherwise aggressively try to fix these transient differences.  If
     ++users want these differences to persist, they should run the `set` or
     ++`add` subcommands of `git sparse-checkout` to reflect their intended
     ++sparse specification.
     ++
     ++However, when we need to do a query on history restricted to the
     ++"relevant subset of files" such a transiently expanded sparse
     ++specification is ignored.  There are a couple reasons for this:
     ++
     ++   * The behavior wanted when doing something like
     ++	 git grep expression REVISION
     ++     is roughly what the users would expect from
     ++	 git checkout REVISION && git grep expression
     ++     (modulo a "REVISION:" prefix), which has a couple ramifications:
     ++
     ++   * REVISION may have paths not in the current index, so there is no
     ++     path we can consult for a SKIP_WORKTREE setting for those paths.
     ++
     ++   * Since `checkout` is one of those commands that tries to remove
     ++     transient differences in the sparse specification, it makes sense
     ++     to use the corrected sparse specification
     ++     (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
     ++     consult SKIP_WORKTREE anyway.
     ++
     ++So, a transiently expanded (or restricted) sparse specification applies to
     ++the working tree, but not to history history queries where we always use
     ++the sparsity patterns.  (See [16] for an early discussion of this.)
     ++
     ++Similar to a transiently expanded sparse specification of the working tree
     ++based on additional files being present in the working tree, we could also
     ++consider the concept of a transiently expanded sparse specification for the
     ++index.  In particular, if the user has staged changes to files that do not
     ++match the sparsity patterns, and the file is not present in the working
     ++tree, we may still want to consider the file part of the sparse
     ++specification if we are specifically performing a query related to the
     ++index (e.g. git diff REVISION, git diff-index REVISION, etc.)
      +
      +
      +=== Implementation Questions ===
      +
     -+  * Does the name --[no-]restrict sound good to others?  Are there better options?
     ++  * Does the name --[no-]restrict sound good to others?  Are there better
     ++    options?
      +    * Names in use, or appearing in patches, or previously suggested:
      +      * --sparse/--dense
      +      * --ignore-skip-worktree-bits
     @@ Documentation/technical/sparse-checkout.txt (new)
      +      * --[no-]restrict-to-sparse-paths
      +      * --full-tree/--sparse-tree
      +      * --[no-]restrict
     ++      * --scope={sparse,all}
     ++      * --focus/--unfocus
     ++      * --limit/--unlimited
      +    * Rationale making me lean slightly towards --[no-]restrict:
      +      * We want a name that works for many commands, so we need a name that
      +	does not conflict
     @@ Documentation/technical/sparse-checkout.txt (new)
      +	which would probably be even more ridiculously long.  (But we
      +	can make --ignore-skip-worktree-bits a deprecated alias for
      +	--no-restrict.)
     ++    * BUT, as others points out, --[no-]restrict isn't very clear about what
     ++      it's restricting nor does it automatically tie in to the concept of
     ++      "sparse-checkout" in the user's mind
      +
      +  * Should --[no-]restrict be a git global option, or added as options to each
      +    relevant command?  (Does that make sense given the multitude of different
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    commands (add/rm/mv), but certainly not most the others.  Previous config
      +    suggestion here: [13]
      +
     -+  * Should --sparse in ls-files be made an alias for --restrict?
     -+    `--restrict` is certainly a near synonym in cone-mode, but even then
     -+    it's not quite the same.  In non-cone mode, ls-files' `--sparse`
     -+    option has no effect, and in cone-mode it still shows the sparse
     -+    directory entries which are technically outside the sparsity
     -+    specification.
     ++  * Is `--no-expand` a good alias for ls-files's `--sparse` option?
     ++    (`--sparse` does not map to either `--restrict` or `--no-restrict`,
     ++    because in non-cone mode it does nothing and in cone-mode it shows the
     ++    sparse directory entries which are technically outside the sparse
     ++    specification) Should `--restrict` be the default (does that imply that
     ++    `--no-expand` needs a `--no-restrict` or that it just partially
     ++    overrides it)?  Should `-t` imply `--no-restrict`?
      +
      +  * Should --ignore-skip-worktree-bits in checkout-index, checkout, and
      +    restore be made deprecated aliases for --no-restrict?  (They have the
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    the reverse option is never wanted and the sole purpose of this
      +    option was to turn off a bug?)
      +
     -+  * sparse-checkout: once behavior A is fully implemented, should we
     -+    take an interim measure to easy people into switching the default?
     -+    Namely, if folks are not already in a sparse checkout, then require
     -+    `sparse-checkout init/set` to take a `--[no-]restrict` flag (which
     -+    would set core.restrictToSparse according to the setting given), and
     -+    throw an error if the flag is not provided?  That error would be a
     -+    great place to warn folks that the default may change in the future,
     -+    and get them used to specifying what they want so that the eventual
     -+    default switch is seamless for them.
     ++  * Should update-index be made like add/rm/mv with the restrict-or-error
     ++    default functionality?  If we do, should some flags like
     ++    --[no-]skip-worktree imply --no-restrict?
     ++
     ++  * Should `apply --index` preserve SKIP_WORKTREE bits for
     ++    non-conflicted files?  We normally like preserving those bits (and
     ++    it'd make git-am more like cherry-pick/rebase/merge), but `apply`
     ++    without `--index` should unconditionally clear them and it seems a
     ++    little weird for the addition of the `--index` flag to affect how
     ++    the working tree is treated.  On the other hand, `am` builds on
     ++    `apply --index` and it needs the SKIP_WORKTREE bits preserved for
     ++    non-conflicted files in order to behave like
     ++    cherry-pick/rebase/merge.
     ++
     ++  * sparse-checkout: once behavior A is fully implemented, should we take
     ++    an interim measure to ease people into switching the default?  Namely,
     ++    if folks are not already in a sparse checkout, then require
     ++    `sparse-checkout init/set` to take a `--set-[no-]restrict-mode` or
     ++    `--set-scope=(sparse|all)` flag (which would set core.restrictToSparse
     ++    according to the setting given), and throw an error if the flag is not
     ++    provided?  That error would be a great place to warn folks that the
     ++    default may change in the future, and get them used to specifying what
     ++    they want so that the eventual default switch is seamless for them.
      +
      +  * clone: should we provide some mechanism for tying partial clones and
      +    sparse checkouts together better.  Maybe an option
     @@ Documentation/technical/sparse-checkout.txt (new)
      +
      +=== Implementation Goals/Plans ===
      +
     ++ * Get buy-in on this document in general.
     ++
      + * Figure out answers to the 'Implementation Questions' sections (above)
      +
      + * Fix bugs in the 'Known bugs' section (below)
      +
     ++ [Below here is kind of spitballing since the first two haven't been resolved]
     ++
      + * update-index: flip the default to --no-ignore-skip-worktree-entries, possibly
      +   nuke this stupid "Oh, there's a bug?  Let me add a flag to let users request
      +   that they not trigger this bug." flag
      +
     -+  * Flags & Config
     -+    * Make `--sparse` in add/rm/mv a deprecated alias for `--no-restrict`
     -+    * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
     -+      a deprecated aliases for `--no-restrict`
     -+    * Create config option (core.restrictToSparsity?), note how it only
     -+      affects two classes of commands
     ++ * ls-files: add a --[no-]restrict flag for limiting tracked files listed to
     ++   the relevant subset.  (Plus more stuff after questions are answered.)
     ++
     ++ * Flags & Config
     ++   * Make `--sparse` in add/rm/mv a deprecated alias for `--no-restrict`
     ++   * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
     ++     a deprecated aliases for `--no-restrict`
     ++   * Create config option (core.restrictToSparsity?), note how it only
     ++     affects two classes of commands
      +
      + * Behavioral plans:
      +     add, rm, mv:
     -+	Behavior B: throw error if would have affected paths outside of sparsity.
     -+	Behavior A: throw error if would have only affected paths outside of sparsity.
     ++	Behavior B: throw error if would have affected paths outside of sparse
     ++		    specification.
     ++	Behavior A: throw error if would have *only* affected paths outside of
     ++		    sparse specification.
      +     grep (on history), diff (on history), log, etc:
      +	Behavior B: act on all paths (already implemented)
     -+	Behavior A: act on limited paths, maybe show stderr warning ("results limited")
     -+		    if selected via config rather than explicitly
     ++	Behavior A: act on limited paths, maybe show stderr warning ("results
     ++		    limited") if selected via config rather than explicitly
      +     other diff machinery:
     -+	make sure diff machinery changes don't mess with format-patch, fast-export, etc.
     ++	make sure diff machinery changes don't mess with format-patch,
     ++	fast-export, etc.
      +
      +  * Fix performance issues, such as
      +    https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
     @@ Documentation/technical/sparse-checkout.txt (new)
      +This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
      +been working on it.
      +
     -+0. Behavior A is not well supported in Git.  (Behavior B didn't used to be either,
     -+   but was the easier of the two to implement.)
     ++0. Behavior A is not well supported in Git.  (Behavior B didn't used to
     ++   be either, but was the easier of the two to implement.)
      +
      +1. am and apply:
      +
     -+   am and apply rely on files being present in the working copy, and
     -+   also write to them unconditionally.  They should probably first check
     -+   for the files' presence, and if found to be SKIP_WORKTREE, then clear
     -+   the bit and vivify the paths, then do its work.
     ++   apply, without `--index` or `--cached`, relies on files being present
     ++   in the working copy, and also writes to them unconditionally.  As
     ++   such, it should first check for the files' presence, and if found to
     ++   be SKIP_WORKTREE, then clear the bit and vivify the paths, then do
     ++   its work.  Currently, it just throws an error.
     ++
     ++   apply, with either `--cached` or `--index`, will not preserve the
     ++   SKIP_WORKTREE bit.  This is fine if the file has conflicts, but
     ++   otherwise SKIP_WORKTREE bits should be preserved for --cached and
     ++   probably also for --index.
     ++
     ++   am, if there are no conflicts, will vivify files and fail to preserve
     ++   the SKIP_WORKTREE bit.  If there are conflicts and `-3` is not
     ++   specified, it will vivify files and then complain the patch doesn't
     ++   apply.  If there are conflicts and `-3` is specified, it will vivify
     ++   files and then complain that those vivified files would be
     ++   overwritten by merge.
      +
      +2. reset --hard:
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    `git reset --hard` DID remove addme from the index and the working tree, contrary
      +    to the error message, but in line with how reset --hard should behave.
      +
     -+3. Checkout, restore:
     ++3. read-tree
     ++
     ++   `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the
     ++   entries it reads into the index, resulting in all your files suddenly
     ++   appearing to be "deleted".
     ++
     ++4. Checkout, restore:
      +
      +   These command do not handle path & revision arguments appropriately:
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +[9] (Move from out-of-cone to in-cone)
      +    https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
      +    https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
     -+[10] (Unnecessarily downloading objects outside sparsity specification)
     ++[10] (Unnecessarily downloading objects outside sparse specification)
      +     https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
      +
      +[11] (Stolee's comments on high-level usecases)
     @@ Documentation/technical/sparse-checkout.txt (new)
      +[13] Previous config name suggestion and description
      +  * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
      +
     -+[14] Tangential issue: switch to cone mode as default sparsity specification mechanism:
     ++[14] Tangential issue: switch to cone mode as default sparse specification mechanism:
      +  https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
      +
      +[15] Lengthy email on grep behavior, covering what should be searched:
      +  * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/
     ++
     ++[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations,
     ++     search for the parenthetical comment starting "We do not check".
     ++    https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/
     ++
     ++[17] "I'm not even sure if we would want to make this available via a
     ++     config setting"
     ++   and
     ++     "But I also want to avoid doing this as a default or even behind a
     ++     config setting"
     ++   respectively, from:
     ++     https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
     ++     https://lore.kernel.org/git/07a25d48-e364-0d9b-6ffa-41a5984eb5db@github.com/


 Documentation/technical/sparse-checkout.txt | 938 ++++++++++++++++++++
 1 file changed, 938 insertions(+)
 create mode 100644 Documentation/technical/sparse-checkout.txt

diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
new file mode 100644
index 00000000000..408f66eaba6
--- /dev/null
+++ b/Documentation/technical/sparse-checkout.txt
@@ -0,0 +1,938 @@
+Table of contents:
+
+  * Terminology
+  * Purpose of sparse-checkouts
+  * Desired behavior
+  * Behavior classes
+  * Subcommand-dependent defaults
+  * Sparse specification vs. sparsity patterns
+  * Implementation Questions
+  * Implementation Goals/Plans
+  * Known bugs
+  * Reference Emails
+
+
+=== Terminology ===
+
+cone mode: one of two modes for specifying the desired subset of files
+	in a sparse-checkout.  In cone-mode, the user specifies
+	directories (getting both everything under that directory as
+	well as everything in leading directories), while in non-cone
+	mode, the user specifies gitignore-style patterns.  Controlled
+	by the --[no-]cone option to sparse-checkout init|set.
+
+SKIP_WORKTREE: When tracked files do not match the sparse specification and
+	are removed from the working tree, the file in the index is marked
+	with a SKIP_WORKTREE bit.  Note that if a tracked file has the
+	SKIP_WORKTREE bit set but is later written by the user to the
+	working tree anyway, the SKIP_WORKTREE bit will be cleared at the
+	beginning of any Git operation.
+
+	Most sparse checkout users are unaware of this implementation
+	detail, and the term should generally be avoided in user-facing
+	descriptions and command flags.  Unfortunately, prior to the
+	`sparse-checkout` subcommand these low-level details were exposed,
+	and as of time of writing, still are in various places.
+
+sparse-checkout: a subcommand in git used to reduce the files present in
+	the working tree to a subset of all tracked files.  Also, the
+	name of the file in the $GIT_DIR/info directory used to track
+	the sparsity patterns corresponding to the user's desired
+	subset.
+
+sparse cone: see cone mode
+
+sparse directory: An entry in the index corresponding to a directory
+	rather, and used to replace all files under that directory that
+	would normally appear in the index.  See also sparse-index.
+	Something that can cause confusion is that the "sparse
+	directory" does NOT match the sparse specification, i.e. the
+	directory is NOT present in the working tree.
+
+sparse index: A special mode for sparse-checkout that also makes the
+	index sparse by recording a directory entry in lieu of all the
+	files underneath that directory.  Controlled by the
+	--[no-]sparse-index option to init|set|reapply.  See also
+	"sparse directory".
+
+sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
+	define the set of files of interest.  A warning: It is easy to
+	over-use this term (or the shortened "patterns" term), for two
+	reasons (1) users in cone mode specify directories rather
+	than patterns (their directories are transformed into patterns,
+	but users may think you are talking about non-cone mode if you
+	use the word "patterns"), and (b) the sparse specification might
+	transiently differ in the working tree from the sparsity
+	patterns (see "Sparse specification vs. sparsity patterns").
+
+sparse specification: The set of paths in the user's area of focus.  When
+	interacting with the working tree, this is the set of tracked files
+	present in the working copy or with a clear SKIP_WORKTREE bit.
+	When working with history, this is the set of files matching the
+	sparsity patterns.  Usually the tracked files present in the
+	working copy are precisely the set of tracked files matching
+	sparsity patterns, but they can temporarily differ.  (See also
+	"Sparse specification vs. sparsity patterns")
+
+vivifying: When a command restores a tracked file to the working tree
+	(and clearing the SKIP_WORKTREE bit in the index), this is
+	referred to as "vivifying" the file.
+
+
+=== Purpose of sparse-checkouts ===
+
+sparse-checkouts exist to allow users to work with a subset of their
+files.
+
+You can think of sparse-checkouts as subdividing "tracked" files into
+two categories -- a sparse subset, and all the rest.
+Implementationally, we mark "all the rest" with SKIP_WORKTREE.  The
+SKIP_WORKTREE files are still tracked, just not present in the working
+tree.
+
+In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
+is missing from the working tree but pretend the file matches HEAD".  That
+was a low-level detail which provided decent behavior for a few commands,
+but which had a surprising number of ways in which it violated user
+expectations and was a bad mental model.  However, it persisted for many
+years and may still be found in some corners of the code base.
+
+Anyway, the idea of "working with a subset of files" is simple enough, but
+there are two different high-level usecases which affect how some Git
+subcommands should behave.  Further, even if we only considered one of
+those usecases, sparse-checkouts modify different subcommands in over a
+half dozen different ways.  Let's start by considering the high level
+usecases:
+
+  A) Users are _only_ interested in the sparse portion of the repo
+
+  B) Users want a sparse working tree, but are working in a larger whole
+
+It may be worth explaining both of these in a bit more detail:
+
+  (Behavior A) Users are _only_ interested in the sparse portion of the repo
+
+These folks might know there are other things in the repository, but
+don't care.  They are uninterested in other parts of the repository, and
+only want to know about changes within their area of interest.  Showing
+them other results from history (e.g. from diff/log/grep/etc.) is a
+usability annoyance, potentially a huge one since other changes in
+history may dwarf the changes they are interested in.
+
+Some of these users also arrive at this usecase from wanting to use
+partial clones together with sparse checkouts and do disconnected
+development.  Not only do these users generally not care about other
+parts of the repository, but consider it a blocker for Git commands to
+try to operate on those.  If commands attempt to access paths in history
+outside the sparsity specification, then the partial clone will attempt
+to download additional blobs on demand, fail, and then fail the user's
+command.  (This may be unavoidable in some cases, e.g. when `git merge`
+has non-trivial changes to reconcile outside the sparse specification,
+but we should limit how often users are forced to connect to the
+network.)
+
+Also, even for users using partial clones that do not mind being
+always connected to the network, the need to download blobs as
+side-effects of various other commands (such as the printed diffstat
+after a merge or pull) can lead to worries about local repository size
+growing unnecessarily[10].
+
+  (Behavior B) Users want a sparse working tree, but are working in a larger whole
+
+Stolee described this usecase this way[11]:
+
+"I'm also focused on users that know that they are a part of a larger
+whole. They know they are operating on a large repository but focus on
+what they need to contribute their part. I expect multiple "roles" to
+use very different, almost disjoint parts of the codebase. Some other
+"architect" users operate across the entire tree or hop between different
+sections of the codebase as necessary. In this situation, I'm wary of
+scoping too many features to the sparse-checkout definition, especially
+"git log," as it can be too confusing to have their view of the codebase
+depend on your "point of view."
+
+People might also end up wanting behavior B due to complex inter-project
+dependencies.  The initial attempts to use sparse-checkouts usually
+involve the directories you are directly interested in plus what those
+directories depend upon within your repository.  But there's a monkey
+wrench here: if you have integration tests, they invert the hierarchy:
+to run integration tests, you need not only what you are interested in
+and its dependencies, you also need everything that depends upon what
+you are interested in or that depends upon one of your
+dependencies...AND you need all the dependencies of that expanded group.
+That can easily change your sparse-checkout into a nearly dense one.
+
+Naturally, that tends to kill the benefits of sparse-checkouts.  There
+are a couple solutions to this conundrum: either avoid grabbing
+dependencies (maybe have built versions of your dependencies pulled from
+a CI cache somewhere), or say that users shouldn't run integration tests
+directly and instead do it on the CI server when they submit a code
+review.  Or do both.  Regardless of whether you stub out your
+dependencies or stub out the things that depend upon you, there is
+certainly a reason to want to query and be aware of those other
+stubbed-out parts of the repository, particularly when the dependencies
+are complex or change relatively frequently.  Thus, for such uses,
+sparse-checkouts can be used to limit what you directly build and
+modify, but these users do not necessarily want their sparse checkout
+paths to limit their queries of history.
+
+Some people may also be interested in behavior B simply as a performance
+workaround: if they are using non-cone mode, then they have to deal with
+its inherent quadratic performance problems.  In that mode, every
+operation that checks whether paths match the sparsity specification can
+be expensive.  As such, these users may only be willing to pay for those
+expensive checks when interacting with the working copy, and may prefer
+getting "unrelated" results from their history queries over having slow
+commands.
+
+
+=== Desired behavior ===
+
+As noted in the previous section, despite the simple idea of just
+working with a subset of files, there are a range of different
+behavioral changes that need to be made to different subcommands to work
+well with such a feature.  See [1,2,3,4,5,6,7,8,9,10] for various
+examples.  In particular, at [2], we saw that mere composition of other
+commands that individually worked correctly in a sparse-checkout context
+did not imply that the higher level command would work correctly; it
+sometimes requires further tweaks.  So, understanding these differences
+can be beneficial.
+
+* Commands behaving the same regardless of high-level use-case
+
+  * commands that only look at files within the sparsity specification
+
+      * diff (without --cached or REVISION arguments)
+      * grep (without --cached or REVISION arguments)
+      * diff-files
+
+  * commands that restore files to the working tree that match sparsity
+    patterns, and remove unmodified files that don't match those
+    patterns:
+
+      * switch
+      * checkout (the switch-like half)
+      * read-tree
+      * reset --hard
+
+      * `restore` & the restore-like half of `checkout` SHOULD be in this above
+	category, but are buggy (see the "Known bugs" section below)
+
+  * commands that write conflicted files to the working tree, but otherwise will
+    omit writing files that do not match the sparsity patterns:
+
+      * merge
+      * rebase
+      * cherry-pick
+      * revert
+
+      * `am` and `apply --index` should probably be in this section but are buggy
+	(see the "Known bugs" section below)
+
+    Note that this somewhat depends upon the merge strategy being used:
+      * `ort` behaves as described above
+      * `recursive` tries to not vivify files unnecessarily, but does sometimes
+	vivify files without conflicts.
+      * `octopus` and `resolve` will always vivify any file changed in the merge
+	relative to the first parent, which is rather suboptimal.
+
+  * commands that always ignore sparsity since commits must be full-tree
+
+      * archive
+      * bundle
+      * commit
+      * format-patch
+      * fast-export
+      * fast-import
+      * commit-tree
+
+  * commands that write any modified file to the working tree (conflicted or not,
+    and whether those paths match sparsity patterns or not):
+
+      * stash
+      * apply (without `--index` or `--cached`)
+
+* Commands that differ for behavior A vs. behavior B:
+
+  * commands that make modifications to which files are tracked:
+      * add
+      * rm
+      * mv
+      * update-index
+
+    The fact that files can move between the 'tracked' and 'untracked'
+    categories means some commands will have to treat untracked files
+    differently.  But if we have to treat untracked files differently,
+    then additional commands may also need changes:
+
+      * status
+      * clean
+
+    In particular, `status` may need to report any untracked files outside
+    the sparsity specification as an erroneous condition (especially to
+    avoid the user trying to `git add` them, forcing `git add` to display
+    an error).
+
+    It's not clear to me exactly how (or if `clean` would change, but it's
+    the other command that also affects untracked files.
+
+    `update-index` may be slightly special.  Its --[no-]skip-worktree flag
+    may need to ignore the sparse specification by its nature.  Also, its
+    current --[no-]ignore-skip-worktree-entries default is totally bogus.
+
+  * commands that query history
+      * diff (with --cached or REVISION arguments)
+      * grep (with --cached or REVISION arguments)
+      * show (when given commit arguments)
+      * bisect
+      * blame (only matters when one or more -C flags passed)
+	* and annotate
+      * log
+	* and variants: shortlog, gitk, show-branch, whatchanged, rev-list
+      * ls-files
+      * diff-index
+      * diff-tree
+      * ls-tree
+
+    ls-files may be slightly special in that e.g. `git ls-files -t` is
+    often used to see what is sparse and what is not.  Perhaps -t should
+    always work on the full tree?
+
+* Commands I don't know how to classify
+
+  * range-diff
+
+    Is this like `log` or `format-patch`?
+
+  * cherry
+
+    See range-diff
+
+* Commands unaffected by sparse-checkouts
+
+  * branch
+  * describe
+  * fetch
+  * gc
+  * init
+  * maintenance
+  * notes
+  * pull (merge & rebase have the necessary changes)
+  * push
+  * submodule
+  * tag
+
+  * config
+  * filter-branch (works in separate checkout without sparse-checkout setup)
+  * pack-refs
+  * prune
+  * remote
+  * repack
+  * replace
+
+  * bugreport
+  * count-objects
+  * fsck
+  * gitweb
+  * help
+  * instaweb
+  * merge-tree (doesn't touch worktree or index, and merges always compute full-tree)
+  * rerere
+  * verify-commit
+  * verify-tag
+
+  * commit-graph
+  * hash-object
+  * index-pack
+  * mktag
+  * mktree
+  * multi-pack-index
+  * pack-objects
+  * prune-packed
+  * symbolic-ref
+  * unpack-objects
+  * update-ref
+  * write-tree (operates on index, possibly optimized to use sparse dir entries)
+
+  * for-each-ref
+  * get-tar-commit-id
+  * ls-remote
+  * merge-base (merges are computed full tree, so merge base should be too)
+  * name-rev
+  * pack-redundant
+  * rev-parse
+  * show-index
+  * show-ref
+  * unpack-file
+  * var
+  * verify-pack
+
+  * <Everything under 'Interacting with Others' in 'git help --all'>
+  * <Everything under 'Low-level...Syncing' in 'git help --all'>
+  * <Everything under 'Low-level...Internal Helpers' in 'git help --all'>
+  * <Everything under 'External commands' in 'git help --all'>
+
+* Commands that might be affected, but who cares?
+
+  * merge-file
+  * merge-index
+
+
+=== Behavior classes ====
+
+From the above there are a few classes of behavior:
+
+  * "restrict"
+
+    Commands in this class only read or write files within the sparse
+    specification.  Some of these commands may also attempt, at the end of
+    their operation, to cull transient differences between the sparse
+    specification and the sparsity patterns (see "Sparse specification
+    vs. sparsity patterns" for details, but this basically means either
+    removing unmodified files not matching the sparsity patterns and
+    marking those files as SKIP_WORKTREE, or vivifying files that match the
+    sparsity patterns and marking those files as !SKIP_WORKTREE).
+
+  * "restrict modulo conflicts"
+
+    Commands in this class generally behave like the "restrict" class,
+    except that:
+      (1) they ignore the sparse specification in terms of updates to the
+	  index, though they'll preserve or update the SKIP_WORKTREE bit
+	  for files as needed to follow the sparsity patterns.
+      (2) they will ignore the sparse specification and write files with
+	  conflicts to the working tree (thus temporarily expanding the
+	  sparse specification to include such files.)
+
+  * "restrict also specially applied to untracked files"
+
+    Commands in this class generally behave like the "restrict" class,
+    except that they have to handle untracked files differently too, often
+    because these commands are dealing with files changing state between
+    'tracked' and 'untracked'.  Often, this may mean printing an error
+    message if the command had nothing to do, but the arguments may have
+    referred to files whose tracked-ness state could have changed were it
+    not for the sparsity patterns excluding them.
+
+  * "no restrict"
+
+    Commands in this class ignore the sparse specification entirely.
+
+  * "restrict or no restrict dependent upon behavior A vs. behavior B"
+
+    Commands in this class behave like "no restrict" for folks in the
+    behavior B camp, and like "restrict" for folks in the behavior A camp.
+    However, when behaving like "restrict" a warning of some sort might be
+    provided that history queries have been limited by the sparse-checkout
+    specification.
+
+
+=== Subcommand-dependent defaults ===
+
+Note that we have different defaults depending on the command for the
+desired behavior :
+
+  * Commands defaulting to "restrict":
+    * status
+    * diff (without --cached or REVISION arguments)
+    * grep (without --cached or REVISION arguments)
+    * switch
+    * checkout (the switch-like half)
+    * read-tree
+    * reset (--hard)
+    * restore/checkout
+    * checkout-index
+    * diff-files
+
+    This behavior makes sense; these interact with the working tree.
+
+  * Commands defaulting to "restrict modulo conflicts":
+    * merge
+    * rebase
+    * cherry-pick
+    * revert
+
+    * am
+    * apply --index
+
+    These also interact with the working tree, but require slightly different
+    behavior so that conflicts can be resolved.
+
+    (See also the "Known bugs" section below regarding `am` and `apply`)
+
+  * Commands defaulting to "no restrict":
+    * archive
+    * bundle
+    * commit
+    * format-patch
+    * fast-export
+    * fast-import
+    * commit-tree
+
+    * stash
+    * apply (without `--index`)
+
+    These have completely different defaults and perhaps deserve the most
+    detailed explanation:
+
+    In the case of commands in the first group (format-patch,
+    fast-export, bundle, archive, etc.), these are commands for
+    communicating history, which will be broken if they restrict to a
+    subset of the repository.  As such, they operate on full paths and
+    have no `--restrict` option for overriding.  Some of these commands may
+    take paths for manually restricting what is exported, but it needs to
+    be very explicit.
+
+    In the case of stash, it needs to vivify files to avoid losing the
+    user's changes.
+
+    In the case of apply without `--index`, that command needs to update
+    the working tree without the index (or the index without the working
+    tree if `--cached` is passed), and if we restrict those updates to the
+    sparse specification then we'll lose changes from the user.
+
+  * Commands defaulting to "restrict also specially applied to untracked files":
+    * add
+    * rm
+    * mv
+
+    Our original implementation for these commands was "no restrict", but
+    it had some severe usability issues:
+      * `git add <somefile>` if honored and outside the sparse
+	specification, can result in the file randomly disappearing later
+	when some subsequent command is run (since various commands
+	automatically clean up unmodified files outside the sparse
+	specification).
+      * `git rm '*.jpg'` could very negatively surprise users if it deletes
+	files outside the range of the user's interest.
+      * `git mv` has similar surprises when moving into or out of the cone,
+	so best to restrict by default
+
+    So, we switched `add` and `rm` to default to "restrict", which made
+    usability problems much less severe and less frequent, but we still got
+    complaints because commands like:
+	git add <file-outside-sparse-specification>
+	git rm <file-outside-sparse-specification>
+    would silently do nothing.  We should instead print an error in those
+    cases to get usability right.
+
+    There may be a difference in here between behavior A and behavior B in
+    terms of verboseness of errors or additional warnings.
+
+  * Commands falling under "restrict or no restrict dependent upon behavior
+    A vs. behavior B"
+
+    * diff (with --cached or REVISION arguments)
+    * grep (with --cached or REVISION arguments)
+    * show (when given commit arguments)
+    * bisect
+    * blame (only matters when one or more -C flags passed)
+      * and annotate
+    * log
+      * and variants: shortlog, gitk, show-branch, whatchanged, rev-list
+    * ls-files
+    * diff-index
+    * diff-tree
+    * ls-tree
+
+    For now, we default to behavior B for these, which want a default of
+    "no restrict".
+
+    Note that two of these commands -- diff and grep -- also appeared in a
+    different list with a default of "restrict", but only when limited to
+    searching the working tree.  The working tree vs. history distinction
+    is fundamental in how behavior B operates, so this is expected.
+
+    "restrict" may make more sense as the long term default for these[12],
+    though Stolee seems to have some reservations[17].  Also, supporting
+    "restrict" for these commands might be a fair amount of work to
+    implement, meaning it might be implemented over multiple releases.  If
+    that behavior were the default in the commands that supported it, that
+    would force behavior B users to need to learn to slowly add additional
+    flags to their commands, depending on git version, to get the behavior
+    they want.  That gradual switchover would be painful, so we should
+    avoid it at least until it's fully implemented.
+
+
+=== Sparse specification vs. sparsity patterns ===
+
+In a well-behaved situation, the sparse specification is given directly
+by the $GIT_DIR/info/sparse-checkout file.  However, it can transiently
+diverge for a few reasons:
+
+    * needing to resolve conflicts (merging will vivify conflicted files)
+    * running Git commands that implicitly vivify files (e.g. "git stash apply")
+    * running Git commands that explicitly vivify files (e.g. "git checkout
+      --ignore-skip-worktree-bits FILENAME")
+    * other commands that write to these files (perhaps a user copies it
+      from elsewhere)
+
+For the last item, note that we do automatically clear the SKIP_WORKTREE
+bit for files that are present in the working tree.  This has been true
+since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
+2022-03-09)
+
+However, such a situation is transient because:
+
+   * Such transient differences can and will be automatically removed as
+     a side-effect of commands which call unpack_trees() (checkout,
+     merge, reset, etc.).
+   * Users can also request such transient differences be corrected via
+     running `git sparse-checkout reapply`.  Various places recommand
+     running that command.
+   * Additional commands are also welcome to implicitly fix these
+     differences; we may add more in the future.
+
+While we avoid dropping unstaged changes or files which have conflicts,
+we otherwise aggressively try to fix these transient differences.  If
+users want these differences to persist, they should run the `set` or
+`add` subcommands of `git sparse-checkout` to reflect their intended
+sparse specification.
+
+However, when we need to do a query on history restricted to the
+"relevant subset of files" such a transiently expanded sparse
+specification is ignored.  There are a couple reasons for this:
+
+   * The behavior wanted when doing something like
+	 git grep expression REVISION
+     is roughly what the users would expect from
+	 git checkout REVISION && git grep expression
+     (modulo a "REVISION:" prefix), which has a couple ramifications:
+
+   * REVISION may have paths not in the current index, so there is no
+     path we can consult for a SKIP_WORKTREE setting for those paths.
+
+   * Since `checkout` is one of those commands that tries to remove
+     transient differences in the sparse specification, it makes sense
+     to use the corrected sparse specification
+     (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
+     consult SKIP_WORKTREE anyway.
+
+So, a transiently expanded (or restricted) sparse specification applies to
+the working tree, but not to history history queries where we always use
+the sparsity patterns.  (See [16] for an early discussion of this.)
+
+Similar to a transiently expanded sparse specification of the working tree
+based on additional files being present in the working tree, we could also
+consider the concept of a transiently expanded sparse specification for the
+index.  In particular, if the user has staged changes to files that do not
+match the sparsity patterns, and the file is not present in the working
+tree, we may still want to consider the file part of the sparse
+specification if we are specifically performing a query related to the
+index (e.g. git diff REVISION, git diff-index REVISION, etc.)
+
+
+=== Implementation Questions ===
+
+  * Does the name --[no-]restrict sound good to others?  Are there better
+    options?
+    * Names in use, or appearing in patches, or previously suggested:
+      * --sparse/--dense
+      * --ignore-skip-worktree-bits
+      * --ignore-skip-worktree-entries
+      * --ignore-sparsity
+      * --[no-]restrict-to-sparse-paths
+      * --full-tree/--sparse-tree
+      * --[no-]restrict
+      * --scope={sparse,all}
+      * --focus/--unfocus
+      * --limit/--unlimited
+    * Rationale making me lean slightly towards --[no-]restrict:
+      * We want a name that works for many commands, so we need a name that
+	does not conflict
+      * --[no-]restrict isn't overly long and seems relatively explanatory
+      * `--sparse`, as used in add/rm/mv, is totally backwards for
+	grep/log/etc.  Changing the meaning of `--sparse` for these
+	commands would fix the backwardness, but possibly break existing
+	scripts.  Using a new name pairing would allow us to treat
+	`--sparse` in these commands as a deprecated alias.
+      * There is a different `--sparse`/`--dense` pair for commands using
+	revision machinery, so using that naming might cause confusion
+      * There is also a `--sparse` in both pack-objects and show-branch, which
+	don't conflict but do suggest that `--sparse` is overloaded
+      * The name --ignore-skip-worktree-bits is a double negative, is
+	quite a mouthful, refers to an implementation detail that many
+	users may not be familiar with, and we'd need a negation for it
+	which would probably be even more ridiculously long.  (But we
+	can make --ignore-skip-worktree-bits a deprecated alias for
+	--no-restrict.)
+    * BUT, as others points out, --[no-]restrict isn't very clear about what
+      it's restricting nor does it automatically tie in to the concept of
+      "sparse-checkout" in the user's mind
+
+  * Should --[no-]restrict be a git global option, or added as options to each
+    relevant command?  (Does that make sense given the multitude of different
+    default behaviors we have for different options?)
+
+  * If a config option is added (core.restrictToSparsity?) what should
+    the values and description be?  There's a risk of confusion, because
+    we only want this config option to affect the history-querying
+    commands (log/diff/grep) and maybe the path-modifying worktree
+    commands (add/rm/mv), but certainly not most the others.  Previous config
+    suggestion here: [13]
+
+  * Is `--no-expand` a good alias for ls-files's `--sparse` option?
+    (`--sparse` does not map to either `--restrict` or `--no-restrict`,
+    because in non-cone mode it does nothing and in cone-mode it shows the
+    sparse directory entries which are technically outside the sparse
+    specification) Should `--restrict` be the default (does that imply that
+    `--no-expand` needs a `--no-restrict` or that it just partially
+    overrides it)?  Should `-t` imply `--no-restrict`?
+
+  * Should --ignore-skip-worktree-bits in checkout-index, checkout, and
+    restore be made deprecated aliases for --no-restrict?  (They have the
+    same meaning.)
+
+  * Should --ignore-skip-worktree-entries in update-index be made a
+    deprecated alias for --no-restrict?  (Or, better yet, should the
+    option just be nuked from orbit after flipping the default, since
+    the reverse option is never wanted and the sole purpose of this
+    option was to turn off a bug?)
+
+  * Should update-index be made like add/rm/mv with the restrict-or-error
+    default functionality?  If we do, should some flags like
+    --[no-]skip-worktree imply --no-restrict?
+
+  * Should `apply --index` preserve SKIP_WORKTREE bits for
+    non-conflicted files?  We normally like preserving those bits (and
+    it'd make git-am more like cherry-pick/rebase/merge), but `apply`
+    without `--index` should unconditionally clear them and it seems a
+    little weird for the addition of the `--index` flag to affect how
+    the working tree is treated.  On the other hand, `am` builds on
+    `apply --index` and it needs the SKIP_WORKTREE bits preserved for
+    non-conflicted files in order to behave like
+    cherry-pick/rebase/merge.
+
+  * sparse-checkout: once behavior A is fully implemented, should we take
+    an interim measure to ease people into switching the default?  Namely,
+    if folks are not already in a sparse checkout, then require
+    `sparse-checkout init/set` to take a `--set-[no-]restrict-mode` or
+    `--set-scope=(sparse|all)` flag (which would set core.restrictToSparse
+    according to the setting given), and throw an error if the flag is not
+    provided?  That error would be a great place to warn folks that the
+    default may change in the future, and get them used to specifying what
+    they want so that the eventual default switch is seamless for them.
+
+  * clone: should we provide some mechanism for tying partial clones and
+    sparse checkouts together better.  Maybe an option
+	--sparse=dir1,dir2,...,dirN
+    which:
+       * Does initial fetch with `--filter=blob:none`
+       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
+       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
+	 fault in the missing blobs within the sparse
+	 specification...except that rev-list needs some kind of options
+	 to also get files from leading directories too.
+       * Sets --restrict mode to allow focusing on the cone of interest
+	 (and to permit disconnected development)
+
+
+=== Implementation Goals/Plans ===
+
+ * Get buy-in on this document in general.
+
+ * Figure out answers to the 'Implementation Questions' sections (above)
+
+ * Fix bugs in the 'Known bugs' section (below)
+
+ [Below here is kind of spitballing since the first two haven't been resolved]
+
+ * update-index: flip the default to --no-ignore-skip-worktree-entries, possibly
+   nuke this stupid "Oh, there's a bug?  Let me add a flag to let users request
+   that they not trigger this bug." flag
+
+ * ls-files: add a --[no-]restrict flag for limiting tracked files listed to
+   the relevant subset.  (Plus more stuff after questions are answered.)
+
+ * Flags & Config
+   * Make `--sparse` in add/rm/mv a deprecated alias for `--no-restrict`
+   * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
+     a deprecated aliases for `--no-restrict`
+   * Create config option (core.restrictToSparsity?), note how it only
+     affects two classes of commands
+
+ * Behavioral plans:
+     add, rm, mv:
+	Behavior B: throw error if would have affected paths outside of sparse
+		    specification.
+	Behavior A: throw error if would have *only* affected paths outside of
+		    sparse specification.
+     grep (on history), diff (on history), log, etc:
+	Behavior B: act on all paths (already implemented)
+	Behavior A: act on limited paths, maybe show stderr warning ("results
+		    limited") if selected via config rather than explicitly
+     other diff machinery:
+	make sure diff machinery changes don't mess with format-patch,
+	fast-export, etc.
+
+  * Fix performance issues, such as
+    https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
+
+
+=== Known bugs ===
+
+This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
+been working on it.
+
+0. Behavior A is not well supported in Git.  (Behavior B didn't used to
+   be either, but was the easier of the two to implement.)
+
+1. am and apply:
+
+   apply, without `--index` or `--cached`, relies on files being present
+   in the working copy, and also writes to them unconditionally.  As
+   such, it should first check for the files' presence, and if found to
+   be SKIP_WORKTREE, then clear the bit and vivify the paths, then do
+   its work.  Currently, it just throws an error.
+
+   apply, with either `--cached` or `--index`, will not preserve the
+   SKIP_WORKTREE bit.  This is fine if the file has conflicts, but
+   otherwise SKIP_WORKTREE bits should be preserved for --cached and
+   probably also for --index.
+
+   am, if there are no conflicts, will vivify files and fail to preserve
+   the SKIP_WORKTREE bit.  If there are conflicts and `-3` is not
+   specified, it will vivify files and then complain the patch doesn't
+   apply.  If there are conflicts and `-3` is specified, it will vivify
+   files and then complain that those vivified files would be
+   overwritten by merge.
+
+2. reset --hard:
+
+   reset --hard provides confusing error message (works correctly, but
+   misleads the user into believing it didn't):
+
+    $ touch addme
+    $ git add addme
+    $ git ls-files -t
+    H addme
+    H tracked
+    S tracked-but-maybe-skipped
+    $ git reset --hard                           # usually works great
+    error: Path 'addme' not uptodate; will not remove from working tree.
+    HEAD is now at bdbbb6f third
+    $ git ls-files -t
+    H tracked
+    S tracked-but-maybe-skipped
+    $ ls -1
+    tracked
+
+    `git reset --hard` DID remove addme from the index and the working tree, contrary
+    to the error message, but in line with how reset --hard should behave.
+
+3. read-tree
+
+   `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the
+   entries it reads into the index, resulting in all your files suddenly
+   appearing to be "deleted".
+
+4. Checkout, restore:
+
+   These command do not handle path & revision arguments appropriately:
+
+    $ ls
+    tracked
+    $ git ls-files -t
+    H tracked
+    S tracked-but-maybe-skipped
+    $ git status --porcelain
+    $ git checkout -- '*skipped'
+    error: pathspec '*skipped' did not match any file(s) known to git
+    $ git ls-files -- '*skipped'
+    tracked-but-maybe-skipped
+    $ git checkout HEAD -- '*skipped'
+    error: pathspec '*skipped' did not match any file(s) known to git
+    $ git ls-tree HEAD | grep skipped
+    100644 blob 276f5a64354b791b13840f02047738c77ad0584f	tracked-but-maybe-skipped
+    $ git status --porcelain
+    $ git checkout HEAD~1 -- '*skipped'
+    $ git ls-files -t
+    H tracked
+    H tracked-but-maybe-skipped
+    $ git status --porcelain
+    M  tracked-but-maybe-skipped
+    $ git checkout HEAD -- '*skipped'
+    $ git status --porcelain
+    $
+
+    Note that checkout without a revision (or restore --staged) fails to
+    find a file to restore from the index, even though ls-files shows
+    such a file certainly exists.
+
+    Similar issues occur with HEAD (--source=HEAD in restore's case),
+    but suddenly works when HEAD~1 is specified.  And then after that it
+    will work with HEAD specified, even though it didn't before.
+
+    Directories are also an issue:
+
+    $ git sparse-checkout set nomatches
+    $ git status
+    On branch main
+    You are in a sparse checkout with 0% of tracked files present.
+
+    nothing to commit, working tree clean
+    $ git checkout .
+    error: pathspec '.' did not match any file(s) known to git
+    $ git checkout HEAD~1 .
+    Updated 1 path from 58916d9
+    $ git ls-files -t
+    S tracked
+    H tracked-but-maybe-skipped
+
+
+=== Reference Emails ===
+
+Emails that detail various bugs we've had in sparse-checkout:
+
+[1] (Original descriptions of behavior A & behavior B)
+    https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
+[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences)
+    https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/
+[3] (Present-despite-skipped entries)
+    https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
+[4] (Clone --no-checkout interaction)
+    https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout)
+[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`)
+    https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/
+[6] (SKIP_WORKTREE is advisory, not mandatory)
+    https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/
+[7] (`worktree add` should copy sparsity settings from current worktree)
+    https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/
+[8] (Avoid negative surprises in add, rm, and mv)
+    https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/
+    https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/
+[9] (Move from out-of-cone to in-cone)
+    https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
+    https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
+[10] (Unnecessarily downloading objects outside sparse specification)
+     https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
+
+[11] (Stolee's comments on high-level usecases)
+     https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
+
+[12] Others commenting on eventually switching default to behavior A:
+  * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
+  * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
+  * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
+
+[13] Previous config name suggestion and description
+  * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
+
+[14] Tangential issue: switch to cone mode as default sparse specification mechanism:
+  https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
+
+[15] Lengthy email on grep behavior, covering what should be searched:
+  * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/
+
+[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations,
+     search for the parenthetical comment starting "We do not check".
+    https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/
+
+[17] "I'm not even sure if we would want to make this available via a
+     config setting"
+   and
+     "But I also want to avoid doing this as a default or even behind a
+     config setting"
+   respectively, from:
+     https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
+     https://lore.kernel.org/git/07a25d48-e364-0d9b-6ffa-41a5984eb5db@github.com/

base-commit: 1b3d6e17fe83eb6f79ffbac2f2c61bbf1eaef5f8
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-28  5:38   ` Elijah Newren
@ 2022-09-28 13:22     ` Derrick Stolee
  2022-10-06  7:10       ` Elijah Newren
  2022-09-30  9:54     ` ZheNing Hu
  1 sibling, 1 reply; 42+ messages in thread
From: Derrick Stolee @ 2022-09-28 13:22 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Victoria Dye,
	Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

On 9/28/22 1:38 AM, Elijah Newren wrote:
> On Tue, Sep 27, 2022 at 9:36 AM Derrick Stolee <derrickstolee@github.com> wrote:
>>
>> On 9/24/2022 8:09 PM, Elijah Newren via GitGitGadget wrote:
>>> From: Elijah Newren <newren@gmail.com>
>>
>>> +  (Behavior A) Users are _only_ interested in the sparse portion of the repo
>>> +
>>> +These folks might know there are other things in the repository, but
>>> +don't care.  They are uninterested in other parts of the repository, and
>>> +only want to know about changes within their area of interest.  Showing
>>> +them other results from history (e.g. from diff/log/grep/etc.) is a
>>> +usability annoyance, potentially a huge one since other changes in
>>> +history may dwarf the changes they are interested in.
>>
>> This idea of restricting the commit history to the sparse-checkout
>> definition (by default, with an escape hatch) seems like the most
>> radical of the things we've considered. I think it's interesting to
>> consider, but it might be better to think about things like diffstats,
>> grepping, and otherwise preventing out-of-cone adjustments by default.
>>
>> That said, the idea of restricting history is also the simplest to
>> describe as a user-visible change.
> 
> By "restricting commit history", are you thinking in terms of "git log
> -- PATHS" or more like some kind of special --filter to git-clone?
> 
> I get the feeling you might be thinking about the latter, whereas I
> was assuming users had all commits (and all trees), but log/diff would
> restrict output based on relevant paths.

I'm most skeptical of the "git log -- <sparse-checkout-paths>"
restriction showing a simplified history graph. I get enough
complaints about "missing commits" from simplified file history
as it is. Adding this simplified history scoped to the sparse-
checkout is more likely to add confusion than help users, in my
opinion.

>>> +People might also end up wanting behavior B due to complex inter-project
>>> +dependencies.  The initial attempts to use sparse-checkouts usually
>>> +involve the directories you are directly interested in plus what those
>>> +directories depend upon within your repository.  But there's a monkey
>>> +wrench here: if you have integration tests, they invert the hierarchy:
>>> +to run integration tests, you need not only what you are interested in
>>> +and its dependencies, you also need everything that depends upon what
>>> +you are interested in or that depends upon one of your
>>> +dependencies...AND you need all the dependencies of that expanded group.
>>> +That can easily change your sparse-checkout into a nearly dense one.
>>
>> In my experience, the downstream dependencies are checked via builds in
>> the cloud, though that doesn't help if they are source dependencies and
>> you make a breaking change to an API interface. This kind of problem is
>> absolutely one of system architecture and I don't know what Git can do
>> other than to acknowledge it and recommend good patterns.
> 
> I was talking about (source) dependencies between
> modules/projects/whatever-you-want-to-call-the-subcomponents of your
> repository.  We have hundreds of modules, with various cross-module
> dependencies that evolve over time.
> 
> I get the feeling from your description that your intra-repository
> dependencies between modules/projects/whatever are much more static
> for you than what we deal with.  (Which is a good thing; it'd be nice
> if ours were more static.)

The internal monorepo I know the most about has a very strict project
system that has less granularity than other build systems, so the
projects themselves don't change dependencies very frequently (but
they have lots of internal build adjustments that they can make
without updating the sparse-checkout). This is probably atypical,
especially from what I've heard from companies working with a build
system like Bazel.

>> In a properly-organized project, 95% of engineers in the project can have
>> a small sparse-checkout, then 5% work on the common core that has these
>> downstream dependencies and require a large sparse-checkout definition.
> 
> "In a properly-organized project"?  I'm unsure if this is an
> indictment of some of the repositories I deal with in reality (and to
> be fair, it might be a totally fair indictment), or if your statement
> is starting to cross into "No true scotsman" territory.  ;-)

I should probably say things like "If system architects want to
optimize for Git performance for the majority of their engineers, then
this kind of dependency organization is desirable." Building projects
in a vacuum, ignoring Git entirely, there is still a benefit to
minimizing local build costs for individual engineers. I think that
most of the time those improvements to the build system will also
result in more efficient sparse-checkout definitions for engineers
working on a small set of components.

> I would probably lean towards the former (we know it's more messy than
> it should be), but I'm a bit puzzled that you'd just brush aside my
> mention of integration tests.  We have people who want to run
> integration tests locally, even when only modifying a small area of
> the codebase.  These users are not doing cross-tree work, rather they
> are doing cross-tree testing in conjunction with their work.

I include "this component is used tree-wide" as tree-wide work, even
if it doesn't mean they are modifying code across the entire tree.
I will still assert that the vast majority of engineers in a large
repository should not be doing work that has tree-wide implications
such as this.

I would still argue that the most efficient way for these engineers
to work would be to modify their component directly locally, relying
on project-specific tests that check their API boundary for expectations,
then rely on a distributed build system to verify their changes across
the tree. They can then pull in the component(s) that have failing tests
in order to re-run tests locally and verify the correct fix.
 
>> There's nothing Git can do to help those engineers that do cross-tree
>> work.
> 
> I'm going to partially disagree with this, in part because of our
> experience with many inter-module dependencies that evolve over time.
> Folks can start on a certain module and begin refactoring.  Being
> aware that their changes will affect other areas of the code, the can
> do a search (e.g. "git grep --cached ..." to find cases outside their
> current sparse checkout), and then selectively unsparsify to get the
> relevant few dozen (or maybe even few hundred) modules added.  They
> aren't switching to a dense checkout, just a less sparse one.  When
> they are done, they may narrow their sparse specification again.  We
> have a number of users doing cross-tree work who are using
> sparse-checkouts, and who find it productive and say it still speeds
> up their local build/test cycles.

This matches my expectation of how to engage selectively with
dependent components, where we expand the sparse-checkout selectively.
My only difference is that unless there is a breaking change to the
API boundary that this expansion happens reactively, not proactively.
(Expand to another project if it has failing tests due to changes to
the local components.)
 
> So, I'd say that ensuring Git supports behavior B well in
> sparse-checkouts, is something Git can do to help out both some of the
> engineers doing cross-tree work, and some of the engineers that are
> doing cross-tree testing.
> 
> (For full disclosure, we also have users doing cross-tree work using
> regular dense checkouts and I agree there's not a lot we can do to
> help them.)

Perhaps there are two different categories going on here:

 1. The engineer is building a component consumed by many others
    across the tree, but all edits are within that component.

 2. The engineer is editing code across many components across the
    tree.

>>> +  * Commands defaulting to --restrict-unless-conflicts
>>> +    * merge
>>> +    * rebase
>>> +    * cherry-pick
>>> +    * revert
>>
>> In my mind, --restrict-unless-conflicts doesn't provide any value unless
>> you want the --restrict mode to create an _error_ when trying to do
>> something outside of the sparse-checkout cone.
> 
> Are you assuming here I was suggesting command line flags?  If so, I
> apologize for my poor wording/descriptions.

Yes, I think that was my misunderstanding.

>> The only thing I can think about is that the diffstat might want to show
>> the stats for the conflicted files, in which case that's an important
>> perspective on the distinction from --restrict.
> 
> We only show the diffstat on a successful merge, so there's no
> diffstat to show if there are any conflicted files.

Thanks! TIL.

>>> +    * add
>>> +    * rm
>>> +    * mv
>>> +
>>> +    The defaults here perhaps make sense since they are nearly --restrict, but
>>> +    actually using --restrict could cause user confusion if users specify a
>>> +    specific filename, so they warn by default.  That logic may sound like
>>> +    --no-restrict should be the default, but that's prone to even bigger confusion:
>>> +      * `git add <somefile>` if honored and outside the sparse cone, can result in
>>> +     the file randomly disappearing later when some subsequent command is run
>>> +     (since various commands automatically clean up unmodified files outside
>>> +     the sparsity specification).
>>> +      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
>>> +     outside the range of the user's interest.  Much better to operate on the
>>> +     sparsity specification and give the user warnings if other files could have
>>> +     matched.
>>
>> The cost of checking for other files that might match is sometimes too large
>> (needing to expand the sparse index or walk trees to find those path names) that
>> I would not recommend warning that we _didn't_ do something. Perhaps an advice
>> that says "we did not look outside the sparse-checkout definition for matching
>> paths" when the pathspec is not an exact path or a prefix match.
> 
> Ah, good point, and a good idea to keep in mind.
> 
> However, I think advise_on_updating_sparse_paths() currently does what
> you're warning against.  Do you think there's a good chance this is
> the cause of the performance bug reported over at
> https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com
> ?

Perhaps. You're right that it is warning about all of the paths that
match. That method was created before the sparse index was established,
so 'git add' was already checking all of the paths in the index, so
adding the warning made sense as something not too difficult to do after
checking each of those paths.

In the sparse index world, things are much more expensive to do that
check, hence the work to add modes that focus the action only to the
paths in the sparse-checkout. In that world, we _may_ want to recognize
that the user ran 'git rm *.png' and we want to provide advice that
we didn't look for '*.png' files outside of the sparse-checkout definition.

This makes less sense for 'git add *.png' because it already would not do
anything for files outside of the sparse-checkout definition. 

>>> +  * Commands whose default for --restrict vs. --no-restrict should vary depending
>>> +    on Behavior A or Behavior B
>>> +    * diff (with --cached or REVISION arguments)
>>> +    * grep (with --cached or REVISION arguments)
>>> +    * show (when given commit arguments)
>>> +    * bisect
>>> +    * blame
>>> +      * and annotate
>>> +    * log
>>> +      * and variants: shortlog, gitk, show-branch, whatchanged
>>> +
>>> +    For now, we default to behavior B for these, which want a default of
>>> +    --no-restrict.
>>
>> I do feel pretty strongly that we'll want a --no-restrict default here
>> because otherwise we will present confusion. I'm not even sure if we would
>> want to make this available via a config setting, but likely a config
>> setting makes sense in the long term.
> 
> You've got me slightly confused.  You did say the same thing a long time ago:
> 
>     "But I also want to avoid doing this as a default or even behind a
> config setting."[A]
> 
> BUT, when Shaoxuan proposed making --restrict/--focus the default for
> one of these commands, you seemed to be on board[B].

I'm specifically talking about 'git log'. I think that having that be
in a restricted mode is extremely dangerous and will only confuse users.
This includes 'git show' (with commit arguments) and 'git bisect', I
think.

The rest, (diff, grep, blame) are worktree-focused, so having a restrict
mode by default makes sense to me.

> Personally, I thought that if anyone would object to some of these
> commands changing, that grep would be considered as among the riskier.
> For diff and log, printing a "Warning: restricting output to the
> sparse-checkout specification" would be pretty innocuous, but for grep
> that wouldn't be.

My main concern with 'git grep --cached' is its interaction with
partial clone. Perhaps a restrict mode for grep should be toggled with
partial clone and not sparse-checkout alone. But, that becomes more
confusing when the restrictions are applied or not.

> I was a little unsure about making `--restrict/--focus` the default
> for these commands, both based on your previous concerns and because
> of thinking about some of my behavior B users.  But then, it seemed
> like everyone else was pushing for not only having this behavior but
> making it the default[C,D,E,F].  I was beginning to wonder if even you
> had decided behavior B didn't matter anymore between your support of
> Shaoxuan's change at [B] and your diffstat comments at [G].  But now
> it sounds like you're not only against behavior A by default but even
> implementing it at all...even though I don't see how that squares with
> your previous comments on grep and diffstat.
> 
> Is it just a matter of presentation?  Is it specific subcommands you
> don't want changed?  Or am I either missing or misunderstanding
> something?

I think the biggest point is that the implications of behavior A
saying "I don't care about any changes outside of my sparse-checkout"
leading to changed history are unappealing to me. After removing that
kind of feature from consideration, I don't see any difference
between the behaviors.

> Anyway...I will note that without a configurable option to give these
> commands a behavior of `--restrict`, I think you make working in
> disconnected partial clones practically impossible.  I want to be able
> to do "git log -p", "git diff REV1 REV2", and "git grep TERM REV" in
> disconnected partial clones, and I've wanted that kind of capability
> for well over a decade[H].  So, don't be surprised if I keep bringing
> up a config option of some sort for these commands.  :-)

Now, if we're talking about "don't download extra objects" as a goal,
then we're thinking about things not just related to sparse-checkout
but even history within the sparse-checkout. Even if we make the
'backfill' command something that users could run, there isn't a
guarantee that users will want to have even that much data downloaded.
We would need a way to say "yes, I ran 'git blame' on this path in my
sparse-checkout, but please don't just fail if you can't get new objects,
instead inform me that the results are incomplete."

I think the sparse-checkout boundary is a good way to minimize the
number of objects downloaded by these commands, but to actually
remove the need for downloads at all we need a way to gracefully
return partial results.

>>> +  * clone: should we provide some mechanism for tying partial clones and
>>> +    sparse checkouts together better.  Maybe an option
>>> +     --sparse=dir1,dir2,...,dirN
>>> +    which:
>>> +       * Does initial fetch with `--filter=blob:none`
>>> +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
>>> +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
>>> +      fault in the missing blobs within the sparse
>>> +      specification...except that rev-list needs some kind of options
>>> +      to also get files from leading directories too.
>>> +       * Sets --restrict mode to allow focusing on the cone of interest
>>> +      (and to permit disconnected development)
>>
>> As mentioned, I think we should have the option to backfill the blobs in
>> the sparse-checkout definition, but 'git clone' should not do this by
>> default. It's something that can be launched in the background, maybe, but
>> not a blocking operation on being able to use the repository.
>>
>> 'scalar clone' is an excellent testing bed for these kinds of things,
>> like setting the --restrict mode by default.
> 
> Earlier in this same email you were against even making an option to
> request --restrict mode, but now you're suggesting to not only
> implement it but make it the default in scalar?

As I hope I've clarified earlier, there are some commands where I think
a --restrict mode is inadvisable, and turning it on by default is
dangerous. If we can configure the worktree commands to be restricted
by default and _not_ the history simplifyng ones, then that's what I
would want enabled in Scalar.
> I figured we'd have one or two places where all of us had some
> disagreements on the big picture, but more and more I'm finding we
> aren't even always thinking about the problems the same (e.g. the 3+
> different solutions to the `am` issues).  All the more reason that a
> document like this is important for us to discuss these details and
> work out a plan.

With such a massive doc and an ambitious plan, we are bound to have
misunderstandings and seem to self-contradict here and there. This
discussion is helping to drive clarity, and I appreciate all of your
work to drive towards mutual understanding.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-27 16:36 ` Derrick Stolee
  2022-09-28  5:38   ` Elijah Newren
@ 2022-09-30  9:09   ` ZheNing Hu
  1 sibling, 0 replies; 42+ messages in thread
From: ZheNing Hu @ 2022-09-30  9:09 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, git, Victoria Dye, Shaoxuan Yuan,
	Matheus Tavares, Elijah Newren

Derrick Stolee <derrickstolee@github.com> 于2022年9月28日周三 00:36写道:
>
> > +Some of these users also arrive at this usecase from wanting to use
> > +partial clones together with sparse checkouts and do disconnected
> > +development.  Not only do these users generally not care about other
> > +parts of the repository, but consider it a blocker for Git commands to
> > +try to operate on those.  If commands attempt to access paths in history
> > +outside the sparsity specification, then the partial clone will attempt
> > +to download additional blobs on demand, fail, and then fail the user's
> > +command.  (This may be unavoidable in some cases, e.g. when `git merge`
> > +has non-trivial changes to reconcile outside the sparsity path, but we
> > +should limit how often users are forced to connect to the network.)
>
> This idea pairs well with a feature I've been meaning to build:
> 'git sparse-checkout backfill' would download all historical blobs
> within the sparse-checkout definition. This is possible with rev-list,
> but I want to investigate grouping blobs by path and making requests in
> batches, hopefully allowing better deltification and ability to recover
> from network disconnections. That makes this idea of "staying within
> your sparse-checkout means no missing object downloads" even more likely.
>

I think this is very useful: if I use sparse-checkout + partial-clone,
plugins like
git blame in vscode (or other IDE) will be invalidated, or require a
lot of network
overhead to download the missing blobs, so this git sparse-checkout backfill
looks like a promising solution to that problem.

> > +People might also end up wanting behavior B due to complex inter-project
> > +dependencies.  The initial attempts to use sparse-checkouts usually
> > +involve the directories you are directly interested in plus what those
> > +directories depend upon within your repository.  But there's a monkey
> > +wrench here: if you have integration tests, they invert the hierarchy:
> > +to run integration tests, you need not only what you are interested in
> > +and its dependencies, you also need everything that depends upon what
> > +you are interested in or that depends upon one of your
> > +dependencies...AND you need all the dependencies of that expanded group.
> > +That can easily change your sparse-checkout into a nearly dense one.
>
> In my experience, the downstream dependencies are checked via builds in
> the cloud, though that doesn't help if they are source dependencies and
> you make a breaking change to an API interface. This kind of problem is
> absolutely one of system architecture and I don't know what Git can do
> other than to acknowledge it and recommend good patterns.
>
> In a properly-organized project, 95% of engineers in the project can have
> a small sparse-checkout, then 5% work on the common core that has these
> downstream dependencies and require a large sparse-checkout definition.
> There's nothing Git can do to help those engineers that do cross-tree
> work.
>

This feels like it's because your project code is stable enough, but at other
companies I think many of the project dependencies are subject to frequent
changes.

> > +      * `git mv` has similar surprises when moving into or out of the cone, so
> > +     best to restrict and throw warnings if restriction might affect the result.
> > +
> > +    There may be a difference in here between behavior A and behavior B.
> > +    For behavior A, we probably only want to warn if there were no
> > +    suitable matches for files in the sparsity specification, whereas
> > +    for behavior B, we may want to warn even if there are valid files to
> > +    operate on if the result would have been different under
> > +    `--no-restrict`.
>
> I think in behavior B, users who actually want to modify things tree-wide will
> actually increase their sparse-checkout definition to include those files so
> they can validate what they are doing.
>

Agree.

> > +=== Implementation Questions ===
> > +
> > +  * Does the name --[no-]restrict sound good to others?  Are there better options?
> > +    * Names in use, or appearing in patches, or previously suggested:
> > +      * --sparse/--dense
> > +      * --ignore-skip-worktree-bits
> > +      * --ignore-skip-worktree-entries
> > +      * --ignore-sparsity
> > +      * --[no-]restrict-to-sparse-paths
> > +      * --full-tree/--sparse-tree
> > +      * --[no-]restrict
>
> I like the simplicity of --[no-]restrict, and my only worry is that it
> doesn't immediately link to what it is restricting.
>
> Perhaps something like "scope" would describe the set of things we care
> about, but use a text mode:
>
>         --scope=sparse  (--restrict)
>         --scope=all     (--no-restrict)
>
> But I'm notoriously bad at naming things.
>
> > +  * Should --[no-]restrict be a git global option, or added as options to each
> > +    relevant command?  (Does that make sense given the multitude of different
> > +    default behaviors we have for different options?)
>
> If we can make it a global option, that would be great, then update
> the commands to behave under that mode as we go.
>
> If that doesn't work, then adding the consistent option across commands
> would be helpful. It might be good to make a OPT_RESTRICT macro (much
> like OPT__VERBOSE, OPT__QUIET, and similar macros.
>
> > +  * Should --sparse in ls-files be made an alias for --restrict?
> > +    `--restrict` is certainly a near synonym in cone-mode, but even then
> > +    it's not quite the same.  In non-cone mode, ls-files' `--sparse`
> > +    option has no effect, and in cone-mode it still shows the sparse
> > +    directory entries which are technically outside the sparsity
> > +    specification.
>
> We should definitely replace the --sparse option(s) with whatever we
> choose here. For ls-files, we have the issue that we are reporting
> what is in the index, and in non-cone-mode the index cannot be sparse.
>
> Now, maybe we change what the ls-files mode does under --restrict and
> only have it report the paths within the sparse-checkout and not even
> show the results for sparse directory entries. The --no-restrict would
> then expand a sparse-index to show only paths again.
>

> > +    Namely, if folks are not already in a sparse checkout, then require
> > +    `sparse-checkout init/set` to take a `--[no-]restrict` flag (which
> > +    would set core.restrictToSparse according to the setting given), and
> > +    throw an error if the flag is not provided?  That error would be a
> > +    great place to warn folks that the default may change in the future,
> > +    and get them used to specifying what they want so that the eventual
> > +    default switch is seamless for them.
>
> I don't like using the same option name (--[no-]restrict) for something
> that sets a config option to keep that behavior permanently. Different
> names that make it clearer could be:
>
>         --enable-restrict-mode
>         --set-scope=(sparse|all)
>

The name sounds clear enough. I had a idea that add some configuration like:

scope.<cmd>.mode=sparse|all

and then let scalar help users set some default configs...

> > +  * clone: should we provide some mechanism for tying partial clones and
> > +    sparse checkouts together better.  Maybe an option
> > +     --sparse=dir1,dir2,...,dirN
> > +    which:
> > +       * Does initial fetch with `--filter=blob:none`
> > +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
> > +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
> > +      fault in the missing blobs within the sparse
> > +      specification...except that rev-list needs some kind of options
> > +      to also get files from leading directories too.
> > +       * Sets --restrict mode to allow focusing on the cone of interest
> > +      (and to permit disconnected development)
>
> As mentioned, I think we should have the option to backfill the blobs in
> the sparse-checkout definition, but 'git clone' should not do this by
> default. It's something that can be launched in the background, maybe, but
> not a blocking operation on being able to use the repository.
>
> 'scalar clone' is an excellent testing bed for these kinds of things,
> like setting the --restrict mode by default.
>

This sounds interesting and would like to see scalar support them!

> Hopefully my responses aren't too far off-base. I'll go read the rest of
> the discussion now that I've contributed my thoughts on the doc.
>
> Thanks,
> -Stolee

Thanks,
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-28  5:38   ` Elijah Newren
  2022-09-28 13:22     ` Derrick Stolee
@ 2022-09-30  9:54     ` ZheNing Hu
  2022-10-06  7:53       ` Elijah Newren
  1 sibling, 1 reply; 42+ messages in thread
From: ZheNing Hu @ 2022-09-30  9:54 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Victoria Dye, Shaoxuan Yuan, Matheus Tavares

I am not sure if these ideas are feasible.

Elijah Newren <newren@gmail.com> 于2022年9月28日周三 13:38写道:
>
> > > +People might also end up wanting behavior B due to complex inter-project
> > > +dependencies.  The initial attempts to use sparse-checkouts usually
> > > +involve the directories you are directly interested in plus what those
> > > +directories depend upon within your repository.  But there's a monkey
> > > +wrench here: if you have integration tests, they invert the hierarchy:
> > > +to run integration tests, you need not only what you are interested in
> > > +and its dependencies, you also need everything that depends upon what
> > > +you are interested in or that depends upon one of your
> > > +dependencies...AND you need all the dependencies of that expanded group.
> > > +That can easily change your sparse-checkout into a nearly dense one.
> >
> > In my experience, the downstream dependencies are checked via builds in
> > the cloud, though that doesn't help if they are source dependencies and
> > you make a breaking change to an API interface. This kind of problem is
> > absolutely one of system architecture and I don't know what Git can do
> > other than to acknowledge it and recommend good patterns.
>
> I was talking about (source) dependencies between
> modules/projects/whatever-you-want-to-call-the-subcomponents of your
> repository.  We have hundreds of modules, with various cross-module
> dependencies that evolve over time.
>
> I get the feeling from your description that your intra-repository
> dependencies between modules/projects/whatever are much more static
> for you than what we deal with.  (Which is a good thing; it'd be nice
> if ours were more static.)
>
> > In a properly-organized project, 95% of engineers in the project can have
> > a small sparse-checkout, then 5% work on the common core that has these
> > downstream dependencies and require a large sparse-checkout definition.
>
> "In a properly-organized project"?  I'm unsure if this is an
> indictment of some of the repositories I deal with in reality (and to
> be fair, it might be a totally fair indictment), or if your statement
> is starting to cross into "No true scotsman" territory.  ;-)
>
> I would probably lean towards the former (we know it's more messy than
> it should be), but I'm a bit puzzled that you'd just brush aside my
> mention of integration tests.  We have people who want to run
> integration tests locally, even when only modifying a small area of
> the codebase.  These users are not doing cross-tree work, rather they
> are doing cross-tree testing in conjunction with their work.  Running
> such tests requires a build of the modules across the repository,
> which naively would push folks into a dense checkout...and really long
> local builds.  We want fast local builds, and sparse-checkouts help us
> achieve that...but it does mean we have to be clever about how we
> build in order to let these users run integration tests.  (And we have
> to make it easy for users to discover the relevant integration tests,
> and sometimes associated code components that depend on what they are
> changing, which is where behavior B comes in).
>
> > There's nothing Git can do to help those engineers that do cross-tree
> > work.
>
> I'm going to partially disagree with this, in part because of our
> experience with many inter-module dependencies that evolve over time.
> Folks can start on a certain module and begin refactoring.  Being
> aware that their changes will affect other areas of the code, the can
> do a search (e.g. "git grep --cached ..." to find cases outside their
> current sparse checkout), and then selectively unsparsify to get the
> relevant few dozen (or maybe even few hundred) modules added.  They
> aren't switching to a dense checkout, just a less sparse one.  When
> they are done, they may narrow their sparse specification again.  We
> have a number of users doing cross-tree work who are using
> sparse-checkouts, and who find it productive and say it still speeds
> up their local build/test cycles.
>
> So, I'd say that ensuring Git supports behavior B well in
> sparse-checkouts, is something Git can do to help out both some of the
> engineers doing cross-tree work, and some of the engineers that are
> doing cross-tree testing.
>
> (For full disclosure, we also have users doing cross-tree work using
> regular dense checkouts and I agree there's not a lot we can do to
> help them.)
>

Let me guess where the cross tree users using sparse-checkout are
getting their revenue from:

1. they don't have to download the entire repository of blobs at once
2. their working tree can be easily resized.
3. they could have something like sparse-index to optimize the performance
of git commands.

But it's still worth worrying about the size of the git repository blobs,
even if it's just only blobs in mono-repo's HEAD, that may also be too big
for the user's local area to handle.

Perhaps it would make more sense to place this integration testing work on
a remote server.

I am not sure if these ideas are feasible:

1. mount the large git repo on the server to local.
2. just ssh to a remote server to run integration tests.
3. use an external tool to run integration tests on the remote server.

>
> Anyway, we do not want the behavior of `--restrict` for these
> commands.  That would imply not providing conflicts to users for them
> to resolve unless they are contained within the sparse specification,
> which would clearly be broken.  We instead chose to write out files
> with conflicts regardless of whether they are outside the sparse
> specification.  This modified behavior I gave the name of
> `--restrict-unless-conflict`, but we don't need or want an actual
> command line flag for that.  I think the behavior should just remain
> hardcoded into these commands.
>
> (Note: these commands are among those that make me think
> --[no-]restrict or --[un]focus or whatever might not make sense as a
> git global option: `--restrict-unless-conflict` behavior is the
> default for these and in fact that only sensible option, I think.  If
> there's only one sensible option, no actual flag names are needed.)
>
> > The only thing I can think about is that the diffstat might want to show
> > the stats for the conflicted files, in which case that's an important
> > perspective on the distinction from --restrict.
>
> We only show the diffstat on a successful merge, so there's no
> diffstat to show if there are any conflicted files.
>

Sorry, I have some questions here: how does git merge know there are
no conflicts without downloading the blobs?

> > Perhaps something like "scope" would describe the set of things we care
> > about, but use a text mode:
> >
> >         --scope=sparse  (--restrict)
> >         --scope=all     (--no-restrict)
> >
> > But I'm notoriously bad at naming things.
>
> Yeah, me too.  Naming things is one of the two hard problems in
> computer science, right?  (The others being cache invalidation, and
> off-by-one errors.)
>
> However, in this case, your suggestion sounds pretty decent to me.
> I'll add it to the list for us to consider.
>

Agree.

Thanks,
--
ZheNing Hu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-28 13:22     ` Derrick Stolee
@ 2022-10-06  7:10       ` Elijah Newren
  2022-10-06 18:27         ` Derrick Stolee
  0 siblings, 1 reply; 42+ messages in thread
From: Elijah Newren @ 2022-10-06  7:10 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Victoria Dye,
	Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

On Wed, Sep 28, 2022 at 6:22 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 9/28/22 1:38 AM, Elijah Newren wrote:
> > On Tue, Sep 27, 2022 at 9:36 AM Derrick Stolee <derrickstolee@github.com> wrote:
> >>
> >> On 9/24/2022 8:09 PM, Elijah Newren via GitGitGadget wrote:
> >>> From: Elijah Newren <newren@gmail.com>
> >>
[...]
> >>> +  * Commands whose default for --restrict vs. --no-restrict should vary depending
> >>> +    on Behavior A or Behavior B
> >>> +    * diff (with --cached or REVISION arguments)
> >>> +    * grep (with --cached or REVISION arguments)
> >>> +    * show (when given commit arguments)
> >>> +    * bisect
> >>> +    * blame
> >>> +      * and annotate
> >>> +    * log
> >>> +      * and variants: shortlog, gitk, show-branch, whatchanged
> >>> +
> >>> +    For now, we default to behavior B for these, which want a default of
> >>> +    --no-restrict.
> >>
> >> I do feel pretty strongly that we'll want a --no-restrict default here
> >> because otherwise we will present confusion. I'm not even sure if we would
> >> want to make this available via a config setting, but likely a config
> >> setting makes sense in the long term.
> >
> > You've got me slightly confused.  You did say the same thing a long time ago:
> >
> >     "But I also want to avoid doing this as a default or even behind a
> > config setting."[A]
> >
> > BUT, when Shaoxuan proposed making --restrict/--focus the default for
> > one of these commands, you seemed to be on board[B].
>
> I'm specifically talking about 'git log'. I think that having that be
> in a restricted mode is extremely dangerous and will only confuse users.
> This includes 'git show' (with commit arguments) and 'git bisect', I
> think.

Thanks, that helps me understand your position better.

I'm curious if, due to the length of the document and this thread,
you're just skimming past the idea I mentioned of showing a warning at
the beginning of `diff`, `log`, or `show` output when restricting
based on config or defaults.  Without such a warning, I agree that
restricting might be confusing at times, but I think such a warning
may be sufficient to address the concerns around partial/incomplete
results.  The one command that this warning idea doesn't help with is
`grep` since it cannot safely be applied there, which potentially
leaves `grep` giving confusing results when users pass either
`--cached` or revisions, but you seem to not be concerned about that.

I'm also curious if the problem partially stems from the fact that
with `git log` there is no way to control revision limiting and diff
generation paths independently.  If there was a way to make `git log
-p` continue showing the regular list of commits but restrict which
paths were shown in the diffs, and we made the --scope-sparse handling
do this so that only diffs were limited but not the revisions
traversed/printed, would that help address your concerns?

> The rest, (diff, grep, blame) are worktree-focused, so having a restrict
> mode by default makes sense to me.

I was specifically calling out diff & grep when passed revision
arguments, which are definitely *not* worktree-focused operations.

Also, blame incorporates a component of changes from the worktree, but
it's mostly about history (and one or more -C's make it check other
paths as well).

[...]
> I think the biggest point is that the implications of behavior A
> saying "I don't care about any changes outside of my sparse-checkout"
> leading to changed history are unappealing to me. After removing that
> kind of feature from consideration, I don't see any difference
> between the behaviors.

Indeed, the differences between the behaviors is (mostly?) about
history queries, be it `git grep --cached`, `git grep REV`, `git diff
REV1 REV2`, `git log -p`, etc.

And I understand it's unappealing to you, but I haven't seen an
alternative solution to disconnected development in partial clones.
Nor have I seen an alternate plan for users who want to really focus
on their small subset of the repository.

So, maybe you don't want to use a configuration knob and always want a
certain default, but I very much want a knob.

> > Anyway...I will note that without a configurable option to give these
> > commands a behavior of `--restrict`, I think you make working in
> > disconnected partial clones practically impossible.  I want to be able
> > to do "git log -p", "git diff REV1 REV2", and "git grep TERM REV" in
> > disconnected partial clones, and I've wanted that kind of capability
> > for well over a decade[H].  So, don't be surprised if I keep bringing
> > up a config option of some sort for these commands.  :-)
>
> Now, if we're talking about "don't download extra objects" as a goal,
> then we're thinking about things not just related to sparse-checkout
> but even history within the sparse-checkout. Even if we make the
> 'backfill' command something that users could run, there isn't a
> guarantee that users will want to have even that much data downloaded.
> We would need a way to say "yes, I ran 'git blame' on this path in my
> sparse-checkout, but please don't just fail if you can't get new objects,
> instead inform me that the results are incomplete."
>
> I think the sparse-checkout boundary is a good way to minimize the
> number of objects downloaded by these commands, but to actually
> remove the need for downloads at all we need a way to gracefully
> return partial results.

There may be some merits to a partial clone with shallow blob history,
but I've never really been all that interested in it.  I know that
partial clones only really implement that kind of feature, but I've
always wanted a full-depth sparse clone instead.  I tried to create
that alternate reality[H], but didn't get the time to push it very
far, and in the meantime others came along and implemented both
shallow clones and partial clones.  I still want my thing, but at this
point rather than introduce a new kind of clone, it makes more sense
for me to reuse the existing partial clone framework and extend it --
especially since it more gracefully handles cases where additional
data outside user-specified sparsity is needed (such as for merges).

[H] https://lore.kernel.org/git/1283645647-1891-1-git-send-email-newren@gmail.com/

But you've got me curious.  You seem to be suggesting that partial
results are okay if the user is informed.  I have suggested making
diff-with-revisions, log -p, etc. show a warning that results may be
incomplete when restricting them to the sparse checkout based on
config.  So, aren't you suggesting that my proposal is safe after all?

Anyway, if someone wants to implement something like you suggest here,
while I might not use it, it sounds reasonable to me.  It'd probably
fit in as yet another config setting.  Then, for history queries, our
config would select the default between --scope=all (for behavior B
folks), --scope=sparse (for the behavior A folks) and
--scope=sparse-and-already-downloaded (the behavior you suggest above,
though it probably needs a better name).  Also, it sounds to me like
implementing --scope=sparse would be a step along the path to
implementing what you are suggesting here, if I'm understanding you
correctly.  (Also, this idea makes me like your --scope= naming even
more, because it's awkward to add a third option to
--restrict/--no-restrict.)

> > I figured we'd have one or two places where all of us had some
> > disagreements on the big picture, but more and more I'm finding we
> > aren't even always thinking about the problems the same (e.g. the 3+
> > different solutions to the `am` issues).  All the more reason that a
> > document like this is important for us to discuss these details and
> > work out a plan.
>
> With such a massive doc and an ambitious plan, we are bound to have
> misunderstandings and seem to self-contradict here and there. This
> discussion is helping to drive clarity, and I appreciate all of your
> work to drive towards mutual understanding.

Thanks for taking the time to read through it and respond in detail!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-30  9:54     ` ZheNing Hu
@ 2022-10-06  7:53       ` Elijah Newren
  2022-10-15  2:17         ` ZheNing Hu
  0 siblings, 1 reply; 42+ messages in thread
From: Elijah Newren @ 2022-10-06  7:53 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Victoria Dye, Shaoxuan Yuan, Matheus Tavares

On Fri, Sep 30, 2022 at 2:54 AM ZheNing Hu <adlternative@gmail.com> wrote:
>
> I am not sure if these ideas are feasible.
>
> Elijah Newren <newren@gmail.com> 于2022年9月28日周三 13:38写道:
> >
[...]
> > > There's nothing Git can do to help those engineers that do cross-tree
> > > work.
> >
> > I'm going to partially disagree with this, in part because of our
> > experience with many inter-module dependencies that evolve over time.
> > Folks can start on a certain module and begin refactoring.  Being
> > aware that their changes will affect other areas of the code, the can
> > do a search (e.g. "git grep --cached ..." to find cases outside their
> > current sparse checkout), and then selectively unsparsify to get the
> > relevant few dozen (or maybe even few hundred) modules added.  They
> > aren't switching to a dense checkout, just a less sparse one.  When
> > they are done, they may narrow their sparse specification again.  We
> > have a number of users doing cross-tree work who are using
> > sparse-checkouts, and who find it productive and say it still speeds
> > up their local build/test cycles.
> >
> > So, I'd say that ensuring Git supports behavior B well in
> > sparse-checkouts, is something Git can do to help out both some of the
> > engineers doing cross-tree work, and some of the engineers that are
> > doing cross-tree testing.
> >
> > (For full disclosure, we also have users doing cross-tree work using
> > regular dense checkouts and I agree there's not a lot we can do to
> > help them.)
> >
>
> Let me guess where the cross tree users using sparse-checkout are
> getting their revenue from:

Is "revenue" perhaps a case of auto-correct choosing the wrong word?

> 1. they don't have to download the entire repository of blobs at once
> 2. their working tree can be easily resized.
> 3. they could have something like sparse-index to optimize the performance
> of git commands.

These correspond to partial clone, sparse-checkout, and sparse-index.
I think these 3 features and the various work done to support them,
plus submodule (which is a different kind of solution) are the
features Git provides to work with repository subsets.  Some
repositories (especially the big monorepos like the Microsoft ones)
will benefit from using all three of these features.  Others might
only want to use one or two of them.

As an example, the repository where we first applied sparse-checkouts
to (and which had the complicated dependencies) does not use partial
clones or a sparse-index.   While partial clone and sparse-index might
help a little, the .git directory for a full clone is merely 2G, and
there are less than 100K entries in the index.  However,
sparse-checkout helps out a lot.

> But it's still worth worrying about the size of the git repository blobs,
> even if it's just only blobs in mono-repo's HEAD, that may also be too big
> for the user's local area to handle.
>
> Perhaps it would make more sense to place this integration testing work on
> a remote server.
>
> I am not sure if these ideas are feasible:
>
> 1. mount the large git repo on the server to local.
> 2. just ssh to a remote server to run integration tests.
> 3. use an external tool to run integration tests on the remote server.

Are you suggesting #1 as a way for just handling the git history, or
also for handling the worktree with some kind of virtual file system
where not all files are actually written locally?  If you're only
talking about the history, then you're kind of going on a tangent
unrelated to this document.  If you're talking about worktrees and
virtual file systems, then Git proper doesn't have anything of the
sort currently.  There are at least two solutions in this space --
Microsoft's Git-VFS (which I think they are phasing out) and Google's
similar virtual file system -- but I'm not currently particularly
interested in either one.

#3 is precisely what we did first (except "*a* remote server" rather
than "*the* remote server").  I think I called it out in the email
you're responding to; it's often good enough for many people.
However, sometimes those tests fail and people want to run locally so
it's easier to inspect.  Or they just want to be able to run locally
anyway.  So, while #3 helped, it wasn't good enough.

#2 is also something we did.  Using tools like Coder or GitHub
codespaces or other offerings in that area, you can provide developers
a nice beefy box with good network connectivity to the main Git
repository, on which they can do development and running of tests.
Then developers can connect to such machines from a variety of
different external locations.  Works great for some people...but build
times and ability of IDEs to handle the code base are still an issue,
so doing smarter things with sparse-checkouts is still important.
And, even if #2 works for some people, others still want to develop
and run integration tests on their (beefy) laptops.

All three of these, as far as I can tell, are just things that
individual teams setup and aren't anything that would affect Git's
development one way or another.


However, I'll note that while we internally definitely did two of the
three things you suggested here, it wasn't a complete enough solution
for us and sparse-checkout adoption was still pretty minimal at that
point.  So, we went back to our sparse-checkouts and asked how we
could modify the build system to allow us to not check out the in-tree
dependencies of the things we are tweaking, but still get a correct
build and allow us to run tests.  Once we got that working, we finally
really unlocked the value of sparse checkouts for us (both improving
things for developers on laptops, and for developers on the
development box in the cloud).  It went from very few folks using
sparse checkouts with that repository, to being the default and
recommended usage at that point.

While the build changes were internal things we did, I think that the
underlying usage scenario matters to Git development because it helps
inform how sparse-checkout can be used.  In particular, it suggests
why some sparse-checkout users may be interested in finding results
for files that do not match their sparse-checkout patterns -- in-tree
dependencies may not necessarily be checked out, but those are related
enough to the code that developers are working on, that developers are
still potentially interested in using e.g. "git grep" or "git log -p"
to find out information about code or changes in those other areas.
(And, of course, developers are also potentially interested in finding
out what other code depends on what they are changing, but I suspect
folks were already aware of that usecase.)  It's certainly not the
only usecase, but it's an additional one that I didn't think was quite
reflected in Stolee's description of why users would want searches to
turn up results for files not found in their working tree.

> > > The only thing I can think about is that the diffstat might want to show
> > > the stats for the conflicted files, in which case that's an important
> > > perspective on the distinction from --restrict.
> >
> > We only show the diffstat on a successful merge, so there's no
> > diffstat to show if there are any conflicted files.
> >
>
> Sorry, I have some questions here: how does git merge know there are
> no conflicts without downloading the blobs?

Not sure how that's related to the above, but to answer your question:

Sometimes merge has to download blobs to know if there are conflicts
or not.  But only sometimes.  Since tree objects have the hashes of
the blobs, having the tree objects is sufficient to determine which
side(s) of history modified each path.

If both sides of history modified the same file, then you *might* have
conflicts, and you indeed need the blobs to verify.  But if only one
side of history modified a file and the other left it alone, then
there is no conflict.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-10-06  7:10       ` Elijah Newren
@ 2022-10-06 18:27         ` Derrick Stolee
  2022-10-07  2:56           ` Elijah Newren
  0 siblings, 1 reply; 42+ messages in thread
From: Derrick Stolee @ 2022-10-06 18:27 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Victoria Dye,
	Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

On 10/6/22 3:10 AM, Elijah Newren wrote:
> On Wed, Sep 28, 2022 at 6:22 AM Derrick Stolee <derrickstolee@github.com> wrote:
>>
>> On 9/28/22 1:38 AM, Elijah Newren wrote:
>>> On Tue, Sep 27, 2022 at 9:36 AM Derrick Stolee <derrickstolee@github.com> wrote:
>>>>
>>>> On 9/24/2022 8:09 PM, Elijah Newren via GitGitGadget wrote:
>>>>> From: Elijah Newren <newren@gmail.com>
>>>>
> [...]
>>>>> +  * Commands whose default for --restrict vs. --no-restrict should vary depending
>>>>> +    on Behavior A or Behavior B
>>>>> +    * diff (with --cached or REVISION arguments)
>>>>> +    * grep (with --cached or REVISION arguments)
>>>>> +    * show (when given commit arguments)
>>>>> +    * bisect
>>>>> +    * blame
>>>>> +      * and annotate
>>>>> +    * log
>>>>> +      * and variants: shortlog, gitk, show-branch, whatchanged
>>>>> +
>>>>> +    For now, we default to behavior B for these, which want a default of
>>>>> +    --no-restrict.
>>>>
>>>> I do feel pretty strongly that we'll want a --no-restrict default here
>>>> because otherwise we will present confusion. I'm not even sure if we would
>>>> want to make this available via a config setting, but likely a config
>>>> setting makes sense in the long term.
>>>
>>> You've got me slightly confused.  You did say the same thing a long time ago:
>>>
>>>     "But I also want to avoid doing this as a default or even behind a
>>> config setting."[A]
>>>
>>> BUT, when Shaoxuan proposed making --restrict/--focus the default for
>>> one of these commands, you seemed to be on board[B].
>>
>> I'm specifically talking about 'git log'. I think that having that be
>> in a restricted mode is extremely dangerous and will only confuse users.
>> This includes 'git show' (with commit arguments) and 'git bisect', I
>> think.
> 
> Thanks, that helps me understand your position better.
> 
> I'm curious if, due to the length of the document and this thread,
> you're just skimming past the idea I mentioned of showing a warning at
> the beginning of `diff`, `log`, or `show` output when restricting
> based on config or defaults.  Without such a warning, I agree that
> restricting might be confusing at times, but I think such a warning
> may be sufficient to address the concerns around partial/incomplete
> results.  The one command that this warning idea doesn't help with is
> `grep` since it cannot safely be applied there, which potentially
> leaves `grep` giving confusing results when users pass either
> `--cached` or revisions, but you seem to not be concerned about that.

I'm not convinced that warnings are enough for some cases, especially
for output that is fed to a pager. Do the warnings stick around in
the pager? I'm not sure.

> I'm also curious if the problem partially stems from the fact that
> with `git log` there is no way to control revision limiting and diff
> generation paths independently.  If there was a way to make `git log
> -p` continue showing the regular list of commits but restrict which
> paths were shown in the diffs, and we made the --scope-sparse handling
> do this so that only diffs were limited but not the revisions
> traversed/printed, would that help address your concerns?

My biggest issue is with the idea of simplifying the commit history
based on the sparse-checkout path definitions. The '-p' option having
a diff scoped to the sparse-checkout paths would be fine.

>> The rest, (diff, grep, blame) are worktree-focused, so having a restrict
>> mode by default makes sense to me.
> 
> I was specifically calling out diff & grep when passed revision
> arguments, which are definitely *not* worktree-focused operations.

You're right. I'm not using the right terminology. They _are_
operations on a single tree, where path scopes make sense.

> Also, blame incorporates a component of changes from the worktree, but
> it's mostly about history (and one or more -C's make it check other
> paths as well).

Since each input is a specific file path, I'm not sure we need
anything here except perhaps a warning that they are requesting
a file outside the sparse-checkout definition (if even that).

>>> Anyway...I will note that without a configurable option to give these
>>> commands a behavior of `--restrict`, I think you make working in
>>> disconnected partial clones practically impossible.  I want to be able
>>> to do "git log -p", "git diff REV1 REV2", and "git grep TERM REV" in
>>> disconnected partial clones, and I've wanted that kind of capability
>>> for well over a decade[H].  So, don't be surprised if I keep bringing
>>> up a config option of some sort for these commands.  :-)
>>
>> Now, if we're talking about "don't download extra objects" as a goal,
>> then we're thinking about things not just related to sparse-checkout
>> but even history within the sparse-checkout. Even if we make the
>> 'backfill' command something that users could run, there isn't a
>> guarantee that users will want to have even that much data downloaded.
>> We would need a way to say "yes, I ran 'git blame' on this path in my
>> sparse-checkout, but please don't just fail if you can't get new objects,
>> instead inform me that the results are incomplete."
>>
>> I think the sparse-checkout boundary is a good way to minimize the
>> number of objects downloaded by these commands, but to actually
>> remove the need for downloads at all we need a way to gracefully
>> return partial results.
> 
> There may be some merits to a partial clone with shallow blob history,
> but I've never really been all that interested in it. ......
> But you've got me curious.  You seem to be suggesting that partial
> results are okay if the user is informed.  I have suggested making
> diff-with-revisions, log -p, etc. show a warning that results may be
> incomplete when restricting them to the sparse checkout based on
> config.  So, aren't you suggesting that my proposal is safe after all?

I think the following things are true:

1. It's really important to keep the current partial clone default of
   only downloading blobs on-demand. Even with a limited sparse-checkout,
   it's rare that users will need every version of every file in that
   sparse-checkout, and they may not want that tax on their local storage.

2. Adding an opt-in backfill for a sparse-checkout definition will
   prevent most on-demand downloads (although it might want to be
   integrated into 'git fetch' behind an option to be really sure that
   state continues in the future).

3. Updating Git features to scope down to sparse-checkout will prevent
   many of the remaining on-demand downloads.

4. To be _absolutely sure_ that on-demand downloads don't happen, we
   need an extra mode for Git and new ways of reporting partial results.
   Without this mode, Git commands fail when triggering an on-demand
   download and the network is unavailable.

So, I'm saying that (4) is a direction that we could go. It also seems
extremely difficult to do, so we should do (2) & (3) first, which will
get us 99% of the way there.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-10-06 18:27         ` Derrick Stolee
@ 2022-10-07  2:56           ` Elijah Newren
  0 siblings, 0 replies; 42+ messages in thread
From: Elijah Newren @ 2022-10-07  2:56 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Victoria Dye,
	Shaoxuan Yuan, Matheus Tavares, ZheNing Hu

On Thu, Oct 6, 2022 at 11:27 AM Derrick Stolee <derrickstolee@github.com> wrote:
>
> On 10/6/22 3:10 AM, Elijah Newren wrote:
> > On Wed, Sep 28, 2022 at 6:22 AM Derrick Stolee <derrickstolee@github.com> wrote:
> >>
> >> On 9/28/22 1:38 AM, Elijah Newren wrote:
[...]
> >> I'm specifically talking about 'git log'. I think that having that be
> >> in a restricted mode is extremely dangerous and will only confuse users.
> >> This includes 'git show' (with commit arguments) and 'git bisect', I
> >> think.
> >
> > Thanks, that helps me understand your position better.
> >
> > I'm curious if, due to the length of the document and this thread,
> > you're just skimming past the idea I mentioned of showing a warning at
> > the beginning of `diff`, `log`, or `show` output when restricting
> > based on config or defaults.  Without such a warning, I agree that
> > restricting might be confusing at times, but I think such a warning
> > may be sufficient to address the concerns around partial/incomplete
> > results.  The one command that this warning idea doesn't help with is
> > `grep` since it cannot safely be applied there, which potentially
> > leaves `grep` giving confusing results when users pass either
> > `--cached` or revisions, but you seem to not be concerned about that.
>
> I'm not convinced that warnings are enough for some cases

I'm not sure I'm following.  You suggested earlier in this thread that
we may want to provide a mode where commands "don't just fail if you
can't get new objects, instead inform me that the results are
incomplete".  You re-emphasized that in your most recent email by
saying "To be _absolutely sure_ that on-demand downloads don't happen,
we need an extra mode for Git and new ways of reporting partial
results."  So it sounds like you're suggesting a mode where partial
results are a forced option, because how else can you be "_absolutely
sure_ that on-demand downloads don't happen"?  And if we always want
to allow partial results, don't you need to inform users about those
results being potentially incomplete?  How exactly does one inform the
user that results are incomplete if not by a warning?  Something seems
inconsistent here, but perhaps I'm just misunderstanding something?

I think, based on what you said below, that you're uncomfortable with
certain types of incompleteness, such as partial revision results, but
are fine with others such as those dealing with partial blob results
(whether in breadth or in depth).  But if so, I'm still not sure what
your statement about warnings means.  If we scope operations down to
the sparsity paths (e.g. potentially giving a partial-breadth diff for
"git diff REV1 REV2"), what's your expectation with regards to
warnings?

>, especially
> for output that is fed to a pager. Do the warnings stick around in
> the pager? I'm not sure.

If the warning is printed on stdout, then yes the warning will stick
around in a pager.  If the warning is printed on stderr, then the
warning is likely of dubious utility since it can easily get lost.
Since log & diff output are not adversely affected by additional
preliminary output, I think stdout is where such a warning should go
(unless folks feel like we don't even need a warning?).  However, grep
would be strongly negatively affected by additional output, and that's
why I've stated several times that warnings cannot reasonably be
included with grep.

But, so far, no one has expressed concern with providing partial
results for grep even if no warning can be given, so perhaps it
doesn't matter.

> > I'm also curious if the problem partially stems from the fact that
> > with `git log` there is no way to control revision limiting and diff
> > generation paths independently.  If there was a way to make `git log
> > -p` continue showing the regular list of commits but restrict which
> > paths were shown in the diffs, and we made the --scope-sparse handling
> > do this so that only diffs were limited but not the revisions
> > traversed/printed, would that help address your concerns?
>
> My biggest issue is with the idea of simplifying the commit history
> based on the sparse-checkout path definitions. The '-p' option having
> a diff scoped to the sparse-checkout paths would be fine.

Wahoo!  Sounds like we have a path forward then.  I'll update the
document in my patch to reflect this distinction.

Note that it's not just the -p option to log, though, but anything
related to patches: diff formatting, diff filtering, rename & copy
detection, and pickaxe-related options.  The one place where the
scoping to sparse-checkouts is slightly funny for `git log` is with
--remerge-diff (because the merge machinery ignores sparsity patterns
when generating the new toplevel tree; however after the new toplevel
tree is generated, we would generate a diff that is limited to the
sparsity patterns).

[...]
> > Also, blame incorporates a component of changes from the worktree, but
> > it's mostly about history (and one or more -C's make it check other
> > paths as well).
>
> Since each input is a specific file path, I'm not sure we need
> anything here except perhaps a warning that they are requesting
> a file outside the sparse-checkout definition (if even that).

Your statement seems to suggest you are assuming that git blame will
only operate on the path listed on the command line.  Am I reading
your assumption correctly, or am I totally misunderstanding why you
would claim nothing is needed beyond a warning about the path the user
typed?  If I'm understanding your assumption correctly, your
assumption does not hold when one or more -C options are passed.
Since my earlier mentions of those options and their ramification
didn't connect, perhaps it would help if I was a bit more explicit
about what I mean.  Let's take a simple example, in git.git, which you
can run right now:

   git blame -C -C cache.h

This command will show lines of text that now appear in cache.h but
which came *from* all of these files:

    * builtin/clean.c
    * cache.h
    * merge-recursive.h
    * notes.c
    * object-file.c
    * object.h
    * read-cache.c
    * setup.c
    * sha1-file.c
    * sha1_file.c
    * sha1_name.c
    * show-diff.c
    * symlinks.c
    * tree-walk.h

In order to find out and report that the current lines of cache.h came
from these other files, blame has to search a wide range of other
files in the repository.  That potential wide range of other files in
the repository is something we could consider tailoring when in a
sparse-checkout, at least for Behavior A folks.

[...]
> I think the following things are true:
>
> 1. It's really important to keep the current partial clone default of
>    only downloading blobs on-demand. Even with a limited sparse-checkout,
>    it's rare that users will need every version of every file in that
>    sparse-checkout, and they may not want that tax on their local storage.

I do agree we need to keep these in mind for some usecases, but I do
not agree these are universally true among sparse-checkout users.
However, our differences on this probably don't matter in practice
since you then immediately suggested...

> 2. Adding an opt-in backfill for a sparse-checkout definition will
>    prevent most on-demand downloads (although it might want to be
>    integrated into 'git fetch' behind an option to be really sure that
>    state continues in the future).

Yes, this would be great.  One question, though: integrated with
`fetch` or with `sparse-checkout set|add`?  If users adjust their
sparse-checkout definition, that might be a good time to allow them to
automatically trigger fixing the missing backfill at the same time.

> 3. Updating Git features to scope down to sparse-checkout will prevent
>    many of the remaining on-demand downloads.

Yes, though I'd clarify "scope down to sparse-checkout where it can
make sense".  Things like merge & bundle have to pay attention to
changes outside the sparse-checkout, but we can get commands like
diff/log -p/grep to scope down in breadth.

> 4. To be _absolutely sure_ that on-demand downloads don't happen, we
>    need an extra mode for Git and new ways of reporting partial results.
>    Without this mode, Git commands fail when triggering an on-demand
>    download and the network is unavailable.

While many commands might be able to produce partial results
realistically, I think things like merge & bundle should not support
such a mode and just fail if they are missing any data they normally
need.  Basically, we'd still have commands that would fail without a
network connection beyond push/pull/fetch, but this mode would limit
the list as much as possible through allowing commands to limit both
breadth and depth of the blobs we act upon.

> So, I'm saying that (4) is a direction that we could go. It also seems
> extremely difficult to do, so we should do (2) & (3) first, which will
> get us 99% of the way there.

Agreed on all three counts.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v3] sparse-checkout.txt: new document with sparse-checkout directions
  2022-09-28  8:32 ` [PATCH v2] " Elijah Newren via GitGitGadget
@ 2022-10-08 22:52   ` Elijah Newren via GitGitGadget
  2022-11-06  6:04     ` [PATCH v4] " Elijah Newren via GitGitGadget
  0 siblings, 1 reply; 42+ messages in thread
From: Elijah Newren via GitGitGadget @ 2022-10-08 22:52 UTC (permalink / raw)
  To: git
  Cc: Victoria Dye, Derrick Stolee, Shaoxuan Yuan, Matheus Tavares,
	ZheNing Hu, Elijah Newren, Glen Choo, Martin von Zweigbergk,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Once upon a time, Matheus wrote some patches to make
   git grep [--cached | <REVISION>] ...
restrict its output to the sparsity specification when working in a
sparse checkout[1].  That effort got derailed by two things:

  (1) The --sparse-index work just beginning which we wanted to avoid
      creating conflicts for
  (2) Never deciding on flag and config names and planned high level
      behavior for all commands.

More recently, Shaoxuan implemented a more limited form of Matheus'
patches that only affected --cached, using a different flag name,
but also changing the default behavior in line with what Matheus did.
This again highlighted the fact that we never decided on command line
flag names, config option names, and the big picture path forward.

The --sparse-index work has been mostly complete (or at least released
into production even if some small edges remain) for quite some time
now.  We have also had several discussions on flag and config names,
though we never came to solid conclusions.  Stolee once upon a time
suggested putting all these into some document in
Documentation/technical[3], which Victoria recently also requested[4].
I'm behind the times, but here's a patch attempting to finally do that.

[1] https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/
    (See his second link in that email in particular)
[2] https://lore.kernel.org/git/20220908001854.206789-2-shaoxuan.yuan02@gmail.com/
[3] https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/
    (Scroll to the very end for the final few paragraphs)
[4] https://lore.kernel.org/git/cafcedba-96a2-cb85-d593-ef47c8c8397c@github.com/

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    [RFC] sparse-checkout.txt: new document with sparse-checkout directions
    
    I think we're starting to converge on actual proposals; there's some
    areas we've agreed on, others we've compromised on, and some we've just
    figured out what the others were saying. The discussion has been very
    illuminating; thanks to everyone who has chimed in. I've tried to take
    my best stab at cleaning up and culling things that don't need to remain
    as open questions, but if I've mis-represented anyone or missed
    something, don't hesitate to speak up. Everything is still open for
    debate, even if not marked as a currently open question.
    
    Changes since v2:
    
     * Compromised with Stollee on log -- Behavior A only affects
       patch-related operations, not revision walking
     * Incorporated Junio's suggestions about untracked file handling
     * Added new usecases, one brought up by Martin, one by Stolee
     * Added new sections:
       * Usecases of primary concern
       * Oversimplified mental models ("Cliff Notes" for this document!)
     * Recategorization of a few commands based on discussion
     * Greater details on how index operations work under Behavior A, to
       avoid weird edge cases
     * Extended explanation of the sparse specification, particularly when
       index differs from HEAD
     * Switched proposed flag names to --scope={sparse,all} to avoid binary
       flags that are hard to extend
     * Switched proposed config option name (still need good values and
       descriptions for it, though)
     * Removed questions we seemed to have agreement on. Modified/extended
       some existing questions.
     * Added Stolee's sparse-backfill ideas to the plans
     * Additional Known bugs
     * Various wording improvements
     * Possibly other things I've missed.
    
    Changes since v1:
    
     * Added new sections:
       * "Terminology"
       * "Behavior classes"
       * "Sparse specification vs. sparsity patterns"
     * Tried to shuffle commands from unknown into appropriate sections
       based on feedback, but I got some conflicting feedback, so...who
       knows if thing are in the right place
     * More consistency in using "sparse specification" over other terms
     * Extra comments about how add/rm/mv operate on moving files across the
       tracked/untracked boundary
     * --restrict-but-warn should have been "restrict or error", but
       reworded even more heavily as part of "Behavior classes" section
     * Added extra questions based on feedback (--no-expand, update-index
       stuff, apply --index)
     * More details on apply/am bugs
     * Documented read-tree issue
     * A few cases of fixing line wrapping at <=80 chars
     * Added more alternate name suggestions for options instead of
       --[no-]restrict

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1367%2Fnewren%2Fsparse-checkout-directions-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1367/newren/sparse-checkout-directions-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1367

Range-diff vs v2:

 1:  d20e63206dc ! 1:  5923e75195c sparse-checkout.txt: new document with sparse-checkout directions
     @@ Commit message
          Documentation/technical[3], which Victoria recently also requested[4].
          I'm behind the times, but here's a patch attempting to finally do that.
      
     -    Note that the "Implementation Questions" section is pretty large,
     -    reflecting the fact that this is perhaps more RFC than proposal.
     -
          [1] https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/
              (See his second link in that email in particular)
          [2] https://lore.kernel.org/git/20220908001854.206789-2-shaoxuan.yuan02@gmail.com/
     @@ Documentation/technical/sparse-checkout.txt (new)
      +
      +  * Terminology
      +  * Purpose of sparse-checkouts
     ++  * Usecases of primary concern
     ++  * Oversimplified mental models ("Cliff Notes" for this document!)
      +  * Desired behavior
      +  * Behavior classes
      +  * Subcommand-dependent defaults
     @@ Documentation/technical/sparse-checkout.txt (new)
      +SKIP_WORKTREE: When tracked files do not match the sparse specification and
      +	are removed from the working tree, the file in the index is marked
      +	with a SKIP_WORKTREE bit.  Note that if a tracked file has the
     -+	SKIP_WORKTREE bit set but is later written by the user to the
     -+	working tree anyway, the SKIP_WORKTREE bit will be cleared at the
     -+	beginning of any Git operation.
     ++	SKIP_WORKTREE bit set but the file is later written by the user to
     ++	the working tree anyway, the SKIP_WORKTREE bit will be cleared at
     ++	the beginning of any Git operation.
      +
      +	Most sparse checkout users are unaware of this implementation
      +	detail, and the term should generally be avoided in user-facing
      +	descriptions and command flags.  Unfortunately, prior to the
     -+	`sparse-checkout` subcommand these low-level details were exposed,
     -+	and as of time of writing, still are in various places.
     ++	`sparse-checkout` subcommand this low-level detail was exposed,
     ++	and as of time of writing, is still exposed in various places.
      +
      +sparse-checkout: a subcommand in git used to reduce the files present in
      +	the working tree to a subset of all tracked files.  Also, the
     @@ Documentation/technical/sparse-checkout.txt (new)
      +
      +sparse cone: see cone mode
      +
     -+sparse directory: An entry in the index corresponding to a directory
     -+	rather, and used to replace all files under that directory that
     -+	would normally appear in the index.  See also sparse-index.
     -+	Something that can cause confusion is that the "sparse
     -+	directory" does NOT match the sparse specification, i.e. the
     -+	directory is NOT present in the working tree.
     ++sparse directory: An entry in the index corresponding to a directory, which
     ++	appears in the index instead of all the files under that directory
     ++	that would normally appear.  See also sparse-index.  Something that
     ++	can cause confusion is that the "sparse directory" does NOT match
     ++	the sparse specification, i.e. the directory is NOT present in the
     ++	working tree.  May be renamed in the future (e.g. to "skipped
     ++	directory").
      +
      +sparse index: A special mode for sparse-checkout that also makes the
      +	index sparse by recording a directory entry in lieu of all the
     @@ Documentation/technical/sparse-checkout.txt (new)
      +sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
      +	define the set of files of interest.  A warning: It is easy to
      +	over-use this term (or the shortened "patterns" term), for two
     -+	reasons (1) users in cone mode specify directories rather
     -+	than patterns (their directories are transformed into patterns,
     -+	but users may think you are talking about non-cone mode if you
     -+	use the word "patterns"), and (b) the sparse specification might
     -+	transiently differ in the working tree from the sparsity
     ++	reasons: (1) users in cone mode specify directories rather than
     ++	patterns (their directories are transformed into patterns, but
     ++	users may think you are talking about non-cone mode if you use the
     ++	word "patterns"), and (b) the sparse specification might
     ++	transiently differ in the working tree or index from the sparsity
      +	patterns (see "Sparse specification vs. sparsity patterns").
      +
     -+sparse specification: The set of paths in the user's area of focus.  When
     -+	interacting with the working tree, this is the set of tracked files
     -+	present in the working copy or with a clear SKIP_WORKTREE bit.
     -+	When working with history, this is the set of files matching the
     -+	sparsity patterns.  Usually the tracked files present in the
     -+	working copy are precisely the set of tracked files matching
     -+	sparsity patterns, but they can temporarily differ.  (See also
     -+	"Sparse specification vs. sparsity patterns")
     ++sparse specification: The set of paths in the user's area of focus.  This
     ++	is typically just the tracked files that match the sparsity
     ++	patterns, but the sparse specification can temporarily differ and
     ++	include additional files.  (See also "Sparse specification
     ++	vs. sparsity patterns")
     ++
     ++	* When working with history, the sparse specification is exactly
     ++	  the set of files matching the sparsity patterns.
     ++	* When interacting with the working tree, the sparse specification
     ++	  is the set of tracked files with a clear SKIP_WORKTREE bit or
     ++	  tracked files present in the working copy.
     ++	* When modifying or showing results from the index, the sparse
     ++	  specification is the set of files with a clear SKIP_WORKTREE bit
     ++	  or that differ in the index from HEAD.
     ++	* If working with the index and the working copy, the sparse
     ++	  specification is the union of the paths from above.
      +
     -+vivifying: When a command restores a tracked file to the working tree
     -+	(and clearing the SKIP_WORKTREE bit in the index), this is
     -+	referred to as "vivifying" the file.
     ++vivifying: When a command restores a tracked file to the working tree (and
     ++	hopefully also clears the SKIP_WORKTREE bit in the index for that
     ++	file), this is referred to as "vivifying" the file.
      +
      +
      +=== Purpose of sparse-checkouts ===
     @@ Documentation/technical/sparse-checkout.txt (new)
      +sparse-checkouts exist to allow users to work with a subset of their
      +files.
      +
     -+You can think of sparse-checkouts as subdividing "tracked" files into
     -+two categories -- a sparse subset, and all the rest.
     -+Implementationally, we mark "all the rest" with SKIP_WORKTREE.  The
     -+SKIP_WORKTREE files are still tracked, just not present in the working
     -+tree.
     ++You can think of sparse-checkouts as subdividing "tracked" files into two
     ++categories -- a sparse subset, and all the rest.  Implementationally, we
     ++mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them
     ++out of the working tree.  The SKIP_WORKTREE files are still tracked, just
     ++not present in the working tree.
      +
      +In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
     -+is missing from the working tree but pretend the file matches HEAD".  That
     -+was a low-level detail which provided decent behavior for a few commands,
     -+but which had a surprising number of ways in which it violated user
     -+expectations and was a bad mental model.  However, it persisted for many
     -+years and may still be found in some corners of the code base.
     ++is missing from the working tree but pretend the file contents match HEAD".
     ++That was not only bogus (it actually meant the file missing from the
     ++working tree matched the index rather than HEAD), but it was also a
     ++low-level detail which only provided decent behavior for a few commands.
     ++There were a surprising number of ways in which that guiding principle gave
     ++command results that violated user expectations, and as such was a bad
     ++mental model.  However, it persisted for many years and may still be found
     ++in some corners of the code base.
      +
      +Anyway, the idea of "working with a subset of files" is simple enough, but
     -+there are two different high-level usecases which affect how some Git
     ++there are multiple different high-level usecases which affect how some Git
      +subcommands should behave.  Further, even if we only considered one of
     -+those usecases, sparse-checkouts modify different subcommands in over a
     ++those usecases, sparse-checkouts can modify different subcommands in over a
      +half dozen different ways.  Let's start by considering the high level
      +usecases:
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +
      +  B) Users want a sparse working tree, but are working in a larger whole
      +
     -+It may be worth explaining both of these in a bit more detail:
     ++  C) sparse-checkout is a behind-the-scenes implementation detail allowing
     ++     Git to work with a specially crafted in-house virtual file system;
     ++     users are actually working with a "full" working tree that is
     ++     lazily populated, and sparse-checkout helps with the lazy population
     ++     piece.
     ++
     ++  A*) Users are _only_ interested in the sparse portion of the repo that
     ++      they have downloaded so far (a variant on the first usecase)
     ++
     ++
     ++It may be worth explaining each of these in a bit more detail:
     ++
      +
      +  (Behavior A) Users are _only_ interested in the sparse portion of the repo
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +usability annoyance, potentially a huge one since other changes in
      +history may dwarf the changes they are interested in.
      +
     -+Some of these users also arrive at this usecase from wanting to use
     -+partial clones together with sparse checkouts and do disconnected
     -+development.  Not only do these users generally not care about other
     -+parts of the repository, but consider it a blocker for Git commands to
     -+try to operate on those.  If commands attempt to access paths in history
     -+outside the sparsity specification, then the partial clone will attempt
     -+to download additional blobs on demand, fail, and then fail the user's
     -+command.  (This may be unavoidable in some cases, e.g. when `git merge`
     -+has non-trivial changes to reconcile outside the sparse specification,
     -+but we should limit how often users are forced to connect to the
     -+network.)
     ++Some of these users also arrive at this usecase from wanting to use partial
     ++clones together with sparse checkouts (in a way where they have downloaded
     ++blobs within the sparse specification) and do disconnected development.
     ++Not only do these users generally not care about other parts of the
     ++repository, but consider it a blocker for Git commands to try to operate on
     ++those.  If commands attempt to access paths in history outside the sparsity
     ++specification, then the partial clone will attempt to download additional
     ++blobs on demand, fail, and then fail the user's command.  (This may be
     ++unavoidable in some cases, e.g. when `git merge` has non-trivial changes to
     ++reconcile outside the sparse specification, but we should limit how often
     ++users are forced to connect to the network.)
      +
      +Also, even for users using partial clones that do not mind being
      +always connected to the network, the need to download blobs as
     @@ Documentation/technical/sparse-checkout.txt (new)
      +depend on your "point of view."
      +
      +People might also end up wanting behavior B due to complex inter-project
     -+dependencies.  The initial attempts to use sparse-checkouts usually
     -+involve the directories you are directly interested in plus what those
     -+directories depend upon within your repository.  But there's a monkey
     -+wrench here: if you have integration tests, they invert the hierarchy:
     -+to run integration tests, you need not only what you are interested in
     -+and its dependencies, you also need everything that depends upon what
     -+you are interested in or that depends upon one of your
     -+dependencies...AND you need all the dependencies of that expanded group.
     -+That can easily change your sparse-checkout into a nearly dense one.
     -+
     -+Naturally, that tends to kill the benefits of sparse-checkouts.  There
     -+are a couple solutions to this conundrum: either avoid grabbing
     -+dependencies (maybe have built versions of your dependencies pulled from
     -+a CI cache somewhere), or say that users shouldn't run integration tests
     -+directly and instead do it on the CI server when they submit a code
     -+review.  Or do both.  Regardless of whether you stub out your
     ++dependencies.  The initial attempts to use sparse-checkouts usually involve
     ++the directories you are directly interested in plus what those directories
     ++depend upon within your repository.  But there's a monkey wrench here: if
     ++you have integration tests, they invert the hierarchy: to run integration
     ++tests, you need not only what you are interested in and its in-tree
     ++dependencies, you also need everything that depends upon what you are
     ++interested in or that depends upon one of your dependencies...AND you need
     ++all the in-tree dependencies of that expanded group.  That can easily
     ++change your sparse-checkout into a nearly dense one.
     ++
     ++Naturally, that tends to kill the benefits of sparse-checkouts.  There are
     ++a couple solutions to this conundrum: either avoid grabbing in-repo
     ++dependencies (maybe have built versions of your in-repo dependencies pulled
     ++from a CI cache somewhere), or say that users shouldn't run integration
     ++tests directly and instead do it on the CI server when they submit a code
     ++review.  Or do both.  Regardless of whether you stub out your in-repo
      +dependencies or stub out the things that depend upon you, there is
     -+certainly a reason to want to query and be aware of those other
     -+stubbed-out parts of the repository, particularly when the dependencies
     -+are complex or change relatively frequently.  Thus, for such uses,
     -+sparse-checkouts can be used to limit what you directly build and
     -+modify, but these users do not necessarily want their sparse checkout
     -+paths to limit their queries of history.
     -+
     -+Some people may also be interested in behavior B simply as a performance
     -+workaround: if they are using non-cone mode, then they have to deal with
     -+its inherent quadratic performance problems.  In that mode, every
     -+operation that checks whether paths match the sparsity specification can
     -+be expensive.  As such, these users may only be willing to pay for those
     -+expensive checks when interacting with the working copy, and may prefer
     -+getting "unrelated" results from their history queries over having slow
     -+commands.
     ++certainly a reason to want to query and be aware of those other stubbed-out
     ++parts of the repository, particularly when the dependencies are complex or
     ++change relatively frequently.  Thus, for such uses, sparse-checkouts can be
     ++used to limit what you directly build and modify, but these users do not
     ++necessarily want their sparse checkout paths to limit their queries of
     ++versions in history.
     ++
     ++Some people may also be interested in behavior B over behavior A simply as
     ++a performance workaround: if they are using non-cone mode, then they have
     ++to deal with its inherent quadratic performance problems.  In that mode,
     ++every operation that checks whether paths match the sparsity specification
     ++can be expensive.  As such, these users may only be willing to pay for
     ++those expensive checks when interacting with the working copy, and may
     ++prefer getting "unrelated" results from their history queries over having
     ++slow commands.
     ++
     ++  (Behavior C) sparse-checkout is an implementational detail supporting a
     ++	       special VFS.
     ++
     ++This usecase goes slightly against the traditional definition of
     ++sparse-checkout in that it actually tries to present a full or dense
     ++checkout to the user.  However, this usecase utilizes the same underlying
     ++technical underpinnings in a new way which does provide some performance
     ++advantages to users.  The basic idea is that a company can have an in-house
     ++Git-aware Virtual File System which pretends all files are present in the
     ++working tree, by intercepting all file system accesses and using those to
     ++fetch and write accessed files on demand via partial clones.  The VFS uses
     ++sparse-checkout to prevent Git from writing or paying attention to many
     ++files, and manually updates the sparse checkout patterns itself based on
     ++user access and modification of files in the working tree.  See commit
     ++ecc7c8841d ("repo_read_index: add config to expect files outside sparse
     ++patterns", 2022-02-25) and the link at [17] for a more detailed description
     ++of such a VFS.
     ++
     ++The biggest difference here is that users are completely unaware that the
     ++sparse-checkout machinery is even in use.  The sparse patterns are not
     ++specified by the user but rather are under the complete control of the VFS
     ++(and the patterns are updated frequently and dynamically by it).  The user
     ++will perceive the checkout as dense, and commands should thus behave as if
     ++all files are present.
     ++
     ++  (Behavior A*) Users are _only_ interested in the sparse portion of the repo
     ++      that they have downloaded so far (a variant on the first usecase)
     ++
     ++This variant is driven by folks who using partial clones together with
     ++sparse checkouts and do disconnected development (so far sounding like a
     ++subset of behavior A users) and doing so on very large repositories.  The
     ++reason for yet another variant is that downloading even just the blobs
     ++through history within their sparse specification may be too much, so they
     ++only download some.  They would still like operations to succeed without
     ++network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
     ++or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
     ++partial results.
     ++
     ++This variant could be viewed as Behavior A with the sparse specification
     ++for history querying operations modified from "sparsity patterns" to
     ++"sparsity patterns limited to the blobs we have already downloaded".
     ++
     ++
     ++=== Usecases of primary concern ===
     ++
     ++Most of the rest of this document will focus on the first two usecases:
     ++Behavior A and Behavior B.  Some notes about the other two cases and why we
     ++are not focusing on them:
     ++
     ++  (Behavior A*)
     ++
     ++Supporting this usecase is estimated to be difficult and a lot of work.
     ++There are no plans to implement it currently, but it may be a potential
     ++future alternative.  Knowing about the existence of additional alternatives
     ++may affect our choice of command line flags (e.g. if we need tri-state or
     ++quad-state flags rather than just binary flags), so it was still important
     ++to at least note.
     ++
     ++Further, I believe the descriptions below for Behavior A are probably still
     ++valid for this usecase, with the only exception being that it redefines the
     ++sparse specification to restrict it to already-downloaded blobs.  The hard
     ++part is in making commands capable of respecting that modified definition.
     ++
     ++  (Behavior C)
     ++
     ++This usecase violates some of the early sparse-checkout documented
     ++assumptions (since files marked as SKIP_WORKTREE will be displayed to users
     ++as present in the working tree).  That violation may mean various
     ++sparse-checkout related behaviors are not well suited to this usecase and
     ++we may need tweaks -- to both documentation and code -- to handle it.
     ++However, this usecase is also perhaps the simplest model to support in that
     ++everything behaves like a dense checkout with a few exceptions (e.g. branch
     ++checkouts and switches write fewer things, knowing the VFS will lazily
     ++write the rest on an as-needed basis).
     ++
     ++Since there is no publically available VFS-related code for folks to try,
     ++the number of folks who can test such a usecase is limited.
     ++
     ++The primary reason to note the Behavior C usecase is that as we fix things
     ++to better support Behaviors A and B, there may be additional places where
     ++we need to make tweaks allowing folks in this usecase to get the original
     ++non-sparse treatment.  For an example, see ecc7c8841d ("repo_read_index:
     ++add config to expect files outside sparse patterns", 2022-02-25).  The
     ++secondary reason to note Behavior C, is so that folks taking advantage of
     ++Behavior C do not assume they are part of the Behavior B camp and propose
     ++patches that break things for the real Behavior B folks.
     ++
     ++
     ++=== Oversimplified mental models ===
     ++
     ++An oversimplification of the differences in the above behaviors is:
     ++
     ++  Behavior A: Restrict worktree and history operations to sparse specification
     ++  Behavior B: Restrict worktree operations to sparse specification; have any
     ++	      history operations work across all files
     ++  Behavior C: Do not restrict either worktree or history operations to the
     ++	      sparse specification...with the exception of branch checkouts or
     ++	      switches which avoid writing files that will match the index so
     ++	      they can later lazily be populated instead.
      +
      +
      +=== Desired behavior ===
      +
     -+As noted in the previous section, despite the simple idea of just
     -+working with a subset of files, there are a range of different
     -+behavioral changes that need to be made to different subcommands to work
     -+well with such a feature.  See [1,2,3,4,5,6,7,8,9,10] for various
     -+examples.  In particular, at [2], we saw that mere composition of other
     -+commands that individually worked correctly in a sparse-checkout context
     -+did not imply that the higher level command would work correctly; it
     -+sometimes requires further tweaks.  So, understanding these differences
     -+can be beneficial.
     ++As noted previously, despite the simple idea of just working with a subset
     ++of files, there are a range of different behavioral changes that need to be
     ++made to different subcommands to work well with such a feature.  See
     ++[1,2,3,4,5,6,7,8,9,10] for various examples.  In particular, at [2], we saw
     ++that mere composition of other commands that individually worked correctly
     ++in a sparse-checkout context did not imply that the higher level command
     ++would work correctly; it sometimes requires further tweaks.  So,
     ++understanding these differences can be beneficial.
      +
      +* Commands behaving the same regardless of high-level use-case
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +      * read-tree
      +      * reset --hard
      +
     -+      * `restore` & the restore-like half of `checkout` SHOULD be in this above
     -+	category, but are buggy (see the "Known bugs" section below)
     -+
     -+  * commands that write conflicted files to the working tree, but otherwise will
     -+    omit writing files that do not match the sparsity patterns:
     ++  * commands that write conflicted files to the working tree, but otherwise
     ++    will omit writing files to the working tree that do not match the
     ++    sparsity patterns:
      +
      +      * merge
      +      * rebase
      +      * cherry-pick
      +      * revert
      +
     -+      * `am` and `apply --index` should probably be in this section but are buggy
     -+	(see the "Known bugs" section below)
     ++      * `am` and `apply --cached` should probably be in this section but
     ++	are buggy (see the "Known bugs" section below)
      +
     -+    Note that this somewhat depends upon the merge strategy being used:
     ++    The behavior for these commands somewhat depends upon the merge
     ++    strategy being used:
      +      * `ort` behaves as described above
      +      * `recursive` tries to not vivify files unnecessarily, but does sometimes
      +	vivify files without conflicts.
      +      * `octopus` and `resolve` will always vivify any file changed in the merge
      +	relative to the first parent, which is rather suboptimal.
      +
     ++    It is also important to note that these commands WILL update the index
     ++    outside the sparse specification relative to when the operation began,
     ++    BUT these commands often make a commit just before or after such that
     ++    by the end of the operation there is no change to the index outside the
     ++    sparse specification.  Of course, if the operation hits conflicts or
     ++    does not make a commit, then these operations clearly can modify the
     ++    index outside the sparse specification.
     ++
     ++    Finally, it is important to note that at least the first four of these
     ++    commands also try to remove differences between the sparse
     ++    specification and the sparsity patterns (much like the commands in the
     ++    previous section).
     ++
      +  * commands that always ignore sparsity since commits must be full-tree
      +
      +      * archive
     @@ Documentation/technical/sparse-checkout.txt (new)
      +      * stash
      +      * apply (without `--index` or `--cached`)
      +
     -+* Commands that differ for behavior A vs. behavior B:
     ++* Commands that may slightly differ for behavior A vs. behavior B:
     ++
     ++  Commands in this category behave mostly the same between the two
     ++  behaviors, but may differ in verbosity and types of warning and error
     ++  messages.
      +
      +  * commands that make modifications to which files are tracked:
      +      * add
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    may need to ignore the sparse specification by its nature.  Also, its
      +    current --[no-]ignore-skip-worktree-entries default is totally bogus.
      +
     ++  * commands for manually tweaking paths in both the index and the working tree
     ++      * `restore`
     ++      * the restore-like half of `checkout`
     ++
     ++    These commands should be similar to add/rm/mv in that they should
     ++    only operate on the sparse specification by default, and require a
     ++    special flag to operate on all files.
     ++
     ++    Also, note that these commands currently have a number of issues (see
     ++    the "Known bugs" section below)
     ++
     ++* Commands that significantly differ for behavior A vs. behavior B:
     ++
      +  * commands that query history
      +      * diff (with --cached or REVISION arguments)
      +      * grep (with --cached or REVISION arguments)
      +      * show (when given commit arguments)
     -+      * bisect
     -+      * blame (only matters when one or more -C flags passed)
     ++      * blame (only matters when one or more -C flags are passed)
      +	* and annotate
      +      * log
     -+	* and variants: shortlog, gitk, show-branch, whatchanged, rev-list
     ++      * whatchanged
      +      * ls-files
      +      * diff-index
      +      * diff-tree
      +      * ls-tree
      +
     ++    Note: for log and whatchanged, only patch related parts are affected by
     ++    scoping the command to the sparse-checkout; the revision walking is
     ++    unaffected.  (The fact that revision walking is unaffected is why
     ++    rev-list, shortlog, show-branch, and bisect are not in this list.)
     ++
      +    ls-files may be slightly special in that e.g. `git ls-files -t` is
      +    often used to see what is sparse and what is not.  Perhaps -t should
      +    always work on the full tree?
     @@ Documentation/technical/sparse-checkout.txt (new)
      +
      +* Commands unaffected by sparse-checkouts
      +
     ++  * shortlog
     ++  * show-branch
     ++  * rev-list
     ++  * bisect
     ++
      +  * branch
      +  * describe
      +  * fetch
     @@ Documentation/technical/sparse-checkout.txt (new)
      +
      +  * merge-file
      +  * merge-index
     ++  * gitk?
      +
      +
     -+=== Behavior classes ====
     ++=== Behavior classes ===
      +
      +From the above there are a few classes of behavior:
      +
      +  * "restrict"
      +
     -+    Commands in this class only read or write files within the sparse
     -+    specification.  Some of these commands may also attempt, at the end of
     -+    their operation, to cull transient differences between the sparse
     -+    specification and the sparsity patterns (see "Sparse specification
     -+    vs. sparsity patterns" for details, but this basically means either
     -+    removing unmodified files not matching the sparsity patterns and
     -+    marking those files as SKIP_WORKTREE, or vivifying files that match the
     -+    sparsity patterns and marking those files as !SKIP_WORKTREE).
     ++    Commands in this class only read or write files in the working tree
     ++    within the sparse specification.
     ++
     ++    When moving to a new commit (e.g. switch, reset --hard), these commands
     ++    may update index files outside the sparse specification as of the start
     ++    of the operation, but by the end of the operation those index files
     ++    will match HEAD again and thus those files will again be outside the
     ++    sparse specification.
     ++
     ++    When paths are explicitly specified, these paths are intersected with
     ++    the sparse specification and will only operate on such paths.
     ++    (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`)
     ++
     ++    Some of these commands may also attempt, at the end of their operation,
     ++    to cull transient differences between the sparse specification and the
     ++    sparsity patterns (see "Sparse specification vs. sparsity patterns" for
     ++    details, but this basically means either removing unmodified files not
     ++    matching the sparsity patterns and marking those files as
     ++    SKIP_WORKTREE, or vivifying files that match the sparsity patterns and
     ++    marking those files as !SKIP_WORKTREE).
      +
      +  * "restrict modulo conflicts"
      +
      +    Commands in this class generally behave like the "restrict" class,
      +    except that:
     -+      (1) they ignore the sparse specification in terms of updates to the
     -+	  index, though they'll preserve or update the SKIP_WORKTREE bit
     -+	  for files as needed to follow the sparsity patterns.
     -+      (2) they will ignore the sparse specification and write files with
     ++      (1) they will ignore the sparse specification and write files with
      +	  conflicts to the working tree (thus temporarily expanding the
      +	  sparse specification to include such files.)
     ++      (2) they are grouped with commands which move to a new commit, since
     ++	  they often create a commit and then move to it, even though we
     ++	  know there are many exceptions to moving to the new commit.  (For
     ++	  example, the user may rebase a commit that becomes empty, or have
     ++	  a cherry-pick which conflicts, or a user could run `merge
     ++	  --no-commit`, and we also view `apply --index` kind of like `am
     ++	  --no-commit`.)  As such, these commands can make changes to index
     ++	  files outside the sparse specification, though they'll mark such
     ++	  files with SKIP_WORKTREE.
      +
      +  * "restrict also specially applied to untracked files"
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +desired behavior :
      +
      +  * Commands defaulting to "restrict":
     -+    * status
     ++    * diff-files
      +    * diff (without --cached or REVISION arguments)
      +    * grep (without --cached or REVISION arguments)
      +    * switch
      +    * checkout (the switch-like half)
     -+    * read-tree
     -+    * reset (--hard)
     -+    * restore/checkout
     ++    * reset (<commit>)
     ++
     ++    * restore
     ++    * checkout (the restore-like half)
      +    * checkout-index
     -+    * diff-files
     ++    * reset (with pathspec)
      +
      +    This behavior makes sense; these interact with the working tree.
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    * revert
      +
      +    * am
     -+    * apply --index
     ++    * apply --index (which is kind of like an `am --no-commit`)
      +
     -+    These also interact with the working tree, but require slightly different
     -+    behavior so that conflicts can be resolved.
     ++    * read-tree (especially with -m or -u; is kind of like a --no-commit merge)
     ++    * reset (<tree-ish>, due to similarity to read-tree)
     ++
     ++    These also interact with the working tree, but require slightly
     ++    different behavior either so that (a) conflicts can be resolved or (b)
     ++    because they are kind of like a merge-without-commit operation.
      +
      +    (See also the "Known bugs" section below regarding `am` and `apply`)
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    * add
      +    * rm
      +    * mv
     ++    * update-index
     ++    * status
     ++    * clean (?)
      +
      +    Our original implementation for these commands was "no restrict", but
      +    it had some severe usability issues:
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    would silently do nothing.  We should instead print an error in those
      +    cases to get usability right.
      +
     ++    update-index needs to be updated to match, and status and maybe clean
     ++    also need to be updated to specially handle untracked paths.
     ++
      +    There may be a difference in here between behavior A and behavior B in
      +    terms of verboseness of errors or additional warnings.
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    Note that two of these commands -- diff and grep -- also appeared in a
      +    different list with a default of "restrict", but only when limited to
      +    searching the working tree.  The working tree vs. history distinction
     -+    is fundamental in how behavior B operates, so this is expected.
     -+
     -+    "restrict" may make more sense as the long term default for these[12],
     -+    though Stolee seems to have some reservations[17].  Also, supporting
     -+    "restrict" for these commands might be a fair amount of work to
     -+    implement, meaning it might be implemented over multiple releases.  If
     -+    that behavior were the default in the commands that supported it, that
     -+    would force behavior B users to need to learn to slowly add additional
     -+    flags to their commands, depending on git version, to get the behavior
     -+    they want.  That gradual switchover would be painful, so we should
     -+    avoid it at least until it's fully implemented.
     ++    is fundamental in how behavior B operates, so this is expected.  Note,
     ++    though, that for diff and grep with --cached, when doing "restrict"
     ++    behavior, the difference between sparse specification and sparsity
     ++    patterns is important to handle.
     ++
     ++    "restrict" may make more sense as the long term default for these[12].
     ++    Also, supporting "restrict" for these commands might be a fair amount
     ++    of work to implement, meaning it might be implemented over multiple
     ++    releases.  If that behavior were the default in the commands that
     ++    supported it, that would force behavior B users to need to learn to
     ++    slowly add additional flags to their commands, depending on git
     ++    version, to get the behavior they want.  That gradual switchover would
     ++    be painful, so we should avoid it at least until it's fully
     ++    implemented.
      +
      +
      +=== Sparse specification vs. sparsity patterns ===
     @@ Documentation/technical/sparse-checkout.txt (new)
      +     a side-effect of commands which call unpack_trees() (checkout,
      +     merge, reset, etc.).
      +   * Users can also request such transient differences be corrected via
     -+     running `git sparse-checkout reapply`.  Various places recommand
     ++     running `git sparse-checkout reapply`.  Various places recommend
      +     running that command.
      +   * Additional commands are also welcome to implicitly fix these
      +     differences; we may add more in the future.
     @@ Documentation/technical/sparse-checkout.txt (new)
      +     consult SKIP_WORKTREE anyway.
      +
      +So, a transiently expanded (or restricted) sparse specification applies to
     -+the working tree, but not to history history queries where we always use
     -+the sparsity patterns.  (See [16] for an early discussion of this.)
     ++the working tree, but not to history queries where we always use the
     ++sparsity patterns.  (See [16] for an early discussion of this.)
      +
      +Similar to a transiently expanded sparse specification of the working tree
     -+based on additional files being present in the working tree, we could also
     -+consider the concept of a transiently expanded sparse specification for the
     -+index.  In particular, if the user has staged changes to files that do not
     ++based on additional files being present in the working tree, we also need
     ++to consider additional files being modified in the index.  In particular,
     ++if the user has staged changes to files (relative to HEAD) that do not
      +match the sparsity patterns, and the file is not present in the working
     -+tree, we may still want to consider the file part of the sparse
     -+specification if we are specifically performing a query related to the
     -+index (e.g. git diff REVISION, git diff-index REVISION, etc.)
     ++tree, we still want to consider the file part of the sparse specification
     ++if we are specifically performing a query related to the index (e.g. git
     ++diff --cached [REVISION], git diff-index [REVISION], git restore --staged
     ++--source=REVISION -- PATHS, etc.)
      +
      +
      +=== Implementation Questions ===
      +
     -+  * Does the name --[no-]restrict sound good to others?  Are there better
     ++  * Do the options --scope={sparse,all} sound good to others?  Are there better
      +    options?
      +    * Names in use, or appearing in patches, or previously suggested:
      +      * --sparse/--dense
     @@ Documentation/technical/sparse-checkout.txt (new)
      +      * --scope={sparse,all}
      +      * --focus/--unfocus
      +      * --limit/--unlimited
     -+    * Rationale making me lean slightly towards --[no-]restrict:
     ++    * Rationale making me lean slightly towards --scope={sparse,all}:
      +      * We want a name that works for many commands, so we need a name that
      +	does not conflict
     -+      * --[no-]restrict isn't overly long and seems relatively explanatory
     ++      * We know that we have more than two possible usecases, so it is best
     ++	to avoid a flag that appears to be binary.
     ++      * --scope={sparse,all} isn't overly long and seems relatively
     ++	explanatory
      +      * `--sparse`, as used in add/rm/mv, is totally backwards for
      +	grep/log/etc.  Changing the meaning of `--sparse` for these
      +	commands would fix the backwardness, but possibly break existing
     @@ Documentation/technical/sparse-checkout.txt (new)
      +	which would probably be even more ridiculously long.  (But we
      +	can make --ignore-skip-worktree-bits a deprecated alias for
      +	--no-restrict.)
     -+    * BUT, as others points out, --[no-]restrict isn't very clear about what
     -+      it's restricting nor does it automatically tie in to the concept of
     -+      "sparse-checkout" in the user's mind
     -+
     -+  * Should --[no-]restrict be a git global option, or added as options to each
     -+    relevant command?  (Does that make sense given the multitude of different
     -+    default behaviors we have for different options?)
      +
     -+  * If a config option is added (core.restrictToSparsity?) what should
     -+    the values and description be?  There's a risk of confusion, because
     -+    we only want this config option to affect the history-querying
     -+    commands (log/diff/grep) and maybe the path-modifying worktree
     -+    commands (add/rm/mv), but certainly not most the others.  Previous config
     -+    suggestion here: [13]
     ++  * If a config option is added (sparse.scope?) what should the values and
     ++    description be?  "sparse" (behavior A), "worktree-sparse-history-dense"
     ++    (behavior B), "dense" (behavior C)?  There's a risk of confusion,
     ++    because even for Behaviors A and B we want some commands to be
     ++    full-tree and others to operate sparsely, so the wording may need to be
     ++    more tied to the usecases and somehow explain that.  Also, right now,
     ++    the primary difference we are focusing is just the history-querying
     ++    commands (log/diff/grep).  Previous config suggestion here: [13]
      +
      +  * Is `--no-expand` a good alias for ls-files's `--sparse` option?
     -+    (`--sparse` does not map to either `--restrict` or `--no-restrict`,
     ++    (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
      +    because in non-cone mode it does nothing and in cone-mode it shows the
      +    sparse directory entries which are technically outside the sparse
     -+    specification) Should `--restrict` be the default (does that imply that
     -+    `--no-expand` needs a `--no-restrict` or that it just partially
     -+    overrides it)?  Should `-t` imply `--no-restrict`?
     -+
     -+  * Should --ignore-skip-worktree-bits in checkout-index, checkout, and
     -+    restore be made deprecated aliases for --no-restrict?  (They have the
     -+    same meaning.)
     -+
     -+  * Should --ignore-skip-worktree-entries in update-index be made a
     -+    deprecated alias for --no-restrict?  (Or, better yet, should the
     -+    option just be nuked from orbit after flipping the default, since
     -+    the reverse option is never wanted and the sole purpose of this
     -+    option was to turn off a bug?)
     -+
     -+  * Should update-index be made like add/rm/mv with the restrict-or-error
     -+    default functionality?  If we do, should some flags like
     -+    --[no-]skip-worktree imply --no-restrict?
     -+
     -+  * Should `apply --index` preserve SKIP_WORKTREE bits for
     -+    non-conflicted files?  We normally like preserving those bits (and
     -+    it'd make git-am more like cherry-pick/rebase/merge), but `apply`
     -+    without `--index` should unconditionally clear them and it seems a
     -+    little weird for the addition of the `--index` flag to affect how
     -+    the working tree is treated.  On the other hand, `am` builds on
     -+    `apply --index` and it needs the SKIP_WORKTREE bits preserved for
     -+    non-conflicted files in order to behave like
     -+    cherry-pick/rebase/merge.
     ++    specification)
     ++
     ++  * Under Behavior A:
     ++    * Does ls-files' `--no-expand` override the default `--scope=all`, or
     ++      does it need an extra flag?
     ++    * Does ls-files' `-t` option imply `--scope=all`?
     ++    * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
      +
      +  * sparse-checkout: once behavior A is fully implemented, should we take
      +    an interim measure to ease people into switching the default?  Namely,
      +    if folks are not already in a sparse checkout, then require
     -+    `sparse-checkout init/set` to take a `--set-[no-]restrict-mode` or
     -+    `--set-scope=(sparse|all)` flag (which would set core.restrictToSparse
     -+    according to the setting given), and throw an error if the flag is not
     -+    provided?  That error would be a great place to warn folks that the
     -+    default may change in the future, and get them used to specifying what
     -+    they want so that the eventual default switch is seamless for them.
     -+
     -+  * clone: should we provide some mechanism for tying partial clones and
     -+    sparse checkouts together better.  Maybe an option
     -+	--sparse=dir1,dir2,...,dirN
     -+    which:
     -+       * Does initial fetch with `--filter=blob:none`
     -+       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
     -+       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
     -+	 fault in the missing blobs within the sparse
     -+	 specification...except that rev-list needs some kind of options
     -+	 to also get files from leading directories too.
     -+       * Sets --restrict mode to allow focusing on the cone of interest
     -+	 (and to permit disconnected development)
     ++    `sparse-checkout init/set` to take a
     ++    `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
     ++    would set sparse.scope according to the setting given), and throw an
     ++    error if the flag is not provided?  That error would be a great place
     ++    to warn folks that the default may change in the future, and get them
     ++    used to specifying what they want so that the eventual default switch
     ++    is seamless for them.
      +
      +
      +=== Implementation Goals/Plans ===
     @@ Documentation/technical/sparse-checkout.txt (new)
      +
      + * Fix bugs in the 'Known bugs' section (below)
      +
     -+ [Below here is kind of spitballing since the first two haven't been resolved]
     ++ * Provide some kind of method for backfilling the blobs within the sparse
     ++   specification in a partial clone
      +
     -+ * update-index: flip the default to --no-ignore-skip-worktree-entries, possibly
     -+   nuke this stupid "Oh, there's a bug?  Let me add a flag to let users request
     -+   that they not trigger this bug." flag
     ++ [Below here is kind of spitballing since the first two haven't been resolved]
      +
     -+ * ls-files: add a --[no-]restrict flag for limiting tracked files listed to
     -+   the relevant subset.  (Plus more stuff after questions are answered.)
     ++ * update-index: flip the default to --no-ignore-skip-worktree-entries,
     ++   nuke this stupid "Oh, there's a bug?  Let me add a flag to let users
     ++   request that they not trigger this bug." flag
      +
      + * Flags & Config
     -+   * Make `--sparse` in add/rm/mv a deprecated alias for `--no-restrict`
     ++   * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
      +   * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
     -+     a deprecated aliases for `--no-restrict`
     -+   * Create config option (core.restrictToSparsity?), note how it only
     -+     affects two classes of commands
     -+
     -+ * Behavioral plans:
     -+     add, rm, mv:
     -+	Behavior B: throw error if would have affected paths outside of sparse
     -+		    specification.
     -+	Behavior A: throw error if would have *only* affected paths outside of
     -+		    sparse specification.
     -+     grep (on history), diff (on history), log, etc:
     -+	Behavior B: act on all paths (already implemented)
     -+	Behavior A: act on limited paths, maybe show stderr warning ("results
     -+		    limited") if selected via config rather than explicitly
     -+     other diff machinery:
     -+	make sure diff machinery changes don't mess with format-patch,
     -+	fast-export, etc.
     -+
     -+  * Fix performance issues, such as
     -+    https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
     ++     a deprecated aliases for `--scope=all`
     ++   * Create config option (sparse.scope?), tie it to the "Cliff notes"
     ++     overview
      +
     ++   * Add --scope=sparse (and --scope=all) flag to each of the history querying
     ++     commands.  IMPORATNT: make sure diff machinery changes don't mess with
     ++     format-patch, fast-export, etc.
      +
      +=== Known bugs ===
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    S tracked
      +    H tracked-but-maybe-skipped
      +
     ++5. checkout and restore --staged, continued:
     ++
     ++   These commands do not correctly scope operations to the sparse
     ++   specification, and make it worse by not setting important SKIP_WORKTREE
     ++   bits:
     ++
     ++   $ git restore --source OLDREV --staged outside-sparse-cone/
     ++   $ git status --porcelain
     ++   MD outside-sparse-cone/file1
     ++   MD outside-sparse-cone/file2
     ++   MD outside-sparse-cone/file3
     ++
     ++   We can add a --scope=all mode to `git restore` to let it operate outside
     ++   the sparse specification, but then it will be important to set the
     ++   SKIP_WORKTREE bits appropriately.
     ++
     ++6. Performance issues; see:
     ++    https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
     ++
      +
      +=== Reference Emails ===
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +     search for the parenthetical comment starting "We do not check".
      +    https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/
      +
     -+[17] "I'm not even sure if we would want to make this available via a
     -+     config setting"
     -+   and
     -+     "But I also want to avoid doing this as a default or even behind a
     -+     config setting"
     -+   respectively, from:
     -+     https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
     -+     https://lore.kernel.org/git/07a25d48-e364-0d9b-6ffa-41a5984eb5db@github.com/
     ++[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/


 Documentation/technical/sparse-checkout.txt | 1098 +++++++++++++++++++
 1 file changed, 1098 insertions(+)
 create mode 100644 Documentation/technical/sparse-checkout.txt

diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
new file mode 100644
index 00000000000..2518ca17faa
--- /dev/null
+++ b/Documentation/technical/sparse-checkout.txt
@@ -0,0 +1,1098 @@
+Table of contents:
+
+  * Terminology
+  * Purpose of sparse-checkouts
+  * Usecases of primary concern
+  * Oversimplified mental models ("Cliff Notes" for this document!)
+  * Desired behavior
+  * Behavior classes
+  * Subcommand-dependent defaults
+  * Sparse specification vs. sparsity patterns
+  * Implementation Questions
+  * Implementation Goals/Plans
+  * Known bugs
+  * Reference Emails
+
+
+=== Terminology ===
+
+cone mode: one of two modes for specifying the desired subset of files
+	in a sparse-checkout.  In cone-mode, the user specifies
+	directories (getting both everything under that directory as
+	well as everything in leading directories), while in non-cone
+	mode, the user specifies gitignore-style patterns.  Controlled
+	by the --[no-]cone option to sparse-checkout init|set.
+
+SKIP_WORKTREE: When tracked files do not match the sparse specification and
+	are removed from the working tree, the file in the index is marked
+	with a SKIP_WORKTREE bit.  Note that if a tracked file has the
+	SKIP_WORKTREE bit set but the file is later written by the user to
+	the working tree anyway, the SKIP_WORKTREE bit will be cleared at
+	the beginning of any Git operation.
+
+	Most sparse checkout users are unaware of this implementation
+	detail, and the term should generally be avoided in user-facing
+	descriptions and command flags.  Unfortunately, prior to the
+	`sparse-checkout` subcommand this low-level detail was exposed,
+	and as of time of writing, is still exposed in various places.
+
+sparse-checkout: a subcommand in git used to reduce the files present in
+	the working tree to a subset of all tracked files.  Also, the
+	name of the file in the $GIT_DIR/info directory used to track
+	the sparsity patterns corresponding to the user's desired
+	subset.
+
+sparse cone: see cone mode
+
+sparse directory: An entry in the index corresponding to a directory, which
+	appears in the index instead of all the files under that directory
+	that would normally appear.  See also sparse-index.  Something that
+	can cause confusion is that the "sparse directory" does NOT match
+	the sparse specification, i.e. the directory is NOT present in the
+	working tree.  May be renamed in the future (e.g. to "skipped
+	directory").
+
+sparse index: A special mode for sparse-checkout that also makes the
+	index sparse by recording a directory entry in lieu of all the
+	files underneath that directory.  Controlled by the
+	--[no-]sparse-index option to init|set|reapply.  See also
+	"sparse directory".
+
+sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
+	define the set of files of interest.  A warning: It is easy to
+	over-use this term (or the shortened "patterns" term), for two
+	reasons: (1) users in cone mode specify directories rather than
+	patterns (their directories are transformed into patterns, but
+	users may think you are talking about non-cone mode if you use the
+	word "patterns"), and (b) the sparse specification might
+	transiently differ in the working tree or index from the sparsity
+	patterns (see "Sparse specification vs. sparsity patterns").
+
+sparse specification: The set of paths in the user's area of focus.  This
+	is typically just the tracked files that match the sparsity
+	patterns, but the sparse specification can temporarily differ and
+	include additional files.  (See also "Sparse specification
+	vs. sparsity patterns")
+
+	* When working with history, the sparse specification is exactly
+	  the set of files matching the sparsity patterns.
+	* When interacting with the working tree, the sparse specification
+	  is the set of tracked files with a clear SKIP_WORKTREE bit or
+	  tracked files present in the working copy.
+	* When modifying or showing results from the index, the sparse
+	  specification is the set of files with a clear SKIP_WORKTREE bit
+	  or that differ in the index from HEAD.
+	* If working with the index and the working copy, the sparse
+	  specification is the union of the paths from above.
+
+vivifying: When a command restores a tracked file to the working tree (and
+	hopefully also clears the SKIP_WORKTREE bit in the index for that
+	file), this is referred to as "vivifying" the file.
+
+
+=== Purpose of sparse-checkouts ===
+
+sparse-checkouts exist to allow users to work with a subset of their
+files.
+
+You can think of sparse-checkouts as subdividing "tracked" files into two
+categories -- a sparse subset, and all the rest.  Implementationally, we
+mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them
+out of the working tree.  The SKIP_WORKTREE files are still tracked, just
+not present in the working tree.
+
+In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
+is missing from the working tree but pretend the file contents match HEAD".
+That was not only bogus (it actually meant the file missing from the
+working tree matched the index rather than HEAD), but it was also a
+low-level detail which only provided decent behavior for a few commands.
+There were a surprising number of ways in which that guiding principle gave
+command results that violated user expectations, and as such was a bad
+mental model.  However, it persisted for many years and may still be found
+in some corners of the code base.
+
+Anyway, the idea of "working with a subset of files" is simple enough, but
+there are multiple different high-level usecases which affect how some Git
+subcommands should behave.  Further, even if we only considered one of
+those usecases, sparse-checkouts can modify different subcommands in over a
+half dozen different ways.  Let's start by considering the high level
+usecases:
+
+  A) Users are _only_ interested in the sparse portion of the repo
+
+  B) Users want a sparse working tree, but are working in a larger whole
+
+  C) sparse-checkout is a behind-the-scenes implementation detail allowing
+     Git to work with a specially crafted in-house virtual file system;
+     users are actually working with a "full" working tree that is
+     lazily populated, and sparse-checkout helps with the lazy population
+     piece.
+
+  A*) Users are _only_ interested in the sparse portion of the repo that
+      they have downloaded so far (a variant on the first usecase)
+
+
+It may be worth explaining each of these in a bit more detail:
+
+
+  (Behavior A) Users are _only_ interested in the sparse portion of the repo
+
+These folks might know there are other things in the repository, but
+don't care.  They are uninterested in other parts of the repository, and
+only want to know about changes within their area of interest.  Showing
+them other results from history (e.g. from diff/log/grep/etc.) is a
+usability annoyance, potentially a huge one since other changes in
+history may dwarf the changes they are interested in.
+
+Some of these users also arrive at this usecase from wanting to use partial
+clones together with sparse checkouts (in a way where they have downloaded
+blobs within the sparse specification) and do disconnected development.
+Not only do these users generally not care about other parts of the
+repository, but consider it a blocker for Git commands to try to operate on
+those.  If commands attempt to access paths in history outside the sparsity
+specification, then the partial clone will attempt to download additional
+blobs on demand, fail, and then fail the user's command.  (This may be
+unavoidable in some cases, e.g. when `git merge` has non-trivial changes to
+reconcile outside the sparse specification, but we should limit how often
+users are forced to connect to the network.)
+
+Also, even for users using partial clones that do not mind being
+always connected to the network, the need to download blobs as
+side-effects of various other commands (such as the printed diffstat
+after a merge or pull) can lead to worries about local repository size
+growing unnecessarily[10].
+
+  (Behavior B) Users want a sparse working tree, but are working in a larger whole
+
+Stolee described this usecase this way[11]:
+
+"I'm also focused on users that know that they are a part of a larger
+whole. They know they are operating on a large repository but focus on
+what they need to contribute their part. I expect multiple "roles" to
+use very different, almost disjoint parts of the codebase. Some other
+"architect" users operate across the entire tree or hop between different
+sections of the codebase as necessary. In this situation, I'm wary of
+scoping too many features to the sparse-checkout definition, especially
+"git log," as it can be too confusing to have their view of the codebase
+depend on your "point of view."
+
+People might also end up wanting behavior B due to complex inter-project
+dependencies.  The initial attempts to use sparse-checkouts usually involve
+the directories you are directly interested in plus what those directories
+depend upon within your repository.  But there's a monkey wrench here: if
+you have integration tests, they invert the hierarchy: to run integration
+tests, you need not only what you are interested in and its in-tree
+dependencies, you also need everything that depends upon what you are
+interested in or that depends upon one of your dependencies...AND you need
+all the in-tree dependencies of that expanded group.  That can easily
+change your sparse-checkout into a nearly dense one.
+
+Naturally, that tends to kill the benefits of sparse-checkouts.  There are
+a couple solutions to this conundrum: either avoid grabbing in-repo
+dependencies (maybe have built versions of your in-repo dependencies pulled
+from a CI cache somewhere), or say that users shouldn't run integration
+tests directly and instead do it on the CI server when they submit a code
+review.  Or do both.  Regardless of whether you stub out your in-repo
+dependencies or stub out the things that depend upon you, there is
+certainly a reason to want to query and be aware of those other stubbed-out
+parts of the repository, particularly when the dependencies are complex or
+change relatively frequently.  Thus, for such uses, sparse-checkouts can be
+used to limit what you directly build and modify, but these users do not
+necessarily want their sparse checkout paths to limit their queries of
+versions in history.
+
+Some people may also be interested in behavior B over behavior A simply as
+a performance workaround: if they are using non-cone mode, then they have
+to deal with its inherent quadratic performance problems.  In that mode,
+every operation that checks whether paths match the sparsity specification
+can be expensive.  As such, these users may only be willing to pay for
+those expensive checks when interacting with the working copy, and may
+prefer getting "unrelated" results from their history queries over having
+slow commands.
+
+  (Behavior C) sparse-checkout is an implementational detail supporting a
+	       special VFS.
+
+This usecase goes slightly against the traditional definition of
+sparse-checkout in that it actually tries to present a full or dense
+checkout to the user.  However, this usecase utilizes the same underlying
+technical underpinnings in a new way which does provide some performance
+advantages to users.  The basic idea is that a company can have an in-house
+Git-aware Virtual File System which pretends all files are present in the
+working tree, by intercepting all file system accesses and using those to
+fetch and write accessed files on demand via partial clones.  The VFS uses
+sparse-checkout to prevent Git from writing or paying attention to many
+files, and manually updates the sparse checkout patterns itself based on
+user access and modification of files in the working tree.  See commit
+ecc7c8841d ("repo_read_index: add config to expect files outside sparse
+patterns", 2022-02-25) and the link at [17] for a more detailed description
+of such a VFS.
+
+The biggest difference here is that users are completely unaware that the
+sparse-checkout machinery is even in use.  The sparse patterns are not
+specified by the user but rather are under the complete control of the VFS
+(and the patterns are updated frequently and dynamically by it).  The user
+will perceive the checkout as dense, and commands should thus behave as if
+all files are present.
+
+  (Behavior A*) Users are _only_ interested in the sparse portion of the repo
+      that they have downloaded so far (a variant on the first usecase)
+
+This variant is driven by folks who using partial clones together with
+sparse checkouts and do disconnected development (so far sounding like a
+subset of behavior A users) and doing so on very large repositories.  The
+reason for yet another variant is that downloading even just the blobs
+through history within their sparse specification may be too much, so they
+only download some.  They would still like operations to succeed without
+network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
+or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
+partial results.
+
+This variant could be viewed as Behavior A with the sparse specification
+for history querying operations modified from "sparsity patterns" to
+"sparsity patterns limited to the blobs we have already downloaded".
+
+
+=== Usecases of primary concern ===
+
+Most of the rest of this document will focus on the first two usecases:
+Behavior A and Behavior B.  Some notes about the other two cases and why we
+are not focusing on them:
+
+  (Behavior A*)
+
+Supporting this usecase is estimated to be difficult and a lot of work.
+There are no plans to implement it currently, but it may be a potential
+future alternative.  Knowing about the existence of additional alternatives
+may affect our choice of command line flags (e.g. if we need tri-state or
+quad-state flags rather than just binary flags), so it was still important
+to at least note.
+
+Further, I believe the descriptions below for Behavior A are probably still
+valid for this usecase, with the only exception being that it redefines the
+sparse specification to restrict it to already-downloaded blobs.  The hard
+part is in making commands capable of respecting that modified definition.
+
+  (Behavior C)
+
+This usecase violates some of the early sparse-checkout documented
+assumptions (since files marked as SKIP_WORKTREE will be displayed to users
+as present in the working tree).  That violation may mean various
+sparse-checkout related behaviors are not well suited to this usecase and
+we may need tweaks -- to both documentation and code -- to handle it.
+However, this usecase is also perhaps the simplest model to support in that
+everything behaves like a dense checkout with a few exceptions (e.g. branch
+checkouts and switches write fewer things, knowing the VFS will lazily
+write the rest on an as-needed basis).
+
+Since there is no publically available VFS-related code for folks to try,
+the number of folks who can test such a usecase is limited.
+
+The primary reason to note the Behavior C usecase is that as we fix things
+to better support Behaviors A and B, there may be additional places where
+we need to make tweaks allowing folks in this usecase to get the original
+non-sparse treatment.  For an example, see ecc7c8841d ("repo_read_index:
+add config to expect files outside sparse patterns", 2022-02-25).  The
+secondary reason to note Behavior C, is so that folks taking advantage of
+Behavior C do not assume they are part of the Behavior B camp and propose
+patches that break things for the real Behavior B folks.
+
+
+=== Oversimplified mental models ===
+
+An oversimplification of the differences in the above behaviors is:
+
+  Behavior A: Restrict worktree and history operations to sparse specification
+  Behavior B: Restrict worktree operations to sparse specification; have any
+	      history operations work across all files
+  Behavior C: Do not restrict either worktree or history operations to the
+	      sparse specification...with the exception of branch checkouts or
+	      switches which avoid writing files that will match the index so
+	      they can later lazily be populated instead.
+
+
+=== Desired behavior ===
+
+As noted previously, despite the simple idea of just working with a subset
+of files, there are a range of different behavioral changes that need to be
+made to different subcommands to work well with such a feature.  See
+[1,2,3,4,5,6,7,8,9,10] for various examples.  In particular, at [2], we saw
+that mere composition of other commands that individually worked correctly
+in a sparse-checkout context did not imply that the higher level command
+would work correctly; it sometimes requires further tweaks.  So,
+understanding these differences can be beneficial.
+
+* Commands behaving the same regardless of high-level use-case
+
+  * commands that only look at files within the sparsity specification
+
+      * diff (without --cached or REVISION arguments)
+      * grep (without --cached or REVISION arguments)
+      * diff-files
+
+  * commands that restore files to the working tree that match sparsity
+    patterns, and remove unmodified files that don't match those
+    patterns:
+
+      * switch
+      * checkout (the switch-like half)
+      * read-tree
+      * reset --hard
+
+  * commands that write conflicted files to the working tree, but otherwise
+    will omit writing files to the working tree that do not match the
+    sparsity patterns:
+
+      * merge
+      * rebase
+      * cherry-pick
+      * revert
+
+      * `am` and `apply --cached` should probably be in this section but
+	are buggy (see the "Known bugs" section below)
+
+    The behavior for these commands somewhat depends upon the merge
+    strategy being used:
+      * `ort` behaves as described above
+      * `recursive` tries to not vivify files unnecessarily, but does sometimes
+	vivify files without conflicts.
+      * `octopus` and `resolve` will always vivify any file changed in the merge
+	relative to the first parent, which is rather suboptimal.
+
+    It is also important to note that these commands WILL update the index
+    outside the sparse specification relative to when the operation began,
+    BUT these commands often make a commit just before or after such that
+    by the end of the operation there is no change to the index outside the
+    sparse specification.  Of course, if the operation hits conflicts or
+    does not make a commit, then these operations clearly can modify the
+    index outside the sparse specification.
+
+    Finally, it is important to note that at least the first four of these
+    commands also try to remove differences between the sparse
+    specification and the sparsity patterns (much like the commands in the
+    previous section).
+
+  * commands that always ignore sparsity since commits must be full-tree
+
+      * archive
+      * bundle
+      * commit
+      * format-patch
+      * fast-export
+      * fast-import
+      * commit-tree
+
+  * commands that write any modified file to the working tree (conflicted or not,
+    and whether those paths match sparsity patterns or not):
+
+      * stash
+      * apply (without `--index` or `--cached`)
+
+* Commands that may slightly differ for behavior A vs. behavior B:
+
+  Commands in this category behave mostly the same between the two
+  behaviors, but may differ in verbosity and types of warning and error
+  messages.
+
+  * commands that make modifications to which files are tracked:
+      * add
+      * rm
+      * mv
+      * update-index
+
+    The fact that files can move between the 'tracked' and 'untracked'
+    categories means some commands will have to treat untracked files
+    differently.  But if we have to treat untracked files differently,
+    then additional commands may also need changes:
+
+      * status
+      * clean
+
+    In particular, `status` may need to report any untracked files outside
+    the sparsity specification as an erroneous condition (especially to
+    avoid the user trying to `git add` them, forcing `git add` to display
+    an error).
+
+    It's not clear to me exactly how (or if `clean` would change, but it's
+    the other command that also affects untracked files.
+
+    `update-index` may be slightly special.  Its --[no-]skip-worktree flag
+    may need to ignore the sparse specification by its nature.  Also, its
+    current --[no-]ignore-skip-worktree-entries default is totally bogus.
+
+  * commands for manually tweaking paths in both the index and the working tree
+      * `restore`
+      * the restore-like half of `checkout`
+
+    These commands should be similar to add/rm/mv in that they should
+    only operate on the sparse specification by default, and require a
+    special flag to operate on all files.
+
+    Also, note that these commands currently have a number of issues (see
+    the "Known bugs" section below)
+
+* Commands that significantly differ for behavior A vs. behavior B:
+
+  * commands that query history
+      * diff (with --cached or REVISION arguments)
+      * grep (with --cached or REVISION arguments)
+      * show (when given commit arguments)
+      * blame (only matters when one or more -C flags are passed)
+	* and annotate
+      * log
+      * whatchanged
+      * ls-files
+      * diff-index
+      * diff-tree
+      * ls-tree
+
+    Note: for log and whatchanged, only patch related parts are affected by
+    scoping the command to the sparse-checkout; the revision walking is
+    unaffected.  (The fact that revision walking is unaffected is why
+    rev-list, shortlog, show-branch, and bisect are not in this list.)
+
+    ls-files may be slightly special in that e.g. `git ls-files -t` is
+    often used to see what is sparse and what is not.  Perhaps -t should
+    always work on the full tree?
+
+* Commands I don't know how to classify
+
+  * range-diff
+
+    Is this like `log` or `format-patch`?
+
+  * cherry
+
+    See range-diff
+
+* Commands unaffected by sparse-checkouts
+
+  * shortlog
+  * show-branch
+  * rev-list
+  * bisect
+
+  * branch
+  * describe
+  * fetch
+  * gc
+  * init
+  * maintenance
+  * notes
+  * pull (merge & rebase have the necessary changes)
+  * push
+  * submodule
+  * tag
+
+  * config
+  * filter-branch (works in separate checkout without sparse-checkout setup)
+  * pack-refs
+  * prune
+  * remote
+  * repack
+  * replace
+
+  * bugreport
+  * count-objects
+  * fsck
+  * gitweb
+  * help
+  * instaweb
+  * merge-tree (doesn't touch worktree or index, and merges always compute full-tree)
+  * rerere
+  * verify-commit
+  * verify-tag
+
+  * commit-graph
+  * hash-object
+  * index-pack
+  * mktag
+  * mktree
+  * multi-pack-index
+  * pack-objects
+  * prune-packed
+  * symbolic-ref
+  * unpack-objects
+  * update-ref
+  * write-tree (operates on index, possibly optimized to use sparse dir entries)
+
+  * for-each-ref
+  * get-tar-commit-id
+  * ls-remote
+  * merge-base (merges are computed full tree, so merge base should be too)
+  * name-rev
+  * pack-redundant
+  * rev-parse
+  * show-index
+  * show-ref
+  * unpack-file
+  * var
+  * verify-pack
+
+  * <Everything under 'Interacting with Others' in 'git help --all'>
+  * <Everything under 'Low-level...Syncing' in 'git help --all'>
+  * <Everything under 'Low-level...Internal Helpers' in 'git help --all'>
+  * <Everything under 'External commands' in 'git help --all'>
+
+* Commands that might be affected, but who cares?
+
+  * merge-file
+  * merge-index
+  * gitk?
+
+
+=== Behavior classes ===
+
+From the above there are a few classes of behavior:
+
+  * "restrict"
+
+    Commands in this class only read or write files in the working tree
+    within the sparse specification.
+
+    When moving to a new commit (e.g. switch, reset --hard), these commands
+    may update index files outside the sparse specification as of the start
+    of the operation, but by the end of the operation those index files
+    will match HEAD again and thus those files will again be outside the
+    sparse specification.
+
+    When paths are explicitly specified, these paths are intersected with
+    the sparse specification and will only operate on such paths.
+    (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`)
+
+    Some of these commands may also attempt, at the end of their operation,
+    to cull transient differences between the sparse specification and the
+    sparsity patterns (see "Sparse specification vs. sparsity patterns" for
+    details, but this basically means either removing unmodified files not
+    matching the sparsity patterns and marking those files as
+    SKIP_WORKTREE, or vivifying files that match the sparsity patterns and
+    marking those files as !SKIP_WORKTREE).
+
+  * "restrict modulo conflicts"
+
+    Commands in this class generally behave like the "restrict" class,
+    except that:
+      (1) they will ignore the sparse specification and write files with
+	  conflicts to the working tree (thus temporarily expanding the
+	  sparse specification to include such files.)
+      (2) they are grouped with commands which move to a new commit, since
+	  they often create a commit and then move to it, even though we
+	  know there are many exceptions to moving to the new commit.  (For
+	  example, the user may rebase a commit that becomes empty, or have
+	  a cherry-pick which conflicts, or a user could run `merge
+	  --no-commit`, and we also view `apply --index` kind of like `am
+	  --no-commit`.)  As such, these commands can make changes to index
+	  files outside the sparse specification, though they'll mark such
+	  files with SKIP_WORKTREE.
+
+  * "restrict also specially applied to untracked files"
+
+    Commands in this class generally behave like the "restrict" class,
+    except that they have to handle untracked files differently too, often
+    because these commands are dealing with files changing state between
+    'tracked' and 'untracked'.  Often, this may mean printing an error
+    message if the command had nothing to do, but the arguments may have
+    referred to files whose tracked-ness state could have changed were it
+    not for the sparsity patterns excluding them.
+
+  * "no restrict"
+
+    Commands in this class ignore the sparse specification entirely.
+
+  * "restrict or no restrict dependent upon behavior A vs. behavior B"
+
+    Commands in this class behave like "no restrict" for folks in the
+    behavior B camp, and like "restrict" for folks in the behavior A camp.
+    However, when behaving like "restrict" a warning of some sort might be
+    provided that history queries have been limited by the sparse-checkout
+    specification.
+
+
+=== Subcommand-dependent defaults ===
+
+Note that we have different defaults depending on the command for the
+desired behavior :
+
+  * Commands defaulting to "restrict":
+    * diff-files
+    * diff (without --cached or REVISION arguments)
+    * grep (without --cached or REVISION arguments)
+    * switch
+    * checkout (the switch-like half)
+    * reset (<commit>)
+
+    * restore
+    * checkout (the restore-like half)
+    * checkout-index
+    * reset (with pathspec)
+
+    This behavior makes sense; these interact with the working tree.
+
+  * Commands defaulting to "restrict modulo conflicts":
+    * merge
+    * rebase
+    * cherry-pick
+    * revert
+
+    * am
+    * apply --index (which is kind of like an `am --no-commit`)
+
+    * read-tree (especially with -m or -u; is kind of like a --no-commit merge)
+    * reset (<tree-ish>, due to similarity to read-tree)
+
+    These also interact with the working tree, but require slightly
+    different behavior either so that (a) conflicts can be resolved or (b)
+    because they are kind of like a merge-without-commit operation.
+
+    (See also the "Known bugs" section below regarding `am` and `apply`)
+
+  * Commands defaulting to "no restrict":
+    * archive
+    * bundle
+    * commit
+    * format-patch
+    * fast-export
+    * fast-import
+    * commit-tree
+
+    * stash
+    * apply (without `--index`)
+
+    These have completely different defaults and perhaps deserve the most
+    detailed explanation:
+
+    In the case of commands in the first group (format-patch,
+    fast-export, bundle, archive, etc.), these are commands for
+    communicating history, which will be broken if they restrict to a
+    subset of the repository.  As such, they operate on full paths and
+    have no `--restrict` option for overriding.  Some of these commands may
+    take paths for manually restricting what is exported, but it needs to
+    be very explicit.
+
+    In the case of stash, it needs to vivify files to avoid losing the
+    user's changes.
+
+    In the case of apply without `--index`, that command needs to update
+    the working tree without the index (or the index without the working
+    tree if `--cached` is passed), and if we restrict those updates to the
+    sparse specification then we'll lose changes from the user.
+
+  * Commands defaulting to "restrict also specially applied to untracked files":
+    * add
+    * rm
+    * mv
+    * update-index
+    * status
+    * clean (?)
+
+    Our original implementation for these commands was "no restrict", but
+    it had some severe usability issues:
+      * `git add <somefile>` if honored and outside the sparse
+	specification, can result in the file randomly disappearing later
+	when some subsequent command is run (since various commands
+	automatically clean up unmodified files outside the sparse
+	specification).
+      * `git rm '*.jpg'` could very negatively surprise users if it deletes
+	files outside the range of the user's interest.
+      * `git mv` has similar surprises when moving into or out of the cone,
+	so best to restrict by default
+
+    So, we switched `add` and `rm` to default to "restrict", which made
+    usability problems much less severe and less frequent, but we still got
+    complaints because commands like:
+	git add <file-outside-sparse-specification>
+	git rm <file-outside-sparse-specification>
+    would silently do nothing.  We should instead print an error in those
+    cases to get usability right.
+
+    update-index needs to be updated to match, and status and maybe clean
+    also need to be updated to specially handle untracked paths.
+
+    There may be a difference in here between behavior A and behavior B in
+    terms of verboseness of errors or additional warnings.
+
+  * Commands falling under "restrict or no restrict dependent upon behavior
+    A vs. behavior B"
+
+    * diff (with --cached or REVISION arguments)
+    * grep (with --cached or REVISION arguments)
+    * show (when given commit arguments)
+    * bisect
+    * blame (only matters when one or more -C flags passed)
+      * and annotate
+    * log
+      * and variants: shortlog, gitk, show-branch, whatchanged, rev-list
+    * ls-files
+    * diff-index
+    * diff-tree
+    * ls-tree
+
+    For now, we default to behavior B for these, which want a default of
+    "no restrict".
+
+    Note that two of these commands -- diff and grep -- also appeared in a
+    different list with a default of "restrict", but only when limited to
+    searching the working tree.  The working tree vs. history distinction
+    is fundamental in how behavior B operates, so this is expected.  Note,
+    though, that for diff and grep with --cached, when doing "restrict"
+    behavior, the difference between sparse specification and sparsity
+    patterns is important to handle.
+
+    "restrict" may make more sense as the long term default for these[12].
+    Also, supporting "restrict" for these commands might be a fair amount
+    of work to implement, meaning it might be implemented over multiple
+    releases.  If that behavior were the default in the commands that
+    supported it, that would force behavior B users to need to learn to
+    slowly add additional flags to their commands, depending on git
+    version, to get the behavior they want.  That gradual switchover would
+    be painful, so we should avoid it at least until it's fully
+    implemented.
+
+
+=== Sparse specification vs. sparsity patterns ===
+
+In a well-behaved situation, the sparse specification is given directly
+by the $GIT_DIR/info/sparse-checkout file.  However, it can transiently
+diverge for a few reasons:
+
+    * needing to resolve conflicts (merging will vivify conflicted files)
+    * running Git commands that implicitly vivify files (e.g. "git stash apply")
+    * running Git commands that explicitly vivify files (e.g. "git checkout
+      --ignore-skip-worktree-bits FILENAME")
+    * other commands that write to these files (perhaps a user copies it
+      from elsewhere)
+
+For the last item, note that we do automatically clear the SKIP_WORKTREE
+bit for files that are present in the working tree.  This has been true
+since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
+2022-03-09)
+
+However, such a situation is transient because:
+
+   * Such transient differences can and will be automatically removed as
+     a side-effect of commands which call unpack_trees() (checkout,
+     merge, reset, etc.).
+   * Users can also request such transient differences be corrected via
+     running `git sparse-checkout reapply`.  Various places recommend
+     running that command.
+   * Additional commands are also welcome to implicitly fix these
+     differences; we may add more in the future.
+
+While we avoid dropping unstaged changes or files which have conflicts,
+we otherwise aggressively try to fix these transient differences.  If
+users want these differences to persist, they should run the `set` or
+`add` subcommands of `git sparse-checkout` to reflect their intended
+sparse specification.
+
+However, when we need to do a query on history restricted to the
+"relevant subset of files" such a transiently expanded sparse
+specification is ignored.  There are a couple reasons for this:
+
+   * The behavior wanted when doing something like
+	 git grep expression REVISION
+     is roughly what the users would expect from
+	 git checkout REVISION && git grep expression
+     (modulo a "REVISION:" prefix), which has a couple ramifications:
+
+   * REVISION may have paths not in the current index, so there is no
+     path we can consult for a SKIP_WORKTREE setting for those paths.
+
+   * Since `checkout` is one of those commands that tries to remove
+     transient differences in the sparse specification, it makes sense
+     to use the corrected sparse specification
+     (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
+     consult SKIP_WORKTREE anyway.
+
+So, a transiently expanded (or restricted) sparse specification applies to
+the working tree, but not to history queries where we always use the
+sparsity patterns.  (See [16] for an early discussion of this.)
+
+Similar to a transiently expanded sparse specification of the working tree
+based on additional files being present in the working tree, we also need
+to consider additional files being modified in the index.  In particular,
+if the user has staged changes to files (relative to HEAD) that do not
+match the sparsity patterns, and the file is not present in the working
+tree, we still want to consider the file part of the sparse specification
+if we are specifically performing a query related to the index (e.g. git
+diff --cached [REVISION], git diff-index [REVISION], git restore --staged
+--source=REVISION -- PATHS, etc.)
+
+
+=== Implementation Questions ===
+
+  * Do the options --scope={sparse,all} sound good to others?  Are there better
+    options?
+    * Names in use, or appearing in patches, or previously suggested:
+      * --sparse/--dense
+      * --ignore-skip-worktree-bits
+      * --ignore-skip-worktree-entries
+      * --ignore-sparsity
+      * --[no-]restrict-to-sparse-paths
+      * --full-tree/--sparse-tree
+      * --[no-]restrict
+      * --scope={sparse,all}
+      * --focus/--unfocus
+      * --limit/--unlimited
+    * Rationale making me lean slightly towards --scope={sparse,all}:
+      * We want a name that works for many commands, so we need a name that
+	does not conflict
+      * We know that we have more than two possible usecases, so it is best
+	to avoid a flag that appears to be binary.
+      * --scope={sparse,all} isn't overly long and seems relatively
+	explanatory
+      * `--sparse`, as used in add/rm/mv, is totally backwards for
+	grep/log/etc.  Changing the meaning of `--sparse` for these
+	commands would fix the backwardness, but possibly break existing
+	scripts.  Using a new name pairing would allow us to treat
+	`--sparse` in these commands as a deprecated alias.
+      * There is a different `--sparse`/`--dense` pair for commands using
+	revision machinery, so using that naming might cause confusion
+      * There is also a `--sparse` in both pack-objects and show-branch, which
+	don't conflict but do suggest that `--sparse` is overloaded
+      * The name --ignore-skip-worktree-bits is a double negative, is
+	quite a mouthful, refers to an implementation detail that many
+	users may not be familiar with, and we'd need a negation for it
+	which would probably be even more ridiculously long.  (But we
+	can make --ignore-skip-worktree-bits a deprecated alias for
+	--no-restrict.)
+
+  * If a config option is added (sparse.scope?) what should the values and
+    description be?  "sparse" (behavior A), "worktree-sparse-history-dense"
+    (behavior B), "dense" (behavior C)?  There's a risk of confusion,
+    because even for Behaviors A and B we want some commands to be
+    full-tree and others to operate sparsely, so the wording may need to be
+    more tied to the usecases and somehow explain that.  Also, right now,
+    the primary difference we are focusing is just the history-querying
+    commands (log/diff/grep).  Previous config suggestion here: [13]
+
+  * Is `--no-expand` a good alias for ls-files's `--sparse` option?
+    (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
+    because in non-cone mode it does nothing and in cone-mode it shows the
+    sparse directory entries which are technically outside the sparse
+    specification)
+
+  * Under Behavior A:
+    * Does ls-files' `--no-expand` override the default `--scope=all`, or
+      does it need an extra flag?
+    * Does ls-files' `-t` option imply `--scope=all`?
+    * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
+
+  * sparse-checkout: once behavior A is fully implemented, should we take
+    an interim measure to ease people into switching the default?  Namely,
+    if folks are not already in a sparse checkout, then require
+    `sparse-checkout init/set` to take a
+    `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
+    would set sparse.scope according to the setting given), and throw an
+    error if the flag is not provided?  That error would be a great place
+    to warn folks that the default may change in the future, and get them
+    used to specifying what they want so that the eventual default switch
+    is seamless for them.
+
+
+=== Implementation Goals/Plans ===
+
+ * Get buy-in on this document in general.
+
+ * Figure out answers to the 'Implementation Questions' sections (above)
+
+ * Fix bugs in the 'Known bugs' section (below)
+
+ * Provide some kind of method for backfilling the blobs within the sparse
+   specification in a partial clone
+
+ [Below here is kind of spitballing since the first two haven't been resolved]
+
+ * update-index: flip the default to --no-ignore-skip-worktree-entries,
+   nuke this stupid "Oh, there's a bug?  Let me add a flag to let users
+   request that they not trigger this bug." flag
+
+ * Flags & Config
+   * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
+   * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
+     a deprecated aliases for `--scope=all`
+   * Create config option (sparse.scope?), tie it to the "Cliff notes"
+     overview
+
+   * Add --scope=sparse (and --scope=all) flag to each of the history querying
+     commands.  IMPORATNT: make sure diff machinery changes don't mess with
+     format-patch, fast-export, etc.
+
+=== Known bugs ===
+
+This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
+been working on it.
+
+0. Behavior A is not well supported in Git.  (Behavior B didn't used to
+   be either, but was the easier of the two to implement.)
+
+1. am and apply:
+
+   apply, without `--index` or `--cached`, relies on files being present
+   in the working copy, and also writes to them unconditionally.  As
+   such, it should first check for the files' presence, and if found to
+   be SKIP_WORKTREE, then clear the bit and vivify the paths, then do
+   its work.  Currently, it just throws an error.
+
+   apply, with either `--cached` or `--index`, will not preserve the
+   SKIP_WORKTREE bit.  This is fine if the file has conflicts, but
+   otherwise SKIP_WORKTREE bits should be preserved for --cached and
+   probably also for --index.
+
+   am, if there are no conflicts, will vivify files and fail to preserve
+   the SKIP_WORKTREE bit.  If there are conflicts and `-3` is not
+   specified, it will vivify files and then complain the patch doesn't
+   apply.  If there are conflicts and `-3` is specified, it will vivify
+   files and then complain that those vivified files would be
+   overwritten by merge.
+
+2. reset --hard:
+
+   reset --hard provides confusing error message (works correctly, but
+   misleads the user into believing it didn't):
+
+    $ touch addme
+    $ git add addme
+    $ git ls-files -t
+    H addme
+    H tracked
+    S tracked-but-maybe-skipped
+    $ git reset --hard                           # usually works great
+    error: Path 'addme' not uptodate; will not remove from working tree.
+    HEAD is now at bdbbb6f third
+    $ git ls-files -t
+    H tracked
+    S tracked-but-maybe-skipped
+    $ ls -1
+    tracked
+
+    `git reset --hard` DID remove addme from the index and the working tree, contrary
+    to the error message, but in line with how reset --hard should behave.
+
+3. read-tree
+
+   `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the
+   entries it reads into the index, resulting in all your files suddenly
+   appearing to be "deleted".
+
+4. Checkout, restore:
+
+   These command do not handle path & revision arguments appropriately:
+
+    $ ls
+    tracked
+    $ git ls-files -t
+    H tracked
+    S tracked-but-maybe-skipped
+    $ git status --porcelain
+    $ git checkout -- '*skipped'
+    error: pathspec '*skipped' did not match any file(s) known to git
+    $ git ls-files -- '*skipped'
+    tracked-but-maybe-skipped
+    $ git checkout HEAD -- '*skipped'
+    error: pathspec '*skipped' did not match any file(s) known to git
+    $ git ls-tree HEAD | grep skipped
+    100644 blob 276f5a64354b791b13840f02047738c77ad0584f	tracked-but-maybe-skipped
+    $ git status --porcelain
+    $ git checkout HEAD~1 -- '*skipped'
+    $ git ls-files -t
+    H tracked
+    H tracked-but-maybe-skipped
+    $ git status --porcelain
+    M  tracked-but-maybe-skipped
+    $ git checkout HEAD -- '*skipped'
+    $ git status --porcelain
+    $
+
+    Note that checkout without a revision (or restore --staged) fails to
+    find a file to restore from the index, even though ls-files shows
+    such a file certainly exists.
+
+    Similar issues occur with HEAD (--source=HEAD in restore's case),
+    but suddenly works when HEAD~1 is specified.  And then after that it
+    will work with HEAD specified, even though it didn't before.
+
+    Directories are also an issue:
+
+    $ git sparse-checkout set nomatches
+    $ git status
+    On branch main
+    You are in a sparse checkout with 0% of tracked files present.
+
+    nothing to commit, working tree clean
+    $ git checkout .
+    error: pathspec '.' did not match any file(s) known to git
+    $ git checkout HEAD~1 .
+    Updated 1 path from 58916d9
+    $ git ls-files -t
+    S tracked
+    H tracked-but-maybe-skipped
+
+5. checkout and restore --staged, continued:
+
+   These commands do not correctly scope operations to the sparse
+   specification, and make it worse by not setting important SKIP_WORKTREE
+   bits:
+
+   $ git restore --source OLDREV --staged outside-sparse-cone/
+   $ git status --porcelain
+   MD outside-sparse-cone/file1
+   MD outside-sparse-cone/file2
+   MD outside-sparse-cone/file3
+
+   We can add a --scope=all mode to `git restore` to let it operate outside
+   the sparse specification, but then it will be important to set the
+   SKIP_WORKTREE bits appropriately.
+
+6. Performance issues; see:
+    https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
+
+
+=== Reference Emails ===
+
+Emails that detail various bugs we've had in sparse-checkout:
+
+[1] (Original descriptions of behavior A & behavior B)
+    https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
+[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences)
+    https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/
+[3] (Present-despite-skipped entries)
+    https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
+[4] (Clone --no-checkout interaction)
+    https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout)
+[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`)
+    https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/
+[6] (SKIP_WORKTREE is advisory, not mandatory)
+    https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/
+[7] (`worktree add` should copy sparsity settings from current worktree)
+    https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/
+[8] (Avoid negative surprises in add, rm, and mv)
+    https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/
+    https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/
+[9] (Move from out-of-cone to in-cone)
+    https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
+    https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
+[10] (Unnecessarily downloading objects outside sparse specification)
+     https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
+
+[11] (Stolee's comments on high-level usecases)
+     https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
+
+[12] Others commenting on eventually switching default to behavior A:
+  * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
+  * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
+  * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
+
+[13] Previous config name suggestion and description
+  * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
+
+[14] Tangential issue: switch to cone mode as default sparse specification mechanism:
+  https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
+
+[15] Lengthy email on grep behavior, covering what should be searched:
+  * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/
+
+[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations,
+     search for the parenthetical comment starting "We do not check".
+    https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/
+
+[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/

base-commit: 1b3d6e17fe83eb6f79ffbac2f2c61bbf1eaef5f8
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-10-06  7:53       ` Elijah Newren
@ 2022-10-15  2:17         ` ZheNing Hu
  2022-10-15  4:37           ` Elijah Newren
  0 siblings, 1 reply; 42+ messages in thread
From: ZheNing Hu @ 2022-10-15  2:17 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Victoria Dye, Shaoxuan Yuan, Matheus Tavares

Elijah Newren <newren@gmail.com> 于2022年10月6日周四 15:53写道:
>
> On Fri, Sep 30, 2022 at 2:54 AM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > I am not sure if these ideas are feasible.
> >
> > Elijah Newren <newren@gmail.com> 于2022年9月28日周三 13:38写道:
> > >
> [...]
> > > > There's nothing Git can do to help those engineers that do cross-tree
> > > > work.
> > >
> > > I'm going to partially disagree with this, in part because of our
> > > experience with many inter-module dependencies that evolve over time.
> > > Folks can start on a certain module and begin refactoring.  Being
> > > aware that their changes will affect other areas of the code, the can
> > > do a search (e.g. "git grep --cached ..." to find cases outside their
> > > current sparse checkout), and then selectively unsparsify to get the
> > > relevant few dozen (or maybe even few hundred) modules added.  They
> > > aren't switching to a dense checkout, just a less sparse one.  When
> > > they are done, they may narrow their sparse specification again.  We
> > > have a number of users doing cross-tree work who are using
> > > sparse-checkouts, and who find it productive and say it still speeds
> > > up their local build/test cycles.
> > >
> > > So, I'd say that ensuring Git supports behavior B well in
> > > sparse-checkouts, is something Git can do to help out both some of the
> > > engineers doing cross-tree work, and some of the engineers that are
> > > doing cross-tree testing.
> > >
> > > (For full disclosure, we also have users doing cross-tree work using
> > > regular dense checkouts and I agree there's not a lot we can do to
> > > help them.)
> > >
> >
> > Let me guess where the cross tree users using sparse-checkout are
> > getting their revenue from:
>
> Is "revenue" perhaps a case of auto-correct choosing the wrong word?
>

s/revenue/benefits

> > 1. they don't have to download the entire repository of blobs at once
> > 2. their working tree can be easily resized.
> > 3. they could have something like sparse-index to optimize the performance
> > of git commands.
>
> These correspond to partial clone, sparse-checkout, and sparse-index.
> I think these 3 features and the various work done to support them,
> plus submodule (which is a different kind of solution) are the
> features Git provides to work with repository subsets.  Some
> repositories (especially the big monorepos like the Microsoft ones)
> will benefit from using all three of these features.  Others might
> only want to use one or two of them.
>

Here I am just amazed that cross-tree users can shorten the
test/build cycle when only using sparse-checkout. So this benefits
don't come from above there conjectures. Not partial clone, not
sparse-index, not resize repo frequently.

> As an example, the repository where we first applied sparse-checkouts
> to (and which had the complicated dependencies) does not use partial
> clones or a sparse-index.   While partial clone and sparse-index might
> help a little, the .git directory for a full clone is merely 2G, and
> there are less than 100K entries in the index.  However,
> sparse-checkout helps out a lot.
>

Yes, you make a good explanation here that we don't necessarily need
to apply all these kinds of features. But I still feel a little confuse: Where
does the time savings come from? Is it saved by the time reduction of
git checkout? Or is it the reduction of some unnecessary working tree scans
during test/build time?

> > But it's still worth worrying about the size of the git repository blobs,
> > even if it's just only blobs in mono-repo's HEAD, that may also be too big
> > for the user's local area to handle.
> >
> > Perhaps it would make more sense to place this integration testing work on
> > a remote server.
> >
> > I am not sure if these ideas are feasible:
> >
> > 1. mount the large git repo on the server to local.
> > 2. just ssh to a remote server to run integration tests.
> > 3. use an external tool to run integration tests on the remote server.
>
> Are you suggesting #1 as a way for just handling the git history, or
> also for handling the worktree with some kind of virtual file system
> where not all files are actually written locally?  If you're only
> talking about the history, then you're kind of going on a tangent
> unrelated to this document.  If you're talking about worktrees and
> virtual file systems, then Git proper doesn't have anything of the
> sort currently.  There are at least two solutions in this space --
> Microsoft's Git-VFS (which I think they are phasing out) and Google's
> similar virtual file system -- but I'm not currently particularly
> interested in either one.
>

Here I mean git nfs, or some kind of git virtual file system, or some
git workspace, I don't really understand why they are now
phasing out?

> #3 is precisely what we did first (except "*a* remote server" rather
> than "*the* remote server").  I think I called it out in the email
> you're responding to; it's often good enough for many people.
> However, sometimes those tests fail and people want to run locally so
> it's easier to inspect.  Or they just want to be able to run locally
> anyway.  So, while #3 helped, it wasn't good enough.
>

Agree, testing locally sometimes is necessary.

> #2 is also something we did.  Using tools like Coder or GitHub
> codespaces or other offerings in that area, you can provide developers
> a nice beefy box with good network connectivity to the main Git
> repository, on which they can do development and running of tests.
> Then developers can connect to such machines from a variety of
> different external locations.  Works great for some people...but build
> times and ability of IDEs to handle the code base are still an issue,
> so doing smarter things with sparse-checkouts is still important.
> And, even if #2 works for some people, others still want to develop
> and run integration tests on their (beefy) laptops.
>

Agree too.

> All three of these, as far as I can tell, are just things that
> individual teams setup and aren't anything that would affect Git's
> development one way or another.
>
>
> However, I'll note that while we internally definitely did two of the
> three things you suggested here, it wasn't a complete enough solution
> for us and sparse-checkout adoption was still pretty minimal at that
> point.  So, we went back to our sparse-checkouts and asked how we
> could modify the build system to allow us to not check out the in-tree
> dependencies of the things we are tweaking, but still get a correct
> build and allow us to run tests.  Once we got that working, we finally
> really unlocked the value of sparse checkouts for us (both improving
> things for developers on laptops, and for developers on the
> development box in the cloud).  It went from very few folks using
> sparse checkouts with that repository, to being the default and
> recommended usage at that point.
>

Yeah, I'm a big believer in sparse-checkout or partial-clone which are
good features but not many people realize that they can use them.

> While the build changes were internal things we did, I think that the
> underlying usage scenario matters to Git development because it helps
> inform how sparse-checkout can be used.  In particular, it suggests
> why some sparse-checkout users may be interested in finding results
> for files that do not match their sparse-checkout patterns -- in-tree
> dependencies may not necessarily be checked out, but those are related
> enough to the code that developers are working on, that developers are
> still potentially interested in using e.g. "git grep" or "git log -p"
> to find out information about code or changes in those other areas.
> (And, of course, developers are also potentially interested in finding
> out what other code depends on what they are changing, but I suspect
> folks were already aware of that usecase.)  It's certainly not the
> only usecase, but it's an additional one that I didn't think was quite
> reflected in Stolee's description of why users would want searches to
> turn up results for files not found in their working tree.
>

Some users may really want to focus only on their subprojects, so I think
"git log -p" shouldn't show files that don't satisfy the
sparse-checkout patterns,
and "git grep" too. But some users may need to search something globally,
and I think those people are in the minority, so maybe there should be a
"git log -p --scrope=all" or "git grep --scrope=all" for them.

> > > > The only thing I can think about is that the diffstat might want to show
> > > > the stats for the conflicted files, in which case that's an important
> > > > perspective on the distinction from --restrict.
> > >
> > > We only show the diffstat on a successful merge, so there's no
> > > diffstat to show if there are any conflicted files.
> > >
> >
> > Sorry, I have some questions here: how does git merge know there are
> > no conflicts without downloading the blobs?
>
> Not sure how that's related to the above, but to answer your question:
>

Ah, this question relates to my previous question in [1]. At first I always
thought it was git merge that caused the extra blob downloading.
In the end, it turned out to be caused by the last diffstat of merge...

> Sometimes merge has to download blobs to know if there are conflicts
> or not.  But only sometimes.  Since tree objects have the hashes of
> the blobs, having the tree objects is sufficient to determine which
> side(s) of history modified each path.
>
> If both sides of history modified the same file, then you *might* have
> conflicts, and you indeed need the blobs to verify.  But if only one
> side of history modified a file and the other left it alone, then
> there is no conflict.

I think I probably get it. e.g. tree of HEAD of user1 have a tree entry
"a4e1fc out/file1" which is same SHA1 to blob in merge base, because
it's out of sparse-checkout specification, and it fetch a commit of user2,
and its tree has a tree entry "13f91e out/file1", so git merge doesn't really
need to check the contents of the file here, because only one side
changes it.

Thanks for your answers!

[1]: https://lore.kernel.org/git/CABPp-BEBB1oqdVcXrWwMAdtb0TwHZvr-6KDa210j5ncw54Di_g@mail.gmail.com/

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-10-15  2:17         ` ZheNing Hu
@ 2022-10-15  4:37           ` Elijah Newren
  2022-10-15 14:49             ` ZheNing Hu
  0 siblings, 1 reply; 42+ messages in thread
From: Elijah Newren @ 2022-10-15  4:37 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Victoria Dye, Shaoxuan Yuan, Matheus Tavares

On Fri, Oct 14, 2022 at 7:17 PM ZheNing Hu <adlternative@gmail.com> wrote:
>
> Elijah Newren <newren@gmail.com> 于2022年10月6日周四 15:53写道:
> >
> > On Fri, Sep 30, 2022 at 2:54 AM ZheNing Hu <adlternative@gmail.com> wrote:
> > >
> > > Elijah Newren <newren@gmail.com> 于2022年9月28日周三 13:38写道:
> > > >
[...]
> > As an example, the repository where we first applied sparse-checkouts
> > to (and which had the complicated dependencies) does not use partial
> > clones or a sparse-index.   While partial clone and sparse-index might
> > help a little, the .git directory for a full clone is merely 2G, and
> > there are less than 100K entries in the index.  However,
> > sparse-checkout helps out a lot.
>
> Yes, you make a good explanation here that we don't necessarily need
> to apply all these kinds of features. But I still feel a little confuse: Where
> does the time savings come from? Is it saved by the time reduction of
> git checkout? Or is it the reduction of some unnecessary working tree scans
> during test/build time?

It is neither git checkout time, nor tree scans; it's the ability to
avoid building larging parts of the project coupled with the
significantly better responsiveness of IDEs when project scope is
limited.  When directories are entirely missing, we don't need to
build any of the code in those directories and can instead just use
already built artifacts from the most recent point in history that has
been built on our continuous integration infrastructure.  (Note: our
sparsification tool will keep any modules/directories where there have
been modifications since the most recent upstream commit that has been
built, so we don't risk getting a wrong build via this strategy.)

[...]
> > > 1. mount the large git repo on the server to local.
> > > 2. just ssh to a remote server to run integration tests.
> > > 3. use an external tool to run integration tests on the remote server.
> >
> > Are you suggesting #1 as a way for just handling the git history, or
> > also for handling the worktree with some kind of virtual file system
> > where not all files are actually written locally?  If you're only
> > talking about the history, then you're kind of going on a tangent
> > unrelated to this document.  If you're talking about worktrees and
> > virtual file systems, then Git proper doesn't have anything of the
> > sort currently.  There are at least two solutions in this space --
> > Microsoft's Git-VFS (which I think they are phasing out) and Google's
> > similar virtual file system -- but I'm not currently particularly
> > interested in either one.
> >
>
> Here I mean git nfs, or some kind of git virtual file system, or some
> git workspace, I don't really understand why they are now
> phasing out?

You'd have to ask them, or read their comments on it.  I think they
believe sparse-checkout with a normal file system is or will be better
than the behavior they are getting from their virtual file system (and
they've put a lot of really good work behind making sure that is the
case).

[...]
> Some users may really want to focus only on their subprojects, so I think
> "git log -p" shouldn't show files that don't satisfy the
> sparse-checkout patterns,
> and "git grep" too. But some users may need to search something globally,
> and I think those people are in the minority, so maybe there should be a
> "git log -p --scrope=all" or "git grep --scrope=all" for them.

Good to know you're in the "Behavior A" camp and we've got another
vote for implementing things in that direction.  A couple of small
points, though:
  * It's --scope rather than --scrope.  ;-)
  * I have to disagree here slightly about people using a --scope=all
flag -- I don't think users should have to specify it with every grep
or log invocation.  Users in the "Behavior B" camp would want
`--scope=all` behavior for nearly every grep and log -p invocation
they make; it's annoying and unfair to force them to spell it out
every time.  So, I think we need a configuration option.

[...]
> > Sometimes merge has to download blobs to know if there are conflicts
> > or not.  But only sometimes.  Since tree objects have the hashes of
> > the blobs, having the tree objects is sufficient to determine which
> > side(s) of history modified each path.
> >
> > If both sides of history modified the same file, then you *might* have
> > conflicts, and you indeed need the blobs to verify.  But if only one
> > side of history modified a file and the other left it alone, then
> > there is no conflict.
>
> I think I probably get it. e.g. tree of HEAD of user1 have a tree entry
> "a4e1fc out/file1" which is same SHA1 to blob in merge base, because
> it's out of sparse-checkout specification, and it fetch a commit of user2,
> and its tree has a tree entry "13f91e out/file1", so git merge doesn't really
> need to check the contents of the file here, because only one side
> changes it.

Precisely.  :-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions
  2022-10-15  4:37           ` Elijah Newren
@ 2022-10-15 14:49             ` ZheNing Hu
  0 siblings, 0 replies; 42+ messages in thread
From: ZheNing Hu @ 2022-10-15 14:49 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Victoria Dye, Shaoxuan Yuan, Matheus Tavares

Elijah Newren <newren@gmail.com> 于2022年10月15日周六 12:38写道:
>
> On Fri, Oct 14, 2022 at 7:17 PM ZheNing Hu <adlternative@gmail.com> wrote:
> >
> > Elijah Newren <newren@gmail.com> 于2022年10月6日周四 15:53写道:
> > >
> > > On Fri, Sep 30, 2022 at 2:54 AM ZheNing Hu <adlternative@gmail.com> wrote:
> > > >
> > > > Elijah Newren <newren@gmail.com> 于2022年9月28日周三 13:38写道:
> > > > >
> [...]
> > > As an example, the repository where we first applied sparse-checkouts
> > > to (and which had the complicated dependencies) does not use partial
> > > clones or a sparse-index.   While partial clone and sparse-index might
> > > help a little, the .git directory for a full clone is merely 2G, and
> > > there are less than 100K entries in the index.  However,
> > > sparse-checkout helps out a lot.
> >
> > Yes, you make a good explanation here that we don't necessarily need
> > to apply all these kinds of features. But I still feel a little confuse: Where
> > does the time savings come from? Is it saved by the time reduction of
> > git checkout? Or is it the reduction of some unnecessary working tree scans
> > during test/build time?
>
> It is neither git checkout time, nor tree scans; it's the ability to
> avoid building larging parts of the project coupled with the
> significantly better responsiveness of IDEs when project scope is
> limited.  When directories are entirely missing, we don't need to
> build any of the code in those directories and can instead just use
> already built artifacts from the most recent point in history that has
> been built on our continuous integration infrastructure.  (Note: our
> sparsification tool will keep any modules/directories where there have
> been modifications since the most recent upstream commit that has been
> built, so we don't risk getting a wrong build via this strategy.)
>

So these users are just building/testing on a few projects and save time
from building/testing on some other projects. This is reasonable.

> [...]
> > > > 1. mount the large git repo on the server to local.
> > > > 2. just ssh to a remote server to run integration tests.
> > > > 3. use an external tool to run integration tests on the remote server.
> > >
> > > Are you suggesting #1 as a way for just handling the git history, or
> > > also for handling the worktree with some kind of virtual file system
> > > where not all files are actually written locally?  If you're only
> > > talking about the history, then you're kind of going on a tangent
> > > unrelated to this document.  If you're talking about worktrees and
> > > virtual file systems, then Git proper doesn't have anything of the
> > > sort currently.  There are at least two solutions in this space --
> > > Microsoft's Git-VFS (which I think they are phasing out) and Google's
> > > similar virtual file system -- but I'm not currently particularly
> > > interested in either one.
> > >
> >
> > Here I mean git nfs, or some kind of git virtual file system, or some
> > git workspace, I don't really understand why they are now
> > phasing out?
>
> You'd have to ask them, or read their comments on it.  I think they
> believe sparse-checkout with a normal file system is or will be better
> than the behavior they are getting from their virtual file system (and
> they've put a lot of really good work behind making sure that is the
> case).
>

Okay.

> [...]
> > Some users may really want to focus only on their subprojects, so I think
> > "git log -p" shouldn't show files that don't satisfy the
> > sparse-checkout patterns,
> > and "git grep" too. But some users may need to search something globally,
> > and I think those people are in the minority, so maybe there should be a
> > "git log -p --scrope=all" or "git grep --scrope=all" for them.
>
> Good to know you're in the "Behavior A" camp and we've got another
> vote for implementing things in that direction.  A couple of small
> points, though:
>   * It's --scope rather than --scrope.  ;-)
>   * I have to disagree here slightly about people using a --scope=all
> flag -- I don't think users should have to specify it with every grep
> or log invocation.  Users in the "Behavior B" camp would want
> `--scope=all` behavior for nearly every grep and log -p invocation
> they make; it's annoying and unfair to force them to spell it out
> every time.  So, I think we need a configuration option.
>

Fine, this configuration looks like it can balance the needs of both camps.

Thanks,
ZheNing Hu

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-10-08 22:52   ` [PATCH v3] " Elijah Newren via GitGitGadget
@ 2022-11-06  6:04     ` Elijah Newren via GitGitGadget
  2022-11-07 20:44       ` Derrick Stolee
  2022-11-15  4:03       ` ZheNing Hu
  0 siblings, 2 replies; 42+ messages in thread
From: Elijah Newren via GitGitGadget @ 2022-11-06  6:04 UTC (permalink / raw)
  To: git
  Cc: Victoria Dye, Derrick Stolee, Shaoxuan Yuan, Matheus Tavares,
	ZheNing Hu, Elijah Newren, Glen Choo, Martin von Zweigbergk,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Once upon a time, Matheus wrote some patches to make
   git grep [--cached | <REVISION>] ...
restrict its output to the sparsity specification when working in a
sparse checkout[1].  That effort got derailed by two things:

  (1) The --sparse-index work just beginning which we wanted to avoid
      creating conflicts for
  (2) Never deciding on flag and config names and planned high level
      behavior for all commands.

More recently, Shaoxuan implemented a more limited form of Matheus'
patches that only affected --cached, using a different flag name,
but also changing the default behavior in line with what Matheus did.
This again highlighted the fact that we never decided on command line
flag names, config option names, and the big picture path forward.

The --sparse-index work has been mostly complete (or at least released
into production even if some small edges remain) for quite some time
now.  We have also had several discussions on flag and config names,
though we never came to solid conclusions.  Stolee once upon a time
suggested putting all these into some document in
Documentation/technical[3], which Victoria recently also requested[4].
I'm behind the times, but here's a patch attempting to finally do that.

[1] https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/
    (See his second link in that email in particular)
[2] https://lore.kernel.org/git/20220908001854.206789-2-shaoxuan.yuan02@gmail.com/
[3] https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/
    (Scroll to the very end for the final few paragraphs)
[4] https://lore.kernel.org/git/cafcedba-96a2-cb85-d593-ef47c8c8397c@github.com/

Signed-off-by: Elijah Newren <newren@gmail.com>
---
    sparse-checkout.txt: new document with sparse-checkout directions
    
    v2 and v3 didn't get any reviews (I know, I know, this document is
    really long), but it's been nearly a month and this patch is still
    marked as "Needs Review", so I'm hoping sending a v4 will encourage
    feedback. I think it's good enough to accept and start iterating, but
    want to be sure others agree.
    
    As before, I think we're starting to converge on actual proposals;
    there's some areas we've agreed on, others we've compromised on, and
    some we've just figured out what the others were saying. The discussion
    has been very illuminating; thanks to everyone who has chimed in. I've
    tried to take my best stab at cleaning up and culling things that don't
    need to remain as open questions, but if I've mis-represented anyone or
    missed something, don't hesitate to speak up. Everything is still open
    for debate, even if not marked as a currently open question.
    
    Changes since v3:
    
     * A few minor wording cleanups here and there, and one paragraph moved
       to keep similar things together.
    
    Changes since v2:
    
     * Compromised with Stollee on log -- Behavior A only affects
       patch-related operations, not revision walking
     * Incorporated Junio's suggestions about untracked file handling
     * Added new usecases, one brought up by Martin, one by Stolee
     * Added new sections:
       * Usecases of primary concern
       * Oversimplified mental models ("Cliff Notes" for this document!)
     * Recategorization of a few commands based on discussion
     * Greater details on how index operations work under Behavior A, to
       avoid weird edge cases
     * Extended explanation of the sparse specification, particularly when
       index differs from HEAD
     * Switched proposed flag names to --scope={sparse,all} to avoid binary
       flags that are hard to extend
     * Switched proposed config option name (still need good values and
       descriptions for it, though)
     * Removed questions we seemed to have agreement on. Modified/extended
       some existing questions.
     * Added Stolee's sparse-backfill ideas to the plans
     * Additional Known bugs
     * Various wording improvements
     * Possibly other things I've missed.
    
    Changes since v1:
    
     * Added new sections:
       * "Terminology"
       * "Behavior classes"
       * "Sparse specification vs. sparsity patterns"
     * Tried to shuffle commands from unknown into appropriate sections
       based on feedback, but I got some conflicting feedback, so...who
       knows if thing are in the right place
     * More consistency in using "sparse specification" over other terms
     * Extra comments about how add/rm/mv operate on moving files across the
       tracked/untracked boundary
     * --restrict-but-warn should have been "restrict or error", but
       reworded even more heavily as part of "Behavior classes" section
     * Added extra questions based on feedback (--no-expand, update-index
       stuff, apply --index)
     * More details on apply/am bugs
     * Documented read-tree issue
     * A few cases of fixing line wrapping at <=80 chars
     * Added more alternate name suggestions for options instead of
       --[no-]restrict

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1367%2Fnewren%2Fsparse-checkout-directions-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1367/newren/sparse-checkout-directions-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/1367

Range-diff vs v3:

 1:  5923e75195c ! 1:  e09c7aa2396 sparse-checkout.txt: new document with sparse-checkout directions
     @@ Documentation/technical/sparse-checkout.txt (new)
      +	with a SKIP_WORKTREE bit.  Note that if a tracked file has the
      +	SKIP_WORKTREE bit set but the file is later written by the user to
      +	the working tree anyway, the SKIP_WORKTREE bit will be cleared at
     -+	the beginning of any Git operation.
     ++	the beginning of any subsequent Git operation.
      +
      +	Most sparse checkout users are unaware of this implementation
      +	detail, and the term should generally be avoided in user-facing
     @@ Documentation/technical/sparse-checkout.txt (new)
      +
      +sparse index: A special mode for sparse-checkout that also makes the
      +	index sparse by recording a directory entry in lieu of all the
     -+	files underneath that directory.  Controlled by the
     -+	--[no-]sparse-index option to init|set|reapply.  See also
     -+	"sparse directory".
     ++	files underneath that directory (thus making that a "skipped
     ++	directory" which unfortunately has also been called a "sparse
     ++	directory"), and does this for potentially multiple
     ++	directories.  Controlled by the --[no-]sparse-index option to
     ++	init|set|reapply.
      +
      +sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
      +	define the set of files of interest.  A warning: It is easy to
     @@ Documentation/technical/sparse-checkout.txt (new)
      +
      +  A) Users are _only_ interested in the sparse portion of the repo
      +
     ++  A*) Users are _only_ interested in the sparse portion of the repo
     ++      that they have downloaded so far
     ++
      +  B) Users want a sparse working tree, but are working in a larger whole
      +
      +  C) sparse-checkout is a behind-the-scenes implementation detail allowing
     @@ Documentation/technical/sparse-checkout.txt (new)
      +     lazily populated, and sparse-checkout helps with the lazy population
      +     piece.
      +
     -+  A*) Users are _only_ interested in the sparse portion of the repo that
     -+      they have downloaded so far (a variant on the first usecase)
     -+
     -+
      +It may be worth explaining each of these in a bit more detail:
      +
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +These folks might know there are other things in the repository, but
      +don't care.  They are uninterested in other parts of the repository, and
      +only want to know about changes within their area of interest.  Showing
     -+them other results from history (e.g. from diff/log/grep/etc.) is a
     ++them other files from history (e.g. from diff/log/grep/etc.)  is a
      +usability annoyance, potentially a huge one since other changes in
      +history may dwarf the changes they are interested in.
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +after a merge or pull) can lead to worries about local repository size
      +growing unnecessarily[10].
      +
     -+  (Behavior B) Users want a sparse working tree, but are working in a larger whole
     ++  (Behavior A*) Users are _only_ interested in the sparse portion of the repo
     ++      that they have downloaded so far (a variant on the first usecase)
     ++
     ++This variant is driven by folks who using partial clones together with
     ++sparse checkouts and do disconnected development (so far sounding like a
     ++subset of behavior A users) and doing so on very large repositories.  The
     ++reason for yet another variant is that downloading even just the blobs
     ++through history within their sparse specification may be too much, so they
     ++only download some.  They would still like operations to succeed without
     ++network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
     ++or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
     ++partial results that depend on what happens to have been downloaded.
     ++
     ++This variant could be viewed as Behavior A with the sparse specification
     ++for history querying operations modified from "sparsity patterns" to
     ++"sparsity patterns limited to the blobs we have already downloaded".
     ++
     ++  (Behavior B) Users want a sparse working tree, but are working in a
     ++      larger whole
      +
      +Stolee described this usecase this way[11]:
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +will perceive the checkout as dense, and commands should thus behave as if
      +all files are present.
      +
     -+  (Behavior A*) Users are _only_ interested in the sparse portion of the repo
     -+      that they have downloaded so far (a variant on the first usecase)
     -+
     -+This variant is driven by folks who using partial clones together with
     -+sparse checkouts and do disconnected development (so far sounding like a
     -+subset of behavior A users) and doing so on very large repositories.  The
     -+reason for yet another variant is that downloading even just the blobs
     -+through history within their sparse specification may be too much, so they
     -+only download some.  They would still like operations to succeed without
     -+network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
     -+or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
     -+partial results.
     -+
     -+This variant could be viewed as Behavior A with the sparse specification
     -+for history querying operations modified from "sparsity patterns" to
     -+"sparsity patterns limited to the blobs we have already downloaded".
     -+
      +
      +=== Usecases of primary concern ===
      +
     -+Most of the rest of this document will focus on the first two usecases:
     -+Behavior A and Behavior B.  Some notes about the other two cases and why we
     -+are not focusing on them:
     ++Most of the rest of this document will focus on Behavior A and Behavior
     ++B.  Some notes about the other two cases and why we are not focusing on
     ++them:
      +
      +  (Behavior A*)
      +
     @@ Documentation/technical/sparse-checkout.txt (new)
      +      * fast-import
      +      * commit-tree
      +
     -+  * commands that write any modified file to the working tree (conflicted or not,
     -+    and whether those paths match sparsity patterns or not):
     ++  * commands that write any modified file to the working tree (conflicted
     ++    or not, and whether those paths match sparsity patterns or not):
      +
      +      * stash
      +      * apply (without `--index` or `--cached`)
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    avoid the user trying to `git add` them, forcing `git add` to display
      +    an error).
      +
     -+    It's not clear to me exactly how (or if `clean` would change, but it's
     -+    the other command that also affects untracked files.
     ++    It's not clear to me exactly how (or even if) `clean` would change,
     ++    but it's the other command that also affects untracked files.
      +
      +    `update-index` may be slightly special.  Its --[no-]skip-worktree flag
      +    may need to ignore the sparse specification by its nature.  Also, its
     @@ Documentation/technical/sparse-checkout.txt (new)
      +      * diff-tree
      +      * ls-tree
      +
     -+    Note: for log and whatchanged, only patch related parts are affected by
     -+    scoping the command to the sparse-checkout; the revision walking is
     -+    unaffected.  (The fact that revision walking is unaffected is why
     -+    rev-list, shortlog, show-branch, and bisect are not in this list.)
     ++    Note: for log and whatchanged, revision walking logic is unaffected
     ++    but displaying of patches is affected by scoping the command to the
     ++    sparse-checkout.  (The fact that revision walking is unaffected is
     ++    why rev-list, shortlog, show-branch, and bisect are not in this
     ++    list.)
      +
      +    ls-files may be slightly special in that e.g. `git ls-files -t` is
      +    often used to see what is sparse and what is not.  Perhaps -t should
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    * status
      +    * clean (?)
      +
     -+    Our original implementation for these commands was "no restrict", but
     -+    it had some severe usability issues:
     ++    Our original implementation for the first three of these commands was
     ++    "no restrict", but it had some severe usability issues:
      +      * `git add <somefile>` if honored and outside the sparse
      +	specification, can result in the file randomly disappearing later
      +	when some subsequent command is run (since various commands
     @@ Documentation/technical/sparse-checkout.txt (new)
      +    * diff (with --cached or REVISION arguments)
      +    * grep (with --cached or REVISION arguments)
      +    * show (when given commit arguments)
     -+    * bisect
      +    * blame (only matters when one or more -C flags passed)
      +      * and annotate
      +    * log
     @@ Documentation/technical/sparse-checkout.txt (new)
      +tree, we still want to consider the file part of the sparse specification
      +if we are specifically performing a query related to the index (e.g. git
      +diff --cached [REVISION], git diff-index [REVISION], git restore --staged
     -+--source=REVISION -- PATHS, etc.)
     ++--source=REVISION -- PATHS, etc.)  Note that a transiently expanded sparse
     ++specification for the index usually only matters under behavior A, since
     ++under behavior B index operations are lumped with history and tend to
     ++operate full-tree.
      +
      +
      +=== Implementation Questions ===
     @@ Documentation/technical/sparse-checkout.txt (new)
      +     overview
      +
      +   * Add --scope=sparse (and --scope=all) flag to each of the history querying
     -+     commands.  IMPORATNT: make sure diff machinery changes don't mess with
     ++     commands.  IMPORTANT: make sure diff machinery changes don't mess with
      +     format-patch, fast-export, etc.
      +
      +=== Known bugs ===


 Documentation/technical/sparse-checkout.txt | 1103 +++++++++++++++++++
 1 file changed, 1103 insertions(+)
 create mode 100644 Documentation/technical/sparse-checkout.txt

diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
new file mode 100644
index 00000000000..fa0d01cbda4
--- /dev/null
+++ b/Documentation/technical/sparse-checkout.txt
@@ -0,0 +1,1103 @@
+Table of contents:
+
+  * Terminology
+  * Purpose of sparse-checkouts
+  * Usecases of primary concern
+  * Oversimplified mental models ("Cliff Notes" for this document!)
+  * Desired behavior
+  * Behavior classes
+  * Subcommand-dependent defaults
+  * Sparse specification vs. sparsity patterns
+  * Implementation Questions
+  * Implementation Goals/Plans
+  * Known bugs
+  * Reference Emails
+
+
+=== Terminology ===
+
+cone mode: one of two modes for specifying the desired subset of files
+	in a sparse-checkout.  In cone-mode, the user specifies
+	directories (getting both everything under that directory as
+	well as everything in leading directories), while in non-cone
+	mode, the user specifies gitignore-style patterns.  Controlled
+	by the --[no-]cone option to sparse-checkout init|set.
+
+SKIP_WORKTREE: When tracked files do not match the sparse specification and
+	are removed from the working tree, the file in the index is marked
+	with a SKIP_WORKTREE bit.  Note that if a tracked file has the
+	SKIP_WORKTREE bit set but the file is later written by the user to
+	the working tree anyway, the SKIP_WORKTREE bit will be cleared at
+	the beginning of any subsequent Git operation.
+
+	Most sparse checkout users are unaware of this implementation
+	detail, and the term should generally be avoided in user-facing
+	descriptions and command flags.  Unfortunately, prior to the
+	`sparse-checkout` subcommand this low-level detail was exposed,
+	and as of time of writing, is still exposed in various places.
+
+sparse-checkout: a subcommand in git used to reduce the files present in
+	the working tree to a subset of all tracked files.  Also, the
+	name of the file in the $GIT_DIR/info directory used to track
+	the sparsity patterns corresponding to the user's desired
+	subset.
+
+sparse cone: see cone mode
+
+sparse directory: An entry in the index corresponding to a directory, which
+	appears in the index instead of all the files under that directory
+	that would normally appear.  See also sparse-index.  Something that
+	can cause confusion is that the "sparse directory" does NOT match
+	the sparse specification, i.e. the directory is NOT present in the
+	working tree.  May be renamed in the future (e.g. to "skipped
+	directory").
+
+sparse index: A special mode for sparse-checkout that also makes the
+	index sparse by recording a directory entry in lieu of all the
+	files underneath that directory (thus making that a "skipped
+	directory" which unfortunately has also been called a "sparse
+	directory"), and does this for potentially multiple
+	directories.  Controlled by the --[no-]sparse-index option to
+	init|set|reapply.
+
+sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
+	define the set of files of interest.  A warning: It is easy to
+	over-use this term (or the shortened "patterns" term), for two
+	reasons: (1) users in cone mode specify directories rather than
+	patterns (their directories are transformed into patterns, but
+	users may think you are talking about non-cone mode if you use the
+	word "patterns"), and (b) the sparse specification might
+	transiently differ in the working tree or index from the sparsity
+	patterns (see "Sparse specification vs. sparsity patterns").
+
+sparse specification: The set of paths in the user's area of focus.  This
+	is typically just the tracked files that match the sparsity
+	patterns, but the sparse specification can temporarily differ and
+	include additional files.  (See also "Sparse specification
+	vs. sparsity patterns")
+
+	* When working with history, the sparse specification is exactly
+	  the set of files matching the sparsity patterns.
+	* When interacting with the working tree, the sparse specification
+	  is the set of tracked files with a clear SKIP_WORKTREE bit or
+	  tracked files present in the working copy.
+	* When modifying or showing results from the index, the sparse
+	  specification is the set of files with a clear SKIP_WORKTREE bit
+	  or that differ in the index from HEAD.
+	* If working with the index and the working copy, the sparse
+	  specification is the union of the paths from above.
+
+vivifying: When a command restores a tracked file to the working tree (and
+	hopefully also clears the SKIP_WORKTREE bit in the index for that
+	file), this is referred to as "vivifying" the file.
+
+
+=== Purpose of sparse-checkouts ===
+
+sparse-checkouts exist to allow users to work with a subset of their
+files.
+
+You can think of sparse-checkouts as subdividing "tracked" files into two
+categories -- a sparse subset, and all the rest.  Implementationally, we
+mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them
+out of the working tree.  The SKIP_WORKTREE files are still tracked, just
+not present in the working tree.
+
+In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
+is missing from the working tree but pretend the file contents match HEAD".
+That was not only bogus (it actually meant the file missing from the
+working tree matched the index rather than HEAD), but it was also a
+low-level detail which only provided decent behavior for a few commands.
+There were a surprising number of ways in which that guiding principle gave
+command results that violated user expectations, and as such was a bad
+mental model.  However, it persisted for many years and may still be found
+in some corners of the code base.
+
+Anyway, the idea of "working with a subset of files" is simple enough, but
+there are multiple different high-level usecases which affect how some Git
+subcommands should behave.  Further, even if we only considered one of
+those usecases, sparse-checkouts can modify different subcommands in over a
+half dozen different ways.  Let's start by considering the high level
+usecases:
+
+  A) Users are _only_ interested in the sparse portion of the repo
+
+  A*) Users are _only_ interested in the sparse portion of the repo
+      that they have downloaded so far
+
+  B) Users want a sparse working tree, but are working in a larger whole
+
+  C) sparse-checkout is a behind-the-scenes implementation detail allowing
+     Git to work with a specially crafted in-house virtual file system;
+     users are actually working with a "full" working tree that is
+     lazily populated, and sparse-checkout helps with the lazy population
+     piece.
+
+It may be worth explaining each of these in a bit more detail:
+
+
+  (Behavior A) Users are _only_ interested in the sparse portion of the repo
+
+These folks might know there are other things in the repository, but
+don't care.  They are uninterested in other parts of the repository, and
+only want to know about changes within their area of interest.  Showing
+them other files from history (e.g. from diff/log/grep/etc.)  is a
+usability annoyance, potentially a huge one since other changes in
+history may dwarf the changes they are interested in.
+
+Some of these users also arrive at this usecase from wanting to use partial
+clones together with sparse checkouts (in a way where they have downloaded
+blobs within the sparse specification) and do disconnected development.
+Not only do these users generally not care about other parts of the
+repository, but consider it a blocker for Git commands to try to operate on
+those.  If commands attempt to access paths in history outside the sparsity
+specification, then the partial clone will attempt to download additional
+blobs on demand, fail, and then fail the user's command.  (This may be
+unavoidable in some cases, e.g. when `git merge` has non-trivial changes to
+reconcile outside the sparse specification, but we should limit how often
+users are forced to connect to the network.)
+
+Also, even for users using partial clones that do not mind being
+always connected to the network, the need to download blobs as
+side-effects of various other commands (such as the printed diffstat
+after a merge or pull) can lead to worries about local repository size
+growing unnecessarily[10].
+
+  (Behavior A*) Users are _only_ interested in the sparse portion of the repo
+      that they have downloaded so far (a variant on the first usecase)
+
+This variant is driven by folks who using partial clones together with
+sparse checkouts and do disconnected development (so far sounding like a
+subset of behavior A users) and doing so on very large repositories.  The
+reason for yet another variant is that downloading even just the blobs
+through history within their sparse specification may be too much, so they
+only download some.  They would still like operations to succeed without
+network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
+or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
+partial results that depend on what happens to have been downloaded.
+
+This variant could be viewed as Behavior A with the sparse specification
+for history querying operations modified from "sparsity patterns" to
+"sparsity patterns limited to the blobs we have already downloaded".
+
+  (Behavior B) Users want a sparse working tree, but are working in a
+      larger whole
+
+Stolee described this usecase this way[11]:
+
+"I'm also focused on users that know that they are a part of a larger
+whole. They know they are operating on a large repository but focus on
+what they need to contribute their part. I expect multiple "roles" to
+use very different, almost disjoint parts of the codebase. Some other
+"architect" users operate across the entire tree or hop between different
+sections of the codebase as necessary. In this situation, I'm wary of
+scoping too many features to the sparse-checkout definition, especially
+"git log," as it can be too confusing to have their view of the codebase
+depend on your "point of view."
+
+People might also end up wanting behavior B due to complex inter-project
+dependencies.  The initial attempts to use sparse-checkouts usually involve
+the directories you are directly interested in plus what those directories
+depend upon within your repository.  But there's a monkey wrench here: if
+you have integration tests, they invert the hierarchy: to run integration
+tests, you need not only what you are interested in and its in-tree
+dependencies, you also need everything that depends upon what you are
+interested in or that depends upon one of your dependencies...AND you need
+all the in-tree dependencies of that expanded group.  That can easily
+change your sparse-checkout into a nearly dense one.
+
+Naturally, that tends to kill the benefits of sparse-checkouts.  There are
+a couple solutions to this conundrum: either avoid grabbing in-repo
+dependencies (maybe have built versions of your in-repo dependencies pulled
+from a CI cache somewhere), or say that users shouldn't run integration
+tests directly and instead do it on the CI server when they submit a code
+review.  Or do both.  Regardless of whether you stub out your in-repo
+dependencies or stub out the things that depend upon you, there is
+certainly a reason to want to query and be aware of those other stubbed-out
+parts of the repository, particularly when the dependencies are complex or
+change relatively frequently.  Thus, for such uses, sparse-checkouts can be
+used to limit what you directly build and modify, but these users do not
+necessarily want their sparse checkout paths to limit their queries of
+versions in history.
+
+Some people may also be interested in behavior B over behavior A simply as
+a performance workaround: if they are using non-cone mode, then they have
+to deal with its inherent quadratic performance problems.  In that mode,
+every operation that checks whether paths match the sparsity specification
+can be expensive.  As such, these users may only be willing to pay for
+those expensive checks when interacting with the working copy, and may
+prefer getting "unrelated" results from their history queries over having
+slow commands.
+
+  (Behavior C) sparse-checkout is an implementational detail supporting a
+	       special VFS.
+
+This usecase goes slightly against the traditional definition of
+sparse-checkout in that it actually tries to present a full or dense
+checkout to the user.  However, this usecase utilizes the same underlying
+technical underpinnings in a new way which does provide some performance
+advantages to users.  The basic idea is that a company can have an in-house
+Git-aware Virtual File System which pretends all files are present in the
+working tree, by intercepting all file system accesses and using those to
+fetch and write accessed files on demand via partial clones.  The VFS uses
+sparse-checkout to prevent Git from writing or paying attention to many
+files, and manually updates the sparse checkout patterns itself based on
+user access and modification of files in the working tree.  See commit
+ecc7c8841d ("repo_read_index: add config to expect files outside sparse
+patterns", 2022-02-25) and the link at [17] for a more detailed description
+of such a VFS.
+
+The biggest difference here is that users are completely unaware that the
+sparse-checkout machinery is even in use.  The sparse patterns are not
+specified by the user but rather are under the complete control of the VFS
+(and the patterns are updated frequently and dynamically by it).  The user
+will perceive the checkout as dense, and commands should thus behave as if
+all files are present.
+
+
+=== Usecases of primary concern ===
+
+Most of the rest of this document will focus on Behavior A and Behavior
+B.  Some notes about the other two cases and why we are not focusing on
+them:
+
+  (Behavior A*)
+
+Supporting this usecase is estimated to be difficult and a lot of work.
+There are no plans to implement it currently, but it may be a potential
+future alternative.  Knowing about the existence of additional alternatives
+may affect our choice of command line flags (e.g. if we need tri-state or
+quad-state flags rather than just binary flags), so it was still important
+to at least note.
+
+Further, I believe the descriptions below for Behavior A are probably still
+valid for this usecase, with the only exception being that it redefines the
+sparse specification to restrict it to already-downloaded blobs.  The hard
+part is in making commands capable of respecting that modified definition.
+
+  (Behavior C)
+
+This usecase violates some of the early sparse-checkout documented
+assumptions (since files marked as SKIP_WORKTREE will be displayed to users
+as present in the working tree).  That violation may mean various
+sparse-checkout related behaviors are not well suited to this usecase and
+we may need tweaks -- to both documentation and code -- to handle it.
+However, this usecase is also perhaps the simplest model to support in that
+everything behaves like a dense checkout with a few exceptions (e.g. branch
+checkouts and switches write fewer things, knowing the VFS will lazily
+write the rest on an as-needed basis).
+
+Since there is no publically available VFS-related code for folks to try,
+the number of folks who can test such a usecase is limited.
+
+The primary reason to note the Behavior C usecase is that as we fix things
+to better support Behaviors A and B, there may be additional places where
+we need to make tweaks allowing folks in this usecase to get the original
+non-sparse treatment.  For an example, see ecc7c8841d ("repo_read_index:
+add config to expect files outside sparse patterns", 2022-02-25).  The
+secondary reason to note Behavior C, is so that folks taking advantage of
+Behavior C do not assume they are part of the Behavior B camp and propose
+patches that break things for the real Behavior B folks.
+
+
+=== Oversimplified mental models ===
+
+An oversimplification of the differences in the above behaviors is:
+
+  Behavior A: Restrict worktree and history operations to sparse specification
+  Behavior B: Restrict worktree operations to sparse specification; have any
+	      history operations work across all files
+  Behavior C: Do not restrict either worktree or history operations to the
+	      sparse specification...with the exception of branch checkouts or
+	      switches which avoid writing files that will match the index so
+	      they can later lazily be populated instead.
+
+
+=== Desired behavior ===
+
+As noted previously, despite the simple idea of just working with a subset
+of files, there are a range of different behavioral changes that need to be
+made to different subcommands to work well with such a feature.  See
+[1,2,3,4,5,6,7,8,9,10] for various examples.  In particular, at [2], we saw
+that mere composition of other commands that individually worked correctly
+in a sparse-checkout context did not imply that the higher level command
+would work correctly; it sometimes requires further tweaks.  So,
+understanding these differences can be beneficial.
+
+* Commands behaving the same regardless of high-level use-case
+
+  * commands that only look at files within the sparsity specification
+
+      * diff (without --cached or REVISION arguments)
+      * grep (without --cached or REVISION arguments)
+      * diff-files
+
+  * commands that restore files to the working tree that match sparsity
+    patterns, and remove unmodified files that don't match those
+    patterns:
+
+      * switch
+      * checkout (the switch-like half)
+      * read-tree
+      * reset --hard
+
+  * commands that write conflicted files to the working tree, but otherwise
+    will omit writing files to the working tree that do not match the
+    sparsity patterns:
+
+      * merge
+      * rebase
+      * cherry-pick
+      * revert
+
+      * `am` and `apply --cached` should probably be in this section but
+	are buggy (see the "Known bugs" section below)
+
+    The behavior for these commands somewhat depends upon the merge
+    strategy being used:
+      * `ort` behaves as described above
+      * `recursive` tries to not vivify files unnecessarily, but does sometimes
+	vivify files without conflicts.
+      * `octopus` and `resolve` will always vivify any file changed in the merge
+	relative to the first parent, which is rather suboptimal.
+
+    It is also important to note that these commands WILL update the index
+    outside the sparse specification relative to when the operation began,
+    BUT these commands often make a commit just before or after such that
+    by the end of the operation there is no change to the index outside the
+    sparse specification.  Of course, if the operation hits conflicts or
+    does not make a commit, then these operations clearly can modify the
+    index outside the sparse specification.
+
+    Finally, it is important to note that at least the first four of these
+    commands also try to remove differences between the sparse
+    specification and the sparsity patterns (much like the commands in the
+    previous section).
+
+  * commands that always ignore sparsity since commits must be full-tree
+
+      * archive
+      * bundle
+      * commit
+      * format-patch
+      * fast-export
+      * fast-import
+      * commit-tree
+
+  * commands that write any modified file to the working tree (conflicted
+    or not, and whether those paths match sparsity patterns or not):
+
+      * stash
+      * apply (without `--index` or `--cached`)
+
+* Commands that may slightly differ for behavior A vs. behavior B:
+
+  Commands in this category behave mostly the same between the two
+  behaviors, but may differ in verbosity and types of warning and error
+  messages.
+
+  * commands that make modifications to which files are tracked:
+      * add
+      * rm
+      * mv
+      * update-index
+
+    The fact that files can move between the 'tracked' and 'untracked'
+    categories means some commands will have to treat untracked files
+    differently.  But if we have to treat untracked files differently,
+    then additional commands may also need changes:
+
+      * status
+      * clean
+
+    In particular, `status` may need to report any untracked files outside
+    the sparsity specification as an erroneous condition (especially to
+    avoid the user trying to `git add` them, forcing `git add` to display
+    an error).
+
+    It's not clear to me exactly how (or even if) `clean` would change,
+    but it's the other command that also affects untracked files.
+
+    `update-index` may be slightly special.  Its --[no-]skip-worktree flag
+    may need to ignore the sparse specification by its nature.  Also, its
+    current --[no-]ignore-skip-worktree-entries default is totally bogus.
+
+  * commands for manually tweaking paths in both the index and the working tree
+      * `restore`
+      * the restore-like half of `checkout`
+
+    These commands should be similar to add/rm/mv in that they should
+    only operate on the sparse specification by default, and require a
+    special flag to operate on all files.
+
+    Also, note that these commands currently have a number of issues (see
+    the "Known bugs" section below)
+
+* Commands that significantly differ for behavior A vs. behavior B:
+
+  * commands that query history
+      * diff (with --cached or REVISION arguments)
+      * grep (with --cached or REVISION arguments)
+      * show (when given commit arguments)
+      * blame (only matters when one or more -C flags are passed)
+	* and annotate
+      * log
+      * whatchanged
+      * ls-files
+      * diff-index
+      * diff-tree
+      * ls-tree
+
+    Note: for log and whatchanged, revision walking logic is unaffected
+    but displaying of patches is affected by scoping the command to the
+    sparse-checkout.  (The fact that revision walking is unaffected is
+    why rev-list, shortlog, show-branch, and bisect are not in this
+    list.)
+
+    ls-files may be slightly special in that e.g. `git ls-files -t` is
+    often used to see what is sparse and what is not.  Perhaps -t should
+    always work on the full tree?
+
+* Commands I don't know how to classify
+
+  * range-diff
+
+    Is this like `log` or `format-patch`?
+
+  * cherry
+
+    See range-diff
+
+* Commands unaffected by sparse-checkouts
+
+  * shortlog
+  * show-branch
+  * rev-list
+  * bisect
+
+  * branch
+  * describe
+  * fetch
+  * gc
+  * init
+  * maintenance
+  * notes
+  * pull (merge & rebase have the necessary changes)
+  * push
+  * submodule
+  * tag
+
+  * config
+  * filter-branch (works in separate checkout without sparse-checkout setup)
+  * pack-refs
+  * prune
+  * remote
+  * repack
+  * replace
+
+  * bugreport
+  * count-objects
+  * fsck
+  * gitweb
+  * help
+  * instaweb
+  * merge-tree (doesn't touch worktree or index, and merges always compute full-tree)
+  * rerere
+  * verify-commit
+  * verify-tag
+
+  * commit-graph
+  * hash-object
+  * index-pack
+  * mktag
+  * mktree
+  * multi-pack-index
+  * pack-objects
+  * prune-packed
+  * symbolic-ref
+  * unpack-objects
+  * update-ref
+  * write-tree (operates on index, possibly optimized to use sparse dir entries)
+
+  * for-each-ref
+  * get-tar-commit-id
+  * ls-remote
+  * merge-base (merges are computed full tree, so merge base should be too)
+  * name-rev
+  * pack-redundant
+  * rev-parse
+  * show-index
+  * show-ref
+  * unpack-file
+  * var
+  * verify-pack
+
+  * <Everything under 'Interacting with Others' in 'git help --all'>
+  * <Everything under 'Low-level...Syncing' in 'git help --all'>
+  * <Everything under 'Low-level...Internal Helpers' in 'git help --all'>
+  * <Everything under 'External commands' in 'git help --all'>
+
+* Commands that might be affected, but who cares?
+
+  * merge-file
+  * merge-index
+  * gitk?
+
+
+=== Behavior classes ===
+
+From the above there are a few classes of behavior:
+
+  * "restrict"
+
+    Commands in this class only read or write files in the working tree
+    within the sparse specification.
+
+    When moving to a new commit (e.g. switch, reset --hard), these commands
+    may update index files outside the sparse specification as of the start
+    of the operation, but by the end of the operation those index files
+    will match HEAD again and thus those files will again be outside the
+    sparse specification.
+
+    When paths are explicitly specified, these paths are intersected with
+    the sparse specification and will only operate on such paths.
+    (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`)
+
+    Some of these commands may also attempt, at the end of their operation,
+    to cull transient differences between the sparse specification and the
+    sparsity patterns (see "Sparse specification vs. sparsity patterns" for
+    details, but this basically means either removing unmodified files not
+    matching the sparsity patterns and marking those files as
+    SKIP_WORKTREE, or vivifying files that match the sparsity patterns and
+    marking those files as !SKIP_WORKTREE).
+
+  * "restrict modulo conflicts"
+
+    Commands in this class generally behave like the "restrict" class,
+    except that:
+      (1) they will ignore the sparse specification and write files with
+	  conflicts to the working tree (thus temporarily expanding the
+	  sparse specification to include such files.)
+      (2) they are grouped with commands which move to a new commit, since
+	  they often create a commit and then move to it, even though we
+	  know there are many exceptions to moving to the new commit.  (For
+	  example, the user may rebase a commit that becomes empty, or have
+	  a cherry-pick which conflicts, or a user could run `merge
+	  --no-commit`, and we also view `apply --index` kind of like `am
+	  --no-commit`.)  As such, these commands can make changes to index
+	  files outside the sparse specification, though they'll mark such
+	  files with SKIP_WORKTREE.
+
+  * "restrict also specially applied to untracked files"
+
+    Commands in this class generally behave like the "restrict" class,
+    except that they have to handle untracked files differently too, often
+    because these commands are dealing with files changing state between
+    'tracked' and 'untracked'.  Often, this may mean printing an error
+    message if the command had nothing to do, but the arguments may have
+    referred to files whose tracked-ness state could have changed were it
+    not for the sparsity patterns excluding them.
+
+  * "no restrict"
+
+    Commands in this class ignore the sparse specification entirely.
+
+  * "restrict or no restrict dependent upon behavior A vs. behavior B"
+
+    Commands in this class behave like "no restrict" for folks in the
+    behavior B camp, and like "restrict" for folks in the behavior A camp.
+    However, when behaving like "restrict" a warning of some sort might be
+    provided that history queries have been limited by the sparse-checkout
+    specification.
+
+
+=== Subcommand-dependent defaults ===
+
+Note that we have different defaults depending on the command for the
+desired behavior :
+
+  * Commands defaulting to "restrict":
+    * diff-files
+    * diff (without --cached or REVISION arguments)
+    * grep (without --cached or REVISION arguments)
+    * switch
+    * checkout (the switch-like half)
+    * reset (<commit>)
+
+    * restore
+    * checkout (the restore-like half)
+    * checkout-index
+    * reset (with pathspec)
+
+    This behavior makes sense; these interact with the working tree.
+
+  * Commands defaulting to "restrict modulo conflicts":
+    * merge
+    * rebase
+    * cherry-pick
+    * revert
+
+    * am
+    * apply --index (which is kind of like an `am --no-commit`)
+
+    * read-tree (especially with -m or -u; is kind of like a --no-commit merge)
+    * reset (<tree-ish>, due to similarity to read-tree)
+
+    These also interact with the working tree, but require slightly
+    different behavior either so that (a) conflicts can be resolved or (b)
+    because they are kind of like a merge-without-commit operation.
+
+    (See also the "Known bugs" section below regarding `am` and `apply`)
+
+  * Commands defaulting to "no restrict":
+    * archive
+    * bundle
+    * commit
+    * format-patch
+    * fast-export
+    * fast-import
+    * commit-tree
+
+    * stash
+    * apply (without `--index`)
+
+    These have completely different defaults and perhaps deserve the most
+    detailed explanation:
+
+    In the case of commands in the first group (format-patch,
+    fast-export, bundle, archive, etc.), these are commands for
+    communicating history, which will be broken if they restrict to a
+    subset of the repository.  As such, they operate on full paths and
+    have no `--restrict` option for overriding.  Some of these commands may
+    take paths for manually restricting what is exported, but it needs to
+    be very explicit.
+
+    In the case of stash, it needs to vivify files to avoid losing the
+    user's changes.
+
+    In the case of apply without `--index`, that command needs to update
+    the working tree without the index (or the index without the working
+    tree if `--cached` is passed), and if we restrict those updates to the
+    sparse specification then we'll lose changes from the user.
+
+  * Commands defaulting to "restrict also specially applied to untracked files":
+    * add
+    * rm
+    * mv
+    * update-index
+    * status
+    * clean (?)
+
+    Our original implementation for the first three of these commands was
+    "no restrict", but it had some severe usability issues:
+      * `git add <somefile>` if honored and outside the sparse
+	specification, can result in the file randomly disappearing later
+	when some subsequent command is run (since various commands
+	automatically clean up unmodified files outside the sparse
+	specification).
+      * `git rm '*.jpg'` could very negatively surprise users if it deletes
+	files outside the range of the user's interest.
+      * `git mv` has similar surprises when moving into or out of the cone,
+	so best to restrict by default
+
+    So, we switched `add` and `rm` to default to "restrict", which made
+    usability problems much less severe and less frequent, but we still got
+    complaints because commands like:
+	git add <file-outside-sparse-specification>
+	git rm <file-outside-sparse-specification>
+    would silently do nothing.  We should instead print an error in those
+    cases to get usability right.
+
+    update-index needs to be updated to match, and status and maybe clean
+    also need to be updated to specially handle untracked paths.
+
+    There may be a difference in here between behavior A and behavior B in
+    terms of verboseness of errors or additional warnings.
+
+  * Commands falling under "restrict or no restrict dependent upon behavior
+    A vs. behavior B"
+
+    * diff (with --cached or REVISION arguments)
+    * grep (with --cached or REVISION arguments)
+    * show (when given commit arguments)
+    * blame (only matters when one or more -C flags passed)
+      * and annotate
+    * log
+      * and variants: shortlog, gitk, show-branch, whatchanged, rev-list
+    * ls-files
+    * diff-index
+    * diff-tree
+    * ls-tree
+
+    For now, we default to behavior B for these, which want a default of
+    "no restrict".
+
+    Note that two of these commands -- diff and grep -- also appeared in a
+    different list with a default of "restrict", but only when limited to
+    searching the working tree.  The working tree vs. history distinction
+    is fundamental in how behavior B operates, so this is expected.  Note,
+    though, that for diff and grep with --cached, when doing "restrict"
+    behavior, the difference between sparse specification and sparsity
+    patterns is important to handle.
+
+    "restrict" may make more sense as the long term default for these[12].
+    Also, supporting "restrict" for these commands might be a fair amount
+    of work to implement, meaning it might be implemented over multiple
+    releases.  If that behavior were the default in the commands that
+    supported it, that would force behavior B users to need to learn to
+    slowly add additional flags to their commands, depending on git
+    version, to get the behavior they want.  That gradual switchover would
+    be painful, so we should avoid it at least until it's fully
+    implemented.
+
+
+=== Sparse specification vs. sparsity patterns ===
+
+In a well-behaved situation, the sparse specification is given directly
+by the $GIT_DIR/info/sparse-checkout file.  However, it can transiently
+diverge for a few reasons:
+
+    * needing to resolve conflicts (merging will vivify conflicted files)
+    * running Git commands that implicitly vivify files (e.g. "git stash apply")
+    * running Git commands that explicitly vivify files (e.g. "git checkout
+      --ignore-skip-worktree-bits FILENAME")
+    * other commands that write to these files (perhaps a user copies it
+      from elsewhere)
+
+For the last item, note that we do automatically clear the SKIP_WORKTREE
+bit for files that are present in the working tree.  This has been true
+since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
+2022-03-09)
+
+However, such a situation is transient because:
+
+   * Such transient differences can and will be automatically removed as
+     a side-effect of commands which call unpack_trees() (checkout,
+     merge, reset, etc.).
+   * Users can also request such transient differences be corrected via
+     running `git sparse-checkout reapply`.  Various places recommend
+     running that command.
+   * Additional commands are also welcome to implicitly fix these
+     differences; we may add more in the future.
+
+While we avoid dropping unstaged changes or files which have conflicts,
+we otherwise aggressively try to fix these transient differences.  If
+users want these differences to persist, they should run the `set` or
+`add` subcommands of `git sparse-checkout` to reflect their intended
+sparse specification.
+
+However, when we need to do a query on history restricted to the
+"relevant subset of files" such a transiently expanded sparse
+specification is ignored.  There are a couple reasons for this:
+
+   * The behavior wanted when doing something like
+	 git grep expression REVISION
+     is roughly what the users would expect from
+	 git checkout REVISION && git grep expression
+     (modulo a "REVISION:" prefix), which has a couple ramifications:
+
+   * REVISION may have paths not in the current index, so there is no
+     path we can consult for a SKIP_WORKTREE setting for those paths.
+
+   * Since `checkout` is one of those commands that tries to remove
+     transient differences in the sparse specification, it makes sense
+     to use the corrected sparse specification
+     (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
+     consult SKIP_WORKTREE anyway.
+
+So, a transiently expanded (or restricted) sparse specification applies to
+the working tree, but not to history queries where we always use the
+sparsity patterns.  (See [16] for an early discussion of this.)
+
+Similar to a transiently expanded sparse specification of the working tree
+based on additional files being present in the working tree, we also need
+to consider additional files being modified in the index.  In particular,
+if the user has staged changes to files (relative to HEAD) that do not
+match the sparsity patterns, and the file is not present in the working
+tree, we still want to consider the file part of the sparse specification
+if we are specifically performing a query related to the index (e.g. git
+diff --cached [REVISION], git diff-index [REVISION], git restore --staged
+--source=REVISION -- PATHS, etc.)  Note that a transiently expanded sparse
+specification for the index usually only matters under behavior A, since
+under behavior B index operations are lumped with history and tend to
+operate full-tree.
+
+
+=== Implementation Questions ===
+
+  * Do the options --scope={sparse,all} sound good to others?  Are there better
+    options?
+    * Names in use, or appearing in patches, or previously suggested:
+      * --sparse/--dense
+      * --ignore-skip-worktree-bits
+      * --ignore-skip-worktree-entries
+      * --ignore-sparsity
+      * --[no-]restrict-to-sparse-paths
+      * --full-tree/--sparse-tree
+      * --[no-]restrict
+      * --scope={sparse,all}
+      * --focus/--unfocus
+      * --limit/--unlimited
+    * Rationale making me lean slightly towards --scope={sparse,all}:
+      * We want a name that works for many commands, so we need a name that
+	does not conflict
+      * We know that we have more than two possible usecases, so it is best
+	to avoid a flag that appears to be binary.
+      * --scope={sparse,all} isn't overly long and seems relatively
+	explanatory
+      * `--sparse`, as used in add/rm/mv, is totally backwards for
+	grep/log/etc.  Changing the meaning of `--sparse` for these
+	commands would fix the backwardness, but possibly break existing
+	scripts.  Using a new name pairing would allow us to treat
+	`--sparse` in these commands as a deprecated alias.
+      * There is a different `--sparse`/`--dense` pair for commands using
+	revision machinery, so using that naming might cause confusion
+      * There is also a `--sparse` in both pack-objects and show-branch, which
+	don't conflict but do suggest that `--sparse` is overloaded
+      * The name --ignore-skip-worktree-bits is a double negative, is
+	quite a mouthful, refers to an implementation detail that many
+	users may not be familiar with, and we'd need a negation for it
+	which would probably be even more ridiculously long.  (But we
+	can make --ignore-skip-worktree-bits a deprecated alias for
+	--no-restrict.)
+
+  * If a config option is added (sparse.scope?) what should the values and
+    description be?  "sparse" (behavior A), "worktree-sparse-history-dense"
+    (behavior B), "dense" (behavior C)?  There's a risk of confusion,
+    because even for Behaviors A and B we want some commands to be
+    full-tree and others to operate sparsely, so the wording may need to be
+    more tied to the usecases and somehow explain that.  Also, right now,
+    the primary difference we are focusing is just the history-querying
+    commands (log/diff/grep).  Previous config suggestion here: [13]
+
+  * Is `--no-expand` a good alias for ls-files's `--sparse` option?
+    (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
+    because in non-cone mode it does nothing and in cone-mode it shows the
+    sparse directory entries which are technically outside the sparse
+    specification)
+
+  * Under Behavior A:
+    * Does ls-files' `--no-expand` override the default `--scope=all`, or
+      does it need an extra flag?
+    * Does ls-files' `-t` option imply `--scope=all`?
+    * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
+
+  * sparse-checkout: once behavior A is fully implemented, should we take
+    an interim measure to ease people into switching the default?  Namely,
+    if folks are not already in a sparse checkout, then require
+    `sparse-checkout init/set` to take a
+    `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
+    would set sparse.scope according to the setting given), and throw an
+    error if the flag is not provided?  That error would be a great place
+    to warn folks that the default may change in the future, and get them
+    used to specifying what they want so that the eventual default switch
+    is seamless for them.
+
+
+=== Implementation Goals/Plans ===
+
+ * Get buy-in on this document in general.
+
+ * Figure out answers to the 'Implementation Questions' sections (above)
+
+ * Fix bugs in the 'Known bugs' section (below)
+
+ * Provide some kind of method for backfilling the blobs within the sparse
+   specification in a partial clone
+
+ [Below here is kind of spitballing since the first two haven't been resolved]
+
+ * update-index: flip the default to --no-ignore-skip-worktree-entries,
+   nuke this stupid "Oh, there's a bug?  Let me add a flag to let users
+   request that they not trigger this bug." flag
+
+ * Flags & Config
+   * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
+   * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
+     a deprecated aliases for `--scope=all`
+   * Create config option (sparse.scope?), tie it to the "Cliff notes"
+     overview
+
+   * Add --scope=sparse (and --scope=all) flag to each of the history querying
+     commands.  IMPORTANT: make sure diff machinery changes don't mess with
+     format-patch, fast-export, etc.
+
+=== Known bugs ===
+
+This list used to be a lot longer (see e.g. [1,2,3,4,5,6,7,8,9]), but we've
+been working on it.
+
+0. Behavior A is not well supported in Git.  (Behavior B didn't used to
+   be either, but was the easier of the two to implement.)
+
+1. am and apply:
+
+   apply, without `--index` or `--cached`, relies on files being present
+   in the working copy, and also writes to them unconditionally.  As
+   such, it should first check for the files' presence, and if found to
+   be SKIP_WORKTREE, then clear the bit and vivify the paths, then do
+   its work.  Currently, it just throws an error.
+
+   apply, with either `--cached` or `--index`, will not preserve the
+   SKIP_WORKTREE bit.  This is fine if the file has conflicts, but
+   otherwise SKIP_WORKTREE bits should be preserved for --cached and
+   probably also for --index.
+
+   am, if there are no conflicts, will vivify files and fail to preserve
+   the SKIP_WORKTREE bit.  If there are conflicts and `-3` is not
+   specified, it will vivify files and then complain the patch doesn't
+   apply.  If there are conflicts and `-3` is specified, it will vivify
+   files and then complain that those vivified files would be
+   overwritten by merge.
+
+2. reset --hard:
+
+   reset --hard provides confusing error message (works correctly, but
+   misleads the user into believing it didn't):
+
+    $ touch addme
+    $ git add addme
+    $ git ls-files -t
+    H addme
+    H tracked
+    S tracked-but-maybe-skipped
+    $ git reset --hard                           # usually works great
+    error: Path 'addme' not uptodate; will not remove from working tree.
+    HEAD is now at bdbbb6f third
+    $ git ls-files -t
+    H tracked
+    S tracked-but-maybe-skipped
+    $ ls -1
+    tracked
+
+    `git reset --hard` DID remove addme from the index and the working tree, contrary
+    to the error message, but in line with how reset --hard should behave.
+
+3. read-tree
+
+   `read-tree` doesn't apply the 'SKIP_WORKTREE' bit to *any* of the
+   entries it reads into the index, resulting in all your files suddenly
+   appearing to be "deleted".
+
+4. Checkout, restore:
+
+   These command do not handle path & revision arguments appropriately:
+
+    $ ls
+    tracked
+    $ git ls-files -t
+    H tracked
+    S tracked-but-maybe-skipped
+    $ git status --porcelain
+    $ git checkout -- '*skipped'
+    error: pathspec '*skipped' did not match any file(s) known to git
+    $ git ls-files -- '*skipped'
+    tracked-but-maybe-skipped
+    $ git checkout HEAD -- '*skipped'
+    error: pathspec '*skipped' did not match any file(s) known to git
+    $ git ls-tree HEAD | grep skipped
+    100644 blob 276f5a64354b791b13840f02047738c77ad0584f	tracked-but-maybe-skipped
+    $ git status --porcelain
+    $ git checkout HEAD~1 -- '*skipped'
+    $ git ls-files -t
+    H tracked
+    H tracked-but-maybe-skipped
+    $ git status --porcelain
+    M  tracked-but-maybe-skipped
+    $ git checkout HEAD -- '*skipped'
+    $ git status --porcelain
+    $
+
+    Note that checkout without a revision (or restore --staged) fails to
+    find a file to restore from the index, even though ls-files shows
+    such a file certainly exists.
+
+    Similar issues occur with HEAD (--source=HEAD in restore's case),
+    but suddenly works when HEAD~1 is specified.  And then after that it
+    will work with HEAD specified, even though it didn't before.
+
+    Directories are also an issue:
+
+    $ git sparse-checkout set nomatches
+    $ git status
+    On branch main
+    You are in a sparse checkout with 0% of tracked files present.
+
+    nothing to commit, working tree clean
+    $ git checkout .
+    error: pathspec '.' did not match any file(s) known to git
+    $ git checkout HEAD~1 .
+    Updated 1 path from 58916d9
+    $ git ls-files -t
+    S tracked
+    H tracked-but-maybe-skipped
+
+5. checkout and restore --staged, continued:
+
+   These commands do not correctly scope operations to the sparse
+   specification, and make it worse by not setting important SKIP_WORKTREE
+   bits:
+
+   $ git restore --source OLDREV --staged outside-sparse-cone/
+   $ git status --porcelain
+   MD outside-sparse-cone/file1
+   MD outside-sparse-cone/file2
+   MD outside-sparse-cone/file3
+
+   We can add a --scope=all mode to `git restore` to let it operate outside
+   the sparse specification, but then it will be important to set the
+   SKIP_WORKTREE bits appropriately.
+
+6. Performance issues; see:
+    https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@mail.gmail.com/
+
+
+=== Reference Emails ===
+
+Emails that detail various bugs we've had in sparse-checkout:
+
+[1] (Original descriptions of behavior A & behavior B)
+    https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@mail.gmail.com/
+[2] (Fix stash applications in sparse checkouts; bugs from behavioral differences)
+    https://lore.kernel.org/git/ccfedc7140dbf63ba26a15f93bd3885180b26517.1606861519.git.gitgitgadget@gmail.com/
+[3] (Present-despite-skipped entries)
+    https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/
+[4] (Clone --no-checkout interaction)
+    https://lore.kernel.org/git/pull.801.v2.git.git.1591324899170.gitgitgadget@gmail.com/ (clone --no-checkout)
+[5] (The need for update_sparsity() and avoiding `read-tree -mu HEAD`)
+    https://lore.kernel.org/git/3a1f084641eb47515b5a41ed4409a36128913309.1585270142.git.gitgitgadget@gmail.com/
+[6] (SKIP_WORKTREE is advisory, not mandatory)
+    https://lore.kernel.org/git/844306c3e86ef67591cc086decb2b760e7d710a3.1585270142.git.gitgitgadget@gmail.com/
+[7] (`worktree add` should copy sparsity settings from current worktree)
+    https://lore.kernel.org/git/c51cb3714e7b1d2f8c9370fe87eca9984ff4859f.1644269584.git.gitgitgadget@gmail.com/
+[8] (Avoid negative surprises in add, rm, and mv)
+    https://lore.kernel.org/git/cover.1617914011.git.matheus.bernardino@usp.br/
+    https://lore.kernel.org/git/pull.1018.v4.git.1632497954.gitgitgadget@gmail.com/
+[9] (Move from out-of-cone to in-cone)
+    https://lore.kernel.org/git/20220630023737.473690-6-shaoxuan.yuan02@gmail.com/
+    https://lore.kernel.org/git/20220630023737.473690-4-shaoxuan.yuan02@gmail.com/
+[10] (Unnecessarily downloading objects outside sparse specification)
+     https://lore.kernel.org/git/CAOLTT8QfwOi9yx_qZZgyGa8iL8kHWutEED7ok_jxwTcYT_hf9Q@mail.gmail.com/
+
+[11] (Stolee's comments on high-level usecases)
+     https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@gmail.com/
+
+[12] Others commenting on eventually switching default to behavior A:
+  * https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
+  * https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
+  * https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@github.com/
+
+[13] Previous config name suggestion and description
+  * https://lore.kernel.org/git/CABPp-BE6zW0nJSStcVU=_DoDBnPgLqOR8pkTXK3dW11=T01OhA@mail.gmail.com/
+
+[14] Tangential issue: switch to cone mode as default sparse specification mechanism:
+  https://lore.kernel.org/git/a1b68fd6126eb341ef3637bb93fedad4309b36d0.1650594746.git.gitgitgadget@gmail.com/
+
+[15] Lengthy email on grep behavior, covering what should be searched:
+  * https://lore.kernel.org/git/CABPp-BGVO3QdbfE84uF_3QDF0-y2iHHh6G5FAFzNRfeRitkuHw@mail.gmail.com/
+
+[16] Email explaining sparsity patterns vs. SKIP_WORKTREE and history operations,
+     search for the parenthetical comment starting "We do not check".
+    https://lore.kernel.org/git/CABPp-BFsCPPNOZ92JQRJeGyNd0e-TCW-LcLyr0i_+VSQJP+GCg@mail.gmail.com/
+
+[17] https://lore.kernel.org/git/20220207190320.2960362-1-jonathantanmy@google.com/

base-commit: 1b3d6e17fe83eb6f79ffbac2f2c61bbf1eaef5f8
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-11-06  6:04     ` [PATCH v4] " Elijah Newren via GitGitGadget
@ 2022-11-07 20:44       ` Derrick Stolee
  2022-11-16  4:39         ` Elijah Newren
  2022-11-15  4:03       ` ZheNing Hu
  1 sibling, 1 reply; 42+ messages in thread
From: Derrick Stolee @ 2022-11-07 20:44 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Victoria Dye, Shaoxuan Yuan, Matheus Tavares, ZheNing Hu,
	Elijah Newren, Glen Choo, Martin von Zweigbergk

On 11/6/22 1:04 AM, Elijah Newren via GitGitGadget wrote:

> The --sparse-index work has been mostly complete (or at least released
> into production even if some small edges remain) for quite some time
> now.  We have also had several discussions on flag and config names,
> though we never came to solid conclusions.  Stolee once upon a time
> suggested putting all these into some document in
> Documentation/technical[3], which Victoria recently also requested[4].
> I'm behind the times, but here's a patch attempting to finally do that.

This is a correct summary of where the sparse index feature is right now.

It also is a highly-requested document. Thank you for working so hard on
it and sorry for being slow to sign off on your edits since v1.

Today, I'm rereading the whole document anew, but I'll avoid any nits
since I think you are converging on a solid foundation for us to build on.

Mostly, if you asked a question in the doc, I'll reply. Nothing is binding
since the point is to ask the question in the context of the problem
statement and examples. We should remember to update this document when we
actually implement the options, so the decisions are documented here
instead of leaving answered questions lingering.

> +  * Do the options --scope={sparse,all} sound good to others?  Are there better
> +    options?
> +    * Names in use, or appearing in patches, or previously suggested:
> +      * --sparse/--dense
> +      * --ignore-skip-worktree-bits
> +      * --ignore-skip-worktree-entries
> +      * --ignore-sparsity
> +      * --[no-]restrict-to-sparse-paths
> +      * --full-tree/--sparse-tree
> +      * --[no-]restrict
> +      * --scope={sparse,all}
> +      * --focus/--unfocus
> +      * --limit/--unlimited

I'm partial to --scope={sparse|all} (with the option to add another
value if we see the need).

> +  * If a config option is added (sparse.scope?) what should the values and
> +    description be?  "sparse" (behavior A), "worktree-sparse-history-dense"
> +    (behavior B), "dense" (behavior C)?  There's a risk of confusion,
> +    because even for Behaviors A and B we want some commands to be
> +    full-tree and others to operate sparsely, so the wording may need to be
> +    more tied to the usecases and somehow explain that.  Also, right now,
> +    the primary difference we are focusing is just the history-querying
> +    commands (log/diff/grep).  Previous config suggestion here: [13]

Personally, I think we should have the same values for 'sparse.scope' and
'--scope=<X>'. For now, let's pick one behavior for the 'sparse' value and
we can add a new value to differentiate between A and B when necessary in
the future.

> +  * Is `--no-expand` a good alias for ls-files's `--sparse` option?
> +    (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
> +    because in non-cone mode it does nothing and in cone-mode it shows the
> +    sparse directory entries which are technically outside the sparse
> +    specification)
> +
> +  * Under Behavior A:
> +    * Does ls-files' `--no-expand` override the default `--scope=all`, or
> +      does it need an extra flag?
> +    * Does ls-files' `-t` option imply `--scope=all`?
> +    * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?

Since the --no-expand option is rather new, and we have a big experimental
banner on the sparse-checkout documentation, it might be good to plan for
a deprecation of these non-standard options. We could start by making them
aliases for the --scope=sparse option, but with a warning that the option
is deprecated and we will _remove_ the option in a future version. We can
document here which versions we expect those removals to happen.

> +  * sparse-checkout: once behavior A is fully implemented, should we take
> +    an interim measure to ease people into switching the default?  Namely,
> +    if folks are not already in a sparse checkout, then require
> +    `sparse-checkout init/set` to take a
> +    `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
> +    would set sparse.scope according to the setting given), and throw an
> +    error if the flag is not provided?  That error would be a great place
> +    to warn folks that the default may change in the future, and get them
> +    used to specifying what they want so that the eventual default switch
> +    is seamless for them.

I'm not sure that we need a warning here. I think picking an initial default
is good enough. Let's reconsider this warning after we have more implementation
changes that provide a choice between behaviors A and B.

> +=== Implementation Goals/Plans ===
> +
> + * Get buy-in on this document in general.

Consider me bought-in.

> + * Figure out answers to the 'Implementation Questions' sections (above)
> +
> + * Fix bugs in the 'Known bugs' section (below)
> +
> + * Provide some kind of method for backfilling the blobs within the sparse
> +   specification in a partial clone
> +
> + [Below here is kind of spitballing since the first two haven't been resolved]

We can update this document as we gain clarity after the first few updates.

> + * update-index: flip the default to --no-ignore-skip-worktree-entries,
> +   nuke this stupid "Oh, there's a bug?  Let me add a flag to let users
> +   request that they not trigger this bug." flag
> +
> + * Flags & Config
> +   * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`

This '--sparse' deprecation can eventually be a removal, I think.

> +   * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
> +     a deprecated aliases for `--scope=all`

This one might be harder to remove since it's much older. We can consider
it, though.

> +   * Create config option (sparse.scope?), tie it to the "Cliff notes"
> +     overview

Implementation detail: it might be nice to create a parse-opt macro that
will read the '--scope={sparse|all}' command-line option but _also_
create a method to initialize the value to the 'sparse.scope' config
option. These can both happen with the very first implementation of the
command-line option and all future integrations can follow that pattern to
get both options.

Thanks for working so hard on this doc. I think this version is ready to
merge down. Let's get started on this work. I'm excited!

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-11-06  6:04     ` [PATCH v4] " Elijah Newren via GitGitGadget
  2022-11-07 20:44       ` Derrick Stolee
@ 2022-11-15  4:03       ` ZheNing Hu
  2022-11-16  3:18         ` ZheNing Hu
  2022-11-16  5:49         ` Elijah Newren
  1 sibling, 2 replies; 42+ messages in thread
From: ZheNing Hu @ 2022-11-15  4:03 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Victoria Dye, Derrick Stolee, Shaoxuan Yuan,
	Matheus Tavares, Elijah Newren, Glen Choo, Martin von Zweigbergk

Hi,

Elijah Newren via GitGitGadget <gitgitgadget@gmail.com> 于2022年11月6日周日 14:04写道:
>
> From: Elijah Newren <newren@gmail.com>
>
> Once upon a time, Matheus wrote some patches to make
>    git grep [--cached | <REVISION>] ...
> restrict its output to the sparsity specification when working in a
> sparse checkout[1].  That effort got derailed by two things:
>
>   (1) The --sparse-index work just beginning which we wanted to avoid
>       creating conflicts for
>   (2) Never deciding on flag and config names and planned high level
>       behavior for all commands.
>
> More recently, Shaoxuan implemented a more limited form of Matheus'
> patches that only affected --cached, using a different flag name,
> but also changing the default behavior in line with what Matheus did.
> This again highlighted the fact that we never decided on command line
> flag names, config option names, and the big picture path forward.
>
> The --sparse-index work has been mostly complete (or at least released
> into production even if some small edges remain) for quite some time
> now.  We have also had several discussions on flag and config names,
> though we never came to solid conclusions.  Stolee once upon a time
> suggested putting all these into some document in
> Documentation/technical[3], which Victoria recently also requested[4].
> I'm behind the times, but here's a patch attempting to finally do that.
>
> [1] https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/
>     (See his second link in that email in particular)
> [2] https://lore.kernel.org/git/20220908001854.206789-2-shaoxuan.yuan02@gmail.com/
> [3] https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/
>     (Scroll to the very end for the final few paragraphs)
> [4] https://lore.kernel.org/git/cafcedba-96a2-cb85-d593-ef47c8c8397c@github.com/
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
>     sparse-checkout.txt: new document with sparse-checkout directions
>
>     v2 and v3 didn't get any reviews (I know, I know, this document is
>     really long), but it's been nearly a month and this patch is still
>     marked as "Needs Review", so I'm hoping sending a v4 will encourage
>     feedback. I think it's good enough to accept and start iterating, but
>     want to be sure others agree.
>
>     As before, I think we're starting to converge on actual proposals;
>     there's some areas we've agreed on, others we've compromised on, and
>     some we've just figured out what the others were saying. The discussion
>     has been very illuminating; thanks to everyone who has chimed in. I've
>     tried to take my best stab at cleaning up and culling things that don't
>     need to remain as open questions, but if I've mis-represented anyone or
>     missed something, don't hesitate to speak up. Everything is still open
>     for debate, even if not marked as a currently open question.
>
>     Changes since v3:
>
>      * A few minor wording cleanups here and there, and one paragraph moved
>        to keep similar things together.
>
>     Changes since v2:
>
>      * Compromised with Stollee on log -- Behavior A only affects
>        patch-related operations, not revision walking
>      * Incorporated Junio's suggestions about untracked file handling
>      * Added new usecases, one brought up by Martin, one by Stolee
>      * Added new sections:
>        * Usecases of primary concern
>        * Oversimplified mental models ("Cliff Notes" for this document!)
>      * Recategorization of a few commands based on discussion
>      * Greater details on how index operations work under Behavior A, to
>        avoid weird edge cases
>      * Extended explanation of the sparse specification, particularly when
>        index differs from HEAD
>      * Switched proposed flag names to --scope={sparse,all} to avoid binary
>        flags that are hard to extend
>      * Switched proposed config option name (still need good values and
>        descriptions for it, though)
>      * Removed questions we seemed to have agreement on. Modified/extended
>        some existing questions.
>      * Added Stolee's sparse-backfill ideas to the plans
>      * Additional Known bugs
>      * Various wording improvements
>      * Possibly other things I've missed.
>
>     Changes since v1:
>
>      * Added new sections:
>        * "Terminology"
>        * "Behavior classes"
>        * "Sparse specification vs. sparsity patterns"
>      * Tried to shuffle commands from unknown into appropriate sections
>        based on feedback, but I got some conflicting feedback, so...who
>        knows if thing are in the right place
>      * More consistency in using "sparse specification" over other terms
>      * Extra comments about how add/rm/mv operate on moving files across the
>        tracked/untracked boundary
>      * --restrict-but-warn should have been "restrict or error", but
>        reworded even more heavily as part of "Behavior classes" section
>      * Added extra questions based on feedback (--no-expand, update-index
>        stuff, apply --index)
>      * More details on apply/am bugs
>      * Documented read-tree issue
>      * A few cases of fixing line wrapping at <=80 chars
>      * Added more alternate name suggestions for options instead of
>        --[no-]restrict
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1367%2Fnewren%2Fsparse-checkout-directions-v4
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1367/newren/sparse-checkout-directions-v4
> Pull-Request: https://github.com/gitgitgadget/git/pull/1367
>
>  Documentation/technical/sparse-checkout.txt | 1103 +++++++++++++++++++
>  1 file changed, 1103 insertions(+)
>  create mode 100644 Documentation/technical/sparse-checkout.txt
>
> diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
> new file mode 100644
> +=== Terminology ===
> +
> +sparse directory: An entry in the index corresponding to a directory, which
> +       appears in the index instead of all the files under that directory
> +       that would normally appear.  See also sparse-index.  Something that
> +       can cause confusion is that the "sparse directory" does NOT match
> +       the sparse specification, i.e. the directory is NOT present in the
> +       working tree.  May be renamed in the future (e.g. to "skipped
> +       directory").
> +
> +sparse index: A special mode for sparse-checkout that also makes the
> +       index sparse by recording a directory entry in lieu of all the
> +       files underneath that directory (thus making that a "skipped
> +       directory" which unfortunately has also been called a "sparse
> +       directory"), and does this for potentially multiple
> +       directories.  Controlled by the --[no-]sparse-index option to
> +       init|set|reapply.
> +
> +sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
> +       define the set of files of interest.  A warning: It is easy to
> +       over-use this term (or the shortened "patterns" term), for two
> +       reasons: (1) users in cone mode specify directories rather than
> +       patterns (their directories are transformed into patterns, but
> +       users may think you are talking about non-cone mode if you use the
> +       word "patterns"), and (b) the sparse specification might

nit: s/(b)/(2)/g

> +       transiently differ in the working tree or index from the sparsity
> +       patterns (see "Sparse specification vs. sparsity patterns").
> +
> +sparse specification: The set of paths in the user's area of focus.  This
> +       is typically just the tracked files that match the sparsity
> +       patterns, but the sparse specification can temporarily differ and
> +       include additional files.  (See also "Sparse specification
> +       vs. sparsity patterns")
> +
> +       * When working with history, the sparse specification is exactly
> +         the set of files matching the sparsity patterns.
> +       * When interacting with the working tree, the sparse specification
> +         is the set of tracked files with a clear SKIP_WORKTREE bit or
> +         tracked files present in the working copy.

I'm guessing what you mean here is:
Some files are stored with a flag bit of !SKIP_WORKTREE in its index entry.
But files are "vivifying" (restore to worktree) or new files added to
index (tracked files),
they also belong to the sparse specification.

I think we can add some examples to describe these terms.

#!/bin/sh

set -x

rm -rf mono-repo
git init mono-repo -b main
(
  cd mono-repo &&
  mkdir p1 p2 &&
  echo a >p1/a &&
  echo b >p1/b &&
  echo a >p2/a &&
  echo b >p2/b &&
  git add . &&
  git commit -m ok &&
  git sparse-checkout set p1 &&
  git ls-files -t &&
  echo a >>p1/a &&
  echo b >>p1/b &&
  mkdir p2 p3 &&
  echo next >>p2/a &&
  echo next >>p3/c &&
  git add p3/c &&
  # p2/a and p3/c vivify
  git ls-files -t &&
  # compare wortree/commit
  git --no-pager diff HEAD --name-only
)

> +       * When modifying or showing results from the index, the sparse
> +         specification is the set of files with a clear SKIP_WORKTREE bit
> +         or that differ in the index from HEAD.

#!/bin/sh

set -x

rm -rf mono-repo
git init mono-repo -b main
(
  cd mono-repo &&
  mkdir p1 p2 &&
  echo a >p1/a &&
  echo b >p1/b &&
  echo a >p2/a &&
  echo b >p2/b &&
  git add . &&
  git commit -m ok &&
  git sparse-checkout set p1 &&
  git update-index --chmod=+x p2/a &&
  # compare commit/index
  git --no-pager diff --cached --name-only
)

> +       * If working with the index and the working copy, the sparse
> +         specification is the union of the paths from above.
> +
> +vivifying: When a command restores a tracked file to the working tree (and
> +       hopefully also clears the SKIP_WORKTREE bit in the index for that
> +       file), this is referred to as "vivifying" the file.
> +
> +
> +=== Purpose of sparse-checkouts ===
> +
> +sparse-checkouts exist to allow users to work with a subset of their
> +files.
> +
> +You can think of sparse-checkouts as subdividing "tracked" files into two
> +categories -- a sparse subset, and all the rest.  Implementationally, we
> +mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them
> +out of the working tree.  The SKIP_WORKTREE files are still tracked, just
> +not present in the working tree.
> +
> +In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
> +is missing from the working tree but pretend the file contents match HEAD".
> +That was not only bogus (it actually meant the file missing from the
> +working tree matched the index rather than HEAD), but it was also a
> +low-level detail which only provided decent behavior for a few commands.
> +There were a surprising number of ways in which that guiding principle gave
> +command results that violated user expectations, and as such was a bad
> +mental model.  However, it persisted for many years and may still be found
> +in some corners of the code base.
> +
> +Anyway, the idea of "working with a subset of files" is simple enough, but
> +there are multiple different high-level usecases which affect how some Git
> +subcommands should behave.  Further, even if we only considered one of
> +those usecases, sparse-checkouts can modify different subcommands in over a
> +half dozen different ways.  Let's start by considering the high level
> +usecases:
> +
> +  A) Users are _only_ interested in the sparse portion of the repo
> +
> +  A*) Users are _only_ interested in the sparse portion of the repo
> +      that they have downloaded so far
> +
> +  B) Users want a sparse working tree, but are working in a larger whole
> +
> +  C) sparse-checkout is a behind-the-scenes implementation detail allowing
> +     Git to work with a specially crafted in-house virtual file system;
> +     users are actually working with a "full" working tree that is
> +     lazily populated, and sparse-checkout helps with the lazy population
> +     piece.
> +
> +It may be worth explaining each of these in a bit more detail:
> +
> +
> +  (Behavior A) Users are _only_ interested in the sparse portion of the repo
> +
> +These folks might know there are other things in the repository, but
> +don't care.  They are uninterested in other parts of the repository, and
> +only want to know about changes within their area of interest.  Showing
> +them other files from history (e.g. from diff/log/grep/etc.)  is a
> +usability annoyance, potentially a huge one since other changes in
> +history may dwarf the changes they are interested in.
> +
> +Some of these users also arrive at this usecase from wanting to use partial
> +clones together with sparse checkouts (in a way where they have downloaded
> +blobs within the sparse specification) and do disconnected development.
> +Not only do these users generally not care about other parts of the
> +repository, but consider it a blocker for Git commands to try to operate on
> +those.  If commands attempt to access paths in history outside the sparsity
> +specification, then the partial clone will attempt to download additional
> +blobs on demand, fail, and then fail the user's command.  (This may be
> +unavoidable in some cases, e.g. when `git merge` has non-trivial changes to
> +reconcile outside the sparse specification, but we should limit how often
> +users are forced to connect to the network.)
> +
> +Also, even for users using partial clones that do not mind being
> +always connected to the network, the need to download blobs as
> +side-effects of various other commands (such as the printed diffstat
> +after a merge or pull) can lead to worries about local repository size
> +growing unnecessarily[10].
> +
> +  (Behavior A*) Users are _only_ interested in the sparse portion of the repo
> +      that they have downloaded so far (a variant on the first usecase)
> +
> +This variant is driven by folks who using partial clones together with
> +sparse checkouts and do disconnected development (so far sounding like a
> +subset of behavior A users) and doing so on very large repositories.  The
> +reason for yet another variant is that downloading even just the blobs
> +through history within their sparse specification may be too much, so they
> +only download some.  They would still like operations to succeed without
> +network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
> +or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
> +partial results that depend on what happens to have been downloaded.
> +
> +This variant could be viewed as Behavior A with the sparse specification
> +for history querying operations modified from "sparsity patterns" to
> +"sparsity patterns limited to the blobs we have already downloaded".
> +

 I think A's users might need a little more refinement.

A: Some users are _only_ interested in the sparse portion of the repo,
but they don't want to download all the blobs, they can accept to download
other data later using partial-clone, which can reduce the local storage size.
(Current default behavior)

A** : Some users are _only_ interested in the sparse portion of the repo,
but they want to download all the blobs in it to avoid some unnecessary
network connections afterwards.

> +=== Usecases of primary concern ===
> +
> +Most of the rest of this document will focus on Behavior A and Behavior
> +B.  Some notes about the other two cases and why we are not focusing on
> +them:
> +
> +  (Behavior A*)
> +
> +Supporting this usecase is estimated to be difficult and a lot of work.
> +There are no plans to implement it currently, but it may be a potential
> +future alternative.  Knowing about the existence of additional alternatives
> +may affect our choice of command line flags (e.g. if we need tri-state or
> +quad-state flags rather than just binary flags), so it was still important
> +to at least note.
> +
> +Further, I believe the descriptions below for Behavior A are probably still
> +valid for this usecase, with the only exception being that it redefines the
> +sparse specification to restrict it to already-downloaded blobs.  The hard
> +part is in making commands capable of respecting that modified definition.
> +
> +  (Behavior C)
> +
> +This usecase violates some of the early sparse-checkout documented
> +assumptions (since files marked as SKIP_WORKTREE will be displayed to users
> +as present in the working tree).  That violation may mean various
> +sparse-checkout related behaviors are not well suited to this usecase and
> +we may need tweaks -- to both documentation and code -- to handle it.
> +However, this usecase is also perhaps the simplest model to support in that
> +everything behaves like a dense checkout with a few exceptions (e.g. branch
> +checkouts and switches write fewer things, knowing the VFS will lazily
> +write the rest on an as-needed basis).
> +
> +Since there is no publically available VFS-related code for folks to try,
> +the number of folks who can test such a usecase is limited.
> +
> +The primary reason to note the Behavior C usecase is that as we fix things
> +to better support Behaviors A and B, there may be additional places where
> +we need to make tweaks allowing folks in this usecase to get the original
> +non-sparse treatment.  For an example, see ecc7c8841d ("repo_read_index:
> +add config to expect files outside sparse patterns", 2022-02-25).  The
> +secondary reason to note Behavior C, is so that folks taking advantage of
> +Behavior C do not assume they are part of the Behavior B camp and propose
> +patches that break things for the real Behavior B folks.
> +
> +
> +=== Oversimplified mental models ===
> +
> +An oversimplification of the differences in the above behaviors is:
> +
> +  Behavior A: Restrict worktree and history operations to sparse specification
> +  Behavior B: Restrict worktree operations to sparse specification; have any
> +             history operations work across all files
> +  Behavior C: Do not restrict either worktree or history operations to the
> +             sparse specification...with the exception of branch checkouts or
> +             switches which avoid writing files that will match the index so
> +             they can later lazily be populated instead.
> +
> +
> +=== Desired behavior ===
> +
> +As noted previously, despite the simple idea of just working with a subset
> +of files, there are a range of different behavioral changes that need to be
> +made to different subcommands to work well with such a feature.  See
> +[1,2,3,4,5,6,7,8,9,10] for various examples.  In particular, at [2], we saw
> +that mere composition of other commands that individually worked correctly
> +in a sparse-checkout context did not imply that the higher level command
> +would work correctly; it sometimes requires further tweaks.  So,
> +understanding these differences can be beneficial.
> +
> +* Commands behaving the same regardless of high-level use-case
> +
> +  * commands that only look at files within the sparsity specification
> +
> +      * diff (without --cached or REVISION arguments)
> +      * grep (without --cached or REVISION arguments)
> +      * diff-files
> +
> +  * commands that restore files to the working tree that match sparsity
> +    patterns, and remove unmodified files that don't match those
> +    patterns:
> +
> +      * switch
> +      * checkout (the switch-like half)
> +      * read-tree
> +      * reset --hard
> +
> +  * commands that write conflicted files to the working tree, but otherwise
> +    will omit writing files to the working tree that do not match the
> +    sparsity patterns:
> +
> +      * merge
> +      * rebase
> +      * cherry-pick
> +      * revert
> +
> +      * `am` and `apply --cached` should probably be in this section but
> +       are buggy (see the "Known bugs" section below)
> +
> +    The behavior for these commands somewhat depends upon the merge
> +    strategy being used:
> +      * `ort` behaves as described above
> +      * `recursive` tries to not vivify files unnecessarily, but does sometimes
> +       vivify files without conflicts.
> +      * `octopus` and `resolve` will always vivify any file changed in the merge
> +       relative to the first parent, which is rather suboptimal.
> +
> +    It is also important to note that these commands WILL update the index
> +    outside the sparse specification relative to when the operation began,
> +    BUT these commands often make a commit just before or after such that
> +    by the end of the operation there is no change to the index outside the
> +    sparse specification.  Of course, if the operation hits conflicts or
> +    does not make a commit, then these operations clearly can modify the
> +    index outside the sparse specification.
> +
> +    Finally, it is important to note that at least the first four of these
> +    commands also try to remove differences between the sparse
> +    specification and the sparsity patterns (much like the commands in the
> +    previous section).
> +
> +  * commands that always ignore sparsity since commits must be full-tree
> +
> +      * archive
> +      * bundle
> +      * commit
> +      * format-patch
> +      * fast-export
> +      * fast-import
> +      * commit-tree
> +
> +  * commands that write any modified file to the working tree (conflicted
> +    or not, and whether those paths match sparsity patterns or not):
> +
> +      * stash
> +      * apply (without `--index` or `--cached`)
> +
> +* Commands that may slightly differ for behavior A vs. behavior B:
> +
> +  Commands in this category behave mostly the same between the two
> +  behaviors, but may differ in verbosity and types of warning and error
> +  messages.
> +
> +  * commands that make modifications to which files are tracked:
> +      * add
> +      * rm
> +      * mv
> +      * update-index
> +
> +    The fact that files can move between the 'tracked' and 'untracked'
> +    categories means some commands will have to treat untracked files
> +    differently.  But if we have to treat untracked files differently,
> +    then additional commands may also need changes:
> +
> +      * status
> +      * clean
> +

I'm a bit worried about git status, because it's used in many shells
(e.g. zsh) i
in the git prompt function. Its default behavior is restricted, otherwise users
may get blocked when they use zsh to cd to that directory. I don't know how
to reproduce this problem (since the scenario is built on checkout to a local
unborn branch).

> +    In particular, `status` may need to report any untracked files outside
> +    the sparsity specification as an erroneous condition (especially to
> +    avoid the user trying to `git add` them, forcing `git add` to display
> +    an error).
> +
> +    It's not clear to me exactly how (or even if) `clean` would change,
> +    but it's the other command that also affects untracked files.
> +
> +    `update-index` may be slightly special.  Its --[no-]skip-worktree flag
> +    may need to ignore the sparse specification by its nature.  Also, its
> +    current --[no-]ignore-skip-worktree-entries default is totally bogus.
> +
> +  * commands for manually tweaking paths in both the index and the working tree
> +      * `restore`
> +      * the restore-like half of `checkout`
> +
> +    These commands should be similar to add/rm/mv in that they should
> +    only operate on the sparse specification by default, and require a
> +    special flag to operate on all files.
> +
> +    Also, note that these commands currently have a number of issues (see
> +    the "Known bugs" section below)
> +
> +* Commands that significantly differ for behavior A vs. behavior B:
> +
> +  * commands that query history
> +      * diff (with --cached or REVISION arguments)
> +      * grep (with --cached or REVISION arguments)
> +      * show (when given commit arguments)
> +      * blame (only matters when one or more -C flags are passed)
> +       * and annotate
> +      * log
> +      * whatchanged
> +      * ls-files
> +      * diff-index
> +      * diff-tree
> +      * ls-tree
> +
> +    Note: for log and whatchanged, revision walking logic is unaffected
> +    but displaying of patches is affected by scoping the command to the
> +    sparse-checkout.  (The fact that revision walking is unaffected is
> +    why rev-list, shortlog, show-branch, and bisect are not in this
> +    list.)
> +
> +    ls-files may be slightly special in that e.g. `git ls-files -t` is
> +    often used to see what is sparse and what is not.  Perhaps -t should
> +    always work on the full tree?
> +

Recently git ls-files added a --format option, perhaps this can be modified to
show if a file is SKIP_WORKTREE in the future.

diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index 4cf8a23648..0aeff8e514 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -280,6 +280,9 @@ static size_t expand_show_index(struct strbuf *sb,
const char *start,
                              data->pathname));
        else if (skip_prefix(start, "(path)", &p))
                write_name_to_buf(sb, data->pathname);
+       else if (skip_prefix(start, "(skiptree)", &p))
+               strbuf_addstr(sb, ce_skip_worktree(data->ce) ?
+                             "true" : "false");
        else
                die(_("bad ls-files format: %%%.*s"), (int)len, start);


> +=== Behavior classes ===
> +
> +From the above there are a few classes of behavior:
> +
> +  * "restrict"
> +
> +    Commands in this class only read or write files in the working tree
> +    within the sparse specification.
> +
> +    When moving to a new commit (e.g. switch, reset --hard), these commands
> +    may update index files outside the sparse specification as of the start
> +    of the operation, but by the end of the operation those index files
> +    will match HEAD again and thus those files will again be outside the
> +    sparse specification.
> +
> +    When paths are explicitly specified, these paths are intersected with
> +    the sparse specification and will only operate on such paths.
> +    (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`)
> +
> +    Some of these commands may also attempt, at the end of their operation,
> +    to cull transient differences between the sparse specification and the
> +    sparsity patterns (see "Sparse specification vs. sparsity patterns" for
> +    details, but this basically means either removing unmodified files not
> +    matching the sparsity patterns and marking those files as
> +    SKIP_WORKTREE, or vivifying files that match the sparsity patterns and
> +    marking those files as !SKIP_WORKTREE).
> +
> +  * "restrict modulo conflicts"
> +
> +    Commands in this class generally behave like the "restrict" class,
> +    except that:
> +      (1) they will ignore the sparse specification and write files with
> +         conflicts to the working tree (thus temporarily expanding the
> +         sparse specification to include such files.)
> +      (2) they are grouped with commands which move to a new commit, since
> +         they often create a commit and then move to it, even though we
> +         know there are many exceptions to moving to the new commit.  (For
> +         example, the user may rebase a commit that becomes empty, or have
> +         a cherry-pick which conflicts, or a user could run `merge
> +         --no-commit`, and we also view `apply --index` kind of like `am
> +         --no-commit`.)  As such, these commands can make changes to index
> +         files outside the sparse specification, though they'll mark such
> +         files with SKIP_WORKTREE.
> +
> +  * "restrict also specially applied to untracked files"
> +
> +    Commands in this class generally behave like the "restrict" class,
> +    except that they have to handle untracked files differently too, often
> +    because these commands are dealing with files changing state between
> +    'tracked' and 'untracked'.  Often, this may mean printing an error
> +    message if the command had nothing to do, but the arguments may have
> +    referred to files whose tracked-ness state could have changed were it
> +    not for the sparsity patterns excluding them.
> +
> +  * "no restrict"
> +
> +    Commands in this class ignore the sparse specification entirely.
> +
> +  * "restrict or no restrict dependent upon behavior A vs. behavior B"
> +
> +    Commands in this class behave like "no restrict" for folks in the
> +    behavior B camp, and like "restrict" for folks in the behavior A camp.
> +    However, when behaving like "restrict" a warning of some sort might be
> +    provided that history queries have been limited by the sparse-checkout
> +    specification.
> +
> +
> +=== Subcommand-dependent defaults ===
> +
> +Note that we have different defaults depending on the command for the
> +desired behavior :
> +
> +  * Commands defaulting to "restrict":
> +    * diff-files
> +    * diff (without --cached or REVISION arguments)
> +    * grep (without --cached or REVISION arguments)
> +    * switch
> +    * checkout (the switch-like half)
> +    * reset (<commit>)
> +
> +    * restore
> +    * checkout (the restore-like half)
> +    * checkout-index
> +    * reset (with pathspec)
> +
> +    This behavior makes sense; these interact with the working tree.
> +
> +  * Commands defaulting to "restrict modulo conflicts":
> +    * merge
> +    * rebase
> +    * cherry-pick
> +    * revert
> +
> +    * am
> +    * apply --index (which is kind of like an `am --no-commit`)
> +
> +    * read-tree (especially with -m or -u; is kind of like a --no-commit merge)
> +    * reset (<tree-ish>, due to similarity to read-tree)
> +
> +    These also interact with the working tree, but require slightly
> +    different behavior either so that (a) conflicts can be resolved or (b)
> +    because they are kind of like a merge-without-commit operation.
> +
> +    (See also the "Known bugs" section below regarding `am` and `apply`)
> +
> +  * Commands defaulting to "no restrict":
> +    * archive
> +    * bundle
> +    * commit
> +    * format-patch
> +    * fast-export
> +    * fast-import
> +    * commit-tree
> +
> +    * stash
> +    * apply (without `--index`)
> +
> +    These have completely different defaults and perhaps deserve the most
> +    detailed explanation:
> +
> +    In the case of commands in the first group (format-patch,
> +    fast-export, bundle, archive, etc.), these are commands for
> +    communicating history, which will be broken if they restrict to a
> +    subset of the repository.  As such, they operate on full paths and
> +    have no `--restrict` option for overriding.  Some of these commands may
> +    take paths for manually restricting what is exported, but it needs to
> +    be very explicit.
> +
> +    In the case of stash, it needs to vivify files to avoid losing the
> +    user's changes.
> +
> +    In the case of apply without `--index`, that command needs to update
> +    the working tree without the index (or the index without the working
> +    tree if `--cached` is passed), and if we restrict those updates to the
> +    sparse specification then we'll lose changes from the user.
> +
> +  * Commands defaulting to "restrict also specially applied to untracked files":
> +    * add
> +    * rm
> +    * mv
> +    * update-index
> +    * status
> +    * clean (?)
> +
> +    Our original implementation for the first three of these commands was
> +    "no restrict", but it had some severe usability issues:
> +      * `git add <somefile>` if honored and outside the sparse
> +       specification, can result in the file randomly disappearing later
> +       when some subsequent command is run (since various commands
> +       automatically clean up unmodified files outside the sparse
> +       specification).
> +      * `git rm '*.jpg'` could very negatively surprise users if it deletes
> +       files outside the range of the user's interest.
> +      * `git mv` has similar surprises when moving into or out of the cone,
> +       so best to restrict by default
> +
> +    So, we switched `add` and `rm` to default to "restrict", which made
> +    usability problems much less severe and less frequent, but we still got
> +    complaints because commands like:
> +       git add <file-outside-sparse-specification>
> +       git rm <file-outside-sparse-specification>
> +    would silently do nothing.  We should instead print an error in those
> +    cases to get usability right.
> +
> +    update-index needs to be updated to match, and status and maybe clean
> +    also need to be updated to specially handle untracked paths.
> +
> +    There may be a difference in here between behavior A and behavior B in
> +    terms of verboseness of errors or additional warnings.
> +
> +  * Commands falling under "restrict or no restrict dependent upon behavior
> +    A vs. behavior B"
> +
> +    * diff (with --cached or REVISION arguments)
> +    * grep (with --cached or REVISION arguments)
> +    * show (when given commit arguments)
> +    * blame (only matters when one or more -C flags passed)
> +      * and annotate
> +    * log
> +      * and variants: shortlog, gitk, show-branch, whatchanged, rev-list
> +    * ls-files
> +    * diff-index
> +    * diff-tree
> +    * ls-tree
> +
> +    For now, we default to behavior B for these, which want a default of
> +    "no restrict".
> +
> +    Note that two of these commands -- diff and grep -- also appeared in a
> +    different list with a default of "restrict", but only when limited to
> +    searching the working tree.  The working tree vs. history distinction
> +    is fundamental in how behavior B operates, so this is expected.  Note,
> +    though, that for diff and grep with --cached, when doing "restrict"
> +    behavior, the difference between sparse specification and sparsity
> +    patterns is important to handle.
> +
> +    "restrict" may make more sense as the long term default for these[12].
> +    Also, supporting "restrict" for these commands might be a fair amount
> +    of work to implement, meaning it might be implemented over multiple
> +    releases.  If that behavior were the default in the commands that
> +    supported it, that would force behavior B users to need to learn to
> +    slowly add additional flags to their commands, depending on git
> +    version, to get the behavior they want.  That gradual switchover would
> +    be painful, so we should avoid it at least until it's fully
> +    implemented.
> +
> +
> +=== Sparse specification vs. sparsity patterns ===
> +
> +In a well-behaved situation, the sparse specification is given directly
> +by the $GIT_DIR/info/sparse-checkout file.  However, it can transiently
> +diverge for a few reasons:
> +
> +    * needing to resolve conflicts (merging will vivify conflicted files)
> +    * running Git commands that implicitly vivify files (e.g. "git stash apply")
> +    * running Git commands that explicitly vivify files (e.g. "git checkout
> +      --ignore-skip-worktree-bits FILENAME")
> +    * other commands that write to these files (perhaps a user copies it
> +      from elsewhere)
> +
> +For the last item, note that we do automatically clear the SKIP_WORKTREE
> +bit for files that are present in the working tree.  This has been true
> +since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
> +2022-03-09)
> +
> +However, such a situation is transient because:
> +
> +   * Such transient differences can and will be automatically removed as
> +     a side-effect of commands which call unpack_trees() (checkout,
> +     merge, reset, etc.).
> +   * Users can also request such transient differences be corrected via
> +     running `git sparse-checkout reapply`.  Various places recommend
> +     running that command.
> +   * Additional commands are also welcome to implicitly fix these
> +     differences; we may add more in the future.
> +
> +While we avoid dropping unstaged changes or files which have conflicts,
> +we otherwise aggressively try to fix these transient differences.  If
> +users want these differences to persist, they should run the `set` or
> +`add` subcommands of `git sparse-checkout` to reflect their intended
> +sparse specification.
> +
> +However, when we need to do a query on history restricted to the
> +"relevant subset of files" such a transiently expanded sparse
> +specification is ignored.  There are a couple reasons for this:
> +
> +   * The behavior wanted when doing something like
> +        git grep expression REVISION
> +     is roughly what the users would expect from
> +        git checkout REVISION && git grep expression
> +     (modulo a "REVISION:" prefix), which has a couple ramifications:
> +
> +   * REVISION may have paths not in the current index, so there is no
> +     path we can consult for a SKIP_WORKTREE setting for those paths.
> +
> +   * Since `checkout` is one of those commands that tries to remove
> +     transient differences in the sparse specification, it makes sense
> +     to use the corrected sparse specification
> +     (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
> +     consult SKIP_WORKTREE anyway.
> +
> +So, a transiently expanded (or restricted) sparse specification applies to
> +the working tree, but not to history queries where we always use the
> +sparsity patterns.  (See [16] for an early discussion of this.)
> +
> +Similar to a transiently expanded sparse specification of the working tree
> +based on additional files being present in the working tree, we also need
> +to consider additional files being modified in the index.  In particular,
> +if the user has staged changes to files (relative to HEAD) that do not
> +match the sparsity patterns, and the file is not present in the working
> +tree, we still want to consider the file part of the sparse specification
> +if we are specifically performing a query related to the index (e.g. git
> +diff --cached [REVISION], git diff-index [REVISION], git restore --staged
> +--source=REVISION -- PATHS, etc.)  Note that a transiently expanded sparse
> +specification for the index usually only matters under behavior A, since
> +under behavior B index operations are lumped with history and tend to
> +operate full-tree.
> +
> +
> +=== Implementation Questions ===
> +
> +  * Do the options --scope={sparse,all} sound good to others?  Are there better
> +    options?
> +    * Names in use, or appearing in patches, or previously suggested:
> +      * --sparse/--dense
> +      * --ignore-skip-worktree-bits
> +      * --ignore-skip-worktree-entries
> +      * --ignore-sparsity
> +      * --[no-]restrict-to-sparse-paths
> +      * --full-tree/--sparse-tree
> +      * --[no-]restrict
> +      * --scope={sparse,all}
> +      * --focus/--unfocus
> +      * --limit/--unlimited
> +    * Rationale making me lean slightly towards --scope={sparse,all}:
> +      * We want a name that works for many commands, so we need a name that
> +       does not conflict
> +      * We know that we have more than two possible usecases, so it is best
> +       to avoid a flag that appears to be binary.
> +      * --scope={sparse,all} isn't overly long and seems relatively
> +       explanatory
> +      * `--sparse`, as used in add/rm/mv, is totally backwards for
> +       grep/log/etc.  Changing the meaning of `--sparse` for these
> +       commands would fix the backwardness, but possibly break existing
> +       scripts.  Using a new name pairing would allow us to treat
> +       `--sparse` in these commands as a deprecated alias.
> +      * There is a different `--sparse`/`--dense` pair for commands using
> +       revision machinery, so using that naming might cause confusion
> +      * There is also a `--sparse` in both pack-objects and show-branch, which
> +       don't conflict but do suggest that `--sparse` is overloaded
> +      * The name --ignore-skip-worktree-bits is a double negative, is
> +       quite a mouthful, refers to an implementation detail that many
> +       users may not be familiar with, and we'd need a negation for it
> +       which would probably be even more ridiculously long.  (But we
> +       can make --ignore-skip-worktree-bits a deprecated alias for
> +       --no-restrict.)
> +
> +  * If a config option is added (sparse.scope?) what should the values and
> +    description be?  "sparse" (behavior A), "worktree-sparse-history-dense"
> +    (behavior B), "dense" (behavior C)?  There's a risk of confusion,
> +    because even for Behaviors A and B we want some commands to be
> +    full-tree and others to operate sparsely, so the wording may need to be
> +    more tied to the usecases and somehow explain that.  Also, right now,
> +    the primary difference we are focusing is just the history-querying
> +    commands (log/diff/grep).  Previous config suggestion here: [13]
> +

Maybe sparse.scope={sparse, all}?

> +  * Is `--no-expand` a good alias for ls-files's `--sparse` option?
> +    (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
> +    because in non-cone mode it does nothing and in cone-mode it shows the
> +    sparse directory entries which are technically outside the sparse
> +    specification)
> +
> +  * Under Behavior A:
> +    * Does ls-files' `--no-expand` override the default `--scope=all`, or
> +      does it need an extra flag?
> +    * Does ls-files' `-t` option imply `--scope=all`?
> +    * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
> +
> +  * sparse-checkout: once behavior A is fully implemented, should we take
> +    an interim measure to ease people into switching the default?  Namely,
> +    if folks are not already in a sparse checkout, then require
> +    `sparse-checkout init/set` to take a
> +    `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
> +    would set sparse.scope according to the setting given), and throw an
> +    error if the flag is not provided?  That error would be a great place
> +    to warn folks that the default may change in the future, and get them
> +    used to specifying what they want so that the eventual default switch
> +    is seamless for them.
> +
> +
> +=== Implementation Goals/Plans ===
> +
> + * Get buy-in on this document in general.
> +
> + * Figure out answers to the 'Implementation Questions' sections (above)
> +
> + * Fix bugs in the 'Known bugs' section (below)
> +
> + * Provide some kind of method for backfilling the blobs within the sparse
> +   specification in a partial clone
> +
> + [Below here is kind of spitballing since the first two haven't been resolved]
> +
> + * update-index: flip the default to --no-ignore-skip-worktree-entries,
> +   nuke this stupid "Oh, there's a bug?  Let me add a flag to let users
> +   request that they not trigger this bug." flag
> +
> + * Flags & Config
> +   * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
> +   * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
> +     a deprecated aliases for `--scope=all`
> +   * Create config option (sparse.scope?), tie it to the "Cliff notes"
> +     overview
> +
> +   * Add --scope=sparse (and --scope=all) flag to each of the history querying
> +     commands.  IMPORTANT: make sure diff machinery changes don't mess with
> +     format-patch, fast-export, etc.
> +

Thanks,
ZheNing Hu

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-11-15  4:03       ` ZheNing Hu
@ 2022-11-16  3:18         ` ZheNing Hu
  2022-11-16  6:51           ` Elijah Newren
  2022-11-16  5:49         ` Elijah Newren
  1 sibling, 1 reply; 42+ messages in thread
From: ZheNing Hu @ 2022-11-16  3:18 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Victoria Dye, Derrick Stolee, Shaoxuan Yuan,
	Matheus Tavares, Elijah Newren, Glen Choo, Martin von Zweigbergk

ZheNing Hu <adlternative@gmail.com> 于2022年11月15日周二 12:03写道:
>
> Hi,
>
> Elijah Newren via GitGitGadget <gitgitgadget@gmail.com> 于2022年11月6日周日 14:04写道:
> >
> > From: Elijah Newren <newren@gmail.com>
> >
> > Once upon a time, Matheus wrote some patches to make
> >    git grep [--cached | <REVISION>] ...
> > restrict its output to the sparsity specification when working in a
> > sparse checkout[1].  That effort got derailed by two things:
> >
> >   (1) The --sparse-index work just beginning which we wanted to avoid
> >       creating conflicts for
> >   (2) Never deciding on flag and config names and planned high level
> >       behavior for all commands.
> >
> > More recently, Shaoxuan implemented a more limited form of Matheus'
> > patches that only affected --cached, using a different flag name,
> > but also changing the default behavior in line with what Matheus did.
> > This again highlighted the fact that we never decided on command line
> > flag names, config option names, and the big picture path forward.
> >
> > The --sparse-index work has been mostly complete (or at least released
> > into production even if some small edges remain) for quite some time
> > now.  We have also had several discussions on flag and config names,
> > though we never came to solid conclusions.  Stolee once upon a time
> > suggested putting all these into some document in
> > Documentation/technical[3], which Victoria recently also requested[4].
> > I'm behind the times, but here's a patch attempting to finally do that.
> >
> > [1] https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/
> >     (See his second link in that email in particular)
> > [2] https://lore.kernel.org/git/20220908001854.206789-2-shaoxuan.yuan02@gmail.com/
> > [3] https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/
> >     (Scroll to the very end for the final few paragraphs)
> > [4] https://lore.kernel.org/git/cafcedba-96a2-cb85-d593-ef47c8c8397c@github.com/
> >
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> >     sparse-checkout.txt: new document with sparse-checkout directions
> >
> >     v2 and v3 didn't get any reviews (I know, I know, this document is
> >     really long), but it's been nearly a month and this patch is still
> >     marked as "Needs Review", so I'm hoping sending a v4 will encourage
> >     feedback. I think it's good enough to accept and start iterating, but
> >     want to be sure others agree.
> >
> >     As before, I think we're starting to converge on actual proposals;
> >     there's some areas we've agreed on, others we've compromised on, and
> >     some we've just figured out what the others were saying. The discussion
> >     has been very illuminating; thanks to everyone who has chimed in. I've
> >     tried to take my best stab at cleaning up and culling things that don't
> >     need to remain as open questions, but if I've mis-represented anyone or
> >     missed something, don't hesitate to speak up. Everything is still open
> >     for debate, even if not marked as a currently open question.
> >
> >     Changes since v3:
> >
> >      * A few minor wording cleanups here and there, and one paragraph moved
> >        to keep similar things together.
> >
> >     Changes since v2:
> >
> >      * Compromised with Stollee on log -- Behavior A only affects
> >        patch-related operations, not revision walking
> >      * Incorporated Junio's suggestions about untracked file handling
> >      * Added new usecases, one brought up by Martin, one by Stolee
> >      * Added new sections:
> >        * Usecases of primary concern
> >        * Oversimplified mental models ("Cliff Notes" for this document!)
> >      * Recategorization of a few commands based on discussion
> >      * Greater details on how index operations work under Behavior A, to
> >        avoid weird edge cases
> >      * Extended explanation of the sparse specification, particularly when
> >        index differs from HEAD
> >      * Switched proposed flag names to --scope={sparse,all} to avoid binary
> >        flags that are hard to extend
> >      * Switched proposed config option name (still need good values and
> >        descriptions for it, though)
> >      * Removed questions we seemed to have agreement on. Modified/extended
> >        some existing questions.
> >      * Added Stolee's sparse-backfill ideas to the plans
> >      * Additional Known bugs
> >      * Various wording improvements
> >      * Possibly other things I've missed.
> >
> >     Changes since v1:
> >
> >      * Added new sections:
> >        * "Terminology"
> >        * "Behavior classes"
> >        * "Sparse specification vs. sparsity patterns"
> >      * Tried to shuffle commands from unknown into appropriate sections
> >        based on feedback, but I got some conflicting feedback, so...who
> >        knows if thing are in the right place
> >      * More consistency in using "sparse specification" over other terms
> >      * Extra comments about how add/rm/mv operate on moving files across the
> >        tracked/untracked boundary
> >      * --restrict-but-warn should have been "restrict or error", but
> >        reworded even more heavily as part of "Behavior classes" section
> >      * Added extra questions based on feedback (--no-expand, update-index
> >        stuff, apply --index)
> >      * More details on apply/am bugs
> >      * Documented read-tree issue
> >      * A few cases of fixing line wrapping at <=80 chars
> >      * Added more alternate name suggestions for options instead of
> >        --[no-]restrict
> >
> > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1367%2Fnewren%2Fsparse-checkout-directions-v4
> > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1367/newren/sparse-checkout-directions-v4
> > Pull-Request: https://github.com/gitgitgadget/git/pull/1367
> >
> >  Documentation/technical/sparse-checkout.txt | 1103 +++++++++++++++++++
> >  1 file changed, 1103 insertions(+)
> >  create mode 100644 Documentation/technical/sparse-checkout.txt
> >
> > diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
> > new file mode 100644
> > +=== Terminology ===
> > +
> > +sparse directory: An entry in the index corresponding to a directory, which
> > +       appears in the index instead of all the files under that directory
> > +       that would normally appear.  See also sparse-index.  Something that
> > +       can cause confusion is that the "sparse directory" does NOT match
> > +       the sparse specification, i.e. the directory is NOT present in the
> > +       working tree.  May be renamed in the future (e.g. to "skipped
> > +       directory").
> > +
> > +sparse index: A special mode for sparse-checkout that also makes the
> > +       index sparse by recording a directory entry in lieu of all the
> > +       files underneath that directory (thus making that a "skipped
> > +       directory" which unfortunately has also been called a "sparse
> > +       directory"), and does this for potentially multiple
> > +       directories.  Controlled by the --[no-]sparse-index option to
> > +       init|set|reapply.
> > +
> > +sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
> > +       define the set of files of interest.  A warning: It is easy to
> > +       over-use this term (or the shortened "patterns" term), for two
> > +       reasons: (1) users in cone mode specify directories rather than
> > +       patterns (their directories are transformed into patterns, but
> > +       users may think you are talking about non-cone mode if you use the
> > +       word "patterns"), and (b) the sparse specification might
>
> nit: s/(b)/(2)/g
>
> > +       transiently differ in the working tree or index from the sparsity
> > +       patterns (see "Sparse specification vs. sparsity patterns").
> > +
> > +sparse specification: The set of paths in the user's area of focus.  This
> > +       is typically just the tracked files that match the sparsity
> > +       patterns, but the sparse specification can temporarily differ and
> > +       include additional files.  (See also "Sparse specification
> > +       vs. sparsity patterns")
> > +
> > +       * When working with history, the sparse specification is exactly
> > +         the set of files matching the sparsity patterns.
> > +       * When interacting with the working tree, the sparse specification
> > +         is the set of tracked files with a clear SKIP_WORKTREE bit or
> > +         tracked files present in the working copy.
>

I found af6a518 (repo_read_index: clear SKIP_WORKTREE bit from files
present in worktree) which maybe a good place to learn about "sparse
specification",
it has a long commit message though.

> I'm guessing what you mean here is:
> Some files are stored with a flag bit of !SKIP_WORKTREE in its index entry.
> But files are "vivifying" (restore to worktree) or new files added to
> index (tracked files),
> they also belong to the sparse specification.
>
> I think we can add some examples to describe these terms.
>
> #!/bin/sh
>
> set -x
>
> rm -rf mono-repo
> git init mono-repo -b main
> (
>   cd mono-repo &&
>   mkdir p1 p2 &&
>   echo a >p1/a &&
>   echo b >p1/b &&
>   echo a >p2/a &&
>   echo b >p2/b &&
>   git add . &&
>   git commit -m ok &&
>   git sparse-checkout set p1 &&
>   git ls-files -t &&
>   echo a >>p1/a &&
>   echo b >>p1/b &&
>   mkdir p2 p3 &&
>   echo next >>p2/a &&
>   echo next >>p3/c &&
>   git add p3/c &&
>   # p2/a and p3/c vivify
>   git ls-files -t &&
>   # compare wortree/commit
>   git --no-pager diff HEAD --name-only
> )
>
> > +       * When modifying or showing results from the index, the sparse
> > +         specification is the set of files with a clear SKIP_WORKTREE bit
> > +         or that differ in the index from HEAD.
>
> #!/bin/sh
>
> set -x
>
> rm -rf mono-repo
> git init mono-repo -b main
> (
>   cd mono-repo &&
>   mkdir p1 p2 &&
>   echo a >p1/a &&
>   echo b >p1/b &&
>   echo a >p2/a &&
>   echo b >p2/b &&
>   git add . &&
>   git commit -m ok &&
>   git sparse-checkout set p1 &&
>   git update-index --chmod=+x p2/a &&
>   # compare commit/index
>   git --no-pager diff --cached --name-only
> )
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-11-07 20:44       ` Derrick Stolee
@ 2022-11-16  4:39         ` Elijah Newren
  0 siblings, 0 replies; 42+ messages in thread
From: Elijah Newren @ 2022-11-16  4:39 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, git, Victoria Dye, Shaoxuan Yuan,
	Matheus Tavares, ZheNing Hu, Glen Choo, Martin von Zweigbergk

On Mon, Nov 7, 2022 at 12:44 PM Derrick Stolee <derrickstolee@github.com> wrote:
>
> It also is a highly-requested document. Thank you for working so hard on
> it and sorry for being slow to sign off on your edits since v1.
>
> Today, I'm rereading the whole document anew, but I'll avoid any nits
> since I think you are converging on a solid foundation for us to build on.

Thanks for reading it over!

And sorry for my delay in responding; my Git time has been sadly
limited as of late.

> Mostly, if you asked a question in the doc, I'll reply. Nothing is binding
> since the point is to ask the question in the context of the problem
> statement and examples. We should remember to update this document when we
> actually implement the options, so the decisions are documented here
> instead of leaving answered questions lingering.

Yes, I think this sounds good.

> > +  * Do the options --scope={sparse,all} sound good to others?  Are there better
> > +    options?
> > +    * Names in use, or appearing in patches, or previously suggested:
> > +      * --sparse/--dense
> > +      * --ignore-skip-worktree-bits
> > +      * --ignore-skip-worktree-entries
> > +      * --ignore-sparsity
> > +      * --[no-]restrict-to-sparse-paths
> > +      * --full-tree/--sparse-tree
> > +      * --[no-]restrict
> > +      * --scope={sparse,all}
> > +      * --focus/--unfocus
> > +      * --limit/--unlimited
>
> I'm partial to --scope={sparse|all} (with the option to add another
> value if we see the need).
>
> > +  * If a config option is added (sparse.scope?) what should the values and
> > +    description be?  "sparse" (behavior A), "worktree-sparse-history-dense"
> > +    (behavior B), "dense" (behavior C)?  There's a risk of confusion,
> > +    because even for Behaviors A and B we want some commands to be
> > +    full-tree and others to operate sparsely, so the wording may need to be
> > +    more tied to the usecases and somehow explain that.  Also, right now,
> > +    the primary difference we are focusing is just the history-querying
> > +    commands (log/diff/grep).  Previous config suggestion here: [13]
>
> Personally, I think we should have the same values for 'sparse.scope' and
> '--scope=<X>'. For now, let's pick one behavior for the 'sparse' value and
> we can add a new value to differentiate between A and B when necessary in
> the future.

I think this is untenable.  For example, under behavior B:

   * default to --scope=all: diff REV, grep REV, log, etc.
   * default to --scope=sparse: restore, add, diff [without REV or
--cached], etc.

So sparse.scope=all would not yield behavior B.  In fact, there'd be
no way to behavior B since it is inherently a mix of different types
of scopes, as reflected in its "oversimplified" description:

   "Restrict worktree operations to sparse specification; have any
history operations work across all files"

I think it'd *also* potentially set us up for problems under behavior
A.  Behavior A is roughly thought of as --scope=sparse for everything,
but some commands ignore the sparse specification entirely -- commit,
fast-export, bundle, stash, apply, etc.  Perhaps those other
subcommands just never take a --scope option, and thus we have no
issues.  But what if someone asks for a feature where they want to
just apply a subset of the patch with "stash pop" or "apply", and
particularly the subset overlapping with the sparse specification?  Or
perhaps a user wants to do a fast-export of a subset of the repository
-- which they can already do by specifying paths already on the
command line -- but they don't want to have to type all the paths and
want a simple flag for limiting to the sparse specification?  If so,
--scope=sparse is a pretty clear flag that could be used.  But then
we'd have the problem that:

   * default to --scope=all: commit, fast-export, bundle, stash,
apply, and a few others
   * default to --scope=sparse: pretty much everything else

If any of the full-tree commands ever morphs in this direction, then
sparse.scope=sparse would *not* yield behavior A, and there'd be no
way to get it, because behavior A would also be a mix of different
types of scopes.

Personally, I can't imagine that either having --scope=sparse or
--scope=all be the default for all commands would even be a useful
mode for anyone.  So, I think the values of scope.sparse should not be
either "sparse" or "all".

> > +  * Is `--no-expand` a good alias for ls-files's `--sparse` option?
> > +    (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
> > +    because in non-cone mode it does nothing and in cone-mode it shows the
> > +    sparse directory entries which are technically outside the sparse
> > +    specification)
> > +
> > +  * Under Behavior A:
> > +    * Does ls-files' `--no-expand` override the default `--scope=all`, or
> > +      does it need an extra flag?
> > +    * Does ls-files' `-t` option imply `--scope=all`?
> > +    * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
>
> Since the --no-expand option is rather new, and we have a big experimental
> banner on the sparse-checkout documentation, it might be good to plan for
> a deprecation of these non-standard options. We could start by making them
> aliases for the --scope=sparse option, but with a warning that the option
> is deprecated and we will _remove_ the option in a future version. We can
> document here which versions we expect those removals to happen.

I do agree that elsewhere aliasing flags to --scope=sparse makes sense.

But that's not applicable here.  `--no-expand` does not exist yet; it
was suggested as a rename for `--sparse` because ls-files' `--sparse`
option cannot be mapped to either --scope=sparse or --scope=all (nor
any other --scope= option we thought of).  The reason for a different
name was specifically that this option name didn't fit the mold and we
know of no analogous options anywhere.  --scope=sparse means only show
the non-SKIP_WORKTREE entries (which would exclude the sparse
directories and everything under them), while --scope=all means show
all the files (without the directories).  This option, in contrast,
means to show the non-SKIP_WORKTREE file entries plus the
SKIP_WORKTREE directory entries.

> > +  * sparse-checkout: once behavior A is fully implemented, should we take
> > +    an interim measure to ease people into switching the default?  Namely,
> > +    if folks are not already in a sparse checkout, then require
> > +    `sparse-checkout init/set` to take a
> > +    `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
> > +    would set sparse.scope according to the setting given), and throw an
> > +    error if the flag is not provided?  That error would be a great place
> > +    to warn folks that the default may change in the future, and get them
> > +    used to specifying what they want so that the eventual default switch
> > +    is seamless for them.
>
> I'm not sure that we need a warning here. I think picking an initial default
> is good enough. Let's reconsider this warning after we have more implementation
> changes that provide a choice between behaviors A and B.
>
> > +=== Implementation Goals/Plans ===
> > +
> > + * Get buy-in on this document in general.
>
> Consider me bought-in.

Wahoo!

> > + * Figure out answers to the 'Implementation Questions' sections (above)
> > +
> > + * Fix bugs in the 'Known bugs' section (below)
> > +
> > + * Provide some kind of method for backfilling the blobs within the sparse
> > +   specification in a partial clone
> > +
> > + [Below here is kind of spitballing since the first two haven't been resolved]
>
> We can update this document as we gain clarity after the first few updates.
>
> > + * update-index: flip the default to --no-ignore-skip-worktree-entries,
> > +   nuke this stupid "Oh, there's a bug?  Let me add a flag to let users
> > +   request that they not trigger this bug." flag
> > +
> > + * Flags & Config
> > +   * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
>
> This '--sparse' deprecation can eventually be a removal, I think.

Sounds fair.  Should I clarify that in the document as well?

> > +   * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
> > +     a deprecated aliases for `--scope=all`
>
> This one might be harder to remove since it's much older. We can consider
> it, though.

Yeah, if we end up with deprecated-but-kept-around, that's fine so
long as we recommend the new flag over the old one.

> > +   * Create config option (sparse.scope?), tie it to the "Cliff notes"
> > +     overview
>
> Implementation detail: it might be nice to create a parse-opt macro that
> will read the '--scope={sparse|all}' command-line option but _also_
> create a method to initialize the value to the 'sparse.scope' config
> option. These can both happen with the very first implementation of the
> command-line option and all future integrations can follow that pattern to
> get both options.

I'm not sure how this could work, since `sparse.scope` should not use
the values {sparse,all}, and the correct default scope is
command-dependent for both behavior B and behavior A.

> Thanks for working so hard on this doc. I think this version is ready to
> merge down. Let's get started on this work. I'm excited!

:-)

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-11-15  4:03       ` ZheNing Hu
  2022-11-16  3:18         ` ZheNing Hu
@ 2022-11-16  5:49         ` Elijah Newren
  2022-11-16 10:04           ` ZheNing Hu
  1 sibling, 1 reply; 42+ messages in thread
From: Elijah Newren @ 2022-11-16  5:49 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Elijah Newren via GitGitGadget, git, Victoria Dye,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, Glen Choo,
	Martin von Zweigbergk

On Mon, Nov 14, 2022 at 8:03 PM ZheNing Hu <adlternative@gmail.com> wrote:
> Elijah Newren via GitGitGadget <gitgitgadget@gmail.com> 于2022年11月6日周日 14:04写道:
[...]
> > +sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
> > +       define the set of files of interest.  A warning: It is easy to
> > +       over-use this term (or the shortened "patterns" term), for two
> > +       reasons: (1) users in cone mode specify directories rather than
> > +       patterns (their directories are transformed into patterns, but
> > +       users may think you are talking about non-cone mode if you use the
> > +       word "patterns"), and (b) the sparse specification might
>
> nit: s/(b)/(2)/g

Thanks.

> > +sparse specification: The set of paths in the user's area of focus.  This
> > +       is typically just the tracked files that match the sparsity
> > +       patterns, but the sparse specification can temporarily differ and
> > +       include additional files.  (See also "Sparse specification
> > +       vs. sparsity patterns")
> > +
> > +       * When working with history, the sparse specification is exactly
> > +         the set of files matching the sparsity patterns.
> > +       * When interacting with the working tree, the sparse specification
> > +         is the set of tracked files with a clear SKIP_WORKTREE bit or
> > +         tracked files present in the working copy.
>
> I'm guessing what you mean here is:
> Some files are stored with a flag bit of !SKIP_WORKTREE in its index entry.
> But files are "vivifying" (restore to worktree) or new files added to
> index (tracked files),
> they also belong to the sparse specification.

For this case, when interacting with the working tree, I mean the
union of the following two sets of files:
  * files with !SKIP_WORKTREE in the index entry
  * files present in the working copy whose index entry has the
SKIP_WORKTREE bit set.

The fact that files might be new index entries (i.e. newly tracked
files) is irrelevant; it only matters whether such files fit into one
of the two categories above or not.

The fact that files have been "vivified" is slightly ambiguous and
thus a bad way to define this set.  When git vivifies files, it'll
clear the SKIP_WORKTREE bit.  If an editor, or script external to git,
or something else restores the file, it will likely overlook that
detail.  We want vivified files to be part of the sparse specification
when interacting with the working tree regardless of whether the
SKIP_WORKTREE bit was correctly updated, so I defined it the way I did
to remove such ambiguity.  (I guess I should note that as per
af6a51875a ("repo_read_index: clear SKIP_WORKTREE bit from files
present in worktree", 2022-01-14), git will clear the SKIP_WORKTREE
bit for files present in the working copy as one of the first things
it does, but that could leave people wondering whether I meant the
SKIP_WORKTREE bit was set as of the time of git invocation.  So, I
explicitly call out files present in the working copy for which the
index entry has the SKIP_WORKTREE bit set, so folks know these files
are definitely included in the sparse specification.)

> I think we can add some examples to describe these terms.
>
> #!/bin/sh
>
> set -x
>
> rm -rf mono-repo
> git init mono-repo -b main
> (
>   cd mono-repo &&
>   mkdir p1 p2 &&
>   echo a >p1/a &&
>   echo b >p1/b &&
>   echo a >p2/a &&
>   echo b >p2/b &&
>   git add . &&
>   git commit -m ok &&
>   git sparse-checkout set p1 &&
>   git ls-files -t &&
>   echo a >>p1/a &&
>   echo b >>p1/b &&
>   mkdir p2 p3 &&
>   echo next >>p2/a &&
>   echo next >>p3/c &&
>   git add p3/c &&
>   # p2/a and p3/c vivify
>   git ls-files -t &&
>   # compare wortree/commit
>   git --no-pager diff HEAD --name-only
> )

You've added a bunch of code with this example, but you have not said
what the output should be, so how exactly does this help describe the
terms?

> > +       * When modifying or showing results from the index, the sparse
> > +         specification is the set of files with a clear SKIP_WORKTREE bit
> > +         or that differ in the index from HEAD.
>
> #!/bin/sh
>
> set -x
>
> rm -rf mono-repo
> git init mono-repo -b main
> (
>   cd mono-repo &&
>   mkdir p1 p2 &&
>   echo a >p1/a &&
>   echo b >p1/b &&
>   echo a >p2/a &&
>   echo b >p2/b &&
>   git add . &&
>   git commit -m ok &&
>   git sparse-checkout set p1 &&
>   git update-index --chmod=+x p2/a &&
>   # compare commit/index
>   git --no-pager diff --cached --name-only
> )

Same issue here; you haven't stated the expected output of these
commands, so I don't see how they help with the description at all.

Perhaps it's worth noting why I think the sparse specification should
be extended when dealing with the index:

  * "mergy" commands (merge, rebase, cherry-pick, am, revert) can
modify the index outside the sparsity patterns, without creating a
commit.
  * `git commit` (or `rebase --continue`, or whatever) will create a
commit from whatever staged versions of files there are
  => `git status` should show what is about to be committed
  => `git diff --cached --name-only` ought to be usable to show what
is to be committed
  => `git grep --cached ...` ought to be usable to search through what
is about to be committed

See also https://lore.kernel.org/git/CABPp-BESkb=04vVnqTvZyeCa+7cymX7rosUW3rhtA02khMJKHA@mail.gmail.com/
(starting with the paragraph with "leery" in it), and the thread
starting there.  If the sparse specification is not expanded, users
will get some nasty surprises, and the only other alternative I can
think of to avoid such surprises would be making several commands
always run full tree.  Running full-tree with a non-default option to
run sparse forces behavior A folks into a "pick your poison"
situation, which is not nice.  Extending the sparse specification to
include files whose index entries do not match HEAD for index-related
operations provides the nice middle ground that avoids such usability
problems while also allowing users to avoid operating on a full tree.

[...]
>  I think A's users might need a little more refinement.
>
> A: Some users are _only_ interested in the sparse portion of the repo,
> but they don't want to download all the blobs, they can accept to download
> other data later using partial-clone, which can reduce the local storage size.
> (Current default behavior)

Behavior A is definitely not the current default behavior.  Also,
behavior A is not tied to partial clones; some users may well want it
even with a dense clone, so we need to avoid suggesting it is only for
users with partial clones.  (Though, if users are using partial clones
with behavior A, then I agree with the part you wrote other than your
parenthetical comment.)

> A** : Some users are _only_ interested in the sparse portion of the repo,
> but they want to download all the blobs in it to avoid some unnecessary
> network connections afterwards.

Here you just repeated `A*` but relabelled it as `A**`.  Yes, this one
is explicitly tied to partial clone behavior.

[...]
> > +    The fact that files can move between the 'tracked' and 'untracked'
> > +    categories means some commands will have to treat untracked files
> > +    differently.  But if we have to treat untracked files differently,
> > +    then additional commands may also need changes:
> > +
> > +      * status
> > +      * clean
> > +
>
> I'm a bit worried about git status, because it's used in many shells
> (e.g. zsh) i
> in the git prompt function. Its default behavior is restricted, otherwise users
> may get blocked when they use zsh to cd to that directory. I don't know how
> to reproduce this problem (since the scenario is built on checkout to a local
> unborn branch).

Could you elaborate?  I'm not sure if you are talking about an
existing problem that you are worried about being exacerbated, or a
hypothetical problem that could occur with changes.  Further, your
wording is so vague about the problem, that I have no idea what its
nature is or whether any changes to status would even possibly have
any bearing on it.  But the suggested changes to git status are
simply:

> > +    In particular, `status` may need to report any untracked files outside
> > +    the sparsity specification as an erroneous condition (especially to
> > +    avoid the user trying to `git add` them, forcing `git add` to display
> > +    an error).

[...]
> > +    ls-files may be slightly special in that e.g. `git ls-files -t` is
> > +    often used to see what is sparse and what is not.  Perhaps -t should
> > +    always work on the full tree?
> > +
>
> Recently git ls-files added a --format option, perhaps this can be modified to
> show if a file is SKIP_WORKTREE in the future.

If so, then you've made my question also applicable for `--format`;
much like -t, should --format always work on the full tree?  (Or maybe
just when the format specifies the skip worktree bit?)

[...]
> > +  * If a config option is added (sparse.scope?) what should the values and
> > +    description be?  "sparse" (behavior A), "worktree-sparse-history-dense"
> > +    (behavior B), "dense" (behavior C)?  There's a risk of confusion,
> > +    because even for Behaviors A and B we want some commands to be
> > +    full-tree and others to operate sparsely, so the wording may need to be
> > +    more tied to the usecases and somehow explain that.  Also, right now,
> > +    the primary difference we are focusing is just the history-querying
> > +    commands (log/diff/grep).  Previous config suggestion here: [13]
> > +
>
> Maybe sparse.scope={sparse, all}?

I guess that's people's common first guess.  However, when you dig in,
I think this would be badly broken -- see my response to Stolee I just
sent out.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-11-16  3:18         ` ZheNing Hu
@ 2022-11-16  6:51           ` Elijah Newren
  0 siblings, 0 replies; 42+ messages in thread
From: Elijah Newren @ 2022-11-16  6:51 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Elijah Newren via GitGitGadget, git, Victoria Dye,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, Glen Choo,
	Martin von Zweigbergk

On Tue, Nov 15, 2022 at 7:18 PM ZheNing Hu <adlternative@gmail.com> wrote:
> ZheNing Hu <adlternative@gmail.com> 于2022年11月15日周二 12:03写道:
> > Elijah Newren via GitGitGadget <gitgitgadget@gmail.com> 于2022年11月6日周日 14:04写道:
[...]
> > > +sparse specification: The set of paths in the user's area of focus.  This
> > > +       is typically just the tracked files that match the sparsity
> > > +       patterns, but the sparse specification can temporarily differ and
> > > +       include additional files.  (See also "Sparse specification
> > > +       vs. sparsity patterns")
> > > +
> > > +       * When working with history, the sparse specification is exactly
> > > +         the set of files matching the sparsity patterns.
> > > +       * When interacting with the working tree, the sparse specification
> > > +         is the set of tracked files with a clear SKIP_WORKTREE bit or
> > > +         tracked files present in the working copy.
> >
>
> I found af6a518 (repo_read_index: clear SKIP_WORKTREE bit from files
> present in worktree)

Yes, that was one of the footnotes referenced in the file:

+[3] (Present-despite-skipped entries)
+    https://lore.kernel.org/git/11d46a399d26c913787b704d2b7169cafc28d639.1642175983.git.gitgitgadget@gmail.com/

> which maybe a good place to learn about "sparse specification",
> it has a long commit message though.

Not quite; it was a predecessor that described some of the bugs caused
by the facts that:
  * "SKIP_WORKTREE" meant the file would be missing from the worktree
  * the above promise was often violated in a variety of ways
Experience with all the bugs caused by this situation (and the many
other attempted workarounds we tried that kept falling short)
certainly informed my suggestions about the sparse specification.
However, that only looks at the working tree side; the sparse
specification is also expanded for index-related operations, as I
called out in the other email I just sent you.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-11-16  5:49         ` Elijah Newren
@ 2022-11-16 10:04           ` ZheNing Hu
  2022-11-16 10:10             ` ZheNing Hu
  2022-11-19  2:15             ` Elijah Newren
  0 siblings, 2 replies; 42+ messages in thread
From: ZheNing Hu @ 2022-11-16 10:04 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, git, Victoria Dye,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, Glen Choo,
	Martin von Zweigbergk

Elijah Newren <newren@gmail.com> 于2022年11月16日周三 13:49写道:
>
> On Mon, Nov 14, 2022 at 8:03 PM ZheNing Hu <adlternative@gmail.com> wrote:
> > Elijah Newren via GitGitGadget <gitgitgadget@gmail.com> 于2022年11月6日周日 14:04写道:
> [...]
> > > +sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
> > > +       define the set of files of interest.  A warning: It is easy to
> > > +       over-use this term (or the shortened "patterns" term), for two
> > > +       reasons: (1) users in cone mode specify directories rather than
> > > +       patterns (their directories are transformed into patterns, but
> > > +       users may think you are talking about non-cone mode if you use the
> > > +       word "patterns"), and (b) the sparse specification might
> >
> > nit: s/(b)/(2)/g
>
> Thanks.
>
> > > +sparse specification: The set of paths in the user's area of focus.  This
> > > +       is typically just the tracked files that match the sparsity
> > > +       patterns, but the sparse specification can temporarily differ and
> > > +       include additional files.  (See also "Sparse specification
> > > +       vs. sparsity patterns")
> > > +
> > > +       * When working with history, the sparse specification is exactly
> > > +         the set of files matching the sparsity patterns.
> > > +       * When interacting with the working tree, the sparse specification
> > > +         is the set of tracked files with a clear SKIP_WORKTREE bit or
> > > +         tracked files present in the working copy.
> >
> > I'm guessing what you mean here is:
> > Some files are stored with a flag bit of !SKIP_WORKTREE in its index entry.
> > But files are "vivifying" (restore to worktree) or new files added to
> > index (tracked files),
> > they also belong to the sparse specification.
>
> For this case, when interacting with the working tree, I mean the
> union of the following two sets of files:
>   * files with !SKIP_WORKTREE in the index entry
>   * files present in the working copy whose index entry has the
> SKIP_WORKTREE bit set.
>

Precise description. This is consistent with the behavior in
clear_skip_worktree_from_present_files().

> The fact that files might be new index entries (i.e. newly tracked
> files) is irrelevant; it only matters whether such files fit into one
> of the two categories above or not.
>

Okay.

> The fact that files have been "vivified" is slightly ambiguous and
> thus a bad way to define this set.  When git vivifies files, it'll
> clear the SKIP_WORKTREE bit.  If an editor, or script external to git,
> or something else restores the file, it will likely overlook that
> detail.  We want vivified files to be part of the sparse specification
> when interacting with the working tree regardless of whether the
> SKIP_WORKTREE bit was correctly updated, so I defined it the way I did
> to remove such ambiguity.  (I guess I should note that as per
> af6a51875a ("repo_read_index: clear SKIP_WORKTREE bit from files
> present in worktree", 2022-01-14), git will clear the SKIP_WORKTREE
> bit for files present in the working copy as one of the first things
> it does, but that could leave people wondering whether I meant the
> SKIP_WORKTREE bit was set as of the time of git invocation.  So, I
> explicitly call out files present in the working copy for which the
> index entry has the SKIP_WORKTREE bit set, so folks know these files
> are definitely included in the sparse specification.)
>

You are right: the definition of vivifying explicitly clears the
SKIP_WORKTREE bit from the index, So the behavior described
here is not vivifying, but very much like vivifying: clear the
SKIP_WORKTREE bit from index_entry in memory instead of actually
clearing it from the index on disk.

Anyway, for the file on "worktree", we can use ce_skip_worktree(ce)
to check if it belongs to the sparse specification.

> > I think we can add some examples to describe these terms.
> >
> > #!/bin/sh
> >
> > set -x
> >
> > rm -rf mono-repo
> > git init mono-repo -b main
> > (
> >   cd mono-repo &&
> >   mkdir p1 p2 &&
> >   echo a >p1/a &&
> >   echo b >p1/b &&
> >   echo a >p2/a &&
> >   echo b >p2/b &&
> >   git add . &&
> >   git commit -m ok &&
> >   git sparse-checkout set p1 &&
> >   git ls-files -t &&
> >   echo a >>p1/a &&
> >   echo b >>p1/b &&
> >   mkdir p2 p3 &&
> >   echo next >>p2/a &&
> >   echo next >>p3/c &&
> >   git add p3/c &&

Here I forget a "--sparse"

> >   # p2/a and p3/c vivify
> >   git ls-files -t &&
> >   # compare wortree/commit
> >   git --no-pager diff HEAD --name-only
> > )
>
> You've added a bunch of code with this example, but you have not said
> what the output should be, so how exactly does this help describe the
> terms?
>

We create a repo with two sub projects p1/ p2/, then set
sparsity directory p1.

First git ls-files -t outputs:

H p1/a
H p1/b
S p2/a
S p2/b

It shows that index entries in p2 are skip-worktree.
Then we restore p2/a in the working tree and create a
new file p3/c and add it to the index.

The second git ls-files -t output:

H p1/a
H p1/b
H p2/a
S p2/b
H p3/c

p2/a and p3/c are not in sparse patterns, but they are in
sparse specification. It's like a special "vivifying".

> > > +       * When modifying or showing results from the index, the sparse
> > > +         specification is the set of files with a clear SKIP_WORKTREE bit
> > > +         or that differ in the index from HEAD.
> >
> > #!/bin/sh
> >
> > set -x
> >
> > rm -rf mono-repo
> > git init mono-repo -b main
> > (
> >   cd mono-repo &&
> >   mkdir p1 p2 &&
> >   echo a >p1/a &&
> >   echo b >p1/b &&
> >   echo a >p2/a &&
> >   echo b >p2/b &&
> >   git add . &&
> >   git commit -m ok &&
> >   git sparse-checkout set p1 &&
> >   git update-index --chmod=+x p2/a &&
> >   # compare commit/index
> >   git --no-pager diff --cached --name-only
> > )
>
> Same issue here; you haven't stated the expected output of these
> commands, so I don't see how they help with the description at all.
>

Here only output p2/a:

p2/a is out of sparse patterns, but this index entry mode has been
changed compared to HEAD. So we should consider it as a part of
sparse specification.

> Perhaps it's worth noting why I think the sparse specification should
> be extended when dealing with the index:
>
>   * "mergy" commands (merge, rebase, cherry-pick, am, revert) can
> modify the index outside the sparsity patterns, without creating a
> commit.
>   * `git commit` (or `rebase --continue`, or whatever) will create a
> commit from whatever staged versions of files there are
>   => `git status` should show what is about to be committed
>   => `git diff --cached --name-only` ought to be usable to show what
> is to be committed
>   => `git grep --cached ...` ought to be usable to search through what
> is about to be committed
>
> See also https://lore.kernel.org/git/CABPp-BESkb=04vVnqTvZyeCa+7cymX7rosUW3rhtA02khMJKHA@mail.gmail.com/
> (starting with the paragraph with "leery" in it), and the thread
> starting there.  If the sparse specification is not expanded, users
> will get some nasty surprises, and the only other alternative I can
> think of to avoid such surprises would be making several commands
> always run full tree.  Running full-tree with a non-default option to
> run sparse forces behavior A folks into a "pick your poison"
> situation, which is not nice.  Extending the sparse specification to
> include files whose index entries do not match HEAD for index-related
> operations provides the nice middle ground that avoids such usability
> problems while also allowing users to avoid operating on a full tree.
>

I can understand the reason why we need to extend sparse specification:
index often needs to handle files that are not in the sparse pattern.

> [...]
> >  I think A's users might need a little more refinement.
> >
> > A: Some users are _only_ interested in the sparse portion of the repo,
> > but they don't want to download all the blobs, they can accept to download
> > other data later using partial-clone, which can reduce the local storage size.
> > (Current default behavior)
>
> Behavior A is definitely not the current default behavior.  Also,
> behavior A is not tied to partial clones; some users may well want it
> even with a dense clone, so we need to avoid suggesting it is only for
> users with partial clones.  (Though, if users are using partial clones
> with behavior A, then I agree with the part you wrote other than your
> parenthetical comment.)
>

Makes sense. This should be considered scalar-style behavior.

> > A** : Some users are _only_ interested in the sparse portion of the repo,
> > but they want to download all the blobs in it to avoid some unnecessary
> > network connections afterwards.
>
> Here you just repeated `A*` but relabelled it as `A**`.  Yes, this one
> is explicitly tied to partial clone behavior.

Ah,  `A*` part say “so things like `git log -S${SEARCH_TERM} -p`
or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
partial results that depend on what happens to have been downloaded."

So I think it's probably a lot like the behavior after a shallow
clone: git log -p or other
git commands returning partial results.

The expectation of A** is to have all blobs under the entire sparse-patterns.

>
> [...]
> > > +    The fact that files can move between the 'tracked' and 'untracked'
> > > +    categories means some commands will have to treat untracked files
> > > +    differently.  But if we have to treat untracked files differently,
> > > +    then additional commands may also need changes:
> > > +
> > > +      * status
> > > +      * clean
> > > +
> >
> > I'm a bit worried about git status, because it's used in many shells
> > (e.g. zsh) i
> > in the git prompt function. Its default behavior is restricted, otherwise users
> > may get blocked when they use zsh to cd to that directory. I don't know how
> > to reproduce this problem (since the scenario is built on checkout to a local
> > unborn branch).
>
> Could you elaborate?  I'm not sure if you are talking about an
> existing problem that you are worried about being exacerbated, or a
> hypothetical problem that could occur with changes.  Further, your
> wording is so vague about the problem, that I have no idea what its
> nature is or whether any changes to status would even possibly have
> any bearing on it.  But the suggested changes to git status are
> simply:
>

I just might have caused this in one particular case. So it's not very
important at the moment. But it's worth noting that many shells, IDEs’
git plugins
may also need to understand sparse-checkout properly, otherwise it can
 cause some usability problems.

> > > +    In particular, `status` may need to report any untracked files outside
> > > +    the sparsity specification as an erroneous condition (especially to
> > > +    avoid the user trying to `git add` them, forcing `git add` to display
> > > +    an error).
>
> [...]
> > > +    ls-files may be slightly special in that e.g. `git ls-files -t` is
> > > +    often used to see what is sparse and what is not.  Perhaps -t should
> > > +    always work on the full tree?
> > > +
> >
> > Recently git ls-files added a --format option, perhaps this can be modified to
> > show if a file is SKIP_WORKTREE in the future.
>
> If so, then you've made my question also applicable for `--format`;
> much like -t, should --format always work on the full tree?  (Or maybe
> just when the format specifies the skip worktree bit?)
>

My personal opinion is to default to "restrict" , but we can use "git
ls-files -t -scope=all"
to check all index entries
.
> [...]
> > > +  * If a config option is added (sparse.scope?) what should the values and
> > > +    description be?  "sparse" (behavior A), "worktree-sparse-history-dense"
> > > +    (behavior B), "dense" (behavior C)?  There's a risk of confusion,
> > > +    because even for Behaviors A and B we want some commands to be
> > > +    full-tree and others to operate sparsely, so the wording may need to be
> > > +    more tied to the usecases and somehow explain that.  Also, right now,
> > > +    the primary difference we are focusing is just the history-querying
> > > +    commands (log/diff/grep).  Previous config suggestion here: [13]
> > > +
> >
> > Maybe sparse.scope={sparse, all}?
>
> I guess that's people's common first guess.  However, when you dig in,
> I think this would be badly broken -- see my response to Stolee I just
> sent out.

Ok...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-11-16 10:04           ` ZheNing Hu
@ 2022-11-16 10:10             ` ZheNing Hu
  2022-11-16 14:33               ` ZheNing Hu
  2022-11-19  2:15             ` Elijah Newren
  1 sibling, 1 reply; 42+ messages in thread
From: ZheNing Hu @ 2022-11-16 10:10 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, git, Victoria Dye,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, Glen Choo,
	Martin von Zweigbergk

ZheNing Hu <adlternative@gmail.com> 于2022年11月16日周三 18:04写道:
>
> Elijah Newren <newren@gmail.com> 于2022年11月16日周三 13:49写道:
> >
> > Perhaps it's worth noting why I think the sparse specification should
> > be extended when dealing with the index:
> >
> >   * "mergy" commands (merge, rebase, cherry-pick, am, revert) can
> > modify the index outside the sparsity patterns, without creating a
> > commit.
> >   * `git commit` (or `rebase --continue`, or whatever) will create a
> > commit from whatever staged versions of files there are
> >   => `git status` should show what is about to be committed
> >   => `git diff --cached --name-only` ought to be usable to show what
> > is to be committed
> >   => `git grep --cached ...` ought to be usable to search through what
> > is about to be committed
> >
> > See also https://lore.kernel.org/git/CABPp-BESkb=04vVnqTvZyeCa+7cymX7rosUW3rhtA02khMJKHA@mail.gmail.com/
> > (starting with the paragraph with "leery" in it), and the thread
> > starting there.  If the sparse specification is not expanded, users
> > will get some nasty surprises, and the only other alternative I can
> > think of to avoid such surprises would be making several commands
> > always run full tree.  Running full-tree with a non-default option to
> > run sparse forces behavior A folks into a "pick your poison"
> > situation, which is not nice.  Extending the sparse specification to
> > include files whose index entries do not match HEAD for index-related
> > operations provides the nice middle ground that avoids such usability
> > problems while also allowing users to avoid operating on a full tree.
> >
>
> I can understand the reason why we need to extend sparse specification:
> index often needs to handle files that are not in the sparse pattern.
>

I might have one more question: when we use "git diff -cached HEAD~",
what is the best way to check if an index entry is the same as HEAD here?
Do we need to run "git diff --cached HEAD <file>" again?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-11-16 10:10             ` ZheNing Hu
@ 2022-11-16 14:33               ` ZheNing Hu
  2022-11-19  2:36                 ` Elijah Newren
  0 siblings, 1 reply; 42+ messages in thread
From: ZheNing Hu @ 2022-11-16 14:33 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, git, Victoria Dye,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, Glen Choo,
	Martin von Zweigbergk

ZheNing Hu <adlternative@gmail.com> 于2022年11月16日周三 18:10写道:
>
> ZheNing Hu <adlternative@gmail.com> 于2022年11月16日周三 18:04写道:
> >
> > Elijah Newren <newren@gmail.com> 于2022年11月16日周三 13:49写道:
> > >
> > > Perhaps it's worth noting why I think the sparse specification should
> > > be extended when dealing with the index:
> > >
> > >   * "mergy" commands (merge, rebase, cherry-pick, am, revert) can
> > > modify the index outside the sparsity patterns, without creating a
> > > commit.
> > >   * `git commit` (or `rebase --continue`, or whatever) will create a
> > > commit from whatever staged versions of files there are
> > >   => `git status` should show what is about to be committed
> > >   => `git diff --cached --name-only` ought to be usable to show what
> > > is to be committed
> > >   => `git grep --cached ...` ought to be usable to search through what
> > > is about to be committed
> > >
> > > See also https://lore.kernel.org/git/CABPp-BESkb=04vVnqTvZyeCa+7cymX7rosUW3rhtA02khMJKHA@mail.gmail.com/
> > > (starting with the paragraph with "leery" in it), and the thread
> > > starting there.  If the sparse specification is not expanded, users
> > > will get some nasty surprises, and the only other alternative I can
> > > think of to avoid such surprises would be making several commands
> > > always run full tree.  Running full-tree with a non-default option to
> > > run sparse forces behavior A folks into a "pick your poison"
> > > situation, which is not nice.  Extending the sparse specification to
> > > include files whose index entries do not match HEAD for index-related
> > > operations provides the nice middle ground that avoids such usability
> > > problems while also allowing users to avoid operating on a full tree.
> > >
> >
> > I can understand the reason why we need to extend sparse specification:
> > index often needs to handle files that are not in the sparse pattern.
> >
>
> I might have one more question: when we use "git diff -cached HEAD~",
> what is the best way to check if an index entry is the same as HEAD here?
> Do we need to run "git diff --cached HEAD <file>" again?

I found that git commit will execute index_differs_from() to determine
whether the index has changed, It defaults to comparing HEAD.
But if we use git commit --amend, index_differs_from() will compare
to HEAD~.

the docs say:

       * When modifying or showing results from the index, the sparse
         specification is the set of files with a clear SKIP_WORKTREE bit
         or that differ in the index from HEAD.

I wonder if there is some description error here? Not always "from HEAD"?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-11-16 10:04           ` ZheNing Hu
  2022-11-16 10:10             ` ZheNing Hu
@ 2022-11-19  2:15             ` Elijah Newren
  2022-11-23  9:08               ` ZheNing Hu
  1 sibling, 1 reply; 42+ messages in thread
From: Elijah Newren @ 2022-11-19  2:15 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Elijah Newren via GitGitGadget, git, Victoria Dye,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, Glen Choo,
	Martin von Zweigbergk

On Wed, Nov 16, 2022 at 2:04 AM ZheNing Hu <adlternative@gmail.com> wrote:
> Elijah Newren <newren@gmail.com> 于2022年11月16日周三 13:49写道:
> > On Mon, Nov 14, 2022 at 8:03 PM ZheNing Hu <adlternative@gmail.com> wrote:
> > > Elijah Newren via GitGitGadget <gitgitgadget@gmail.com> 于2022年11月6日周日 14:04写道:
[...]
> > The fact that files have been "vivified" is slightly ambiguous and
> > thus a bad way to define this set.  When git vivifies files, it'll
> > clear the SKIP_WORKTREE bit.  If an editor, or script external to git,
> > or something else restores the file, it will likely overlook that
> > detail.  We want vivified files to be part of the sparse specification
> > when interacting with the working tree regardless of whether the
> > SKIP_WORKTREE bit was correctly updated, so I defined it the way I did
> > to remove such ambiguity.  (I guess I should note that as per
> > af6a51875a ("repo_read_index: clear SKIP_WORKTREE bit from files
> > present in worktree", 2022-01-14), git will clear the SKIP_WORKTREE
> > bit for files present in the working copy as one of the first things
> > it does, but that could leave people wondering whether I meant the
> > SKIP_WORKTREE bit was set as of the time of git invocation.  So, I
> > explicitly call out files present in the working copy for which the
> > index entry has the SKIP_WORKTREE bit set, so folks know these files
> > are definitely included in the sparse specification.)
> >
>
> You are right: the definition of vivifying explicitly clears the
> SKIP_WORKTREE bit from the index, So the behavior described
> here is not vivifying, but very much like vivifying: clear the
> SKIP_WORKTREE bit from index_entry in memory instead of actually
> clearing it from the index on disk.

No, the definition of vivifying in my PR does not explicitly state
that; in fact, it doesn't even imply that it always happens.  The
wording was:

+vivifying: When a command restores a tracked file to the working tree (and
+ hopefully also clears the SKIP_WORKTREE bit in the index for that
+ file), this is referred to as "vivifying" the file.

In particular, it's important to note that:

  * some git commands won't even clear the SKIP_WORKTREE bit when they
vivify files (e.g. I think git-checkout-index falls in this category,
for example); we could never audit all codepaths and fix them all.
But when they restore files we still consider that to be "vivifying"
those paths whether or not they clear the SKIP_WORKTREE bit.

  * I considered the restoration of files by non-git commands (e.g.
"echo contents >filename") to also be considered "vivifying" of those
files, and I certainly don't expect non-git commands to clear the
SKIP_WORKTREE bit.

> Anyway, for the file on "worktree", we can use ce_skip_worktree(ce)
> to check if it belongs to the sparse specification.

Yes, due to the commit af6a51875a referenced above, our implementation
can just check this.  Except, it's !ce_skep_worktree(ce), of course.
:-)

> > > I think we can add some examples to describe these terms.
> > >
> > > #!/bin/sh
> > >
> > > set -x
> > >
> > > rm -rf mono-repo
> > > git init mono-repo -b main
> > > (
> > >   cd mono-repo &&
> > >   mkdir p1 p2 &&
> > >   echo a >p1/a &&
> > >   echo b >p1/b &&
> > >   echo a >p2/a &&
> > >   echo b >p2/b &&
> > >   git add . &&
> > >   git commit -m ok &&
> > >   git sparse-checkout set p1 &&
> > >   git ls-files -t &&
> > >   echo a >>p1/a &&
> > >   echo b >>p1/b &&
> > >   mkdir p2 p3 &&
> > >   echo next >>p2/a &&
> > >   echo next >>p3/c &&
> > >   git add p3/c &&
>
> Here I forget a "--sparse"
>
> > >   # p2/a and p3/c vivify
> > >   git ls-files -t &&
> > >   # compare wortree/commit
> > >   git --no-pager diff HEAD --name-only
> > > )
> >
> > You've added a bunch of code with this example, but you have not said
> > what the output should be, so how exactly does this help describe the
> > terms?
> >
>
> We create a repo with two sub projects p1/ p2/, then set
> sparsity directory p1.
>
> First git ls-files -t outputs:
>
> H p1/a
> H p1/b
> S p2/a
> S p2/b
>
> It shows that index entries in p2 are skip-worktree.
> Then we restore p2/a in the working tree and create a
> new file p3/c and add it to the index.
>
> The second git ls-files -t output:
>
> H p1/a
> H p1/b
> H p2/a
> S p2/b
> H p3/c
>
> p2/a and p3/c are not in sparse patterns, but they are in
> sparse specification. It's like a special "vivifying".

What do you mean by a "special" vivifying?

Also, there's multiple problems with using your example so far to
describe the sparse specification:

  * You are specifying `git ls-files -t` output here, which may or may
not ignore the sparse specification (as mentioned elsewhere in the new
document); if it doesn't, then specifying how commands behave when
they ignore the sparse specification as a way of describing the sparse
specification seems less than helpful.  We could overlook that, but:
  * You didn't specify the output for `git diff HEAD` and `git diff
REV` was one of the cases where the sparse specification matters.
Explaining how `git diff REV` works relative to the sparse
specification seems like the point of you having an example, BUT even
if you tried to do that with this particular example:
  1) Users are probably left wondering whether p3/c is present in the
working copy at the time these commands run, and thus whether it is in
the sparse specification for that reason rather than for the reason of
there being a difference in the index relative to HEAD.
  2) You didn't specify the differences in the output between behavior
A and behavior B for your example, if any, which might be needed for
an appropriate contrast.  Further...
  3) You picked an example where the output might be the same for both
behavior A and behavior B, and since behavior B ignores the sparse
specification, it's really hard to see how this example helps
elucidate the meaning of the sparse specification.

So, I'm still not seeing how this example helps describe the sparse
specification.

> > > > +       * When modifying or showing results from the index, the sparse
> > > > +         specification is the set of files with a clear SKIP_WORKTREE bit
> > > > +         or that differ in the index from HEAD.
> > >
> > > #!/bin/sh
> > >
> > > set -x
> > >
> > > rm -rf mono-repo
> > > git init mono-repo -b main
> > > (
> > >   cd mono-repo &&
> > >   mkdir p1 p2 &&
> > >   echo a >p1/a &&
> > >   echo b >p1/b &&
> > >   echo a >p2/a &&
> > >   echo b >p2/b &&
> > >   git add . &&
> > >   git commit -m ok &&
> > >   git sparse-checkout set p1 &&
> > >   git update-index --chmod=+x p2/a &&
> > >   # compare commit/index
> > >   git --no-pager diff --cached --name-only
> > > )
> >
> > Same issue here; you haven't stated the expected output of these
> > commands, so I don't see how they help with the description at all.
> >
>
> Here only output p2/a:
>
> p2/a is out of sparse patterns, but this index entry mode has been
> changed compared to HEAD. So we should consider it as a part of
> sparse specification.

Same thing here about the fact that you've given an example with the
same output under behavior A and behavior B, and since behavior B
ignores the sparse specification, I'm not sure your example elucidates
the sparse specification that much other than to make clear it
includes more than the sparse patterns.  But didn't my wording already
do that?

(Note that `git diff --cached` without a revision is just inherently
susceptible to this problem; it should always produce the same output
under both modes.)

> > Perhaps it's worth noting why I think the sparse specification should
> > be extended when dealing with the index:
> >
> >   * "mergy" commands (merge, rebase, cherry-pick, am, revert) can
> > modify the index outside the sparsity patterns, without creating a
> > commit.
> >   * `git commit` (or `rebase --continue`, or whatever) will create a
> > commit from whatever staged versions of files there are
> >   => `git status` should show what is about to be committed
> >   => `git diff --cached --name-only` ought to be usable to show what
> > is to be committed
> >   => `git grep --cached ...` ought to be usable to search through what
> > is about to be committed
> >
> > See also https://lore.kernel.org/git/CABPp-BESkb=04vVnqTvZyeCa+7cymX7rosUW3rhtA02khMJKHA@mail.gmail.com/
> > (starting with the paragraph with "leery" in it), and the thread
> > starting there.  If the sparse specification is not expanded, users
> > will get some nasty surprises, and the only other alternative I can
> > think of to avoid such surprises would be making several commands
> > always run full tree.  Running full-tree with a non-default option to
> > run sparse forces behavior A folks into a "pick your poison"
> > situation, which is not nice.  Extending the sparse specification to
> > include files whose index entries do not match HEAD for index-related
> > operations provides the nice middle ground that avoids such usability
> > problems while also allowing users to avoid operating on a full tree.
> >
>
> I can understand the reason why we need to extend sparse specification:
> index often needs to handle files that are not in the sparse pattern.

Yep!

[...]
> > > A** : Some users are _only_ interested in the sparse portion of the repo,
> > > but they want to download all the blobs in it to avoid some unnecessary
> > > network connections afterwards.
> >
> > Here you just repeated `A*` but relabelled it as `A**`.  Yes, this one
> > is explicitly tied to partial clone behavior.
>
> Ah,  `A*` part say “so things like `git log -S${SEARCH_TERM} -p`
> or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
> partial results that depend on what happens to have been downloaded."
>
> So I think it's probably a lot like the behavior after a shallow
> clone: git log -p or other
> git commands returning partial results.

Yes, though not being a fan of shallow clones, the comparison makes me
recoil slightly.  ;-)

> The expectation of A** is to have all blobs under the entire sparse-patterns.

Ah, I misread your `A**`.  I agree there are users that want to do
this; I'm one of them.

But how does that affect the results that users see from running
operations?  Does it change any definitions or categorize any commands
differently, or affect anything in the document?  Why is it worth
calling out that people want full history of the paths matching the
sparsity patterns?

> >
> > [...]
> > > > +    The fact that files can move between the 'tracked' and 'untracked'
> > > > +    categories means some commands will have to treat untracked files
> > > > +    differently.  But if we have to treat untracked files differently,
> > > > +    then additional commands may also need changes:
> > > > +
> > > > +      * status
> > > > +      * clean
> > > > +
> > >
> > > I'm a bit worried about git status, because it's used in many shells
> > > (e.g. zsh) i
> > > in the git prompt function. Its default behavior is restricted, otherwise users
> > > may get blocked when they use zsh to cd to that directory. I don't know how
> > > to reproduce this problem (since the scenario is built on checkout to a local
> > > unborn branch).
> >
> > Could you elaborate?  I'm not sure if you are talking about an
> > existing problem that you are worried about being exacerbated, or a
> > hypothetical problem that could occur with changes.  Further, your
> > wording is so vague about the problem, that I have no idea what its
> > nature is or whether any changes to status would even possibly have
> > any bearing on it.  But the suggested changes to git status are
> > simply:
> >
>
> I just might have caused this in one particular case. So it's not very
> important at the moment. But it's worth noting that many shells, IDEs’
> git plugins
> may also need to understand sparse-checkout properly, otherwise it can
>  cause some usability problems.

Why do these tools need to understand sparse-checkout?  What kind of
usability problems could occur?  Can you describe what range of issues
can occur, or even give any specific examples?

The whole point of the document is trying to address remaining
sparse-checkout issues, and we even have a section highlighting known
current problems.  If you know of additional issues, it would be great
to make them known, but I cannot figure out what you might be referring
to from these vague descriptions.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-11-16 14:33               ` ZheNing Hu
@ 2022-11-19  2:36                 ` Elijah Newren
  0 siblings, 0 replies; 42+ messages in thread
From: Elijah Newren @ 2022-11-19  2:36 UTC (permalink / raw)
  To: ZheNing Hu
  Cc: Elijah Newren via GitGitGadget, git, Victoria Dye,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, Glen Choo,
	Martin von Zweigbergk

On Wed, Nov 16, 2022 at 6:33 AM ZheNing Hu <adlternative@gmail.com> wrote:
>
> ZheNing Hu <adlternative@gmail.com> 于2022年11月16日周三 18:10写道:
> >
> > ZheNing Hu <adlternative@gmail.com> 于2022年11月16日周三 18:04写道:
> > >
> > > Elijah Newren <newren@gmail.com> 于2022年11月16日周三 13:49写道:
> > > >
> > > > Perhaps it's worth noting why I think the sparse specification should
> > > > be extended when dealing with the index:
> > > >
> > > >   * "mergy" commands (merge, rebase, cherry-pick, am, revert) can
> > > > modify the index outside the sparsity patterns, without creating a
> > > > commit.
> > > >   * `git commit` (or `rebase --continue`, or whatever) will create a
> > > > commit from whatever staged versions of files there are
> > > >   => `git status` should show what is about to be committed
> > > >   => `git diff --cached --name-only` ought to be usable to show what
> > > > is to be committed
> > > >   => `git grep --cached ...` ought to be usable to search through what
> > > > is about to be committed
> > > >
> > > > See also https://lore.kernel.org/git/CABPp-BESkb=04vVnqTvZyeCa+7cymX7rosUW3rhtA02khMJKHA@mail.gmail.com/
> > > > (starting with the paragraph with "leery" in it), and the thread
> > > > starting there.  If the sparse specification is not expanded, users
> > > > will get some nasty surprises, and the only other alternative I can
> > > > think of to avoid such surprises would be making several commands
> > > > always run full tree.  Running full-tree with a non-default option to
> > > > run sparse forces behavior A folks into a "pick your poison"
> > > > situation, which is not nice.  Extending the sparse specification to
> > > > include files whose index entries do not match HEAD for index-related
> > > > operations provides the nice middle ground that avoids such usability
> > > > problems while also allowing users to avoid operating on a full tree.
> > > >
> > >
> > > I can understand the reason why we need to extend sparse specification:
> > > index often needs to handle files that are not in the sparse pattern.
> > >
> >
> > I might have one more question: when we use "git diff -cached HEAD~",
> > what is the best way to check if an index entry is the same as HEAD here?
> > Do we need to run "git diff --cached HEAD <file>" again?
>
> I found that git commit will execute index_differs_from() to determine
> whether the index has changed, It defaults to comparing HEAD.
> But if we use git commit --amend, index_differs_from() will compare
> to HEAD~.
>
> the docs say:
>
>        * When modifying or showing results from the index, the sparse
>          specification is the set of files with a clear SKIP_WORKTREE bit
>          or that differ in the index from HEAD.
>
> I wonder if there is some description error here? Not always "from HEAD"?

Perhaps this part of the document will help:

+  * commands that always ignore sparsity since commits must be full-tree
+
+      * archive
+      * bundle
+      * commit
+      * format-patch
+      * fast-export
+      * fast-import
+      * commit-tree

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
  2022-11-19  2:15             ` Elijah Newren
@ 2022-11-23  9:08               ` ZheNing Hu
  0 siblings, 0 replies; 42+ messages in thread
From: ZheNing Hu @ 2022-11-23  9:08 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, git, Victoria Dye,
	Derrick Stolee, Shaoxuan Yuan, Matheus Tavares, Glen Choo,
	Martin von Zweigbergk

Elijah Newren <newren@gmail.com> 于2022年11月19日周六 10:16写道:
>
> On Wed, Nov 16, 2022 at 2:04 AM ZheNing Hu <adlternative@gmail.com> wrote:
> > Elijah Newren <newren@gmail.com> 于2022年11月16日周三 13:49写道:
> > > On Mon, Nov 14, 2022 at 8:03 PM ZheNing Hu <adlternative@gmail.com> wrote:
> > > > Elijah Newren via GitGitGadget <gitgitgadget@gmail.com> 于2022年11月6日周日 14:04写道:
> [...]
> > > The fact that files have been "vivified" is slightly ambiguous and
> > > thus a bad way to define this set.  When git vivifies files, it'll
> > > clear the SKIP_WORKTREE bit.  If an editor, or script external to git,
> > > or something else restores the file, it will likely overlook that
> > > detail.  We want vivified files to be part of the sparse specification
> > > when interacting with the working tree regardless of whether the
> > > SKIP_WORKTREE bit was correctly updated, so I defined it the way I did
> > > to remove such ambiguity.  (I guess I should note that as per
> > > af6a51875a ("repo_read_index: clear SKIP_WORKTREE bit from files
> > > present in worktree", 2022-01-14), git will clear the SKIP_WORKTREE
> > > bit for files present in the working copy as one of the first things
> > > it does, but that could leave people wondering whether I meant the
> > > SKIP_WORKTREE bit was set as of the time of git invocation.  So, I
> > > explicitly call out files present in the working copy for which the
> > > index entry has the SKIP_WORKTREE bit set, so folks know these files
> > > are definitely included in the sparse specification.)
> > >
> >
> > You are right: the definition of vivifying explicitly clears the
> > SKIP_WORKTREE bit from the index, So the behavior described
> > here is not vivifying, but very much like vivifying: clear the
> > SKIP_WORKTREE bit from index_entry in memory instead of actually
> > clearing it from the index on disk.
>
> No, the definition of vivifying in my PR does not explicitly state
> that; in fact, it doesn't even imply that it always happens.  The
> wording was:
>
> +vivifying: When a command restores a tracked file to the working tree (and
> + hopefully also clears the SKIP_WORKTREE bit in the index for that
> + file), this is referred to as "vivifying" the file.
>
> In particular, it's important to note that:
>
>   * some git commands won't even clear the SKIP_WORKTREE bit when they
> vivify files (e.g. I think git-checkout-index falls in this category,
> for example); we could never audit all codepaths and fix them all.
> But when they restore files we still consider that to be "vivifying"
> those paths whether or not they clear the SKIP_WORKTREE bit.
>
>   * I considered the restoration of files by non-git commands (e.g.
> "echo contents >filename") to also be considered "vivifying" of those
> files, and I certainly don't expect non-git commands to clear the
> SKIP_WORKTREE bit.
>

Ok, I probably understand, what you mean by vivify is the "activation" of the
file, and I thought it was the "restoration" of the file state in the index
(clear SKIP_WORKTREE).

> > Anyway, for the file on "worktree", we can use ce_skip_worktree(ce)
> > to check if it belongs to the sparse specification.
>
> Yes, due to the commit af6a51875a referenced above, our implementation
> can just check this.  Except, it's !ce_skep_worktree(ce), of course.
> :-)
>

Yes, but I think I should need one or more explicit interfaces to determine
whether a file belongs to "sparse specification", instead of using
!ce_skip_worktree(ce) every time. I'll expose this interface in my next
“diff --scope” patch.

> > > > I think we can add some examples to describe these terms.
> > > >
> > > > #!/bin/sh
> > > >
> > > > set -x
> > > >
> > > > rm -rf mono-repo
> > > > git init mono-repo -b main
> > > > (
> > > >   cd mono-repo &&
> > > >   mkdir p1 p2 &&
> > > >   echo a >p1/a &&
> > > >   echo b >p1/b &&
> > > >   echo a >p2/a &&
> > > >   echo b >p2/b &&
> > > >   git add . &&
> > > >   git commit -m ok &&
> > > >   git sparse-checkout set p1 &&
> > > >   git ls-files -t &&
> > > >   echo a >>p1/a &&
> > > >   echo b >>p1/b &&
> > > >   mkdir p2 p3 &&
> > > >   echo next >>p2/a &&
> > > >   echo next >>p3/c &&
> > > >   git add p3/c &&
> >
> > Here I forget a "--sparse"
> >
> > > >   # p2/a and p3/c vivify
> > > >   git ls-files -t &&
> > > >   # compare wortree/commit
> > > >   git --no-pager diff HEAD --name-only
> > > > )
> > >
> > > You've added a bunch of code with this example, but you have not said
> > > what the output should be, so how exactly does this help describe the
> > > terms?
> > >
> >
> > We create a repo with two sub projects p1/ p2/, then set
> > sparsity directory p1.
> >
> > First git ls-files -t outputs:
> >
> > H p1/a
> > H p1/b
> > S p2/a
> > S p2/b
> >
> > It shows that index entries in p2 are skip-worktree.
> > Then we restore p2/a in the working tree and create a
> > new file p3/c and add it to the index.
> >
> > The second git ls-files -t output:
> >
> > H p1/a
> > H p1/b
> > H p2/a
> > S p2/b
> > H p3/c
> >
> > p2/a and p3/c are not in sparse patterns, but they are in
> > sparse specification. It's like a special "vivifying".
>
> What do you mean by a "special" vivifying?
>

As mentioned above, I misunderstood your "vivify". What I just
want to say here is that skip_worktree bit of these files has
been cleared, their status is "activated".

> Also, there's multiple problems with using your example so far to
> describe the sparse specification:
>
>   * You are specifying `git ls-files -t` output here, which may or may
> not ignore the sparse specification (as mentioned elsewhere in the new
> document); if it doesn't, then specifying how commands behave when
> they ignore the sparse specification as a way of describing the sparse
> specification seems less than helpful.  We could overlook that, but:
>   * You didn't specify the output for `git diff HEAD` and `git diff
> REV` was one of the cases where the sparse specification matters.

Makes sense. My example misses the point.

> Explaining how `git diff REV` works relative to the sparse
> specification seems like the point of you having an example, BUT even
> if you tried to do that with this particular example:
>   1) Users are probably left wondering whether p3/c is present in the
> working copy at the time these commands run, and thus whether it is in
> the sparse specification for that reason rather than for the reason of
> there being a difference in the index relative to HEAD.
>   2) You didn't specify the differences in the output between behavior
> A and behavior B for your example, if any, which might be needed for
> an appropriate contrast.  Further...
>   3) You picked an example where the output might be the same for both
> behavior A and behavior B, and since behavior B ignores the sparse
> specification, it's really hard to see how this example helps
> elucidate the meaning of the sparse specification.
>
> So, I'm still not seeing how this example helps describe the sparse
> specification.
>
> > > > > +       * When modifying or showing results from the index, the sparse
> > > > > +         specification is the set of files with a clear SKIP_WORKTREE bit
> > > > > +         or that differ in the index from HEAD.
> > > >
> > > > #!/bin/sh
> > > >
> > > > set -x
> > > >
> > > > rm -rf mono-repo
> > > > git init mono-repo -b main
> > > > (
> > > >   cd mono-repo &&
> > > >   mkdir p1 p2 &&
> > > >   echo a >p1/a &&
> > > >   echo b >p1/b &&
> > > >   echo a >p2/a &&
> > > >   echo b >p2/b &&
> > > >   git add . &&
> > > >   git commit -m ok &&
> > > >   git sparse-checkout set p1 &&
> > > >   git update-index --chmod=+x p2/a &&
> > > >   # compare commit/index
> > > >   git --no-pager diff --cached --name-only
> > > > )
> > >
> > > Same issue here; you haven't stated the expected output of these
> > > commands, so I don't see how they help with the description at all.
> > >
> >
> > Here only output p2/a:
> >
> > p2/a is out of sparse patterns, but this index entry mode has been
> > changed compared to HEAD. So we should consider it as a part of
> > sparse specification.
>
> Same thing here about the fact that you've given an example with the
> same output under behavior A and behavior B, and since behavior B
> ignores the sparse specification, I'm not sure your example elucidates
> the sparse specification that much other than to make clear it
> includes more than the sparse patterns.  But didn't my wording already
> do that?
>
> (Note that `git diff --cached` without a revision is just inherently
> susceptible to this problem; it should always produce the same output
> under both modes.)
>

You're right. here need a commit other than HEAD for comparison.

> > > Perhaps it's worth noting why I think the sparse specification should
> > > be extended when dealing with the index:
> > >
> > >   * "mergy" commands (merge, rebase, cherry-pick, am, revert) can
> > > modify the index outside the sparsity patterns, without creating a
> > > commit.
> > >   * `git commit` (or `rebase --continue`, or whatever) will create a
> > > commit from whatever staged versions of files there are
> > >   => `git status` should show what is about to be committed
> > >   => `git diff --cached --name-only` ought to be usable to show what
> > > is to be committed
> > >   => `git grep --cached ...` ought to be usable to search through what
> > > is about to be committed
> > >
> > > See also https://lore.kernel.org/git/CABPp-BESkb=04vVnqTvZyeCa+7cymX7rosUW3rhtA02khMJKHA@mail.gmail.com/
> > > (starting with the paragraph with "leery" in it), and the thread
> > > starting there.  If the sparse specification is not expanded, users
> > > will get some nasty surprises, and the only other alternative I can
> > > think of to avoid such surprises would be making several commands
> > > always run full tree.  Running full-tree with a non-default option to
> > > run sparse forces behavior A folks into a "pick your poison"
> > > situation, which is not nice.  Extending the sparse specification to
> > > include files whose index entries do not match HEAD for index-related
> > > operations provides the nice middle ground that avoids such usability
> > > problems while also allowing users to avoid operating on a full tree.
> > >
> >
> > I can understand the reason why we need to extend sparse specification:
> > index often needs to handle files that are not in the sparse pattern.
>
> Yep!
>
> [...]
> > > > A** : Some users are _only_ interested in the sparse portion of the repo,
> > > > but they want to download all the blobs in it to avoid some unnecessary
> > > > network connections afterwards.
> > >
> > > Here you just repeated `A*` but relabelled it as `A**`.  Yes, this one
> > > is explicitly tied to partial clone behavior.
> >
> > Ah,  `A*` part say “so things like `git log -S${SEARCH_TERM} -p`
> > or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
> > partial results that depend on what happens to have been downloaded."
> >
> > So I think it's probably a lot like the behavior after a shallow
> > clone: git log -p or other
> > git commands returning partial results.
>
> Yes, though not being a fan of shallow clones, the comparison makes me
> recoil slightly.  ;-)
>
> > The expectation of A** is to have all blobs under the entire sparse-patterns.
>
> Ah, I misread your `A**`.  I agree there are users that want to do
> this; I'm one of them.
>
> But how does that affect the results that users see from running
> operations?  Does it change any definitions or categorize any commands
> differently, or affect anything in the document?  Why is it worth
> calling out that people want full history of the paths matching the
> sparsity patterns?
>

Hmmm,It does not change the current definition of "restrict" or "scope",
because “restirct”/"scope" care about the range of the "horizontal" file path,
and this "A**" cares about the "vertical" historical depth. While this seems
like a digression, I think it's still relevant to mention it.

I have another issue related to sparse checkout: monorepo size:

1. The user started working under project1, but he accidentally wanted
to see the content of project2, so he modified the speciation patterns, but
then he went back to working on project1. Yes, the size of the worktree is
reduced, but the size of git objects is not, and the size of the git repository
will gradually expand accordingly.

2. Since many git commands now accidentally touch objects outside
the sparse checkout specification (e.g. accidentally downloading objects
after the last git pull), until we actually implement --scope, this repository
size will gradually increase.

Therefore, can we create a new gc option for removing objects out of
sparse specification?

> > >
> > > [...]
> > > > > +    The fact that files can move between the 'tracked' and 'untracked'
> > > > > +    categories means some commands will have to treat untracked files
> > > > > +    differently.  But if we have to treat untracked files differently,
> > > > > +    then additional commands may also need changes:
> > > > > +
> > > > > +      * status
> > > > > +      * clean
> > > > > +
> > > >
> > > > I'm a bit worried about git status, because it's used in many shells
> > > > (e.g. zsh) i
> > > > in the git prompt function. Its default behavior is restricted, otherwise users
> > > > may get blocked when they use zsh to cd to that directory. I don't know how
> > > > to reproduce this problem (since the scenario is built on checkout to a local
> > > > unborn branch).
> > >
> > > Could you elaborate?  I'm not sure if you are talking about an
> > > existing problem that you are worried about being exacerbated, or a
> > > hypothetical problem that could occur with changes.  Further, your
> > > wording is so vague about the problem, that I have no idea what its
> > > nature is or whether any changes to status would even possibly have
> > > any bearing on it.  But the suggested changes to git status are
> > > simply:
> > >
> >
> > I just might have caused this in one particular case. So it's not very
> > important at the moment. But it's worth noting that many shells, IDEs’
> > git plugins
> > may also need to understand sparse-checkout properly, otherwise it can
> >  cause some usability problems.
>
> Why do these tools need to understand sparse-checkout?  What kind of
> usability problems could occur?  Can you describe what range of issues
> can occur, or even give any specific examples?
>
> The whole point of the document is trying to address remaining
> sparse-checkout issues, and we even have a section highlighting known
> current problems.  If you know of additional issues, it would be great
> to make them known, but I cannot figure out what you might be referring
> to from these vague descriptions.


Sorry, I haven't been able to reproduce this specific example yet, and
when I find it,
I'll re-propose it.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2022-11-23  9:13 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-25  0:09 [PATCH] sparse-checkout.txt: new document with sparse-checkout directions Elijah Newren via GitGitGadget
2022-09-26 17:20 ` Junio C Hamano
2022-09-26 17:38 ` Junio C Hamano
2022-09-27  3:05   ` Elijah Newren
2022-09-27  4:30     ` Junio C Hamano
2022-09-26 20:08 ` Victoria Dye
2022-09-26 22:36   ` Junio C Hamano
2022-09-27  7:30     ` Elijah Newren
2022-09-27 16:07       ` Junio C Hamano
2022-09-28  6:13         ` Elijah Newren
2022-09-27  6:09   ` Elijah Newren
2022-09-27 16:42   ` Derrick Stolee
2022-09-28  5:42     ` Elijah Newren
2022-09-27 15:43 ` Junio C Hamano
2022-09-28  7:49   ` Elijah Newren
2022-09-27 16:36 ` Derrick Stolee
2022-09-28  5:38   ` Elijah Newren
2022-09-28 13:22     ` Derrick Stolee
2022-10-06  7:10       ` Elijah Newren
2022-10-06 18:27         ` Derrick Stolee
2022-10-07  2:56           ` Elijah Newren
2022-09-30  9:54     ` ZheNing Hu
2022-10-06  7:53       ` Elijah Newren
2022-10-15  2:17         ` ZheNing Hu
2022-10-15  4:37           ` Elijah Newren
2022-10-15 14:49             ` ZheNing Hu
2022-09-30  9:09   ` ZheNing Hu
2022-09-28  8:32 ` [PATCH v2] " Elijah Newren via GitGitGadget
2022-10-08 22:52   ` [PATCH v3] " Elijah Newren via GitGitGadget
2022-11-06  6:04     ` [PATCH v4] " Elijah Newren via GitGitGadget
2022-11-07 20:44       ` Derrick Stolee
2022-11-16  4:39         ` Elijah Newren
2022-11-15  4:03       ` ZheNing Hu
2022-11-16  3:18         ` ZheNing Hu
2022-11-16  6:51           ` Elijah Newren
2022-11-16  5:49         ` Elijah Newren
2022-11-16 10:04           ` ZheNing Hu
2022-11-16 10:10             ` ZheNing Hu
2022-11-16 14:33               ` ZheNing Hu
2022-11-19  2:36                 ` Elijah Newren
2022-11-19  2:15             ` Elijah Newren
2022-11-23  9:08               ` ZheNing Hu

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).