From: ZheNing Hu <adlternative@gmail.com>
To: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Victoria Dye <vdye@github.com>,
Derrick Stolee <derrickstolee@github.com>,
Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>,
Matheus Tavares <matheus.bernardino@usp.br>,
Elijah Newren <newren@gmail.com>, Glen Choo <chooglen@google.com>,
Martin von Zweigbergk <martinvonz@google.com>
Subject: Re: [PATCH v4] sparse-checkout.txt: new document with sparse-checkout directions
Date: Tue, 15 Nov 2022 12:03:37 +0800 [thread overview]
Message-ID: <CAOLTT8TzpfoH7pz7gxgFvNWOaUZUcg1q_Tap+2anwHfAUgDV8Q@mail.gmail.com> (raw)
In-Reply-To: <pull.1367.v4.git.1667714666810.gitgitgadget@gmail.com>
Hi,
Elijah Newren via GitGitGadget <gitgitgadget@gmail.com> 于2022年11月6日周日 14:04写道:
>
> From: Elijah Newren <newren@gmail.com>
>
> Once upon a time, Matheus wrote some patches to make
> git grep [--cached | <REVISION>] ...
> restrict its output to the sparsity specification when working in a
> sparse checkout[1]. That effort got derailed by two things:
>
> (1) The --sparse-index work just beginning which we wanted to avoid
> creating conflicts for
> (2) Never deciding on flag and config names and planned high level
> behavior for all commands.
>
> More recently, Shaoxuan implemented a more limited form of Matheus'
> patches that only affected --cached, using a different flag name,
> but also changing the default behavior in line with what Matheus did.
> This again highlighted the fact that we never decided on command line
> flag names, config option names, and the big picture path forward.
>
> The --sparse-index work has been mostly complete (or at least released
> into production even if some small edges remain) for quite some time
> now. We have also had several discussions on flag and config names,
> though we never came to solid conclusions. Stolee once upon a time
> suggested putting all these into some document in
> Documentation/technical[3], which Victoria recently also requested[4].
> I'm behind the times, but here's a patch attempting to finally do that.
>
> [1] https://lore.kernel.org/git/5f3f7ac77039d41d1692ceae4b0c5df3bb45b74a.1612901326.git.matheus.bernardino@usp.br/
> (See his second link in that email in particular)
> [2] https://lore.kernel.org/git/20220908001854.206789-2-shaoxuan.yuan02@gmail.com/
> [3] https://lore.kernel.org/git/CABPp-BHwNoVnooqDFPAsZxBT9aR5Dwk5D9sDRCvYSb8akxAJgA@mail.gmail.com/
> (Scroll to the very end for the final few paragraphs)
> [4] https://lore.kernel.org/git/cafcedba-96a2-cb85-d593-ef47c8c8397c@github.com/
>
> Signed-off-by: Elijah Newren <newren@gmail.com>
> ---
> sparse-checkout.txt: new document with sparse-checkout directions
>
> v2 and v3 didn't get any reviews (I know, I know, this document is
> really long), but it's been nearly a month and this patch is still
> marked as "Needs Review", so I'm hoping sending a v4 will encourage
> feedback. I think it's good enough to accept and start iterating, but
> want to be sure others agree.
>
> As before, I think we're starting to converge on actual proposals;
> there's some areas we've agreed on, others we've compromised on, and
> some we've just figured out what the others were saying. The discussion
> has been very illuminating; thanks to everyone who has chimed in. I've
> tried to take my best stab at cleaning up and culling things that don't
> need to remain as open questions, but if I've mis-represented anyone or
> missed something, don't hesitate to speak up. Everything is still open
> for debate, even if not marked as a currently open question.
>
> Changes since v3:
>
> * A few minor wording cleanups here and there, and one paragraph moved
> to keep similar things together.
>
> Changes since v2:
>
> * Compromised with Stollee on log -- Behavior A only affects
> patch-related operations, not revision walking
> * Incorporated Junio's suggestions about untracked file handling
> * Added new usecases, one brought up by Martin, one by Stolee
> * Added new sections:
> * Usecases of primary concern
> * Oversimplified mental models ("Cliff Notes" for this document!)
> * Recategorization of a few commands based on discussion
> * Greater details on how index operations work under Behavior A, to
> avoid weird edge cases
> * Extended explanation of the sparse specification, particularly when
> index differs from HEAD
> * Switched proposed flag names to --scope={sparse,all} to avoid binary
> flags that are hard to extend
> * Switched proposed config option name (still need good values and
> descriptions for it, though)
> * Removed questions we seemed to have agreement on. Modified/extended
> some existing questions.
> * Added Stolee's sparse-backfill ideas to the plans
> * Additional Known bugs
> * Various wording improvements
> * Possibly other things I've missed.
>
> Changes since v1:
>
> * Added new sections:
> * "Terminology"
> * "Behavior classes"
> * "Sparse specification vs. sparsity patterns"
> * Tried to shuffle commands from unknown into appropriate sections
> based on feedback, but I got some conflicting feedback, so...who
> knows if thing are in the right place
> * More consistency in using "sparse specification" over other terms
> * Extra comments about how add/rm/mv operate on moving files across the
> tracked/untracked boundary
> * --restrict-but-warn should have been "restrict or error", but
> reworded even more heavily as part of "Behavior classes" section
> * Added extra questions based on feedback (--no-expand, update-index
> stuff, apply --index)
> * More details on apply/am bugs
> * Documented read-tree issue
> * A few cases of fixing line wrapping at <=80 chars
> * Added more alternate name suggestions for options instead of
> --[no-]restrict
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1367%2Fnewren%2Fsparse-checkout-directions-v4
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1367/newren/sparse-checkout-directions-v4
> Pull-Request: https://github.com/gitgitgadget/git/pull/1367
>
> Documentation/technical/sparse-checkout.txt | 1103 +++++++++++++++++++
> 1 file changed, 1103 insertions(+)
> create mode 100644 Documentation/technical/sparse-checkout.txt
>
> diff --git a/Documentation/technical/sparse-checkout.txt b/Documentation/technical/sparse-checkout.txt
> new file mode 100644
> +=== Terminology ===
> +
> +sparse directory: An entry in the index corresponding to a directory, which
> + appears in the index instead of all the files under that directory
> + that would normally appear. See also sparse-index. Something that
> + can cause confusion is that the "sparse directory" does NOT match
> + the sparse specification, i.e. the directory is NOT present in the
> + working tree. May be renamed in the future (e.g. to "skipped
> + directory").
> +
> +sparse index: A special mode for sparse-checkout that also makes the
> + index sparse by recording a directory entry in lieu of all the
> + files underneath that directory (thus making that a "skipped
> + directory" which unfortunately has also been called a "sparse
> + directory"), and does this for potentially multiple
> + directories. Controlled by the --[no-]sparse-index option to
> + init|set|reapply.
> +
> +sparsity patterns: patterns from $GIT_DIR/info/sparse-checkout used to
> + define the set of files of interest. A warning: It is easy to
> + over-use this term (or the shortened "patterns" term), for two
> + reasons: (1) users in cone mode specify directories rather than
> + patterns (their directories are transformed into patterns, but
> + users may think you are talking about non-cone mode if you use the
> + word "patterns"), and (b) the sparse specification might
nit: s/(b)/(2)/g
> + transiently differ in the working tree or index from the sparsity
> + patterns (see "Sparse specification vs. sparsity patterns").
> +
> +sparse specification: The set of paths in the user's area of focus. This
> + is typically just the tracked files that match the sparsity
> + patterns, but the sparse specification can temporarily differ and
> + include additional files. (See also "Sparse specification
> + vs. sparsity patterns")
> +
> + * When working with history, the sparse specification is exactly
> + the set of files matching the sparsity patterns.
> + * When interacting with the working tree, the sparse specification
> + is the set of tracked files with a clear SKIP_WORKTREE bit or
> + tracked files present in the working copy.
I'm guessing what you mean here is:
Some files are stored with a flag bit of !SKIP_WORKTREE in its index entry.
But files are "vivifying" (restore to worktree) or new files added to
index (tracked files),
they also belong to the sparse specification.
I think we can add some examples to describe these terms.
#!/bin/sh
set -x
rm -rf mono-repo
git init mono-repo -b main
(
cd mono-repo &&
mkdir p1 p2 &&
echo a >p1/a &&
echo b >p1/b &&
echo a >p2/a &&
echo b >p2/b &&
git add . &&
git commit -m ok &&
git sparse-checkout set p1 &&
git ls-files -t &&
echo a >>p1/a &&
echo b >>p1/b &&
mkdir p2 p3 &&
echo next >>p2/a &&
echo next >>p3/c &&
git add p3/c &&
# p2/a and p3/c vivify
git ls-files -t &&
# compare wortree/commit
git --no-pager diff HEAD --name-only
)
> + * When modifying or showing results from the index, the sparse
> + specification is the set of files with a clear SKIP_WORKTREE bit
> + or that differ in the index from HEAD.
#!/bin/sh
set -x
rm -rf mono-repo
git init mono-repo -b main
(
cd mono-repo &&
mkdir p1 p2 &&
echo a >p1/a &&
echo b >p1/b &&
echo a >p2/a &&
echo b >p2/b &&
git add . &&
git commit -m ok &&
git sparse-checkout set p1 &&
git update-index --chmod=+x p2/a &&
# compare commit/index
git --no-pager diff --cached --name-only
)
> + * If working with the index and the working copy, the sparse
> + specification is the union of the paths from above.
> +
> +vivifying: When a command restores a tracked file to the working tree (and
> + hopefully also clears the SKIP_WORKTREE bit in the index for that
> + file), this is referred to as "vivifying" the file.
> +
> +
> +=== Purpose of sparse-checkouts ===
> +
> +sparse-checkouts exist to allow users to work with a subset of their
> +files.
> +
> +You can think of sparse-checkouts as subdividing "tracked" files into two
> +categories -- a sparse subset, and all the rest. Implementationally, we
> +mark "all the rest" in the index with a SKIP_WORKTREE bit and leave them
> +out of the working tree. The SKIP_WORKTREE files are still tracked, just
> +not present in the working tree.
> +
> +In the past, sparse-checkouts were defined by "SKIP_WORKTREE means the file
> +is missing from the working tree but pretend the file contents match HEAD".
> +That was not only bogus (it actually meant the file missing from the
> +working tree matched the index rather than HEAD), but it was also a
> +low-level detail which only provided decent behavior for a few commands.
> +There were a surprising number of ways in which that guiding principle gave
> +command results that violated user expectations, and as such was a bad
> +mental model. However, it persisted for many years and may still be found
> +in some corners of the code base.
> +
> +Anyway, the idea of "working with a subset of files" is simple enough, but
> +there are multiple different high-level usecases which affect how some Git
> +subcommands should behave. Further, even if we only considered one of
> +those usecases, sparse-checkouts can modify different subcommands in over a
> +half dozen different ways. Let's start by considering the high level
> +usecases:
> +
> + A) Users are _only_ interested in the sparse portion of the repo
> +
> + A*) Users are _only_ interested in the sparse portion of the repo
> + that they have downloaded so far
> +
> + B) Users want a sparse working tree, but are working in a larger whole
> +
> + C) sparse-checkout is a behind-the-scenes implementation detail allowing
> + Git to work with a specially crafted in-house virtual file system;
> + users are actually working with a "full" working tree that is
> + lazily populated, and sparse-checkout helps with the lazy population
> + piece.
> +
> +It may be worth explaining each of these in a bit more detail:
> +
> +
> + (Behavior A) Users are _only_ interested in the sparse portion of the repo
> +
> +These folks might know there are other things in the repository, but
> +don't care. They are uninterested in other parts of the repository, and
> +only want to know about changes within their area of interest. Showing
> +them other files from history (e.g. from diff/log/grep/etc.) is a
> +usability annoyance, potentially a huge one since other changes in
> +history may dwarf the changes they are interested in.
> +
> +Some of these users also arrive at this usecase from wanting to use partial
> +clones together with sparse checkouts (in a way where they have downloaded
> +blobs within the sparse specification) and do disconnected development.
> +Not only do these users generally not care about other parts of the
> +repository, but consider it a blocker for Git commands to try to operate on
> +those. If commands attempt to access paths in history outside the sparsity
> +specification, then the partial clone will attempt to download additional
> +blobs on demand, fail, and then fail the user's command. (This may be
> +unavoidable in some cases, e.g. when `git merge` has non-trivial changes to
> +reconcile outside the sparse specification, but we should limit how often
> +users are forced to connect to the network.)
> +
> +Also, even for users using partial clones that do not mind being
> +always connected to the network, the need to download blobs as
> +side-effects of various other commands (such as the printed diffstat
> +after a merge or pull) can lead to worries about local repository size
> +growing unnecessarily[10].
> +
> + (Behavior A*) Users are _only_ interested in the sparse portion of the repo
> + that they have downloaded so far (a variant on the first usecase)
> +
> +This variant is driven by folks who using partial clones together with
> +sparse checkouts and do disconnected development (so far sounding like a
> +subset of behavior A users) and doing so on very large repositories. The
> +reason for yet another variant is that downloading even just the blobs
> +through history within their sparse specification may be too much, so they
> +only download some. They would still like operations to succeed without
> +network connectivity, though, so things like `git log -S${SEARCH_TERM} -p`
> +or `git grep ${SEARCH_TERM} OLDREV ` would need to be prepared to provide
> +partial results that depend on what happens to have been downloaded.
> +
> +This variant could be viewed as Behavior A with the sparse specification
> +for history querying operations modified from "sparsity patterns" to
> +"sparsity patterns limited to the blobs we have already downloaded".
> +
I think A's users might need a little more refinement.
A: Some users are _only_ interested in the sparse portion of the repo,
but they don't want to download all the blobs, they can accept to download
other data later using partial-clone, which can reduce the local storage size.
(Current default behavior)
A** : Some users are _only_ interested in the sparse portion of the repo,
but they want to download all the blobs in it to avoid some unnecessary
network connections afterwards.
> +=== Usecases of primary concern ===
> +
> +Most of the rest of this document will focus on Behavior A and Behavior
> +B. Some notes about the other two cases and why we are not focusing on
> +them:
> +
> + (Behavior A*)
> +
> +Supporting this usecase is estimated to be difficult and a lot of work.
> +There are no plans to implement it currently, but it may be a potential
> +future alternative. Knowing about the existence of additional alternatives
> +may affect our choice of command line flags (e.g. if we need tri-state or
> +quad-state flags rather than just binary flags), so it was still important
> +to at least note.
> +
> +Further, I believe the descriptions below for Behavior A are probably still
> +valid for this usecase, with the only exception being that it redefines the
> +sparse specification to restrict it to already-downloaded blobs. The hard
> +part is in making commands capable of respecting that modified definition.
> +
> + (Behavior C)
> +
> +This usecase violates some of the early sparse-checkout documented
> +assumptions (since files marked as SKIP_WORKTREE will be displayed to users
> +as present in the working tree). That violation may mean various
> +sparse-checkout related behaviors are not well suited to this usecase and
> +we may need tweaks -- to both documentation and code -- to handle it.
> +However, this usecase is also perhaps the simplest model to support in that
> +everything behaves like a dense checkout with a few exceptions (e.g. branch
> +checkouts and switches write fewer things, knowing the VFS will lazily
> +write the rest on an as-needed basis).
> +
> +Since there is no publically available VFS-related code for folks to try,
> +the number of folks who can test such a usecase is limited.
> +
> +The primary reason to note the Behavior C usecase is that as we fix things
> +to better support Behaviors A and B, there may be additional places where
> +we need to make tweaks allowing folks in this usecase to get the original
> +non-sparse treatment. For an example, see ecc7c8841d ("repo_read_index:
> +add config to expect files outside sparse patterns", 2022-02-25). The
> +secondary reason to note Behavior C, is so that folks taking advantage of
> +Behavior C do not assume they are part of the Behavior B camp and propose
> +patches that break things for the real Behavior B folks.
> +
> +
> +=== Oversimplified mental models ===
> +
> +An oversimplification of the differences in the above behaviors is:
> +
> + Behavior A: Restrict worktree and history operations to sparse specification
> + Behavior B: Restrict worktree operations to sparse specification; have any
> + history operations work across all files
> + Behavior C: Do not restrict either worktree or history operations to the
> + sparse specification...with the exception of branch checkouts or
> + switches which avoid writing files that will match the index so
> + they can later lazily be populated instead.
> +
> +
> +=== Desired behavior ===
> +
> +As noted previously, despite the simple idea of just working with a subset
> +of files, there are a range of different behavioral changes that need to be
> +made to different subcommands to work well with such a feature. See
> +[1,2,3,4,5,6,7,8,9,10] for various examples. In particular, at [2], we saw
> +that mere composition of other commands that individually worked correctly
> +in a sparse-checkout context did not imply that the higher level command
> +would work correctly; it sometimes requires further tweaks. So,
> +understanding these differences can be beneficial.
> +
> +* Commands behaving the same regardless of high-level use-case
> +
> + * commands that only look at files within the sparsity specification
> +
> + * diff (without --cached or REVISION arguments)
> + * grep (without --cached or REVISION arguments)
> + * diff-files
> +
> + * commands that restore files to the working tree that match sparsity
> + patterns, and remove unmodified files that don't match those
> + patterns:
> +
> + * switch
> + * checkout (the switch-like half)
> + * read-tree
> + * reset --hard
> +
> + * commands that write conflicted files to the working tree, but otherwise
> + will omit writing files to the working tree that do not match the
> + sparsity patterns:
> +
> + * merge
> + * rebase
> + * cherry-pick
> + * revert
> +
> + * `am` and `apply --cached` should probably be in this section but
> + are buggy (see the "Known bugs" section below)
> +
> + The behavior for these commands somewhat depends upon the merge
> + strategy being used:
> + * `ort` behaves as described above
> + * `recursive` tries to not vivify files unnecessarily, but does sometimes
> + vivify files without conflicts.
> + * `octopus` and `resolve` will always vivify any file changed in the merge
> + relative to the first parent, which is rather suboptimal.
> +
> + It is also important to note that these commands WILL update the index
> + outside the sparse specification relative to when the operation began,
> + BUT these commands often make a commit just before or after such that
> + by the end of the operation there is no change to the index outside the
> + sparse specification. Of course, if the operation hits conflicts or
> + does not make a commit, then these operations clearly can modify the
> + index outside the sparse specification.
> +
> + Finally, it is important to note that at least the first four of these
> + commands also try to remove differences between the sparse
> + specification and the sparsity patterns (much like the commands in the
> + previous section).
> +
> + * commands that always ignore sparsity since commits must be full-tree
> +
> + * archive
> + * bundle
> + * commit
> + * format-patch
> + * fast-export
> + * fast-import
> + * commit-tree
> +
> + * commands that write any modified file to the working tree (conflicted
> + or not, and whether those paths match sparsity patterns or not):
> +
> + * stash
> + * apply (without `--index` or `--cached`)
> +
> +* Commands that may slightly differ for behavior A vs. behavior B:
> +
> + Commands in this category behave mostly the same between the two
> + behaviors, but may differ in verbosity and types of warning and error
> + messages.
> +
> + * commands that make modifications to which files are tracked:
> + * add
> + * rm
> + * mv
> + * update-index
> +
> + The fact that files can move between the 'tracked' and 'untracked'
> + categories means some commands will have to treat untracked files
> + differently. But if we have to treat untracked files differently,
> + then additional commands may also need changes:
> +
> + * status
> + * clean
> +
I'm a bit worried about git status, because it's used in many shells
(e.g. zsh) i
in the git prompt function. Its default behavior is restricted, otherwise users
may get blocked when they use zsh to cd to that directory. I don't know how
to reproduce this problem (since the scenario is built on checkout to a local
unborn branch).
> + In particular, `status` may need to report any untracked files outside
> + the sparsity specification as an erroneous condition (especially to
> + avoid the user trying to `git add` them, forcing `git add` to display
> + an error).
> +
> + It's not clear to me exactly how (or even if) `clean` would change,
> + but it's the other command that also affects untracked files.
> +
> + `update-index` may be slightly special. Its --[no-]skip-worktree flag
> + may need to ignore the sparse specification by its nature. Also, its
> + current --[no-]ignore-skip-worktree-entries default is totally bogus.
> +
> + * commands for manually tweaking paths in both the index and the working tree
> + * `restore`
> + * the restore-like half of `checkout`
> +
> + These commands should be similar to add/rm/mv in that they should
> + only operate on the sparse specification by default, and require a
> + special flag to operate on all files.
> +
> + Also, note that these commands currently have a number of issues (see
> + the "Known bugs" section below)
> +
> +* Commands that significantly differ for behavior A vs. behavior B:
> +
> + * commands that query history
> + * diff (with --cached or REVISION arguments)
> + * grep (with --cached or REVISION arguments)
> + * show (when given commit arguments)
> + * blame (only matters when one or more -C flags are passed)
> + * and annotate
> + * log
> + * whatchanged
> + * ls-files
> + * diff-index
> + * diff-tree
> + * ls-tree
> +
> + Note: for log and whatchanged, revision walking logic is unaffected
> + but displaying of patches is affected by scoping the command to the
> + sparse-checkout. (The fact that revision walking is unaffected is
> + why rev-list, shortlog, show-branch, and bisect are not in this
> + list.)
> +
> + ls-files may be slightly special in that e.g. `git ls-files -t` is
> + often used to see what is sparse and what is not. Perhaps -t should
> + always work on the full tree?
> +
Recently git ls-files added a --format option, perhaps this can be modified to
show if a file is SKIP_WORKTREE in the future.
diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index 4cf8a23648..0aeff8e514 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -280,6 +280,9 @@ static size_t expand_show_index(struct strbuf *sb,
const char *start,
data->pathname));
else if (skip_prefix(start, "(path)", &p))
write_name_to_buf(sb, data->pathname);
+ else if (skip_prefix(start, "(skiptree)", &p))
+ strbuf_addstr(sb, ce_skip_worktree(data->ce) ?
+ "true" : "false");
else
die(_("bad ls-files format: %%%.*s"), (int)len, start);
> +=== Behavior classes ===
> +
> +From the above there are a few classes of behavior:
> +
> + * "restrict"
> +
> + Commands in this class only read or write files in the working tree
> + within the sparse specification.
> +
> + When moving to a new commit (e.g. switch, reset --hard), these commands
> + may update index files outside the sparse specification as of the start
> + of the operation, but by the end of the operation those index files
> + will match HEAD again and thus those files will again be outside the
> + sparse specification.
> +
> + When paths are explicitly specified, these paths are intersected with
> + the sparse specification and will only operate on such paths.
> + (e.g. `git restore [--staged] -- '*.png'`, `git reset -p -- '*.md'`)
> +
> + Some of these commands may also attempt, at the end of their operation,
> + to cull transient differences between the sparse specification and the
> + sparsity patterns (see "Sparse specification vs. sparsity patterns" for
> + details, but this basically means either removing unmodified files not
> + matching the sparsity patterns and marking those files as
> + SKIP_WORKTREE, or vivifying files that match the sparsity patterns and
> + marking those files as !SKIP_WORKTREE).
> +
> + * "restrict modulo conflicts"
> +
> + Commands in this class generally behave like the "restrict" class,
> + except that:
> + (1) they will ignore the sparse specification and write files with
> + conflicts to the working tree (thus temporarily expanding the
> + sparse specification to include such files.)
> + (2) they are grouped with commands which move to a new commit, since
> + they often create a commit and then move to it, even though we
> + know there are many exceptions to moving to the new commit. (For
> + example, the user may rebase a commit that becomes empty, or have
> + a cherry-pick which conflicts, or a user could run `merge
> + --no-commit`, and we also view `apply --index` kind of like `am
> + --no-commit`.) As such, these commands can make changes to index
> + files outside the sparse specification, though they'll mark such
> + files with SKIP_WORKTREE.
> +
> + * "restrict also specially applied to untracked files"
> +
> + Commands in this class generally behave like the "restrict" class,
> + except that they have to handle untracked files differently too, often
> + because these commands are dealing with files changing state between
> + 'tracked' and 'untracked'. Often, this may mean printing an error
> + message if the command had nothing to do, but the arguments may have
> + referred to files whose tracked-ness state could have changed were it
> + not for the sparsity patterns excluding them.
> +
> + * "no restrict"
> +
> + Commands in this class ignore the sparse specification entirely.
> +
> + * "restrict or no restrict dependent upon behavior A vs. behavior B"
> +
> + Commands in this class behave like "no restrict" for folks in the
> + behavior B camp, and like "restrict" for folks in the behavior A camp.
> + However, when behaving like "restrict" a warning of some sort might be
> + provided that history queries have been limited by the sparse-checkout
> + specification.
> +
> +
> +=== Subcommand-dependent defaults ===
> +
> +Note that we have different defaults depending on the command for the
> +desired behavior :
> +
> + * Commands defaulting to "restrict":
> + * diff-files
> + * diff (without --cached or REVISION arguments)
> + * grep (without --cached or REVISION arguments)
> + * switch
> + * checkout (the switch-like half)
> + * reset (<commit>)
> +
> + * restore
> + * checkout (the restore-like half)
> + * checkout-index
> + * reset (with pathspec)
> +
> + This behavior makes sense; these interact with the working tree.
> +
> + * Commands defaulting to "restrict modulo conflicts":
> + * merge
> + * rebase
> + * cherry-pick
> + * revert
> +
> + * am
> + * apply --index (which is kind of like an `am --no-commit`)
> +
> + * read-tree (especially with -m or -u; is kind of like a --no-commit merge)
> + * reset (<tree-ish>, due to similarity to read-tree)
> +
> + These also interact with the working tree, but require slightly
> + different behavior either so that (a) conflicts can be resolved or (b)
> + because they are kind of like a merge-without-commit operation.
> +
> + (See also the "Known bugs" section below regarding `am` and `apply`)
> +
> + * Commands defaulting to "no restrict":
> + * archive
> + * bundle
> + * commit
> + * format-patch
> + * fast-export
> + * fast-import
> + * commit-tree
> +
> + * stash
> + * apply (without `--index`)
> +
> + These have completely different defaults and perhaps deserve the most
> + detailed explanation:
> +
> + In the case of commands in the first group (format-patch,
> + fast-export, bundle, archive, etc.), these are commands for
> + communicating history, which will be broken if they restrict to a
> + subset of the repository. As such, they operate on full paths and
> + have no `--restrict` option for overriding. Some of these commands may
> + take paths for manually restricting what is exported, but it needs to
> + be very explicit.
> +
> + In the case of stash, it needs to vivify files to avoid losing the
> + user's changes.
> +
> + In the case of apply without `--index`, that command needs to update
> + the working tree without the index (or the index without the working
> + tree if `--cached` is passed), and if we restrict those updates to the
> + sparse specification then we'll lose changes from the user.
> +
> + * Commands defaulting to "restrict also specially applied to untracked files":
> + * add
> + * rm
> + * mv
> + * update-index
> + * status
> + * clean (?)
> +
> + Our original implementation for the first three of these commands was
> + "no restrict", but it had some severe usability issues:
> + * `git add <somefile>` if honored and outside the sparse
> + specification, can result in the file randomly disappearing later
> + when some subsequent command is run (since various commands
> + automatically clean up unmodified files outside the sparse
> + specification).
> + * `git rm '*.jpg'` could very negatively surprise users if it deletes
> + files outside the range of the user's interest.
> + * `git mv` has similar surprises when moving into or out of the cone,
> + so best to restrict by default
> +
> + So, we switched `add` and `rm` to default to "restrict", which made
> + usability problems much less severe and less frequent, but we still got
> + complaints because commands like:
> + git add <file-outside-sparse-specification>
> + git rm <file-outside-sparse-specification>
> + would silently do nothing. We should instead print an error in those
> + cases to get usability right.
> +
> + update-index needs to be updated to match, and status and maybe clean
> + also need to be updated to specially handle untracked paths.
> +
> + There may be a difference in here between behavior A and behavior B in
> + terms of verboseness of errors or additional warnings.
> +
> + * Commands falling under "restrict or no restrict dependent upon behavior
> + A vs. behavior B"
> +
> + * diff (with --cached or REVISION arguments)
> + * grep (with --cached or REVISION arguments)
> + * show (when given commit arguments)
> + * blame (only matters when one or more -C flags passed)
> + * and annotate
> + * log
> + * and variants: shortlog, gitk, show-branch, whatchanged, rev-list
> + * ls-files
> + * diff-index
> + * diff-tree
> + * ls-tree
> +
> + For now, we default to behavior B for these, which want a default of
> + "no restrict".
> +
> + Note that two of these commands -- diff and grep -- also appeared in a
> + different list with a default of "restrict", but only when limited to
> + searching the working tree. The working tree vs. history distinction
> + is fundamental in how behavior B operates, so this is expected. Note,
> + though, that for diff and grep with --cached, when doing "restrict"
> + behavior, the difference between sparse specification and sparsity
> + patterns is important to handle.
> +
> + "restrict" may make more sense as the long term default for these[12].
> + Also, supporting "restrict" for these commands might be a fair amount
> + of work to implement, meaning it might be implemented over multiple
> + releases. If that behavior were the default in the commands that
> + supported it, that would force behavior B users to need to learn to
> + slowly add additional flags to their commands, depending on git
> + version, to get the behavior they want. That gradual switchover would
> + be painful, so we should avoid it at least until it's fully
> + implemented.
> +
> +
> +=== Sparse specification vs. sparsity patterns ===
> +
> +In a well-behaved situation, the sparse specification is given directly
> +by the $GIT_DIR/info/sparse-checkout file. However, it can transiently
> +diverge for a few reasons:
> +
> + * needing to resolve conflicts (merging will vivify conflicted files)
> + * running Git commands that implicitly vivify files (e.g. "git stash apply")
> + * running Git commands that explicitly vivify files (e.g. "git checkout
> + --ignore-skip-worktree-bits FILENAME")
> + * other commands that write to these files (perhaps a user copies it
> + from elsewhere)
> +
> +For the last item, note that we do automatically clear the SKIP_WORKTREE
> +bit for files that are present in the working tree. This has been true
> +since 82386b4496 ("Merge branch 'en/present-despite-skipped'",
> +2022-03-09)
> +
> +However, such a situation is transient because:
> +
> + * Such transient differences can and will be automatically removed as
> + a side-effect of commands which call unpack_trees() (checkout,
> + merge, reset, etc.).
> + * Users can also request such transient differences be corrected via
> + running `git sparse-checkout reapply`. Various places recommend
> + running that command.
> + * Additional commands are also welcome to implicitly fix these
> + differences; we may add more in the future.
> +
> +While we avoid dropping unstaged changes or files which have conflicts,
> +we otherwise aggressively try to fix these transient differences. If
> +users want these differences to persist, they should run the `set` or
> +`add` subcommands of `git sparse-checkout` to reflect their intended
> +sparse specification.
> +
> +However, when we need to do a query on history restricted to the
> +"relevant subset of files" such a transiently expanded sparse
> +specification is ignored. There are a couple reasons for this:
> +
> + * The behavior wanted when doing something like
> + git grep expression REVISION
> + is roughly what the users would expect from
> + git checkout REVISION && git grep expression
> + (modulo a "REVISION:" prefix), which has a couple ramifications:
> +
> + * REVISION may have paths not in the current index, so there is no
> + path we can consult for a SKIP_WORKTREE setting for those paths.
> +
> + * Since `checkout` is one of those commands that tries to remove
> + transient differences in the sparse specification, it makes sense
> + to use the corrected sparse specification
> + (i.e. $GIT_DIR/info/sparse-checkout) rather than attempting to
> + consult SKIP_WORKTREE anyway.
> +
> +So, a transiently expanded (or restricted) sparse specification applies to
> +the working tree, but not to history queries where we always use the
> +sparsity patterns. (See [16] for an early discussion of this.)
> +
> +Similar to a transiently expanded sparse specification of the working tree
> +based on additional files being present in the working tree, we also need
> +to consider additional files being modified in the index. In particular,
> +if the user has staged changes to files (relative to HEAD) that do not
> +match the sparsity patterns, and the file is not present in the working
> +tree, we still want to consider the file part of the sparse specification
> +if we are specifically performing a query related to the index (e.g. git
> +diff --cached [REVISION], git diff-index [REVISION], git restore --staged
> +--source=REVISION -- PATHS, etc.) Note that a transiently expanded sparse
> +specification for the index usually only matters under behavior A, since
> +under behavior B index operations are lumped with history and tend to
> +operate full-tree.
> +
> +
> +=== Implementation Questions ===
> +
> + * Do the options --scope={sparse,all} sound good to others? Are there better
> + options?
> + * Names in use, or appearing in patches, or previously suggested:
> + * --sparse/--dense
> + * --ignore-skip-worktree-bits
> + * --ignore-skip-worktree-entries
> + * --ignore-sparsity
> + * --[no-]restrict-to-sparse-paths
> + * --full-tree/--sparse-tree
> + * --[no-]restrict
> + * --scope={sparse,all}
> + * --focus/--unfocus
> + * --limit/--unlimited
> + * Rationale making me lean slightly towards --scope={sparse,all}:
> + * We want a name that works for many commands, so we need a name that
> + does not conflict
> + * We know that we have more than two possible usecases, so it is best
> + to avoid a flag that appears to be binary.
> + * --scope={sparse,all} isn't overly long and seems relatively
> + explanatory
> + * `--sparse`, as used in add/rm/mv, is totally backwards for
> + grep/log/etc. Changing the meaning of `--sparse` for these
> + commands would fix the backwardness, but possibly break existing
> + scripts. Using a new name pairing would allow us to treat
> + `--sparse` in these commands as a deprecated alias.
> + * There is a different `--sparse`/`--dense` pair for commands using
> + revision machinery, so using that naming might cause confusion
> + * There is also a `--sparse` in both pack-objects and show-branch, which
> + don't conflict but do suggest that `--sparse` is overloaded
> + * The name --ignore-skip-worktree-bits is a double negative, is
> + quite a mouthful, refers to an implementation detail that many
> + users may not be familiar with, and we'd need a negation for it
> + which would probably be even more ridiculously long. (But we
> + can make --ignore-skip-worktree-bits a deprecated alias for
> + --no-restrict.)
> +
> + * If a config option is added (sparse.scope?) what should the values and
> + description be? "sparse" (behavior A), "worktree-sparse-history-dense"
> + (behavior B), "dense" (behavior C)? There's a risk of confusion,
> + because even for Behaviors A and B we want some commands to be
> + full-tree and others to operate sparsely, so the wording may need to be
> + more tied to the usecases and somehow explain that. Also, right now,
> + the primary difference we are focusing is just the history-querying
> + commands (log/diff/grep). Previous config suggestion here: [13]
> +
Maybe sparse.scope={sparse, all}?
> + * Is `--no-expand` a good alias for ls-files's `--sparse` option?
> + (`--sparse` does not map to either `--scope=sparse` or `--scope=all`,
> + because in non-cone mode it does nothing and in cone-mode it shows the
> + sparse directory entries which are technically outside the sparse
> + specification)
> +
> + * Under Behavior A:
> + * Does ls-files' `--no-expand` override the default `--scope=all`, or
> + does it need an extra flag?
> + * Does ls-files' `-t` option imply `--scope=all`?
> + * Does update-index's `--[no-]skip-worktree` option imply `--scope=all`?
> +
> + * sparse-checkout: once behavior A is fully implemented, should we take
> + an interim measure to ease people into switching the default? Namely,
> + if folks are not already in a sparse checkout, then require
> + `sparse-checkout init/set` to take a
> + `--set-scope=(sparse|worktree-sparse-history-dense|dense)` flag (which
> + would set sparse.scope according to the setting given), and throw an
> + error if the flag is not provided? That error would be a great place
> + to warn folks that the default may change in the future, and get them
> + used to specifying what they want so that the eventual default switch
> + is seamless for them.
> +
> +
> +=== Implementation Goals/Plans ===
> +
> + * Get buy-in on this document in general.
> +
> + * Figure out answers to the 'Implementation Questions' sections (above)
> +
> + * Fix bugs in the 'Known bugs' section (below)
> +
> + * Provide some kind of method for backfilling the blobs within the sparse
> + specification in a partial clone
> +
> + [Below here is kind of spitballing since the first two haven't been resolved]
> +
> + * update-index: flip the default to --no-ignore-skip-worktree-entries,
> + nuke this stupid "Oh, there's a bug? Let me add a flag to let users
> + request that they not trigger this bug." flag
> +
> + * Flags & Config
> + * Make `--sparse` in add/rm/mv a deprecated alias for `--scope=all`
> + * Make `--ignore-skip-worktree-bits` in checkout-index/checkout/restore
> + a deprecated aliases for `--scope=all`
> + * Create config option (sparse.scope?), tie it to the "Cliff notes"
> + overview
> +
> + * Add --scope=sparse (and --scope=all) flag to each of the history querying
> + commands. IMPORTANT: make sure diff machinery changes don't mess with
> + format-patch, fast-export, etc.
> +
Thanks,
ZheNing Hu
next prev parent reply other threads:[~2022-11-15 4:04 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-25 0:09 [PATCH] sparse-checkout.txt: new document with sparse-checkout directions Elijah Newren via GitGitGadget
2022-09-26 17:20 ` Junio C Hamano
2022-09-26 17:38 ` Junio C Hamano
2022-09-27 3:05 ` Elijah Newren
2022-09-27 4:30 ` Junio C Hamano
2022-09-26 20:08 ` Victoria Dye
2022-09-26 22:36 ` Junio C Hamano
2022-09-27 7:30 ` Elijah Newren
2022-09-27 16:07 ` Junio C Hamano
2022-09-28 6:13 ` Elijah Newren
2022-09-27 6:09 ` Elijah Newren
2022-09-27 16:42 ` Derrick Stolee
2022-09-28 5:42 ` Elijah Newren
2022-09-27 15:43 ` Junio C Hamano
2022-09-28 7:49 ` Elijah Newren
2022-09-27 16:36 ` Derrick Stolee
2022-09-28 5:38 ` Elijah Newren
2022-09-28 13:22 ` Derrick Stolee
2022-10-06 7:10 ` Elijah Newren
2022-10-06 18:27 ` Derrick Stolee
2022-10-07 2:56 ` Elijah Newren
2022-09-30 9:54 ` ZheNing Hu
2022-10-06 7:53 ` Elijah Newren
2022-10-15 2:17 ` ZheNing Hu
2022-10-15 4:37 ` Elijah Newren
2022-10-15 14:49 ` ZheNing Hu
2022-09-30 9:09 ` ZheNing Hu
2022-09-28 8:32 ` [PATCH v2] " Elijah Newren via GitGitGadget
2022-10-08 22:52 ` [PATCH v3] " Elijah Newren via GitGitGadget
2022-11-06 6:04 ` [PATCH v4] " Elijah Newren via GitGitGadget
2022-11-07 20:44 ` Derrick Stolee
2022-11-16 4:39 ` Elijah Newren
2022-11-15 4:03 ` ZheNing Hu [this message]
2022-11-16 3:18 ` ZheNing Hu
2022-11-16 6:51 ` Elijah Newren
2022-11-16 5:49 ` Elijah Newren
2022-11-16 10:04 ` ZheNing Hu
2022-11-16 10:10 ` ZheNing Hu
2022-11-16 14:33 ` ZheNing Hu
2022-11-19 2:36 ` Elijah Newren
2022-11-19 2:15 ` Elijah Newren
2022-11-23 9:08 ` ZheNing Hu
2023-01-14 10:18 ` ZheNing Hu
2023-01-20 4:30 ` Elijah Newren
2023-01-23 15:05 ` ZheNing Hu
2023-01-24 3:17 ` Elijah Newren
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAOLTT8TzpfoH7pz7gxgFvNWOaUZUcg1q_Tap+2anwHfAUgDV8Q@mail.gmail.com \
--to=adlternative@gmail.com \
--cc=chooglen@google.com \
--cc=derrickstolee@github.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=martinvonz@google.com \
--cc=matheus.bernardino@usp.br \
--cc=newren@gmail.com \
--cc=shaoxuan.yuan02@gmail.com \
--cc=vdye@github.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).