From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: git@vger.kernel.org
Subject: [Summit topic] Sparse checkout behavior and plans
Date: Thu, 21 Oct 2021 13:56:26 +0200 (CEST) [thread overview]
Message-ID: <nycvar.QRO.7.76.6.2110211148230.56@tvgsbejvaqbjf.bet> (raw)
In-Reply-To: <nycvar.QRO.7.76.6.2110211129130.56@tvgsbejvaqbjf.bet>
[-- Attachment #1: Type: text/plain, Size: 13013 bytes --]
This session was led by Derrick Stolee. Supporting cast: Jonathan
"jrnieder" Nieder, Elijah Newren, Jeff Hostetler, Jeff "Peff" King,
Johannes "Dscho" Schindelin, Ævar Arnfjörð Bjarmason, Emily Shaffer,
Victoria Dye, brian m. carlson, and CB Bailey.
Notes:
1. Cone mode has stabilized
2. jrnieder: would sparse index without cone mode support be welcome?
1. Stolee: you’re welcome to try ;-)
2. Elijah: main theme: performance. Cone mode allows reasonable
performance due to fewer rules to check
3. Stolee: directory-level lookups mean lookups can have sublinear cost,
since you can skip sparse rules (no need to check them in order to
figure out whether or not a file is excluded or not)
3. Elijah: interested in “sparse clones”, i.e. clones that download
everything related to a specified cone
1. Would be nice not having to download extra objects when already having
specified a cone of interest
2. Jeff: the original partial clone had code to restrict to a cone
3. Peff: we still have the code, but turned it off, you can have bitmaps
with that (too heavy on the server)
4. Stolee: also, how can the cone be updated if things change? Never
solved that problem
5. Stolee: but the extra blob downloads turned out not to be too big of a
problem
6. Stolee: got a feature request to restrict git log to the current cone,
git grep already does that (thanks Matheus)
7. Elijah: “git grep” without revision arguments is restricted to
worktree, so it respects the sparse checkout. When you pass a
revision, though, it searches the whole tree
8. Many commands want to examine the whole tree, makes sense to figure
out the UX (configuration, etc) of them together
9. Peff: Is diff code on someone’s radar?
10. Stolee: I’d view that as part of the same story as “git log”, “git log
-p”.
11. Sparse index means we can avoid faulting in trees outside of HEAD, so
it helps unlock this
4. Sparse index: Victoria and Lessley are taking lead on the number of
commands supporting sparse index
1. update-index, diff, blame, clean, stash, sparse-checkout itself so far
supported only in the Microsoft fork of Git
2. Enabled by default internally so helps us gather data
3. Elijah: awesome that you’re working on this, sorry I haven’t been as
responsive as I’d like on reviews
4. I’m interested in “clean” in particular --- isn’t that about untracked
files?
5. Stolee: It uses the index to find what is tracked, want to avoid
expanding the in-memory index. If there are files outside the sparse
checkout area then it does expand.
5. jrnieder: question about failure modes
1. When I convert a command, I make sure my code path doesn’t assume the
cache array contains all entries. Then I turn off
command_requires_full_index. What happens if I missed a spot?
2. Stolee: I put ensure_full_index() in front of everything that assumes a
full index, but if there’s a loop that we missed, there’s no extra
protection.
3. Example: cache-tree was calling itself, invalidating points,
segfaulted.
4. More worrying failure mode would be if commands proceed with bad data.
Segfaulting is the good case!
5. jrnieder is not too worried since we’re pretty far along and soon
enough we’ll have converted all commands and these questions would be
moot
1. Stolee: goal isn’t to get 100% coverage, so point of questions being
moot isn’t coming soon
2. jrnieder: Thanks! Okay, I’ll take a look.
6. http://sweng.the-davies.net/Home/rustys-api-design-manifesto
7. Stolee is less worried because we have sufficient ensure_full_index
calls.
6. One optimization we’re considering: not expanding the full index when
anything outside the cone is needed (we’d like to maybe expand just the
part that needs expanding)
1. Elijah: we would still keep cone mode, but it’s a bit weird because the
cone mode does not match what we have in the index
2. Stolee: we might actually not need this
7. Stolee: in the process of this work, found D/F conflict issue, made a test
illustrating it
8. Elijah: atomicitiy
1. checkout is a non-atomic operation. ^C makes a mess
2. “git sparse-checkout disable” is non-atomic. Takes a while, people ^C,
and the very last step is updating the sparsity files. Leaves the
worktree with a bunch of files they don’t need but commands ignore
them
3. We run into problems because then they can check out a different
branch, do a bunch of other work, then update the sparse-checkout and
it will see these precious files it doesn’t want to overwrite
4. Should “git status” show them?
5. Dscho: We could set a flag on disk when you’re about to disable, then
if we were interrupted print an error message to get the user to sort
things out
6. Peff: I was going to suggest something similar. FS doesn’t make
transactions easy, but we can at least do a rollback (signal handler),
not foolproof, but it works pretty well and covers your ^C case.
7. Stolee: coming in 2.34: sparse-checkout reapply will delete ignored
(and tracked?) files. Helps with these leftover files.
8. Elijah: no current way to get out of that state, thank you for making
sparse-checkout reapply do that
9. Stolee: noticed during experimental release to people from Office.
Everything was slow because they had run build and left behind ignored
files
10. jrnieder: Piggy-backing on Dscho’s comment, there’s a database
analogy: record intent (in the database case, that’s a transaction
journal) before the non-atomic steps the act on that intent. Suggests
maybe we should be updating the sparsity pattern before the checkout
step
9. That’s it, that’s the status update what’s currently on the list.
10. We have more plans, though.
11. Idea: use git.git itself
1. Tried it, but had to have 97% files to still be workable
2. Could change the Makefile to accept that, say, po/ is missing
3. Ævar: creates a lot of complexity for the build
4. jrnieder: as VCS provider, what is our recommendation to build authors?
Do we want them querying sparse checkout, do we want builds that Just
Work in cone mode, do we want to treat sparse checkout as a thing that
builds don’t need to support?
5. Stolee: want build system to be able to tell Git about what needs to be
checked out. “In-tree sparse checkout” (see below)
12. Emily: we’re interested in sparse-checkout affecting the set of active
submodules, just mentioning this as a heads-up
13. [PATCH 00/10] [RFC] In-tree sparse-checkout definitions - Derrick Stolee
via GitGitGadget
(https://lore.kernel.org/git/pull.627.git.1588857462.gitgitgadget@gmail.com/)
14. Victoria: today when you switch gears and work on something else you have
to update the sparse checkout pattern
15. Proposal here is to have in-tree sparse checkout definitions, e.g. a
.gitdependencies file that lists, for the directories you’re working with,
what other subdirectories they depend on
16. That way, you get exactly the folders you need
17. Stolee: office has their own tool “scoper” that figures out dependencies
and runs “git sparse-checkout set” for the user. Is confusing when you
rebase and need to remember to run it
18. Currently lives in a hook, custom and built for one engineering system,
want to generalize and make a standard feature
19. Victoria: being built in to Git would make sense because it’s general
enough to work in most monorepo environments.
20. Involves two pieces: having git understand the dependencies and assemble
your sparse checkout cone using them, and having the build system maintain
and use sparse checkout correctly.
21. Some build setups tolerate missing directories reasonably well. If we make
.gitdependencies more of a first-class concept then we could go further
and make build systems handle missing directories as something that would
be expected
22. C# .proj files link to dependencies on other .proj files with relative
path. But in a solution file collecting all .proj files, it lists all of
them and you need to have them all present. If a subdirectory isn’t
present, proposal is to build what is there instead of everything.
23. Tried another prototype on how to do this in Bazel. It has a rigorous
definition of inputs and outputs, and based on that you could translate to
a .gitdependencies file or sparse-checkout pattern.
24. Microsoft’s buildxl has similar properties
25. Victoria asks: how general is the above?
26. brian: Many monorepos has multiple microservices. A cone can represent
what a particular service needs to run.
27. If you’re building one coherent product like Windows, you’re going to need
some prebuilt artifacts that you pull down.
28. jrnieder: Large monorepos often have strong remote build. Not everything
you depend on is things that you need to have in source form locally
29. CB: My team at Bloomberg has a teamwide “monorepo” (not Bloomberg-wide).
We’re cmake based. Sparse checkout would be interesting for us. We’re
experimenting with what’s called workspace builds: you have a thing you
can build (a subdirectory), that you pull into the toplevel CMakeLists.txt
as a single thing.
30. With cmake you can declare a dependency with target_link_libraries. A
dependency name can either be a cmake defined target in the codebase
you’re building it, or it can be a pre-built library pulled in another
way, e.g. importing via a pkg-config file.
31. At build time if I decide I want to change that library, I’ll expand my
sparse-checkout region, and rerun cmake to have it understand the newly
available source.
32. Optionality: I don’t have to have that source checked out, but when it’s
present I want to use it.
33. Victoria: sounds like in-tree sparse checkout is more of an intermediate
step. Sometimes you want the source, sometimes you want to pull in an
external artifact.
34. Elijah: we have a monorepo, about the size of the Linux kernel. Multiple
separate services, interconnected pieces. Using sparse-checkout required
some code changes, refactoring that wasn’t just around the build system.
We created a tool before the sparse-checkout command existed, using older
mechanisms, and then switched to sparse-checkout when it came out. We
track our dependencies ourselves --- you need this set of modules (3 or 4)
or the modules relevant to a particular team, and it then computes the
relevant directories to get. We had to make some changes to adopt cone
mode but I like it and the changes it led to. Then you run the build
system --- you have files that declare the dependencies, are they newer
than .git/info/sparse-checkout? If not then recompute them again.
35. Potentially would want to rerun the dependency generation after you run a
rebase as well…
36. If we track it in-tree, there are some interesting cases we’ll run into
(merge conflicts on this generated file).
37. Also, tracking dependencies in two places can result in difficulty, skew.
Maybe can generate one from the other.
38. Our sparse checkout tends to be build oriented “what do I need for this
build”. But testing inverts the dependency graph, want to see what tests
depend on this code. We encourage them to test in the cloud but not
everyone does that, leads fewer people to use sparse checkout.
39. There’s some remote build, mixing-and-matching pieces built remotely and
locally.
40. Part of working in a monorepo is you need strong tool hygiene enforcement.
Without that, you get a ball of mud of dependencies. Adopting sparse
checkout drove modularity.
41. Ævar: I’d be interested in a summary
42. Git’s lack of support for sparse checkout was unusual, so I think this
topic is well explored by previous version control systems
next prev parent reply other threads:[~2021-10-21 11:56 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-21 11:55 Notes from the Git Contributors' Summit 2021, virtual, Oct 19/20 Johannes Schindelin
2021-10-21 11:55 ` [Summit topic] Crazy (and not so crazy) ideas Johannes Schindelin
2021-10-21 12:30 ` Son Luong Ngoc
2021-10-26 20:14 ` scripting speedups [was: [Summit topic] Crazy (and not so crazy) ideas] Eric Wong
2021-10-30 19:58 ` Ævar Arnfjörð Bjarmason
2021-11-03 9:24 ` test suite speedups via some not-so-crazy ideas (was: scripting speedups[...]) Ævar Arnfjörð Bjarmason
2021-11-03 22:12 ` test suite speedups via some not-so-crazy ideas Junio C Hamano
2021-11-02 13:52 ` scripting speedups [was: [Summit topic] Crazy (and not so crazy) ideas] Johannes Schindelin
2021-10-21 11:55 ` [Summit topic] SHA-256 Updates Johannes Schindelin
2021-10-21 11:56 ` [Summit topic] Server-side merge/rebase: needs and wants? Johannes Schindelin
2021-10-22 3:06 ` Bagas Sanjaya
2021-10-22 10:01 ` Johannes Schindelin
2021-10-23 20:52 ` Ævar Arnfjörð Bjarmason
2021-11-08 18:21 ` Taylor Blau
2021-11-09 2:15 ` Ævar Arnfjörð Bjarmason
2021-11-30 10:06 ` Christian Couder
2021-10-21 11:56 ` [Summit topic] Submodules and how to make them worth using Johannes Schindelin
2021-10-21 11:56 ` Johannes Schindelin [this message]
2021-10-21 11:56 ` [Summit topic] The state of getting a reftable backend working in git.git Johannes Schindelin
2021-10-25 19:00 ` Han-Wen Nienhuys
2021-10-25 22:09 ` Ævar Arnfjörð Bjarmason
2021-10-26 8:12 ` Han-Wen Nienhuys
2021-10-28 14:17 ` Philip Oakley
2021-10-26 15:51 ` Philip Oakley
2021-10-21 11:56 ` [Summit topic] Documentation (translations, FAQ updates, new user-focused, general improvements, etc.) Johannes Schindelin
2021-10-22 14:20 ` Jean-Noël Avila
2021-10-22 14:31 ` Ævar Arnfjörð Bjarmason
2021-10-27 7:02 ` Jean-Noël Avila
2021-10-27 8:50 ` Jeff King
2021-10-21 11:56 ` [Summit topic] Increasing diversity & inclusion (transition to `main`, etc) Johannes Schindelin
2021-10-21 12:55 ` Son Luong Ngoc
2021-10-22 10:02 ` vale check, was " Johannes Schindelin
2021-10-22 10:03 ` Johannes Schindelin
2021-10-21 11:57 ` [Summit topic] Improving Git UX Johannes Schindelin
2021-10-21 16:45 ` changing the experimental 'git switch' (was: [Summit topic] Improving Git UX) Ævar Arnfjörð Bjarmason
2021-10-21 23:03 ` changing the experimental 'git switch' Junio C Hamano
2021-10-22 3:33 ` changing the experimental 'git switch' (was: [Summit topic] Improving Git UX) Bagas Sanjaya
2021-10-22 14:04 ` martin
2021-10-22 14:24 ` Ævar Arnfjörð Bjarmason
2021-10-22 15:30 ` martin
2021-10-23 8:27 ` changing the experimental 'git switch' Sergey Organov
2021-10-22 21:54 ` Sergey Organov
2021-10-24 6:54 ` changing the experimental 'git switch' (was: [Summit topic] Improving Git UX) Martin
2021-10-24 20:27 ` changing the experimental 'git switch' Junio C Hamano
2021-10-25 12:48 ` Ævar Arnfjörð Bjarmason
2021-10-25 17:06 ` Junio C Hamano
2021-10-25 16:44 ` Sergey Organov
2021-10-25 22:23 ` Ævar Arnfjörð Bjarmason
2021-10-27 18:54 ` Sergey Organov
2021-10-21 11:57 ` [Summit topic] Improving reviewer quality of life (patchwork, subsystem lists?, etc) Johannes Schindelin
2021-10-21 13:41 ` Konstantin Ryabitsev
2021-10-22 22:06 ` Ævar Arnfjörð Bjarmason
2021-10-22 8:02 ` Missing notes, was Re: Notes from the Git Contributors' Summit 2021, virtual, Oct 19/20 Johannes Schindelin
2021-10-22 8:22 ` Johannes Schindelin
2021-10-22 8:30 ` Johannes Schindelin
2021-10-22 9:07 ` Johannes Schindelin
2021-10-22 9:44 ` Let's have public Git chalk talks, " Johannes Schindelin
2021-10-25 12:58 ` Ævar Arnfjörð Bjarmason
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=nycvar.QRO.7.76.6.2110211148230.56@tvgsbejvaqbjf.bet \
--to=johannes.schindelin@gmx.de \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).