git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>,
	newren@gmaill.com, Jeff King <peff@peff.net>,
	Taylor Blau <me@ttaylorr.com>,
	Jonathan Nieder <jrnieder@gmail.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 00/10] [RFC] In-tree sparse-checkout definitions
Date: Wed, 20 May 2020 10:38:30 -0700	[thread overview]
Message-ID: <CABPp-BEz+XE48_vmEa4jmH53YSHUoj5G-OY+UT1mv92irQVEkQ@mail.gmail.com> (raw)
In-Reply-To: <pull.627.git.1588857462.gitgitgadget@gmail.com>

On Thu, May 7, 2020 at 6:20 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This is based on ds/sparse-allow-empty-working-tree.
>
> BACKGROUND
> ==========
>
> After discussing the future of sparse-checkout at the Git Contributor
> Summit, Minh Tai caught me during the break to introduce a very interesting
> idea: in-tree sparse-checkout definitions.
>
> Here is the context: I was talking about how the Office team is building a
> tool to update the sparse-checkout definition dynamically as their build
> definitions change. They have a very compartmentalized build system, and
> certain "projects" depend on each other. To ensure that builds do not break
> as dependencies change, the plan was to have this tool run as a hook when
> HEAD changes. This is a bit expensive to do, and we were discussing how much
> we need this to be automated versus a part of the build process.
>
> This kind of idea would need to be replicated for every team that wants to
> use sparse-checkout in this way. That doesn't bode well for wide adoption of
> the feature.
>
> IN-TREE SPARSE-CHECKOUT DEFINITIONS
> ===================================
>
> Minh's idea was simple: have sparse-checkout files in the working directory
> and use config to point to them. As these in-tree files update, we can
> automatically update the sparse-checkout definition accordingly. Now, the
> only thing to do would be to ensure that the sparse-checkout files are
> updated when someone updates the build definitions. This requires some extra
> build validation, but would not require special tools built on every client.

I'm not sure how well having a parallel dependency structure will be
received, but the idea of auto-updating does seem really nice.  (I put
a hook in the build system that compares the timestamps on the files
that define dependency structures to the .git/info/sparse-checkout
file, and re-run `git sparse-checkout reapply`, which helps with
updating but is still delayed from the checkout/merge/rebase that
brought in new dependencies.)

> As soon as I sat down to implement this idea, I discovered some
> complications to that basic idea. Each of these sparse-checkout definitions
> should fit a "role" on the team. To continue with the Office analogy,
> suppose we had a definition for developers working on each of Word,
> PowerPoint, and Excel. What happens when someone wants to port a feature
> from Excel to Word? That user would want to take the union of the two
> sparse-checkout definitions. But what does that mean when we are working
> with arbitrary sparse-checkout files?
>
> This leads to the restriction that I built into this feature: we only care
> about "cone mode" definitions. These rely on directory-based prefix matches
> that are very fast. With this restriction, it is simple to understand what
> the "union" operation should do: take all directories that would be included
> by either.

Seems like you're adopting my idea of pushing folks towards "cone
mode" and limiting support for non-code modes[1], eh?  ;-)

I'm totally on board.  :-)

[1] https://lore.kernel.org/git/CABPp-BHKYR9QBcAG_pM6DniaeGS40X=ErKGGsvWa_KhogCZzEA@mail.gmail.com/

> Once we restrict to cone mode, there is no reason to continue storing files
> using the sparse-checkout file format. The list of patterns is larger than
> the input to "git sparse-checkout set" or output from "git sparse-checkout
> list" in cone mode, and more prone to user error. Thus, let's be simpler and
> use a list of directories to specify the sparse-checkout definition.

Ahah, you did come around to using a new config file[2].  ;-)

I like having simpler config files; sounds great.

[2] https://lore.kernel.org/git/c919513a-f41f-2ce8-60ed-e0b0733c0c7f@gmail.com/
 ("another config file")

> This leads to the second issue that complicates things. If a build
> dependency changes to a core library, but there are N "roles" that depend on
> that library, then the person changing that library also needs to change and
> validate N sparse-checkout definitions! This does not scale well. So, let's
> take a page from build dependencies and allow these in-tree sparse-checkout
> definitions to depend on each other!
>
> The end result is a mechanism that should be flexible enough to future needs
> but simple enough to build using existing Git features. The format re-uses
> the config file format. This provides good re-use of code as well as being
> something easy to extend in the future. In Patch 4, I tried to describe
> several alternatives and why this is the best of those options. I'm open to
> suggestions here.
>
> OUTLINE
> =======
>
>  * Patch 1 is a bugfix that should work on its own. I caught it in tests
>    after Patch 5.
>
>
>  * Patches 2-7 implement the feature.
>
>
>  * Patches 8-9 make the Git build system a bit more robust to missing
>    directories.
>
>
>  * Patch 10 adds a .sparse directory to Git as well as two in-tree
>    sparse-checkout definitions. One is for a bare-bones Linux developer and
>    the other adds the compat code for Windows on top.
>
>
>
> As mentioned in patch 10, this allows the following workflow for Git
> contributors that want to "eat our own dogfood" with the partial clone and
> sparse-checkout features:
>
>  $ git clone --sparse --filter=blob:none https://github.com/git/git
> $ cd git
> $ git sparse-checkout set --in-tree .sparse/base.deps
>
> Then, we have a smaller working directory but can still build and test the
> code.
>
> FUTURE DIRECTIONS
> =================
>
> There are definitely ways to make this feature more seamless, or more useful
> to us.
>
>  * It would be nice to extend git clone to have a string value to
>    --sparse=<file> and possibly have it multi-valued. This would initialize
>    the in-tree sparse-checkout definition immediately upon cloning.

Also in terms of future directions, I would like to have the option
for these to be passed to --filter instead of using --filter:none.

And on top of that, it may also make sense for users to be able to
fetch cones of history somehow.  For example, if they started with
`git clone --sparse=featureA ...` and later wanted to add the
'featureB' cone, it'd be nice to be able to run some kind of `git
fetch --sparse=featureB ...` command when they know they are on a good
network to download the history of objects that touched the featureB
cone instead of relying on the normal lazy fetching of partial clones.

This kind of thing could be especially useful if, say, we were in a
pandemic and everyone was forced to work from home.  But it'd also be
useful if you just had people travelling a lot and stuck in third
world places with low connectivity, like CalTrain.  ;-)

>  * I think there are some ways we could reorganize the Git codebase to make
>    our "most basic" sparse-checkout file have even fewer files. Mostly, it
>    would require modifying the build to ask "do these files exist on-disk?
>    if not, then disable this build option." This would also require adding
>    test dependencies that also disable some tests when those do not exist.
>
>
>  * Currently, if we have an in-tree sparse-checkout definition, then we
>    can't add more directories to it manually. If someone runs git
>    sparse-checkout set --in-tree <file> followed by git sparse-checkout add
>    <dir>, then will disappear as it is overridden by the in-tree definition.
>    The only way I can think to get around this would be to add these
>    directories through Git config, since we were using the sparse-checkout
>    file as the storage of "what directories do I care about?" but it is no
>    longer the source of truth in this situation. I'm open to ideas here, and
>    it would be nice to have this interaction hardened before moving the
>    series out of RFC status.

This seems like a big problem to me.  Actually, it sounds like two big problems.

1) non-extensibility

I suspect most will be happy enough with the in-tree definitions, but
there are always a few users that want to add to the pre-defined
sparsity sets to get some additional directories.  We could use
.gitignore as an example; although people have shared ignore rules in
.gitignore, we still allow user-defined additional ignore rules that
are not meant to be shared.  Perhaps we can allow users to use some
pre-defined additional file (similar to .git/info/exclude) or even
user-defined additional file (similar to core.excludesFile)?

We would still need to be careful to not allow the in-tree definitions
to depend on files that are not in-tree, of course.

(Even with my `sparsify` script which allowed things like `./sparsify
--modules moduleA moduleB` and which walked build system dependencies
to get the right directories, I still allowed additional directories
to be specified within a .git/info/sparse-checkout-user file.  Not
sure I liked the location of the file (it's sometimes hard to find for
users of worktrees), or the basename of the file I randomly chose that
day, but it came in handy for a few people.)

2) Ugly, overly verbose (and even error-prone?) UI.

If `git sparse-checkout set --in-tree <file>` means that any
subsequent checkout will blow apart any user-ran `git sparse-checkout
add <dir>`, then we really need to give at least warning and maybe
even error messages if people are using in-tree mode and use either
set or add.  Also, this feels like we're just about deprecating the
usage of set and add without the --in-tree flag, which could lead to
years of questions of "Why do I have to specify --in-tree every time I
run these commands?"  Can we avoid that somehow?

Maybe we just take advantage of the huge backward compatibility
warning we put in place and require non-in-tree mode to specify an
additional flag?  Maybe we just check the argument to set or add when
in cone mode, and if it's a file (instead of a directory) then we
automatically assume --in-tree (or attempt to parse it first and throw
an error if it isn't in the --in-tree format)?  Maybe we pick a
different verb than add or set ('use') and guide people to it instead
of `add` and `set`?

Also, as a side note, can people specify multiple in-tree files at
once (`git sparse-checkout set --in-tree <file1> <file2>`)?

  parent reply	other threads:[~2020-05-20 17:38 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-07 13:17 [PATCH 00/10] [RFC] In-tree sparse-checkout definitions Derrick Stolee via GitGitGadget
2020-05-07 13:17 ` [PATCH 01/10] unpack-trees: avoid array out-of-bounds error Derrick Stolee via GitGitGadget
2020-05-07 22:27   ` Junio C Hamano
2020-05-08 12:19     ` Derrick Stolee
2020-05-08 15:09       ` Junio C Hamano
2020-05-20 16:32     ` Elijah Newren
2020-05-07 13:17 ` [PATCH 02/10] sparse-checkout: move code from builtin Derrick Stolee via GitGitGadget
2020-05-07 13:17 ` [PATCH 03/10] sparse-checkout: move code from unpack-trees.c Derrick Stolee via GitGitGadget
2020-05-07 13:17 ` [PATCH 04/10] sparse-checkout: allow in-tree definitions Derrick Stolee via GitGitGadget
2020-05-07 22:58   ` Junio C Hamano
2020-05-08 15:40     ` Derrick Stolee
2020-05-20 17:52       ` Elijah Newren
2020-06-17 23:07         ` Elijah Newren
2020-06-18  8:18           ` Son Luong Ngoc
2020-05-07 13:17 ` [PATCH 05/10] sparse-checkout: automatically update in-tree definition Derrick Stolee via GitGitGadget
2020-05-20 16:28   ` Elijah Newren
2020-05-07 13:17 ` [PATCH 06/10] sparse-checkout: use oidset to prevent repeat blobs Derrick Stolee via GitGitGadget
2020-05-20 16:40   ` Elijah Newren
2020-05-21  3:49     ` Elijah Newren
2020-05-21 17:54       ` Derrick Stolee
2020-05-07 13:17 ` [PATCH 07/10] sparse-checkout: define in-tree dependencies Derrick Stolee via GitGitGadget
2020-05-20 18:10   ` Elijah Newren
2020-05-30 17:26     ` Elijah Newren
2020-05-07 13:17 ` [PATCH 08/10] Makefile: skip git-gui if dir is missing Derrick Stolee via GitGitGadget
2020-05-07 13:17 ` [PATCH 09/10] Makefile: disable GETTEXT when 'po' " Derrick Stolee via GitGitGadget
2020-05-07 13:17 ` [PATCH 10/10] .sparse: add in-tree sparse-checkout for Git Derrick Stolee via GitGitGadget
2020-05-20 17:38 ` Elijah Newren [this message]
2020-06-17 23:14 ` [PATCH 00/10] [RFC] In-tree sparse-checkout definitions Elijah Newren
2020-06-18  1:42   ` Derrick Stolee
2020-06-18  1:59     ` Elijah Newren
2020-06-18  3:01       ` Derrick Stolee
2020-06-18  5:03         ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BEz+XE48_vmEa4jmH53YSHUoj5G-OY+UT1mv92irQVEkQ@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=jrnieder@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmaill.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).