From: Derrick Stolee <stolee@gmail.com>
To: Shaun Case <warmsocks@gmail.com>, Elijah Newren <newren@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: Possible regression for sparse-checkout in git 2.27.0
Date: Thu, 4 Jun 2020 20:32:42 -0400 [thread overview]
Message-ID: <0aab7056-7176-01bc-ca7b-01356cbace4c@gmail.com> (raw)
In-Reply-To: <CAD3+_6AK390F5iVqCmP-FY8MxJtcUoS7HsbunhZ0qNzOdT53CQ@mail.gmail.com>
On 6/4/2020 4:50 PM, Shaun Case wrote:
> Hi Elijah, Derrick,
>
> On Thu, Jun 4, 2020 at 12:27 AM Elijah Newren <newren@gmail.com> wrote:
>>
>> Hi,
>>
>> On Wed, Jun 3, 2020 at 8:36 AM Derrick Stolee <stolee@gmail.com> wrote:
>>>
>>> On 6/3/2020 12:37 AM, Elijah Newren wrote:
>>>> I think it'd be more natural to run
>>>
>>>> git clone --filter=blob:none --sparse
>>>> https://github.com/r-spacex/launch-timeline.git
>>>
>>>> in place of the combination of
>>>
>>>> git clone --filter=blob:none --no-checkout
>>>> https://github.com/r-spacex/launch-timeline.git
>>>> git sparse-checkout init --cone
>>>
>>>> since the --sparse flag was added just for this kind of case -- to do
>>>> a clone but start with only a few things checked out. It's easier, is
>>>> the route we're moving towards, and as a bonus also happens to work.
>>>
>>> Just one warning: the --sparse option in "git clone" does not currently
>>> enable core.sparseCheckoutCone, so running "git sparse-checkout init --cone"
>>> afterwards is a good idea, or else your "git sparse-checkout (set|add)"
>>> commands will not behave the way you expect.
>
> Ok, great. My production script now does this:
>
> git clone --filter=blob:none --sparse <URL>
> git sparse-checkout init --cone
> git sparse-checkout add <PATH1>
> git sparse-checkout add <PATH2>
> [ ... ]
>
> ... and everything works as expected in 2.26.2 and 2.27.0. Perfection!
>
> See below for why I was doing it the other way.
>
>>> (I think that I will propose a change in behavior to make it do so during
>>> this release cycle.)
>
> That sounds good to me. Could you also consider adding an error
> message for git sparse-checkout init, if something unexpected follows? My
> rationale is that my script initially contained this:
>
> git sparse-checkout init cone
>
> Note the missing "--" in front of cone. This failed silently, so I
> thought I was in
> cone mode, but wasn't. An error message would have helped. Your proposal,
> having it happen automatically, would too.
Yes, this is probably a bug that's my fault. The option parser should
be more strict in the "init" case.
>>>> A bit of a side note, or a few of them, but this command of yours is broken:
>>>> git sparse-checkout set README.md
>>>> because --cone mode means you are specifying *directories* that should
>>>> be checked out.
>
> Thank you for pointing this out. This is a surprise. I'm migrating some large
> Subversion repos and workspace setup scripts to git (thus the interest
> in partial
> cloning and sparse checkouts), and svn has the ability to sparse checkout
> individual files as well as directories, so I expected it to work the
> same way in git.
>
> Is this limitation a design decision, a technical limitation, a
> planned feature? It
> seems to be fairly popular in svn, and there is even more interest
> for it in git:
>
> https://stackoverflow.com/questions/122107/checkout-one-file-from-subversion
> https://stackoverflow.com/questions/2466735/how-to-sparsely-checkout-only-one-single-file-from-a-git-repository
There are a couple reasons why this is tricky in Git.
The first is due to the .gitignore syntax. The syntax allows exact
matches for _directories_ using a trailing slash "/". For example,
we can match everything in A/B/C with the pattern
A/B/C/
This would match the files in A/B/C/ and its subdirectories, but will
not match a file A/B/C.txt or A/B/C1/. There is no equivalent matching
for files, so A/B/C _will_ match a file A/B/C and A/B/C.txt. Whether this
matters to you or not depends on your file structure.
The second is how the pattern-matching works for cone mode. Everything
is based on prefix matches of directories, but also that if a directory
is included, then all of its parent directories are included, but not
recursively. All files contained directly in one of the parent directories
are included automatically. Thus, if A/B/C/ is recursively added, then
A/B/X.txt, A/Y.txt, and Z.txt all match in cone mode. This allows us some
significant performance gains by allowing us to stop walking trees when
we reach a directory that isn't one of these "parent" directories. We
know that A/D/ is not in our list of "parents" so we don't need to look
in there.
What does this have to do with file matches? If we include A/B/C.txt as
a file, then we will need to match all sibling files in A/B/! This is
very similar to matching A/B/* or even A/B/C/*.
This idea was brought up and debated shortly after 2.25.0 released [1]
[1] https://lore.kernel.org/git/CADSBhNbbO=aq-Oo2MpzDMN2VAX4m6f9Jb-eCtVVX1NfWKE9zJw@mail.gmail.com/
>>>> Currently, this gives no error, it instead silently
>>>> drops you back to non-cone mode, which seems bad to me.
>>>> sparse-checkout should provide some kind of error -- or at very least
>>>> a warning -- when you make that mistake.
>
> Agreed.
>
> I'm going to leave the following here for context, but please scroll down.
>
>>>> Now let's talk about the commit in question that changed behavior
>>>> here. The point of sparse-checkout is never to switch branches or
>>>> checkout new commits; all it does is update which paths are in the
>>>> current working directory. A related point to this is it should never
>>>> add or remove entries from the index and shouldn't change any hashes
>>>> of files in the index. It used to violate this, at first via an
>>>> implementation that was literally invoking `git read-tree -mu HEAD` in
>>>> a subprocess, and then later using internal code equivalent to
>>>> invoking that command in a subprocess. But by violating the
>>>> leave-index-entries-alone mandate, it left folks who were in the
>>>> middle of a rebase and wanted to update their sparse-checkout to
>>>> include some more directories in their working tree in a precarious
>>>> spot -- if they didn't update, then they didn't have the files
>>>> necessary to build, and if they did forcibly update via `git read-tree
>>>> -mu HEAD` then their staged changes would all get wiped out. I spent
>>>> some quality time helping users recover their files and teaching them
>>>> about the git storage model.
>>>>
>>>> So that brings us back to your original question. When you said
>>>> --no-checkout, it means that there is no commit checked out and the
>>>> index is empty. update_sparsity() is correctly toggling the
>>>> SKIP_WORKTREE bits for the existing index entries that don't match the
>>>> sparsity patterns, and it is correctly calling check_updates().
>>>> check_updates() is correctly checking for files currently in the index
>>>> which have toggled to being needed in the current worktree so that it
>>>> can issue downloads related to promisor packs. The problem is just
>>>> that there aren't any index entries to begin with, so there are no
>>>> SKIP_WORKTREE bits to update, and thus no files that need to be
>>>> downloaded.
>>>>
>>>> It seems a bit risky to make sparse-checkout start doing
>>>> checkout/switch behavior and adding entries to the index. There's a
>>>> couple ways forward. One, we could decide this is a special edge or
>>>> corner case where we allow it: if the index is completely empty, then
>>>> there's no data to lose and thus we could make `git sparse-checkout
>>>> init [--cone]` in that one case use the old 'read-tree -mu HEAD'
>>>> logic. Alternatively, we could just require users to run 'git reset
>>>> --hard' at the end of your script.
>>>>
>>>> Stolee: Thoughts?
>>>
>>> I agree that using "--sparse" instead of "--no-checkout" is the
>>> best way forward for now, but I'll classify that as a workaround
>>> and not necessarily the end of the conversation.
>>
>> Agreed.
>
> Agree too. I also agree that Elijah's changes are necessary, and will save me
> a ton of headaches going forward. I don't want them backed out.
>
> So, it seems --sparse instead of --no-checkout is what I should have been doing
> from the beginning, I had to fiddle a bit to get what I had working,
> by following
> some online resources and experimenting, and what I ended up with working
> satisfactorily was --no-checkout, with 2.26.2.
>
> Why?
>
> Perhaps the messaging has inadvertently become a little muddled. I started
> by working on getting partial cloning working; when I searched, these are the
> resources I found and followed:
>
> https://about.gitlab.com/blog/2020/03/13/partial-clone-for-massive-repositories/
> https://github.blog/2020-01-13-highlights-from-git-2-25/
> https://stackoverflow.com/questions/4114887/is-it-possible-to-do-a-sparse-checkout-without-checking-out-the-whole-repository
> https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/
> https://docs.gitlab.com/ce/topics/git/partial_clone.html
>
> The first four mention --no-checkout, but do not mention --sparse. The last one
> actually gives proper guidance, but by the time I came across it, I had the
> --no-checkout method working and I came away not even realizing --sparse
> exists. Perhaps I'm the only one who went down this path. If not, the
> --no-checkout case may be a bit more common than you might expect.
There's a good reason for this. In v2.25.0, the "--sparse" option was
present but broken. See [1] for the fix which was released in v2.26.0
as 47dbf10d8 (clone: fix --sparse option with URLs, 2020-01-24).
[1] https://lore.kernel.org/git/4991a51f6d5d840eaa3bb830e68f1530c2ee08e4.1579900782.git.gitgitgadget@gmail.com/
>>> In general, the commit in question is doing something extremely
>>> valuable for common situations, like rebase that you mention.
>>> I also think that this change in behavior is warranted by the
>>> clear warning placed at the top of the docs [1]:
>>>
>>> THIS COMMAND IS EXPERIMENTAL. ITS BEHAVIOR, AND THE
>>> BEHAVIOR OF OTHER COMMANDS IN THE PRESENCE OF SPARSE-
>>> CHECKOUTS, WILL LIKELY CHANGE IN THE FUTURE.
>>>
>>> [1] https://git-scm.com/docs/git-sparse-checkout#_description
>
> Got it. Consider your butts covered. :)
:D
Thanks,
-Stolee
next prev parent reply other threads:[~2020-06-05 0:32 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-06-03 1:08 Possible regression for sparse-checkout in git 2.27.0 Shaun Case
2020-06-03 4:37 ` Elijah Newren
2020-06-03 15:36 ` Derrick Stolee
2020-06-04 7:27 ` Elijah Newren
2020-06-04 20:50 ` Shaun Case
2020-06-05 0:32 ` Derrick Stolee [this message]
2020-06-05 0:56 ` Junio C Hamano
2020-06-05 13:07 ` Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0aab7056-7176-01bc-ca7b-01356cbace4c@gmail.com \
--to=stolee@gmail.com \
--cc=git@vger.kernel.org \
--cc=newren@gmail.com \
--cc=warmsocks@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).