git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Finn Bryant <finnbryant@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: git@vger.kernel.org, Elijah Newren <newren@gmail.com>
Subject: Re: Sparse checkout inconsistency with non-folder paths between cone mode and full matching (2.5.0)
Date: Fri, 31 Jan 2020 15:13:23 +0000	[thread overview]
Message-ID: <CADSBhNZ7g_eqUSRczGJPkoiG57OL4DXray3_kTTtj-DF4=aM1g@mail.gmail.com> (raw)
In-Reply-To: <ef39f8c5-ce0b-a48b-940d-821df563b292@gmail.com>

(resent, I forgot to reply all)

On Thu, 30 Jan 2020 at 18:52, Derrick Stolee <stolee@gmail.com> wrote:
>
> On 1/30/2020 10:25 AM, Finn Bryant wrote:
> > Hi,
>
> Hello! Thanks for chiming in with feedback.
>
> > With cone mode enabled for a sparse checkout, a pattern like the
> > following is accepted:
> >
> > /*
> > !/*/
> > /a_file_or_folder/
>
> As you mention below, you say this path is a file OR a directory.
> However, the trailing slash means _only_ match directories with this
> name. This is a specific choice made in creating cone mode to restrict
> only by directories.

Right, I thought it was supposed to work that way, but the bug you've
patched left me uncertain.

> > If matching behaviour with full pattern mode is a non-goal for cone
> > mode, I'd still question the logic of this behaviour, though I suppose
> > it does have the benefit of (accidentally?) adding support for
> > excluding individual files from a sparse checkout, which I imagine
> > some could find useful. Personally I'd prefer that was instead added
> > with a more sane syntax, if needed.
>
> Cone mode is specifically designed for performance using a hashset-based
> pattern matching. This naturally restricts the possible patterns since
> we cannot match the full pattern set [1] this quickly. This means there
> are some trade-offs that are required, but they were made with some
> information of real-world repositories organized in a way that they
> could take advantage of this pattern set.

Absolutely, that makes sense to me, and I think this fits the model
I'm working with well, or will once your patch filters through.

> > $ git sparse-checkout init --cone
> > $ git read-tree -mu HEAD
> > $ ls -1
> > a_file_or_folder
> > some_file
> > $ git sparse-checkout set a_file_or_folder
> > $ git read-tree -mu HEAD
> > $ ls -1
> > some_file
> > $ cat .git/info/sparse-checkout
> > /*
> > !/*/
> > /a_file_or_folder/
>
> This is an interesting test, because I would expect the /a_file_or_folder/
> pattern to not cause the _file_ to not match. It does match the first two
> patterns, and just because it doesn't match the third pattern shouldn't
> remove it.
>
> This is actually a bug in the hashset-based pattern-matching algorithm,
> since setting core.sparseCheckoutCone=false does not have this behavior.
> I'll make a patch to fix this.

I was hoping you'd say that!

> > Right now I'm trying to integrate cone mode with my company's existing
> > manifest infrastructure, which doesn't differentiate between files and
> > folders, so this is forcing me to perform a lot of repo checks to
> > confirm we aren't accidentally excluding files we were supposed to
> > include. Just in case you needed another example of why this behaviour
> > is unhelpful.
>
> I apologize that this wrong behavior is causing you some issues. If you
> are able to identify which paths are files instead of directories, then
> you can remove the filename from the path to include its parent directory.
> This is only a workaround until the bug is fixed, but the end-result will
> be the same: you'll have all sibling files in your working directory as
> well as the files you specified.

No need for apologies, I just wanted to convey why I thought the bug
was worth attention.
I realise this is a brand new feature and appreciate the impressively
fast feedback!
I am loving the cone mode improvements by the way, I've been living
with the poor performance from full matching for a long time now, and
this is a huge improvement.

> This is important: if you have a directory with many large files, but
> intend to include only a subset of those files, then you will be
> disappointed by the size of your working directory.

Understood, this thankfully isn't a huge issue for my use-case, our
folder structures are mostly not too shallow, so there's only a few
cases where the lack of file matching hurts us, and the benefits far
outweigh the costs.

> This requirement of specifying directories instead of files is part
> of the cone mode design for two reasons: one philosophical and another
> more practical:
>
> Philisophical: Filenames change more often than directories.
>
> That is, as users are interacting with your repo, the overall directory
> structure at a high level is typically static. If new cross-component
> dependencies are introduced, then those are usually big architecture-level
> changes. However, at an individual file level dependencies change at
> a higher rate, causing users to react by changing their sparse-checkout
> definition more frequently.

It's not too much of a concern for me, but the few cases where it has
an impact for me tend to follow the pattern of:
project A needs file X, but X lives in a folder with 1000 other files
that aren't needed by project A.

> Practical: How do we specify if an input is a directory or a file?
>
> As you mentioned, if you run "git sparse-checkout set X" then you get
> a sparse-checkout file containing:
>
>         /*
>         !/*/
>         /X/
>
> But if X is actually a _file_, then we should write this pattern:
>
>         /*
>         !/*/
>         /X
>
> So the input does not provide enough information to say which pattern
> should be written.
>
> The other concern about this second set of patterns is that "/X" is
> only a _prefix_ match, not an exact path name match. That means we
> cannot use the existing hashset matching to result in the same
> pattern matching as non-cone-mode.

This is an interesting point, currently (at least in 2.25) both `/X/`
and `/X` formats are allowed, and appear to be treated the same way.

Given `/X` is only a prefix match, perhaps it should be invalid for
cone mode, and force the full matching fallback?

This wouldn't have any effect on the matching capabilities of cone
mode, I think, but it would reduce confusion and ensure cone mode and
full matching have as close to the same results as possible.
Additionally, by preventing it now, you can effectively reserve that
syntax in case it proves useful for later extensions to cone mode.

> Now, there is perhaps a way to decide between the two: inspect the
> current index or tree for whether the input corresponds to a directory
> or a file. This adds a layer of complexity that is not currently in
> the sparse-checkout builtin, but it is not insurmountable. The only
> thing to consider is what happens when the input path is not in the
> index/tree at all! It may be that we want to specify the patterns
> before moving HEAD to a commit that _does_ contain the path, and then
> we would do the wrong thing at this point.

Agreed, this is a nasty problem.

> So I'll put out a question to the community: is this practical issue
> something worth overcoming? Is my concern about a non-existent path
> a true problem, or something more?

Personally I think it's a reasonable compromise. Many will benefit
with only folder matching support, and the cases where file matching
is needed (or at least a significant benefit) are going to be quite a
bit less common.

> Thanks,
> -Stolee

Thank you for the quick response and the patch!

Finn

  parent reply	other threads:[~2020-01-31 15:13 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-01-30 15:25 Sparse checkout inconsistency with non-folder paths between cone mode and full matching (2.5.0) Finn Bryant
2020-01-30 18:52 ` Derrick Stolee
2020-01-30 19:22   ` Derrick Stolee
2020-01-31 15:13   ` Finn Bryant [this message]
2020-01-31 19:59   ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CADSBhNZ7g_eqUSRczGJPkoiG57OL4DXray3_kTTtj-DF4=aM1g@mail.gmail.com' \
    --to=finnbryant@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=newren@gmail.com \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).