git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Sparse checkout inconsistency with non-folder paths between cone mode and full matching (2.5.0)
@ 2020-01-30 15:25 Finn Bryant
  2020-01-30 18:52 ` Derrick Stolee
  0 siblings, 1 reply; 5+ messages in thread
From: Finn Bryant @ 2020-01-30 15:25 UTC (permalink / raw)
  To: git

Hi,

With cone mode enabled for a sparse checkout, a pattern like the
following is accepted:

/*
!/*/
/a_file_or_folder/

As the name suggests, a_file_or_folder might be a file, or might be a
directory. If it's a directory, then this is a "valid" recursive match
for the directory and everything works as expected.
But if the path leads to an ordinary file, it appears that cone mode
will *exclude* the file from the worktree (ie, it'll have the
skip-worktree flag set).
This seems bizarre and unexpected behaviour.

My suspicion is that cone mode is supposed to be a pure subset of full
pattern matching, such that if cone mode is ever disabled, the
sparseness of the worktree will be unchanged. Clearly, this scenario
is breaking that pattern.

I think the correct behaviour is that recursive matches for a
potential directory do not have any effect on a non-directory file
with the same name. Alternatively, you could forbid any patterns which
match non-directory files instead (and downgrade to full pattern
matching), though I'd not be a fan of this, since it'd mean the
validity of a cone-mode sparse configuration file is dependent on the
contents of the repo, and is thus much harder to ascertain (scripts
can't simply prove if it's a valid cone mode file by parsing it,
they'd need to have access to any repo it may be applied to, and
inspect the type of any matching file/folder paths, and its validity
could be changed simply by replacing a folder with a file in the
repo).

If matching behaviour with full pattern mode is a non-goal for cone
mode, I'd still question the logic of this behaviour, though I suppose
it does have the benefit of (accidentally?) adding support for
excluding individual files from a sparse checkout, which I imagine
some could find useful. Personally I'd prefer that was instead added
with a more sane syntax, if needed.

Full test case:

$ git init test_repo
Initialized empty Git repository in [path]/test_repo/.git/
$ cd test_repo/
$ touch some_file a_file_or_folder
$ git add some_file a_file_or_folder
$ git commit -m "some files"
[master (root-commit) 80d5918] some files
 2 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 a_file_or_folder
 create mode 100644 some_file
$ git sparse-checkout init --cone
$ git read-tree -mu HEAD
$ ls -1
a_file_or_folder
some_file
$ git sparse-checkout set a_file_or_folder
$ git read-tree -mu HEAD
$ ls -1
some_file
$ cat .git/info/sparse-checkout
/*
!/*/
/a_file_or_folder/
$


Right now I'm trying to integrate cone mode with my company's existing
manifest infrastructure, which doesn't differentiate between files and
folders, so this is forcing me to perform a lot of repo checks to
confirm we aren't accidentally excluding files we were supposed to
include. Just in case you needed another example of why this behaviour
is unhelpful.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Sparse checkout inconsistency with non-folder paths between cone mode and full matching (2.5.0)
  2020-01-30 15:25 Sparse checkout inconsistency with non-folder paths between cone mode and full matching (2.5.0) Finn Bryant
@ 2020-01-30 18:52 ` Derrick Stolee
  2020-01-30 19:22   ` Derrick Stolee
                     ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Derrick Stolee @ 2020-01-30 18:52 UTC (permalink / raw)
  To: finnbryant, git, Elijah Newren

On 1/30/2020 10:25 AM, Finn Bryant wrote:
> Hi,

Hello! Thanks for chiming in with feedback.

> With cone mode enabled for a sparse checkout, a pattern like the
> following is accepted:
> 
> /*
> !/*/
> /a_file_or_folder/

As you mention below, you say this path is a file OR a directory.
However, the trailing slash means _only_ match directories with this
name. This is a specific choice made in creating cone mode to restrict
only by directories.

> My suspicion is that cone mode is supposed to be a pure subset of full
> pattern matching, such that if cone mode is ever disabled, the
> sparseness of the worktree will be unchanged. Clearly, this scenario
> is breaking that pattern.

By "pure subset" you must mean "strict subset". Cone mode is definitely
a smaller set of patterns.

> I think the correct behaviour is that recursive matches for a
> potential directory do not have any effect on a non-directory file
> with the same name. Alternatively, you could forbid any patterns which
> match non-directory files instead (and downgrade to full pattern
> matching), though I'd not be a fan of this, since it'd mean the
> validity of a cone-mode sparse configuration file is dependent on the
> contents of the repo, and is thus much harder to ascertain (scripts
> can't simply prove if it's a valid cone mode file by parsing it,
> they'd need to have access to any repo it may be applied to, and
> inspect the type of any matching file/folder paths, and its validity
> could be changed simply by replacing a folder with a file in the
> repo).
> 
> If matching behaviour with full pattern mode is a non-goal for cone
> mode, I'd still question the logic of this behaviour, though I suppose
> it does have the benefit of (accidentally?) adding support for
> excluding individual files from a sparse checkout, which I imagine
> some could find useful. Personally I'd prefer that was instead added
> with a more sane syntax, if needed.

Cone mode is specifically designed for performance using a hashset-based
pattern matching. This naturally restricts the possible patterns since
we cannot match the full pattern set [1] this quickly. This means there
are some trade-offs that are required, but they were made with some
information of real-world repositories organized in a way that they
could take advantage of this pattern set.

[1] https://git-scm.com/docs/gitignore#_pattern_format
 
> $ git sparse-checkout init --cone
> $ git read-tree -mu HEAD
> $ ls -1
> a_file_or_folder
> some_file
> $ git sparse-checkout set a_file_or_folder
> $ git read-tree -mu HEAD
> $ ls -1
> some_file
> $ cat .git/info/sparse-checkout
> /*
> !/*/
> /a_file_or_folder/

This is an interesting test, because I would expect the /a_file_or_folder/
pattern to not cause the _file_ to not match. It does match the first two
patterns, and just because it doesn't match the third pattern shouldn't
remove it.

This is actually a bug in the hashset-based pattern-matching algorithm,
since setting core.sparseCheckoutCone=false does not have this behavior.
I'll make a patch to fix this.

> Right now I'm trying to integrate cone mode with my company's existing
> manifest infrastructure, which doesn't differentiate between files and
> folders, so this is forcing me to perform a lot of repo checks to
> confirm we aren't accidentally excluding files we were supposed to
> include. Just in case you needed another example of why this behaviour
> is unhelpful.

I apologize that this wrong behavior is causing you some issues. If you
are able to identify which paths are files instead of directories, then
you can remove the filename from the path to include its parent directory.
This is only a workaround until the bug is fixed, but the end-result will
be the same: you'll have all sibling files in your working directory as
well as the files you specified.

This is important: if you have a directory with many large files, but
intend to include only a subset of those files, then you will be
disappointed by the size of your working directory.

This requirement of specifying directories instead of files is part
of the cone mode design for two reasons: one philosophical and another
more practical:

Philisophical: Filenames change more often than directories.

That is, as users are interacting with your repo, the overall directory
structure at a high level is typically static. If new cross-component
dependencies are introduced, then those are usually big architecture-level
changes. However, at an individual file level dependencies change at
a higher rate, causing users to react by changing their sparse-checkout
definition more frequently.

Practical: How do we specify if an input is a directory or a file?

As you mentioned, if you run "git sparse-checkout set X" then you get
a sparse-checkout file containing:

	/*
	!/*/
	/X/

But if X is actually a _file_, then we should write this pattern:

	/*
	!/*/
	/X

So the input does not provide enough information to say which pattern
should be written.

The other concern about this second set of patterns is that "/X" is
only a _prefix_ match, not an exact path name match. That means we
cannot use the existing hashset matching to result in the same
pattern matching as non-cone-mode.

Now, there is perhaps a way to decide between the two: inspect the
current index or tree for whether the input corresponds to a directory
or a file. This adds a layer of complexity that is not currently in
the sparse-checkout builtin, but it is not insurmountable. The only
thing to consider is what happens when the input path is not in the
index/tree at all! It may be that we want to specify the patterns
before moving HEAD to a commit that _does_ contain the path, and then
we would do the wrong thing at this point.

So I'll put out a question to the community: is this practical issue
something worth overcoming? Is my concern about a non-existent path
a true problem, or something more?

Thanks,
-Stolee



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Sparse checkout inconsistency with non-folder paths between cone mode and full matching (2.5.0)
  2020-01-30 18:52 ` Derrick Stolee
@ 2020-01-30 19:22   ` Derrick Stolee
  2020-01-31 15:13   ` Finn Bryant
  2020-01-31 19:59   ` Elijah Newren
  2 siblings, 0 replies; 5+ messages in thread
From: Derrick Stolee @ 2020-01-30 19:22 UTC (permalink / raw)
  To: finnbryant, git, Elijah Newren

On 1/30/2020 1:52 PM, Derrick Stolee wrote:
> On 1/30/2020 10:25 AM, Finn Bryant wrote:
>> $ git sparse-checkout init --cone
>> $ git read-tree -mu HEAD
>> $ ls -1
>> a_file_or_folder
>> some_file
>> $ git sparse-checkout set a_file_or_folder
>> $ git read-tree -mu HEAD
>> $ ls -1
>> some_file
>> $ cat .git/info/sparse-checkout
>> /*
>> !/*/
>> /a_file_or_folder/
> 
> This is an interesting test, because I would expect the /a_file_or_folder/
> pattern to not cause the _file_ to not match. It does match the first two
> patterns, and just because it doesn't match the third pattern shouldn't
> remove it.
> 
> This is actually a bug in the hashset-based pattern-matching algorithm,
> since setting core.sparseCheckoutCone=false does not have this behavior.
> I'll make a patch to fix this.

And here is a patch. I'll add this on top of my "Harden the sparse-checkout
builtin" series in its v4.

-->8--

From bf2115670e8ef79844ec7e87f3ba99d1f776a44d Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Thu, 30 Jan 2020 19:11:26 +0000
Subject: [PATCH] sparse-checkout: fix cone mode behavior mismatch

The intention of the special "cone mode" in the sparse-checkout feature
is to always match the same patterns that are matched by the same
sparse-checkout file as when cone mode is disabled.

When a file path is given to "git sparse-checkout set" in cone mode,
then the cone mode improperly matches the file as a recursive path.
When setting the skip-worktree bits, files were not expecting the
MATCHED_RECURSIVE response, and hence these were left out of the
matched cone.

Fix this bug by checking for MATCHED_RECURSIVE in addition to MATCHED
and add a test that prevents regression.

Reported-by: Finn Bryant <finnbryant@gmail.com>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/t1091-sparse-checkout-builtin.sh | 12 ++++++++++++
 unpack-trees.c                     |  2 +-
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/t/t1091-sparse-checkout-builtin.sh b/t/t1091-sparse-checkout-builtin.sh
index 37e9304ef3..7d982096fb 100755
--- a/t/t1091-sparse-checkout-builtin.sh
+++ b/t/t1091-sparse-checkout-builtin.sh
@@ -305,6 +305,18 @@ test_expect_success 'different sparse-checkouts with worktrees' '
 	check_files worktree a deep
 '
 
+test_expect_success 'set using filename keeps file on-disk' '
+	git -C repo sparse-checkout set a deep &&
+	cat >expect <<-\EOF &&
+	/*
+	!/*/
+	/a/
+	/deep/
+	EOF
+	test_cmp expect repo/.git/info/sparse-checkout &&
+	check_files repo a deep
+'
+
 check_read_tree_errors () {
 	REPO=$1
 	FILES=$2
diff --git a/unpack-trees.c b/unpack-trees.c
index 3789a22cf0..78425ce74b 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -1416,7 +1416,7 @@ static int clear_ce_flags_1(struct index_state *istate,
 						name, &dtype, pl, istate);
 		if (ret == UNDECIDED)
 			ret = default_match;
-		if (ret == MATCHED)
+		if (ret == MATCHED || ret == MATCHED_RECURSIVE)
 			ce->ce_flags &= ~clear_mask;
 		cache++;
 		progress_nr++;
-- 
2.25.0.vfs.1.1




^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: Sparse checkout inconsistency with non-folder paths between cone mode and full matching (2.5.0)
  2020-01-30 18:52 ` Derrick Stolee
  2020-01-30 19:22   ` Derrick Stolee
@ 2020-01-31 15:13   ` Finn Bryant
  2020-01-31 19:59   ` Elijah Newren
  2 siblings, 0 replies; 5+ messages in thread
From: Finn Bryant @ 2020-01-31 15:13 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Elijah Newren

(resent, I forgot to reply all)

On Thu, 30 Jan 2020 at 18:52, Derrick Stolee <stolee@gmail.com> wrote:
>
> On 1/30/2020 10:25 AM, Finn Bryant wrote:
> > Hi,
>
> Hello! Thanks for chiming in with feedback.
>
> > With cone mode enabled for a sparse checkout, a pattern like the
> > following is accepted:
> >
> > /*
> > !/*/
> > /a_file_or_folder/
>
> As you mention below, you say this path is a file OR a directory.
> However, the trailing slash means _only_ match directories with this
> name. This is a specific choice made in creating cone mode to restrict
> only by directories.

Right, I thought it was supposed to work that way, but the bug you've
patched left me uncertain.

> > If matching behaviour with full pattern mode is a non-goal for cone
> > mode, I'd still question the logic of this behaviour, though I suppose
> > it does have the benefit of (accidentally?) adding support for
> > excluding individual files from a sparse checkout, which I imagine
> > some could find useful. Personally I'd prefer that was instead added
> > with a more sane syntax, if needed.
>
> Cone mode is specifically designed for performance using a hashset-based
> pattern matching. This naturally restricts the possible patterns since
> we cannot match the full pattern set [1] this quickly. This means there
> are some trade-offs that are required, but they were made with some
> information of real-world repositories organized in a way that they
> could take advantage of this pattern set.

Absolutely, that makes sense to me, and I think this fits the model
I'm working with well, or will once your patch filters through.

> > $ git sparse-checkout init --cone
> > $ git read-tree -mu HEAD
> > $ ls -1
> > a_file_or_folder
> > some_file
> > $ git sparse-checkout set a_file_or_folder
> > $ git read-tree -mu HEAD
> > $ ls -1
> > some_file
> > $ cat .git/info/sparse-checkout
> > /*
> > !/*/
> > /a_file_or_folder/
>
> This is an interesting test, because I would expect the /a_file_or_folder/
> pattern to not cause the _file_ to not match. It does match the first two
> patterns, and just because it doesn't match the third pattern shouldn't
> remove it.
>
> This is actually a bug in the hashset-based pattern-matching algorithm,
> since setting core.sparseCheckoutCone=false does not have this behavior.
> I'll make a patch to fix this.

I was hoping you'd say that!

> > Right now I'm trying to integrate cone mode with my company's existing
> > manifest infrastructure, which doesn't differentiate between files and
> > folders, so this is forcing me to perform a lot of repo checks to
> > confirm we aren't accidentally excluding files we were supposed to
> > include. Just in case you needed another example of why this behaviour
> > is unhelpful.
>
> I apologize that this wrong behavior is causing you some issues. If you
> are able to identify which paths are files instead of directories, then
> you can remove the filename from the path to include its parent directory.
> This is only a workaround until the bug is fixed, but the end-result will
> be the same: you'll have all sibling files in your working directory as
> well as the files you specified.

No need for apologies, I just wanted to convey why I thought the bug
was worth attention.
I realise this is a brand new feature and appreciate the impressively
fast feedback!
I am loving the cone mode improvements by the way, I've been living
with the poor performance from full matching for a long time now, and
this is a huge improvement.

> This is important: if you have a directory with many large files, but
> intend to include only a subset of those files, then you will be
> disappointed by the size of your working directory.

Understood, this thankfully isn't a huge issue for my use-case, our
folder structures are mostly not too shallow, so there's only a few
cases where the lack of file matching hurts us, and the benefits far
outweigh the costs.

> This requirement of specifying directories instead of files is part
> of the cone mode design for two reasons: one philosophical and another
> more practical:
>
> Philisophical: Filenames change more often than directories.
>
> That is, as users are interacting with your repo, the overall directory
> structure at a high level is typically static. If new cross-component
> dependencies are introduced, then those are usually big architecture-level
> changes. However, at an individual file level dependencies change at
> a higher rate, causing users to react by changing their sparse-checkout
> definition more frequently.

It's not too much of a concern for me, but the few cases where it has
an impact for me tend to follow the pattern of:
project A needs file X, but X lives in a folder with 1000 other files
that aren't needed by project A.

> Practical: How do we specify if an input is a directory or a file?
>
> As you mentioned, if you run "git sparse-checkout set X" then you get
> a sparse-checkout file containing:
>
>         /*
>         !/*/
>         /X/
>
> But if X is actually a _file_, then we should write this pattern:
>
>         /*
>         !/*/
>         /X
>
> So the input does not provide enough information to say which pattern
> should be written.
>
> The other concern about this second set of patterns is that "/X" is
> only a _prefix_ match, not an exact path name match. That means we
> cannot use the existing hashset matching to result in the same
> pattern matching as non-cone-mode.

This is an interesting point, currently (at least in 2.25) both `/X/`
and `/X` formats are allowed, and appear to be treated the same way.

Given `/X` is only a prefix match, perhaps it should be invalid for
cone mode, and force the full matching fallback?

This wouldn't have any effect on the matching capabilities of cone
mode, I think, but it would reduce confusion and ensure cone mode and
full matching have as close to the same results as possible.
Additionally, by preventing it now, you can effectively reserve that
syntax in case it proves useful for later extensions to cone mode.

> Now, there is perhaps a way to decide between the two: inspect the
> current index or tree for whether the input corresponds to a directory
> or a file. This adds a layer of complexity that is not currently in
> the sparse-checkout builtin, but it is not insurmountable. The only
> thing to consider is what happens when the input path is not in the
> index/tree at all! It may be that we want to specify the patterns
> before moving HEAD to a commit that _does_ contain the path, and then
> we would do the wrong thing at this point.

Agreed, this is a nasty problem.

> So I'll put out a question to the community: is this practical issue
> something worth overcoming? Is my concern about a non-existent path
> a true problem, or something more?

Personally I think it's a reasonable compromise. Many will benefit
with only folder matching support, and the cases where file matching
is needed (or at least a significant benefit) are going to be quite a
bit less common.

> Thanks,
> -Stolee

Thank you for the quick response and the patch!

Finn

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Sparse checkout inconsistency with non-folder paths between cone mode and full matching (2.5.0)
  2020-01-30 18:52 ` Derrick Stolee
  2020-01-30 19:22   ` Derrick Stolee
  2020-01-31 15:13   ` Finn Bryant
@ 2020-01-31 19:59   ` Elijah Newren
  2 siblings, 0 replies; 5+ messages in thread
From: Elijah Newren @ 2020-01-31 19:59 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: finnbryant, Git Mailing List

On Thu, Jan 30, 2020 at 10:52 AM Derrick Stolee <stolee@gmail.com> wrote:
> Practical: How do we specify if an input is a directory or a file?
>
> As you mentioned, if you run "git sparse-checkout set X" then you get
> a sparse-checkout file containing:
>
>         /*
>         !/*/
>         /X/
>
> But if X is actually a _file_, then we should write this pattern:
>
>         /*
>         !/*/
>         /X
>
> So the input does not provide enough information to say which pattern
> should be written.
>
> The other concern about this second set of patterns is that "/X" is
> only a _prefix_ match, not an exact path name match. That means we
> cannot use the existing hashset matching to result in the same
> pattern matching as non-cone-mode.
>
> Now, there is perhaps a way to decide between the two: inspect the
> current index or tree for whether the input corresponds to a directory
> or a file. This adds a layer of complexity that is not currently in
> the sparse-checkout builtin, but it is not insurmountable. The only
> thing to consider is what happens when the input path is not in the
> index/tree at all! It may be that we want to specify the patterns
> before moving HEAD to a commit that _does_ contain the path, and then
> we would do the wrong thing at this point.

And once people ran into the difference because their repo was weird,
they'd be bewildered.  I don't want to deal with those support
questions; let's just disallow the latter form and require the
trailing slash.  There are three alternate universes we could have
supported such an ability under:

   * If we had decided to use regexes instead of fnmatch for
sparse-checkout/gitignore then we could have a /<path>$ pattern to
match just files
  * If fnmatch had an end-of-input marker like regex's '$', then we
could use it to mark exact files and be able to translate that into
hashes.
  * If we had decided that cone mode uses an input file other than
.git/info/sparse-checkout with a different format, we wouldn't have to
worry about strict compatibility of the format with non-cone mode and
support different types of abilities.

But I think it's too late now.  Let's just leave the trailing slash
there and say we only support directory-based matching in cone mode.

> So I'll put out a question to the community: is this practical issue
> something worth overcoming? Is my concern about a non-existent path
> a true problem, or something more?
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-01-31 19:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-30 15:25 Sparse checkout inconsistency with non-folder paths between cone mode and full matching (2.5.0) Finn Bryant
2020-01-30 18:52 ` Derrick Stolee
2020-01-30 19:22   ` Derrick Stolee
2020-01-31 15:13   ` Finn Bryant
2020-01-31 19:59   ` Elijah Newren

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).