git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* [BUG] gitignore documentation inconsistent with actual behaviour
@ 2018-10-11 10:19 dana
  2018-10-11 10:37 ` dana
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: dana @ 2018-10-11 10:19 UTC (permalink / raw)
  To: git

Hello,

I'm a contributor to ripgrep, which is a grep-like tool that supports using
gitignore files to control which files are searched in a repo (or any other
directory tree). ripgrep's support for the patterns in these files is based on
git's official documentation, as seen here:

  https://git-scm.com/docs/gitignore

One of the most common reports on the ripgrep bug tracker is that it does not
allow patterns like the following real-world examples, where a ** is used along
with other text within the same path component:

  **/**$$*.java
  **.orig
  **local.properties
  !**.sha1

The reason it doesn't allow them is that the gitignore documentation explicitly
states that they're invalid:

  A leading "**" followed by a slash means match in all directories...

  A trailing "/**" matches everything inside...

  A slash followed by two consecutive asterisks then a slash matches zero or
  more directories...

  Other consecutive asterisks are considered invalid.

git itself happily accepts these patterns, however, apparently treating the **
like a single * without fnmatch(3)'s FNM_PATHNAME flag set (in other words, it
matches / as a regular character, thus crossing directory boundaries).

ripgrep's developer is loathe to reverse-engineer this undocumented behaviour,
and so the reports keep coming, both for ripgrep itself and for down-stream
consumers of it and its ignore crate (including most notably Microsoft's VS Code
editor).

My request: Assuming that git's actual handling of these patterns is intended,
would it be possible to make it 'official' and explicitly add it to the
documentation?

References (the first one is the main bug):

https://github.com/BurntSushi/ripgrep/issues/373
https://github.com/BurntSushi/ripgrep/issues/507
https://github.com/BurntSushi/ripgrep/issues/859
https://github.com/BurntSushi/ripgrep/issues/945
https://github.com/BurntSushi/ripgrep/issues/1080
https://github.com/BurntSushi/ripgrep/issues/1082
https://github.com/Microsoft/vscode/issues/24050

dana


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] gitignore documentation inconsistent with actual behaviour
  2018-10-11 10:19 [BUG] gitignore documentation inconsistent with actual behaviour dana
@ 2018-10-11 10:37 ` dana
  2018-10-11 11:08 ` Ævar Arnfjörð Bjarmason
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: dana @ 2018-10-11 10:37 UTC (permalink / raw)
  To: git

On 11 Oct 2018, at 05:19, dana <dana@dana.is> wrote:
>git itself happily accepts these patterns, however, apparently treating the **
>like a single * without fnmatch(3)'s FNM_PATHNAME flag set (in other words, it
>matches / as a regular character, thus crossing directory boundaries).

Sorry for replying to myself so quickly, but i think this bit is wrong. It seems
like it actually just treats ** in this sort of pattern like a regular * — but
i haven't studied the source very closely, so maybe that's not the full story
either, i'm not sure.

dana


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] gitignore documentation inconsistent with actual behaviour
  2018-10-11 10:19 [BUG] gitignore documentation inconsistent with actual behaviour dana
  2018-10-11 10:37 ` dana
@ 2018-10-11 11:08 ` Ævar Arnfjörð Bjarmason
  2018-10-14  2:14   ` dana
  2018-10-14 12:15   ` Duy Nguyen
  2018-10-20  5:26 ` Duy Nguyen
  2018-10-27  8:48 ` [PATCH] wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode Nguyễn Thái Ngọc Duy
  3 siblings, 2 replies; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-11 11:08 UTC (permalink / raw)
  To: dana; +Cc: git, Matthieu Moy, Michael Haggerty


On Thu, Oct 11 2018, dana wrote:

> Hello,
>
> I'm a contributor to ripgrep, which is a grep-like tool that supports using
> gitignore files to control which files are searched in a repo (or any other
> directory tree). ripgrep's support for the patterns in these files is based on
> git's official documentation, as seen here:
>
>   https://git-scm.com/docs/gitignore
>
> One of the most common reports on the ripgrep bug tracker is that it does not
> allow patterns like the following real-world examples, where a ** is used along
> with other text within the same path component:
>
>   **/**$$*.java
>   **.orig
>   **local.properties
>   !**.sha1
>
> The reason it doesn't allow them is that the gitignore documentation explicitly
> states that they're invalid:
>
>   A leading "**" followed by a slash means match in all directories...
>
>   A trailing "/**" matches everything inside...
>
>   A slash followed by two consecutive asterisks then a slash matches zero or
>   more directories...
>
>   Other consecutive asterisks are considered invalid.
>
> git itself happily accepts these patterns, however, apparently treating the **
> like a single * without fnmatch(3)'s FNM_PATHNAME flag set (in other words, it
> matches / as a regular character, thus crossing directory boundaries).
>
> ripgrep's developer is loathe to reverse-engineer this undocumented behaviour,
> and so the reports keep coming, both for ripgrep itself and for down-stream
> consumers of it and its ignore crate (including most notably Microsoft's VS Code
> editor).
>
> My request: Assuming that git's actual handling of these patterns is intended,
> would it be possible to make it 'official' and explicitly add it to the
> documentation?
>
> References (the first one is the main bug):
>
> https://github.com/BurntSushi/ripgrep/issues/373
> https://github.com/BurntSushi/ripgrep/issues/507
> https://github.com/BurntSushi/ripgrep/issues/859
> https://github.com/BurntSushi/ripgrep/issues/945
> https://github.com/BurntSushi/ripgrep/issues/1080
> https://github.com/BurntSushi/ripgrep/issues/1082
> https://github.com/Microsoft/vscode/issues/24050

Yeah those docs seem wrong. In general the docs for the matching
function are quite bad. I have on my TODO list to factor this out into
some gitwildmatch manpage, but right now the bit in gitignore is all we
have.

Our matching function comes from rsync originally, whose manpage says:

    use ’**’ to match anything, including slashes.

I believe this is accurate as far as the implementation goes. You can
also see the rather exhaustive tests here:
https://github.com/git/git/blob/master/t/t3070-wildmatch.sh

Note the different behavior with e.g. --glob-pathspecs v.s. the
default. There's also stuff like:

    $ grep diff=perl .gitattributes
    *.perl eol=lf diff=perl
    *.pl eof=lf diff=perl
    *.pm eol=lf diff=perl
    $ git ls-files ":(attr:diff=perl)" | wc -l
    65

And then the exclude syntax. This is not in .gitignore:

    $ git ls-files ":(exclude)*.pm" ":(attr:diff=perl)" | wc -l
    41
    $ git ls-files ":^*.pm" ":(attr:diff=perl)" | wc -l
    41

I.e. we have wildmatch() on one hand and then the pathspec matching.

For an unrelated thing I have been thinking of adding a new plumbing
command who'd get input like this on stdin:

    1 text t/t0202/test.pl\0\0
    2 text perl/Git.pm\0\0
    3 text *.pm\0\0
    4 text :^*.pm"\0:(attr:diff=perl)\0\0
    5 match glob,icase\04\03\0\0
    6 match icase\04\02\0\0
    7 match \04\01\0\0

Which would return (in any order):

    1 OK
    2 OK
    3 OK
    4 OK
    5 NO
    6 NO
    7 YES

Or whatever. I.e. something where you can as a batch feed various
strings into a program in a batch, and then ask how some of them would
match against each other.

The reason for this is to extend something like git-multimail[1] with a
config where users can subscribe to changes to paths as declared by git
pathspecs, and be informed about which of the glob rules they defined
matched their config.

Right now you'd need to e.g. for a "git push" run each match rule for
each user with such config against each changed path in each commit
that's pushed, but plumbing like this would allow for feeding arbitrary
combinations of those in and ask what matches against what.

The reason I'm on this tangent is to ask whether if such a thing
existed, if it's something you can see something like ripgrep
using. I.e. ask git given its .gitignore and .gitattributes what thing
matches one of its pathspecs instead of carrying your own bug-compatible
implementation.

1. https://github.com/git-multimail/git-multimail/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] gitignore documentation inconsistent with actual behaviour
  2018-10-11 11:08 ` Ævar Arnfjörð Bjarmason
@ 2018-10-14  2:14   ` dana
  2018-10-14 12:15   ` Duy Nguyen
  1 sibling, 0 replies; 17+ messages in thread
From: dana @ 2018-10-14  2:14 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Matthieu Moy, Michael Haggerty, Andrew Gallant

On 11 Oct 2018, at 06:08, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>Our matching function comes from rsync originally, whose manpage says:
>
>   use ’**’ to match anything, including slashes.
>
>I believe this is accurate as far as the implementation goes. You can
>also see the rather exhaustive tests here:
>https://github.com/git/git/blob/master/t/t3070-wildmatch.sh

Thanks. The tests are a little hard to follow at first glance, so i guess i'd
have to study them more to see what the semantics actually are. When i tested it
briefly it didn't seem like it was behaving as you described, but maybe i wasn't
paying close enough attention. In any case, i'm sure these will be useful.

On 11 Oct 2018, at 06:08, Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>The reason I'm on this tangent is to ask whether if such a thing
>existed, if it's something you can see something like ripgrep
>using. I.e. ask git given its .gitignore and .gitattributes what thing
>matches one of its pathspecs instead of carrying your own bug-compatible
>implementation.

My impression is that it probably isn't practical to use something like that as
ripgrep's main pattern-matching facility — for one thing there are performance
concerns, and for another it makes it dependent on git for some of its core
functionality (and it's not a git-specific tool). But i was going to say that
it'd be useful for testing, and when i e-mailed Andrew (the main ripgrep dev)
about this he said the same. I've CCed him on this message in case he has
anything to add.

dana


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] gitignore documentation inconsistent with actual behaviour
  2018-10-11 11:08 ` Ævar Arnfjörð Bjarmason
  2018-10-14  2:14   ` dana
@ 2018-10-14 12:15   ` Duy Nguyen
  2018-10-14 22:56     ` Junio C Hamano
  1 sibling, 1 reply; 17+ messages in thread
From: Duy Nguyen @ 2018-10-14 12:15 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: dana, Git Mailing List, Matthieu Moy, Michael Haggerty

On Thu, Oct 11, 2018 at 2:41 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
>
> On Thu, Oct 11 2018, dana wrote:
>
> > Hello,
> >
> > I'm a contributor to ripgrep, which is a grep-like tool that supports using
> > gitignore files to control which files are searched in a repo (or any other
> > directory tree). ripgrep's support for the patterns in these files is based on
> > git's official documentation, as seen here:
> >
> >   https://git-scm.com/docs/gitignore
> >
> > One of the most common reports on the ripgrep bug tracker is that it does not
> > allow patterns like the following real-world examples, where a ** is used along
> > with other text within the same path component:
> >
> >   **/**$$*.java
> >   **.orig
> >   **local.properties
> >   !**.sha1
> >
> > The reason it doesn't allow them is that the gitignore documentation explicitly
> > states that they're invalid:
> >
> >   A leading "**" followed by a slash means match in all directories...
> >
> >   A trailing "/**" matches everything inside...
> >
> >   A slash followed by two consecutive asterisks then a slash matches zero or
> >   more directories...
> >
> >   Other consecutive asterisks are considered invalid.

Perhaps "undefined" is a better word than "invalid".

> > git itself happily accepts these patterns, however, apparently treating the **
> > like a single * without fnmatch(3)'s FNM_PATHNAME flag set (in other words, it
> > matches / as a regular character, thus crossing directory boundaries).
> >
> > ripgrep's developer is loathe to reverse-engineer this undocumented behaviour,
> > and so the reports keep coming, both for ripgrep itself and for down-stream
> > consumers of it and its ignore crate (including most notably Microsoft's VS Code
> > editor).
> >
> > My request: Assuming that git's actual handling of these patterns is intended,
> > would it be possible to make it 'official' and explicitly add it to the
> > documentation?

You mean "**" in the fourth case is interpreted as "*"? Yes I guess we
could rephrase it as "Other consecutive asterisks are considered
normal wildcard asterisks"

> Yeah those docs seem wrong. In general the docs for the matching
> function are quite bad. I have on my TODO list to factor this out into
> some gitwildmatch manpage, but right now the bit in gitignore is all we
> have.
>
> Our matching function comes from rsync originally, whose manpage says:
>
>     use ’**’ to match anything, including slashes.
>
> I believe this is accurate as far as the implementation goes.

No. "**" semantics is not the same as from rsync. The three cases
"**/", "/**/" and "/**" were requested by Junio if I remember
correctly. You can search the mail archive for more information.
-- 
Duy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] gitignore documentation inconsistent with actual behaviour
  2018-10-14 12:15   ` Duy Nguyen
@ 2018-10-14 22:56     ` Junio C Hamano
  2018-10-15 15:27       ` Duy Nguyen
  0 siblings, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2018-10-14 22:56 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Ævar Arnfjörð Bjarmason, dana, Git Mailing List,
	Matthieu Moy, Michael Haggerty

Duy Nguyen <pclouds@gmail.com> writes:

>> Our matching function comes from rsync originally, whose manpage says:
>>
>>     use ’**’ to match anything, including slashes.
>>
>> I believe this is accurate as far as the implementation goes.
>
> No. "**" semantics is not the same as from rsync. The three cases
> "**/", "/**/" and "/**" were requested by Junio if I remember
> correctly. You can search the mail archive for more information.

Perhaps spelling the rules out would be more benefitial for the
purpose of this thread?  I do not recall what I requested, but let
me throw out my guesses (i.e. what I would have wished if I were
making a request to implement something) to keep the thread alive,
you can correct me, and people can take it from there to update the
docs ;-)

    A double-asterisk, both of whose ends are adjacent to a
    directory boundary (i.e. the beginning of the pattern, the end
    of the pattern or a slash) macthes 0 or more levels of
    directories.  e.g. **/a/b would match a/b, x/a/b, x/y/a/b, but
    not z-a/b.  a/**/b would match a/b, a/x/b, but not a/z-b or
    a-z-b.

What a double-asterisk that does not sit on a directory boundary,
e.g. "a**b", "a**/b", "a/**b", or "**a/b", matches, as far as I am
concerned, is undefined, meaning that (1) I do not care all that
much what the code actually do when a pattern like that is given as
long as it does not segfault, and (2) I do not think I would mind
changing the behaviour as a "bugfix", if their current behaviour
does not make sense and we can come up with a saner alternative.

But the documentation probably should describe what these currently
match as the starting point.

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] gitignore documentation inconsistent with actual behaviour
  2018-10-14 22:56     ` Junio C Hamano
@ 2018-10-15 15:27       ` Duy Nguyen
  0 siblings, 0 replies; 17+ messages in thread
From: Duy Nguyen @ 2018-10-15 15:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, dana geier,
	Git Mailing List, Michael Haggerty

On Mon, Oct 15, 2018 at 12:57 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Duy Nguyen <pclouds@gmail.com> writes:
>
> >> Our matching function comes from rsync originally, whose manpage says:
> >>
> >>     use ’**’ to match anything, including slashes.
> >>
> >> I believe this is accurate as far as the implementation goes.
> >
> > No. "**" semantics is not the same as from rsync. The three cases
> > "**/", "/**/" and "/**" were requested by Junio if I remember
> > correctly. You can search the mail archive for more information.
>
> Perhaps spelling the rules out would be more benefitial for the
> purpose of this thread?  I do not recall what I requested, but let

OK I gave you too much credit. You pointed out semantics problem with
rsync "**" [1] and gently pushed me to implement a safer subset ;-)

[1] https://public-inbox.org/git/7vbogj5sji.fsf@alter.siamese.dyndns.org/

> me throw out my guesses (i.e. what I would have wished if I were
> making a request to implement something) to keep the thread alive,
> you can correct me, and people can take it from there to update the
> docs ;-)
>
>     A double-asterisk, both of whose ends are adjacent to a
>     directory boundary (i.e. the beginning of the pattern, the end
>     of the pattern or a slash) macthes 0 or more levels of
>     directories.  e.g. **/a/b would match a/b, x/a/b, x/y/a/b, but
>     not z-a/b.  a/**/b would match a/b, a/x/b, but not a/z-b or
>     a-z-b.
>
> What a double-asterisk that does not sit on a directory boundary,
> e.g. "a**b", "a**/b", "a/**b", or "**a/b", matches, as far as I am
> concerned, is undefined, meaning that (1) I do not care all that
> much what the code actually do when a pattern like that is given as
> long as it does not segfault, and (2) I do not think I would mind
> changing the behaviour as a "bugfix", if their current behaviour
> does not make sense and we can come up with a saner alternative.

I think the document describes more or less what you wrote above in
the indented paragraph (but maybe less clear, patches are of course
welcome). It's the last paragraph that is problematic. It right now
says "invalid" which could be interpreted as "bad pattern, rejected"
but we in fact accept these "*" are regular ones.

There are not many alternatives we could do though. Erroring out could
really flood the stderr because we match these patterns zillions of
time and traditionally fnmatch gracefully accepts bad patterns, trying
to make the best of of them. So keeping undefined "**" as two "*"
sounds good enough.

But of course I'm open for saner alternatives if people find any.
-- 
Duy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] gitignore documentation inconsistent with actual behaviour
  2018-10-11 10:19 [BUG] gitignore documentation inconsistent with actual behaviour dana
  2018-10-11 10:37 ` dana
  2018-10-11 11:08 ` Ævar Arnfjörð Bjarmason
@ 2018-10-20  5:26 ` Duy Nguyen
  2018-10-20  5:53   ` dana
  2018-10-27  8:48 ` [PATCH] wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode Nguyễn Thái Ngọc Duy
  3 siblings, 1 reply; 17+ messages in thread
From: Duy Nguyen @ 2018-10-20  5:26 UTC (permalink / raw)
  To: dana; +Cc: git, Ævar Arnfjörð, Junio C Hamano

On Thu, Oct 11, 2018 at 05:19:06AM -0500, dana wrote:
> Hello,
> 
> I'm a contributor to ripgrep, which is a grep-like tool that supports using
> gitignore files to control which files are searched in a repo (or any other
> directory tree). ripgrep's support for the patterns in these files is based on
> git's official documentation, as seen here:
> 
>   https://git-scm.com/docs/gitignore
> 
> One of the most common reports on the ripgrep bug tracker is that it does not
> allow patterns like the following real-world examples, where a ** is used along
> with other text within the same path component:
> 
>   **/**$$*.java
>   **.orig
>   **local.properties
>   !**.sha1
> 
> The reason it doesn't allow them is that the gitignore documentation explicitly
> states that they're invalid:
>
> ...

I've checked the code and run some tests. There is a twist here. "**"
is only special when matched in "pathname" mode. That is when the
pattern contains at least one slash. In your patterns above, that only
applies to the first pattern.

When '**' is special, if it's neither '**/', '/**/' or '/**', it _is_
considered invalid (i.e. bad pattern) and the pattern will not match
anything.

The confusion comes from when '**' is not special for the remaining
three patterns, it's considered as regular '*' and still matches
stuff.

So, I think we have two options. The document could be clarified with
something like this

-- 8< --
diff --git a/Documentation/gitignore.txt b/Documentation/gitignore.txt
index d107daaffd..500cd43939 100644
--- a/Documentation/gitignore.txt
+++ b/Documentation/gitignore.txt
@@ -100,7 +100,8 @@ PATTERN FORMAT
    a shell glob pattern and checks for a match against the
    pathname relative to the location of the `.gitignore` file
    (relative to the toplevel of the work tree if not from a
-   `.gitignore` file).
+   `.gitignore` file). Note that the "two consecutive asterisks" rule
+   below does not apply.
 
  - Otherwise, Git treats the pattern as a shell glob: "`*`" matches
    anything except "`/`", "`?`" matches any one character except "`/`"
@@ -129,7 +130,8 @@ full pathname may have special meaning:
    matches zero or more directories. For example, "`a/**/b`"
    matches "`a/b`", "`a/x/b`", "`a/x/y/b`" and so on.
 
- - Other consecutive asterisks are considered invalid.
+ - Other consecutive asterisks are considered invalid and the pattern
+   is ignored.
 
 NOTES
 -----
-- 8< --

Or we could make the behavior consistent. If '**' is invalid, just
consider it two separate regular '*'. Then all four of your patterns
will behave the same way. The change for that is quite simple

-- 8< --
diff --git a/wildmatch.c b/wildmatch.c
index d074c1be10..64087bf02c 100644
--- a/wildmatch.c
+++ b/wildmatch.c
@@ -104,8 +104,10 @@ static int dowild(const uchar *p, const uchar *text, unsigned int flags)
 					    dowild(p + 1, text, flags) == WM_MATCH)
 						return WM_MATCH;
 					match_slash = 1;
-				} else
-					return WM_ABORT_MALFORMED;
+				} else {
+					/* without WM_PATHNAME, '*' == '**' */
+					match_slash = flags & WM_PATHNAME ? 0 : 1;
+				}
 			} else
 				/* without WM_PATHNAME, '*' == '**' */
 				match_slash = flags & WM_PATHNAME ? 0 : 1;
-- 8< --

Which way should we go? I'm leaning towards the second one...
--
Duy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] gitignore documentation inconsistent with actual behaviour
  2018-10-20  5:26 ` Duy Nguyen
@ 2018-10-20  5:53   ` dana
  2018-10-20  6:03     ` Duy Nguyen
  0 siblings, 1 reply; 17+ messages in thread
From: dana @ 2018-10-20  5:53 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: git, Ævar Arnfjörð, Junio C Hamano, Andrew Gallant

On 20 Oct 2018, at 00:26, Duy Nguyen <pclouds@gmail.com> wrote:
>Which way should we go? I'm leaning towards the second one...

Not sure how much my opinion is worth, but the second option does feel more
friendly (from a usage perspective) as well as more straight-forward (from a
re-implementation perspective).

There's a third option too, though, right? The 'rsync' behaviour mentioned
earlier? It wouldn't matter either way in any of the examples i listed, but is
there ever a conceivable use case for something like `foo**bar`, where the `**`
matches across slashes? (I can't think of any reason i'd need that personally,
but then again i don't understand why these people are using `**` the way they
are in the first place.)

dana


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] gitignore documentation inconsistent with actual behaviour
  2018-10-20  5:53   ` dana
@ 2018-10-20  6:03     ` Duy Nguyen
  2018-10-20  6:26       ` dana
  0 siblings, 1 reply; 17+ messages in thread
From: Duy Nguyen @ 2018-10-20  6:03 UTC (permalink / raw)
  To: dana geier
  Cc: Git Mailing List, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, jamslam

On Sat, Oct 20, 2018 at 7:53 AM dana <dana@dana.is> wrote:
>
> On 20 Oct 2018, at 00:26, Duy Nguyen <pclouds@gmail.com> wrote:
> >Which way should we go? I'm leaning towards the second one...
>
> Not sure how much my opinion is worth, but the second option does feel more
> friendly (from a usage perspective) as well as more straight-forward (from a
> re-implementation perspective).

Yeah. And not having to describe all the corner cases is a plus. Too
many corner cases are a sign of bad implementation anyway. I'll wait
some more time for the others to speak up before I cook a proper
patch.

> There's a third option too, though, right? The 'rsync' behaviour mentioned
> earlier? It wouldn't matter either way in any of the examples i listed, but is
> there ever a conceivable use case for something like `foo**bar`, where the `**`
> matches across slashes? (I can't think of any reason i'd need that personally,
> but then again i don't understand why these people are using `**` the way they
> are in the first place.)

foo**bar would match foobar as well as foo/bar, foo/x/bar and
foo/x/y/bar... Its behavior is error prone in my opinion. There's also
some concerns in early iterations of this "**" support that we would
need to revisit if we want 'rsync' behavior. I'm not very excited
about doing that.
-- 
Duy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [BUG] gitignore documentation inconsistent with actual behaviour
  2018-10-20  6:03     ` Duy Nguyen
@ 2018-10-20  6:26       ` dana
  0 siblings, 0 replies; 17+ messages in thread
From: dana @ 2018-10-20  6:26 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Git Mailing List, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, jamslam

On 20 Oct 2018, at 01:03, Duy Nguyen <pclouds@gmail.com> wrote:
>foo**bar would match foobar as well as foo/bar, foo/x/bar and
>foo/x/y/bar... Its behavior is error prone in my opinion. There's also
>some concerns in early iterations of this "**" support that we would
>need to revisit if we want 'rsync' behavior. I'm not very excited
>about doing that.

That's fair.

I guess another point in favour of your second option is that it's essentially
the same behaviour used by bash (with the `globstar` option) and zsh (with the
default options); they also give `**` special recursion powers only when used in
a path component by itself, otherwise it acts like `*`. So there's precedent
there.

dana


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode
  2018-10-11 10:19 [BUG] gitignore documentation inconsistent with actual behaviour dana
                   ` (2 preceding siblings ...)
  2018-10-20  5:26 ` Duy Nguyen
@ 2018-10-27  8:48 ` Nguyễn Thái Ngọc Duy
  2018-10-28  6:25   ` Torsten Bögershausen
  2018-10-29 13:24   ` Ævar Arnfjörð Bjarmason
  3 siblings, 2 replies; 17+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2018-10-27  8:48 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Ævar Arnfjörð,
	dana, Nguyễn Thái Ngọc Duy

In WM_PATHNAME mode (or FNM_PATHNAME), '*' does not match '/' and '**'
can but only in three patterns:

- '**/' matches zero or more leading directories
- '/**/' matches zero or more directories in between
- '/**' matches zero or more trailing directories/files

When '**' is present but not in one of these patterns, the current
behavior is consider the pattern invalid and stop matching. In other
words, 'foo**bar' never matches anything, whatever you throw at it.

This behavior is arguably a bit confusing partly because we can't
really tell the user their pattern is invalid so that they can fix
it. So instead, tolerate it and make '**' act like two regular '*'s
(which is essentially the same as a single asterisk). This behavior
seems more predictable.

Noticed-by: dana <dana@dana.is>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 Documentation/gitignore.txt | 3 ++-
 t/t3070-wildmatch.sh        | 4 ++--
 wildmatch.c                 | 4 ++--
 wildmatch.h                 | 1 -
 4 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/Documentation/gitignore.txt b/Documentation/gitignore.txt
index d107daaffd..1c94f08ff4 100644
--- a/Documentation/gitignore.txt
+++ b/Documentation/gitignore.txt
@@ -129,7 +129,8 @@ full pathname may have special meaning:
    matches zero or more directories. For example, "`a/**/b`"
    matches "`a/b`", "`a/x/b`", "`a/x/y/b`" and so on.
 
- - Other consecutive asterisks are considered invalid.
+ - Other consecutive asterisks are considered regular asterisks and
+   will match according to the previous rules.
 
 NOTES
 -----
diff --git a/t/t3070-wildmatch.sh b/t/t3070-wildmatch.sh
index 46aca0af10..891d4d7cb9 100755
--- a/t/t3070-wildmatch.sh
+++ b/t/t3070-wildmatch.sh
@@ -237,7 +237,7 @@ match 0 0 0 0 foobar 'foo\*bar'
 match 1 1 1 1 'f\oo' 'f\\oo'
 match 1 1 1 1 ball '*[al]?'
 match 0 0 0 0 ten '[ten]'
-match 0 0 1 1 ten '**[!te]'
+match 1 1 1 1 ten '**[!te]'
 match 0 0 0 0 ten '**[!ten]'
 match 1 1 1 1 ten 't[a-g]n'
 match 0 0 0 0 ten 't[!a-g]n'
@@ -253,7 +253,7 @@ match 1 1 1 1 ']' ']'
 # Extended slash-matching features
 match 0 0 1 1 'foo/baz/bar' 'foo*bar'
 match 0 0 1 1 'foo/baz/bar' 'foo**bar'
-match 0 0 1 1 'foobazbar' 'foo**bar'
+match 1 1 1 1 'foobazbar' 'foo**bar'
 match 1 1 1 1 'foo/baz/bar' 'foo/**/bar'
 match 1 1 0 0 'foo/baz/bar' 'foo/**/**/bar'
 match 1 1 1 1 'foo/b/a/z/bar' 'foo/**/bar'
diff --git a/wildmatch.c b/wildmatch.c
index d074c1be10..9e9e2a2f95 100644
--- a/wildmatch.c
+++ b/wildmatch.c
@@ -104,8 +104,8 @@ static int dowild(const uchar *p, const uchar *text, unsigned int flags)
 					    dowild(p + 1, text, flags) == WM_MATCH)
 						return WM_MATCH;
 					match_slash = 1;
-				} else
-					return WM_ABORT_MALFORMED;
+				} else /* WM_PATHNAME is set */
+					match_slash = 0;
 			} else
 				/* without WM_PATHNAME, '*' == '**' */
 				match_slash = flags & WM_PATHNAME ? 0 : 1;
diff --git a/wildmatch.h b/wildmatch.h
index b8c826aa68..5993696298 100644
--- a/wildmatch.h
+++ b/wildmatch.h
@@ -4,7 +4,6 @@
 #define WM_CASEFOLD 1
 #define WM_PATHNAME 2
 
-#define WM_ABORT_MALFORMED 2
 #define WM_NOMATCH 1
 #define WM_MATCH 0
 #define WM_ABORT_ALL -1
-- 
2.19.1.647.g708186aaf9


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode
  2018-10-27  8:48 ` [PATCH] wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode Nguyễn Thái Ngọc Duy
@ 2018-10-28  6:25   ` Torsten Bögershausen
  2018-10-28  6:35     ` Duy Nguyen
  2018-10-29 13:24   ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 17+ messages in thread
From: Torsten Bögershausen @ 2018-10-28  6:25 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy
  Cc: git, Junio C Hamano, Ævar Arnfjörð, dana

On Sat, Oct 27, 2018 at 10:48:23AM +0200, Nguyễn Thái Ngọc Duy wrote:
> In WM_PATHNAME mode (or FNM_PATHNAME), '*' does not match '/' and '**'
> can but only in three patterns:
> 
> - '**/' matches zero or more leading directories
> - '/**/' matches zero or more directories in between
> - '/**' matches zero or more trailing directories/files
> 
> When '**' is present but not in one of these patterns, the current
> behavior is consider the pattern invalid and stop matching. In other
> words, 'foo**bar' never matches anything, whatever you throw at it.
> 
> This behavior is arguably a bit confusing partly because we can't
> really tell the user their pattern is invalid so that they can fix
> it. So instead, tolerate it and make '**' act like two regular '*'s
> (which is essentially the same as a single asterisk). This behavior
> seems more predictable.

Nice analyzes.
I have one question here:
If the user specifies '**' and nothhing is found,
would it be better to die() with a useful message
instead of silently correcting it ?

See the the patch below:
> -				} else
> -					return WM_ABORT_MALFORMED;

Would it be possible to put in the die() here?
As it is outlined so nicely above, a '**' must have either a '/'
before, or behind, or both, to make sense.
When there is no '/' then the user specified something wrong.
Either a '/' has been forgotten, or the '*' key may be bouncing.
I don't think that Git should assume anything here.
(but I didn't follow the previous discussions, so I may have missed
some arguments.)

[]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode
  2018-10-28  6:25   ` Torsten Bögershausen
@ 2018-10-28  6:35     ` Duy Nguyen
  2018-10-29  2:28       ` Junio C Hamano
  0 siblings, 1 reply; 17+ messages in thread
From: Duy Nguyen @ 2018-10-28  6:35 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Git Mailing List, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, dana geier

On Sun, Oct 28, 2018 at 7:25 AM Torsten Bögershausen <tboegi@web.de> wrote:
>
> On Sat, Oct 27, 2018 at 10:48:23AM +0200, Nguyễn Thái Ngọc Duy wrote:
> > In WM_PATHNAME mode (or FNM_PATHNAME), '*' does not match '/' and '**'
> > can but only in three patterns:
> >
> > - '**/' matches zero or more leading directories
> > - '/**/' matches zero or more directories in between
> > - '/**' matches zero or more trailing directories/files
> >
> > When '**' is present but not in one of these patterns, the current
> > behavior is consider the pattern invalid and stop matching. In other
> > words, 'foo**bar' never matches anything, whatever you throw at it.
> >
> > This behavior is arguably a bit confusing partly because we can't
> > really tell the user their pattern is invalid so that they can fix
> > it. So instead, tolerate it and make '**' act like two regular '*'s
> > (which is essentially the same as a single asterisk). This behavior
> > seems more predictable.
>
> Nice analyzes.
> I have one question here:
> If the user specifies '**' and nothing is found,
> would it be better to die() with a useful message
> instead of silently correcting it ?

Consider the main use case of wildmatch, .gitignore patterns, dying
would be really bad because it can affect a lot of commands. It would
be much better if wildmatch api allows the caller to handle the error
so they can either die() or propagate the error upwards. But even then
the current API is not suited for that, ideally we should have a
compile phase where you can validate the pattern first. Without it,
you encounter the error every time you try to match something and
handling errors becomes much uglier.

And it goes against the way pattern errors are handled by
fnmatch/wildmatch anyway. If you write '[ab' instead of '[ab]', '['
will just be considered literal '[' instead of erroring out.

> See the the patch below:
> > -                             } else
> > -                                     return WM_ABORT_MALFORMED;
>
> Would it be possible to put in the die() here?
> As it is outlined so nicely above, a '**' must have either a '/'
> before, or behind, or both, to make sense.
> When there is no '/' then the user specified something wrong.
> Either a '/' has been forgotten, or the '*' key may be bouncing.
> I don't think that Git should assume anything here.
> (but I didn't follow the previous discussions, so I may have missed
> some arguments.)
>
> []



-- 
Duy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode
  2018-10-28  6:35     ` Duy Nguyen
@ 2018-10-29  2:28       ` Junio C Hamano
  0 siblings, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2018-10-29  2:28 UTC (permalink / raw)
  To: Duy Nguyen
  Cc: Torsten Bögershausen, Git Mailing List,
	Ævar Arnfjörð Bjarmason, dana geier

Duy Nguyen <pclouds@gmail.com> writes:

>> Nice analyzes.
>> I have one question here:
>> If the user specifies '**' and nothing is found,
>> would it be better to die() with a useful message
>> instead of silently correcting it ?
>
> Consider the main use case of wildmatch, .gitignore patterns, dying
> would be really bad because it can affect a lot of commands....

If the user gives 'foo*' and nothing is found, we may say "no match"
and some codepaths that uses wildmatch API may die.  And in such place,
when the user gives '**' and nothing is found, we should do the same
in the same codepath.  In either case, the implementation of wildmatch
API is not the place to call a die(), I think.

And yes, treating an unanchored "**" as if there is just a "*" followed
by another '*" makes good sense.

Thanks, both.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode
  2018-10-27  8:48 ` [PATCH] wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode Nguyễn Thái Ngọc Duy
  2018-10-28  6:25   ` Torsten Bögershausen
@ 2018-10-29 13:24   ` Ævar Arnfjörð Bjarmason
  2018-10-29 15:53     ` Duy Nguyen
  1 sibling, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-10-29 13:24 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git, Junio C Hamano, dana


On Sat, Oct 27 2018, Nguyễn Thái Ngọc Duy wrote:

> In WM_PATHNAME mode (or FNM_PATHNAME), '*' does not match '/' and '**'
> can but only in three patterns:
>
> - '**/' matches zero or more leading directories
> - '/**/' matches zero or more directories in between
> - '/**' matches zero or more trailing directories/files
>
> When '**' is present but not in one of these patterns, the current
> behavior is consider the pattern invalid and stop matching. In other
> words, 'foo**bar' never matches anything, whatever you throw at it.
>
> This behavior is arguably a bit confusing partly because we can't
> really tell the user their pattern is invalid so that they can fix
> it. So instead, tolerate it and make '**' act like two regular '*'s
> (which is essentially the same as a single asterisk). This behavior
> seems more predictable.
>
> Noticed-by: dana <dana@dana.is>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>  Documentation/gitignore.txt | 3 ++-
>  t/t3070-wildmatch.sh        | 4 ++--
>  wildmatch.c                 | 4 ++--
>  wildmatch.h                 | 1 -
>  4 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/gitignore.txt b/Documentation/gitignore.txt
> index d107daaffd..1c94f08ff4 100644
> --- a/Documentation/gitignore.txt
> +++ b/Documentation/gitignore.txt
> @@ -129,7 +129,8 @@ full pathname may have special meaning:
>     matches zero or more directories. For example, "`a/**/b`"
>     matches "`a/b`", "`a/x/b`", "`a/x/y/b`" and so on.
>
> - - Other consecutive asterisks are considered invalid.
> + - Other consecutive asterisks are considered regular asterisks and
> +   will match according to the previous rules.
>
>  NOTES
>  -----
> diff --git a/t/t3070-wildmatch.sh b/t/t3070-wildmatch.sh
> index 46aca0af10..891d4d7cb9 100755
> --- a/t/t3070-wildmatch.sh
> +++ b/t/t3070-wildmatch.sh
> @@ -237,7 +237,7 @@ match 0 0 0 0 foobar 'foo\*bar'
>  match 1 1 1 1 'f\oo' 'f\\oo'
>  match 1 1 1 1 ball '*[al]?'
>  match 0 0 0 0 ten '[ten]'
> -match 0 0 1 1 ten '**[!te]'
> +match 1 1 1 1 ten '**[!te]'
>  match 0 0 0 0 ten '**[!ten]'
>  match 1 1 1 1 ten 't[a-g]n'
>  match 0 0 0 0 ten 't[!a-g]n'
> @@ -253,7 +253,7 @@ match 1 1 1 1 ']' ']'
>  # Extended slash-matching features
>  match 0 0 1 1 'foo/baz/bar' 'foo*bar'
>  match 0 0 1 1 'foo/baz/bar' 'foo**bar'
> -match 0 0 1 1 'foobazbar' 'foo**bar'
> +match 1 1 1 1 'foobazbar' 'foo**bar'
>  match 1 1 1 1 'foo/baz/bar' 'foo/**/bar'
>  match 1 1 0 0 'foo/baz/bar' 'foo/**/**/bar'
>  match 1 1 1 1 'foo/b/a/z/bar' 'foo/**/bar'
> diff --git a/wildmatch.c b/wildmatch.c
> index d074c1be10..9e9e2a2f95 100644
> --- a/wildmatch.c
> +++ b/wildmatch.c
> @@ -104,8 +104,8 @@ static int dowild(const uchar *p, const uchar *text, unsigned int flags)
>  					    dowild(p + 1, text, flags) == WM_MATCH)
>  						return WM_MATCH;
>  					match_slash = 1;
> -				} else
> -					return WM_ABORT_MALFORMED;
> +				} else /* WM_PATHNAME is set */
> +					match_slash = 0;
>  			} else
>  				/* without WM_PATHNAME, '*' == '**' */
>  				match_slash = flags & WM_PATHNAME ? 0 : 1;
> diff --git a/wildmatch.h b/wildmatch.h
> index b8c826aa68..5993696298 100644
> --- a/wildmatch.h
> +++ b/wildmatch.h
> @@ -4,7 +4,6 @@
>  #define WM_CASEFOLD 1
>  #define WM_PATHNAME 2
>
> -#define WM_ABORT_MALFORMED 2
>  #define WM_NOMATCH 1
>  #define WM_MATCH 0
>  #define WM_ABORT_ALL -1

This patch looks good to me, but I think it's a bad state of affairs to
keep changing these semantics and not having something like a
"gitwildmatch" doc were we document this matching syntax.

Also I still need to dig up the work for using PCRE as an alternate
matching engine, the PCRE devs produced a bug-for-bug compatible version
of our wildmatch function (all the more reason to document it), so I
think they'll need to change it now that this is in, but I haven't
rebased those ancient patches yet.

Do you have any thoughts on how to proceed with getting this documented
/ into some stable state where we can specify it? Even if we don't end
up using PCRE as a matching engine (sometimes it was faster, sometimes
slower) I think it would be very useful if we can spew out "here's your
pattern as a regex" for self-documentation purposes.

Then that can be piped into e.g. "perl -Mre=debug" to see a step-by-step
guide for how the pattern compiles, and why it does or doesn't match a
given thing.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode
  2018-10-29 13:24   ` Ævar Arnfjörð Bjarmason
@ 2018-10-29 15:53     ` Duy Nguyen
  0 siblings, 0 replies; 17+ messages in thread
From: Duy Nguyen @ 2018-10-29 15:53 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Junio C Hamano, dana geier

On Mon, Oct 29, 2018 at 2:24 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> This patch looks good to me, but I think it's a bad state of affairs to
> keep changing these semantics and not having something like a
> "gitwildmatch" doc were we document this matching syntax.

While we don't have a separate document for it, the behavior _is_
documented even if perhaps it wasn't as clear. There were even tests
for this corner case.

> Do you have any thoughts on how to proceed with getting this documented
> / into some stable state where we can specify it?

wildmatch has been used in git for a few years if I remember correctly
so to me it is stable. Granted there are corner cases like this, but I
can't prevent all bugs (especially this one which is more like design
mistake than bug per se). You are welcome to refactor gitignore.txt
and add gitwildmatch.txt.
-- 
Duy

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, back to index

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-11 10:19 [BUG] gitignore documentation inconsistent with actual behaviour dana
2018-10-11 10:37 ` dana
2018-10-11 11:08 ` Ævar Arnfjörð Bjarmason
2018-10-14  2:14   ` dana
2018-10-14 12:15   ` Duy Nguyen
2018-10-14 22:56     ` Junio C Hamano
2018-10-15 15:27       ` Duy Nguyen
2018-10-20  5:26 ` Duy Nguyen
2018-10-20  5:53   ` dana
2018-10-20  6:03     ` Duy Nguyen
2018-10-20  6:26       ` dana
2018-10-27  8:48 ` [PATCH] wildmatch: change behavior of "foo**bar" in WM_PATHNAME mode Nguyễn Thái Ngọc Duy
2018-10-28  6:25   ` Torsten Bögershausen
2018-10-28  6:35     ` Duy Nguyen
2018-10-29  2:28       ` Junio C Hamano
2018-10-29 13:24   ` Ævar Arnfjörð Bjarmason
2018-10-29 15:53     ` Duy Nguyen

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox