git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: Junio C Hamano <gitster@pobox.com>
Cc: Nguyen Thai Ngoc Duy <pclouds@gmail.com>, git@vger.kernel.org
Subject: Re: What's cooking in git.git (Oct 2012, #01; Tue, 2)
Date: Thu, 04 Oct 2012 11:34:47 +0200	[thread overview]
Message-ID: <506D5837.6020708@alum.mit.edu> (raw)
In-Reply-To: <7vbogj5sji.fsf@alter.siamese.dyndns.org>

On 10/03/2012 08:17 PM, Junio C Hamano wrote:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
> 
>> There's an interesting case: "**foo". According to our rules, that
>> pattern does not contain slashes therefore is basename match. But some
>> might find that confusing because "**" can match slashes,...
> 
> By "our rules", if you mean "if a pattern has slash, it is anchored",
> that obviously need to be updated with this series, if "**" is meant
> to match multiple hierarchies.
>> I think the latter makes more sense. When users put "**" they expect
>> to match some slashes. But that may call for a refactoring in
>> path_matches() in attr.c. Putting strstr(pattern, "**") in that
>> matching function may increase overhead unnecessarily.
>>
>> The third option is just die() and let users decide either "*foo",
>> "**/foo" or "/**foo", never "**foo".
> 
> For the double-star at the beginning, you should just turn it into "**/"
> if it is not followed by a slash internally, I think.
> 
> What is the semantics of ** in the first place?  Is it described to
> a reasonable level of detail in the documentation updates?  For
> example does "**foo" match "afoo", "a/b/foo", "a/bfoo", "a/foo/b",
> "a/bfoo/c"?  Does "x**y" match "xy", "xay", "xa/by", "x/a/y"?
> 
> I am guessing that the only sensible definition is that "**"
> requires anything that comes before it (if exists) is at a proper
> hierarchy boundary, and anything matches it is also at a proper
> hierarchy boundary, so "x**y" matches "x/a/y" and not "xy", "xay",
> nor "xa/by" in the above example.  If "x**y" can match "xy" or "xay"
> (or "**foo" can match "afoo"), it would be unreasonable to say it
> implies the pattern is anchored at any level, no?

Given that there is no obvious interpretation for what a construct like
"x**y" would mean, and many plausible guesses (most of which sound
rather useless), I suggest that we forbid it.  This will make the
feature easier to explain and make .gitignore files that use it easier
to understand.

I think that 98% of the usefulness of "**" would be in constructs where
it replaces a proper part of the pathname, like "**/SOMETHING" or
"SOMETHING/**/SOMETHING"; in other words, where its use matches the
regexp "(^|/)\*\*/".  In these constructs the only ambiguity is whether
"**/" matches regexp

    "([^/]+/)+"

or

    "([^/]+/)*"

(e.g., whether "foo/**/bar" matches "foo/bar").  I personally prefer the
second, because the first behavior can be had using the second
interpretation by using "SOMETHING/*/**/SOMETHING", whereas the second
behavior cannot be implemented in terms of the first in a single line of
the .gitignore file.

Optionally, one might also like to support "SOMETHING/**" or "**" alone
in the obvious ways.

As for the implementation, it is quite easy to textually convert a glob
pattern, including "**" parts, into a regexp.  I happen to have written
some Python code that does this for another project (see below).  An
obvious optimization would be to read any literal parts of the path off
the beginning of the glob pattern and only use regexps for the tail
part.  Would a regexp-based implementation be too slow?

Michael

_filename_char_pattern = r'[^/]'
_glob_patterns = [
    ('?', _filename_char_pattern),
    ('/**', r'(/.+)?'),
    ('**/', r'(.+/)?'),
    ('*', _filename_char_pattern + r'*'),
    ]


def glob_to_regexp(pattern):
    pattern = os.path.normpath(pattern) # remove trivial redundancies

    if pattern == '**':
        # This case has to be handled separately because it doesn't
        # involve a '/' character adjacent to the '**' pattern.  (Such
        # slashes otherwise have to be considered part of the pattern
        # to handle the matching of zero path components.)
        return re.compile(
            r'^' + _filename_char_pattern + r'(.+' +
_filename_char_pattern + r')?$'
            )

    regexp = [r'^']
    i = 0
    while i < len(pattern):
        for (s, r) in _glob_patterns:
            if pattern.startswith(s, i):
                regexp.append(r)
                i += len(s)
                break
        else:
            # AFAIK it's a normal character.  Escape it and add it to
            # pattern.
            regexp.append(re.escape(pattern[i]))
            i += 1

    regexp.append(r'$')

    return re.compile(''.join(regexp))




-- 
Michael Haggerty
mhagger@alum.mit.edu
http://softwareswirl.blogspot.com/

  parent reply	other threads:[~2012-10-04 22:03 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-10-02 23:20 What's cooking in git.git (Oct 2012, #01; Tue, 2) Junio C Hamano
2012-10-03 15:23 ` Nguyen Thai Ngoc Duy
2012-10-03 18:17   ` Junio C Hamano
2012-10-04  1:56     ` Nguyen Thai Ngoc Duy
2012-10-04  6:01       ` Junio C Hamano
2012-10-04  7:39         ` [PATCH 0/6] wildmatch part 2 Nguyễn Thái Ngọc Duy
2012-10-04  7:39           ` [PATCH 1/6] attr: remove the union in struct match_attr Nguyễn Thái Ngọc Duy
2012-10-04  7:39           ` [PATCH 2/6] attr: avoid strlen() on every match Nguyễn Thái Ngọc Duy
2012-10-04  7:39           ` [PATCH 3/6] attr: avoid searching for basename " Nguyễn Thái Ngọc Duy
2012-10-04  7:39           ` [PATCH 4/6] attr: more matching optimizations from .gitignore Nguyễn Thái Ngọc Duy
2012-10-04  7:39           ` [PATCH 5/6] gitignore: do not do basename match with patterns that have '**' Nguyễn Thái Ngọc Duy
2012-10-04 17:59             ` Junio C Hamano
2012-10-05  7:01             ` Johannes Sixt
2012-10-05 11:23               ` Nguyen Thai Ngoc Duy
2012-10-04  7:39           ` [PATCH 6/6] t3001: note about expected "**" behavior Nguyễn Thái Ngọc Duy
2012-10-04 18:04             ` Junio C Hamano
2012-10-04 17:43           ` [PATCH 0/6] wildmatch part 2 Junio C Hamano
2012-10-04  9:34     ` Michael Haggerty [this message]
2012-10-04 11:46       ` What's cooking in git.git (Oct 2012, #01; Tue, 2) Nguyen Thai Ngoc Duy
2012-10-04 15:17         ` Michael Haggerty
2012-10-04 16:39       ` Junio C Hamano
2012-10-05 12:19         ` Andreas Schwab
2012-10-05 12:30           ` Matthieu Moy
2012-10-05 14:15             ` Andreas Schwab
2012-10-05 13:21         ` Nguyen Thai Ngoc Duy
2012-10-04  8:17 ` David Michael Barr
2012-10-04  8:30   ` fa/remote-svn (Re: What's cooking in git.git (Oct 2012, #01; Tue, 2)) Jonathan Nieder
2012-10-04 13:16     ` Stephen Bash
2012-10-04 16:30       ` Junio C Hamano
2012-10-04 16:27   ` What's cooking in git.git (Oct 2012, #01; Tue, 2) Junio C Hamano
2012-10-30 12:15 ` Florian Achleitner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=506D5837.6020708@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).