From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Haggerty Subject: Re: What's cooking in git.git (Oct 2012, #01; Tue, 2) Date: Thu, 04 Oct 2012 11:34:47 +0200 Message-ID: <506D5837.6020708@alum.mit.edu> References: <7vmx045umh.fsf@alter.siamese.dyndns.org> <7vbogj5sji.fsf@alter.siamese.dyndns.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: Nguyen Thai Ngoc Duy , git@vger.kernel.org To: Junio C Hamano X-From: git-owner@vger.kernel.org Fri Oct 05 00:13:54 2012 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1TJtVj-0001w8-Ac for gcvg-git-2@plane.gmane.org; Fri, 05 Oct 2012 00:03:15 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755843Ab2JDJew (ORCPT ); Thu, 4 Oct 2012 05:34:52 -0400 Received: from ALUM-MAILSEC-SCANNER-5.MIT.EDU ([18.7.68.17]:56214 "EHLO alum-mailsec-scanner-5.mit.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755293Ab2JDJeu (ORCPT ); Thu, 4 Oct 2012 05:34:50 -0400 X-AuditID: 12074411-b7fa36d0000008cc-8e-506d5839d5e1 Received: from outgoing-alum.mit.edu (OUTGOING-ALUM.MIT.EDU [18.7.68.33]) by alum-mailsec-scanner-5.mit.edu (Symantec Messaging Gateway) with SMTP id 21.15.02252.9385D605; Thu, 4 Oct 2012 05:34:49 -0400 (EDT) Received: from [192.168.101.152] (ssh.berlin.jpk.com [212.222.128.135]) (authenticated bits=0) (User authenticated as mhagger@ALUM.MIT.EDU) by outgoing-alum.mit.edu (8.13.8/8.12.4) with ESMTP id q949YlWl002818 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NOT); Thu, 4 Oct 2012 05:34:49 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120827 Thunderbird/15.0 In-Reply-To: <7vbogj5sji.fsf@alter.siamese.dyndns.org> X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrOKsWRmVeSWpSXmKPExsUixO6iqGsZkRtg8OanuUXXlW4mi4beK8wW 3VPeMjowe+ycdZfd4+IlZY/Pm+QCmKO4bZISS8qCM9Pz9O0SuDN+rPzBXrBDuaKn8RhLA2OP dBcjJ4eEgInEtcl32CFsMYkL99azdTFycQgJXGaU2DbpNCtIQkjgGJNEw9fwLkYODl4BbYnZ vx1BTBYBVYm/z7RBKtgEdCUW9TQzgdiiAiESMy5PZgaxeQUEJU7OfMICYosIqElMbDsEZjML OEqcWrYZrF5YwE7iy4lXLBBrlzBKvLz9E6yZU8BMou/DN3aIBh2Jd30PmCFseYntb+cwT2AU mIVkxywkZbOQlC1gZF7FKJeYU5qrm5uYmVOcmqxbnJyYl5dapGuql5tZopeaUrqJERK4gjsY Z5yUO8QowMGoxMOr3ZITIMSaWFZcmXuIUZKDSUmUVzg8N0CILyk/pTIjsTgjvqg0J7X4EKME B7OSCO9EH6Acb0piZVVqUT5MSpqDRUmcl2+Jup+QQHpiSWp2ampBahFMVoaDQ0mCNxhkqGBR anpqRVpmTglCmomDE2Q4l5RIcWpeSmpRYmlJRjwoTuOLgZEKkuIB2qsC0s5bXJCYCxSFaD3F aMlx9M3ch4wcJ/6ByI+N8x4yCrHk5eelSonzaoI0CIA0ZJTmwa2Dpa9XjOJA3wvzuoJU8QBT H9zUV0ALmYAWrtDNAllYkoiQkmpgnH6b42RjFstfxd8TCy4/nS/brb58vVW56flzkY8n79m+ vGKZt971dbWtXbysQcdLYk5a1Hi1R6zy9F/7r3Rq2sbTKwuV2yzWRRen96rlXuAO Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On 10/03/2012 08:17 PM, Junio C Hamano wrote: > Nguyen Thai Ngoc Duy writes: > >> There's an interesting case: "**foo". According to our rules, that >> pattern does not contain slashes therefore is basename match. But some >> might find that confusing because "**" can match slashes,... > > By "our rules", if you mean "if a pattern has slash, it is anchored", > that obviously need to be updated with this series, if "**" is meant > to match multiple hierarchies. >> I think the latter makes more sense. When users put "**" they expect >> to match some slashes. But that may call for a refactoring in >> path_matches() in attr.c. Putting strstr(pattern, "**") in that >> matching function may increase overhead unnecessarily. >> >> The third option is just die() and let users decide either "*foo", >> "**/foo" or "/**foo", never "**foo". > > For the double-star at the beginning, you should just turn it into "**/" > if it is not followed by a slash internally, I think. > > What is the semantics of ** in the first place? Is it described to > a reasonable level of detail in the documentation updates? For > example does "**foo" match "afoo", "a/b/foo", "a/bfoo", "a/foo/b", > "a/bfoo/c"? Does "x**y" match "xy", "xay", "xa/by", "x/a/y"? > > I am guessing that the only sensible definition is that "**" > requires anything that comes before it (if exists) is at a proper > hierarchy boundary, and anything matches it is also at a proper > hierarchy boundary, so "x**y" matches "x/a/y" and not "xy", "xay", > nor "xa/by" in the above example. If "x**y" can match "xy" or "xay" > (or "**foo" can match "afoo"), it would be unreasonable to say it > implies the pattern is anchored at any level, no? Given that there is no obvious interpretation for what a construct like "x**y" would mean, and many plausible guesses (most of which sound rather useless), I suggest that we forbid it. This will make the feature easier to explain and make .gitignore files that use it easier to understand. I think that 98% of the usefulness of "**" would be in constructs where it replaces a proper part of the pathname, like "**/SOMETHING" or "SOMETHING/**/SOMETHING"; in other words, where its use matches the regexp "(^|/)\*\*/". In these constructs the only ambiguity is whether "**/" matches regexp "([^/]+/)+" or "([^/]+/)*" (e.g., whether "foo/**/bar" matches "foo/bar"). I personally prefer the second, because the first behavior can be had using the second interpretation by using "SOMETHING/*/**/SOMETHING", whereas the second behavior cannot be implemented in terms of the first in a single line of the .gitignore file. Optionally, one might also like to support "SOMETHING/**" or "**" alone in the obvious ways. As for the implementation, it is quite easy to textually convert a glob pattern, including "**" parts, into a regexp. I happen to have written some Python code that does this for another project (see below). An obvious optimization would be to read any literal parts of the path off the beginning of the glob pattern and only use regexps for the tail part. Would a regexp-based implementation be too slow? Michael _filename_char_pattern = r'[^/]' _glob_patterns = [ ('?', _filename_char_pattern), ('/**', r'(/.+)?'), ('**/', r'(.+/)?'), ('*', _filename_char_pattern + r'*'), ] def glob_to_regexp(pattern): pattern = os.path.normpath(pattern) # remove trivial redundancies if pattern == '**': # This case has to be handled separately because it doesn't # involve a '/' character adjacent to the '**' pattern. (Such # slashes otherwise have to be considered part of the pattern # to handle the matching of zero path components.) return re.compile( r'^' + _filename_char_pattern + r'(.+' + _filename_char_pattern + r')?$' ) regexp = [r'^'] i = 0 while i < len(pattern): for (s, r) in _glob_patterns: if pattern.startswith(s, i): regexp.append(r) i += len(s) break else: # AFAIK it's a normal character. Escape it and add it to # pattern. regexp.append(re.escape(pattern[i])) i += 1 regexp.append(r'$') return re.compile(''.join(regexp)) -- Michael Haggerty mhagger@alum.mit.edu http://softwareswirl.blogspot.com/