git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Git grep does not support multi-byte characters (like UTF-8)
@ 2015-07-06 11:28 Plamen Totev
  2015-07-06 12:23 ` Duy Nguyen
  2015-07-06 12:42 ` [PATCH] grep: use regcomp() for icase search with non-ascii patterns Nguyễn Thái Ngọc Duy
  0 siblings, 2 replies; 53+ messages in thread
From: Plamen Totev @ 2015-07-06 11:28 UTC (permalink / raw)
  To: git

Hello, 

It looks like the git grep command does not support multi-byte character sets like UTF-8. As a result some of the grep functionality is not working. For example if you search for non Latin words the ignore case flag does not have effect(the search is case sensitive). I suspect there are some regular expressions that will not work as expected too. 

When I'm using the git from the shell I could use the GNU grep utility instead. But some tools like gitweb use git grep so they are not working properly with UTF-8 files with non Latin characters. 

Much of the grep code seems to be copied from the GNU grep utility but the multi-byte support is left out. I just wondered what could be the reason. 

Regards, 
Plamen Totev

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Git grep does not support multi-byte characters (like UTF-8)
  2015-07-06 11:28 Git grep does not support multi-byte characters (like UTF-8) Plamen Totev
@ 2015-07-06 12:23 ` Duy Nguyen
  2015-07-07  8:58   ` Plamen Totev
  2015-07-06 12:42 ` [PATCH] grep: use regcomp() for icase search with non-ascii patterns Nguyễn Thái Ngọc Duy
  1 sibling, 1 reply; 53+ messages in thread
From: Duy Nguyen @ 2015-07-06 12:23 UTC (permalink / raw)
  To: Plamen Totev; +Cc: Git Mailing List

On Mon, Jul 6, 2015 at 6:28 PM, Plamen Totev <plamen.totev@abv.bg> wrote:
> Hello,
>
> It looks like the git grep command does not support multi-byte character sets like UTF-8. As a result some of the grep functionality is not working. For example if you search for non Latin words the ignore case flag does not have effect(the search is case sensitive). I suspect there are some regular expressions that will not work as expected too.

I think we over-optimized a bit. If you your system provides regex
with locale support (e.g. Linux) and you don't explicitly use fallback
regex implementation, it should work. I suppose your test patterns
look "fixed" (i.e. no regex special characters)? Can you try just add
"." and see if case insensitive matching works?
-- 
Duy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH] grep: use regcomp() for icase search with non-ascii patterns
  2015-07-06 11:28 Git grep does not support multi-byte characters (like UTF-8) Plamen Totev
  2015-07-06 12:23 ` Duy Nguyen
@ 2015-07-06 12:42 ` Nguyễn Thái Ngọc Duy
  2015-07-06 20:10   ` René Scharfe
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
  1 sibling, 2 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-06 12:42 UTC (permalink / raw)
  To: git; +Cc: plamen.totev, Nguyễn Thái Ngọc Duy

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index b58c7c6..48db15a 100644
--- a/grep.c
+++ b/grep.c
@@ -378,7 +378,7 @@ static void free_pcre_regexp(struct grep_pat *p)
 }
 #endif /* !USE_LIBPCRE */
 
-static int is_fixed(const char *s, size_t len)
+static int is_fixed(const char *s, size_t len, int ignore_icase)
 {
 	size_t i;
 
@@ -391,6 +391,13 @@ static int is_fixed(const char *s, size_t len)
 	for (i = 0; i < len; i++) {
 		if (is_regex_special(s[i]))
 			return 0;
+		/*
+		 * The builtin substring search can only deal with case
+		 * insensitivity in ascii range. If there is something outside
+		 * of that range, fall back to regcomp.
+		 */
+		if (ignore_icase && (unsigned char)s[i] >= 128)
+			return 0;
 	}
 
 	return 1;
@@ -398,18 +405,19 @@ static int is_fixed(const char *s, size_t len)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
+	int ignore_icase = opt->regflags & REG_ICASE || p->ignore_case;
 	int err;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
 
-	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
+	if (opt->fixed || is_fixed(p->pattern, p->patternlen, ignore_icase))
 		p->fixed = 1;
 	else
 		p->fixed = 0;
 
 	if (p->fixed) {
-		if (opt->regflags & REG_ICASE || p->ignore_case)
+		if (ignore_case)
 			p->kws = kwsalloc(tolower_trans_tbl);
 		else
 			p->kws = kwsalloc(NULL);
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH] grep: use regcomp() for icase search with non-ascii patterns
  2015-07-06 12:42 ` [PATCH] grep: use regcomp() for icase search with non-ascii patterns Nguyễn Thái Ngọc Duy
@ 2015-07-06 20:10   ` René Scharfe
  2015-07-06 23:02     ` Duy Nguyen
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
  1 sibling, 1 reply; 53+ messages in thread
From: René Scharfe @ 2015-07-06 20:10 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy, git; +Cc: plamen.totev

Am 06.07.2015 um 14:42 schrieb Nguyễn Thái Ngọc Duy:
> Noticed-by: Plamen Totev <plamen.totev@abv.bg>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> ---
>   grep.c | 14 +++++++++++---
>   1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/grep.c b/grep.c
> index b58c7c6..48db15a 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -378,7 +378,7 @@ static void free_pcre_regexp(struct grep_pat *p)
>   }
>   #endif /* !USE_LIBPCRE */
>
> -static int is_fixed(const char *s, size_t len)
> +static int is_fixed(const char *s, size_t len, int ignore_icase)
>   {
>   	size_t i;
>
> @@ -391,6 +391,13 @@ static int is_fixed(const char *s, size_t len)
>   	for (i = 0; i < len; i++) {
>   		if (is_regex_special(s[i]))
>   			return 0;
> +		/*
> +		 * The builtin substring search can only deal with case
> +		 * insensitivity in ascii range. If there is something outside
> +		 * of that range, fall back to regcomp.
> +		 */
> +		if (ignore_icase && (unsigned char)s[i] >= 128)
> +			return 0;

How about "isascii(s[i])"?

>   	}
>
>   	return 1;
> @@ -398,18 +405,19 @@ static int is_fixed(const char *s, size_t len)
>
>   static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>   {
> +	int ignore_icase = opt->regflags & REG_ICASE || p->ignore_case;
>   	int err;
>
>   	p->word_regexp = opt->word_regexp;
>   	p->ignore_case = opt->ignore_case;

Using p->ignore_case before this line, as in initialization of the new 
variable ignore_icase above, changes the meaning.

>
> -	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
> +	if (opt->fixed || is_fixed(p->pattern, p->patternlen, ignore_icase))
>   		p->fixed = 1;
>   	else
>   		p->fixed = 0;
>
>   	if (p->fixed) {
> -		if (opt->regflags & REG_ICASE || p->ignore_case)
> +		if (ignore_case)

ignore_icase instead?  ignore_case is for the config variable 
core.ignorecase.  Tricky.

>   			p->kws = kwsalloc(tolower_trans_tbl);
>   		else
>   			p->kws = kwsalloc(NULL);
>

So the optimization before this patch was that if a string was searched 
for without -F then it would be treated as a fixed string anyway unless 
it contained regex special characters.  Searching for fixed strings 
using the kwset functions is faster than using regcomp and regexec, 
which makes the exercise worthwhile.

Your patch disables the optimization if non-ASCII characters are 
searched for because kwset handles case transformations only for ASCII 
chars.

Another consequence of this limitation is that -Fi (explicit 
case-insensitive fixed-string search) doesn't work properly with 
non-ASCII chars neither.  How can we handle this one?  Fall back to 
regcomp by escaping all special characters?  Or at least warn?

Tests would be nice. :)

René

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] grep: use regcomp() for icase search with non-ascii patterns
  2015-07-06 20:10   ` René Scharfe
@ 2015-07-06 23:02     ` Duy Nguyen
  2015-07-07 14:25       ` Plamen Totev
  0 siblings, 1 reply; 53+ messages in thread
From: Duy Nguyen @ 2015-07-06 23:02 UTC (permalink / raw)
  To: René Scharfe; +Cc: Git Mailing List, plamen.totev

On Tue, Jul 7, 2015 at 3:10 AM, René Scharfe <l.s.r@web.de> wrote:
> Am 06.07.2015 um 14:42 schrieb Nguyễn Thái Ngọc Duy:
>>
>> Noticed-by: Plamen Totev <plamen.totev@abv.bg>
>> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
>> ---
>>   grep.c | 14 +++++++++++---
>>   1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/grep.c b/grep.c
>> index b58c7c6..48db15a 100644
>> --- a/grep.c
>> +++ b/grep.c
>> @@ -378,7 +378,7 @@ static void free_pcre_regexp(struct grep_pat *p)
>>   }
>>   #endif /* !USE_LIBPCRE */
>>
>> -static int is_fixed(const char *s, size_t len)
>> +static int is_fixed(const char *s, size_t len, int ignore_icase)
>>   {
>>         size_t i;
>>
>> @@ -391,6 +391,13 @@ static int is_fixed(const char *s, size_t len)
>>         for (i = 0; i < len; i++) {
>>                 if (is_regex_special(s[i]))
>>                         return 0;
>> +               /*
>> +                * The builtin substring search can only deal with case
>> +                * insensitivity in ascii range. If there is something
>> outside
>> +                * of that range, fall back to regcomp.
>> +                */
>> +               if (ignore_icase && (unsigned char)s[i] >= 128)
>> +                       return 0;
>
>
> How about "isascii(s[i])"?

Yes, better.

>
>>         }
>>
>>         return 1;
>> @@ -398,18 +405,19 @@ static int is_fixed(const char *s, size_t len)
>>
>>   static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>>   {
>> +       int ignore_icase = opt->regflags & REG_ICASE || p->ignore_case;
>>         int err;
>>
>>         p->word_regexp = opt->word_regexp;
>>         p->ignore_case = opt->ignore_case;
>
>
> Using p->ignore_case before this line, as in initialization of the new
> variable ignore_icase above, changes the meaning.

Oops.

>> -       if (opt->fixed || is_fixed(p->pattern, p->patternlen))
>> +       if (opt->fixed || is_fixed(p->pattern, p->patternlen,
>> ignore_icase))
>>                 p->fixed = 1;
>>         else
>>                 p->fixed = 0;
>>
>>         if (p->fixed) {
>> -               if (opt->regflags & REG_ICASE || p->ignore_case)
>> +               if (ignore_case)
>
>
> ignore_icase instead?  ignore_case is for the config variable
> core.ignorecase.  Tricky.

Maybe we can test isascii separately and save the result in
has_non_ascii, then we can avoid ignore_(i)case

>
>>                         p->kws = kwsalloc(tolower_trans_tbl);
>>                 else
>>                         p->kws = kwsalloc(NULL);
>>
>
> So the optimization before this patch was that if a string was searched for
> without -F then it would be treated as a fixed string anyway unless it
> contained regex special characters.  Searching for fixed strings using the
> kwset functions is faster than using regcomp and regexec, which makes the
> exercise worthwhile.
>
> Your patch disables the optimization if non-ASCII characters are searched
> for because kwset handles case transformations only for ASCII chars.
>
> Another consequence of this limitation is that -Fi (explicit
> case-insensitive fixed-string search) doesn't work properly with non-ASCII
> chars neither.  How can we handle this one?  Fall back to regcomp by
> escaping all special characters?  Or at least warn?

Hehe.. I noticed it too shortly after sending the patch. I was torn
between simply documenting the limitation and waiting for the next
person to come and fix it, or quoting the regex then passing to
regcomp. GNU grep does the quoting in this case, but that code is
GPLv3 so we can't simply copy over. It could be a problem if we need
to quote a regex in a multibyte charset where ascii is not a subset.
But i guess we can just go with utf-8..

> Tests would be nice. :)

Yeah.. but we now rely on system regcomp which may behave differently
across platforms. Then we need some locale to be always there. Some
platforms (like Gentoo) even allow building glibc without i18n.. So
I'm not sure how we know when to test or skip.
-- 
Duy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Git grep does not support multi-byte characters (like UTF-8)
  2015-07-06 12:23 ` Duy Nguyen
@ 2015-07-07  8:58   ` Plamen Totev
  2015-07-07 12:22     ` Duy Nguyen
  2015-07-07 16:07     ` Junio C Hamano
  0 siblings, 2 replies; 53+ messages in thread
From: Plamen Totev @ 2015-07-07  8:58 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List

Nguyen, thanks for the help and the patch. Also the escaping suggested by Scharfe seems as good choice. But i dig some more into the problem and I found some other thing. That's why I replied on the main thread not on the patch. I hope you'll excuse me if this is a bad practice.

git grep -i -P also does not works because the PCRE_UTF8 is not set and pcre library does not treat the string as UTF-8.

pickaxe search also uses kwsearch so the case insensitive search with it does not work (e.g. git log -i -S).  Maybe this is a less of a problem here as one is expected to search for exact string (hence knows the case)

There is a interesting corner case. is_fixed treats all patterns containing nulls as fixed. So what about if the string contains non-ASCII symbols as well as nulls and the search is case insensitive :) I have to admin that my knowledge in UTF-8 is not enough to answer the question if this could occur during normal usage. For example the second byte in multi-byte symbol is NULL. I would guess that's not true as it would break a lot of programs that depend on NULL delimited string but it's good if somebody could confirm.

GNU grep indeed uses escaped regular expressions when the string is using multi-byte encoding and the search is case insensitive. If the encoding is UTF-8 then this strategy could be used in git too. Especially that git already have support and helper functions to work with UTF-8. As for the other multi-byte encodings - I think the things would become more complicated. As far I know in UTF-8 the '{' character for example is two bytes not one. Maybe really a support could be added only for the UTF-8 and if the string is not UTF-8 to issue a warning.

So maybe the following makes sense when a grep search is performed:
* check if the multi-byte encoding is used. If it's and the search is case insensitive and the encoding is not UTF-8 give a warning;
* if pcre is used and the string is UTF-8 encoded set the PCRE_UTF8 flag;
* if the search is case insensitive, the string is fixed and the encoding  used is UTF-8 use regcomp instead of kwsearch and escape any regex special characters in the pattern;

And the question with the behavior of pickaxe search remains open. Using kwset does not work with case insensitive non-ASCII searches. Instead of fixing grep.c maybe it's better if new function is introduced that performs keyword searches so it could be used by both grep, diffcore-pickaxe and any other code in the future that may require such functionality. Or maybe diffcore-pickaxe should use grep instead of directly kwset/regcomp

Regards,
Plamen Totev



>-------- Оригинално писмо -------- 
>От: Duy Nguyen pclouds@gmail.com 
>Относно: Re: Git grep does not support multi-byte characters (like UTF-8) 
>До: Plamen Totev <plamen.totev@abv.bg> 
>Изпратено на: 06.07.2015 15:23 

> I think we over-optimized a bit. If you your system provides regex 
> with locale support (e.g. Linux) and you don't explicitly use fallback 
> regex implementation, it should work. I suppose your test patterns 
> look "fixed" (i.e. no regex special characters)? Can you try just add 
> "." and see if case insensitive matching works? 

This is indeed the problem. When I added the "." the matching works just fine.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Git grep does not support multi-byte characters (like UTF-8)
  2015-07-07  8:58   ` Plamen Totev
@ 2015-07-07 12:22     ` Duy Nguyen
  2015-07-07 16:07     ` Junio C Hamano
  1 sibling, 0 replies; 53+ messages in thread
From: Duy Nguyen @ 2015-07-07 12:22 UTC (permalink / raw)
  To: Plamen Totev; +Cc: Git Mailing List

On Tue, Jul 7, 2015 at 3:58 PM, Plamen Totev <plamen.totev@abv.bg> wrote:
> Nguyen, thanks for the help and the patch. Also the escaping suggested by Scharfe seems as good choice. But i dig some more into the problem and I found some other thing. That's why I replied on the main thread not on the patch. I hope you'll excuse me if this is a bad practice.

So far this is very good reporting. I can't complain :)

> git grep -i -P also does not works because the PCRE_UTF8 is not set and pcre library does not treat the string as UTF-8.

We do prefer utf-8, but i don't know if we can assume utf-8 everywhere
yet. I guess it's ok in this case.

> pickaxe search also uses kwsearch so the case insensitive search with it does not work (e.g. git log -i -S).  Maybe this is a less of a problem here as one is expected to search for exact string (hence knows the case)

Will fix (i'm close to being done with git-grep, not counting the pcre
bug you just found)

> There is a interesting corner case. is_fixed treats all patterns containing nulls as fixed. So what about if the string contains non-ASCII symbols as well as nulls and the search is case insensitive :) I have to admin that my knowledge in UTF-8 is not enough to answer the question if this could occur during normal usage. For example the second byte in multi-byte symbol is NULL. I would guess that's not true as it would break a lot of programs that depend on NULL delimited string but it's good if somebody could confirm.

For utf-8, if NUL occurs in a byte stream, it must be ASCII NUL, not
part of any multibyte character. Utf-8 is really well tuned for C
strings.

> GNU grep indeed uses escaped regular expressions when the string is using multi-byte encoding and the search is case insensitive. If the encoding is UTF-8 then this strategy could be used in git too. Especially that git already have support and helper functions to work with UTF-8. As for the other multi-byte encodings - I think the things would become more complicated. As far I know in UTF-8 the '{' character for example is two bytes not one. Maybe really a support could be added only for the UTF-8 and if the string is not UTF-8 to issue a warning.

In the worst case we could reuse the trick we do elsewhere in git:
convert to utf-8 with iconv, do whatever we need to (escaping...) then
convert back before feeding it to regcomp and friends.

> So maybe the following makes sense when a grep search is performed:
> * check if the multi-byte encoding is used. If it's and the search is case insensitive and the encoding is not UTF-8 give a warning;
> * if pcre is used and the string is UTF-8 encoded set the PCRE_UTF8 flag;
> * if the search is case insensitive, the string is fixed and the encoding  used is UTF-8 use regcomp instead of kwsearch and escape any regex special characters in the pattern;
>
> And the question with the behavior of pickaxe search remains open. Using kwset does not work with case insensitive non-ASCII searches. Instead of fixing grep.c maybe it's better if new function is introduced that performs keyword searches so it could be used by both grep, diffcore-pickaxe and any other code in the future that may require such functionality. Or maybe diffcore-pickaxe should use grep instead of directly kwset/regcomp

That would function be called "grep". More refactor would be needed.
"git grep regcomp" reveals some many places. Many some of them would
benefit from kws if we provide this new function you mentioned.

> Regards,
> Plamen Totev
>
>
>
>>-------- Оригинално писмо --------
>>От: Duy Nguyen pclouds@gmail.com
>>Относно: Re: Git grep does not support multi-byte characters (like UTF-8)
>>До: Plamen Totev <plamen.totev@abv.bg>
>>Изпратено на: 06.07.2015 15:23
>
>> I think we over-optimized a bit. If you your system provides regex
>> with locale support (e.g. Linux) and you don't explicitly use fallback
>> regex implementation, it should work. I suppose your test patterns
>> look "fixed" (i.e. no regex special characters)? Can you try just add
>> "." and see if case insensitive matching works?
>
> This is indeed the problem. When I added the "." the matching works just fine.



-- 
Duy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH] grep: use regcomp() for icase search with non-ascii patterns
  2015-07-06 23:02     ` Duy Nguyen
@ 2015-07-07 14:25       ` Plamen Totev
  0 siblings, 0 replies; 53+ messages in thread
From: Plamen Totev @ 2015-07-07 14:25 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: René Scharfe, Git Mailing List


On 07.07. 2015 at 02:02, Duy Nguyen <pclouds@gmail.com> wrote: 
> On Tue, Jul 7, 2015 at 3:10 AM, René Scharfe <l.s.r@web.de> wrote: 
> > Am 06.07.2015 um 14:42 schrieb Nguyễn Thái Ngọc Duy: 

> > So the optimization before this patch was that if a string was searched for 
> > without -F then it would be treated as a fixed string anyway unless it 
> > contained regex special characters. Searching for fixed strings using the 
> > kwset functions is faster than using regcomp and regexec, which makes the 
> > exercise worthwhile. 
> > 
> > Your patch disables the optimization if non-ASCII characters are searched 
> > for because kwset handles case transformations only for ASCII chars. 
> > 
> > Another consequence of this limitation is that -Fi (explicit 
> > case-insensitive fixed-string search) doesn't work properly with non-ASCII 
> > chars neither. How can we handle this one? Fall back to regcomp by 
> > escaping all special characters? Or at least warn? 
> 
> Hehe.. I noticed it too shortly after sending the patch. I was torn 
> between simply documenting the limitation and waiting for the next 
> person to come and fix it, or quoting the regex then passing to 
> regcomp. GNU grep does the quoting in this case, but that code is 
> GPLv3 so we can't simply copy over. It could be a problem if we need 
> to quote a regex in a multibyte charset where ascii is not a subset. 
> But i guess we can just go with utf-8.. 

I played a little bit with the code and I came up with this function to escape
regular expressions in  utf-8. Hope it helps.

static void escape_regexp(const char *pattern, size_t len,
                char **new_pattern, size_t *new_len)
{
        const char *p = pattern;
        char *np = *new_pattern = xmalloc(2 * len);
        int chrlen;
        *new_len = len;

        while (len) {
                chrlen = mbs_chrlen(&p, &len, "utf-8");
                if (chrlen == 1 && is_regex_special(*pattern))
                        *np++ = '\\';

                memcpy(np, pattern, chrlen);
                np += chrlen;
                pattern = p;
        }

        *new_len = np - *new_pattern;
}

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Git grep does not support multi-byte characters (like UTF-8)
  2015-07-07  8:58   ` Plamen Totev
  2015-07-07 12:22     ` Duy Nguyen
@ 2015-07-07 16:07     ` Junio C Hamano
  2015-07-07 18:08       ` Plamen Totev
  1 sibling, 1 reply; 53+ messages in thread
From: Junio C Hamano @ 2015-07-07 16:07 UTC (permalink / raw)
  To: Plamen Totev; +Cc: Duy Nguyen, Git Mailing List

Plamen Totev <plamen.totev@abv.bg> writes:

> pickaxe search also uses kwsearch so the case insensitive search with
> it does not work (e.g. git log -i -S).  Maybe this is a less of a
> problem here as one is expected to search for exact string (hence
> knows the case)

You reasoned correctly, I think.  Pickaxe, as one of the building
blocks to implement Linus's ultimate change tracking tool [*1*],
should never pay attention to "-i".  It is a step to finding the
commit that touches the exact code block given (i.e. "how do you
drill down?" part of $gmane/217 message).

Thanks.

[Footnote]
*1* http://article.gmane.org/gmane.comp.version-control.git/217

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Git grep does not support multi-byte characters (like UTF-8)
  2015-07-07 16:07     ` Junio C Hamano
@ 2015-07-07 18:08       ` Plamen Totev
  2015-07-08  2:19         ` Duy Nguyen
  0 siblings, 1 reply; 53+ messages in thread
From: Plamen Totev @ 2015-07-07 18:08 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Git Mailing List

Junio C Hamano <gitster@pobox.com> writes:

> Plamen Totev <plamen.totev@abv.bg> writes: 
> 
> > pickaxe search also uses kwsearch so the case insensitive search with 
> > it does not work (e.g. git log -i -S). Maybe this is a less of a 
> > problem here as one is expected to search for exact string (hence 
> > knows the case) 
> 
> You reasoned correctly, I think. Pickaxe, as one of the building 
> blocks to implement Linus's ultimate change tracking tool [*1*], 
> should never pay attention to "-i". It is a step to finding the 
> commit that touches the exact code block given (i.e. "how do you 
> drill down?" part of $gmane/217 message). 
> 
> Thanks. 
> 
> [Footnote] 
> *1* http://article.gmane.org/gmane.comp.version-control.git/217

Now that I read the link again and gave the matter a thought I'm not so sure.
In some contexts the case of the words does not matter. In SQL for example.
Let's consider a SQL script file that contains the following line:

select name, address from customers;

At some point we decide to change the coding style to:

SELECT name, address FROM customers;

What should pickaxe search return - the first commit where the line is introduced
or the commit with the refactoring? From this point of view I think the -i switch makes sense.
The SQL is not the only case insensitive language - BASIC and Pascal come into my mind 
(those two I was using while I was in the high school :)).

Also I think it makes sense (maybe even more?) for natural languages.
For example after editing a text a sentence could be split into two.
Then the first word of the second sentence may change its case.
Of course the natural languages always  complicate the things a bit.
An ultimate tracking tools should be able to handle typo fixes, punctuation changes, etc.

But I'm getting a bit off-topic. What I wanted to say is that in some contexts it makes sense
(at least to me) to have case insensitive pickaxe search.
Or I'm missing something and there is a better tools to use is such cases?

Regards,
Plamen Totev

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Git grep does not support multi-byte characters (like UTF-8)
  2015-07-07 18:08       ` Plamen Totev
@ 2015-07-08  2:19         ` Duy Nguyen
  2015-07-08  4:52           ` Junio C Hamano
  0 siblings, 1 reply; 53+ messages in thread
From: Duy Nguyen @ 2015-07-08  2:19 UTC (permalink / raw)
  To: Plamen Totev; +Cc: Junio C Hamano, Git Mailing List

On Wed, Jul 8, 2015 at 1:08 AM, Plamen Totev <plamen.totev@abv.bg> wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> Plamen Totev <plamen.totev@abv.bg> writes:
>>
>> > pickaxe search also uses kwsearch so the case insensitive search with
>> > it does not work (e.g. git log -i -S). Maybe this is a less of a
>> > problem here as one is expected to search for exact string (hence
>> > knows the case)
>>
>> You reasoned correctly, I think. Pickaxe, as one of the building
>> blocks to implement Linus's ultimate change tracking tool [*1*],
>> should never pay attention to "-i". It is a step to finding the
>> commit that touches the exact code block given (i.e. "how do you
>> drill down?" part of $gmane/217 message).
>>
>> Thanks.
>>
>> [Footnote]
>> *1* http://article.gmane.org/gmane.comp.version-control.git/217
>
> Now that I read the link again and gave the matter a thought I'm not so sure.
> In some contexts the case of the words does not matter. In SQL for example.
> Let's consider a SQL script file that contains the following line:
>
> select name, address from customers;
>
> At some point we decide to change the coding style to:
>
> SELECT name, address FROM customers;

On top of this, pickaxe already supports icase even kws is used. But
it only works for ascii, so either we fix it and support non-ascii, or
we remove icase support entirely from diffcore_pickaxe(). I vote the
former.
-- 
Duy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: Git grep does not support multi-byte characters (like UTF-8)
  2015-07-08  2:19         ` Duy Nguyen
@ 2015-07-08  4:52           ` Junio C Hamano
  0 siblings, 0 replies; 53+ messages in thread
From: Junio C Hamano @ 2015-07-08  4:52 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Plamen Totev, Git Mailing List

Duy Nguyen <pclouds@gmail.com> writes:

> On top of this, pickaxe already supports icase even kws is used. But
> it only works for ascii, so either we fix it and support non-ascii, or
> we remove icase support entirely from diffcore_pickaxe(). I vote the
> former.

I think that is a different issue.  The pickaxe has a single very
narrowly-defined intended use case [*1*] and I do not care too much
how any use that is outside the intended use case behaves.  As long
as its intended use case does not suffer (1) correctness-wise, (2)
performance-wise and (3) code-cleanliness-wise, due to changes to
support such enhancements, I am perfectly fine.

Ascii-only icase match is one example of a feature that is outside
the intended use case, and implementation of it using kws is nearly
free if I am not mistaken, not making the primary use case suffer in
any way.

I however am highly skeptical that the same thing can be done with
non-ascii icase.  As long as it can be added without makinng the
primary use case suffer in any way, I do not mind it very much.

Thanks.


[Footnote]

*1* The requirement is very simple.  You get a string that is unique
in a blob that exists at the revision your traversal begins, and you
want to find the point where the blob at the corresponding path does
not have that exact string with minimal effort.  You do not need to
ensure that the input string is unique (it is a user error and the
behaviour is undefined) and for simplicity you are also allowed to
fire when the blob has more than one copies of the string (even
though the expected use is to find the place where the blob has
zero).

Any other cases, e.g. the string was not unique in the blob, the
user specified "ignore-case" and other irrelevant options, are
allowed to be incorrect or slow or both, as $gmane/217 does not need
such uses to implement it ;-)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 0/9] icase match on non-ascii
  2015-07-06 12:42 ` [PATCH] grep: use regcomp() for icase search with non-ascii patterns Nguyễn Thái Ngọc Duy
  2015-07-06 20:10   ` René Scharfe
@ 2015-07-08 10:38   ` Nguyễn Thái Ngọc Duy
  2015-07-08 10:38     ` [PATCH v2 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
                       ` (11 more replies)
  1 sibling, 12 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-08 10:38 UTC (permalink / raw)
  To: git
  Cc: plamen.totev, l.s.r, Junio C Hamano,
	Nguyễn Thái Ngọc Duy

This series fix case insensitive matching on non-ascii charsets.
"grep -i", "grep -F -i", "grep --pcre-regexp -i" and "log -i -S" are
fixed.

Side note, I almost added the third has_non_ascii() function. Maybe we
should consider merging the two existing has_non_ascii() functions
back, or rename one to something else.

Patch 5 is "funny". The patch itself is in iso-8859-1, but my name in
the commit message is in utf-8. Let's see how it goes on the apply
side. We may need to fix something in this area..

Nguyễn Thái Ngọc Duy (9):
  grep: allow -F -i combination
  grep: break down an "if" stmt in preparation for next changes
  grep/icase: avoid kwsset on literal non-ascii strings
  grep/icase: avoid kwsset when -F is specified
  grep/pcre: prepare locale-dependent tables for icase matching
  gettext: add is_utf8_locale()
  grep/pcre: support utf-8
  diffcore-pickaxe: "share" regex error handling code
  diffcore-pickaxe: support case insensitive match on non-ascii

 builtin/grep.c                           |  2 +-
 diffcore-pickaxe.c                       | 27 ++++++++++-----
 gettext.c                                |  7 +++-
 gettext.h                                |  5 +++
 grep.c                                   | 43 ++++++++++++++++++++---
 grep.h                                   |  1 +
 quote.c                                  | 37 ++++++++++++++++++++
 quote.h                                  |  1 +
 t/t7812-grep-icase-non-ascii.sh (new +x) | 58 ++++++++++++++++++++++++++++++++
 t/t7813-grep-icase-non-ascii.sh (new +x) | 19 +++++++++++
 10 files changed, 186 insertions(+), 14 deletions(-)
 create mode 100755 t/t7812-grep-icase-non-ascii.sh
 create mode 100755 t/t7813-grep-icase-non-ascii.sh

-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 1/9] grep: allow -F -i combination
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
@ 2015-07-08 10:38     ` Nguyễn Thái Ngọc Duy
  2015-07-08 10:38     ` [PATCH v2 2/9] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
                       ` (10 subsequent siblings)
  11 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-08 10:38 UTC (permalink / raw)
  To: git
  Cc: plamen.totev, l.s.r, Junio C Hamano,
	Nguyễn Thái Ngọc Duy

-F means "no regex", not "case sensitive" so it should not override -i

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index d04f440..2d392e9 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -806,7 +806,7 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 
 	if (!opt.pattern_list)
 		die(_("no pattern given."));
-	if (!opt.fixed && opt.ignore_case)
+	if (opt.ignore_case)
 		opt.regflags |= REG_ICASE;
 
 	compile_grep_patterns(&opt);
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 2/9] grep: break down an "if" stmt in preparation for next changes
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
  2015-07-08 10:38     ` [PATCH v2 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
@ 2015-07-08 10:38     ` Nguyễn Thái Ngọc Duy
  2015-07-08 10:38     ` [PATCH v2 3/9] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
                       ` (9 subsequent siblings)
  11 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-08 10:38 UTC (permalink / raw)
  To: git
  Cc: plamen.totev, l.s.r, Junio C Hamano,
	Nguyễn Thái Ngọc Duy

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index b58c7c6..bd32f66 100644
--- a/grep.c
+++ b/grep.c
@@ -403,9 +403,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
 
-	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
+	if (is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
-	else
+	else if (opt->fixed) {
+		p->fixed = 1;
+	} else
 		p->fixed = 0;
 
 	if (p->fixed) {
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 3/9] grep/icase: avoid kwsset on literal non-ascii strings
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
  2015-07-08 10:38     ` [PATCH v2 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
  2015-07-08 10:38     ` [PATCH v2 2/9] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
@ 2015-07-08 10:38     ` Nguyễn Thái Ngọc Duy
  2015-07-08 10:38     ` [PATCH v2 4/9] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
                       ` (8 subsequent siblings)
  11 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-08 10:38 UTC (permalink / raw)
  To: git
  Cc: plamen.totev, l.s.r, Junio C Hamano,
	Nguyễn Thái Ngọc Duy

When we detect the pattern is just a literal string, we avoid heavy
regex engine and use fast substring search implemented in kwsset.c.
But kws uses git-ctype which is locale-independent so it does not know
how to fold case properly outside ascii range. Let regcomp or pcre
take care of this case instead. Slower, but accurate.

Helped-by: René Scharfe <l.s.r@web.de>
Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                                   |  7 ++++++-
 t/t7812-grep-icase-non-ascii.sh (new +x) | 19 +++++++++++++++++++
 2 files changed, 25 insertions(+), 1 deletion(-)
 create mode 100755 t/t7812-grep-icase-non-ascii.sh

diff --git a/grep.c b/grep.c
index bd32f66..d795b0e 100644
--- a/grep.c
+++ b/grep.c
@@ -4,6 +4,7 @@
 #include "xdiff-interface.h"
 #include "diff.h"
 #include "diffcore.h"
+#include "commit.h"
 
 static int grep_source_load(struct grep_source *gs);
 static int grep_source_is_binary(struct grep_source *gs);
@@ -398,12 +399,16 @@ static int is_fixed(const char *s, size_t len)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
+	int icase_non_ascii;
 	int err;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
+	icase_non_ascii =
+		(opt->regflags & REG_ICASE || p->ignore_case) &&
+		has_non_ascii(p->pattern);
 
-	if (is_fixed(p->pattern, p->patternlen))
+	if (!icase_non_ascii && is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else if (opt->fixed) {
 		p->fixed = 1;
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
new file mode 100755
index 0000000..63a2630
--- /dev/null
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -0,0 +1,19 @@
+#!/bin/sh
+
+test_description='grep icase on non-English locales'
+
+. ./lib-gettext.sh
+
+test_expect_success GETTEXT_LOCALE 'setup' '
+	printf "TILRAUN: Halló Heimur!" >file &&
+	git add file &&
+	LC_ALL="$is_IS_locale" &&
+	export LC_ALL
+'
+
+test_expect_success GETTEXT_LOCALE 'grep literal string, no -F' '
+	git grep -i "TILRAUN: Halló Heimur!" &&
+	git grep -i "TILRAUN: HALLÓ HEIMUR!"
+'
+
+test_done
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 4/9] grep/icase: avoid kwsset when -F is specified
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
                       ` (2 preceding siblings ...)
  2015-07-08 10:38     ` [PATCH v2 3/9] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
@ 2015-07-08 10:38     ` Nguyễn Thái Ngọc Duy
  2015-07-08 10:38     ` [PATCH v2 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
                       ` (7 subsequent siblings)
  11 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-08 10:38 UTC (permalink / raw)
  To: git
  Cc: plamen.totev, l.s.r, Junio C Hamano,
	Nguyễn Thái Ngọc Duy

Similar to the previous commit, we can't use kws on icase search
outside ascii range. But we can't simply pass the pattern to
regcomp/pcre like the previous commit because it may contain regex
special characters, so we need to quote the regex first.

To avoid misquote traps that could lead to undefined behavior, we
always stick to basic regex engine in this case. We don't need fancy
features for grepping a literal string anyway.

basic_regex_quote_buf() assumes that if the pattern is in a multibyte
encoding, ascii chars must be unambiguously encoded as single
bytes. This is true at least for UTF-8. For others, let's wait until
people yell up. Chances are nobody uses multibyte, non utf-8 charsets
any more..

Helped-by: René Scharfe <l.s.r@web.de>
Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          | 25 ++++++++++++++++++++++++-
 quote.c                         | 37 +++++++++++++++++++++++++++++++++++++
 quote.h                         |  1 +
 t/t7812-grep-icase-non-ascii.sh | 26 ++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/grep.c b/grep.c
index d795b0e..8fce54f 100644
--- a/grep.c
+++ b/grep.c
@@ -5,6 +5,7 @@
 #include "diff.h"
 #include "diffcore.h"
 #include "commit.h"
+#include "quote.h"
 
 static int grep_source_load(struct grep_source *gs);
 static int grep_source_is_binary(struct grep_source *gs);
@@ -397,6 +398,24 @@ static int is_fixed(const char *s, size_t len)
 	return 1;
 }
 
+static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int err;
+
+	basic_regex_quote_buf(&sb, p->pattern);
+	err = regcomp(&p->regexp, sb.buf, opt->regflags & ~REG_EXTENDED);
+	if (opt->debug)
+		fprintf(stderr, "fixed%s\n", sb.buf);
+	strbuf_release(&sb);
+	if (err) {
+		char errbuf[1024];
+		regerror(err, &p->regexp, errbuf, 1024);
+		regfree(&p->regexp);
+		compile_regexp_failed(p, errbuf);
+	}
+}
+
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
 	int icase_non_ascii;
@@ -411,7 +430,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if (!icase_non_ascii && is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else if (opt->fixed) {
-		p->fixed = 1;
+		p->fixed = !icase_non_ascii;
+		if (!p->fixed) {
+			compile_fixed_regexp(p, opt);
+			return;
+		}
 	} else
 		p->fixed = 0;
 
diff --git a/quote.c b/quote.c
index 7920e18..43a8057 100644
--- a/quote.c
+++ b/quote.c
@@ -439,3 +439,40 @@ void tcl_quote_buf(struct strbuf *sb, const char *src)
 	}
 	strbuf_addch(sb, '"');
 }
+
+void basic_regex_quote_buf(struct strbuf *sb, const char *src)
+{
+	char c;
+
+	if (*src == '^') {
+		/* only beginning '^' is special and needs quoting */
+		strbuf_addch(sb, '\\');
+		strbuf_addch(sb, *src++);
+	}
+	if (*src == '*')
+		/* beginning '*' is not special, no quoting */
+		strbuf_addch(sb, *src++);
+
+	while ((c = *src++)) {
+		switch (c) {
+		case '[':
+		case '.':
+		case '\\':
+		case '*':
+			strbuf_addch(sb, '\\');
+			strbuf_addch(sb, c);
+			break;
+
+		case '$':
+			/* only the end '$' is special and needs quoting */
+			if (*src == '\0')
+				strbuf_addch(sb, '\\');
+			strbuf_addch(sb, c);
+			break;
+
+		default:
+			strbuf_addch(sb, c);
+			break;
+		}
+	}
+}
diff --git a/quote.h b/quote.h
index 99e04d3..362d315 100644
--- a/quote.h
+++ b/quote.h
@@ -67,5 +67,6 @@ extern char *quote_path_relative(const char *in, const char *prefix,
 extern void perl_quote_buf(struct strbuf *sb, const char *src);
 extern void python_quote_buf(struct strbuf *sb, const char *src);
 extern void tcl_quote_buf(struct strbuf *sb, const char *src);
+extern void basic_regex_quote_buf(struct strbuf *sb, const char *src);
 
 #endif
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 63a2630..c945589 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -16,4 +16,30 @@ test_expect_success GETTEXT_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success GETTEXT_LOCALE 'grep literal string, with -F' '
+	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
+		 grep fixed >debug1 &&
+	echo "fixedTILRAUN: Halló Heimur!" >expect1 &&
+	test_cmp expect1 debug1 &&
+
+	git grep --debug -i -F "TILRAUN: HALLÓ HEIMUR!"  2>&1 >/dev/null |
+		 grep fixed >debug2 &&
+	echo "fixedTILRAUN: HALLÓ HEIMUR!" >expect2 &&
+	test_cmp expect2 debug2
+'
+
+test_expect_success GETTEXT_LOCALE 'grep string with regex, with -F' '
+	printf "^*TILR^AUN:.* \\Halló \$He[]imur!\$" >file &&
+
+	git grep --debug -i -F "^*TILR^AUN:.* \\Halló \$He[]imur!\$" 2>&1 >/dev/null |
+		 grep fixed >debug1 &&
+	echo "fixed\\^*TILR^AUN:\\.\\* \\\\Halló \$He\\[]imur!\\\$" >expect1 &&
+	test_cmp expect1 debug1 &&
+
+	git grep --debug -i -F "^*TILR^AUN:.* \\HALLÓ \$HE[]IMUR!\$"  2>&1 >/dev/null |
+		 grep fixed >debug2 &&
+	echo "fixed\\^*TILR^AUN:\\.\\* \\\\HALLÓ \$HE\\[]IMUR!\\\$" >expect2 &&
+	test_cmp expect2 debug2
+'
+
 test_done
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 5/9] grep/pcre: prepare locale-dependent tables for icase matching
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
                       ` (3 preceding siblings ...)
  2015-07-08 10:38     ` [PATCH v2 4/9] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
@ 2015-07-08 10:38     ` Nguyễn Thái Ngọc Duy
  2015-07-08 11:00       ` Duy Nguyen
  2015-07-08 10:38     ` [PATCH v2 6/9] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
                       ` (6 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-08 10:38 UTC (permalink / raw)
  To: git
  Cc: plamen.totev, l.s.r, Junio C Hamano,
	Nguyễn Thái Ngọc Duy

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2371 bytes --]

The default tables are usually built with C locale and only suitable
for LANG=C or similar.  This should make case insensitive search work
correctly for all single-byte charsets.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                                   |  7 +++++--
 grep.h                                   |  1 +
 t/t7813-grep-icase-non-ascii.sh (new +x) | 19 +++++++++++++++++++
 3 files changed, 25 insertions(+), 2 deletions(-)
 create mode 100755 t/t7813-grep-icase-non-ascii.sh

diff --git a/grep.c b/grep.c
index 8fce54f..c79aa70 100644
--- a/grep.c
+++ b/grep.c
@@ -324,11 +324,13 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 	int erroffset;
 	int options = PCRE_MULTILINE;
 
-	if (opt->ignore_case)
+	if (opt->ignore_case) {
+		p->pcre_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
+	}
 
 	p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
-			NULL);
+				      p->pcre_tables);
 	if (!p->pcre_regexp)
 		compile_regexp_failed(p, error);
 
@@ -362,6 +364,7 @@ static void free_pcre_regexp(struct grep_pat *p)
 {
 	pcre_free(p->pcre_regexp);
 	pcre_free(p->pcre_extra_info);
+	pcre_free((void *)p->pcre_tables);
 }
 #else /* !USE_LIBPCRE */
 static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
diff --git a/grep.h b/grep.h
index 95f197a..cee4357 100644
--- a/grep.h
+++ b/grep.h
@@ -48,6 +48,7 @@ struct grep_pat {
 	regex_t regexp;
 	pcre *pcre_regexp;
 	pcre_extra *pcre_extra_info;
+	const unsigned char *pcre_tables;
 	kwset_t kws;
 	unsigned fixed:1;
 	unsigned ignore_case:1;
diff --git a/t/t7813-grep-icase-non-ascii.sh b/t/t7813-grep-icase-non-ascii.sh
new file mode 100755
index 0000000..efef7fb
--- /dev/null
+++ b/t/t7813-grep-icase-non-ascii.sh
@@ -0,0 +1,19 @@
+#!/bin/sh
+
+test_description='grep icase on non-English locales'
+
+. ./lib-gettext.sh
+
+test_expect_success GETTEXT_ISO_LOCALE 'setup' '
+	printf "TILRAUN: Halló Heimur!" >file &&
+	git add file &&
+	LC_ALL="$is_IS_iso_locale" &&
+	export LC_ALL
+'
+
+test_expect_success GETTEXT_ISO_LOCALE,LIBPCRE 'grep pcre string' '
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
+test_done
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 6/9] gettext: add is_utf8_locale()
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
                       ` (4 preceding siblings ...)
  2015-07-08 10:38     ` [PATCH v2 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
@ 2015-07-08 10:38     ` Nguyễn Thái Ngọc Duy
  2015-07-08 10:38     ` [PATCH v2 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
                       ` (5 subsequent siblings)
  11 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-08 10:38 UTC (permalink / raw)
  To: git
  Cc: plamen.totev, l.s.r, Junio C Hamano,
	Nguyễn Thái Ngọc Duy

This function returns true if git is running under an UTF-8
locale. pcre in the next patch will need this.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 gettext.c | 7 ++++++-
 gettext.h | 5 +++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/gettext.c b/gettext.c
index 7378ba2..601bc80 100644
--- a/gettext.c
+++ b/gettext.c
@@ -166,12 +166,17 @@ void git_setup_gettext(void)
 	textdomain("git");
 }
 
+int is_utf8_locale(void)
+{
+	return !strcmp(charset, "UTF-8");
+}
+
 /* return the number of columns of string 's' in current locale */
 int gettext_width(const char *s)
 {
 	static int is_utf8 = -1;
 	if (is_utf8 == -1)
-		is_utf8 = !strcmp(charset, "UTF-8");
+		is_utf8 = is_utf8_locale();
 
 	return is_utf8 ? utf8_strwidth(s) : strlen(s);
 }
diff --git a/gettext.h b/gettext.h
index 33696a4..5e733d4 100644
--- a/gettext.h
+++ b/gettext.h
@@ -31,6 +31,7 @@
 #ifndef NO_GETTEXT
 extern void git_setup_gettext(void);
 extern int gettext_width(const char *s);
+extern int is_utf8_locale(void);
 #else
 static inline void git_setup_gettext(void)
 {
@@ -39,6 +40,10 @@ static inline int gettext_width(const char *s)
 {
 	return strlen(s);
 }
+static inline int is_utf8_locale(void)
+{
+	return 0;
+}
 #endif
 
 #ifdef GETTEXT_POISON
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 7/9] grep/pcre: support utf-8
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
                       ` (5 preceding siblings ...)
  2015-07-08 10:38     ` [PATCH v2 6/9] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
@ 2015-07-08 10:38     ` Nguyễn Thái Ngọc Duy
  2015-07-11  8:07       ` Plamen Totev
  2015-07-08 10:38     ` [PATCH v2 8/9] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
                       ` (4 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-08 10:38 UTC (permalink / raw)
  To: git
  Cc: plamen.totev, l.s.r, Junio C Hamano,
	Nguyễn Thái Ngọc Duy

In the previous change in this function, we add locale support for
single-byte encodings only. It looks like pcre only supports utf-* as
multibyte encodings, the others are left in the cold (which is
fine). We need to enable PCRE_UTF8 so pcre can parse the string
correctly before folding case.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          | 2 ++
 t/t7812-grep-icase-non-ascii.sh | 6 ++++++
 2 files changed, 8 insertions(+)

diff --git a/grep.c b/grep.c
index c79aa70..7c9e437 100644
--- a/grep.c
+++ b/grep.c
@@ -326,6 +326,8 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 
 	if (opt->ignore_case) {
 		p->pcre_tables = pcre_maketables();
+		if (is_utf8_locale())
+			options |= PCRE_UTF8;
 		options |= PCRE_CASELESS;
 	}
 
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index c945589..1306cc0 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -16,6 +16,12 @@ test_expect_success GETTEXT_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre string' '
+	git grep --perl-regexp    "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
 test_expect_success GETTEXT_LOCALE 'grep literal string, with -F' '
 	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
 		 grep fixed >debug1 &&
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 8/9] diffcore-pickaxe: "share" regex error handling code
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
                       ` (6 preceding siblings ...)
  2015-07-08 10:38     ` [PATCH v2 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
@ 2015-07-08 10:38     ` Nguyễn Thái Ngọc Duy
  2015-07-08 10:38     ` [PATCH v2 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
                       ` (3 subsequent siblings)
  11 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-08 10:38 UTC (permalink / raw)
  To: git
  Cc: plamen.totev, l.s.r, Junio C Hamano,
	Nguyễn Thái Ngọc Duy

There's another regcomp code block coming in this function. By moving
the error handling code out of this block, we don't have to add the
same error handling code in the new block.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 diffcore-pickaxe.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 185f86b..7a718fc 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -204,20 +204,13 @@ void diffcore_pickaxe(struct diff_options *o)
 	int opts = o->pickaxe_opts;
 	regex_t regex, *regexp = NULL;
 	kwset_t kws = NULL;
+	int err = 0;
 
 	if (opts & (DIFF_PICKAXE_REGEX | DIFF_PICKAXE_KIND_G)) {
-		int err;
 		int cflags = REG_EXTENDED | REG_NEWLINE;
 		if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE))
 			cflags |= REG_ICASE;
 		err = regcomp(&regex, needle, cflags);
-		if (err) {
-			/* The POSIX.2 people are surely sick */
-			char errbuf[1024];
-			regerror(err, &regex, errbuf, 1024);
-			regfree(&regex);
-			die("invalid regex: %s", errbuf);
-		}
 		regexp = &regex;
 	} else {
 		kws = kwsalloc(DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE)
@@ -225,6 +218,13 @@ void diffcore_pickaxe(struct diff_options *o)
 		kwsincr(kws, needle, strlen(needle));
 		kwsprep(kws);
 	}
+	if (err) {
+		/* The POSIX.2 people are surely sick */
+		char errbuf[1024];
+		regerror(err, &regex, errbuf, 1024);
+		regfree(&regex);
+		die("invalid regex: %s", errbuf);
+	}
 
 	/* Might want to warn when both S and G are on; I don't care... */
 	pickaxe(&diff_queued_diff, o, regexp, kws,
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 9/9] diffcore-pickaxe: support case insensitive match on non-ascii
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
                       ` (7 preceding siblings ...)
  2015-07-08 10:38     ` [PATCH v2 8/9] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
@ 2015-07-08 10:38     ` Nguyễn Thái Ngọc Duy
  2015-07-09 22:55       ` Eric Sunshine
  2015-07-08 11:32     ` [PATCH v2 0/9] icase " Torsten Bögershausen
                       ` (2 subsequent siblings)
  11 siblings, 1 reply; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-08 10:38 UTC (permalink / raw)
  To: git
  Cc: plamen.totev, l.s.r, Junio C Hamano,
	Nguyễn Thái Ngọc Duy

Similar to the "grep -F -i" case, we can't use kws on icase search
outside ascii range, quote we quote the string and pass it to regcomp
as a basic regexp and let regex engine deal with case sensitivity.

The new test is put in t7812 instead of t4209-log-pickaxe because
lib-gettext.sh might cause problems elsewhere, probably..

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 diffcore-pickaxe.c              | 11 +++++++++++
 t/t7812-grep-icase-non-ascii.sh |  7 +++++++
 2 files changed, 18 insertions(+)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 7a718fc..6946c15 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -7,6 +7,8 @@
 #include "diffcore.h"
 #include "xdiff-interface.h"
 #include "kwset.h"
+#include "commit.h"
+#include "quote.h"
 
 typedef int (*pickaxe_fn)(mmfile_t *one, mmfile_t *two,
 			  struct diff_options *o,
@@ -212,6 +214,15 @@ void diffcore_pickaxe(struct diff_options *o)
 			cflags |= REG_ICASE;
 		err = regcomp(&regex, needle, cflags);
 		regexp = &regex;
+	} else if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE) &&
+		   has_non_ascii(needle)) {
+		struct strbuf sb = STRBUF_INIT;
+		int cflags = REG_NEWLINE | REG_ICASE;
+
+		basic_regex_quote_buf(&sb, needle);
+		err = regcomp(&regex, sb.buf, cflags);
+		strbuf_release(&sb);
+		regexp = &regex;
 	} else {
 		kws = kwsalloc(DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE)
 			       ? tolower_trans_tbl : NULL);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 1306cc0..20a39d0 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -48,4 +48,11 @@ test_expect_success GETTEXT_LOCALE 'grep string with regex, with -F' '
 	test_cmp expect2 debug2
 '
 
+test_expect_success GETTEXT_LOCALE 'pickaxe -i on non-ascii' '
+	git commit -m first &&
+	git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual &&
+	echo first >expected &&
+	test_cmp expected actual
+'
+
 test_done
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 5/9] grep/pcre: prepare locale-dependent tables for icase matching
  2015-07-08 10:38     ` [PATCH v2 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
@ 2015-07-08 11:00       ` Duy Nguyen
  0 siblings, 0 replies; 53+ messages in thread
From: Duy Nguyen @ 2015-07-08 11:00 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Plamen Totev, René Scharfe, Junio C Hamano,
	Nguyễn Thái Ngọc Duy

On Wed, Jul 8, 2015 at 5:38 PM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
> diff --git a/grep.c b/grep.c
> index 8fce54f..c79aa70 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -324,11 +324,13 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
>         int erroffset;
>         int options = PCRE_MULTILINE;
>
> -       if (opt->ignore_case)
> +       if (opt->ignore_case) {
> +               p->pcre_tables = pcre_maketables();

This affects ascii-only case too because I didn't use has_non_ascii()
here. I guess it's ok because the doc says there are default tables
anyway. But I'm not really familiar with pcre... If they have
optimizations for when the last argument of pcre_compile() is NULL, we
have a regression.

>                 options |= PCRE_CASELESS;
> +       }
>
>         p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
> -                       NULL);
> +                                     p->pcre_tables);
>         if (!p->pcre_regexp)
>                 compile_regexp_failed(p, error);
>
-- 
Duy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/9] icase match on non-ascii
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
                       ` (8 preceding siblings ...)
  2015-07-08 10:38     ` [PATCH v2 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
@ 2015-07-08 11:32     ` Torsten Bögershausen
  2015-07-08 12:13       ` Duy Nguyen
  2015-07-08 15:36     ` Junio C Hamano
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
  11 siblings, 1 reply; 53+ messages in thread
From: Torsten Bögershausen @ 2015-07-08 11:32 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy, git
  Cc: plamen.totev, l.s.r, Junio C Hamano

On 2015-07-08 12.38, Nguyễn Thái Ngọc Duy wrote:
> Side note, I almost added the third has_non_ascii() function. Maybe we
> should consider merging the two existing has_non_ascii() functions
> back, or rename one to something else.
>
Side question:

has_non_ascii can mean different things:
 UTF-8, ISO-8859-1, ISO-8859-x...

In short: everything.
Should we be more critical here ?

May be this can be used from commit.c:
static int verify_utf8(struct strbuf *buf)

Other question:
Should the "non-ascii" characters in the test scripts be octal-escaped ?

Third question:
What happens on systems, which don't have gettext, (for whatever reasons)
--- a/gettext.c
+++ b/gettext.c
@@ -166,12 +166,17 @@ void git_setup_gettext(void)
textdomain("git");
}

+int is_utf8_locale(void)
+{
+ return !strcmp(charset, "UTF-8");
+}


4th question:
What happens on systems which don't have locale support at all ?


As one may suspect, I'm not a friend of being dependent on gettext and/or
locale, at least not for this kind of business.

Would it make more sense to have a command line option ?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/9] icase match on non-ascii
  2015-07-08 11:32     ` [PATCH v2 0/9] icase " Torsten Bögershausen
@ 2015-07-08 12:13       ` Duy Nguyen
  0 siblings, 0 replies; 53+ messages in thread
From: Duy Nguyen @ 2015-07-08 12:13 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Git Mailing List, Plamen Totev, René Scharfe, Junio C Hamano

On Wed, Jul 8, 2015 at 6:32 PM, Torsten Bögershausen <tboegi@web.de> wrote:
> On 2015-07-08 12.38, Nguyễn Thái Ngọc Duy wrote:
>> Side note, I almost added the third has_non_ascii() function. Maybe we
>> should consider merging the two existing has_non_ascii() functions
>> back, or rename one to something else.
>>
> Side question:
>
> has_non_ascii can mean different things:
>  UTF-8, ISO-8859-1, ISO-8859-x...
>
> In short: everything.
> Should we be more critical here ?

For the purpose of this series, no. What we need to ask is "can we
optimize this?" and that leads to "ok we can safely optimize for
ascii-only patterns, don't optimize otherwise".

> May be this can be used from commit.c:
> static int verify_utf8(struct strbuf *buf)

Except the pcre/utf8 patch, the rest should work for all single-byte encodings..

> Other question:
> Should the "non-ascii" characters in the test scripts be octal-escaped ?

I would prefer something readable from a text editor. Even though I
don't speak Icelandic (the strings were copied from gettext test
script) I can see uppercase/lowercase of letters. But I don't know how
many people have fonts covering more than just ascii..

> Third question:
> What happens on systems, which don't have gettext, (for whatever reasons)
> --- a/gettext.c
> +++ b/gettext.c
> @@ -166,12 +166,17 @@ void git_setup_gettext(void)
> textdomain("git");
> }
>
> +int is_utf8_locale(void)
> +{
> + return !strcmp(charset, "UTF-8");
> +}

Hm.. pcre on utf-8 is screwed. I really don't want to go through
nl_langinfo, $LC_ALL, $LC_CTYPE and $LANG to detect utf8 like what's
done in compat/regex/regcomp.c.. But I guess there's no other way,
people can disable gettext and expect git to work properly with utf-8
if their pcre library supports it.

> 4th question:
> What happens on systems which don't have locale support at all ?

I suppose by locale here you do not mean "gettext" any more, then
icase does not work. We simply delegate the work to system regex/pcre.
If they don't support locale, nothing we can do. Git itself does not
know how to fold case outside ascii (except maybe utf8, I don't know
how smart our utf-8 impl is). gettext support does not matter here,
except pcre/utf8 case above.

> As one may suspect, I'm not a friend of being dependent on gettext and/or
> locale, at least not for this kind of business.
>
> Would it make more sense to have a command line option ?

I assume you only care about utf-8 here (and utf8.c knows how to fold
case), you'll need to improve compat regex to take advantage of it
first, maybe make sure it can live side by side with system regex,
also make sure pcre is compiled with utf-8 support if you use pcre.
And if setting LANG to let git know you want to use utf-8 is too
subtle, then yes a command line option makes sense :)
-- 
Duy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/9] icase match on non-ascii
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
                       ` (9 preceding siblings ...)
  2015-07-08 11:32     ` [PATCH v2 0/9] icase " Torsten Bögershausen
@ 2015-07-08 15:36     ` Junio C Hamano
  2015-07-08 23:28       ` Duy Nguyen
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
  11 siblings, 1 reply; 53+ messages in thread
From: Junio C Hamano @ 2015-07-08 15:36 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy
  Cc: Git Mailing List, Plamen Totev, René Scharfe

> Patch 5 is "funny". The patch itself is in iso-8859-1, but my name in
> the commit message is in utf-8.

As an e-mail message is a single file, by definition that is not merely
"funny" but just "broken", no matter what encoding your MUA
declares the contents are in, no?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/9] icase match on non-ascii
  2015-07-08 15:36     ` Junio C Hamano
@ 2015-07-08 23:28       ` Duy Nguyen
  0 siblings, 0 replies; 53+ messages in thread
From: Duy Nguyen @ 2015-07-08 23:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, Plamen Totev, René Scharfe

On Wed, Jul 8, 2015 at 10:36 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Patch 5 is "funny". The patch itself is in iso-8859-1, but my name in
>> the commit message is in utf-8.
>
> As an e-mail message is a single file, by definition that is not merely
> "funny" but just "broken", no matter what encoding your MUA
> declares the contents are in, no?

Yes. But if your MUA is not so strict and does not reject invalid byte
sequences, git-am might be able to process it. What are the options we
have? Teach git-format-patch to generate patches with non-ascii chars
as binary (utf-8 may be treated specially and kept as-is after
validation)?
-- 
Duy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 9/9] diffcore-pickaxe: support case insensitive match on non-ascii
  2015-07-08 10:38     ` [PATCH v2 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
@ 2015-07-09 22:55       ` Eric Sunshine
  0 siblings, 0 replies; 53+ messages in thread
From: Eric Sunshine @ 2015-07-09 22:55 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy
  Cc: Git List, plamen.totev, René Scharfe, Junio C Hamano

On Wed, Jul 8, 2015 at 6:38 AM, Nguyễn Thái Ngọc Duy <pclouds@gmail.com> wrote:
> Similar to the "grep -F -i" case, we can't use kws on icase search
> outside ascii range, quote we quote the string and pass it to regcomp

s/quote we quote/so we quote/

(or something)

> as a basic regexp and let regex engine deal with case sensitivity.
>
> The new test is put in t7812 instead of t4209-log-pickaxe because
> lib-gettext.sh might cause problems elsewhere, probably..
>
> Noticed-by: Plamen Totev <plamen.totev@abv.bg>
> Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 7/9] grep/pcre: support utf-8
  2015-07-08 10:38     ` [PATCH v2 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
@ 2015-07-11  8:07       ` Plamen Totev
  0 siblings, 0 replies; 53+ messages in thread
From: Plamen Totev @ 2015-07-11  8:07 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy; +Cc: git, l.s.r, Junio C Hamano

Nguyễn Thái Ngọc Duy <pclouds@gmail.com> writes:
> In the previous change in this function, we add locale support for 
> single-byte encodings only. It looks like pcre only supports utf-* as 
> multibyte encodings, the others are left in the cold (which is 
> fine). We need to enable PCRE_UTF8 so pcre can parse the string 
> correctly before folding case. 

> if (opt->ignore_case) { 
> p->pcre_tables = pcre_maketables(); 
> +	if (is_utf8_locale()) 
> +	options |= PCRE_UTF8; 
> options |= PCRE_CASELESS; 
> } 

We need to set the PCRE_UTF8 flag in all cases when the locale is UTF-8
not only when the search is case insensitive.
Otherwise pcre threats the encoding as single byte and if the regex contains
quantifiers it will not work as expected. The quantifier will try to match the
second byte of the multi-byte symbol instead of the whole symbol.

For example lets have file that contains the string

TILRAUN: HALLÓÓÓ HEIMUR!

the following command

git grep -P "HALLÓ{3}"

will not match the file while 

git grep -P "HAL{2}ÓÓÓ"

will. That's because the L symbol is a single byte.

Regards,
Plamen Totev

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v3 0/9] icase match on non-ascii
  2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
                       ` (10 preceding siblings ...)
  2015-07-08 15:36     ` Junio C Hamano
@ 2015-07-14 13:24     ` Nguyễn Thái Ngọc Duy
  2015-07-14 13:24       ` [PATCH v3 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
                         ` (10 more replies)
  11 siblings, 11 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-14 13:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

Compared to v2

 - fix grep/pcre on utf-8 even in case is sensitive
 - peek at $LANG and friends anyway for utf-8 detection even
   when gettext support is not built in git
 - s/quote we quote/so we quote/ in 9/9
 - rename t7813, s/non-ascii/iso/

Nguyễn Thái Ngọc Duy (9):
  grep: allow -F -i combination
  grep: break down an "if" stmt in preparation for next changes
  grep/icase: avoid kwsset on literal non-ascii strings
  grep/icase: avoid kwsset when -F is specified
  grep/pcre: prepare locale-dependent tables for icase matching
  gettext: add is_utf8_locale()
  grep/pcre: support utf-8
  diffcore-pickaxe: "share" regex error handling code
  diffcore-pickaxe: support case insensitive match on non-ascii

 builtin/grep.c                           |  2 +-
 diffcore-pickaxe.c                       | 27 +++++++++----
 gettext.c                                | 24 +++++++++++-
 gettext.h                                |  1 +
 grep.c                                   | 44 +++++++++++++++++++--
 grep.h                                   |  1 +
 quote.c                                  | 37 ++++++++++++++++++
 quote.h                                  |  1 +
 t/t7812-grep-icase-non-ascii.sh (new +x) | 67 ++++++++++++++++++++++++++++++++
 t/t7813-grep-icase-iso.sh (new +x)       | 19 +++++++++
 10 files changed, 208 insertions(+), 15 deletions(-)
 create mode 100755 t/t7812-grep-icase-non-ascii.sh
 create mode 100755 t/t7813-grep-icase-iso.sh

-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v3 1/9] grep: allow -F -i combination
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
@ 2015-07-14 13:24       ` Nguyễn Thái Ngọc Duy
  2015-07-14 13:24       ` [PATCH v3 2/9] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
                         ` (9 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-14 13:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

-F means "no regex", not "case sensitive" so it should not override -i

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index d04f440..2d392e9 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -806,7 +806,7 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 
 	if (!opt.pattern_list)
 		die(_("no pattern given."));
-	if (!opt.fixed && opt.ignore_case)
+	if (opt.ignore_case)
 		opt.regflags |= REG_ICASE;
 
 	compile_grep_patterns(&opt);
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 2/9] grep: break down an "if" stmt in preparation for next changes
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
  2015-07-14 13:24       ` [PATCH v3 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
@ 2015-07-14 13:24       ` Nguyễn Thái Ngọc Duy
  2015-07-14 13:24       ` [PATCH v3 3/9] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
                         ` (8 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-14 13:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index b58c7c6..bd32f66 100644
--- a/grep.c
+++ b/grep.c
@@ -403,9 +403,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
 
-	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
+	if (is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
-	else
+	else if (opt->fixed) {
+		p->fixed = 1;
+	} else
 		p->fixed = 0;
 
 	if (p->fixed) {
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 3/9] grep/icase: avoid kwsset on literal non-ascii strings
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
  2015-07-14 13:24       ` [PATCH v3 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
  2015-07-14 13:24       ` [PATCH v3 2/9] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
@ 2015-07-14 13:24       ` Nguyễn Thái Ngọc Duy
  2015-07-14 13:24       ` [PATCH v3 4/9] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
                         ` (7 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-14 13:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

When we detect the pattern is just a literal string, we avoid heavy
regex engine and use fast substring search implemented in kwsset.c.
But kws uses git-ctype which is locale-independent so it does not know
how to fold case properly outside ascii range. Let regcomp or pcre
take care of this case instead. Slower, but accurate.

Helped-by: René Scharfe <l.s.r@web.de>
Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                                   |  7 ++++++-
 t/t7812-grep-icase-non-ascii.sh (new +x) | 19 +++++++++++++++++++
 2 files changed, 25 insertions(+), 1 deletion(-)
 create mode 100755 t/t7812-grep-icase-non-ascii.sh

diff --git a/grep.c b/grep.c
index bd32f66..d795b0e 100644
--- a/grep.c
+++ b/grep.c
@@ -4,6 +4,7 @@
 #include "xdiff-interface.h"
 #include "diff.h"
 #include "diffcore.h"
+#include "commit.h"
 
 static int grep_source_load(struct grep_source *gs);
 static int grep_source_is_binary(struct grep_source *gs);
@@ -398,12 +399,16 @@ static int is_fixed(const char *s, size_t len)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
+	int icase_non_ascii;
 	int err;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
+	icase_non_ascii =
+		(opt->regflags & REG_ICASE || p->ignore_case) &&
+		has_non_ascii(p->pattern);
 
-	if (is_fixed(p->pattern, p->patternlen))
+	if (!icase_non_ascii && is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else if (opt->fixed) {
 		p->fixed = 1;
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
new file mode 100755
index 0000000..63a2630
--- /dev/null
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -0,0 +1,19 @@
+#!/bin/sh
+
+test_description='grep icase on non-English locales'
+
+. ./lib-gettext.sh
+
+test_expect_success GETTEXT_LOCALE 'setup' '
+	printf "TILRAUN: Halló Heimur!" >file &&
+	git add file &&
+	LC_ALL="$is_IS_locale" &&
+	export LC_ALL
+'
+
+test_expect_success GETTEXT_LOCALE 'grep literal string, no -F' '
+	git grep -i "TILRAUN: Halló Heimur!" &&
+	git grep -i "TILRAUN: HALLÓ HEIMUR!"
+'
+
+test_done
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 4/9] grep/icase: avoid kwsset when -F is specified
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
                         ` (2 preceding siblings ...)
  2015-07-14 13:24       ` [PATCH v3 3/9] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
@ 2015-07-14 13:24       ` Nguyễn Thái Ngọc Duy
  2015-07-14 13:24       ` [PATCH v3 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
                         ` (6 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-14 13:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

Similar to the previous commit, we can't use kws on icase search
outside ascii range. But we can't simply pass the pattern to
regcomp/pcre like the previous commit because it may contain regex
special characters, so we need to quote the regex first.

To avoid misquote traps that could lead to undefined behavior, we
always stick to basic regex engine in this case. We don't need fancy
features for grepping a literal string anyway.

basic_regex_quote_buf() assumes that if the pattern is in a multibyte
encoding, ascii chars must be unambiguously encoded as single
bytes. This is true at least for UTF-8. For others, let's wait until
people yell up. Chances are nobody uses multibyte, non utf-8 charsets
any more..

Helped-by: René Scharfe <l.s.r@web.de>
Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          | 25 ++++++++++++++++++++++++-
 quote.c                         | 37 +++++++++++++++++++++++++++++++++++++
 quote.h                         |  1 +
 t/t7812-grep-icase-non-ascii.sh | 26 ++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/grep.c b/grep.c
index d795b0e..8fce54f 100644
--- a/grep.c
+++ b/grep.c
@@ -5,6 +5,7 @@
 #include "diff.h"
 #include "diffcore.h"
 #include "commit.h"
+#include "quote.h"
 
 static int grep_source_load(struct grep_source *gs);
 static int grep_source_is_binary(struct grep_source *gs);
@@ -397,6 +398,24 @@ static int is_fixed(const char *s, size_t len)
 	return 1;
 }
 
+static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int err;
+
+	basic_regex_quote_buf(&sb, p->pattern);
+	err = regcomp(&p->regexp, sb.buf, opt->regflags & ~REG_EXTENDED);
+	if (opt->debug)
+		fprintf(stderr, "fixed%s\n", sb.buf);
+	strbuf_release(&sb);
+	if (err) {
+		char errbuf[1024];
+		regerror(err, &p->regexp, errbuf, 1024);
+		regfree(&p->regexp);
+		compile_regexp_failed(p, errbuf);
+	}
+}
+
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
 	int icase_non_ascii;
@@ -411,7 +430,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if (!icase_non_ascii && is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else if (opt->fixed) {
-		p->fixed = 1;
+		p->fixed = !icase_non_ascii;
+		if (!p->fixed) {
+			compile_fixed_regexp(p, opt);
+			return;
+		}
 	} else
 		p->fixed = 0;
 
diff --git a/quote.c b/quote.c
index 7920e18..43a8057 100644
--- a/quote.c
+++ b/quote.c
@@ -439,3 +439,40 @@ void tcl_quote_buf(struct strbuf *sb, const char *src)
 	}
 	strbuf_addch(sb, '"');
 }
+
+void basic_regex_quote_buf(struct strbuf *sb, const char *src)
+{
+	char c;
+
+	if (*src == '^') {
+		/* only beginning '^' is special and needs quoting */
+		strbuf_addch(sb, '\\');
+		strbuf_addch(sb, *src++);
+	}
+	if (*src == '*')
+		/* beginning '*' is not special, no quoting */
+		strbuf_addch(sb, *src++);
+
+	while ((c = *src++)) {
+		switch (c) {
+		case '[':
+		case '.':
+		case '\\':
+		case '*':
+			strbuf_addch(sb, '\\');
+			strbuf_addch(sb, c);
+			break;
+
+		case '$':
+			/* only the end '$' is special and needs quoting */
+			if (*src == '\0')
+				strbuf_addch(sb, '\\');
+			strbuf_addch(sb, c);
+			break;
+
+		default:
+			strbuf_addch(sb, c);
+			break;
+		}
+	}
+}
diff --git a/quote.h b/quote.h
index 99e04d3..362d315 100644
--- a/quote.h
+++ b/quote.h
@@ -67,5 +67,6 @@ extern char *quote_path_relative(const char *in, const char *prefix,
 extern void perl_quote_buf(struct strbuf *sb, const char *src);
 extern void python_quote_buf(struct strbuf *sb, const char *src);
 extern void tcl_quote_buf(struct strbuf *sb, const char *src);
+extern void basic_regex_quote_buf(struct strbuf *sb, const char *src);
 
 #endif
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 63a2630..c945589 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -16,4 +16,30 @@ test_expect_success GETTEXT_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success GETTEXT_LOCALE 'grep literal string, with -F' '
+	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
+		 grep fixed >debug1 &&
+	echo "fixedTILRAUN: Halló Heimur!" >expect1 &&
+	test_cmp expect1 debug1 &&
+
+	git grep --debug -i -F "TILRAUN: HALLÓ HEIMUR!"  2>&1 >/dev/null |
+		 grep fixed >debug2 &&
+	echo "fixedTILRAUN: HALLÓ HEIMUR!" >expect2 &&
+	test_cmp expect2 debug2
+'
+
+test_expect_success GETTEXT_LOCALE 'grep string with regex, with -F' '
+	printf "^*TILR^AUN:.* \\Halló \$He[]imur!\$" >file &&
+
+	git grep --debug -i -F "^*TILR^AUN:.* \\Halló \$He[]imur!\$" 2>&1 >/dev/null |
+		 grep fixed >debug1 &&
+	echo "fixed\\^*TILR^AUN:\\.\\* \\\\Halló \$He\\[]imur!\\\$" >expect1 &&
+	test_cmp expect1 debug1 &&
+
+	git grep --debug -i -F "^*TILR^AUN:.* \\HALLÓ \$HE[]IMUR!\$"  2>&1 >/dev/null |
+		 grep fixed >debug2 &&
+	echo "fixed\\^*TILR^AUN:\\.\\* \\\\HALLÓ \$HE\\[]IMUR!\\\$" >expect2 &&
+	test_cmp expect2 debug2
+'
+
 test_done
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 5/9] grep/pcre: prepare locale-dependent tables for icase matching
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
                         ` (3 preceding siblings ...)
  2015-07-14 13:24       ` [PATCH v3 4/9] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
@ 2015-07-14 13:24       ` Nguyễn Thái Ngọc Duy
  2015-07-14 13:24       ` [PATCH v3 6/9] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
                         ` (5 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-14 13:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2366 bytes --]

The default tables are usually built with C locale and only suitable
for LANG=C or similar.  This should make case insensitive search work
correctly for all single-byte charsets.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                             |  8 ++++++--
 grep.h                             |  1 +
 t/t7813-grep-icase-iso.sh (new +x) | 19 +++++++++++++++++++
 3 files changed, 26 insertions(+), 2 deletions(-)
 create mode 100755 t/t7813-grep-icase-iso.sh

diff --git a/grep.c b/grep.c
index 8fce54f..f0fbf99 100644
--- a/grep.c
+++ b/grep.c
@@ -324,11 +324,14 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 	int erroffset;
 	int options = PCRE_MULTILINE;
 
-	if (opt->ignore_case)
+	if (opt->ignore_case) {
+		if (has_non_ascii(p->pattern))
+			p->pcre_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
+	}
 
 	p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
-			NULL);
+				      p->pcre_tables);
 	if (!p->pcre_regexp)
 		compile_regexp_failed(p, error);
 
@@ -362,6 +365,7 @@ static void free_pcre_regexp(struct grep_pat *p)
 {
 	pcre_free(p->pcre_regexp);
 	pcre_free(p->pcre_extra_info);
+	pcre_free((void *)p->pcre_tables);
 }
 #else /* !USE_LIBPCRE */
 static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
diff --git a/grep.h b/grep.h
index 95f197a..cee4357 100644
--- a/grep.h
+++ b/grep.h
@@ -48,6 +48,7 @@ struct grep_pat {
 	regex_t regexp;
 	pcre *pcre_regexp;
 	pcre_extra *pcre_extra_info;
+	const unsigned char *pcre_tables;
 	kwset_t kws;
 	unsigned fixed:1;
 	unsigned ignore_case:1;
diff --git a/t/t7813-grep-icase-iso.sh b/t/t7813-grep-icase-iso.sh
new file mode 100755
index 0000000..efef7fb
--- /dev/null
+++ b/t/t7813-grep-icase-iso.sh
@@ -0,0 +1,19 @@
+#!/bin/sh
+
+test_description='grep icase on non-English locales'
+
+. ./lib-gettext.sh
+
+test_expect_success GETTEXT_ISO_LOCALE 'setup' '
+	printf "TILRAUN: Halló Heimur!" >file &&
+	git add file &&
+	LC_ALL="$is_IS_iso_locale" &&
+	export LC_ALL
+'
+
+test_expect_success GETTEXT_ISO_LOCALE,LIBPCRE 'grep pcre string' '
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
+test_done
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 6/9] gettext: add is_utf8_locale()
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
                         ` (4 preceding siblings ...)
  2015-07-14 13:24       ` [PATCH v3 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
@ 2015-07-14 13:24       ` Nguyễn Thái Ngọc Duy
  2015-07-14 13:24       ` [PATCH v3 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
                         ` (4 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-14 13:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

This function returns true if git is running under an UTF-8
locale. pcre in the next patch will need this.

is_encoding_utf8() is used instead of strcmp() to catch both "utf-8"
and "utf8" suffixes.

When built with no gettext support, we peek in several env variables
to detect UTF-8. pcre library might support utf-8 even if libc or git
is built without locale support.. The peeking code is a copy from
compat/regex/regcomp.c.

Helped-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 gettext.c | 24 ++++++++++++++++++++++--
 gettext.h |  1 +
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/gettext.c b/gettext.c
index 7378ba2..aaf1688 100644
--- a/gettext.c
+++ b/gettext.c
@@ -18,6 +18,8 @@
 #	endif
 #endif
 
+static const char *charset;
+
 /*
  * Guess the user's preferred languages from the value in LANGUAGE environment
  * variable and LC_MESSAGES locale category if NO_GETTEXT is not defined.
@@ -65,7 +67,6 @@ static int test_vsnprintf(const char *fmt, ...)
 	return ret;
 }
 
-static const char *charset;
 static void init_gettext_charset(const char *domain)
 {
 	/*
@@ -171,8 +172,27 @@ int gettext_width(const char *s)
 {
 	static int is_utf8 = -1;
 	if (is_utf8 == -1)
-		is_utf8 = !strcmp(charset, "UTF-8");
+		is_utf8 = is_utf8_locale();
 
 	return is_utf8 ? utf8_strwidth(s) : strlen(s);
 }
 #endif
+
+int is_utf8_locale(void)
+{
+#ifdef NO_GETTEXT
+	if (!charset) {
+		const char *env = getenv("LC_ALL");
+		if (!env || !*env)
+			env = getenv("LC_CTYPE");
+		if (!env || !*env)
+			env = getenv("LANG");
+		if (!env)
+			env = "";
+		if (strchr(env, '.'))
+			env = strchr(env, '.') + 1;
+		charset = xstrdup(env);
+	}
+#endif
+	return is_encoding_utf8(charset);
+}
diff --git a/gettext.h b/gettext.h
index 33696a4..7eee64a 100644
--- a/gettext.h
+++ b/gettext.h
@@ -90,5 +90,6 @@ const char *Q_(const char *msgid, const char *plu, unsigned long n)
 #endif
 
 const char *get_preferred_languages(void);
+extern int is_utf8_locale(void);
 
 #endif
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 7/9] grep/pcre: support utf-8
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
                         ` (5 preceding siblings ...)
  2015-07-14 13:24       ` [PATCH v3 6/9] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
@ 2015-07-14 13:24       ` Nguyễn Thái Ngọc Duy
  2015-07-14 13:24       ` [PATCH v3 8/9] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
                         ` (3 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-14 13:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

In the previous change in this function, we add locale support for
single-byte encodings only. It looks like pcre only supports utf-* as
multibyte encodings, the others are left in the cold (which is
fine).

We need to enable PCRE_UTF8 so pcre can find character boundary
correctly. It's needed for case folding (when --ignore-case is used)
or '*', '+' or similar syntax is used.

The "has_non_ascii()" check is to be on the conservative side. If
there's non-ascii in the pattern, the searched content could still be
in utf-8, but we can treat it just like a byte stream and everything
should work. If we force utf-8 based on locale only and pcre validates
utf-8 and the file content is in non-utf8 encoding, things break.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Helped-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          |  2 ++
 t/t7812-grep-icase-non-ascii.sh | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/grep.c b/grep.c
index f0fbf99..07621c1 100644
--- a/grep.c
+++ b/grep.c
@@ -329,6 +329,8 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 			p->pcre_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
 	}
+	if (is_utf8_locale() && has_non_ascii(p->pattern))
+		options |= PCRE_UTF8;
 
 	p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
 				      p->pcre_tables);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index c945589..e861a15 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -16,6 +16,21 @@ test_expect_success GETTEXT_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 icase' '
+	git grep --perl-regexp    "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 string with "+"' '
+	printf "TILRAUN: Hallóó Heimur!" >file2 &&
+	git add file2 &&
+	git grep -l --perl-regexp "TILRAUN: H.lló+ Heimur!" >actual &&
+	echo file >expected &&
+	echo file2 >>expected &&
+	test_cmp expected actual
+'
+
 test_expect_success GETTEXT_LOCALE 'grep literal string, with -F' '
 	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
 		 grep fixed >debug1 &&
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 8/9] diffcore-pickaxe: "share" regex error handling code
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
                         ` (6 preceding siblings ...)
  2015-07-14 13:24       ` [PATCH v3 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
@ 2015-07-14 13:24       ` Nguyễn Thái Ngọc Duy
  2015-07-14 13:24       ` [PATCH v3 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
                         ` (2 subsequent siblings)
  10 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-14 13:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

There's another regcomp code block coming in this function. By moving
the error handling code out of this block, we don't have to add the
same error handling code in the new block.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 diffcore-pickaxe.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 185f86b..7a718fc 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -204,20 +204,13 @@ void diffcore_pickaxe(struct diff_options *o)
 	int opts = o->pickaxe_opts;
 	regex_t regex, *regexp = NULL;
 	kwset_t kws = NULL;
+	int err = 0;
 
 	if (opts & (DIFF_PICKAXE_REGEX | DIFF_PICKAXE_KIND_G)) {
-		int err;
 		int cflags = REG_EXTENDED | REG_NEWLINE;
 		if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE))
 			cflags |= REG_ICASE;
 		err = regcomp(&regex, needle, cflags);
-		if (err) {
-			/* The POSIX.2 people are surely sick */
-			char errbuf[1024];
-			regerror(err, &regex, errbuf, 1024);
-			regfree(&regex);
-			die("invalid regex: %s", errbuf);
-		}
 		regexp = &regex;
 	} else {
 		kws = kwsalloc(DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE)
@@ -225,6 +218,13 @@ void diffcore_pickaxe(struct diff_options *o)
 		kwsincr(kws, needle, strlen(needle));
 		kwsprep(kws);
 	}
+	if (err) {
+		/* The POSIX.2 people are surely sick */
+		char errbuf[1024];
+		regerror(err, &regex, errbuf, 1024);
+		regfree(&regex);
+		die("invalid regex: %s", errbuf);
+	}
 
 	/* Might want to warn when both S and G are on; I don't care... */
 	pickaxe(&diff_queued_diff, o, regexp, kws,
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v3 9/9] diffcore-pickaxe: support case insensitive match on non-ascii
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
                         ` (7 preceding siblings ...)
  2015-07-14 13:24       ` [PATCH v3 8/9] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
@ 2015-07-14 13:24       ` Nguyễn Thái Ngọc Duy
  2015-07-14 16:42       ` [PATCH v3 0/9] icase " Torsten Bögershausen
  2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
  10 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-07-14 13:24 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

Similar to the "grep -F -i" case, we can't use kws on icase search
outside ascii range, so we quote the string and pass it to regcomp as
a basic regexp and let regex engine deal with case sensitivity.

The new test is put in t7812 instead of t4209-log-pickaxe because
lib-gettext.sh might cause problems elsewhere, probably..

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 diffcore-pickaxe.c              | 11 +++++++++++
 t/t7812-grep-icase-non-ascii.sh |  7 +++++++
 2 files changed, 18 insertions(+)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 7a718fc..6946c15 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -7,6 +7,8 @@
 #include "diffcore.h"
 #include "xdiff-interface.h"
 #include "kwset.h"
+#include "commit.h"
+#include "quote.h"
 
 typedef int (*pickaxe_fn)(mmfile_t *one, mmfile_t *two,
 			  struct diff_options *o,
@@ -212,6 +214,15 @@ void diffcore_pickaxe(struct diff_options *o)
 			cflags |= REG_ICASE;
 		err = regcomp(&regex, needle, cflags);
 		regexp = &regex;
+	} else if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE) &&
+		   has_non_ascii(needle)) {
+		struct strbuf sb = STRBUF_INIT;
+		int cflags = REG_NEWLINE | REG_ICASE;
+
+		basic_regex_quote_buf(&sb, needle);
+		err = regcomp(&regex, sb.buf, cflags);
+		strbuf_release(&sb);
+		regexp = &regex;
 	} else {
 		kws = kwsalloc(DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE)
 			       ? tolower_trans_tbl : NULL);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index e861a15..d07fa20 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -57,4 +57,11 @@ test_expect_success GETTEXT_LOCALE 'grep string with regex, with -F' '
 	test_cmp expect2 debug2
 '
 
+test_expect_success GETTEXT_LOCALE 'pickaxe -i on non-ascii' '
+	git commit -m first &&
+	git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual &&
+	echo first >expected &&
+	test_cmp expected actual
+'
+
 test_done
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 0/9] icase match on non-ascii
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
                         ` (8 preceding siblings ...)
  2015-07-14 13:24       ` [PATCH v3 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
@ 2015-07-14 16:42       ` Torsten Bögershausen
  2015-07-15  9:39         ` Duy Nguyen
  2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
  10 siblings, 1 reply; 53+ messages in thread
From: Torsten Bögershausen @ 2015-07-14 16:42 UTC (permalink / raw)
  To: Nguyễn Thái Ngọc Duy, git
  Cc: Junio C Hamano, plamen.totev, l.s.r, tboegi

(I haven't been able to do more debugging yet,
but this doesn't fully work on my Mac OS X box:)

Initialized empty Git repository in
/Users/tb/NoBackup/projects/git/tb.150714_Duy_grep_utf8/t/trash
directory.t7812-grep-icase-non-ascii/.git/
# lib-gettext: Found 'is_IS.UTF-8' as an is_IS UTF-8 locale
# lib-gettext: Found 'is_IS.ISO8859-1' as an is_IS ISO-8859-1 locale
expecting success:
    printf "TILRAUN: Halló Heimur!" >file &&
    git add file &&
    LC_ALL="$is_IS_locale" &&
    export LC_ALL

ok 1 - setup

expecting success:
    git grep -i "TILRAUN: Halló Heimur!" &&
    git grep -i "TILRAUN: HALLÓ HEIMUR!"

file:TILRAUN: Halló Heimur!
not ok 2 - grep literal string, no -F
#   
#        git grep -i "TILRAUN: Halló Heimur!" &&
#        git grep -i "TILRAUN: HALLÓ HEIMUR!"
#   

skipping test: grep pcre utf-8 icase
    git grep --perl-regexp    "TILRAUN: H.lló Heimur!" &&
    git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
    git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"

ok 3 # skip grep pcre utf-8 icase (missing LIBPCRE of GETTEXT_LOCALE,LIBPCRE)

skipping test: grep pcre utf-8 string with "+"
    printf "TILRAUN: Hallóó Heimur!" >file2 &&
    git add file2 &&
    git grep -l --perl-regexp "TILRAUN: H.lló+ Heimur!" >actual &&
    echo file >expected &&
    echo file2 >>expected &&
    test_cmp expected actual

ok 4 # skip grep pcre utf-8 string with "+" (missing LIBPCRE of
GETTEXT_LOCALE,LIBPCRE)

expecting success:
    git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
         grep fixed >debug1 &&
    echo "fixedTILRAUN: Halló Heimur!" >expect1 &&
    test_cmp expect1 debug1 &&

    git grep --debug -i -F "TILRAUN: HALLÓ HEIMUR!"  2>&1 >/dev/null |
         grep fixed >debug2 &&
    echo "fixedTILRAUN: HALLÓ HEIMUR!" >expect2 &&
    test_cmp expect2 debug2

ok 5 - grep literal string, with -F

expecting success:
    printf "^*TILR^AUN:.* \\Halló \$He[]imur!\$" >file &&

    git grep --debug -i -F "^*TILR^AUN:.* \\Halló \$He[]imur!\$" 2>&1 >/dev/null |
         grep fixed >debug1 &&
    echo "fixed\\^*TILR^AUN:\\.\\* \\\\Halló \$He\\[]imur!\\\$" >expect1 &&
    test_cmp expect1 debug1 &&

    git grep --debug -i -F "^*TILR^AUN:.* \\HALLÓ \$HE[]IMUR!\$"  2>&1 >/dev/null |
         grep fixed >debug2 &&
    echo "fixed\\^*TILR^AUN:\\.\\* \\\\HALLÓ \$HE\\[]IMUR!\\\$" >expect2 &&
    test_cmp expect2 debug2

--- expect1    2015-07-14 16:38:22.000000000 +0000
+++ debug1    2015-07-14 16:38:22.000000000 +0000
@@ -1 +1 @@
-fixed\^*TILR^AUN:\.\* \Halló $He\[]imur!\$
+fixed\^*TILR^AUN:\.\* \\Halló $He\[]imur!\$
not ok 6 - grep string with regex, with -F
#   
#        printf "^*TILR^AUN:.* \\Halló \$He[]imur!\$" >file &&
#   
#        git grep --debug -i -F "^*TILR^AUN:.* \\Halló \$He[]imur!\$" 2>&1
>/dev/null |
#             grep fixed >debug1 &&
#        echo "fixed\\^*TILR^AUN:\\.\\* \\\\Halló \$He\\[]imur!\\\$" >expect1 &&
#        test_cmp expect1 debug1 &&
#   
#        git grep --debug -i -F "^*TILR^AUN:.* \\HALLÓ \$HE[]IMUR!\$"  2>&1
>/dev/null |
#             grep fixed >debug2 &&
#        echo "fixed\\^*TILR^AUN:\\.\\* \\\\HALLÓ \$HE\\[]IMUR!\\\$" >expect2 &&
#        test_cmp expect2 debug2
#   

expecting success:
    git commit -m first &&
    git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual &&
    echo first >expected &&
    test_cmp expected actual

[master (root-commit) e6052d5] first
 Author: A U Thor <author@example.com>
 1 file changed, 1 insertion(+)
 create mode 100644 file
--- expected    2015-07-14 16:38:22.000000000 +0000
+++ actual    2015-07-14 16:38:22.000000000 +0000
@@ -1 +0,0 @@
-first
not ok 7 - pickaxe -i on non-ascii
#   
#        git commit -m first &&
#        git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual &&
#        echo first >expected &&
#        test_cmp expected actual
#   

# failed 3 among 7 test(s)
1..7

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 0/9] icase match on non-ascii
  2015-07-14 16:42       ` [PATCH v3 0/9] icase " Torsten Bögershausen
@ 2015-07-15  9:39         ` Duy Nguyen
  2015-07-15 19:51           ` Torsten Bögershausen
  0 siblings, 1 reply; 53+ messages in thread
From: Duy Nguyen @ 2015-07-15  9:39 UTC (permalink / raw)
  To: Torsten Bögershausen
  Cc: Git Mailing List, Junio C Hamano, Plamen Totev, René Scharfe

On Tue, Jul 14, 2015 at 11:42 PM, Torsten Bögershausen <tboegi@web.de> wrote:
> (I haven't been able to do more debugging yet,
> but this doesn't fully work on my Mac OS X box:)
>
> Initialized empty Git repository in
> /Users/tb/NoBackup/projects/git/tb.150714_Duy_grep_utf8/t/trash
> directory.t7812-grep-icase-non-ascii/.git/
> # lib-gettext: Found 'is_IS.UTF-8' as an is_IS UTF-8 locale
> # lib-gettext: Found 'is_IS.ISO8859-1' as an is_IS ISO-8859-1 locale
> expecting success:
>     printf "TILRAUN: Halló Heimur!" >file &&
>     git add file &&
>     LC_ALL="$is_IS_locale" &&
>     export LC_ALL
>
> ok 1 - setup
>
> expecting success:
>     git grep -i "TILRAUN: Halló Heimur!" &&
>     git grep -i "TILRAUN: HALLÓ HEIMUR!"
>
> file:TILRAUN: Halló Heimur!
> not ok 2 - grep literal string, no -F
> #
> #        git grep -i "TILRAUN: Halló Heimur!" &&
> #        git grep -i "TILRAUN: HALLÓ HEIMUR!"
> #

I don't know if there's an easy way to test if regexec() on your
system supports locale (at least for is_IS). I can reproduce the same
by using compat regex. So it's not a good news because compat regex is
used in a few platforms, so this test will fail on those.

I don't see any way around it, except dropping all the tests. I don't
think there is a way for us to test regex locale support at runtime.
-- 
Duy

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v3 0/9] icase match on non-ascii
  2015-07-15  9:39         ` Duy Nguyen
@ 2015-07-15 19:51           ` Torsten Bögershausen
  0 siblings, 0 replies; 53+ messages in thread
From: Torsten Bögershausen @ 2015-07-15 19:51 UTC (permalink / raw)
  To: Duy Nguyen, Torsten Bögershausen
  Cc: Git Mailing List, Junio C Hamano, Plamen Totev, René Scharfe

> I don't see any way around it, except dropping all the tests. I don't
> think there is a way for us to test regex locale support at runtime.
> 
(I don't think dropping all tests is a good way forward)
Either there is runtime code similar to test-regex.c,
or how about something like this:

commit a1cdac0fc0df1dad20f4dc196688a73c11b00480
Author: Torsten Bögershausen <tboegi@web.de>
Date:   Wed Jul 15 21:43:47 2015 +0200

    t7812: More LIBPCRE preconditions

    Some (e.g. BSD based) regex libraries are not able to handle
    UTF-8 strings case-insensitive (if asked so)

    Exclude some test cases by using the LIBPCRE precondition

diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index d07fa20..30d3d68 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -11,9 +11,12 @@ test_expect_success GETTEXT_LOCALE 'setup' '
 	export LC_ALL
 '

-test_expect_success GETTEXT_LOCALE 'grep literal string, no -F' '
-	git grep -i "TILRAUN: Halló Heimur!" &&
-	git grep -i "TILRAUN: HALLÓ HEIMUR!"
+test_expect_success GETTEXT_LOCALE 'grep literal low string, no -F' '
+	git grep -i "TILRAUN: Halló Heimur!"
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep literal up string, no -F' '
+	git grep -i "TILRAUN: HALLÓ. HEIMUR!"
 '

 test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 icase' '
@@ -31,33 +34,37 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8
string with "+"' '
 	test_cmp expected actual
 '

-test_expect_success GETTEXT_LOCALE 'grep literal string, with -F' '
+test_expect_success GETTEXT_LOCALE 'grep literal low string, with -F' '
 	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
 		 grep fixed >debug1 &&
 	echo "fixedTILRAUN: Halló Heimur!" >expect1 &&
-	test_cmp expect1 debug1 &&
+	test_cmp expect1 debug1
+'

+test_expect_success GETTEXT_LOCALE 'grep literal up string, with -F' '
 	git grep --debug -i -F "TILRAUN: HALLÓ HEIMUR!"  2>&1 >/dev/null |
 		 grep fixed >debug2 &&
 	echo "fixedTILRAUN: HALLÓ HEIMUR!" >expect2 &&
 	test_cmp expect2 debug2
 '

-test_expect_success GETTEXT_LOCALE 'grep string with regex, with -F' '
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep string with regex, with -F' '
 	printf "^*TILR^AUN:.* \\Halló \$He[]imur!\$" >file &&

 	git grep --debug -i -F "^*TILR^AUN:.* \\Halló \$He[]imur!\$" 2>&1 >/dev/null |
 		 grep fixed >debug1 &&
 	echo "fixed\\^*TILR^AUN:\\.\\* \\\\Halló \$He\\[]imur!\\\$" >expect1 &&
-	test_cmp expect1 debug1 &&
+	test_cmp expect1 debug1
+'

+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep up string with regex, with -F' '
 	git grep --debug -i -F "^*TILR^AUN:.* \\HALLÓ \$HE[]IMUR!\$"  2>&1 >/dev/null |
 		 grep fixed >debug2 &&
 	echo "fixed\\^*TILR^AUN:\\.\\* \\\\HALLÓ \$HE\\[]IMUR!\\\$" >expect2 &&
 	test_cmp expect2 debug2
 '

-test_expect_success GETTEXT_LOCALE 'pickaxe -i on non-ascii' '
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'pickaxe -i on non-ascii' '
 	git commit -m first &&
 	git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual &&
 	echo first >expected &&

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v4 00/10] icase match on non-ascii
  2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
                         ` (9 preceding siblings ...)
  2015-07-14 16:42       ` [PATCH v3 0/9] icase " Torsten Bögershausen
@ 2015-08-21 12:47       ` Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 01/10] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
                           ` (9 more replies)
  10 siblings, 10 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-08-21 12:47 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

Sorry it took more than a month for a simple reroll. Free time (with
energy left) is rare these days. v4 adds system regex's icase support
detection and only runs tests in these cases. This should fix test
failures on Windows where compat regex does not support icase.

Diff from v3 below

diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index d07fa20..a5475bb 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -11,7 +11,11 @@ test_expect_success GETTEXT_LOCALE 'setup' '
 	export LC_ALL
 '
 
-test_expect_success GETTEXT_LOCALE 'grep literal string, no -F' '
+test_have_prereq GETTEXT_LOCALE &&
+test-regex "HALLÓ" "Halló" ICASE &&
+test_set_prereq REGEX_LOCALE
+
+test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: Halló Heimur!" &&
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
@@ -31,7 +35,7 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 string with "+"' '
 	test_cmp expected actual
 '
 
-test_expect_success GETTEXT_LOCALE 'grep literal string, with -F' '
+test_expect_success REGEX_LOCALE 'grep literal string, with -F' '
 	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
 		 grep fixed >debug1 &&
 	echo "fixedTILRAUN: Halló Heimur!" >expect1 &&
@@ -43,7 +47,7 @@ test_expect_success GETTEXT_LOCALE 'grep literal string, with -F' '
 	test_cmp expect2 debug2
 '
 
-test_expect_success GETTEXT_LOCALE 'grep string with regex, with -F' '
+test_expect_success REGEX_LOCALE 'grep string with regex, with -F' '
 	printf "^*TILR^AUN:.* \\Halló \$He[]imur!\$" >file &&
 
 	git grep --debug -i -F "^*TILR^AUN:.* \\Halló \$He[]imur!\$" 2>&1 >/dev/null |
@@ -57,7 +61,7 @@ test_expect_success GETTEXT_LOCALE 'grep string with regex, with -F' '
 	test_cmp expect2 debug2
 '
 
-test_expect_success GETTEXT_LOCALE 'pickaxe -i on non-ascii' '
+test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
 	git commit -m first &&
 	git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual &&
 	echo first >expected &&
diff --git a/test-regex.c b/test-regex.c
index 0dc598e..3b5641c 100644
--- a/test-regex.c
+++ b/test-regex.c
@@ -1,19 +1,63 @@
 #include "git-compat-util.h"
+#include "gettext.h"
+
+struct reg_flag {
+	const char *name;
+	int flag;
+};
+
+static struct reg_flag reg_flags[] = {
+	{ "EXTENDED",	 REG_EXTENDED	},
+	{ "NEWLINE",	 REG_NEWLINE	},
+	{ "ICASE",	 REG_ICASE	},
+	{ "NOTBOL",	 REG_NOTBOL	},
+#ifdef REG_STARTEND
+	{ "STARTEND",	 REG_STARTEND	},
+#endif
+	{ NULL, 0 }
+};
 
 int main(int argc, char **argv)
 {
-	char *pat = "[^={} \t]+";
-	char *str = "={}\nfred";
+	const char *pat;
+	const char *str;
+	int flags = 0;
 	regex_t r;
 	regmatch_t m[1];
 
-	if (regcomp(&r, pat, REG_EXTENDED | REG_NEWLINE))
+	if (argc == 1) {
+		/* special case, bug check */
+		pat = "[^={} \t]+";
+		str = "={}\nfred";
+		flags = REG_EXTENDED | REG_NEWLINE;
+	} else {
+		argv++;
+		pat = *argv++;
+		str = *argv++;
+		while (*argv) {
+			struct reg_flag *rf;
+			for (rf = reg_flags; rf->name; rf++)
+				if (!strcmp(*argv, rf->name)) {
+					flags |= rf->flag;
+					break;
+				}
+			if (!rf->name)
+				die("do not recognize %s", *argv);
+			argv++;
+		}
+		git_setup_gettext();
+	}
+
+	if (regcomp(&r, pat, flags))
 		die("failed regcomp() for pattern '%s'", pat);
-	if (regexec(&r, str, 1, m, 0))
-		die("no match of pattern '%s' to string '%s'", pat, str);
+	if (regexec(&r, str, 1, m, 0)) {
+		if (argc == 1)
+			die("no match of pattern '%s' to string '%s'", pat, str);
+		return 1;
+	}
 
 	/* http://sourceware.org/bugzilla/show_bug.cgi?id=3957  */
-	if (m[0].rm_so == 3) /* matches '\n' when it should not */
+	if (argc == 1 && m[0].rm_so == 3) /* matches '\n' when it should not */
 		die("regex bug confirmed: re-build git with NO_REGEX=1");
 
 	exit(0);
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v4 01/10] grep: allow -F -i combination
  2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
@ 2015-08-21 12:47         ` Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 02/10] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
                           ` (8 subsequent siblings)
  9 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-08-21 12:47 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

-F means "no regex", not "case sensitive" so it should not override -i

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 builtin/grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index d04f440..2d392e9 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -806,7 +806,7 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 
 	if (!opt.pattern_list)
 		die(_("no pattern given."));
-	if (!opt.fixed && opt.ignore_case)
+	if (opt.ignore_case)
 		opt.regflags |= REG_ICASE;
 
 	compile_grep_patterns(&opt);
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v4 02/10] grep: break down an "if" stmt in preparation for next changes
  2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 01/10] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
@ 2015-08-21 12:47         ` Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 03/10] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
                           ` (7 subsequent siblings)
  9 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-08-21 12:47 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index b58c7c6..bd32f66 100644
--- a/grep.c
+++ b/grep.c
@@ -403,9 +403,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
 
-	if (opt->fixed || is_fixed(p->pattern, p->patternlen))
+	if (is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
-	else
+	else if (opt->fixed) {
+		p->fixed = 1;
+	} else
 		p->fixed = 0;
 
 	if (p->fixed) {
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v4 03/10] test-regex: expose full regcomp() to the command line
  2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 01/10] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 02/10] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
@ 2015-08-21 12:47         ` Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 04/10] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
                           ` (6 subsequent siblings)
  9 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-08-21 12:47 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 test-regex.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 50 insertions(+), 6 deletions(-)

diff --git a/test-regex.c b/test-regex.c
index 0dc598e..3b5641c 100644
--- a/test-regex.c
+++ b/test-regex.c
@@ -1,19 +1,63 @@
 #include "git-compat-util.h"
+#include "gettext.h"
+
+struct reg_flag {
+	const char *name;
+	int flag;
+};
+
+static struct reg_flag reg_flags[] = {
+	{ "EXTENDED",	 REG_EXTENDED	},
+	{ "NEWLINE",	 REG_NEWLINE	},
+	{ "ICASE",	 REG_ICASE	},
+	{ "NOTBOL",	 REG_NOTBOL	},
+#ifdef REG_STARTEND
+	{ "STARTEND",	 REG_STARTEND	},
+#endif
+	{ NULL, 0 }
+};
 
 int main(int argc, char **argv)
 {
-	char *pat = "[^={} \t]+";
-	char *str = "={}\nfred";
+	const char *pat;
+	const char *str;
+	int flags = 0;
 	regex_t r;
 	regmatch_t m[1];
 
-	if (regcomp(&r, pat, REG_EXTENDED | REG_NEWLINE))
+	if (argc == 1) {
+		/* special case, bug check */
+		pat = "[^={} \t]+";
+		str = "={}\nfred";
+		flags = REG_EXTENDED | REG_NEWLINE;
+	} else {
+		argv++;
+		pat = *argv++;
+		str = *argv++;
+		while (*argv) {
+			struct reg_flag *rf;
+			for (rf = reg_flags; rf->name; rf++)
+				if (!strcmp(*argv, rf->name)) {
+					flags |= rf->flag;
+					break;
+				}
+			if (!rf->name)
+				die("do not recognize %s", *argv);
+			argv++;
+		}
+		git_setup_gettext();
+	}
+
+	if (regcomp(&r, pat, flags))
 		die("failed regcomp() for pattern '%s'", pat);
-	if (regexec(&r, str, 1, m, 0))
-		die("no match of pattern '%s' to string '%s'", pat, str);
+	if (regexec(&r, str, 1, m, 0)) {
+		if (argc == 1)
+			die("no match of pattern '%s' to string '%s'", pat, str);
+		return 1;
+	}
 
 	/* http://sourceware.org/bugzilla/show_bug.cgi?id=3957  */
-	if (m[0].rm_so == 3) /* matches '\n' when it should not */
+	if (argc == 1 && m[0].rm_so == 3) /* matches '\n' when it should not */
 		die("regex bug confirmed: re-build git with NO_REGEX=1");
 
 	exit(0);
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v4 04/10] grep/icase: avoid kwsset on literal non-ascii strings
  2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
                           ` (2 preceding siblings ...)
  2015-08-21 12:47         ` [PATCH v4 03/10] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
@ 2015-08-21 12:47         ` Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 05/10] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
                           ` (5 subsequent siblings)
  9 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-08-21 12:47 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

When we detect the pattern is just a literal string, we avoid heavy
regex engine and use fast substring search implemented in kwsset.c.
But kws uses git-ctype which is locale-independent so it does not know
how to fold case properly outside ascii range. Let regcomp or pcre
take care of this case instead. Slower, but accurate.

Helped-by: René Scharfe <l.s.r@web.de>
Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                                   |  7 ++++++-
 t/t7812-grep-icase-non-ascii.sh (new +x) | 23 +++++++++++++++++++++++
 2 files changed, 29 insertions(+), 1 deletion(-)
 create mode 100755 t/t7812-grep-icase-non-ascii.sh

diff --git a/grep.c b/grep.c
index bd32f66..d795b0e 100644
--- a/grep.c
+++ b/grep.c
@@ -4,6 +4,7 @@
 #include "xdiff-interface.h"
 #include "diff.h"
 #include "diffcore.h"
+#include "commit.h"
 
 static int grep_source_load(struct grep_source *gs);
 static int grep_source_is_binary(struct grep_source *gs);
@@ -398,12 +399,16 @@ static int is_fixed(const char *s, size_t len)
 
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
+	int icase_non_ascii;
 	int err;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
+	icase_non_ascii =
+		(opt->regflags & REG_ICASE || p->ignore_case) &&
+		has_non_ascii(p->pattern);
 
-	if (is_fixed(p->pattern, p->patternlen))
+	if (!icase_non_ascii && is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else if (opt->fixed) {
 		p->fixed = 1;
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
new file mode 100755
index 0000000..6eff490
--- /dev/null
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -0,0 +1,23 @@
+#!/bin/sh
+
+test_description='grep icase on non-English locales'
+
+. ./lib-gettext.sh
+
+test_expect_success GETTEXT_LOCALE 'setup' '
+	printf "TILRAUN: Halló Heimur!" >file &&
+	git add file &&
+	LC_ALL="$is_IS_locale" &&
+	export LC_ALL
+'
+
+test_have_prereq GETTEXT_LOCALE &&
+test-regex "HALLÓ" "Halló" ICASE &&
+test_set_prereq REGEX_LOCALE
+
+test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
+	git grep -i "TILRAUN: Halló Heimur!" &&
+	git grep -i "TILRAUN: HALLÓ HEIMUR!"
+'
+
+test_done
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v4 05/10] grep/icase: avoid kwsset when -F is specified
  2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
                           ` (3 preceding siblings ...)
  2015-08-21 12:47         ` [PATCH v4 04/10] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
@ 2015-08-21 12:47         ` Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 06/10] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
                           ` (4 subsequent siblings)
  9 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-08-21 12:47 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

Similar to the previous commit, we can't use kws on icase search
outside ascii range. But we can't simply pass the pattern to
regcomp/pcre like the previous commit because it may contain regex
special characters, so we need to quote the regex first.

To avoid misquote traps that could lead to undefined behavior, we
always stick to basic regex engine in this case. We don't need fancy
features for grepping a literal string anyway.

basic_regex_quote_buf() assumes that if the pattern is in a multibyte
encoding, ascii chars must be unambiguously encoded as single
bytes. This is true at least for UTF-8. For others, let's wait until
people yell up. Chances are nobody uses multibyte, non utf-8 charsets
any more..

Helped-by: René Scharfe <l.s.r@web.de>
Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          | 25 ++++++++++++++++++++++++-
 quote.c                         | 37 +++++++++++++++++++++++++++++++++++++
 quote.h                         |  1 +
 t/t7812-grep-icase-non-ascii.sh | 26 ++++++++++++++++++++++++++
 4 files changed, 88 insertions(+), 1 deletion(-)

diff --git a/grep.c b/grep.c
index d795b0e..8fce54f 100644
--- a/grep.c
+++ b/grep.c
@@ -5,6 +5,7 @@
 #include "diff.h"
 #include "diffcore.h"
 #include "commit.h"
+#include "quote.h"
 
 static int grep_source_load(struct grep_source *gs);
 static int grep_source_is_binary(struct grep_source *gs);
@@ -397,6 +398,24 @@ static int is_fixed(const char *s, size_t len)
 	return 1;
 }
 
+static void compile_fixed_regexp(struct grep_pat *p, struct grep_opt *opt)
+{
+	struct strbuf sb = STRBUF_INIT;
+	int err;
+
+	basic_regex_quote_buf(&sb, p->pattern);
+	err = regcomp(&p->regexp, sb.buf, opt->regflags & ~REG_EXTENDED);
+	if (opt->debug)
+		fprintf(stderr, "fixed%s\n", sb.buf);
+	strbuf_release(&sb);
+	if (err) {
+		char errbuf[1024];
+		regerror(err, &p->regexp, errbuf, 1024);
+		regfree(&p->regexp);
+		compile_regexp_failed(p, errbuf);
+	}
+}
+
 static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
 	int icase_non_ascii;
@@ -411,7 +430,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if (!icase_non_ascii && is_fixed(p->pattern, p->patternlen))
 		p->fixed = 1;
 	else if (opt->fixed) {
-		p->fixed = 1;
+		p->fixed = !icase_non_ascii;
+		if (!p->fixed) {
+			compile_fixed_regexp(p, opt);
+			return;
+		}
 	} else
 		p->fixed = 0;
 
diff --git a/quote.c b/quote.c
index 7920e18..43a8057 100644
--- a/quote.c
+++ b/quote.c
@@ -439,3 +439,40 @@ void tcl_quote_buf(struct strbuf *sb, const char *src)
 	}
 	strbuf_addch(sb, '"');
 }
+
+void basic_regex_quote_buf(struct strbuf *sb, const char *src)
+{
+	char c;
+
+	if (*src == '^') {
+		/* only beginning '^' is special and needs quoting */
+		strbuf_addch(sb, '\\');
+		strbuf_addch(sb, *src++);
+	}
+	if (*src == '*')
+		/* beginning '*' is not special, no quoting */
+		strbuf_addch(sb, *src++);
+
+	while ((c = *src++)) {
+		switch (c) {
+		case '[':
+		case '.':
+		case '\\':
+		case '*':
+			strbuf_addch(sb, '\\');
+			strbuf_addch(sb, c);
+			break;
+
+		case '$':
+			/* only the end '$' is special and needs quoting */
+			if (*src == '\0')
+				strbuf_addch(sb, '\\');
+			strbuf_addch(sb, c);
+			break;
+
+		default:
+			strbuf_addch(sb, c);
+			break;
+		}
+	}
+}
diff --git a/quote.h b/quote.h
index 99e04d3..362d315 100644
--- a/quote.h
+++ b/quote.h
@@ -67,5 +67,6 @@ extern char *quote_path_relative(const char *in, const char *prefix,
 extern void perl_quote_buf(struct strbuf *sb, const char *src);
 extern void python_quote_buf(struct strbuf *sb, const char *src);
 extern void tcl_quote_buf(struct strbuf *sb, const char *src);
+extern void basic_regex_quote_buf(struct strbuf *sb, const char *src);
 
 #endif
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 6eff490..aba6b15 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -20,4 +20,30 @@ test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success REGEX_LOCALE 'grep literal string, with -F' '
+	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
+		 grep fixed >debug1 &&
+	echo "fixedTILRAUN: Halló Heimur!" >expect1 &&
+	test_cmp expect1 debug1 &&
+
+	git grep --debug -i -F "TILRAUN: HALLÓ HEIMUR!"  2>&1 >/dev/null |
+		 grep fixed >debug2 &&
+	echo "fixedTILRAUN: HALLÓ HEIMUR!" >expect2 &&
+	test_cmp expect2 debug2
+'
+
+test_expect_success REGEX_LOCALE 'grep string with regex, with -F' '
+	printf "^*TILR^AUN:.* \\Halló \$He[]imur!\$" >file &&
+
+	git grep --debug -i -F "^*TILR^AUN:.* \\Halló \$He[]imur!\$" 2>&1 >/dev/null |
+		 grep fixed >debug1 &&
+	echo "fixed\\^*TILR^AUN:\\.\\* \\\\Halló \$He\\[]imur!\\\$" >expect1 &&
+	test_cmp expect1 debug1 &&
+
+	git grep --debug -i -F "^*TILR^AUN:.* \\HALLÓ \$HE[]IMUR!\$"  2>&1 >/dev/null |
+		 grep fixed >debug2 &&
+	echo "fixed\\^*TILR^AUN:\\.\\* \\\\HALLÓ \$HE\\[]IMUR!\\\$" >expect2 &&
+	test_cmp expect2 debug2
+'
+
 test_done
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v4 06/10] grep/pcre: prepare locale-dependent tables for icase matching
  2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
                           ` (4 preceding siblings ...)
  2015-08-21 12:47         ` [PATCH v4 05/10] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
@ 2015-08-21 12:47         ` Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 07/10] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
                           ` (3 subsequent siblings)
  9 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-08-21 12:47 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 2366 bytes --]

The default tables are usually built with C locale and only suitable
for LANG=C or similar.  This should make case insensitive search work
correctly for all single-byte charsets.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                             |  8 ++++++--
 grep.h                             |  1 +
 t/t7813-grep-icase-iso.sh (new +x) | 19 +++++++++++++++++++
 3 files changed, 26 insertions(+), 2 deletions(-)
 create mode 100755 t/t7813-grep-icase-iso.sh

diff --git a/grep.c b/grep.c
index 8fce54f..f0fbf99 100644
--- a/grep.c
+++ b/grep.c
@@ -324,11 +324,14 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 	int erroffset;
 	int options = PCRE_MULTILINE;
 
-	if (opt->ignore_case)
+	if (opt->ignore_case) {
+		if (has_non_ascii(p->pattern))
+			p->pcre_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
+	}
 
 	p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
-			NULL);
+				      p->pcre_tables);
 	if (!p->pcre_regexp)
 		compile_regexp_failed(p, error);
 
@@ -362,6 +365,7 @@ static void free_pcre_regexp(struct grep_pat *p)
 {
 	pcre_free(p->pcre_regexp);
 	pcre_free(p->pcre_extra_info);
+	pcre_free((void *)p->pcre_tables);
 }
 #else /* !USE_LIBPCRE */
 static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
diff --git a/grep.h b/grep.h
index 95f197a..cee4357 100644
--- a/grep.h
+++ b/grep.h
@@ -48,6 +48,7 @@ struct grep_pat {
 	regex_t regexp;
 	pcre *pcre_regexp;
 	pcre_extra *pcre_extra_info;
+	const unsigned char *pcre_tables;
 	kwset_t kws;
 	unsigned fixed:1;
 	unsigned ignore_case:1;
diff --git a/t/t7813-grep-icase-iso.sh b/t/t7813-grep-icase-iso.sh
new file mode 100755
index 0000000..efef7fb
--- /dev/null
+++ b/t/t7813-grep-icase-iso.sh
@@ -0,0 +1,19 @@
+#!/bin/sh
+
+test_description='grep icase on non-English locales'
+
+. ./lib-gettext.sh
+
+test_expect_success GETTEXT_ISO_LOCALE 'setup' '
+	printf "TILRAUN: Halló Heimur!" >file &&
+	git add file &&
+	LC_ALL="$is_IS_iso_locale" &&
+	export LC_ALL
+'
+
+test_expect_success GETTEXT_ISO_LOCALE,LIBPCRE 'grep pcre string' '
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
+test_done
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v4 07/10] gettext: add is_utf8_locale()
  2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
                           ` (5 preceding siblings ...)
  2015-08-21 12:47         ` [PATCH v4 06/10] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
@ 2015-08-21 12:47         ` Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 08/10] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
                           ` (2 subsequent siblings)
  9 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-08-21 12:47 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

This function returns true if git is running under an UTF-8
locale. pcre in the next patch will need this.

is_encoding_utf8() is used instead of strcmp() to catch both "utf-8"
and "utf8" suffixes.

When built with no gettext support, we peek in several env variables
to detect UTF-8. pcre library might support utf-8 even if libc is
built without locale support.. The peeking code is a copy from
compat/regex/regcomp.c

Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 gettext.c | 24 ++++++++++++++++++++++--
 gettext.h |  1 +
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/gettext.c b/gettext.c
index a268a2c..db727ea 100644
--- a/gettext.c
+++ b/gettext.c
@@ -18,6 +18,8 @@
 #	endif
 #endif
 
+static const char *charset;
+
 /*
  * Guess the user's preferred languages from the value in LANGUAGE environment
  * variable and LC_MESSAGES locale category if NO_GETTEXT is not defined.
@@ -65,7 +67,6 @@ static int test_vsnprintf(const char *fmt, ...)
 	return ret;
 }
 
-static const char *charset;
 static void init_gettext_charset(const char *domain)
 {
 	/*
@@ -172,8 +173,27 @@ int gettext_width(const char *s)
 {
 	static int is_utf8 = -1;
 	if (is_utf8 == -1)
-		is_utf8 = !strcmp(charset, "UTF-8");
+		is_utf8 = is_utf8_locale();
 
 	return is_utf8 ? utf8_strwidth(s) : strlen(s);
 }
 #endif
+
+int is_utf8_locale(void)
+{
+#ifdef NO_GETTEXT
+	if (!charset) {
+		const char *env = getenv("LC_ALL");
+		if (!env || !*env)
+			env = getenv("LC_CTYPE");
+		if (!env || !*env)
+			env = getenv("LANG");
+		if (!env)
+			env = "";
+		if (strchr(env, '.'))
+			env = strchr(env, '.') + 1;
+		charset = xstrdup(env);
+	}
+#endif
+	return is_encoding_utf8(charset);
+}
diff --git a/gettext.h b/gettext.h
index 33696a4..7eee64a 100644
--- a/gettext.h
+++ b/gettext.h
@@ -90,5 +90,6 @@ const char *Q_(const char *msgid, const char *plu, unsigned long n)
 #endif
 
 const char *get_preferred_languages(void);
+extern int is_utf8_locale(void);
 
 #endif
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v4 08/10] grep/pcre: support utf-8
  2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
                           ` (6 preceding siblings ...)
  2015-08-21 12:47         ` [PATCH v4 07/10] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
@ 2015-08-21 12:47         ` Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 09/10] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 10/10] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
  9 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-08-21 12:47 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

In the previous change in this function, we add locale support for
single-byte encodings only. It looks like pcre only supports utf-* as
multibyte encodings, the others are left in the cold (which is
fine).

We need to enable PCRE_UTF8 so pcre can find character boundary
correctly. It's needed for case folding (when --ignore-case is used)
or '*', '+' or similar syntax is used.

The "has_non_ascii()" check is to be on the conservative side. If
there's non-ascii in the pattern, the searched content could still be
in utf-8, but we can treat it just like a byte stream and everything
should work. If we force utf-8 based on locale only and pcre validates
utf-8 and the file content is in non-utf8 encoding, things break.

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Helped-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 grep.c                          |  2 ++
 t/t7812-grep-icase-non-ascii.sh | 15 +++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/grep.c b/grep.c
index f0fbf99..07621c1 100644
--- a/grep.c
+++ b/grep.c
@@ -329,6 +329,8 @@ static void compile_pcre_regexp(struct grep_pat *p, const struct grep_opt *opt)
 			p->pcre_tables = pcre_maketables();
 		options |= PCRE_CASELESS;
 	}
+	if (is_utf8_locale() && has_non_ascii(p->pattern))
+		options |= PCRE_UTF8;
 
 	p->pcre_regexp = pcre_compile(p->pattern, options, &error, &erroffset,
 				      p->pcre_tables);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index aba6b15..8896410 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -20,6 +20,21 @@ test_expect_success REGEX_LOCALE 'grep literal string, no -F' '
 	git grep -i "TILRAUN: HALLÓ HEIMUR!"
 '
 
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 icase' '
+	git grep --perl-regexp    "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.lló Heimur!" &&
+	git grep --perl-regexp -i "TILRAUN: H.LLÓ HEIMUR!"
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE 'grep pcre utf-8 string with "+"' '
+	printf "TILRAUN: Hallóó Heimur!" >file2 &&
+	git add file2 &&
+	git grep -l --perl-regexp "TILRAUN: H.lló+ Heimur!" >actual &&
+	echo file >expected &&
+	echo file2 >>expected &&
+	test_cmp expected actual
+'
+
 test_expect_success REGEX_LOCALE 'grep literal string, with -F' '
 	git grep --debug -i -F "TILRAUN: Halló Heimur!"  2>&1 >/dev/null |
 		 grep fixed >debug1 &&
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v4 09/10] diffcore-pickaxe: "share" regex error handling code
  2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
                           ` (7 preceding siblings ...)
  2015-08-21 12:47         ` [PATCH v4 08/10] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
@ 2015-08-21 12:47         ` Nguyễn Thái Ngọc Duy
  2015-08-21 12:47         ` [PATCH v4 10/10] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
  9 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-08-21 12:47 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

There's another regcomp code block coming in this function. By moving
the error handling code out of this block, we don't have to add the
same error handling code in the new block.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 diffcore-pickaxe.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 185f86b..7a718fc 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -204,20 +204,13 @@ void diffcore_pickaxe(struct diff_options *o)
 	int opts = o->pickaxe_opts;
 	regex_t regex, *regexp = NULL;
 	kwset_t kws = NULL;
+	int err = 0;
 
 	if (opts & (DIFF_PICKAXE_REGEX | DIFF_PICKAXE_KIND_G)) {
-		int err;
 		int cflags = REG_EXTENDED | REG_NEWLINE;
 		if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE))
 			cflags |= REG_ICASE;
 		err = regcomp(&regex, needle, cflags);
-		if (err) {
-			/* The POSIX.2 people are surely sick */
-			char errbuf[1024];
-			regerror(err, &regex, errbuf, 1024);
-			regfree(&regex);
-			die("invalid regex: %s", errbuf);
-		}
 		regexp = &regex;
 	} else {
 		kws = kwsalloc(DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE)
@@ -225,6 +218,13 @@ void diffcore_pickaxe(struct diff_options *o)
 		kwsincr(kws, needle, strlen(needle));
 		kwsprep(kws);
 	}
+	if (err) {
+		/* The POSIX.2 people are surely sick */
+		char errbuf[1024];
+		regerror(err, &regex, errbuf, 1024);
+		regfree(&regex);
+		die("invalid regex: %s", errbuf);
+	}
 
 	/* Might want to warn when both S and G are on; I don't care... */
 	pickaxe(&diff_queued_diff, o, regexp, kws,
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v4 10/10] diffcore-pickaxe: support case insensitive match on non-ascii
  2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
                           ` (8 preceding siblings ...)
  2015-08-21 12:47         ` [PATCH v4 09/10] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
@ 2015-08-21 12:47         ` Nguyễn Thái Ngọc Duy
  9 siblings, 0 replies; 53+ messages in thread
From: Nguyễn Thái Ngọc Duy @ 2015-08-21 12:47 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, plamen.totev, l.s.r, Eric Sunshine, tboegi,
	Nguyễn Thái Ngọc Duy

Similar to the "grep -F -i" case, we can't use kws on icase search
outside ascii range, so we quote the string and pass it to regcomp as
a basic regexp and let regex engine deal with case sensitivity.

The new test is put in t7812 instead of t4209-log-pickaxe because
lib-gettext.sh might cause problems elsewhere, probably..

Noticed-by: Plamen Totev <plamen.totev@abv.bg>
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 diffcore-pickaxe.c              | 11 +++++++++++
 t/t7812-grep-icase-non-ascii.sh |  7 +++++++
 2 files changed, 18 insertions(+)

diff --git a/diffcore-pickaxe.c b/diffcore-pickaxe.c
index 7a718fc..6946c15 100644
--- a/diffcore-pickaxe.c
+++ b/diffcore-pickaxe.c
@@ -7,6 +7,8 @@
 #include "diffcore.h"
 #include "xdiff-interface.h"
 #include "kwset.h"
+#include "commit.h"
+#include "quote.h"
 
 typedef int (*pickaxe_fn)(mmfile_t *one, mmfile_t *two,
 			  struct diff_options *o,
@@ -212,6 +214,15 @@ void diffcore_pickaxe(struct diff_options *o)
 			cflags |= REG_ICASE;
 		err = regcomp(&regex, needle, cflags);
 		regexp = &regex;
+	} else if (DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE) &&
+		   has_non_ascii(needle)) {
+		struct strbuf sb = STRBUF_INIT;
+		int cflags = REG_NEWLINE | REG_ICASE;
+
+		basic_regex_quote_buf(&sb, needle);
+		err = regcomp(&regex, sb.buf, cflags);
+		strbuf_release(&sb);
+		regexp = &regex;
 	} else {
 		kws = kwsalloc(DIFF_OPT_TST(o, PICKAXE_IGNORE_CASE)
 			       ? tolower_trans_tbl : NULL);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 8896410..a5475bb 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -61,4 +61,11 @@ test_expect_success REGEX_LOCALE 'grep string with regex, with -F' '
 	test_cmp expect2 debug2
 '
 
+test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
+	git commit -m first &&
+	git log --format=%f -i -S"TILRAUN: HALLÓ HEIMUR!" >actual &&
+	echo first >expected &&
+	test_cmp expected actual
+'
+
 test_done
-- 
2.3.0.rc1.137.g477eb31

^ permalink raw reply related	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2015-08-21 12:59 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-06 11:28 Git grep does not support multi-byte characters (like UTF-8) Plamen Totev
2015-07-06 12:23 ` Duy Nguyen
2015-07-07  8:58   ` Plamen Totev
2015-07-07 12:22     ` Duy Nguyen
2015-07-07 16:07     ` Junio C Hamano
2015-07-07 18:08       ` Plamen Totev
2015-07-08  2:19         ` Duy Nguyen
2015-07-08  4:52           ` Junio C Hamano
2015-07-06 12:42 ` [PATCH] grep: use regcomp() for icase search with non-ascii patterns Nguyễn Thái Ngọc Duy
2015-07-06 20:10   ` René Scharfe
2015-07-06 23:02     ` Duy Nguyen
2015-07-07 14:25       ` Plamen Totev
2015-07-08 10:38   ` [PATCH v2 0/9] icase match on non-ascii Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 2/9] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 3/9] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 4/9] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2015-07-08 11:00       ` Duy Nguyen
2015-07-08 10:38     ` [PATCH v2 6/9] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2015-07-11  8:07       ` Plamen Totev
2015-07-08 10:38     ` [PATCH v2 8/9] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2015-07-08 10:38     ` [PATCH v2 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2015-07-09 22:55       ` Eric Sunshine
2015-07-08 11:32     ` [PATCH v2 0/9] icase " Torsten Bögershausen
2015-07-08 12:13       ` Duy Nguyen
2015-07-08 15:36     ` Junio C Hamano
2015-07-08 23:28       ` Duy Nguyen
2015-07-14 13:24     ` [PATCH v3 " Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 1/9] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 2/9] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 3/9] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 4/9] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 5/9] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 6/9] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 7/9] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 8/9] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2015-07-14 13:24       ` [PATCH v3 9/9] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy
2015-07-14 16:42       ` [PATCH v3 0/9] icase " Torsten Bögershausen
2015-07-15  9:39         ` Duy Nguyen
2015-07-15 19:51           ` Torsten Bögershausen
2015-08-21 12:47       ` [PATCH v4 00/10] " Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 01/10] grep: allow -F -i combination Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 02/10] grep: break down an "if" stmt in preparation for next changes Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 03/10] test-regex: expose full regcomp() to the command line Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 04/10] grep/icase: avoid kwsset on literal non-ascii strings Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 05/10] grep/icase: avoid kwsset when -F is specified Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 06/10] grep/pcre: prepare locale-dependent tables for icase matching Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 07/10] gettext: add is_utf8_locale() Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 08/10] grep/pcre: support utf-8 Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 09/10] diffcore-pickaxe: "share" regex error handling code Nguyễn Thái Ngọc Duy
2015-08-21 12:47         ` [PATCH v4 10/10] diffcore-pickaxe: support case insensitive match on non-ascii Nguyễn Thái Ngọc Duy

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).