[PATCH] grep: skip UTF8 checks explicitally

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] grep: skip UTF8 checks explicitally
@ 2019-07-21 18:31 Carlo Marcelo Arenas Belón
  2019-07-21 18:34 ` Eric Sunshine
  2019-07-22 11:55 ` Johannes Schindelin
  0 siblings, 2 replies; 26+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2019-07-21 18:31 UTC (permalink / raw)
  To: git; +Cc: gitster, avarab, Johannes.Schindelin

Usually PCRE is compiled with JIT support, and therefore the code
path used includes calling pcre2_jit_match (for PCRE2), that ignores
invalid UTF-8 in the corpus.

Make that option explicit so it can be also used when JIT is not
enabled and pcre2_match is called instead, preventing `git grep`
to abort when hitting the first binary blob in a fixed match
after ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15)

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 grep.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index fc0ed73ef3..146093f590 100644
--- a/grep.c
+++ b/grep.c
@@ -409,7 +409,7 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
 		regmatch_t *match, int eflags)
 {
-	int ovector[30], ret, flags = 0;
+	int ovector[30], ret, flags = PCRE_NO_UTF8_CHECK;
 
 	if (eflags & REG_NOTBOL)
 		flags |= PCRE_NOTBOL;
@@ -554,7 +554,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
 		regmatch_t *match, int eflags)
 {
-	int ret, flags = 0;
+	int ret, flags = PCRE2_NO_UTF_CHECK;
 	PCRE2_SIZE *ovector;
 	PCRE2_UCHAR errbuf[256];
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-21 18:31 [PATCH] grep: skip UTF8 checks explicitally Carlo Marcelo Arenas Belón
@ 2019-07-21 18:34 ` Eric Sunshine
  2019-07-22 11:55 ` Johannes Schindelin
  1 sibling, 0 replies; 26+ messages in thread
From: Eric Sunshine @ 2019-07-21 18:34 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón
  Cc: Git List, Junio C Hamano, Ævar Arnfjörð Bjarmason,
	Johannes Schindelin

On Sun, Jul 21, 2019 at 2:31 PM Carlo Marcelo Arenas Belón
<carenas@gmail.com> wrote:
> grep: skip UTF8 checks explicitally

s/explicitally/explicitly/

> Usually PCRE is compiled with JIT support, and therefore the code
> path used includes calling pcre2_jit_match (for PCRE2), that ignores
> invalid UTF-8 in the corpus.
>
> Make that option explicit so it can be also used when JIT is not
> enabled and pcre2_match is called instead, preventing `git grep`
> to abort when hitting the first binary blob in a fixed match
> after ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15)
>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-21 18:31 [PATCH] grep: skip UTF8 checks explicitally Carlo Marcelo Arenas Belón
  2019-07-21 18:34 ` Eric Sunshine
@ 2019-07-22 11:55 ` Johannes Schindelin
  2019-07-22 14:43   ` [PATCH] grep: skip UTF8 checks explicitly Carlo Marcelo Arenas Belón
  2019-07-22 19:42   ` [PATCH] grep: skip UTF8 checks explicitally Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 26+ messages in thread
From: Johannes Schindelin @ 2019-07-22 11:55 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón; +Cc: git, gitster, avarab

[-- Attachment #1: Type: text/plain, Size: 1887 bytes --]

Hi Carlo,

On Sun, 21 Jul 2019, Carlo Marcelo Arenas Belón wrote:

> Usually PCRE is compiled with JIT support, and therefore the code
> path used includes calling pcre2_jit_match (for PCRE2), that ignores
> invalid UTF-8 in the corpus.
>
> Make that option explicit so it can be also used when JIT is not
> enabled and pcre2_match is called instead, preventing `git grep`
> to abort when hitting the first binary blob in a fixed match
> after ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15)

Good idea.

The flag has been in PCRE1 since at least March 5, 2007, when the
pcre.h.in file was first recorded in their Subversion repository:
https://vcs.pcre.org/pcre/code/trunk/pcre.h.in?view=log

It also was part of PCRE2 from the first revision (rev 4, in fact, where
pcre2.h.in was added):
https://vcs.pcre.org/pcre2/code/trunk/src/pcre2.h.in?view=log

So I am fine with this patch.

Thanks,
Dscho

> ---
>  grep.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/grep.c b/grep.c
> index fc0ed73ef3..146093f590 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -409,7 +409,7 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
>  static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
>  		regmatch_t *match, int eflags)
>  {
> -	int ovector[30], ret, flags = 0;
> +	int ovector[30], ret, flags = PCRE_NO_UTF8_CHECK;
>
>  	if (eflags & REG_NOTBOL)
>  		flags |= PCRE_NOTBOL;
> @@ -554,7 +554,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
>  		regmatch_t *match, int eflags)
>  {
> -	int ret, flags = 0;
> +	int ret, flags = PCRE2_NO_UTF_CHECK;
>  	PCRE2_SIZE *ovector;
>  	PCRE2_UCHAR errbuf[256];
>
> --
> 2.22.0
>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH] grep: skip UTF8 checks explicitly
  2019-07-22 11:55 ` Johannes Schindelin
@ 2019-07-22 14:43   ` Carlo Marcelo Arenas Belón
  2019-07-24  2:10     ` Junio C Hamano
  2019-07-22 19:42   ` [PATCH] grep: skip UTF8 checks explicitally Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 26+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2019-07-22 14:43 UTC (permalink / raw)
  To: git; +Cc: gitster, avarab, Johannes Schindelin

Usually PCRE is compiled with JIT support, and therefore the code
path used includes calling pcre2_jit_match (for PCRE2), that ignores
invalid UTF-8 in the corpus.

Make that option explicit so it can be also used when JIT is not
enabled and pcre2_match is called instead, preventing `git grep`
to abort when hitting the first binary blob in a fixed match
after ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15)

Reviewed-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
V2: spelling fixes from Eric Sunshine

 grep.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index fc0ed73ef3..146093f590 100644
--- a/grep.c
+++ b/grep.c
@@ -409,7 +409,7 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
 		regmatch_t *match, int eflags)
 {
-	int ovector[30], ret, flags = 0;
+	int ovector[30], ret, flags = PCRE_NO_UTF8_CHECK;
 
 	if (eflags & REG_NOTBOL)
 		flags |= PCRE_NOTBOL;
@@ -554,7 +554,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
 		regmatch_t *match, int eflags)
 {
-	int ret, flags = 0;
+	int ret, flags = PCRE2_NO_UTF_CHECK;
 	PCRE2_SIZE *ovector;
 	PCRE2_UCHAR errbuf[256];
 
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-22 11:55 ` Johannes Schindelin
  2019-07-22 14:43   ` [PATCH] grep: skip UTF8 checks explicitly Carlo Marcelo Arenas Belón
@ 2019-07-22 19:42   ` Ævar Arnfjörð Bjarmason
  2019-07-23  3:50     ` Carlo Arenas
  1 sibling, 1 reply; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-22 19:42 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Carlo Marcelo Arenas Belón, git, gitster


On Mon, Jul 22 2019, Johannes Schindelin wrote:

> Hi Carlo,
>
> On Sun, 21 Jul 2019, Carlo Marcelo Arenas Belón wrote:
>
>> Usually PCRE is compiled with JIT support, and therefore the code
>> path used includes calling pcre2_jit_match (for PCRE2), that ignores
>> invalid UTF-8 in the corpus.
>>
>> Make that option explicit so it can be also used when JIT is not
>> enabled and pcre2_match is called instead, preventing `git grep`
>> to abort when hitting the first binary blob in a fixed match
>> after ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15)
>
> Good idea.
>
> The flag has been in PCRE1 since at least March 5, 2007, when the
> pcre.h.in file was first recorded in their Subversion repository:
> https://vcs.pcre.org/pcre/code/trunk/pcre.h.in?view=log
>
> It also was part of PCRE2 from the first revision (rev 4, in fact, where
> pcre2.h.in was added):
> https://vcs.pcre.org/pcre2/code/trunk/src/pcre2.h.in?view=log

Thanks for digging, that portability indeed sounds just fine.

> So I am fine with this patch.

I'm not, not because I dislike the approach. I haven't made up my mind
yet.

I stopped paying attention to this error-with-not-JIT discussion when I
heard that some other series going into next for Windows fixed that
issue[1]

But now we have it again in some form? My ab/no-kwset has a lot of tests
for encodings & locales combined with grep, don't some of those trigger
this? If so we should make any such failure a test & part of this patch.

Right now we don't have the info of whether we're really using the JIT
or not, but that would be easy to add to grep's --debug mode for use in
a test prereq.

As noted in [2] I'd be inclined to go the other way, if we indeed have
some cases where PCRE skips its own checks does not dying actually give
us anything useful? I'd think not, so just ignoring the issue seems like
the wrong thing to do.

Surely we're not producing useful grep results at that point, so just
not dying and mysteriously returning either nothing or garbage isn't
going to help much...

1. https://public-inbox.org/git/xmqq4l3wxk8j.fsf@gitster-ct.c.googlers.com/
2. https://public-inbox.org/git/87pnms7kv0.fsf@evledraar.gmail.com/

> Thanks,
> Dscho
>
>> ---
>>  grep.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/grep.c b/grep.c
>> index fc0ed73ef3..146093f590 100644
>> --- a/grep.c
>> +++ b/grep.c
>> @@ -409,7 +409,7 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
>>  static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
>>  		regmatch_t *match, int eflags)
>>  {
>> -	int ovector[30], ret, flags = 0;
>> +	int ovector[30], ret, flags = PCRE_NO_UTF8_CHECK;
>>
>>  	if (eflags & REG_NOTBOL)
>>  		flags |= PCRE_NOTBOL;
>> @@ -554,7 +554,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>>  static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
>>  		regmatch_t *match, int eflags)
>>  {
>> -	int ret, flags = 0;
>> +	int ret, flags = PCRE2_NO_UTF_CHECK;
>>  	PCRE2_SIZE *ovector;
>>  	PCRE2_UCHAR errbuf[256];
>>
>> --
>> 2.22.0
>>
>>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-22 19:42   ` [PATCH] grep: skip UTF8 checks explicitally Ævar Arnfjörð Bjarmason
@ 2019-07-23  3:50     ` Carlo Arenas
  2019-07-23 12:46       ` Johannes Schindelin
  0 siblings, 1 reply; 26+ messages in thread
From: Carlo Arenas @ 2019-07-23  3:50 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Johannes Schindelin, git, gitster

On Mon, Jul 22, 2019 at 12:42 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Mon, Jul 22 2019, Johannes Schindelin wrote:
>
> > So I am fine with this patch.
>
> I'm not, not because I dislike the approach. I haven't made up my mind
> yet.

my bad, I should had explained better the regression I was trying to
fix with this patch :

$ git version
git version 2.22.0.709.g102302147b.dirty
$ git grep "Nguyễn Thái"
.mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
fatal: pcre2_match failed with error code -22: UTF-8 error: isolated
byte with 0x80 bit set
$ git grep -I "Nguyễn Thái"
.mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
po/TEAMS:Members:       Nguyễn Thái Ngọc Duy <pclouds AT gmail.com>
po/vi.po:# Nguyễn Thái Ngọc Duy <pclouds@gmail.com>, 2012.
t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy
t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top
bits not 0x80

> I stopped paying attention to this error-with-not-JIT discussion when I
> heard that some other series going into next for Windows fixed that
> issue[1]
>
> But now we have it again in some form? My ab/no-kwset has a lot of tests
> for encodings & locales combined with grep, don't some of those trigger
> this? If so we should make any such failure a test & part of this patch.

I don't have a windows environment to test, and I admit I wasn't
following clearly
the whole conversation but at least in my case, I never got any test
to fail, and I haven't
seen any test with broken UTF-8.

I apologize for not sending a test and will correct that, but I am
concerned that
according to PCRE documentation the behaviour is undefined and therefore the
test might be flacky.

> Right now we don't have the info of whether we're really using the JIT
> or not, but that would be easy to add to grep's --debug mode for use in
> a test prereq.

I would think that extending testtool would be a better option, I have indeed
a patch checking for UTF-8 support using git directly which I hadn't sent just
because I thought it looked too hacky, but agree it will be nice to have all
those dependencies clear and the corresponding tests otherwise

It is interesting to note that because of the conservative way we enable UTF-8
then it will depend on several factors if the problematic code gets triggered

> As noted in [2] I'd be inclined to go the other way, if we indeed have
> some cases where PCRE skips its own checks does not dying actually give
> us anything useful? I'd think not, so just ignoring the issue seems like
> the wrong thing to do.

we could reenable the checks by moving out of the JIT fast path in pcre so
that pcre2_match is used for all matches (with JIT compiled as an optimization)
and that might have the added benefit of solving the PCRE errors when JIT is
broken[1]

$ git version
git version 2.22.0
$ git grep "Nguyễn Thái"
.mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
po/TEAMS:Members:       Nguyễn Thái Ngọc Duy <pclouds AT gmail.com>
po/vi.po:# Nguyễn Thái Ngọc Duy <pclouds@gmail.com>, 2012.
t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy
t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
t/t7011-skip-worktree-reading.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
t/t7012-skip-worktree-writing.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy

important to note that at least on my system (macOS 10.14.6) the output of all
engines for grep is the same for a UTF-8 pattern match, even if using invalid
UTF-8 in the corpus, and I would expect that to be a better indicative
of correctness

FWIW GNU grep (with -P) also ignores UTF-8 errors using the same flag and
even PCRE seems to be going in the other direction with the recent addition
of PCRE2_MATCH_INVALID_UTF[1]

$ od -b int.p
0000000   102 145 154 303 263 156 012 303
$ pcre2grep -U 'Beló' int.p
Belón

Carlo

[1] https://public-inbox.org/git/20181209230024.43444-1-carenas@gmail.com/
[2] https://lists.exim.org/lurker/message/20190528.141422.2af1e386.gl.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-23  3:50     ` Carlo Arenas
@ 2019-07-23 12:46       ` Johannes Schindelin
  2019-07-24  1:47         ` Carlo Arenas
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Schindelin @ 2019-07-23 12:46 UTC (permalink / raw)
  To: Carlo Arenas; +Cc: Ævar Arnfjörð Bjarmason, git, gitster

[-- Attachment #1: Type: text/plain, Size: 5856 bytes --]

Hi Carlo,

On Mon, 22 Jul 2019, Carlo Arenas wrote:

> On Mon, Jul 22, 2019 at 12:42 PM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
> >
> > On Mon, Jul 22 2019, Johannes Schindelin wrote:
> >
> > > So I am fine with this patch.
> >
> > I'm not, not because I dislike the approach. I haven't made up my mind
> > yet.
>
> my bad, I should had explained better the regression I was trying to
> fix with this patch :
>
> $ git version
> git version 2.22.0.709.g102302147b.dirty
> $ git grep "Nguyễn Thái"
> .mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated
> byte with 0x80 bit set
> $ git grep -I "Nguyễn Thái"
> .mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> po/TEAMS:Members:       Nguyễn Thái Ngọc Duy <pclouds AT gmail.com>
> po/vi.po:# Nguyễn Thái Ngọc Duy <pclouds@gmail.com>, 2012.
> t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy
> t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
> fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top
> bits not 0x80

This is worrisome.

The entire reason why we switch on Unicode mode is because it matters
for regular expressions. Take for example this one:

	Nguyễ*n

See how the ễ is supposed to be present an arbitrary number of times
(including 0 times)? If PCRE2 was to interpret this as UTF-8, it would
understand that the bytes 0xe1 0xbb 0x85 had to be repeated (or be
missing), but if it interprets it _not_ as UTF-8, then it would handle
_only the last_ byte, 0x85, that way, which will not work correctly.

So when PCRE2 complains about the top two bits not being 0x80, it fails
to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are
indeed 0x80).

Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this
does not happen here... But then, I don't need the `-I` option, and my
output looks like this:

-- snip --
$ git grep --perl-regexp "Nguyễn Thái"
.mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
po/TEAMS:Members:       Nguyễn Thái Ngọc Duy <pclouds AT gmail.com>
po/vi.po:# Nguyễn Thái Ngọc Duy <pclouds@gmail.com>, 2012.
t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy
t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
t/t7011-skip-worktree-reading.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
t/t7012-skip-worktree-writing.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
-- snap --

No error, you see?

Or maybe it is a problem with the locale? Is your locale using an
encoding different than UTF-8?

> > I stopped paying attention to this error-with-not-JIT discussion when I
> > heard that some other series going into next for Windows fixed that
> > issue[1]
> >
> > But now we have it again in some form? My ab/no-kwset has a lot of tests
> > for encodings & locales combined with grep, don't some of those trigger
> > this? If so we should make any such failure a test & part of this patch.
>
> I don't have a windows environment to test, and I admit I wasn't
> following clearly the whole conversation but at least in my case, I
> never got any test to fail, and I haven't seen any test with broken
> UTF-8.

The problem was not Windows-specific. It was specific to PCRE2 versions
compiled without JIT support (which was the case in Git for Windows'
SDK, but I fixed this on July 5th).

Your patch adequately addresses this problem: instead of erroring out in
the non-JIT version and passing in the JIT version, it will let also the
non-JIT version pass.

But still, I think the issue you found merits some deeper investigation,
methinks.

> > As noted in [2] I'd be inclined to go the other way, if we indeed have
> > some cases where PCRE skips its own checks does not dying actually give
> > us anything useful? I'd think not, so just ignoring the issue seems like
> > the wrong thing to do.
>
> we could reenable the checks by moving out of the JIT fast path in pcre so
> that pcre2_match is used for all matches (with JIT compiled as an optimization)
> and that might have the added benefit of solving the PCRE errors when JIT is
> broken[1]
>
> $ git version
> git version 2.22.0
> $ git grep "Nguyễn Thái"
> .mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> po/TEAMS:Members:       Nguyễn Thái Ngọc Duy <pclouds AT gmail.com>
> po/vi.po:# Nguyễn Thái Ngọc Duy <pclouds@gmail.com>, 2012.
> t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy
> t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
> t/t7011-skip-worktree-reading.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
> t/t7012-skip-worktree-writing.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
>
> important to note that at least on my system (macOS 10.14.6) the output of all
> engines for grep is the same for a UTF-8 pattern match, even if using invalid
> UTF-8 in the corpus, and I would expect that to be a better indicative
> of correctness

Right. It is _inconsistent_ at the moment, and with your patch it would
at least be consistent again.

So I still would prefer your patch over "going the other way".

> FWIW GNU grep (with -P) also ignores UTF-8 errors using the same flag and
> even PCRE seems to be going in the other direction with the recent addition
> of PCRE2_MATCH_INVALID_UTF[1]

Good point!

Ciao,
Dscho

>
> $ od -b int.p
> 0000000   102 145 154 303 263 156 012 303
> $ pcre2grep -U 'Beló' int.p
> Belón
>
> Carlo
>
> [1] https://public-inbox.org/git/20181209230024.43444-1-carenas@gmail.com/
> [2] https://lists.exim.org/lurker/message/20190528.141422.2af1e386.gl.html
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-23 12:46       ` Johannes Schindelin
@ 2019-07-24  1:47         ` Carlo Arenas
  2019-07-24 10:47           ` Johannes Schindelin
  0 siblings, 1 reply; 26+ messages in thread
From: Carlo Arenas @ 2019-07-24  1:47 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Ævar Arnfjörð Bjarmason, git, gitster

On Tue, Jul 23, 2019 at 5:47 AM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> So when PCRE2 complains about the top two bits not being 0x80, it fails
> to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are
> indeed 0x80).

the error is confusing but it is not coming from the pattern, but from
what PCRE2 calls
the subject.

meaning that while going through the repository it found content that
it tried to match but
that it is not valid UTF-8, like all the png and a few txt files that
are not encoded as
UTF-8 (ex: t/t3900/ISO8859-1.txt).

> Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this
> does not happen here... But then, I don't need the `-I` option, and my
> output looks like this:

-I was just an attempt to workaround the obvious binary files (like
PNG); I'll assume you
should be able to reproduce if using a non JIT enabled PCRE2,
regardless of version.

my point was that unlike in your report, I didn't have any test cases
failing, because
AFAIK there are no test cases using broken UTF-8 (the ones with binary data are
actually valid zero terminated UTF-8 strings)

Carlo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitly
  2019-07-22 14:43   ` [PATCH] grep: skip UTF8 checks explicitly Carlo Marcelo Arenas Belón
@ 2019-07-24  2:10     ` Junio C Hamano
  2019-07-24 10:50       ` Johannes Schindelin
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2019-07-24  2:10 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón; +Cc: git, avarab, Johannes Schindelin

Carlo Marcelo Arenas Belón  <carenas@gmail.com> writes:

> Usually PCRE is compiled with JIT support, and therefore the code
> path used includes calling pcre2_jit_match (for PCRE2), that ignores
> invalid UTF-8 in the corpus.
>
> Make that option explicit so it can be also used when JIT is not
> enabled and pcre2_match is called instead, preventing `git grep`
> to abort when hitting the first binary blob in a fixed match
> after ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15)
>
> Reviewed-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
> ---
> V2: spelling fixes from Eric Sunshine

Good.  I was expecting fallouts like this from our recent push to
aggressively use pcre and that was why I merged ab/no-kwset before
I felt comfortable, so that we can have longer exposure.  It seems
to be paying off.

So with JIT, PCRE_NO_UTF8_CHECK is on by default, but without, we
need to give the option explicitly, and because it is on by default
in the JIT case, it would not hurt to explicitly pass it?  

That makes perfect sense to me.

>  grep.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/grep.c b/grep.c
> index fc0ed73ef3..146093f590 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -409,7 +409,7 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
>  static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
>  		regmatch_t *match, int eflags)
>  {
> -	int ovector[30], ret, flags = 0;
> +	int ovector[30], ret, flags = PCRE_NO_UTF8_CHECK;
>  
>  	if (eflags & REG_NOTBOL)
>  		flags |= PCRE_NOTBOL;
> @@ -554,7 +554,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
>  		regmatch_t *match, int eflags)
>  {
> -	int ret, flags = 0;
> +	int ret, flags = PCRE2_NO_UTF_CHECK;
>  	PCRE2_SIZE *ovector;
>  	PCRE2_UCHAR errbuf[256];

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-24  1:47         ` Carlo Arenas
@ 2019-07-24 10:47           ` Johannes Schindelin
  2019-07-24 18:22             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Schindelin @ 2019-07-24 10:47 UTC (permalink / raw)
  To: Carlo Arenas; +Cc: Ævar Arnfjörð Bjarmason, git, gitster

Hi Carlo,

On Tue, 23 Jul 2019, Carlo Arenas wrote:

> On Tue, Jul 23, 2019 at 5:47 AM Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> >
> > So when PCRE2 complains about the top two bits not being 0x80, it fails
> > to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are
> > indeed 0x80).
>
> the error is confusing but it is not coming from the pattern, but from
> what PCRE2 calls
> the subject.
>
> meaning that while going through the repository it found content that
> it tried to match but
> that it is not valid UTF-8, like all the png and a few txt files that
> are not encoded as
> UTF-8 (ex: t/t3900/ISO8859-1.txt).
>
> > Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this
> > does not happen here... But then, I don't need the `-I` option, and my
> > output looks like this:
>
> -I was just an attempt to workaround the obvious binary files (like
> PNG); I'll assume you
> should be able to reproduce if using a non JIT enabled PCRE2,
> regardless of version.
>
> my point was that unlike in your report, I didn't have any test cases
> failing, because
> AFAIK there are no test cases using broken UTF-8 (the ones with binary data are
> actually valid zero terminated UTF-8 strings)

Thank you for this explanation. I think it makes a total lot of sense.

So your motivation for this patch is actually a different one than mine,
and I would like to think that this actually strengthens the case _in
favor_ of it. The patch kind of kills two birds with one stone.

Thanks,
Dscho

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitly
  2019-07-24  2:10     ` Junio C Hamano
@ 2019-07-24 10:50       ` Johannes Schindelin
  2019-07-24 16:08         ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Schindelin @ 2019-07-24 10:50 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Carlo Marcelo Arenas Belón, git, avarab

[-- Attachment #1: Type: text/plain, Size: 2666 bytes --]

Hi Junio,

On Tue, 23 Jul 2019, Junio C Hamano wrote:

> Carlo Marcelo Arenas Belón  <carenas@gmail.com> writes:
>
> > Usually PCRE is compiled with JIT support, and therefore the code
> > path used includes calling pcre2_jit_match (for PCRE2), that ignores
> > invalid UTF-8 in the corpus.
> >
> > Make that option explicit so it can be also used when JIT is not
> > enabled and pcre2_match is called instead, preventing `git grep`
> > to abort when hitting the first binary blob in a fixed match
> > after ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15)
> >
> > Reviewed-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
> > Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
> > ---
> > V2: spelling fixes from Eric Sunshine
>
> Good.  I was expecting fallouts like this from our recent push to
> aggressively use pcre and that was why I merged ab/no-kwset before
> I felt comfortable, so that we can have longer exposure.  It seems
> to be paying off.

I agree: it pays off quite nicely.

> So with JIT, PCRE_NO_UTF8_CHECK is on by default, but without, we
> need to give the option explicitly, and because it is on by default
> in the JIT case, it would not hurt to explicitly pass it?
>
> That makes perfect sense to me.

My reading of the situation is slightly different. I think
PCRE_NO_UTF8_CHECK is off by default, but it only makes a difference in
the non-JIT'ed code path. Since we use PCRE2's JIT when possible
(because it leads to a quite nice performance improvement), we usually
don't see those warnings. Carlo's patch makes the non-JIT'ed code path
behave the same as our preferred code path.

Thanks,
Dscho

>
> >  grep.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/grep.c b/grep.c
> > index fc0ed73ef3..146093f590 100644
> > --- a/grep.c
> > +++ b/grep.c
> > @@ -409,7 +409,7 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
> >  static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
> >  		regmatch_t *match, int eflags)
> >  {
> > -	int ovector[30], ret, flags = 0;
> > +	int ovector[30], ret, flags = PCRE_NO_UTF8_CHECK;
> >
> >  	if (eflags & REG_NOTBOL)
> >  		flags |= PCRE_NOTBOL;
> > @@ -554,7 +554,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
> >  static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
> >  		regmatch_t *match, int eflags)
> >  {
> > -	int ret, flags = 0;
> > +	int ret, flags = PCRE2_NO_UTF_CHECK;
> >  	PCRE2_SIZE *ovector;
> >  	PCRE2_UCHAR errbuf[256];
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitly
  2019-07-24 10:50       ` Johannes Schindelin
@ 2019-07-24 16:08         ` Junio C Hamano
  2019-08-28 14:54           ` [PATCH v2] " Carlo Marcelo Arenas Belón
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2019-07-24 16:08 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Carlo Marcelo Arenas Belón, git, avarab

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> My reading of the situation is slightly different. I think
> PCRE_NO_UTF8_CHECK is off by default, but it only makes a difference in
> the non-JIT'ed code path. Since we use PCRE2's JIT when possible
> (because it leads to a quite nice performance improvement), we usually
> don't see those warnings. Carlo's patch makes the non-JIT'ed code path
> behave the same as our preferred code path.

You're right and we do want to make both codepaths behave the same
way.  The case we are having trouble with is without JIT, where the
machinery to look for needle in a non-UTF8 binary haystack barfs and
dies, which is unacceptable (imagine "git log --grep=..."), and we
want the machinery to ignore random binary gunk just like the JIT
codepath does.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-24 10:47           ` Johannes Schindelin
@ 2019-07-24 18:22             ` Ævar Arnfjörð Bjarmason
  2019-07-24 21:04               ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-24 18:22 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Carlo Arenas, git, gitster

On Wed, Jul 24 2019, Johannes Schindelin wrote:

> Hi Carlo,
>
> On Tue, 23 Jul 2019, Carlo Arenas wrote:
>
>> On Tue, Jul 23, 2019 at 5:47 AM Johannes Schindelin
>> <Johannes.Schindelin@gmx.de> wrote:
>> >
>> > So when PCRE2 complains about the top two bits not being 0x80, it fails
>> > to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are
>> > indeed 0x80).
>>
>> the error is confusing but it is not coming from the pattern, but from
>> what PCRE2 calls
>> the subject.
>>
>> meaning that while going through the repository it found content that
>> it tried to match but
>> that it is not valid UTF-8, like all the png and a few txt files that
>> are not encoded as
>> UTF-8 (ex: t/t3900/ISO8859-1.txt).
>>
>> > Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this
>> > does not happen here... But then, I don't need the `-I` option, and my
>> > output looks like this:
>>
>> -I was just an attempt to workaround the obvious binary files (like
>> PNG); I'll assume you
>> should be able to reproduce if using a non JIT enabled PCRE2,
>> regardless of version.
>>
>> my point was that unlike in your report, I didn't have any test cases
>> failing, because
>> AFAIK there are no test cases using broken UTF-8 (the ones with binary data are
>> actually valid zero terminated UTF-8 strings)
>
> Thank you for this explanation. I think it makes a total lot of sense.
>
> So your motivation for this patch is actually a different one than mine,
> and I would like to think that this actually strengthens the case _in
> favor_ of it. The patch kind of kills two birds with one stone.

This patch is really the wrong thing to do. Don't get me wrong, I'm
sympathetic to the *problem* and it should be solved, but this isn't the
solution.

The PCRE2_NO_UTF_CHECK flag means "I have checked that this is a valid
UTF-8 string so you, PCRE, don't need to re-check it". To quote
pcre2api(3):

    If you know that your pattern is a valid UTF string, and you want to
    skip this check for performance reasons, you can set the
    PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an
    in‐ valid UTF string as a pattern is undefined. It may cause your
    program to crash or loop.

(Later it's discussed that "pattern" here is also "subject string" in
the context of pcre2_{jit_,}match()).

I know almost nothing about the internals of PCRE's engine, but much of
it's based on Perl's, which I know way better. Doing the equivalent of
this in perl (setting the UTF8 flag on a SV) *will* cause asserts to
fail and possibly segfaults.

It's likely through dumb luck that this is "working". I.e. yes the JIT
mode is less anal about these checks, so if you say grep for "Nguyễn
Thái" in UTF-8 mode and there's binary data you're satisfied not to find
anything in that binary data.

But if you are I'm willing to bet this ruins your day, e.g PCRE would
"skip ahead" a character 4-byte character because it sees a telltale
U+10000 through U+10FFFF start sequence, except that wasn't a character,
it was some arbitrary binary.

Now, what is the solution? I don't have any patches yet, but things I
intend to look at:

 1) We're oversupplying PCRE2_UTF now, and one such case is what's being
    reported here. I.e. there's no reason I can think of for why a
    fixed-string pattern should need PCRE2_UTF set when not combined
    with --ignore-case. We can just not do that, but maybe I'm missing
    something there.

 2) We can do "try utf8, and fallback". A more advanced version of this
    is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
    does. I was thinking something closer to just carrying two compiled
    patterns, and falling back on the ~PCRE2_UTF one if we get a
    PCRE2_ERROR_UTF8_* error.

One reason we can't "just" go back to the pre-ab/no-kwset behavior is
that one thing it does is fix a long-standing bug where we'd do the
wrong thing under locales && -i && UTF-8 string/pattern. More precisely
we'd punt it to the C library's matching function, which would probably
do the wrong thing.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-24 18:22             ` Ævar Arnfjörð Bjarmason
@ 2019-07-24 21:04               ` Junio C Hamano
  2019-07-25  9:48                 ` Johannes Schindelin
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2019-07-24 21:04 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Schindelin, Carlo Arenas, git

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> The PCRE2_NO_UTF_CHECK flag means "I have checked that this is a valid
> UTF-8 string so you, PCRE, don't need to re-check it".

OK, in short, barfing and stopping is a problem, but that flag is
not the right knob to tweak.  And the right knob ...

>  1) We're oversupplying PCRE2_UTF now, and one such case is what's being
>     reported here. I.e. there's no reason I can think of for why a
>     fixed-string pattern should need PCRE2_UTF set when not combined
>     with --ignore-case. We can just not do that, but maybe I'm missing
>     something there.
>
>  2) We can do "try utf8, and fallback". A more advanced version of this
>     is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
>     does. I was thinking something closer to just carrying two compiled
>     patterns, and falling back on the ~PCRE2_UTF one if we get a
>     PCRE2_ERROR_UTF8_* error.

... lies somewhere along that line.  I think that is very sensible.
Let's make sure this gets sorted out soonish.

Thanks.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-24 21:04               ` Junio C Hamano
@ 2019-07-25  9:48                 ` Johannes Schindelin
  2019-07-25 13:02                   ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Schindelin @ 2019-07-25  9:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ævar Arnfjörð Bjarmason, Carlo Arenas, git

[-- Attachment #1: Type: text/plain, Size: 1946 bytes --]

Hi,

On Wed, 24 Jul 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
> > The PCRE2_NO_UTF_CHECK flag means "I have checked that this is a valid
> > UTF-8 string so you, PCRE, don't need to re-check it".
>
> OK, in short, barfing and stopping is a problem, but that flag is
> not the right knob to tweak.  And the right knob ...
>
> >  1) We're oversupplying PCRE2_UTF now, and one such case is what's being
> >     reported here. I.e. there's no reason I can think of for why a
> >     fixed-string pattern should need PCRE2_UTF set when not combined
> >     with --ignore-case. We can just not do that, but maybe I'm missing
> >     something there.
> >
> >  2) We can do "try utf8, and fallback". A more advanced version of this
> >     is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
> >     does. I was thinking something closer to just carrying two compiled
> >     patterns, and falling back on the ~PCRE2_UTF one if we get a
> >     PCRE2_ERROR_UTF8_* error.
>
> ... lies somewhere along that line.  I think that is very sensible.

I am glad that everybody agrees with my original comment on ab/no-kwset
where I suggested that we should use our knowledge of the encoding of
the haystack and convert it to UTF-8 if we detect that the pattern is
UTF-8 encoded, and then pass the PCRE2_UTF flag only when applicable
(i.e. when we know that either needle or haystack is non-ASCII, and then
making sure that we convert to UTF-8 whenever necessary).

Okay, that came over a bit more sarcastic than I originally intended,
but if you try to filter that out, I think that is still the better
solution than to paper over the issue.

After all, PCRE2_MATCH_INVALID_UTF is only marginally better than what
Carlo suggested. _Marginally_. Not really worth considering, in my mind,
even.

> Let's make sure this gets sorted out soonish.

I agree,
Dscho

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-25  9:48                 ` Johannes Schindelin
@ 2019-07-25 13:02                   ` Junio C Hamano
  2019-07-25 18:22                     ` Johannes Schindelin
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2019-07-25 13:02 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, Carlo Arenas, git

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

>> OK, in short, barfing and stopping is a problem, but that flag is
>> not the right knob to tweak.  And the right knob ...
>>
>> >  1) We're oversupplying PCRE2_UTF now, and one such case is what's being
>> >     reported here. I.e. there's no reason I can think of for why a
>> >     fixed-string pattern should need PCRE2_UTF set when not combined
>> >     with --ignore-case. We can just not do that, but maybe I'm missing
>> >     something there.
>> >
>> >  2) We can do "try utf8, and fallback". A more advanced version of this
>> >     is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
>> >     does. I was thinking something closer to just carrying two compiled
>> >     patterns, and falling back on the ~PCRE2_UTF one if we get a
>> >     PCRE2_ERROR_UTF8_* error.
>>
>> ... lies somewhere along that line.  I think that is very sensible.
>
> I am glad that everybody agrees with my original comment on ab/no-kwset
> where I suggested that we should use our knowledge of the encoding of
> the haystack and convert it to UTF-8 if we detect that the pattern is
> UTF-8 encoded,...

Please do not count me among "everybody", then.  I did not think
that Ævar meant to iconv the haystack when I wrote the message you
are responding to, but if that was what he meant, I would not have
said "very sensible".


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-25 13:02                   ` Junio C Hamano
@ 2019-07-25 18:22                     ` Johannes Schindelin
  2019-07-26 15:15                       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 26+ messages in thread
From: Johannes Schindelin @ 2019-07-25 18:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ævar Arnfjörð Bjarmason, Carlo Arenas, git

[-- Attachment #1: Type: text/plain, Size: 1742 bytes --]

Hi Junio,

On Thu, 25 Jul 2019, Junio C Hamano wrote:

> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>
> >> OK, in short, barfing and stopping is a problem, but that flag is
> >> not the right knob to tweak.  And the right knob ...
> >>
> >> >  1) We're oversupplying PCRE2_UTF now, and one such case is what's being
> >> >     reported here. I.e. there's no reason I can think of for why a
> >> >     fixed-string pattern should need PCRE2_UTF set when not combined
> >> >     with --ignore-case. We can just not do that, but maybe I'm missing
> >> >     something there.
> >> >
> >> >  2) We can do "try utf8, and fallback". A more advanced version of this
> >> >     is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
> >> >     does. I was thinking something closer to just carrying two compiled
> >> >     patterns, and falling back on the ~PCRE2_UTF one if we get a
> >> >     PCRE2_ERROR_UTF8_* error.
> >>
> >> ... lies somewhere along that line.  I think that is very sensible.
> >
> > I am glad that everybody agrees with my original comment on ab/no-kwset
> > where I suggested that we should use our knowledge of the encoding of
> > the haystack and convert it to UTF-8 if we detect that the pattern is
> > UTF-8 encoded,...
>
> Please do not count me among "everybody", then.  I did not think
> that Ævar meant to iconv the haystack when I wrote the message you
> are responding to, but if that was what he meant, I would not have
> said "very sensible".

Okay, but in that case I cannot agree with your assessment that it is
very sensible.

If we're already deciding to paper over things, I'd much rather prefer
the simpler patch, i.e. Carlo's.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-25 18:22                     ` Johannes Schindelin
@ 2019-07-26 15:15                       ` Ævar Arnfjörð Bjarmason
  2019-07-26 15:53                         ` Carlo Arenas
  2019-07-26 16:19                         ` Junio C Hamano
  0 siblings, 2 replies; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 15:15 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, Carlo Arenas, git

On Thu, Jul 25 2019, Johannes Schindelin wrote:

> Hi Junio,
>
> On Thu, 25 Jul 2019, Junio C Hamano wrote:
>
>> Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:
>>
>> >> OK, in short, barfing and stopping is a problem, but that flag is
>> >> not the right knob to tweak.  And the right knob ...
>> >>
>> >> >  1) We're oversupplying PCRE2_UTF now, and one such case is what's being
>> >> >     reported here. I.e. there's no reason I can think of for why a
>> >> >     fixed-string pattern should need PCRE2_UTF set when not combined
>> >> >     with --ignore-case. We can just not do that, but maybe I'm missing
>> >> >     something there.
>> >> >
>> >> >  2) We can do "try utf8, and fallback". A more advanced version of this
>> >> >     is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
>> >> >     does. I was thinking something closer to just carrying two compiled
>> >> >     patterns, and falling back on the ~PCRE2_UTF one if we get a
>> >> >     PCRE2_ERROR_UTF8_* error.
>> >>
>> >> ... lies somewhere along that line.  I think that is very sensible.
>> >
>> > I am glad that everybody agrees with my original comment on ab/no-kwset
>> > where I suggested that we should use our knowledge of the encoding of
>> > the haystack and convert it to UTF-8 if we detect that the pattern is
>> > UTF-8 encoded,...
>>
>> Please do not count me among "everybody", then.  I did not think
>> that Ævar meant to iconv the haystack when I wrote the message you
>> are responding to, but if that was what he meant, I would not have
>> said "very sensible".
>
> Okay, but in that case I cannot agree with your assessment that it is
> very sensible.

FWIW what I meant was not that we'd run around and iconv() things, it
wouldn't make much sense to e.g. iconv() some PNG data to be "UTF-8
valid", which presumably would be the end result of something like that.

Rather that this model of assuming that a UTF-8 pattern means we can
consider everything in the repo UTF-8 in git-grep doesn't make sense. My
kwset patches *revealed* that problem in a painful way, but it was there
already.

I'm not sure what a real fix for that is. Part of it is probably 8/8 in
the series I mention below, but more generally we'd need to be more
encoding aware at a much higher callsite than "grep". So e.g. we'd know
that we match "binary" data as not-UTF-8. Now we just throw arbitrary
bytes around and hope something sticks.

> If we're already deciding to paper over things, I'd much rather prefer
> the simpler patch, i.e. Carlo's.

As I noted upthread PCRE's own docs promise undefined behavior and fire
and brimstone if that patch is applied. Those last two not
guaranteed. So we need another solution.

I've submitted
https://public-inbox.org/git/20190726150818.6373-1-avarab@gmail.com/
just now. See what you think about it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-26 15:15                       ` Ævar Arnfjörð Bjarmason
@ 2019-07-26 15:53                         ` Carlo Arenas
  2019-07-26 20:05                           ` Ævar Arnfjörð Bjarmason
  2019-07-26 16:19                         ` Junio C Hamano
  1 sibling, 1 reply; 26+ messages in thread
From: Carlo Arenas @ 2019-07-26 15:53 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Schindelin, Junio C Hamano, git

On Fri, Jul 26, 2019 at 8:15 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> I'm not sure what a real fix for that is. Part of it is probably 8/8 in
> the series I mention below, but more generally we'd need to be more
> encoding aware at a much higher callsite than "grep". So e.g. we'd know
> that we match "binary" data as not-UTF-8. Now we just throw arbitrary
> bytes around and hope something sticks.

I haven't look yet at your proposed changes, but my gut feeling is that
the work to support invalid UTF in the yet unreleased PCRE version would
be needed as part of it, and therefore it might be better to keep PCRE
out of the main path until that gets released and can be relied upon.

kwset is not going away with this series anyway, regardless of the no-kwset
name on the branch.

> > If we're already deciding to paper over things, I'd much rather prefer
> > the simpler patch, i.e. Carlo's.
>
> As I noted upthread PCRE's own docs promise undefined behavior and fire
> and brimstone if that patch is applied. Those last two not
> guaranteed. So we need another solution.

in my original reply I mentioned I explicitly didn't do a test because of this
"undefined behavior", but I think it should be fair to mention that we are
already affected by that because using the JIT fast path does skip any
UTF-8 validations and is currently possible to get git into an infinite loop
or make it segfault when using PCRE.

in that line, I am not sure I understand the pushback against making that
explicit since it only makes both codepaths behave the same (bugs and
risks of burning alike)

Carlo

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-26 15:15                       ` Ævar Arnfjörð Bjarmason
  2019-07-26 15:53                         ` Carlo Arenas
@ 2019-07-26 16:19                         ` Junio C Hamano
  2019-07-26 19:40                           ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2019-07-26 16:19 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Johannes Schindelin, Carlo Arenas, git

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> FWIW what I meant was not that we'd run around and iconv() things, it
> wouldn't make much sense to e.g. iconv() some PNG data to be "UTF-8
> valid", which presumably would be the end result of something like that.
>
> Rather that this model of assuming that a UTF-8 pattern means we can
> consider everything in the repo UTF-8 in git-grep doesn't make sense. My
> kwset patches *revealed* that problem in a painful way, but it was there
> already.

We already do assume that pathnames are UTF-8 (pathspecs on MacOS
are converted and then they are matched assuming that property).
Further, with the same mechanism, I think there is an assumption
that anything that comes from the command line is UTF-8 (and if I
recall correctly, doesn't the Windows port of Git force us to use
the same assumption---I recall we needed tests tweak for that).

In the very very longer term, I do not think we would want to keep
the assumption that the text encoding of blobs is always UTF-8, and
it would be nice to extend the system, so that blob data could be
marked in some way to say "I'm in Big-5, and not in UTF-8, so please
treat me as such" and magically the needle and the haystack can be
made to agree, with iconv() either one of them.  

But I do not think the current topic to fix the immediate/imminent
breakage should not be distracted by that.  Let's keep assuming that
any blob, when it is text, is UTF-8.

And from that point of view, I think the two pieces of idea in your
earlier message does make sense.  We can try to match as binary most
of the time, as UTF-8 would not let a valid UTF-8 needle match in
the haystack starting in the middle of a character.  When the user
is trying to match case-insensitively, we know the haystack in which
the user is interested in finding the needle is text, even though
there may be non-text blobs as well.

For example, "git grep -i 'foo' t/" may find a few png files under
the t/ directory.  We do not care if they happen to contain Foo and
we do not mind if they appear or do not appear in the result.  The
only two things we care about are (1) foo, Foo, FOO are found in the
text files under t/ and (2) the command does not die in the middle,
before processing all the files, only because a png file it found
were not UTF-8 valid.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-26 16:19                         ` Junio C Hamano
@ 2019-07-26 19:40                           ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 19:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Johannes Schindelin, Carlo Arenas, git


On Fri, Jul 26 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>> FWIW what I meant was not that we'd run around and iconv() things, it
>> wouldn't make much sense to e.g. iconv() some PNG data to be "UTF-8
>> valid", which presumably would be the end result of something like that.
>>
>> Rather that this model of assuming that a UTF-8 pattern means we can
>> consider everything in the repo UTF-8 in git-grep doesn't make sense. My
>> kwset patches *revealed* that problem in a painful way, but it was there
>> already.
>
> We already do assume that pathnames are UTF-8 (pathspecs on MacOS
> are converted and then they are matched assuming that property).
> Further, with the same mechanism, I think there is an assumption
> that anything that comes from the command line is UTF-8 (and if I
> recall correctly, doesn't the Windows port of Git force us to use
> the same assumption---I recall we needed tests tweak for that).
>
> In the very very longer term, I do not think we would want to keep
> the assumption that the text encoding of blobs is always UTF-8, and
> it would be nice to extend the system, so that blob data could be
> marked in some way to say "I'm in Big-5, and not in UTF-8, so please
> treat me as such" and magically the needle and the haystack can be
> made to agree, with iconv() either one of them.
>
> But I do not think the current topic to fix the immediate/imminent
> breakage should not be distracted by that.  Let's keep assuming that
> any blob, when it is text, is UTF-8.
>
> And from that point of view, I think the two pieces of idea in your
> earlier message does make sense.  We can try to match as binary most
> of the time, as UTF-8 would not let a valid UTF-8 needle match in
> the haystack starting in the middle of a character.

*nod*

> When the user is trying to match case-insensitively, we know the
> haystack in which the user is interested in finding the needle is
> text, even though there may be non-text blobs as well.
>
> For example, "git grep -i 'foo' t/" may find a few png files under
> the t/ directory.  We do not care if they happen to contain Foo and
> we do not mind if they appear or do not appear in the result.  The
> only two things we care about are (1) foo, Foo, FOO are found in the
> text files under t/ and (2) the command does not die in the middle,
> before processing all the files, only because a png file it found
> were not UTF-8 valid.

I think this part's a step too far, and not how e.g. GNU grep
works. Peeking into binary data in a text grep is what people expect,
e.g. because you might want to recursively grep mixed text/mp3s for an
author. The text part of the mp3s means that metadata will be grepped
for inside the binary files.

Getting that right is hard around the edges though...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH] grep: skip UTF8 checks explicitally
  2019-07-26 15:53                         ` Carlo Arenas
@ 2019-07-26 20:05                           ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 26+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 20:05 UTC (permalink / raw)
  To: Carlo Arenas
  Cc: Johannes Schindelin, Junio C Hamano, git,
	Nguyễn Thái Ngọc Duy

On Fri, Jul 26 2019, Carlo Arenas wrote:

> On Fri, Jul 26, 2019 at 8:15 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> I'm not sure what a real fix for that is. Part of it is probably 8/8 in
>> the series I mention below, but more generally we'd need to be more
>> encoding aware at a much higher callsite than "grep". So e.g. we'd know
>> that we match "binary" data as not-UTF-8. Now we just throw arbitrary
>> bytes around and hope something sticks.
>
> I haven't look yet at your proposed changes, but my gut feeling is that
> the work to support invalid UTF in the yet unreleased PCRE version would
> be needed as part of it, and therefore it might be better to keep PCRE
> out of the main path until that gets released and can be relied upon.

I'm hoping my 8-part series is good enough to move it forward, but as
48de2a768c ("grep: remove the kwset optimization", .2019-07-01) shows we
can always just fall back on using regcomp instead.

> kwset is not going away with this series anyway, regardless of the no-kwset
> name on the branch.

The larger context here is that this is the 1st step of a 2-step series
to get rid of kwset. If I can pull that off successfully is another
matter, but that's the plan. After it's applied we just use it in the
pickaxe code, and it's relatively straightforward to convert that. See:
https://public-inbox.org/git/87v9x793qi.fsf@evledraar.gmail.com/

>> > If we're already deciding to paper over things, I'd much rather prefer
>> > the simpler patch, i.e. Carlo's.
>>
>> As I noted upthread PCRE's own docs promise undefined behavior and fire
>> and brimstone if that patch is applied. Those last two not
>> guaranteed. So we need another solution.
>
> in my original reply I mentioned I explicitly didn't do a test because of this
> "undefined behavior", but I think it should be fair to mention that we are
> already affected by that because using the JIT fast path does skip any
> UTF-8 validations and is currently possible to get git into an infinite loop
> or make it segfault when using PCRE.

Right, this is a good point that we should take notice of. I.e. this is
*not* a new bug per-se, you can do this on master and get a UTF-8 bug
from git.git:

    git grep -P '(*NO_JIT)[æ]'

> in that line, I am not sure I understand the pushback against making that
> explicit since it only makes both codepaths behave the same (bugs and
> risks of burning alike)

Because with my kwset series we're getting a lot more users of this
until-now obscure code, so we're finding old-but-new-to-us bugs.

We've had this bug dating all the way back to Duy's 18547aacf5
("grep/pcre: support utf-8", 2016-06-25). It was first released with git
2.10.

So why are we getting list discussion about it *now*? Because my kwset
series got merged to "next", and we apparently have a lot of users who'd
use fixed-string git-grep under locales, but never used PCRE via -P
explicitly before.

So it's worth getting the semantics right. As noted in the E-Mail I
linked to earlier my ulterior motive here is to get to a point where
we'll funnel all regex matching through PCRE implicitly if it's
available.

We need to get these UTF-8 edge cases right. I don't know if my recent
8-part series gets us 100% there, but hopefully it at least gets us
closer to it.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2] grep: skip UTF8 checks explicitly
  2019-07-24 16:08         ` Junio C Hamano
@ 2019-08-28 14:54           ` Carlo Marcelo Arenas Belón
  2019-08-28 16:57             ` Carlo Arenas
  0 siblings, 1 reply; 26+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2019-08-28 14:54 UTC (permalink / raw)
  To: git
  Cc: Johannes Schindelin, Eric Sunshine,
	Ævar Arnfjörð Bjarmason, Junio C Hamano

18547aacf5 ("grep/pcre: support utf-8", 2016-06-25) that was released
with git 2.10 added the PCRE_UTF8 flag to PCRE1 matching including a
call to has_non_ascii() to try to avoid breakage if there was non-utf8
encoded content in the haystack.

Usually PCRE is compiled with JIT support (even if is not the default),
and therefore the codepath used includes calling pcre_jit_exec, which
skips UTF-8 validation by design (which might result in crashes or hangs)
but when JIT support wasn't compiled we use pcre_exec instead with the
posibility that grep might be aborted if invalid UTF-8 is found in the
haystack.

PCRE1 provides a flag since Mar 5, 2007 that could be used to skip the
checks explicitly so use that to make both codepaths equivalent (the
flag is ignored by pcre1_jit_exec)

this fix is only implemented for PCRE1 because PCRE2 is likely to have
a better solution (without the risks) instead in the future

Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Helped-by: Eric Sunshine <sunshine@sunshineco.com>
Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
V2:
* drop PCRE2 support
* add backward compatibility define

 grep.c | 4 ++--
 grep.h | 3 +++
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index f7c3a5803e..69ef69516e 100644
--- a/grep.c
+++ b/grep.c
@@ -421,7 +421,7 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
 		regmatch_t *match, int eflags)
 {
-	int ovector[30], ret, flags = 0;
+	int ovector[30], ret, flags = PCRE_NO_UTF8_CHECK;
 
 	if (eflags & REG_NOTBOL)
 		flags |= PCRE_NOTBOL;
diff --git a/grep.h b/grep.h
index 1875880f37..9c8797a017 100644
--- a/grep.h
+++ b/grep.h
@@ -3,6 +3,9 @@
 #include "color.h"
 #ifdef USE_LIBPCRE1
 #include <pcre.h>
+#ifndef PCRE_NO_UTF8_CHECK
+#define PCRE_NO_UTF8_CHECK 0
+#endif
 #ifdef PCRE_CONFIG_JIT
 #if PCRE_MAJOR >= 8 && PCRE_MINOR >= 32
 #ifndef NO_LIBPCRE1_JIT
-- 
2.23.0

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] grep: skip UTF8 checks explicitly
  2019-08-28 14:54           ` [PATCH v2] " Carlo Marcelo Arenas Belón
@ 2019-08-28 16:57             ` Carlo Arenas
  2019-09-09 15:07               ` Carlo Arenas
  0 siblings, 1 reply; 26+ messages in thread
From: Carlo Arenas @ 2019-08-28 16:57 UTC (permalink / raw)
  To: git
  Cc: Johannes Schindelin, Eric Sunshine,
	Ævar Arnfjörð Bjarmason, Junio C Hamano

FWIW, the changes in grep.h are IMHO optional and hadn't been really
tested as I couldn't find a version of PCRE1 old enough to trigger it
but I am hoping are simple enough to work.

As in my original proposal, there is no test because this is going to
(as documented) trigger undefined behaviour and so I don't see how we
can reliably test that.  Goes without saying tthough, that the same is
triggered when JIT is enabled and the JIT fast path is being used, so
this is not a regression and will be more visible once
ab/pcre-jit-fixes gets merged because we are moving out of the JIT
fast path with a patch[0] in that series

While Ævar made a point[1] that this wasn't a solution to the problem,
it was because PCRE2 could have a better one (still missing but based
on ab/pcre-jit-fixes), and it was expected that PCRE2 will be used a
lot more and so it wasn't a good idea for it (something that doesn't
apply to PCRE1)

the patch was based on maint (like all bugfixes) but applies cleanly
to master and next, it will conflict with pu but for easy of merge I'd
applied it on top of cb/pcre1-cleanup and make it available in
GitHub[2]; that branch could be used as well as a reroll for that
topic (if that is preferred)

the error message hasn't been changed on this patch, as it might make
sense to improve it as well for PCRE2, but at least shouldn't be
triggered anymore (ex, from running a freshly built git without the
patch and linked against a non JIT enabled PCRE1):

$ ./git-grep -P 'Nguyễn Thái.Ngọc'
.mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
fatal: pcre_exec failed with error code -10

Carlo

[0] 685668faaa (grep: stop using a custom JIT stack with PCRE v1, 2019-07-26)
[1] https://public-inbox.org/git/87lfwn70nb.fsf@evledraar.gmail.com/
[2] https://github.com/carenas/git/tree/pcre1-cleanup

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] grep: skip UTF8 checks explicitly
  2019-08-28 16:57             ` Carlo Arenas
@ 2019-09-09 15:07               ` Carlo Arenas
  2019-09-09 18:49                 ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Carlo Arenas @ 2019-09-09 15:07 UTC (permalink / raw)
  To: git
  Cc: Johannes Schindelin, Eric Sunshine,
	Ævar Arnfjörð Bjarmason, Junio C Hamano

ping

any feedback on code/approach highly appreciated

Carlo

On Wed, Aug 28, 2019 at 9:57 AM Carlo Arenas <carenas@gmail.com> wrote:
>
> FWIW, the changes in grep.h are IMHO optional and hadn't been really
> tested as I couldn't find a version of PCRE1 old enough to trigger it
> but I am hoping are simple enough to work.
>
> As in my original proposal, there is no test because this is going to
> (as documented) trigger undefined behaviour and so I don't see how we
> can reliably test that.  Goes without saying tthough, that the same is
> triggered when JIT is enabled and the JIT fast path is being used, so
> this is not a regression and will be more visible once
> ab/pcre-jit-fixes gets merged because we are moving out of the JIT
> fast path with a patch[0] in that series
>
> While Ævar made a point[1] that this wasn't a solution to the problem,
> it was because PCRE2 could have a better one (still missing but based
> on ab/pcre-jit-fixes), and it was expected that PCRE2 will be used a
> lot more and so it wasn't a good idea for it (something that doesn't
> apply to PCRE1)
>
> the patch was based on maint (like all bugfixes) but applies cleanly
> to master and next, it will conflict with pu but for easy of merge I'd
> applied it on top of cb/pcre1-cleanup and make it available in
> GitHub[2]; that branch could be used as well as a reroll for that
> topic (if that is preferred)
>
> the error message hasn't been changed on this patch, as it might make
> sense to improve it as well for PCRE2, but at least shouldn't be
> triggered anymore (ex, from running a freshly built git without the
> patch and linked against a non JIT enabled PCRE1):
>
> $ ./git-grep -P 'Nguyễn Thái.Ngọc'
> .mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
> fatal: pcre_exec failed with error code -10
>
> Carlo
>
> [0] 685668faaa (grep: stop using a custom JIT stack with PCRE v1, 2019-07-26)
> [1] https://public-inbox.org/git/87lfwn70nb.fsf@evledraar.gmail.com/
> [2] https://github.com/carenas/git/tree/pcre1-cleanup

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2] grep: skip UTF8 checks explicitly
  2019-09-09 15:07               ` Carlo Arenas
@ 2019-09-09 18:49                 ` Junio C Hamano
  0 siblings, 0 replies; 26+ messages in thread
From: Junio C Hamano @ 2019-09-09 18:49 UTC (permalink / raw)
  To: Carlo Arenas
  Cc: git, Johannes Schindelin, Eric Sunshine,
	Ævar Arnfjörð Bjarmason

Carlo Arenas <carenas@gmail.com> writes:

> ping
>
> any feedback on code/approach highly appreciated

I'd prefer to see others weigh in before starting to merge the pcre
stuff to 'next'.

I do not mind taking this updated one that limits the scope to pcre1
as a pure regressino fix, if others agree.  Thanks for keeping a tab
on these pcre issues.

>> While Ævar made a point[1] that this wasn't a solution to the problem,
>> it was because PCRE2 could have a better one (still missing but based
>> on ab/pcre-jit-fixes), and it was expected that PCRE2 will be used a
>> lot more and so it wasn't a good idea for it (something that doesn't
>> apply to PCRE1)
>>
>> the patch was based on maint (like all bugfixes) but applies cleanly
>> to master and next, it will conflict with pu but for easy of merge I'd
>> applied it on top of cb/pcre1-cleanup and make it available in
>> GitHub[2]; that branch could be used as well as a reroll for that
>> topic (if that is preferred)
>>
>> the error message hasn't been changed on this patch, as it might make
>> sense to improve it as well for PCRE2, but at least shouldn't be
>> triggered anymore (ex, from running a freshly built git without the
>> patch and linked against a non JIT enabled PCRE1):
>>
>> $ ./git-grep -P 'Nguyễn Thái.Ngọc'
>> .mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
>> fatal: pcre_exec failed with error code -10
>>
>> Carlo
>>
>> [0] 685668faaa (grep: stop using a custom JIT stack with PCRE v1, 2019-07-26)
>> [1] https://public-inbox.org/git/87lfwn70nb.fsf@evledraar.gmail.com/
>> [2] https://github.com/carenas/git/tree/pcre1-cleanup

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-09-09 18:49 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-21 18:31 [PATCH] grep: skip UTF8 checks explicitally Carlo Marcelo Arenas Belón
2019-07-21 18:34 ` Eric Sunshine
2019-07-22 11:55 ` Johannes Schindelin
2019-07-22 14:43   ` [PATCH] grep: skip UTF8 checks explicitly Carlo Marcelo Arenas Belón
2019-07-24  2:10     ` Junio C Hamano
2019-07-24 10:50       ` Johannes Schindelin
2019-07-24 16:08         ` Junio C Hamano
2019-08-28 14:54           ` [PATCH v2] " Carlo Marcelo Arenas Belón
2019-08-28 16:57             ` Carlo Arenas
2019-09-09 15:07               ` Carlo Arenas
2019-09-09 18:49                 ` Junio C Hamano
2019-07-22 19:42   ` [PATCH] grep: skip UTF8 checks explicitally Ævar Arnfjörð Bjarmason
2019-07-23  3:50     ` Carlo Arenas
2019-07-23 12:46       ` Johannes Schindelin
2019-07-24  1:47         ` Carlo Arenas
2019-07-24 10:47           ` Johannes Schindelin
2019-07-24 18:22             ` Ævar Arnfjörð Bjarmason
2019-07-24 21:04               ` Junio C Hamano
2019-07-25  9:48                 ` Johannes Schindelin
2019-07-25 13:02                   ` Junio C Hamano
2019-07-25 18:22                     ` Johannes Schindelin
2019-07-26 15:15                       ` Ævar Arnfjörð Bjarmason
2019-07-26 15:53                         ` Carlo Arenas
2019-07-26 20:05                           ` Ævar Arnfjörð Bjarmason
2019-07-26 16:19                         ` Junio C Hamano
2019-07-26 19:40                           ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).