git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
* [PATCH] grep: use custom JIT stack with pcre2
@ 2019-07-21 19:40 Carlo Marcelo Arenas Belón
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
                   ` (3 more replies)
  0 siblings, 4 replies; 52+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2019-07-21 19:40 UTC (permalink / raw)
  To: git; +Cc: avarab

94da9193a6 ("grep: add support for PCRE v2", 2017-06-01) allocate
a stack and assign it to a match context, but never pass it to
pcre2_jit_match, using instead the default.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
This might have positive performance consequences (per the comments)
but haven't tested them; if there is no difference might be better
instead to remove the stack and match_context and save the related
memory, since it seems the default was working fine anyway.

 grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/grep.c b/grep.c
index 146093f590..ff76907ceb 100644
--- a/grep.c
+++ b/grep.c
@@ -564,7 +564,7 @@ static int pcre2match(struct grep_pat *p, const char *line, const char *eol,
 	if (p->pcre2_jit_on)
 		ret = pcre2_jit_match(p->pcre2_pattern, (unsigned char *)line,
 				      eol - line, 0, flags, p->pcre2_match_data,
-				      NULL);
+				      p->pcre2_match_context);
 	else
 		ret = pcre2_match(p->pcre2_pattern, (unsigned char *)line,
 				  eol - line, 0, flags, p->pcre2_match_data,
-- 
2.22.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 0/3] grep: PCRE JIT fixes
  2019-07-21 19:40 [PATCH] grep: use custom JIT stack with pcre2 Carlo Marcelo Arenas Belón
@ 2019-07-24 15:14 ` Ævar Arnfjörð Bjarmason
  2019-07-24 16:18   ` Junio C Hamano
                     ` (9 more replies)
  2019-07-24 15:14 ` [PATCH 1/3] grep: remove overly paranoid BUG(...) code Ævar Arnfjörð Bjarmason
                   ` (2 subsequent siblings)
  3 siblings, 10 replies; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-24 15:14 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

There's a couple of patches fixing mistakes in the JIT code I added
for PCRE in <20190722181923.21572-1-dev+git@drbeat.li> and
<20190721194052.15440-1-carenas@gmail.com>

This small series proposes to replace both of those. In both cases I
think we're better off just removing the relevant code. The commit
messages for the patches themselves make the case for that.

Ævar Arnfjörð Bjarmason (3):
  grep: remove overly paranoid BUG(...) code
  grep: stop "using" a custom JIT stack with PCRE v2
  grep: stop using a custom JIT stack with PCRE v1

 grep.c | 46 ++++++----------------------------------------
 grep.h |  9 ---------
 2 files changed, 6 insertions(+), 49 deletions(-)

-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/3] grep: remove overly paranoid BUG(...) code
  2019-07-21 19:40 [PATCH] grep: use custom JIT stack with pcre2 Carlo Marcelo Arenas Belón
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
@ 2019-07-24 15:14 ` Ævar Arnfjörð Bjarmason
  2019-07-24 15:14 ` [PATCH 2/3] grep: stop "using" a custom JIT stack with PCRE v2 Ævar Arnfjörð Bjarmason
  2019-07-24 15:14 ` [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1 Ævar Arnfjörð Bjarmason
  3 siblings, 0 replies; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-24 15:14 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

Remove code that would trigger if pcre_config() or pcre2_config() was
so broken that "do we have JIT?" wouldn't return a boolean.

I added this code back in fbaceaac47 ("grep: add support for the PCRE
v1 JIT API", 2017-05-25) and then as noted in [1] incorrectly
copy/pasted some of it in 94da9193a6 ("grep: add support for PCRE v2",
2017-06-01).

Let's just remove it instead of fixing that bug. Being this paranoid
about what PCRE returns is crossing the line into unreasonable
paranoia.

1. https://public-inbox.org/git/20190722181923.21572-1-dev+git@drbeat.li/

Reported-by:  Beat Bolli <dev+git@drbeat.li>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/grep.c b/grep.c
index f7c3a5803e..be4282fef3 100644
--- a/grep.c
+++ b/grep.c
@@ -406,14 +406,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 
 #ifdef GIT_PCRE1_USE_JIT
 	pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on);
-	if (p->pcre1_jit_on == 1) {
+	if (p->pcre1_jit_on) {
 		p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024);
 		if (!p->pcre1_jit_stack)
 			die("Couldn't allocate PCRE JIT stack");
 		pcre_assign_jit_stack(p->pcre1_extra_info, NULL, p->pcre1_jit_stack);
-	} else if (p->pcre1_jit_on != 0) {
-		BUG("The pcre1_jit_on variable should be 0 or 1, not %d",
-		    p->pcre1_jit_on);
 	}
 #endif
 }
@@ -522,7 +519,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	}
 
 	pcre2_config(PCRE2_CONFIG_JIT, &p->pcre2_jit_on);
-	if (p->pcre2_jit_on == 1) {
+	if (p->pcre2_jit_on) {
 		jitret = pcre2_jit_compile(p->pcre2_pattern, PCRE2_JIT_COMPLETE);
 		if (jitret)
 			die("Couldn't JIT the PCRE2 pattern '%s', got '%d'\n", p->pattern, jitret);
@@ -557,9 +554,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		if (!p->pcre2_match_context)
 			die("Couldn't allocate PCRE2 match context");
 		pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack);
-	} else if (p->pcre2_jit_on != 0) {
-		BUG("The pcre2_jit_on variable should be 0 or 1, not %d",
-		    p->pcre1_jit_on);
 	}
 }
 
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 2/3] grep: stop "using" a custom JIT stack with PCRE v2
  2019-07-21 19:40 [PATCH] grep: use custom JIT stack with pcre2 Carlo Marcelo Arenas Belón
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
  2019-07-24 15:14 ` [PATCH 1/3] grep: remove overly paranoid BUG(...) code Ævar Arnfjörð Bjarmason
@ 2019-07-24 15:14 ` Ævar Arnfjörð Bjarmason
  2019-07-24 16:24   ` Junio C Hamano
  2019-07-24 15:14 ` [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1 Ævar Arnfjörð Bjarmason
  3 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-24 15:14 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

As reported in [1] the code I added in 94da9193a6 ("grep: add support
for PCRE v2", 2017-06-01) to use a custom JIT stack has never
worked. It was incorrectly copy/pasted from code I added in
fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25),
which did work.

Thus our intention of starting with 1 byte of stack at a maximum of 1
MB didn't happen, we'd always use the 32 KB stack provided by PCRE
v2's jit_machine_stack_exec()[2]. The reason I allocated a custom
stack at all was this advice in pcrejit(3) (same in pcre2jit(3)):

    "By default, it uses 32KiB on the machine stack. However, some
    large or complicated patterns need more than this"

Since we've haven't had any reports of users running into
PCRE2_ERROR_JIT_STACKLIMIT in the wild I think we can safely assume
that we can just use the library defaults instead and drop this
code. This won't change with the wider use of PCRE v2 in
ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15), a
fixed string search is not a "large or complicated pattern".

For good measure I ran the performance test noted in 94da9193a6,
although the command is simpler now due to my 0f50c8e32c ("Makefile:
remove the NO_R_TO_GCC_LINKER flag", 2019-05-17):

    GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE2=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst' ./run HEAD~ HEAD p7820-grep-engines.sh

Just the /perl/ results are:

    Test                                            HEAD~             HEAD
    ---------------------------------------------------------------------------------------
    7820.3: perl grep 'how.to'                      0.17(0.27+0.65)   0.17(0.24+0.68) +0.0%
    7820.7: perl grep '^how to'                     0.16(0.23+0.66)   0.16(0.23+0.67) +0.0%
    7820.11: perl grep '[how] to'                   0.18(0.35+0.62)   0.18(0.33+0.65) +0.0%
    7820.15: perl grep '(e.t[^ ]*|v.ry) rare'       0.17(0.45+0.54)   0.17(0.49+0.50) +0.0%
    7820.19: perl grep 'm(ú|u)lt.b(æ|y)te'          0.16(0.33+0.58)   0.16(0.29+0.62) +0.0%

So, as expected there's no change, and running with valgrind reveals
that we have fewer allocations now.

1. https://public-inbox.org/git/20190721194052.15440-1-carenas@gmail.com/
2. I didn't really intend to start with 1 byte, looking at the PCRE v2
   code again what happened is that I cargo-culted some of PCRE v2's
   own test code which was meant to test re-allocations. It's more
   sane to start with say 32 KB with a max of 1 MB, as pcre2grep.c
   does.

Reported-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 10 ----------
 grep.h |  4 ----
 2 files changed, 14 deletions(-)

diff --git a/grep.c b/grep.c
index be4282fef3..20ce95270a 100644
--- a/grep.c
+++ b/grep.c
@@ -546,14 +546,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 			p->pcre2_jit_on = 0;
 			return;
 		}
-
-		p->pcre2_jit_stack = pcre2_jit_stack_create(1, 1024 * 1024, NULL);
-		if (!p->pcre2_jit_stack)
-			die("Couldn't allocate PCRE2 JIT stack");
-		p->pcre2_match_context = pcre2_match_context_create(NULL);
-		if (!p->pcre2_match_context)
-			die("Couldn't allocate PCRE2 match context");
-		pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack);
 	}
 }
 
@@ -597,8 +589,6 @@ static void free_pcre2_pattern(struct grep_pat *p)
 	pcre2_compile_context_free(p->pcre2_compile_context);
 	pcre2_code_free(p->pcre2_pattern);
 	pcre2_match_data_free(p->pcre2_match_data);
-	pcre2_jit_stack_free(p->pcre2_jit_stack);
-	pcre2_match_context_free(p->pcre2_match_context);
 }
 #else /* !USE_LIBPCRE2 */
 static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt)
diff --git a/grep.h b/grep.h
index 1875880f37..a65f4a1ae1 100644
--- a/grep.h
+++ b/grep.h
@@ -29,8 +29,6 @@ typedef int pcre_jit_stack;
 typedef int pcre2_code;
 typedef int pcre2_match_data;
 typedef int pcre2_compile_context;
-typedef int pcre2_match_context;
-typedef int pcre2_jit_stack;
 #endif
 #include "kwset.h"
 #include "thread-utils.h"
@@ -94,8 +92,6 @@ struct grep_pat {
 	pcre2_code *pcre2_pattern;
 	pcre2_match_data *pcre2_match_data;
 	pcre2_compile_context *pcre2_compile_context;
-	pcre2_match_context *pcre2_match_context;
-	pcre2_jit_stack *pcre2_jit_stack;
 	uint32_t pcre2_jit_on;
 	kwset_t kws;
 	unsigned fixed:1;
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1
  2019-07-21 19:40 [PATCH] grep: use custom JIT stack with pcre2 Carlo Marcelo Arenas Belón
                   ` (2 preceding siblings ...)
  2019-07-24 15:14 ` [PATCH 2/3] grep: stop "using" a custom JIT stack with PCRE v2 Ævar Arnfjörð Bjarmason
@ 2019-07-24 15:14 ` Ævar Arnfjörð Bjarmason
  2019-07-26 13:15   ` Carlo Arenas
  3 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-24 15:14 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

Simplify the PCRE v1 code for the same reasons as for the PCRE v2 code
in the last commit. Unlike with v2 we actually used the custom stack
in v1, but let's use PCRE's built-in 32 KB one instead, since
experience with v2 shows that's enough. Most distros are already using
v2 as a default, and the underlying sljit code is the same.

Unfortunately we can't just pass a NULL to pcre_jit_exec() as with
pcre2_jit_match(). Unlike the v2 function it doesn't support
that. Instead we need to use the fatter pcre_exec() if we'd like the
same behavior.

This will make things slightly slower than on the fast-path function,
but it's OK since we care less about v1 performance these days since
we have and recommend v2. Running a similar performance test as what I
ran in fbaceaac47 ("grep: add support for the PCRE v1 JIT API",
2017-05-25) via:

    GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE1=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst' ./run HEAD~ HEAD p7820-grep-engines.sh

Gives us this, just the /perl/ results:

    Test                                            HEAD~             HEAD
    ---------------------------------------------------------------------------------------
    7820.3: perl grep 'how.to'                      0.19(0.67+0.52)   0.19(0.65+0.52) +0.0%
    7820.7: perl grep '^how to'                     0.19(0.78+0.44)   0.19(0.72+0.49) +0.0%
    7820.11: perl grep '[how] to'                   0.39(2.13+0.43)   0.40(2.10+0.46) +2.6%
    7820.15: perl grep '(e.t[^ ]*|v.ry) rare'       0.44(2.55+0.37)   0.45(2.47+0.41) +2.3%
    7820.19: perl grep 'm(ú|u)lt.b(æ|y)te'          0.23(1.06+0.42)   0.22(1.03+0.43) -4.3%

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 28 +++++-----------------------
 grep.h |  5 -----
 2 files changed, 5 insertions(+), 28 deletions(-)

diff --git a/grep.c b/grep.c
index 20ce95270a..6b52fed53a 100644
--- a/grep.c
+++ b/grep.c
@@ -406,12 +406,6 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 
 #ifdef GIT_PCRE1_USE_JIT
 	pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on);
-	if (p->pcre1_jit_on) {
-		p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024);
-		if (!p->pcre1_jit_stack)
-			die("Couldn't allocate PCRE JIT stack");
-		pcre_assign_jit_stack(p->pcre1_extra_info, NULL, p->pcre1_jit_stack);
-	}
 #endif
 }
 
@@ -423,18 +417,9 @@ static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
 	if (eflags & REG_NOTBOL)
 		flags |= PCRE_NOTBOL;
 
-#ifdef GIT_PCRE1_USE_JIT
-	if (p->pcre1_jit_on) {
-		ret = pcre_jit_exec(p->pcre1_regexp, p->pcre1_extra_info, line,
-				    eol - line, 0, flags, ovector,
-				    ARRAY_SIZE(ovector), p->pcre1_jit_stack);
-	} else
-#endif
-	{
-		ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line,
-				eol - line, 0, flags, ovector,
-				ARRAY_SIZE(ovector));
-	}
+	ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line,
+			eol - line, 0, flags, ovector,
+			ARRAY_SIZE(ovector));
 
 	if (ret < 0 && ret != PCRE_ERROR_NOMATCH)
 		die("pcre_exec failed with error code %d", ret);
@@ -451,14 +436,11 @@ static void free_pcre1_regexp(struct grep_pat *p)
 {
 	pcre_free(p->pcre1_regexp);
 #ifdef GIT_PCRE1_USE_JIT
-	if (p->pcre1_jit_on) {
+	if (p->pcre1_jit_on)
 		pcre_free_study(p->pcre1_extra_info);
-		pcre_jit_stack_free(p->pcre1_jit_stack);
-	} else
+	else
 #endif
-	{
 		pcre_free(p->pcre1_extra_info);
-	}
 	pcre_free((void *)p->pcre1_tables);
 }
 #else /* !USE_LIBPCRE1 */
diff --git a/grep.h b/grep.h
index a65f4a1ae1..a405fc870c 100644
--- a/grep.h
+++ b/grep.h
@@ -14,13 +14,9 @@
 #ifndef GIT_PCRE_STUDY_JIT_COMPILE
 #define GIT_PCRE_STUDY_JIT_COMPILE 0
 #endif
-#if PCRE_MAJOR <= 8 && PCRE_MINOR < 20
-typedef int pcre_jit_stack;
-#endif
 #else
 typedef int pcre;
 typedef int pcre_extra;
-typedef int pcre_jit_stack;
 #endif
 #ifdef USE_LIBPCRE2
 #define PCRE2_CODE_UNIT_WIDTH 8
@@ -86,7 +82,6 @@ struct grep_pat {
 	regex_t regexp;
 	pcre *pcre1_regexp;
 	pcre_extra *pcre1_extra_info;
-	pcre_jit_stack *pcre1_jit_stack;
 	const unsigned char *pcre1_tables;
 	int pcre1_jit_on;
 	pcre2_code *pcre2_pattern;
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] grep: PCRE JIT fixes
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
@ 2019-07-24 16:18   ` Junio C Hamano
  2019-07-24 20:03     ` Ævar Arnfjörð Bjarmason
  2019-07-26 15:08   ` [PATCH v2 0/8] grep: PCRE JIT fixes + ab/no-kwset fix Ævar Arnfjörð Bjarmason
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2019-07-24 16:18 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> There's a couple of patches fixing mistakes in the JIT code I added
> for PCRE in <20190722181923.21572-1-dev+git@drbeat.li> and
> <20190721194052.15440-1-carenas@gmail.com>
>
> This small series proposes to replace both of those. In both cases I
> think we're better off just removing the relevant code. The commit
> messages for the patches themselves make the case for that.

I am not sure about the BUG() that practically never triggered so
far (AFAICT, the check that guards the BUG() would trigger only if
we later introduced a bug, calling the code to compile when we are
not asked to do so)---wouldn't it be better to leave it in while
there still are people who are touching the vicinity?

The other two I am perfectly OK with.  It is easy to resurrect the
support for v1 (which may not even be needed for long) and resurrect
the support for v2 with Carlo's fix, if it later turns out that some
users may need to use a more complex pattern.

Thanks.

> Ævar Arnfjörð Bjarmason (3):
>   grep: remove overly paranoid BUG(...) code
>   grep: stop "using" a custom JIT stack with PCRE v2
>   grep: stop using a custom JIT stack with PCRE v1
>
>  grep.c | 46 ++++++----------------------------------------
>  grep.h |  9 ---------
>  2 files changed, 6 insertions(+), 49 deletions(-)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] grep: stop "using" a custom JIT stack with PCRE v2
  2019-07-24 15:14 ` [PATCH 2/3] grep: stop "using" a custom JIT stack with PCRE v2 Ævar Arnfjörð Bjarmason
@ 2019-07-24 16:24   ` Junio C Hamano
  2019-07-24 20:06     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2019-07-24 16:24 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> Since we've haven't had any reports of users running into
> PCRE2_ERROR_JIT_STACKLIMIT in the wild I think we can safely assume
> that we can just use the library defaults instead and drop this
> code.

Does everybody use pcre2 with JIT with Git these days, or only those
who want to live near the bleeding edge?

> This won't change with the wider use of PCRE v2 in
> ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15), a
> fixed string search is not a "large or complicated pattern".

In any case, if we were not "using" the custom stack anyway for v2,
this change does not hurt anybody, possibly other than those who
will learn about pcre2 support by reading this message and experiments
with larger patterns.  And it should be simple to wire it back if it
becomes necessary later.

Thanks for cleaning up.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/3] grep: PCRE JIT fixes
  2019-07-24 16:18   ` Junio C Hamano
@ 2019-07-24 20:03     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-24 20:03 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin


On Wed, Jul 24 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> There's a couple of patches fixing mistakes in the JIT code I added
>> for PCRE in <20190722181923.21572-1-dev+git@drbeat.li> and
>> <20190721194052.15440-1-carenas@gmail.com>
>>
>> This small series proposes to replace both of those. In both cases I
>> think we're better off just removing the relevant code. The commit
>> messages for the patches themselves make the case for that.
>
> I am not sure about the BUG() that practically never triggered so
> far (AFAICT, the check that guards the BUG() would trigger only if
> we later introduced a bug, calling the code to compile when we are
> not asked to do so)---wouldn't it be better to leave it in while
> there still are people who are touching the vicinity?

The BUG() in 1/3 is just checking if pcre2?_config() returns a boolean
when promised, so it amounts to black-box testing of that library.

I think code in that style is overly paranoid and verbose, it's
reasonable to just trust the library in that case.

I think the reason it ended up in the codebase in the first place was
converting some first-draft implementation I wrote where I was being
more paranoid about using the PCRE API as a black box.

> The other two I am perfectly OK with.  It is easy to resurrect the
> support for v1 (which may not even be needed for long) and resurrect
> the support for v2 with Carlo's fix, if it later turns out that some
> users may need to use a more complex pattern.
>
> Thanks.
>
>> Ævar Arnfjörð Bjarmason (3):
>>   grep: remove overly paranoid BUG(...) code
>>   grep: stop "using" a custom JIT stack with PCRE v2
>>   grep: stop using a custom JIT stack with PCRE v1
>>
>>  grep.c | 46 ++++++----------------------------------------
>>  grep.h |  9 ---------
>>  2 files changed, 6 insertions(+), 49 deletions(-)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] grep: stop "using" a custom JIT stack with PCRE v2
  2019-07-24 16:24   ` Junio C Hamano
@ 2019-07-24 20:06     ` Ævar Arnfjörð Bjarmason
  2019-07-25  5:11       ` Carlo Arenas
  0 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-24 20:06 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin


On Wed, Jul 24 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> Since we've haven't had any reports of users running into
>> PCRE2_ERROR_JIT_STACKLIMIT in the wild I think we can safely assume
>> that we can just use the library defaults instead and drop this
>> code.
>
> Does everybody use pcre2 with JIT with Git these days, or only those
> who want to live near the bleeding edge?

My informal survey of various package recipies suggests that all the big
*nix distros are using it by default now, so we have a lot of users in
the wild, including in the just-released Debian stable.

So I'm confidend that if there were issues with e.g. it dying on
patterns in practical use we'd have heard about them.

>> This won't change with the wider use of PCRE v2 in
>> ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15), a
>> fixed string search is not a "large or complicated pattern".
>
> In any case, if we were not "using" the custom stack anyway for v2,
> this change does not hurt anybody, possibly other than those who
> will learn about pcre2 support by reading this message and experiments
> with larger patterns.  And it should be simple to wire it back if it
> becomes necessary later.

*nod*

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 2/3] grep: stop "using" a custom JIT stack with PCRE v2
  2019-07-24 20:06     ` Ævar Arnfjörð Bjarmason
@ 2019-07-25  5:11       ` Carlo Arenas
  0 siblings, 0 replies; 52+ messages in thread
From: Carlo Arenas @ 2019-07-25  5:11 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Junio C Hamano, git, Beat Bolli, Johannes Schindelin

On Wed, Jul 24, 2019 at 1:20 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> On Wed, Jul 24 2019, Junio C Hamano wrote:
> >
> > Does everybody use pcre2 with JIT with Git these days, or only those
> > who want to live near the bleeding edge?
>
> My informal survey of various package recipies suggests that all the big
> *nix distros are using it by default now, so we have a lot of users in
> the wild, including in the just-released Debian stable.

FWIW neither OpenBSD or NetBSD enable JIT, and the git that comes
with Xcode (AKA Apple Git) doesn't either, while still using PCRE1

> > In any case, if we were not "using" the custom stack anyway for v2,
> > this change does not hurt anybody, possibly other than those who
> > will learn about pcre2 support by reading this message and experiments
> > with larger patterns.  And it should be simple to wire it back if it
> > becomes necessary later.
>
> *nod*

the following pattern fails unless 1MB stack is available:

  '^([/](?!/)|[^/])*~/.*'

the workaround implemented in GNU grep (that uses PCRE1) and the related
discussion[1] are a very interesting read

Carlo

[1] https://www.mail-archive.com/bug-grep@gnu.org/msg05763.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1
  2019-07-24 15:14 ` [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1 Ævar Arnfjörð Bjarmason
@ 2019-07-26 13:15   ` Carlo Arenas
  2019-07-26 13:50     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 52+ messages in thread
From: Carlo Arenas @ 2019-07-26 13:15 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Beat Bolli, Johannes Schindelin

since this moves PCRE1 out of the JIT fast path, introduces the
regression where git grep will abort if there is binary data or non
UTF-8 text in the repository/log and should be IMHO hold out until a
fix for that can be merged.

this also needs additional changes to better support NO_LIBPCRE1_JIT,
patch to follow

Carlo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1
  2019-07-26 13:15   ` Carlo Arenas
@ 2019-07-26 13:50     ` Ævar Arnfjörð Bjarmason
  2019-07-26 14:12       ` Carlo Arenas
  0 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 13:50 UTC (permalink / raw)
  To: Carlo Arenas; +Cc: git, Junio C Hamano, Beat Bolli, Johannes Schindelin


On Fri, Jul 26 2019, Carlo Arenas wrote:

> since this moves PCRE1 out of the JIT fast path,

I think you're mostly replying to the wrong thread. None of the patches
I've sent disable PCRE v1 JIT, as the performance numbers show. The JIT
stack is resized, and for v2 some dead code removed.

> introduces the regression where git grep will abort if there is binary
> data or non UTF-8 text in the repository/log and should be IMHO hold
> out until a fix for that can be merged.

You're talking about the kwset series, not this cleanup series.

> this also needs additional changes to better support NO_LIBPCRE1_JIT,
> patch to follow

Looking forward to it, thanks!

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1
  2019-07-26 13:50     ` Ævar Arnfjörð Bjarmason
@ 2019-07-26 14:12       ` Carlo Arenas
  2019-07-26 14:43         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 52+ messages in thread
From: Carlo Arenas @ 2019-07-26 14:12 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Beat Bolli, Johannes Schindelin

On Fri, Jul 26, 2019 at 6:50 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Fri, Jul 26 2019, Carlo Arenas wrote:
>
> > since this moves PCRE1 out of the JIT fast path,
>
> I think you're mostly replying to the wrong thread. None of the patches
> I've sent disable PCRE v1 JIT, as the performance numbers show. The JIT
> stack is resized, and for v2 some dead code removed.

I didn't mean JIT was disabled, but that we are calling now the regular
PCRE1 function which does UTF-8 validation (unlike the one used before)

> > introduces the regression where git grep will abort if there is binary
> > data or non UTF-8 text in the repository/log and should be IMHO hold
> > out until a fix for that can be merged.
>
> You're talking about the kwset series, not this cleanup series.

a combination of both (as seen in pu) and that will also happen in next if
this series get merged there.

before this cleanup series, a git compiled against PCRE1 and not using
NO_LIBPCRE1_JIT will use the jit fast path function and therefore would
have no problems with binary or non UTF-8 content in the repository, but
will regress after.

Carlo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1
  2019-07-26 14:12       ` Carlo Arenas
@ 2019-07-26 14:43         ` Ævar Arnfjörð Bjarmason
  2019-07-26 20:26           ` [RFC PATCH 0/2] PCRE1 cleanup Carlo Marcelo Arenas Belón
  0 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 14:43 UTC (permalink / raw)
  To: Carlo Arenas; +Cc: git, Junio C Hamano, Beat Bolli, Johannes Schindelin


On Fri, Jul 26 2019, Carlo Arenas wrote:

> On Fri, Jul 26, 2019 at 6:50 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> On Fri, Jul 26 2019, Carlo Arenas wrote:
>>
>> > since this moves PCRE1 out of the JIT fast path,
>>
>> I think you're mostly replying to the wrong thread. None of the patches
>> I've sent disable PCRE v1 JIT, as the performance numbers show. The JIT
>> stack is resized, and for v2 some dead code removed.
>
> I didn't mean JIT was disabled, but that we are calling now the regular
> PCRE1 function which does UTF-8 validation (unlike the one used before)
>
>> > introduces the regression where git grep will abort if there is binary
>> > data or non UTF-8 text in the repository/log and should be IMHO hold
>> > out until a fix for that can be merged.
>>
>> You're talking about the kwset series, not this cleanup series.
>
> a combination of both (as seen in pu) and that will also happen in next if
> this series get merged there.
>
> before this cleanup series, a git compiled against PCRE1 and not using
> NO_LIBPCRE1_JIT will use the jit fast path function and therefore would
> have no problems with binary or non UTF-8 content in the repository, but
> will regress after.

I see. Yes you're right, I misread pcrejit(3) about how the "fast path
API" worked (or more accurately, misremembered). Yes, this is now a new
caveat.

I have some patches on top of next I'm about to send that hopefully make
this whole thing less of a mess.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 0/8] grep: PCRE JIT fixes + ab/no-kwset fix
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
  2019-07-24 16:18   ` Junio C Hamano
@ 2019-07-26 15:08   ` Ævar Arnfjörð Bjarmason
  2019-07-26 20:27     ` Junio C Hamano
  2019-07-26 15:08   ` [PATCH v2 1/8] grep: remove overly paranoid BUG(...) code Ævar Arnfjörð Bjarmason
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 15:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

1-3 here are a re-roll on "next". I figured that was easier for
everyone with the state of the in-flight patches, it certainly was for
me. Sorry Junio if this creates a mess for you.

4-8 are a "fix" for the UTF-8 matching error noted in Carlo's "grep:
skip UTF8 checks explicitally" in
https://public-inbox.org/git/20190721183115.14985-1-carenas@gmail.com/

As noted the bug isn't fully fixed until 8/8, and that patch relies on
unreleased PCRE v2 code. I'm hoping that with 7/8 we're in a good
enough state to limp forward as noted in the rationale of those
commits.

Ævar Arnfjörð Bjarmason (8):
  grep: remove overly paranoid BUG(...) code
  grep: stop "using" a custom JIT stack with PCRE v2
  grep: stop using a custom JIT stack with PCRE v1
  grep: consistently use "p->fixed" in compile_regexp()
  grep: create a "is_fixed" member in "grep_pat"
  grep: stess test PCRE v2 on invalid UTF-8 data
  grep: do not enter PCRE2_UTF mode on fixed matching
  grep: optimistically use PCRE2_MATCH_INVALID_UTF

 Makefile                        |  1 +
 grep.c                          | 68 +++++++++++----------------------
 grep.h                          | 13 ++-----
 t/helper/test-pcre2-config.c    | 12 ++++++
 t/helper/test-tool.c            |  1 +
 t/helper/test-tool.h            |  1 +
 t/t7812-grep-icase-non-ascii.sh | 39 +++++++++++++++++++
 7 files changed, 80 insertions(+), 55 deletions(-)
 create mode 100644 t/helper/test-pcre2-config.c

-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 1/8] grep: remove overly paranoid BUG(...) code
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
  2019-07-24 16:18   ` Junio C Hamano
  2019-07-26 15:08   ` [PATCH v2 0/8] grep: PCRE JIT fixes + ab/no-kwset fix Ævar Arnfjörð Bjarmason
@ 2019-07-26 15:08   ` Ævar Arnfjörð Bjarmason
  2019-07-26 15:08   ` [PATCH v2 2/8] grep: stop "using" a custom JIT stack with PCRE v2 Ævar Arnfjörð Bjarmason
                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 15:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

Remove code that would trigger if pcre_config() or pcre2_config() was
so broken that "do we have JIT?" wouldn't return a boolean.

I added this code back in fbaceaac47 ("grep: add support for the PCRE
v1 JIT API", 2017-05-25) and then as noted in f002532784 ("grep: print
the pcre2_jit_on value", 2019-07-22) incorrectly copy/pasted some of
it in 94da9193a6 ("grep: add support for PCRE v2", 2017-06-01).

Let's just remove this code. Being this paranoid about the
pcre2?_config() function itself being broken is crossing the line into
unreasonable paranoia.

Reported-by:  Beat Bolli <dev+git@drbeat.li>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/grep.c b/grep.c
index 0937c5bfff..95af88cb74 100644
--- a/grep.c
+++ b/grep.c
@@ -394,14 +394,11 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 
 #ifdef GIT_PCRE1_USE_JIT
 	pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on);
-	if (p->pcre1_jit_on == 1) {
+	if (p->pcre1_jit_on) {
 		p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024);
 		if (!p->pcre1_jit_stack)
 			die("Couldn't allocate PCRE JIT stack");
 		pcre_assign_jit_stack(p->pcre1_extra_info, NULL, p->pcre1_jit_stack);
-	} else if (p->pcre1_jit_on != 0) {
-		BUG("The pcre1_jit_on variable should be 0 or 1, not %d",
-		    p->pcre1_jit_on);
 	}
 #endif
 }
@@ -510,7 +507,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	}
 
 	pcre2_config(PCRE2_CONFIG_JIT, &p->pcre2_jit_on);
-	if (p->pcre2_jit_on == 1) {
+	if (p->pcre2_jit_on) {
 		jitret = pcre2_jit_compile(p->pcre2_pattern, PCRE2_JIT_COMPLETE);
 		if (jitret)
 			die("Couldn't JIT the PCRE2 pattern '%s', got '%d'\n", p->pattern, jitret);
@@ -545,9 +542,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		if (!p->pcre2_match_context)
 			die("Couldn't allocate PCRE2 match context");
 		pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack);
-	} else if (p->pcre2_jit_on != 0) {
-		BUG("The pcre2_jit_on variable should be 0 or 1, not %d",
-		    p->pcre2_jit_on);
 	}
 }
 
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 2/8] grep: stop "using" a custom JIT stack with PCRE v2
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
                     ` (2 preceding siblings ...)
  2019-07-26 15:08   ` [PATCH v2 1/8] grep: remove overly paranoid BUG(...) code Ævar Arnfjörð Bjarmason
@ 2019-07-26 15:08   ` Ævar Arnfjörð Bjarmason
  2019-07-29  0:33     ` Carlo Arenas
  2019-07-26 15:08   ` [PATCH v2 3/8] grep: stop using a custom JIT stack with PCRE v1 Ævar Arnfjörð Bjarmason
                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 15:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

As reported in [1] the code I added in 94da9193a6 ("grep: add support
for PCRE v2", 2017-06-01) to use a custom JIT stack has never
worked. It was incorrectly copy/pasted from code I added in
fbaceaac47 ("grep: add support for the PCRE v1 JIT API", 2017-05-25),
which did work.

Thus our intention of starting with 1 byte of stack at a maximum of 1
MB didn't happen, we'd always use the 32 KB stack provided by PCRE
v2's jit_machine_stack_exec()[2]. The reason I allocated a custom
stack at all was this advice in pcrejit(3) (same in pcre2jit(3)):

    "By default, it uses 32KiB on the machine stack. However, some
    large or complicated patterns need more than this"

Since we've haven't had any reports of users running into
PCRE2_ERROR_JIT_STACKLIMIT in the wild I think we can safely assume
that we can just use the library defaults instead and drop this
code. This won't change with the wider use of PCRE v2 in
ed0479ce3d ("Merge branch 'ab/no-kwset' into next", 2019-07-15), a
fixed string search is not a "large or complicated pattern".

For good measure I ran the performance test noted in 94da9193a6,
although the command is simpler now due to my 0f50c8e32c ("Makefile:
remove the NO_R_TO_GCC_LINKER flag", 2019-05-17):

    GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE2=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre2/inst' ./run HEAD~ HEAD p7820-grep-engines.sh

Just the /perl/ results are:

    Test                                            HEAD~             HEAD
    ---------------------------------------------------------------------------------------
    7820.3: perl grep 'how.to'                      0.17(0.27+0.65)   0.17(0.24+0.68) +0.0%
    7820.7: perl grep '^how to'                     0.16(0.23+0.66)   0.16(0.23+0.67) +0.0%
    7820.11: perl grep '[how] to'                   0.18(0.35+0.62)   0.18(0.33+0.65) +0.0%
    7820.15: perl grep '(e.t[^ ]*|v.ry) rare'       0.17(0.45+0.54)   0.17(0.49+0.50) +0.0%
    7820.19: perl grep 'm(ú|u)lt.b(æ|y)te'          0.16(0.33+0.58)   0.16(0.29+0.62) +0.0%

So, as expected there's no change, and running with valgrind reveals
that we have fewer allocations now.

As noted in [3] there are known regexes that will fail with the lower
stack limit, the way GNU grep fixed it is interesting, although I
believe the implementation is overly verbose, they could make PCRE v2
handle that gradual re-allocation, that's what min/max memory is
for.

So we might end up bringing this back, I'm more inclined to just kick
such cases upstairs to PCRE maintainers as a bug, perhaps they'll add
some overall "just allocate more then" flag to make this easier. In
any case there's no functional change here, we didn't have a custom
stack, so let's apply this first, we can always revert it later.

1. https://public-inbox.org/git/20190721194052.15440-1-carenas@gmail.com/
2. I didn't really intend to start with 1 byte, looking at the PCRE v2
   code again what happened is that I cargo-culted some of PCRE v2's
   own test code which was meant to test re-allocations. It's more
   sane to start with say 32 KB with a max of 1 MB, as pcre2grep.c
   does.
3. https://public-inbox.org/git/CAPUEspjj+fG8QDmf=bZXktfpLgkgiu34HTjKLhm-cmEE04FE-A@mail.gmail.com/

Reported-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 10 ----------
 grep.h |  4 ----
 2 files changed, 14 deletions(-)

diff --git a/grep.c b/grep.c
index 95af88cb74..4b1e917ac5 100644
--- a/grep.c
+++ b/grep.c
@@ -534,14 +534,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 			p->pcre2_jit_on = 0;
 			return;
 		}
-
-		p->pcre2_jit_stack = pcre2_jit_stack_create(1, 1024 * 1024, NULL);
-		if (!p->pcre2_jit_stack)
-			die("Couldn't allocate PCRE2 JIT stack");
-		p->pcre2_match_context = pcre2_match_context_create(NULL);
-		if (!p->pcre2_match_context)
-			die("Couldn't allocate PCRE2 match context");
-		pcre2_jit_stack_assign(p->pcre2_match_context, NULL, p->pcre2_jit_stack);
 	}
 }
 
@@ -585,8 +577,6 @@ static void free_pcre2_pattern(struct grep_pat *p)
 	pcre2_compile_context_free(p->pcre2_compile_context);
 	pcre2_code_free(p->pcre2_pattern);
 	pcre2_match_data_free(p->pcre2_match_data);
-	pcre2_jit_stack_free(p->pcre2_jit_stack);
-	pcre2_match_context_free(p->pcre2_match_context);
 }
 #else /* !USE_LIBPCRE2 */
 static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt)
diff --git a/grep.h b/grep.h
index d35a137fcb..4d8e300175 100644
--- a/grep.h
+++ b/grep.h
@@ -29,8 +29,6 @@ typedef int pcre_jit_stack;
 typedef int pcre2_code;
 typedef int pcre2_match_data;
 typedef int pcre2_compile_context;
-typedef int pcre2_match_context;
-typedef int pcre2_jit_stack;
 #endif
 #include "thread-utils.h"
 #include "userdiff.h"
@@ -93,8 +91,6 @@ struct grep_pat {
 	pcre2_code *pcre2_pattern;
 	pcre2_match_data *pcre2_match_data;
 	pcre2_compile_context *pcre2_compile_context;
-	pcre2_match_context *pcre2_match_context;
-	pcre2_jit_stack *pcre2_jit_stack;
 	uint32_t pcre2_jit_on;
 	unsigned fixed:1;
 	unsigned ignore_case:1;
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 3/8] grep: stop using a custom JIT stack with PCRE v1
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
                     ` (3 preceding siblings ...)
  2019-07-26 15:08   ` [PATCH v2 2/8] grep: stop "using" a custom JIT stack with PCRE v2 Ævar Arnfjörð Bjarmason
@ 2019-07-26 15:08   ` Ævar Arnfjörð Bjarmason
  2019-07-29  1:26     ` Carlo Arenas
  2019-07-26 15:08   ` [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp() Ævar Arnfjörð Bjarmason
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 15:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

Simplify the PCRE v1 code for the same reasons as for the PCRE v2 code
in the last commit. Unlike with v2 we actually used the custom stack
in v1, but let's use PCRE's built-in 32 KB one instead, since
experience with v2 shows that's enough. Most distros are already using
v2 as a default, and the underlying sljit code is the same.

Unfortunately we can't just pass a NULL to pcre_jit_exec() as with
pcre2_jit_match(). Unlike the v2 function it doesn't support
that. Instead we need to use the fatter pcre_exec() if we'd like the
same behavior.

This will make things slightly slower than on the fast-path function,
but it's OK since we care less about v1 performance these days since
we have and recommend v2. Running a similar performance test as what I
ran in fbaceaac47 ("grep: add support for the PCRE v1 JIT API",
2017-05-25) via:

    GIT_PERF_REPEAT_COUNT=30 GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_MAKE_OPTS='-j8 USE_LIBPCRE1=Y CFLAGS=-O3 LIBPCREDIR=/home/avar/g/pcre/inst' ./run HEAD~ HEAD p7820-grep-engines.sh

Gives us this, just the /perl/ results:

    Test                                            HEAD~             HEAD
    ---------------------------------------------------------------------------------------
    7820.3: perl grep 'how.to'                      0.19(0.67+0.52)   0.19(0.65+0.52) +0.0%
    7820.7: perl grep '^how to'                     0.19(0.78+0.44)   0.19(0.72+0.49) +0.0%
    7820.11: perl grep '[how] to'                   0.39(2.13+0.43)   0.40(2.10+0.46) +2.6%
    7820.15: perl grep '(e.t[^ ]*|v.ry) rare'       0.44(2.55+0.37)   0.45(2.47+0.41) +2.3%
    7820.19: perl grep 'm(ú|u)lt.b(æ|y)te'          0.23(1.06+0.42)   0.22(1.03+0.43) -4.3%

It will also implicitly re-enable UTF-8 validation for PCRE v1. As
noted in [1] we now have cases as a result where PCRE v1 is more eager
to error out. Subsequent patches will fix that for v2, and I think
it's fair to tell v1 users "just upgrade" and not worry about that
edge case for v1.

1.  https://public-inbox.org/git/CAPUEsphZJ_Uv9o1-yDpjNLA_q-f7gWXz9g1gCY2pYAYN8ri40g@mail.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 28 +++++-----------------------
 grep.h |  5 -----
 2 files changed, 5 insertions(+), 28 deletions(-)

diff --git a/grep.c b/grep.c
index 4b1e917ac5..9c2b259771 100644
--- a/grep.c
+++ b/grep.c
@@ -394,12 +394,6 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 
 #ifdef GIT_PCRE1_USE_JIT
 	pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on);
-	if (p->pcre1_jit_on) {
-		p->pcre1_jit_stack = pcre_jit_stack_alloc(1, 1024 * 1024);
-		if (!p->pcre1_jit_stack)
-			die("Couldn't allocate PCRE JIT stack");
-		pcre_assign_jit_stack(p->pcre1_extra_info, NULL, p->pcre1_jit_stack);
-	}
 #endif
 }
 
@@ -411,18 +405,9 @@ static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
 	if (eflags & REG_NOTBOL)
 		flags |= PCRE_NOTBOL;
 
-#ifdef GIT_PCRE1_USE_JIT
-	if (p->pcre1_jit_on) {
-		ret = pcre_jit_exec(p->pcre1_regexp, p->pcre1_extra_info, line,
-				    eol - line, 0, flags, ovector,
-				    ARRAY_SIZE(ovector), p->pcre1_jit_stack);
-	} else
-#endif
-	{
-		ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line,
-				eol - line, 0, flags, ovector,
-				ARRAY_SIZE(ovector));
-	}
+	ret = pcre_exec(p->pcre1_regexp, p->pcre1_extra_info, line,
+			eol - line, 0, flags, ovector,
+			ARRAY_SIZE(ovector));
 
 	if (ret < 0 && ret != PCRE_ERROR_NOMATCH)
 		die("pcre_exec failed with error code %d", ret);
@@ -439,14 +424,11 @@ static void free_pcre1_regexp(struct grep_pat *p)
 {
 	pcre_free(p->pcre1_regexp);
 #ifdef GIT_PCRE1_USE_JIT
-	if (p->pcre1_jit_on) {
+	if (p->pcre1_jit_on)
 		pcre_free_study(p->pcre1_extra_info);
-		pcre_jit_stack_free(p->pcre1_jit_stack);
-	} else
+	else
 #endif
-	{
 		pcre_free(p->pcre1_extra_info);
-	}
 	pcre_free((void *)p->pcre1_tables);
 }
 #else /* !USE_LIBPCRE1 */
diff --git a/grep.h b/grep.h
index 4d8e300175..ce2d72571f 100644
--- a/grep.h
+++ b/grep.h
@@ -14,13 +14,9 @@
 #ifndef GIT_PCRE_STUDY_JIT_COMPILE
 #define GIT_PCRE_STUDY_JIT_COMPILE 0
 #endif
-#if PCRE_MAJOR <= 8 && PCRE_MINOR < 20
-typedef int pcre_jit_stack;
-#endif
 #else
 typedef int pcre;
 typedef int pcre_extra;
-typedef int pcre_jit_stack;
 #endif
 #ifdef USE_LIBPCRE2
 #define PCRE2_CODE_UNIT_WIDTH 8
@@ -85,7 +81,6 @@ struct grep_pat {
 	regex_t regexp;
 	pcre *pcre1_regexp;
 	pcre_extra *pcre1_extra_info;
-	pcre_jit_stack *pcre1_jit_stack;
 	const unsigned char *pcre1_tables;
 	int pcre1_jit_on;
 	pcre2_code *pcre2_pattern;
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp()
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
                     ` (4 preceding siblings ...)
  2019-07-26 15:08   ` [PATCH v2 3/8] grep: stop using a custom JIT stack with PCRE v1 Ævar Arnfjörð Bjarmason
@ 2019-07-26 15:08   ` Ævar Arnfjörð Bjarmason
  2019-07-29  1:48     ` Carlo Arenas
  2019-07-26 15:08   ` [PATCH v2 5/8] grep: create a "is_fixed" member in "grep_pat" Ævar Arnfjörð Bjarmason
                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 15:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

At the start of this function we do:

    p->fixed = opt->fixed;

It's less confusing to use that variable consistently that switch back
& forth between the two.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/grep.c b/grep.c
index 9c2b259771..b94e998680 100644
--- a/grep.c
+++ b/grep.c
@@ -616,7 +616,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
 
 	pat_is_fixed = is_fixed(p->pattern, p->patternlen);
-	if (opt->fixed || pat_is_fixed) {
+	if (p->fixed || pat_is_fixed) {
 #ifdef USE_LIBPCRE2
 		opt->pcre2 = 1;
 		if (pat_is_fixed) {
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 5/8] grep: create a "is_fixed" member in "grep_pat"
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
                     ` (5 preceding siblings ...)
  2019-07-26 15:08   ` [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp() Ævar Arnfjörð Bjarmason
@ 2019-07-26 15:08   ` Ævar Arnfjörð Bjarmason
  2019-07-26 15:08   ` [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data Ævar Arnfjörð Bjarmason
                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 15:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

This change paves the way for later using this value the regex compile
functions themselves.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c | 7 +++----
 grep.h | 1 +
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/grep.c b/grep.c
index b94e998680..6d60e2e557 100644
--- a/grep.c
+++ b/grep.c
@@ -606,7 +606,6 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 {
 	int err;
 	int regflags = REG_NEWLINE;
-	int pat_is_fixed;
 
 	p->word_regexp = opt->word_regexp;
 	p->ignore_case = opt->ignore_case;
@@ -615,11 +614,11 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 	if (memchr(p->pattern, 0, p->patternlen) && !opt->pcre2)
 		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
 
-	pat_is_fixed = is_fixed(p->pattern, p->patternlen);
-	if (p->fixed || pat_is_fixed) {
+	p->is_fixed = is_fixed(p->pattern, p->patternlen);
+	if (p->fixed || p->is_fixed) {
 #ifdef USE_LIBPCRE2
 		opt->pcre2 = 1;
-		if (pat_is_fixed) {
+		if (p->is_fixed) {
 			compile_pcre2_pattern(p, opt);
 		} else {
 			/*
diff --git a/grep.h b/grep.h
index ce2d72571f..c0c71eb4a9 100644
--- a/grep.h
+++ b/grep.h
@@ -88,6 +88,7 @@ struct grep_pat {
 	pcre2_compile_context *pcre2_compile_context;
 	uint32_t pcre2_jit_on;
 	unsigned fixed:1;
+	unsigned is_fixed:1;
 	unsigned ignore_case:1;
 	unsigned word_regexp:1;
 };
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
                     ` (6 preceding siblings ...)
  2019-07-26 15:08   ` [PATCH v2 5/8] grep: create a "is_fixed" member in "grep_pat" Ævar Arnfjörð Bjarmason
@ 2019-07-26 15:08   ` Ævar Arnfjörð Bjarmason
  2019-07-26 20:34     ` Junio C Hamano
                       ` (2 more replies)
  2019-07-26 15:08   ` [PATCH v2 7/8] grep: do not enter PCRE2_UTF mode on fixed matching Ævar Arnfjörð Bjarmason
  2019-07-26 15:08   ` [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF Ævar Arnfjörð Bjarmason
  9 siblings, 3 replies; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 15:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

Since my b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string
search", 2019-07-01) we've been dying on invalid UTF-8 data when
grepping for fixed strings if the following are all true:

    * The subject string is non-ASCII (e.g. "ævar")
    * We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C"
    * We compiled with PCRE v2
    * That PCRE v2 did not have JIT support

The last of those is why this wasn't caught earlier, per pcre2jit(3):

    "unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested
    for validity. In the interests of speed, these checks do not
    happen on the JIT fast path, and if invalid data is passed, the
    result is undefined."

I.e. the subject being matched against our pattern was invalid, but we
were lucky and getting away with it on the JIT path, but the non-JIT
one is stricter.

This patch does nothing to fix that, instead we sneak in support for
fixed patterns starting with "(*NO_JIT)", this disables the PCRE v2
jit with implicit fixed-string matching for testing, see
pcre2syntax(3) the syntax.

This is technically a change in behavior, but it's so obscure that I
figured it was OK. We'd previously consider this an invalid regular
expression as regcomp() would die on it, now we feed it to the PCRE v2
fixed-string path. I thought this was better than introducing yet
another GIT_TEST_* environment variable.

We're also relying on a behavior of PCRE v2 that technically could
change, but I think the test coverage is worth dipping our toe into
some somewhat undefined behavior.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c                          | 10 ++++++++++
 t/t7812-grep-icase-non-ascii.sh | 28 ++++++++++++++++++++++++++++
 2 files changed, 38 insertions(+)

diff --git a/grep.c b/grep.c
index 6d60e2e557..5bc0f4f32a 100644
--- a/grep.c
+++ b/grep.c
@@ -615,6 +615,16 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
 		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
 
 	p->is_fixed = is_fixed(p->pattern, p->patternlen);
+#ifdef USE_LIBPCRE2
+       if (!p->fixed && !p->is_fixed) {
+	       const char *no_jit = "(*NO_JIT)";
+	       const int no_jit_len = strlen(no_jit);
+	       if (starts_with(p->pattern, no_jit) &&
+		   is_fixed(p->pattern + no_jit_len,
+			    p->patternlen - no_jit_len))
+		       p->is_fixed = 1;
+       }
+#endif
 	if (p->fixed || p->is_fixed) {
 #ifdef USE_LIBPCRE2
 		opt->pcre2 = 1;
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 0c685d3598..96c3572056 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -53,4 +53,32 @@ test_expect_success REGEX_LOCALE 'pickaxe -i on non-ascii' '
 	test_cmp expected actual
 '
 
+test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' '
+	printf "\\200\\n" >invalid-0x80 &&
+	echo "ævar" >expected &&
+	cat expected >>invalid-0x80 &&
+	git add invalid-0x80
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UTF-8 data' '
+	git grep -h "var" invalid-0x80 >actual &&
+	test_cmp expected actual &&
+	git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
+	test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' '
+	test_might_fail git grep -h "æ" invalid-0x80 >actual &&
+	test_cmp expected actual &&
+	test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 &&
+	test_cmp expected actual
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' '
+	test_might_fail git grep -hi "Æ" invalid-0x80 >actual &&
+	test_cmp expected actual &&
+	test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 &&
+	test_cmp expected actual
+'
+
 test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 7/8] grep: do not enter PCRE2_UTF mode on fixed matching
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
                     ` (7 preceding siblings ...)
  2019-07-26 15:08   ` [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data Ævar Arnfjörð Bjarmason
@ 2019-07-26 15:08   ` Ævar Arnfjörð Bjarmason
  2019-07-26 20:36     ` Junio C Hamano
  2019-07-26 15:08   ` [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF Ævar Arnfjörð Bjarmason
  9 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 15:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

As discussed in the last commit partially fix a bug introduced in
b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search",
2019-07-01). Because PCRE v2, unlike kwset, validates its UTF-8 input
we'd die on e.g.:

    fatal: pcre2_match failed with error code -22: UTF-8 error:
    isolated byte with 0x80 bit set

When grepping a non-ASCII fixed string. This is a more general problem
that's hard to fix, but we can at least fix the most common case of
grepping for a fixed string without "-i". I can't think of a reason
for why we'd turn on PCRE2_UTF when matching byte-for-byte like that.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 grep.c                          | 3 ++-
 t/t7812-grep-icase-non-ascii.sh | 4 ++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index 5bc0f4f32a..c7c06ae08d 100644
--- a/grep.c
+++ b/grep.c
@@ -472,7 +472,8 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern))
+	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
 		options |= PCRE2_UTF;
 
 	p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 96c3572056..531eb59d57 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -68,9 +68,9 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UT
 '
 
 test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' '
-	test_might_fail git grep -h "æ" invalid-0x80 >actual &&
+	git grep -h "æ" invalid-0x80 >actual &&
 	test_cmp expected actual &&
-	test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 &&
+	git grep -h "(*NO_JIT)æ" invalid-0x80 &&
 	test_cmp expected actual
 '
 
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF
  2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
                     ` (8 preceding siblings ...)
  2019-07-26 15:08   ` [PATCH v2 7/8] grep: do not enter PCRE2_UTF mode on fixed matching Ævar Arnfjörð Bjarmason
@ 2019-07-26 15:08   ` Ævar Arnfjörð Bjarmason
  2019-07-26 21:07     ` Junio C Hamano
  9 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 15:08 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin, Ævar Arnfjörð Bjarmason

As discussed in the "grep: stess test PCRE v2 on invalid UTF-8 data"
commit leading up to this one there's a regression in
b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search",
2019-07-01) when matching UTF-8 data.

This ultimately isn't straightforward to just "fix", because the kwset
backend was so dumb about icase matching that we'd skip it entirely on
non-ASCII. See the code removed in 48de2a768c ("grep: remove the kwset
optimization", 2019-07-01).

Just going back to the C library for those isn't ideal, since it's
likely to be even dumber about these mixed-encoding cases.

So let's support this "properly" using the PCRE2_MATCH_INVALID_UTF
flag. This is new code that's not in any released PCRE v2 version, so
we might need a fix that emulates it somehow. I figure that the case
that with the non-icase bug out of the way this is obscure enough to
tell people "upgrade your PCRE v2 too!'. It'll likely be released by
the time we release the git version this commit is part of.

We can't just use PCRE2_NO_UTF_CHECK instead for the reasons discussed
in [1].

1. https://public-inbox.org/git/87lfwn70nb.fsf@evledraar.gmail.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Makefile                        |  1 +
 grep.c                          |  2 +-
 grep.h                          |  3 +++
 t/helper/test-pcre2-config.c    | 12 ++++++++++++
 t/helper/test-tool.c            |  1 +
 t/helper/test-tool.h            |  1 +
 t/t7812-grep-icase-non-ascii.sh | 13 ++++++++++++-
 7 files changed, 31 insertions(+), 2 deletions(-)
 create mode 100644 t/helper/test-pcre2-config.c

diff --git a/Makefile b/Makefile
index bd246f2989..dd38d5e527 100644
--- a/Makefile
+++ b/Makefile
@@ -726,6 +726,7 @@ TEST_BUILTINS_OBJS += test-oidmap.o
 TEST_BUILTINS_OBJS += test-online-cpus.o
 TEST_BUILTINS_OBJS += test-parse-options.o
 TEST_BUILTINS_OBJS += test-path-utils.o
+TEST_BUILTINS_OBJS += test-pcre2-config.o
 TEST_BUILTINS_OBJS += test-pkt-line.o
 TEST_BUILTINS_OBJS += test-prio-queue.o
 TEST_BUILTINS_OBJS += test-reach.o
diff --git a/grep.c b/grep.c
index c7c06ae08d..8b8b9efe12 100644
--- a/grep.c
+++ b/grep.c
@@ -474,7 +474,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 	}
 	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
 	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
-		options |= PCRE2_UTF;
+		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
 
 	p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
 					 p->patternlen, options, &error, &erroffset,
diff --git a/grep.h b/grep.h
index c0c71eb4a9..506f05b97b 100644
--- a/grep.h
+++ b/grep.h
@@ -21,6 +21,9 @@ typedef int pcre_extra;
 #ifdef USE_LIBPCRE2
 #define PCRE2_CODE_UNIT_WIDTH 8
 #include <pcre2.h>
+#ifndef PCRE2_MATCH_INVALID_UTF
+#define PCRE2_MATCH_INVALID_UTF 0
+#endif
 #else
 typedef int pcre2_code;
 typedef int pcre2_match_data;
diff --git a/t/helper/test-pcre2-config.c b/t/helper/test-pcre2-config.c
new file mode 100644
index 0000000000..5258fdddba
--- /dev/null
+++ b/t/helper/test-pcre2-config.c
@@ -0,0 +1,12 @@
+#include "test-tool.h"
+#include "cache.h"
+#include "grep.h"
+
+int cmd__pcre2_config(int argc, const char **argv)
+{
+	if (argc == 2 && !strcmp(argv[1], "has-PCRE2_MATCH_INVALID_UTF")) {
+		int value = PCRE2_MATCH_INVALID_UTF;
+		return !value;
+	}
+	return 1;
+}
diff --git a/t/helper/test-tool.c b/t/helper/test-tool.c
index ce7e89028c..e022ce0e48 100644
--- a/t/helper/test-tool.c
+++ b/t/helper/test-tool.c
@@ -40,6 +40,7 @@ static struct test_cmd cmds[] = {
 	{ "online-cpus", cmd__online_cpus },
 	{ "parse-options", cmd__parse_options },
 	{ "path-utils", cmd__path_utils },
+	{ "pcre2-config", cmd__pcre2_config },
 	{ "pkt-line", cmd__pkt_line },
 	{ "prio-queue", cmd__prio_queue },
 	{ "reach", cmd__reach },
diff --git a/t/helper/test-tool.h b/t/helper/test-tool.h
index f805bb39ae..acd8af2a9d 100644
--- a/t/helper/test-tool.h
+++ b/t/helper/test-tool.h
@@ -30,6 +30,7 @@ int cmd__oidmap(int argc, const char **argv);
 int cmd__online_cpus(int argc, const char **argv);
 int cmd__parse_options(int argc, const char **argv);
 int cmd__path_utils(int argc, const char **argv);
+int cmd__pcre2_config(int argc, const char **argv);
 int cmd__pkt_line(int argc, const char **argv);
 int cmd__prio_queue(int argc, const char **argv);
 int cmd__reach(int argc, const char **argv);
diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 531eb59d57..848d46e4f9 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -74,11 +74,22 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invali
 	test_cmp expected actual
 '
 
-test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' '
+test_lazy_prereq PCRE2_MATCH_INVALID_UTF '
+	test-tool pcre2-config has-PCRE2_MATCH_INVALID_UTF
+'
+
+test_expect_success GETTEXT_LOCALE,LIBPCRE2,!PCRE2_MATCH_INVALID_UTF 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' '
 	test_might_fail git grep -hi "Æ" invalid-0x80 >actual &&
 	test_cmp expected actual &&
 	test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 &&
 	test_cmp expected actual
 '
 
+test_expect_success GETTEXT_LOCALE,LIBPCRE2,PCRE2_MATCH_INVALID_UTF 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' '
+	git grep -hi "Æ" invalid-0x80 >actual &&
+	test_cmp expected actual &&
+	git grep -hi "(*NO_JIT)Æ" invalid-0x80 &&
+	test_cmp expected actual
+'
+
 test_done
-- 
2.22.0.455.g172b71a6c5


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 0/2] PCRE1 cleanup
  2019-07-26 14:43         ` Ævar Arnfjörð Bjarmason
@ 2019-07-26 20:26           ` Carlo Marcelo Arenas Belón
  2019-07-26 20:26             ` [RFC PATCH 1/2] grep: make sure NO_LIBPCRE1_JIT disable JIT in PCRE1 Carlo Marcelo Arenas Belón
  2019-07-26 20:26             ` [RFC PATCH 2/2] grep: refactor and simplify PCRE1 support Carlo Marcelo Arenas Belón
  0 siblings, 2 replies; 52+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2019-07-26 20:26 UTC (permalink / raw)
  To: git; +Cc: avarab, Johannes.Schindelin, dev+git, gitster

Sent as an RFC since it was meant to be applied against ab/pcre-jit-fixes
but that is likely to change with the reroll of that branch.

 [PATCH 1/2] grep: make sure NO_LIBPCRE1_JIT disable JIT in PCRE1
 [PATCH 2/2] grep: refactor and simplify PCRE1 support

The end result could be squashed together before merging but sent as
independentt changes to make review easier, with the first change doing
the minimum required to make PCRE1 great again.

 Makefile |  9 ++-------
 grep.c   | 15 +++++++++------
 grep.h   | 11 -----------
 3 files changed, 11 insertions(+), 24 deletions(-)

base-commit: 0f8c4ddfdddb72dc62d76864f5d3d31f136c7129
-- 
2.22.0

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 1/2] grep: make sure NO_LIBPCRE1_JIT disable JIT in PCRE1
  2019-07-26 20:26           ` [RFC PATCH 0/2] PCRE1 cleanup Carlo Marcelo Arenas Belón
@ 2019-07-26 20:26             ` Carlo Marcelo Arenas Belón
  2019-07-26 20:26             ` [RFC PATCH 2/2] grep: refactor and simplify PCRE1 support Carlo Marcelo Arenas Belón
  1 sibling, 0 replies; 52+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2019-07-26 20:26 UTC (permalink / raw)
  To: git; +Cc: avarab, Johannes.Schindelin, dev+git, gitster

e87de7cab4 ("grep: un-break building with PCRE < 8.32", 2017-05-25)
added a restriction for JIT support that is no longer needed after
pcre_jit_exec() calls were removed.

Reorganize the definitions in grep.h so that JIT support could be
detected early and NO_LIBPCRE1_JIT could be used reliably to enforce
JIT doesn't get used.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 Makefile | 9 ++-------
 grep.h   | 4 +---
 2 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/Makefile b/Makefile
index 11ccea4071..7e0e6cc129 100644
--- a/Makefile
+++ b/Makefile
@@ -34,13 +34,8 @@ all::
 # library. Support for version 1 will likely be removed in some future
 # release of Git, as upstream has all but abandoned it.
 #
-# When using USE_LIBPCRE1, define NO_LIBPCRE1_JIT if the PCRE v1
-# library is compiled without --enable-jit. We will auto-detect
-# whether the version of the PCRE v1 library in use has JIT support at
-# all, but we unfortunately can't auto-detect whether JIT support
-# hasn't been compiled in in an otherwise JIT-supporting version. If
-# you have link-time errors about a missing `pcre_jit_exec` define
-# this, or recompile PCRE v1 with --enable-jit.
+# When using USE_LIBPCRE1, define NO_LIBPCRE1_JIT if you want to
+# disable JIT even if supported by your library.
 #
 # Define LIBPCREDIR=/foo/bar if your PCRE header and library files are
 # in /foo/bar/include and /foo/bar/lib directories. Which version of
diff --git a/grep.h b/grep.h
index a405fc870c..2a74e28d94 100644
--- a/grep.h
+++ b/grep.h
@@ -3,14 +3,12 @@
 #include "color.h"
 #ifdef USE_LIBPCRE1
 #include <pcre.h>
-#ifdef PCRE_CONFIG_JIT
-#if PCRE_MAJOR >= 8 && PCRE_MINOR >= 32
 #ifndef NO_LIBPCRE1_JIT
+#ifdef PCRE_CONFIG_JIT
 #define GIT_PCRE1_USE_JIT
 #define GIT_PCRE_STUDY_JIT_COMPILE PCRE_STUDY_JIT_COMPILE
 #endif
 #endif
-#endif
 #ifndef GIT_PCRE_STUDY_JIT_COMPILE
 #define GIT_PCRE_STUDY_JIT_COMPILE 0
 #endif
-- 
2.22.0

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [RFC PATCH 2/2] grep: refactor and simplify PCRE1 support
  2019-07-26 20:26           ` [RFC PATCH 0/2] PCRE1 cleanup Carlo Marcelo Arenas Belón
  2019-07-26 20:26             ` [RFC PATCH 1/2] grep: make sure NO_LIBPCRE1_JIT disable JIT in PCRE1 Carlo Marcelo Arenas Belón
@ 2019-07-26 20:26             ` Carlo Marcelo Arenas Belón
  1 sibling, 0 replies; 52+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2019-07-26 20:26 UTC (permalink / raw)
  To: git; +Cc: avarab, Johannes.Schindelin, dev+git, gitster

The code used both a macro and a variable to keep track if JIT
support was desired and relied on the fact that a non JIT
enabled library will ignore a request for JIT compilation
(as defined by the second parameter of the call to pcre_study)

Cleanup the multiple levels of macros used and call pcre_study
with the right parameter after JIT support has been confirmed
and unless it was requested to be disabled with NO_LIBPCRE1_JIT

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 grep.c | 15 +++++++++------
 grep.h |  9 ---------
 2 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/grep.c b/grep.c
index 6b52fed53a..599765c5c1 100644
--- a/grep.c
+++ b/grep.c
@@ -386,6 +386,7 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 	const char *error;
 	int erroffset;
 	int options = PCRE_MULTILINE;
+	int study_options = 0;
 
 	if (opt->ignore_case) {
 		if (has_non_ascii(p->pattern))
@@ -400,13 +401,15 @@ static void compile_pcre1_regexp(struct grep_pat *p, const struct grep_opt *opt)
 	if (!p->pcre1_regexp)
 		compile_regexp_failed(p, error);
 
-	p->pcre1_extra_info = pcre_study(p->pcre1_regexp, GIT_PCRE_STUDY_JIT_COMPILE, &error);
-	if (!p->pcre1_extra_info && error)
-		die("%s", error);
-
-#ifdef GIT_PCRE1_USE_JIT
+#if defined(PCRE_CONFIG_JIT) && !defined(NO_LIBPCRE1_JIT)
 	pcre_config(PCRE_CONFIG_JIT, &p->pcre1_jit_on);
+	if (p->pcre1_jit_on)
+		study_options = PCRE_STUDY_JIT_COMPILE;
 #endif
+
+	p->pcre1_extra_info = pcre_study(p->pcre1_regexp, study_options, &error);
+	if (!p->pcre1_extra_info && error)
+		die("%s", error);
 }
 
 static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
@@ -435,7 +438,7 @@ static int pcre1match(struct grep_pat *p, const char *line, const char *eol,
 static void free_pcre1_regexp(struct grep_pat *p)
 {
 	pcre_free(p->pcre1_regexp);
-#ifdef GIT_PCRE1_USE_JIT
+#ifdef PCRE_CONFIG_JIT
 	if (p->pcre1_jit_on)
 		pcre_free_study(p->pcre1_extra_info);
 	else
diff --git a/grep.h b/grep.h
index 2a74e28d94..30f2503121 100644
--- a/grep.h
+++ b/grep.h
@@ -3,15 +3,6 @@
 #include "color.h"
 #ifdef USE_LIBPCRE1
 #include <pcre.h>
-#ifndef NO_LIBPCRE1_JIT
-#ifdef PCRE_CONFIG_JIT
-#define GIT_PCRE1_USE_JIT
-#define GIT_PCRE_STUDY_JIT_COMPILE PCRE_STUDY_JIT_COMPILE
-#endif
-#endif
-#ifndef GIT_PCRE_STUDY_JIT_COMPILE
-#define GIT_PCRE_STUDY_JIT_COMPILE 0
-#endif
 #else
 typedef int pcre;
 typedef int pcre_extra;
-- 
2.22.0

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 0/8] grep: PCRE JIT fixes + ab/no-kwset fix
  2019-07-26 15:08   ` [PATCH v2 0/8] grep: PCRE JIT fixes + ab/no-kwset fix Ævar Arnfjörð Bjarmason
@ 2019-07-26 20:27     ` Junio C Hamano
  2019-07-29  9:20       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2019-07-26 20:27 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> 1-3 here are a re-roll on "next". I figured that was easier for
> everyone with the state of the in-flight patches, it certainly was for
> me. Sorry Junio if this creates a mess for you.

As long as I can just apply all of them on top of no-kwset and keep
it a single topic, it wouldn't be too much of a hassle.

> 4-8 are a "fix" for the UTF-8 matching error noted in Carlo's "grep:
> skip UTF8 checks explicitally" in
> https://public-inbox.org/git/20190721183115.14985-1-carenas@gmail.com/
>
> As noted the bug isn't fully fixed until 8/8, and that patch relies on
> unreleased PCRE v2 code. I'm hoping that with 7/8 we're in a good
> enough state to limp forward as noted in the rationale of those
> commits.

Yikes.  Perhaps we should kick the no-kwset thing out of 'next' and
start from scratch?  It does not sound that the world is ready yet.

But that is just a knee-jerk reaction before reading the actual
patches.  Let's see how they look ;-)

Thanks.

> Ævar Arnfjörð Bjarmason (8):
>   grep: remove overly paranoid BUG(...) code
>   grep: stop "using" a custom JIT stack with PCRE v2
>   grep: stop using a custom JIT stack with PCRE v1
>   grep: consistently use "p->fixed" in compile_regexp()
>   grep: create a "is_fixed" member in "grep_pat"
>   grep: stess test PCRE v2 on invalid UTF-8 data
>   grep: do not enter PCRE2_UTF mode on fixed matching
>   grep: optimistically use PCRE2_MATCH_INVALID_UTF
>
>  Makefile                        |  1 +
>  grep.c                          | 68 +++++++++++----------------------
>  grep.h                          | 13 ++-----
>  t/helper/test-pcre2-config.c    | 12 ++++++
>  t/helper/test-tool.c            |  1 +
>  t/helper/test-tool.h            |  1 +
>  t/t7812-grep-icase-non-ascii.sh | 39 +++++++++++++++++++
>  7 files changed, 80 insertions(+), 55 deletions(-)
>  create mode 100644 t/helper/test-pcre2-config.c

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data
  2019-07-26 15:08   ` [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data Ævar Arnfjörð Bjarmason
@ 2019-07-26 20:34     ` Junio C Hamano
  2019-07-26 21:55       ` Ævar Arnfjörð Bjarmason
  2019-07-29  3:06     ` Carlo Arenas
  2019-11-26 21:50     ` [PATCH] t7812: add missing redirects Andreas Schwab
  2 siblings, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2019-07-26 20:34 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> diff --git a/grep.c b/grep.c
> index 6d60e2e557..5bc0f4f32a 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -615,6 +615,16 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>  		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
>  
>  	p->is_fixed = is_fixed(p->pattern, p->patternlen);
> +#ifdef USE_LIBPCRE2
> +       if (!p->fixed && !p->is_fixed) {
> +	       const char *no_jit = "(*NO_JIT)";
> +	       const int no_jit_len = strlen(no_jit);
> +	       if (starts_with(p->pattern, no_jit) &&
> +		   is_fixed(p->pattern + no_jit_len,
> +			    p->patternlen - no_jit_len))
> +		       p->is_fixed = 1;

It is unfortunate that is_fixed() takes a counted string.
Otherwise, using skip_prefix() to avoid "+no_jit_len" would have
made it much easier to read. i.e.

	/* an illustration that does not quite work */
	char *pattern_body;
	if (skip_prefix(p->pattern, "(*NO_JIT)", &pattern_body) &&
            is_fixed(pattern_body))
		p->is_fixed = 1;
	
> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' '
> +	printf "\\200\\n" >invalid-0x80 &&
> +	echo "ævar" >expected &&
> +	cat expected >>invalid-0x80 &&
> +	git add invalid-0x80
> +'
> +
> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UTF-8 data' '
> +	git grep -h "var" invalid-0x80 >actual &&
> +	test_cmp expected actual &&
> +	git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
> +	test_cmp expected actual
> +'
> +
> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' '
> +	test_might_fail git grep -h "æ" invalid-0x80 >actual &&
> +	test_cmp expected actual &&
> +	test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 &&
> +	test_cmp expected actual
> +'
> +
> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' '
> +	test_might_fail git grep -hi "Æ" invalid-0x80 >actual &&
> +	test_cmp expected actual &&
> +	test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 &&
> +	test_cmp expected actual
> +'
> +
>  test_done

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 7/8] grep: do not enter PCRE2_UTF mode on fixed matching
  2019-07-26 15:08   ` [PATCH v2 7/8] grep: do not enter PCRE2_UTF mode on fixed matching Ævar Arnfjörð Bjarmason
@ 2019-07-26 20:36     ` Junio C Hamano
  0 siblings, 0 replies; 52+ messages in thread
From: Junio C Hamano @ 2019-07-26 20:36 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> When grepping a non-ASCII fixed string. This is a more general problem
> that's hard to fix, but we can at least fix the most common case of
> grepping for a fixed string without "-i". I can't think of a reason
> for why we'd turn on PCRE2_UTF when matching byte-for-byte like that.

Yes, exactly.  That's quite a sane and minimum fix/workaround, I
would think.

>  test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' '
> -	test_might_fail git grep -h "æ" invalid-0x80 >actual &&
> +	git grep -h "æ" invalid-0x80 >actual &&
>  	test_cmp expected actual &&
> -	test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 &&
> +	git grep -h "(*NO_JIT)æ" invalid-0x80 &&
>  	test_cmp expected actual
>  '

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF
  2019-07-26 15:08   ` [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF Ævar Arnfjörð Bjarmason
@ 2019-07-26 21:07     ` Junio C Hamano
  2019-07-26 21:53       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 52+ messages in thread
From: Junio C Hamano @ 2019-07-26 21:07 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> diff --git a/Makefile b/Makefile
> index bd246f2989..dd38d5e527 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -726,6 +726,7 @@ TEST_BUILTINS_OBJS += test-oidmap.o
>  TEST_BUILTINS_OBJS += test-online-cpus.o
>  TEST_BUILTINS_OBJS += test-parse-options.o
>  TEST_BUILTINS_OBJS += test-path-utils.o
> +TEST_BUILTINS_OBJS += test-pcre2-config.o

This won't even build with any released pcre version; shouldn't we
make it at least conditionally compiled code?  Specifically...

>  TEST_BUILTINS_OBJS += test-pkt-line.o
>  TEST_BUILTINS_OBJS += test-prio-queue.o
>  TEST_BUILTINS_OBJS += test-reach.o
> diff --git a/grep.c b/grep.c
> index c7c06ae08d..8b8b9efe12 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -474,7 +474,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  	}
>  	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>  	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
> -		options |= PCRE2_UTF;
> +		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>  
>  	p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
>  					 p->patternlen, options, &error, &erroffset,
> diff --git a/grep.h b/grep.h
> index c0c71eb4a9..506f05b97b 100644
> --- a/grep.h
> +++ b/grep.h
> @@ -21,6 +21,9 @@ typedef int pcre_extra;
>  #ifdef USE_LIBPCRE2
>  #define PCRE2_CODE_UNIT_WIDTH 8
>  #include <pcre2.h>
> +#ifndef PCRE2_MATCH_INVALID_UTF
> +#define PCRE2_MATCH_INVALID_UTF 0
> +#endif

... unlike this piece of code ...

>  #else
>  typedef int pcre2_code;
>  typedef int pcre2_match_data;
> diff --git a/t/helper/test-pcre2-config.c b/t/helper/test-pcre2-config.c
> new file mode 100644
> index 0000000000..5258fdddba
> --- /dev/null
> +++ b/t/helper/test-pcre2-config.c
> @@ -0,0 +1,12 @@
> +#include "test-tool.h"
> +#include "cache.h"
> +#include "grep.h"
> +
> +int cmd__pcre2_config(int argc, const char **argv)
> +{
> +	if (argc == 2 && !strcmp(argv[1], "has-PCRE2_MATCH_INVALID_UTF")) {
> +		int value = PCRE2_MATCH_INVALID_UTF;

... this part does not have any fallback definition.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF
  2019-07-26 21:07     ` Junio C Hamano
@ 2019-07-26 21:53       ` Ævar Arnfjörð Bjarmason
  2019-07-26 21:57         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 21:53 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin


On Fri, Jul 26 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> diff --git a/Makefile b/Makefile
>> index bd246f2989..dd38d5e527 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -726,6 +726,7 @@ TEST_BUILTINS_OBJS += test-oidmap.o
>>  TEST_BUILTINS_OBJS += test-online-cpus.o
>>  TEST_BUILTINS_OBJS += test-parse-options.o
>>  TEST_BUILTINS_OBJS += test-path-utils.o
>> +TEST_BUILTINS_OBJS += test-pcre2-config.o
>
> This won't even build with any released pcre version; shouldn't we
> make it at least conditionally compiled code?  Specifically...
>
>>  TEST_BUILTINS_OBJS += test-pkt-line.o
>>  TEST_BUILTINS_OBJS += test-prio-queue.o
>>  TEST_BUILTINS_OBJS += test-reach.o
>> diff --git a/grep.c b/grep.c
>> index c7c06ae08d..8b8b9efe12 100644
>> --- a/grep.c
>> +++ b/grep.c
>> @@ -474,7 +474,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>>  	}
>>  	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>  	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>> -		options |= PCRE2_UTF;
>> +		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>>
>>  	p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
>>  					 p->patternlen, options, &error, &erroffset,
>> diff --git a/grep.h b/grep.h
>> index c0c71eb4a9..506f05b97b 100644
>> --- a/grep.h
>> +++ b/grep.h
>> @@ -21,6 +21,9 @@ typedef int pcre_extra;
>>  #ifdef USE_LIBPCRE2
>>  #define PCRE2_CODE_UNIT_WIDTH 8
>>  #include <pcre2.h>
>> +#ifndef PCRE2_MATCH_INVALID_UTF
>> +#define PCRE2_MATCH_INVALID_UTF 0
>> +#endif
>
> ... unlike this piece of code ...
>
>>  #else
>>  typedef int pcre2_code;
>>  typedef int pcre2_match_data;
>> diff --git a/t/helper/test-pcre2-config.c b/t/helper/test-pcre2-config.c
>> new file mode 100644
>> index 0000000000..5258fdddba
>> --- /dev/null
>> +++ b/t/helper/test-pcre2-config.c
>> @@ -0,0 +1,12 @@
>> +#include "test-tool.h"
>> +#include "cache.h"
>> +#include "grep.h"
>> +
>> +int cmd__pcre2_config(int argc, const char **argv)
>> +{
>> +	if (argc == 2 && !strcmp(argv[1], "has-PCRE2_MATCH_INVALID_UTF")) {
>> +		int value = PCRE2_MATCH_INVALID_UTF;
>
> ... this part does not have any fallback definition.

It works because we include grep.h, which'll define
PCRE2_MATCH_INVALID_UTF=0 if pcre2.h doesn't give it to us. I've tested
this on PCRE versions with/without PCRE2_MATCH_INVALID_UTF and it works
& runs/skips the appropriate tests.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data
  2019-07-26 20:34     ` Junio C Hamano
@ 2019-07-26 21:55       ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 21:55 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin


On Fri, Jul 26 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> diff --git a/grep.c b/grep.c
>> index 6d60e2e557..5bc0f4f32a 100644
>> --- a/grep.c
>> +++ b/grep.c
>> @@ -615,6 +615,16 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>>  		die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
>>
>>  	p->is_fixed = is_fixed(p->pattern, p->patternlen);
>> +#ifdef USE_LIBPCRE2
>> +       if (!p->fixed && !p->is_fixed) {
>> +	       const char *no_jit = "(*NO_JIT)";
>> +	       const int no_jit_len = strlen(no_jit);
>> +	       if (starts_with(p->pattern, no_jit) &&
>> +		   is_fixed(p->pattern + no_jit_len,
>> +			    p->patternlen - no_jit_len))
>> +		       p->is_fixed = 1;
>
> It is unfortunate that is_fixed() takes a counted string.
> Otherwise, using skip_prefix() to avoid "+no_jit_len" would have
> made it much easier to read. i.e.
>
> 	/* an illustration that does not quite work */
> 	char *pattern_body;
> 	if (skip_prefix(p->pattern, "(*NO_JIT)", &pattern_body) &&
>             is_fixed(pattern_body))
> 		p->is_fixed = 1;

Indeed, but then we couldn't use this for patterns that have NUL in
them, which we otherwise support (and support here). So I think it's
worth keeping it so it takes ptr+len.

>> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: setup invalid UTF-8 data' '
>> +	printf "\\200\\n" >invalid-0x80 &&
>> +	echo "ævar" >expected &&
>> +	cat expected >>invalid-0x80 &&
>> +	git add invalid-0x80
>> +'
>> +
>> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UTF-8 data' '
>> +	git grep -h "var" invalid-0x80 >actual &&
>> +	test_cmp expected actual &&
>> +	git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
>> +	test_cmp expected actual
>> +'
>> +
>> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' '
>> +	test_might_fail git grep -h "æ" invalid-0x80 >actual &&
>> +	test_cmp expected actual &&
>> +	test_must_fail git grep -h "(*NO_JIT)æ" invalid-0x80 &&
>> +	test_cmp expected actual
>> +'
>> +
>> +test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' '
>> +	test_might_fail git grep -hi "Æ" invalid-0x80 >actual &&
>> +	test_cmp expected actual &&
>> +	test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 &&
>> +	test_cmp expected actual
>> +'
>> +
>>  test_done

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF
  2019-07-26 21:53       ` Ævar Arnfjörð Bjarmason
@ 2019-07-26 21:57         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-26 21:57 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin


On Fri, Jul 26 2019, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Jul 26 2019, Junio C Hamano wrote:
>
>> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>>
>>> diff --git a/Makefile b/Makefile
>>> index bd246f2989..dd38d5e527 100644
>>> --- a/Makefile
>>> +++ b/Makefile
>>> @@ -726,6 +726,7 @@ TEST_BUILTINS_OBJS += test-oidmap.o
>>>  TEST_BUILTINS_OBJS += test-online-cpus.o
>>>  TEST_BUILTINS_OBJS += test-parse-options.o
>>>  TEST_BUILTINS_OBJS += test-path-utils.o
>>> +TEST_BUILTINS_OBJS += test-pcre2-config.o
>>
>> This won't even build with any released pcre version; shouldn't we
>> make it at least conditionally compiled code?  Specifically...
>>
>>>  TEST_BUILTINS_OBJS += test-pkt-line.o
>>>  TEST_BUILTINS_OBJS += test-prio-queue.o
>>>  TEST_BUILTINS_OBJS += test-reach.o
>>> diff --git a/grep.c b/grep.c
>>> index c7c06ae08d..8b8b9efe12 100644
>>> --- a/grep.c
>>> +++ b/grep.c
>>> @@ -474,7 +474,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>>>  	}
>>>  	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>>>  	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>>> -		options |= PCRE2_UTF;
>>> +		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>>>
>>>  	p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern,
>>>  					 p->patternlen, options, &error, &erroffset,
>>> diff --git a/grep.h b/grep.h
>>> index c0c71eb4a9..506f05b97b 100644
>>> --- a/grep.h
>>> +++ b/grep.h
>>> @@ -21,6 +21,9 @@ typedef int pcre_extra;
>>>  #ifdef USE_LIBPCRE2
>>>  #define PCRE2_CODE_UNIT_WIDTH 8
>>>  #include <pcre2.h>
>>> +#ifndef PCRE2_MATCH_INVALID_UTF
>>> +#define PCRE2_MATCH_INVALID_UTF 0
>>> +#endif
>>
>> ... unlike this piece of code ...
>>
>>>  #else
>>>  typedef int pcre2_code;
>>>  typedef int pcre2_match_data;
>>> diff --git a/t/helper/test-pcre2-config.c b/t/helper/test-pcre2-config.c
>>> new file mode 100644
>>> index 0000000000..5258fdddba
>>> --- /dev/null
>>> +++ b/t/helper/test-pcre2-config.c
>>> @@ -0,0 +1,12 @@
>>> +#include "test-tool.h"
>>> +#include "cache.h"
>>> +#include "grep.h"
>>> +
>>> +int cmd__pcre2_config(int argc, const char **argv)
>>> +{
>>> +	if (argc == 2 && !strcmp(argv[1], "has-PCRE2_MATCH_INVALID_UTF")) {
>>> +		int value = PCRE2_MATCH_INVALID_UTF;
>>
>> ... this part does not have any fallback definition.
>
> It works because we include grep.h, which'll define
> PCRE2_MATCH_INVALID_UTF=0 if pcre2.h doesn't give it to us. I've tested
> this on PCRE versions with/without PCRE2_MATCH_INVALID_UTF and it works
> & runs/skips the appropriate tests.

Ah, I spoke too soon, of course that's all guarded by "are we using PCRE
v2 in general?". I'll fix it...

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 2/8] grep: stop "using" a custom JIT stack with PCRE v2
  2019-07-26 15:08   ` [PATCH v2 2/8] grep: stop "using" a custom JIT stack with PCRE v2 Ævar Arnfjörð Bjarmason
@ 2019-07-29  0:33     ` Carlo Arenas
  0 siblings, 0 replies; 52+ messages in thread
From: Carlo Arenas @ 2019-07-29  0:33 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Beat Bolli, Johannes Schindelin

On Fri, Jul 26, 2019 at 8:08 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> As noted in [3] there are known regexes that will fail with the lower
> stack limit, the way GNU grep fixed it is interesting, although I
> believe the implementation is overly verbose, they could make PCRE v2
> handle that gradual re-allocation, that's what min/max memory is
> for.

The part I liked about the grep implementation was how they went above
and beyond to make sure there was no abstraction leak and at the end
the end user doesn't even see a PCRE error message.

Presume thought that the end user we have is different, and might make
sense to expose them to the underlying mechanism, but in that case we
should also provide them with knobs to tweak (like the one I proposed
to disable jit, and that in this case might be to set a stacksize)

> So we might end up bringing this back, I'm more inclined to just kick
> such cases upstairs to PCRE maintainers as a bug, perhaps they'll add
> some overall "just allocate more then" flag to make this easier. In
> any case there's no functional change here, we didn't have a custom
> stack, so let's apply this first, we can always revert it later.

agree, LGTM other than by the comment below

> diff --git a/grep.c b/grep.c
> index 95af88cb74..4b1e917ac5 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -534,14 +534,6 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>                         p->pcre2_jit_on = 0;
>                         return;

this return and brackets no longer needed

Carlo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/8] grep: stop using a custom JIT stack with PCRE v1
  2019-07-26 15:08   ` [PATCH v2 3/8] grep: stop using a custom JIT stack with PCRE v1 Ævar Arnfjörð Bjarmason
@ 2019-07-29  1:26     ` Carlo Arenas
  0 siblings, 0 replies; 52+ messages in thread
From: Carlo Arenas @ 2019-07-29  1:26 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Beat Bolli, Johannes Schindelin, sandals

On Fri, Jul 26, 2019 at 8:09 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> It will also implicitly re-enable UTF-8 validation for PCRE v1. As
> noted in [1] we now have cases as a result where PCRE v1 is more eager
> to error out. Subsequent patches will fix that for v2, and I think
> it's fair to tell v1 users "just upgrade" and not worry about that
> edge case for v1.
>
> 1.  https://public-inbox.org/git/CAPUEsphZJ_Uv9o1-yDpjNLA_q-f7gWXz9g1gCY2pYAYN8ri40g@mail.gmail.com/

Haven't seen any responses from packagers but there was a report[1] of
CentOS 6 users
that would be affected, and would certainly make things more difficult
to whoever is behind
the git binary that comes with Xcode in macOS and that is linked
against a system library
(PCRE1) that is IMHO unlikely to be upgraded.

The minimum I would expect if you want to move forward with this,
would be to make
their git not randomly die because of non UTF-8 haystack by applying
[2] and make clear we
will also introduce stack size problems like the one in [3] (as it was
done in the previous
commit)

Feedback on the patchset[4] that applies on top of this to make sure
JIT can be disabled at
compile time and the logic is less messy also appreciated.

Let me know also how you want to keep it on sync, as IMHO makes more
sense inside
your branch instead of as an independent topic.

Carlo

CC +Brian

[1] https://public-inbox.org/git/20190615191514.GD8616@genre.crustytoothpaste.net/
[2] https://public-inbox.org/git/20190722144350.46458-1-carenas@gmail.com/
[3] https://public-inbox.org/git/CAPUEspjj+fG8QDmf=bZXktfpLgkgiu34HTjKLhm-cmEE04FE-A@mail.gmail.com/
[4] https://public-inbox.org/git/20190726202642.7986-1-carenas@gmail.com/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp()
  2019-07-26 15:08   ` [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp() Ævar Arnfjörð Bjarmason
@ 2019-07-29  1:48     ` Carlo Arenas
  2019-07-29  9:05       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 52+ messages in thread
From: Carlo Arenas @ 2019-07-29  1:48 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Beat Bolli, Johannes Schindelin

On Fri, Jul 26, 2019 at 8:09 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> It's less confusing to use that variable consistently that switch back
> & forth between the two.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> ---
>  grep.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/grep.c b/grep.c
> index 9c2b259771..b94e998680 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -616,7 +616,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>                 die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
>
>         pat_is_fixed = is_fixed(p->pattern, p->patternlen);
> -       if (opt->fixed || pat_is_fixed) {
> +       if (p->fixed || pat_is_fixed) {

at the end of this series we have:

  if (p->fixed || p->is_fixed)

which doesn't make sense; at least with opt->fixed it was clear that
what was meant is that grep was passed -P

maybe is_fixed shouldn't exist and fixed when applied to the pattern
means we had determined it was a fixed
pattern and overridden the user selection of engine.

that at least will give us a logical way to fix the pattern reported
in [1] and that currently requires the user to know
git's grep internals and know he can skip the "is_fixed" optimization
by doing something like :

  $ git grep 'foo[ ]bar'

Carlo

[1] https://public-inbox.org/git/20190728235427.41425-1-carenas@gmail.com/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data
  2019-07-26 15:08   ` [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data Ævar Arnfjörð Bjarmason
  2019-07-26 20:34     ` Junio C Hamano
@ 2019-07-29  3:06     ` Carlo Arenas
  2019-11-26 21:50     ` [PATCH] t7812: add missing redirects Andreas Schwab
  2 siblings, 0 replies; 52+ messages in thread
From: Carlo Arenas @ 2019-07-29  3:06 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Beat Bolli, Johannes Schindelin

On Fri, Jul 26, 2019 at 8:09 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> This patch does nothing to fix that, instead we sneak in support for
> fixed patterns starting with "(*NO_JIT)", this disables the PCRE v2
> jit with implicit fixed-string matching for testing, see
> pcre2syntax(3) the syntax.

Alternativelly; using `git -c pcre.jit=false grep ...` on top of [1],
might be cleaner

Carlo

[1] https://public-inbox.org/git/20190728235427.41425-1-carenas@gmail.com/

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp()
  2019-07-29  1:48     ` Carlo Arenas
@ 2019-07-29  9:05       ` Ævar Arnfjörð Bjarmason
  2019-07-29  9:13         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-29  9:05 UTC (permalink / raw)
  To: Carlo Arenas; +Cc: git, Junio C Hamano, Beat Bolli, Johannes Schindelin


On Mon, Jul 29 2019, Carlo Arenas wrote:

> On Fri, Jul 26, 2019 at 8:09 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> It's less confusing to use that variable consistently that switch back
>> & forth between the two.
>>
>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>> ---
>>  grep.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/grep.c b/grep.c
>> index 9c2b259771..b94e998680 100644
>> --- a/grep.c
>> +++ b/grep.c
>> @@ -616,7 +616,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>>                 die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
>>
>>         pat_is_fixed = is_fixed(p->pattern, p->patternlen);
>> -       if (opt->fixed || pat_is_fixed) {
>> +       if (p->fixed || pat_is_fixed) {
>
> at the end of this series we have:
>
>   if (p->fixed || p->is_fixed)
>
> which doesn't make sense; at least with opt->fixed it was clear that
> what was meant is that grep was passed -P

I assume you mean "was passed -F...".

> maybe is_fixed shouldn't exist and fixed when applied to the pattern
> means we had determined it was a fixed
> pattern and overridden the user selection of engine.

They're two flags because p->fixed is "--fixed-strings", and p->is_fixed
is "there's no metachars here". So the former case needs escaping, as
the code just below might do (the two aren't mutually exclusive).

I don't get how you think we can always fold them into one flag, but
maybe I'm missing something...

> that at least will give us a logical way to fix the pattern reported
> in [1] and that currently requires the user to know
> git's grep internals and know he can skip the "is_fixed" optimization
> by doing something like :
>
>   $ git grep 'foo[ ]bar'
>
> [1] https://public-inbox.org/git/20190728235427.41425-1-carenas@gmail.com/

As I noted in a reply there this seems like a way to fix a bug in "next"
with a config knob. Yes we should fix the bug, but we've had the kwset
code in git for years without needing this distinction, so after we work
out the bugs I don't see why we'd need this.

The reason we ignore the user's choice here is because you might
e.g. set grep.patternType=extended in your config, and you'd still want
grepping for a fixed "foo" to be fast.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp()
  2019-07-29  9:05       ` Ævar Arnfjörð Bjarmason
@ 2019-07-29  9:13         ` Ævar Arnfjörð Bjarmason
  2019-07-29 16:23           ` Junio C Hamano
  0 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-29  9:13 UTC (permalink / raw)
  To: Carlo Arenas; +Cc: git, Junio C Hamano, Beat Bolli, Johannes Schindelin


On Mon, Jul 29 2019, Ævar Arnfjörð Bjarmason wrote:

> On Mon, Jul 29 2019, Carlo Arenas wrote:
>
>> On Fri, Jul 26, 2019 at 8:09 AM Ævar Arnfjörð Bjarmason
>> <avarab@gmail.com> wrote:
>>>
>>> It's less confusing to use that variable consistently that switch back
>>> & forth between the two.
>>>
>>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>>> ---
>>>  grep.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/grep.c b/grep.c
>>> index 9c2b259771..b94e998680 100644
>>> --- a/grep.c
>>> +++ b/grep.c
>>> @@ -616,7 +616,7 @@ static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
>>>                 die(_("given pattern contains NULL byte (via -f <file>). This is only supported with -P under PCRE v2"));
>>>
>>>         pat_is_fixed = is_fixed(p->pattern, p->patternlen);
>>> -       if (opt->fixed || pat_is_fixed) {
>>> +       if (p->fixed || pat_is_fixed) {
>>
>> at the end of this series we have:
>>
>>   if (p->fixed || p->is_fixed)
>>
>> which doesn't make sense; at least with opt->fixed it was clear that
>> what was meant is that grep was passed -P
>
> I assume you mean "was passed -F...".
>
>> maybe is_fixed shouldn't exist and fixed when applied to the pattern
>> means we had determined it was a fixed
>> pattern and overridden the user selection of engine.
>
> They're two flags because p->fixed is "--fixed-strings", and p->is_fixed
> is "there's no metachars here". So the former case needs escaping, as
> the code just below might do (the two aren't mutually exclusive).
>
> I don't get how you think we can always fold them into one flag, but
> maybe I'm missing something...
>
>> that at least will give us a logical way to fix the pattern reported
>> in [1] and that currently requires the user to know
>> git's grep internals and know he can skip the "is_fixed" optimization
>> by doing something like :
>>
>>   $ git grep 'foo[ ]bar'
>>
>> [1] https://public-inbox.org/git/20190728235427.41425-1-carenas@gmail.com/
>
> As I noted in a reply there this seems like a way to fix a bug in "next"
> with a config knob. Yes we should fix the bug, but we've had the kwset
> code in git for years without needing this distinction, so after we work
> out the bugs I don't see why we'd need this.
>
> The reason we ignore the user's choice here is because you might
> e.g. set grep.patternType=extended in your config, and you'd still want
> grepping for a fixed "foo" to be fast.

...and more generally, for any future sanity of implementation and
maintenance I think we should only make the promise that we support
certain syntax & semantics, not that the -F, -G, -E, -P options are
guaranteed to dispatch to a given codepath.

Internally we should be free to switch between those, so e.g. if a
pattern is fixed and you configure "basic" regexp, but we know your C
library is faster for those matches with REG_EXTENDED we should just
pass that regardless of -G or -E.

Of course that means we *must* expose the same semantics (to some
reasonable extent), which means I have a lot of bugs in "next" to
address.

I'm just saying that the presence of those bugs means we should be
inclined to fix them / back out certain changes, not work around them
with user-servicable knobs.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 0/8] grep: PCRE JIT fixes + ab/no-kwset fix
  2019-07-26 20:27     ` Junio C Hamano
@ 2019-07-29  9:20       ` Ævar Arnfjörð Bjarmason
  2019-07-29 16:12         ` Junio C Hamano
  0 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-07-29  9:20 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin


On Fri, Jul 26 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
>
>> 1-3 here are a re-roll on "next". I figured that was easier for
>> everyone with the state of the in-flight patches, it certainly was for
>> me. Sorry Junio if this creates a mess for you.
>
> As long as I can just apply all of them on top of no-kwset and keep
> it a single topic, it wouldn't be too much of a hassle.
>
>> 4-8 are a "fix" for the UTF-8 matching error noted in Carlo's "grep:
>> skip UTF8 checks explicitally" in
>> https://public-inbox.org/git/20190721183115.14985-1-carenas@gmail.com/
>>
>> As noted the bug isn't fully fixed until 8/8, and that patch relies on
>> unreleased PCRE v2 code. I'm hoping that with 7/8 we're in a good
>> enough state to limp forward as noted in the rationale of those
>> commits.
>
> Yikes.  Perhaps we should kick the no-kwset thing out of 'next' and
> start from scratch?  It does not sound that the world is ready yet.

I have some fix-for-the-fix and was going to submit a v3 of this series,
but I think the more responsible thing to do at this point, especially
with various patches from Carlo that need to be integrated in one way or
another, is to back it out until the outstanding issues are addressed.

If it's not too much trouble, would you mind reverting just the two
patches at the tip of ab/no-kwset in "next"? I.e.

    b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01)
    48de2a768c ("grep: remove the kwset optimization", 2019-07-01)

I believe the rest are all settled & haven't had any issues raised with
them, and those tests & preparatory fixes would be very useful to have
in "master" for any re-roll without needing to be distracted by those
changes.

> But that is just a knee-jerk reaction before reading the actual
> patches.  Let's see how they look ;-)
>
> Thanks.
>
>> Ævar Arnfjörð Bjarmason (8):
>>   grep: remove overly paranoid BUG(...) code
>>   grep: stop "using" a custom JIT stack with PCRE v2
>>   grep: stop using a custom JIT stack with PCRE v1
>>   grep: consistently use "p->fixed" in compile_regexp()
>>   grep: create a "is_fixed" member in "grep_pat"
>>   grep: stess test PCRE v2 on invalid UTF-8 data
>>   grep: do not enter PCRE2_UTF mode on fixed matching
>>   grep: optimistically use PCRE2_MATCH_INVALID_UTF
>>
>>  Makefile                        |  1 +
>>  grep.c                          | 68 +++++++++++----------------------
>>  grep.h                          | 13 ++-----
>>  t/helper/test-pcre2-config.c    | 12 ++++++
>>  t/helper/test-tool.c            |  1 +
>>  t/helper/test-tool.h            |  1 +
>>  t/t7812-grep-icase-non-ascii.sh | 39 +++++++++++++++++++
>>  7 files changed, 80 insertions(+), 55 deletions(-)
>>  create mode 100644 t/helper/test-pcre2-config.c

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 0/8] grep: PCRE JIT fixes + ab/no-kwset fix
  2019-07-29  9:20       ` Ævar Arnfjörð Bjarmason
@ 2019-07-29 16:12         ` Junio C Hamano
  0 siblings, 0 replies; 52+ messages in thread
From: Junio C Hamano @ 2019-07-29 16:12 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I have some fix-for-the-fix and was going to submit a v3 of this series,
> but I think the more responsible thing to do at this point, especially
> with various patches from Carlo that need to be integrated in one way or
> another, is to back it out until the outstanding issues are addressed.
>
> If it's not too much trouble, would you mind reverting just the two
> patches at the tip of ab/no-kwset in "next"? I.e.
>
>     b65abcafc7 ("grep: use PCRE v2 for optimized fixed-string search", 2019-07-01)
>     48de2a768c ("grep: remove the kwset optimization", 2019-07-01)

As we'd be in pre-release freeze soonish, let me revert the merge to
'next' and drop these two tip commits from the topic branch (keeping
the earlier parts), and ask you two to work together to get the
pcre-jit stuff into a good shape.

Thanks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp()
  2019-07-29  9:13         ` Ævar Arnfjörð Bjarmason
@ 2019-07-29 16:23           ` Junio C Hamano
  0 siblings, 0 replies; 52+ messages in thread
From: Junio C Hamano @ 2019-07-29 16:23 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Carlo Arenas, git, Beat Bolli, Johannes Schindelin

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> The reason we ignore the user's choice here is because you might
>> e.g. set grep.patternType=extended in your config, and you'd still want
>> grepping for a fixed "foo" to be fast.
>
> ...and more generally, for any future sanity of implementation and
> maintenance I think we should only make the promise that we support
> certain syntax & semantics, not that the -F, -G, -E, -P options are
> guaranteed to dispatch to a given codepath.
>
> Internally we should be free to switch between those, so e.g. if a
> pattern is fixed and you configure "basic" regexp, but we know your C
> library is faster for those matches with REG_EXTENDED we should just
> pass that regardless of -G or -E.

That certainly is a very sensible goal.

> I'm just saying that the presence of those bugs means we should be
> inclined to fix them / back out certain changes, not work around them
> with user-servicable knobs.

Amen.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] t7812: add missing redirects
  2019-07-26 15:08   ` [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data Ævar Arnfjörð Bjarmason
  2019-07-26 20:34     ` Junio C Hamano
  2019-07-29  3:06     ` Carlo Arenas
@ 2019-11-26 21:50     ` Andreas Schwab
  2019-11-26 22:27       ` Johannes Schindelin
  2019-11-30  0:46       ` [PATCH] t7812: expect failure for grep -i with invalid UTF-8 data Todd Zullinger
  2 siblings, 2 replies; 52+ messages in thread
From: Andreas Schwab @ 2019-11-26 21:50 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli,
	Johannes Schindelin

Two tests in t7812 were missing redirects, failing to actually test the
produced output.

Fixes: 8a5999838e ("grep: stess test PCRE v2 on invalid UTF-8 data")
Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
---
 t/t7812-grep-icase-non-ascii.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 531eb59d57..c4528432e5 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -70,14 +70,14 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UT
 test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' '
 	git grep -h "æ" invalid-0x80 >actual &&
 	test_cmp expected actual &&
-	git grep -h "(*NO_JIT)æ" invalid-0x80 &&
+	git grep -h "(*NO_JIT)æ" invalid-0x80 >actual &&
 	test_cmp expected actual
 '
 
 test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' '
 	test_might_fail git grep -hi "Æ" invalid-0x80 >actual &&
 	test_cmp expected actual &&
-	test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 &&
+	test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 >actual &&
 	test_cmp expected actual
 '
 
-- 
2.24.0


-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] t7812: add missing redirects
  2019-11-26 21:50     ` [PATCH] t7812: add missing redirects Andreas Schwab
@ 2019-11-26 22:27       ` Johannes Schindelin
  2019-11-26 23:11         ` Andreas Schwab
  2019-11-30  0:46       ` [PATCH] t7812: expect failure for grep -i with invalid UTF-8 data Todd Zullinger
  1 sibling, 1 reply; 52+ messages in thread
From: Johannes Schindelin @ 2019-11-26 22:27 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Ævar Arnfjörð Bjarmason, git, Junio C Hamano,
	Carlo Marcelo Arenas Belón, Beat Bolli

[-- Attachment #1: Type: text/plain, Size: 1814 bytes --]

Hi,


On Tue, 26 Nov 2019, Andreas Schwab wrote:

> Two tests in t7812 were missing redirects, failing to actually test the
> produced output.
>
> Fixes: 8a5999838e ("grep: stess test PCRE v2 on invalid UTF-8 data")

Apart from this line which is cuddled with a real footer (but is no
footer, and the commit reference is not in the recommended format either
because it lacks the date), this patch looks fine to me.

Ciao,
Johannes

> Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>
> ---
>  t/t7812-grep-icase-non-ascii.sh | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
> index 531eb59d57..c4528432e5 100755
> --- a/t/t7812-grep-icase-non-ascii.sh
> +++ b/t/t7812-grep-icase-non-ascii.sh
> @@ -70,14 +70,14 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep ASCII from invalid UT
>  test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data' '
>  	git grep -h "æ" invalid-0x80 >actual &&
>  	test_cmp expected actual &&
> -	git grep -h "(*NO_JIT)æ" invalid-0x80 &&
> +	git grep -h "(*NO_JIT)æ" invalid-0x80 >actual &&
>  	test_cmp expected actual
>  '
>
>  test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' '
>  	test_might_fail git grep -hi "Æ" invalid-0x80 >actual &&
>  	test_cmp expected actual &&
> -	test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 &&
> +	test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 >actual &&
>  	test_cmp expected actual
>  '
>
> --
> 2.24.0
>
>
> --
> Andreas Schwab, schwab@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] t7812: add missing redirects
  2019-11-26 22:27       ` Johannes Schindelin
@ 2019-11-26 23:11         ` Andreas Schwab
  2019-11-27 11:58           ` Jeff King
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Schwab @ 2019-11-26 23:11 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, git, Junio C Hamano,
	Carlo Marcelo Arenas Belón, Beat Bolli

On Nov 26 2019, Johannes Schindelin wrote:

> footer, and the commit reference is not in the recommended format either
> because it lacks the date),

Where is that documented?

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] t7812: add missing redirects
  2019-11-26 23:11         ` Andreas Schwab
@ 2019-11-27 11:58           ` Jeff King
  0 siblings, 0 replies; 52+ messages in thread
From: Jeff King @ 2019-11-27 11:58 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Johannes Schindelin, Ævar Arnfjörð Bjarmason, git,
	Junio C Hamano, Carlo Marcelo Arenas Belón, Beat Bolli

On Wed, Nov 27, 2019 at 12:11:49AM +0100, Andreas Schwab wrote:

> On Nov 26 2019, Johannes Schindelin wrote:
> 
> > footer, and the commit reference is not in the recommended format either
> > because it lacks the date),
> 
> Where is that documented?

It's mentioned as the preferred way to reference commits in
SubmittingPatches (search for %ad).

But I don't see why it is "not a footer". The "Fixes:" key conforms to
the trailer syntax, and the value of a trailer is free-form. Running:

  git log --format='%(trailers:key=Fixes)'

shows that Git is happy with it. And indeed, a few other people have
used it before you. None of them with a date. ;)

-Peff

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH] t7812: expect failure for grep -i with invalid UTF-8 data
  2019-11-26 21:50     ` [PATCH] t7812: add missing redirects Andreas Schwab
  2019-11-26 22:27       ` Johannes Schindelin
@ 2019-11-30  0:46       ` Todd Zullinger
  2019-11-30  8:00         ` Andreas Schwab
  1 sibling, 1 reply; 52+ messages in thread
From: Todd Zullinger @ 2019-11-30  0:46 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Ævar Arnfjörð Bjarmason, git, Junio C Hamano,
	Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin,
	Jeff King

When the 'grep with invalid UTF-8 data' tests were added/adjusted in
8a5999838e (grep: stess test PCRE v2 on invalid UTF-8 data, 2019-07-26)
and 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
2019-07-26) they lacked a redirect which caused them to falsely succeed
on most architectures.  They failed on big-endian arches where the test
never reached the portion which was missing the redirect.

A recent patch add the missing redirect and exposed the fact that the
'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' test fails on
all architectures.

Based on the final paragraph in in 870eea8166:

    When grepping a non-ASCII fixed string. This is a more general problem
    that's hard to fix, but we can at least fix the most common case of
    grepping for a fixed string without "-i". I can't think of a reason
    for why we'd turn on PCRE2_UTF when matching byte-for-byte like that.

it seems that we don't expect that the case-insensitive grep will
succeed.  Adjust the test to reflect that expectation.

Signed-off-by: Todd Zullinger <tmz@pobox.com>
---

Hi,

Andreas Schwab wrote:
> Two tests in t7812 were missing redirects, failing to actually test the
> produced output.
> 
> Fixes: 8a5999838e ("grep: stess test PCRE v2 on invalid UTF-8 data")
> Signed-off-by: Andreas Schwab <schwab@linux-m68k.org>

Nice catch on these missing redirects.  While testing the
2.24.0 rc's this test failed only on big-endian arches (like
s390x in Fedora).  It was briefly mentioned at the time in
<20191020002648.GZ10893@pobox.com>.

After applying the fix to add the missing redirects, the
test now fails on all architectures.  It seems like that is
the intended state and we simply need to adjust the final
test with `grep -i` accordingly.

Of course, it's possible that I've misunderstood Ævar's
intentions in 870eea8166 (grep: do not enter PCRE2_UTF mode
on fixed matching, 2019-07-26) or otherwise have poorly
explained the reasoning in my commit message.

We don't appear to test PCRE (neither v1 nor v2) in our CI
builds.  I added libpcre2-dev and set USE_LIBPCRE to test
this in travis.  (Whether we want to enable that by default
is worth discussing.)

Doing so confirms that the tests (incorrectly) pass without
the missing redirects and fail when the redirects are added.
Inverting the expectation of test_cmp in the final `grep -i`
test results in a successful test run.

with PCRE2:
https://travis-ci.org/tmzullinger/git/jobs/618733182

with PCRE2 and 'add missing redirects':
https://travis-ci.org/tmzullinger/git/jobs/618723277

with PCRE2, 'add missing redirects' and this patch:
https://travis-ci.org/tmzullinger/git/jobs/618796963

This still doesn't fix the failure on s390x.  There

    test_might_fail git grep -hi "Æ" invalid-0x80 >actual

fails and leaves nothing in 'actual' which causes the test
to fail at the following

    test_cmp expected actual

line.  If we suspect the test might fail, it seems like we
need to account for that and not expect that 'expect' will
match 'actual' as we currently do.

 t/t7812-grep-icase-non-ascii.sh | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index c4528432e5..03dba6685a 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -76,9 +76,12 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invali
 
 test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' '
 	test_might_fail git grep -hi "Æ" invalid-0x80 >actual &&
-	test_cmp expected actual &&
+	if test -s actual
+	then
+	    test_cmp expected actual
+	fi &&
 	test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 >actual &&
-	test_cmp expected actual
+	! test_cmp expected actual
 '
 
 test_done
-- 
2.24.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] t7812: expect failure for grep -i with invalid UTF-8 data
  2019-11-30  0:46       ` [PATCH] t7812: expect failure for grep -i with invalid UTF-8 data Todd Zullinger
@ 2019-11-30  8:00         ` Andreas Schwab
  2019-12-01 16:33           ` Junio C Hamano
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Schwab @ 2019-11-30  8:00 UTC (permalink / raw)
  To: Todd Zullinger
  Cc: Ævar Arnfjörð Bjarmason, git, Junio C Hamano,
	Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin,
	Jeff King

On Nov 29 2019, Todd Zullinger wrote:

> When the 'grep with invalid UTF-8 data' tests were added/adjusted in
> 8a5999838e (grep: stess test PCRE v2 on invalid UTF-8 data, 2019-07-26)
> and 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
> 2019-07-26) they lacked a redirect which caused them to falsely succeed
> on most architectures.  They failed on big-endian arches where the test
> never reached the portion which was missing the redirect.

It's not about big vs little endian, it's only about JIT vs non-JIT.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] t7812: expect failure for grep -i with invalid UTF-8 data
  2019-11-30  8:00         ` Andreas Schwab
@ 2019-12-01 16:33           ` Junio C Hamano
  2019-12-01 17:09             ` Andreas Schwab
  2019-12-01 18:32             ` Todd Zullinger
  0 siblings, 2 replies; 52+ messages in thread
From: Junio C Hamano @ 2019-12-01 16:33 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: Todd Zullinger, Ævar Arnfjörð Bjarmason, git,
	Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin,
	Jeff King

Andreas Schwab <schwab@linux-m68k.org> writes:

> On Nov 29 2019, Todd Zullinger wrote:
>
>> When the 'grep with invalid UTF-8 data' tests were added/adjusted in
>> 8a5999838e (grep: stess test PCRE v2 on invalid UTF-8 data, 2019-07-26)
>> and 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
>> 2019-07-26) they lacked a redirect which caused them to falsely succeed
>> on most architectures.  They failed on big-endian arches where the test
>> never reached the portion which was missing the redirect.
>
> It's not about big vs little endian, it's only about JIT vs non-JIT.

So, which one of JIT / non-JIT sides did the test fail unexpectedly?

Should I do s/on big-endian arches/with PCRE with JIT disabled/
while queuing the patch?

Thanks.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] t7812: expect failure for grep -i with invalid UTF-8 data
  2019-12-01 16:33           ` Junio C Hamano
@ 2019-12-01 17:09             ` Andreas Schwab
  2019-12-01 18:32             ` Todd Zullinger
  1 sibling, 0 replies; 52+ messages in thread
From: Andreas Schwab @ 2019-12-01 17:09 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Todd Zullinger, Ævar Arnfjörð Bjarmason, git,
	Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin,
	Jeff King

On Dez 01 2019, Junio C Hamano wrote:

> So, which one of JIT / non-JIT sides did the test fail unexpectedly?

The non-JIT was the one that wasn't actually tested.

> Should I do s/on big-endian arches/with PCRE with JIT disabled/
> while queuing the patch?

Right.

Andreas.

-- 
Andreas Schwab, schwab@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] t7812: expect failure for grep -i with invalid UTF-8 data
  2019-12-01 16:33           ` Junio C Hamano
  2019-12-01 17:09             ` Andreas Schwab
@ 2019-12-01 18:32             ` Todd Zullinger
  2019-12-02  6:13               ` Junio C Hamano
  1 sibling, 1 reply; 52+ messages in thread
From: Todd Zullinger @ 2019-12-01 18:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Andreas Schwab, Ævar Arnfjörð Bjarmason, git,
	Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin,
	Jeff King

Hi,

Junio C Hamano wrote:
> Andreas Schwab <schwab@linux-m68k.org> writes:
> 
>> On Nov 29 2019, Todd Zullinger wrote:
>>
>>> When the 'grep with invalid UTF-8 data' tests were added/adjusted in
>>> 8a5999838e (grep: stess test PCRE v2 on invalid UTF-8 data, 2019-07-26)
>>> and 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
>>> 2019-07-26) they lacked a redirect which caused them to falsely succeed
>>> on most architectures.  They failed on big-endian arches where the test
>>> never reached the portion which was missing the redirect.
>>
>> It's not about big vs little endian, it's only about JIT vs non-JIT.
> 
> So, which one of JIT / non-JIT sides did the test fail unexpectedly?

On s390x, the initial:

    test_might_fail git grep -hi "Æ" invalid-0x80 >actual

fails to produce any output in actual, but since we use
test_might_fail, the test happily continues to:

    test_cmp expected actual

which fails.

The test output from and s390x build:

    expecting success of 7812.11 'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i':
        test_might_fail git grep -hi "Æ" invalid-0x80 >actual &&
        test_cmp expected actual &&
        test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 &&
        test_cmp expected actual
    ++ test_might_fail git grep -hi Æ invalid-0x80
    ++ test_must_fail ok=success git grep -hi Æ invalid-0x80
    ++ case "$1" in
    ++ _test_ok=success
    ++ shift
    ++ git grep -hi Æ invalid-0x80
    fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set
    ++ exit_code=128
    ++ test 128 -eq 0
    ++ test_match_signal 13 128
    ++ test 128 = 141
    ++ test 128 = 269
    ++ return 1
    ++ test 128 -gt 129
    ++ test 128 -eq 127
    ++ test 128 -eq 126
    ++ return 0
    ++ test_cmp expected actual
    ++ diff -u expected actual
    --- expected    2019-10-19 21:56:08.634252012 +0000
    +++ actual      2019-10-19 21:56:08.714252012 +0000
    @@ -1 +0,0 @@
    -ævar
    error: last command exited with $?=1
    not ok 11 - PCRE v2: grep non-ASCII from invalid UTF-8 data with -i
    #
    #               test_might_fail git grep -hi "Æ" invalid-0x80 >actual &&
    #               test_cmp expected actual &&
    #               test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 &&
    #               test_cmp expected actual
    #
    # failed 1 among 11 test(s)

After Andreas' missing redirect fix, that still fails on
s390x (not surprisingly).  But now systems with JIT enabled
fail at:

    test_must_fail git grep -hi "(*NO_JIT)Æ" invalid-0x80 >actual &&
    test_cmp expected actual

Though we say that the command must fail, so we shouldn't be
surprised that 'expect' and 'actual' don't match.  It would
be more surprising if they did. :)

> Should I do s/on big-endian arches/with PCRE with JIT disabled/
> while queuing the patch?

Here's how I changed the commit message locally.  I was
going to wait a day or so for any other feedback on the
actual test changes, being a holiday weekend in the US (and
more generally a weekend).

1:  d9aeaf0c98 ! 1:  d0c083db78 t7812: expect failure for grep -i with invalid UTF-8 data
    @@ Commit message
         8a5999838e (grep: stess test PCRE v2 on invalid UTF-8 data, 2019-07-26)
         and 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
         2019-07-26) they lacked a redirect which caused them to falsely succeed
    -    on most architectures.  They failed on big-endian arches where the test
    -    never reached the portion which was missing the redirect.
    +    on most systems.  The 'grep -i' test failed on systems where JIT was
    +    disabled as it never reached the portion which was missing the redirect.
     
    -    A recent patch add the missing redirect and exposed the fact that the
    -    'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' test fails on
    -    all architectures.
    +    A recent patch added the missing redirect and exposed the fact that the
    +    'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' test fails
    +    regardless of whether JIT is enabled.
     
         Based on the final paragraph in in 870eea8166:

Thanks for pointing out the proper reasoning to use in the
commit message Andreas.  I hadn't looked at the Fedora pcre2
package to see that it explicitly disables JIT on s390x.

I'm not sure if s390x is supported upstream or not -- it
doesn't appear to have a specific entry in the sljit config
header¹, so it seems likely it's not well-tested at the
least.  (Not that any of that is our concern here.)

¹ https://github.com/zherczeg/sljit/blob/master/sljit_src/sljitConfigInternal.h

Thanks for the follow-up Junio.

-- 
Todd

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH] t7812: expect failure for grep -i with invalid UTF-8 data
  2019-12-01 18:32             ` Todd Zullinger
@ 2019-12-02  6:13               ` Junio C Hamano
  0 siblings, 0 replies; 52+ messages in thread
From: Junio C Hamano @ 2019-12-02  6:13 UTC (permalink / raw)
  To: Todd Zullinger
  Cc: Andreas Schwab, Ævar Arnfjörð Bjarmason, git,
	Carlo Marcelo Arenas Belón, Beat Bolli, Johannes Schindelin,
	Jeff King

Todd Zullinger <tmz@pobox.com> writes:

> After Andreas' missing redirect fix, that still fails on
> s390x (not surprisingly).  But now systems with JIT enabled
> fail at:
> ...
> Here's how I changed the commit message locally.  I was
> going to wait a day or so for any other feedback on the
> actual test changes, being a holiday weekend in the US (and
> more generally a weekend).
>
> 1:  d9aeaf0c98 ! 1:  d0c083db78 t7812: expect failure for grep -i with invalid UTF-8 data
>     @@ Commit message
>          8a5999838e (grep: stess test PCRE v2 on invalid UTF-8 data, 2019-07-26)
>          and 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
>          2019-07-26) they lacked a redirect which caused them to falsely succeed
>     -    on most architectures.  They failed on big-endian arches where the test
>     -    never reached the portion which was missing the redirect.
>     +    on most systems.  The 'grep -i' test failed on systems where JIT was
>     +    disabled as it never reached the portion which was missing the redirect.
>      
>     -    A recent patch add the missing redirect and exposed the fact that the
>     -    'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' test fails on
>     -    all architectures.
>     +    A recent patch added the missing redirect and exposed the fact that the
>     +    'PCRE v2: grep non-ASCII from invalid UTF-8 data with -i' test fails
>     +    regardless of whether JIT is enabled.
>      
>          Based on the final paragraph in in 870eea8166:
>
> Thanks for pointing out the proper reasoning to use in the
> commit message Andreas.  I hadn't looked at the Fedora pcre2
> package to see that it explicitly disables JIT on s390x.

OK.  I locally edited the log message to match the above.  I guess
this forms an integral part of a topic that Andreas started with the
"missing redirects" fix, so let me queue your patch directly on the
same topic branch, without creating a separate one.

Thanks, both.

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, back to index

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-21 19:40 [PATCH] grep: use custom JIT stack with pcre2 Carlo Marcelo Arenas Belón
2019-07-24 15:14 ` [PATCH 0/3] grep: PCRE JIT fixes Ævar Arnfjörð Bjarmason
2019-07-24 16:18   ` Junio C Hamano
2019-07-24 20:03     ` Ævar Arnfjörð Bjarmason
2019-07-26 15:08   ` [PATCH v2 0/8] grep: PCRE JIT fixes + ab/no-kwset fix Ævar Arnfjörð Bjarmason
2019-07-26 20:27     ` Junio C Hamano
2019-07-29  9:20       ` Ævar Arnfjörð Bjarmason
2019-07-29 16:12         ` Junio C Hamano
2019-07-26 15:08   ` [PATCH v2 1/8] grep: remove overly paranoid BUG(...) code Ævar Arnfjörð Bjarmason
2019-07-26 15:08   ` [PATCH v2 2/8] grep: stop "using" a custom JIT stack with PCRE v2 Ævar Arnfjörð Bjarmason
2019-07-29  0:33     ` Carlo Arenas
2019-07-26 15:08   ` [PATCH v2 3/8] grep: stop using a custom JIT stack with PCRE v1 Ævar Arnfjörð Bjarmason
2019-07-29  1:26     ` Carlo Arenas
2019-07-26 15:08   ` [PATCH v2 4/8] grep: consistently use "p->fixed" in compile_regexp() Ævar Arnfjörð Bjarmason
2019-07-29  1:48     ` Carlo Arenas
2019-07-29  9:05       ` Ævar Arnfjörð Bjarmason
2019-07-29  9:13         ` Ævar Arnfjörð Bjarmason
2019-07-29 16:23           ` Junio C Hamano
2019-07-26 15:08   ` [PATCH v2 5/8] grep: create a "is_fixed" member in "grep_pat" Ævar Arnfjörð Bjarmason
2019-07-26 15:08   ` [PATCH v2 6/8] grep: stess test PCRE v2 on invalid UTF-8 data Ævar Arnfjörð Bjarmason
2019-07-26 20:34     ` Junio C Hamano
2019-07-26 21:55       ` Ævar Arnfjörð Bjarmason
2019-07-29  3:06     ` Carlo Arenas
2019-11-26 21:50     ` [PATCH] t7812: add missing redirects Andreas Schwab
2019-11-26 22:27       ` Johannes Schindelin
2019-11-26 23:11         ` Andreas Schwab
2019-11-27 11:58           ` Jeff King
2019-11-30  0:46       ` [PATCH] t7812: expect failure for grep -i with invalid UTF-8 data Todd Zullinger
2019-11-30  8:00         ` Andreas Schwab
2019-12-01 16:33           ` Junio C Hamano
2019-12-01 17:09             ` Andreas Schwab
2019-12-01 18:32             ` Todd Zullinger
2019-12-02  6:13               ` Junio C Hamano
2019-07-26 15:08   ` [PATCH v2 7/8] grep: do not enter PCRE2_UTF mode on fixed matching Ævar Arnfjörð Bjarmason
2019-07-26 20:36     ` Junio C Hamano
2019-07-26 15:08   ` [PATCH v2 8/8] grep: optimistically use PCRE2_MATCH_INVALID_UTF Ævar Arnfjörð Bjarmason
2019-07-26 21:07     ` Junio C Hamano
2019-07-26 21:53       ` Ævar Arnfjörð Bjarmason
2019-07-26 21:57         ` Ævar Arnfjörð Bjarmason
2019-07-24 15:14 ` [PATCH 1/3] grep: remove overly paranoid BUG(...) code Ævar Arnfjörð Bjarmason
2019-07-24 15:14 ` [PATCH 2/3] grep: stop "using" a custom JIT stack with PCRE v2 Ævar Arnfjörð Bjarmason
2019-07-24 16:24   ` Junio C Hamano
2019-07-24 20:06     ` Ævar Arnfjörð Bjarmason
2019-07-25  5:11       ` Carlo Arenas
2019-07-24 15:14 ` [PATCH 3/3] grep: stop using a custom JIT stack with PCRE v1 Ævar Arnfjörð Bjarmason
2019-07-26 13:15   ` Carlo Arenas
2019-07-26 13:50     ` Ævar Arnfjörð Bjarmason
2019-07-26 14:12       ` Carlo Arenas
2019-07-26 14:43         ` Ævar Arnfjörð Bjarmason
2019-07-26 20:26           ` [RFC PATCH 0/2] PCRE1 cleanup Carlo Marcelo Arenas Belón
2019-07-26 20:26             ` [RFC PATCH 1/2] grep: make sure NO_LIBPCRE1_JIT disable JIT in PCRE1 Carlo Marcelo Arenas Belón
2019-07-26 20:26             ` [RFC PATCH 2/2] grep: refactor and simplify PCRE1 support Carlo Marcelo Arenas Belón

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git