git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* improve performance of PCRE2 bug 2642 bug workaround
@ 2022-03-22 16:38 Paul Eggert
  2022-03-22 20:26 ` René Scharfe
  0 siblings, 1 reply; 7+ messages in thread
From: Paul Eggert @ 2022-03-22 16:38 UTC (permalink / raw)
  To: git; +Cc: Carlo Arenas

[-- Attachment #1: Type: text/plain, Size: 805 bytes --]

Today, Carlo Arenas pointed out[1] that GNU grep didn't work around 
PCRE2 bug 2642, which Git grep has a workaround for. While installing a 
GNU grep patch to fix this[2] I noticed that Git's workaround appears to 
be too pessimistic: on older PCRE2 libraries Git grep sets 
PCRE2_NO_START_OPTIMIZE even when PCRE2_CASELESS is not set.

Attached is a patch to Git that I just now cobbled up and have not even 
compiled, much less tested. Please feel free to ignore it, as it would 
merely improve performance on older, buggy PCRE2 libraries and that 
might not be worth your trouble. I'm sending this email as more of a 
thank-you for letting us know indirectly of the PCRE2 bug.

[1]: https://lists.gnu.org/r/grep-devel/2022-03/msg00004.html
[2]: https://lists.gnu.org/r/grep-devel/2022-03/msg00005.html

[-- Attachment #2: git-grep.diff --]
[-- Type: text/x-patch, Size: 559 bytes --]

diff --git a/grep.c b/grep.c
index 82eb7da102..b9553ec9f5 100644
--- a/grep.c
+++ b/grep.c
@@ -297,7 +297,9 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
-	if (PCRE2_MATCH_INVALID_UTF && options & (PCRE2_UTF | PCRE2_CASELESS))
+	if (PCRE2_MATCH_INVALID_UTF &&
+	    ((options & (PCRE2_UTF | PCRE2_CASELESS)) ==
+	     (PCRE2_UTF | PCRE2_CASELESS)))
 		options |= PCRE2_NO_START_OPTIMIZE;
 #endif
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: improve performance of PCRE2 bug 2642 bug workaround
  2022-03-22 16:38 improve performance of PCRE2 bug 2642 bug workaround Paul Eggert
@ 2022-03-22 20:26 ` René Scharfe
  2022-03-22 21:12   ` Paul Eggert
  2022-03-23  1:09   ` Carlo Marcelo Arenas Belón
  0 siblings, 2 replies; 7+ messages in thread
From: René Scharfe @ 2022-03-22 20:26 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Carlo Arenas, git, Ævar Arnfjörð Bjarmason

Am 22.03.22 um 17:38 schrieb Paul Eggert:
> Today, Carlo Arenas pointed out[1] that GNU grep didn't work around
> PCRE2 bug 2642, which Git grep has a workaround for. While installing
> a GNU grep patch to fix this[2] I noticed that Git's workaround
> appears to be too pessimistic: on older PCRE2 libraries Git grep sets
> PCRE2_NO_START_OPTIMIZE even when PCRE2_CASELESS is not set.
>
> Attached is a patch to Git that I just now cobbled up and have not
> even compiled, much less tested. Please feel free to ignore it, as it
> would merely improve performance on older, buggy PCRE2 libraries and
> that might not be worth your trouble. I'm sending this email as more
> of a thank-you for letting us know indirectly of the PCRE2 bug.
>
> [1]: https://lists.gnu.org/r/grep-devel/2022-03/msg00004.html
> [2]: https://lists.gnu.org/r/grep-devel/2022-03/msg00005.html

Interesting.  So you say bug 2642 [3] requires the flag PCRE2_CASELESS
(i.e. --ignore-case) to be triggered.  (That's probably documented in
Bugzilla, but I'm not authorized to access it.)

However, the looser check works around another bug, if only by accident.
I believe it was fixed upstream by [4].  That other bug was discussed in
the thread Carlo linked to, which started at [5].  You should be able to
reproduce it with something like this (search for leading white-space in
a Unicode haystack):

  $ echo ' Halló' | grep -P '^\s'

An affected version of PCRE2 would loop forever.

However, I can only test any of that with CI jobs, not locally, so
please take my findings with a heap of salt.

René


[3] https://bugs.exim.org/show_bug.cgi?id=2642
[4] https://github.com/PhilipHazel/pcre2/commit/e0c6029
[5] https://lore.kernel.org/git/20220129172542.GB2581@szeder.dev/

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: improve performance of PCRE2 bug 2642 bug workaround
  2022-03-22 20:26 ` René Scharfe
@ 2022-03-22 21:12   ` Paul Eggert
  2022-03-23  1:09   ` Carlo Marcelo Arenas Belón
  1 sibling, 0 replies; 7+ messages in thread
From: Paul Eggert @ 2022-03-22 21:12 UTC (permalink / raw)
  To: René Scharfe
  Cc: Carlo Arenas, git, Ævar Arnfjörð Bjarmason,
	GNU grep developers

[-- Attachment #1: Type: text/plain, Size: 263 bytes --]

On 3/22/22 13:26, René Scharfe wrote:

> However, the looser check works around another bug, if only by accident.

Thanks for letting me know. In that case, GNU grep should use a looser 
check too, like Git grep does. I installed the attached into GNU grep.

[-- Attachment #2: 0001-grep-work-around-another-potential-PCRE2-bug.patch --]
[-- Type: text/x-patch, Size: 1678 bytes --]

From ff2d24b08223e6c7b704a91127bac4391a9b8adb Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Tue, 22 Mar 2022 14:09:05 -0700
Subject: [PATCH] grep: work around another potential PCRE2 bug
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Potential problem reported by René Scharfe in:
https://lore.kernel.org/git/99b0adb6-26ba-293c-3a8f-679f59e7cb4d@web.de/T
* src/pcresearch.c (Pcompile): Mimic git grep’s workarounds
for PCRE2 bugs more closely; this is more conservative.
---
 src/pcresearch.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/pcresearch.c b/src/pcresearch.c
index 0cf804d..6947838 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -154,15 +154,16 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, bool exact)
 #ifdef PCRE2_MATCH_INVALID_UTF
       /* Consider invalid UTF-8 as a barrier, instead of error.  */
       flags |= PCRE2_MATCH_INVALID_UTF;
-
-# if ! (10 < PCRE2_MAJOR + (36 <= PCRE2_MINOR))
-      /* Work around PCRE2 bug 2642.  */
-      if (flags & PCRE2_CASELESS)
-        flags |= PCRE2_NO_START_OPTIMIZE;
-# endif
 #endif
     }
 
+#if defined PCRE2_MATCH_INVALID_UTF && !(10 < PCRE2_MAJOR + (36 <= PCRE2_MINOR))
+  /* Work around PCRE2 bug 2642, and another bug reportedly fixed in
+     PCRE2 commit e0c6029a62db9c2161941ecdf459205382d4d379.  */
+  if (flags & (PCRE2_UTF | PCRE2_CASELESS))
+    flags |= PCRE2_NO_START_OPTIMIZE;
+#endif
+
   /* FIXME: Remove this restriction.  */
   if (rawmemchr (pattern, '\n') != patlim)
     die (EXIT_TROUBLE, 0, _("the -P option only supports a single pattern"));
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: improve performance of PCRE2 bug 2642 bug workaround
  2022-03-22 20:26 ` René Scharfe
  2022-03-22 21:12   ` Paul Eggert
@ 2022-03-23  1:09   ` Carlo Marcelo Arenas Belón
  2022-03-23  4:06     ` Paul Eggert
  2022-03-23 18:37     ` René Scharfe
  1 sibling, 2 replies; 7+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2022-03-23  1:09 UTC (permalink / raw)
  To: René Scharfe
  Cc: Paul Eggert, git, Ævar Arnfjörð Bjarmason

On Tue, Mar 22, 2022 at 09:26:10PM +0100, René Scharfe wrote:
> Am 22.03.22 um 17:38 schrieb Paul Eggert:
> > Today, Carlo Arenas pointed out[1] that GNU grep didn't work around
> > PCRE2 bug 2642, which Git grep has a workaround for. While installing
> > a GNU grep patch to fix this[2] I noticed that Git's workaround
> > appears to be too pessimistic: on older PCRE2 libraries Git grep sets
> > PCRE2_NO_START_OPTIMIZE even when PCRE2_CASELESS is not set.
> 
> Interesting.  So you say bug 2642 [3] requires the flag PCRE2_CASELESS
> (i.e. --ignore-case) to be triggered.  (That's probably documented in
> Bugzilla, but I'm not authorized to access it.)

AFAIK the contents of the bugzilla are no longer accessible to anyone (lost in the migration of PCRE2 to github), but the use of PCRE2_CASELESS introduced in 95ca1f987e (grep/pcre2: better support invalid UTF-8 haystacks, 2021-01-24) might had been a mistake all along.

the bug will trigger when both PCRE2_UTF and PCRE2_MULTILINE are set (as shown in the PCRE2 regression added), with the later set by default in git and NEVER set in GNU grep, hence why I later retracted[6] my suggestion to add the workaround to grep, and suggest updating git with the following

Carlo

[6] https://lists.gnu.org/r/grep-devel/2022-03/msg00006.html
--- >8 ---
Subject: [PATCH] grep: remove check for case sensitivity in workaround for
 PCRE's bug2642

95ca1f987e (grep/pcre2: better support invalid UTF-8 haystacks, 2021-01-24)
add a workaround to an old PCRE2 bug, but includes in the logic a partial
check for case sensitivity without explanation.

Remove it so that the workaround (and its performance impact) will be only
triggered when needed (both PCRE2_MULTILINE and PCRE2_UTF and JIT is used)

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 grep.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index 82eb7da102..d910836569 100644
--- a/grep.c
+++ b/grep.c
@@ -296,8 +296,8 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
-	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
-	if (PCRE2_MATCH_INVALID_UTF && options & (PCRE2_UTF | PCRE2_CASELESS))
+	/* Work around PCRE2 bug2642 fixed in 10.36 */
+	if (SUPPORT_JIT && PCRE2_MATCH_INVALID_UTF && (options & PCRE2_UTF))
 		options |= PCRE2_NO_START_OPTIMIZE;
 #endif
 
-- 
2.35.1.505.g27486cd1b2d


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: improve performance of PCRE2 bug 2642 bug workaround
  2022-03-23  1:09   ` Carlo Marcelo Arenas Belón
@ 2022-03-23  4:06     ` Paul Eggert
  2022-03-23 18:37     ` René Scharfe
  1 sibling, 0 replies; 7+ messages in thread
From: Paul Eggert @ 2022-03-23  4:06 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón, René Scharfe
  Cc: git, Ævar Arnfjörð Bjarmason, GNU grep developers

[-- Attachment #1: Type: text/plain, Size: 533 bytes --]

On 3/22/22 18:09, Carlo Marcelo Arenas Belón wrote:

> AFAIK the contents of the bugzilla are no longer accessible to anyone (lost in the migration of PCRE2 to github),

Yes, that confused me too.

> the bug will trigger when both PCRE2_UTF and PCRE2_MULTILINE are set (as shown in the PCRE2 regression added), with the later set by default in git and NEVER set in GNU grep, hence why I later retracted[6] my suggestion to add the workaround to grep

OK, thanks, I installed the attached to GNU grep and we'll call it a day.

[-- Attachment #2: 0001-grep-Remove-recent-PCRE2-bug-workarounds.patch --]
[-- Type: text/x-patch, Size: 1331 bytes --]

From 743b1f6f5ca7ee86348fa0593da2eff03df1a82a Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@cs.ucla.edu>
Date: Tue, 22 Mar 2022 20:12:38 -0700
Subject: [PATCH] grep: Remove recent PCRE2 bug workarounds
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

* src/pcresearch.c (Pcompile): Remove recent workaround for PCRE2
bugs; apparently it’s not needed.  This reverts back to where
things were before today.  Suggested by Carlo Arenas in:
https://lists.gnu.org/r/grep-devel/2022-03/msg00006.html
---
 src/pcresearch.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/src/pcresearch.c b/src/pcresearch.c
index 6947838..f332a44 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -157,13 +157,6 @@ Pcompile (char *pattern, idx_t size, reg_syntax_t ignored, bool exact)
 #endif
     }
 
-#if defined PCRE2_MATCH_INVALID_UTF && !(10 < PCRE2_MAJOR + (36 <= PCRE2_MINOR))
-  /* Work around PCRE2 bug 2642, and another bug reportedly fixed in
-     PCRE2 commit e0c6029a62db9c2161941ecdf459205382d4d379.  */
-  if (flags & (PCRE2_UTF | PCRE2_CASELESS))
-    flags |= PCRE2_NO_START_OPTIMIZE;
-#endif
-
   /* FIXME: Remove this restriction.  */
   if (rawmemchr (pattern, '\n') != patlim)
     die (EXIT_TROUBLE, 0, _("the -P option only supports a single pattern"));
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: improve performance of PCRE2 bug 2642 bug workaround
  2022-03-23  1:09   ` Carlo Marcelo Arenas Belón
  2022-03-23  4:06     ` Paul Eggert
@ 2022-03-23 18:37     ` René Scharfe
  2022-03-23 20:24       ` Carlo Arenas
  1 sibling, 1 reply; 7+ messages in thread
From: René Scharfe @ 2022-03-23 18:37 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón
  Cc: Paul Eggert, git, Ævar Arnfjörð Bjarmason

Am 23.03.22 um 02:09 schrieb Carlo Marcelo Arenas Belón:
> On Tue, Mar 22, 2022 at 09:26:10PM +0100, René Scharfe wrote:
>> Am 22.03.22 um 17:38 schrieb Paul Eggert:
>>> Today, Carlo Arenas pointed out[1] that GNU grep didn't work around
>>> PCRE2 bug 2642, which Git grep has a workaround for. While installing
>>> a GNU grep patch to fix this[2] I noticed that Git's workaround
>>> appears to be too pessimistic: on older PCRE2 libraries Git grep sets
>>> PCRE2_NO_START_OPTIMIZE even when PCRE2_CASELESS is not set.
>>
>> Interesting.  So you say bug 2642 [3] requires the flag PCRE2_CASELESS
>> (i.e. --ignore-case) to be triggered.  (That's probably documented in
>> Bugzilla, but I'm not authorized to access it.)
>
> AFAIK the contents of the bugzilla are no longer accessible to anyone
> (lost in the migration of PCRE2 to github), but the use of
> PCRE2_CASELESS introduced in 95ca1f987e (grep/pcre2: better support
> invalid UTF-8 haystacks, 2021-01-24) might had been a mistake all
> along.

Ah, OK.

> the bug will trigger when both PCRE2_UTF and PCRE2_MULTILINE are set
> (as shown in the PCRE2 regression added), with the later set by
> default in git and NEVER set in GNU grep, hence why I later
> retracted[6] my suggestion to add the workaround to grep, and suggest
> updating git with the following
> Carlo
>
> [6] https://lists.gnu.org/r/grep-devel/2022-03/msg00006.html
> --- >8 ---
> Subject: [PATCH] grep: remove check for case sensitivity in workaround for
>  PCRE's bug2642
>
> 95ca1f987e (grep/pcre2: better support invalid UTF-8 haystacks, 2021-01-24)
> add a workaround to an old PCRE2 bug, but includes in the logic a partial
> check for case sensitivity without explanation.
>
> Remove it so that the workaround (and its performance impact) will be only
> triggered when needed (both PCRE2_MULTILINE and PCRE2_UTF and JIT is used)
>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
> ---
>  grep.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/grep.c b/grep.c
> index 82eb7da102..d910836569 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -296,8 +296,8 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>
>  #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
> -	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
> -	if (PCRE2_MATCH_INVALID_UTF && options & (PCRE2_UTF | PCRE2_CASELESS))
> +	/* Work around PCRE2 bug2642 fixed in 10.36 */
> +	if (SUPPORT_JIT && PCRE2_MATCH_INVALID_UTF && (options & PCRE2_UTF))

SUPPORT_JIT can be undefined and then this won't compile.

And if I turn it into an "#ifdef SUPPORT_JIT", t7812.16 fails with PCRE2
versions 10.34 and 10.35.

SUPPORT_JIT is defined in PCRE2's config.h and not baked into any of its
public headers, so it never will be set in grep.c, right?  Perhaps just
drop this condition?

The patch below adds a test that fails even with a PCRE2 configured with
--disable-git.  Current main passes this test even with PCRE2 versions
10.34 and 10.35.

"PCRE2_MATCH_INVALID_UTF && (options & PCRE2_UTF)" can be simplified to
"options & PCRE2_MATCH_INVALID_UTF".

>  		options |= PCRE2_NO_START_OPTIMIZE;
>  #endif
>

--- >8 ---
Subject: [PATCH] t7812: test PCRE2 whitespace bug

Check if git grep works around the PCRE2 big fixed by their e0c6029
(Fix inifinite loop when a single byte newline is searched in JIT.,
2020-05-29), which affects version 10.35 and earlier.

Searching for leading whitespace also triggers the endless loop.
Set a one-second alarm to abort in case we do get hit by the bug, to
avoid having to wait forever for the test result.

Reported-by: SZEDER Gábor <szeder.dev@gmail.com>
Signed-off-by: René Scharfe <l.s.r@web.de>
---
 t/t7812-grep-icase-non-ascii.sh | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/t/t7812-grep-icase-non-ascii.sh b/t/t7812-grep-icase-non-ascii.sh
index 9047d665a1..ac7be54714 100755
--- a/t/t7812-grep-icase-non-ascii.sh
+++ b/t/t7812-grep-icase-non-ascii.sh
@@ -4,6 +4,10 @@ test_description='grep icase on non-English locales'

 . ./lib-gettext.sh

+doalarm () {
+	perl -e 'alarm shift; exec @ARGV' -- "$@"
+}
+
 test_expect_success GETTEXT_LOCALE 'setup' '
 	test_write_lines "TILRAUN: Halló Heimur!" >file &&
 	git add file &&
@@ -139,4 +143,10 @@ test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep non-literal ASCII fro
 	test_cmp expected actual
 '

+test_expect_success GETTEXT_LOCALE,LIBPCRE2 'PCRE v2: grep avoid endless loop bug' '
+	echo " Halló" >leading-whitespace &&
+	git add leading-whitespace &&
+	doalarm 1 git grep --perl-regexp "^\s" leading-whitespace
+'
+
 test_done
--
2.35.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: improve performance of PCRE2 bug 2642 bug workaround
  2022-03-23 18:37     ` René Scharfe
@ 2022-03-23 20:24       ` Carlo Arenas
  0 siblings, 0 replies; 7+ messages in thread
From: Carlo Arenas @ 2022-03-23 20:24 UTC (permalink / raw)
  To: René Scharfe
  Cc: Paul Eggert, git, Ævar Arnfjörð Bjarmason

On Wed, Mar 23, 2022 at 11:37 AM René Scharfe <l.s.r@web.de> wrote:
>
> Am 23.03.22 um 02:09 schrieb Carlo Marcelo Arenas Belón:
> > On Tue, Mar 22, 2022 at 09:26:10PM +0100, René Scharfe wrote:
> >> Interesting.  So you say bug 2642 [3] requires the flag PCRE2_CASELESS
> >> (i.e. --ignore-case) to be triggered.  (That's probably documented in
> >> Bugzilla, but I'm not authorized to access it.)
> >
> > AFAIK the contents of the bugzilla are no longer accessible to anyone
> > (lost in the migration of PCRE2 to github), but the use of
> > PCRE2_CASELESS introduced in 95ca1f987e (grep/pcre2: better support
> > invalid UTF-8 haystacks, 2021-01-24) might have been a mistake all
> > along.
>
> Ah, OK.

It happens though that the original bug2642 did require PCRE2_CASELESS
and was fixed instead by PhilipHazel/pcre2@f8cbb1f5[7], so you were correct
when suggesting that the lazy coding of the condition was fixing more than one
bug and therefore it will be better left unchanged IMHO (except maybe from
an improved comment)

> The patch below adds a test that fails even with a PCRE2 configured with
> --disable-git.  Current main passes this test even with PCRE2 versions
> 10.34 and 10.35.
>
> "PCRE2_MATCH_INVALID_UTF && (options & PCRE2_UTF)" can be simplified to
> "options & PCRE2_MATCH_INVALID_UTF".

indeed, but will also need the PCRE2_CASELESS part that was required
from the original bug

thanks for adding a test for the infinite loop one, will definitely
help future readers.

Carlo

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-03-23 20:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-22 16:38 improve performance of PCRE2 bug 2642 bug workaround Paul Eggert
2022-03-22 20:26 ` René Scharfe
2022-03-22 21:12   ` Paul Eggert
2022-03-23  1:09   ` Carlo Marcelo Arenas Belón
2022-03-23  4:06     ` Paul Eggert
2022-03-23 18:37     ` René Scharfe
2022-03-23 20:24       ` Carlo Arenas

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).