git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Carlo Marcelo Arenas Belón" <carenas@gmail.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: "Paul Eggert" <eggert@cs.ucla.edu>,
	git@vger.kernel.org, "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: improve performance of PCRE2 bug 2642 bug workaround
Date: Tue, 22 Mar 2022 18:09:31 -0700	[thread overview]
Message-ID: <20220323010931.jzf7op7hdusdty33@carlos-mbp.lan> (raw)
In-Reply-To: <99b0adb6-26ba-293c-3a8f-679f59e7cb4d@web.de>

On Tue, Mar 22, 2022 at 09:26:10PM +0100, René Scharfe wrote:
> Am 22.03.22 um 17:38 schrieb Paul Eggert:
> > Today, Carlo Arenas pointed out[1] that GNU grep didn't work around
> > PCRE2 bug 2642, which Git grep has a workaround for. While installing
> > a GNU grep patch to fix this[2] I noticed that Git's workaround
> > appears to be too pessimistic: on older PCRE2 libraries Git grep sets
> > PCRE2_NO_START_OPTIMIZE even when PCRE2_CASELESS is not set.
> 
> Interesting.  So you say bug 2642 [3] requires the flag PCRE2_CASELESS
> (i.e. --ignore-case) to be triggered.  (That's probably documented in
> Bugzilla, but I'm not authorized to access it.)

AFAIK the contents of the bugzilla are no longer accessible to anyone (lost in the migration of PCRE2 to github), but the use of PCRE2_CASELESS introduced in 95ca1f987e (grep/pcre2: better support invalid UTF-8 haystacks, 2021-01-24) might had been a mistake all along.

the bug will trigger when both PCRE2_UTF and PCRE2_MULTILINE are set (as shown in the PCRE2 regression added), with the later set by default in git and NEVER set in GNU grep, hence why I later retracted[6] my suggestion to add the workaround to grep, and suggest updating git with the following

Carlo

[6] https://lists.gnu.org/r/grep-devel/2022-03/msg00006.html
--- >8 ---
Subject: [PATCH] grep: remove check for case sensitivity in workaround for
 PCRE's bug2642

95ca1f987e (grep/pcre2: better support invalid UTF-8 haystacks, 2021-01-24)
add a workaround to an old PCRE2 bug, but includes in the logic a partial
check for case sensitivity without explanation.

Remove it so that the workaround (and its performance impact) will be only
triggered when needed (both PCRE2_MULTILINE and PCRE2_UTF and JIT is used)

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 grep.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index 82eb7da102..d910836569 100644
--- a/grep.c
+++ b/grep.c
@@ -296,8 +296,8 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
-	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
-	if (PCRE2_MATCH_INVALID_UTF && options & (PCRE2_UTF | PCRE2_CASELESS))
+	/* Work around PCRE2 bug2642 fixed in 10.36 */
+	if (SUPPORT_JIT && PCRE2_MATCH_INVALID_UTF && (options & PCRE2_UTF))
 		options |= PCRE2_NO_START_OPTIMIZE;
 #endif
 
-- 
2.35.1.505.g27486cd1b2d


  parent reply	other threads:[~2022-03-23  1:09 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-22 16:38 improve performance of PCRE2 bug 2642 bug workaround Paul Eggert
2022-03-22 20:26 ` René Scharfe
2022-03-22 21:12   ` Paul Eggert
2022-03-23  1:09   ` Carlo Marcelo Arenas Belón [this message]
2022-03-23  4:06     ` Paul Eggert
2022-03-23 18:37     ` René Scharfe
2022-03-23 20:24       ` Carlo Arenas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220323010931.jzf7op7hdusdty33@carlos-mbp.lan \
    --to=carenas@gmail.com \
    --cc=avarab@gmail.com \
    --cc=eggert@cs.ucla.edu \
    --cc=git@vger.kernel.org \
    --cc=l.s.r@web.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).