[PATCH] grep: correctly identify utf-8 characters with \{b,w} in -P

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] grep: correctly identify utf-8 characters with \{b,w} in -P
@ 2023-01-08  6:23 Carlo Marcelo Arenas Belón
  2023-01-08  6:39 ` Junio C Hamano
  2023-01-08 15:52 ` [PATCH v2] " Carlo Marcelo Arenas Belón
  0 siblings, 2 replies; 36+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2023-01-08  6:23 UTC (permalink / raw)
  To: git; +Cc: avarab, Carlo Marcelo Arenas Belón

When UTF is enabled for a PCRE match, the corresponding flags are
added to the pcre2_compile() call, but PCRE2_UCP wasn't included.

This prevents extending the meaning of the character classes to
include those new valid characters and therefore result in failed
matches for expressions that rely on that extention, for ex:

  $ git grep -P '\bÆvar'

Add PCRE2_UCP so that \w will include Æ and therefore \b could
correctly match the beginning of that word.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 grep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/grep.c b/grep.c
index 06eed69493..1687f65b64 100644
--- a/grep.c
+++ b/grep.c
@@ -293,7 +293,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		options |= PCRE2_CASELESS;
 	}
 	if (!opt->ignore_locale && is_utf8_locale() && !literal)
-		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
-- 
2.37.1 (Apple Git-137.1)


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-08  6:23 [PATCH] grep: correctly identify utf-8 characters with \{b,w} in -P Carlo Marcelo Arenas Belón
@ 2023-01-08  6:39 ` Junio C Hamano
  2023-01-08 15:52 ` [PATCH v2] " Carlo Marcelo Arenas Belón
  1 sibling, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2023-01-08  6:39 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón; +Cc: git, avarab

Carlo Marcelo Arenas Belón  <carenas@gmail.com> writes:

> When UTF is enabled for a PCRE match, the corresponding flags are
> added to the pcre2_compile() call, but PCRE2_UCP wasn't included.

Would the same performance concern as

https://discourse.julialang.org/t/regex-pcre2-and-the-pcre2-ucp-ucp-flag/10930

apply to us as well?



>  	if (!opt->ignore_locale && is_utf8_locale() && !literal)
> -		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> +		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
>  
>  #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
>  	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-08  6:23 [PATCH] grep: correctly identify utf-8 characters with \{b,w} in -P Carlo Marcelo Arenas Belón
  2023-01-08  6:39 ` Junio C Hamano
@ 2023-01-08 15:52 ` Carlo Marcelo Arenas Belón
  2023-01-09 11:35   ` Ævar Arnfjörð Bjarmason
  2023-01-17 10:51   ` [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P Carlo Marcelo Arenas Belón
  1 sibling, 2 replies; 36+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2023-01-08 15:52 UTC (permalink / raw)
  To: git; +Cc: avarab, gitster, Carlo Marcelo Arenas Belón

When UTF is enabled for a PCRE match, the corresponding flags are
added to the pcre2_compile() call, but PCRE2_UCP wasn't included.

This prevents extending the meaning of the character classes to
include those new valid characters and therefore result in failed
matches for expressions that rely on that extention, for ex:

  $ git grep -P '\bÆvar'

Add PCRE2_UCP so that \w will include Æ and therefore \b could
correctly match the beginning of that word.

This has an impact on performance that has been estimated to be
between 20% to 40% and that is shown through the added performance
test.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
 grep.c                              |  2 +-
 t/perf/p7822-grep-perl-character.sh | 42 +++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 1 deletion(-)
 create mode 100755 t/perf/p7822-grep-perl-character.sh

diff --git a/grep.c b/grep.c
index 06eed69493..1687f65b64 100644
--- a/grep.c
+++ b/grep.c
@@ -293,7 +293,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		options |= PCRE2_CASELESS;
 	}
 	if (!opt->ignore_locale && is_utf8_locale() && !literal)
-		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
diff --git a/t/perf/p7822-grep-perl-character.sh b/t/perf/p7822-grep-perl-character.sh
new file mode 100755
index 0000000000..87009c60df
--- /dev/null
+++ b/t/perf/p7822-grep-perl-character.sh
@@ -0,0 +1,42 @@
+#!/bin/sh
+
+test_description="git-grep's perl regex
+
+If GIT_PERF_GREP_THREADS is set to a list of threads (e.g. '1 4 8'
+etc.) we will test the patterns under those numbers of threads.
+"
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+if test -n "$GIT_PERF_GREP_THREADS"
+then
+	test_set_prereq PERF_GREP_ENGINES_THREADS
+fi
+
+for pattern in \
+	'\\bhow' \
+	'\\bÆvar' \
+	'\\d+ \\bÆvar' \
+	'\\bBelón\\b' \
+	'\\w{12}\\b'
+do
+	echo '$pattern' >pat
+	if ! test_have_prereq PERF_GREP_ENGINES_THREADS
+	then
+		test_perf "grep -P '$pattern'" --prereq PCRE "
+			git -P grep -f pat || :
+		"
+	else
+		for threads in $GIT_PERF_GREP_THREADS
+		do
+			test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE "
+				git -c grep.threads=$threads -P grep -f pat || :
+			"
+		done
+	fi
+done
+
+test_done
-- 
2.39.0.199.g555ddd67e6


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-08 15:52 ` [PATCH v2] " Carlo Marcelo Arenas Belón
@ 2023-01-09 11:35   ` Ævar Arnfjörð Bjarmason
  2023-01-09 18:40     ` bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} " Paul Eggert
                       ` (3 more replies)
  2023-01-17 10:51   ` [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P Carlo Marcelo Arenas Belón
  1 sibling, 4 replies; 36+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-01-09 11:35 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón
  Cc: git, gitster, bug-grep, demerphq, pcre-dev

On Sun, Jan 08 2023, Carlo Marcelo Arenas Belón wrote:

> When UTF is enabled for a PCRE match, the corresponding flags are
> added to the pcre2_compile() call, but PCRE2_UCP wasn't included.
>
> This prevents extending the meaning of the character classes to
> include those new valid characters and therefore result in failed
> matches for expressions that rely on that extention, for ex:
>
>   $ git grep -P '\bÆvar'
>
> Add PCRE2_UCP so that \w will include Æ and therefore \b could
> correctly match the beginning of that word.
>
> This has an impact on performance that has been estimated to be
> between 20% to 40% and that is shown through the added performance
> test.
>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
> ---
>  grep.c                              |  2 +-
>  t/perf/p7822-grep-perl-character.sh | 42 +++++++++++++++++++++++++++++
>  2 files changed, 43 insertions(+), 1 deletion(-)
>  create mode 100755 t/perf/p7822-grep-perl-character.sh
>
> diff --git a/grep.c b/grep.c
> index 06eed69493..1687f65b64 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -293,7 +293,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  		options |= PCRE2_CASELESS;
>  	}
>  	if (!opt->ignore_locale && is_utf8_locale() && !literal)
> -		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> +		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);

I have a definite bias towards liking this change, it would help my find
myself :)

But I don't think it's safe to change the default behavior "git-grep",
it's not a mere bug fix, but a major behavior change for existing users
of grep.patternType=perl. E.g. on git.git:

	$ diff <(git -P grep -P '\d+') <(git -P grep -P '(*UCP)\d')
	53360a53361,53362
	> git-gui/po/ja.po:"- 第１行: 何をしたか、を１行で要約。\n"
	> git-gui/po/ja.po:"- 第２行: 空白\n"

So, it will help "do the right thing" on e.g. "\bÆ", but it will also
find e.g. CJK numeric characters for \d etc.

I see per the discussion on
https://github.com/PCRE2Project/pcre2/issues/185 and
https://lists.gnu.org/archive/html/bug-grep/2023-01/threads.html that
you submitted similar fixes to GNU grep & PCRE itself.

I see that GNU grep integrated it a couple of days ago as
https://git.savannah.gnu.org/cgit/grep.git/commit/?id=5e3b760f65f13856e5717e5b9d935f5b4a615be3

As most discussions about PCRE will eventually devolve into "what does
Perl do?": "Perl" itself will promiscuously use this behavior by
default.

E.g. here the same "１" character (not the ASCII digit "1") will be
matched from the command-line:

	$ perl -Mre=debug -CA -wE 'shift =~ /\d/' "１"
	Compiling REx "\d"
	Final program:
	   1: POSIXU[\d] (2)
	   2: END (0)
	stclass POSIXU[\d] minlen 1
	Matching REx "\d" against "%x{ff11}"
	UTF-8 string...
	Matching stclass POSIXU[\d] against "%x{ff11}" (3 bytes)
	   0 <> <%x{ff11}>           |   0| 1:POSIXU[\d](2)
	   3 <%x{ff11}> <>           |   0| 2:END(0)
	Match successful!
	Freeing REx: "\d"

But I don't think it makes sense for "git grep" (or GNU "grep") to
follow Perl in this particular case.

For those not familiar with its Unicode model it doesn't assume by
default that strings are Unicode, they have to be explicitly marked as
such. in the above example I'm declaring that all of "argv" is UTF-8
(via the "-CA" flag).

If I didn't supply that flag the string wouldn't have the UTF-8 flag,
and wouldn't match, as the Perl regex engine won't use Unicode semantics
except on Unicode target strings.

Even for Perl, this behavior has been troublesome. Opinions differ, but
I think many would agree (and I've CC'd the main authority on Perl's
regex engine) that doing this by default was *probably* a mistake.

You almost never want "everything Unicode considers a digit", and if you
do using e.g. \p{Nd} instead of \d would be better in terms of
expressing your intent. I see you're running into this on the PCRE
tracker, where you're suggesting that the equivalent of /a (or /aa)
would be needed.

	https://github.com/PCRE2Project/pcre2/issues/185#issuecomment-1374796393

Which brings me home to the seeming digression about "Perl"
above.

Unlike a programming language where you'll typically "mark" your data as
it comes in, natural text as UTF-8, binary data as such etc., a "grep"
utility has to operate on more of an "all or nothing" basis (except in
the case of "-a"). I.e. we're usually searching through unknown data.

Enabling this by default means that we'll pick up characters most people
probably wouldn't expect, particularly from near-binary data formats
(those that won't require "-a", but contain non-Unicode non-ASCII
sequences).

I don't have some completely holistic view of what we should do in every
case, e.g. we turned on PCRE2_UTF so that things like "-i" would Just
Work, but even case-insensitivity has its own unexpected edge cases in
Unicode.

But I don't think those edge cases are nearly as common as those we'd
run into by enabling PCRE2_UCP. Rather than trying to opt-out with "/a"
or "/aa" I think this should be opt-in.

As the example at the start shows you can already do this with "(*UCP)"
in the pattern, so perhaps we should just link to the pcre2pattern(3)
manual from git-grep(1)?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P
  2023-01-09 11:35   ` Ævar Arnfjörð Bjarmason
@ 2023-01-09 18:40     ` Paul Eggert
  2023-01-09 19:51       ` Ævar Arnfjörð Bjarmason
  2023-01-10  4:49     ` [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} " Carlo Arenas
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 36+ messages in thread
From: Paul Eggert @ 2023-01-09 18:40 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason,
	Carlo Marcelo Arenas Belón
  Cc: demerphq, pcre-dev, 60690, gitster, git

On 1/9/23 03:35, Ævar Arnfjörð Bjarmason wrote:

> You almost never want "everything Unicode considers a digit", and if you
> do using e.g. \p{Nd} instead of \d would be better in terms of
> expressing your intent.

For GNU grep, PCRE2_UCP is needed because of examples like what Gro-Tsen 
and Karl Petterssen supplied. If there's some diagreement about how \d 
should behave with UTF-8 data the GNU grep hackers should let the Perl 
community decide that; that is, GNU grep can simply follow PCRE2's lead. 
But GNU grep does need PCRE2_UCP for \b etc.

> 	$ diff <(git -P grep -P '\d+') <(git -P grep -P '(*UCP)\d')
> 	53360a53361,53362
> 	> git-gui/po/ja.po:"- 第１行: 何をしたか、を１行で要約。\n"
> 	> git-gui/po/ja.po:"- 第２行: 空白\n"

Although I don't speak Japanese I have dealt with quite a bit of 
Japanese text in a previous job, and personally I would prefer \d to 
match those two lines as they do contain digits. So to me this 
particular case is not a good argument that git grep should not match 
those lines.

Of course other people might prefer differently, and there are cases 
where I want to match only ASCII digits. I've learned in the past to use 
[0-9] for that. I hope PCRE2 never changes [0-9] to match anything but 
ASCII digits when searching UTF-8 text.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P
  2023-01-09 18:40     ` bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} " Paul Eggert
@ 2023-01-09 19:51       ` Ævar Arnfjörð Bjarmason
  2023-01-09 23:12         ` Paul Eggert
  0 siblings, 1 reply; 36+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-01-09 19:51 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Carlo Marcelo Arenas Belón, demerphq, pcre-dev, 60690,
	gitster, git

On Mon, Jan 09 2023, Paul Eggert wrote:

> On 1/9/23 03:35, Ævar Arnfjörð Bjarmason wrote:
>
>> You almost never want "everything Unicode considers a digit", and if you
>> do using e.g. \p{Nd} instead of \d would be better in terms of
>> expressing your intent.
>
> For GNU grep, PCRE2_UCP is needed because of examples like what
> Gro-Tsen and Karl Petterssen supplied.

[For reference, referring to this Twitter thread:
https://twitter.com/gro_tsen/status/1610972356972875777]

Those examples compared -E and -P. I think it's correct that UCP brings
the behavior closer to -E, but it's also different in various ways.

E.g. on emacs.git (which I've been finding to be quite a nice test case)
a comparison of the two, with "git grep" because I found it easier to
test, but GNU grep will presumably find the same for those files:

	for c in b s w
	do
		for pfx in '' '(*UCP)'
		do
			echo "$pfx/$c:" &&
			diff -u <(git -P grep -E "\\$c") <(git -P grep -P "$pfx\\$c") | wc -l
		done
	done

Yields:

	/b:
	155781
	(*UCP)/b:
	46035
	/s:
	0
	(*UCP)/s:
	0
	/w:
	142468
	(*UCP)/w:
	9706

So the output still differs, and some of those differences may or may
not be wanted.

> If there's some diagreement
> about how \d should behave with UTF-8 data the GNU grep hackers should
> let the Perl community decide that; that is, GNU grep can simply
> follow PCRE2's lead.

PCRE2 tends to follow Perl, I'm mainly trying to point out here that it
isn't a-priory clear how "let Perl decide" is supposed to map to the of
a "grep"-like utility, since the Perl behavior is inherently tied up
with knowing the encoding of the target data.

For GNU grep and "git grep" that's more of an all-or-nothing with
locales, although in this case being as close as possible to -E is
probably more correct than not.

>> 	$ diff <(git -P grep -P '\d+') <(git -P grep -P '(*UCP)\d')
>> 	53360a53361,53362
>> 	> git-gui/po/ja.po:"- 第１行: 何をしたか、を１行で要約。\n"
>> 	> git-gui/po/ja.po:"- 第２行: 空白\n"
>
> Although I don't speak Japanese I have dealt with quite a bit of
> Japanese text in a previous job, and personally I would prefer \d to
> match those two lines as they do contain digits. So to me this
> particular case is not a good argument that git grep should not match
> those lines.

I'm mainly raising the backwards compatibility concern, which GNU grep
and git grep may or may not want to handle differently, but let's at
least be aware of the various edge cases.

For \b I think it mostly does the right thing.

For \w and \d in particular I'm mainly noting that yes, sometimes you
want to match [0-9], and sometimes you'd want to match Japanese numbers,
but you rarely (or at least I haven't) want to match everything Unicode
considers X, unless you're doing some self-reflection on Unicode itself.

E.g. for \d it's at least (up from just 10):

	$ perl -CO -wE 'for (1..2**20) { say chr if chr =~ /\d/ }'|wc -l
	650

For \w you similarly go from ~60 to ~130k:

	$ perl -CO -wE 'for (1..2**24) { say chr if chr =~ /\w/ }'|wc -l
	134564

If all you're doing is matching either ASCII or Japanese text and you
want "locale-aware numbers" it might do the wrong thing.

But I've found it to be too promiscuous when casting a wider net, which
is the usual use-case with 'grep".

> Of course other people might prefer differently, and there are cases
> where I want to match only ASCII digits. I've learned in the past to
> use [0-9] for that. I hope PCRE2 never changes [0-9] to match anything
> but ASCII digits when searching UTF-8 text.

I think that'll never change.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P
  2023-01-09 19:51       ` Ævar Arnfjörð Bjarmason
@ 2023-01-09 23:12         ` Paul Eggert
  0 siblings, 0 replies; 36+ messages in thread
From: Paul Eggert @ 2023-01-09 23:12 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Carlo Marcelo Arenas Belón, demerphq, pcre-dev, 60690,
	gitster, git

On 1/9/23 11:51, Ævar Arnfjörð Bjarmason wrote:

> 	/b:
> 	155781
> 	(*UCP)/b:
> 	46035
> 	/s:
> 	0
> 	(*UCP)/s:
> 	0
> 	/w:
> 	142468
> 	(*UCP)/w:
> 	9706
> 
> So the output still differs, and some of those differences may or may
> not be wanted.

I took a look at the output, and by and large I'd want the differences; 
that is, I'd want the UCP version, which generates less output. This is 
because several Emacs source files are not UTF-8, and \b has nonsense 
matches when searching text files encoded via Shift-JIS or Big 5 or 
whatever. For this sort of thing, the fewer matches the better.

> If all you're doing is matching either ASCII or Japanese text and you
> want "locale-aware numbers" it might do the wrong thing.

I'm not seeing much of a problem here. When searching Japanese text, I 
would expect \d and [0-9０-９] (using both ASCII and full-width digits) to 
be equivalent so (assuming UCP) it's not a big deal as to which regex 
you use, since Japanese text won't contain Bengali (or whatever) digits. 
And when searching binary data, I'd expect a bunch of garbage no matter 
how \d is interpreted.

Here I'm assuming [０-９] (using full-width digits) has the expected 
meaning in PCRE2, i.e., that PCRE2 didn't make the same mistake that 
POSIX made.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-09 11:35   ` Ævar Arnfjörð Bjarmason
  2023-01-09 18:40     ` bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} " Paul Eggert
@ 2023-01-10  4:49     ` Carlo Arenas
  2023-01-16 20:48     ` Junio C Hamano
  2023-04-03 21:38     ` -P '\d' in GNU and git grep Paul Eggert
  3 siblings, 0 replies; 36+ messages in thread
From: Carlo Arenas @ 2023-01-10  4:49 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, gitster, demerphq

On Mon, Jan 9, 2023 at 4:17 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>Rather than trying to opt-out with "/a" or "/aa" I think this should be opt-in.
>
> As the example at the start shows you can already do this with "(*UCP)"
> in the pattern, so perhaps we should just link to the pcre2pattern(3)
> manual from git-grep(1)?

Considering that PCRE is used internally even for cases that don't
specify -P how would that opt-in work?

For example, in a repository with code that uses utf identifiers, the
following will fail:

  $ git grep -w -E motion
  u.c:  int émotion = 0;
  $ git grep -w -E '(*UCP)motion'
  fatal: command line, '(*UCP)motion': Invalid preceding regular expression
  $ git -P grep -P -w '(*UCP)motion'
  u.c:  int émotion = 0;

Carlo

CC removed gnu and the obsoleted PCRE developer list (if really needed
would be better to use the documented pcre2-dev@googlegroups.com,
instead)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-09 11:35   ` Ævar Arnfjörð Bjarmason
  2023-01-09 18:40     ` bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} " Paul Eggert
  2023-01-10  4:49     ` [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} " Carlo Arenas
@ 2023-01-16 20:48     ` Junio C Hamano
  2023-04-03 21:38     ` -P '\d' in GNU and git grep Paul Eggert
  3 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2023-01-16 20:48 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Carlo Marcelo Arenas Belón, git

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> But I don't think it's safe to change the default behavior "git-grep",
> it's not a mere bug fix, but a major behavior change for existing users
> of grep.patternType=perl.
> ...
> Even for Perl, this behavior has been troublesome. Opinions differ, but
> I think many would agree (and I've CC'd the main authority on Perl's
> regex engine) that doing this by default was *probably* a mistake.
> ...
> As the example at the start shows you can already do this with "(*UCP)"
> in the pattern, so perhaps we should just link to the pcre2pattern(3)
> manual from git-grep(1)?

So, now do we have a final verdict on this patch?  If we are not
taking the "unconditonally enable ucp" patch (which I tend to agree
with a safer choice for now), it may make sense to mention (*UCP) in
our documentation somewhere, perhaps?



^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-08 15:52 ` [PATCH v2] " Carlo Marcelo Arenas Belón
  2023-01-09 11:35   ` Ævar Arnfjörð Bjarmason
@ 2023-01-17 10:51   ` Carlo Marcelo Arenas Belón
  2023-01-17 12:38     ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 36+ messages in thread
From: Carlo Marcelo Arenas Belón @ 2023-01-17 10:51 UTC (permalink / raw)
  To: git; +Cc: avarab, gitster, Carlo Marcelo Arenas Belón

When UTF is enabled for a PCRE match, the PCRE2_UTF flag is used
by the pcre2_compile() call, but that would only allow for the
use of Unicode character properties when caseless is required
but not to include the additional UTF characters for all other
class matches.

This would result in failed matches for expressions that rely
on those properties, for ex:

  $ git grep -P '\bÆvar'

Add a configuration that could be used to enable the PCRE2_UCP
flag to correctly match those cases, when required.

The use of this has an impact on performance that has been estimated
to be significant.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
---
Changes since v2:
* make setting UCP and opt-in as suggested by Ævar
* remove performance test and instead add a test

 Documentation/config/grep.txt |  6 ++++++
 grep.c                        | 11 ++++++++++-
 grep.h                        |  1 +
 t/t7810-grep.sh               | 13 +++++++++++++
 4 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
index e521f20390..8848db7311 100644
--- a/Documentation/config/grep.txt
+++ b/Documentation/config/grep.txt
@@ -26,3 +26,9 @@ grep.fullName::
 grep.fallbackToNoIndex::
 	If set to true, fall back to git grep --no-index if git grep
 	is executed outside of a git repository.  Defaults to false.
+
+pcre.ucp::
+	If set to true, will use all Unicode Character Properties when matching
+	`\w`, `\b`, `\d` or the POSIX classes (ex: `[:alnum:]`) and PCRE is used
+	as the underlying engine. If PCRE is not being used it is ignored.
+	Defaults to false
diff --git a/grep.c b/grep.c
index 06eed69493..ceafb8937d 100644
--- a/grep.c
+++ b/grep.c
@@ -102,6 +102,12 @@ int grep_config(const char *var, const char *value, void *cb)
 			return config_error_nonbool(var);
 		return color_parse(value, color);
 	}
+
+	if (!strcmp(var, "pcre.ucp")) {
+		opt->pcre_ucp = git_config_bool(var, value);
+		return 0;
+	}
+
 	return 0;
 }
 
@@ -292,8 +298,11 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		}
 		options |= PCRE2_CASELESS;
 	}
-	if (!opt->ignore_locale && is_utf8_locale() && !literal)
+	if (!opt->ignore_locale && is_utf8_locale() && !literal) {
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+		if (opt->pcre_ucp)
+			options |= PCRE2_UCP;
+	}
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
diff --git a/grep.h b/grep.h
index 6075f997e6..082bd3a0c7 100644
--- a/grep.h
+++ b/grep.h
@@ -171,6 +171,7 @@ struct grep_opt {
 	int file_break;
 	int heading;
 	int max_count;
+	int pcre_ucp;
 	void *priv;
 
 	void (*output)(struct grep_opt *opt, const void *data, size_t size);
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 8eded6ab27..a99a967060 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -95,6 +95,7 @@ test_expect_success setup '
 	then
 		echo "¿" >reverse-question-mark
 	fi &&
+	echo "émotion" >ucp &&
 	git add . &&
 	test_tick &&
 	git commit -m initial
@@ -1474,6 +1475,18 @@ test_expect_success PCRE 'grep -P backreferences work (the PCRE NO_AUTO_CAPTURE
 	test_cmp hello_world actual
 '
 
+test_expect_success PCRE 'grep -c pcre.ucp -P fixes \b' '
+	cat >expected <<-\EOF &&
+	ucp:émotion
+	EOF
+	cat >pattern <<-\EOF &&
+	\bémotion
+	EOF
+	LC_ALL=en_US.UTF-8 git -c pcre.ucp=true grep -P -f pattern >actual &&
+	test_cmp expected actual &&
+	LC_ALL=en_US.UTF-8 test_must_fail git grep -P -f pattern
+'
+
 test_expect_success 'grep -G invalidpattern properly dies ' '
 	test_must_fail git grep -G "a["
 '
-- 
2.37.1 (Apple Git-137.1)


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-17 10:51   ` [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P Carlo Marcelo Arenas Belón
@ 2023-01-17 12:38     ` Ævar Arnfjörð Bjarmason
  2023-01-17 15:19       ` Junio C Hamano
  0 siblings, 1 reply; 36+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-01-17 12:38 UTC (permalink / raw)
  To: Carlo Marcelo Arenas Belón; +Cc: git, gitster, Diomidis Spinellis

On Tue, Jan 17 2023, Carlo Marcelo Arenas Belón wrote:

> When UTF is enabled for a PCRE match, the PCRE2_UTF flag is used
> by the pcre2_compile() call, but that would only allow for the
> use of Unicode character properties when caseless is required
> but not to include the additional UTF characters for all other
> class matches.
>
> This would result in failed matches for expressions that rely
> on those properties, for ex:
>
>   $ git grep -P '\bÆvar'
>
> Add a configuration that could be used to enable the PCRE2_UCP
> flag to correctly match those cases, when required.
>
> The use of this has an impact on performance that has been estimated
> to be significant.
>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
> ---
> Changes since v2:
> * make setting UCP and opt-in as suggested by Ævar

To argue with myself here, I'm not so sure that just making this the
default isn't the right move, especially as the GNU grep maintainer
seems to be convinced that that's the right thing for grep(1).

We've usually just followed GNU grep semantics, so if it's doing X and
we're doing Y after this it's probably better to unify our behavior with
theirs.

I was mainly concerned with the behavior change sneaking in as a mere
bugfix, I think it's OK if we change the behavior, as long as we're
going into it with our eyes open...

> * remove performance test and instead add a test

...I didn't follow the thread(s) where this may have been discussed, but
I for one would like to see a perf test with this, but maybe it was
removed for a good reason that I'm not aware of...

>  Documentation/config/grep.txt |  6 ++++++
>  grep.c                        | 11 ++++++++++-
>  grep.h                        |  1 +
>  t/t7810-grep.sh               | 13 +++++++++++++
>  4 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> index e521f20390..8848db7311 100644
> --- a/Documentation/config/grep.txt
> +++ b/Documentation/config/grep.txt
> @@ -26,3 +26,9 @@ grep.fullName::
>  grep.fallbackToNoIndex::
>  	If set to true, fall back to git grep --no-index if git grep
>  	is executed outside of a git repository.  Defaults to false.
> +
> +pcre.ucp::
> +	If set to true, will use all Unicode Character Properties when matching
> +	`\w`, `\b`, `\d` or the POSIX classes (ex: `[:alnum:]`) and PCRE is used
> +	as the underlying engine. If PCRE is not being used it is ignored.
> +	Defaults to false

There's a couple of exceptions to this, but we tend to stick config docs
in their corresponding Documentation/config/<namespace>.txt, so this
should be in a new Documentation/config/pcre.txt if we're adding this
name.

But I'd rather that we don't expose the implementation detail that we're
using PCRE, which we haven't done so far. We just have a
"grep.patternType=perl", which (and that's another issue, in any case)
we should explicitly mention here, i.e. that this is for use with that
config (and corresponding option(s)).

I think calling this e.g.:

	grep.perl.Unicode=<bool>
	grep.patternTypePerl.Unicode=<bool>

Or even:

	grep.patternTypePerl.Flags=u

Would be better, i.e. PCRE's C API is really just mapping to the flags
you can find in "perldoc perlre" (https://perldoc.perl.org/perlre). In
this case the /u flag maps to the "PCRE2_UCP" API flag.

That we happen to use PCRE to give ourselves "Perl" semantics is an
implementation detail we should avoid exposing, so we could either give
our config generic names, or literally map to the perl /flags/.

For now we could just die on any "Flags" value that isn't "u".

Of course all of this is predicated on us wanting to leave this as an
opt-in, which I'm not so sure about. If it's opt-out we'll avoid this
entire question,

> diff --git a/grep.c b/grep.c
> index 06eed69493..ceafb8937d 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -102,6 +102,12 @@ int grep_config(const char *var, const char *value, void *cb)
>  			return config_error_nonbool(var);
>  		return color_parse(value, color);
>  	}
> +
> +	if (!strcmp(var, "pcre.ucp")) {
> +		opt->pcre_ucp = git_config_bool(var, value);
> +		return 0;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -292,8 +298,11 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  		}
>  		options |= PCRE2_CASELESS;
>  	}
> -	if (!opt->ignore_locale && is_utf8_locale() && !literal)
> +	if (!opt->ignore_locale && is_utf8_locale() && !literal) {
>  		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> +		if (opt->pcre_ucp)
> +			options |= PCRE2_UCP;
> +	}

This interaction with locale settings etc. is probably correct, but if
we're keeping the config etc. we should really document how this
interacts with those.

I.e. you might expect "-c grep.patternType=perl -c
<whatever_the_setting_is>=true" to give you UCP semantics, but we'll
ignore it based on these other criteria.

>  #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
>  	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
> diff --git a/grep.h b/grep.h
> index 6075f997e6..082bd3a0c7 100644
> --- a/grep.h
> +++ b/grep.h
> @@ -171,6 +171,7 @@ struct grep_opt {
>  	int file_break;
>  	int heading;
>  	int max_count;
> +	int pcre_ucp;
>  	void *priv;

The reason for why we have some "bool"-like settings (like "int
ignore_case") as an "int" as opposed to an "unsigned int <name>:1"
bitfield is because we need to take their address via the
parse_options() API.

But in this case it's a purely internal field, so (and again, if we're
keeping the option) let's use "unsigned int ...:1" here instead?

Unless that is, we're expecting a corresponding command-line option.

>  
>  	void (*output)(struct grep_opt *opt, const void *data, size_t size);
> diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
> index 8eded6ab27..a99a967060 100755
> --- a/t/t7810-grep.sh
> +++ b/t/t7810-grep.sh
> @@ -95,6 +95,7 @@ test_expect_success setup '
>  	then
>  		echo "¿" >reverse-question-mark
>  	fi &&
> +	echo "émotion" >ucp &&

Here we carry this file through the entirety ouf our tests...

>  	git add . &&
>  	test_tick &&
>  	git commit -m initial
> @@ -1474,6 +1475,18 @@ test_expect_success PCRE 'grep -P backreferences work (the PCRE NO_AUTO_CAPTURE
>  	test_cmp hello_world actual
>  '
>  
> +test_expect_success PCRE 'grep -c pcre.ucp -P fixes \b' '
> +	cat >expected <<-\EOF &&
> +	ucp:émotion

...only to use it in this one test, it clearly didn't harm anything (or
rather, I didn't run this, but I expect you did and it passed), but how
about avoiding more global state here doing:

	test_when_finished "rm -rf repo" &&
	git init repo &&
	test_commit -C repo msg ucp émotion &&
	[...]

> +	EOF
> +	cat >pattern <<-\EOF &&
> +	\bémotion
> +	EOF
> +	LC_ALL=en_US.UTF-8 git -c pcre.ucp=true grep -P -f pattern >actual &&
> +	test_cmp expected actual &&
> +	LC_ALL=en_US.UTF-8 test_must_fail git grep -P -f pattern

This will break on platforms that don't have en_US.UTF-8 (and that's not
hypothetical, some systems will skip installing locales for various
reasons).

I see we have some almost recent breakage 1819ad327b7 (grep: fix
multibyte regex handling under macOS, 2022-08-26), but that one adds a
new MB_REGEX prereq, which presumably fails if we don't have this
locale.

You can just use the existing "GETTEXT_LOCALE", which will piggy-back on
our existing locale tests, or we could add a corresponding one for
en_US.UTF-8 and refactor the existing MB_REGEX to be in lib-gettext.sh
where it arguably belongs...

> +'
> +
>  test_expect_success 'grep -G invalidpattern properly dies ' '
>  	test_must_fail git grep -G "a["
>  '

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-17 12:38     ` Ævar Arnfjörð Bjarmason
@ 2023-01-17 15:19       ` Junio C Hamano
  2023-01-18  7:35         ` Carlo Arenas
  0 siblings, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2023-01-17 15:19 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Carlo Marcelo Arenas Belón, git, Diomidis Spinellis

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> To argue with myself here, I'm not so sure that just making this the
> default isn't the right move, especially as the GNU grep maintainer
> seems to be convinced that that's the right thing for grep(1).

OK.

> I think calling this e.g.:
>
> 	grep.perl.Unicode=<bool>
> 	grep.patternTypePerl.Unicode=<bool>
>
> Or even:
>
> 	grep.patternTypePerl.Flags=u
>
> Would be better, i.e. PCRE's C API is really just mapping to the flags
> you can find in "perldoc perlre" (https://perldoc.perl.org/perlre). In
> this case the /u flag maps to the "PCRE2_UCP" API flag.
>
> That we happen to use PCRE to give ourselves "Perl" semantics is an
> implementation detail we should avoid exposing, so we could either give
> our config generic names, or literally map to the perl /flags/.
>
> For now we could just die on any "Flags" value that isn't "u".
>
> Of course all of this is predicated on us wanting to leave this as an
> opt-in, which I'm not so sure about. If it's opt-out we'll avoid this
> entire question,

Making it opt-out would also require a similar knob to turn the
"flag" off, be it a configuration variable or a command line option,
wouldn't it?  I tend to agree with you that it makes sense to make
it a goal to take us closer to "grep -P" from GNU---do they have
such an opt-out knob?  If not, let's make it simple by turning it
always on, which would be the simplest ;-)

Again, thanks for a careful review with concrete points.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-17 15:19       ` Junio C Hamano
@ 2023-01-18  7:35         ` Carlo Arenas
  2023-01-18 11:49           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 36+ messages in thread
From: Carlo Arenas @ 2023-01-18  7:35 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, git, Diomidis Spinellis

On Tue, Jan 17, 2023 at 7:19 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
> > To argue with myself here, I'm not so sure that just making this the
> > default isn't the right move, especially as the GNU grep maintainer
> > seems to be convinced that that's the right thing for grep(1).
>
> OK.

I think that is definitely the right thing to do for grep, because the
current behaviour can only be described as a bug (and a bad one at
it), but after all the push back and performance testing, I am also
not convinced anymore it needs to be the default for git, because the
negatives outweigh the positives.

First there is the performance hit, which is inevitable because there
are just a lot more characters to match when UCP tables are being
used, and second there is the fact that PCRE2_UCP itself might not be
what you want when matching code, because for example numbers are
never going to be using digits outside what ASCII provides, and
identifiers have a narrow set of characters as valid than what you
would expect from all written human languages in history.

Lastly, even with PCRE2_UCP enabled, our current logic for word
matches is still broken, because the current code still uses a
definition of word that was done outside what the regex engines
provide and that roughly matches what you would expect of identifiers
from C in the ASCII times.

> > Of course all of this is predicated on us wanting to leave this as an
> > opt-in, which I'm not so sure about. If it's opt-out we'll avoid this
> > entire question,
>
> Making it opt-out would also require a similar knob to turn the
> "flag" off, be it a configuration variable or a command line option,
> wouldn't it?  I tend to agree with you that it makes sense to make
> it a goal to take us closer to "grep -P" from GNU---do they have
> such an opt-out knob?  If not, let's make it simple by turning it
> always on, which would be the simplest ;-)

GNU grep -P has no knob and would likely never have one.

So for now, I think we should acknowledge the bug, provide an option
for people that might need the fix, and fix all other problems we
have, which will include changes in PCRE2 as well to better fit our
use case.

Carlo

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-18  7:35         ` Carlo Arenas
@ 2023-01-18 11:49           ` Ævar Arnfjörð Bjarmason
  2023-01-18 16:20             ` Junio C Hamano
  0 siblings, 1 reply; 36+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-01-18 11:49 UTC (permalink / raw)
  To: Carlo Arenas; +Cc: Junio C Hamano, git, Diomidis Spinellis

On Tue, Jan 17 2023, Carlo Arenas wrote:

> On Tue, Jan 17, 2023 at 7:19 AM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>>
>> > To argue with myself here, I'm not so sure that just making this the
>> > default isn't the right move, especially as the GNU grep maintainer
>> > seems to be convinced that that's the right thing for grep(1).
>>
>> OK.
>
> I think that is definitely the right thing to do for grep, because the
> current behaviour can only be described as a bug (and a bad one at
> it), but after all the push back and performance testing, I am also
> not convinced anymore it needs to be the default for git, because the
> negatives outweigh the positives.
>
> First there is the performance hit, which is inevitable because there
> are just a lot more characters to match when UCP tables are being
> used,[...]

I'm less concerned about the performance, we should aim for correctness
first. We can always provide an opt-out (and the locale setting is
already that opt-out).

> and second there is the fact that PCRE2_UCP itself might not be
> what you want when matching code, because for example numbers are
> never going to be using digits outside what ASCII provides, and
> identifiers have a narrow set of characters as valid than what you
> would expect from all written human languages in history.

[0-9] will be ASCII, but \d will use [^0-9] Unicode numbers.

I agree it might not be expected by some, but I can't really square that
view in my mind with the desire to match "\bÆvar" :). After all that "Æ"
is also arbitrary byte garbage in the ASCII-view of the world.

I can see how it might be more practical in some cases to have "\b" have
Unicode semantics, but to specifically make "\d" an exception. But the
ship has sailed on that in Perl & PCRE land years (or more than a decade
ago). I think us coming up with some exception to that would probably
suck more than going with their behavior.

> Lastly, even with PCRE2_UCP enabled, our current logic for word
> matches is still broken, because the current code still uses a
> definition of word that was done outside what the regex engines
> provide and that roughly matches what you would expect of identifiers
> from C in the ASCII times.

Yes, FWIW I have some WIP patches somewhere to get rid of that bit of
grep.c if we're using PCRE. I.e. the "-w" should be powered by just
adding "\b" to the start/end of the provided string.

That'll then be correct, and faster.

I can't remember if there were some subtle bugs in that, or why I didn't
finish that...

>> > Of course all of this is predicated on us wanting to leave this as an
>> > opt-in, which I'm not so sure about. If it's opt-out we'll avoid this
>> > entire question,
>>
>> Making it opt-out would also require a similar knob to turn the
>> "flag" off, be it a configuration variable or a command line option,
>> wouldn't it?  I tend to agree with you that it makes sense to make
>> it a goal to take us closer to "grep -P" from GNU---do they have
>> such an opt-out knob?  If not, let's make it simple by turning it
>> always on, which would be the simplest ;-)
>
> GNU grep -P has no knob and would likely never have one.

I think the general knob in not just GNU grep but GNU utils and the
wider *nix landscape is "tweak your LC_ALL and/or other locale
varibales".

Which works for it, and will work for us once we're using PCRE2_UCP too.

> So for now, I think we should acknowledge the bug, provide an option
> for people that might need the fix, and fix all other problems we
> have, which will include changes in PCRE2 as well to better fit our
> use case.

Hrm, what are those PCRE2 changes? The one I saw so far (or was it a
proposal) was to just make its "grep" utility use the PCRE2_UCP like GNU
grep is now doing in its unreleased version in its git repo...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-18 11:49           ` Ævar Arnfjörð Bjarmason
@ 2023-01-18 16:20             ` Junio C Hamano
  2023-01-18 23:06               ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2023-01-18 16:20 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Carlo Arenas, git, Diomidis Spinellis

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> GNU grep -P has no knob and would likely never have one.
>
> I think the general knob in not just GNU grep but GNU utils and the
> wider *nix landscape is "tweak your LC_ALL and/or other locale
> varibales".
>
> Which works for it, and will work for us once we're using PCRE2_UCP too.
>
>> So for now, I think we should acknowledge the bug, provide an option
>> for people that might need the fix, and fix all other problems we
>> have, which will include changes in PCRE2 as well to better fit our
>> use case.
>
> Hrm, what are those PCRE2 changes? The one I saw so far (or was it a
> proposal) was to just make its "grep" utility use the PCRE2_UCP like GNU
> grep is now doing in its unreleased version in its git repo...

Yeah, I didn't understand Carlo's comment in that paragraph at all.

In short, it sounds to me that the earlier one that added PCRE2_UCP
unconditionally would be the best alternative among those that have
been discussed.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-18 16:20             ` Junio C Hamano
@ 2023-01-18 23:06               ` Ævar Arnfjörð Bjarmason
  2023-01-18 23:24                 ` Junio C Hamano
  0 siblings, 1 reply; 36+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-01-18 23:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Carlo Arenas, git, Diomidis Spinellis


On Wed, Jan 18 2023, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>
>>> GNU grep -P has no knob and would likely never have one.
>>
>> I think the general knob in not just GNU grep but GNU utils and the
>> wider *nix landscape is "tweak your LC_ALL and/or other locale
>> varibales".
>>
>> Which works for it, and will work for us once we're using PCRE2_UCP too.
>>
>>> So for now, I think we should acknowledge the bug, provide an option
>>> for people that might need the fix, and fix all other problems we
>>> have, which will include changes in PCRE2 as well to better fit our
>>> use case.
>>
>> Hrm, what are those PCRE2 changes? The one I saw so far (or was it a
>> proposal) was to just make its "grep" utility use the PCRE2_UCP like GNU
>> grep is now doing in its unreleased version in its git repo...
>
> Yeah, I didn't understand Carlo's comment in that paragraph at all.
>
> In short, it sounds to me that the earlier one that added PCRE2_UCP
> unconditionally would be the best alternative among those that have
> been discussed.

I agree.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P
  2023-01-18 23:06               ` Ævar Arnfjörð Bjarmason
@ 2023-01-18 23:24                 ` Junio C Hamano
  0 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2023-01-18 23:24 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Carlo Arenas, git, Diomidis Spinellis

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

>> In short, it sounds to me that the earlier one that added PCRE2_UCP
>> unconditionally would be the best alternative among those that have
>> been discussed.
>
> I agree.

So, ... we'll mark cb/grep-pcre-ucp, ea8bc435 (grep: correctly
identify utf-8 characters with \{b,w} in -P, 2023-01-08), to be
merged to 'next' and see what happens.  I'll add your Acked-by
while at it, if you do not mind.

---- >8 --------- >8 --------- >8 --------- >8 ------
From: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Date: Sun, 8 Jan 2023 07:52:17 -0800
Subject: [PATCH] grep: correctly identify utf-8 characters with \{b,w} in -P

When UTF is enabled for a PCRE match, the corresponding flags are
added to the pcre2_compile() call, but PCRE2_UCP wasn't included.

This prevents extending the meaning of the character classes to
include those new valid characters and therefore result in failed
matches for expressions that rely on that extention, for ex:

  $ git grep -P '\bÆvar'

Add PCRE2_UCP so that \w will include Æ and therefore \b could
correctly match the beginning of that word.

This has an impact on performance that has been estimated to be
between 20% to 40% and that is shown through the added performance
test.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Acked-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 grep.c                              |  2 +-
 t/perf/p7822-grep-perl-character.sh | 42 +++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 1 deletion(-)
 create mode 100755 t/perf/p7822-grep-perl-character.sh

diff --git a/grep.c b/grep.c
index 06eed69493..1687f65b64 100644
--- a/grep.c
+++ b/grep.c
@@ -293,7 +293,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
 		options |= PCRE2_CASELESS;
 	}
 	if (!opt->ignore_locale && is_utf8_locale() && !literal)
-		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+		options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
 
 #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
 	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
diff --git a/t/perf/p7822-grep-perl-character.sh b/t/perf/p7822-grep-perl-character.sh
new file mode 100755
index 0000000000..87009c60df
--- /dev/null
+++ b/t/perf/p7822-grep-perl-character.sh
@@ -0,0 +1,42 @@
+#!/bin/sh
+
+test_description="git-grep's perl regex
+
+If GIT_PERF_GREP_THREADS is set to a list of threads (e.g. '1 4 8'
+etc.) we will test the patterns under those numbers of threads.
+"
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+if test -n "$GIT_PERF_GREP_THREADS"
+then
+	test_set_prereq PERF_GREP_ENGINES_THREADS
+fi
+
+for pattern in \
+	'\\bhow' \
+	'\\bÆvar' \
+	'\\d+ \\bÆvar' \
+	'\\bBelón\\b' \
+	'\\w{12}\\b'
+do
+	echo '$pattern' >pat
+	if ! test_have_prereq PERF_GREP_ENGINES_THREADS
+	then
+		test_perf "grep -P '$pattern'" --prereq PCRE "
+			git -P grep -f pat || :
+		"
+	else
+		for threads in $GIT_PERF_GREP_THREADS
+		do
+			test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE "
+				git -c grep.threads=$threads -P grep -f pat || :
+			"
+		done
+	fi
+done
+
+test_done
-- 
2.39.1-231-ga7caae2729


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* -P '\d' in GNU and git grep
@ 2023-04-03 21:38     ` Paul Eggert
  2023-04-04  3:30       ` bug#60690: " Jim Meyering
  2023-04-04  6:56       ` Carlo Arenas
  0 siblings, 2 replies; 36+ messages in thread
From: Paul Eggert @ 2023-04-03 21:38 UTC (permalink / raw)
  To: 60690
  Cc: Tukusej’s Sirs, Ævar Arnfjörð Bjarmason,
	mega lith01, demerphq, Carlo Marcelo Arenas Belón, pcre-dev,
	gitster, git

I've recently done some bug-report maintenance about a set of GNU grep 
bug reports related to whether whether "grep -P '\d'" should match 
non-ASCII digits, and have some thoughts about coordinating GNU grep 
with git grep in this department.

GNU Bug#62605[1] "`[\d]` does not work with PCRE" has been fixed on 
Savannah's copy of GNU grep, and some sort of fix should appear in the 
next grep release. However, I'm leaving the GNU grep bug report open for 
now because it's related to Bug#60690[2] "[PATCH v2] grep: correctly 
identify utf-8 characters with \{b,w} in -P" and to Bug#62552[3] "Bug 
found in latest stable release v3.10 of grep". I merged these related 
bug reports, and the oldest one, Bug#60690, is now the representative 
displayed in the GNU grep bug list[4].

For this set of grep bug reports there's still a pending issue discussed 
in my recent email[5], which proposes a patch so I've tagged Bug#60690 
with "patch". The proposal is that GNU grep -P '\d' should revert to the 
grep 3.9 behavior, i.e., that in a UTF-8 locale, \d should also match 
non-ASCII decimal digits.

In researching this a bit further, I found that on March 23 Git disabled 
the use of PCRE2_UCP in PCRE2 10.34 or earlier[6], due to a PCRE2 bug 
that can cause a crash when PCRE2_UCP is used[7]. A bug fix[8] should 
appear in the next PCRE2 release.

When PCRE2 10.35 comes out, it appears that 'git grep -P' will behave 
like 'grep -P' only if GNU grep adopts something like the solution 
proposed in [5].

[1]: https://bugs.gnu.org/62605
[2]: https://bugs.gnu.org/60690
[3]: https://bugs.gnu.org/62552
[4]: https://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep
[5]: https://lists.gnu.org/archive/html/grep-devel/2023-04/msg00004.html
[6]: 
https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8
[7]: 
https://lore.kernel.org/git/7E83DAA1-F9A9-4151-8D07-D80EA6D59EEA@clumio.com/
[8]: 
https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-03 21:38     ` -P '\d' in GNU and git grep Paul Eggert
@ 2023-04-04  3:30       ` Jim Meyering
  2023-04-04  6:46         ` Paul Eggert
  2023-04-04  6:56       ` Carlo Arenas
  1 sibling, 1 reply; 36+ messages in thread
From: Jim Meyering @ 2023-04-04  3:30 UTC (permalink / raw)
  To: Paul Eggert
  Cc: 60690, demerphq, mega lith01, Carlo Marcelo Arenas Belón,
	Ævar Arnfjörð Bjarmason, git, gitster,
	Tukusej’s Sirs, pcre-dev

On Mon, Apr 3, 2023 at 2:39 PM Paul Eggert <eggert@cs.ucla.edu> wrote:
> I've recently done some bug-report maintenance about a set of GNU grep
> bug reports related to whether whether "grep -P '\d'" should match
> non-ASCII digits, and have some thoughts about coordinating GNU grep
> with git grep in this department.
>
> GNU Bug#62605[1] "`[\d]` does not work with PCRE" has been fixed on
> Savannah's copy of GNU grep, and some sort of fix should appear in the
> next grep release. However, I'm leaving the GNU grep bug report open for
> now because it's related to Bug#60690[2] "[PATCH v2] grep: correctly
> identify utf-8 characters with \{b,w} in -P" and to Bug#62552[3] "Bug
> found in latest stable release v3.10 of grep". I merged these related
> bug reports, and the oldest one, Bug#60690, is now the representative
> displayed in the GNU grep bug list[4].
>
> For this set of grep bug reports there's still a pending issue discussed
> in my recent email[5], which proposes a patch so I've tagged Bug#60690
> with "patch". The proposal is that GNU grep -P '\d' should revert to the
> grep 3.9 behavior, i.e., that in a UTF-8 locale, \d should also match
> non-ASCII decimal digits.
>
> In researching this a bit further, I found that on March 23 Git disabled
> the use of PCRE2_UCP in PCRE2 10.34 or earlier[6], due to a PCRE2 bug
> that can cause a crash when PCRE2_UCP is used[7]. A bug fix[8] should
> appear in the next PCRE2 release.
>
> When PCRE2 10.35 comes out,

Thanks for finding that.
It's clearly a good idea to disable PCRE2_UCP for those using those
older, known-buggy versions of pcre2.

The latest is 10.42, per https://github.com/PCRE2Project/pcre2/releases

> it appears that 'git grep -P' will behave
> like 'grep -P' only if GNU grep adopts something like the solution
> proposed in [5].
>
> [1]: https://bugs.gnu.org/62605
> [2]: https://bugs.gnu.org/60690
> [3]: https://bugs.gnu.org/62552
> [4]: https://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep
> [5]: https://lists.gnu.org/archive/html/grep-devel/2023-04/msg00004.html
> [6]:
> https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8
> [7]:
> https://lore.kernel.org/git/7E83DAA1-F9A9-4151-8D07-D80EA6D59EEA@clumio.com/
> [8]:
> https://github.com/git/git/commit/14b9a044798ebb3858a1f1a1377309a3d6054ac8

Thanks for all of the links. However, have you seen justification
(other than for compatibility with some other tool or language) for
allowing \d to match non-ASCII by default, in spite of the risks?
IMHO, we have an obligation to retain compatibility with how grep -P
'\d' has worked since -P was added. I'd be happy to see an option to
enable the match-multibyte-digits behavior, but making it the default
seems too likely to introduce unwarranted risk.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-04  3:30       ` bug#60690: " Jim Meyering
@ 2023-04-04  6:46         ` Paul Eggert
  2023-04-04 15:31           ` Jim Meyering
  0 siblings, 1 reply; 36+ messages in thread
From: Paul Eggert @ 2023-04-04  6:46 UTC (permalink / raw)
  To: Jim Meyering
  Cc: 60690, demerphq, mega lith01, Carlo Marcelo Arenas Belón,
	Ævar Arnfjörð Bjarmason, git, gitster,
	Tukusej’s Sirs, pcre-dev

On 2023-04-03 20:30, Jim Meyering wrote:
> have you seen justification
> (other than for compatibility with some other tool or language) for
> allowing \d to match non-ASCII by default, in spite of the risks?

In the example Ævar supplied in <https://bugs.gnu.org/60690>, my 
impression was that it was better when \d matched non-ASCII digits. That 
is, in a UTF-8 locale it's better when \d finds matches in these lines:

>> 	> git-gui/po/ja.po:"- 第１行: 何をしたか、を１行で要約。\n"
>> 	> git-gui/po/ja.po:"- 第２行: 空白\n"

because they contain the Japanese digits "１" and "２". This was the only 
example I recall being given.

Also, I find it odd that grep -P '^[\w\d]*$' matches lines containing 
any sort of Arabic word characters, but it rejects lines containing 
Arabic digits like "٣" that are perfectly reasonable in Arabic-language 
text. I also find it odd that [\d] and [[:digit:]] mean different things.

There are arguments on the other side, otherwise we wouldn't be having 
this discussion. And it's true that grep -P '\d' formerly rejected 
Arabic digits (though it's also true that grep -P '\w' formerly rejected 
Arabic letters...). Still, the cure's oddness and incompatibility with 
Git, Perl, etc. appears to me to be worse than the disease of dealing 
with grep -P invocations that need to use [0-9] or LC_ALL="C" anyway if 
they want to be portable to any program other than GNU grep.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: -P '\d' in GNU and git grep
  2023-04-03 21:38     ` -P '\d' in GNU and git grep Paul Eggert
  2023-04-04  3:30       ` bug#60690: " Jim Meyering
@ 2023-04-04  6:56       ` Carlo Arenas
  2023-04-04 18:25         ` bug#60690: " Paul Eggert
  1 sibling, 1 reply; 36+ messages in thread
From: Carlo Arenas @ 2023-04-04  6:56 UTC (permalink / raw)
  To: Paul Eggert
  Cc: 60690, Tukusej’s Sirs,
	Ævar Arnfjörð Bjarmason, mega lith01, demerphq,
	pcre-dev, gitster, git

On Mon, Apr 3, 2023 at 2:38 PM Paul Eggert <eggert@cs.ucla.edu> wrote:
>
> In researching this a bit further, I found that on March 23 Git disabled
> the use of PCRE2_UCP in PCRE2 10.34 or earlier[6], due to a PCRE2 bug
> that can cause a crash when PCRE2_UCP is used[7]. A bug fix[8] should
> appear in the next PCRE2 release.

Presume PCRE2 is a typo and should have been "git" here?

FWIW the PCRE2 fix[1] has been released already with 10.35 and
backporting to the Ubuntu 20.04 package that crashed in the original
report would also solve the crash with 10.34.

Carlo

[1] https://github.com/PCRE2Project/pcre2/commit/c21bd977547d

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-04  6:46         ` Paul Eggert
@ 2023-04-04 15:31           ` Jim Meyering
  0 siblings, 0 replies; 36+ messages in thread
From: Jim Meyering @ 2023-04-04 15:31 UTC (permalink / raw)
  To: Paul Eggert
  Cc: 60690, demerphq, mega lith01, Carlo Marcelo Arenas Belón,
	Ævar Arnfjörð Bjarmason, git, gitster,
	Tukusej’s Sirs, pcre-dev

On Mon, Apr 3, 2023 at 11:47 PM Paul Eggert <eggert@cs.ucla.edu> wrote:
> On 2023-04-03 20:30, Jim Meyering wrote:
> > have you seen justification
> > (other than for compatibility with some other tool or language) for
> > allowing \d to match non-ASCII by default, in spite of the risks?
>
> In the example Ævar supplied in <https://bugs.gnu.org/60690>, my
> impression was that it was better when \d matched non-ASCII digits. That
> is, in a UTF-8 locale it's better when \d finds matches in these lines:
>
> >>      > git-gui/po/ja.po:"- 第１行: 何をしたか、を１行で要約。\n"
> >>      > git-gui/po/ja.po:"- 第２行: 空白\n"
>
> because they contain the Japanese digits "１" and "２". This was the only
> example I recall being given.

Before it was unintentionally enabled in grep-3.9, lines like that have
never been matched by grep -P's '\d'. By relaxing \d, we'd weaken
any application that uses say grep -P '^\d+$' to perform input
validation intending to ensure that some input is all ASCII digits.
It's not a big stretch to imagine that some downstream processor
of that "verified" data is not prepared to deal with multi-byte digits.

> Also, I find it odd that grep -P '^[\w\d]*$' matches lines containing
> any sort of Arabic word characters, but it rejects lines containing
> Arabic digits like "٣" that are perfectly reasonable in Arabic-language
> text. I also find it odd that [\d] and [[:digit:]] mean different things.
>
> There are arguments on the other side, otherwise we wouldn't be having
> this discussion. And it's true that grep -P '\d' formerly rejected
> Arabic digits (though it's also true that grep -P '\w' formerly rejected
> Arabic letters...). Still, the cure's oddness and incompatibility with
> Git, Perl, etc. appears to me to be worse than the disease of dealing
> with grep -P invocations that need to use [0-9] or LC_ALL="C" anyway if
> they want to be portable to any program other than GNU grep.

I'm primarily concerned about not introducing a persistent regression in
how GNU grep's -P '\d' works in multibyte locales. The corner cases you
mention do matter, of course, but are far less likely to matter in practice.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-04  6:56       ` Carlo Arenas
@ 2023-04-04 18:25         ` Paul Eggert
  2023-04-04 19:31           ` Junio C Hamano
  0 siblings, 1 reply; 36+ messages in thread
From: Paul Eggert @ 2023-04-04 18:25 UTC (permalink / raw)
  To: Carlo Arenas
  Cc: demerphq, 60690, mega lith01,
	Ævar Arnfjörð Bjarmason, git, gitster,
	Tukusej’s Sirs, pcre-dev

On 4/3/23 23:56, Carlo Arenas wrote:
> On Mon, Apr 3, 2023 at 2:38 PM Paul Eggert <eggert@cs.ucla.edu> wrote:
>>
>> on March 23 Git disabled
>> the use of PCRE2_UCP in PCRE2 10.34 or earlier[6], due to a PCRE2 bug
>> that can cause a crash when PCRE2_UCP is used[7]. A bug fix[8] should
>> appear in the next PCRE2 release.
> 
> Presume PCRE2 is a typo and should have been "git" here?

No, I was talking about what options Git uses when it calls PCRE2 
functions. In other words, this is about whether GNU 'grep -P' should be 
compatible with 'git grep -P' (as well as with Perl and with pcregrep), 
when interpreting \d and similar constructs.

This is an evolving area. Git master is fiddling with flags and options, 
and so is GNU grep master, and so is PCRE2, and there are bugs. If 
you're running bleeding-edge versions of this code you'll get different 
behavior than if you're running grep 3.8, pcregrep 8.45, Perl 5.36, and 
git 2.39.2 (which is what Fedora 37 has).

What I'm fearing is that we may evolve into mutually incompatible 
interpretations of how Perl regular expressions deal with UTF-8 text. 
That'd be a recipe for confusion down the road.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-04 18:25         ` bug#60690: " Paul Eggert
@ 2023-04-04 19:31           ` Junio C Hamano
  2023-04-05 18:32             ` Paul Eggert
  2023-04-06 13:39             ` demerphq
  0 siblings, 2 replies; 36+ messages in thread
From: Junio C Hamano @ 2023-04-04 19:31 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Carlo Arenas, demerphq, 60690, mega lith01,
	Ævar Arnfjörð Bjarmason, git, Tukusej’s Sirs,
	pcre-dev

Paul Eggert <eggert@cs.ucla.edu> writes:

> This is an evolving area. Git master is fiddling with flags and
> options, and so is GNU grep master, and so is PCRE2, and there are
> bugs. If you're running bleeding-edge versions of this code you'll get
> different behavior than if you're running grep 3.8, pcregrep 8.45,
> Perl 5.36, and git 2.39.2 (which is what Fedora 37 has).
>
> What I'm fearing is that we may evolve into mutually incompatible
> interpretations of how Perl regular expressions deal with UTF-8
> text. That'd be a recipe for confusion down the road.

Nicely said.  My personal inclination is to let Perl folks decide
and follow them (even though I am skeptical about the wisdom of
letting '\d' match anything other than [0-9]), but even in Git
circle there would be different opinions, so I am glad that the
discussion is visible on the list to those who are intrested.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-04 19:31           ` Junio C Hamano
@ 2023-04-05 18:32             ` Paul Eggert
  2023-04-05 19:04               ` Paul Eggert
                                 ` (3 more replies)
  2023-04-06 13:39             ` demerphq
  1 sibling, 4 replies; 36+ messages in thread
From: Paul Eggert @ 2023-04-05 18:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Carlo Arenas, demerphq, 60690, mega lith01,
	Ævar Arnfjörð Bjarmason, git, Tukusej’s Sirs,
	pcre-dev, Philip.Hazel

On 2023-04-04 12:31, Junio C Hamano wrote:

> My personal inclination is to let Perl folks decide
> and follow them (even though I am skeptical about the wisdom of
> letting '\d' match anything other than [0-9])

I looked into what pcre2grep does. It has always done only 8-bit 
processing unless you use the -u or --utf option, so plain "pcre2grep 
'\d'" matches only ASCII digits.

Although this causes pcre2grep to mishandle Unicode characters:

   $ echo 'Ævar' | pcre2grep '[Ssß]'
   Ævar

it mimics Perl 5.36:

   $ echo 'Ævar' | perl -ne 'print $_ if /[Ssß]/'
   Ævar

so this seems to be what Perl users expect, despite its infelicities.

For better Unicode handling one can use pcre2grep's -u or --utf option, 
which causes pcre2grep to behave more like GNU grep -P and git grep -P: 
"echo 'Ævar' | pcre2grep -u '[Ssß]'" outputs nothing, which I think is 
what most people would expect (unless they're Perl users :-).

Neither git grep -P nor the current release of pcre2grep -u have \d 
matching non-ASCII digits, because they do not use PCRE2_UCP. However, 
in a February 8 commit[1], Philip Hazel changed pcre2grep to use 
PCRE2_UCP, so this will mean 10.43 pcre2grep -u will behave like 3.9 GNU 
grep -P did (though 3.10 has changed this).

That February commit also added a --no-ucp option, to disable PCRE2_UCP. 
So as I understand it, if you're in a UTF-8 locale:

* 10.43 pcre2grep -u will behave like 3.9 GNU grep -P.

* 10.43 pcre2grep -u --no-ucp will behave like git grep -P.

* Current GNU grep -P is different from everybody else.

This incompatibility is not good.

Here are two ways forward to fix this incompatibility (there are other 
possibilities of course):

(A) GNU grep adds a --no-ucp option that acts like 10.43 pcre2grep 
--no-ucp, and git grep -P follows suit. That is, both GNU and git grep 
act like 10.43 pcre2grep -u, in that they enable PCRE2_UTF, and also 
enable PCRE2_UCP unless --no-ucp is given. This would cause \d to match 
non-ASCII digits unless --no-ucp is given.

(B) GNU grep -P and git grep -P mimic pcre2grep in both -u and --no-ucp. 
That is, they would both do 8-bit-only by default, and use PCRE2_UTF 
only when -u or --utf is given, and use PCRE2_UCP only when --no-ucp is 
absent. This would cause \d to match non-ASCII digits only when -u is 
given but --no-ucp is not.

Under either (A) or (B), future pcre2grep -u, GNU grep -P, and git grep 
-P would be consistent.

I mildly prefer (B) but (A) would also work. (One advantage of (B) is 
that it should be faster....)

[1]: 
https://github.com/PCRE2Project/pcre2/commit/8385df8c97b6f8069a48e600c7e4e94cc3e3ebd9ht

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-05 18:32             ` Paul Eggert
@ 2023-04-05 19:04               ` Paul Eggert
  2023-04-05 19:37               ` Junio C Hamano
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: Paul Eggert @ 2023-04-05 19:04 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: demerphq, Philip.Hazel, 60690, mega lith01, Carlo Arenas,
	Ævar Arnfjörð Bjarmason, pcre-dev,
	Tukusej’s Sirs, git

On 2023-04-05 11:32, Paul Eggert wrote:

> in a February 8 commit[1], Philip Hazel changed pcre2grep to use 
> PCRE2_UCP, so this will mean 10.43 pcre2grep -u will behave like 3.9 GNU 
> grep -P did (though 3.10 has changed this).

Sorry, due to fumblefingers I gave the wrong URL for [1]. Here's a 
corrected URL:

https://github.com/PCRE2Project/pcre2/commit/8385df8c97b6f8069a48e600c7e4e94cc3e3ebd9

It also mentions a new --case-restrict option, intended for 10.43 
pcre2grep. Given Perl's and PCRE2's plethora of options I suppose one 
could imagine several other options of that ilk.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-05 18:32             ` Paul Eggert
  2023-04-05 19:04               ` Paul Eggert
@ 2023-04-05 19:37               ` Junio C Hamano
  2023-04-05 19:40               ` Jim Meyering
  2023-04-06 15:45               ` demerphq
  3 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2023-04-05 19:37 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Carlo Arenas, demerphq, 60690, mega lith01,
	Ævar Arnfjörð Bjarmason, git, Tukusej’s Sirs,
	pcre-dev, Philip.Hazel

Paul Eggert <eggert@cs.ucla.edu> writes:

> Here are two ways forward to fix this incompatibility (there are other
> possibilities of course):
>
> (A) GNU grep adds a --no-ucp option that acts like 10.43 pcre2grep
> --no-ucp, and git grep -P follows suit. That is, both GNU and git grep
> act like 10.43 pcre2grep -u, in that they enable PCRE2_UTF, and also
> enable PCRE2_UCP unless --no-ucp is given. This would cause \d to
> match non-ASCII digits unless --no-ucp is given.
>
> (B) GNU grep -P and git grep -P mimic pcre2grep in both -u and
> --no-ucp. That is, they would both do 8-bit-only by default, and use
> PCRE2_UTF only when -u or --utf is given, and use PCRE2_UCP only when
> --no-ucp is absent. This would cause \d to match non-ASCII digits only
> when -u is given but --no-ucp is not.
>
> Under either (A) or (B), future pcre2grep -u, GNU grep -P, and git
> grep -P would be consistent.
>
> I mildly prefer (B) but (A) would also work. (One advantage of (B) is
> that it should be faster....)

For "git grep -P", I would like to hear from Carlo and Ævar; I agree
both (A) and (B) would be workable solutions, and have a slight
preference on a solution that does not add more options that take
only in effect when -P is given, simply because these options are
cumbersome to document and explain, but that is a very minor point.

Thanks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-05 18:32             ` Paul Eggert
  2023-04-05 19:04               ` Paul Eggert
  2023-04-05 19:37               ` Junio C Hamano
@ 2023-04-05 19:40               ` Jim Meyering
  2023-04-05 20:03                 ` Paul Eggert
  2023-04-05 21:20                 ` Carlo Arenas
  2023-04-06 15:45               ` demerphq
  3 siblings, 2 replies; 36+ messages in thread
From: Jim Meyering @ 2023-04-05 19:40 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Junio C Hamano, demerphq, Philip.Hazel, 60690, mega lith01,
	Carlo Arenas, Ævar Arnfjörð Bjarmason, pcre-dev,
	Tukusej’s Sirs, git

On Wed, Apr 5, 2023 at 11:33 AM Paul Eggert <eggert@cs.ucla.edu> wrote:
> On 2023-04-04 12:31, Junio C Hamano wrote:
> > My personal inclination is to let Perl folks decide
> > and follow them (even though I am skeptical about the wisdom of
> > letting '\d' match anything other than [0-9])
>
> I looked into what pcre2grep does. It has always done only 8-bit
> processing unless you use the -u or --utf option, so plain "pcre2grep
> '\d'" matches only ASCII digits.
>
> Although this causes pcre2grep to mishandle Unicode characters:
>
>    $ echo 'Ævar' | pcre2grep '[Ssß]'
>    Ævar
>
> it mimics Perl 5.36:
>
>    $ echo 'Ævar' | perl -ne 'print $_ if /[Ssß]/'
>    Ævar
>
> so this seems to be what Perl users expect, despite its infelicities.
>
> For better Unicode handling one can use pcre2grep's -u or --utf option,
> which causes pcre2grep to behave more like GNU grep -P and git grep -P:
> "echo 'Ævar' | pcre2grep -u '[Ssß]'" outputs nothing, which I think is
> what most people would expect (unless they're Perl users :-).

Good argument for making PCRE2_UCP the default.

> Neither git grep -P nor the current release of pcre2grep -u have \d
> matching non-ASCII digits, because they do not use PCRE2_UCP. However,
> in a February 8 commit[1], Philip Hazel changed pcre2grep to use
> PCRE2_UCP, so this will mean 10.43 pcre2grep -u will behave like 3.9 GNU
> grep -P did (though 3.10 has changed this).
>
> That February commit also added a --no-ucp option, to disable PCRE2_UCP.
> So as I understand it, if you're in a UTF-8 locale:
>
> * 10.43 pcre2grep -u will behave like 3.9 GNU grep -P.
>
> * 10.43 pcre2grep -u --no-ucp will behave like git grep -P.
>
> * Current GNU grep -P is different from everybody else.
>
> This incompatibility is not good.
>
> Here are two ways forward to fix this incompatibility (there are other
> possibilities of course):
>
> (A) GNU grep adds a --no-ucp option that acts like 10.43 pcre2grep
> --no-ucp, and git grep -P follows suit. That is, both GNU and git grep
> act like 10.43 pcre2grep -u, in that they enable PCRE2_UTF, and also
> enable PCRE2_UCP unless --no-ucp is given. This would cause \d to match
> non-ASCII digits unless --no-ucp is given.
>
> (B) GNU grep -P and git grep -P mimic pcre2grep in both -u and --no-ucp.
> That is, they would both do 8-bit-only by default, and use PCRE2_UTF
> only when -u or --utf is given, and use PCRE2_UCP only when --no-ucp is
> absent. This would cause \d to match non-ASCII digits only when -u is
> given but --no-ucp is not.

Changing grep -P's \d to match multibyte digits by default would break
an important contract. Avoiding that feels like it must outweigh any
cross-tool portability concern.

(C)  preserve grep -P's tradition of \d matching only 0..9, and once
grep uses 10.43 or newer, \b and \w will also work as desired.

> Under either (A) or (B), future pcre2grep -u, GNU grep -P, and git grep
> -P would be consistent.

I hope git grep -P's \d will also stick to ASCII-only by default.
Those rare few who desire multibyte matches can always specify \p{Nd}
instead of \d, or (with new enough PCRE2), use (?-aD) and (?aD) to
toggle the digit-matching mode.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-05 19:40               ` Jim Meyering
@ 2023-04-05 20:03                 ` Paul Eggert
  2023-04-05 21:20                 ` Carlo Arenas
  1 sibling, 0 replies; 36+ messages in thread
From: Paul Eggert @ 2023-04-05 20:03 UTC (permalink / raw)
  To: Jim Meyering
  Cc: Junio C Hamano, demerphq, Philip.Hazel, 60690, mega lith01,
	Carlo Arenas, Ævar Arnfjörð Bjarmason, pcre-dev,
	Tukusej’s Sirs, git

On 2023-04-05 12:40, Jim Meyering wrote:
> (C)  preserve grep -P's tradition of \d matching only 0..9, and once
> grep uses 10.43 or newer, \b and \w will also work as desired.

If I understand you correctly, (C) would mean that GNU grep -P, git grep 
-P, and pcre2grep -u would all use PCRE2_UTF | PCRE2_UCP, and would also 
use the extra option PCRE2_EXTRA_ASCII_BSD that is planned for 10.43 PCRE2.

This would require changes to bleeding-edge pcre2grep -u (since it would 
need to add PCRE2_EXTRA_ASCII_BSD unless --no-ucp is also given), and to 
git grep -P (which would need to add PCRE2_UCP and 
PCRE2_EXTRA_ASCII_BSD, when libpcre2 is new enough to #define 
PCRE2_EXTRA_ASCII_BSD).

This option works for me as well. In fact it's the least work for me 
since I already implemented it in bleeding-edge GNU grep (so it works 
this way already :-).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-05 19:40               ` Jim Meyering
  2023-04-05 20:03                 ` Paul Eggert
@ 2023-04-05 21:20                 ` Carlo Arenas
  1 sibling, 0 replies; 36+ messages in thread
From: Carlo Arenas @ 2023-04-05 21:20 UTC (permalink / raw)
  To: Jim Meyering
  Cc: Paul Eggert, Junio C Hamano, demerphq, Philip.Hazel, 60690,
	mega lith01, Ævar Arnfjörð Bjarmason,
	Tukusej’s Sirs, git, pcre2-dev

On Wed, Apr 5, 2023 at 12:40 PM Jim Meyering <jim@meyering.net> wrote:
>
> Changing grep -P's \d to match multibyte digits by default would break
> an important contract.

While I tend to agree[1] (and indeed that is why PCRE2_EXTRA_ASCII_BSD
was invented), it would be also important to note that it goes against
the Unicode recommendation[2] and it is actually not true already[3]
for Python, .NET or Rust (which means ripgrep behaves like GNU grep -P
3.9).

FWIW I also agree that (at least `git grep -P`) should use
PCRE2_EXTRA_ASCII_BSD by default as that is what makes more sense in
the context of matching source code and using instead `\P{Nd}` if you
really want all Unicode digits is not much of a burden, but I am also
not sure if that makes sense in other contexts, specially considering
that I am obviously biased since the languages I mostly interact with
ONLY use arabic numerals and therefore `\d` meaning `[0-9]` seems
"normal".

Carlo

CC: changed to the real email address for PCRE2 development, for full
context on this thread use [4]

[1] https://github.com/PCRE2Project/pcre2/pull/186
[2] https://unicode.org/reports/tr18/
[3] https://regex101.com/r/S5RW4c/1
[4] https://lore.kernel.org/git/230109.86v8lf297g.gmgdl@evledraar.gmail.com/T/

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-04 19:31           ` Junio C Hamano
  2023-04-05 18:32             ` Paul Eggert
@ 2023-04-06 13:39             ` demerphq
  2023-04-07 19:00               ` Paul Eggert
  1 sibling, 1 reply; 36+ messages in thread
From: demerphq @ 2023-04-06 13:39 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Paul Eggert, Carlo Arenas, 60690, mega lith01,
	Ævar Arnfjörð Bjarmason, git, Tukusej’s Sirs,
	pcre-dev

On Tue, 4 Apr 2023 at 21:31, Junio C Hamano <gitster@pobox.com> wrote:
>
> Paul Eggert <eggert@cs.ucla.edu> writes:
>
> > This is an evolving area. Git master is fiddling with flags and
> > options, and so is GNU grep master, and so is PCRE2, and there are
> > bugs. If you're running bleeding-edge versions of this code you'll get
> > different behavior than if you're running grep 3.8, pcregrep 8.45,
> > Perl 5.36, and git 2.39.2 (which is what Fedora 37 has).
> >
> > What I'm fearing is that we may evolve into mutually incompatible
> > interpretations of how Perl regular expressions deal with UTF-8
> > text. That'd be a recipe for confusion down the road.
>
> Nicely said.  My personal inclination is to let Perl folks decide
> and follow them (even though I am skeptical about the wisdom of
> letting '\d' match anything other than [0-9]), but even in Git
> circle there would be different opinions, so I am glad that the
> discussion is visible on the list to those who are intrested.

Perl matches Unicode text according to the rules specified by the
Unicode consortium. It is the reference implementation for Unicode
regular expression matching. Unicode specifies that \d match any digit
in any script that it supports. Thus \d matches far more codepoints
than \p{PosixDigit} or [0-9] would.  Be aware that Unicode contains
and separates numbers and digits, eg, \x{1EC9E} represents a Lakh,
which is used in many Indian languages for 100,000, but which is not
considered a *digit* for obvious reasons.

FWIW, someone mentioned [[:digit:]] which matches the same as \d does
on Unicode strings and under the /u matching flag for regexes in Perl.
Arguably this was a mistake, [[:digit:]] is a POSIX character class,
and POSIX doesn't support Unicode so it should have matched [0-9] or
\p{PosixDigit}. But historically \d and [[:digit:]] in Perl were the
same and when \d was extended to meet the Unicode specification
[[:digit:]] came along for the ride likely inadvertently, thus
\p{PosixDigit} is equivalent to [0-9], but \p{XPosixDigit} is
equivalent to \d and [[:digit:]].

I notice that other posts in this thread have moved the conversation
on, and covered most of the points I wanted to make here. However I
wanted to say that there seem to be two different issues here. The
first is "what semantics do i expect from my regular expressions",
Unicode or legacy-ASCII, mostly this relates to case-insensitive
matching, but things like \d also surface discrepancies. The second is
"what encodings does the regular expression engine understand".
Unfortunately on *nix there is no tradition of using BOM's to
distinguish the 6 different possible encodings of Unicode (UTF-8,
UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE), and there seems
to be some level of desire of matching with unicode semantics against
files that are not uniformly encoded in one of these formats.

So the question comes up, A) how do you tell the regular expression
engine what semantics you want and B) how does the regular expression
library identify the encoding in the file, and how does it handle
malformed content in that file. For instance if I have a file which
contains snippets of UTF8 encoded data, *and* snippets of data that is
illegal in UTF8, what should the regular expression engine do if it is
asked to do a case insensitive match against that file.

cheers,
yves

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-05 18:32             ` Paul Eggert
                                 ` (2 preceding siblings ...)
  2023-04-05 19:40               ` Jim Meyering
@ 2023-04-06 15:45               ` demerphq
  2023-04-07 16:48                 ` Paul Eggert
  3 siblings, 1 reply; 36+ messages in thread
From: demerphq @ 2023-04-06 15:45 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Junio C Hamano, Carlo Arenas, 60690, mega lith01,
	Ævar Arnfjörð Bjarmason, git, Tukusej’s Sirs,
	pcre-dev, Philip.Hazel

On Wed, 5 Apr 2023 at 20:32, Paul Eggert <eggert@cs.ucla.edu> wrote:
>
> On 2023-04-04 12:31, Junio C Hamano wrote:
>
> > My personal inclination is to let Perl folks decide
> > and follow them (even though I am skeptical about the wisdom of
> > letting '\d' match anything other than [0-9])
>
> I looked into what pcre2grep does. It has always done only 8-bit
> processing unless you use the -u or --utf option, so plain "pcre2grep
> '\d'" matches only ASCII digits.
>
> Although this causes pcre2grep to mishandle Unicode characters:
>
>    $ echo 'Ævar' | pcre2grep '[Ssß]'
>    Ævar
>
> it mimics Perl 5.36:
>
>    $ echo 'Ævar' | perl -ne 'print $_ if /[Ssß]/'
>    Ævar
>
> so this seems to be what Perl users expect, despite its infelicities.

Actually no, I think you have misunderstood what is happening at the
different layers involved here.

Your terminal is rendering ß as a glyph. But it is almost certainly
actually the octets C3 9F (which is the UTF8 canonical representation
of the codepoint U+DF). So the code you provided to perl is close to
the equivalent of

echo 'Ævar' | perl -ne 'print $_ if /[Ss\x{C3}\x{9F}]/'

And if you check, you will see that U+C6 "Æ" in utf8 is represented as
the octets C3 86.

So what you have done is the equivalent of:

perl -le'print "\x{C3}\x{86}"' | perl -ne'print $_ if /[Ss\x{C3}\x{9F}]/'

which of course matches. \x{C3} matches \x{C3} always and everywhere.

What you should have done is something like this:

$ echo 'Ævar' | perl -ne 'utf8::decode($_); print $_ if /[Ss\x{DF}]/u'
$ echo 'baß' | perl -MEncode -ne 'utf8::decode($_); print
encode_utf8($_) if /[Ss\x{DF}]/u'
baß
$ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print
encode_utf8($_) if /[Ss\x{C6}]/u'
Ævar
$ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print
encode_utf8($_) if /[Ss\x{e6}]/ui'
Ævar

The "utf8::decode($_)" tells perl to decode the input string as though
it contained utf8 (which in this case it does). THe /u suffix tells
the regex engine that you want Unicode semantics.

I believe that the same thing is true of your pcre2grep example. You
simply aren't checking what you think you are checking. You terminal
renders UTF8 as glyphs, but the programs you are feeding those glyphs
to aren't seeing glyphs, they are seeing UTF8 sequences as distinct
octets, and are not decoding their input back as codepoints.

You could have checked your assumptions by using the -Mre=debug option to perl:

$ echo 'Ævar' | perl -Mre=debug -ne 'print $_ if /[Ssß]/'
Compiling REx "[Ss%x{c3}%x{9f}]"
Final program:
   1: ANYOF[Ss\x9F\xC3] (11)
  11: END (0)
stclass ANYOF[Ss\x9F\xC3] minlen 1
Matching REx "[Ss%x{c3}%x{9f}]" against "%x{c3}%x{86}var%n"
Matching stclass ANYOF[Ss\x9F\xC3] against "%x{c3}%x{86}var%n" (6 bytes)
   0 <> <%x{c3}>             |   0| 1:ANYOF[Ss\x9F\xC3](11)
   1 <%x{c3}> <%x{86}var>    |   0| 11:END(0)
Match successful!
Ævar
Freeing REx: "[Ss%x{c3}%x{9f}]"

The line: Matching REx "[Ss%x{c3}%x{9f}]" against "%x{c3}%x{86}var%n"

basically says it all. Perl has not decoded the UTF8 into U+C6, and it
has not decoded the UTF8 for U+DF either. Instead you have asked it if
the UTF8 sequence that represents U+C6 contains any of the same octets
as the UTF8 representation of U+53, U+73 and U+DF would. Which gives
the common octet of \x{c3}.

> For better Unicode handling one can use pcre2grep's -u or --utf option,
> which causes pcre2grep to behave more like GNU grep -P and git grep -P:
> "echo 'Ævar' | pcre2grep -u '[Ssß]'" outputs nothing, which I think is
> what most people would expect (unless they're Perl users :-).

It is what Perl users would expect also, assuming you actually wrote
the character class [Ss\x{DF}] and asked for unicode semantics. \x{DF}
is the Latin1 codepoint range, so perl will assume that you meant
ASCII semantics unless you tell it otherwise.

Basically these tests you have quoted here are just examples of
garbage in garbage out.

Perl has been working together with the Unicode consortium for over 20
years. Afaik we were and are the reference implementation for the spec
on regular expression matching in Unicode and we have a long history
of working together with the Unicode consortium to refine and
implement the spec. You should assume that if Perl seems to have made
a gross error in how it does Unicode matching that you are simply
using it wrong, we take a great deal of pride in having the best
Unicode support there is.

https://unicode.org/reports/tr18/

FWIW, i think this email nicely illustrates the issues with git and
regular expressions. To do regular expressions properly you need to
know a) what semantics do you expect, b) how to decode the text you
are matching against. If you want unicode semantics you need to have a
way to ask for it. If you want to match against Unicode data then you
need a way to determine which of the 6 possible encodings[1] of
Unicode data you are using. If you get either wrong you will not get
the results you expect.  You may even want to deal with cases where
you want Unicode semantics, but to match against non-unicode data. For
instance Latin-1. In Latin-1 the codepoint U+DF is the *octet* 0xDF.
Maybe you want that octet to match "ss" case-insensitively, as a
German speaker would expect and as Unicode specifies is correct.  Or
vice versa, maybe you are like some of the posters to this thread who
seem to expect that \d should not match U+16B51 (as a Hmong speaker
might expect). Perl resolves these problems at the pattern level by
supporting the suffixes /a and /u (for ascii and unicode), and at the
string level it supports two type of string, unicode strings, and
binary/ASCII strings. By default input is the latter but there are a
variety of ways of saying that a file handle should decode to Unicode
instead.

cheers,
Yves
[1] UTF-EBCDIC, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.

--
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-06 15:45               ` demerphq
@ 2023-04-07 16:48                 ` Paul Eggert
  0 siblings, 0 replies; 36+ messages in thread
From: Paul Eggert @ 2023-04-07 16:48 UTC (permalink / raw)
  To: demerphq
  Cc: Philip.Hazel, 60690, mega lith01, Carlo Arenas,
	Ævar Arnfjörð Bjarmason, Junio C Hamano, pcre-dev,
	Tukusej’s Sirs, git

On 2023-04-06 08:45, demerphq wrote:
>> Although this causes pcre2grep to mishandle Unicode characters:
>>
>>     $ echo 'Ævar' | pcre2grep '[Ssß]'
>>     Ævar
>>
>> it mimics Perl 5.36:
>>
>>     $ echo 'Ævar' | perl -ne 'print $_ if /[Ssß]/'
>>     Ævar
>>
>> so this seems to be what Perl users expect, despite its infelicities.
> Actually no, I think you have misunderstood what is happening at the
> different layers involved here.

No, I understood what was going on. My point was that Perl users seem to 
have accepted this behavior, even though it does not match what people 
would ordinarily expect.

> What you should have done is something like this:

No, for two reasons. First, I'm no Perl expert and so I don't know (and 
don't particularly want to learn) its complicated Unicode options and 
calls. Second, /[Ss\x{DF}]/u is hard to read. If I want the S letters of 
traditional German, I'll write them in the obvious way, as [Ssß]. No 
doubt Perl will let me do this somehow - but it is telling that none of 
your examples do it in such a straightforward way.

> $ echo 'Ævar' | perl -ne 'utf8::decode($_); print $_ if /[Ss\x{DF}]/u'
> $ echo 'baß' | perl -MEncode -ne 'utf8::decode($_); print
> encode_utf8($_) if /[Ss\x{DF}]/u'
> baß
> $ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print
> encode_utf8($_) if /[Ss\x{C6}]/u'
> Ævar
> $ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print
> encode_utf8($_) if /[Ss\x{e6}]/ui'
> Ævar

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-06 13:39             ` demerphq
@ 2023-04-07 19:00               ` Paul Eggert
  2023-04-08  5:01                 ` Carlo Arenas
  0 siblings, 1 reply; 36+ messages in thread
From: Paul Eggert @ 2023-04-07 19:00 UTC (permalink / raw)
  To: demerphq
  Cc: 60690, mega lith01, Carlo Arenas,
	Ævar Arnfjörð Bjarmason, Tukusej’s Sirs, git,
	pcre2-dev, Junio C Hamano

On 2023-04-06 06:39, demerphq wrote:

> Unicode specifies that \d match any digit
> in any script that it supports.

"Specifies" is too strong. The Unicode Regular Expressions technical 
standard (UTS#18) mentions \d only in Annex C[1], next to the word 
"digit" in a column labeled "Property" (even though \d is really syntax 
not a property). This is at best an informal recommendation, not a 
requirement, as UTS#18 0.2[2] says that UTS#18's syntax is only for 
illustration and that although it's similar to Perl's, the two syntax 
forms may not be exactly the same. So we can't look to UTS#18 for a 
definitive way out of the \d mess, as the Unicode folks specifically 
delegated matters to us.

Even ignoring the \d issue the digit situation is messy. UTS#18 Annex C 
says "\p{gc=Decimal_Number}" is the standard recommended syntax 
assignment for digits. However, PCRE2 does not support this syntax; it 
supports another variant \p{Nd} that UTS#18 also recommends. So it 
appears that PCRE2 already does not implement every recommended aspect 
of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support 
"\p{gc=Decimal_Number}".

Anyway, since grep -P '\p{Nd}' implements Unicode's decimal digit class, 
that's clearly enough for grep -P to conform to UTS#18 with respect to 
digits.

> A) how do you tell the regular expression
> engine what semantics you want and B) how does the regular expression
> library identify the encoding in the file, and how does it handle
> malformed content in that file.

Here's how GNU grep does it:

* RE semantics are specified via command-line options like -P.

* Text encoding is specified by locale, e.g., LC_ALL='en_US.utf8'.

* REs do not match encoding errors.

> on *nix there is no tradition of using BOM's to
> distinguish the 6 different possible encodings of Unicode (UTF-8,
> UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE)

Yes, GNU/Linux never really experienced the joys of UTF-EBCDIC, Oracle 
UTFE, UTF-16LE vs UTF-16BE etc. If you're running legacy IBM mainframe 
or MS-Windows code these legacy encodings are obviously a big deal. 
However, there seems little reason to force their nontrivial hassles 
onto every GNU/Linux program that processes text. A few specialized apps 
like 'iconv' deal with offbeat encodings, and that is probably a better 
approach all around.

> there seems
> to be some level of desire of matching with unicode semantics against
> files that are not uniformly encoded in one of these formats.

That is a use case, yes. It's what 'strings' and 'grep' do.

[1]: https://unicode.org/reports/tr18/#Compatibility_Properties
[2]: https://unicode.org/reports/tr18/#Conformance

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-07 19:00               ` Paul Eggert
@ 2023-04-08  5:01                 ` Carlo Arenas
  2023-04-08 22:45                   ` Paul Eggert
  0 siblings, 1 reply; 36+ messages in thread
From: Carlo Arenas @ 2023-04-08  5:01 UTC (permalink / raw)
  To: Paul Eggert
  Cc: demerphq, 60690, mega lith01,
	Ævar Arnfjörð Bjarmason, Tukusej’s Sirs, git,
	pcre2-dev, Junio C Hamano

On Fri, Apr 7, 2023 at 12:00 PM Paul Eggert <eggert@cs.ucla.edu> wrote:
>
> On 2023-04-06 06:39, demerphq wrote:
>
> > Unicode specifies that \d match any digit
> > in any script that it supports.
>
> "Specifies" is too strong. The Unicode Regular Expressions technical
> standard (UTS#18) mentions \d only in Annex C[1], next to the word
> "digit" in a column labeled "Property" (even though \d is really syntax
> not a property). This is at best an informal recommendation, not a
> requirement, as UTS#18 0.2[2] says that UTS#18's syntax is only for
> illustration and that although it's similar to Perl's, the two syntax
> forms may not be exactly the same. So we can't look to UTS#18 for a
> definitive way out of the \d mess, as the Unicode folks specifically
> delegated matters to us.
>
> Even ignoring the \d issue the digit situation is messy. UTS#18 Annex C
> says "\p{gc=Decimal_Number}" is the standard recommended syntax
> assignment for digits. However, PCRE2 does not support this syntax; it
> supports another variant \p{Nd} that UTS#18 also recommends. So it
> appears that PCRE2 already does not implement every recommended aspect
> of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support
> "\p{gc=Decimal_Number}".

Not sure I follow the whole logic here, but PCRE2[3] (search for
"general category" which is what the "gc" above stands for) only
supports the abbreviated form of the unicode classes and `Nd` is
indeed the one that corresponds to `Decimal_Number`.

Carlo

[1]: https://unicode.org/reports/tr18/#Compatibility_Properties
[2]: https://unicode.org/reports/tr18/#Conformance
[3]: https://pcre2project.github.io/pcre2/doc/html/pcre2pattern.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: bug#60690: -P '\d' in GNU and git grep
  2023-04-08  5:01                 ` Carlo Arenas
@ 2023-04-08 22:45                   ` Paul Eggert
  0 siblings, 0 replies; 36+ messages in thread
From: Paul Eggert @ 2023-04-08 22:45 UTC (permalink / raw)
  To: Carlo Arenas
  Cc: demerphq, pcre2-dev, 60690, mega lith01,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Tukusej’s Sirs, git

On 2023-04-07 22:01, Carlo Arenas wrote:

> Not sure I follow the whole logic here, but PCRE2[3] (search for
> "general category" which is what the "gc" above stands for) only
> supports the abbreviated form of the unicode classes and `Nd` is
> indeed the one that corresponds to `Decimal_Number`.

That's fine: all that UTS#18[1] requires is that PCRE2 provide syntax 
for a regular expression that matches the Decimal Number class. Which 
PCRE2 does, via \p{Nd}.

The logic is that UTF#18 does not require that \d must behave like 
\p{Nd}, or even that \p{gc=Decimal_Number} must behave like \p{Nd}. It 
merely requires that there be some syntax for matching Decimal Number, 
and it says the choice of syntax is up to the implementer. This is why 
UTF#18 doesn't require that \d must also match non-ASCII digits (which 
is what I think Yves was saying).

[1]: https://unicode.org/reports/tr18/

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2023-04-08 22:45 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-08  6:23 [PATCH] grep: correctly identify utf-8 characters with \{b,w} in -P Carlo Marcelo Arenas Belón
2023-01-08  6:39 ` Junio C Hamano
2023-01-08 15:52 ` [PATCH v2] " Carlo Marcelo Arenas Belón
2023-01-09 11:35   ` Ævar Arnfjörð Bjarmason
2023-01-09 18:40     ` bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} " Paul Eggert
2023-01-09 19:51       ` Ævar Arnfjörð Bjarmason
2023-01-09 23:12         ` Paul Eggert
2023-01-10  4:49     ` [PATCH v2] grep: correctly identify utf-8 characters with \{b,w} " Carlo Arenas
2023-01-16 20:48     ` Junio C Hamano
2023-04-03 21:38     ` -P '\d' in GNU and git grep Paul Eggert
2023-04-04  3:30       ` bug#60690: " Jim Meyering
2023-04-04  6:46         ` Paul Eggert
2023-04-04 15:31           ` Jim Meyering
2023-04-04  6:56       ` Carlo Arenas
2023-04-04 18:25         ` bug#60690: " Paul Eggert
2023-04-04 19:31           ` Junio C Hamano
2023-04-05 18:32             ` Paul Eggert
2023-04-05 19:04               ` Paul Eggert
2023-04-05 19:37               ` Junio C Hamano
2023-04-05 19:40               ` Jim Meyering
2023-04-05 20:03                 ` Paul Eggert
2023-04-05 21:20                 ` Carlo Arenas
2023-04-06 15:45               ` demerphq
2023-04-07 16:48                 ` Paul Eggert
2023-04-06 13:39             ` demerphq
2023-04-07 19:00               ` Paul Eggert
2023-04-08  5:01                 ` Carlo Arenas
2023-04-08 22:45                   ` Paul Eggert
2023-01-17 10:51   ` [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P Carlo Marcelo Arenas Belón
2023-01-17 12:38     ` Ævar Arnfjörð Bjarmason
2023-01-17 15:19       ` Junio C Hamano
2023-01-18  7:35         ` Carlo Arenas
2023-01-18 11:49           ` Ævar Arnfjörð Bjarmason
2023-01-18 16:20             ` Junio C Hamano
2023-01-18 23:06               ` Ævar Arnfjörð Bjarmason
2023-01-18 23:24                 ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).