From: Junio C Hamano <gitster@pobox.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Hamza Mahfooz" <someguy@effective-light.com>,
git@vger.kernel.org
Subject: Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data
Date: Sat, 16 Oct 2021 23:00:58 -0700 [thread overview]
Message-ID: <xmqq1r4k197p.fsf@gitster.g> (raw)
In-Reply-To: <fc7eb9fc-9521-5484-b05f-9c20086fd485@web.de> ("René Scharfe"'s message of "Sat, 16 Oct 2021 21:44:13 +0200")
René Scharfe <l.s.r@web.de> writes:
>>> Literal patterns are those that don't use any wildcards or case-folding.
>>> If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
>>> pattern only consists of ASCII characters, or if the pattern is encoded
>>> in UTF-8 and is not just a literal pattern.
>>>
>>> Hmm. Why enable PCRE2_UTF for literal patterns that consist of only
>>> ASCII chars?
>>> ...
>> echo 'René Scharfe' >f &&
>> $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?Ren. S' f; echo $?
>> 1
>> $ git -P grep --no-index -P '^(?:You are (?:wrong|correct), )?R[eé]n. S' f; echo $?
>> f:René Scharfe
>> 0
>>
>> So it's a choose-your-own adventure where you can pick if you're
>> right. I.e. do you want the "." metacharacter to match your "é" or not?
>
> Yes, I do, and it's what Hamza's patch is fixing.
That may be correct but is this discussion still about "Why enable
... for literal patterns that consist of only ASCII"? Calling "." a
"metacharacter" and wanting it to match anything other than a single
dot would mean the pattern we are discussing is no longer "literal",
isn't it? I am puzzled.
>> These sorts of patterns demonstrate nicely that the relationship between your
>> pattern being ASCII and wanting or not wanting UTF-8 matching semantics
>> isn't what you might imagine it to be.
>
> Differences are:
>
> o: opt->ignore_locale
> h: has_non_ascii(p->pattern)
> i: is_utf8_locale()
> l: literal
>
> o h i l master hamza rene
> 0 0 0 0 0 1 0
> 0 0 0 1 0 1 0
> 0 0 1 0 0 1 1 <== your first example
> 0 0 1 1 0 1 0
> 0 1 1 1 0 0 1
>
> Turning on PCRE2_UTF when is_utf8_locale() == 0 seems wrong (first two
> lines).
>
> Turning on PCRE2_UTF for literal matching (fourth line) goes against
> 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching,
> 2019-07-26).
>
> Turning on PCRE2_UTF for literal matching of non-ASCII characters (fifth
> line) also goes against that, so my intuition betrayed me. When I
> adjust it, I get:
>
> if (!opt->ignore_locale && is_utf8_locale() && !literal)
> options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>
> That looks deceptively simple -- just drop has_non_ascii(p->pattern)
> from the original condition.
>
> Your second example is handle the same by all versions btw.:
>
> o h i l master hamza rene
> 0 1 1 0 1 1 1
>
> René
next prev parent reply other threads:[~2021-10-17 6:01 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-15 16:13 [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-15 16:13 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data Hamza Mahfooz
2021-10-15 20:03 ` Junio C Hamano
2021-10-16 16:25 ` René Scharfe
2021-10-16 17:12 ` Ævar Arnfjörð Bjarmason
2021-10-16 19:44 ` René Scharfe
2021-10-17 6:00 ` Junio C Hamano [this message]
2021-10-17 6:55 ` René Scharfe
2021-10-17 9:44 ` Ævar Arnfjörð Bjarmason
2021-11-15 20:43 ` Andreas Schwab
2021-11-15 22:41 ` Ævar Arnfjörð Bjarmason
2021-11-16 2:12 ` Carlo Arenas
2021-11-16 8:41 ` Andreas Schwab
2021-11-16 9:06 ` Carlo Arenas
2021-11-16 9:18 ` Andreas Schwab
2021-11-16 9:29 ` Andreas Schwab
2021-11-16 9:38 ` Carlo Arenas
2021-11-16 9:55 ` Andreas Schwab
2021-11-16 11:00 ` [PATCH] grep: avoid setting UTF mode when not needed Carlo Marcelo Arenas Belón
2021-11-16 12:32 ` Ævar Arnfjörð Bjarmason
2021-11-16 13:35 ` Hamza Mahfooz
2021-11-17 14:31 ` Ævar Arnfjörð Bjarmason
2021-11-16 18:22 ` Carlo Arenas
2021-11-16 18:48 ` Junio C Hamano
2021-11-17 10:23 ` [PATCH v2] grep: avoid setting UTF mode when dangerous with PCRE Carlo Marcelo Arenas Belón
2021-11-18 7:29 ` Junio C Hamano
2021-11-18 10:15 ` Carlo Arenas
2021-11-18 12:49 ` Hamza Mahfooz
2021-11-19 6:58 ` Ævar Arnfjörð Bjarmason
2021-11-18 21:21 ` Hamza Mahfooz
2021-11-17 18:46 ` [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data René Scharfe
2021-11-17 19:56 ` Carlo Arenas
2021-11-17 20:59 ` Ævar Arnfjörð Bjarmason
2021-11-17 21:53 ` Carlo Arenas
2021-11-18 0:00 ` Ævar Arnfjörð Bjarmason
2021-11-18 18:17 ` Junio C Hamano
2021-11-18 20:57 ` René Scharfe
2021-11-19 7:00 ` Ævar Arnfjörð Bjarmason
2021-11-19 16:08 ` René Scharfe
2021-11-19 17:33 ` Carlo Arenas
2021-11-19 17:11 ` Junio C Hamano
2021-10-15 18:05 ` [PATCH v13 1/3] grep: refactor next_match() and match_one_pattern() for external use Junio C Hamano
2021-10-15 18:24 ` Hamza Mahfooz
2021-10-15 19:28 ` Junio C Hamano
2021-10-15 19:40 ` Hamza Mahfooz
2021-10-15 19:49 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=xmqq1r4k197p.fsf@gitster.g \
--to=gitster@pobox.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=l.s.r@web.de \
--cc=someguy@effective-light.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).