From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Hamza Mahfooz <someguy@effective-light.com>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH v12 3/3] grep: fix an edge case concerning ascii patterns and UTF-8 data
Date: Sat, 09 Oct 2021 03:37:58 +0200 [thread overview]
Message-ID: <87a6jjlzz1.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20211008224918.603392-3-someguy@effective-light.com>
On Fri, Oct 08 2021, Hamza Mahfooz wrote:
> If we attempt to grep non-ascii log message text with an ascii pattern, we
> run into the following issue:
>
> $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
> grep: (standard input): binary file matches
>
> So, to fix this teach the grep code to mark the pattern as UTF-8 (even if
> the pattern is composed of only ascii characters), so long as the log
> output is encoded using UTF-8.
>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
> Signed-off-by: Hamza Mahfooz <someguy@effective-light.com>
> ---
> v12: get rid of utf8_all_the_things and fix an issue with one of the unit
> tests.
> ---
> grep.c | 6 +++--
> t/t7812-grep-icase-non-ascii.sh | 48 +++++++++++++++++++++++++++++++++
> 2 files changed, 52 insertions(+), 2 deletions(-)
>
> diff --git a/grep.c b/grep.c
> index fe847a0111..f6e113e9f0 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -382,8 +382,10 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
> }
> options |= PCRE2_CASELESS;
> }
> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
> - !(!opt->ignore_case && (p->fixed || p->is_fixed)))
> + if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
> + (!opt->ignore_locale && is_utf8_locale() &&
> + has_non_ascii(p->pattern) && !(!opt->ignore_case &&
> + (p->fixed || p->is_fixed))))
> options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
I think at least some of that existing "if" is my fault, and I *think*
your patch works here, but FWIW I'd find something like this way more
readable:
@@ -382,8 +383,16 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
}
options |= PCRE2_CASELESS;
}
+ if (opt->utf8_all_the_things) {
+ options |= PCRE2_UCP;
+ do_utf8 = 1;
+ }
+
if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
!(!opt->ignore_case && (p->fixed || p->is_fixed)))
+ do_utf8 = 1;
+
+ if (do_utf8)
options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
Well, without the "utf8_all_the_things" probably. That's a reference to
a popular meme, probably better to name it differently, and the
PCRE2_UCP is just something I was experimenting with.
It's late here, but I've got to admit that I'm still a bit confused by
this. Let's see if I can try to sum up why:
Ultimately whether we use PCRE2_UTF *should* have nothing do to with
whether the pattern is UTF-8 or not, because even an expression like:
/.*/
Will behave differently under UTF-8, i.e. character classes change, byte
boundaries change to "character" boundaries etc.
That the existing code has has_non_ascii() and the like is a trade-off
that had to be made for the grep-a-file case, because you might run into
arbitrary binary data, but logs are cleaner/encoded/re-encoded etc.
If you run PCRE in UTF-8 mode it will die on some of that data (as
you'll see from our test suite if you turn it on unconditionally).
Are there cases where my "utf8_all_the_things" POC patch would have
turned on PCRE2_UTF, but yours doesn't? IOW is there a 1=1 mapping still
between the encoding mode log/revision.c thinks it's in & PCRE2_UTF?
Anyway, I've tried to break things with this patch and haven't
succeeded, so maybe it's all OK now, thanks a lot for working on this
again, it's a really neat feature. Just wondering if there's any
remaining edge cases we know about...
next prev parent reply other threads:[~2021-10-09 2:00 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-10-08 22:49 [PATCH v12 1/3] grep: refactor next_match() and match_one_pattern() for external use Hamza Mahfooz
2021-10-08 22:49 ` [PATCH v12 2/3] pretty: colorize pattern matches in commit messages Hamza Mahfooz
2021-10-08 22:49 ` [PATCH v12 3/3] grep: fix an edge case concerning ascii patterns and UTF-8 data Hamza Mahfooz
2021-10-09 1:37 ` Ævar Arnfjörð Bjarmason [this message]
2021-10-12 23:01 ` Hamza Mahfooz
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87a6jjlzz1.fsf@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=someguy@effective-light.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).