Re: [PATCH] grep: skip UTF8 checks explicitally

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: Carlo Arenas <carenas@gmail.com>, git@vger.kernel.org, gitster@pobox.com
Subject: Re: [PATCH] grep: skip UTF8 checks explicitally
Date: Wed, 24 Jul 2019 20:22:48 +0200	[thread overview]
Message-ID: <87lfwn70nb.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <nycvar.QRO.7.76.6.1907241245540.21907@tvgsbejvaqbjf.bet>

On Wed, Jul 24 2019, Johannes Schindelin wrote:

> Hi Carlo,
>
> On Tue, 23 Jul 2019, Carlo Arenas wrote:
>
>> On Tue, Jul 23, 2019 at 5:47 AM Johannes Schindelin
>> <Johannes.Schindelin@gmx.de> wrote:
>> >
>> > So when PCRE2 complains about the top two bits not being 0x80, it fails
>> > to parse the bytes correctly (byte 2 is 0xbb, whose two top bits are
>> > indeed 0x80).
>>
>> the error is confusing but it is not coming from the pattern, but from
>> what PCRE2 calls
>> the subject.
>>
>> meaning that while going through the repository it found content that
>> it tried to match but
>> that it is not valid UTF-8, like all the png and a few txt files that
>> are not encoded as
>> UTF-8 (ex: t/t3900/ISO8859-1.txt).
>>
>> > Maybe this is a bug in your PCRE2 version? Mine is 10.33... and this
>> > does not happen here... But then, I don't need the `-I` option, and my
>> > output looks like this:
>>
>> -I was just an attempt to workaround the obvious binary files (like
>> PNG); I'll assume you
>> should be able to reproduce if using a non JIT enabled PCRE2,
>> regardless of version.
>>
>> my point was that unlike in your report, I didn't have any test cases
>> failing, because
>> AFAIK there are no test cases using broken UTF-8 (the ones with binary data are
>> actually valid zero terminated UTF-8 strings)
>
> Thank you for this explanation. I think it makes a total lot of sense.
>
> So your motivation for this patch is actually a different one than mine,
> and I would like to think that this actually strengthens the case _in
> favor_ of it. The patch kind of kills two birds with one stone.

This patch is really the wrong thing to do. Don't get me wrong, I'm
sympathetic to the *problem* and it should be solved, but this isn't the
solution.

The PCRE2_NO_UTF_CHECK flag means "I have checked that this is a valid
UTF-8 string so you, PCRE, don't need to re-check it". To quote
pcre2api(3):

    If you know that your pattern is a valid UTF string, and you want to
    skip this check for performance reasons, you can set the
    PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an
    in‐ valid UTF string as a pattern is undefined. It may cause your
    program to crash or loop.

(Later it's discussed that "pattern" here is also "subject string" in
the context of pcre2_{jit_,}match()).

I know almost nothing about the internals of PCRE's engine, but much of
it's based on Perl's, which I know way better. Doing the equivalent of
this in perl (setting the UTF8 flag on a SV) *will* cause asserts to
fail and possibly segfaults.

It's likely through dumb luck that this is "working". I.e. yes the JIT
mode is less anal about these checks, so if you say grep for "Nguyễn
Thái" in UTF-8 mode and there's binary data you're satisfied not to find
anything in that binary data.

But if you are I'm willing to bet this ruins your day, e.g PCRE would
"skip ahead" a character 4-byte character because it sees a telltale
U+10000 through U+10FFFF start sequence, except that wasn't a character,
it was some arbitrary binary.

Now, what is the solution? I don't have any patches yet, but things I
intend to look at:

 1) We're oversupplying PCRE2_UTF now, and one such case is what's being
    reported here. I.e. there's no reason I can think of for why a
    fixed-string pattern should need PCRE2_UTF set when not combined
    with --ignore-case. We can just not do that, but maybe I'm missing
    something there.

 2) We can do "try utf8, and fallback". A more advanced version of this
    is what the new PCRE2_MATCH_INVALID_UTF flag (mentioned upthread)
    does. I was thinking something closer to just carrying two compiled
    patterns, and falling back on the ~PCRE2_UTF one if we get a
    PCRE2_ERROR_UTF8_* error.

One reason we can't "just" go back to the pre-ab/no-kwset behavior is
that one thing it does is fix a long-standing bug where we'd do the
wrong thing under locales && -i && UTF-8 string/pattern. More precisely
we'd punt it to the C library's matching function, which would probably
do the wrong thing.

next prev parent reply	other threads:[~2019-07-24 18:22 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-21 18:31 [PATCH] grep: skip UTF8 checks explicitally Carlo Marcelo Arenas Belón
2019-07-21 18:34 ` Eric Sunshine
2019-07-22 11:55 ` Johannes Schindelin
2019-07-22 14:43   ` [PATCH] grep: skip UTF8 checks explicitly Carlo Marcelo Arenas Belón
2019-07-24  2:10     ` Junio C Hamano
2019-07-24 10:50       ` Johannes Schindelin
2019-07-24 16:08         ` Junio C Hamano
2019-08-28 14:54           ` [PATCH v2] " Carlo Marcelo Arenas Belón
2019-08-28 16:57             ` Carlo Arenas
2019-09-09 15:07               ` Carlo Arenas
2019-09-09 18:49                 ` Junio C Hamano
2019-07-22 19:42   ` [PATCH] grep: skip UTF8 checks explicitally Ævar Arnfjörð Bjarmason
2019-07-23  3:50     ` Carlo Arenas
2019-07-23 12:46       ` Johannes Schindelin
2019-07-24  1:47         ` Carlo Arenas
2019-07-24 10:47           ` Johannes Schindelin
2019-07-24 18:22             ` Ævar Arnfjörð Bjarmason [this message]
2019-07-24 21:04               ` Junio C Hamano
2019-07-25  9:48                 ` Johannes Schindelin
2019-07-25 13:02                   ` Junio C Hamano
2019-07-25 18:22                     ` Johannes Schindelin
2019-07-26 15:15                       ` Ævar Arnfjörð Bjarmason
2019-07-26 15:53                         ` Carlo Arenas
2019-07-26 20:05                           ` Ævar Arnfjörð Bjarmason
2019-07-26 16:19                         ` Junio C Hamano
2019-07-26 19:40                           ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87lfwn70nb.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=carenas@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).