git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Carlo Arenas <carenas@gmail.com>
Cc: "Johannes Schindelin" <Johannes.Schindelin@gmx.de>,
	"Junio C Hamano" <gitster@pobox.com>,
	git@vger.kernel.org, "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Subject: Re: [PATCH] grep: skip UTF8 checks explicitally
Date: Fri, 26 Jul 2019 22:05:31 +0200	[thread overview]
Message-ID: <87blxg7e9g.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <CAPUEspgseYvKTNN6EkqjKo1zQXRjUzyiLMe0kJwnNu=F3ATmOA@mail.gmail.com>


On Fri, Jul 26 2019, Carlo Arenas wrote:

> On Fri, Jul 26, 2019 at 8:15 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>> I'm not sure what a real fix for that is. Part of it is probably 8/8 in
>> the series I mention below, but more generally we'd need to be more
>> encoding aware at a much higher callsite than "grep". So e.g. we'd know
>> that we match "binary" data as not-UTF-8. Now we just throw arbitrary
>> bytes around and hope something sticks.
>
> I haven't look yet at your proposed changes, but my gut feeling is that
> the work to support invalid UTF in the yet unreleased PCRE version would
> be needed as part of it, and therefore it might be better to keep PCRE
> out of the main path until that gets released and can be relied upon.

I'm hoping my 8-part series is good enough to move it forward, but as
48de2a768c ("grep: remove the kwset optimization", .2019-07-01) shows we
can always just fall back on using regcomp instead.

> kwset is not going away with this series anyway, regardless of the no-kwset
> name on the branch.

The larger context here is that this is the 1st step of a 2-step series
to get rid of kwset. If I can pull that off successfully is another
matter, but that's the plan. After it's applied we just use it in the
pickaxe code, and it's relatively straightforward to convert that. See:
https://public-inbox.org/git/87v9x793qi.fsf@evledraar.gmail.com/

>> > If we're already deciding to paper over things, I'd much rather prefer
>> > the simpler patch, i.e. Carlo's.
>>
>> As I noted upthread PCRE's own docs promise undefined behavior and fire
>> and brimstone if that patch is applied. Those last two not
>> guaranteed. So we need another solution.
>
> in my original reply I mentioned I explicitly didn't do a test because of this
> "undefined behavior", but I think it should be fair to mention that we are
> already affected by that because using the JIT fast path does skip any
> UTF-8 validations and is currently possible to get git into an infinite loop
> or make it segfault when using PCRE.

Right, this is a good point that we should take notice of. I.e. this is
*not* a new bug per-se, you can do this on master and get a UTF-8 bug
from git.git:

    git grep -P '(*NO_JIT)[æ]'

> in that line, I am not sure I understand the pushback against making that
> explicit since it only makes both codepaths behave the same (bugs and
> risks of burning alike)

Because with my kwset series we're getting a lot more users of this
until-now obscure code, so we're finding old-but-new-to-us bugs.

We've had this bug dating all the way back to Duy's 18547aacf5
("grep/pcre: support utf-8", 2016-06-25). It was first released with git
2.10.

So why are we getting list discussion about it *now*? Because my kwset
series got merged to "next", and we apparently have a lot of users who'd
use fixed-string git-grep under locales, but never used PCRE via -P
explicitly before.

So it's worth getting the semantics right. As noted in the E-Mail I
linked to earlier my ulterior motive here is to get to a point where
we'll funnel all regex matching through PCRE implicitly if it's
available.

We need to get these UTF-8 edge cases right. I don't know if my recent
8-part series gets us 100% there, but hopefully it at least gets us
closer to it.

  reply	other threads:[~2019-07-26 20:05 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-21 18:31 [PATCH] grep: skip UTF8 checks explicitally Carlo Marcelo Arenas Belón
2019-07-21 18:34 ` Eric Sunshine
2019-07-22 11:55 ` Johannes Schindelin
2019-07-22 14:43   ` [PATCH] grep: skip UTF8 checks explicitly Carlo Marcelo Arenas Belón
2019-07-24  2:10     ` Junio C Hamano
2019-07-24 10:50       ` Johannes Schindelin
2019-07-24 16:08         ` Junio C Hamano
2019-08-28 14:54           ` [PATCH v2] " Carlo Marcelo Arenas Belón
2019-08-28 16:57             ` Carlo Arenas
2019-09-09 15:07               ` Carlo Arenas
2019-09-09 18:49                 ` Junio C Hamano
2019-07-22 19:42   ` [PATCH] grep: skip UTF8 checks explicitally Ævar Arnfjörð Bjarmason
2019-07-23  3:50     ` Carlo Arenas
2019-07-23 12:46       ` Johannes Schindelin
2019-07-24  1:47         ` Carlo Arenas
2019-07-24 10:47           ` Johannes Schindelin
2019-07-24 18:22             ` Ævar Arnfjörð Bjarmason
2019-07-24 21:04               ` Junio C Hamano
2019-07-25  9:48                 ` Johannes Schindelin
2019-07-25 13:02                   ` Junio C Hamano
2019-07-25 18:22                     ` Johannes Schindelin
2019-07-26 15:15                       ` Ævar Arnfjörð Bjarmason
2019-07-26 15:53                         ` Carlo Arenas
2019-07-26 20:05                           ` Ævar Arnfjörð Bjarmason [this message]
2019-07-26 16:19                         ` Junio C Hamano
2019-07-26 19:40                           ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87blxg7e9g.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=carenas@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).