git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Carlo Arenas <carenas@gmail.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Johannes Schindelin <Johannes.Schindelin@gmx.de>,
	git@vger.kernel.org, gitster@pobox.com
Subject: Re: [PATCH] grep: skip UTF8 checks explicitally
Date: Mon, 22 Jul 2019 20:50:27 -0700	[thread overview]
Message-ID: <CAPUEspg1nUoPApTk5J2r_-9psxTTSC7nRAPw_X9no+2sFVSxAA@mail.gmail.com> (raw)
In-Reply-To: <87muh57t5r.fsf@evledraar.gmail.com>

On Mon, Jul 22, 2019 at 12:42 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Mon, Jul 22 2019, Johannes Schindelin wrote:
>
> > So I am fine with this patch.
>
> I'm not, not because I dislike the approach. I haven't made up my mind
> yet.

my bad, I should had explained better the regression I was trying to
fix with this patch :

$ git version
git version 2.22.0.709.g102302147b.dirty
$ git grep "Nguyễn Thái"
.mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
fatal: pcre2_match failed with error code -22: UTF-8 error: isolated
byte with 0x80 bit set
$ git grep -I "Nguyễn Thái"
.mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
po/TEAMS:Members:       Nguyễn Thái Ngọc Duy <pclouds AT gmail.com>
po/vi.po:# Nguyễn Thái Ngọc Duy <pclouds@gmail.com>, 2012.
t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy
t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top
bits not 0x80

> I stopped paying attention to this error-with-not-JIT discussion when I
> heard that some other series going into next for Windows fixed that
> issue[1]
>
> But now we have it again in some form? My ab/no-kwset has a lot of tests
> for encodings & locales combined with grep, don't some of those trigger
> this? If so we should make any such failure a test & part of this patch.

I don't have a windows environment to test, and I admit I wasn't
following clearly
the whole conversation but at least in my case, I never got any test
to fail, and I haven't
seen any test with broken UTF-8.

I apologize for not sending a test and will correct that, but I am
concerned that
according to PCRE documentation the behaviour is undefined and therefore the
test might be flacky.

> Right now we don't have the info of whether we're really using the JIT
> or not, but that would be easy to add to grep's --debug mode for use in
> a test prereq.

I would think that extending testtool would be a better option, I have indeed
a patch checking for UTF-8 support using git directly which I hadn't sent just
because I thought it looked too hacky, but agree it will be nice to have all
those dependencies clear and the corresponding tests otherwise

It is interesting to note that because of the conservative way we enable UTF-8
then it will depend on several factors if the problematic code gets triggered

> As noted in [2] I'd be inclined to go the other way, if we indeed have
> some cases where PCRE skips its own checks does not dying actually give
> us anything useful? I'd think not, so just ignoring the issue seems like
> the wrong thing to do.

we could reenable the checks by moving out of the JIT fast path in pcre so
that pcre2_match is used for all matches (with JIT compiled as an optimization)
and that might have the added benefit of solving the PCRE errors when JIT is
broken[1]

$ git version
git version 2.22.0
$ git grep "Nguyễn Thái"
.mailmap:Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
po/TEAMS:Members:       Nguyễn Thái Ngọc Duy <pclouds AT gmail.com>
po/vi.po:# Nguyễn Thái Ngọc Duy <pclouds@gmail.com>, 2012.
t/t1302-repo-version.sh:# Copyright (c) 2007 Nguyễn Thái Ngọc Duy
t/t2104-update-index-skip-worktree.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
t/t7011-skip-worktree-reading.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy
t/t7012-skip-worktree-writing.sh:# Copyright (c) 2008 Nguyễn Thái Ngọc Duy

important to note that at least on my system (macOS 10.14.6) the output of all
engines for grep is the same for a UTF-8 pattern match, even if using invalid
UTF-8 in the corpus, and I would expect that to be a better indicative
of correctness

FWIW GNU grep (with -P) also ignores UTF-8 errors using the same flag and
even PCRE seems to be going in the other direction with the recent addition
of PCRE2_MATCH_INVALID_UTF[1]

$ od -b int.p
0000000   102 145 154 303 263 156 012 303
$ pcre2grep -U 'Beló' int.p
Belón

Carlo

[1] https://public-inbox.org/git/20181209230024.43444-1-carenas@gmail.com/
[2] https://lists.exim.org/lurker/message/20190528.141422.2af1e386.gl.html

  reply	other threads:[~2019-07-23  3:50 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-21 18:31 [PATCH] grep: skip UTF8 checks explicitally Carlo Marcelo Arenas Belón
2019-07-21 18:34 ` Eric Sunshine
2019-07-22 11:55 ` Johannes Schindelin
2019-07-22 14:43   ` [PATCH] grep: skip UTF8 checks explicitly Carlo Marcelo Arenas Belón
2019-07-24  2:10     ` Junio C Hamano
2019-07-24 10:50       ` Johannes Schindelin
2019-07-24 16:08         ` Junio C Hamano
2019-08-28 14:54           ` [PATCH v2] " Carlo Marcelo Arenas Belón
2019-08-28 16:57             ` Carlo Arenas
2019-09-09 15:07               ` Carlo Arenas
2019-09-09 18:49                 ` Junio C Hamano
2019-07-22 19:42   ` [PATCH] grep: skip UTF8 checks explicitally Ævar Arnfjörð Bjarmason
2019-07-23  3:50     ` Carlo Arenas [this message]
2019-07-23 12:46       ` Johannes Schindelin
2019-07-24  1:47         ` Carlo Arenas
2019-07-24 10:47           ` Johannes Schindelin
2019-07-24 18:22             ` Ævar Arnfjörð Bjarmason
2019-07-24 21:04               ` Junio C Hamano
2019-07-25  9:48                 ` Johannes Schindelin
2019-07-25 13:02                   ` Junio C Hamano
2019-07-25 18:22                     ` Johannes Schindelin
2019-07-26 15:15                       ` Ævar Arnfjörð Bjarmason
2019-07-26 15:53                         ` Carlo Arenas
2019-07-26 20:05                           ` Ævar Arnfjörð Bjarmason
2019-07-26 16:19                         ` Junio C Hamano
2019-07-26 19:40                           ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPUEspg1nUoPApTk5J2r_-9psxTTSC7nRAPw_X9no+2sFVSxAA@mail.gmail.com \
    --to=carenas@gmail.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).