git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: demerphq <demerphq@gmail.com>
Cc: "D. Ben Knoble" <ben.knoble@gmail.com>, git@vger.kernel.org
Subject: Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
Date: Wed, 1 Feb 2023 18:03:55 -0500	[thread overview]
Message-ID: <Y9rv29c0dYUAYx8B@coredump.intra.peff.net> (raw)
In-Reply-To: <CANgJU+X_e0owKC3uWPaA_gVP54syF1+MJ-cTn+fjPrNS5LDsMA@mail.gmail.com>

On Wed, Feb 01, 2023 at 05:09:33PM +0100, demerphq wrote:

> > Failure (using Zsh to produce the characters; I think there's a Bash
> > equivalent):
> > ```
> > # git diff --word-diff --word-diff-regex=$'[\xc0-\xff][\x80-\xbf]+'
> > fatal¬†: invalid regular expression: [¿-ˇ][Ä-ø]+
> > ```
> 
> FWIW that looks pretty weird to me, like the escapes in the charclass
> were interpolated before being fed to the regex engine. Are you sure
> you tested the right thing?

I think the point is that he is feeding a raw \xc0 byte (not the escape
sequence) to the regex engine, which is bogus UTF8. And the internal
userdiff drivers do the same thing. They contain "[\xc0-\xff]", and
those "\x" will be interpolated by the compiler into their actual bytes.

So the regex engine is complaining that it is getting bytes with high
bits set, but that are not part of a multi-byte character. I.e., it is
not happy to do bytewise matching, but really wants valid UTF8 in the
expression.

glibc's regex engine seems OK with this. Try:

  git grep $'[\xc0-\xff]'

in git.git, and it will find lots of multi-byte characters. But pcre,
for example, is not:

  $ git grep -P $'[\xc0-\xff]'
  fatal: command line, '[<C0>-<FF>]': UTF-8 error: byte 2 top bits not 0x80

There you really want to feed the literal escapes (obviously dropping
the '$ shell interpolation is a better solution, but for the sake of
illustration):

  git grep -P $'[\\xc0-\\xff]'

But I don't think we can rely on the libc BRE supporting "\x" in
character classes. Glibc certainly doesn't. I'm not sure what the
portable solution is.

-Peff

  parent reply	other threads:[~2023-02-01 23:04 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-01 15:18 grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f) D. Ben Knoble
2023-02-01 16:09 ` demerphq
2023-02-01 16:21   ` D. Ben Knoble
2023-02-01 18:23     ` demerphq
2023-02-01 18:54       ` Junio C Hamano
2023-02-01 21:33         ` D. Ben Knoble
2023-02-01 21:34           ` D. Ben Knoble
2023-02-01 22:15           ` Junio C Hamano
2023-02-01 23:03   ` Jeff King [this message]
2023-02-02 16:22     ` demerphq
2023-02-02 20:49       ` D. Ben Knoble
2023-02-03 17:01       ` Jeff King
2023-02-03 21:56         ` Ævar Arnfjörð Bjarmason
2023-02-04 11:17           ` Jeff King
2023-02-04 11:32         ` demerphq
2023-02-05 19:51           ` D. Ben Knoble
2023-02-07 18:23             ` Jeff King
2023-02-07 22:27               ` D. Ben Knoble
2023-02-07 18:19           ` Jeff King
2023-02-02 20:47     ` D. Ben Knoble
2023-02-03 16:55       ` Jeff King
2023-02-03 17:06         ` D. Ben Knoble

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y9rv29c0dYUAYx8B@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=ben.knoble@gmail.com \
    --cc=demerphq@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).