git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* RE: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
@ 2023-02-01 15:18 D. Ben Knoble
  2023-02-01 16:09 ` demerphq
  0 siblings, 1 reply; 22+ messages in thread
From: D. Ben Knoble @ 2023-02-01 15:18 UTC (permalink / raw)
  To: git

I recently updated to git 2.39.1 and noticed today that `git diff
--word-diff` fails for files with `diff=scheme`. I was able to narrow
the failure down to the inclusion of control characters \xc0, \xff,
\x80, \xbf by https://github.com/git/git/blob/2fc9e9ca3c7505bc60069f11e7ef09b1aeeee473/userdiff.c#L17
in the definition of the scheme diff pattern (really, all patterns).

I suspect the commit referenced in the subject, given that it messes
with regex handling on macOS.

Relevant environment that I can think of:
```
# locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL="fr_FR.UTF-8"
```

I'm on macOS 11.7.

Failure (using Zsh to produce the characters; I think there's a Bash
equivalent):
```
# git diff --word-diff --word-diff-regex=$'[\xc0-\xff][\x80-\xbf]+'
fatal¬†: invalid regular expression: [¿-ˇ][Ä-ø]+
```
(Looks like the output is a bit scrambled; here's the hexdump)
```
# !! |& xxd
00000000: 6661 7461 6cc2 a03a 2069 6e76 616c 6964  fatal..: invalid
00000010: 2072 6567 756c 6172 2065 7870 7265 7373   regular express
00000020: 696f 6e3a 205b c02d ff5d 5b80 2dbf 5d2b  ion: [.-.][.-.]+
00000030: 0a                                       .
```

-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-01 15:18 grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f) D. Ben Knoble
@ 2023-02-01 16:09 ` demerphq
  2023-02-01 16:21   ` D. Ben Knoble
  2023-02-01 23:03   ` Jeff King
  0 siblings, 2 replies; 22+ messages in thread
From: demerphq @ 2023-02-01 16:09 UTC (permalink / raw)
  To: D. Ben Knoble; +Cc: git

On Wed, 1 Feb 2023 at 16:25, D. Ben Knoble <ben.knoble@gmail.com> wrote:
>
> I recently updated to git 2.39.1 and noticed today that `git diff
> --word-diff` fails for files with `diff=scheme`. I was able to narrow
> the failure down to the inclusion of control characters \xc0, \xff,
> \x80, \xbf by https://github.com/git/git/blob/2fc9e9ca3c7505bc60069f11e7ef09b1aeeee473/userdiff.c#L17
> in the definition of the scheme diff pattern (really, all patterns).
>
> I suspect the commit referenced in the subject, given that it messes
> with regex handling on macOS.
>
> Relevant environment that I can think of:
> ```
> # locale
> LANG="fr_FR.UTF-8"
> LC_COLLATE="fr_FR.UTF-8"
> LC_CTYPE="fr_FR.UTF-8"
> LC_MESSAGES="fr_FR.UTF-8"
> LC_MONETARY="fr_FR.UTF-8"
> LC_NUMERIC="fr_FR.UTF-8"
> LC_TIME="fr_FR.UTF-8"
> LC_ALL="fr_FR.UTF-8"
> ```
>
> I'm on macOS 11.7.
>
> Failure (using Zsh to produce the characters; I think there's a Bash
> equivalent):
> ```
> # git diff --word-diff --word-diff-regex=$'[\xc0-\xff][\x80-\xbf]+'
> fatal¬†: invalid regular expression: [¿-ˇ][Ä-ø]+
> ```

FWIW that looks pretty weird to me, like the escapes in the charclass
were interpolated before being fed to the regex engine. Are you sure
you tested the right thing?

Yves


--
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-01 16:09 ` demerphq
@ 2023-02-01 16:21   ` D. Ben Knoble
  2023-02-01 18:23     ` demerphq
  2023-02-01 23:03   ` Jeff King
  1 sibling, 1 reply; 22+ messages in thread
From: D. Ben Knoble @ 2023-02-01 16:21 UTC (permalink / raw)
  To: demerphq; +Cc: git

On Wed, Feb 1, 2023 at 11:09 AM demerphq <demerphq@gmail.com> wrote:
> FWIW that looks pretty weird to me, like the escapes in the charclass
> were interpolated before being fed to the regex engine. Are you sure
> you tested the right thing?

Quite sure. `git diff --word-diff` fails. This was just a smaller
example based on the linked C code.

Here's the output of `git diff --word-diff` (verbatim and dumped):

```
fatal : invalid regular expression: \|([^\\]*)\||([^][)(}{[
])+|[^[:space:]]|[¿-ˇ][Ä-ø]+
00000000: 6661 7461 6cc2 a03a 2069 6e76 616c 6964  fatal..: invalid
00000010: 2072 6567 756c 6172 2065 7870 7265 7373   regular express
00000020: 696f 6e3a 205c 7c28 5b5e 5c5c 5d2a 295c  ion: \|([^\\]*)\
00000030: 7c7c 285b 5e5d 5b29 287d 7b5b 2009 5d29  ||([^][)(}{[ .])
00000040: 2b7c 5b5e 5b3a 7370 6163 653a 5d5d 7c5b  +|[^[:space:]]|[
00000050: c02d ff5d 5b80 2dbf 5d2b 0a              .-.][.-.]+.
```

-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-01 16:21   ` D. Ben Knoble
@ 2023-02-01 18:23     ` demerphq
  2023-02-01 18:54       ` Junio C Hamano
  0 siblings, 1 reply; 22+ messages in thread
From: demerphq @ 2023-02-01 18:23 UTC (permalink / raw)
  To: D. Ben Knoble, Git

On Wed, 1 Feb 2023 at 17:22, D. Ben Knoble <ben.knoble@gmail.com> wrote:
>
> On Wed, Feb 1, 2023 at 11:09 AM demerphq <demerphq@gmail.com> wrote:
> > FWIW that looks pretty weird to me, like the escapes in the charclass
> > were interpolated before being fed to the regex engine. Are you sure
> > you tested the right thing?
>
> Quite sure. `git diff --word-diff` fails. This was just a smaller
> example based on the linked C code.
>
> Here's the output of `git diff --word-diff` (verbatim and dumped):
>
> ```
> fatal : invalid regular expression: \|([^\\]*)\||([^][)(}{[
> ])+|[^[:space:]]|[¿-ˇ][Ä-ø]+
> 00000000: 6661 7461 6cc2 a03a 2069 6e76 616c 6964  fatal..: invalid
> 00000010: 2072 6567 756c 6172 2065 7870 7265 7373   regular express
> 00000020: 696f 6e3a 205c 7c28 5b5e 5c5c 5d2a 295c  ion: \|([^\\]*)\
> 00000030: 7c7c 285b 5e5d 5b29 287d 7b5b 2009 5d29  ||([^][)(}{[ .])
> 00000040: 2b7c 5b5e 5b3a 7370 6163 653a 5d5d 7c5b  +|[^[:space:]]|[
> 00000050: c02d ff5d 5b80 2dbf 5d2b 0a              .-.][.-.]+.
> ```

Interesting. The regex engine seems to be interpolating the \xC0 in
such a way you arent seeing the real pattern. In the Perl regex engine
I'd call that a bug (it used to do the same thing before we fixed it
years ago[1]). FWIW, this is a valid regex in Perl so i dont think the
pattern is at fault, its something else. I saw some discussion
recently that the mac regex engine doesn't play nicely in certain
ways, but i dont recollect the details.

Sorry i can't help more. Try searching for EXTENDED and regex and mac.
Maybe you can find the mail I mean.

cheers,
yves
[1] I am one of the maintainers of the perl regex engine.
-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-01 18:23     ` demerphq
@ 2023-02-01 18:54       ` Junio C Hamano
  2023-02-01 21:33         ` D. Ben Knoble
  0 siblings, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2023-02-01 18:54 UTC (permalink / raw)
  To: demerphq; +Cc: D. Ben Knoble, Git

demerphq <demerphq@gmail.com> writes:

> ... I saw some discussion
> recently that the mac regex engine doesn't play nicely in certain
> ways, but i dont recollect the details.

There was a discussion on BRE having and not having GNU-like
extension on macOS, in this thread

  https://lore.kernel.org/git/26a0d4ca-3d97-ace4-1a1f-92b1ee6715a6@web.de/

The patch we ended up using avoids touching the behaviour with ERE,
as REG_ENHANCED on macOS affects REG_EXTENDED, but the issue we were
looking at in the thread was about BRE.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-01 18:54       ` Junio C Hamano
@ 2023-02-01 21:33         ` D. Ben Knoble
  2023-02-01 21:34           ` D. Ben Knoble
  2023-02-01 22:15           ` Junio C Hamano
  0 siblings, 2 replies; 22+ messages in thread
From: D. Ben Knoble @ 2023-02-01 21:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: demerphq, Git, dds

On Wed, Feb 1, 2023 at 1:54 PM Junio C Hamano <gitster@pobox.com> wrote:
> There was a discussion on BRE having and not having GNU-like
> extension on macOS, in this thread
>
>   https://lore.kernel.org/git/26a0d4ca-3d97-ace4-1a1f-92b1ee6715a6@web.de/
>
> The patch we ended up using avoids touching the behaviour with ERE,
> as REG_ENHANCED on macOS affects REG_EXTENDED, but the issue we were
> looking at in the thread was about BRE.
>

Are you suggesting that I try with `-P` and/or `-E`?

CC'ing the original author of
https://github.com/git/git/commit/1819ad327b7a1f19540a819813b70a0e8a7f798f,
which I suspect is the most relevant.

-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-01 21:33         ` D. Ben Knoble
@ 2023-02-01 21:34           ` D. Ben Knoble
  2023-02-01 22:15           ` Junio C Hamano
  1 sibling, 0 replies; 22+ messages in thread
From: D. Ben Knoble @ 2023-02-01 21:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: demerphq, Git, dds

On Wed, Feb 1, 2023 at 4:33 PM D. Ben Knoble <ben.knoble@gmail.com> wrote:
> Are you suggesting that I try with `-P` and/or `-E`?

Er, obviously these don't exist for git-diff.



-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-01 21:33         ` D. Ben Knoble
  2023-02-01 21:34           ` D. Ben Knoble
@ 2023-02-01 22:15           ` Junio C Hamano
  1 sibling, 0 replies; 22+ messages in thread
From: Junio C Hamano @ 2023-02-01 22:15 UTC (permalink / raw)
  To: D. Ben Knoble; +Cc: demerphq, Git, dds

"D. Ben Knoble" <ben.knoble@gmail.com> writes:

> On Wed, Feb 1, 2023 at 1:54 PM Junio C Hamano <gitster@pobox.com> wrote:
>> There was a discussion on BRE having and not having GNU-like
>> extension on macOS, in this thread
>>
>>   https://lore.kernel.org/git/26a0d4ca-3d97-ace4-1a1f-92b1ee6715a6@web.de/
>>
>> The patch we ended up using avoids touching the behaviour with ERE,
>> as REG_ENHANCED on macOS affects REG_EXTENDED, but the issue we were
>> looking at in the thread was about BRE.
>>
>
> Are you suggesting that I try with `-P` and/or `-E`?

No.  I was just giving a concrete URL for what I think demerphq was
referring to as "saw some discussion recently".

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-01 16:09 ` demerphq
  2023-02-01 16:21   ` D. Ben Knoble
@ 2023-02-01 23:03   ` Jeff King
  2023-02-02 16:22     ` demerphq
  2023-02-02 20:47     ` D. Ben Knoble
  1 sibling, 2 replies; 22+ messages in thread
From: Jeff King @ 2023-02-01 23:03 UTC (permalink / raw)
  To: demerphq; +Cc: D. Ben Knoble, git

On Wed, Feb 01, 2023 at 05:09:33PM +0100, demerphq wrote:

> > Failure (using Zsh to produce the characters; I think there's a Bash
> > equivalent):
> > ```
> > # git diff --word-diff --word-diff-regex=$'[\xc0-\xff][\x80-\xbf]+'
> > fatal¬†: invalid regular expression: [¿-ˇ][Ä-ø]+
> > ```
> 
> FWIW that looks pretty weird to me, like the escapes in the charclass
> were interpolated before being fed to the regex engine. Are you sure
> you tested the right thing?

I think the point is that he is feeding a raw \xc0 byte (not the escape
sequence) to the regex engine, which is bogus UTF8. And the internal
userdiff drivers do the same thing. They contain "[\xc0-\xff]", and
those "\x" will be interpolated by the compiler into their actual bytes.

So the regex engine is complaining that it is getting bytes with high
bits set, but that are not part of a multi-byte character. I.e., it is
not happy to do bytewise matching, but really wants valid UTF8 in the
expression.

glibc's regex engine seems OK with this. Try:

  git grep $'[\xc0-\xff]'

in git.git, and it will find lots of multi-byte characters. But pcre,
for example, is not:

  $ git grep -P $'[\xc0-\xff]'
  fatal: command line, '[<C0>-<FF>]': UTF-8 error: byte 2 top bits not 0x80

There you really want to feed the literal escapes (obviously dropping
the '$ shell interpolation is a better solution, but for the sake of
illustration):

  git grep -P $'[\\xc0-\\xff]'

But I don't think we can rely on the libc BRE supporting "\x" in
character classes. Glibc certainly doesn't. I'm not sure what the
portable solution is.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-01 23:03   ` Jeff King
@ 2023-02-02 16:22     ` demerphq
  2023-02-02 20:49       ` D. Ben Knoble
  2023-02-03 17:01       ` Jeff King
  2023-02-02 20:47     ` D. Ben Knoble
  1 sibling, 2 replies; 22+ messages in thread
From: demerphq @ 2023-02-02 16:22 UTC (permalink / raw)
  To: Jeff King; +Cc: D. Ben Knoble, git

On Thu, 2 Feb 2023 at 00:03, Jeff King <peff@peff.net> wrote:
>
> On Wed, Feb 01, 2023 at 05:09:33PM +0100, demerphq wrote:
>
> > > Failure (using Zsh to produce the characters; I think there's a Bash
> > > equivalent):
> > > ```
> > > # git diff --word-diff --word-diff-regex=$'[\xc0-\xff][\x80-\xbf]+'
> > > fatal¬†: invalid regular expression: [¿-ˇ][Ä-ø]+
> > > ```
> >
> > FWIW that looks pretty weird to me, like the escapes in the charclass
> > were interpolated before being fed to the regex engine. Are you sure
> > you tested the right thing?
>
> I think the point is that he is feeding a raw \xc0 byte (not the escape
> sequence) to the regex engine, which is bogus UTF8. And the internal
> userdiff drivers do the same thing. They contain "[\xc0-\xff]", and
> those "\x" will be interpolated by the compiler into their actual bytes.

Thanks, that was the bit that threw me off. I had completely forgotten
that C supports \x escapes :-(. The Perl internals and regex engine is
where I do most of my C hacking and it uses octal exclusively AFAIK.
(I guess it uses octal because of the "where does the escape end"
problem that C seems to have with hex escapes). So I had assumed that
something else, or the regex engine itself was interpolating them. I
appreciate that you took the time to set me straight.

> So the regex engine is complaining that it is getting bytes with high
> bits set, but that are not part of a multi-byte character. I.e., it is
> not happy to do bytewise matching, but really wants valid UTF8 in the
> expression.

Yeah, that was my first thought too, but as I said above the hex
escapes threw me off.

> glibc's regex engine seems OK with this. Try:
>
>   git grep $'[\xc0-\xff]'
>
> in git.git, and it will find lots of multi-byte characters. But pcre,
> for example, is not:
>
>   $ git grep -P $'[\xc0-\xff]'
>   fatal: command line, '[<C0>-<FF>]': UTF-8 error: byte 2 top bits not 0x80

I expect that has something to do with how you are configuring PCRE,
and that with a slightly different config it would be fine with this.

> There you really want to feed the literal escapes (obviously dropping
> the '$ shell interpolation is a better solution, but for the sake of
> illustration):
>
>   git grep -P $'[\\xc0-\\xff]'
>
> But I don't think we can rely on the libc BRE supporting "\x" in
> character classes. Glibc certainly doesn't. I'm not sure what the
> portable solution is.

I've been lurking watching some of the regex discussion on the list
and personally I think it is asking for trouble to use "whatever regex
engine is traditional in a given environment" instead of just choosing
a good open source engine and using it consistently everywhere.  I
don't really buy the arguments I have seen to justify a policy of "use
the standard library version"; regex engines vary widely in
performance and implementation and feature set, and even the really
good ones do not entirely agree on every semantic[1], so if you don't
standardize you will be forever dealing with bugs related to those
differences.

I think the git project should choose the feature set[2] it thinks are
important, and then choose a regex engine that provides those features
and is well supported, and then use it consistently everywhere that
git needs to do regex based matching. Anything else is asking for
trouble at some level or another.

Cheers,
yves
[1] Leaving aside advanced features, even something as simple as
alternation can vary by engine.  Consider "foo"=~/f|fo|foo/. Some
regex engines will match "foo", and some will match "f", depending on
whether they implement "longest match" (as most NFA/DFA engines do),
or if they implement "leftmost longest match" (as Perl and other
backtracking engines tend to do).

[2] Personally I think that features like recursive patterns, named
capture, negative and positive lookahead and lookbehind and branch
reset are so useful that it would be wise to choose an engine that
supports them, but some might argue for other priorities, performance
being a likely candidate.

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-01 23:03   ` Jeff King
  2023-02-02 16:22     ` demerphq
@ 2023-02-02 20:47     ` D. Ben Knoble
  2023-02-03 16:55       ` Jeff King
  1 sibling, 1 reply; 22+ messages in thread
From: D. Ben Knoble @ 2023-02-02 20:47 UTC (permalink / raw)
  To: Jeff King; +Cc: demerphq, git

On Wed, Feb 1, 2023 at 6:03 PM Jeff King <peff@peff.net> wrote:
> So the regex engine is complaining that it is getting bytes with high
> bits set, but that are not part of a multi-byte character. I.e., it is
> not happy to do bytewise matching, but really wants valid UTF8 in the
> expression.

I did manage to find that the call to regcomp in diff.c's
init_diff_words_data (line 2212 in v2.39.1) is what crashes; I could
not step into it with gdb, however.

Further, the following C program compiles without warnings (except for
the unused main parameters):
```
#include <regex.h>
#include <assert.h>
#include <stddef.h>
#include <stdio.h>

int main(int argc, char **argv) {
    regex_t re;
    int ret = regcomp(&re, "[\xc0-\xff][\x80-\xbf]+", REG_EXTENDED |
REG_NEWLINE);
    /* assert(ret != 0); */
    size_t errbuf_size = regerror(ret, &re, NULL, 0);
    char errbuf[errbuf_size];
    regerror(ret, &re, errbuf, errbuf_size);
    printf("%s\n", errbuf);
}
```

```
# CFLAGS='-Wall -Wextra -Wmissing-prototypes -Wstrict-prototypes
-Wold-style-definition -Wshadow -Wpointer-arith -Wcast-qual -pedantic
-std=c11'
# cc $CFLAGS regtest.c -o regtest && ./regtest
*** unknown regexp error code ***
```
(the assertion fails because regcomp succeeds!)

So I can neither find out what's to blame nor what to fix. Here are
the linked libraries on macOS (IIUC):
```
# otool -L regtest
regtest:
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current
version 1311.0.0)
# otool -L ./git-diff # from v2.39.1 source build today
./git-diff:
/System/Library/Frameworks/CoreServices.framework/Versions/A/CoreServices
(compatibility version 1.0.0, current version 1141.1.0)
/usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.11)
/usr/lib/libiconv.2.dylib (compatibility version 7.0.0, current version 7.0.0)
/usr/local/opt/gettext/lib/libintl.8.dylib (compatibility version
12.0.0, current version 12.0.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current
version 1311.0.0)
/System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation
(compatibility version 150.0.0, current version 1856.105.0)
```

-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-02 16:22     ` demerphq
@ 2023-02-02 20:49       ` D. Ben Knoble
  2023-02-03 17:01       ` Jeff King
  1 sibling, 0 replies; 22+ messages in thread
From: D. Ben Knoble @ 2023-02-02 20:49 UTC (permalink / raw)
  To: demerphq; +Cc: Jeff King, git

On Thu, Feb 2, 2023 at 11:22 AM demerphq <demerphq@gmail.com> wrote:
> I've been lurking watching some of the regex discussion on the list
> and personally I think it is asking for trouble to use "whatever regex
> engine is traditional in a given environment" instead of just choosing
> a good open source engine and using it consistently everywhere.  I
> don't really buy the arguments I have seen to justify a policy of "use
> the standard library version"; regex engines vary widely in
> performance and implementation and feature set, and even the really
> good ones do not entirely agree on every semantic[1], so if you don't
> standardize you will be forever dealing with bugs related to those
> differences.

My understanding is that the patch suspected in the subject and linked
several times in this thread _does_ use the native Darwin regex
library which causes the problem, but I can't reproduce it in a
smaller program (see previous mail).


-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-02 20:47     ` D. Ben Knoble
@ 2023-02-03 16:55       ` Jeff King
  2023-02-03 17:06         ` D. Ben Knoble
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2023-02-03 16:55 UTC (permalink / raw)
  To: D. Ben Knoble; +Cc: demerphq, git

On Thu, Feb 02, 2023 at 03:47:28PM -0500, D. Ben Knoble wrote:

> Further, the following C program compiles without warnings (except for
> the unused main parameters):
> ```
> #include <regex.h>
> #include <assert.h>
> #include <stddef.h>
> #include <stdio.h>
> 
> int main(int argc, char **argv) {
>     regex_t re;
>     int ret = regcomp(&re, "[\xc0-\xff][\x80-\xbf]+", REG_EXTENDED |
> REG_NEWLINE);
>     /* assert(ret != 0); */
>     size_t errbuf_size = regerror(ret, &re, NULL, 0);
>     char errbuf[errbuf_size];
>     regerror(ret, &re, errbuf, errbuf_size);
>     printf("%s\n", errbuf);
> }
> ```

Just a guess, but does calling:

  setlocale(LC_CTYPE, "");

at the start of the program change things (you'll probably need to also
include locale.h)? I don't have a macos system to test on.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-02 16:22     ` demerphq
  2023-02-02 20:49       ` D. Ben Knoble
@ 2023-02-03 17:01       ` Jeff King
  2023-02-03 21:56         ` Ævar Arnfjörð Bjarmason
  2023-02-04 11:32         ` demerphq
  1 sibling, 2 replies; 22+ messages in thread
From: Jeff King @ 2023-02-03 17:01 UTC (permalink / raw)
  To: demerphq; +Cc: D. Ben Knoble, git

On Thu, Feb 02, 2023 at 05:22:37PM +0100, demerphq wrote:

> I've been lurking watching some of the regex discussion on the list
> and personally I think it is asking for trouble to use "whatever regex
> engine is traditional in a given environment" instead of just choosing
> a good open source engine and using it consistently everywhere.  I
> don't really buy the arguments I have seen to justify a policy of "use
> the standard library version"; regex engines vary widely in
> performance and implementation and feature set, and even the really
> good ones do not entirely agree on every semantic[1], so if you don't
> standardize you will be forever dealing with bugs related to those
> differences.

I think this is a perennial question for portable software: is it better
to be consistent across platforms (by shipping our own regex engine), or
consistent with other programs on the same platform (by using the system
regex).

I don't have a strong opinion either way. The main concern I'd have is
handling dependencies. I like pcre a lot, but I'm not sure that I would
want building Git to require pcre on every platform. If there's an
engine we can ship as a vendored dependency that builds everywhere, that
helps. We have the engine imported from gawk in compat/regex. That
_probably_ builds everywhere (though we don't really know, because any
platform that doesn't set NO_REGEX has been happily using the system
routines). But it also may not be the best choice; avoiding its
multi-byte handling was the reason behind 1819ad327 in the first place.

> I think the git project should choose the feature set[2] it thinks are
> important, and then choose a regex engine that provides those features
> and is well supported, and then use it consistently everywhere that
> git needs to do regex based matching. Anything else is asking for
> trouble at some level or another.

IMHO the biggest issue here is that the built-in userdiff regexes are
doing something a bit questionable, which is embedding high-bit
characters directly into the regex. If we can avoid that, then having
consistency in multi-byte handling across platforms becomes a lot less
important.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-03 16:55       ` Jeff King
@ 2023-02-03 17:06         ` D. Ben Knoble
  0 siblings, 0 replies; 22+ messages in thread
From: D. Ben Knoble @ 2023-02-03 17:06 UTC (permalink / raw)
  To: Jeff King; +Cc: demerphq, git

On Fri, Feb 3, 2023 at 11:55 AM Jeff King <peff@peff.net> wrote:
> Just a guess, but does calling:
>
>   setlocale(LC_CTYPE, "");
>
> at the start of the program change things (you'll probably need to also
> include locale.h)?

Indeed, the new output is

    illegal byte sequence

For the following program

    #include <regex.h>
    #include <assert.h>
    #include <stddef.h>
    #include <stdio.h>
    #include <locale.h>

    int main(int argc, char **argv) {
        char *loc = setlocale(LC_CTYPE, "");
        assert (loc != NULL);
        regex_t re;
        int ret = regcomp(&re, "[\xc0-\xff][\x80-\xbf]+", REG_EXTENDED
| REG_NEWLINE);
        /* assert(ret != 0); */
        size_t errbuf_size = regerror(ret, &re, NULL, 0);
        char errbuf[errbuf_size];
        regerror(ret, &re, errbuf, errbuf_size);
        printf("%s\n", errbuf);
    }

My own locale output, for completion's sake:

    LANG="fr_FR.UTF-8"
    LC_COLLATE="fr_FR.UTF-8"
    LC_CTYPE="fr_FR.UTF-8"
    LC_MESSAGES="fr_FR.UTF-8"
    LC_MONETARY="fr_FR.UTF-8"
    LC_NUMERIC="fr_FR.UTF-8"
    LC_TIME="fr_FR.UTF-8"
    LC_ALL="fr_FR.UTF-8"


-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-03 17:01       ` Jeff King
@ 2023-02-03 21:56         ` Ævar Arnfjörð Bjarmason
  2023-02-04 11:17           ` Jeff King
  2023-02-04 11:32         ` demerphq
  1 sibling, 1 reply; 22+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-02-03 21:56 UTC (permalink / raw)
  To: Jeff King; +Cc: demerphq, D. Ben Knoble, git


On Fri, Feb 03 2023, Jeff King wrote:

> On Thu, Feb 02, 2023 at 05:22:37PM +0100, demerphq wrote:
>
>> I've been lurking watching some of the regex discussion on the list
>> and personally I think it is asking for trouble to use "whatever regex
>> engine is traditional in a given environment" instead of just choosing
>> a good open source engine and using it consistently everywhere.  I
>> don't really buy the arguments I have seen to justify a policy of "use
>> the standard library version"; regex engines vary widely in
>> performance and implementation and feature set, and even the really
>> good ones do not entirely agree on every semantic[1], so if you don't
>> standardize you will be forever dealing with bugs related to those
>> differences.
>
> I think this is a perennial question for portable software: is it better
> to be consistent across platforms (by shipping our own regex engine), or
> consistent with other programs on the same platform (by using the system
> regex).

*nod*

> I don't have a strong opinion either way. The main concern I'd have is
> handling dependencies. I like pcre a lot, but I'm not sure that I would
> want building Git to require pcre on every platform. If there's an
> engine we can ship as a vendored dependency that builds everywhere, that
> helps.

We can just make that fallback engine be PCRE. I submitted patches a
while ago to include a minimal version of it in compat/pcre, as we seem
to have some allergy to external dependencies:
https://lore.kernel.org/git/20170511175115.648-1-avarab@gmail.com/

It's ~80k lines instead of compat/regex's ~15k, but it's actually
maintained, and would be much easier to upgrade.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-03 21:56         ` Ævar Arnfjörð Bjarmason
@ 2023-02-04 11:17           ` Jeff King
  0 siblings, 0 replies; 22+ messages in thread
From: Jeff King @ 2023-02-04 11:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: demerphq, D. Ben Knoble, git

On Fri, Feb 03, 2023 at 10:56:53PM +0100, Ævar Arnfjörð Bjarmason wrote:

> > I don't have a strong opinion either way. The main concern I'd have is
> > handling dependencies. I like pcre a lot, but I'm not sure that I would
> > want building Git to require pcre on every platform. If there's an
> > engine we can ship as a vendored dependency that builds everywhere, that
> > helps.
> 
> We can just make that fallback engine be PCRE. I submitted patches a
> while ago to include a minimal version of it in compat/pcre, as we seem
> to have some allergy to external dependencies:
> https://lore.kernel.org/git/20170511175115.648-1-avarab@gmail.com/
> 
> It's ~80k lines instead of compat/regex's ~15k, but it's actually
> maintained, and would be much easier to upgrade.

I'm OK with that if we really think that libpcre will build without
problems on every platform that Git does. I don't know if we have any
data there. Obviously libpcre builds lots of places, but will we have
problems on obscure platforms like NonStop? Part of me wants to not
care, but if the value here is saying "the regex engine is always going
to be X", then there is not much point in saying "the regex engine is
usually X, but you can't rely on it because sometimes it's not".

"Usually" is enough for helping users quality of life (if we help 99% of
users, that is good). It isn't enough for making assumptions in the code
(like using constructs in userdiff regexes that would break horribly on
the other 1% of platforms).

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-03 17:01       ` Jeff King
  2023-02-03 21:56         ` Ævar Arnfjörð Bjarmason
@ 2023-02-04 11:32         ` demerphq
  2023-02-05 19:51           ` D. Ben Knoble
  2023-02-07 18:19           ` Jeff King
  1 sibling, 2 replies; 22+ messages in thread
From: demerphq @ 2023-02-04 11:32 UTC (permalink / raw)
  To: Jeff King; +Cc: D. Ben Knoble, git

On Fri, 3 Feb 2023 at 18:01, Jeff King <peff@peff.net> wrote:
>
> On Thu, Feb 02, 2023 at 05:22:37PM +0100, demerphq wrote:
>
> > I've been lurking watching some of the regex discussion on the list
> > and personally I think it is asking for trouble to use "whatever regex
> > engine is traditional in a given environment" instead of just choosing
> > a good open source engine and using it consistently everywhere.  I
> > don't really buy the arguments I have seen to justify a policy of "use
> > the standard library version"; regex engines vary widely in
> > performance and implementation and feature set, and even the really
> > good ones do not entirely agree on every semantic[1], so if you don't
> > standardize you will be forever dealing with bugs related to those
> > differences.
>
> I think this is a perennial question for portable software: is it better
> to be consistent across platforms (by shipping our own regex engine), or
> consistent with other programs on the same platform (by using the system
> regex).

Personally I think that while this seems to be an impartial reading of
the question at hand I think it frames the debate in a way that has
the potential to bias[1] the discussion in favour of a particular
policy outcome[2]. It implies that all other things are equal between
the two options presented, and frames the question as something that
comes down to personal preference between one form of consistency and
another.

But I am not sure that all other things are equal here, at least as
far as regex engines go. I think there is evidence that suggests that
depending on the system regex engine introduces long term recurring
costs that would not be incurred if git chose to link to a specific
library everywhere. I think it ignores the implications on the wider
ecosystem and toolchains. For instance if the behavior of git grep
differs by platform then scripts that might want to use git grep to
automate git become less portable, or more expensive to maintain to
work around the inconsistencies.  It overlooks the costs of training
humans (arguably low) versus the costs of training computers (arguably
high). It also assumes that being consistent with other programs on
the same platform is inherently beneficial, when that doesn't seem to
be clearly established[2]. It also assumes that there are only two
options. Maybe there are more. Maybe there is a third or fourth option
as well. One would be to use a specific library for internal regexes,
and let the command line use the system library by default. Another
would be to make the default engine be one that ships with git, and
that users that want "platform compatibility" should use an option to
get it, much as they would with -P to enable PCRE. You mentioned that
there is already such an engine, but it isn't well tested. Maybe that
should be changed.

I think that if you look at other broadly ported projects there is
evidence that owning your own dependencies makes a project easier and
cheaper to port to new platforms. The more platforms you target the
more room there is for inconsistencies and the more costs there will
be to deal with them. If portability of git is a goal and minimizing
the cost of doing so is a secondary goal then I would say that using a
specific library will make achieving that goal easier and lower cost
than depending on the system libraries. There would be a high initial
cost to do the switch, and then a low cost in the long run.  As far as
I know Perl is more broadly ported than git, based on the fact that
when we migrated to git a number of the Perl maintainers could not use
git on their platforms (Vax comes to mind), and Perl definitely adopts
the view that it is better to own your dependencies, and wraps and
hides system inconsistencies as much as possible to make porting as
easy as possible. So that is one precedent to consider. No doubt there
are many other long running projects with precedents in this area.
What does Vi or Vim do? What does Emacs do? Etc.

Anyway, my opinion on these things doesn't matter that much, I am just
a git user who happens to have a passion for regular expressions and
regular expression engines and I am happily served by the PCRE support
in my git build. So I can't answer these questions for the project.
But I do think the questions that need to be answered are more complex
and nuanced than deciding which of two forms of consistency is more
important.

Thanks for hearing me out.

cheers,
yves
[1] PS I do not mean to imply that you are *intending* to bias the
discussion. I think your writing and measured approach indicates
strongly that you intend to be impartial, but nevertheless I think
this way of framing the question does bias the debate.

[2] My thinking about the framing of this question is probably pretty
strongly influenced by a recently released report from the BBC on
journalistic bias in presenting questions of economic debate. The
report presents some interesting perspectives on how framing questions
and data can bias the discussion even though the person presenting the
data or asking the question actually intended to be impartial.
Regardless of your position on what regex engine git should use I
think it is a good read. Especially in this day and age of austerity
politics and debt-ceiling debates around the world.
https://www.bbc.co.uk/aboutthebbc/documents/thematic-review-taxation-public-spending-govt-borrowing-debt.pdf

[3] Personally I use git grep with patterns and the -P flag far more
than I use grep with anything other than a constant string. These days
I barely ever use grep, so to me what it does is entirely irrelevant
to what git does.  I suspect I am not alone in this.
-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-04 11:32         ` demerphq
@ 2023-02-05 19:51           ` D. Ben Knoble
  2023-02-07 18:23             ` Jeff King
  2023-02-07 18:19           ` Jeff King
  1 sibling, 1 reply; 22+ messages in thread
From: D. Ben Knoble @ 2023-02-05 19:51 UTC (permalink / raw)
  To: demerphq; +Cc: Jeff King, git

On Sat, Feb 4, 2023 at 6:32 AM demerphq <demerphq@gmail.com> wrote:
> I think that if you look at other broadly ported projects there is
> evidence that owning your own dependencies makes a project easier and
> cheaper to port to new platforms. […] No doubt there
> are many other long running projects with precedents in this area.
> What does Vi or Vim do? What does Emacs do? Etc.

Vi and Vim have their own pattern language descended from Ed. They
thus have their own implementation for it. (Vim's has at least two
different "engines" powering the same syntax, and tries to
automatically pick the best one in case; I'm not sure if Vim compiles
the patterns before using them.) The advantage is that, as has been
pointed out, you can rely on Vi or Vim patterns.

…

Any thoughts on some sort of stop-gap measure to fix --word-diff while
Git decides how to handle the regex engine incompatibilities? How
important is the sequence of bytes at the end of --word-diff regexes
in userdiff.c?

-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-04 11:32         ` demerphq
  2023-02-05 19:51           ` D. Ben Knoble
@ 2023-02-07 18:19           ` Jeff King
  1 sibling, 0 replies; 22+ messages in thread
From: Jeff King @ 2023-02-07 18:19 UTC (permalink / raw)
  To: demerphq; +Cc: D. Ben Knoble, git

On Sat, Feb 04, 2023 at 12:32:09PM +0100, demerphq wrote:

> > I think this is a perennial question for portable software: is it better
> > to be consistent across platforms (by shipping our own regex engine), or
> > consistent with other programs on the same platform (by using the system
> > regex).
> 
> Personally I think that while this seems to be an impartial reading of
> the question at hand I think it frames the debate in a way that has
> the potential to bias[1] the discussion in favour of a particular
> policy outcome[2]. It implies that all other things are equal between
> the two options presented, and frames the question as something that
> comes down to personal preference between one form of consistency and
> another.

Sorry, I didn't mean to imply one side or the other, or even that they
were equal. I more just meant: you can make arguments either way.

_I_ don't have a strong opinion there, but you have made many of those
arguments in your email. And everything you said was pretty sensible.

I did want to mention one thing, though:

> It also assumes that being consistent with other programs on
> the same platform is inherently beneficial, when that doesn't seem to
> be clearly established[2].

This isn't consistency per se, but one problem that arises when a
portable program ships its own implementation is that its implementation
sucks more than the system one. And that was the case with 1819ad327;
our compat/regex was worse than the system with respect to multi-byte
characters.

But yes, if your point is: "well, don't ship a crappy implementation",
then I agree that is a way to mitigate that problem. ;)

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-05 19:51           ` D. Ben Knoble
@ 2023-02-07 18:23             ` Jeff King
  2023-02-07 22:27               ` D. Ben Knoble
  0 siblings, 1 reply; 22+ messages in thread
From: Jeff King @ 2023-02-07 18:23 UTC (permalink / raw)
  To: D. Ben Knoble; +Cc: demerphq, git

On Sun, Feb 05, 2023 at 02:51:05PM -0500, D. Ben Knoble wrote:

> Any thoughts on some sort of stop-gap measure to fix --word-diff while
> Git decides how to handle the regex engine incompatibilities? How
> important is the sequence of bytes at the end of --word-diff regexes
> in userdiff.c?

It comes from 664d44ee7f (userdiff: simplify word-diff safeguard,
2011-01-11). So presumably we'd want to figure out a way to accomplish
the same thing in a portable way. I'm not sure that's possible, though,
without making assumptions about the regex engine.

There's a little more discussion in the original thread:

  https://lore.kernel.org/git/561247.22837.qm@web110707.mail.gq1.yahoo.com/

but I didn't read it very carefully.

It may also be possible to play tricks with LC_CTYPE (either not setting
it, or resetting before/after a regex), but that feels pretty hacky.

-Peff

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
  2023-02-07 18:23             ` Jeff King
@ 2023-02-07 22:27               ` D. Ben Knoble
  0 siblings, 0 replies; 22+ messages in thread
From: D. Ben Knoble @ 2023-02-07 22:27 UTC (permalink / raw)
  To: Jeff King; +Cc: demerphq, git, jrnieder

CC'ing Jonathan Nieder

On Tue, Feb 7, 2023 at 1:23 PM Jeff King <peff@peff.net> wrote:
>
> On Sun, Feb 05, 2023 at 02:51:05PM -0500, D. Ben Knoble wrote:
>
> > Any thoughts on some sort of stop-gap measure to fix --word-diff while
> > Git decides how to handle the regex engine incompatibilities? How
> > important is the sequence of bytes at the end of --word-diff regexes
> > in userdiff.c?
>
> It comes from 664d44ee7f (userdiff: simplify word-diff safeguard,
> 2011-01-11). So presumably we'd want to figure out a way to accomplish
> the same thing in a portable way. I'm not sure that's possible, though,
> without making assumptions about the regex engine.

If "use the safeguard portably" implies "make assumptions about the
regex engine," that sounds like an argument for Git to ship its own
engine with exactly the necessary features. If that implementation
includes proper locale and UTF-8 support alongside support for the
high-byte character classes, I think we would be all set…

OTOH, perhaps there is a way to express the safeguard character
classes portably?

Jonathan, can you provide more context for the safeguard? I've read
this message several times

> git's diff-words support has a detail that can be a little dangerous:
> any text not matched by a given language's tokenization pattern is
> treated as whitespace and changes in such text would go unnoticed.
> Therefore each of the built-in regexes allows a special token type
> consisting of a single non-whitespace character [^[:space:]].
>
> To make sure UTF-8 sequences remain human readable, the builtin
> regexes also have a special token type for runs of bytes with the high
> bit set.  In English, non-ASCII characters are usually isolated so
> this is analogous to the [^[:space:]] pattern, except it matches a
> single _multibyte_ character despite use of the C locale.
>
> Unfortunately it is easy to make typos or forget entirely to include
> these catch-all token types when adding support for new languages (see
> v1.7.3.5~16, userdiff: fix typo in ruby and python word regexes,
> 2010-12-18).  Avoid this by including them automatically within the
> PATTERNS and IPATTERN macros.
>
> While at it, change the UTF-8 sequence token type to match exactly one
> non-ASCII multi-byte character, rather than an arbitrary run of them.

and I can hardly make heads or tails of it.

-- 
D. Ben Knoble

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-02-07 22:27 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-01 15:18 grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f) D. Ben Knoble
2023-02-01 16:09 ` demerphq
2023-02-01 16:21   ` D. Ben Knoble
2023-02-01 18:23     ` demerphq
2023-02-01 18:54       ` Junio C Hamano
2023-02-01 21:33         ` D. Ben Knoble
2023-02-01 21:34           ` D. Ben Knoble
2023-02-01 22:15           ` Junio C Hamano
2023-02-01 23:03   ` Jeff King
2023-02-02 16:22     ` demerphq
2023-02-02 20:49       ` D. Ben Knoble
2023-02-03 17:01       ` Jeff King
2023-02-03 21:56         ` Ævar Arnfjörð Bjarmason
2023-02-04 11:17           ` Jeff King
2023-02-04 11:32         ` demerphq
2023-02-05 19:51           ` D. Ben Knoble
2023-02-07 18:23             ` Jeff King
2023-02-07 22:27               ` D. Ben Knoble
2023-02-07 18:19           ` Jeff King
2023-02-02 20:47     ` D. Ben Knoble
2023-02-03 16:55       ` Jeff King
2023-02-03 17:06         ` D. Ben Knoble

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).