* regex compilation error with --color-words
@ 2023-03-29 22:55 Eric Sunshine
2023-03-30 7:55 ` Diomidis Spinellis
0 siblings, 1 reply; 8+ messages in thread
From: Eric Sunshine @ 2023-03-29 22:55 UTC (permalink / raw)
To: Git List, Diomidis Spinellis
Cc: René Scharfe, Junio C Hamano, demerphq, Mario Grgic
I'm encountering a failure on macOS High Sierra 10.13.6 when using
--color-words:
% git show --color-words HEAD
fatal: invalid regular expression:
[[:alpha:]_'][[:alnum:]_']*|0[xb]?[0-9a-fA-F_]*|[0-9a-fA-F_]+(\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::|&&=|\|\|=|//=|\*\*=|&&|\|\||//|\+\+|--|\*\*|\.\.\.?|[-+*/%.^&<>=!|]=|=~|!~|<<|<>|<=>|>>|[^[:space:]]|[<C0>-<FF>][<80>-<BF>]+
This crash happens when viewing the commit I sent to Peff today[1],
though it doesn't happen with all commits. The problem bisects to:
Author: Diomidis Spinellis <dds@aueb.gr>
Date: Fri Aug 26 11:58:15 2022 +0300
grep: fix multibyte regex handling under macOS
The commit 29de20504e (Makefile: fix default regex settings on
Darwin, 2013-05-11) fixed t0070-fundamental.sh under Darwin (macOS) by
adopting Git's regex library. However, this library is compiled with
NO_MBSUPPORT, which causes git-grep to work incorrectly on multibyte
(e.g. UTF-8) files. Current macOS versions pass t0070-fundamental.sh
with the native macOS regex library, which also supports multibyte
characters.
Adjust the Makefile to use the native regex library, and call
setlocale(3) to set CTYPE according to the user's preference.
The setlocale call is required on all platforms, but in platforms
supporting gettext(3), setlocale was called as a side-effect of
initializing gettext. Therefore, move the CTYPE setlocale call from
gettext.c to common-main.c and the corresponding locale.h include
into git-compat-util.h.
Thanks to the global initialization of CTYPE setlocale, the test-tool
regex command now works correctly with supported multibyte regexes, and
is used to set the MB_REGEX test prerequisite by assessing a platform's
support for them.
Signed-off-by: Diomidis Spinellis <dds@aueb.gr>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
I see that this same commit is also the subject of another bug report
currently being discussed[2], so I've Cc:'d the participants of that
thread, as well.
Any pointers aimed at getting this resolved would be appreciated.
[1]: https://lore.kernel.org/git/CAPig+cQiOGrDSUc34jHEBp87Rx-dnXNcPcF76bu0SJoOzD+1hw@mail.gmail.com/
[2]: https://lore.kernel.org/git/MW4PR20MB5517583CBEEF34B1E87CCF1290859@MW4PR20MB5517.namprd20.prod.outlook.com/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: regex compilation error with --color-words
2023-03-29 22:55 Eric Sunshine
@ 2023-03-30 7:55 ` Diomidis Spinellis
2023-03-31 20:44 ` René Scharfe
0 siblings, 1 reply; 8+ messages in thread
From: Diomidis Spinellis @ 2023-03-30 7:55 UTC (permalink / raw)
To: Eric Sunshine, Git List
Cc: René Scharfe, Junio C Hamano, demerphq, Mario Grgic
On 30-Mar-23 1:55, Eric Sunshine wrote:
> I'm encountering a failure on macOS High Sierra 10.13.6 when using
> --color-words:
The built-in word separation regular expression pattern for the Perl
language fails to work with the macOS regex engine. The same also
happens with the FreeBSD one (tested on 14.0).
The issue can be replicated through the following sequence of commands.
git init color-words
cd color-words
echo '*.pl diff=perl' >.gitattributes
echo 'print 42;' >t.pl
git add t.pl
git commit -am Add
git show --color-words
Strangely, I haven't been able to reproduce the failure with egrep on
any of the two platforms.
egrep
'[[:alpha:]_'\''][[:alnum:]_'\'']*|0[xb]?[0-9a-fA-F_]*|[0-9a-fA-F_]+(\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::|&&=|\|\|=|//=|\*\*=|&&|\|\||//|\+\+|--|\*\*|\.\.\.?|[-+*/%.^&<>=!|]=|=~|!~|<<|<>|<=>|>>|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+'
/dev/null
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: regex compilation error with --color-words
2023-03-30 7:55 ` Diomidis Spinellis
@ 2023-03-31 20:44 ` René Scharfe
2023-04-02 9:44 ` René Scharfe
0 siblings, 1 reply; 8+ messages in thread
From: René Scharfe @ 2023-03-31 20:44 UTC (permalink / raw)
To: Diomidis Spinellis, Eric Sunshine, Git List
Cc: Junio C Hamano, demerphq, Mario Grgic
Am 30.03.23 um 09:55 schrieb Diomidis Spinellis:
> On 30-Mar-23 1:55, Eric Sunshine wrote:
>> I'm encountering a failure on macOS High Sierra 10.13.6 when using
>> --color-words:
>
> The built-in word separation regular expression pattern for the Perl language fails to work with the macOS regex engine. The same also happens with the FreeBSD one (tested on 14.0).
>
> The issue can be replicated through the following sequence of commands.
>
> git init color-words
> cd color-words
> echo '*.pl diff=perl' >.gitattributes
> echo 'print 42;' >t.pl
> git add t.pl
> git commit -am Add
> git show --color-words
Or in Git's own repo:
$ git log -p --color-words --no-merges '*.c'
Schwerwiegend: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<C0>-<FF>][<80>-<BF>]+
commit 14b9a044798ebb3858a1f1a1377309a3d6054ac8
[...]
The error disappears when localization is turned off:
$ LANG=C git log -p --color-words --no-merges '*.c' >/dev/null
# just finishes without an error
The issue also vanishes when the "|[\xc0-\xff][\x80-\xbf]+" part is
removed that the macros PATTERNS and IPATTERN in userdiff.c append.
So it seems regcomp(1) on macOS doesn't like invalid Unicode characters
unless it's in ASCII mode (LANG=C). 664d44ee7f (userdiff: simplify
word-diff safeguard, 2011-01-11) explains that this part exists to match
a multi-byte UTF-8 character. With a regcomp(1) that supports
multi-byte characters natively they need to be specified differently, I
guess, perhaps like this "[^\x00-\x7f]"?
> Strangely, I haven't been able to reproduce the failure with egrep on any of the two platforms.
>
> egrep '[[:alpha:]_'\''][[:alnum:]_'\'']*|0[xb]?[0-9a-fA-F_]*|[0-9a-fA-F_]+(\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::|&&=|\|\|=|//=|\*\*=|&&|\|\||//|\+\+|--|\*\*|\.\.\.?|[-+*/%.^&<>=!|]=|=~|!~|<<|<>|<=>|>>|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+' /dev/null
No idea how to specify non-ASCII bytes in shell or regex. '\xNN' does
not seem to do the trick. printf(1) interpretes octal numbers, though:
$ echo ö | egrep $(printf "[\200-\377]")
egrep: illegal byte sequence
(The regex contains "illegal bytes" -- UTF-8 multi-byte sequences cut
short; the "ö" is OK.)
René
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: regex compilation error with --color-words
2023-03-31 20:44 ` René Scharfe
@ 2023-04-02 9:44 ` René Scharfe
2023-04-03 16:29 ` Junio C Hamano
0 siblings, 1 reply; 8+ messages in thread
From: René Scharfe @ 2023-04-02 9:44 UTC (permalink / raw)
To: Diomidis Spinellis, Eric Sunshine, Git List
Cc: Junio C Hamano, demerphq, Mario Grgic
Am 31.03.23 um 22:44 schrieb René Scharfe:
> Am 30.03.23 um 09:55 schrieb Diomidis Spinellis:
>> On 30-Mar-23 1:55, Eric Sunshine wrote:
>>> I'm encountering a failure on macOS High Sierra 10.13.6 when using
>>> --color-words:
>>
>> The built-in word separation regular expression pattern for the Perl language fails to work with the macOS regex engine. The same also happens with the FreeBSD one (tested on 14.0).
>>
>> The issue can be replicated through the following sequence of commands.
>>
>> git init color-words
>> cd color-words
>> echo '*.pl diff=perl' >.gitattributes
>> echo 'print 42;' >t.pl
>> git add t.pl
>> git commit -am Add
>> git show --color-words
>
> Or in Git's own repo:
>
> $ git log -p --color-words --no-merges '*.c'
> Schwerwiegend: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<C0>-<FF>][<80>-<BF>]+
> commit 14b9a044798ebb3858a1f1a1377309a3d6054ac8
> [...]
>
> The error disappears when localization is turned off:
>
> $ LANG=C git log -p --color-words --no-merges '*.c' >/dev/null
> # just finishes without an error
>
> The issue also vanishes when the "|[\xc0-\xff][\x80-\xbf]+" part is
> removed that the macros PATTERNS and IPATTERN in userdiff.c append.
>
> So it seems regcomp(1) on macOS doesn't like invalid Unicode characters
> unless it's in ASCII mode (LANG=C). 664d44ee7f (userdiff: simplify
> word-diff safeguard, 2011-01-11) explains that this part exists to match
> a multi-byte UTF-8 character. With a regcomp(1) that supports
> multi-byte characters natively they need to be specified differently, I
> guess, perhaps like this "[^\x00-\x7f]"?
Actually we can drop the "|[\xc0-\xff][\x80-\xbf]+" part in that case
because the "[^[:space:]]" suffices. And we probably need to do that at
runtime because it depends on the locale. The rather elaborate patch
below does that. It leaks the truncated word_regex, which isn't that
bad because it's done only once per run, but certainly untidy.
I suspect/hope this can be done simpler and cleaner after refactoring
the userdiff code to allow for runtime assembly of regular expressions.
And it's regcomp(3), or rather regexec(3), not regcomp(1).
---
userdiff.c | 38 ++++++++++++++++++++++++++++++++++++--
userdiff.h | 2 ++
2 files changed, 38 insertions(+), 2 deletions(-)
diff --git a/userdiff.c b/userdiff.c
index 09203fbc35..aa2cd150ba 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -9,6 +9,8 @@ static struct userdiff_driver *drivers;
static int ndrivers;
static int drivers_alloc;
+#define OR_MULTI_BYTE_CHAR "|[\xc0-\xff][\x80-\xbf]+"
+
#define PATTERNS(lang, rx, wrx) { \
.name = lang, \
.binary = -1, \
@@ -16,7 +18,9 @@ static int drivers_alloc;
.pattern = rx, \
.cflags = REG_EXTENDED, \
}, \
- .word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
+ .word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
+ .is_builtin = 1, \
+ .has_multi_byte_char_fallback = 1, \
}
#define IPATTERN(lang, rx, wrx) { \
.name = lang, \
@@ -25,7 +29,9 @@ static int drivers_alloc;
.pattern = rx, \
.cflags = REG_EXTENDED | REG_ICASE, \
}, \
- .word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
+ .word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
+ .is_builtin = 1, \
+ .has_multi_byte_char_fallback = 1, \
}
/*
@@ -330,6 +336,25 @@ static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
return 0;
}
+static int regexec_support_multi_byte_chars(void)
+{
+ static const char not_space[] = "[^[:space:]]";
+ static const char utf8_multi_byte_char[] = "\xc2\xa3";
+ regex_t re;
+ regmatch_t match;
+ static int result = -1;
+
+ if (result != -1)
+ return result;
+ if (regcomp(&re, not_space, REG_EXTENDED))
+ BUG("invalid regular expression: %s", not_space);
+ result = !regexec(&re, utf8_multi_byte_char, 1, &match, 0) &&
+ match.rm_so == 0 &&
+ match.rm_eo == strlen(utf8_multi_byte_char);
+ regfree(&re);
+ return result;
+}
+
static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t len)
{
struct find_by_namelen_data udcbdata = {
@@ -337,6 +362,15 @@ static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t
.len = len,
};
for_each_userdiff_driver(userdiff_find_by_namelen_cb, &udcbdata);
+ if (udcbdata.driver &&
+ udcbdata.driver->is_builtin &&
+ udcbdata.driver->has_multi_byte_char_fallback &&
+ regexec_support_multi_byte_chars()) {
+ const char *word_regex = udcbdata.driver->word_regex;
+ udcbdata.driver->word_regex = xmemdupz(word_regex,
+ strlen(word_regex) - strlen(OR_MULTI_BYTE_CHAR));
+ udcbdata.driver->has_multi_byte_char_fallback = 0;
+ }
return udcbdata.driver;
}
diff --git a/userdiff.h b/userdiff.h
index 24419db697..83f5863d58 100644
--- a/userdiff.h
+++ b/userdiff.h
@@ -21,6 +21,8 @@ struct userdiff_driver {
const char *textconv;
struct notes_cache *textconv_cache;
int textconv_want_cache;
+ int is_builtin;
+ int has_multi_byte_char_fallback;
};
enum userdiff_driver_type {
USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
--
2.40.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: regex compilation error with --color-words
2023-04-02 9:44 ` René Scharfe
@ 2023-04-03 16:29 ` Junio C Hamano
2023-04-03 19:32 ` René Scharfe
0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2023-04-03 16:29 UTC (permalink / raw)
To: René Scharfe
Cc: Diomidis Spinellis, Eric Sunshine, Git List, demerphq,
Mario Grgic
René Scharfe <l.s.r@web.de> writes:
> Actually we can drop the "|[\xc0-\xff][\x80-\xbf]+" part in that case
> because the "[^[:space:]]" suffices. And we probably need to do that at
> runtime because it depends on the locale. The rather elaborate patch
> below does that. It leaks the truncated word_regex, which isn't that
> bad because it's done only once per run, but certainly untidy.
Small ugliness like what we see below is fine in a technology
demonostration.
> I suspect/hope this can be done simpler and cleaner after refactoring
> the userdiff code to allow for runtime assembly of regular expressions.
Do we expect "does the regcomp(3) and regexec(3) correctly match a
non-space multi-byte UTF-8 sequence as expected?" to be the only
choices, do we expect we will choose from only two, and do we expect
that the differences between the MB version and fallback version to
be the same "OR_MULTI_BYTE_CHAR may be omitted"? For now I think
it would be reasonable to answer yes to all three.
How are .is_builtin and .has_multi_byte_char_fallback bits expected
to be used? For what kind of files do we expect them to be set
differently?
In the simplest case, I would imagine that we could do this
...
const char *word_regex;
+ const char *word_regex_wo_mb;
const char *textconv;
...
in the definition of "struct userdifif_driver", use
#define PATTERNS(lang, rx, wrx) { \
...
.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
+ .word_regex_wo_mb = wrx "|[^[:space:]]", \
}
and similar for IPATTERN, and make a non-NULL .word.regex_wo_mb
serve as the .has_multi_byte_char_fallback bit to trigger "does our
regex engine do a good job for multi-byte?"
Thanks.
> diff --git a/userdiff.c b/userdiff.c
> index 09203fbc35..aa2cd150ba 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -9,6 +9,8 @@ static struct userdiff_driver *drivers;
> static int ndrivers;
> static int drivers_alloc;
>
> +#define OR_MULTI_BYTE_CHAR "|[\xc0-\xff][\x80-\xbf]+"
> +
> #define PATTERNS(lang, rx, wrx) { \
> .name = lang, \
> .binary = -1, \
> @@ -16,7 +18,9 @@ static int drivers_alloc;
> .pattern = rx, \
> .cflags = REG_EXTENDED, \
> }, \
> - .word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
> + .word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
> + .is_builtin = 1, \
> + .has_multi_byte_char_fallback = 1, \
> }
> #define IPATTERN(lang, rx, wrx) { \
> .name = lang, \
> @@ -25,7 +29,9 @@ static int drivers_alloc;
> .pattern = rx, \
> .cflags = REG_EXTENDED | REG_ICASE, \
> }, \
> - .word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
> + .word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
> + .is_builtin = 1, \
> + .has_multi_byte_char_fallback = 1, \
> }
>
> /*
> @@ -330,6 +336,25 @@ static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
> return 0;
> }
>
> +static int regexec_support_multi_byte_chars(void)
> +{
> + static const char not_space[] = "[^[:space:]]";
> + static const char utf8_multi_byte_char[] = "\xc2\xa3";
> + regex_t re;
> + regmatch_t match;
> + static int result = -1;
> +
> + if (result != -1)
> + return result;
> + if (regcomp(&re, not_space, REG_EXTENDED))
> + BUG("invalid regular expression: %s", not_space);
> + result = !regexec(&re, utf8_multi_byte_char, 1, &match, 0) &&
> + match.rm_so == 0 &&
> + match.rm_eo == strlen(utf8_multi_byte_char);
> + regfree(&re);
> + return result;
> +}
> +
> static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t len)
> {
> struct find_by_namelen_data udcbdata = {
> @@ -337,6 +362,15 @@ static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t
> .len = len,
> };
> for_each_userdiff_driver(userdiff_find_by_namelen_cb, &udcbdata);
> + if (udcbdata.driver &&
> + udcbdata.driver->is_builtin &&
> + udcbdata.driver->has_multi_byte_char_fallback &&
> + regexec_support_multi_byte_chars()) {
> + const char *word_regex = udcbdata.driver->word_regex;
> + udcbdata.driver->word_regex = xmemdupz(word_regex,
> + strlen(word_regex) - strlen(OR_MULTI_BYTE_CHAR));
> + udcbdata.driver->has_multi_byte_char_fallback = 0;
> + }
> return udcbdata.driver;
> }
>
> diff --git a/userdiff.h b/userdiff.h
> index 24419db697..83f5863d58 100644
> --- a/userdiff.h
> +++ b/userdiff.h
> @@ -21,6 +21,8 @@ struct userdiff_driver {
> const char *textconv;
> struct notes_cache *textconv_cache;
> int textconv_want_cache;
> + int is_builtin;
> + int has_multi_byte_char_fallback;
> };
> enum userdiff_driver_type {
> USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
> --
> 2.40.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: regex compilation error with --color-words
2023-04-03 16:29 ` Junio C Hamano
@ 2023-04-03 19:32 ` René Scharfe
0 siblings, 0 replies; 8+ messages in thread
From: René Scharfe @ 2023-04-03 19:32 UTC (permalink / raw)
To: Junio C Hamano
Cc: Diomidis Spinellis, Eric Sunshine, Git List, demerphq,
Mario Grgic
Am 03.04.23 um 18:29 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>
>> Actually we can drop the "|[\xc0-\xff][\x80-\xbf]+" part in that case
>> because the "[^[:space:]]" suffices. And we probably need to do that at
>> runtime because it depends on the locale. The rather elaborate patch
>> below does that. It leaks the truncated word_regex, which isn't that
>> bad because it's done only once per run, but certainly untidy.
>
> Small ugliness like what we see below is fine in a technology
> demonostration.
>
>> I suspect/hope this can be done simpler and cleaner after refactoring
>> the userdiff code to allow for runtime assembly of regular expressions.
>
> Do we expect "does the regcomp(3) and regexec(3) correctly match a
> non-space multi-byte UTF-8 sequence as expected?" to be the only
> choices, do we expect we will choose from only two, and do we expect
> that the differences between the MB version and fallback version to
> be the same "OR_MULTI_BYTE_CHAR may be omitted"? For now I think
> it would be reasonable to answer yes to all three.
>
> How are .is_builtin and .has_multi_byte_char_fallback bits expected
> to be used? For what kind of files do we expect them to be set
> differently?
.is_builtin is unnecessary here. It is a remnant of me noticing that we
don't add "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" to user-defined
patterns, but .has_multi_byte_char_fallback alone suffices.
>
> In the simplest case, I would imagine that we could do this
>
> ...
> const char *word_regex;
> + const char *word_regex_wo_mb;
> const char *textconv;
> ...
>
> in the definition of "struct userdifif_driver", use
>
> #define PATTERNS(lang, rx, wrx) { \
> ...
> .word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
> + .word_regex_wo_mb = wrx "|[^[:space:]]", \
Ah, nice, no allocation or string manipulation at runtime at all, at
the small cost of having near-duplicate static strings.
> }
>
> and similar for IPATTERN, and make a non-NULL .word.regex_wo_mb
> serve as the .has_multi_byte_char_fallback bit to trigger "does our
> regex engine do a good job for multi-byte?"
>
> Thanks.
>
>> diff --git a/userdiff.c b/userdiff.c
>> index 09203fbc35..aa2cd150ba 100644
>> --- a/userdiff.c
>> +++ b/userdiff.c
>> @@ -9,6 +9,8 @@ static struct userdiff_driver *drivers;
>> static int ndrivers;
>> static int drivers_alloc;
>>
>> +#define OR_MULTI_BYTE_CHAR "|[\xc0-\xff][\x80-\xbf]+"
>> +
>> #define PATTERNS(lang, rx, wrx) { \
>> .name = lang, \
>> .binary = -1, \
>> @@ -16,7 +18,9 @@ static int drivers_alloc;
>> .pattern = rx, \
>> .cflags = REG_EXTENDED, \
>> }, \
>> - .word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
>> + .word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
>> + .is_builtin = 1, \
>> + .has_multi_byte_char_fallback = 1, \
>> }
>> #define IPATTERN(lang, rx, wrx) { \
>> .name = lang, \
>> @@ -25,7 +29,9 @@ static int drivers_alloc;
>> .pattern = rx, \
>> .cflags = REG_EXTENDED | REG_ICASE, \
>> }, \
>> - .word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
>> + .word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
>> + .is_builtin = 1, \
>> + .has_multi_byte_char_fallback = 1, \
>> }
>>
>> /*
>> @@ -330,6 +336,25 @@ static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
>> return 0;
>> }
>>
>> +static int regexec_support_multi_byte_chars(void)
>> +{
>> + static const char not_space[] = "[^[:space:]]";
>> + static const char utf8_multi_byte_char[] = "\xc2\xa3";
>> + regex_t re;
>> + regmatch_t match;
>> + static int result = -1;
>> +
>> + if (result != -1)
>> + return result;
>> + if (regcomp(&re, not_space, REG_EXTENDED))
>> + BUG("invalid regular expression: %s", not_space);
>> + result = !regexec(&re, utf8_multi_byte_char, 1, &match, 0) &&
>> + match.rm_so == 0 &&
>> + match.rm_eo == strlen(utf8_multi_byte_char);
>> + regfree(&re);
>> + return result;
>> +}
>> +
>> static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t len)
>> {
>> struct find_by_namelen_data udcbdata = {
>> @@ -337,6 +362,15 @@ static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t
>> .len = len,
>> };
>> for_each_userdiff_driver(userdiff_find_by_namelen_cb, &udcbdata);
>> + if (udcbdata.driver &&
>> + udcbdata.driver->is_builtin &&
>> + udcbdata.driver->has_multi_byte_char_fallback &&
>> + regexec_support_multi_byte_chars()) {
>> + const char *word_regex = udcbdata.driver->word_regex;
>> + udcbdata.driver->word_regex = xmemdupz(word_regex,
>> + strlen(word_regex) - strlen(OR_MULTI_BYTE_CHAR));
>> + udcbdata.driver->has_multi_byte_char_fallback = 0;
>> + }
>> return udcbdata.driver;
>> }
>>
>> diff --git a/userdiff.h b/userdiff.h
>> index 24419db697..83f5863d58 100644
>> --- a/userdiff.h
>> +++ b/userdiff.h
>> @@ -21,6 +21,8 @@ struct userdiff_driver {
>> const char *textconv;
>> struct notes_cache *textconv_cache;
>> int textconv_want_cache;
>> + int is_builtin;
>> + int has_multi_byte_char_fallback;
>> };
>> enum userdiff_driver_type {
>> USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
>> --
>> 2.40.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: regex compilation error with --color-words
@ 2023-04-06 13:39 Benjamin
0 siblings, 0 replies; 8+ messages in thread
From: Benjamin @ 2023-04-06 13:39 UTC (permalink / raw)
To: sunshine; +Cc: dds, demerphq, git, gitster, l.s.r, mario_grgic
This was also discussed in February[1]. Giving --word-diff is to git-diff is enough to trigger the error in non-C locales.
I am pleasantly surprised to learn that LC_ALL=C temporarily avoids the problem; that makes it useable again, for now. But a fix is much welcome!
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: regex compilation error with --color-words
@ 2023-04-06 16:03 Benjamin
0 siblings, 0 replies; 8+ messages in thread
From: Benjamin @ 2023-04-06 16:03 UTC (permalink / raw)
To: Ben Knoble; +Cc: dds, demerphq, git, gitster, l.s.r, mario_grgic, sunshine
Ack, forgot the link apparently:
[1]: https://lore.kernel.org/git/CALnO6CAZtwfGY4SYeOuKqdP9+e_0EYNf4F703DRQB7UUfd_bUg@mail.gmail.com/
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-04-06 16:03 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-06 16:03 regex compilation error with --color-words Benjamin
-- strict thread matches above, loose matches on Subject: below --
2023-04-06 13:39 Benjamin
2023-03-29 22:55 Eric Sunshine
2023-03-30 7:55 ` Diomidis Spinellis
2023-03-31 20:44 ` René Scharfe
2023-04-02 9:44 ` René Scharfe
2023-04-03 16:29 ` Junio C Hamano
2023-04-03 19:32 ` René Scharfe
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).