git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* regex compilation error with --color-words
@ 2023-03-29 22:55 Eric Sunshine
  2023-03-30  7:55 ` Diomidis Spinellis
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Sunshine @ 2023-03-29 22:55 UTC (permalink / raw)
  To: Git List, Diomidis Spinellis
  Cc: René Scharfe, Junio C Hamano, demerphq, Mario Grgic

I'm encountering a failure on macOS High Sierra 10.13.6 when using
--color-words:

    % git show --color-words HEAD
    fatal: invalid regular expression:
[[:alpha:]_'][[:alnum:]_']*|0[xb]?[0-9a-fA-F_]*|[0-9a-fA-F_]+(\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::|&&=|\|\|=|//=|\*\*=|&&|\|\||//|\+\+|--|\*\*|\.\.\.?|[-+*/%.^&<>=!|]=|=~|!~|<<|<>|<=>|>>|[^[:space:]]|[<C0>-<FF>][<80>-<BF>]+

This crash happens when viewing the commit I sent to Peff today[1],
though it doesn't happen with all commits. The problem bisects to:

    Author: Diomidis Spinellis <dds@aueb.gr>
    Date:   Fri Aug 26 11:58:15 2022 +0300

    grep: fix multibyte regex handling under macOS

    The commit 29de20504e (Makefile: fix default regex settings on
    Darwin, 2013-05-11) fixed t0070-fundamental.sh under Darwin (macOS) by
    adopting Git's regex library.  However, this library is compiled with
    NO_MBSUPPORT, which causes git-grep to work incorrectly on multibyte
    (e.g. UTF-8) files.  Current macOS versions pass t0070-fundamental.sh
    with the native macOS regex library, which also supports multibyte
    characters.

    Adjust the Makefile to use the native regex library, and call
    setlocale(3) to set CTYPE according to the user's preference.
    The setlocale call is required on all platforms, but in platforms
    supporting gettext(3), setlocale was called as a side-effect of
    initializing gettext.  Therefore, move the CTYPE setlocale call from
    gettext.c to common-main.c and the corresponding locale.h include
    into git-compat-util.h.

    Thanks to the global initialization of CTYPE setlocale, the test-tool
    regex command now works correctly with supported multibyte regexes, and
    is used to set the MB_REGEX test prerequisite by assessing a platform's
    support for them.

    Signed-off-by: Diomidis Spinellis <dds@aueb.gr>
    Signed-off-by: Junio C Hamano <gitster@pobox.com>

I see that this same commit is also the subject of another bug report
currently being discussed[2], so I've Cc:'d the participants of that
thread, as well.

Any pointers aimed at getting this resolved would be appreciated.

[1]: https://lore.kernel.org/git/CAPig+cQiOGrDSUc34jHEBp87Rx-dnXNcPcF76bu0SJoOzD+1hw@mail.gmail.com/
[2]: https://lore.kernel.org/git/MW4PR20MB5517583CBEEF34B1E87CCF1290859@MW4PR20MB5517.namprd20.prod.outlook.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex compilation error with --color-words
  2023-03-29 22:55 Eric Sunshine
@ 2023-03-30  7:55 ` Diomidis Spinellis
  2023-03-31 20:44   ` René Scharfe
  0 siblings, 1 reply; 8+ messages in thread
From: Diomidis Spinellis @ 2023-03-30  7:55 UTC (permalink / raw)
  To: Eric Sunshine, Git List
  Cc: René Scharfe, Junio C Hamano, demerphq, Mario Grgic

On 30-Mar-23 1:55, Eric Sunshine wrote:
> I'm encountering a failure on macOS High Sierra 10.13.6 when using
> --color-words:

The built-in word separation regular expression pattern for the Perl 
language fails to work with the macOS regex engine.  The same also 
happens with the FreeBSD one (tested on 14.0).

The issue can be replicated through the following sequence of commands.

git init color-words
cd color-words
echo '*.pl   diff=perl' >.gitattributes
echo 'print 42;' >t.pl
git add t.pl
git commit -am Add
git show --color-words

Strangely, I haven't been able to reproduce the failure with egrep on 
any of the two platforms.

egrep 
'[[:alpha:]_'\''][[:alnum:]_'\'']*|0[xb]?[0-9a-fA-F_]*|[0-9a-fA-F_]+(\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::|&&=|\|\|=|//=|\*\*=|&&|\|\||//|\+\+|--|\*\*|\.\.\.?|[-+*/%.^&<>=!|]=|=~|!~|<<|<>|<=>|>>|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+' 
/dev/null

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex compilation error with --color-words
  2023-03-30  7:55 ` Diomidis Spinellis
@ 2023-03-31 20:44   ` René Scharfe
  2023-04-02  9:44     ` René Scharfe
  0 siblings, 1 reply; 8+ messages in thread
From: René Scharfe @ 2023-03-31 20:44 UTC (permalink / raw)
  To: Diomidis Spinellis, Eric Sunshine, Git List
  Cc: Junio C Hamano, demerphq, Mario Grgic

Am 30.03.23 um 09:55 schrieb Diomidis Spinellis:
> On 30-Mar-23 1:55, Eric Sunshine wrote:
>> I'm encountering a failure on macOS High Sierra 10.13.6 when using
>> --color-words:
>
> The built-in word separation regular expression pattern for the Perl language fails to work with the macOS regex engine.  The same also happens with the FreeBSD one (tested on 14.0).
>
> The issue can be replicated through the following sequence of commands.
>
> git init color-words
> cd color-words
> echo '*.pl   diff=perl' >.gitattributes
> echo 'print 42;' >t.pl
> git add t.pl
> git commit -am Add
> git show --color-words

Or in Git's own repo:

   $ git log -p --color-words --no-merges '*.c'
   Schwerwiegend: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<C0>-<FF>][<80>-<BF>]+
   commit 14b9a044798ebb3858a1f1a1377309a3d6054ac8
   [...]

The error disappears when localization is turned off:

   $ LANG=C git log -p --color-words --no-merges '*.c' >/dev/null
   # just finishes without an error

The issue also vanishes when the "|[\xc0-\xff][\x80-\xbf]+" part is
removed that the macros PATTERNS and IPATTERN in userdiff.c append.

So it seems regcomp(1) on macOS doesn't like invalid Unicode characters
unless it's in ASCII mode (LANG=C).  664d44ee7f (userdiff: simplify
word-diff safeguard, 2011-01-11) explains that this part exists to match
a multi-byte UTF-8 character.  With a regcomp(1) that supports
multi-byte characters natively they need to be specified differently, I
guess, perhaps like this "[^\x00-\x7f]"?

> Strangely, I haven't been able to reproduce the failure with egrep on any of the two platforms.
>
> egrep '[[:alpha:]_'\''][[:alnum:]_'\'']*|0[xb]?[0-9a-fA-F_]*|[0-9a-fA-F_]+(\.[0-9a-fA-F_]+)?([eE][-+]?[0-9_]+)?|=>|-[rwxoRWXOezsfdlpSugkbctTBMAC>]|~~|::|&&=|\|\|=|//=|\*\*=|&&|\|\||//|\+\+|--|\*\*|\.\.\.?|[-+*/%.^&<>=!|]=|=~|!~|<<|<>|<=>|>>|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+' /dev/null

No idea how to specify non-ASCII bytes in shell or regex.  '\xNN' does
not seem to do the trick.  printf(1) interpretes octal numbers, though:

   $ echo ö | egrep $(printf "[\200-\377]")
   egrep: illegal byte sequence

(The regex contains "illegal bytes" -- UTF-8 multi-byte sequences cut
short; the "ö" is OK.)

René

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex compilation error with --color-words
  2023-03-31 20:44   ` René Scharfe
@ 2023-04-02  9:44     ` René Scharfe
  2023-04-03 16:29       ` Junio C Hamano
  0 siblings, 1 reply; 8+ messages in thread
From: René Scharfe @ 2023-04-02  9:44 UTC (permalink / raw)
  To: Diomidis Spinellis, Eric Sunshine, Git List
  Cc: Junio C Hamano, demerphq, Mario Grgic

Am 31.03.23 um 22:44 schrieb René Scharfe:
> Am 30.03.23 um 09:55 schrieb Diomidis Spinellis:
>> On 30-Mar-23 1:55, Eric Sunshine wrote:
>>> I'm encountering a failure on macOS High Sierra 10.13.6 when using
>>> --color-words:
>>
>> The built-in word separation regular expression pattern for the Perl language fails to work with the macOS regex engine.  The same also happens with the FreeBSD one (tested on 14.0).
>>
>> The issue can be replicated through the following sequence of commands.
>>
>> git init color-words
>> cd color-words
>> echo '*.pl   diff=perl' >.gitattributes
>> echo 'print 42;' >t.pl
>> git add t.pl
>> git commit -am Add
>> git show --color-words
>
> Or in Git's own repo:
>
>    $ git log -p --color-words --no-merges '*.c'
>    Schwerwiegend: invalid regular expression: [a-zA-Z_][a-zA-Z0-9_]*|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*|0[xXbB][0-9a-fA-F]+[lLuU]*|\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?|[-+*/<>%&^|=!]=|--|\+\+|<<=?|>>=?|&&|\|\||::|->\*?|\.\*|<=>|[^[:space:]]|[<C0>-<FF>][<80>-<BF>]+
>    commit 14b9a044798ebb3858a1f1a1377309a3d6054ac8
>    [...]
>
> The error disappears when localization is turned off:
>
>    $ LANG=C git log -p --color-words --no-merges '*.c' >/dev/null
>    # just finishes without an error
>
> The issue also vanishes when the "|[\xc0-\xff][\x80-\xbf]+" part is
> removed that the macros PATTERNS and IPATTERN in userdiff.c append.
>
> So it seems regcomp(1) on macOS doesn't like invalid Unicode characters
> unless it's in ASCII mode (LANG=C).  664d44ee7f (userdiff: simplify
> word-diff safeguard, 2011-01-11) explains that this part exists to match
> a multi-byte UTF-8 character.  With a regcomp(1) that supports
> multi-byte characters natively they need to be specified differently, I
> guess, perhaps like this "[^\x00-\x7f]"?

Actually we can drop the "|[\xc0-\xff][\x80-\xbf]+" part in that case
because the "[^[:space:]]" suffices.  And we probably need to do that at
runtime because it depends on the locale.  The rather elaborate patch
below does that.  It leaks the truncated word_regex, which isn't that
bad because it's done only once per run, but certainly untidy.

I suspect/hope this can be done simpler and cleaner after refactoring
the userdiff code to allow for runtime assembly of regular expressions.

And it's regcomp(3), or rather regexec(3), not regcomp(1).

---
 userdiff.c | 38 ++++++++++++++++++++++++++++++++++++--
 userdiff.h |  2 ++
 2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/userdiff.c b/userdiff.c
index 09203fbc35..aa2cd150ba 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -9,6 +9,8 @@ static struct userdiff_driver *drivers;
 static int ndrivers;
 static int drivers_alloc;

+#define OR_MULTI_BYTE_CHAR "|[\xc0-\xff][\x80-\xbf]+"
+
 #define PATTERNS(lang, rx, wrx) { \
 	.name = lang, \
 	.binary = -1, \
@@ -16,7 +18,9 @@ static int drivers_alloc;
 		.pattern = rx, \
 		.cflags = REG_EXTENDED, \
 	}, \
-	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
+	.word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
+	.is_builtin = 1, \
+	.has_multi_byte_char_fallback = 1, \
 }
 #define IPATTERN(lang, rx, wrx) { \
 	.name = lang, \
@@ -25,7 +29,9 @@ static int drivers_alloc;
 		.pattern = rx, \
 		.cflags = REG_EXTENDED | REG_ICASE, \
 	}, \
-	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
+	.word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
+	.is_builtin = 1, \
+	.has_multi_byte_char_fallback = 1, \
 }

 /*
@@ -330,6 +336,25 @@ static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
 	return 0;
 }

+static int regexec_support_multi_byte_chars(void)
+{
+	static const char not_space[] = "[^[:space:]]";
+	static const char utf8_multi_byte_char[] = "\xc2\xa3";
+	regex_t re;
+	regmatch_t match;
+	static int result = -1;
+
+	if (result != -1)
+		return result;
+	if (regcomp(&re, not_space, REG_EXTENDED))
+		BUG("invalid regular expression: %s", not_space);
+	result = !regexec(&re, utf8_multi_byte_char, 1, &match, 0) &&
+		match.rm_so == 0 &&
+		match.rm_eo == strlen(utf8_multi_byte_char);
+	regfree(&re);
+	return result;
+}
+
 static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t len)
 {
 	struct find_by_namelen_data udcbdata = {
@@ -337,6 +362,15 @@ static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t
 		.len = len,
 	};
 	for_each_userdiff_driver(userdiff_find_by_namelen_cb, &udcbdata);
+	if (udcbdata.driver &&
+	    udcbdata.driver->is_builtin &&
+	    udcbdata.driver->has_multi_byte_char_fallback &&
+	    regexec_support_multi_byte_chars()) {
+		const char *word_regex = udcbdata.driver->word_regex;
+		udcbdata.driver->word_regex = xmemdupz(word_regex,
+			strlen(word_regex) - strlen(OR_MULTI_BYTE_CHAR));
+		udcbdata.driver->has_multi_byte_char_fallback = 0;
+	}
 	return udcbdata.driver;
 }

diff --git a/userdiff.h b/userdiff.h
index 24419db697..83f5863d58 100644
--- a/userdiff.h
+++ b/userdiff.h
@@ -21,6 +21,8 @@ struct userdiff_driver {
 	const char *textconv;
 	struct notes_cache *textconv_cache;
 	int textconv_want_cache;
+	int is_builtin;
+	int has_multi_byte_char_fallback;
 };
 enum userdiff_driver_type {
 	USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
--
2.40.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: regex compilation error with --color-words
  2023-04-02  9:44     ` René Scharfe
@ 2023-04-03 16:29       ` Junio C Hamano
  2023-04-03 19:32         ` René Scharfe
  0 siblings, 1 reply; 8+ messages in thread
From: Junio C Hamano @ 2023-04-03 16:29 UTC (permalink / raw)
  To: René Scharfe
  Cc: Diomidis Spinellis, Eric Sunshine, Git List, demerphq,
	Mario Grgic

René Scharfe <l.s.r@web.de> writes:

> Actually we can drop the "|[\xc0-\xff][\x80-\xbf]+" part in that case
> because the "[^[:space:]]" suffices.  And we probably need to do that at
> runtime because it depends on the locale.  The rather elaborate patch
> below does that.  It leaks the truncated word_regex, which isn't that
> bad because it's done only once per run, but certainly untidy.

Small ugliness like what we see below is fine in a technology
demonostration.

> I suspect/hope this can be done simpler and cleaner after refactoring
> the userdiff code to allow for runtime assembly of regular expressions.

Do we expect "does the regcomp(3) and regexec(3) correctly match a
non-space multi-byte UTF-8 sequence as expected?" to be the only
choices, do we expect we will choose from only two, and do we expect
that the differences between the MB version and fallback version to
be the same "OR_MULTI_BYTE_CHAR may be omitted"?  For now I think
it would be reasonable to answer yes to all three.

How are .is_builtin and .has_multi_byte_char_fallback bits expected
to be used?  For what kind of files do we expect them to be set
differently?

In the simplest case, I would imagine that we could do this

 	...
 	const char *word_regex;
+	const char *word_regex_wo_mb;
 	const char *textconv;
 	...

in the definition of "struct userdifif_driver", use

 #define PATTERNS(lang, rx, wrx) { \
 	...
 	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
+	.word_regex_wo_mb = wrx "|[^[:space:]]", \
 }

and similar for IPATTERN, and make a non-NULL .word.regex_wo_mb
serve as the .has_multi_byte_char_fallback bit to trigger "does our
regex engine do a good job for multi-byte?"

Thanks.

> diff --git a/userdiff.c b/userdiff.c
> index 09203fbc35..aa2cd150ba 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -9,6 +9,8 @@ static struct userdiff_driver *drivers;
>  static int ndrivers;
>  static int drivers_alloc;
>
> +#define OR_MULTI_BYTE_CHAR "|[\xc0-\xff][\x80-\xbf]+"
> +
>  #define PATTERNS(lang, rx, wrx) { \
>  	.name = lang, \
>  	.binary = -1, \
> @@ -16,7 +18,9 @@ static int drivers_alloc;
>  		.pattern = rx, \
>  		.cflags = REG_EXTENDED, \
>  	}, \
> -	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
> +	.word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
> +	.is_builtin = 1, \
> +	.has_multi_byte_char_fallback = 1, \
>  }
>  #define IPATTERN(lang, rx, wrx) { \
>  	.name = lang, \
> @@ -25,7 +29,9 @@ static int drivers_alloc;
>  		.pattern = rx, \
>  		.cflags = REG_EXTENDED | REG_ICASE, \
>  	}, \
> -	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
> +	.word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
> +	.is_builtin = 1, \
> +	.has_multi_byte_char_fallback = 1, \
>  }
>
>  /*
> @@ -330,6 +336,25 @@ static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
>  	return 0;
>  }
>
> +static int regexec_support_multi_byte_chars(void)
> +{
> +	static const char not_space[] = "[^[:space:]]";
> +	static const char utf8_multi_byte_char[] = "\xc2\xa3";
> +	regex_t re;
> +	regmatch_t match;
> +	static int result = -1;
> +
> +	if (result != -1)
> +		return result;
> +	if (regcomp(&re, not_space, REG_EXTENDED))
> +		BUG("invalid regular expression: %s", not_space);
> +	result = !regexec(&re, utf8_multi_byte_char, 1, &match, 0) &&
> +		match.rm_so == 0 &&
> +		match.rm_eo == strlen(utf8_multi_byte_char);
> +	regfree(&re);
> +	return result;
> +}
> +
>  static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t len)
>  {
>  	struct find_by_namelen_data udcbdata = {
> @@ -337,6 +362,15 @@ static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t
>  		.len = len,
>  	};
>  	for_each_userdiff_driver(userdiff_find_by_namelen_cb, &udcbdata);
> +	if (udcbdata.driver &&
> +	    udcbdata.driver->is_builtin &&
> +	    udcbdata.driver->has_multi_byte_char_fallback &&
> +	    regexec_support_multi_byte_chars()) {
> +		const char *word_regex = udcbdata.driver->word_regex;
> +		udcbdata.driver->word_regex = xmemdupz(word_regex,
> +			strlen(word_regex) - strlen(OR_MULTI_BYTE_CHAR));
> +		udcbdata.driver->has_multi_byte_char_fallback = 0;
> +	}
>  	return udcbdata.driver;
>  }
>
> diff --git a/userdiff.h b/userdiff.h
> index 24419db697..83f5863d58 100644
> --- a/userdiff.h
> +++ b/userdiff.h
> @@ -21,6 +21,8 @@ struct userdiff_driver {
>  	const char *textconv;
>  	struct notes_cache *textconv_cache;
>  	int textconv_want_cache;
> +	int is_builtin;
> +	int has_multi_byte_char_fallback;
>  };
>  enum userdiff_driver_type {
>  	USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
> --
> 2.40.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex compilation error with --color-words
  2023-04-03 16:29       ` Junio C Hamano
@ 2023-04-03 19:32         ` René Scharfe
  0 siblings, 0 replies; 8+ messages in thread
From: René Scharfe @ 2023-04-03 19:32 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Diomidis Spinellis, Eric Sunshine, Git List, demerphq,
	Mario Grgic

Am 03.04.23 um 18:29 schrieb Junio C Hamano:
> René Scharfe <l.s.r@web.de> writes:
>
>> Actually we can drop the "|[\xc0-\xff][\x80-\xbf]+" part in that case
>> because the "[^[:space:]]" suffices.  And we probably need to do that at
>> runtime because it depends on the locale.  The rather elaborate patch
>> below does that.  It leaks the truncated word_regex, which isn't that
>> bad because it's done only once per run, but certainly untidy.
>
> Small ugliness like what we see below is fine in a technology
> demonostration.
>
>> I suspect/hope this can be done simpler and cleaner after refactoring
>> the userdiff code to allow for runtime assembly of regular expressions.
>
> Do we expect "does the regcomp(3) and regexec(3) correctly match a
> non-space multi-byte UTF-8 sequence as expected?" to be the only
> choices, do we expect we will choose from only two, and do we expect
> that the differences between the MB version and fallback version to
> be the same "OR_MULTI_BYTE_CHAR may be omitted"?  For now I think
> it would be reasonable to answer yes to all three.
>
> How are .is_builtin and .has_multi_byte_char_fallback bits expected
> to be used?  For what kind of files do we expect them to be set
> differently?

.is_builtin is unnecessary here.  It is a remnant of me noticing that we
don't add "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" to user-defined
patterns, but .has_multi_byte_char_fallback alone suffices.

>
> In the simplest case, I would imagine that we could do this
>
>  	...
>  	const char *word_regex;
> +	const char *word_regex_wo_mb;
>  	const char *textconv;
>  	...
>
> in the definition of "struct userdifif_driver", use
>
>  #define PATTERNS(lang, rx, wrx) { \
>  	...
>  	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
> +	.word_regex_wo_mb = wrx "|[^[:space:]]", \

Ah, nice, no allocation or string manipulation at runtime at all, at
the small cost of having near-duplicate static strings.

>  }
>
> and similar for IPATTERN, and make a non-NULL .word.regex_wo_mb
> serve as the .has_multi_byte_char_fallback bit to trigger "does our
> regex engine do a good job for multi-byte?"
>
> Thanks.
>
>> diff --git a/userdiff.c b/userdiff.c
>> index 09203fbc35..aa2cd150ba 100644
>> --- a/userdiff.c
>> +++ b/userdiff.c
>> @@ -9,6 +9,8 @@ static struct userdiff_driver *drivers;
>>  static int ndrivers;
>>  static int drivers_alloc;
>>
>> +#define OR_MULTI_BYTE_CHAR "|[\xc0-\xff][\x80-\xbf]+"
>> +
>>  #define PATTERNS(lang, rx, wrx) { \
>>  	.name = lang, \
>>  	.binary = -1, \
>> @@ -16,7 +18,9 @@ static int drivers_alloc;
>>  		.pattern = rx, \
>>  		.cflags = REG_EXTENDED, \
>>  	}, \
>> -	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
>> +	.word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
>> +	.is_builtin = 1, \
>> +	.has_multi_byte_char_fallback = 1, \
>>  }
>>  #define IPATTERN(lang, rx, wrx) { \
>>  	.name = lang, \
>> @@ -25,7 +29,9 @@ static int drivers_alloc;
>>  		.pattern = rx, \
>>  		.cflags = REG_EXTENDED | REG_ICASE, \
>>  	}, \
>> -	.word_regex = wrx "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+", \
>> +	.word_regex = wrx "|[^[:space:]]" OR_MULTI_BYTE_CHAR, \
>> +	.is_builtin = 1, \
>> +	.has_multi_byte_char_fallback = 1, \
>>  }
>>
>>  /*
>> @@ -330,6 +336,25 @@ static int userdiff_find_by_namelen_cb(struct userdiff_driver *driver,
>>  	return 0;
>>  }
>>
>> +static int regexec_support_multi_byte_chars(void)
>> +{
>> +	static const char not_space[] = "[^[:space:]]";
>> +	static const char utf8_multi_byte_char[] = "\xc2\xa3";
>> +	regex_t re;
>> +	regmatch_t match;
>> +	static int result = -1;
>> +
>> +	if (result != -1)
>> +		return result;
>> +	if (regcomp(&re, not_space, REG_EXTENDED))
>> +		BUG("invalid regular expression: %s", not_space);
>> +	result = !regexec(&re, utf8_multi_byte_char, 1, &match, 0) &&
>> +		match.rm_so == 0 &&
>> +		match.rm_eo == strlen(utf8_multi_byte_char);
>> +	regfree(&re);
>> +	return result;
>> +}
>> +
>>  static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t len)
>>  {
>>  	struct find_by_namelen_data udcbdata = {
>> @@ -337,6 +362,15 @@ static struct userdiff_driver *userdiff_find_by_namelen(const char *name, size_t
>>  		.len = len,
>>  	};
>>  	for_each_userdiff_driver(userdiff_find_by_namelen_cb, &udcbdata);
>> +	if (udcbdata.driver &&
>> +	    udcbdata.driver->is_builtin &&
>> +	    udcbdata.driver->has_multi_byte_char_fallback &&
>> +	    regexec_support_multi_byte_chars()) {
>> +		const char *word_regex = udcbdata.driver->word_regex;
>> +		udcbdata.driver->word_regex = xmemdupz(word_regex,
>> +			strlen(word_regex) - strlen(OR_MULTI_BYTE_CHAR));
>> +		udcbdata.driver->has_multi_byte_char_fallback = 0;
>> +	}
>>  	return udcbdata.driver;
>>  }
>>
>> diff --git a/userdiff.h b/userdiff.h
>> index 24419db697..83f5863d58 100644
>> --- a/userdiff.h
>> +++ b/userdiff.h
>> @@ -21,6 +21,8 @@ struct userdiff_driver {
>>  	const char *textconv;
>>  	struct notes_cache *textconv_cache;
>>  	int textconv_want_cache;
>> +	int is_builtin;
>> +	int has_multi_byte_char_fallback;
>>  };
>>  enum userdiff_driver_type {
>>  	USERDIFF_DRIVER_TYPE_BUILTIN = 1<<0,
>> --
>> 2.40.0

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex compilation error with --color-words
@ 2023-04-06 13:39 Benjamin
  0 siblings, 0 replies; 8+ messages in thread
From: Benjamin @ 2023-04-06 13:39 UTC (permalink / raw)
  To: sunshine; +Cc: dds, demerphq, git, gitster, l.s.r, mario_grgic

This was also discussed in February[1]. Giving --word-diff is to git-diff is enough to trigger the error in non-C locales.

I am pleasantly surprised to learn that LC_ALL=C temporarily avoids the problem; that makes it useable again, for now. But a fix is much welcome!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: regex compilation error with --color-words
@ 2023-04-06 16:03 Benjamin
  0 siblings, 0 replies; 8+ messages in thread
From: Benjamin @ 2023-04-06 16:03 UTC (permalink / raw)
  To: Ben Knoble; +Cc: dds, demerphq, git, gitster, l.s.r, mario_grgic, sunshine

Ack, forgot the link apparently:

[1]: https://lore.kernel.org/git/CALnO6CAZtwfGY4SYeOuKqdP9+e_0EYNf4F703DRQB7UUfd_bUg@mail.gmail.com/

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-04-06 16:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-06 13:39 regex compilation error with --color-words Benjamin
  -- strict thread matches above, loose matches on Subject: below --
2023-04-06 16:03 Benjamin
2023-03-29 22:55 Eric Sunshine
2023-03-30  7:55 ` Diomidis Spinellis
2023-03-31 20:44   ` René Scharfe
2023-04-02  9:44     ` René Scharfe
2023-04-03 16:29       ` Junio C Hamano
2023-04-03 19:32         ` René Scharfe

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).