git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "D. Ben Knoble" <ben.knoble@gmail.com>
To: "René Scharfe" <l.s.r@web.de>
Cc: "Git List" <git@vger.kernel.org>,
	"Diomidis Spinellis" <dds@aueb.gr>,
	"Eric Sunshine" <sunshine@sunshineco.com>,
	demerphq <demerphq@gmail.com>,
	"Mario Grgic" <mario_grgic@hotmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>, "Jeff King" <peff@peff.net>
Subject: Re: [PATCH] userdiff: support regexec(3) with multi-byte support
Date: Fri, 7 Apr 2023 10:41:07 -0400	[thread overview]
Message-ID: <CALnO6CDgs+of5KCRRwpmzEoHcqZ4udbHVhNrd63q4fFh_5TwHg@mail.gmail.com> (raw)
In-Reply-To: <7327ac06-d5da-ec53-543e-78e7729e78bb@web.de>

On Thu, Apr 6, 2023 at 4:19 PM René Scharfe <l.s.r@web.de> wrote:
>
> Since 1819ad327b (grep: fix multibyte regex handling under macOS,
> 2022-08-26) we use the system library for all regular expression
> matching on macOS, not just for git grep.  It supports multi-byte
> strings and rejects invalid multi-byte characters.
>
> This broke all built-in userdiff word regexes in UTF-8 locales because
> they all include such invalid bytes in expressions that are intended to
> match multi-byte characters without explicit support for that from the
> regex engine.
>
> "|[^[:space:]]|[\xc0-\xff][\x80-\xbf]+" is added to all built-in word
> regexes to match a single non-space or multi-byte character.  The \xNN
> characters are invalid if interpreted as UTF-8 because they have their
> high bit set, which indicates they are part of a multi-byte character,
> but they are surrounded by single-byte characters.
>
> Replace that expression with "|[^[:space:]]" if the regex engine
> supports multi-byte matching, as there is no need to have an explicit
> range for multi-byte characters then.  Check for that capability at
> runtime, because it depends on the locale and thus on environment
> variables.  Construct the full replacement expression at build time
> and just switch it in if necessary to avoid string manipulation and
> allocations at runtime.
>
> Additionally the word regex for tex contains the expression
> "[a-zA-Z0-9\x80-\xff]+" with a similarly invalid range.  The best
> replacement with only valid characters that I can come up with is
> "([a-zA-Z0-9]|[^\x01-\x7f])+".  Unlike the original it matches NUL
> characters, though.  Assuming that tex files usually don't contain NUL
> this should be acceptable.
>
> Reported-by: D. Ben Knoble <ben.knoble@gmail.com>
> Reported-by: Eric Sunshine <sunshine@sunshineco.com>
> Helped-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: René Scharfe <l.s.r@web.de>

I tested the patch locally on top of ae73b2c8f1 and it solved my
problem. Seems like there's still some further discussion, though.

  parent reply	other threads:[~2023-04-07 14:41 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-29 22:55 regex compilation error with --color-words Eric Sunshine
2023-03-30  7:55 ` Diomidis Spinellis
2023-03-31 20:44   ` René Scharfe
2023-04-02  9:44     ` René Scharfe
2023-04-03 16:29       ` Junio C Hamano
2023-04-03 19:32         ` René Scharfe
2023-04-06 20:19           ` [PATCH] userdiff: support regexec(3) with multi-byte support René Scharfe
2023-04-06 22:35             ` Johannes Sixt
2023-04-07  7:49               ` René Scharfe
2023-04-07 10:56                 ` Johannes Sixt
2023-04-07 14:41             ` D. Ben Knoble [this message]
2023-04-07 16:02               ` Junio C Hamano
2023-04-07 17:23             ` Eric Sunshine

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALnO6CDgs+of5KCRRwpmzEoHcqZ4udbHVhNrd63q4fFh_5TwHg@mail.gmail.com \
    --to=ben.knoble@gmail.com \
    --cc=avarab@gmail.com \
    --cc=dds@aueb.gr \
    --cc=demerphq@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=l.s.r@web.de \
    --cc=mario_grgic@hotmail.com \
    --cc=peff@peff.net \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).