git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "D. Ben Knoble" <ben.knoble@gmail.com>
To: Jeff King <peff@peff.net>
Cc: demerphq <demerphq@gmail.com>, git@vger.kernel.org, jrnieder@gmail.com
Subject: Re: grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f)
Date: Tue, 7 Feb 2023 17:27:14 -0500	[thread overview]
Message-ID: <CALnO6CCzp0brmCjbNmOKe9E1bcxLzGNrqpfK_JrU=+LXt-DUyQ@mail.gmail.com> (raw)
In-Reply-To: <Y+KXC5b0qUg2/nxt@coredump.intra.peff.net>

CC'ing Jonathan Nieder

On Tue, Feb 7, 2023 at 1:23 PM Jeff King <peff@peff.net> wrote:
>
> On Sun, Feb 05, 2023 at 02:51:05PM -0500, D. Ben Knoble wrote:
>
> > Any thoughts on some sort of stop-gap measure to fix --word-diff while
> > Git decides how to handle the regex engine incompatibilities? How
> > important is the sequence of bytes at the end of --word-diff regexes
> > in userdiff.c?
>
> It comes from 664d44ee7f (userdiff: simplify word-diff safeguard,
> 2011-01-11). So presumably we'd want to figure out a way to accomplish
> the same thing in a portable way. I'm not sure that's possible, though,
> without making assumptions about the regex engine.

If "use the safeguard portably" implies "make assumptions about the
regex engine," that sounds like an argument for Git to ship its own
engine with exactly the necessary features. If that implementation
includes proper locale and UTF-8 support alongside support for the
high-byte character classes, I think we would be all set…

OTOH, perhaps there is a way to express the safeguard character
classes portably?

Jonathan, can you provide more context for the safeguard? I've read
this message several times

> git's diff-words support has a detail that can be a little dangerous:
> any text not matched by a given language's tokenization pattern is
> treated as whitespace and changes in such text would go unnoticed.
> Therefore each of the built-in regexes allows a special token type
> consisting of a single non-whitespace character [^[:space:]].
>
> To make sure UTF-8 sequences remain human readable, the builtin
> regexes also have a special token type for runs of bytes with the high
> bit set.  In English, non-ASCII characters are usually isolated so
> this is analogous to the [^[:space:]] pattern, except it matches a
> single _multibyte_ character despite use of the C locale.
>
> Unfortunately it is easy to make typos or forget entirely to include
> these catch-all token types when adding support for new languages (see
> v1.7.3.5~16, userdiff: fix typo in ruby and python word regexes,
> 2010-12-18).  Avoid this by including them automatically within the
> PATTERNS and IPATTERN macros.
>
> While at it, change the UTF-8 sequence token type to match exactly one
> non-ASCII multi-byte character, rather than an arbitrary run of them.

and I can hardly make heads or tails of it.

-- 
D. Ben Knoble

  reply	other threads:[~2023-02-07 22:27 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-01 15:18 grep: fix multibyte regex handling under macOS (1819ad327b7a1f19540a819813b70a0e8a7f798f) D. Ben Knoble
2023-02-01 16:09 ` demerphq
2023-02-01 16:21   ` D. Ben Knoble
2023-02-01 18:23     ` demerphq
2023-02-01 18:54       ` Junio C Hamano
2023-02-01 21:33         ` D. Ben Knoble
2023-02-01 21:34           ` D. Ben Knoble
2023-02-01 22:15           ` Junio C Hamano
2023-02-01 23:03   ` Jeff King
2023-02-02 16:22     ` demerphq
2023-02-02 20:49       ` D. Ben Knoble
2023-02-03 17:01       ` Jeff King
2023-02-03 21:56         ` Ævar Arnfjörð Bjarmason
2023-02-04 11:17           ` Jeff King
2023-02-04 11:32         ` demerphq
2023-02-05 19:51           ` D. Ben Knoble
2023-02-07 18:23             ` Jeff King
2023-02-07 22:27               ` D. Ben Knoble [this message]
2023-02-07 18:19           ` Jeff King
2023-02-02 20:47     ` D. Ben Knoble
2023-02-03 16:55       ` Jeff King
2023-02-03 17:06         ` D. Ben Knoble

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALnO6CCzp0brmCjbNmOKe9E1bcxLzGNrqpfK_JrU=+LXt-DUyQ@mail.gmail.com' \
    --to=ben.knoble@gmail.com \
    --cc=demerphq@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).