git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Johannes Sixt <j6t@kdbg.org>
Cc: git@vger.kernel.org,
	Johannes Sixt via GitGitGadget <gitgitgadget@gmail.com>
Subject: Re: [PATCH v2 0/5] Fun with cpp word regex
Date: Sat, 09 Oct 2021 02:00:12 +0200	[thread overview]
Message-ID: <87mtnjm416.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <25363715-dc39-1f18-a937-f715b106f529@kdbg.org>


On Sat, Oct 09 2021, Johannes Sixt wrote:

> Am 08.10.21 um 22:07 schrieb Ævar Arnfjörð Bjarmason:

[re-arranged a bit]

> [...]As far as tokenization is concerned, C
> is a subset of C++. I don't think we need to separate the drivers.
>
>>  * I found myself back-porting some of your tests (manually mostly),
>>    maybe you disagree, but in cases like 123'123, <=> etc. I'd find it
>>    easier to follow if we first added the test data, and then the
>>    changed behavior.
>> 
>>    Because after all, we're going to change how we highlight existing
>>    data, so testing for that would be informative.
>
> Good point. I'll work a bit more on that.

Great!

>>  * I wonder if it isn't time to split up "cpp" into a "c" driver,
>>    e.g. git.git's .gitattributes has "cpp" for *.[ch] files, but as C++
>>    adds more syntax sugar.
>> 
>>    So e.g. if you use "<=>" after this series we'll tokenize it
>>    differently in *.c files, but it's a C++-only operator, on the other
>>    hand probably nobody cares that much...
>
> Yes, it is that: <=> won't appear in a correct C file (outside of
> comments), so no-one will care. [...]

..mmm..

>>  * This pre-dates your much improved tests, but these test files could
>>    really use some test comments, as in:
>> 
>>    /* Now that we're going to understand the "'" character somehow, will any of this change? */
>>    /* We haven't written code like this since the 1960's ... */
>>    /* Run & free */
>> 
>>    I.e. we don't just highlight code the compiler likes to eat, but also
>>    comments. So particularly for smaller tokens that also occur in
>>    natural language like "'" and "&" are we getting expected results?
>
> Comments are free text. Anything can happen. There is no such thing as
> "correct tokenization" in comments. Not interested.

Sure there is, just because the problem is fuzzy doesn't mean there
aren't more and less correct things to do.

But most importantly the output of "git diff" is made for human
consumption, people who use --word-diff are going to be looking at code
that contains comments, embedded natural language in C-strings etc.

I've got no reason to think that your changes here make it worse, but
just as a general matter we absolutely should consider that one of the
top priorities when it comes to these language drivers.

E.g. in some languages (like CJK) it's common for characters or words
not to have any unicode whitespace between them, so even a word-diff
mode for C can benefit from recognizing those character ranges and
splitting them appropriately, or at least trying.

So to take a comment of yours where you changed a comment at random:
    
    $ git log -U0 --oneline -1 --word-diff -p af920e36977 --word-diff-regex='[ a-zA-Z]*'
    [...]
    [- Note that this seemingly redundant second declaration is required-]{+ Note that this redundant forward declaration is required+}

Don't you think that would suck v.s. the now-behavior of:

    [...]
    Note that this[-seemingly-] redundant [-second-]{+forward+} declaration is required

The former is exactly the sort of thing you'd get in CJK languages with
a word-diff driver thought the line we should stop at was the same as a
comiler tokenizer.

Anyway, the cpp driver seems to do just fine on CJK.

I'm just saying that as a general thing it's definitely a priority for
these drivers to not only handle the narrow cases of tokens a compiler
would know about. Text that people commonly use should be presented in
some way that isn't line noise.

For an example of something we do a bit badly with the cpp driver is
parts of my 66f5f6dca95 (C style: use standard style for "TRANSLATORS"
comments, 2017-05-11).

I.e. there I was changing a comment format, and added a full stop to a
sentence, the word-diff is:

        /*
         {+*+} TRANSLATORS: here is a comment that explains the string to
         {+*+} be translated, that follows immediately after [-it-]{+it.+}
         */

Even though it has nothing to do with C syntax per-se that would be much more useful as:

        /*
         {+*+} TRANSLATORS: here is a comment that explains the string to
         {+*+} be translated, that follows immediately after it{+.+}
         */

I.e. treating a "." at the end of a word specially isn't C or C++
syntax, but it's absolutely input that the cpp driver *is* getting and
should be if possible be handling well.

I just did that by experimenting with
--word-diff-regex='([A-Za-z:]*|\*|\.)', that example is unchanged with
your series, but maybe low-hanging fruit....

Thanks for working on this, just an unsolicited braindump :)

  reply	other threads:[~2021-10-09  0:32 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-07  6:50 [PATCH 0/3] Fun with cpp word regex Johannes Sixt via GitGitGadget
2021-10-07  6:50 ` [PATCH 1/3] userdiff: tighten " Johannes Sixt via GitGitGadget
2021-10-07  6:50 ` [PATCH 2/3] userdiff: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
2021-10-07  6:51 ` [PATCH 3/3] userdiff: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
2021-10-07  9:14 ` [PATCH 0/3] Fun with cpp word regex Ævar Arnfjörð Bjarmason
2021-10-07 16:40   ` Johannes Sixt
2021-10-08 19:09 ` [PATCH v2 0/5] " Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 1/5] t4034/cpp: actually test that operator tokens are not split Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 2/5] t4034: add tests showing problematic cpp tokenizations Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 3/5] userdiff-cpp: tighten word regex Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 4/5] userdiff-cpp: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 5/5] userdiff-cpp: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
2021-10-08 20:07   ` [PATCH v2 0/5] Fun with cpp word regex Ævar Arnfjörð Bjarmason
2021-10-08 22:11     ` Johannes Sixt
2021-10-09  0:00       ` Ævar Arnfjörð Bjarmason [this message]
2021-10-10 20:15         ` Johannes Sixt
2021-10-10 17:02   ` [PATCH v3 0/6] " Johannes Sixt via GitGitGadget
2021-10-10 17:02     ` [PATCH v3 1/6] t4034/cpp: actually test that operator tokens are not split Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 2/6] t4034: add tests showing problematic cpp tokenizations Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 3/6] userdiff-cpp: tighten word regex Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 4/6] userdiff-cpp: prepare test cases with yet unsupported features Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 5/6] userdiff-cpp: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 6/6] userdiff-cpp: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
2021-10-24  9:56     ` [PATCH 7/6] userdiff-cpp: back out the digit-separators in numbers Johannes Sixt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mtnjm416.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=j6t@kdbg.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).