git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Johannes Sixt via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Johannes Sixt <j6t@kdbg.org>
Subject: Re: [PATCH v2 0/5] Fun with cpp word regex
Date: Fri, 08 Oct 2021 22:07:18 +0200	[thread overview]
Message-ID: <87r1cvmg0c.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <pull.1054.v2.git.1633720197.gitgitgadget@gmail.com>


On Fri, Oct 08 2021, Johannes Sixt via GitGitGadget wrote:

> The cpp word regex driver is a bit too loose and can match too much text
> where the intent is to match only a number.
>
> The first patch makes the cpp word regex tests more effective.
>
> The second patch adds problematic test cases. The third patch fixes these
> problems.
>
> The final two patches add support for digit separators and the spaceship
> operator <=> (generalized comparison operator).
>
> I left out support for hexadecimal floating point constants because that
> would require to tighten the regex even more to avoid that entire
> expressions are treated as single tokens.
>
> Changes since V1:
>
>  * Tests, tests, tests.
>  * Polished commit messages.

I've read this much improved v2 over, thanks a lot. This looks much
better.

Just some general comments, in my mind this is more than good enough
already and doesn't require a re-roll, just food for thought:

 * I wonder if it isn't time to split up "cpp" into a "c" driver,
   e.g. git.git's .gitattributes has "cpp" for *.[ch] files, but as C++
   adds more syntax sugar.

   So e.g. if you use "<=>" after this series we'll tokenize it
   differently in *.c files, but it's a C++-only operator, on the other
   hand probably nobody cares that much...

 * I found myself back-porting some of your tests (manually mostly),
   maybe you disagree, but in cases like 123'123, <=> etc. I'd find it
   easier to follow if we first added the test data, and then the
   changed behavior.

   Because after all, we're going to change how we highlight existing
   data, so testing for that would be informative.

 * This pre-dates your much improved tests, but these test files could
   really use some test comments, as in:

   /* Now that we're going to understand the "'" character somehow, will any of this change? */
   /* We haven't written code like this since the 1960's ... */
   /* Run & free */

   I.e. we don't just highlight code the compiler likes to eat, but also
   comments. So particularly for smaller tokens that also occur in
   natural language like "'" and "&" are we getting expected results?

   It looked like it from skimming your changes, i.e. the ' rule is
   checked by surrounding digits, just might be handy to test it ... :)

> Johannes Sixt (5):
>   t4034/cpp: actually test that operator tokens are not split
>   t4034: add tests showing problematic cpp tokenizations
>   userdiff-cpp: tighten word regex
>   userdiff-cpp: permit the digit-separating single-quote in numbers
>   userdiff-cpp: learn the C++ spaceship operator
>
>  t/t4034/cpp/expect | 63 +++++++++++++++++++++++-----------------------
>  t/t4034/cpp/post   | 47 +++++++++++++++++++++-------------
>  t/t4034/cpp/pre    | 41 +++++++++++++++++++-----------
>  userdiff.c         | 10 ++++++--
>  4 files changed, 94 insertions(+), 67 deletions(-)
>
>
> base-commit: 225bc32a989d7a22fa6addafd4ce7dcd04675dbf
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1054%2Fj6t%2Ffun-with-cpp-word-regex-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1054/j6t/fun-with-cpp-word-regex-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/1054
>
> Range-diff vs v1:
>
>  -:  ----------- > 1:  dd9f82ba712 t4034/cpp: actually test that operator tokens are not split
>  -:  ----------- > 2:  5a84fc9cf71 t4034: add tests showing problematic cpp tokenizations
>  1:  a47ab9ba20e ! 3:  d4ebe45fddc userdiff: tighten cpp word regex
>      @@ Metadata
>       Author: Johannes Sixt <j6t@kdbg.org>
>       
>        ## Commit message ##
>      -    userdiff: tighten cpp word regex
>      +    userdiff-cpp: tighten word regex
>       
>           Generally, word regex can be written such that they match tokens
>           liberally and need not model the actual syntax because it can be assumed
>      @@ Commit message
>       
>              .l      as in str.length
>              .f      as in str.find
>      +       .e      as in str.erase
>       
>           Tighten the regex in the following way:
>       
>      @@ Commit message
>       
>           For readability, factor hex- and binary numbers into an own term.
>       
>      -    As a drive-by, this fixes that floatingpoint numbers such as 12E5
>      +    As a drive-by, this fixes that floating point numbers such as 12E5
>           (with upper-case E) were split into two tokens.
>       
>           Signed-off-by: Johannes Sixt <j6t@kdbg.org>
>       
>      + ## t/t4034/cpp/expect ##
>      +@@
>      + <BOLD>--- a/pre<RESET>
>      + <BOLD>+++ b/post<RESET>
>      + <CYAN>@@ -1,30 +1,30 @@<RESET>
>      +-Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x<RED>.f<RESET><GREEN>.F<RESET>ind); }
>      ++Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x.<RED>find<RESET><GREEN>Find<RESET>); }
>      + cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
>      +-<GREEN>(<RESET>1 <RED>-1e10<RESET><GREEN>+1e10<RESET> 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
>      ++<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
>      + // long double<RESET>
>      + <RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
>      + // float<RESET>
>      +-120<RED>E5f<RESET><GREEN>E6f<RESET>
>      ++<RED>120E5f<RESET><GREEN>120E6f<RESET>
>      + // hex<RESET>
>      +-<RED>0xdeadbeaf+8<RESET><GREEN>0xdeadBeaf+7<RESET>ULL
>      ++<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
>      + // octal<RESET>
>      + <RED>01234567<RESET><GREEN>01234560<RESET>
>      + // binary<RESET>
>      + <RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
>      + // expression<RESET>
>      +-<RED>1.5-e+2+f<RESET><GREEN>1.5-e+3+f<RESET>
>      ++1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
>      + // another one<RESET>
>      +-str<RED>.e+65<RESET><GREEN>.e+75<RESET>
>      +-[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.e<RESET><GREEN>.*e<RESET>
>      ++str.e+<RED>65<RESET><GREEN>75<RESET>
>      ++[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.<RESET><GREEN>.*<RESET>e
>      + <GREEN>~<RESET>!a <GREEN>!<RESET>~b c<RED>++<RESET><GREEN>+<RESET> d<RED>--<RESET><GREEN>-<RESET> e*<GREEN>*<RESET>f g<RED>&<RESET><GREEN>&&<RESET>h
>      + a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
>      + a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
>      +@@ t/t4034/cpp/expect: a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
>      + a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
>      + a<RED>||<RESET><GREEN>|<RESET>b
>      + a?<GREEN>:<RESET>b
>      +-a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d <RED>e-=f<RESET><GREEN>e-f<RESET> g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
>      ++a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d e<RED>-=<RESET><GREEN>-<RESET>f g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
>      + a,b<RESET>
>      + a<RED>::<RESET><GREEN>:<RESET>b
>      +
>        ## userdiff.c ##
>       @@ userdiff.c: PATTERNS("cpp",
>        	 /* functions/methods, variables, and compounds at top level */
>  2:  9d1c05f5f41 ! 4:  dd75d19cee9 userdiff: permit the digit-separating single-quote in numbers
>      @@ Metadata
>       Author: Johannes Sixt <j6t@kdbg.org>
>       
>        ## Commit message ##
>      -    userdiff: permit the digit-separating single-quote in numbers
>      +    userdiff-cpp: permit the digit-separating single-quote in numbers
>       
>           Since C++17, the single-quote can be used as digit separator:
>       
>      @@ Commit message
>              1'000'000
>              0xdead'beaf
>       
>      -    Make it known to the word regex of the cpp driver, so that numbers are not
>      -    split into separate tokens at the single-quotes.
>      +    Make it known to the word regex of the cpp driver, so that numbers are
>      +    not split into separate tokens at the single-quotes.
>       
>           Signed-off-by: Johannes Sixt <j6t@kdbg.org>
>       
>      + ## t/t4034/cpp/expect ##
>      +@@
>      + <BOLD>diff --git a/pre b/post<RESET>
>      +-<BOLD>index 1229cdb..3feae6f 100644<RESET>
>      ++<BOLD>index 60f3640..f6fbf7b 100644<RESET>
>      + <BOLD>--- a/pre<RESET>
>      + <BOLD>+++ b/post<RESET>
>      + <CYAN>@@ -1,30 +1,30 @@<RESET>
>      +@@ t/t4034/cpp/expect: Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>
>      + cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
>      + <GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
>      + // long double<RESET>
>      +-<RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
>      ++<RED>3.141'592'653e-10l<RESET><GREEN>3.141'592'654e+10l<RESET>
>      + // float<RESET>
>      + <RED>120E5f<RESET><GREEN>120E6f<RESET>
>      + // hex<RESET>
>      +-<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
>      ++<RED>0xdead'beaf<RESET><GREEN>0xdead'Beaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
>      + // octal<RESET>
>      +-<RED>01234567<RESET><GREEN>01234560<RESET>
>      ++<RED>0123'4567<RESET><GREEN>0123'4560<RESET>
>      + // binary<RESET>
>      +-<RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
>      ++<RED>0b10'00<RESET><GREEN>0b11'00<RESET>+e1
>      + // expression<RESET>
>      + 1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
>      + // another one<RESET>
>      +
>      + ## t/t4034/cpp/post ##
>      +@@ t/t4034/cpp/post: Foo() : x(0&42) { bar(x.Find); }
>      + cout<<"Hello World?\n"<<endl;
>      + (1 +1e10 0xabcdef) 'y'
>      + // long double
>      +-3.141592654e+10l
>      ++3.141'592'654e+10l
>      + // float
>      + 120E6f
>      + // hex
>      +-0xdeadBeaf+7ULL
>      ++0xdead'Beaf+7ULL
>      + // octal
>      +-01234560
>      ++0123'4560
>      + // binary
>      +-0b1100+e1
>      ++0b11'00+e1
>      + // expression
>      + 1.5-e+3+f
>      + // another one
>      +
>      + ## t/t4034/cpp/pre ##
>      +@@ t/t4034/cpp/pre: Foo():x(0&&1){ foo0( x.find); }
>      + cout<<"Hello World!\n"<<endl;
>      + 1 -1e10 0xabcdef 'x'
>      + // long double
>      +-3.141592653e-10l
>      ++3.141'592'653e-10l
>      + // float
>      + 120E5f
>      + // hex
>      +-0xdeadbeaf+8ULL
>      ++0xdead'beaf+8ULL
>      + // octal
>      +-01234567
>      ++0123'4567
>      + // binary
>      +-0b1000+e1
>      ++0b10'00+e1
>      + // expression
>      + 1.5-e+2+f
>      + // another one
>      +
>        ## userdiff.c ##
>       @@ userdiff.c: PATTERNS("cpp",
>        	 /* identifiers and keywords */
>  3:  df66485e7f0 ! 5:  43a701f5ffd userdiff: learn the C++ spaceship operator
>      @@ Metadata
>       Author: Johannes Sixt <j6t@kdbg.org>
>       
>        ## Commit message ##
>      -    userdiff: learn the C++ spaceship operator
>      +    userdiff-cpp: learn the C++ spaceship operator
>       
>      -    Since C++20, the language has a generalized comparison operator. Teach
>      -    the cpp driver not to separate it into <= and > tokens.
>      +    Since C++20, the language has a generalized comparison operator <=>.
>      +    Teach the cpp driver not to separate it into <= and > tokens.
>       
>           Signed-off-by: Johannes Sixt <j6t@kdbg.org>
>       
>      + ## t/t4034/cpp/expect ##
>      +@@
>      + <BOLD>diff --git a/pre b/post<RESET>
>      +-<BOLD>index 60f3640..f6fbf7b 100644<RESET>
>      ++<BOLD>index 144cd98..244f79c 100644<RESET>
>      + <BOLD>--- a/pre<RESET>
>      + <BOLD>+++ b/post<RESET>
>      + <CYAN>@@ -1,30 +1,30 @@<RESET>
>      +@@ t/t4034/cpp/expect: str.e+<RED>65<RESET><GREEN>75<RESET>
>      + a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
>      + a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
>      + a<RED><<<RESET><GREEN><<=<RESET>b c<RED>>><RESET><GREEN>>>=<RESET>d
>      +-a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h
>      ++a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h i<RED><=<RESET><GREEN><=><RESET>j
>      + a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
>      + a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
>      + a<RED>||<RESET><GREEN>|<RESET>b
>      +
>      + ## t/t4034/cpp/post ##
>      +@@ t/t4034/cpp/post: str.e+75
>      + a*=b c/=d e%=f
>      + a++b c--d
>      + a<<=b c>>=d
>      +-a<=b c<d e>=f g>h
>      ++a<=b c<d e>=f g>h i<=>j
>      + a!=b c=d
>      + a^=b c|=d e&=f
>      + a|b
>      +
>      + ## t/t4034/cpp/pre ##
>      +@@ t/t4034/cpp/pre: str.e+65
>      + a*b c/d e%f
>      + a+b c-d
>      + a<<b c>>d
>      +-a<b c<=d e>f g>=h
>      ++a<b c<=d e>f g>=h i<=j
>      + a==b c!=d
>      + a^b c|d e&&f
>      + a||b
>      +
>        ## userdiff.c ##
>       @@ userdiff.c: PATTERNS("cpp",
>        	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"


  parent reply	other threads:[~2021-10-08 20:13 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-07  6:50 [PATCH 0/3] Fun with cpp word regex Johannes Sixt via GitGitGadget
2021-10-07  6:50 ` [PATCH 1/3] userdiff: tighten " Johannes Sixt via GitGitGadget
2021-10-07  6:50 ` [PATCH 2/3] userdiff: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
2021-10-07  6:51 ` [PATCH 3/3] userdiff: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
2021-10-07  9:14 ` [PATCH 0/3] Fun with cpp word regex Ævar Arnfjörð Bjarmason
2021-10-07 16:40   ` Johannes Sixt
2021-10-08 19:09 ` [PATCH v2 0/5] " Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 1/5] t4034/cpp: actually test that operator tokens are not split Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 2/5] t4034: add tests showing problematic cpp tokenizations Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 3/5] userdiff-cpp: tighten word regex Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 4/5] userdiff-cpp: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 5/5] userdiff-cpp: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
2021-10-08 20:07   ` Ævar Arnfjörð Bjarmason [this message]
2021-10-08 22:11     ` [PATCH v2 0/5] Fun with cpp word regex Johannes Sixt
2021-10-09  0:00       ` Ævar Arnfjörð Bjarmason
2021-10-10 20:15         ` Johannes Sixt
2021-10-10 17:02   ` [PATCH v3 0/6] " Johannes Sixt via GitGitGadget
2021-10-10 17:02     ` [PATCH v3 1/6] t4034/cpp: actually test that operator tokens are not split Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 2/6] t4034: add tests showing problematic cpp tokenizations Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 3/6] userdiff-cpp: tighten word regex Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 4/6] userdiff-cpp: prepare test cases with yet unsupported features Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 5/6] userdiff-cpp: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 6/6] userdiff-cpp: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
2021-10-24  9:56     ` [PATCH 7/6] userdiff-cpp: back out the digit-separators in numbers Johannes Sixt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87r1cvmg0c.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=j6t@kdbg.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).