git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/3] Fun with cpp word regex
@ 2021-10-07  6:50 Johannes Sixt via GitGitGadget
  2021-10-07  6:50 ` [PATCH 1/3] userdiff: tighten " Johannes Sixt via GitGitGadget
                   ` (4 more replies)
  0 siblings, 5 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-07  6:50 UTC (permalink / raw)
  To: git; +Cc: Johannes Sixt

The cpp word regex driver is a bit too loose and can match too much text
where the intent is to match only a number. The first patch fixes that.

The other two patches add support for digit separators and the spaceship
operator <=> (generalized comparison operator).

I left out support for hexadecimal floating point constants because that
would require to tighten the regex even more to avoid that entire
expressions are treated as single tokens.

Johannes Sixt (3):
  userdiff: tighten cpp word regex
  userdiff: permit the digit-separating single-quote in numbers
  userdiff: learn the C++ spaceship operator

 userdiff.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)


base-commit: 225bc32a989d7a22fa6addafd4ce7dcd04675dbf
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1054%2Fj6t%2Ffun-with-cpp-word-regex-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1054/j6t/fun-with-cpp-word-regex-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1054
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/3] userdiff: tighten cpp word regex
  2021-10-07  6:50 [PATCH 0/3] Fun with cpp word regex Johannes Sixt via GitGitGadget
@ 2021-10-07  6:50 ` Johannes Sixt via GitGitGadget
  2021-10-07  6:50 ` [PATCH 2/3] userdiff: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-07  6:50 UTC (permalink / raw)
  To: git; +Cc: Johannes Sixt, Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

Generally, word regex can be written such that they match tokens
liberally and need not model the actual syntax because it can be assumed
that the regex will only be applied to syntactically correct text.

The regex for cpp (C/C++) is too liberal, though. It regards these
sequences as single tokens:

   1+2
   1.5-e+2+f

and the following amalgams as one token:

   .l      as in str.length
   .f      as in str.find

Tighten the regex in the following way:

- Accept + and - only in one position in the exponent. + and - are no
  longer regarded as the sign of a number and are treated by the
  catcher-all that is not visible in the driver's regex.

- Accept a leading decimal point only when it is followed by a digit.

For readability, factor hex- and binary numbers into an own term.

As a drive-by, this fixes that floatingpoint numbers such as 12E5
(with upper-case E) were split into two tokens.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 userdiff.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/userdiff.c b/userdiff.c
index d9b2ba752f0..ce2a9230703 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -54,8 +54,14 @@ PATTERNS("cpp",
 	 /* functions/methods, variables, and compounds at top level */
 	 "^((::[[:space:]]*)?[A-Za-z_].*)$",
 	 /* -- */
+	 /* identifiers and keywords */
 	 "[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lLuU]*"
+	 /* decimal and octal integers as well as floatingpoint numbers */
+	 "|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*"
+	 /* hexadecimal and binary integers */
+	 "|0[xXbB][0-9a-fA-F]+[lLuU]*"
+	 /* floatingpoint numbers that begin with a decimal point */
+	 "|\\.[0-9]+([Ee][-+]?[0-9]+)?[fFlL]?"
 	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"),
 PATTERNS("csharp",
 	 /* Keywords */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/3] userdiff: permit the digit-separating single-quote in numbers
  2021-10-07  6:50 [PATCH 0/3] Fun with cpp word regex Johannes Sixt via GitGitGadget
  2021-10-07  6:50 ` [PATCH 1/3] userdiff: tighten " Johannes Sixt via GitGitGadget
@ 2021-10-07  6:50 ` Johannes Sixt via GitGitGadget
  2021-10-07  6:51 ` [PATCH 3/3] userdiff: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-07  6:50 UTC (permalink / raw)
  To: git; +Cc: Johannes Sixt, Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

Since C++17, the single-quote can be used as digit separator:

   3.141'592'654
   1'000'000
   0xdead'beaf

Make it known to the word regex of the cpp driver, so that numbers are not
split into separate tokens at the single-quotes.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 userdiff.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/userdiff.c b/userdiff.c
index ce2a9230703..1b640c7df79 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -57,11 +57,11 @@ PATTERNS("cpp",
 	 /* identifiers and keywords */
 	 "[a-zA-Z_][a-zA-Z0-9_]*"
 	 /* decimal and octal integers as well as floatingpoint numbers */
-	 "|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*"
+	 "|[0-9][0-9.']*([Ee][-+]?[0-9]+)?[fFlLuU]*"
 	 /* hexadecimal and binary integers */
-	 "|0[xXbB][0-9a-fA-F]+[lLuU]*"
+	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"
 	 /* floatingpoint numbers that begin with a decimal point */
-	 "|\\.[0-9]+([Ee][-+]?[0-9]+)?[fFlL]?"
+	 "|\\.[0-9']+([Ee][-+]?[0-9]+)?[fFlL]?"
 	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"),
 PATTERNS("csharp",
 	 /* Keywords */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/3] userdiff: learn the C++ spaceship operator
  2021-10-07  6:50 [PATCH 0/3] Fun with cpp word regex Johannes Sixt via GitGitGadget
  2021-10-07  6:50 ` [PATCH 1/3] userdiff: tighten " Johannes Sixt via GitGitGadget
  2021-10-07  6:50 ` [PATCH 2/3] userdiff: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
@ 2021-10-07  6:51 ` Johannes Sixt via GitGitGadget
  2021-10-07  9:14 ` [PATCH 0/3] Fun with cpp word regex Ævar Arnfjörð Bjarmason
  2021-10-08 19:09 ` [PATCH v2 0/5] " Johannes Sixt via GitGitGadget
  4 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-07  6:51 UTC (permalink / raw)
  To: git; +Cc: Johannes Sixt, Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

Since C++20, the language has a generalized comparison operator. Teach
the cpp driver not to separate it into <= and > tokens.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 userdiff.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/userdiff.c b/userdiff.c
index 1b640c7df79..13cec0b48db 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -62,7 +62,7 @@ PATTERNS("cpp",
 	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"
 	 /* floatingpoint numbers that begin with a decimal point */
 	 "|\\.[0-9']+([Ee][-+]?[0-9]+)?[fFlL]?"
-	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"),
+	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*|<=>"),
 PATTERNS("csharp",
 	 /* Keywords */
 	 "!^[ \t]*(do|while|for|if|else|instanceof|new|return|switch|case|throw|catch|using)\n"
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] Fun with cpp word regex
  2021-10-07  6:50 [PATCH 0/3] Fun with cpp word regex Johannes Sixt via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-10-07  6:51 ` [PATCH 3/3] userdiff: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
@ 2021-10-07  9:14 ` Ævar Arnfjörð Bjarmason
  2021-10-07 16:40   ` Johannes Sixt
  2021-10-08 19:09 ` [PATCH v2 0/5] " Johannes Sixt via GitGitGadget
  4 siblings, 1 reply; 24+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-07  9:14 UTC (permalink / raw)
  To: Johannes Sixt via GitGitGadget; +Cc: git, Johannes Sixt


On Thu, Oct 07 2021, Johannes Sixt via GitGitGadget wrote:

> The cpp word regex driver is a bit too loose and can match too much text
> where the intent is to match only a number. The first patch fixes that.
>
> The other two patches add support for digit separators and the spaceship
> operator <=> (generalized comparison operator).
>
> I left out support for hexadecimal floating point constants because that
> would require to tighten the regex even more to avoid that entire
> expressions are treated as single tokens.
>
> Johannes Sixt (3):
>   userdiff: tighten cpp word regex
>   userdiff: permit the digit-separating single-quote in numbers
>   userdiff: learn the C++ spaceship operator
>
>  userdiff.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)

I haven't dug into the C++ syntax here, I trust you know what you're
doing there.

But some tests in t/t4034/cpp/* would be great, and seem from the files
easy to add. E.g. wouldn't changing:

    a<b a<=b a>b a>=b


to:

    a<b a<=b a>b a>=b a<=>b

Give you an updated regression test for your 3/3? Similar changes can be
done for 1-2/3.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 0/3] Fun with cpp word regex
  2021-10-07  9:14 ` [PATCH 0/3] Fun with cpp word regex Ævar Arnfjörð Bjarmason
@ 2021-10-07 16:40   ` Johannes Sixt
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt @ 2021-10-07 16:40 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Johannes Sixt via GitGitGadget

Am 07.10.21 um 11:14 schrieb Ævar Arnfjörð Bjarmason:
> But some tests in t/t4034/cpp/* would be great, and seem from the files
> easy to add. E.g. wouldn't changing:

Ah, I didn't notice that we actually have tests for word-diff. I'll look
into it and resubmit.

-- Hannes

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 0/5] Fun with cpp word regex
  2021-10-07  6:50 [PATCH 0/3] Fun with cpp word regex Johannes Sixt via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-10-07  9:14 ` [PATCH 0/3] Fun with cpp word regex Ævar Arnfjörð Bjarmason
@ 2021-10-08 19:09 ` Johannes Sixt via GitGitGadget
  2021-10-08 19:09   ` [PATCH v2 1/5] t4034/cpp: actually test that operator tokens are not split Johannes Sixt via GitGitGadget
                     ` (6 more replies)
  4 siblings, 7 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-08 19:09 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt

The cpp word regex driver is a bit too loose and can match too much text
where the intent is to match only a number.

The first patch makes the cpp word regex tests more effective.

The second patch adds problematic test cases. The third patch fixes these
problems.

The final two patches add support for digit separators and the spaceship
operator <=> (generalized comparison operator).

I left out support for hexadecimal floating point constants because that
would require to tighten the regex even more to avoid that entire
expressions are treated as single tokens.

Changes since V1:

 * Tests, tests, tests.
 * Polished commit messages.

Johannes Sixt (5):
  t4034/cpp: actually test that operator tokens are not split
  t4034: add tests showing problematic cpp tokenizations
  userdiff-cpp: tighten word regex
  userdiff-cpp: permit the digit-separating single-quote in numbers
  userdiff-cpp: learn the C++ spaceship operator

 t/t4034/cpp/expect | 63 +++++++++++++++++++++++-----------------------
 t/t4034/cpp/post   | 47 +++++++++++++++++++++-------------
 t/t4034/cpp/pre    | 41 +++++++++++++++++++-----------
 userdiff.c         | 10 ++++++--
 4 files changed, 94 insertions(+), 67 deletions(-)


base-commit: 225bc32a989d7a22fa6addafd4ce7dcd04675dbf
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1054%2Fj6t%2Ffun-with-cpp-word-regex-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1054/j6t/fun-with-cpp-word-regex-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1054

Range-diff vs v1:

 -:  ----------- > 1:  dd9f82ba712 t4034/cpp: actually test that operator tokens are not split
 -:  ----------- > 2:  5a84fc9cf71 t4034: add tests showing problematic cpp tokenizations
 1:  a47ab9ba20e ! 3:  d4ebe45fddc userdiff: tighten cpp word regex
     @@ Metadata
      Author: Johannes Sixt <j6t@kdbg.org>
      
       ## Commit message ##
     -    userdiff: tighten cpp word regex
     +    userdiff-cpp: tighten word regex
      
          Generally, word regex can be written such that they match tokens
          liberally and need not model the actual syntax because it can be assumed
     @@ Commit message
      
             .l      as in str.length
             .f      as in str.find
     +       .e      as in str.erase
      
          Tighten the regex in the following way:
      
     @@ Commit message
      
          For readability, factor hex- and binary numbers into an own term.
      
     -    As a drive-by, this fixes that floatingpoint numbers such as 12E5
     +    As a drive-by, this fixes that floating point numbers such as 12E5
          (with upper-case E) were split into two tokens.
      
          Signed-off-by: Johannes Sixt <j6t@kdbg.org>
      
     + ## t/t4034/cpp/expect ##
     +@@
     + <BOLD>--- a/pre<RESET>
     + <BOLD>+++ b/post<RESET>
     + <CYAN>@@ -1,30 +1,30 @@<RESET>
     +-Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x<RED>.f<RESET><GREEN>.F<RESET>ind); }
     ++Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x.<RED>find<RESET><GREEN>Find<RESET>); }
     + cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
     +-<GREEN>(<RESET>1 <RED>-1e10<RESET><GREEN>+1e10<RESET> 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
     ++<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
     + // long double<RESET>
     + <RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
     + // float<RESET>
     +-120<RED>E5f<RESET><GREEN>E6f<RESET>
     ++<RED>120E5f<RESET><GREEN>120E6f<RESET>
     + // hex<RESET>
     +-<RED>0xdeadbeaf+8<RESET><GREEN>0xdeadBeaf+7<RESET>ULL
     ++<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
     + // octal<RESET>
     + <RED>01234567<RESET><GREEN>01234560<RESET>
     + // binary<RESET>
     + <RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
     + // expression<RESET>
     +-<RED>1.5-e+2+f<RESET><GREEN>1.5-e+3+f<RESET>
     ++1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
     + // another one<RESET>
     +-str<RED>.e+65<RESET><GREEN>.e+75<RESET>
     +-[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.e<RESET><GREEN>.*e<RESET>
     ++str.e+<RED>65<RESET><GREEN>75<RESET>
     ++[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.<RESET><GREEN>.*<RESET>e
     + <GREEN>~<RESET>!a <GREEN>!<RESET>~b c<RED>++<RESET><GREEN>+<RESET> d<RED>--<RESET><GREEN>-<RESET> e*<GREEN>*<RESET>f g<RED>&<RESET><GREEN>&&<RESET>h
     + a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
     + a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
     +@@ t/t4034/cpp/expect: a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
     + a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
     + a<RED>||<RESET><GREEN>|<RESET>b
     + a?<GREEN>:<RESET>b
     +-a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d <RED>e-=f<RESET><GREEN>e-f<RESET> g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
     ++a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d e<RED>-=<RESET><GREEN>-<RESET>f g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
     + a,b<RESET>
     + a<RED>::<RESET><GREEN>:<RESET>b
     +
       ## userdiff.c ##
      @@ userdiff.c: PATTERNS("cpp",
       	 /* functions/methods, variables, and compounds at top level */
 2:  9d1c05f5f41 ! 4:  dd75d19cee9 userdiff: permit the digit-separating single-quote in numbers
     @@ Metadata
      Author: Johannes Sixt <j6t@kdbg.org>
      
       ## Commit message ##
     -    userdiff: permit the digit-separating single-quote in numbers
     +    userdiff-cpp: permit the digit-separating single-quote in numbers
      
          Since C++17, the single-quote can be used as digit separator:
      
     @@ Commit message
             1'000'000
             0xdead'beaf
      
     -    Make it known to the word regex of the cpp driver, so that numbers are not
     -    split into separate tokens at the single-quotes.
     +    Make it known to the word regex of the cpp driver, so that numbers are
     +    not split into separate tokens at the single-quotes.
      
          Signed-off-by: Johannes Sixt <j6t@kdbg.org>
      
     + ## t/t4034/cpp/expect ##
     +@@
     + <BOLD>diff --git a/pre b/post<RESET>
     +-<BOLD>index 1229cdb..3feae6f 100644<RESET>
     ++<BOLD>index 60f3640..f6fbf7b 100644<RESET>
     + <BOLD>--- a/pre<RESET>
     + <BOLD>+++ b/post<RESET>
     + <CYAN>@@ -1,30 +1,30 @@<RESET>
     +@@ t/t4034/cpp/expect: Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>
     + cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
     + <GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
     + // long double<RESET>
     +-<RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
     ++<RED>3.141'592'653e-10l<RESET><GREEN>3.141'592'654e+10l<RESET>
     + // float<RESET>
     + <RED>120E5f<RESET><GREEN>120E6f<RESET>
     + // hex<RESET>
     +-<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
     ++<RED>0xdead'beaf<RESET><GREEN>0xdead'Beaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
     + // octal<RESET>
     +-<RED>01234567<RESET><GREEN>01234560<RESET>
     ++<RED>0123'4567<RESET><GREEN>0123'4560<RESET>
     + // binary<RESET>
     +-<RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
     ++<RED>0b10'00<RESET><GREEN>0b11'00<RESET>+e1
     + // expression<RESET>
     + 1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
     + // another one<RESET>
     +
     + ## t/t4034/cpp/post ##
     +@@ t/t4034/cpp/post: Foo() : x(0&42) { bar(x.Find); }
     + cout<<"Hello World?\n"<<endl;
     + (1 +1e10 0xabcdef) 'y'
     + // long double
     +-3.141592654e+10l
     ++3.141'592'654e+10l
     + // float
     + 120E6f
     + // hex
     +-0xdeadBeaf+7ULL
     ++0xdead'Beaf+7ULL
     + // octal
     +-01234560
     ++0123'4560
     + // binary
     +-0b1100+e1
     ++0b11'00+e1
     + // expression
     + 1.5-e+3+f
     + // another one
     +
     + ## t/t4034/cpp/pre ##
     +@@ t/t4034/cpp/pre: Foo():x(0&&1){ foo0( x.find); }
     + cout<<"Hello World!\n"<<endl;
     + 1 -1e10 0xabcdef 'x'
     + // long double
     +-3.141592653e-10l
     ++3.141'592'653e-10l
     + // float
     + 120E5f
     + // hex
     +-0xdeadbeaf+8ULL
     ++0xdead'beaf+8ULL
     + // octal
     +-01234567
     ++0123'4567
     + // binary
     +-0b1000+e1
     ++0b10'00+e1
     + // expression
     + 1.5-e+2+f
     + // another one
     +
       ## userdiff.c ##
      @@ userdiff.c: PATTERNS("cpp",
       	 /* identifiers and keywords */
 3:  df66485e7f0 ! 5:  43a701f5ffd userdiff: learn the C++ spaceship operator
     @@ Metadata
      Author: Johannes Sixt <j6t@kdbg.org>
      
       ## Commit message ##
     -    userdiff: learn the C++ spaceship operator
     +    userdiff-cpp: learn the C++ spaceship operator
      
     -    Since C++20, the language has a generalized comparison operator. Teach
     -    the cpp driver not to separate it into <= and > tokens.
     +    Since C++20, the language has a generalized comparison operator <=>.
     +    Teach the cpp driver not to separate it into <= and > tokens.
      
          Signed-off-by: Johannes Sixt <j6t@kdbg.org>
      
     + ## t/t4034/cpp/expect ##
     +@@
     + <BOLD>diff --git a/pre b/post<RESET>
     +-<BOLD>index 60f3640..f6fbf7b 100644<RESET>
     ++<BOLD>index 144cd98..244f79c 100644<RESET>
     + <BOLD>--- a/pre<RESET>
     + <BOLD>+++ b/post<RESET>
     + <CYAN>@@ -1,30 +1,30 @@<RESET>
     +@@ t/t4034/cpp/expect: str.e+<RED>65<RESET><GREEN>75<RESET>
     + a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
     + a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
     + a<RED><<<RESET><GREEN><<=<RESET>b c<RED>>><RESET><GREEN>>>=<RESET>d
     +-a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h
     ++a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h i<RED><=<RESET><GREEN><=><RESET>j
     + a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
     + a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
     + a<RED>||<RESET><GREEN>|<RESET>b
     +
     + ## t/t4034/cpp/post ##
     +@@ t/t4034/cpp/post: str.e+75
     + a*=b c/=d e%=f
     + a++b c--d
     + a<<=b c>>=d
     +-a<=b c<d e>=f g>h
     ++a<=b c<d e>=f g>h i<=>j
     + a!=b c=d
     + a^=b c|=d e&=f
     + a|b
     +
     + ## t/t4034/cpp/pre ##
     +@@ t/t4034/cpp/pre: str.e+65
     + a*b c/d e%f
     + a+b c-d
     + a<<b c>>d
     +-a<b c<=d e>f g>=h
     ++a<b c<=d e>f g>=h i<=j
     + a==b c!=d
     + a^b c|d e&&f
     + a||b
     +
       ## userdiff.c ##
      @@ userdiff.c: PATTERNS("cpp",
       	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2 1/5] t4034/cpp: actually test that operator tokens are not split
  2021-10-08 19:09 ` [PATCH v2 0/5] " Johannes Sixt via GitGitGadget
@ 2021-10-08 19:09   ` Johannes Sixt via GitGitGadget
  2021-10-08 19:09   ` [PATCH v2 2/5] t4034: add tests showing problematic cpp tokenizations Johannes Sixt via GitGitGadget
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-08 19:09 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

8d96e7288f2b (t4034: bulk verify builtin word regex sanity, 2010-12-18)
added many tests with the intent to verify that operators consisting of
more than one symbol are kept together. These are tested by probing a
transition from, e.g., a!=b to x!=y, which results in the word-diff

  [-a-]{+x+}!=[-b-]{+y+}

But that proves only that the letters and operators are separate tokens.
To prove that != is an unseparable token, we have to probe a transition
from, e.g., a=b to a!=b having a word-diff

  a[-=-]{+!=+}b

that proves that the ! is not separate from the =.

In the post-image, add to or remove from operators a character that
turns it into another valid operator.

Change the identifiers used around operators such that the diff
algorithm does not have an incentive to match, e.g., a<b in one spot
in the pre-image with a<b elsewhere in the post-image.

Adjust the expected output to match the new differences. Notice that
there are some undesirable tokenizations around e, ., and -.  This will
be addressed in a later change.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 t/t4034/cpp/expect | 45 +++++++++++++++------------------------------
 t/t4034/cpp/post   | 29 +++++++++++++----------------
 t/t4034/cpp/pre    | 25 +++++++++++--------------
 3 files changed, 39 insertions(+), 60 deletions(-)

diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect
index 37d1ea25870..41976971b93 100644
--- a/t/t4034/cpp/expect
+++ b/t/t4034/cpp/expect
@@ -1,36 +1,21 @@
 <BOLD>diff --git a/pre b/post<RESET>
-<BOLD>index 23d5c8a..7e8c026 100644<RESET>
+<BOLD>index c5672a2..4229868 100644<RESET>
 <BOLD>--- a/pre<RESET>
 <BOLD>+++ b/post<RESET>
-<CYAN>@@ -1,19 +1,19 @@<RESET>
+<CYAN>@@ -1,16 +1,16 @@<RESET>
 Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <GREEN>bar(x);<RESET> }
 cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
 <GREEN>(<RESET>1<GREEN>) (<RESET>-1e10<GREEN>) (<RESET>0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
-[<RED>a<RESET><GREEN>x<RESET>] <RED>a<RESET><GREEN>x<RESET>-><RED>b a<RESET><GREEN>y x<RESET>.<RED>b<RESET><GREEN>y<RESET>
-!<RED>a<RESET><GREEN>x<RESET> ~<RED>a a<RESET><GREEN>x x<RESET>++ <RED>a<RESET><GREEN>x<RESET>-- <RED>a<RESET><GREEN>x<RESET>*<RED>b a<RESET><GREEN>y x<RESET>&<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>*<RED>b a<RESET><GREEN>y x<RESET>/<RED>b a<RESET><GREEN>y x<RESET>%<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>+<RED>b a<RESET><GREEN>y x<RESET>-<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET><<<RED>b a<RESET><GREEN>y x<RESET>>><RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET><<RED>b a<RESET><GREEN>y x<RESET><=<RED>b a<RESET><GREEN>y x<RESET>><RED>b a<RESET><GREEN>y x<RESET>>=<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>==<RED>b a<RESET><GREEN>y x<RESET>!=<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>&<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>^<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>|<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>&&<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>||<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>?<RED>b<RESET><GREEN>y<RESET>:z
-<RED>a<RESET><GREEN>x<RESET>=<RED>b a<RESET><GREEN>y x<RESET>+=<RED>b a<RESET><GREEN>y x<RESET>-=<RED>b a<RESET><GREEN>y x<RESET>*=<RED>b a<RESET><GREEN>y x<RESET>/=<RED>b a<RESET><GREEN>y x<RESET>%=<RED>b a<RESET><GREEN>y x<RESET><<=<RED>b a<RESET><GREEN>y x<RESET>>>=<RED>b a<RESET><GREEN>y x<RESET>&=<RED>b a<RESET><GREEN>y x<RESET>^=<RED>b a<RESET><GREEN>y x<RESET>|=<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>,y
-<RED>a<RESET><GREEN>x<RESET>::<RED>b<RESET><GREEN>y<RESET>
+[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.e<RESET><GREEN>.*e<RESET>
+<GREEN>~<RESET>!a <GREEN>!<RESET>~b c<RED>++<RESET><GREEN>+<RESET> d<RED>--<RESET><GREEN>-<RESET> e*<GREEN>*<RESET>f g<RED>&<RESET><GREEN>&&<RESET>h
+a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
+a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
+a<RED><<<RESET><GREEN><<=<RESET>b c<RED>>><RESET><GREEN>>>=<RESET>d
+a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h
+a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
+a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
+a<RED>||<RESET><GREEN>|<RESET>b
+a?<GREEN>:<RESET>b
+a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d <RED>e-=f<RESET><GREEN>e-f<RESET> g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
+a,b<RESET>
+a<RED>::<RESET><GREEN>:<RESET>b
diff --git a/t/t4034/cpp/post b/t/t4034/cpp/post
index 7e8c026cefb..4229868ae62 100644
--- a/t/t4034/cpp/post
+++ b/t/t4034/cpp/post
@@ -1,19 +1,16 @@
 Foo() : x(0&42) { bar(x); }
 cout<<"Hello World?\n"<<endl;
 (1) (-1e10) (0xabcdef) 'y'
-[x] x->y x.y
-!x ~x x++ x-- x*y x&y
-x*y x/y x%y
-x+y x-y
-x<<y x>>y
-x<y x<=y x>y x>=y
-x==y x!=y
-x&y
-x^y
-x|y
-x&&y
-x||y
-x?y:z
-x=y x+=y x-=y x*=y x/=y x%=y x<<=y x>>=y x&=y x^=y x|=y
-x,y
-x::y
+[a] b->*v d.*e
+~!a !~b c+ d- e**f g&&h
+a*=b c/=d e%=f
+a++b c--d
+a<<=b c>>=d
+a<=b c<d e>=f g>h
+a!=b c=d
+a^=b c|=d e&=f
+a|b
+a?:b
+a==b c+d e-f g*h i/j k%l m<<n o>>p q&r s^t u|v
+a,b
+a:b
diff --git a/t/t4034/cpp/pre b/t/t4034/cpp/pre
index 23d5c8adf54..c5672a24cfc 100644
--- a/t/t4034/cpp/pre
+++ b/t/t4034/cpp/pre
@@ -1,19 +1,16 @@
 Foo():x(0&&1){}
 cout<<"Hello World!\n"<<endl;
 1 -1e10 0xabcdef 'x'
-[a] a->b a.b
-!a ~a a++ a-- a*b a&b
-a*b a/b a%b
-a+b a-b
-a<<b a>>b
-a<b a<=b a>b a>=b
-a==b a!=b
-a&b
-a^b
-a|b
-a&&b
+[a] b->v d.e
+!a ~b c++ d-- e*f g&h
+a*b c/d e%f
+a+b c-d
+a<<b c>>d
+a<b c<=d e>f g>=h
+a==b c!=d
+a^b c|d e&&f
 a||b
-a?b:z
-a=b a+=b a-=b a*=b a/=b a%=b a<<=b a>>=b a&=b a^=b a|=b
-a,y
+a?b
+a=b c+=d e-=f g*=h i/=j k%=l m<<=n o>>=p q&=r s^=t u|=v
+a,b
 a::b
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 2/5] t4034: add tests showing problematic cpp tokenizations
  2021-10-08 19:09 ` [PATCH v2 0/5] " Johannes Sixt via GitGitGadget
  2021-10-08 19:09   ` [PATCH v2 1/5] t4034/cpp: actually test that operator tokens are not split Johannes Sixt via GitGitGadget
@ 2021-10-08 19:09   ` Johannes Sixt via GitGitGadget
  2021-10-08 19:09   ` [PATCH v2 3/5] userdiff-cpp: tighten word regex Johannes Sixt via GitGitGadget
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-08 19:09 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

The word regex is too loose and matches long streaks of characters
that should actually be separate tokens.  Add these problematic test
cases. Separate the lines with text that will remain identical in the
pre- and post-image so that the diff algorithm will not lump removals
and additions of consecutive lines together. This makes the expected
output easier to read.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 t/t4034/cpp/expect | 22 ++++++++++++++++++----
 t/t4034/cpp/post   | 18 ++++++++++++++++--
 t/t4034/cpp/pre    | 16 +++++++++++++++-
 3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect
index 41976971b93..63e53a61e62 100644
--- a/t/t4034/cpp/expect
+++ b/t/t4034/cpp/expect
@@ -1,11 +1,25 @@
 <BOLD>diff --git a/pre b/post<RESET>
-<BOLD>index c5672a2..4229868 100644<RESET>
+<BOLD>index 1229cdb..3feae6f 100644<RESET>
 <BOLD>--- a/pre<RESET>
 <BOLD>+++ b/post<RESET>
-<CYAN>@@ -1,16 +1,16 @@<RESET>
-Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <GREEN>bar(x);<RESET> }
+<CYAN>@@ -1,30 +1,30 @@<RESET>
+Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x<RED>.f<RESET><GREEN>.F<RESET>ind); }
 cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
-<GREEN>(<RESET>1<GREEN>) (<RESET>-1e10<GREEN>) (<RESET>0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
+<GREEN>(<RESET>1 <RED>-1e10<RESET><GREEN>+1e10<RESET> 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
+// long double<RESET>
+<RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
+// float<RESET>
+120<RED>E5f<RESET><GREEN>E6f<RESET>
+// hex<RESET>
+<RED>0xdeadbeaf+8<RESET><GREEN>0xdeadBeaf+7<RESET>ULL
+// octal<RESET>
+<RED>01234567<RESET><GREEN>01234560<RESET>
+// binary<RESET>
+<RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
+// expression<RESET>
+<RED>1.5-e+2+f<RESET><GREEN>1.5-e+3+f<RESET>
+// another one<RESET>
+str<RED>.e+65<RESET><GREEN>.e+75<RESET>
 [a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.e<RESET><GREEN>.*e<RESET>
 <GREEN>~<RESET>!a <GREEN>!<RESET>~b c<RED>++<RESET><GREEN>+<RESET> d<RED>--<RESET><GREEN>-<RESET> e*<GREEN>*<RESET>f g<RED>&<RESET><GREEN>&&<RESET>h
 a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
diff --git a/t/t4034/cpp/post b/t/t4034/cpp/post
index 4229868ae62..3feae6f430f 100644
--- a/t/t4034/cpp/post
+++ b/t/t4034/cpp/post
@@ -1,6 +1,20 @@
-Foo() : x(0&42) { bar(x); }
+Foo() : x(0&42) { bar(x.Find); }
 cout<<"Hello World?\n"<<endl;
-(1) (-1e10) (0xabcdef) 'y'
+(1 +1e10 0xabcdef) 'y'
+// long double
+3.141592654e+10l
+// float
+120E6f
+// hex
+0xdeadBeaf+7ULL
+// octal
+01234560
+// binary
+0b1100+e1
+// expression
+1.5-e+3+f
+// another one
+str.e+75
 [a] b->*v d.*e
 ~!a !~b c+ d- e**f g&&h
 a*=b c/=d e%=f
diff --git a/t/t4034/cpp/pre b/t/t4034/cpp/pre
index c5672a24cfc..1229cdb59d1 100644
--- a/t/t4034/cpp/pre
+++ b/t/t4034/cpp/pre
@@ -1,6 +1,20 @@
-Foo():x(0&&1){}
+Foo():x(0&&1){ foo0( x.find); }
 cout<<"Hello World!\n"<<endl;
 1 -1e10 0xabcdef 'x'
+// long double
+3.141592653e-10l
+// float
+120E5f
+// hex
+0xdeadbeaf+8ULL
+// octal
+01234567
+// binary
+0b1000+e1
+// expression
+1.5-e+2+f
+// another one
+str.e+65
 [a] b->v d.e
 !a ~b c++ d-- e*f g&h
 a*b c/d e%f
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 3/5] userdiff-cpp: tighten word regex
  2021-10-08 19:09 ` [PATCH v2 0/5] " Johannes Sixt via GitGitGadget
  2021-10-08 19:09   ` [PATCH v2 1/5] t4034/cpp: actually test that operator tokens are not split Johannes Sixt via GitGitGadget
  2021-10-08 19:09   ` [PATCH v2 2/5] t4034: add tests showing problematic cpp tokenizations Johannes Sixt via GitGitGadget
@ 2021-10-08 19:09   ` Johannes Sixt via GitGitGadget
  2021-10-08 19:09   ` [PATCH v2 4/5] userdiff-cpp: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-08 19:09 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

Generally, word regex can be written such that they match tokens
liberally and need not model the actual syntax because it can be assumed
that the regex will only be applied to syntactically correct text.

The regex for cpp (C/C++) is too liberal, though. It regards these
sequences as single tokens:

   1+2
   1.5-e+2+f

and the following amalgams as one token:

   .l      as in str.length
   .f      as in str.find
   .e      as in str.erase

Tighten the regex in the following way:

- Accept + and - only in one position in the exponent. + and - are no
  longer regarded as the sign of a number and are treated by the
  catcher-all that is not visible in the driver's regex.

- Accept a leading decimal point only when it is followed by a digit.

For readability, factor hex- and binary numbers into an own term.

As a drive-by, this fixes that floating point numbers such as 12E5
(with upper-case E) were split into two tokens.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 t/t4034/cpp/expect | 16 ++++++++--------
 userdiff.c         |  8 +++++++-
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect
index 63e53a61e62..46c9460a968 100644
--- a/t/t4034/cpp/expect
+++ b/t/t4034/cpp/expect
@@ -3,24 +3,24 @@
 <BOLD>--- a/pre<RESET>
 <BOLD>+++ b/post<RESET>
 <CYAN>@@ -1,30 +1,30 @@<RESET>
-Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x<RED>.f<RESET><GREEN>.F<RESET>ind); }
+Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x.<RED>find<RESET><GREEN>Find<RESET>); }
 cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
-<GREEN>(<RESET>1 <RED>-1e10<RESET><GREEN>+1e10<RESET> 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
+<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
 // long double<RESET>
 <RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
 // float<RESET>
-120<RED>E5f<RESET><GREEN>E6f<RESET>
+<RED>120E5f<RESET><GREEN>120E6f<RESET>
 // hex<RESET>
-<RED>0xdeadbeaf+8<RESET><GREEN>0xdeadBeaf+7<RESET>ULL
+<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
 // octal<RESET>
 <RED>01234567<RESET><GREEN>01234560<RESET>
 // binary<RESET>
 <RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
 // expression<RESET>
-<RED>1.5-e+2+f<RESET><GREEN>1.5-e+3+f<RESET>
+1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
 // another one<RESET>
-str<RED>.e+65<RESET><GREEN>.e+75<RESET>
-[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.e<RESET><GREEN>.*e<RESET>
+str.e+<RED>65<RESET><GREEN>75<RESET>
+[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.<RESET><GREEN>.*<RESET>e
 <GREEN>~<RESET>!a <GREEN>!<RESET>~b c<RED>++<RESET><GREEN>+<RESET> d<RED>--<RESET><GREEN>-<RESET> e*<GREEN>*<RESET>f g<RED>&<RESET><GREEN>&&<RESET>h
 a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
 a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
@@ -30,6 +30,6 @@ a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
 a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
 a<RED>||<RESET><GREEN>|<RESET>b
 a?<GREEN>:<RESET>b
-a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d <RED>e-=f<RESET><GREEN>e-f<RESET> g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
+a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d e<RED>-=<RESET><GREEN>-<RESET>f g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
 a,b<RESET>
 a<RED>::<RESET><GREEN>:<RESET>b
diff --git a/userdiff.c b/userdiff.c
index d9b2ba752f0..ce2a9230703 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -54,8 +54,14 @@ PATTERNS("cpp",
 	 /* functions/methods, variables, and compounds at top level */
 	 "^((::[[:space:]]*)?[A-Za-z_].*)$",
 	 /* -- */
+	 /* identifiers and keywords */
 	 "[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lLuU]*"
+	 /* decimal and octal integers as well as floatingpoint numbers */
+	 "|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*"
+	 /* hexadecimal and binary integers */
+	 "|0[xXbB][0-9a-fA-F]+[lLuU]*"
+	 /* floatingpoint numbers that begin with a decimal point */
+	 "|\\.[0-9]+([Ee][-+]?[0-9]+)?[fFlL]?"
 	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"),
 PATTERNS("csharp",
 	 /* Keywords */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 4/5] userdiff-cpp: permit the digit-separating single-quote in numbers
  2021-10-08 19:09 ` [PATCH v2 0/5] " Johannes Sixt via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-10-08 19:09   ` [PATCH v2 3/5] userdiff-cpp: tighten word regex Johannes Sixt via GitGitGadget
@ 2021-10-08 19:09   ` Johannes Sixt via GitGitGadget
  2021-10-08 19:09   ` [PATCH v2 5/5] userdiff-cpp: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-08 19:09 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

Since C++17, the single-quote can be used as digit separator:

   3.141'592'654
   1'000'000
   0xdead'beaf

Make it known to the word regex of the cpp driver, so that numbers are
not split into separate tokens at the single-quotes.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 t/t4034/cpp/expect | 10 +++++-----
 t/t4034/cpp/post   |  8 ++++----
 t/t4034/cpp/pre    |  8 ++++----
 userdiff.c         |  6 +++---
 4 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect
index 46c9460a968..a3a234f5461 100644
--- a/t/t4034/cpp/expect
+++ b/t/t4034/cpp/expect
@@ -1,5 +1,5 @@
 <BOLD>diff --git a/pre b/post<RESET>
-<BOLD>index 1229cdb..3feae6f 100644<RESET>
+<BOLD>index 60f3640..f6fbf7b 100644<RESET>
 <BOLD>--- a/pre<RESET>
 <BOLD>+++ b/post<RESET>
 <CYAN>@@ -1,30 +1,30 @@<RESET>
@@ -7,15 +7,15 @@ Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>
 cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
 <GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
 // long double<RESET>
-<RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
+<RED>3.141'592'653e-10l<RESET><GREEN>3.141'592'654e+10l<RESET>
 // float<RESET>
 <RED>120E5f<RESET><GREEN>120E6f<RESET>
 // hex<RESET>
-<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
+<RED>0xdead'beaf<RESET><GREEN>0xdead'Beaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
 // octal<RESET>
-<RED>01234567<RESET><GREEN>01234560<RESET>
+<RED>0123'4567<RESET><GREEN>0123'4560<RESET>
 // binary<RESET>
-<RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
+<RED>0b10'00<RESET><GREEN>0b11'00<RESET>+e1
 // expression<RESET>
 1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
 // another one<RESET>
diff --git a/t/t4034/cpp/post b/t/t4034/cpp/post
index 3feae6f430f..f6fbf7bc04c 100644
--- a/t/t4034/cpp/post
+++ b/t/t4034/cpp/post
@@ -2,15 +2,15 @@ Foo() : x(0&42) { bar(x.Find); }
 cout<<"Hello World?\n"<<endl;
 (1 +1e10 0xabcdef) 'y'
 // long double
-3.141592654e+10l
+3.141'592'654e+10l
 // float
 120E6f
 // hex
-0xdeadBeaf+7ULL
+0xdead'Beaf+7ULL
 // octal
-01234560
+0123'4560
 // binary
-0b1100+e1
+0b11'00+e1
 // expression
 1.5-e+3+f
 // another one
diff --git a/t/t4034/cpp/pre b/t/t4034/cpp/pre
index 1229cdb59d1..60f3640d773 100644
--- a/t/t4034/cpp/pre
+++ b/t/t4034/cpp/pre
@@ -2,15 +2,15 @@ Foo():x(0&&1){ foo0( x.find); }
 cout<<"Hello World!\n"<<endl;
 1 -1e10 0xabcdef 'x'
 // long double
-3.141592653e-10l
+3.141'592'653e-10l
 // float
 120E5f
 // hex
-0xdeadbeaf+8ULL
+0xdead'beaf+8ULL
 // octal
-01234567
+0123'4567
 // binary
-0b1000+e1
+0b10'00+e1
 // expression
 1.5-e+2+f
 // another one
diff --git a/userdiff.c b/userdiff.c
index ce2a9230703..1b640c7df79 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -57,11 +57,11 @@ PATTERNS("cpp",
 	 /* identifiers and keywords */
 	 "[a-zA-Z_][a-zA-Z0-9_]*"
 	 /* decimal and octal integers as well as floatingpoint numbers */
-	 "|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*"
+	 "|[0-9][0-9.']*([Ee][-+]?[0-9]+)?[fFlLuU]*"
 	 /* hexadecimal and binary integers */
-	 "|0[xXbB][0-9a-fA-F]+[lLuU]*"
+	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"
 	 /* floatingpoint numbers that begin with a decimal point */
-	 "|\\.[0-9]+([Ee][-+]?[0-9]+)?[fFlL]?"
+	 "|\\.[0-9']+([Ee][-+]?[0-9]+)?[fFlL]?"
 	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"),
 PATTERNS("csharp",
 	 /* Keywords */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v2 5/5] userdiff-cpp: learn the C++ spaceship operator
  2021-10-08 19:09 ` [PATCH v2 0/5] " Johannes Sixt via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-10-08 19:09   ` [PATCH v2 4/5] userdiff-cpp: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
@ 2021-10-08 19:09   ` Johannes Sixt via GitGitGadget
  2021-10-08 20:07   ` [PATCH v2 0/5] Fun with cpp word regex Ævar Arnfjörð Bjarmason
  2021-10-10 17:02   ` [PATCH v3 0/6] " Johannes Sixt via GitGitGadget
  6 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-08 19:09 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

Since C++20, the language has a generalized comparison operator <=>.
Teach the cpp driver not to separate it into <= and > tokens.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 t/t4034/cpp/expect | 4 ++--
 t/t4034/cpp/post   | 2 +-
 t/t4034/cpp/pre    | 2 +-
 userdiff.c         | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect
index a3a234f5461..bf3cd2abc74 100644
--- a/t/t4034/cpp/expect
+++ b/t/t4034/cpp/expect
@@ -1,5 +1,5 @@
 <BOLD>diff --git a/pre b/post<RESET>
-<BOLD>index 60f3640..f6fbf7b 100644<RESET>
+<BOLD>index 144cd98..244f79c 100644<RESET>
 <BOLD>--- a/pre<RESET>
 <BOLD>+++ b/post<RESET>
 <CYAN>@@ -1,30 +1,30 @@<RESET>
@@ -25,7 +25,7 @@ str.e+<RED>65<RESET><GREEN>75<RESET>
 a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
 a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
 a<RED><<<RESET><GREEN><<=<RESET>b c<RED>>><RESET><GREEN>>>=<RESET>d
-a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h
+a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h i<RED><=<RESET><GREEN><=><RESET>j
 a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
 a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
 a<RED>||<RESET><GREEN>|<RESET>b
diff --git a/t/t4034/cpp/post b/t/t4034/cpp/post
index f6fbf7bc04c..244f79c9900 100644
--- a/t/t4034/cpp/post
+++ b/t/t4034/cpp/post
@@ -20,7 +20,7 @@ str.e+75
 a*=b c/=d e%=f
 a++b c--d
 a<<=b c>>=d
-a<=b c<d e>=f g>h
+a<=b c<d e>=f g>h i<=>j
 a!=b c=d
 a^=b c|=d e&=f
 a|b
diff --git a/t/t4034/cpp/pre b/t/t4034/cpp/pre
index 60f3640d773..144cd980d6b 100644
--- a/t/t4034/cpp/pre
+++ b/t/t4034/cpp/pre
@@ -20,7 +20,7 @@ str.e+65
 a*b c/d e%f
 a+b c-d
 a<<b c>>d
-a<b c<=d e>f g>=h
+a<b c<=d e>f g>=h i<=j
 a==b c!=d
 a^b c|d e&&f
 a||b
diff --git a/userdiff.c b/userdiff.c
index 1b640c7df79..13cec0b48db 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -62,7 +62,7 @@ PATTERNS("cpp",
 	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"
 	 /* floatingpoint numbers that begin with a decimal point */
 	 "|\\.[0-9']+([Ee][-+]?[0-9]+)?[fFlL]?"
-	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"),
+	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*|<=>"),
 PATTERNS("csharp",
 	 /* Keywords */
 	 "!^[ \t]*(do|while|for|if|else|instanceof|new|return|switch|case|throw|catch|using)\n"
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/5] Fun with cpp word regex
  2021-10-08 19:09 ` [PATCH v2 0/5] " Johannes Sixt via GitGitGadget
                     ` (4 preceding siblings ...)
  2021-10-08 19:09   ` [PATCH v2 5/5] userdiff-cpp: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
@ 2021-10-08 20:07   ` Ævar Arnfjörð Bjarmason
  2021-10-08 22:11     ` Johannes Sixt
  2021-10-10 17:02   ` [PATCH v3 0/6] " Johannes Sixt via GitGitGadget
  6 siblings, 1 reply; 24+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-08 20:07 UTC (permalink / raw)
  To: Johannes Sixt via GitGitGadget; +Cc: git, Johannes Sixt


On Fri, Oct 08 2021, Johannes Sixt via GitGitGadget wrote:

> The cpp word regex driver is a bit too loose and can match too much text
> where the intent is to match only a number.
>
> The first patch makes the cpp word regex tests more effective.
>
> The second patch adds problematic test cases. The third patch fixes these
> problems.
>
> The final two patches add support for digit separators and the spaceship
> operator <=> (generalized comparison operator).
>
> I left out support for hexadecimal floating point constants because that
> would require to tighten the regex even more to avoid that entire
> expressions are treated as single tokens.
>
> Changes since V1:
>
>  * Tests, tests, tests.
>  * Polished commit messages.

I've read this much improved v2 over, thanks a lot. This looks much
better.

Just some general comments, in my mind this is more than good enough
already and doesn't require a re-roll, just food for thought:

 * I wonder if it isn't time to split up "cpp" into a "c" driver,
   e.g. git.git's .gitattributes has "cpp" for *.[ch] files, but as C++
   adds more syntax sugar.

   So e.g. if you use "<=>" after this series we'll tokenize it
   differently in *.c files, but it's a C++-only operator, on the other
   hand probably nobody cares that much...

 * I found myself back-porting some of your tests (manually mostly),
   maybe you disagree, but in cases like 123'123, <=> etc. I'd find it
   easier to follow if we first added the test data, and then the
   changed behavior.

   Because after all, we're going to change how we highlight existing
   data, so testing for that would be informative.

 * This pre-dates your much improved tests, but these test files could
   really use some test comments, as in:

   /* Now that we're going to understand the "'" character somehow, will any of this change? */
   /* We haven't written code like this since the 1960's ... */
   /* Run & free */

   I.e. we don't just highlight code the compiler likes to eat, but also
   comments. So particularly for smaller tokens that also occur in
   natural language like "'" and "&" are we getting expected results?

   It looked like it from skimming your changes, i.e. the ' rule is
   checked by surrounding digits, just might be handy to test it ... :)

> Johannes Sixt (5):
>   t4034/cpp: actually test that operator tokens are not split
>   t4034: add tests showing problematic cpp tokenizations
>   userdiff-cpp: tighten word regex
>   userdiff-cpp: permit the digit-separating single-quote in numbers
>   userdiff-cpp: learn the C++ spaceship operator
>
>  t/t4034/cpp/expect | 63 +++++++++++++++++++++++-----------------------
>  t/t4034/cpp/post   | 47 +++++++++++++++++++++-------------
>  t/t4034/cpp/pre    | 41 +++++++++++++++++++-----------
>  userdiff.c         | 10 ++++++--
>  4 files changed, 94 insertions(+), 67 deletions(-)
>
>
> base-commit: 225bc32a989d7a22fa6addafd4ce7dcd04675dbf
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1054%2Fj6t%2Ffun-with-cpp-word-regex-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1054/j6t/fun-with-cpp-word-regex-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/1054
>
> Range-diff vs v1:
>
>  -:  ----------- > 1:  dd9f82ba712 t4034/cpp: actually test that operator tokens are not split
>  -:  ----------- > 2:  5a84fc9cf71 t4034: add tests showing problematic cpp tokenizations
>  1:  a47ab9ba20e ! 3:  d4ebe45fddc userdiff: tighten cpp word regex
>      @@ Metadata
>       Author: Johannes Sixt <j6t@kdbg.org>
>       
>        ## Commit message ##
>      -    userdiff: tighten cpp word regex
>      +    userdiff-cpp: tighten word regex
>       
>           Generally, word regex can be written such that they match tokens
>           liberally and need not model the actual syntax because it can be assumed
>      @@ Commit message
>       
>              .l      as in str.length
>              .f      as in str.find
>      +       .e      as in str.erase
>       
>           Tighten the regex in the following way:
>       
>      @@ Commit message
>       
>           For readability, factor hex- and binary numbers into an own term.
>       
>      -    As a drive-by, this fixes that floatingpoint numbers such as 12E5
>      +    As a drive-by, this fixes that floating point numbers such as 12E5
>           (with upper-case E) were split into two tokens.
>       
>           Signed-off-by: Johannes Sixt <j6t@kdbg.org>
>       
>      + ## t/t4034/cpp/expect ##
>      +@@
>      + <BOLD>--- a/pre<RESET>
>      + <BOLD>+++ b/post<RESET>
>      + <CYAN>@@ -1,30 +1,30 @@<RESET>
>      +-Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x<RED>.f<RESET><GREEN>.F<RESET>ind); }
>      ++Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x.<RED>find<RESET><GREEN>Find<RESET>); }
>      + cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
>      +-<GREEN>(<RESET>1 <RED>-1e10<RESET><GREEN>+1e10<RESET> 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
>      ++<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
>      + // long double<RESET>
>      + <RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
>      + // float<RESET>
>      +-120<RED>E5f<RESET><GREEN>E6f<RESET>
>      ++<RED>120E5f<RESET><GREEN>120E6f<RESET>
>      + // hex<RESET>
>      +-<RED>0xdeadbeaf+8<RESET><GREEN>0xdeadBeaf+7<RESET>ULL
>      ++<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
>      + // octal<RESET>
>      + <RED>01234567<RESET><GREEN>01234560<RESET>
>      + // binary<RESET>
>      + <RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
>      + // expression<RESET>
>      +-<RED>1.5-e+2+f<RESET><GREEN>1.5-e+3+f<RESET>
>      ++1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
>      + // another one<RESET>
>      +-str<RED>.e+65<RESET><GREEN>.e+75<RESET>
>      +-[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.e<RESET><GREEN>.*e<RESET>
>      ++str.e+<RED>65<RESET><GREEN>75<RESET>
>      ++[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.<RESET><GREEN>.*<RESET>e
>      + <GREEN>~<RESET>!a <GREEN>!<RESET>~b c<RED>++<RESET><GREEN>+<RESET> d<RED>--<RESET><GREEN>-<RESET> e*<GREEN>*<RESET>f g<RED>&<RESET><GREEN>&&<RESET>h
>      + a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
>      + a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
>      +@@ t/t4034/cpp/expect: a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
>      + a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
>      + a<RED>||<RESET><GREEN>|<RESET>b
>      + a?<GREEN>:<RESET>b
>      +-a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d <RED>e-=f<RESET><GREEN>e-f<RESET> g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
>      ++a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d e<RED>-=<RESET><GREEN>-<RESET>f g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
>      + a,b<RESET>
>      + a<RED>::<RESET><GREEN>:<RESET>b
>      +
>        ## userdiff.c ##
>       @@ userdiff.c: PATTERNS("cpp",
>        	 /* functions/methods, variables, and compounds at top level */
>  2:  9d1c05f5f41 ! 4:  dd75d19cee9 userdiff: permit the digit-separating single-quote in numbers
>      @@ Metadata
>       Author: Johannes Sixt <j6t@kdbg.org>
>       
>        ## Commit message ##
>      -    userdiff: permit the digit-separating single-quote in numbers
>      +    userdiff-cpp: permit the digit-separating single-quote in numbers
>       
>           Since C++17, the single-quote can be used as digit separator:
>       
>      @@ Commit message
>              1'000'000
>              0xdead'beaf
>       
>      -    Make it known to the word regex of the cpp driver, so that numbers are not
>      -    split into separate tokens at the single-quotes.
>      +    Make it known to the word regex of the cpp driver, so that numbers are
>      +    not split into separate tokens at the single-quotes.
>       
>           Signed-off-by: Johannes Sixt <j6t@kdbg.org>
>       
>      + ## t/t4034/cpp/expect ##
>      +@@
>      + <BOLD>diff --git a/pre b/post<RESET>
>      +-<BOLD>index 1229cdb..3feae6f 100644<RESET>
>      ++<BOLD>index 60f3640..f6fbf7b 100644<RESET>
>      + <BOLD>--- a/pre<RESET>
>      + <BOLD>+++ b/post<RESET>
>      + <CYAN>@@ -1,30 +1,30 @@<RESET>
>      +@@ t/t4034/cpp/expect: Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>
>      + cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
>      + <GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
>      + // long double<RESET>
>      +-<RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
>      ++<RED>3.141'592'653e-10l<RESET><GREEN>3.141'592'654e+10l<RESET>
>      + // float<RESET>
>      + <RED>120E5f<RESET><GREEN>120E6f<RESET>
>      + // hex<RESET>
>      +-<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
>      ++<RED>0xdead'beaf<RESET><GREEN>0xdead'Beaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
>      + // octal<RESET>
>      +-<RED>01234567<RESET><GREEN>01234560<RESET>
>      ++<RED>0123'4567<RESET><GREEN>0123'4560<RESET>
>      + // binary<RESET>
>      +-<RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
>      ++<RED>0b10'00<RESET><GREEN>0b11'00<RESET>+e1
>      + // expression<RESET>
>      + 1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
>      + // another one<RESET>
>      +
>      + ## t/t4034/cpp/post ##
>      +@@ t/t4034/cpp/post: Foo() : x(0&42) { bar(x.Find); }
>      + cout<<"Hello World?\n"<<endl;
>      + (1 +1e10 0xabcdef) 'y'
>      + // long double
>      +-3.141592654e+10l
>      ++3.141'592'654e+10l
>      + // float
>      + 120E6f
>      + // hex
>      +-0xdeadBeaf+7ULL
>      ++0xdead'Beaf+7ULL
>      + // octal
>      +-01234560
>      ++0123'4560
>      + // binary
>      +-0b1100+e1
>      ++0b11'00+e1
>      + // expression
>      + 1.5-e+3+f
>      + // another one
>      +
>      + ## t/t4034/cpp/pre ##
>      +@@ t/t4034/cpp/pre: Foo():x(0&&1){ foo0( x.find); }
>      + cout<<"Hello World!\n"<<endl;
>      + 1 -1e10 0xabcdef 'x'
>      + // long double
>      +-3.141592653e-10l
>      ++3.141'592'653e-10l
>      + // float
>      + 120E5f
>      + // hex
>      +-0xdeadbeaf+8ULL
>      ++0xdead'beaf+8ULL
>      + // octal
>      +-01234567
>      ++0123'4567
>      + // binary
>      +-0b1000+e1
>      ++0b10'00+e1
>      + // expression
>      + 1.5-e+2+f
>      + // another one
>      +
>        ## userdiff.c ##
>       @@ userdiff.c: PATTERNS("cpp",
>        	 /* identifiers and keywords */
>  3:  df66485e7f0 ! 5:  43a701f5ffd userdiff: learn the C++ spaceship operator
>      @@ Metadata
>       Author: Johannes Sixt <j6t@kdbg.org>
>       
>        ## Commit message ##
>      -    userdiff: learn the C++ spaceship operator
>      +    userdiff-cpp: learn the C++ spaceship operator
>       
>      -    Since C++20, the language has a generalized comparison operator. Teach
>      -    the cpp driver not to separate it into <= and > tokens.
>      +    Since C++20, the language has a generalized comparison operator <=>.
>      +    Teach the cpp driver not to separate it into <= and > tokens.
>       
>           Signed-off-by: Johannes Sixt <j6t@kdbg.org>
>       
>      + ## t/t4034/cpp/expect ##
>      +@@
>      + <BOLD>diff --git a/pre b/post<RESET>
>      +-<BOLD>index 60f3640..f6fbf7b 100644<RESET>
>      ++<BOLD>index 144cd98..244f79c 100644<RESET>
>      + <BOLD>--- a/pre<RESET>
>      + <BOLD>+++ b/post<RESET>
>      + <CYAN>@@ -1,30 +1,30 @@<RESET>
>      +@@ t/t4034/cpp/expect: str.e+<RED>65<RESET><GREEN>75<RESET>
>      + a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
>      + a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
>      + a<RED><<<RESET><GREEN><<=<RESET>b c<RED>>><RESET><GREEN>>>=<RESET>d
>      +-a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h
>      ++a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h i<RED><=<RESET><GREEN><=><RESET>j
>      + a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
>      + a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
>      + a<RED>||<RESET><GREEN>|<RESET>b
>      +
>      + ## t/t4034/cpp/post ##
>      +@@ t/t4034/cpp/post: str.e+75
>      + a*=b c/=d e%=f
>      + a++b c--d
>      + a<<=b c>>=d
>      +-a<=b c<d e>=f g>h
>      ++a<=b c<d e>=f g>h i<=>j
>      + a!=b c=d
>      + a^=b c|=d e&=f
>      + a|b
>      +
>      + ## t/t4034/cpp/pre ##
>      +@@ t/t4034/cpp/pre: str.e+65
>      + a*b c/d e%f
>      + a+b c-d
>      + a<<b c>>d
>      +-a<b c<=d e>f g>=h
>      ++a<b c<=d e>f g>=h i<=j
>      + a==b c!=d
>      + a^b c|d e&&f
>      + a||b
>      +
>        ## userdiff.c ##
>       @@ userdiff.c: PATTERNS("cpp",
>        	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/5] Fun with cpp word regex
  2021-10-08 20:07   ` [PATCH v2 0/5] Fun with cpp word regex Ævar Arnfjörð Bjarmason
@ 2021-10-08 22:11     ` Johannes Sixt
  2021-10-09  0:00       ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 24+ messages in thread
From: Johannes Sixt @ 2021-10-08 22:11 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Johannes Sixt via GitGitGadget

Am 08.10.21 um 22:07 schrieb Ævar Arnfjörð Bjarmason:
>  * I wonder if it isn't time to split up "cpp" into a "c" driver,
>    e.g. git.git's .gitattributes has "cpp" for *.[ch] files, but as C++
>    adds more syntax sugar.
> 
>    So e.g. if you use "<=>" after this series we'll tokenize it
>    differently in *.c files, but it's a C++-only operator, on the other
>    hand probably nobody cares that much...

Yes, it is that: <=> won't appear in a correct C file (outside of
comments), so no-one will care. As far as tokenization is concerned, C
is a subset of C++. I don't think we need to separate the drivers.

>  * I found myself back-porting some of your tests (manually mostly),
>    maybe you disagree, but in cases like 123'123, <=> etc. I'd find it
>    easier to follow if we first added the test data, and then the
>    changed behavior.
> 
>    Because after all, we're going to change how we highlight existing
>    data, so testing for that would be informative.

Good point. I'll work a bit more on that.

>  * This pre-dates your much improved tests, but these test files could
>    really use some test comments, as in:
> 
>    /* Now that we're going to understand the "'" character somehow, will any of this change? */
>    /* We haven't written code like this since the 1960's ... */
>    /* Run & free */
> 
>    I.e. we don't just highlight code the compiler likes to eat, but also
>    comments. So particularly for smaller tokens that also occur in
>    natural language like "'" and "&" are we getting expected results?

Comments are free text. Anything can happen. There is no such thing as
"correct tokenization" in comments. Not interested.

Thank you for the review.
-- Hannes

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/5] Fun with cpp word regex
  2021-10-08 22:11     ` Johannes Sixt
@ 2021-10-09  0:00       ` Ævar Arnfjörð Bjarmason
  2021-10-10 20:15         ` Johannes Sixt
  0 siblings, 1 reply; 24+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-10-09  0:00 UTC (permalink / raw)
  To: Johannes Sixt; +Cc: git, Johannes Sixt via GitGitGadget


On Sat, Oct 09 2021, Johannes Sixt wrote:

> Am 08.10.21 um 22:07 schrieb Ævar Arnfjörð Bjarmason:

[re-arranged a bit]

> [...]As far as tokenization is concerned, C
> is a subset of C++. I don't think we need to separate the drivers.
>
>>  * I found myself back-porting some of your tests (manually mostly),
>>    maybe you disagree, but in cases like 123'123, <=> etc. I'd find it
>>    easier to follow if we first added the test data, and then the
>>    changed behavior.
>> 
>>    Because after all, we're going to change how we highlight existing
>>    data, so testing for that would be informative.
>
> Good point. I'll work a bit more on that.

Great!

>>  * I wonder if it isn't time to split up "cpp" into a "c" driver,
>>    e.g. git.git's .gitattributes has "cpp" for *.[ch] files, but as C++
>>    adds more syntax sugar.
>> 
>>    So e.g. if you use "<=>" after this series we'll tokenize it
>>    differently in *.c files, but it's a C++-only operator, on the other
>>    hand probably nobody cares that much...
>
> Yes, it is that: <=> won't appear in a correct C file (outside of
> comments), so no-one will care. [...]

..mmm..

>>  * This pre-dates your much improved tests, but these test files could
>>    really use some test comments, as in:
>> 
>>    /* Now that we're going to understand the "'" character somehow, will any of this change? */
>>    /* We haven't written code like this since the 1960's ... */
>>    /* Run & free */
>> 
>>    I.e. we don't just highlight code the compiler likes to eat, but also
>>    comments. So particularly for smaller tokens that also occur in
>>    natural language like "'" and "&" are we getting expected results?
>
> Comments are free text. Anything can happen. There is no such thing as
> "correct tokenization" in comments. Not interested.

Sure there is, just because the problem is fuzzy doesn't mean there
aren't more and less correct things to do.

But most importantly the output of "git diff" is made for human
consumption, people who use --word-diff are going to be looking at code
that contains comments, embedded natural language in C-strings etc.

I've got no reason to think that your changes here make it worse, but
just as a general matter we absolutely should consider that one of the
top priorities when it comes to these language drivers.

E.g. in some languages (like CJK) it's common for characters or words
not to have any unicode whitespace between them, so even a word-diff
mode for C can benefit from recognizing those character ranges and
splitting them appropriately, or at least trying.

So to take a comment of yours where you changed a comment at random:
    
    $ git log -U0 --oneline -1 --word-diff -p af920e36977 --word-diff-regex='[ a-zA-Z]*'
    [...]
    [- Note that this seemingly redundant second declaration is required-]{+ Note that this redundant forward declaration is required+}

Don't you think that would suck v.s. the now-behavior of:

    [...]
    Note that this[-seemingly-] redundant [-second-]{+forward+} declaration is required

The former is exactly the sort of thing you'd get in CJK languages with
a word-diff driver thought the line we should stop at was the same as a
comiler tokenizer.

Anyway, the cpp driver seems to do just fine on CJK.

I'm just saying that as a general thing it's definitely a priority for
these drivers to not only handle the narrow cases of tokens a compiler
would know about. Text that people commonly use should be presented in
some way that isn't line noise.

For an example of something we do a bit badly with the cpp driver is
parts of my 66f5f6dca95 (C style: use standard style for "TRANSLATORS"
comments, 2017-05-11).

I.e. there I was changing a comment format, and added a full stop to a
sentence, the word-diff is:

        /*
         {+*+} TRANSLATORS: here is a comment that explains the string to
         {+*+} be translated, that follows immediately after [-it-]{+it.+}
         */

Even though it has nothing to do with C syntax per-se that would be much more useful as:

        /*
         {+*+} TRANSLATORS: here is a comment that explains the string to
         {+*+} be translated, that follows immediately after it{+.+}
         */

I.e. treating a "." at the end of a word specially isn't C or C++
syntax, but it's absolutely input that the cpp driver *is* getting and
should be if possible be handling well.

I just did that by experimenting with
--word-diff-regex='([A-Za-z:]*|\*|\.)', that example is unchanged with
your series, but maybe low-hanging fruit....

Thanks for working on this, just an unsolicited braindump :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 0/6] Fun with cpp word regex
  2021-10-08 19:09 ` [PATCH v2 0/5] " Johannes Sixt via GitGitGadget
                     ` (5 preceding siblings ...)
  2021-10-08 20:07   ` [PATCH v2 0/5] Fun with cpp word regex Ævar Arnfjörð Bjarmason
@ 2021-10-10 17:02   ` Johannes Sixt via GitGitGadget
  2021-10-10 17:02     ` [PATCH v3 1/6] t4034/cpp: actually test that operator tokens are not split Johannes Sixt via GitGitGadget
                       ` (6 more replies)
  6 siblings, 7 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-10 17:02 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt

The cpp word regex driver is a bit too loose and can match too much text
where the intent is to match only a number.

The first patch makes the cpp word regex tests more effective.

The second patch adds problematic test cases. The third patch fixes these
problems.

The remaining three patches add support for digit separators and the
spaceship operator <=> (generalized comparison operator).

I left out support for hexadecimal floating point constants because that
would require to tighten the regex even more to avoid that entire
expressions are treated as single tokens.

Changes since V2:

 * Add test cases for the new features in a separate commit so that the new
   behavior is better visible.
 * Don't treat .' as in '.' as a token.

Changes since V1:

 * Tests, tests, tests.
 * Polished commit messages.

Johannes Sixt (6):
  t4034/cpp: actually test that operator tokens are not split
  t4034: add tests showing problematic cpp tokenizations
  userdiff-cpp: tighten word regex
  userdiff-cpp: prepare test cases with yet unsupported features
  userdiff-cpp: permit the digit-separating single-quote in numbers
  userdiff-cpp: learn the C++ spaceship operator

 t/t4034/cpp/expect | 63 +++++++++++++++++++++++-----------------------
 t/t4034/cpp/post   | 47 +++++++++++++++++++++-------------
 t/t4034/cpp/pre    | 41 +++++++++++++++++++-----------
 userdiff.c         | 10 ++++++--
 4 files changed, 94 insertions(+), 67 deletions(-)


base-commit: 225bc32a989d7a22fa6addafd4ce7dcd04675dbf
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1054%2Fj6t%2Ffun-with-cpp-word-regex-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1054/j6t/fun-with-cpp-word-regex-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1054

Range-diff vs v2:

 1:  dd9f82ba712 = 1:  dd9f82ba712 t4034/cpp: actually test that operator tokens are not split
 2:  5a84fc9cf71 = 2:  5a84fc9cf71 t4034: add tests showing problematic cpp tokenizations
 3:  d4ebe45fddc = 3:  d4ebe45fddc userdiff-cpp: tighten word regex
 4:  dd75d19cee9 ! 4:  c9f58b5e82f userdiff-cpp: permit the digit-separating single-quote in numbers
     @@ Metadata
      Author: Johannes Sixt <j6t@kdbg.org>
      
       ## Commit message ##
     -    userdiff-cpp: permit the digit-separating single-quote in numbers
     +    userdiff-cpp: prepare test cases with yet unsupported features
      
     -    Since C++17, the single-quote can be used as digit separator:
     -
     -       3.141'592'654
     -       1'000'000
     -       0xdead'beaf
     -
     -    Make it known to the word regex of the cpp driver, so that numbers are
     -    not split into separate tokens at the single-quotes.
     +    We are going to add support for C++'s digit-separating single-quote and
     +    the spaceship operator. By adding the test cases in this separate
     +    commit, the effect on the word highlighting will become more obvious
     +    as the features are implemented and the file cpp/expect is updated.
      
          Signed-off-by: Johannes Sixt <j6t@kdbg.org>
      
     @@ t/t4034/cpp/expect
      @@
       <BOLD>diff --git a/pre b/post<RESET>
      -<BOLD>index 1229cdb..3feae6f 100644<RESET>
     -+<BOLD>index 60f3640..f6fbf7b 100644<RESET>
     ++<BOLD>index 144cd98..64e78af 100644<RESET>
       <BOLD>--- a/pre<RESET>
       <BOLD>+++ b/post<RESET>
       <CYAN>@@ -1,30 +1,30 @@<RESET>
     -@@ t/t4034/cpp/expect: Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>
     + Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x.<RED>find<RESET><GREEN>Find<RESET>); }
       cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
     - <GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
     +-<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
     ++<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>.<RESET>'
       // long double<RESET>
      -<RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
     -+<RED>3.141'592'653e-10l<RESET><GREEN>3.141'592'654e+10l<RESET>
     ++3.141'592'<RED>653e-10l<RESET><GREEN>654e+10l<RESET>
       // float<RESET>
       <RED>120E5f<RESET><GREEN>120E6f<RESET>
       // hex<RESET>
      -<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
     -+<RED>0xdead'beaf<RESET><GREEN>0xdead'Beaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
     ++0xdead'<RED>beaf<RESET><GREEN>Beaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
       // octal<RESET>
      -<RED>01234567<RESET><GREEN>01234560<RESET>
     -+<RED>0123'4567<RESET><GREEN>0123'4560<RESET>
     ++0123'<RED>4567<RESET><GREEN>4560<RESET>
       // binary<RESET>
      -<RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
     -+<RED>0b10'00<RESET><GREEN>0b11'00<RESET>+e1
     ++<RED>0b10<RESET><GREEN>0b11<RESET>'00+e1
       // expression<RESET>
       1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
       // another one<RESET>
     +@@ t/t4034/cpp/expect: str.e+<RED>65<RESET><GREEN>75<RESET>
     + a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
     + a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
     + a<RED><<<RESET><GREEN><<=<RESET>b c<RED>>><RESET><GREEN>>>=<RESET>d
     +-a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h
     ++a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h i<=<GREEN>><RESET>j
     + a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
     + a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
     + a<RED>||<RESET><GREEN>|<RESET>b
      
       ## t/t4034/cpp/post ##
     -@@ t/t4034/cpp/post: Foo() : x(0&42) { bar(x.Find); }
     +@@
     + Foo() : x(0&42) { bar(x.Find); }
       cout<<"Hello World?\n"<<endl;
     - (1 +1e10 0xabcdef) 'y'
     +-(1 +1e10 0xabcdef) 'y'
     ++(1 +1e10 0xabcdef) '.'
       // long double
      -3.141592654e+10l
      +3.141'592'654e+10l
     @@ t/t4034/cpp/post: Foo() : x(0&42) { bar(x.Find); }
       // expression
       1.5-e+3+f
       // another one
     +@@ t/t4034/cpp/post: str.e+75
     + a*=b c/=d e%=f
     + a++b c--d
     + a<<=b c>>=d
     +-a<=b c<d e>=f g>h
     ++a<=b c<d e>=f g>h i<=>j
     + a!=b c=d
     + a^=b c|=d e&=f
     + a|b
      
       ## t/t4034/cpp/pre ##
      @@ t/t4034/cpp/pre: Foo():x(0&&1){ foo0( x.find); }
     @@ t/t4034/cpp/pre: Foo():x(0&&1){ foo0( x.find); }
       // expression
       1.5-e+2+f
       // another one
     -
     - ## userdiff.c ##
     -@@ userdiff.c: PATTERNS("cpp",
     - 	 /* identifiers and keywords */
     - 	 "[a-zA-Z_][a-zA-Z0-9_]*"
     - 	 /* decimal and octal integers as well as floatingpoint numbers */
     --	 "|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*"
     -+	 "|[0-9][0-9.']*([Ee][-+]?[0-9]+)?[fFlLuU]*"
     - 	 /* hexadecimal and binary integers */
     --	 "|0[xXbB][0-9a-fA-F]+[lLuU]*"
     -+	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"
     - 	 /* floatingpoint numbers that begin with a decimal point */
     --	 "|\\.[0-9]+([Ee][-+]?[0-9]+)?[fFlL]?"
     -+	 "|\\.[0-9']+([Ee][-+]?[0-9]+)?[fFlL]?"
     - 	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"),
     - PATTERNS("csharp",
     - 	 /* Keywords */
     +@@ t/t4034/cpp/pre: str.e+65
     + a*b c/d e%f
     + a+b c-d
     + a<<b c>>d
     +-a<b c<=d e>f g>=h
     ++a<b c<=d e>f g>=h i<=j
     + a==b c!=d
     + a^b c|d e&&f
     + a||b
 -:  ----------- > 5:  037c743d9e3 userdiff-cpp: permit the digit-separating single-quote in numbers
 5:  43a701f5ffd ! 6:  cc9dc967f10 userdiff-cpp: learn the C++ spaceship operator
     @@ Commit message
          Signed-off-by: Johannes Sixt <j6t@kdbg.org>
      
       ## t/t4034/cpp/expect ##
     -@@
     - <BOLD>diff --git a/pre b/post<RESET>
     --<BOLD>index 60f3640..f6fbf7b 100644<RESET>
     -+<BOLD>index 144cd98..244f79c 100644<RESET>
     - <BOLD>--- a/pre<RESET>
     - <BOLD>+++ b/post<RESET>
     - <CYAN>@@ -1,30 +1,30 @@<RESET>
      @@ t/t4034/cpp/expect: str.e+<RED>65<RESET><GREEN>75<RESET>
       a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
       a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
       a<RED><<<RESET><GREEN><<=<RESET>b c<RED>>><RESET><GREEN>>>=<RESET>d
     --a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h
     +-a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h i<=<GREEN>><RESET>j
      +a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h i<RED><=<RESET><GREEN><=><RESET>j
       a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
       a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
       a<RED>||<RESET><GREEN>|<RESET>b
      
     - ## t/t4034/cpp/post ##
     -@@ t/t4034/cpp/post: str.e+75
     - a*=b c/=d e%=f
     - a++b c--d
     - a<<=b c>>=d
     --a<=b c<d e>=f g>h
     -+a<=b c<d e>=f g>h i<=>j
     - a!=b c=d
     - a^=b c|=d e&=f
     - a|b
     -
     - ## t/t4034/cpp/pre ##
     -@@ t/t4034/cpp/pre: str.e+65
     - a*b c/d e%f
     - a+b c-d
     - a<<b c>>d
     --a<b c<=d e>f g>=h
     -+a<b c<=d e>f g>=h i<=j
     - a==b c!=d
     - a^b c|d e&&f
     - a||b
     -
       ## userdiff.c ##
      @@ userdiff.c: PATTERNS("cpp",
       	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"
       	 /* floatingpoint numbers that begin with a decimal point */
     - 	 "|\\.[0-9']+([Ee][-+]?[0-9]+)?[fFlL]?"
     + 	 "|\\.[0-9][0-9']*([Ee][-+]?[0-9]+)?[fFlL]?"
      -	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"),
      +	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*|<=>"),
       PATTERNS("csharp",

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 1/6] t4034/cpp: actually test that operator tokens are not split
  2021-10-10 17:02   ` [PATCH v3 0/6] " Johannes Sixt via GitGitGadget
@ 2021-10-10 17:02     ` Johannes Sixt via GitGitGadget
  2021-10-10 17:03     ` [PATCH v3 2/6] t4034: add tests showing problematic cpp tokenizations Johannes Sixt via GitGitGadget
                       ` (5 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-10 17:02 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

8d96e7288f2b (t4034: bulk verify builtin word regex sanity, 2010-12-18)
added many tests with the intent to verify that operators consisting of
more than one symbol are kept together. These are tested by probing a
transition from, e.g., a!=b to x!=y, which results in the word-diff

  [-a-]{+x+}!=[-b-]{+y+}

But that proves only that the letters and operators are separate tokens.
To prove that != is an unseparable token, we have to probe a transition
from, e.g., a=b to a!=b having a word-diff

  a[-=-]{+!=+}b

that proves that the ! is not separate from the =.

In the post-image, add to or remove from operators a character that
turns it into another valid operator.

Change the identifiers used around operators such that the diff
algorithm does not have an incentive to match, e.g., a<b in one spot
in the pre-image with a<b elsewhere in the post-image.

Adjust the expected output to match the new differences. Notice that
there are some undesirable tokenizations around e, ., and -.  This will
be addressed in a later change.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 t/t4034/cpp/expect | 45 +++++++++++++++------------------------------
 t/t4034/cpp/post   | 29 +++++++++++++----------------
 t/t4034/cpp/pre    | 25 +++++++++++--------------
 3 files changed, 39 insertions(+), 60 deletions(-)

diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect
index 37d1ea25870..41976971b93 100644
--- a/t/t4034/cpp/expect
+++ b/t/t4034/cpp/expect
@@ -1,36 +1,21 @@
 <BOLD>diff --git a/pre b/post<RESET>
-<BOLD>index 23d5c8a..7e8c026 100644<RESET>
+<BOLD>index c5672a2..4229868 100644<RESET>
 <BOLD>--- a/pre<RESET>
 <BOLD>+++ b/post<RESET>
-<CYAN>@@ -1,19 +1,19 @@<RESET>
+<CYAN>@@ -1,16 +1,16 @@<RESET>
 Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <GREEN>bar(x);<RESET> }
 cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
 <GREEN>(<RESET>1<GREEN>) (<RESET>-1e10<GREEN>) (<RESET>0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
-[<RED>a<RESET><GREEN>x<RESET>] <RED>a<RESET><GREEN>x<RESET>-><RED>b a<RESET><GREEN>y x<RESET>.<RED>b<RESET><GREEN>y<RESET>
-!<RED>a<RESET><GREEN>x<RESET> ~<RED>a a<RESET><GREEN>x x<RESET>++ <RED>a<RESET><GREEN>x<RESET>-- <RED>a<RESET><GREEN>x<RESET>*<RED>b a<RESET><GREEN>y x<RESET>&<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>*<RED>b a<RESET><GREEN>y x<RESET>/<RED>b a<RESET><GREEN>y x<RESET>%<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>+<RED>b a<RESET><GREEN>y x<RESET>-<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET><<<RED>b a<RESET><GREEN>y x<RESET>>><RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET><<RED>b a<RESET><GREEN>y x<RESET><=<RED>b a<RESET><GREEN>y x<RESET>><RED>b a<RESET><GREEN>y x<RESET>>=<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>==<RED>b a<RESET><GREEN>y x<RESET>!=<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>&<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>^<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>|<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>&&<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>||<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>?<RED>b<RESET><GREEN>y<RESET>:z
-<RED>a<RESET><GREEN>x<RESET>=<RED>b a<RESET><GREEN>y x<RESET>+=<RED>b a<RESET><GREEN>y x<RESET>-=<RED>b a<RESET><GREEN>y x<RESET>*=<RED>b a<RESET><GREEN>y x<RESET>/=<RED>b a<RESET><GREEN>y x<RESET>%=<RED>b a<RESET><GREEN>y x<RESET><<=<RED>b a<RESET><GREEN>y x<RESET>>>=<RED>b a<RESET><GREEN>y x<RESET>&=<RED>b a<RESET><GREEN>y x<RESET>^=<RED>b a<RESET><GREEN>y x<RESET>|=<RED>b<RESET>
-<RED>a<RESET><GREEN>y<RESET>
-<GREEN>x<RESET>,y
-<RED>a<RESET><GREEN>x<RESET>::<RED>b<RESET><GREEN>y<RESET>
+[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.e<RESET><GREEN>.*e<RESET>
+<GREEN>~<RESET>!a <GREEN>!<RESET>~b c<RED>++<RESET><GREEN>+<RESET> d<RED>--<RESET><GREEN>-<RESET> e*<GREEN>*<RESET>f g<RED>&<RESET><GREEN>&&<RESET>h
+a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
+a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
+a<RED><<<RESET><GREEN><<=<RESET>b c<RED>>><RESET><GREEN>>>=<RESET>d
+a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h
+a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
+a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
+a<RED>||<RESET><GREEN>|<RESET>b
+a?<GREEN>:<RESET>b
+a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d <RED>e-=f<RESET><GREEN>e-f<RESET> g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
+a,b<RESET>
+a<RED>::<RESET><GREEN>:<RESET>b
diff --git a/t/t4034/cpp/post b/t/t4034/cpp/post
index 7e8c026cefb..4229868ae62 100644
--- a/t/t4034/cpp/post
+++ b/t/t4034/cpp/post
@@ -1,19 +1,16 @@
 Foo() : x(0&42) { bar(x); }
 cout<<"Hello World?\n"<<endl;
 (1) (-1e10) (0xabcdef) 'y'
-[x] x->y x.y
-!x ~x x++ x-- x*y x&y
-x*y x/y x%y
-x+y x-y
-x<<y x>>y
-x<y x<=y x>y x>=y
-x==y x!=y
-x&y
-x^y
-x|y
-x&&y
-x||y
-x?y:z
-x=y x+=y x-=y x*=y x/=y x%=y x<<=y x>>=y x&=y x^=y x|=y
-x,y
-x::y
+[a] b->*v d.*e
+~!a !~b c+ d- e**f g&&h
+a*=b c/=d e%=f
+a++b c--d
+a<<=b c>>=d
+a<=b c<d e>=f g>h
+a!=b c=d
+a^=b c|=d e&=f
+a|b
+a?:b
+a==b c+d e-f g*h i/j k%l m<<n o>>p q&r s^t u|v
+a,b
+a:b
diff --git a/t/t4034/cpp/pre b/t/t4034/cpp/pre
index 23d5c8adf54..c5672a24cfc 100644
--- a/t/t4034/cpp/pre
+++ b/t/t4034/cpp/pre
@@ -1,19 +1,16 @@
 Foo():x(0&&1){}
 cout<<"Hello World!\n"<<endl;
 1 -1e10 0xabcdef 'x'
-[a] a->b a.b
-!a ~a a++ a-- a*b a&b
-a*b a/b a%b
-a+b a-b
-a<<b a>>b
-a<b a<=b a>b a>=b
-a==b a!=b
-a&b
-a^b
-a|b
-a&&b
+[a] b->v d.e
+!a ~b c++ d-- e*f g&h
+a*b c/d e%f
+a+b c-d
+a<<b c>>d
+a<b c<=d e>f g>=h
+a==b c!=d
+a^b c|d e&&f
 a||b
-a?b:z
-a=b a+=b a-=b a*=b a/=b a%=b a<<=b a>>=b a&=b a^=b a|=b
-a,y
+a?b
+a=b c+=d e-=f g*=h i/=j k%=l m<<=n o>>=p q&=r s^=t u|=v
+a,b
 a::b
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 2/6] t4034: add tests showing problematic cpp tokenizations
  2021-10-10 17:02   ` [PATCH v3 0/6] " Johannes Sixt via GitGitGadget
  2021-10-10 17:02     ` [PATCH v3 1/6] t4034/cpp: actually test that operator tokens are not split Johannes Sixt via GitGitGadget
@ 2021-10-10 17:03     ` Johannes Sixt via GitGitGadget
  2021-10-10 17:03     ` [PATCH v3 3/6] userdiff-cpp: tighten word regex Johannes Sixt via GitGitGadget
                       ` (4 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-10 17:03 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

The word regex is too loose and matches long streaks of characters
that should actually be separate tokens.  Add these problematic test
cases. Separate the lines with text that will remain identical in the
pre- and post-image so that the diff algorithm will not lump removals
and additions of consecutive lines together. This makes the expected
output easier to read.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 t/t4034/cpp/expect | 22 ++++++++++++++++++----
 t/t4034/cpp/post   | 18 ++++++++++++++++--
 t/t4034/cpp/pre    | 16 +++++++++++++++-
 3 files changed, 49 insertions(+), 7 deletions(-)

diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect
index 41976971b93..63e53a61e62 100644
--- a/t/t4034/cpp/expect
+++ b/t/t4034/cpp/expect
@@ -1,11 +1,25 @@
 <BOLD>diff --git a/pre b/post<RESET>
-<BOLD>index c5672a2..4229868 100644<RESET>
+<BOLD>index 1229cdb..3feae6f 100644<RESET>
 <BOLD>--- a/pre<RESET>
 <BOLD>+++ b/post<RESET>
-<CYAN>@@ -1,16 +1,16 @@<RESET>
-Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <GREEN>bar(x);<RESET> }
+<CYAN>@@ -1,30 +1,30 @@<RESET>
+Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x<RED>.f<RESET><GREEN>.F<RESET>ind); }
 cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
-<GREEN>(<RESET>1<GREEN>) (<RESET>-1e10<GREEN>) (<RESET>0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
+<GREEN>(<RESET>1 <RED>-1e10<RESET><GREEN>+1e10<RESET> 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
+// long double<RESET>
+<RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
+// float<RESET>
+120<RED>E5f<RESET><GREEN>E6f<RESET>
+// hex<RESET>
+<RED>0xdeadbeaf+8<RESET><GREEN>0xdeadBeaf+7<RESET>ULL
+// octal<RESET>
+<RED>01234567<RESET><GREEN>01234560<RESET>
+// binary<RESET>
+<RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
+// expression<RESET>
+<RED>1.5-e+2+f<RESET><GREEN>1.5-e+3+f<RESET>
+// another one<RESET>
+str<RED>.e+65<RESET><GREEN>.e+75<RESET>
 [a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.e<RESET><GREEN>.*e<RESET>
 <GREEN>~<RESET>!a <GREEN>!<RESET>~b c<RED>++<RESET><GREEN>+<RESET> d<RED>--<RESET><GREEN>-<RESET> e*<GREEN>*<RESET>f g<RED>&<RESET><GREEN>&&<RESET>h
 a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
diff --git a/t/t4034/cpp/post b/t/t4034/cpp/post
index 4229868ae62..3feae6f430f 100644
--- a/t/t4034/cpp/post
+++ b/t/t4034/cpp/post
@@ -1,6 +1,20 @@
-Foo() : x(0&42) { bar(x); }
+Foo() : x(0&42) { bar(x.Find); }
 cout<<"Hello World?\n"<<endl;
-(1) (-1e10) (0xabcdef) 'y'
+(1 +1e10 0xabcdef) 'y'
+// long double
+3.141592654e+10l
+// float
+120E6f
+// hex
+0xdeadBeaf+7ULL
+// octal
+01234560
+// binary
+0b1100+e1
+// expression
+1.5-e+3+f
+// another one
+str.e+75
 [a] b->*v d.*e
 ~!a !~b c+ d- e**f g&&h
 a*=b c/=d e%=f
diff --git a/t/t4034/cpp/pre b/t/t4034/cpp/pre
index c5672a24cfc..1229cdb59d1 100644
--- a/t/t4034/cpp/pre
+++ b/t/t4034/cpp/pre
@@ -1,6 +1,20 @@
-Foo():x(0&&1){}
+Foo():x(0&&1){ foo0( x.find); }
 cout<<"Hello World!\n"<<endl;
 1 -1e10 0xabcdef 'x'
+// long double
+3.141592653e-10l
+// float
+120E5f
+// hex
+0xdeadbeaf+8ULL
+// octal
+01234567
+// binary
+0b1000+e1
+// expression
+1.5-e+2+f
+// another one
+str.e+65
 [a] b->v d.e
 !a ~b c++ d-- e*f g&h
 a*b c/d e%f
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 3/6] userdiff-cpp: tighten word regex
  2021-10-10 17:02   ` [PATCH v3 0/6] " Johannes Sixt via GitGitGadget
  2021-10-10 17:02     ` [PATCH v3 1/6] t4034/cpp: actually test that operator tokens are not split Johannes Sixt via GitGitGadget
  2021-10-10 17:03     ` [PATCH v3 2/6] t4034: add tests showing problematic cpp tokenizations Johannes Sixt via GitGitGadget
@ 2021-10-10 17:03     ` Johannes Sixt via GitGitGadget
  2021-10-10 17:03     ` [PATCH v3 4/6] userdiff-cpp: prepare test cases with yet unsupported features Johannes Sixt via GitGitGadget
                       ` (3 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-10 17:03 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

Generally, word regex can be written such that they match tokens
liberally and need not model the actual syntax because it can be assumed
that the regex will only be applied to syntactically correct text.

The regex for cpp (C/C++) is too liberal, though. It regards these
sequences as single tokens:

   1+2
   1.5-e+2+f

and the following amalgams as one token:

   .l      as in str.length
   .f      as in str.find
   .e      as in str.erase

Tighten the regex in the following way:

- Accept + and - only in one position in the exponent. + and - are no
  longer regarded as the sign of a number and are treated by the
  catcher-all that is not visible in the driver's regex.

- Accept a leading decimal point only when it is followed by a digit.

For readability, factor hex- and binary numbers into an own term.

As a drive-by, this fixes that floating point numbers such as 12E5
(with upper-case E) were split into two tokens.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 t/t4034/cpp/expect | 16 ++++++++--------
 userdiff.c         |  8 +++++++-
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect
index 63e53a61e62..46c9460a968 100644
--- a/t/t4034/cpp/expect
+++ b/t/t4034/cpp/expect
@@ -3,24 +3,24 @@
 <BOLD>--- a/pre<RESET>
 <BOLD>+++ b/post<RESET>
 <CYAN>@@ -1,30 +1,30 @@<RESET>
-Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x<RED>.f<RESET><GREEN>.F<RESET>ind); }
+Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x.<RED>find<RESET><GREEN>Find<RESET>); }
 cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
-<GREEN>(<RESET>1 <RED>-1e10<RESET><GREEN>+1e10<RESET> 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
+<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
 // long double<RESET>
 <RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
 // float<RESET>
-120<RED>E5f<RESET><GREEN>E6f<RESET>
+<RED>120E5f<RESET><GREEN>120E6f<RESET>
 // hex<RESET>
-<RED>0xdeadbeaf+8<RESET><GREEN>0xdeadBeaf+7<RESET>ULL
+<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
 // octal<RESET>
 <RED>01234567<RESET><GREEN>01234560<RESET>
 // binary<RESET>
 <RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
 // expression<RESET>
-<RED>1.5-e+2+f<RESET><GREEN>1.5-e+3+f<RESET>
+1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
 // another one<RESET>
-str<RED>.e+65<RESET><GREEN>.e+75<RESET>
-[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.e<RESET><GREEN>.*e<RESET>
+str.e+<RED>65<RESET><GREEN>75<RESET>
+[a] b<RED>-><RESET><GREEN>->*<RESET>v d<RED>.<RESET><GREEN>.*<RESET>e
 <GREEN>~<RESET>!a <GREEN>!<RESET>~b c<RED>++<RESET><GREEN>+<RESET> d<RED>--<RESET><GREEN>-<RESET> e*<GREEN>*<RESET>f g<RED>&<RESET><GREEN>&&<RESET>h
 a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
 a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
@@ -30,6 +30,6 @@ a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
 a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
 a<RED>||<RESET><GREEN>|<RESET>b
 a?<GREEN>:<RESET>b
-a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d <RED>e-=f<RESET><GREEN>e-f<RESET> g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
+a<RED>=<RESET><GREEN>==<RESET>b c<RED>+=<RESET><GREEN>+<RESET>d e<RED>-=<RESET><GREEN>-<RESET>f g<RED>*=<RESET><GREEN>*<RESET>h i<RED>/=<RESET><GREEN>/<RESET>j k<RED>%=<RESET><GREEN>%<RESET>l m<RED><<=<RESET><GREEN><<<RESET>n o<RED>>>=<RESET><GREEN>>><RESET>p q<RED>&=<RESET><GREEN>&<RESET>r s<RED>^=<RESET><GREEN>^<RESET>t u<RED>|=<RESET><GREEN>|<RESET>v
 a,b<RESET>
 a<RED>::<RESET><GREEN>:<RESET>b
diff --git a/userdiff.c b/userdiff.c
index d9b2ba752f0..ce2a9230703 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -54,8 +54,14 @@ PATTERNS("cpp",
 	 /* functions/methods, variables, and compounds at top level */
 	 "^((::[[:space:]]*)?[A-Za-z_].*)$",
 	 /* -- */
+	 /* identifiers and keywords */
 	 "[a-zA-Z_][a-zA-Z0-9_]*"
-	 "|[-+0-9.e]+[fFlL]?|0[xXbB]?[0-9a-fA-F]+[lLuU]*"
+	 /* decimal and octal integers as well as floatingpoint numbers */
+	 "|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*"
+	 /* hexadecimal and binary integers */
+	 "|0[xXbB][0-9a-fA-F]+[lLuU]*"
+	 /* floatingpoint numbers that begin with a decimal point */
+	 "|\\.[0-9]+([Ee][-+]?[0-9]+)?[fFlL]?"
 	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"),
 PATTERNS("csharp",
 	 /* Keywords */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 4/6] userdiff-cpp: prepare test cases with yet unsupported features
  2021-10-10 17:02   ` [PATCH v3 0/6] " Johannes Sixt via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-10-10 17:03     ` [PATCH v3 3/6] userdiff-cpp: tighten word regex Johannes Sixt via GitGitGadget
@ 2021-10-10 17:03     ` Johannes Sixt via GitGitGadget
  2021-10-10 17:03     ` [PATCH v3 5/6] userdiff-cpp: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
                       ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-10 17:03 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

We are going to add support for C++'s digit-separating single-quote and
the spaceship operator. By adding the test cases in this separate
commit, the effect on the word highlighting will become more obvious
as the features are implemented and the file cpp/expect is updated.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 t/t4034/cpp/expect | 14 +++++++-------
 t/t4034/cpp/post   | 12 ++++++------
 t/t4034/cpp/pre    | 10 +++++-----
 3 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect
index 46c9460a968..3d37ddac42c 100644
--- a/t/t4034/cpp/expect
+++ b/t/t4034/cpp/expect
@@ -1,21 +1,21 @@
 <BOLD>diff --git a/pre b/post<RESET>
-<BOLD>index 1229cdb..3feae6f 100644<RESET>
+<BOLD>index 144cd98..64e78af 100644<RESET>
 <BOLD>--- a/pre<RESET>
 <BOLD>+++ b/post<RESET>
 <CYAN>@@ -1,30 +1,30 @@<RESET>
 Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x.<RED>find<RESET><GREEN>Find<RESET>); }
 cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
-<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>y<RESET>'
+<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>.<RESET>'
 // long double<RESET>
-<RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
+3.141'592'<RED>653e-10l<RESET><GREEN>654e+10l<RESET>
 // float<RESET>
 <RED>120E5f<RESET><GREEN>120E6f<RESET>
 // hex<RESET>
-<RED>0xdeadbeaf<RESET><GREEN>0xdeadBeaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
+0xdead'<RED>beaf<RESET><GREEN>Beaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
 // octal<RESET>
-<RED>01234567<RESET><GREEN>01234560<RESET>
+0123'<RED>4567<RESET><GREEN>4560<RESET>
 // binary<RESET>
-<RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
+<RED>0b10<RESET><GREEN>0b11<RESET>'00+e1
 // expression<RESET>
 1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
 // another one<RESET>
@@ -25,7 +25,7 @@ str.e+<RED>65<RESET><GREEN>75<RESET>
 a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
 a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
 a<RED><<<RESET><GREEN><<=<RESET>b c<RED>>><RESET><GREEN>>>=<RESET>d
-a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h
+a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h i<=<GREEN>><RESET>j
 a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
 a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
 a<RED>||<RESET><GREEN>|<RESET>b
diff --git a/t/t4034/cpp/post b/t/t4034/cpp/post
index 3feae6f430f..64e78afbfb5 100644
--- a/t/t4034/cpp/post
+++ b/t/t4034/cpp/post
@@ -1,16 +1,16 @@
 Foo() : x(0&42) { bar(x.Find); }
 cout<<"Hello World?\n"<<endl;
-(1 +1e10 0xabcdef) 'y'
+(1 +1e10 0xabcdef) '.'
 // long double
-3.141592654e+10l
+3.141'592'654e+10l
 // float
 120E6f
 // hex
-0xdeadBeaf+7ULL
+0xdead'Beaf+7ULL
 // octal
-01234560
+0123'4560
 // binary
-0b1100+e1
+0b11'00+e1
 // expression
 1.5-e+3+f
 // another one
@@ -20,7 +20,7 @@ str.e+75
 a*=b c/=d e%=f
 a++b c--d
 a<<=b c>>=d
-a<=b c<d e>=f g>h
+a<=b c<d e>=f g>h i<=>j
 a!=b c=d
 a^=b c|=d e&=f
 a|b
diff --git a/t/t4034/cpp/pre b/t/t4034/cpp/pre
index 1229cdb59d1..144cd980d6b 100644
--- a/t/t4034/cpp/pre
+++ b/t/t4034/cpp/pre
@@ -2,15 +2,15 @@ Foo():x(0&&1){ foo0( x.find); }
 cout<<"Hello World!\n"<<endl;
 1 -1e10 0xabcdef 'x'
 // long double
-3.141592653e-10l
+3.141'592'653e-10l
 // float
 120E5f
 // hex
-0xdeadbeaf+8ULL
+0xdead'beaf+8ULL
 // octal
-01234567
+0123'4567
 // binary
-0b1000+e1
+0b10'00+e1
 // expression
 1.5-e+2+f
 // another one
@@ -20,7 +20,7 @@ str.e+65
 a*b c/d e%f
 a+b c-d
 a<<b c>>d
-a<b c<=d e>f g>=h
+a<b c<=d e>f g>=h i<=j
 a==b c!=d
 a^b c|d e&&f
 a||b
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 5/6] userdiff-cpp: permit the digit-separating single-quote in numbers
  2021-10-10 17:02   ` [PATCH v3 0/6] " Johannes Sixt via GitGitGadget
                       ` (3 preceding siblings ...)
  2021-10-10 17:03     ` [PATCH v3 4/6] userdiff-cpp: prepare test cases with yet unsupported features Johannes Sixt via GitGitGadget
@ 2021-10-10 17:03     ` Johannes Sixt via GitGitGadget
  2021-10-10 17:03     ` [PATCH v3 6/6] userdiff-cpp: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
  2021-10-24  9:56     ` [PATCH 7/6] userdiff-cpp: back out the digit-separators in numbers Johannes Sixt
  6 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-10 17:03 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

Since C++17, the single-quote can be used as digit separator:

   3.141'592'654
   1'000'000
   0xdead'beaf

Make it known to the word regex of the cpp driver, so that numbers are
not split into separate tokens at the single-quotes.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 t/t4034/cpp/expect | 8 ++++----
 userdiff.c         | 6 +++---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect
index 3d37ddac42c..b90b3f207bf 100644
--- a/t/t4034/cpp/expect
+++ b/t/t4034/cpp/expect
@@ -7,15 +7,15 @@ Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>
 cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
 <GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>.<RESET>'
 // long double<RESET>
-3.141'592'<RED>653e-10l<RESET><GREEN>654e+10l<RESET>
+<RED>3.141'592'653e-10l<RESET><GREEN>3.141'592'654e+10l<RESET>
 // float<RESET>
 <RED>120E5f<RESET><GREEN>120E6f<RESET>
 // hex<RESET>
-0xdead'<RED>beaf<RESET><GREEN>Beaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
+<RED>0xdead'beaf<RESET><GREEN>0xdead'Beaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
 // octal<RESET>
-0123'<RED>4567<RESET><GREEN>4560<RESET>
+<RED>0123'4567<RESET><GREEN>0123'4560<RESET>
 // binary<RESET>
-<RED>0b10<RESET><GREEN>0b11<RESET>'00+e1
+<RED>0b10'00<RESET><GREEN>0b11'00<RESET>+e1
 // expression<RESET>
 1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
 // another one<RESET>
diff --git a/userdiff.c b/userdiff.c
index ce2a9230703..5072d12e51e 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -57,11 +57,11 @@ PATTERNS("cpp",
 	 /* identifiers and keywords */
 	 "[a-zA-Z_][a-zA-Z0-9_]*"
 	 /* decimal and octal integers as well as floatingpoint numbers */
-	 "|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*"
+	 "|[0-9][0-9.']*([Ee][-+]?[0-9]+)?[fFlLuU]*"
 	 /* hexadecimal and binary integers */
-	 "|0[xXbB][0-9a-fA-F]+[lLuU]*"
+	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"
 	 /* floatingpoint numbers that begin with a decimal point */
-	 "|\\.[0-9]+([Ee][-+]?[0-9]+)?[fFlL]?"
+	 "|\\.[0-9][0-9']*([Ee][-+]?[0-9]+)?[fFlL]?"
 	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"),
 PATTERNS("csharp",
 	 /* Keywords */
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 6/6] userdiff-cpp: learn the C++ spaceship operator
  2021-10-10 17:02   ` [PATCH v3 0/6] " Johannes Sixt via GitGitGadget
                       ` (4 preceding siblings ...)
  2021-10-10 17:03     ` [PATCH v3 5/6] userdiff-cpp: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
@ 2021-10-10 17:03     ` Johannes Sixt via GitGitGadget
  2021-10-24  9:56     ` [PATCH 7/6] userdiff-cpp: back out the digit-separators in numbers Johannes Sixt
  6 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt via GitGitGadget @ 2021-10-10 17:03 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason, Johannes Sixt,
	Johannes Sixt

From: Johannes Sixt <j6t@kdbg.org>

Since C++20, the language has a generalized comparison operator <=>.
Teach the cpp driver not to separate it into <= and > tokens.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 t/t4034/cpp/expect | 2 +-
 userdiff.c         | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect
index b90b3f207bf..5ff4ce477b4 100644
--- a/t/t4034/cpp/expect
+++ b/t/t4034/cpp/expect
@@ -25,7 +25,7 @@ str.e+<RED>65<RESET><GREEN>75<RESET>
 a<RED>*<RESET><GREEN>*=<RESET>b c<RED>/<RESET><GREEN>/=<RESET>d e<RED>%<RESET><GREEN>%=<RESET>f
 a<RED>+<RESET><GREEN>++<RESET>b c<RED>-<RESET><GREEN>--<RESET>d
 a<RED><<<RESET><GREEN><<=<RESET>b c<RED>>><RESET><GREEN>>>=<RESET>d
-a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h i<=<GREEN>><RESET>j
+a<RED><<RESET><GREEN><=<RESET>b c<RED><=<RESET><GREEN><<RESET>d e<RED>><RESET><GREEN>>=<RESET>f g<RED>>=<RESET><GREEN>><RESET>h i<RED><=<RESET><GREEN><=><RESET>j
 a<RED>==<RESET><GREEN>!=<RESET>b c<RED>!=<RESET><GREEN>=<RESET>d
 a<RED>^<RESET><GREEN>^=<RESET>b c<RED>|<RESET><GREEN>|=<RESET>d e<RED>&&<RESET><GREEN>&=<RESET>f
 a<RED>||<RESET><GREEN>|<RESET>b
diff --git a/userdiff.c b/userdiff.c
index 5072d12e51e..96adddd6f9a 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -62,7 +62,7 @@ PATTERNS("cpp",
 	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"
 	 /* floatingpoint numbers that begin with a decimal point */
 	 "|\\.[0-9][0-9']*([Ee][-+]?[0-9]+)?[fFlL]?"
-	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*"),
+	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*|<=>"),
 PATTERNS("csharp",
 	 /* Keywords */
 	 "!^[ \t]*(do|while|for|if|else|instanceof|new|return|switch|case|throw|catch|using)\n"
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2 0/5] Fun with cpp word regex
  2021-10-09  0:00       ` Ævar Arnfjörð Bjarmason
@ 2021-10-10 20:15         ` Johannes Sixt
  0 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt @ 2021-10-10 20:15 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Johannes Sixt via GitGitGadget

Am 09.10.21 um 02:00 schrieb Ævar Arnfjörð Bjarmason:
> For an example of something we do a bit badly with the cpp driver is
> parts of my 66f5f6dca95 (C style: use standard style for "TRANSLATORS"
> comments, 2017-05-11).
> 
> I.e. there I was changing a comment format, and added a full stop to a
> sentence, the word-diff is:
> 
>         /*
>          {+*+} TRANSLATORS: here is a comment that explains the string to
>          {+*+} be translated, that follows immediately after [-it-]{+it.+}
>          */
> 
> Even though it has nothing to do with C syntax per-se that would be much more useful as:
> 
>         /*
>          {+*+} TRANSLATORS: here is a comment that explains the string to
>          {+*+} be translated, that follows immediately after it{+.+}
>          */
> 
> I.e. treating a "." at the end of a word specially isn't C or C++
> syntax, but it's absolutely input that the cpp driver *is* getting and
> should be if possible be handling well.

FWIW, I wondered why the cpp driver should not handle this case as you
expect it, and the answer is that it actually does when this comment
appears in a C file. This particular word-diff occurs in
Documentation/CodingGuidelines and that is not covered by the cpp driver.

-- Hannes

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 7/6] userdiff-cpp: back out the digit-separators in numbers
  2021-10-10 17:02   ` [PATCH v3 0/6] " Johannes Sixt via GitGitGadget
                       ` (5 preceding siblings ...)
  2021-10-10 17:03     ` [PATCH v3 6/6] userdiff-cpp: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
@ 2021-10-24  9:56     ` Johannes Sixt
  6 siblings, 0 replies; 24+ messages in thread
From: Johannes Sixt @ 2021-10-24  9:56 UTC (permalink / raw)
  To: git; +Cc: Ævar Arnfjörð Bjarmason,
	Johannes Sixt via GitGitGadget

The implementation of digit-separating single-quotes introduced a
note-worthy regression: the change of a character literal with a
digit would splice the digit and the closing single-quote. For
example, the change from 'a' to '2' is now tokenized as
'[-a'-]{+2'+} instead of '[-a-]{+2+}'.

The options to fix the regression are:

- Tighten the regular expression such that the single-quote can only
  occur between digits (that would match the official syntax).

- Remove support for digit separators.

I chose to remove support, because

- I have not seen a lot of code make use of digit separators.

- If code does use digit separators, then the numbers are typically
  long. If a change in one of the segments occurs, it is actually
  better visible if only that segment is highlighted as the word
  that changed instead of the whole long number.

This choice does introduce another minor regression, though, which
is highlighted in the test case: when a change occurs in the second
or later segment of a hexadecimal number where the segment begins
with a digit, but also has letters, the segment is mistaken as
consisting of a number and an identifier. I can live with that.

Signed-off-by: Johannes Sixt <j6t@kdbg.org>
---
 t/t4034/cpp/expect | 12 ++++++------
 t/t4034/cpp/post   | 10 +++++-----
 t/t4034/cpp/pre    |  8 ++++----
 userdiff.c         |  6 +++---
 4 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/t/t4034/cpp/expect b/t/t4034/cpp/expect
index 5ff4ce477b..dc500ae092 100644
--- a/t/t4034/cpp/expect
+++ b/t/t4034/cpp/expect
@@ -1,21 +1,21 @@
 <BOLD>diff --git a/pre b/post<RESET>
-<BOLD>index 144cd98..64e78af 100644<RESET>
+<BOLD>index a1a09b7..f1b6f3c 100644<RESET>
 <BOLD>--- a/pre<RESET>
 <BOLD>+++ b/post<RESET>
 <CYAN>@@ -1,30 +1,30 @@<RESET>
 Foo() : x(0<RED>&&1<RESET><GREEN>&42<RESET>) { <RED>foo0<RESET><GREEN>bar<RESET>(x.<RED>find<RESET><GREEN>Find<RESET>); }
 cout<<"Hello World<RED>!<RESET><GREEN>?<RESET>\n"<<endl;
-<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>.<RESET>'
+<GREEN>(<RESET>1 <RED>-<RESET><GREEN>+<RESET>1e10 0xabcdef<GREEN>)<RESET> '<RED>x<RESET><GREEN>2<RESET>'
 // long double<RESET>
-<RED>3.141'592'653e-10l<RESET><GREEN>3.141'592'654e+10l<RESET>
+<RED>3.141592653e-10l<RESET><GREEN>3.141592654e+10l<RESET>
 // float<RESET>
 <RED>120E5f<RESET><GREEN>120E6f<RESET>
 // hex<RESET>
-<RED>0xdead'beaf<RESET><GREEN>0xdead'Beaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
+<RED>0xdead<RESET><GREEN>0xdeaf<RESET>'1<RED>eaF<RESET><GREEN>eaf<RESET>+<RED>8ULL<RESET><GREEN>7ULL<RESET>
 // octal<RESET>
-<RED>0123'4567<RESET><GREEN>0123'4560<RESET>
+<RED>01234567<RESET><GREEN>01234560<RESET>
 // binary<RESET>
-<RED>0b10'00<RESET><GREEN>0b11'00<RESET>+e1
+<RED>0b1000<RESET><GREEN>0b1100<RESET>+e1
 // expression<RESET>
 1.5-e+<RED>2<RESET><GREEN>3<RESET>+f
 // another one<RESET>
diff --git a/t/t4034/cpp/post b/t/t4034/cpp/post
index 64e78afbfb..f1b6f3c228 100644
--- a/t/t4034/cpp/post
+++ b/t/t4034/cpp/post
@@ -1,16 +1,16 @@
 Foo() : x(0&42) { bar(x.Find); }
 cout<<"Hello World?\n"<<endl;
-(1 +1e10 0xabcdef) '.'
+(1 +1e10 0xabcdef) '2'
 // long double
-3.141'592'654e+10l
+3.141592654e+10l
 // float
 120E6f
 // hex
-0xdead'Beaf+7ULL
+0xdeaf'1eaf+7ULL
 // octal
-0123'4560
+01234560
 // binary
-0b11'00+e1
+0b1100+e1
 // expression
 1.5-e+3+f
 // another one
diff --git a/t/t4034/cpp/pre b/t/t4034/cpp/pre
index 144cd980d6..a1a09b7712 100644
--- a/t/t4034/cpp/pre
+++ b/t/t4034/cpp/pre
@@ -2,15 +2,15 @@ Foo():x(0&&1){ foo0( x.find); }
 cout<<"Hello World!\n"<<endl;
 1 -1e10 0xabcdef 'x'
 // long double
-3.141'592'653e-10l
+3.141592653e-10l
 // float
 120E5f
 // hex
-0xdead'beaf+8ULL
+0xdead'1eaF+8ULL
 // octal
-0123'4567
+01234567
 // binary
-0b10'00+e1
+0b1000+e1
 // expression
 1.5-e+2+f
 // another one
diff --git a/userdiff.c b/userdiff.c
index 7b143ef36b..8578cb0d12 100644
--- a/userdiff.c
+++ b/userdiff.c
@@ -67,11 +67,11 @@
 	 /* identifiers and keywords */
 	 "[a-zA-Z_][a-zA-Z0-9_]*"
 	 /* decimal and octal integers as well as floatingpoint numbers */
-	 "|[0-9][0-9.']*([Ee][-+]?[0-9]+)?[fFlLuU]*"
+	 "|[0-9][0-9.]*([Ee][-+]?[0-9]+)?[fFlLuU]*"
 	 /* hexadecimal and binary integers */
-	 "|0[xXbB][0-9a-fA-F']+[lLuU]*"
+	 "|0[xXbB][0-9a-fA-F]+[lLuU]*"
 	 /* floatingpoint numbers that begin with a decimal point */
-	 "|\\.[0-9][0-9']*([Ee][-+]?[0-9]+)?[fFlL]?"
+	 "|\\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?"
 	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*|<=>"),
 PATTERNS("csharp",
 	 /* Keywords */
-- 
2.33.0.129.g739793498e

^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2021-10-24  9:57 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-07  6:50 [PATCH 0/3] Fun with cpp word regex Johannes Sixt via GitGitGadget
2021-10-07  6:50 ` [PATCH 1/3] userdiff: tighten " Johannes Sixt via GitGitGadget
2021-10-07  6:50 ` [PATCH 2/3] userdiff: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
2021-10-07  6:51 ` [PATCH 3/3] userdiff: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
2021-10-07  9:14 ` [PATCH 0/3] Fun with cpp word regex Ævar Arnfjörð Bjarmason
2021-10-07 16:40   ` Johannes Sixt
2021-10-08 19:09 ` [PATCH v2 0/5] " Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 1/5] t4034/cpp: actually test that operator tokens are not split Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 2/5] t4034: add tests showing problematic cpp tokenizations Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 3/5] userdiff-cpp: tighten word regex Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 4/5] userdiff-cpp: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
2021-10-08 19:09   ` [PATCH v2 5/5] userdiff-cpp: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
2021-10-08 20:07   ` [PATCH v2 0/5] Fun with cpp word regex Ævar Arnfjörð Bjarmason
2021-10-08 22:11     ` Johannes Sixt
2021-10-09  0:00       ` Ævar Arnfjörð Bjarmason
2021-10-10 20:15         ` Johannes Sixt
2021-10-10 17:02   ` [PATCH v3 0/6] " Johannes Sixt via GitGitGadget
2021-10-10 17:02     ` [PATCH v3 1/6] t4034/cpp: actually test that operator tokens are not split Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 2/6] t4034: add tests showing problematic cpp tokenizations Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 3/6] userdiff-cpp: tighten word regex Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 4/6] userdiff-cpp: prepare test cases with yet unsupported features Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 5/6] userdiff-cpp: permit the digit-separating single-quote in numbers Johannes Sixt via GitGitGadget
2021-10-10 17:03     ` [PATCH v3 6/6] userdiff-cpp: learn the C++ spaceship operator Johannes Sixt via GitGitGadget
2021-10-24  9:56     ` [PATCH 7/6] userdiff-cpp: back out the digit-separators in numbers Johannes Sixt

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).