git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Johannes Sixt <j6t@kdbg.org>
To: Steven Jeuris via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Jeff King" <peff@peff.net>, "Linus Arver" <linusa@google.com>,
	"Steven Jeuris" <steven.jeuris@gmail.com>,
	"Steven Jeuris" <steven.jeuris@3shape.com>
Subject: Re: [PATCH v4] userdiff: better method/property matching for C#
Date: Sat, 30 Mar 2024 19:49:02 +0100	[thread overview]
Message-ID: <bb4fe56f-f45d-404a-834e-31ba877641c7@kdbg.org> (raw)
In-Reply-To: <pull.1682.v4.git.git.1711653257043.gitgitgadget@gmail.com>

Am 28.03.24 um 20:14 schrieb Steven Jeuris via GitGitGadget:
> diff --git a/userdiff.c b/userdiff.c
> index e399543823b..35e0e1183f7 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -89,12 +89,46 @@ PATTERNS("cpp",
>  	 "|\\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?"
>  	 "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*|<=>"),
>  PATTERNS("csharp",
> -	 /* Keywords */
> -	 "!^[ \t]*(do|while|for|if|else|instanceof|new|return|switch|case|throw|catch|using)\n"
> -	 /* Methods and constructors */
> -	 "^[ \t]*(((static|public|internal|private|protected|new|virtual|sealed|override|unsafe|async)[ \t]+)*[][<>@.~_[:alnum:]]+[ \t]+[<>@._[:alnum:]]+[ \t]*\\(.*\\))[ \t]*$\n"
> -	 /* Properties */
> -	 "^[ \t]*(((static|public|internal|private|protected|new|virtual|sealed|override|unsafe)[ \t]+)*[][<>@.~_[:alnum:]]+[ \t]+[@._[:alnum:]]+)[ \t]*$\n"
> +	 /*
> +	  * Jump over reserved keywords which are illegal method names, but which
> +	  * can be followed by parentheses without special characters in between,
> +	  * making them look like methods.
> +	  */
> +	 "!(^|[ \t]+)" /* Start of line or whitespace. */
> +		"(do|while|for|foreach|if|else|new|default|return|switch|case|throw"
> +		"|catch|using|lock|fixed)"
> +		"([ \t(]+|$)\n" /* Whitespace, "(", or end of line. */
> +	 /*
> +	  * Methods/constructors:
> +	  * The strategy is to identify a minimum of two groups (any combination
> +	  * of keywords/type/name) before the opening parenthesis, and without
> +	  * final unexpected characters, normally only used in ordinary statements.
> +	  */
> +	 "^[ \t]*" /* Remove leading whitespace. */
> +		"(" /* Start chunk header capture. */
> +		"(" /* Group of keywords/type/names. */
> +		"([][[:alnum:]@_<>.]|, [ |\t]*)+" /* Space only allowed after ",". */

Mental note: This pattern means:

   a sequence of at least one of
   (
       one character of a certain set
       OR
       a comma
          followed by SP
          optionally followed by whitespace or |
   )

> +		"[ \t]+" /* One required space forces a minimum of two items. */

OK.

> +		"([][[:alnum:]@_<>.]|, [ |\t]*)+"

Same here.

Is it the case of t4018/csharp-method-generics?

    Example<int, string> Method<TA, TB>(TA RIGHT, TB b)

Let me see if I make sense of this. The idea is to treat the ', '
sequence as a single "character" so that the SP in this sequence does
not count as the word separator that we otherwise have between two plain
words.

I am unsure whether this is a workable solution. The set of symbols that
can occur before a method definition overlaps too much with the symbols
that can occur elsewhere. For example, if you have these lines of code:

    something(arg,
        meth, func(x),
        prop, y,
        more);

you have a false positive in the lines in the middle.

I have the feeling that it is impossible to distinguish method
definitions from other code reliably. We can go back and forth a while,
but at some point we have to stop and accept that there are false
positives. Where that point is, I cannot tell, because I do not know
what is common and what is uncommon in typical C# code. It is your call.

> +		"[ \t]*" /* Optional space before parameters start. */
> +		")+"
> +		"\\(" /* Start of method parameters. */
> +		"[^;]*" /* Allow complex parameters, but exclude statements (;). */
> +		")$\n" /* Close chunk header capture. */
> +	 /*
> +	  * Properties:
> +	  * As with methods, expect a minimum of two groups. But, more trivial than
> +	  * methods, the vast majority of properties long enough to be worth
> +	  * showing a chunk header for don't include "=:;,()" on the line they are
> +	  * defined, since they don't have a parameter list.
> +	  */
> +	 "^[ \t]*("
> +		"("
> +		"([][[:alnum:]@_<>.]|, [ |\t]*)+[ \t]+"
> +		"([][[:alnum:]@_<>.]|, [ |\t]*)+[ \t]*"
> +		")+" /* Up to here, same as methods regex. */

Is the intent to match pairs of words? Or is the intent to find at least
two words? If the latter, then the preceding 4 lines are better written as

		       "([][[:alnum:]@_<>.]|,[ \t]+)+"
		"([ \t]+([][[:alnum:]@_<>.]|,[ \t]+)+)+"

If the former, then the space after the second word must not be
optional, because it would match

    a bc d

as two pairs

    (a b)(c d)

BTW that there is an | symbol in the potential space after the comma is
certainly an oversight, right? And, why is a tab after the comma not
permitted?

When you construct regular expressions, make sure that repeated parts
always begin with a mandatory part and that this mandatory part cannot
be an optional part at the end of the preceding or the repeated pattern.
Otherwise, in a non-matching case you send the matcher into an expensive
backtracking loop.

> +		"[^;=:,()]*" /* Compared to methods, no parameter list allowed. */
> +		")$\n"
>  	 /* Type definitions */
>  	 "^[ \t]*(((static|public|internal|private|protected|new|unsafe|sealed|abstract|partial)[ \t]+)*(class|enum|interface|struct|record)[ \t]+.*)$\n"
>  	 /* Namespace */
> 
> base-commit: f41f85c9ec8d4d46de0fd5fded88db94d3ec8c11


  parent reply	other threads:[~2024-03-30 18:49 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-25 17:33 [PATCH] userdiff: better method/property matching for C# Steven Jeuris via GitGitGadget
2024-03-06 20:21 ` [PATCH v2] " Steven Jeuris via GitGitGadget
2024-03-07  2:11   ` Junio C Hamano
2024-03-16 18:14   ` Linus Arver
2024-03-26 21:38   ` Junio C Hamano
2024-03-27  8:40     ` Jeff King
2024-03-27  7:30   ` Johannes Sixt
2024-03-28  8:07   ` [PATCH v3] " Steven Jeuris via GitGitGadget
2024-03-28 19:14     ` [PATCH v4] " Steven Jeuris via GitGitGadget
2024-03-28 19:33       ` Junio C Hamano
2024-03-30 18:49       ` Johannes Sixt [this message]
2024-04-03 21:42       ` [PATCH v5] " Steven Jeuris via GitGitGadget
2024-04-05 22:02         ` Johannes Sixt
2024-04-05 22:10           ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bb4fe56f-f45d-404a-834e-31ba877641c7@kdbg.org \
    --to=j6t@kdbg.org \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=linusa@google.com \
    --cc=peff@peff.net \
    --cc=steven.jeuris@3shape.com \
    --cc=steven.jeuris@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).