From: Johannes Sixt <j6t@kdbg.org>
To: Steven Jeuris via GitGitGadget <gitgitgadget@gmail.com>,
git@vger.kernel.org
Cc: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Jeff King" <peff@peff.net>, "Linus Arver" <linusa@google.com>,
"Steven Jeuris" <steven.jeuris@gmail.com>,
"Steven Jeuris" <steven.jeuris@3shape.com>
Subject: Re: [PATCH v4] userdiff: better method/property matching for C#
Date: Sat, 30 Mar 2024 19:49:02 +0100 [thread overview]
Message-ID: <bb4fe56f-f45d-404a-834e-31ba877641c7@kdbg.org> (raw)
In-Reply-To: <pull.1682.v4.git.git.1711653257043.gitgitgadget@gmail.com>
Am 28.03.24 um 20:14 schrieb Steven Jeuris via GitGitGadget:
> diff --git a/userdiff.c b/userdiff.c
> index e399543823b..35e0e1183f7 100644
> --- a/userdiff.c
> +++ b/userdiff.c
> @@ -89,12 +89,46 @@ PATTERNS("cpp",
> "|\\.[0-9][0-9]*([Ee][-+]?[0-9]+)?[fFlL]?"
> "|[-+*/<>%&^|=!]=|--|\\+\\+|<<=?|>>=?|&&|\\|\\||::|->\\*?|\\.\\*|<=>"),
> PATTERNS("csharp",
> - /* Keywords */
> - "!^[ \t]*(do|while|for|if|else|instanceof|new|return|switch|case|throw|catch|using)\n"
> - /* Methods and constructors */
> - "^[ \t]*(((static|public|internal|private|protected|new|virtual|sealed|override|unsafe|async)[ \t]+)*[][<>@.~_[:alnum:]]+[ \t]+[<>@._[:alnum:]]+[ \t]*\\(.*\\))[ \t]*$\n"
> - /* Properties */
> - "^[ \t]*(((static|public|internal|private|protected|new|virtual|sealed|override|unsafe)[ \t]+)*[][<>@.~_[:alnum:]]+[ \t]+[@._[:alnum:]]+)[ \t]*$\n"
> + /*
> + * Jump over reserved keywords which are illegal method names, but which
> + * can be followed by parentheses without special characters in between,
> + * making them look like methods.
> + */
> + "!(^|[ \t]+)" /* Start of line or whitespace. */
> + "(do|while|for|foreach|if|else|new|default|return|switch|case|throw"
> + "|catch|using|lock|fixed)"
> + "([ \t(]+|$)\n" /* Whitespace, "(", or end of line. */
> + /*
> + * Methods/constructors:
> + * The strategy is to identify a minimum of two groups (any combination
> + * of keywords/type/name) before the opening parenthesis, and without
> + * final unexpected characters, normally only used in ordinary statements.
> + */
> + "^[ \t]*" /* Remove leading whitespace. */
> + "(" /* Start chunk header capture. */
> + "(" /* Group of keywords/type/names. */
> + "([][[:alnum:]@_<>.]|, [ |\t]*)+" /* Space only allowed after ",". */
Mental note: This pattern means:
a sequence of at least one of
(
one character of a certain set
OR
a comma
followed by SP
optionally followed by whitespace or |
)
> + "[ \t]+" /* One required space forces a minimum of two items. */
OK.
> + "([][[:alnum:]@_<>.]|, [ |\t]*)+"
Same here.
Is it the case of t4018/csharp-method-generics?
Example<int, string> Method<TA, TB>(TA RIGHT, TB b)
Let me see if I make sense of this. The idea is to treat the ', '
sequence as a single "character" so that the SP in this sequence does
not count as the word separator that we otherwise have between two plain
words.
I am unsure whether this is a workable solution. The set of symbols that
can occur before a method definition overlaps too much with the symbols
that can occur elsewhere. For example, if you have these lines of code:
something(arg,
meth, func(x),
prop, y,
more);
you have a false positive in the lines in the middle.
I have the feeling that it is impossible to distinguish method
definitions from other code reliably. We can go back and forth a while,
but at some point we have to stop and accept that there are false
positives. Where that point is, I cannot tell, because I do not know
what is common and what is uncommon in typical C# code. It is your call.
> + "[ \t]*" /* Optional space before parameters start. */
> + ")+"
> + "\\(" /* Start of method parameters. */
> + "[^;]*" /* Allow complex parameters, but exclude statements (;). */
> + ")$\n" /* Close chunk header capture. */
> + /*
> + * Properties:
> + * As with methods, expect a minimum of two groups. But, more trivial than
> + * methods, the vast majority of properties long enough to be worth
> + * showing a chunk header for don't include "=:;,()" on the line they are
> + * defined, since they don't have a parameter list.
> + */
> + "^[ \t]*("
> + "("
> + "([][[:alnum:]@_<>.]|, [ |\t]*)+[ \t]+"
> + "([][[:alnum:]@_<>.]|, [ |\t]*)+[ \t]*"
> + ")+" /* Up to here, same as methods regex. */
Is the intent to match pairs of words? Or is the intent to find at least
two words? If the latter, then the preceding 4 lines are better written as
"([][[:alnum:]@_<>.]|,[ \t]+)+"
"([ \t]+([][[:alnum:]@_<>.]|,[ \t]+)+)+"
If the former, then the space after the second word must not be
optional, because it would match
a bc d
as two pairs
(a b)(c d)
BTW that there is an | symbol in the potential space after the comma is
certainly an oversight, right? And, why is a tab after the comma not
permitted?
When you construct regular expressions, make sure that repeated parts
always begin with a mandatory part and that this mandatory part cannot
be an optional part at the end of the preceding or the repeated pattern.
Otherwise, in a non-matching case you send the matcher into an expensive
backtracking loop.
> + "[^;=:,()]*" /* Compared to methods, no parameter list allowed. */
> + ")$\n"
> /* Type definitions */
> "^[ \t]*(((static|public|internal|private|protected|new|unsafe|sealed|abstract|partial)[ \t]+)*(class|enum|interface|struct|record)[ \t]+.*)$\n"
> /* Namespace */
>
> base-commit: f41f85c9ec8d4d46de0fd5fded88db94d3ec8c11
next prev parent reply other threads:[~2024-03-30 18:49 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-25 17:33 [PATCH] userdiff: better method/property matching for C# Steven Jeuris via GitGitGadget
2024-03-06 20:21 ` [PATCH v2] " Steven Jeuris via GitGitGadget
2024-03-07 2:11 ` Junio C Hamano
2024-03-16 18:14 ` Linus Arver
2024-03-26 21:38 ` Junio C Hamano
2024-03-27 8:40 ` Jeff King
2024-03-27 7:30 ` Johannes Sixt
2024-03-28 8:07 ` [PATCH v3] " Steven Jeuris via GitGitGadget
2024-03-28 19:14 ` [PATCH v4] " Steven Jeuris via GitGitGadget
2024-03-28 19:33 ` Junio C Hamano
2024-03-30 18:49 ` Johannes Sixt [this message]
2024-04-03 21:42 ` [PATCH v5] " Steven Jeuris via GitGitGadget
2024-04-05 22:02 ` Johannes Sixt
2024-04-05 22:10 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bb4fe56f-f45d-404a-834e-31ba877641c7@kdbg.org \
--to=j6t@kdbg.org \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitgitgadget@gmail.com \
--cc=linusa@google.com \
--cc=peff@peff.net \
--cc=steven.jeuris@3shape.com \
--cc=steven.jeuris@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).