[RFC 0/1] Leading whitespace as a function identification heuristic?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [RFC 0/1] Leading whitespace as a function identification heuristic?
@ 2020-09-23 21:59 Ryan Zoeller
  2020-09-23 21:59 ` [RFC 1/1] xdiff: use leading whitespace in function heuristic Ryan Zoeller
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Ryan Zoeller @ 2020-09-23 21:59 UTC (permalink / raw)
  To: git; +Cc: Ryan Zoeller

I've recently been annoyed by git's choice of lines when invoking
`git diff --function-context`. git isn't doing a _bad_ job per se,
given that it's using regular expressions to parse nonregular languages,
but in my opinion it could do better.

The issue I'm seeing is that constructs like nested functions cause
`git diff --function-context` to stop taking lines prematurely; a line
containing a function declaration has been reached, but it's not the one
I'm expecting git to stop at. This results in some of the function being
omitted from the output of `git diff --function-context`, and also
manifests itself as the incorrect function being listed in the hunk header
in general.

A trivial example: in a Python file containing the following contents,
modifying the indicated line will result in git picking `bar()` for the
function header, rather than `foo()`. This is obviously worse for longer
functions, or code with multiple inner functions or classes.

```
def foo(xs, i):
    def bar(x):
        return x + i

    # Intentionally not using a lambda,
    # because the issue occurs for inner functions.
    # Making this comment extra long so the function name
    # doesn't get automatically included in the context.

    return map(bar, xs) # Modify this line
```

Even without properly parsing the file as code, git could use
indentation as a heuristic to determine which lines containing functions
are relevant. I've used Python as an example here, but I think the
issue and heuristic apply to many languages.

This patch implements a proof of concept for this heuristic: only
lines which are indented less than the hunk are considered eligible
to contain function definitions. It only supports `git diff`,
ignoring things like `git grep` for simplicity (and introducing
`-1`s to the diff in leiu of handling them).

While I welcome feedback on the existing implementation, I'm really
requesting commentary on the following:

1. Is this indentation-aware function identification useful, and
    generally worth pursuing further?
2. What level of granularity would be desirable for enabling this?
    Something akin to xfuncname in .gitattributes, where it can be
    enabled per path?
3. How do we handle files with a mix of tabs and spaces?

Best,
Ryan

Ryan Zoeller (1):
  xdiff: use leading whitespace in function heuristic

 grep.c            |  2 +-
 line-range.c      |  2 +-
 xdiff-interface.c | 14 +++++++++++++-
 xdiff/xdiff.h     |  2 +-
 xdiff/xemit.c     | 23 +++++++++++++++++------
 5 files changed, 33 insertions(+), 10 deletions(-)

-- 
2.28.0.586.g47c91ef7fe

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC 1/1] xdiff: use leading whitespace in function heuristic
  2020-09-23 21:59 [RFC 0/1] Leading whitespace as a function identification heuristic? Ryan Zoeller
@ 2020-09-23 21:59 ` Ryan Zoeller
  2020-09-24  6:45 ` [RFC 0/1] Leading whitespace as a function identification heuristic? Junio C Hamano
  2020-09-25 18:12 ` Johannes Sixt
  2 siblings, 0 replies; 10+ messages in thread
From: Ryan Zoeller @ 2020-09-23 21:59 UTC (permalink / raw)
  To: git; +Cc: Ryan Zoeller

The regular expressions specified in userdiff.c, as well as user-defined
expressions, allow git to detect which lines of code which declare functions
(as well as other notable lines, such as class declarations).

Although useful, these regular expressions can't identify which function
a line of code belongs to, only the closest function to it.
Languages which allow for nested functions -- or functions inside of
classes -- can trip this mechanism up.

Since many languages use indentation to associate lines of code with a
function (either semantically or cosmetically), we can use indentation
as an additional heuristic for identifying the owning function.

Specifically, this assumes code belongs to a function which is less
indented than it.

Signed-off-by: Ryan Zoeller <rtzoeller@rtzoeller.com>
---
 grep.c            |  2 +-
 line-range.c      |  2 +-
 xdiff-interface.c | 14 +++++++++++++-
 xdiff/xdiff.h     |  2 +-
 xdiff/xemit.c     | 23 +++++++++++++++++------
 5 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/grep.c b/grep.c
index 54af9f813e..3281f19977 100644
--- a/grep.c
+++ b/grep.c
@@ -1555,7 +1555,7 @@ static int match_funcname(struct grep_opt *opt, struct grep_source *gs, char *bo
 
 	if (xecfg) {
 		char buf[1];
-		return xecfg->find_func(bol, eol - bol, buf, 1,
+		return xecfg->find_func(bol, eol - bol, buf, 1, -1,
 					xecfg->find_func_priv) >= 0;
 	}
 
diff --git a/line-range.c b/line-range.c
index 9b50583dc0..eb9540bc76 100644
--- a/line-range.c
+++ b/line-range.c
@@ -119,7 +119,7 @@ static int match_funcname(xdemitconf_t *xecfg, const char *bol, const char *eol)
 {
 	if (xecfg) {
 		char buf[1];
-		return xecfg->find_func(bol, eol - bol, buf, 1,
+		return xecfg->find_func(bol, eol - bol, buf, 1, -1,
 					xecfg->find_func_priv) >= 0;
 	}
 
diff --git a/xdiff-interface.c b/xdiff-interface.c
index 4d20069302..d93cb5c72e 100644
--- a/xdiff-interface.c
+++ b/xdiff-interface.c
@@ -201,7 +201,7 @@ struct ff_regs {
 };
 
 static long ff_regexp(const char *line, long len,
-		char *buffer, long buffer_size, void *priv)
+		char *buffer, long buffer_size, long max_leading_spaces, void *priv)
 {
 	struct ff_regs *regs = priv;
 	regmatch_t pmatch[2];
@@ -216,6 +216,18 @@ static long ff_regexp(const char *line, long len,
 			len--;
 	}
 
+	// TODO: Is it faster to check whitespace only after matching the regex?
+	if (max_leading_spaces >= 0) {
+		long leading_spaces;
+		for (leading_spaces = 0; leading_spaces < len
+				&& leading_spaces <= max_leading_spaces
+				&& isspace(line[leading_spaces]); leading_spaces++)
+			;
+
+		if (leading_spaces > max_leading_spaces)
+			return -1;
+	}
+
 	for (i = 0; i < regs->nr; i++) {
 		struct ff_reg *reg = regs->array + i;
 		if (!regexec_buf(&reg->re, line, len, 2, pmatch, 0)) {
diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h
index 032e3a9f41..f78c30c527 100644
--- a/xdiff/xdiff.h
+++ b/xdiff/xdiff.h
@@ -93,7 +93,7 @@ typedef struct s_xdemitcb {
 	int (*out_line)(void *, mmbuffer_t *, int);
 } xdemitcb_t;
 
-typedef long (*find_func_t)(const char *line, long line_len, char *buffer, long buffer_size, void *priv);
+typedef long (*find_func_t)(const char *line, long line_len, char *buffer, long buffer_size, long max_leading_spaces, void *priv);
 
 typedef int (*xdl_emit_hunk_consume_func_t)(long start_a, long count_a,
 					    long start_b, long count_b,
diff --git a/xdiff/xemit.c b/xdiff/xemit.c
index 9d7d6c5087..1de68008f9 100644
--- a/xdiff/xemit.c
+++ b/xdiff/xemit.c
@@ -95,7 +95,7 @@ xdchange_t *xdl_get_hunk(xdchange_t **xscr, xdemitconf_t const *xecfg)
 }
 
 
-static long def_ff(const char *rec, long len, char *buf, long sz, void *priv)
+static long def_ff(const char *rec, long len, char *buf, long sz, long max_leading_spaces, void *priv)
 {
 	if (len > 0 &&
 			(isalpha((unsigned char)*rec) || /* identifier? */
@@ -112,19 +112,19 @@ static long def_ff(const char *rec, long len, char *buf, long sz, void *priv)
 }
 
 static long match_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri,
-			   char *buf, long sz)
+			   char *buf, long sz, long max_leading_spaces)
 {
 	const char *rec;
 	long len = xdl_get_rec(xdf, ri, &rec);
 	if (!xecfg->find_func)
-		return def_ff(rec, len, buf, sz, xecfg->find_func_priv);
-	return xecfg->find_func(rec, len, buf, sz, xecfg->find_func_priv);
+		return def_ff(rec, len, buf, sz, max_leading_spaces, xecfg->find_func_priv);
+	return xecfg->find_func(rec, len, buf, sz, max_leading_spaces, xecfg->find_func_priv);
 }
 
 static int is_func_rec(xdfile_t *xdf, xdemitconf_t const *xecfg, long ri)
 {
 	char dummy[1];
-	return match_func_rec(xdf, xecfg, ri, dummy, sizeof(dummy)) >= 0;
+	return match_func_rec(xdf, xecfg, ri, dummy, -1, sizeof(dummy)) >= 0;
 }
 
 struct func_line {
@@ -137,12 +137,23 @@ static long get_func_line(xdfenv_t *xe, xdemitconf_t const *xecfg,
 {
 	long l, size, step = (start > limit) ? -1 : 1;
 	char *buf, dummy[1];
+	long leading_spaces;
+
+	if (start - step >= 0 && start - step < xe->xdf1.nrec) {
+		xrecord_t *first_line = xe->xdf1.recs[start - step];
+
+		for (leading_spaces = 0; first_line->ptr[leading_spaces]
+				&& isspace(first_line->ptr[leading_spaces]); leading_spaces++)
+			;
+	} else {
+		leading_spaces = 0;
+	}
 
 	buf = func_line ? func_line->buf : dummy;
 	size = func_line ? sizeof(func_line->buf) : sizeof(dummy);
 
 	for (l = start; l != limit && 0 <= l && l < xe->xdf1.nrec; l += step) {
-		long len = match_func_rec(&xe->xdf1, xecfg, l, buf, size);
+		long len = match_func_rec(&xe->xdf1, xecfg, l, buf, size, leading_spaces - 1);
 		if (len >= 0) {
 			if (func_line)
 				func_line->len = len;
-- 
2.28.0.586.g47c91ef7fe



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC 0/1] Leading whitespace as a function identification heuristic?
  2020-09-23 21:59 [RFC 0/1] Leading whitespace as a function identification heuristic? Ryan Zoeller
  2020-09-23 21:59 ` [RFC 1/1] xdiff: use leading whitespace in function heuristic Ryan Zoeller
@ 2020-09-24  6:45 ` Junio C Hamano
  2020-09-24 21:17   ` Jeff King
  2020-09-25 18:12 ` Johannes Sixt
  2 siblings, 1 reply; 10+ messages in thread
From: Junio C Hamano @ 2020-09-24  6:45 UTC (permalink / raw)
  To: Ryan Zoeller; +Cc: git

Ryan Zoeller <rtzoeller@rtzoeller.com> writes:

> 1. Is this indentation-aware function identification useful, and
>     generally worth pursuing further?

I cannot shake the feeling off that this is being overly generous
and inviting for misidentification for languages whose usual
convention is not to nest and indent the definitions freely.

IOW, why can't the "we allow leading whitespaces" a per-language
thing?  IOW, why do we even need any new code---shouldn't it be just
the matter of defining xfuncname patterns for such a language with
nested and indented definitions?

So, a mild Meh from me at this point.  I may change my mind in the
morning, though ;-)

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 0/1] Leading whitespace as a function identification heuristic?
  2020-09-24  6:45 ` [RFC 0/1] Leading whitespace as a function identification heuristic? Junio C Hamano
@ 2020-09-24 21:17   ` Jeff King
  2020-09-24 22:01     ` Ryan Zoeller
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2020-09-24 21:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Ryan Zoeller, git

On Wed, Sep 23, 2020 at 11:45:55PM -0700, Junio C Hamano wrote:

> Ryan Zoeller <rtzoeller@rtzoeller.com> writes:
> 
> > 1. Is this indentation-aware function identification useful, and
> >     generally worth pursuing further?
> 
> I cannot shake the feeling off that this is being overly generous
> and inviting for misidentification for languages whose usual
> convention is not to nest and indent the definitions freely.
> 
> IOW, why can't the "we allow leading whitespaces" a per-language
> thing?  IOW, why do we even need any new code---shouldn't it be just
> the matter of defining xfuncname patterns for such a language with
> nested and indented definitions?

If I understand the problem correctly, I don't think a static regex can
accomplish this, because you need context from the original line. E.g.,
consider something like this:

  void foo() {
	void bar() {
		...code 1...
	}
	...code 2...
  }

If we change one of the lines of code, then we find the function header
by walking backwards up the lines, evaluating a regex for each line. But
for "code 2", how do we know to keep walking past bar() and up to foo()?
Or more specifically, what is different when evaluating a change from
"code 2" that is different than when we would evaluate "code 1"?

If the only input to the question of "is this line a function header" is
the regex from the config, then changes to either line of code must
produce the same answer (either bar() if we allow leading whitespace, or
foo() if we do not).

So I think Ryan's proposal is to give that search an extra piece of
information: the indentation of the original changed line. Which is
enough to differentiate the two cases.

If I understand the patch correctly, it is always picking the first line
where indentation is lessened (and which matches the funcname pattern).
That works out of the box with existing funcname patterns, which is
nice. Though I wonder if there are cases where the funcname regex could
use the information more flexibly (i.e., some marker in the regex that
means "match less than this many spaces" or something).

I do agree that this should not be on automatically for all languages.
Some file formats may want to show a header that's at the same
indentation as the change. Adding a diff.foo.funcnameIndentation config
option would be one solution. But requiring the funcname pattern to make
explicit use of the information is another (and would allow a format to
match some things at one indentation level and some at another; but
again, I'm hand-waving a bit on whether this level of flexibility is
even useful).

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 0/1] Leading whitespace as a function identification heuristic?
  2020-09-24 21:17   ` Jeff King
@ 2020-09-24 22:01     ` Ryan Zoeller
  2020-09-25  9:11       ` Phillip Wood
  0 siblings, 1 reply; 10+ messages in thread
From: Ryan Zoeller @ 2020-09-24 22:01 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: git

On 9/24/20 4:17 PM, Jeff King wrote:
> 
> 
> If I understand the problem correctly, I don't think a static regex can
> accomplish this, because you need context from the original line. E.g.,
> consider something like this:
> 
>    void foo() {
> 	void bar() {
> 		...code 1...
> 	}
> 	...code 2...
>    }
> 
> If we change one of the lines of code, then we find the function header
> by walking backwards up the lines, evaluating a regex for each line. But
> for "code 2", how do we know to keep walking past bar() and up to foo()?
> Or more specifically, what is different when evaluating a change from
> "code 2" that is different than when we would evaluate "code 1"?
> 
> If the only input to the question of "is this line a function header" is
> the regex from the config, then changes to either line of code must
> produce the same answer (either bar() if we allow leading whitespace, or
> foo() if we do not).
> 
> So I think Ryan's proposal is to give that search an extra piece of
> information: the indentation of the original changed line. Which is
> enough to differentiate the two cases.

You've explained this better than I could have. Thanks.

> If I understand the patch correctly, it is always picking the first line
> where indentation is lessened (and which matches the funcname pattern).
> That works out of the box with existing funcname patterns, which is
> nice. Though I wonder if there are cases where the funcname regex could
> use the information more flexibly (i.e., some marker in the regex that
> means "match less than this many spaces" or something).

My original intent was to work with existing funcname expressions 
without modifications. Some of the funcname regexes are rather 
impenetrable at first glance, and not requiring modifications seemed 
like an easy win.

Especially for funcname patterns specified by a user, I assumed it
would be easier to set an additional configuration option than
rewrite an existing regex to handle this complexity.

> I do agree that this should not be on automatically for all languages.
> Some file formats may want to show a header that's at the same
> indentation as the change. Adding a diff.foo.funcnameIndentation config
> option would be one solution. But requiring the funcname pattern to make
> explicit use of the information is another (and would allow a format to
> match some things at one indentation level and some at another; but
> again, I'm hand-waving a bit on whether this level of flexibility is
> even useful)
If the configuration option is implemented correctly (i.e. as an enum 
rather than a binary toggle), I think we could leave the door open for
a more flexible approach in the future, without needing to answer how 
useful that flexibility is now. I couldn't think of any situations 
requiring this flexibility, but that doesn't mean they don't exist.

Ryan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 0/1] Leading whitespace as a function identification heuristic?
  2020-09-24 22:01     ` Ryan Zoeller
@ 2020-09-25  9:11       ` Phillip Wood
  2020-09-25 18:43         ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Phillip Wood @ 2020-09-25  9:11 UTC (permalink / raw)
  To: Ryan Zoeller, Jeff King, Junio C Hamano; +Cc: git

On 24/09/2020 23:01, Ryan Zoeller wrote:
> On 9/24/20 4:17 PM, Jeff King wrote:
>>
>>
>> If I understand the problem correctly, I don't think a static regex can
>> accomplish this, because you need context from the original line. E.g.,
>> consider something like this:
>>
>>     void foo() {
>> 	void bar() {
>> 		...code 1...
>> 	}
>> 	...code 2...
>>     }
>>

If I've understood correctly when ...code 2... contains changes that are 
themselves indented then we'll pick the wrong function header for example

	void foo() {
		void bar() {
			...code 1...
		}
		for (...) {
			// if this line is changed we pick bar rather
			// than foo because it is the first function
			// header with a smaller indentation than the
			// first changed line.
		}
	}

Unfortunately I suspect code like that is not uncommon and the diff 
would regress with this simple heuristic. It might be possible to 
recalculate the required indentation as we walk backwards up the file 
though, so when we hit the "for" line we reduce the maximum indentation 
allowed for a match and so skip "bar" as a function header.

Best Wishes

Phillip

>> If we change one of the lines of code, then we find the function header
>> by walking backwards up the lines, evaluating a regex for each line. But
>> for "code 2", how do we know to keep walking past bar() and up to foo()?
>> Or more specifically, what is different when evaluating a change from
>> "code 2" that is different than when we would evaluate "code 1"?
>>
>> If the only input to the question of "is this line a function header" is
>> the regex from the config, then changes to either line of code must
>> produce the same answer (either bar() if we allow leading whitespace, or
>> foo() if we do not).
>>
>> So I think Ryan's proposal is to give that search an extra piece of
>> information: the indentation of the original changed line. Which is
>> enough to differentiate the two cases.
> 
> You've explained this better than I could have. Thanks.
> 
>> If I understand the patch correctly, it is always picking the first line
>> where indentation is lessened (and which matches the funcname pattern).
>> That works out of the box with existing funcname patterns, which is
>> nice. Though I wonder if there are cases where the funcname regex could
>> use the information more flexibly (i.e., some marker in the regex that
>> means "match less than this many spaces" or something).
> 
> My original intent was to work with existing funcname expressions
> without modifications. Some of the funcname regexes are rather
> impenetrable at first glance, and not requiring modifications seemed
> like an easy win.
> 
> Especially for funcname patterns specified by a user, I assumed it
> would be easier to set an additional configuration option than
> rewrite an existing regex to handle this complexity.
> 
>> I do agree that this should not be on automatically for all languages.
>> Some file formats may want to show a header that's at the same
>> indentation as the change. Adding a diff.foo.funcnameIndentation config
>> option would be one solution. But requiring the funcname pattern to make
>> explicit use of the information is another (and would allow a format to
>> match some things at one indentation level and some at another; but
>> again, I'm hand-waving a bit on whether this level of flexibility is
>> even useful)
> If the configuration option is implemented correctly (i.e. as an enum
> rather than a binary toggle), I think we could leave the door open for
> a more flexible approach in the future, without needing to answer how
> useful that flexibility is now. I couldn't think of any situations
> requiring this flexibility, but that doesn't mean they don't exist.
> 
> Ryan
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 0/1] Leading whitespace as a function identification heuristic?
  2020-09-23 21:59 [RFC 0/1] Leading whitespace as a function identification heuristic? Ryan Zoeller
  2020-09-23 21:59 ` [RFC 1/1] xdiff: use leading whitespace in function heuristic Ryan Zoeller
  2020-09-24  6:45 ` [RFC 0/1] Leading whitespace as a function identification heuristic? Junio C Hamano
@ 2020-09-25 18:12 ` Johannes Sixt
  2 siblings, 0 replies; 10+ messages in thread
From: Johannes Sixt @ 2020-09-25 18:12 UTC (permalink / raw)
  To: Ryan Zoeller; +Cc: git

Am 23.09.20 um 23:59 schrieb Ryan Zoeller:
> I've recently been annoyed by git's choice of lines when invoking
> `git diff --function-context`. git isn't doing a _bad_ job per se,
> given that it's using regular expressions to parse nonregular languages,
> but in my opinion it could do better.
> 
> The issue I'm seeing is that constructs like nested functions cause
> `git diff --function-context` to stop taking lines prematurely; a line
> containing a function declaration has been reached, but it's not the one
> I'm expecting git to stop at. This results in some of the function being
> omitted from the output of `git diff --function-context`, and also
> manifests itself as the incorrect function being listed in the hunk header
> in general.

There are two mechanisms at play here:

The first is that the context above and below a point of interest is
extended to the next function. This is relevant for both `git diff` and
`git grep` when --function-context is specified.

The second is the determination of the hunk header. It is only relevant
for `git diff`.

The purpose of the hunk header is to get an estimation of where the hunk
is located. It is most helpful in this case when the closest anchor
point is listed; it is not important to find where the semantically
surrounding nesting context begins.

Therefore, I suggest that you aim only for the --function-context case,
because there it makes sense to find the correctly nested surrounding
context if possible.

-- Hannes

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 0/1] Leading whitespace as a function identification heuristic?
  2020-09-25  9:11       ` Phillip Wood
@ 2020-09-25 18:43         ` Jeff King
  2020-09-25 19:01           ` Phillip Wood
  0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2020-09-25 18:43 UTC (permalink / raw)
  To: phillip.wood; +Cc: Ryan Zoeller, Junio C Hamano, git

On Fri, Sep 25, 2020 at 10:11:33AM +0100, Phillip Wood wrote:

> If I've understood correctly when ...code 2... contains changes that are
> themselves indented then we'll pick the wrong function header for example
> 
> 	void foo() {
> 		void bar() {
> 			...code 1...
> 		}
> 		for (...) {
> 			// if this line is changed we pick bar rather
> 			// than foo because it is the first function
> 			// header with a smaller indentation than the
> 			// first changed line.
> 		}
> 	}
> 
> Unfortunately I suspect code like that is not uncommon and the diff would
> regress with this simple heuristic. It might be possible to recalculate the
> required indentation as we walk backwards up the file though, so when we hit
> the "for" line we reduce the maximum indentation allowed for a match and so
> skip "bar" as a function header.

Thanks, that's a great counter-example I hadn't considered.

Yes, I agree that adjusting the desired indentation as we walk backwards
would work. That's assuming indentation is hierarchical, but I think
that's implied by the existence of this feature at all.

Another possible corner case: tabs vs spaces. If I have:

  <space><space><space><space><space><space><space><space>foo
  <tab><tab>bar

which is more indented? Counting isspace(), it is the first one. But
visually, it would _usually_ be the second one. But of course it would
depend on your tabstops.

The above example is obviously stupid and contrived, but I wonder if
there are legitimate confusing cases where people mix tabs and spaces
(e.g., mixed tabs and spaces to align function parameters, etc).

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 0/1] Leading whitespace as a function identification heuristic?
  2020-09-25 18:43         ` Jeff King
@ 2020-09-25 19:01           ` Phillip Wood
  2020-09-25 19:05             ` Jeff King
  0 siblings, 1 reply; 10+ messages in thread
From: Phillip Wood @ 2020-09-25 19:01 UTC (permalink / raw)
  To: Jeff King, phillip.wood; +Cc: Ryan Zoeller, Junio C Hamano, git

Hi Peff

On 25/09/2020 19:43, Jeff King wrote:
> On Fri, Sep 25, 2020 at 10:11:33AM +0100, Phillip Wood wrote:
> 
>> If I've understood correctly when ...code 2... contains changes that are
>> themselves indented then we'll pick the wrong function header for example
>>
>> 	void foo() {
>> 		void bar() {
>> 			...code 1...
>> 		}
>> 		for (...) {
>> 			// if this line is changed we pick bar rather
>> 			// than foo because it is the first function
>> 			// header with a smaller indentation than the
>> 			// first changed line.
>> 		}
>> 	}
>>
>> Unfortunately I suspect code like that is not uncommon and the diff would
>> regress with this simple heuristic. It might be possible to recalculate the
>> required indentation as we walk backwards up the file though, so when we hit
>> the "for" line we reduce the maximum indentation allowed for a match and so
>> skip "bar" as a function header.
> 
> Thanks, that's a great counter-example I hadn't considered.
> 
> Yes, I agree that adjusting the desired indentation as we walk backwards
> would work. That's assuming indentation is hierarchical, but I think
> that's implied by the existence of this feature at all.
> 
> Another possible corner case: tabs vs spaces. If I have:
> 
>    <space><space><space><space><space><space><space><space>foo
>    <tab><tab>bar
> 
> which is more indented? Counting isspace(), it is the first one. But
> visually, it would _usually_ be the second one. But of course it would
> depend on your tabstops.
> 
> The above example is obviously stupid and contrived, but I wonder if
> there are legitimate confusing cases where people mix tabs and spaces
> (e.g., mixed tabs and spaces to align function parameters, etc).

To calculate the indentation for diff 
--color-moved-ws=allow-indentation-change in fill_es_indent_data() we 
use the tabwidth whitespace attribute to calculate the width of a tab in 
spaces

	int tab_width = es->flags & WS_TAB_WIDTH_MASK;
	/* calculate the visual width of indentation */
	while(1) {
		if (s[off] == ' ') {
			width++;
			off++;
		} else if (s[off] == '\t') {
			width += tab_width - (width % tab_width);
			while (s[++off] == '\t')
				width += tab_width;
		} else {
			break;
		}
	}

I think we could probably do something similar here assuming it is the 
visual width of the indentation that we care about.

Best Wishes

Phillip

> -Peff
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC 0/1] Leading whitespace as a function identification heuristic?
  2020-09-25 19:01           ` Phillip Wood
@ 2020-09-25 19:05             ` Jeff King
  0 siblings, 0 replies; 10+ messages in thread
From: Jeff King @ 2020-09-25 19:05 UTC (permalink / raw)
  To: phillip.wood; +Cc: Ryan Zoeller, Junio C Hamano, git

On Fri, Sep 25, 2020 at 08:01:07PM +0100, Phillip Wood wrote:

> > Another possible corner case: tabs vs spaces. If I have:
> > 
> >    <space><space><space><space><space><space><space><space>foo
> >    <tab><tab>bar
> > 
> > which is more indented? Counting isspace(), it is the first one. But
> > visually, it would _usually_ be the second one. But of course it would
> > depend on your tabstops.
> > 
> > The above example is obviously stupid and contrived, but I wonder if
> > there are legitimate confusing cases where people mix tabs and spaces
> > (e.g., mixed tabs and spaces to align function parameters, etc).
> 
> To calculate the indentation for diff
> --color-moved-ws=allow-indentation-change in fill_es_indent_data() we use
> the tabwidth whitespace attribute to calculate the width of a tab in spaces
> [...]

Ah, right, I forgot we already had "tabwidth" config to deal with this.
I agree we could make use of that same technique here.

-Peff

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-09-25 20:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-23 21:59 [RFC 0/1] Leading whitespace as a function identification heuristic? Ryan Zoeller
2020-09-23 21:59 ` [RFC 1/1] xdiff: use leading whitespace in function heuristic Ryan Zoeller
2020-09-24  6:45 ` [RFC 0/1] Leading whitespace as a function identification heuristic? Junio C Hamano
2020-09-24 21:17   ` Jeff King
2020-09-24 22:01     ` Ryan Zoeller
2020-09-25  9:11       ` Phillip Wood
2020-09-25 18:43         ` Jeff King
2020-09-25 19:01           ` Phillip Wood
2020-09-25 19:05             ` Jeff King
2020-09-25 18:12 ` Johannes Sixt

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).