git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
@ 2016-06-16 17:46 Stefan Beller
  2016-06-16 20:27 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Stefan Beller @ 2016-06-16 17:46 UTC (permalink / raw)
  To: peff; +Cc: git, jacob.keller, mhagger, Stefan Beller

The compaction heuristic for diffs was deemed quite good, but in 5580b27
we have an example demonstrates a common case, that fails with the existing
heuristic. That is why we disabled the heuristic in the v2.9.0 release.

With this patch a diff looks like:

         def bar
                 do_bar_stuff()

                 common_ending()
         end

+        def bal
+                do_bal_stuff()
+
+                common_ending()
+        end
+
         def baz
                 do_baz_stuff()

                 common_ending()
         end

whereas before we had:

  WIP (TODO: ask peff to provide an example that actually triggers, I seem to be
       unable to write one, though I thought the above was one)


The way we do it, is by inspecting the neighboring lines and see how
much indent there is and we choose the line that has the shortest amount
of blanks in the neighboring lines.

(TODO: update comments in the file)

Signed-off-by: Stefan Beller <sbeller@google.com>
---

Hi Jeff, hi Michael,

thanks for pointing out the flaw in this ruby code base before the 2.9 release.
I think this patch would fix your finding, though I cannot reproduce it.
Can you provide an example repo/patch that looks ugly on current origin/master,
so I can be sure this fixes the issue?

This can also be a start of a discussion if this is a too short-sighted enhancement
of the heuristic. Essentially we'd want to prefer 

+        end
+
+        def bal

over:

+                do_bal_stuff()
+
+                common_ending()

because there is less space between line start and {end, def bal}
than for {do_bal_stuff, common_ending}.

Thanks,
Stefan

 xdiff/xdiffi.c | 41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index b3c6848..58adc74 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -405,6 +405,35 @@ static int is_blank_line(xrecord_t **recs, long ix, long flags)
 	return xdl_blankline(recs[ix]->ptr, recs[ix]->size, flags);
 }
 
+static unsigned int leading_blank(const char *line)
+{
+	unsigned int ret = 0;
+	while (*line) {
+		if (*line == '\t')
+			ret += 8;
+		else if (*line == ' ')
+			ret ++;
+		else
+			break;
+		line++;
+	}
+	return ret;
+}
+
+static unsigned int surrounding_leading_blank(xrecord_t **recs, long ix,
+		long flags, long nrec)
+{
+	unsigned int i, ret = UINT_MAX;
+	if (ix > 0)
+		ret = leading_blank(recs[ix - 1]->ptr);
+	if (ix < nrec - 1) {
+		i = leading_blank(recs[ix + 1]->ptr);
+		if (i < ret)
+			ret = i;
+	}
+	return ret;
+}
+
 static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
 {
 	return (recs[ixs]->ha == recs[ix]->ha &&
@@ -416,7 +445,7 @@ static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
 int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 	long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec;
 	char *rchg = xdf->rchg, *rchgo = xdfo->rchg;
-	unsigned int blank_lines;
+	unsigned int blank_lines, min_bl_neigh_indent;
 	xrecord_t **recs = xdf->recs;
 
 	/*
@@ -451,6 +480,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 		do {
 			grpsiz = ix - ixs;
 			blank_lines = 0;
+			min_bl_neigh_indent = UINT_MAX;
 
 			/*
 			 * If the line before the current change group, is equal to
@@ -485,7 +515,13 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			 * the group.
 			 */
 			while (ix < nrec && recs_match(recs, ixs, ix, flags)) {
-				blank_lines += is_blank_line(recs, ix, flags);
+				if (is_blank_line(recs, ix, flags)) {
+					unsigned int bl_neigh_indent =
+						surrounding_leading_blank(recs, ix, flags, nrec);
+					if (min_bl_neigh_indent > bl_neigh_indent)
+						min_bl_neigh_indent = min_bl_neigh_indent;
+					blank_lines++;
+				}
 
 				rchg[ixs++] = 0;
 				rchg[ix++] = 1;
@@ -525,6 +561,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
 			while (ixs > 0 &&
 			       !is_blank_line(recs, ix - 1, flags) &&
+			       surrounding_leading_blank(recs, ix - 1, flags, nrec) > min_bl_neigh_indent &&
 			       recs_match(recs, ixs - 1, ix - 1, flags)) {
 				rchg[--ixs] = 1;
 				rchg[--ix] = 0;
-- 
2.9.0.8.gb6cbe66


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-06-16 17:46 [PATCH] diff compaction heuristic: favor shortest neighboring blank lines Stefan Beller
@ 2016-06-16 20:27 ` Junio C Hamano
  2016-06-16 21:06   ` Stefan Beller
  2016-06-16 21:10 ` Michael Haggerty
  2016-06-17 15:36 ` Jeff King
  2 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2016-06-16 20:27 UTC (permalink / raw)
  To: Stefan Beller; +Cc: peff, git, jacob.keller, mhagger

Stefan Beller <sbeller@google.com> writes:

> +        def bal
> +                do_bal_stuff()
> +
> +                common_ending()
> +        end
> +
>          def baz
>                  do_baz_stuff()
>
>                  common_ending()
>          end
>
> whereas before we had:
>
>   WIP (TODO: ask peff to provide an example that actually triggers, I seem to be
>        unable to write one, though I thought the above was one)
>
>
> The way we do it, is by inspecting the neighboring lines and see how
> much indent there is and we choose the line that has the shortest amount
> of blanks in the neighboring lines.
> ...
> because there is less space between line start and {end, def bal}
> than for {do_bal_stuff, common_ending}.

I haven't thought this carefully yet, but would this equally work
well for Python, where it does not have the "end" or does the lack
of "end" pose a problem?  You'll still find "def bal" is a good
boundary (but you cannot tell if it is the beginning or the end of a
block, unless you understand the language), though.

> +static unsigned int leading_blank(const char *line)
> +{
> +	unsigned int ret = 0;
> +	while (*line) {
> +		if (*line == '\t')
> +			ret += 8;

This will be broken with a line with space-before-tab whitespace
breakage, I suspect...

> +		else if (*line == ' ')
> +			ret ++;
> +		else
> +			break;
> +		line++;
> +	}
> +	return ret;
> +}
> +
> +static unsigned int surrounding_leading_blank(xrecord_t **recs, long ix,
> +		long flags, long nrec)
> +{
> +	unsigned int i, ret = UINT_MAX;
> +	if (ix > 0)
> +		ret = leading_blank(recs[ix - 1]->ptr);
> +	if (ix < nrec - 1) {
> +		i = leading_blank(recs[ix + 1]->ptr);
> +		if (i < ret)
> +			ret = i;
> +	}
> +	return ret;
> +}
> +
>  static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
>  {
>  	return (recs[ixs]->ha == recs[ix]->ha &&
> @@ -416,7 +445,7 @@ static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
>  int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  	long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec;
>  	char *rchg = xdf->rchg, *rchgo = xdfo->rchg;
> -	unsigned int blank_lines;
> +	unsigned int blank_lines, min_bl_neigh_indent;
>  	xrecord_t **recs = xdf->recs;
>  
>  	/*
> @@ -451,6 +480,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  		do {
>  			grpsiz = ix - ixs;
>  			blank_lines = 0;
> +			min_bl_neigh_indent = UINT_MAX;
>  
>  			/*
>  			 * If the line before the current change group, is equal to
> @@ -485,7 +515,13 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  			 * the group.
>  			 */
>  			while (ix < nrec && recs_match(recs, ixs, ix, flags)) {
> -				blank_lines += is_blank_line(recs, ix, flags);
> +				if (is_blank_line(recs, ix, flags)) {
> +					unsigned int bl_neigh_indent =
> +						surrounding_leading_blank(recs, ix, flags, nrec);
> +					if (min_bl_neigh_indent > bl_neigh_indent)
> +						min_bl_neigh_indent = min_bl_neigh_indent;
> +					blank_lines++;
> +				}
>  
>  				rchg[ixs++] = 0;
>  				rchg[ix++] = 1;
> @@ -525,6 +561,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
>  			while (ixs > 0 &&
>  			       !is_blank_line(recs, ix - 1, flags) &&
> +			       surrounding_leading_blank(recs, ix - 1, flags, nrec) > min_bl_neigh_indent &&
>  			       recs_match(recs, ixs - 1, ix - 1, flags)) {
>  				rchg[--ixs] = 1;
>  				rchg[--ix] = 0;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-06-16 20:27 ` Junio C Hamano
@ 2016-06-16 21:06   ` Stefan Beller
  0 siblings, 0 replies; 15+ messages in thread
From: Stefan Beller @ 2016-06-16 21:06 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, git@vger.kernel.org, Jacob Keller, Michael Haggerty

On Thu, Jun 16, 2016 at 1:27 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> ...
>> because there is less space between line start and {end, def bal}
>> than for {do_bal_stuff, common_ending}.
>
> I haven't thought this carefully yet, but would this equally work
> well for Python, where it does not have the "end" or does the lack
> of "end" pose a problem?  You'll still find "def bal" is a good
> boundary (but you cannot tell if it is the beginning or the end of a
> block, unless you understand the language), though.

Good point.  I found a flaw in my implementation
(as it doesn't match my mental model, not necessarily a bad thing)

We take the minimum of the two neighbors, i.e.

+                do_bal_stuff()
+
+        common_ending()

is preferrable to

+                do_bal_stuff()
+
+                common_ending()

and in python the example would look like:

    def foo():
        do_foo()

        common_thing()

+    def baz():
+        do_baz()
+
+        common_thing()
+
    def bar():
        do_bar()

        common_thing()

and breaking between

        common_thing()

    def bar():

is more favorable than between

        do_baz()

        common_thing()

because in the first former the count of
white space in front of "def bar():" is smaller
than for any of "do_baz()" and "common_thing()"


>
>> +static unsigned int leading_blank(const char *line)
>> +{
>> +     unsigned int ret = 0;
>> +     while (*line) {
>> +             if (*line == '\t')
>> +                     ret += 8;
>
> This will be broken with a line with space-before-tab whitespace
> breakage, I suspect...

How so? We inspect each character on its own and then move on later
by line++. (I am not seeing how this could cause trouble, so please
help me?)

Going back to python, this may become a problem when you have a code like:

 def baz():

        do_baz()

        common_thing()

 def bar():

+       do_bal()
+
+       common_thing()
+
+def bar():
+
        do_bar()

        common_thing()


but this was fabricated with a typo (the first definition of bar
should have been bal),
(Also it doesn't worsen the diff, as it is same without the heuristic)

once that typo is fixed we get:
(both with and without the heuristic)

        do_foo()

        common_thing()

 def baz():
        do_baz()

        common_thing()

+def bal():
+
+       do_bal()
+
+       common_thing()
+
 def bar():

        do_bar()

        common_thing()

Clearly it can also be intentional to have 2 methods with the same
code for historical reasons, (even without the blank line after the
function definition this produces the same result)

When playing around with various diffs I could not find a thing that
this patch makes worse, it only fixes the actual issue.
(I realized Peff actually attached a script to produce a bad diff, which
is gone with this patch)

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-06-16 17:46 [PATCH] diff compaction heuristic: favor shortest neighboring blank lines Stefan Beller
  2016-06-16 20:27 ` Junio C Hamano
@ 2016-06-16 21:10 ` Michael Haggerty
  2016-06-16 21:36   ` Stefan Beller
  2016-06-17 15:36 ` Jeff King
  2 siblings, 1 reply; 15+ messages in thread
From: Michael Haggerty @ 2016-06-16 21:10 UTC (permalink / raw)
  To: Stefan Beller, peff; +Cc: git, jacob.keller

On 06/16/2016 07:46 PM, Stefan Beller wrote:
> The compaction heuristic for diffs was deemed quite good, but in 5580b27
> we have an example demonstrates a common case, that fails with the existing
> heuristic. That is why we disabled the heuristic in the v2.9.0 release.
> 
> With this patch a diff looks like:
> 
>          def bar
>                  do_bar_stuff()
> 
>                  common_ending()
>          end
> 
> +        def bal
> +                do_bal_stuff()
> +
> +                common_ending()
> +        end
> +
>          def baz
>                  do_baz_stuff()
> 
>                  common_ending()
>          end
> 
> whereas before we had:
> 
>   WIP (TODO: ask peff to provide an example that actually triggers, I seem to be
>        unable to write one, though I thought the above was one)
> 
> 
> The way we do it, is by inspecting the neighboring lines and see how
> much indent there is and we choose the line that has the shortest amount
> of blanks in the neighboring lines.
> 
> (TODO: update comments in the file)
> 
> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
> 
> Hi Jeff, hi Michael,
> 
> thanks for pointing out the flaw in this ruby code base before the 2.9 release.
> I think this patch would fix your finding, though I cannot reproduce it.
> Can you provide an example repo/patch that looks ugly on current origin/master,
> so I can be sure this fixes the issue?
> 
> This can also be a start of a discussion if this is a too short-sighted enhancement
> of the heuristic. Essentially we'd want to prefer 
> 
> +        end
> +
> +        def bal
> 
> over:
> 
> +                do_bal_stuff()
> +
> +                common_ending()
> 
> because there is less space between line start and {end, def bal}
> than for {do_bal_stuff, common_ending}.

It's really cool that you are trying to implement the indentation
heuristic. I'm curious how it works in practice (something we can
probably only determine by applying it to a corpus of diffs to see how
often it outperforms/underperforms the other possible approaches).

The heuristic I proposed was

* Prefer to start and end hunks *following* lines with the least
  indentation.

* Define the "indentation" of a blank line to be the indentation of
  the previous non-blank line minus epsilon.

* In the case of a tie, prefer to slide the hunk down as far as
  possible.

If that's what you are trying to implement, then notice that the first
rule says that the hunk should start *following* the line with the least
indentation. So the "score" for

> +        end
> +
> +        def bal

would be the indentation of the line preceding "end", which is larger
than the indentation of "end". And the score for

> +                do_bal_stuff()
> +
> +                common_ending()

would come from the line preceding "do_bal_stuff()", namely "def bal()",
which is indented the same as "end". So the latter would be preferred.

But actually, I don't understand how your example is meaningful. I think
this heuristic is only used to slide hunks around; i.e., when the line
following the hunk is the same as the first lines of the hunk, or when
the line preceding the hunk is the same as the last line of the hunk.
Right? So your snippets would never compete against each other. Let's
take the full example. The five possible placements for the `+`
characters are marked with the digits 1 through 5 here:

>              def bar
>                      do_bar_stuff()
> 1
> 12                   common_ending()
> 123          end
> 1234
> 12345        def bal
> 12345                do_bal_stuff()
>  2345
>   345                common_ending()
>    45        end
>     5
>              def baz
>                      do_baz_stuff()
>
>                      common_ending()
>              end

Of the lines preceding the candidate hunks, the blank line preceding the
hunk marked "5" has the smallest indent, namely "epsilon" less than the
indentation of the "end" line preceding it. So placement "5" would be
chosen.

I don't know enough about this area of the code to review your patch in
detail, but I did note below a typo that jumped out at me.

Michael

>  xdiff/xdiffi.c | 41 +++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 39 insertions(+), 2 deletions(-)
> 
> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> index b3c6848..58adc74 100644
> --- a/xdiff/xdiffi.c
> +++ b/xdiff/xdiffi.c
> [...]
> @@ -485,7 +515,13 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  			 * the group.
>  			 */
>  			while (ix < nrec && recs_match(recs, ixs, ix, flags)) {
> -				blank_lines += is_blank_line(recs, ix, flags);
> +				if (is_blank_line(recs, ix, flags)) {
> +					unsigned int bl_neigh_indent =
> +						surrounding_leading_blank(recs, ix, flags, nrec);
> +					if (min_bl_neigh_indent > bl_neigh_indent)
> +						min_bl_neigh_indent = min_bl_neigh_indent;

The line above has no effect (same variable on both sides of the =).

> +					blank_lines++;
> +				}
>  
>  				rchg[ixs++] = 0;
>  				rchg[ix++] = 1;
> @@ -525,6 +561,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
>  			while (ixs > 0 &&
>  			       !is_blank_line(recs, ix - 1, flags) &&
> +			       surrounding_leading_blank(recs, ix - 1, flags, nrec) > min_bl_neigh_indent &&
>  			       recs_match(recs, ixs - 1, ix - 1, flags)) {
>  				rchg[--ixs] = 1;
>  				rchg[--ix] = 0;
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-06-16 21:10 ` Michael Haggerty
@ 2016-06-16 21:36   ` Stefan Beller
  0 siblings, 0 replies; 15+ messages in thread
From: Stefan Beller @ 2016-06-16 21:36 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, git@vger.kernel.org, Jacob Keller

>
> The heuristic I proposed was
>
> * Prefer to start and end hunks *following* lines with the least
>   indentation.
>
> * Define the "indentation" of a blank line to be the indentation of
>   the previous non-blank line minus epsilon.
>
> * In the case of a tie, prefer to slide the hunk down as far as
>   possible.
>
> If that's what you are trying to implement, then notice that the first
> rule says that the hunk should start *following* the line with the least
> indentation. So the "score" for

I did not try to implement that heuristic.

> * Prefer to start and end hunks *following* lines with the least
>   indentation.

We had this alone as a heuristic and it turned out badly. (More
false positives than the current "empty line only" thing)

It may however be a good starting block when combined with the other
conditions, as blank lines are assigned a longer virtual length than
what they have.

> * In the case of a tie, prefer to slide the hunk down as far as
>   possible.
>

This is what the current implementation does as well. (i.e. if there is
more than one blank line, we have a tie, so choose the last one as the
end of the hunk)


So what I am proposing in this patch:

 * Prefer to start and end hunks *following* lines with an empty line.
   (as implemented currently)

 * In the case of a tie, make use of a second tier heuristic.

 * Define the "length" of a blank line to be the
    minimum of the amount of white space in the line before and after

 * Choose that empty line (first tier) that has the shortest length
(second tier)

 * In the case of a tie, prefer to slide the hunk down as far as
   possible.

I would be really interested how these two heuristics compare in practice.

>
>> +        end
>> +
>> +        def bal
>
> would be the indentation of the line preceding "end", which is larger
> than the indentation of "end". And the score for
>
>> +                do_bal_stuff()
>> +
>> +                common_ending()
>
> would come from the line preceding "do_bal_stuff()", namely "def bal()",
> which is indented the same as "end". So the latter would be preferred.
>
> But actually, I don't understand how your example is meaningful. I think
> this heuristic is only used to slide hunks around; i.e., when the line
> following the hunk is the same as the first lines of the hunk, or when
> the line preceding the hunk is the same as the last line of the hunk.

Right.

> Right? So your snippets would never compete against each other.

right, I was using that to show why some parts are more favorable to
put a break in (i.e. the transisiton from "+ lines" to surrounding lines.
I was not clear enough there. :(

> Let's
> take the full example. The five possible placements for the `+`
> characters are marked with the digits 1 through 5 here:
>
>>              def bar
>>                      do_bar_stuff()
>> 1
>> 12                   common_ending()
>> 123          end
>> 1234
>> 12345        def bal
>> 12345                do_bal_stuff()
>>  2345
>>   345                common_ending()
>>    45        end
>>     5
>>              def baz
>>                      do_baz_stuff()
>>
>>                      common_ending()
>>              end
>
> Of the lines preceding the candidate hunks, the blank line preceding the
> hunk marked "5" has the smallest indent, namely "epsilon" less than the
> indentation of the "end" line preceding it. So placement "5" would be
> chosen.

I agree 5 is the best here, but for other reasons.

Imagine /s/end/long final boilerplate code/, and it may be more clear.
What I am trying to say is, that you "indentation" of the blank line is not
reaching until the end of the "end" line, but rather to the front of "end".

(If you don't care of white spaces in the git history and have an editor, that
makes fancy white space stuff, i.e. your text editor cursor is in the "end" line
and you press "enter" to get to the next line, it is most likely indented to
strlen ("<end line>") - strlen("end"), as that is how much indentation we had
in the preceding line?

>
> I don't know enough about this area of the code to review your patch in
> detail, but I did note below a typo that jumped out at me.

I did not know the code either before writing the patch. :)


>> +                                     if (min_bl_neigh_indent > bl_neigh_indent)
>> +                                             min_bl_neigh_indent = min_bl_neigh_indent;
>
> The line above has no effect (same variable on both sides of the =).

Thanks!

I had the feeling something was off judging from behavior,
but could not tell what exactly. This is it.

It should be

    min_bl_neigh_indent = bl_neigh_indent;

i.e. the minimum of the existing and newly computed surrounding line
starting blanks.

Junio writes:
> Now, on what column does 'd' sit on the example below?
>
>        def foo
>
> I typed SP SP HT 'd' 'e' 'f'; it still is indented by 8 places.

I see, so for the TAB case we would need to do a

    ret += (8 - ret%8);

Thanks!
Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-06-16 17:46 [PATCH] diff compaction heuristic: favor shortest neighboring blank lines Stefan Beller
  2016-06-16 20:27 ` Junio C Hamano
  2016-06-16 21:10 ` Michael Haggerty
@ 2016-06-17 15:36 ` Jeff King
  2016-06-17 16:09   ` Stefan Beller
  2 siblings, 1 reply; 15+ messages in thread
From: Jeff King @ 2016-06-17 15:36 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git, jacob.keller, mhagger

On Thu, Jun 16, 2016 at 10:46:20AM -0700, Stefan Beller wrote:

> The compaction heuristic for diffs was deemed quite good, but in 5580b27
> we have an example demonstrates a common case, that fails with the existing
> heuristic. That is why we disabled the heuristic in the v2.9.0 release.
> 
> With this patch a diff looks like:
> 
>          def bar
>                  do_bar_stuff()
> 
>                  common_ending()
>          end
> 
> +        def bal
> +                do_bal_stuff()
> +
> +                common_ending()
> +        end
> +
>          def baz
>                  do_baz_stuff()
> 
>                  common_ending()
>          end
> 
> whereas before we had:
> 
>   WIP (TODO: ask peff to provide an example that actually triggers, I seem to be
>        unable to write one, though I thought the above was one)

Did you want something that triggers the "bad" case with the existing
compaction heuristic?

I gave one in:

  http://article.gmane.org/gmane.comp.version-control.git/296947

I think the difference is that in my example, the diff (before
compaction) has the blank line at the top (because we are adding a new
entry at the bottom), whereas in yours, the blank line is at the bottom.

This patch does make my "bad" case look better. Unfortunately, it
re-breaks another case:

	cat >file.rb <<\EOF
	##
	# This is the foo function.
	def foo
	  foo stuff
	end

	##
	# This is the bar function.
	def bar
	  bar stuff
	end

	##
	# This is the baz function.
	def baz
	  baz stuff
	end
	EOF

	git init
	git add file.rb
	sed -i 7,12d file.rb
	git diff

That goes back to the original, pre-compaction-heuristic diff:

	diff --git a/file.rb b/file.rb
	index 0b31fb6..ee25d63 100644
	--- a/file.rb
	+++ b/file.rb
	@@ -5,12 +5,6 @@ def foo
	 end
	 
	 ##
	-# This is the bar function.
	-def bar
	-  bar stuff
	-end
	-
	-##
	 # This is the baz function.
	 def baz
	   baz stuff

-Peff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-06-17 15:36 ` Jeff King
@ 2016-06-17 16:09   ` Stefan Beller
  2016-06-23 17:10     ` Michael Haggerty
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Beller @ 2016-06-17 16:09 UTC (permalink / raw)
  To: Jeff King; +Cc: git@vger.kernel.org, Jacob Keller, Michael Haggerty

On Fri, Jun 17, 2016 at 8:36 AM, Jeff King <peff@peff.net> wrote:
>
> Did you want something that triggers the "bad" case with the existing
> compaction heuristic?
>
> I gave one in:
>
>   http://article.gmane.org/gmane.comp.version-control.git/296947

I found that out after sending the initial patch. Thanks for pointing out!


>
> I think the difference is that in my example, the diff (before
> compaction) has the blank line at the top (because we are adding a new
> entry at the bottom), whereas in yours, the blank line is at the bottom.
>
> This patch does make my "bad" case look better. Unfortunately, it
> re-breaks another case:

I think before spending more time on discussing and implementing new
(hopefully better) heuristics, I'd want to step back and try to be a bit more
systematic, i.e. I'll want to collect lots of test cases in different languages
and use cases. A mini test suite, maybe?

I think I'll include these examples in there, too.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-06-17 16:09   ` Stefan Beller
@ 2016-06-23 17:10     ` Michael Haggerty
  2016-06-23 17:25       ` Stefan Beller
                         ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Michael Haggerty @ 2016-06-23 17:10 UTC (permalink / raw)
  To: Stefan Beller, Jeff King; +Cc: git@vger.kernel.org, Jacob Keller

On 06/17/2016 06:09 PM, Stefan Beller wrote:
> I think before spending more time on discussing and implementing new
> (hopefully better) heuristics, I'd want to step back and try to be a bit more
> systematic, i.e. I'll want to collect lots of test cases in different languages
> and use cases. A mini test suite, maybe?

Stefan,

I've also been playing around with diff heuristics and also trying to
find some good test data. Have you put together some test cases yet?

In case you are interested, the current iteration of my heuristic
considers six things to determine a "score" for breaking a diff above a
particular line:

* `blank` -- is the line blank?
* `indent` -- the indentation of the line (if it is not blank)
* `pre_blank` -- is the preceding line blank?
* `pre_indent` -- the indentation of the closest preceding non-blank
  line
* `post_blank` -- is the following line blank?
* `post_indent` -- the indentation of the closest following non-blank
  line

The meat of the scoring algorithm is below. I compute the score for the
beginning and the end of the insert/delete block for each candidate
shift of an insertion/deletion hunk, then choose the one that has the
lowest score.

There are some adjustable "bonus" values below. The values shown give
quite good results for C, shell, Python, XML, Ruby, CSS, YAML, and HTML
in the couple of projects that I've tried it on. But I haven't done
enough systematic testing over a large enough corpus to know whether the
parameters could be improved or whether there are types of code that it
performs really terribly on.

Right now my program doesn't output diffs in the usual format, but
rather as a diff notated with various information. So it's not in a form
that other people could use so easily. Here's an example:

> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                   *
>                   * Default behaviour is to initialize the notes tree from the tree object
>                   * specified by the given (or default) notes ref.
>                   */
>                  #define NOTES_INIT_EMPTY 1
>     |     -4     
>     |     -6 +   /*
>     ||    28 +  + * By default, the notes tree is only readable, and the notes ref can be
>     ||       +  + * any treeish. The notes tree can however be made writable with this flag,
>     ||       +  + * in which case only strict ref names can be used.
>     ||       +  + */
>     ||       +  +#define NOTES_INIT_WRITABLE 2
>      |       +  +
>      |          +/*
>                   * Initialize the given notes_tree with the notes tree structure at the given
>                   * ref. If given ref is NULL, the value of the $GIT_NOTES_REF environment
>                   * variable is used, and if that is missing, the default notes ref is used
>                   * ("refs/notes/commits").
>                   *
>                   * If you need to re-initialize a notes_tree structure (e.g. when switching from
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The two columns of vertical bars show the highest and lowest range of
lines that this insertion block can be shifted to. The numbers show the
score for breaking the diff above that line. The first column of +/- is
where this version of the heuristic would place the diff, and the second
column is where the master version of Git would place it.

My program can also search for and output "slideable" insertion/deletion
blocks, so it can be used to find diffs that are theoretically capable
of being optimized, even if they don't happen to be handled differently
by two different algorithms.

Let me know if you want to collaborate on this somehow.

Michael

Scoring heuristic:

>     # A place to accumulate bonus factors (positive makes this
>     # index more favored):
>     bonus = 0
> 
>     # Bonuses based on the location of blank lines:
>     if pre_blank and not blank:
>         bonus += 3
>     elif blank and not pre_blank:
>         bonus += 2
>     elif blank and pre_blank:
>         bonus += 1
> 
>     # Now fill in missing indent values:
>     if pre_indent is None:
>         pre_indent = 0
> 
>     if post_indent is None:
>         post_indent = 0
> 
>     if blank:
>         indent = post_indent
> 
>     if indent > pre_indent:
>         # The line is indented more than its predecessor. It is
>         # preferable to keep these lines together, so we score it
>         # based on the larger indent:
>         score = indent
>         bonus -= 4
> 
>     elif indent < pre_indent:
>         # The line is indented less than its predecessor. It could
>         # be that this line is the start of a new block (e.g., of
>         # an "else" block, or of a block without a block
>         # terminator) or it could be the end of the previous
>         # block.
>         if indent < post_indent:
>             # The following line is indented more. So it is likely
>             # that this line is the start of a block. It's a
>             # pretty good place to split, so score it based on the
>             # smaller indent:
>             score = indent
>             bonus += 2
>         else:
>             # This was probably the end of a block. We score based
>             # on the line's own indent:
>             score = indent
> 
>     else:
>         # The line has the same indentation level as its
>         # predecessor. We score it based on its own indent:
>         score = indent
> 
>     return 10 * score - bonus


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-06-23 17:10     ` Michael Haggerty
@ 2016-06-23 17:25       ` Stefan Beller
  2016-06-23 17:37       ` Junio C Hamano
  2016-06-30 13:54       ` Michael Haggerty
  2 siblings, 0 replies; 15+ messages in thread
From: Stefan Beller @ 2016-06-23 17:25 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Jeff King, git@vger.kernel.org, Jacob Keller

On Thu, Jun 23, 2016 at 10:10 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 06/17/2016 06:09 PM, Stefan Beller wrote:
>> I think before spending more time on discussing and implementing new
>> (hopefully better) heuristics, I'd want to step back and try to be a bit more
>> systematic, i.e. I'll want to collect lots of test cases in different languages
>> and use cases. A mini test suite, maybe?
>
> Stefan,
>
> I've also been playing around with diff heuristics and also trying to
> find some good test data. Have you put together some test cases yet?

I started with that and had 2 or 3 test cases written out, but the
diffs of diffs
confused me and then I got distracted, so I am stalling on putting together
a mini test suite. You'll find my attempt at [1] as an integration of our
test suite. Maybe we don't need to have these tests in tree, but once we're
done developing the heuristic, we can put the test suite in the commit message
of that commit? (Assuming the tests are short and as for a heuristic there is
no right and wrong, but only a better or worse)

[1] https://github.com/stefanbeller/git/tree/diff_heuristic_improvements


>
> In case you are interested, the current iteration of my heuristic
> considers six things to determine a "score" for breaking a diff above a
> particular line:
>
> * `blank` -- is the line blank?
> * `indent` -- the indentation of the line (if it is not blank)
> * `pre_blank` -- is the preceding line blank?
> * `pre_indent` -- the indentation of the closest preceding non-blank
>   line
> * `post_blank` -- is the following line blank?
> * `post_indent` -- the indentation of the closest following non-blank
>   line
>
> The meat of the scoring algorithm is below. I compute the score for the
> beginning and the end of the insert/delete block for each candidate
> shift of an insertion/deletion hunk, then choose the one that has the
> lowest score.
>
> There are some adjustable "bonus" values below. The values shown give
> quite good results for C, shell, Python, XML, Ruby, CSS, YAML, and HTML
> in the couple of projects that I've tried it on. But I haven't done
> enough systematic testing over a large enough corpus to know whether the
> parameters could be improved or whether there are types of code that it
> performs really terribly on.

I think the corpus doesn't have to be large, but rather diverse.

The current heuristic was tested by Jacob and Jeff over the history of the
linux kernel, (and some other C projects), which I consider a large corpus --
But it's all well formed C, i.e a well known coding style. [2]

[2] don't read this too seriously:
http://codegolf.stackexchange.com/questions/42510/reverse-indentation

>
> Right now my program doesn't output diffs in the usual format, but
> rather as a diff notated with various information. So it's not in a form
> that other people could use so easily. Here's an example:
>
>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>                   *
>>                   * Default behaviour is to initialize the notes tree from the tree object
>>                   * specified by the given (or default) notes ref.
>>                   */
>>                  #define NOTES_INIT_EMPTY 1
>>     |     -4
>>     |     -6 +   /*
>>     ||    28 +  + * By default, the notes tree is only readable, and the notes ref can be
>>     ||       +  + * any treeish. The notes tree can however be made writable with this flag,
>>     ||       +  + * in which case only strict ref names can be used.
>>     ||       +  + */
>>     ||       +  +#define NOTES_INIT_WRITABLE 2
>>      |       +  +
>>      |          +/*
>>                   * Initialize the given notes_tree with the notes tree structure at the given
>>                   * ref. If given ref is NULL, the value of the $GIT_NOTES_REF environment
>>                   * variable is used, and if that is missing, the default notes ref is used
>>                   * ("refs/notes/commits").
>>                   *
>>                   * If you need to re-initialize a notes_tree structure (e.g. when switching from
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> The two columns of vertical bars show the highest and lowest range of
> lines that this insertion block can be shifted to. The numbers show the
> score for breaking the diff above that line. The first column of +/- is
> where this version of the heuristic would place the diff, and the second
> column is where the master version of Git would place it.

This is very valuable for understanding and testing different
patterns! Great work!

>
> My program can also search for and output "slideable" insertion/deletion
> blocks, so it can be used to find diffs that are theoretically capable
> of being optimized, even if they don't happen to be handled differently
> by two different algorithms.

Oh right, that is what we can use to find good examples to test on.

I wondered if we could "machine learn" the coefficients for these 6
parameters that you outline above to bring the best matches in most of the time.

>
> Let me know if you want to collaborate on this somehow.

I am interested, but I dropped the ball last week. I'll pick it up again. :)

Stefan

>
> Michael
>
> Scoring heuristic:
>
>>     # A place to accumulate bonus factors (positive makes this
>>     # index more favored):
>>     bonus = 0
>>
>>     # Bonuses based on the location of blank lines:
>>     if pre_blank and not blank:
>>         bonus += 3
>>     elif blank and not pre_blank:
>>         bonus += 2
>>     elif blank and pre_blank:
>>         bonus += 1
>>
>>     # Now fill in missing indent values:
>>     if pre_indent is None:
>>         pre_indent = 0
>>
>>     if post_indent is None:
>>         post_indent = 0
>>
>>     if blank:
>>         indent = post_indent
>>
>>     if indent > pre_indent:
>>         # The line is indented more than its predecessor. It is
>>         # preferable to keep these lines together, so we score it
>>         # based on the larger indent:
>>         score = indent
>>         bonus -= 4
>>
>>     elif indent < pre_indent:
>>         # The line is indented less than its predecessor. It could
>>         # be that this line is the start of a new block (e.g., of
>>         # an "else" block, or of a block without a block
>>         # terminator) or it could be the end of the previous
>>         # block.
>>         if indent < post_indent:
>>             # The following line is indented more. So it is likely
>>             # that this line is the start of a block. It's a
>>             # pretty good place to split, so score it based on the
>>             # smaller indent:
>>             score = indent
>>             bonus += 2
>>         else:
>>             # This was probably the end of a block. We score based
>>             # on the line's own indent:
>>             score = indent
>>
>>     else:
>>         # The line has the same indentation level as its
>>         # predecessor. We score it based on its own indent:
>>         score = indent
>>
>>     return 10 * score - bonus
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-06-23 17:10     ` Michael Haggerty
  2016-06-23 17:25       ` Stefan Beller
@ 2016-06-23 17:37       ` Junio C Hamano
  2016-06-23 20:13         ` Michael Haggerty
  2016-06-30 13:54       ` Michael Haggerty
  2 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2016-06-23 17:37 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Stefan Beller, Jeff King, git@vger.kernel.org, Jacob Keller

Michael Haggerty <mhagger@alum.mit.edu> writes:

> Scoring heuristic:
>
>>     # A place to accumulate bonus factors (positive makes this
>>     # index more favored):
>>     bonus = 0
>> 
>>     # Bonuses based on the location of blank lines:
>>     if pre_blank and not blank:
>>         bonus += 3
>>     elif blank and not pre_blank:
>>         bonus += 2
>>     elif blank and pre_blank:
>>         bonus += 1
>> 
>>     # Now fill in missing indent values:
>>     if pre_indent is None:
>>         pre_indent = 0
>> 
>>     if post_indent is None:
>>         post_indent = 0

These "missing" are to handle the lines at the boundary of shiftable
block, or do you have pre_indent if a line is the earliest possible
one that could be included in the hunk by shifting, as long as it is
not at the beginning of the file?

I think "this is indented deeper than the previous, and it wants to
stay with the previous if possible" is an excellent heuristic, which
you have below.  I wonder if "however, the previous cannot be
included in the hunk by shifting, hence we try to keep this together
with it by favouring its exclusion from the hunk" could be part of
the same heuristics, if you do have pre_indent value for a line at
the boundary of shiftable block.

Also I am wondering if 0 is a sensible thing to use as the fallback
value.  When pre_indent is not definable, turning it to 0 and
declaring that our indented line is indented deeper than the
previous line and penalizing with "bonus -= 4" (which you do) feels
just as fishy as turning the undefined pre_indent to maxint and
declaring that our indented line is indented shallower than the
previous line (which you don't, but is just as justifiable, I would
think).

Thanks.

>>     if blank:
>>         indent = post_indent
>> 
>>     if indent > pre_indent:
>>         # The line is indented more than its predecessor. It is
>>         # preferable to keep these lines together, so we score it
>>         # based on the larger indent:
>>         score = indent
>>         bonus -= 4
>> 
>>     elif indent < pre_indent:
>>         # The line is indented less than its predecessor. It could
>>         # be that this line is the start of a new block (e.g., of
>>         # an "else" block, or of a block without a block
>>         # terminator) or it could be the end of the previous
>>         # block.
>>         if indent < post_indent:
>>             # The following line is indented more. So it is likely
>>             # that this line is the start of a block. It's a
>>             # pretty good place to split, so score it based on the
>>             # smaller indent:
>>             score = indent
>>             bonus += 2
>>         else:
>>             # This was probably the end of a block. We score based
>>             # on the line's own indent:
>>             score = indent
>> 
>>     else:
>>         # The line has the same indentation level as its
>>         # predecessor. We score it based on its own indent:
>>         score = indent
>> 
>>     return 10 * score - bonus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-06-23 17:37       ` Junio C Hamano
@ 2016-06-23 20:13         ` Michael Haggerty
  0 siblings, 0 replies; 15+ messages in thread
From: Michael Haggerty @ 2016-06-23 20:13 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Stefan Beller, Jeff King, git@vger.kernel.org, Jacob Keller

On 06/23/2016 07:37 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> Scoring heuristic:
>>
>>>     # A place to accumulate bonus factors (positive makes this
>>>     # index more favored):
>>>     bonus = 0
>>>
>>>     # Bonuses based on the location of blank lines:
>>>     if pre_blank and not blank:
>>>         bonus += 3
>>>     elif blank and not pre_blank:
>>>         bonus += 2
>>>     elif blank and pre_blank:
>>>         bonus += 1
>>>
>>>     # Now fill in missing indent values:
>>>     if pre_indent is None:
>>>         pre_indent = 0
>>>
>>>     if post_indent is None:
>>>         post_indent = 0
> 
> These "missing" are to handle the lines at the boundary of shiftable
> block, or do you have pre_indent if a line is the earliest possible
> one that could be included in the hunk by shifting, as long as it is
> not at the beginning of the file?

The missing values are meant to handle the case where there are only
blank lines between the add/delete block and the beginning/end of the
file. Though currently (due to a limitation of the prototype) I only
pass context up to the next add/delete line, so that fallback also kicks
in if there are only blank lines between this add/delete block and the
next added/deleted line. When that is fixed, I think it might help to
give beginning/end of file a little bit more bonus.

This isn't all implemented yet, but I think the best score for an
add/delete block positioning would be the sum of three values. Let's
take the following "complete file" diff as an example, and consider an
add block with the following positioning:

> diff --git a/file.txt b/file.txt
> index 9405325..6a6b95c 100644
> --- a/file.txt
> +++ b/file.txt
> @@ -1,5 +1,7 @@
>  a
>  b
>  c
> +x
> +c
>  d
>  e

The score for splitting the old version of the file between
"bc" and "de", plus

the score for splitting the new version of the file between
"bc" and "xc", plus

the score for splitting the new version of the file between
"xc" and "de".

> I think "this is indented deeper than the previous, and it wants to
> stay with the previous if possible" is an excellent heuristic, which
> you have below.  I wonder if "however, the previous cannot be
> included in the hunk by shifting, hence we try to keep this together
> with it by favouring its exclusion from the hunk" could be part of
> the same heuristics, if you do have pre_indent value for a line at
> the boundary of shiftable block.

That's automatic, isn't it? If splitting between two lines is assigned a
high cost, then the algorithm will avoid splitting there *either* by
including both lines in the hunk *or* by omitting both lines.

> Also I am wondering if 0 is a sensible thing to use as the fallback
> value.  When pre_indent is not definable, turning it to 0 and
> declaring that our indented line is indented deeper than the
> previous line and penalizing with "bonus -= 4" (which you do) feels
> just as fishy as turning the undefined pre_indent to maxint and
> declaring that our indented line is indented shallower than the
> previous line (which you don't, but is just as justifiable, I would
> think).

Once the algorithm is fixed to consider the whole file, then the
fallback value will only be used at the beginning and end of the file,
where an indent of 0 seems like the natural thing to do. As I mentioned
above, it might even help to give beginning/end of file a little more
bonus than that.

But really, testing against real-world diffs will be the best basis for
adjusting the heuristic.

Michael


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-06-23 17:10     ` Michael Haggerty
  2016-06-23 17:25       ` Stefan Beller
  2016-06-23 17:37       ` Junio C Hamano
@ 2016-06-30 13:54       ` Michael Haggerty
  2016-07-01 17:04         ` diff heuristics dramatically improved by considering line indentation and " Michael Haggerty
  2016-07-01 18:01         ` [PATCH] diff compaction heuristic: favor shortest neighboring " Junio C Hamano
  2 siblings, 2 replies; 15+ messages in thread
From: Michael Haggerty @ 2016-06-30 13:54 UTC (permalink / raw)
  To: Stefan Beller, Jeff King
  Cc: git@vger.kernel.org, Jacob Keller, Junio C Hamano

On 06/23/2016 07:10 PM, Michael Haggerty wrote:
> On 06/17/2016 06:09 PM, Stefan Beller wrote:
>> I think before spending more time on discussing and implementing new
>> (hopefully better) heuristics, I'd want to step back and try to be a bit more
>> systematic, i.e. I'll want to collect lots of test cases in different languages
>> and use cases. A mini test suite, maybe?
> 
> Stefan,
> 
> I've also been playing around with diff heuristics and also trying to
> find some good test data. Have you put together some test cases yet?

I put quite of work into tooling to evaluate diff heuristics, and just
published it on GitHub:

    https://github.com/mhagger/diff-slider-tools

The README there is hopefully enough to let people get started using it
to test their own favorite heuristics.

!!!
Please be careful! The scripts are not hardened against strangely-named
files!
!!!

I've run a bunch of tests of the heuristic that I described in the
previous email against a bunch of repositories (to try to get a sampling
of popular languages): ant, bugzilla, couchdb, docker, git, ipython,
junit, nodejs, phpmyadmin, test-more, test-unit, xmonad.

I haven't done a lot of systematic analysis yet, nor tried to optimize
the heuristic systematically, but by eye the results look very good.
Some constructs (C preprocessor directives, Shell here documents)
sometimes confuse it because of their weird indentation, and of course
code with inconsistent indentation or strangely-placed blank lines
doesn't get the best results. But in a large majority of cases where it
disagrees with the standard Git or with `git --compaction-heuristic`, it
does a better job.

Below are a few examples to show off some typical results. I promise I
tried not to cherry pick only flattering examples.

In the output below, `|` mark the highest and lowest possible
positioning of the "slider" (a group of added or deleted lines whose
exact identity is ambiguous). Current Git, by default, always chooses
the lowest positioning. "c" is where Git master with
`--compaction-heuristic` chooses to position the slider, and `i` is the
choice of my indentation-based algorithm. The numbers are just the shift
relative to the standard shift.

Michael

The algorithm does pretty well with XML and HTML:

> fef3ea39f8bd474add292bb6437df6cbd22e1ba7:contributors.xml a394a0bdf8e6240dc09023a8260059c57c158a00:contributors.xml + 1624
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >    <last>Toni</last>
>                >  </name>
>                >  <name>
>                >    <first>Vincent</first>
>                >    <last>Legoll</last>
>       -2 |     >  </name>
>       -1 |   i >  <name>
>        0 || ci >    <first>Vincent</first>
>          || ci >    <last>Privat</last>
>           | ci >  </name>
>           | c  >  <name>
>                >    <first>Vimil</first>
>                >    <last>Saju</last>
>                >  </name>
>                >  <name>
>                >    <first>Vitold</first>
>                >    <last>Sedyshev</last>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> 789422e131b6c2c003d94f394169a64297e986c6:manual/Tasks/signjar.html 7f51882300a8e40ff675867e055061867ba6c8bd:manual/Tasks/signjar.html + 151
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >  </tr>
>                >  <tr>
>                >    <td valign="top">tsacert</td>
>                >    <td valign="top">alias in the keystore for a timestamp authority for
>                >    timestamped JAR files in Java1.5+</td>
>       -3 |     >    <td valign="top" align="center">No</td>
>       -2 |     >  </tr>
>       -1 |   i >  <tr>
>        0 || ci >    <td valign="top">tsaproxyhost</td>
>          || ci >    <td valign="top">proxy host to be used when connecting to TSA server</td>
>          || ci >    <td valign="top" align="center">No</td>
>          || ci >  </tr>
>          || ci >  <tr>
>          || ci >    <td valign="top">getTsaproxyport</td>
>          || ci >    <td valign="top">proxy port to be used when connecting to TSA server</td>
>           | ci >    <td valign="top" align="center">No</td>
>           | ci >  </tr>
>           | c  >  <tr>
>                >    <td valign="top">executable</td>
>                >    <td valign="top">Specify a particular <code>jarsigner</code> executable
>                >      to use in place of the default binary (found in the same JDK as
>                >      Apache Ant is running in).<br/>
>                >      Must support the same command line options as the Sun JDK
>                >      jarsigner command.
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Java:

> de5b4058b9bb039cdb17082fc543098de598ece2:src/main/org/apache/tools/ant/taskdefs/optional/ssh/AbstractSshMessage.java 3fe363e2120342f6de2b219c3bd74efe9aea2803:src/main/org/apache/tools/ant/taskdefs/optional/ssh/AbstractSshMessage.java + 208
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >     * @return true if the verbose attribute is set
>                >     * @since Ant 1.6.2
>                >     */
>                >    protected final boolean getVerbose() {
>                >        return verbose;
>       -3 |     >    }
>       -2 |     >
>       -1 |  ci >    /**
>        0 || ci >     * Is the compressed attribute set.
>          || ci >     * @return true if the compressed attribute is set
>          || ci >     * @since Ant 1.9.8
>          || ci >     */
>          || ci >    protected final boolean getCompressed() {
>          || ci >        return compressed;
>           | ci >    }
>           | ci >
>           |    >    /**
>                >     * Track progress every 10% if 100kb < filesize < 1mb. For larger
>                >     * files track progress for every percent transmitted.
>                >     * @param filesize the size of the file been transmitted
>                >     * @param totalLength the total transmission size
>                >     * @param percentTransmitted the current percent transmitted
>                >     * @return the percent that the file is of the total
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> cd1ff3566ed915392041314eb583a638bcb1eb64:src/main/org/apache/tools/ant/taskdefs/optional/ejb/WeblogicDeploymentTool.java a8d6367ddc214b4956b2a7d6b779930df5e43515:src/main/org/apache/tools/ant/taskdefs/optional/ejb/WeblogicDeploymentTool.java - 888
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >                    // empty
>                >                }
>                >            }
>                >
>                >            if (newJarStream != null) {
>       -1 |   i >                try {
>        0 || ci >                    newJarStream.close();
>          || ci >                } catch (IOException closeException) {
>          || ci >                    // empty
>          || ci >                }
>          || ci >
>           | c  >                try {
>                >                    FILE_UTILS.rename(newWLJarFile, weblogicJarFile);
>                >                } catch (IOException renameException) {
>                >                    log(renameException.getMessage(), Project.MSG_WARN);
>                >                    rebuild = true;
>                >                }
>                >            }
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Perl:

> bb18f8a6d560fa141f812dddcf6f6d8fe8b49dbb:Bugzilla/DB.pm 16f1833e572297edd89faddb69364e09efecdfdb:Bugzilla/DB.pm + 98
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                ># PostgreSQL, and hopefully will get us as much Unicode support as possible,
>                ># depending on how well the regexp engines of the various databases support
>                ># Unicode.
>                >use constant WORD_START => '(^|[^[:alnum:]])';
>                >use constant WORD_END   => '($|[^[:alnum:]])';
>       -2 |     >
>       -1 |  ci >#####################################################################
>        0 || ci ># Overridden Superclass Methods 
>          || ci >#####################################################################
>          || ci >
>          || ci >sub quote {
>          || ci >    my $self = shift;
>          || ci >    my $retval = $self->SUPER::quote(@_);
>          || ci >    trick_taint($retval) if defined $retval;
>          || ci >    return $retval;
>          || ci >}
>           | ci >
>           |    >#####################################################################
>                ># Connection Methods
>                >#####################################################################
>                >
>                >sub connect_shadow {
>                >    my $params = Bugzilla->params;
>                >    die "Tried to connect to non-existent shadowdb" 
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> dbfd6207290d1eee53fddec4c7c3b4aac0b2d47a:Bugzilla/Install/Requirements.pm 3c7ab8316f97fe84222ccacf0d1b4be9cd43e9f8:Bugzilla/Install/Requirements.pm - 338
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >        feature => ['rest']
>                >    },
>                >    {
>                >        package => 'HTTP-Response',
>                >        module  => 'HTTP::Response',
>       -4 |     >        version => 0,
>       -3 |     >        feature => ['rest']
>       -2 |     >    },
>       -1 |   i >    {
>        0 || ci >        package => 'URI-Escape',
>          || ci >        module  => 'URI::Escape',
>           | ci >        version => 0,
>           | ci >        feature => ['rest']
>           | ci >    },
>           | c  >    {
>                >        # We need the 'utf8_mode' method of HTML::Parser, for HTML::Scrubber.
>                >        package => 'HTML-Parser',
>                >        module  => 'HTML::Parser',
>                >        version => ($^V >= v5.13.3) ? '3.67' : '3.40',
>                >        feature => ['html_desc'],
>                >    },
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Erlang:

> 56f969b3c49375c5321eb3cceb0fe346f8535c22:src/couchdb/couch_httpd.erl b90e40212663474e873fde6cab343c31c1e635e7:src/couchdb/couch_httpd.erl + 332
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >        case couch_httpd_cors:is_preflight_request(HttpReq) of
>                >        #httpd{} ->
>                >        case authenticate_request(HttpReq, AuthHandlers) of
>                >        #httpd{} = Req ->
>                >            HandlerFun(Req);
>       -2 |   i >        Response ->
>       -1 |   i >            Response
>        0 || ci >            end;
>           | c  >        Response ->
>           | c  >            Response
>                >        end
>                >    catch
>                >        throw:{http_head_abort, Resp0} ->
>                >            {ok, Resp0};
>                >        throw:{invalid_json, S} ->
>                >            ?LOG_ERROR("attempted upload of invalid JSON (set log_level to debug to log it)", []),
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Go:

> 7b2b4eb40cfd907f3aa20b630cc2feeb8460404a:opts/hosts.go fb3eb1c27ef5520571c599ead8a72b343748db39:opts/hosts.go + 132
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >	u, err := url.Parse("tcp://" + addr)
>                >	if err != nil {
>                >		return "", err
>                >	}
>                >	host, port, err := net.SplitHostPort(u.Host)
>       -1 |   i >	if err != nil {
>        0 || ci >		// try port addition once
>          || ci >		host, port, err = net.SplitHostPort(net.JoinHostPort(u.Host, defaultPort))
>          || ci >	}
>           | c  >	if err != nil {
>                >		return "", fmt.Errorf("Invalid bind address format: %s", tryAddr)
>                >	}
>                >
>                >	if host == "" {
>                >		host = defaultHost
>                >	}
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

gettext source:

> aef18cc6063106075afeb3a78ec72656b19c8bde:po/de.po b0e098ce46ce7f8ecd975720ccec8d0eb7787e50:po/de.po - 12954
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >#~ msgid "unmerged:   %s"
>                >#~ msgstr "nicht zusammengeführt:   %s"
>                >
>                >#~ msgid "input paths are terminated by a null character"
>                >#~ msgstr "Eingabepfade sind durch ein NUL Zeichen abgeschlossen"
>       -2 |     >
>       -1 |  ci >#~ msgid ""
>        0 || ci >#~ "The following untracked files would NOT be saved but need to be removed "
>          || ci >#~ "by stash save:"
>          || ci >#~ msgstr ""
>          || ci >#~ "Die folgenden unbeobachteten Dateien würden NICHT gespeichert werden,\n"
>          || ci >#~ "müssen aber durch \"stash save\" entfernt werden:"
>           | ci >
>           |    >#~ msgid ""
>                >#~ "Aborting. Consider using either the --force or --include-untracked option."
>                >#~ msgstr ""
>                >#~ "Abgebrochen. Benutzen Sie entweder die Option --force oder --include-"
>                >#~ "untracked."
>                >
>                >#~ msgid "  (fix conflicts and then run \"git am --resolved\")"
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

C:

> eeff891ac756fd97a05476446f15269b714ce4cc:grep.c 79a77109d3d0d364910ff7fa8c605c554dc4c3e0:grep.c + 1133
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >		regmatch_t match;
>                >		enum grep_context ctx = GREP_CONTEXT_BODY;
>                >		int ch = *eol;
>                >		int eflags = 0;
>                >
>       -1 |   i >		if (sign == ':')
>        0 || ci >			match_color = opt->color_match_selected;
>          || ci >		else
>          || ci >			match_color = opt->color_match_context;
>           | c  >		if (sign == ':')
>                >			line_color = opt->color_selected;
>                >		else if (sign == '-')
>                >			line_color = opt->color_context;
>                >		else if (sign == '=')
>                >			line_color = opt->color_function;
>                >		*eol = '\0';
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Tcl:

> a9ae14a1c50a9b5bd051be629dba0fb04b6fd67a:git-gui.sh 25476c63e7d3357f955b44dd6a7c825aba819987:git-gui.sh - 3158
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >
>                ># -- Diff Body Context Menu
>                >#
>                >
>                >proc create_common_diff_popup {ctxm} {
>       -1 |   i >	$ctxm add command \
>        0 || ci >		-label [mc "Show Less Context"] \
>          || ci >		-command show_less_context
>          || ci >	lappend diff_actions [list $ctxm entryconf [$ctxm index last] -state]
>          || ci >	$ctxm add command \
>          || ci >		-label [mc "Show More Context"] \
>          || ci >		-command show_more_context
>          || ci >	lappend diff_actions [list $ctxm entryconf [$ctxm index last] -state]
>          || ci >	$ctxm add separator
>           | c  >	$ctxm add command \
>                >		-label [mc Refresh] \
>                >		-command reshow_diff
>                >	lappend diff_actions [list $ctxm entryconf [$ctxm index last] -state]
>                >	$ctxm add command \
>                >		-label [mc Copy] \
>                >		-command {tk_textCopy $ui_diff}
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Shell:

> 34a82bda7766f000ef646130ed3f6af58ca23aa2:git-subtree.sh 942dce5578c8eb03fdc7f9109c8418d499e931ff:git-subtree.sh + 40
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >rejoin=
>                >ignore_joins=
>                >annotate=
>                >
>                >debug()
>       -1 |   i >{
>        0 || ci >	if [ -n "$debug" ]; then
>          || ci >		echo "$@" >&2
>          || ci >	fi
>          || ci >}
>          || ci >
>          || ci >say()
>           | c  >{
>                >	if [ -z "$quiet" ]; then
>                >		echo "$@" >&2
>                >	fi
>                >}
>                >
>                >assert()
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Python:

> 9ccc64b8ea94a9083038b079a8e2691ae94591fe:IPython/html/base/handlers.py db6ce120fc0bc0a4b29969f43af98e4841d5e1be:IPython/html/base/handlers.py - 90
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >    def cookie_name(self):
>                >        default_cookie_name = non_alphanum.sub('-', 'username-{}'.format(
>                >            self.request.host
>                >        ))
>                >        return self.settings.get('cookie_name', default_cookie_name)
>       -2 |     >    
>       -1 |   i >    @property
>        0 || ci >    def password(self):
>          || ci >        """our password"""
>          || ci >        return self.settings.get('password', '')
>           | ci >    
>           | c  >    @property
>                >    def logged_in(self):
>                >        """Is a user currently logged in?
>                >
>                >        """
>                >        user = self.get_current_user()
>                >        return (user and not user == 'anonymous')
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

javascript (and JSON):

> d3c3a928c3a2f9e961881b47ef0796e57ae8d429:tools/eslint/lib/rules/no-arrow-condition.js d7aa8fa088f3b8a31c7d85c6d71824c8c60e7c17:tools/eslint/lib/rules/no-confusing-arrow.js - 58
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >//------------------------------------------------------------------------------
>                >// Rule Definition
>                >//------------------------------------------------------------------------------
>                >
>                >module.exports = function(context) {
>       -1 |   i >    /**
>        0 || ci >     * Reports if a conditional statement is an arrow function.
>          || ci >     * @param {ASTNode} node - A node to check and report.
>          || ci >     * @returns {void}
>          || ci >     */
>          || ci >    function checkCondition(node) {
>          || ci >        if (isArrowFunction(node)) {
>          || ci >            context.report(node, "Arrow function `=>` used inside {{statementType}} instead of comparison operator.", {statementType: node.type});
>          || ci >        }
>          || ci >    }
>          || ci >
>           | c  >    /**
>                >     * Reports if an arrow function contains an ambiguous conditional.
>                >     * @param {ASTNode} node - A node to check and report.
>                >     * @returns {void}
>                >     */
>                >    function checkArrowFunc(node) {
>                >        if (isConditional(node)) {
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

PHP:

> cc9550579c02d652686bdc574bc91fb982f7d95b:libraries/ThemeManager.php ca5eb903b04035f78819ae1e14ff888155bd3e32:libraries/ThemeManager.php + 20
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                > *
>                > * @package PhpMyAdmin
>                > */
>                >class ThemeManager
>                >{
>       -1 |   i >    /**
>        0 || ci >     * ThemeManager instance
>          || ci >     *
>          || ci >     * @access private
>          || ci >     * @static
>          || ci >     * @var ThemeManager
>          || ci >     */
>          || ci >    private static $_instance;
>          || ci >
>           | c  >    /**
>                >     * @var string path to theme folder
>                >     * @access protected
>                >     */
>                >    private $_themes_path = './themes/';
>                >
>                >    /**
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ruby:

> 0ae3800be7ca632c47c089d527b53a8f66960c60:test/test_assertions.rb c35d9e2de60441cae592a300dae1460b45562cce:test/test_assertions.rb - 120
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >      end
>                >      
>                >      def test_assert_equal
>                >        check_nothing_fails {
>                >          assert_equal("string1", "string1")
>       -2 |     >        }
>       -1 |   i >        check_nothing_fails {
>        0 || ci >          assert_equal( "string1", "string1", "successful assert_equal")
>           | ci >        }
>           | c  >        check_nothing_fails {
>                >          assert_equal("string1", "string1", "successful assert_equal")
>                >        }
>                >
>                >        message = <<-EOM.chomp
>                ><"string1"> expected but was
>                ><"string2">.
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


It often fails to get C preprocessor directives right:

> a08595f76159b09d57553e37a5123f1091bb13e7:http.c aeff8a61216bf6e0d663c08c583bc8552fa3c344:http.c + 429
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >		curl_easy_setopt(result, CURLOPT_SSLKEY, ssl_key);
>                >#endif
>                >#if LIBCURL_VERSION_NUM >= 0x070908
>                >	if (ssl_capath != NULL)
>                >		curl_easy_setopt(result, CURLOPT_CAPATH, ssl_capath);
>       -1 |   i >#endif
>        0 || ci >#if LIBCURL_VERSION_NUM >= 0x072c00
>          || ci >	if (ssl_pinnedkey != NULL)
>          || ci >		curl_easy_setopt(result, CURLOPT_PINNEDPUBLICKEY, ssl_pinnedkey);
>           | c  >#endif
>                >	if (ssl_cainfo != NULL)
>                >		curl_easy_setopt(result, CURLOPT_CAINFO, ssl_cainfo);
>                >
>                >	if (curl_low_speed_limit > 0 && curl_low_speed_time > 0) {
>                >		curl_easy_setopt(result, CURLOPT_LOW_SPEED_LIMIT,
>                >				 curl_low_speed_limit);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

And it gets confused by unusual blank line placement:

> ed55169834a3ce16a271def9630c858626ded34d:tools/eslint/node_modules/doctrine/lib/doctrine.js 2d441493a4a46a511ba1bdf93e442c3288fbe92d:tools/eslint/node_modules/doctrine/lib/doctrine.js + 330
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >                        name === 'external' ||
>                >                        name === 'event')) {
>                >                    name += advance();
>                >                    name += scanIdentifier(last);
>                >
>       -1 |   i >                }
>        0 || ci >                if(source.charCodeAt(index) === 0x5B  /* '[' */ && source.charCodeAt(index + 1) === 0x5D  /* ']' */){
>          || ci >                    name += advance();
>          || ci >                    name += advance();
>           | c  >                }
>                >                while (source.charCodeAt(index) === 0x2E  /* '.' */ ||
>                >                        source.charCodeAt(index) === 0x23  /* '#' */ ||
>                >                        source.charCodeAt(index) === 0x7E  /* '~' */) {
>                >                    name += advance();
>                >                    name += scanIdentifier(last);
>                >                }
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

or inconsistent indentation:

> 1768d08d029dc3bf1ab88e26df0a9b40ae61227f:src/main/java/org/junit/internal/runners/statements/FailOnTimeout.java 5ca9da987a7d4dc00e082aaf552cbd8ee8c7bd33:src/main/java/org/junit/internal/runners/statements/FailOnTimeout.java + 136
> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>                >                    maxCpuTime = threadCpuTime;
>                >                }
>                >    		}   			
>                >    	}
>                >    	return (stuckThread == mainThread) ? null : stuckThread;
>       -2 |   i >    }
>       -1 |   i >
>        0 || ci >    /**
>          || ci >     * Returns the CPU time used by a thread, if possible.
>          || ci >     * @param thr The thread to query.
>          || ci >     * @return The CPU time used by {@code thr}, or 0 if it cannot be determined.
>          || ci >     */
>          || ci >    private long cpuTime (Thread thr) {
>          || ci >		ThreadMXBean mxBean = ManagementFactory.getThreadMXBean();
>          || ci >        if (mxBean.isThreadCpuTimeSupported()) {
>          || ci >            try {
>          || ci >                return mxBean.getThreadCpuTime(thr.getId());
>          || ci >            } catch (UnsupportedOperationException e) {
>          || ci >            }
>          || ci >        }
>          || ci >        return 0;
>           | c  >    }
>           | c  >
>                >	private class CallableStatement implements Callable<Throwable> {
>                >        public Throwable call() throws Exception {
>                >            try {
>                >                fOriginalStatement.evaluate();
>                >            } catch (Exception e) {
>                >                throw e;
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


^ permalink raw reply	[flat|nested] 15+ messages in thread

* diff heuristics dramatically improved by considering line indentation and blank lines
  2016-06-30 13:54       ` Michael Haggerty
@ 2016-07-01 17:04         ` Michael Haggerty
  2016-07-01 18:01         ` [PATCH] diff compaction heuristic: favor shortest neighboring " Junio C Hamano
  1 sibling, 0 replies; 15+ messages in thread
From: Michael Haggerty @ 2016-07-01 17:04 UTC (permalink / raw)
  To: Stefan Beller, Jeff King
  Cc: git@vger.kernel.org, Jacob Keller, Junio C Hamano

On 06/30/2016 03:54 PM, Michael Haggerty wrote:
> On 06/23/2016 07:10 PM, Michael Haggerty wrote:
>> On 06/17/2016 06:09 PM, Stefan Beller wrote:
>>> I think before spending more time on discussing and implementing new
>>> (hopefully better) heuristics, I'd want to step back and try to be a bit more
>>> systematic, i.e. I'll want to collect lots of test cases in different languages
>>> and use cases. A mini test suite, maybe?
>>
>> Stefan,
>>
>> I've also been playing around with diff heuristics and also trying to
>> find some good test data. Have you put together some test cases yet?
> 
> I put quite of work into tooling to evaluate diff heuristics, and just
> published it on GitHub:
> 
>     https://github.com/mhagger/diff-slider-tools

Today I hand-optimized about 2700 diff sliders in 12 different
repositories and used that as a reference corpus against which I tested
the accuracy of four different algorithms. The tools and the
hand-optimized values are now in the GitHub repository mentioned above.
See also this pull request [1].

If I didn't make any mistakes, the number of errors (i.e., cases where
the algorithm choose a different slider shift than the one I picked by
hand) look like this:

> | repository           | default | compaction | indent | indent-favor-edges |
> | -------------------- | ------- | ---------- | ------ | ------------------ |
> | ant                  |     225 |        102 |      7 |                  7 |
> | bugzilla             |     208 |         81 |     14 |                 14 |
> | couchdb              |      44 |         24 |     13 |                 10 |
> | docker               |     180 |        160 |     29 |                 18 |
> | git                  |     446 |         59 |     27 |                 19 |
> | ipython              |     358 |        163 |     61 |                 11 |
> | junit                |     146 |         67 |      5 |                  5 |
> | nodejs               |     489 |         78 |     12 |                 12 |
> | phpmyadmin           |     330 |         49 |      1 |                  0 |
> | test-more            |      15 |          2 |      2 |                  0 |
> | test-unit            |      33 |         13 |      4 |                  4 |
> | -------------------- | ------- | ---------- | ------ | ------------------ |
> | totals               |    2474 |        798 |    175 |                100 |

The algorithms are:

* default -- the default behavior `git diff` on the current master
* compaction -- `git diff --compaction-heuristic`
* indent -- the indent-based algorithm as reported earlier in
  this thread
* indent-favor-edges -- the indent-based algorithm, with some
  improved heuristics regarding sliders near the begin/end of file
  and probably one or two fixes

I encourage other people to run the tests against some code samples in
their favorite programming languages so that the testing can cover a
wider range of inputs, and submit your data to the GitHub project. Also
feel free to tweak the heuristic (there's probably still room for
improvement). All the tools and raw data for my experiments are already
there.

I'll be going on vacation for the next two weeks so I probably can't
follow up on this for a while. But I think we should build an algorithm
like this into Git!

Michael

[1] https://github.com/mhagger/diff-slider-tools/pull/1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-06-30 13:54       ` Michael Haggerty
  2016-07-01 17:04         ` diff heuristics dramatically improved by considering line indentation and " Michael Haggerty
@ 2016-07-01 18:01         ` Junio C Hamano
  2016-07-04 14:33           ` Jakub Narębski
  1 sibling, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2016-07-01 18:01 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Stefan Beller, Jeff King, git@vger.kernel.org, Jacob Keller

Michael Haggerty <mhagger@alum.mit.edu> writes:

> I put quite of work into tooling to evaluate diff heuristics, and just
> published it on GitHub:
>
>     https://github.com/mhagger/diff-slider-tools
>
> The README there is hopefully enough to let people get started using it
> to test their own favorite heuristics.

Intereting.  A fair TL;DR of this would be that what we write and
want to diff have the structure where things often are ordered in an
inherent hierarchy, and things at the higher level are highlighted
by being less indented, and the indent-based heuristics exploit that
well to choose a slide that breaks at the higher-level boundary, e.g.

> The algorithm does pretty well with XML and HTML:
>
>> fef3ea39f8bd474add292bb6437df6cbd22e1ba7:contributors.xml a394a0bdf8e6240dc09023a8260059c57c158a00:contributors.xml + 1624
>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>                >    <last>Toni</last>
>>                >  </name>
>>                >  <name>
>>                >    <first>Vincent</first>
>>                >    <last>Legoll</last>
>>       -2 |     >  </name>
>>       -1 |   i >  <name>
>>        0 || ci >    <first>Vincent</first>
>>          || ci >    <last>Privat</last>
>>           | ci >  </name>
>>           | c  >  <name>

And that trait is shared among common programming languages (Java,
Perl, Go, C, etc.).

The only case I can think of offhand that does not follow "higher
level heading is less indented" pattern is an already typeset book,
where chapter headers are often centered on the page while section
headers may be at the left edge.  But we are not likely to be
interested to get that case right anyway, so it is OK.

> gettext source:
>
>> aef18cc6063106075afeb3a78ec72656b19c8bde:po/de.po b0e098ce46ce7f8ecd975720ccec8d0eb7787e50:po/de.po - 12954
>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>                >#~ msgid "unmerged:   %s"
>>                >#~ msgstr "nicht zusammengeführt:   %s"
>>                >
>>                >#~ msgid "input paths are terminated by a null character"
>>                >#~ msgstr "Eingabepfade sind durch ein NUL Zeichen abgeschlossen"
>>       -2 |     >
>>       -1 |  ci >#~ msgid ""
>>        0 || ci >#~ "The following untracked files would NOT be saved but need to be removed "
>>          || ci >#~ "by stash save:"
>>          || ci >#~ msgstr ""
>>          || ci >#~ "Die folgenden unbeobachteten Dateien würden NICHT gespeichert werden,\n"
>>          || ci >#~ "müssen aber durch \"stash save\" entfernt werden:"
>>           | ci >

This is an interesting example of "have only a single level
hierarchy.  A line is either in one block or a blank between
blocks".

> It often fails to get C preprocessor directives right:
>
>> a08595f76159b09d57553e37a5123f1091bb13e7:http.c aeff8a61216bf6e0d663c08c583bc8552fa3c344:http.c + 429
>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>                >		curl_easy_setopt(result, CURLOPT_SSLKEY, ssl_key);
>>                >#endif
>>                >#if LIBCURL_VERSION_NUM >= 0x070908
>>                >	if (ssl_capath != NULL)
>>                >		curl_easy_setopt(result, CURLOPT_CAPATH, ssl_capath);
>>       -1 |   i >#endif
>>        0 || ci >#if LIBCURL_VERSION_NUM >= 0x072c00
>>          || ci >	if (ssl_pinnedkey != NULL)
>>          || ci >		curl_easy_setopt(result, CURLOPT_PINNEDPUBLICKEY, ssl_pinnedkey);
>>           | c  >#endif

Yes, this is "non-human do not know 'end' is likely to be at the end
of a logical block".

> And it gets confused by unusual blank line placement:
>
>> ed55169834a3ce16a271def9630c858626ded34d:tools/eslint/node_modules/doctrine/lib/doctrine.js 2d441493a4a46a511ba1bdf93e442c3288fbe92d:tools/eslint/node_modules/doctrine/lib/doctrine.js + 330
>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>                >                        name === 'external' ||
>>                >                        name === 'event')) {
>>                >                    name += advance();
>>                >                    name += scanIdentifier(last);
>>                >
>>       -1 |   i >                }
>>        0 || ci >                if(source.charCodeAt(index) === 0x5B  /* '[' */ && source.charCodeAt(index + 1) === 0x5D  /* ']' */){
>>          || ci >                    name += advance();
>>          || ci >                    name += advance();
>>           | c  >                }

Likewise, this is showing that a "non-human not knowing } is a closing
and { is an opening token".

By the way, I no longer remember what indent level your
"indent-only" heuristics gives to a blank line.  Does the closing
brace we see above (marked with -1) count as increasing the
indentation level from the previous empty-line at indent 0, or does it
count as dedenting from the previous empty-line that has virtually
at the same indent level as "name += scanIdentifier(last);" line?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
  2016-07-01 18:01         ` [PATCH] diff compaction heuristic: favor shortest neighboring " Junio C Hamano
@ 2016-07-04 14:33           ` Jakub Narębski
  0 siblings, 0 replies; 15+ messages in thread
From: Jakub Narębski @ 2016-07-04 14:33 UTC (permalink / raw)
  To: git; +Cc: Stefan Beller, Jeff King, git@vger.kernel.org, Jacob Keller

W dniu 2016-07-01 o 20:01, Junio C Hamano pisze:
> Michael Haggerty <mhagger@alum.mit.edu> writes:

>> It often fails to get C preprocessor directives right:
>>
>>> a08595f76159b09d57553e37a5123f1091bb13e7:http.c aeff8a61216bf6e0d663c08c583bc8552fa3c344:http.c + 429
>>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>>                >		curl_easy_setopt(result, CURLOPT_SSLKEY, ssl_key);
>>>                >#endif
>>>                >#if LIBCURL_VERSION_NUM >= 0x070908
>>>                >	if (ssl_capath != NULL)
>>>                >		curl_easy_setopt(result, CURLOPT_CAPATH, ssl_capath);
>>>       -1 |   i >#endif
>>>        0 || ci >#if LIBCURL_VERSION_NUM >= 0x072c00
>>>          || ci >	if (ssl_pinnedkey != NULL)
>>>          || ci >		curl_easy_setopt(result, CURLOPT_PINNEDPUBLICKEY, ssl_pinnedkey);
>>>           | c  >#endif
> 
> Yes, this is "non-human do not know 'end' is likely to be at the end
> of a logical block".

I wonder if taking into account xfuncname match to adjust heuristics
(or to change "slider") would help there.  I think good heuristic
would be to break before xfuncname match (before new section).
 
>> And it gets confused by unusual blank line placement:
>>
>>> ed55169834a3ce16a271def9630c858626ded34d:tools/eslint/node_modules/doctrine/lib/doctrine.js 2d441493a4a46a511ba1bdf93e442c3288fbe92d:tools/eslint/node_modules/doctrine/lib/doctrine.js + 330
>>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>>                >                        name === 'external' ||
>>>                >                        name === 'event')) {
>>>                >                    name += advance();
>>>                >                    name += scanIdentifier(last);
>>>                >
>>>       -1 |   i >                }
>>>        0 || ci >                if(source.charCodeAt(index) === 0x5B  /* '[' */ && source.charCodeAt(index + 1) === 0x5D  /* ']' */){
>>>          || ci >                    name += advance();
>>>          || ci >                    name += advance();
>>>           | c  >                }
> 
> Likewise, this is showing that a "non-human not knowing } is a closing
> and { is an opening token".

If not encoding heuristic that [,{,( are opening token, and ],},) are
closing token into heuristics, perhaps length of the line could be
a consideration about where to start a diff chunk?

-- 
Jakub Narębski



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-07-04 14:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-16 17:46 [PATCH] diff compaction heuristic: favor shortest neighboring blank lines Stefan Beller
2016-06-16 20:27 ` Junio C Hamano
2016-06-16 21:06   ` Stefan Beller
2016-06-16 21:10 ` Michael Haggerty
2016-06-16 21:36   ` Stefan Beller
2016-06-17 15:36 ` Jeff King
2016-06-17 16:09   ` Stefan Beller
2016-06-23 17:10     ` Michael Haggerty
2016-06-23 17:25       ` Stefan Beller
2016-06-23 17:37       ` Junio C Hamano
2016-06-23 20:13         ` Michael Haggerty
2016-06-30 13:54       ` Michael Haggerty
2016-07-01 17:04         ` diff heuristics dramatically improved by considering line indentation and " Michael Haggerty
2016-07-01 18:01         ` [PATCH] diff compaction heuristic: favor shortest neighboring " Junio C Hamano
2016-07-04 14:33           ` Jakub Narębski

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).