Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Michael Haggerty <mhagger@alum.mit.edu>
To: Stefan Beller <sbeller@google.com>, peff@peff.net
Cc: git@vger.kernel.org, jacob.keller@gmail.com
Subject: Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
Date: Thu, 16 Jun 2016 23:10:57 +0200	[thread overview]
Message-ID: <576315E1.2050405@alum.mit.edu> (raw)
In-Reply-To: <20160616174620.1011-1-sbeller@google.com>

On 06/16/2016 07:46 PM, Stefan Beller wrote:
> The compaction heuristic for diffs was deemed quite good, but in 5580b27
> we have an example demonstrates a common case, that fails with the existing
> heuristic. That is why we disabled the heuristic in the v2.9.0 release.
> 
> With this patch a diff looks like:
> 
>          def bar
>                  do_bar_stuff()
> 
>                  common_ending()
>          end
> 
> +        def bal
> +                do_bal_stuff()
> +
> +                common_ending()
> +        end
> +
>          def baz
>                  do_baz_stuff()
> 
>                  common_ending()
>          end
> 
> whereas before we had:
> 
>   WIP (TODO: ask peff to provide an example that actually triggers, I seem to be
>        unable to write one, though I thought the above was one)
> 
> 
> The way we do it, is by inspecting the neighboring lines and see how
> much indent there is and we choose the line that has the shortest amount
> of blanks in the neighboring lines.
> 
> (TODO: update comments in the file)
> 
> Signed-off-by: Stefan Beller <sbeller@google.com>
> ---
> 
> Hi Jeff, hi Michael,
> 
> thanks for pointing out the flaw in this ruby code base before the 2.9 release.
> I think this patch would fix your finding, though I cannot reproduce it.
> Can you provide an example repo/patch that looks ugly on current origin/master,
> so I can be sure this fixes the issue?
> 
> This can also be a start of a discussion if this is a too short-sighted enhancement
> of the heuristic. Essentially we'd want to prefer 
> 
> +        end
> +
> +        def bal
> 
> over:
> 
> +                do_bal_stuff()
> +
> +                common_ending()
> 
> because there is less space between line start and {end, def bal}
> than for {do_bal_stuff, common_ending}.

It's really cool that you are trying to implement the indentation
heuristic. I'm curious how it works in practice (something we can
probably only determine by applying it to a corpus of diffs to see how
often it outperforms/underperforms the other possible approaches).

The heuristic I proposed was

* Prefer to start and end hunks *following* lines with the least
  indentation.

* Define the "indentation" of a blank line to be the indentation of
  the previous non-blank line minus epsilon.

* In the case of a tie, prefer to slide the hunk down as far as
  possible.

If that's what you are trying to implement, then notice that the first
rule says that the hunk should start *following* the line with the least
indentation. So the "score" for

> +        end
> +
> +        def bal

would be the indentation of the line preceding "end", which is larger
than the indentation of "end". And the score for

> +                do_bal_stuff()
> +
> +                common_ending()

would come from the line preceding "do_bal_stuff()", namely "def bal()",
which is indented the same as "end". So the latter would be preferred.

But actually, I don't understand how your example is meaningful. I think
this heuristic is only used to slide hunks around; i.e., when the line
following the hunk is the same as the first lines of the hunk, or when
the line preceding the hunk is the same as the last line of the hunk.
Right? So your snippets would never compete against each other. Let's
take the full example. The five possible placements for the `+`
characters are marked with the digits 1 through 5 here:

>              def bar
>                      do_bar_stuff()
> 1
> 12                   common_ending()
> 123          end
> 1234
> 12345        def bal
> 12345                do_bal_stuff()
>  2345
>   345                common_ending()
>    45        end
>     5
>              def baz
>                      do_baz_stuff()
>
>                      common_ending()
>              end

Of the lines preceding the candidate hunks, the blank line preceding the
hunk marked "5" has the smallest indent, namely "epsilon" less than the
indentation of the "end" line preceding it. So placement "5" would be
chosen.

I don't know enough about this area of the code to review your patch in
detail, but I did note below a typo that jumped out at me.

Michael

>  xdiff/xdiffi.c | 41 +++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 39 insertions(+), 2 deletions(-)
> 
> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> index b3c6848..58adc74 100644
> --- a/xdiff/xdiffi.c
> +++ b/xdiff/xdiffi.c
> [...]
> @@ -485,7 +515,13 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  			 * the group.
>  			 */
>  			while (ix < nrec && recs_match(recs, ixs, ix, flags)) {
> -				blank_lines += is_blank_line(recs, ix, flags);
> +				if (is_blank_line(recs, ix, flags)) {
> +					unsigned int bl_neigh_indent =
> +						surrounding_leading_blank(recs, ix, flags, nrec);
> +					if (min_bl_neigh_indent > bl_neigh_indent)
> +						min_bl_neigh_indent = min_bl_neigh_indent;

The line above has no effect (same variable on both sides of the =).

> +					blank_lines++;
> +				}
>  
>  				rchg[ixs++] = 0;
>  				rchg[ix++] = 1;
> @@ -525,6 +561,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
>  			while (ixs > 0 &&
>  			       !is_blank_line(recs, ix - 1, flags) &&
> +			       surrounding_leading_blank(recs, ix - 1, flags, nrec) > min_bl_neigh_indent &&
>  			       recs_match(recs, ixs - 1, ix - 1, flags)) {
>  				rchg[--ixs] = 1;
>  				rchg[--ix] = 0;
>

next prev parent reply	other threads:[~2016-06-16 21:11 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-16 17:46 [PATCH] diff compaction heuristic: favor shortest neighboring blank lines Stefan Beller
2016-06-16 20:27 ` Junio C Hamano
2016-06-16 21:06   ` Stefan Beller
2016-06-16 21:10 ` Michael Haggerty [this message]
2016-06-16 21:36   ` Stefan Beller
2016-06-17 15:36 ` Jeff King
2016-06-17 16:09   ` Stefan Beller
2016-06-23 17:10     ` Michael Haggerty
2016-06-23 17:25       ` Stefan Beller
2016-06-23 17:37       ` Junio C Hamano
2016-06-23 20:13         ` Michael Haggerty
2016-06-30 13:54       ` Michael Haggerty
2016-07-01 17:04         ` diff heuristics dramatically improved by considering line indentation and " Michael Haggerty
2016-07-01 18:01         ` [PATCH] diff compaction heuristic: favor shortest neighboring " Junio C Hamano
2016-07-04 14:33           ` Jakub Narębski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=576315E1.2050405@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    --cc=jacob.keller@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).