git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: Stefan Beller <sbeller@google.com>, Jeff King <peff@peff.net>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
	Jacob Keller <jacob.keller@gmail.com>,
	Junio C Hamano <gitster@pobox.com>
Subject: diff heuristics dramatically improved by considering line indentation and blank lines
Date: Fri, 1 Jul 2016 19:04:31 +0200	[thread overview]
Message-ID: <5776A29F.4020609@alum.mit.edu> (raw)
In-Reply-To: <57752478.1000302@alum.mit.edu>

On 06/30/2016 03:54 PM, Michael Haggerty wrote:
> On 06/23/2016 07:10 PM, Michael Haggerty wrote:
>> On 06/17/2016 06:09 PM, Stefan Beller wrote:
>>> I think before spending more time on discussing and implementing new
>>> (hopefully better) heuristics, I'd want to step back and try to be a bit more
>>> systematic, i.e. I'll want to collect lots of test cases in different languages
>>> and use cases. A mini test suite, maybe?
>>
>> Stefan,
>>
>> I've also been playing around with diff heuristics and also trying to
>> find some good test data. Have you put together some test cases yet?
> 
> I put quite of work into tooling to evaluate diff heuristics, and just
> published it on GitHub:
> 
>     https://github.com/mhagger/diff-slider-tools

Today I hand-optimized about 2700 diff sliders in 12 different
repositories and used that as a reference corpus against which I tested
the accuracy of four different algorithms. The tools and the
hand-optimized values are now in the GitHub repository mentioned above.
See also this pull request [1].

If I didn't make any mistakes, the number of errors (i.e., cases where
the algorithm choose a different slider shift than the one I picked by
hand) look like this:

> | repository           | default | compaction | indent | indent-favor-edges |
> | -------------------- | ------- | ---------- | ------ | ------------------ |
> | ant                  |     225 |        102 |      7 |                  7 |
> | bugzilla             |     208 |         81 |     14 |                 14 |
> | couchdb              |      44 |         24 |     13 |                 10 |
> | docker               |     180 |        160 |     29 |                 18 |
> | git                  |     446 |         59 |     27 |                 19 |
> | ipython              |     358 |        163 |     61 |                 11 |
> | junit                |     146 |         67 |      5 |                  5 |
> | nodejs               |     489 |         78 |     12 |                 12 |
> | phpmyadmin           |     330 |         49 |      1 |                  0 |
> | test-more            |      15 |          2 |      2 |                  0 |
> | test-unit            |      33 |         13 |      4 |                  4 |
> | -------------------- | ------- | ---------- | ------ | ------------------ |
> | totals               |    2474 |        798 |    175 |                100 |

The algorithms are:

* default -- the default behavior `git diff` on the current master
* compaction -- `git diff --compaction-heuristic`
* indent -- the indent-based algorithm as reported earlier in
  this thread
* indent-favor-edges -- the indent-based algorithm, with some
  improved heuristics regarding sliders near the begin/end of file
  and probably one or two fixes

I encourage other people to run the tests against some code samples in
their favorite programming languages so that the testing can cover a
wider range of inputs, and submit your data to the GitHub project. Also
feel free to tweak the heuristic (there's probably still room for
improvement). All the tools and raw data for my experiments are already
there.

I'll be going on vacation for the next two weeks so I probably can't
follow up on this for a while. But I think we should build an algorithm
like this into Git!

Michael

[1] https://github.com/mhagger/diff-slider-tools/pull/1


  reply	other threads:[~2016-07-01 17:04 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-16 17:46 [PATCH] diff compaction heuristic: favor shortest neighboring blank lines Stefan Beller
2016-06-16 20:27 ` Junio C Hamano
2016-06-16 21:06   ` Stefan Beller
2016-06-16 21:10 ` Michael Haggerty
2016-06-16 21:36   ` Stefan Beller
2016-06-17 15:36 ` Jeff King
2016-06-17 16:09   ` Stefan Beller
2016-06-23 17:10     ` Michael Haggerty
2016-06-23 17:25       ` Stefan Beller
2016-06-23 17:37       ` Junio C Hamano
2016-06-23 20:13         ` Michael Haggerty
2016-06-30 13:54       ` Michael Haggerty
2016-07-01 17:04         ` Michael Haggerty [this message]
2016-07-01 18:01         ` Junio C Hamano
2016-07-04 14:33           ` Jakub Narębski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5776A29F.4020609@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jacob.keller@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).