git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: git@vger.kernel.org
Cc: "Stefan Beller" <sbeller@google.com>,
	"Junio C Hamano" <gitster@pobox.com>, "Jeff King" <peff@peff.net>,
	"Jakub Narębski" <jnareb@gmail.com>,
	"Jacob Keller" <jacob.keller@gmail.com>,
	"Michael Haggerty" <mhagger@alum.mit.edu>
Subject: [PATCH v2 0/7] Better heuristics make prettier diffs
Date: Mon, 22 Aug 2016 13:22:39 +0200	[thread overview]
Message-ID: <cover.1471864378.git.mhagger@alum.mit.edu> (raw)

This is v2 of a patch series to improve the heuristics that `git diff`
and related commands use to position ambiguous blocks of added/deleted
lines. Thanks to Stefan, Jacob, Peff, Junio, and Jakub for their
comments about v1 [1]. This patch series is also available from my
GitHub account [2] as branch `diff-indent-heuristic`.

This version started out as a light touch-up of v1, but as I was
working I realized that it needed more changes. Among other things,
the final heuristic is now improved relative to the one proposed in
v1. Summary of changes:

* I changed the `--compaction-heuristic` code such that if a group of
  added/deleted lines can be aligned with a group of deleted/added
  lines in the other file, then that should be done rather than try to
  slide to a blank line. In the existing code, the compaction
  heuristic was attempted, incorrectly, *after* trying to align
  groups, which doesn't make sense.

* I changed `recs_match()` similarly to `is_blank_line()`: now it
  takes two `xrecord_t *` as arguments rather than an array of
  `xrecord_t` and two indexes.

* I refactored the old `xdl_change_compact()` more thoroughly. All of
  the index manipulation was pretty confusing, and I was having
  trouble convincing myself that even the old code was working
  correctly. So I introduced a `struct group`, which is like a cursor
  that keeps track of a (possibly empty) group of changed lines in the
  old or new version of a file. I added functions to initialize a
  group to the first change in a file, to move to the next or previous
  groups, and to slide a group forward or backward if possible (i.e.,
  if the group is a "slider").

* In the course of that refactoring, I found (and fixed) another bug
  in the `--compaction-heuristic` code: it didn't notice if the top
  possible position of a slider had a blank line as its last line. See
  the commit message of patch [5/5] for more details.

* I realized that the behavior of the indent heuristic from v1,
  because it multiplied weighting factors times indent widths, would
  behave differently if a project changed its convention for how many
  spaces to use per level of indentation. That doesn't make sense, so
  I changed the handling of indentation:

  Now, the sum of the indentation width at the top and bottom of the
  slider are added, *but those sums are only compared* between slider
  positions, and the weight is multipled by -1, 0, or +1 depending on
  whether the first indent sum is less than, equal, or greater than
  the other. This should make the behavior relatively independent of
  the project's indentation convention, and thus make the heuristic
  work more consistently across projects.

  The resulting heuristic works significantly better than the one
  proposed in v1:

  | repository            | count |      Git 2.9.0 |             v1 |             v2 |
  | --------------------- | ----- | -------------- | -------------- | -------------- |
  | afnetworking          |   109 |    89  (81.7%) |     3   (2.8%) |     2   (1.8%) |
  | alamofire             |    30 |    18  (60.0%) |     2   (6.7%) |     0   (0.0%) |
  | angular               |   184 |   127  (69.0%) |    15   (8.2%) |     5   (2.7%) |
  | animate               |   313 |     2   (0.6%) |     2   (0.6%) |     2   (0.6%) |
  | ant                   |   380 |   356  (93.7%) |    22   (5.8%) |    15   (3.9%) |
  | bugzilla              |   306 |   263  (85.9%) |    24   (7.8%) |    15   (4.9%) |
  | corefx                |   126 |    91  (72.2%) |    15  (11.9%) |     6   (4.8%) |
  | couchdb               |    78 |    44  (56.4%) |    17  (21.8%) |     6   (7.7%) |
  | cpython               |   937 |   158  (16.9%) |    26   (2.8%) |     5   (0.5%) |
  | discourse             |   160 |    95  (59.4%) |    24  (15.0%) |    13   (8.1%) |
  | docker                |   307 |   194  (63.2%) |    16   (5.2%) |     8   (2.6%) |
  | electron              |   163 |   132  (81.0%) |    15   (9.2%) |     6   (3.7%) |
  | git                   |   536 |   470  (87.7%) |    18   (3.4%) |    16   (3.0%) |
  | gitflow               |   127 |     0   (0.0%) |     0   (0.0%) |     0   (0.0%) |
  | ionic                 |   133 |    89  (66.9%) |     6   (4.5%) |     1   (0.8%) |
  | ipython               |   482 |   362  (75.1%) |    45   (9.3%) |    11   (2.3%) |
  | junit                 |   161 |   147  (91.3%) |     4   (2.5%) |     1   (0.6%) |
  | lighttable            |    15 |     5  (33.3%) |     2  (13.3%) |     0   (0.0%) |
  | magit                 |    88 |    75  (85.2%) |     5   (5.7%) |     0   (0.0%) |
  | neural-style          |    28 |     0   (0.0%) |     0   (0.0%) |     0   (0.0%) |
  | nodejs                |   781 |   649  (83.1%) |    98  (12.5%) |     5   (0.6%) |
  | phpmyadmin            |   491 |   481  (98.0%) |     7   (1.4%) |     2   (0.4%) |
  | react-native          |   168 |   130  (77.4%) |     5   (3.0%) |     0   (0.0%) |
  | rust                  |   171 |   128  (74.9%) |    17   (9.9%) |    14   (8.2%) |
  | spark                 |   186 |   149  (80.1%) |    17   (9.1%) |     2   (1.1%) |
  | tensorflow            |   115 |    66  (57.4%) |     7   (6.1%) |     5   (4.3%) |
  | test-more             |    19 |    15  (78.9%) |     1   (5.3%) |     1   (5.3%) |
  | test-unit             |    51 |    34  (66.7%) |     8  (15.7%) |     2   (3.9%) |
  | xmonad                |    23 |    22  (95.7%) |     1   (4.3%) |     1   (4.3%) |
  | --------------------- | ----- | -------------- | -------------- | -------------- |
  | totals                |  6668 |  4391  (65.9%) |   422   (6.3%) |   144   (2.2%) |

* I noticed that most of the "bonus" weights were actually negative,
  so calling them "bonuses" was misleading. Therefore, I negated the
  coefficients and now call them "penalties"

* I noticed that `git blame` was parsing diff options like
  `--indent-heuristic` from the command line, but wasn't using the
  values. That is fixed in a new patch [07/07].

* I redid all of the analysis and training with a bigger corpus. To
  check whether I was overtraining the heuristic, I also did the
  following experiment: I optimized the weights against a training set
  consisting of only some of the repositories, then tested it against
  the rest of the corpus. See the full results in the commit message
  for patch [06/06].

* I added a couple of smoke tests.

The companion project [3] now provides an easy way to replicate and
extend these results, if anybody is interested.

The new heuristic still has to be enabled via the `--indent-heuristic`
command-line parameter or the `diff.indentHeuristic` configuration
setting. Before we release this, we should decide what the final UI
should look like and make it so. If we decide to replace the existing
compaction heuristic with this one and/or turn this heuristic on for
all users by default, those steps might look like branch
`indent-heuristic-default` in my GitHub repository [2].

Michael

[1] http://public-inbox.org/git/cover.1470259583.git.mhagger@alum.mit.edu/t/#u
[2] https://github.com/mhagger/git
[3] https://github.com/mhagger/diff-slider-tools

Michael Haggerty (7):
  xdl_change_compact(): fix compaction heuristic to adjust ixo
  xdl_change_compact(): only use heuristic if group can't be matched
  is_blank_line(): take a single xrecord_t as argument
  recs_match(): take two xrecord_t pointers as arguments
  xdl_change_compact(): introduce the concept of a change group
  diff: improve positioning of add/delete blocks in diffs
  blame: actually use the diff opts parsed from the command line

 Documentation/diff-config.txt  |   7 +-
 Documentation/diff-options.txt |   9 +-
 builtin/blame.c                |  11 +-
 diff.c                         |  23 +-
 git-add--interactive.perl      |   5 +-
 t/t4059-diff-indent.sh         | 160 +++++++++++
 xdiff/xdiff.h                  |   1 +
 xdiff/xdiffi.c                 | 635 ++++++++++++++++++++++++++++++++++-------
 8 files changed, 736 insertions(+), 115 deletions(-)
 create mode 100755 t/t4059-diff-indent.sh

-- 
2.9.3


             reply	other threads:[~2016-08-22 11:24 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-22 11:22 Michael Haggerty [this message]
2016-08-22 11:22 ` [PATCH v2 1/7] xdl_change_compact(): fix compaction heuristic to adjust ixo Michael Haggerty
2016-08-22 11:22 ` [PATCH v2 2/7] xdl_change_compact(): only use heuristic if group can't be matched Michael Haggerty
2016-08-22 11:22 ` [PATCH v2 3/7] is_blank_line(): take a single xrecord_t as argument Michael Haggerty
2016-08-22 11:22 ` [PATCH v2 4/7] recs_match(): take two xrecord_t pointers as arguments Michael Haggerty
2016-08-22 11:22 ` [PATCH v2 5/7] xdl_change_compact(): introduce the concept of a change group Michael Haggerty
2016-08-23 21:35   ` Junio C Hamano
2016-08-22 11:22 ` [PATCH v2 6/7] diff: improve positioning of add/delete blocks in diffs Michael Haggerty
2016-08-22 11:22 ` [PATCH v2 7/7] blame: actually use the diff opts parsed from the command line Michael Haggerty
2016-08-23  9:56   ` René Scharfe
2016-09-05  4:12     ` Michael Haggerty
2016-09-07 17:58       ` Junio C Hamano
2016-08-23 17:01   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cover.1471864378.git.mhagger@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jacob.keller@gmail.com \
    --cc=jnareb@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).