[PATCH 0/8] Better heuristics make prettier diffs

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 0/8] Better heuristics make prettier diffs
@ 2016-08-03 22:00 Michael Haggerty
  2016-08-03 22:00 ` [PATCH 1/8] xdl_change_compact(): rename some local variables for clarity Michael Haggerty
                   ` (9 more replies)
  0 siblings, 10 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 22:00 UTC (permalink / raw)
  To: git
  Cc: Stefan Beller, Junio C Hamano, Jeff King, Jakub Narębski,
	Jacob Keller, Michael Haggerty

I've talked about this quite a bit on the list already. The idea is to
improve ugly diffs like

    @@ -231,6 +231,9 @@ if (!defined $initial_reply_to && $prompting) {
     }

     if (!$smtp_server) {
    +       $smtp_server = $repo->config('sendemail.smtpserver');
    +}
    +if (!$smtp_server) {
            foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) {
                    if (-x $_) {
                            $smtp_server = $_;

by feeding clues from the surrounding lines (namely their patterns of
indentation and blank lines) into a heuristic that more often produces
the diffs that users would rather see, like

    --- a/9c572b21dd090a1e5c5bb397053bf8043ffe7fb4:git-send-email.perl
    +++ b/6dcfa306f2b67b733a7eb2d7ded1bc9987809edb:git-send-email.perl
    @@ -230,6 +230,9 @@ if (!defined $initial_reply_to && $prompting) {
            $initial_reply_to =~ s/(^\s+|\s+$)//g;
     }

    +if (!$smtp_server) {
    +       $smtp_server = $repo->config('sendemail.smtpserver');
    +}
     if (!$smtp_server) {
            foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) {
                    if (-x $_) {

See the last commit's log message for a very detailed explanation of
the heuristic, how it was optimized, and how you can get involved to
make sure that the heuristic also works well for your favorite
language.

When tested against a corpus of 2700 diffs that I optimized by hand,
this heuristic gets a "wrong" answer only about 1.7% as frequently as
the current default Git algorithm and only about 5.3% as often as `git
diff --compaction-heuristic`. (Though please don't treat these numbers
as final; I want to verify them again first.)

For now the new algorithm has to be enabled explicitly using either
`--indent-heuristic` or `git config diff.indentheuristic true`.

Michael Haggerty (8):
  xdl_change_compact(): rename some local variables for clarity
  xdl_change_compact(): clarify code
  xdl_change_compact(): rename i to end
  xdl_change_compact(): do one final shift or the other, not both
  xdl_change_compact(): fix compaction heuristic to adjust io
  xdl_change_compact(): keep track of the earliest end
  is_blank_line: take a single xrecord_t as argument
  diff: improve positioning of add/delete blocks in diffs

 Documentation/diff-options.txt |   6 +-
 diff.c                         |  11 +
 git-add--interactive.perl      |   5 +-
 xdiff/xdiff.h                  |   1 +
 xdiff/xdiffi.c                 | 458 ++++++++++++++++++++++++++++++++++-------
 5 files changed, 409 insertions(+), 72 deletions(-)

-- 
2.8.1


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [PATCH 1/8] xdl_change_compact(): rename some local variables for clarity
  2016-08-03 22:00 [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
@ 2016-08-03 22:00 ` Michael Haggerty
  2016-08-04  7:06   ` Jeff King
  2016-08-03 22:00 ` [PATCH 2/8] xdl_change_compact(): clarify code Michael Haggerty
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 22:00 UTC (permalink / raw)
  To: git
  Cc: Stefan Beller, Junio C Hamano, Jeff King, Jakub Narębski,
	Jacob Keller, Michael Haggerty

* ix -> i
* ixo -> io
* ixs -> start
* grpsiz -> groupsize

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
Thankfully, we don't have to limit indentifers to six characters, so
start naming things more understandably.

 xdiff/xdiffi.c | 66 +++++++++++++++++++++++++++++-----------------------------
 1 file changed, 33 insertions(+), 33 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index b3c6848..ff7fc42 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -414,7 +414,7 @@ static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
 }
 
 int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
-	long ix, ixo, ixs, ixref, grpsiz, nrec = xdf->nrec;
+	long i, io, start, ixref, groupsize, nrec = xdf->nrec;
 	char *rchg = xdf->rchg, *rchgo = xdfo->rchg;
 	unsigned int blank_lines;
 	xrecord_t **recs = xdf->recs;
@@ -424,7 +424,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 	 * change groups for a consistent and pretty diff output. This also
 	 * helps in finding joinable change groups and reduce the diff size.
 	 */
-	for (ix = ixo = 0;;) {
+	for (i = io = 0;;) {
 		/*
 		 * Find the first changed line in the to-be-compacted file.
 		 * We need to keep track of both indexes, so if we find a
@@ -434,22 +434,22 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 		 * not need index bounding since the array is prepared with
 		 * a zero at position -1 and N.
 		 */
-		for (; ix < nrec && !rchg[ix]; ix++)
-			while (rchgo[ixo++]);
-		if (ix == nrec)
+		for (; i < nrec && !rchg[i]; i++)
+			while (rchgo[io++]);
+		if (i == nrec)
 			break;
 
 		/*
 		 * Record the start of a changed-group in the to-be-compacted file
 		 * and find the end of it, on both to-be-compacted and other file
-		 * indexes (ix and ixo).
+		 * indexes (i and io).
 		 */
-		ixs = ix;
-		for (ix++; rchg[ix]; ix++);
-		for (; rchgo[ixo]; ixo++);
+		start = i;
+		for (i++; rchg[i]; i++);
+		for (; rchgo[io]; io++);
 
 		do {
-			grpsiz = ix - ixs;
+			groupsize = i - start;
 			blank_lines = 0;
 
 			/*
@@ -457,9 +457,9 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			 * the last line of the current change group, shift backward
 			 * the group.
 			 */
-			while (ixs > 0 && recs_match(recs, ixs - 1, ix - 1, flags)) {
-				rchg[--ixs] = 1;
-				rchg[--ix] = 0;
+			while (start > 0 && recs_match(recs, start - 1, i - 1, flags)) {
+				rchg[--start] = 1;
+				rchg[--i] = 0;
 
 				/*
 				 * This change might have joined two change groups,
@@ -467,8 +467,8 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 				 * the start index accordingly (and so the other-file
 				 * end-of-group index).
 				 */
-				for (; rchg[ixs - 1]; ixs--);
-				while (rchgo[--ixo]);
+				for (; rchg[start - 1]; start--);
+				while (rchgo[--io]);
 			}
 
 			/*
@@ -477,18 +477,18 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			 * change record before the end-of-group index in the other
 			 * file is set).
 			 */
-			ixref = rchgo[ixo - 1] ? ix: nrec;
+			ixref = rchgo[io - 1] ? i : nrec;
 
 			/*
 			 * If the first line of the current change group, is equal to
 			 * the line next of the current change group, shift forward
 			 * the group.
 			 */
-			while (ix < nrec && recs_match(recs, ixs, ix, flags)) {
-				blank_lines += is_blank_line(recs, ix, flags);
+			while (i < nrec && recs_match(recs, start, i, flags)) {
+				blank_lines += is_blank_line(recs, i, flags);
 
-				rchg[ixs++] = 0;
-				rchg[ix++] = 1;
+				rchg[start++] = 0;
+				rchg[i++] = 1;
 
 				/*
 				 * This change might have joined two change groups,
@@ -498,20 +498,20 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 				 * index in case we are shifting together with a
 				 * corresponding group of changes in the other file.
 				 */
-				for (; rchg[ix]; ix++);
-				while (rchgo[++ixo])
-					ixref = ix;
+				for (; rchg[i]; i++);
+				while (rchgo[++io])
+					ixref = i;
 			}
-		} while (grpsiz != ix - ixs);
+		} while (groupsize != i - start);
 
 		/*
 		 * Try to move back the possibly merged group of changes, to match
 		 * the recorded position in the other file.
 		 */
-		while (ixref < ix) {
-			rchg[--ixs] = 1;
-			rchg[--ix] = 0;
-			while (rchgo[--ixo]);
+		while (ixref < i) {
+			rchg[--start] = 1;
+			rchg[--i] = 0;
+			while (rchgo[--io]);
 		}
 
 		/*
@@ -523,11 +523,11 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 		 * in the earlier loop, we need to shift it back only if at all.
 		 */
 		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
-			while (ixs > 0 &&
-			       !is_blank_line(recs, ix - 1, flags) &&
-			       recs_match(recs, ixs - 1, ix - 1, flags)) {
-				rchg[--ixs] = 1;
-				rchg[--ix] = 0;
+			while (start > 0 &&
+			       !is_blank_line(recs, i - 1, flags) &&
+			       recs_match(recs, start - 1, i - 1, flags)) {
+				rchg[--start] = 1;
+				rchg[--i] = 0;
 			}
 		}
 	}
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 2/8] xdl_change_compact(): clarify code
  2016-08-03 22:00 [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
  2016-08-03 22:00 ` [PATCH 1/8] xdl_change_compact(): rename some local variables for clarity Michael Haggerty
@ 2016-08-03 22:00 ` Michael Haggerty
  2016-08-03 22:11   ` Stefan Beller
  2016-08-03 22:00 ` [PATCH 3/8] xdl_change_compact(): rename i to end Michael Haggerty
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 22:00 UTC (permalink / raw)
  To: git
  Cc: Stefan Beller, Junio C Hamano, Jeff King, Jakub Narębski,
	Jacob Keller, Michael Haggerty

Write things out a bit longer but less cryptically. Add more comments.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
I find the loops in the old code, with unfamiliar patterns of embedded
increment/decrement operators, confusing, and I think that writing
things out a little bit more verbosely (and with more comments) makes
it much easier to read the code and be sure that it is correct.
The compiled code and performance shouldn't be affected materially.

 xdiff/xdiffi.c | 106 +++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 73 insertions(+), 33 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index ff7fc42..a0a485c 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -434,8 +434,14 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 		 * not need index bounding since the array is prepared with
 		 * a zero at position -1 and N.
 		 */
-		for (; i < nrec && !rchg[i]; i++)
-			while (rchgo[io++]);
+		for (; i < nrec && !rchg[i]; i++) {
+			/* skip over any changed lines in the other file... */
+			while (rchgo[io])
+				io++;
+
+			/* ...plus one non-changed line. */
+			io++;
+		}
 		if (i == nrec)
 			break;
 
@@ -444,45 +450,70 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 		 * and find the end of it, on both to-be-compacted and other file
 		 * indexes (i and io).
 		 */
-		start = i;
-		for (i++; rchg[i]; i++);
-		for (; rchgo[io]; io++);
+		start = i++;
+
+		while (rchg[i])
+			i++;
+
+		while (rchgo[io])
+		       io++;
 
 		do {
 			groupsize = i - start;
+
+			/*
+			 * Are there any blank lines that could appear as the last
+			 * line of this group?
+			 */
 			blank_lines = 0;
 
 			/*
-			 * If the line before the current change group, is equal to
-			 * the last line of the current change group, shift backward
-			 * the group.
+			 * While the line before the current change group is equal
+			 * to the last line of the current change group, shift the
+			 * group backward.
 			 */
 			while (start > 0 && recs_match(recs, start - 1, i - 1, flags)) {
 				rchg[--start] = 1;
 				rchg[--i] = 0;
 
 				/*
-				 * This change might have joined two change groups,
-				 * so we try to take this scenario in account by moving
-				 * the start index accordingly (and so the other-file
-				 * end-of-group index).
+				 * This change might have joined two change groups.
+				 * If so, move the start index to the beginning of
+				 * the combined group:
 				 */
-				for (; rchg[start - 1]; start--);
-				while (rchgo[--io]);
+				while (rchg[start - 1])
+					start--;
+
+				/*
+				 * Move the other file index past a non-changed
+				 * line...
+				 */
+				io--;
+
+				/* ...and also past any changed lines: */
+				while (rchgo[io])
+					io--;
 			}
 
-			/*
-			 * Record the end-of-group position in case we are matched
-			 * with a group of changes in the other file (that is, the
-			 * change record before the end-of-group index in the other
-			 * file is set).
-			 */
-			ixref = rchgo[io - 1] ? i : nrec;
+			if (rchgo[io - 1]) {
+				/*
+				 * This change is matched to a group of changes in
+				 * the other file. Record the end-of-group
+				 * position:
+				 */
+				ixref = i;
+			} else {
+				/*
+				 * Otherwise, set a value to signify that there
+				 * are no matched changes in the other file:
+				 */
+				ixref = nrec;
+			}
 
 			/*
-			 * If the first line of the current change group, is equal to
-			 * the line next of the current change group, shift forward
-			 * the group.
+			 * Now shift the group forward as long as the first line
+			 * of the current change group is equal to the line after
+			 * the current change group.
 			 */
 			while (i < nrec && recs_match(recs, start, i, flags)) {
 				blank_lines += is_blank_line(recs, i, flags);
@@ -491,16 +522,22 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 				rchg[i++] = 1;
 
 				/*
-				 * This change might have joined two change groups,
-				 * so we try to take this scenario in account by moving
-				 * the start index accordingly (and so the other-file
-				 * end-of-group index). Keep tracking the reference
-				 * index in case we are shifting together with a
-				 * corresponding group of changes in the other file.
+				 * This change might have joined two change
+				 * groups. If so, move the start index accordingly
+				 * (and so the other-file end-of-group index).
+				 * Keep tracking the reference index in case we
+				 * are shifting together with a corresponding
+				 * group of changes in the other file.
 				 */
-				for (; rchg[i]; i++);
-				while (rchgo[++io])
+				while (rchg[i])
+					i++;
+
+				io++;
+				if (rchgo[io]) {
 					ixref = i;
+					while (rchgo[io])
+						io++;
+				}
 			}
 		} while (groupsize != i - start);
 
@@ -511,7 +548,10 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 		while (ixref < i) {
 			rchg[--start] = 1;
 			rchg[--i] = 0;
-			while (rchgo[--io]);
+
+			io--;
+			while (rchgo[io])
+				io--;
 		}
 
 		/*
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 3/8] xdl_change_compact(): rename i to end
  2016-08-03 22:00 [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
  2016-08-03 22:00 ` [PATCH 1/8] xdl_change_compact(): rename some local variables for clarity Michael Haggerty
  2016-08-03 22:00 ` [PATCH 2/8] xdl_change_compact(): clarify code Michael Haggerty
@ 2016-08-03 22:00 ` Michael Haggerty
  2016-08-04  7:16   ` Jeff King
  2016-08-03 22:00 ` [PATCH 4/8] xdl_change_compact(): do one final shift or the other, not both Michael Haggerty
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 22:00 UTC (permalink / raw)
  To: git
  Cc: Stefan Beller, Junio C Hamano, Jeff King, Jakub Narębski,
	Jacob Keller, Michael Haggerty

Rename i to end, and alternate between using start and end as the
indexing variable as appropriate.

Rename ixref to end_matching_other.

Add some more comments.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 xdiff/xdiffi.c | 70 ++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 39 insertions(+), 31 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index a0a485c..0f235bc 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -414,7 +414,7 @@ static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
 }
 
 int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
-	long i, io, start, ixref, groupsize, nrec = xdf->nrec;
+	long start, end, io, end_matching_other, groupsize, nrec = xdf->nrec;
 	char *rchg = xdf->rchg, *rchgo = xdfo->rchg;
 	unsigned int blank_lines;
 	xrecord_t **recs = xdf->recs;
@@ -424,7 +424,8 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 	 * change groups for a consistent and pretty diff output. This also
 	 * helps in finding joinable change groups and reduce the diff size.
 	 */
-	for (i = io = 0;;) {
+	end = io = 0;
+	while (1) {
 		/*
 		 * Find the first changed line in the to-be-compacted file.
 		 * We need to keep track of both indexes, so if we find a
@@ -434,7 +435,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 		 * not need index bounding since the array is prepared with
 		 * a zero at position -1 and N.
 		 */
-		for (; i < nrec && !rchg[i]; i++) {
+		for (start = end; start < nrec && !rchg[start]; start++) {
 			/* skip over any changed lines in the other file... */
 			while (rchgo[io])
 				io++;
@@ -442,24 +443,29 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			/* ...plus one non-changed line. */
 			io++;
 		}
-		if (i == nrec)
+		if (start == nrec)
 			break;
 
 		/*
-		 * Record the start of a changed-group in the to-be-compacted file
-		 * and find the end of it, on both to-be-compacted and other file
-		 * indexes (i and io).
+		 * That's the start of a changed-group in the to-be-compacted
+		 * file. Now find its end.
 		 */
-		start = i++;
-
-		while (rchg[i])
-			i++;
+		end = start + 1;
+		while (rchg[end])
+			end++;
 
 		while (rchgo[io])
 		       io++;
 
+		/*
+		 * Now shift the change up and then down as far as possible in
+		 * each direction. If it bumps into any other changes, merge them.
+		 * If there are any changes in the other file that this change
+		 * could line up with, set end_matching_other to the end position
+		 * of this change that would leave them aligned.
+		 */
 		do {
-			groupsize = i - start;
+			groupsize = end - start;
 
 			/*
 			 * Are there any blank lines that could appear as the last
@@ -472,9 +478,9 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			 * to the last line of the current change group, shift the
 			 * group backward.
 			 */
-			while (start > 0 && recs_match(recs, start - 1, i - 1, flags)) {
+			while (start > 0 && recs_match(recs, start - 1, end - 1, flags)) {
 				rchg[--start] = 1;
-				rchg[--i] = 0;
+				rchg[--end] = 0;
 
 				/*
 				 * This change might have joined two change groups.
@@ -501,13 +507,13 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 				 * the other file. Record the end-of-group
 				 * position:
 				 */
-				ixref = i;
+				end_matching_other = end;
 			} else {
 				/*
 				 * Otherwise, set a value to signify that there
 				 * are no matched changes in the other file:
 				 */
-				ixref = nrec;
+				end_matching_other = -1;
 			}
 
 			/*
@@ -515,11 +521,11 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			 * of the current change group is equal to the line after
 			 * the current change group.
 			 */
-			while (i < nrec && recs_match(recs, start, i, flags)) {
-				blank_lines += is_blank_line(recs, i, flags);
+			while (end < nrec && recs_match(recs, start, end, flags)) {
+				blank_lines += is_blank_line(recs, end, flags);
 
 				rchg[start++] = 0;
-				rchg[i++] = 1;
+				rchg[end++] = 1;
 
 				/*
 				 * This change might have joined two change
@@ -529,29 +535,31 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 				 * are shifting together with a corresponding
 				 * group of changes in the other file.
 				 */
-				while (rchg[i])
-					i++;
+				while (rchg[end])
+					end++;
 
 				io++;
 				if (rchgo[io]) {
-					ixref = i;
+					end_matching_other = end;
 					while (rchgo[io])
 						io++;
 				}
 			}
-		} while (groupsize != i - start);
+		} while (groupsize != end - start);
 
 		/*
 		 * Try to move back the possibly merged group of changes, to match
 		 * the recorded position in the other file.
 		 */
-		while (ixref < i) {
-			rchg[--start] = 1;
-			rchg[--i] = 0;
+		if (end_matching_other != -1) {
+			while (end_matching_other < end) {
+				rchg[--start] = 1;
+				rchg[--end] = 0;
 
-			io--;
-			while (rchgo[io])
 				io--;
+				while (rchgo[io])
+					io--;
+			}
 		}
 
 		/*
@@ -564,10 +572,10 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 		 */
 		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
 			while (start > 0 &&
-			       !is_blank_line(recs, i - 1, flags) &&
-			       recs_match(recs, start - 1, i - 1, flags)) {
+			       !is_blank_line(recs, end - 1, flags) &&
+			       recs_match(recs, start - 1, end - 1, flags)) {
 				rchg[--start] = 1;
-				rchg[--i] = 0;
+				rchg[--end] = 0;
 			}
 		}
 	}
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 4/8] xdl_change_compact(): do one final shift or the other, not both
  2016-08-03 22:00 [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
                   ` (2 preceding siblings ...)
  2016-08-03 22:00 ` [PATCH 3/8] xdl_change_compact(): rename i to end Michael Haggerty
@ 2016-08-03 22:00 ` Michael Haggerty
  2016-08-03 22:00 ` [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io Michael Haggerty
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 22:00 UTC (permalink / raw)
  To: git
  Cc: Stefan Beller, Junio C Hamano, Jeff King, Jakub Narębski,
	Jacob Keller, Michael Haggerty

There is no need to shift the group to match a diff in the other file if
we're just going to override that shift based on the compaction
heuristic. Note that this changes the behavior if the matching shift
would have shifted the group higher than the last blank line: the old
code would have ignored the compaction heuristic in that case, whereas
the new code always gives precedence to the compaction heuristic when it
is turned on.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 xdiff/xdiffi.c | 44 ++++++++++++++++++++++----------------------
 1 file changed, 22 insertions(+), 22 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 0f235bc..c67cfe3 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -547,11 +547,28 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			}
 		} while (groupsize != end - start);
 
-		/*
-		 * Try to move back the possibly merged group of changes, to match
-		 * the recorded position in the other file.
-		 */
-		if (end_matching_other != -1) {
+		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
+			/*
+			 * Compaction heuristic: if a group can be moved back and
+			 * forth, then if possible shift the group to make its
+			 * bottom line a blank line.
+			 *
+			 * As we already shifted the group forward as far as
+			 * possible in the earlier loop, we only need to handle
+			 * backward shifts, not forward ones.
+			 */
+			while (start > 0 &&
+			       !is_blank_line(recs, end - 1, flags) &&
+			       recs_match(recs, start - 1, end - 1, flags)) {
+				rchg[--start] = 1;
+				rchg[--end] = 0;
+			}
+		} else if (end_matching_other != -1) {
+			/*
+			 * Move the possibly merged group of changes back to line
+			 * up with the last group of changes from the other file
+			 * that it can align with.
+			 */
 			while (end_matching_other < end) {
 				rchg[--start] = 1;
 				rchg[--end] = 0;
@@ -561,23 +578,6 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 					io--;
 			}
 		}
-
-		/*
-		 * If a group can be moved back and forth, see if there is a
-		 * blank line in the moving space. If there is a blank line,
-		 * make sure the last blank line is the end of the group.
-		 *
-		 * As we already shifted the group forward as far as possible
-		 * in the earlier loop, we need to shift it back only if at all.
-		 */
-		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
-			while (start > 0 &&
-			       !is_blank_line(recs, end - 1, flags) &&
-			       recs_match(recs, start - 1, end - 1, flags)) {
-				rchg[--start] = 1;
-				rchg[--end] = 0;
-			}
-		}
 	}
 
 	return 0;
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io
  2016-08-03 22:00 [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
                   ` (3 preceding siblings ...)
  2016-08-03 22:00 ` [PATCH 4/8] xdl_change_compact(): do one final shift or the other, not both Michael Haggerty
@ 2016-08-03 22:00 ` Michael Haggerty
  2016-08-04  7:27   ` Jeff King
  2016-08-04 18:43   ` Junio C Hamano
  2016-08-03 22:00 ` [PATCH 6/8] xdl_change_compact(): keep track of the earliest end Michael Haggerty
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 22:00 UTC (permalink / raw)
  To: git
  Cc: Stefan Beller, Junio C Hamano, Jeff King, Jakub Narębski,
	Jacob Keller, Michael Haggerty

The code branch used for the compaction heuristic incorrectly forgot to
keep io in sync while the group was shifted. I think that could have
led to reading past the end of the rchgo array.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
I didn't actually try to verify the presence of a bug, because it
seems like more work than worthwhile. But here is my reasoning:

If io is not decremented correctly during one iteration of the outer
`while` loop, then it will loose sync with the `end` counter. In
particular it will be too large.

Suppose that the next iterations of the outer `while` loop (i.e.,
processing the next block of add/delete lines) don't have any sliders.
Then the `io` counter would be incremented by the number of
non-changed lines in xdf, which is the same as the number of
non-changed lines in xdfo that *should have* followed the group that
experienced the malfunction. But since `io` was too large at the end
of that iteration, it will be incremented past the end of the
xdfo->rchg array, and will try to read that memory illegally.

 xdiff/xdiffi.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index c67cfe3..66129db 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -562,6 +562,10 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			       recs_match(recs, start - 1, end - 1, flags)) {
 				rchg[--start] = 1;
 				rchg[--end] = 0;
+
+				io--;
+				while (rchgo[io])
+					io--;
 			}
 		} else if (end_matching_other != -1) {
 			/*
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 6/8] xdl_change_compact(): keep track of the earliest end
  2016-08-03 22:00 [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
                   ` (4 preceding siblings ...)
  2016-08-03 22:00 ` [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io Michael Haggerty
@ 2016-08-03 22:00 ` Michael Haggerty
  2016-08-04 18:46   ` Junio C Hamano
  2016-08-03 22:00 ` [PATCH 7/8] is_blank_line: take a single xrecord_t as argument Michael Haggerty
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 22:00 UTC (permalink / raw)
  To: git
  Cc: Stefan Beller, Junio C Hamano, Jeff King, Jakub Narębski,
	Jacob Keller, Michael Haggerty

This makes it easier to detect whether shifting is possible, and will
also make the next change easier.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 xdiff/xdiffi.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 66129db..34f021a 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -414,7 +414,8 @@ static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
 }
 
 int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
-	long start, end, io, end_matching_other, groupsize, nrec = xdf->nrec;
+	long start, end, earliest_end, end_matching_other;
+	long io, groupsize, nrec = xdf->nrec;
 	char *rchg = xdf->rchg, *rchgo = xdfo->rchg;
 	unsigned int blank_lines;
 	xrecord_t **recs = xdf->recs;
@@ -516,6 +517,8 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 				end_matching_other = -1;
 			}
 
+			earliest_end = end;
+
 			/*
 			 * Now shift the group forward as long as the first line
 			 * of the current change group is equal to the line after
@@ -547,6 +550,9 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			}
 		} while (groupsize != end - start);
 
+		if (end == earliest_end)
+			continue; /* no shifting is possible */
+
 		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
 			/*
 			 * Compaction heuristic: if a group can be moved back and
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 7/8] is_blank_line: take a single xrecord_t as argument
  2016-08-03 22:00 [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
                   ` (5 preceding siblings ...)
  2016-08-03 22:00 ` [PATCH 6/8] xdl_change_compact(): keep track of the earliest end Michael Haggerty
@ 2016-08-03 22:00 ` Michael Haggerty
  2016-08-04 18:48   ` Junio C Hamano
  2016-08-03 22:00 ` [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs Michael Haggerty
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 22:00 UTC (permalink / raw)
  To: git
  Cc: Stefan Beller, Junio C Hamano, Jeff King, Jakub Narębski,
	Jacob Keller, Michael Haggerty

There is no reason for it to take an array and index as argument, as it
only accesses a single element of the array.

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
 xdiff/xdiffi.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 34f021a..7518cd5 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -400,9 +400,9 @@ static xdchange_t *xdl_add_change(xdchange_t *xscr, long i1, long i2, long chg1,
 }
 
 
-static int is_blank_line(xrecord_t **recs, long ix, long flags)
+static int is_blank_line(xrecord_t *rec, long flags)
 {
-	return xdl_blankline(recs[ix]->ptr, recs[ix]->size, flags);
+	return xdl_blankline(rec->ptr, rec->size, flags);
 }
 
 static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
@@ -525,7 +525,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			 * the current change group.
 			 */
 			while (end < nrec && recs_match(recs, start, end, flags)) {
-				blank_lines += is_blank_line(recs, end, flags);
+				blank_lines += is_blank_line(recs[end], flags);
 
 				rchg[start++] = 0;
 				rchg[end++] = 1;
@@ -564,7 +564,7 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			 * backward shifts, not forward ones.
 			 */
 			while (start > 0 &&
-			       !is_blank_line(recs, end - 1, flags) &&
+			       !is_blank_line(recs[end - 1], flags) &&
 			       recs_match(recs, start - 1, end - 1, flags)) {
 				rchg[--start] = 1;
 				rchg[--end] = 0;
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-03 22:00 [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
                   ` (6 preceding siblings ...)
  2016-08-03 22:00 ` [PATCH 7/8] is_blank_line: take a single xrecord_t as argument Michael Haggerty
@ 2016-08-03 22:00 ` Michael Haggerty
  2016-08-03 22:29   ` Jacob Keller
                     ` (3 more replies)
  2016-08-03 22:08 ` [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
  2016-08-04  7:38 ` Jeff King
  9 siblings, 4 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 22:00 UTC (permalink / raw)
  To: git
  Cc: Stefan Beller, Junio C Hamano, Jeff King, Jakub Narębski,
	Jacob Keller, Michael Haggerty

Some groups of added/deleted lines in diffs can be slid up or down,
because lines at the edges of the group are not unique. Picking good
shifts for such groups is not a matter of correctness but definitely has
a big effect on aesthetics. For example, consider the following two
diffs. The first is what standard Git emits:

    --- a/9c572b21dd090a1e5c5bb397053bf8043ffe7fb4:git-send-email.perl
    +++ b/6dcfa306f2b67b733a7eb2d7ded1bc9987809edb:git-send-email.perl
    @@ -231,6 +231,9 @@ if (!defined $initial_reply_to && $prompting) {
     }

     if (!$smtp_server) {
    +       $smtp_server = $repo->config('sendemail.smtpserver');
    +}
    +if (!$smtp_server) {
            foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) {
                    if (-x $_) {
                            $smtp_server = $_;

The following diff is equivalent, but is obviously preferable from an
aesthetic point of view:

    --- a/9c572b21dd090a1e5c5bb397053bf8043ffe7fb4:git-send-email.perl
    +++ b/6dcfa306f2b67b733a7eb2d7ded1bc9987809edb:git-send-email.perl
    @@ -230,6 +230,9 @@ if (!defined $initial_reply_to && $prompting) {
            $initial_reply_to =~ s/(^\s+|\s+$)//g;
     }

    +if (!$smtp_server) {
    +       $smtp_server = $repo->config('sendemail.smtpserver');
    +}
     if (!$smtp_server) {
            foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) {
                    if (-x $_) {

This patch teaches Git to pick better positions for such "diff sliders".

The original Git code basically always shifted such "sliders" as far
down in the file as possible. The only exception is when the slider can
be aligned with a group of the lines in the other file, in which case
Git favors one add+delete block over one add and a slightly offset
delete block. This naive algorithm often yields ugly diffs.

Commit d634d61ed6 improved the situation somewhat by preferring to
position add/delete groups to make their last line a blank line, when
that is possible. This heuristic does more good than harm, but can only
help if there are blank lines in the right places. It still leaves a lot
of ugly diffs.

This commit implements a new and much better heuristic for picking
optimal "slider" positions using the following approach: First observe
that each hypothetical positioning of a diff slider introduces two
splits: one between the context lines preceding the group and the first
added/deleted line, and the other between the last added/deleted line
and the first line of context following it. It tries to find the
positioning that creates the least bad splits.

Splits are evaluated based only on the presence and locations of nearby
blank lines, and the indentation of lines near the split. Basically, it
prefers to introduce splits between lines that are indented less and
adjacent to blank lines. In more detail:

1. It measures the following characteristics of a proposed splitting
   position:

   * the number of blank lines above the proposed split
   * whether the line directly after the split is blank
   * the number of blank lines following that line
   * the indentation of the nearest non-blank line above the split
   * the indentation of the line directly below the split
   * the indentation of the nearest non-blank line after that line

2. It combines these attributes using a bunch of empirically-optimized
   weighting factors to estimate a score of the "badness" of splitting
   the text at that position.

3. It defines the score for a positioning of a diff slider to be the sum
   of the scores for the splits at the top and bottom of the slider.

4. It computes scores like this for all possible positions of the diff
   slider, and selects the position with the smallest "badness" score.

The weighting factors were found by collecting a corpus of code samples
in various programming languages, deciding "by eye" the best positioning
for about 2700 diff sliders, then optimizing the weights against this
corpus to get the best agreement with the manually-determined values.
(One caveat is that the same corpus was used for both optimization and
testing.)

The resulting numbers of non-optimal diff groups were as follows:

| repository           | default | compaction | indent | optimized |
| -------------------- | ------- | ---------- | ------ | --------- |
| ant                  |     225 |        102 |      7 |         5 |
| bugzilla             |     208 |         81 |     14 |         8 |
| couchdb              |      44 |         24 |     13 |         4 |
| docker               |     180 |        160 |     29 |         7 |
| git                  |     446 |         59 |     27 |         4 |
| ipython              |     358 |        163 |     61 |        11 |
| junit                |     146 |         67 |      5 |         1 |
| nodejs               |     489 |         78 |     12 |         2 |
| phpmyadmin           |     330 |         49 |      1 |         0 |
| test-more            |      15 |          2 |      2 |         0 |
| test-unit            |      33 |         13 |      4 |         0 |
| xmonad               |      20 |          1 |      1 |         0 |
| -------------------- | ------- | ---------- | ------ | --------- |
| totals               |    2494 |        788 |    176 |        42 |

This table shows the number of diff slider groups that were positioned
differently than the human-generated values, for various repositories.
"default" is the default "git diff" algorithm. "compaction" is Git 2.9.0
with the `--compaction-heuristic` option "indent" is an earlier,
hand-optimized version of this heuristic "optimized" is the
machine-optimized version, which is implemented in this patch. As you
can see, this new heuristic disagrees with the hand-optimized positions
only 1.7% as often as the default Git algorithm.

The tools that were used to do this optimization and analysis and the
human-generated data values are recorded in a separate project [1]. If
other people add more test data (especially in other programming
languages and/or other text formats) and come up with weighting factors
that work better over a wider breadth of samples, it will be easy to
tweak the factors in the code.

[1] https://github.com/mhagger/diff-slider-tools

Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
---
Most of the testing I've done is to run it against the large number of
diffs in the corpus that I have been using to optimize the heuristic.
The diffs that it generates are "equivalent" to the originals
generated by Git (modulo slider positioning), and they are the same
(including slider positioning) as the diffs generated by the Python
code that I used to prototype this heuristic. Since I have very
carefully tested the output of the Python version, I conclude that
this C version is working correctly.

I think it would be hard to add test cases for this code. Maybe one or
two sanity checks would be worthwhile?

 Documentation/diff-options.txt |   6 +-
 diff.c                         |  11 ++
 git-add--interactive.perl      |   5 +-
 xdiff/xdiff.h                  |   1 +
 xdiff/xdiffi.c                 | 276 +++++++++++++++++++++++++++++++++++++++--
 5 files changed, 289 insertions(+), 10 deletions(-)

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index 705a873..78733c8 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -63,10 +63,12 @@ ifndef::git-format-patch[]
 	Synonym for `-p --raw`.
 endif::git-format-patch[]
 
+--indent-heuristic::
+--no-indent-heuristic::
 --compaction-heuristic::
 --no-compaction-heuristic::
-	These are to help debugging and tuning an experimental
-	heuristic (which is off by default) that shifts the hunk
+	These are to help debugging and tuning experimental
+	heuristics (which are off by default) that shift the hunk
 	boundary in an attempt to make the resulting patch easier
 	to read.
 
diff --git a/diff.c b/diff.c
index 7d03419..24b5818 100644
--- a/diff.c
+++ b/diff.c
@@ -26,6 +26,7 @@
 #endif
 
 static int diff_detect_rename_default;
+static int diff_indent_heuristic; /* experimental */
 static int diff_compaction_heuristic; /* experimental */
 static int diff_rename_limit_default = 400;
 static int diff_suppress_blank_empty;
@@ -190,6 +191,10 @@ int git_diff_ui_config(const char *var, const char *value, void *cb)
 		diff_detect_rename_default = git_config_rename(var, value);
 		return 0;
 	}
+	if (!strcmp(var, "diff.indentheuristic")) {
+		diff_indent_heuristic = git_config_bool(var, value);
+		return 0;
+	}
 	if (!strcmp(var, "diff.compactionheuristic")) {
 		diff_compaction_heuristic = git_config_bool(var, value);
 		return 0;
@@ -3286,6 +3291,8 @@ void diff_setup(struct diff_options *options)
 	options->use_color = diff_use_color_default;
 	options->detect_rename = diff_detect_rename_default;
 	options->xdl_opts |= diff_algorithm;
+	if (diff_indent_heuristic)
+		DIFF_XDL_SET(options, INDENT_HEURISTIC);
 	if (diff_compaction_heuristic)
 		DIFF_XDL_SET(options, COMPACTION_HEURISTIC);
 
@@ -3808,6 +3815,10 @@ int diff_opt_parse(struct diff_options *options,
 		DIFF_XDL_SET(options, IGNORE_WHITESPACE_AT_EOL);
 	else if (!strcmp(arg, "--ignore-blank-lines"))
 		DIFF_XDL_SET(options, IGNORE_BLANK_LINES);
+	else if (!strcmp(arg, "--indent-heuristic"))
+		DIFF_XDL_SET(options, INDENT_HEURISTIC);
+	else if (!strcmp(arg, "--no-indent-heuristic"))
+		DIFF_XDL_CLR(options, INDENT_HEURISTIC);
 	else if (!strcmp(arg, "--compaction-heuristic"))
 		DIFF_XDL_SET(options, COMPACTION_HEURISTIC);
 	else if (!strcmp(arg, "--no-compaction-heuristic"))
diff --git a/git-add--interactive.perl b/git-add--interactive.perl
index 642cce1..ee3d812 100755
--- a/git-add--interactive.perl
+++ b/git-add--interactive.perl
@@ -45,6 +45,7 @@ my ($diff_new_color) =
 my $normal_color = $repo->get_color("", "reset");
 
 my $diff_algorithm = $repo->config('diff.algorithm');
+my $diff_indent_heuristic = $repo->config_bool('diff.indentheuristic');
 my $diff_compaction_heuristic = $repo->config_bool('diff.compactionheuristic');
 my $diff_filter = $repo->config('interactive.difffilter');
 
@@ -750,7 +751,9 @@ sub parse_diff {
 	if (defined $diff_algorithm) {
 		splice @diff_cmd, 1, 0, "--diff-algorithm=${diff_algorithm}";
 	}
-	if ($diff_compaction_heuristic) {
+	if ($diff_indent_heuristic) {
+		splice @diff_cmd, 1, 0, "--indent-heuristic";
+	} elsif ($diff_compaction_heuristic) {
 		splice @diff_cmd, 1, 0, "--compaction-heuristic";
 	}
 	if (defined $patch_mode_revision) {
diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h
index 7423f77..8db16d4 100644
--- a/xdiff/xdiff.h
+++ b/xdiff/xdiff.h
@@ -42,6 +42,7 @@ extern "C" {
 #define XDF_IGNORE_BLANK_LINES (1 << 7)
 
 #define XDF_COMPACTION_HEURISTIC (1 << 8)
+#define XDF_INDENT_HEURISTIC (1 << 9)
 
 #define XDL_EMIT_FUNCNAMES (1 << 0)
 #define XDL_EMIT_FUNCCONTEXT (1 << 2)
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 7518cd5..a07c63e 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -413,6 +413,228 @@ static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
 			     flags));
 }
 
+/*
+ * If a line is indented more than this, get_indent() just returns this value.
+ * This avoids having to do absurd amounts of work for data that are not
+ * human-readable text, and also ensures that the output of get_indent fits within
+ * an int.
+ */
+#define MAX_INDENT 200
+
+/*
+ * Return the amount of indentation of the specified line, treating TAB as 8
+ * columns. Return -1 if line is empty or contains only whitespace. Clamp the
+ * output value at MAX_INDENT.
+ */
+static int get_indent(xrecord_t *rec)
+{
+	long i;
+	int ret = 0;
+
+	for (i = 0; i < rec->size; i++) {
+		char c = rec->ptr[i];
+
+		if (!XDL_ISSPACE(c))
+			return ret;
+		else if (c == ' ')
+			ret += 1;
+		else if (c == '\t')
+			ret += 8 - ret % 8;
+		/* ignore other whitespace characters */
+
+		if (ret >= MAX_INDENT)
+			return MAX_INDENT;
+	}
+	/*
+	 * We have reached the end of the line without finding any non-space
+	 * characters; i.e., the whole line consists of trailing spaces, which we
+	 * are not interested in.
+	 */
+	return -1;
+}
+
+/*
+ * If more than this number of consecutive blank rows are found, just return this
+ * value. This avoids requiring O(N^2) work for pathological cases, and also
+ * ensures that the output of score_split fits in an int.
+ */
+#define MAX_BLANKS 20
+
+/* Characteristics measured about a hypothetical split position. */
+struct split_measurement {
+	/*
+	 * Is the split at the end of the file (aside from any blank lines)?
+	 */
+	int end_of_file;
+
+	/*
+	 * How much is the line immediately following the split indented (or -1 if
+	 * the line is blank):
+	 */
+	int indent;
+
+	/*
+	 * How many consecutive lines above the split are blank?
+	 */
+	int pre_blank;
+
+	/*
+	 * How much is the nearest non-blank line above the split indented (or -1
+	 * if there is no such line)?
+	 */
+	int pre_indent;
+
+	/*
+	 * How many lines after the line following the split are blank?
+	 */
+	int post_blank;
+
+	/*
+	 * How much is the nearest non-blank line after the line following the
+	 * split indented (or -1 if there is no such line)?
+	 */
+	int post_indent;
+};
+
+/*
+ * Fill m with information about a hypothetical split of xdf above line split.
+ */
+void measure_split(const xdfile_t *xdf, long split, struct split_measurement *m)
+{
+	long i;
+
+	if (split >= xdf->nrec) {
+		m->end_of_file = 1;
+		m->indent = -1;
+	} else {
+		m->end_of_file = 0;
+		m->indent = get_indent(xdf->recs[split]);
+	}
+
+	m->pre_blank = 0;
+	for (i = split - 1; i >= 0; i--) {
+		m->pre_indent = get_indent(xdf->recs[i]);
+		if (m->pre_indent != -1)
+			break;
+		m->pre_blank += 1;
+		if (m->pre_blank == MAX_BLANKS) {
+			m->pre_indent = 0;
+			break;
+		}
+	}
+
+	m->post_blank = 0;
+	for (i = split + 1; i < xdf->nrec; i++) {
+		m->post_indent = get_indent(xdf->recs[i]);
+		if (m->post_indent != -1)
+			break;
+		m->post_blank += 1;
+		if (m->post_blank == MAX_BLANKS) {
+			m->post_indent = 0;
+			break;
+		}
+	}
+}
+
+#define START_OF_FILE_BONUS 9
+#define END_OF_FILE_BONUS 46
+#define TOTAL_BLANK_WEIGHT 4
+#define PRE_BLANK_WEIGHT 16
+#define RELATIVE_INDENT_BONUS -1
+#define RELATIVE_INDENT_HAS_BLANK_BONUS 15
+#define RELATIVE_OUTDENT_BONUS -19
+#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2
+#define RELATIVE_DEDENT_BONUS -63
+#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50
+
+/*
+ * Compute a badness score for the hypothetical split whose measurements are
+ * stored in m. The weight factors were determined empirically using the tools and
+ * corpus described in
+ *
+ *     https://github.com/mhagger/diff-slider-tools
+ *
+ * Also see that project if you want to improve the weights based on, for example,
+ * a larger or more diverse corpus.
+ */
+int score_split(const struct split_measurement *m)
+{
+	/*
+	 * A place to accumulate bonus factors (positive makes this index more
+	 * favored):
+	 */
+        int bonus = 0, score, total_blanks, indent, any_blanks;
+
+        if (m->pre_indent == -1 && m->pre_blank == 0)
+		bonus += START_OF_FILE_BONUS;
+
+        if (m->end_of_file)
+		bonus += END_OF_FILE_BONUS;
+
+        total_blanks = m->pre_blank;
+        if (m->indent == -1)
+		total_blanks += 1 + m->post_blank;
+
+	/* Bonuses based on the location of blank lines: */
+        bonus += TOTAL_BLANK_WEIGHT * total_blanks;
+	bonus += PRE_BLANK_WEIGHT * m->pre_blank;
+
+        if (m->indent != -1)
+		indent = m->indent;
+        else
+		indent = m->post_indent;
+
+        any_blanks = (total_blanks != 0);
+
+        if (indent == -1) {
+		score = 0;
+        } else if (m->pre_indent == -1) {
+		score = indent;
+        } else if (indent > m->pre_indent) {
+		/*
+		 * The line is indented more than its predecessor. Score it based
+		 * on the larger indent:
+		 */
+		score = indent;
+		bonus += RELATIVE_INDENT_BONUS;
+		bonus += RELATIVE_INDENT_HAS_BLANK_BONUS * any_blanks;
+	} else if (indent < m->pre_indent) {
+		/*
+		 * The line is indented less than its predecessor. It could be
+		 * that this line is the start of a new block (e.g., of an "else"
+		 * block, or of a block without a block terminator) or it could be
+		 * the end of the previous block.
+		 */
+		if (m->post_indent == -1 || indent >= m->post_indent) {
+			/*
+			 * That was probably the end of a block. Score based on
+			 * the line's own indent:
+			 */
+			score = indent;
+			bonus += RELATIVE_DEDENT_BONUS;
+			bonus += RELATIVE_DEDENT_HAS_BLANK_BONUS * any_blanks;
+		} else {
+			/*
+			 * The following line is indented more. So it is likely
+			 * that this line is the start of a block. It's a pretty
+			 * good place to split, so score it based on its own
+			 * indent:
+			 */
+			score = indent;
+			bonus += RELATIVE_OUTDENT_BONUS;
+			bonus += RELATIVE_OUTDENT_HAS_BLANK_BONUS * any_blanks;
+		}
+	} else {
+		/*
+		 * The line has the same indentation level as its predecessor. We
+		 * score it based on its own indent:
+		 */
+		score = indent;
+	}
+
+        return 10 * score - bonus;
+}
+
 int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 	long start, end, earliest_end, end_matching_other;
 	long io, groupsize, nrec = xdf->nrec;
@@ -553,15 +775,18 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 		if (end == earliest_end)
 			continue; /* no shifting is possible */
 
+		/*
+		 * The group can be shifted. Possibly use this freedom to produce
+		 * a more intuitive diff.
+		 *
+		 * The group is currently shifted as far down as possible, so the
+		 * heuristics below only have to handle upwards shifts.
+		 */
+
 		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
 			/*
-			 * Compaction heuristic: if a group can be moved back and
-			 * forth, then if possible shift the group to make its
-			 * bottom line a blank line.
-			 *
-			 * As we already shifted the group forward as far as
-			 * possible in the earlier loop, we only need to handle
-			 * backward shifts, not forward ones.
+			 * Compaction heuristic: if possible, shift the group to
+			 * make its bottom line a blank line.
 			 */
 			while (start > 0 &&
 			       !is_blank_line(recs[end - 1], flags) &&
@@ -587,6 +812,43 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 				while (rchgo[io])
 					io--;
 			}
+		} else if (flags & XDF_INDENT_HEURISTIC) {
+			/*
+			 * Indent heuristic: a group of pure add/delete lines
+			 * implies two splits, one between the end of the "before"
+			 * context and the start of the group, and another between
+			 * the end of the group and the beginning of the "after"
+			 * context. Some splits are aesthetically better and some
+			 * are worse. We compute a badness "score" for each split,
+			 * and add the scores for the two splits to define a
+			 * "score" for each position that the group can be shifted
+			 * to. Then we pick the shift with the lowest score.
+			 */
+			long shift, best_shift = -1;
+			int best_score = 0;
+
+			for (shift = earliest_end; shift <= end; shift++) {
+				struct split_measurement m;
+				int score;
+
+				measure_split(xdf, shift, &m);
+				score = score_split(&m);
+				measure_split(xdf, shift - groupsize, &m);
+				score += score_split(&m);
+				if (best_shift == -1 || score <= best_score) {
+					best_score = score;
+					best_shift = shift;
+				}
+			}
+
+			while (end > best_shift) {
+				rchg[--start] = 1;
+				rchg[--end] = 0;
+
+				io--;
+				while (rchgo[io])
+					io--;
+			}
 		}
 	}
 
-- 
2.8.1


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/8] Better heuristics make prettier diffs
  2016-08-03 22:00 [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
                   ` (7 preceding siblings ...)
  2016-08-03 22:00 ` [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs Michael Haggerty
@ 2016-08-03 22:08 ` Michael Haggerty
  2016-08-04  7:38 ` Jeff King
  9 siblings, 0 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 22:08 UTC (permalink / raw)
  To: git
  Cc: Stefan Beller, Junio C Hamano, Jeff King, Jakub Narębski,
	Jacob Keller

On 08/04/2016 12:00 AM, Michael Haggerty wrote:
> I've talked about this quite a bit on the list already. The idea is to
> improve ugly diffs

I forgot to note that this patch series is also available from my GitHub
account [1] as branch "diff-indent-heuristics".

Also, I will be away from my computer for the next five days, so don't
be insulted if I don't respond promptly to your emails. (Well, actually
I'm never that prompt so people might not even notice :-/ ).

Michael

[1] https://github.com/mhagger/git


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/8] xdl_change_compact(): clarify code
  2016-08-03 22:00 ` [PATCH 2/8] xdl_change_compact(): clarify code Michael Haggerty
@ 2016-08-03 22:11   ` Stefan Beller
  2016-08-03 23:14     ` Michael Haggerty
  0 siblings, 1 reply; 57+ messages in thread
From: Stefan Beller @ 2016-08-03 22:11 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King,
	Jakub Narębski, Jacob Keller

On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> Write things out a bit longer but less cryptically. Add more comments.

By less cryptic you mean in Git coding style ;)
The original author (do we want to cc Davido?) may disagree.

> +
> +                       /*
> +                        * Are there any blank lines that could appear as the last
> +                        * line of this group?
> +                        */

IIRC this comment is not quite correct as this 'only' counts the number of
blank lines within the forward shifting section, i.e. in the movable space.

Later we use it as a boolean indicator (whether or not it is equal to 0)
to see if we can do better.

Any other change in code and comments looks good to me, but this stood out
like a sore thumb. (Probably the old heuristic as a whole stood out, but the
comment here specifically sounds /wrong/ to me in this place. How can
a question document a variable? I'd rather expect a question comment
to ease the understanding of a condition)

Thanks for reviving this topic!
Stefan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-03 22:00 ` [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs Michael Haggerty
@ 2016-08-03 22:29   ` Jacob Keller
  2016-08-03 22:36     ` Michael Haggerty
  2016-08-03 22:30   ` Stefan Beller
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 57+ messages in thread
From: Jacob Keller @ 2016-08-03 22:29 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Git mailing list, Stefan Beller, Junio C Hamano, Jeff King,
	Jakub Narębski

On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> +/*
> + * If a line is indented more than this, get_indent() just returns this value.
> + * This avoids having to do absurd amounts of work for data that are not
> + * human-readable text, and also ensures that the output of get_indent fits within
> + * an int.
> + */
> +#define MAX_INDENT 200
> +
> +/*
> + * Return the amount of indentation of the specified line, treating TAB as 8
> + * columns. Return -1 if line is empty or contains only whitespace. Clamp the
> + * output value at MAX_INDENT.
> + */
> +static int get_indent(xrecord_t *rec)
> +{
> +       long i;
> +       int ret = 0;
> +
> +       for (i = 0; i < rec->size; i++) {
> +               char c = rec->ptr[i];
> +
> +               if (!XDL_ISSPACE(c))
> +                       return ret;
> +               else if (c == ' ')
> +                       ret += 1;
> +               else if (c == '\t')
> +                       ret += 8 - ret % 8;
> +               /* ignore other whitespace characters */
> +
> +               if (ret >= MAX_INDENT)
> +                       return MAX_INDENT;

Should we return -1 here?

> +       }
> +       /*
> +        * We have reached the end of the line without finding any non-space
> +        * characters; i.e., the whole line consists of trailing spaces, which we
> +        * are not interested in.
> +        */
> +       return -1;

It seems odd to be that a line with "199" spaces and nothing else will
return "-1" but a line with 200 spaces and nothing else will return
200..? Would it be safe to just return -1 in both cases (if a line is
all spaces or starts with more than 200 spaces just return -1)?

> +}
> +

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-03 22:00 ` [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs Michael Haggerty
  2016-08-03 22:29   ` Jacob Keller
@ 2016-08-03 22:30   ` Stefan Beller
  2016-08-03 22:41     ` Michael Haggerty
  2016-08-04  7:56   ` Jeff King
  2016-08-04 19:52   ` Junio C Hamano
  3 siblings, 1 reply; 57+ messages in thread
From: Stefan Beller @ 2016-08-03 22:30 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King,
	Jakub Narębski, Jacob Keller

On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:

> +        return 10 * score - bonus;

Would it make sense to define-ify the 10 as well
as this is the only hardcoded number in the
scoring function?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-03 22:29   ` Jacob Keller
@ 2016-08-03 22:36     ` Michael Haggerty
  2016-08-04  4:47       ` Jacob Keller
  2016-08-04 19:39       ` Junio C Hamano
  0 siblings, 2 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 22:36 UTC (permalink / raw)
  To: Jacob Keller
  Cc: Git mailing list, Stefan Beller, Junio C Hamano, Jeff King,
	Jakub Narębski

On 08/04/2016 12:29 AM, Jacob Keller wrote:
> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> +/*
>> + * If a line is indented more than this, get_indent() just returns this value.
>> + * This avoids having to do absurd amounts of work for data that are not
>> + * human-readable text, and also ensures that the output of get_indent fits within
>> + * an int.
>> + */
>> +#define MAX_INDENT 200
>> +
>> +/*
>> + * Return the amount of indentation of the specified line, treating TAB as 8
>> + * columns. Return -1 if line is empty or contains only whitespace. Clamp the
>> + * output value at MAX_INDENT.
>> + */
>> +static int get_indent(xrecord_t *rec)
>> +{
>> +       long i;
>> +       int ret = 0;
>> +
>> +       for (i = 0; i < rec->size; i++) {
>> +               char c = rec->ptr[i];
>> +
>> +               if (!XDL_ISSPACE(c))
>> +                       return ret;
>> +               else if (c == ' ')
>> +                       ret += 1;
>> +               else if (c == '\t')
>> +                       ret += 8 - ret % 8;
>> +               /* ignore other whitespace characters */
>> +
>> +               if (ret >= MAX_INDENT)
>> +                       return MAX_INDENT;
> 
> Should we return -1 here?
> 
>> +       }
>> +       /*
>> +        * We have reached the end of the line without finding any non-space
>> +        * characters; i.e., the whole line consists of trailing spaces, which we
>> +        * are not interested in.
>> +        */
>> +       return -1;
> 
> It seems odd to be that a line with "199" spaces and nothing else will
> return "-1" but a line with 200 spaces and nothing else will return
> 200..? Would it be safe to just return -1 in both cases (if a line is
> all spaces or starts with more than 200 spaces just return -1)?
> 
>> +}
>> +

Thanks for your feedback.

I was implicitly assuming that such lines would have text somewhere
after those 200 spaces (or 25 TABs or whatever). But you're right, the
line could consist only of whitespace. Unfortunately, the only way to
distinguish these two cases is to read the rest of the line, which is
exactly what we *don't* want to do.

But I think it doesn't matter anyway. Such "text" will likely never be
read by a human, so it's not a big deal if the slider position is not
picked perfectly. And remember, this whole saga is just to improve the
aesthetics of the diff. The diff is *correct* (e.g., in the sense of
applicable) regardless of where we position the sliders.

Michael


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-03 22:30   ` Stefan Beller
@ 2016-08-03 22:41     ` Michael Haggerty
  2016-08-03 22:51       ` Stefan Beller
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 22:41 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King,
	Jakub Narębski, Jacob Keller

On 08/04/2016 12:30 AM, Stefan Beller wrote:
> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> 
>> +        return 10 * score - bonus;
> 
> Would it make sense to define-ify the 10 as well
> as this is the only hardcoded number in the
> scoring function?

I started answering this question by explaining why it is not important
to *optimize* the number 10 (namely because scores are only ever
compared against other scores, so an overall scaling factor makes no
difference). The factor 10 only has to be large enough to provide enough
dynamic range for the other (adjustable) parameters.

But I think you are asking a simpler question: should we give this
constant a name rather than hardcoding it? I don't see a strong reason
for or against, so I'll give it a name in the next version, as you suggest.

Michael

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-03 22:41     ` Michael Haggerty
@ 2016-08-03 22:51       ` Stefan Beller
  2016-08-03 23:30         ` Michael Haggerty
  0 siblings, 1 reply; 57+ messages in thread
From: Stefan Beller @ 2016-08-03 22:51 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King,
	Jakub Narębski, Jacob Keller

On Wed, Aug 3, 2016 at 3:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 08/04/2016 12:30 AM, Stefan Beller wrote:
>> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>
>>> +        return 10 * score - bonus;
>>
>> Would it make sense to define-ify the 10 as well
>> as this is the only hardcoded number in the
>> scoring function?
>
> I started answering this question by explaining why it is not important
> to *optimize* the number 10 (namely because scores are only ever
> compared against other scores, so an overall scaling factor makes no
> difference). The factor 10 only has to be large enough to provide enough
> dynamic range for the other (adjustable) parameters.

But it only scales the score, not the bonus. So another way to write it
would be

    score - bonus/10;

assuming the values of score and bonus are large enough.

In some prior conversation you said you take the indent and add an epsilon
for some special conditions to make one indent better than the other.

Epsilons are usually very small compared to the rest of the equation,
but if I look at the boni definitions ranging from -63..50 they are scaled up
so much that the bonus may become larger than '1' unit of 'score', i.e.
it is not an epsilon any more. Or to put it another way:
If we were to s/10/100/ the results would be worse.

Rather the 10 describes the ratio of "advanced magic" to pure indentation
based scoring in my understanding.

>
> But I think you are asking a simpler question: should we give this
> constant a name rather than hardcoding it? I don't see a strong reason
> for or against, so I'll give it a name in the next version, as you suggest.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/8] xdl_change_compact(): clarify code
  2016-08-03 22:11   ` Stefan Beller
@ 2016-08-03 23:14     ` Michael Haggerty
  2016-08-03 23:50       ` Stefan Beller
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 23:14 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King,
	Jakub Narębski, Jacob Keller

On 08/04/2016 12:11 AM, Stefan Beller wrote:
> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> Write things out a bit longer but less cryptically. Add more comments.
> 
> By less cryptic you mean in Git coding style ;)
> The original author (do we want to cc Davido?) may disagree.

Davide hasn't contributed since 2008 and libxdiff is not being
developed, so I didn't think he'd be interested.

Yes, tastes certainly differ. If more people like the old version
better, I will gnash my teeth and undo these "clarification" patches. I
mean, what's not to like about variable names like "grpsiz" and "ixref"?

>> +
>> +                       /*
>> +                        * Are there any blank lines that could appear as the last
>> +                        * line of this group?
>> +                        */
> 
> IIRC this comment is not quite correct as this 'only' counts the number of
> blank lines within the forward shifting section, i.e. in the movable space.
> 
> Later we use it as a boolean indicator (whether or not it is equal to 0)
> to see if we can do better.
> 
> Any other change in code and comments looks good to me, but this stood out
> like a sore thumb. (Probably the old heuristic as a whole stood out, but the
> comment here specifically sounds /wrong/ to me in this place. How can
> a question document a variable? I'd rather expect a question comment
> to ease the understanding of a condition)

I don't understand your objection. A blank line can appear as the last
line of the group if and only if it is within the shift range ("movable
space") of the group, right? So it seems like our formulations are
equivalent.

Since the variable is used as a boolean, it seemed natural to document
it by stating the question that the true/false value is the answer to.

If you have a concrete suggestion for a better comment, please let me know.

Michael

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-03 22:51       ` Stefan Beller
@ 2016-08-03 23:30         ` Michael Haggerty
  2016-08-04  0:04           ` Stefan Beller
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Haggerty @ 2016-08-03 23:30 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King,
	Jakub Narębski, Jacob Keller

On 08/04/2016 12:51 AM, Stefan Beller wrote:
> On Wed, Aug 3, 2016 at 3:41 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On 08/04/2016 12:30 AM, Stefan Beller wrote:
>>> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>>
>>>> +        return 10 * score - bonus;
>>>
>>> Would it make sense to define-ify the 10 as well
>>> as this is the only hardcoded number in the
>>> scoring function?
>>
>> I started answering this question by explaining why it is not important
>> to *optimize* the number 10 (namely because scores are only ever
>> compared against other scores, so an overall scaling factor makes no
>> difference). The factor 10 only has to be large enough to provide enough
>> dynamic range for the other (adjustable) parameters.
> 
> But it only scales the score, not the bonus.

Yes, that's how it provides the overall scale of the score. If it
multiplied both values, then it would be completely pointless.

This is an important point for the optimization, but less so for the
implementation of the heuristic here. I was dynamically optimizing the
ten other variables, and everything that goes into the bonus includes
one of those factors. If I had also let this factor of 10 vary, then the
net behavior of the algorithm would be completely unchanged if I would,
say, double all eleven parameters. This is bad for optimization, because
(1) it increases the search space unnecessarily, and (2) it means that
whole lines in the parameter space give identical behavior, making the
algorithm waste time searching along those lines for a minimum.

> So another way to write it
> would be
> 
>     score - bonus/10;
> 
> assuming the values of score and bonus are large enough.

Score is the number of columns that some line is indented, so it can be
0 or 1 or any other positive integer. It is not "large enough", which is
why the "10" can't be "1".

If the calculations were done in floating point, then the factor could
be "1", because then the other factors could be made less than one.

> In some prior conversation you said you take the indent and add an epsilon
> for some special conditions to make one indent better than the other.
> 
> Epsilons are usually very small compared to the rest of the equation,

I should have mentioned that this heuristic is quite a bit more advanced
than my original proposal to use "indent" plus some "epsilon" factors.
The old discussion about epsilons is not relevant here except maybe as
an inspiration and starting point for this version.

> but if I look at the boni definitions ranging from -63..50 they are scaled up
> so much that the bonus may become larger than '1' unit of 'score', i.e.
> it is not an epsilon any more. Or to put it another way:
> If we were to s/10/100/ the results would be worse.

If you would change s/10/100/ and simultaneously multiply the other
constants by 10, the end results would be unchanged.

> Rather the 10 describes the ratio of "advanced magic" to pure indentation
> based scoring in my understanding.

No, it's basically just a number against which the other constants are
compared. E.g., if another bonus wants to balance out against exactly
one space of indentation, its constant needs to be 10. If it wants to
balance out against exactly 5 spaces, its constant needs to be 50. Etc.

Michael

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/8] xdl_change_compact(): clarify code
  2016-08-03 23:14     ` Michael Haggerty
@ 2016-08-03 23:50       ` Stefan Beller
  2016-08-04  7:13         ` Jeff King
  2016-08-10 16:39         ` Michael Haggerty
  0 siblings, 2 replies; 57+ messages in thread
From: Stefan Beller @ 2016-08-03 23:50 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King,
	Jakub Narębski, Jacob Keller

On Wed, Aug 3, 2016 at 4:14 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 08/04/2016 12:11 AM, Stefan Beller wrote:
>> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>> Write things out a bit longer but less cryptically. Add more comments.
>>
>> By less cryptic you mean in Git coding style ;)
>> The original author (do we want to cc Davido?) may disagree.
>
> Davide hasn't contributed since 2008 and libxdiff is not being
> developed, so I didn't think he'd be interested.

ok.

>
> Yes, tastes certainly differ. If more people like the old version
> better, I will gnash my teeth and undo these "clarification" patches.

I was not asking for undoing these, but giving short cryptic answers myself. ;)
While I agree the variable names are way better than before, the use of while
instead of for (and then doing another final ++ after the loop) extended some
one liners to about 5. I am totally fine with that as they are easier
to read for me
as I understand them as Git style, hence easier to read.

There may be old timers (who have knowledge of C from other projects), that
would prefer the style as before:

e.g.

-               start = i;
-               for (i++; rchg[i]; i++);
-               for (; rchgo[io]; io++);
+               start = i++;
+
+               while (rchg[i])
+                       i++;
+
+               while (rchgo[io])
+                      io++;

This doesn't change variable names, but it only transforms a for loop with no
body in a more readable structure of while loops separated by white space.
So for such a chunk I could imagine people arguing about adding lines of code
(which is valuable screen real estate) for only slight gain if any.
I am not one of these.

> I
> mean, what's not to like about variable names like "grpsiz" and "ixref"?

faster typng ;)

>
>>> +
>>> +                       /*
>>> +                        * Are there any blank lines that could appear as the last
>>> +                        * line of this group?
>>> +                        */
>>
>> IIRC this comment is not quite correct as this 'only' counts the number of
>> blank lines within the forward shifting section, i.e. in the movable space.
>>
>> Later we use it as a boolean indicator (whether or not it is equal to 0)
>> to see if we can do better.
>>
>> Any other change in code and comments looks good to me, but this stood out
>> like a sore thumb. (Probably the old heuristic as a whole stood out, but the
>> comment here specifically sounds /wrong/ to me in this place. How can
>> a question document a variable? I'd rather expect a question comment
>> to ease the understanding of a condition)
>
> I don't understand your objection. A blank line can appear as the last
> line of the group if and only if it is within the shift range ("movable
> space") of the group, right? So it seems like our formulations are
> equivalent.

Sure, e.g. in 0fe5043da (2016-06-17, dir_iterator: new API for iterating
over a directory tree), struct dir_iterator_int we have a member

    struct dir_iterator_int {
...
         /*
         * The number of levels currently on the stack. This is always
         * at least 1, because when it becomes zero the iteration is
         * ended and this struct is freed.
         */
         size_t levels_nr;
...
};

you could have written that comment as

    /* How many levels do we have to free? */

but that would be misleading the same way as here.

I think a comment should carry useful information that is not
obvious from the code. So in this comment we want to convey the
message that we need to count blank lines to apply a heuristic later
on. Maybe

  /* Number of blank lines in the sliding area of the group */

as that

* states the actual thing we do
* doesn't hint at one particular intended use case later on
* it assumes you know what a "sliding area" is though.

I think what triggered me on questioning this comment, was the fact
that it is a question as we rarely have comments stated as questions.

>
> Since the variable is used as a boolean, it seemed natural to document
> it by stating the question that the true/false value is the answer to.

Oh I see. Another example (that maybe looks constructed) is the comment
of S_IFGITLINK in cache.h which is not "Is this entry a submodule?" but
rather some sentence of what a git link actually is. (though very short)

>
> If you have a concrete suggestion for a better comment, please let me know.

I'd go with the imperative form,

 /* Number of blank lines in the sliding area of the group */

if that makes sense to you?

Sorry for the bike shedding and not focusing on the real issues,

Stefan

>
> Michael
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-03 23:30         ` Michael Haggerty
@ 2016-08-04  0:04           ` Stefan Beller
  2016-08-10 19:12             ` Michael Haggerty
  0 siblings, 1 reply; 57+ messages in thread
From: Stefan Beller @ 2016-08-04  0:04 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King,
	Jakub Narębski, Jacob Keller

On Wed, Aug 3, 2016 at 4:30 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:

> This is an important point for the optimization, but less so for the
> implementation of the heuristic here. I was dynamically optimizing the
> ten other variables, and everything that goes into the bonus includes
> one of those factors. If I had also let this factor of 10 vary, then the
> net behavior of the algorithm would be completely unchanged if I would,
> say, double all eleven parameters. This is bad for optimization, because
> (1) it increases the search space unnecessarily, and (2) it means that
> whole lines in the parameter space give identical behavior, making the
> algorithm waste time searching along those lines for a minimum.
>
>> So another way to write it
>> would be
>>
>>     score - bonus/10;
>>
>> assuming the values of score and bonus are large enough.
>
> Score is the number of columns that some line is indented, so it can be
> 0 or 1 or any other positive integer. It is not "large enough", which is
> why the "10" can't be "1".

Right, I should have made it more clear that it was a hypothetical rewrite,
just to point out we are looking at only one of score or bonus.

>> but if I look at the boni definitions ranging from -63..50 they are scaled up
>> so much that the bonus may become larger than '1' unit of 'score', i.e.
>> it is not an epsilon any more. Or to put it another way:
>> If we were to s/10/100/ the results would be worse.
>
> If you would change s/10/100/ and simultaneously multiply the other
> constants by 10, the end results would be unchanged.

Right, so maybe a good name would be CONSTANT_SCALE_OF_ONE_INDENT
as it has the meaning that a bonus of 10 is equivalent of "one indent"
in the weighting.

Speaking of which, do we want to "over-optimize" to make that constant a
power of 2 as that is a supposedly faster multiplication?
(Just asking, feel free to reject; as I can imagine the numbers itself are
already magic, so why scale them with 42?^H^H^H 16?)

>
>> Rather the 10 describes the ratio of "advanced magic" to pure indentation
>> based scoring in my understanding.
>
> No, it's basically just a number against which the other constants are
> compared. E.g., if another bonus wants to balance out against exactly
> one space of indentation, its constant needs to be 10. If it wants to
> balance out against exactly 5 spaces, its constant needs to be 50. Etc.

So another interpretation is that the 10 gives the resolution for all other
constants, i.e. if we keep 10, then we can only give weights in 1/10 of
"one indent". But the "ideal" weight may not be a multiple of 1/10,
so we approximate them to the nearest multiple of 1/10.

If we were to use 1000 here, we could have a higher accuracy of the
other constants, but probably we do not care about the 3rd decimal place
for these because they are created heuristically from a corpus that may
not warrant a precision of constants with a 3rd decimal place.

Stefan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-03 22:36     ` Michael Haggerty
@ 2016-08-04  4:47       ` Jacob Keller
  2016-08-04 19:39       ` Junio C Hamano
  1 sibling, 0 replies; 57+ messages in thread
From: Jacob Keller @ 2016-08-04  4:47 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Git mailing list, Stefan Beller, Junio C Hamano, Jeff King,
	Jakub Narębski

On Wed, Aug 3, 2016 at 3:36 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
> On 08/04/2016 12:29 AM, Jacob Keller wrote:
>> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> It seems odd to be that a line with "199" spaces and nothing else will
>> return "-1" but a line with 200 spaces and nothing else will return
>> 200..? Would it be safe to just return -1 in both cases (if a line is
>> all spaces or starts with more than 200 spaces just return -1)?
>>
>>> +}
>>> +
>
> Thanks for your feedback.
>
> I was implicitly assuming that such lines would have text somewhere
> after those 200 spaces (or 25 TABs or whatever). But you're right, the
> line could consist only of whitespace. Unfortunately, the only way to
> distinguish these two cases is to read the rest of the line, which is
> exactly what we *don't* want to do.
>
> But I think it doesn't matter anyway. Such "text" will likely never be
> read by a human, so it's not a big deal if the slider position is not
> picked perfectly. And remember, this whole saga is just to improve the
> aesthetics of the diff. The diff is *correct* (e.g., in the sense of
> applicable) regardless of where we position the sliders.
>
> Michael
>

I think in this case treating it as "all whitespace" is more natural
than treating it as "200 characters with something following it"
because the only thing we've found so far is all white space.

Either way it's not really a big deal here.

Thanks,
Jake

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/8] xdl_change_compact(): rename some local variables for clarity
  2016-08-03 22:00 ` [PATCH 1/8] xdl_change_compact(): rename some local variables for clarity Michael Haggerty
@ 2016-08-04  7:06   ` Jeff King
  2016-08-04 18:24     ` Junio C Hamano
  2016-08-13 19:38     ` Michael Haggerty
  0 siblings, 2 replies; 57+ messages in thread
From: Jeff King @ 2016-08-04  7:06 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git, Stefan Beller, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On Thu, Aug 04, 2016 at 12:00:29AM +0200, Michael Haggerty wrote:

> * ix -> i
> * ixo -> io
> * ixs -> start
> * grpsiz -> groupsize

After your change, I immediately understand three of them. But what is
"io"?

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/8] xdl_change_compact(): clarify code
  2016-08-03 23:50       ` Stefan Beller
@ 2016-08-04  7:13         ` Jeff King
  2016-08-10 16:39         ` Michael Haggerty
  1 sibling, 0 replies; 57+ messages in thread
From: Jeff King @ 2016-08-04  7:13 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Michael Haggerty, git@vger.kernel.org, Junio C Hamano,
	Jakub Narębski, Jacob Keller

On Wed, Aug 03, 2016 at 04:50:46PM -0700, Stefan Beller wrote:

> I was not asking for undoing these, but giving short cryptic answers myself. ;)
> While I agree the variable names are way better than before, the use of while
> instead of for (and then doing another final ++ after the loop) extended some
> one liners to about 5. I am totally fine with that as they are easier
> to read for me as I understand them as Git style, hence easier to read.

One thing I try to do with loops is to use "for" loops only when I truly
want an iteration from point A to point B. If I care about the value of
the iterator _after_ the loop, I prefer a "while" loop.

Not everybody necessarily has the same taste, but I assume Michael does,
since that's what's happening in this hunk:

> -               start = i;
> -               for (i++; rchg[i]; i++);
> -               for (; rchgo[io]; io++);
> +               start = i++;
> +
> +               while (rchg[i])
> +                       i++;
> +
> +               while (rchgo[io])
> +                      io++;

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 3/8] xdl_change_compact(): rename i to end
  2016-08-03 22:00 ` [PATCH 3/8] xdl_change_compact(): rename i to end Michael Haggerty
@ 2016-08-04  7:16   ` Jeff King
  0 siblings, 0 replies; 57+ messages in thread
From: Jeff King @ 2016-08-04  7:16 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git, Stefan Beller, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On Thu, Aug 04, 2016 at 12:00:31AM +0200, Michael Haggerty wrote:

> Rename i to end, and alternate between using start and end as the
> indexing variable as appropriate.
> 
> Rename ixref to end_matching_other.
> 
> Add some more comments.

I'd usually complain that there is too much "what" in your commit
message, but in this case, the diff really is hard to read. Having a
summary up front is nice.

There's no "why", but I imagine it is just "I had to do this to even
make sense of this function".

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io
  2016-08-03 22:00 ` [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io Michael Haggerty
@ 2016-08-04  7:27   ` Jeff King
  2016-08-10 16:58     ` Michael Haggerty
  2016-08-04 18:43   ` Junio C Hamano
  1 sibling, 1 reply; 57+ messages in thread
From: Jeff King @ 2016-08-04  7:27 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git, Stefan Beller, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On Thu, Aug 04, 2016 at 12:00:33AM +0200, Michael Haggerty wrote:

> The code branch used for the compaction heuristic incorrectly forgot to
> keep io in sync while the group was shifted. I think that could have
> led to reading past the end of the rchgo array.
> 
> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
> ---
> I didn't actually try to verify the presence of a bug, because it
> seems like more work than worthwhile. But here is my reasoning:
> 
> If io is not decremented correctly during one iteration of the outer
> `while` loop, then it will loose sync with the `end` counter. In
> particular it will be too large.
> 
> Suppose that the next iterations of the outer `while` loop (i.e.,
> processing the next block of add/delete lines) don't have any sliders.
> Then the `io` counter would be incremented by the number of
> non-changed lines in xdf, which is the same as the number of
> non-changed lines in xdfo that *should have* followed the group that
> experienced the malfunction. But since `io` was too large at the end
> of that iteration, it will be incremented past the end of the
> xdfo->rchg array, and will try to read that memory illegally.

Hmm. In the loop:

  while (rchgo[io])
	io++;

that implies that rchgo has a zero-marker that we can rely on hitting.
And it looks like rchgo[io] always ends the loop on a 0. So it seems
like we would just hit that condition again.

That doesn't make it _right_, but I'm not sure I see how it would walk
off the end of the array.  But I'm very sure I don't understand this
code completely, so I may be missing something.

Anyway, I'd suggest putting your cover letter bits into the commit
message. Even though they are all suppositions, they are the kind of
thing that could really help somebody debugging this in 2 years, and are
better than nothing.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/8] Better heuristics make prettier diffs
  2016-08-03 22:00 [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
                   ` (8 preceding siblings ...)
  2016-08-03 22:08 ` [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
@ 2016-08-04  7:38 ` Jeff King
  2016-08-04 19:54   ` Junio C Hamano
  9 siblings, 1 reply; 57+ messages in thread
From: Jeff King @ 2016-08-04  7:38 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git, Stefan Beller, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On Thu, Aug 04, 2016 at 12:00:28AM +0200, Michael Haggerty wrote:

> I've talked about this quite a bit on the list already. The idea is to
> improve ugly diffs like
> 
>     @@ -231,6 +231,9 @@ if (!defined $initial_reply_to && $prompting) {
>      }
> 
>      if (!$smtp_server) {
>     +       $smtp_server = $repo->config('sendemail.smtpserver');
>     +}
>     +if (!$smtp_server) {
>             foreach (qw( /usr/sbin/sendmail /usr/lib/sendmail )) {
>                     if (-x $_) {
>                             $smtp_server = $_;

Not that you probably need more random cases of C code, but I happened
to be looking at a diff in git.git today, b333d0d6, which is another
regression for the compaction heuristic. The indent heuristic here gets
it right.

Coincidentally, another example is the final patch in this series.

So I am already happier even without digging further yet. :)

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-03 22:00 ` [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs Michael Haggerty
  2016-08-03 22:29   ` Jacob Keller
  2016-08-03 22:30   ` Stefan Beller
@ 2016-08-04  7:56   ` Jeff King
  2016-08-04 16:55     ` Stefan Beller
  2016-08-12 23:25     ` Michael Haggerty
  2016-08-04 19:52   ` Junio C Hamano
  3 siblings, 2 replies; 57+ messages in thread
From: Jeff King @ 2016-08-04  7:56 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git, Stefan Beller, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On Thu, Aug 04, 2016 at 12:00:36AM +0200, Michael Haggerty wrote:

> This table shows the number of diff slider groups that were positioned
> differently than the human-generated values, for various repositories.
> "default" is the default "git diff" algorithm. "compaction" is Git 2.9.0
> with the `--compaction-heuristic` option "indent" is an earlier,

s/option/&./

>  static int diff_detect_rename_default;
> +static int diff_indent_heuristic; /* experimental */
>  static int diff_compaction_heuristic; /* experimental */

These two flags are mutually exclusive in the xdiff code, so we should
probably handle that here.

TBH, I do not care that much what:

  [diff]
  compactionHeuristic = true
  indentHeuristic = true

does. But right now:

  git config diff.compactionHeuristic true
  git show --indent-heuristic

still prefers the compaction heuristic, which I think is objectively
wrong.

So perhaps we need a single variable:

  enum {
    DIFF_HEURISTIC_COMPACTION,
    DIFF_HEURISTIC_INDENT
  } diff_heuristic;

and set it in last-one-wins fashion (it would be nice if the config and
command line options were shaped the same way so it's clear to the user
that they are exclusive, but we may have to keep --compaction-heuristic
around for compatibility, as an alias for --diff-heuristic=compaction).

> diff --git a/git-add--interactive.perl b/git-add--interactive.perl
> index 642cce1..ee3d812 100755
> --- a/git-add--interactive.perl
> +++ b/git-add--interactive.perl
> @@ -45,6 +45,7 @@ my ($diff_new_color) =
>  my $normal_color = $repo->get_color("", "reset");
>  
>  my $diff_algorithm = $repo->config('diff.algorithm');
> +my $diff_indent_heuristic = $repo->config_bool('diff.indentheuristic');
>  my $diff_compaction_heuristic = $repo->config_bool('diff.compactionheuristic');

Nice touch.

Unfortunately the mutual-exclusivity handling will probably bleed over
to here, too.

> +/*
> + * If a line is indented more than this, get_indent() just returns this value.
> + * This avoids having to do absurd amounts of work for data that are not
> + * human-readable text, and also ensures that the output of get_indent fits within
> + * an int.
> + */
> +#define MAX_INDENT 200

Speaking of absurd amounts of work, I was curious if there was a
noticeable performance penalty for using this heuristic (just because
it's a lot more complicated than the others). I couldn't detect any
differences running "git log -p --no-merges -3000" on git.git with no
heuristic, compaction, and indent. There may be other repositories that
behave more pathologically (it looks like having 20 blank lines at the
end of each hunk?), but I'd guess in most cases this will always be
drowned out in the noise of doing the actual diff.

> +#define START_OF_FILE_BONUS 9
> +#define END_OF_FILE_BONUS 46
> +#define TOTAL_BLANK_WEIGHT 4
> +#define PRE_BLANK_WEIGHT 16
> +#define RELATIVE_INDENT_BONUS -1
> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15
> +#define RELATIVE_OUTDENT_BONUS -19
> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2
> +#define RELATIVE_DEDENT_BONUS -63
> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50

I see there is a comment below here mentioning that these are empirical
voodoo, but it might be worth one at the top (or just moving these below
the comment) because the comment looks like it's just associated with
the function (and these are sufficiently bizarre that anybody reading is
going to double-take on them).

> +        return 10 * score - bonus;

I don't mind this not "10" not being a #define constant, but after
reading the exchange between you and Stefan, I think it would be nice to
describe what it is in a comment. The rest of the function is commented
so nicely that this one left me thinking "huh?" upon seeing the "10".

The rest looks sane to me, though I am not sure I have absorbed all the
implications. IMHO the most interesting thing is the actual results,
though.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-04  7:56   ` Jeff King
@ 2016-08-04 16:55     ` Stefan Beller
  2016-08-04 19:47       ` Junio C Hamano
  2016-08-13  0:09       ` Michael Haggerty
  2016-08-12 23:25     ` Michael Haggerty
  1 sibling, 2 replies; 57+ messages in thread
From: Stefan Beller @ 2016-08-04 16:55 UTC (permalink / raw)
  To: Jeff King
  Cc: Michael Haggerty, git@vger.kernel.org, Junio C Hamano,
	Jakub Narębski, Jacob Keller

On Thu, Aug 4, 2016 at 12:56 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Aug 04, 2016 at 12:00:36AM +0200, Michael Haggerty wrote:
>
>> This table shows the number of diff slider groups that were positioned
>> differently than the human-generated values, for various repositories.
>> "default" is the default "git diff" algorithm. "compaction" is Git 2.9.0
>> with the `--compaction-heuristic` option "indent" is an earlier,
>
> s/option/&./
>
>>  static int diff_detect_rename_default;
>> +static int diff_indent_heuristic; /* experimental */
>>  static int diff_compaction_heuristic; /* experimental */
>
> These two flags are mutually exclusive in the xdiff code, so we should
> probably handle that here.
>
> TBH, I do not care that much what:
>
>   [diff]
>   compactionHeuristic = true
>   indentHeuristic = true
>
> does. But right now:
>
>   git config diff.compactionHeuristic true
>   git show --indent-heuristic
>
> still prefers the compaction heuristic, which I think is objectively
> wrong.
>
> So perhaps we need a single variable:
>
>   enum {
>     DIFF_HEURISTIC_COMPACTION,
>     DIFF_HEURISTIC_INDENT
>   } diff_heuristic;
>
> and set it in last-one-wins fashion (it would be nice if the config and
> command line options were shaped the same way so it's clear to the user
> that they are exclusive, but we may have to keep --compaction-heuristic
> around for compatibility, as an alias for --diff-heuristic=compaction).
>
>> diff --git a/git-add--interactive.perl b/git-add--interactive.perl
>> index 642cce1..ee3d812 100755
>> --- a/git-add--interactive.perl
>> +++ b/git-add--interactive.perl
>> @@ -45,6 +45,7 @@ my ($diff_new_color) =
>>  my $normal_color = $repo->get_color("", "reset");
>>
>>  my $diff_algorithm = $repo->config('diff.algorithm');
>> +my $diff_indent_heuristic = $repo->config_bool('diff.indentheuristic');
>>  my $diff_compaction_heuristic = $repo->config_bool('diff.compactionheuristic');
>
> Nice touch.
>
> Unfortunately the mutual-exclusivity handling will probably bleed over
> to here, too.
>
>> +/*
>> + * If a line is indented more than this, get_indent() just returns this value.
>> + * This avoids having to do absurd amounts of work for data that are not
>> + * human-readable text, and also ensures that the output of get_indent fits within
>> + * an int.
>> + */
>> +#define MAX_INDENT 200
>
> Speaking of absurd amounts of work, I was curious if there was a
> noticeable performance penalty for using this heuristic (just because
> it's a lot more complicated than the others). I couldn't detect any
> differences running "git log -p --no-merges -3000" on git.git with no
> heuristic, compaction, and indent. There may be other repositories that
> behave more pathologically (it looks like having 20 blank lines at the
> end of each hunk?), but I'd guess in most cases this will always be
> drowned out in the noise of doing the actual diff.
>
>> +#define START_OF_FILE_BONUS 9
>> +#define END_OF_FILE_BONUS 46
>> +#define TOTAL_BLANK_WEIGHT 4
>> +#define PRE_BLANK_WEIGHT 16
>> +#define RELATIVE_INDENT_BONUS -1
>> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15
>> +#define RELATIVE_OUTDENT_BONUS -19
>> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2
>> +#define RELATIVE_DEDENT_BONUS -63
>> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50
>
> I see there is a comment below here mentioning that these are empirical
> voodoo, but it might be worth one at the top (or just moving these below
> the comment) because the comment looks like it's just associated with
> the function (and these are sufficiently bizarre that anybody reading is
> going to double-take on them).
>
>> +        return 10 * score - bonus;
>
> I don't mind this not "10" not being a #define constant, but after
> reading the exchange between you and Stefan, I think it would be nice to
> describe what it is in a comment. The rest of the function is commented
> so nicely that this one left me thinking "huh?" upon seeing the "10".

After a night of sleep I agree with Peffs statement here, it's not about the
#define, it's about the comment. (which the #define would have given in a
short cryptic way in angry capital letters).

I have just reread the scoring function and I think you could pull out the
`score=indent` assignment (it is always assigned except for indent <0)

        if (indent == -1)
               score = 0;
        else
               score = indent;
        ... lots of bonus computation below, which in its current implementation
        have lots of "score = indent;" lines as well.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/8] xdl_change_compact(): rename some local variables for clarity
  2016-08-04  7:06   ` Jeff King
@ 2016-08-04 18:24     ` Junio C Hamano
  2016-08-13 19:38     ` Michael Haggerty
  1 sibling, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2016-08-04 18:24 UTC (permalink / raw)
  To: Jeff King
  Cc: Michael Haggerty, git, Stefan Beller, Jakub Narębski,
	Jacob Keller

Jeff King <peff@peff.net> writes:

> On Thu, Aug 04, 2016 at 12:00:29AM +0200, Michael Haggerty wrote:
>
>> * ix -> i
>> * ixo -> io
>> * ixs -> start
>> * grpsiz -> groupsize
>
> After your change, I immediately understand three of them. But what is
> "io"?

I had the same reaction.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io
  2016-08-03 22:00 ` [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io Michael Haggerty
  2016-08-04  7:27   ` Jeff King
@ 2016-08-04 18:43   ` Junio C Hamano
  2016-08-10 17:13     ` Michael Haggerty
  1 sibling, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2016-08-04 18:43 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git, Stefan Beller, Jeff King, Jakub Narębski, Jacob Keller

Michael Haggerty <mhagger@alum.mit.edu> writes:

> The code branch used for the compaction heuristic incorrectly forgot to
> keep io in sync while the group was shifted. I think that could have
> led to reading past the end of the rchgo array.

I had to read the first sentence three times as "incorrectly forgot"
was a bit strange thing to say (as if there is a situation where
'forgetting to do' is the correct thing to do, but in that case we
would phrase it to stress that not doing is a deliberate choice,
e.g. 'refraining from doing').  Perhaps s/incorrectly // is the
simplest readability improvement?

> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
> ---
> I didn't actually try to verify the presence of a bug, because it
> seems like more work than worthwhile. But here is my reasoning:
>
> If io is not decremented correctly during one iteration of the outer
> `while` loop, then it will loose sync with the `end` counter. In
> particular it will be too large.
>
> Suppose that the next iterations of the outer `while` loop (i.e.,
> processing the next block of add/delete lines) don't have any sliders.
> Then the `io` counter would be incremented by the number of
> non-changed lines in xdf, which is the same as the number of
> non-changed lines in xdfo that *should have* followed the group that
> experienced the malfunction. But since `io` was too large at the end
> of that iteration, it will be incremented past the end of the
> xdfo->rchg array, and will try to read that memory illegally.

I agree with Peff that these should be in the log message.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/8] xdl_change_compact(): keep track of the earliest end
  2016-08-03 22:00 ` [PATCH 6/8] xdl_change_compact(): keep track of the earliest end Michael Haggerty
@ 2016-08-04 18:46   ` Junio C Hamano
  2016-08-10 17:16     ` Michael Haggerty
  0 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2016-08-04 18:46 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git, Stefan Beller, Jeff King, Jakub Narębski, Jacob Keller

Michael Haggerty <mhagger@alum.mit.edu> writes:

> This makes it easier to detect whether shifting is possible, and will
> also make the next change easier.

I can see the code keeping track of earliest_end but the above does
not make it clear what the new "continue" is about.

... easier to detect whether shifting is possible (in which case we
can skip the shifting), and will also make ...

perhaps.

> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
> ---
>  xdiff/xdiffi.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> index 66129db..34f021a 100644
> --- a/xdiff/xdiffi.c
> +++ b/xdiff/xdiffi.c
> @@ -414,7 +414,8 @@ static int recs_match(xrecord_t **recs, long ixs, long ix, long flags)
>  }
>  
>  int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
> -	long start, end, io, end_matching_other, groupsize, nrec = xdf->nrec;
> +	long start, end, earliest_end, end_matching_other;
> +	long io, groupsize, nrec = xdf->nrec;
>  	char *rchg = xdf->rchg, *rchgo = xdfo->rchg;
>  	unsigned int blank_lines;
>  	xrecord_t **recs = xdf->recs;
> @@ -516,6 +517,8 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  				end_matching_other = -1;
>  			}
>  
> +			earliest_end = end;
> +
>  			/*
>  			 * Now shift the group forward as long as the first line
>  			 * of the current change group is equal to the line after
> @@ -547,6 +550,9 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  			}
>  		} while (groupsize != end - start);
>  
> +		if (end == earliest_end)
> +			continue; /* no shifting is possible */
> +
>  		if ((flags & XDF_COMPACTION_HEURISTIC) && blank_lines) {
>  			/*
>  			 * Compaction heuristic: if a group can be moved back and

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 7/8] is_blank_line: take a single xrecord_t as argument
  2016-08-03 22:00 ` [PATCH 7/8] is_blank_line: take a single xrecord_t as argument Michael Haggerty
@ 2016-08-04 18:48   ` Junio C Hamano
  0 siblings, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2016-08-04 18:48 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git, Stefan Beller, Jeff King, Jakub Narębski, Jacob Keller

Michael Haggerty <mhagger@alum.mit.edu> writes:

> There is no reason for it to take an array and index as argument, as it
> only accesses a single element of the array.

Yup, I think I am partly guilty.  The result looks much nicer.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-03 22:36     ` Michael Haggerty
  2016-08-04  4:47       ` Jacob Keller
@ 2016-08-04 19:39       ` Junio C Hamano
  2016-08-10 19:01         ` Michael Haggerty
  1 sibling, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2016-08-04 19:39 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Jacob Keller, Git mailing list, Stefan Beller, Jeff King,
	Jakub Narębski

Michael Haggerty <mhagger@alum.mit.edu> writes:

>>> +       }
>>> +       /*
>>> +        * We have reached the end of the line without finding any non-space
>>> +        * characters; i.e., the whole line consists of trailing spaces, which we
>>> +        * are not interested in.
>>> +        */
>>> +       return -1;

Not related to Jacob's review, but "the whole line consists of
trailing spaces" made me read it twice; while it is technically
correct, "the whole line consists of spaces", or even "this is a
blank line", would read a lot more easily, at least for me.

> I was implicitly assuming that such lines would have text somewhere
> after those 200 spaces (or 25 TABs or whatever). But you're right, the
> line could consist only of whitespace. Unfortunately, the only way to
> distinguish these two cases is to read the rest of the line, which is
> exactly what we *don't* want to do.

Hmm, why is it exactly what we don't want to do?  Is it a
performance concern?  In other words, is it because this function is
called many times to measure the same line multiple times?  After
all, somebody in this file is already scanning each and every line
to see where it ends to split the input into records, so perhaps a
"right" (if the "theoretical correctness" of the return value from
this function mattered, which you wave-away below) optimization
could be to precompute it while the lines are broken into records
and store it in the "rec" structure?

> But I think it doesn't matter anyway. Such "text" will likely never be
> read by a human, so it's not a big deal if the slider position is not
> picked perfectly. And remember, this whole saga is just to improve the
> aesthetics of the diff. The diff is *correct* (e.g., in the sense of
> applicable) regardless of where we position the sliders.

A better argument may be "if the user is truly reading a diff output
for such an unusual "text", it is likely that she has a very wide
display and/or running less -S, and treating such an overindented line
as if it were a blank line would give a result that is more consistent
to what appears on her display", perhaps?

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-04 16:55     ` Stefan Beller
@ 2016-08-04 19:47       ` Junio C Hamano
  2016-08-13  0:09       ` Michael Haggerty
  1 sibling, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2016-08-04 19:47 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jeff King, Michael Haggerty, git@vger.kernel.org,
	Jakub Narębski, Jacob Keller

Stefan Beller <sbeller@google.com> writes:

> I have just reread the scoring function and I think you could pull out the
> `score=indent` assignment (it is always assigned except for indent <0)
>
>         if (indent == -1)
>                score = 0;
>         else
>                score = indent;
>         ... lots of bonus computation below, which in its current implementation
>         have lots of "score = indent;" lines as well.

Yup.  If each part in this if/else if/... cascade independently sets
complete information (i.e. both "bonus" and "score") necessary for
the final result, then I do not mind the same "score = indent" in
many of them (these case happen to get the same score), but that is
not what we have in this code (i.e. "bonus" has a shared component
that is not affected by thie if/else if/ cascade), so setting score
to indent upfront and reset it to 0 only on a blank line would make
sense.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-03 22:00 ` [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs Michael Haggerty
                     ` (2 preceding siblings ...)
  2016-08-04  7:56   ` Jeff King
@ 2016-08-04 19:52   ` Junio C Hamano
  2016-08-13  0:11     ` Michael Haggerty
  3 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2016-08-04 19:52 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git, Stefan Beller, Jeff King, Jakub Narębski, Jacob Keller

Michael Haggerty <mhagger@alum.mit.edu> writes:

I agree with Peff about "comment on the voodoo upfront".

> +#define START_OF_FILE_BONUS 9
> +#define END_OF_FILE_BONUS 46
> +#define TOTAL_BLANK_WEIGHT 4
> +#define PRE_BLANK_WEIGHT 16
> +#define RELATIVE_INDENT_BONUS -1
> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15
> +#define RELATIVE_OUTDENT_BONUS -19
> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2

When I read up to here, I thought "Heh, isn't the opposite of INdent
DEdent?" and then saw this:

> +#define RELATIVE_DEDENT_BONUS -63
> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50

It turns out that you mean by OUTdent a line that indents further
(if I am reading the code correctly).  Is that obvious to everybody?

> +	/* Bonuses based on the location of blank lines: */
> +        bonus += TOTAL_BLANK_WEIGHT * total_blanks;
> +	bonus += PRE_BLANK_WEIGHT * m->pre_blank;

This and ...

> +        } else if (indent > m->pre_indent) {
> +		/*
> +		 * The line is indented more than its predecessor. Score it based
> +		 * on the larger indent:
> +		 */
> +		score = indent;
> +		bonus += RELATIVE_INDENT_BONUS;
> +		bonus += RELATIVE_INDENT_HAS_BLANK_BONUS * any_blanks;
> +	} else if (indent < m->pre_indent) {

... this seems to be indented correctly even after getting quoted,
which in turn means most of the lines in the added code share
indent-with-non-tab badness.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/8] Better heuristics make prettier diffs
  2016-08-04  7:38 ` Jeff King
@ 2016-08-04 19:54   ` Junio C Hamano
  2016-08-04 20:01     ` Jeff King
  0 siblings, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2016-08-04 19:54 UTC (permalink / raw)
  To: Jeff King
  Cc: Michael Haggerty, git, Stefan Beller, Jakub Narębski,
	Jacob Keller

Jeff King <peff@peff.net> writes:

> Not that you probably need more random cases of C code, but I happened
> to be looking at a diff in git.git today, b333d0d6, which is another
> regression for the compaction heuristic.

Wow, that one is _really_ bad.  Does it have something to do with
the removal being at the very end of the file?

> The indent heuristic here gets it right.

Looks that way.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 0/8] Better heuristics make prettier diffs
  2016-08-04 19:54   ` Junio C Hamano
@ 2016-08-04 20:01     ` Jeff King
  0 siblings, 0 replies; 57+ messages in thread
From: Jeff King @ 2016-08-04 20:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Michael Haggerty, git, Stefan Beller, Jakub Narębski,
	Jacob Keller

On Thu, Aug 04, 2016 at 12:54:51PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > Not that you probably need more random cases of C code, but I happened
> > to be looking at a diff in git.git today, b333d0d6, which is another
> > regression for the compaction heuristic.
> 
> Wow, that one is _really_ bad.  Does it have something to do with
> the removal being at the very end of the file?

I think so. If it were:

  func1() {
     ... unique stuff ...
     ... shared ending ...
  }

  func2() {
     ... more unique stuff ...
     ... shared ending ...
  }

  unrelated_func() {
  }

and we dropped func2, then I think the blank line between func2() and
unrelated_func() would cause the compaction heuristic to stop shifting.

OTOH, if it were:

  func2() {
     ...
  }
  unrelated_func() {
  }

with no newline, you get a similar badly-shifted diff (which is not
surprising, as we were given no syntactic hint that "func2" is a
separate unit from "unrelated_func").

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/8] xdl_change_compact(): clarify code
  2016-08-03 23:50       ` Stefan Beller
  2016-08-04  7:13         ` Jeff King
@ 2016-08-10 16:39         ` Michael Haggerty
  2016-08-10 16:58           ` Stefan Beller
  1 sibling, 1 reply; 57+ messages in thread
From: Michael Haggerty @ 2016-08-10 16:39 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King,
	Jakub Narębski, Jacob Keller

On 08/04/2016 01:50 AM, Stefan Beller wrote:
> On Wed, Aug 3, 2016 at 4:14 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> On 08/04/2016 12:11 AM, Stefan Beller wrote:
>>> On Wed, Aug 3, 2016 at 3:00 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>>>> [...]
>>>> +
>>>> +                       /*
>>>> +                        * Are there any blank lines that could appear as the last
>>>> +                        * line of this group?
>>>> +                        */
>>>
>>> IIRC this comment is not quite correct as this 'only' counts the number of
>>> blank lines within the forward shifting section, i.e. in the movable space.
>>>
>>> Later we use it as a boolean indicator (whether or not it is equal to 0)
>>> to see if we can do better.
>>> [...]

Thanks for your comments, Stefan.

I realized that the main thing that took me a while to grok when I was
reading this code was that blank_lines was really only used as a boolean
value, even though it was updated with "+=". That's the main information
that I'd like to convey to the reader.

So I decided to change the comment to emphasize this fact (and change it
from a question to a statement), and also changed the place that
blank_lines is updated to treat it more like a boolean. The latter
change also has the advantage of not calling is_blank_line()
unnecessarily when blank_lines is already true.

If you have no objections, that is what I will put in v2 of this patch
series:

> diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
> index de15de2..fde0433 100644
> --- a/xdiff/xdiffi.c
> +++ b/xdiff/xdiffi.c
> @@ -460,6 +460,12 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>  
>                 do {
>                         groupsize = i - start;
> +
> +                       /*
> +                        * Boolean value that records whether there are any blank
> +                        * lines that could be made to be the last line of this
> +                        * group.
> +                        */
>                         blank_lines = 0;
>  
>                         /*
> @@ -511,7 +517,8 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
>                          * the current change group.
>                          */
>                         while (i < nrec && recs_match(recs, start, i, flags)) {
> -                               blank_lines += is_blank_line(recs, i, flags);
> +                               if (!blank_lines)
> +                                       blank_lines = is_blank_line(recs, i, flags);
>  
>                                 rchg[start++] = 0;
>                                 rchg[i++] = 1;

Michael


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io
  2016-08-04  7:27   ` Jeff King
@ 2016-08-10 16:58     ` Michael Haggerty
  2016-08-10 17:09       ` Michael Haggerty
  2016-08-11  4:16       ` Jeff King
  0 siblings, 2 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-10 16:58 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Stefan Beller, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On 08/04/2016 09:27 AM, Jeff King wrote:
> On Thu, Aug 04, 2016 at 12:00:33AM +0200, Michael Haggerty wrote:
> 
>> The code branch used for the compaction heuristic incorrectly forgot to
>> keep io in sync while the group was shifted. I think that could have
>> led to reading past the end of the rchgo array.
>>
>> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
>> ---
>> I didn't actually try to verify the presence of a bug, because it
>> seems like more work than worthwhile. But here is my reasoning:
>>
>> If io is not decremented correctly during one iteration of the outer
>> `while` loop, then it will loose sync with the `end` counter. In
>> particular it will be too large.
>>
>> Suppose that the next iterations of the outer `while` loop (i.e.,
>> processing the next block of add/delete lines) don't have any sliders.
>> Then the `io` counter would be incremented by the number of
>> non-changed lines in xdf, which is the same as the number of
>> non-changed lines in xdfo that *should have* followed the group that
>> experienced the malfunction. But since `io` was too large at the end
>> of that iteration, it will be incremented past the end of the
>> xdfo->rchg array, and will try to read that memory illegally.
> 
> Hmm. In the loop:
> 
>   while (rchgo[io])
> 	io++;
> 
> that implies that rchgo has a zero-marker that we can rely on hitting.

I agree.

> And it looks like rchgo[io] always ends the loop on a 0. So it seems
> like we would just hit that condition again.

Correct...in this loop. But there is another place where `io` is
incremented unconditionally. In the version before my changes, it is via
the preincrement operator in this while statement conditional:

https://github.com/mhagger/git/blob/a28705da929ad746abcb34270947f738549d3246/xdiff/xdiffi.c#L502

After my changes, the unconditional increment is more obvious because I
took it out of the while condition:

https://github.com/mhagger/git/blob/39a135da93834fd72ee923d95d0cebfe525dfe7a/xdiff/xdiffi.c#L541

(BTW, I think this is a good example of how patch 2/8 makes the code
easier to reason about.)

I didn't do the hard work to determine whether `io` could *really* walk
off the end of the array, but I don't see an obvious reason why it
*couldn't*.

> Anyway, I'd suggest putting your cover letter bits into the commit
> message. Even though they are all suppositions, they are the kind of
> thing that could really help somebody debugging this in 2 years, and are
> better than nothing.

Good idea. Will do.

Michael


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 2/8] xdl_change_compact(): clarify code
  2016-08-10 16:39         ` Michael Haggerty
@ 2016-08-10 16:58           ` Stefan Beller
  0 siblings, 0 replies; 57+ messages in thread
From: Stefan Beller @ 2016-08-10 16:58 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King,
	Jakub Narębski, Jacob Keller

On Wed, Aug 10, 2016 at 9:39 AM, Michael Haggerty <mhagger@alum.mit.edu> wrote:

>
> I realized that the main thing that took me a while to grok when I was
> reading this code was that blank_lines was really only used as a boolean
> value, even though it was updated with "+=". That's the main information
> that I'd like to convey to the reader.

Oh :(

I think there was some discussion when we added the blank line counting
whether we would want to have it boolean or counting. And we settled
for counting as "future algorithms can make use of this additional information"
IIRC.

>
> So I decided to change the comment to emphasize this fact (and change it
> from a question to a statement), and also changed the place that
> blank_lines is updated to treat it more like a boolean. The latter
> change also has the advantage of not calling is_blank_line()
> unnecessarily when blank_lines is already true.
>
> If you have no objections, that is what I will put in v2 of this patch
> series:

No objections from my side,
sorry for this lengthy discussion about a comment,

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io
  2016-08-10 16:58     ` Michael Haggerty
@ 2016-08-10 17:09       ` Michael Haggerty
  2016-08-11  4:16       ` Jeff King
  1 sibling, 0 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-10 17:09 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Stefan Beller, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On 08/10/2016 06:58 PM, Michael Haggerty wrote:
> On 08/04/2016 09:27 AM, Jeff King wrote:
>> On Thu, Aug 04, 2016 at 12:00:33AM +0200, Michael Haggerty wrote:
>>
>>> The code branch used for the compaction heuristic incorrectly forgot to
>>> keep io in sync while the group was shifted. I think that could have
>>> led to reading past the end of the rchgo array.
>>>
>>> Signed-off-by: Michael Haggerty <mhagger@alum.mit.edu>
>>> ---
>>> I didn't actually try to verify the presence of a bug, because it
>>> seems like more work than worthwhile. But here is my reasoning:
>>>
>>> If io is not decremented correctly during one iteration of the outer
>>> `while` loop, then it will loose sync with the `end` counter. In
>>> particular it will be too large.
>>>
>>> Suppose that the next iterations of the outer `while` loop (i.e.,
>>> processing the next block of add/delete lines) don't have any sliders.
>>> Then the `io` counter would be incremented by the number of
>>> non-changed lines in xdf, which is the same as the number of
>>> non-changed lines in xdfo that *should have* followed the group that
>>> experienced the malfunction. But since `io` was too large at the end
>>> of that iteration, it will be incremented past the end of the
>>> xdfo->rchg array, and will try to read that memory illegally.
>>
>> Hmm. In the loop:
>>
>>   while (rchgo[io])
>> 	io++;
>>
>> that implies that rchgo has a zero-marker that we can rely on hitting.
> 
> I agree.
> 
>> And it looks like rchgo[io] always ends the loop on a 0. So it seems
>> like we would just hit that condition again.
> 
> Correct...in this loop. But there is another place where `io` is
> incremented unconditionally. In the version before my changes, it is via
> the preincrement operator in this while statement conditional:
> 
> https://github.com/mhagger/git/blob/a28705da929ad746abcb34270947f738549d3246/xdiff/xdiffi.c#L502
> 
> After my changes, the unconditional increment is more obvious because I
> took it out of the while condition:
> 
> https://github.com/mhagger/git/blob/39a135da93834fd72ee923d95d0cebfe525dfe7a/xdiff/xdiffi.c#L541
> 
> (BTW, I think this is a good example of how patch 2/8 makes the code
> easier to reason about.)

Actually, for the case that no more sliders are found in the file, the
key lines where io is incremented unconditionally are

https://github.com/mhagger/git/blob/a28705da929ad746abcb34270947f738549d3246/xdiff/xdiffi.c#L438

before the change (note that the post-increment happens even if the
while condition returns false), and

https://github.com/mhagger/git/blob/39a135da93834fd72ee923d95d0cebfe525dfe7a/xdiff/xdiffi.c#L443-L444

after the change. (The lines I mentioned in my previous email are also
unconditional increments, but those are only executed in the case that
more sliders are found.)

Michael


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io
  2016-08-04 18:43   ` Junio C Hamano
@ 2016-08-10 17:13     ` Michael Haggerty
  0 siblings, 0 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-10 17:13 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Stefan Beller, Jeff King, Jakub Narębski, Jacob Keller

On 08/04/2016 08:43 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> The code branch used for the compaction heuristic incorrectly forgot to
>> keep io in sync while the group was shifted. I think that could have
>> led to reading past the end of the rchgo array.
> 
> I had to read the first sentence three times as "incorrectly forgot"
> was a bit strange thing to say (as if there is a situation where
> 'forgetting to do' is the correct thing to do, but in that case we
> would phrase it to stress that not doing is a deliberate choice,
> e.g. 'refraining from doing').  Perhaps s/incorrectly // is the
> simplest readability improvement?

Yes, that makes it clearer. Will change.

Michael


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 6/8] xdl_change_compact(): keep track of the earliest end
  2016-08-04 18:46   ` Junio C Hamano
@ 2016-08-10 17:16     ` Michael Haggerty
  0 siblings, 0 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-10 17:16 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Stefan Beller, Jeff King, Jakub Narębski, Jacob Keller

On 08/04/2016 08:46 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>> This makes it easier to detect whether shifting is possible, and will
>> also make the next change easier.
> 
> I can see the code keeping track of earliest_end but the above does
> not make it clear what the new "continue" is about.
> 
> ... easier to detect whether shifting is possible (in which case we
> can skip the shifting), and will also make ...
> 
> perhaps.

Thanks. I will make the change that you suggest.

Michael


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-04 19:39       ` Junio C Hamano
@ 2016-08-10 19:01         ` Michael Haggerty
  2016-08-10 21:28           ` Junio C Hamano
  0 siblings, 1 reply; 57+ messages in thread
From: Michael Haggerty @ 2016-08-10 19:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jacob Keller, Git mailing list, Stefan Beller, Jeff King,
	Jakub Narębski

On 08/04/2016 09:39 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>>>> +       }
>>>> +       /*
>>>> +        * We have reached the end of the line without finding any non-space
>>>> +        * characters; i.e., the whole line consists of trailing spaces, which we
>>>> +        * are not interested in.
>>>> +        */
>>>> +       return -1;
> 
> Not related to Jacob's review, but "the whole line consists of
> trailing spaces" made me read it twice; while it is technically
> correct, "the whole line consists of spaces", or even "this is a
> blank line", would read a lot more easily, at least for me.

Hehe, yes, ETOOMANYWORDS.

>> I was implicitly assuming that such lines would have text somewhere
>> after those 200 spaces (or 25 TABs or whatever). But you're right, the
>> line could consist only of whitespace. Unfortunately, the only way to
>> distinguish these two cases is to read the rest of the line, which is
>> exactly what we *don't* want to do.
> 
> Hmm, why is it exactly what we don't want to do?  Is it a
> performance concern?  In other words, is it because this function is
> called many times to measure the same line multiple times?

Yes, performance is the reason, and especially the desire to avoid
unreasonable runtimes for pathological cases. Thanks for asking, though,
because it's worthwhile to look into this more rigorously.

Suppose there is a slider that can be shifted to any of `numshift`
positions. Then we have to call `measure_split()` `2*numshift` times
(once for the beginning and once for the end of each candidate slider
position).

Suppose there are `numblanks` blank lines in the neighborhood of the
slider. Each time we call `measure_split()`, we count the number of
blank lines before and after the proposed split position. So we end up
calling `get_indent()` `2*numshift*numblanks` times.

Finally, suppose that the blank lines each contain `numws` whitespace
characters. Then each call to `get_indent()` has to do `O(numws)` work.

So altogether, if there were no limits, then the amount of work to
position a slider would scale like

    O(numshift * numblanks * numws)

However, the total number of characters in the file might only be

    O(numblanks * numws)

So without limits, the amount of work to position sliders could scale by
numshift times the size of the file.

The worst case is pretty easy to achieve. Just create a file consisting
of a million or so LF characters, then add another LF to it. The diff
would be a slider with

    numshift = 1000000
    numblanks = 1000000
    numws = 1

so the heuristic would take O(N^2) in the size of the file.

Effectively the limits cap the effective `numblanks` at `2*MAX_BLANKS`
(which is 2*20) and the effective `numws` at `MAX_INDENT` (which is
200), meaning that the maximum effort scales like

    numshift * 2*20 * 200

which is still a big number but not absurd. Empirically, for the case
described above, `git diff` takes 104 ms and `git diff
--indent-heuristic` takes 490 ms. I think that's not prohibitive for a
pathological case.

Meanwhile, Myers's algorithm scales like O(ND), where N is the number of
lines and D is the edit distance, so I suppose that it is already
possible to find diffs that are intractable to compute.

> After
> all, somebody in this file is already scanning each and every line
> to see where it ends to split the input into records, so perhaps a
> "right" (if the "theoretical correctness" of the return value from
> this function mattered, which you wave-away below) optimization
> could be to precompute it while the lines are broken into records
> and store it in the "rec" structure?

That would certainly be possible, and would help in cases where there
are a lot of lines with lots of leading whitespace. You could get nearly
the same benefit by recording a single bit in struct rec, indicating
whether the line is blank or not.

But it wouldn't help the worst case described above, where each call to
`git_indent()` is already very cheap. And I didn't think it was worth
allocating the extra memory to optimize this heuristic

* which isn't used all that often in the first place,
* which (for normal inputs) doesn't take a significant amount of time, and
* when the optimization doesn't improve the worst-case scenario (and
thus any DoS potential)

I think the only way to ensure O(size_of_file) runtime in the above case
would be to record, along with each line, how many blank lines
immediately precede and succeed it. You could achieve something like
O(size_of_file lg(size_of_file)) by storing, e.g., the total number of
nonblank lines that precede each line and doing a binary search to find
the nearest non-blank line.

>> But I think it doesn't matter anyway. Such "text" will likely never be
>> read by a human, so it's not a big deal if the slider position is not
>> picked perfectly. And remember, this whole saga is just to improve the
>> aesthetics of the diff. The diff is *correct* (e.g., in the sense of
>> applicable) regardless of where we position the sliders.
> 
> A better argument may be "if the user is truly reading a diff output
> for such an unusual "text", it is likely that she has a very wide
> display and/or running less -S, and treating such an overindented line
> as if it were a blank line would give a result that is more consistent
> to what appears on her display", perhaps?

I don't know. It seems like a pretty contrived justification for what is
basically, "your input is too weird for us. We're not going to break our
necks trying to give you the best possible slider positioning."

Michael

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-04  0:04           ` Stefan Beller
@ 2016-08-10 19:12             ` Michael Haggerty
  0 siblings, 0 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-10 19:12 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King,
	Jakub Narębski, Jacob Keller

On 08/04/2016 02:04 AM, Stefan Beller wrote:
> On Wed, Aug 3, 2016 at 4:30 PM, Michael Haggerty <mhagger@alum.mit.edu> wrote:
>> Stefan Beller wrote:
>>> [...]
>>> Rather the 10 describes the ratio of "advanced magic" to pure indentation
>>> based scoring in my understanding.
>>
>> No, it's basically just a number against which the other constants are
>> compared. E.g., if another bonus wants to balance out against exactly
>> one space of indentation, its constant needs to be 10. If it wants to
>> balance out against exactly 5 spaces, its constant needs to be 50. Etc.
> 
> So another interpretation is that the 10 gives the resolution for all other
> constants, i.e. if we keep 10, then we can only give weights in 1/10 of
> "one indent". But the "ideal" weight may not be a multiple of 1/10,
> so we approximate them to the nearest multiple of 1/10.
> 
> If we were to use 1000 here, we could have a higher accuracy of the
> other constants, but probably we do not care about the 3rd decimal place
> for these because they are created heuristically from a corpus that may
> not warrant a precision of constants with a 3rd decimal place.

Not only that. Since all of the inputs to the heuristic are integers,
the outputs are discontinuous and can take only certain discrete values.
And the outputs are only ever compared against one another. So a small
adjustment to the output will only make a difference if it causes the
value to become larger/smaller than another of the possible output
values. So too high a resolution makes no sense at all.

That being said, I didn't actually check in any systematic way whether
the resolution of 10:1 is high enough in practice. But I can say that
the overall performance of the heuristic (in terms of number of errors
counted) remained constant over a relatively large parameter range, so I
think the resolution is sufficient.

Feel free to play with the parameters and/or other heuristics. The tools
and raw data are all published in [1]. Let me know if you need help
getting started.

Michael

[1] https://github.com/mhagger/diff-slider-tools

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-10 19:01         ` Michael Haggerty
@ 2016-08-10 21:28           ` Junio C Hamano
  0 siblings, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2016-08-10 21:28 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: Jacob Keller, Git mailing list, Stefan Beller, Jeff King,
	Jakub Narębski

Michael Haggerty <mhagger@alum.mit.edu> writes:

>> After
>> all, somebody in this file is already scanning each and every line
>> to see where it ends to split the input into records, so perhaps a
>> "right" (if the "theoretical correctness" of the return value from
>> this function mattered, which you wave-away below) optimization
>> could be to precompute it while the lines are broken into records
>> and store it in the "rec" structure?
>
> That would certainly be possible, and would help in cases where there
> are a lot of lines with lots of leading whitespace. You could get nearly
> the same benefit by recording a single bit in struct rec, indicating
> whether the line is blank or not.
>
> But it wouldn't help the worst case described above, where each call to
> `git_indent()` is already very cheap. And I didn't think it was worth
> allocating the extra memory to optimize this heuristic

True.

> I don't know. It seems like a pretty contrived justification for what is
> basically, "your input is too weird for us. We're not going to break our
> necks trying to give you the best possible slider positioning."

True again.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io
  2016-08-10 16:58     ` Michael Haggerty
  2016-08-10 17:09       ` Michael Haggerty
@ 2016-08-11  4:16       ` Jeff King
  1 sibling, 0 replies; 57+ messages in thread
From: Jeff King @ 2016-08-11  4:16 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git, Stefan Beller, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On Wed, Aug 10, 2016 at 06:58:06PM +0200, Michael Haggerty wrote:

> > And it looks like rchgo[io] always ends the loop on a 0. So it seems
> > like we would just hit that condition again.
> 
> Correct...in this loop. But there is another place where `io` is
> incremented unconditionally. In the version before my changes, it is via
> the preincrement operator in this while statement conditional:
> 
> https://github.com/mhagger/git/blob/a28705da929ad746abcb34270947f738549d3246/xdiff/xdiffi.c#L502
> 
> After my changes, the unconditional increment is more obvious because I
> took it out of the while condition:
> 
> https://github.com/mhagger/git/blob/39a135da93834fd72ee923d95d0cebfe525dfe7a/xdiff/xdiffi.c#L541
> 
> (BTW, I think this is a good example of how patch 2/8 makes the code
> easier to reason about.)

Ah, yeah, I totally missed that case (and I agree the code after your
2/8 makes it much more obvious).

> I didn't do the hard work to determine whether `io` could *really* walk
> off the end of the array, but I don't see an obvious reason why it
> *couldn't*.

Yeah, I'm in agreement.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-04  7:56   ` Jeff King
  2016-08-04 16:55     ` Stefan Beller
@ 2016-08-12 23:25     ` Michael Haggerty
  2016-08-13  8:59       ` Jeff King
  1 sibling, 1 reply; 57+ messages in thread
From: Michael Haggerty @ 2016-08-12 23:25 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Stefan Beller, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On 08/04/2016 09:56 AM, Jeff King wrote:
> On Thu, Aug 04, 2016 at 12:00:36AM +0200, Michael Haggerty wrote:
> 
>> This table shows the number of diff slider groups that were positioned
>> differently than the human-generated values, for various repositories.
>> "default" is the default "git diff" algorithm. "compaction" is Git 2.9.0
>> with the `--compaction-heuristic` option "indent" is an earlier,
> 
> s/option/&./

Thanks, will change.

>>  static int diff_detect_rename_default;
>> +static int diff_indent_heuristic; /* experimental */
>>  static int diff_compaction_heuristic; /* experimental */
> 
> These two flags are mutually exclusive in the xdiff code, so we should
> probably handle that here.
> 
> TBH, I do not care that much what:
> 
>   [diff]
>   compactionHeuristic = true
>   indentHeuristic = true
> 
> does. But right now:
> 
>   git config diff.compactionHeuristic true
>   git show --indent-heuristic
> 
> still prefers the compaction heuristic, which I think is objectively
> wrong.

I wasn't worrying about that yet, given that these two features are both
still experimental. I also have a strong inkling that at most one of
them needs to be made permanent. I propose that I repair the semantics
in the simplest way possible for now while we decide on the long-term
plan, which might conceivably be:

* keep both options permanently
* keep only one option permanently
* choose one heuristic and use it always (i.e., make it part of the new
standard one-and-only diff algorithm)
* discard both heuristics (I hope not!)

After we've decided on that, *then* let's decide on a suitable UI and
implement it before we declare either feature non-experimental.

> [...]
> Speaking of absurd amounts of work, I was curious if there was a
> noticeable performance penalty for using this heuristic [...]

I included some performance numbers in my response to Junio [1].

>> +#define START_OF_FILE_BONUS 9
>> +#define END_OF_FILE_BONUS 46
>> +#define TOTAL_BLANK_WEIGHT 4
>> +#define PRE_BLANK_WEIGHT 16
>> +#define RELATIVE_INDENT_BONUS -1
>> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15
>> +#define RELATIVE_OUTDENT_BONUS -19
>> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2
>> +#define RELATIVE_DEDENT_BONUS -63
>> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50
> 
> I see there is a comment below here mentioning that these are empirical
> voodoo, but it might be worth one at the top (or just moving these below
> the comment) because the comment looks like it's just associated with
> the function (and these are sufficiently bizarre that anybody reading is
> going to double-take on them).

Good idea.

>> +        return 10 * score - bonus;
> 
> I don't mind this not "10" not being a #define constant, but after
> reading the exchange between you and Stefan, I think it would be nice to
> describe what it is in a comment. The rest of the function is commented
> so nicely that this one left me thinking "huh?" upon seeing the "10".

Done. Thanks for your review.

Michael

[1]
http://public-inbox.org/git/5fe0edbc-3659-058f-3328-639d1343fa05@alum.mit.edu/


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-04 16:55     ` Stefan Beller
  2016-08-04 19:47       ` Junio C Hamano
@ 2016-08-13  0:09       ` Michael Haggerty
  1 sibling, 0 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-13  0:09 UTC (permalink / raw)
  To: Stefan Beller, Jeff King
  Cc: git@vger.kernel.org, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On 08/04/2016 06:55 PM, Stefan Beller wrote:
> [...]
> I have just reread the scoring function and I think you could pull out the
> `score=indent` assignment (it is always assigned except for indent <0)
> 
>         if (indent == -1)
>                score = 0;
>         else
>                score = indent;
>         ... lots of bonus computation below, which in its current implementation
>         have lots of "score = indent;" lines as well.

Yes. An earlier version of the heuristic used different indent values in
different situations, but that's gone away so the code can be made
simpler now. I'll make the change.

Thanks,
Michael


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-04 19:52   ` Junio C Hamano
@ 2016-08-13  0:11     ` Michael Haggerty
  0 siblings, 0 replies; 57+ messages in thread
From: Michael Haggerty @ 2016-08-13  0:11 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Stefan Beller, Jeff King, Jakub Narębski, Jacob Keller

On 08/04/2016 09:52 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
>> +#define START_OF_FILE_BONUS 9
>> +#define END_OF_FILE_BONUS 46
>> +#define TOTAL_BLANK_WEIGHT 4
>> +#define PRE_BLANK_WEIGHT 16
>> +#define RELATIVE_INDENT_BONUS -1
>> +#define RELATIVE_INDENT_HAS_BLANK_BONUS 15
>> +#define RELATIVE_OUTDENT_BONUS -19
>> +#define RELATIVE_OUTDENT_HAS_BLANK_BONUS 2
> 
> When I read up to here, I thought "Heh, isn't the opposite of INdent
> DEdent?" and then saw this:
> 
>> +#define RELATIVE_DEDENT_BONUS -63
>> +#define RELATIVE_DEDENT_HAS_BLANK_BONUS 50
> 
> It turns out that you mean by OUTdent a line that indents further
> (if I am reading the code correctly).  Is that obvious to everybody?

I'll comment it better.

>> +	/* Bonuses based on the location of blank lines: */
>> +        bonus += TOTAL_BLANK_WEIGHT * total_blanks;
>> +	bonus += PRE_BLANK_WEIGHT * m->pre_blank;
> 
> This and ...
> 
>> +        } else if (indent > m->pre_indent) {
>> +		/*
>> +		 * The line is indented more than its predecessor. Score it based
>> +		 * on the larger indent:
>> +		 */
>> +		score = indent;
>> +		bonus += RELATIVE_INDENT_BONUS;
>> +		bonus += RELATIVE_INDENT_HAS_BLANK_BONUS * any_blanks;
>> +	} else if (indent < m->pre_indent) {
> 
> ... this seems to be indented correctly even after getting quoted,
> which in turn means most of the lines in the added code share
> indent-with-non-tab badness.

The code was copy-pasted from a Python prototype then converted to C :-)

I'll fix the whitespace.

Thanks,
Michael


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-12 23:25     ` Michael Haggerty
@ 2016-08-13  8:59       ` Jeff King
  2016-08-13 15:59         ` Junio C Hamano
  2016-08-15  6:33         ` Stefan Beller
  0 siblings, 2 replies; 57+ messages in thread
From: Jeff King @ 2016-08-13  8:59 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git, Stefan Beller, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On Sat, Aug 13, 2016 at 01:25:05AM +0200, Michael Haggerty wrote:

> > These two flags are mutually exclusive in the xdiff code, so we should
> > probably handle that here.
> > 
> > TBH, I do not care that much what:
> > 
> >   [diff]
> >   compactionHeuristic = true
> >   indentHeuristic = true
> > 
> > does. But right now:
> > 
> >   git config diff.compactionHeuristic true
> >   git show --indent-heuristic
> > 
> > still prefers the compaction heuristic, which I think is objectively
> > wrong.
> 
> I wasn't worrying about that yet, given that these two features are both
> still experimental. I also have a strong inkling that at most one of
> them needs to be made permanent. I propose that I repair the semantics
> in the simplest way possible for now while we decide on the long-term
> plan, which might conceivably be:
> 
> * keep both options permanently
> * keep only one option permanently
> * choose one heuristic and use it always (i.e., make it part of the new
> standard one-and-only diff algorithm)
> * discard both heuristics (I hope not!)
> 
> After we've decided on that, *then* let's decide on a suitable UI and
> implement it before we declare either feature non-experimental.

Is there a case where the compaction heuristic produces a better result
than this indent heuristic? AFAICT, you have not found one, and I'd be
surprised if there is one, because this _seems_ like a superset
generally. I suppose there is always the possibility that the empirical
knobs behave badly in some particular case that the compaction heuristic
just happens to get right, but it should be quite rare.

So assuming everything I just said isn't complete bollocks, I think we
can move to a future where nobody uses the compaction heuristic. And
there are three ways to deal with that:

  1. The knob and feature stay. It might be useful for somebody who
     wants to experiment in the future.

  2. The knob and feature go away completely. It was an experiment, but
     now we have something more useful.

  3. The feature goes away, but the knob stays as noop, or maybe as an
     alias for the indent heuristic, just because we did ship a version
     that accepts "--compaction-heuristic", and maybe somebody somewhere
     put it in a script?

I think I'd be in favor of (2). It doesn't seem likely enough for people
to experiment with to merit a run-time knob; they can always patch and
build if they want to do so. And (3) just seems like a pain for
something that was only shipped in one version and was kind of
experimental, and was unlikely to end up in scripts (much more likely is
that people set the config, but that's easier to ignore). But it does
violate our usual backwards-compatibility rules.

So if we assume that indent is useful and compaction goes away, the only
questions are "does indent it become the default" and "if so, does it
still get a knob". I'd say "yes" to both. Making the new behavior the
default was what we planned to do with compaction until we saw that it
regressed some cases. But as a new feature, it's nice for users to be
able to easily disable it to see if it's causing a problem (or to see
what a big improvement it is!).

I think we could get by with just a command-line option for that
purpose, rather than a config option; that saves a lot of effort in
having porcelains manually propagate the config option when they call a
plumbing diff-tree.

I guess the only users that leaves out are ones who really want stable
backwards-compatible diff. I guess "patch --stable" is one such user
(but that one we could handle internally). But let's say you had a code
review system that attached comments to lines of a diff. You might want
to disable the feature entirely to avoid invalidating old comments.

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-13  8:59       ` Jeff King
@ 2016-08-13 15:59         ` Junio C Hamano
  2016-08-14  7:21           ` Jacob Keller
  2016-08-15  6:33         ` Stefan Beller
  1 sibling, 1 reply; 57+ messages in thread
From: Junio C Hamano @ 2016-08-13 15:59 UTC (permalink / raw)
  To: Jeff King
  Cc: Michael Haggerty, git, Stefan Beller, Jakub Narębski,
	Jacob Keller

Jeff King <peff@peff.net> writes:

> So assuming everything I just said isn't complete bollocks, I think we
> can move to a future where nobody uses the compaction heuristic. And
> there are three ways to deal with that:
>
>   1. The knob and feature stay. It might be useful for somebody who
>      wants to experiment in the future.
>
>   2. The knob and feature go away completely. It was an experiment, but
>      now we have something more useful.
>
>   3. The feature goes away, but the knob stays as noop, or maybe as an
>      alias for the indent heuristic, just because we did ship a version
>      that accepts "--compaction-heuristic", and maybe somebody somewhere
>      put it in a script?
>
> I think I'd be in favor of (2).

I am all for (2) [*1*]

This and the previous "take a blank line as a hint" are both
heuristics.  As long as the resulting code does not tax runtime
performance visibly and improves the resulting output 99% of the
time, there is no reason to leave end-users a knob.  "Among 9 hunks
in this patch that touch hello.c, 7 are made much more readable but
2 are worse" cannot even be helped with a command line option.


[Footnote]

*1* I am also strongly against (3), if only to teach people a
    lesson ;-).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/8] xdl_change_compact(): rename some local variables for clarity
  2016-08-04  7:06   ` Jeff King
  2016-08-04 18:24     ` Junio C Hamano
@ 2016-08-13 19:38     ` Michael Haggerty
  2016-08-14 12:26       ` Jeff King
  1 sibling, 1 reply; 57+ messages in thread
From: Michael Haggerty @ 2016-08-13 19:38 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Stefan Beller, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On 08/04/2016 09:06 AM, Jeff King wrote:
> On Thu, Aug 04, 2016 at 12:00:29AM +0200, Michael Haggerty wrote:
> 
>> * ix -> i
>> * ixo -> io
>> * ixs -> start
>> * grpsiz -> groupsize
> 
> After your change, I immediately understand three of them. But what is
> "io"?

The (pre-existing) convention in this function is that variable names
dealing with the "other" file have a trailing "o"; e.g., (xdf, xdfo),
(rchg, rchgo). There used to also be (i, io), the indexes tracking the
current line number in the file and the other file. But I renamed "i".

At first I was just going to add a comment for variable "io", but in
trying to figure out its exact semantics I realized that this code is
still pretty hard to follow. Part of the problem is that "the line in
the other file corresponding to a line in the to-be-compacted file" is
not a well-defined concept. In fact it is *groups of lines* that
correlate with each other. So I totally refactored the function, using a

    struct group {
            long start, end;
    };

as a kind of a cursor used to iterate through the groups on both sides.
I think the result is a lot easier to read, and while refactoring I
found and fixed another bug in the pre-existing code :-O

I hope to have v2 out soon.

Michael

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-13 15:59         ` Junio C Hamano
@ 2016-08-14  7:21           ` Jacob Keller
  0 siblings, 0 replies; 57+ messages in thread
From: Jacob Keller @ 2016-08-14  7:21 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Michael Haggerty, Git mailing list, Stefan Beller,
	Jakub Narębski

On Sat, Aug 13, 2016 at 8:59 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
>
>> So assuming everything I just said isn't complete bollocks, I think we
>> can move to a future where nobody uses the compaction heuristic. And
>> there are three ways to deal with that:
>>
>>   1. The knob and feature stay. It might be useful for somebody who
>>      wants to experiment in the future.
>>
>>   2. The knob and feature go away completely. It was an experiment, but
>>      now we have something more useful.
>>
>>   3. The feature goes away, but the knob stays as noop, or maybe as an
>>      alias for the indent heuristic, just because we did ship a version
>>      that accepts "--compaction-heuristic", and maybe somebody somewhere
>>      put it in a script?
>>
>> I think I'd be in favor of (2).
>
> I am all for (2) [*1*]
>

I also am in favor of (2). I understand the reasoning for maintaining
compatibility, but this was a known experimental feature that was
unlikely used by many people. Even if it was, these are the very sorts
of people who should be aware that the experimental feature is going
away. It reduces code complexity if it just goes away, and I believe
the new heuristic is much better (Thank you Michael!!!!!)

As for a knob on the new feature, I think it can become the default
with a way to disable the feature via command line. I'm not really
sure it needs a config option at all.

> This and the previous "take a blank line as a hint" are both
> heuristics.  As long as the resulting code does not tax runtime
> performance visibly and improves the resulting output 99% of the
> time, there is no reason to leave end-users a knob.  "Among 9 hunks
> in this patch that touch hello.c, 7 are made much more readable but
> 2 are worse" cannot even be helped with a command line option.
>

Yea I agree. It might be worth having it disabled via the stable patch
IDs (? I don't know if we guarantee this?) but otherwise I don't see
it being important either way. I would vote for a way to disable it
via command line just because we *are* changing behavior here. But I
don't think it needs to be a config option at all.

>
> [Footnote]
>
> *1* I am also strongly against (3), if only to teach people a
>     lesson ;-).

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 1/8] xdl_change_compact(): rename some local variables for clarity
  2016-08-13 19:38     ` Michael Haggerty
@ 2016-08-14 12:26       ` Jeff King
  0 siblings, 0 replies; 57+ messages in thread
From: Jeff King @ 2016-08-14 12:26 UTC (permalink / raw)
  To: Michael Haggerty
  Cc: git, Stefan Beller, Junio C Hamano, Jakub Narębski,
	Jacob Keller

On Sat, Aug 13, 2016 at 09:38:33PM +0200, Michael Haggerty wrote:

> On 08/04/2016 09:06 AM, Jeff King wrote:
> > On Thu, Aug 04, 2016 at 12:00:29AM +0200, Michael Haggerty wrote:
> > 
> >> * ix -> i
> >> * ixo -> io
> >> * ixs -> start
> >> * grpsiz -> groupsize
> > 
> > After your change, I immediately understand three of them. But what is
> > "io"?
> 
> The (pre-existing) convention in this function is that variable names
> dealing with the "other" file have a trailing "o"; e.g., (xdf, xdfo),
> (rchg, rchgo). There used to also be (i, io), the indexes tracking the
> current line number in the file and the other file. But I renamed "i".

Yeah, after reading the rest of the patches, the "o" prefix sort of
started to make sense.

> At first I was just going to add a comment for variable "io", but in
> trying to figure out its exact semantics I realized that this code is
> still pretty hard to follow. Part of the problem is that "the line in
> the other file corresponding to a line in the to-be-compacted file" is
> not a well-defined concept. In fact it is *groups of lines* that
> correlate with each other. So I totally refactored the function, using a
> 
>     struct group {
>             long start, end;
>     };
> 
> as a kind of a cursor used to iterate through the groups on both sides.
> I think the result is a lot easier to read, and while refactoring I
> found and fixed another bug in the pre-existing code :-O

That sounds like it would be nicer. And bug fixes are always good.

Don't kill yourself polishing up the function names, though (unless you
keep finding bugs. ;) ).

-Peff

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-13  8:59       ` Jeff King
  2016-08-13 15:59         ` Junio C Hamano
@ 2016-08-15  6:33         ` Stefan Beller
  2016-08-15 20:24           ` Junio C Hamano
  1 sibling, 1 reply; 57+ messages in thread
From: Stefan Beller @ 2016-08-15  6:33 UTC (permalink / raw)
  To: Jeff King
  Cc: Michael Haggerty, git@vger.kernel.org, Junio C Hamano,
	Jakub Narębski, Jacob Keller

> Is there a case where the compaction heuristic produces a better result
> than this indent heuristic? AFAICT, you have not found one, and I'd be
> surprised if there is one, because this _seems_ like a superset
> generally. I suppose there is always the possibility that the empirical
> knobs behave badly in some particular case that the compaction heuristic
> just happens to get right, but it should be quite rare.

This is how I understand it as well. I would not mind to remove the
blank-line-suggested-split heuristic.

Maybe we can enable Michaels heuristic with the same
config/command line flag, i.e. "the flag changes its algorithm"?
Then people, who read the prior announcement don't have to
do anything, while we keep it as an experimental feature for the
next release and auto-on it after that?

We could also be a bit more aggressive and auto-on the new
heuristic with the old heuristic removed and we only have an
(undocumented) emergency-off knob for one or two releases?

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
  2016-08-15  6:33         ` Stefan Beller
@ 2016-08-15 20:24           ` Junio C Hamano
  0 siblings, 0 replies; 57+ messages in thread
From: Junio C Hamano @ 2016-08-15 20:24 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jeff King, Michael Haggerty, git@vger.kernel.org,
	Jakub Narębski, Jacob Keller

Stefan Beller <sbeller@google.com> writes:

> Maybe we can enable Michaels heuristic with the same
> config/command line flag, i.e. "the flag changes its algorithm"?

I think that is a very sensible proposal.  After all, the name
diff.compactionHeuristic only tells us what part of the diff process
the heuristic is used, and does not say anything about what the
heuristics does.  It is neutral between "take a blank as a hint" vs
"take indentation leveles as a hint".

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2016-08-15 20:24 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-03 22:00 [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
2016-08-03 22:00 ` [PATCH 1/8] xdl_change_compact(): rename some local variables for clarity Michael Haggerty
2016-08-04  7:06   ` Jeff King
2016-08-04 18:24     ` Junio C Hamano
2016-08-13 19:38     ` Michael Haggerty
2016-08-14 12:26       ` Jeff King
2016-08-03 22:00 ` [PATCH 2/8] xdl_change_compact(): clarify code Michael Haggerty
2016-08-03 22:11   ` Stefan Beller
2016-08-03 23:14     ` Michael Haggerty
2016-08-03 23:50       ` Stefan Beller
2016-08-04  7:13         ` Jeff King
2016-08-10 16:39         ` Michael Haggerty
2016-08-10 16:58           ` Stefan Beller
2016-08-03 22:00 ` [PATCH 3/8] xdl_change_compact(): rename i to end Michael Haggerty
2016-08-04  7:16   ` Jeff King
2016-08-03 22:00 ` [PATCH 4/8] xdl_change_compact(): do one final shift or the other, not both Michael Haggerty
2016-08-03 22:00 ` [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io Michael Haggerty
2016-08-04  7:27   ` Jeff King
2016-08-10 16:58     ` Michael Haggerty
2016-08-10 17:09       ` Michael Haggerty
2016-08-11  4:16       ` Jeff King
2016-08-04 18:43   ` Junio C Hamano
2016-08-10 17:13     ` Michael Haggerty
2016-08-03 22:00 ` [PATCH 6/8] xdl_change_compact(): keep track of the earliest end Michael Haggerty
2016-08-04 18:46   ` Junio C Hamano
2016-08-10 17:16     ` Michael Haggerty
2016-08-03 22:00 ` [PATCH 7/8] is_blank_line: take a single xrecord_t as argument Michael Haggerty
2016-08-04 18:48   ` Junio C Hamano
2016-08-03 22:00 ` [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs Michael Haggerty
2016-08-03 22:29   ` Jacob Keller
2016-08-03 22:36     ` Michael Haggerty
2016-08-04  4:47       ` Jacob Keller
2016-08-04 19:39       ` Junio C Hamano
2016-08-10 19:01         ` Michael Haggerty
2016-08-10 21:28           ` Junio C Hamano
2016-08-03 22:30   ` Stefan Beller
2016-08-03 22:41     ` Michael Haggerty
2016-08-03 22:51       ` Stefan Beller
2016-08-03 23:30         ` Michael Haggerty
2016-08-04  0:04           ` Stefan Beller
2016-08-10 19:12             ` Michael Haggerty
2016-08-04  7:56   ` Jeff King
2016-08-04 16:55     ` Stefan Beller
2016-08-04 19:47       ` Junio C Hamano
2016-08-13  0:09       ` Michael Haggerty
2016-08-12 23:25     ` Michael Haggerty
2016-08-13  8:59       ` Jeff King
2016-08-13 15:59         ` Junio C Hamano
2016-08-14  7:21           ` Jacob Keller
2016-08-15  6:33         ` Stefan Beller
2016-08-15 20:24           ` Junio C Hamano
2016-08-04 19:52   ` Junio C Hamano
2016-08-13  0:11     ` Michael Haggerty
2016-08-03 22:08 ` [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
2016-08-04  7:38 ` Jeff King
2016-08-04 19:54   ` Junio C Hamano
2016-08-04 20:01     ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).