git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC PATCH] xdiff/xpatience: support anchoring a line
@ 2017-11-21 22:17 Jonathan Tan
  2017-11-21 23:28 ` Stefan Beller
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Jonathan Tan @ 2017-11-21 22:17 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

Teach the patience diff to support prohibiting a user-specified line
from appearing as a deletion or addition in the end result.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
I'm sending this out to see if a change similar to this would be
welcome. It is useful to me as a reviewer (to check my own code, e.g.
when checking [1]). Probably more design needs to go into this,
including the best way to specify the "anchor" line, and the correct
behavior when the anchor is either not found or appears more than once.

Any thoughts?

[1]
https://public-inbox.org/git/20171121221256.154741-1-jonathantanmy@google.com/
---
 t/t4033-diff-patience.sh | 13 +++++++++++++
 xdiff/xpatience.c        | 29 +++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/t/t4033-diff-patience.sh b/t/t4033-diff-patience.sh
index 113304dc5..2147fd688 100755
--- a/t/t4033-diff-patience.sh
+++ b/t/t4033-diff-patience.sh
@@ -13,6 +13,19 @@ test_expect_success '--ignore-space-at-eol with a single appended character' '
 	grep "^+.*X" diff
 '
 
+test_expect_success 'anchor' '
+	printf "a\nb\nc\n" >pre &&
+	printf "c\na\nb\n" >post &&
+
+	# without anchor, c is moved
+	test_expect_code 1 git diff --no-index --patience pre post >diff &&
+	grep "^+c" diff &&
+
+	# with anchor, a is moved
+	DIFF_ANCHOR=c test_expect_code 1 git diff --no-index --patience pre post >diff &&
+	grep "^+a" diff
+'
+
 test_diff_frobnitz "patience"
 
 test_diff_unique "patience"
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index a44e77632..195a60e57 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -62,6 +62,8 @@ struct hashmap {
 		 * initially, "next" reflects only the order in file1.
 		 */
 		struct entry *next, *previous;
+
+		unsigned anchor : 1;
 	} *entries, *first, *last;
 	/* were common records found? */
 	unsigned long has_matches;
@@ -70,6 +72,14 @@ struct hashmap {
 	xpparam_t const *xpp;
 };
 
+static int is_anchor(const char *line)
+{
+	char *anchor = getenv("DIFF_ANCHOR");
+	if (!anchor)
+		return 0;
+	return !strncmp(line, anchor, strlen(anchor));
+}
+
 /* The argument "pass" is 1 for the first file, 2 for the second. */
 static void insert_record(int line, struct hashmap *map, int pass)
 {
@@ -110,6 +120,7 @@ static void insert_record(int line, struct hashmap *map, int pass)
 		return;
 	map->entries[index].line1 = line;
 	map->entries[index].hash = record->ha;
+	map->entries[index].anchor = is_anchor(map->env->xdf1.recs[line - 1]->ptr);
 	if (!map->first)
 		map->first = map->entries + index;
 	if (map->last) {
@@ -192,14 +203,28 @@ static struct entry *find_longest_common_sequence(struct hashmap *map)
 	int longest = 0, i;
 	struct entry *entry;
 
+	/*
+	 * If not -1, this entry in sequence must never be overridden. (Also,
+	 * do not override entries in sequence before this entry, since it is
+	 * useless.)
+	 */
+	int anchor_i = -1;
+
 	for (entry = map->first; entry; entry = entry->next) {
 		if (!entry->line2 || entry->line2 == NON_UNIQUE)
 			continue;
 		i = binary_search(sequence, longest, entry);
 		entry->previous = i < 0 ? NULL : sequence[i];
-		sequence[++i] = entry;
-		if (i == longest)
+		++i;
+		if (i <= anchor_i)
+			continue;
+		sequence[i] = entry;
+		if (anchor_i == -1 && entry->anchor) {
+			anchor_i = i;
+			longest = anchor_i + 1;
+		} else if (i == longest) {
 			longest++;
+		}
 	}
 
 	/* No common unique lines were found */
-- 
2.15.0.448.gf294e3d99a-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] xdiff/xpatience: support anchoring a line
  2017-11-21 22:17 [RFC PATCH] xdiff/xpatience: support anchoring a line Jonathan Tan
@ 2017-11-21 23:28 ` Stefan Beller
  2017-11-22  2:27 ` Junio C Hamano
  2017-11-22 23:41 ` [PATCH] xdiff/xpatience: support anchoring line(s) Jonathan Tan
  2 siblings, 0 replies; 15+ messages in thread
From: Stefan Beller @ 2017-11-21 23:28 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Tue, Nov 21, 2017 at 2:17 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> Teach the patience diff to support prohibiting a user-specified line
> from appearing as a deletion or addition in the end result.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> I'm sending this out to see if a change similar to this would be
> welcome. It is useful to me as a reviewer (to check my own code, e.g.
> when checking [1]). Probably more design needs to go into this,
> including the best way to specify the "anchor" line, and the correct
> behavior when the anchor is either not found or appears more than once.
>
> Any thoughts?

The background from this whole idea is that the Myers diff algorithm
may produce the shortest diff, which is good for computer consumption
or transport, but not ideal for human review. To accommodate human
review we need to couple the diff with higher level concepts, such as
move detection or ignoring blanks.

I would imagine that this anchor can be set by a user to a function
header or other significant line, such that the commit is more in line with
the diff itself ("move code out of function A into its helper" would not
want to have the "function A" line jump around, but the code should
be removed from the function, hence you'd anchor the function).

The solution you provide is a good thing to experiment with, but
longer term, I would want to have huge record of configs in which
humans selected the best diff, such that we can use that data
to reason about better automatic diff generation.
The diff heuristic was based on a lot of human generated data,
that was generated by Michael at the time. I wonder if we want to
permanently store the anchor so the data collection will happen
automatically over time.

I had a similar idea, which would affix a given coordinate of the map[1]
to be on the path. I imagine it similar to e.g. Google Maps in which you
can select intermittent way points on your route.

When having this rather abstract coordinate, which can be
given as line number in pre and post image, then we would not
need to think about questions whether a given line is found or
appears multiple times; however users like concise input the best,
so the idea of an "anchor line" might be the best representation for
the user, which is internally translated into way points.

[1] think of http://simplygenius.net/ArticleFiles/DiffTutorial/diagonals.png

Stefan

> [1]
> https://public-inbox.org/git/20171121221256.154741-1-jonathantanmy@google.com/
> ---
>  t/t4033-diff-patience.sh | 13 +++++++++++++
>  xdiff/xpatience.c        | 29 +++++++++++++++++++++++++++--
>  2 files changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/t/t4033-diff-patience.sh b/t/t4033-diff-patience.sh
> index 113304dc5..2147fd688 100755
> --- a/t/t4033-diff-patience.sh
> +++ b/t/t4033-diff-patience.sh
> @@ -13,6 +13,19 @@ test_expect_success '--ignore-space-at-eol with a single appended character' '
>         grep "^+.*X" diff
>  '
>
> +test_expect_success 'anchor' '
> +       printf "a\nb\nc\n" >pre &&
> +       printf "c\na\nb\n" >post &&
> +
> +       # without anchor, c is moved
> +       test_expect_code 1 git diff --no-index --patience pre post >diff &&
> +       grep "^+c" diff &&
> +
> +       # with anchor, a is moved
> +       DIFF_ANCHOR=c test_expect_code 1 git diff --no-index --patience pre post >diff &&
> +       grep "^+a" diff

or rather: "c is not moved, we don't care how the diff actually looks like",
so maybe
      ! grep "+c" diff


> diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
> index a44e77632..195a60e57 100644
> --- a/xdiff/xpatience.c
> +++ b/xdiff/xpatience.c
> @@ -62,6 +62,8 @@ struct hashmap {
>                  * initially, "next" reflects only the order in file1.
>                  */
>                 struct entry *next, *previous;
> +
> +               unsigned anchor : 1;

While this is RFC, I should not expect comments, though it would
be nice to have them in the final series. ;-)

>         } *entries, *first, *last;
>         /* were common records found? */
>         unsigned long has_matches;
> @@ -70,6 +72,14 @@ struct hashmap {
>         xpparam_t const *xpp;
>  };
>
> +static int is_anchor(const char *line)
> +{
> +       char *anchor = getenv("DIFF_ANCHOR");
> +       if (!anchor)
> +               return 0;
> +       return !strncmp(line, anchor, strlen(anchor));
> +}
> +
>  /* The argument "pass" is 1 for the first file, 2 for the second. */
>  static void insert_record(int line, struct hashmap *map, int pass)
>  {
> @@ -110,6 +120,7 @@ static void insert_record(int line, struct hashmap *map, int pass)
>                 return;
>         map->entries[index].line1 = line;
>         map->entries[index].hash = record->ha;
> +       map->entries[index].anchor = is_anchor(map->env->xdf1.recs[line - 1]->ptr);
>         if (!map->first)
>                 map->first = map->entries + index;
>         if (map->last) {
> @@ -192,14 +203,28 @@ static struct entry *find_longest_common_sequence(struct hashmap *map)
>         int longest = 0, i;
>         struct entry *entry;
>
> +       /*
> +        * If not -1, this entry in sequence must never be overridden. (Also,
> +        * do not override entries in sequence before this entry, since it is
> +        * useless.)
> +        */
> +       int anchor_i = -1;
> +
>         for (entry = map->first; entry; entry = entry->next) {
>                 if (!entry->line2 || entry->line2 == NON_UNIQUE)
>                         continue;
>                 i = binary_search(sequence, longest, entry);
>                 entry->previous = i < 0 ? NULL : sequence[i];
> -               sequence[++i] = entry;
> -               if (i == longest)
> +               ++i;
> +               if (i <= anchor_i)
> +                       continue;
> +               sequence[i] = entry;
> +               if (anchor_i == -1 && entry->anchor) {
> +                       anchor_i = i;
> +                       longest = anchor_i + 1;
> +               } else if (i == longest) {
>                         longest++;
> +               }
>         }
>
>         /* No common unique lines were found */
> --
> 2.15.0.448.gf294e3d99a-goog
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH] xdiff/xpatience: support anchoring a line
  2017-11-21 22:17 [RFC PATCH] xdiff/xpatience: support anchoring a line Jonathan Tan
  2017-11-21 23:28 ` Stefan Beller
@ 2017-11-22  2:27 ` Junio C Hamano
  2017-11-22 23:41 ` [PATCH] xdiff/xpatience: support anchoring line(s) Jonathan Tan
  2 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2017-11-22  2:27 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

> Teach the patience diff to support prohibiting a user-specified line
> from appearing as a deletion or addition in the end result.
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> I'm sending this out to see if a change similar to this would be
> welcome. It is useful to me as a reviewer (to check my own code, e.g.
> when checking [1]). Probably more design needs to go into this,
> including the best way to specify the "anchor" line, and the correct
> behavior when the anchor is either not found or appears more than once.
>
> Any thoughts?

This is a natural extension of the idea the patience algorithm is
built upon.  If this were a cumulative command line option that can
be given to specify multiple lines and can be used across the diff
family, it would make a welcome addition, I would think.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] xdiff/xpatience: support anchoring line(s)
  2017-11-21 22:17 [RFC PATCH] xdiff/xpatience: support anchoring a line Jonathan Tan
  2017-11-21 23:28 ` Stefan Beller
  2017-11-22  2:27 ` Junio C Hamano
@ 2017-11-22 23:41 ` Jonathan Tan
  2017-11-23  0:15   ` Stefan Beller
                     ` (2 more replies)
  2 siblings, 3 replies; 15+ messages in thread
From: Jonathan Tan @ 2017-11-22 23:41 UTC (permalink / raw)
  To: jonathantanmy, git; +Cc: gitster, sbeller

Teach the patience diff to attempt preventing user-specified lines from
appearing as a deletion or addition in the end result. The end user can
use this by specifying "--anchor=<text>" one or more times when using
Git commands like "diff" and "show".

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
Actual patch instead of RFC.

One thing that might help is to warn if --anchor is used without
--patience, but I couldn't find a good place to put that warning. Let me
know if you know of a good place.

Replying to Stefan's and Junio's comments:

> The solution you provide is a good thing to experiment with, but
> longer term, I would want to have huge record of configs in which
> humans selected the best diff, such that we can use that data
> to reason about better automatic diff generation.
> The diff heuristic was based on a lot of human generated data,
> that was generated by Michael at the time. I wonder if we want to
> permanently store the anchor so the data collection will happen
> automatically over time.

I think machine learning is beyond the scope of this patch :-)

> or rather: "c is not moved, we don't care how the diff actually looks
> like",
> so maybe
>       ! grep "+c" diff

I think it's less error-prone to show "a" moving. With this, if the
command somehow prints nothing, the test would still pass.

> While this is RFC, I should not expect comments, though it would
> be nice to have them in the final series. ;-)

Added comments :-)

> This is a natural extension of the idea the patience algorithm is
> built upon.  If this were a cumulative command line option that can
> be given to specify multiple lines and can be used across the diff
> family, it would make a welcome addition, I would think.

Specifying multiple lines - done.

By diff family, do you mean "diff", "show", and others? If yes, that has
also been done in this patch.
---
 Documentation/diff-options.txt |  9 +++++++
 diff.c                         |  8 ++++++
 diff.h                         |  4 +++
 t/t4033-diff-patience.sh       | 59 ++++++++++++++++++++++++++++++++++++++++++
 xdiff/xdiff.h                  |  4 +++
 xdiff/xpatience.c              | 42 ++++++++++++++++++++++++++----
 6 files changed, 121 insertions(+), 5 deletions(-)

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index dd0dba5b1..b17d62696 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -100,6 +100,15 @@ For instance, if you configured diff.algorithm variable to a
 non-default value and want to use the default one, then you
 have to use `--diff-algorithm=default` option.
 
+--anchor=<text>::
+	This option may be specified more than once.
++
+If the "patience diff" algorithm is used, Git attempts to prevent lines
+starting with this text from appearing as a deletion or addition in the
+output.
++
+If another diff algorithm is used, this has no effect.
+
 --stat[=<width>[,<name-width>[,<count>]]]::
 	Generate a diffstat. By default, as much space as necessary
 	will be used for the filename part, and the rest for the graph
diff --git a/diff.c b/diff.c
index 0763e8926..29a7176ee 100644
--- a/diff.c
+++ b/diff.c
@@ -3210,6 +3210,8 @@ static void builtin_diff(const char *name_a,
 		ecbdata.opt = o;
 		ecbdata.header = header.len ? &header : NULL;
 		xpp.flags = o->xdl_opts;
+		xpp.anchors = o->anchors;
+		xpp.anchors_nr = o->anchors_nr;
 		xecfg.ctxlen = o->context;
 		xecfg.interhunkctxlen = o->interhunkcontext;
 		xecfg.flags = XDL_EMIT_FUNCNAMES;
@@ -3302,6 +3304,8 @@ static void builtin_diffstat(const char *name_a, const char *name_b,
 		memset(&xpp, 0, sizeof(xpp));
 		memset(&xecfg, 0, sizeof(xecfg));
 		xpp.flags = o->xdl_opts;
+		xpp.anchors = o->anchors;
+		xpp.anchors_nr = o->anchors_nr;
 		xecfg.ctxlen = o->context;
 		xecfg.interhunkctxlen = o->interhunkcontext;
 		if (xdi_diff_outf(&mf1, &mf2, diffstat_consume, diffstat,
@@ -4608,6 +4612,10 @@ int diff_opt_parse(struct diff_options *options,
 		options->xdl_opts &= ~XDF_DIFF_ALGORITHM_MASK;
 		options->xdl_opts |= value;
 		return argcount;
+	} else if (skip_prefix(arg, "--anchor=", &arg)) {
+		ALLOC_GROW(options->anchors, options->anchors_nr + 1,
+			   options->anchors_alloc);
+		options->anchors[options->anchors_nr++] = xstrdup(arg);
 	}
 
 	/* flags options */
diff --git a/diff.h b/diff.h
index 0fb18dd73..7cf276f07 100644
--- a/diff.h
+++ b/diff.h
@@ -166,6 +166,10 @@ struct diff_options {
 	const char *stat_sep;
 	long xdl_opts;
 
+	/* see Documentation/diff-options.txt */
+	char **anchors;
+	size_t anchors_nr, anchors_alloc;
+
 	int stat_width;
 	int stat_name_width;
 	int stat_graph_width;
diff --git a/t/t4033-diff-patience.sh b/t/t4033-diff-patience.sh
index 113304dc5..2d00d1056 100755
--- a/t/t4033-diff-patience.sh
+++ b/t/t4033-diff-patience.sh
@@ -13,6 +13,65 @@ test_expect_success '--ignore-space-at-eol with a single appended character' '
 	grep "^+.*X" diff
 '
 
+test_expect_success '--anchor' '
+	printf "a\nb\nc\n" >pre &&
+	printf "c\na\nb\n" >post &&
+
+	# without anchor, c is moved
+	test_expect_code 1 git diff --no-index --patience pre post >diff &&
+	grep "^+c" diff &&
+
+	# with anchor, a is moved
+	test_expect_code 1 git diff --no-index --patience --anchor=c pre post >diff &&
+	grep "^+a" diff
+'
+
+test_expect_success '--anchor multiple' '
+	printf "a\nb\nc\nd\ne\nf\n" >pre &&
+	printf "c\na\nb\nf\nd\ne\n" >post &&
+
+	# with 1 anchor, c is not moved, but f is moved
+	test_expect_code 1 git diff --no-index --patience --anchor=c pre post >diff &&
+	grep "^+a" diff && # a is moved instead of c
+	grep "^+f" diff &&
+
+	# with 2 anchors, c and f are not moved
+	test_expect_code 1 git diff --no-index --patience --anchor=c --anchor=f pre post >diff &&
+	grep "^+a" diff &&
+	grep "^+d" diff # d is moved instead of f
+'
+
+test_expect_success '--anchor with nonexistent line has no effect' '
+	printf "a\nb\nc\n" >pre &&
+	printf "c\na\nb\n" >post &&
+
+	test_expect_code 1 git diff --no-index --patience --anchor=x pre post >diff &&
+	grep "^+c" diff
+'
+
+test_expect_success 'diff still produced with impossible multiple --anchor' '
+	printf "a\nb\nc\n" >pre &&
+	printf "c\na\nb\n" >post &&
+
+	test_expect_code 1 git diff --no-index --patience --anchor=a --anchor=c pre post >diff &&
+	mv post expected_post &&
+	git apply diff &&
+	diff expected_post post # make sure that the diff is correct
+'
+
+test_expect_success '--anchor works with other commands like "git show"' '
+	printf "a\nb\nc\n" >file &&
+	git add file &&
+	git commit -m foo &&
+	printf "c\na\nb\n" >file &&
+	git add file &&
+	git commit -m foo &&
+
+	# with anchor, a is moved
+	git show --patience --anchor=c >diff &&
+	grep "^+a" diff
+'
+
 test_diff_frobnitz "patience"
 
 test_diff_unique "patience"
diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h
index 785ecb089..e6217e1d7 100644
--- a/xdiff/xdiff.h
+++ b/xdiff/xdiff.h
@@ -80,6 +80,10 @@ typedef struct s_mmbuffer {
 
 typedef struct s_xpparam {
 	unsigned long flags;
+
+	/* See Documentation/diff-options.txt. */
+	char **anchors;
+	size_t anchors_nr;
 } xpparam_t;
 
 typedef struct s_xdemitcb {
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index a44e77632..f55c9fb11 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -62,6 +62,12 @@ struct hashmap {
 		 * initially, "next" reflects only the order in file1.
 		 */
 		struct entry *next, *previous;
+
+		/*
+		 * If 1, this entry can serve as the anchor. See
+		 * Documentation/diff-options.txt for more information.
+		 */
+		unsigned anchor : 1;
 	} *entries, *first, *last;
 	/* were common records found? */
 	unsigned long has_matches;
@@ -70,8 +76,19 @@ struct hashmap {
 	xpparam_t const *xpp;
 };
 
+static int is_anchor(xpparam_t const *xpp, const char *line)
+{
+	int i;
+	for (i = 0; i < xpp->anchors_nr; i++) {
+		if (!strncmp(line, xpp->anchors[i], strlen(xpp->anchors[i])))
+			return 1;
+	}
+	return 0;
+}
+
 /* The argument "pass" is 1 for the first file, 2 for the second. */
-static void insert_record(int line, struct hashmap *map, int pass)
+static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
+			  int pass)
 {
 	xrecord_t **records = pass == 1 ?
 		map->env->xdf1.recs : map->env->xdf2.recs;
@@ -110,6 +127,7 @@ static void insert_record(int line, struct hashmap *map, int pass)
 		return;
 	map->entries[index].line1 = line;
 	map->entries[index].hash = record->ha;
+	map->entries[index].anchor = is_anchor(xpp, map->env->xdf1.recs[line - 1]->ptr);
 	if (!map->first)
 		map->first = map->entries + index;
 	if (map->last) {
@@ -147,11 +165,11 @@ static int fill_hashmap(mmfile_t *file1, mmfile_t *file2,
 
 	/* First, fill with entries from the first file */
 	while (count1--)
-		insert_record(line1++, result, 1);
+		insert_record(xpp, line1++, result, 1);
 
 	/* Then search for matches in the second file */
 	while (count2--)
-		insert_record(line2++, result, 2);
+		insert_record(xpp, line2++, result, 2);
 
 	return 0;
 }
@@ -192,14 +210,28 @@ static struct entry *find_longest_common_sequence(struct hashmap *map)
 	int longest = 0, i;
 	struct entry *entry;
 
+	/*
+	 * If not -1, this entry in sequence must never be overridden.
+	 * Therefore, overriding entries before this has no effect, so
+	 * do not do that either.
+	 */
+	int anchor_i = -1;
+
 	for (entry = map->first; entry; entry = entry->next) {
 		if (!entry->line2 || entry->line2 == NON_UNIQUE)
 			continue;
 		i = binary_search(sequence, longest, entry);
 		entry->previous = i < 0 ? NULL : sequence[i];
-		sequence[++i] = entry;
-		if (i == longest)
+		++i;
+		if (i <= anchor_i)
+			continue;
+		sequence[i] = entry;
+		if (entry->anchor) {
+			anchor_i = i;
+			longest = anchor_i + 1;
+		} else if (i == longest) {
 			longest++;
+		}
 	}
 
 	/* No common unique lines were found */
-- 
2.15.0.448.gf294e3d99a-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] xdiff/xpatience: support anchoring line(s)
  2017-11-22 23:41 ` [PATCH] xdiff/xpatience: support anchoring line(s) Jonathan Tan
@ 2017-11-23  0:15   ` Stefan Beller
  2017-11-23  2:05   ` Junio C Hamano
  2017-11-27 19:47   ` [PATCH v2] diff: " Jonathan Tan
  2 siblings, 0 replies; 15+ messages in thread
From: Stefan Beller @ 2017-11-23  0:15 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Junio C Hamano

On Wed, Nov 22, 2017 at 3:41 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> Teach the patience diff to attempt preventing user-specified lines from
> appearing as a deletion or addition in the end result. The end user can
> use this by specifying "--anchor=<text>" one or more times when using
> Git commands like "diff" and "show".
>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> Actual patch instead of RFC.
>
> One thing that might help is to warn if --anchor is used without
> --patience, but I couldn't find a good place to put that warning. Let me
> know if you know of a good place.

Would it make sense to have `--anchor` imply patience?
(not necessarily in this patch, might be a "yes, let's do
it in a year when users complain")

> Replying to Stefan's and Junio's comments:
>
>> The solution you provide is a good thing to experiment with, but
>> longer term, I would want to have huge record of configs in which
>> humans selected the best diff, such that we can use that data
>> to reason about better automatic diff generation.
>> The diff heuristic was based on a lot of human generated data,
>> that was generated by Michael at the time. I wonder if we want to
>> permanently store the anchor so the data collection will happen
>> automatically over time.
>
> I think machine learning is beyond the scope of this patch :-)

agreed; I just wanted to share what I think we could do in the future
to select sane default. For that we'd want to collect some "most useful"
configurations.

When I proposed separate flags for the move detection regarding
ignoring whitespaces, the question "how is the user sanely select
from so many flags?" came up. And in that spirit I would want think
adding this rather fundamental flag, and then machine learn (e.g. the
weights in traversing the diff matrix) off of this collected data later
might be a viable approach.

>> or rather: "c is not moved, we don't care how the diff actually looks
>> like",
>> so maybe
>>       ! grep "+c" diff
>
> I think it's less error-prone to show "a" moving. With this, if the
> command somehow prints nothing, the test would still pass.

Makes sense.

> diff --git a/t/t4033-diff-patience.sh b/t/t4033-diff-patience.sh
> index 113304dc5..2d00d1056 100755
> --- a/t/t4033-diff-patience.sh
> +++ b/t/t4033-diff-patience.sh

I was waiting for

    test_expect_success 'one --anchor anchors many lines' '
        printf "a\nb\na\nc\na\n" >file && # many 'a's
        ....
        --anchor=a
        ...


Thanks for writing this patch,
I hope we can make use of this addition eventually a lot. :)

Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] xdiff/xpatience: support anchoring line(s)
  2017-11-22 23:41 ` [PATCH] xdiff/xpatience: support anchoring line(s) Jonathan Tan
  2017-11-23  0:15   ` Stefan Beller
@ 2017-11-23  2:05   ` Junio C Hamano
  2017-11-23  2:16     ` Junio C Hamano
  2017-11-27 19:47   ` [PATCH v2] diff: " Jonathan Tan
  2 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2017-11-23  2:05 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, sbeller

Jonathan Tan <jonathantanmy@google.com> writes:

> One thing that might help is to warn if --anchor is used without
> --patience, but I couldn't find a good place to put that warning. Let me
> know if you know of a good place.

How about dropping --anchor option and do it as "--patience=<pattern>"?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] xdiff/xpatience: support anchoring line(s)
  2017-11-23  2:05   ` Junio C Hamano
@ 2017-11-23  2:16     ` Junio C Hamano
  2017-11-23  2:47       ` Junio C Hamano
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2017-11-23  2:16 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, sbeller

Junio C Hamano <gitster@pobox.com> writes:

> Jonathan Tan <jonathantanmy@google.com> writes:
>
>> One thing that might help is to warn if --anchor is used without
>> --patience, but I couldn't find a good place to put that warning. Let me
>> know if you know of a good place.
>
> How about dropping --anchor option and do it as "--patience=<pattern>"?

Well, that may not be an optimal suggestion, as it is always a pain
to have to deal with an option with optional argument.

I understand that the case you would really want to warn against is

    git diff --histogram --anchor=foo

and not

    git diff --anchor=foo

as it is sensible to turn --patience on implicitly.  Perhaps a good
starting point might be to rename the option to include "patience"
somewhere in its name ("--patience-anchor=<pattern>"?) to make it
more obvious that it is about helping the patience algorithm.  And
then make "--patience-anchor=<pattern>" without any other algorithm
selection option to behave as if "--patience" was also given.

What do we do when

    git diff --histogram --patience

is given?  Do we warn?  If we don't, perhaps it may not be too bad
if 

    git diff --histogram --patience-anchor=foo
    git diff --patience-anchor=foo --histogram

did not get any warning.  Instead we just implicitly turn any
occurence of --patience-anchor=foo into --patience followed by the
same option, and assume that the user wanted the usual "last one
wins" semantics.  It would turn patience on for the former, and
ignore the anchor for the latter and use historgram.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] xdiff/xpatience: support anchoring line(s)
  2017-11-23  2:16     ` Junio C Hamano
@ 2017-11-23  2:47       ` Junio C Hamano
  2017-11-27 18:30         ` Jonathan Tan
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2017-11-23  2:47 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, sbeller

Junio C Hamano <gitster@pobox.com> writes:

> What do we do when
>
>     git diff --histogram --patience
>
> is given?  Do we warn?  If we don't, perhaps it may not be too bad
> if 
>
>     git diff --histogram --patience-anchor=foo
>     git diff --patience-anchor=foo --histogram
>
> did not get any warning.  Instead we just implicitly turn any
> occurence of --patience-anchor=foo into --patience followed by the
> same option, and assume that the user wanted the usual "last one
> wins" semantics.  It would turn patience on for the former, and
> ignore the anchor for the latter and use historgram.

Thinking about this a bit more, I do like the basic idea of the UI
even better.  What we could do is to sell this to the end users as a
new kind of diff algorithm choice (i.e. myers, patience, ... will
gain a new friend) that internally happens to be implemented by
piggybacking on patience (just like minimal is piggybacking on
myers) and call it "anchor".  Then just like this command line

    git diff --histogram --patience

makes the last one win without complaint, it is sane that these
command lines

    git diff --histogram --anchored=<pattern>
    git diff --anchored=<pattern> --histogram

make the last one win without complaint, either.

Hmm?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] xdiff/xpatience: support anchoring line(s)
  2017-11-23  2:47       ` Junio C Hamano
@ 2017-11-27 18:30         ` Jonathan Tan
  0 siblings, 0 replies; 15+ messages in thread
From: Jonathan Tan @ 2017-11-27 18:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, sbeller

On Thu, 23 Nov 2017 11:47:02 +0900
Junio C Hamano <gitster@pobox.com> wrote:

> Thinking about this a bit more, I do like the basic idea of the UI
> even better.  What we could do is to sell this to the end users as a
> new kind of diff algorithm choice (i.e. myers, patience, ... will
> gain a new friend) that internally happens to be implemented by
> piggybacking on patience (just like minimal is piggybacking on
> myers) and call it "anchor".  Then just like this command line
> 
>     git diff --histogram --patience
> 
> makes the last one win without complaint, it is sane that these
> command lines
> 
>     git diff --histogram --anchored=<pattern>
>     git diff --anchored=<pattern> --histogram
> 
> make the last one win without complaint, either.
> 
> Hmm?

This sounds good. There will be a bit of inconsistency in that in "git
diff --anchored=<pattern1> --anchored=<pattern2>", it is not the last
one that wins, but both of them will in fact be used. But I think that
in practice, this will be fine.

I'll send out another version with this UI (and with Stefan's test
suggestion).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2] diff: support anchoring line(s)
  2017-11-22 23:41 ` [PATCH] xdiff/xpatience: support anchoring line(s) Jonathan Tan
  2017-11-23  0:15   ` Stefan Beller
  2017-11-23  2:05   ` Junio C Hamano
@ 2017-11-27 19:47   ` Jonathan Tan
  2017-11-28  1:38     ` Junio C Hamano
  2 siblings, 1 reply; 15+ messages in thread
From: Jonathan Tan @ 2017-11-27 19:47 UTC (permalink / raw)
  To: jonathantanmy, git; +Cc: gitster, sbeller

Teach diff a new algorithm, one that attempts to prevent user-specified
lines from appearing as a deletion or addition in the end result. The
end user can use this by specifying "--anchored=<text>" one or more
times when using Git commands like "diff" and "show".

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
In v2, the UI treats it as a new algorithm. (Internally it's still an
option for "patience".)

Stefan suggested that I add a test where a single --anchored=<> matches
multiple lines, and that reminded me that the patience algorithm only
really works on unique lines. I have updated the documentation to remark
on that, and also added a test to show what happens. I have also added a
test to show that later algorithm options override earlier ones.
---
 Documentation/diff-options.txt | 10 +++++
 diff.c                         | 22 +++++++++-
 diff.h                         |  4 ++
 t/t4064-diff-anchored.sh       | 94 ++++++++++++++++++++++++++++++++++++++++++
 xdiff/xdiff.h                  |  4 ++
 xdiff/xpatience.c              | 42 ++++++++++++++++---
 6 files changed, 169 insertions(+), 7 deletions(-)
 create mode 100755 t/t4064-diff-anchored.sh

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index dd0dba5b1..6ce39fb69 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -80,6 +80,16 @@ endif::git-format-patch[]
 --histogram::
 	Generate a diff using the "histogram diff" algorithm.
 
+--anchored=<text>::
+	Generate a diff using the "anchored diff" algorithm.
++
+This option may be specified more than once.
++
+If a line exists in both the source and destination, exists only once,
+and starts with this text, this algorithm attempts to prevent it from
+appearing as a deletion or addition in the output. It uses the "patience
+diff" algorithm internally.
+
 --diff-algorithm={patience|minimal|histogram|myers}::
 	Choose a diff algorithm. The variants are as follows:
 +
diff --git a/diff.c b/diff.c
index 0763e8926..98aa638f8 100644
--- a/diff.c
+++ b/diff.c
@@ -3210,6 +3210,8 @@ static void builtin_diff(const char *name_a,
 		ecbdata.opt = o;
 		ecbdata.header = header.len ? &header : NULL;
 		xpp.flags = o->xdl_opts;
+		xpp.anchors = o->anchors;
+		xpp.anchors_nr = o->anchors_nr;
 		xecfg.ctxlen = o->context;
 		xecfg.interhunkctxlen = o->interhunkcontext;
 		xecfg.flags = XDL_EMIT_FUNCNAMES;
@@ -3302,6 +3304,8 @@ static void builtin_diffstat(const char *name_a, const char *name_b,
 		memset(&xpp, 0, sizeof(xpp));
 		memset(&xecfg, 0, sizeof(xecfg));
 		xpp.flags = o->xdl_opts;
+		xpp.anchors = o->anchors;
+		xpp.anchors_nr = o->anchors_nr;
 		xecfg.ctxlen = o->context;
 		xecfg.interhunkctxlen = o->interhunkcontext;
 		if (xdi_diff_outf(&mf1, &mf2, diffstat_consume, diffstat,
@@ -4594,9 +4598,18 @@ int diff_opt_parse(struct diff_options *options,
 		DIFF_XDL_SET(options, INDENT_HEURISTIC);
 	else if (!strcmp(arg, "--no-indent-heuristic"))
 		DIFF_XDL_CLR(options, INDENT_HEURISTIC);
-	else if (!strcmp(arg, "--patience"))
+	else if (!strcmp(arg, "--patience")) {
+		int i;
 		options->xdl_opts = DIFF_WITH_ALG(options, PATIENCE_DIFF);
-	else if (!strcmp(arg, "--histogram"))
+		/*
+		 * Both --patience and --anchored use PATIENCE_DIFF
+		 * internally, so remove any anchors previously
+		 * specified.
+		 */
+		for (i = 0; i < options->anchors_nr; i++)
+			free(options->anchors[i]);
+		options->anchors_nr = 0;
+	} else if (!strcmp(arg, "--histogram"))
 		options->xdl_opts = DIFF_WITH_ALG(options, HISTOGRAM_DIFF);
 	else if ((argcount = parse_long_opt("diff-algorithm", av, &optarg))) {
 		long value = parse_algorithm_value(optarg);
@@ -4608,6 +4621,11 @@ int diff_opt_parse(struct diff_options *options,
 		options->xdl_opts &= ~XDF_DIFF_ALGORITHM_MASK;
 		options->xdl_opts |= value;
 		return argcount;
+	} else if (skip_prefix(arg, "--anchored=", &arg)) {
+		options->xdl_opts = DIFF_WITH_ALG(options, PATIENCE_DIFF);
+		ALLOC_GROW(options->anchors, options->anchors_nr + 1,
+			   options->anchors_alloc);
+		options->anchors[options->anchors_nr++] = xstrdup(arg);
 	}
 
 	/* flags options */
diff --git a/diff.h b/diff.h
index 0fb18dd73..7cf276f07 100644
--- a/diff.h
+++ b/diff.h
@@ -166,6 +166,10 @@ struct diff_options {
 	const char *stat_sep;
 	long xdl_opts;
 
+	/* see Documentation/diff-options.txt */
+	char **anchors;
+	size_t anchors_nr, anchors_alloc;
+
 	int stat_width;
 	int stat_name_width;
 	int stat_graph_width;
diff --git a/t/t4064-diff-anchored.sh b/t/t4064-diff-anchored.sh
new file mode 100755
index 000000000..b3f510f04
--- /dev/null
+++ b/t/t4064-diff-anchored.sh
@@ -0,0 +1,94 @@
+#!/bin/sh
+
+test_description='anchored diff algorithm'
+
+. ./test-lib.sh
+
+test_expect_success '--anchored' '
+	printf "a\nb\nc\n" >pre &&
+	printf "c\na\nb\n" >post &&
+
+	# normally, c is moved to produce the smallest diff
+	test_expect_code 1 git diff --no-index pre post >diff &&
+	grep "^+c" diff &&
+
+	# with anchor, a is moved
+	test_expect_code 1 git diff --no-index --anchored=c pre post >diff &&
+	grep "^+a" diff
+'
+
+test_expect_success '--anchored multiple' '
+	printf "a\nb\nc\nd\ne\nf\n" >pre &&
+	printf "c\na\nb\nf\nd\ne\n" >post &&
+
+	# with 1 anchor, c is not moved, but f is moved
+	test_expect_code 1 git diff --no-index --anchored=c pre post >diff &&
+	grep "^+a" diff && # a is moved instead of c
+	grep "^+f" diff &&
+
+	# with 2 anchors, c and f are not moved
+	test_expect_code 1 git diff --no-index --anchored=c --anchored=f pre post >diff &&
+	grep "^+a" diff &&
+	grep "^+d" diff # d is moved instead of f
+'
+
+test_expect_success '--anchored with nonexistent line has no effect' '
+	printf "a\nb\nc\n" >pre &&
+	printf "c\na\nb\n" >post &&
+
+	test_expect_code 1 git diff --no-index --anchored=x pre post >diff &&
+	grep "^+c" diff
+'
+
+test_expect_success '--anchored with non-unique line has no effect' '
+	printf "a\nb\nc\nd\ne\nc\n" >pre &&
+	printf "c\na\nb\nc\nd\ne\n" >post &&
+
+	test_expect_code 1 git diff --no-index --anchored=c pre post >diff &&
+	grep "^+c" diff
+'
+
+test_expect_success 'diff still produced with impossible multiple --anchored' '
+	printf "a\nb\nc\n" >pre &&
+	printf "c\na\nb\n" >post &&
+
+	test_expect_code 1 git diff --no-index --anchored=a --anchored=c pre post >diff &&
+	mv post expected_post &&
+
+	# Ensure that the diff is correct by applying it and then
+	# comparing the result with the original
+	git apply diff &&
+	diff expected_post post
+'
+
+test_expect_success 'later algorithm arguments override earlier ones' '
+	printf "a\nb\nc\n" >pre &&
+	printf "c\na\nb\n" >post &&
+
+	test_expect_code 1 git diff --no-index --patience --anchored=c pre post >diff &&
+	grep "^+a" diff &&
+
+	test_expect_code 1 git diff --no-index --anchored=c --patience pre post >diff &&
+	grep "^+c" diff &&
+
+	test_expect_code 1 git diff --no-index --histogram --anchored=c pre post >diff &&
+	grep "^+a" diff &&
+
+	test_expect_code 1 git diff --no-index --anchored=c --histogram pre post >diff &&
+	grep "^+c" diff
+'
+
+test_expect_success '--anchored works with other commands like "git show"' '
+	printf "a\nb\nc\n" >file &&
+	git add file &&
+	git commit -m foo &&
+	printf "c\na\nb\n" >file &&
+	git add file &&
+	git commit -m foo &&
+
+	# with anchor, a is moved
+	git show --patience --anchored=c >diff &&
+	grep "^+a" diff
+'
+
+test_done
diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h
index 785ecb089..e6217e1d7 100644
--- a/xdiff/xdiff.h
+++ b/xdiff/xdiff.h
@@ -80,6 +80,10 @@ typedef struct s_mmbuffer {
 
 typedef struct s_xpparam {
 	unsigned long flags;
+
+	/* See Documentation/diff-options.txt. */
+	char **anchors;
+	size_t anchors_nr;
 } xpparam_t;
 
 typedef struct s_xdemitcb {
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index a44e77632..f3573d9f0 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -62,6 +62,12 @@ struct hashmap {
 		 * initially, "next" reflects only the order in file1.
 		 */
 		struct entry *next, *previous;
+
+		/*
+		 * If 1, this entry can serve as an anchor. See
+		 * Documentation/diff-options.txt for more information.
+		 */
+		unsigned anchor : 1;
 	} *entries, *first, *last;
 	/* were common records found? */
 	unsigned long has_matches;
@@ -70,8 +76,19 @@ struct hashmap {
 	xpparam_t const *xpp;
 };
 
+static int is_anchor(xpparam_t const *xpp, const char *line)
+{
+	int i;
+	for (i = 0; i < xpp->anchors_nr; i++) {
+		if (!strncmp(line, xpp->anchors[i], strlen(xpp->anchors[i])))
+			return 1;
+	}
+	return 0;
+}
+
 /* The argument "pass" is 1 for the first file, 2 for the second. */
-static void insert_record(int line, struct hashmap *map, int pass)
+static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
+			  int pass)
 {
 	xrecord_t **records = pass == 1 ?
 		map->env->xdf1.recs : map->env->xdf2.recs;
@@ -110,6 +127,7 @@ static void insert_record(int line, struct hashmap *map, int pass)
 		return;
 	map->entries[index].line1 = line;
 	map->entries[index].hash = record->ha;
+	map->entries[index].anchor = is_anchor(xpp, map->env->xdf1.recs[line - 1]->ptr);
 	if (!map->first)
 		map->first = map->entries + index;
 	if (map->last) {
@@ -147,11 +165,11 @@ static int fill_hashmap(mmfile_t *file1, mmfile_t *file2,
 
 	/* First, fill with entries from the first file */
 	while (count1--)
-		insert_record(line1++, result, 1);
+		insert_record(xpp, line1++, result, 1);
 
 	/* Then search for matches in the second file */
 	while (count2--)
-		insert_record(line2++, result, 2);
+		insert_record(xpp, line2++, result, 2);
 
 	return 0;
 }
@@ -192,14 +210,28 @@ static struct entry *find_longest_common_sequence(struct hashmap *map)
 	int longest = 0, i;
 	struct entry *entry;
 
+	/*
+	 * If not -1, this entry in sequence must never be overridden.
+	 * Therefore, overriding entries before this has no effect, so
+	 * do not do that either.
+	 */
+	int anchor_i = -1;
+
 	for (entry = map->first; entry; entry = entry->next) {
 		if (!entry->line2 || entry->line2 == NON_UNIQUE)
 			continue;
 		i = binary_search(sequence, longest, entry);
 		entry->previous = i < 0 ? NULL : sequence[i];
-		sequence[++i] = entry;
-		if (i == longest)
+		++i;
+		if (i <= anchor_i)
+			continue;
+		sequence[i] = entry;
+		if (entry->anchor) {
+			anchor_i = i;
+			longest = anchor_i + 1;
+		} else if (i == longest) {
 			longest++;
+		}
 	}
 
 	/* No common unique lines were found */
-- 
2.15.0.417.g466bffb3ac-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] diff: support anchoring line(s)
  2017-11-27 19:47   ` [PATCH v2] diff: " Jonathan Tan
@ 2017-11-28  1:38     ` Junio C Hamano
  2017-11-28 18:47       ` [PATCH v3] " Jonathan Tan
  0 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2017-11-28  1:38 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, sbeller

Jonathan Tan <jonathantanmy@google.com> writes:

> -	else if (!strcmp(arg, "--patience"))
> +	else if (!strcmp(arg, "--patience")) {
> +		int i;
>  		options->xdl_opts = DIFF_WITH_ALG(options, PATIENCE_DIFF);
> -	else if (!strcmp(arg, "--histogram"))
> +		/*
> +		 * Both --patience and --anchored use PATIENCE_DIFF
> +		 * internally, so remove any anchors previously
> +		 * specified.
> +		 */
> +		for (i = 0; i < options->anchors_nr; i++)
> +			free(options->anchors[i]);
> +		options->anchors_nr = 0;

This makes sense, but "--diff-algorithm=patience" would want to do
the same, I suspect, so the loop would want to become a little
helper function "clear_patience_anchors(options)" or something like
that.

> diff --git a/t/t4064-diff-anchored.sh b/t/t4064-diff-anchored.sh
> new file mode 100755
> index 000000000..b3f510f04
> --- /dev/null
> +++ b/t/t4064-diff-anchored.sh
> @@ -0,0 +1,94 @@
> +#!/bin/sh
> +
> +test_description='anchored diff algorithm'
> +
> +. ./test-lib.sh
> +
> +test_expect_success '--anchored' '
> +	printf "a\nb\nc\n" >pre &&
> +	printf "c\na\nb\n" >post &&

This may be too little to matter, but I'd find

	printf "%s\n" a b c >pre

vastly easier to read.  Or perhaps just use

	test_write_lines a b c >pre


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v3] diff: support anchoring line(s)
  2017-11-28  1:38     ` Junio C Hamano
@ 2017-11-28 18:47       ` Jonathan Tan
  2017-11-30  0:36         ` Johannes Schindelin
  0 siblings, 1 reply; 15+ messages in thread
From: Jonathan Tan @ 2017-11-28 18:47 UTC (permalink / raw)
  To: gitster, git; +Cc: jonathantanmy, sbeller

Teach diff a new algorithm, one that attempts to prevent user-specified
lines from appearing as a deletion or addition in the end result. The
end user can use this by specifying "--anchored=<text>" one or more
times when using Git commands like "diff" and "show".

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
v3 addresses Junio's comments.

> This makes sense, but "--diff-algorithm=patience" would want to do
> the same, I suspect, so the loop would want to become a little
> helper function "clear_patience_anchors(options)" or something like
> that.

Done. Also updated a test to test the interaction of
"--diff-algorithm=patience" with "--anchored=<>".

> This may be too little to matter, but I'd find
>
> 	printf "%s\n" a b c >pre
>
> vastly easier to read.  Or perhaps just use
>
> 	test_write_lines a b c >pre

Thanks for the pointer to test_write_lines - I used that and it is
indeed clearer.
---
 Documentation/diff-options.txt |  10 +++++
 diff.c                         |  31 ++++++++++++-
 diff.h                         |   4 ++
 t/t4064-diff-anchored.sh       | 100 +++++++++++++++++++++++++++++++++++++++++
 xdiff/xdiff.h                  |   4 ++
 xdiff/xpatience.c              |  42 ++++++++++++++---
 6 files changed, 184 insertions(+), 7 deletions(-)
 create mode 100755 t/t4064-diff-anchored.sh

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index dd0dba5b1..6ce39fb69 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -80,6 +80,16 @@ endif::git-format-patch[]
 --histogram::
 	Generate a diff using the "histogram diff" algorithm.
 
+--anchored=<text>::
+	Generate a diff using the "anchored diff" algorithm.
++
+This option may be specified more than once.
++
+If a line exists in both the source and destination, exists only once,
+and starts with this text, this algorithm attempts to prevent it from
+appearing as a deletion or addition in the output. It uses the "patience
+diff" algorithm internally.
+
 --diff-algorithm={patience|minimal|histogram|myers}::
 	Choose a diff algorithm. The variants are as follows:
 +
diff --git a/diff.c b/diff.c
index 0763e8926..7149ff627 100644
--- a/diff.c
+++ b/diff.c
@@ -3210,6 +3210,8 @@ static void builtin_diff(const char *name_a,
 		ecbdata.opt = o;
 		ecbdata.header = header.len ? &header : NULL;
 		xpp.flags = o->xdl_opts;
+		xpp.anchors = o->anchors;
+		xpp.anchors_nr = o->anchors_nr;
 		xecfg.ctxlen = o->context;
 		xecfg.interhunkctxlen = o->interhunkcontext;
 		xecfg.flags = XDL_EMIT_FUNCNAMES;
@@ -3302,6 +3304,8 @@ static void builtin_diffstat(const char *name_a, const char *name_b,
 		memset(&xpp, 0, sizeof(xpp));
 		memset(&xecfg, 0, sizeof(xecfg));
 		xpp.flags = o->xdl_opts;
+		xpp.anchors = o->anchors;
+		xpp.anchors_nr = o->anchors_nr;
 		xecfg.ctxlen = o->context;
 		xecfg.interhunkctxlen = o->interhunkcontext;
 		if (xdi_diff_outf(&mf1, &mf2, diffstat_consume, diffstat,
@@ -4487,6 +4491,21 @@ static int parse_ws_error_highlight_opt(struct diff_options *opt, const char *ar
 	return 1;
 }
 
+/*
+ * Clear the anchors configured for the PATIENCE_DIFF algorithm.
+ *
+ * This must be called whenever a command-line argument indicating that
+ * the patience algorithm is to be used is seen, because the patience
+ * and anchored algorithms both use PATIENCE_DIFF internally.
+ */
+static void clear_patience_anchors(struct diff_options *options)
+{
+	int i;
+	for (i = 0; i < options->anchors_nr; i++)
+		free(options->anchors[i]);
+	options->anchors_nr = 0;
+}
+
 int diff_opt_parse(struct diff_options *options,
 		   const char **av, int ac, const char *prefix)
 {
@@ -4594,9 +4613,10 @@ int diff_opt_parse(struct diff_options *options,
 		DIFF_XDL_SET(options, INDENT_HEURISTIC);
 	else if (!strcmp(arg, "--no-indent-heuristic"))
 		DIFF_XDL_CLR(options, INDENT_HEURISTIC);
-	else if (!strcmp(arg, "--patience"))
+	else if (!strcmp(arg, "--patience")) {
 		options->xdl_opts = DIFF_WITH_ALG(options, PATIENCE_DIFF);
-	else if (!strcmp(arg, "--histogram"))
+		clear_patience_anchors(options);
+	} else if (!strcmp(arg, "--histogram"))
 		options->xdl_opts = DIFF_WITH_ALG(options, HISTOGRAM_DIFF);
 	else if ((argcount = parse_long_opt("diff-algorithm", av, &optarg))) {
 		long value = parse_algorithm_value(optarg);
@@ -4607,7 +4627,14 @@ int diff_opt_parse(struct diff_options *options,
 		DIFF_XDL_CLR(options, NEED_MINIMAL);
 		options->xdl_opts &= ~XDF_DIFF_ALGORITHM_MASK;
 		options->xdl_opts |= value;
+		if (value == XDF_PATIENCE_DIFF)
+			clear_patience_anchors(options);
 		return argcount;
+	} else if (skip_prefix(arg, "--anchored=", &arg)) {
+		options->xdl_opts = DIFF_WITH_ALG(options, PATIENCE_DIFF);
+		ALLOC_GROW(options->anchors, options->anchors_nr + 1,
+			   options->anchors_alloc);
+		options->anchors[options->anchors_nr++] = xstrdup(arg);
 	}
 
 	/* flags options */
diff --git a/diff.h b/diff.h
index 0fb18dd73..7cf276f07 100644
--- a/diff.h
+++ b/diff.h
@@ -166,6 +166,10 @@ struct diff_options {
 	const char *stat_sep;
 	long xdl_opts;
 
+	/* see Documentation/diff-options.txt */
+	char **anchors;
+	size_t anchors_nr, anchors_alloc;
+
 	int stat_width;
 	int stat_name_width;
 	int stat_graph_width;
diff --git a/t/t4064-diff-anchored.sh b/t/t4064-diff-anchored.sh
new file mode 100755
index 000000000..d51ca2c35
--- /dev/null
+++ b/t/t4064-diff-anchored.sh
@@ -0,0 +1,100 @@
+#!/bin/sh
+
+test_description='anchored diff algorithm'
+
+. ./test-lib.sh
+
+test_expect_success '--anchored' '
+	test_write_lines a b c >pre &&
+	test_write_lines c a b >post &&
+
+	# normally, c is moved to produce the smallest diff
+	test_expect_code 1 git diff --no-index pre post >diff &&
+	grep "^+c" diff &&
+
+	# with anchor, a is moved
+	test_expect_code 1 git diff --no-index --anchored=c pre post >diff &&
+	grep "^+a" diff
+'
+
+test_expect_success '--anchored multiple' '
+	test_write_lines a b c d e f >pre &&
+	test_write_lines c a b f d e >post &&
+
+	# with 1 anchor, c is not moved, but f is moved
+	test_expect_code 1 git diff --no-index --anchored=c pre post >diff &&
+	grep "^+a" diff && # a is moved instead of c
+	grep "^+f" diff &&
+
+	# with 2 anchors, c and f are not moved
+	test_expect_code 1 git diff --no-index --anchored=c --anchored=f pre post >diff &&
+	grep "^+a" diff &&
+	grep "^+d" diff # d is moved instead of f
+'
+
+test_expect_success '--anchored with nonexistent line has no effect' '
+	test_write_lines a b c >pre &&
+	test_write_lines c a b >post &&
+
+	test_expect_code 1 git diff --no-index --anchored=x pre post >diff &&
+	grep "^+c" diff
+'
+
+test_expect_success '--anchored with non-unique line has no effect' '
+	test_write_lines a b c d e c >pre &&
+	test_write_lines c a b c d e >post &&
+
+	test_expect_code 1 git diff --no-index --anchored=c pre post >diff &&
+	grep "^+c" diff
+'
+
+test_expect_success 'diff still produced with impossible multiple --anchored' '
+	test_write_lines a b c >pre &&
+	test_write_lines c a b >post &&
+
+	test_expect_code 1 git diff --no-index --anchored=a --anchored=c pre post >diff &&
+	mv post expected_post &&
+
+	# Ensure that the diff is correct by applying it and then
+	# comparing the result with the original
+	git apply diff &&
+	diff expected_post post
+'
+
+test_expect_success 'later algorithm arguments override earlier ones' '
+	test_write_lines a b c >pre &&
+	test_write_lines c a b >post &&
+
+	test_expect_code 1 git diff --no-index --patience --anchored=c pre post >diff &&
+	grep "^+a" diff &&
+
+	test_expect_code 1 git diff --no-index --anchored=c --patience pre post >diff &&
+	grep "^+c" diff &&
+
+	test_expect_code 1 git diff --no-index --histogram --anchored=c pre post >diff &&
+	grep "^+a" diff &&
+
+	test_expect_code 1 git diff --no-index --anchored=c --histogram pre post >diff &&
+	grep "^+c" diff &&
+
+	test_expect_code 1 git diff --no-index --diff-algorithm=patience --anchored=c pre post >diff &&
+	grep "^+a" diff &&
+
+	test_expect_code 1 git diff --no-index --anchored=c --diff-algorithm=patience pre post >diff &&
+	grep "^+c" diff
+'
+
+test_expect_success '--anchored works with other commands like "git show"' '
+	test_write_lines a b c >file &&
+	git add file &&
+	git commit -m foo &&
+	test_write_lines c a b >file &&
+	git add file &&
+	git commit -m foo &&
+
+	# with anchor, a is moved
+	git show --patience --anchored=c >diff &&
+	grep "^+a" diff
+'
+
+test_done
diff --git a/xdiff/xdiff.h b/xdiff/xdiff.h
index 785ecb089..e6217e1d7 100644
--- a/xdiff/xdiff.h
+++ b/xdiff/xdiff.h
@@ -80,6 +80,10 @@ typedef struct s_mmbuffer {
 
 typedef struct s_xpparam {
 	unsigned long flags;
+
+	/* See Documentation/diff-options.txt. */
+	char **anchors;
+	size_t anchors_nr;
 } xpparam_t;
 
 typedef struct s_xdemitcb {
diff --git a/xdiff/xpatience.c b/xdiff/xpatience.c
index a44e77632..f3573d9f0 100644
--- a/xdiff/xpatience.c
+++ b/xdiff/xpatience.c
@@ -62,6 +62,12 @@ struct hashmap {
 		 * initially, "next" reflects only the order in file1.
 		 */
 		struct entry *next, *previous;
+
+		/*
+		 * If 1, this entry can serve as an anchor. See
+		 * Documentation/diff-options.txt for more information.
+		 */
+		unsigned anchor : 1;
 	} *entries, *first, *last;
 	/* were common records found? */
 	unsigned long has_matches;
@@ -70,8 +76,19 @@ struct hashmap {
 	xpparam_t const *xpp;
 };
 
+static int is_anchor(xpparam_t const *xpp, const char *line)
+{
+	int i;
+	for (i = 0; i < xpp->anchors_nr; i++) {
+		if (!strncmp(line, xpp->anchors[i], strlen(xpp->anchors[i])))
+			return 1;
+	}
+	return 0;
+}
+
 /* The argument "pass" is 1 for the first file, 2 for the second. */
-static void insert_record(int line, struct hashmap *map, int pass)
+static void insert_record(xpparam_t const *xpp, int line, struct hashmap *map,
+			  int pass)
 {
 	xrecord_t **records = pass == 1 ?
 		map->env->xdf1.recs : map->env->xdf2.recs;
@@ -110,6 +127,7 @@ static void insert_record(int line, struct hashmap *map, int pass)
 		return;
 	map->entries[index].line1 = line;
 	map->entries[index].hash = record->ha;
+	map->entries[index].anchor = is_anchor(xpp, map->env->xdf1.recs[line - 1]->ptr);
 	if (!map->first)
 		map->first = map->entries + index;
 	if (map->last) {
@@ -147,11 +165,11 @@ static int fill_hashmap(mmfile_t *file1, mmfile_t *file2,
 
 	/* First, fill with entries from the first file */
 	while (count1--)
-		insert_record(line1++, result, 1);
+		insert_record(xpp, line1++, result, 1);
 
 	/* Then search for matches in the second file */
 	while (count2--)
-		insert_record(line2++, result, 2);
+		insert_record(xpp, line2++, result, 2);
 
 	return 0;
 }
@@ -192,14 +210,28 @@ static struct entry *find_longest_common_sequence(struct hashmap *map)
 	int longest = 0, i;
 	struct entry *entry;
 
+	/*
+	 * If not -1, this entry in sequence must never be overridden.
+	 * Therefore, overriding entries before this has no effect, so
+	 * do not do that either.
+	 */
+	int anchor_i = -1;
+
 	for (entry = map->first; entry; entry = entry->next) {
 		if (!entry->line2 || entry->line2 == NON_UNIQUE)
 			continue;
 		i = binary_search(sequence, longest, entry);
 		entry->previous = i < 0 ? NULL : sequence[i];
-		sequence[++i] = entry;
-		if (i == longest)
+		++i;
+		if (i <= anchor_i)
+			continue;
+		sequence[i] = entry;
+		if (entry->anchor) {
+			anchor_i = i;
+			longest = anchor_i + 1;
+		} else if (i == longest) {
 			longest++;
+		}
 	}
 
 	/* No common unique lines were found */
-- 
2.15.0.417.g466bffb3ac-goog


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] diff: support anchoring line(s)
  2017-11-28 18:47       ` [PATCH v3] " Jonathan Tan
@ 2017-11-30  0:36         ` Johannes Schindelin
  2017-11-30 23:26           ` Jonathan Tan
  0 siblings, 1 reply; 15+ messages in thread
From: Johannes Schindelin @ 2017-11-30  0:36 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: gitster, git, sbeller

Hi Jonathan,

On Tue, 28 Nov 2017, Jonathan Tan wrote:

> @@ -4607,7 +4627,14 @@ int diff_opt_parse(struct diff_options *options,
>  		DIFF_XDL_CLR(options, NEED_MINIMAL);
>  		options->xdl_opts &= ~XDF_DIFF_ALGORITHM_MASK;
>  		options->xdl_opts |= value;
> +		if (value == XDF_PATIENCE_DIFF)
> +			clear_patience_anchors(options);
>  		return argcount;
> +	} else if (skip_prefix(arg, "--anchored=", &arg)) {
> +		options->xdl_opts = DIFF_WITH_ALG(options, PATIENCE_DIFF);
> +		ALLOC_GROW(options->anchors, options->anchors_nr + 1,
> +			   options->anchors_alloc);
> +		options->anchors[options->anchors_nr++] = xstrdup(arg);

I looked and failed to find the code that releases this array after the
diff is done... did I miss anything?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] diff: support anchoring line(s)
  2017-11-30  0:36         ` Johannes Schindelin
@ 2017-11-30 23:26           ` Jonathan Tan
  2017-12-04 19:45             ` Stefan Beller
  0 siblings, 1 reply; 15+ messages in thread
From: Jonathan Tan @ 2017-11-30 23:26 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: gitster, git, sbeller

On Thu, 30 Nov 2017 01:36:37 +0100 (CET)
Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:

> Hi Jonathan,
> 
> On Tue, 28 Nov 2017, Jonathan Tan wrote:
> 
> > @@ -4607,7 +4627,14 @@ int diff_opt_parse(struct diff_options *options,
> >  		DIFF_XDL_CLR(options, NEED_MINIMAL);
> >  		options->xdl_opts &= ~XDF_DIFF_ALGORITHM_MASK;
> >  		options->xdl_opts |= value;
> > +		if (value == XDF_PATIENCE_DIFF)
> > +			clear_patience_anchors(options);
> >  		return argcount;
> > +	} else if (skip_prefix(arg, "--anchored=", &arg)) {
> > +		options->xdl_opts = DIFF_WITH_ALG(options, PATIENCE_DIFF);
> > +		ALLOC_GROW(options->anchors, options->anchors_nr + 1,
> > +			   options->anchors_alloc);
> > +		options->anchors[options->anchors_nr++] = xstrdup(arg);
> 
> I looked and failed to find the code that releases this array after the
> diff is done... did I miss anything?

You didn't miss anything. As far as I can tell, occurrences of struct
diff_options live throughout the lifetime of an invocation of Git and
are not freed. (Even if the struct itself is allocated on the stack, its
pathspec has some elements allocated on the heap.)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3] diff: support anchoring line(s)
  2017-11-30 23:26           ` Jonathan Tan
@ 2017-12-04 19:45             ` Stefan Beller
  0 siblings, 0 replies; 15+ messages in thread
From: Stefan Beller @ 2017-12-04 19:45 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Johannes Schindelin, Junio C Hamano, git

On Thu, Nov 30, 2017 at 3:26 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> On Thu, 30 Nov 2017 01:36:37 +0100 (CET)
> Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote:
>
>> Hi Jonathan,
>>
>> On Tue, 28 Nov 2017, Jonathan Tan wrote:
>>
>> > @@ -4607,7 +4627,14 @@ int diff_opt_parse(struct diff_options *options,
>> >             DIFF_XDL_CLR(options, NEED_MINIMAL);
>> >             options->xdl_opts &= ~XDF_DIFF_ALGORITHM_MASK;
>> >             options->xdl_opts |= value;
>> > +           if (value == XDF_PATIENCE_DIFF)
>> > +                   clear_patience_anchors(options);
>> >             return argcount;
>> > +   } else if (skip_prefix(arg, "--anchored=", &arg)) {
>> > +           options->xdl_opts = DIFF_WITH_ALG(options, PATIENCE_DIFF);
>> > +           ALLOC_GROW(options->anchors, options->anchors_nr + 1,
>> > +                      options->anchors_alloc);
>> > +           options->anchors[options->anchors_nr++] = xstrdup(arg);
>>
>> I looked and failed to find the code that releases this array after the
>> diff is done... did I miss anything?
>
> You didn't miss anything. As far as I can tell, occurrences of struct
> diff_options live throughout the lifetime of an invocation of Git and
> are not freed. (Even if the struct itself is allocated on the stack, its
> pathspec has some elements allocated on the heap.)

I thought at least for the intra-line word diff, which allocates its own
diff options struct, we'd clear them, but looking around we seem to leak
the diff options consistently. (builtin/blame.c is nice enough to
`clear_pathspec(&diff_opts.pathspec)`, but that's about it, no other
command takes care of the cruft.

I wonder if diff_flush might be a good place to clean up the diff options
and after the flush the diff options are declared invalid?

> test_expect_success '--anchored with non-unique line has no effect'

okay, if it turns out we need this case in the future we can still come up
with ideas.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-12-04 19:45 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-21 22:17 [RFC PATCH] xdiff/xpatience: support anchoring a line Jonathan Tan
2017-11-21 23:28 ` Stefan Beller
2017-11-22  2:27 ` Junio C Hamano
2017-11-22 23:41 ` [PATCH] xdiff/xpatience: support anchoring line(s) Jonathan Tan
2017-11-23  0:15   ` Stefan Beller
2017-11-23  2:05   ` Junio C Hamano
2017-11-23  2:16     ` Junio C Hamano
2017-11-23  2:47       ` Junio C Hamano
2017-11-27 18:30         ` Jonathan Tan
2017-11-27 19:47   ` [PATCH v2] diff: " Jonathan Tan
2017-11-28  1:38     ` Junio C Hamano
2017-11-28 18:47       ` [PATCH v3] " Jonathan Tan
2017-11-30  0:36         ` Johannes Schindelin
2017-11-30 23:26           ` Jonathan Tan
2017-12-04 19:45             ` Stefan Beller

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).