[PATCH] Documentation/diff-options: explain different diff algorithms

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] Documentation/diff-options: explain different diff algorithms
@ 2018-07-24  0:36 Stefan Beller
  2018-07-24  4:40 ` Jonathan Nieder
  2018-08-10  0:10 ` [PATCH 0/2] Getting data on different diff algorithms WAS: " Stefan Beller
  0 siblings, 2 replies; 13+ messages in thread
From: Stefan Beller @ 2018-07-24  0:36 UTC (permalink / raw)
  To: git; +Cc: Stefan Beller

As a user I wondered what the diff algorithms are about. Offer at least
a basic explanation on the differences of the diff algorithms.

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 Documentation/diff-options.txt | 32 +++++++++++++++++++++++++-------
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index f466600972f..0d765482027 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -94,16 +94,34 @@ diff" algorithm internally.
 	Choose a diff algorithm. The variants are as follows:
 +
 --
-`default`, `myers`;;
-	The basic greedy diff algorithm. Currently, this is the default.
 `minimal`;;
-	Spend extra time to make sure the smallest possible diff is
-	produced.
+	A diff as produced by the basic greedy algorithm described in
+	link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations]
+`default`, `myers`;;
+	The same algorithm as `minimal`, extended with a heuristic to
+	reduce extensive searches. Currently, this is the default.
 `patience`;;
-	Use "patience diff" algorithm when generating patches.
+	Use "patience diff" algorithm when generating patches. This
+	matches the longest common subsequence of unique lines on
+	both sides, recursively. It obtained its name by the way the
+	longest subsequence is found, as that is a byproduct of the
+	patience sorting algorithm. If there are no unique lines left
+	it falls back to `myers`. Empirically this algorithm produces
+	a more readable output for code, but it does not garantuee
+	the shortest output.
 `histogram`;;
-	This algorithm extends the patience algorithm to "support
-	low-occurrence common elements".
+	This algorithm re-implements the `patience` algorithm with
+	"support of low-occurrence common elements" and only picks
+	one element of the LCS for the recursion. It is often the
+	fastest, but in cornercases (when there are many longest
+	common subsequences of the same length) it produces bad
+	results as seen in:
+
+		seq 1 100 >one
+		echo 99 > two
+		seq 1 2 98 >>two
+		git diff --no-index --histogram one two
+
 --
 +
 For instance, if you configured diff.algorithm variable to a
-- 
2.18.0.345.g5c9ce644c3-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] Documentation/diff-options: explain different diff algorithms
  2018-07-24  0:36 [PATCH] Documentation/diff-options: explain different diff algorithms Stefan Beller
@ 2018-07-24  4:40 ` Jonathan Nieder
  2018-07-24 17:38   ` Stefan Beller
  2018-08-06 22:25   ` Stefan Beller
  2018-08-10  0:10 ` [PATCH 0/2] Getting data on different diff algorithms WAS: " Stefan Beller
  1 sibling, 2 replies; 13+ messages in thread
From: Jonathan Nieder @ 2018-07-24  4:40 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git

Hi,

Stefan Beller wrote:

> As a user I wondered what the diff algorithms are about. Offer at least
> a basic explanation on the differences of the diff algorithms.

Interesting.  Let's see.

[...]
> --- a/Documentation/diff-options.txt
> +++ b/Documentation/diff-options.txt
> @@ -94,16 +94,34 @@ diff" algorithm internally.
>  	Choose a diff algorithm. The variants are as follows:
>  +
>  --
> -`default`, `myers`;;
> -	The basic greedy diff algorithm. Currently, this is the default.
>  `minimal`;;
> +	A diff as produced by the basic greedy algorithm described in

Why this reordering?

> +	link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations]

This URL doesn't seem likely to stay stable.  Are appearances
deceiving?  (Maybe so, since that's the same host as libxdiff's
homepage <http://www.xmailserver.org/xdiff-lib.html>.)  It might be
worth mentioning the author and date just in case.

> -	Spend extra time to make sure the smallest possible diff is
> -	produced.
> +`default`, `myers`;;
> +	The same algorithm as `minimal`, extended with a heuristic to
> +	reduce extensive searches. Currently, this is the default.

Amusing --- this means that the Myers diff is spelled as "minimal"
instead of "myers".

>  `patience`;;
> -	Use "patience diff" algorithm when generating patches.
> +	Use "patience diff" algorithm when generating patches. This
> +	matches the longest common subsequence of unique lines on
> +	both sides, recursively. It obtained its name by the way the
> +	longest subsequence is found, as that is a byproduct of the
> +	patience sorting algorithm. If there are no unique lines left
> +	it falls back to `myers`. Empirically this algorithm produces
> +	a more readable output for code, but it does not garantuee
> +	the shortest output.

Probably also worth mentioning that this was invented by Bram Cohen
for the bazaar vcs.

>  `histogram`;;
> -	This algorithm extends the patience algorithm to "support
> -	low-occurrence common elements".
> +	This algorithm re-implements the `patience` algorithm with

What does reimplements mean here?  Do you mean that it is a variation
on patience?

> +	"support of low-occurrence common elements" and only picks
> +	one element of the LCS for the recursion. It is often the

Does LCS mean longest common subsequence or longest common substring
here?  Probably best to spell it out to avoid confusion.

> +	fastest, but in cornercases (when there are many longest

s/cornercases/corner cases/

> +	common subsequences of the same length) it produces bad
> +	results as seen in:
> +
> +		seq 1 100 >one
> +		echo 99 > two
> +		seq 1 2 98 >>two
> +		git diff --no-index --histogram one two

I wonder if these details (about all the algorithms) should go in a
separate Diff Algorithms section and this section could focus on
what use cases warrant using each, referring to that section for
details.  What do you think?

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Documentation/diff-options: explain different diff algorithms
  2018-07-24  4:40 ` Jonathan Nieder
@ 2018-07-24 17:38   ` Stefan Beller
  2018-07-24 20:06     ` Junio C Hamano
  2018-08-06 22:25   ` Stefan Beller
  1 sibling, 1 reply; 13+ messages in thread
From: Stefan Beller @ 2018-07-24 17:38 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git

On Mon, Jul 23, 2018 at 9:41 PM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> Hi,
>
> Stefan Beller wrote:
>
> > As a user I wondered what the diff algorithms are about. Offer at least
> > a basic explanation on the differences of the diff algorithms.
>
> Interesting.  Let's see.
>
> [...]
> > --- a/Documentation/diff-options.txt
> > +++ b/Documentation/diff-options.txt
> > @@ -94,16 +94,34 @@ diff" algorithm internally.
> >       Choose a diff algorithm. The variants are as follows:
> >  +
> >  --
> > -`default`, `myers`;;
> > -     The basic greedy diff algorithm. Currently, this is the default.
> >  `minimal`;;
> > +     A diff as produced by the basic greedy algorithm described in
>
> Why this reordering?

because Myers (below) is constructed from minimal + heuristic.
Before this patch we say Myers is myers and minimal "spends
extra cycles", which is true if you read the code, as it just is

    for (...) {
        test_path
        if (need_min)
            continue;
        if (heuristic_good_enough())
            break;
    }

https://github.com/git/git/blob/master/xdiff/xdiffi.c#L152

So the "minimal" algorithm is the basic myers algorithm,
and the "myers" algorithm is the extended version (extended
with a heuristic to be fast in corner cases, still producing good
enough diffs).

>
> > +     link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations]
>
> This URL doesn't seem likely to stay stable.  Are appearances
> deceiving?  (Maybe so, since that's the same host as libxdiff's
> homepage <http://www.xmailserver.org/xdiff-lib.html>.)  It might be
> worth mentioning the author and date just in case.

I though about it and did not mention it, as the name
"An O(ND) Difference Algorithm and its Variations" is already
enough to find that paper quickly.

> > -     Spend extra time to make sure the smallest possible diff is
> > -     produced.
> > +`default`, `myers`;;
> > +     The same algorithm as `minimal`, extended with a heuristic to
> > +     reduce extensive searches. Currently, this is the default.
>
> Amusing --- this means that the Myers diff is spelled as "minimal"
> instead of "myers".
>
> >  `patience`;;
> > -     Use "patience diff" algorithm when generating patches.
> > +     Use "patience diff" algorithm when generating patches. This
> > +     matches the longest common subsequence of unique lines on
> > +     both sides, recursively. It obtained its name by the way the
> > +     longest subsequence is found, as that is a byproduct of the
> > +     patience sorting algorithm. If there are no unique lines left
> > +     it falls back to `myers`. Empirically this algorithm produces
> > +     a more readable output for code, but it does not garantuee
> > +     the shortest output.
>
> Probably also worth mentioning that this was invented by Bram Cohen
> for the bazaar vcs.

I tried to balance the part that reads like a history lesson and the
immediately useful part (why is this better, when should I use it?)

Mentioning bazaar probably makes sense. Not sure about the author,
but I'll include him in a reroll of this patch.

> >  `histogram`;;
> > -     This algorithm extends the patience algorithm to "support
> > -     low-occurrence common elements".
> > +     This algorithm re-implements the `patience` algorithm with
>
> What does reimplements mean here?  Do you mean that it is a variation
> on patience?

It is supposed to be a variation of patience, but as it comes with its
own implementation which I would not fully trust (as it was ported
from JGit, and reading the comments over there, a misunderstanding
of LCS made it possible to have only one midpoint before recursing)

So it is not just taking the patience algorithm and adding support for
"low-occurrence common elements"  (what does that even mean?
patience already distinguishes uniques and non-uniques), but
its own implementation, hence it is not bug-for-bug compatible.

> > +     "support of low-occurrence common elements" and only picks
> > +     one element of the LCS for the recursion. It is often the
>
> Does LCS mean longest common subsequence or longest common substring
> here?  Probably best to spell it out to avoid confusion.

When writing the patch I meant to refer to longest common subsequence
from above, but by picking just one element, this algorithm understands it
as longest common string.


>
> > +     fastest, but in cornercases (when there are many longest
>
> s/cornercases/corner cases/
>
> > +     common subsequences of the same length) it produces bad
> > +     results as seen in:
> > +
> > +             seq 1 100 >one
> > +             echo 99 > two
> > +             seq 1 2 98 >>two
> > +             git diff --no-index --histogram one two
>
> I wonder if these details (about all the algorithms) should go in a
> separate Diff Algorithms section and this section could focus on
> what use cases warrant using each, referring to that section for
> details.  What do you think?

Yeah I thought about putting together a
Documentation/technical/diffs.txt that talk about all kinds of diffing
(basic diff algorithms, options, how to diff trees), but considered
rolling this as a band aid until that document fully materializes.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Documentation/diff-options: explain different diff algorithms
  2018-07-24 17:38   ` Stefan Beller
@ 2018-07-24 20:06     ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2018-07-24 20:06 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Jonathan Nieder, git

Stefan Beller <sbeller@google.com> writes:

> On Mon, Jul 23, 2018 at 9:41 PM Jonathan Nieder <jrnieder@gmail.com> wrote:
>>
>> Hi,
>>
>> Stefan Beller wrote:
>>
>> > As a user I wondered what the diff algorithms are about. Offer at least
>> > a basic explanation on the differences of the diff algorithms.
>>
>> Interesting.  Let's see.
>>
>> [...]
>> > --- a/Documentation/diff-options.txt
>> > +++ b/Documentation/diff-options.txt
>> > @@ -94,16 +94,34 @@ diff" algorithm internally.
>> >       Choose a diff algorithm. The variants are as follows:
>> >  +
>> >  --
>> > -`default`, `myers`;;
>> > -     The basic greedy diff algorithm. Currently, this is the default.
>> >  `minimal`;;
>> > +     A diff as produced by the basic greedy algorithm described in
>>
>> Why this reordering?
>
> because Myers (below) is constructed from minimal + heuristic.
> ...
> So the "minimal" algorithm is the basic myers algorithm,
> and the "myers" algorithm is the extended version (extended
> with a heuristic to be fast in corner cases, still producing good
> enough diffs).

I am not sure that is a good reason for the target readers of this
document to move the default from the beginning to somewhere in the
middle.  For a textbook that explains different algorithms and gives
overview of how they work, that order may be appropriate, though.

>> > -     Spend extra time to make sure the smallest possible diff is
>> > -     produced.
>> > +`default`, `myers`;;
>> > +     The same algorithm as `minimal`, extended with a heuristic to
>> > +     reduce extensive searches. Currently, this is the default.
>>
>> Amusing --- this means that the Myers diff is spelled as "minimal"
>> instead of "myers".

Yeah, I had the same thought---for an end-user documentation update,
this change and the reordering does not feel like an improvement.

>> I wonder if these details (about all the algorithms) should go in a
>> separate Diff Algorithms section and this section could focus on
>> what use cases warrant using each, referring to that section for
>> details.  What do you think?

Good suggestion.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] Documentation/diff-options: explain different diff algorithms
  2018-07-24  4:40 ` Jonathan Nieder
  2018-07-24 17:38   ` Stefan Beller
@ 2018-08-06 22:25   ` Stefan Beller
  2018-08-06 23:18     ` Jonathan Nieder
  1 sibling, 1 reply; 13+ messages in thread
From: Stefan Beller @ 2018-08-06 22:25 UTC (permalink / raw)
  To: jrnieder; +Cc: git, sbeller

As a user I wondered what the diff algorithms are about. Offer at least
a basic explanation on the differences of the diff algorithms.

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 Documentation/diff-options.txt | 10 +++++++---
 Documentation/git-diff.txt     | 34 ++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index f394608b42c..eae033a21ea 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -91,14 +91,18 @@ appearing as a deletion or addition in the output. It uses the "patience
 diff" algorithm internally.
 
 --diff-algorithm={patience|minimal|histogram|myers}::
-	Choose a diff algorithm. The variants are as follows:
+	Choose a diff algorithm. See the discussion of DIFF ALGORITHMS
+ifndef::git-diff[]
+	in linkgit:git-diff[1]
+endif::git-diff[]
+	. The variants are as follows:
 +
 --
 `default`, `myers`;;
 	The basic greedy diff algorithm. Currently, this is the default.
 `minimal`;;
-	Spend extra time to make sure the smallest possible diff is
-	produced.
+	The same algorithm as `myers`, but spend extra time to make
+	sure the smallest possible diff is produced.
 `patience`;;
 	Use "patience diff" algorithm when generating patches.
 `histogram`;;
diff --git a/Documentation/git-diff.txt b/Documentation/git-diff.txt
index b180f1fa5bf..b182389aaae 100644
--- a/Documentation/git-diff.txt
+++ b/Documentation/git-diff.txt
@@ -119,6 +119,40 @@ include::diff-options.txt[]
 
 include::diff-format.txt[]
 
+DIFF ALGORITHMS
+---------------
+`Myers`
+
+A diff as produced by the basic greedy algorithm described in
+link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations].
+with a run time of O(M + N + D^2). It employs a heuristic to allow for
+a faster diff at the small cost of diff size.
+The `minimal` algorithm has that heuristic turned off.
+
+`Patience`
+
+This algorithm by Bram Cohen matches the longest common subsequence
+of unique lines on both sides, recursively. It obtained its name by
+the way the longest subsequence is found, as that is a byproduct of
+the patience sorting algorithm. If there are no unique lines left
+it falls back to `myers`. Empirically this algorithm produces
+a more readable output for code, but it does not garantuee
+the shortest output.
+
+`Histogram`
+
+This algorithm finds the longest common substring and recursively
+diffs the content before and after the longest common substring.
+If there are no common substrings left, fallback to `myers`.
+This is often the fastest, but in corner cases (when there are
+many common substrings of the same length) it produces bad
+results as seen in:
+
+	seq 1 100 >one
+	echo 99 > two
+	seq 1 2 98 >>two
+	git diff --no-index --histogram one two
+
 EXAMPLES
 --------
 
-- 
2.18.0.597.ga71716f1ad-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] Documentation/diff-options: explain different diff algorithms
  2018-08-06 22:25   ` Stefan Beller
@ 2018-08-06 23:18     ` Jonathan Nieder
  2018-08-07 15:56       ` Junio C Hamano
  2018-08-09 19:51       ` Stefan Beller
  0 siblings, 2 replies; 13+ messages in thread
From: Jonathan Nieder @ 2018-08-06 23:18 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git

Hi,

Stefan Beller wrote:
> --- a/Documentation/diff-options.txt
> +++ b/Documentation/diff-options.txt
> @@ -91,14 +91,18 @@ appearing as a deletion or addition in the output. It uses the "patience
>  diff" algorithm internally.
>  
>  --diff-algorithm={patience|minimal|histogram|myers}::
> -	Choose a diff algorithm. The variants are as follows:
> +	Choose a diff algorithm. See the discussion of DIFF ALGORITHMS
> +ifndef::git-diff[]
> +	in linkgit:git-diff[1]
> +endif::git-diff[]
> +	. The variants are as follows:

This means outside of git-diff(1), I'd see

	See the discussion of DIFF ALGORITHMS in git-diff(1) .

And in git-diff(1), I'd see

	See the discussion of DIFF ALGORITHMS .

Both don't seem quite right, since they have an extra space before the
period.  The git-diff(1) seems especially not quite right --- does it
intend to say something like "See the DIFF ALGORITHMS section for more
discussion"?

[...]
> --- a/Documentation/git-diff.txt
> +++ b/Documentation/git-diff.txt
> @@ -119,6 +119,40 @@ include::diff-options.txt[]
>  
>  include::diff-format.txt[]
>  
> +DIFF ALGORITHMS
> +---------------

Please add some introductory words about what the headings refer to.

> +`Myers`
> +
> +A diff as produced by the basic greedy algorithm described in
> +link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations].
> +with a run time of O(M + N + D^2). It employs a heuristic to allow for
> +a faster diff at the small cost of diff size.
> +The `minimal` algorithm has that heuristic turned off.
> +
> +`Patience`
> +
> +This algorithm by Bram Cohen matches the longest common subsequence
> +of unique lines on both sides, recursively. It obtained its name by
> +the way the longest subsequence is found, as that is a byproduct of
> +the patience sorting algorithm. If there are no unique lines left
> +it falls back to `myers`. Empirically this algorithm produces
> +a more readable output for code, but it does not garantuee

nit: s/garantuee/guarantee/

> +the shortest output.

Trivia: the `minimal` variant of Myers doesn't guarantee shortest
output, either: what it minimizes is the number of lines marked as
added or removed.  If you want to minimize context lines too, then
that would be a new variant. ;-)

[...]
> +`Histogram`
> +
> +This algorithm finds the longest common substring and recursively
> +diffs the content before and after the longest common substring.

optional: may be worth a short aside in the text about the distinction
between LCS and LCS. ;-)

It would be especially useful here, since the alphabet used in these
strings is *lines* instead of characters, so the first-time reader
could probably use some help in building their intuition.

> +If there are no common substrings left, fallback to `myers`.

nit: fallback is the noun, fall back is the verb.

> +This is often the fastest, but in corner cases (when there are
> +many common substrings of the same length) it produces bad

Can you clarify what "bad" means?  E.g. would "unexpected", or "poorly
aligned", match what you mean?

> +results as seen in:
> +
> +	seq 1 100 >one
> +	echo 99 > two
> +	seq 1 2 98 >>two
> +	git diff --no-index --histogram one two
> +
>  EXAMPLES
>  --------

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Documentation/diff-options: explain different diff algorithms
  2018-08-06 23:18     ` Jonathan Nieder
@ 2018-08-07 15:56       ` Junio C Hamano
  2018-08-09 19:26         ` Stefan Beller
  2018-08-09 19:51       ` Stefan Beller
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2018-08-07 15:56 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Stefan Beller, git

Jonathan Nieder <jrnieder@gmail.com> writes:

> Both don't seem quite right, since they have an extra space before the
> period.  The git-diff(1) seems especially not quite right --- does it
> intend to say something like "See the DIFF ALGORITHMS section for more
> discussion"?

Good suggestion and doing so would nicely allow side-stepping the
space before the dot issue, i.e. "See the DIFF ALGO section (in
git-diff(1)) for more discussion".   I like it.

> [...]
>> --- a/Documentation/git-diff.txt
>> +++ b/Documentation/git-diff.txt
>> @@ -119,6 +119,40 @@ include::diff-options.txt[]
>>  
>>  include::diff-format.txt[]
>>  
>> +DIFF ALGORITHMS
>> +---------------
>
> Please add some introductory words about what the headings refer to.
>
>> +`Myers`
>> +
>> +A diff as produced by the basic greedy algorithm described in
>> +link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations].
>> +...
>> +
>> +`Patience`
>> +
>> +This algorithm by Bram Cohen matches the longest common subsequence
>> +...
> [...]
>> +`Histogram`
>> +
>> +This algorithm finds the longest common substring and recursively
>> +diffs the content before and after the longest common substring.

As a first-time reader, I felt that it is a bit uneven that there is
no attribution for only this item, unlike the description for Myers
and Patience.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Documentation/diff-options: explain different diff algorithms
  2018-08-07 15:56       ` Junio C Hamano
@ 2018-08-09 19:26         ` Stefan Beller
  2018-08-10 22:18           ` Stefan Beller
  0 siblings, 1 reply; 13+ messages in thread
From: Stefan Beller @ 2018-08-09 19:26 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Nieder, git

On Tue, Aug 7, 2018 at 8:56 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jonathan Nieder <jrnieder@gmail.com> writes:
>
> > Both don't seem quite right, since they have an extra space before the
> > period.  The git-diff(1) seems especially not quite right --- does it
> > intend to say something like "See the DIFF ALGORITHMS section for more
> > discussion"?
>
> Good suggestion and doing so would nicely allow side-stepping the
> space before the dot issue, i.e. "See the DIFF ALGO section (in
> git-diff(1)) for more discussion".   I like it.

Me, too.

> >> +`Histogram`
> >> +
> >> +This algorithm finds the longest common substring and recursively
> >> +diffs the content before and after the longest common substring.
>
> As a first-time reader, I felt that it is a bit uneven that there is
> no attribution for only this item, unlike the description for Myers
> and Patience.

Yeah. That is because I spent too much time thinking how this algo is
flawed (in its design as the comments in the initial commit in JGit
indicate that design error when reading closely) when I was trying to
understand it deeply.

Maybe I should emphasize the trade off for performance gain
more as Shawn really cared about performance. After all it offers
the best performance in *many* common cases, but has a
"long tail" of really unfortunate results, too.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] Documentation/diff-options: explain different diff algorithms
  2018-08-06 23:18     ` Jonathan Nieder
  2018-08-07 15:56       ` Junio C Hamano
@ 2018-08-09 19:51       ` Stefan Beller
  1 sibling, 0 replies; 13+ messages in thread
From: Stefan Beller @ 2018-08-09 19:51 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git

On Mon, Aug 6, 2018 at 4:18 PM Jonathan Nieder <jrnieder@gmail.com> wrote:

> > +DIFF ALGORITHMS
> > +---------------
>
> Please add some introductory words about what the headings refer to.

ok.

>
> > +the shortest output.
>
> Trivia: the `minimal` variant of Myers doesn't guarantee shortest
> output, either: what it minimizes is the number of lines marked as
> added or removed.  If you want to minimize context lines too, then
> that would be a new variant. ;-)

... and take line length into account. ;-)

It minimizes the edit distance in terms of lines, i.e. in a context-less diff
we get the lowest number of lines possible.

> > +This algorithm finds the longest common substring and recursively
> > +diffs the content before and after the longest common substring.
>
> optional: may be worth a short aside in the text about the distinction
> between LCS and LCS. ;-)
>
> It would be especially useful here, since the alphabet used in these
> strings is *lines* instead of characters, so the first-time reader
> could probably use some help in building their intuition.

That makes sense.

>
> > +This is often the fastest, but in corner cases (when there are
> > +many common substrings of the same length) it produces bad
>
> Can you clarify what "bad" means?  E.g. would "unexpected", or "poorly
> aligned", match what you mean?

I'll just go with unexpected.

> > +results as seen in:
> > +
> > +     seq 1 100 >one
> > +     echo 99 > two
> > +     seq 1 2 98 >>two
> > +     git diff --no-index --histogram one two

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 0/2] Getting data on different diff algorithms WAS: Documentation/diff-options: explain different diff algorithms
  2018-07-24  0:36 [PATCH] Documentation/diff-options: explain different diff algorithms Stefan Beller
  2018-07-24  4:40 ` Jonathan Nieder
@ 2018-08-10  0:10 ` Stefan Beller
  2018-08-10  0:10   ` [PATCH 1/2] WIP: range-diff: take extra arguments for different diffs Stefan Beller
  2018-08-10  0:10   ` [PATCH 2/2] WIP range-diff: print some statistics about the range Stefan Beller
  1 sibling, 2 replies; 13+ messages in thread
From: Stefan Beller @ 2018-08-10  0:10 UTC (permalink / raw)
  To: sbeller; +Cc: git

I instrumented range-diff to use different diff algorithms for the patches,
(I chose default-myers and patience) and then run on the same range,
via ./git-range-diff v2.10.0..HEAD v2.10.0..HEAD
and I found 5245 same, 304 slightly different and 4 completely different
patches in that range.

Looking at the interdiff is not very pleasing even when reading it with
coloring and move detection.

Manually looking at them, I found the patience diff easier to review.

Comparing the 'default' diff algorithm to 'minimal', I'll see 
 5491 same, 58 slightly different and 0 completely different patches.

Comparing 'default' to 'histogram', I'll see
 5255 same, 294 slightly different and 8 completely different patches.

Comparing 'histogram' to 'patience', I'll see 
 5337 same, 212 slightly different and 10 completely different patches.

This is all to just put data out there, on how much difference to expect from
the diff algorithms. I have not yet dug into the quality of the diffs.

Stefan

Stefan Beller (2):
  WIP: range-diff: take extra arguments for different diffs.
  WIP range-diff: print some statistics about the range

 range-diff.c | 29 +++++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] WIP: range-diff: take extra arguments for different diffs.
  2018-08-10  0:10 ` [PATCH 0/2] Getting data on different diff algorithms WAS: " Stefan Beller
@ 2018-08-10  0:10   ` Stefan Beller
  2018-08-10  0:10   ` [PATCH 2/2] WIP range-diff: print some statistics about the range Stefan Beller
  1 sibling, 0 replies; 13+ messages in thread
From: Stefan Beller @ 2018-08-10  0:10 UTC (permalink / raw)
  To: sbeller; +Cc: git

We can use the range-diff on the same range to examine differences in the
diff algorithm.

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 range-diff.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/range-diff.c b/range-diff.c
index 347b4a79f25..a977289b7dc 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -28,7 +28,8 @@ struct patch_util {
  * Reads the patches into a string list, with the `util` field being populated
  * as struct object_id (will need to be free()d).
  */
-static int read_patches(const char *range, struct string_list *list)
+static int read_patches(const char *range, struct string_list *list,
+			struct argv_array *extra_log_args)
 {
 	struct child_process cp = CHILD_PROCESS_INIT;
 	FILE *in;
@@ -36,7 +37,12 @@ static int read_patches(const char *range, struct string_list *list)
 	struct patch_util *util = NULL;
 	int in_header = 1;
 
-	argv_array_pushl(&cp.args, "log", "--no-color", "-p", "--no-merges",
+	argv_array_pushl(&cp.args, "log", "--no-color", "-p", "--no-merges", NULL);
+
+	if (extra_log_args)
+		argv_array_pushv(&cp.args, extra_log_args->argv);
+
+	argv_array_pushl(&cp.args,
 			"--reverse", "--date-order", "--decorate=no",
 			"--no-abbrev-commit", range,
 			NULL);
@@ -419,14 +425,20 @@ int show_range_diff(const char *range1, const char *range2,
 {
 	int res = 0;
 
+	struct argv_array extra = ARGV_ARRAY_INIT;
+
 	struct string_list branch1 = STRING_LIST_INIT_DUP;
 	struct string_list branch2 = STRING_LIST_INIT_DUP;
 
-	if (read_patches(range1, &branch1))
+	argv_array_push(&extra, "--diff-algorithm=patience");
+	argv_array_push(&extra, "--indent-heuristic");
+
+	if (read_patches(range1, &branch1, NULL))
 		res = error(_("could not parse log for '%s'"), range1);
-	if (!res && read_patches(range2, &branch2))
+	if (!res && read_patches(range2, &branch2, &extra))
 		res = error(_("could not parse log for '%s'"), range2);
 
+	diffopt->color_moved = COLOR_MOVED_DEFAULT;
 	if (!res) {
 		find_exact_matches(&branch1, &branch2);
 		get_correspondences(&branch1, &branch2, creation_factor);
-- 
2.18.0.597.ga71716f1ad-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] WIP range-diff: print some statistics about the range
  2018-08-10  0:10 ` [PATCH 0/2] Getting data on different diff algorithms WAS: " Stefan Beller
  2018-08-10  0:10   ` [PATCH 1/2] WIP: range-diff: take extra arguments for different diffs Stefan Beller
@ 2018-08-10  0:10   ` Stefan Beller
  1 sibling, 0 replies; 13+ messages in thread
From: Stefan Beller @ 2018-08-10  0:10 UTC (permalink / raw)
  To: sbeller; +Cc: git

Signed-off-by: Stefan Beller <sbeller@google.com>
---
 range-diff.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/range-diff.c b/range-diff.c
index a977289b7dc..fbabce5f91f 100644
--- a/range-diff.c
+++ b/range-diff.c
@@ -264,6 +264,8 @@ static void get_correspondences(struct string_list *a, struct string_list *b,
 	free(b2a);
 }
 
+int completely_different, slightly_different, same;
+
 static void output_pair_header(struct diff_options *diffopt,
 			       int patch_no_width,
 			       struct strbuf *buf,
@@ -288,15 +290,19 @@ static void output_pair_header(struct diff_options *diffopt,
 	if (!b_util) {
 		color = color_old;
 		status = '<';
+		slightly_different++;
 	} else if (!a_util) {
 		color = color_new;
 		status = '>';
+		completely_different++;
 	} else if (strcmp(a_util->patch, b_util->patch)) {
 		color = color_commit;
 		status = '!';
+		slightly_different++;
 	} else {
 		color = color_commit;
 		status = '=';
+		same++;
 	}
 
 	strbuf_reset(buf);
@@ -439,12 +445,15 @@ int show_range_diff(const char *range1, const char *range2,
 		res = error(_("could not parse log for '%s'"), range2);
 
 	diffopt->color_moved = COLOR_MOVED_DEFAULT;
+
 	if (!res) {
 		find_exact_matches(&branch1, &branch2);
 		get_correspondences(&branch1, &branch2, creation_factor);
 		output(&branch1, &branch2, diffopt);
 	}
 
+	printf("patch ranges are %d same, %d slightly different and %d completely different\n", same, slightly_different, completely_different);
+
 	string_list_clear(&branch1, 1);
 	string_list_clear(&branch2, 1);
 
-- 
2.18.0.597.ga71716f1ad-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH] Documentation/diff-options: explain different diff algorithms
  2018-08-09 19:26         ` Stefan Beller
@ 2018-08-10 22:18           ` Stefan Beller
  0 siblings, 0 replies; 13+ messages in thread
From: Stefan Beller @ 2018-08-10 22:18 UTC (permalink / raw)
  To: sbeller; +Cc: git, gitster, jrnieder

As a user I wondered what the diff algorithms are about. Offer at least
a basic explanation on the differences of the diff algorithms.

Signed-off-by: Stefan Beller <sbeller@google.com>
---

 Not sure if this is finished, I just want to put out the state that I
 have sitting on my disk.

 Documentation/diff-options.txt | 10 +++--
 Documentation/git-diff.txt     | 72 ++++++++++++++++++++++++++++++++++
 2 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
index f394608b42c..00684b8936f 100644
--- a/Documentation/diff-options.txt
+++ b/Documentation/diff-options.txt
@@ -91,14 +91,18 @@ appearing as a deletion or addition in the output. It uses the "patience
 diff" algorithm internally.
 
 --diff-algorithm={patience|minimal|histogram|myers}::
-	Choose a diff algorithm. The variants are as follows:
+	Choose a diff algorithm. See the DIFF ALGORITHMS section
+ifndef::git-diff[]
+	in linkgit:git-diff[1]
+endif::git-diff[]
+	for more discussion. The variants are as follows:
 +
 --
 `default`, `myers`;;
 	The basic greedy diff algorithm. Currently, this is the default.
 `minimal`;;
-	Spend extra time to make sure the smallest possible diff is
-	produced.
+	The same algorithm as `myers`, but spend extra time to make
+	sure the smallest possible diff is produced.
 `patience`;;
 	Use "patience diff" algorithm when generating patches.
 `histogram`;;
diff --git a/Documentation/git-diff.txt b/Documentation/git-diff.txt
index b180f1fa5bf..8837492ed05 100644
--- a/Documentation/git-diff.txt
+++ b/Documentation/git-diff.txt
@@ -119,6 +119,78 @@ include::diff-options.txt[]
 
 include::diff-format.txt[]
 
+DIFF ALGORITHMS
+---------------
+
+This section explains background on the diff algorithms. All of them
+operate on two input sequences of symbols. In Git each symbol is
+represented by a line of a file unless the option to diff based on
+words is given. The following diff algorithms are available:
+
+`Myers`
+
+A diff as produced by the basic greedy algorithm described in
+link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations].
+with a run time of O(M + N + D^2). To understand this algorithm, one
+can imagine a table spanned by the two input sequences with slides
+where there are the same symbols. For example the sequences 'ABCD' and 'ADB'
+the graph would look like
+
+	S | A | B | C | A
+	---------------------
+	A | \ |   |   | \ |
+	---------------------
+	D |   |   |   |   |
+	---------------------
+	B |   | \ |   |   |
+	---------------------
+	  |   |   |   |   |F
+
+and a greedy algorithm is used to find the cheapest path from start S to
+finish F, with each horizontal and vertical step having a cost of one and
+the diagonal slides having a cost of zero.
+
+This is simplified as the real algorithm only needs O(N+M) in terms of memory.
+In addition it employs a heuristic to allow for a faster diff at the small
+cost of diff size. The `minimal` algorithm has that heuristic turned off.
+
+`Minimal`
+The exact algorithm as described in the `Myers` paper without the heuristic
+that trades execution time for slightly worse diffs.
+
+`Patience`
+
+This algorithm by Bram Cohen originally for the bzr version control
+system matches the longest common subsequence of unique lines on
+both sides, recursively. It obtained its name by the way the longest
+subsequence is found, as that is a byproduct of the patience sorting
+algorithm. If there are no unique lines left it falls back to `myers`.
+Empirically this algorithm produces a more readable output for code,
+but it does not guarantee the shortest output.
+
+`Histogram`
+
+This algorithm by Shawn Pearce, originally implemented for
+JGit, finds the longest common substring and recursively
+diffs the content before and after the longest common substring.
+If there are no common substrings left, fall back to `myers`.
+This is often the fastest, but in corner cases (when there are
+many common substrings of the same length) it produces unexpected
+results as seen in:
+
+	seq 1 100 >one
+	echo 99 > two
+	seq 1 2 98 >>two
+	git diff --no-index --histogram one two
+
+
+Note how both `patience` and `histogram` use a concept that is abbreviated
+as 'LCS' (longest common subsequence and longest common substring).
+The longest common subsequence is a sequence of symbols that are found
+on both sides in the same order. The symbols do not need to be adjacent.
+The longest common substring is a sequence of adjacent symbols in order
+on both sides.
+
 EXAMPLES
 --------
 
-- 
2.18.0.865.gffc8e1a3cd6-goog


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2018-08-10 22:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-24  0:36 [PATCH] Documentation/diff-options: explain different diff algorithms Stefan Beller
2018-07-24  4:40 ` Jonathan Nieder
2018-07-24 17:38   ` Stefan Beller
2018-07-24 20:06     ` Junio C Hamano
2018-08-06 22:25   ` Stefan Beller
2018-08-06 23:18     ` Jonathan Nieder
2018-08-07 15:56       ` Junio C Hamano
2018-08-09 19:26         ` Stefan Beller
2018-08-10 22:18           ` Stefan Beller
2018-08-09 19:51       ` Stefan Beller
2018-08-10  0:10 ` [PATCH 0/2] Getting data on different diff algorithms WAS: " Stefan Beller
2018-08-10  0:10   ` [PATCH 1/2] WIP: range-diff: take extra arguments for different diffs Stefan Beller
2018-08-10  0:10   ` [PATCH 2/2] WIP range-diff: print some statistics about the range Stefan Beller

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).