* [PATCH] Documentation/diff-options: explain different diff algorithms @ 2018-07-24 0:36 Stefan Beller 2018-07-24 4:40 ` Jonathan Nieder 2018-08-10 0:10 ` [PATCH 0/2] Getting data on different diff algorithms WAS: " Stefan Beller 0 siblings, 2 replies; 13+ messages in thread From: Stefan Beller @ 2018-07-24 0:36 UTC (permalink / raw) To: git; +Cc: Stefan Beller As a user I wondered what the diff algorithms are about. Offer at least a basic explanation on the differences of the diff algorithms. Signed-off-by: Stefan Beller <sbeller@google.com> --- Documentation/diff-options.txt | 32 +++++++++++++++++++++++++------- 1 file changed, 25 insertions(+), 7 deletions(-) diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt index f466600972f..0d765482027 100644 --- a/Documentation/diff-options.txt +++ b/Documentation/diff-options.txt @@ -94,16 +94,34 @@ diff" algorithm internally. Choose a diff algorithm. The variants are as follows: + -- -`default`, `myers`;; - The basic greedy diff algorithm. Currently, this is the default. `minimal`;; - Spend extra time to make sure the smallest possible diff is - produced. + A diff as produced by the basic greedy algorithm described in + link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations] +`default`, `myers`;; + The same algorithm as `minimal`, extended with a heuristic to + reduce extensive searches. Currently, this is the default. `patience`;; - Use "patience diff" algorithm when generating patches. + Use "patience diff" algorithm when generating patches. This + matches the longest common subsequence of unique lines on + both sides, recursively. It obtained its name by the way the + longest subsequence is found, as that is a byproduct of the + patience sorting algorithm. If there are no unique lines left + it falls back to `myers`. Empirically this algorithm produces + a more readable output for code, but it does not garantuee + the shortest output. `histogram`;; - This algorithm extends the patience algorithm to "support - low-occurrence common elements". + This algorithm re-implements the `patience` algorithm with + "support of low-occurrence common elements" and only picks + one element of the LCS for the recursion. It is often the + fastest, but in cornercases (when there are many longest + common subsequences of the same length) it produces bad + results as seen in: + + seq 1 100 >one + echo 99 > two + seq 1 2 98 >>two + git diff --no-index --histogram one two + -- + For instance, if you configured diff.algorithm variable to a -- 2.18.0.345.g5c9ce644c3-goog ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] Documentation/diff-options: explain different diff algorithms 2018-07-24 0:36 [PATCH] Documentation/diff-options: explain different diff algorithms Stefan Beller @ 2018-07-24 4:40 ` Jonathan Nieder 2018-07-24 17:38 ` Stefan Beller 2018-08-06 22:25 ` Stefan Beller 2018-08-10 0:10 ` [PATCH 0/2] Getting data on different diff algorithms WAS: " Stefan Beller 1 sibling, 2 replies; 13+ messages in thread From: Jonathan Nieder @ 2018-07-24 4:40 UTC (permalink / raw) To: Stefan Beller; +Cc: git Hi, Stefan Beller wrote: > As a user I wondered what the diff algorithms are about. Offer at least > a basic explanation on the differences of the diff algorithms. Interesting. Let's see. [...] > --- a/Documentation/diff-options.txt > +++ b/Documentation/diff-options.txt > @@ -94,16 +94,34 @@ diff" algorithm internally. > Choose a diff algorithm. The variants are as follows: > + > -- > -`default`, `myers`;; > - The basic greedy diff algorithm. Currently, this is the default. > `minimal`;; > + A diff as produced by the basic greedy algorithm described in Why this reordering? > + link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations] This URL doesn't seem likely to stay stable. Are appearances deceiving? (Maybe so, since that's the same host as libxdiff's homepage <http://www.xmailserver.org/xdiff-lib.html>.) It might be worth mentioning the author and date just in case. > - Spend extra time to make sure the smallest possible diff is > - produced. > +`default`, `myers`;; > + The same algorithm as `minimal`, extended with a heuristic to > + reduce extensive searches. Currently, this is the default. Amusing --- this means that the Myers diff is spelled as "minimal" instead of "myers". > `patience`;; > - Use "patience diff" algorithm when generating patches. > + Use "patience diff" algorithm when generating patches. This > + matches the longest common subsequence of unique lines on > + both sides, recursively. It obtained its name by the way the > + longest subsequence is found, as that is a byproduct of the > + patience sorting algorithm. If there are no unique lines left > + it falls back to `myers`. Empirically this algorithm produces > + a more readable output for code, but it does not garantuee > + the shortest output. Probably also worth mentioning that this was invented by Bram Cohen for the bazaar vcs. > `histogram`;; > - This algorithm extends the patience algorithm to "support > - low-occurrence common elements". > + This algorithm re-implements the `patience` algorithm with What does reimplements mean here? Do you mean that it is a variation on patience? > + "support of low-occurrence common elements" and only picks > + one element of the LCS for the recursion. It is often the Does LCS mean longest common subsequence or longest common substring here? Probably best to spell it out to avoid confusion. > + fastest, but in cornercases (when there are many longest s/cornercases/corner cases/ > + common subsequences of the same length) it produces bad > + results as seen in: > + > + seq 1 100 >one > + echo 99 > two > + seq 1 2 98 >>two > + git diff --no-index --histogram one two I wonder if these details (about all the algorithms) should go in a separate Diff Algorithms section and this section could focus on what use cases warrant using each, referring to that section for details. What do you think? Thanks and hope that helps, Jonathan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Documentation/diff-options: explain different diff algorithms 2018-07-24 4:40 ` Jonathan Nieder @ 2018-07-24 17:38 ` Stefan Beller 2018-07-24 20:06 ` Junio C Hamano 2018-08-06 22:25 ` Stefan Beller 1 sibling, 1 reply; 13+ messages in thread From: Stefan Beller @ 2018-07-24 17:38 UTC (permalink / raw) To: Jonathan Nieder; +Cc: git On Mon, Jul 23, 2018 at 9:41 PM Jonathan Nieder <jrnieder@gmail.com> wrote: > > Hi, > > Stefan Beller wrote: > > > As a user I wondered what the diff algorithms are about. Offer at least > > a basic explanation on the differences of the diff algorithms. > > Interesting. Let's see. > > [...] > > --- a/Documentation/diff-options.txt > > +++ b/Documentation/diff-options.txt > > @@ -94,16 +94,34 @@ diff" algorithm internally. > > Choose a diff algorithm. The variants are as follows: > > + > > -- > > -`default`, `myers`;; > > - The basic greedy diff algorithm. Currently, this is the default. > > `minimal`;; > > + A diff as produced by the basic greedy algorithm described in > > Why this reordering? because Myers (below) is constructed from minimal + heuristic. Before this patch we say Myers is myers and minimal "spends extra cycles", which is true if you read the code, as it just is for (...) { test_path if (need_min) continue; if (heuristic_good_enough()) break; } https://github.com/git/git/blob/master/xdiff/xdiffi.c#L152 So the "minimal" algorithm is the basic myers algorithm, and the "myers" algorithm is the extended version (extended with a heuristic to be fast in corner cases, still producing good enough diffs). > > > + link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations] > > This URL doesn't seem likely to stay stable. Are appearances > deceiving? (Maybe so, since that's the same host as libxdiff's > homepage <http://www.xmailserver.org/xdiff-lib.html>.) It might be > worth mentioning the author and date just in case. I though about it and did not mention it, as the name "An O(ND) Difference Algorithm and its Variations" is already enough to find that paper quickly. > > - Spend extra time to make sure the smallest possible diff is > > - produced. > > +`default`, `myers`;; > > + The same algorithm as `minimal`, extended with a heuristic to > > + reduce extensive searches. Currently, this is the default. > > Amusing --- this means that the Myers diff is spelled as "minimal" > instead of "myers". > > > `patience`;; > > - Use "patience diff" algorithm when generating patches. > > + Use "patience diff" algorithm when generating patches. This > > + matches the longest common subsequence of unique lines on > > + both sides, recursively. It obtained its name by the way the > > + longest subsequence is found, as that is a byproduct of the > > + patience sorting algorithm. If there are no unique lines left > > + it falls back to `myers`. Empirically this algorithm produces > > + a more readable output for code, but it does not garantuee > > + the shortest output. > > Probably also worth mentioning that this was invented by Bram Cohen > for the bazaar vcs. I tried to balance the part that reads like a history lesson and the immediately useful part (why is this better, when should I use it?) Mentioning bazaar probably makes sense. Not sure about the author, but I'll include him in a reroll of this patch. > > `histogram`;; > > - This algorithm extends the patience algorithm to "support > > - low-occurrence common elements". > > + This algorithm re-implements the `patience` algorithm with > > What does reimplements mean here? Do you mean that it is a variation > on patience? It is supposed to be a variation of patience, but as it comes with its own implementation which I would not fully trust (as it was ported from JGit, and reading the comments over there, a misunderstanding of LCS made it possible to have only one midpoint before recursing) So it is not just taking the patience algorithm and adding support for "low-occurrence common elements" (what does that even mean? patience already distinguishes uniques and non-uniques), but its own implementation, hence it is not bug-for-bug compatible. > > + "support of low-occurrence common elements" and only picks > > + one element of the LCS for the recursion. It is often the > > Does LCS mean longest common subsequence or longest common substring > here? Probably best to spell it out to avoid confusion. When writing the patch I meant to refer to longest common subsequence from above, but by picking just one element, this algorithm understands it as longest common string. > > > + fastest, but in cornercases (when there are many longest > > s/cornercases/corner cases/ > > > + common subsequences of the same length) it produces bad > > + results as seen in: > > + > > + seq 1 100 >one > > + echo 99 > two > > + seq 1 2 98 >>two > > + git diff --no-index --histogram one two > > I wonder if these details (about all the algorithms) should go in a > separate Diff Algorithms section and this section could focus on > what use cases warrant using each, referring to that section for > details. What do you think? Yeah I thought about putting together a Documentation/technical/diffs.txt that talk about all kinds of diffing (basic diff algorithms, options, how to diff trees), but considered rolling this as a band aid until that document fully materializes. Thanks, Stefan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Documentation/diff-options: explain different diff algorithms 2018-07-24 17:38 ` Stefan Beller @ 2018-07-24 20:06 ` Junio C Hamano 0 siblings, 0 replies; 13+ messages in thread From: Junio C Hamano @ 2018-07-24 20:06 UTC (permalink / raw) To: Stefan Beller; +Cc: Jonathan Nieder, git Stefan Beller <sbeller@google.com> writes: > On Mon, Jul 23, 2018 at 9:41 PM Jonathan Nieder <jrnieder@gmail.com> wrote: >> >> Hi, >> >> Stefan Beller wrote: >> >> > As a user I wondered what the diff algorithms are about. Offer at least >> > a basic explanation on the differences of the diff algorithms. >> >> Interesting. Let's see. >> >> [...] >> > --- a/Documentation/diff-options.txt >> > +++ b/Documentation/diff-options.txt >> > @@ -94,16 +94,34 @@ diff" algorithm internally. >> > Choose a diff algorithm. The variants are as follows: >> > + >> > -- >> > -`default`, `myers`;; >> > - The basic greedy diff algorithm. Currently, this is the default. >> > `minimal`;; >> > + A diff as produced by the basic greedy algorithm described in >> >> Why this reordering? > > because Myers (below) is constructed from minimal + heuristic. > ... > So the "minimal" algorithm is the basic myers algorithm, > and the "myers" algorithm is the extended version (extended > with a heuristic to be fast in corner cases, still producing good > enough diffs). I am not sure that is a good reason for the target readers of this document to move the default from the beginning to somewhere in the middle. For a textbook that explains different algorithms and gives overview of how they work, that order may be appropriate, though. >> > - Spend extra time to make sure the smallest possible diff is >> > - produced. >> > +`default`, `myers`;; >> > + The same algorithm as `minimal`, extended with a heuristic to >> > + reduce extensive searches. Currently, this is the default. >> >> Amusing --- this means that the Myers diff is spelled as "minimal" >> instead of "myers". Yeah, I had the same thought---for an end-user documentation update, this change and the reordering does not feel like an improvement. >> I wonder if these details (about all the algorithms) should go in a >> separate Diff Algorithms section and this section could focus on >> what use cases warrant using each, referring to that section for >> details. What do you think? Good suggestion. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH] Documentation/diff-options: explain different diff algorithms 2018-07-24 4:40 ` Jonathan Nieder 2018-07-24 17:38 ` Stefan Beller @ 2018-08-06 22:25 ` Stefan Beller 2018-08-06 23:18 ` Jonathan Nieder 1 sibling, 1 reply; 13+ messages in thread From: Stefan Beller @ 2018-08-06 22:25 UTC (permalink / raw) To: jrnieder; +Cc: git, sbeller As a user I wondered what the diff algorithms are about. Offer at least a basic explanation on the differences of the diff algorithms. Signed-off-by: Stefan Beller <sbeller@google.com> --- Documentation/diff-options.txt | 10 +++++++--- Documentation/git-diff.txt | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 41 insertions(+), 3 deletions(-) diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt index f394608b42c..eae033a21ea 100644 --- a/Documentation/diff-options.txt +++ b/Documentation/diff-options.txt @@ -91,14 +91,18 @@ appearing as a deletion or addition in the output. It uses the "patience diff" algorithm internally. --diff-algorithm={patience|minimal|histogram|myers}:: - Choose a diff algorithm. The variants are as follows: + Choose a diff algorithm. See the discussion of DIFF ALGORITHMS +ifndef::git-diff[] + in linkgit:git-diff[1] +endif::git-diff[] + . The variants are as follows: + -- `default`, `myers`;; The basic greedy diff algorithm. Currently, this is the default. `minimal`;; - Spend extra time to make sure the smallest possible diff is - produced. + The same algorithm as `myers`, but spend extra time to make + sure the smallest possible diff is produced. `patience`;; Use "patience diff" algorithm when generating patches. `histogram`;; diff --git a/Documentation/git-diff.txt b/Documentation/git-diff.txt index b180f1fa5bf..b182389aaae 100644 --- a/Documentation/git-diff.txt +++ b/Documentation/git-diff.txt @@ -119,6 +119,40 @@ include::diff-options.txt[] include::diff-format.txt[] +DIFF ALGORITHMS +--------------- +`Myers` + +A diff as produced by the basic greedy algorithm described in +link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations]. +with a run time of O(M + N + D^2). It employs a heuristic to allow for +a faster diff at the small cost of diff size. +The `minimal` algorithm has that heuristic turned off. + +`Patience` + +This algorithm by Bram Cohen matches the longest common subsequence +of unique lines on both sides, recursively. It obtained its name by +the way the longest subsequence is found, as that is a byproduct of +the patience sorting algorithm. If there are no unique lines left +it falls back to `myers`. Empirically this algorithm produces +a more readable output for code, but it does not garantuee +the shortest output. + +`Histogram` + +This algorithm finds the longest common substring and recursively +diffs the content before and after the longest common substring. +If there are no common substrings left, fallback to `myers`. +This is often the fastest, but in corner cases (when there are +many common substrings of the same length) it produces bad +results as seen in: + + seq 1 100 >one + echo 99 > two + seq 1 2 98 >>two + git diff --no-index --histogram one two + EXAMPLES -------- -- 2.18.0.597.ga71716f1ad-goog ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] Documentation/diff-options: explain different diff algorithms 2018-08-06 22:25 ` Stefan Beller @ 2018-08-06 23:18 ` Jonathan Nieder 2018-08-07 15:56 ` Junio C Hamano 2018-08-09 19:51 ` Stefan Beller 0 siblings, 2 replies; 13+ messages in thread From: Jonathan Nieder @ 2018-08-06 23:18 UTC (permalink / raw) To: Stefan Beller; +Cc: git Hi, Stefan Beller wrote: > --- a/Documentation/diff-options.txt > +++ b/Documentation/diff-options.txt > @@ -91,14 +91,18 @@ appearing as a deletion or addition in the output. It uses the "patience > diff" algorithm internally. > > --diff-algorithm={patience|minimal|histogram|myers}:: > - Choose a diff algorithm. The variants are as follows: > + Choose a diff algorithm. See the discussion of DIFF ALGORITHMS > +ifndef::git-diff[] > + in linkgit:git-diff[1] > +endif::git-diff[] > + . The variants are as follows: This means outside of git-diff(1), I'd see See the discussion of DIFF ALGORITHMS in git-diff(1) . And in git-diff(1), I'd see See the discussion of DIFF ALGORITHMS . Both don't seem quite right, since they have an extra space before the period. The git-diff(1) seems especially not quite right --- does it intend to say something like "See the DIFF ALGORITHMS section for more discussion"? [...] > --- a/Documentation/git-diff.txt > +++ b/Documentation/git-diff.txt > @@ -119,6 +119,40 @@ include::diff-options.txt[] > > include::diff-format.txt[] > > +DIFF ALGORITHMS > +--------------- Please add some introductory words about what the headings refer to. > +`Myers` > + > +A diff as produced by the basic greedy algorithm described in > +link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations]. > +with a run time of O(M + N + D^2). It employs a heuristic to allow for > +a faster diff at the small cost of diff size. > +The `minimal` algorithm has that heuristic turned off. > + > +`Patience` > + > +This algorithm by Bram Cohen matches the longest common subsequence > +of unique lines on both sides, recursively. It obtained its name by > +the way the longest subsequence is found, as that is a byproduct of > +the patience sorting algorithm. If there are no unique lines left > +it falls back to `myers`. Empirically this algorithm produces > +a more readable output for code, but it does not garantuee nit: s/garantuee/guarantee/ > +the shortest output. Trivia: the `minimal` variant of Myers doesn't guarantee shortest output, either: what it minimizes is the number of lines marked as added or removed. If you want to minimize context lines too, then that would be a new variant. ;-) [...] > +`Histogram` > + > +This algorithm finds the longest common substring and recursively > +diffs the content before and after the longest common substring. optional: may be worth a short aside in the text about the distinction between LCS and LCS. ;-) It would be especially useful here, since the alphabet used in these strings is *lines* instead of characters, so the first-time reader could probably use some help in building their intuition. > +If there are no common substrings left, fallback to `myers`. nit: fallback is the noun, fall back is the verb. > +This is often the fastest, but in corner cases (when there are > +many common substrings of the same length) it produces bad Can you clarify what "bad" means? E.g. would "unexpected", or "poorly aligned", match what you mean? > +results as seen in: > + > + seq 1 100 >one > + echo 99 > two > + seq 1 2 98 >>two > + git diff --no-index --histogram one two > + > EXAMPLES > -------- Thanks, Jonathan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Documentation/diff-options: explain different diff algorithms 2018-08-06 23:18 ` Jonathan Nieder @ 2018-08-07 15:56 ` Junio C Hamano 2018-08-09 19:26 ` Stefan Beller 2018-08-09 19:51 ` Stefan Beller 1 sibling, 1 reply; 13+ messages in thread From: Junio C Hamano @ 2018-08-07 15:56 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Stefan Beller, git Jonathan Nieder <jrnieder@gmail.com> writes: > Both don't seem quite right, since they have an extra space before the > period. The git-diff(1) seems especially not quite right --- does it > intend to say something like "See the DIFF ALGORITHMS section for more > discussion"? Good suggestion and doing so would nicely allow side-stepping the space before the dot issue, i.e. "See the DIFF ALGO section (in git-diff(1)) for more discussion". I like it. > [...] >> --- a/Documentation/git-diff.txt >> +++ b/Documentation/git-diff.txt >> @@ -119,6 +119,40 @@ include::diff-options.txt[] >> >> include::diff-format.txt[] >> >> +DIFF ALGORITHMS >> +--------------- > > Please add some introductory words about what the headings refer to. > >> +`Myers` >> + >> +A diff as produced by the basic greedy algorithm described in >> +link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations]. >> +... >> + >> +`Patience` >> + >> +This algorithm by Bram Cohen matches the longest common subsequence >> +... > [...] >> +`Histogram` >> + >> +This algorithm finds the longest common substring and recursively >> +diffs the content before and after the longest common substring. As a first-time reader, I felt that it is a bit uneven that there is no attribution for only this item, unlike the description for Myers and Patience. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] Documentation/diff-options: explain different diff algorithms 2018-08-07 15:56 ` Junio C Hamano @ 2018-08-09 19:26 ` Stefan Beller 2018-08-10 22:18 ` Stefan Beller 0 siblings, 1 reply; 13+ messages in thread From: Stefan Beller @ 2018-08-09 19:26 UTC (permalink / raw) To: Junio C Hamano; +Cc: Jonathan Nieder, git On Tue, Aug 7, 2018 at 8:56 AM Junio C Hamano <gitster@pobox.com> wrote: > > Jonathan Nieder <jrnieder@gmail.com> writes: > > > Both don't seem quite right, since they have an extra space before the > > period. The git-diff(1) seems especially not quite right --- does it > > intend to say something like "See the DIFF ALGORITHMS section for more > > discussion"? > > Good suggestion and doing so would nicely allow side-stepping the > space before the dot issue, i.e. "See the DIFF ALGO section (in > git-diff(1)) for more discussion". I like it. Me, too. > >> +`Histogram` > >> + > >> +This algorithm finds the longest common substring and recursively > >> +diffs the content before and after the longest common substring. > > As a first-time reader, I felt that it is a bit uneven that there is > no attribution for only this item, unlike the description for Myers > and Patience. Yeah. That is because I spent too much time thinking how this algo is flawed (in its design as the comments in the initial commit in JGit indicate that design error when reading closely) when I was trying to understand it deeply. Maybe I should emphasize the trade off for performance gain more as Shawn really cared about performance. After all it offers the best performance in *many* common cases, but has a "long tail" of really unfortunate results, too. ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH] Documentation/diff-options: explain different diff algorithms 2018-08-09 19:26 ` Stefan Beller @ 2018-08-10 22:18 ` Stefan Beller 0 siblings, 0 replies; 13+ messages in thread From: Stefan Beller @ 2018-08-10 22:18 UTC (permalink / raw) To: sbeller; +Cc: git, gitster, jrnieder As a user I wondered what the diff algorithms are about. Offer at least a basic explanation on the differences of the diff algorithms. Signed-off-by: Stefan Beller <sbeller@google.com> --- Not sure if this is finished, I just want to put out the state that I have sitting on my disk. Documentation/diff-options.txt | 10 +++-- Documentation/git-diff.txt | 72 ++++++++++++++++++++++++++++++++++ 2 files changed, 79 insertions(+), 3 deletions(-) diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt index f394608b42c..00684b8936f 100644 --- a/Documentation/diff-options.txt +++ b/Documentation/diff-options.txt @@ -91,14 +91,18 @@ appearing as a deletion or addition in the output. It uses the "patience diff" algorithm internally. --diff-algorithm={patience|minimal|histogram|myers}:: - Choose a diff algorithm. The variants are as follows: + Choose a diff algorithm. See the DIFF ALGORITHMS section +ifndef::git-diff[] + in linkgit:git-diff[1] +endif::git-diff[] + for more discussion. The variants are as follows: + -- `default`, `myers`;; The basic greedy diff algorithm. Currently, this is the default. `minimal`;; - Spend extra time to make sure the smallest possible diff is - produced. + The same algorithm as `myers`, but spend extra time to make + sure the smallest possible diff is produced. `patience`;; Use "patience diff" algorithm when generating patches. `histogram`;; diff --git a/Documentation/git-diff.txt b/Documentation/git-diff.txt index b180f1fa5bf..8837492ed05 100644 --- a/Documentation/git-diff.txt +++ b/Documentation/git-diff.txt @@ -119,6 +119,78 @@ include::diff-options.txt[] include::diff-format.txt[] +DIFF ALGORITHMS +--------------- + +This section explains background on the diff algorithms. All of them +operate on two input sequences of symbols. In Git each symbol is +represented by a line of a file unless the option to diff based on +words is given. The following diff algorithms are available: + +`Myers` + +A diff as produced by the basic greedy algorithm described in +link:http://www.xmailserver.org/diff2.pdf[An O(ND) Difference Algorithm and its Variations]. +with a run time of O(M + N + D^2). To understand this algorithm, one +can imagine a table spanned by the two input sequences with slides +where there are the same symbols. For example the sequences 'ABCD' and 'ADB' +the graph would look like + + S | A | B | C | A + --------------------- + A | \ | | | \ | + --------------------- + D | | | | | + --------------------- + B | | \ | | | + --------------------- + | | | | |F + +and a greedy algorithm is used to find the cheapest path from start S to +finish F, with each horizontal and vertical step having a cost of one and +the diagonal slides having a cost of zero. + +This is simplified as the real algorithm only needs O(N+M) in terms of memory. +In addition it employs a heuristic to allow for a faster diff at the small +cost of diff size. The `minimal` algorithm has that heuristic turned off. + +`Minimal` +The exact algorithm as described in the `Myers` paper without the heuristic +that trades execution time for slightly worse diffs. + +`Patience` + +This algorithm by Bram Cohen originally for the bzr version control +system matches the longest common subsequence of unique lines on +both sides, recursively. It obtained its name by the way the longest +subsequence is found, as that is a byproduct of the patience sorting +algorithm. If there are no unique lines left it falls back to `myers`. +Empirically this algorithm produces a more readable output for code, +but it does not guarantee the shortest output. + +`Histogram` + +This algorithm by Shawn Pearce, originally implemented for +JGit, finds the longest common substring and recursively +diffs the content before and after the longest common substring. +If there are no common substrings left, fall back to `myers`. +This is often the fastest, but in corner cases (when there are +many common substrings of the same length) it produces unexpected +results as seen in: + + seq 1 100 >one + echo 99 > two + seq 1 2 98 >>two + git diff --no-index --histogram one two + + +Note how both `patience` and `histogram` use a concept that is abbreviated +as 'LCS' (longest common subsequence and longest common substring). +The longest common subsequence is a sequence of symbols that are found +on both sides in the same order. The symbols do not need to be adjacent. +The longest common substring is a sequence of adjacent symbols in order +on both sides. + EXAMPLES -------- -- 2.18.0.865.gffc8e1a3cd6-goog ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH] Documentation/diff-options: explain different diff algorithms 2018-08-06 23:18 ` Jonathan Nieder 2018-08-07 15:56 ` Junio C Hamano @ 2018-08-09 19:51 ` Stefan Beller 1 sibling, 0 replies; 13+ messages in thread From: Stefan Beller @ 2018-08-09 19:51 UTC (permalink / raw) To: Jonathan Nieder; +Cc: git On Mon, Aug 6, 2018 at 4:18 PM Jonathan Nieder <jrnieder@gmail.com> wrote: > > +DIFF ALGORITHMS > > +--------------- > > Please add some introductory words about what the headings refer to. ok. > > > +the shortest output. > > Trivia: the `minimal` variant of Myers doesn't guarantee shortest > output, either: what it minimizes is the number of lines marked as > added or removed. If you want to minimize context lines too, then > that would be a new variant. ;-) ... and take line length into account. ;-) It minimizes the edit distance in terms of lines, i.e. in a context-less diff we get the lowest number of lines possible. > > +This algorithm finds the longest common substring and recursively > > +diffs the content before and after the longest common substring. > > optional: may be worth a short aside in the text about the distinction > between LCS and LCS. ;-) > > It would be especially useful here, since the alphabet used in these > strings is *lines* instead of characters, so the first-time reader > could probably use some help in building their intuition. That makes sense. > > > +This is often the fastest, but in corner cases (when there are > > +many common substrings of the same length) it produces bad > > Can you clarify what "bad" means? E.g. would "unexpected", or "poorly > aligned", match what you mean? I'll just go with unexpected. > > +results as seen in: > > + > > + seq 1 100 >one > > + echo 99 > two > > + seq 1 2 98 >>two > > + git diff --no-index --histogram one two ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 0/2] Getting data on different diff algorithms WAS: Documentation/diff-options: explain different diff algorithms 2018-07-24 0:36 [PATCH] Documentation/diff-options: explain different diff algorithms Stefan Beller 2018-07-24 4:40 ` Jonathan Nieder @ 2018-08-10 0:10 ` Stefan Beller 2018-08-10 0:10 ` [PATCH 1/2] WIP: range-diff: take extra arguments for different diffs Stefan Beller 2018-08-10 0:10 ` [PATCH 2/2] WIP range-diff: print some statistics about the range Stefan Beller 1 sibling, 2 replies; 13+ messages in thread From: Stefan Beller @ 2018-08-10 0:10 UTC (permalink / raw) To: sbeller; +Cc: git I instrumented range-diff to use different diff algorithms for the patches, (I chose default-myers and patience) and then run on the same range, via ./git-range-diff v2.10.0..HEAD v2.10.0..HEAD and I found 5245 same, 304 slightly different and 4 completely different patches in that range. Looking at the interdiff is not very pleasing even when reading it with coloring and move detection. Manually looking at them, I found the patience diff easier to review. Comparing the 'default' diff algorithm to 'minimal', I'll see 5491 same, 58 slightly different and 0 completely different patches. Comparing 'default' to 'histogram', I'll see 5255 same, 294 slightly different and 8 completely different patches. Comparing 'histogram' to 'patience', I'll see 5337 same, 212 slightly different and 10 completely different patches. This is all to just put data out there, on how much difference to expect from the diff algorithms. I have not yet dug into the quality of the diffs. Stefan Stefan Beller (2): WIP: range-diff: take extra arguments for different diffs. WIP range-diff: print some statistics about the range range-diff.c | 29 +++++++++++++++++++++++++---- 1 file changed, 25 insertions(+), 4 deletions(-) -- 2.18.0.597.ga71716f1ad-goog ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] WIP: range-diff: take extra arguments for different diffs. 2018-08-10 0:10 ` [PATCH 0/2] Getting data on different diff algorithms WAS: " Stefan Beller @ 2018-08-10 0:10 ` Stefan Beller 2018-08-10 0:10 ` [PATCH 2/2] WIP range-diff: print some statistics about the range Stefan Beller 1 sibling, 0 replies; 13+ messages in thread From: Stefan Beller @ 2018-08-10 0:10 UTC (permalink / raw) To: sbeller; +Cc: git We can use the range-diff on the same range to examine differences in the diff algorithm. Signed-off-by: Stefan Beller <sbeller@google.com> --- range-diff.c | 20 ++++++++++++++++---- 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/range-diff.c b/range-diff.c index 347b4a79f25..a977289b7dc 100644 --- a/range-diff.c +++ b/range-diff.c @@ -28,7 +28,8 @@ struct patch_util { * Reads the patches into a string list, with the `util` field being populated * as struct object_id (will need to be free()d). */ -static int read_patches(const char *range, struct string_list *list) +static int read_patches(const char *range, struct string_list *list, + struct argv_array *extra_log_args) { struct child_process cp = CHILD_PROCESS_INIT; FILE *in; @@ -36,7 +37,12 @@ static int read_patches(const char *range, struct string_list *list) struct patch_util *util = NULL; int in_header = 1; - argv_array_pushl(&cp.args, "log", "--no-color", "-p", "--no-merges", + argv_array_pushl(&cp.args, "log", "--no-color", "-p", "--no-merges", NULL); + + if (extra_log_args) + argv_array_pushv(&cp.args, extra_log_args->argv); + + argv_array_pushl(&cp.args, "--reverse", "--date-order", "--decorate=no", "--no-abbrev-commit", range, NULL); @@ -419,14 +425,20 @@ int show_range_diff(const char *range1, const char *range2, { int res = 0; + struct argv_array extra = ARGV_ARRAY_INIT; + struct string_list branch1 = STRING_LIST_INIT_DUP; struct string_list branch2 = STRING_LIST_INIT_DUP; - if (read_patches(range1, &branch1)) + argv_array_push(&extra, "--diff-algorithm=patience"); + argv_array_push(&extra, "--indent-heuristic"); + + if (read_patches(range1, &branch1, NULL)) res = error(_("could not parse log for '%s'"), range1); - if (!res && read_patches(range2, &branch2)) + if (!res && read_patches(range2, &branch2, &extra)) res = error(_("could not parse log for '%s'"), range2); + diffopt->color_moved = COLOR_MOVED_DEFAULT; if (!res) { find_exact_matches(&branch1, &branch2); get_correspondences(&branch1, &branch2, creation_factor); -- 2.18.0.597.ga71716f1ad-goog ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/2] WIP range-diff: print some statistics about the range 2018-08-10 0:10 ` [PATCH 0/2] Getting data on different diff algorithms WAS: " Stefan Beller 2018-08-10 0:10 ` [PATCH 1/2] WIP: range-diff: take extra arguments for different diffs Stefan Beller @ 2018-08-10 0:10 ` Stefan Beller 1 sibling, 0 replies; 13+ messages in thread From: Stefan Beller @ 2018-08-10 0:10 UTC (permalink / raw) To: sbeller; +Cc: git Signed-off-by: Stefan Beller <sbeller@google.com> --- range-diff.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/range-diff.c b/range-diff.c index a977289b7dc..fbabce5f91f 100644 --- a/range-diff.c +++ b/range-diff.c @@ -264,6 +264,8 @@ static void get_correspondences(struct string_list *a, struct string_list *b, free(b2a); } +int completely_different, slightly_different, same; + static void output_pair_header(struct diff_options *diffopt, int patch_no_width, struct strbuf *buf, @@ -288,15 +290,19 @@ static void output_pair_header(struct diff_options *diffopt, if (!b_util) { color = color_old; status = '<'; + slightly_different++; } else if (!a_util) { color = color_new; status = '>'; + completely_different++; } else if (strcmp(a_util->patch, b_util->patch)) { color = color_commit; status = '!'; + slightly_different++; } else { color = color_commit; status = '='; + same++; } strbuf_reset(buf); @@ -439,12 +445,15 @@ int show_range_diff(const char *range1, const char *range2, res = error(_("could not parse log for '%s'"), range2); diffopt->color_moved = COLOR_MOVED_DEFAULT; + if (!res) { find_exact_matches(&branch1, &branch2); get_correspondences(&branch1, &branch2, creation_factor); output(&branch1, &branch2, diffopt); } + printf("patch ranges are %d same, %d slightly different and %d completely different\n", same, slightly_different, completely_different); + string_list_clear(&branch1, 1); string_list_clear(&branch2, 1); -- 2.18.0.597.ga71716f1ad-goog ^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2018-08-10 22:19 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-24 0:36 [PATCH] Documentation/diff-options: explain different diff algorithms Stefan Beller 2018-07-24 4:40 ` Jonathan Nieder 2018-07-24 17:38 ` Stefan Beller 2018-07-24 20:06 ` Junio C Hamano 2018-08-06 22:25 ` Stefan Beller 2018-08-06 23:18 ` Jonathan Nieder 2018-08-07 15:56 ` Junio C Hamano 2018-08-09 19:26 ` Stefan Beller 2018-08-10 22:18 ` Stefan Beller 2018-08-09 19:51 ` Stefan Beller 2018-08-10 0:10 ` [PATCH 0/2] Getting data on different diff algorithms WAS: " Stefan Beller 2018-08-10 0:10 ` [PATCH 1/2] WIP: range-diff: take extra arguments for different diffs Stefan Beller 2018-08-10 0:10 ` [PATCH 2/2] WIP range-diff: print some statistics about the range Stefan Beller
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).