From: Stefan Beller <sbeller@google.com>
To: git <git@vger.kernel.org>, jgit-dev@eclipse.org
Subject: FYI: histogram diff bug
Date: Fri, 20 Jul 2018 12:57:31 -0700 [thread overview]
Message-ID: <CAGZ79kZYO6hHiAM8Sfp3J=VX11c=0-7YDSx3_EAKt5-uvvt-Ew@mail.gmail.com> (raw)
seq 1 100 >one
echo 99 > two
seq 1 2 98 >>two
git diff --no-index --histogram one two
In addition to the recent patches[1], I found another bug in the
histogram algorithm,
which can be produced with the instructions given above. At least I
think it is a bug as
the diff looks terrible to me. (It is a correct diff, but the output
is just bad.)
[1] https://public-inbox.org/git/20180719221942.179565-1-sbeller@google.com/
And I think the issue is in the algorithm, which is basically as follows:
1) build up information about one side ("scanA()"), by putting each
line into a hashmap (and counting its occurrences, such that you
can make the histogram)
2) "find_LCS" on the other side (B) by trying low occurrence lines
and seeing how long the common consequential string match is.
(I think this LCS refers to [2], not to be confused with [3])
3) The whole mental model of the difference
between both sides now look like:
stuff before the LCS
LCS
stuff after the LCS
As the LCS has no output associated with it, just call recursively
do_histogram on the remainders before and after.
This is my rough understanding of the algorithm so far.
(I am unsure about the exact find_LCS part how and why to
pick certain candidates for finding the LCS).
The big issue however is the count of LCS, so far we assumed
there is only one! In the example given above there are many,
i.e. lots of "longest" common continuous substrings of length 1.
We are unfortunate to pick "99" as the LCS and recurse before
and after; the other LCSs would be a better pick.
So I think we would need to find "all LCS", which there can be
more than one at the same time, and then fill the gaps between
them using recursion.
As the order of LCSs can be different in the two sides,
we actually have to produce a diff of LCSs and those which are
out of order need to be emitted as full edits (deletes or creates).
In the example above we'd want to have "99" to be a full create
and delete instead of being *the* anchor; all other LCSs ought to
be anchors for the first (zero'th?) level of recursion.
This bug is also present in JGit which this algorithm was ported
from, hence jgit-dev is cc'd (and furthers my suspicion that the
issue is not in the port but in the algorithm itself).
[2] https://en.wikipedia.org/wiki/Longest_common_substring_problem
[3] https://en.wikipedia.org/wiki/Longest_common_subsequence_problem
I'll think about this and see if I can fix it properly,
Thoughts or Opinions?
Stefan
reply other threads:[~2018-07-20 19:57 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAGZ79kZYO6hHiAM8Sfp3J=VX11c=0-7YDSx3_EAKt5-uvvt-Ew@mail.gmail.com' \
--to=sbeller@google.com \
--cc=git@vger.kernel.org \
--cc=jgit-dev@eclipse.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).