git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Phillip Wood <phillip.wood123@gmail.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Atharva Raykar" <raykar.ath@gmail.com>
Cc: Count of San Francisco <countofsanfrancisco@gmail.com>,
	git <git@vger.kernel.org>
Subject: Re: git diff word diff bug??
Date: Sun, 2 May 2021 19:00:00 +0100	[thread overview]
Message-ID: <c8e51afa-ec67-fd8a-09e5-2111020a5c55@gmail.com> (raw)
In-Reply-To: <87o8e14b6a.fsf@evledraar.gmail.com>

On 26/04/2021 10:45, Ævar Arnfjörð Bjarmason wrote:
> 
> On Sun, Apr 25 2021, Atharva Raykar wrote:
> 
>> On 20-Apr-2021, at 22:08, Count of San Francisco <countofsanfrancisco@gmail.com> wrote:
>>>[...]
>>> The diff above does not make it clear that the "{+river+}" is really to be appended (or related) to the first line.
>>> I expected the first diff line to look like this:
>>> The fox jumped over the [-wall.-]{+river+} and the rest of the lines are delete lines.
>>>
>>> git diff --word-diff=porcelain file0.txt file1.txt produced this:
>>> diff --git a/file0.txt b/file1.txt
>>> index c8756ba..3413f10 100644
>>> --- a/file0.txt
>>> +++ b/file1.txt
>>> @@ -1,11 +1,2 @@
>>>   The fox jumped over the
>>> -wall.
>>> ~
>>> -Blah1e32
>>> ~
>>> -q432423
>>> ~
>>> -qe23234
>>> ~
>>> - 233
>>> ~
>>> -253
>>> ~
>>> -345235
>>> ~
>>> ~
>>> -53243
>>> ~
>>> -afsfffas
>>> +river.
>>> ~
>>> +  He made it over.
>>> ~
>>>
>>> This is more non-discernable. The git diff --help documentation says
>>> that "Newlines in the input are represented by a tilde ~ on a line
>>> of its own". So a script would see the '~' character and interpret
>>> that as a new line. The script would have mistaken the "+river" for
>>> a different line. The git diff --help documentation does not explain
>>> what to do in this scenario.
>>>
>>> I expected this:
>>>   The fox jumped over the
>>> -wall.
>>> +river.
>>> ~
>>
>> This is also consistent with the behaviour I mentioned above.
>> A script would need to interpret this as:
>> delete "wall"        (this starts the streak of deletions)
>> go to next line
>> delete "Blah1e32"
>> ...
>>
>> and as soon as it sees a '+', that is, an addition, it knows
>> the series of deletions are done with, so it will add "river"
>> to the last line that was common to both, that is,
>> "the fox jumped over the".
>>
>>> Is this a bug? If not, how do I make the distinction that the {+river+} (in the first case) and the +river (in the 2nd case) is really for the first line?
>>
>> I do not think this is a bug, because it does not really
>> deviate from any specified behaviour. But I do see the source
>> of confusion.
>>
>> I hope I could explain that well enough.
> 
> I do think it's somewhere between a "bug" and a "missing feature"
> though.
> 
> The core issue here is that in diff.c we "fake up" the whole word-diff
> mode by essentially taking both sides, splitting them on whitespace, and
> then diffing that. So for A/B like:
> 
>      A: foo bar
>         baz
>      B: foo baz
>         boo
> 
> We diff:
> 
>      A: foo\nbar\nbaz
>      B: foo\nbaz\nboo
> 
> If you insert an extra newline into that stream you'll get different
> results, which is closer to what the Count is expecting here.
> 
> In this case using --word-diff-regex='[^\n]' will produce: >
>      The fox jumped over the [-wall-]{+river+}.
> 
> But the rest of the output isn't ideal, exercise for the user and all
> that.

Yes it's a complete mess as we're matching a character at a time rather 
than words, the only reason it works for the first line is that there 
happens to be some context at the end of the line.

> The bug is arguably that we should be doing this by default, i.e. we
> split up our output into "\n" internally, and arguably confuse that with
> what the user wanted from their input.
> 
> But see bf82940dbf1 (color-words: enable REG_NEWLINE to help user,
> 2009-01-17) which is probably the source of the "bug" here (if any) for
> a trade-off related to that.

Reverting that commit has no effect on the output, looking at 
find_word_boundaries() it has always terminated the match at a newline 
so I'm not sure what the commit message for bf82940dbf1 means when it 
says "Of course newlines can still be matched explicitly.".

To me the problem is not with how the word diff is calculated but how it 
is presented. At the moment it just prints all the deletions followed by 
all the insertions. I think that if following some context it just 
printed the deletions and insertions up to the end of that line and then 
dumped the rest of the deletions and insertions following the line break 
that would give a more readable result. It would need to do something 
similar for deletions and insertions at the beginning of a line that are 
followed by some context before the end of the line.

In general I find it useful that --word-diff ignores the line structure 
of the input, this enables it to show changes to the words in re-flowed 
text without the clutter one sees in a line-by-line diff from moving the 
line breaks. I think the changes above would probably be compatible with 
that.

Best Wishes

Phillip


  parent reply	other threads:[~2021-05-02 18:00 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-20 16:38 Count of San Francisco
2021-04-25  8:54 ` Atharva Raykar
2021-04-26  9:45   ` Ævar Arnfjörð Bjarmason
2021-04-30  4:12     ` Count of San Francisco
2021-05-02 18:00     ` Phillip Wood [this message]
2021-05-03  9:47       ` Phillip Wood
2022-06-07 22:40         ` Scott Phuong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c8e51afa-ec67-fd8a-09e5-2111020a5c55@gmail.com \
    --to=phillip.wood123@gmail.com \
    --cc=avarab@gmail.com \
    --cc=countofsanfrancisco@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=phillip.wood@dunelm.org.uk \
    --cc=raykar.ath@gmail.com \
    --subject='Re: git diff word diff bug??' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).