git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [Git BUG] Please do not use "-B -M" in "diff" family for now
@ 2015-01-31 19:12 Junio C Hamano
  2015-02-02  3:18 ` [RFC PATCH] diff: do not use creation-half of -B as a rename target candidate Junio C Hamano
  0 siblings, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2015-01-31 19:12 UTC (permalink / raw)
  To: git; +Cc: linux-kernel

Please avoid the combination "-B -M" when running "diff" family of
commands, as it can produce incorrect results [*1*] in corner cases.
Use of either "-B" or "-M" by itself is fine, but not both at the
same time.

This problem exists even in Git v1.7.12, and I have no reason to
suspect that it worked correctly in any earlier versions, so I do
not consider it an urgent issue to fix during the pre-release for
upcoming Git v2.3 release.

End of TL;DR.

For a simple reproduction, go to your copy of the kernel tree and do
this:

    $ git diff -B -M v2.6.13 v2.6.12 -- \
        arch/ppc64/kernel/rtas_pci.c arch/ppc64/kernel/pSeries_pci.c >:patch

    $ git reset --hard
    $ git checkout v2.6.13

    $ git apply --cached --whitespace=nowarn :patch
    error: arch/ppc64/kernel/pSeries_pci.c: already exists in index

This is not a bug in "apply", but in "diff".  The resulting patch
looks like this:

    $ git apply --whitespace=nowarn --numstat --summary :patch
    112     5       arch/ppc64/kernel/pSeries_pci.c
     rename arch/ppc64/kernel/{rtas_pci.c => pSeries_pci.c} (81%)

That is, it wants to rename rtas_pci.c to pSeries_pci.c with a bit
of editing.

However, what really happens when going from 2.6.13 to 2.6.12 is
this:

    $ git diff v2.6.13 v2.6.12 -- \
        arch/ppc64/kernel/rtas_pci.c arch/ppc64/kernel/pSeries_pci.c |
        git apply --whitespace=nowarn --numstat --summary
    478     19      arch/ppc64/kernel/pSeries_pci.c
    0       495     arch/ppc64/kernel/rtas_pci.c
     delete mode 100644 arch/ppc64/kernel/rtas_pci.c

That is:

    #1 rtas_pci.c exists in 2.6.13 but not 2.6.12.

    #2 pSeries_pci.c exists in both 2.6.12 and 2.6.13 but is majorly
       rewritten; in fact, the difference between rtas_pci.c in
       2.6.13 and pSeries_pci.c in 2.6.12 is much smaller than the
       difference between pSeries_pci.c from 2.6.13 and 2.6.12.

What seems to happen is that "diff -B" splits the above #2 into
"removal of pSeries_pci.c with contents from 2.6.13" and "creation
of pSeries_pci.c with contents from 2.6.12" and these gets further
combined with the "removal of rtas_pci.c" [*2*].  In the end result,
we incorrectly get "rename rtas_pci.c to create pSeries_pci.c with
some changes".  We shouldn't do this because pSeries_pci.c is not
created and is not a new file.

[Footnotes]

*1* "git apply" refuses to apply the output affected by the bug, so
    at least this will not lead to silent corruption.

*2* The "-B" option was introduced solely to find a possible
    rename/copy source to express this sequence:

    $ cp A B
    $ edit B slightly ;# optional
    $ edit A heavily

as if it was done like this instead:

    $ mv A B
    $ edit B slightly ;# optional
    $ create B from scratch

Without "-B", the change would be expressed as "Heavily edit A,
create B from scratch", and with "-B", we would say "Create B by
copying from A and then edit, and edit A heavily", which would
result in more readable patch (in fact, we would see in the output
of "diff -B -M v2.6.12 v2.6.13" exactly that).

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH] diff: do not use creation-half of -B as a rename target candidate
  2015-01-31 19:12 [Git BUG] Please do not use "-B -M" in "diff" family for now Junio C Hamano
@ 2015-02-02  3:18 ` Junio C Hamano
  2015-02-02  5:52   ` Stefan Beller
  2015-02-02 18:25   ` Junio C Hamano
  0 siblings, 2 replies; 5+ messages in thread
From: Junio C Hamano @ 2015-02-02  3:18 UTC (permalink / raw)
  To: git

When a commit creates new file B by copying the contents of an
existing file A and making a small edit and makes large edit to A,
"diff -M" would not see any copying or renaming, because the file A
appears in both preimage and postimage.  The output ends up showing
two large patches, one that adds almost the entirety of original A
to the newly created file B, and the other that removes almost the
entirety of the original contents from A and adds new material to
it.

"diff -B -M" was invented to allow us notice this case and instead
express the change as one patch that copies A to B with small edit,
and rewrites A with contents unrelated to its original.  The patch
expressed this way becomes much easier to read, because the reader
can see that most of the contents in B after the change came from
the original A (the patch header shows "copy from A" for B), and A
was completely rewritten by the patch (the patch body shows
everything removed first and then all new material added).

However this logic has a bug to incorrectly produce an unapplicable
patch in other cases.  Starting from existing files A and B, when a
commit removes A and makes the resulting B similar to the original
contents of A, it incorrectly expressed it as a change that renames
A to B and then makes small edits.  Such a patch will not apply to
the original state the patch was taken from, as B exists there.

The root cause of the problem is that after the complete rewrite of
B is detected and is internally split into deletion of old B and
creation of new B, the rename detection machinery matches the old
contents of A (which will go away) with the newly created B, because
they are similar.  Considering the deletion-half of the change to B
as possible rename source (i.e. from which a new file is created) is
good, but considering the creation-half as possible rename
destination (i.e. the file is created by taking whole contents from
elsewhere) is bad---because we know B being a broken filepair means
it already existed, and cannot have been newly _created_.

For a simple reproduction, go to your copy of the kernel tree and do
this:

    $ git diff -B -M v2.6.13 v2.6.12 -- \
        arch/ppc64/kernel/rtas_pci.c arch/ppc64/kernel/pSeries_pci.c >:patch

    $ git reset --hard
    $ git checkout v2.6.13

    $ git apply --cached --whitespace=nowarn :patch
    error: arch/ppc64/kernel/pSeries_pci.c: already exists in index

In this example, rtas_pci.c and pSeries_pci.c corresponds to files A
and B, respectively, in the more general description of the problem
given earlier.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---

 * Here is what I am at the moment; I cannot quite explain (hence I
   cannot convince myself) why this is the right solution, but it
   seems to make the above sample case work without breaking any
   existing tests.  It is possible that the tests that would break
   without the "&& !p->score" bit are expecting wrong results, but I
   didn't look at them in detail.

 diffcore-rename.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 6c7a72f..f4e8e00 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -516,6 +516,8 @@ void diffcore_rename(struct diff_options *options)
 			else if (!DIFF_OPT_TST(options, RENAME_EMPTY) &&
 				 is_empty_blob_sha1(p->two->sha1))
 				continue;
+			else if (p->broken_pair && !p->score)
+				continue;
 			else
 				locate_rename_dst(p->two, 1);
 		}
-- 
2.3.0-rc2-165-gbd2cd9b

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] diff: do not use creation-half of -B as a rename target candidate
  2015-02-02  3:18 ` [RFC PATCH] diff: do not use creation-half of -B as a rename target candidate Junio C Hamano
@ 2015-02-02  5:52   ` Stefan Beller
  2015-02-02  6:47     ` Yue Lin Ho
  2015-02-02 18:25   ` Junio C Hamano
  1 sibling, 1 reply; 5+ messages in thread
From: Stefan Beller @ 2015-02-02  5:52 UTC (permalink / raw)
  To: Junio C Hamano, git

On 01.02.2015 19:18, Junio C Hamano wrote:
> When a commit creates new file B by copying the contents of an
> existing file A and making a small edit and makes large edit to A,

This part is hard to parse
"When ... and making a small edit and makes a large edit"
So large or small? It's a bit hard to parse and understand when just
trying to read the first sentence. IT becomes clear (somewhat) later.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] diff: do not use creation-half of -B as a rename target candidate
  2015-02-02  5:52   ` Stefan Beller
@ 2015-02-02  6:47     ` Yue Lin Ho
  0 siblings, 0 replies; 5+ messages in thread
From: Yue Lin Ho @ 2015-02-02  6:47 UTC (permalink / raw)
  To: git

A1 = "I am file A."
B1 is copied from A1, so B1 = "I am file A."
B1 changes to B2 = "I am file B."
A1 changes to A2 = "file A is changed a lot, a lot, ..., a lot."
At this moment, commit A2 and B2 files.





--
View this message in context: http://git.661346.n2.nabble.com/Git-BUG-Please-do-not-use-B-M-in-diff-family-for-now-tp7624794p7624836.html
Sent from the git mailing list archive at Nabble.com.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] diff: do not use creation-half of -B as a rename target candidate
  2015-02-02  3:18 ` [RFC PATCH] diff: do not use creation-half of -B as a rename target candidate Junio C Hamano
  2015-02-02  5:52   ` Stefan Beller
@ 2015-02-02 18:25   ` Junio C Hamano
  1 sibling, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2015-02-02 18:25 UTC (permalink / raw)
  To: git

Junio C Hamano <gitster@pobox.com> writes:

>  * Here is what I am at the moment; I cannot quite explain (hence I
>    cannot convince myself) why this is the right solution, but it
>    seems to make the above sample case work without breaking any
>    existing tests.  It is possible that the tests that would break
>    without the "&& !p->score" bit are expecting wrong results, but I
>    didn't look at them in detail.

Sadly, I think this is garbage.  "Do not consider creation-half of a
broken pair, ever" is too simple and cripples this case that starts
with two files A and B that are quite different:

	$ git add A B
	$ mv A B.new
        $ mv B A
        $ mv B.new B
        $ git diff -B -M

where the internal machinery breaks both A and B into these two file
pairs:

	delete A(old)
        create A(new)

	delete B(old)
        create B(new)

and then match them up to produce

	rename A to B
        rename B to A

The rule need to be "creation-half of a broken pair can be used as
the destination of a rename, if and only if its corresponding
deletion-half is used as the source of another rename elsewhere".
Under that condition, a file A that is completely rewritten to
become similar to another existing file B can be expressed as a
rename of B, because A is renamed away to make room in the same
change.

Fixing this is turning out to be more complex than I originally
hoped X-<.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-02-02 18:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-31 19:12 [Git BUG] Please do not use "-B -M" in "diff" family for now Junio C Hamano
2015-02-02  3:18 ` [RFC PATCH] diff: do not use creation-half of -B as a rename target candidate Junio C Hamano
2015-02-02  5:52   ` Stefan Beller
2015-02-02  6:47     ` Yue Lin Ho
2015-02-02 18:25   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).