git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* difflame improvements
@ 2017-02-15  5:19 Edmundo Carmona Antoranz
  2017-02-17  5:17 ` Jeff King
  0 siblings, 1 reply; 5+ messages in thread
From: Edmundo Carmona Antoranz @ 2017-02-15  5:19 UTC (permalink / raw)
  To: Git List; +Cc: Jeff King

Hi!

I've been working on detecting revisions where a "real" deletion was
made and I think I advanced a lot in that front. I still have to work
on many scenarios (renamed files, for example... also performance) but
at least I'm using a few runs against git-scm history and the results
are "promising":

23:05 $ git blame -s --reverse -L 25,40 HEAD~20..HEAD -- versioncmp.c
066fb0494e 25) static int initialized;
066fb0494e 26)
066fb0494e 27) /*
8ec68d1ae2 28)  * p1 and p2 point to the first different character in
two strings. If
8ec68d1ae2 29)  * either p1 or p2 starts with a prerelease suffix, it
will be forced
8ec68d1ae2 30)  * to be on top.
8ec68d1ae2 31)  *
8ec68d1ae2 32)  * If both p1 and p2 start with (different) suffix, the order is
8ec68d1ae2 33)  * determined by config file.
066fb0494e 34)  *
8ec68d1ae2 35)  * Note that we don't have to deal with the situation
when both p1 and
8ec68d1ae2 36)  * p2 start with the same suffix because the common
part is already
8ec68d1ae2 37)  * consumed by the caller.
066fb0494e 38)  *
066fb0494e 39)  * Return non-zero if *diff contains the return value
for versioncmp()
066fb0494e 40)  */

Lines 28-33:

23:05 $ git show --summary 8ec68d1ae2
commit 8ec68d1ae2863823b74d67c5e92297e38bbf97bc
Merge: e801be066 c48886779
Author: Junio C Hamano <>
Date:   Mon Jan 23 15:59:21 2017 -0800

    Merge branch 'vn/diff-ihc-config'

    "git diff" learned diff.interHunkContext configuration variable
    that gives the default value for its --inter-hunk-context option.

    * vn/diff-ihc-config:
      diff: add interhunk context config option



And this is not telling me the _real_ revision where the lines were
_deleted_ so it's not very helpful, as Peff has already mentioned.

Running difflame:

23:06 $ time ~/proyectos/git/difflame/difflame.py -bp=-s -w HEAD~20
HEAD -- versioncmp.c
diff --git a/versioncmp.c b/versioncmp.c
index 80bfd109f..9f81dc106 100644
--- a/versioncmp.c
+++ b/versioncmp.c
@@ -24,42 +24,83 @@
.
.
.
+b17846432d  33) static void find_better_matching_suffix(const char
*tagname, const char *suffix,
+b17846432d  34)                                        int
suffix_len, int start, int conf_pos,
+b17846432d  35)                                        struct
suffix_match *match)
+b17846432d  36) {
b17846432d  37)        /*
       c026557a3 versioncmp: generalize version sort suffix reordering
-c026557a3 (SZEDER 28)  * p1 and p2 point to the first different
character in two strings. If
-c026557a3 (SZEDER 29)  * either p1 or p2 starts with a prerelease
suffix, it will be forced
-c026557a3 (SZEDER 30)  * to be on top.
-c026557a3 (SZEDER 31)  *
-c026557a3 (SZEDER 32)  * If both p1 and p2 start with (different)
suffix, the order is
-c026557a3 (SZEDER 33)  * determined by config file.
       b17846432 versioncmp: factor out helper for suffix matching
+b17846432d  38)         * A better match either starts earlier or
starts at the same offset
+b17846432d  39)         * but is longer.
+b17846432d  40)         */
+b17846432d  41)        int end = match->len < suffix_len ?
match->start : match->start-1;
.
.
.

Same range of (deleted) lines:

23:10 $ git --show --name-status c026557a3
commit c026557a37361b7019acca28f240a19f546739e9
Author: SZEDER Gábor <>
Date:   Thu Dec 8 15:24:01 2016 +0100

   versioncmp: generalize version sort suffix reordering

   The 'versionsort.prereleaseSuffix' configuration variable, as its name
   suggests, is supposed to only deal with tagnames with prerelease
.
.
.


   Signed-off-by: SZEDER Gábor <>
   Signed-off-by: Junio C Hamano <>

M       Documentation/config.txt
M       Documentation/git-tag.txt
M       t/t7004-tag.sh
M       versioncmp.c


This is the revision where the deletion happened.

That's it for the time being.

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: difflame improvements
  2017-02-15  5:19 difflame improvements Edmundo Carmona Antoranz
@ 2017-02-17  5:17 ` Jeff King
  2017-02-17  7:01   ` Edmundo Carmona Antoranz
  0 siblings, 1 reply; 5+ messages in thread
From: Jeff King @ 2017-02-17  5:17 UTC (permalink / raw)
  To: Edmundo Carmona Antoranz; +Cc: Git List

On Tue, Feb 14, 2017 at 11:19:05PM -0600, Edmundo Carmona Antoranz wrote:

> I've been working on detecting revisions where a "real" deletion was
> made and I think I advanced a lot in that front. I still have to work
> on many scenarios (renamed files, for example... also performance) but
> at least I'm using a few runs against git-scm history and the results
> are "promising":

I played with this a bit more, and it did turn up the correct results
for some deletions in my experiments.

One thing I noticed is that it also turned up nonsense for lines that
blame in weird ways. For instance, I have a diff like this (these are
real examples, but don't pay attention to the sha1s; it's in a fork of
git, not upstream):

  $ git diff v2.6.5 builtin/prune-packed.c
  diff --git a/builtin/prune-packed.c b/builtin/prune-packed.c
  index 7cf900ea07..5e3727e841 100644
  --- a/builtin/prune-packed.c
  +++ b/builtin/prune-packed.c
  @@ -2,6 +2,7 @@
   #include "cache.h"
   #include "progress.h"
   #include "parse-options.h"
  +#include "gh-log.h"
   
   static const char * const prune_packed_usage[] = {
   	N_("git prune-packed [-n | --dry-run] [-q | --quiet]"),
  @@ -29,8 +30,11 @@ static int prune_object(const unsigned char *sha1, const char *path,
   
   	if (*opts & PRUNE_PACKED_DRY_RUN)
   		printf("rm -f %s\n", path);
  -	else
  +	else {
  +		gh_logf("prune", "%s Duplicate loose object pruned\n",
  +			sha1_to_hex(sha1));
   		unlink_or_warn(path);
  +	}
   	return 0;
   }
   

Running difflame on it says this:

  $ python /path/to/difflame.py v2.6.5..HEAD -- builtin/prune-packed.c
  [...]
  -2c0b29e662 (Jeff King 2016-01-26 15:27:55 -0500 32) 	else
  +d60032f8640 builtin/prune-packed.c (Jeff King        2015-02-02 23:15:33 -0500 33) 	else {
  +d60032f8640 builtin/prune-packed.c (Jeff King        2015-02-02 23:15:33 -0500 34) 		gh_logf("prune", "%s Duplicate loose object pruned\n",
  +d60032f8640 builtin/prune-packed.c (Jeff King        2015-02-02 23:15:33 -0500 35) 			sha1_to_hex(sha1));
   0d3b729680e builtin/prune-packed.c (Jeff King        2014-10-15 18:40:53 -0400 36) 		unlink_or_warn(path);
  +2396ec85bd1 prune-packed.c         (Linus Torvalds   2005-07-03 14:27:34 -0700 37) 	}

There are two weird things. One is that the old "else" is attributed to
my 2c0b29e662. That's quite weird, because that is a merge commit which
did not touch the file at all. I haven't tracked it down, but presumably
that is weirdness with the --reverse blame.

But there's another one, that I think is easy to fix. The closing brace
is attributed to some ancient commit from Linus. Which yes, I'm sure had
a closing brace, but not _my_ closing brace that was added by
d60032f8640, that the rest of the lines got attributed to.

This isn't difflame's fault; that's what "git blame" tells you about
that line. But since I already told difflame "v2.6.5..HEAD", it would
probably make sense to similarly limit the blame to that range. That
turns up a boundary commit for the line. Which is _also_ not helpful,
but at least the tool is telling me that the line came from before
v2.6.5, and I don't really need to care much about it.

Part of this is that my use case may be a bit different than yours. I
don't actually want to look at the blame results directly. I just want
to see the set of commits that I'd need to look at and possibly
cherry-pick in order to re-create the diff.

-Peff

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: difflame improvements
  2017-02-17  5:17 ` Jeff King
@ 2017-02-17  7:01   ` Edmundo Carmona Antoranz
  2017-02-19  6:35     ` Edmundo Carmona Antoranz
  0 siblings, 1 reply; 5+ messages in thread
From: Edmundo Carmona Antoranz @ 2017-02-17  7:01 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List

On Thu, Feb 16, 2017 at 11:17 PM, Jeff King <peff@peff.net> wrote:

> This isn't difflame's fault; that's what "git blame" tells you about
> that line. But since I already told difflame "v2.6.5..HEAD", it would
> probably make sense to similarly limit the blame to that range. That
> turns up a boundary commit for the line. Which is _also_ not helpful,
> but at least the tool is telling me that the line came from before
> v2.6.5, and I don't really need to care much about it.


I'm running my own tests on difflame and I have a theory about "when"
it breaks.... at least one of the cases when it breaks:

Analysis for deleted lines is being driven by git blame --reverse.
What I have noticed is that it "breaks" when blame --reverse drives
the analysis into revisions where "treeish1" is not part of their
history (like, bringing analysis "to the sides" of treeish1 instead of
keeping analysis in revisions in the history of treeish2 that have
treeish1 as one of their ancestors.... which is definitely a valid
case for analysis, anyway). In this case, blame --reverse stops being
helpful.

Take this example (I just pushed a debug-deletion branch into gh...
probably more debugging messages will be needed):

$ difflame.py HEAD~100 HEAD -- Documentation/git-commit.txt
diff --git a/Documentation/git-commit.txt b/Documentation/git-commit.txt
index f2ab0ee2e..4f8f20a36 100644
--- a/Documentation/git-commit.txt
+++ b/Documentation/git-commit.txt
@@ -265,7 +265,8 @@ FROM UPSTREAM REBASE" section in linkgit:git-rebase[1].)
bcf9626a71 (Matthieu Moy      2016-06-28 13:40:11 +0200 265)   If this
option is specified together with `--amend`, then
04c8ce9c1c (Markus Heidelberg 2008-12-19 13:14:18 +0100 266)   no
paths need to be specified, which can be used to amend
d4ba07cac5 (Johannes Sixt     2008-04-10 13:33:09 +0200 267)   the
last commit without committing changes that have
       Range of revisions: 02db2d..066fb04
               Treeish1 02db2d04: 02db2d042 Merge branch 'ah/grammos'
               Treeish2 066fb0494: 066fb0494 blame: draft of line format
       Blamed Revision afe0e2a39: afe0e2a39 Merge branch
'da/difftool-dir-diff-fix'
       Original Filename a/Documentation/git-commit.txt Deleted Line 268
       Children revisions:
               3aead1cad7a: 3aead1cad Merge branch 'ak/commit-only-allow-empty'
       There's only one child revision.... on that revision the line
we are tracking is gone
       Parents of this child revision:
               afe0e2a39166: afe0e2a39 Merge branch 'da/difftool-dir-diff-fix'
               beb635ca9ce: beb635ca9 commit: remove 'Clever' message
for --only --amend
       Finding parent where the line has been deleted:
               beb635ca9: beb635ca9 commit: remove 'Clever' message
for --only --amend
       Range of revisions: 02db2d042..beb635c
               Treeish1 02db2d0: 02db2d042 Merge branch 'ah/grammos'
               Treeish2 beb635c: beb635ca9 commit: remove 'Clever'
message for --only --amend
       Blamed Revision 02db2d0: 02db2d042 Merge branch 'ah/grammos'
       Original Filename a/Documentation/git-commit.txt Deleted Line 268
       Children revisions:
       Found no children... will return the original blamed revision
(02db2d0) saying that the deleting revision could not be found
       beb635ca9 commit: remove 'Clever' message for --only --amend
-beb635ca9 (Andreas Krey 2016-12-09 05:10:21 +0100 268)
already been staged.
       319d83524 commit: make --only --allow-empty work without paths
+319d835240 (Andreas Krey      2016-12-02 23:15:13 +0100 268)
already been staged. ...
+319d835240 (Andreas Krey      2016-12-02 23:15:13 +0100 269)   paths
are also not requi...
d4ba07cac5 (Johannes Sixt     2008-04-10 13:33:09 +0200 270)
1947bdbc31 (Junio C Hamano    2008-06-22 14:32:27 -0700 271) -u[<mode>]::
1947bdbc31 (Junio C Hamano    2008-06-22 14:32:27 -0700 272)
--untracked-files[=<mode>]::



I know that line 268 was deleted on 319d835240.

So.... on the first round of merge analysis it says "let's go into
beb635ca9". That's fine. That's exactly the path that is required to
reach 319d835240. But then when using this new "range of revisions"
for git blame --reverse, we get that line 268 is not telling us
anything useful:

$ git blame --reverse -L268,268 02db2d042..beb635c --
Documentation/git-commit.txt
^02db2d042 (Junio C Hamano 2016-12-19 14:45:30 -0800 268)
already been staged.

So, instead of pointing to 319d835240 (the parent of beb635c), it's
basically saying something like "I give up". My hunch (haven't sat
down to digest all the details about the output of git blame
--reverse... YET) is that, given that 02db2d042 is _not_ part of the
history of beb635c, git blame reverse is trying to tell me just
that... and that means I'll have to "script around this scenario".

$ git merge-base 02db2d042 beb635c
0202c411edc25940cc381bf317badcdf67670be4


Thanks in advance.

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: difflame improvements
  2017-02-17  7:01   ` Edmundo Carmona Antoranz
@ 2017-02-19  6:35     ` Edmundo Carmona Antoranz
  0 siblings, 0 replies; 5+ messages in thread
From: Edmundo Carmona Antoranz @ 2017-02-19  6:35 UTC (permalink / raw)
  To: Jeff King; +Cc: Git List

On Fri, Feb 17, 2017 at 1:01 AM, Edmundo Carmona Antoranz
<eantoranz@gmail.com> wrote:
> On Thu, Feb 16, 2017 at 11:17 PM, Jeff King <peff@peff.net> wrote:
>
>> This isn't difflame's fault; that's what "git blame" tells you about
>> that line. But since I already told difflame "v2.6.5..HEAD", it would
>> probably make sense to similarly limit the blame to that range. That
>> turns up a boundary commit for the line. Which is _also_ not helpful,
>> but at least the tool is telling me that the line came from before
>> v2.6.5, and I don't really need to care much about it.
>
>
> I'm running my own tests on difflame and I have a theory about "when"
> it breaks.... at least one of the cases when it breaks:
>
> Analysis for deleted lines is being driven by git blame --reverse.
> What I have noticed is that it "breaks" when blame --reverse drives
> the analysis into revisions where "treeish1" is not part of their
> history (like, bringing analysis "to the sides" of treeish1 instead of
> keeping analysis in revisions in the history of treeish2 that have
> treeish1 as one of their ancestors.... which is definitely a valid
> case for analysis, anyway). In this case, blame --reverse stops being
> helpful.
>

At the cost of being slower, I just pushed to master the best results yet.

The workaround I developed for the case I described on the previous
mail ended up providing much better results overall so I ended up
replacing the whole merge-analysis logic with it.

Thanks for your kind help and comments, Peff. Let me know how it goes.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* difflame improvements
@ 2017-03-05 16:18 Edmundo Carmona Antoranz
  0 siblings, 0 replies; 5+ messages in thread
From: Edmundo Carmona Antoranz @ 2017-03-05 16:18 UTC (permalink / raw)
  To: Git List

Hi!

Since my last post the biggest improvement is the ability to detect
that the user has requested a "reverse" analysis.

Under "normal" circumstances a user would ask difflame to get the diff
from an ancestor (call "difflame treeish1 treeish2" so that merge-base
of treeish1 treeish2 equals treeish1). In this case the blame result
is done using straight blame output for added lines and additional
analysis to detect where a line was deleted (analysis has improved a
lot in this regard.... I haven't heard anything from Peff, though).
But if the user requests the opposite (call "difflame treeish1
treeish2" so that merge-base of treeish1 treeish2 is treeish2) then
the analysis has to be driven "in reverse".

Here's one example taken from difflame itself:

normal "forward" call (hope output doesn't get butchered):

$ ./difflame.py HEAD~3 HEAD~2
diff --git a/difflame.py b/difflame.py
index e70154a..04c7577 100755
--- a/difflame.py
+++ b/difflame.py
@@ -365,7 +365,7 @@ def get_full_revision_id(revision):
 e5b218e4 (Edmundo 2017-02-01 365)         # we already had the revision
 50528377 (Edmundo 2017-03-04 366)         return REVISIONS_ID_CACHE[revision]
 d1d11d8a (Edmundo 2017-02-02 367)     # fallback to get it from git
       b1a6693 use rev-list to get revision IDs
-b1a6693 (Edmundo 2017-03-04 368)     full_revision =
run_git_command(["show", "--pretty=%H", revision]).split("\n")[0]
+b1a66932 (Edmundo 2017-03-04 368)     full_revision =
run_git_command(["rev-list", "--max-count=1",
revision]).split("\n")[0]
 50528377 (Edmundo 2017-03-04 369)     REVISIONS_ID_CACHE[revision] =
full_revision
 e5b218e4 (Edmundo 2017-02-01 370)     return full_revision
 91b7d3f5 (Edmundo 2017-01-31 371)

"reverse" call:
$ ./difflame.py HEAD~2 HEAD~3
diff --git a/difflame.py b/difflame.py
index 04c7577..e70154a 100755
--- a/difflame.py
+++ b/difflame.py
@@ -365,7 +365,7 @@ def get_full_revision_id(revision):
 e5b218e4 (Edmundo 2017-02-01 365)         # we already had the revision
 50528377 (Edmundo 2017-03-04 366)         return REVISIONS_ID_CACHE[revision]
 d1d11d8a (Edmundo 2017-02-02 367)     # fallback to get it from git
       b1a6693 use rev-list to get revision IDs
-b1a66932 (Edmundo 2017-03-04 368)     full_revision =
run_git_command(["rev-list", "--max-count=1",
revision]).split("\n")[0]
+b1a6693 (Edmundo 2017-03-04 368)     full_revision =
run_git_command(["show", "--pretty=%H", revision]).split("\n")[0]
 50528377 (Edmundo 2017-03-04 369)     REVISIONS_ID_CACHE[revision] =
full_revision
 e5b218e4 (Edmundo 2017-02-01 370)     return full_revision
 91b7d3f5 (Edmundo 2017-01-31 371)

Notice how the revision reported in both difflame calls is the same:

$ git show b1a66932
commit b1a66932704245fd653f8d48c0a718f168f334a7
Author: Edmundo Carmona Antoranz <whocares@gmail.com>
Date:   Sat Mar 4 13:59:50 2017 -0600

   use rev-list to get revision IDs

diff --git a/difflame.py b/difflame.py
index e70154a..04c7577 100755
--- a/difflame.py
+++ b/difflame.py
@@ -365,7 +365,7 @@ def get_full_revision_id(revision):
        # we already had the revision
        return REVISIONS_ID_CACHE[revision]
    # fallback to get it from git
-    full_revision = run_git_command(["show", "--pretty=%H",
revision]).split("\n")[0]
+    full_revision = run_git_command(["rev-list", "--max-count=1",
revision]).split("\n")[0]
    REVISIONS_ID_CACHE[revision] = full_revision
    return full_revision


If this "detection" to perform reverse analysis hadn't been done, then
there wouldn't be a lot of useful information because there are no
revisions in HEAD~2..HEAD~3 and so the output would have been
something like:

diff --git a/difflame.py b/difflame.py
index 04c7577..e70154a 100755
--- a/difflame.py
+++ b/difflame.py
@@ -365,7 +365,7 @@ def get_full_revision_id(revision):
 e5b218e4 (Edmundo 2017-02-01 365)         # we already had the revision
 50528377 (Edmundo 2017-03-04 366)         return REVISIONS_ID_CACHE[revision]
 d1d11d8a (Edmundo 2017-02-02 367)     # fallback to get it from git
       b1a6693 use rev-list to get revision IDs
%b1a6693 (Edmundo 2017-03-04 368)     full_revision =
run_git_command(["rev-list", "--max-count=1",
revision]).split("\n")[0]
       e5b218e printing hints for deleted lines
+e5b218e4 (Edmundo 2017-02-01 368)     full_revision =
run_git_command(["show", "--pretty=%H", revision]).split("\n")[0]
 50528377 (Edmundo 2017-03-04 369)     REVISIONS_ID_CACHE[revision] =
full_revision
 e5b218e4 (Edmundo 2017-02-01 370)     return full_revision
 91b7d3f5 (Edmundo 2017-01-31 371)

Notice how both the added line and the deleted line are reporting the
_wrong_ revision. It should be b1a66932 in all cases.


One question that has been bugging me for a while is what to do in
cases where treeish1, treeish2 are not "direct" descendants" (as in
merge-base treeish1 treeish2 is something other than treeish1 or
treeish2). Suppose a line was added on an ancestor of treeish1 but it
hasn't been merged into treeish2. In this case if we diff
treeish1..treeish2 we will get a _deleted_ line. However analysis to
find a deleting revision in treeish1..treeish2 will fail. I'm
wondering if it would be ok in this case to blame the deleted line on
the ancestor if treeish1 where the line was _added_.

Another thing I added is the support to use tags.

Best regards!

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-03-05 16:19 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-15  5:19 difflame improvements Edmundo Carmona Antoranz
2017-02-17  5:17 ` Jeff King
2017-02-17  7:01   ` Edmundo Carmona Antoranz
2017-02-19  6:35     ` Edmundo Carmona Antoranz
  -- strict thread matches above, loose matches on Subject: below --
2017-03-05 16:18 Edmundo Carmona Antoranz

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).