git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* A case where diff.colorMoved=plain is more sensible than diff.colorMoved=zebra & others
@ 2018-12-06 13:54 Ævar Arnfjörð Bjarmason
  2018-12-06 14:58 ` Phillip Wood
  0 siblings, 1 reply; 5+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-12-06 13:54 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git

Let's ignore how bad this patch is for git.git, and just focus on how
diff.colorMoved treats it:

    diff --git a/builtin/add.c b/builtin/add.c
    index f65c172299..d1155322ef 100644
    --- a/builtin/add.c
    +++ b/builtin/add.c
    @@ -6,5 +6,3 @@
     #include "cache.h"
    -#include "config.h"
     #include "builtin.h"
    -#include "lockfile.h"
     #include "dir.h"
    diff --git a/builtin/am.c b/builtin/am.c
    index 8f27f3375b..eded15aa8a 100644
    --- a/builtin/am.c
    +++ b/builtin/am.c
    @@ -6,3 +6,2 @@
     #include "cache.h"
    -#include "config.h"
     #include "builtin.h"
    diff --git a/builtin/blame.c b/builtin/blame.c
    index 06a7163ffe..44a754f190 100644
    --- a/builtin/blame.c
    +++ b/builtin/blame.c
    @@ -8,3 +8,2 @@
     #include "cache.h"
    -#include "config.h"
     #include "color.h"
    diff --git a/cache.h b/cache.h
    index ca36b44ee0..ea8d60b94a 100644
    --- a/cache.h
    +++ b/cache.h
    @@ -4,2 +4,4 @@
     #include "git-compat-util.h"
    +#include "config.h"
    +#include "new.h"
     #include "strbuf.h"

This is a common thing that's useful to have highlighted, e.g. we move
includes of config.h to some common file, so I want to se all the
deleted config.h lines as moved into the cache.h line, and then the
"lockfile.h" I removed while I was at it plain remove, and the new
"new.h" plain added.

Exactly that is what you get with diff.colorMoved=plain, but the default
of diff.colorMoved=zebra gets confused by this and highlights no moves
at all, same or "blocks" and "dimmed-zebra".

So at first I thought this had something to do with the many->one
detection, but it seems to be simpler, we just don't detect a move of
1-line with anything but plain, e.g. this works as expected in all modes
and detects the many->one:

    diff --git a/builtin/add.c b/builtin/add.c
    index f65c172299..f4fda75890 100644
    --- a/builtin/add.c
    +++ b/builtin/add.c
    @@ -5,4 +5,2 @@
      */
    -#include "cache.h"
    -#include "config.h"
     #include "builtin.h"
    diff --git a/builtin/branch.c b/builtin/branch.c
    index 0c55f7f065..52e39924d3 100644
    --- a/builtin/branch.c
    +++ b/builtin/branch.c
    @@ -7,4 +7,2 @@

    -#include "cache.h"
    -#include "config.h"
     #include "color.h"
    diff --git a/cache.h b/cache.h
    index ca36b44ee0..d4146dbf8a 100644
    --- a/cache.h
    +++ b/cache.h
    @@ -3,2 +3,4 @@

    +#include "cache.h"
    +#include "config.h"
     #include "git-compat-util.h"

So is there some "must be at least two consecutive lines" condition for
not-plain, or is something else going on here?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A case where diff.colorMoved=plain is more sensible than diff.colorMoved=zebra & others
  2018-12-06 13:54 A case where diff.colorMoved=plain is more sensible than diff.colorMoved=zebra & others Ævar Arnfjörð Bjarmason
@ 2018-12-06 14:58 ` Phillip Wood
  2018-12-06 18:11   ` Stefan Beller
  0 siblings, 1 reply; 5+ messages in thread
From: Phillip Wood @ 2018-12-06 14:58 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, Stefan Beller; +Cc: git

Hi Ævar

On 06/12/2018 13:54, Ævar Arnfjörð Bjarmason wrote:
> Let's ignore how bad this patch is for git.git, and just focus on how
> diff.colorMoved treats it:
> 
>      diff --git a/builtin/add.c b/builtin/add.c
>      index f65c172299..d1155322ef 100644
>      --- a/builtin/add.c
>      +++ b/builtin/add.c
>      @@ -6,5 +6,3 @@
>       #include "cache.h"
>      -#include "config.h"
>       #include "builtin.h"
>      -#include "lockfile.h"
>       #include "dir.h"
>      diff --git a/builtin/am.c b/builtin/am.c
>      index 8f27f3375b..eded15aa8a 100644
>      --- a/builtin/am.c
>      +++ b/builtin/am.c
>      @@ -6,3 +6,2 @@
>       #include "cache.h"
>      -#include "config.h"
>       #include "builtin.h"
>      diff --git a/builtin/blame.c b/builtin/blame.c
>      index 06a7163ffe..44a754f190 100644
>      --- a/builtin/blame.c
>      +++ b/builtin/blame.c
>      @@ -8,3 +8,2 @@
>       #include "cache.h"
>      -#include "config.h"
>       #include "color.h"
>      diff --git a/cache.h b/cache.h
>      index ca36b44ee0..ea8d60b94a 100644
>      --- a/cache.h
>      +++ b/cache.h
>      @@ -4,2 +4,4 @@
>       #include "git-compat-util.h"
>      +#include "config.h"
>      +#include "new.h"
>       #include "strbuf.h"
> 
> This is a common thing that's useful to have highlighted, e.g. we move
> includes of config.h to some common file, so I want to se all the
> deleted config.h lines as moved into the cache.h line, and then the
> "lockfile.h" I removed while I was at it plain remove, and the new
> "new.h" plain added.
> 
> Exactly that is what you get with diff.colorMoved=plain, but the default
> of diff.colorMoved=zebra gets confused by this and highlights no moves
> at all, same or "blocks" and "dimmed-zebra".
> 
> So at first I thought this had something to do with the many->one
> detection, but it seems to be simpler, we just don't detect a move of
> 1-line with anything but plain, e.g. this works as expected in all modes
> and detects the many->one:
> 
>      diff --git a/builtin/add.c b/builtin/add.c
>      index f65c172299..f4fda75890 100644
>      --- a/builtin/add.c
>      +++ b/builtin/add.c
>      @@ -5,4 +5,2 @@
>        */
>      -#include "cache.h"
>      -#include "config.h"
>       #include "builtin.h"
>      diff --git a/builtin/branch.c b/builtin/branch.c
>      index 0c55f7f065..52e39924d3 100644
>      --- a/builtin/branch.c
>      +++ b/builtin/branch.c
>      @@ -7,4 +7,2 @@
> 
>      -#include "cache.h"
>      -#include "config.h"
>       #include "color.h"
>      diff --git a/cache.h b/cache.h
>      index ca36b44ee0..d4146dbf8a 100644
>      --- a/cache.h
>      +++ b/cache.h
>      @@ -3,2 +3,4 @@
> 
>      +#include "cache.h"
>      +#include "config.h"
>       #include "git-compat-util.h"
> 
> So is there some "must be at least two consecutive lines" condition for
> not-plain, or is something else going on here?

To be considered a block has to have 20 alphanumeric characters - see 
commit f0b8fb6e59 ("diff: define block by number of alphanumeric chars", 
2017-08-15). This stops things like random '}' lines being marked as 
moved on their own. It might be better to use some kind of frequency 
information (a bit like python's difflib junk parameter) instead so that 
(fairly) unique short lines also get marked properly.

Best Wishes

Phillip

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A case where diff.colorMoved=plain is more sensible than diff.colorMoved=zebra & others
  2018-12-06 14:58 ` Phillip Wood
@ 2018-12-06 18:11   ` Stefan Beller
  2018-12-10 14:43     ` Phillip Wood
  0 siblings, 1 reply; 5+ messages in thread
From: Stefan Beller @ 2018-12-06 18:11 UTC (permalink / raw)
  To: Phillip Wood; +Cc: Ævar Arnfjörð Bjarmason, git

On Thu, Dec 6, 2018 at 6:58 AM Phillip Wood <phillip.wood@talktalk.net> wrote:

> > So is there some "must be at least two consecutive lines" condition for
> > not-plain, or is something else going on here?
>
> To be considered a block has to have 20 alphanumeric characters - see
> commit f0b8fb6e59 ("diff: define block by number of alphanumeric chars",
> 2017-08-15). This stops things like random '}' lines being marked as
> moved on their own.

This is spot on.

All but the "plain" mode use the concept of "blocks" of code
(there is even one mode called "blocks", which adds to the confusion).

> It might be better to use some kind of frequency
> information (a bit like python's difflib junk parameter) instead so that
> (fairly) unique short lines also get marked properly.

Yes that is what I was initially thinking about. However to have good
information, you'd need to index a whole lot (the whole repository,
i.e. all text blobs in existence?) to get an accurate picture of frequency
information, which I'd prefer to call entropy as I come from a background
familiar with https://en.wikipedia.org/wiki/Information_theory, I am not
sure where 'frequency information' comes from -- it sounds like the
same concept.

Of course it is too expensive to run an operation O(repository size)
just for this diff, so maybe we could get away with some smaller
corpus to build up this information on what is sufficient for coloring.

When only looking at the given diff, I would imagine that each line
would not carry a whole lot of information as its characters occur
rather frequently compared to the rest of the diff.

Best,
Stefan

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A case where diff.colorMoved=plain is more sensible than diff.colorMoved=zebra & others
  2018-12-06 18:11   ` Stefan Beller
@ 2018-12-10 14:43     ` Phillip Wood
  2018-12-11  0:54       ` Stefan Beller
  0 siblings, 1 reply; 5+ messages in thread
From: Phillip Wood @ 2018-12-10 14:43 UTC (permalink / raw)
  To: Stefan Beller, Phillip Wood; +Cc: Ævar Arnfjörð Bjarmason, git

On 06/12/2018 18:11, Stefan Beller wrote:
> On Thu, Dec 6, 2018 at 6:58 AM Phillip Wood <phillip.wood@talktalk.net> wrote:
> 
>>> So is there some "must be at least two consecutive lines" condition for
>>> not-plain, or is something else going on here?
>>
>> To be considered a block has to have 20 alphanumeric characters - see
>> commit f0b8fb6e59 ("diff: define block by number of alphanumeric chars",
>> 2017-08-15). This stops things like random '}' lines being marked as
>> moved on their own.
> 
> This is spot on.
> 
> All but the "plain" mode use the concept of "blocks" of code
> (there is even one mode called "blocks", which adds to the confusion).
> 
>> It might be better to use some kind of frequency
>> information (a bit like python's difflib junk parameter) instead so that
>> (fairly) unique short lines also get marked properly.
> 
> Yes that is what I was initially thinking about. However to have good
> information, you'd need to index a whole lot (the whole repository,
> i.e. all text blobs in existence?) to get an accurate picture of frequency
> information, which I'd prefer to call entropy as I come from a background
> familiar with https://en.wikipedia.org/wiki/Information_theory, I am not
> sure where 'frequency information' comes from -- it sounds like the
> same concept.
> 
> Of course it is too expensive to run an operation O(repository size)
> just for this diff, so maybe we could get away with some smaller
> corpus to build up this information on what is sufficient for coloring.
> 
> When only looking at the given diff, I would imagine that each line
> would not carry a whole lot of information as its characters occur
> rather frequently compared to the rest of the diff.

I was thinking of using lines rather than characters as the unit of 
information (if that's the right phrase). I was hoping that seeing how 
often a given line occurs within the set of files being diffed would be 
good enough to tell is if it is an "interesting" move or not. In the 
mean time I wonder if decreasing the block limit to 10 alphanumeric 
characters would be enough to prevent too much noise in the output 
without suppressing matches that it would be useful to highlight.

Best Wishes

Phillip

> 
> Best,
> Stefan
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: A case where diff.colorMoved=plain is more sensible than diff.colorMoved=zebra & others
  2018-12-10 14:43     ` Phillip Wood
@ 2018-12-11  0:54       ` Stefan Beller
  0 siblings, 0 replies; 5+ messages in thread
From: Stefan Beller @ 2018-12-11  0:54 UTC (permalink / raw)
  To: Phillip Wood; +Cc: Ævar Arnfjörð Bjarmason, git

On Mon, Dec 10, 2018 at 6:43 AM Phillip Wood <phillip.wood@talktalk.net> wrote:
>
> On 06/12/2018 18:11, Stefan Beller wrote:
> > On Thu, Dec 6, 2018 at 6:58 AM Phillip Wood <phillip.wood@talktalk.net> wrote:
> >
> >>> So is there some "must be at least two consecutive lines" condition for
> >>> not-plain, or is something else going on here?
> >>
> >> To be considered a block has to have 20 alphanumeric characters - see
> >> commit f0b8fb6e59 ("diff: define block by number of alphanumeric chars",
> >> 2017-08-15). This stops things like random '}' lines being marked as
> >> moved on their own.
> >
> > This is spot on.
> >
> > All but the "plain" mode use the concept of "blocks" of code
> > (there is even one mode called "blocks", which adds to the confusion).
> >
> >> It might be better to use some kind of frequency
> >> information (a bit like python's difflib junk parameter) instead so that
> >> (fairly) unique short lines also get marked properly.
> >
> > Yes that is what I was initially thinking about. However to have good
> > information, you'd need to index a whole lot (the whole repository,
> > i.e. all text blobs in existence?) to get an accurate picture of frequency
> > information, which I'd prefer to call entropy as I come from a background
> > familiar with https://en.wikipedia.org/wiki/Information_theory, I am not
> > sure where 'frequency information' comes from -- it sounds like the
> > same concept.
> >
> > Of course it is too expensive to run an operation O(repository size)
> > just for this diff, so maybe we could get away with some smaller
> > corpus to build up this information on what is sufficient for coloring.
> >
> > When only looking at the given diff, I would imagine that each line
> > would not carry a whole lot of information as its characters occur
> > rather frequently compared to the rest of the diff.
>
> I was thinking of using lines rather than characters as the unit of
> information (if that's the right phrase). I was hoping that seeing how
> often a given line occurs within the set of files being diffed would be
> good enough to tell is if it is an "interesting" move or not.

That sounds reasonable. We should try that.

> In the
> mean time I wonder if decreasing the block limit to 10 alphanumeric
> characters would be enough to prevent too much noise in the output
> without suppressing matches that it would be useful to highlight.

Jonathan elegantly deferred the need to come up with data on why 20
is a good choice, but rather claimed previous art in git-blame, see
f0b8fb6e59 (diff: define block by number of alphanumeric chars,
2017-08-15), which seems to say we'd want to follow the
model of having a blame_entry_score (that counts the number
of alnum() characters per line) and the
BLAME_DEFAULT_MOVE_SCORE, which came into existence in
4a0fc95f18 (git-pickaxe: introduce heuristics to avoid "trivial" chunks,
2006-10-20), simply stating

    The current heuristics are quite simple and may need to be
    tweaked later, but we need to start somewhere.

so I guess replacing 20 by 10 is totally doable, but the proof on
why 10 is better than 20 is on you. ;-(
Probably it doesn't need to be as fancy as in 433860f3d0
(diff: improve positioning of add/delete blocks in diffs, 2016-09-05)
but we'd need to gather *some* data to convince
(s/convince/fool/)  ourselves that it is better.

It could also be the case that we need to fine tune differently
for blame as for move detection, but we could still reuse
some code to process it.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-12-11  0:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-06 13:54 A case where diff.colorMoved=plain is more sensible than diff.colorMoved=zebra & others Ævar Arnfjörð Bjarmason
2018-12-06 14:58 ` Phillip Wood
2018-12-06 18:11   ` Stefan Beller
2018-12-10 14:43     ` Phillip Wood
2018-12-11  0:54       ` Stefan Beller

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).