git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Stefan Beller <sbeller@google.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: "git@vger.kernel.org" <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH v3 3/3] diff: define block by number of non-space chars
Date: Mon, 14 Aug 2017 19:29:19 -0700	[thread overview]
Message-ID: <CAGZ79kbtscZ_OrXH1iDmiRNhNQkU-636351kaKxVYOYEBtBoeQ@mail.gmail.com> (raw)
In-Reply-To: <41a7ff674072559415f98b81ffde798d94aed2fc.1502754962.git.jonathantanmy@google.com>

On Mon, Aug 14, 2017 at 4:57 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> The existing behavior of diff --color-moved=zebra does not define the
> minimum size of a block at all, instead relying on a heuristic applied
> later to filter out sets of adjacent moved lines that are shorter than 3
> lines long. This can be confusing, because a block could thus be colored
> as moved at the source but not at the destination (or vice versa),
> depending on its neighbors.
>
> Instead, teach diff that the minimum size of a block is 10
> non-whitespace characters. This allows diff to still exclude
> uninteresting lines appearing on their own (such as those solely
> consisting of one or a few closing braces), as was the intention of the
> adjacent-moved-line heuristic.

After some thought, I really like this heuristic, however allow me
a moment to bikeshed 10 as a number here.

One could think that 10 equals roughly 3 lines a 3 characters and
in C based languages the shortest meaningful lines have more than
3 characters ("i++;", "a();", "int i;" have 4 or 5 each), but I would still
think that 10 is too much, as we'd want to detect the closing braces
in their own lines.

>  dimmed_zebra::
>         Similar to 'zebra', but additional dimming of uninteresting parts
>         of moved code is performed. The bordering lines of two adjacent
> diff --git a/diff.c b/diff.c
> index f598d8a3a..305ce4126 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -864,19 +864,28 @@ static int shrink_potential_moved_blocks(struct moved_entry **pmb,
>  /*
>   * If o->color_moved is COLOR_MOVED_PLAIN, this function does nothing.
>   *
> - * Otherwise, if the last block has fewer lines than
> - * COLOR_MOVED_MIN_BLOCK_LENGTH, unset DIFF_SYMBOL_MOVED_LINE on all lines in
> - * that block.
> + * Otherwise, if the last block has fewer non-space characters than
> + * COLOR_MOVED_MIN_NON_SPACE_COUNT, unset DIFF_SYMBOL_MOVED_LINE on all lines
> + * in that block.
>   *
>   * The last block consists of the (n - block_length)'th line up to but not
>   * including the nth line.
>   */
>  static void adjust_last_block(struct diff_options *o, int n, int block_length)
>  {
> -       int i;
> -       if (block_length >= COLOR_MOVED_MIN_BLOCK_LENGTH ||
> -           o->color_moved == COLOR_MOVED_PLAIN)
> +       int i, non_space_count = 0;
> +       if (o->color_moved == COLOR_MOVED_PLAIN)
>                 return;
> +       for (i = 1; i < block_length + 1; i++) {
> +               const char *c = o->emitted_symbols->buf[n - i].line;
> +               for (; *c; c++) {
> +                       if (isspace(*c))
> +                               continue;
> +                       non_space_count++;
> +                       if (non_space_count >= COLOR_MOVED_MIN_NON_SPACE_COUNT)
> +                               return;

When we do this counting, we could count the lines ourselves here as well.
`n-block_count` should be equal to the line that has a different
(flags & (DIFF_SYMBOL_MOVED_LINE | DIFF_SYMBOL_MOVED_LINE_ALT))
pattern than those before. (although we'd also have to check for i > 0, too)
Your choice.

> +               }
> +       }
>         for (i = 1; i < block_length + 1; i++)
>                 o->emitted_symbols->buf[n - i].flags &= ~DIFF_SYMBOL_MOVED_LINE;
>  }
> @@ -923,7 +932,6 @@ static void mark_color_as_moved(struct diff_options *o,
>                 }
>
>                 l->flags |= DIFF_SYMBOL_MOVED_LINE;
> -               block_length++;
>
>                 if (o->color_moved == COLOR_MOVED_PLAIN)
>                         continue;
> @@ -953,8 +961,13 @@ static void mark_color_as_moved(struct diff_options *o,
>                         }
>
>                         flipped_block = (flipped_block + 1) % 2;
> +
> +                       adjust_last_block(o, n, block_length);
> +                       block_length = 0;
>                 }
>
> +               block_length++;
> +
>                 if (flipped_block)
>                         l->flags |= DIFF_SYMBOL_MOVED_LINE_ALT;
>         }
> diff --git a/diff.h b/diff.h
> index 5755f465d..9e2fece5b 100644
> --- a/diff.h
> +++ b/diff.h
> @@ -195,7 +195,7 @@ struct diff_options {
>                 COLOR_MOVED_ZEBRA_DIM = 3,
>         } color_moved;
>         #define COLOR_MOVED_DEFAULT COLOR_MOVED_ZEBRA
> -       #define COLOR_MOVED_MIN_BLOCK_LENGTH 3
> +       #define COLOR_MOVED_MIN_NON_SPACE_COUNT 10
>  };
>
>  void diff_emit_submodule_del(struct diff_options *o, const char *line);
> diff --git a/t/t4015-diff-whitespace.sh b/t/t4015-diff-whitespace.sh
> index 6f7758e5c..d8e7b77b9 100755
> --- a/t/t4015-diff-whitespace.sh
> +++ b/t/t4015-diff-whitespace.sh
> @@ -1101,9 +1101,9 @@ test_expect_success 'detect malicious moved code, inside file' '
>         <BRED>-{<RESET>
>         <BLUE>-if (!u->is_allowed_foo)<RESET>
>         <BLUE>-return;<RESET>
> -       <BRED>-foo(u);<RESET>
> -       <BLUE>-}<RESET>
> -       <BLUE>-<RESET>
> +       <RED>-foo(u);<RESET>
> +       <RED>-}<RESET>
> +       <RED>-<RESET>

Here we have 2 blocks, the first has 7 character,
which we may want to detect, the second has only 1 char.

The longest "uninteresting" line in C like languages might
be "\t } else {" which has 6 non-ws characters.

Thinking of other languages (shell "fi" is uninteresting, others are
interesting,
Latex \"custom" all bets are off), I think we may want to go lower and have
COLOR_MOVED_MIN_NON_SPACE_COUNT to be about 6
(~ 2 characters on a 3 line block).

That said this is all bikeshedding, feel free to ignore.

Acked-by: Stefan Beller <sbeller@google.com>

  reply	other threads:[~2017-08-15  2:29 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-11 22:49 [RFC PATCH 0/3] Fixes to "diff --color-moved" MIN_BLOCK_LENGTH handling Jonathan Tan
2017-08-11 22:49 ` [RFC PATCH 1/3] diff: avoid redundantly clearing a flag Jonathan Tan
2017-08-14 17:13   ` Stefan Beller
2017-08-11 22:49 ` [RFC PATCH 2/3] diff: respect MIN_BLOCK_LENGTH for last block Jonathan Tan
2017-08-14 17:17   ` Stefan Beller
2017-08-11 22:49 ` [RFC PATCH 3/3] diff: check MIN_BLOCK_LENGTH at start of new block Jonathan Tan
2017-08-14 17:22   ` Stefan Beller
2017-08-12  0:39 ` [RFC PATCH 0/3] Fixes to "diff --color-moved" MIN_BLOCK_LENGTH handling Junio C Hamano
2017-08-14 17:29   ` Stefan Beller
2017-08-14 19:37     ` Junio C Hamano
2017-08-14 19:51       ` Stefan Beller
2017-08-15 17:07         ` Junio C Hamano
2017-08-14 21:31 ` [PATCH v2 " Jonathan Tan
2017-08-14 21:31 ` [PATCH v2 1/3] diff: avoid redundantly clearing a flag Jonathan Tan
2017-08-14 21:31 ` [PATCH v2 2/3] diff: respect MIN_BLOCK_LENGTH for last block Jonathan Tan
2017-08-14 21:31 ` [PATCH v2 3/3] diff: check MIN_BLOCK_LENGTH at start of new block Jonathan Tan
2017-08-14 22:46   ` Stefan Beller
2017-08-14 23:57 ` [PATCH v3 0/3] "diff --color-moved" with different heuristic Jonathan Tan
2017-08-14 23:57 ` [PATCH v3 1/3] diff: avoid redundantly clearing a flag Jonathan Tan
2017-08-14 23:57 ` [PATCH v3 2/3] diff: respect MIN_BLOCK_LENGTH for last block Jonathan Tan
2017-08-14 23:57 ` [PATCH v3 3/3] diff: define block by number of non-space chars Jonathan Tan
2017-08-15  2:29   ` Stefan Beller [this message]
2017-08-15 19:54   ` Junio C Hamano
2017-08-15 20:06     ` Stefan Beller
2017-08-15 20:53       ` Junio C Hamano
2017-08-16  1:27 ` [PATCH v4 0/3] "diff --color-moved" with yet another heuristic Jonathan Tan
2017-08-16  5:55   ` Stefan Beller
2017-08-16  1:27 ` [PATCH v4 1/3] diff: avoid redundantly clearing a flag Jonathan Tan
2017-08-16  1:27 ` [PATCH v4 2/3] diff: respect MIN_BLOCK_LENGTH for last block Jonathan Tan
2017-08-16  1:27 ` [PATCH v4 3/3] diff: define block by number of alphanumeric chars Jonathan Tan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGZ79kbtscZ_OrXH1iDmiRNhNQkU-636351kaKxVYOYEBtBoeQ@mail.gmail.com \
    --to=sbeller@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).