git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Stefan Beller <sbeller@google.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Jonathan Tan <jonathantanmy@google.com>,
	"git@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: [PATCH v3 3/3] diff: define block by number of non-space chars
Date: Tue, 15 Aug 2017 13:06:49 -0700	[thread overview]
Message-ID: <CAGZ79kahg7q2CSg7OiaLNvHKzCUgTtq2-n=Km0N+eGRuNZ=yJw@mail.gmail.com> (raw)
In-Reply-To: <xmqqy3qkppxt.fsf@gitster.mtv.corp.google.com>

On Tue, Aug 15, 2017 at 12:54 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Jonathan Tan <jonathantanmy@google.com> writes:
>
>> The existing behavior of diff --color-moved=zebra does not define the
>> minimum size of a block at all, instead relying on a heuristic applied
>> later to filter out sets of adjacent moved lines that are shorter than 3
>> lines long. This can be confusing, because a block could thus be colored
>> as moved at the source but not at the destination (or vice versa),
>> depending on its neighbors.
>>
>> Instead, teach diff that the minimum size of a block is 10
>> non-whitespace characters. This allows diff to still exclude
>> uninteresting lines appearing on their own (such as those solely
>> consisting of one or a few closing braces), as was the intention of the
>> adjacent-moved-line heuristic.
>
> I recall that there is a logic backed by a similar rationale in
> blame.c::blame_entry_score() but over there we count alnum, not
> !isspace, to judge if a block has been split into too small a piece
> to be significant.  I do not know which one is better, but if there
> is no strong reason, perhaps we want to unify the two, so that we
> can improve both heuristics at the same time?

In an ideal world we would use entropy of the diffed characters as
that is a best approximation on how much "interesting" things are
going on in that particular diff.

Computing the entropy is cumbersome, but maybe ok for this
purpose (we're most likely IO bound anyway, specifically when
including the human understanding as the last part of IO).

Reasons that may influence the choice
* Non latin/ASCII characters should also work.
* Do we care about the whitespace esoteric
  programming language?

The function blame_entry_score is documented to approach
this exact problem that we are trying to solve here, so I agree
we should have a common heuristic.

Stefan

  reply	other threads:[~2017-08-15 20:06 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-11 22:49 [RFC PATCH 0/3] Fixes to "diff --color-moved" MIN_BLOCK_LENGTH handling Jonathan Tan
2017-08-11 22:49 ` [RFC PATCH 1/3] diff: avoid redundantly clearing a flag Jonathan Tan
2017-08-14 17:13   ` Stefan Beller
2017-08-11 22:49 ` [RFC PATCH 2/3] diff: respect MIN_BLOCK_LENGTH for last block Jonathan Tan
2017-08-14 17:17   ` Stefan Beller
2017-08-11 22:49 ` [RFC PATCH 3/3] diff: check MIN_BLOCK_LENGTH at start of new block Jonathan Tan
2017-08-14 17:22   ` Stefan Beller
2017-08-12  0:39 ` [RFC PATCH 0/3] Fixes to "diff --color-moved" MIN_BLOCK_LENGTH handling Junio C Hamano
2017-08-14 17:29   ` Stefan Beller
2017-08-14 19:37     ` Junio C Hamano
2017-08-14 19:51       ` Stefan Beller
2017-08-15 17:07         ` Junio C Hamano
2017-08-14 21:31 ` [PATCH v2 " Jonathan Tan
2017-08-14 21:31 ` [PATCH v2 1/3] diff: avoid redundantly clearing a flag Jonathan Tan
2017-08-14 21:31 ` [PATCH v2 2/3] diff: respect MIN_BLOCK_LENGTH for last block Jonathan Tan
2017-08-14 21:31 ` [PATCH v2 3/3] diff: check MIN_BLOCK_LENGTH at start of new block Jonathan Tan
2017-08-14 22:46   ` Stefan Beller
2017-08-14 23:57 ` [PATCH v3 0/3] "diff --color-moved" with different heuristic Jonathan Tan
2017-08-14 23:57 ` [PATCH v3 1/3] diff: avoid redundantly clearing a flag Jonathan Tan
2017-08-14 23:57 ` [PATCH v3 2/3] diff: respect MIN_BLOCK_LENGTH for last block Jonathan Tan
2017-08-14 23:57 ` [PATCH v3 3/3] diff: define block by number of non-space chars Jonathan Tan
2017-08-15  2:29   ` Stefan Beller
2017-08-15 19:54   ` Junio C Hamano
2017-08-15 20:06     ` Stefan Beller [this message]
2017-08-15 20:53       ` Junio C Hamano
2017-08-16  1:27 ` [PATCH v4 0/3] "diff --color-moved" with yet another heuristic Jonathan Tan
2017-08-16  5:55   ` Stefan Beller
2017-08-16  1:27 ` [PATCH v4 1/3] diff: avoid redundantly clearing a flag Jonathan Tan
2017-08-16  1:27 ` [PATCH v4 2/3] diff: respect MIN_BLOCK_LENGTH for last block Jonathan Tan
2017-08-16  1:27 ` [PATCH v4 3/3] diff: define block by number of alphanumeric chars Jonathan Tan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAGZ79kahg7q2CSg7OiaLNvHKzCUgTtq2-n=Km0N+eGRuNZ=yJw@mail.gmail.com' \
    --to=sbeller@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).