git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Michael Haggerty <mhagger@alum.mit.edu>
Cc: Stefan Beller <sbeller@google.com>, Jeff King <peff@peff.net>,
	"git\@vger.kernel.org" <git@vger.kernel.org>,
	Jacob Keller <jacob.keller@gmail.com>
Subject: Re: [PATCH] diff compaction heuristic: favor shortest neighboring blank lines
Date: Fri, 01 Jul 2016 11:01:01 -0700	[thread overview]
Message-ID: <xmqq8txlz19e.fsf@gitster.mtv.corp.google.com> (raw)
In-Reply-To: <57752478.1000302@alum.mit.edu> (Michael Haggerty's message of "Thu, 30 Jun 2016 15:54:00 +0200")

Michael Haggerty <mhagger@alum.mit.edu> writes:

> I put quite of work into tooling to evaluate diff heuristics, and just
> published it on GitHub:
>
>     https://github.com/mhagger/diff-slider-tools
>
> The README there is hopefully enough to let people get started using it
> to test their own favorite heuristics.

Intereting.  A fair TL;DR of this would be that what we write and
want to diff have the structure where things often are ordered in an
inherent hierarchy, and things at the higher level are highlighted
by being less indented, and the indent-based heuristics exploit that
well to choose a slide that breaks at the higher-level boundary, e.g.

> The algorithm does pretty well with XML and HTML:
>
>> fef3ea39f8bd474add292bb6437df6cbd22e1ba7:contributors.xml a394a0bdf8e6240dc09023a8260059c57c158a00:contributors.xml + 1624
>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>                >    <last>Toni</last>
>>                >  </name>
>>                >  <name>
>>                >    <first>Vincent</first>
>>                >    <last>Legoll</last>
>>       -2 |     >  </name>
>>       -1 |   i >  <name>
>>        0 || ci >    <first>Vincent</first>
>>          || ci >    <last>Privat</last>
>>           | ci >  </name>
>>           | c  >  <name>

And that trait is shared among common programming languages (Java,
Perl, Go, C, etc.).

The only case I can think of offhand that does not follow "higher
level heading is less indented" pattern is an already typeset book,
where chapter headers are often centered on the page while section
headers may be at the left edge.  But we are not likely to be
interested to get that case right anyway, so it is OK.

> gettext source:
>
>> aef18cc6063106075afeb3a78ec72656b19c8bde:po/de.po b0e098ce46ce7f8ecd975720ccec8d0eb7787e50:po/de.po - 12954
>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>                >#~ msgid "unmerged:   %s"
>>                >#~ msgstr "nicht zusammengeführt:   %s"
>>                >
>>                >#~ msgid "input paths are terminated by a null character"
>>                >#~ msgstr "Eingabepfade sind durch ein NUL Zeichen abgeschlossen"
>>       -2 |     >
>>       -1 |  ci >#~ msgid ""
>>        0 || ci >#~ "The following untracked files would NOT be saved but need to be removed "
>>          || ci >#~ "by stash save:"
>>          || ci >#~ msgstr ""
>>          || ci >#~ "Die folgenden unbeobachteten Dateien würden NICHT gespeichert werden,\n"
>>          || ci >#~ "müssen aber durch \"stash save\" entfernt werden:"
>>           | ci >

This is an interesting example of "have only a single level
hierarchy.  A line is either in one block or a blank between
blocks".

> It often fails to get C preprocessor directives right:
>
>> a08595f76159b09d57553e37a5123f1091bb13e7:http.c aeff8a61216bf6e0d663c08c583bc8552fa3c344:http.c + 429
>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>                >		curl_easy_setopt(result, CURLOPT_SSLKEY, ssl_key);
>>                >#endif
>>                >#if LIBCURL_VERSION_NUM >= 0x070908
>>                >	if (ssl_capath != NULL)
>>                >		curl_easy_setopt(result, CURLOPT_CAPATH, ssl_capath);
>>       -1 |   i >#endif
>>        0 || ci >#if LIBCURL_VERSION_NUM >= 0x072c00
>>          || ci >	if (ssl_pinnedkey != NULL)
>>          || ci >		curl_easy_setopt(result, CURLOPT_PINNEDPUBLICKEY, ssl_pinnedkey);
>>           | c  >#endif

Yes, this is "non-human do not know 'end' is likely to be at the end
of a logical block".

> And it gets confused by unusual blank line placement:
>
>> ed55169834a3ce16a271def9630c858626ded34d:tools/eslint/node_modules/doctrine/lib/doctrine.js 2d441493a4a46a511ba1bdf93e442c3288fbe92d:tools/eslint/node_modules/doctrine/lib/doctrine.js + 330
>> vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
>>                >                        name === 'external' ||
>>                >                        name === 'event')) {
>>                >                    name += advance();
>>                >                    name += scanIdentifier(last);
>>                >
>>       -1 |   i >                }
>>        0 || ci >                if(source.charCodeAt(index) === 0x5B  /* '[' */ && source.charCodeAt(index + 1) === 0x5D  /* ']' */){
>>          || ci >                    name += advance();
>>          || ci >                    name += advance();
>>           | c  >                }

Likewise, this is showing that a "non-human not knowing } is a closing
and { is an opening token".

By the way, I no longer remember what indent level your
"indent-only" heuristics gives to a blank line.  Does the closing
brace we see above (marked with -1) count as increasing the
indentation level from the previous empty-line at indent 0, or does it
count as dedenting from the previous empty-line that has virtually
at the same indent level as "name += scanIdentifier(last);" line?


  parent reply	other threads:[~2016-07-01 18:01 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-16 17:46 [PATCH] diff compaction heuristic: favor shortest neighboring blank lines Stefan Beller
2016-06-16 20:27 ` Junio C Hamano
2016-06-16 21:06   ` Stefan Beller
2016-06-16 21:10 ` Michael Haggerty
2016-06-16 21:36   ` Stefan Beller
2016-06-17 15:36 ` Jeff King
2016-06-17 16:09   ` Stefan Beller
2016-06-23 17:10     ` Michael Haggerty
2016-06-23 17:25       ` Stefan Beller
2016-06-23 17:37       ` Junio C Hamano
2016-06-23 20:13         ` Michael Haggerty
2016-06-30 13:54       ` Michael Haggerty
2016-07-01 17:04         ` diff heuristics dramatically improved by considering line indentation and " Michael Haggerty
2016-07-01 18:01         ` Junio C Hamano [this message]
2016-07-04 14:33           ` [PATCH] diff compaction heuristic: favor shortest neighboring " Jakub Narębski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqq8txlz19e.fsf@gitster.mtv.corp.google.com \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=jacob.keller@gmail.com \
    --cc=mhagger@alum.mit.edu \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).