git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: Junio C Hamano <gitster@pobox.com>
Cc: "Jacob Keller" <jacob.keller@gmail.com>,
	"Git mailing list" <git@vger.kernel.org>,
	"Stefan Beller" <sbeller@google.com>, "Jeff King" <peff@peff.net>,
	"Jakub Narębski" <jnareb@gmail.com>
Subject: Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs
Date: Wed, 10 Aug 2016 21:01:06 +0200	[thread overview]
Message-ID: <5fe0edbc-3659-058f-3328-639d1343fa05@alum.mit.edu> (raw)
In-Reply-To: <xmqqd1lo2uj1.fsf@gitster.mtv.corp.google.com>

On 08/04/2016 09:39 PM, Junio C Hamano wrote:
> Michael Haggerty <mhagger@alum.mit.edu> writes:
> 
>>>> +       }
>>>> +       /*
>>>> +        * We have reached the end of the line without finding any non-space
>>>> +        * characters; i.e., the whole line consists of trailing spaces, which we
>>>> +        * are not interested in.
>>>> +        */
>>>> +       return -1;
> 
> Not related to Jacob's review, but "the whole line consists of
> trailing spaces" made me read it twice; while it is technically
> correct, "the whole line consists of spaces", or even "this is a
> blank line", would read a lot more easily, at least for me.

Hehe, yes, ETOOMANYWORDS.

>> I was implicitly assuming that such lines would have text somewhere
>> after those 200 spaces (or 25 TABs or whatever). But you're right, the
>> line could consist only of whitespace. Unfortunately, the only way to
>> distinguish these two cases is to read the rest of the line, which is
>> exactly what we *don't* want to do.
> 
> Hmm, why is it exactly what we don't want to do?  Is it a
> performance concern?  In other words, is it because this function is
> called many times to measure the same line multiple times?

Yes, performance is the reason, and especially the desire to avoid
unreasonable runtimes for pathological cases. Thanks for asking, though,
because it's worthwhile to look into this more rigorously.

Suppose there is a slider that can be shifted to any of `numshift`
positions. Then we have to call `measure_split()` `2*numshift` times
(once for the beginning and once for the end of each candidate slider
position).

Suppose there are `numblanks` blank lines in the neighborhood of the
slider. Each time we call `measure_split()`, we count the number of
blank lines before and after the proposed split position. So we end up
calling `get_indent()` `2*numshift*numblanks` times.

Finally, suppose that the blank lines each contain `numws` whitespace
characters. Then each call to `get_indent()` has to do `O(numws)` work.

So altogether, if there were no limits, then the amount of work to
position a slider would scale like

    O(numshift * numblanks * numws)

However, the total number of characters in the file might only be

    O(numblanks * numws)

So without limits, the amount of work to position sliders could scale by
numshift times the size of the file.

The worst case is pretty easy to achieve. Just create a file consisting
of a million or so LF characters, then add another LF to it. The diff
would be a slider with

    numshift = 1000000
    numblanks = 1000000
    numws = 1

so the heuristic would take O(N^2) in the size of the file.

Effectively the limits cap the effective `numblanks` at `2*MAX_BLANKS`
(which is 2*20) and the effective `numws` at `MAX_INDENT` (which is
200), meaning that the maximum effort scales like

    numshift * 2*20 * 200

which is still a big number but not absurd. Empirically, for the case
described above, `git diff` takes 104 ms and `git diff
--indent-heuristic` takes 490 ms. I think that's not prohibitive for a
pathological case.

Meanwhile, Myers's algorithm scales like O(ND), where N is the number of
lines and D is the edit distance, so I suppose that it is already
possible to find diffs that are intractable to compute.

> After
> all, somebody in this file is already scanning each and every line
> to see where it ends to split the input into records, so perhaps a
> "right" (if the "theoretical correctness" of the return value from
> this function mattered, which you wave-away below) optimization
> could be to precompute it while the lines are broken into records
> and store it in the "rec" structure?

That would certainly be possible, and would help in cases where there
are a lot of lines with lots of leading whitespace. You could get nearly
the same benefit by recording a single bit in struct rec, indicating
whether the line is blank or not.

But it wouldn't help the worst case described above, where each call to
`git_indent()` is already very cheap. And I didn't think it was worth
allocating the extra memory to optimize this heuristic

* which isn't used all that often in the first place,
* which (for normal inputs) doesn't take a significant amount of time, and
* when the optimization doesn't improve the worst-case scenario (and
thus any DoS potential)

I think the only way to ensure O(size_of_file) runtime in the above case
would be to record, along with each line, how many blank lines
immediately precede and succeed it. You could achieve something like
O(size_of_file lg(size_of_file)) by storing, e.g., the total number of
nonblank lines that precede each line and doing a binary search to find
the nearest non-blank line.

>> But I think it doesn't matter anyway. Such "text" will likely never be
>> read by a human, so it's not a big deal if the slider position is not
>> picked perfectly. And remember, this whole saga is just to improve the
>> aesthetics of the diff. The diff is *correct* (e.g., in the sense of
>> applicable) regardless of where we position the sliders.
> 
> A better argument may be "if the user is truly reading a diff output
> for such an unusual "text", it is likely that she has a very wide
> display and/or running less -S, and treating such an overindented line
> as if it were a blank line would give a result that is more consistent
> to what appears on her display", perhaps?

I don't know. It seems like a pretty contrived justification for what is
basically, "your input is too weird for us. We're not going to break our
necks trying to give you the best possible slider positioning."

Michael


  reply	other threads:[~2016-08-10 19:23 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-03 22:00 [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
2016-08-03 22:00 ` [PATCH 1/8] xdl_change_compact(): rename some local variables for clarity Michael Haggerty
2016-08-04  7:06   ` Jeff King
2016-08-04 18:24     ` Junio C Hamano
2016-08-13 19:38     ` Michael Haggerty
2016-08-14 12:26       ` Jeff King
2016-08-03 22:00 ` [PATCH 2/8] xdl_change_compact(): clarify code Michael Haggerty
2016-08-03 22:11   ` Stefan Beller
2016-08-03 23:14     ` Michael Haggerty
2016-08-03 23:50       ` Stefan Beller
2016-08-04  7:13         ` Jeff King
2016-08-10 16:39         ` Michael Haggerty
2016-08-10 16:58           ` Stefan Beller
2016-08-03 22:00 ` [PATCH 3/8] xdl_change_compact(): rename i to end Michael Haggerty
2016-08-04  7:16   ` Jeff King
2016-08-03 22:00 ` [PATCH 4/8] xdl_change_compact(): do one final shift or the other, not both Michael Haggerty
2016-08-03 22:00 ` [PATCH 5/8] xdl_change_compact(): fix compaction heuristic to adjust io Michael Haggerty
2016-08-04  7:27   ` Jeff King
2016-08-10 16:58     ` Michael Haggerty
2016-08-10 17:09       ` Michael Haggerty
2016-08-11  4:16       ` Jeff King
2016-08-04 18:43   ` Junio C Hamano
2016-08-10 17:13     ` Michael Haggerty
2016-08-03 22:00 ` [PATCH 6/8] xdl_change_compact(): keep track of the earliest end Michael Haggerty
2016-08-04 18:46   ` Junio C Hamano
2016-08-10 17:16     ` Michael Haggerty
2016-08-03 22:00 ` [PATCH 7/8] is_blank_line: take a single xrecord_t as argument Michael Haggerty
2016-08-04 18:48   ` Junio C Hamano
2016-08-03 22:00 ` [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs Michael Haggerty
2016-08-03 22:29   ` Jacob Keller
2016-08-03 22:36     ` Michael Haggerty
2016-08-04  4:47       ` Jacob Keller
2016-08-04 19:39       ` Junio C Hamano
2016-08-10 19:01         ` Michael Haggerty [this message]
2016-08-10 21:28           ` Junio C Hamano
2016-08-03 22:30   ` Stefan Beller
2016-08-03 22:41     ` Michael Haggerty
2016-08-03 22:51       ` Stefan Beller
2016-08-03 23:30         ` Michael Haggerty
2016-08-04  0:04           ` Stefan Beller
2016-08-10 19:12             ` Michael Haggerty
2016-08-04  7:56   ` Jeff King
2016-08-04 16:55     ` Stefan Beller
2016-08-04 19:47       ` Junio C Hamano
2016-08-13  0:09       ` Michael Haggerty
2016-08-12 23:25     ` Michael Haggerty
2016-08-13  8:59       ` Jeff King
2016-08-13 15:59         ` Junio C Hamano
2016-08-14  7:21           ` Jacob Keller
2016-08-15  6:33         ` Stefan Beller
2016-08-15 20:24           ` Junio C Hamano
2016-08-04 19:52   ` Junio C Hamano
2016-08-13  0:11     ` Michael Haggerty
2016-08-03 22:08 ` [PATCH 0/8] Better heuristics make prettier diffs Michael Haggerty
2016-08-04  7:38 ` Jeff King
2016-08-04 19:54   ` Junio C Hamano
2016-08-04 20:01     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5fe0edbc-3659-058f-3328-639d1343fa05@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jacob.keller@gmail.com \
    --cc=jnareb@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).