From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.6 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id D91271FD99 for ; Wed, 10 Aug 2016 19:23:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934758AbcHJTWx (ORCPT ); Wed, 10 Aug 2016 15:22:53 -0400 Received: from alum-mailsec-scanner-8.mit.edu ([18.7.68.20]:61743 "EHLO alum-mailsec-scanner-8.mit.edu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S940192AbcHJTBL (ORCPT ); Wed, 10 Aug 2016 15:01:11 -0400 X-AuditID: 12074414-ca3ff70000000931-79-57ab79f5d120 Received: from outgoing-alum.mit.edu (OUTGOING-ALUM.MIT.EDU [18.7.68.33]) by (Symantec Messaging Gateway) with SMTP id A5.43.02353.5F97BA75; Wed, 10 Aug 2016 15:01:09 -0400 (EDT) Received: from [192.168.69.130] (p5B104255.dip0.t-ipconnect.de [91.16.66.85]) (authenticated bits=0) (User authenticated as mhagger@ALUM.MIT.EDU) by outgoing-alum.mit.edu (8.13.8/8.12.4) with ESMTP id u7AJ166Y011909 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Wed, 10 Aug 2016 15:01:07 -0400 Subject: Re: [PATCH 8/8] diff: improve positioning of add/delete blocks in diffs To: Junio C Hamano References: <7b0680ed7a10fc13acd8d7816a75ed05a5f9e28c.1470259583.git.mhagger@alum.mit.edu> Cc: Jacob Keller , Git mailing list , Stefan Beller , Jeff King , =?UTF-8?Q?Jakub_Nar=c4=99bski?= From: Michael Haggerty Message-ID: <5fe0edbc-3659-058f-3328-639d1343fa05@alum.mit.edu> Date: Wed, 10 Aug 2016 21:01:06 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFmpkleLIzCtJLcpLzFFi42IRYndR1P1auTrc4P0fa4uuK91MFg29V5gt di/uZ7ZYcXUOs8WPlh5mi82b21kc2Dx2zrrL7rFgU6nHs949jB4XLyl7fN4kF8AaxWWTkpqT WZZapG+XwJXxc0J8wTedio6Td9kaGHcrdzFyckgImEhsnP+JFcQWEtjKKHHjr04XIxeQfZZJ ovnRHkaQhLBAkMTuTfvBbBEBNYmJbYdYIIq2MUmcetXCCuIwC9xnlPgxt4MJpIpNQFdiUU8z mM0rYC+xvO0tM4jNIqAqMfVPHwuILSoQIrHtZgMbRI2gxMmZT8DinALWEj9uLwKzmQX0JHZc /8UKYctLbH87h3kCI/8sJC2zkJTNQlK2gJF5FaNcYk5prm5uYmZOcWqybnFyYl5eapGuhV5u ZoleakrpJkZISIvsYDxyUu4QowAHoxIPr0fm6nAh1sSy4srcQ4ySHExKorzCMUAhvqT8lMqM xOKM+KLSnNTiQ4wSHMxKIrwPS4ByvCmJlVWpRfkwKWkOFiVx3m+L1f2EBNITS1KzU1MLUotg sjIcHEoSvGLA2BUSLEpNT61Iy8wpQUgzcXCCDOcBGp4MUsNbXJCYW5yZDpE/xagoJc57oQIo IQCSyCjNg+uFpZxXjOJArwjzloC08wDTFVz3K6DBTECDk1RXgAwuSURISTUwOvb/Eoyemi/M eerKip3TQ35p7I3ZxGazK6dG3oFHu+De46/qG5jOtse8VXC5cWn/pmIX5ascbTyudVeiZ/p/ 2Vfi8iRherrjusjZrXP+eVgmqCyR31Nofs9pz/yF4v//e27J3Vvue/Lq/V5xBcFP5dbfuZpP /n3G/uZJR/RcsV7B8ptSR3RfKLEUZyQaajEXFScCAJyMcbgUAwAA Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On 08/04/2016 09:39 PM, Junio C Hamano wrote: > Michael Haggerty writes: > >>>> + } >>>> + /* >>>> + * We have reached the end of the line without finding any non-space >>>> + * characters; i.e., the whole line consists of trailing spaces, which we >>>> + * are not interested in. >>>> + */ >>>> + return -1; > > Not related to Jacob's review, but "the whole line consists of > trailing spaces" made me read it twice; while it is technically > correct, "the whole line consists of spaces", or even "this is a > blank line", would read a lot more easily, at least for me. Hehe, yes, ETOOMANYWORDS. >> I was implicitly assuming that such lines would have text somewhere >> after those 200 spaces (or 25 TABs or whatever). But you're right, the >> line could consist only of whitespace. Unfortunately, the only way to >> distinguish these two cases is to read the rest of the line, which is >> exactly what we *don't* want to do. > > Hmm, why is it exactly what we don't want to do? Is it a > performance concern? In other words, is it because this function is > called many times to measure the same line multiple times? Yes, performance is the reason, and especially the desire to avoid unreasonable runtimes for pathological cases. Thanks for asking, though, because it's worthwhile to look into this more rigorously. Suppose there is a slider that can be shifted to any of `numshift` positions. Then we have to call `measure_split()` `2*numshift` times (once for the beginning and once for the end of each candidate slider position). Suppose there are `numblanks` blank lines in the neighborhood of the slider. Each time we call `measure_split()`, we count the number of blank lines before and after the proposed split position. So we end up calling `get_indent()` `2*numshift*numblanks` times. Finally, suppose that the blank lines each contain `numws` whitespace characters. Then each call to `get_indent()` has to do `O(numws)` work. So altogether, if there were no limits, then the amount of work to position a slider would scale like O(numshift * numblanks * numws) However, the total number of characters in the file might only be O(numblanks * numws) So without limits, the amount of work to position sliders could scale by numshift times the size of the file. The worst case is pretty easy to achieve. Just create a file consisting of a million or so LF characters, then add another LF to it. The diff would be a slider with numshift = 1000000 numblanks = 1000000 numws = 1 so the heuristic would take O(N^2) in the size of the file. Effectively the limits cap the effective `numblanks` at `2*MAX_BLANKS` (which is 2*20) and the effective `numws` at `MAX_INDENT` (which is 200), meaning that the maximum effort scales like numshift * 2*20 * 200 which is still a big number but not absurd. Empirically, for the case described above, `git diff` takes 104 ms and `git diff --indent-heuristic` takes 490 ms. I think that's not prohibitive for a pathological case. Meanwhile, Myers's algorithm scales like O(ND), where N is the number of lines and D is the edit distance, so I suppose that it is already possible to find diffs that are intractable to compute. > After > all, somebody in this file is already scanning each and every line > to see where it ends to split the input into records, so perhaps a > "right" (if the "theoretical correctness" of the return value from > this function mattered, which you wave-away below) optimization > could be to precompute it while the lines are broken into records > and store it in the "rec" structure? That would certainly be possible, and would help in cases where there are a lot of lines with lots of leading whitespace. You could get nearly the same benefit by recording a single bit in struct rec, indicating whether the line is blank or not. But it wouldn't help the worst case described above, where each call to `git_indent()` is already very cheap. And I didn't think it was worth allocating the extra memory to optimize this heuristic * which isn't used all that often in the first place, * which (for normal inputs) doesn't take a significant amount of time, and * when the optimization doesn't improve the worst-case scenario (and thus any DoS potential) I think the only way to ensure O(size_of_file) runtime in the above case would be to record, along with each line, how many blank lines immediately precede and succeed it. You could achieve something like O(size_of_file lg(size_of_file)) by storing, e.g., the total number of nonblank lines that precede each line and doing a binary search to find the nearest non-blank line. >> But I think it doesn't matter anyway. Such "text" will likely never be >> read by a human, so it's not a big deal if the slider position is not >> picked perfectly. And remember, this whole saga is just to improve the >> aesthetics of the diff. The diff is *correct* (e.g., in the sense of >> applicable) regardless of where we position the sliders. > > A better argument may be "if the user is truly reading a diff output > for such an unusual "text", it is likely that she has a very wide > display and/or running less -S, and treating such an overindented line > as if it were a blank line would give a result that is more consistent > to what appears on her display", perhaps? I don't know. It seems like a pretty contrived justification for what is basically, "your input is too weird for us. We're not going to break our necks trying to give you the best possible slider positioning." Michael