git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: Stefan Beller <sbeller@google.com>
Cc: git@vger.kernel.org, jamill@microsoft.com, mh@glandium.org
Subject: Re: [PATCH] xdiff: reduce indent heuristic overhead
Date: Sun, 1 Jul 2018 17:57:35 +0200	[thread overview]
Message-ID: <72ac1ac2-f567-f241-41d6-d0f83072e0b3@alum.mit.edu> (raw)
In-Reply-To: <20180629202811.131265-1-sbeller@google.com>

On 06/29/2018 10:28 PM, Stefan Beller wrote:
> [...]
>     Adds some threshold to avoid expensive cases, like:
> 
>     ```
>     #!python
>     open('a', 'w').write(" \n" * 1000000)
>     open('b', 'w').write(" \n" * 1000001)
>     ```
> 
>     The indent heuristic is O(N * 20) (N = 1000000) for the above case, and
>     makes diff much slower.
> [...]
> +/*
> + * For indentation heuristic, skip searching for better slide position after
> + * checking MAX_BORING lines without finding an improvement. This defends the
> + * indentation heuristic logic against pathological cases. The value is not
> + * picked scientifically but should be good enough.
> + */
> +#define MAX_BORING 100

This is an interesting case, and a speed difference of almost a factor
of five seems impressive. But this is a pretty pathological case, isn't
it? And I'm pretty sure that the algorithm is `O(N)` both before and
after this change. Remember that to find `earliest_end` and `g.end`,
there has already been a scan through all 1000000 lines. In other words,
you're not improving how the overall algorithm scales with `N`; you're
only changing the constant factor in front. So it's a little bit
questionable whether it is worth complicating the code for this unusual
case.

But *if* we want to improve this case, I think that we could be smarter
about it.

By the time we get to this point in the code, we already know that there
is a "slider" hunk of length `M` (`groupsize`) that can be slid up or
down within a range of `N` (`g.end - earliest_end + 1`) possible
positions. The interesting case here is `N ≫ M`, because then naively
the number of positions to try out is a lot bigger than the size of the
hunk itself. (In the case described above, `N` is 1000000 and `M` is 1.)

But how can that situation even arise? Remember, a hunk can only be slid
down by a line if the first line *after* the hunk is identical to the
first line *of* the hunk. It follows that if you shift a hunk down `M`
lines, then it has the same contents as when you started—you've just
rotated all of the hunk lines around once.

So if `N ≫ M`, there is necessarily a lot of repetition among the `N +
M` lines that the hunk could possibly overlay. Specifically, it must
consist of `floor((N + M)/M)` identical copies of the hunk, plus
possibly a few leftover lines constituting the start of another repetition.

Given this large amount of repetition, it seems to me that there is
never a need to scan more than the bottom `M + 1` possible positions [1]
plus the highest possible position [2] to be sure of finding the very
best one. In the pathological case that you described above, where `M`
is 1, only three positions have to be evaluated, not 100.

In fact, it *could* be that there is even more repetition, namely if the
hunk itself contains multiple copies of an even shorter block of `K`
lines. In that case, you would only have to scan `K + 1` positions at
the bottom plus one at the top to be sure to find the best hunk
position. This would be an interesting optimization for a case like

>     open('a', 'w').write(" \n" * 1000000)
>     open('b', 'w').write(" \n" * 1100000)

(`N = 1000000`, `M = 100000`, `K = 1`) or

>     open('a', 'w').write("<item>\nMISSING\n</item>\n" * 1000000)
>     open('b', 'w').write("<item>\nMISSING\n</item>\n" * 1100000)

(`N = 3000000`, `M = 300000`, `K = 3`). On the other hand, it's not
entirely trivial to find periodicity in a group of lines (i.e., to find
`K`), and I don't know offhand how that task scales with `M`.

Michael

[1] Actually, to be rigorously correct it might be necessary to check
even a bit more than `M + 1` positions at the bottom because the
heuristic looks a bit beyond the lines of the hunk.

[2] The position at the top has different predecessor lines than the
other positions, so it could have a lower score than all of the others.
It's worth checking it. Here too, to be rigorously correct it might be
necessary to check more than one position at the top because the
heuristic looks a bit beyond the lines of the hunk.

  parent reply	other threads:[~2018-07-01 15:58 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-29  9:44 fast-import slowness when importing large files with small differences Mike Hommey
2018-06-29 20:14 ` Stefan Beller
2018-06-29 20:28   ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
2018-06-29 21:17     ` Junio C Hamano
2018-06-29 23:37       ` Stefan Beller
2018-06-30  1:11         ` Jun Wu
2018-07-01 15:57     ` Michael Haggerty [this message]
2018-07-02 17:27       ` Stefan Beller
2018-07-03  9:15         ` Michael Haggerty
2018-07-27 22:23           ` Stefan Beller
2018-07-03 18:14       ` Junio C Hamano
2018-06-29 20:39   ` fast-import slowness when importing large files with small differences Jeff King
2018-06-29 20:51     ` Stefan Beller
2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
2018-06-29 23:35   ` Mike Hommey
2018-07-03 16:05     ` Ævar Arnfjörð Bjarmason
2018-07-03 22:38       ` Mike Hommey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=72ac1ac2-f567-f241-41d6-d0f83072e0b3@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=git@vger.kernel.org \
    --cc=jamill@microsoft.com \
    --cc=mh@glandium.org \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).