fast-import slowness when importing large files with small differences

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* fast-import slowness when importing large files with small differences
@ 2018-06-29  9:44 Mike Hommey
  2018-06-29 20:14 ` Stefan Beller
  2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 17+ messages in thread
From: Mike Hommey @ 2018-06-29  9:44 UTC (permalink / raw)
  To: git

Hi,

I noticed some slowness when fast-importing data from the Firefox mercurial
repository, where fast-import spends more than 5 minutes importing ~2000
revisions of one particular file. I reduced a testcase while still
using real data. One could synthesize data with kind of the same
properties, but I figured real data could be useful.

To reproduce:
$ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
$ git init bar
$ cd bar
$ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000

(--depth=2000 to minimize the pack size)

The python script doesn't have much overhead:
$ time python ../foo/import.py ../foo/data.gz > /dev/null

real	0m14.564s
user	0m9.813s
sys	0m4.703s

It generates about 26GB of data from that 4.2MB data.gz.

$ python ../foo/import.py ../foo/data.gz | time git fast-import --depth=2000
git-fast-import statistics:
---------------------------------------------------------------------
Alloc'd objects:       5000
Total objects:         1868 (       133 duplicates                  )
      blobs  :         1868 (       133 duplicates       1867 deltas of       1868 attempts)
      trees  :            0 (         0 duplicates          0 deltas of          0 attempts)
      commits:            0 (         0 duplicates          0 deltas of          0 attempts)
      tags   :            0 (         0 duplicates          0 deltas of          0 attempts)
Total branches:           0 (         0 loads     )
      marks:           1024 (         0 unique    )
      atoms:              0
Memory total:          2282 KiB
       pools:          2048 KiB
     objects:           234 KiB
---------------------------------------------------------------------
pack_report: getpagesize()            =       4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit      = 35184372088832
pack_report: pack_used_ctr            =          0
pack_report: pack_mmap_calls          =          0
pack_report: pack_open_windows        =          0 /          0
pack_report: pack_mapped              =          0 /          0
---------------------------------------------------------------------

321.61user 6.60system 5:50.08elapsed 93%CPU (0avgtext+0avgdata 83192maxresident)k
0inputs+10568outputs (0major+38689minor)pagefaults 0swaps

(The resulting pack is 5.3MB, fwiw)

Obviously, sha1'ing 26GB is not going to be free, but it's also not the
dominating cost, according to perf:

    63.52%  git-fast-import  git-fast-import     [.] create_delta_index
    17.46%  git-fast-import  git-fast-import     [.] sha1_compression_states
     9.89%  git-fast-import  git-fast-import     [.] ubc_check
     6.23%  git-fast-import  git-fast-import     [.] create_delta
     2.49%  git-fast-import  git-fast-import     [.] sha1_process

That's a whole lot of time spent on create_delta_index.

FWIW, if delta was 100% free (yes, I tested that), the fast-import would
take 1:40 with the following profile:

    58.74%  git-fast-import  git-fast-import     [.] sha1_compression_states
    32.45%  git-fast-import  git-fast-import     [.] ubc_check
     8.25%  git-fast-import  git-fast-import     [.] sha1_process

I toyed with the idea of eliminating common head and tail before
creating the delta, and got some promising result: a fast-import taking
3:22 instead of 5:50, with the following profile:

    34.67%  git-fast-import  git-fast-import     [.] create_delta_index
    30.88%  git-fast-import  git-fast-import     [.] sha1_compression_states
    17.15%  git-fast-import  git-fast-import     [.] ubc_check
     7.25%  git-fast-import  git-fast-import     [.] store_object
     4.47%  git-fast-import  git-fast-import     [.] sha1_process
     2.72%  git-fast-import  git-fast-import     [.] create_delta2

The resulting pack is however much larger (for some reason, many objects
are left non-deltaed), and the deltas are partly broken (they don't
apply cleanly), but that just tells the code is not ready to be sent. I
don't expect working code would be much slower than this. The remaining
question is whether this is beneficial for more normal cases.

I also seemed to remember when I tested a while ago, that somehow xdiff
handles those files faster than diff-delta, and I'm wondering if it
would make sense to to make the pack code use xdiff. So I tested
replacing diff_delta with a call to xdi_diff_outf with a callback that
does nothing and zeroed out xpparam_t and xdemitconf_t (not sure that's
best, though, I haven't looked very deeply), and that finished in 5:15
with the following profile (without common head trimming,
xdiff-interface apparently does common tail trimming):

    32.99%  git-fast-import  git-fast-import     [.] xdl_prepare_ctx.isra.0
    20.42%  git-fast-import  git-fast-import     [.] sha1_compression_states
    15.26%  git-fast-import  git-fast-import     [.] xdl_hash_record
    11.65%  git-fast-import  git-fast-import     [.] ubc_check
     3.09%  git-fast-import  git-fast-import     [.] xdl_recs_cmp
     3.03%  git-fast-import  git-fast-import     [.] sha1_process
     2.91%  git-fast-import  git-fast-import     [.] xdl_prepare_env

So maybe it would make sense to consolidate the diff code (after all,
diff-delta.c is an old specialized fork of xdiff). With manual trimming
of common head and tail, this gets down to 3:33.

I'll also note that Facebook has imported xdiff from the git code base
into mercurial and improved performance on it, so it might also be worth
looking at what's worth taking from there.

Cheers,

Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fast-import slowness when importing large files with small differences
  2018-06-29  9:44 fast-import slowness when importing large files with small differences Mike Hommey
@ 2018-06-29 20:14 ` Stefan Beller
  2018-06-29 20:28   ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
  2018-06-29 20:39   ` fast-import slowness when importing large files with small differences Jeff King
  2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
  1 sibling, 2 replies; 17+ messages in thread
From: Stefan Beller @ 2018-06-29 20:14 UTC (permalink / raw)
  To: Mike Hommey, Jameson Miller; +Cc: git

On Fri, Jun 29, 2018 at 3:18 AM Mike Hommey <mh@glandium.org> wrote:
>
> Hi,
>
> I noticed some slowness when fast-importing data from the Firefox mercurial
> repository, where fast-import spends more than 5 minutes importing ~2000
> revisions of one particular file. I reduced a testcase while still
> using real data. One could synthesize data with kind of the same
> properties, but I figured real data could be useful.

I cc'd Jameson, who refactored memory allocation in fast-import recently.
(I am not aware of other refactorings in the area of fast-import)

> To reproduce:
[...]
> Memory total:          2282 KiB
>        pools:          2048 KiB
>      objects:           234 KiB
>
[...]
> Obviously, sha1'ing 26GB is not going to be free, but it's also not the
> dominating cost, according to perf:
>
>     63.52%  git-fast-import  git-fast-import     [.] create_delta_index

So this doesn't sound like a memory issue, but a diffing/deltaing issue.

> So maybe it would make sense to consolidate the diff code (after all,
> diff-delta.c is an old specialized fork of xdiff). With manual trimming
> of common head and tail, this gets down to 3:33.

This sounds interesting. I'd love to see that code to be unified.

> I'll also note that Facebook has imported xdiff from the git code base
> into mercurial and improved performance on it, so it might also be worth
> looking at what's worth taking from there.

So starting with
https://www.mercurial-scm.org/repo/hg/rev/34e2ff1f9cd8
("xdiff: vendor xdiff library from git")
they adapted it slightly:
$ hg log --template '{node|short} {desc|firstline}\n' --
mercurial/thirdparty/xdiff/
a2baa61bbb14 xdiff: move stdint.h to xdiff.h
d40b9e29c114 xdiff: fix a hard crash on Windows
651c80720eed xdiff: silence a 32-bit shift warning on Windows
d255744de97a xdiff: backport int64_t and uint64_t types to Windows
e5b14f5b8b94 xdiff: resolve signed unsigned comparison warning
f1ef0e53e628 xdiff: use int64 for hash table size
f0d9811dda8e xdiff: remove unused xpp and xecfg parameters
49fe6249937a xdiff: remove unused flags parameter
882657a9f768 xdiff: replace {unsigned ,}long with {u,}int64_t
0c7350656f93 xdiff: add comments for fields in xdfile_t
f33a87cf60cc xdiff: add a preprocessing step that trims files
3cf40112efb7 xdiff: remove xmerge related logic
90f8fe72446c xdiff: remove xemit related logic
b5bb0f99064d xdiff: remove unused structure, functions, and constants
09f320067591 xdiff: remove whitespace related feature
1f9bbd1d6b8a xdiff: fix builds on Windows
c420792217c8 xdiff: reduce indent heuristic overhead
b3c9c483cac9 xdiff: add a bdiff hunk mode
9e7b14caf67f xdiff: remove patience and histogram diff algorithms
34e2ff1f9cd8 xdiff: vendor xdiff library from git

Interesting pieces regarding performance:

c420792217c8 xdiff: reduce indent heuristic overhead
https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5

f33a87cf60cc xdiff: add a preprocessing step that trims files
https://phab.mercurial-scm.org/rHGf33a87cf60ccb8b46e06b85e60bc5031420707d6

I'll see if I can make that into patches.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] xdiff: reduce indent heuristic overhead
  2018-06-29 20:14 ` Stefan Beller
@ 2018-06-29 20:28   ` Stefan Beller
  2018-06-29 21:17     ` Junio C Hamano
  2018-07-01 15:57     ` Michael Haggerty
  2018-06-29 20:39   ` fast-import slowness when importing large files with small differences Jeff King
  1 sibling, 2 replies; 17+ messages in thread
From: Stefan Beller @ 2018-06-29 20:28 UTC (permalink / raw)
  To: sbeller; +Cc: mhagger, git, jamill, mh

This patch was written originally for mercurial at
https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5

    changeset:   36674:c420792217c8
    user:        Jun Wu <quark@fb.com>
    date:        Sat Mar 03 12:39:11 2018 -0800
    files:       mercurial/thirdparty/xdiff/xdiffi.c
    description:
    xdiff: reduce indent heuristic overhead

    Adds some threshold to avoid expensive cases, like:

    ```
    #!python
    open('a', 'w').write(" \n" * 1000000)
    open('b', 'w').write(" \n" * 1000001)
    ```

    The indent heuristic is O(N * 20) (N = 1000000) for the above case, and
    makes diff much slower.

    Before this patch (system git 2.14.2):

    ```
    git diff --no-indent-heuristic a b  0.21s user 0.03s system 100% cpu 0.239 total
    git diff --indent-heuristic a b     0.77s user 0.02s system 99% cpu 0.785 total
    ```

    After this patch (git 2fc74f41, with xdiffi.c patched):

    ```
    # with the changed xdiffi.c
    git diff --indent-heuristic a b      0.16s user 0.01s system 90% cpu 0.188 total
    git diff --no-indent-heuristic a b   0.18s user 0.01s system 99% cpu 0.192 total
    ```

    Now turning on indent-heuristic has no visible impact on performance.

    Differential Revision: https://phab.mercurial-scm.org/D2601

Signed-off-by: Stefan Beller <sbeller@google.com>
---

This applies on our master branch, I have not thought of a
good commit message or if we need to test it.

Thanks,
Stefan

 xdiff/xdiffi.c | 38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 0de1ef463bf..c74ec77da58 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -807,6 +807,14 @@ static void xdl_bug(const char *msg)
 	exit(1);
 }
 
+/*
+ * For indentation heuristic, skip searching for better slide position after
+ * checking MAX_BORING lines without finding an improvement. This defends the
+ * indentation heuristic logic against pathological cases. The value is not
+ * picked scientifically but should be good enough.
+ */
+#define MAX_BORING 100
+
 /*
  * Move back and forward change groups for a consistent and pretty diff output.
  * This also helps in finding joinable change groups and reducing the diff
@@ -903,19 +911,43 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			long shift, best_shift = -1;
 			struct split_score best_score;
 
-			for (shift = earliest_end; shift <= g.end; shift++) {
+			/*
+			 * This is O(N * MAX_BLANKS) (N = shift-able lines).
+			 * Even with MAX_BLANKS bounded to a small value, a
+			 * large N could still make this loop take several
+			 * times longer than the main diff algorithm. The
+			 * "boring" value is to help cut down N to something
+			 * like (MAX_BORING + groupsize).
+			 *
+			 * Scan from bottom to top. So we can exit the loop
+			 * without compromising the assumption "for a same best
+			 * score, pick the bottommost shift".
+			 */
+			int boring = 0;
+			for (shift = g.end; shift >= earliest_end; shift--) {
 				struct split_measurement m;
 				struct split_score score = {0, 0};
+				int cmp;
 
 				measure_split(xdf, shift, &m);
 				score_add_split(&m, &score);
 				measure_split(xdf, shift - groupsize, &m);
 				score_add_split(&m, &score);
-				if (best_shift == -1 ||
-				    score_cmp(&score, &best_score) <= 0) {
+
+				if (best_shift == -1) {
+					cmp = -1;
+				} else {
+					cmp = score_cmp(&score, &best_score);
+				}
+				if (cmp < 0) {
+					boring = 0;
 					best_score.effective_indent = score.effective_indent;
 					best_score.penalty = score.penalty;
 					best_shift = shift;
+				} else {
+					boring += 1;
+					if (boring >= MAX_BORING)
+						break;
 				}
 			}
 
-- 
2.18.0.399.gad0ab374a1-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: fast-import slowness when importing large files with small differences
  2018-06-29 20:14 ` Stefan Beller
  2018-06-29 20:28   ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
@ 2018-06-29 20:39   ` Jeff King
  2018-06-29 20:51     ` Stefan Beller
  1 sibling, 1 reply; 17+ messages in thread
From: Jeff King @ 2018-06-29 20:39 UTC (permalink / raw)
  To: Stefan Beller; +Cc: Mike Hommey, Jameson Miller, git

On Fri, Jun 29, 2018 at 01:14:52PM -0700, Stefan Beller wrote:

> Interesting pieces regarding performance:
> 
> c420792217c8 xdiff: reduce indent heuristic overhead
> https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5
> 
> f33a87cf60cc xdiff: add a preprocessing step that trims files
> https://phab.mercurial-scm.org/rHGf33a87cf60ccb8b46e06b85e60bc5031420707d6
> 
> I'll see if I can make that into patches.

Apparently the second one is not so trivial as you might hope; see
https://public-inbox.org/git/1520337165-sup-4504@x1c/.

-Peff

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fast-import slowness when importing large files with small differences
  2018-06-29 20:39   ` fast-import slowness when importing large files with small differences Jeff King
@ 2018-06-29 20:51     ` Stefan Beller
  0 siblings, 0 replies; 17+ messages in thread
From: Stefan Beller @ 2018-06-29 20:51 UTC (permalink / raw)
  To: Jeff King, quark; +Cc: Mike Hommey, Jameson Miller, git

+cc Jun Wu, original author of these patches

On Fri, Jun 29, 2018 at 1:39 PM Jeff King <peff@peff.net> wrote:

> > Interesting pieces regarding performance:
> >
> > c420792217c8 xdiff: reduce indent heuristic overhead
> > https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5

Going by the mailing list, the first patch was not brought over yet,
so sending it here was warranted.

Jun, you may want to take ownership of
https://public-inbox.org/git/20180629202811.131265-1-sbeller@google.com/
as I merely resend it to the git mailing list?
If not, that is fine, too.

> >
> > f33a87cf60cc xdiff: add a preprocessing step that trims files
> > https://phab.mercurial-scm.org/rHGf33a87cf60ccb8b46e06b85e60bc5031420707d6
> >
> > I'll see if I can make that into patches.
>
> Apparently the second one is not so trivial as you might hope; see
> https://public-inbox.org/git/1520337165-sup-4504@x1c/.

Thanks so much, this saves me further effort to dig there.
So I'll just stop porting this.

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: reduce indent heuristic overhead
  2018-06-29 20:28   ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
@ 2018-06-29 21:17     ` Junio C Hamano
  2018-06-29 23:37       ` Stefan Beller
  2018-07-01 15:57     ` Michael Haggerty
  1 sibling, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2018-06-29 21:17 UTC (permalink / raw)
  To: Stefan Beller; +Cc: mhagger, git, jamill, mh

Stefan Beller <sbeller@google.com> writes:

> This patch was written originally for mercurial at
> https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5
>
>     changeset:   36674:c420792217c8
>     user:        Jun Wu <quark@fb.com>
>     date:        Sat Mar 03 12:39:11 2018 -0800
>     files:       mercurial/thirdparty/xdiff/xdiffi.c
>     description:
>     xdiff: reduce indent heuristic overhead

... This should come with in-body header to credit the original
author as the author, I think.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fast-import slowness when importing large files with small differences
  2018-06-29  9:44 fast-import slowness when importing large files with small differences Mike Hommey
  2018-06-29 20:14 ` Stefan Beller
@ 2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
  2018-06-29 23:35   ` Mike Hommey
  1 sibling, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-29 22:10 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git


On Fri, Jun 29 2018, Mike Hommey wrote:

> I noticed some slowness when fast-importing data from the Firefox mercurial
> repository, where fast-import spends more than 5 minutes importing ~2000
> revisions of one particular file. I reduced a testcase while still
> using real data. One could synthesize data with kind of the same
> properties, but I figured real data could be useful.
>
> To reproduce:
> $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
> $ git init bar
> $ cd bar
> $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
>
> [...]
> So maybe it would make sense to consolidate the diff code (after all,
> diff-delta.c is an old specialized fork of xdiff). With manual trimming
> of common head and tail, this gets down to 3:33.
>
> I'll also note that Facebook has imported xdiff from the git code base
> into mercurial and improved performance on it, so it might also be worth
> looking at what's worth taking from there.

It would be interesting to see how does this compares with a more naïve
approach of committing every version of this file one-at-a-time into a
new repository (with & without gc.auto=0). Perhaps deltaing as we go is
suboptimal compared to just writing out a lot of redundant data and
repacking it all at once later.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fast-import slowness when importing large files with small differences
  2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
@ 2018-06-29 23:35   ` Mike Hommey
  2018-07-03 16:05     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Hommey @ 2018-06-29 23:35 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git

On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Jun 29 2018, Mike Hommey wrote:
> 
> > I noticed some slowness when fast-importing data from the Firefox mercurial
> > repository, where fast-import spends more than 5 minutes importing ~2000
> > revisions of one particular file. I reduced a testcase while still
> > using real data. One could synthesize data with kind of the same
> > properties, but I figured real data could be useful.
> >
> > To reproduce:
> > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
> > $ git init bar
> > $ cd bar
> > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
> >
> > [...]
> > So maybe it would make sense to consolidate the diff code (after all,
> > diff-delta.c is an old specialized fork of xdiff). With manual trimming
> > of common head and tail, this gets down to 3:33.
> >
> > I'll also note that Facebook has imported xdiff from the git code base
> > into mercurial and improved performance on it, so it might also be worth
> > looking at what's worth taking from there.
> 
> It would be interesting to see how does this compares with a more naïve
> approach of committing every version of this file one-at-a-time into a
> new repository (with & without gc.auto=0). Perhaps deltaing as we go is
> suboptimal compared to just writing out a lot of redundant data and
> repacking it all at once later.

"Just" writing 26GB? And that's only one file. If I were to do that for
the whole repository, it would yield a > 100GB pack. Instead of < 2GB
currently.

Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] xdiff: reduce indent heuristic overhead
  2018-06-29 21:17     ` Junio C Hamano
@ 2018-06-29 23:37       ` Stefan Beller
  2018-06-30  1:11         ` Jun Wu
  0 siblings, 1 reply; 17+ messages in thread
From: Stefan Beller @ 2018-06-29 23:37 UTC (permalink / raw)
  To: gitster; +Cc: git, jamill, mh, mhagger, sbeller, quark

From: Jun Wu <quark@fb.com>

This patch was written originally for mercurial at [1],
adding a limit on how long we'd be looking for an
optimal indent heuristic. Choose the limit high enough
to only limit edge cases.

    Adds some threshold to avoid expensive cases, like:

    ```
    #!python
    open('a', 'w').write(" \n" * 1000000)
    open('b', 'w').write(" \n" * 1000001)
    ```

    The indent heuristic is O(N * 20) (N = 1000000) for the above case, and
    makes diff much slower.

    Before this patch (system git 2.14.2):

    ```
    git diff --no-indent-heuristic a b  0.21s user 0.03s system 100% cpu 0.239 total
    git diff --indent-heuristic a b     0.77s user 0.02s system 99% cpu 0.785 total
    ```

    After this patch (git 2fc74f41, with xdiffi.c patched):

    ```
    # with the changed xdiffi.c
    git diff --indent-heuristic a b      0.16s user 0.01s system 90% cpu 0.188 total
    git diff --no-indent-heuristic a b   0.18s user 0.01s system 99% cpu 0.192 total
    ```

    Now turning on indent-heuristic has no visible impact on performance.

    Differential Revision: https://phab.mercurial-scm.org/D2601

[1] https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5

Signed-off-by: Stefan Beller <sbeller@google.com>
---

Jun, Junio

By changing the authorship we'd want to have a sign off from the original author,
before applying; in the previous attempt, I was merely taking the code from
mercurial as their copy of xdiff is also LGPLv2 so we are free to use that.

Thanks,
Stefan

 xdiff/xdiffi.c | 38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 0de1ef463bf..c74ec77da58 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -807,6 +807,14 @@ static void xdl_bug(const char *msg)
 	exit(1);
 }
 
+/*
+ * For indentation heuristic, skip searching for better slide position after
+ * checking MAX_BORING lines without finding an improvement. This defends the
+ * indentation heuristic logic against pathological cases. The value is not
+ * picked scientifically but should be good enough.
+ */
+#define MAX_BORING 100
+
 /*
  * Move back and forward change groups for a consistent and pretty diff output.
  * This also helps in finding joinable change groups and reducing the diff
@@ -903,19 +911,43 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			long shift, best_shift = -1;
 			struct split_score best_score;
 
-			for (shift = earliest_end; shift <= g.end; shift++) {
+			/*
+			 * This is O(N * MAX_BLANKS) (N = shift-able lines).
+			 * Even with MAX_BLANKS bounded to a small value, a
+			 * large N could still make this loop take several
+			 * times longer than the main diff algorithm. The
+			 * "boring" value is to help cut down N to something
+			 * like (MAX_BORING + groupsize).
+			 *
+			 * Scan from bottom to top. So we can exit the loop
+			 * without compromising the assumption "for a same best
+			 * score, pick the bottommost shift".
+			 */
+			int boring = 0;
+			for (shift = g.end; shift >= earliest_end; shift--) {
 				struct split_measurement m;
 				struct split_score score = {0, 0};
+				int cmp;
 
 				measure_split(xdf, shift, &m);
 				score_add_split(&m, &score);
 				measure_split(xdf, shift - groupsize, &m);
 				score_add_split(&m, &score);
-				if (best_shift == -1 ||
-				    score_cmp(&score, &best_score) <= 0) {
+
+				if (best_shift == -1) {
+					cmp = -1;
+				} else {
+					cmp = score_cmp(&score, &best_score);
+				}
+				if (cmp < 0) {
+					boring = 0;
 					best_score.effective_indent = score.effective_indent;
 					best_score.penalty = score.penalty;
 					best_shift = shift;
+				} else {
+					boring += 1;
+					if (boring >= MAX_BORING)
+						break;
 				}
 			}
 
-- 
2.18.0.399.gad0ab374a1-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: reduce indent heuristic overhead
  2018-06-29 23:37       ` Stefan Beller
@ 2018-06-30  1:11         ` Jun Wu
  0 siblings, 0 replies; 17+ messages in thread
From: Jun Wu @ 2018-06-30  1:11 UTC (permalink / raw)
  To: Stefan Beller; +Cc: gitster, git, jamill, mh, mhagger

Excerpts from Stefan Beller's message of 2018-06-29 16:37:41 -0700:
> [...]
> Jun, Junio
> 
> By changing the authorship we'd want to have a sign off from the original author,
> before applying; in the previous attempt, I was merely taking the code from
> mercurial as their copy of xdiff is also LGPLv2 so we are free to use that.

I'm fine with signing off the patch. I didn't send this one here mainly
because indent heuristic was default off at the time I made the changes, and
I wasn't sure about how to test this properly according to the git community
standard. If this change is fine without additional tests, I can send it
myself, too.

> Thanks,
> Stefan
> 
> [...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: reduce indent heuristic overhead
  2018-06-29 20:28   ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
  2018-06-29 21:17     ` Junio C Hamano
@ 2018-07-01 15:57     ` Michael Haggerty
  2018-07-02 17:27       ` Stefan Beller
  2018-07-03 18:14       ` Junio C Hamano
  1 sibling, 2 replies; 17+ messages in thread
From: Michael Haggerty @ 2018-07-01 15:57 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git, jamill, mh

On 06/29/2018 10:28 PM, Stefan Beller wrote:
> [...]
>     Adds some threshold to avoid expensive cases, like:
> 
>     ```
>     #!python
>     open('a', 'w').write(" \n" * 1000000)
>     open('b', 'w').write(" \n" * 1000001)
>     ```
> 
>     The indent heuristic is O(N * 20) (N = 1000000) for the above case, and
>     makes diff much slower.
> [...]
> +/*
> + * For indentation heuristic, skip searching for better slide position after
> + * checking MAX_BORING lines without finding an improvement. This defends the
> + * indentation heuristic logic against pathological cases. The value is not
> + * picked scientifically but should be good enough.
> + */
> +#define MAX_BORING 100

This is an interesting case, and a speed difference of almost a factor
of five seems impressive. But this is a pretty pathological case, isn't
it? And I'm pretty sure that the algorithm is `O(N)` both before and
after this change. Remember that to find `earliest_end` and `g.end`,
there has already been a scan through all 1000000 lines. In other words,
you're not improving how the overall algorithm scales with `N`; you're
only changing the constant factor in front. So it's a little bit
questionable whether it is worth complicating the code for this unusual
case.

But *if* we want to improve this case, I think that we could be smarter
about it.

By the time we get to this point in the code, we already know that there
is a "slider" hunk of length `M` (`groupsize`) that can be slid up or
down within a range of `N` (`g.end - earliest_end + 1`) possible
positions. The interesting case here is `N ≫ M`, because then naively
the number of positions to try out is a lot bigger than the size of the
hunk itself. (In the case described above, `N` is 1000000 and `M` is 1.)

But how can that situation even arise? Remember, a hunk can only be slid
down by a line if the first line *after* the hunk is identical to the
first line *of* the hunk. It follows that if you shift a hunk down `M`
lines, then it has the same contents as when you started—you've just
rotated all of the hunk lines around once.

So if `N ≫ M`, there is necessarily a lot of repetition among the `N +
M` lines that the hunk could possibly overlay. Specifically, it must
consist of `floor((N + M)/M)` identical copies of the hunk, plus
possibly a few leftover lines constituting the start of another repetition.

Given this large amount of repetition, it seems to me that there is
never a need to scan more than the bottom `M + 1` possible positions [1]
plus the highest possible position [2] to be sure of finding the very
best one. In the pathological case that you described above, where `M`
is 1, only three positions have to be evaluated, not 100.

In fact, it *could* be that there is even more repetition, namely if the
hunk itself contains multiple copies of an even shorter block of `K`
lines. In that case, you would only have to scan `K + 1` positions at
the bottom plus one at the top to be sure to find the best hunk
position. This would be an interesting optimization for a case like

>     open('a', 'w').write(" \n" * 1000000)
>     open('b', 'w').write(" \n" * 1100000)

(`N = 1000000`, `M = 100000`, `K = 1`) or

>     open('a', 'w').write("<item>\nMISSING\n</item>\n" * 1000000)
>     open('b', 'w').write("<item>\nMISSING\n</item>\n" * 1100000)

(`N = 3000000`, `M = 300000`, `K = 3`). On the other hand, it's not
entirely trivial to find periodicity in a group of lines (i.e., to find
`K`), and I don't know offhand how that task scales with `M`.

Michael

[1] Actually, to be rigorously correct it might be necessary to check
even a bit more than `M + 1` positions at the bottom because the
heuristic looks a bit beyond the lines of the hunk.

[2] The position at the top has different predecessor lines than the
other positions, so it could have a lower score than all of the others.
It's worth checking it. Here too, to be rigorously correct it might be
necessary to check more than one position at the top because the
heuristic looks a bit beyond the lines of the hunk.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: reduce indent heuristic overhead
  2018-07-01 15:57     ` Michael Haggerty
@ 2018-07-02 17:27       ` Stefan Beller
  2018-07-03  9:15         ` Michael Haggerty
  2018-07-03 18:14       ` Junio C Hamano
  1 sibling, 1 reply; 17+ messages in thread
From: Stefan Beller @ 2018-07-02 17:27 UTC (permalink / raw)
  To: Michael Haggerty, quark; +Cc: git, Jameson Miller, Mike Hommey

On Sun, Jul 1, 2018 at 8:57 AM Michael Haggerty <mhagger@alum.mit.edu> wrote:
>
> On 06/29/2018 10:28 PM, Stefan Beller wrote:
> > [...]
> >     Adds some threshold to avoid expensive cases, like:
> >
> >     ```
> >     #!python
> >     open('a', 'w').write(" \n" * 1000000)
> >     open('b', 'w').write(" \n" * 1000001)
> >     ```
> >
> >     The indent heuristic is O(N * 20) (N = 1000000) for the above case, and
> >     makes diff much slower.
> > [...]
> > +/*
> > + * For indentation heuristic, skip searching for better slide position after
> > + * checking MAX_BORING lines without finding an improvement. This defends the
> > + * indentation heuristic logic against pathological cases. The value is not
> > + * picked scientifically but should be good enough.
> > + */
> > +#define MAX_BORING 100
>
> This is an interesting case, and a speed difference of almost a factor
> of five seems impressive. But this is a pretty pathological case, isn't
> it? And I'm pretty sure that the algorithm is `O(N)` both before and
> after this change. Remember that to find `earliest_end` and `g.end`,
> there has already been a scan through all 1000000 lines. In other words,
> you're not improving how the overall algorithm scales with `N`; you're
> only changing the constant factor in front. So it's a little bit
> questionable whether it is worth complicating the code for this unusual
> case.
>
> But *if* we want to improve this case, I think that we could be smarter
> about it.
>
> By the time we get to this point in the code, we already know that there
> is a "slider" hunk of length `M` (`groupsize`) that can be slid up or
> down within a range of `N` (`g.end - earliest_end + 1`) possible
> positions. The interesting case here is `N ≫ M`, because then naively
> the number of positions to try out is a lot bigger than the size of the
> hunk itself. (In the case described above, `N` is 1000000 and `M` is 1.)
>
> But how can that situation even arise? Remember, a hunk can only be slid
> down by a line if the first line *after* the hunk is identical to the
> first line *of* the hunk. It follows that if you shift a hunk down `M`
> lines, then it has the same contents as when you started—you've just
> rotated all of the hunk lines around once.
>
> So if `N ≫ M`, there is necessarily a lot of repetition among the `N +
> M` lines that the hunk could possibly overlay. Specifically, it must
> consist of `floor((N + M)/M)` identical copies of the hunk, plus
> possibly a few leftover lines constituting the start of another repetition.
>
> Given this large amount of repetition, it seems to me that there is
> never a need to scan more than the bottom `M + 1` possible positions [1]
> plus the highest possible position [2] to be sure of finding the very
> best one. In the pathological case that you described above, where `M`
> is 1, only three positions have to be evaluated, not 100.
>
> In fact, it *could* be that there is even more repetition, namely if the
> hunk itself contains multiple copies of an even shorter block of `K`
> lines. In that case, you would only have to scan `K + 1` positions at
> the bottom plus one at the top to be sure to find the best hunk
> position. This would be an interesting optimization for a case like
>
> >     open('a', 'w').write(" \n" * 1000000)
> >     open('b', 'w').write(" \n" * 1100000)
>
> (`N = 1000000`, `M = 100000`, `K = 1`) or
>
> >     open('a', 'w').write("<item>\nMISSING\n</item>\n" * 1000000)
> >     open('b', 'w').write("<item>\nMISSING\n</item>\n" * 1100000)
>
> (`N = 3000000`, `M = 300000`, `K = 3`). On the other hand, it's not
> entirely trivial to find periodicity in a group of lines (i.e., to find
> `K`), and I don't know offhand how that task scales with `M`.
>
> Michael
>
> [1] Actually, to be rigorously correct it might be necessary to check
> even a bit more than `M + 1` positions at the bottom because the
> heuristic looks a bit beyond the lines of the hunk.
>
> [2] The position at the top has different predecessor lines than the
> other positions, so it could have a lower score than all of the others.
> It's worth checking it. Here too, to be rigorously correct it might be
> necessary to check more than one position at the top because the
> heuristic looks a bit beyond the lines of the hunk.

So this suggests to have MAX_BORING to be
"hunk size + some small constant offset" ?

Stefan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: reduce indent heuristic overhead
  2018-07-02 17:27       ` Stefan Beller
@ 2018-07-03  9:15         ` Michael Haggerty
  2018-07-27 22:23           ` Stefan Beller
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Haggerty @ 2018-07-03  9:15 UTC (permalink / raw)
  To: Stefan Beller; +Cc: quark, Git Mailing List, jamill, mh

On Mon, Jul 2, 2018 at 7:27 PM Stefan Beller <sbeller@google.com> wrote:
> On Sun, Jul 1, 2018 at 8:57 AM Michael Haggerty <mhagger@alum.mit.edu> wrote:
> [...]
> So this suggests to have MAX_BORING to be
> "hunk size + some small constant offset" ?

That would be my suggestion, yes. There are cases where it will be
more expensive than a fixed `MAX_BORING`, but I bet on average it will
be faster. Plus, it should always give the right answer.

Michael

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fast-import slowness when importing large files with small differences
  2018-06-29 23:35   ` Mike Hommey
@ 2018-07-03 16:05     ` Ævar Arnfjörð Bjarmason
  2018-07-03 22:38       ` Mike Hommey
  0 siblings, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-07-03 16:05 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git


On Fri, Jun 29 2018, Mike Hommey wrote:

> On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Fri, Jun 29 2018, Mike Hommey wrote:
>>
>> > I noticed some slowness when fast-importing data from the Firefox mercurial
>> > repository, where fast-import spends more than 5 minutes importing ~2000
>> > revisions of one particular file. I reduced a testcase while still
>> > using real data. One could synthesize data with kind of the same
>> > properties, but I figured real data could be useful.
>> >
>> > To reproduce:
>> > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
>> > $ git init bar
>> > $ cd bar
>> > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
>> >
>> > [...]
>> > So maybe it would make sense to consolidate the diff code (after all,
>> > diff-delta.c is an old specialized fork of xdiff). With manual trimming
>> > of common head and tail, this gets down to 3:33.
>> >
>> > I'll also note that Facebook has imported xdiff from the git code base
>> > into mercurial and improved performance on it, so it might also be worth
>> > looking at what's worth taking from there.
>>
>> It would be interesting to see how does this compares with a more naïve
>> approach of committing every version of this file one-at-a-time into a
>> new repository (with & without gc.auto=0). Perhaps deltaing as we go is
>> suboptimal compared to just writing out a lot of redundant data and
>> repacking it all at once later.
>
> "Just" writing 26GB? And that's only one file. If I were to do that for
> the whole repository, it would yield a > 100GB pack. Instead of < 2GB
> currently.

To clarify on my terse response. I mean to try this on an isolated test
case to see to what extent the problem you're describing is unique to
fast-import, and to what extent it's encountered during "normal" git use
when you commit all the revisions of that file in succession.

Perhaps the difference between the two would give some hint as to how to
proceed, or not.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] xdiff: reduce indent heuristic overhead
  2018-07-01 15:57     ` Michael Haggerty
  2018-07-02 17:27       ` Stefan Beller
@ 2018-07-03 18:14       ` Junio C Hamano
  1 sibling, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2018-07-03 18:14 UTC (permalink / raw)
  To: Michael Haggerty; +Cc: Stefan Beller, git, jamill, mh

Michael Haggerty <mhagger@alum.mit.edu> writes:

> So if `N ≫ M`, there is necessarily a lot of repetition among the `N +
> M` lines that the hunk could possibly overlay. Specifically, it must
> consist of `floor((N + M)/M)` identical copies of the hunk, plus
> possibly a few leftover lines constituting the start of another repetition.
>
> Given this large amount of repetition, it seems to me that there is
> never a need to scan more than the bottom `M + 1` possible positions [1]
> plus the highest possible position [2] to be sure of finding the very
> best one. In the pathological case that you described above, where `M`
> is 1, only three positions have to be evaluated, not 100.

Nicely analysed.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: fast-import slowness when importing large files with small differences
  2018-07-03 16:05     ` Ævar Arnfjörð Bjarmason
@ 2018-07-03 22:38       ` Mike Hommey
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Hommey @ 2018-07-03 22:38 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git

On Tue, Jul 03, 2018 at 06:05:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Jun 29 2018, Mike Hommey wrote:
> 
> > On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >>
> >> On Fri, Jun 29 2018, Mike Hommey wrote:
> >>
> >> > I noticed some slowness when fast-importing data from the Firefox mercurial
> >> > repository, where fast-import spends more than 5 minutes importing ~2000
> >> > revisions of one particular file. I reduced a testcase while still
> >> > using real data. One could synthesize data with kind of the same
> >> > properties, but I figured real data could be useful.
> >> >
> >> > To reproduce:
> >> > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
> >> > $ git init bar
> >> > $ cd bar
> >> > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
> >> >
> >> > [...]
> >> > So maybe it would make sense to consolidate the diff code (after all,
> >> > diff-delta.c is an old specialized fork of xdiff). With manual trimming
> >> > of common head and tail, this gets down to 3:33.
> >> >
> >> > I'll also note that Facebook has imported xdiff from the git code base
> >> > into mercurial and improved performance on it, so it might also be worth
> >> > looking at what's worth taking from there.
> >>
> >> It would be interesting to see how does this compares with a more naïve
> >> approach of committing every version of this file one-at-a-time into a
> >> new repository (with & without gc.auto=0). Perhaps deltaing as we go is
> >> suboptimal compared to just writing out a lot of redundant data and
> >> repacking it all at once later.
> >
> > "Just" writing 26GB? And that's only one file. If I were to do that for
> > the whole repository, it would yield a > 100GB pack. Instead of < 2GB
> > currently.
> 
> To clarify on my terse response. I mean to try this on an isolated test
> case to see to what extent the problem you're describing is unique to
> fast-import, and to what extent it's encountered during "normal" git use
> when you commit all the revisions of that file in succession.
> 
> Perhaps the difference between the two would give some hint as to how to
> proceed, or not.

AIUI, git repack will end up creating delta indexes for every blob, so the
problem should exist there, but because it will be comparing "random"
blobs, it can't take the same kinds of shortcuts as fast-import could,
because fast-import only cares about diffing with the last imported
blob. So while fast-import can reduce the amount of work it does by not
creating an index for common heads and tails of the compared blobs, git
repack can't.

Mike

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH] xdiff: reduce indent heuristic overhead
  2018-07-03  9:15         ` Michael Haggerty
@ 2018-07-27 22:23           ` Stefan Beller
  0 siblings, 0 replies; 17+ messages in thread
From: Stefan Beller @ 2018-07-27 22:23 UTC (permalink / raw)
  To: mhagger; +Cc: git, jamill, mh, quark, sbeller

Skip searching for better indentation heuristics if we'd slide a hunk more
than its size. This is the easiest fix proposed in the analysis[1] in
response to a patch that mercurial took for xdiff to limit searching
by a constant. Using a performance test as:

     #!python
     open('a', 'w').write(" \n" * 1000000)
     open('b', 'w').write(" \n" * 1000001)

This patch reduces the execution of "git diff --no-index a b" from
0.70s to 0.31s. However limiting the sliding to the size of the diff hunk,
which was proposed as a solution (that I found easiest to implement for
now) is not optimal for cases like

     open('a', 'w').write(" \n" * 1000000)
     open('b', 'w').write(" \n" * 2000000)

as then we'd still slide 1000000 times.

In addition to limiting the sliding to size of the hunk, also limit by a
constant. Choose 100 lines as the constant as that fits more than a screen,
which really means that the diff sliding is probably not providing a lot
of benefit anyway.

[1] https://public-inbox.org/git/72ac1ac2-f567-f241-41d6-d0f83072e0b3@alum.mit.edu/

Reported-by: Jun Wu <quark@fb.com>
Analysis-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Stefan Beller <sbeller@google.com>
---

> Plus, it should always give the right answer.

I was tempted to do just that, but I caved. The diff is correct,
and the hunk sliding is purely to appease the visual aspect of
humans looking at diffs. If your diff can slide more than a
monitor height, you're not interested in the best slided diff,
but something else is going on.

> There are cases where it will be
> more expensive than a fixed `MAX_BORING`, but I bet on average it will
> be faster.

So I did both, settling for performance as the utmost desire. ;-)

 xdiff/xdiffi.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 0de1ef463bf..91e98ee9869 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -591,6 +591,11 @@ static void measure_split(const xdfile_t *xdf, long split,
  */
 #define INDENT_WEIGHT 60

+/*
+ * How far do we slide a hunk at most?
+ */
+#define INDENT_HEURISTIC_MAX_SLIDING 100
+
 /*
  * Compute a badness score for the hypothetical split whose measurements are
  * stored in m. The weight factors were determined empirically using the tools and
@@ -903,7 +908,12 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
 			long shift, best_shift = -1;
 			struct split_score best_score;

-			for (shift = earliest_end; shift <= g.end; shift++) {
+			shift = earliest_end;
+			if (g.end - groupsize - 1 > shift)
+				shift = g.end - groupsize - 1;
+			if (g.end - INDENT_HEURISTIC_MAX_SLIDING > shift)
+				shift = g.end - INDENT_HEURISTIC_MAX_SLIDING;
+			for (; shift <= g.end; shift++) {
 				struct split_measurement m;
 				struct split_score score = {0, 0};

-- 
2.18.0.345.g5c9ce644c3-goog

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-07-27 22:24 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-29  9:44 fast-import slowness when importing large files with small differences Mike Hommey
2018-06-29 20:14 ` Stefan Beller
2018-06-29 20:28   ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
2018-06-29 21:17     ` Junio C Hamano
2018-06-29 23:37       ` Stefan Beller
2018-06-30  1:11         ` Jun Wu
2018-07-01 15:57     ` Michael Haggerty
2018-07-02 17:27       ` Stefan Beller
2018-07-03  9:15         ` Michael Haggerty
2018-07-27 22:23           ` Stefan Beller
2018-07-03 18:14       ` Junio C Hamano
2018-06-29 20:39   ` fast-import slowness when importing large files with small differences Jeff King
2018-06-29 20:51     ` Stefan Beller
2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
2018-06-29 23:35   ` Mike Hommey
2018-07-03 16:05     ` Ævar Arnfjörð Bjarmason
2018-07-03 22:38       ` Mike Hommey

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).