* fast-import slowness when importing large files with small differences
@ 2018-06-29 9:44 Mike Hommey
2018-06-29 20:14 ` Stefan Beller
2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
0 siblings, 2 replies; 17+ messages in thread
From: Mike Hommey @ 2018-06-29 9:44 UTC (permalink / raw)
To: git
Hi,
I noticed some slowness when fast-importing data from the Firefox mercurial
repository, where fast-import spends more than 5 minutes importing ~2000
revisions of one particular file. I reduced a testcase while still
using real data. One could synthesize data with kind of the same
properties, but I figured real data could be useful.
To reproduce:
$ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
$ git init bar
$ cd bar
$ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
(--depth=2000 to minimize the pack size)
The python script doesn't have much overhead:
$ time python ../foo/import.py ../foo/data.gz > /dev/null
real 0m14.564s
user 0m9.813s
sys 0m4.703s
It generates about 26GB of data from that 4.2MB data.gz.
$ python ../foo/import.py ../foo/data.gz | time git fast-import --depth=2000
git-fast-import statistics:
---------------------------------------------------------------------
Alloc'd objects: 5000
Total objects: 1868 ( 133 duplicates )
blobs : 1868 ( 133 duplicates 1867 deltas of 1868 attempts)
trees : 0 ( 0 duplicates 0 deltas of 0 attempts)
commits: 0 ( 0 duplicates 0 deltas of 0 attempts)
tags : 0 ( 0 duplicates 0 deltas of 0 attempts)
Total branches: 0 ( 0 loads )
marks: 1024 ( 0 unique )
atoms: 0
Memory total: 2282 KiB
pools: 2048 KiB
objects: 234 KiB
---------------------------------------------------------------------
pack_report: getpagesize() = 4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit = 35184372088832
pack_report: pack_used_ctr = 0
pack_report: pack_mmap_calls = 0
pack_report: pack_open_windows = 0 / 0
pack_report: pack_mapped = 0 / 0
---------------------------------------------------------------------
321.61user 6.60system 5:50.08elapsed 93%CPU (0avgtext+0avgdata 83192maxresident)k
0inputs+10568outputs (0major+38689minor)pagefaults 0swaps
(The resulting pack is 5.3MB, fwiw)
Obviously, sha1'ing 26GB is not going to be free, but it's also not the
dominating cost, according to perf:
63.52% git-fast-import git-fast-import [.] create_delta_index
17.46% git-fast-import git-fast-import [.] sha1_compression_states
9.89% git-fast-import git-fast-import [.] ubc_check
6.23% git-fast-import git-fast-import [.] create_delta
2.49% git-fast-import git-fast-import [.] sha1_process
That's a whole lot of time spent on create_delta_index.
FWIW, if delta was 100% free (yes, I tested that), the fast-import would
take 1:40 with the following profile:
58.74% git-fast-import git-fast-import [.] sha1_compression_states
32.45% git-fast-import git-fast-import [.] ubc_check
8.25% git-fast-import git-fast-import [.] sha1_process
I toyed with the idea of eliminating common head and tail before
creating the delta, and got some promising result: a fast-import taking
3:22 instead of 5:50, with the following profile:
34.67% git-fast-import git-fast-import [.] create_delta_index
30.88% git-fast-import git-fast-import [.] sha1_compression_states
17.15% git-fast-import git-fast-import [.] ubc_check
7.25% git-fast-import git-fast-import [.] store_object
4.47% git-fast-import git-fast-import [.] sha1_process
2.72% git-fast-import git-fast-import [.] create_delta2
The resulting pack is however much larger (for some reason, many objects
are left non-deltaed), and the deltas are partly broken (they don't
apply cleanly), but that just tells the code is not ready to be sent. I
don't expect working code would be much slower than this. The remaining
question is whether this is beneficial for more normal cases.
I also seemed to remember when I tested a while ago, that somehow xdiff
handles those files faster than diff-delta, and I'm wondering if it
would make sense to to make the pack code use xdiff. So I tested
replacing diff_delta with a call to xdi_diff_outf with a callback that
does nothing and zeroed out xpparam_t and xdemitconf_t (not sure that's
best, though, I haven't looked very deeply), and that finished in 5:15
with the following profile (without common head trimming,
xdiff-interface apparently does common tail trimming):
32.99% git-fast-import git-fast-import [.] xdl_prepare_ctx.isra.0
20.42% git-fast-import git-fast-import [.] sha1_compression_states
15.26% git-fast-import git-fast-import [.] xdl_hash_record
11.65% git-fast-import git-fast-import [.] ubc_check
3.09% git-fast-import git-fast-import [.] xdl_recs_cmp
3.03% git-fast-import git-fast-import [.] sha1_process
2.91% git-fast-import git-fast-import [.] xdl_prepare_env
So maybe it would make sense to consolidate the diff code (after all,
diff-delta.c is an old specialized fork of xdiff). With manual trimming
of common head and tail, this gets down to 3:33.
I'll also note that Facebook has imported xdiff from the git code base
into mercurial and improved performance on it, so it might also be worth
looking at what's worth taking from there.
Cheers,
Mike
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences
2018-06-29 9:44 fast-import slowness when importing large files with small differences Mike Hommey
@ 2018-06-29 20:14 ` Stefan Beller
2018-06-29 20:28 ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
2018-06-29 20:39 ` fast-import slowness when importing large files with small differences Jeff King
2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
1 sibling, 2 replies; 17+ messages in thread
From: Stefan Beller @ 2018-06-29 20:14 UTC (permalink / raw)
To: Mike Hommey, Jameson Miller; +Cc: git
On Fri, Jun 29, 2018 at 3:18 AM Mike Hommey <mh@glandium.org> wrote:
>
> Hi,
>
> I noticed some slowness when fast-importing data from the Firefox mercurial
> repository, where fast-import spends more than 5 minutes importing ~2000
> revisions of one particular file. I reduced a testcase while still
> using real data. One could synthesize data with kind of the same
> properties, but I figured real data could be useful.
I cc'd Jameson, who refactored memory allocation in fast-import recently.
(I am not aware of other refactorings in the area of fast-import)
> To reproduce:
[...]
> Memory total: 2282 KiB
> pools: 2048 KiB
> objects: 234 KiB
>
[...]
> Obviously, sha1'ing 26GB is not going to be free, but it's also not the
> dominating cost, according to perf:
>
> 63.52% git-fast-import git-fast-import [.] create_delta_index
So this doesn't sound like a memory issue, but a diffing/deltaing issue.
> So maybe it would make sense to consolidate the diff code (after all,
> diff-delta.c is an old specialized fork of xdiff). With manual trimming
> of common head and tail, this gets down to 3:33.
This sounds interesting. I'd love to see that code to be unified.
> I'll also note that Facebook has imported xdiff from the git code base
> into mercurial and improved performance on it, so it might also be worth
> looking at what's worth taking from there.
So starting with
https://www.mercurial-scm.org/repo/hg/rev/34e2ff1f9cd8
("xdiff: vendor xdiff library from git")
they adapted it slightly:
$ hg log --template '{node|short} {desc|firstline}\n' --
mercurial/thirdparty/xdiff/
a2baa61bbb14 xdiff: move stdint.h to xdiff.h
d40b9e29c114 xdiff: fix a hard crash on Windows
651c80720eed xdiff: silence a 32-bit shift warning on Windows
d255744de97a xdiff: backport int64_t and uint64_t types to Windows
e5b14f5b8b94 xdiff: resolve signed unsigned comparison warning
f1ef0e53e628 xdiff: use int64 for hash table size
f0d9811dda8e xdiff: remove unused xpp and xecfg parameters
49fe6249937a xdiff: remove unused flags parameter
882657a9f768 xdiff: replace {unsigned ,}long with {u,}int64_t
0c7350656f93 xdiff: add comments for fields in xdfile_t
f33a87cf60cc xdiff: add a preprocessing step that trims files
3cf40112efb7 xdiff: remove xmerge related logic
90f8fe72446c xdiff: remove xemit related logic
b5bb0f99064d xdiff: remove unused structure, functions, and constants
09f320067591 xdiff: remove whitespace related feature
1f9bbd1d6b8a xdiff: fix builds on Windows
c420792217c8 xdiff: reduce indent heuristic overhead
b3c9c483cac9 xdiff: add a bdiff hunk mode
9e7b14caf67f xdiff: remove patience and histogram diff algorithms
34e2ff1f9cd8 xdiff: vendor xdiff library from git
Interesting pieces regarding performance:
c420792217c8 xdiff: reduce indent heuristic overhead
https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5
f33a87cf60cc xdiff: add a preprocessing step that trims files
https://phab.mercurial-scm.org/rHGf33a87cf60ccb8b46e06b85e60bc5031420707d6
I'll see if I can make that into patches.
Thanks,
Stefan
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] xdiff: reduce indent heuristic overhead
2018-06-29 20:14 ` Stefan Beller
@ 2018-06-29 20:28 ` Stefan Beller
2018-06-29 21:17 ` Junio C Hamano
2018-07-01 15:57 ` Michael Haggerty
2018-06-29 20:39 ` fast-import slowness when importing large files with small differences Jeff King
1 sibling, 2 replies; 17+ messages in thread
From: Stefan Beller @ 2018-06-29 20:28 UTC (permalink / raw)
To: sbeller; +Cc: mhagger, git, jamill, mh
This patch was written originally for mercurial at
https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5
changeset: 36674:c420792217c8
user: Jun Wu <quark@fb.com>
date: Sat Mar 03 12:39:11 2018 -0800
files: mercurial/thirdparty/xdiff/xdiffi.c
description:
xdiff: reduce indent heuristic overhead
Adds some threshold to avoid expensive cases, like:
```
#!python
open('a', 'w').write(" \n" * 1000000)
open('b', 'w').write(" \n" * 1000001)
```
The indent heuristic is O(N * 20) (N = 1000000) for the above case, and
makes diff much slower.
Before this patch (system git 2.14.2):
```
git diff --no-indent-heuristic a b 0.21s user 0.03s system 100% cpu 0.239 total
git diff --indent-heuristic a b 0.77s user 0.02s system 99% cpu 0.785 total
```
After this patch (git 2fc74f41, with xdiffi.c patched):
```
# with the changed xdiffi.c
git diff --indent-heuristic a b 0.16s user 0.01s system 90% cpu 0.188 total
git diff --no-indent-heuristic a b 0.18s user 0.01s system 99% cpu 0.192 total
```
Now turning on indent-heuristic has no visible impact on performance.
Differential Revision: https://phab.mercurial-scm.org/D2601
Signed-off-by: Stefan Beller <sbeller@google.com>
---
This applies on our master branch, I have not thought of a
good commit message or if we need to test it.
Thanks,
Stefan
xdiff/xdiffi.c | 38 +++++++++++++++++++++++++++++++++++---
1 file changed, 35 insertions(+), 3 deletions(-)
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 0de1ef463bf..c74ec77da58 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -807,6 +807,14 @@ static void xdl_bug(const char *msg)
exit(1);
}
+/*
+ * For indentation heuristic, skip searching for better slide position after
+ * checking MAX_BORING lines without finding an improvement. This defends the
+ * indentation heuristic logic against pathological cases. The value is not
+ * picked scientifically but should be good enough.
+ */
+#define MAX_BORING 100
+
/*
* Move back and forward change groups for a consistent and pretty diff output.
* This also helps in finding joinable change groups and reducing the diff
@@ -903,19 +911,43 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
long shift, best_shift = -1;
struct split_score best_score;
- for (shift = earliest_end; shift <= g.end; shift++) {
+ /*
+ * This is O(N * MAX_BLANKS) (N = shift-able lines).
+ * Even with MAX_BLANKS bounded to a small value, a
+ * large N could still make this loop take several
+ * times longer than the main diff algorithm. The
+ * "boring" value is to help cut down N to something
+ * like (MAX_BORING + groupsize).
+ *
+ * Scan from bottom to top. So we can exit the loop
+ * without compromising the assumption "for a same best
+ * score, pick the bottommost shift".
+ */
+ int boring = 0;
+ for (shift = g.end; shift >= earliest_end; shift--) {
struct split_measurement m;
struct split_score score = {0, 0};
+ int cmp;
measure_split(xdf, shift, &m);
score_add_split(&m, &score);
measure_split(xdf, shift - groupsize, &m);
score_add_split(&m, &score);
- if (best_shift == -1 ||
- score_cmp(&score, &best_score) <= 0) {
+
+ if (best_shift == -1) {
+ cmp = -1;
+ } else {
+ cmp = score_cmp(&score, &best_score);
+ }
+ if (cmp < 0) {
+ boring = 0;
best_score.effective_indent = score.effective_indent;
best_score.penalty = score.penalty;
best_shift = shift;
+ } else {
+ boring += 1;
+ if (boring >= MAX_BORING)
+ break;
}
}
--
2.18.0.399.gad0ab374a1-goog
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences
2018-06-29 20:14 ` Stefan Beller
2018-06-29 20:28 ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
@ 2018-06-29 20:39 ` Jeff King
2018-06-29 20:51 ` Stefan Beller
1 sibling, 1 reply; 17+ messages in thread
From: Jeff King @ 2018-06-29 20:39 UTC (permalink / raw)
To: Stefan Beller; +Cc: Mike Hommey, Jameson Miller, git
On Fri, Jun 29, 2018 at 01:14:52PM -0700, Stefan Beller wrote:
> Interesting pieces regarding performance:
>
> c420792217c8 xdiff: reduce indent heuristic overhead
> https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5
>
> f33a87cf60cc xdiff: add a preprocessing step that trims files
> https://phab.mercurial-scm.org/rHGf33a87cf60ccb8b46e06b85e60bc5031420707d6
>
> I'll see if I can make that into patches.
Apparently the second one is not so trivial as you might hope; see
https://public-inbox.org/git/1520337165-sup-4504@x1c/.
-Peff
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences
2018-06-29 20:39 ` fast-import slowness when importing large files with small differences Jeff King
@ 2018-06-29 20:51 ` Stefan Beller
0 siblings, 0 replies; 17+ messages in thread
From: Stefan Beller @ 2018-06-29 20:51 UTC (permalink / raw)
To: Jeff King, quark; +Cc: Mike Hommey, Jameson Miller, git
+cc Jun Wu, original author of these patches
On Fri, Jun 29, 2018 at 1:39 PM Jeff King <peff@peff.net> wrote:
> > Interesting pieces regarding performance:
> >
> > c420792217c8 xdiff: reduce indent heuristic overhead
> > https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5
Going by the mailing list, the first patch was not brought over yet,
so sending it here was warranted.
Jun, you may want to take ownership of
https://public-inbox.org/git/20180629202811.131265-1-sbeller@google.com/
as I merely resend it to the git mailing list?
If not, that is fine, too.
> >
> > f33a87cf60cc xdiff: add a preprocessing step that trims files
> > https://phab.mercurial-scm.org/rHGf33a87cf60ccb8b46e06b85e60bc5031420707d6
> >
> > I'll see if I can make that into patches.
>
> Apparently the second one is not so trivial as you might hope; see
> https://public-inbox.org/git/1520337165-sup-4504@x1c/.
Thanks so much, this saves me further effort to dig there.
So I'll just stop porting this.
Thanks,
Stefan
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] xdiff: reduce indent heuristic overhead
2018-06-29 20:28 ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
@ 2018-06-29 21:17 ` Junio C Hamano
2018-06-29 23:37 ` Stefan Beller
2018-07-01 15:57 ` Michael Haggerty
1 sibling, 1 reply; 17+ messages in thread
From: Junio C Hamano @ 2018-06-29 21:17 UTC (permalink / raw)
To: Stefan Beller; +Cc: mhagger, git, jamill, mh
Stefan Beller <sbeller@google.com> writes:
> This patch was written originally for mercurial at
> https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5
>
> changeset: 36674:c420792217c8
> user: Jun Wu <quark@fb.com>
> date: Sat Mar 03 12:39:11 2018 -0800
> files: mercurial/thirdparty/xdiff/xdiffi.c
> description:
> xdiff: reduce indent heuristic overhead
... This should come with in-body header to credit the original
author as the author, I think.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences
2018-06-29 9:44 fast-import slowness when importing large files with small differences Mike Hommey
2018-06-29 20:14 ` Stefan Beller
@ 2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
2018-06-29 23:35 ` Mike Hommey
1 sibling, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-29 22:10 UTC (permalink / raw)
To: Mike Hommey; +Cc: git
On Fri, Jun 29 2018, Mike Hommey wrote:
> I noticed some slowness when fast-importing data from the Firefox mercurial
> repository, where fast-import spends more than 5 minutes importing ~2000
> revisions of one particular file. I reduced a testcase while still
> using real data. One could synthesize data with kind of the same
> properties, but I figured real data could be useful.
>
> To reproduce:
> $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
> $ git init bar
> $ cd bar
> $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
>
> [...]
> So maybe it would make sense to consolidate the diff code (after all,
> diff-delta.c is an old specialized fork of xdiff). With manual trimming
> of common head and tail, this gets down to 3:33.
>
> I'll also note that Facebook has imported xdiff from the git code base
> into mercurial and improved performance on it, so it might also be worth
> looking at what's worth taking from there.
It would be interesting to see how does this compares with a more naïve
approach of committing every version of this file one-at-a-time into a
new repository (with & without gc.auto=0). Perhaps deltaing as we go is
suboptimal compared to just writing out a lot of redundant data and
repacking it all at once later.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences
2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
@ 2018-06-29 23:35 ` Mike Hommey
2018-07-03 16:05 ` Ævar Arnfjörð Bjarmason
0 siblings, 1 reply; 17+ messages in thread
From: Mike Hommey @ 2018-06-29 23:35 UTC (permalink / raw)
To: Ævar Arnfjörð Bjarmason; +Cc: git
On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Fri, Jun 29 2018, Mike Hommey wrote:
>
> > I noticed some slowness when fast-importing data from the Firefox mercurial
> > repository, where fast-import spends more than 5 minutes importing ~2000
> > revisions of one particular file. I reduced a testcase while still
> > using real data. One could synthesize data with kind of the same
> > properties, but I figured real data could be useful.
> >
> > To reproduce:
> > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
> > $ git init bar
> > $ cd bar
> > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
> >
> > [...]
> > So maybe it would make sense to consolidate the diff code (after all,
> > diff-delta.c is an old specialized fork of xdiff). With manual trimming
> > of common head and tail, this gets down to 3:33.
> >
> > I'll also note that Facebook has imported xdiff from the git code base
> > into mercurial and improved performance on it, so it might also be worth
> > looking at what's worth taking from there.
>
> It would be interesting to see how does this compares with a more naïve
> approach of committing every version of this file one-at-a-time into a
> new repository (with & without gc.auto=0). Perhaps deltaing as we go is
> suboptimal compared to just writing out a lot of redundant data and
> repacking it all at once later.
"Just" writing 26GB? And that's only one file. If I were to do that for
the whole repository, it would yield a > 100GB pack. Instead of < 2GB
currently.
Mike
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] xdiff: reduce indent heuristic overhead
2018-06-29 21:17 ` Junio C Hamano
@ 2018-06-29 23:37 ` Stefan Beller
2018-06-30 1:11 ` Jun Wu
0 siblings, 1 reply; 17+ messages in thread
From: Stefan Beller @ 2018-06-29 23:37 UTC (permalink / raw)
To: gitster; +Cc: git, jamill, mh, mhagger, sbeller, quark
From: Jun Wu <quark@fb.com>
This patch was written originally for mercurial at [1],
adding a limit on how long we'd be looking for an
optimal indent heuristic. Choose the limit high enough
to only limit edge cases.
Adds some threshold to avoid expensive cases, like:
```
#!python
open('a', 'w').write(" \n" * 1000000)
open('b', 'w').write(" \n" * 1000001)
```
The indent heuristic is O(N * 20) (N = 1000000) for the above case, and
makes diff much slower.
Before this patch (system git 2.14.2):
```
git diff --no-indent-heuristic a b 0.21s user 0.03s system 100% cpu 0.239 total
git diff --indent-heuristic a b 0.77s user 0.02s system 99% cpu 0.785 total
```
After this patch (git 2fc74f41, with xdiffi.c patched):
```
# with the changed xdiffi.c
git diff --indent-heuristic a b 0.16s user 0.01s system 90% cpu 0.188 total
git diff --no-indent-heuristic a b 0.18s user 0.01s system 99% cpu 0.192 total
```
Now turning on indent-heuristic has no visible impact on performance.
Differential Revision: https://phab.mercurial-scm.org/D2601
[1] https://phab.mercurial-scm.org/rHGc420792217c89622482005c99e959b9071c109c5
Signed-off-by: Stefan Beller <sbeller@google.com>
---
Jun, Junio
By changing the authorship we'd want to have a sign off from the original author,
before applying; in the previous attempt, I was merely taking the code from
mercurial as their copy of xdiff is also LGPLv2 so we are free to use that.
Thanks,
Stefan
xdiff/xdiffi.c | 38 +++++++++++++++++++++++++++++++++++---
1 file changed, 35 insertions(+), 3 deletions(-)
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 0de1ef463bf..c74ec77da58 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -807,6 +807,14 @@ static void xdl_bug(const char *msg)
exit(1);
}
+/*
+ * For indentation heuristic, skip searching for better slide position after
+ * checking MAX_BORING lines without finding an improvement. This defends the
+ * indentation heuristic logic against pathological cases. The value is not
+ * picked scientifically but should be good enough.
+ */
+#define MAX_BORING 100
+
/*
* Move back and forward change groups for a consistent and pretty diff output.
* This also helps in finding joinable change groups and reducing the diff
@@ -903,19 +911,43 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
long shift, best_shift = -1;
struct split_score best_score;
- for (shift = earliest_end; shift <= g.end; shift++) {
+ /*
+ * This is O(N * MAX_BLANKS) (N = shift-able lines).
+ * Even with MAX_BLANKS bounded to a small value, a
+ * large N could still make this loop take several
+ * times longer than the main diff algorithm. The
+ * "boring" value is to help cut down N to something
+ * like (MAX_BORING + groupsize).
+ *
+ * Scan from bottom to top. So we can exit the loop
+ * without compromising the assumption "for a same best
+ * score, pick the bottommost shift".
+ */
+ int boring = 0;
+ for (shift = g.end; shift >= earliest_end; shift--) {
struct split_measurement m;
struct split_score score = {0, 0};
+ int cmp;
measure_split(xdf, shift, &m);
score_add_split(&m, &score);
measure_split(xdf, shift - groupsize, &m);
score_add_split(&m, &score);
- if (best_shift == -1 ||
- score_cmp(&score, &best_score) <= 0) {
+
+ if (best_shift == -1) {
+ cmp = -1;
+ } else {
+ cmp = score_cmp(&score, &best_score);
+ }
+ if (cmp < 0) {
+ boring = 0;
best_score.effective_indent = score.effective_indent;
best_score.penalty = score.penalty;
best_shift = shift;
+ } else {
+ boring += 1;
+ if (boring >= MAX_BORING)
+ break;
}
}
--
2.18.0.399.gad0ab374a1-goog
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH] xdiff: reduce indent heuristic overhead
2018-06-29 23:37 ` Stefan Beller
@ 2018-06-30 1:11 ` Jun Wu
0 siblings, 0 replies; 17+ messages in thread
From: Jun Wu @ 2018-06-30 1:11 UTC (permalink / raw)
To: Stefan Beller; +Cc: gitster, git, jamill, mh, mhagger
Excerpts from Stefan Beller's message of 2018-06-29 16:37:41 -0700:
> [...]
> Jun, Junio
>
> By changing the authorship we'd want to have a sign off from the original author,
> before applying; in the previous attempt, I was merely taking the code from
> mercurial as their copy of xdiff is also LGPLv2 so we are free to use that.
I'm fine with signing off the patch. I didn't send this one here mainly
because indent heuristic was default off at the time I made the changes, and
I wasn't sure about how to test this properly according to the git community
standard. If this change is fine without additional tests, I can send it
myself, too.
> Thanks,
> Stefan
>
> [...]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] xdiff: reduce indent heuristic overhead
2018-06-29 20:28 ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
2018-06-29 21:17 ` Junio C Hamano
@ 2018-07-01 15:57 ` Michael Haggerty
2018-07-02 17:27 ` Stefan Beller
2018-07-03 18:14 ` Junio C Hamano
1 sibling, 2 replies; 17+ messages in thread
From: Michael Haggerty @ 2018-07-01 15:57 UTC (permalink / raw)
To: Stefan Beller; +Cc: git, jamill, mh
On 06/29/2018 10:28 PM, Stefan Beller wrote:
> [...]
> Adds some threshold to avoid expensive cases, like:
>
> ```
> #!python
> open('a', 'w').write(" \n" * 1000000)
> open('b', 'w').write(" \n" * 1000001)
> ```
>
> The indent heuristic is O(N * 20) (N = 1000000) for the above case, and
> makes diff much slower.
> [...]
> +/*
> + * For indentation heuristic, skip searching for better slide position after
> + * checking MAX_BORING lines without finding an improvement. This defends the
> + * indentation heuristic logic against pathological cases. The value is not
> + * picked scientifically but should be good enough.
> + */
> +#define MAX_BORING 100
This is an interesting case, and a speed difference of almost a factor
of five seems impressive. But this is a pretty pathological case, isn't
it? And I'm pretty sure that the algorithm is `O(N)` both before and
after this change. Remember that to find `earliest_end` and `g.end`,
there has already been a scan through all 1000000 lines. In other words,
you're not improving how the overall algorithm scales with `N`; you're
only changing the constant factor in front. So it's a little bit
questionable whether it is worth complicating the code for this unusual
case.
But *if* we want to improve this case, I think that we could be smarter
about it.
By the time we get to this point in the code, we already know that there
is a "slider" hunk of length `M` (`groupsize`) that can be slid up or
down within a range of `N` (`g.end - earliest_end + 1`) possible
positions. The interesting case here is `N ≫ M`, because then naively
the number of positions to try out is a lot bigger than the size of the
hunk itself. (In the case described above, `N` is 1000000 and `M` is 1.)
But how can that situation even arise? Remember, a hunk can only be slid
down by a line if the first line *after* the hunk is identical to the
first line *of* the hunk. It follows that if you shift a hunk down `M`
lines, then it has the same contents as when you started—you've just
rotated all of the hunk lines around once.
So if `N ≫ M`, there is necessarily a lot of repetition among the `N +
M` lines that the hunk could possibly overlay. Specifically, it must
consist of `floor((N + M)/M)` identical copies of the hunk, plus
possibly a few leftover lines constituting the start of another repetition.
Given this large amount of repetition, it seems to me that there is
never a need to scan more than the bottom `M + 1` possible positions [1]
plus the highest possible position [2] to be sure of finding the very
best one. In the pathological case that you described above, where `M`
is 1, only three positions have to be evaluated, not 100.
In fact, it *could* be that there is even more repetition, namely if the
hunk itself contains multiple copies of an even shorter block of `K`
lines. In that case, you would only have to scan `K + 1` positions at
the bottom plus one at the top to be sure to find the best hunk
position. This would be an interesting optimization for a case like
> open('a', 'w').write(" \n" * 1000000)
> open('b', 'w').write(" \n" * 1100000)
(`N = 1000000`, `M = 100000`, `K = 1`) or
> open('a', 'w').write("<item>\nMISSING\n</item>\n" * 1000000)
> open('b', 'w').write("<item>\nMISSING\n</item>\n" * 1100000)
(`N = 3000000`, `M = 300000`, `K = 3`). On the other hand, it's not
entirely trivial to find periodicity in a group of lines (i.e., to find
`K`), and I don't know offhand how that task scales with `M`.
Michael
[1] Actually, to be rigorously correct it might be necessary to check
even a bit more than `M + 1` positions at the bottom because the
heuristic looks a bit beyond the lines of the hunk.
[2] The position at the top has different predecessor lines than the
other positions, so it could have a lower score than all of the others.
It's worth checking it. Here too, to be rigorously correct it might be
necessary to check more than one position at the top because the
heuristic looks a bit beyond the lines of the hunk.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] xdiff: reduce indent heuristic overhead
2018-07-01 15:57 ` Michael Haggerty
@ 2018-07-02 17:27 ` Stefan Beller
2018-07-03 9:15 ` Michael Haggerty
2018-07-03 18:14 ` Junio C Hamano
1 sibling, 1 reply; 17+ messages in thread
From: Stefan Beller @ 2018-07-02 17:27 UTC (permalink / raw)
To: Michael Haggerty, quark; +Cc: git, Jameson Miller, Mike Hommey
On Sun, Jul 1, 2018 at 8:57 AM Michael Haggerty <mhagger@alum.mit.edu> wrote:
>
> On 06/29/2018 10:28 PM, Stefan Beller wrote:
> > [...]
> > Adds some threshold to avoid expensive cases, like:
> >
> > ```
> > #!python
> > open('a', 'w').write(" \n" * 1000000)
> > open('b', 'w').write(" \n" * 1000001)
> > ```
> >
> > The indent heuristic is O(N * 20) (N = 1000000) for the above case, and
> > makes diff much slower.
> > [...]
> > +/*
> > + * For indentation heuristic, skip searching for better slide position after
> > + * checking MAX_BORING lines without finding an improvement. This defends the
> > + * indentation heuristic logic against pathological cases. The value is not
> > + * picked scientifically but should be good enough.
> > + */
> > +#define MAX_BORING 100
>
> This is an interesting case, and a speed difference of almost a factor
> of five seems impressive. But this is a pretty pathological case, isn't
> it? And I'm pretty sure that the algorithm is `O(N)` both before and
> after this change. Remember that to find `earliest_end` and `g.end`,
> there has already been a scan through all 1000000 lines. In other words,
> you're not improving how the overall algorithm scales with `N`; you're
> only changing the constant factor in front. So it's a little bit
> questionable whether it is worth complicating the code for this unusual
> case.
>
> But *if* we want to improve this case, I think that we could be smarter
> about it.
>
> By the time we get to this point in the code, we already know that there
> is a "slider" hunk of length `M` (`groupsize`) that can be slid up or
> down within a range of `N` (`g.end - earliest_end + 1`) possible
> positions. The interesting case here is `N ≫ M`, because then naively
> the number of positions to try out is a lot bigger than the size of the
> hunk itself. (In the case described above, `N` is 1000000 and `M` is 1.)
>
> But how can that situation even arise? Remember, a hunk can only be slid
> down by a line if the first line *after* the hunk is identical to the
> first line *of* the hunk. It follows that if you shift a hunk down `M`
> lines, then it has the same contents as when you started—you've just
> rotated all of the hunk lines around once.
>
> So if `N ≫ M`, there is necessarily a lot of repetition among the `N +
> M` lines that the hunk could possibly overlay. Specifically, it must
> consist of `floor((N + M)/M)` identical copies of the hunk, plus
> possibly a few leftover lines constituting the start of another repetition.
>
> Given this large amount of repetition, it seems to me that there is
> never a need to scan more than the bottom `M + 1` possible positions [1]
> plus the highest possible position [2] to be sure of finding the very
> best one. In the pathological case that you described above, where `M`
> is 1, only three positions have to be evaluated, not 100.
>
> In fact, it *could* be that there is even more repetition, namely if the
> hunk itself contains multiple copies of an even shorter block of `K`
> lines. In that case, you would only have to scan `K + 1` positions at
> the bottom plus one at the top to be sure to find the best hunk
> position. This would be an interesting optimization for a case like
>
> > open('a', 'w').write(" \n" * 1000000)
> > open('b', 'w').write(" \n" * 1100000)
>
> (`N = 1000000`, `M = 100000`, `K = 1`) or
>
> > open('a', 'w').write("<item>\nMISSING\n</item>\n" * 1000000)
> > open('b', 'w').write("<item>\nMISSING\n</item>\n" * 1100000)
>
> (`N = 3000000`, `M = 300000`, `K = 3`). On the other hand, it's not
> entirely trivial to find periodicity in a group of lines (i.e., to find
> `K`), and I don't know offhand how that task scales with `M`.
>
> Michael
>
> [1] Actually, to be rigorously correct it might be necessary to check
> even a bit more than `M + 1` positions at the bottom because the
> heuristic looks a bit beyond the lines of the hunk.
>
> [2] The position at the top has different predecessor lines than the
> other positions, so it could have a lower score than all of the others.
> It's worth checking it. Here too, to be rigorously correct it might be
> necessary to check more than one position at the top because the
> heuristic looks a bit beyond the lines of the hunk.
So this suggests to have MAX_BORING to be
"hunk size + some small constant offset" ?
Stefan
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] xdiff: reduce indent heuristic overhead
2018-07-02 17:27 ` Stefan Beller
@ 2018-07-03 9:15 ` Michael Haggerty
2018-07-27 22:23 ` Stefan Beller
0 siblings, 1 reply; 17+ messages in thread
From: Michael Haggerty @ 2018-07-03 9:15 UTC (permalink / raw)
To: Stefan Beller; +Cc: quark, Git Mailing List, jamill, mh
On Mon, Jul 2, 2018 at 7:27 PM Stefan Beller <sbeller@google.com> wrote:
> On Sun, Jul 1, 2018 at 8:57 AM Michael Haggerty <mhagger@alum.mit.edu> wrote:
> [...]
> So this suggests to have MAX_BORING to be
> "hunk size + some small constant offset" ?
That would be my suggestion, yes. There are cases where it will be
more expensive than a fixed `MAX_BORING`, but I bet on average it will
be faster. Plus, it should always give the right answer.
Michael
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences
2018-06-29 23:35 ` Mike Hommey
@ 2018-07-03 16:05 ` Ævar Arnfjörð Bjarmason
2018-07-03 22:38 ` Mike Hommey
0 siblings, 1 reply; 17+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-07-03 16:05 UTC (permalink / raw)
To: Mike Hommey; +Cc: git
On Fri, Jun 29 2018, Mike Hommey wrote:
> On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Fri, Jun 29 2018, Mike Hommey wrote:
>>
>> > I noticed some slowness when fast-importing data from the Firefox mercurial
>> > repository, where fast-import spends more than 5 minutes importing ~2000
>> > revisions of one particular file. I reduced a testcase while still
>> > using real data. One could synthesize data with kind of the same
>> > properties, but I figured real data could be useful.
>> >
>> > To reproduce:
>> > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
>> > $ git init bar
>> > $ cd bar
>> > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
>> >
>> > [...]
>> > So maybe it would make sense to consolidate the diff code (after all,
>> > diff-delta.c is an old specialized fork of xdiff). With manual trimming
>> > of common head and tail, this gets down to 3:33.
>> >
>> > I'll also note that Facebook has imported xdiff from the git code base
>> > into mercurial and improved performance on it, so it might also be worth
>> > looking at what's worth taking from there.
>>
>> It would be interesting to see how does this compares with a more naïve
>> approach of committing every version of this file one-at-a-time into a
>> new repository (with & without gc.auto=0). Perhaps deltaing as we go is
>> suboptimal compared to just writing out a lot of redundant data and
>> repacking it all at once later.
>
> "Just" writing 26GB? And that's only one file. If I were to do that for
> the whole repository, it would yield a > 100GB pack. Instead of < 2GB
> currently.
To clarify on my terse response. I mean to try this on an isolated test
case to see to what extent the problem you're describing is unique to
fast-import, and to what extent it's encountered during "normal" git use
when you commit all the revisions of that file in succession.
Perhaps the difference between the two would give some hint as to how to
proceed, or not.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] xdiff: reduce indent heuristic overhead
2018-07-01 15:57 ` Michael Haggerty
2018-07-02 17:27 ` Stefan Beller
@ 2018-07-03 18:14 ` Junio C Hamano
1 sibling, 0 replies; 17+ messages in thread
From: Junio C Hamano @ 2018-07-03 18:14 UTC (permalink / raw)
To: Michael Haggerty; +Cc: Stefan Beller, git, jamill, mh
Michael Haggerty <mhagger@alum.mit.edu> writes:
> So if `N ≫ M`, there is necessarily a lot of repetition among the `N +
> M` lines that the hunk could possibly overlay. Specifically, it must
> consist of `floor((N + M)/M)` identical copies of the hunk, plus
> possibly a few leftover lines constituting the start of another repetition.
>
> Given this large amount of repetition, it seems to me that there is
> never a need to scan more than the bottom `M + 1` possible positions [1]
> plus the highest possible position [2] to be sure of finding the very
> best one. In the pathological case that you described above, where `M`
> is 1, only three positions have to be evaluated, not 100.
Nicely analysed.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: fast-import slowness when importing large files with small differences
2018-07-03 16:05 ` Ævar Arnfjörð Bjarmason
@ 2018-07-03 22:38 ` Mike Hommey
0 siblings, 0 replies; 17+ messages in thread
From: Mike Hommey @ 2018-07-03 22:38 UTC (permalink / raw)
To: Ævar Arnfjörð Bjarmason; +Cc: git
On Tue, Jul 03, 2018 at 06:05:16PM +0200, Ævar Arnfjörð Bjarmason wrote:
>
> On Fri, Jun 29 2018, Mike Hommey wrote:
>
> > On Sat, Jun 30, 2018 at 12:10:24AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >>
> >> On Fri, Jun 29 2018, Mike Hommey wrote:
> >>
> >> > I noticed some slowness when fast-importing data from the Firefox mercurial
> >> > repository, where fast-import spends more than 5 minutes importing ~2000
> >> > revisions of one particular file. I reduced a testcase while still
> >> > using real data. One could synthesize data with kind of the same
> >> > properties, but I figured real data could be useful.
> >> >
> >> > To reproduce:
> >> > $ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
> >> > $ git init bar
> >> > $ cd bar
> >> > $ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
> >> >
> >> > [...]
> >> > So maybe it would make sense to consolidate the diff code (after all,
> >> > diff-delta.c is an old specialized fork of xdiff). With manual trimming
> >> > of common head and tail, this gets down to 3:33.
> >> >
> >> > I'll also note that Facebook has imported xdiff from the git code base
> >> > into mercurial and improved performance on it, so it might also be worth
> >> > looking at what's worth taking from there.
> >>
> >> It would be interesting to see how does this compares with a more naïve
> >> approach of committing every version of this file one-at-a-time into a
> >> new repository (with & without gc.auto=0). Perhaps deltaing as we go is
> >> suboptimal compared to just writing out a lot of redundant data and
> >> repacking it all at once later.
> >
> > "Just" writing 26GB? And that's only one file. If I were to do that for
> > the whole repository, it would yield a > 100GB pack. Instead of < 2GB
> > currently.
>
> To clarify on my terse response. I mean to try this on an isolated test
> case to see to what extent the problem you're describing is unique to
> fast-import, and to what extent it's encountered during "normal" git use
> when you commit all the revisions of that file in succession.
>
> Perhaps the difference between the two would give some hint as to how to
> proceed, or not.
AIUI, git repack will end up creating delta indexes for every blob, so the
problem should exist there, but because it will be comparing "random"
blobs, it can't take the same kinds of shortcuts as fast-import could,
because fast-import only cares about diffing with the last imported
blob. So while fast-import can reduce the amount of work it does by not
creating an index for common heads and tails of the compared blobs, git
repack can't.
Mike
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH] xdiff: reduce indent heuristic overhead
2018-07-03 9:15 ` Michael Haggerty
@ 2018-07-27 22:23 ` Stefan Beller
0 siblings, 0 replies; 17+ messages in thread
From: Stefan Beller @ 2018-07-27 22:23 UTC (permalink / raw)
To: mhagger; +Cc: git, jamill, mh, quark, sbeller
Skip searching for better indentation heuristics if we'd slide a hunk more
than its size. This is the easiest fix proposed in the analysis[1] in
response to a patch that mercurial took for xdiff to limit searching
by a constant. Using a performance test as:
#!python
open('a', 'w').write(" \n" * 1000000)
open('b', 'w').write(" \n" * 1000001)
This patch reduces the execution of "git diff --no-index a b" from
0.70s to 0.31s. However limiting the sliding to the size of the diff hunk,
which was proposed as a solution (that I found easiest to implement for
now) is not optimal for cases like
open('a', 'w').write(" \n" * 1000000)
open('b', 'w').write(" \n" * 2000000)
as then we'd still slide 1000000 times.
In addition to limiting the sliding to size of the hunk, also limit by a
constant. Choose 100 lines as the constant as that fits more than a screen,
which really means that the diff sliding is probably not providing a lot
of benefit anyway.
[1] https://public-inbox.org/git/72ac1ac2-f567-f241-41d6-d0f83072e0b3@alum.mit.edu/
Reported-by: Jun Wu <quark@fb.com>
Analysis-by: Michael Haggerty <mhagger@alum.mit.edu>
Signed-off-by: Stefan Beller <sbeller@google.com>
---
> Plus, it should always give the right answer.
I was tempted to do just that, but I caved. The diff is correct,
and the hunk sliding is purely to appease the visual aspect of
humans looking at diffs. If your diff can slide more than a
monitor height, you're not interested in the best slided diff,
but something else is going on.
> There are cases where it will be
> more expensive than a fixed `MAX_BORING`, but I bet on average it will
> be faster.
So I did both, settling for performance as the utmost desire. ;-)
xdiff/xdiffi.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/xdiff/xdiffi.c b/xdiff/xdiffi.c
index 0de1ef463bf..91e98ee9869 100644
--- a/xdiff/xdiffi.c
+++ b/xdiff/xdiffi.c
@@ -591,6 +591,11 @@ static void measure_split(const xdfile_t *xdf, long split,
*/
#define INDENT_WEIGHT 60
+/*
+ * How far do we slide a hunk at most?
+ */
+#define INDENT_HEURISTIC_MAX_SLIDING 100
+
/*
* Compute a badness score for the hypothetical split whose measurements are
* stored in m. The weight factors were determined empirically using the tools and
@@ -903,7 +908,12 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
long shift, best_shift = -1;
struct split_score best_score;
- for (shift = earliest_end; shift <= g.end; shift++) {
+ shift = earliest_end;
+ if (g.end - groupsize - 1 > shift)
+ shift = g.end - groupsize - 1;
+ if (g.end - INDENT_HEURISTIC_MAX_SLIDING > shift)
+ shift = g.end - INDENT_HEURISTIC_MAX_SLIDING;
+ for (; shift <= g.end; shift++) {
struct split_measurement m;
struct split_score score = {0, 0};
--
2.18.0.345.g5c9ce644c3-goog
^ permalink raw reply related [flat|nested] 17+ messages in thread
end of thread, other threads:[~2018-07-27 22:24 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-06-29 9:44 fast-import slowness when importing large files with small differences Mike Hommey
2018-06-29 20:14 ` Stefan Beller
2018-06-29 20:28 ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
2018-06-29 21:17 ` Junio C Hamano
2018-06-29 23:37 ` Stefan Beller
2018-06-30 1:11 ` Jun Wu
2018-07-01 15:57 ` Michael Haggerty
2018-07-02 17:27 ` Stefan Beller
2018-07-03 9:15 ` Michael Haggerty
2018-07-27 22:23 ` Stefan Beller
2018-07-03 18:14 ` Junio C Hamano
2018-06-29 20:39 ` fast-import slowness when importing large files with small differences Jeff King
2018-06-29 20:51 ` Stefan Beller
2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
2018-06-29 23:35 ` Mike Hommey
2018-07-03 16:05 ` Ævar Arnfjörð Bjarmason
2018-07-03 22:38 ` Mike Hommey
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).