From: Mike Hommey <mh@glandium.org>
To: git@vger.kernel.org
Subject: fast-import slowness when importing large files with small differences
Date: Fri, 29 Jun 2018 18:44:13 +0900 [thread overview]
Message-ID: <20180629094413.bgltep6ntlza6vhz@glandium.org> (raw)
Hi,
I noticed some slowness when fast-importing data from the Firefox mercurial
repository, where fast-import spends more than 5 minutes importing ~2000
revisions of one particular file. I reduced a testcase while still
using real data. One could synthesize data with kind of the same
properties, but I figured real data could be useful.
To reproduce:
$ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
$ git init bar
$ cd bar
$ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000
(--depth=2000 to minimize the pack size)
The python script doesn't have much overhead:
$ time python ../foo/import.py ../foo/data.gz > /dev/null
real 0m14.564s
user 0m9.813s
sys 0m4.703s
It generates about 26GB of data from that 4.2MB data.gz.
$ python ../foo/import.py ../foo/data.gz | time git fast-import --depth=2000
git-fast-import statistics:
---------------------------------------------------------------------
Alloc'd objects: 5000
Total objects: 1868 ( 133 duplicates )
blobs : 1868 ( 133 duplicates 1867 deltas of 1868 attempts)
trees : 0 ( 0 duplicates 0 deltas of 0 attempts)
commits: 0 ( 0 duplicates 0 deltas of 0 attempts)
tags : 0 ( 0 duplicates 0 deltas of 0 attempts)
Total branches: 0 ( 0 loads )
marks: 1024 ( 0 unique )
atoms: 0
Memory total: 2282 KiB
pools: 2048 KiB
objects: 234 KiB
---------------------------------------------------------------------
pack_report: getpagesize() = 4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit = 35184372088832
pack_report: pack_used_ctr = 0
pack_report: pack_mmap_calls = 0
pack_report: pack_open_windows = 0 / 0
pack_report: pack_mapped = 0 / 0
---------------------------------------------------------------------
321.61user 6.60system 5:50.08elapsed 93%CPU (0avgtext+0avgdata 83192maxresident)k
0inputs+10568outputs (0major+38689minor)pagefaults 0swaps
(The resulting pack is 5.3MB, fwiw)
Obviously, sha1'ing 26GB is not going to be free, but it's also not the
dominating cost, according to perf:
63.52% git-fast-import git-fast-import [.] create_delta_index
17.46% git-fast-import git-fast-import [.] sha1_compression_states
9.89% git-fast-import git-fast-import [.] ubc_check
6.23% git-fast-import git-fast-import [.] create_delta
2.49% git-fast-import git-fast-import [.] sha1_process
That's a whole lot of time spent on create_delta_index.
FWIW, if delta was 100% free (yes, I tested that), the fast-import would
take 1:40 with the following profile:
58.74% git-fast-import git-fast-import [.] sha1_compression_states
32.45% git-fast-import git-fast-import [.] ubc_check
8.25% git-fast-import git-fast-import [.] sha1_process
I toyed with the idea of eliminating common head and tail before
creating the delta, and got some promising result: a fast-import taking
3:22 instead of 5:50, with the following profile:
34.67% git-fast-import git-fast-import [.] create_delta_index
30.88% git-fast-import git-fast-import [.] sha1_compression_states
17.15% git-fast-import git-fast-import [.] ubc_check
7.25% git-fast-import git-fast-import [.] store_object
4.47% git-fast-import git-fast-import [.] sha1_process
2.72% git-fast-import git-fast-import [.] create_delta2
The resulting pack is however much larger (for some reason, many objects
are left non-deltaed), and the deltas are partly broken (they don't
apply cleanly), but that just tells the code is not ready to be sent. I
don't expect working code would be much slower than this. The remaining
question is whether this is beneficial for more normal cases.
I also seemed to remember when I tested a while ago, that somehow xdiff
handles those files faster than diff-delta, and I'm wondering if it
would make sense to to make the pack code use xdiff. So I tested
replacing diff_delta with a call to xdi_diff_outf with a callback that
does nothing and zeroed out xpparam_t and xdemitconf_t (not sure that's
best, though, I haven't looked very deeply), and that finished in 5:15
with the following profile (without common head trimming,
xdiff-interface apparently does common tail trimming):
32.99% git-fast-import git-fast-import [.] xdl_prepare_ctx.isra.0
20.42% git-fast-import git-fast-import [.] sha1_compression_states
15.26% git-fast-import git-fast-import [.] xdl_hash_record
11.65% git-fast-import git-fast-import [.] ubc_check
3.09% git-fast-import git-fast-import [.] xdl_recs_cmp
3.03% git-fast-import git-fast-import [.] sha1_process
2.91% git-fast-import git-fast-import [.] xdl_prepare_env
So maybe it would make sense to consolidate the diff code (after all,
diff-delta.c is an old specialized fork of xdiff). With manual trimming
of common head and tail, this gets down to 3:33.
I'll also note that Facebook has imported xdiff from the git code base
into mercurial and improved performance on it, so it might also be worth
looking at what's worth taking from there.
Cheers,
Mike
next reply other threads:[~2018-06-29 10:18 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-06-29 9:44 Mike Hommey [this message]
2018-06-29 20:14 ` fast-import slowness when importing large files with small differences Stefan Beller
2018-06-29 20:28 ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
2018-06-29 21:17 ` Junio C Hamano
2018-06-29 23:37 ` Stefan Beller
2018-06-30 1:11 ` Jun Wu
2018-07-01 15:57 ` Michael Haggerty
2018-07-02 17:27 ` Stefan Beller
2018-07-03 9:15 ` Michael Haggerty
2018-07-27 22:23 ` Stefan Beller
2018-07-03 18:14 ` Junio C Hamano
2018-06-29 20:39 ` fast-import slowness when importing large files with small differences Jeff King
2018-06-29 20:51 ` Stefan Beller
2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
2018-06-29 23:35 ` Mike Hommey
2018-07-03 16:05 ` Ævar Arnfjörð Bjarmason
2018-07-03 22:38 ` Mike Hommey
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180629094413.bgltep6ntlza6vhz@glandium.org \
--to=mh@glandium.org \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).