git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Mike Hommey <mh@glandium.org>
To: git@vger.kernel.org
Subject: fast-import slowness when importing large files with small differences
Date: Fri, 29 Jun 2018 18:44:13 +0900	[thread overview]
Message-ID: <20180629094413.bgltep6ntlza6vhz@glandium.org> (raw)

Hi,

I noticed some slowness when fast-importing data from the Firefox mercurial
repository, where fast-import spends more than 5 minutes importing ~2000
revisions of one particular file. I reduced a testcase while still
using real data. One could synthesize data with kind of the same
properties, but I figured real data could be useful.

To reproduce:
$ git clone https://gist.github.com/b6b8edcff2005cc482cf84972adfbba9.git foo
$ git init bar
$ cd bar
$ python ../foo/import.py ../foo/data.gz | git fast-import --depth=2000

(--depth=2000 to minimize the pack size)

The python script doesn't have much overhead:
$ time python ../foo/import.py ../foo/data.gz > /dev/null

real	0m14.564s
user	0m9.813s
sys	0m4.703s

It generates about 26GB of data from that 4.2MB data.gz.

$ python ../foo/import.py ../foo/data.gz | time git fast-import --depth=2000
git-fast-import statistics:
---------------------------------------------------------------------
Alloc'd objects:       5000
Total objects:         1868 (       133 duplicates                  )
      blobs  :         1868 (       133 duplicates       1867 deltas of       1868 attempts)
      trees  :            0 (         0 duplicates          0 deltas of          0 attempts)
      commits:            0 (         0 duplicates          0 deltas of          0 attempts)
      tags   :            0 (         0 duplicates          0 deltas of          0 attempts)
Total branches:           0 (         0 loads     )
      marks:           1024 (         0 unique    )
      atoms:              0
Memory total:          2282 KiB
       pools:          2048 KiB
     objects:           234 KiB
---------------------------------------------------------------------
pack_report: getpagesize()            =       4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit      = 35184372088832
pack_report: pack_used_ctr            =          0
pack_report: pack_mmap_calls          =          0
pack_report: pack_open_windows        =          0 /          0
pack_report: pack_mapped              =          0 /          0
---------------------------------------------------------------------

321.61user 6.60system 5:50.08elapsed 93%CPU (0avgtext+0avgdata 83192maxresident)k
0inputs+10568outputs (0major+38689minor)pagefaults 0swaps

(The resulting pack is 5.3MB, fwiw)

Obviously, sha1'ing 26GB is not going to be free, but it's also not the
dominating cost, according to perf:

    63.52%  git-fast-import  git-fast-import     [.] create_delta_index
    17.46%  git-fast-import  git-fast-import     [.] sha1_compression_states
     9.89%  git-fast-import  git-fast-import     [.] ubc_check
     6.23%  git-fast-import  git-fast-import     [.] create_delta
     2.49%  git-fast-import  git-fast-import     [.] sha1_process

That's a whole lot of time spent on create_delta_index.

FWIW, if delta was 100% free (yes, I tested that), the fast-import would
take 1:40 with the following profile:

    58.74%  git-fast-import  git-fast-import     [.] sha1_compression_states
    32.45%  git-fast-import  git-fast-import     [.] ubc_check
     8.25%  git-fast-import  git-fast-import     [.] sha1_process

I toyed with the idea of eliminating common head and tail before
creating the delta, and got some promising result: a fast-import taking
3:22 instead of 5:50, with the following profile:

    34.67%  git-fast-import  git-fast-import     [.] create_delta_index
    30.88%  git-fast-import  git-fast-import     [.] sha1_compression_states
    17.15%  git-fast-import  git-fast-import     [.] ubc_check
     7.25%  git-fast-import  git-fast-import     [.] store_object
     4.47%  git-fast-import  git-fast-import     [.] sha1_process
     2.72%  git-fast-import  git-fast-import     [.] create_delta2

The resulting pack is however much larger (for some reason, many objects
are left non-deltaed), and the deltas are partly broken (they don't
apply cleanly), but that just tells the code is not ready to be sent. I
don't expect working code would be much slower than this. The remaining
question is whether this is beneficial for more normal cases.

I also seemed to remember when I tested a while ago, that somehow xdiff
handles those files faster than diff-delta, and I'm wondering if it
would make sense to to make the pack code use xdiff. So I tested
replacing diff_delta with a call to xdi_diff_outf with a callback that
does nothing and zeroed out xpparam_t and xdemitconf_t (not sure that's
best, though, I haven't looked very deeply), and that finished in 5:15
with the following profile (without common head trimming,
xdiff-interface apparently does common tail trimming):

    32.99%  git-fast-import  git-fast-import     [.] xdl_prepare_ctx.isra.0
    20.42%  git-fast-import  git-fast-import     [.] sha1_compression_states
    15.26%  git-fast-import  git-fast-import     [.] xdl_hash_record
    11.65%  git-fast-import  git-fast-import     [.] ubc_check
     3.09%  git-fast-import  git-fast-import     [.] xdl_recs_cmp
     3.03%  git-fast-import  git-fast-import     [.] sha1_process
     2.91%  git-fast-import  git-fast-import     [.] xdl_prepare_env

So maybe it would make sense to consolidate the diff code (after all,
diff-delta.c is an old specialized fork of xdiff). With manual trimming
of common head and tail, this gets down to 3:33.

I'll also note that Facebook has imported xdiff from the git code base
into mercurial and improved performance on it, so it might also be worth
looking at what's worth taking from there.

Cheers,

Mike

             reply	other threads:[~2018-06-29 10:18 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-29  9:44 Mike Hommey [this message]
2018-06-29 20:14 ` fast-import slowness when importing large files with small differences Stefan Beller
2018-06-29 20:28   ` [PATCH] xdiff: reduce indent heuristic overhead Stefan Beller
2018-06-29 21:17     ` Junio C Hamano
2018-06-29 23:37       ` Stefan Beller
2018-06-30  1:11         ` Jun Wu
2018-07-01 15:57     ` Michael Haggerty
2018-07-02 17:27       ` Stefan Beller
2018-07-03  9:15         ` Michael Haggerty
2018-07-27 22:23           ` Stefan Beller
2018-07-03 18:14       ` Junio C Hamano
2018-06-29 20:39   ` fast-import slowness when importing large files with small differences Jeff King
2018-06-29 20:51     ` Stefan Beller
2018-06-29 22:10 ` Ævar Arnfjörð Bjarmason
2018-06-29 23:35   ` Mike Hommey
2018-07-03 16:05     ` Ævar Arnfjörð Bjarmason
2018-07-03 22:38       ` Mike Hommey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180629094413.bgltep6ntlza6vhz@glandium.org \
    --to=mh@glandium.org \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).