git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Marius Storm-Olsen <mstormo@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: Delta compression not so effective
Date: Sat, 4 Mar 2017 02:27:00 -0600	[thread overview]
Message-ID: <9961a973-0d5d-5ff9-ab78-eea07bdb5dbf@gmail.com> (raw)
In-Reply-To: <CA+55aFxxQUixAJWXkUgVvDNCHD4LuYYuQRTE7dJ_OZTo9Gxqew@mail.gmail.com>

On 3/1/2017 18:43, Linus Torvalds wrote:
>> So, this repo must be knocking several parts of Git's insides. I was curious
>> about why it was so slow on the writing objects part, since the whole repo
>> is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing
>> has ~400MB/s continuous throughput available.
>>
>> iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single
>> thread (since the "write objects" stage is single threaded, obviously).
>
> So the writing phase isn't multi-threaded because it's not expected to
> matter. But if you can't even generate deltas, you aren't just
> *writing* much more data, you're compressing all that data with zlib
> too.
>
> So even with a fast disk subsystem, you won't even be able to saturate
> the disk, simply because the compression will be slower (and
> single-threaded).

I did a simple
     $ time zip -r repo.zip repo/
...
     total bytes=219353596620, compressed=214310715074 -> 2% savings

     real    154m6.323s
     user    133m5.209s
     sys     5m5.338s

also using a single thread + same disk, as git repack. But if you 
compare it to the numbers below, it's 2.6hrs with zip vs 14.2hrs 
(1:5.5). So it can't just be the overhead of having to compress the full 
blobs, due to lacking delta..


>> Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed
>> blobs are the same DLLs (multiple of them)
>
> I think the first thing you should test is to repack with fewer
> threads, and a bigger pack window. Do somethinig like
>
>   -c pack.threads=4 --window-memory=30g
>
> instead. Just to see if that starts finding deltas.

I reran the repack with the options above (dropping the zlib=9, as you 
suggested)

     $ time git -c pack.threads=4 repack -a -d -F \
                --window=350 --depth=250 --window-memory=30g

     Delta compression using up to 4 threads.
     Compressing objects:   100% (609413/609413)
     Writing objects: 100% (666515/666515), done.
     Total 666515 (delta 499585), reused 0 (delta 0)

     real	850m3.473s
     user	897m36.280s
     sys 	10m8.824s

and ended up with
     $ du -sh .
     205G	.

In other words, going from 6G to 30G window didn't help a lick on 
finding deltas for those binaries. (205G was what I had with the 
non-aggressive 'git gc', before zlib=9 repack.)

BUT, oddly enough, even if the new size if almost identical to the 
previous version without zlib=9,
     git verify-pack --verbose 
objects/pack/pack-29b06ae4d458ac03efd98b330702d30e851b2933.idx | sort 
-k3n | tail -n15
gives me a VERY different list than before

   17e5b2146311256dc8317d6e0ed1291363c31a76 blob   673399562 110248747 
190398904084
   04c881d9069eab3bd0d50dd48a047a60f79cc415 blob   673863358 111710559 
188818868865
   fdcabd75aeda86ce234d6e43b54d27d993acddcd blob   674523614 111956017 
185706433825
   d8815033d1b00b151ae762be8a69ffa35f55c4b4 blob   675286758 112099638 
185153570292
   997e0b9d3bcf440af10c7bbe535a597ca46c492c blob   678274978 112654668 
184041692883
   dfed141679e5c33caaa921cbe1595a24967a3c2c blob   681692132 113121410 
186753502634
   76a4000e71cd5b85f2265e02eb876acf1f33cc55 blob   682673430 112743915 
184563542298
   81e7292c4d2da2d2d236fbfaa572b6c4e8d787f4 blob   684543130 112797325 
181805773038
   991184c60e1fc6b2721bf40f181012b72b10d02d blob   684543130 112796892 
182344388066
   0e9269f4abd1440addd05d4f964c96d74d11cd89 blob   684547270 112809074 
181070719237
   6019b6d09759cf5adeac678c8b56d177803a0486 blob   684547270 112809336 
180517242193
   70a5f70bd205329472d6f9c660eb3f7d207a596e blob   686852038 112873611 
183520467528
   e86a0064d9652be9f5e3a877b11a665f64198ecd blob   686852038 112874133 
182893219377
   bae8de0555be5b1ffa0988cbc6cba698f6745c26 blob   894041802 137223252 
2355250324
   94dc773600e03ac1e6f3ab077b70b8297325ad77 blob   945197364 145219485 
16560137220

compared to the last 3 entries of the previous pack
   e9916da851962265a9d5b099e72f60659a74c144 blob   170113524 73514361 
966299538
   f7bf1313752deb1bae592cc7fc54289aea87ff19 blob   170113524 70756581 
1039814687
   8afc6f2a51f0fa1cc4b03b8d10c70599866804ad blob   248959314 237612609 
606692699


> So the first thing you might want to do is to just print out the
> objects after sorting them, and before it starts trying to finsd
> deltas.
...
> and notice that QSORT() line: that's what sorts the objects. You can
> do something like
>
>                 for (i = 0; i < n; i++)
>                         show_object_entry_details(delta_list[i]);

I did
     fprintf(stderr, "%s %u %lu\n",
             sha1_to_hex(delta_list[i]->idx.sha1),
             delta_list[i]->hash,
             delta_list[i]->size);

I assume that's correct?


> In fact, if your data is not *so* sensitive, and you're ok with making
> the one-line commit logs and the filenames public, you could make just
> those things available, and maybe I'll have time to look at it.

I've removed all commit messages, and "sanitized" some filepaths etc, so 
name hashes won't match what's reported, but that should be fine. (the 
object_entry->hash seems to be just a trivial uint32 hash for sorting 
anyways)

I really don't want the files on the mailinglist, so I'll send you a 
link directly. However, small snippets for public discussions about 
potential issues would be fine, obviously.

BUT, if I look at the last 3 entries of the sorted git verify-pack 
output, and look for them in the 'git log --oneline --raw -R 
--abbrev=40' output, I get:
  :100644 100644 991184c60e1fc6b2721bf40f181012b72b10d02d 
e86a0064d9652be9f5e3a877b11a665f64198ecd M 
extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib
  :100644 000000 bae8de0555be5b1ffa0988cbc6cba698f6745c26 
0000000000000000000000000000000000000000 D 
extern/win/gdal-2.0.0/lib/x64/Debug/libgdal.lib
  :000000 100644 0000000000000000000000000000000000000000 
94dc773600e03ac1e6f3ab077b70b8297325ad77 A 
extern/win/gdal-2.0.0/lib/x64/Debug/gdal.lib

while I cannot find ANY of them in the delta_list output?? Shouldn't 
delta_list contain all objects, sorted by some heuristics? Or is the 
delta_list already here limited by some other metric, before the QSORT?

Also note that the 'git log --oneline --raw -R --abbrev=40' only gave me 
the log for trunk, so for the second last object, must have been added 
in a branch, and deleted on trunk; so I could only see the deletion of 
that object in the output.


You might get an idea for how to easily create a repo which reproduces 
the issue, and which would highlight it more easily for the ML.

I was thinking of maybe scripting up
     make install prefix=extern
for each Git release, and rewrite trunk history with extern/ binary 
commits at the time of each tag; maybe that would show the same 
behavior? But then again, most of the binaries are just copies of each 
other, and only ~10M, so probably not a big win.


Thanks!

-- 
.marius

  reply	other threads:[~2017-03-04  8:55 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-01 13:51 Delta compression not so effective Marius Storm-Olsen
2017-03-01 16:06 ` Junio C Hamano
2017-03-01 16:17   ` Junio C Hamano
2017-03-01 17:36 ` Linus Torvalds
2017-03-01 17:57   ` Marius Storm-Olsen
2017-03-01 18:30     ` Linus Torvalds
2017-03-01 21:08       ` Martin Langhoff
2017-03-02  0:12       ` Marius Storm-Olsen
2017-03-02  0:43         ` Linus Torvalds
2017-03-04  8:27           ` Marius Storm-Olsen [this message]
2017-03-06  1:14             ` Linus Torvalds
2017-03-06 13:36               ` Marius Storm-Olsen
2017-03-07  9:07             ` Thomas Braun
2017-03-01 20:19 ` Martin Langhoff
2017-03-01 23:59   ` Marius Storm-Olsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9961a973-0d5d-5ff9-ab78-eea07bdb5dbf@gmail.com \
    --to=mstormo@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).