git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Marius Storm-Olsen <mstormo@gmail.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: Delta compression not so effective
Date: Wed, 1 Mar 2017 11:57:27 -0600	[thread overview]
Message-ID: <eba83461-34cf-6d64-4013-873b04af9b82@gmail.com> (raw)
In-Reply-To: <CA+55aFzQ0o2R2kShS=AuKu0TLnfPV-0JCkViqx5J_afCK0Yt5g@mail.gmail.com>

On 3/1/2017 11:36, Linus Torvalds wrote:
> On Wed, Mar 1, 2017 at 5:51 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote:
>>
>> When first importing, I disabled gc to avoid any repacking until completed.
>> When done importing, there was 209GB of all loose objects (~670k files).
>> With the hopes of quick consolidation, I did a
>>     git -c gc.autoDetach=0 -c gc.reflogExpire=0 \
>>           -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \
>>           -c gc.rerereunresolved=0 -c gc.pruneExpire=now \
>>           gc --prune
>> which brought it down to 206GB in a single pack. I then ran
>>     git repack -a -d -F --window=350 --depth=250
>> which took it down to 203GB, where I'm at right now.
>
> Considering that it was 209GB in loose objects, I don't think it
> delta-packed the big objects at all.
>
> I wonder if the big objects end up hitting some size limit that causes
> the delta creation to fail.

You're likely on to something here.
I just ran
     git verify-pack --verbose 
objects/pack/pack-9473815bc36d20fbcd38021d7454fbe09f791931.idx | sort 
-k3n | tail -n15
and got no blobs with deltas in them.
   feb35d6dc7af8463e038c71cc3893d163d47c31c blob   36841958 36461935 
3259424358
   007b65e603cdcec6644ddc25c2a729a394534927 blob   36845345 36462120 
3341677889
   0727a97f68197c99c63fcdf7254e5867f8512f14 blob   37368646 36983862 
3677338718
   576ce2e0e7045ee36d0370c2365dc730cb435f40 blob   37399203 37014740 
3639613780
   7f6e8b22eed5d8348467d9b0180fc4ae01129052 blob   125296632 83609223 
5045853543
   014b9318d2d969c56d46034a70223554589b3dc4 blob   170113524 6124878 
1118227958
   22d83cb5240872006c01651eb1166c8db62c62d8 blob   170113524 65941491 
1257435955
   292ac84f48a3d5c4de8d12bfb2905e055f9a33b1 blob   170113524 67770601 
1323377446
   2b9329277e379dfbdcd0b452b39c6b0bf3549005 blob   170113524 7656690 
1110571268
   37517efb4818a15ad7bba79b515170b3ee18063b blob   170113524 133083119 
1124352836
   55a4a70500eb3b99735677d0025f33b1bb78624a blob   170113524 6592386 
1398975989
   e669421ea5bf2e733d5bf10cf505904d168de749 blob   170113524 7827942 
1391148047
   e9916da851962265a9d5b099e72f60659a74c144 blob   170113524 73514361 
966299538
   f7bf1313752deb1bae592cc7fc54289aea87ff19 blob   170113524 70756581 
1039814687
   8afc6f2a51f0fa1cc4b03b8d10c70599866804ad blob   248959314 237612609 
606692699

In fact, I don't see a single "deltified" blob until 6355th last line!


> For example, we have that HASH_LIMIT  that limits how many hashes
> we'll create for the same hash bucket, because there's some quadratic
> behavior in the delta algorithm. It triggered with things like big
> files that have lots of repeated content.
>
> We also have various memory limits, in particular
> 'window_memory_limit'. That one should default to 0, but maybe you
> limited it at some point in a config file and forgot about it?

Indeed, I did do a
     -c pack.threads=20 --window-memory=6g
to 'git repack', since the machine is a 20-core (40 threads) machine 
with 126GB of RAM.

So I guess with these sized objects, even at 6GB per thread, it's not 
enough to get a big enough Window for proper delta-packing?

This repo took >14hr to repack on 20 threads though ("compression" step 
was very fast, but stuck 95% of the time in "writing objects"), so I can 
only imagine how long a pack.threads=1 will take :)

But arent't the blobs sorted by some metric for reasonable delta-pack 
locality, so even with a 6GB window it should have seen ~25 similar 
objects to deltify against?


-- 
.marius

  reply	other threads:[~2017-03-01 17:59 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-01 13:51 Delta compression not so effective Marius Storm-Olsen
2017-03-01 16:06 ` Junio C Hamano
2017-03-01 16:17   ` Junio C Hamano
2017-03-01 17:36 ` Linus Torvalds
2017-03-01 17:57   ` Marius Storm-Olsen [this message]
2017-03-01 18:30     ` Linus Torvalds
2017-03-01 21:08       ` Martin Langhoff
2017-03-02  0:12       ` Marius Storm-Olsen
2017-03-02  0:43         ` Linus Torvalds
2017-03-04  8:27           ` Marius Storm-Olsen
2017-03-06  1:14             ` Linus Torvalds
2017-03-06 13:36               ` Marius Storm-Olsen
2017-03-07  9:07             ` Thomas Braun
2017-03-01 20:19 ` Martin Langhoff
2017-03-01 23:59   ` Marius Storm-Olsen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eba83461-34cf-6d64-4013-873b04af9b82@gmail.com \
    --to=mstormo@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).