Delta compression not so effective

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Delta compression not so effective
@ 2017-03-01 13:51 Marius Storm-Olsen
  2017-03-01 16:06 ` Junio C Hamano
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Marius Storm-Olsen @ 2017-03-01 13:51 UTC (permalink / raw)
  To: git

I have just converted an SVN repo to Git (using SubGit), where I feel 
delta compression has let me down :)

Suffice it to say, this is a "traditional" SVN repo, with an extern/ 
blown out of proportion with many binary check-ins. BUT, even still, I 
would expect Git's delta compression to be quite effective, compared to 
the compression present in SVN. In this case however, the Git repo ends 
up being 46% larger than the SVN DB.

Details - SVN:
     Commits: 32988
     DB (server) size: 139GB
     Branches: 103
     Tags: 1088

Details - Git:
     $ git count-objects -v
       count: 0
       size: 0
       in-pack: 666515
       packs: 1
       size-pack: 211933109
       prune-packable: 0
       garbage: 0
       size-garbage: 0
     $ du -sh .
       203G    .

     $ java -jar ~/sources/bfg/bfg.jar --delete-folders extern 
--no-blob-protection && \
       git reflog expire --expire=now --all && \
       git gc --prune=now --aggressive
     $ git count-objects -v
       count: 0
       size: 0
       in-pack: 495070
       packs: 1
       size-pack: 5765365
       prune-packable: 0
       garbage: 0
       size-garbage: 0
     $ du -sh .
       5.6G    .

When first importing, I disabled gc to avoid any repacking until 
completed. When done importing, there was 209GB of all loose objects 
(~670k files). With the hopes of quick consolidation, I did a
     git -c gc.autoDetach=0 -c gc.reflogExpire=0 \
           -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \
           -c gc.rerereunresolved=0 -c gc.pruneExpire=now \
           gc --prune
which brought it down to 206GB in a single pack. I then ran
     git repack -a -d -F --window=350 --depth=250
which took it down to 203GB, where I'm at right now.

However, this is still miles away from the 139GB in SVN's DB.

Any ideas what's going on, and why my results are so terrible, compared 
to SVN?

Thanks!

-- 
.marius

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-01 13:51 Delta compression not so effective Marius Storm-Olsen
@ 2017-03-01 16:06 ` Junio C Hamano
  2017-03-01 16:17   ` Junio C Hamano
  2017-03-01 17:36 ` Linus Torvalds
  2017-03-01 20:19 ` Martin Langhoff
  2 siblings, 1 reply; 15+ messages in thread
From: Junio C Hamano @ 2017-03-01 16:06 UTC (permalink / raw)
  To: Marius Storm-Olsen; +Cc: Git Mailing List

On Wed, Mar 1, 2017 at 5:51 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote:
> ... which brought it down to 206GB in a single pack. I then ran
>     git repack -a -d -F --window=350 --depth=250
> which took it down to 203GB, where I'm at right now.

Just a hunch. s/F/f/ perhaps?  "-F" does not allow Git to recover from poor
delta-base choice the original importer may have made (and if the
original importer
used fast-import, it is known that its choice of the delta-base is suboptimal).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-01 16:06 ` Junio C Hamano
@ 2017-03-01 16:17   ` Junio C Hamano
  0 siblings, 0 replies; 15+ messages in thread
From: Junio C Hamano @ 2017-03-01 16:17 UTC (permalink / raw)
  To: Marius Storm-Olsen; +Cc: Git Mailing List

On Wed, Mar 1, 2017 at 8:06 AM, Junio C Hamano <gitster@pobox.com> wrote:

> Just a hunch. s/F/f/ perhaps?  "-F" does not allow Git to recover from poor

Nah, sorry for the noise. Between -F and -f there shouldn't be any difference.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-01 13:51 Delta compression not so effective Marius Storm-Olsen
  2017-03-01 16:06 ` Junio C Hamano
@ 2017-03-01 17:36 ` Linus Torvalds
  2017-03-01 17:57   ` Marius Storm-Olsen
  2017-03-01 20:19 ` Martin Langhoff
  2 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2017-03-01 17:36 UTC (permalink / raw)
  To: Marius Storm-Olsen; +Cc: Git Mailing List

On Wed, Mar 1, 2017 at 5:51 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote:
>
> When first importing, I disabled gc to avoid any repacking until completed.
> When done importing, there was 209GB of all loose objects (~670k files).
> With the hopes of quick consolidation, I did a
>     git -c gc.autoDetach=0 -c gc.reflogExpire=0 \
>           -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \
>           -c gc.rerereunresolved=0 -c gc.pruneExpire=now \
>           gc --prune
> which brought it down to 206GB in a single pack. I then ran
>     git repack -a -d -F --window=350 --depth=250
> which took it down to 203GB, where I'm at right now.

Considering that it was 209GB in loose objects, I don't think it
delta-packed the big objects at all.

I wonder if the big objects end up hitting some size limit that causes
the delta creation to fail.

For example, we have that HASH_LIMIT  that limits how many hashes
we'll create for the same hash bucket, because there's some quadratic
behavior in the delta algorithm. It triggered with things like big
files that have lots of repeated content.

We also have various memory limits, in particular
'window_memory_limit'. That one should default to 0, but maybe you
limited it at some point in a config file and forgot about it?

                                     Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-01 17:36 ` Linus Torvalds
@ 2017-03-01 17:57   ` Marius Storm-Olsen
  2017-03-01 18:30     ` Linus Torvalds
  0 siblings, 1 reply; 15+ messages in thread
From: Marius Storm-Olsen @ 2017-03-01 17:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On 3/1/2017 11:36, Linus Torvalds wrote:
> On Wed, Mar 1, 2017 at 5:51 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote:
>>
>> When first importing, I disabled gc to avoid any repacking until completed.
>> When done importing, there was 209GB of all loose objects (~670k files).
>> With the hopes of quick consolidation, I did a
>>     git -c gc.autoDetach=0 -c gc.reflogExpire=0 \
>>           -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \
>>           -c gc.rerereunresolved=0 -c gc.pruneExpire=now \
>>           gc --prune
>> which brought it down to 206GB in a single pack. I then ran
>>     git repack -a -d -F --window=350 --depth=250
>> which took it down to 203GB, where I'm at right now.
>
> Considering that it was 209GB in loose objects, I don't think it
> delta-packed the big objects at all.
>
> I wonder if the big objects end up hitting some size limit that causes
> the delta creation to fail.

You're likely on to something here.
I just ran
     git verify-pack --verbose 
objects/pack/pack-9473815bc36d20fbcd38021d7454fbe09f791931.idx | sort 
-k3n | tail -n15
and got no blobs with deltas in them.
   feb35d6dc7af8463e038c71cc3893d163d47c31c blob   36841958 36461935 
3259424358
   007b65e603cdcec6644ddc25c2a729a394534927 blob   36845345 36462120 
3341677889
   0727a97f68197c99c63fcdf7254e5867f8512f14 blob   37368646 36983862 
3677338718
   576ce2e0e7045ee36d0370c2365dc730cb435f40 blob   37399203 37014740 
3639613780
   7f6e8b22eed5d8348467d9b0180fc4ae01129052 blob   125296632 83609223 
5045853543
   014b9318d2d969c56d46034a70223554589b3dc4 blob   170113524 6124878 
1118227958
   22d83cb5240872006c01651eb1166c8db62c62d8 blob   170113524 65941491 
1257435955
   292ac84f48a3d5c4de8d12bfb2905e055f9a33b1 blob   170113524 67770601 
1323377446
   2b9329277e379dfbdcd0b452b39c6b0bf3549005 blob   170113524 7656690 
1110571268
   37517efb4818a15ad7bba79b515170b3ee18063b blob   170113524 133083119 
1124352836
   55a4a70500eb3b99735677d0025f33b1bb78624a blob   170113524 6592386 
1398975989
   e669421ea5bf2e733d5bf10cf505904d168de749 blob   170113524 7827942 
1391148047
   e9916da851962265a9d5b099e72f60659a74c144 blob   170113524 73514361 
966299538
   f7bf1313752deb1bae592cc7fc54289aea87ff19 blob   170113524 70756581 
1039814687
   8afc6f2a51f0fa1cc4b03b8d10c70599866804ad blob   248959314 237612609 
606692699

In fact, I don't see a single "deltified" blob until 6355th last line!


> For example, we have that HASH_LIMIT  that limits how many hashes
> we'll create for the same hash bucket, because there's some quadratic
> behavior in the delta algorithm. It triggered with things like big
> files that have lots of repeated content.
>
> We also have various memory limits, in particular
> 'window_memory_limit'. That one should default to 0, but maybe you
> limited it at some point in a config file and forgot about it?

Indeed, I did do a
     -c pack.threads=20 --window-memory=6g
to 'git repack', since the machine is a 20-core (40 threads) machine 
with 126GB of RAM.

So I guess with these sized objects, even at 6GB per thread, it's not 
enough to get a big enough Window for proper delta-packing?

This repo took >14hr to repack on 20 threads though ("compression" step 
was very fast, but stuck 95% of the time in "writing objects"), so I can 
only imagine how long a pack.threads=1 will take :)

But arent't the blobs sorted by some metric for reasonable delta-pack 
locality, so even with a 6GB window it should have seen ~25 similar 
objects to deltify against?


-- 
.marius

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-01 17:57   ` Marius Storm-Olsen
@ 2017-03-01 18:30     ` Linus Torvalds
  2017-03-01 21:08       ` Martin Langhoff
  2017-03-02  0:12       ` Marius Storm-Olsen
  0 siblings, 2 replies; 15+ messages in thread
From: Linus Torvalds @ 2017-03-01 18:30 UTC (permalink / raw)
  To: Marius Storm-Olsen; +Cc: Git Mailing List

On Wed, Mar 1, 2017 at 9:57 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote:
>
> Indeed, I did do a
>     -c pack.threads=20 --window-memory=6g
> to 'git repack', since the machine is a 20-core (40 threads) machine with
> 126GB of RAM.
>
> So I guess with these sized objects, even at 6GB per thread, it's not enough
> to get a big enough Window for proper delta-packing?

Hmm. The 6GB window should be plenty good enough, unless your blobs
are in the gigabyte range too.

> This repo took >14hr to repack on 20 threads though ("compression" step was
> very fast, but stuck 95% of the time in "writing objects"), so I can only
> imagine how long a pack.threads=1 will take :)

Actually, it's usually the compression phase that should be slow - but
if something is limiting finding deltas (so that we abort early), then
that would certainly tend to speed up compression.

The "writing objects" phase should be mainly about the actual IO.
Which should be much faster *if* you actually find deltas.

> But arent't the blobs sorted by some metric for reasonable delta-pack
> locality, so even with a 6GB window it should have seen ~25 similar objects
> to deltify against?

Yes they are. The sorting for delta packing tries to make sure that
the window is effective. However, the sorting is also just a
heuristic, and it may well be that your repository layout ends up
screwing up the sorting, so that the windows just work very badly.

For example, the sorting code thinks that objects with the same name
across the history are good sources of deltas. But it may be that for
your case, the binary blobs that you have don't tend to actually
change in the history, so that heuristic doesn't end up doing
anything.

The sorting does use the size and the type too, but the "filename
hash" (which isn't really a hash, it's something nasty to give
reasonable results for the case where files get renamed) is the main
sort key.

So you might well want to look at the sorting code too. If filenames
(particularly the end of filenames) for the blobs aren't good hints
for the sorting code, that sort might end up spreading all the blobs
out rather than sort them by size.

And again, if that happens, the "can I delta these two objects" code
will notice that the size of the objects are wildly different and
won't even bother trying. Which speeds up the "compressing" phase, of
course, but then because you don't get any good deltas, the "writing
out" phase sucks donkey balls because it does zlib compression on big
objects and writes them out to disk.

So there are certainly multiple possible reasons for the deltification
to not work well for you.

Hos sensitive is your material? Could you make a smaller repo with
some of the blobs that still show the symptoms? I don't think I want
to download 206GB of data even if my internet access is good.

                    Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-01 13:51 Delta compression not so effective Marius Storm-Olsen
  2017-03-01 16:06 ` Junio C Hamano
  2017-03-01 17:36 ` Linus Torvalds
@ 2017-03-01 20:19 ` Martin Langhoff
  2017-03-01 23:59   ` Marius Storm-Olsen
  2 siblings, 1 reply; 15+ messages in thread
From: Martin Langhoff @ 2017-03-01 20:19 UTC (permalink / raw)
  To: Marius Storm-Olsen; +Cc: Git Mailing List

On Wed, Mar 1, 2017 at 8:51 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote:
> BUT, even still, I would expect Git's delta compression to be quite effective, compared to the compression present in SVN.

jar files are zipfiles. They don't delta in any useful form, and in
fact they differ even if they contain identical binary files inside.

>     Commits: 32988
>     DB (server) size: 139GB

Are you certain of the on-disk storage at the SVN server? Ideally,
you've taken the size with a low-level tool like `du -sh
/path/to/SVNRoot`.

Even with no delta compression (as per Junio and Linus' discussion),
based on past experience importing jar/wars/binaries from SVN into
git... I'd expect git's worst case to be on-par with SVN, perhaps ~5%
larger due to compression headers on uncompressible data.

cheers,

m
-- 
 martin.langhoff@gmail.com
 - ask interesting questions  ~  http://linkedin.com/in/martinlanghoff
 - don't be distracted        ~  http://github.com/martin-langhoff
   by shiny stuff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-01 18:30     ` Linus Torvalds
@ 2017-03-01 21:08       ` Martin Langhoff
  2017-03-02  0:12       ` Marius Storm-Olsen
  1 sibling, 0 replies; 15+ messages in thread
From: Martin Langhoff @ 2017-03-01 21:08 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Marius Storm-Olsen, Git Mailing List

On Wed, Mar 1, 2017 at 1:30 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> For example, the sorting code thinks that objects with the same name
> across the history are good sources of deltas.

Marius has indicated he is working with jar files. IME jar and war
files, which are zipfiles containing Java bytecode, range from not
delta-ing in a useful fashion, to pretty good deltas.

Depending on the build process (hi Maven!) there can be enough
variance in the build metadata to throw all the compression machinery
off.

On a simple Maven-driven project I have at hand, two .war files
compiled from the same codebase compressed really well in git. I've
also seen projects where storage space is ~101% of the "uncompressed"
size.

my 2c,

m
-- 
 martin.langhoff@gmail.com
 - ask interesting questions  ~  http://linkedin.com/in/martinlanghoff
 - don't be distracted        ~  http://github.com/martin-langhoff
   by shiny stuff

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-01 20:19 ` Martin Langhoff
@ 2017-03-01 23:59   ` Marius Storm-Olsen
  0 siblings, 0 replies; 15+ messages in thread
From: Marius Storm-Olsen @ 2017-03-01 23:59 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

On 3/1/2017 14:19, Martin Langhoff wrote:
> On Wed, Mar 1, 2017 at 8:51 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote:
>> BUT, even still, I would expect Git's delta compression to be quite effective, compared to the compression present in SVN.
>
> jar files are zipfiles. They don't delta in any useful form, and in
> fact they differ even if they contain identical binary files inside.

If you look through the initial post, you'll see that the jar in 
question is in fact a tool (BFG) by Roberto Tyley, which is basically 
git filter-branch on steroids. I used it to quickly filter out the 
extern/ folder, just to prove most of the original size stems from that 
particular folder. That's all.

The repo does not contain zip or jar files. A few images and other 
compressed formats (except a few 100MBs of proprietary files, which 
never change), but nothing unusual.

>>     Commits: 32988
>>     DB (server) size: 139GB
>
> Are you certain of the on-disk storage at the SVN server? Ideally,
> you've taken the size with a low-level tool like `du -sh
> /path/to/SVNRoot`.

139GB is from 'du -sh' on the SVN server. I imported (via SubGit) 
directly from the (hotcopied) SVN folder on the server. So true SVN size.

> Even with no delta compression (as per Junio and Linus' discussion),
> based on past experience importing jar/wars/binaries from SVN into
> git... I'd expect git's worst case to be on-par with SVN, perhaps ~5%
> larger due to compression headers on uncompressible data.

Yes, I was expecting a Git repo <139GB, but like Linus mentioned, 
something must be knocking the delta search off its feet, so it bails 
out. Loose object -> 'hard' repack didn't show that much difference.

Thanks!

-- 
.marius

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-01 18:30     ` Linus Torvalds
  2017-03-01 21:08       ` Martin Langhoff
@ 2017-03-02  0:12       ` Marius Storm-Olsen
  2017-03-02  0:43         ` Linus Torvalds
  1 sibling, 1 reply; 15+ messages in thread
From: Marius Storm-Olsen @ 2017-03-02  0:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On 3/1/2017 12:30, Linus Torvalds wrote:
> On Wed, Mar 1, 2017 at 9:57 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote:
>>
>> Indeed, I did do a
>>     -c pack.threads=20 --window-memory=6g
>> to 'git repack', since the machine is a 20-core (40 threads) machine with
>> 126GB of RAM.
>>
>> So I guess with these sized objects, even at 6GB per thread, it's not enough
>> to get a big enough Window for proper delta-packing?
>
> Hmm. The 6GB window should be plenty good enough, unless your blobs
> are in the gigabyte range too.

No, the list of git verify-objects in the previous post was from the 
bottom of the sorted list, so those are the largest blobs, ~249MB..


>> This repo took >14hr to repack on 20 threads though ("compression" step was
>> very fast, but stuck 95% of the time in "writing objects"), so I can only
>> imagine how long a pack.threads=1 will take :)
>
> Actually, it's usually the compression phase that should be slow - but
> if something is limiting finding deltas (so that we abort early), then
> that would certainly tend to speed up compression.
>
> The "writing objects" phase should be mainly about the actual IO.
> Which should be much faster *if* you actually find deltas.

So, this repo must be knocking several parts of Git's insides. I was 
curious about why it was so slow on the writing objects part, since the 
whole repo is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, 
but the thing has ~400MB/s continuous throughput available.

iostat -m 5 showed trickle read/write to the process, and 80-100% CPU 
single thread (since the "write objects" stage is single threaded, 
obviously).

The failing delta must be triggering other negative behavior.


> For example, the sorting code thinks that objects with the same name
> across the history are good sources of deltas. But it may be that for
> your case, the binary blobs that you have don't tend to actually
> change in the history, so that heuristic doesn't end up doing
> anything.

These are generally just DLLs (debug & release), which content is 
updated due to upstream project updates. So, filenames/paths tend to 
stay identical, while content changes throughout history.


> The sorting does use the size and the type too, but the "filename
> hash" (which isn't really a hash, it's something nasty to give
> reasonable results for the case where files get renamed) is the main
> sort key.
>
> So you might well want to look at the sorting code too. If filenames
> (particularly the end of filenames) for the blobs aren't good hints
> for the sorting code, that sort might end up spreading all the blobs
> out rather than sort them by size.

Filenames are fairly static, and the bulk of the 6000 biggest 
non-delta'ed blobs are the same DLLs (multiple of them)


> And again, if that happens, the "can I delta these two objects" code
> will notice that the size of the objects are wildly different and
> won't even bother trying. Which speeds up the "compressing" phase, of
> course, but then because you don't get any good deltas, the "writing
> out" phase sucks donkey balls because it does zlib compression on big
> objects and writes them out to disk.

Right, now on this machine, I really didn't notice much difference 
between standard zlib level and doing -9. The 203GB version was actually 
with zlib=9.


> So there are certainly multiple possible reasons for the deltification
> to not work well for you.
>
> Hos sensitive is your material? Could you make a smaller repo with
> some of the blobs that still show the symptoms? I don't think I want
> to download 206GB of data even if my internet access is good.

Pretty sensitive, and not sure how I can reproduce this reasonable well. 
However, I can easily recompile git with any recommended 
instrumentation/printfs, if you have any suggestions of good places to 
start? If anyone have good file/line numbers, I'll give that a go, and 
report back?

Thanks!

-- 
.marius

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-02  0:12       ` Marius Storm-Olsen
@ 2017-03-02  0:43         ` Linus Torvalds
  2017-03-04  8:27           ` Marius Storm-Olsen
  0 siblings, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2017-03-02  0:43 UTC (permalink / raw)
  To: Marius Storm-Olsen; +Cc: Git Mailing List

On Wed, Mar 1, 2017 at 4:12 PM, Marius Storm-Olsen <mstormo@gmail.com> wrote:
>
> No, the list of git verify-objects in the previous post was from the bottom
> of the sorted list, so those are the largest blobs, ~249MB..

.. so with a 6GB window, you should easily sill have 20+ objects. Not
a huge window, but it should find some deltas.

But a smaller window - _together_ with a suboptimal sorting choice -
could then result in lack of successful delta matches.

> So, this repo must be knocking several parts of Git's insides. I was curious
> about why it was so slow on the writing objects part, since the whole repo
> is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing
> has ~400MB/s continuous throughput available.
>
> iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single
> thread (since the "write objects" stage is single threaded, obviously).

So the writing phase isn't multi-threaded because it's not expected to
matter. But if you can't even generate deltas, you aren't just
*writing* much more data, you're compressing all that data with zlib
too.

So even with a fast disk subsystem, you won't even be able to saturate
the disk, simply because the compression will be slower (and
single-threaded).

> Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed
> blobs are the same DLLs (multiple of them)

I think the first thing you should test is to repack with fewer
threads, and a bigger pack window. Do somethinig like

  -c pack.threads=4 --window-memory=30g

instead. Just to see if that starts finding deltas.

> Right, now on this machine, I really didn't notice much difference between
> standard zlib level and doing -9. The 203GB version was actually with
> zlib=9.

Don't. zlib has *horrible* scaling with higher compressions. It
doesn't actually improve the end result very much, and it makes things
*much* slower.

zlib was a reasonable choice when git started - well-known, stable, easy to use.

But realistically it's a relatively horrible choice today, just
because there are better alternatives now.

>> Hos sensitive is your material? Could you make a smaller repo with
>> some of the blobs that still show the symptoms? I don't think I want
>> to download 206GB of data even if my internet access is good.
>
> Pretty sensitive, and not sure how I can reproduce this reasonable well.
> However, I can easily recompile git with any recommended
> instrumentation/printfs, if you have any suggestions of good places to
> start? If anyone have good file/line numbers, I'll give that a go, and
> report back?

So the first thing you might want to do is to just print out the
objects after sorting them, and before it starts trying to finsd
deltas.

See prepare_pack() in builtin/pack-objects.c, where it does something like this:

        if (nr_deltas && n > 1) {
                unsigned nr_done = 0;
                if (progress)
                        progress_state = start_progress(_("Compressing
objects"),
                                                        nr_deltas);
                QSORT(delta_list, n, type_size_sort);
                ll_find_deltas(delta_list, n, window+1, depth, &nr_done);
                stop_progress(&progress_state);

and notice that QSORT() line: that's what sorts the objects. You can
do something like

                for (i = 0; i < n; i++)
                        show_object_entry_details(delta_list[i]);

right after that QSORT(), and make that print out the object hash,
filename hash, and size (we don't have the filename that the object
was associated with any more at that stage - they take too much
space).

Save off that array for off-line processing: when you have the object
hash, you can see what the contents are, and match it up wuith the
file in the git history using something like

   git log --oneline --raw -R --abbrev=40

which shows you the log, but also the "diff" in the form of "this
filename changed from SHA1 to SHA1", so you can match up the object
hashes with where they are in the tree (and where they are in
history).

So then you could try to figure out if that type_size_sort() heuristic
is just particularly horrible for you.

In fact, if your data is not *so* sensitive, and you're ok with making
the one-line commit logs and the filenames public, you could make just
those things available, and maybe I'll have time to look at it.

I'm in the middle of the kernel merge window, but I'm in the last
stretch, and because of the SHA1 thing I've been looking at git
lately. No promises, though.

                   Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-02  0:43         ` Linus Torvalds
@ 2017-03-04  8:27           ` Marius Storm-Olsen
  2017-03-06  1:14             ` Linus Torvalds
  2017-03-07  9:07             ` Thomas Braun
  0 siblings, 2 replies; 15+ messages in thread
From: Marius Storm-Olsen @ 2017-03-04  8:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On 3/1/2017 18:43, Linus Torvalds wrote:
>> So, this repo must be knocking several parts of Git's insides. I was curious
>> about why it was so slow on the writing objects part, since the whole repo
>> is on a 4x RAID 5, 7k spindels. Now, they are not SSDs sure, but the thing
>> has ~400MB/s continuous throughput available.
>>
>> iostat -m 5 showed trickle read/write to the process, and 80-100% CPU single
>> thread (since the "write objects" stage is single threaded, obviously).
>
> So the writing phase isn't multi-threaded because it's not expected to
> matter. But if you can't even generate deltas, you aren't just
> *writing* much more data, you're compressing all that data with zlib
> too.
>
> So even with a fast disk subsystem, you won't even be able to saturate
> the disk, simply because the compression will be slower (and
> single-threaded).

I did a simple
     $ time zip -r repo.zip repo/
...
     total bytes=219353596620, compressed=214310715074 -> 2% savings

     real    154m6.323s
     user    133m5.209s
     sys     5m5.338s

also using a single thread + same disk, as git repack. But if you 
compare it to the numbers below, it's 2.6hrs with zip vs 14.2hrs 
(1:5.5). So it can't just be the overhead of having to compress the full 
blobs, due to lacking delta..


>> Filenames are fairly static, and the bulk of the 6000 biggest non-delta'ed
>> blobs are the same DLLs (multiple of them)
>
> I think the first thing you should test is to repack with fewer
> threads, and a bigger pack window. Do somethinig like
>
>   -c pack.threads=4 --window-memory=30g
>
> instead. Just to see if that starts finding deltas.

I reran the repack with the options above (dropping the zlib=9, as you 
suggested)

     $ time git -c pack.threads=4 repack -a -d -F \
                --window=350 --depth=250 --window-memory=30g

     Delta compression using up to 4 threads.
     Compressing objects:   100% (609413/609413)
     Writing objects: 100% (666515/666515), done.
     Total 666515 (delta 499585), reused 0 (delta 0)

     real	850m3.473s
     user	897m36.280s
     sys 	10m8.824s

and ended up with
     $ du -sh .
     205G	.

In other words, going from 6G to 30G window didn't help a lick on 
finding deltas for those binaries. (205G was what I had with the 
non-aggressive 'git gc', before zlib=9 repack.)

BUT, oddly enough, even if the new size if almost identical to the 
previous version without zlib=9,
     git verify-pack --verbose 
objects/pack/pack-29b06ae4d458ac03efd98b330702d30e851b2933.idx | sort 
-k3n | tail -n15
gives me a VERY different list than before

   17e5b2146311256dc8317d6e0ed1291363c31a76 blob   673399562 110248747 
190398904084
   04c881d9069eab3bd0d50dd48a047a60f79cc415 blob   673863358 111710559 
188818868865
   fdcabd75aeda86ce234d6e43b54d27d993acddcd blob   674523614 111956017 
185706433825
   d8815033d1b00b151ae762be8a69ffa35f55c4b4 blob   675286758 112099638 
185153570292
   997e0b9d3bcf440af10c7bbe535a597ca46c492c blob   678274978 112654668 
184041692883
   dfed141679e5c33caaa921cbe1595a24967a3c2c blob   681692132 113121410 
186753502634
   76a4000e71cd5b85f2265e02eb876acf1f33cc55 blob   682673430 112743915 
184563542298
   81e7292c4d2da2d2d236fbfaa572b6c4e8d787f4 blob   684543130 112797325 
181805773038
   991184c60e1fc6b2721bf40f181012b72b10d02d blob   684543130 112796892 
182344388066
   0e9269f4abd1440addd05d4f964c96d74d11cd89 blob   684547270 112809074 
181070719237
   6019b6d09759cf5adeac678c8b56d177803a0486 blob   684547270 112809336 
180517242193
   70a5f70bd205329472d6f9c660eb3f7d207a596e blob   686852038 112873611 
183520467528
   e86a0064d9652be9f5e3a877b11a665f64198ecd blob   686852038 112874133 
182893219377
   bae8de0555be5b1ffa0988cbc6cba698f6745c26 blob   894041802 137223252 
2355250324
   94dc773600e03ac1e6f3ab077b70b8297325ad77 blob   945197364 145219485 
16560137220

compared to the last 3 entries of the previous pack
   e9916da851962265a9d5b099e72f60659a74c144 blob   170113524 73514361 
966299538
   f7bf1313752deb1bae592cc7fc54289aea87ff19 blob   170113524 70756581 
1039814687
   8afc6f2a51f0fa1cc4b03b8d10c70599866804ad blob   248959314 237612609 
606692699


> So the first thing you might want to do is to just print out the
> objects after sorting them, and before it starts trying to finsd
> deltas.
...
> and notice that QSORT() line: that's what sorts the objects. You can
> do something like
>
>                 for (i = 0; i < n; i++)
>                         show_object_entry_details(delta_list[i]);

I did
     fprintf(stderr, "%s %u %lu\n",
             sha1_to_hex(delta_list[i]->idx.sha1),
             delta_list[i]->hash,
             delta_list[i]->size);

I assume that's correct?


> In fact, if your data is not *so* sensitive, and you're ok with making
> the one-line commit logs and the filenames public, you could make just
> those things available, and maybe I'll have time to look at it.

I've removed all commit messages, and "sanitized" some filepaths etc, so 
name hashes won't match what's reported, but that should be fine. (the 
object_entry->hash seems to be just a trivial uint32 hash for sorting 
anyways)

I really don't want the files on the mailinglist, so I'll send you a 
link directly. However, small snippets for public discussions about 
potential issues would be fine, obviously.

BUT, if I look at the last 3 entries of the sorted git verify-pack 
output, and look for them in the 'git log --oneline --raw -R 
--abbrev=40' output, I get:
  :100644 100644 991184c60e1fc6b2721bf40f181012b72b10d02d 
e86a0064d9652be9f5e3a877b11a665f64198ecd M 
extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib
  :100644 000000 bae8de0555be5b1ffa0988cbc6cba698f6745c26 
0000000000000000000000000000000000000000 D 
extern/win/gdal-2.0.0/lib/x64/Debug/libgdal.lib
  :000000 100644 0000000000000000000000000000000000000000 
94dc773600e03ac1e6f3ab077b70b8297325ad77 A 
extern/win/gdal-2.0.0/lib/x64/Debug/gdal.lib

while I cannot find ANY of them in the delta_list output?? Shouldn't 
delta_list contain all objects, sorted by some heuristics? Or is the 
delta_list already here limited by some other metric, before the QSORT?

Also note that the 'git log --oneline --raw -R --abbrev=40' only gave me 
the log for trunk, so for the second last object, must have been added 
in a branch, and deleted on trunk; so I could only see the deletion of 
that object in the output.


You might get an idea for how to easily create a repo which reproduces 
the issue, and which would highlight it more easily for the ML.

I was thinking of maybe scripting up
     make install prefix=extern
for each Git release, and rewrite trunk history with extern/ binary 
commits at the time of each tag; maybe that would show the same 
behavior? But then again, most of the binaries are just copies of each 
other, and only ~10M, so probably not a big win.


Thanks!

-- 
.marius

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-04  8:27           ` Marius Storm-Olsen
@ 2017-03-06  1:14             ` Linus Torvalds
  2017-03-06 13:36               ` Marius Storm-Olsen
  2017-03-07  9:07             ` Thomas Braun
  1 sibling, 1 reply; 15+ messages in thread
From: Linus Torvalds @ 2017-03-06  1:14 UTC (permalink / raw)
  To: Marius Storm-Olsen; +Cc: Git Mailing List

On Sat, Mar 4, 2017 at 12:27 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote:
>
> I reran the repack with the options above (dropping the zlib=9, as you
> suggested)
>
>     $ time git -c pack.threads=4 repack -a -d -F \
>                --window=350 --depth=250 --window-memory=30g
>
> and ended up with
>     $ du -sh .
>     205G        .
>
> In other words, going from 6G to 30G window didn't help a lick on finding
> deltas for those binaries.

Ok.

> I did
>     fprintf(stderr, "%s %u %lu\n",
>             sha1_to_hex(delta_list[i]->idx.sha1),
>             delta_list[i]->hash,
>             delta_list[i]->size);
>
> I assume that's correct?

Looks good.

> I've removed all commit messages, and "sanitized" some filepaths etc, so
> name hashes won't match what's reported, but that should be fine. (the
> object_entry->hash seems to be just a trivial uint32 hash for sorting
> anyways)

Yes. I see your name list and your pack-file index.

> BUT, if I look at the last 3 entries of the sorted git verify-pack output,
> and look for them in the 'git log --oneline --raw -R --abbrev=40' output, I
> get:
...
> while I cannot find ANY of them in the delta_list output?? \

Yes. You have a lot of of object names in that log file you sent in
private that aren't in the delta list.

Now, objects smaller than 50 bytes we don't ever try to even delta. I
can't see the object sizes when they don't show up in the delta list,
but looking at some of those filenames I'd expect them to not fall in
that category.

I guess you could do the printout a bit earlier (on the
"to_pack.objects[]" array - to_pack.nr_objects is the count there).
That should show all of them. But the small objects shouldn't matter.

But if you have a file like

   extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib

I would have assumed that it has a size that is > 50. Unless those
"extern" things are placeholders?

> You might get an idea for how to easily create a repo which reproduces the
> issue, and which would highlight it more easily for the ML.

Looking at your sorted object list ready for packing, it doesn't look
horrible. When sorting for size, it still shows a lot of those large
files with the same name hash, so they sorted together in that form
too.

I do wonder if your dll data just simply is absolutely horrible for
xdelta. We've also limited the delta finding a bit, simply because it
had some O(m*n) behavior that gets very expensive on some patterns.
Maybe your blobs trigger some of those case.

The diff-delta work all goes back to 2005 and 2006, so it's a long time ago.

What I'd ask you to do is try to find if you could make a reposity of
just one of the bigger DLL's with its history, particularly if you can
find some that you don't think is _that_ sensitive.

Looking at it, for example, I see that you have that file

   extern/redhat-5/FlammableV3/x64/plugins/libFlameCUDA-3.0.703.so

that seems to have changed several times, and is a largish blob. Could
you try creating a repository with git fast-import that *only*
contains that file (or pick another one), and see if that delta's
well?

And if you find some case that doesn't xdelta well, and that you feel
you could make available outside, we could have a test-case...

                 Linus

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-06  1:14             ` Linus Torvalds
@ 2017-03-06 13:36               ` Marius Storm-Olsen
  0 siblings, 0 replies; 15+ messages in thread
From: Marius Storm-Olsen @ 2017-03-06 13:36 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On 3/5/2017 19:14, Linus Torvalds wrote:
> On Sat, Mar 4, 2017 at 12:27 AM, Marius Storm-Olsen <mstormo@gmail.com> wrote:
> I guess you could do the printout a bit earlier (on the
> "to_pack.objects[]" array - to_pack.nr_objects is the count there).
> That should show all of them. But the small objects shouldn't matter.
>
> But if you have a file like
>
>    extern/win/FlammableV3/x64/lib/FlameProxyLibD.lib
>
> I would have assumed that it has a size that is > 50. Unless those
> "extern" things are placeholders?

No placeholders, the FlameProxyLibD.lib is a debug lib, and probably the 
largest in the whole repo (with a replace count > 5).


> I do wonder if your dll data just simply is absolutely horrible for
> xdelta. We've also limited the delta finding a bit, simply because it
> had some O(m*n) behavior that gets very expensive on some patterns.
> Maybe your blobs trigger some of those case.

Ok, but given that the SVN delta compression, which forward-linear only, 
is ~45% better, perhaps that particular search could be done fairly 
cheap? Although, I bet time(stamps) are out of the loop at that point, 
so it's not a factor anymore. Even if it where, I'm not sure it would 
solve anything, if there's other factors also limiting deltafication.


> The diff-delta work all goes back to 2005 and 2006, so it's a long time ago.
>
> What I'd ask you to do is try to find if you could make a reposity of
> just one of the bigger DLL's with its history, particularly if you can
> find some that you don't think is _that_ sensitive.
>
> Looking at it, for example, I see that you have that file
>
>    extern/redhat-5/FlammableV3/x64/plugins/libFlameCUDA-3.0.703.so
>
> that seems to have changed several times, and is a largish blob. Could
> you try creating a repository with git fast-import that *only*
> contains that file (or pick another one), and see if that delta's
> well?

I'll filter-branch to extern/ only, however the whole FlammableV3 needs 
to go too, I'm afaid (extern for that project, but internal to $WORK).
I'll do some rewrites and see what comes up.

> And if you find some case that doesn't xdelta well, and that you feel
> you could make available outside, we could have a test-case...

I'll try with this repo first, if not, I'll see if I can construct one.

Thanks!


-- 
.marius

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Delta compression not so effective
  2017-03-04  8:27           ` Marius Storm-Olsen
  2017-03-06  1:14             ` Linus Torvalds
@ 2017-03-07  9:07             ` Thomas Braun
  1 sibling, 0 replies; 15+ messages in thread
From: Thomas Braun @ 2017-03-07  9:07 UTC (permalink / raw)
  To: Marius Storm-Olsen, Linus Torvalds; +Cc: Git Mailing List



> Marius Storm-Olsen <mstormo@gmail.com> hat am 4. März 2017 um 09:27
> geschrieben:

[...]

> I really don't want the files on the mailinglist, so I'll send you a 
> link directly. However, small snippets for public discussions about 
> potential issues would be fine, obviously.

git fast-export can anonymize a repository [1]. Maybe an anonymized repository
still shows the issue you are seeing.

[1]: https://www.git-scm.com/docs/git-fast-export#_anonymizing

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-03-07  9:07 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-01 13:51 Delta compression not so effective Marius Storm-Olsen
2017-03-01 16:06 ` Junio C Hamano
2017-03-01 16:17   ` Junio C Hamano
2017-03-01 17:36 ` Linus Torvalds
2017-03-01 17:57   ` Marius Storm-Olsen
2017-03-01 18:30     ` Linus Torvalds
2017-03-01 21:08       ` Martin Langhoff
2017-03-02  0:12       ` Marius Storm-Olsen
2017-03-02  0:43         ` Linus Torvalds
2017-03-04  8:27           ` Marius Storm-Olsen
2017-03-06  1:14             ` Linus Torvalds
2017-03-06 13:36               ` Marius Storm-Olsen
2017-03-07  9:07             ` Thomas Braun
2017-03-01 20:19 ` Martin Langhoff
2017-03-01 23:59   ` Marius Storm-Olsen

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).