* git repack: --depth=100000 causing larger not smaler pack file?
@ 2009-03-17 19:05 Kjetil Barvik
2009-03-17 20:38 ` Nicolas Pitre
0 siblings, 1 reply; 6+ messages in thread
From: Kjetil Barvik @ 2009-03-17 19:05 UTC (permalink / raw
To: git
aloha!
Yesterday I run the following command on the updated GIT respository:
git repack -adf --window=250000 --depth=100000
After 280 minutes or so it finished, but the strange thing was that
the resulting pack-file was larger than before. I had expected that
it should be smaler, or at least the same size as before.
kjetil git (my_next)$ ls -l .git/objects/pack/*
-r-------- 1 kjetil kjetil 2757280 2009-03-16 15:18 .git/objects/pack/pack-c5f15d5c48d6b3902a49046d7e8a8d717e167051.idx
-r-------- 1 kjetil kjetil 19961120 2009-03-16 15:18 .git/objects/pack/pack-c5f15d5c48d6b3902a49046d7e8a8d717e167051.pack
Before I started the pack file was around 19 250 000 bytes, and was
the result of the following commands:
1) git repack -adf --window=250000 --depth=20000
- not completly sure about the --window number here
- the resulting pack file was a litle less than 19 100 000
2) 'git fetch' to get the latest GIT patches
3) since 'git fetch' always make an extra new "smal" pack file, I run
the command 'git repack -ad --window=40000 --depth=10000' to be
able to get one singel pack file of 19 250 000 bytes or so.
I can think of one thing which is spesial with the "--depth=100000"
number, and that is that it is now larger than the total number of
objects in the pack, which is around 96000 to 97000, or so.
I have run 'git fsck --strict --full' on the pack with no resulting
error/debug output or change in the file size.
Any help on how to debug this?
-- kjetil
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git repack: --depth=100000 causing larger not smaler pack file?
2009-03-17 19:05 git repack: --depth=100000 causing larger not smaler pack file? Kjetil Barvik
@ 2009-03-17 20:38 ` Nicolas Pitre
2009-03-23 10:11 ` Kjetil Barvik
0 siblings, 1 reply; 6+ messages in thread
From: Nicolas Pitre @ 2009-03-17 20:38 UTC (permalink / raw
To: Kjetil Barvik; +Cc: git
On Tue, 17 Mar 2009, Kjetil Barvik wrote:
> aloha!
>
> Yesterday I run the following command on the updated GIT respository:
>
> git repack -adf --window=250000 --depth=100000
>
> After 280 minutes or so it finished, but the strange thing was that
> the resulting pack-file was larger than before. I had expected that
> it should be smaler, or at least the same size as before.
>
> kjetil git (my_next)$ ls -l .git/objects/pack/*
> -r-------- 1 kjetil kjetil 2757280 2009-03-16 15:18 .git/objects/pack/pack-c5f15d5c48d6b3902a49046d7e8a8d717e167051.idx
> -r-------- 1 kjetil kjetil 19961120 2009-03-16 15:18 .git/objects/pack/pack-c5f15d5c48d6b3902a49046d7e8a8d717e167051.pack
>
> Before I started the pack file was around 19 250 000 bytes, and was
> the result of the following commands:
>
> 1) git repack -adf --window=250000 --depth=20000
> - not completly sure about the --window number here
> - the resulting pack file was a litle less than 19 100 000
>
> 2) 'git fetch' to get the latest GIT patches
>
> 3) since 'git fetch' always make an extra new "smal" pack file, I run
> the command 'git repack -ad --window=40000 --depth=10000' to be
> able to get one singel pack file of 19 250 000 bytes or so.
>
> I can think of one thing which is spesial with the "--depth=100000"
> number, and that is that it is now larger than the total number of
> objects in the pack, which is around 96000 to 97000, or so.
No, the depth should have zero negative influence on the pack size.
For tight compression, the larger the better. What this will impact
though is runtime access to the pack data afterward. The deeper a
given object is, the slower its access will be. But since the object
recency order tend to put newer objects at the top of a delta chain,
this should impact older objects more than recent ones.
> I have run 'git fsck --strict --full' on the pack with no resulting
> error/debug output or change in the file size.
There shouldn't be any.
> Any help on how to debug this?
I doubt there is anything to debug. In this case the window size is
used to evaluate a threshold slope for matching objects in the delta
search. What we want is a broader delta tree more than a deep one in
order to have more deltas with a lower depth limit. Therefore a size
threshold is applied, based on the object distance in the delta search
window (see commit c83f032e and the other ones referenced therein).
By providing a big window value, the threshold slope becomes rather flat
and ineffective, and this changes the delta match outcome. While delta
selection is based on the uncompressed delta result, the compressed size
of different deltas with the same size may vary. I suspect you might
have been unlucky in that regard and this could explain the negative
effect on the pack size.
Nicolas
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git repack: --depth=100000 causing larger not smaler pack file?
2009-03-17 20:38 ` Nicolas Pitre
@ 2009-03-23 10:11 ` Kjetil Barvik
2009-03-23 10:20 ` Mike Ralphson
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Kjetil Barvik @ 2009-03-23 10:11 UTC (permalink / raw
To: Nicolas Pitre; +Cc: git
Nicolas Pitre <nico@cam.org> writes:
> On Tue, 17 Mar 2009, Kjetil Barvik wrote:
>
>> aloha!
>>
>> Yesterday I run the following command on the updated GIT respository:
>>
>> git repack -adf --window=250000 --depth=100000
>>
>> After 280 minutes or so it finished, but the strange thing was that
>> the resulting pack-file was larger than before. I had expected that
>> it should be smaler, or at least the same size as before.
[snip]
>> I can think of one thing which is spesial with the "--depth=100000"
>> number, and that is that it is now larger than the total number of
>> objects in the pack, which is around 96000 to 97000, or so.
>
> No, the depth should have zero negative influence on the pack size.
> For tight compression, the larger the better. What this will impact
> though is runtime access to the pack data afterward. The deeper a
> given object is, the slower its access will be. But since the object
> recency order tend to put newer objects at the top of a delta chain,
> this should impact older objects more than recent ones.
I have done some more tests, and have copied the whole git/ directory
to a new directory (such that I do not accidentally add or delete any
objects/commits), and have made the following table:
All pack file sizes, F, below was computed with the following git
command:
git repack -adf --window=250000 --depth=D
D | F | (F - F_prev) / (D - D_prev)
-------|------------|----------------------------
5000 | 19129934 |
10000 | 19128956 | -978 / 5000 = -0.1956
15000 | 19126077 | -2879 / 5000 = -0.5758
20000 | 19126077 | 0 / 5000 = 0
25000 | 19126077 | 0 / 5000 = 0
30000 | 19197575 | 71498 / 5000 = 14.2996
45000 | 19312240 | 114665 / 15000 = 7.6443
60000 | 19560083 | 247843 / 15000 = 16.5229
75000 | 19803043 | 242960 / 15000 = 16.1973
90000 | 19669923 | -133120 / 15000 = -8.8746
95000 | 20463780 | 793857 / 5000 = 155.7714
From the table it seems that you get the smallest pack file (for this
particular repository) when --depth value is somewhere between 15000
and 25000. And, when the --depth value was 95000 the resulting pack
file was (- 20463780 19126077) = 1 337 703 bytes, 1.25 MiB, or 7%
larger than this.
> I doubt there is anything to debug. In this case the window size is
> used to evaluate a threshold slope for matching objects in the delta
> search. What we want is a broader delta tree more than a deep one in
> order to have more deltas with a lower depth limit. Therefore a size
> threshold is applied, based on the object distance in the delta search
> window (see commit c83f032e and the other ones referenced therein).
>
> By providing a big window value, the threshold slope becomes rather flat
> and ineffective, and this changes the delta match outcome. While delta
> selection is based on the uncompressed delta result, the compressed size
> of different deltas with the same size may vary. I suspect you might
> have been unlucky in that regard and this could explain the negative
> effect on the pack size.
From the table above it seems that I have been unlucky with _all_
--depth values above 25000 or so.
Question: is there some low level GIT command I can run to compare 2
pack files to maybe be able to see the reason behind the above table?
Maybe to see some details about how many delta's, how big each are,
total sizes, etc..
-- kjetil
PS! I have the following in my $HOME/.gitconfig file:
[repack]
UseDeltaBaseOffset = true
[gc]
auto = 25
autopacklimit = 1
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git repack: --depth=100000 causing larger not smaler pack file?
2009-03-23 10:11 ` Kjetil Barvik
@ 2009-03-23 10:20 ` Mike Ralphson
2009-03-23 14:05 ` Peter Harris
2009-03-23 14:14 ` Nicolas Pitre
2 siblings, 0 replies; 6+ messages in thread
From: Mike Ralphson @ 2009-03-23 10:20 UTC (permalink / raw
To: Kjetil Barvik; +Cc: Nicolas Pitre, git
2009/3/23 Kjetil Barvik <barvik@broadpark.no>:
> PS! I have the following in my $HOME/.gitconfig file:
>
> [repack]
> UseDeltaBaseOffset = true
> [gc]
> auto = 25
> autopacklimit = 1
Just an aside, but from my reading of how it works, there's very
little point in setting gc.auto to anything less than 257 and
statistically it won't kick in predictably unless set quite a bit
higher (say an order of magnitude).
Mike
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git repack: --depth=100000 causing larger not smaler pack file?
2009-03-23 10:11 ` Kjetil Barvik
2009-03-23 10:20 ` Mike Ralphson
@ 2009-03-23 14:05 ` Peter Harris
2009-03-23 14:14 ` Nicolas Pitre
2 siblings, 0 replies; 6+ messages in thread
From: Peter Harris @ 2009-03-23 14:05 UTC (permalink / raw
To: Kjetil Barvik; +Cc: Nicolas Pitre, git
On Mon, Mar 23, 2009 at 6:11 AM, Kjetil Barvik wrote:
> Question: is there some low level GIT command I can run to compare 2
> pack files to maybe be able to see the reason behind the above table?
> Maybe to see some details about how many delta's, how big each are,
> total sizes, etc..
git verify-pack -v <pack.idx>
The columns are: SHA1 type size size-in-packfile offset-in-packfile
depth base-SHA1
(the last two columns are only present for deltified objects)
Peter Harris
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: git repack: --depth=100000 causing larger not smaler pack file?
2009-03-23 10:11 ` Kjetil Barvik
2009-03-23 10:20 ` Mike Ralphson
2009-03-23 14:05 ` Peter Harris
@ 2009-03-23 14:14 ` Nicolas Pitre
2 siblings, 0 replies; 6+ messages in thread
From: Nicolas Pitre @ 2009-03-23 14:14 UTC (permalink / raw
To: Kjetil Barvik; +Cc: git
On Mon, 23 Mar 2009, Kjetil Barvik wrote:
> Nicolas Pitre <nico@cam.org> writes:
>
> > On Tue, 17 Mar 2009, Kjetil Barvik wrote:
> >
> >> aloha!
> >>
> >> Yesterday I run the following command on the updated GIT respository:
> >>
> >> git repack -adf --window=250000 --depth=100000
> >>
> >> After 280 minutes or so it finished, but the strange thing was that
> >> the resulting pack-file was larger than before. I had expected that
> >> it should be smaler, or at least the same size as before.
> [snip]
> >> I can think of one thing which is spesial with the "--depth=100000"
> >> number, and that is that it is now larger than the total number of
> >> objects in the pack, which is around 96000 to 97000, or so.
> >
> > No, the depth should have zero negative influence on the pack size.
> > For tight compression, the larger the better. What this will impact
> > though is runtime access to the pack data afterward. The deeper a
> > given object is, the slower its access will be. But since the object
> > recency order tend to put newer objects at the top of a delta chain,
> > this should impact older objects more than recent ones.
>
> I have done some more tests, and have copied the whole git/ directory
> to a new directory (such that I do not accidentally add or delete any
> objects/commits), and have made the following table:
>
> All pack file sizes, F, below was computed with the following git
> command:
>
> git repack -adf --window=250000 --depth=D
>
> D | F | (F - F_prev) / (D - D_prev)
> -------|------------|----------------------------
> 5000 | 19129934 |
> 10000 | 19128956 | -978 / 5000 = -0.1956
> 15000 | 19126077 | -2879 / 5000 = -0.5758
> 20000 | 19126077 | 0 / 5000 = 0
> 25000 | 19126077 | 0 / 5000 = 0
> 30000 | 19197575 | 71498 / 5000 = 14.2996
> 45000 | 19312240 | 114665 / 15000 = 7.6443
> 60000 | 19560083 | 247843 / 15000 = 16.5229
> 75000 | 19803043 | 242960 / 15000 = 16.1973
> 90000 | 19669923 | -133120 / 15000 = -8.8746
> 95000 | 20463780 | 793857 / 5000 = 155.7714
>
> From the table it seems that you get the smallest pack file (for this
> particular repository) when --depth value is somewhere between 15000
> and 25000. And, when the --depth value was 95000 the resulting pack
> file was (- 20463780 19126077) = 1 337 703 bytes, 1.25 MiB, or 7%
> larger than this.
This is a bit intriguing.
Of course, before going any further, you must realize that having a
depth of 15000 is a bit excessive. That means that, if you have a delta
chain with a depth of 15000 that means access to the object at the end
of the chain will require that 14999 other objects be accessed before
the 15000th one is retrieved. This will have horrible runtime
performances for something like 10% reduction in the best cases which is
probably not a good tradeoff.
This being said, I still stand by my assertion that, in theory, greater
delta depth should not make the pack bigger. And your table appears to
confirm that, even to the point of reaching a stable size as one would
expect, until a breaking point is reached after which results tend to
become rather random.
What I'm suspecting in that case is some computation overflow in
try_delta(). Consider for instance this piece:
max_size = max_size * (max_depth - src->depth) /
(max_depth - ref_depth + 1);
[ This is the treshold slope I was talking about, but contrary to
what I said before, it is affected by the depth not the window size. ]
In this case, if you have a max_depth of 95000, then any object larger
than 90461 bytes will cause a multiplication overflow, and the resulting
max_size will be capped to some random smaller value than expected
depending on the remaining bits. For example, suppose max_size = 45211,
max_depth = 95000 and src->depth = 0 then you should have max_size still
equal to 45211, but in this case it'll become 0 and no delta will be
attempted at all. The number of deltas reported at the end of the
repack process probably reflects that.
> > I doubt there is anything to debug. In this case the window size is
> > used to evaluate a threshold slope for matching objects in the delta
> > search. What we want is a broader delta tree more than a deep one in
> > order to have more deltas with a lower depth limit. Therefore a size
> > threshold is applied, based on the object distance in the delta search
> > window (see commit c83f032e and the other ones referenced therein).
> >
> > By providing a big window value, the threshold slope becomes rather flat
> > and ineffective, and this changes the delta match outcome. While delta
> > selection is based on the uncompressed delta result, the compressed size
> > of different deltas with the same size may vary. I suspect you might
> > have been unlucky in that regard and this could explain the negative
> > effect on the pack size.
>
> From the table above it seems that I have been unlucky with _all_
> --depth values above 25000 or so.
See explanation (and self correction) above.
> Question: is there some low level GIT command I can run to compare 2
> pack files to maybe be able to see the reason behind the above table?
> Maybe to see some details about how many delta's, how big each are,
> total sizes, etc..
Yes -- see the -v option of 'git verify-pack'.
Nicolas
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-03-23 14:17 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-17 19:05 git repack: --depth=100000 causing larger not smaler pack file? Kjetil Barvik
2009-03-17 20:38 ` Nicolas Pitre
2009-03-23 10:11 ` Kjetil Barvik
2009-03-23 10:20 ` Mike Ralphson
2009-03-23 14:05 ` Peter Harris
2009-03-23 14:14 ` Nicolas Pitre
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).