git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git repack: --depth=100000 causing larger not smaler pack file?
@ 2009-03-17 19:05 Kjetil Barvik
  2009-03-17 20:38 ` Nicolas Pitre
  0 siblings, 1 reply; 6+ messages in thread
From: Kjetil Barvik @ 2009-03-17 19:05 UTC (permalink / raw
  To: git

  aloha!

  Yesterday I run the following command on the updated GIT respository:

    git repack -adf --window=250000 --depth=100000

  After 280 minutes or so it finished, but the strange thing was that
  the resulting pack-file was larger than before.  I had expected that
  it should be smaler, or at least the same size as before.

  kjetil git (my_next)$ ls -l .git/objects/pack/*
-r-------- 1 kjetil kjetil  2757280 2009-03-16 15:18 .git/objects/pack/pack-c5f15d5c48d6b3902a49046d7e8a8d717e167051.idx
-r-------- 1 kjetil kjetil 19961120 2009-03-16 15:18 .git/objects/pack/pack-c5f15d5c48d6b3902a49046d7e8a8d717e167051.pack

  Before I started the pack file was around 19 250 000 bytes, and was
  the result of the following commands:

  1) git repack -adf --window=250000 --depth=20000
          - not completly sure about the --window number here
          - the resulting pack file was a litle less than 19 100 000

  2) 'git fetch' to get the latest GIT patches

  3) since 'git fetch' always make an extra new "smal" pack file, I run
     the command 'git repack -ad --window=40000 --depth=10000' to be
     able to get one singel pack file of 19 250 000 bytes or so.

  I can think of one thing which is spesial with the "--depth=100000"
  number, and that is that it is now larger than the total number of
  objects in the pack, which is around 96000 to 97000, or so.

  I have run 'git fsck --strict --full' on the pack with no resulting
  error/debug output or change in the file size.

  Any help on how to debug this?

  -- kjetil

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git repack: --depth=100000 causing larger not smaler pack file?
  2009-03-17 19:05 git repack: --depth=100000 causing larger not smaler pack file? Kjetil Barvik
@ 2009-03-17 20:38 ` Nicolas Pitre
  2009-03-23 10:11   ` Kjetil Barvik
  0 siblings, 1 reply; 6+ messages in thread
From: Nicolas Pitre @ 2009-03-17 20:38 UTC (permalink / raw
  To: Kjetil Barvik; +Cc: git

On Tue, 17 Mar 2009, Kjetil Barvik wrote:

>   aloha!
> 
>   Yesterday I run the following command on the updated GIT respository:
> 
>     git repack -adf --window=250000 --depth=100000
> 
>   After 280 minutes or so it finished, but the strange thing was that
>   the resulting pack-file was larger than before.  I had expected that
>   it should be smaler, or at least the same size as before.
> 
>   kjetil git (my_next)$ ls -l .git/objects/pack/*
> -r-------- 1 kjetil kjetil  2757280 2009-03-16 15:18 .git/objects/pack/pack-c5f15d5c48d6b3902a49046d7e8a8d717e167051.idx
> -r-------- 1 kjetil kjetil 19961120 2009-03-16 15:18 .git/objects/pack/pack-c5f15d5c48d6b3902a49046d7e8a8d717e167051.pack
> 
>   Before I started the pack file was around 19 250 000 bytes, and was
>   the result of the following commands:
> 
>   1) git repack -adf --window=250000 --depth=20000
>           - not completly sure about the --window number here
>           - the resulting pack file was a litle less than 19 100 000
> 
>   2) 'git fetch' to get the latest GIT patches
> 
>   3) since 'git fetch' always make an extra new "smal" pack file, I run
>      the command 'git repack -ad --window=40000 --depth=10000' to be
>      able to get one singel pack file of 19 250 000 bytes or so.
> 
>   I can think of one thing which is spesial with the "--depth=100000"
>   number, and that is that it is now larger than the total number of
>   objects in the pack, which is around 96000 to 97000, or so.

No, the depth should have zero negative influence on the pack size.  
For tight compression, the larger the better.  What this will impact 
though is runtime access to the pack data afterward.  The deeper a 
given object is, the slower its access will be.  But since the object 
recency order tend to put newer objects at the top of a delta chain, 
this should impact older objects more than recent ones.

>   I have run 'git fsck --strict --full' on the pack with no resulting
>   error/debug output or change in the file size.

There shouldn't be any.

>   Any help on how to debug this?

I doubt there is anything to debug.  In this case the window size is 
used to evaluate a threshold slope for matching objects in the delta 
search.  What we want is a broader delta tree more than a deep one in 
order to have more deltas with a lower depth limit.  Therefore a size 
threshold is applied, based on the object distance in the delta search 
window (see commit c83f032e and the other ones referenced therein).

By providing a big window value, the threshold slope becomes rather flat 
and ineffective, and this changes the delta match outcome.  While delta 
selection is based on the uncompressed delta result, the compressed size 
of different deltas with the same size may vary.  I suspect you might 
have been unlucky in that regard and this could explain the negative 
effect on the pack size.


Nicolas

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git repack: --depth=100000 causing larger not smaler pack file?
  2009-03-17 20:38 ` Nicolas Pitre
@ 2009-03-23 10:11   ` Kjetil Barvik
  2009-03-23 10:20     ` Mike Ralphson
                       ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Kjetil Barvik @ 2009-03-23 10:11 UTC (permalink / raw
  To: Nicolas Pitre; +Cc: git

Nicolas Pitre <nico@cam.org> writes:

> On Tue, 17 Mar 2009, Kjetil Barvik wrote:
>
>>   aloha!
>> 
>>   Yesterday I run the following command on the updated GIT respository:
>> 
>>     git repack -adf --window=250000 --depth=100000
>> 
>>   After 280 minutes or so it finished, but the strange thing was that
>>   the resulting pack-file was larger than before.  I had expected that
>>   it should be smaler, or at least the same size as before.
  [snip]
>>   I can think of one thing which is spesial with the "--depth=100000"
>>   number, and that is that it is now larger than the total number of
>>   objects in the pack, which is around 96000 to 97000, or so.
>
> No, the depth should have zero negative influence on the pack size.  
> For tight compression, the larger the better.  What this will impact 
> though is runtime access to the pack data afterward.  The deeper a 
> given object is, the slower its access will be.  But since the object 
> recency order tend to put newer objects at the top of a delta chain, 
> this should impact older objects more than recent ones.

  I have done some more tests, and have copied the whole git/ directory
  to a new directory (such that I do not accidentally add or delete any
  objects/commits), and have made the following table:

  All pack file sizes, F, below was computed with the following git
  command:

      git repack -adf --window=250000 --depth=D

     D   |     F      | (F - F_prev) / (D - D_prev)
  -------|------------|----------------------------
    5000 |  19129934  |
   10000 |  19128956  |    -978 /  5000 =  -0.1956
   15000 |  19126077  |   -2879 /  5000 =  -0.5758
   20000 |  19126077  |       0 /  5000 =   0
   25000 |  19126077  |       0 /  5000 =   0
   30000 |  19197575  |   71498 /  5000 =  14.2996
   45000 |  19312240  |  114665 / 15000 =   7.6443
   60000 |  19560083  |  247843 / 15000 =  16.5229
   75000 |  19803043  |  242960 / 15000 =  16.1973
   90000 |  19669923  | -133120 / 15000 =  -8.8746
   95000 |  20463780  |  793857 /  5000 = 155.7714

  From the table it seems that you get the smallest pack file (for this
  particular repository) when --depth value is somewhere between 15000
  and 25000.  And, when the --depth value was 95000 the resulting pack
  file was (- 20463780 19126077) = 1 337 703 bytes, 1.25 MiB, or 7%
  larger than this.

> I doubt there is anything to debug.  In this case the window size is 
> used to evaluate a threshold slope for matching objects in the delta 
> search.  What we want is a broader delta tree more than a deep one in 
> order to have more deltas with a lower depth limit.  Therefore a size 
> threshold is applied, based on the object distance in the delta search 
> window (see commit c83f032e and the other ones referenced therein).
>
> By providing a big window value, the threshold slope becomes rather flat 
> and ineffective, and this changes the delta match outcome.  While delta 
> selection is based on the uncompressed delta result, the compressed size 
> of different deltas with the same size may vary.  I suspect you might 
> have been unlucky in that regard and this could explain the negative 
> effect on the pack size.

  From the table above it seems that I have been unlucky with _all_
  --depth values above 25000 or so.

  Question: is there some low level GIT command I can run to compare 2
  pack files to maybe be able to see the reason behind the above table?
  Maybe to see some details about how many delta's, how big each are,
  total sizes, etc..

  -- kjetil

  PS!  I have the following in my $HOME/.gitconfig file:

[repack]
	UseDeltaBaseOffset = true
[gc]
	auto = 25
	autopacklimit = 1

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git repack: --depth=100000 causing larger not smaler pack file?
  2009-03-23 10:11   ` Kjetil Barvik
@ 2009-03-23 10:20     ` Mike Ralphson
  2009-03-23 14:05     ` Peter Harris
  2009-03-23 14:14     ` Nicolas Pitre
  2 siblings, 0 replies; 6+ messages in thread
From: Mike Ralphson @ 2009-03-23 10:20 UTC (permalink / raw
  To: Kjetil Barvik; +Cc: Nicolas Pitre, git

2009/3/23 Kjetil Barvik <barvik@broadpark.no>:
>  PS!  I have the following in my $HOME/.gitconfig file:
>
> [repack]
>        UseDeltaBaseOffset = true
> [gc]
>        auto = 25
>        autopacklimit = 1

Just an aside, but from my reading of how it works, there's very
little point in setting gc.auto to anything less than 257 and
statistically it won't kick in predictably unless set quite a bit
higher (say an order of magnitude).

Mike

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git repack: --depth=100000 causing larger not smaler pack file?
  2009-03-23 10:11   ` Kjetil Barvik
  2009-03-23 10:20     ` Mike Ralphson
@ 2009-03-23 14:05     ` Peter Harris
  2009-03-23 14:14     ` Nicolas Pitre
  2 siblings, 0 replies; 6+ messages in thread
From: Peter Harris @ 2009-03-23 14:05 UTC (permalink / raw
  To: Kjetil Barvik; +Cc: Nicolas Pitre, git

On Mon, Mar 23, 2009 at 6:11 AM, Kjetil Barvik wrote:
>  Question: is there some low level GIT command I can run to compare 2
>  pack files to maybe be able to see the reason behind the above table?
>  Maybe to see some details about how many delta's, how big each are,
>  total sizes, etc..

git verify-pack -v <pack.idx>

The columns are: SHA1 type size size-in-packfile offset-in-packfile
depth base-SHA1
(the last two columns are only present for deltified objects)

Peter Harris

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: git repack: --depth=100000 causing larger not smaler pack file?
  2009-03-23 10:11   ` Kjetil Barvik
  2009-03-23 10:20     ` Mike Ralphson
  2009-03-23 14:05     ` Peter Harris
@ 2009-03-23 14:14     ` Nicolas Pitre
  2 siblings, 0 replies; 6+ messages in thread
From: Nicolas Pitre @ 2009-03-23 14:14 UTC (permalink / raw
  To: Kjetil Barvik; +Cc: git

On Mon, 23 Mar 2009, Kjetil Barvik wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > On Tue, 17 Mar 2009, Kjetil Barvik wrote:
> >
> >>   aloha!
> >> 
> >>   Yesterday I run the following command on the updated GIT respository:
> >> 
> >>     git repack -adf --window=250000 --depth=100000
> >> 
> >>   After 280 minutes or so it finished, but the strange thing was that
> >>   the resulting pack-file was larger than before.  I had expected that
> >>   it should be smaler, or at least the same size as before.
>   [snip]
> >>   I can think of one thing which is spesial with the "--depth=100000"
> >>   number, and that is that it is now larger than the total number of
> >>   objects in the pack, which is around 96000 to 97000, or so.
> >
> > No, the depth should have zero negative influence on the pack size.  
> > For tight compression, the larger the better.  What this will impact 
> > though is runtime access to the pack data afterward.  The deeper a 
> > given object is, the slower its access will be.  But since the object 
> > recency order tend to put newer objects at the top of a delta chain, 
> > this should impact older objects more than recent ones.
> 
>   I have done some more tests, and have copied the whole git/ directory
>   to a new directory (such that I do not accidentally add or delete any
>   objects/commits), and have made the following table:
> 
>   All pack file sizes, F, below was computed with the following git
>   command:
> 
>       git repack -adf --window=250000 --depth=D
> 
>      D   |     F      | (F - F_prev) / (D - D_prev)
>   -------|------------|----------------------------
>     5000 |  19129934  |
>    10000 |  19128956  |    -978 /  5000 =  -0.1956
>    15000 |  19126077  |   -2879 /  5000 =  -0.5758
>    20000 |  19126077  |       0 /  5000 =   0
>    25000 |  19126077  |       0 /  5000 =   0
>    30000 |  19197575  |   71498 /  5000 =  14.2996
>    45000 |  19312240  |  114665 / 15000 =   7.6443
>    60000 |  19560083  |  247843 / 15000 =  16.5229
>    75000 |  19803043  |  242960 / 15000 =  16.1973
>    90000 |  19669923  | -133120 / 15000 =  -8.8746
>    95000 |  20463780  |  793857 /  5000 = 155.7714
> 
>   From the table it seems that you get the smallest pack file (for this
>   particular repository) when --depth value is somewhere between 15000
>   and 25000.  And, when the --depth value was 95000 the resulting pack
>   file was (- 20463780 19126077) = 1 337 703 bytes, 1.25 MiB, or 7%
>   larger than this.

This is a bit intriguing.

Of course, before going any further, you must realize that having a 
depth of 15000 is a bit excessive.  That means that, if you have a delta 
chain with a depth of 15000 that means access to the object at the end 
of the chain will require that 14999 other objects be accessed before 
the 15000th one is retrieved.  This will have horrible runtime 
performances for something like 10% reduction in the best cases which is 
probably not a good tradeoff.

This being said, I still stand by my assertion that, in theory, greater 
delta depth should not make the pack bigger.  And your table appears to 
confirm that, even to the point of reaching a stable size as one would 
expect, until a breaking point is reached after which results tend to 
become rather random.

What I'm suspecting in that case is some computation overflow in 
try_delta().  Consider for instance this piece:

    max_size = max_size * (max_depth - src->depth) /
                                            (max_depth - ref_depth + 1);

[ This is the treshold slope I was talking about, but contrary to
  what I said before, it is affected by the depth not the window size. ]

In this case, if you have a max_depth of 95000, then any object larger 
than 90461 bytes will cause a multiplication overflow, and the resulting 
max_size will be capped to some random smaller value than expected 
depending on the remaining bits. For example, suppose max_size = 45211, 
max_depth = 95000 and src->depth = 0 then you should have max_size still 
equal to 45211, but in this case it'll become 0 and no delta will be 
attempted at all.  The number of deltas reported at the end of the 
repack process probably reflects that.

> > I doubt there is anything to debug.  In this case the window size is 
> > used to evaluate a threshold slope for matching objects in the delta 
> > search.  What we want is a broader delta tree more than a deep one in 
> > order to have more deltas with a lower depth limit.  Therefore a size 
> > threshold is applied, based on the object distance in the delta search 
> > window (see commit c83f032e and the other ones referenced therein).
> >
> > By providing a big window value, the threshold slope becomes rather flat 
> > and ineffective, and this changes the delta match outcome.  While delta 
> > selection is based on the uncompressed delta result, the compressed size 
> > of different deltas with the same size may vary.  I suspect you might 
> > have been unlucky in that regard and this could explain the negative 
> > effect on the pack size.
> 
>   From the table above it seems that I have been unlucky with _all_
>   --depth values above 25000 or so.

See explanation (and self correction) above.

>   Question: is there some low level GIT command I can run to compare 2
>   pack files to maybe be able to see the reason behind the above table?
>   Maybe to see some details about how many delta's, how big each are,
>   total sizes, etc..

Yes -- see the -v option of 'git verify-pack'.


Nicolas

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-03-23 14:17 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-17 19:05 git repack: --depth=100000 causing larger not smaler pack file? Kjetil Barvik
2009-03-17 20:38 ` Nicolas Pitre
2009-03-23 10:11   ` Kjetil Barvik
2009-03-23 10:20     ` Mike Ralphson
2009-03-23 14:05     ` Peter Harris
2009-03-23 14:14     ` Nicolas Pitre

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).