git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] gc: default aggressive depth to 50
@ 2016-08-11 16:13 Jeff King
  2016-08-11 17:20 ` Jeff King
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff King @ 2016-08-11 16:13 UTC (permalink / raw)
  To: git

This commit message is long and has lots of background and
numbers. The summary is: the current default of 250 doesn't
save much space, and costs CPU. It's not a good tradeoff.
Read on for details.

The "--aggressive" flag to git-gc does three things:

  1. use "-f" to throw out existing deltas and recompute from
     scratch

  2. use "--window=250" to look harder for deltas

  3. use "--depth=250" to make longer delta chains

Items (1) and (2) are good matches for an "aggressive"
repack. They ask the repack to do more computation work in
the hopes of getting a better pack. You pay the costs during
the repack, and other operations see only the benefit.

Item (3) is not so clear. Allowing longer chains means fewer
restrictions on the deltas, which means potentially finding
better ones and saving some space. But it also means that
operations which access the deltas have to follow longer
chains, which affects their performance. So it's a tradeoff,
and it's not clear that the tradeoff is even a good one.

The existing "250" numbers for "--aggressive" come
originally from this thread:

  http://public-inbox.org/git/alpine.LFD.0.9999.0712060803430.13796@woody.linux-foundation.org/

where Linus says:

  So when I said "--depth=250 --window=250", I chose those
  numbers more as an example of extremely aggressive
  packing, and I'm not at all sure that the end result is
  necessarily wonderfully usable. It's going to save disk
  space (and network bandwidth - the delta's will be re-used
  for the network protocol too!), but there are definitely
  downsides too, and using long delta chains may
  simply not be worth it in practice.

There are some numbers in that thread, but they're mostly
focused on the improved window size, and measure the
improvement from --depth=250 and --window=250 together.
E.g.:

  http://public-inbox.org/git/9e4733910712062006l651571f3w7f76ce64c6650dff@mail.gmail.com/

talks about the improved run-time of "git-blame", which
comes from the reduced pack size. But most of that reduction
is coming from --window=250, whereas most of the extra costs
come from --depth=250. There's a link in that thread showing
that increasing the depth beyond 50 doesn't seem to help
much with the size:

  https://vcscompare.blogspot.com/2008/06/git-repack-parameters.html

but again, no discussion of the timing impact.

In an earlier thread from Ted Ts'o which discussed setting
the non-aggressive default (from 10 to 50):

  http://public-inbox.org/git/20070509134958.GA21489%40thunk.org/

we have more numbers, with the conclusion that going past 50
does not help size much, and hurts the speed of normal
operations.

So from that, we might guess that 50 is actually a sweet
spot, even for aggressive, if we interpret aggressive to
"spend time now to make a better pack". It is not clear that
"--depth=250" is actually a better pack. It may be slightly
_smaller_, but it carries a run-time penalty.

Here are some more recent timings I did to verify that. They
show three things:

  - the size of the resulting pack (so disk saved to store,
    bandwidth saved on clones/fetches)

  - the cost of "rev-list --objects --all", which shows the
    effect of the delta chains on trees (commits typically
    don't delta, and the command doesn't touch the blobs at
    all)

  - the cost of "log -Sfoo", which will additionally access
    each blob

All cases were repacked with "git repack -adf --depth=$d
--window=250" (so basically, what would happen if we tweaked
the "gc --aggressive" default depth).

The timings are all wall-clock best-of-3. The machine itself
has plenty of RAM compared to the repositories (which is
probably typical of most workstations these days), so we're
really measuring CPU usage, as the whole thing will be in
disk cache after the first run.

The core.deltaBaseCacheLimit is at its default of 96MiB.
It's possible that tweaking it would have some impact on the
tests, as some of them (especially "log -S" on a large repo)
are likely to overflow that. But bumping that carries a
run-time memory cost, so for these tests, I focused on what
we could do just with the on-disk pack tradeoffs.

Each test is done for four depths: 250 (the current value),
50 (the current default that tested well previously), 100
(to show something on the larger side, which previous tests
showed was not a good tradeoff), and 10 (the very old
default, which previous tests showed was worse than 50).

Here are the numbers for linux.git:

   depth |  size |  %    | rev-list |  %     | log -Sfoo |   %
  -------+-------+-------+----------+--------+-----------+-------
    250  | 967MB |  n/a  | 48.159s  |   n/a  | 378.088   |   n/a
    100  | 971MB | +0.4% | 41.471s  | -13.9% | 342.060   |  -9.5%
     50  | 979MB | +1.2% | 37.778s  | -21.6% | 311.040s  | -17.7%
     10  | 1.1GB | +6.6% | 32.518s  | -32.5% | 279.890s  | -25.9%

and for git.git:

   depth |  size |  %    | rev-list |  %     | log -Sfoo |   %
  -------+-------+-------+----------+--------+-----------+-------
    250  |  48MB |  n/a  |  2.215s  |   n/a  |  20.922s  |   n/a
    100  |  49MB | +0.5% |  2.140s  |  -3.4% |  17.736s  | -15.2%
     50  |  49MB | +1.7% |  2.099s  |  -5.2% |  15.418s  | -26.3%
     10  |  53MB | +9.3% |  2.001s  |  -9.7% |  12.677s  | -39.4%

You can see that that the CPU savings for regular operations improves as we
decrease the depth. The savings are less for "rev-list" on a smaller repository
than they are for blob-accessing operations, or even rev-list on a larger
repository. This may mean that a larger delta cache would help (though setting
core.deltaBaseCacheLimit by itself doesn't).

But we can also see that the space savings are not that great as the depth goes
higher. Saving 5-10% between 10 and 50 is probably worth the CPU tradeoff.
Saving 1% to go from 50 to 100, or another 0.5% to go from 100 to 250 is
probably not.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/config.txt | 2 +-
 builtin/gc.c             | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 0bcb679..8eb403d 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -1366,7 +1366,7 @@ fsck.skipList::
 gc.aggressiveDepth::
 	The depth parameter used in the delta compression
 	algorithm used by 'git gc --aggressive'.  This defaults
-	to 250.
+	to 50.
 
 gc.aggressiveWindow::
 	The window size parameter used in the delta compression
diff --git a/builtin/gc.c b/builtin/gc.c
index 332bcf7..069950d 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -28,7 +28,7 @@ static const char * const builtin_gc_usage[] = {
 
 static int pack_refs = 1;
 static int prune_reflogs = 1;
-static int aggressive_depth = 250;
+static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
-- 
2.9.2.790.gaa5bc72

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] gc: default aggressive depth to 50
  2016-08-11 16:13 [PATCH] gc: default aggressive depth to 50 Jeff King
@ 2016-08-11 17:20 ` Jeff King
  2016-08-11 18:52   ` Junio C Hamano
  0 siblings, 1 reply; 3+ messages in thread
From: Jeff King @ 2016-08-11 17:20 UTC (permalink / raw)
  To: git

On Thu, Aug 11, 2016 at 12:13:09PM -0400, Jeff King wrote:

> Here are the numbers for linux.git:
> 
>    depth |  size |  %    | rev-list |  %     | log -Sfoo |   %
>   -------+-------+-------+----------+--------+-----------+-------
>     250  | 967MB |  n/a  | 48.159s  |   n/a  | 378.088   |   n/a
>     100  | 971MB | +0.4% | 41.471s  | -13.9% | 342.060   |  -9.5%
>      50  | 979MB | +1.2% | 37.778s  | -21.6% | 311.040s  | -17.7%
>      10  | 1.1GB | +6.6% | 32.518s  | -32.5% | 279.890s  | -25.9%
> [...]
> 
> You can see that that the CPU savings for regular operations improves as we
> decrease the depth. The savings are less for "rev-list" on a smaller repository
> than they are for blob-accessing operations, or even rev-list on a larger
> repository. This may mean that a larger delta cache would help (though setting
> core.deltaBaseCacheLimit by itself doesn't).

The problem with deltaBaseCacheLimit is that it only changes the memory
parameter, but there are a fixed number of slots in the data structure.
Bumping it like this:

diff --git a/sha1_file.c b/sha1_file.c
index 02940f1..ca79703 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2073,7 +2073,7 @@ static void *unpack_compressed_entry(struct packed_git *p,
 	return buffer;
 }
 
-#define MAX_DELTA_CACHE (256)
+#define MAX_DELTA_CACHE (1024)
 
 static size_t delta_base_cached;
 

along with the cache size does help (this was discussed a year or two
ago, but nobody ever followed up with numbers or patches).

Here are best-of-3 numbers for rev-list on linux.git at each depth with
that patch and "-c core.deltabasecachelimit=256m":

  depth |   time
  ----------------
   250  | 36.524s
   100  | 33.200s
    50  | 31.065s
    10  | 28.266s

So there's clearly a lot of room for improvement on the reading side in
general. And doing so clearly lessens the impact of the delta chains.
But you still get a 15% improvement moving to depth=50, versus only a
1.2% storage increase. So I don't think it fundamentally changes the
conclusion of the "depth=50" patch I'm responding to.

I don't think bumping MAX_DELTA_CACHE naively is a good idea, though. I
seem to recall that it has scaling problems as it grows, so we may want
a better data structure (but I haven't looked at it recently enough to
say anything intelligent). I might work on it in the near future, but if
anybody is interested, please be my guest.

-Peff

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] gc: default aggressive depth to 50
  2016-08-11 17:20 ` Jeff King
@ 2016-08-11 18:52   ` Junio C Hamano
  0 siblings, 0 replies; 3+ messages in thread
From: Junio C Hamano @ 2016-08-11 18:52 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> On Thu, Aug 11, 2016 at 12:13:09PM -0400, Jeff King wrote:
>
>> Here are the numbers for linux.git:
>> 
>>    depth |  size |  %    | rev-list |  %     | log -Sfoo |   %
>>   -------+-------+-------+----------+--------+-----------+-------
>>     250  | 967MB |  n/a  | 48.159s  |   n/a  | 378.088   |   n/a
>>     100  | 971MB | +0.4% | 41.471s  | -13.9% | 342.060   |  -9.5%
>>      50  | 979MB | +1.2% | 37.778s  | -21.6% | 311.040s  | -17.7%
>>      10  | 1.1GB | +6.6% | 32.518s  | -32.5% | 279.890s  | -25.9%
>> [...]
>> 
>> You can see that that the CPU savings for regular operations improves as we
>> decrease the depth. The savings are less for "rev-list" on a smaller repository
>> than they are for blob-accessing operations, or even rev-list on a larger
>> repository. This may mean that a larger delta cache would help (though setting
>> core.deltaBaseCacheLimit by itself doesn't).
>
> The problem with deltaBaseCacheLimit is that it only changes the memory
> parameter, but there are a fixed number of slots in the data structure.
> Bumping it like this:
>
> diff --git a/sha1_file.c b/sha1_file.c
> index 02940f1..ca79703 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2073,7 +2073,7 @@ static void *unpack_compressed_entry(struct packed_git *p,
>  	return buffer;
>  }
>  
> -#define MAX_DELTA_CACHE (256)
> +#define MAX_DELTA_CACHE (1024)
>  
>  static size_t delta_base_cached;
>
> along with the cache size does help (this was discussed a year or two
> ago, but nobody ever followed up with numbers or patches).

Yeah, and I also think Linus's "--depth=250 is just a sample; it
will not perform well" already cited the number of delta-cache
entries being the limiting factor.

> I don't think bumping MAX_DELTA_CACHE naively is a good idea, though. I
> seem to recall that it has scaling problems as it grows, so we may want
> a better data structure (but I haven't looked at it recently enough to
> say anything intelligent).

Me neither.  In any case, I do think reducing the aggressive depth
down to 50 is a very sensible move.  I also suspect that window size
may want to be a bit increased (or even made dynamic; the first time
we need the window size determined is after to_pack.objects[] array
is fully populated, so we could use the number of commits as one of
the hint, for example), but that can be treated as a separate topic.



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2016-08-11 18:52 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-11 16:13 [PATCH] gc: default aggressive depth to 50 Jeff King
2016-08-11 17:20 ` Jeff King
2016-08-11 18:52   ` Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).