git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] pack-objects: never deltify objects bigger than window_memory_limit.
@ 2010-09-22 10:25 Avery Pennarun
  2010-09-22 12:00 ` Nicolas Pitre
  0 siblings, 1 reply; 3+ messages in thread
From: Avery Pennarun @ 2010-09-22 10:25 UTC (permalink / raw
  To: git, gitster, nico; +Cc: Avery Pennarun

With very large objects, just loading them into the delta window wastes a
huge amount of memory.  In one repo, I have some objects around 1GB in size,
and git-pack-objects seems to require about 8x that in order to deltify it,
even when the window memory limit is small (eg. --window-memory=100M).  With
this patch, the maximum memory usage is about halved.

Perhaps more importantly, however, disabling deltification for large objects
seems to reduce memory thrashing when you can't fit multiple large objects
into physical RAM at once.  It seems to be the difference between "never
finishes" and "finishes eventually" for me.

Test:

I created a test repo with 10 sequential commits containing a bunch of
nearly-identical 110MB files (just appending a line each time).

Without this patch:

    $ /usr/bin/time git repack -a --window-memory=100M

    Counting objects: 43, done.
    warning: suboptimal pack - out of memory
    Compressing objects: 100% (43/43), done.
    Writing objects: 100% (43/43), done.
    Total 43 (delta 14), reused 0 (delta 0)
    42.79user 1.07system 0:44.53elapsed 98%CPU (0avgtext+0avgdata
      866736maxresident)k
      0inputs+2752outputs (0major+718471minor)pagefaults 0swaps

With this patch:

    $ /usr/bin/time -a git repack -a --window-memory=100M

    Counting objects: 43, done.
    Compressing objects: 100% (30/30), done.
    Writing objects: 100% (43/43), done.
    Total 43 (delta 14), reused 0 (delta 0)
    35.86user 0.65system 0:36.30elapsed 100%CPU (0avgtext+0avgdata
      437568maxresident)k
      0inputs+2768outputs (0major+366137minor)pagefaults 0swaps

It apparently still uses about 4x the memory of the largest object, which is
about twice as good as before, though still kind of awful.  (Ideally, we
wouldn't even load the entire large object into memory even once.)

Signed-off-by: Avery Pennarun <apenwarr@gmail.com>
---
 builtin/pack-objects.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 0e81673..9f1a289 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -1791,6 +1791,9 @@ static void prepare_pack(int window, int depth)
 		if (entry->size < 50)
 			continue;
 
+		if (window_memory_limit && entry->size > window_memory_limit)
+                	continue;
+
 		if (entry->no_try_delta)
 			continue;
 
-- 
1.7.3.1.gca9d1

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] pack-objects: never deltify objects bigger than window_memory_limit.
  2010-09-22 10:25 [PATCH] pack-objects: never deltify objects bigger than window_memory_limit Avery Pennarun
@ 2010-09-22 12:00 ` Nicolas Pitre
  2010-09-23  5:01   ` Avery Pennarun
  0 siblings, 1 reply; 3+ messages in thread
From: Nicolas Pitre @ 2010-09-22 12:00 UTC (permalink / raw
  To: Avery Pennarun; +Cc: git, gitster

On Wed, 22 Sep 2010, Avery Pennarun wrote:

> With very large objects, just loading them into the delta window wastes a
> huge amount of memory.  In one repo, I have some objects around 1GB in size,
> and git-pack-objects seems to require about 8x that in order to deltify it,
> even when the window memory limit is small (eg. --window-memory=100M).  With
> this patch, the maximum memory usage is about halved.
> 
> Perhaps more importantly, however, disabling deltification for large objects
> seems to reduce memory thrashing when you can't fit multiple large objects
> into physical RAM at once.  It seems to be the difference between "never
> finishes" and "finishes eventually" for me.
> 
> Test:
> 
> I created a test repo with 10 sequential commits containing a bunch of
> nearly-identical 110MB files (just appending a line each time).
> 
> Without this patch:
> 
>     $ /usr/bin/time git repack -a --window-memory=100M
> 
>     Counting objects: 43, done.
>     warning: suboptimal pack - out of memory
>     Compressing objects: 100% (43/43), done.
>     Writing objects: 100% (43/43), done.
>     Total 43 (delta 14), reused 0 (delta 0)
>     42.79user 1.07system 0:44.53elapsed 98%CPU (0avgtext+0avgdata
>       866736maxresident)k
>       0inputs+2752outputs (0major+718471minor)pagefaults 0swaps
> 
> With this patch:
> 
>     $ /usr/bin/time -a git repack -a --window-memory=100M
> 
>     Counting objects: 43, done.
>     Compressing objects: 100% (30/30), done.
>     Writing objects: 100% (43/43), done.
>     Total 43 (delta 14), reused 0 (delta 0)
>     35.86user 0.65system 0:36.30elapsed 100%CPU (0avgtext+0avgdata
>       437568maxresident)k
>       0inputs+2768outputs (0major+366137minor)pagefaults 0swaps
> 
> It apparently still uses about 4x the memory of the largest object, which is
> about twice as good as before, though still kind of awful.  (Ideally, we
> wouldn't even load the entire large object into memory even once.)

To not load big objects into memory, we'd have to add support for the 
core.bigFileThreshold config option in more places.

>  builtin/pack-objects.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index 0e81673..9f1a289 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -1791,6 +1791,9 @@ static void prepare_pack(int window, int depth)
>  		if (entry->size < 50)
>  			continue;
>  
> +		if (window_memory_limit && entry->size > window_memory_limit)
> +                	continue;
> +

I think you should even use entry->size/2 here, or even entry->size/4.  
The reason for that is 1) you need at least 2 such similar objects in 
memory to find a possible delta, and 2) reference object to delta 
against has to be block indexed and that index table is almost the same 
size as the object itself especially on 64-bit machines.


Nicolas

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] pack-objects: never deltify objects bigger than window_memory_limit.
  2010-09-22 12:00 ` Nicolas Pitre
@ 2010-09-23  5:01   ` Avery Pennarun
  0 siblings, 0 replies; 3+ messages in thread
From: Avery Pennarun @ 2010-09-23  5:01 UTC (permalink / raw
  To: Nicolas Pitre; +Cc: git, gitster

On Wed, Sep 22, 2010 at 5:00 AM, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Wed, 22 Sep 2010, Avery Pennarun wrote:
> To not load big objects into memory, we'd have to add support for the
> core.bigFileThreshold config option in more places.

Well, it could be automatic in git-pack-objects if it was enabled when
the object is larger than window_memory_limit.  But I see your point.

Anyway, it requires more knowledge of the guts of git (plus patience
and motivation) than I currently have.  My patch was sufficient to get
my previously-un-packable repo packed on my system, which was a big
step forward.

>> +             if (window_memory_limit && entry->size > window_memory_limit)
>> +                     continue;
>
> I think you should even use entry->size/2 here, or even entry->size/4.
> The reason for that is 1) you need at least 2 such similar objects in
> memory to find a possible delta, and 2) reference object to delta
> against has to be block indexed and that index table is almost the same
> size as the object itself especially on 64-bit machines.

I would be happy to re-spin my exciting two line patch to include
whatever threshold you think it best :)

Good point about #1; dividing by two does seem like a good idea.
(Though arguably you could construct a small object from a large one
just by deleting a lot of stuff, and a small + large object could fit
in a window close to the large object's size.  I doubt this happens
frequently, though.)

As for #2, does the block index table count toward the
window_memory_limit right now?  If it doesn't, then dividing by 4
doesn't seem to make sense.  If it does, then it does make sense, I
guess.  (Though maybe dividing by 3 is a bit more generous just in
case.)

Anyway, the window_size_limit stuff seems pretty weakly implemented
overall; with or without my patch, it seems like you could end up
using vastly more memory than the window size.  Imagine setting the
window_memory_limit to 10MB and packing a 1GB object; it still ends up
in memory at least two or three times, for a 200x overshoot.  Without
my patch, it's about 2x worse than even that, though I don't know why.

So my question is: is it worth it to try to treat window_memory_limit
as "maximum memory used by the pack operation" or should it literally
be the maximum size of the window?  If the former, well, we've got a
lot of work ahead of us, but dividing by 4 might be appropriate.  If
the latter, dividing by two is probably the most we should do.

FWIW, the test timings described in my commit message are unaffected
whether we divide by 1, 2, or 4, because as it happens, I didn't have
any objects between 25 and 100 MB in size :)

Have fun,

Avery

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-09-23  5:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-22 10:25 [PATCH] pack-objects: never deltify objects bigger than window_memory_limit Avery Pennarun
2010-09-22 12:00 ` Nicolas Pitre
2010-09-23  5:01   ` Avery Pennarun

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).