git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Performance of "git gc..." is extremely bad in some cases
@ 2021-03-08 21:15 Anthony Muller
  2021-03-08 22:29 ` Bryan Turner
  0 siblings, 1 reply; 5+ messages in thread
From: Anthony Muller @ 2021-03-08 21:15 UTC (permalink / raw)
  To: git

What did you do before the bug happened? (Steps to reproduce your issue)

git clone https://github.com/notracking/hosts-blocklists
cd hosts-blocklists
git reflog expire --all --expire=now && git gc --prune=now --aggressive


What did you expect to happen? (Expected behavior)

Running gc on a ~300 MB repo should not take 1 hour 55 minutes when
running gc on a 2.6 GB repo (LLVM) only takes 24 minutes.


What happened instead? (Actual behavior)

Command took 1h 55m to complete on a ~300MB repo and used enough
resources that the machine is almost unusable.


What's different between what you expected and what actually happened?

Compression stage uses the majority of the resources and time. Compression
itself, when compared to something like zlib or lzma, should not take very long.
While more may be happening as objects are compressed, the amount of time
gc takes to compress the objects and the resources it consumed are both
unreasonable.

Memory: RSS = 3451152 KB (3.29 GB), VSZ = 29286272 KB (27.92 GB)
Time: 12902.83s user 8995.41s system 315% cpu 1:55:36.73 total

I've seen this issue with a number of repos and size of the repo does not
determine if this happens. LLVM @ 2.6 GB worked flawlessly, a 900 MB
repo never finished, this 300 MB repo takes forever, and if you test something
like chromium git will just crash.


[System Info]
hardware: 2.9Ghz Quad Core i7
git version:
git version 2.30.0
cpu: x86_64
no commit associated with this build
sizeof-long: 8
sizeof-size_t: 8
shell-path: /bin/sh
uname: Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64 x86_64
compiler info: clang: 12.0.0 (clang-1200.0.32.28)
libc info: no libc information available
$SHELL (typically, interactive shell): /usr/local/bin/zsh


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Performance of "git gc..." is extremely bad in some cases
  2021-03-08 21:15 Performance of "git gc..." is extremely bad in some cases Anthony Muller
@ 2021-03-08 22:29 ` Bryan Turner
       [not found]   ` <178140c3b3b.c7a29306868075.2037370475662478386@monospace.sh>
  2021-03-08 23:56   ` brian m. carlson
  0 siblings, 2 replies; 5+ messages in thread
From: Bryan Turner @ 2021-03-08 22:29 UTC (permalink / raw)
  To: Anthony Muller; +Cc: git

On Mon, Mar 8, 2021 at 1:32 PM Anthony Muller <anthony@monospace.sh> wrote:
>
> What did you do before the bug happened? (Steps to reproduce your issue)
>
> git clone https://github.com/notracking/hosts-blocklists
> cd hosts-blocklists
> git reflog expire --all --expire=now && git gc --prune=now --aggressive

--aggressive tells git gc to discard all of its existing delta chains
and go find new ones, and to be fairly aggressive in how it looks for
candidates. This is going to be the primary source of the resource
usage you see, as well as the time.

Aggressive GCs are something you do once in a (very great) while. If
you try this without the --aggressive, how does it look?

>
>
> What did you expect to happen? (Expected behavior)
>
> Running gc on a ~300 MB repo should not take 1 hour 55 minutes when
> running gc on a 2.6 GB repo (LLVM) only takes 24 minutes.
>
>
> What happened instead? (Actual behavior)
>
> Command took 1h 55m to complete on a ~300MB repo and used enough
> resources that the machine is almost unusable.
>
>
> What's different between what you expected and what actually happened?
>
> Compression stage uses the majority of the resources and time. Compression
> itself, when compared to something like zlib or lzma, should not take very long.
> While more may be happening as objects are compressed, the amount of time
> gc takes to compress the objects and the resources it consumed are both
> unreasonable.

The compression happening here is delta compression, not simple
compression like zip. Git searches across the repository for similar
objects and stores them as chains with a base object and (essentially)
instructions for converting that base object into another object.
That's significantly more resource-intensive work than zipping some
data.

>
> Memory: RSS = 3451152 KB (3.29 GB), VSZ = 29286272 KB (27.92 GB)
> Time: 12902.83s user 8995.41s system 315% cpu 1:55:36.73 total

Git offers several knobs that can be used to influence (though not
necessarily control) its resource usage. On 64-bit Linux the defaults
are 1 thread per logical CPU (so hyperthreaded CPUs use double) and
_unlimited_ memory usage per thread. You might want to investigate
some options like pack.threads and pack.windowmemory to apply some
constraints.

>
> I've seen this issue with a number of repos and size of the repo does not
> determine if this happens. LLVM @ 2.6 GB worked flawlessly, a 900 MB
> repo never finished, this 300 MB repo takes forever, and if you test something
> like chromium git will just crash.
>
>
> [System Info]
> hardware: 2.9Ghz Quad Core i7
> git version:
> git version 2.30.0
> cpu: x86_64
> no commit associated with this build
> sizeof-long: 8
> sizeof-size_t: 8
> shell-path: /bin/sh
> uname: Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64 x86_64
> compiler info: clang: 12.0.0 (clang-1200.0.32.28)
> libc info: no libc information available
> $SHELL (typically, interactive shell): /usr/local/bin/zsh
>

Hope this helps!
-b

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Performance of "git gc..." is extremely bad in some cases
       [not found]   ` <178140c3b3b.c7a29306868075.2037370475662478386@monospace.sh>
@ 2021-03-08 23:55     ` Bryan Turner
  0 siblings, 0 replies; 5+ messages in thread
From: Bryan Turner @ 2021-03-08 23:55 UTC (permalink / raw)
  To: Anthony Muller, Git Users

Re-adding the list.

On Mon, Mar 8, 2021 at 2:54 PM Anthony Muller <anthony@monospace.sh> wrote:
>
>  ---- On Mon, 08 Mar 2021 22:29:16 +0000 Bryan Turner <bturner@atlassian.com> wrote ----
>  > On Mon, Mar 8, 2021 at 1:32 PM Anthony Muller <anthony@monospace.sh> wrote:
>  > >
>  > > What did you do before the bug happened? (Steps to reproduce your issue)
>  > >
>  > > git clone https://github.com/notracking/hosts-blocklists
>  > > cd hosts-blocklists
>  > > git reflog expire --all --expire=now && git gc --prune=now --aggressive
>  >
>  > --aggressive tells git gc to discard all of its existing delta chains
>  > and go find new ones, and to be fairly aggressive in how it looks for
>  > candidates. This is going to be the primary source of the resource
>  > usage you see, as well as the time.
>  >
>  > Aggressive GCs are something you do once in a (very great) while. If
>  > you try this without the --aggressive, how does it look?
>
> Hi Bryan,
>
> Without --aggressive it's fine and I do expect it to take longer using aggressive.
>
> I find it very odd that a repo ~8x in size and with probably 400x as many objects took 1/4 the time though. I would think size and object count would play a role in time and resources.

Looking at that blocklists repository, it looks like it's not many
files or commits, but the files are pretty large (10-25MB). For delta
compression, large files can cause a lot of pain.

If you set core.bigFileThreshold=5m (a reduction from 512m by default)
and pack.windowmemory=1g, for me locally, at least, "fixes" the
"problem" (which is to say it changes the behavior). The GC runs in
under 10 minutes:
$ /usr/bin/time -l git gc --prune=now --aggressive
Enumerating objects: 10777, done.
Counting objects: 100% (10777/10777), done.
Delta compression using up to 20 threads
Compressing objects: 100% (8672/8672), done.
Writing objects: 100% (10777/10777), done.
Reusing bitmaps: 101, done.
Selecting bitmap commits: 2146, done.
Building bitmaps: 100% (126/126), done.
Total 10777 (delta 3986), reused 6784 (delta 0)
      298.00 real       996.76 user        18.84 sys
          9284980736  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             2861811  page reclaims
                   1  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                 296  signals received
                 172  voluntary context switches
              171245  involuntary context switches
            20586171  instructions retired
            28100595  cycles elapsed
              880640  peak memory footprint

Of course, that also takes the size of the repository from 367MB to
2.3GB--not exactly your desired outcome if you're trying to save
space.

From there I tried just reducing the threads from 20 to 8 and using
the 1g window memory limit, but leaving the bigFileThreshold at
default. That allows for delta compressing everything, and for me
completes in just under 12 minutes:
$ /usr/bin/time -l git gc --prune=now --aggressive
Enumerating objects: 10777, done.
Counting objects: 100% (10777/10777), done.
Delta compression using up to 8 threads
Compressing objects: 100% (10077/10077), done.
Writing objects: 100% (10777/10777), done.
Reusing bitmaps: 101, done.
Selecting bitmap commits: 2146, done.
Building bitmaps: 100% (126/126), done.
Total 10777 (delta 5387), reused 5383 (delta 0)
      713.98 real      3053.41 user        31.91 sys
         13408837632  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             3804319  page reclaims
                   1  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                 712  signals received
                  57  voluntary context switches
             1011681  involuntary context switches
            20568579  instructions retired
            31734809  cycles elapsed
              872448  peak memory footprint

That also reduced the repository from 367MB to 320MB. (Technically
from 2.3GB to 320MB, since I this after the earlier attempt.)

Of course, there's a machine difference to consider here as well. I'm
guessing you're on a MacBook Pro, based on the specs part of the bug
report. My testing here is on a 10 core iMac Pro with 64GB of RAM, so
some of the difference may just be that I'm on a less constrained
system.

>
> What factors would make that happen? Is it a combination of more commits with fewer objects?

Big files are the biggest issue, in my experience. The total number of
objects (it's not really about object type too much, as far as I can
tell) certainly has an impact, but having big files (where "big" here
is anything larger than a normal source code file, which is typically
well under 1MB) is likely to balloon both time and resource
consumption.

>
> I've been using aggressive after cloning repos I use primarily for reference/offline/etc to recover a lot of wasted space.

To some extent I'm not sure there's an easy answer, for this. It may
come down to looking at the repositories before you do a local GC to
see what "shape" they have (starting size on disk, in-repository file
sizes, etc.) and deciding from there whether the savings is likely to
be worth the time investment.

>
>  >
>  > >
>  > >
>  > > What did you expect to happen? (Expected behavior)
>  > >
>  > > Running gc on a ~300 MB repo should not take 1 hour 55 minutes when
>  > > running gc on a 2.6 GB repo (LLVM) only takes 24 minutes.
>  > >
>  > >
>  > > What happened instead? (Actual behavior)
>  > >
>  > > Command took 1h 55m to complete on a ~300MB repo and used enough
>  > > resources that the machine is almost unusable.
>  > >
>  > >
>  > > What's different between what you expected and what actually happened?
>  > >
>  > > Compression stage uses the majority of the resources and time. Compression
>  > > itself, when compared to something like zlib or lzma, should not take very long.
>  > > While more may be happening as objects are compressed, the amount of time
>  > > gc takes to compress the objects and the resources it consumed are both
>  > > unreasonable.
>  >
>  > The compression happening here is delta compression, not simple
>  > compression like zip. Git searches across the repository for similar
>  > objects and stores them as chains with a base object and (essentially)
>  > instructions for converting that base object into another object.
>  > That's significantly more resource-intensive work than zipping some
>  > data.
>  >
>  > >
>  > > Memory: RSS = 3451152 KB (3.29 GB), VSZ = 29286272 KB (27.92 GB)
>  > > Time: 12902.83s user 8995.41s system 315% cpu 1:55:36.73 total
>  >
>  > Git offers several knobs that can be used to influence (though not
>  > necessarily control) its resource usage. On 64-bit Linux the defaults
>  > are 1 thread per logical CPU (so hyperthreaded CPUs use double) and
>  > _unlimited_ memory usage per thread. You might want to investigate
>  > some options like pack.threads and pack.windowmemory to apply some
>  > constraints.
>  >
>  > >
>  > > I've seen this issue with a number of repos and size of the repo does not
>  > > determine if this happens. LLVM @ 2.6 GB worked flawlessly, a 900 MB
>  > > repo never finished, this 300 MB repo takes forever, and if you test something
>  > > like chromium git will just crash.

I should add that for something like Chromium, and potentially
whatever 900MB repository you tested with, you're very likely to need
to do some explicit configuration for things like threads/window
memory unless you're on a _very_ beefy machine. The default unlimited
behavior is very likely to run afoul of the OOM killer (or something
similar).

>  > >
>  > >
>  > > [System Info]
>  > > hardware: 2.9Ghz Quad Core i7
>  > > git version:
>  > > git version 2.30.0
>  > > cpu: x86_64
>  > > no commit associated with this build
>  > > sizeof-long: 8
>  > > sizeof-size_t: 8
>  > > shell-path: /bin/sh
>  > > uname: Darwin 19.6.0 Darwin Kernel Version 19.6.0: Tue Jan 12 22:13:05 PST 2021; root:xnu-6153.141.16~1/RELEASE_X86_64 x86_64
>  > > compiler info: clang: 12.0.0 (clang-1200.0.32.28)
>  > > libc info: no libc information available
>  > > $SHELL (typically, interactive shell): /usr/local/bin/zsh
>  > >
>  >

Hope this helps!
-b

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Performance of "git gc..." is extremely bad in some cases
  2021-03-08 22:29 ` Bryan Turner
       [not found]   ` <178140c3b3b.c7a29306868075.2037370475662478386@monospace.sh>
@ 2021-03-08 23:56   ` brian m. carlson
  2021-03-09  0:14     ` Anthony Muller
  1 sibling, 1 reply; 5+ messages in thread
From: brian m. carlson @ 2021-03-08 23:56 UTC (permalink / raw)
  To: Bryan Turner; +Cc: Anthony Muller, git

[-- Attachment #1: Type: text/plain, Size: 2163 bytes --]

On 2021-03-08 at 22:29:16, Bryan Turner wrote:
> On Mon, Mar 8, 2021 at 1:32 PM Anthony Muller <anthony@monospace.sh> wrote:
> >
> > What did you do before the bug happened? (Steps to reproduce your issue)
> >
> > git clone https://github.com/notracking/hosts-blocklists
> > cd hosts-blocklists
> > git reflog expire --all --expire=now && git gc --prune=now --aggressive
> 
> --aggressive tells git gc to discard all of its existing delta chains
> and go find new ones, and to be fairly aggressive in how it looks for
> candidates. This is going to be the primary source of the resource
> usage you see, as well as the time.
> 
> Aggressive GCs are something you do once in a (very great) while. If
> you try this without the --aggressive, how does it look?

I should point out that this repository is also rather pathologically
structured.  Almost every commit is an automatic commit updating the
same five files which are text files ranging from 5 MB to 11 MB.

When you use --aggressive, as Bryan pointed out, you're asking to throw
away all the deltas and try really hard to compute all of them fresh.
That's going to use a lot of memory because you're loading many large
text files into memory.  It's also going to use a lot of CPU because
these files do indeed delta extremely well, and since computing deltas
on larger files is more expensive, especially when there are many of
them.

And that's just the blobs.  The trees and commits are also going to be
nearly identically structured and will also delta well with virtually
every other similar object of their type.  Normally Git sorts by size
which helps pick better candidates, but since these are all going to be
identically sized, the performance is going to suffer.

Now, I have the advantage in this case of being a person who's sometimes
on call for the maintenance of Git repositories and in that capacity,
that this is pathologically structured is obvious to me.  But, yeah, I
would definitely not run --aggressive on this repo unless I needed to
and I would not expect it to perform well.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Performance of "git gc..." is extremely bad in some cases
  2021-03-08 23:56   ` brian m. carlson
@ 2021-03-09  0:14     ` Anthony Muller
  0 siblings, 0 replies; 5+ messages in thread
From: Anthony Muller @ 2021-03-09  0:14 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Bryan Turner, git

Thank you Brian and Bryan. You both clarified what was happening and now I know what to look for.

I can use a shallow clone for most repos, but there are some I want to keep history for. I don't need a full copy of this repo, but it was a good repo to show the issue I was facing.

Thanks again!


 ---- On Mon, 08 Mar 2021 23:56:53 +0000 brian m. carlson <sandals@crustytoothpaste.net> wrote ----
 > On 2021-03-08 at 22:29:16, Bryan Turner wrote:
 > > On Mon, Mar 8, 2021 at 1:32 PM Anthony Muller <anthony@monospace.sh> wrote:
 > > >
 > > > What did you do before the bug happened? (Steps to reproduce your issue)
 > > >
 > > > git clone https://github.com/notracking/hosts-blocklists
 > > > cd hosts-blocklists
 > > > git reflog expire --all --expire=now && git gc --prune=now --aggressive
 > > 
 > > --aggressive tells git gc to discard all of its existing delta chains
 > > and go find new ones, and to be fairly aggressive in how it looks for
 > > candidates. This is going to be the primary source of the resource
 > > usage you see, as well as the time.
 > > 
 > > Aggressive GCs are something you do once in a (very great) while. If
 > > you try this without the --aggressive, how does it look?
 > 
 > I should point out that this repository is also rather pathologically
 > structured.  Almost every commit is an automatic commit updating the
 > same five files which are text files ranging from 5 MB to 11 MB.
 > 
 > When you use --aggressive, as Bryan pointed out, you're asking to throw
 > away all the deltas and try really hard to compute all of them fresh.
 > That's going to use a lot of memory because you're loading many large
 > text files into memory.  It's also going to use a lot of CPU because
 > these files do indeed delta extremely well, and since computing deltas
 > on larger files is more expensive, especially when there are many of
 > them.
 > 
 > And that's just the blobs.  The trees and commits are also going to be
 > nearly identically structured and will also delta well with virtually
 > every other similar object of their type.  Normally Git sorts by size
 > which helps pick better candidates, but since these are all going to be
 > identically sized, the performance is going to suffer.
 > 
 > Now, I have the advantage in this case of being a person who's sometimes
 > on call for the maintenance of Git repositories and in that capacity,
 > that this is pathologically structured is obvious to me.  But, yeah, I
 > would definitely not run --aggressive on this repo unless I needed to
 > and I would not expect it to perform well.
 > -- 
 > brian m. carlson (he/him or they/them)
 > Houston, Texas, US
 > 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-03-09  0:14 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-08 21:15 Performance of "git gc..." is extremely bad in some cases Anthony Muller
2021-03-08 22:29 ` Bryan Turner
     [not found]   ` <178140c3b3b.c7a29306868075.2037370475662478386@monospace.sh>
2021-03-08 23:55     ` Bryan Turner
2021-03-08 23:56   ` brian m. carlson
2021-03-09  0:14     ` Anthony Muller

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).