git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Mike Hommey <mh@glandium.org>
To: git@vger.kernel.org
Subject: Surprising use of memory and time when repacking mozilla's gecko repository
Date: Thu, 4 Jul 2019 19:05:30 +0900	[thread overview]
Message-ID: <20190704100530.smn4rpiekwtfylhz@glandium.org> (raw)

Hi,

I was looking at the disk size of the gecko repository on github[1],
which started at 4.7GB, and `git gc --aggressive`'d it, which made that
into 2.0G. But to achieve that required quite some resources.

My first attempt failed with OOM, on an AWS instance with 16 cores and
32GB RAM. I then went to another AWS instance, with 36 cores and 96GB
RAM. And that went through after a while... with a peak memory usage
above 60GB!

Since then, Peff kindly repacked the repo on the github end, so it
doesn't really need repacking locally anymore, but I can still reproduce
the > 60GB memory usage with the packed repository.

I gathered some data[2], all on the same 36 cores, 96GB RAM instance, with
36, 16 and 1 threads, and here's what can be observed:

With 36 threads, the overall process takes 45 minutes:
- 50 seconds enumerating and counting objects.
- ~22 minutes compressing objects
- ~22 minutes writing objects

Of the 22 minutes compressing objects, more than 15 minutes are spent on
the last percent of objects, and only during that part the memory usage
balloons above 20GB.

Memory usage goes back to 2.4G after finishing to compress.

With 16 threads, the overall process takes about the same time as above,
with about the same repartition.

But less time is spent on compressing the last percent of objects, and
memory usage goes above 20GB later than with 36 threads.

Finally, with 1 thread, the picture changes greatly. The overall process
takes 2.5h:
- 50 seconds enumerating and counting objects.
- ~2.5h compressing objects.
- 3 minutes and 25 seconds writing objects!

Memory usage stays reasonable, except at some point after 47 minutes,
where it starts to increase up to 12.7GB, and then goes back down about
half an hour later, all while stalling around the 13% progress mark.

My guess is all those stalls are happening when processing the files I
already had problems with in the past[3], except there are more of them
now (thankfully, they were removed, so there won't be more, but that
doesn't make the existing ones go away).

I never ended up working on trying to make that diff faster, maybe that
would help a little here, but that would probably not help much wrt the
memory usage. I wonder what git could reasonably do to avoid OOMing in
this case. Reduce the window size temporarily? Trade memory with time,
by not keeping the objects in memory?

I'm puzzled by the fact writing objects is so much faster with 1 thread.

It's worth noting that the AWS instances don't have swap by default,
which is actually good in this case, because if it had started to swap,
it would have taken forever.

1. https://github.com/mozilla/gecko
2. https://docs.google.com/spreadsheets/d/1IE8E3BhKurXsXgwBYFXs4mRBT_512v--ip6Vhxc3o-Y/edit?usp=sharing
3. https://public-inbox.org/git/20180703223823.qedmoy2imp4dcvkp@glandium.org/T/

Any thoughts?

Mike

             reply	other threads:[~2019-07-04 10:05 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-04 10:05 Mike Hommey [this message]
2019-07-04 12:04 ` Surprising use of memory and time when repacking mozilla's gecko repository Eric Wong
2019-07-04 13:13   ` Mike Hommey
2019-07-05  5:14     ` Jeff King
2019-07-05  5:47       ` Mike Hommey
2019-07-05 11:29         ` Jakub Narebski
2019-07-05  0:22 ` Mike Hommey
2019-07-05  4:45 ` Mike Hommey
2019-07-05  5:09 ` Jeff King
2019-07-05  5:45   ` Mike Hommey
2019-07-05 11:51     ` Mike Hommey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190704100530.smn4rpiekwtfylhz@glandium.org \
    --to=mh@glandium.org \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).