From: Jeff King <peff@peff.net>
To: git@vger.kernel.org
Subject: [PATCH 0/12] reducing resource usage of for_each_alternate_ref
Date: Mon, 23 Jan 2017 19:37:29 -0500 [thread overview]
Message-ID: <20170124003729.j4ygjcgypdq7hceg@sigill.intra.peff.net> (raw)
As I've mentioned before, I have some alternates repositories with
absurd numbers of refs, most of which are duplicates of each other[1].
There are a couple of problems I've seen:
1. the way that for_each_alternate_ref() works has very high peak memory
usage for this case
2. the way that receive-pack de-duplicates the refs has high peak memory
usage
3. we access the alternate refs twice in fetch-pack
This fixes all three, along with a few other minor bugfixes, cleanups,
and optimizations. I've tried to order the series to keep bugfixes and
less-contentious changes near the front.
Just to give some numbers, on my worst-case repository (see [1]), this
drops peak RSS for "git clone --reference" from over 25GB to about 40MB.
Sort of, anyway. You still pay a big CPU and RSS cost on the process in
the alternates repo that accesses packed-refs, but the 25GB came on top
of that. So this is a first pass at the low-hanging fruit.
I'll be the first to admit that this setup is insane. And some of the
optimizations are tradeoffs that help particularly the case where your
refs aren't unique. But for the most part should help _every_ case. And
in the cases where your refs are unique, either you don't have many (so
the tradeoffs are OK) or you have so many that you are pretty much
screwed no matter what (if your fetch is looking at 30 million unique
ref tips, the object storage is your problem, not looking at the refs).
A brief overview of the patches:
[01/12]: for_each_alternate_ref: handle failure from real_pathdup()
[02/12]: for_each_alternate_ref: stop trimming trailing slashes
[03/12]: for_each_alternate_ref: use strbuf for path allocation
Bugfixes and cleanups (the first one is actually a recent-ish
regression).
[04/12]: for_each_alternate_ref: pass name/oid instead of ref struct
[05/12]: for_each_alternate_ref: replace transport code with for-each-ref
[06/12]: clone: disable save_commit_buffer
This gives the 25GB->40MB benefit. There are a bunch of caveats in
patch 05, but I _think_ it's the right direction for the reasons I
outlined there.
[07/12]: fetch-pack: cache results of for_each_alternate_ref
Just running for-each-ref in the giant alternates repo is about a
minute of CPU, plus 10GB heap, and fetch-pack wants to do it twice.
This drops it to once.
[08/12]: add oidset API
[09/12]: receive-pack: use oidset to de-duplicate .have lines
These give another ~12GB RSS improvement when receive-pack looks at
the alternates in my worst-case repo.
This is less tested than the earlier ones, as we've disabled
receive-pack looking at alternates in production (see [1] below).
I just did them for completeness upstream.
[10/12]: receive-pack: fix misleading namespace/.have comment
[11/12]: receive-pack: treat namespace .have lines like alternates
[12/12]: receive-pack: avoid duplicates between our refs and alternates
These are optimizations to avoid more duplicate .have lines, and
should benefit even non-insane cases. As with 8-12, not as well
tested by me.
Makefile | 1 +
builtin/clone.c | 1 +
builtin/receive-pack.c | 41 +++++++++++++++-------------
fetch-pack.c | 48 ++++++++++++++++++++++++++++-----
object.h | 2 +-
oidset.c | 49 ++++++++++++++++++++++++++++++++++
oidset.h | 45 +++++++++++++++++++++++++++++++
t/t5400-send-pack.sh | 38 ++++++++++++++++++++++++++
transport.c | 72 +++++++++++++++++++++++++++++++++++---------------
transport.h | 2 +-
10 files changed, 250 insertions(+), 49 deletions(-)
create mode 100644 oidset.c
create mode 100644 oidset.h
-Peff
[1] Background, if you care:
I've mentioned before that GitHub's repository storage uses a
fork-network layout. Each user gets their own "fork" repository, and
it points to "network.git" as shared storage. The network.git
repository has a ref for each fork that is updated periodically via
"git fetch".
So the network.git repo contains O(nr_forks * nr_refs_in_fork) refs.
Quite a few of these point to the same tip sha1s, as each fork has
the same tags. One of the most pathological cases is a popular
public repo that has ~44K tags in it. The network repo has ~80
million refs, of which ~60K are unique. You can imagine that it
takes some time to access the refs. Basically anything that loads
the network.git packed-refs file takes an extra minute of CPU and
needs ~10G RSS to store the internal cache of `packed-refs`.
Most operations don't care about this; they work on the fork repo,
and never look at the network refs. But we _do_ sometimes peek at
alternate refs from receive-pack and fetch-pack to tell the other
side about tips we have.
These optimizations backfire completely in such a setting. Besides
the CPU and memory spikes on the server, even just the unique refs
add over 3MB to the ref advertisement. So for the time being, we've
disabled them. I have patches that add a receive.advertiseAlternates
config option, if anybody is interested.
So why do I care? There is one exception: when we are cloning a
fork, we turn on the fetch-pack side of the optimizations. This is
worth it for a large repository, as it saves us having to transfer
any objects at all (and therefore not process them with index-pack,
which is expensive).
This series is also a first step towards other optimizations, like
asking for just the alternate refs from _one_ of the forks. I hope
one day to turn alternates optimizations back on everywhere.
next reply other threads:[~2017-01-24 0:37 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-24 0:37 Jeff King [this message]
2017-01-24 0:38 ` [PATCH 01/12] for_each_alternate_ref: handle failure from real_pathdup() Jeff King
2017-01-25 18:26 ` Junio C Hamano
2017-01-24 0:39 ` [PATCH 02/12] for_each_alternate_ref: stop trimming trailing slashes Jeff King
2017-01-24 0:40 ` [PATCH 03/12] for_each_alternate_ref: use strbuf for path allocation Jeff King
2017-01-25 18:29 ` Junio C Hamano
2017-01-25 18:40 ` Jeff King
2017-01-24 0:40 ` [PATCH 04/12] for_each_alternate_ref: pass name/oid instead of ref struct Jeff King
2017-01-24 0:44 ` [PATCH 05/12] for_each_alternate_ref: replace transport code with for-each-ref Jeff King
2017-01-25 19:00 ` Junio C Hamano
2017-01-24 0:45 ` [PATCH 06/12] clone: disable save_commit_buffer Jeff King
2017-01-25 19:11 ` Junio C Hamano
2017-01-25 19:27 ` Jeff King
2017-01-25 19:35 ` Jeff King
2017-01-25 21:07 ` Jeff King
2017-01-24 0:45 ` [PATCH 07/12] fetch-pack: cache results of for_each_alternate_ref Jeff King
2017-01-25 19:21 ` Junio C Hamano
2017-01-25 19:47 ` Jeff King
2017-01-24 0:46 ` [PATCH 08/12] add oidset API Jeff King
2017-01-24 20:26 ` Ramsay Jones
2017-01-24 20:35 ` Jeff King
2017-01-24 0:47 ` [PATCH 09/12] receive-pack: use oidset to de-duplicate .have lines Jeff King
2017-01-25 19:32 ` Junio C Hamano
2017-01-25 19:54 ` Jeff King
2017-01-24 0:47 ` [PATCH 10/12] receive-pack: fix misleading namespace/.have comment Jeff King
2017-01-24 0:48 ` [PATCH 11/12] receive-pack: treat namespace .have lines like alternates Jeff King
2017-01-25 19:51 ` Junio C Hamano
2017-01-25 19:58 ` Jeff King
2017-01-27 17:45 ` Lukas Fleischer
2017-01-27 17:58 ` Jeff King
2017-01-27 20:42 ` Junio C Hamano
2017-01-24 0:48 ` [PATCH 12/12] receive-pack: avoid duplicates between our refs and alternates Jeff King
2017-01-25 20:02 ` Junio C Hamano
2017-01-25 20:05 ` Jeff King
2017-01-24 1:33 ` [PATCH 0/12] reducing resource usage of for_each_alternate_ref Brandon Williams
2017-01-24 2:12 ` Jeff King
2017-02-08 20:52 ` [PATCH v2 0/11] " Jeff King
2017-02-08 20:52 ` [PATCH v2 01/11] for_each_alternate_ref: handle failure from real_pathdup() Jeff King
2017-02-08 20:52 ` [PATCH v2 02/11] for_each_alternate_ref: stop trimming trailing slashes Jeff King
2017-02-08 20:52 ` [PATCH v2 03/11] for_each_alternate_ref: use strbuf for path allocation Jeff King
2017-02-08 20:52 ` [PATCH v2 04/11] for_each_alternate_ref: pass name/oid instead of ref struct Jeff King
2017-02-08 20:53 ` [PATCH v2 05/11] for_each_alternate_ref: replace transport code with for-each-ref Jeff King
2017-02-08 20:53 ` [PATCH v2 06/11] fetch-pack: cache results of for_each_alternate_ref Jeff King
2017-02-08 20:53 ` [PATCH v2 07/11] add oidset API Jeff King
2017-02-08 20:53 ` [PATCH v2 08/11] receive-pack: use oidset to de-duplicate .have lines Jeff King
2017-02-08 20:53 ` [PATCH v2 09/11] receive-pack: fix misleading namespace/.have comment Jeff King
2017-02-08 20:53 ` [PATCH v2 10/11] receive-pack: treat namespace .have lines like alternates Jeff King
2017-02-08 20:53 ` [PATCH v2 11/11] receive-pack: avoid duplicates between our refs and alternates Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170124003729.j4ygjcgypdq7hceg@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).