From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id 84B3F1FD99 for ; Mon, 29 Aug 2016 21:31:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755942AbcH2VbH (ORCPT ); Mon, 29 Aug 2016 17:31:07 -0400 Received: from cloud.peff.net ([104.130.231.41]:35090 "HELO cloud.peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1754688AbcH2VbG (ORCPT ); Mon, 29 Aug 2016 17:31:06 -0400 Received: (qmail 14625 invoked by uid 109); 29 Aug 2016 21:31:05 -0000 Received: from Unknown (HELO peff.net) (10.0.1.2) by cloud.peff.net (qpsmtpd/0.84) with SMTP; Mon, 29 Aug 2016 21:31:05 +0000 Received: (qmail 23973 invoked by uid 111); 29 Aug 2016 21:31:10 -0000 Received: from sigill.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.7) by peff.net (qpsmtpd/0.84) with SMTP; Mon, 29 Aug 2016 17:31:10 -0400 Received: by sigill.intra.peff.net (sSMTP sendmail emulation); Mon, 29 Aug 2016 17:31:01 -0400 Date: Mon, 29 Aug 2016 17:31:01 -0400 From: Jeff King To: "W. David Jarvis" Cc: git@vger.kernel.org Subject: Re: Reducing CPU load on git server Message-ID: <20160829213101.3ulrw5hrh5pytjii@sigill.intra.peff.net> References: <20160829054725.r6pqf3xlusxi7ibq@sigill.intra.peff.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On Mon, Aug 29, 2016 at 12:16:20PM -0700, W. David Jarvis wrote: > > Do you know which processes are generating the load? git-upload-pack > > does the negotiation, and then pack-objects does the actual packing. > > When I look at expensive operations (ones that I can see consuming > 90%+ of a CPU for more than a second), there are often pack-objects > processes running that will consume an entire core for multiple > seconds (I also saw one pack-object counting process run for several > minutes while using up a full core). Pegging CPU for a few seconds doesn't sound out-of-place for pack-objects serving a fetch or clone on a large repository. And I can certainly believe "minutes", especially if it was not serving a fetch, but doing repository maintenance on a large repository. Talk to GitHub Enterprise support folks about what kind of process monitoring and accounting is available. Recent versions of GHE can easily tell things like which repositories and processes are using the most CPU, RAM, I/O, and network, which ones are seeing a lot of parallelism, etc. > rev-list shows up as a pretty active CPU consumer, as do prune and > blame-tree. > > I'd say overall that in terms of high-CPU consumption activities, > `prune` and `rev-list` show up the most frequently. None of those operations is triggered by client fetches. You'll see "rev-list" for a variety of operations, so that's hard to pinpoint. But I'm surprised that "prune" is a common one for you. It is run as part of the post-push, but I happen to know that the version that ships on GHE is optimized to use bitmaps, and to avoid doing any work when there are no loose objects that need pruning in the first place. Blame-tree is a GitHub-specific command (it feeds the main repository view page), and is a known CPU hog. There's more clever caching for that coming down the pipe, but it's not shipped yet. > On the subject of prune - I forgot to mention that the `git fetch` > calls from the subscribers are running `git fetch --prune`. I'm not > sure if that changes the projected load profile. That shouldn't change anything; the pruning is purely a client side thing. > > Maybe. If pack-objects is where your load is coming from, then > > counter-intuitively things sometimes get _worse_ as you fetch less. The > > problem is that git will generally re-use deltas it has on disk when > > sending to the clients. But if the clients are missing some of the > > objects (because they don't fetch all of the branches), then we cannot > > use those deltas and may need to recompute new ones. So you might see > > some parts of the fetch get cheaper (negotiation, pack-object's > > "Counting objects" phase), but "Compressing objects" gets more > > expensive. > > I might be misunderstanding this, but if the subscriber is already "up > to date" modulo a single updated ref tip, then this problem shouldn't > occur, right? Concretely: if ref A is built off of ref B, and the > subscriber already has B when it requests A, that shouldn't cause this > behavior, but it would cause this behavior if the subscriber didn't > have B when it requested A. Correct. So this shouldn't be a thing you are running into now, but it's something that might be made worse if you switch to fetching only single refs. > See comment above about a long-running counting objects process. I > couldn't tell which of our repositories it was counting, but it went > for about 3 minutes with full core utilization. I didn't see us > counting pack-objects frequently but it's an expensive operation. That really sounds like repository maintenance. Repacks of torvalds/linux (including all of its forks) on github.com take ~15 minutes of CPU. There may be some optimization opportunities there (I have a few things I'd like to explore in the next few months), but most of it is pretty fundamental. It literally takes a few minutes just to walk the entire object graph for that repo (that's one of the more extreme cases, of course, but presumably you are hosting some large repositories). Maintenance like that should be a very occasional operation, but it's possible that you have a very busy repo. > > There's nothing in upstream git to help smooth these loads, but since > > you mentioned GitHub Enterprise, I happen to know that it does have a > > system for coalescing multiple fetches into a single pack-objects. I > > _think_ it's in GHE 2.5, so you might check which version you're > > running (and possibly also talk to GitHub Support, who might have more > > advice; there are also tools for finding out which git processes are > > generating the most load, etc). > > We're on 2.6.4 at the moment. OK, I double-checked, and your version should be coalescing identical fetches. Given that, and that a lot of the load you mentioned above is coming from non-fetch sources, it sounds like switching anything with your replica fetch strategy isn't likely to help much. And a multi-tiered architecture won't help if the load is being generated by requests that are serving the web-views directly on the box. I'd really encourage you to talk with GitHub Support about performance and clustering. It sounds like there may be some GitHub-specific things to tweak. And it may be that the load is just too much for a single machine, and would benefit from spreading the load across multiple git servers. -Peff