From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 151AD1F463; Tue, 10 Sep 2019 18:12:25 +0000 (UTC) Date: Tue, 10 Sep 2019 18:12:24 +0000 From: Eric Wong To: Konstantin Ryabitsev Cc: meta@public-inbox.org Subject: Re: trying to figure out 100% CPU usage in nntpd... Message-ID: <20190910181224.urhyoo6av7mhjs67@dcvr> References: <20190908104518.11919-1-e@80x24.org> <20190908105243.GA15983@dcvr> <20190909100500.GA9452@pure.paranoia.local> <20190909175340.u5aq4ztfzukko7zb@dcvr> <20190910083820.GA8018@pure.paranoia.local> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20190910083820.GA8018@pure.paranoia.local> List-Id: Konstantin Ryabitsev wrote: > On Mon, Sep 09, 2019 at 05:53:41PM +0000, Eric Wong wrote: > > Konstantin Ryabitsev wrote: > > > There also was a weird problem a couple of days ago where one of the > > > httpd daemons started returning "Internal Server Error" to all requests. > > > Restarting public-inbox-httpd fixed the problem, but I am not sure how I > > > would troubleshoot the causes if it happens next time -- is there a way > > > to enable error logging for the httpd daemon? > > > > That's a new one... I haven't seen any problems from -httpd myself > > in ages. So -httpd could not handle requests at all? > > > > The daemons already spits errors to stderr which typically ends up > > in syslog via systemd. So, that's the first place to look > > (also "systemctl status $SERVICE"); anything in there? > > Bah, I hadn't even considered looking in syslog. I blame jet lag. Well, > the explanation is simple, then: > > Sep 7 22:22:35 lore.kernel.org public-inbox-httpd: pipe failed: Too many open files at /usr/local/share/perl5/PublicInbox/Git.pm line 210. > Sep 7 22:22:36 lore.kernel.org public-inbox-httpd: pipe: Too many open files > Sep 7 22:23:01 lore.kernel.org public-inbox-httpd: pipe: Too many open files > Sep 7 22:23:02 lore.kernel.org public-inbox-httpd: pipe: Too many open files > Sep 7 22:23:02 lore.kernel.org public-inbox-httpd: pipe: Too many open files > > I see that there's already a LimitNOFILE=30000 -- is exhausting 30,000 > open files a reasonable sort of thing for a system as busy as > lore.kernel.org, or should we look for a leak. Currently: > > publici+ 13125 1 0 52297 25096 0 Sep09 ? 00:00:00 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log > publici+ 13128 13125 1 200056 591376 4 Sep09 ? 00:21:06 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log > publici+ 13129 13125 1 117789 262000 2 Sep09 ? 00:16:09 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log > publici+ 13130 13125 0 103569 204772 9 Sep09 ? 00:08:06 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log > publici+ 13131 13125 0 101510 199268 1 Sep09 ? 00:03:33 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log Overall limit of open files could be higher (just limited by RAM) but the balance of pipes/unlinked/other (sockets) seems way off. I wonder if it's some bad interaction with hitting pipe-user-pages-soft (see bottom), too... > # ls /proc/13125/fd/ | wc -l > 9 > # ls /proc/13128/fd/ | wc -l > 17776 > # ls /proc/13129/fd/ | wc -l > 3937 > # ls /proc/13130/fd/ | wc -l > 648 > # ls /proc/13131/fd/ | wc -l > 68 OK, there is correlation between CPU time and open FDs. The poor load distribution (based on CPU time) between processes sucks, but that's an unrelated problem[1] > 17776 since Sep 09 seems... fairly excessive. Most of them are pipes: > > # lsof -p 13128 | grep pipe | wc -l > 14741 > > a bunch of them are to (deleted) entries in /tmp: > > # lsof -p 13128 | grep deleted | wc -l > 3015 > > (e.g.): > > public-in 13128 publicinbox 45u REG 259,4 984 4200094 /tmp/PerlIO_yd43mN (deleted) > public-in 13128 publicinbox 46r FIFO 0,9 0t0 21818751 pipe > public-in 13128 publicinbox 47u REG 259,4 984 4200096 /tmp/PerlIO_siY3Pb (deleted) > public-in 13128 publicinbox 48r FIFO 0,9 0t0 21818758 pipe > public-in 13128 publicinbox 49u REG 259,4 984 4200100 /tmp/PerlIO_WyIaaC (deleted) > > It does seem like there's perhaps a leak somewhere? Probably. Not seeing any of that on my (smaller) instances; but those -httpd haven't been restarted in weeks/months. The "PerlIO_" prefix is created from open(..., '+>', undef), so it either has to be for: 1. POST bodies for git http-backend 1. git cat-file --batch-check stderr 3. experimental ViewVCS|SolverGit which isn't configured on lore :) = git-http-backend Looking at the size (984), PerlIO_* prefix and proximity of FD numbers, I think those lines above are git-http-backend POST body. Any git-http-backend stuck from people fetching/cloning? If so, fd=0 of those should correspond to the inodes of those unlinked PerlIO files. This is -httpd writing to varnish, still, right? Anything from varnish error logs? 3K git-http-backend is a lot, the Qspawn mechanism limits it to 32 processes by default (same as git-daemon), so it shouldn't hit RLIMIT_NOPROC. Qspawn also won't create the pipe until right before spawning git-http-backend. The unlinked file is created once the POST request hits. And 3K git-http-backend would only be a 1:1 ratio of pipes to unlinked files, though, so that's over 10K pipes still unaccounted for... = git cat-file There can be two cat-file processes (--batch and --batch-check) for every inbox in every -httpd worker. Every cat-file process uses 2 pipes (public-inbox closes the ends it hands to git), and there's --batch-check ones which use an additional unlinked regular file for stderr. So: 80 inboxes * 2 cat-file * 2 pipes is only 320 pipes in -httpd And not sure if 160 cat-file processes would cause RLIMIT_NPROC problems, probably not... Fwiw, while the new DS code also uses temporary files for output buffering, those have the "wbuf-" prefix and not PerlIO_* since 'open(FH, "+>>", undef)' didn't work until recently: https://rt.perl.org/Ticket/Display.html?id=134221 I've been meaning to expand dtas-graph to include temporary file and socket connectivity, too, but it's pretty useful at graphing pipe connectivity between processes: https://80x24.org/dtas-all/b243086/s/?b=perl/dtas-graph And it can help find unconnected pipes. = /proc/sys/fs/pipe-user-pages-soft I also just remembered this knob, which you could hit from just cat-file, and I seem to remember some bugs in older kernels when pipe limits are hit. for httpd with 80 inboxes: 160 pipes * 4 workers * 16 pages = 10240 pages for --batch 160 pipes * 4 workers * 1 pages = 640 pages for --batch-check That's already 10880 pages for cat-file alone despite only F_SETPIPE_SZ-ing to a single page for batch-check (default is 16 pages for 65536 bytes). And there's still pipes for nntpd (not sure how many workers there), and various internal self-pipe stuff (single-page), some of which can be made eventfd... /proc/sys/fs/pipe-user-pages-soft seems to default to 16384 regardless of system RAM, and I've needed to increase it in other places for different projects. For anybody using SQLite, I could see about supporting a single cat-file process in a repo where every inbox is an alternate of a main one... [1] maybe a multiaccept knob can improve load distribution: https://public-inbox.org/meta/20190513025610.30378-1-e@80x24.org/