user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@80x24.org>
To: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Cc: meta@public-inbox.org
Subject: Re: trying to figure out 100% CPU usage in nntpd...
Date: Tue, 10 Sep 2019 18:12:24 +0000	[thread overview]
Message-ID: <20190910181224.urhyoo6av7mhjs67@dcvr> (raw)
In-Reply-To: <20190910083820.GA8018@pure.paranoia.local>

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Mon, Sep 09, 2019 at 05:53:41PM +0000, Eric Wong wrote:
> > Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > > There also was a weird problem a couple of days ago where one of the
> > > httpd daemons started returning "Internal Server Error" to all requests.
> > > Restarting public-inbox-httpd fixed the problem, but I am not sure how I
> > > would troubleshoot the causes if it happens next time -- is there a way
> > > to enable error logging for the httpd daemon?
> > 
> > That's a new one... I haven't seen any problems from -httpd myself
> > in ages.  So -httpd could not handle requests at all?
> > 
> > The daemons already spits errors to stderr which typically ends up
> > in syslog via systemd.  So, that's the first place to look
> > (also "systemctl status $SERVICE"); anything in there?
> 
> Bah, I hadn't even considered looking in syslog. I blame jet lag. Well,
> the explanation is simple, then:
> 
> Sep  7 22:22:35 lore.kernel.org public-inbox-httpd: pipe failed: Too many open files at /usr/local/share/perl5/PublicInbox/Git.pm line 210.
> Sep  7 22:22:36 lore.kernel.org public-inbox-httpd: pipe: Too many open files
> Sep  7 22:23:01 lore.kernel.org public-inbox-httpd: pipe: Too many open files
> Sep  7 22:23:02 lore.kernel.org public-inbox-httpd: pipe: Too many open files
> Sep  7 22:23:02 lore.kernel.org public-inbox-httpd: pipe: Too many open files
> 
> I see that there's already a LimitNOFILE=30000 -- is exhausting 30,000
> open files a reasonable sort of thing for a system as busy as
> lore.kernel.org, or should we look for a leak. Currently:
> 
> publici+ 13125     1  0 52297 25096   0 Sep09 ?        00:00:00 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log
> publici+ 13128 13125  1 200056 591376 4 Sep09 ?        00:21:06 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log
> publici+ 13129 13125  1 117789 262000 2 Sep09 ?        00:16:09 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log
> publici+ 13130 13125  0 103569 204772 9 Sep09 ?        00:08:06 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log
> publici+ 13131 13125  0 101510 199268 1 Sep09 ?        00:03:33 /usr/bin/perl -w /usr/local/bin/public-inbox-httpd -W4 -1 /var/log/public-inbox/httpd.out.log

Overall limit of open files could be higher (just limited by
RAM) but the balance of pipes/unlinked/other (sockets) seems way off.

I wonder if it's some bad interaction with hitting
pipe-user-pages-soft (see bottom), too...

> # ls /proc/13125/fd/ | wc -l
> 9
> # ls /proc/13128/fd/ | wc -l
> 17776
> # ls /proc/13129/fd/ | wc -l
> 3937
> # ls /proc/13130/fd/ | wc -l
> 648
> # ls /proc/13131/fd/ | wc -l
> 68

OK, there is correlation between CPU time and open FDs.

The poor load distribution (based on CPU time) between processes
sucks, but that's an unrelated problem[1]

> 17776 since Sep 09 seems... fairly excessive. Most of them are pipes:
> 
> # lsof -p 13128 | grep pipe | wc -l
> 14741
> 
> a bunch of them are to (deleted) entries in /tmp:
> 
> # lsof -p 13128 | grep deleted | wc -l
> 3015
> 
> (e.g.):
> 
> public-in 13128 publicinbox   45u      REG              259,4        984   4200094 /tmp/PerlIO_yd43mN (deleted)
> public-in 13128 publicinbox   46r     FIFO                0,9        0t0  21818751 pipe
> public-in 13128 publicinbox   47u      REG              259,4        984   4200096 /tmp/PerlIO_siY3Pb (deleted)
> public-in 13128 publicinbox   48r     FIFO                0,9        0t0  21818758 pipe
> public-in 13128 publicinbox   49u      REG              259,4        984   4200100 /tmp/PerlIO_WyIaaC (deleted)
> 
> It does seem like there's perhaps a leak somewhere?

Probably.  Not seeing any of that on my (smaller) instances;
but those -httpd haven't been restarted in weeks/months.

The "PerlIO_" prefix is created from open(..., '+>', undef), so
it either has to be for:

1. POST bodies for git http-backend

1. git cat-file --batch-check stderr

3. experimental ViewVCS|SolverGit which isn't configured on lore :)

= git-http-backend

Looking at the size (984), PerlIO_* prefix and proximity of FD
numbers, I think those lines above are git-http-backend POST body.

Any git-http-backend stuck from people fetching/cloning?

If so, fd=0 of those should correspond to the inodes of those
unlinked PerlIO files.

This is -httpd writing to varnish, still, right?
Anything from varnish error logs?

3K git-http-backend is a lot, the Qspawn mechanism
limits it to 32 processes by default (same as git-daemon),
so it shouldn't hit RLIMIT_NOPROC.  Qspawn also won't create
the pipe until right before spawning git-http-backend.

The unlinked file is created once the POST request hits.

And 3K git-http-backend would only be a 1:1 ratio of pipes to
unlinked files, though, so that's over 10K pipes still
unaccounted for...


= git cat-file

There can be two cat-file processes (--batch and --batch-check)
for every inbox in every -httpd worker.

Every cat-file process uses 2 pipes (public-inbox closes the
ends it hands to git), and there's --batch-check ones which use
an additional unlinked regular file for stderr.

So: 80 inboxes * 2 cat-file * 2 pipes is only 320 pipes in -httpd

And not sure if 160 cat-file processes would cause RLIMIT_NPROC
problems, probably not...

Fwiw, while the new DS code also uses temporary files for output
buffering, those have the "wbuf-" prefix and not PerlIO_* since
'open(FH, "+>>", undef)' didn't work until recently:
https://rt.perl.org/Ticket/Display.html?id=134221


I've been meaning to expand dtas-graph to include
temporary file and socket connectivity, too, but it's pretty
useful at graphing pipe connectivity between processes:
https://80x24.org/dtas-all/b243086/s/?b=perl/dtas-graph
And it can help find unconnected pipes.


= /proc/sys/fs/pipe-user-pages-soft

I also just remembered this knob, which you could hit from just
cat-file, and I seem to remember some bugs in older kernels when
pipe limits are hit.

for httpd with 80 inboxes:
160 pipes * 4 workers * 16 pages = 10240 pages for --batch
160 pipes * 4 workers * 1 pages = 640 pages for --batch-check

That's already 10880 pages for cat-file alone despite only
F_SETPIPE_SZ-ing to a single page for batch-check (default is 16
pages for 65536 bytes).

And there's still pipes for nntpd (not sure how many workers
there), and various internal self-pipe stuff (single-page), some
of which can be made eventfd...

/proc/sys/fs/pipe-user-pages-soft seems to default to 16384
regardless of system RAM, and I've needed to increase it in other
places for different projects.

For anybody using SQLite, I could see about supporting a single
cat-file process in a repo where every inbox is an alternate of
a main one...

[1] maybe a multiaccept knob can improve load distribution:
    https://public-inbox.org/meta/20190513025610.30378-1-e@80x24.org/

  reply	other threads:[~2019-09-10 18:12 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-08 10:45 [PATCH] nntp: regexp always consumes rbuf if "\n" exists Eric Wong
2019-09-08 10:52 ` trying to figure out 100% CPU usage in nntpd Eric Wong
2019-09-09 10:05   ` Konstantin Ryabitsev
2019-09-09 17:53     ` Eric Wong
2019-09-10  8:38       ` Konstantin Ryabitsev
2019-09-10 18:12         ` Eric Wong [this message]
2019-09-11  2:22           ` httpd 502s [was: trying to figure out 100% CPU usage in nntpd...] Eric Wong
2019-09-11 10:24             ` Konstantin Ryabitsev
2019-09-11 17:12               ` Eric Wong
2019-09-11 17:36                 ` Konstantin Ryabitsev
2019-09-12  0:05                   ` Eric Wong
2019-09-12  2:49                     ` Eric Wong
2019-09-12  8:35                       ` Eric Wong
2019-09-12 11:37                         ` Konstantin Ryabitsev
2019-09-13  3:12                           ` Eric Wong
2019-09-13  7:03                             ` Eric Wong
2019-09-13  9:01                             ` Eric Wong
2019-09-13 18:07                             ` Konstantin Ryabitsev
2019-09-14  5:25                               ` Eric Wong
2019-09-11  9:44           ` trying to figure out 100% CPU usage in nntpd Konstantin Ryabitsev
2019-09-11 17:12             ` Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190910181224.urhyoo6av7mhjs67@dcvr \
    --to=e@80x24.org \
    --cc=konstantin@linuxfoundation.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).