user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* public-inbox.org VPS hopefully stable, now...
@ 2024-10-14 22:48 Eric Wong
  2025-03-18 18:28 ` robots.txt ignored (was: public-inbox.org VPS hopefully stable, now...) Eric Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Wong @ 2024-10-14 22:48 UTC (permalink / raw)
  To: meta

I've got a lot of orphaned sockets and OOM from the kernel the
past few days.  It's a combination of kernel TCP memory use,
OpenSSL, zlib, glibc malloc, Perl 5, and probably other things...

It looks like a lot of bot traffic trying to scrape IMAP(S),
too :<

WolfSSL might be an option via Inline::C *shrug*

I've cut down on connections and via iptables/ip6tables
connlimit and state modules; still not sure where they
should be atm..

Current sysctls are here, many limits lowered from defaults.
Mostly going off Documentation/networking/ip-sysctl.rst in
linux.git

I'm not 100% sure about many of these so holler if you see anything
amiss...

	net.core.somaxconn = 128
	net.ipv4.tcp_timestamps = 1
	net.ipv4.tcp_tw_reuse = 1
	net.ipv4.tcp_fin_timeout = 20
	net.ipv4.tcp_slow_start_after_idle = 0
	net.ipv4.tcp_retries2 = 8 # default 15
	net.ipv4.tcp_orphan_retries = 1 # default 8
	net.ipv4.tcp_max_orphans = 2048 # default 4096

	# Things will probably be worse for LFNs w/ smaller tcp_wmem
	net.ipv4.tcp_rmem = 4096 16384 65536
	net.ipv4.tcp_wmem = 4096 16384 65536

	# tcp_mem thresholds untouched atm..

	net.netfilter.nf_conntrack_tcp_timeout_established = 600

	# can probably drop this...
	net.netfilter.nf_conntrack_max = 30000

I "only" have 1GB of RAM since it's the cheapest available
(32-bit userspace, x86_64 kernel).  Getting more RAM or CPU
is absolutely NOT an option; optimizing data structures,
code and tweaking knobs are the only ways to fix this.

Down with consumerism!

^ permalink raw reply	[flat|nested] 3+ messages in thread

* robots.txt ignored (was: public-inbox.org VPS hopefully stable, now...)
  2024-10-14 22:48 public-inbox.org VPS hopefully stable, now Eric Wong
@ 2025-03-18 18:28 ` Eric Wong
  2025-03-20  0:05   ` [RFC] plack_limiter: middleware to limit concurrency Eric Wong
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Wong @ 2025-03-18 18:28 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> I've got a lot of orphaned sockets and OOM from the kernel the
> past few days.

No longer a problem, at least.   More rambling thoughts below...

> It's a combination of kernel TCP memory use,
> OpenSSL, zlib, glibc malloc, Perl 5, and probably other things...

Yeah...

> It looks like a lot of bot traffic trying to scrape IMAP(S),
> too :<
> 
> WolfSSL might be an option via Inline::C *shrug*
> 
> I've cut down on connections and via iptables/ip6tables
> connlimit and state modules; still not sure where they
> should be atm..

Per-IP limits don't seem effective because bots are not
reusing connections, changing IPs, and changing User-Agents.
Throttling IPv4 /24 blocks seems OK, I haven't gone to /16, yet.

IPv6 doesn't seem to be a problem at the moment.

HTTPS termination is still provided by a horribly-named Ruby
webserver (no nginx/haproxy/etc.), and some janky Ruby userspace
throttles were added last week for non-(curl|w3m|git|lynx)
User-Agents.

The chain is currently like this:

                        __/--> varnish -> public-inbox-netd
        bots -> ruby --|__
                          \--> ssh(*) -> varnish -> public-inbox-netd

I might expand the public-inbox HTTP server code to provide
reverse proxying to Varnish and get rid of ruby(**):

                        __/--> varnish -> public-inbox-netd
        bots -> perl --|__
                          \--> ssh(*) -> varnish -> public-inbox-netd

My small Varnish caches get poisoned from scanning, too.  While
Varnish is great for dealing with traffic bursts from popular sites,
it's a waste for crawlers which rarely/never repeat requests.

The "limiter" feature (see public-inbox-config(5)) could be
expanded to handle some endpoints such as /T/, /t/, /t.mbox.gz
to reduce excessive parallelism while still providing enough to
keep git processes saturated.  Basically the same thing I did
with /s/ which seems quite effective:
8d6a50ff (www: use a dedicated limiter for blob solver, 2024-03-11)
2368fb20 (viewvcs: -codeblob limiter w/ depth for solver, 2025-02-08)

OTOH, limiter currently doesn't check IP ranges.  Perhaps it could?

limiter is great for reducing memory use from zlib buffers and
would be good for limiting the the thread-skeleton structure, as
well.

> I "only" have 1GB of RAM since it's the cheapest available
> (32-bit userspace, x86_64 kernel).  Getting more RAM or CPU
> is absolutely NOT an option; optimizing data structures,
> code and tweaking knobs are the only ways to fix this.
> 
> Down with consumerism!

Unchanged :>  And I forgot to mention it's all happening on a
single core, even.

Of course, JavaScript or MITM-based services are not an option
due to accessibility.


(*) the yhbt.net/lore/ mirror is behind the SSH tunnel, I know
    I could move varnish in front of the SSH tunnel to reduce
    SSH traffic but I'm more familiar with using Ruby||Perl to
    route than VCL.

(**) deterministic DESTROY from Perl would've been so useful to
     have in the janky Ruby throttler I made

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [RFC] plack_limiter: middleware to limit concurrency
  2025-03-18 18:28 ` robots.txt ignored (was: public-inbox.org VPS hopefully stable, now...) Eric Wong
@ 2025-03-20  0:05   ` Eric Wong
  0 siblings, 0 replies; 3+ messages in thread
From: Eric Wong @ 2025-03-20  0:05 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> HTTPS termination is still provided by a horribly-named Ruby
> webserver (no nginx/haproxy/etc.), and some janky Ruby userspace
> throttles were added last week for non-(curl|w3m|git|lynx)
> User-Agents.

Those throttles are replaced by the patch below...

> The "limiter" feature (see public-inbox-config(5)) could be
> expanded to handle some endpoints such as /T/, /t/, /t.mbox.gz
> to reduce excessive parallelism while still providing enough to
> keep git processes saturated.

<snip>

> OTOH, limiter currently doesn't check IP ranges.  Perhaps it could?

Can be coded into match_cb below...

> limiter is great for reducing memory use from zlib buffers and
> would be good for limiting the the thread-skeleton structure, as
> well.

Seems working, but there seems to be a lull in traffic today.
I'm keeping this as a Plack/PSGI component to be less intrusive
to the rest of the codebase and offer more configurability via
`match_cb' and `stats_match_cb' which accepts Perl code(*).

Documentation in the POD at the end...
-----8<------
Subject: [PATCH] plack_limiter: PSGI middleware to limit concurrency

While processing several concurrent requests within the same
worker process is helpful to exploit parallelism in git blob
lookups and smooth out delays; excessive parallelism is harmful
since it allows too much memory to be allocated at once for zlib
buffers and such.

While PublicInbox::WWW already uses the limiter for certain
expensive endpoints (e.g. /s/ and anything using Qspawn); some
long-running endpoints with many inexpensive steps (e.g. /T/,
/t/, /d/, *.atom, *.mbox.gz, etc.) can end up using a large
amount of memory for gzip buffers despite being fair to other
responses and being able to stream >500 messages/sec on 2010-era
hardware.

So give sysadmins an option to balance between smoothing out
delays in blob retrieval and memory usage required to compress
and spew out chunks of potentially large multi-email responses.
---
 MANIFEST                        |   1 +
 lib/PublicInbox/PlackLimiter.pm | 117 ++++++++++++++++++++++++++++++++
 2 files changed, 118 insertions(+)
 create mode 100644 lib/PublicInbox/PlackLimiter.pm

diff --git a/MANIFEST b/MANIFEST
index 93407a46..5e599990 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -328,6 +328,7 @@ lib/PublicInbox/OverIdx.pm
 lib/PublicInbox/POP3.pm
 lib/PublicInbox/POP3D.pm
 lib/PublicInbox/PktOp.pm
+lib/PublicInbox/PlackLimiter.pm
 lib/PublicInbox/Qspawn.pm
 lib/PublicInbox/Reply.pm
 lib/PublicInbox/RepoAtom.pm
diff --git a/lib/PublicInbox/PlackLimiter.pm b/lib/PublicInbox/PlackLimiter.pm
new file mode 100644
index 00000000..a1cc51dc
--- /dev/null
+++ b/lib/PublicInbox/PlackLimiter.pm
@@ -0,0 +1,117 @@
+# Copyright (C) all contributors <meta@public-inbox.org>
+# License: GPL-3.0+ <https://www.gnu.org/licenses/gpl-3.0.txt>
+# generic Plack/PSGI middleware to expose PublicInbox::Limiter, (see __END__)
+package PublicInbox::PlackLimiter;
+use v5.12;
+use parent qw(Plack::Middleware);
+use PublicInbox::OnDestroy;
+
+sub prepare_app { # called via Plack::Component (used by Plack::Middleware)
+	my ($self) = @_;
+	$self->{match_cb} //= sub { 1 };
+	$self->{max} //= 2;
+	$self->{run_queue} = [];
+	$self->{running} = 0;
+	$self->{rejected} = 0;
+	$self->{message} //= "too busy\n";
+}
+
+sub r503 ($) {
+	my @body = ($_[0]->{message});
+	++$_[0]->{rejected};
+	[ 503, [ 'Content-Type' => 'text/plain',
+		'Content-Length' => length($body[0]) ], \@body ]
+}
+
+sub next_req { # on_destroy cb
+	my ($self) = @_;
+	--$self->{running};
+	my $env = shift @{$self->{run_queue}} or return;
+	my $wcb = delete $env->{'p-i.limiter.wcb'} // die 'BUG: no wcb';
+	my $res = eval { call($self, $env) };
+	return warn("W: $@") if $@;
+	ref($res) eq 'CODE' ? $res->($wcb) : $wcb->($res);
+}
+
+sub stats ($) {
+	my ($self) = @_;
+	my $nq = scalar @{$self->{run_queue}};
+	my $res = <<EOM;
+running: $self->{running}
+queued: $nq
+rejected: $self->{rejected}
+max: $self->{max}
+EOM
+	[ 200, [ 'Content-Type' => 'text/plain',
+		'Content-Length' => length($res) ], [ $res ] ]
+}
+
+sub call {
+	my ($self, $env) = @_;
+	if (defined $self->{stats_match_cb}) {
+		return stats $self if $self->{stats_match_cb}->($env);
+	}
+	return $self->app->($env) if !$self->{match_cb}->($env);
+	return r503($self) if @{$self->{run_queue}} > ($self->{depth} // 32);
+	if ($self->{running} < $self->{max}) {
+		++$self->{running};
+		$env->{'p-i.limiter.next'} = on_destroy \&next_req, $self;
+		$self->app->($env);
+	} else { # capture write cb from PSGI server and queue up
+		sub {
+			$env->{'p-i.limiter.wcb'} = $_[0];
+			push @{$self->{run_queue}}, $env;
+		};
+	}
+}
+
+1;
+__END__
+
+=head1 NAME
+
+PublicInbox::PlackLimiter - limit concurrency to parts of a PSGI app
+
+=head1 SYNOPSIS
+
+	# In your .psgi file
+	use Plack::Builder;
+	builder {
+
+	# by default, only 2 requests may be processed at once:
+	enable '+PublicInbox::PlackLimiter';
+
+	# You will likely only want to limit certain expensive endpoints,
+	# while allowing maximum concurrency for inexpensive endpoints.
+	# You can do that by passing a `match_cb' parameter:
+	enable '+PublicInbox::PlackLimiter',
+		# some expensive endpoints for my public-inbox instance, YMMV
+		match_cb => sub {
+			my ($env) = @_;
+			$env->{PATH_INFO} =~ m!/(?:[Ttd]/|.+\.
+						(?:mbox\.gz|atom|html))\z!x ||
+				$env->{QUERY_STRING} =~ /\bx=[tA]\b/
+		},
+		# You can increase `max' and `depth' to higher numbers
+		max => 3, # maximum concurrent requests
+		depth => 128, # maximum queue depth (size)
+		# You can also enable a stats endpoint if you wish (optional):
+		stats_match_cb => sub {
+			my ($env) = @_;
+			$env->{REQUEST_URI} eq '/stats' &&
+				$env->{REMOTE_ADDR} eq '127.0.0.1'
+		};
+	# ...
+	}; # /builder
+
+=head1 DESCRIPTION
+
+PublicInbox::PlackLimiter lets a sysadmin limit concurrency to certain
+expensive endpoints while allowing the normal concurrency level of the
+server to run inexpensive requests.
+
+=head1 SEE ALSO
+
+L<Plack> L<Plack::Builder> L<Plack::Middleware>
+
+=cut


(*)	Side note: stuff like VCL (Varnish config language) and nginx
	configs inevitably ends up being an awk-like language anyways;
	might as well just use Perl :P

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-03-20  0:05 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-14 22:48 public-inbox.org VPS hopefully stable, now Eric Wong
2025-03-18 18:28 ` robots.txt ignored (was: public-inbox.org VPS hopefully stable, now...) Eric Wong
2025-03-20  0:05   ` [RFC] plack_limiter: middleware to limit concurrency Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).