From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: robots.txt ignored (was: public-inbox.org VPS hopefully stable, now...)
Date: Tue, 18 Mar 2025 18:28:01 +0000 [thread overview]
Message-ID: <20250318182801.M799744@dcvr> (raw)
In-Reply-To: <20241014224822.M489222@dcvr>
Eric Wong <e@80x24.org> wrote:
> I've got a lot of orphaned sockets and OOM from the kernel the
> past few days.
No longer a problem, at least. More rambling thoughts below...
> It's a combination of kernel TCP memory use,
> OpenSSL, zlib, glibc malloc, Perl 5, and probably other things...
Yeah...
> It looks like a lot of bot traffic trying to scrape IMAP(S),
> too :<
>
> WolfSSL might be an option via Inline::C *shrug*
>
> I've cut down on connections and via iptables/ip6tables
> connlimit and state modules; still not sure where they
> should be atm..
Per-IP limits don't seem effective because bots are not
reusing connections, changing IPs, and changing User-Agents.
Throttling IPv4 /24 blocks seems OK, I haven't gone to /16, yet.
IPv6 doesn't seem to be a problem at the moment.
HTTPS termination is still provided by a horribly-named Ruby
webserver (no nginx/haproxy/etc.), and some janky Ruby userspace
throttles were added last week for non-(curl|w3m|git|lynx)
User-Agents.
The chain is currently like this:
__/--> varnish -> public-inbox-netd
bots -> ruby --|__
\--> ssh(*) -> varnish -> public-inbox-netd
I might expand the public-inbox HTTP server code to provide
reverse proxying to Varnish and get rid of ruby(**):
__/--> varnish -> public-inbox-netd
bots -> perl --|__
\--> ssh(*) -> varnish -> public-inbox-netd
My small Varnish caches get poisoned from scanning, too. While
Varnish is great for dealing with traffic bursts from popular sites,
it's a waste for crawlers which rarely/never repeat requests.
The "limiter" feature (see public-inbox-config(5)) could be
expanded to handle some endpoints such as /T/, /t/, /t.mbox.gz
to reduce excessive parallelism while still providing enough to
keep git processes saturated. Basically the same thing I did
with /s/ which seems quite effective:
8d6a50ff (www: use a dedicated limiter for blob solver, 2024-03-11)
2368fb20 (viewvcs: -codeblob limiter w/ depth for solver, 2025-02-08)
OTOH, limiter currently doesn't check IP ranges. Perhaps it could?
limiter is great for reducing memory use from zlib buffers and
would be good for limiting the the thread-skeleton structure, as
well.
> I "only" have 1GB of RAM since it's the cheapest available
> (32-bit userspace, x86_64 kernel). Getting more RAM or CPU
> is absolutely NOT an option; optimizing data structures,
> code and tweaking knobs are the only ways to fix this.
>
> Down with consumerism!
Unchanged :> And I forgot to mention it's all happening on a
single core, even.
Of course, JavaScript or MITM-based services are not an option
due to accessibility.
(*) the yhbt.net/lore/ mirror is behind the SSH tunnel, I know
I could move varnish in front of the SSH tunnel to reduce
SSH traffic but I'm more familiar with using Ruby||Perl to
route than VCL.
(**) deterministic DESTROY from Perl would've been so useful to
have in the janky Ruby throttler I made
next prev parent reply other threads:[~2025-03-18 18:28 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-14 22:48 public-inbox.org VPS hopefully stable, now Eric Wong
2025-03-18 18:28 ` Eric Wong [this message]
2025-03-20 0:05 ` [RFC] plack_limiter: middleware to limit concurrency Eric Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250318182801.M799744@dcvr \
--to=e@80x24.org \
--cc=meta@public-inbox.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).