user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: robots.txt ignored (was: public-inbox.org VPS hopefully stable, now...)
Date: Tue, 18 Mar 2025 18:28:01 +0000	[thread overview]
Message-ID: <20250318182801.M799744@dcvr> (raw)
In-Reply-To: <20241014224822.M489222@dcvr>

Eric Wong <e@80x24.org> wrote:
> I've got a lot of orphaned sockets and OOM from the kernel the
> past few days.

No longer a problem, at least.   More rambling thoughts below...

> It's a combination of kernel TCP memory use,
> OpenSSL, zlib, glibc malloc, Perl 5, and probably other things...

Yeah...

> It looks like a lot of bot traffic trying to scrape IMAP(S),
> too :<
> 
> WolfSSL might be an option via Inline::C *shrug*
> 
> I've cut down on connections and via iptables/ip6tables
> connlimit and state modules; still not sure where they
> should be atm..

Per-IP limits don't seem effective because bots are not
reusing connections, changing IPs, and changing User-Agents.
Throttling IPv4 /24 blocks seems OK, I haven't gone to /16, yet.

IPv6 doesn't seem to be a problem at the moment.

HTTPS termination is still provided by a horribly-named Ruby
webserver (no nginx/haproxy/etc.), and some janky Ruby userspace
throttles were added last week for non-(curl|w3m|git|lynx)
User-Agents.

The chain is currently like this:

                        __/--> varnish -> public-inbox-netd
        bots -> ruby --|__
                          \--> ssh(*) -> varnish -> public-inbox-netd

I might expand the public-inbox HTTP server code to provide
reverse proxying to Varnish and get rid of ruby(**):

                        __/--> varnish -> public-inbox-netd
        bots -> perl --|__
                          \--> ssh(*) -> varnish -> public-inbox-netd

My small Varnish caches get poisoned from scanning, too.  While
Varnish is great for dealing with traffic bursts from popular sites,
it's a waste for crawlers which rarely/never repeat requests.

The "limiter" feature (see public-inbox-config(5)) could be
expanded to handle some endpoints such as /T/, /t/, /t.mbox.gz
to reduce excessive parallelism while still providing enough to
keep git processes saturated.  Basically the same thing I did
with /s/ which seems quite effective:
8d6a50ff (www: use a dedicated limiter for blob solver, 2024-03-11)
2368fb20 (viewvcs: -codeblob limiter w/ depth for solver, 2025-02-08)

OTOH, limiter currently doesn't check IP ranges.  Perhaps it could?

limiter is great for reducing memory use from zlib buffers and
would be good for limiting the the thread-skeleton structure, as
well.

> I "only" have 1GB of RAM since it's the cheapest available
> (32-bit userspace, x86_64 kernel).  Getting more RAM or CPU
> is absolutely NOT an option; optimizing data structures,
> code and tweaking knobs are the only ways to fix this.
> 
> Down with consumerism!

Unchanged :>  And I forgot to mention it's all happening on a
single core, even.

Of course, JavaScript or MITM-based services are not an option
due to accessibility.


(*) the yhbt.net/lore/ mirror is behind the SSH tunnel, I know
    I could move varnish in front of the SSH tunnel to reduce
    SSH traffic but I'm more familiar with using Ruby||Perl to
    route than VCL.

(**) deterministic DESTROY from Perl would've been so useful to
     have in the janky Ruby throttler I made

  reply	other threads:[~2025-03-18 18:28 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-14 22:48 public-inbox.org VPS hopefully stable, now Eric Wong
2025-03-18 18:28 ` Eric Wong [this message]
2025-03-20  0:05   ` [RFC] plack_limiter: middleware to limit concurrency Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250318182801.M799744@dcvr \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).