about summary refs log tree commit homepage
path: root/lib/PublicInbox/HTTP.pm
DateCommit message (Collapse)
2021-10-24http: use a larger buffer for ->getline responses
64K matches the Linux pipe default, and matches what we use in httpd/async and qspawn. This should reduce syscalls used for serving git packs via dumb HTTP and any ->getline code paths used by other PSGI code. This appears to speed up HTML rendering by w3m when serving giant HTML responsees from the Devel::Mwrap::PSGI memory debugger.
2021-10-20httpd: reject requests with spaces in header names
Malicious clients may attempt HTTP request smuggling this way. This doesn't affect our current code as we only look for exact matches, but it could affect other servers behind a to-be-implemented reverse proxy built around our -httpd. This doesn't affect users behind varnish at all, nor the HTTPS/HTTP reverse proxy I use (I don't know about nginx), but could be passed through by other reverse proxies. This change is only needed for HTTP::Parser::XS which most users probably use. Users of the pure Perl parser (via PLACK_HTTP_PARSER_PP=1) already hit 400 errors in this case, so this makes the common XS case consistent with the pure Perl case. cf. https://www.mozilla.org/en-US/security/advisories/mfsa2006-33/
2021-10-16httpd: move pipeline logic into event_step
Most of the HTTP server code was written for Danga::Socket and not fully-transitioned to take advantage of PublicInbox::DS. This change brings it up-to-date with the style of pipeline handling used for -imapd and -nntpd.
2021-10-16imapd+nntpd: drop timer-based expiration
It's needlessly complex and O(n), so it doesn't scale well to a high number of clients nor is it easy-to-scale with the data structures available to us in pure Perl. In any case, I see no evidence of either -imapd nor -nntpd experiencing high connection loads on public-facing sites. -httpd has never had its own timer-based expiration, either. Fwiw, public-inbox.org itself has been running a public-facing HTTP/HTTPS server with no userspace idle client expiration for the past 8 years or with no ill effect. Clients can come and go as they wish, and SO_KEEPALIVE takes care of truly broken connections if they're gone for ~2 hours. Internet connections drop all time, so it should be harmless to drop connections w/o warning since both NNTP and IMAP protocols have well-defined semantics for determining if a message was truncated (as does HTTP/1.1+).
2021-10-13treewide: use warn() or carp() instead of env->{psgi.errors}
Large chunks of our codebase and 3rd-party dependencies do not use ->{psgi.errors}, so trying to standardize on it was a fruitless endeavor. Since warn() and carp() are standard mechanism within Perl, just use that instead and simplify a bunch of existing code.
2021-10-09http: avoid Perl target cache for psgi.input
By using syswrite to populate env->{psgi.input}. The substr() call IO::Handle->write will trigger Perl's target/scratchpad and result in a permanent allocation. Since this is a cold path, that allocation is pointless, and syswrite() can already write a substring. Allowing Perl to cache a large allocation in a cold path only result in fragmentation and wasted RAM. write(2) on a regular file won't result in short writes unless the FS quotas or free space limits are hit, or the buffer is close to overflowing (e.g. the 0x7ffff000-byte Linux limit). Since our HTTP server will never buffer that much in RAM, there's no need to retry syswrite nor rely on the retrying implicit in IO::Handle->write and the "print" perlop.
2021-08-28get rid of unnecessary bytes::length usage
The only place where we could return wide characters with -httpd was the raw $INBOX_DIR/description text, which is now converted to octets. All daemon (HTTP/NNTP/IMAP) sockets are opened in binary mode, so length() and bytes::length() are equivalent on reads. For socket writes, any non-octet data would warn about wide characters and we are strict in warnings with test_httpd. All gzipped buffers are also octets, as is PublicInbox::Eml->body, and anything from PerlIO objects ("git cat-file --batch" output, filesystems), so bytes::length was unnecessary in all those places.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-10-30tls: epollbit: account for miscellaneous OpenSSL errors
Apparently they happen (triggered by my -imapd instance), so bail out by closing the underlying socket rather than stopping the event loop and daemon process.
2020-07-06mboxgz: do asynchronous git blob retrievals
This lets the -httpd worker process make better use of time instead of waiting for git-cat-file to respond. With 4 jobs in the new test case against a clone of <https://public-inbox.org/meta/>, a speedup of 10-12% is shown. Even a single job shows a 2-5% improvement on an SSD.
2020-06-28ds: remove fields.pm usage
Since the removal of pseudo-hash support in Perl 5.10, the "fields" module no longer provides the space or speed benefits it did in 5.8. It also does not allow for compile-time checks, only run-time checks. To me, the extra developer overhead in maintaining "use fields" args has become a hassle. None of our non-DS-related code uses fields.pm, nor do any of our current dependencies. In fact, Danga::Socket (which DS was originally forked from) and its subclasses are the only fields.pm users I've ever encountered in the wild. Removing fields may make our code more approachable to other Perl hackers. So stop using fields.pm and locked hashes, but continue to document what fields do for non-trivial classes.
2020-06-21daemon: use ->can to check for IO::Socket::SSL
Doing a ref($obj) string comparison ties us to IO::Socket::SSL (and OpenSSL) In the future, we may support GnuTLS or other TLS implementations. This was already done in the IMAP code.
2020-03-19http: fix RFC conformance w.r.t. message length
We need to favor "Transfer-Encoding: chunked" over the value of the Content-Length header. We should also reject bogus, duplicate and/or unreasonable values for both these, since they can trigger unexpected behavior when combined with other HTTP parsers in proxies such as varnish, nginx, haproxy, etc... See RFC 7230 (and RFC 2616) for more details: https://tools.ietf.org/html/rfc7230 https://www.rfc-editor.org/errata_search.php?rfc=7230
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-25http: eliminate short-lived cyclic ref for psgix.io
While there is no known actual leak due to reference cycles, here, eliminating a potential source of leaks is helpful.
2020-01-13ds|http|nntp: simplify {wbuf} population
We can rely on autovification to turn `undef' value of {wbuf} into an arrayref. Furthermore, "push" returns the (new) size of the array since at least Perl 5.0 (I didn't look further back), so we can use that return value instead of calling "scalar" again.
2020-01-11allow HTTP_HOST to be '0' via defined() checks
'0' is a valid value for HTTP_HOST, and maybe some folks will want to hit that as port 80 where the HTTP client won't send the ":$PORT" suffix.
2020-01-09http: log response_write errors
Application-supplied callbacks may error out, try to log them so the PSGI app developer can figure out what went wrong.
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2020-01-01http: update comment about psgix.io usage
We've been using async_pass for a while.
2019-12-22http: avoid anonymous sub for getline callback
We can avoid the danger of self-referential subs entirely for code internal to PublicInbox::HTTP. This change was only made possible by commit 8e1c3155da4edc082e8e3d8b30351f0c861757a7 ("ds: pass $self to code references")
2019-12-22http: get rid of anonymous subs for write/close
Each sub costs us several kilobytes of memory for every response we make. An arrayref only costs 80 bytes on 64-bit, so bless that to packages with appropriate ->write and ->close methods.
2019-12-14ds: move EvCleanup code into DS
EvCleanup only existed since Danga::Socket was a separate component, and cleanup code belongs with the event loop.
2019-09-17http: remove unnecessary delete
Only removing $http->{env} is needed to prevent circular references. $env->{'psgix.io'} does not need to be deleted since $env will no longer have any references to it when ->close returns.
2019-09-17http: drop unused `$env' variable after delete
And explain why we need to do that delete in a comment.
2019-09-14tmpfile: give temporary files meaningful names
Although we always unlink temporary files, give them a meaningful name so that we can we can still make sense of the pre-unlink name when using lsof(8) or similar tools on Linux.
2019-09-09run update-copyrights from gnulib for 2019
2019-07-10http|nntp: avoid recursion inside ->write
In HTTP.pm, we can use the same technique NNTP.pm uses with long_response with the $long_cb callback and avoid storing $pull in the per-client structure at all. We can also reuse the same logic to push the callback into wbuf from NNTP. This does NOT introduce a new circular reference, but documents it more clearly.
2019-07-08http|nntp: "use PublicInbox::DS" instead of ->import
Relying on "use" to import during BEGIN means we get to take advantage of prototype checking of function args during the rest of the compilation phase.
2019-06-29httpd/async: switch to buffering-as-fast-as-possible
With DS buffering to a temporary file nowadays, applying backpressure to git-http-backend(1) hurts overall memory usage of the system. Instead, try to get git-http-backend(1) to finish as quickly as possible and use edge-triggered notifications to reduce wakeups on our end.
2019-06-29http: support HTTPS (kinda)
It's barely any effort at all to support HTTPS now that we have NNTPS support and can share all the code for writing daemons. However, we still depend on Varnish to avoid hug-of-death situations, so supporting reverse-proxying will be required.
2019-06-29ds: handle deferred DS->close after timers
Our hacks in EvCleanup::next_tick and EvCleanup::asap were due to the fact "closed" sockets were deferred and could not wake up the event loop, causing certain actions to be delayed until an event fired. Instead, ensure we don't sleep if there are pending sockets to close. We can then remove most of the EvCleanup stuff While we're at it, split out immediate timer handling into a separate array so we don't need to deal with time calculations for the event loop.
2019-06-29http: use requeue instead of watch_in1
Don't use epoll or kqueue to watch for anything unless we hit EAGAIN, since we don't know if a socket is SSL or not.
2019-06-29ds: share lazy rbuf handling between HTTP and NNTP
Doing this for HTTP cuts the memory usage of 10K idle-after-one-request HTTP clients from 92 MB to 47 MB. The savings over the equivalent NNTP change in commit 6f173864f5acac89769a67739b8c377510711d49, ("nntp: lazily allocate and stash rbuf") seems down to the size of HTTP requests and the fact HTTP is a client-sends-first protocol where as NNTP is server-sends-first.
2019-06-24allow use of PerlIO layers for filesystem writes
It may make sense to use PerlIO::mmap or PerlIO::scalar for DS write buffering with IO::Socket::SSL or similar (since we can't use MSG_MORE), so that means we need to go through buffering in userspace for the common case; while still being easily compatible with slow clients. And it also simplifies GitHTTPBackend slightly. Maybe it can make sense for HTTP input buffering, too...
2019-06-24ds: hoist out do_read from NNTP and HTTP
Both NNTP and HTTP have common needs and we can factor out some common code to make dealing with IO::Socket::SSL easier.
2019-06-24http|nntp: be explicit about bytes::length on rbuf
It should not matter because our rbuf is always from a socket without encoding layers, but this makes things easier to follow.
2019-06-24ds: pass $self to code references
We can reduce the amount of short-lived anonymous subs we create by passing $self to code references.
2019-06-24http: don't pass extra args to PublicInbox::DS::close
YAGNI Followup-to: commit 30ab5cf82b9d47242640f748a0f9a088ca783e32 ("ds: reduce Errno imports and drop ->close reason")
2019-06-24ds: favor `delete' over assigning fields to `undef'
This is cleaner in most cases and may allow Perl to reuse memory from unused fields. We can do this now that we no longer support Perl 5.8; since Danga::Socket was written with struct-like pseudo-hash support in mind, and Perl 5.9+ dropped support for pseudo-hashes over a decade ago.
2019-06-24http|nntp: favor "$! == EFOO" over $!{EFOO} checks
Integer comparisions of "$!" are faster than hash lookups. See commit 6fa2b29fcd0477d126ebb7db7f97b334f74bbcbc ("ds: cleanup Errno imports and favor constant comparisons") for benchmarks.
2019-06-24ds: get rid of event_watch field
We don't need to keep track of that field since we always know what events we're interested in when using one-shot wakeups.
2019-06-24ds: set event flags directly at initialization
We can avoid the EPOLL_CTL_ADD && EPOLL_CTL_MOD sequence with a single EPOLL_CTL_ADD.
2019-06-24ds: switch write buffering to use a tempfile
Data which can't fit into a generously-sized socket buffer, has no business being stored in heap.
2019-06-24ds: share send(..., MSG_MORE) logic
No sense in having similar Linux-specific functionality in both our NNTP.pm and HTTP.pm
2019-06-24http: favor DS->write(strref) when reasonable
This can avoid large memory copies when strings can't be copy-on-write and saves us the trouble of creating new refs in the code.
2019-06-24ds: lazy-initialize wbuf
We don't need write buffering unless we encounter slow clients requesting large responses. So don't waste a hash slot or (empty) arrayref for it.
2019-06-24ds: get rid of {closed} field
Merely checking the presence of the {sock} field is enough, and having multiple sources of truth increases confusion and the likelyhood of bugs.
2019-06-16ds: stop distinguishing event read and write callbacks
Having separate read/write callbacks in every class is too confusing to my easily-confused mind. Instead, give every class an "event_step" callback which is easier to wrap my head around. This will make future code to support IO::Socket::SSL-wrapped sockets easier-to-digest, since SSL_write() can require waiting on POLLIN events, and SSL_read() can require waiting on POLLOUT events.
2019-06-10ds: do not distinguish between POLLHUP and POLLERR
In my experience, both are worthless as any normal read/write call path will be wanting to check errors and deal with them appropriately; so we can just call event_read, for now. Eventually, there'll probably be only one callback for dealing with all in/out/err/hup events to simplify logic, especially w.r.t TLS socket negotiation.