about summary refs log tree commit homepage
path: root/lib/PublicInbox/NNTP.pm
DateCommit message (Collapse)
2021-08-28move ->ids_after from mm to over
Since we favor ->over in WWW and IMAP, move this method to ->over to reduce open files in common cases. This fixes the /$EXTINDEX_NAME/all.mbox.gz endpoint for extindex entries (which may get expensive...).
2021-08-28get rid of unnecessary bytes::length usage
The only place where we could return wide characters with -httpd was the raw $INBOX_DIR/description text, which is now converted to octets. All daemon (HTTP/NNTP/IMAP) sockets are opened in binary mode, so length() and bytes::length() are equivalent on reads. For socket writes, any non-octet data would warn about wide characters and we are strict in warnings with test_httpd. All gzipped buffers are also octets, as is PublicInbox::Eml->body, and anything from PerlIO objects ("git cat-file --batch" output, filesystems), so bytes::length was unnecessary in all those places.
2021-08-25imap+nntp: die loudly if ->mm or ->over disappear
While the WWW front-end can gracefully handle ->mm and ->over disappearing (in most cases), IMAP+NNTP front-ends are completely dependent on these and failed mysteriously when they go missing after startup. These will hopefully make issues like what Konstantin encountered more obvious: Link: https://public-inbox.org/meta/20210824204855.ejspej4z7r2rpu63@nitro.local/
2021-06-24favor git(1) rather than libgit2 for ExtSearch
While both git and libgit2 take around 16 minutes to load 100K alternates there's already a proposed patch to make git faster: <https://lore.kernel.org/git/20210624005806.12079-1-e@80x24.org/> It's also easier to patch and install git locally since the git.git build system defaults to prefix=$HOME and dealing with dynamic linking with libgit2 is more difficult for end users relying on Inline::C. libgit2 remains in use for the non-ALL.git case, but maybe it's not necessary (libgit2 is significantly slower than git in Debian 10 due to SHA-1 collision checking).
2021-03-16nntp: remove unused header_append method
It was unused since 1bf653ad139bf7bb3d853ab0b5eae3eaa1b13a95 ("nntp+www: drop List-* and Archived-At headers")
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-11nntp+www: drop List-* and Archived-At headers
These headers can conflict with headers in the DKIM signature; and parsing the DKIM-Signature header to determine whether or not we can safely add a header would be more code and CPU cycles. Since IMAP seems fine without these headers (and JMAP will likely be, too), there's likely no need to continue appending these to every message. Nowadays, developers seem sufficiently trained to use URLs with Message-IDs in them. So drop the headers and save some cycles and bandwidth all around.
2020-12-10www+nntp: deal with lack of addresses for ->ALL
Since extindex is an amalgamation of several inboxes, discerning an appropriate address for List-Post: would be expensive and most likely unnecessary. Some legacy/historical inboxes may have no active address, either, so don't attempt to set the List-Post header if no addresses are configured.
2020-12-09rename {pi_config} fields to {pi_cfg}
{pi_config} may be confused with the documented `PI_CONFIG' environment variable, and we'll favor vowel-removal to be consistent with our usage of object references. The `pi_' prefix may stay in some places, for now; since a separate namespace may come into this codebase for local/private client-tooling. For InboxIdle, we'll also remove an invalid comment about holding a reference to the PublicInbox::Config object, too.
2020-12-09nntp: replace {ng} with {ibx} for consistency
They're PublicInbox::Inbox objects just like the rest of the non-NNTP code. So rename the NNTP code for consistency with the rest of the codebase. Furthermore, {ng} and $ng may be confused with the `--ng' switch for -init, and that's a non-ref scalar string.
2020-12-05nntp: small speed up for multi-line responses
Using a non-zero-length separator for `join' requires extra work inside Perl. We can shove the cost of appending "\r\n" into the `map' loop, instead. This speeds up the `join' operation. The "deferred" log entry for a "LISTGROUP org.kernel.vger.linux-kernel" command (with nearly 3.8 million messages) goes from ~3.96s to 3.86s on my workstation.
2020-12-05nntp: xref_by_tc: simplify slightly
We can invalidate ibx->{newsgroup} at config load-time to avoid having to check ibx->{newsgroup} validity in To/Cc: matching. This saves us some hash lookups in all cases.
2020-12-01nntp: make ->ALL Xref generation more fuzzy
For ->ALL users, this mitigates the regression introduced by commit 811b8d3cbaa790f59b7b107140b86248da16499b ("nntp: xref: use ->ALL extindex if available"), since it's common to cross post messages to some mailing lists with per-list trailers for unsubscribe information. We won't bother dealing with Bcc-ed messages since those are nearly all spam when it comes to public mailing lists. Fixes: 811b8d3cbaa790f5 ("nntp: xref: use ->ALL extindex if available") Link: https://public-inbox.org/meta/20201130194201.GA6687@dcvr/
2020-11-29nntpd: remove redundant {groups} shortcut
It's not worth confusing hackers reading the source to have two ways to access the same (large) hash table. So just go through PublicInbox::Config objects for now since the extra hash lookup isn't going to be noticeable. I've also started favoring "for" instead of "foreach" since they're the equivalent perlop and less wear on my fingers + keyboard.
2020-11-29nntp: XPATH uses ->ALL extindex, too
Another 30-40% speedup when testing against a local lore.kernel.org mirror. In either case, we'll consistently sort the response for ease-of-testing and client-side cache-friendliness.
2020-11-29nntp: art_lookup: use mid_lookup and simplify
This lets us take advantage of mid_lookup speedup from the previous commit. While we're at it, start moving towards using `$ibx' as the abbreviation for PublicInbox::Inbox objects even in the NNTP code, since they've been shared with the WWW code for several years, now.
2020-11-29nntp: speed up mid_lookup() using ->ALL extindex
We can reuse "xref3" information in extindex to quickly match messages matching a given Message-ID across hundreds or thousands of newsgroups with a few SQL statements. "XHDR Xref $MESSAGE_ID" is around 40% faster, on top of previous speedups.
2020-11-29nntp: NEWGROUPS uses long_response
We can amortize the cost of NEWGROUPS time filtering using the long_response API. This lets us handle hundreds/thousands of inboxes without monopolizing the event loop for this command. Further speedup is possible using MiscSearch, but that requires not-yet-done indexing changes to MiscIdx.
2020-11-28nntp: xref: use ->ALL extindex if available
Getting Xref for cross-posted messages is an O(n) operation where `n' is the number of newsgroups on the server. This works acceptably when there are dozens of groups, but would be unnacceptable when there's tens of thousands of newsgroups. With ~140 newsgroups, a lore.kernel.org mirror already handles "XHDR Xref $MESSAGE_ID" requests around 30% faster after creating the xref3.idx_nntp index. The SQL additions to ExtSearch.pm may be a bit strange and seem more appropriate for Over.pm; however it currently makes sense to me since those bits of over.sqlite3 access are exclusive to ExtSearch and can't be used by traditional v1/v2 inboxes...
2020-11-28nntp: xref: simplify sub signature
We'll be using the `xref3' table in extindex to speed up xref(), and that'll require comparisons against $smsg->{blob}. So pass the entire $smsg through.
2020-11-28nntp: some minor golfing
Reduce screen real estate usage to reduce human attention span requirements.
2020-11-28nntp: move LIST iterators to long_response
Iterating through many newsgroups can hog the event loop if many random seeks are required. Avoid monopolizing the event loop in that case by using the long_response API. For now, we can still rely on grep() since it seems to work reasonably well with 50K test newsgroup names.
2020-11-28nntp: LIST ACTIVE.TIMES use angle brackets around address
This matches the example shown in RFC 3977, section
2020-11-28nntp: NEWNEWS: speed up filtering
With 50K newsgroups, the filtering phase goes from ~2000 seconds to ~90 MILLISECONDS by relying on the grep perlop. This moves ->over checking out of the main dispatch and amortizes the cost via long_response. (Fairly scheduled) long_response time in newnews_i now takes ~360 seconds as opposed to ~30 seconds before this change, however; but the initial filtering speedup eliminating 2000s is more than worth it.
2020-11-28nntp: use grep operation for wildmat matching
Based on experiences with the IMAP server, this ought to be significantly faster (as to be demonstrated in the next commit).
2020-11-28mm: min/max: return 0 instead of undef
This simplifies callers and allows empty newsgroups to be represented (the WWW UI may be insufficient there, too).
2020-11-28nntp: use Inbox->uidvalidity instead of ->mm->created_at
This is memoized, and may allow us some future flexibility w.r.t PublicInbox::Inbox-like objects. While we're at it, use defined-or ("//") in case somebody really set a public-inbox creation time to the Unix epoch.
2020-11-04nntp: attempt RFC 5536 3.1.5-conformant Path: headers
Perhaps some NNTP clients would be unhappy with the old value "y". So use a bit more bandwidth+space to use the server-name and historical "!not-for-mail" tail-entry to better conform to a published RFC. Reported-by: Andrey Melnikov <temnota.am@gmail.com>
2020-11-04nntp: delimit Newsgroup: header with commas
...instead of spaces. This is specified in RFC 5536 3.1.4. Include references to RFC 1036, 5536 and 5537 in our docs while we're at it. Reported-by: Andrey Melnikov <temnota.am@gmail.com> Link: https://public-inbox.org/meta/CA+PODjpUN5Q4gBFQhAzUNuMasVEdmp9f=8Uo0Ej0mFumdSwi4w@mail.gmail.com/
2020-10-30tls: epollbit: account for miscellaneous OpenSSL errors
Apparently they happen (triggered by my -imapd instance), so bail out by closing the underlying socket rather than stopping the event loop and daemon process.
2020-09-12nntp: share more code between art_lookup callers
This prepares us for future changes to improve scalability to many inboxes.
2020-09-10nntp: fix cross-newsgroup Message-ID lookups
We cannot blindly use the selected newsgroup for HEAD/ARTICLE/BODY requests using Message-ID, since those commands look across all newsgroups; not just the selected one (if any). So stuff a reference to the Inbox object into $smsg. We can reduce args passed into set_nntp_headers() and msg_hdr_write(), too. Fixes: 0e6ceff37fc38f28 ("nntp: support slow blob retrievals")
2020-08-02nntp: fix STAT command
The return value of art_lookup changed but this command wasn't updated since it wasn't tested. Fixes: 0e6ceff37fc38f28 ("nntp: support slow blob retrievals")
2020-07-06daemon: warn on missing blobs
Since -edit and -purge should be rare and TOCTOU around them rarer still; missing {blobs} could be indicative of a real bug elsewhere. Warn on them. And I somehow ended up with 3 different field names for Inbox objects. Perhaps they'll be made consistent in the future.
2020-06-28ds: remove fields.pm usage
Since the removal of pseudo-hash support in Perl 5.10, the "fields" module no longer provides the space or speed benefits it did in 5.8. It also does not allow for compile-time checks, only run-time checks. To me, the extra developer overhead in maintaining "use fields" args has become a hassle. None of our non-DS-related code uses fields.pm, nor do any of our current dependencies. In fact, Danga::Socket (which DS was originally forked from) and its subclasses are the only fields.pm users I've ever encountered in the wild. Removing fields may make our code more approachable to other Perl hackers. So stop using fields.pm and locked hashes, but continue to document what fields do for non-trivial classes.
2020-06-25git_async_cat: remove circular reference
While this circular reference was carefully managed to not leak memory; it was still triggering a warning at -imapd/-nntpd shutdown due to the EPOLL_CTL_DEL op failing after the $Epoll FD gets closed. So remove the circular reference by providing a ref to `undef', instead.
2020-06-21nntp: support slow blob retrievals
Having `git cat-file' as a separate process naturally lends itself to asynchronous dispatch. Our event loop for -nntpd no longer blocks on slow git storage. Pipelining in -imapd was tricky and bugs were exposed by mbsync(1). Update t/nntpd.t to support pipelining ARTICLE requests to ensure we don't have the same problems -imapd did during development.
2020-06-21nntp: event_step: prepare for async git reads
This matches PublicInbox::IMAP::event_step and will allow us to handle blob retrievals from git asynchronously without falling over on pipelined requests.
2020-06-21daemon: use ->can to check for IO::Socket::SSL
Doing a ref($obj) string comparison ties us to IO::Socket::SSL (and OpenSSL) In the future, we may support GnuTLS or other TLS implementations. This was already done in the IMAP code.
2020-06-13nntpd+imapd: detect replaced over.sqlite3
For v1 inboxes (and possibly v2 in the future, for VACUUM), public-inbox-compact replaces over.sqlite3 with a new file. This currently doesn't need an extra inotify watch descriptor (or FD for kevent) at the moment, so it can coexist nicely for systems w/o IO::KQueue or Linux::Inotify2.
2020-06-03smsg: remove remaining accessor methods
We'll continue to favor simpler data models that can be used directly rather than wasting time and memory with accessor APIs. The ->from, ->to, -cc, ->mid, ->subject, >references methods can all be trivially replaced by hash lookups since all their values are stored in doc_data. Most remaining callers of those methods were test cases, anyways. ->from_name is only used in the PSGI code, so we can just use ->psgi_cull to take care of populating the {from_name} field.
2020-06-03nntp: smsg_range_i: favor ->{$field} lookups when possible
PublicInbox::Smsg::date remains the only exception which requires any subroutine calls, here, so we'll just have a branch just for that.
2020-05-09switch read-only Email::Simple users to Eml
Since PublicInbox::Eml doesn't parse MIME subparts up front, it can replace most uses of Email::Simple without performance penalty. This will eventually allow us to lower overall internal API footprint by not having to keep the MIME vs Simple distinction.
2020-04-22make zlib-related modules a hard dependency
This allows us to simplify some of our existing code and make future changes easier. I doubt anybody goes through the trouble to have a Perl installation without zlib support. The zlib source code is even bundled with Perl since 5.9.3 for systems without existing zlib development headers and libraries. Of course, zlib is also a requirement of git, too; and we're not going to stop using git :) [squashed: "wwwaltid: use gzipfilter up front"]
2020-04-19reduce scope of mbox From_ line removal
It's unnecessary overhead for anything which does Email::MIME parsing. It was never done for v2 indexing, even though v1->v2 conversions did NOT remove those From_ lines. There was never a need to remote From_ lines the v1 SearchIdx paths, either. Hitting a /$INBOX_URL/$MSGID/T/ endpoint with an 18 message thread reveals a ~0.5% speed improvement. This will become more apparent when we have a faster MIME parser.
2020-04-02nntp: allow multiple spaces or tabs to delimit args
While this is not a known problem in practice, RFC 3977 section 3.1 states: Keywords and arguments MUST each be separated by one or more space or TAB characters.
2020-04-02mid: add $MID_EXTRACT regexp for export
This allows us to consistently enforce the same Message-ID extraction rules everywhere and makes it easier for us to make changes in the future. Update scripts/ssoma-replay, as well, but don't rely on PublicInbox::* modules in that since it's legacy and public-inbox was never a dependency of ssoma.
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-24nntp: simplify setting X-Alt-Message-ID
We can cut down on the number of operations required using "grep" instead of "foreach".