about summary refs log tree commit homepage
path: root/lib/PublicInbox/NNTP.pm
DateCommit message (Collapse)
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-11nntp+www: drop List-* and Archived-At headers
These headers can conflict with headers in the DKIM signature; and parsing the DKIM-Signature header to determine whether or not we can safely add a header would be more code and CPU cycles. Since IMAP seems fine without these headers (and JMAP will likely be, too), there's likely no need to continue appending these to every message. Nowadays, developers seem sufficiently trained to use URLs with Message-IDs in them. So drop the headers and save some cycles and bandwidth all around.
2020-12-10www+nntp: deal with lack of addresses for ->ALL
Since extindex is an amalgamation of several inboxes, discerning an appropriate address for List-Post: would be expensive and most likely unnecessary. Some legacy/historical inboxes may have no active address, either, so don't attempt to set the List-Post header if no addresses are configured.
2020-12-09rename {pi_config} fields to {pi_cfg}
{pi_config} may be confused with the documented `PI_CONFIG' environment variable, and we'll favor vowel-removal to be consistent with our usage of object references. The `pi_' prefix may stay in some places, for now; since a separate namespace may come into this codebase for local/private client-tooling. For InboxIdle, we'll also remove an invalid comment about holding a reference to the PublicInbox::Config object, too.
2020-12-09nntp: replace {ng} with {ibx} for consistency
They're PublicInbox::Inbox objects just like the rest of the non-NNTP code. So rename the NNTP code for consistency with the rest of the codebase. Furthermore, {ng} and $ng may be confused with the `--ng' switch for -init, and that's a non-ref scalar string.
2020-12-05nntp: small speed up for multi-line responses
Using a non-zero-length separator for `join' requires extra work inside Perl. We can shove the cost of appending "\r\n" into the `map' loop, instead. This speeds up the `join' operation. The "deferred" log entry for a "LISTGROUP org.kernel.vger.linux-kernel" command (with nearly 3.8 million messages) goes from ~3.96s to 3.86s on my workstation.
2020-12-05nntp: xref_by_tc: simplify slightly
We can invalidate ibx->{newsgroup} at config load-time to avoid having to check ibx->{newsgroup} validity in To/Cc: matching. This saves us some hash lookups in all cases.
2020-12-01nntp: make ->ALL Xref generation more fuzzy
For ->ALL users, this mitigates the regression introduced by commit 811b8d3cbaa790f59b7b107140b86248da16499b ("nntp: xref: use ->ALL extindex if available"), since it's common to cross post messages to some mailing lists with per-list trailers for unsubscribe information. We won't bother dealing with Bcc-ed messages since those are nearly all spam when it comes to public mailing lists. Fixes: 811b8d3cbaa790f5 ("nntp: xref: use ->ALL extindex if available") Link: https://public-inbox.org/meta/20201130194201.GA6687@dcvr/
2020-11-29nntpd: remove redundant {groups} shortcut
It's not worth confusing hackers reading the source to have two ways to access the same (large) hash table. So just go through PublicInbox::Config objects for now since the extra hash lookup isn't going to be noticeable. I've also started favoring "for" instead of "foreach" since they're the equivalent perlop and less wear on my fingers + keyboard.
2020-11-29nntp: XPATH uses ->ALL extindex, too
Another 30-40% speedup when testing against a local lore.kernel.org mirror. In either case, we'll consistently sort the response for ease-of-testing and client-side cache-friendliness.
2020-11-29nntp: art_lookup: use mid_lookup and simplify
This lets us take advantage of mid_lookup speedup from the previous commit. While we're at it, start moving towards using `$ibx' as the abbreviation for PublicInbox::Inbox objects even in the NNTP code, since they've been shared with the WWW code for several years, now.
2020-11-29nntp: speed up mid_lookup() using ->ALL extindex
We can reuse "xref3" information in extindex to quickly match messages matching a given Message-ID across hundreds or thousands of newsgroups with a few SQL statements. "XHDR Xref $MESSAGE_ID" is around 40% faster, on top of previous speedups.
2020-11-29nntp: NEWGROUPS uses long_response
We can amortize the cost of NEWGROUPS time filtering using the long_response API. This lets us handle hundreds/thousands of inboxes without monopolizing the event loop for this command. Further speedup is possible using MiscSearch, but that requires not-yet-done indexing changes to MiscIdx.
2020-11-28nntp: xref: use ->ALL extindex if available
Getting Xref for cross-posted messages is an O(n) operation where `n' is the number of newsgroups on the server. This works acceptably when there are dozens of groups, but would be unnacceptable when there's tens of thousands of newsgroups. With ~140 newsgroups, a lore.kernel.org mirror already handles "XHDR Xref $MESSAGE_ID" requests around 30% faster after creating the xref3.idx_nntp index. The SQL additions to ExtSearch.pm may be a bit strange and seem more appropriate for Over.pm; however it currently makes sense to me since those bits of over.sqlite3 access are exclusive to ExtSearch and can't be used by traditional v1/v2 inboxes...
2020-11-28nntp: xref: simplify sub signature
We'll be using the `xref3' table in extindex to speed up xref(), and that'll require comparisons against $smsg->{blob}. So pass the entire $smsg through.
2020-11-28nntp: some minor golfing
Reduce screen real estate usage to reduce human attention span requirements.
2020-11-28nntp: move LIST iterators to long_response
Iterating through many newsgroups can hog the event loop if many random seeks are required. Avoid monopolizing the event loop in that case by using the long_response API. For now, we can still rely on grep() since it seems to work reasonably well with 50K test newsgroup names.
2020-11-28nntp: LIST ACTIVE.TIMES use angle brackets around address
This matches the example shown in RFC 3977, section 7.6.1.3
2020-11-28nntp: NEWNEWS: speed up filtering
With 50K newsgroups, the filtering phase goes from ~2000 seconds to ~90 MILLISECONDS by relying on the grep perlop. This moves ->over checking out of the main dispatch and amortizes the cost via long_response. (Fairly scheduled) long_response time in newnews_i now takes ~360 seconds as opposed to ~30 seconds before this change, however; but the initial filtering speedup eliminating 2000s is more than worth it.
2020-11-28nntp: use grep operation for wildmat matching
Based on experiences with the IMAP server, this ought to be significantly faster (as to be demonstrated in the next commit).
2020-11-28mm: min/max: return 0 instead of undef
This simplifies callers and allows empty newsgroups to be represented (the WWW UI may be insufficient there, too).
2020-11-28nntp: use Inbox->uidvalidity instead of ->mm->created_at
This is memoized, and may allow us some future flexibility w.r.t PublicInbox::Inbox-like objects. While we're at it, use defined-or ("//") in case somebody really set a public-inbox creation time to the Unix epoch.
2020-11-04nntp: attempt RFC 5536 3.1.5-conformant Path: headers
Perhaps some NNTP clients would be unhappy with the old value "y". So use a bit more bandwidth+space to use the server-name and historical "!not-for-mail" tail-entry to better conform to a published RFC. Reported-by: Andrey Melnikov <temnota.am@gmail.com>
2020-11-04nntp: delimit Newsgroup: header with commas
...instead of spaces. This is specified in RFC 5536 3.1.4. Include references to RFC 1036, 5536 and 5537 in our docs while we're at it. Reported-by: Andrey Melnikov <temnota.am@gmail.com> Link: https://public-inbox.org/meta/CA+PODjpUN5Q4gBFQhAzUNuMasVEdmp9f=8Uo0Ej0mFumdSwi4w@mail.gmail.com/
2020-10-30tls: epollbit: account for miscellaneous OpenSSL errors
Apparently they happen (triggered by my -imapd instance), so bail out by closing the underlying socket rather than stopping the event loop and daemon process.
2020-09-12nntp: share more code between art_lookup callers
This prepares us for future changes to improve scalability to many inboxes.
2020-09-10nntp: fix cross-newsgroup Message-ID lookups
We cannot blindly use the selected newsgroup for HEAD/ARTICLE/BODY requests using Message-ID, since those commands look across all newsgroups; not just the selected one (if any). So stuff a reference to the Inbox object into $smsg. We can reduce args passed into set_nntp_headers() and msg_hdr_write(), too. Fixes: 0e6ceff37fc38f28 ("nntp: support slow blob retrievals")
2020-08-02nntp: fix STAT command
The return value of art_lookup changed but this command wasn't updated since it wasn't tested. Fixes: 0e6ceff37fc38f28 ("nntp: support slow blob retrievals")
2020-07-06daemon: warn on missing blobs
Since -edit and -purge should be rare and TOCTOU around them rarer still; missing {blobs} could be indicative of a real bug elsewhere. Warn on them. And I somehow ended up with 3 different field names for Inbox objects. Perhaps they'll be made consistent in the future.
2020-06-28ds: remove fields.pm usage
Since the removal of pseudo-hash support in Perl 5.10, the "fields" module no longer provides the space or speed benefits it did in 5.8. It also does not allow for compile-time checks, only run-time checks. To me, the extra developer overhead in maintaining "use fields" args has become a hassle. None of our non-DS-related code uses fields.pm, nor do any of our current dependencies. In fact, Danga::Socket (which DS was originally forked from) and its subclasses are the only fields.pm users I've ever encountered in the wild. Removing fields may make our code more approachable to other Perl hackers. So stop using fields.pm and locked hashes, but continue to document what fields do for non-trivial classes.
2020-06-25git_async_cat: remove circular reference
While this circular reference was carefully managed to not leak memory; it was still triggering a warning at -imapd/-nntpd shutdown due to the EPOLL_CTL_DEL op failing after the $Epoll FD gets closed. So remove the circular reference by providing a ref to `undef', instead.
2020-06-21nntp: support slow blob retrievals
Having `git cat-file' as a separate process naturally lends itself to asynchronous dispatch. Our event loop for -nntpd no longer blocks on slow git storage. Pipelining in -imapd was tricky and bugs were exposed by mbsync(1). Update t/nntpd.t to support pipelining ARTICLE requests to ensure we don't have the same problems -imapd did during development.
2020-06-21nntp: event_step: prepare for async git reads
This matches PublicInbox::IMAP::event_step and will allow us to handle blob retrievals from git asynchronously without falling over on pipelined requests.
2020-06-21daemon: use ->can to check for IO::Socket::SSL
Doing a ref($obj) string comparison ties us to IO::Socket::SSL (and OpenSSL) In the future, we may support GnuTLS or other TLS implementations. This was already done in the IMAP code.
2020-06-13nntpd+imapd: detect replaced over.sqlite3
For v1 inboxes (and possibly v2 in the future, for VACUUM), public-inbox-compact replaces over.sqlite3 with a new file. This currently doesn't need an extra inotify watch descriptor (or FD for kevent) at the moment, so it can coexist nicely for systems w/o IO::KQueue or Linux::Inotify2.
2020-06-03smsg: remove remaining accessor methods
We'll continue to favor simpler data models that can be used directly rather than wasting time and memory with accessor APIs. The ->from, ->to, -cc, ->mid, ->subject, >references methods can all be trivially replaced by hash lookups since all their values are stored in doc_data. Most remaining callers of those methods were test cases, anyways. ->from_name is only used in the PSGI code, so we can just use ->psgi_cull to take care of populating the {from_name} field.
2020-06-03nntp: smsg_range_i: favor ->{$field} lookups when possible
PublicInbox::Smsg::date remains the only exception which requires any subroutine calls, here, so we'll just have a branch just for that.
2020-05-09switch read-only Email::Simple users to Eml
Since PublicInbox::Eml doesn't parse MIME subparts up front, it can replace most uses of Email::Simple without performance penalty. This will eventually allow us to lower overall internal API footprint by not having to keep the MIME vs Simple distinction.
2020-04-22make zlib-related modules a hard dependency
This allows us to simplify some of our existing code and make future changes easier. I doubt anybody goes through the trouble to have a Perl installation without zlib support. The zlib source code is even bundled with Perl since 5.9.3 for systems without existing zlib development headers and libraries. Of course, zlib is also a requirement of git, too; and we're not going to stop using git :) [squashed: "wwwaltid: use gzipfilter up front"]
2020-04-19reduce scope of mbox From_ line removal
It's unnecessary overhead for anything which does Email::MIME parsing. It was never done for v2 indexing, even though v1->v2 conversions did NOT remove those From_ lines. There was never a need to remote From_ lines the v1 SearchIdx paths, either. Hitting a /$INBOX_URL/$MSGID/T/ endpoint with an 18 message thread reveals a ~0.5% speed improvement. This will become more apparent when we have a faster MIME parser.
2020-04-02nntp: allow multiple spaces or tabs to delimit args
While this is not a known problem in practice, RFC 3977 section 3.1 states: Keywords and arguments MUST each be separated by one or more space or TAB characters.
2020-04-02mid: add $MID_EXTRACT regexp for export
This allows us to consistently enforce the same Message-ID extraction rules everywhere and makes it easier for us to make changes in the future. Update scripts/ssoma-replay, as well, but don't rely on PublicInbox::* modules in that since it's legacy and public-inbox was never a dependency of ssoma.
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-24nntp: simplify setting X-Alt-Message-ID
We can cut down on the number of operations required using "grep" instead of "foreach".
2020-01-13ds|http|nntp: simplify {wbuf} population
We can rely on autovification to turn `undef' value of {wbuf} into an arrayref. Furthermore, "push" returns the (new) size of the array since at least Perl 5.0 (I didn't look further back), so we can use that return value instead of calling "scalar" again.
2020-01-08nntp: correctly log long response errors
We cannot safely call "fileno(undef)" without bringing down the entire -nntpd process :x. To ensure no logging regression, we now stash the FD for the duration of the long response to ensure the error can be matched to the original command in logs. Fixes: 207b89615a1a0c06 ("nntp: remove cyclic refs from long_response")
2020-01-01nntp: handle 2-digit year "70" properly
Time::Local has the concept of a "rolling century" which is defined at 50 years on either side of the current year. Since it's now 2020 and >50 years since the Unix epoch, the year "70" gets interpreted by Time::Local as 2070-01-01 instead of 1970-01-01. Since NNTP servers are unlikely to store messages from the future, we'll feed 4-digit year to Time::Local::{timegm,timelocal} and hopefully not have to worry about things until Y10K. This fixes test failures on t/v2writable.t and t/nntpd.t since 2020-01-01.
2019-12-22nntp: cmd_xover: use named sub for long_response
Introduce xover_i, which does the same thing as the anonymous sub it replaces.
2019-12-22nntp: hdr_msg_id: use named sub for long_response
Introduce hdr_msgid_range_i, which does the same thing as the anonymous sub it replaces.