public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-06-15	testcommon: allow OR-ing module dependencies
	IMAP requires either the Email::Address::XS or Mail::Address package (part of perl-MailTools RPM or libmailtools-perl deb); and Email::Address::XS is not officially packaged for some older distros, most notably CentOS 7.x.
2020-06-13	nntpd+imapd: detect replaced over.sqlite3
	For v1 inboxes (and possibly v2 in the future, for VACUUM), public-inbox-compact replaces over.sqlite3 with a new file. This currently doesn't need an extra inotify watch descriptor (or FD for kevent) at the moment, so it can coexist nicely for systems w/o IO::KQueue or Linux::Inotify2.
2020-06-13	imap: introduce memory-efficient uo2m mapping
	Since we limit our mailboxes slices to 50K and can guarantee a contiguous UID space for those mailboxes, we can store a mapping of "UID offsets" (not full UIDs) to Message Sequence Numbers as an array of 16-bit unsigned integers in a 100K scalar. For UID-only FETCH responses, we can momentarily unpack the compact 100K representation to a ~1.6M Perl array of IV/UV elements for a slight speedup. Furthermore, we can (ab)use hash key deduplication in Perl5 to deduplicate this 100K scalar across all clients with the same mailbox slice open. Technically we can increase our slice size to 64K w/o increasing our storage overhead, but I suspect humans are more accustomed to slices easily divisible by 10.
2020-06-13	imap: remove non-UID SEARCH for now
	Supporting MSNs in long-lived connections beyond the lifetime of a single request/response cycle is not scalable to a C10K scenario. It's probably not needed, since most clients seem to use UIDs. A somewhat efficient implementation I can come up uses pack("S*" ...) (AKA "uint16_t mapping[50000]") has an overhead of 100K per-client socket on a mailbox with 50K messages. The 100K is a contiguous scalar, so it could be swapped out for idle clients on most architectures if THP is disabled. An alternative could be to use a tempfile as an allocator partitioned into 100K chunks (or SQLite); but I'll only do that if somebody presents a compelling case to support MSN SEARCH.
2020-06-13	imapd: don't bother sorting LIST output
	The sort was unstable on my test instance anyways, and clients don't seem to mind. So stop wasting CPU cycles.
2020-06-13	imap: wire up Xapian, MSN SEARCH and multi sequence-sets
	Simple queries work, more complex queries involving parentheses, "OR", "NOT" don't work, yet. Tested with "=b", "=B", and "=H" search and limits in mutt on both v1 and v2 with multiple Xapian shards.
2020-06-13	imap: STATUS/EXAMINE: rely on SQLite overview
	We can get exact values for EXISTS, UIDNEXT using SQLite rather than calculating off $ibx->mm->max ourselves. Furthermore, $ibx->mm is less useful than $ibx->over for IMAP (and for our read-only daemons in general) so do not depend on $ibx->mm outside of startup/reload to save FDs and reduce kernel page cache footprint.
2020-06-13	imap: FETCH: try to make fake MSNs sequentially
	This appears to significantly improve header caching behavior with mutt. With the current public-inbox.org/git mirror(), mutt will only re-FETCH the last ~300 or so messages in the final "inbox.comp.version-control.git.7" mailbox, instead of ~49,000 messages every time. It's not perfect, but a 500ms query is better than a >10s query and mutt itself spends as much time loading its header cache. () there are many gaps in NNTP article numbers (UIDs) due to spam removal from public-inbox-learn.
2020-06-13	imap: FETCH: more granular CRLF conversion
	This speeds up requests from mutt for HEADER.FIELDS by around 10% since we don't waste time doing CRLF conversion on large message bodies that get discarded, anyways.
2020-06-13	imap: LIST shows "INBOX" in all caps
	While selecting a mailbox is done case-insensitively, "INBOX" is special for the LIST command, according to RFC 3501 6.3.8: > The special name INBOX is included in the output from LIST, if > INBOX is supported by this server for this user and if the > uppercase string "INBOX" matches the interpreted reference and > mailbox name arguments with wildcards as described above. The > criteria for omitting INBOX is whether SELECT INBOX will > return failure; it is not relevant whether the user's real > INBOX resides on this or some other server. Thus, the existing news.public-inbox.org convention of naming newsgroups starting with "inbox." needs to be special-cased to not confuse clients. While we're at it, do not create ".0" for dummy newsgroups if they're selected, either.
2020-06-13	index: account for CRLF conversion when storing bytes
	NNTP and IMAP both require CRLF conversions on the wire. They're also the only components which care about $smsg->{bytes}, so store the CRLF-adjusted value in over.sqlite3 and Xapian DBs.. This will allow us to optimize RFC822.SIZE fetch item in IMAP without triggering size mismatch errors in some clients' default configurations (e.g. Mail::IMAPClient), but not most others. It could also fix hypothetical problems with NNTP clients that report discrepancies between overview and article data.
2020-06-13	imap: compile UID FETCH to opcodes
	This is just a hair faster and cacheable in the future, if we need it. Most notably, this avoids doing PublicInbox::Eml->new for simple "RFC822", "BODY[]", and "RFC822.SIZE" requests.
2020-06-13	imap: remove dummies from sequence number FETCH
	Dummy messages make for bad user experience with MUAs which still use sequence numbers. Not being able to fetch a message doesn't seem fatal in mutt, so just ignore (sometimes large) gaps.
2020-06-13	search: index UID for IMAP search, too
	We'll need to support searching UID ranges for IMAP, so make sure it's indexed, too.
2020-06-13	search: index byte size of a message for IMAP search
	Searching for messages smaller than a certain size is allowed by offlineimap(1), mbsync(1), and possibly other tools. Maybe public-inbox-watch will support it, too. I don't see a reason to expose searching by size via WWW search right now (but maybe in the future, I could be convinced to). Note: we only store the byte-size of the message in git, this is typically LF-only and we won't have the correct size after CRLF conversion for NNTP or IMAP.
2020-06-13	imap: allow UID range search on timestamps
	Since it seems somewhat common for IMAP clients to limit searches by sent Date: or INTERNALDATE, we can rely on the NNTP/WWW-optimized overview DB. For other queries, we'll have to depend on the Xapian DB.
2020-06-13	imap: start parsing out queries for SQLite and Xapian
	None of the new cases are wired up, yet, but existing cases still work.
2020-06-13	imap: avoid uninitialized warnings on incomplete commands
	No point in spewing "uninitialized" warnings into logs when the cat jumps on the Enter key.
2020-06-13	t/config.t: always compare against git bool behavior
	We'll use the xqx() to avoid losing too much performance compared to normal `backtick` (qx) when testing using "make check-run" + Inline::C.
2020-06-13	imap: omit $UID_END from mailbox name, use index
	Having two large numbers separated by a dash can make visual comparisons difficult when numbers are in the 3,000,000 range for LKML. So avoid the $UID_END value, since it can be calculated from $UID_MIN. And we can avoid large values of $UID_MIN, too, by instead storing the block index and just multiplying it by 50000 (and adding 1) on the server side. Of course, LKML still goes up to 72, at the moment.
2020-06-13	imapd: ensure LIST is sorted alphabetically, for now
	I'm not sure this matters, and it could be a waste of CPU cycles if no real clients care. However, it does make debugging over telnet or s_client a bit easier.
2020-06-13	imap: require ".$UID_MIN-$UID_END" suffix
	Finish up the IMAP-only portion of iterative config reloading, which allows us to create all sub-ranges of an inbox up front. The InboxIdler still uses ->each_inbox which will struggle with 100K inboxes. Having messages in the top-level newsgroup name of an inbox will still waste bandwidth for clients which want to do full syncs once there's a rollover to a new 50K range. So instead, make every inbox accessible exclusively via 50K slices in the form of "$NEWSGROUP.$UID_MIN-$UID_END". This introduces the DummyInbox, which makes $NEWSGROUP and every parent component a selectable, empty inbox. This aids navigation with mutt and possibly other MUAs. Finally, the xt/perf-imap-list maintainer test is broken, now, so remove it. The grep perlfunc is already proven effective, and we'll have separate tests for mocking out ~100k inboxes.
2020-06-13	imap: case-insensitive mailbox name comparisons
	IMAP RFC 3501 stipulates case-insensitive comparisons, and so does RFC 977 (NNTP). However, INN (nnrpd) uses case-sensitive comparisons, so we've always used case-sensitive comparisons for NNTP to match nnrpd behavior. Unfortunately, some IMAP clients insist on sending "INBOX" with caps, which causes problems for us. Since NNTP group names are typically all lowercase anyways, just force all comparisons to lowercase for IMAP and warn admins if uppercase-containing newsgroups won't be accessible over IMAP. This ensures our existing -nntpd behavior remains unchanged while being compatible with the expectations of real-world IMAP clients.
2020-06-13	imap: support out-of-bounds ranges
	"$UID_START:*" needs to return at least one message according to RFC 3501 section 6.4.8. While we're in the area, coerce ranges to (unsigned) integers by adding zero ("+ 0") to reduce memory overhead.
2020-06-13	imapclient: wrapper for Mail::IMAPClient
	We'll be using this wrapper class to workaround some upstream bugs in Mail::IMAPClient. There may also be experiments with new APIs for more performance.
2020-06-13	git: async: automatic retry on alternates change
	This matches the behavior of the existing synchronous ->cat_file method. In fact, ->cat_file now becomes a small wrapper around the ->cat_async method.
2020-06-13	git: cat_async: provide requested OID + "missing" on missing blobs
	This will make it easier to implement the retries on alternates_changed() of the synchronous ->cat_file API.
2020-06-13	imap: FETCH: support comma-delimited ranges
	The RFC 3501 `sequence-set' definition allows comma-delimited ranges, so we'll support it in case clients send them. Coalescing overlapping ranges isn't required, so we won't support it as such an attempt to save bandwidth would waste memory on the server, instead.
2020-06-13	imap: use git-cat-file asynchronously
	This ought to improve overall performance with multiple clients. Single client performance suffers a tiny bit due to extra syscall overhead from epoll. This also makes the existing async interface easier-to-use, since calling cat_async_begin is no longer required.
2020-06-13	imap: speed up HEADER.FIELDS[.NOT] range fetches
	While we can't memoize the regexp forever like we do with other Eml users, we can still benefit from caching regexp compilation on a per-request basis. A FETCH request from mutt on a 4K message inbox is around 8% faster after this. Since regexp compilation via qr// isn't unbearably slow, a shared cache probably isn't worth the trouble of implementing. A per-request cache seems enough.
2020-06-13	imap: support the CLOSE command
	It seems worthless to support CLOSE for read-only inboxes, but mutt sends it, so don't return a BAD error with proper use.
2020-06-13	imap: do not include ".PEEK" in responses
	They're not specified in RFC 3501 for responses, and at least mutt fails to handle it.
2020-06-13	imap: support sequence number FETCH
	We'll return dummy messages for now when sequence numbers go missing, in case clients can't handle missing messages.
2020-06-13	imap: fix multi-message partial header fetches
	We must keep the contents of {-partial} around when handling a request to fetch multiple messages.
2020-06-13	imap: split out unit tests and benchmarks
	This makes the test code easier-to-manage and allows us to run faster unit tests which don't involve loading Mail::IMAPClient.
2020-06-13	imap: allow fetch of partial of BODY[...] and headers
	IMAP supports a high level of granularity when it comes to fetching, but fortunately Perl makes it fairly easy to support.
2020-06-13	eml: each_part: single part $idx is 1
	Instead of counts starting at 0, we start the single-part message at 1 like we do with subparts of a multipart message. This will make it easier to map offsets for "BODY[$SECTION]" when using IMAP FETCH, since $SECTION must contain non-zero numbers according to RFC 3501. This doesn't make any difference for WWW URLs, since single part messages cannot have downloadable attachments.
2020-06-13	imap: support fetch for BODYSTRUCTURE and BODY
	I'm not sure which clients use these, but it could be useful down the line.
2020-06-13	t/imapd: support FakeInotify and KQNotify
	We can fill in some missing pieces from the emulation APIs to enable IMAP IDLE tests on non-Linux platforms.
2020-06-13	imap: support LIST command
	We'll optimize for the common case of: $TAG LIST "" * and rely on the grep perlfunc to handle trickier cases.
2020-06-13	imap: implement STATUS command
	I'm not sure if there's much use for this command, but it's part of RFC3501 and works read-only.
2020-06-13	imap: delay InboxIdle start, support refresh
	InboxIdle should not be holding onto Inbox objects after the Config object they came from expires, and Config objects may expire on SIGHUP. Old Inbox objects still persist due to IMAP clients holding onto them, but that's a concern we'll deal with at another time, or not at all, since all clients expire, eventually. Regardless, stale inotify watch descriptors should not be left hanging after SIGHUP refreshes.
2020-06-13	imap: support IDLE
	It seems to be working as far as Mail::IMAPClient is concerned.
2020-06-13	inboxidle: new class to detect inbox changes
	This will be used to implement IMAP IDLE, first. Eventually, it may be used to trigger other things: * incremental internal updates for manifest.js.gz * restart `git cat-file' processes on pack index unlink * IMAP IDLE-like long-polling HTTP endpoint And maybe more things we haven't thought of, yet. It uses Linux::Inotify2 or IO::KQueue depending on what packages are installed and what the kernel supports. It falls back to nanosecond-aware Time::HiRes::stat() (available with Perl 5.10.0+) on systems lacking Linux::Inotify2 and IO::KQueue. In the future, a pure Perl alternative to Linux::Inotify2 may be supplied for users of architectures we already support signalfd and epoll on. v2 changes: - avoid O_TRUNC on lock file - change ctime on Linux systems w/o inotify - fix naming of comments and fields
2020-06-13	preliminary imap server implementation
	It shares a bit of code with NNTP. It's copy+pasted for now since this provides new ground to experiment with APIs for dealing with slow storage and many inboxes.
2020-06-08	index: v2: parallel by default
	InboxWritable should only set $v2w->{parallel} if the $parallel flag is defined to 0 or 1. We want indexing a new inbox to utilize SMP, just like --reindex. -index once again allows -j0/--jobs=0 to force single-process use, and we'll be ensuring that works in tests to maintain performance on small systems. Fixes: 61a2fff5b34a3e32 ("admin: move index_inbox over")
2020-06-03	smsg: remove remaining accessor methods
	We'll continue to favor simpler data models that can be used directly rather than wasting time and memory with accessor APIs. The ->from, ->to, -cc, ->mid, ->subject, >references methods can all be trivially replaced by hash lookups since all their values are stored in doc_data. Most remaining callers of those methods were test cases, anyways. ->from_name is only used in the PSGI code, so we can just use ->psgi_cull to take care of populating the {from_name} field.
2020-06-03	www: remove smsg_mime API and adjust callers
	To further simplify callers and avoid embarrasing memory explosions[1], we can finally eliminate this method in favor of smsg_eml. [1] commit 7d02b9e64455831d3bda20cd2e64e0c15dc07df5 ("view: stop storing all MIME objects on large threads") fixed a huge memory blowup.
2020-06-03	smsg: introduce ->populate method
	This will eventually replace the __hdr() calling methods and eradicate {mime} usage from Smsg. For now, we can eliminate PublicInbox::Smsg->new since most callers already rely on an open `bless' to avoid the old {mime} arg.
2020-05-29	treat $INBOX_DIR/description and gitweb.owner as UTF-8
	gitweb does the same with $GIT_DIR/description and gitweb.owner. Allowing UTF-8 description should not cause problems when used in responses for to the NNTP "LIST NEWSGROUPS" request, either, since RFC 3977 section 7.6.6 recommends the description be UTF-8 (but does not require it). Link: https://public-inbox.org/meta/20200528151216.l7vmnmrs4ojw372g@sourcephile.fr/