public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-12-12	doc: add public-inbox-extindex-format(5) manpage
	The CLI tool still needs usability work, and "misc" is still in flux, but the core message indexing part is stable (since it's stolen from v2 :P).
2020-12-05	isearch: emulate per-inbox search with ->ALL
	Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05	over: ensure old, merged {tid} is really gone
	We must use the result of link_refs() since it can trigger merge_threads() and invalidate $old_tid. In case merge_threads() isn't triggered, link_refs() will return $old_tid anyways. When rethreading and allocating new {tid}, we also must update the row where the now-expired {tid} came from to ensure only the new {tid} is seen when reindexing subsequent messages in history. Otherwise, every subsequently reindexed+rethreaded message could end up getting a new {tid}. Reported-by: Kyle Meyer <kyle@kyleam.com> Link: https://public-inbox.org/meta/87360nlc44.fsf@kyleam.com/
2020-11-24	miscsearch: a new Xapian sub-DB for extindex
	This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-08	extsearch: rename -eindex to -extindex
	Upon "eindex" rhymes with "reindex", which could be confusing; so name the command and config prefix to use "extindex" which is hopefully less confusing.
2020-11-07	script: add preliminary eindex implementation
	Not documented, yet, but it runs...
2020-11-07	extsearchidx: initial implementation
	It compiles...
2020-11-07	extsearch: start mocking out
	This will provide a similar API to PublicInbox::Inbox for read-only WWW, -imapd, and -nntpd interfaces.
2020-10-17	xt: remove eml_check_roundtrip
	If there's no body ({bdy} field), ->each_part set the {bdy} field to "\n" and the ->as_string result afterwards is one extra "\n" byte longer than the original. It's not worth extra cycles in common ->each_part calls to ensure 100% round-trip matches of header-only messages (which are likely spam), especially when the only difference is a trailing "\n".
2020-09-26	xt: add eml ->as_string round trip checker
	Unlike Email::MIME, PublicInbox::Eml::as_string should be able to round trip from the Perl object to a raw scalar and back without changes.
2020-09-20	doc: post-1.6 updates, start 1.7
	I should've dropped "PENDING" notes before the 1.6 release; they're dropped now, and a note is added to remind my future self to drop them before 1.7.
2020-09-19	gcf2: wire up read-only daemons and rm -gcf2 script
	It seems easiest to have a singleton Gcf2Client client object per daemon worker for all inboxes to use. This reduces overall FD usage from pipes. The `public-inbox-gcf2' command + manpage are gone and a `$^X' one-liner is used, instead. This saves inodes for internal commands and hopefully makes it easier to avoid mismatched PERL5LIB include paths (as noticed during development :x). We'll also make the existing cat-file process management infrastructure more resilient to BOFHs on process killing sprees (or in case our libgit2-based code fails on us). (Rare) PublicInbox::WWW PSGI users NOT using public-inbox-httpd won't automatically benefit from this change, and extra configuration will be required (to be documented later).
2020-09-19	add gcf2 client and executable script
	This should be able to replace multiple `git cat-file' for blob retrieval, but adjustments may be needed.
2020-09-19	gcf2: libgit2-based git cat-file alternative
	Having tens of thousands of inboxes and associated git processes won't work well, so we'll use libgit2 to access the object DB directly. We only care about OID lookups and won't need to rely on per-repo revision names or paths. The Git::Raw XS package won't be used since its manpages don't promise a stable API. Since we already use Inline::C and have experience with I::C when it comes to compatibility, this only introduces libgit2 itself as a source of new incompatibilities. This also provides an excuse for me to writev(2) to reduce syscalls, but liburing is on the horizon for next year.
2020-09-10	config: split out iterator into separate object
	We will need to allow simultaneous iterators on the same config object, since we'll need this for ExtMsg, NNTPD, WwwListing, NewsWWW, and other places.
2020-09-10	www: manifest.js.gz generation no longer hogs event loop
	It's still as slow as before with hundreds/thousands of inboxes, but at least it's fair. Future changes will allow it to be cached and memoized with persistent HTTP servers.
2020-09-02	t/v2dupindex: test indexing mirrors with duplicate messages
	While it's not a known problem, our deduplicating logic may change in the future; or a BOFH could be manually injecting duplicate messages directly into the git epoch repositories. Ensure indexing in mirrors doesn't break when there's duplicates. This is in preparation for detached indices for multi-inbox search.
2020-09-01	replace ParentPipe with EOFpipe
	ParentPipe was a subset of EOFpipe, except EOFpipe correctly accounts for theoretical() spurious wakeups on the pipe. () AFAIK, spurious wakeups are/were more likely on TCP sockets due to checksum failures, something that's not a problem on local pipes. We're also not sharing pipes like we do with listen sockets on accept(2), so there's no chance of another process grabbing bytes (unless we have bugs in our code).
2020-09-01	watch: use EOFpipe to reduce dwaitpid wakeups
	It's a bit inefficient to use a pipe, here. However, using dwaitpid() on a process that's not expected to exit soon is also inefficient as it causes excessive wakeups as most of our inbox-writing code expects synchronous waitpid(). This only affects -watch instances configured for NNTP and IMAP clients.
2020-09-01	rename WatchMaildir => Watch
	This is no longer limited to Maildirs now that IMAP and NNTP support exist; so give it a shorter name.
2020-08-25	examples: add imapd systemd examples
	We've got examples for all the other daemons, too!
2020-08-20	searchquery: split off from searchview
	Since this was already a separate package, split it off into its own file since SearchView may not handle inbox groups.
2020-08-16	doc: add public-inbox-tuning(7) manpage
	Determining storage device speed and latencies doesn't seem portable or even possible with the wide variety of storage layers in use. This means we need to write a tuning document and hope users read and improve on it :P
2020-08-03	t/nntpd: do not fork on indexing, test v2
	No need to waste resources when doing minimal work. With PI_TEST_VERSION=2, this fixes a test failure where Net::NNTP::DESTROY was getting called in the shard process. We'll also get rid of an unnecessary use_ok under v2, too.
2020-07-29	searchidx: disable CoW for SQLite and Xapian under btrfs
	SQLite and Xapian files are written randomly, thus they become fragmented under btrfs with copy-on-write. This leads to noticeable performance problems (and probably ENOSPC) as these files get big. lore/git (v2, <1GB) indexes around 20% faster with this on an ancient SSD. lore/lkml seems to be taking forever and I'll probably cancel it to save wear on my SSD. Unfortunately, disabling CoW also means disabling checksumming (and compression), so we'll be careful to only set the No_COW attribute on regeneratable data. We want to keep CoW (and checksums+compression) on git storage because current ref storage is neither checksummed nor compressed, and git streams pack output.
2020-07-25	v2writable: introduce idx_stack
	This avoids pinning a potentially large chunk of memory from `git-log --reverse' into RAM (or triggering less predictable swap behavior). Instead it uses a contiguous temporary file with a fixed-size record for every blob we'll need to index.
2020-07-25	v2: index forwards (via `git log --reverse')
	Since we'll need to expose THREADID to JMAP and IMAP users, index all messages in the order they were committed to ensure our `tid' (thread ID) column ascends in mirrors the same way they do in the source inbox. This drastically simplifies our code but increases memory usage of `git-log'. The next commit will bring memory use back down at the expense of $TMPDIR usage.
2020-07-06	www: start making gzipfilter the parent response class
	Virtually all of our responses are going to be gzipped, anyways. This will allow us to utilize zlib as a buffering layer and share common code for async blob retrieval responses. To streamline this and allow GzipFilter to be a parent class, we'll replace the NoopFilter with a similar CompressNoop class which emulates the two Compress::Raw::Zlib::Deflate methods we use. This drops a bunch of redundant code and will hopefully make upcoming WwwStream changes easier to reason about.
2020-07-06	mboxgz: do asynchronous git blob retrievals
	This lets the -httpd worker process make better use of time instead of waiting for git-cat-file to respond. With 4 jobs in the new test case against a clone of <https://public-inbox.org/meta/>, a speedup of 10-12% is shown. Even a single job shows a 2-5% improvement on an SSD.
2020-07-06	wwwlisting: use GzipFilter for HTML
	The changes to GzipFilter here may be beneficial for building HTML and XML responses in other places, too.
2020-06-28	watch: use our own "git credential" wrapper
	Git.pm may not be installed on some systems; or some users have multiple Perl installations and Git.pm is not available to the Perl running -watch. Accomodate both those types of users by providing our own "git credential" wrapper.
2020-06-28	watch: add NNTP support
	This is similar to IMAP support, but only supports polling. Automatic altid support is not yet supported, yet; but may be in the future. v2: small grammar fix by Kyle Meyer Link: https://public-inbox.org/meta/87sgeg5nxf.fsf@kyleam.com/
2020-06-28	watch: remove Filesys::Notify::Simple dependency
	Since we already use inotify and EVFILT_VNODE (kqueue) in -imapd, we might as well use them directly in -watch, too. This will allow public-inbox-watch to use PublicInbox::DS for timers to watch newsgroups/mailboxes and have saner signal handling in future commits.
2020-06-28	kqnotify\|fake_inotify: detect Maildir write ops
	We need to detect link(2) and rename(2) in other apps writing to the Maildir. We'll be removing the Filesys::Notify::Simple from -watch in favor of using IO::KQueue or Linux::Inotify2 directly. Ensure non-inotify emulations can support everything we expect for Maildir writers.
2020-06-28	watch: preliminary IMAP support
	Only servers with IDLE are supported, for now. Polling will be needed since users may need to watch many inboxes with a few active connections due to IMAP server limitations.
2020-06-28	URI IMAP support
	We'll be supporting the IMAP URL scheme described in RFC 5092 for -watch, so add this module to fill in what the `URI' package lacks.
2020-06-28	imaptracker: use ~/.local/share/public-inbox/imap.sqlite3
	Respect XDG_DATA_HOME to avoid cluttering ~/.public-inbox/. Existing users of ~/.public-inbox/imap.sqlite3 will remain supported, but the preference for new data is to use ~/.local/share and other paths standardized by XDG. Cc: "Eric W. Biederman" <ebiederm@xmission.com>
2020-06-28	IMAPTracker: Add a helper to track our place in reading imap mailboxes
	This removes the need to delete from an imap mailbox when downloading it's messages. [ew: minor style changes] Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2020-06-16	imap: *SEARCH: use Parse::RecDescent
	For properly parsing IMAP search requests, it's easier to use a recursive descent parser generator to deal with subqueries and the "OR" statement. Parse::RecDescent was chosen since it's mature, well-known, widely available and already used by our optional dependencies: Inline::C and Mail::IMAPClient. While it's possible to build Xapian queries without using the Xapian string query parser; this iteration of the IMAP parser still builds a string which is passed to Xapian's query parser for ease-of-diagnostics. Since this is a recursive descent parser dealing with untrusted inputs, subqueries have a nesting limit of 10. I expect that is more than adequate for real-world use.
2020-06-16	MANIFEST: add missing 1.6.0 release notes entry

2020-06-13	imap: introduce memory-efficient uo2m mapping
	Since we limit our mailboxes slices to 50K and can guarantee a contiguous UID space for those mailboxes, we can store a mapping of "UID offsets" (not full UIDs) to Message Sequence Numbers as an array of 16-bit unsigned integers in a 100K scalar. For UID-only FETCH responses, we can momentarily unpack the compact 100K representation to a ~1.6M Perl array of IV/UV elements for a slight speedup. Furthermore, we can (ab)use hash key deduplication in Perl5 to deduplicate this 100K scalar across all clients with the same mailbox slice open. Technically we can increase our slice size to 64K w/o increasing our storage overhead, but I suspect humans are more accustomed to slices easily divisible by 10.
2020-06-13	imap: require ".$UID_MIN-$UID_END" suffix
	Finish up the IMAP-only portion of iterative config reloading, which allows us to create all sub-ranges of an inbox up front. The InboxIdler still uses ->each_inbox which will struggle with 100K inboxes. Having messages in the top-level newsgroup name of an inbox will still waste bandwidth for clients which want to do full syncs once there's a rollover to a new 50K range. So instead, make every inbox accessible exclusively via 50K slices in the form of "$NEWSGROUP.$UID_MIN-$UID_END". This introduces the DummyInbox, which makes $NEWSGROUP and every parent component a selectable, empty inbox. This aids navigation with mutt and possibly other MUAs. Finally, the xt/perf-imap-list maintainer test is broken, now, so remove it. The grep perlfunc is already proven effective, and we'll have separate tests for mocking out ~100k inboxes.
2020-06-13	xt: add imapd-validate and imapd-mbsync-oimap
	imapd-validate is a beefed up version of our nntpd-validate test which hammers the server with parallel connections over regular IMAP, IMAPS, IMAP+STARTTLS; and COMPRESS=DEFLATE variants of each of those. It uses $START_UID:$END_UID fetch ranges to reduce requests and slurp many responses at once to saturate "git cat-file --batch" processes. mbsync(1) also uses pipelining extensively (but IMHO unnecessarily), so it was able to shake out some bugs in the async git code. Finally, we remove xt/cmp-imapd-compress.t since it's redundant now that we have PublicInbox::IMAPClient to work around bugs in Mail::IMAPClient.
2020-06-13	imapclient: wrapper for Mail::IMAPClient
	We'll be using this wrapper class to workaround some upstream bugs in Mail::IMAPClient. There may also be experiments with new APIs for more performance.
2020-06-13	add imapd compression test
	Include a test for Mail::IMAPTalk, here, since Mail::IMAPClient stalls with compression enabled: https://rt.cpan.org/Ticket/Display.html?id=132720
2020-06-13	imap: use git-cat-file asynchronously
	This ought to improve overall performance with multiple clients. Single client performance suffers a tiny bit due to extra syscall overhead from epoll. This also makes the existing async interface easier-to-use, since calling cat_async_begin is no longer required.
2020-06-13	imap: split out unit tests and benchmarks
	This makes the test code easier-to-manage and allows us to run faster unit tests which don't involve loading Mail::IMAPClient.
2020-06-13	imap: allow fetch of partial of BODY[...] and headers
	IMAP supports a high level of granularity when it comes to fetching, but fortunately Perl makes it fairly easy to support.
2020-06-13	inboxidle: new class to detect inbox changes
	This will be used to implement IMAP IDLE, first. Eventually, it may be used to trigger other things: * incremental internal updates for manifest.js.gz * restart `git cat-file' processes on pack index unlink * IMAP IDLE-like long-polling HTTP endpoint And maybe more things we haven't thought of, yet. It uses Linux::Inotify2 or IO::KQueue depending on what packages are installed and what the kernel supports. It falls back to nanosecond-aware Time::HiRes::stat() (available with Perl 5.10.0+) on systems lacking Linux::Inotify2 and IO::KQueue. In the future, a pure Perl alternative to Linux::Inotify2 may be supplied for users of architectures we already support signalfd and epoll on. v2 changes: - avoid O_TRUNC on lock file - change ctime on Linux systems w/o inotify - fix naming of comments and fields
2020-06-13	preliminary imap server implementation
	It shares a bit of code with NNTP. It's copy+pasted for now since this provides new ground to experiment with APIs for dealing with slow storage and many inboxes.