public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-09-01	rename WatchMaildir => Watch
	This is no longer limited to Maildirs now that IMAP and NNTP support exist; so give it a shorter name.
2020-08-30	imapd: filter out unusable flags from search
	Quiet down logs from -imapd when clients are blindly sending some unsupported flag conditions (e.g. "DRAFT", "DELETED") specified in RFC 3501.
2020-08-29	tests: check-run: fixup un-squashed simplification
	Link: https://public-inbox.org/meta/20200828221803.GA89978@dcvr/
2020-08-28	tests: check-run: show skipped tests
	We'll deduplicate redundant lines and show counts of skipped tests to ensure it's easy to notice if something is unexpectedly skipped.
2020-08-27	overidx: inline create_ghost sub
	There's no need for this to be a separate sub since there's only a single caller. This saves a few kilobytes at least in short-lived processes.
2020-08-27	over: recent: remove expensive COUNT query
	As noted in commit 87dca6d8d5988c5eb54019cca342450b0b7dd6b7 ("www: rework query responses to avoid COUNT in SQLite"), COUNT on many rows is expensive on big SQLite DBs. We've already stopped using that code path long ago in WWW while -imapd and -nntpd never used it. So we'll adjust our remaining test cases to not need it, either.
2020-08-27	over: rename ->disconnect to ->dbh_close
	Since we got rid of over->connect, `disconnect' no longer pairs with it. So name it after the `close(2)' syscall it ultimately issues.
2020-08-27	over: rename ->connect method to ->dbh
	`->connect' is confused with the perlfunc for the `connect(2)' syscall, and also `DBI->connect'. Since SQLite doesn't use sockets, the word "connect" needlessly confuses me. Give it a short name to match the field name we use for it, which also matches the variable name used by the DBI(3pm) and DBD::SQLite(3pm) manpages.
2020-08-26	over+msgmap: respect WAL journal_mode if set
	WAL actually seems to have ideal locking characteristics given concurrency problems I'm experiencing with --reindex running in parallel with expensive read-only SQLite queries: <https://public-inbox.org/meta/20200825001204.GA840@dcvr/> Unfortunately, we cannot blindly use WAL while preserving compatibility with existing setups nor our guarantees that read-only daemons are indeed "read-only". However, respect an user's the choice to set WAL on their own if they're comfortable with giving -nntpd/-httpd/-imapd processes write permission to the directory storing SQLite DBs.
2020-08-23	mbox: disable "&t" on existing Xapian until full reindex
	Expanding threads via over.sqlite3 for mbox.gz downloads without Xapian effectively collapsing on the THREADID column leads to repeated messages getting downloaded. To avoid that situation, use a "has_threadid" Xapian metadata flag that's only set on --reindex (and brand new Xapian DBs). This allows admins to upgrade WWW or do --reindex in any order; without worrying about users eating up bandwidth and CPU cycles.
2020-08-23	searchidx: index THREADID in Xapian
	This is the `tid' column from over.sqlite3; and will be used for IMAP and JMAP search (among other things).
2020-08-20	init+index: support --skip-docdata for Xapian
	Since we no longer read document data from Xapian, allow users to opt-out of storing it. This breaks compatibility with previous releases of public-inbox, but gives us a ~1.5% space savings on Xapian storage (and associated I/O and page cache pressure reduction).
2020-08-20	t/nntpd-v2: set PI_TEST_VERSION=2 properly
	Numbers are hard :<
2020-08-20	smsg: remove from_mitem
	We no longer read docdata.glass from anywhere in our code base. Some adjustments were needed to t/search.t to deal with the Xapian::WritableDatabase committing at different times, since our ->query is avoided from PublicInbox::SearchIdx to avoid needing a {over_ro} field.
2020-08-20	init: drop -N alias for --skip-artnum
	It may be too easily confused for --newsgroup or --ng. This is too rarely used and never made it into a release, so it should be fine.
2020-08-20	init: support --newsgroup option
	We can reduce the need to edit the config file for NNTP group names this way.
2020-08-19	smsg: handle wide characters in raw mail headers
	There may be messages in the wild with wide characters in headers which aren't non-RFC2047 encoded. Assume UTF-8 so those fields can round trip through over.sqlite3. This doesn't affect docdata.glass in Xapian, but it does affect how over.sqlite3 stores the same deflated info.
2020-08-10	avoid File::Temp::tempfile in more places
	We can use open(..., undef) natively in Perl in t/import.t In places where we need a pathname, the File::Temp OO API gives us auto-unlinking for free.
2020-08-08	support setting No_COW on Perl <5.22
	fileno(DIRHANDLE) only works on Perl 5.22+, so we need to use dirfd(3) ourselves from Inline::C (or rely on chattr(1) being installed). While we're at it, rename `set_nodatacow' to `nodatacow_fd' for consistency with `nodatacow_dir'.
2020-08-07	index: v2: --sequential-shard option
	This gives better page cache utilization for Xapian indexing on slow storage by improving locality for random I/O activity on the Xapian DB. Instead of doing a single-pass to index both SQLite and Xapian; this indexes them separately. The first pass is identical to indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3. Subsequent passes only operate on a single Xapian shard for documents belonging to that shard. Given enough shards, each individual shard can be made small enough to fit into the kernel page cache and avoid HDD seeks for read activity. Doing rough tests with a busy system with a 7200 RPM HDD with ext4, full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to ~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2) disabled (--no-sync) and `--batch-size=10m'.
2020-08-07	xapcmd: quietly no-op on indexlevel=basic
	I find myself mindlessly adding "-c" to public-inbox-index, and other users may do the same. Instead of erroring out, we'll just silently ignore it, for now and allow public-inbox-compact to work on SQLite-only inboxes. We'll only check for xapian-compact if search exists, since it won't be needed in case we support SQLite VACUUM.
2020-08-07	syscall: support sparc64 (and maybe other big-endian systems)
	Thanks to the GCC compile farm project, we can wire up syscalls for sparc64 and set system-specific SFD_* constants properly. I've FINALLY figured out how to use POSIX::SigSet to generate a usable buffer for the syscall perlfunc. This is required for endian-neutral behavior and relevant to sparc64, at least. There's no need for signalfd-related stuff to be constants, either. signalfd initialization is never a hot path and a stub subroutine for constants uses several KB of memory in the interpreter. We'll drop the needless SEEK_CUR import while we're importing O_NONBLOCK, too.
2020-08-07	imap: search support BODY key
	This is specified in RFC 3501 but was accidentally omitted :x I probably got it confused with TEXT, so add a comment about TEXT being "everything" in the message.
2020-08-07	www: avoid warnings on YYYYMMDD-only t= query parameter
	While we always generate YYYYMMDDhhmmss query parameters ourselves, the regexps in paginate_recent allow YYYYMMDD-only (no hhmmss) timestamps, so don't trigger Time::Local::timegm warnings about empty numeric comparisons on empty strings when a client starts making up their own URLs.
2020-08-06	t/epoll: adjust for u64_mod_8 case
	epoll_wait_mod8 places a dummy element into the [2] slot of the nested array, which caused is_deeply to fail. Tested on aarch64.
2020-08-03	t/indexlevels-mirror-v1: localize ENV change
	We don't want ENV changes propagated to other tests when using t/run.perl via "make check-run"
2020-08-03	t/nntpd: do not fork on indexing, test v2
	No need to waste resources when doing minimal work. With PI_TEST_VERSION=2, this fixes a test failure where Net::NNTP::DESTROY was getting called in the shard process. We'll also get rid of an unnecessary use_ok under v2, too.
2020-08-02	searchidx: remove v1-only msg_mime sub
	We can rely on the newer mids() sub directly and use faster numeric comparisons for Msgmap unindexing in v1.
2020-08-02	nntp: fix STAT command
	The return value of art_lookup changed but this command wasn't updated since it wasn't tested. Fixes: 0e6ceff37fc38f28 ("nntp: support slow blob retrievals")
2020-08-01	improve error handling on import fork / lock failures
	v?fork failures seems to be the cause of locks not getting released in -watch. Ensure lock release doesn't get skipped in ->done for both v1 and v2 inboxes. We also need to do everything we can to ensure DB handles, pipes and processes get released even in the face of failure. While we're at it, make failures around `git update-server-info' non-fatal, since smart HTTP seems more popular anyways. v2 changes: - spawn: show failing command - ensure waitpid is synchronous for inotify events - teardown all fast-import processes on exception, not just the failing one - beef up lock_release error handling - release lock on fast-import spawn failure
2020-07-29	t/init: fix test when ~/.public-inbox/ does not exist
	We'll just set the documented PI_EMERGENCY env to a writable location.
2020-07-29	t/imap_searchqp: fix test dependencies
	The query parser test pulls in all of the IMAP stuff, so it has the same dependencies.
2020-07-29	searchidx: disable CoW for SQLite and Xapian under btrfs
	SQLite and Xapian files are written randomly, thus they become fragmented under btrfs with copy-on-write. This leads to noticeable performance problems (and probably ENOSPC) as these files get big. lore/git (v2, <1GB) indexes around 20% faster with this on an ancient SSD. lore/lkml seems to be taking forever and I'll probably cancel it to save wear on my SSD. Unfortunately, disabling CoW also means disabling checksumming (and compression), so we'll be careful to only set the No_COW attribute on regeneratable data. We want to keep CoW (and checksums+compression) on git storage because current ref storage is neither checksummed nor compressed, and git streams pack output.
2020-07-29	v2writable: use {inboxdir} for msgmap->tmp_clone
	Otherwise, a user is more likely to remove the msgmap-XXXXXXXX SQLite file from $TMPDIR and cause SQLite to error out.
2020-07-26	t/init.t: don't modify ~/.public-inbox/
	Tests for failures should not leave junk temporary files lying around in a users' ~/.public-inbox/. On a side note, I'm not sure if PI_DIR is or was ever necessary. It's never been documented, so perhaps using $HOME for this is better...
2020-07-25	searchidx: make v1 indexing closer to v2
	We'll switch to using IdxStack here to ensure we get repeatable results and ascending THREADIDs according to git chronology. This means we'll need a two-pass reindex to index existing messages before indexing new messages. Since we no longer have a long-lived git-log process, we don't have to worry about old Xapian referencing the git-log pipe w/o FD_CLOEXEC, either.
2020-07-25	searchidx: rename _xdb_{acquire,release} => idx_
	The "xdb" prefix was inaccurate since it's used by indexlevel=basic, which is Xapian-free. The '_' (underscore) prefix was also wrong for a method which is called across package boundaries.
2020-07-25	v2writable: introduce idx_stack
	This avoids pinning a potentially large chunk of memory from `git-log --reverse' into RAM (or triggering less predictable swap behavior). Instead it uses a contiguous temporary file with a fixed-size record for every blob we'll need to index.
2020-07-25	index: support --rethread switch to fix old indices
	Older versions of public-inbox < 1.3.0 had subtly different semantics around threading in some corner cases. This switch (when combined with --reindex) allows us to fix them by regenerating associations.
2020-07-18	msgmap: fix atfork_* callbacks
	Noticed while reindexing a largish v2 inbox in parallel on an SSD which required checkpointing and respawning shard workers. Fixes: f06e84220e5566e7 ("over+msgmap: do not store filename after DBI->connect")
2020-07-17	search: simplify unindexing
	Since over.sqlite3 seems here to stay, we no longer need to do Message-ID lookups against Xapian and can simply rely on the docid <=> NNTP article number equivalancy SCHEMA_VERSION=15 gave us. This rids us of the closure-using batch_do sub in the v1 code path and vastly simplifies both v1 and v2 unindexing.
2020-07-17	t/import: quiet warning, clobber variable
	The eval in key2sub via t/run.perl ("make check-run") won't trigger the warning, but running "prove -bvw t/import.t" directly, does. In any case, ensure the contents of this variable doesn't linger across runs.
2020-07-17	config: reject `\n' in `inboxdir'
	"\n" and other characters requiring quoting and/or escaping in in $GIT_DIR/objects/info/alternates was not supported in git 2.11 and earlier; nor does it seem supported at all in libgit2. This will allow us to support sharing git-cat-file or similar endpoints across multiple inboxes via alternates. This breaks an existing use case for anybody wacky enough to put `\n' in the `inboxdir' pathname; but I doubt this affects anybody.
2020-07-14	over+msgmap: do not store filename after DBI->connect
	SQLite already knows the filename internally, so avoid having it as a long-lived Perl SV to save some bytes when there's many inboxes and open DBs.
2020-07-14	nntpd+imapd: detect unlinked msgmap
	While it's even less common to experience a replaced msgmap.sqlite3 file, BOFHs may do the darndest things. This is another step towards reducing the number of needless wakeups we need to do in long-lived read-only daemons.
2020-07-10	imap: avoid warnings on non-slice mailboxes
	Non-slice mailboxes never have messages themselves, so we must not assume a message exists when sending untagged EXISTS messages.
2020-07-10	hval: to_filename: return `undef' instead of empty string
	Returning an empty string for a filename makes no sense, so instead return `undef' so the caller can setup a fallback using the "//" operator. This fixes uninitialized variable warnings because split() on an empty string returns `undef', which caused to_filename to warn on s// and tr// ops.
2020-07-07	t/spawn: fix test reliability
	Since Perl doesn't internally use a self-pipe for sleep/select/poll/etc, wake up every 10ms to ensure it can see the SIGCHLD; since neither signalfd nor EVFILT_SIGNAL are always available. Fixes: 761baa2a300e4268 ("spawn: unblock SIGCHLD in subprocess")
2020-07-06	wwwattach: support async blob retrievals
	We can reuse some of the GzipFilter infrastructure used by other WWW components to handle slow blob retrieval, here. The difference from previous changes is we don't decide on the 200 status code until we've retrieved the blob and found the attachment. While we're at it, ensure we can compress text attachment responses once again, since all text attachments are served as text/plain.
2020-07-06	wwwatomstream: support async blob fetch
	This allows -httpd to handle other requests while waiting for git to retrieve and decode blobs. We'll also break apart t/psgi_v2.t further to ensure tests run against -httpd in addition to generic PSGI testing. Using xt/httpd-async-stream.t to test against clones of meta@public-inbox.org shows a 10-12% performance improvement with the following env: TEST_JOBS=1000 TEST_CURL_OPT=--compressed TEST_ENDPOINT=new.atom