public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-12-19	tests: more common JSON module loading
	We'll probably be using JSON more in the future, so make it easier to require in tests
2020-12-19	lei_store: local storage for Local Email Interface
	Still unstable, this builds off the equally unstable extindex :P This will be used for caching/memoization of traditional mail stores (IMAP, Maildir, etc) while providing indexing via Xapian, along with compression, and checksumming from git. Most notably, this adds the ability to add/remove per-message keywords (draft, seen, flagged, answered) as described in the JMAP specification (RFC 8621 section 4.1.1). We'll use `.' (a single period) as an $eidx_key since it's an invalid {inboxdir} or {newsgroup} name.
2020-12-19	lei: proposed command-listing and options
	In an attempt to ensure a coherent UI/UX, we'll try to document all proposed commands and options in one place for easy reference
2020-12-19	lei: FD-passing and IPC basics
	The start of lei, a Local Email Interface. It'll support a daemon via FD passing to avoid startup time penalties if IO::FDPass is installed, but fall back to a slow one-shot mode if not. Compared to traditional socket daemon, FD passing should allow us to eventually do stuff like run "git show" and still have proper terminal support for pager and color.
2020-12-18	extsearchidx: improve missing machine-id fallback
	It's likely most GNU/Linux systems have /etc/machine-id these days, so anything missing it is likely a *BSD, most of which support and favor "sysctl -n kern.hostid". We'll also support "ghostid" since GNU utils are commonly prefixed with 'g' on non-GNU platforms. In any case, we'll suppress stderr from missing commands and fall back to hard coding an $OSNAME-based identifier as a last resort and hope the hostname is unique.
2020-12-18	nntpd: skip inboxes w/o {newsgroup}
	So we don't trigger an uninitialized variable warning :x
2020-12-18	import: drop X-Status in addition to Status
	It's actually supported by mutt, dovecot[1], and likely some other software to augment the Status: header. While dovecot doesn't expose X-Status to clients, mutt will write 'A' (answered) and 'F' to X-Status (but not T (draft)). So we'll drop it like we do Status since it's not suitable for public mail, but stick it in an @UNWANTED_HEADERS array will allow us to configure an override if needed. [1] https://doc.dovecot.org/configuration_manual/mail_location/mbox/
2020-12-17	extsearchidx: no need to make InboxWritable
	extindex treats v1/v2 public inboxes as read-only, so there's no need to scare people by using the InboxWritable package now that ->git_dir_n is gone and we can use ->max_git_epoch instead of ->git_dir_latest.
2020-12-17	inboxwritable: drop git_dir_n sub
	There's only one caller, unlikely to be any more, and should be harmless to open code.
2020-12-17	inbox: simplify v2 epoch counting
	Perl readdir detects list context and can return an array suitable for the grep op. From there, we can rely on substr to remove the ".git" suffix and integerize the value to save a few bytes before letting List::Util::max return the value. This is how we detect Xapian shards nowadays, too, and we'll also use defined-or (//) to simplify the return value there. We'll also simplify InboxWritable->git_dir_latest, remove some callers, and consider removing it entirely.
2020-12-17	index: ignore some warnings, set {current_info} for v1
	-index runs on data that's already frozen in git, so there's no point in warning users about it. While we're at it, set the {current_info} prefix for v1 as we do in v2 inboxes in case new problems show up.
2020-12-17	inboxwritable: warn_ignore: "Bad UTF7 data escape"
	As with the other messages in this callback, there's nothing we can do about invalid messages ending up in our Maildirs for -watch.
2020-12-17	extsearchidx: lock eidxq on full --reindex
	Incremental indexing can use the `eidxq' reindexing queue for handling deletes and resuming interrupted indexing. Ensure those incremental -extindex invocations do not steal (and prematurely perform) work that an "-extindex --reindex" invocation is handling.
2020-12-17	searchidxshard: simplify newline elimination
	This overdue change fixes {current_info} to not inject a newline into every warning message. Simpler code helps us avoid bugs and the need to make fixes like commit 44de182766037948d62bc2a8ba924de2264dd5fc ("searchidxshard: chomp $eidx_key from pipe").
2020-12-17	extsearchidx: reindex releases over.sqlite3 handles properly
	When checkpointing and yielding the lock to other processes, we need to ensure any open DB statement handles are closed, since they reference and prevent DB FDs from being closed and unlocked. And clean up some progress reporting while we're at it.
2020-12-17	extsearchidx: simplify reindex code paths
	Since we're inside a Xapian transaction, calling ->index_raw followed by ->shard_add_eidx_info calls on the same docid doesn't seem to hurt indexing performance. It definitely reduces FS read traffic and IPC from git at the cost of some more IPC between the parent and workers. Nevertheless, the code and FD reductions seem worth it.
2020-12-17	extsearchidx: checkpoint releases locks
	--reindex can take many hours or days, ensure we release locks according to --batch-size so automated fetch+index jobs can write new data to indices while we update old data.
2020-12-17	extsearchidx: reindex works on Xapian, too
	Instead of just working on over.sqlite3, we need to work on the Xapian DBs as well. While no changes to our Xapian use have taken place recently, they could in the future and --reindex exists to account for that.
2020-12-17	extindex: support --rethread and content bifurcation
	--rethread is useful for dealing with bugs and behaves just like it does with current inboxes. This is in case our content deduplication logic changes for whatever reason and causes previously merged messages to be considered "different". As with v2, this won't allow us to merge messages in a way that allows deduplicating messages which were previously considered different, but v2 inboxes do not allow that, either. In other words, this makes the --reindex and --rethread switches of -extindex match the behavior of v2 -index.
2020-12-17	over: sort xref3 by xnum if ibx_id repeats
	While unlikely to happen, it may be possible for messages from the same inbox to get indexed multiple times. Provide consistent results in this case for ease-of-testing.
2020-12-17	extindex: delete stale messages from over.sqlite3
	In addition to removing stale messages from Xapian, we must also remove them from over.sqlite3.
2020-12-17	extindex: preliminary --reindex support
	--reindex allows us to catch missed and stale messages due to -extindex vs -index races prior to commit 02b2fcc46f364b51 ("extsearchidx: enforce -index before -extindex"). We'll also rely on reindex to internally deal with v1/v2 inbox removals and partial-unindexing of messages which are only removed from one inbox out of many. This reindex design is completely different than how normal v1/v2 inbox reindex operates due to extindex having multiple histories to work with. Instead of scanning git history, this relies exclusively on comparing over.sqlite3 contents between the v1/v2 inboxes and the extindex. Changes to Xapian behavior also get picked up, now. Xapian indexing is handled by workers with minimal IPC to the parent process. This results in more read I/O but fewer writes when dealing with cross-posted messages. Changes to $smsg->populate and --rethread still need further work.
2020-12-17	inbox: ->uidvalidity returns undef w/o ->mm
	While totally unindexed inboxes are rare, we still support them for v1 and may hit code which calls this method. Just return `undef' when ->mm access fails.
2020-12-17	imap: rename parse_query => parse_imap_query
	Avoid confusing hackers since this conflicts with a method name provided by (Search::)Xapian::QueryParser.
2020-12-16	daemon: simplify fork() failure checks
	The defined-or `//' operator in 5.10 allows us to golf down our code slightly.
2020-12-16	daemon: support --daemonize without Net::Server::Daemonize
	We don't actually need Net::Server::Daemonize to support the --daemonize flag, since the daemonize() sub provided by N::S::D doesn't exactly do the things we want.
2020-12-14	PublicInbox::Feed owns `feedmax' default value
	There's no need to have extra code in the Inbox package for this or to waste dozens of bytes for every Inbox object which uses the default value. This makes our code more flexible w.r.t Inbox-like ExtSearch objects and fixes uninitialized value warnings with ->ALL.
2020-12-11	nntp+www: drop List-* and Archived-At headers
	These headers can conflict with headers in the DKIM signature; and parsing the DKIM-Signature header to determine whether or not we can safely add a header would be more code and CPU cycles. Since IMAP seems fine without these headers (and JMAP will likely be, too), there's likely no need to continue appending these to every message. Nowadays, developers seem sufficiently trained to use URLs with Message-IDs in them. So drop the headers and save some cycles and bandwidth all around.
2020-12-11	extmsg: avoid exceptions when /all/$MSGID/ fails
	If a message can't be found in ->ALL, we shouldn't attempt to enter code paths which iterate normal inboxes or attempt to access non-existent fields (e.g. {name}, {newsgroup}, {inboxdir}) in the ExtSearch object.
2020-12-10	manifest: account for future cache in MiscIdx docdata
	We'll be storing private data inside the "" (empty string) key of the JSON doc we use for store for manifest.js.gz generation. This private data will allow us to reduce FS activity at and speed up startup times, but some will also be in Xapian boolean terms and values for searching and filtering.
2020-12-10	extsearchidx: enforce -index before -extindex
	We cannot set xref3 data without the `xnum' column to tie it to the per-inbox over.sqlite3 DB. So ensure we don't read brand-new history that only exists in git, but instead rely on last_commit and last_xap15-$EPOCH metadata in msgmap to decide how far we can index. Before this change, it was possible to miss messages in the extindex if -index did not run (which will be fixable by upcoming --reindex support in -extindex).
2020-12-10	searchidx: all indexers check for bad blobs
	This should help us detect bugs in our code or storage synchronization problems more easily. This probably won't detect corrupted git storage, but can detect corrupted SQLite files. "Bad blobs, bad blobs, whatcha gonna do when they come for you?"
2020-12-10	www+nntp: deal with lack of addresses for ->ALL
	Since extindex is an amalgamation of several inboxes, discerning an appropriate address for List-Post: would be expensive and most likely unnecessary. Some legacy/historical inboxes may have no active address, either, so don't attempt to set the List-Post header if no addresses are configured.
2020-12-09	extsearchidx: ck_existing: set $OID for warning context
	The content_hash() hash in the same scope may trigger warnings for a given blob, so ensure we correctly report the blob where it happens.
2020-12-09	admin: resolve_repo_dir => resolve_inboxdir
	We've stopped referring to inboxdirs as "repos" a while ago since v2 inboxes have multiple git repos associated with them. So update the name to reflect that and avoid an unnecessary export that's only used by a test case.
2020-12-09	extindex: do not use current dir like -index does
	At least not for resolving inboxes, since there's no good way for a user to specify what is an inbox or extindex directory without a command-line switch. Instead of changing the -extindex command, we change the -index command internals to rely on the new {-use_cwd} flag to avoid internal use of negation, since double-negatives and the like are confusing to me.
2020-12-09	rename {pi_config} fields to {pi_cfg}
	{pi_config} may be confused with the documented `PI_CONFIG' environment variable, and we'll favor vowel-removal to be consistent with our usage of object references. The `pi_' prefix may stay in some places, for now; since a separate namespace may come into this codebase for local/private client-tooling. For InboxIdle, we'll also remove an invalid comment about holding a reference to the PublicInbox::Config object, too.
2020-12-09	nntp: replace {ng} with {ibx} for consistency
	They're PublicInbox::Inbox objects just like the rest of the non-NNTP code. So rename the NNTP code for consistency with the rest of the codebase. Furthermore, {ng} and $ng may be confused with the `--ng' switch for -init, and that's a non-ref scalar string.
2020-12-09	treewide: replace {-inbox} with {ibx} for consistency
	{ibx} is shorter and is the most prevalent abbreviation in indexing and IMAP code, and the `$ibx' local variable is already prevalent throughout. In general, the codebase favors removal of vowels in variable and field names to denote non-references (because references are "lighter" than non-references). So update WWW and Filter users to use the same code since it reduces confusion and may allow easier code sharing.
2020-12-09	search: reinstate "uid:" internal search prefix
	User-supplied queries (via PublicInbox::IMAPsearchqp) may restrict messages to certain UID ranges in addition to the limits we impose ourselves for mailbox slices. So we'll continue to ask Xapian::QueryParser to "uid:" numeric ranges. Fixes: 4b551c884a648b45 ("imap: support isearch and reduce Xapian queries")
2020-12-08	shard_add_eidx_info: pass $eidx_key instead of $ibx object
	This improves consistency with sibling methods such as ->shard_remove_eidx_info and ->add_xref3. Passing the $eidx_key scalar is preferable to the entire $ibx object for IPC-friendliness.
2020-12-08	searchidx: remove $oid parameter from most calls
	Xapian docids have been tied to the over {num} column for nearly 3 years, now; and OIDs are no longer stored in Xapian document data. There's no need to increase code and IPC complexity by passing the OID around.
2020-12-08	extsearchidx: remove needless SHA-1 check
	There is no need to verify checksums of data already stored in git. Doing this ourselves also limits flexibility in moving to other hashes.
2020-12-08	overidx: wrap eidx_key => ibx_id mapping
	This makes things a little less noisy and will be called by ExtSearchIdx.
2020-12-08	over: gracefully show invalid ibx_id
	While "public-inbox-extindex --gc" invocations try to ensure proper ordering, it is still possible for users to change the `inboxes' tables via sqlite3(1) or similar means. So show a "missing://ibx_id=$ibx_id" placeholder to avoid undefined variable warnings. URLs such as "imaps://..." will eventually be supported as eidx_keys, so having a URL-like "missing://" as a placeholder probably makes sense.
2020-12-07	overidx: {num} column is INTEGER PRIMARY KEY
	INTEGER PRIMARY KEY can be an alias for ROWID in SQLite and is already unique, so there's no need for a separate UNIQUE(num) index. With a smallish ~3K, freshly indexed v2 inbox, this results in a ~40K space savings, reducing over.sqlite3 from 1.375M to 1.335M (post-VACUUM). This only affects newly-indexed inboxes; existing DBs will require manual intervention to take advantage of space savings. Link: https://www.sqlite.org/rowidtable.html
2020-12-05	imap: support isearch and reduce Xapian queries
	Since IMAP search (either with Isearch or traditional per-Inbox search) only returns UIDs, we can safely set the limit to the UID slice size(). With isearch, we can also trust the Xapian result to fit any docid range we specify. Limiting Xapian results to 1000 was making ->ALL docid <=> per-Inbox UID impossible since results could overlap between ranges unpredictably. Finally, we can map the ->ALL docids into per-Inbox UIDs and show them to the client in the UID order of the Inbox, not the docid order of the ->ALL extindex. This also lets us get rid of the "uid:" query parser prefix and use the Xapian::Query API directly to reduce our search prefix footprint. For mbox.gz downloads in WWW, we'll also make a best effort to preserve the order from the Inbox, not the order of extindex; though it's possible large result sets can have non-overlapping windows. () by definition, UID slice size is a "safe" value which shouldn't OOM either the server or clients.
2020-12-05	isearch: emulate per-inbox search with ->ALL
	Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05	inbox: simplify ->search and callers
	Stop leaking WWW/PSGI-specific logic into classes like PublicInbox::Inbox, which is used universally. We'll also decouple $ibx->over from $ibx->search and just deal with duplicate the code inside ->over to reduce argument complexity in ->search. This is also a step in moving away from using {psgi.errors} to ease code sharing between IMAP, NNTP, and command-line interfaces. Perl's built-in `warn' and `local $SIG{__WARN__}' provides all the flexibility we need to control warning output and should be universally understood by Perl hackers who may be unfamiliar with PSGI.
2020-12-05	extmsg: use ->ALL for "global" MID lookups
	As with NewsWWW and NNTP, we can use ->ALL to completely avoid trying SQLite/Xapian lookups across hundreds/thousands of inboxes.