about summary refs log tree commit homepage
path: root/lib
DateCommit message (Collapse)
2020-12-19search: simplify initialization, add ->xdb_shards_flat
This reduces differences between v1 and v2 code, and introduces ->xdb_shards_flat to provide read-only access to shards without using Xapian::MultiDatabase. This will allow us to combine shards of several inboxes AND extindexes for lei.
2020-12-19lei_store: simplify git_epoch_max, slightly
This follows how we detect the max epoch for v2 and shard count in Xapian.
2020-12-19lei: support `daemon-env' for modifying long-lived env
While lei(1) socket connections can set environment variables for its running context, it may not completely remove some of them. The background daemon just inherits whatever env the client spawning it had. This command ensures the persistent env can be modified as needed. Similar to env(1), this supports "-u", "-" (--clear), and "-0"/"-z" switches. It may be useful to unset or change or even completely clear the environment independently of what a socket client feeds us. "-i" is omitted since "--ignore-environment" seems like a bad name for a persistent daemon as opposed to a one-shot command. "-" and --clear (like clearenv(3)) will completely clobber the environment. "Lonesome dash" support is added to our option/help parsing for the "-" shortcut to "--clear". Getopt::Long doesn't seem to support specs like "clear|" or "stdin|", but only "", so we do a little pre/post-processing to merge the cases.
2020-12-19lei: ensure we run a restrictive umask
While we configure the LeiStore git repos and DBs to have a restrictive umask, lei may also write to Maildirs/mboxes/etc. We will follow mutt behavior when saving files/messages to the FS. We only want to create files which are only readable by the local user since this is intended for private mail and could be used on shared systems. We may allow passing the umask on a per-command-basis, but it's probably not worth the effort to support.
2020-12-19t/lei-oneshot: standalone oneshot (non-socket) test
We can use the same "local $ENV{FOO}" hack we do with t/nntpd-v2.t to test the oneshot code path without imposing an extra script in the users' $PATH.
2020-12-19lei: refine help/option parsing, implement "init"
There's a bunch of work in here as the foundations are being fleshed out. One of the UI/UX is to make it easy to keep built-in help and shell completions consistent
2020-12-19lei: use spawn (vfork + execve) for lazy start
This allows us to rely on FD_CLOEXEC being set on pipes from prove(1), so forgetting `daemon-stop' won't cause tests to hang. Unfortunately, daemon tests will be slower with this.
2020-12-19tests: more common JSON module loading
We'll probably be using JSON more in the future, so make it easier to require in tests
2020-12-19lei_store: local storage for Local Email Interface
Still unstable, this builds off the equally unstable extindex :P This will be used for caching/memoization of traditional mail stores (IMAP, Maildir, etc) while providing indexing via Xapian, along with compression, and checksumming from git. Most notably, this adds the ability to add/remove per-message keywords (draft, seen, flagged, answered) as described in the JMAP specification (RFC 8621 section 4.1.1). We'll use `.' (a single period) as an $eidx_key since it's an invalid {inboxdir} or {newsgroup} name.
2020-12-19lei: proposed command-listing and options
In an attempt to ensure a coherent UI/UX, we'll try to document all proposed commands and options in one place for easy reference
2020-12-19lei: FD-passing and IPC basics
The start of lei, a Local Email Interface. It'll support a daemon via FD passing to avoid startup time penalties if IO::FDPass is installed, but fall back to a slow one-shot mode if not. Compared to traditional socket daemon, FD passing should allow us to eventually do stuff like run "git show" and still have proper terminal support for pager and color.
2020-12-18extsearchidx: improve missing machine-id fallback
It's likely most GNU/Linux systems have /etc/machine-id these days, so anything missing it is likely a *BSD, most of which support and favor "sysctl -n kern.hostid". We'll also support "ghostid" since GNU utils are commonly prefixed with 'g' on non-GNU platforms. In any case, we'll suppress stderr from missing commands and fall back to hard coding an $OSNAME-based identifier as a last resort and hope the hostname is unique.
2020-12-18nntpd: skip inboxes w/o {newsgroup}
So we don't trigger an uninitialized variable warning :x
2020-12-18import: drop X-Status in addition to Status
It's actually supported by mutt, dovecot[1], and likely some other software to augment the Status: header. While dovecot doesn't expose X-Status to clients, mutt will write 'A' (answered) and 'F' to X-Status (but not T (draft)). So we'll drop it like we do Status since it's not suitable for public mail, but stick it in an @UNWANTED_HEADERS array will allow us to configure an override if needed. [1] https://doc.dovecot.org/configuration_manual/mail_location/mbox/
2020-12-17extsearchidx: no need to make InboxWritable
extindex treats v1/v2 public inboxes as read-only, so there's no need to scare people by using the InboxWritable package now that ->git_dir_n is gone and we can use ->max_git_epoch instead of ->git_dir_latest.
2020-12-17inboxwritable: drop git_dir_n sub
There's only one caller, unlikely to be any more, and should be harmless to open code.
2020-12-17inbox: simplify v2 epoch counting
Perl readdir detects list context and can return an array suitable for the grep op. From there, we can rely on substr to remove the ".git" suffix and integerize the value to save a few bytes before letting List::Util::max return the value. This is how we detect Xapian shards nowadays, too, and we'll also use defined-or (//) to simplify the return value there. We'll also simplify InboxWritable->git_dir_latest, remove some callers, and consider removing it entirely.
2020-12-17index: ignore some warnings, set {current_info} for v1
-index runs on data that's already frozen in git, so there's no point in warning users about it. While we're at it, set the {current_info} prefix for v1 as we do in v2 inboxes in case new problems show up.
2020-12-17inboxwritable: warn_ignore: "Bad UTF7 data escape"
As with the other messages in this callback, there's nothing we can do about invalid messages ending up in our Maildirs for -watch.
2020-12-17extsearchidx: lock eidxq on full --reindex
Incremental indexing can use the `eidxq' reindexing queue for handling deletes and resuming interrupted indexing. Ensure those incremental -extindex invocations do not steal (and prematurely perform) work that an "-extindex --reindex" invocation is handling.
2020-12-17searchidxshard: simplify newline elimination
This overdue change fixes {current_info} to not inject a newline into every warning message. Simpler code helps us avoid bugs and the need to make fixes like commit 44de182766037948d62bc2a8ba924de2264dd5fc ("searchidxshard: chomp $eidx_key from pipe").
2020-12-17extsearchidx: reindex releases over.sqlite3 handles properly
When checkpointing and yielding the lock to other processes, we need to ensure any open DB statement handles are closed, since they reference and prevent DB FDs from being closed and unlocked. And clean up some progress reporting while we're at it.
2020-12-17extsearchidx: simplify reindex code paths
Since we're inside a Xapian transaction, calling ->index_raw followed by ->shard_add_eidx_info calls on the same docid doesn't seem to hurt indexing performance. It definitely reduces FS read traffic and IPC from git at the cost of some more IPC between the parent and workers. Nevertheless, the code and FD reductions seem worth it.
2020-12-17extsearchidx: checkpoint releases locks
--reindex can take many hours or days, ensure we release locks according to --batch-size so automated fetch+index jobs can write new data to indices while we update old data.
2020-12-17extsearchidx: reindex works on Xapian, too
Instead of just working on over.sqlite3, we need to work on the Xapian DBs as well. While no changes to our Xapian use have taken place recently, they could in the future and --reindex exists to account for that.
2020-12-17extindex: support --rethread and content bifurcation
--rethread is useful for dealing with bugs and behaves just like it does with current inboxes. This is in case our content deduplication logic changes for whatever reason and causes previously merged messages to be considered "different". As with v2, this won't allow us to merge messages in a way that allows deduplicating messages which were previously considered different, but v2 inboxes do not allow that, either. In other words, this makes the --reindex and --rethread switches of -extindex match the behavior of v2 -index.
2020-12-17over: sort xref3 by xnum if ibx_id repeats
While unlikely to happen, it may be possible for messages from the same inbox to get indexed multiple times. Provide consistent results in this case for ease-of-testing.
2020-12-17extindex: delete stale messages from over.sqlite3
In addition to removing stale messages from Xapian, we must also remove them from over.sqlite3.
2020-12-17extindex: preliminary --reindex support
--reindex allows us to catch missed and stale messages due to -extindex vs -index races prior to commit 02b2fcc46f364b51 ("extsearchidx: enforce -index before -extindex"). We'll also rely on reindex to internally deal with v1/v2 inbox removals and partial-unindexing of messages which are only removed from one inbox out of many. This reindex design is completely different than how normal v1/v2 inbox reindex operates due to extindex having multiple histories to work with. Instead of scanning git history, this relies exclusively on comparing over.sqlite3 contents between the v1/v2 inboxes and the extindex. Changes to Xapian behavior also get picked up, now. Xapian indexing is handled by workers with minimal IPC to the parent process. This results in more read I/O but fewer writes when dealing with cross-posted messages. Changes to $smsg->populate and --rethread still need further work.
2020-12-17inbox: ->uidvalidity returns undef w/o ->mm
While totally unindexed inboxes are rare, we still support them for v1 and may hit code which calls this method. Just return `undef' when ->mm access fails.
2020-12-17imap: rename parse_query => parse_imap_query
Avoid confusing hackers since this conflicts with a method name provided by (Search::)Xapian::QueryParser.
2020-12-16daemon: simplify fork() failure checks
The defined-or `//' operator in 5.10 allows us to golf down our code slightly.
2020-12-16daemon: support --daemonize without Net::Server::Daemonize
We don't actually need Net::Server::Daemonize to support the --daemonize flag, since the daemonize() sub provided by N::S::D doesn't exactly do the things we want.
2020-12-14PublicInbox::Feed owns `feedmax' default value
There's no need to have extra code in the Inbox package for this or to waste dozens of bytes for every Inbox object which uses the default value. This makes our code more flexible w.r.t Inbox-like ExtSearch objects and fixes uninitialized value warnings with ->ALL.
2020-12-11nntp+www: drop List-* and Archived-At headers
These headers can conflict with headers in the DKIM signature; and parsing the DKIM-Signature header to determine whether or not we can safely add a header would be more code and CPU cycles. Since IMAP seems fine without these headers (and JMAP will likely be, too), there's likely no need to continue appending these to every message. Nowadays, developers seem sufficiently trained to use URLs with Message-IDs in them. So drop the headers and save some cycles and bandwidth all around.
2020-12-11extmsg: avoid exceptions when /all/$MSGID/ fails
If a message can't be found in ->ALL, we shouldn't attempt to enter code paths which iterate normal inboxes or attempt to access non-existent fields (e.g. {name}, {newsgroup}, {inboxdir}) in the ExtSearch object.
2020-12-10manifest: account for future cache in MiscIdx docdata
We'll be storing private data inside the "" (empty string) key of the JSON doc we use for store for manifest.js.gz generation. This private data will allow us to reduce FS activity at and speed up startup times, but some will also be in Xapian boolean terms and values for searching and filtering.
2020-12-10extsearchidx: enforce -index before -extindex
We cannot set xref3 data without the `xnum' column to tie it to the per-inbox over.sqlite3 DB. So ensure we don't read brand-new history that only exists in git, but instead rely on last_commit and last_xap15-$EPOCH metadata in msgmap to decide how far we can index. Before this change, it was possible to miss messages in the extindex if -index did not run (which will be fixable by upcoming --reindex support in -extindex).
2020-12-10searchidx: all indexers check for bad blobs
This should help us detect bugs in our code or storage synchronization problems more easily. This probably won't detect corrupted git storage, but can detect corrupted SQLite files. "Bad blobs, bad blobs, whatcha gonna do when they come for you?"
2020-12-10www+nntp: deal with lack of addresses for ->ALL
Since extindex is an amalgamation of several inboxes, discerning an appropriate address for List-Post: would be expensive and most likely unnecessary. Some legacy/historical inboxes may have no active address, either, so don't attempt to set the List-Post header if no addresses are configured.
2020-12-09extsearchidx: ck_existing: set $OID for warning context
The content_hash() hash in the same scope may trigger warnings for a given blob, so ensure we correctly report the blob where it happens.
2020-12-09admin: resolve_repo_dir => resolve_inboxdir
We've stopped referring to inboxdirs as "repos" a while ago since v2 inboxes have multiple git repos associated with them. So update the name to reflect that and avoid an unnecessary export that's only used by a test case.
2020-12-09extindex: do not use current dir like -index does
At least not for resolving inboxes, since there's no good way for a user to specify what is an inbox or extindex directory without a command-line switch. Instead of changing the -extindex command, we change the -index command internals to rely on the new {-use_cwd} flag to avoid internal use of negation, since double-negatives and the like are confusing to me.
2020-12-09rename {pi_config} fields to {pi_cfg}
{pi_config} may be confused with the documented `PI_CONFIG' environment variable, and we'll favor vowel-removal to be consistent with our usage of object references. The `pi_' prefix may stay in some places, for now; since a separate namespace may come into this codebase for local/private client-tooling. For InboxIdle, we'll also remove an invalid comment about holding a reference to the PublicInbox::Config object, too.
2020-12-09nntp: replace {ng} with {ibx} for consistency
They're PublicInbox::Inbox objects just like the rest of the non-NNTP code. So rename the NNTP code for consistency with the rest of the codebase. Furthermore, {ng} and $ng may be confused with the `--ng' switch for -init, and that's a non-ref scalar string.
2020-12-09treewide: replace {-inbox} with {ibx} for consistency
{ibx} is shorter and is the most prevalent abbreviation in indexing and IMAP code, and the `$ibx' local variable is already prevalent throughout. In general, the codebase favors removal of vowels in variable and field names to denote non-references (because references are "lighter" than non-references). So update WWW and Filter users to use the same code since it reduces confusion and may allow easier code sharing.
2020-12-09search: reinstate "uid:" internal search prefix
User-supplied queries (via PublicInbox::IMAPsearchqp) may restrict messages to certain UID ranges in addition to the limits we impose ourselves for mailbox slices. So we'll continue to ask Xapian::QueryParser to "uid:" numeric ranges. Fixes: 4b551c884a648b45 ("imap: support isearch and reduce Xapian queries")
2020-12-08shard_add_eidx_info: pass $eidx_key instead of $ibx object
This improves consistency with sibling methods such as ->shard_remove_eidx_info and ->add_xref3. Passing the $eidx_key scalar is preferable to the entire $ibx object for IPC-friendliness.
2020-12-08searchidx: remove $oid parameter from most calls
Xapian docids have been tied to the over {num} column for nearly 3 years, now; and OIDs are no longer stored in Xapian document data. There's no need to increase code and IPC complexity by passing the OID around.
2020-12-08extsearchidx: remove needless SHA-1 check
There is no need to verify checksums of data already stored in git. Doing this ourselves also limits flexibility in moving to other hashes.