about summary refs log tree commit homepage
path: root/t
DateCommit message (Collapse)
2021-01-01ipc: support Sereal
Some testing will be needed to see if it's worth the code and maintenance overhead, but it seems easy-enough to get working.
2021-01-01ipc: generic IPC dispatch based on Storable
I intend to use this with LeiStore when importing from multiple slow sources at once (e.g. curl, IMAP, etc). This is because over.sqlite3 can only have a single writer, and we'll have several slow readers running in parallel. Watch and SearchIdxShard should also be able to use this code in the future, but this will be proven with LeiStore, first.
2021-01-01lei_to_mail: support Maildir, fix+test --augment
Maildir should be plenty fine for short-lived output folders.
2021-01-01lei_to_mail: support for non-seekable outputs
Users may wish to pipe output to "git am", "spamc", or similar, so we need to support those cases and not bail out on lseek(2) or ftruncate(2) failures.
2021-01-01lei: implement various deduplication strategies
For writing mboxes and Maildirs, users may wish to use stricter or looser deduplication strategies. This gives them more control.
2021-01-01lei_to_mail: start --augment, dedupe, bz2 and xz
--augment will match the mairix(1) option of the same name to augment existing search results. We'll need to implement deduplication for a better user experience. mutt ships with compressed mbox support for bz2 and xz, at least, so we'll support those out-of-the-box.
2021-01-01mboxreader: new class for reading various mbox formats
This is only lightly-tested against stuff LeiToMail generates and will need real-world tests to validate.
2021-01-01lei_to_mail: start atomic and compressed mbox writing
We'll allow using multiple workers to write to a single mbox (which could be compressed). This is can be done safely with O_APPEND + syswrite for uncompressed files, and using a lock when piping to pigz/gzip/bzip2/xz.
2021-01-01sharedkv: split out index_values
In most cases, we won't need to index by value, so don't waste cycles or space on it.
2021-01-01sharedkv: fork()-friendly key-value store
This is intended for maintaining Maildir states, mbox message deduplication, but may be useful for other purposes...
2021-01-01lei_to_mail: initial implementation for writing mbox formats
No Maildir, support, yet, but it'll come.
2020-12-31Merge remote-tracking branch 'origin/master' into lorelei
* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-31lei_xsearch: cross-(inbox|extindex) search
While a single extindex combines multiple inboxes into a single search index, extindex still requires up-front indexing on items which can be searched. XSearch has no on-disk footprint itself and uses Xapian DBs of existing publicinbox and extindex ("extinbox") exclusively. XSearch still suffers from the multi-shard Xapian scalability problems which led to the creation of extindex, but I expect the number of shards to remain relatively low. I envision users hosting public-inbox instances on their workstations will only have two extindex combined by this, one read-only extindex for serving public archives, and one read-write extindex managed by LeiStore for private mail.
2020-12-28ds: flatten + reuse @events, epoll_wait style fixes
Consistently returning the equivalent of pollfd.revents in a portable manner was never worth the effort for us, as we use the same ->event_step callback regardless of POLLIN/POLLOUT/POLLHUP. Being a Perl, @events knows it size and we don't have to return a maximum index for the caller to iterate on. We can also avoid redundant integer coercion ("+0") since we ensure everything is an IV in other places. Finally, vec() is preferable to ("\0" x $size) for resizing buffers because it only needs to write the extended portion and not overwrite the entire buffer.
2020-12-28import: check for git->qx errors, clearer return values
Those git commands can fail and git->qx will set $? when it fails. There's no need for the extra indirection of the @ret array, either. Improve git->qx coverage to check for $? while we're at it.
2020-12-28git: qx: avoid extra "local" for scalar context case
We can use the ternary operator to avoid an early return, here
2020-12-26t/config: test --get-urlmatch for git <2.26
While git 1.8.5 learned --get-urlmatch, git did not learn to match URLs against wildcards until 2.26. So only depend on 1.8.5 for this test since 2.26 is too new. Reported-by: Ali Alnubani <alialnu@nvidia.com> Link: https://public-inbox.org/meta/DM6PR12MB49106F8E3BD697B63B943A22DADB0@DM6PR12MB4910.namprd12.prod.outlook.com/ Tested-by: Ali Alnubani <alialnu@nvidia.com>
2020-12-26inboxidle: avoid needless syscalls on refresh
We don't have to replace a bunch of existing watches with identical new ones. On Linux with Linux::Inotify2 installed, this avoids a storm of inotify_add_watch(2) and inotify_rm_watch(2) syscalls on SIGHUP with -imapd and "-extindex --watch"
2020-12-23miscsearch: index UIDVALIDITY, use as startup cache
This brings -nntpd startup time down from ~35s to ~5s with 50K inboxes. Further improvements ought to be possible with deeper changes to MiscIdx, since -mda having to load every inbox seems unreasonable; but this general change is fairly unintrusive.
2020-12-21searchidx: rename get_val to int_val and return IV
Values can be strings in Xapian, although we currently use integer values exclusively. Give the wrapper a more appropriate name in case we start using string columns. For future-proofing, we'll now return `undef' on missing columns and coerce the return value to an IV (integer value) to save memory, as sortable_unserialise returns a PV (pointer value) scalar despite it existing to support numeric values.
2020-12-20script/public-inbox-*: favor caller-provided pathnames
We'll try to avoid calling Cwd::abs_path and use File::Spec->rel2abs instead, since abs_path will resolve symlinks the user specified on the command-line. Unfortunately, ->rel2abs still leaves "/.." and "/../" uncollapsed, so we still need to fall back to Cwd::abs_path in those cases. While we are at it, we'll also resolve inboxdir from deep inside v2 directories instead of misdetecting them as v1 bare git repos. In any case, stop matching directories by name and instead rely on the unique combination of st_dev + st_ino on stat() as we started doing in the extindex code.
2020-12-19lei: extinbox: start implementing in config file
They need to be indexed by MiscIdx, but MiscIdx still needs more work to support faster config loading when dealing with ~100K data sources.
2020-12-19lei: support for -$DIGIT and -$SIG CLI switches
I'm a bit spoiled by using single-dash digit options from common tools: ("git log -$DIGIT", "kill -9", "tail -1", ...), so we'll support it for limiting query results. But first, make it easier to send arbitrary signals to the daemon via "daemon-kill". "daemon-stop" is redundant, now, and removed, since the default for "daemon-kill" is SIGTERM to match kill(1) behavior.
2020-12-19lei: drop $SIG{__DIE__}, add oneshot fallbacks
We'll force stdout+stderr to be a pipe the spawning client controls, thus there's no need to lose error reporting by prematurely redirecting stdout+stderr to /dev/null. We can now rely exclusively on OnDestroy to write to syslog() on uncaught die failures. Also support falling back to oneshot mode on socket and cwd failures, since some commands may still be useful if the current working directory goes missing :P
2020-12-19on_destroy: generic localized END
This is a localized version of the process-wide END{}, but runs at the end of variable scope. A subroutine ref and arguments may be passed, which allows us to avoid anonymous subs and problems they cause. It's similar to `defer' or `ensure' in other languages; Perl can rely on deterministic destructors due to refcounting.
2020-12-19lei_store: keyword extraction from mbox and Maildir
Dovecot, mutt, and likely much other software support mbox Status/X-Status headers. Ensure we have a way to extract these headers as JMAP-compatible keywords before removing them for git storage. ->add_eml now accepts setting keywords at import time, and will probably be called like this: $lst->add_eml($eml, $lst->mbox_keywords($eml)); $lst->add_eml($eml, $lst->maildir_keywords($fn));
2020-12-19lei: help: show actual paths being operated on
This allows us to respect XDG_* environment variables to override HOME. We'll also make the $lei wrapper easier-to-use by auto-clearing $out/$err and reducing [] needed for common cases.
2020-12-19lei: support pass-through for `lei config'
This will be a handy wrapper for "git config" for manipulating ~/.config/lei/config. Since we'll have many commands, start breaking up t/lei.t into more distinct sections for ease-of-testing.
2020-12-19rename LeiDaemon package to PublicInbox::LEI
"LEI" is an acronym, and ALL CAPS is consistent with existing PublicInbox::{IMAP,HTTP,NNTP,WWW} naming for top-level modules, 3 of 4 old ones which deal directly with sockets and requests.
2020-12-19lei: support `daemon-env' for modifying long-lived env
While lei(1) socket connections can set environment variables for its running context, it may not completely remove some of them. The background daemon just inherits whatever env the client spawning it had. This command ensures the persistent env can be modified as needed. Similar to env(1), this supports "-u", "-" (--clear), and "-0"/"-z" switches. It may be useful to unset or change or even completely clear the environment independently of what a socket client feeds us. "-i" is omitted since "--ignore-environment" seems like a bad name for a persistent daemon as opposed to a one-shot command. "-" and --clear (like clearenv(3)) will completely clobber the environment. "Lonesome dash" support is added to our option/help parsing for the "-" shortcut to "--clear". Getopt::Long doesn't seem to support specs like "clear|" or "stdin|", but only "", so we do a little pre/post-processing to merge the cases.
2020-12-19t/lei-oneshot: standalone oneshot (non-socket) test
We can use the same "local $ENV{FOO}" hack we do with t/nntpd-v2.t to test the oneshot code path without imposing an extra script in the users' $PATH.
2020-12-19lei: refine help/option parsing, implement "init"
There's a bunch of work in here as the foundations are being fleshed out. One of the UI/UX is to make it easy to keep built-in help and shell completions consistent
2020-12-19tests: more common JSON module loading
We'll probably be using JSON more in the future, so make it easier to require in tests
2020-12-19lei_store: local storage for Local Email Interface
Still unstable, this builds off the equally unstable extindex :P This will be used for caching/memoization of traditional mail stores (IMAP, Maildir, etc) while providing indexing via Xapian, along with compression, and checksumming from git. Most notably, this adds the ability to add/remove per-message keywords (draft, seen, flagged, answered) as described in the JMAP specification (RFC 8621 section 4.1.1). We'll use `.' (a single period) as an $eidx_key since it's an invalid {inboxdir} or {newsgroup} name.
2020-12-19lei: FD-passing and IPC basics
The start of lei, a Local Email Interface. It'll support a daemon via FD passing to avoid startup time penalties if IO::FDPass is installed, but fall back to a slow one-shot mode if not. Compared to traditional socket daemon, FD passing should allow us to eventually do stuff like run "git show" and still have proper terminal support for pager and color.
2020-12-17extindex: support --rethread and content bifurcation
--rethread is useful for dealing with bugs and behaves just like it does with current inboxes. This is in case our content deduplication logic changes for whatever reason and causes previously merged messages to be considered "different". As with v2, this won't allow us to merge messages in a way that allows deduplicating messages which were previously considered different, but v2 inboxes do not allow that, either. In other words, this makes the --reindex and --rethread switches of -extindex match the behavior of v2 -index.
2020-12-17extindex: delete stale messages from over.sqlite3
In addition to removing stale messages from Xapian, we must also remove them from over.sqlite3.
2020-12-17extindex: preliminary --reindex support
--reindex allows us to catch missed and stale messages due to -extindex vs -index races prior to commit 02b2fcc46f364b51 ("extsearchidx: enforce -index before -extindex"). We'll also rely on reindex to internally deal with v1/v2 inbox removals and partial-unindexing of messages which are only removed from one inbox out of many. This reindex design is completely different than how normal v1/v2 inbox reindex operates due to extindex having multiple histories to work with. Instead of scanning git history, this relies exclusively on comparing over.sqlite3 contents between the v1/v2 inboxes and the extindex. Changes to Xapian behavior also get picked up, now. Xapian indexing is handled by workers with minimal IPC to the parent process. This results in more read I/O but fewer writes when dealing with cross-posted messages. Changes to $smsg->populate and --rethread still need further work.
2020-12-17t/psgi_v2: ignore warnings on missing P::M::ReverseProxy
Plack::Test::ExternalServer doesn't depend on Plack::Middleware::ReverseProxy, so we need to account for some warnings in stderr if P::M::RP is missing.
2020-12-14PublicInbox::Feed owns `feedmax' default value
There's no need to have extra code in the Inbox package for this or to waste dozens of bytes for every Inbox object which uses the default value. This makes our code more flexible w.r.t Inbox-like ExtSearch objects and fixes uninitialized value warnings with ->ALL.
2020-12-11nntp+www: drop List-* and Archived-At headers
These headers can conflict with headers in the DKIM signature; and parsing the DKIM-Signature header to determine whether or not we can safely add a header would be more code and CPU cycles. Since IMAP seems fine without these headers (and JMAP will likely be, too), there's likely no need to continue appending these to every message. Nowadays, developers seem sufficiently trained to use URLs with Message-IDs in them. So drop the headers and save some cycles and bandwidth all around.
2020-12-10extsearchidx: enforce -index before -extindex
We cannot set xref3 data without the `xnum' column to tie it to the per-inbox over.sqlite3 DB. So ensure we don't read brand-new history that only exists in git, but instead rely on last_commit and last_xap15-$EPOCH metadata in msgmap to decide how far we can index. Before this change, it was possible to miss messages in the extindex if -index did not run (which will be fixable by upcoming --reindex support in -extindex).
2020-12-10t/extsearch: use indexlevel=basic in inboxes
There's no need for per-inbox Xapian DBs when using extindex, so reduce wear on the poor systems this test runs on.
2020-12-09admin: resolve_repo_dir => resolve_inboxdir
We've stopped referring to inboxdirs as "repos" a while ago since v2 inboxes have multiple git repos associated with them. So update the name to reflect that and avoid an unnecessary export that's only used by a test case.
2020-12-09rename {pi_config} fields to {pi_cfg}
{pi_config} may be confused with the documented `PI_CONFIG' environment variable, and we'll favor vowel-removal to be consistent with our usage of object references. The `pi_' prefix may stay in some places, for now; since a separate namespace may come into this codebase for local/private client-tooling. For InboxIdle, we'll also remove an invalid comment about holding a reference to the PublicInbox::Config object, too.
2020-12-09nntp: replace {ng} with {ibx} for consistency
They're PublicInbox::Inbox objects just like the rest of the non-NNTP code. So rename the NNTP code for consistency with the rest of the codebase. Furthermore, {ng} and $ng may be confused with the `--ng' switch for -init, and that's a non-ref scalar string.
2020-12-09treewide: replace {-inbox} with {ibx} for consistency
{ibx} is shorter and is the most prevalent abbreviation in indexing and IMAP code, and the `$ibx' local variable is already prevalent throughout. In general, the codebase favors removal of vowels in variable and field names to denote non-references (because references are "lighter" than non-references). So update WWW and Filter users to use the same code since it reduces confusion and may allow easier code sharing.
2020-12-05isearch: emulate per-inbox search with ->ALL
Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05over: ensure old, merged {tid} is really gone
We must use the result of link_refs() since it can trigger merge_threads() and invalidate $old_tid. In case merge_threads() isn't triggered, link_refs() will return $old_tid anyways. When rethreading and allocating new {tid}, we also must update the row where the now-expired {tid} came from to ensure only the new {tid} is seen when reindexing subsequent messages in history. Otherwise, every subsequently reindexed+rethreaded message could end up getting a new {tid}. Reported-by: Kyle Meyer <kyle@kyleam.com> Link: https://public-inbox.org/meta/87360nlc44.fsf@kyleam.com/
2020-12-01nntp: make ->ALL Xref generation more fuzzy
For ->ALL users, this mitigates the regression introduced by commit 811b8d3cbaa790f59b7b107140b86248da16499b ("nntp: xref: use ->ALL extindex if available"), since it's common to cross post messages to some mailing lists with per-list trailers for unsubscribe information. We won't bother dealing with Bcc-ed messages since those are nearly all spam when it comes to public mailing lists. Fixes: 811b8d3cbaa790f5 ("nntp: xref: use ->ALL extindex if available") Link: https://public-inbox.org/meta/20201130194201.GA6687@dcvr/