public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2021-01-18	lei: q: results output to Maildir and mbox* working
	All the augment and deduplication stuff seems to be working based on unit tests. OpPipe is a nice general addition that will probably make future state machines easier.
2021-01-15	lei: q: lock stdout on overview output
	Most writes to stdout aren't atomic and we need locking to prevent workers from interleaving and corrupting JSON output. The one case stdout won't require locking is if it's pointed to a regular file with O_APPEND; as POSIX O_APPEND semantics guarantees atomicity.
2021-01-14	lei: test SIGPIPE, stop xsearch workers on client abort
	The new test ensures consistency between oneshot and client/daemon users. Cancelling an in-progress result now also stops xsearch workers to avoid wasted CPU and I/O. Note the lei->atfork_child_wq usage changes, it is to workaround a bug in Perl 5: http://nntp.perl.org/group/perl.perl5.porters/258784 <CAHhgV8hPbcmkzWizp6Vijw921M5BOXixj4+zTh3nRS9vRBYk8w@mail.gmail.com> This switches the internal protocol to use SOCK_SEQPACKET AF_UNIX sockets to prevent merging messages from the daemon to client to run pager and kill/exit the client script.
2021-01-12	lei: query: restore JSON output overview
	This internal API is better suited for fork-friendliness (but locking + dedupe still needs to be re-added). Normal "json" is the default, though stream-friendly "concatjson" and "jsonl" (AKA "ndjson" AKA "ldjson") all seem working (though tests aren't working, yet). For normal "json", the biggest downside is the necessity of a trailing "null" element at the end of the array because of parallel processes, since (AFAIK) regular JSON doesn't allow trailing commas, unlike JavaScript.
2021-01-12	lei_xsearch: transfer 4 FDs internally, drop IO::FDPass
	It's easier to make the code more generic by transferring all four FDs (std(in\|out\|err) + socket) instead of omitting stdin. We'll be reading from stdin on some imports, and possibly outputting to stdout, so omitting stdin now would needlessly complicate things. The differences with IO::FDPass "1" code paths and the "4" code paths used by Inline::C and Socket::MsgHdr are far too much to support and test at the moment.
2021-01-12	cmd_ipc: send FDs with buffer payload
	For another step in in syscall reduction, we'll support transferring 3 FDs and a buffer with a single sendmsg/recvmsg syscall using Socket::MsgHdr if available. Beyond script/lei itself, this will be used for internal IPC between search backends (perhaps with SOCK_SEQPACKET). There's a chance this could make it to the public-facing daemons, too. This adds an optional dependency on the Socket::MsgHdr package, available as libsocket-msghdr-perl on Debian-based distros (but not CentOS 7.x and FreeBSD 11.x, at least). Our Inline::C version in PublicInbox::Spawn remains the last choice for script/lei due to the high startup time, and IO::FDPass remains supported for non-Debian distros. Since the socket name prefix changes from 3 to 4, we'll also take this opportunity to make the argv+env buffer transfer less error-prone by relying on argc instead of designated delimiters.
2021-01-12	lei query + pagination sorta working
	Parallelism and interactivity with pager + SIGPIPE needs work; but results are shown and phrase search works without shell users having to apply Xapian quoting rules on top of standard shell quoting.
2021-01-01	lei: rename "extinbox" => "external"
	The words "extinbox" and "extindex" are too close and easy to confuse with the other. Rename "extinbox" to "external", since these could be IMAP, JMAP or other non-public-inbox search APIs. Link: https://public-inbox.org/meta/20201226112649.GB6226@dcvr/
2021-01-01	ipc: generic IPC dispatch based on Storable
	I intend to use this with LeiStore when importing from multiple slow sources at once (e.g. curl, IMAP, etc). This is because over.sqlite3 can only have a single writer, and we'll have several slow readers running in parallel. Watch and SearchIdxShard should also be able to use this code in the future, but this will be proven with LeiStore, first.
2021-01-01	lei: implement various deduplication strategies
	For writing mboxes and Maildirs, users may wish to use stricter or looser deduplication strategies. This gives them more control.
2021-01-01	mboxreader: new class for reading various mbox formats
	This is only lightly-tested against stuff LeiToMail generates and will need real-world tests to validate.
2021-01-01	sharedkv: fork()-friendly key-value store
	This is intended for maintaining Maildir states, mbox message deduplication, but may be useful for other purposes...
2021-01-01	lei_to_mail: initial implementation for writing mbox formats
	No Maildir, support, yet, but it'll come.
2020-12-31	Merge remote-tracking branch 'origin/master' into lorelei
	* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-31	lei_xsearch: cross-(inbox\|extindex) search
	While a single extindex combines multiple inboxes into a single search index, extindex still requires up-front indexing on items which can be searched. XSearch has no on-disk footprint itself and uses Xapian DBs of existing publicinbox and extindex ("extinbox") exclusively. XSearch still suffers from the multi-shard Xapian scalability problems which led to the creation of extindex, but I expect the number of shards to remain relatively low. I envision users hosting public-inbox instances on their workstations will only have two extindex combined by this, one read-only extindex for serving public archives, and one read-write extindex managed by LeiStore for private mail.
2020-12-23	xt: add create-many-inboxes helper test
	I've been using something like this to mock out thousands of inboxes for testing.
2020-12-19	lei: extinbox: start implementing in config file
	They need to be indexed by MiscIdx, but MiscIdx still needs more work to support faster config loading when dealing with ~100K data sources.
2020-12-19	build: add lei.sh + "make symlink-install" target
	This could've been done ages ago, but I rarely invoked public-inbox-* commands from an interactive terminal like I would with lei.
2020-12-19	lei: start working on bash completion
	Much work still needs to be done, but that goes for this entire project :P
2020-12-19	on_destroy: generic localized END
	This is a localized version of the process-wide END{}, but runs at the end of variable scope. A subroutine ref and arguments may be passed, which allows us to avoid anonymous subs and problems they cause. It's similar to `defer' or `ensure' in other languages; Perl can rely on deterministic destructors due to refcounting.
2020-12-19	rename LeiDaemon package to PublicInbox::LEI
	"LEI" is an acronym, and ALL CAPS is consistent with existing PublicInbox::{IMAP,HTTP,NNTP,WWW} naming for top-level modules, 3 of 4 old ones which deal directly with sockets and requests.
2020-12-19	t/lei-oneshot: standalone oneshot (non-socket) test
	We can use the same "local $ENV{FOO}" hack we do with t/nntpd-v2.t to test the oneshot code path without imposing an extra script in the users' $PATH.
2020-12-19	lei_store: local storage for Local Email Interface
	Still unstable, this builds off the equally unstable extindex :P This will be used for caching/memoization of traditional mail stores (IMAP, Maildir, etc) while providing indexing via Xapian, along with compression, and checksumming from git. Most notably, this adds the ability to add/remove per-message keywords (draft, seen, flagged, answered) as described in the JMAP specification (RFC 8621 section 4.1.1). We'll use `.' (a single period) as an $eidx_key since it's an invalid {inboxdir} or {newsgroup} name.
2020-12-19	lei: FD-passing and IPC basics
	The start of lei, a Local Email Interface. It'll support a daemon via FD passing to avoid startup time penalties if IO::FDPass is installed, but fall back to a slow one-shot mode if not. Compared to traditional socket daemon, FD passing should allow us to eventually do stuff like run "git show" and still have proper terminal support for pager and color.
2020-12-12	doc: add public-inbox-extindex-format(5) manpage
	The CLI tool still needs usability work, and "misc" is still in flux, but the core message indexing part is stable (since it's stolen from v2 :P).
2020-12-05	isearch: emulate per-inbox search with ->ALL
	Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05	over: ensure old, merged {tid} is really gone
	We must use the result of link_refs() since it can trigger merge_threads() and invalidate $old_tid. In case merge_threads() isn't triggered, link_refs() will return $old_tid anyways. When rethreading and allocating new {tid}, we also must update the row where the now-expired {tid} came from to ensure only the new {tid} is seen when reindexing subsequent messages in history. Otherwise, every subsequently reindexed+rethreaded message could end up getting a new {tid}. Reported-by: Kyle Meyer <kyle@kyleam.com> Link: https://public-inbox.org/meta/87360nlc44.fsf@kyleam.com/
2020-11-24	miscsearch: a new Xapian sub-DB for extindex
	This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-08	extsearch: rename -eindex to -extindex
	Upon "eindex" rhymes with "reindex", which could be confusing; so name the command and config prefix to use "extindex" which is hopefully less confusing.
2020-11-07	script: add preliminary eindex implementation
	Not documented, yet, but it runs...
2020-11-07	extsearchidx: initial implementation
	It compiles...
2020-11-07	extsearch: start mocking out
	This will provide a similar API to PublicInbox::Inbox for read-only WWW, -imapd, and -nntpd interfaces.
2020-10-17	xt: remove eml_check_roundtrip
	If there's no body ({bdy} field), ->each_part set the {bdy} field to "\n" and the ->as_string result afterwards is one extra "\n" byte longer than the original. It's not worth extra cycles in common ->each_part calls to ensure 100% round-trip matches of header-only messages (which are likely spam), especially when the only difference is a trailing "\n".
2020-09-26	xt: add eml ->as_string round trip checker
	Unlike Email::MIME, PublicInbox::Eml::as_string should be able to round trip from the Perl object to a raw scalar and back without changes.
2020-09-20	doc: post-1.6 updates, start 1.7
	I should've dropped "PENDING" notes before the 1.6 release; they're dropped now, and a note is added to remind my future self to drop them before 1.7.
2020-09-19	gcf2: wire up read-only daemons and rm -gcf2 script
	It seems easiest to have a singleton Gcf2Client client object per daemon worker for all inboxes to use. This reduces overall FD usage from pipes. The `public-inbox-gcf2' command + manpage are gone and a `$^X' one-liner is used, instead. This saves inodes for internal commands and hopefully makes it easier to avoid mismatched PERL5LIB include paths (as noticed during development :x). We'll also make the existing cat-file process management infrastructure more resilient to BOFHs on process killing sprees (or in case our libgit2-based code fails on us). (Rare) PublicInbox::WWW PSGI users NOT using public-inbox-httpd won't automatically benefit from this change, and extra configuration will be required (to be documented later).
2020-09-19	add gcf2 client and executable script
	This should be able to replace multiple `git cat-file' for blob retrieval, but adjustments may be needed.
2020-09-19	gcf2: libgit2-based git cat-file alternative
	Having tens of thousands of inboxes and associated git processes won't work well, so we'll use libgit2 to access the object DB directly. We only care about OID lookups and won't need to rely on per-repo revision names or paths. The Git::Raw XS package won't be used since its manpages don't promise a stable API. Since we already use Inline::C and have experience with I::C when it comes to compatibility, this only introduces libgit2 itself as a source of new incompatibilities. This also provides an excuse for me to writev(2) to reduce syscalls, but liburing is on the horizon for next year.
2020-09-10	config: split out iterator into separate object
	We will need to allow simultaneous iterators on the same config object, since we'll need this for ExtMsg, NNTPD, WwwListing, NewsWWW, and other places.
2020-09-10	www: manifest.js.gz generation no longer hogs event loop
	It's still as slow as before with hundreds/thousands of inboxes, but at least it's fair. Future changes will allow it to be cached and memoized with persistent HTTP servers.
2020-09-02	t/v2dupindex: test indexing mirrors with duplicate messages
	While it's not a known problem, our deduplicating logic may change in the future; or a BOFH could be manually injecting duplicate messages directly into the git epoch repositories. Ensure indexing in mirrors doesn't break when there's duplicates. This is in preparation for detached indices for multi-inbox search.
2020-09-01	replace ParentPipe with EOFpipe
	ParentPipe was a subset of EOFpipe, except EOFpipe correctly accounts for theoretical() spurious wakeups on the pipe. () AFAIK, spurious wakeups are/were more likely on TCP sockets due to checksum failures, something that's not a problem on local pipes. We're also not sharing pipes like we do with listen sockets on accept(2), so there's no chance of another process grabbing bytes (unless we have bugs in our code).
2020-09-01	watch: use EOFpipe to reduce dwaitpid wakeups
	It's a bit inefficient to use a pipe, here. However, using dwaitpid() on a process that's not expected to exit soon is also inefficient as it causes excessive wakeups as most of our inbox-writing code expects synchronous waitpid(). This only affects -watch instances configured for NNTP and IMAP clients.
2020-09-01	rename WatchMaildir => Watch
	This is no longer limited to Maildirs now that IMAP and NNTP support exist; so give it a shorter name.
2020-08-25	examples: add imapd systemd examples
	We've got examples for all the other daemons, too!
2020-08-20	searchquery: split off from searchview
	Since this was already a separate package, split it off into its own file since SearchView may not handle inbox groups.
2020-08-16	doc: add public-inbox-tuning(7) manpage
	Determining storage device speed and latencies doesn't seem portable or even possible with the wide variety of storage layers in use. This means we need to write a tuning document and hope users read and improve on it :P
2020-08-03	t/nntpd: do not fork on indexing, test v2
	No need to waste resources when doing minimal work. With PI_TEST_VERSION=2, this fixes a test failure where Net::NNTP::DESTROY was getting called in the shard process. We'll also get rid of an unnecessary use_ok under v2, too.
2020-07-29	searchidx: disable CoW for SQLite and Xapian under btrfs
	SQLite and Xapian files are written randomly, thus they become fragmented under btrfs with copy-on-write. This leads to noticeable performance problems (and probably ENOSPC) as these files get big. lore/git (v2, <1GB) indexes around 20% faster with this on an ancient SSD. lore/lkml seems to be taking forever and I'll probably cancel it to save wear on my SSD. Unfortunately, disabling CoW also means disabling checksumming (and compression), so we'll be careful to only set the No_COW attribute on regeneratable data. We want to keep CoW (and checksums+compression) on git storage because current ref storage is neither checksummed nor compressed, and git streams pack output.
2020-07-25	v2writable: introduce idx_stack
	This avoids pinning a potentially large chunk of memory from `git-log --reverse' into RAM (or triggering less predictable swap behavior). Instead it uses a contiguous temporary file with a fixed-size record for every blob we'll need to index.