public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2021-01-01	lei: implement various deduplication strategies
	For writing mboxes and Maildirs, users may wish to use stricter or looser deduplication strategies. This gives them more control.
2021-01-01	mboxreader: new class for reading various mbox formats
	This is only lightly-tested against stuff LeiToMail generates and will need real-world tests to validate.
2021-01-01	sharedkv: fork()-friendly key-value store
	This is intended for maintaining Maildir states, mbox message deduplication, but may be useful for other purposes...
2021-01-01	lei_to_mail: initial implementation for writing mbox formats
	No Maildir, support, yet, but it'll come.
2020-12-31	public-inbox 1.6.1 - minor bugfix release v1.6.1

2020-12-31	Merge remote-tracking branch 'origin/master' into lorelei
	* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-31	lei_xsearch: cross-(inbox\|extindex) search
	While a single extindex combines multiple inboxes into a single search index, extindex still requires up-front indexing on items which can be searched. XSearch has no on-disk footprint itself and uses Xapian DBs of existing publicinbox and extindex ("extinbox") exclusively. XSearch still suffers from the multi-shard Xapian scalability problems which led to the creation of extindex, but I expect the number of shards to remain relatively low. I envision users hosting public-inbox instances on their workstations will only have two extindex combined by this, one read-only extindex for serving public archives, and one read-write extindex managed by LeiStore for private mail.
2020-12-26	over: ensure old, merged {tid} is really gone
	We must use the result of link_refs() since it can trigger merge_threads() and invalidate $old_tid. In case merge_threads() isn't triggered, link_refs() will return $old_tid anyways. When rethreading and allocating new {tid}, we also must update the row where the now-expired {tid} came from to ensure only the new {tid} is seen when reindexing subsequent messages in history. Otherwise, every subsequently reindexed+rethreaded message could end up getting a new {tid}. Reported-by: Kyle Meyer <kyle@kyleam.com> Link: https://public-inbox.org/meta/87360nlc44.fsf@kyleam.com/ (cherry picked from commit 9356ec0cc5afc95a8fd398ddf898942ef0acdb74)
2020-12-26	doc: post-1.6 updates, start 1.7
	I should've dropped "PENDING" notes before the 1.6 release; they're dropped now, and a note is added to remind my future self to drop them before 1.7. (cherry picked from commit 3b5d3d1910f1db526a488142c01f42db5255ac72)
2020-12-23	xt: add create-many-inboxes helper test
	I've been using something like this to mock out thousands of inboxes for testing.
2020-12-19	lei: extinbox: start implementing in config file
	They need to be indexed by MiscIdx, but MiscIdx still needs more work to support faster config loading when dealing with ~100K data sources.
2020-12-19	build: add lei.sh + "make symlink-install" target
	This could've been done ages ago, but I rarely invoked public-inbox-* commands from an interactive terminal like I would with lei.
2020-12-19	lei: start working on bash completion
	Much work still needs to be done, but that goes for this entire project :P
2020-12-19	on_destroy: generic localized END
	This is a localized version of the process-wide END{}, but runs at the end of variable scope. A subroutine ref and arguments may be passed, which allows us to avoid anonymous subs and problems they cause. It's similar to `defer' or `ensure' in other languages; Perl can rely on deterministic destructors due to refcounting.
2020-12-19	rename LeiDaemon package to PublicInbox::LEI
	"LEI" is an acronym, and ALL CAPS is consistent with existing PublicInbox::{IMAP,HTTP,NNTP,WWW} naming for top-level modules, 3 of 4 old ones which deal directly with sockets and requests.
2020-12-19	t/lei-oneshot: standalone oneshot (non-socket) test
	We can use the same "local $ENV{FOO}" hack we do with t/nntpd-v2.t to test the oneshot code path without imposing an extra script in the users' $PATH.
2020-12-19	lei_store: local storage for Local Email Interface
	Still unstable, this builds off the equally unstable extindex :P This will be used for caching/memoization of traditional mail stores (IMAP, Maildir, etc) while providing indexing via Xapian, along with compression, and checksumming from git. Most notably, this adds the ability to add/remove per-message keywords (draft, seen, flagged, answered) as described in the JMAP specification (RFC 8621 section 4.1.1). We'll use `.' (a single period) as an $eidx_key since it's an invalid {inboxdir} or {newsgroup} name.
2020-12-19	lei: FD-passing and IPC basics
	The start of lei, a Local Email Interface. It'll support a daemon via FD passing to avoid startup time penalties if IO::FDPass is installed, but fall back to a slow one-shot mode if not. Compared to traditional socket daemon, FD passing should allow us to eventually do stuff like run "git show" and still have proper terminal support for pager and color.
2020-12-12	doc: add public-inbox-extindex-format(5) manpage
	The CLI tool still needs usability work, and "misc" is still in flux, but the core message indexing part is stable (since it's stolen from v2 :P).
2020-12-05	isearch: emulate per-inbox search with ->ALL
	Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05	over: ensure old, merged {tid} is really gone
	We must use the result of link_refs() since it can trigger merge_threads() and invalidate $old_tid. In case merge_threads() isn't triggered, link_refs() will return $old_tid anyways. When rethreading and allocating new {tid}, we also must update the row where the now-expired {tid} came from to ensure only the new {tid} is seen when reindexing subsequent messages in history. Otherwise, every subsequently reindexed+rethreaded message could end up getting a new {tid}. Reported-by: Kyle Meyer <kyle@kyleam.com> Link: https://public-inbox.org/meta/87360nlc44.fsf@kyleam.com/
2020-11-24	miscsearch: a new Xapian sub-DB for extindex
	This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-08	extsearch: rename -eindex to -extindex
	Upon "eindex" rhymes with "reindex", which could be confusing; so name the command and config prefix to use "extindex" which is hopefully less confusing.
2020-11-07	script: add preliminary eindex implementation
	Not documented, yet, but it runs...
2020-11-07	extsearchidx: initial implementation
	It compiles...
2020-11-07	extsearch: start mocking out
	This will provide a similar API to PublicInbox::Inbox for read-only WWW, -imapd, and -nntpd interfaces.
2020-10-17	xt: remove eml_check_roundtrip
	If there's no body ({bdy} field), ->each_part set the {bdy} field to "\n" and the ->as_string result afterwards is one extra "\n" byte longer than the original. It's not worth extra cycles in common ->each_part calls to ensure 100% round-trip matches of header-only messages (which are likely spam), especially when the only difference is a trailing "\n".
2020-09-26	xt: add eml ->as_string round trip checker
	Unlike Email::MIME, PublicInbox::Eml::as_string should be able to round trip from the Perl object to a raw scalar and back without changes.
2020-09-20	doc: post-1.6 updates, start 1.7
	I should've dropped "PENDING" notes before the 1.6 release; they're dropped now, and a note is added to remind my future self to drop them before 1.7.
2020-09-19	gcf2: wire up read-only daemons and rm -gcf2 script
	It seems easiest to have a singleton Gcf2Client client object per daemon worker for all inboxes to use. This reduces overall FD usage from pipes. The `public-inbox-gcf2' command + manpage are gone and a `$^X' one-liner is used, instead. This saves inodes for internal commands and hopefully makes it easier to avoid mismatched PERL5LIB include paths (as noticed during development :x). We'll also make the existing cat-file process management infrastructure more resilient to BOFHs on process killing sprees (or in case our libgit2-based code fails on us). (Rare) PublicInbox::WWW PSGI users NOT using public-inbox-httpd won't automatically benefit from this change, and extra configuration will be required (to be documented later).
2020-09-19	add gcf2 client and executable script
	This should be able to replace multiple `git cat-file' for blob retrieval, but adjustments may be needed.
2020-09-19	gcf2: libgit2-based git cat-file alternative
	Having tens of thousands of inboxes and associated git processes won't work well, so we'll use libgit2 to access the object DB directly. We only care about OID lookups and won't need to rely on per-repo revision names or paths. The Git::Raw XS package won't be used since its manpages don't promise a stable API. Since we already use Inline::C and have experience with I::C when it comes to compatibility, this only introduces libgit2 itself as a source of new incompatibilities. This also provides an excuse for me to writev(2) to reduce syscalls, but liburing is on the horizon for next year.
2020-09-10	config: split out iterator into separate object
	We will need to allow simultaneous iterators on the same config object, since we'll need this for ExtMsg, NNTPD, WwwListing, NewsWWW, and other places.
2020-09-10	www: manifest.js.gz generation no longer hogs event loop
	It's still as slow as before with hundreds/thousands of inboxes, but at least it's fair. Future changes will allow it to be cached and memoized with persistent HTTP servers.
2020-09-02	t/v2dupindex: test indexing mirrors with duplicate messages
	While it's not a known problem, our deduplicating logic may change in the future; or a BOFH could be manually injecting duplicate messages directly into the git epoch repositories. Ensure indexing in mirrors doesn't break when there's duplicates. This is in preparation for detached indices for multi-inbox search.
2020-09-01	replace ParentPipe with EOFpipe
	ParentPipe was a subset of EOFpipe, except EOFpipe correctly accounts for theoretical() spurious wakeups on the pipe. () AFAIK, spurious wakeups are/were more likely on TCP sockets due to checksum failures, something that's not a problem on local pipes. We're also not sharing pipes like we do with listen sockets on accept(2), so there's no chance of another process grabbing bytes (unless we have bugs in our code).
2020-09-01	watch: use EOFpipe to reduce dwaitpid wakeups
	It's a bit inefficient to use a pipe, here. However, using dwaitpid() on a process that's not expected to exit soon is also inefficient as it causes excessive wakeups as most of our inbox-writing code expects synchronous waitpid(). This only affects -watch instances configured for NNTP and IMAP clients.
2020-09-01	rename WatchMaildir => Watch
	This is no longer limited to Maildirs now that IMAP and NNTP support exist; so give it a shorter name.
2020-08-25	examples: add imapd systemd examples
	We've got examples for all the other daemons, too!
2020-08-20	searchquery: split off from searchview
	Since this was already a separate package, split it off into its own file since SearchView may not handle inbox groups.
2020-08-16	doc: add public-inbox-tuning(7) manpage
	Determining storage device speed and latencies doesn't seem portable or even possible with the wide variety of storage layers in use. This means we need to write a tuning document and hope users read and improve on it :P
2020-08-03	t/nntpd: do not fork on indexing, test v2
	No need to waste resources when doing minimal work. With PI_TEST_VERSION=2, this fixes a test failure where Net::NNTP::DESTROY was getting called in the shard process. We'll also get rid of an unnecessary use_ok under v2, too.
2020-07-29	searchidx: disable CoW for SQLite and Xapian under btrfs
	SQLite and Xapian files are written randomly, thus they become fragmented under btrfs with copy-on-write. This leads to noticeable performance problems (and probably ENOSPC) as these files get big. lore/git (v2, <1GB) indexes around 20% faster with this on an ancient SSD. lore/lkml seems to be taking forever and I'll probably cancel it to save wear on my SSD. Unfortunately, disabling CoW also means disabling checksumming (and compression), so we'll be careful to only set the No_COW attribute on regeneratable data. We want to keep CoW (and checksums+compression) on git storage because current ref storage is neither checksummed nor compressed, and git streams pack output.
2020-07-25	v2writable: introduce idx_stack
	This avoids pinning a potentially large chunk of memory from `git-log --reverse' into RAM (or triggering less predictable swap behavior). Instead it uses a contiguous temporary file with a fixed-size record for every blob we'll need to index.
2020-07-25	v2: index forwards (via `git log --reverse')
	Since we'll need to expose THREADID to JMAP and IMAP users, index all messages in the order they were committed to ensure our `tid' (thread ID) column ascends in mirrors the same way they do in the source inbox. This drastically simplifies our code but increases memory usage of `git-log'. The next commit will bring memory use back down at the expense of $TMPDIR usage.
2020-07-06	www: start making gzipfilter the parent response class
	Virtually all of our responses are going to be gzipped, anyways. This will allow us to utilize zlib as a buffering layer and share common code for async blob retrieval responses. To streamline this and allow GzipFilter to be a parent class, we'll replace the NoopFilter with a similar CompressNoop class which emulates the two Compress::Raw::Zlib::Deflate methods we use. This drops a bunch of redundant code and will hopefully make upcoming WwwStream changes easier to reason about.
2020-07-06	mboxgz: do asynchronous git blob retrievals
	This lets the -httpd worker process make better use of time instead of waiting for git-cat-file to respond. With 4 jobs in the new test case against a clone of <https://public-inbox.org/meta/>, a speedup of 10-12% is shown. Even a single job shows a 2-5% improvement on an SSD.
2020-07-06	wwwlisting: use GzipFilter for HTML
	The changes to GzipFilter here may be beneficial for building HTML and XML responses in other places, too.
2020-06-28	watch: use our own "git credential" wrapper
	Git.pm may not be installed on some systems; or some users have multiple Perl installations and Git.pm is not available to the Perl running -watch. Accomodate both those types of users by providing our own "git credential" wrapper.
2020-06-28	watch: add NNTP support
	This is similar to IMAP support, but only supports polling. Automatic altid support is not yet supported, yet; but may be in the future. v2: small grammar fix by Kyle Meyer Link: https://public-inbox.org/meta/87sgeg5nxf.fsf@kyleam.com/