public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-08-26	doc: 1.6.0 release notes update
	A few more things happened, here.
2020-08-26	doc: add some more tuning notes
	I've learned a thing or three about btrfs in the past few weeks and remembered some old HDD things, too. The Xapian MultiDatabase problem will need to be addressed for 1.7...
2020-08-23	searchidx: index THREADID in Xapian
	This is the `tid' column from over.sqlite3; and will be used for IMAP and JMAP search (among other things).
2020-08-20	init+index: support --skip-docdata for Xapian
	Since we no longer read document data from Xapian, allow users to opt-out of storing it. This breaks compatibility with previous releases of public-inbox, but gives us a ~1.5% space savings on Xapian storage (and associated I/O and page cache pressure reduction).
2020-08-20	init: drop -N alias for --skip-artnum
	It may be too easily confused for --newsgroup or --ng. This is too rarely used and never made it into a release, so it should be fine.
2020-08-20	init: support --newsgroup option
	We can reduce the need to edit the config file for NNTP group names this way.
2020-08-20	doc: note -compact and -xcpdb are rarely used
	Slowly improving the learning curve...
2020-08-16	doc: add public-inbox-tuning(7) manpage
	Determining storage device speed and latencies doesn't seem portable or even possible with the wide variety of storage layers in use. This means we need to write a tuning document and hope users read and improve on it :P
2020-08-14	index\|compact\|xcpdb: support --all switch
	For -index, this is a convenient way to quickly index all inboxes after a grok-pull. Might as well support it for rarely used commands like -compact and -xcpdb, too.
2020-08-13	xcpdb: wire up new index options and --help
	--sequential-shard also disables the copy parallelism (--jobs), so it can be useful for systems unable to handle parallel random I/O but still want many shards. There was a missing "use strict", too, which is fixed.
2020-08-10	convert: support new -index options
	Converting v1 inboxes from v2 can be a painful experience on HDD. Some of the new options in the CLI or config file make it less painful.
2020-08-10	index: cleanup internal variables
	Move away from hard-to-read alllowercase naming and favor snake_case or separated-by-dashes. We'll keep `--indexlevel' as-is for now, since it's been around for several releases; but we'll support `--index-level' in the CLI and update our documentation in a few months. We'll also clarify that publicInbox.indexMaxSize is only intended for -index, and not -watch or -mda.
2020-08-10	admin: use a generic variable name
	We parse other options, too, not just --max-size
2020-08-10	doc: add some notes around -xcpdb / -edit / -purge
	These rarely-used commands have some caveats that needed expanding on.
2020-08-10	doc: index: more notes about latest changes
	With LKML on an HDD, a giant --batch-size of 500m ends up being pretty useful. I was able to index LKML in ~16 hours on a system that had other activity on it. The big downside was it was eating up over 5g of RAM :x. We'll also fix up a duplicated indexBatchSize section, fix formatting around global vs per-inbox indexSequentialShard, and ensure section 5 manpages are linked correctly.
2020-08-07	index: add built-in --help / -?
	Eventually, commonly-used commands run by the user will all support --help / -? for user-friendliness. The changes from up-front `use' to lazy `require' speed up `--help' by 3x or so.
2020-08-07	index+xcpdb: rename `--no-sync' to `--no-fsync'
	We'll continue supporting `--no-sync' even if its yet-to-make it it into a release, but the term `sync' is overloaded in our codebase which may be confusing to new hackers and users. None of our our code nor dependencies issue the sync(2) syscall, either, only fsync(2) and fdatasync(2).
2020-08-07	index: v2: --sequential-shard option
	This gives better page cache utilization for Xapian indexing on slow storage by improving locality for random I/O activity on the Xapian DB. Instead of doing a single-pass to index both SQLite and Xapian; this indexes them separately. The first pass is identical to indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3. Subsequent passes only operate on a single Xapian shard for documents belonging to that shard. Given enough shards, each individual shard can be made small enough to fit into the kernel page cache and avoid HDD seeks for read activity. Doing rough tests with a busy system with a 7200 RPM HDD with ext4, full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to ~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2) disabled (--no-sync) and `--batch-size=10m'.
2020-07-25	index+xcpdb: support --no-sync flag
	This allows us to speed up indexing operations to SQLite and Xapian. Unfortunately, it doesn't affect operations using `xapian-compact' and the compactor API, since that doesn't seem to support Xapian::DB_NO_SYNC, yet.
2020-07-25	index: support --rethread switch to fix old indices
	Older versions of public-inbox < 1.3.0 had subtly different semantics around threading in some corner cases. This switch (when combined with --reindex) allows us to fix them by regenerating associations.
2020-07-17	doc: add some recommendations around slow HDDs
	grok-pull is still painful with serialization on an old USB 2.0 HDD, but at least it can finish with flock(1) and disabling parallelization. While parallel "git fetch" doesn't seem so bad, slow seeks are exacerbated by parallel reads in Xapian. That means some updates can take days instead of hours. The same updates take only seconds or minutes on an SSD.
2020-07-14	doc: release notes and version info updates
	Update release notes with some features in the 1.6 timeline. We'll note the version availability of some command-line options, it may help users who are reading the latest documentation online but running older versions.
2020-07-10	doc: standards: link IMAP capabilities and response codes
	We'll be implementing some IMAP search/threading extensions in IMAP and providing analogues over HTTP via JMAP.
2020-07-06	doc/technical/whyperl: note Perl 7 announcement
	Right now[1] the Perl upstream plan is to maintain 5 compatibility in Perl 7 for at least 5 years[1], and perhaps drop it when Perl 8 comes along. That said, distros may pick it and maintain 5 on their own given the vast amounts of perfectly good legacy code out there. [1] http://nntp.perl.org/group/perl.perl5.porters/257817 [2] http://nntp.perl.org/group/perl.perl5.porters/257565
2020-07-06	doc/technical/whyperl: reword bit around installed docs
	I originally proposed this rewording to address Leah's comment but forgot to squash it in :x Link: https://public-inbox.org/meta/20200408221741.GA10142@dcvr/ Cc: Leah Neukirchen <leah@vuxu.org>
2020-07-06	doc: daemon: update documentation around Inline::C
	`~/.cache/public-inbox/inline-c' is supported, nowadays for convenience, but Inline::C usage will remain opt-in.
2020-07-06	view: simplify eml_entry callers further
	This simplifies the primary callers of eml_entry while only making mknews.perl worse.
2020-07-06	www: update internal docs
	We no longer favor getline+close for streaming PSGI responses when using public-inbox-httpd. We still support it for other PSGI servers, though.
2020-07-06	view: eml_entry: reduce parameters
	We can save stack space and simplify subroutine calls, here.
2020-07-06	wwwstream: reduce blob fetch paths for ->getline
	This will make it easier to support asynchronous blob retrievals. The `$ctx->{nr}' counter is no longer implicitly supplied since many users didn't care for it, so stack overhead is slightly reduced.
2020-07-06	wwwstream: reduce object graph depth
	Like with WwwAtomStream and MboxGz, we can bless the existing $ctx object directly to avoid allocating a new hashref. We'll also switch from "->" to "::" to reduce stack utilization.
2020-07-06	wwwatomstream: support async blob fetch
	This allows -httpd to handle other requests while waiting for git to retrieve and decode blobs. We'll also break apart t/psgi_v2.t further to ensure tests run against -httpd in addition to generic PSGI testing. Using xt/httpd-async-stream.t to test against clones of meta@public-inbox.org shows a 10-12% performance improvement with the following env: TEST_JOBS=1000 TEST_CURL_OPT=--compressed TEST_ENDPOINT=new.atom
2020-07-06	wwwatomstream: simplify feed_update callers
	We always return Z (UTC) times, anyways, so we'll always use gmtime() on the seconds-after-the-epoch.
2020-07-06	stop auto-loading Plack::Middleware::Deflater
	Instead of gzipping some (mbox.gz, manifest.js.gz) responses and leaving P::M::D to do the rest, we gzip everything ourselves, now, so P::M::D is redundant.
2020-06-28	watch: remove Filesys::Notify::Simple dependency
	Since we already use inotify and EVFILT_VNODE (kqueue) in -imapd, we might as well use them directly in -watch, too. This will allow public-inbox-watch to use PublicInbox::DS for timers to watch newsgroups/mailboxes and have saner signal handling in future commits.
2020-06-23	init: add --skip-artnum parameter
	For archivists with only newer mail archives, this option allows reserving reserve NNTP article numbers for yet-to-be-archived old messages. Indexers will need to be updated to support this feature in future commits. -V1 inboxes will now be initialized with SQLite and Xapian support if this option is used, or if --indexlevel= is specified.
2020-06-23	init: add -j / --jobs parameter
	On a powerful (by my standards) machine with 16GB RAM and an 7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in git) LKML snapshot from Sep 2019 did not finish after 7 days with the default number (3) of Xapian shards (`--jobs=4') and `--batch-size=10m'. Indexing starts off fast, but progressively get slower as contents of the inbox (including Xapian + SQLite DBs) could no longer be cached by the kernel. Once the on-disk size increased, HDD seek contention between the Xapian shard workers slowed the process down to a crawl. With a single shard, it still took around 3.5 days to index on the HDD. That's not good, but it's far better than not finishing after 7 days. So allow unfortunate HDD users to easily specify a single shard on public-inbox-init. For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II bus on the same machine indexes that same snapshot of LKML in ~7 hours with 3 shards and the same 10m batch size. In the past, a higher-end consumer grade MLC SSDs on similar hardware indexed a similarly sized-data set in ~4 hours.
2020-06-13	doc: update TODO and WIP 1.6.0 release notes
	Lots of big changes coming Thanks to The Linux Foundation for sponsoring me to hack on this in 2020 :)
2020-06-13	imap: support 8000 octet lines
	RFC 2683 section 3.2.1.5 recommends it: > For its part, a server should allow for a command line of at least > 8000 octets. This provides plenty of leeway for accepting reasonable > length commands from clients. The server should send a BAD response > to a command that does not end within the server's maximum accepted > command length. To conserve memory, we won't bother reading the entire line before sending the BAD response and disconnecting them.
2020-06-13	imap: support IDLE
	It seems to be working as far as Mail::IMAPClient is concerned.
2020-06-13	preliminary imap server implementation
	It shares a bit of code with NNTP. It's copy+pasted for now since this provides new ground to experiment with APIs for dealing with slow storage and many inboxes.
2020-06-13	doc: add some IMAP standards
	There's more, but IMAP is big and complex already.
2020-06-03	www: remove smsg_mime API and adjust callers
	To further simplify callers and avoid embarrasing memory explosions[1], we can finally eliminate this method in favor of smsg_eml. [1] commit 7d02b9e64455831d3bda20cd2e64e0c15dc07df5 ("view: stop storing all MIME objects on large threads") fixed a huge memory blowup.
2020-05-27	learn: support --all with `rm'
	I found myself wanting to remove a message from all inboxes while working on a test case in another branch. I figure this could also be useful for globally removing messages which are in the grey area or too big for spamc.
2020-05-18	index: add --batch-size=SIZE option
	On powerful systems, having this option is preferable to XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention with other processes (-learn, -mda, -watch). Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and -watch to get stuck until an epoch is completely processed.
2020-05-12	rename "ContentId" to "ContentHash"
	The old name may be confused with "Content-ID" as described in RFC 2392, so use an alternate name to avoid confusing future readers.
2020-05-10	public-inbox 1.5.0 v1.5.0

2020-05-10	various doc updates ahead of 1.5.0

2020-05-09	replace most uses of PublicInbox::MIME with Eml
	PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-04-27	doc: add clients.txt
	Since some client tools exist for dealing with public-inbox specifically, it seems like a good idea to list some of them. Cc: Danh Doan <congdanhqx@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Cc: Leah Neukirchen <leah@vuxu.org>