about summary refs log tree commit homepage
path: root/Documentation
DateCommit message (Collapse)
2020-08-26doc: 1.6.0 release notes update
A few more things happened, here.
2020-08-26doc: add some more tuning notes
I've learned a thing or three about btrfs in the past few weeks and remembered some old HDD things, too. The Xapian MultiDatabase problem will need to be addressed for 1.7...
2020-08-23searchidx: index THREADID in Xapian
This is the `tid' column from over.sqlite3; and will be used for IMAP and JMAP search (among other things).
2020-08-20init+index: support --skip-docdata for Xapian
Since we no longer read document data from Xapian, allow users to opt-out of storing it. This breaks compatibility with previous releases of public-inbox, but gives us a ~1.5% space savings on Xapian storage (and associated I/O and page cache pressure reduction).
2020-08-20init: drop -N alias for --skip-artnum
It may be too easily confused for --newsgroup or --ng. This is too rarely used and never made it into a release, so it should be fine.
2020-08-20init: support --newsgroup option
We can reduce the need to edit the config file for NNTP group names this way.
2020-08-20doc: note -compact and -xcpdb are rarely used
Slowly improving the learning curve...
2020-08-16doc: add public-inbox-tuning(7) manpage
Determining storage device speed and latencies doesn't seem portable or even possible with the wide variety of storage layers in use. This means we need to write a tuning document and hope users read and improve on it :P
2020-08-14index|compact|xcpdb: support --all switch
For -index, this is a convenient way to quickly index all inboxes after a grok-pull. Might as well support it for rarely used commands like -compact and -xcpdb, too.
2020-08-13xcpdb: wire up new index options and --help
--sequential-shard also disables the copy parallelism (--jobs), so it can be useful for systems unable to handle parallel random I/O but still want many shards. There was a missing "use strict", too, which is fixed.
2020-08-10convert: support new -index options
Converting v1 inboxes from v2 can be a painful experience on HDD. Some of the new options in the CLI or config file make it less painful.
2020-08-10index: cleanup internal variables
Move away from hard-to-read alllowercase naming and favor snake_case or separated-by-dashes. We'll keep `--indexlevel' as-is for now, since it's been around for several releases; but we'll support `--index-level' in the CLI and update our documentation in a few months. We'll also clarify that publicInbox.indexMaxSize is only intended for -index, and not -watch or -mda.
2020-08-10admin: use a generic variable name
We parse other options, too, not just --max-size
2020-08-10doc: add some notes around -xcpdb / -edit / -purge
These rarely-used commands have some caveats that needed expanding on.
2020-08-10doc: index: more notes about latest changes
With LKML on an HDD, a giant --batch-size of 500m ends up being pretty useful. I was able to index LKML in ~16 hours on a system that had other activity on it. The big downside was it was eating up over 5g of RAM :x. We'll also fix up a duplicated indexBatchSize section, fix formatting around global vs per-inbox indexSequentialShard, and ensure section 5 manpages are linked correctly.
2020-08-07index: add built-in --help / -?
Eventually, commonly-used commands run by the user will all support --help / -? for user-friendliness. The changes from up-front `use' to lazy `require' speed up `--help' by 3x or so.
2020-08-07index+xcpdb: rename `--no-sync' to `--no-fsync'
We'll continue supporting `--no-sync' even if its yet-to-make it it into a release, but the term `sync' is overloaded in our codebase which may be confusing to new hackers and users. None of our our code nor dependencies issue the sync(2) syscall, either, only fsync(2) and fdatasync(2).
2020-08-07index: v2: --sequential-shard option
This gives better page cache utilization for Xapian indexing on slow storage by improving locality for random I/O activity on the Xapian DB. Instead of doing a single-pass to index both SQLite and Xapian; this indexes them separately. The first pass is identical to indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3. Subsequent passes only operate on a single Xapian shard for documents belonging to that shard. Given enough shards, each individual shard can be made small enough to fit into the kernel page cache and avoid HDD seeks for read activity. Doing rough tests with a busy system with a 7200 RPM HDD with ext4, full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to ~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2) disabled (--no-sync) and `--batch-size=10m'.
2020-07-25index+xcpdb: support --no-sync flag
This allows us to speed up indexing operations to SQLite and Xapian. Unfortunately, it doesn't affect operations using `xapian-compact' and the compactor API, since that doesn't seem to support Xapian::DB_NO_SYNC, yet.
2020-07-25index: support --rethread switch to fix old indices
Older versions of public-inbox < 1.3.0 had subtly different semantics around threading in some corner cases. This switch (when combined with --reindex) allows us to fix them by regenerating associations.
2020-07-17doc: add some recommendations around slow HDDs
grok-pull is still painful with serialization on an old USB 2.0 HDD, but at least it can finish with flock(1) and disabling parallelization. While parallel "git fetch" doesn't seem so bad, slow seeks are exacerbated by parallel reads in Xapian. That means some updates can take days instead of hours. The same updates take only seconds or minutes on an SSD.
2020-07-14doc: release notes and version info updates
Update release notes with some features in the 1.6 timeline. We'll note the version availability of some command-line options, it may help users who are reading the latest documentation online but running older versions.
2020-07-10doc: standards: link IMAP capabilities and response codes
We'll be implementing some IMAP search/threading extensions in IMAP and providing analogues over HTTP via JMAP.
2020-07-06doc/technical/whyperl: note Perl 7 announcement
Right now[1] the Perl upstream plan is to maintain 5 compatibility in Perl 7 for at least 5 years[1], and perhaps drop it when Perl 8 comes along. That said, distros may pick it and maintain 5 on their own given the vast amounts of perfectly good legacy code out there. [1] http://nntp.perl.org/group/perl.perl5.porters/257817 [2] http://nntp.perl.org/group/perl.perl5.porters/257565
2020-07-06doc/technical/whyperl: reword bit around installed docs
I originally proposed this rewording to address Leah's comment but forgot to squash it in :x Link: https://public-inbox.org/meta/20200408221741.GA10142@dcvr/ Cc: Leah Neukirchen <leah@vuxu.org>
2020-07-06doc: daemon: update documentation around Inline::C
`~/.cache/public-inbox/inline-c' is supported, nowadays for convenience, but Inline::C usage will remain opt-in.
2020-07-06view: simplify eml_entry callers further
This simplifies the primary callers of eml_entry while only making mknews.perl worse.
2020-07-06www: update internal docs
We no longer favor getline+close for streaming PSGI responses when using public-inbox-httpd. We still support it for other PSGI servers, though.
2020-07-06view: eml_entry: reduce parameters
We can save stack space and simplify subroutine calls, here.
2020-07-06wwwstream: reduce blob fetch paths for ->getline
This will make it easier to support asynchronous blob retrievals. The `$ctx->{nr}' counter is no longer implicitly supplied since many users didn't care for it, so stack overhead is slightly reduced.
2020-07-06wwwstream: reduce object graph depth
Like with WwwAtomStream and MboxGz, we can bless the existing $ctx object directly to avoid allocating a new hashref. We'll also switch from "->" to "::" to reduce stack utilization.
2020-07-06wwwatomstream: support async blob fetch
This allows -httpd to handle other requests while waiting for git to retrieve and decode blobs. We'll also break apart t/psgi_v2.t further to ensure tests run against -httpd in addition to generic PSGI testing. Using xt/httpd-async-stream.t to test against clones of meta@public-inbox.org shows a 10-12% performance improvement with the following env: TEST_JOBS=1000 TEST_CURL_OPT=--compressed TEST_ENDPOINT=new.atom
2020-07-06wwwatomstream: simplify feed_update callers
We always return Z (UTC) times, anyways, so we'll always use gmtime() on the seconds-after-the-epoch.
2020-07-06stop auto-loading Plack::Middleware::Deflater
Instead of gzipping some (mbox.gz, manifest.js.gz) responses and leaving P::M::D to do the rest, we gzip everything ourselves, now, so P::M::D is redundant.
2020-06-28watch: remove Filesys::Notify::Simple dependency
Since we already use inotify and EVFILT_VNODE (kqueue) in -imapd, we might as well use them directly in -watch, too. This will allow public-inbox-watch to use PublicInbox::DS for timers to watch newsgroups/mailboxes and have saner signal handling in future commits.
2020-06-23init: add --skip-artnum parameter
For archivists with only newer mail archives, this option allows reserving reserve NNTP article numbers for yet-to-be-archived old messages. Indexers will need to be updated to support this feature in future commits. -V1 inboxes will now be initialized with SQLite and Xapian support if this option is used, or if --indexlevel= is specified.
2020-06-23init: add -j / --jobs parameter
On a powerful (by my standards) machine with 16GB RAM and an 7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in git) LKML snapshot from Sep 2019 did not finish after 7 days with the default number (3) of Xapian shards (`--jobs=4') and `--batch-size=10m'. Indexing starts off fast, but progressively get slower as contents of the inbox (including Xapian + SQLite DBs) could no longer be cached by the kernel. Once the on-disk size increased, HDD seek contention between the Xapian shard workers slowed the process down to a crawl. With a single shard, it still took around 3.5 days to index on the HDD. That's not good, but it's far better than not finishing after 7 days. So allow unfortunate HDD users to easily specify a single shard on public-inbox-init. For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II bus on the same machine indexes that same snapshot of LKML in ~7 hours with 3 shards and the same 10m batch size. In the past, a higher-end consumer grade MLC SSDs on similar hardware indexed a similarly sized-data set in ~4 hours.
2020-06-13doc: update TODO and WIP 1.6.0 release notes
Lots of big changes coming Thanks to The Linux Foundation for sponsoring me to hack on this in 2020 :)
2020-06-13imap: support 8000 octet lines
RFC 2683 section 3.2.1.5 recommends it: > For its part, a server should allow for a command line of at least > 8000 octets. This provides plenty of leeway for accepting reasonable > length commands from clients. The server should send a BAD response > to a command that does not end within the server's maximum accepted > command length. To conserve memory, we won't bother reading the entire line before sending the BAD response and disconnecting them.
2020-06-13imap: support IDLE
It seems to be working as far as Mail::IMAPClient is concerned.
2020-06-13preliminary imap server implementation
It shares a bit of code with NNTP. It's copy+pasted for now since this provides new ground to experiment with APIs for dealing with slow storage and many inboxes.
2020-06-13doc: add some IMAP standards
There's more, but IMAP is big and complex already.
2020-06-03www: remove smsg_mime API and adjust callers
To further simplify callers and avoid embarrasing memory explosions[1], we can finally eliminate this method in favor of smsg_eml. [1] commit 7d02b9e64455831d3bda20cd2e64e0c15dc07df5 ("view: stop storing all MIME objects on large threads") fixed a huge memory blowup.
2020-05-27learn: support --all with `rm'
I found myself wanting to remove a message from all inboxes while working on a test case in another branch. I figure this could also be useful for globally removing messages which are in the grey area or too big for spamc.
2020-05-18index: add --batch-size=SIZE option
On powerful systems, having this option is preferable to XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention with other processes (-learn, -mda, -watch). Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and -watch to get stuck until an epoch is completely processed.
2020-05-12rename "ContentId" to "ContentHash"
The old name may be confused with "Content-ID" as described in RFC 2392, so use an alternate name to avoid confusing future readers.
2020-05-10public-inbox 1.5.0 v1.5.0
2020-05-10various doc updates ahead of 1.5.0
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-04-27doc: add clients.txt
Since some client tools exist for dealing with public-inbox specifically, it seems like a good idea to list some of them. Cc: Danh Doan <congdanhqx@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Cc: Leah Neukirchen <leah@vuxu.org>