about summary refs log tree commit homepage
path: root/Documentation
DateCommit message (Collapse)
2020-06-28watch: remove Filesys::Notify::Simple dependency
Since we already use inotify and EVFILT_VNODE (kqueue) in -imapd, we might as well use them directly in -watch, too. This will allow public-inbox-watch to use PublicInbox::DS for timers to watch newsgroups/mailboxes and have saner signal handling in future commits.
2020-06-23init: add --skip-artnum parameter
For archivists with only newer mail archives, this option allows reserving reserve NNTP article numbers for yet-to-be-archived old messages. Indexers will need to be updated to support this feature in future commits. -V1 inboxes will now be initialized with SQLite and Xapian support if this option is used, or if --indexlevel= is specified.
2020-06-23init: add -j / --jobs parameter
On a powerful (by my standards) machine with 16GB RAM and an 7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in git) LKML snapshot from Sep 2019 did not finish after 7 days with the default number (3) of Xapian shards (`--jobs=4') and `--batch-size=10m'. Indexing starts off fast, but progressively get slower as contents of the inbox (including Xapian + SQLite DBs) could no longer be cached by the kernel. Once the on-disk size increased, HDD seek contention between the Xapian shard workers slowed the process down to a crawl. With a single shard, it still took around 3.5 days to index on the HDD. That's not good, but it's far better than not finishing after 7 days. So allow unfortunate HDD users to easily specify a single shard on public-inbox-init. For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II bus on the same machine indexes that same snapshot of LKML in ~7 hours with 3 shards and the same 10m batch size. In the past, a higher-end consumer grade MLC SSDs on similar hardware indexed a similarly sized-data set in ~4 hours.
2020-06-13doc: update TODO and WIP 1.6.0 release notes
Lots of big changes coming Thanks to The Linux Foundation for sponsoring me to hack on this in 2020 :)
2020-06-13imap: support 8000 octet lines
RFC 2683 section 3.2.1.5 recommends it: > For its part, a server should allow for a command line of at least > 8000 octets. This provides plenty of leeway for accepting reasonable > length commands from clients. The server should send a BAD response > to a command that does not end within the server's maximum accepted > command length. To conserve memory, we won't bother reading the entire line before sending the BAD response and disconnecting them.
2020-06-13imap: support IDLE
It seems to be working as far as Mail::IMAPClient is concerned.
2020-06-13preliminary imap server implementation
It shares a bit of code with NNTP. It's copy+pasted for now since this provides new ground to experiment with APIs for dealing with slow storage and many inboxes.
2020-06-13doc: add some IMAP standards
There's more, but IMAP is big and complex already.
2020-06-03www: remove smsg_mime API and adjust callers
To further simplify callers and avoid embarrasing memory explosions[1], we can finally eliminate this method in favor of smsg_eml. [1] commit 7d02b9e64455831d3bda20cd2e64e0c15dc07df5 ("view: stop storing all MIME objects on large threads") fixed a huge memory blowup.
2020-05-27learn: support --all with `rm'
I found myself wanting to remove a message from all inboxes while working on a test case in another branch. I figure this could also be useful for globally removing messages which are in the grey area or too big for spamc.
2020-05-18index: add --batch-size=SIZE option
On powerful systems, having this option is preferable to XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention with other processes (-learn, -mda, -watch). Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and -watch to get stuck until an epoch is completely processed.
2020-05-12rename "ContentId" to "ContentHash"
The old name may be confused with "Content-ID" as described in RFC 2392, so use an alternate name to avoid confusing future readers.
2020-05-10public-inbox 1.5.0 v1.5.0
2020-05-10various doc updates ahead of 1.5.0
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-04-27doc: add clients.txt
Since some client tools exist for dealing with public-inbox specifically, it seems like a good idea to list some of them. Cc: Danh Doan <congdanhqx@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Cc: Leah Neukirchen <leah@vuxu.org>
2020-04-25doc: note some changes for 1.5
As an established project (:P), it's important to document when new features appear in manpages. Users may be reading new documentation online which doesn't reflect an older version they have installed.
2020-04-21index: support --max-size / publicinbox.indexMaxSize
In normal mail paths, we can rely on MTAs being configured with reasonable limits in the -watch and -mda mail injection paths. However, the MTA is bypassed in a git-only delivery path, a BOFH could inject a large message and DoS users attempting to mirror a public-inbox. This doesn't protect unindexed WWW interfaces from Email::MIME memory explosions on v1 inboxes. Probably nobody cares about unindexed WWW interfaces anymore, especially now that Xapian is optional for indexing.
2020-04-20doc: txt2pre: fix URL of git-filter-repo(1)
Probably a typo when doing concatenation.
2020-04-20doc: HACKING: add a bit about faster testing
`make test' is annoyingly slow, and `make check-run' works wonders for improving the edit && test cycle.
2020-04-20watchmaildir: support multiple watchheader values
The watchheader key supports only a single value. Supporting multiple watchheader values was mentioned in discussion [1] of 8d3e3bd8 (doc: explain publicinbox.<name>.watchheader, 2019-10-09), and it wasn't clear if there was a need. One scenario in which matching multiple headers would be convenient is when someone wants to set up public-inbox archives for some small projects but does _not_ want to run mailing lists for them, instead allowing others to follow the project by any of the pull mechanisms. Using a common underlying address, an address alias for each project is configured via a third-party email provider, with messages for each alias being exposed as a separate public-inbox archive. In this setup, messages for an inbox cannot be selected by a List-ID header but can be identified by the inbox's address in either the To or Cc header. To support such a use case, update the watchheader handling to consider multiple values, accepting a message if it matches any value. While selecting a message based on matching _any_ rather than _all_ values is motivated by the above scenario, it's worth noting that the "any" behavior is consistent with how multiple listid config values are handled. [1] https://public-inbox.org/meta/20191010085118.r3amey4cayazfycb@dcvr/
2020-04-19doc: start writeup on semi-automatic memory management
I don't consider Perl's memory management "automatic". Instead, having an extra bit of control as a hacker is nice and there's no need to burden ordinary users with GC tuning knobs.
2020-04-19wwwatomstream: move {emit_header} field to $self
There's no need to pollute the cross-package $ctx with it.
2020-04-17searchthread: reduce indirection by removing container
We can rid ourselves of a layer of indirection by subclassing PublicInbox::Smsg instead of using a container object to hold each $smsg. Furthermore, the `{id}' vs. `{mid}' field name confusion is eliminated. This reduces the size of the $rootset passed to walk_thread by around 15%, that is over 50K memory when rendering a /$INBOX/ landing page.
2020-04-17doc: update 1.4.0 relnotes with date, start 1.5.0
2020-04-17public-inbox 1.4.0
2020-04-13doc: add technical/whyperl
Some people don't like Perl; but it exists, there's no avoiding it with everything that depends on it. And nearly all code still works unmodified after 20 years.
2020-04-13doc: start reproducibility document
Not new ideas, just gathering thoughts.
2020-04-12doc: escape internal ">" in listid code snippet
A code snippet in the listid description is incorrectly rendered as "publicinbox.$NAME.watchheader=List-Id:<foo.example.com"> Escape the closing bracket around the List-Id value to avoid this. Also escape the opening bracket for symmetry/readability.
2020-04-01doc: update notes and HACKING ahead of 1.4 release
There will probably be a 1.4 release in a few days...
2020-03-29index: support --compact / -c on command-line
It's more convenient to specify `-c' / `--compact' on the command-line when reindexing than it is to invoke public-inbox-compact(1) separately. This is especially convenient in low-space situations when public-inbox-index is operating on multiple inboxes sequentially, as compaction can happen immediately after indexing each inbox, instead of waiting until all inboxes are indexed.
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-03-20doc: standards: add references to RFC 5322 (and RFC 822)
RFC 5322 is the latest one in this line, but much documentation and even command-line options in other programs (e.g. git) refer to RFC 2822 or even RFC 822.
2020-03-01doc: design_www: document offline friendliness
This isn't anything new and has been a part of the design since the beginning, but it may not be apparent to some folks.
2020-02-27doc: 1.4.0 release notes update
Perhaps 1.4.0 will be a small release, after all (and also smaller in terms of memory use :)
2020-02-24doc: technical: document data structures
Can't code without data structures, and we emphasize data over code just about everywhere.
2020-02-23doc: improve wording of "inbox" vs "repository"
Since v2 inboxes contain multiple git repositories, avoid the use of the word "repository" when referring to inboxes as a whole in most places.
2020-02-17doc: design_www: document solver endpoint
The blob regeneration (solving) part has been stable and performant for over a year with no problems, even with web crawlers constantly hitting it without needing rate limits. All the other stuff is open to bikeshedding (as long as my crappy hardware supports it :P)
2020-02-09doc: update v1.3.0.eml with actual headers, start v1.4.0
Bigger changes coming :>
2020-02-10public-inbox 1.3.0 v1.3.0
2020-02-08doc: more 1.3.0 release notes updates
2020-02-08doc: mark some TODO items as done
NNTP TLS and COMPRESS support and cgit spawning from the WWW interface were implemented last year. Given the lack of syscall number stability guarantee on the OpenBSD and FreeBSD, I don't think supporting a pure-Perl kevent is feasible. Inline::C may still be an option since IO::KQueue is abandoned, though, as it is for some Linux-only syscalls and maybe some POSIX ones not covered by POSIX.pm.
2020-02-08doc: update copyright for standards.perl
It was missing "(C)", so gnulib update-copyright missed it.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-02-06doc: v1: add a reference to git-filter-repo(1), too
The git-filter-branch(1) manpage itself recommends git-filter-repo, nowadays due to performance and safety problems.
2020-02-06doc: txt2pre: auto-linkify manpage references
This can be more convenient for people browsing HTML docs remotely or locally.
2020-02-06doc: remove .x/ subdirectory for Xapian manpages
There's no need to keep Xapian manpage renderings in a separate subdirectory, after all. Eliminating this difference between the local FS and URL path will allow relative URLs to the Xapian manpages in our local HTML documentation to work smoothly, since there was never any ".x/" path component for files served from public-inbox.org
2020-02-06doc: add data flow diagram using Graph::Easy
Maybe this can make it easier for new and potential users to understand what's going on.
2020-02-04doc: recommend -compact after --reindex
It's likely a user will be low on space after running --reindex, so recommend the use of public-inbox-compact afterwards. And add a few more notes about using public-inbox-compact to clarify it's for inboxes-only (and not any old Xapian DBs) that using xapian-compact(1) directly is error-prone and likely to break things.
2020-02-04doc: spellling fixes for manpages
The wording for publicinbox.nntpserver was awkward, too, and I took this as opportunity to hopefully clarify it and favor "hostname" for Internet addresses, because we already use "address" to mean "email address" in the config.