about summary refs log tree commit homepage
path: root/script
DateCommit message (Collapse)
2020-06-23init: add --skip-artnum parameter
For archivists with only newer mail archives, this option allows reserving reserve NNTP article numbers for yet-to-be-archived old messages. Indexers will need to be updated to support this feature in future commits. -V1 inboxes will now be initialized with SQLite and Xapian support if this option is used, or if --indexlevel= is specified.
2020-06-23init: refer to inboxes as "inbox" or "inboxes" in errors
Since V2 uses multiple git repositories, stop using the word "repo" when referring to inboxes.
2020-06-23init: add -j / --jobs parameter
On a powerful (by my standards) machine with 16GB RAM and an 7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in git) LKML snapshot from Sep 2019 did not finish after 7 days with the default number (3) of Xapian shards (`--jobs=4') and `--batch-size=10m'. Indexing starts off fast, but progressively get slower as contents of the inbox (including Xapian + SQLite DBs) could no longer be cached by the kernel. Once the on-disk size increased, HDD seek contention between the Xapian shard workers slowed the process down to a crawl. With a single shard, it still took around 3.5 days to index on the HDD. That's not good, but it's far better than not finishing after 7 days. So allow unfortunate HDD users to easily specify a single shard on public-inbox-init. For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II bus on the same machine indexes that same snapshot of LKML in ~7 hours with 3 shards and the same 10m batch size. In the past, a higher-end consumer grade MLC SSDs on similar hardware indexed a similarly sized-data set in ~4 hours.
2020-06-13imap: start doing iterative config reloading
This will be used to prevent reloading a giant config with tens/hundreds of thousands of inboxes from blocking the event loop.
2020-06-13preliminary imap server implementation
It shares a bit of code with NNTP. It's copy+pasted for now since this provides new ground to experiment with APIs for dealing with slow storage and many inboxes.
2020-06-08index: v2: parallel by default
InboxWritable should only set $v2w->{parallel} if the $parallel flag is defined to 0 or 1. We want indexing a new inbox to utilize SMP, just like --reindex. -index once again allows -j0/--jobs=0 to force single-process use, and we'll be ensuring that works in tests to maintain performance on small systems. Fixes: 61a2fff5b34a3e32 ("admin: move index_inbox over")
2020-05-27learn: support --all with `rm'
I found myself wanting to remove a message from all inboxes while working on a test case in another branch. I figure this could also be useful for globally removing messages which are in the grey area or too big for spamc.
2020-05-27learn: fix buggy typo on List-ID mapping
There is obviously a typo here, so fix it and add a test case to guard against future regressions. Fixes: 74a3206babe0572a ("mda: support multiple List-ID matches")
2020-05-20convert: describe the release of fast-import pipes
Upon rereading the code, it wasn't immediately obvious to me why we didn't check for errors with `close($w)' instead of relying on `undef'. So add a comment for the benefit of future readers.
2020-05-19favor readline() and print() as functions
In our inbox-writing code paths, ->getline as an OO method may be confused with the various definitions of `getline' used by the PSGI interface. It's also easier to do: "perldoc -f readline" than to figure out which class "->getline" belongs to (IO::Handle) and lookup documentation for that. ->print is less confusing than the "readline" vs "getline" mismatch, but we can still make it clear we're using a real file handle and not a mock interface. Finally, functions are a bit faster than their OO counterparts.
2020-05-18index: add --batch-size=SIZE option
On powerful systems, having this option is preferable to XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention with other processes (-learn, -mda, -watch). Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and -watch to get stuck until an epoch is completely processed.
2020-05-12rename "ContentId" to "ContentHash"
The old name may be confused with "Content-ID" as described in RFC 2392, so use an alternate name to avoid confusing future readers.
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-04-22make zlib-related modules a hard dependency
This allows us to simplify some of our existing code and make future changes easier. I doubt anybody goes through the trouble to have a Perl installation without zlib support. The zlib source code is even bundled with Perl since 5.9.3 for systems without existing zlib development headers and libraries. Of course, zlib is also a requirement of git, too; and we're not going to stop using git :) [squashed: "wwwaltid: use gzipfilter up front"]
2020-04-21index: support --max-size / publicinbox.indexMaxSize
In normal mail paths, we can rely on MTAs being configured with reasonable limits in the -watch and -mda mail injection paths. However, the MTA is bypassed in a git-only delivery path, a BOFH could inject a large message and DoS users attempting to mirror a public-inbox. This doesn't protect unindexed WWW interfaces from Email::MIME memory explosions on v1 inboxes. Probably nobody cares about unindexed WWW interfaces anymore, especially now that Xapian is optional for indexing.
2020-04-20drop needless `eval {}' around Config->new
It hasn't been needed since commit 089cca37fa036411 ("config: ignore missing config files"). And we actually want to propagate errors when we can't start new processes or if git(1) is missing.
2020-04-19favor `do {}' over `eval {}' for localized slurp
I did not know to use the return value of `do' back in the day. There's probably no practical difference in these cases, but `eval' is overkill for these uses and may hide actual errors. We can get rid of a few redundant `scalar' ops and pass scalar refs to Email::MIME->new to avoid copies in a few more places, too.
2020-04-19inboxwritable: mime_from_path: reuse in more places
There's nothing Maildir-specific about the function, so `maildir_path_load' was a bad name. So give it a more appropriate name and use it in our tests. This save ourselves some code and inconsistency by reusing an existing internal library routine in more places. We can drop the "From_" line in some of our (formerly) mbox sample files.
2020-03-29index: support --compact / -c on command-line
It's more convenient to specify `-c' / `--compact' on the command-line when reindexing than it is to invoke public-inbox-compact(1) separately. This is especially convenient in low-space situations when public-inbox-index is operating on multiple inboxes sequentially, as compaction can happen immediately after indexing each inbox, instead of waiting until all inboxes are indexed.
2020-02-23doc: improve wording of "inbox" vs "repository"
Since v2 inboxes contain multiple git repositories, avoid the use of the word "repository" when referring to inboxes as a whole in most places.
2020-02-08convert: preserve indexlevel on conversions
We don't want to blow up users storage too badly when converting v1 to v2 or break because they don't have Xapian bindings installed.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-02-02convert: fix --no-index switch
The (currently undocumented) "--no-index" flag did not trigger the V2Writable->done call necessary to make the import successful. Fixes: eea47b676127bcdb ("convert: preserve highwater mark from v1 msgmap")
2020-02-02convert: shift @ARGV explicitly
Relying on implicit "@_" for shift fails with TestCommon::_run_sub iff GetOptions modifies @ARGV.
2020-02-02convert: remove unused variables capturing :from
Looking at git history, they were never used.
2020-01-31convert: preserve highwater mark from v1 msgmap
If we're reusing the msgmap from a v1 inbox, we also need to ensure the highwater mark doesn't get doubled in the v1->v2 conversion by internally triggering the equivalent of "--reindex" on a fresh v2 inbox. This was needed to convert an indexed v1 inbox which featured messages with multiple Message-IDs in it. Fresh, unindexed clones of v1 inboxes would not have been affected by this.
2020-01-27init: use Import::run_die instead of system()
We already load PublicInbox::Import via PublicInbox::InboxWritable, so it's not an extra module to load. This can give us a slight speedup in tests.
2020-01-27inbox: add ->version method
This allows us to simplify version checking by avoiding "//" or "||" operators sprinkled around.
2020-01-11make Plack optional for non-WWW and non-httpd users
Some users just want to run -mda, -watch, and/or -nntpd. Let them run just those without forcing them to pull in a bunch of dependencies.
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2020-01-01filter/base: export REJECT as a constant
And update callers to use it, as it makes the code a bit cleaner. Probably irrelvant, but it should be faster, too, as "perl -I lib -w -MO=Deparse $FILE" shows REJECT() calls are constant-folded.
2019-12-24remove "no warnings 'once'" in a few places
We can use "use" to get the namespace into the "BEGIN" phase of the interpreter. While we're at it, use \&coderef syntax explicitly instead of globbing everything.
2019-11-24check for File::Temp 0.19 for ->newdir method
This is distributed with Perl 5.10.1 and onwards, so it should not be an installation burden for any users. I'm planning to move away from tempdir() entirely and use File::Temp->newdir to remove dependencies on END{} blocks.
2019-11-16inboxwritable: add ->cleanup method
We've been using this in -edit, and will be using it in some more scripts and tests to optimize for run_mode=2 with run_script. Keeping this in the *Writable modules since I don't see it being useful for the WWW and NNTP read-only interfaces which use PublicInbox::Inbox.
2019-11-16learn: pass global variables into subs
Avoid 'Variable "%s" will not stay shared' warnings when the contents of this script eval'ed into a sub.
2019-11-16mda: pass global variables into subs
Avoid 'Variable "%s" will not stay shared' warnings when the contents of this script eval'ed into a sub.
2019-11-16init: pass global variables into subs
Avoid 'Variable "%s" will not stay shared' warnings when the contents of this script eval'ed into a sub. We also need to rely on ->DESTROY instead of END{} to unlink the lock file on sub exit.
2019-11-16index: pass global variables into subs
Avoid 'Variable "%s" will not stay shared' warnings when the contents of this script eval'ed into a sub.
2019-11-16admin: get rid of singleton $CFG var
PublicInbox::Admin::config() just adds an extra layer of indirection which we barely rely on. So get rid of this global variable and make it easier to run tests in the future without relying on global state.
2019-11-16edit: use OO API of File::Temp to shorten lifetime
Instead of relying on END{} blocks, rely on ->DESTROY so the temporary files go out-of-scope and system resources get released, sooner.
2019-11-16edit: pass global variables into subs
Avoid 'Variable "%s" will not stay shared' warnings when the contents of this script eval'ed into a sub.
2019-11-14convert: remove duplicated GetOptions() call
We only need to parse the command-line once.
2019-11-14inboxwritable: drop {-importer} cyclic reference
InboxWritable caching the result of ->importer leads to a circular references with returned (V2Writable|Import) object holds onto the calling InboxWritable object. With public-inbox-watch, this leads to a memory leak if a user is reloading via SIGHUP after a message is imported (it would only become noticeable with SIGHUPs after every message imported). I would not expect anybody to to notice this in real-world usage. I only noticed this since I was making -xcpdb suitable for long-lived process use (e.g. "mod_perl style") and a flock remained unreleased on v1 inboxes after resharding. WatchMaildir (used by -watch) already handles caching of the importer object itself, and all of our other real-world uses of ->importer are short-lived or designed for batch scripts, so there's no need to cache the importer result internally.
2019-11-08edit: check for write errors writing "From_" line
We need to check every print to a regular file for errors, because storage devices inevitably fail.
2019-11-08edit: propagate correct editor exit code
exit($?) is never correct, since ($? >> 8) is needed to extract the correct exit code, as other information (e.g. such as signal) is encoded in $? in addition to the exit code.
2019-10-30mda: support multiple List-ID matches
While it's not RFC2919-conformant, mail software can theoretically set multiple List-ID headers. Deliver to all inboxes which match a given List-ID since that's likely the intended. Cc: Eric W. Biederman <ebiederm@xmission.com> Link: https://public-inbox.org/meta/87pniltscf.fsf@x220.int.ebiederm.org/
2019-10-30mda: prepare for multiple destinations
Multiple List-ID headers will be supported in the next commit
2019-10-30inboxwritable: add assert_usable_dir sub
And use it for mda, since "0" could be a usable directory if somebody insists on using relative paths...
2019-10-30mda: skip MIME parsing if spam
We don't want to waste cycles parsing the message for MIME bits if it's spam.
2019-10-30mda: hoist out mda_filter_adjust
It makes it easier to document the default -mda behavior is stricter than normal, including "public-inbox-learn ham"