about summary refs log tree commit homepage
path: root/lib/PublicInbox/WatchMaildir.pm
DateCommit message (Collapse)
2020-09-01rename WatchMaildir => Watch
This is no longer limited to Maildirs now that IMAP and NNTP support exist; so give it a shorter name.
2020-09-01watchmaildir: use v5.10.1, drop warnings
Declare 5.10.1 to avoid potential compatibility problems with Perl 7/8 down the line. We'll rely on the command-line to set or drop warnings during development, at least.
2020-09-01watch: limit batch size of NNTP and IMAP workers, too
We don't want to monopolize locks because processes can easily block each other if using `watchspam' on a Maildir while a big NNTP or IMAP import is happening. This can also happen if somebody configured a single inbox to watch from several sources to merge several mailboxes into one (e.g. both an IMAP and Maildir are watched).
2020-08-28imaptracker: update_last: simplify callers
By making it a no-op if last_uid is not defined. This isn't a hot code path, so the extra method dispatch isn't an issue. It'll save some indentation/wrapping in future commits.
2020-08-28watch: flush changes to inbox before updating IMAPTracker
Data needs to hit inboxes, first. Otherwise it's possible to skip messages in case git-fast-import is killed before it sees "done\n". Now, -watch will just waste a little bandwidth in re-downloading a seen message if it's interrupted immediately before updating IMAPTracker.
2020-08-27watch: imap: only remove \Seen spam
This matches the behavior of Maildir `watchspam' handling in not removing unseen messages. NNTP can't match this behavior, since NNTP servers don't store flags, clients do.
2020-08-27watchmaildir: ensure I:/W:/E: prefixes in warnings
For consistency in output, any URL/path-context-dependent prefixes should have the same prefix as the actual warning which triggered it.
2020-08-07index: v2: --sequential-shard option
This gives better page cache utilization for Xapian indexing on slow storage by improving locality for random I/O activity on the Xapian DB. Instead of doing a single-pass to index both SQLite and Xapian; this indexes them separately. The first pass is identical to indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3. Subsequent passes only operate on a single Xapian shard for documents belonging to that shard. Given enough shards, each individual shard can be made small enough to fit into the kernel page cache and avoid HDD seeks for read activity. Doing rough tests with a busy system with a 7200 RPM HDD with ext4, full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to ~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2) disabled (--no-sync) and `--batch-size=10m'.
2020-08-03watch: quiet some warnings on spam mailboxes
Email::Address::XS and PublicInbox::MsgTime both emit warnings which are likely to trigger from spam messages. Since this can be configured to remove spam, just filter out those warnings to avoid cluttering up stderr with useless information.
2020-08-02remove unnecessary ->header_obj calls
We used ->header_obj in the past as an optimization with Email::MIME. That optimization is no longer necessary with PublicInbox::Eml. This doesn't make any functional difference even if we were to go back to Email::MIME. However, it reduces the amount of code we have and slightly reduces allocations with PublicInbox::Eml.
2020-08-02inboxwritable: rename mime_from_path to eml_from_path
This is more accurate given we use PublicInbox::Eml instead of Email::MIME/PublicInbox::MIME, nowadays.
2020-08-01improve error handling on import fork / lock failures
v?fork failures seems to be the cause of locks not getting released in -watch. Ensure lock release doesn't get skipped in ->done for both v1 and v2 inboxes. We also need to do everything we can to ensure DB handles, pipes and processes get released even in the face of failure. While we're at it, make failures around `git update-server-info' non-fatal, since smart HTTP seems more popular anyways. v2 changes: - spawn: show failing command - ensure waitpid is synchronous for inotify events - teardown all fast-import processes on exception, not just the failing one - beef up lock_release error handling - release lock on fast-import spawn failure
2020-07-05watch: don't burn CPU on IDLE failures
Network connections fail and need to be detected sooner rather than later during IDLE to avoid backtrace floods. In case the IDLE process dies completely, don't respawn right away, either, to avoid entering a respawn loop. There's also a typo fix :P
2020-07-02watch: retry signals to kill IDLE and polling processes
To ensure reliable signal delivery in Perl, it seems we need to repeatedly signal processes which aren't using signalfd (or EVFILT_SIGNAL) with our event loop.
2020-06-30watch: make waitpid() synchronous for Maildir scans
Maildir scanning still happens in the main process. Scanning dozens of Maildirs is still time-consuming and monopolizes the event loop during WatchMaildir::event_step. This can cause cause zombies to accumulate before Sigfd::event_step triggers DS::reap_pids.
2020-06-30watch: ensure SIGCHLD works in forked children
In case our git or spam checker subprocesses spawn subprocesses of their own. We'll also ensure signal handlers are properly setup before unblocking them.
2020-06-30watch: show path for warnings from spam messages
It could be useful to see warnings generated for known problematic messages just as it is for possibly non-problematic ones.
2020-06-30watch: check for duplicates in ->over before spamcheck
It's cheaper to check for duplicates than run `spamc' repeatedly when rechecking. We already do this for v1 with by using the "ls" command with fast-import, but v2 requires checking against over.sqlite3.
2020-06-28watch: simplify internal structures
We won't be attempting to reuse Mail::IMAPConnections used to check authentication info, for now, so stop storing $self->{mics}. We can also combine $poll initialization for IMAP and NNTP to avoid data structure duplication. Furthermore, rely on autovivification to create {idle_pids} and {poll_pids}.
2020-06-28watch: support ~/.netrc via Net::Netrc
While git-credential-netrc exists in git.git contrib/, it may not be widely known or installed. Net::Netrc is already a standard part of most (if not all) Perl installations, so use it directly if available.
2020-06-28watch: use our own "git credential" wrapper
Git.pm may not be installed on some systems; or some users have multiple Perl installations and Git.pm is not available to the Perl running -watch. Accomodate both those types of users by providing our own "git credential" wrapper.
2020-06-28watch: show user-specified URL consistently.
Since we use the non-ref scalar URL in many error messages, favor keeping the unblessed URL in the long-lived process. This avoids showing "snews://" to users who've specified "nntps://" URLs, since "nntps" is IANA-registered nowadays and what we show in our documentation, while "snews" was just a draft the URI package picked up decades ago.
2020-06-28watch: add NNTP support
This is similar to IMAP support, but only supports polling. Automatic altid support is not yet supported, yet; but may be in the future. v2: small grammar fix by Kyle Meyer Link: https://public-inbox.org/meta/87sgeg5nxf.fsf@kyleam.com/
2020-06-28watch: just use ->urlmatch
We may just modify PublicInbox::Config->urlmatch in the future to support git <1.8.5, but I wonder if there's enough users on git <1.8.5 to justify it.
2020-06-28watch: remove {mdir} array
Since we store all watched directory names as keys in %mdmap, there should be no need to keep an array of those directories around. t/watch_maildir*.t required changes to remove trained spam. Once we've trained something as spam, there shouldn't be a need to rescan it.
2020-06-28watch: support multiple watch: directives per-inbox
Some users will find it useful to merge several Maildir or IMAP mailboxes into one public-inbox. Let them do it, since we've always supported multi-address inboxes.
2020-06-28watch: imap: be quiet about disconnecting on quit
If ->idle_done was handled successfully, we can just let normal ->DESTROY disconnect and avoid ugly backtraces when a user hits Ctrl-C to take down the process group.
2020-06-28watch: support imap.fetchBatchSize parameter
IMAP allows retrieving multiple messages with a single command, and Mail::IMAPClient supports that. Unfortunately, it means we slurp multiple messages into memory at once. This option allows users to trade off memory usage to reduce network round-trips. Ideally, we'd support pipelining; but AFAIK no widely installed Perl IMAP library supports it.
2020-06-28watch: avoid long transaction to IMAPTracker
With different polling intervals, multiple processes may simultaneously write to IMAPtracker. This ought to reduce SQLite busy waiting and contention issues when importing many inboxes in parallel.
2020-06-28imaptracker: add {url} field to reduce args
Passing a $url parameter to every function was error-prone and having {url} field for a short-lived object is appropriate. This matches the version of IMAPTracker posted by Eric W. Biederman on 2020-05-15 at: https://public-inbox.org/meta/87ftc0c3r4.fsf_-_@x220.int.ebiederm.org/ The version I originally imported was based on the one posted on 2019-10-09: https://public-inbox.org/meta/874l0i9vhc.fsf_-_@x220.int.ebiederm.org/ Cc: Eric W. Biederman <ebiederm@xmission.com>
2020-06-28ds: add_timer: allow passing arg to callback.
This allows callers to avoid creating expensive closures. We no longer pass the `$now' value to callers, as none of the callers used it.
2020-06-28watch: use UID SEARCH to avoid empty UID FETCH
For mailboxes with many gaps in the UID sequence, performing a UID SEARCH beforehand can reduce the number of articles to fetch. However, the downside to this is we may end up with an arbitrarly large list of UIDs from the server.
2020-06-28watch: stop importers before forking
This fixes cases where watch is handling both Maildirs and IMAP connections. While we're at it, close open directories in the IMAP children to save FDs.
2020-06-28config: support ->urlmatch method for -watch
Since we have IMAP client support in -watch; make sure per-URL settings are familiar to git users by taking advantage of git's URL matching abilities. This requires git 1.8.5+, which most users ought to have (though base CentOS 7 is on 1.8.3).
2020-06-28watch: support IMAP polling
Not all IMAP servers support IDLE, and IDLE may be prohibitively expensive for some IMAP servers with many inboxes. So allow configuring a imap.$IMAP_URL.pollInterval=SECONDS to poll mailboxes. We'll also need to poll for NNTP servers in the future.
2020-06-28watch: wire up IMAP IDLE reapers to DS
We can avoid synchronous `waitpid(-1, 0)' and save a process when simultaneously watching Maildirs. One DS bug is fixed: ->Reset needs to clear the DS $in_loop flag in forked children so dwaitpid() fails and allows git processes to be reaped synchronously. TestCommon also calls DS->Reset when spawning new processes, since t/imapd.t uses DS->EventLoop while waiting on -watch to write.
2020-06-28watch: use signalfd for Maildir watching
We can get rid of the janky wannabe self-using-a-directory-instead-of-pipe thing we needed to workaround Filesys::Notify::Simple being blocking. For existing Maildir users, this should be more robust and immune to missed wakeups for signalfd and kqueue-enabled systems; as well as being immune to BOFHs clearing $TMPDIR and preventing notifications from firing. The IMAP IDLE code still uses normal Perl signals, so it's still vulnerable to missed wakeups. That will be addressed in future commits.
2020-06-28watch: remove Filesys::Notify::Simple dependency
Since we already use inotify and EVFILT_VNODE (kqueue) in -imapd, we might as well use them directly in -watch, too. This will allow public-inbox-watch to use PublicInbox::DS for timers to watch newsgroups/mailboxes and have saner signal handling in future commits.
2020-06-28watch: preliminary IMAP support
Only servers with IDLE are supported, for now. Polling will be needed since users may need to watch many inboxes with a few active connections due to IMAP server limitations.
2020-06-28watchmaildir: fix check for spam vs ham inbox conflicts
The old check was ineffective since we process the spam folder config before ham inboxes; and would only fail when attempting to treat the scalar "watchspam" string as an array ref.
2020-06-28watchmaildir: hoist out compile_watchheaders
It's too deeply indented, and we will be using it for IMAP, too.
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-04-25watchmaildir: match List-ID case-insensitively
RFC 2919 section 6 states the following: There is only one operation defined for list identifiers, that of case insensitive equality. So no arguing with that. Now, the other headers are open to interpretation, so put a note about them.
2020-04-25watchmaildir: scan all matching headers
Some headers may appear more than once in a message, so it's probably best to ensure we attempt matches on all of them. This ought to allow matching on Received: or similar because a list lacks List-IDs :P
2020-04-20watchmaildir: support multiple watchheader values
The watchheader key supports only a single value. Supporting multiple watchheader values was mentioned in discussion [1] of 8d3e3bd8 (doc: explain publicinbox.<name>.watchheader, 2019-10-09), and it wasn't clear if there was a need. One scenario in which matching multiple headers would be convenient is when someone wants to set up public-inbox archives for some small projects but does _not_ want to run mailing lists for them, instead allowing others to follow the project by any of the pull mechanisms. Using a common underlying address, an address alias for each project is configured via a third-party email provider, with messages for each alias being exposed as a separate public-inbox archive. In this setup, messages for an inbox cannot be selected by a List-ID header but can be identified by the inbox's address in either the To or Cc header. To support such a use case, update the watchheader handling to consider multiple values, accepting a message if it matches any value. While selecting a message based on matching _any_ rather than _all_ values is motivated by the above scenario, it's worth noting that the "any" behavior is consistent with how multiple listid config values are handled. [1] https://public-inbox.org/meta/20191010085118.r3amey4cayazfycb@dcvr/
2020-04-19inboxwritable: mime_from_path: reuse in more places
There's nothing Maildir-specific about the function, so `maildir_path_load' was a bad name. So give it a more appropriate name and use it in our tests. This save ourselves some code and inconsistency by reusing an existing internal library routine in more places. We can drop the "From_" line in some of our (formerly) mbox sample files.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-11make Filesys::Notify::Simple optional
It's only used by us in public-inbox-watch, and maybe not for long. It's in most installations because Plack pulls it in though, but Plack is no longer required.
2020-01-11spawn (and thus popen_rd) die on failure
Most spawn and popen_rd callers die on failure to spawn, anyways, and some are missing checks entirely. This saves us a bunch of verbose error-checking code in callers. This also makes popen_rd more consistent, since it already dies on pipe creation failures.
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.