about summary refs log tree commit homepage
path: root/lib/PublicInbox/LeiInput.pm
DateCommit message (Collapse)
2021-10-15lei + ipc: simplify process reaping
Simplify our APIs and force dwaitpid() to work in async mode for all lei workers. This avoids having lingering zombies for parallel searches if one worker finishes soon before another. The old distinction between "old" and "new" workers was needlessly complex, error-prone, and embarrasingly bad. We also never handled v2:// writers properly before on Ctrl-C/Ctrl-Z (SIGINT/SIGTSTP), so add them to @WQ_KEYS to ensure they get handled by $lei when appropropriate.
2021-10-15lei: TSTP affects all curl and related subprocesses
By relying more on pgroups for remaining remaining processes, this lets us pause all curl+tail subprocesses with a single kill(2) to avoid cluttering stderr. We won't bother pausing the pigz/gzip/bzip2/xz compressor process not cat-file processes, though, since those don't write to the terminal and they idle soon after the workers react to SIGSTOP. AutoReap is hoisted out from TestCommon.pm. CLONE_SKIP is gone since we won't be using Perl threads any time soon (they're discouraged by the maintainers of Perl).
2021-10-13lei: use standard warn() in more places
warn() is easier to augment with context information, and frankly unavoidable in the presence of 3rd-party libraries we don't control.
2021-10-02lei mail-diff: diagnostic command to diff mail contents
This is useful in finding the cause of deduplication bugs, and possibly the cause of missing threads reported by Konstantin in <20211001130527.z7eivotlgqbgetzz@meerkat.local> usage: u=https://yhbt.net/lore/all/87czop5j33.fsf@tynnyri.adurom.net/raw lei mail-diff $u
2021-09-19lei/store: use SOCK_SEQPACKET rather than pipe
This has several advantages: * no need to use ipc.lock to protect a pipe for non-atomic writes * ability to pass FDs. In another commit, this will let us simplify lei->sto_done_request and pass newly-created sockets to lei/store directly. disadvantages: - an extra pipe is required for rare messages over several hundred KB, this is probably a non-issue, though The performance delta is unknown, but I expect shards (which remain pipes) to be the primary bottleneck IPC-wise for lei/store.
2021-09-18lei_mail_sync: rely on flock(2), avoid IPC
Since 44917fdd24a8bec1 ("lei_mail_sync: do not use transactions"), relying on lei/store to serialize access was a pointless endeavor. Rely on flock(2) to serialize multiple writers since (in my experience) it's the easiest way to deal with parallel writers when using SQLite. This allows us to simplify existing callers while speeding up 'lei refresh-mail-sync --all=local' by 5% or so.
2021-09-17lei refresh-mail-sync: implicitly remove missing folders
There's no point in keeping mail_sync.sqlite3 entries around if the folder is gone. We do keep saved-search configs around, however, since somebody may decide to blow away a search and start over.
2021-09-09lei_input: provide hint for bare "L:" and "kw:"
I just made this mistake running "lei import" myself, so I figure giving a hint makes sense, here.
2021-09-03lei: ->child_error less error-prone
I was calling "child_error(1, ...)" in a few places where I meant to be calling "child_error(1 << 8, ...)" and inadvertantly triggering SIGHUP in script/lei. Since giving a zero exit code to child_error makes no sense, just allow falsy values to default to 1 << 8.
2021-09-02lei_input: set and prepare watches early
This will be needed as we track changes in real-time, especially for "lei index" since there's no storage involved.
2021-07-25lei rm-watch: new command to support removing watches
Pretty trivial since it just invokes "git-config". It's mainly intended to make shell completion easier.
2021-07-22lei: start implementing inotify Maildir support
This allows lei to automatically note keyword (message flag) changes made to a Maildir and propagate it into lei/store: lei add-watch --state=tag-ro /path/to/Maildir This doesn't persist across restarts, yet. In the future, it will be applied automatically to "lei q" output Maildirs by default (with an option to disable it). State values of tag-rw, index-<ro|rw>, import-<ro|rw> will all be supported for Maildir. This represents a fairly major internal change that's fairly intrusive, but the whole daemon-oriented design was to facilitate being able to automatically monitor (and propagate) Maildir/IMAP flag changes.
2021-06-17lei_input: prefix bare Maildir paths w/ "maildir:"
This will simplify upcoming code for watches.
2021-06-14lei_input: allow keywords when importing 1 file from Maildir
This will eventually be useful for supporting inotify watches on Maildir. It will also allow users to script their own FS watchers more easily.
2021-06-12lei ls-mail-source: list IMAP folders and NNTP groups
While other tools can provide the same functionality, having integration with git-credential is convenient, here. Caching and completion will be implemented separately.
2021-06-08lei import: speed up repeated Maildir imports
On a 4-core CPU, this speeds up "lei import" on a largish Maildir inbox with 75K messages from ~8 minutes down to ~40s. Parallelizing alone did not bring any improvement and may even hurt performance slightly, depending on CPU availability. However, creating the index on the "fid" and "name" columns in blob2name yields us the same speedup we got. Parallelizing IMAP makes more sense due to the fact most IMAP stores are non-local and subject to network latency. Followup-to: bdecd7ed8e0dcf0b45491b947cd737ba8cfe38a3 ("lei import: speed up kw updates for old IMAP messages")
2021-05-23lei_input: fix canonicalization of Maildirs for sync
This is needed for the upcoming "lei export-kw"
2021-05-16lei_input: drop misplaced word from error message
2021-05-05lei rediff: regenerate diffs from stdin
Sometimes a mailed patch is generated with non-ideal output, (lacking context, noisy whitespace changes, etc.), or a user wants to use the same external diff viewer they've configured git to use. Since we have SolverGit to regenerate arbitrary blobs from patches; this new command allows us to regenerate a diff with different options using the blobs SolverGit gives us. The amount of git-diff(1) options is mind numbing, so it's likely I missed some favorites or botched the getopt spec translation. This also fixes Inbox::base_url to check psgi.url_scheme before attempting to generate URLs and avoid uninitialized variable warnings. Oddly, the "lei blob" tests did not trigger these uninitialized warnings. Note: this will automatically import+index the message(s) it's regenerating, because solver relies on being able to lookup pre/postimage OIDs and read blobs.
2021-05-04lei index: new command to index mail w/o git storage
Since completely purging blobs from git is slow, users may wish to index messages in Maildirs (and eventually other local storage) without storing data in git. Much code from LeiImport and LeiInput is reused, and a new dummy FakeImport class supplies a non-storing $im->add and minimize changes to LeiStore. The tricky part of this command is to support "lei import" after a message has gone through "lei index". Relying on $smsg->{bytes} == 0 (as we do for external-only vmd storage) does not work here, since it would break searching for "z:" byte-ranges when not using externals. This eventually required PublicInbox::Import::add to use a SharedKV to keep track of imported blobs and prevent duplication.
2021-05-03lei_input: reject --mail-sync if using HTTP(S) for now
I'm not sure how we'll distinguish JMAP vs read-only HTTPS, yet; but we'll focus on currently-supported stuff, first.
2021-05-03lei_input: common net_merge_all_done for lei <import|tag>
I suspect there'll be more lei_input-only things in the future.
2021-05-01lei import: fix --mail-sync handling in LeiInput
"lei inspect" also shows "mail-sync" as a field name
2021-04-30lei: IMAP .onion support via --proxy=s switch
Mail::IMAPClient provides the ability to pass a pre-connected Socket to it. We can rely on this functionality to use IO::Socket::Socks in place whatever socket class Mail::IMAPClient chooses to use. The --proxy=s is shared with curl(1), though we only support socks5h:// at the moment. Is there any need for SOCKS4 or SOCKS5 without name resolution? Tor .onions require socks5h:// for name resolution and to prevent data leakage.
2021-04-30lei import: support UIDVALIDITY in IMAP URL
Specifying a UIDVALIDITY value allows the user to enforce a strict match and force failure. This necessitated changes to NetReader to allow die() and make error reporting more suitable for CLI usage rather than daemonized usage of -watch.
2021-04-30lei import: avoid IMAPTracker, use LeiMailSync more
IMAPTracker has a UNIQUE constraint on the `url' column, which may cause compatibility and/or rollback problems in attempting to deal with UIDVALIDITY changes. Having multiple sources of truth leads to confusion and bugs, so relying on LeiMailSync exclusively ought to simplify things. Furthermore, since LeiMailSync is only written to by LeiStore, it is safer in that it won't mark a UID or article as imported until git-fast-import has seen it, and the SQLite commit always happens after "done\n" is sent to fast-import. This mostly reverts recent commits to IMAPTracker to support lei, those are: 1) commit 7632d8f7590daf70c65d4270e750c36552fa9389 ("net_reader: restart on first UID when UIDVALIDITY changes") 2) commit 311a5d37ad275cd75b1e64d87827c4d13fe4bfab ("imap_tracker: prepare for use with lei"). This means public-inbox-watch will not change between 1.6 and 1.7: -watch stops synching a folder when UIDVALIDITY changes.
2021-04-26lei_input: support PublicInbox::WWW mboxrd URLs
This gives "lei import", "lei tag", and similar commands the ability to use URLs recognized by our PSGI frontend directly. This is more convenient than an equivalent shell pipeline since "set -o pipefail" is not portable and errors may be lost.
2021-04-24lei import: keep sync info for Maildir and IMAP folders
We aren't using it, yet, but the plan is to be able to use this information to propagate keyword changes back to IMAP and Maildir folders using some to-be-implemented command. "lei inspect" is a half-baked new command to make testing this change easier. It will be updated to support more SQLite+Xapian introspection duties in the future, including public-inbox things independent of lei.
2021-04-24lei_input: drop outdated comment w.r.t. compression
Followup-to: 49b036771ef3bf45 ("lei_input: support compressed mboxes")
2021-04-23lei import: support adding keywords and labels on import
This saves some work and makes it easier to set volatile metadata on a message at import time.
2021-04-05lei: maildir: move shard support to MdirReader
We'll eventually want lei_input users like "lei import" and "lei tag" to support parallel reads.
2021-03-31lei_input: reduce IPC traffic with multiple inputs
No point in sending a command for every input when a single one will do. We'll also trigger LeiStore->done sooner in the worker rather than later.
2021-03-29lei_input: support compressed mboxes
Since "lei q" and "lei convert" already support writing these compressed inboxes, it makes sense that all mbox readers support them, as well. Using compression is one reliable way to know an mboxrd or mboxo hasn't been unexpectedly truncated.
2021-03-29lei_input: treat ".eml" and ".patch" suffix as "eml"
".eml" is a suffix supported by (/usr/local)/etc/mime.types on Debian and FreeBSD systems using the "mime-support" package. ".patch" is what "git format-patch" generates by default since git v1.5.0 in 2007.
2021-03-29lei_input: avoid special case sub for --stdin
We can consistently open /dev/stdin correctly nowadays, so drop the input_stdin and just use the normal ->path_to_fd code path.
2021-03-26lei: support /dev/fd/[0-2] inputs and outputs in daemon
Since lei-daemon won't have the same FDs as the client, we need to special-case thse mappings and won't be able to open arbitrary, non-standard FDs. We also won't attempt to support /proc/self/fd/[0-2] since that's a Linux-ism. /dev/fd/[0-2] and /dev/std{in,out,err} are portable to FreeBSD, at least. mawk(1) also supports /dev/std{out,err}, as does gawk(1) (which supports everything we can support, and arbitrary /dev/fd/$FD).
2021-03-24lei_input: more common code between <mark|convert|import>
"lei convert" is actually a bit of the odd one, since it uses lei2mail for auth, unlike the others.
2021-03-24lei: hide *_atfork_child from command-line
Otherwise we could get non-sensical results if somebody tries running "lei atfork_child" from the command-line.
2021-03-24lei mark: command for (un)setting keywords and labels
Only tested for keywords and labels with file inputs, so far; but it seems to do what it needs to do. There's a bit more redundant code than I'd like, and more opportunities for code sharing in the future "lei import" will be expanded to support +kw:$KEYWORD and +L:$LABEL in the future.
2021-03-23lei_input: drop "From " line on single "eml" (message/rfc822)
This matches the long-standing behavior of public-inbox-mda, public-inbox-learn and our other tools. It is useful because mutt, "git format-patch", and likely other tools will pipe a single message with a "From " header line, but with no further "From " escaping or Content-Length: header.
2021-03-23lei_input: common filehandle reader for eml + mbox
This improve code regularity, and will let us deal with the "RFC822" messages with "From " line that mutt pipes to.
2021-03-23mbox_reader: add ->reads method to avoid nonsensical formats
Relying on UNIVERSAL::can may cause internal helper methods to be used, which can lead to failures or nonsensical results.
2021-03-23lei: share input code between convert and import
These commands accept mail the same way, and this forces us to maintain consistent input format support between commands. We'll be using this for "lei mark", too.