about summary refs log tree commit homepage
path: root/lib/PublicInbox/LeiToMail.pm
DateCommit message (Collapse)
2024-04-17lei: use async barrier for --import-before
Write barriers can take a long time to finish, especially when commands are issues in parallel. So handle it asynchronously without blocking lei-daemon by making EOFpipe a little more flexible by supporting arguments to the callback function. This is another step towards improving parallel use of lei.
2024-04-17lei: use ->barrier to commit to lei/store
barrier (synchronous checkpoint) is better than ->done with parallel lei commands being issued (via '&' or different terminals), since repeatedly stopping and restarting processes doesn't play nicely with expensive tasks like `lei reindex'. This introduces a slight regression in maintaining more processes (and thus resource use) when lei is idle, but that'll be fixed in the next commit.
2024-03-10lei: prevent empty {bytes} field in saved search
Noticed while tracking down fast-import crash bug report. Link: https://public-inbox.org/meta/CAL_JsqK7P4gjLPyvzxNEcYmxT4j6Ah5f3Pz1RqDHxmysTg3aEg@mail.gmail.com/
2024-02-01lei convert: explicitly allow --sort for inputs
LeiToMail can't sort v2 output, but sorting MH input (and NNTP spool + mlmmj archives) numerically makes sense.
2024-01-31lei_to_mail: improve SIGPIPE handling
I can't reproduce this in t/lei-sigpipe.t with GIANT_INBOX_DIR. In real-world usage, having a large `lei q -f text ...' output piped to a pager and killing the pager prematurely could trigger: non-fatal error from PublicInbox::LeiToMail $?=256 messages in my terminal. This is because $self->{lei} was becoming undefined in the process cleanup process of git_to_mail. So flip the cleanup logic around and unconditionally check for Git::cleanup state to bail out early. With this change, the `non-fatal error ...' message no longer appears when I stop reading results early.
2024-01-11lei_to_mail: show supported mbox formats on error
Users may accidentally or unknowingly write `mbox' and not know we support 4 incompatible mbox variants.
2023-12-30lei: support reading MH for convert+import+index
The MH format is widely-supported and used by various MUAs such as mutt and sylpheed, and a MH-like format is used by mlmmj for archives, as well. Locking implementations for writes are inconsistent, so this commit doesn't support writes, yet. inotify|EVFILT_VNODE watches aren't supported, yet, but that'll have to come since MH allows packing unused integers and renaming files.
2023-12-16lei: use ->child_error API properly
I noticed this bug while developing another feature and tests were getting SIGHUP (since SIGHUP == 1 on most systems).
2023-11-22lei_to_mail: don't close STDOUT unless it is a mbox* output
We only care about error checking when stdout is an mbox output pointed to a pathname. This is noticeable with `lei up' with multiple non-mbox* destinations. We'll also ensure throwing exceptions to trigger lei->x_it from lei->do_env results in the epoll/kqueue watch being discarded, otherwise commands may never terminate (leading to stuck tests)
2023-11-16lei q|up|convert: common finish_output to detect errors
We need to consistently check the exit code of pigz|gzip|xz|bzip2 when writing to compressed mboxes (or bad storage).
2023-11-16lei: avoid extra fork for v2 outputs
We've always forced LeiToMail to only have one process for v2 outputs anyways since v2 has its own sharding and IPC. Thus we can use the single LeiToMail process directly to avoid extra IPC overhead.
2023-11-16lei convert: fix repeat and idempotent v2 output
We should be able to treat v2 outputs just like any other mail format, with the exception that content dedupe is always enforced by the v2 format. This allows users hosting v2 public-inboxes to catch up broken synchronization from alternate archives such as the mbox archives hosted by https://lists.gnu.org/ Link: https://public-inbox.org/meta/20231114-hypersonic-papaya-starling-e1cfc8@nitro/
2023-11-13treewide: update read_all to avoid eof|close checks
read_all can be expanded to support FIFOs/pipes/sockets where read-until-EOF behavior is desired. We can also rely on wantarray to support splitting on EOL markers, but it's hard-coded to support only `$/ eq "\n"' since (AFAIK) it's the only way we use the wantarray form `readline'.
2023-11-11mda|learn|watch: support dropUniqueUnsubscribe config
List-Unsubscribe headers with unique identifiers (such as those generated by our examples/unsubscribe.milter) should not end up in public archives. Add a new config knob to strip List-Unsubscribe headers if they have the `List-Unsubscribe-Post: List-Unsubscribe=One-Click' header. Unfortunately, this breaks DKIM signatures if the signature covers either of these List-Unsubscribe* headers. However, breaking DKIM is the lesser evil compared to any archive reader being able to stop archival by an independent archivist. As much as I would like this to be the default, it probably affects few users at the moment since very few mailing lists use unique identifiers in List-Unsubscribe (but that number has grown, recently).
2023-11-07lei: fix SIGPIPE on large result sets to pager
When dealing with large search results, we need to deal with EPIPE not just from the pager, but also EPIPE or ECONNRESET between lei_xsearch and lei2mail processes. Without this fix, lei_xsearch processes could linger and get stuck writing to dead lei2mail processes if a user aborts the pager early during a large result set. To ensure lei_xsearch processes don't linger around after lei2mail workers all die, we must close $l2m->{-wq_s2} before spawning lei_xsearch processes, since $l2m->{-wq_s2} is only used in lei2mail workers. For `git cat-file' processes, we also need to trigger PublicInbox::Git->close to handle unpredictable destructor ordering to avoid using uninitialized IO refs. This combines with the `git_to_mail' change to deal with process cleanup handling from premature shutdowns. To test all this, we can't just rely on a single message being large, but also need to rely on the result set being large enough to saturate the lei_xsearch -> lei2mail socket so we rely on GIANT_INBOX_DIR once again.
2023-11-03treewide: check alternates writes with eof + autodie
We must use eof() combined with close() to detect errors in situations involving the readline()' op since `readline' (and most buffered I/O libraries) have weak error detection support. This fixes error detection for files opened for read/write access. The next commit will fix error detection for read-only files.
2023-11-03replace ProcessIO with untied PublicInbox::IO
This fixes two major problems with the use of tie for filehandles: * no way to do fcntl, stat, etc. calls directly on the tied handle, forcing callers to use the `tied' perlop to access the underlying IO::Handle * needing separate classes to handle blocking and non-blocking I/O As a result, Git->cleanup_if_unlinked, InputPipe->consume, and Qspawn->_yield_start have fewer bizzare bits and we can call `$io->blocking(0)' directly instead of `(tied *$io)->{fh}->blocking(0)' Having a PublicInbox::IO class will also allow us to support custom read buffering which allows inspecting the current state.
2023-10-11treewide: consolidate "From " line removal
Aside from our prior import bugs (fixed in a0c07cba0e5d8b6a (mda: drop leading "From " lines again, 2016-06-26)), we'll always have to be dealing with mutt piping messages to us and `git format-patch' output. So just share the regexp so we can use it everywhere. In may be desirable to allow importing messages with a leading "From " line for FUSE, even. Additionally, some instances of this regexp needlessly added optional `\r?' (CR) checks ahead of the `\n' (LF) element; but they're pointless anyways since [^\n]* is enough to exclude all non-LF bytes.
2023-10-08process_io: pass args to awaitpid as list
Specifying {cb_args} in the options hash felt awkward to me. Instead, just use the Perl stack like we do with awaitpid() and pass the list down directly.
2023-10-08rename ProcessPipe to ProcessIO
Since we deal with pipes (of either direction) and bidirectional stream sockets for this class, it's better to remove the `Pipe' from the name and replace it with `IO' to communicate that it works for any form of IO::Handle-like object tied to a process.
2023-10-04lei: get rid of l2m_progress PktOp callback
We already have an ->incr callback we can enhance to support multiple counters with a single request. Furthermore, we can just flatten the object graph by storing counters directly in the $lei object itself to reduce hash lookups.
2023-10-01lei_to_mail: avoid uninitialized vars for l2m_progress
nr_write may be undef if nothing was written due to dedupe, but with seen messages.
2023-10-01git: use Unix stream sockets for `cat-file --batch-*'
The benefit of 1MB potential pipe buffer size (on Linux) doesn't seem noticeable when reading from git (unlike when writing to v2 shards), so Unix stream sockets seem fine, here. This allows us to simplify our process management by using the same socket FD for reads and writes and enables us to use our ProcessPipe class for reaping (as we can do with Gcf2Client). Gcf2Client no longer relies on PublicInbox::DS for write buffering, and instead just waits for requests to complete once the number of inflight requests hits the MAX_INFLIGHT threshold as we do with PublicInbox::Git. We reuse the existing MAX_INFLIGHT limit (18) that was determined by the minimum allowed PIPE_BUF (512). (AFAIK) Unix stream sockets have no analogy to PIPE_BUF, but all *BSDs and Linux I've checked have default SO_RCVBUF and SO_SNDBUF values larger than the previously-required PIPE_BUF size of 512 bytes.
2023-09-28lei: don't gzip --rsyncable by default for mbox*
Using and memoizing the usability of `--rsyncable' is unsafe since pigz (or GNU gzip) can be uninstalled and leave a user with a non-rsync-aware gzip implementation in the long-running daemon. So we stop passing --rsyncable by default to pigz/gzip and no longer attempt to check for it (since it was a TOCTTOU error, anyways). Specifying --rsyncable explicitly didn't work, either, and ended up passing `1' to the gzip/pigz argv :x Finally, we now test --rsyncable on the CLI by adding support for it in `lei convert' and testing it in t/lei-convert.t
2023-09-28spawn: add popen_wr support
This makes interesting parts of our code easier to read IMHO. We can take advantage of `local' while avoiding `fileno' calls since it's called in spawn() anyways to reduce LoC even further.
2023-09-22lei_to_mail: drop awkward duplication of $lei object
Our awaitpid API now exists and ProcessPipe uses it, so it's immune to cyclic references. Thus there's no need to create a duplicate of the lei object to prevent leaks.
2023-09-22lei: improve ->fail internal API
Allow the exit code to be the first argument intead of the last to match our ->child_error, as well as the BSD err(3) API. We'll also avoid shifting user-passed exit codes so $? can be passed as-is without losing signal information.
2023-02-22treewide: simplify File::Path mkpath/make_path callers
File::Path already accounts for the existence of directories, handles races from redundant mkdir(2), and croaks on unrecoverable errors. So there's no point in doing any of that on our end. Furthermore, avoiding the overhead of loading File::Path doesn't seem worth it to save 20-60ms given the overhead of loading our other code. Instead, try to reduce optree overhead on our code, instead, since File::Path gets used in a bunch of places. We'll also favor the newer make_path for multi-directory invocations to avoid bloating our own optree to create an arrayref, but mkpath is one fewer subroutine call within File::Path itself, right now.
2023-01-30ipc: drop awaitpid_init to avoid circular refs
This brings t/lei-index.t back down from ~8 to ~3s. I didn't notice this before was because the LeiNoteEvent timer was firing every 5s and clearing circular refs and parallel testing meant the delay got hidden. Fixes: 4a2a95bbc78f99c8 (ipc+lei: switch to awaitpid, 2023-01-17)
2023-01-18ipc+lei: switch to awaitpid
This avoids awkwardly stuffing an arrayref into callbacks which expect multiple arguments. IPC->awaitpid_init now allows pre-registering callbacks before spawning workers.
2023-01-18ds: introduce awaitpid, switch ProcessPipe users
awaitpid is the new API which will eventually replace dwaitpid. It enables early registration of callback handlers. Eventually (once dwaitpid is gone) it'll be able to use fewer waitpid calls. The avoidance of waitpid(-1) in our earlier days was driven by the belief that threads may eventually become relevant for Perl 5, but that's extremely unlikely at this stage. I will still introduce optional threads via C, but they definitely won't be spawning/reaping processes. Argument order to callbacks is swapped (PID first) to allow flattened multiple arguments more natrually. The previous API (allowing only a single argument, as influenced by pthread_create(3)) was more tedious as it involved packing multiple arguments into yet another array.
2022-09-30lei_to_mail: propagate errors to script/lei
We need to rely on lei->fail to propagate errors in lei workers to the script/lei client, otherwise tests and other scripts can stumble forward with incomplete/incorrect/broken outputs. This helps me focus on occasional t/lei-up.t failures I see on CentOS 7.x where OverIdx->adj_counter fails on "lei up --all"...
2022-09-10lei: bail out earlier on IMAP writer failures
Excessive IMAP connections can overload IMAP servers and cause clients to be disconnected without diagnostic messages. Use $lei->fail on these exceptions to propagate errors to the CLI ASAP to avoid further errors down the line. This ought to make problems more apparent for users using IMAP destinations. Reported-by: Ricardo Ribalda <ribalda@chromium.org> Link: https://public-inbox.org/meta/CANiDSCsDfutAUMBLPZbxdyka+_jnhv+4YNYdL9QPRoC=wNUGCQ@mail.gmail.com/
2022-07-07lei: track seen messages to note duplicates
This may help track down deduplication or other bugs in lei which lead to occasionally missing messages. Link: https://public-inbox.org/meta/CAL_JsqJH8xx_2NyZffNsRXbGXiv3kjmCETvKXt3Yfb0uToLm9Q@mail.gmail.com/
2021-10-30lei_to_mail: avoid SEGV on worker exit via SIGTERM
->DESTROY ordering via "exit()" calls is tricky, and dedupe checks were causing problems. AFAIK, this only affects users who manually enable WAL on lei/store/ei*/over.sqlite3. Fortunately, there is no data corruption as a result even though "read-only" WAL requires write permissions.
2021-10-30lei_to_mail: limit workers for text, reply and v2 outputs
"text" and "reply" outputs are intended for the pager, so parallelizing them is a waste of resources. v2 has shards, of course, so parallelizing writes to it is also a waste since the deduplication work is a bit more complex.
2021-10-26lei_to_mail: only run lms_write_prepare for IMAP+Maildir
Mail synchronization in lei_to_mail only works for IMAP and Maildir; so don't waste time preparing mbox* writers for it.
2021-10-25lei_to_mail: write directly to mail_sync.sqlite3
No need to go through the lei/store process when we write mail_sync.sqlite3. This ought to reduce ENOBUFS errors (and the sleep workaround) on RAM-starved systems.
2021-10-22lei: use RENAME_NOREPLACE on Linux 3.15+
One syscall is better than two for atomicity in Maildirs. This means there's no window where another process can see both the old and new file at the same time (link && unlink), nor a window where we might inadvertantly clobber an existing file if we were to do `stat && rename'.
2021-10-16lei_to_mail: quiet down abort messages
We don't need to flood the terminal with "W: $oid is (!= blob)\n" messages when somebody nukes a git cat-file process from under us.
2021-10-15lei + ipc: simplify process reaping
Simplify our APIs and force dwaitpid() to work in async mode for all lei workers. This avoids having lingering zombies for parallel searches if one worker finishes soon before another. The old distinction between "old" and "new" workers was needlessly complex, error-prone, and embarrasingly bad. We also never handled v2:// writers properly before on Ctrl-C/Ctrl-Z (SIGINT/SIGTSTP), so add them to @WQ_KEYS to ensure they get handled by $lei when appropropriate.
2021-10-15lei: TSTP affects all curl and related subprocesses
By relying more on pgroups for remaining remaining processes, this lets us pause all curl+tail subprocesses with a single kill(2) to avoid cluttering stderr. We won't bother pausing the pigz/gzip/bzip2/xz compressor process not cat-file processes, though, since those don't write to the terminal and they idle soon after the workers react to SIGSTOP. AutoReap is hoisted out from TestCommon.pm. CLONE_SKIP is gone since we won't be using Perl threads any time soon (they're discouraged by the maintainers of Perl).
2021-10-10lei_to_mail: show --output on augment progress failure
Just in case it fails when there's many parallel invocations.
2021-09-25lei: make pkt_op easier-to-use and understand
Since switching to SOCK_SEQUENTIAL, we no longer have to use fixed-width records to guarantee atomic reads. Thus we can maintain more human-readable/searchable PktOp opcodes. Furthermore, we can infer the subroutine name in many cases to avoid repeating ourselves by specifying a command-name twice (e.g. $ops->{CMD} => [ \&CMD, $obj ]; can now simply be written as: $ops->{CMD} => [ $obj ] if CMD is a method of $obj.
2021-09-25lei2mail: augment_inprogress: guard against closed FDs
I'm not sure what caused it, but $err was undef and caused print to fail, leading to an event loop error. Guard the timer with an eval and assume warn() can't trigger an event loop failure.
2021-09-21lei q: show progress on >1s preparation phase
Overwriting existing destinations safe (but slow) by default, so show a progress message noting what we're doing while a user waits.
2021-09-19lei/store: use SOCK_SEQPACKET rather than pipe
This has several advantages: * no need to use ipc.lock to protect a pipe for non-atomic writes * ability to pass FDs. In another commit, this will let us simplify lei->sto_done_request and pass newly-created sockets to lei/store directly. disadvantages: - an extra pipe is required for rare messages over several hundred KB, this is probably a non-issue, though The performance delta is unknown, but I expect shards (which remain pipes) to be the primary bottleneck IPC-wise for lei/store.
2021-09-18lei up: automatically use dt: for remote externals
Since we can't use maxuid for remote externals, automatically maintaining the last time we got results and appending a dt: range to the query will prevent HTTP(S) responses from getting too big. We could be using "rt:", but no stable release of public-inbox supports it, yet, so we'll use dt:, instead. By default, there's a two day fudge factor to account for MTA downtime and delays; which is hopefully enough. The fudge factor may be changed per-invocation with the --remote-fudge-factor=INTERVAL option Since different externals can have different message transport routes, "lastresult" entries are stored on a per-external basis.
2021-09-18lei_mail_sync: rely on flock(2), avoid IPC
Since 44917fdd24a8bec1 ("lei_mail_sync: do not use transactions"), relying on lei/store to serialize access was a pointless endeavor. Rely on flock(2) to serialize multiple writers since (in my experience) it's the easiest way to deal with parallel writers when using SQLite. This allows us to simplify existing callers while speeding up 'lei refresh-mail-sync --all=local' by 5% or so.
2021-09-11lei q|lcat: support "-f reply" output format
When composing replies in "git format-patch" cover letters, I'd been relying on "lei q -f text ...", but that still requires several steps to make it suitable for composing a reply: * s/^/> / to quote the body * drop existing In-Reply-To+References * s/^Message-ID:/In-Reply-To:/; * add an attribute line ... "lei q -f reply" takes care of most of that and users will only have to trim "From " lines, unnecessary results and over-quoted text (and trimming is likely less error-prone than doing all the steps above manually). This should also be a good replacement for "git format-patch --in-reply-to=...", since copying long Message-IDs can be error-prone (and this lets you include quoted text in replies).