Date | Commit message (Collapse) |
|
This fixes a performance regression in multi-process v2 indexing
due to the switch to PublicInbox::IPC. While Unix sockets are
fewer FDs to manage, pipes allow unprivileged processes to use
larger buffers (up to 1M) on out-of-the-box Linux instances.
A larger buffer via F_SETPIPE_SZ afforded by pipes was proven
valuable during v2 development in 2018 and continues to be
valuable when we get significant amounts of one-way traffic from
the producer parent to worker children.
Compression may be an option for systems without F_SETPIPE_SZ;
but it increases CPU usage with no memory bandwidth savings on
hosts where larger buffers are available.
|
|
We don't need to be keeping the raw message around after it hits
git. Shard work now relies on Storable (or Sereal) and all of
the indexing code relies on the Email::MIME-like API of Eml to
access interesting parts of the message.
Similarly, smsg->{raw_bytes} is no longer carried around and we
do the CRLF adjustment when setting smsg->{bytes}.
There's also a small simplification to t/import.t while
we're in the area to use xqx instead of spawn/popen_rd.
|
|
Since Storable and Sereal are designed for lossless
serialization, we'll just pass $eml objects to whatever process
is running SearchIdx.
|
|
We can remove some now-pointless wrapper functions by using
->ipc_do in even more places.
|
|
It's nice to prove the new code works by swapping it into
the current V2Writable / SearchIdxShard packages. This is
only the first step for the core bits, and we'll be able
to delete more code in a subsequent patch.
|
|
Fix some comments and add some short summary descriptions to
hopefully make things easier-to-follow.
|
|
While Gcf2Client is designed to mimic what git-cat-file writes
to stdout, its request format is different to support requests
with a git repository path included.
We'll highlight the distinction and make the GitAsyncCat support
code easier-to-follow as a result.
Since Gcf2Client relies on DS, we can rely on DS-specific code
here, too, and use a single Unix socket instead of separate
input and output pipes, reducing memory overhead in both users
and kernel space. Due to the interactive nature of requests and
responses, the buffer size limitations of Unix sockets on Linux
seems inconsequential here (just like it is for existing "git
cat-file --batch" use).
|
|
The daemon needs to flush stdout before disconnecting or killing
clients, otherwise they may reread empty data on redirected
outputs. We also don't want to unbuffer stdout too early in
case we have lots of small chunks of data to output.
The received ($self->{2}) will always have autoflush, matching normal
STDERR behavior.
|
|
We'll always be transferring stdin, stdout, and stderr together
for lei. Perhaps I lack imagination or foresight, but I can't
think of a reason to send more or less FDs.
|
|
IO::FDPass may be an extra installation burden I don't want to
impose on users. We only support Linux and *BSDs, however.
|
|
I never hit these die() calls, but noticed it while debugging
another problem on FreeBSD.
|
|
ProcessPipe has a built-in mechanism to prevent siblings from
reaping children.
|
|
Only saves us one line of code, but that's better than nothing.
|
|
This saves us a few lines of code, but also prevents misreaping
by sibling processes.
|
|
If we're using ->qx, we're operating synchronously anyways,
so there's little point in relying on the event loop for
waitpid.
|
|
This saves over 20ms with scripts that only use PublicInbox::Spawn.
|
|
To get rid of the ugly $PublicInbox::DS::in_loop localization
in MboxReader, we'll distinguish between ->CLOSE and ->DESTROY
with ProcessPipe.
If we end up closing via ->DESTROY, we'll assume the caller will
want to deal with $? asynchronously via the event loop (or not
even care about $?).
If we hit ->CLOSE directly, we'll assume the caller called
close() and wants to check $? synchronously.
Note: wantarray doesn't seem to propagate into tied methods,
otherwise I'd be relying on that.
|
|
While the changes to git->qx/git->popen from commit 171a9c24022ad7ef
will be useful for the lei daemon, hiding git error messages from
actual users is probably wrong and we'll just localize GIT_*
vars for testing.
|
|
Hopefully this will make it easier to spot dependency
bugs in the future.
|
|
GIT_CONFIG is set by -convert, and user may have it set
for other reasons. In either case, it conflicts with
any any attempt to use `git config --global` so we have
to unset it.
This fixes t/multi-mid.t under TEST_RUN_MODE=0
|
|
The default $QP_FLAGS won't be set until after Xapian is
loaded, duh...
This fixes t/imapd.t with TEST_RUN_MODE=0
|
|
$git->qx and $git->popen now $env and $opt for redirects
like lower-level popen_rd. This may be beneficial in other
places.
|
|
Using "make update-copyrights" after setting GNULIB_PATH in my
config.mak
|
|
Since we'll be forking for Xapian indexing and maybe
other places, having a simple guard in place to ensure
OnDestroy doesn't unexpectedly unlink files or similar
is a safer option.
|
|
This may help ensure DESTROY callbacks will see in_loop
before the others.
|
|
Objects with DESTROY callbacks get propagated to children, so we
must be careful to not invoke waitpid from children on their
sibling processes. Only parents (and their parents...) can reap
child processes.
|
|
Spawn was designed to speed up process spawning inside
long-lived daemons with largish memory usage. It does not help
for short-lived scripts which only exist to start and connect to
a daemon.
This change actually speeds up initial lei startup from
~190ms to ~140ms(!). Normal usage once the daemon is running
is unaffected, at <20ms for help text.
While we're in the area, simplify Cwd error message generation,
too.
|
|
Since Perl exposes O_NONBLOCK as a constant, we can safely make
SFD_NONBLOCK a constant, too. This is not the case for
SFD_CLOEXEC, since O_CLOEXEC is not exposed by Perl despite
being used internally in the interpreter.
|
|
This simplifies our code and provides a more consistent API for
error handling. PublicInbox::DS can be loaded nowadays on all
*BSDs and Linux distros easily without extra packages to
install.
The downside is possibly increased startup time, but it's
probably not as a big problem with lei being a daemon
(and -mda possibly following suite).
|
|
The daemon for the local email interface will be inside
the DS->EventLoop. -watch currently doesn't trigger this
bug since it doesn't enable parallelism, but it may in
the future.
|
|
Opening a FIFO with O_RDWR always succeeds on Linux, which
cause the cat(1) process invoked by t/lei_to_mail.t to get
stuck. Furthermore O_APPEND makes no sense on FIFOs and
perhaps there's some kernel out there which will reject it.
|
|
We don't want to leave Xapcmd waitpid(-1, ...) call to hit it.
|
|
It seems like a more logical place for it, but we'll favor the
newly-added xsys_e() in tests for BAIL_OUT use.
|
|
This will be helpful for mairix users.
|
|
This matches mairix(1) behavior and may be safer if there's
concurrent readers on the existing mbox, especially since
we don't do currently implement mbox locking (nor does mairix).
|
|
shutdown(2) on a socket can be preferable if there's multiple
forked processes writing to a single worker and we really want
to shut things down ASAP.
It may also be good to provide an ipc_worker_exit method which
subclasses can override if needed for graceful shutdown. But we
won't need equivalents to atexit(3) since we can rely on DESTROY
handlers given this is Perl5.
|
|
For personal mail, unsent drafts messages are a common source of
messages without Message-IDs.
|
|
We'll be using it for Resent-Message-ID with lei, and possibly
other places.
|
|
As shown recently in commit a05445fb400108e60ede7d377cf3b26a0392eb24
("config: config_fh_parse: micro-optimize"), the relying on
the return value of `push' and defined-or operators can avoid
modifying a the hash value scalar with an increment.
|
|
The words "extinbox" and "extindex" are too close and easy to
confuse with the other. Rename "extinbox" to "external", since
these could be IMAP, JMAP or other non-public-inbox search APIs.
Link: https://public-inbox.org/meta/20201226112649.GB6226@dcvr/
|
|
Add a ->set_eml method which can be a useful fire-and-forget
way of either adding new files to store OR setting keywords
on them.
When seeing brand-new messages, add_eml can afford to return
more information in the smsg instead of just the OID.
|
|
Some testing will be needed to see if it's worth the code
and maintenance overhead, but it seems easy-enough to get
working.
|
|
I intend to use this with LeiStore when importing from multiple
slow sources at once (e.g. curl, IMAP, etc). This is because
over.sqlite3 can only have a single writer, and we'll have
several slow readers running in parallel.
Watch and SearchIdxShard should also be able to use this code
in the future, but this will be proven with LeiStore, first.
|
|
Maildir should be plenty fine for short-lived output folders.
|
|
Users may wish to pipe output to "git am", "spamc",
or similar, so we need to support those cases and
not bail out on lseek(2) or ftruncate(2) failures.
|
|
LeiDedupe requires SQLite, so we may want to be able to test
writing mail without DBI or SQLite down the line.
|
|
For writing mboxes and Maildirs, users may wish to use
stricter or looser deduplication strategies. This
gives them more control.
|
|
--augment will match the mairix(1) option of the same
name to augment existing search results. We'll need
to implement deduplication for a better user experience.
mutt ships with compressed mbox support for bz2 and xz,
at least, so we'll support those out-of-the-box.
|
|
This is only lightly-tested against stuff LeiToMail generates
and will need real-world tests to validate.
|
|
We'll allow using multiple workers to write to a single
mbox (which could be compressed). This is can be done
safely with O_APPEND + syswrite for uncompressed files,
and using a lock when piping to pigz/gzip/bzip2/xz.
|