about summary refs log tree commit homepage
path: root/lib/PublicInbox/LeiXSearch.pm
DateCommit message (Collapse)
2021-03-04lei_xsearch: add_eml for remote mboxrd, not set_eml
set_eml will clobber any existing keywords. Since remote mboxrds cannot (and should not) be sending keywords to us, we shouldn't let remote external requests clobber already-set keywords if they exist.
2021-03-01lei q: improve early aborts w/ remote externals
We must issue LeiStore->done if a client disconnects while we're streaming from a remote external. This can happen via SIGPIPE, or if a client process is interrupted by any other means.
2021-02-26lei_xsearch: more detail about ->xdb call chain
I was just wondering this myself :x
2021-02-26lei q: support mbox locking by default
While this diverges from from mairix(1) behavior, it's the safer option. We'll follow Debian policy by supporting fcntl and dotlocks by default (in that order). Users who do not want locking can use "--lock=none" This will be used in a read-only capacity for watching mailboxes for keyword updates via inotify or EVFILT_VNODE.
2021-02-26lei q: -tt marks direct hits as "flagged"
This can be used to quickly distinguish messages which were direct hits when doing thread expansion vs messages that were merely part of the same thread. This is NOT mairix-derived behavior, but I occasionally found it useful when looking at results in an MUA to know whether a message was a direct hit or not. This makes "-t" consistent with non-"-t" cases as far as keyword reading goes.
2021-02-25lei q: auto-memoize remote messages into lei/store
This lets users avoid network traffic on subsequent searches at the expense of local disk space. --no-import-remote may be specified to reverse this trade-off for users with little storage.
2021-02-24lei: avoid needless env passing to subcommands
We already localize %ENV before calling dispatch(), so it's needless overhead in spawn() to be checking env for undef values in those cases.
2021-02-22lei q: reduce wasted IMAP connection for auth
We can rework the first lei2mail worker to authenticate, and then share auth info with the rest of the lei2mail workers. As with "lei import", this uses PktOp and lei-daemon to share updated credentials between the first an subsequent l2m workers.
2021-02-21lei2mail: parallel augment for lock-free stores
This lets us make use of multiple cores on IMAP and Maildir backed by SSD (or better) storage. This benefits IMAP stores with high network latency, but may still penalize IMAP servers with rotational storage.
2021-02-21ipc: support setting a locked number of WQ workers
We can use this to ensure sharded work doesn't do unexpected things if workers are added/removed. We currently don't increase/decrease workers once a workqueue is started, but non-lei code (-httpd/imapd) may start doing so. This also fixes a bug where lei2mail workers could not be adjusted via --jobs on the command-line.
2021-02-21lei q: move augment into lei2mail workers
This is a step which will allow us to parallelize augment on Maildir and IMAP.
2021-02-08lei q: SIGWINCH process group with the terminal
While using utime on the destination Maildir is enough for mutt to eventually notice new mail, "eventually" isn't good enough. Send a SIGWINCH to wake mutt (and likely other MUAs) immediately. This is more portable than relying on MUAs to support inotify or EVFILT_VNODE.
2021-02-08lei_xsearch: quiet Eml warnings from remote mboxrds
This will probably cover full Atom/HTML feed generation or any outputs which are order-dependent, but those aren't prioritized at the moment.
2021-02-08lei q: improve remote mboxrd UX + MUA
For early MUA spawners using lock-free outputs, we we need to on the startq pipe to silence progress reporting. For --augment users, we can start the MUA even earlier by creating Maildirs in the pre-augment phase. To improve progress reporting for non-MUA (or late-MUA) spawners, we'll no longer blindly append "--compressed" to the curl(1) command when POST-ing for the gzipped mboxrd. Furthermore, we'll overload stringify ('""') in LeiCurl to ensure the empty -d '' string shows up properly. v2: fix startq waiting with --threads mset_progress is never shown with early MUA spawning, The plan is to still show progress when augmenting and deduping. This fixes all local search cases. A leftover debug bit is dropped, too
2021-02-07lei: replace --thread with --threads
Nobody is expected to use long options, but for consistency with mairix(1), we'll use the pluralized option throughout (including existing PublicInbox::{Search,SearchView}). Link: https://public-inbox.org/meta/20210206090119.GA14519@dcvr/
2021-02-07lei: more consistent IPC exit and error handling
We're able to propagate $? from wq_workers in a consistent manner, now.
2021-02-07ipc: wq_do => wq_io_do
We will have a ->wq_do that doesn't pass FDs for I/O.
2021-02-07lei add-external: handle interrupts with --mirror
This also updates lei_xsearch to follow the same pattern for stopping curl(1) and tail(1) processes it spawns.
2021-02-07lei: add-external --mirror support
This can be useful for users who want to clone and mirror an existing public-inbox. This doesn't have update support, yet, so users will need to run "git fetch && public-inbox-index" for now.
2021-02-05lei import: initial implementation
Only tested with .eml files so far, but Maildir + IMAP will be supported.
2021-02-05lei_xsearch: drop unused imports
Reaping is handled by the parent PublicInbox::IPC, and we have no business using PublicInbox::Import since LeiXSearch won't write to git directly (it will write via LeiStore).
2021-02-05lei q: eliminate $not_done temporary git dir hack
Another step towards simplifying lei internals. None of our current uses of ->wq_do involve FD passing, and the plan is only rely on FD passing between lei-daemon and lei(1). Internally, it ought to be possible for lei-daemon internal bits to be ordered properly to not need FD passing.
2021-02-05lei q: reinstate early MUA spawn for Maildir
Once all files are written, we can use utime() to poke Maildirs to wake up MUAs that fail to account for nanosecond timestamps resolution.
2021-02-05lei q: only start pager if output is to stdout
No need to be starting a pager if we're writing to a regular file.
2021-02-05lei q: reorder internals to reduce FD passing
While FD passing is critical for script/lei <=> lei-daemon, lei-daemon doesn't need to use it internally if FDs are created in the proper order before forking.
2021-02-05lei q: delay worker spawn
Now that --stdin support is sorted, we can delay spawning workers until we know the query is ready-to-run.
2021-02-04lei q: support reading queries from stdin
This will be useful on shared machines when a user doesn't want search queries visible to other users looking at the ps(1) output or similar.
2021-02-04lei: propagate curl errors, improve internal consistency
IO::Uncompress::Gunzip seems to be losing $? when closing PublicInbox::ProcessPipe. To workaround this, do a synchronous waitpid ourselves to force proper $? reporting update tests to use the new --only feature for testing invalid URLs. This improves internal code consistency by having {pkt_op} parse the same ASCII-only protocol script/lei understands. We no longer pass {sock} to worker processes at all, further reducing FD pressure on per-user limits.
2021-02-04pkt_op: rely on DS::in_loop global
No reason to check for $lei->{oneshot} here.
2021-02-03lei q: support --jobs [SEARCHERS],[WRITERS]
This comma-delimited parameter allows controlling the number or lei_xsearch and lei2mail worker processes. With the change to make IPC wq_* work use the event loop, it's now safe to run fewer worker processes for searching with no risk of deadlocks. MAX_PER_HOST isn't configurable yet for remote hosts, and maybe it shouldn't be due to potential for abuse.
2021-02-03lei q: tidy up progress reporting
We won't be reporting progress when output is going to stdout since it can clutter up the terminal unless stderr != stdout, which probably isn't worth checking. We'll also use a more agnostic mset_progress which may make it easier to support worker-less invocations.
2021-02-03lei_xsearch: ensure curl.err and tail(1) cleanup happens
We can safely rely on exit(0) here when interacting with curl(1) and git(1), unlike query workers which hit Xapian directly, where some badness happens when hit with a signal while retrieving an mset.
2021-02-03lei q: do not leave temporary files after oneshot exit
Avoid on-stack shortcuts which may prevent destructors from firing since we're not inside the event loop. We'll also tidy up the unlink mechanism in LeiOverview while we're at it.
2021-02-03lib: explicitly distinguish oneshot use
The daemon must not be fooled into thinking it's in oneshot after a lei client disconnects and erases {sock}.
2021-02-03lei_xsearch: truncate curl stderr after reading it
We may have further URLs to read in that process, so ensure we don't end up having tail send stale data.
2021-02-03lei q: emit progress and counting via PktOp
Sometimes it can be confusing for "lei q" to finish writing to a Maildir|mbox and not know if it did anything. So show some per-external progress and stats. These can be disabled via the new --quiet/-q switch. We differ slightly from mairix(1) here, as we use stderr instead of stdout for reporting totals (and we support parallel queries from various sources).
2021-02-03lei: switch to use SEQPACKET socketpair instead of pipe
This will allow us to use larger messages and do progress reporting to accumulate in the main daemon.
2021-02-01lei_to_mail: reduce spew on Maildir removal
At most, we'll only warn once per worker when a Maildir disappears from under us. We'll also use the '!' OpPipe to note the exceptional condition, and use '|' to SIGPIPE so it'll be a bit easier for hackers to remember.
2021-02-01lei_xsearch: load PublicInbox::Smsg
We use $smsg->populate here, so ensure it's loaded although PublicInbox::Search currently loads it.
2021-02-01lei: keep $lei around until workers are reaped
This prevents SharedKV->DESTROY in lei-daemon from triggering before DB handles are closed in lei2mail processes. The {each_smsg_not_done} pipe was not sufficient in this case: that gets closed at the end of the last git_to_mail callback invocation.
2021-02-01lei: remove SIGPIPE handler
It doesn't save us any code, and the action-at-a-distance element was making it confusing to track down actual problems. Another potential problem was keeping references alive too long. So do like we would a C100K server and check every write while still ensuring lei(1) exit with a proper SIGPIPE iff needed.
2021-01-30lei: less error-prone FD mapping
Keeping track of non-standard FDs gets tricky, so make it easier by relying on st_dev/st_ino mapping in the transmitted objects. We'll keep using numbers for the standard FDs since we need to be able to easily redirect them in the producer (main daemon) process for (gzip|bzip2|xz) if writing to a compressed mbox.
2021-01-30lei_xsearch: drop repeated "Xapian" in error message
Copy+paste error :x
2021-01-26lei q: continue remote search if torsocks(1) is missing
torsocks is just one of many ways to get curl to use Tor, so we'll continue if we can't find torsocks in our PATH and assume the user has a proxy configured via curlrc, the command-line, environment variable, or even firewall rules.
2021-01-26lei q: reject remotes early if curl(1) is missing
This ought to provide a better user experience for users if they attempt to use remote externals but don't have curl installed. We can avoid repeating PATH search in every worker here, too.
2021-01-26lei q: demangle and quiet curl output
curl(1) writes to stderr one byte-at-a-time (presumably for the progress bar). This ends up being unreadable on my terminal when parallel processes are trying to write error messages. So instead, we'll capture the output to a file and run 'tail -f' on it if --verbose is enabled. Since HTTP 404s from non-existent results are a common response, we'll ignore them and stay silent, matching behavior of local searches.
2021-01-24smsg: make parse_references an object method
Having parse_references in OverIdx was awkward and Smsg is a better place for it.
2021-01-24lei q: fix JSON overview with remote externals
We can't (and don't need to) repeatedly get the $each_smsg callback for each URI since that clobbers {ovv_buf} before it can be output. I initially thought this was a dedupe-related bug and moved the dedupe code into the $each_smsg callback to minimize differences. Nevertheless it's a nice code reduction. I also thought it was related to incomplete smsg info, so {references} is now filled in correctly for dedupe.
2021-01-24lei_xsearch: use curl -d '' for nginx compatibility
It appears Content-Length and/or Content-Type headers are required by nginx with POST requests. varnish alone doesn't have this requirement and my (perhaps lossy) reading of RFC 2616, 7230, 7231 didn't note this, either. In any case, we must support nginx even if it's overly strict. Reported-By: Kyle Meyer <kyle@kyleam.com> Link: https://public-inbox.org/meta/87v9bmswkh.fsf@kyleam.com/
2021-01-24lei q: honor --no-local to force remote searches
This can be useful for testing remote behavior, or for augmenting local results. It'll also be possible to explicitly include/exclude externals via CLI switches (once names are decided).