about summary refs log tree commit homepage
path: root/lib/PublicInbox/LeiXSearch.pm
DateCommit message (Collapse)
2024-04-17lei: use async barrier for --import-before
Write barriers can take a long time to finish, especially when commands are issues in parallel. So handle it asynchronously without blocking lei-daemon by making EOFpipe a little more flexible by supporting arguments to the callback function. This is another step towards improving parallel use of lei.
2024-04-17lei: use ->barrier to commit to lei/store
barrier (synchronous checkpoint) is better than ->done with parallel lei commands being issued (via '&' or different terminals), since repeatedly stopping and restarting processes doesn't play nicely with expensive tasks like `lei reindex'. This introduces a slight regression in maintaining more processes (and thus resource use) when lei is idle, but that'll be fixed in the next commit.
2024-04-12lei q: support --thread-id=$MSGID || -T $MSGID
This adds support for the "POST /$INBOX/$MSGID/?x=m?q=..." added last year to support per-thread searches 764035c83 (www: support POST /$INBOX/$MSGID/?x=m&q=, 2023-03-30) This only supports instances of public-inbox since 764035c83, but unfortunately there hasn't been a release since then.
2023-12-13treewide: avoid strftime %k for portability
The musl strftime(3) implementation on AlpineLinux 3.19.0 doesn't support `%k' and `%k' isn't in POSIX, either. So we fall back to using the `sprintf' perlop in the user-facing UI since leading zeroes require needless overhead for my eyes and brain to parse in the time.
2023-11-16lei q|up|convert: common finish_output to detect errors
We need to consistently check the exit code of pigz|gzip|xz|bzip2 when writing to compressed mboxes (or bad storage).
2023-11-16lei: avoid extra fork for v2 outputs
We've always forced LeiToMail to only have one process for v2 outputs anyways since v2 has its own sharding and IPC. Thus we can use the single LeiToMail process directly to avoid extra IPC overhead.
2023-11-16lei convert: fix repeat and idempotent v2 output
We should be able to treat v2 outputs just like any other mail format, with the exception that content dedupe is always enforced by the v2 format. This allows users hosting v2 public-inboxes to catch up broken synchronization from alternate archives such as the mbox archives hosted by https://lists.gnu.org/ Link: https://public-inbox.org/meta/20231114-hypersonic-papaya-starling-e1cfc8@nitro/
2023-11-15lei: use -signal numbers for old Perl
Unlike modern Perls, Perl 5.16.3 on CentOS doesn't accept negative string signals like "-TERM" . This only became a problem since commit b231d91f42d7 (treewide: enable warnings in all exec-ed processes) made our code stricter by enabling more warnings. In both cases, the kill is probably unnecessary and safe to remove since we can rely on closing sockets to drop processes.
2023-11-09lei: get rid of autoreap usage
We can rely on Process::IO->DESTROY to close and reap in these cases. This is the final step in eliminating the wantarray invocations of popen_rd (and popen_wr).
2023-11-09lei_xsearch: put query in process title for debugging
Having queries in the process titles makes it easier to diagnose stuck queries due to IPC problems. This was used to diagnose commit e97a30e7624d (lei: fix SIGPIPE on large result sets to pager)).
2023-11-07lei: fix SIGPIPE on large result sets to pager
When dealing with large search results, we need to deal with EPIPE not just from the pager, but also EPIPE or ECONNRESET between lei_xsearch and lei2mail processes. Without this fix, lei_xsearch processes could linger and get stuck writing to dead lei2mail processes if a user aborts the pager early during a large result set. To ensure lei_xsearch processes don't linger around after lei2mail workers all die, we must close $l2m->{-wq_s2} before spawning lei_xsearch processes, since $l2m->{-wq_s2} is only used in lei2mail workers. For `git cat-file' processes, we also need to trigger PublicInbox::Git->close to handle unpredictable destructor ordering to avoid using uninitialized IO refs. This combines with the `git_to_mail' change to deal with process cleanup handling from premature shutdowns. To test all this, we can't just rely on a single message being large, but also need to rely on the result set being large enough to saturate the lei_xsearch -> lei2mail socket so we rely on GIANT_INBOX_DIR once again.
2023-11-03treewide: use ->close to call ProcessIO->CLOSE
This will open the door for us to drop `tie' usage from ProcessIO completely in favor of OO method dispatch. While OO method dispatches (e.g. `$fh->close') are slower than normal subroutine calls, it hardly matters in this case since process teardown is a fairly rare operation and we continue to use `close($fh)' for Maildir writes.
2023-10-19lei: simplify startq/au_done wakeup notifications
We only need to write one byte at MUA start instead of a byte for every LeiXSearch worker. Also, make sure it succeeds by enabling autodie for syswrite. When reading, we can rely on `:perlio' layer `read' semantics to retry on EINTR to avoid looping and other error checking.
2023-10-18syscall: common $F_SETPIPE_SZ definition
We use this in various places to minimize or maximize pipe size on Linux. So keep it all in one place.
2023-10-12lei: quiet excessive write/seen messages
We don't want to end up dumping nr_seen/nr_write when progress is disabled, nor do we want forked off `lei note-event' workers dump them when DS->Reset is called on fork.
2023-10-11lei_xsearch: improve curl progress reporting
Instead of having tail(1) follow a file when we're in verbose mode, unconditionally pipe stderr to a Perl 2-liner which tees its output to a regular file with line buffering. POSIX tee(1) isn't suitable for this task since it's required to be completely unbuffered while we want line-buffering when running parallel processes. Fortunately, Perl makes this easy. This also means we no longer leave curl-err.XXXX files around on premature shutdown if we're hit by a SIGKILL or similar and can't exit normally. We do need to stop and respawn the Perl process if we hit a curl error, though, since we need to be certain the output is flushed.
2023-10-08process_io: fix binmode and use it in lei_xsearch
The `binmode' perlop can only take two scalars, so passing `@_' blindly won't work since prototypes are checked. This means we can get IO::Uncompress::Gunzip working properly with ProcessIO and use it for curl. We'll also just autodie (instead of warn) on FS errors when dealing with curl stderr; since the process will likely be in bigger trouble soon, anyways.
2023-10-08lei: always use async `done' requests to store
It's safer against deadlocks and we still get proper error reporting by passing stderr across in addition to the lei socket.
2023-10-04lei: get rid of l2m_progress PktOp callback
We already have an ->incr callback we can enhance to support multiple counters with a single request. Furthermore, we can just flatten the object graph by storing counters directly in the $lei object itself to reduce hash lookups.
2023-10-04lei: do_env combines fchdir and local
This will make switching $lei contexts less error-prone and hopefully save us from some suprising bugs in the future. Followup-to: 759885e60e59 (lei: ensure --stdin sets %ENV and $current_lei, 2023-09-14)
2023-10-02lei up: faster non-thread, single-source incremental query
When using isearch (that is v1/v2 inbox relying on extindex for search), there's actually no guarantee that IMAP UIDs are in the correct order with regard to Xapian docids. Thus we must iterate through every UID(num) to see if it's suitable to display in a saved search. The old grep filter (before commit a6fe84489127) was not effective since it didn't account for the mset->items correspondence. Fortunately, this bug merely manifests in reduced performance as of a6fe84489127. Prior to that, it could cause incorrect keywords and labels to be applied. Unfortunately, this behavior is hard-to-test so no test case is included. Followup-to: a6fe84489127 (lei up: fix missing -t/--threads matches w/ saved search)
2023-10-01lei up: fix missing -t/--threads matches w/ saved search
We must not filter out seen docids from the mset; but only with the result of over->expand_thread.
2023-09-16lei q: set exit code for invalid Xapian queries
Xapian can't parse every query, so ensure we set the exit code for the client.
2023-01-30ipc: drop awaitpid_init to avoid circular refs
This brings t/lei-index.t back down from ~8 to ~3s. I didn't notice this before was because the LeiNoteEvent timer was firing every 5s and clearing circular refs and parallel testing meant the delay got hidden. Fixes: 4a2a95bbc78f99c8 (ipc+lei: switch to awaitpid, 2023-01-17)
2023-01-18ipc+lei: switch to awaitpid
This avoids awkwardly stuffing an arrayref into callbacks which expect multiple arguments. IPC->awaitpid_init now allows pre-registering callbacks before spawning workers.
2022-12-02lei_saved_search: expand only/include/exclude to absolute paths
While users may specify relative paths for convenience on the command-line, absolute paths are required for `lei up' since that (especially `lei up --all') could run from anywhere. Note that we need to do this when parsing the command-line options, since shortcuts for URL matching on URL path components are allowed for `lei q', and those same shortcuts may remain in effect across to `lei up' as the underlying external may be moved to a different URI host.
2022-12-02lei: stricter external checks for valid $GIT_DIR/objects
I ended up with my $HOME in ~/.cache/lei/all_locals_ever.git/objects/info/alterntes and am trying to avoid that in the future.
2022-08-29treewide: ditch inbox->recent method
It's a needless wrapper, nowadays. Originally, ->over was added on experimental basis to optimize for /$INBOX/ where Xapian ->search is slower on gigantic (LKML-sized) inboxes. Nowadays with extindex, ->over is here to stay given NNTP and IMAP both benefit from it. So reduce the interpreter stack overhead and just access ->over directly. lxs->recent was never used outside of tests, anyways. And while we're in the area, avoid needlessly bumping the refcount of $ctx->{ibx} in View::paginate_recent.
2022-07-07lei: track seen messages to note duplicates
This may help track down deduplication or other bugs in lei which lead to occasionally missing messages. Link: https://public-inbox.org/meta/CAL_JsqJH8xx_2NyZffNsRXbGXiv3kjmCETvKXt3Yfb0uToLm9Q@mail.gmail.com/
2022-07-07lei_xsearch: simplify lei/store import check
There's no need to check for two fields when one will suffice.
2021-11-10lei q: make HTTP(S) query strings even less ugly
Following commit 57fed2e4b78ed394 (lei: normalize whitespace in remote queries, 2021-09-11), leaving the trailing `\n' from stdin queries to be normalized to ` ' (SP) causes it to appear as `+' in URLs, which Xapian ignores.
2021-10-30lei_xsearch: quiet error message on SIG{PIPE,TERM}
SIGPIPE and SIGTERM are common and user-induced, so they're not worth warning on. Add the value of "$?", though, since it can help users notice other errors (e.g. SIGSEGV).
2021-10-27lei q: fix remote import accounting
We need to update the {-nr_remote_eml} counter regardless of progress display being enabled since it's needed for saved searches. We'll also split out the {-imported} flag separately and only call LeiStore->done if a new message was imported. Note: this change is NOT expected to fix errors reported by Thomas in <ebf92218-1470-4602-b534-6dae59639dc6@t-8ch.de> Cc: Thomas Weißschuh <thomas@t-8ch.de>
2021-10-24lei: always pass $lei to LeiAuth->op_merge
This will make future developments easier.
2021-10-19lei: use die for external and query handling
This allows "lei up" to continue processing unrelated externals if on output fails.
2021-10-16lei: more eval guards for die on failure
Relying on $lei->fail is unsustainable since there'll always be parts of our code and dependencies which can trigger die() and break the event loop.
2021-10-15lei q: guard query_done against die()
v2w->wq_do('done') may die on I/O errors, and likely other places. Just guard the entire block with an eval and ->fail as appropriate.
2021-10-15lei + ipc: simplify process reaping
Simplify our APIs and force dwaitpid() to work in async mode for all lei workers. This avoids having lingering zombies for parallel searches if one worker finishes soon before another. The old distinction between "old" and "new" workers was needlessly complex, error-prone, and embarrasingly bad. We also never handled v2:// writers properly before on Ctrl-C/Ctrl-Z (SIGINT/SIGTSTP), so add them to @WQ_KEYS to ensure they get handled by $lei when appropropriate.
2021-10-15lei q: avoid kw lookup failure on remote mboxrd
When importing several sources in parallel via http(s) mboxrd, we need to be able to get keywords of uncommitted documents directly from shard workers. Otherwise, Xapian DocNotFound errors happen because the read-only LeiSearch won't see documents from uncomitted transactions. Keep in mind that it's possible the keywords can be changed on-the-fly even for uncommitted documents because of inotify watches from LeiNoteEvent.
2021-10-15lei: TSTP affects all curl and related subprocesses
By relying more on pgroups for remaining remaining processes, this lets us pause all curl+tail subprocesses with a single kill(2) to avoid cluttering stderr. We won't bother pausing the pigz/gzip/bzip2/xz compressor process not cat-file processes, though, since those don't write to the terminal and they idle soon after the workers react to SIGSTOP. AutoReap is hoisted out from TestCommon.pm. CLONE_SKIP is gone since we won't be using Perl threads any time soon (they're discouraged by the maintainers of Perl).
2021-10-15lei: give workers their own process group
This lets users Ctrl-Z from their terminal to pause an entire git-clone process hierarchy.
2021-10-13treewide: use warn() or carp() instead of env->{psgi.errors}
Large chunks of our codebase and 3rd-party dependencies do not use ->{psgi.errors}, so trying to standardize on it was a fruitless endeavor. Since warn() and carp() are standard mechanism within Perl, just use that instead and simplify a bunch of existing code.
2021-10-13lei: use standard warn() in more places
warn() is easier to augment with context information, and frankly unavoidable in the presence of 3rd-party libraries we don't control.
2021-09-25lei: make pkt_op easier-to-use and understand
Since switching to SOCK_SEQUENTIAL, we no longer have to use fixed-width records to guarantee atomic reads. Thus we can maintain more human-readable/searchable PktOp opcodes. Furthermore, we can infer the subroutine name in many cases to avoid repeating ourselves by specifying a command-name twice (e.g. $ops->{CMD} => [ \&CMD, $obj ]; can now simply be written as: $ops->{CMD} => [ $obj ] if CMD is a method of $obj.
2021-09-25lei up: show timezone offset with localtime
Sometimes a user (e.g. me) isn't really sure what timezone they're in...
2021-09-23lei_xsearch: use localtime for user message
It's probably least confusing for user-facing messages to display times in the user's configured timezone. I considered appending "UTC" to the message and sticking with gmtime(), too, but this output isn't intended to be web-cache friendly nor expect users from across multiple timezones to view the same output.
2021-09-21lei q: improve --limit behavior and progress
Avoid slurping gigantic (e.g. 100000) result sets into a single response if a giant limit is specified, and instead use 10000 as a window for the mset with a given offset. We'll also warn and hint towards about the --limit= switch when the estimated result set is larger than the default limit.
2021-09-21lei q: show progress on >1s preparation phase
Overwriting existing destinations safe (but slow) by default, so show a progress message noting what we're doing while a user waits.
2021-09-21lei lcat: use single queue for ordering
If lcat-ing multiple argument types (blobs vs folders), maintain the original order of the arguments instead of dumping all blobs before folder contents.
2021-09-19lei_xsearch: drop Data::Dumper use
We're not using Data::Dumper for JSON output.