about summary refs log tree commit homepage
path: root/lib/PublicInbox/ExtSearchIdx.pm
DateCommit message (Collapse)
2024-04-17lei: use ->barrier to commit to lei/store
barrier (synchronous checkpoint) is better than ->done with parallel lei commands being issued (via '&' or different terminals), since repeatedly stopping and restarting processes doesn't play nicely with expensive tasks like `lei reindex'. This introduces a slight regression in maintaining more processes (and thus resource use) when lei is idle, but that'll be fixed in the next commit.
2024-01-30spawn: support some rlimit uses via Inline::C
BSD::Resource isn't packaged for Alpine (as of 3.19), but we also have optional Inline::C support and already rely on calling setrlimit(2) directly from the Inline::C version of pi_fork_exec.
2023-12-13treewide: avoid strftime %k for portability
The musl strftime(3) implementation on AlpineLinux 3.19.0 doesn't support `%k' and `%k' isn't in POSIX, either. So we fall back to using the `sprintf' perlop in the user-facing UI since leading zeroes require needless overhead for my eyes and brain to parse in the time.
2023-11-16extindex: warn and hint about --gc on bad ibx_id
Stale entries from newsgroup name changes (including adding a `publicinbox.<name>.newsgroup' entry when none existed before) can wreak havoc during a --reindex. So give the hint to users about running -extindex with --gc to clean up stale entries.
2023-04-07umask: rely on the OnDestroy-based call where applicable
This lets us get rid of some awkwardness around the old API and single-use subroutines while saving us some LoC.
2023-04-07umask: hoist out of InboxWritable
Since CodeSearchIdx doesn't deal with inboxes, it makes sense to split it out from inbox-specific code and start moving towards using OnDestroy to restore the umask at the end of scope and reducing extra functions.
2023-03-25ds: @post_loop_do replaces SetPostLoopCallback
This allows us to avoid repeatedly using memory-intensive anonymous subs in CodeSearchIdx where the callback is assigned frequently. Anonymous subs are known to leak memory in old Perls (e.g. 5.16.3 in enterprise distros) and still expensive in newer Perls. So favor the (\&subroutine, @args) form which allows us to eliminate anonymous subs going forward. Only CodeSearchIdx takes advantage of the new API at the moment, since it's the biggest repeat user of post-loop callback changes. Getting rid of the subroutine and relying on a global `our' variable also has two advantages: 1) Perl warnings can detect typos at compile-time, whereas the (now gone) method could only detect errors at run-time. 2) `our' variable assignment can be `local'-ized to a scope
2022-10-24treewide: replace /^I: / prefix with /^# /
This is like more familiar to readers of TAP (Test Anywhere Protocol) output, as well as shell and Perl scripters which also use `#' for comments. AFAIK, nobody is parsing our stderr, and I'm not sure how standardized the `I:' prefix is (nor `W:' and `E:' are). It's already the prevailing style in Lei* code, too, so things have been moving in that direction for a bit.
2022-03-08index|extindex: support --dangerous flag
This enables Xapian::DB_DANGEROUS to support in-place updates. This can speed up the initial index and reduce I/O at the cost of preventing concurrent readers and being unsafe in the face of any abnormal terminations. This is more dangerous than --no-fsync. --no-fsync is only unsafe in the event of a power loss or kernel crash; --dangerous is unsafe even on SIGKILL.
2021-10-18extindex: show mismatches for messages deleted from inbox
There seems to be a bug in v2 inbox reindexing somewhere...
2021-10-17extindex: better locations for {quit} checks
Check for graceful termination at every message since it's a fairly inexpensive check.
2021-10-17extindex: guard against false mismatch unrefs
I'm not sure if this is a bug or not (or it could be an old bug in the v2 indexing code).
2021-10-17extindex: retry sync_inbox before reindex
Ensure the num highwater mark of the target inbox is stable before using it. Otherwise we may end up repeating work done to index a message.
2021-10-17extindex: use localtime to display lock time
Since this is intended for use on the command-line, include TZ offset in time and try to shorten the message a bit so it wraps less on a terminal.
2021-10-16extindex: avoid triggering a buggy unref
We can't attempt to unref messages beyond the highwater mark of an inbox. This bugfix was found by commit c485036d0b1ce7ed (extindex: guard against buggy unrefs, 2021-10-14), which actually did its intended job and guarded against a buggy unref.
2021-10-16extindex: prune invalid alternate entries on --gc
Seeing the same warning over and over again gets annoying.
2021-10-16smsg: add ->oidbin method
This makes some of our code less noisy by reducing the amount of pack('H*', ...) use.
2021-10-14extindex: guard against buggy unrefs
I noticed some unref messages which shouldn't have been happening, but they were. Which is troubling. So add a guard around an unref path until we can get to the bottom of this.
2021-10-13extindex: set {current_info} in eidxq processing
This gives context as to where warnings are coming from.
2021-10-13extindex: show OID on bad blob failure
AFAIK I've never hit these messages, but I might be glad if I ever do.
2021-10-13extindex: flush pending reindex before unref
This prevents unnecessary message renumbering and I/O. Without this change, there is a small window for long-running WWW streaming requests to miss a message that was unref-ed before reindexing. If we expose an "All Mail" mailbox via IMAP/JMAP, this will save client traffic.
2021-10-12extindex: avoid invalid blobs after unref
When unref-ing a blob from xref3, make sure the "preferred" smsg->{blob} doesn't point to the blob we just unrefed. This is necessary because we periodically checkpoint our extindex process to allow -watch and -mda processes to run. This also gets rid of a lot of redundant code for ->remove_xref3, since it's all handled in ExtSearchIdx, now.
2021-10-12extindex: more consistent doc removal
We need to ensure a message is consistently removed from eidxq, over and Xapian in all cases. Removing from eidxq saves users from some noisy error messages.
2021-10-12extindex: share unref logic in more places
We can use the same logic for --gc and --reindex and 'd' log entries They're similar enough and the actual need to unref should be fairly rare. We could go a lot faster if we didn't show progress for --gc and --reindex, actually.
2021-10-12extindex: rename var: active => active_shards
We also have the idea of active inboxes, too, so "active shards" ought to make the purpose of the data structure more obvious.
2021-10-12extindex: speed up --reindex --fast
This required some tweaking of xref3 indices in over.sqlite3, but the end result is it brings no-op "--reindex --fast --all" checks down to roughly 20 minutes (from 30-40 minutes) on lore/all. This is faster because a bunch of small SQLite queries are still slower en-mass than a bunch of perlops. Despite the lack of IPC overhead, crossing .so boundaries and repeating lookups over btrees is still slower than doing the same with Perl hash tables.
2021-10-10extindex: sync each inbox before checking for missed messages
Otherwise, it gets too noisy and we repeat some work when we do an actual sync, since the last_commit info will be out-of-date.
2021-10-10extindex: --gc doesn't touch ghost entries
We were deleting ghost entries, this was usually harmless since other messages could fill-in-the-blanks, but could cause misthreading in odd cases where a big chunk of a thread is missing and the latest messages only referenced ghosts. We'll also save some cycles when scanning Xapian shards since docids won't be <= 0.
2021-10-10extindex: minor cost reductions
Don't bother decoding the 20-byte SHA-1 to a 40-byte hex value since we don't read it, anyways. We can also use the on-stack ibx->eidx_key value instead of dispatching the method again.
2021-10-10extindex: speed up Xapian cleanup in --gc
Avoiding repeated SQL statements brings --gc down to 2-3 minutes from around 10. We'll also add some checkpoints around over and xref3 cleanups.
2021-10-09extindex: support --reindex --fast
This mode only checks history for missed/stale messages and doesn't attempt to reindex messages which are already indexed.
2021-10-06extindex: --gc checkpoints
We need to ensure -extindex --gc runs don't prevent other work from happening in the meantime. I actually caused my -extindex to OOM due to the lack of checkpoints :x We'll also hoist out the shard scanning into its own sub in preparation for lei/store usage.
2021-10-05extsearchidx: favor 20-byte OID comparison
As with most of our internal-only code, favor smaller comparisons to reduce memory traffic.
2021-10-02extsearchidx: emit diagnostics for missing blobs
I'm not sure why they weren't emitted, earlier.
2021-10-02extsearchidx: attach_config: set {ibx_map} value to $ibx
It doesn't seem to matter, actually, but this matches the behavior of attach_inbox and the comment in ->new.
2021-10-02extsearchidx: do not process eidxq w/o config
When indexing a single inbox, do not attempt reindexing code paths without a full config, otherwise ordering comparisons won't work.
2021-10-01ds: simplify signalfd use
Since signalfd is often combined with our event loop, give it a convenient API and reduce the code duplication required to use it. EventLoop is replaced with ::event_loop to allow consistent parameter passing and avoid needlessly passing the package name on stack. We also avoid exporting SFD_NONBLOCK since it's the only flag we support. There's no sense in having the memory overhead of a constant function when it's in cold code.
2021-09-22treewide: fix %SIG localization, harder
This fixes the occasional t/lei-sigpipe.t infinite loop under "make check-run". Link: http://nntp.perl.org/group/perl.perl5.porters/258784 <CAHhgV8hPbcmkzWizp6Vijw921M5BOXixj4+zTh3nRS9vRBYk8w@mail.gmail.com> Followup-to: b552bb9150775fe4 ("daemon+watch: fix localization of %SIG for non-signalfd users")
2021-09-15multi_git: hoist out common epoch/alternates handling
IMHO, this greatly improves code sharing and organization between v2, extindex, and lei/store. Common git-related logic for these is lightly-refactored and easier to reason about. The impetus for this big change was to ensure inboxes created+managed by public-inbox-{clone,fetch} could have alternates and configs setup properly without depending on SQLite (via V2Writable). This change does that while making old code shorter and better factored.
2021-09-01extindex: --gc removes messages from over, too
While messages from removed inboxes were removed from Xapian search, --gc failed to remove messages from over.sqlite3 entirely. They no longer show up in the topic summary view. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/20210830201723.dehoul4y6gpqf2cp@nitro.local/
2021-08-04extindex: fix boost with partial runs
Boost relies on knowledge of all inboxes in a given config file to work properly. So while we support indexing a subset of inboxes, we must still account for boost in inboxes we're not indexing. So split internal inbox groups into "known" and "active", where previously we only cared for inboxes which were being actively indexed. Furthermore, boost checks need to be applied when a message arrives in different inboxes across multiple invocations. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/20210802204058.vscbxs5q7xyolyu2@nitro.local/
2021-08-04extindex: do not over-account for cross-posted messages
Cross-posted messages don't result in massive writes to the Xapian DBs like a completely unseen message would, so stop accounting for their size. This ought to improve performance for heavily cross-posted setups, but --commit-interval still has effect.
2021-07-25extindex: support --jobs/-j properly on creation for shard count
This wasn't wired up properly, but Xapian appears to suffer from I/O amplification problems as DB shards get larger: https://lists.xapian.org/pipermail/xapian-discuss/2019-February/009727.html <23640.32170.703368.841021@y.dockes.com> Of course, we shouldn't have too many shards, either; because performance problems with too many shards was the entire reason extindex was created: https://lists.xapian.org/pipermail/xapian-discuss/2020-August/009823.html <20200826064728.GA32239@dcvr>
2021-07-25extsearchidx: favor binary comparison in common case
We'll use 20-byte SHA-1 comparisons instead of 40-byte hex representations for a minor reduction in memory traffic.
2021-07-25extsearchidx: use more appropriate max for dedupe
The over.msgid table may contain ghost Message-IDs and also Message-IDs of deleted spam messages, so over->max isn't a good aproproximation of dedupe progress.
2021-07-25extindex: improve comment around git->async_wait_all
I found myself tempted to remove this, but it appears impossible due to odd messages which have multiple Message-IDs.
2021-07-25extindex: support --dedupe[=MSGID]
Sometimes I just want to dedupe a single Message-ID to test something, and this lets me do it. This patch appears to do what its supposed to. But it also appears to be finding duplicates that were previously missed. That's a good thing, but I wish I understood what seems to be fixed :x I'm not sure why the previous ExtSearchIdx.pm (blob 357312b8) was causing messages to be missed, even, and why this patch seems to fix it... And it's not infinite looping, either. Anyways, before this patch, "-extindex --dedupe" was taking ~5 min to no-op every message (after the initial full --dedupe run which took over a day to run). No-op --dedupes now take just under 2 hours to scan every single cross-posted message for a no-op dedupe. The initial dedupe took nearly 44 hours on my system for <https://yhbt.net/lore/all/> due to SATA-2 TLC SSD latency on 3 gigantic Xapian shards. Running --dedupe with this change seems to prevent /BUG\?.*?not deduplicated properly/ stderr messages from being triggered by View.pm. Current versions of -extindex do not seem susceptible to introducing duplicates.
2021-07-22extsearch: support publicinbox.*.boost parameter
This behaves identically the lei external "boost" parameter in prioritizing raw messages for extindex. Relying exclusively on the config file order doesn't work well for mirrors since it's impossible to guarantee config file ordering via grokmirror hooks. Config file ordering remains the default if boost is unconfigured, or in case of ties. Note: I chose the name "boost" rather than "priority" or "rank" since I always get confused by whether higher or lower numbers take precedence when it comes to kernel scheduling. "weight" is also a part of Xapian API terminology, which we currently do not expose to configuration (but may in the future).
2021-07-08extindex: dedupe: reduce SQLite contention and dirty data
Complex queries causes SQLite to block readers for longer than their retry period. For dedupe, it was also preventing us from making good use of checkpoints due to the query time. With many deduplications, checkpoints are necessary to maintain system health due to having too much data piled up.
2021-07-08extsearchidx: ignore Eml warnings across the board
There's nothing we can do about misformatted emails and headers we get from untrusted sources. They're too noisy and those messages already exist in public-inboxes, anyways, so just keep things quiet so we can spot real problems more easily.