about summary refs log tree commit homepage
path: root/lib/PublicInbox/ExtSearchIdx.pm
DateCommit message (Collapse)
2021-07-08extindex: dedupe: reduce SQLite contention and dirty data
Complex queries causes SQLite to block readers for longer than their retry period. For dedupe, it was also preventing us from making good use of checkpoints due to the query time. With many deduplications, checkpoints are necessary to maintain system health due to having too much data piled up.
2021-07-08extsearchidx: ignore Eml warnings across the board
There's nothing we can do about misformatted emails and headers we get from untrusted sources. They're too noisy and those messages already exist in public-inboxes, anyways, so just keep things quiet so we can spot real problems more easily.
2021-07-06extindex: implement --dedupe to fix old extindices
This is intended to fix older indices that had deduplication bugs for matching content. It'll also make dealing with future changes to ContentHash easier since that's never guaranteed stable. It also supports --dry-run to print changes only without making them.
2021-07-03extsearchidx: extra assertions for deduplication flow
I haven't found any bugs from this (still looking for missed deduplication bugs), and it's a bit shorter and more likely to catch future bugs. Clean up an unnecessary ->{mid} array copy while we're at it, too.
2021-07-01extsearchidx: lock before writing multi-pack-index
This avoids errors from git in case -extindex gets invoked in parallel.
2021-06-30extsearchidx: symlink .rev and .bitmap files into ALL.git
It's possible for these to exist and git can (or may eventually) take advantage of them to speed up functionality which affects us.
2021-06-27extindex: maintain pack symlinks and use "git multi-pack-index"
This is a fair amount of complexity, but it speeds up "git cat-file --batch" startup by 3-4% with 50K packfiles with a hot kernel cache. This appears extremely sensitive to RAM available to the kernel page cache with my SATA 2 SSD. Faster storage and more RAM can bring loading pack. 2.60s vs 2.69s were the best cases on my workstation with and without the multi-pack-index, however times could be all over the place (even in the minutes) with more activity on my workstation. Getting sub-minute times requires a git patch to speed up alt_odb_usable(): <https://lore.kernel.org/20210624005806.12079-1-e@80x24.org/> Otherwise, prepare to wait several minutes.
2021-04-30lei_store: fix locking w.r.t epoch creation
Prior to this change, it was possible for oneshot lei processes to race on epoch creation/rollover. lei-daemon normally prevents the problem by funnelling all writes to a single socket, but oneshot lei has no such protection.
2021-04-24extindex: --gc: use escape pathnames for SQL LIKE properly
This allows us to handle odd inboxes w/o a newsgroup configured if they also make the strange choice of having backslashes in their path name. Also, ensure we use case-sensitive LIKE, since case-insensitive FSes are not worth supporting.
2021-03-04lei q: import flags when clobbering/augmenting Maildirs
This will eventually be supported for other mail stores, but Maildir is the easiest to test and support, here. This lets us avoid a situation where flag changes get lost between search results.
2021-02-24treewide: avoid "delete local" construct on hashes
Apparently this feature is only in Perl 5.12+, and we're still on Perl 5.10.
2021-02-08ds: improve add_timer usability
Packing args into an arrayref is awkward and we may be using this API more in lei.
2021-01-26miscidx: switch to lazy transactions
This fixes a sporadic failure on a 1/2 core VM where "git cat-file --batch" hasn't started up by the time $cleanup->() destroys the ALL.git directory in t/lei.t (but not t/lei-oneshot.t). This happens because dwaitpid() runs inside the event loop asynchronously and we were able to return to the client before the cat-file process could even start. I could not reproduce this failure on my usual 4-core workstation via "schedtool -a 0x1" to force the entire test to use a single core. Lazy transactions matches OverIdx and SearchIdx behavior, and I've verified this lets us avoid problems with old Xapian versions (on CentOS 7.x) which failed to set FD_CLOEXEC.
2021-01-18extindex: fix w/ Xapian 1.2.21..1.2.24
Xapian v1.2.21..v1.2.24 failed to set the close-on-exec flag on the flintlock FD, causing "git cat-file" processes to hold onto the lock and prevent subsequent Xapian::WritableDatabase from locking the DB. So cleanup git processes after committing the miscidx transaction.
2021-01-12ds: block signals when reaping
This lets us call dwaitpid long before a process exits and not have to wait around for it. This is advantageous for lei where we can run dwaitpid on the pager as soon as we spawn it, instead of waiting for a client socket to go away on DESTROY.
2021-01-03use Eml (or MIME) objects for all indexing paths
We don't need to be keeping the raw message around after it hits git. Shard work now relies on Storable (or Sereal) and all of the indexing code relies on the Email::MIME-like API of Eml to access interesting parts of the message. Similarly, smsg->{raw_bytes} is no longer carried around and we do the CRLF adjustment when setting smsg->{bytes}. There's also a small simplification to t/import.t while we're in the area to use xqx instead of spawn/popen_rd.
2021-01-03searchidxshard: replace index_raw with index_eml
Since Storable and Sereal are designed for lossless serialization, we'll just pass $eml objects to whatever process is running SearchIdx.
2021-01-03searchidxshard: IPC conversion, part 2
We can remove some now-pointless wrapper functions by using ->ipc_do in even more places.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-31Merge remote-tracking branch 'origin/master' into lorelei
* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-27extindex: add undocumented --no-scan switch
This makes diagnosing --watch problems easier when there's 50K inboxes by avoiding the lengthy scan (which is the reason --watch exists in the first place).
2020-12-27extindex: various --watch signal handling fixes
We need to clobber the SIGUSR1 resync queue on SIGHUP to invalidate old inbox objects. Furthermore, the lengthy initial scan needs to ignore signals intended for the event loop to avoid unexpected behavior. Finally, add some progress output to inform users on the terminal to inform users' of progress.
2020-12-27extindex: --watch for inotify-based updates
This reuses existing InboxIdle infrastructure to update external indices based on per-inbox updates. This is an alternative to auto-updating external indices via the -index command and also works with existing uses of -mda and public-inbox-watch. Using inotify (or EVFILT_VNODE) allows watching thousands of inboxes without having to scan every single one at every invocation. This is especially beneficial in cases where an external index is not writable to the users writing to per-inbox indices.
2020-12-26default to CORE::warn in $SIG{__WARN__} handlers
As with CORE::die and $SIG{__DIE__}, it turns out CORE::warn is safe to use inside $SIG{__WARN__} handlers without triggering infinite recursion. So fall back to reusing CORE::warn instead of creating a new sub.
2020-12-26index: fix --no-fsync flag propagation to extindex
Negation in flag names are confusing, but trying to deviate from the DB_NO_SYNC name used by Xapian is also confusing.
2020-12-26extsearchidx: close DB handles after use if FD constrained
Most distros ship with low RLIMIT_NOFILE limits and surprises may lurk for admins who configure many inboxes. Keep FD usage under control to avoid EMFILE errors at inopportune times during reindex. From what I can tell, this is the only place where extindex can have unpredictable FD growth when there's thousands of inboxes, and it's in an extremely rare code path.
2020-12-26extsearchidx: delay SQLite availability checks
This will make attach_inbox faster for no-op calls. It also helps us avoid races in case msgmap or over.sqlite3 gets unlinked while -extindex is running.
2020-12-25inboxwritable: delay umask_prepare calls
This simplifies all ->with_umask callers and opens the door for further optimizations to delay/elide process spawning.
2020-12-23extsearchidx: close SQLite handles after attaching
This is needed to prevent us from running out of FDs when indexing many inboxes. Perhaps checking these on attach_inbox is unnecessary and may be removed entirely down the line.
2020-12-23miscsearch: index UIDVALIDITY, use as startup cache
This brings -nntpd startup time down from ~35s to ~5s with 50K inboxes. Further improvements ought to be possible with deeper changes to MiscIdx, since -mda having to load every inbox seems unreasonable; but this general change is fairly unintrusive.
2020-12-21extsearch*: drop unnecessary path canonicalization
Unlike inboxdir, the canonical-ness of -extindex paths is not relevant at the moment, and may never be relevant at all. So don't mislead others into thinking these paths being canonicalized matters.
2020-12-21use rel2abs_collapsed when loading Inbox objects
We need to canonicalize paths for inboxes which do not have a newsgroup defined, otherwise ->eidx_key matches can fail in unexpected ways.
2020-12-19lei_store: local storage for Local Email Interface
Still unstable, this builds off the equally unstable extindex :P This will be used for caching/memoization of traditional mail stores (IMAP, Maildir, etc) while providing indexing via Xapian, along with compression, and checksumming from git. Most notably, this adds the ability to add/remove per-message keywords (draft, seen, flagged, answered) as described in the JMAP specification (RFC 8621 section 4.1.1). We'll use `.' (a single period) as an $eidx_key since it's an invalid {inboxdir} or {newsgroup} name.
2020-12-18extsearchidx: improve missing machine-id fallback
It's likely most GNU/Linux systems have /etc/machine-id these days, so anything missing it is likely a *BSD, most of which support and favor "sysctl -n kern.hostid". We'll also support "ghostid" since GNU utils are commonly prefixed with 'g' on non-GNU platforms. In any case, we'll suppress stderr from missing commands and fall back to hard coding an $OSNAME-based identifier as a last resort and hope the hostname is unique.
2020-12-17extsearchidx: no need to make InboxWritable
extindex treats v1/v2 public inboxes as read-only, so there's no need to scare people by using the InboxWritable package now that ->git_dir_n is gone and we can use ->max_git_epoch instead of ->git_dir_latest.
2020-12-17inbox: simplify v2 epoch counting
Perl readdir detects list context and can return an array suitable for the grep op. From there, we can rely on substr to remove the ".git" suffix and integerize the value to save a few bytes before letting List::Util::max return the value. This is how we detect Xapian shards nowadays, too, and we'll also use defined-or (//) to simplify the return value there. We'll also simplify InboxWritable->git_dir_latest, remove some callers, and consider removing it entirely.
2020-12-17extsearchidx: lock eidxq on full --reindex
Incremental indexing can use the `eidxq' reindexing queue for handling deletes and resuming interrupted indexing. Ensure those incremental -extindex invocations do not steal (and prematurely perform) work that an "-extindex --reindex" invocation is handling.
2020-12-17extsearchidx: reindex releases over.sqlite3 handles properly
When checkpointing and yielding the lock to other processes, we need to ensure any open DB statement handles are closed, since they reference and prevent DB FDs from being closed and unlocked. And clean up some progress reporting while we're at it.
2020-12-17extsearchidx: simplify reindex code paths
Since we're inside a Xapian transaction, calling ->index_raw followed by ->shard_add_eidx_info calls on the same docid doesn't seem to hurt indexing performance. It definitely reduces FS read traffic and IPC from git at the cost of some more IPC between the parent and workers. Nevertheless, the code and FD reductions seem worth it.
2020-12-17extsearchidx: checkpoint releases locks
--reindex can take many hours or days, ensure we release locks according to --batch-size so automated fetch+index jobs can write new data to indices while we update old data.
2020-12-17extsearchidx: reindex works on Xapian, too
Instead of just working on over.sqlite3, we need to work on the Xapian DBs as well. While no changes to our Xapian use have taken place recently, they could in the future and --reindex exists to account for that.
2020-12-17extindex: support --rethread and content bifurcation
--rethread is useful for dealing with bugs and behaves just like it does with current inboxes. This is in case our content deduplication logic changes for whatever reason and causes previously merged messages to be considered "different". As with v2, this won't allow us to merge messages in a way that allows deduplicating messages which were previously considered different, but v2 inboxes do not allow that, either. In other words, this makes the --reindex and --rethread switches of -extindex match the behavior of v2 -index.
2020-12-17extindex: delete stale messages from over.sqlite3
In addition to removing stale messages from Xapian, we must also remove them from over.sqlite3.
2020-12-17extindex: preliminary --reindex support
--reindex allows us to catch missed and stale messages due to -extindex vs -index races prior to commit 02b2fcc46f364b51 ("extsearchidx: enforce -index before -extindex"). We'll also rely on reindex to internally deal with v1/v2 inbox removals and partial-unindexing of messages which are only removed from one inbox out of many. This reindex design is completely different than how normal v1/v2 inbox reindex operates due to extindex having multiple histories to work with. Instead of scanning git history, this relies exclusively on comparing over.sqlite3 contents between the v1/v2 inboxes and the extindex. Changes to Xapian behavior also get picked up, now. Xapian indexing is handled by workers with minimal IPC to the parent process. This results in more read I/O but fewer writes when dealing with cross-posted messages. Changes to $smsg->populate and --rethread still need further work.
2020-12-10extsearchidx: enforce -index before -extindex
We cannot set xref3 data without the `xnum' column to tie it to the per-inbox over.sqlite3 DB. So ensure we don't read brand-new history that only exists in git, but instead rely on last_commit and last_xap15-$EPOCH metadata in msgmap to decide how far we can index. Before this change, it was possible to miss messages in the extindex if -index did not run (which will be fixable by upcoming --reindex support in -extindex).
2020-12-10searchidx: all indexers check for bad blobs
This should help us detect bugs in our code or storage synchronization problems more easily. This probably won't detect corrupted git storage, but can detect corrupted SQLite files. "Bad blobs, bad blobs, whatcha gonna do when they come for you?"
2020-12-09extsearchidx: ck_existing: set $OID for warning context
The content_hash() hash in the same scope may trigger warnings for a given blob, so ensure we correctly report the blob where it happens.
2020-12-08shard_add_eidx_info: pass $eidx_key instead of $ibx object
This improves consistency with sibling methods such as ->shard_remove_eidx_info and ->add_xref3. Passing the $eidx_key scalar is preferable to the entire $ibx object for IPC-friendliness.
2020-12-08searchidx: remove $oid parameter from most calls
Xapian docids have been tied to the over {num} column for nearly 3 years, now; and OIDs are no longer stored in Xapian document data. There's no need to increase code and IPC complexity by passing the OID around.
2020-12-08extsearchidx: remove needless SHA-1 check
There is no need to verify checksums of data already stored in git. Doing this ourselves also limits flexibility in moving to other hashes.