about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchIdx.pm
DateCommit message (Collapse)
2021-03-11searchidx: remove smsg_from_doc
We no longer read Xapian docdata and favor hitting over.sqlite3, instead, as Xapian is less likely to be available than SQLite.
2021-01-24smsg: make parse_references an object method
Having parse_references in OverIdx was awkward and Smsg is a better place for it.
2021-01-18extindex: fix w/ Xapian 1.2.21..1.2.24
Xapian v1.2.21..v1.2.24 failed to set the close-on-exec flag on the flintlock FD, causing "git cat-file" processes to hold onto the lock and prevent subsequent Xapian::WritableDatabase from locking the DB. So cleanup git processes after committing the miscidx transaction.
2021-01-03searchidxshard: use add_xapian directly for v2
We can more clearly distinguish between v1 and v2-only code paths this way, and may be able to save a few cycles this way.
2021-01-03use Eml (or MIME) objects for all indexing paths
We don't need to be keeping the raw message around after it hits git. Shard work now relies on Storable (or Sereal) and all of the indexing code relies on the Email::MIME-like API of Eml to access interesting parts of the message. Similarly, smsg->{raw_bytes} is no longer carried around and we do the CRLF adjustment when setting smsg->{bytes}. There's also a small simplification to t/import.t while we're in the area to use xqx instead of spawn/popen_rd.
2021-01-03searchidxshard: IPC conversion, part 2
We can remove some now-pointless wrapper functions by using ->ipc_do in even more places.
2021-01-03searchidxshard: use PublicInbox::IPC to kill lots of code
It's nice to prove the new code works by swapping it into the current V2Writable / SearchIdxShard packages. This is only the first step for the core bits, and we'll be able to delete more code in a subsequent patch.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-31Merge remote-tracking branch 'origin/master' into lorelei
* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-26index: do not attach inbox to extindex unless updated
We'll count the number of log changes (regardless of index or unindex) and only attach inboxes to ExtSearchIdx objects when they get new work. We'll also reduce lock bouncing and only update external indices after all per-inbox indexing is done. This also updates existing v2 indexing/unindexing callers to be more consistent and ensures unindex log entries update per-inbox last commit information.
2020-12-25inboxwritable: delay umask_prepare calls
This simplifies all ->with_umask callers and opens the door for further optimizations to delay/elide process spawning.
2020-12-23miscsearch: index UIDVALIDITY, use as startup cache
This brings -nntpd startup time down from ~35s to ~5s with 50K inboxes. Further improvements ought to be possible with deeper changes to MiscIdx, since -mda having to load every inbox seems unreasonable; but this general change is fairly unintrusive.
2020-12-21searchidx: rename get_val to int_val and return IV
Values can be strings in Xapian, although we currently use integer values exclusively. Give the wrapper a more appropriate name in case we start using string columns. For future-proofing, we'll now return `undef' on missing columns and coerce the return value to an IV (integer value) to save memory, as sortable_unserialise returns a PV (pointer value) scalar despite it existing to support numeric values.
2020-12-19search: simplify initialization, add ->xdb_shards_flat
This reduces differences between v1 and v2 code, and introduces ->xdb_shards_flat to provide read-only access to shards without using Xapian::MultiDatabase. This will allow us to combine shards of several inboxes AND extindexes for lei.
2020-12-19lei_store: local storage for Local Email Interface
Still unstable, this builds off the equally unstable extindex :P This will be used for caching/memoization of traditional mail stores (IMAP, Maildir, etc) while providing indexing via Xapian, along with compression, and checksumming from git. Most notably, this adds the ability to add/remove per-message keywords (draft, seen, flagged, answered) as described in the JMAP specification (RFC 8621 section 4.1.1). We'll use `.' (a single period) as an $eidx_key since it's an invalid {inboxdir} or {newsgroup} name.
2020-12-17index: ignore some warnings, set {current_info} for v1
-index runs on data that's already frozen in git, so there's no point in warning users about it. While we're at it, set the {current_info} prefix for v1 as we do in v2 inboxes in case new problems show up.
2020-12-17extsearchidx: simplify reindex code paths
Since we're inside a Xapian transaction, calling ->index_raw followed by ->shard_add_eidx_info calls on the same docid doesn't seem to hurt indexing performance. It definitely reduces FS read traffic and IPC from git at the cost of some more IPC between the parent and workers. Nevertheless, the code and FD reductions seem worth it.
2020-12-17extindex: preliminary --reindex support
--reindex allows us to catch missed and stale messages due to -extindex vs -index races prior to commit 02b2fcc46f364b51 ("extsearchidx: enforce -index before -extindex"). We'll also rely on reindex to internally deal with v1/v2 inbox removals and partial-unindexing of messages which are only removed from one inbox out of many. This reindex design is completely different than how normal v1/v2 inbox reindex operates due to extindex having multiple histories to work with. Instead of scanning git history, this relies exclusively on comparing over.sqlite3 contents between the v1/v2 inboxes and the extindex. Changes to Xapian behavior also get picked up, now. Xapian indexing is handled by workers with minimal IPC to the parent process. This results in more read I/O but fewer writes when dealing with cross-posted messages. Changes to $smsg->populate and --rethread still need further work.
2020-12-10searchidx: all indexers check for bad blobs
This should help us detect bugs in our code or storage synchronization problems more easily. This probably won't detect corrupted git storage, but can detect corrupted SQLite files. "Bad blobs, bad blobs, whatcha gonna do when they come for you?"
2020-12-08searchidx: remove $oid parameter from most calls
Xapian docids have been tied to the over {num} column for nearly 3 years, now; and OIDs are no longer stored in Xapian document data. There's no need to increase code and IPC complexity by passing the OID around.
2020-11-29extindex: support `--gc' to remove dead inboxes
Inboxes may be removed or newsgroups renamed over time. Introduce a switch to do garbage collection and eliminate stale search and xref3 results based on inboxes which remain in the config file. This may also fixup stale results leftover from any bugs which may leave stale data around. This is also useful in case a clumsy BOFH (me :P) is swapping between several PI_CONFIGs and accidentally indexed a bunch of inboxes they didn't intend to.
2020-11-28*index: more consistent graceful shutdown checks
v1 and v2 inbox indexing now supports graceful shutdown checks just like ExtSearchIdx. Additionally, we'll consistently perform quit checks at the top of loops for consistency. Interaction with the --xapian-only and --sequential-shard options are a bit lacking, and will warn the user to use "--reindex --xapian-only" to fix.
2020-11-24miscsearch: a new Xapian sub-DB for extindex
This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-15searchidx: check for graceful shutdown in log2stack
The initial "git log" invocation for a git epoch can be time consuming, so check for graceful shutdown at each line to ensure timely shutdowns and avoid SSD/HDD wear.
2020-11-15*index: checkpoints write last_commit metadata
This will set us up for supporting graceful shutdown on -index without repeating any work.
2020-11-10searchidx: fix fallback on unindex miss
In case of other bugs or intentional corruption of over.sqlite3, we don't want to attempt dereferencing a non-ref scalar when calling ->mid_delete in the fallback code path. Noticed while chasing another bug in extindex development...
2020-11-07searchidx: ignore exceptions from ->remove_term
This seems necessary for some cross-posted messages (and we did it historically before we used over.sqlite3).
2020-11-07searchidx: remove xref3 support for Xapian
It doesn't seem worth storing xref3 data in Xapian now that the same info is in over.sqlite3.
2020-11-07searchidx: favor $sync->{ibx} (over $self->{ibx})
In case we want to reuse code with ExtSearchIdx or V2Writable.
2020-11-07searchidx: reduce inbox-dependency, wrap ->with_umask
This will let us work consistently with both existing inboxes and external indices.
2020-11-07searchidx: export prepare_stack
We'll be needing it in ExtSearchIdx for the next commit.
2020-11-07searchidx: log2stack: simplify callers
Since we store {ibx} in $sync state, we no longer have to pass it as an argument to log2stack.
2020-11-07searchidx: put {ibx} into $sync state
This will allow reusability with ExtSearchIdx
2020-11-07searchidxshard: special init for eidx
Having a special init path for external indices is probably easier than further overloading SearchIdx->new initialization to work without an Inbox object.
2020-11-07searchidx: xref3 delete support
Not yet tested, but Perl compiles it!
2020-11-07searchidx: index eidx_key as a boolean term
Using `O' (owner) here (according Xapian omega's termprefixes.rst) since we could say the newsgroup or inbox is the owner of the given message.
2020-11-07inboxwritable: eidx_key for external index
This is preferable to open-coding "newsgroup // inboxdir" everywhere.
2020-11-07searchidx: introduce "xref3" concept
This will be used to track cross-posted messages in the external/detached index.
2020-11-07searchidx: expose INDEXLEVELS as `our'
This will be used by external/detached indices, too.
2020-10-17git: introduce async_wait_all
->cat_async and ->check_async may trigger each other (in future callers) while waiting, so we need a unified method to ensure both complete. This doesn't affect current code, but allows us to slightly simplify existing callers.
2020-09-29searchidx: index lower-case List-Id value
We don't want a List-Id value being confused with a Xapian term prefix, here. Followup-to: 8b06cda3a3af3f0e ("mda: match List-Id insensitively")
2020-09-24searchidx: fix (undocumented) --skip-docdata handling
This switch is still undocumented, but we can reduce the scope of our Xapian docdata dependency by moving its only caller to SearchIdx. This reduces the amount of code loaded by read-only code paths.
2020-09-03disambiguate OverIdx and Over by field name
We'll use {oidx} as the common field name for the read-write OverIdx, here, to disambiguate it from the read-only {over} field. This hopefully makes it clearer which code paths are read-only and which are read-write.
2020-08-25searchidx: croak for Xapian DB open failure
croak() can give more context on the failure, and setting `PERL5OPT=-MCarp=verbose' can force a stacktrace.
2020-08-23mbox: disable "&t" on existing Xapian until full reindex
Expanding threads via over.sqlite3 for mbox.gz downloads without Xapian effectively collapsing on the THREADID column leads to repeated messages getting downloaded. To avoid that situation, use a "has_threadid" Xapian metadata flag that's only set on --reindex (and brand new Xapian DBs). This allows admins to upgrade WWW or do --reindex in any order; without worrying about users eating up bandwidth and CPU cycles.
2020-08-23searchidx: index THREADID in Xapian
This is the `tid' column from over.sqlite3; and will be used for IMAP and JMAP search (among other things).
2020-08-23searchidx: put all shard-related stuff in SearchIdxShard.pm
We'll also rename the /^remote_/ prefix to "shard_", since remote implies the process is on a different host. These methods only pass messages to a child process on the same host OR perform operations within the same process.
2020-08-20init+index: support --skip-docdata for Xapian
Since we no longer read document data from Xapian, allow users to opt-out of storing it. This breaks compatibility with previous releases of public-inbox, but gives us a ~1.5% space savings on Xapian storage (and associated I/O and page cache pressure reduction).
2020-08-10searchidx: use singular `$opt' for consistency with v2
The rest of our indexing code uses `$opt' instead of `$opts'.
2020-08-10index: cleanup internal variables
Move away from hard-to-read alllowercase naming and favor snake_case or separated-by-dashes. We'll keep `--indexlevel' as-is for now, since it's been around for several releases; but we'll support `--index-level' in the CLI and update our documentation in a few months. We'll also clarify that publicInbox.indexMaxSize is only intended for -index, and not -watch or -mda.