about summary refs log tree commit homepage
path: root/lib/PublicInbox/ExtSearchIdx.pm
DateCommit message (Collapse)
2020-12-08extsearchidx: remove needless SHA-1 check
There is no need to verify checksums of data already stored in git. Doing this ourselves also limits flexibility in moving to other hashes.
2020-11-29extindex: support `--gc' to remove dead inboxes
Inboxes may be removed or newsgroups renamed over time. Introduce a switch to do garbage collection and eliminate stale search and xref3 results based on inboxes which remain in the config file. This may also fixup stale results leftover from any bugs which may leave stale data around. This is also useful in case a clumsy BOFH (me :P) is swapping between several PI_CONFIGs and accidentally indexed a bunch of inboxes they didn't intend to.
2020-11-29extindex: fix delete (`d') handling
We need to completely remove a message from over.sqlite3 and Xapian when no references remain, otherwise users will still see the removed messages in NNTP overviews and WWW search results/summaries. References to messages are now solely handled by the `xref3' table of over.sqlite3. We can also trust `xref3' when deciding whether to remove only the "O$eidx_key" and "G$lid" terms from a document in Xapian or to remove the entire Xapian document.
2020-11-28*index: more consistent graceful shutdown checks
v1 and v2 inbox indexing now supports graceful shutdown checks just like ExtSearchIdx. Additionally, we'll consistently perform quit checks at the top of loops for consistency. Interaction with the --xapian-only and --sequential-shard options are a bit lacking, and will warn the user to use "--reindex --xapian-only" to fix.
2020-11-24extsearchidx: deduplicate alternates based on st_dev + st_ino
This allows us to filter out duplicate alternates entries in case there's symlinks or bind mounts in play, as I (and perhaps some other users) tend to use symlinks and/or bind mounts heavily.
2020-11-24extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare
This was intended to make development easier; but also allows us description, URL, and address changes to be picked up independently of message history.
2020-11-24miscidx: cleanup git processes after manifest indexing
We shouldn't leave "cat-file --batch" processes around when we're done with an epoch or inbox, since there could be many thousands.
2020-11-24miscsearch: a new Xapian sub-DB for extindex
This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-15extindex: support graceful shutdown via QUIT/INT/TERM
Just like the daemon processes, -extindex now supports graceful shutdown via the same signals. This lets users avoid having to repeat indexing messages when a power outage strikes during a long (multi-hour/day) indexing run. Per-inbox (v1/v2) -index graceful shutdowns are not supported, yet, but is planned for later.
2020-11-15*index: discard sync->{todo} on iteration
There's no need to continuously append to {todo} when indexing multiple inboxes. They're not redundantly indexed (because the IdxStack is discarded, making it a noop), but it's still a waste of memory keeping the $unit hashrefs around.
2020-11-15*index: avoid per-epoch --batch-check processes
Since all.git (v2) and ALL.git (extindex) encompass every single epoch or indexed inbox; and is_ancestor() only uses hexadecimal OIDs; there is no good reason to use $unit->{git} for an epoch-local $git->check. This prevents dozens/hundreds of --batch-check processes from being left running after indexing and can improve locality if size checks are being done (since that uses --batch-check, too). Theoretically several epochs may have conflicting OIDs, but we're screwed in those cases, anyways, so we might as well detect it earlier (though I'm not sure what the behavior would be :x).
2020-11-15*index: checkpoints write last_commit metadata
This will set us up for supporting graceful shutdown on -index without repeating any work.
2020-11-08extindex: SIGUSR1 supports checkpoint
Matching the behavior of git-fast-import(1), we'll allow a user to send SIGUSR1 to checkpoint over.sqlite3 and Xapian.
2020-11-08v2writable: more accurate {current_info} warnings/progress
With async git blob retrievals, the OID being enqueued and the OID being processed can be totally unrelated and misleading. We'll also prefix $INBOX_DIR for v2, and not just the epoch since we could be indexing multiple inboxes via both -index and -extindex.
2020-11-08extsearch: canonicalize topdir
This makes `ps' output look a bit nicer if there's trailing slashes involved from the command-line.
2020-11-08extsearchidx: quiet warning for unindexed `d' messages
"deleted" messages (via -learn <spam|rm>) in the source inboxes are likely to already be unindexed, so avoid triggering needless warnings about the spam message being missing.
2020-11-08extsearchidx: avoid needless alternates rewrite in ALL.git
As with fill_alternates in V2Writable, we do not need to update $GIT_DIR/objects/info/alternates if nothing is changed.
2020-11-07extsearchidx: support --batch-size checkpoints
This is needed to limit the RSS of processes and ensure the stored data in over.sqlite3 and Xapian DBs are consistent if interrupted. Without checkpoints, indexing lore causes shard workers to take several GB of memory and thrash/OOM smaller systems.
2020-11-07extsearchidx: set current_info in warning callbacks
This bit is duplicated with per-Inbox indexing in Admin, undecided if it's the right place for it.
2020-11-07extsearchidx: handle edits
We can now handle cases where messages are edited in one inbox but not another, bifurcating the message. V2Writable::log_range handles some edge-cases which could happen in v2-only code paths, as well, but weren't usually triggered due to default git-gc knobs not pruning immediately
2020-11-07searchidx: remove xref3 support for Xapian
It doesn't seem worth storing xref3 data in Xapian now that the same info is in over.sqlite3.
2020-11-07extsearchidx: sync updates
A couple of more things to prepare us to run syncs on both v1 and v2 inboxes.
2020-11-07extsearchidx: sync unit updates
Now that the V2Writable code is more generic, we can sync with it to use `units' which represent either a v2 epoch or an entire v1 inbox.
2020-11-07extsearchidx: remove {unindex_range} field
Moved to per-epoch "units".
2020-11-07extsearchidx: more compatibility with V2Writable callers
We'll use `index_oid' and `unindex_oid' as our method names so V2Writable methods may use `$self->can' to access them.
2020-11-07extsearchidx: initial implementation
It compiles...