about summary refs log tree commit homepage
DateCommit message (Collapse)
2020-11-10searchidx: fix fallback on unindex miss
In case of other bugs or intentional corruption of over.sqlite3, we don't want to attempt dereferencing a non-ref scalar when calling ->mid_delete in the fallback code path. Noticed while chasing another bug in extindex development...
2020-11-08extindex: fix --batch-size support
Calling PublicInbox::Admin::index_prepare is required for --batch-size (k|m|g) modifiiers and indexBatchSize in the config file. Otherwise, the default 1m batch size stuck and led to unexpectedly bad performance on a machine which could index v2 inboxes faster with larger batch sizes.
2020-11-08extindex: SIGUSR1 supports checkpoint
Matching the behavior of git-fast-import(1), we'll allow a user to send SIGUSR1 to checkpoint over.sqlite3 and Xapian.
2020-11-08v2writable: more accurate {current_info} warnings/progress
With async git blob retrievals, the OID being enqueued and the OID being processed can be totally unrelated and misleading. We'll also prefix $INBOX_DIR for v2, and not just the epoch since we could be indexing multiple inboxes via both -index and -extindex.
2020-11-08extsearch: canonicalize topdir
This makes `ps' output look a bit nicer if there's trailing slashes involved from the command-line.
2020-11-08extsearchidx: quiet warning for unindexed `d' messages
"deleted" messages (via -learn <spam|rm>) in the source inboxes are likely to already be unindexed, so avoid triggering needless warnings about the spam message being missing.
2020-11-08v2writable: less expensive checkpoint for extindex
Since extindex holds no locks on parallel inbox writers, we can simply use "barrier" IPC shard commands to checkpoint and avoid respawning shard or git processes.
2020-11-08searchidxshard: further improve {current_info} readability
Add a space after \0 to visually disambiguate it from the {bytes} field.
2020-11-08searchidxshard: reduce syscalls when writing ->eidx_key
We use ->autoflush(1) on this pipe to ensure the shard workers see data immediately on print; so this means we have to do our own buffering for optional data.
2020-11-08extsearchidx: avoid needless alternates rewrite in ALL.git
As with fill_alternates in V2Writable, we do not need to update $GIT_DIR/objects/info/alternates if nothing is changed.
2020-11-08extsearch: rename -eindex to -extindex
Upon "eindex" rhymes with "reindex", which could be confusing; so name the command and config prefix to use "extindex" which is hopefully less confusing.
2020-11-07searchidxshard: make warnings with eidx_key less confusing
Seeing "Xorg.foo.bar" can be confusing in warnings if the eidx_key is only "org.foo.bar" with no relation to "Xorg" at all. Furthermore, printing "\0" to log or terminal output isn't very nice and could throw off some users/tools.
2020-11-07extsearchidx: support --batch-size checkpoints
This is needed to limit the RSS of processes and ensure the stored data in over.sqlite3 and Xapian DBs are consistent if interrupted. Without checkpoints, indexing lore causes shard workers to take several GB of memory and thrash/OOM smaller systems.
2020-11-07extsearchidx: set current_info in warning callbacks
This bit is duplicated with per-Inbox indexing in Admin, undecided if it's the right place for it.
2020-11-07searchidx: ignore exceptions from ->remove_term
This seems necessary for some cross-posted messages (and we did it historically before we used over.sqlite3).
2020-11-07extsearch: wire up remaining Inbox-like methods for WWW
This lets us pretend an ExtSearch object is an Inbox object in most of the existing WWW code.
2020-11-07extsearchidx: handle edits
We can now handle cases where messages are edited in one inbox but not another, bifurcating the message. V2Writable::log_range handles some edge-cases which could happen in v2-only code paths, as well, but weren't usually triggered due to default git-gc knobs not pruning immediately
2020-11-07extsearch: wire up smsg_eml
We'll probably still need synchronous message retrieval in a few places (tests, at least).
2020-11-07t/v2writable: remove pointless ->barrier call
We don't actually use it anywhere, and may not need it in the future.
2020-11-07t/extsearch.t: verify results and xref3 ordering
We want NNTP clients to see consistent Xref: headers to ensure client-side caches don't get confused.
2020-11-07searchidx: remove xref3 support for Xapian
It doesn't seem worth storing xref3 data in Xapian now that the same info is in over.sqlite3.
2020-11-07over: store xref3 data in over.sqlite3
We may not end up storing xref3 data in Xapian, actually. This will make indexlevel=basic possible, and along with --sequential-shard indexing support for slow storage. Making oidmap a separate table seems unnecessary, too, so fold it into the xref3 table since it's unlikely a git blob will be responsible for multiple xref3 rows.
2020-11-07index: eindex wiring
This doesn't do anything, yet, but it will once the rest of the eindex stuff works.
2020-11-07script: add preliminary eindex implementation
Not documented, yet, but it runs...
2020-11-07Makefile.PL: do not build manpage if POD is missing
But warn on it, this lets us test new or throwaway commands more easily if we don't have to start a new POD for everything we want to dump in script/.
2020-11-07searchidx: favor $sync->{ibx} (over $self->{ibx})
In case we want to reuse code with ExtSearchIdx or V2Writable.
2020-11-07searchidx: reduce inbox-dependency, wrap ->with_umask
This will let us work consistently with both existing inboxes and external indices.
2020-11-07extsearchidx: sync updates
A couple of more things to prepare us to run syncs on both v1 and v2 inboxes.
2020-11-07searchidx: export prepare_stack
We'll be needing it in ExtSearchIdx for the next commit.
2020-11-07extsearchidx: sync unit updates
Now that the V2Writable code is more generic, we can sync with it to use `units' which represent either a v2 epoch or an entire v1 inbox.
2020-11-07v2writable: pass oid to uindex_oid
We'll be validating against this in the future to stop bugs from creeping in.
2020-11-07extsearchidx: remove {unindex_range} field
Moved to per-epoch "units".
2020-11-07v2writable: reduce scope of epoch-aware code
And clearly label it. We may try to reuse some of this for v1 indexing code paths.
2020-11-07extsearchidx: more compatibility with V2Writable callers
We'll use `index_oid' and `unindex_oid' as our method names so V2Writable methods may use `$self->can' to access them.
2020-11-07v2writable: move size check init to sync_prepare
This will let us use it from ExtSearchIdx.
2020-11-07v2writable: make *last_commits and sync_prepare OO methods
This will allow ExtSearchIdx to override or reuse them more easily. Unfortunately we lose prototype validation, but that seems to be discouraged anyways given the 'signatures' feature in Perl 5.20+.
2020-11-07v2writable: rename {v2w} field to {self}
This will make it easier to reuse some indexing code for ExtSearchIdx.
2020-11-07v2writable: allow OO method references
Using `->can(method)' allows subclasses to override `index_oid' and `unindex_oid' methods.
2020-11-07v2writable: more generic sync setup code
We want to reuse this code for ExtSearchIdx, eventually.
2020-11-07searchidx: log2stack: simplify callers
Since we store {ibx} in $sync state, we no longer have to pass it as an argument to log2stack.
2020-11-07searchidx: put {ibx} into $sync state
This will allow reusability with ExtSearchIdx
2020-11-07searchidxshard: special init for eidx
Having a special init path for external indices is probably easier than further overloading SearchIdx->new initialization to work without an Inbox object.
2020-11-07searchidx: xref3 delete support
Not yet tested, but Perl compiles it!
2020-11-07searchidx: index eidx_key as a boolean term
Using `O' (owner) here (according Xapian omega's termprefixes.rst) since we could say the newsgroup or inbox is the owner of the given message.
2020-11-07extsearchidx: initial implementation
It compiles...
2020-11-07v2writable: checkpoint: account for lack of {mm}
ExtSearchIdx will not have Msgmap, since it may index non email blobs in the future (it'll still be usable with IMAP, but not NNTP).
2020-11-07v2writable: rename remaining "remote" terminology
"remote" used to imply "child process on the same machine" which was somewhat non-sensical, anyways. And OverIdx has been in the same process since v2 was finalized. So use the suffix "aux" for "auxiliary" since it can be safely jettisoned without breaking URLs.
2020-11-07inboxwritable: eidx_key for external index
This is preferable to open-coding "newsgroup // inboxdir" everywhere.
2020-11-07v2: some changes for ExtSearchIdx compatibility
We'll be using per-sync-state {ibx} refs instead, so make parts of the v2 indexing code less-dependent on $self->{ibx} where $self is a V2Writable object.
2020-11-07overidx: introduce changes for external index
Since external indices won't have msgmap.sqlite3, we'll need to store last_commit-* metadata in over.sqlite3 instead. This has a longer limits to account for path names or newsgroup names stored in keys. We'll also rely on built-in counters for Xapian document IDs, since msgmap.sqlite3 no longer provides an AUTOINCREMENT column.