about summary refs log tree commit homepage
DateCommit message (Collapse)
2020-11-07extsearchidx: support --batch-size checkpoints
This is needed to limit the RSS of processes and ensure the stored data in over.sqlite3 and Xapian DBs are consistent if interrupted. Without checkpoints, indexing lore causes shard workers to take several GB of memory and thrash/OOM smaller systems.
2020-11-07extsearchidx: set current_info in warning callbacks
This bit is duplicated with per-Inbox indexing in Admin, undecided if it's the right place for it.
2020-11-07searchidx: ignore exceptions from ->remove_term
This seems necessary for some cross-posted messages (and we did it historically before we used over.sqlite3).
2020-11-07extsearch: wire up remaining Inbox-like methods for WWW
This lets us pretend an ExtSearch object is an Inbox object in most of the existing WWW code.
2020-11-07extsearchidx: handle edits
We can now handle cases where messages are edited in one inbox but not another, bifurcating the message. V2Writable::log_range handles some edge-cases which could happen in v2-only code paths, as well, but weren't usually triggered due to default git-gc knobs not pruning immediately
2020-11-07extsearch: wire up smsg_eml
We'll probably still need synchronous message retrieval in a few places (tests, at least).
2020-11-07t/v2writable: remove pointless ->barrier call
We don't actually use it anywhere, and may not need it in the future.
2020-11-07t/extsearch.t: verify results and xref3 ordering
We want NNTP clients to see consistent Xref: headers to ensure client-side caches don't get confused.
2020-11-07searchidx: remove xref3 support for Xapian
It doesn't seem worth storing xref3 data in Xapian now that the same info is in over.sqlite3.
2020-11-07over: store xref3 data in over.sqlite3
We may not end up storing xref3 data in Xapian, actually. This will make indexlevel=basic possible, and along with --sequential-shard indexing support for slow storage. Making oidmap a separate table seems unnecessary, too, so fold it into the xref3 table since it's unlikely a git blob will be responsible for multiple xref3 rows.
2020-11-07index: eindex wiring
This doesn't do anything, yet, but it will once the rest of the eindex stuff works.
2020-11-07script: add preliminary eindex implementation
Not documented, yet, but it runs...
2020-11-07Makefile.PL: do not build manpage if POD is missing
But warn on it, this lets us test new or throwaway commands more easily if we don't have to start a new POD for everything we want to dump in script/.
2020-11-07searchidx: favor $sync->{ibx} (over $self->{ibx})
In case we want to reuse code with ExtSearchIdx or V2Writable.
2020-11-07searchidx: reduce inbox-dependency, wrap ->with_umask
This will let us work consistently with both existing inboxes and external indices.
2020-11-07extsearchidx: sync updates
A couple of more things to prepare us to run syncs on both v1 and v2 inboxes.
2020-11-07searchidx: export prepare_stack
We'll be needing it in ExtSearchIdx for the next commit.
2020-11-07extsearchidx: sync unit updates
Now that the V2Writable code is more generic, we can sync with it to use `units' which represent either a v2 epoch or an entire v1 inbox.
2020-11-07v2writable: pass oid to uindex_oid
We'll be validating against this in the future to stop bugs from creeping in.
2020-11-07extsearchidx: remove {unindex_range} field
Moved to per-epoch "units".
2020-11-07v2writable: reduce scope of epoch-aware code
And clearly label it. We may try to reuse some of this for v1 indexing code paths.
2020-11-07extsearchidx: more compatibility with V2Writable callers
We'll use `index_oid' and `unindex_oid' as our method names so V2Writable methods may use `$self->can' to access them.
2020-11-07v2writable: move size check init to sync_prepare
This will let us use it from ExtSearchIdx.
2020-11-07v2writable: make *last_commits and sync_prepare OO methods
This will allow ExtSearchIdx to override or reuse them more easily. Unfortunately we lose prototype validation, but that seems to be discouraged anyways given the 'signatures' feature in Perl 5.20+.
2020-11-07v2writable: rename {v2w} field to {self}
This will make it easier to reuse some indexing code for ExtSearchIdx.
2020-11-07v2writable: allow OO method references
Using `->can(method)' allows subclasses to override `index_oid' and `unindex_oid' methods.
2020-11-07v2writable: more generic sync setup code
We want to reuse this code for ExtSearchIdx, eventually.
2020-11-07searchidx: log2stack: simplify callers
Since we store {ibx} in $sync state, we no longer have to pass it as an argument to log2stack.
2020-11-07searchidx: put {ibx} into $sync state
This will allow reusability with ExtSearchIdx
2020-11-07searchidxshard: special init for eidx
Having a special init path for external indices is probably easier than further overloading SearchIdx->new initialization to work without an Inbox object.
2020-11-07searchidx: xref3 delete support
Not yet tested, but Perl compiles it!
2020-11-07searchidx: index eidx_key as a boolean term
Using `O' (owner) here (according Xapian omega's termprefixes.rst) since we could say the newsgroup or inbox is the owner of the given message.
2020-11-07extsearchidx: initial implementation
It compiles...
2020-11-07v2writable: checkpoint: account for lack of {mm}
ExtSearchIdx will not have Msgmap, since it may index non email blobs in the future (it'll still be usable with IMAP, but not NNTP).
2020-11-07v2writable: rename remaining "remote" terminology
"remote" used to imply "child process on the same machine" which was somewhat non-sensical, anyways. And OverIdx has been in the same process since v2 was finalized. So use the suffix "aux" for "auxiliary" since it can be safely jettisoned without breaking URLs.
2020-11-07inboxwritable: eidx_key for external index
This is preferable to open-coding "newsgroup // inboxdir" everywhere.
2020-11-07v2: some changes for ExtSearchIdx compatibility
We'll be using per-sync-state {ibx} refs instead, so make parts of the v2 indexing code less-dependent on $self->{ibx} where $self is a V2Writable object.
2020-11-07overidx: introduce changes for external index
Since external indices won't have msgmap.sqlite3, we'll need to store last_commit-* metadata in over.sqlite3 instead. This has a longer limits to account for path names or newsgroup names stored in keys. We'll also rely on built-in counters for Xapian document IDs, since msgmap.sqlite3 no longer provides an AUTOINCREMENT column.
2020-11-07v2writable: count_shards: allow working without {ibx}
This will be needed for ExtSearchIdx which doesn't have a persistent PublicInbox::Inbox object.
2020-11-07v2writable: idx_shard: simplify callers
This will make it easier-to-use in ExtSearchIdx.
2020-11-07searchidxshard: allow msgref to be undef
We don't need to keep it in code paths which are guaranteed to only see PublicInbox::Eml (and not Email::MIME or PublicInbox::MIME which did not round-trip properly). However, we must set {raw_bytes} since PublicInbox::Eml may add an extra "\n" for rare messages with no bodies.
2020-11-07v2writable: hoist out write_alternates
We'll be reusing this for external indices and possibly other places.
2020-11-07v2writable: prepare initialization for external indices
External indices won't have $self->{ibx} since it needs to deal with multiple inboxes. We can also hoist out ->parallel_init to make it easier to distinguish the non-parallel control flow.
2020-11-07searchidx: introduce "xref3" concept
This will be used to track cross-posted messages in the external/detached index.
2020-11-07search: xdb_sharded: make this a public method for ExtSearch
We can simplify callers by using $self->{xpfx} instead of passing another arg on the stack.
2020-11-07v2writable: make OO calls to last_commit-related methods
We'll try to reuse as much V2Writable code as possible for external indices, but the way "last_commit" info is stored must be different as external indices will deal with last_commit info for multiple inboxes.
2020-11-07v2writable: add git method
This will make it easier to share code with ExtSearchIdx.
2020-11-07searchidx: expose INDEXLEVELS as `our'
This will be used by external/detached indices, too.
2020-11-07extsearch: start mocking out
This will provide a similar API to PublicInbox::Inbox for read-only WWW, -imapd, and -nntpd interfaces.
2020-11-07search: hoist out _xdb_sharded for v2 inboxes
We'll be using this in detached (ext) Xapian indexes in cross inbox search.