about summary refs log tree commit homepage
path: root/lib/PublicInbox/V2Writable.pm
DateCommit message (Collapse)
2020-12-08searchidx: remove $oid parameter from most calls
Xapian docids have been tied to the over {num} column for nearly 3 years, now; and OIDs are no longer stored in Xapian document data. There's no need to increase code and IPC complexity by passing the OID around.
2020-11-29v2writable: detect shard count for ExtSearchIdx properly
Otherwise, any explicitly set shard counts were ignored and we'd be counting CPUs every single time.
2020-11-28*index: more consistent graceful shutdown checks
v1 and v2 inbox indexing now supports graceful shutdown checks just like ExtSearchIdx. Additionally, we'll consistently perform quit checks at the top of loops for consistency. Interaction with the --xapian-only and --sequential-shard options are a bit lacking, and will warn the user to use "--reindex --xapian-only" to fix.
2020-11-28mm: min/max: return 0 instead of undef
This simplifies callers and allows empty newsgroups to be represented (the WWW UI may be insufficient there, too).
2020-11-24miscsearch: a new Xapian sub-DB for extindex
This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-17v2writable: avoid initiating leftover unindex if interrupted
We can also avoid a needless progress message on log2stack interruptions, too.
2020-11-15extindex: support graceful shutdown via QUIT/INT/TERM
Just like the daemon processes, -extindex now supports graceful shutdown via the same signals. This lets users avoid having to repeat indexing messages when a power outage strikes during a long (multi-hour/day) indexing run. Per-inbox (v1/v2) -index graceful shutdowns are not supported, yet, but is planned for later.
2020-11-15*index: discard sync->{todo} on iteration
There's no need to continuously append to {todo} when indexing multiple inboxes. They're not redundantly indexed (because the IdxStack is discarded, making it a noop), but it's still a waste of memory keeping the $unit hashrefs around.
2020-11-15*index: avoid per-epoch --batch-check processes
Since all.git (v2) and ALL.git (extindex) encompass every single epoch or indexed inbox; and is_ancestor() only uses hexadecimal OIDs; there is no good reason to use $unit->{git} for an epoch-local $git->check. This prevents dozens/hundreds of --batch-check processes from being left running after indexing and can improve locality if size checks are being done (since that uses --batch-check, too). Theoretically several epochs may have conflicting OIDs, but we're screwed in those cases, anyways, so we might as well detect it earlier (though I'm not sure what the behavior would be :x).
2020-11-15*index: checkpoints write last_commit metadata
This will set us up for supporting graceful shutdown on -index without repeating any work.
2020-11-08v2writable: more accurate {current_info} warnings/progress
With async git blob retrievals, the OID being enqueued and the OID being processed can be totally unrelated and misleading. We'll also prefix $INBOX_DIR for v2, and not just the epoch since we could be indexing multiple inboxes via both -index and -extindex.
2020-11-08v2writable: less expensive checkpoint for extindex
Since extindex holds no locks on parallel inbox writers, we can simply use "barrier" IPC shard commands to checkpoint and avoid respawning shard or git processes.
2020-11-07extsearchidx: handle edits
We can now handle cases where messages are edited in one inbox but not another, bifurcating the message. V2Writable::log_range handles some edge-cases which could happen in v2-only code paths, as well, but weren't usually triggered due to default git-gc knobs not pruning immediately
2020-11-07v2writable: pass oid to uindex_oid
We'll be validating against this in the future to stop bugs from creeping in.
2020-11-07v2writable: reduce scope of epoch-aware code
And clearly label it. We may try to reuse some of this for v1 indexing code paths.
2020-11-07v2writable: move size check init to sync_prepare
This will let us use it from ExtSearchIdx.
2020-11-07v2writable: make *last_commits and sync_prepare OO methods
This will allow ExtSearchIdx to override or reuse them more easily. Unfortunately we lose prototype validation, but that seems to be discouraged anyways given the 'signatures' feature in Perl 5.20+.
2020-11-07v2writable: rename {v2w} field to {self}
This will make it easier to reuse some indexing code for ExtSearchIdx.
2020-11-07v2writable: allow OO method references
Using `->can(method)' allows subclasses to override `index_oid' and `unindex_oid' methods.
2020-11-07v2writable: more generic sync setup code
We want to reuse this code for ExtSearchIdx, eventually.
2020-11-07searchidx: log2stack: simplify callers
Since we store {ibx} in $sync state, we no longer have to pass it as an argument to log2stack.
2020-11-07v2writable: checkpoint: account for lack of {mm}
ExtSearchIdx will not have Msgmap, since it may index non email blobs in the future (it'll still be usable with IMAP, but not NNTP).
2020-11-07v2writable: rename remaining "remote" terminology
"remote" used to imply "child process on the same machine" which was somewhat non-sensical, anyways. And OverIdx has been in the same process since v2 was finalized. So use the suffix "aux" for "auxiliary" since it can be safely jettisoned without breaking URLs.
2020-11-07v2: some changes for ExtSearchIdx compatibility
We'll be using per-sync-state {ibx} refs instead, so make parts of the v2 indexing code less-dependent on $self->{ibx} where $self is a V2Writable object.
2020-11-07v2writable: count_shards: allow working without {ibx}
This will be needed for ExtSearchIdx which doesn't have a persistent PublicInbox::Inbox object.
2020-11-07v2writable: idx_shard: simplify callers
This will make it easier-to-use in ExtSearchIdx.
2020-11-07v2writable: hoist out write_alternates
We'll be reusing this for external indices and possibly other places.
2020-11-07v2writable: prepare initialization for external indices
External indices won't have $self->{ibx} since it needs to deal with multiple inboxes. We can also hoist out ->parallel_init to make it easier to distinguish the non-parallel control flow.
2020-11-07v2writable: make OO calls to last_commit-related methods
We'll try to reuse as much V2Writable code as possible for external indices, but the way "last_commit" info is stored must be different as external indices will deal with last_commit info for multiple inboxes.
2020-11-07v2writable: add git method
This will make it easier to share code with ExtSearchIdx.
2020-10-17git: introduce async_wait_all
->cat_async and ->check_async may trigger each other (in future callers) while waiting, so we need a unified method to ensure both complete. This doesn't affect current code, but allows us to slightly simplify existing callers.
2020-09-30v2writable: use "HEAD" to match v1 indexing behavior
Users may want to change the default branch used for git epochs in v2 (v1 SearchIdx always used whatever "HEAD" pointed to).
2020-09-24v2writable: drop outdated {unindex_range} check
{unindex_range} only exists in the $sync state, nowadays, not the V2Writable ($self) object. $sync->{unindex_range} won't be populated if $regen_max is zero, either, unless somebody is injecting importable commits into an epoch history, in which this change will result in no-op indexing doing no work.
2020-09-12treewide: avoid `goto &NAME' for tail recursion
While Perl implements tail recursion via `goto' which allows avoiding warnings on deep recursion. It doesn't (as of 5.28) optimize the speed of such dispatches, though it may reduce ephemeral memory usage. Make the code less alien to hackers coming from other languages by using normal subroutine dispatch. It's actually slightly faster in micro benchmarks due to the complexity of `goto &NAME'.
2020-09-03v2writable: reuse read-only shard counting code
We'll also fix the read-only code to ensure we notice missing Xapian shards, since gaps would throw off our expectation that Xapian document IDs and NNTP article numbers are interchangeable.
2020-09-03disambiguate OverIdx and Over by field name
We'll use {oidx} as the common field name for the read-write OverIdx, here, to disambiguate it from the read-only {over} field. This hopefully makes it clearer which code paths are read-only and which are read-write.
2020-09-01watch: avoid unnecessary spawning on spam removals
This should further mitigate lock contention problems when -watch is configured to watch on a Maildir for spam while performing a large NNTP import. There is now a small risk a message won't get removed because if it's in the current (uncommitted) fast-import batch, but unlikely given the batch size is now only 10 messages. If a that small window is hit, flipping the \Seen flag (e.g. marking it unread, and then read again) will trigger another removal attempt via IMAP or Maildir.
2020-08-27over: rename ->disconnect to ->dbh_close
Since we got rid of over->connect, `disconnect' no longer pairs with it. So name it after the `close(2)' syscall it ultimately issues.
2020-08-26v2writable: compatibility with SWIG Xapian binding
The SWIG binding won't auto-convert IV/UV to PV like the XS Search::Xapian binding would, so workaround that shortcoming for now. Fixes: a367ec1b15a2458 ("mbox: disable "&t" on existing Xapian until full reindex")
2020-08-23index: --sequential-shard checkpoints after each shard
There's no reason we'd want Xapian to defer flushing once we've indexed everything belonging to a particular shard.
2020-08-23mbox: disable "&t" on existing Xapian until full reindex
Expanding threads via over.sqlite3 for mbox.gz downloads without Xapian effectively collapsing on the THREADID column leads to repeated messages getting downloaded. To avoid that situation, use a "has_threadid" Xapian metadata flag that's only set on --reindex (and brand new Xapian DBs). This allows admins to upgrade WWW or do --reindex in any order; without worrying about users eating up bandwidth and CPU cycles.
2020-08-23searchidx: put all shard-related stuff in SearchIdxShard.pm
We'll also rename the /^remote_/ prefix to "shard_", since remote implies the process is on a different host. These methods only pass messages to a child process on the same host OR perform operations within the same process.
2020-08-19v2writable: show newline after "indexing all of .. " message
Otherwise things get very confusing when verbosity is enabled :x
2020-08-13v2writable: remove IdxStack import
We use IdxStack via log2stack() from SearchIdx, now.
2020-08-10index: cleanup internal variables
Move away from hard-to-read alllowercase naming and favor snake_case or separated-by-dashes. We'll keep `--indexlevel' as-is for now, since it's been around for several releases; but we'll support `--index-level' in the CLI and update our documentation in a few months. We'll also clarify that publicInbox.indexMaxSize is only intended for -index, and not -watch or -mda.
2020-08-10avoid File::Temp::tempfile in more places
We can use open(..., undef) natively in Perl in t/import.t In places where we need a pathname, the File::Temp OO API gives us auto-unlinking for free.
2020-08-10index: --sequential-shard works incrementally
We should never reindex all data in Xapian unless --reindex is specified on the command-line. This means users who put publicInbox.indexSequentialShard in their config file won't have to put up with a full reindex at every invocation, only when they specify --reindex. We'll also cleanup the progress output to not emit non-sensical ranges where the starting number is higher than the end.
2020-08-09favor `getconf _NPROCESSORS_ONLN` over GNU nproc
getconf(1) itself is POSIX, while `_NPROCESSORS_ONLN' is not. However, FreeBSD (tested 11.4 and 12.1) and glibc (tested CentOS 7.x and Debian 10.x) both support `getconf _NPROCESSORS_ONLN'. GNU coreutils (and thus `nproc' or `gnproc') are not installed by default on the *BSDs, so we'll try the option most likely to exist on both glibc and *BSDs out-of-the-box.
2020-08-07v2writable: fix batch size accounting
We need to account for whether shard parallelization is enabled or not, since users of parallelization are expected to have more RAM.
2020-08-07index+xcpdb: rename `--no-sync' to `--no-fsync'
We'll continue supporting `--no-sync' even if its yet-to-make it it into a release, but the term `sync' is overloaded in our codebase which may be confusing to new hackers and users. None of our our code nor dependencies issue the sync(2) syscall, either, only fsync(2) and fdatasync(2).