about summary refs log tree commit homepage
path: root/lib/PublicInbox/V2Writable.pm
DateCommit message (Collapse)
2020-11-15*index: discard sync->{todo} on iteration
There's no need to continuously append to {todo} when indexing multiple inboxes. They're not redundantly indexed (because the IdxStack is discarded, making it a noop), but it's still a waste of memory keeping the $unit hashrefs around.
2020-11-15*index: avoid per-epoch --batch-check processes
Since all.git (v2) and ALL.git (extindex) encompass every single epoch or indexed inbox; and is_ancestor() only uses hexadecimal OIDs; there is no good reason to use $unit->{git} for an epoch-local $git->check. This prevents dozens/hundreds of --batch-check processes from being left running after indexing and can improve locality if size checks are being done (since that uses --batch-check, too). Theoretically several epochs may have conflicting OIDs, but we're screwed in those cases, anyways, so we might as well detect it earlier (though I'm not sure what the behavior would be :x).
2020-11-15*index: checkpoints write last_commit metadata
This will set us up for supporting graceful shutdown on -index without repeating any work.
2020-11-08v2writable: more accurate {current_info} warnings/progress
With async git blob retrievals, the OID being enqueued and the OID being processed can be totally unrelated and misleading. We'll also prefix $INBOX_DIR for v2, and not just the epoch since we could be indexing multiple inboxes via both -index and -extindex.
2020-11-08v2writable: less expensive checkpoint for extindex
Since extindex holds no locks on parallel inbox writers, we can simply use "barrier" IPC shard commands to checkpoint and avoid respawning shard or git processes.
2020-11-07extsearchidx: handle edits
We can now handle cases where messages are edited in one inbox but not another, bifurcating the message. V2Writable::log_range handles some edge-cases which could happen in v2-only code paths, as well, but weren't usually triggered due to default git-gc knobs not pruning immediately
2020-11-07v2writable: pass oid to uindex_oid
We'll be validating against this in the future to stop bugs from creeping in.
2020-11-07v2writable: reduce scope of epoch-aware code
And clearly label it. We may try to reuse some of this for v1 indexing code paths.
2020-11-07v2writable: move size check init to sync_prepare
This will let us use it from ExtSearchIdx.
2020-11-07v2writable: make *last_commits and sync_prepare OO methods
This will allow ExtSearchIdx to override or reuse them more easily. Unfortunately we lose prototype validation, but that seems to be discouraged anyways given the 'signatures' feature in Perl 5.20+.
2020-11-07v2writable: rename {v2w} field to {self}
This will make it easier to reuse some indexing code for ExtSearchIdx.
2020-11-07v2writable: allow OO method references
Using `->can(method)' allows subclasses to override `index_oid' and `unindex_oid' methods.
2020-11-07v2writable: more generic sync setup code
We want to reuse this code for ExtSearchIdx, eventually.
2020-11-07searchidx: log2stack: simplify callers
Since we store {ibx} in $sync state, we no longer have to pass it as an argument to log2stack.
2020-11-07v2writable: checkpoint: account for lack of {mm}
ExtSearchIdx will not have Msgmap, since it may index non email blobs in the future (it'll still be usable with IMAP, but not NNTP).
2020-11-07v2writable: rename remaining "remote" terminology
"remote" used to imply "child process on the same machine" which was somewhat non-sensical, anyways. And OverIdx has been in the same process since v2 was finalized. So use the suffix "aux" for "auxiliary" since it can be safely jettisoned without breaking URLs.
2020-11-07v2: some changes for ExtSearchIdx compatibility
We'll be using per-sync-state {ibx} refs instead, so make parts of the v2 indexing code less-dependent on $self->{ibx} where $self is a V2Writable object.
2020-11-07v2writable: count_shards: allow working without {ibx}
This will be needed for ExtSearchIdx which doesn't have a persistent PublicInbox::Inbox object.
2020-11-07v2writable: idx_shard: simplify callers
This will make it easier-to-use in ExtSearchIdx.
2020-11-07v2writable: hoist out write_alternates
We'll be reusing this for external indices and possibly other places.
2020-11-07v2writable: prepare initialization for external indices
External indices won't have $self->{ibx} since it needs to deal with multiple inboxes. We can also hoist out ->parallel_init to make it easier to distinguish the non-parallel control flow.
2020-11-07v2writable: make OO calls to last_commit-related methods
We'll try to reuse as much V2Writable code as possible for external indices, but the way "last_commit" info is stored must be different as external indices will deal with last_commit info for multiple inboxes.
2020-11-07v2writable: add git method
This will make it easier to share code with ExtSearchIdx.
2020-10-17git: introduce async_wait_all
->cat_async and ->check_async may trigger each other (in future callers) while waiting, so we need a unified method to ensure both complete. This doesn't affect current code, but allows us to slightly simplify existing callers.
2020-09-30v2writable: use "HEAD" to match v1 indexing behavior
Users may want to change the default branch used for git epochs in v2 (v1 SearchIdx always used whatever "HEAD" pointed to).
2020-09-24v2writable: drop outdated {unindex_range} check
{unindex_range} only exists in the $sync state, nowadays, not the V2Writable ($self) object. $sync->{unindex_range} won't be populated if $regen_max is zero, either, unless somebody is injecting importable commits into an epoch history, in which this change will result in no-op indexing doing no work.
2020-09-12treewide: avoid `goto &NAME' for tail recursion
While Perl implements tail recursion via `goto' which allows avoiding warnings on deep recursion. It doesn't (as of 5.28) optimize the speed of such dispatches, though it may reduce ephemeral memory usage. Make the code less alien to hackers coming from other languages by using normal subroutine dispatch. It's actually slightly faster in micro benchmarks due to the complexity of `goto &NAME'.
2020-09-03v2writable: reuse read-only shard counting code
We'll also fix the read-only code to ensure we notice missing Xapian shards, since gaps would throw off our expectation that Xapian document IDs and NNTP article numbers are interchangeable.
2020-09-03disambiguate OverIdx and Over by field name
We'll use {oidx} as the common field name for the read-write OverIdx, here, to disambiguate it from the read-only {over} field. This hopefully makes it clearer which code paths are read-only and which are read-write.
2020-09-01watch: avoid unnecessary spawning on spam removals
This should further mitigate lock contention problems when -watch is configured to watch on a Maildir for spam while performing a large NNTP import. There is now a small risk a message won't get removed because if it's in the current (uncommitted) fast-import batch, but unlikely given the batch size is now only 10 messages. If a that small window is hit, flipping the \Seen flag (e.g. marking it unread, and then read again) will trigger another removal attempt via IMAP or Maildir.
2020-08-27over: rename ->disconnect to ->dbh_close
Since we got rid of over->connect, `disconnect' no longer pairs with it. So name it after the `close(2)' syscall it ultimately issues.
2020-08-26v2writable: compatibility with SWIG Xapian binding
The SWIG binding won't auto-convert IV/UV to PV like the XS Search::Xapian binding would, so workaround that shortcoming for now. Fixes: a367ec1b15a2458 ("mbox: disable "&t" on existing Xapian until full reindex")
2020-08-23index: --sequential-shard checkpoints after each shard
There's no reason we'd want Xapian to defer flushing once we've indexed everything belonging to a particular shard.
2020-08-23mbox: disable "&t" on existing Xapian until full reindex
Expanding threads via over.sqlite3 for mbox.gz downloads without Xapian effectively collapsing on the THREADID column leads to repeated messages getting downloaded. To avoid that situation, use a "has_threadid" Xapian metadata flag that's only set on --reindex (and brand new Xapian DBs). This allows admins to upgrade WWW or do --reindex in any order; without worrying about users eating up bandwidth and CPU cycles.
2020-08-23searchidx: put all shard-related stuff in SearchIdxShard.pm
We'll also rename the /^remote_/ prefix to "shard_", since remote implies the process is on a different host. These methods only pass messages to a child process on the same host OR perform operations within the same process.
2020-08-19v2writable: show newline after "indexing all of .. " message
Otherwise things get very confusing when verbosity is enabled :x
2020-08-13v2writable: remove IdxStack import
We use IdxStack via log2stack() from SearchIdx, now.
2020-08-10index: cleanup internal variables
Move away from hard-to-read alllowercase naming and favor snake_case or separated-by-dashes. We'll keep `--indexlevel' as-is for now, since it's been around for several releases; but we'll support `--index-level' in the CLI and update our documentation in a few months. We'll also clarify that publicInbox.indexMaxSize is only intended for -index, and not -watch or -mda.
2020-08-10avoid File::Temp::tempfile in more places
We can use open(..., undef) natively in Perl in t/import.t In places where we need a pathname, the File::Temp OO API gives us auto-unlinking for free.
2020-08-10index: --sequential-shard works incrementally
We should never reindex all data in Xapian unless --reindex is specified on the command-line. This means users who put publicInbox.indexSequentialShard in their config file won't have to put up with a full reindex at every invocation, only when they specify --reindex. We'll also cleanup the progress output to not emit non-sensical ranges where the starting number is higher than the end.
2020-08-09favor `getconf _NPROCESSORS_ONLN` over GNU nproc
getconf(1) itself is POSIX, while `_NPROCESSORS_ONLN' is not. However, FreeBSD (tested 11.4 and 12.1) and glibc (tested CentOS 7.x and Debian 10.x) both support `getconf _NPROCESSORS_ONLN'. GNU coreutils (and thus `nproc' or `gnproc') are not installed by default on the *BSDs, so we'll try the option most likely to exist on both glibc and *BSDs out-of-the-box.
2020-08-07v2writable: fix batch size accounting
We need to account for whether shard parallelization is enabled or not, since users of parallelization are expected to have more RAM.
2020-08-07index+xcpdb: rename `--no-sync' to `--no-fsync'
We'll continue supporting `--no-sync' even if its yet-to-make it it into a release, but the term `sync' is overloaded in our codebase which may be confusing to new hackers and users. None of our our code nor dependencies issue the sync(2) syscall, either, only fsync(2) and fdatasync(2).
2020-08-07index: support --xapian-only switch
This is useful for speeding up indexing runs when only Xapian rules change but SQLite indexing doesn't change. This mostly implies `--reindex', but does NOT pick up new messages (because SQLite indexing needs to occur for that). I'm leaving this undocumented in the manpage for now since it's mainly to speed up development and testing. Users upgrading to 1.6.0 will be advised to `--reindex --rethread', anyways, due to the threading improvements since 1.1.0-pre1. It may make sense to document for 1.7+ when there's Xapian-only indexing changes, though.
2020-08-07index: v2: --sequential-shard option
This gives better page cache utilization for Xapian indexing on slow storage by improving locality for random I/O activity on the Xapian DB. Instead of doing a single-pass to index both SQLite and Xapian; this indexes them separately. The first pass is identical to indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3. Subsequent passes only operate on a single Xapian shard for documents belonging to that shard. Given enough shards, each individual shard can be made small enough to fit into the kernel page cache and avoid HDD seeks for read activity. Doing rough tests with a busy system with a 7200 RPM HDD with ext4, full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to ~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2) disabled (--no-sync) and `--batch-size=10m'.
2020-08-07v2writable: fix rethread cleanup
We need to drop old ghosts properly while inside the transaction, otherwise it becomes a no-op. This isn't a big deal, as it only results in a few dangling DB rows and a small amount of wasted space.
2020-08-02remove unnecessary ->header_obj calls
We used ->header_obj in the past as an optimization with Email::MIME. That optimization is no longer necessary with PublicInbox::Eml. This doesn't make any functional difference even if we were to go back to Email::MIME. However, it reduces the amount of code we have and slightly reduces allocations with PublicInbox::Eml.
2020-08-01improve error handling on import fork / lock failures
v?fork failures seems to be the cause of locks not getting released in -watch. Ensure lock release doesn't get skipped in ->done for both v1 and v2 inboxes. We also need to do everything we can to ensure DB handles, pipes and processes get released even in the face of failure. While we're at it, make failures around `git update-server-info' non-fatal, since smart HTTP seems more popular anyways. v2 changes: - spawn: show failing command - ensure waitpid is synchronous for inotify events - teardown all fast-import processes on exception, not just the failing one - beef up lock_release error handling - release lock on fast-import spawn failure
2020-07-29v2writable: use {inboxdir} for msgmap->tmp_clone
Otherwise, a user is more likely to remove the msgmap-XXXXXXXX SQLite file from $TMPDIR and cause SQLite to error out.
2020-07-29v2writable: support async git blob retrievals
This seems to speed up --reindex on smallish v2 inboxes by about 30% on both HDD and SSD. lore/git (~1GB) on an SSD even gives a 30% improvement with 3 shards. I'm only seeing a ~4% speedup on LKML with a SATA SSD (which is difficult to repeat because it takes around 4 hours). Testing LKML on an HDD will take much more time...