about summary refs log tree commit homepage
path: root/lib/PublicInbox/V2Writable.pm
DateCommit message (Collapse)
2020-09-12treewide: avoid `goto &NAME' for tail recursion
While Perl implements tail recursion via `goto' which allows avoiding warnings on deep recursion. It doesn't (as of 5.28) optimize the speed of such dispatches, though it may reduce ephemeral memory usage. Make the code less alien to hackers coming from other languages by using normal subroutine dispatch. It's actually slightly faster in micro benchmarks due to the complexity of `goto &NAME'.
2020-09-03v2writable: reuse read-only shard counting code
We'll also fix the read-only code to ensure we notice missing Xapian shards, since gaps would throw off our expectation that Xapian document IDs and NNTP article numbers are interchangeable.
2020-09-03disambiguate OverIdx and Over by field name
We'll use {oidx} as the common field name for the read-write OverIdx, here, to disambiguate it from the read-only {over} field. This hopefully makes it clearer which code paths are read-only and which are read-write.
2020-09-01watch: avoid unnecessary spawning on spam removals
This should further mitigate lock contention problems when -watch is configured to watch on a Maildir for spam while performing a large NNTP import. There is now a small risk a message won't get removed because if it's in the current (uncommitted) fast-import batch, but unlikely given the batch size is now only 10 messages. If a that small window is hit, flipping the \Seen flag (e.g. marking it unread, and then read again) will trigger another removal attempt via IMAP or Maildir.
2020-08-27over: rename ->disconnect to ->dbh_close
Since we got rid of over->connect, `disconnect' no longer pairs with it. So name it after the `close(2)' syscall it ultimately issues.
2020-08-26v2writable: compatibility with SWIG Xapian binding
The SWIG binding won't auto-convert IV/UV to PV like the XS Search::Xapian binding would, so workaround that shortcoming for now. Fixes: a367ec1b15a2458 ("mbox: disable "&t" on existing Xapian until full reindex")
2020-08-23index: --sequential-shard checkpoints after each shard
There's no reason we'd want Xapian to defer flushing once we've indexed everything belonging to a particular shard.
2020-08-23mbox: disable "&t" on existing Xapian until full reindex
Expanding threads via over.sqlite3 for mbox.gz downloads without Xapian effectively collapsing on the THREADID column leads to repeated messages getting downloaded. To avoid that situation, use a "has_threadid" Xapian metadata flag that's only set on --reindex (and brand new Xapian DBs). This allows admins to upgrade WWW or do --reindex in any order; without worrying about users eating up bandwidth and CPU cycles.
2020-08-23searchidx: put all shard-related stuff in SearchIdxShard.pm
We'll also rename the /^remote_/ prefix to "shard_", since remote implies the process is on a different host. These methods only pass messages to a child process on the same host OR perform operations within the same process.
2020-08-19v2writable: show newline after "indexing all of .. " message
Otherwise things get very confusing when verbosity is enabled :x
2020-08-13v2writable: remove IdxStack import
We use IdxStack via log2stack() from SearchIdx, now.
2020-08-10index: cleanup internal variables
Move away from hard-to-read alllowercase naming and favor snake_case or separated-by-dashes. We'll keep `--indexlevel' as-is for now, since it's been around for several releases; but we'll support `--index-level' in the CLI and update our documentation in a few months. We'll also clarify that publicInbox.indexMaxSize is only intended for -index, and not -watch or -mda.
2020-08-10avoid File::Temp::tempfile in more places
We can use open(..., undef) natively in Perl in t/import.t In places where we need a pathname, the File::Temp OO API gives us auto-unlinking for free.
2020-08-10index: --sequential-shard works incrementally
We should never reindex all data in Xapian unless --reindex is specified on the command-line. This means users who put publicInbox.indexSequentialShard in their config file won't have to put up with a full reindex at every invocation, only when they specify --reindex. We'll also cleanup the progress output to not emit non-sensical ranges where the starting number is higher than the end.
2020-08-09favor `getconf _NPROCESSORS_ONLN` over GNU nproc
getconf(1) itself is POSIX, while `_NPROCESSORS_ONLN' is not. However, FreeBSD (tested 11.4 and 12.1) and glibc (tested CentOS 7.x and Debian 10.x) both support `getconf _NPROCESSORS_ONLN'. GNU coreutils (and thus `nproc' or `gnproc') are not installed by default on the *BSDs, so we'll try the option most likely to exist on both glibc and *BSDs out-of-the-box.
2020-08-07v2writable: fix batch size accounting
We need to account for whether shard parallelization is enabled or not, since users of parallelization are expected to have more RAM.
2020-08-07index+xcpdb: rename `--no-sync' to `--no-fsync'
We'll continue supporting `--no-sync' even if its yet-to-make it it into a release, but the term `sync' is overloaded in our codebase which may be confusing to new hackers and users. None of our our code nor dependencies issue the sync(2) syscall, either, only fsync(2) and fdatasync(2).
2020-08-07index: support --xapian-only switch
This is useful for speeding up indexing runs when only Xapian rules change but SQLite indexing doesn't change. This mostly implies `--reindex', but does NOT pick up new messages (because SQLite indexing needs to occur for that). I'm leaving this undocumented in the manpage for now since it's mainly to speed up development and testing. Users upgrading to 1.6.0 will be advised to `--reindex --rethread', anyways, due to the threading improvements since 1.1.0-pre1. It may make sense to document for 1.7+ when there's Xapian-only indexing changes, though.
2020-08-07index: v2: --sequential-shard option
This gives better page cache utilization for Xapian indexing on slow storage by improving locality for random I/O activity on the Xapian DB. Instead of doing a single-pass to index both SQLite and Xapian; this indexes them separately. The first pass is identical to indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3. Subsequent passes only operate on a single Xapian shard for documents belonging to that shard. Given enough shards, each individual shard can be made small enough to fit into the kernel page cache and avoid HDD seeks for read activity. Doing rough tests with a busy system with a 7200 RPM HDD with ext4, full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to ~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2) disabled (--no-sync) and `--batch-size=10m'.
2020-08-07v2writable: fix rethread cleanup
We need to drop old ghosts properly while inside the transaction, otherwise it becomes a no-op. This isn't a big deal, as it only results in a few dangling DB rows and a small amount of wasted space.
2020-08-02remove unnecessary ->header_obj calls
We used ->header_obj in the past as an optimization with Email::MIME. That optimization is no longer necessary with PublicInbox::Eml. This doesn't make any functional difference even if we were to go back to Email::MIME. However, it reduces the amount of code we have and slightly reduces allocations with PublicInbox::Eml.
2020-08-01improve error handling on import fork / lock failures
v?fork failures seems to be the cause of locks not getting released in -watch. Ensure lock release doesn't get skipped in ->done for both v1 and v2 inboxes. We also need to do everything we can to ensure DB handles, pipes and processes get released even in the face of failure. While we're at it, make failures around `git update-server-info' non-fatal, since smart HTTP seems more popular anyways. v2 changes: - spawn: show failing command - ensure waitpid is synchronous for inotify events - teardown all fast-import processes on exception, not just the failing one - beef up lock_release error handling - release lock on fast-import spawn failure
2020-07-29v2writable: use {inboxdir} for msgmap->tmp_clone
Otherwise, a user is more likely to remove the msgmap-XXXXXXXX SQLite file from $TMPDIR and cause SQLite to error out.
2020-07-29v2writable: support async git blob retrievals
This seems to speed up --reindex on smallish v2 inboxes by about 30% on both HDD and SSD. lore/git (~1GB) on an SSD even gives a 30% improvement with 3 shards. I'm only seeing a ~4% speedup on LKML with a SATA SSD (which is difficult to repeat because it takes around 4 hours). Testing LKML on an HDD will take much more time...
2020-07-25v2writable: {unindexed} belongs in $sync state
There's no reason for {unindexed} to persist beyond an ->index_sync call.
2020-07-25v2writable: share log2stack code with v1
Another step in making v1 and v2 more similar.
2020-07-25index+xcpdb: support --no-sync flag
This allows us to speed up indexing operations to SQLite and Xapian. Unfortunately, it doesn't affect operations using `xapian-compact' and the compactor API, since that doesn't seem to support Xapian::DB_NO_SYNC, yet.
2020-07-25v2writable: clarify "epoch" comment
2020-07-25v2writable: get rid of {reindex_pipe} field
Since normal per-epoch indexing no longer holds a "git log" process open, we don't need to worry about not sharing the pipe with forked shards when we restart the indexer. While we're in the area, better describe what `unindex' does, since it's a rarely-used but necessary code path.
2020-07-25v2writable: use read-only PublicInbox::Git for cat_file
We can reduce the number of parameters we pass around on stack and make our read-write and read-only code paths more uniform.
2020-07-25use consistent {ibx} field for writable code paths
This is a step which makes our use of abbreviations more consistent when referring to PublicInbox::Inbox objects. We'll also be reducing the number of redundant fields in SearchIdx and V2Writable code paths to make the object graph easier-to-follow.
2020-07-25v2writable: drop "EPOCH.git indexing $RANGE" progress
It'll be one continuous range with IdxStack.
2020-07-25v2writable: allow >= 40 byte git object IDs
Another step in slowly updating our code to support SHA-256 or whatever other hash algorithms git may support in the future.
2020-07-25v2writable: move {autime} and {cotime} into $sync state
The V2Writable object may be long-lived, so it makes more sense to put the {autime} and {cotime} fields into the shorter-lived index_sync state.
2020-07-25v2writable: index_sync: reduce fill_alternates calls
Instead of doing fill_alternates for every epoch we're indexing, just do it once at the start of index_sync invocation. This will set us up for using a single "git cat-file" process for indexing multiple epochs.
2020-07-25v2writable: introduce idx_stack
This avoids pinning a potentially large chunk of memory from `git-log --reverse' into RAM (or triggering less predictable swap behavior). Instead it uses a contiguous temporary file with a fixed-size record for every blob we'll need to index.
2020-07-25v2: index forwards (via `git log --reverse')
Since we'll need to expose THREADID to JMAP and IMAP users, index all messages in the order they were committed to ensure our `tid' (thread ID) column ascends in mirrors the same way they do in the source inbox. This drastically simplifies our code but increases memory usage of `git-log'. The next commit will bring memory use back down at the expense of $TMPDIR usage.
2020-07-25index: support --rethread switch to fix old indices
Older versions of public-inbox < 1.3.0 had subtly different semantics around threading in some corner cases. This switch (when combined with --reindex) allows us to fix them by regenerating associations.
2020-07-17v2writable: git_hash_raw: avoid $TMPDIR write
We can rely on FD_CLOEXEC being set by default (since Perl 5.6+) on pipes to avoid FS/page-cache traffic, here. We also know "git hash-object" won't output anything until it's consumed all of its standard input; so there's no danger of a deadlock even in the the unlikely case git uses a hash that can't fit into PIPE_BUF :P
2020-07-17search: simplify unindexing
Since over.sqlite3 seems here to stay, we no longer need to do Message-ID lookups against Xapian and can simply rely on the docid <=> NNTP article number equivalancy SCHEMA_VERSION=15 gave us. This rids us of the closure-using batch_do sub in the v1 code path and vastly simplifies both v1 and v2 unindexing.
2020-07-17with_umask: pass args to callback
While it makes the code flow slightly less well in some places, it saves us runtime allocations and indentation.
2020-07-17v2: use v5.10.1, parent.pm, drop warnings
The "5.010_001" form was for Perl 5.6, which I doubt anybody would attempt; so favor "v5.10.1" as it is more readable to humans. Prefer "parent" to "base" since the former is lighter. We'll also rely on warnings from "-w" globally (or not) instead of via "use". We'll also update "use" statements to reflect what's actually used by V2Writable.
2020-06-30watch: check for duplicates in ->over before spamcheck
It's cheaper to check for duplicates than run `spamc' repeatedly when rechecking. We already do this for v1 with by using the "ls" command with fast-import, but v2 requires checking against over.sqlite3.
2020-06-25lock: reduce inotify wakeups
We can reduce the amount of platform-specific code by always relying on IN_MODIFY/NOTE_WRITE notifications from lock release. This reduces the number of times our read-only daemons will need to wake up when -watch sees no-op message changes (e.g. replied, seen, recent flag changes).
2020-06-23init: add --skip-artnum parameter
For archivists with only newer mail archives, this option allows reserving reserve NNTP article numbers for yet-to-be-archived old messages. Indexers will need to be updated to support this feature in future commits. -V1 inboxes will now be initialized with SQLite and Xapian support if this option is used, or if --indexlevel= is specified.
2020-06-13index: account for CRLF conversion when storing bytes
NNTP and IMAP both require CRLF conversions on the wire. They're also the only components which care about $smsg->{bytes}, so store the CRLF-adjusted value in over.sqlite3 and Xapian DBs.. This will allow us to optimize RFC822.SIZE fetch item in IMAP without triggering size mismatch errors in some clients' default configurations (e.g. Mail::IMAPClient), but not most others. It could also fix hypothetical problems with NNTP clients that report discrepancies between overview and article data.
2020-06-03smsg: introduce ->populate method
This will eventually replace the __hdr() calling methods and eradicate {mime} usage from Smsg. For now, we can eliminate PublicInbox::Smsg->new since most callers already rely on an open `bless' to avoid the old {mime} arg.
2020-06-03v2writable: fix non-sensical interpolation in BUG message
No point in attempting to print the value of an undefined variable if there's a bug. Fortunately, (AFAIK) we've never hit that bug check :>
2020-05-25v2writable: only load Xapian when a shard is found
We don't need to load Xapian until we have a directory which looks like a shard, otherwise we're wasting cycles on memory when running short-lived processes.
2020-05-19favor readline() and print() as functions
In our inbox-writing code paths, ->getline as an OO method may be confused with the various definitions of `getline' used by the PSGI interface. It's also easier to do: "perldoc -f readline" than to figure out which class "->getline" belongs to (IO::Handle) and lookup documentation for that. ->print is less confusing than the "readline" vs "getline" mismatch, but we can still make it clear we're using a real file handle and not a mock interface. Finally, functions are a bit faster than their OO counterparts.