about summary refs log tree commit homepage
path: root/lib/PublicInbox/V2Writable.pm
DateCommit message (Collapse)
2020-08-07index: support --xapian-only switch
This is useful for speeding up indexing runs when only Xapian rules change but SQLite indexing doesn't change. This mostly implies `--reindex', but does NOT pick up new messages (because SQLite indexing needs to occur for that). I'm leaving this undocumented in the manpage for now since it's mainly to speed up development and testing. Users upgrading to 1.6.0 will be advised to `--reindex --rethread', anyways, due to the threading improvements since 1.1.0-pre1. It may make sense to document for 1.7+ when there's Xapian-only indexing changes, though.
2020-08-07index: v2: --sequential-shard option
This gives better page cache utilization for Xapian indexing on slow storage by improving locality for random I/O activity on the Xapian DB. Instead of doing a single-pass to index both SQLite and Xapian; this indexes them separately. The first pass is identical to indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3. Subsequent passes only operate on a single Xapian shard for documents belonging to that shard. Given enough shards, each individual shard can be made small enough to fit into the kernel page cache and avoid HDD seeks for read activity. Doing rough tests with a busy system with a 7200 RPM HDD with ext4, full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to ~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2) disabled (--no-sync) and `--batch-size=10m'.
2020-08-07v2writable: fix rethread cleanup
We need to drop old ghosts properly while inside the transaction, otherwise it becomes a no-op. This isn't a big deal, as it only results in a few dangling DB rows and a small amount of wasted space.
2020-08-02remove unnecessary ->header_obj calls
We used ->header_obj in the past as an optimization with Email::MIME. That optimization is no longer necessary with PublicInbox::Eml. This doesn't make any functional difference even if we were to go back to Email::MIME. However, it reduces the amount of code we have and slightly reduces allocations with PublicInbox::Eml.
2020-08-01improve error handling on import fork / lock failures
v?fork failures seems to be the cause of locks not getting released in -watch. Ensure lock release doesn't get skipped in ->done for both v1 and v2 inboxes. We also need to do everything we can to ensure DB handles, pipes and processes get released even in the face of failure. While we're at it, make failures around `git update-server-info' non-fatal, since smart HTTP seems more popular anyways. v2 changes: - spawn: show failing command - ensure waitpid is synchronous for inotify events - teardown all fast-import processes on exception, not just the failing one - beef up lock_release error handling - release lock on fast-import spawn failure
2020-07-29v2writable: use {inboxdir} for msgmap->tmp_clone
Otherwise, a user is more likely to remove the msgmap-XXXXXXXX SQLite file from $TMPDIR and cause SQLite to error out.
2020-07-29v2writable: support async git blob retrievals
This seems to speed up --reindex on smallish v2 inboxes by about 30% on both HDD and SSD. lore/git (~1GB) on an SSD even gives a 30% improvement with 3 shards. I'm only seeing a ~4% speedup on LKML with a SATA SSD (which is difficult to repeat because it takes around 4 hours). Testing LKML on an HDD will take much more time...
2020-07-25v2writable: {unindexed} belongs in $sync state
There's no reason for {unindexed} to persist beyond an ->index_sync call.
2020-07-25v2writable: share log2stack code with v1
Another step in making v1 and v2 more similar.
2020-07-25index+xcpdb: support --no-sync flag
This allows us to speed up indexing operations to SQLite and Xapian. Unfortunately, it doesn't affect operations using `xapian-compact' and the compactor API, since that doesn't seem to support Xapian::DB_NO_SYNC, yet.
2020-07-25v2writable: clarify "epoch" comment
2020-07-25v2writable: get rid of {reindex_pipe} field
Since normal per-epoch indexing no longer holds a "git log" process open, we don't need to worry about not sharing the pipe with forked shards when we restart the indexer. While we're in the area, better describe what `unindex' does, since it's a rarely-used but necessary code path.
2020-07-25v2writable: use read-only PublicInbox::Git for cat_file
We can reduce the number of parameters we pass around on stack and make our read-write and read-only code paths more uniform.
2020-07-25use consistent {ibx} field for writable code paths
This is a step which makes our use of abbreviations more consistent when referring to PublicInbox::Inbox objects. We'll also be reducing the number of redundant fields in SearchIdx and V2Writable code paths to make the object graph easier-to-follow.
2020-07-25v2writable: drop "EPOCH.git indexing $RANGE" progress
It'll be one continuous range with IdxStack.
2020-07-25v2writable: allow >= 40 byte git object IDs
Another step in slowly updating our code to support SHA-256 or whatever other hash algorithms git may support in the future.
2020-07-25v2writable: move {autime} and {cotime} into $sync state
The V2Writable object may be long-lived, so it makes more sense to put the {autime} and {cotime} fields into the shorter-lived index_sync state.
2020-07-25v2writable: index_sync: reduce fill_alternates calls
Instead of doing fill_alternates for every epoch we're indexing, just do it once at the start of index_sync invocation. This will set us up for using a single "git cat-file" process for indexing multiple epochs.
2020-07-25v2writable: introduce idx_stack
This avoids pinning a potentially large chunk of memory from `git-log --reverse' into RAM (or triggering less predictable swap behavior). Instead it uses a contiguous temporary file with a fixed-size record for every blob we'll need to index.
2020-07-25v2: index forwards (via `git log --reverse')
Since we'll need to expose THREADID to JMAP and IMAP users, index all messages in the order they were committed to ensure our `tid' (thread ID) column ascends in mirrors the same way they do in the source inbox. This drastically simplifies our code but increases memory usage of `git-log'. The next commit will bring memory use back down at the expense of $TMPDIR usage.
2020-07-25index: support --rethread switch to fix old indices
Older versions of public-inbox < 1.3.0 had subtly different semantics around threading in some corner cases. This switch (when combined with --reindex) allows us to fix them by regenerating associations.
2020-07-17v2writable: git_hash_raw: avoid $TMPDIR write
We can rely on FD_CLOEXEC being set by default (since Perl 5.6+) on pipes to avoid FS/page-cache traffic, here. We also know "git hash-object" won't output anything until it's consumed all of its standard input; so there's no danger of a deadlock even in the the unlikely case git uses a hash that can't fit into PIPE_BUF :P
2020-07-17search: simplify unindexing
Since over.sqlite3 seems here to stay, we no longer need to do Message-ID lookups against Xapian and can simply rely on the docid <=> NNTP article number equivalancy SCHEMA_VERSION=15 gave us. This rids us of the closure-using batch_do sub in the v1 code path and vastly simplifies both v1 and v2 unindexing.
2020-07-17with_umask: pass args to callback
While it makes the code flow slightly less well in some places, it saves us runtime allocations and indentation.
2020-07-17v2: use v5.10.1, parent.pm, drop warnings
The "5.010_001" form was for Perl 5.6, which I doubt anybody would attempt; so favor "v5.10.1" as it is more readable to humans. Prefer "parent" to "base" since the former is lighter. We'll also rely on warnings from "-w" globally (or not) instead of via "use". We'll also update "use" statements to reflect what's actually used by V2Writable.
2020-06-30watch: check for duplicates in ->over before spamcheck
It's cheaper to check for duplicates than run `spamc' repeatedly when rechecking. We already do this for v1 with by using the "ls" command with fast-import, but v2 requires checking against over.sqlite3.
2020-06-25lock: reduce inotify wakeups
We can reduce the amount of platform-specific code by always relying on IN_MODIFY/NOTE_WRITE notifications from lock release. This reduces the number of times our read-only daemons will need to wake up when -watch sees no-op message changes (e.g. replied, seen, recent flag changes).
2020-06-23init: add --skip-artnum parameter
For archivists with only newer mail archives, this option allows reserving reserve NNTP article numbers for yet-to-be-archived old messages. Indexers will need to be updated to support this feature in future commits. -V1 inboxes will now be initialized with SQLite and Xapian support if this option is used, or if --indexlevel= is specified.
2020-06-13index: account for CRLF conversion when storing bytes
NNTP and IMAP both require CRLF conversions on the wire. They're also the only components which care about $smsg->{bytes}, so store the CRLF-adjusted value in over.sqlite3 and Xapian DBs.. This will allow us to optimize RFC822.SIZE fetch item in IMAP without triggering size mismatch errors in some clients' default configurations (e.g. Mail::IMAPClient), but not most others. It could also fix hypothetical problems with NNTP clients that report discrepancies between overview and article data.
2020-06-03smsg: introduce ->populate method
This will eventually replace the __hdr() calling methods and eradicate {mime} usage from Smsg. For now, we can eliminate PublicInbox::Smsg->new since most callers already rely on an open `bless' to avoid the old {mime} arg.
2020-06-03v2writable: fix non-sensical interpolation in BUG message
No point in attempting to print the value of an undefined variable if there's a bug. Fortunately, (AFAIK) we've never hit that bug check :>
2020-05-25v2writable: only load Xapian when a shard is found
We don't need to load Xapian until we have a directory which looks like a shard, otherwise we're wasting cycles on memory when running short-lived processes.
2020-05-19favor readline() and print() as functions
In our inbox-writing code paths, ->getline as an OO method may be confused with the various definitions of `getline' used by the PSGI interface. It's also easier to do: "perldoc -f readline" than to figure out which class "->getline" belongs to (IO::Handle) and lookup documentation for that. ->print is less confusing than the "readline" vs "getline" mismatch, but we can still make it clear we're using a real file handle and not a mock interface. Finally, functions are a bit faster than their OO counterparts.
2020-05-18index: add --batch-size=SIZE option
On powerful systems, having this option is preferable to XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention with other processes (-learn, -mda, -watch). Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and -watch to get stuck until an epoch is completely processed.
2020-05-12rename "ContentId" to "ContentHash"
The old name may be confused with "Content-ID" as described in RFC 2392, so use an alternate name to avoid confusing future readers.
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-04-21index: support --max-size / publicinbox.indexMaxSize
In normal mail paths, we can rely on MTAs being configured with reasonable limits in the -watch and -mda mail injection paths. However, the MTA is bypassed in a git-only delivery path, a BOFH could inject a large message and DoS users attempting to mirror a public-inbox. This doesn't protect unindexed WWW interfaces from Email::MIME memory explosions on v1 inboxes. Probably nobody cares about unindexed WWW interfaces anymore, especially now that Xapian is optional for indexing.
2020-04-20v2writable: drop SQLite-based multi_mid_q_new
We switched to the SDBM-based queue to store author/committer info last month. Fixes: c7acdfe78bda5bf3 ("v2: SDBM-based multi Message-ID queue")
2020-04-20import: init_bare: allow use as method, use in tests
Allowing ->init_bare to be used as a method saves some keystrokes, and we can save a little bit of time on systems with our vfork(2)-enabled spawn(). This also sets us up for future improvements where we can avoid spawning a process at all.
2020-03-22v2: SDBM-based multi Message-ID queue
This lets us store author and committer times for deferred indexing messages with ambiguous Message-IDs. This allows us to reproducibly reindex messages with the git commit and author times when a rare message lacks Received and/or Date headers while having ambiguous Message-IDs.
2020-03-22*idx: pass smsg in even more places
We can finally get rid of the awkward, ad-hoc use of V2Writable, SearchIdx, and OverIdx args for passing {cotime} and {autime} between classes. We'll still use those git time fields internally within V2Writable and SearchIdx for (re)indexing, but that's not worth avoiding as a fallback.
2020-03-22v2: pass smsg in more places
We can pass fewer order-dependent args to V2Writable::do_idx and SearchIdxShard::index_raw by passing the smsg object, instead.
2020-03-22*idx: pass $smsg in more places instead of many args
We can pass blessed PublicInbox::Smsg objects to internal indexing APIs instead of having long parameter lists in some places. The end goal is to avoid parsing redundant information each step of the way and hopefully make things more understandable.
2020-03-22v2writable: preserve timestamps from import
While v2 indexing is triggered immediately after writing the commit to the git repository, there may be a gap between when PublicInbox::Import generates a timestamp and when PublicInbox::SearchIdx sees the message. So follow the mirror indexing behavior and take the to-be-indexed (time|date)stamps directly from the git commit.
2020-03-22index: use git commit times on missing Date/Received
When indexing messages without Date: and/or Received: headers, fall back to using timestamps originally recorded by git in the commit object. This allows git mirrors to preserve the import datestamp and timestamp of a message according to what was fed into git, instead of blindly falling back to the current time.
2020-02-24v2writable: lookup_content => content_exists
It only needs to return a boolean, since none of the current callers care about the return value. Thus avoid a hash table assignment and use of `$smsg->{mime}', here.
2020-02-24v2writable: make remove return-compatible w/ Import::remove
Import::remove is a documented interface, and the return value of the V2Writable work-alike should try to be compatible with what Import implements.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-02-02v2writable: more ways to detect online CPU count
OpenBSD and FreeBSD support `getconf NPROCESSORS_ONLN` (no leading underscore). They may also have GNU nproc installed as "gnproc". We may also encounter Linux systems w/o GNU coreutils, but able to use `getconf _NPROCESSORS_ONLN` (with leading underscore).
2020-02-02v2writable: do not clobber {shards} or {parallel} if unset
The $jobs parameter in `public-inbox-convert' is passed to V2Writable->init_inbox as `undef' by default, causing parallelization to be disabled. Instead, leave the underlying {parallel} flag untouched if $shards is undef and do not clobber the default shard count. This allows us to take advantage of multicore systems when running public-inbox-convert with no command-line switches.