From 06a2418fd053c9a5b80217e74d1b47b8e1ca85e1 Mon Sep 17 00:00:00 2001 From: Eric Wong Date: Fri, 7 Aug 2020 01:14:04 +0000 Subject: index: v2: --sequential-shard option This gives better page cache utilization for Xapian indexing on slow storage by improving locality for random I/O activity on the Xapian DB. Instead of doing a single-pass to index both SQLite and Xapian; this indexes them separately. The first pass is identical to indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3. Subsequent passes only operate on a single Xapian shard for documents belonging to that shard. Given enough shards, each individual shard can be made small enough to fit into the kernel page cache and avoid HDD seeks for read activity. Doing rough tests with a busy system with a 7200 RPM HDD with ext4, full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to ~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2) disabled (--no-sync) and `--batch-size=10m'. --- Documentation/public-inbox-index.pod | 53 +++++++++++++++++++++++++++++++++++- 1 file changed, 52 insertions(+), 1 deletion(-) (limited to 'Documentation/public-inbox-index.pod') diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod index aeb1b3a3..f525ba54 100644 --- a/Documentation/public-inbox-index.pod +++ b/Documentation/public-inbox-index.pod @@ -34,12 +34,16 @@ normal search functionality. =item --jobs=JOBS, -j -Control the number of Xapian indexing jobs in a +Influences the number of Xapian indexing shards in a (L) inbox. C<--jobs=0> is accepted as of public-inbox 1.6.0 (PENDING) to disable parallel indexing. +If the inbox has not been indexed, C shards +will be created (one job is always needed for indexing +the overview and article number mapping). + Default: the number of existing Xapian shards =item --compact / -c @@ -120,6 +124,14 @@ and Xapian. This is only effective with Xapian 1.4+. Available in public-inbox 1.6.0 (PENDING). +=item --sequential-shard + +Sets or overrides L on a +per-invocation basis. See L +below. + +Available in public-inbox 1.6.0 (PENDING). + =back =head1 FILES @@ -167,6 +179,45 @@ inbox with 3 shards will flush every 3 megabytes by default. Default: 1m (one megabyte) +=item publicinbox.indexBatchSize + +Flushes changes to the filesystem and releases locks after +indexing the given number of bytes. The default value of C<1m> +(one megabyte) is low to minimize memory use and reduce +contention with parallel invocations of L, +L, and L. + +Increase this value on powerful systems to improve throughput at +the expense of memory use. The reduction of lock granularity +may not be noticeable on fast systems. + +This option is available in public-inbox 1.6 or later. +public-inbox 1.5 and earlier used the current default, C<1m>. + +For L inboxes, this value is +multiplied by the number of Xapian shards. Thus a typical v2 +inbox with 3 shards will flush every 3 megabytes by default. + +Default: 1m (one megabyte) + +=item publicinbox.indexSequentialShard +=item publicinbox..indexSequentialShard + +For L inboxes, setting this to C +allows indexing Xapian shards in multiple passes. This speeds up +indexing on rotational storage with high seek latency by allowing +individual shards to fit into the kernel page cache. + +Using a higher-than-normal number of C<--jobs> with +L may be required to ensure individual +shards are small enough to fit into cache. + +Available in public-inbox 1.6.0 (PENDING). + +This is ignored on L inboxes. + +Default: false, shards are indexed in parallel + =back =head1 ENVIRONMENT -- cgit v1.2.3-24-ge0c7