about summary refs log tree commit homepage
path: root/Documentation/public-inbox-index.pod
diff options
authorEric Wong <e@yhbt.net>2020-08-07 01:14:04 +0000
committerEric Wong <e@yhbt.net>2020-08-07 23:45:38 +0000
commit06a2418fd053c9a5b80217e74d1b47b8e1ca85e1 (patch)
tree37dc120e64b6f2114164a3e4d2d358373b1b1eb5 /Documentation/public-inbox-index.pod
parent32f6a1f9498f759041b72d6f4d5cb959088a3dec (diff)
This gives better page cache utilization for Xapian indexing on
slow storage by improving locality for random I/O activity on
the Xapian DB.

Instead of doing a single-pass to index both SQLite and Xapian;
this indexes them separately.  The first pass is identical to
indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3.

Subsequent passes only operate on a single Xapian shard for
documents belonging to that shard.  Given enough shards, each
individual shard can be made small enough to fit into the kernel
page cache and avoid HDD seeks for read activity.

Doing rough tests with a busy system with a 7200 RPM HDD with ext4,
full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to
~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2)
disabled (--no-sync) and `--batch-size=10m'.
Diffstat (limited to 'Documentation/public-inbox-index.pod')
1 files changed, 52 insertions, 1 deletions
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index aeb1b3a3..f525ba54 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -34,12 +34,16 @@ normal search functionality.
 =item --jobs=JOBS, -j
-Control the number of Xapian indexing jobs in a
+Influences the number of Xapian indexing shards in a
 (L<public-inbox-v2-format(5)>) inbox.
 C<--jobs=0> is accepted as of public-inbox 1.6.0 (PENDING)
 to disable parallel indexing.
+If the inbox has not been indexed, C<JOBS - 1> shards
+will be created (one job is always needed for indexing
+the overview and article number mapping).
 Default: the number of existing Xapian shards
 =item --compact / -c
@@ -120,6 +124,14 @@ and Xapian.  This is only effective with Xapian 1.4+.
 Available in public-inbox 1.6.0 (PENDING).
+=item --sequential-shard
+Sets or overrides L</publicinbox.indexSequentialShard> on a
+per-invocation basis.  See L</publicinbox.indexSequentialShard>
+Available in public-inbox 1.6.0 (PENDING).
 =head1 FILES
@@ -167,6 +179,45 @@ inbox with 3 shards will flush every 3 megabytes by default.
 Default: 1m (one megabyte)
+=item publicinbox.indexBatchSize
+Flushes changes to the filesystem and releases locks after
+indexing the given number of bytes.  The default value of C<1m>
+(one megabyte) is low to minimize memory use and reduce
+contention with parallel invocations of L<public-inbox-mda(1)>,
+L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
+Increase this value on powerful systems to improve throughput at
+the expense of memory use.  The reduction of lock granularity
+may not be noticeable on fast systems.
+This option is available in public-inbox 1.6 or later.
+public-inbox 1.5 and earlier used the current default, C<1m>.
+For L<public-inbox-v2-format(5)> inboxes, this value is
+multiplied by the number of Xapian shards.  Thus a typical v2
+inbox with 3 shards will flush every 3 megabytes by default.
+Default: 1m (one megabyte)
+=item publicinbox.indexSequentialShard
+=item publicinbox.<inbox_name>.indexSequentialShard
+For L<public-inbox-v2-format(5)> inboxes, setting this to C<true>
+allows indexing Xapian shards in multiple passes.  This speeds up
+indexing on rotational storage with high seek latency by allowing
+individual shards to fit into the kernel page cache.
+Using a higher-than-normal number of C<--jobs> with
+L<public-inbox-init(1)> may be required to ensure individual
+shards are small enough to fit into cache.
+Available in public-inbox 1.6.0 (PENDING).
+This is ignored on L<public-inbox-v1-format(5)> inboxes.
+Default: false, shards are indexed in parallel