index: v2: --sequential-shard option

This gives better page cache utilization for Xapian indexing on slow storage by improving locality for random I/O activity on the Xapian DB. Instead of doing a single-pass to index both SQLite and Xapian; this indexes them separately. The first pass is identical to indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3. Subsequent passes only operate on a single Xapian shard for documents belonging to that shard. Given enough shards, each individual shard can be made small enough to fit into the kernel page cache and avoid HDD seeks for read activity. Doing rough tests with a busy system with a 7200 RPM HDD with ext4, full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to ~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2) disabled (--no-sync) and `--batch-size=10m'.
author: Eric Wong <e@yhbt.net> 2020-08-07 01:14:04 +0000
committer: Eric Wong <e@yhbt.net> 2020-08-07 23:45:38 +0000
commit: 06a2418fd053c9a5b80217e74d1b47b8e1ca85e1 (patch)
tree: 37dc120e64b6f2114164a3e4d2d358373b1b1eb5 /Documentation
parent: 32f6a1f9498f759041b72d6f4d5cb959088a3dec (diff)
download: public-inbox-06a2418fd053c9a5b80217e74d1b47b8e1ca85e1.tar.gz
3 files changed, 66 insertions, 4 deletions
diff --git a/Documentation/public-inbox-config.pod b/Documentation/public-inbox-config.pod
index e6108c35..05b84819 100644
--- a/Documentation/public-inbox-config.pod
+++ b/Documentation/public-inbox-config.pod
@@ -139,6 +139,10 @@ allow for searching for phrases using quoted text.
  
  Default: C<full>
  
+=item publicinbox.<name>.indexSequentialShard
+
+See L<public-inbox-index(1)/publicInbox.indexSequentialShard>
+
  =item publicinbox.<name>.httpbackendmax
  
  If a digit, the maximum number of parallel
@@ -291,6 +295,8 @@ or /usr/share/cgit/
  See L<public-inbox-edit(1)>
  
  =item publicinbox.indexMaxSize
+=item publicinbox.indexBatchSize
+=item publicinbox.indexSequentialShard
  
  See L<public-inbox-index(1)>
  
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index aeb1b3a3..f525ba54 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -34,12 +34,16 @@ normal search functionality.
  
  =item --jobs=JOBS, -j
  
-Control the number of Xapian indexing jobs in a
+Influences the number of Xapian indexing shards in a
  (L<public-inbox-v2-format(5)>) inbox.
  
  C<--jobs=0> is accepted as of public-inbox 1.6.0 (PENDING)
  to disable parallel indexing.
  
+If the inbox has not been indexed, C<JOBS - 1> shards
+will be created (one job is always needed for indexing
+the overview and article number mapping).
+
  Default: the number of existing Xapian shards
  
  =item --compact / -c
@@ -120,6 +124,14 @@ and Xapian.  This is only effective with Xapian 1.4+.
  
  Available in public-inbox 1.6.0 (PENDING).
  
+=item --sequential-shard
+
+Sets or overrides L</publicinbox.indexSequentialShard> on a
+per-invocation basis.  See L</publicinbox.indexSequentialShard>
+below.
+
+Available in public-inbox 1.6.0 (PENDING).
+
  =back
  
  =head1 FILES
@@ -167,6 +179,45 @@ inbox with 3 shards will flush every 3 megabytes by default.
  
  Default: 1m (one megabyte)
  
+=item publicinbox.indexBatchSize
+
+Flushes changes to the filesystem and releases locks after
+indexing the given number of bytes.  The default value of C<1m>
+(one megabyte) is low to minimize memory use and reduce
+contention with parallel invocations of L<public-inbox-mda(1)>,
+L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
+
+Increase this value on powerful systems to improve throughput at
+the expense of memory use.  The reduction of lock granularity
+may not be noticeable on fast systems.
+
+This option is available in public-inbox 1.6 or later.
+public-inbox 1.5 and earlier used the current default, C<1m>.
+
+For L<public-inbox-v2-format(5)> inboxes, this value is
+multiplied by the number of Xapian shards.  Thus a typical v2
+inbox with 3 shards will flush every 3 megabytes by default.
+
+Default: 1m (one megabyte)
+
+=item publicinbox.indexSequentialShard
+=item publicinbox.<inbox_name>.indexSequentialShard
+
+For L<public-inbox-v2-format(5)> inboxes, setting this to C<true>
+allows indexing Xapian shards in multiple passes.  This speeds up
+indexing on rotational storage with high seek latency by allowing
+individual shards to fit into the kernel page cache.
+
+Using a higher-than-normal number of C<--jobs> with
+L<public-inbox-init(1)> may be required to ensure individual
+shards are small enough to fit into cache.
+
+Available in public-inbox 1.6.0 (PENDING).
+
+This is ignored on L<public-inbox-v1-format(5)> inboxes.
+
+Default: false, shards are indexed in parallel
+
  =back
  
  =head1 ENVIRONMENT
diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod
index 9e284a75..6876989c 100644
--- a/Documentation/public-inbox-v2-format.pod
+++ b/Documentation/public-inbox-v2-format.pod
@@ -113,9 +113,14 @@ improved with high-quality and high-quantity solid-state storage.
  Issuing TRIM commands with L<fstrim(8)> was necessary to maintain
  consistent performance while developing this feature.
  
-Rotational storage devices are NOT recommended for indexing of
-large mail archives; but are fine for backup and usable for
-small instances.
+Rotational storage devices perform significantly worse than
+solid state storage for indexing of large mail archives; but are
+fine for backup and usable for small instances.
+
+As of public-inbox 1.6.0, the C<--sequential-shard> option of
+L<public-inbox-index(1)> may be used with a high shard count
+to ensure individual shards fit into page cache when the entire
+Xapian DB cannot.
  
  Our use of the L</OVERVIEW DB> requires Xapian document IDs to
  remain stable.  Using L<public-inbox-compact(1)> and
author	Eric Wong <e@yhbt.net>	2020-08-07 01:14:04 +0000
committer	Eric Wong <e@yhbt.net>	2020-08-07 23:45:38 +0000
commit	06a2418fd053c9a5b80217e74d1b47b8e1ca85e1 (patch)
tree	37dc120e64b6f2114164a3e4d2d358373b1b1eb5 /Documentation
parent	32f6a1f9498f759041b72d6f4d5cb959088a3dec (diff)
download	public-inbox-06a2418fd053c9a5b80217e74d1b47b8e1ca85e1.tar.gz