From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id DE5161F66F for ; Mon, 10 Aug 2020 02:12:05 +0000 (UTC) From: Eric Wong To: meta@public-inbox.org Subject: [PATCH 03/14] doc: index: more notes about latest changes Date: Mon, 10 Aug 2020 02:11:54 +0000 Message-Id: <20200810021205.18909-4-e@yhbt.net> In-Reply-To: <20200810021205.18909-1-e@yhbt.net> References: <20200810021205.18909-1-e@yhbt.net> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit List-Id: With LKML on an HDD, a giant --batch-size of 500m ends up being pretty useful. I was able to index LKML in ~16 hours on a system that had other activity on it. The big downside was it was eating up over 5g of RAM :x. We'll also fix up a duplicated indexBatchSize section, fix formatting around global vs per-inbox indexSequentialShard, and ensure section 5 manpages are linked correctly. --- Documentation/public-inbox-index.pod | 62 +++++++++++++++------------- 1 file changed, 33 insertions(+), 29 deletions(-) diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod index 56dec993..3ae3b008 100644 --- a/Documentation/public-inbox-index.pod +++ b/Documentation/public-inbox-index.pod @@ -115,6 +115,11 @@ Sets or overrides L on a per-invocation basis. See L below. +When using rotational storage but abundant RAM, using a large +value (e.g. C<500m>) with C<--sequential-shard> can +significantly speed up the initial index and full C<--reindex> +invocations (but not incremental updates). + Available in public-inbox 1.6.0 (PENDING). =item --no-fsync @@ -136,11 +141,11 @@ Available in public-inbox 1.6.0 (PENDING). =head1 FILES -For v1 (ssoma) repositories described in L. +For v1 (ssoma) repositories described in L. All public-inbox-specific files are contained within the C<$GIT_DIR/public-inbox/> directory. -v2 inboxes are described in L. +v2 inboxes are described in L. =head1 CONFIGURATION @@ -168,40 +173,25 @@ L, and L. Increase this value on powerful systems to improve throughput at the expense of memory use. The reduction of lock granularity -may not be noticeable on fast systems. - -This option is available in public-inbox 1.6 or later. -public-inbox 1.5 and earlier used the current default, C<1m>. +may not be noticeable on fast systems. With SSDs, values above +C<4m> have little benefit. For L inboxes, this value is multiplied by the number of Xapian shards. Thus a typical v2 -inbox with 3 shards will flush every 3 megabytes by default. - -Default: 1m (one megabyte) +inbox with 3 shards will flush every 3 megabytes by default +when unless parallelism is disabled via C<--sequential-shard> +or C<--jobs=0>. -=item publicinbox.indexBatchSize - -Flushes changes to the filesystem and releases locks after -indexing the given number of bytes. The default value of C<1m> -(one megabyte) is low to minimize memory use and reduce -contention with parallel invocations of L, -L, and L. - -Increase this value on powerful systems to improve throughput at -the expense of memory use. The reduction of lock granularity -may not be noticeable on fast systems. +This influences memory usage of Xapian, but it is not exact. +The actual memory used by Xapian and Perl has been observed +in excess of 10x this value. This option is available in public-inbox 1.6 or later. public-inbox 1.5 and earlier used the current default, C<1m>. -For L inboxes, this value is -multiplied by the number of Xapian shards. Thus a typical v2 -inbox with 3 shards will flush every 3 megabytes by default. - Default: 1m (one megabyte) =item publicinbox.indexSequentialShard -=item publicinbox..indexSequentialShard For L inboxes, setting this to C allows indexing Xapian shards in multiple passes. This speeds up @@ -212,12 +202,23 @@ Using a higher-than-normal number of C<--jobs> with L may be required to ensure individual shards are small enough to fit into cache. +Warning: interrupting C while this option +is in use may leave the search indices out-of-date with respect +to SQLite databases. WWW and IMAP users may notice incomplete +search results, but it is otherwise non-fatal. Using C<--reindex> +will bring everything back up-to-date. + Available in public-inbox 1.6.0 (PENDING). This is ignored on L inboxes. Default: false, shards are indexed in parallel +=item publicinbox..indexSequentialShard + +Identical to L, +but only affect the inbox matching EnameE. + =back =head1 ENVIRONMENT @@ -235,10 +236,13 @@ disk. This environment is handled directly by Xapian, refer to Xapian API documentation for more details. For public-inbox 1.6 and later, use C -instead. Setting C for a large C<--reindex> -may cause L, L and -L tasks to wait long periods of time -during C<--reindex>. +instead. + +Setting C or +C for a large C<--reindex> may cause +L, L and +L tasks to wait long and unpredictable +periods of time during C<--reindex>. Default: none, uses C