user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@yhbt.net>
To: meta@public-inbox.org
Subject: [PATCH 03/14] doc: index: more notes about latest changes
Date: Mon, 10 Aug 2020 02:11:54 +0000	[thread overview]
Message-ID: <20200810021205.18909-4-e@yhbt.net> (raw)
In-Reply-To: <20200810021205.18909-1-e@yhbt.net>

With LKML on an HDD, a giant --batch-size of 500m ends up being
pretty useful.  I was able to index LKML in ~16 hours on a
system that had other activity on it.  The big downside was it
was eating up over 5g of RAM :x.

We'll also fix up a duplicated indexBatchSize section, fix
formatting around global vs per-inbox indexSequentialShard,
and ensure section 5 manpages are linked correctly.
---
 Documentation/public-inbox-index.pod | 62 +++++++++++++++-------------
 1 file changed, 33 insertions(+), 29 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 56dec993..3ae3b008 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -115,6 +115,11 @@ Sets or overrides L</publicinbox.indexBatchSize> on a
 per-invocation basis.  See L</publicinbox.indexBatchSize>
 below.
 
+When using rotational storage but abundant RAM, using a large
+value (e.g. C<500m>) with C<--sequential-shard> can
+significantly speed up the initial index and full C<--reindex>
+invocations (but not incremental updates).
+
 Available in public-inbox 1.6.0 (PENDING).
 
 =item --no-fsync
@@ -136,11 +141,11 @@ Available in public-inbox 1.6.0 (PENDING).
 
 =head1 FILES
 
-For v1 (ssoma) repositories described in L<public-inbox-v1-format>.
+For v1 (ssoma) repositories described in L<public-inbox-v1-format(5)>.
 All public-inbox-specific files are contained within the
 C<$GIT_DIR/public-inbox/> directory.
 
-v2 inboxes are described in L<public-inbox-v2-format>.
+v2 inboxes are described in L<public-inbox-v2-format(5)>.
 
 =head1 CONFIGURATION
 
@@ -168,40 +173,25 @@ L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
 
 Increase this value on powerful systems to improve throughput at
 the expense of memory use.  The reduction of lock granularity
-may not be noticeable on fast systems.
-
-This option is available in public-inbox 1.6 or later.
-public-inbox 1.5 and earlier used the current default, C<1m>.
+may not be noticeable on fast systems.  With SSDs, values above
+C<4m> have little benefit.
 
 For L<public-inbox-v2-format(5)> inboxes, this value is
 multiplied by the number of Xapian shards.  Thus a typical v2
-inbox with 3 shards will flush every 3 megabytes by default.
-
-Default: 1m (one megabyte)
+inbox with 3 shards will flush every 3 megabytes by default
+when unless parallelism is disabled via C<--sequential-shard>
+or C<--jobs=0>.
 
-=item publicinbox.indexBatchSize
-
-Flushes changes to the filesystem and releases locks after
-indexing the given number of bytes.  The default value of C<1m>
-(one megabyte) is low to minimize memory use and reduce
-contention with parallel invocations of L<public-inbox-mda(1)>,
-L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
-
-Increase this value on powerful systems to improve throughput at
-the expense of memory use.  The reduction of lock granularity
-may not be noticeable on fast systems.
+This influences memory usage of Xapian, but it is not exact.
+The actual memory used by Xapian and Perl has been observed
+in excess of 10x this value.
 
 This option is available in public-inbox 1.6 or later.
 public-inbox 1.5 and earlier used the current default, C<1m>.
 
-For L<public-inbox-v2-format(5)> inboxes, this value is
-multiplied by the number of Xapian shards.  Thus a typical v2
-inbox with 3 shards will flush every 3 megabytes by default.
-
 Default: 1m (one megabyte)
 
 =item publicinbox.indexSequentialShard
-=item publicinbox.<inbox_name>.indexSequentialShard
 
 For L<public-inbox-v2-format(5)> inboxes, setting this to C<true>
 allows indexing Xapian shards in multiple passes.  This speeds up
@@ -212,12 +202,23 @@ Using a higher-than-normal number of C<--jobs> with
 L<public-inbox-init(1)> may be required to ensure individual
 shards are small enough to fit into cache.
 
+Warning: interrupting C<public-inbox-index(1)> while this option
+is in use may leave the search indices out-of-date with respect
+to SQLite databases.  WWW and IMAP users may notice incomplete
+search results, but it is otherwise non-fatal.  Using C<--reindex>
+will bring everything back up-to-date.
+
 Available in public-inbox 1.6.0 (PENDING).
 
 This is ignored on L<public-inbox-v1-format(5)> inboxes.
 
 Default: false, shards are indexed in parallel
 
+=item publicinbox.<name>.indexSequentialShard
+
+Identical to L</publicinbox.indexSequentialShard>,
+but only affect the inbox matching E<lt>nameE<gt>.
+
 =back
 
 =head1 ENVIRONMENT
@@ -235,10 +236,13 @@ disk.  This environment is handled directly by Xapian, refer to
 Xapian API documentation for more details.
 
 For public-inbox 1.6 and later, use C<publicinbox.indexBatchSize>
-instead.  Setting C<XAPIAN_FLUSH_THRESHOLD> for a large C<--reindex>
-may cause L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
-L<public-inbox-watch(1)> tasks to wait long periods of time
-during C<--reindex>.
+instead.
+
+Setting C<XAPIAN_FLUSH_THRESHOLD> or
+C<publicinbox.indexBatchSize> for a large C<--reindex> may cause
+L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
+L<public-inbox-watch(1)> tasks to wait long and unpredictable
+periods of time during C<--reindex>.
 
 Default: none, uses C<publicinbox.indexBatchSize>
 

  parent reply	other threads:[~2020-08-10  2:12 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-10  2:11 [PATCH 00/14] more indexing related improvements Eric Wong
2020-08-10  2:11 ` [PATCH 01/14] index: require --reindex when using --xapian-only Eric Wong
2020-08-10  2:11 ` [PATCH 02/14] index: --sequential-shard works incrementally Eric Wong
2020-08-10  2:11 ` Eric Wong [this message]
2020-08-10  2:38   ` [PATCH 03/14] doc: index: more notes about latest changes Kyle Meyer
2020-08-10  6:29     ` Eric Wong
2020-08-10  2:11 ` [PATCH 04/14] doc: add some notes around -xcpdb / -edit / -purge Eric Wong
2020-08-10  2:11 ` [PATCH 05/14] index+xcpdb: improve SIG{INT,TERM,HUP,PIPE} behavior Eric Wong
2020-08-10  2:11 ` [PATCH 06/14] msgmap: tmp_clone: simplify + meaningful filename Eric Wong
2020-08-10  2:11 ` [PATCH 07/14] avoid File::Temp::tempfile in more places Eric Wong
2020-08-10  2:11 ` [PATCH 08/14] admin: use a generic veriable name Eric Wong
2020-08-10  2:38   ` Kyle Meyer
2020-08-10  2:12 ` [PATCH 09/14] index: cleanup internal variables Eric Wong
2020-08-10  2:12 ` [PATCH 10/14] searchidx: use singular `$opt' for consistency with v2 Eric Wong
2020-08-10  2:12 ` [PATCH 11/14] convert: support new -index options Eric Wong
2020-08-10  2:12 ` [PATCH 12/14] convert: speed up --help Eric Wong
2020-08-10  2:12 ` [PATCH 13/14] convert: check ARGV more correctly Eric Wong
2020-08-10  2:12 ` [PATCH 14/14] convert: set No_COW on copied SQLite files Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200810021205.18909-4-e@yhbt.net \
    --to=e@yhbt.net \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).