diff options
Diffstat (limited to 'Documentation/public-inbox-index.pod')
-rw-r--r-- | Documentation/public-inbox-index.pod | 279 |
1 files changed, 261 insertions, 18 deletions
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod index 7c04f753..f1a2180a 100644 --- a/Documentation/public-inbox-index.pod +++ b/Documentation/public-inbox-index.pod @@ -4,15 +4,17 @@ public-inbox-index - create and update search indices =head1 SYNOPSIS -public-inbox-index [OPTIONS] INBOX_DIR +public-inbox-index [OPTIONS] INBOX_DIR... + +public-inbox-index [OPTIONS] --all =head1 DESCRIPTION public-inbox-index creates and updates the search, overview and NNTP article number database used by the read-only public-inbox HTTP and NNTP interfaces. Currently, this requires -L<DBD::SQLite> and L<DBI> Perl modules. L<Search::Xapian> -is optional, only to support the PSGI search interface. +L<DBD::SQLite> and L<DBI> Perl modules. L<Xapian> (or L<Search::Xapian>) +are optional, only to support the PSGI search interface. Once the initial indices are created by public-inbox-index, L<public-inbox-mda(1)> and L<public-inbox-watch(1)> will @@ -32,16 +34,79 @@ normal search functionality. =over +=item -j JOBS + +=item --jobs=JOBS + +Influences the number of Xapian indexing shards in a +(L<public-inbox-v2-format(5)>) inbox. + +See L<public-inbox-init(1)/--jobs> for a full description +of sharding. + +C<--jobs=0> is accepted as of public-inbox 1.6.0 +to disable parallel indexing regardless of the number of +pre-existing shards. + +If the inbox has not been indexed or initialized, C<JOBS - 1> +shards will be created (one job is always needed for indexing +the overview and article number mapping). + +Default: the number of existing Xapian shards + +=item -c + +=item --compact + +Compacts the Xapian DBs after indexing. This is recommended +when using C<--reindex> to avoid running out of disk space +while indexing multiple inboxes. + +While option takes a negligible amount of time compared to +C<--reindex>, it requires temporarily duplicating the entire +contents of the Xapian DB. + +This switch may be specified twice, in which case compaction +happens both before and after indexing to minimize the temporal +footprint of the (re)indexing operation. + +Available since public-inbox 1.4.0. + =item --reindex -Forces a search engine re-index of all messages in the -repository. This can be used for in-place upgrades while +Forces a re-index of all messages in the inbox. +This can be used for in-place upgrades and bugfixes while NNTP/HTTP server processes are utilizing the index. Keep in mind this roughly doubles the size of the already-large -Xapian database. Running L<public-inbox-compact(1)> -afterwards is recommended to release free space. +Xapian database. Using this with C<--compact> or running +L<public-inbox-compact(1)> afterwards is recommended to +release free space. + +public-inbox protects writes to various indices with +L<flock(2)>, so it is safe to reindex (and rethread) while +L<public-inbox-watch(1)>, L<public-inbox-mda(1)> or +L<public-inbox-learn(1)> run. This does not touch the NNTP article number database. +It does not affect threading unless C<--rethread> is +used. + +=item --all + +Index all inboxes configured in ~/.public-inbox/config. +This is an alternative to specifying individual inboxes directories +on the command-line. + +=item --rethread + +Regenerate internal THREADID and message thread associations +when reindexing. + +This fixes some bugs in older versions of public-inbox. While +it is possible to use this without C<--reindex>, it makes little +sense to do so. + +Available in public-inbox 1.6.0+. =item --prune @@ -50,15 +115,187 @@ is detected. This is intended to be used in mirrors after running L<public-inbox-edit(1)> or L<public-inbox-purge(1)> to ensure data is expunged from mirrors. +Available since public-inbox 1.2.0. + +=item --max-size SIZE + +Sets or overrides L</publicinbox.indexMaxSize> on a +per-invocation basis. See L</publicinbox.indexMaxSize> +below. + +Available since public-inbox 1.5.0. + +=item --batch-size SIZE + +Sets or overrides L</publicinbox.indexBatchSize> on a +per-invocation basis. See L</publicinbox.indexBatchSize> +below. + +When using rotational storage but abundant RAM, using a large +value (e.g. C<500m>) with C<--sequential-shard> can +significantly speed up and reduce fragmentation during the +initial index and full C<--reindex> invocations (but not +incremental updates). + +Available in public-inbox 1.6.0+. + +=item --no-fsync + +Disables L<fsync(2)> and L<fdatasync(2)> operations on SQLite +and Xapian. This is only effective with Xapian 1.4+. This is +primarily intended for systems with low RAM and the small +(default) C<--batch-size=1m>. Users of large C<--batch-size> +may even find disabling L<fdatasync(2)> causes too much dirty +data to accumulate, resulting on latency spikes from writeback. + +Available in public-inbox 1.6.0+. + +=item --dangerous + +Speed up initial index by using in-place updates and denying support for +concurrent readers. This is only effective with Xapian 1.4+. + +Available in public-inbox 1.8.0+ + +=item --sequential-shard + +Sets or overrides L</publicinbox.indexSequentialShard> on a +per-invocation basis. See L</publicinbox.indexSequentialShard> +below. + +Available in public-inbox 1.6.0+. + +=item --skip-docdata + +Stop storing document data in Xapian on an existing inbox. + +See L<public-inbox-init(1)/--skip-docdata> for description and caveats. + +Available in public-inbox 1.6.0+. + +=item -E EXTINDEX + +=item --update-extindex=EXTINDEX + +Update the given external index (L<public-inbox-extindex-format(5)>. +Either the configured section name (e.g. C<all>) or a directory name +may be specified. + +Defaults to C<all> if C<[extindex "all"]> is configured, +otherwise no external indices are updated. + +May be specified multiple times in rare cases where multiple +external indices are configured. + +=item --no-update-extindex + +Do not update the C<all> external index by default. This negates +all uses of C<-E> / C<--update-extindex=> on the command-line. + +=item --no-multi-pack-index + +Disables writing the multi-pack-index when using L</--update-extindex>. +See L<public-inbox-extindex(1)/--no-multi-pack-index> for details. + +Available in public-inbox 2.0.0+ + +=item --since=DATESTRING + +=item --after=DATESTRING + +=item --until=DATESTRING + +=item --before=DATESTRING + +Passed directly to L<git-log(1)> to limit changes for C<--reindex> + =back =head1 FILES -For v1 (ssoma) repositories described in L<public-inbox-v1-format>. +For v1 (ssoma) repositories described in L<public-inbox-v1-format(5)>. All public-inbox-specific files are contained within the C<$GIT_DIR/public-inbox/> directory. -v2 repositories are described in L<public-inbox-v2-format>. +v2 inboxes are described in L<public-inbox-v2-format(5)>. + +=head1 CONFIGURATION + +=over 8 + +=item publicinbox.indexMaxSize + +Prevents indexing of messages larger than the specified size +value. A single suffix modifier of C<k>, C<m> or C<g> is +supported, thus the value of C<1m> to prevents indexing of +messages larger than one megabyte. + +This is useful for avoiding memory exhaustion in mirrors +via git. It does not prevent L<public-inbox-mda(1)> or +L<public-inbox-watch(1)> from importing (and indexing) +a message. + +This option is only available in public-inbox 1.5 or later. + +Default: none + +=item publicinbox.indexBatchSize + +Flushes changes to the filesystem and releases locks after +indexing the given number of bytes. The default value of C<1m> +(one megabyte) is low to minimize memory use and reduce +contention with parallel invocations of L<public-inbox-mda(1)>, +L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>. + +Increase this value on powerful systems to improve throughput at +the expense of memory use. The reduction of lock granularity +may not be noticeable on fast systems. With SSDs, values above +C<4m> have little benefit. + +For L<public-inbox-v2-format(5)> inboxes, this value is +multiplied by the number of Xapian shards. Thus a typical v2 +inbox with 3 shards will flush every 3 megabytes by default +unless parallelism is disabled via C<--sequential-shard> +or C<--jobs=0>. + +This influences memory usage of Xapian, but it is not exact. +The actual memory used by Xapian and Perl has been observed +in excess of 10x this value. + +This option is available in public-inbox 1.6 or later. +public-inbox 1.5 and earlier used the current default, C<1m>. + +Default: 1m (one megabyte) + +=item publicinbox.indexSequentialShard + +For L<public-inbox-v2-format(5)> inboxes, setting this to C<true> +allows indexing Xapian shards in multiple passes. This speeds up +indexing on rotational storage with high seek latency by allowing +individual shards to fit into the kernel page cache. + +Using a higher-than-normal number of C<--jobs> with +L<public-inbox-init(1)> may be required to ensure individual +shards are small enough to fit into cache. + +Warning: interrupting C<public-inbox-index(1)> while this option +is in use may leave the search indices out-of-date with respect +to SQLite databases. WWW and IMAP users may notice incomplete +search results, but it is otherwise non-fatal. Using C<--reindex> +will bring everything back up-to-date. + +Available in public-inbox 1.6.0+. + +This is ignored on L<public-inbox-v1-format(5)> inboxes. + +Default: false, shards are indexed in parallel + +=item publicinbox.<name>.indexSequentialShard + +Identical to L</publicinbox.indexSequentialShard>, +but only affect the inbox matching E<lt>nameE<gt>. + +=back =head1 ENVIRONMENT @@ -74,31 +311,37 @@ The number of documents to update before committing changes to disk. This environment is handled directly by Xapian, refer to Xapian API documentation for more details. -Default: our indexing code flushes every megabyte of mail seen -to keep memory usage low. Setting this environment variable to -any positive value will switch to a document count-based -threshold in Xapian. +For public-inbox 1.6 and later, use C<publicinbox.indexBatchSize> +instead. + +Setting C<XAPIAN_FLUSH_THRESHOLD> or +C<publicinbox.indexBatchSize> for a large C<--reindex> may cause +L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and +L<public-inbox-watch(1)> tasks to wait long and unpredictable +periods of time during C<--reindex>. + +Default: none, uses C<publicinbox.indexBatchSize> =back =head1 UPGRADING -Occasionally, public-inbox will update it's schema version and +Occasionally, public-inbox will update its schema version and require a full index by running this command. =head1 CONTACT Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org> -The mail archives are hosted at L<https://public-inbox.org/meta/> -and L<http://hjrcffqmbrq6wope.onion/meta/> +The mail archives are hosted at L<https://public-inbox.org/meta/> and +L<http://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/meta/> =head1 COPYRIGHT -Copyright 2016-2020 all contributors L<mailto:meta@public-inbox.org> +Copyright all contributors L<mailto:meta@public-inbox.org> License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt> =head1 SEE ALSO -L<Search::Xapian>, L<DBD::SQLite> +L<Search::Xapian>, L<DBD::SQLite>, L<public-inbox-extindex-format(5)> |