PUBLIC-INBOX-INDEX(1)      public-inbox user manual      PUBLIC-INBOX-INDEX(1)

NAME
       public-inbox-index - create and update search indices

SYNOPSIS
       public-inbox-index [OPTIONS] INBOX_DIR...

       public-inbox-index [OPTIONS] --all

DESCRIPTION
       public-inbox-index creates and updates the search, overview and NNTP
       article number database used by the read-only public-inbox HTTP and
       NNTP interfaces.  Currently, this requires DBD::SQLite and DBI Perl
       modules.  Search::Xapian is optional, only to support the PSGI search
       interface.

       Once the initial indices are created by public-inbox-index,
       public-inbox-mda(1) and public-inbox-watch(1) will automatically
       maintain them.

       Running this manually to update indices is only required if relying on
       git-fetch(1) to mirror an existing public-inbox; or if upgrading to a
       new version of public-inbox using the "--reindex" option.

       Having the overview and article number database is essential to running
       the NNTP interface, and strongly recommended for the HTTP interface as
       it provides thread grouping in addition to normal search functionality.

OPTIONS
       -j JOBS
       --jobs=JOBS
           Influences the number of Xapian indexing shards in a
           (public-inbox-v2-format(5)) inbox.

           See "--jobs" in public-inbox-init(1) for a full description of
           sharding.

           "--jobs=0" is accepted as of public-inbox 1.6.0 to disable parallel
           indexing regardless of the number of pre-existing shards.

           If the inbox has not been indexed or initialized, "JOBS - 1" shards
           will be created (one job is always needed for indexing the overview
           and article number mapping).

           Default: the number of existing Xapian shards

       -c
       --compact
           Compacts the Xapian DBs after indexing.  This is recommended when
           using "--reindex" to avoid running out of disk space while indexing
           multiple inboxes.

           While option takes a negligible amount of time compared to
           "--reindex", it requires temporarily duplicating the entire
           contents of the Xapian DB.

           This switch may be specified twice, in which case compaction
           happens both before and after indexing to minimize the temporal
           footprint of the (re)indexing operation.

           Available since public-inbox 1.4.0.

       --reindex
           Forces a re-index of all messages in the inbox.  This can be used
           for in-place upgrades and bugfixes while NNTP/HTTP server processes
           are utilizing the index.  Keep in mind this roughly doubles the
           size of the already-large Xapian database.  Using this with
           "--compact" or running public-inbox-compact(1) afterwards is
           recommended to release free space.

           public-inbox protects writes to various indices with flock(2), so
           it is safe to reindex (and rethread) while public-inbox-watch(1),
           public-inbox-mda(1) or public-inbox-learn(1) run.

           This does not touch the NNTP article number database.  It does not
           affect threading unless "--rethread" is used.

       --all
           Index all inboxes configured in ~/.public-inbox/config.  This is an
           alternative to specifying individual inboxes directories on the
           command-line.

       --rethread
           Regenerate internal THREADID and message thread associations when
           reindexing.

           This fixes some bugs in older versions of public-inbox.  While it
           is possible to use this without "--reindex", it makes little sense
           to do so.

           Available in public-inbox 1.6.0+.

       --prune
           Run git-gc(1) to prune and expire reflogs if discontiguous history
           is detected.  This is intended to be used in mirrors after running
           public-inbox-edit(1) or public-inbox-purge(1) to ensure data is
           expunged from mirrors.

           Available since public-inbox 1.2.0.

       --max-size SIZE
           Sets or overrides "publicinbox.indexMaxSize" on a per-invocation
           basis.  See "publicinbox.indexMaxSize" below.

           Available since public-inbox 1.5.0.

       --batch-size SIZE
           Sets or overrides "publicinbox.indexBatchSize" on a per-invocation
           basis.  See "publicinbox.indexBatchSize" below.

           When using rotational storage but abundant RAM, using a large value
           (e.g. "500m") with "--sequential-shard" can significantly speed up
           and reduce fragmentation during the initial index and full
           "--reindex" invocations (but not incremental updates).

           Available in public-inbox 1.6.0+.

       --no-fsync
           Disables fsync(2) and fdatasync(2) operations on SQLite and Xapian.
           This is only effective with Xapian 1.4+.  This is primarily
           intended for systems with low RAM and the small (default)
           "--batch-size=1m".  Users of large "--batch-size" may even find
           disabling fdatasync(2) causes too much dirty data to accumulate,
           resulting on latency spikes from writeback.

           Available in public-inbox 1.6.0+.

       --dangerous
           Speed up initial index by using in-place updates and denying
           support for concurrent readers.  This is only effective with Xapian
           1.4+.

           Available in public-inbox 1.8.0+

       --sequential-shard
           Sets or overrides "publicinbox.indexSequentialShard" on a per-
           invocation basis.  See "publicinbox.indexSequentialShard" below.

           Available in public-inbox 1.6.0+.

       --skip-docdata
           Stop storing document data in Xapian on an existing inbox.

           See "--skip-docdata" in public-inbox-init(1) for description and
           caveats.

           Available in public-inbox 1.6.0+.

       -E EXTINDEX
       --update-extindex=EXTINDEX
           Update the given external index (public-inbox-extindex-format(5).
           Either the configured section name (e.g. "all") or a directory name
           may be specified.

           Defaults to "all" if "[extindex "all"]" is configured, otherwise no
           external indices are updated.

           May be specified multiple times in rare cases where multiple
           external indices are configured.

       --no-update-extindex
           Do not update the "all" external index by default.  This negates
           all uses of "-E" / "--update-extindex=" on the command-line.

       --since=DATESTRING
       --after=DATESTRING
       --until=DATESTRING
       --before=DATESTRING
           Passed directly to git-log(1) to limit changes for "--reindex"

FILES
       For v1 (ssoma) repositories described in public-inbox-v1-format(5).
       All public-inbox-specific files are contained within the
       "$GIT_DIR/public-inbox/" directory.

       v2 inboxes are described in public-inbox-v2-format(5).

CONFIGURATION
       publicinbox.indexMaxSize
               Prevents indexing of messages larger than the specified size
               value.  A single suffix modifier of "k", "m" or "g" is
               supported, thus the value of "1m" to prevents indexing of
               messages larger than one megabyte.

               This is useful for avoiding memory exhaustion in mirrors via
               git.  It does not prevent public-inbox-mda(1) or
               public-inbox-watch(1) from importing (and indexing) a message.

               This option is only available in public-inbox 1.5 or later.

               Default: none

       publicinbox.indexBatchSize
               Flushes changes to the filesystem and releases locks after
               indexing the given number of bytes.  The default value of "1m"
               (one megabyte) is low to minimize memory use and reduce
               contention with parallel invocations of public-inbox-mda(1),
               public-inbox-learn(1), and public-inbox-watch(1).

               Increase this value on powerful systems to improve throughput
               at the expense of memory use.  The reduction of lock
               granularity may not be noticeable on fast systems.  With SSDs,
               values above "4m" have little benefit.

               For public-inbox-v2-format(5) inboxes, this value is multiplied
               by the number of Xapian shards.  Thus a typical v2 inbox with 3
               shards will flush every 3 megabytes by default unless
               parallelism is disabled via "--sequential-shard" or "--jobs=0".

               This influences memory usage of Xapian, but it is not exact.
               The actual memory used by Xapian and Perl has been observed in
               excess of 10x this value.

               This option is available in public-inbox 1.6 or later.  public-
               inbox 1.5 and earlier used the current default, "1m".

               Default: 1m (one megabyte)

       publicinbox.indexSequentialShard
               For public-inbox-v2-format(5) inboxes, setting this to "true"
               allows indexing Xapian shards in multiple passes.  This speeds
               up indexing on rotational storage with high seek latency by
               allowing individual shards to fit into the kernel page cache.

               Using a higher-than-normal number of "--jobs" with
               public-inbox-init(1) may be required to ensure individual
               shards are small enough to fit into cache.

               Warning: interrupting "public-inbox-index(1)" while this option
               is in use may leave the search indices out-of-date with respect
               to SQLite databases.  WWW and IMAP users may notice incomplete
               search results, but it is otherwise non-fatal.  Using
               "--reindex" will bring everything back up-to-date.

               Available in public-inbox 1.6.0+.

               This is ignored on public-inbox-v1-format(5) inboxes.

               Default: false, shards are indexed in parallel

       publicinbox.<name>.indexSequentialShard
               Identical to "publicinbox.indexSequentialShard", but only
               affect the inbox matching <name>.

ENVIRONMENT
       PI_CONFIG
               Used to override the default "~/.public-inbox/config" value.

       XAPIAN_FLUSH_THRESHOLD
               The number of documents to update before committing changes to
               disk.  This environment is handled directly by Xapian, refer to
               Xapian API documentation for more details.

               For public-inbox 1.6 and later, use
               "publicinbox.indexBatchSize" instead.

               Setting "XAPIAN_FLUSH_THRESHOLD" or
               "publicinbox.indexBatchSize" for a large "--reindex" may cause
               public-inbox-mda(1), public-inbox-learn(1) and
               public-inbox-watch(1) tasks to wait long and unpredictable
               periods of time during "--reindex".

               Default: none, uses "publicinbox.indexBatchSize"

UPGRADING
       Occasionally, public-inbox will update it's schema version and require
       a full index by running this command.

CONTACT
       Feedback welcome via plain-text mail to <mailto:meta@public-inbox.org>

       The mail archives are hosted at <https://public-inbox.org/meta/> and
       <http://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/meta/>

COPYRIGHT
       Copyright all contributors <mailto:meta@public-inbox.org>

       License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>

SEE ALSO
       Search::Xapian, DBD::SQLite, public-inbox-extindex-format(5)

public-inbox.git                  1993-10-02             PUBLIC-INBOX-INDEX(1)