PUBLIC-INBOX-EXTINDEX-FORMApublic-inbox user maPUBLIC-INBOX-EXTINDEX-FORMAT(5)

NAME
       public-inbox-extindex-format - external index format description

DESCRIPTION
       The extindex is an index-only evolution of the per-inbox SQLite and
       Xapian indices used by public-inbox-v2-format(5) and
       public-inbox-v1-format(5).  It exists to facilitate searches across
       multiple inboxes as well as to reduce index space when messages are
       cross-posted to several existing inboxes.

       It transparently indexes messages across any combination of v1 and v2
       inboxes and data about inboxes themselves.

DIRECTORY LAYOUT
       While inspired by v2, there is no git blob storage nor "msgmap.sqlite3"
       DB.

       Instead, there is an "ALL.git" (all caps) git repo which treats every
       indexed v1 inbox or v2 epoch as a git alternate.

       As with v2 inboxes, it uses "over.sqlite3" and Xapian "shards" for WWW
       and IMAP use.  Several exclusive new tables are added to deal with
       "XREF3 DEDUPLICATION" and metadata.

       Unlike v1 and v2 inboxes, it is NOT designed to map to a NNTP
       newsgroup.  Thus it lacks "msgmap.sqlite3" to enforce the unique
       Message-ID requirement of NNTP.

   INDEX OVERVIEW AND DEFINITIONS
         $SCHEMA_VERSION - DB schema version (for Xapian)
         $SHARD - Integer starting with 0 based on parallelism

         foo/                              # "foo" is the name of the index
         - ei.lock                         # lock file to protect global state
         - ALL.git                         # empty, alternates for inboxes
         - ei$SCHEMA_VERSION/$SHARD        # per-shard Xapian DB
         - ei$SCHEMA_VERSION/over.sqlite3  # overview DB for WWW, IMAP
         - ei$SCHEMA_VERSION/misc          # misc Xapian DB

       File and directory names are intentionally different from analogous v2
       names to ensure extindex and v2 inboxes can easily be distinguished
       from each other.

   XREF3 DEDUPLICATION
       Due to cross-posted messages being the norm in the large Linux kernel
       development community and Xapian indices being the primary consumer of
       storage, it makes sense to deduplicate indexing as much as possible.

       The internal storage format is based on the NNTP "Xref" tuple, but with
       the addition of a third element: the git blob OID.  Thus the triple is
       expressed in string form as:

               $NEWSGROUP_NAME:$ARTICLE_NUM:$OID

       If no "newsgroup" is configured for an inbox, the "inboxdir" of the
       inbox is used.

       This data is stored in the "xref3" table of over.sqlite3.

   misc XAPIAN DB
       In addition to the numeric Xapian shards for indexing messages, there
       is a new, in-development Xapian index for storing data about inboxes
       themselves and other non-message data.  This index allows us to speed
       up operations involving hundreds or thousands of inboxes.

BENEFITS
       In addition to providing cross-inbox search capabilities, it can also
       replace per-inbox Xapian shards (but not per-inbox over.sqlite3).  This
       allows reduction in disk space, open file handles, and associated
       memory use.

CAVEATS
       Relocating v1 and v2 inboxes on the filesystem will require extindex to
       be garbage-collected and/or reindexed.

       Configuring and maintaining stable "newsgroup" names before any
       messages are indexed from every inbox can avoid expensive reindexing
       and rely exclusively on GC.

LOCKING
       flock(2) locking exclusively locks the empty ei.lock file for all non-
       atomic operations.

THANKS
       Thanks to the Linux Foundation for sponsoring the development and
       testing.

COPYRIGHT
       Copyright 2020-2021 all contributors <mailto:meta@public-inbox.org>

       License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>

SEE ALSO
       public-inbox-v2-format(5)

public-inbox.git                  1993-10-02   PUBLIC-INBOX-EXTINDEX-FORMAT(5)