public-inbox extindex format description

    The extindex is an index-only evolution of the per-inbox SQLite and
    Xapian indices used by public-inbox-v2-format(5) and
    public-inbox-v1-format(5). It exists to facilitate searches across
    multiple inboxes as well as to reduce index space when messages are
    cross-posted to several existing inboxes.

    It transparently indexes messages across any combination of v1 and v2
    inboxes and data about inboxes themselves.

    While inspired by v2, there is no git blob storage nor "msgmap.sqlite3"

    Instead, there is an "ALL.git" (all caps) git repo which treats every
    indexed v1 inbox or v2 epoch as a git alternate.

    As with v2 inboxes, it uses "over.sqlite3" and Xapian "shards" for WWW
    and IMAP use. Several exclusive new tables are added to deal with "XREF3
    DEDUPLICATION" and metadata.

    Unlike v1 and v2 inboxes, it is NOT designed to map to a NNTP newsgroup.
    Thus it lacks "msgmap.sqlite3" to enforce the unique Message-ID
    requirement of NNTP.

      $SCHEMA_VERSION - DB schema version (for Xapian)
      $SHARD - Integer starting with 0 based on parallelism

      foo/                              # "foo" is the name of the index
      - ei.lock                         # lock file to protect global state
      - ALL.git                         # empty, alternates for inboxes
      - ei$SCHEMA_VERSION/$SHARD        # per-shard Xapian DB
      - ei$SCHEMA_VERSION/over.sqlite3  # overview DB for WWW, IMAP
      - ei$SCHEMA_VERSION/misc          # misc Xapian DB

    File and directory names are intentionally different from analogous v2
    names to ensure extindex and v2 inboxes can easily be distinguished from
    each other.

    Due to cross-posted messages being the norm in the large Linux kernel
    development community and Xapian indices being the primary consumer of
    storage, it makes sense to deduplicate indexing as much as possible.

    The internal storage format is based on the NNTP "Xref" tuple, but with
    the addition of a third element: the git blob OID. Thus the triple is
    expressed in string form as:


    If no "newsgroup" is configured for an inbox, the "inboxdir" of the
    inbox is used.

    This data is stored in the "xref3" table of over.sqlite3.

  misc XAPIAN DB
    In addition to the numeric Xapian shards for indexing messages, there is
    a new, in-development Xapian index for storing data about inboxes
    themselves and other non-message data. This index allows us to speed up
    operations involving hundreds or thousands of inboxes.

    In addition to providing cross-inbox search capabilities, it can also
    replace per-inbox Xapian shards (but not per-inbox over.sqlite3). This
    allows reduction in disk space, open file handles, and associated memory

    Relocating v1 and v2 inboxes on the filesystem will require extindex to
    be garbage-collected and/or reindexed.

    Configuring and maintaining stable "newsgroup" names before any messages
    are indexed from every inbox can avoid expensive reindexing and rely
    exclusively on GC.

    flock(2) locking exclusively locks the empty ei.lock file for all
    non-atomic operations.

    Thanks to the Linux Foundation for sponsoring the development and

    Copyright 2020-2021 all contributors <>

    License: AGPL-3.0+ <>