NAME
    public-inbox v2 repository description

DESCRIPTION
    The v2 format is designed primarily to address several scalability
    problems of the original format described at public-inbox-v1-format(5).
    It also handles messages with Message-IDs.

INBOX LAYOUT
    The key change in v2 is the inbox is no longer a bare git repository,
    but a directory with two or more git repositories. v2 divides git
    repositories by time "epochs" and Xapian databases for parallelism by
    "partitions".

  INBOX OVERVIEW AND DEFINITIONS
    $EPOCH - Integer starting with 0 based on time $SCHEMA_VERSION -
    PublicInbox::Search::SCHEMA_VERSION used by Xapian $PART - Integer
    (0..NPROCESSORS)

    foo/ # assuming "foo" is the name of the list - inbox.lock # lock file
    (flock) to protect global state - git/$EPOCH.git # normal git
    repositories - all.git # empty git repo, alternates to git/$EPOCH.git -
    xap$SCHEMA_VERSION/$PART # per-partition Xapian DB -
    xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP and threading -
    msgmap.sqlite3 # same the v1 msgmap

    For blob lookups, the reader only needs to open the "all.git" repository
    with $GIT_DIR/objects/info/alternates which references every $EPOCH.git
    repo.

    Individual $EPOCH.git repos DO NOT use alternates themselves as git
    currently limits recursion of alternates nesting depth to 5.

  GIT EPOCHS
    One of the inherent scalability problems with git itself is the full
    history of a project must be stored and carried around to all clients.
    To address this problem, the v2 format uses multiple git repositories,
    stored as time-based "epochs".

    We currently divide epochs into roughly one gigabyte segments; but this
    size can be configurable (if needed) in the future.

    A pleasant side-effect of this design is the git packs of older epochs
    are stable, allowing them to be cloned without requiring expensive pack
    generation. This also allows clients to clone only the epochs they are
    interested in to save bandwidth and storage.

    To minimize changes to existing v1-based code and simplify our code, we
    use the "alternates" mechanism described in gitrepository-layout(5) to
    link all the epoch repositories with a single read-only "all.git"
    endpoint.

    Processes retrieve blobs via the "all.git" repository, while writers
    write blobs directly to epochs.

  GIT TREE LAYOUT
    One key problem specific to v1 was large trees were frequently a
    performance problem as name lookups are expensive and there were limited
    deltafication opportunities with unpredictable file names. As a result,
    all Xapian-enabled installations retrieve blob object_ids directly in
    v1, bypassing tree lookups.

    While dividing git repositories into epochs caps the growth of trees,
    worst-case tree size was still unnecessary overhead and worth
    eliminating.

    So in contrast to the big trees of v1, the v2 git tree contains only a
    single file at the top-level of the tree, either 'm' (for 'mail' or
    'message') or 'd' (for deleted). A tree does not have 'm' and 'd' at the
    same time.

    Mail is still stored in blobs (instead of inline with the commit object)
    as we still need a stable reference in the indices in case commit
    history is rewritten to comply with legal requirements.

    After-the-fact invocations of public-inbox-index will ignore messages
    written to 'd' after they are written to 'm'.

    Deltafication is not significantly improved over v1, but overall storage
    for trees is made as as small as possible. Initial statistics and
    benchmarks showing the benefits of this approach are documented at:

    <https://public-inbox.org/meta/20180209205140.GA11047@dcvr/>

  XAPIAN PARTITIONS
    Another second scalability problem in v1 was the inability to utilize
    multiple CPU cores for Xapian indexing. This is addressed by using
    partitions in Xapian to perform import indexing in parallel.

    As with git alternates, Xapian natively supports a read-only interface
    which transparently abstracts away the knowledge of multiple partitions.
    This allows us to simplify our read-only code paths.

    The performance of the storage device is now the bottleneck on larger
    multi-core systems. In our experience, performance is improves with
    high-quality and high-quantity solid-state storage. Issuing TRIM
    commands with fstrim(8) was necessary to maintain consistent performance
    while developing this feature.

    Rotational storage devices are NOT recommended for indexing of large
    mail archives; but are fine for backup and usable for small instances.

  OVERVIEW DB
    Towards the end of v2 development, it became apparent Xapian did not
    perform well for sorting large result sets used to generate the landing
    page in the PSGI UI (/$INBOX/) or many queries used by the NNTP server.
    Thus, SQLite was employed and the Xapian "skeleton" DB was renamed to
    the "overview" DB (after the NNTP OVER/XOVER commands).

    The overview DB maintains all the header information necessary to
    implement the NNTP OVER/XOVER commands and non-search endpoints of of
    the PSGI UI.

    In the future, Xapian will become completely optional for v2 (as it is
    for v1) as SQLite turns out to be powerful enough to maintain overview
    information. Most of the PSGI and all of the NNTP functionality will be
    possible with only SQLite in addition to git.

    The overview DB was an instrumental piece in maintaining near
    constant-time read performance on a dataset 2-3 times larger than LKML
    history as of 2018.

   GHOST MESSAGES
    The overview DB also includes references to "ghost" messages, or
    messages which have replies but have not been seen by us. Thus it is
    expected to have more rows than the "msgmap" DB described below.

  msgmap.sqlite3
    The SQLite msgmap DB is unchanged from v1, but it is now at the
    top-level of the directory.

OBJECT IDENTIFIERS
    There are three distinct type of identifiers. content_id is the new one
    for v2 and should make message removal and deduplication easier.
    object_id and Message-ID are already known.

    object_id
        The blob identifier git uses (currently SHA-1). No need to
        publically expose this outside of normal git ops (cloning) and
        there's no need to make this searchable. As with v1 of public-inbox,
        this is stored as part of the Xapian document so expensive name
        lookups can be avoided for document retrieval.

    Message-ID
        The email header; duplicates allowed for archival purposes. This
        remains a searchable field in Xapian. Note: it's possible for emails
        to have multiple Message-ID headers (and git-send-email(1) had that
        bug for a bit); so we take all of them into account. In case of
        conflicts detected by content_id below, we generate a new Message-ID
        based on content_id; if the generated Message-ID still conflicts, a
        random one is generated.

    content_id
        A hash of relevant headers and raw body content for purging of
        unwanted content. This is not stored anywhere, but always calculated
        on-the-fly.

        For now, the relevant headers are:

                Subject, From, Date, References, In-Reply-To, To, Cc

        Received, List-Id, and similar headers are NOT part of content_id as
        they differ across lists and we will want removal to be able to
        cross lists.

        The textual parts of the body are decoded, CRLF normalized to LF,
        and trailing whitespace stripped. Notably, hashing the raw body
        risks being broken by list signatures; but we can use filters (e.g.
        PublicInbox::Filter::Vger) to clean the body for imports.

        content_id is SHA-256 for now; but can be changed at any time
        without making DB changes.

LOCKING
    flock(2) locking exclusively locks the empty inbox.lock file for all
    non-atomic operations.

HEADERS
    Same handling as with v1, except the Message-ID header will will be
    generated if not provided or conflicting. "Bytes", "Lines" and
    "Content-Length" headers are stripped and not allowed, they can
    interfere with further processing.

    The "Status" mbox header is also stripped as that header makes no sense
    in a public archive.

THANKS
    Thanks to the Linux Foundation for sponsoring the development and
    testing of the v2 repository format.

COPYRIGHT
    Copyright 2018-2019 all contributors <mailto:meta@public-inbox.org>

    License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>

SEE ALSO
    gitrepository-layout(5), public-inbox-v1-format(5)