user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: v2 development notes
Date: Thu, 19 Apr 2018 01:58:13 +0000	[thread overview]
Message-ID: <20180419015813.GA20051@dcvr> (raw)

Just some notes I was maintaining a long the way.  It should
probably be turned into .pod for a manpage at some point...

public-inbox repository layout v2
---------------------------------

First off, Xapian remains a swappable component for another
search engine; but for now, only Xapian (along with SQLite) is
supported and required for v2 to work.

Basic idea is v2 won't be git-centered; in other words it is not
be a bare git repository polluted by public-inbox-specific files.
Instead, one or more bare git repositories will exist within
the hierarchy.


Partitioning
------------

There are two types of partitioning done in v2 to address
different performance problems with the original v1 (ssoma)
inboxes.

The first is size/time-based partitioning based on epoch.  Each
git repository is limited to roughly 1G (to be made
configurable, later).  Once a git repository hits its size
threshold, a new one is created and new messages only go to it.

For the server administrator, this has a pleasant side effect of
limiting pack sizes and clone times.  It also allows mirrors to
do a partial mirror for inboxes spanning several git repos.

The second partition type is by CPU core count.  Xapian indexing
is an expensive operation and consumes a significant amount of
CPU time.  Since multi-core CPUs are common nowadays, we split
off Xapian indexing into multiple cores.  Fortunately, its'
read-only interface can transparently abstract away the multiple
partitions.


object identifiers
------------------

There will be three distinct type of identifiers.  content_id is
the new one for v2 and should make message removal and
deduplication easier.  object_id and Message-ID are already
known.

* object_id - the blob identifier git uses (currently SHA-1)
  No need to publically expose this outside of normal git ops (cloning)
  and there's no need to make this searchable.  As with v1 of
  public-inbox, this will be stored as part of the Xapian
  document so expensive name lookups can be avoided for document
  retrieval.

* Message-ID - the email header; duplicates allowed for archival purposes.
  Needs to be a searchable field in Xapian.  Note: it's possible
  for emails to have multiple Message-ID headers (and git-send-email(1)
  had that bug for a bit); so we take all of them into account.
  In case of conflicts detected by content_id below, we generate a new
  Message-ID based on content_id; if the generated Message-ID still
  conflicts, a random one is generated.

* content_id - a hash of relevant headers and raw body content for
  purging of unwanted content.  This is not stored anywhere,
  but calculated on-the-fly.

  For now, the relevant headers are:

	Subject, From, Date, References, In-Reply-To, To, Cc

  Received, List-Id, and similar headers are NOT part of content_id as
  they differ across lists and we will want removal to be able to cross
  lists.

  The textual parts of the body are decoded, CRLF normalized to
  LF, and trailing whitespace stripped.  Notably, hashing the
  raw body risks being broken by list signatures; but we can use
  filters (e.g. PublicInbox::Filter::Vger) to clean the body for
  imports.

  This is SHA-256 for now; but can be changed at any time without
  DB changes.

repository layout
-----------------

$EPOCH - Integer starting with 0 based on time
$SCHEMA_VERSION - SCHEMA_VERSION used by Xapian, we'll inherit and
                  start with '14' from v1.0.0
$PART - Integer (0..NPROCESSORS)

foo/ # assuming "foo" is the name of the list
- inbox.lock                 # lock file (flock) to protect global state
- git/$EPOCH.git             # normal git repositories
- all.git                    # empty git repo, alternates to git/$EPOCH.git
- xap$SCHEMA_VERSION/$PART   # per-partition Xapian DB
- xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP and threading

For blob lookups, the reader only needs to open the "all.git"
repository with $GIT_DIR/objects/info/alternates which references
every $EPOCH.git repo.

Individual $EPOCH.git repos DO NOT use alternates themselves as
git currently limits recursion of alternates nesting depth to 5.

git tree layout
---------------

During the original (v1) development, large trees were frequently
a performance problem as name lookups are expensive and there
were limited deltafication opportunities.

Unlike the ssoma-based layout in v1, the v2 git tree contains
only a single file at the top-level of the tree, either 'm'
(for 'mail' or 'message') or 'd' (for deleted).

Mail is still stored in blobs (instead of inline with the commit
object) as we still need a stable reference in the indices in
case history is rewritten to comply with legal requirements.

After-the-fact invocations of public-inbox-index will ignore
messages written to 'd' after they are written to 'm'.

Deltafication is not significantly improved over v1,
but overall storage for trees is greatly reduced.

https://public-inbox.org/meta/20180209205140.GA11047@dcvr/T/


Overview DB
-----------

Late into v2 development, it became apparent Xapian did not
perform well with sorting large result sets used to generate the
landing page in the PSGI UI (/$INBOX/) or many queries used
by the NNTP server.  Thus, SQLite was employed and the Xapian
"skeleton" DB was renamed to the "overview" DB (after the NNTP
XOVER/OVER commands).

In the future, Xapian will become optional for v2.  Most of the
PSGI all of the NNTP functionality will be possible with only
SQLite in addition to git.

https://public-inbox.org/meta/20180402000456.13446-1-e@80x24.org/T/

                 reply	other threads:[~2018-04-19  1:58 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180419015813.GA20051@dcvr \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).