about summary refs log tree commit homepage
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/design_notes.txt18
-rw-r--r--Documentation/include.mk4
-rw-r--r--Documentation/public-inbox-mda.pod2
-rw-r--r--Documentation/public-inbox-v1-format.pod171
-rw-r--r--Documentation/public-inbox-v2-format.pod234
5 files changed, 413 insertions, 16 deletions
diff --git a/Documentation/design_notes.txt b/Documentation/design_notes.txt
index c5d9427b..9ad49774 100644
--- a/Documentation/design_notes.txt
+++ b/Documentation/design_notes.txt
@@ -27,9 +27,7 @@ Use existing infrastructure
 * Existing spam filtering on an SMTP server is also effective on
   public-inbox.
 
-* readers may continue using use their choice of mail clients and
-  mailbox formats, only learning a few commands of the ssoma(1) tool
-  is required.
+* Readers may continue using use their choice of NNTP and mail clients.
 
 * Atom is a reasonable feed format for casual readers and is supported
   by a variety of feed readers.
@@ -145,19 +143,11 @@ What sucks about public-inbox
 Scalability notes
 -----------------
 
-Even with shallow clone, storing the history of large/busy mailing lists
-may place much burden on subscribers and servers.  However, having a
-single (or few) refs representing the entire history of a list is good
-for small lists since it's easier to look up a message by Message-ID, so
-we want to avoid splitting refs with independent histories.
-
-ssoma will likely grow its own built-in ref rotation system based on
-message count (not rotating at fixed time intervals).  This would
-split the histories and require O(n) lookup time based on Message-ID,
-where `n' is the number of history splits.
+See the public-inbox-v2-format(5) manpage for all the scalability
+problems solved.
 
 Copyright
 ---------
 
-Copyright 2013-2018 all contributors <meta@public-inbox.org>
+Copyright 2013-2019 all contributors <meta@public-inbox.org>
 License: AGPL-3.0+ <http://www.gnu.org/licenses/agpl-3.0.txt>
diff --git a/Documentation/include.mk b/Documentation/include.mk
index ad7b80a6..28fa7574 100644
--- a/Documentation/include.mk
+++ b/Documentation/include.mk
@@ -1,4 +1,4 @@
-# Copyright (C) 2013-2018 all contributors <meta@public-inbox.org>
+# Copyright (C) 2013-2019 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 all::
 
@@ -24,6 +24,8 @@ m1 += public-inbox-watch
 m1 += public-inbox-index
 m5 =
 m5 += public-inbox-config
+m5 += public-inbox-v1-format
+m5 += public-inbox-v2-format
 m7 =
 m7 += public-inbox-overview
 m8 =
diff --git a/Documentation/public-inbox-mda.pod b/Documentation/public-inbox-mda.pod
index 1a5ade84..41a697b1 100644
--- a/Documentation/public-inbox-mda.pod
+++ b/Documentation/public-inbox-mda.pod
@@ -56,4 +56,4 @@ License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt>
 
 =head1 SEE ALSO
 
-L<git(1)>, L<git-config(1)>, L<ssoma_repository(5)>
+L<git(1)>, L<git-config(1)>, L<public-inbox-v1-format(5)>
diff --git a/Documentation/public-inbox-v1-format.pod b/Documentation/public-inbox-v1-format.pod
new file mode 100644
index 00000000..2a6b8d3c
--- /dev/null
+++ b/Documentation/public-inbox-v1-format.pod
@@ -0,0 +1,171 @@
+% public-inbox developer manual
+
+=head1 NAME
+
+public-inbox v1 git repository and tree description (aka "ssoma")
+
+=head1 DESCRIPTION
+
+WARNING: this does NOT describe the scalable v2 format used
+by public-inbox.  Use of ssoma is not recommended for new
+installations due to scalability problems.
+
+ssoma uses a git repository to store each email as a git blob.
+The tree filename of the blob is based on the SHA1 hexdigest of
+the first Message-ID header.  A commit is made for each message
+delivered.  The commit SHA-1 identifier is used by ssoma clients
+to track synchronization state.
+
+=head1 PATHNAMES IN TREES
+
+A Message-ID may be extremely long and also contain slashes, so using
+them as a path name is challenging.  Instead we use the SHA-1 hexdigest
+of the Message-ID (excluding the leading "E<lt>" and trailing "E<gt>")
+to generate a path name.  Leading and trailing white space in the
+Message-ID header is ignored for hashing.
+
+A message with Message-ID of: E<lt>20131106023245.GA20224@dcvr.yhbt.netE<gt>
+
+Would be stored as: f2/8c6cfd2b0a65f994c3e1be266105413b3d3f63
+
+Thus it is easy to look up the contents of a message matching a given
+a Message-ID.
+
+=head1 MESSAGE-ID CONFLICTS
+
+public-inbox v1 repositories currently do not resolve conflicting
+Message-IDs or messages with multiple Message-IDs.
+
+=head1 HEADERS
+
+The Message-ID header is required.
+"Bytes", "Lines" and "Content-Length" headers are stripped and not
+allowed, they can interfere with further processing.
+When using ssoma with public-inbox-mda, the "Status" mbox header
+is also stripped as that header makes no sense in a public archive.
+
+=head1 LOCKING
+
+L<flock(2)> locking exclusively locks the empty $GIT_DIR/ssoma.lock file
+for all non-atomic operations.
+
+=head1 EXAMPLE INPUT FLOW (SERVER-SIDE MDA)
+
+1. Message is delivered to a mail transport agent (MTA)
+
+1a. (optional) reject/discard spam, this should run before ssoma-mda
+
+1b. (optional) reject/strip unwanted attachments
+
+ssoma-mda handles all steps once invoked.
+
+2. Mail transport agent invokes ssoma-mda
+
+3. reads message via stdin, extracting Message-ID
+
+4. acquires exclusive flock lock on $GIT_DIR/ssoma.lock
+
+5. creates or updates the blob of associated 2/38 SHA-1 path
+
+6. updates the index and commits
+
+7. releases $GIT_DIR/ssoma.lock
+
+ssoma-mda can also be used as an L<inotify(7)> trigger to monitor maildirs,
+and the ability to monitor IMAP mailboxes using IDLE will be available
+in the future.
+
+=head1 GIT REPOSITORIES (SERVERS)
+
+ssoma uses bare git repositories on both servers and clients.
+
+Using the L<git-init(1)> command with --bare is the recommend method
+of creating a git repository on a server:
+
+        git init --bare /path/to/wherever/you/want.git
+
+There are no standardized paths for servers, administrators make
+all the choices regarding git repository locations.
+
+Special files in $GIT_DIR on the server:
+
+=over
+
+=item $GIT_DIR/ssoma.lock
+
+An empty file for L<flock(2)> locking.
+This is necessary to ensure the index and commits are updated
+consistently and multiple processes running MDA do not step on
+each other.
+
+=item $GIT_DIR/public-inbox/msgmap.sqlite3
+
+SQLite3 database maintaining a stable mapping of Message-IDs to NNTP
+article numbers.  Used by L<public-inbox-nntpd(1)> and created
+and updated by L<public-inbox-index(1)>.
+
+Automatically updated by L<public-inbox-mda(1)>,
+L<public-inbox-learn(1)> and L<public-inbox-watch(1)>.
+
+Losing or damaging this file will cause synchronization problems for
+NNTP clients.  This file is expected to be stable and require no
+updates to its schema.
+
+Requires L<DBD::SQLite>.
+
+=item $GIT_DIR/public-inbox/xapian$N/
+
+Xapian database for search indices in the PSGI web UI.
+
+$N is the value of PublicInbox::Search::SCHEMA_VERSION, and
+installations may have parallel versions on disk during upgrades
+or to roll-back upgrades.
+
+This is created and updated by L<public-inbox-index(1)>.
+
+Automatically updated by L<public-inbox-mda(1)>,
+L<public-inbox-learn(1)> and L<public-inbox-watch(1)>.
+
+This directory can always be regenerated with L<public-inbox-index(1)>.
+If lost or damaaged, there is no need to back it up unless the
+CPU/memory cost of regenerating it outweighs the storage/transfer cost.
+
+Since SCHEMA_VERSION 15 and the development of the v2 format,
+the "overview" DB also exists in the xapian directory for v1
+repositories.  See L<public-inbox-v2-format(5)/OVERVIEW DB>
+
+=item $GIT_DIR/ssoma.index
+
+This file is no longer used or created by public-inbox, but it is
+updated if it exists to remain compatible with ssoma installations.
+
+A git index file used for MDA updates.  The normal git index (in
+$GIT_DIR/index) is not used at all as there is typically no working
+tree.
+
+=back
+
+Each client $GIT_DIR may have multiple mbox/maildir/command targets.
+It is possible for a client to extract the mail stored in the git
+repository to multiple mboxes for compatibility with a variety of
+different tools.
+
+=head1 CAVEATS
+
+It is NOT recommended to check out the working directory of a git.
+there may be many files.
+
+It is impossible to completely expunge messages, even spam, as git
+retains full history.  Projects may (with adequate notice) cycle to new
+repositories/branches with history cleaned up via L<git-filter-branch(1)>.
+This is up to the administrators.
+
+=head1 COPYRIGHT
+
+Copyright 2013-2019 all contributors L<mailto:meta@public-inbox.org>
+
+License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>
+
+=head1 SEE ALSO
+
+L<gitrepository-layout(5)>, L<ssoma(1)>
diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod
new file mode 100644
index 00000000..05ef32a9
--- /dev/null
+++ b/Documentation/public-inbox-v2-format.pod
@@ -0,0 +1,234 @@
+% public-inbox developer manual
+
+=head1 NAME
+
+public-inbox v2 repository description
+
+=head1 DESCRIPTION
+
+The v2 format is designed primarily to address several
+scalability problems of the original format described at
+L<public-inbox-v1-format(5)>.  It also handles messages with
+Message-IDs.
+
+=head1 INBOX LAYOUT
+
+The key change in v2 is the inbox is no longer a bare git
+repository, but a directory with two or more git repositories.
+v2 divides git repositories by time "epochs" and Xapian
+databases for parallelism by "partitions".
+
+=head2 INBOX OVERVIEW AND DEFINITIONS
+
+$EPOCH - Integer starting with 0 based on time
+$SCHEMA_VERSION - PublicInbox::Search::SCHEMA_VERSION used by Xapian
+$PART - Integer (0..NPROCESSORS)
+
+foo/ # assuming "foo" is the name of the list
+- inbox.lock                 # lock file (flock) to protect global state
+- git/$EPOCH.git             # normal git repositories
+- all.git                    # empty git repo, alternates to git/$EPOCH.git
+- xap$SCHEMA_VERSION/$PART   # per-partition Xapian DB
+- xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP and threading
+- msgmap.sqlite3             # same the v1 msgmap
+
+For blob lookups, the reader only needs to open the "all.git"
+repository with $GIT_DIR/objects/info/alternates which references
+every $EPOCH.git repo.
+
+Individual $EPOCH.git repos DO NOT use alternates themselves as
+git currently limits recursion of alternates nesting depth to 5.
+
+=head2 GIT EPOCHS
+
+One of the inherent scalability problems with git itself is the
+full history of a project must be stored and carried around to
+all clients.  To address this problem, the v2 format uses
+multiple git repositories, stored as time-based "epochs".
+
+We currently divide epochs into roughly one gigabyte segments;
+but this size can be configurable (if needed) in the future.
+
+A pleasant side-effect of this design is the git packs of older
+epochs are stable, allowing them to be cloned without requiring
+expensive pack generation.  This also allows clients to clone
+only the epochs they are interested in to save bandwidth and
+storage.
+
+To minimize changes to existing v1-based code and simplify our
+code, we use the "alternates" mechanism described in
+L<gitrepository-layout(5)> to link all the epoch repositories
+with a single read-only "all.git" endpoint.
+
+Processes retrieve blobs via the "all.git" repository, while
+writers write blobs directly to epochs.
+
+=head2 GIT TREE LAYOUT
+
+One key problem specific to v1 was large trees were frequently a
+performance problem as name lookups are expensive and there were
+limited deltafication opportunities with unpredictable file
+names.  As a result, all Xapian-enabled installations retrieve
+blob object_ids directly in v1, bypassing tree lookups.
+
+While dividing git repositories into epochs caps the growth of
+trees, worst-case tree size was still unnecessary overhead and
+worth eliminating.
+
+So in contrast to the big trees of v1, the v2 git tree contains
+only a single file at the top-level of the tree, either 'm' (for
+'mail' or 'message') or 'd' (for deleted).  A tree does not have
+'m' and 'd' at the same time.
+
+Mail is still stored in blobs (instead of inline with the commit
+object) as we still need a stable reference in the indices in
+case commit history is rewritten to comply with legal
+requirements.
+
+After-the-fact invocations of L<public-inbox-index> will ignore
+messages written to 'd' after they are written to 'm'.
+
+Deltafication is not significantly improved over v1, but overall
+storage for trees is made as as small as possible.  Initial
+statistics and benchmarks showing the benefits of this approach
+are documented at:
+
+L<https://public-inbox.org/meta/20180209205140.GA11047@dcvr/>
+
+=head2 XAPIAN PARTITIONS
+
+Another second scalability problem in v1 was the inability to
+utilize multiple CPU cores for Xapian indexing.  This is
+addressed by using partitions in Xapian to perform import
+indexing in parallel.
+
+As with git alternates, Xapian natively supports a read-only
+interface which transparently abstracts away the knowledge of
+multiple partitions.  This allows us to simplify our read-only
+code paths.
+
+The performance of the storage device is now the bottleneck on
+larger multi-core systems.  In our experience, performance is
+improves with high-quality and high-quantity solid-state storage.
+Issuing TRIM commands with L<fstrim(8)> was necessary to maintain
+consistent performance while developing this feature.
+
+Rotational storage devices are NOT recommended for indexing of
+large mail archives; but are fine for backup and usable for
+small instances.
+
+=head2 OVERVIEW DB
+
+Towards the end of v2 development, it became apparent Xapian did
+not perform well for sorting large result sets used to generate
+the landing page in the PSGI UI (/$INBOX/) or many queries used
+by the NNTP server.  Thus, SQLite was employed and the Xapian
+"skeleton" DB was renamed to the "overview" DB (after the NNTP
+OVER/XOVER commands).
+
+The overview DB maintains all the header information necessary
+to implement the NNTP OVER/XOVER commands and non-search
+endpoints of of the PSGI UI.
+
+In the future, Xapian will become completely optional for v2 (as
+it is for v1) as SQLite turns out to be powerful enough to
+maintain overview information.  Most of the PSGI and all of the
+NNTP functionality will be possible with only SQLite in addition
+to git.
+
+The overview DB was an instrumental piece in maintaining near
+constant-time read performance on a dataset 2-3 times larger
+than LKML history as of 2018.
+
+=head3 GHOST MESSAGES
+
+The overview DB also includes references to "ghost" messages,
+or messages which have replies but have not been seen by us.
+Thus it is expected to have more rows than the "msgmap" DB
+described below.
+
+=head2 msgmap.sqlite3
+
+The SQLite msgmap DB is unchanged from v1, but it is now at the
+top-level of the directory.
+
+=head1 OBJECT IDENTIFIERS
+
+There are three distinct type of identifiers.  content_id is the
+new one for v2 and should make message removal and deduplication
+easier.  object_id and Message-ID are already known.
+
+=over
+
+=item object_id
+
+The blob identifier git uses (currently SHA-1).  No need to
+publically expose this outside of normal git ops (cloning) and
+there's no need to make this searchable.  As with v1 of
+public-inbox, this is stored as part of the Xapian document so
+expensive name lookups can be avoided for document retrieval.
+
+=item Message-ID
+
+The email header; duplicates allowed for archival purposes.
+This remains a searchable field in Xapian.  Note: it's possible
+for emails to have multiple Message-ID headers (and L<git-send-email(1)>
+had that bug for a bit); so we take all of them into account.
+In case of conflicts detected by content_id below, we generate a new
+Message-ID based on content_id; if the generated Message-ID still
+conflicts, a random one is generated.
+
+=item content_id
+
+A hash of relevant headers and raw body content for
+purging of unwanted content.  This is not stored anywhere,
+but always calculated on-the-fly.
+
+For now, the relevant headers are:
+
+        Subject, From, Date, References, In-Reply-To, To, Cc
+
+Received, List-Id, and similar headers are NOT part of content_id as
+they differ across lists and we will want removal to be able to cross
+lists.
+
+The textual parts of the body are decoded, CRLF normalized to
+LF, and trailing whitespace stripped.  Notably, hashing the
+raw body risks being broken by list signatures; but we can use
+filters (e.g. PublicInbox::Filter::Vger) to clean the body for
+imports.
+
+content_id is SHA-256 for now; but can be changed at any time
+without making DB changes.
+
+=back
+
+=head1 LOCKING
+
+L<flock(2)> locking exclusively locks the empty inbox.lock file
+for all non-atomic operations.
+
+=head1 HEADERS
+
+Same handling as with v1, except the Message-ID header will will
+be generated if not provided or conflicting.  "Bytes", "Lines"
+and "Content-Length" headers are stripped and not allowed, they
+can interfere with further processing.
+
+The "Status" mbox header is also stripped as that header makes
+no sense in a public archive.
+
+=head1 THANKS
+
+Thanks to the Linux Foundation for sponsoring the development
+and testing of the v2 repository format.
+
+=head1 COPYRIGHT
+
+Copyright 2018-2019 all contributors L<mailto:meta@public-inbox.org>
+
+License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>
+
+=head1 SEE ALSO
+
+L<gitrepository-layout(5)>, L<public-inbox-v1-format(5)>