user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <>
Subject: [PATCH] update and add documentation for repository formats
Date: Wed,  2 Jan 2019 08:33:05 +0000	[thread overview]
Message-ID: <> (raw)

Remove confusing documentation around ssoma now that we
have NNTP and downloadable mbox support.

Only lightly-checked for grammar and speling, and not yet
formatting.  Edits, corrections and addendums expected :>
 Documentation/design_notes.txt           |  18 +-
 Documentation/                 |   4 +-
 Documentation/public-inbox-mda.pod       |   2 +-
 Documentation/public-inbox-v1-format.pod | 171 +++++++++++++++++
 Documentation/public-inbox-v2-format.pod | 234 +++++++++++++++++++++++
 INSTALL                                  |   4 +-
 MANIFEST                                 |   2 +
 README                                   |  20 +-
 lib/PublicInbox/                |   6 +-
 9 files changed, 430 insertions(+), 31 deletions(-)
 create mode 100644 Documentation/public-inbox-v1-format.pod
 create mode 100644 Documentation/public-inbox-v2-format.pod

diff --git a/Documentation/design_notes.txt b/Documentation/design_notes.txt
index c5d9427..9ad4977 100644
--- a/Documentation/design_notes.txt
+++ b/Documentation/design_notes.txt
@@ -27,9 +27,7 @@ Use existing infrastructure
 * Existing spam filtering on an SMTP server is also effective on
-* readers may continue using use their choice of mail clients and
-  mailbox formats, only learning a few commands of the ssoma(1) tool
-  is required.
+* Readers may continue using use their choice of NNTP and mail clients.
 * Atom is a reasonable feed format for casual readers and is supported
   by a variety of feed readers.
@@ -145,19 +143,11 @@ What sucks about public-inbox
 Scalability notes
-Even with shallow clone, storing the history of large/busy mailing lists
-may place much burden on subscribers and servers.  However, having a
-single (or few) refs representing the entire history of a list is good
-for small lists since it's easier to look up a message by Message-ID, so
-we want to avoid splitting refs with independent histories.
-ssoma will likely grow its own built-in ref rotation system based on
-message count (not rotating at fixed time intervals).  This would
-split the histories and require O(n) lookup time based on Message-ID,
-where `n' is the number of history splits.
+See the public-inbox-v2-format(5) manpage for all the scalability
+problems solved.
-Copyright 2013-2018 all contributors <>
+Copyright 2013-2019 all contributors <>
 License: AGPL-3.0+ <>
diff --git a/Documentation/ b/Documentation/
index ad7b80a..28fa757 100644
--- a/Documentation/
+++ b/Documentation/
@@ -1,4 +1,4 @@
-# Copyright (C) 2013-2018 all contributors <>
+# Copyright (C) 2013-2019 all contributors <>
 # License: AGPL-3.0+ <>
@@ -24,6 +24,8 @@ m1 += public-inbox-watch
 m1 += public-inbox-index
 m5 =
 m5 += public-inbox-config
+m5 += public-inbox-v1-format
+m5 += public-inbox-v2-format
 m7 =
 m7 += public-inbox-overview
 m8 =
diff --git a/Documentation/public-inbox-mda.pod b/Documentation/public-inbox-mda.pod
index 1a5ade8..41a697b 100644
--- a/Documentation/public-inbox-mda.pod
+++ b/Documentation/public-inbox-mda.pod
@@ -56,4 +56,4 @@ License: AGPL-3.0+ L<>
 =head1 SEE ALSO
-L<git(1)>, L<git-config(1)>, L<ssoma_repository(5)>
+L<git(1)>, L<git-config(1)>, L<public-inbox-v1-format(5)>
diff --git a/Documentation/public-inbox-v1-format.pod b/Documentation/public-inbox-v1-format.pod
new file mode 100644
index 0000000..2a6b8d3
--- /dev/null
+++ b/Documentation/public-inbox-v1-format.pod
@@ -0,0 +1,171 @@
+% public-inbox developer manual
+=head1 NAME
+public-inbox v1 git repository and tree description (aka "ssoma")
+WARNING: this does NOT describe the scalable v2 format used
+by public-inbox.  Use of ssoma is not recommended for new
+installations due to scalability problems.
+ssoma uses a git repository to store each email as a git blob.
+The tree filename of the blob is based on the SHA1 hexdigest of
+the first Message-ID header.  A commit is made for each message
+delivered.  The commit SHA-1 identifier is used by ssoma clients
+to track synchronization state.
+A Message-ID may be extremely long and also contain slashes, so using
+them as a path name is challenging.  Instead we use the SHA-1 hexdigest
+of the Message-ID (excluding the leading "E<lt>" and trailing "E<gt>")
+to generate a path name.  Leading and trailing white space in the
+Message-ID header is ignored for hashing.
+A message with Message-ID of: E<lt>20131106023245.GA20224@dcvr.yhbt.netE<gt>
+Would be stored as: f2/8c6cfd2b0a65f994c3e1be266105413b3d3f63
+Thus it is easy to look up the contents of a message matching a given
+a Message-ID.
+public-inbox v1 repositories currently do not resolve conflicting
+Message-IDs or messages with multiple Message-IDs.
+=head1 HEADERS
+The Message-ID header is required.
+"Bytes", "Lines" and "Content-Length" headers are stripped and not
+allowed, they can interfere with further processing.
+When using ssoma with public-inbox-mda, the "Status" mbox header
+is also stripped as that header makes no sense in a public archive.
+=head1 LOCKING
+L<flock(2)> locking exclusively locks the empty $GIT_DIR/ssoma.lock file
+for all non-atomic operations.
+1. Message is delivered to a mail transport agent (MTA)
+1a. (optional) reject/discard spam, this should run before ssoma-mda
+1b. (optional) reject/strip unwanted attachments
+ssoma-mda handles all steps once invoked.
+2. Mail transport agent invokes ssoma-mda
+3. reads message via stdin, extracting Message-ID
+4. acquires exclusive flock lock on $GIT_DIR/ssoma.lock
+5. creates or updates the blob of associated 2/38 SHA-1 path
+6. updates the index and commits
+7. releases $GIT_DIR/ssoma.lock
+ssoma-mda can also be used as an L<inotify(7)> trigger to monitor maildirs,
+and the ability to monitor IMAP mailboxes using IDLE will be available
+in the future.
+ssoma uses bare git repositories on both servers and clients.
+Using the L<git-init(1)> command with --bare is the recommend method
+of creating a git repository on a server:
+	git init --bare /path/to/wherever/you/want.git
+There are no standardized paths for servers, administrators make
+all the choices regarding git repository locations.
+Special files in $GIT_DIR on the server:
+=item $GIT_DIR/ssoma.lock
+An empty file for L<flock(2)> locking.
+This is necessary to ensure the index and commits are updated
+consistently and multiple processes running MDA do not step on
+each other.
+=item $GIT_DIR/public-inbox/msgmap.sqlite3
+SQLite3 database maintaining a stable mapping of Message-IDs to NNTP
+article numbers.  Used by L<public-inbox-nntpd(1)> and created
+and updated by L<public-inbox-index(1)>.
+Automatically updated by L<public-inbox-mda(1)>,
+L<public-inbox-learn(1)> and L<public-inbox-watch(1)>.
+Losing or damaging this file will cause synchronization problems for
+NNTP clients.  This file is expected to be stable and require no
+updates to its schema.
+Requires L<DBD::SQLite>.
+=item $GIT_DIR/public-inbox/xapian$N/
+Xapian database for search indices in the PSGI web UI.
+$N is the value of PublicInbox::Search::SCHEMA_VERSION, and
+installations may have parallel versions on disk during upgrades
+or to roll-back upgrades.
+This is created and updated by L<public-inbox-index(1)>.
+Automatically updated by L<public-inbox-mda(1)>,
+L<public-inbox-learn(1)> and L<public-inbox-watch(1)>.
+This directory can always be regenerated with L<public-inbox-index(1)>.
+If lost or damaaged, there is no need to back it up unless the
+CPU/memory cost of regenerating it outweighs the storage/transfer cost.
+Since SCHEMA_VERSION 15 and the development of the v2 format,
+the "overview" DB also exists in the xapian directory for v1
+repositories.  See L<public-inbox-v2-format(5)/OVERVIEW DB>
+=item $GIT_DIR/ssoma.index
+This file is no longer used or created by public-inbox, but it is
+updated if it exists to remain compatible with ssoma installations.
+A git index file used for MDA updates.  The normal git index (in
+$GIT_DIR/index) is not used at all as there is typically no working
+Each client $GIT_DIR may have multiple mbox/maildir/command targets.
+It is possible for a client to extract the mail stored in the git
+repository to multiple mboxes for compatibility with a variety of
+different tools.
+=head1 CAVEATS
+It is NOT recommended to check out the working directory of a git.
+there may be many files.
+It is impossible to completely expunge messages, even spam, as git
+retains full history.  Projects may (with adequate notice) cycle to new
+repositories/branches with history cleaned up via L<git-filter-branch(1)>.
+This is up to the administrators.
+Copyright 2013-2019 all contributors L<>
+License: AGPL-3.0+ L<>
+=head1 SEE ALSO
+L<gitrepository-layout(5)>, L<ssoma(1)>
diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod
new file mode 100644
index 0000000..05ef32a
--- /dev/null
+++ b/Documentation/public-inbox-v2-format.pod
@@ -0,0 +1,234 @@
+% public-inbox developer manual
+=head1 NAME
+public-inbox v2 repository description
+The v2 format is designed primarily to address several
+scalability problems of the original format described at
+L<public-inbox-v1-format(5)>.  It also handles messages with
+The key change in v2 is the inbox is no longer a bare git
+repository, but a directory with two or more git repositories.
+v2 divides git repositories by time "epochs" and Xapian
+databases for parallelism by "partitions".
+$EPOCH - Integer starting with 0 based on time
+$SCHEMA_VERSION - PublicInbox::Search::SCHEMA_VERSION used by Xapian
+$PART - Integer (0..NPROCESSORS)
+foo/ # assuming "foo" is the name of the list
+- inbox.lock                 # lock file (flock) to protect global state
+- git/$EPOCH.git             # normal git repositories
+- all.git                    # empty git repo, alternates to git/$EPOCH.git
+- xap$SCHEMA_VERSION/$PART   # per-partition Xapian DB
+- xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP and threading
+- msgmap.sqlite3             # same the v1 msgmap
+For blob lookups, the reader only needs to open the "all.git"
+repository with $GIT_DIR/objects/info/alternates which references
+every $EPOCH.git repo.
+Individual $EPOCH.git repos DO NOT use alternates themselves as
+git currently limits recursion of alternates nesting depth to 5.
+=head2 GIT EPOCHS
+One of the inherent scalability problems with git itself is the
+full history of a project must be stored and carried around to
+all clients.  To address this problem, the v2 format uses
+multiple git repositories, stored as time-based "epochs".
+We currently divide epochs into roughly one gigabyte segments;
+but this size can be configurable (if needed) in the future.
+A pleasant side-effect of this design is the git packs of older
+epochs are stable, allowing them to be cloned without requiring
+expensive pack generation.  This also allows clients to clone
+only the epochs they are interested in to save bandwidth and
+To minimize changes to existing v1-based code and simplify our
+code, we use the "alternates" mechanism described in
+L<gitrepository-layout(5)> to link all the epoch repositories
+with a single read-only "all.git" endpoint.
+Processes retrieve blobs via the "all.git" repository, while
+writers write blobs directly to epochs.
+One key problem specific to v1 was large trees were frequently a
+performance problem as name lookups are expensive and there were
+limited deltafication opportunities with unpredictable file
+names.  As a result, all Xapian-enabled installations retrieve
+blob object_ids directly in v1, bypassing tree lookups.
+While dividing git repositories into epochs caps the growth of
+trees, worst-case tree size was still unnecessary overhead and
+worth eliminating.
+So in contrast to the big trees of v1, the v2 git tree contains
+only a single file at the top-level of the tree, either 'm' (for
+'mail' or 'message') or 'd' (for deleted).  A tree does not have
+'m' and 'd' at the same time.
+Mail is still stored in blobs (instead of inline with the commit
+object) as we still need a stable reference in the indices in
+case commit history is rewritten to comply with legal
+After-the-fact invocations of L<public-inbox-index> will ignore
+messages written to 'd' after they are written to 'm'.
+Deltafication is not significantly improved over v1, but overall
+storage for trees is made as as small as possible.  Initial
+statistics and benchmarks showing the benefits of this approach
+are documented at:
+Another second scalability problem in v1 was the inability to
+utilize multiple CPU cores for Xapian indexing.  This is
+addressed by using partitions in Xapian to perform import
+indexing in parallel.
+As with git alternates, Xapian natively supports a read-only
+interface which transparently abstracts away the knowledge of
+multiple partitions.  This allows us to simplify our read-only
+code paths.
+The performance of the storage device is now the bottleneck on
+larger multi-core systems.  In our experience, performance is
+improves with high-quality and high-quantity solid-state storage.
+Issuing TRIM commands with L<fstrim(8)> was necessary to maintain
+consistent performance while developing this feature.
+Rotational storage devices are NOT recommended for indexing of
+large mail archives; but are fine for backup and usable for
+small instances.
+Towards the end of v2 development, it became apparent Xapian did
+not perform well for sorting large result sets used to generate
+the landing page in the PSGI UI (/$INBOX/) or many queries used
+by the NNTP server.  Thus, SQLite was employed and the Xapian
+"skeleton" DB was renamed to the "overview" DB (after the NNTP
+OVER/XOVER commands).
+The overview DB maintains all the header information necessary
+to implement the NNTP OVER/XOVER commands and non-search
+endpoints of of the PSGI UI.
+In the future, Xapian will become completely optional for v2 (as
+it is for v1) as SQLite turns out to be powerful enough to
+maintain overview information.  Most of the PSGI and all of the
+NNTP functionality will be possible with only SQLite in addition
+to git.
+The overview DB was an instrumental piece in maintaining near
+constant-time read performance on a dataset 2-3 times larger
+than LKML history as of 2018.
+The overview DB also includes references to "ghost" messages,
+or messages which have replies but have not been seen by us.
+Thus it is expected to have more rows than the "msgmap" DB
+described below.
+=head2 msgmap.sqlite3
+The SQLite msgmap DB is unchanged from v1, but it is now at the
+top-level of the directory.
+There are three distinct type of identifiers.  content_id is the
+new one for v2 and should make message removal and deduplication
+easier.  object_id and Message-ID are already known.
+=item object_id
+The blob identifier git uses (currently SHA-1).  No need to
+publically expose this outside of normal git ops (cloning) and
+there's no need to make this searchable.  As with v1 of
+public-inbox, this is stored as part of the Xapian document so
+expensive name lookups can be avoided for document retrieval.
+=item Message-ID
+The email header; duplicates allowed for archival purposes.
+This remains a searchable field in Xapian.  Note: it's possible
+for emails to have multiple Message-ID headers (and L<git-send-email(1)>
+had that bug for a bit); so we take all of them into account.
+In case of conflicts detected by content_id below, we generate a new
+Message-ID based on content_id; if the generated Message-ID still
+conflicts, a random one is generated.
+=item content_id
+A hash of relevant headers and raw body content for
+purging of unwanted content.  This is not stored anywhere,
+but always calculated on-the-fly.
+For now, the relevant headers are:
+	Subject, From, Date, References, In-Reply-To, To, Cc
+Received, List-Id, and similar headers are NOT part of content_id as
+they differ across lists and we will want removal to be able to cross
+The textual parts of the body are decoded, CRLF normalized to
+LF, and trailing whitespace stripped.  Notably, hashing the
+raw body risks being broken by list signatures; but we can use
+filters (e.g. PublicInbox::Filter::Vger) to clean the body for
+content_id is SHA-256 for now; but can be changed at any time
+without making DB changes.
+=head1 LOCKING
+L<flock(2)> locking exclusively locks the empty inbox.lock file
+for all non-atomic operations.
+=head1 HEADERS
+Same handling as with v1, except the Message-ID header will will
+be generated if not provided or conflicting.  "Bytes", "Lines"
+and "Content-Length" headers are stripped and not allowed, they
+can interfere with further processing.
+The "Status" mbox header is also stripped as that header makes
+no sense in a public archive.
+=head1 THANKS
+Thanks to the Linux Foundation for sponsoring the development
+and testing of the v2 repository format.
+Copyright 2018-2019 all contributors L<>
+License: AGPL-3.0+ L<>
+=head1 SEE ALSO
+L<gitrepository-layout(5)>, L<public-inbox-v1-format(5)>
diff --git a/INSTALL b/INSTALL
index 3fe0e4f..aa4afb5 100644
@@ -2,7 +2,7 @@ public-inbox (server-side) installation
 This is for folks who want to setup their own public-inbox instance.
-Clients should see instead
+Clients should use normal git-clone/git-fetch, or NNTP clients
 if they want to import mail into their personal inboxes.
 TODO: this still needs to be documented better,
@@ -134,5 +134,5 @@ installation is complete.
-Copyright 2013-2018 all contributors <>
+Copyright 2013-2019 all contributors <>
 License: AGPL-3.0+ <>
diff --git a/MANIFEST b/MANIFEST
index f25a580..d56cd85 100644
@@ -16,6 +16,8 @@ Documentation/public-inbox-index.pod
diff --git a/README b/README
index 26e0b69..ffd433d 100644
--- a/README
+++ b/README
@@ -22,8 +22,9 @@ to run their own instances with minimal overhead.
-public-inbox stores mail in a git repository keyed by Message-ID
-as documented in:
+public-inbox stores mail in git repositories as documented
+in and
 By storing (and optionally) exposing an inbox via git, it is
 fast and efficient to host and mirror public-inboxes.
@@ -35,10 +36,10 @@ discussions if archives do not expose Message-ID and References
 headers.  List server admins are also burdened with delivery
-public-inbox uses the "pull" model.  Casual readers may also
+public-inbox uses the "pull" model.  Casual readers may
 follow the list via NNTP, Atom feed or HTML archives.
-If a reader loses interest, they simply stop syncing.
+If a reader loses interest, they simply stop following.
 Since we use git, mirrors are easy-to-setup, and lists are
 easy-to-relocate to different mail addresses without losing
@@ -75,6 +76,9 @@ Requirements (participant)
   their mailers to reduce the impact of a public-inbox as a
   single point of failure.
+* The HTTP web interface exposes mboxrd files, and NNTP clients often
+  feature reply-by-email functionality
 * participants do not need to install public-inbox, only server admins
 Requirements (server)
@@ -123,10 +127,6 @@ You may also clone all messages via git:
 	git clone --mirror
 	torsocks git clone --mirror http://hjrcffqmbrq6wope.onion/meta/
-Or pass the same git repository URL for ssoma using the instructions at:
@@ -140,7 +140,7 @@ Content Filtering
 To discourage phishing, trackers, exploits and other nuisances,
-only plain-text emails are allowed and HTML is rejected.
+only plain-text emails are allowed and HTML is rejected by default.
 This improves accessibility, and saves bandwidth and storage
 as mail is archived forever.
@@ -151,7 +151,7 @@ aims to preserve the focus on content, and not presentation.
-Copyright 2013-2018 all contributors <>
+Copyright 2013-2019 all contributors <>
 License: AGPL-3.0+ <>
 This program is free software: you can redistribute it and/or modify
diff --git a/lib/PublicInbox/ b/lib/PublicInbox/
index 3df7d98..29c482f 100644
--- a/lib/PublicInbox/
+++ b/lib/PublicInbox/
@@ -1,4 +1,4 @@
-# Copyright (C) 2016-2018 all contributors <>
+# Copyright (C) 2016-2019 all contributors <>
 # License: AGPL-3.0+ <>
 # git fast-import-based ssoma-mda MDA replacement
@@ -635,8 +635,8 @@ =head1 SYNOPSYS
 An importer and remover for public-inboxes which takes L<Email::MIME>
-messages as input and stores them in a ssoma repository as
-documented in L<>,
+messages as input and stores them in a git repository as
+documented in L<>,
 except it does not allow duplicate Message-IDs.
 It requires L<git(1)> and L<git-fast-import(1)> to be installed.

                 reply	other threads:[~2019-01-02  8:33 UTC|newest]

Thread overview: [no followups] expand[flat|nested]  mbox.gz  Atom feed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

  List information:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).