From cf35d38e7f845393659dfce0249a76d529a2c92c Mon Sep 17 00:00:00 2001 From: Eric Wong Date: Wed, 2 Jan 2019 08:23:13 +0000 Subject: update and add documentation for repository formats Remove confusing documentation around ssoma now that we have NNTP and downloadable mbox support. Only lightly-checked for grammar and speling, and not yet formatting. Edits, corrections and addendums expected :> --- Documentation/public-inbox-v1-format.pod | 171 +++++++++++++++++++++++++++++++ 1 file changed, 171 insertions(+) create mode 100644 Documentation/public-inbox-v1-format.pod (limited to 'Documentation/public-inbox-v1-format.pod') diff --git a/Documentation/public-inbox-v1-format.pod b/Documentation/public-inbox-v1-format.pod new file mode 100644 index 00000000..2a6b8d3c --- /dev/null +++ b/Documentation/public-inbox-v1-format.pod @@ -0,0 +1,171 @@ +% public-inbox developer manual + +=head1 NAME + +public-inbox v1 git repository and tree description (aka "ssoma") + +=head1 DESCRIPTION + +WARNING: this does NOT describe the scalable v2 format used +by public-inbox. Use of ssoma is not recommended for new +installations due to scalability problems. + +ssoma uses a git repository to store each email as a git blob. +The tree filename of the blob is based on the SHA1 hexdigest of +the first Message-ID header. A commit is made for each message +delivered. The commit SHA-1 identifier is used by ssoma clients +to track synchronization state. + +=head1 PATHNAMES IN TREES + +A Message-ID may be extremely long and also contain slashes, so using +them as a path name is challenging. Instead we use the SHA-1 hexdigest +of the Message-ID (excluding the leading "E" and trailing "E") +to generate a path name. Leading and trailing white space in the +Message-ID header is ignored for hashing. + +A message with Message-ID of: E20131106023245.GA20224@dcvr.yhbt.netE + +Would be stored as: f2/8c6cfd2b0a65f994c3e1be266105413b3d3f63 + +Thus it is easy to look up the contents of a message matching a given +a Message-ID. + +=head1 MESSAGE-ID CONFLICTS + +public-inbox v1 repositories currently do not resolve conflicting +Message-IDs or messages with multiple Message-IDs. + +=head1 HEADERS + +The Message-ID header is required. +"Bytes", "Lines" and "Content-Length" headers are stripped and not +allowed, they can interfere with further processing. +When using ssoma with public-inbox-mda, the "Status" mbox header +is also stripped as that header makes no sense in a public archive. + +=head1 LOCKING + +L locking exclusively locks the empty $GIT_DIR/ssoma.lock file +for all non-atomic operations. + +=head1 EXAMPLE INPUT FLOW (SERVER-SIDE MDA) + +1. Message is delivered to a mail transport agent (MTA) + +1a. (optional) reject/discard spam, this should run before ssoma-mda + +1b. (optional) reject/strip unwanted attachments + +ssoma-mda handles all steps once invoked. + +2. Mail transport agent invokes ssoma-mda + +3. reads message via stdin, extracting Message-ID + +4. acquires exclusive flock lock on $GIT_DIR/ssoma.lock + +5. creates or updates the blob of associated 2/38 SHA-1 path + +6. updates the index and commits + +7. releases $GIT_DIR/ssoma.lock + +ssoma-mda can also be used as an L trigger to monitor maildirs, +and the ability to monitor IMAP mailboxes using IDLE will be available +in the future. + +=head1 GIT REPOSITORIES (SERVERS) + +ssoma uses bare git repositories on both servers and clients. + +Using the L command with --bare is the recommend method +of creating a git repository on a server: + + git init --bare /path/to/wherever/you/want.git + +There are no standardized paths for servers, administrators make +all the choices regarding git repository locations. + +Special files in $GIT_DIR on the server: + +=over + +=item $GIT_DIR/ssoma.lock + +An empty file for L locking. +This is necessary to ensure the index and commits are updated +consistently and multiple processes running MDA do not step on +each other. + +=item $GIT_DIR/public-inbox/msgmap.sqlite3 + +SQLite3 database maintaining a stable mapping of Message-IDs to NNTP +article numbers. Used by L and created +and updated by L. + +Automatically updated by L, +L and L. + +Losing or damaging this file will cause synchronization problems for +NNTP clients. This file is expected to be stable and require no +updates to its schema. + +Requires L. + +=item $GIT_DIR/public-inbox/xapian$N/ + +Xapian database for search indices in the PSGI web UI. + +$N is the value of PublicInbox::Search::SCHEMA_VERSION, and +installations may have parallel versions on disk during upgrades +or to roll-back upgrades. + +This is created and updated by L. + +Automatically updated by L, +L and L. + +This directory can always be regenerated with L. +If lost or damaaged, there is no need to back it up unless the +CPU/memory cost of regenerating it outweighs the storage/transfer cost. + +Since SCHEMA_VERSION 15 and the development of the v2 format, +the "overview" DB also exists in the xapian directory for v1 +repositories. See L + +=item $GIT_DIR/ssoma.index + +This file is no longer used or created by public-inbox, but it is +updated if it exists to remain compatible with ssoma installations. + +A git index file used for MDA updates. The normal git index (in +$GIT_DIR/index) is not used at all as there is typically no working +tree. + +=back + +Each client $GIT_DIR may have multiple mbox/maildir/command targets. +It is possible for a client to extract the mail stored in the git +repository to multiple mboxes for compatibility with a variety of +different tools. + +=head1 CAVEATS + +It is NOT recommended to check out the working directory of a git. +there may be many files. + +It is impossible to completely expunge messages, even spam, as git +retains full history. Projects may (with adequate notice) cycle to new +repositories/branches with history cleaned up via L. +This is up to the administrators. + +=head1 COPYRIGHT + +Copyright 2013-2019 all contributors L + +License: AGPL-3.0+ L + +=head1 SEE ALSO + +L, L -- cgit v1.2.3-24-ge0c7