From cf35d38e7f845393659dfce0249a76d529a2c92c Mon Sep 17 00:00:00 2001 From: Eric Wong Date: Wed, 2 Jan 2019 08:23:13 +0000 Subject: update and add documentation for repository formats Remove confusing documentation around ssoma now that we have NNTP and downloadable mbox support. Only lightly-checked for grammar and speling, and not yet formatting. Edits, corrections and addendums expected :> --- Documentation/design_notes.txt | 18 +-- Documentation/include.mk | 4 +- Documentation/public-inbox-mda.pod | 2 +- Documentation/public-inbox-v1-format.pod | 171 ++++++++++++++++++++++ Documentation/public-inbox-v2-format.pod | 234 +++++++++++++++++++++++++++++++ 5 files changed, 413 insertions(+), 16 deletions(-) create mode 100644 Documentation/public-inbox-v1-format.pod create mode 100644 Documentation/public-inbox-v2-format.pod (limited to 'Documentation') diff --git a/Documentation/design_notes.txt b/Documentation/design_notes.txt index c5d9427b..9ad49774 100644 --- a/Documentation/design_notes.txt +++ b/Documentation/design_notes.txt @@ -27,9 +27,7 @@ Use existing infrastructure * Existing spam filtering on an SMTP server is also effective on public-inbox. -* readers may continue using use their choice of mail clients and - mailbox formats, only learning a few commands of the ssoma(1) tool - is required. +* Readers may continue using use their choice of NNTP and mail clients. * Atom is a reasonable feed format for casual readers and is supported by a variety of feed readers. @@ -145,19 +143,11 @@ What sucks about public-inbox Scalability notes ----------------- -Even with shallow clone, storing the history of large/busy mailing lists -may place much burden on subscribers and servers. However, having a -single (or few) refs representing the entire history of a list is good -for small lists since it's easier to look up a message by Message-ID, so -we want to avoid splitting refs with independent histories. - -ssoma will likely grow its own built-in ref rotation system based on -message count (not rotating at fixed time intervals). This would -split the histories and require O(n) lookup time based on Message-ID, -where `n' is the number of history splits. +See the public-inbox-v2-format(5) manpage for all the scalability +problems solved. Copyright --------- -Copyright 2013-2018 all contributors +Copyright 2013-2019 all contributors License: AGPL-3.0+ diff --git a/Documentation/include.mk b/Documentation/include.mk index ad7b80a6..28fa7574 100644 --- a/Documentation/include.mk +++ b/Documentation/include.mk @@ -1,4 +1,4 @@ -# Copyright (C) 2013-2018 all contributors +# Copyright (C) 2013-2019 all contributors # License: AGPL-3.0+ all:: @@ -24,6 +24,8 @@ m1 += public-inbox-watch m1 += public-inbox-index m5 = m5 += public-inbox-config +m5 += public-inbox-v1-format +m5 += public-inbox-v2-format m7 = m7 += public-inbox-overview m8 = diff --git a/Documentation/public-inbox-mda.pod b/Documentation/public-inbox-mda.pod index 1a5ade84..41a697b1 100644 --- a/Documentation/public-inbox-mda.pod +++ b/Documentation/public-inbox-mda.pod @@ -56,4 +56,4 @@ License: AGPL-3.0+ L =head1 SEE ALSO -L, L, L +L, L, L diff --git a/Documentation/public-inbox-v1-format.pod b/Documentation/public-inbox-v1-format.pod new file mode 100644 index 00000000..2a6b8d3c --- /dev/null +++ b/Documentation/public-inbox-v1-format.pod @@ -0,0 +1,171 @@ +% public-inbox developer manual + +=head1 NAME + +public-inbox v1 git repository and tree description (aka "ssoma") + +=head1 DESCRIPTION + +WARNING: this does NOT describe the scalable v2 format used +by public-inbox. Use of ssoma is not recommended for new +installations due to scalability problems. + +ssoma uses a git repository to store each email as a git blob. +The tree filename of the blob is based on the SHA1 hexdigest of +the first Message-ID header. A commit is made for each message +delivered. The commit SHA-1 identifier is used by ssoma clients +to track synchronization state. + +=head1 PATHNAMES IN TREES + +A Message-ID may be extremely long and also contain slashes, so using +them as a path name is challenging. Instead we use the SHA-1 hexdigest +of the Message-ID (excluding the leading "E" and trailing "E") +to generate a path name. Leading and trailing white space in the +Message-ID header is ignored for hashing. + +A message with Message-ID of: E20131106023245.GA20224@dcvr.yhbt.netE + +Would be stored as: f2/8c6cfd2b0a65f994c3e1be266105413b3d3f63 + +Thus it is easy to look up the contents of a message matching a given +a Message-ID. + +=head1 MESSAGE-ID CONFLICTS + +public-inbox v1 repositories currently do not resolve conflicting +Message-IDs or messages with multiple Message-IDs. + +=head1 HEADERS + +The Message-ID header is required. +"Bytes", "Lines" and "Content-Length" headers are stripped and not +allowed, they can interfere with further processing. +When using ssoma with public-inbox-mda, the "Status" mbox header +is also stripped as that header makes no sense in a public archive. + +=head1 LOCKING + +L locking exclusively locks the empty $GIT_DIR/ssoma.lock file +for all non-atomic operations. + +=head1 EXAMPLE INPUT FLOW (SERVER-SIDE MDA) + +1. Message is delivered to a mail transport agent (MTA) + +1a. (optional) reject/discard spam, this should run before ssoma-mda + +1b. (optional) reject/strip unwanted attachments + +ssoma-mda handles all steps once invoked. + +2. Mail transport agent invokes ssoma-mda + +3. reads message via stdin, extracting Message-ID + +4. acquires exclusive flock lock on $GIT_DIR/ssoma.lock + +5. creates or updates the blob of associated 2/38 SHA-1 path + +6. updates the index and commits + +7. releases $GIT_DIR/ssoma.lock + +ssoma-mda can also be used as an L trigger to monitor maildirs, +and the ability to monitor IMAP mailboxes using IDLE will be available +in the future. + +=head1 GIT REPOSITORIES (SERVERS) + +ssoma uses bare git repositories on both servers and clients. + +Using the L command with --bare is the recommend method +of creating a git repository on a server: + + git init --bare /path/to/wherever/you/want.git + +There are no standardized paths for servers, administrators make +all the choices regarding git repository locations. + +Special files in $GIT_DIR on the server: + +=over + +=item $GIT_DIR/ssoma.lock + +An empty file for L locking. +This is necessary to ensure the index and commits are updated +consistently and multiple processes running MDA do not step on +each other. + +=item $GIT_DIR/public-inbox/msgmap.sqlite3 + +SQLite3 database maintaining a stable mapping of Message-IDs to NNTP +article numbers. Used by L and created +and updated by L. + +Automatically updated by L, +L and L. + +Losing or damaging this file will cause synchronization problems for +NNTP clients. This file is expected to be stable and require no +updates to its schema. + +Requires L. + +=item $GIT_DIR/public-inbox/xapian$N/ + +Xapian database for search indices in the PSGI web UI. + +$N is the value of PublicInbox::Search::SCHEMA_VERSION, and +installations may have parallel versions on disk during upgrades +or to roll-back upgrades. + +This is created and updated by L. + +Automatically updated by L, +L and L. + +This directory can always be regenerated with L. +If lost or damaaged, there is no need to back it up unless the +CPU/memory cost of regenerating it outweighs the storage/transfer cost. + +Since SCHEMA_VERSION 15 and the development of the v2 format, +the "overview" DB also exists in the xapian directory for v1 +repositories. See L + +=item $GIT_DIR/ssoma.index + +This file is no longer used or created by public-inbox, but it is +updated if it exists to remain compatible with ssoma installations. + +A git index file used for MDA updates. The normal git index (in +$GIT_DIR/index) is not used at all as there is typically no working +tree. + +=back + +Each client $GIT_DIR may have multiple mbox/maildir/command targets. +It is possible for a client to extract the mail stored in the git +repository to multiple mboxes for compatibility with a variety of +different tools. + +=head1 CAVEATS + +It is NOT recommended to check out the working directory of a git. +there may be many files. + +It is impossible to completely expunge messages, even spam, as git +retains full history. Projects may (with adequate notice) cycle to new +repositories/branches with history cleaned up via L. +This is up to the administrators. + +=head1 COPYRIGHT + +Copyright 2013-2019 all contributors L + +License: AGPL-3.0+ L + +=head1 SEE ALSO + +L, L diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod new file mode 100644 index 00000000..05ef32a9 --- /dev/null +++ b/Documentation/public-inbox-v2-format.pod @@ -0,0 +1,234 @@ +% public-inbox developer manual + +=head1 NAME + +public-inbox v2 repository description + +=head1 DESCRIPTION + +The v2 format is designed primarily to address several +scalability problems of the original format described at +L. It also handles messages with +Message-IDs. + +=head1 INBOX LAYOUT + +The key change in v2 is the inbox is no longer a bare git +repository, but a directory with two or more git repositories. +v2 divides git repositories by time "epochs" and Xapian +databases for parallelism by "partitions". + +=head2 INBOX OVERVIEW AND DEFINITIONS + +$EPOCH - Integer starting with 0 based on time +$SCHEMA_VERSION - PublicInbox::Search::SCHEMA_VERSION used by Xapian +$PART - Integer (0..NPROCESSORS) + +foo/ # assuming "foo" is the name of the list +- inbox.lock # lock file (flock) to protect global state +- git/$EPOCH.git # normal git repositories +- all.git # empty git repo, alternates to git/$EPOCH.git +- xap$SCHEMA_VERSION/$PART # per-partition Xapian DB +- xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP and threading +- msgmap.sqlite3 # same the v1 msgmap + +For blob lookups, the reader only needs to open the "all.git" +repository with $GIT_DIR/objects/info/alternates which references +every $EPOCH.git repo. + +Individual $EPOCH.git repos DO NOT use alternates themselves as +git currently limits recursion of alternates nesting depth to 5. + +=head2 GIT EPOCHS + +One of the inherent scalability problems with git itself is the +full history of a project must be stored and carried around to +all clients. To address this problem, the v2 format uses +multiple git repositories, stored as time-based "epochs". + +We currently divide epochs into roughly one gigabyte segments; +but this size can be configurable (if needed) in the future. + +A pleasant side-effect of this design is the git packs of older +epochs are stable, allowing them to be cloned without requiring +expensive pack generation. This also allows clients to clone +only the epochs they are interested in to save bandwidth and +storage. + +To minimize changes to existing v1-based code and simplify our +code, we use the "alternates" mechanism described in +L to link all the epoch repositories +with a single read-only "all.git" endpoint. + +Processes retrieve blobs via the "all.git" repository, while +writers write blobs directly to epochs. + +=head2 GIT TREE LAYOUT + +One key problem specific to v1 was large trees were frequently a +performance problem as name lookups are expensive and there were +limited deltafication opportunities with unpredictable file +names. As a result, all Xapian-enabled installations retrieve +blob object_ids directly in v1, bypassing tree lookups. + +While dividing git repositories into epochs caps the growth of +trees, worst-case tree size was still unnecessary overhead and +worth eliminating. + +So in contrast to the big trees of v1, the v2 git tree contains +only a single file at the top-level of the tree, either 'm' (for +'mail' or 'message') or 'd' (for deleted). A tree does not have +'m' and 'd' at the same time. + +Mail is still stored in blobs (instead of inline with the commit +object) as we still need a stable reference in the indices in +case commit history is rewritten to comply with legal +requirements. + +After-the-fact invocations of L will ignore +messages written to 'd' after they are written to 'm'. + +Deltafication is not significantly improved over v1, but overall +storage for trees is made as as small as possible. Initial +statistics and benchmarks showing the benefits of this approach +are documented at: + +L + +=head2 XAPIAN PARTITIONS + +Another second scalability problem in v1 was the inability to +utilize multiple CPU cores for Xapian indexing. This is +addressed by using partitions in Xapian to perform import +indexing in parallel. + +As with git alternates, Xapian natively supports a read-only +interface which transparently abstracts away the knowledge of +multiple partitions. This allows us to simplify our read-only +code paths. + +The performance of the storage device is now the bottleneck on +larger multi-core systems. In our experience, performance is +improves with high-quality and high-quantity solid-state storage. +Issuing TRIM commands with L was necessary to maintain +consistent performance while developing this feature. + +Rotational storage devices are NOT recommended for indexing of +large mail archives; but are fine for backup and usable for +small instances. + +=head2 OVERVIEW DB + +Towards the end of v2 development, it became apparent Xapian did +not perform well for sorting large result sets used to generate +the landing page in the PSGI UI (/$INBOX/) or many queries used +by the NNTP server. Thus, SQLite was employed and the Xapian +"skeleton" DB was renamed to the "overview" DB (after the NNTP +OVER/XOVER commands). + +The overview DB maintains all the header information necessary +to implement the NNTP OVER/XOVER commands and non-search +endpoints of of the PSGI UI. + +In the future, Xapian will become completely optional for v2 (as +it is for v1) as SQLite turns out to be powerful enough to +maintain overview information. Most of the PSGI and all of the +NNTP functionality will be possible with only SQLite in addition +to git. + +The overview DB was an instrumental piece in maintaining near +constant-time read performance on a dataset 2-3 times larger +than LKML history as of 2018. + +=head3 GHOST MESSAGES + +The overview DB also includes references to "ghost" messages, +or messages which have replies but have not been seen by us. +Thus it is expected to have more rows than the "msgmap" DB +described below. + +=head2 msgmap.sqlite3 + +The SQLite msgmap DB is unchanged from v1, but it is now at the +top-level of the directory. + +=head1 OBJECT IDENTIFIERS + +There are three distinct type of identifiers. content_id is the +new one for v2 and should make message removal and deduplication +easier. object_id and Message-ID are already known. + +=over + +=item object_id + +The blob identifier git uses (currently SHA-1). No need to +publically expose this outside of normal git ops (cloning) and +there's no need to make this searchable. As with v1 of +public-inbox, this is stored as part of the Xapian document so +expensive name lookups can be avoided for document retrieval. + +=item Message-ID + +The email header; duplicates allowed for archival purposes. +This remains a searchable field in Xapian. Note: it's possible +for emails to have multiple Message-ID headers (and L +had that bug for a bit); so we take all of them into account. +In case of conflicts detected by content_id below, we generate a new +Message-ID based on content_id; if the generated Message-ID still +conflicts, a random one is generated. + +=item content_id + +A hash of relevant headers and raw body content for +purging of unwanted content. This is not stored anywhere, +but always calculated on-the-fly. + +For now, the relevant headers are: + + Subject, From, Date, References, In-Reply-To, To, Cc + +Received, List-Id, and similar headers are NOT part of content_id as +they differ across lists and we will want removal to be able to cross +lists. + +The textual parts of the body are decoded, CRLF normalized to +LF, and trailing whitespace stripped. Notably, hashing the +raw body risks being broken by list signatures; but we can use +filters (e.g. PublicInbox::Filter::Vger) to clean the body for +imports. + +content_id is SHA-256 for now; but can be changed at any time +without making DB changes. + +=back + +=head1 LOCKING + +L locking exclusively locks the empty inbox.lock file +for all non-atomic operations. + +=head1 HEADERS + +Same handling as with v1, except the Message-ID header will will +be generated if not provided or conflicting. "Bytes", "Lines" +and "Content-Length" headers are stripped and not allowed, they +can interfere with further processing. + +The "Status" mbox header is also stripped as that header makes +no sense in a public archive. + +=head1 THANKS + +Thanks to the Linux Foundation for sponsoring the development +and testing of the v2 repository format. + +=head1 COPYRIGHT + +Copyright 2018-2019 all contributors L + +License: AGPL-3.0+ L + +=head1 SEE ALSO + +L, L -- cgit v1.2.3-24-ge0c7