From cf35d38e7f845393659dfce0249a76d529a2c92c Mon Sep 17 00:00:00 2001 From: Eric Wong Date: Wed, 2 Jan 2019 08:23:13 +0000 Subject: update and add documentation for repository formats Remove confusing documentation around ssoma now that we have NNTP and downloadable mbox support. Only lightly-checked for grammar and speling, and not yet formatting. Edits, corrections and addendums expected :> --- Documentation/public-inbox-v2-format.pod | 234 +++++++++++++++++++++++++++++++ 1 file changed, 234 insertions(+) create mode 100644 Documentation/public-inbox-v2-format.pod (limited to 'Documentation/public-inbox-v2-format.pod') diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod new file mode 100644 index 00000000..05ef32a9 --- /dev/null +++ b/Documentation/public-inbox-v2-format.pod @@ -0,0 +1,234 @@ +% public-inbox developer manual + +=head1 NAME + +public-inbox v2 repository description + +=head1 DESCRIPTION + +The v2 format is designed primarily to address several +scalability problems of the original format described at +L. It also handles messages with +Message-IDs. + +=head1 INBOX LAYOUT + +The key change in v2 is the inbox is no longer a bare git +repository, but a directory with two or more git repositories. +v2 divides git repositories by time "epochs" and Xapian +databases for parallelism by "partitions". + +=head2 INBOX OVERVIEW AND DEFINITIONS + +$EPOCH - Integer starting with 0 based on time +$SCHEMA_VERSION - PublicInbox::Search::SCHEMA_VERSION used by Xapian +$PART - Integer (0..NPROCESSORS) + +foo/ # assuming "foo" is the name of the list +- inbox.lock # lock file (flock) to protect global state +- git/$EPOCH.git # normal git repositories +- all.git # empty git repo, alternates to git/$EPOCH.git +- xap$SCHEMA_VERSION/$PART # per-partition Xapian DB +- xap$SCHEMA_VERSION/over.sqlite3 # OVER-view DB for NNTP and threading +- msgmap.sqlite3 # same the v1 msgmap + +For blob lookups, the reader only needs to open the "all.git" +repository with $GIT_DIR/objects/info/alternates which references +every $EPOCH.git repo. + +Individual $EPOCH.git repos DO NOT use alternates themselves as +git currently limits recursion of alternates nesting depth to 5. + +=head2 GIT EPOCHS + +One of the inherent scalability problems with git itself is the +full history of a project must be stored and carried around to +all clients. To address this problem, the v2 format uses +multiple git repositories, stored as time-based "epochs". + +We currently divide epochs into roughly one gigabyte segments; +but this size can be configurable (if needed) in the future. + +A pleasant side-effect of this design is the git packs of older +epochs are stable, allowing them to be cloned without requiring +expensive pack generation. This also allows clients to clone +only the epochs they are interested in to save bandwidth and +storage. + +To minimize changes to existing v1-based code and simplify our +code, we use the "alternates" mechanism described in +L to link all the epoch repositories +with a single read-only "all.git" endpoint. + +Processes retrieve blobs via the "all.git" repository, while +writers write blobs directly to epochs. + +=head2 GIT TREE LAYOUT + +One key problem specific to v1 was large trees were frequently a +performance problem as name lookups are expensive and there were +limited deltafication opportunities with unpredictable file +names. As a result, all Xapian-enabled installations retrieve +blob object_ids directly in v1, bypassing tree lookups. + +While dividing git repositories into epochs caps the growth of +trees, worst-case tree size was still unnecessary overhead and +worth eliminating. + +So in contrast to the big trees of v1, the v2 git tree contains +only a single file at the top-level of the tree, either 'm' (for +'mail' or 'message') or 'd' (for deleted). A tree does not have +'m' and 'd' at the same time. + +Mail is still stored in blobs (instead of inline with the commit +object) as we still need a stable reference in the indices in +case commit history is rewritten to comply with legal +requirements. + +After-the-fact invocations of L will ignore +messages written to 'd' after they are written to 'm'. + +Deltafication is not significantly improved over v1, but overall +storage for trees is made as as small as possible. Initial +statistics and benchmarks showing the benefits of this approach +are documented at: + +L + +=head2 XAPIAN PARTITIONS + +Another second scalability problem in v1 was the inability to +utilize multiple CPU cores for Xapian indexing. This is +addressed by using partitions in Xapian to perform import +indexing in parallel. + +As with git alternates, Xapian natively supports a read-only +interface which transparently abstracts away the knowledge of +multiple partitions. This allows us to simplify our read-only +code paths. + +The performance of the storage device is now the bottleneck on +larger multi-core systems. In our experience, performance is +improves with high-quality and high-quantity solid-state storage. +Issuing TRIM commands with L was necessary to maintain +consistent performance while developing this feature. + +Rotational storage devices are NOT recommended for indexing of +large mail archives; but are fine for backup and usable for +small instances. + +=head2 OVERVIEW DB + +Towards the end of v2 development, it became apparent Xapian did +not perform well for sorting large result sets used to generate +the landing page in the PSGI UI (/$INBOX/) or many queries used +by the NNTP server. Thus, SQLite was employed and the Xapian +"skeleton" DB was renamed to the "overview" DB (after the NNTP +OVER/XOVER commands). + +The overview DB maintains all the header information necessary +to implement the NNTP OVER/XOVER commands and non-search +endpoints of of the PSGI UI. + +In the future, Xapian will become completely optional for v2 (as +it is for v1) as SQLite turns out to be powerful enough to +maintain overview information. Most of the PSGI and all of the +NNTP functionality will be possible with only SQLite in addition +to git. + +The overview DB was an instrumental piece in maintaining near +constant-time read performance on a dataset 2-3 times larger +than LKML history as of 2018. + +=head3 GHOST MESSAGES + +The overview DB also includes references to "ghost" messages, +or messages which have replies but have not been seen by us. +Thus it is expected to have more rows than the "msgmap" DB +described below. + +=head2 msgmap.sqlite3 + +The SQLite msgmap DB is unchanged from v1, but it is now at the +top-level of the directory. + +=head1 OBJECT IDENTIFIERS + +There are three distinct type of identifiers. content_id is the +new one for v2 and should make message removal and deduplication +easier. object_id and Message-ID are already known. + +=over + +=item object_id + +The blob identifier git uses (currently SHA-1). No need to +publically expose this outside of normal git ops (cloning) and +there's no need to make this searchable. As with v1 of +public-inbox, this is stored as part of the Xapian document so +expensive name lookups can be avoided for document retrieval. + +=item Message-ID + +The email header; duplicates allowed for archival purposes. +This remains a searchable field in Xapian. Note: it's possible +for emails to have multiple Message-ID headers (and L +had that bug for a bit); so we take all of them into account. +In case of conflicts detected by content_id below, we generate a new +Message-ID based on content_id; if the generated Message-ID still +conflicts, a random one is generated. + +=item content_id + +A hash of relevant headers and raw body content for +purging of unwanted content. This is not stored anywhere, +but always calculated on-the-fly. + +For now, the relevant headers are: + + Subject, From, Date, References, In-Reply-To, To, Cc + +Received, List-Id, and similar headers are NOT part of content_id as +they differ across lists and we will want removal to be able to cross +lists. + +The textual parts of the body are decoded, CRLF normalized to +LF, and trailing whitespace stripped. Notably, hashing the +raw body risks being broken by list signatures; but we can use +filters (e.g. PublicInbox::Filter::Vger) to clean the body for +imports. + +content_id is SHA-256 for now; but can be changed at any time +without making DB changes. + +=back + +=head1 LOCKING + +L locking exclusively locks the empty inbox.lock file +for all non-atomic operations. + +=head1 HEADERS + +Same handling as with v1, except the Message-ID header will will +be generated if not provided or conflicting. "Bytes", "Lines" +and "Content-Length" headers are stripped and not allowed, they +can interfere with further processing. + +The "Status" mbox header is also stripped as that header makes +no sense in a public archive. + +=head1 THANKS + +Thanks to the Linux Foundation for sponsoring the development +and testing of the v2 repository format. + +=head1 COPYRIGHT + +Copyright 2018-2019 all contributors L + +License: AGPL-3.0+ L + +=head1 SEE ALSO + +L, L -- cgit v1.2.3-24-ge0c7