diff options
Diffstat (limited to 'Documentation/public-inbox-extindex-format.pod')
-rw-r--r-- | Documentation/public-inbox-extindex-format.pod | 110 |
1 files changed, 110 insertions, 0 deletions
diff --git a/Documentation/public-inbox-extindex-format.pod b/Documentation/public-inbox-extindex-format.pod new file mode 100644 index 00000000..d47e2102 --- /dev/null +++ b/Documentation/public-inbox-extindex-format.pod @@ -0,0 +1,110 @@ +% public-inbox developer manual + +=head1 NAME + +public-inbox-extindex-format - external index format description + +=head1 DESCRIPTION + +The extindex is an index-only evolution of the per-inbox +SQLite and Xapian indices used by L<public-inbox-v2-format(5)> +and L<public-inbox-v1-format(5)>. It exists to facilitate +searches across multiple inboxes as well as to reduce index +space when messages are cross-posted to several existing +inboxes. + +It transparently indexes messages across any combination of v1 and v2 +inboxes and data about inboxes themselves. + +=head1 DIRECTORY LAYOUT + +While inspired by v2, there is no git blob storage nor +C<msgmap.sqlite3> DB. + +Instead, there is an C<ALL.git> (all caps) git repo which treats +every indexed v1 inbox or v2 epoch as a git alternate. + +As with v2 inboxes, it uses C<over.sqlite3> and Xapian "shards" +for WWW and IMAP use. Several exclusive new tables are added +to deal with L</XREF3 DEDUPLICATION> and metadata. + +Unlike v1 and v2 inboxes, it is NOT designed to map to a NNTP +newsgroup. Thus it lacks C<msgmap.sqlite3> to enforce the +unique Message-ID requirement of NNTP. + +=head2 INDEX OVERVIEW AND DEFINITIONS + + $SCHEMA_VERSION - DB schema version (for Xapian) + $SHARD - Integer starting with 0 based on parallelism + + foo/ # "foo" is the name of the index + - ei.lock # lock file to protect global state + - ALL.git # empty, alternates for inboxes + - ei$SCHEMA_VERSION/$SHARD # per-shard Xapian DB + - ei$SCHEMA_VERSION/over.sqlite3 # overview DB for WWW, IMAP + - ei$SCHEMA_VERSION/misc # misc Xapian DB + +File and directory names are intentionally different from +analogous v2 names to ensure extindex and v2 inboxes can +easily be distinguished from each other. + +=head2 XREF3 DEDUPLICATION + +Due to cross-posted messages being the norm in the large Linux kernel +development community and Xapian indices being the primary consumer of +storage, it makes sense to deduplicate indexing as much as possible. + +The internal storage format is based on the NNTP "Xref" tuple, +but with the addition of a third element: the git blob OID. +Thus the triple is expressed in string form as: + + $NEWSGROUP_NAME:$ARTICLE_NUM:$OID + +If no C<newsgroup> is configured for an inbox, the C<inboxdir> +of the inbox is used. + +This data is stored in the C<xref3> table of over.sqlite3. + +=head2 misc XAPIAN DB + +In addition to the numeric Xapian shards for indexing messages, +there is a new, in-development Xapian index for storing data +about inboxes themselves and other non-message data. This +index allows us to speed up operations involving hundreds or +thousands of inboxes. + +=head1 BENEFITS + +In addition to providing cross-inbox search capabilities, it can +also replace per-inbox Xapian shards (but not per-inbox +over.sqlite3). This allows reduction in disk space, open file +handles, and associated memory use. + +=head1 CAVEATS + +Relocating v1 and v2 inboxes on the filesystem will require +extindex to be garbage-collected and/or reindexed. + +Configuring and maintaining stable C<newsgroup> names before any +messages are indexed from every inbox can avoid expensive +reindexing and rely exclusively on GC. + +=head1 LOCKING + +L<flock(2)> locking exclusively locks the empty ei.lock file +for all non-atomic operations. + +=head1 THANKS + +Thanks to the Linux Foundation for sponsoring the development +and testing. + +=head1 COPYRIGHT + +Copyright 2020-2021 all contributors L<mailto:meta@public-inbox.org> + +License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt> + +=head1 SEE ALSO + +L<public-inbox-v2-format(5)> |