1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
| | % public-inbox developer manual
=head1 NAME
public-inbox-extindex-format - external index format description
=head1 DESCRIPTION
The extindex is an index-only evolution of the per-inbox
SQLite and Xapian indices used by L<public-inbox-v2-format(5)>
and L<public-inbox-v1-format(5)>. It exists to facilitate
searches across multiple inboxes as well as to reduce index
space when messages are cross-posted to several existing
inboxes.
It transparently indexes messages across any combination of v1 and v2
inboxes and data about inboxes themselves.
=head1 DIRECTORY LAYOUT
While inspired by v2, there is no git blob storage nor
C<msgmap.sqlite3> DB.
Instead, there is an C<ALL.git> (all caps) git repo which treats
every indexed v1 inbox or v2 epoch as a git alternate.
As with v2 inboxes, it uses C<over.sqlite3> and Xapian "shards"
for WWW and IMAP use. Several exclusive new tables are added
to deal with L</XREF3 DEDUPLICATION> and metadata.
Unlike v1 and v2 inboxes, it is NOT designed to map to a NNTP
newsgroup. Thus it lacks C<msgmap.sqlite3> to enforce the
unique Message-ID requirement of NNTP.
=head2 INDEX OVERVIEW AND DEFINITIONS
$SCHEMA_VERSION - DB schema version (for Xapian)
$SHARD - Integer starting with 0 based on parallelism
foo/ # "foo" is the name of the index
- ei.lock # lock file to protect global state
- ALL.git # empty, alternates for inboxes
- ei$SCHEMA_VERSION/$SHARD # per-shard Xapian DB
- ei$SCHEMA_VERSION/over.sqlite3 # overview DB for WWW, IMAP
- ei$SCHEMA_VERSION/misc # misc Xapian DB
File and directory names are intentionally different from
analogous v2 names to ensure extindex and v2 inboxes can
easily be distinguished from each other.
=head2 XREF3 DEDUPLICATION
Due to cross-posted messages being the norm in the large Linux kernel
development community and Xapian indices being the primary consumer of
storage, it makes sense to deduplicate indexing as much as possible.
The internal storage format is based on the NNTP "Xref" tuple,
but with the addition of a third element: the git blob OID.
Thus the triple is expressed in string form as:
$NEWSGROUP_NAME:$ARTICLE_NUM:$OID
If no C<newsgroup> is configured for an inbox, the C<inboxdir>
of the inbox is used.
This data is stored in the C<xref3> table of over.sqlite3.
=head2 misc XAPIAN DB
In addition to the numeric Xapian shards for indexing messages,
there is a new, in-development Xapian index for storing data
about inboxes themselves and other non-message data. This
index allows us to speed up operations involving hundreds or
thousands of inboxes.
=head1 BENEFITS
In addition to providing cross-inbox search capabilities, it can
also replace per-inbox Xapian shards (but not per-inbox
over.sqlite3). This allows reduction in disk space, open file
handles, and associated memory use.
=head1 CAVEATS
Relocating v1 and v2 inboxes on the filesystem will require
extindex to be garbage-collected and/or reindexed.
Configuring and maintaining stable C<newsgroup> names before any
messages are indexed from every inbox can avoid expensive
reindexing and rely exclusively on GC.
=head1 LOCKING
L<flock(2)> locking exclusively locks the empty ei.lock file
for all non-atomic operations.
=head1 THANKS
Thanks to the Linux Foundation for sponsoring the development
and testing.
=head1 COPYRIGHT
Copyright 2020-2021 all contributors L<mailto:meta@public-inbox.org>
License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>
=head1 SEE ALSO
L<public-inbox-v2-format(5)>
|