public-inbox.git  about / heads / tags
an "archives first" approach to mailing lists
blob 14f157a5a66e2cc67198f48730dc610e9044a875 10126 bytes (raw)
$ git show HEAD:Documentation/public-inbox-index.pod	# shows this blob on the CLI

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
 
=head1 NAME

public-inbox-index - create and update search indices

=head1 SYNOPSIS

public-inbox-index [OPTIONS] INBOX_DIR...

public-inbox-index [OPTIONS] --all

=head1 DESCRIPTION

public-inbox-index creates and updates the search, overview and
NNTP article number database used by the read-only public-inbox
HTTP and NNTP interfaces.  Currently, this requires
L<DBD::SQLite> and L<DBI> Perl modules.  L<Xapian> (or L<Search::Xapian>)
are optional, only to support the PSGI search interface.

Once the initial indices are created by public-inbox-index,
L<public-inbox-mda(1)> and L<public-inbox-watch(1)> will
automatically maintain them.

Running this manually to update indices is only required if
relying on L<git-fetch(1)> to mirror an existing public-inbox;
or if upgrading to a new version of public-inbox using
the C<--reindex> option.

Having the overview and article number database is essential to
running the NNTP interface, and strongly recommended for the
HTTP interface as it provides thread grouping in addition to
normal search functionality.

=head1 OPTIONS

=over

=item -j JOBS

=item --jobs=JOBS

Influences the number of Xapian indexing shards in a
(L<public-inbox-v2-format(5)>) inbox.

See L<public-inbox-init(1)/--jobs> for a full description
of sharding.

C<--jobs=0> is accepted as of public-inbox 1.6.0
to disable parallel indexing regardless of the number of
pre-existing shards.

If the inbox has not been indexed or initialized, C<JOBS - 1>
shards will be created (one job is always needed for indexing
the overview and article number mapping).

Default: the number of existing Xapian shards

=item -c

=item --compact

Compacts the Xapian DBs after indexing.  This is recommended
when using C<--reindex> to avoid running out of disk space
while indexing multiple inboxes.

While option takes a negligible amount of time compared to
C<--reindex>, it requires temporarily duplicating the entire
contents of the Xapian DB.

This switch may be specified twice, in which case compaction
happens both before and after indexing to minimize the temporal
footprint of the (re)indexing operation.

Available since public-inbox 1.4.0.

=item --reindex

Forces a re-index of all messages in the inbox.
This can be used for in-place upgrades and bugfixes while
NNTP/HTTP server processes are utilizing the index.  Keep in
mind this roughly doubles the size of the already-large
Xapian database.  Using this with C<--compact> or running
L<public-inbox-compact(1)> afterwards is recommended to
release free space.

public-inbox protects writes to various indices with
L<flock(2)>, so it is safe to reindex (and rethread) while
L<public-inbox-watch(1)>, L<public-inbox-mda(1)> or
L<public-inbox-learn(1)> run.

This does not touch the NNTP article number database.
It does not affect threading unless C<--rethread> is
used.

=item --all

Index all inboxes configured in ~/.public-inbox/config.
This is an alternative to specifying individual inboxes directories
on the command-line.

=item --rethread

Regenerate internal THREADID and message thread associations
when reindexing.

This fixes some bugs in older versions of public-inbox.  While
it is possible to use this without C<--reindex>, it makes little
sense to do so.

Available in public-inbox 1.6.0+.

=item --prune

Run L<git-gc(1)> to prune and expire reflogs if discontiguous history
is detected.  This is intended to be used in mirrors after running
L<public-inbox-edit(1)> or L<public-inbox-purge(1)> to ensure data
is expunged from mirrors.

Available since public-inbox 1.2.0.

=item --max-size SIZE

Sets or overrides L</publicinbox.indexMaxSize> on a
per-invocation basis.  See L</publicinbox.indexMaxSize>
below.

Available since public-inbox 1.5.0.

=item --batch-size SIZE

Sets or overrides L</publicinbox.indexBatchSize> on a
per-invocation basis.  See L</publicinbox.indexBatchSize>
below.

When using rotational storage but abundant RAM, using a large
value (e.g. C<500m>) with C<--sequential-shard> can
significantly speed up and reduce fragmentation during the
initial index and full C<--reindex> invocations (but not
incremental updates).

Available in public-inbox 1.6.0+.

=item --no-fsync

Disables L<fsync(2)> and L<fdatasync(2)> operations on SQLite
and Xapian.  This is only effective with Xapian 1.4+.  This is
primarily intended for systems with low RAM and the small
(default) C<--batch-size=1m>.  Users of large C<--batch-size>
may even find disabling L<fdatasync(2)> causes too much dirty
data to accumulate, resulting on latency spikes from writeback.

Available in public-inbox 1.6.0+.

=item --dangerous

Speed up initial index by using in-place updates and denying support for
concurrent readers.  This is only effective with Xapian 1.4+.

Available in public-inbox 1.8.0+

=item --sequential-shard

Sets or overrides L</publicinbox.indexSequentialShard> on a
per-invocation basis.  See L</publicinbox.indexSequentialShard>
below.

Available in public-inbox 1.6.0+.

=item --skip-docdata

Stop storing document data in Xapian on an existing inbox.

See L<public-inbox-init(1)/--skip-docdata> for description and caveats.

Available in public-inbox 1.6.0+.

=item -E EXTINDEX

=item --update-extindex=EXTINDEX

Update the given external index (L<public-inbox-extindex-format(5)>.
Either the configured section name (e.g. C<all>) or a directory name
may be specified.

Defaults to C<all> if C<[extindex "all"]> is configured,
otherwise no external indices are updated.

May be specified multiple times in rare cases where multiple
external indices are configured.

=item --no-update-extindex

Do not update the C<all> external index by default.  This negates
all uses of C<-E> / C<--update-extindex=> on the command-line.

=item --since=DATESTRING

=item --after=DATESTRING

=item --until=DATESTRING

=item --before=DATESTRING

Passed directly to L<git-log(1)> to limit changes for C<--reindex>

=back

=head1 FILES

For v1 (ssoma) repositories described in L<public-inbox-v1-format(5)>.
All public-inbox-specific files are contained within the
C<$GIT_DIR/public-inbox/> directory.

v2 inboxes are described in L<public-inbox-v2-format(5)>.

=head1 CONFIGURATION

=over 8

=item publicinbox.indexMaxSize

Prevents indexing of messages larger than the specified size
value.  A single suffix modifier of C<k>, C<m> or C<g> is
supported, thus the value of C<1m> to prevents indexing of
messages larger than one megabyte.

This is useful for avoiding memory exhaustion in mirrors
via git.  It does not prevent L<public-inbox-mda(1)> or
L<public-inbox-watch(1)> from importing (and indexing)
a message.

This option is only available in public-inbox 1.5 or later.

Default: none

=item publicinbox.indexBatchSize

Flushes changes to the filesystem and releases locks after
indexing the given number of bytes.  The default value of C<1m>
(one megabyte) is low to minimize memory use and reduce
contention with parallel invocations of L<public-inbox-mda(1)>,
L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.

Increase this value on powerful systems to improve throughput at
the expense of memory use.  The reduction of lock granularity
may not be noticeable on fast systems.  With SSDs, values above
C<4m> have little benefit.

For L<public-inbox-v2-format(5)> inboxes, this value is
multiplied by the number of Xapian shards.  Thus a typical v2
inbox with 3 shards will flush every 3 megabytes by default
unless parallelism is disabled via C<--sequential-shard>
or C<--jobs=0>.

This influences memory usage of Xapian, but it is not exact.
The actual memory used by Xapian and Perl has been observed
in excess of 10x this value.

This option is available in public-inbox 1.6 or later.
public-inbox 1.5 and earlier used the current default, C<1m>.

Default: 1m (one megabyte)

=item publicinbox.indexSequentialShard

For L<public-inbox-v2-format(5)> inboxes, setting this to C<true>
allows indexing Xapian shards in multiple passes.  This speeds up
indexing on rotational storage with high seek latency by allowing
individual shards to fit into the kernel page cache.

Using a higher-than-normal number of C<--jobs> with
L<public-inbox-init(1)> may be required to ensure individual
shards are small enough to fit into cache.

Warning: interrupting C<public-inbox-index(1)> while this option
is in use may leave the search indices out-of-date with respect
to SQLite databases.  WWW and IMAP users may notice incomplete
search results, but it is otherwise non-fatal.  Using C<--reindex>
will bring everything back up-to-date.

Available in public-inbox 1.6.0+.

This is ignored on L<public-inbox-v1-format(5)> inboxes.

Default: false, shards are indexed in parallel

=item publicinbox.<name>.indexSequentialShard

Identical to L</publicinbox.indexSequentialShard>,
but only affect the inbox matching E<lt>nameE<gt>.

=back

=head1 ENVIRONMENT

=over 8

=item PI_CONFIG

Used to override the default "~/.public-inbox/config" value.

=item XAPIAN_FLUSH_THRESHOLD

The number of documents to update before committing changes to
disk.  This environment is handled directly by Xapian, refer to
Xapian API documentation for more details.

For public-inbox 1.6 and later, use C<publicinbox.indexBatchSize>
instead.

Setting C<XAPIAN_FLUSH_THRESHOLD> or
C<publicinbox.indexBatchSize> for a large C<--reindex> may cause
L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
L<public-inbox-watch(1)> tasks to wait long and unpredictable
periods of time during C<--reindex>.

Default: none, uses C<publicinbox.indexBatchSize>

=back

=head1 UPGRADING

Occasionally, public-inbox will update its schema version and
require a full index by running this command.

=head1 CONTACT

Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>

The mail archives are hosted at L<https://public-inbox.org/meta/> and
L<http://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/meta/>

=head1 COPYRIGHT

Copyright all contributors L<mailto:meta@public-inbox.org>

License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt>

=head1 SEE ALSO

L<Search::Xapian>, L<DBD::SQLite>, L<public-inbox-extindex-format(5)>

git clone https://public-inbox.org/public-inbox.git
git clone http://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/public-inbox.git