diff options
author | Eric Wong <e@yhbt.net> | 2020-06-21 00:21:31 +0000 |
---|---|---|
committer | Eric Wong <e@yhbt.net> | 2020-06-23 00:22:14 +0000 |
commit | b0372059b451c01fba1fdfe8a1879fbd5c7ca53d (patch) | |
tree | 442baa8adaac3d590f3d1e85f4936a2f97600f1c /Documentation/public-inbox-init.pod | |
parent | a5c21c6e800be4755848621ba223594b0bde4d95 (diff) | |
download | public-inbox-b0372059b451c01fba1fdfe8a1879fbd5c7ca53d.tar.gz |
On a powerful (by my standards) machine with 16GB RAM and an 7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in git) LKML snapshot from Sep 2019 did not finish after 7 days with the default number (3) of Xapian shards (`--jobs=4') and `--batch-size=10m'. Indexing starts off fast, but progressively get slower as contents of the inbox (including Xapian + SQLite DBs) could no longer be cached by the kernel. Once the on-disk size increased, HDD seek contention between the Xapian shard workers slowed the process down to a crawl. With a single shard, it still took around 3.5 days to index on the HDD. That's not good, but it's far better than not finishing after 7 days. So allow unfortunate HDD users to easily specify a single shard on public-inbox-init. For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II bus on the same machine indexes that same snapshot of LKML in ~7 hours with 3 shards and the same 10m batch size. In the past, a higher-end consumer grade MLC SSDs on similar hardware indexed a similarly sized-data set in ~4 hours.
Diffstat (limited to 'Documentation/public-inbox-init.pod')
-rw-r--r-- | Documentation/public-inbox-init.pod | 14 |
1 files changed, 14 insertions, 0 deletions
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod index 4744da96..495a258f 100644 --- a/Documentation/public-inbox-init.pod +++ b/Documentation/public-inbox-init.pod @@ -48,6 +48,20 @@ added-after-the-fact (without affecting "git clone" followers). Default: unset, no epochs are skipped +=item -j, --jobs=JOBS + +Control the number of Xapian index shards in a +C<-V2> (L<public-inbox-v2-format(5)>) inbox. + +It is useful to use a single shard (C<-j1>) for inboxes on +high-latency storage (e.g. rotational HDD) unless the system has +enough RAM to cache 5-10x the size of the git repository. + +It is generally not useful to specify higher values than the +default due to contention in the top-level producer process. + +Default: the number of online CPUs, up to 4 + =back =head1 ENVIRONMENT |