about summary refs log tree commit homepage
path: root/script
diff options
context:
space:
mode:
authorEric Wong <e@yhbt.net>2020-06-21 00:21:31 +0000
committerEric Wong <e@yhbt.net>2020-06-23 00:22:14 +0000
commitb0372059b451c01fba1fdfe8a1879fbd5c7ca53d (patch)
tree442baa8adaac3d590f3d1e85f4936a2f97600f1c /script
parenta5c21c6e800be4755848621ba223594b0bde4d95 (diff)
downloadpublic-inbox-b0372059b451c01fba1fdfe8a1879fbd5c7ca53d.tar.gz
On a powerful (by my standards) machine with 16GB RAM and an
7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in
git) LKML snapshot from Sep 2019 did not finish after 7 days
with the default number (3) of Xapian shards (`--jobs=4') and
`--batch-size=10m'.

Indexing starts off fast, but progressively get slower as
contents of the inbox (including Xapian + SQLite DBs) could no
longer be cached by the kernel.  Once the on-disk size
increased, HDD seek contention between the Xapian shard workers
slowed the process down to a crawl.

With a single shard, it still took around 3.5 days to index on
the HDD.  That's not good, but it's far better than not
finishing after 7 days.  So allow unfortunate HDD users to
easily specify a single shard on public-inbox-init.

For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II
bus on the same machine indexes that same snapshot of LKML in
~7 hours with 3 shards and the same 10m batch size.  In the past,
a higher-end consumer grade MLC SSDs on similar hardware indexed
a similarly sized-data set in ~4 hours.
Diffstat (limited to 'script')
-rwxr-xr-xscript/public-inbox-init8
1 files changed, 8 insertions, 0 deletions
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 10d3ad45..00147db5 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -27,10 +27,12 @@ use Cwd qw/abs_path/;
 my $version = undef;
 my $indexlevel = undef;
 my $skip_epoch;
+my $jobs;
 my %opts = (
         'V|version=i' => \$version,
         'L|indexlevel=s' => \$indexlevel,
         'S|skip|skip-epoch=i' => \$skip_epoch,
+        'j|jobs=i' => \$jobs,
 );
 GetOptions(%opts) or usage();
 PublicInbox::Admin::indexlevel_ok_or_die($indexlevel) if defined $indexlevel;
@@ -144,6 +146,12 @@ my $ibx = PublicInbox::Inbox->new({
 });
 
 my $creat_opt = {};
+if (defined $jobs) {
+        die "--jobs is only supported for -V2 inboxes\n" if $version == 1;
+        die "--jobs=$jobs must be >= 1\n" if $jobs <= 0;
+        $creat_opt->{nproc} = $jobs;
+}
+
 PublicInbox::InboxWritable->new($ibx, $creat_opt)->init_inbox(0, $skip_epoch);
 
 # needed for git prior to v2.1.0