user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH 1/3] init: add -j / --jobs parameter
  2020-06-21  0:21  7% [PATCH 0/3] -init updates Eric Wong
@ 2020-06-21  0:21  5% ` Eric Wong
  0 siblings, 0 replies; 2+ results
From: Eric Wong @ 2020-06-21  0:21 UTC (permalink / raw)
  To: meta

On a powerful (by my standards) machine with 16GB RAM and an
7200 RPM HDD marketed for "enterprise" use, indexing a 8.1G (in
git) LKML snapshot from Sep 2019 did not finish after 7 days
with the default number (3) of Xapian shards (`--jobs=4') and
`--batch-size=10m'.

Indexing starts off fast, but progressively get slower as
contents of the inbox (including Xapian + SQLite DBs) could no
longer be cached by the kernel.  Once the on-disk size
increased, HDD seek contention between the Xapian shard workers
slowed the process down to a crawl.

With a single shard, it still took around 3.5 days to index on
the HDD.  That's not good, but it's far better than not
finishing after 7 days.  So allow unfortunate HDD users to
easily specify a single shard on public-inbox-init.

For reference, a freshly TRIM-ed low-end TLC SSD on the SATA II
bus on the same machine indexes that same snapshot of LKML in
~7 hours with 3 shards and the same 10m batch size.  In the past,
a higher-end consumer grade MLC SSDs on similar hardware indexed
a similarly sized-data set in ~4 hours.
---
 Documentation/public-inbox-init.pod | 14 ++++++++++++++
 script/public-inbox-init            |  8 ++++++++
 t/v2mirror.t                        |  4 +++-
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 4744da96..495a258f 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -48,6 +48,20 @@ added-after-the-fact (without affecting "git clone" followers).
 
 Default: unset, no epochs are skipped
 
+=item -j, --jobs=JOBS
+
+Control the number of Xapian index shards in a
+C<-V2> (L<public-inbox-v2-format(5)>) inbox.
+
+It is useful to use a single shard (C<-j1>) for inboxes on
+high-latency storage (e.g. rotational HDD) unless the system has
+enough RAM to cache 5-10x the size of the git repository.
+
+It is generally not useful to specify higher values than the
+default due to contention in the top-level producer process.
+
+Default: the number of online CPUs, up to 4
+
 =back
 
 =head1 ENVIRONMENT
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 10d3ad45..00147db5 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -27,10 +27,12 @@ use Cwd qw/abs_path/;
 my $version = undef;
 my $indexlevel = undef;
 my $skip_epoch;
+my $jobs;
 my %opts = (
 	'V|version=i' => \$version,
 	'L|indexlevel=s' => \$indexlevel,
 	'S|skip|skip-epoch=i' => \$skip_epoch,
+	'j|jobs=i' => \$jobs,
 );
 GetOptions(%opts) or usage();
 PublicInbox::Admin::indexlevel_ok_or_die($indexlevel) if defined $indexlevel;
@@ -144,6 +146,12 @@ my $ibx = PublicInbox::Inbox->new({
 });
 
 my $creat_opt = {};
+if (defined $jobs) {
+	die "--jobs is only supported for -V2 inboxes\n" if $version == 1;
+	die "--jobs=$jobs must be >= 1\n" if $jobs <= 0;
+	$creat_opt->{nproc} = $jobs;
+}
+
 PublicInbox::InboxWritable->new($ibx, $creat_opt)->init_inbox(0, $skip_epoch);
 
 # needed for git prior to v2.1.0
diff --git a/t/v2mirror.t b/t/v2mirror.t
index fc03c3d7..b24528fe 100644
--- a/t/v2mirror.t
+++ b/t/v2mirror.t
@@ -80,9 +80,11 @@ foreach my $i (0..$epoch_max) {
 	ok(-d "$tmpdir/m/git/$i.git", "mirror $i OK");
 }
 
-@cmd = ("-init", '-V2', 'm', "$tmpdir/m", 'http://example.com/m',
+@cmd = ("-init", '-j1', '-V2', 'm', "$tmpdir/m", 'http://example.com/m',
 	'alt@example.com');
 ok(run_script(\@cmd), 'initialized public-inbox -V2');
+my @shards = glob("$tmpdir/m/xap*/?");
+is(scalar(@shards), 1, 'got a single shard on init');
 
 ok(run_script([qw(-index -j0), "$tmpdir/m"]), 'indexed');
 

^ permalink raw reply related	[relevance 5%]

* [PATCH 0/3] -init updates
@ 2020-06-21  0:21  7% Eric Wong
  2020-06-21  0:21  5% ` [PATCH 1/3] init: add -j / --jobs parameter Eric Wong
  0 siblings, 1 reply; 2+ results
From: Eric Wong @ 2020-06-21  0:21 UTC (permalink / raw)
  To: meta

Eric Wong (3):
  init: add -j / --jobs parameter
  init: refer to inboxes as "inbox" or "inboxes" in errors
  init: add --skip-artnum parameter

 Documentation/public-inbox-init.pod | 28 ++++++++++++++++++++++++++++
 lib/PublicInbox/InboxWritable.pm    | 13 ++++++++++++-
 lib/PublicInbox/Msgmap.pm           | 26 ++++++++++++++++++++++++++
 lib/PublicInbox/SearchIdx.pm        |  1 +
 lib/PublicInbox/V2Writable.pm       |  3 ++-
 script/public-inbox-init            | 21 ++++++++++++++-------
 t/init.t                            | 28 +++++++++++++++++++++++++++-
 t/v2mirror.t                        |  4 +++-
 8 files changed, 113 insertions(+), 11 deletions(-)

^ permalink raw reply	[relevance 7%]

Results 1-2 of 2 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2020-06-21  0:21  7% [PATCH 0/3] -init updates Eric Wong
2020-06-21  0:21  5% ` [PATCH 1/3] init: add -j / --jobs parameter Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).