user/dev discussion of public-inbox itself
 help / color / Atom feed
* [PATCH] index: add --batch-size=SIZE option
@ 2020-05-17 19:37 Eric Wong
  2020-05-17 23:54 ` Kyle Meyer
  2020-05-18  3:04 ` Eric Wong
  0 siblings, 2 replies; 4+ messages in thread
From: Eric Wong @ 2020-05-17 19:37 UTC (permalink / raw)
  To: meta

On powerful systems, having this option is preferable to
XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention
with other processes (-learn, -mda, -watch).

Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and
-watch to get stuck until an epoch is completely processed.
---
 Documentation/public-inbox-index.pod | 42 +++++++++++++++++++++++++---
 lib/PublicInbox/SearchIdx.pm         | 12 ++++----
 lib/PublicInbox/V2Writable.pm        |  2 +-
 script/public-inbox-index            | 11 ++++++--
 4 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 8a37580c..2719b39e 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -56,6 +56,10 @@ Xapian database.  Using this with C<--compact> or running
 L<public-inbox-compact(1)> afterwards is recommended to
 release free space.
 
+public-inbox protects writes to various indices with L<flock(2)>,
+so it is safe to reindex while L<public-inbox-watch(1)>,
+L<public-inbox-mda(1)> or L<public-inbox-learn(1)> run.
+
 This does not touch the NNTP article number database or
 affect threading.
 
@@ -72,6 +76,12 @@ Sets or overrides L</publicinbox.indexMaxSize> on a
 per-invocation basis.  See L</publicinbox.indexMaxSize>
 below.
 
+=item --batch-size SIZE
+
+Sets or overrides L</publicinbox.indexBatchSize> on a
+per-invocation basis.  See L</publicinbox.indexBatchSize>
+below.
+
 =back
 
 =head1 FILES
@@ -98,6 +108,27 @@ This option is only available in public-inbox 1.5 or later.
 
 Default: none
 
+=item publicinbox.indexBatchSize
+
+Flushes changes to the filesystem and releases locks after
+indexing the given number of bytes.  The default value of C<1m>
+(one megabyte) is low to minimize memory use and reduce
+contention with parallel invocations of L<public-inbox-mda(1)>,
+L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
+
+Increase this value on powerful systems improve throughput at
+the expense of memory use.  The reduction of lock granularity
+may not be noticeable on fast systems.
+
+This option is available in public-inbox 1.6 or later.
+public-inbox 1.5 and earlier used the current default, C<1m>.
+
+For L<public-inbox-v2-format(5)> inboxes, this value is
+multiplied by the number of Xapian shards.  Thus a typical v2
+inbox with 3 shards will flush every 3 megabytes by default.
+
+Default: 1m (one megabyte)
+
 =back
 
 =head1 ENVIRONMENT
@@ -114,10 +145,13 @@ The number of documents to update before committing changes to
 disk.  This environment is handled directly by Xapian, refer to
 Xapian API documentation for more details.
 
-Default: our indexing code flushes every megabyte of mail seen
-to keep memory usage low.  Setting this environment variable to
-any positive value will switch to a document count-based
-threshold in Xapian.
+For public-inbox 1.6 and later, use C<publicinbox.indexBatchSize>
+instead.  Setting C<XAPIAN_FLUSH_THRESHOLD> for a large C<--reindex>
+may cause L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
+L<public-inbox-watch(1)> tasks to wait long periods of time
+during C<--reindex>.
+
+Default: none, uses C<publicinbox.indexBatchSize>
 
 =back
 
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 5f5ae895..b4088933 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -22,11 +22,9 @@ use PublicInbox::Git qw(git_unquote);
 use PublicInbox::MsgTime qw(msg_timestamp msg_datestamp);
 my $X = \%PublicInbox::Search::X;
 my ($DB_CREATE_OR_OPEN, $DB_OPEN);
-use constant {
-	BATCH_BYTES => defined($ENV{XAPIAN_FLUSH_THRESHOLD}) ?
-			0x7fffffff : 1_000_000,
-	DEBUG => !!$ENV{DEBUG},
-};
+our $BATCH_BYTES = defined($ENV{XAPIAN_FLUSH_THRESHOLD}) ?
+			0x7fffffff : 1_000_000;
+use constant DEBUG => !!$ENV{DEBUG};
 
 my $xapianlevels = qr/\A(?:full|medium)\z/;
 
@@ -585,7 +583,7 @@ sub batch_adjust ($$$$$) {
 	my ($max, $bytes, $batch_cb, $latest, $nr) = @_;
 	$$max -= $bytes;
 	if ($$max <= 0) {
-		$$max = BATCH_BYTES;
+		$$max = $BATCH_BYTES;
 		$batch_cb->($nr, $latest);
 	}
 }
@@ -610,7 +608,7 @@ sub read_log {
 	my $git = $self->{git};
 	my $latest;
 	my $bytes;
-	my $max = BATCH_BYTES;
+	my $max = $BATCH_BYTES;
 	local $/ = "\n";
 	my %D;
 	my $line;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index bf5a0df9..c732b98a 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -159,7 +159,7 @@ sub do_idx ($$$$) {
 	my $idx = idx_shard($self, $smsg->{num} % $self->{shards});
 	$idx->index_raw($msgref, $mime, $smsg);
 	my $n = $self->{transact_bytes} += $smsg->{bytes};
-	$n >= (PublicInbox::SearchIdx::BATCH_BYTES * $self->{shards});
+	$n >= ($PublicInbox::SearchIdx::BATCH_BYTES * $self->{shards});
 }
 
 sub _add {
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 2d0f0eca..0018668e 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -16,7 +16,7 @@ use PublicInbox::Xapcmd;
 my $compact_opt;
 my $opt = { quiet => -1, compact => 0, maxsize => undef };
 GetOptions($opt, qw(verbose|v+ reindex compact|c+ jobs|j=i prune
-		indexlevel|L=s maxsize|max-size=s))
+		indexlevel|L=s maxsize|max-size=s batchsize|batch-size=s))
 	or die "bad command-line args\n$usage";
 die "--jobs must be positive\n" if defined $opt->{jobs} && $opt->{jobs} <= 0;
 
@@ -30,13 +30,20 @@ my $cfg = PublicInbox::Config->new;
 my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, undef, $cfg);
 PublicInbox::Admin::require_or_die('-index');
 unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 }
-my $mods = {};
+
 my $max_size = $opt->{maxsize} // $cfg->{lc('publicInbox.indexMaxSize')};
 if (defined $max_size) {
 	PublicInbox::Admin::parse_unsigned(\$max_size) or
 		die "`publicInbox.indexMaxSize=$max_size' not parsed\n";
 }
 
+if (my $bs = $opt->{batchsize} // $cfg->{lc('publicInbox.indexBatchSize')}) {
+	PublicInbox::Admin::parse_unsigned(\$bs) or
+		die "`publicInbox.indexBatchSize=$bs' not parsed\n";
+	$PublicInbox::SearchIdx::BATCH_BYTES = $bs;
+}
+
+my $mods = {};
 foreach my $ibx (@ibxs) {
 	# XXX: users can shoot themselves in the foot, with opt->{indexlevel}
 	$ibx->{indexlevel} //= $opt->{indexlevel} //

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] index: add --batch-size=SIZE option
  2020-05-17 19:37 [PATCH] index: add --batch-size=SIZE option Eric Wong
@ 2020-05-17 23:54 ` Kyle Meyer
  2020-05-18  2:39   ` Eric Wong
  2020-05-18  3:04 ` Eric Wong
  1 sibling, 1 reply; 4+ messages in thread
From: Kyle Meyer @ 2020-05-17 23:54 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong writes:

> +Increase this value on powerful systems improve throughput at
> +the expense of memory use.  The reduction of lock granularity

I think this is missing a "to" in front of "improve".

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] index: add --batch-size=SIZE option
  2020-05-17 23:54 ` Kyle Meyer
@ 2020-05-18  2:39   ` Eric Wong
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Wong @ 2020-05-18  2:39 UTC (permalink / raw)
  To: Kyle Meyer; +Cc: meta

Kyle Meyer <kyle@kyleam.com> wrote:
> Eric Wong writes:
> 
> > +Increase this value on powerful systems improve throughput at
> > +the expense of memory use.  The reduction of lock granularity
> 
> I think this is missing a "to" in front of "improve".

Thanks, fix queued for pushing.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] index: add --batch-size=SIZE option
  2020-05-17 19:37 [PATCH] index: add --batch-size=SIZE option Eric Wong
  2020-05-17 23:54 ` Kyle Meyer
@ 2020-05-18  3:04 ` Eric Wong
  1 sibling, 0 replies; 4+ messages in thread
From: Eric Wong @ 2020-05-18  3:04 UTC (permalink / raw)
  To: meta

Eric Wong <e@yhbt.net> wrote:
> On powerful systems, having this option is preferable to
> XAPIAN_FLUSH_THRESHOLD due to lock granularity and contention
> with other processes (-learn, -mda, -watch).
> 
> Setting XAPIAN_FLUSH_THRESHOLD can cause -learn, -mda, and
> -watch to get stuck until an epoch is completely processed.

Fwiw, I've debated with myself on-and-off for a number of years
about adding this knob.

I don't like tuning knobs in general because it's more stuff to
document and support, and more stuff for new users to learn.

However, it's tough to tell how much memory a user is willing to
spend or how long they're willing to wait for -mda/-watch
deliveries while --reindex is happening.  So the default is to
remain conservative and lean towards suboptimal throughput
rather than trigger a swap storm or OOM.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, back to index

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-17 19:37 [PATCH] index: add --batch-size=SIZE option Eric Wong
2020-05-17 23:54 ` Kyle Meyer
2020-05-18  2:39   ` Eric Wong
2020-05-18  3:04 ` Eric Wong

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror http://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git