user/dev discussion of public-inbox itself
 help / color / mirror / Atom feed
* [PATCH 00/23] indexing: --skip-docdata + speedups
@ 2020-08-20 20:24 Eric Wong
  2020-08-20 20:24 ` [PATCH 01/23] doc: note -compact and -xcpdb are rarely used Eric Wong
                   ` (22 more replies)
  0 siblings, 23 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

Some miscellaneous help and cleanup things, too.

Document data is no longer read from Xapian by read-only
daemons; that data is redundant given over.sqlite3 always
exists.  This should should improve page cache hit rates for
over.sqlite3 by a small bit.  Being able to mass load a bunch of
rows from SQLite speeds up the default search summary view by
~10%, too.

--skip-docdata option to -init and -index can avoid writing
Xapian document data, saving ~1.5% in Xapian space overhead
(and associated I/O and page cache overheads).  It breaks
rollbacks to old versions, though, so it won't be the default.

Eric Wong (23):
  doc: note -compact and -xcpdb are rarely used
  admin: progress shows the inbox being indexed
  compact: support --help/-? and perform lazy loading
  init: support --help and -?
  init: support --newsgroup option
  init: drop -N alias for --skip-artnum
  search: v2: ensure shards are numerically sorted
  xapcmd: simplify {reindex} parameter passing
  www: reduce long-lived PublicInbox::Search references
  search: improve comments around constants
  search: export mdocid subroutine
  searchquery: split off from searchview
  search: make qparse_new an internal function
  smsg: reduce utf8::decode call sites
  searchview: use over.sqlite3 instead of Xapian docdata
  searchview: speed up search summary by ~10%
  searchview: convert nested and Atom display to over.sqlite3
  extmsg: avoid using Xapian docdata
  mbox: avoid Xapian docdata in search results
  smsg: remove from_mitem
  t/nntpd-v2: set PI_TEST_VERSION=2 properly
  init+index: support --skip-docdata for Xapian
  search: add mset_to_artnums method

 Documentation/public-inbox-compact.pod |   5 +
 Documentation/public-inbox-config.pod  |   2 +-
 Documentation/public-inbox-index.pod   |   8 ++
 Documentation/public-inbox-init.pod    |  37 +++++++-
 Documentation/public-inbox-xcpdb.pod   |   3 +
 MANIFEST                               |   1 +
 lib/PublicInbox/Admin.pm               |  15 ++-
 lib/PublicInbox/ExtMsg.pm              |  19 ++--
 lib/PublicInbox/IMAP.pm                |   5 +-
 lib/PublicInbox/Inbox.pm               |  11 ++-
 lib/PublicInbox/Mbox.pm                |  24 +++--
 lib/PublicInbox/Over.pm                |  28 +++---
 lib/PublicInbox/Search.pm              | 125 ++++++++++++++-----------
 lib/PublicInbox/SearchIdx.pm           |  35 +++++--
 lib/PublicInbox/SearchIdxShard.pm      |   2 +-
 lib/PublicInbox/SearchQuery.pm         |  53 +++++++++++
 lib/PublicInbox/SearchView.pm          |  90 ++++--------------
 lib/PublicInbox/Smsg.pm                |  10 +-
 lib/PublicInbox/Xapcmd.pm              |  20 ++--
 script/public-inbox-compact            |  39 ++++++--
 script/public-inbox-convert            |   3 +-
 script/public-inbox-index              |   7 +-
 script/public-inbox-init               | 101 +++++++++++++-------
 t/imapd.t                              |   6 +-
 t/inbox_idle.t                         |   2 +-
 t/index-git-times.t                    |  11 ++-
 t/init.t                               |  17 +++-
 t/nntpd-v2.t                           |   2 +-
 t/nntpd.t                              |   9 +-
 t/search.t                             |  34 ++++---
 30 files changed, 437 insertions(+), 287 deletions(-)
 create mode 100644 lib/PublicInbox/SearchQuery.pm

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 01/23] doc: note -compact and -xcpdb are rarely used
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 02/23] admin: progress shows the inbox being indexed Eric Wong
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

Slowly improving the learning curve...
---
 Documentation/public-inbox-compact.pod | 5 +++++
 Documentation/public-inbox-xcpdb.pod   | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/Documentation/public-inbox-compact.pod b/Documentation/public-inbox-compact.pod
index 8e463ab1..4e9b6d9f 100644
--- a/Documentation/public-inbox-compact.pod
+++ b/Documentation/public-inbox-compact.pod
@@ -19,6 +19,11 @@ It enforces the use of the C<--no-renumber> option of
 L<xapian-compact(1)> which is required to work with the
 rest of the public-inbox search code.
 
+This command is rarely needed for active inboxes.
+
+Using the C<--compact> option of L<public-inbox-index(1)>
+is recommended, instead, and only when doing a C<--reindex>.
+
 =head1 OPTIONS
 
 =over
diff --git a/Documentation/public-inbox-xcpdb.pod b/Documentation/public-inbox-xcpdb.pod
index 2b38f1da..52939894 100644
--- a/Documentation/public-inbox-xcpdb.pod
+++ b/Documentation/public-inbox-xcpdb.pod
@@ -17,6 +17,9 @@ accept parallel Xapian database modifications from
 L<public-inbox-watch(1)>, L<public-inbox-mda(1)>,
 L<public-inbox-learn(1)>, and L<public-inbox-index(1)>.
 
+This command is rarely used, as Xapian DB formats rarely
+change.
+
 =head1 OPTIONS
 
 =over

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 02/23] admin: progress shows the inbox being indexed
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
  2020-08-20 20:24 ` [PATCH 01/23] doc: note -compact and -xcpdb are rarely used Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 03/23] compact: support --help/-? and perform lazy loading Eric Wong
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

This is helpful with --all, or when multiple inboxes
are being indexed.
---
 lib/PublicInbox/Admin.pm | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index d99a00b4..f5427af7 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -207,6 +207,9 @@ sub index_terminate {
 sub index_inbox {
 	my ($ibx, $im, $opt) = @_;
 	my $jobs = delete $opt->{jobs} if $opt;
+	if (my $pr = $opt->{-progress}) {
+		$pr->("indexing $ibx->{inboxdir} ...\n");
+	}
 	local %SIG = %SIG;
 	setup_signals(\&index_terminate, $ibx);
 	if (ref($ibx) && $ibx->version == 2) {

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 03/23] compact: support --help/-? and perform lazy loading
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
  2020-08-20 20:24 ` [PATCH 01/23] doc: note -compact and -xcpdb are rarely used Eric Wong
  2020-08-20 20:24 ` [PATCH 02/23] admin: progress shows the inbox being indexed Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 04/23] init: support --help and -? Eric Wong
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

This probably won't be used much, but --help can still
make sense.
---
 script/public-inbox-compact | 39 +++++++++++++++++++++++++++----------
 1 file changed, 29 insertions(+), 10 deletions(-)

diff --git a/script/public-inbox-compact b/script/public-inbox-compact
index b5fa0086..a6bb62bd 100755
--- a/script/public-inbox-compact
+++ b/script/public-inbox-compact
@@ -1,18 +1,37 @@
-#!/usr/bin/perl -w
+#!perl -w
 # Copyright (C) 2018-2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 use strict;
-use warnings;
+use v5.10.1;
 use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
-use PublicInbox::InboxWritable;
-use PublicInbox::Xapcmd;
-use PublicInbox::Admin;
-PublicInbox::Admin::require_or_die('-index');
-my $usage = "Usage: public-inbox-compact INBOX_DIR\n";
+my $usage = 'public-inbox-compact INBOX_DIR';
 my $opt = { compact => 1, -coarse_lock => 1 };
-GetOptions($opt, qw(all), @PublicInbox::Xapcmd::COMPACT_OPT) or
-	die "bad command-line args\n$usage";
-my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, $opt) or die $usage;
+my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
+Usage: $usage
+
+  Compact Xapian DBs in an inbox
+
+options:
+
+  --all               index all configured inboxes
+  --jobs=NUM          control parallelization
+
+See public-inbox-compact(1) man page for full documentation.
+EOF
+GetOptions($opt, qw(all help|?),
+	# compact options:
+	qw(jobs|j=i quiet|q blocksize|b=s no-full|n fuller|F),
+) or die "bad command-line args\n$usage\n";
+if ($opt->{help}) { print $help; exit 0 };
+
+require PublicInbox::Admin;
+PublicInbox::Admin::require_or_die('-index');
+PublicInbox::Admin::progress_prepare($opt);
+
+require PublicInbox::InboxWritable;
+require PublicInbox::Xapcmd;
+my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, $opt);
+unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 }
 foreach (@ibxs) {
 	my $ibx = PublicInbox::InboxWritable->new($_);
 	PublicInbox::Xapcmd::run($ibx, 'compact', $opt);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 04/23] init: support --help and -?
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (2 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 03/23] compact: support --help/-? and perform lazy loading Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 05/23] init: support --newsgroup option Eric Wong
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

And speed those up with some lazy loading, too.
---
 script/public-inbox-init | 79 +++++++++++++++++++++++++---------------
 1 file changed, 50 insertions(+), 29 deletions(-)

diff --git a/script/public-inbox-init b/script/public-inbox-init
index 1c8066df..6852f64a 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -1,57 +1,76 @@
-#!/usr/bin/perl -w
+#!perl -w
 # Copyright (C) 2014-2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
-#
-# Initializes a public-inbox, basically a wrapper for git-init(1)
 use strict;
-use warnings;
-sub usage {
-	print STDERR <<EOF;
-Usage: public-inbox-init NAME INBOX_DIR HTTP_URL ADDRESS [ADDRESS..]
-EOF
-	exit 1;
-}
+use v5.10.1;
 use Getopt::Long qw/:config gnu_getopt no_ignore_case auto_abbrev/;
-use PublicInbox::Admin;
-PublicInbox::Admin::require_or_die('-base');
-use PublicInbox::Config;
-use PublicInbox::InboxWritable;
-use PublicInbox::Import;
-use File::Temp;
-use PublicInbox::Lock;
-use File::Basename qw/dirname/;
-use File::Path qw/mkpath/;
 use Fcntl qw(:DEFAULT);
-use Cwd qw/abs_path/;
+my $usage = 'public-inbox-init NAME INBOX_DIR HTTP_URL ADDRESS [ADDRESS..]';
+my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
+usage: $usage
+
+  Initialize a public-inbox
+
+required arguments:
+
+  NAME                the name of the inbox
+  INBOX_DIR           pathname the inbox
+  HTTP_URL            HTTP (or HTTPS) URL
+  ADDRESS             email address(es), may be specified multiple times
+
+options:
 
-my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs);
+  -V2                 use scalable public-inbox-v2-format(5)
+  -L LEVEL            index level `basic', `medium', or `full' (default: full)
+  --skip-artnum=NUM   NNTP article numbers to skip
+  --skip-epoch=NUM    epochs to skip (-V2 only)
+  -J JOBS             number of indexing jobs (-V2 only), (default: 4)
+
+See public-inbox-init(1) man page for full documentation.
+EOF
+
+require PublicInbox::Admin;
+PublicInbox::Admin::require_or_die('-base');
+
+my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs, $show_help);
 my %opts = (
 	'V|version=i' => \$version,
 	'L|index-level|indexlevel=s' => \$indexlevel,
 	'S|skip|skip-epoch=i' => \$skip_epoch,
 	'N|skip-artnum=i' => \$skip_artnum,
 	'j|jobs=i' => \$jobs,
+	'help|?' => \$show_help,
 );
-GetOptions(%opts) or usage();
+my $usage_cb = sub {
+	print STDERR "Usage: $usage\n";
+	exit 1;
+};
+GetOptions(%opts) or $usage_cb->();
+if ($show_help) { print $help; exit 0 };
 PublicInbox::Admin::indexlevel_ok_or_die($indexlevel) if defined $indexlevel;
-my $name = shift @ARGV or usage();
-my $inboxdir = shift @ARGV or usage();
-my $http_url = shift @ARGV or usage();
+my $name = shift @ARGV or $usage_cb->();
+my $inboxdir = shift @ARGV or $usage_cb->();
+my $http_url = shift @ARGV or $usage_cb->();
 my (@address) = @ARGV;
-@address or usage();
+@address or $usage_cb->();
 my %seen;
 
+require PublicInbox::Config;
 my $pi_config = PublicInbox::Config->default_file;
-my $dir = dirname($pi_config);
-mkpath($dir); # will croak on fatal errors
+require File::Basename;
+my $dir = File::Basename::dirname($pi_config);
+require File::Path;
+File::Path::mkpath($dir); # will croak on fatal errors
 
 # first, we grab a flock to prevent simultaneous public-inbox-init
 # processes from trampling over each other, or exiting with 255 on
 # O_EXCL failure below.  This gets unlocked automatically on exit:
+require PublicInbox::Lock;
 my $lock_obj = { lock_path => "$pi_config.flock" };
 PublicInbox::Lock::lock_acquire($lock_obj);
 
 # git-config will operate on this (and rename on success):
+require File::Temp;
 my $fh = File::Temp->new(TEMPLATE => 'pi-init-XXXXXXXX', DIR => $dir);
 
 # Now, we grab another lock to use git-config(1) locking, so it won't
@@ -111,7 +130,8 @@ close($fh) or die "failed to close $pi_config_tmp: $!\n";
 my $pfx = "publicinbox.$name";
 my @x = (qw/git config/, "--file=$pi_config_tmp");
 
-$inboxdir = abs_path($inboxdir);
+require Cwd;
+$inboxdir = Cwd::abs_path($inboxdir);
 die "`\\n' not allowed in `$inboxdir'\n" if $inboxdir =~ /\n/s;
 if (-f "$inboxdir/inbox.lock") {
 	if (!defined $version) {
@@ -148,6 +168,7 @@ if (defined $jobs) {
 	$creat_opt->{nproc} = $jobs;
 }
 
+require PublicInbox::InboxWritable;
 $ibx = PublicInbox::InboxWritable->new($ibx, $creat_opt);
 $ibx->init_inbox(0, $skip_epoch, $skip_artnum);
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 05/23] init: support --newsgroup option
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (3 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 04/23] init: support --help and -? Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 21:10   ` Eric Wong
  2020-08-20 20:24 ` [PATCH 06/23] init: drop -N alias for --skip-artnum Eric Wong
                   ` (17 subsequent siblings)
  22 siblings, 1 reply; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

We can reduce the need to edit the config file for NNTP group names
this way.
---
 Documentation/public-inbox-config.pod |  2 +-
 Documentation/public-inbox-init.pod   | 25 +++++++++++++++++++++----
 script/public-inbox-init              | 12 ++++++++++--
 t/imapd.t                             |  6 ++----
 t/nntpd.t                             |  9 +++------
 5 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/Documentation/public-inbox-config.pod b/Documentation/public-inbox-config.pod
index 05b84819..1dfb926e 100644
--- a/Documentation/public-inbox-config.pod
+++ b/Documentation/public-inbox-config.pod
@@ -63,7 +63,7 @@ Default: none, optional
 =item publicinbox.<name>.newsgroup
 
 The NNTP group name for use with L<public-inbox-nntpd(8)>.  This
-may be any newsgroup name with hierarchies delimited by '.'.
+may be any newsgroup name with hierarchies delimited by C<.>.
 For example, the newsgroup for L<mailto:meta@public-inbox.org>
 is: C<inbox.comp.mail.public-inbox.meta>
 
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index d0c87563..240959eb 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -39,6 +39,21 @@ See L<public-inbox-config(5)> for more information.
 
 Default: C<full>
 
+=item --ng, --newsgroup NEWSGROUP
+
+The NNTP group name for use with L<public-inbox-nntpd(8)>.  This
+may be any newsgroup name with hierarchies delimited by C<.>.
+For example, the newsgroup for L<mailto:meta@public-inbox.org>
+is: C<inbox.comp.mail.public-inbox.meta>
+
+This may be set after-the-fact via C<publicinbox.$NAME.newsgroup>
+in the configuration file.  See L<public-inbox-config(5)> for more
+info.
+
+Available since public-inbox 1.6.0 (PENDING).
+
+Default: none.
+
 =item -N, --skip-artnum
 
 This option allows archivists to publish incomplete archives
@@ -94,10 +109,12 @@ Used to override the default C<~/.public-inbox/config> value.
 
 =head1 LIMITATIONS
 
-This tool predates NNTP support in public-inbox and is missing
-C<newsgroup> and many of the options documented in
-L<public-inbox-config(5)>.  See L<public-inbox-config(5)> for all the
-options which may be applied to a given inbox.
+Some of the options documented in L<public-inbox-config(5)>
+require editing the config file.  Old versions lack the
+C<-n>/C<--newsgroup> parameter
+
+See L<public-inbox-config(5)> for all the options which may be applied
+to a given inbox.
 
 =head1 CONTACT
 
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 6852f64a..90b32be8 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -22,6 +22,7 @@ options:
 
   -V2                 use scalable public-inbox-v2-format(5)
   -L LEVEL            index level `basic', `medium', or `full' (default: full)
+  --ng NEWSGROUP      set NNTP newsgroup name
   --skip-artnum=NUM   NNTP article numbers to skip
   --skip-epoch=NUM    epochs to skip (-V2 only)
   -J JOBS             number of indexing jobs (-V2 only), (default: 4)
@@ -33,12 +34,14 @@ require PublicInbox::Admin;
 PublicInbox::Admin::require_or_die('-base');
 
 my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs, $show_help);
+my $ng = '';
 my %opts = (
 	'V|version=i' => \$version,
 	'L|index-level|indexlevel=s' => \$indexlevel,
 	'S|skip|skip-epoch=i' => \$skip_epoch,
 	'N|skip-artnum=i' => \$skip_artnum,
 	'j|jobs=i' => \$jobs,
+	'ng|newsgroup=s' => \$ng,
 	'help|?' => \$show_help,
 );
 my $usage_cb = sub {
@@ -53,7 +56,11 @@ my $inboxdir = shift @ARGV or $usage_cb->();
 my $http_url = shift @ARGV or $usage_cb->();
 my (@address) = @ARGV;
 @address or $usage_cb->();
-my %seen;
+
+$ng =~ m![^A-Za-z0-9/_\.\-\~\@\+\=:]! and
+	die "--newsgroup `$ng' is not valid\n";
+($ng =~ m!\A\.! || $ng =~ m!\.\z!) and
+	die "--newsgroup `$ng' must not start or end with `.'\n";
 
 require PublicInbox::Config;
 my $pi_config = PublicInbox::Config->default_file;
@@ -84,7 +91,7 @@ sysopen($lockfh, $lockfile, O_RDWR|O_CREAT|O_EXCL) or do {
 	exit(255);
 };
 my $auto_unlink = UnlinkMe->new($lockfile);
-my $perm;
+my ($perm, %seen);
 if (-e $pi_config) {
 	open(my $oh, '<', $pi_config) or die "unable to read $pi_config: $!\n";
 	my @st = stat($oh);
@@ -185,6 +192,7 @@ PublicInbox::Import::run_die([@x, "$pfx.inboxdir", $inboxdir]);
 if (defined($indexlevel)) {
 	PublicInbox::Import::run_die([@x, "$pfx.indexlevel", $indexlevel]);
 }
+PublicInbox::Import::run_die([@x, "$pfx.newsgroup", $ng]) if $ng ne '';
 
 # needed for git prior to v2.1.0
 if (defined $perm) {
diff --git a/t/imapd.t b/t/imapd.t
index 6cfced41..4d627af7 100644
--- a/t/imapd.t
+++ b/t/imapd.t
@@ -39,11 +39,9 @@ for my $V (@V) {
 	my $url = "http://example.com/i$V";
 	my $inboxdir = "$tmpdir/$name";
 	my $folder = "inbox.i$V";
-	my $cmd = ['-init', "-V$V", "-L$level", $name, $inboxdir, $url, $addr];
+	my $cmd = ['-init', "-V$V", "-L$level", "--ng=$folder",
+		$name, $inboxdir, $url, $addr];
 	run_script($cmd) or BAIL_OUT("init $name");
-	xsys(qw(git config), "--file=$ENV{HOME}/.public-inbox/config",
-			"publicinbox.$name.newsgroup", $folder) == 0 or
-			BAIL_OUT("setting newsgroup $V");
 	if ($V == 1) {
 		xsys(qw(git config), "--file=$ENV{HOME}/.public-inbox/config",
 			'publicinboxmda.spamcheck', 'none') == 0 or
diff --git a/t/nntpd.t b/t/nntpd.t
index b9b9a63d..74e21a41 100644
--- a/t/nntpd.t
+++ b/t/nntpd.t
@@ -46,14 +46,11 @@ my $ibx = {
 $ibx = PublicInbox::Inbox->new($ibx);
 {
 	local $ENV{HOME} = $home;
-	my @cmd = ('-init', $group, $inboxdir, 'http://example.com/', $addr);
-	push @cmd, "-V$version", '-Lbasic';
+	my @cmd = ('-init', $group, $inboxdir, 'http://example.com/', $addr,
+		"-V$version", '-Lbasic', '--newsgroup', $group);
 	ok(run_script(\@cmd), 'init OK');
-	is(xsys(qw(git config), "--file=$home/.public-inbox/config",
-			"publicinbox.$group.newsgroup", $group),
-		0, 'enabled newsgroup');
-	my $len;
 
+	my $len;
 	$ibx = PublicInbox::InboxWritable->new($ibx);
 	my $im = $ibx->importer(0);
 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 06/23] init: drop -N alias for --skip-artnum
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (4 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 05/23] init: support --newsgroup option Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 07/23] search: v2: ensure shards are numerically sorted Eric Wong
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

It may be too easily confused for --newsgroup or --ng.  This is
too rarely used and never made it into a release, so it should
be fine.
---
 Documentation/public-inbox-init.pod | 2 +-
 script/public-inbox-init            | 2 +-
 t/init.t                            | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 240959eb..4cc7e29f 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -54,7 +54,7 @@ Available since public-inbox 1.6.0 (PENDING).
 
 Default: none.
 
-=item -N, --skip-artnum
+=item --skip-artnum
 
 This option allows archivists to publish incomplete archives
 with only new mail while allowing NNTP article numbers
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 90b32be8..b19c2321 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -39,7 +39,7 @@ my %opts = (
 	'V|version=i' => \$version,
 	'L|index-level|indexlevel=s' => \$indexlevel,
 	'S|skip|skip-epoch=i' => \$skip_epoch,
-	'N|skip-artnum=i' => \$skip_artnum,
+	'skip-artnum=i' => \$skip_artnum,
 	'j|jobs=i' => \$jobs,
 	'ng|newsgroup=s' => \$ng,
 	'help|?' => \$show_help,
diff --git a/t/init.t b/t/init.t
index 6211bb58..4d2c5049 100644
--- a/t/init.t
+++ b/t/init.t
@@ -116,7 +116,7 @@ SKIP: {
 			'publicinboxmda.spamcheck', 'none') == 0 or
 			BAIL_OUT "git config $?";
 	my $addr = 'skip3@example.com';
-	$cmd = [ qw(-init -V2 -Lbasic -N12 skip3), "$tmpdir/skip3",
+	$cmd = [ qw(-init -V2 -Lbasic --skip-artnum=12 skip3), "$tmpdir/skip3",
 		   qw(http://example.com/skip3), $addr ];
 	ok(run_script($cmd), '--skip-artnum -V2');
 	my $env = { ORIGINAL_RECIPIENT => $addr };
@@ -131,7 +131,7 @@ SKIP: {
 
 	$addr = 'skip4@example.com';
 	$env = { ORIGINAL_RECIPIENT => $addr };
-	$cmd = [ qw(-init -V1 -N12 -Lmedium skip4), "$tmpdir/skip4",
+	$cmd = [ qw(-init -V1 --skip-artnum 12 -Lmedium skip4), "$tmpdir/skip4",
 		   qw(http://example.com/skip4), $addr ];
 	ok(run_script($cmd), '--skip-artnum -V1');
 	$err = '';

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 07/23] search: v2: ensure shards are numerically sorted
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (5 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 06/23] init: drop -N alias for --skip-artnum Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 08/23] xapcmd: simplify {reindex} parameter passing Eric Wong
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

This seems required to correctly get the NNTP article number
from Xapian docid on combined Xapian DBs.  The default
(ASCII-betical) sorting was only acceptable for -imapd users
until somebody hit 11 (or more) shards, which is a rare case.
---
 lib/PublicInbox/Search.pm | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 4e08aed7..4d02a7c1 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -177,19 +177,24 @@ sub _xdb ($) {
 	my ($xdb, $slow_phrase);
 	my $qpf = \($self->{qp_flags} ||= $QP_FLAGS);
 	if ($self->{ibx_ver} >= 2) {
-		my $n = 0;
-		foreach my $shard (<$dir/*>) {
-			-d $shard && $shard =~ m!/[0-9]+\z! or next;
-			my $sub = $X{Database}->new($shard);
-			if ($xdb) {
-				$xdb->add_database($sub);
-			} else {
-				$xdb = $sub;
+		my @xdb;
+		opendir(my $dh, $dir) or return; # not initialized yet
+
+		# We need numeric sorting so shard[0] is first for reading
+		# Xapian metadata, if needed
+		for (sort { $a <=> $b } grep(/\A[0-9]+\z/, readdir($dh))) {
+			my $shard_dir = "$dir/$_";
+			if (-d $shard_dir && -r _) {
+				push @xdb, $X{Database}->new($shard_dir);
+				$slow_phrase ||= -f "$shard_dir/iamchert";
+			} else { # gaps from missing epochs throw off mdocid()
+				warn "E: $shard_dir missing or unreadable\n";
+				return;
 			}
-			$slow_phrase ||= -f "$shard/iamchert";
-			++$n;
 		}
-		$self->{nshard} = $n;
+		$self->{nshard} = scalar(@xdb);
+		$xdb = shift @xdb;
+		$xdb->add_database($_) for @xdb;
 	} else {
 		$slow_phrase = -f "$dir/iamchert";
 		$xdb = $X{Database}->new($dir);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 08/23] xapcmd: simplify {reindex} parameter passing
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (6 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 07/23] search: v2: ensure shards are numerically sorted Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 09/23] www: reduce long-lived PublicInbox::Search references Eric Wong
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

No need to localize it, here, since we can just refer to it
in the `$opt' hashref.  Hopefully this improves readability
for others like it does for me.

I sometimes wonder if the concept of a stack in high-level
languages is even necessary...
---
 lib/PublicInbox/Xapcmd.pm | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm
index 46548a94..fffac99c 100644
--- a/lib/PublicInbox/Xapcmd.pm
+++ b/lib/PublicInbox/Xapcmd.pm
@@ -19,7 +19,6 @@ our @COMPACT_OPT = qw(jobs|j=i quiet|q blocksize|b=s no-full|n fuller|F);
 sub commit_changes ($$$$) {
 	my ($ibx, $im, $tmp, $opt) = @_;
 	my $reshard = $opt->{reshard};
-	my $reindex = $opt->{reindex};
 
 	$SIG{INT} or die 'BUG: $SIG{INT} not handled';
 	my @old_shard;
@@ -102,17 +101,17 @@ sub runnable_or_die ($) {
 }
 
 sub prepare_reindex ($$$) {
-	my ($ibx, $im, $reindex) = @_;
+	my ($ibx, $im, $opt) = @_;
 	if ($ibx->version == 1) {
 		my $dir = $ibx->search->xdir(1);
 		my $xdb = $PublicInbox::Search::X{Database}->new($dir);
 		if (my $lc = $xdb->get_metadata('last_commit')) {
-			$reindex->{from} = $lc;
+			$opt->{reindex}->{from} = $lc;
 		}
 	} else { # v2
 		my $max;
 		$im->git_dir_latest(\$max) or return;
-		my $from = $reindex->{from};
+		my $from = $opt->{reindex}->{from};
 		my $mm = $ibx->mm;
 		my $v = PublicInbox::Search::SCHEMA_VERSION();
 		foreach my $i (0..$max) {
@@ -238,14 +237,14 @@ sub prepare_run {
 sub check_compact () { runnable_or_die($XAPIAN_COMPACT) }
 
 sub _run {
-	my ($ibx, $cb, $opt, $reindex) = @_;
+	my ($ibx, $cb, $opt) = @_;
 	my $im = $ibx->importer(0);
 	$im->lock_acquire;
 	my ($tmp, $queue) = prepare_run($ibx, $opt);
 
 	# fine-grained locking if we prepare for reindex
 	if (!$opt->{-coarse_lock}) {
-		prepare_reindex($ibx, $im, $reindex);
+		prepare_reindex($ibx, $im, $opt);
 		$im->lock_release;
 	}
 
@@ -262,19 +261,18 @@ sub run {
 	defined(my $dir = $ibx->{inboxdir}) or die "no inboxdir defined\n";
 	-d $dir or die "inboxdir=$dir does not exist\n";
 	check_compact() if $opt->{compact} && $ibx->search;
-	my $reindex; # v1:{ from => $x40 }, v2:{ from => [ $x40, $x40, .. ] } }
 
 	if (!$opt->{-coarse_lock}) {
-		$reindex = $opt->{reindex} = { # per-epoch ranges for v2
-			from => $ibx->version == 1 ? '' : [],
-		};
+		# per-epoch ranges for v2
+		# v1:{ from => $OID }, v2:{ from => [ $OID, $OID, $OID ] } }
+		$opt->{reindex} = { from => $ibx->version == 1 ? '' : [] };
 		PublicInbox::SearchIdx::load_xapian_writable();
 	}
 
 	local %SIG = %SIG;
 	setup_signals();
 	$ibx->umask_prepare;
-	$ibx->with_umask(\&_run, $ibx, $cb, $opt, $reindex);
+	$ibx->with_umask(\&_run, $ibx, $cb, $opt);
 }
 
 sub cpdb_retryable ($$) {

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 09/23] www: reduce long-lived PublicInbox::Search references
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (7 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 08/23] xapcmd: simplify {reindex} parameter passing Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 10/23] search: improve comments around constants Eric Wong
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

While this is unlikely to be a problem in current practice,
keeping Xapian DBs open for long responses can interfere with
free space recovery after -compact.

In the future, it will interfere with inbox search grouping
and lead to unexpected results.
---
 lib/PublicInbox/Inbox.pm      | 11 ++++++++---
 lib/PublicInbox/Mbox.pm       |  6 +++---
 lib/PublicInbox/SearchView.pm |  4 ++--
 3 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index 267be4e3..55e546e1 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -191,14 +191,19 @@ sub mm {
 	};
 }
 
-sub search ($;$) {
-	my ($self, $over_only) = @_;
+sub search ($;$$) {
+	my ($self, $over_only, $ctx) = @_;
 	my $srch = $self->{search} ||= eval {
 		_cleanup_later($self);
 		require PublicInbox::Search;
 		PublicInbox::Search->new($self);
 	};
-	($over_only || eval { $srch->xdb }) ? $srch : undef;
+	($over_only || eval { $srch->xdb }) ? $srch : do {
+		$ctx and $ctx->{env}->{'psgi.errors'}->print(<<EOF);
+`$self->{name}' search went away unexpectedly
+EOF
+		undef;
+	};
 }
 
 sub over ($) {
diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index fc83a893..a83c0356 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -205,8 +205,8 @@ sub mbox_all_ids {
 
 sub results_cb {
 	my ($ctx) = @_;
+	my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
 	my $mset = $ctx->{mset};
-	my $srch = $ctx->{srch};
 	while (1) {
 		while (my $mi = (($mset->items)[$ctx->{iter}++])) {
 			my $smsg = PublicInbox::Smsg::from_mitem($mi,
@@ -227,8 +227,8 @@ sub mbox_all {
 
 	return mbox_all_ids($ctx) if $query eq '';
 	my $qopts = $ctx->{qopts} = { mset => 2 };
-	my $srch = $ctx->{srch} = $ctx->{-inbox}->search or
-		return PublicInbox::WWW::need($ctx, 'Search');;
+	my $srch = $ctx->{-inbox}->search or
+		return PublicInbox::WWW::need($ctx, 'Search');
 	my $mset = $ctx->{mset} = $srch->query($query, $qopts);
 	$qopts->{offset} = $mset->size or
 			return [404, [qw(Content-Type text/plain)],
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index 84c04c6c..5d77469e 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -321,7 +321,6 @@ sub adump {
 	my ($cb, $mset, $q, $ctx) = @_;
 	$ctx->{items} = [ $mset->items ];
 	$ctx->{search_query} = $q; # used by WwwAtomStream::atom_header
-	$ctx->{srch} = $ctx->{-inbox}->search;
 	PublicInbox::WwwAtomStream->response($ctx, 200, \&adump_i);
 }
 
@@ -329,8 +328,9 @@ sub adump {
 sub adump_i {
 	my ($ctx) = @_;
 	while (my $mi = shift @{$ctx->{items}}) {
+		my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
 		my $smsg = eval {
-			PublicInbox::Smsg::from_mitem($mi, $ctx->{srch});
+			PublicInbox::Smsg::from_mitem($mi, $srch);
 		} or next;
 		return $smsg;
 	}

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 10/23] search: improve comments around constants
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (8 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 09/23] www: reduce long-lived PublicInbox::Search references Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 11/23] search: export mdocid subroutine Eric Wong
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

We'll probably be adding more value columns like THREADID to sort
on.
---
 lib/PublicInbox/Search.pm | 63 +++++++++++++++++++++------------------
 1 file changed, 34 insertions(+), 29 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 4d02a7c1..593040a8 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -6,15 +6,47 @@
 package PublicInbox::Search;
 use strict;
 
-# values for searching
+# values for searching, changing the numeric value breaks
+# compatibility with old indices (so don't change them it)
 use constant {
 	TS => 0, # Received: header in Unix time (IMAP INTERNALDATE)
 	YYYYMMDD => 1, # Date: header for searching in the WWW UI
 	DT => 2, # Date: YYYYMMDDHHMMSS
+
+	# added for public-inbox 1.6.0+
 	BYTES => 3, # IMAP RFC822.SIZE
 	UID => 4, # IMAP UID == NNTP article number == Xapian docid
+
 	# TODO
-	# REPLYCNT => 4, # IMAP ANSWERED
+	# THREADID => ?
+	# REPLYCNT => ?, # IMAP ANSWERED
+
+	# SCHEMA_VERSION history
+	# 0 - initial
+	# 1 - subject_path is lower-cased
+	# 2 - subject_path is id_compress in the index, only
+	# 3 - message-ID is compressed if it includes '%' (hack!)
+	# 4 - change "Re: " normalization, avoid circular Reference ghosts
+	# 5 - subject_path drops trailing '.'
+	# 6 - preserve References: order in document data
+	# 7 - remove references and inreplyto terms
+	# 8 - remove redundant/unneeded document data
+	# 9 - disable Message-ID compression (SHA-1)
+	# 10 - optimize doc for NNTP overviews
+	# 11 - merge threads when vivifying ghosts
+	# 12 - change YYYYMMDD value column to numeric
+	# 13 - fix threading for empty References/In-Reply-To
+	#      (commit 83425ef12e4b65cdcecd11ddcb38175d4a91d5a0)
+	# 14 - fix ghost root vivification
+	# 15 - see public-inbox-v2-format(5)
+	#      further bumps likely unnecessary, we'll suggest in-place
+	#      "--reindex" use for further fixes and tweaks:
+	#
+	#      public-inbox v1.5.0 adds (still SCHEMA_VERSION=15):
+	#      * "lid:" and "l:" for List-Id searches
+	#
+	#      v1.6.0 adds BYTES and UID values
+	SCHEMA_VERSION => 15,
 };
 
 use PublicInbox::Smsg;
@@ -61,33 +93,6 @@ sub load_xapian () {
 # a prefix common in patch emails
 our $LANG = 'english';
 
-use constant {
-	# SCHEMA_VERSION history
-	# 0 - initial
-	# 1 - subject_path is lower-cased
-	# 2 - subject_path is id_compress in the index, only
-	# 3 - message-ID is compressed if it includes '%' (hack!)
-	# 4 - change "Re: " normalization, avoid circular Reference ghosts
-	# 5 - subject_path drops trailing '.'
-	# 6 - preserve References: order in document data
-	# 7 - remove references and inreplyto terms
-	# 8 - remove redundant/unneeded document data
-	# 9 - disable Message-ID compression (SHA-1)
-	# 10 - optimize doc for NNTP overviews
-	# 11 - merge threads when vivifying ghosts
-	# 12 - change YYYYMMDD value column to numeric
-	# 13 - fix threading for empty References/In-Reply-To
-	#      (commit 83425ef12e4b65cdcecd11ddcb38175d4a91d5a0)
-	# 14 - fix ghost root vivification
-	# 15 - see public-inbox-v2-format(5)
-	#      further bumps likely unnecessary, we'll suggest in-place
-	#      "--reindex" use for further fixes and tweaks
-	#
-	#      public-inbox v1.5.0 adds (still SCHEMA_VERSION=15):
-	#      * "lid:" and "l:" for List-Id searches
-	SCHEMA_VERSION => 15,
-};
-
 # note: the non-X term prefix allocations are shared with
 # Xapian omega, see xapian-applications/omega/docs/termprefixes.rst
 my %bool_pfx_external = (

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 11/23] search: export mdocid subroutine
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (9 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 10/23] search: improve comments around constants Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 12/23] searchquery: split off from searchview Eric Wong
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

No need to have awkward globrefs for this.
---
 lib/PublicInbox/IMAP.pm   | 3 +--
 lib/PublicInbox/Search.pm | 2 ++
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/IMAP.pm b/lib/PublicInbox/IMAP.pm
index 3d66f930..562c59d4 100644
--- a/lib/PublicInbox/IMAP.pm
+++ b/lib/PublicInbox/IMAP.pm
@@ -40,9 +40,8 @@ use PublicInbox::Syscall qw(EPOLLIN EPOLLONESHOT);
 use PublicInbox::GitAsyncCat;
 use Text::ParseWords qw(parse_line);
 use Errno qw(EAGAIN);
-use PublicInbox::Search;
+use PublicInbox::Search qw(mdocid);
 use PublicInbox::IMAPsearchqp;
-*mdocid = \&PublicInbox::Search::mdocid;
 
 my $Address;
 for my $mod (qw(Email::Address::XS Mail::Address)) {
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 593040a8..f98513d3 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -5,6 +5,8 @@
 # Read-only search interface for use by the web and NNTP interfaces
 package PublicInbox::Search;
 use strict;
+use parent qw(Exporter);
+our @EXPORT_OK = qw(mdocid);
 
 # values for searching, changing the numeric value breaks
 # compatibility with old indices (so don't change them it)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 12/23] searchquery: split off from searchview
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (10 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 11/23] search: export mdocid subroutine Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 13/23] search: make qparse_new an internal function Eric Wong
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

Since this was already a separate package, split it off
into its own file since SearchView may not handle inbox
groups.
---
 MANIFEST                       |  1 +
 lib/PublicInbox/SearchQuery.pm | 53 ++++++++++++++++++++++++++++++++++
 lib/PublicInbox/SearchView.pm  | 53 ++--------------------------------
 3 files changed, 57 insertions(+), 50 deletions(-)
 create mode 100644 lib/PublicInbox/SearchQuery.pm

diff --git a/MANIFEST b/MANIFEST
index 6cb5f6bf..d86d3b15 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -173,6 +173,7 @@ lib/PublicInbox/SaPlugin/ListMirror.pod
 lib/PublicInbox/Search.pm
 lib/PublicInbox/SearchIdx.pm
 lib/PublicInbox/SearchIdxShard.pm
+lib/PublicInbox/SearchQuery.pm
 lib/PublicInbox/SearchThread.pm
 lib/PublicInbox/SearchView.pm
 lib/PublicInbox/Sigfd.pm
diff --git a/lib/PublicInbox/SearchQuery.pm b/lib/PublicInbox/SearchQuery.pm
new file mode 100644
index 00000000..ce1eae12
--- /dev/null
+++ b/lib/PublicInbox/SearchQuery.pm
@@ -0,0 +1,53 @@
+# Copyright (C) 2015-2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# used by PublicInbox::SearchView
+package PublicInbox::SearchQuery;
+use strict;
+use v5.10.1;
+use URI::Escape qw(uri_escape);
+use PublicInbox::MID qw(MID_ESC);
+our $LIM = 200;
+
+sub new {
+	my ($class, $qp) = @_;
+
+	my $r = $qp->{r};
+	my ($l) = (($qp->{l} || '') =~ /([0-9]+)/);
+	$l = $LIM if !$l || $l > $LIM;
+	bless {
+		q => $qp->{'q'},
+		x => $qp->{x} || '',
+		o => (($qp->{o} || '0') =~ /(-?[0-9]+)/),
+		l => $l,
+		r => (defined $r && $r ne '0'),
+	}, $class;
+}
+
+sub qs_html {
+	my ($self, %override) = @_;
+
+	if (scalar(keys(%override))) {
+		$self = bless { (%$self, %override) }, ref($self);
+	}
+
+	my $q = uri_escape($self->{'q'}, MID_ESC);
+	$q =~ s/%20/+/g; # improve URL readability
+	my $qs = "q=$q";
+
+	if (my $o = $self->{o}) { # ignore o == 0
+		$qs .= "&amp;o=$o";
+	}
+	if (my $l = $self->{l}) {
+		$qs .= "&amp;l=$l" unless $l == $LIM;
+	}
+	if (my $r = $self->{r}) {
+		$qs .= "&amp;r";
+	}
+	if (my $x = $self->{x}) {
+		$qs .= "&amp;x=$x" if ($x eq 't' || $x eq 'A' || $x eq 'm');
+	}
+	$qs;
+}
+
+1;
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index 5d77469e..28d9ce5d 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -4,15 +4,15 @@
 # Displays search results for the web interface
 package PublicInbox::SearchView;
 use strict;
-use warnings;
-use URI::Escape qw(uri_unescape uri_escape);
+use v5.10.1;
+use URI::Escape qw(uri_unescape);
 use PublicInbox::Smsg;
 use PublicInbox::Hval qw(ascii_html obfuscate_addrs mid_href);
 use PublicInbox::View;
 use PublicInbox::WwwAtomStream;
 use PublicInbox::WwwStream qw(html_oneshot);
 use PublicInbox::SearchThread;
-our $LIM = 200;
+use PublicInbox::SearchQuery;
 my %rmap_inc;
 
 sub mbox_results {
@@ -336,51 +336,4 @@ sub adump_i {
 	}
 }
 
-package PublicInbox::SearchQuery;
-use strict;
-use warnings;
-use URI::Escape qw(uri_escape);
-use PublicInbox::MID qw(MID_ESC);
-
-sub new {
-	my ($class, $qp) = @_;
-
-	my $r = $qp->{r};
-	my ($l) = (($qp->{l} || '') =~ /([0-9]+)/);
-	$l = $LIM if !$l || $l > $LIM;
-	bless {
-		q => $qp->{'q'},
-		x => $qp->{x} || '',
-		o => (($qp->{o} || '0') =~ /(-?[0-9]+)/),
-		l => $l,
-		r => (defined $r && $r ne '0'),
-	}, $class;
-}
-
-sub qs_html {
-	my ($self, %override) = @_;
-
-	if (scalar(keys(%override))) {
-		$self = bless { (%$self, %override) }, ref($self);
-	}
-
-	my $q = uri_escape($self->{'q'}, MID_ESC);
-	$q =~ s/%20/+/g; # improve URL readability
-	my $qs = "q=$q";
-
-	if (my $o = $self->{o}) { # ignore o == 0
-		$qs .= "&amp;o=$o";
-	}
-	if (my $l = $self->{l}) {
-		$qs .= "&amp;l=$l" unless $l == $LIM;
-	}
-	if (my $r = $self->{r}) {
-		$qs .= "&amp;r";
-	}
-	if (my $x = $self->{x}) {
-		$qs .= "&amp;x=$x" if ($x eq 't' || $x eq 'A' || $x eq 'm');
-	}
-	$qs;
-}
-
 1;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 13/23] search: make qparse_new an internal function
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (11 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 12/23] searchquery: split off from searchview Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 14/23] smsg: reduce utf8::decode call sites Eric Wong
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

We'll probably be reusing it from another package in a future commit.
---
 lib/PublicInbox/Search.pm | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index f98513d3..e6200bfb 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -265,7 +265,7 @@ sub query {
 	if ($query_string eq '' && !$opts->{mset}) {
 		$self->{over_ro}->recent($opts);
 	} else {
-		my $qp = qp($self);
+		my $qp = $self->{qp} //= qparse_new($self);
 		my $qp_flags = $self->{qp_flags};
 		my $query = $qp->parse_query($query_string, $qp_flags);
 		$opts->{relevance} = 1 unless exists $opts->{relevance};
@@ -334,17 +334,14 @@ sub _enquire_once { # retry_reopen callback
 sub stemmer { $X{Stem}->new($LANG) }
 
 # read-only
-sub qp {
+sub qparse_new ($) {
 	my ($self) = @_;
 
-	my $qp = $self->{query_parser};
-	return $qp if $qp;
 	my $xdb = xdb($self);
-	# new parser
-	$qp = $X{QueryParser}->new;
+	my $qp = $X{QueryParser}->new;
 	$qp->set_default_op(OP_AND());
 	$qp->set_database($xdb);
-	$qp->set_stemmer($self->stemmer);
+	$qp->set_stemmer(stemmer($self));
 	$qp->set_stemming_strategy(STEM_SOME());
 	$qp->set_max_wildcard_expansion(100);
 	my $nvrp = $X{NumberValueRangeProcessor};
@@ -382,13 +379,12 @@ EOF
 	while (my ($name, $prefix) = each %prob_prefix) {
 		$qp->add_prefix($name, $_) foreach split(/ /, $prefix);
 	}
-
-	$self->{query_parser} = $qp;
+	$qp;
 }
 
 sub help {
 	my ($self) = @_;
-	$self->qp; # parse altids
+	$self->{qp} //= qparse_new($self); # parse altids
 	my @ret = @HELP;
 	if (my $user_pfx = $self->{-user_pfx}) {
 		push @ret, @$user_pfx;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 14/23] smsg: reduce utf8::decode call sites
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (12 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 13/23] search: make qparse_new an internal function Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 15/23] searchview: use over.sqlite3 instead of Xapian docdata Eric Wong
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

Both callers of load_from_data call utf8::decode, so just
do utf8::decode in load_from_data.
---
 lib/PublicInbox/Over.pm | 1 -
 lib/PublicInbox/Smsg.pm | 2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index 2b314882..81b9fca7 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -68,7 +68,6 @@ sub load_from_row ($;$) {
 	bless $smsg, 'PublicInbox::Smsg';
 	if (defined(my $data = delete $smsg->{ddd})) {
 		$data = uncompress($data);
-		utf8::decode($data);
 		PublicInbox::Smsg::load_from_data($smsg, $data);
 
 		# saves over 600K for 1000+ message threads
diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index 62cb951e..f22cd43e 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -40,6 +40,7 @@ sub to_doc_data {
 
 sub load_from_data ($$) {
 	my ($self) = $_[0]; # data = $_[1]
+	utf8::decode($_[1]);
 	(
 		$self->{subject},
 		$self->{from},
@@ -67,7 +68,6 @@ sub load_expand {
 	my $dt = get_val($doc, PublicInbox::Search::DT());
 	my ($yyyy, $mon, $dd, $hh, $mm, $ss) = unpack('A4A2A2A2A2A2', $dt);
 	$self->{ds} = timegm($ss, $mm, $hh, $dd, $mon - 1, $yyyy);
-	utf8::decode($data);
 	load_from_data($self, $data);
 	$self;
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 15/23] searchview: use over.sqlite3 instead of Xapian docdata
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (13 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 14/23] smsg: reduce utf8::decode call sites Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 16/23] searchview: speed up search summary by ~10% Eric Wong
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

This is a step towards improving kernel page cache hit rates by
relying on over.sqlite3 for document data instead of Xapian.
Some micro-optimization to over->get_art was required to
maintain performance.
---
 lib/PublicInbox/Over.pm       | 18 +++++-------------
 lib/PublicInbox/SearchView.pm | 10 +++++++---
 2 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index 81b9fca7..80e57e62 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -57,6 +57,7 @@ sub new {
 sub disconnect {
 	my ($self) = @_;
 	if (my $dbh = delete $self->{dbh}) {
+		delete $self->{-get_art};
 		$self->{filename} = $dbh->sqlite_db_filename;
 	}
 }
@@ -201,8 +202,8 @@ SELECT COUNT(num) FROM over WHERE num > 0
 
 sub get_art {
 	my ($self, $num) = @_;
-	my $dbh = $self->connect;
-	my $sth = $dbh->prepare_cached(<<'', undef, 1);
+	# caching $sth ourselves is faster than prepare_cached
+	my $sth = $self->{-get_art} //= $self->connect->prepare(<<'');
 SELECT num,ds,ts,ddd FROM over WHERE num = ? LIMIT 1
 
 	$sth->execute($num);
@@ -230,13 +231,7 @@ ORDER BY num ASC LIMIT 1
 	$sth->execute($$id, $$prev);
 	my $num = $sth->fetchrow_array or return;
 	$$prev = $num;
-
-	$sth = $dbh->prepare_cached(<<"", undef, 1);
-SELECT num,ts,ds,ddd FROM over WHERE num = ? LIMIT 1
-
-	$sth->execute($num);
-	my $smsg = $sth->fetchrow_hashref or return;
-	load_from_row($smsg);
+	get_art($self, $num);
 }
 
 # IMAP search, this is limited by callers to UID_SLICE size (50K)
@@ -278,10 +273,7 @@ sub check_inodes {
 		my $st = pack('dd', $st[0], $st[1]);
 
 		# don't actually reopen, just let {dbh} be recreated later
-		if ($st ne ($self->{st} // $st)) {
-			delete($self->{dbh});
-			$self->{filename} = $f;
-		}
+		disconnect($self) if $st ne ($self->{st} // $st);
 	} else {
 		warn "W: stat $f: $!\n";
 	}
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index 28d9ce5d..61534c25 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -13,6 +13,7 @@ use PublicInbox::WwwAtomStream;
 use PublicInbox::WwwStream qw(html_oneshot);
 use PublicInbox::SearchThread;
 use PublicInbox::SearchQuery;
+use PublicInbox::Search qw(mdocid);
 my %rmap_inc;
 
 sub mbox_results {
@@ -90,19 +91,22 @@ sub mset_summary {
 	my $pfx = ' ' x $pad;
 	my $res = \($ctx->{-html_tip});
 	my $ibx = $ctx->{-inbox};
-	my $srch = $ibx->search;
+	my $over = $ibx->over;
+	my $nshard = $ibx->search->{nshard} // 1;
 	my $obfs_ibx = $ibx->{obfuscate} ? $ibx : undef;
 	foreach my $m ($mset->items) {
 		my $rank = sprintf("%${pad}d", $m->get_rank + 1);
 		my $pct = get_pct($m);
-		my $smsg = PublicInbox::Smsg::from_mitem($m, $srch);
+		my $num = mdocid($nshard, $m);
+		my $smsg = $over->get_art($num, 1);
 		unless ($smsg) {
 			eval {
-				$m = "$m ".$m->get_docid . " expired\n";
+				$m = "$m $num expired\n";
 				$ctx->{env}->{'psgi.errors'}->print($m);
 			};
 			next;
 		}
+		PublicInbox::Smsg::psgi_cull($smsg);
 		my $s = ascii_html($smsg->{subject});
 		my $f = ascii_html($smsg->{from_name});
 		if ($obfs_ibx) {

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 16/23] searchview: speed up search summary by ~10%
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (14 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 15/23] searchview: use over.sqlite3 instead of Xapian docdata Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 17/23] searchview: convert nested and Atom display to over.sqlite3 Eric Wong
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

Instead of loading one article at-a-time from over.sqlite3, we
can use SQL to mass-load IN (?,?, ...) all results with a single
SQLite query.  Despite SQLite being in-process and having no
network latency, the reduction in SQL query executions from
loading multiple rows at once speeds things up significantly.

We'll keep the over->get_art optimizations from the previous
commit, since it still speeds up long-lived responses, slightly.
---
 lib/PublicInbox/Over.pm       |  9 +++++++++
 lib/PublicInbox/SearchView.pm | 12 ++++++------
 2 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index 80e57e62..a055b4cd 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -104,6 +104,15 @@ ORDER BY num ASC
 
 }
 
+sub get_all {
+	my $self = shift;
+	my $nr = scalar(@_) or return [];
+	my $in = '?' . (',?' x ($nr - 1));
+	do_get($self, <<"", { cull => 1, limit => $nr }, @_);
+SELECT num,ds,ddd FROM over WHERE num IN ($in)
+
+}
+
 sub nothing () { wantarray ? (0, []) : [] };
 
 sub get_thread {
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index 61534c25..a3527057 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -91,22 +91,22 @@ sub mset_summary {
 	my $pfx = ' ' x $pad;
 	my $res = \($ctx->{-html_tip});
 	my $ibx = $ctx->{-inbox};
-	my $over = $ibx->over;
 	my $nshard = $ibx->search->{nshard} // 1;
 	my $obfs_ibx = $ibx->{obfuscate} ? $ibx : undef;
+	my @nums = map { mdocid($nshard, $_) } $mset->items;
+	my %num2msg = map { $_->{num} => $_ } @{$ibx->over->get_all(@nums)};
+
 	foreach my $m ($mset->items) {
 		my $rank = sprintf("%${pad}d", $m->get_rank + 1);
 		my $pct = get_pct($m);
-		my $num = mdocid($nshard, $m);
-		my $smsg = $over->get_art($num, 1);
-		unless ($smsg) {
+		my $num = shift @nums;
+		my $smsg = delete($num2msg{$num}) or do {
 			eval {
 				$m = "$m $num expired\n";
 				$ctx->{env}->{'psgi.errors'}->print($m);
 			};
 			next;
-		}
-		PublicInbox::Smsg::psgi_cull($smsg);
+		};
 		my $s = ascii_html($smsg->{subject});
 		my $f = ascii_html($smsg->{from_name});
 		if ($obfs_ibx) {

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 17/23] searchview: convert nested and Atom display to over.sqlite3
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (15 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 16/23] searchview: speed up search summary by ~10% Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 18/23] extmsg: avoid using Xapian docdata Eric Wong
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

git blob retrieval dominates on these, "&x=t" (nested) is
roughly the same due to increased overhead for ->get_percent
storage balancing out the mass-loading from SQLite.

Atom "&x=A" is sped up slightly and uses less memory in the
long-lived response.
---
 lib/PublicInbox/SearchView.pm | 25 ++++++++-----------------
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index a3527057..ef1b9767 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -255,20 +255,13 @@ sub get_pct ($) {
 	$n > 99 ? 99 : $n;
 }
 
-sub load_msgs {
-	my ($mset) = @_;
-	[ map {
-		my $mi = $_;
-		my $smsg = PublicInbox::Smsg::from_mitem($mi);
-		$smsg->{pct} = get_pct($mi);
-		$smsg;
-	} ($mset->items) ]
-}
-
 sub mset_thread {
 	my ($ctx, $mset, $q) = @_;
 	my $ibx = $ctx->{-inbox};
-	my $msgs = $ibx->search->retry_reopen(\&load_msgs, $mset);
+	my $nshard = $ibx->search->{nshard} // 1;
+	my %pct = map { mdocid($nshard, $_) => get_pct($_) } $mset->items;
+	my $msgs = $ibx->over->get_all(keys %pct);
+	$_->{pct} = $pct{$_->{num}} for @$msgs;
 	my $r = $q->{r};
 	my $rootset = PublicInbox::SearchThread::thread($msgs,
 		$r ? \&sort_relevance : \&PublicInbox::View::sort_ds,
@@ -323,7 +316,8 @@ sub ctx_prepare {
 
 sub adump {
 	my ($cb, $mset, $q, $ctx) = @_;
-	$ctx->{items} = [ $mset->items ];
+	my $nshard = $ctx->{-inbox}->search->{nshard} // 1;
+	$ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ];
 	$ctx->{search_query} = $q; # used by WwwAtomStream::atom_header
 	PublicInbox::WwwAtomStream->response($ctx, 200, \&adump_i);
 }
@@ -331,11 +325,8 @@ sub adump {
 # callback for PublicInbox::WwwAtomStream::getline
 sub adump_i {
 	my ($ctx) = @_;
-	while (my $mi = shift @{$ctx->{items}}) {
-		my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
-		my $smsg = eval {
-			PublicInbox::Smsg::from_mitem($mi, $srch);
-		} or next;
+	while (my $num = shift @{$ctx->{ids}}) {
+		my $smsg = eval { $ctx->{-inbox}->over->get_art($num) } or next;
 		return $smsg;
 	}
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 18/23] extmsg: avoid using Xapian docdata
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (16 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 17/23] searchview: convert nested and Atom display to over.sqlite3 Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 19/23] mbox: avoid Xapian docdata in search results Eric Wong
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

Once again, over.sqlite3 contains everything necessary for
Message-ID resolution.  Also, Xapian may be completely
unnecessary with the advent of over.sqlite3, but that's for
another time.
---
 lib/PublicInbox/ExtMsg.pm | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/lib/PublicInbox/ExtMsg.pm b/lib/PublicInbox/ExtMsg.pm
index d7917b34..5c23dc31 100644
--- a/lib/PublicInbox/ExtMsg.pm
+++ b/lib/PublicInbox/ExtMsg.pm
@@ -11,6 +11,7 @@ use warnings;
 use PublicInbox::Hval qw(ascii_html prurl mid_href);
 use PublicInbox::WwwStream qw(html_oneshot);
 use PublicInbox::Smsg;
+use PublicInbox::Search qw(mdocid);
 our $MIN_PARTIAL_LEN = 16;
 
 # TODO: user-configurable
@@ -29,13 +30,10 @@ our @EXT_URL = map { ascii_html($_) } (
 
 sub PARTIAL_MAX () { 100 }
 
-sub mids_from_mset { # Search::retry_reopen callback
-	[ map { PublicInbox::Smsg::from_mitem($_)->{mid} } $_[0]->items ];
-}
-
 sub search_partial ($$) {
-	my ($srch, $mid) = @_;
+	my ($ibx, $mid) = @_;
 	return if length($mid) < $MIN_PARTIAL_LEN;
+	my $srch = $ibx->search or return;
 	my $opt = { limit => PARTIAL_MAX, mset => 2 };
 	my @try = ("m:$mid*");
 	my $chop = $mid;
@@ -63,14 +61,17 @@ sub search_partial ($$) {
 		}
 	}
 
+	my $n = $srch->{nshard} // 1;
 	foreach my $m (@try) {
 		# If Xapian can't handle the wildcard since it
 		# has too many results.  $@ can be
 		# Search::Xapian::QueryParserError or even:
 		# "something terrible happened at ../Search/Xapian/Enquire.pm"
 		my $mset = eval { $srch->query($m, $opt) } or next;
-		my $mids = $srch->retry_reopen(\&mids_from_mset, $mset);
-		return $mids if scalar(@$mids);
+		my @mids = map {
+			$_->{mid}
+		} @{$ibx->over->get_all(map { mdocid($n, $_) } $mset->items)};
+		return \@mids if scalar(@mids);
 	}
 }
 
@@ -111,8 +112,7 @@ sub ext_msg {
 	# fall back to partial MID matching
 	my @partial;
 	my $n_partial = 0;
-	my $srch = $cur->search;
-	my $mids = search_partial($srch, $mid) if $srch;
+	my $mids = search_partial($cur, $mid);
 	if ($mids) {
 		$n_partial = scalar(@$mids);
 		push @partial, [ $cur, $mids ];
@@ -121,8 +121,7 @@ sub ext_msg {
 	# can't find a partial match in current inbox, try the others:
 	if (!$n_partial && length($mid) >= $MIN_PARTIAL_LEN) {
 		foreach my $ibx (@$ibxs) {
-			$srch = $ibx->search or next;
-			$mids = search_partial($srch, $mid) or next;
+			$mids = search_partial($ibx, $mid) or next;
 			$n_partial += scalar(@$mids);
 			push @partial, [ $ibx, $mids];
 			last if $n_partial >= PARTIAL_MAX;

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 19/23] mbox: avoid Xapian docdata in search results
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (17 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 18/23] extmsg: avoid using Xapian docdata Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 20/23] smsg: remove from_mitem Eric Wong
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

Another place where we can reduce kernel page cache overhead
by hitting over.sqlite3 instead of docdata.glass.
---
 lib/PublicInbox/Mbox.pm | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index a83c0356..0fa9a38d 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -10,6 +10,7 @@ use PublicInbox::MID qw/mid_escape/;
 use PublicInbox::Hval qw/to_filename/;
 use PublicInbox::Smsg;
 use PublicInbox::Eml;
+use PublicInbox::Search qw(mdocid);
 
 # called by PSGI server as body response
 # this gets called twice for every message, once to return the header,
@@ -205,20 +206,19 @@ sub mbox_all_ids {
 
 sub results_cb {
 	my ($ctx) = @_;
-	my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
-	my $mset = $ctx->{mset};
+	my $over = $ctx->{-inbox}->over or return;
 	while (1) {
-		while (my $mi = (($mset->items)[$ctx->{iter}++])) {
-			my $smsg = PublicInbox::Smsg::from_mitem($mi,
-								$srch) or next;
+		while (defined(my $num = shift(@{$ctx->{ids}}))) {
+			my $smsg = $over->get_art($num) or next;
 			return $smsg;
 		}
 		# refill result set
-		$mset = $ctx->{mset} = $srch->query($ctx->{query},
-							$ctx->{qopts});
+		my $srch = $ctx->{-inbox}->search(undef, $ctx) or return;
+		my $mset = $srch->query($ctx->{query}, $ctx->{qopts});
 		my $size = $mset->size or return;
 		$ctx->{qopts}->{offset} += $size;
-		$ctx->{iter} = 0;
+		my $nshard = $srch->{nshard} // 1;
+		$ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ];
 	}
 }
 
@@ -226,15 +226,16 @@ sub mbox_all {
 	my ($ctx, $query) = @_;
 
 	return mbox_all_ids($ctx) if $query eq '';
-	my $qopts = $ctx->{qopts} = { mset => 2 };
+	my $qopts = $ctx->{qopts} = { mset => 2 }; # order by docid
 	my $srch = $ctx->{-inbox}->search or
 		return PublicInbox::WWW::need($ctx, 'Search');
-	my $mset = $ctx->{mset} = $srch->query($query, $qopts);
+	my $mset = $srch->query($query, $qopts);
 	$qopts->{offset} = $mset->size or
 			return [404, [qw(Content-Type text/plain)],
 				["No results found\n"]];
-	$ctx->{iter} = 0;
 	$ctx->{query} = $query;
+	my $nshard = $srch->{nshard} // 1;
+	$ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ];
 	require PublicInbox::MboxGz;
 	PublicInbox::MboxGz::mbox_gz($ctx, \&results_cb, 'results-'.$query);
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 20/23] smsg: remove from_mitem
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (18 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 19/23] mbox: avoid Xapian docdata in search results Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 21/23] t/nntpd-v2: set PI_TEST_VERSION=2 properly Eric Wong
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

We no longer read docdata.glass from anywhere in our code base.

Some adjustments were needed to t/search.t to deal with the
Xapian::WritableDatabase committing at different times, since
our ->query is avoided from PublicInbox::SearchIdx to avoid
needing a {over_ro} field.
---
 lib/PublicInbox/Search.pm | 11 +++++++----
 lib/PublicInbox/Smsg.pm   |  8 --------
 t/search.t                | 34 +++++++++++++++++++---------------
 3 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index e6200bfb..1c648299 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -324,10 +324,13 @@ sub _enquire_once { # retry_reopen callback
 	my $limit = $opts->{limit} || 50;
 	my $mset = $enquire->get_mset($offset, $limit);
 	return $mset if $opts->{mset};
-	my @msgs = map { PublicInbox::Smsg::from_mitem($_) } $mset->items;
-	return \@msgs unless wantarray;
-
-	($mset->get_matches_estimated, \@msgs)
+	my $nshard = $self->{nshard} // 1;
+	my $i = 0;
+	my %order = map { mdocid($nshard, $_) => ++$i } $mset->items;
+	my @msgs = sort {
+		$order{$a->{num}} <=> $order{$b->{num}}
+	} @{$self->{over_ro}->get_all(keys %order)};
+	wantarray ? ($mset->get_matches_estimated, \@msgs) : \@msgs;
 }
 
 # read-write
diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index f22cd43e..51226b8e 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -86,14 +86,6 @@ sub psgi_cull ($) {
 	$self;
 }
 
-# Only called by PSGI interface, not NNTP
-sub from_mitem {
-	my ($mitem, $srch) = @_;
-	return $srch->retry_reopen(\&from_mitem, $mitem) if $srch;
-	my $self = bless {}, __PACKAGE__;
-	psgi_cull(load_expand($self, $mitem->get_document));
-}
-
 # for Import and v1 non-SQLite WWW code paths
 sub populate {
 	my ($self, $hdr, $sync) = @_;
diff --git a/t/search.t b/t/search.t
index 299f57c8..e2290ecd 100644
--- a/t/search.t
+++ b/t/search.t
@@ -263,13 +263,14 @@ To: list@example.com
 theatre
 fade
 EOF
-	my $res = $rw->query("theatre");
+	$rw_commit->();
+	my $res = $ro->reopen->query("theatre");
 	is(scalar(@$res), 2, "got both matches");
 	if (@$res == 2) {
 		is($res->[0]->{mid}, 'nquote@a', 'non-quoted scores higher');
 		is($res->[1]->{mid}, 'quote@a', 'quoted result still returned');
 	}
-	$res = $rw->query("illusions");
+	$res = $ro->query("illusions");
 	is(scalar(@$res), 1, "got a match for quoted text");
 	is($res->[0]->{mid}, 'quote@a',
 		"quoted result returned if nothing else") if scalar(@$res);
@@ -290,7 +291,8 @@ To: list\@example.com
 LOOP!
 EOF
 	ok($doc_id > 0, "doc_id defined with circular reference");
-	my $smsg = $rw->query('m:circle@a', {limit=>1})->[0];
+	$rw_commit->();
+	my $smsg = $ro->reopen->query('m:circle@a', {limit=>1})->[0];
 	is(defined($smsg), 1, 'found m:circl@a');
 	if (defined $smsg) {
 		is($smsg->{references}, '', "no references created");
@@ -298,11 +300,22 @@ EOF
 	}
 });
 
+{
+	my $msgs = $ro->query('d:19931002..20101002');
+	ok(scalar(@$msgs) > 0, 'got results within range');
+	$msgs = $ro->query('d:20101003..');
+	is(scalar(@$msgs), 0, 'nothing after 20101003');
+	$msgs = $ro->query('d:..19931001');
+	is(scalar(@$msgs), 0, 'nothing before 19931001');
+}
+
 $ibx->with_umask(sub {
 	my $mime = eml_load 't/utf8.eml';
 	my $doc_id = $rw->add_message($mime);
 	ok($doc_id > 0, 'message indexed doc_id with UTF-8');
-	my $msg = $rw->query('m:testmessage@example.com', {limit => 1})->[0];
+	$rw_commit->();
+	my $msg = $ro->reopen->
+		query('m:testmessage@example.com', {limit => 1})->[0];
 	is(defined($msg), 1, 'found testmessage@example.com');
 	if (defined $msg) {
 		is($mime->header('Subject'), $msg->{subject},
@@ -310,19 +323,10 @@ $ibx->with_umask(sub {
 	}
 });
 
-{
-	my $msgs = $ro->query('d:19931002..20101002');
-	ok(scalar(@$msgs) > 0, 'got results within range');
-	$msgs = $ro->query('d:20101003..');
-	is(scalar(@$msgs), 0, 'nothing after 20101003');
-	$msgs = $ro->query('d:..19931001');
-	is(scalar(@$msgs), 0, 'nothing before 19931001');
-}
-
 # names and addresses
 {
 	my $mset = $ro->query('t:list@example.com', {mset => 1});
-	is($mset->size, 6, 'searched To: successfully');
+	is($mset->size, 9, 'searched To: successfully');
 	foreach my $m ($mset->items) {
 		my $smsg = $ro->{over_ro}->get_art($m->get_docid);
 		like($smsg->{to}, qr/\blist\@example\.com\b/, 'to appears');
@@ -340,7 +344,7 @@ $ibx->with_umask(sub {
 	}
 
 	$mset = $ro->query('tc:list@example.com', {mset => 1});
-	is($mset->size, 6, 'searched To+Cc: successfully');
+	is($mset->size, 9, 'searched To+Cc: successfully');
 	foreach my $m ($mset->items) {
 		my $smsg = $ro->{over_ro}->get_art($m->get_docid);
 		my $tocc = join("\n", $smsg->{to}, $smsg->{cc});

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 21/23] t/nntpd-v2: set PI_TEST_VERSION=2 properly
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (19 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 20/23] smsg: remove from_mitem Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 22/23] init+index: support --skip-docdata for Xapian Eric Wong
  2020-08-20 20:24 ` [PATCH 23/23] search: add mset_to_artnums method Eric Wong
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

Numbers are hard :<
---
 t/nntpd-v2.t | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/nntpd-v2.t b/t/nntpd-v2.t
index 7fc3447e..1dd992a0 100644
--- a/t/nntpd-v2.t
+++ b/t/nntpd-v2.t
@@ -1,4 +1,4 @@
 # Copyright (C) 2019-2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
-local $ENV{PI_TEST_VERSION} = 1;
+local $ENV{PI_TEST_VERSION} = 2;
 require './t/nntpd.t';

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 22/23] init+index: support --skip-docdata for Xapian
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (20 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 21/23] t/nntpd-v2: set PI_TEST_VERSION=2 properly Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  2020-08-20 20:24 ` [PATCH 23/23] search: add mset_to_artnums method Eric Wong
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

Since we no longer read document data from Xapian, allow users
to opt-out of storing it.

This breaks compatibility with previous releases of
public-inbox, but gives us a ~1.5% space savings on Xapian
storage (and associated I/O and page cache pressure reduction).
---
 Documentation/public-inbox-index.pod |  8 +++++++
 Documentation/public-inbox-init.pod  | 10 ++++++++
 lib/PublicInbox/Admin.pm             | 12 ++++++----
 lib/PublicInbox/SearchIdx.pm         | 35 +++++++++++++++++++++-------
 lib/PublicInbox/SearchIdxShard.pm    |  2 +-
 script/public-inbox-convert          |  3 ++-
 script/public-inbox-index            |  7 ++++--
 script/public-inbox-init             |  8 +++++++
 t/inbox_idle.t                       |  2 +-
 t/index-git-times.t                  | 11 ++++++++-
 t/init.t                             | 13 +++++++++++
 11 files changed, 91 insertions(+), 20 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 1ed9f5e7..46a53825 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -145,6 +145,14 @@ below.
 
 Available in public-inbox 1.6.0 (PENDING).
 
+=item --skip-docdata
+
+Stop storing document data in Xapian on an existing inbox.
+
+See L<public-inbox-init(1)/--skip-docdata> for description and caveats.
+
+Available in public-inbox 1.6.0 (PENDING).
+
 =back
 
 =head1 FILES
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 4cc7e29f..3f98807a 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -95,6 +95,16 @@ default due to contention in the top-level producer process.
 
 Default: the number of online CPUs, up to 4
 
+=item --skip-docdata
+
+Do not store document data in Xapian, reducing Xapian storage
+overhead by around 1.5%.
+
+Warning: this option prevents rollbacks to public-inbox 1.5.0
+and earlier.
+
+Available since public-inbox 1.6.0 (PENDING).
+
 =back
 
 =head1 ENVIRONMENT
diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index f5427af7..b8ead6f7 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -48,13 +48,14 @@ sub resolve_repo_dir {
 sub detect_indexlevel ($) {
 	my ($ibx) = @_;
 
-	# brand new or never before indexed inboxes default to full
-	return 'full' unless $ibx->over;
-	delete $ibx->{over}; # don't leave open FD lying around
+	my $over = $ibx->over;
+	my $srch = $ibx->search;
+	delete @$ibx{qw(over search)}; # don't leave open FDs lying around
 
+	# brand new or never before indexed inboxes default to full
+	return 'full' unless $over;
 	my $l = 'basic';
-	my $srch = $ibx->search or return $l;
-	delete $ibx->{search}; # don't leave open FD lying around
+	return $l unless $srch;
 	if (my $xdb = $srch->xdb) {
 		$l = 'full';
 		my $m = $xdb->get_metadata('indexlevel');
@@ -65,6 +66,7 @@ sub detect_indexlevel ($) {
 $ibx->{inboxdir} has unexpected indexlevel in Xapian: $m
 
 		}
+		$ibx->{-skip_docdata} = 1 if $xdb->get_metadata('skip_docdata');
 	}
 	$l;
 }
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 5c39f3d6..be46b2b9 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -61,6 +61,10 @@ sub new {
 	}, $class;
 	$self->xpfx_init;
 	$self->{-set_indexlevel_once} = 1 if $indexlevel eq 'medium';
+	if ($ibx->{-skip_docdata}) {
+		$self->{-set_skip_docdata_once} = 1;
+		$self->{-skip_docdata} = 1;
+	}
 	$ibx->umask_prepare;
 	if ($version == 1) {
 		$self->{lock_path} = "$inboxdir/ssoma.lock";
@@ -359,10 +363,18 @@ sub add_xapian ($$$$) {
 
 	msg_iter($eml, \&index_xapian, [ $self, $doc ]);
 	index_ids($self, $doc, $eml, $mids);
-	$smsg->{to} = $smsg->{cc} = ''; # WWW doesn't need these, only NNTP
-	PublicInbox::OverIdx::parse_references($smsg, $eml, $mids);
-	my $data = $smsg->to_doc_data;
-	$doc->set_data($data);
+
+	# by default, we maintain compatibility with v1.5.0 and earlier
+	# by writing to docdata.glass, users who never exect to downgrade can
+	# use --skip-docdata
+	if (!$self->{-skip_docdata}) {
+		# WWW doesn't need {to} or {cc}, only NNTP
+		$smsg->{to} = $smsg->{cc} = '';
+		PublicInbox::OverIdx::parse_references($smsg, $eml, $mids);
+		my $data = $smsg->to_doc_data;
+		$doc->set_data($data);
+	}
+
 	if (my $altid = $self->{-altid}) {
 		foreach my $alt (@$altid) {
 			my $pfx = $alt->{xprefix};
@@ -831,23 +843,28 @@ sub begin_txn_lazy {
 
 # store 'indexlevel=medium' in v2 shard=0 and v1 (only one shard)
 # This metadata is read by Admin::detect_indexlevel:
-sub set_indexlevel {
+sub set_metadata_once {
 	my ($self) = @_;
 
-	if (!$self->{shard} && # undef or 0, not >0
-			delete($self->{-set_indexlevel_once})) {
-		my $xdb = $self->{xdb};
+	return if $self->{shard}; # only continue if undef or 0, not >0
+	my $xdb = $self->{xdb};
+
+	if (delete($self->{-set_indexlevel_once})) {
 		my $level = $xdb->get_metadata('indexlevel');
 		if (!$level || $level ne 'medium') {
 			$xdb->set_metadata('indexlevel', 'medium');
 		}
 	}
+	if (delete($self->{-set_skip_docdata_once})) {
+		$xdb->get_metadata('skip_docdata') or
+			$xdb->set_metadata('skip_docdata', '1');
+	}
 }
 
 sub _commit_txn {
 	my ($self) = @_;
 	if (my $xdb = $self->{xdb}) {
-		set_indexlevel($self);
+		set_metadata_once($self);
 		$xdb->commit_transaction;
 	}
 	$self->{over}->commit_lazy if $self->{over};
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index 59b36087..20077e08 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -16,7 +16,7 @@ sub new {
 	my $self = $class->SUPER::new($ibx, 1, $shard);
 	# create the DB before forking:
 	$self->idx_acquire;
-	$self->set_indexlevel;
+	$self->set_metadata_once;
 	$self->idx_release;
 	$self->spawn_worker($v2w, $shard) if $v2w->{parallel};
 	$self;
diff --git a/script/public-inbox-convert b/script/public-inbox-convert
index d655dcc6..4ff198d1 100755
--- a/script/public-inbox-convert
+++ b/script/public-inbox-convert
@@ -77,7 +77,8 @@ if ($old) {
 die "Only conversion from v1 inboxes is supported\n" if $old->version >= 2;
 
 require PublicInbox::Admin;
-$old->{indexlevel} //= PublicInbox::Admin::detect_indexlevel($old);
+my $detected = PublicInbox::Admin::detect_indexlevel($old);
+$old->{indexlevel} //= $detected;
 my $env;
 if ($opt->{'index'}) {
 	my $mods = {};
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 30d24838..9855c67d 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -39,7 +39,7 @@ GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
 		indexlevel|index-level|L=s max_size|max-size=s
 		batch_size|batch-size=s
 		sequential_shard|seq-shard|sequential-shard
-		all help|?))
+		skip-docdata all help|?))
 	or die "bad command-line args\n$usage";
 if ($opt->{help}) { print $help; exit 0 };
 die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
@@ -58,9 +58,11 @@ unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 }
 
 my $mods = {};
 foreach my $ibx (@ibxs) {
+	# detect_indexlevel may also set $ibx->{-skip_docdata}
+	my $detected = PublicInbox::Admin::detect_indexlevel($ibx);
 	# XXX: users can shoot themselves in the foot, with opt->{indexlevel}
 	$ibx->{indexlevel} //= $opt->{indexlevel} // ($opt->{xapian_only} ?
-			'full' : PublicInbox::Admin::detect_indexlevel($ibx));
+			'full' : $detected);
 	PublicInbox::Admin::scan_ibx_modules($mods, $ibx);
 }
 
@@ -75,6 +77,7 @@ for my $ibx (@ibxs) {
 		PublicInbox::Xapcmd::run($ibx, 'compact', $opt->{compact_opt});
 	}
 	$ibx->{-no_fsync} = 1 if !$opt->{fsync};
+	$ibx->{-skip_docdata} //= $opt->{'skip-docdata'};
 
 	my $ibx_opt = $opt;
 	if (defined(my $s = $ibx->{lc('indexSequentialShard')})) {
diff --git a/script/public-inbox-init b/script/public-inbox-init
index b19c2321..037e8e56 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -34,6 +34,7 @@ require PublicInbox::Admin;
 PublicInbox::Admin::require_or_die('-base');
 
 my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs, $show_help);
+my $skip_docdata;
 my $ng = '';
 my %opts = (
 	'V|version=i' => \$version,
@@ -42,6 +43,7 @@ my %opts = (
 	'skip-artnum=i' => \$skip_artnum,
 	'j|jobs=i' => \$jobs,
 	'ng|newsgroup=s' => \$ng,
+	'skip-docdata' => \$skip_docdata,
 	'help|?' => \$show_help,
 );
 my $usage_cb = sub {
@@ -177,6 +179,12 @@ if (defined $jobs) {
 
 require PublicInbox::InboxWritable;
 $ibx = PublicInbox::InboxWritable->new($ibx, $creat_opt);
+if ($skip_docdata) {
+	$ibx->{indexlevel} //= 'full'; # ensure init_inbox writes xdb
+	$ibx->{indexlevel} eq 'basic' and
+		die "--skip-docdata ignored with --indexlevel=basic\n";
+	$ibx->{-skip_docdata} = $skip_docdata;
+}
 $ibx->init_inbox(0, $skip_epoch, $skip_artnum);
 
 # needed for git prior to v2.1.0
diff --git a/t/inbox_idle.t b/t/inbox_idle.t
index 61287200..e16ee11b 100644
--- a/t/inbox_idle.t
+++ b/t/inbox_idle.t
@@ -29,7 +29,7 @@ for my $V (1, 2) {
 	if ($V == 1) {
 		my $sidx = PublicInbox::SearchIdx->new($ibx, 1);
 		$sidx->idx_acquire;
-		$sidx->set_indexlevel;
+		$sidx->set_metadata_once;
 		$sidx->idx_release; # allow watching on lockfile
 	}
 	my $pi_config = PublicInbox::Config->new(\<<EOF);
diff --git a/t/index-git-times.t b/t/index-git-times.t
index 2e9e88e8..8f80c866 100644
--- a/t/index-git-times.t
+++ b/t/index-git-times.t
@@ -4,6 +4,7 @@ use Test::More;
 use PublicInbox::TestCommon;
 use PublicInbox::Import;
 use PublicInbox::Config;
+use PublicInbox::Admin;
 use File::Path qw(remove_tree);
 
 require_mods(qw(DBD::SQLite Search::Xapian));
@@ -47,11 +48,15 @@ EOF
 	PublicInbox::Import::run_die($cmd, undef, { 0 => $r });
 }
 
-run_script(['-index', $v1dir]) or die 'v1 index failed';
+run_script(['-index', '--skip-docdata', $v1dir]) or die 'v1 index failed';
+
 my $smsg;
 {
 	my $cfg = PublicInbox::Config->new;
 	my $ibx = $cfg->lookup($addr);
+	my $lvl = PublicInbox::Admin::detect_indexlevel($ibx);
+	is($lvl, 'medium', 'indexlevel detected');
+	is($ibx->{-skip_docdata}, 1, '--skip-docdata flag set on -index');
 	$smsg = $ibx->over->get_art(1);
 	is($smsg->{ds}, 749520000, 'datestamp from git author time');
 	is($smsg->{ts}, 1285977600, 'timestamp from git committer time');
@@ -70,6 +75,10 @@ SKIP: {
 	my $check_v2 = sub {
 		my $ibx = PublicInbox::Inbox->new({inboxdir => $v2dir,
 				address => $addr});
+		my $lvl = PublicInbox::Admin::detect_indexlevel($ibx);
+		is($lvl, 'medium', 'indexlevel detected after convert');
+		is($ibx->{-skip_docdata}, 1,
+			'--skip-docdata preserved after convert');
 		my $v2smsg = $ibx->over->get_art(1);
 		is($v2smsg->{ds}, $smsg->{ds},
 			'v2 datestamp from git author time');
diff --git a/t/init.t b/t/init.t
index 4d2c5049..dad09435 100644
--- a/t/init.t
+++ b/t/init.t
@@ -95,6 +95,19 @@ SKIP: {
 		my $ibx = PublicInbox::Inbox->new({ inboxdir => $dir });
 		is(PublicInbox::Admin::detect_indexlevel($ibx), $lvl,
 			'detected expected level w/o config');
+		ok(!$ibx->{-skip_docdata}, 'docdata written by default');
+	}
+	for my $v (1, 2) {
+		my $name = "v$v-skip-docdata";
+		my $dir = "$tmpdir/$name";
+		$cmd = [ '-init', $name, "-V$v", '--skip-docdata',
+			$dir, "http://example.com/$name",
+			"$name\@example.com" ];
+		ok(run_script($cmd), "-init -V$v --skip-docdata");
+		my $ibx = PublicInbox::Inbox->new({ inboxdir => $dir });
+		is(PublicInbox::Admin::detect_indexlevel($ibx), 'full',
+			"detected default indexlevel -V$v");
+		ok($ibx->{-skip_docdata}, "docdata skip set -V$v");
 	}
 
 	# loop for idempotency

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 23/23] search: add mset_to_artnums method
  2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
                   ` (21 preceding siblings ...)
  2020-08-20 20:24 ` [PATCH 22/23] init+index: support --skip-docdata for Xapian Eric Wong
@ 2020-08-20 20:24 ` Eric Wong
  22 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

We can avoid importing mdocid() in several places by using
this method, simplifying callers.
---
 lib/PublicInbox/ExtMsg.pm     | 4 +---
 lib/PublicInbox/IMAP.pm       | 4 +---
 lib/PublicInbox/Mbox.pm       | 7 ++-----
 lib/PublicInbox/Search.pm     | 6 ++++++
 lib/PublicInbox/SearchView.pm | 6 ++----
 5 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/lib/PublicInbox/ExtMsg.pm b/lib/PublicInbox/ExtMsg.pm
index 5c23dc31..65892161 100644
--- a/lib/PublicInbox/ExtMsg.pm
+++ b/lib/PublicInbox/ExtMsg.pm
@@ -11,7 +11,6 @@ use warnings;
 use PublicInbox::Hval qw(ascii_html prurl mid_href);
 use PublicInbox::WwwStream qw(html_oneshot);
 use PublicInbox::Smsg;
-use PublicInbox::Search qw(mdocid);
 our $MIN_PARTIAL_LEN = 16;
 
 # TODO: user-configurable
@@ -61,7 +60,6 @@ sub search_partial ($$) {
 		}
 	}
 
-	my $n = $srch->{nshard} // 1;
 	foreach my $m (@try) {
 		# If Xapian can't handle the wildcard since it
 		# has too many results.  $@ can be
@@ -70,7 +68,7 @@ sub search_partial ($$) {
 		my $mset = eval { $srch->query($m, $opt) } or next;
 		my @mids = map {
 			$_->{mid}
-		} @{$ibx->over->get_all(map { mdocid($n, $_) } $mset->items)};
+		} @{$ibx->over->get_all(@{$srch->mset_to_artnums($mset)})};
 		return \@mids if scalar(@mids);
 	}
 }
diff --git a/lib/PublicInbox/IMAP.pm b/lib/PublicInbox/IMAP.pm
index 562c59d4..abdb8fec 100644
--- a/lib/PublicInbox/IMAP.pm
+++ b/lib/PublicInbox/IMAP.pm
@@ -40,7 +40,6 @@ use PublicInbox::Syscall qw(EPOLLIN EPOLLONESHOT);
 use PublicInbox::GitAsyncCat;
 use Text::ParseWords qw(parse_line);
 use Errno qw(EAGAIN);
-use PublicInbox::Search qw(mdocid);
 use PublicInbox::IMAPsearchqp;
 
 my $Address;
@@ -1188,9 +1187,8 @@ sub refill_xap ($$$$) {
 	my ($beg, $end) = @$range_info;
 	my $srch = $self->{ibx}->search;
 	my $opt = { mset => 2, limit => 1000 };
-	my $nshard = $srch->{nshard} // 1;
 	my $mset = $srch->query("$q uid:$beg..$end", $opt);
-	@$uids = map { mdocid($nshard, $_) } $mset->items;
+	@$uids = @{$srch->mset_to_artnums($mset)};
 	if (@$uids) {
 		$range_info->[0] = $uids->[-1] + 1; # update $beg
 		return; # possibly more
diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index 0fa9a38d..873ff7be 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -10,7 +10,6 @@ use PublicInbox::MID qw/mid_escape/;
 use PublicInbox::Hval qw/to_filename/;
 use PublicInbox::Smsg;
 use PublicInbox::Eml;
-use PublicInbox::Search qw(mdocid);
 
 # called by PSGI server as body response
 # this gets called twice for every message, once to return the header,
@@ -217,8 +216,7 @@ sub results_cb {
 		my $mset = $srch->query($ctx->{query}, $ctx->{qopts});
 		my $size = $mset->size or return;
 		$ctx->{qopts}->{offset} += $size;
-		my $nshard = $srch->{nshard} // 1;
-		$ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ];
+		$ctx->{ids} = $srch->mset_to_artnums($mset);
 	}
 }
 
@@ -234,8 +232,7 @@ sub mbox_all {
 			return [404, [qw(Content-Type text/plain)],
 				["No results found\n"]];
 	$ctx->{query} = $query;
-	my $nshard = $srch->{nshard} // 1;
-	$ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ];
+	$ctx->{ids} = $srch->mset_to_artnums($mset);
 	require PublicInbox::MboxGz;
 	PublicInbox::MboxGz::mbox_gz($ctx, \&results_cb, 'results-'.$query);
 }
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 1c648299..c18e19d4 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -219,6 +219,12 @@ sub mdocid {
 	int(($docid - 1) / $nshard) + 1;
 }
 
+sub mset_to_artnums {
+	my ($self, $mset) = @_;
+	my $nshard = $self->{nshard} // 1;
+	[ map { mdocid($nshard, $_) } $mset->items ];
+}
+
 sub xdb ($) {
 	my ($self) = @_;
 	$self->{xdb} ||= do {
diff --git a/lib/PublicInbox/SearchView.pm b/lib/PublicInbox/SearchView.pm
index ef1b9767..aa8fa037 100644
--- a/lib/PublicInbox/SearchView.pm
+++ b/lib/PublicInbox/SearchView.pm
@@ -91,9 +91,8 @@ sub mset_summary {
 	my $pfx = ' ' x $pad;
 	my $res = \($ctx->{-html_tip});
 	my $ibx = $ctx->{-inbox};
-	my $nshard = $ibx->search->{nshard} // 1;
 	my $obfs_ibx = $ibx->{obfuscate} ? $ibx : undef;
-	my @nums = map { mdocid($nshard, $_) } $mset->items;
+	my @nums = @{$ibx->search->mset_to_artnums($mset)};
 	my %num2msg = map { $_->{num} => $_ } @{$ibx->over->get_all(@nums)};
 
 	foreach my $m ($mset->items) {
@@ -316,8 +315,7 @@ sub ctx_prepare {
 
 sub adump {
 	my ($cb, $mset, $q, $ctx) = @_;
-	my $nshard = $ctx->{-inbox}->search->{nshard} // 1;
-	$ctx->{ids} = [ map { mdocid($nshard, $_) } $mset->items ];
+	$ctx->{ids} = $ctx->{-inbox}->search->mset_to_artnums($mset);
 	$ctx->{search_query} = $q; # used by WwwAtomStream::atom_header
 	PublicInbox::WwwAtomStream->response($ctx, 200, \&adump_i);
 }

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 05/23] init: support --newsgroup option
  2020-08-20 20:24 ` [PATCH 05/23] init: support --newsgroup option Eric Wong
@ 2020-08-20 21:10   ` Eric Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Eric Wong @ 2020-08-20 21:10 UTC (permalink / raw)
  To: meta

Eric Wong <e@yhbt.net> wrote:
> +Some of the options documented in L<public-inbox-config(5)>
> +require editing the config file.  Old versions lack the
> +C<-n>/C<--newsgroup> parameter

While working on this, I realized -n vs -N could be confusing,
so I made the abbreviation --ng instead.  So I'll squash
this in before pushing:

diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 240959eb..ea2b38dd 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -111,7 +111,7 @@ Used to override the default C<~/.public-inbox/config> value.
 
 Some of the options documented in L<public-inbox-config(5)>
 require editing the config file.  Old versions lack the
-C<-n>/C<--newsgroup> parameter
+C<--ng>/C<--newsgroup> parameter
 
 See L<public-inbox-config(5)> for all the options which may be applied
 to a given inbox.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-08-20 21:10 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-20 20:24 [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
2020-08-20 20:24 ` [PATCH 01/23] doc: note -compact and -xcpdb are rarely used Eric Wong
2020-08-20 20:24 ` [PATCH 02/23] admin: progress shows the inbox being indexed Eric Wong
2020-08-20 20:24 ` [PATCH 03/23] compact: support --help/-? and perform lazy loading Eric Wong
2020-08-20 20:24 ` [PATCH 04/23] init: support --help and -? Eric Wong
2020-08-20 20:24 ` [PATCH 05/23] init: support --newsgroup option Eric Wong
2020-08-20 21:10   ` Eric Wong
2020-08-20 20:24 ` [PATCH 06/23] init: drop -N alias for --skip-artnum Eric Wong
2020-08-20 20:24 ` [PATCH 07/23] search: v2: ensure shards are numerically sorted Eric Wong
2020-08-20 20:24 ` [PATCH 08/23] xapcmd: simplify {reindex} parameter passing Eric Wong
2020-08-20 20:24 ` [PATCH 09/23] www: reduce long-lived PublicInbox::Search references Eric Wong
2020-08-20 20:24 ` [PATCH 10/23] search: improve comments around constants Eric Wong
2020-08-20 20:24 ` [PATCH 11/23] search: export mdocid subroutine Eric Wong
2020-08-20 20:24 ` [PATCH 12/23] searchquery: split off from searchview Eric Wong
2020-08-20 20:24 ` [PATCH 13/23] search: make qparse_new an internal function Eric Wong
2020-08-20 20:24 ` [PATCH 14/23] smsg: reduce utf8::decode call sites Eric Wong
2020-08-20 20:24 ` [PATCH 15/23] searchview: use over.sqlite3 instead of Xapian docdata Eric Wong
2020-08-20 20:24 ` [PATCH 16/23] searchview: speed up search summary by ~10% Eric Wong
2020-08-20 20:24 ` [PATCH 17/23] searchview: convert nested and Atom display to over.sqlite3 Eric Wong
2020-08-20 20:24 ` [PATCH 18/23] extmsg: avoid using Xapian docdata Eric Wong
2020-08-20 20:24 ` [PATCH 19/23] mbox: avoid Xapian docdata in search results Eric Wong
2020-08-20 20:24 ` [PATCH 20/23] smsg: remove from_mitem Eric Wong
2020-08-20 20:24 ` [PATCH 21/23] t/nntpd-v2: set PI_TEST_VERSION=2 properly Eric Wong
2020-08-20 20:24 ` [PATCH 22/23] init+index: support --skip-docdata for Xapian Eric Wong
2020-08-20 20:24 ` [PATCH 23/23] search: add mset_to_artnums method Eric Wong

user/dev discussion of public-inbox itself

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 meta meta/ https://public-inbox.org/meta \
		meta@public-inbox.org
	public-inbox-index meta

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for the project(s) associated with this inbox:

	https://80x24.org/public-inbox.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git