user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* [PATCH v2 0/21] UI bits and v2 import fixes
@ 2018-02-28 23:41 Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 01/21] v2writable: warn on duplicate Message-IDs Eric Wong (Contractor, The Linux Foundation)
                   ` (21 more replies)
  0 siblings, 22 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

The most important fix was fixing a long-standing bug (also in v1)
with "searchidx: do not modify Xapian DB while iterating".
It turns out my initial v2 import was missing a bit of data
and got threading wrong when Subjects were mismatched (every
patch series).  Unfortunately that means the import times I
initially reported were too  optimistic and real import times
may take 30-40% longer :<  (More optimizations are planned, however)

Fortunately, old optimizations made to avoid git tree lookups
during the v1 era has made v2 UI work much easier and I was able
to spot some errors and bugs quickly in the PSGI interface.

For sorting, relying on the Date: header seems unreliable as
kernel developers seem more prone to having bad clocks than
other lists I've imported.  I'll probably switch the internal
timestamps to use the Received: date as a result.

Eric Wong (Contractor, The Linux Foundation) (21):
  v2writable: warn on duplicate Message-IDs
  v2/ui: some hacky things to get the PSGI UI to show up
  v2/ui: retry DB reopens in a few more places
  v2writable: cleanup unused pipes in partitions
  searchidxpart: binmode
  use PublicInbox::MIME consistently
  searchidxpart: chomp line before splitting
  searchidx*: name child subprocesses
  searchidx: get rid of pointless index_blob wrapper
  view: remove X-PI-TS reference
  searchidxthread: load doc data for references
  searchidxpart: force integers into add_message
  search: reopen skeleton DB as well
  searchidx: index values in the threader
  search: use different Enquire object for skeleton queries
  rename SearchIdxThread to SearchIdxSkeleton
  v2writable: commit to skeleton via remote partitions
  searchidxskeleton: extra error checking
  searchidx: do not modify Xapian DB while iterating
  search: query_xover uses skeleton DB iff available
  v2/ui: get nntpd and init tests running on v2

 MANIFEST                                           |  2 +-
 lib/PublicInbox/Import.pm                          |  7 ++
 lib/PublicInbox/Inbox.pm                           | 23 +++++--
 lib/PublicInbox/MIME.pm                            |  2 +
 lib/PublicInbox/Search.pm                          | 74 ++++++++++++++++++++--
 lib/PublicInbox/SearchIdx.pm                       | 72 +++++++++++----------
 lib/PublicInbox/SearchIdxPart.pm                   | 27 +++++---
 .../{SearchIdxThread.pm => SearchIdxSkeleton.pm}   | 50 +++++++++------
 lib/PublicInbox/V2Writable.pm                      | 50 +++++++++------
 lib/PublicInbox/View.pm                            |  3 +-
 lib/PublicInbox/WatchMaildir.pm                    |  2 -
 lib/PublicInbox/WwwAttach.pm                       |  3 +-
 script/public-inbox-init                           | 48 ++++++++++----
 script/public-inbox-learn                          |  2 -
 script/public-inbox-mda                            |  4 +-
 scripts/import_slrnspool                           |  4 +-
 scripts/import_vger_from_mbox                      |  3 +-
 t/init.t                                           | 15 +++++
 t/nntpd.t                                          | 36 +++++++++--
 19 files changed, 302 insertions(+), 125 deletions(-)
 rename lib/PublicInbox/{SearchIdxThread.pm => SearchIdxSkeleton.pm} (63%)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 01/21] v2writable: warn on duplicate Message-IDs
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 02/21] v2/ui: some hacky things to get the PSGI UI to show up Eric Wong (Contractor, The Linux Foundation)
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

This should give us an idea of how much a problem deduplication
will be.
---
 lib/PublicInbox/SearchIdx.pm  | 6 ++++--
 lib/PublicInbox/V2Writable.pm | 2 +-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index cc7e7ec..f9207e9 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -515,13 +515,15 @@ sub unindex_blob {
 }
 
 sub index_mm {
-	my ($self, $mime) = @_;
+	my ($self, $mime, $warn_existing) = @_;
 	my $mid = mid_clean(mid_mime($mime));
 	my $mm = $self->{mm};
 	my $num = $mm->mid_insert($mid);
+	return $num if defined $num;
 
+	warn "<$mid> reused\n" if $warn_existing;
 	# fallback to num_for since filters like RubyLang set the number
-	defined $num ? $num : $mm->num_for($mid);
+	$mm->num_for($mid);
 }
 
 sub unindex_mm {
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index cf19c76..29ed23c 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -63,7 +63,7 @@ sub add {
 	my ($len, $msgref) = @{$im->{last_object}};
 
 	$self->idx_init;
-	my $num = $self->{all}->index_mm($mime);
+	my $num = $self->{all}->index_mm($mime, 1);
 	my $nparts = $self->{partitions};
 	my $part = $num % $nparts;
 	my $idx = $self->idx_part($part);
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 02/21] v2/ui: some hacky things to get the PSGI UI to show up
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 01/21] v2writable: warn on duplicate Message-IDs Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 03/21] v2/ui: retry DB reopens in a few more places Eric Wong (Contractor, The Linux Foundation)
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

Fortunately, Xapian multiple database support makes things
easier but we still need to handle the skeleton DB separately.
---
 lib/PublicInbox/Inbox.pm  | 21 +++++++++++++++++----
 lib/PublicInbox/Search.pm | 42 ++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index e7856e3..f000a95 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -73,6 +73,10 @@ sub new {
 	_set_limiter($opts, $pi_config, 'httpbackend');
 	_set_uint($opts, 'feedmax', 25);
 	$opts->{nntpserver} ||= $pi_config->{'publicinbox.nntpserver'};
+	my $dir = $opts->{mainrepo};
+	if (defined $dir && -f "$dir/msgmap.sqlite3") { # XXX DIRTY
+		$opts->{version} = 2;
+	}
 	bless $opts, $class;
 }
 
@@ -92,7 +96,12 @@ sub mm {
 	my ($self) = @_;
 	$self->{mm} ||= eval {
 		_cleanup_later($self);
-		PublicInbox::Msgmap->new($self->{mainrepo});
+		my $dir = $self->{mainrepo};
+		if (($self->{version} || 1) >= 2) {
+			PublicInbox::Msgmap->new_file("$dir/msgmap.sqlite3");
+		} else {
+			PublicInbox::Msgmap->new($dir);
+		}
 	};
 }
 
@@ -100,7 +109,7 @@ sub search {
 	my ($self) = @_;
 	$self->{search} ||= eval {
 		_cleanup_later($self);
-		PublicInbox::Search->new($self->{mainrepo}, $self->{altid});
+		PublicInbox::Search->new($self, $self->{altid});
 	};
 }
 
@@ -229,7 +238,7 @@ sub msg_by_smsg ($$;$) {
 	# backwards compat to fallback to msg_by_mid
 	# TODO: remove if we bump SCHEMA_VERSION in Search.pm:
 	defined(my $blob = $smsg->{blob}) or
-			return msg_by_mid($self, $smsg->mid);
+			return msg_by_path($self, mid2path($smsg->mid), $ref);
 
 	my $str = git($self)->cat_file($blob, $ref);
 	$$str =~ s/\A[\r\n]*From [^\r\n]*\r?\n//s if $str;
@@ -243,7 +252,11 @@ sub path_check {
 
 sub msg_by_mid ($$;$) {
 	my ($self, $mid, $ref) = @_;
-	msg_by_path($self, mid2path($mid), $ref);
+	my $srch = search($self) or
+			return msg_by_path($self, mid2path($mid), $ref);
+	my $smsg = $srch->lookup_skeleton($mid) or return;
+	$smsg->load_expand;
+	msg_by_smsg($self, $smsg, $ref);
 }
 
 1;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 3b28059..b20b2cc 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -144,7 +144,26 @@ sub new {
 		altid => $altid,
 		version => $version,
 	}, $class;
-	$self->{xdb} = Search::Xapian::Database->new($self->xdir);
+	if ($version >= 2) {
+		my $dir = "$self->{mainrepo}/xap" . SCHEMA_VERSION;
+		my $xdb;
+		my $parts = 0;
+		foreach my $part (<$dir/*>) {
+			-d $part && $part =~ m!/\d+\z! or next;
+			$parts++;
+			my $sub = Search::Xapian::Database->new($part);
+			if ($xdb) {
+				$xdb->add_database($sub);
+			} else {
+				$xdb = $sub;
+			}
+		}
+		warn "v2 repo with $parts found in $dir\n";
+		$self->{xdb} = $xdb;
+		$self->{skel} = Search::Xapian::Database->new("$dir/all");
+	} else {
+		$self->{xdb} = Search::Xapian::Database->new($self->xdir);
+	}
 	$self;
 }
 
@@ -166,7 +185,7 @@ sub query {
 
 sub get_thread {
 	my ($self, $mid, $opts) = @_;
-	my $smsg = eval { $self->lookup_message($mid) };
+	my $smsg = eval { $self->lookup_skeleton($mid) };
 
 	return { total => 0, msgs => [] } unless $smsg;
 	my $qtid = Search::Xapian::Query->new('G' . $smsg->thread_id);
@@ -298,6 +317,25 @@ sub query_xover {
 	_do_enquire($self, $query, {num => 1, limit => 200, offset => $offset});
 }
 
+sub lookup_skeleton {
+	my ($self, $mid) = @_;
+	my $skel = $self->{skel} or return lookup_message($self, $mid);
+	$mid = mid_clean($mid);
+	my $term = 'XMID' . $mid;
+	my $smsg;
+	my $beg = $skel->postlist_begin($term);
+	if ($beg != $skel->postlist_end($term)) {
+		my $doc_id = $beg->get_docid;
+		if (defined $doc_id) {
+			# raises on error:
+			my $doc = $skel->get_document($doc_id);
+			$smsg = PublicInbox::SearchMsg->wrap($doc, $mid);
+			$smsg->{doc_id} = $doc_id;
+		}
+	}
+	$smsg;
+}
+
 sub lookup_message {
 	my ($self, $mid) = @_;
 	$mid = mid_clean($mid);
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 03/21] v2/ui: retry DB reopens in a few more places
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 01/21] v2writable: warn on duplicate Message-IDs Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 02/21] v2/ui: some hacky things to get the PSGI UI to show up Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 04/21] v2writable: cleanup unused pipes in partitions Eric Wong (Contractor, The Linux Foundation)
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

Relying more on Xapian requires retrying reopens in more
places to ensure it does not fall down and show errors to
the user.
---
 lib/PublicInbox/Inbox.pm  | 8 +++++---
 lib/PublicInbox/Search.pm | 3 ++-
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/Inbox.pm b/lib/PublicInbox/Inbox.pm
index f000a95..54a6eb3 100644
--- a/lib/PublicInbox/Inbox.pm
+++ b/lib/PublicInbox/Inbox.pm
@@ -254,9 +254,11 @@ sub msg_by_mid ($$;$) {
 	my ($self, $mid, $ref) = @_;
 	my $srch = search($self) or
 			return msg_by_path($self, mid2path($mid), $ref);
-	my $smsg = $srch->lookup_skeleton($mid) or return;
-	$smsg->load_expand;
-	msg_by_smsg($self, $smsg, $ref);
+	my $smsg;
+	$srch->retry_reopen(sub {
+		$smsg = $srch->lookup_skeleton($mid) and $smsg->load_expand;
+	});
+	$smsg ? msg_by_smsg($self, $smsg, $ref) : undef;
 }
 
 1;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index b20b2cc..1df87d0 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -185,7 +185,7 @@ sub query {
 
 sub get_thread {
 	my ($self, $mid, $opts) = @_;
-	my $smsg = eval { $self->lookup_skeleton($mid) };
+	my $smsg = retry_reopen($self, sub { lookup_skeleton($self, $mid) });
 
 	return { total => 0, msgs => [] } unless $smsg;
 	my $qtid = Search::Xapian::Query->new('G' . $smsg->thread_id);
@@ -216,6 +216,7 @@ sub retry_reopen {
 		if (ref($@) eq 'Search::Xapian::DatabaseModifiedError') {
 			reopen($self);
 		} else {
+			warn "ref: ", ref($@), "\n";
 			die;
 		}
 	}
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 04/21] v2writable: cleanup unused pipes in partitions
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (2 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 03/21] v2/ui: retry DB reopens in a few more places Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 05/21] searchidxpart: binmode Eric Wong (Contractor, The Linux Foundation)
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

Leaking these pipes to child processes wasn't harmful, but
made determining relationships and dataflow between processes
more confusing.
---
 lib/PublicInbox/Import.pm          |  7 +++++++
 lib/PublicInbox/SearchIdxPart.pm   |  9 +++++----
 lib/PublicInbox/SearchIdxThread.pm |  6 ++++--
 lib/PublicInbox/V2Writable.pm      | 18 +++++++++++++++---
 4 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index b650e4e..ac46c0c 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -372,6 +372,13 @@ sub done {
 	close $lockfh or die "close lock failed: $!";
 }
 
+sub atfork_child {
+	my ($self) = @_;
+	foreach my $f (qw(in out)) {
+		close $self->{$f} or die "failed to close import[$f]: $!\n";
+	}
+}
+
 1;
 __END__
 =pod
diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index 5582d67..64e5263 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -14,10 +14,7 @@ sub new {
 	my $pid = fork;
 	defined $pid or die "fork failed: $!\n";
 	if ($pid == 0) {
-		foreach my $other (@{$v2writable->{idx_parts}}) {
-			my $other_w = $other->{w} or next;
-			close $other_w or die "close other failed: $!\n";
-		}
+		$v2writable->atfork_child;
 		$v2writable = undef;
 		close $w;
 
@@ -74,4 +71,8 @@ sub index_raw {
 	$w->flush or die "failed to flush: $!\n";
 }
 
+sub atfork_child {
+	close $_[0]->{w} or die "failed to close write pipe: $!\n";
+}
+
 1;
diff --git a/lib/PublicInbox/SearchIdxThread.pm b/lib/PublicInbox/SearchIdxThread.pm
index 6471309..7df07a6 100644
--- a/lib/PublicInbox/SearchIdxThread.pm
+++ b/lib/PublicInbox/SearchIdxThread.pm
@@ -7,8 +7,8 @@ use base qw(PublicInbox::SearchIdx);
 use Storable qw(freeze thaw);
 
 sub new {
-	my ($class, $v2ibx) = @_;
-	my $self = $class->SUPER::new($v2ibx, 1, 'all');
+	my ($class, $v2writable) = @_;
+	my $self = $class->SUPER::new($v2writable->{-inbox}, 1, 'all');
 	# create the DB:
 	$self->_xdb_acquire;
 	$self->_xdb_release;
@@ -20,6 +20,8 @@ sub new {
 	my $pid = fork;
 	defined $pid or die "fork failed: $!\n";
 	if ($pid == 0) {
+		$v2writable->atfork_child;
+		$v2writable = undef;
 		close $w;
 		eval { thread_worker_loop($self, $r) };
 		die "thread worker died: $@\n" if $@;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 29ed23c..3451261 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -84,9 +84,9 @@ sub idx_part {
 sub idx_init {
 	my ($self) = @_;
 	return if $self->{idx_parts};
-	# first time initialization:
-	my $all = $self->{all} =
-		PublicInbox::SearchIdxThread->new($self->{-inbox});
+
+	# first time initialization, first we create the threader pipe:
+	my $all = $self->{all} = PublicInbox::SearchIdxThread->new($self);
 
 	# need to create all parts before initializing msgmap FD
 	my $max = $self->{partitions} - 1;
@@ -94,6 +94,8 @@ sub idx_init {
 	for my $i (0..$max) {
 		push @$idx, PublicInbox::SearchIdxPart->new($self, $i, $all);
 	}
+
+	# Now that all subprocesses are up, we can open the FD for SQLite:
 	$all->_msgmap_init->{dbh}->begin_work;
 }
 
@@ -242,4 +244,14 @@ sub lookup_content {
 	undef # TODO
 }
 
+sub atfork_child {
+	my ($self) = @_;
+	if (my $parts = $self->{idx_parts}) {
+		$_->atfork_child foreach @$parts;
+	}
+	if (my $im = $self->{im}) {
+		$im->atfork_child;
+	}
+}
+
 1;
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 05/21] searchidxpart: binmode
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (3 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 04/21] v2writable: cleanup unused pipes in partitions Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 06/21] use PublicInbox::MIME consistently Eric Wong (Contractor, The Linux Foundation)
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

Probably unnecessary, but set binmode for consistency across
platforms.
---
 lib/PublicInbox/SearchIdxPart.pm | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index 64e5263..63fc704 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -11,6 +11,8 @@ sub new {
 	$self->{threader} = $threader;
 	my ($r, $w);
 	pipe($r, $w) or die "pipe failed: $!\n";
+	binmode $r, ':raw';
+	binmode $w, ':raw';
 	my $pid = fork;
 	defined $pid or die "fork failed: $!\n";
 	if ($pid == 0) {
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 06/21] use PublicInbox::MIME consistently
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (4 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 05/21] searchidxpart: binmode Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 07/21] searchidxpart: chomp line before splitting Eric Wong (Contractor, The Linux Foundation)
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

It works around some bugs in older Email::MIME which we'll
find useful.
---
 lib/PublicInbox/MIME.pm         | 2 ++
 lib/PublicInbox/SearchIdx.pm    | 2 --
 lib/PublicInbox/V2Writable.pm   | 2 --
 lib/PublicInbox/WatchMaildir.pm | 2 --
 lib/PublicInbox/WwwAttach.pm    | 3 +--
 script/public-inbox-learn       | 2 --
 script/public-inbox-mda         | 4 +---
 scripts/import_slrnspool        | 4 ++--
 scripts/import_vger_from_mbox   | 3 +--
 9 files changed, 7 insertions(+), 17 deletions(-)

diff --git a/lib/PublicInbox/MIME.pm b/lib/PublicInbox/MIME.pm
index 54925a8..456eed6 100644
--- a/lib/PublicInbox/MIME.pm
+++ b/lib/PublicInbox/MIME.pm
@@ -23,6 +23,8 @@ package PublicInbox::MIME;
 use strict;
 use warnings;
 use base qw(Email::MIME);
+use Email::MIME::ContentType;
+$Email::MIME::ContentType::STRICT_PARAMS = 0;
 
 if ($Email::MIME::VERSION <= 1.937) {
 sub parts_multipart {
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index f9207e9..0c3445d 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -11,8 +11,6 @@ use strict;
 use warnings;
 use Fcntl qw(:flock :DEFAULT);
 use PublicInbox::MIME;
-use Email::MIME::ContentType;
-$Email::MIME::ContentType::STRICT_PARAMS = 0;
 use base qw(PublicInbox::Search);
 use PublicInbox::MID qw/mid_clean id_compress mid_mime/;
 use PublicInbox::MsgIter;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 3451261..5e819da 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -11,8 +11,6 @@ use PublicInbox::SearchIdxThread;
 use PublicInbox::MIME;
 use PublicInbox::Git;
 use PublicInbox::Import;
-use Email::MIME::ContentType;
-$Email::MIME::ContentType::STRICT_PARAMS = 0;
 
 # an estimate of the post-packed size to the raw uncompressed size
 my $PACKING_FACTOR = 0.4;
diff --git a/lib/PublicInbox/WatchMaildir.pm b/lib/PublicInbox/WatchMaildir.pm
index 403b6cf..3da6b27 100644
--- a/lib/PublicInbox/WatchMaildir.pm
+++ b/lib/PublicInbox/WatchMaildir.pm
@@ -7,8 +7,6 @@ package PublicInbox::WatchMaildir;
 use strict;
 use warnings;
 use PublicInbox::MIME;
-use Email::MIME::ContentType;
-$Email::MIME::ContentType::STRICT_PARAMS = 0; # user input is imperfect
 use PublicInbox::Git;
 use PublicInbox::Import;
 use PublicInbox::MDA;
diff --git a/lib/PublicInbox/WwwAttach.pm b/lib/PublicInbox/WwwAttach.pm
index 98cf9f7..b1504f5 100644
--- a/lib/PublicInbox/WwwAttach.pm
+++ b/lib/PublicInbox/WwwAttach.pm
@@ -5,9 +5,8 @@
 package PublicInbox::WwwAttach; # internal package
 use strict;
 use warnings;
-use PublicInbox::MIME;
 use Email::MIME::ContentType qw(parse_content_type);
-$Email::MIME::ContentType::STRICT_PARAMS = 0;
+use PublicInbox::MIME;
 use PublicInbox::MsgIter;
 
 # /$LISTNAME/$MESSAGE_ID/$IDX-$FILENAME
diff --git a/script/public-inbox-learn b/script/public-inbox-learn
index bdc72e0..c51f958 100755
--- a/script/public-inbox-learn
+++ b/script/public-inbox-learn
@@ -11,8 +11,6 @@ use PublicInbox::Config;
 use PublicInbox::Git;
 use PublicInbox::Import;
 use PublicInbox::MIME;
-use Email::MIME::ContentType;
-$Email::MIME::ContentType::STRICT_PARAMS = 0; # user input is imperfect
 use PublicInbox::Address;
 use PublicInbox::Spamcheck::Spamc;
 my $train = shift or die "usage: $usage\n";
diff --git a/script/public-inbox-mda b/script/public-inbox-mda
index 8cf4419..f1eaf62 100755
--- a/script/public-inbox-mda
+++ b/script/public-inbox-mda
@@ -15,9 +15,7 @@ sub do_exit {
 }
 
 use Email::Simple;
-use Email::MIME;
-use Email::MIME::ContentType;
-$Email::MIME::ContentType::STRICT_PARAMS = 0; # user input is imperfect
+use PublicInbox::MIME;
 use PublicInbox::MDA;
 use PublicInbox::Config;
 use PublicInbox::Import;
diff --git a/scripts/import_slrnspool b/scripts/import_slrnspool
index 5158460..7b6c9ab 100755
--- a/scripts/import_slrnspool
+++ b/scripts/import_slrnspool
@@ -11,7 +11,7 @@
 use strict;
 use warnings;
 use PublicInbox::Config;
-use Email::MIME;
+use PublicInbox::MIME;
 use PublicInbox::Import;
 use PublicInbox::Git;
 sub usage { "Usage:\n".join('',grep(/\t/, `head -n 10 $0`)) }
@@ -58,7 +58,7 @@ for (; $exit == 0 && $n < $max; $n++) {
 	open(my $fh, '<', $fn) or next;
 	$max = $n + $max_gap;
 
-	my $mime = Email::MIME->new(eval { local $/; <$fh> });
+	my $mime = PublicInbox::MIME->new(eval { local $/; <$fh> });
 	my $hdr = $mime->header_obj;
 
 	# gmane rewrites Received headers, which increases spamminess
diff --git a/scripts/import_vger_from_mbox b/scripts/import_vger_from_mbox
index 1308483..8f0ec7c 100644
--- a/scripts/import_vger_from_mbox
+++ b/scripts/import_vger_from_mbox
@@ -5,8 +5,7 @@ use strict;
 use warnings;
 use Getopt::Long qw/:config gnu_getopt no_ignore_case auto_abbrev/;
 use Date::Parse qw/str2time/;
-use Email::MIME;
-$Email::MIME::ContentType::STRICT_PARAMS = 0; # user input is imperfect
+use PublicInbox::MIME;
 use PublicInbox::Inbox;
 use PublicInbox::V2Writable;
 use PublicInbox::Import;
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 07/21] searchidxpart: chomp line before splitting
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (5 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 06/21] use PublicInbox::MIME consistently Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 08/21] searchidx*: name child subprocesses Eric Wong (Contractor, The Linux Foundation)
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

This was adding a needless newline into doc_data
---
 lib/PublicInbox/SearchIdxPart.pm | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index 63fc704..e37887a 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -48,6 +48,7 @@ sub partition_worker_loop ($$) {
 			$self->_xdb_release;
 			$xdb = $txn = undef;
 		} else {
+			chomp $line;
 			my ($len, $artnum, $object_id) = split(/ /, $line);
 			$xdb ||= $self->_xdb_acquire;
 			if (!$txn) {
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 08/21] searchidx*: name child subprocesses
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (6 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 07/21] searchidxpart: chomp line before splitting Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 09/21] searchidx: get rid of pointless index_blob wrapper Eric Wong (Contractor, The Linux Foundation)
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

This makes viewing "ps" output nicer.
---
 lib/PublicInbox/SearchIdxPart.pm   | 7 ++++---
 lib/PublicInbox/SearchIdxThread.pm | 1 +
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index e37887a..bffa532 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -24,7 +24,7 @@ sub new {
 		# speeds V2Writable batch imports across 8 cores by nearly 20%
 		fcntl($r, 1031, 1048576) if $^O eq 'linux';
 
-		eval { partition_worker_loop($self, $r) };
+		eval { partition_worker_loop($self, $r, $part) };
 		die "worker $part died: $@\n" if $@;
 		die "unexpected MM $self->{mm}" if $self->{mm};
 		exit;
@@ -35,8 +35,9 @@ sub new {
 	$self;
 }
 
-sub partition_worker_loop ($$) {
-	my ($self, $r) = @_;
+sub partition_worker_loop ($$$) {
+	my ($self, $r, $part) = @_;
+	$0 = "pi-v2-partition[$part]";
 	my $xdb = $self->_xdb_acquire;
 	$xdb->begin_transaction;
 	my $txn = 1;
diff --git a/lib/PublicInbox/SearchIdxThread.pm b/lib/PublicInbox/SearchIdxThread.pm
index 7df07a6..fd133d1 100644
--- a/lib/PublicInbox/SearchIdxThread.pm
+++ b/lib/PublicInbox/SearchIdxThread.pm
@@ -41,6 +41,7 @@ sub new {
 
 sub thread_worker_loop {
 	my ($self, $r) = @_;
+	$0 = 'pi-v2-threader';
 	my $msg;
 	my $xdb = $self->_xdb_acquire;
 	$xdb->begin_transaction;
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 09/21] searchidx: get rid of pointless index_blob wrapper
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (7 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 08/21] searchidx*: name child subprocesses Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 10/21] view: remove X-PI-TS reference Eric Wong (Contractor, The Linux Foundation)
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

This used to lookup the message in git, but no longer, so
remove a needless indirection layer and call add_message
directly.
---
 lib/PublicInbox/SearchIdx.pm     | 11 +++--------
 lib/PublicInbox/SearchIdxPart.pm |  2 +-
 2 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 0c3445d..00b24d6 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -491,11 +491,6 @@ sub link_message {
 	$smsg->{doc}->add_term('G' . $tid);
 }
 
-sub index_blob {
-	my ($self, $mime, $bytes, $num, $blob) = @_;
-	$self->add_message($mime, $bytes, $num, $blob);
-}
-
 sub index_git_blob_id {
 	my ($doc, $pfx, $objid) = @_;
 
@@ -532,7 +527,7 @@ sub unindex_mm {
 sub index_mm2 {
 	my ($self, $mime, $bytes, $blob) = @_;
 	my $num = $self->{mm}->num_for(mid_clean(mid_mime($mime)));
-	index_blob($self, $mime, $bytes, $num, $blob);
+	add_message($self, $mime, $bytes, $num, $blob);
 }
 
 sub unindex_mm2 {
@@ -544,7 +539,7 @@ sub unindex_mm2 {
 sub index_both {
 	my ($self, $mime, $bytes, $blob) = @_;
 	my $num = index_mm($self, $mime);
-	index_blob($self, $mime, $bytes, $num, $blob);
+	add_message($self, $mime, $bytes, $num, $blob);
 }
 
 sub unindex_both {
@@ -711,7 +706,7 @@ sub _index_sync {
 		}
 	} else {
 		# user didn't install DBD::SQLite and DBI
-		rlog($self, $xlog, *index_blob, *unindex_blob, $cb);
+		rlog($self, $xlog, *add_message, *unindex_blob, $cb);
 	}
 }
 
diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index bffa532..ee79e08 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -59,7 +59,7 @@ sub partition_worker_loop ($$$) {
 			my $n = read($r, my $msg, $len) or die "read: $!\n";
 			$n == $len or die "short read: $n != $len\n";
 			my $mime = PublicInbox::MIME->new(\$msg);
-			$self->index_blob($mime, $len, $artnum, $object_id);
+			$self->add_message($mime, $len, $artnum, $object_id);
 		}
 	}
 	warn "$$ still in transaction\n" if $txn;
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 10/21] view: remove X-PI-TS reference
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (8 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 09/21] searchidx: get rid of pointless index_blob wrapper Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 11/21] searchidxthread: load doc data for references Eric Wong (Contractor, The Linux Foundation)
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

We haven't needed this since we integrated threading
and dropped Email::Abstract and Mail::Thread usage.
---
 lib/PublicInbox/View.pm | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm
index 590a76a..aad6748 100644
--- a/lib/PublicInbox/View.pm
+++ b/lib/PublicInbox/View.pm
@@ -752,8 +752,7 @@ sub missing_thread {
 
 sub _msg_date {
 	my ($hdr) = @_;
-	my $ts = $hdr->header('X-PI-TS') || msg_timestamp($hdr);
-	fmt_ts($ts);
+	fmt_ts(msg_timestamp($hdr));
 }
 
 sub fmt_ts { POSIX::strftime('%Y-%m-%d %k:%M', gmtime($_[0])) }
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 11/21] searchidxthread: load doc data for references
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (9 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 10/21] view: remove X-PI-TS reference Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 12/21] searchidxpart: force integers into add_message Eric Wong (Contractor, The Linux Foundation)
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

Otherwise, references and thread linking doesn't happen
across subject mismatches.  Oops, this is important.
---
 lib/PublicInbox/SearchIdxThread.pm | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/PublicInbox/SearchIdxThread.pm b/lib/PublicInbox/SearchIdxThread.pm
index fd133d1..57bb293 100644
--- a/lib/PublicInbox/SearchIdxThread.pm
+++ b/lib/PublicInbox/SearchIdxThread.pm
@@ -101,6 +101,7 @@ sub thread_msg_real {
 	$doc->add_term('XMID' . $mid);
 	$doc->set_data($doc_data);
 	$smsg->{ts} = $ts;
+	$smsg->load_from_data($doc_data);
 	my @refs = ($smsg->references =~ /<([^>]+)>/g);
 	$self->link_message($smsg, \@refs, $old_tid);
 	my $db = $self->{xdb};
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 12/21] searchidxpart: force integers into add_message
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (10 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 11/21] searchidxthread: load doc data for references Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 13/21] search: reopen skeleton DB as well Eric Wong (Contractor, The Linux Foundation)
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

Make data passed via Storable to the skeleton worker
a little neater.
---
 lib/PublicInbox/SearchIdxPart.pm | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index ee79e08..477a4f9 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -59,7 +59,8 @@ sub partition_worker_loop ($$$) {
 			my $n = read($r, my $msg, $len) or die "read: $!\n";
 			$n == $len or die "short read: $n != $len\n";
 			my $mime = PublicInbox::MIME->new(\$msg);
-			$self->add_message($mime, $len, $artnum, $object_id);
+			$artnum = int($artnum);
+			$self->add_message($mime, $n, $artnum, $object_id);
 		}
 	}
 	warn "$$ still in transaction\n" if $txn;
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 13/21] search: reopen skeleton DB as well
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (11 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 12/21] searchidxpart: force integers into add_message Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 14/21] searchidx: index values in the threader Eric Wong (Contractor, The Linux Foundation)
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

Any Xapian DB is subject to the same errors and retries.
Perhaps in the future this can made more granular to avoid
unnecessary reopens.
---
 lib/PublicInbox/Search.pm | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 1df87d0..3a27512 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -167,7 +167,13 @@ sub new {
 	$self;
 }
 
-sub reopen { $_[0]->{xdb}->reopen }
+sub reopen {
+	my ($self) = @_;
+	$self->{xdb}->reopen;
+	if (my $skel = $self->{skel}) {
+		$skel->reopen;
+	}
+}
 
 # read-only
 sub query {
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 14/21] searchidx: index values in the threader
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (12 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 13/21] search: reopen skeleton DB as well Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 15/21] search: use different Enquire object for skeleton queries Eric Wong (Contractor, The Linux Foundation)
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

We will need timestamp, YYYYMMDD, article number, and line count
for querying thread information (including XOVER for NNTP).
---
 lib/PublicInbox/SearchIdx.pm       | 24 ++++++++++++++----------
 lib/PublicInbox/SearchIdxThread.pm | 19 ++++++++++++-------
 2 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 00b24d6..b5d43d1 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -141,18 +141,20 @@ sub add_val ($$$) {
 	$doc->add_value($col, $num);
 }
 
-sub add_values ($$$$) {
-	my ($smsg, $bytes, $num, $lines) = @_;
+sub add_values ($$) {
+	my ($doc, $values) = @_;
 
-	my $ts = $smsg->ts;
-	my $doc = $smsg->{doc};
-	add_val($doc, &PublicInbox::Search::TS, $ts);
+	my $ts = $values->[PublicInbox::Search::TS];
+	add_val($doc, PublicInbox::Search::TS, $ts);
 
-	defined($num) and add_val($doc, &PublicInbox::Search::NUM, $num);
+	my $num = $values->[PublicInbox::Search::NUM];
+	defined($num) and add_val($doc, PublicInbox::Search::NUM, $num);
 
-	defined($bytes) and add_val($doc, &PublicInbox::Search::BYTES, $bytes);
+	my $bytes = $values->[PublicInbox::Search::BYTES];
+	defined($bytes) and add_val($doc, PublicInbox::Search::BYTES, $bytes);
 
-	add_val($doc, &PublicInbox::Search::LINES, $lines);
+	my $lines = $values->[PublicInbox::Search::LINES];
+	add_val($doc, PublicInbox::Search::LINES, $lines);
 
 	my $yyyymmdd = strftime('%Y%m%d', gmtime($ts));
 	add_val($doc, PublicInbox::Search::YYYYMMDD, $yyyymmdd);
@@ -307,7 +309,8 @@ sub add_message {
 		}
 
 		my $lines = $mime->body_raw =~ tr!\n!\n!;
-		add_values($smsg, $bytes, $num, $lines);
+		my @values = ($smsg->ts, $num, $bytes, $lines);
+		add_values($doc, \@values);
 
 		my $tg = $self->term_generator;
 
@@ -360,7 +363,8 @@ sub add_message {
 		my $refs = parse_references($smsg);
 		my $data = $smsg->to_doc_data($blob);
 		if ($threader) {
-			$threader->thread_msg($mid, $smsg->ts, $xpath, $data);
+			push @values, $mid, $xpath, $data;
+			$threader->thread_msg(\@values);
 		} else {
 			link_message($self, $smsg, $refs, $old_tid);
 		}
diff --git a/lib/PublicInbox/SearchIdxThread.pm b/lib/PublicInbox/SearchIdxThread.pm
index 57bb293..6b50eb0 100644
--- a/lib/PublicInbox/SearchIdxThread.pm
+++ b/lib/PublicInbox/SearchIdxThread.pm
@@ -61,30 +61,34 @@ sub thread_worker_loop {
 				$xdb->begin_transaction;
 				$txn = 1;
 			}
-			eval { $self->thread_msg_real(@$msg) };
-			warn "failed to index message <$msg->[0]>: $@\n" if $@;
+			eval { $self->thread_msg_real($msg) };
+			warn "failed to index message <$msg->[-1]>: $@\n" if $@;
 		}
 	}
 }
 
 # called by a partition worker
 sub thread_msg {
-	my ($self, $mid, $ts, $xpath, $doc_data) = @_;
+	my ($self, $values) = @_;
 	my $w = $self->{w};
 	my $err;
-	my $str = freeze([ $mid, $ts, $xpath, $doc_data ]);
-	my $len = length($str) . "\n";
+	my $str = freeze($values);
+	$str = length($str) . "\n" . $str;
 
 	# multiple processes write to the same pipe, so use flock
 	$self->_lock_acquire;
-	print $w $len, $str or $err = $!;
+	print $w $str or $err = $!;
 	$self->_lock_release;
 
 	die "print failed: $err\n" if $err;
 }
 
 sub thread_msg_real {
-	my ($self, $mid, $ts, $xpath, $doc_data) = @_;
+	my ($self, $values) = @_;
+	my $doc_data = pop @$values;
+	my $xpath = pop @$values;
+	my $mid = pop @$values;
+	my $ts = $values->[PublicInbox::Search::TS];
 	my $smsg = $self->lookup_message($mid);
 	my ($old_tid, $doc_id);
 	if ($smsg) {
@@ -99,6 +103,7 @@ sub thread_msg_real {
 	my $doc = $smsg->{doc};
 	$doc->add_term('XPATH' . $xpath) if defined $xpath;
 	$doc->add_term('XMID' . $mid);
+	PublicInbox::SearchIdx::add_values($doc, $values);
 	$doc->set_data($doc_data);
 	$smsg->{ts} = $ts;
 	$smsg->load_from_data($doc_data);
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 15/21] search: use different Enquire object for skeleton queries
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (13 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 14/21] searchidx: index values in the threader Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 16/21] rename SearchIdxThread to SearchIdxSkeleton Eric Wong (Contractor, The Linux Foundation)
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

A different Xapian DB requires the use of a different Enquire
object.  This is necessary for get_thread and thread skeleton
to work in the PSGI UI.
---
 lib/PublicInbox/Search.pm | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 3a27512..0f102da 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -207,7 +207,7 @@ sub get_thread {
 	# always sort threads by timestamp, this makes life easier
 	# for the threading algorithm (in SearchThread.pm)
 	$opts->{asc} = 1;
-
+	$opts->{enquire} = enquire_skel($self);
 	_do_enquire($self, $qtid, $opts);
 }
 
@@ -235,7 +235,7 @@ sub _do_enquire {
 
 sub _enquire_once {
 	my ($self, $query, $opts) = @_;
-	my $enquire = $self->enquire;
+	my $enquire = $opts->{enquire} || enquire($self);
 	if (defined $query) {
 		$query = Search::Xapian::Query->new(OP_AND,$query,$mail_query);
 	} else {
@@ -423,6 +423,15 @@ sub enquire {
 	$self->{enquire} ||= Search::Xapian::Enquire->new($self->{xdb});
 }
 
+sub enquire_skel {
+	my ($self) = @_;
+	if (my $skel = $self->{skel}) {
+		$self->{enquire_skel} ||= Search::Xapian::Enquire->new($skel);
+	} else {
+		enquire($self);
+	}
+}
+
 sub help {
 	my ($self) = @_;
 	$self->qp; # parse altids
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 16/21] rename SearchIdxThread to SearchIdxSkeleton
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (14 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 15/21] search: use different Enquire object for skeleton queries Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 17/21] v2writable: commit to skeleton via remote partitions Eric Wong (Contractor, The Linux Foundation)
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

Interchangably using "all", "skel", "threader", etc. were
confusing.  Standardize on the "skeleton" term to describe
this class since it's also used for retrieval of basic headers.
---
 MANIFEST                                           |  2 +-
 lib/PublicInbox/Search.pm                          |  2 +-
 lib/PublicInbox/SearchIdx.pm                       | 10 ++++----
 lib/PublicInbox/SearchIdxPart.pm                   |  4 +--
 .../{SearchIdxThread.pm => SearchIdxSkeleton.pm}   | 21 +++++++--------
 lib/PublicInbox/V2Writable.pm                      | 30 ++++++++++++----------
 6 files changed, 36 insertions(+), 33 deletions(-)
 rename lib/PublicInbox/{SearchIdxThread.pm => SearchIdxSkeleton.pm} (85%)

diff --git a/MANIFEST b/MANIFEST
index 2a6f6f0..1aaf8ff 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -85,7 +85,7 @@ lib/PublicInbox/SaPlugin/ListMirror.pm
 lib/PublicInbox/Search.pm
 lib/PublicInbox/SearchIdx.pm
 lib/PublicInbox/SearchIdxPart.pm
-lib/PublicInbox/SearchIdxThread.pm
+lib/PublicInbox/SearchIdxSkeleton.pm
 lib/PublicInbox/SearchMsg.pm
 lib/PublicInbox/SearchThread.pm
 lib/PublicInbox/SearchView.pm
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 0f102da..6b14942 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -160,7 +160,7 @@ sub new {
 		}
 		warn "v2 repo with $parts found in $dir\n";
 		$self->{xdb} = $xdb;
-		$self->{skel} = Search::Xapian::Database->new("$dir/all");
+		$self->{skel} = Search::Xapian::Database->new("$dir/skel");
 	} else {
 		$self->{xdb} = Search::Xapian::Database->new($self->xdir);
 	}
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index b5d43d1..3259413 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -285,7 +285,7 @@ sub add_message {
 
 	my ($doc_id, $old_tid);
 	my $mid = mid_clean(mid_mime($mime));
-	my $threader = $self->{threader};
+	my $skel = $self->{skeleton};
 
 	eval {
 		die 'Message-ID too long' if length($mid) > MAX_MID_SIZE;
@@ -294,7 +294,7 @@ sub add_message {
 			# convert a ghost to a regular message
 			# it will also clobber any existing regular message
 			$doc_id = $smsg->{doc_id};
-			$old_tid = $smsg->thread_id unless $threader;
+			$old_tid = $smsg->thread_id unless $skel;
 		}
 		$smsg = PublicInbox::SearchMsg->new($mime);
 		my $doc = $smsg->{doc};
@@ -362,9 +362,9 @@ sub add_message {
 		# populates smsg->references for smsg->to_doc_data
 		my $refs = parse_references($smsg);
 		my $data = $smsg->to_doc_data($blob);
-		if ($threader) {
+		if ($skel) {
 			push @values, $mid, $xpath, $data;
-			$threader->thread_msg(\@values);
+			$skel->index_skeleton(\@values);
 		} else {
 			link_message($self, $smsg, $refs, $old_tid);
 		}
@@ -817,7 +817,7 @@ sub DESTROY {
 	$_[0]->{lockfh} = undef;
 }
 
-# remote_* subs are only used by SearchIdxPart and SearchIdxThread:
+# remote_* subs are only used by SearchIdxPart and SearchIdxSkeleton
 sub remote_commit {
 	my ($self) = @_;
 	print { $self->{w} } "commit\n" or die "failed to write commit: $!";
diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index 477a4f9..6025fc4 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -6,9 +6,9 @@ use warnings;
 use base qw(PublicInbox::SearchIdx);
 
 sub new {
-	my ($class, $v2writable, $part, $threader) = @_;
+	my ($class, $v2writable, $part, $skel) = @_;
 	my $self = $class->SUPER::new($v2writable->{-inbox}, 1, $part);
-	$self->{threader} = $threader;
+	$self->{skeleton} = $skel;
 	my ($r, $w);
 	pipe($r, $w) or die "pipe failed: $!\n";
 	binmode $r, ':raw';
diff --git a/lib/PublicInbox/SearchIdxThread.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
similarity index 85%
rename from lib/PublicInbox/SearchIdxThread.pm
rename to lib/PublicInbox/SearchIdxSkeleton.pm
index 6b50eb0..0016f89 100644
--- a/lib/PublicInbox/SearchIdxThread.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -1,6 +1,6 @@
 # Copyright (C) 2018 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
-package PublicInbox::SearchIdxThread;
+package PublicInbox::SearchIdxSkeleton;
 use strict;
 use warnings;
 use base qw(PublicInbox::SearchIdx);
@@ -8,7 +8,7 @@ use Storable qw(freeze thaw);
 
 sub new {
 	my ($class, $v2writable) = @_;
-	my $self = $class->SUPER::new($v2writable->{-inbox}, 1, 'all');
+	my $self = $class->SUPER::new($v2writable->{-inbox}, 1, 'skel');
 	# create the DB:
 	$self->_xdb_acquire;
 	$self->_xdb_release;
@@ -23,8 +23,8 @@ sub new {
 		$v2writable->atfork_child;
 		$v2writable = undef;
 		close $w;
-		eval { thread_worker_loop($self, $r) };
-		die "thread worker died: $@\n" if $@;
+		eval { skeleton_worker_loop($self, $r) };
+		die "skeleton worker died: $@\n" if $@;
 		exit;
 	}
 	$self->{w} = $w;
@@ -34,14 +34,14 @@ sub new {
 	$w->autoflush(1);
 
 	# lock on only exists in parent, not in worker
-	my $l = $self->{lock_path} = $self->xdir . '/thread.lock';
+	my $l = $self->{lock_path} = $self->xdir . '/pi-v2-skeleton.lock';
 	open my $fh, '>>', $l or die "failed to create $l: $!\n";
 	$self;
 }
 
-sub thread_worker_loop {
+sub skeleton_worker_loop {
 	my ($self, $r) = @_;
-	$0 = 'pi-v2-threader';
+	$0 = 'pi-v2-skeleton';
 	my $msg;
 	my $xdb = $self->_xdb_acquire;
 	$xdb->begin_transaction;
@@ -61,14 +61,14 @@ sub thread_worker_loop {
 				$xdb->begin_transaction;
 				$txn = 1;
 			}
-			eval { $self->thread_msg_real($msg) };
+			eval { index_skeleton_real($self, $msg) };
 			warn "failed to index message <$msg->[-1]>: $@\n" if $@;
 		}
 	}
 }
 
 # called by a partition worker
-sub thread_msg {
+sub index_skeleton {
 	my ($self, $values) = @_;
 	my $w = $self->{w};
 	my $err;
@@ -83,7 +83,8 @@ sub thread_msg {
 	die "print failed: $err\n" if $err;
 }
 
-sub thread_msg_real {
+# values: [ TS, NUM, BYTES, LINES, MID, XPATH, doc_data ]
+sub index_skeleton_real ($$) {
 	my ($self, $values) = @_;
 	my $doc_data = pop @$values;
 	my $xpath = pop @$values;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 5e819da..ff3b657 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -7,7 +7,7 @@ use strict;
 use warnings;
 use Fcntl qw(:flock :DEFAULT);
 use PublicInbox::SearchIdxPart;
-use PublicInbox::SearchIdxThread;
+use PublicInbox::SearchIdxSkeleton;
 use PublicInbox::MIME;
 use PublicInbox::Git;
 use PublicInbox::Import;
@@ -61,7 +61,7 @@ sub add {
 	my ($len, $msgref) = @{$im->{last_object}};
 
 	$self->idx_init;
-	my $num = $self->{all}->index_mm($mime, 1);
+	my $num = $self->{skel}->index_mm($mime, 1);
 	my $nparts = $self->{partitions};
 	my $part = $num % $nparts;
 	my $idx = $self->idx_part($part);
@@ -83,18 +83,18 @@ sub idx_init {
 	my ($self) = @_;
 	return if $self->{idx_parts};
 
-	# first time initialization, first we create the threader pipe:
-	my $all = $self->{all} = PublicInbox::SearchIdxThread->new($self);
+	# first time initialization, first we create the skeleton pipe:
+	my $skel = $self->{skel} = PublicInbox::SearchIdxSkeleton->new($self);
 
 	# need to create all parts before initializing msgmap FD
 	my $max = $self->{partitions} - 1;
 	my $idx = $self->{idx_parts} = [];
 	for my $i (0..$max) {
-		push @$idx, PublicInbox::SearchIdxPart->new($self, $i, $all);
+		push @$idx, PublicInbox::SearchIdxPart->new($self, $i, $skel);
 	}
 
 	# Now that all subprocesses are up, we can open the FD for SQLite:
-	$all->_msgmap_init->{dbh}->begin_work;
+	$skel->_msgmap_init->{dbh}->begin_work;
 }
 
 sub remove {
@@ -129,8 +129,8 @@ sub checkpoint {
 sub searchidx_checkpoint {
 	my ($self, $more) = @_;
 
-	# order matters, we can only close {all} after all partitions
-	# are done because the partitions also write to {all}
+	# order matters, we can only close {skel} after all partitions
+	# are done because the partitions also write to {skel}
 
 	if (my $parts = $self->{idx_parts}) {
 		foreach my $idx (@$parts) {
@@ -140,14 +140,16 @@ sub searchidx_checkpoint {
 		delete $self->{idx_parts} unless $more;
 	}
 
-	if (my $all = $self->{all}) {
-		$all->{mm}->{dbh}->commit;
+	if (my $skel = $self->{skel}) {
+		$skel->{mm}->{dbh}->commit;
 		if ($more) {
-			$all->{mm}->{dbh}->begin_work;
+			$skel->{mm}->{dbh}->begin_work;
+		}
+		$skel->remote_commit;
+		unless ($more) {
+			$skel->remote_close;
+			delete $self->{skel};
 		}
-		$all->remote_commit;
-		$all->remote_close unless $more;
-		delete $self->{all} unless $more;
 	}
 	$self->{transact_bytes} = 0;
 }
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 17/21] v2writable: commit to skeleton via remote partitions
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (15 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 16/21] rename SearchIdxThread to SearchIdxSkeleton Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:41 ` [PATCH 18/21] searchidxskeleton: extra error checking Eric Wong (Contractor, The Linux Foundation)
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

We need to ensure Xapian transaction commits are made to remote
partitions before associated commits hit the skeleton DB.

This causes unnecessary commits to be made to the skeleton DB;
but they're mostly harmless.  Further work will be necessary
to ensure proper ordering and avoidance of unnecessary commits.
---
 lib/PublicInbox/SearchIdxPart.pm |  1 +
 lib/PublicInbox/V2Writable.pm    | 13 ++++++-------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/SearchIdxPart.pm b/lib/PublicInbox/SearchIdxPart.pm
index 6025fc4..2f577ec 100644
--- a/lib/PublicInbox/SearchIdxPart.pm
+++ b/lib/PublicInbox/SearchIdxPart.pm
@@ -45,6 +45,7 @@ sub partition_worker_loop ($$$) {
 		if ($line eq "commit\n") {
 			$xdb->commit_transaction if $txn;
 			$txn = undef;
+			$self->{skeleton}->remote_commit;
 		} elsif ($line eq "close\n") {
 			$self->_xdb_release;
 			$xdb = $txn = undef;
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index ff3b657..73110ff 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -131,22 +131,21 @@ sub searchidx_checkpoint {
 
 	# order matters, we can only close {skel} after all partitions
 	# are done because the partitions also write to {skel}
-
 	if (my $parts = $self->{idx_parts}) {
 		foreach my $idx (@$parts) {
-			$idx->remote_commit;
+			$idx->remote_commit; # propagates commit to skel
 			$idx->remote_close unless $more;
 		}
 		delete $self->{idx_parts} unless $more;
 	}
 
 	if (my $skel = $self->{skel}) {
-		$skel->{mm}->{dbh}->commit;
+		my $dbh = $skel->{mm}->{dbh};
+		$dbh->commit;
 		if ($more) {
-			$skel->{mm}->{dbh}->begin_work;
-		}
-		$skel->remote_commit;
-		unless ($more) {
+			$dbh->begin_work;
+		} else {
+			$skel->remote_commit; # XXX should be unnecessary...
 			$skel->remote_close;
 			delete $self->{skel};
 		}
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 18/21] searchidxskeleton: extra error checking
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (16 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 17/21] v2writable: commit to skeleton via remote partitions Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:41 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:42 ` [PATCH 19/21] searchidx: do not modify Xapian DB while iterating Eric Wong (Contractor, The Linux Foundation)
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:41 UTC (permalink / raw)
  To: meta

I added these while chasing down the DatabaseCorruptError
exceptions which turned out to be caused by Xapian DB
modifications during iteration.
---
 lib/PublicInbox/SearchIdxSkeleton.pm | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/SearchIdxSkeleton.pm b/lib/PublicInbox/SearchIdxSkeleton.pm
index 0016f89..aa2713f 100644
--- a/lib/PublicInbox/SearchIdxSkeleton.pm
+++ b/lib/PublicInbox/SearchIdxSkeleton.pm
@@ -42,7 +42,6 @@ sub new {
 sub skeleton_worker_loop {
 	my ($self, $r) = @_;
 	$0 = 'pi-v2-skeleton';
-	my $msg;
 	my $xdb = $self->_xdb_acquire;
 	$xdb->begin_transaction;
 	my $txn = 1;
@@ -54,9 +53,12 @@ sub skeleton_worker_loop {
 			$self->_xdb_release;
 			$xdb = $txn = undef;
 		} else {
-			read($r, $msg, $line) or die "read failed: $!\n";
+			my $len = int($line);
+			my $n = read($r, my $msg, $len) or die "read: $!\n";
+			$n == $len or die "short read: $n != $len\n";
 			$msg = thaw($msg); # should raise on error
 			defined $msg or die "failed to thaw buffer\n";
+			$xdb ||= $self->_xdb_acquire;
 			if (!$txn) {
 				$xdb->begin_transaction;
 				$txn = 1;
@@ -65,6 +67,8 @@ sub skeleton_worker_loop {
 			warn "failed to index message <$msg->[-1]>: $@\n" if $@;
 		}
 	}
+	die "xdb not released\n" if $xdb;
+	die "in transaction\n" if $txn;
 }
 
 # called by a partition worker
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 19/21] searchidx: do not modify Xapian DB while iterating
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (17 preceding siblings ...)
  2018-02-28 23:41 ` [PATCH 18/21] searchidxskeleton: extra error checking Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:42 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:42 ` [PATCH 20/21] search: query_xover uses skeleton DB iff available Eric Wong (Contractor, The Linux Foundation)
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:42 UTC (permalink / raw)
  To: meta

Iterating through a list of documents while modifying them does
not seem to be supported in Xapian and it can trigger
DatabaseCorruptError exceptions.  This only worked with past
datasets out of dumb luck.  With the work-in-progress "v2"
public-inbox layout, this problem might become more visible
as the "thread skeleton" is partitioned out to a separate,
smaller Xapian database.

I've reproduced the problem on both Debian 8.x and 9.x with
Xapian 1.2.19 (chert backend) and 1.4.3 (glass backend)
respectively.
---
 lib/PublicInbox/SearchIdx.pm | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 3259413..f4238fe 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -740,15 +740,22 @@ sub create_ghost {
 sub merge_threads {
 	my ($self, $winner_tid, $loser_tid) = @_;
 	return if $winner_tid == $loser_tid;
-	my ($head, $tail) = $self->find_doc_ids('G' . $loser_tid);
 	my $db = $self->{xdb};
 
-	for (; $head != $tail; $head->inc) {
-		my $docid = $head->get_docid;
-		my $doc = $db->get_document($docid);
-		$doc->remove_term('G' . $loser_tid);
-		$doc->add_term('G' . $winner_tid);
-		$db->replace_document($docid, $doc);
+	my $batch_size = 1000; # don't let @ids grow too large to avoid OOM
+	while (1) {
+		my ($head, $tail) = $self->find_doc_ids('G' . $loser_tid);
+		return if $head == $tail;
+		my @ids;
+		for (; $head != $tail && @ids < $batch_size; $head->inc) {
+			push @ids, $head->get_docid;
+		}
+		foreach my $docid (@ids) {
+			my $doc = $db->get_document($docid);
+			$doc->remove_term('G' . $loser_tid);
+			$doc->add_term('G' . $winner_tid);
+			$db->replace_document($docid, $doc);
+		}
 	}
 }
 
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 20/21] search: query_xover uses skeleton DB iff available
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (18 preceding siblings ...)
  2018-02-28 23:42 ` [PATCH 19/21] searchidx: do not modify Xapian DB while iterating Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:42 ` Eric Wong (Contractor, The Linux Foundation)
  2018-02-28 23:42 ` [PATCH 21/21] v2/ui: get nntpd and init tests running on v2 Eric Wong (Contractor, The Linux Foundation)
  2018-03-01 23:40 ` [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:42 UTC (permalink / raw)
  To: meta

The skeleton DB is where we store all the information needed
for NNTP overviews via XOVER.  This seems to be the only change
necessary (besides eventually handling duplicates) necessary
to support our nntpd interface for v2 repositories.
---
 lib/PublicInbox/Search.pm | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 6b14942..a796cf6 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -317,11 +317,17 @@ sub num_range_processor {
 sub query_xover {
 	my ($self, $beg, $end, $offset) = @_;
 	my $qp = Search::Xapian::QueryParser->new;
-	$qp->set_database($self->{xdb});
+	$qp->set_database($self->{skel} || $self->{xdb});
 	$qp->add_valuerangeprocessor($self->num_range_processor);
 	my $query = $qp->parse_query("$beg..$end", QP_FLAGS);
 
-	_do_enquire($self, $query, {num => 1, limit => 200, offset => $offset});
+	my $opts = {
+		enquire => enquire_skel($self),
+		num => 1,
+		limit => 200,
+		offset => $offset,
+	};
+	_do_enquire($self, $query, $opts);
 }
 
 sub lookup_skeleton {
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 21/21] v2/ui: get nntpd and init tests running on v2
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (19 preceding siblings ...)
  2018-02-28 23:42 ` [PATCH 20/21] search: query_xover uses skeleton DB iff available Eric Wong (Contractor, The Linux Foundation)
@ 2018-02-28 23:42 ` Eric Wong (Contractor, The Linux Foundation)
  2018-03-01 23:40 ` [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong (Contractor, The Linux Foundation) @ 2018-02-28 23:42 UTC (permalink / raw)
  To: meta

A work-in-progress, but it appears the v2 UI pieces do
will not require a lot of work to do.
---
 lib/PublicInbox/V2Writable.pm |  1 +
 script/public-inbox-init      | 48 ++++++++++++++++++++++++++++++++-----------
 t/init.t                      | 15 ++++++++++++++
 t/nntpd.t                     | 36 ++++++++++++++++++++++++++------
 4 files changed, 82 insertions(+), 18 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 73110ff..81c7a4d 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -79,6 +79,7 @@ sub idx_part {
 	$self->{idx_parts}->[$part];
 }
 
+# idempotent
 sub idx_init {
 	my ($self) = @_;
 	return if $self->{idx_parts};
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 2f33c9e..f7a60fb 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -6,6 +6,7 @@
 use strict;
 use warnings;
 my $usage = "public-inbox-init NAME GIT_DIR HTTP_URL ADDRESS [ADDRESS..]";
+use Getopt::Long qw/:config gnu_getopt no_ignore_case auto_abbrev/;
 use PublicInbox::Config;
 use File::Temp qw/tempfile/;
 use File::Basename qw/dirname/;
@@ -14,9 +15,11 @@ use Cwd qw/abs_path/;
 
 sub x { system(@_) and die join(' ', @_). " failed: $?\n" }
 sub usage { print STDERR "Usage: $usage\n"; exit 1 }
-
+my $version = 1;
+my %opts = ( 'V|version=i' => \$version );
+GetOptions(%opts) or usage();
 my $name = shift @ARGV or usage();
-my $git_dir = shift @ARGV or usage();
+my $mainrepo = shift @ARGV or usage();
 my $http_url = shift @ARGV or usage();
 my (@address) = @ARGV;
 @address or usage();
@@ -25,7 +28,7 @@ my %seen;
 my $pi_config = PublicInbox::Config->default_file;
 my $dir = dirname($pi_config);
 mkpath($dir); # will croak on fatal errors
-my ($fh, $filename) = tempfile('pi-init-XXXXXXXX', DIR => $dir);
+my ($fh, $pi_config_tmp) = tempfile('pi-init-XXXXXXXX', DIR => $dir);
 if (-e $pi_config) {
 	open(my $oh, '<', $pi_config) or die "unable to read $pi_config: $!\n";
 	my @st = stat($oh);
@@ -62,22 +65,43 @@ if (-e $pi_config) {
 
 	exit(1) if $conflict;
 }
-close $fh or die "failed to close $filename: $!\n";
+close $fh or die "failed to close $pi_config_tmp: $!\n";
 
 my $pfx = "publicinbox.$name";
-my @x = (qw/git config/, "--file=$filename");
-$git_dir = abs_path($git_dir);
-x(qw(git init -q --bare), $git_dir);
+my @x = (qw/git config/, "--file=$pi_config_tmp");
+
+$mainrepo = abs_path($mainrepo);
 
-# set a reasonable default:
-x(qw/git config/, "--file=$git_dir/config", 'repack.writeBitmaps', 'true');
+if ($version >= 2) {
+	require PublicInbox::V2Writable;
+	require PublicInbox::Inbox;
+	my $ibx = {
+		mainrepo => $mainrepo,
+		name => $name,
+		version => $version,
+		-primary_address => $address[0],
+	};
+	$ibx = PublicInbox::Inbox->new($ibx);
+	my $v2w = PublicInbox::V2Writable->new($ibx, 1);
+	$v2w->git_init(0);
+	$v2w->idx_init(0);
+	$v2w->done;
+} elsif ($version == 1) {
+	x(qw(git init -q --bare), $mainrepo);
+
+	# set a reasonable default:
+	x(qw/git config/, "--file=$mainrepo/config",
+		'repack.writeBitmaps', 'true');
+} else {
+	die "Unsupported -V/--version: $version\n";
+}
 
 foreach my $addr (@address) {
 	next if $seen{lc($addr)};
 	x(@x, "--add", "$pfx.address", $addr);
 }
 x(@x, "$pfx.url", $http_url);
-x(@x, "$pfx.mainrepo", $git_dir);
+x(@x, "$pfx.mainrepo", $mainrepo);
 
-rename $filename, $pi_config or
-	die "failed to rename `$filename' to `$pi_config': $!\n";
+rename $pi_config_tmp, $pi_config or
+	die "failed to rename `$pi_config_tmp' to `$pi_config': $!\n";
diff --git a/t/init.t b/t/init.t
index 864f1ab..54b90ec 100644
--- a/t/init.t
+++ b/t/init.t
@@ -25,4 +25,19 @@ use constant pi_init => 'blib/script/public-inbox-init';
 	is((stat($cfgfile))[2] & 07777, 0666, "permissions preserved");
 }
 
+SKIP: {
+	foreach my $mod (qw(DBD::SQLite Search::Xapian::WritableDatabase)) {
+		eval "require $mod";
+		skip "$mod missing for v2", 2 if $@;
+	}
+	local $ENV{PI_DIR} = "$tmpdir/.public-inbox/";
+	my $cfgfile = "$ENV{PI_DIR}/config";
+	my @cmd = (pi_init, '-V2', 'v2list', "$tmpdir/v2list",
+		   qw(http://example.com/v2list v2list@example.com));
+	is(system(@cmd), 0, 'public-inbox-init -V2 OK');
+	ok(-d "$tmpdir/v2list", 'v2list directory exists');
+	ok(-f "$tmpdir/v2list/msgmap.sqlite3", 'msgmap exists');
+	ok(-d "$tmpdir/v2list/all.git", 'catch-all.git directory exists');
+}
+
 done_testing();
diff --git a/t/nntpd.t b/t/nntpd.t
index 56b1d60..ea0d293 100644
--- a/t/nntpd.t
+++ b/t/nntpd.t
@@ -21,14 +21,18 @@ my $tmpdir = tempdir('pi-nntpd-XXXXXX', TMPDIR => 1, CLEANUP => 1);
 my $home = "$tmpdir/pi-home";
 my $err = "$tmpdir/stderr.log";
 my $out = "$tmpdir/stdout.log";
-my $maindir = "$tmpdir/main.git";
+my $mainrepo = "$tmpdir/main.git";
 my $group = 'test-nntpd';
 my $addr = $group . '@example.com';
 my $nntpd = 'blib/script/public-inbox-nntpd';
 my $init = 'blib/script/public-inbox-init';
 use_ok 'PublicInbox::Import';
+use_ok 'PublicInbox::Inbox';
 use_ok 'PublicInbox::Git';
+use_ok 'PublicInbox::V2Writable';
 
+# XXX FIXME: make it easier to test both versions
+my $version = int($ENV{PI_VERSION} || 1);
 my %opts = (
 	LocalAddr => '127.0.0.1',
 	ReuseAddr => 1,
@@ -40,14 +44,34 @@ my $sock = IO::Socket::INET->new(%opts);
 my $pid;
 my $len;
 END { kill 'TERM', $pid if defined $pid };
+
+my $ibx = {
+	mainrepo => $mainrepo,
+	name => $group,
+	version => $version,
+	-primary_address => $addr,
+};
+$ibx = PublicInbox::Inbox->new($ibx);
 {
 	local $ENV{HOME} = $home;
-	system($init, $group, $maindir, 'http://example.com/', $addr);
+	my @cmd = ($init, $group, $mainrepo, 'http://example.com/', $addr);
+	push @cmd, "-V$version";
+	is(system(@cmd), 0, 'init OK');
 	is(system(qw(git config), "--file=$home/.public-inbox/config",
 			"publicinbox.$group.newsgroup", $group),
 		0, 'enabled newsgroup');
 	my $len;
 
+	my $im;
+	if ($version == 2) {
+		$im = PublicInbox::V2Writable->new($ibx);
+	} elsif ($version == 1) {
+		my $git = PublicInbox::Git->new($mainrepo);
+		$im = PublicInbox::Import->new($git, 'test', $addr);
+	} else {
+		die "unsupported version: $version";
+	}
+
 	# ensure successful message delivery
 	{
 		my $mime = Email::MIME->new(<<EOF);
@@ -66,12 +90,12 @@ EOF
 		$list_id =~ s/@/./;
 		$mime->header_set('List-Id', "<$list_id>");
 		$len = length($mime->as_string);
-		my $git = PublicInbox::Git->new($maindir);
-		my $im = PublicInbox::Import->new($git, 'test', $addr);
 		$im->add($mime);
 		$im->done;
-		my $s = PublicInbox::SearchIdx->new($maindir, 1);
-		$s->index_sync;
+		if ($version == 1) {
+			my $s = PublicInbox::SearchIdx->new($mainrepo, 1);
+			$s->index_sync;
+		}
 	}
 
 	ok($sock, 'sock created');
-- 
EW


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/21] UI bits and v2 import fixes
  2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
                   ` (20 preceding siblings ...)
  2018-02-28 23:42 ` [PATCH 21/21] v2/ui: get nntpd and init tests running on v2 Eric Wong (Contractor, The Linux Foundation)
@ 2018-03-01 23:40 ` Eric Wong
  21 siblings, 0 replies; 23+ messages in thread
From: Eric Wong @ 2018-03-01 23:40 UTC (permalink / raw)
  To: meta

I wrote:
> patch series).  Unfortunately that means the import times I
> initially reported were too  optimistic and real import times
> may take 30-40% longer :<  (More optimizations are planned, however)

I could be wrong on that...  I'm used to having fstrim(8) in a
cronjob on my own machines and did not have it on the machine
I was testing on...

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-03-01 23:40 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-28 23:41 [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 01/21] v2writable: warn on duplicate Message-IDs Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 02/21] v2/ui: some hacky things to get the PSGI UI to show up Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 03/21] v2/ui: retry DB reopens in a few more places Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 04/21] v2writable: cleanup unused pipes in partitions Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 05/21] searchidxpart: binmode Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 06/21] use PublicInbox::MIME consistently Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 07/21] searchidxpart: chomp line before splitting Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 08/21] searchidx*: name child subprocesses Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 09/21] searchidx: get rid of pointless index_blob wrapper Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 10/21] view: remove X-PI-TS reference Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 11/21] searchidxthread: load doc data for references Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 12/21] searchidxpart: force integers into add_message Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 13/21] search: reopen skeleton DB as well Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 14/21] searchidx: index values in the threader Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 15/21] search: use different Enquire object for skeleton queries Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 16/21] rename SearchIdxThread to SearchIdxSkeleton Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 17/21] v2writable: commit to skeleton via remote partitions Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:41 ` [PATCH 18/21] searchidxskeleton: extra error checking Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:42 ` [PATCH 19/21] searchidx: do not modify Xapian DB while iterating Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:42 ` [PATCH 20/21] search: query_xover uses skeleton DB iff available Eric Wong (Contractor, The Linux Foundation)
2018-02-28 23:42 ` [PATCH 21/21] v2/ui: get nntpd and init tests running on v2 Eric Wong (Contractor, The Linux Foundation)
2018-03-01 23:40 ` [PATCH v2 0/21] UI bits and v2 import fixes Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).