user/dev discussion of public-inbox itself
 help / color / Atom feed
* [PATCH 0/8] indexing cleanup and code reduction
@ 2020-07-17  6:31 Eric Wong
  2020-07-17  6:31 ` [PATCH 1/8] v2: use v5.10.1, parent.pm, drop warnings Eric Wong
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17  6:31 UTC (permalink / raw)
  To: meta

Some bigger indexing updates coming, but first we can
reduce allocations and get rid of some overly complicated
unindexing logic.

Eric Wong (8):
  v2: use v5.10.1, parent.pm, drop warnings
  drop binmode usage
  import: use common capitalization for filtering headers
  with_umask: pass args to callback
  overidx: each_by_mid: pass self and args to callbacks
  overidx: favor non-OO sub dispatch for internal subs
  searchidx: use v5.10.1, parent.pm, drop warnings
  search: simplify unindexing

 lib/PublicInbox/Import.pm         |   4 +-
 lib/PublicInbox/InboxWritable.pm  |  42 ++++---
 lib/PublicInbox/OverIdx.pm        | 126 +++++++++++----------
 lib/PublicInbox/SearchIdx.pm      | 178 ++++++++++++------------------
 lib/PublicInbox/SearchIdxShard.pm |  12 +-
 lib/PublicInbox/V2Writable.pm     | 116 +++++++++----------
 lib/PublicInbox/Xapcmd.pm         |  35 +++---
 t/search.t                        |   6 +-
 8 files changed, 246 insertions(+), 273 deletions(-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/8] v2: use v5.10.1, parent.pm, drop warnings
  2020-07-17  6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
@ 2020-07-17  6:31 ` Eric Wong
  2020-07-17  6:31 ` [PATCH 2/8] drop binmode usage Eric Wong
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17  6:31 UTC (permalink / raw)
  To: meta

The "5.010_001" form was for Perl 5.6, which I doubt anybody
would attempt; so favor "v5.10.1" as it is more readable to
humans.  Prefer "parent" to "base" since the former is lighter.
We'll also rely on warnings from "-w" globally (or not) instead
of via "use".

We'll also update "use" statements to reflect what's actually
used by V2Writable.
---
 lib/PublicInbox/SearchIdxShard.pm | 4 ++--
 lib/PublicInbox/V2Writable.pm     | 7 +++----
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index f7ba293f..baf7352a 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -5,8 +5,8 @@
 # See L<public-inbox-v2-format(5)> for more info on how we shard Xapian
 package PublicInbox::SearchIdxShard;
 use strict;
-use warnings;
-use base qw(PublicInbox::SearchIdx);
+use v5.10.1;
+use parent qw(PublicInbox::SearchIdx);
 use IO::Handle (); # autoflush
 use PublicInbox::Eml;
 
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 528f5e9a..0119ea76 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -5,16 +5,15 @@
 # Used to write to V2 inboxes (see L<public-inbox-v2-format(5)>).
 package PublicInbox::V2Writable;
 use strict;
-use warnings;
-use base qw(PublicInbox::Lock);
-use 5.010_001;
+use v5.10.1;
+use parent qw(PublicInbox::Lock);
 use PublicInbox::SearchIdxShard;
 use PublicInbox::Eml;
 use PublicInbox::Git;
 use PublicInbox::Import;
 use PublicInbox::MID qw(mids references);
 use PublicInbox::ContentHash qw(content_hash content_digest);
-use PublicInbox::Inbox;
+use PublicInbox::InboxWritable;
 use PublicInbox::OverIdx;
 use PublicInbox::Msgmap;
 use PublicInbox::Spawn qw(spawn popen_rd);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/8] drop binmode usage
  2020-07-17  6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
  2020-07-17  6:31 ` [PATCH 1/8] v2: use v5.10.1, parent.pm, drop warnings Eric Wong
@ 2020-07-17  6:31 ` Eric Wong
  2020-07-17  6:31 ` [PATCH 3/8] import: use common capitalization for filtering headers Eric Wong
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17  6:31 UTC (permalink / raw)
  To: meta

We only support Unix-like platforms where binmode (":raw") is
the default anyways, and v5.10 semantics means it won't do
unicode_strings (unlike v5.12).  So save some lines of code.
---
 lib/PublicInbox/Import.pm         | 2 --
 lib/PublicInbox/SearchIdxShard.pm | 2 --
 2 files changed, 4 deletions(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index fb813159..b61d4b31 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -73,8 +73,6 @@ sub gfi_start {
 	$self->{out} = $out_w;
 	$self->{pid} = $pid;
 	$self->{nchg} = 0;
-	binmode $out_w, ':raw' or die "binmode :raw failed: $!";
-	binmode $in_r, ':raw' or die "binmode :raw failed: $!";
 	($in_r, $out_w);
 }
 
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index baf7352a..b51d148b 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -26,8 +26,6 @@ sub spawn_worker {
 	my ($self, $v2w, $shard) = @_;
 	my ($r, $w);
 	pipe($r, $w) or die "pipe failed: $!\n";
-	binmode $r, ':raw';
-	binmode $w, ':raw';
 	$w->autoflush(1);
 	my $pid = fork;
 	defined $pid or die "fork failed: $!\n";

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3/8] import: use common capitalization for filtering headers
  2020-07-17  6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
  2020-07-17  6:31 ` [PATCH 1/8] v2: use v5.10.1, parent.pm, drop warnings Eric Wong
  2020-07-17  6:31 ` [PATCH 2/8] drop binmode usage Eric Wong
@ 2020-07-17  6:31 ` Eric Wong
  2020-07-17  6:31 ` [PATCH 4/8] with_umask: pass args to callback Eric Wong
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17  6:31 UTC (permalink / raw)
  To: meta

In case this ends up in the same process as Mbox::msg_hdr,
it can reduce memory use by sharing the cache key in
PublicInbox::Eml::re_memo
---
 lib/PublicInbox/Import.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index b61d4b31..d565b0a0 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -326,7 +326,7 @@ sub extract_cmt_info ($;$) {
 sub drop_unwanted_headers ($) {
 	my ($mime) = @_;
 
-	$mime->header_set($_) for qw(bytes lines content-length status);
+	$mime->header_set($_) for qw(Bytes Lines Content-Length Status);
 	$mime->header_set($_) for @PublicInbox::MDA::BAD_HEADERS;
 }
 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 4/8] with_umask: pass args to callback
  2020-07-17  6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
                   ` (2 preceding siblings ...)
  2020-07-17  6:31 ` [PATCH 3/8] import: use common capitalization for filtering headers Eric Wong
@ 2020-07-17  6:31 ` Eric Wong
  2020-07-17  6:31 ` [PATCH 5/8] overidx: each_by_mid: pass self and args to callbacks Eric Wong
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17  6:31 UTC (permalink / raw)
  To: meta

While it makes the code flow slightly less well in some places,
it saves us runtime allocations and indentation.
---
 lib/PublicInbox/InboxWritable.pm |  42 +++++++------
 lib/PublicInbox/SearchIdx.pm     |  40 ++++++------
 lib/PublicInbox/V2Writable.pm    | 102 ++++++++++++++-----------------
 lib/PublicInbox/Xapcmd.pm        |  35 ++++++-----
 4 files changed, 111 insertions(+), 108 deletions(-)

diff --git a/lib/PublicInbox/InboxWritable.pm b/lib/PublicInbox/InboxWritable.pm
index 875dcce2..1f3f6672 100644
--- a/lib/PublicInbox/InboxWritable.pm
+++ b/lib/PublicInbox/InboxWritable.pm
@@ -37,27 +37,33 @@ sub assert_usable_dir {
 	die "no inboxdir defined for $self->{name}\n";
 }
 
+sub _init_v1 {
+	my ($self, $skip_artnum) = @_;
+	if (defined($self->{indexlevel}) || defined($skip_artnum)) {
+		require PublicInbox::SearchIdx;
+		require PublicInbox::Msgmap;
+		my $sidx = PublicInbox::SearchIdx->new($self, 1); # just create
+		$sidx->begin_txn_lazy;
+		if (defined $skip_artnum) {
+			my $mm = PublicInbox::Msgmap->new($self->{inboxdir}, 1);
+			$mm->{dbh}->begin_work;
+			$mm->skip_artnum($skip_artnum);
+			$mm->{dbh}->commit;
+		}
+		$sidx->commit_txn_lazy;
+	} else {
+		open my $fh, '>>', "$self->{inboxdir}/ssoma.lock" or
+			die "$self->{inboxdir}/ssoma.lock: $!\n";
+	}
+}
+
 sub init_inbox {
 	my ($self, $shards, $skip_epoch, $skip_artnum) = @_;
 	if ($self->version == 1) {
 		my $dir = assert_usable_dir($self);
 		PublicInbox::Import::init_bare($dir);
-		if (defined($self->{indexlevel}) || defined($skip_artnum)) {
-			require PublicInbox::SearchIdx;
-			require PublicInbox::Msgmap;
-			my $sidx = PublicInbox::SearchIdx->new($self, 1); # just create
-			$sidx->begin_txn_lazy;
-			$self->with_umask(sub {
-				my $mm = PublicInbox::Msgmap->new($dir, 1);
-				$mm->{dbh}->begin_work;
-				$mm->skip_artnum($skip_artnum);
-				$mm->{dbh}->commit;
-			}) if defined($skip_artnum);
-			$sidx->commit_txn_lazy;
-		} else {
-			open my $fh, '>>', "$dir/ssoma.lock" or
-				die "$dir/ssoma.lock: $!\n";
-		}
+		$self->umask_prepare;
+		$self->with_umask(\&_init_v1, $self, $skip_artnum);
 	} else {
 		my $v2w = importer($self);
 		$v2w->init_inbox($shards, $skip_epoch, $skip_artnum);
@@ -255,9 +261,9 @@ sub _umask_for {
 }
 
 sub with_umask {
-	my ($self, $cb) = @_;
+	my ($self, $cb, @arg) = @_;
 	my $old = umask $self->{umask};
-	my $rv = eval { $cb->() };
+	my $rv = eval { $cb->(@arg) };
 	my $err = $@;
 	umask $old;
 	die $err if $err;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 4caa66d3..c93c9034 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -585,7 +585,7 @@ sub unindex_both { # git->cat_async callback
 sub index_sync {
 	my ($self, $opts) = @_;
 	delete $self->{lock_path} if $opts->{-skip_lock};
-	$self->{-inbox}->with_umask(sub { $self->_index_sync($opts) })
+	$self->{-inbox}->with_umask(\&_index_sync, $self, $opts);
 }
 
 sub too_big ($$$) {
@@ -854,17 +854,18 @@ sub remote_remove {
 	}
 }
 
-sub begin_txn_lazy {
+sub _begin_txn {
 	my ($self) = @_;
-	return if $self->{txn};
+	my $xdb = $self->{xdb} || $self->_xdb_acquire;
+	$self->{over}->begin_lazy if $self->{over};
+	$xdb->begin_transaction if $xdb;
+	$self->{txn} = 1;
+	$xdb;
+}
 
-	$self->{-inbox}->with_umask(sub {
-		my $xdb = $self->{xdb} || $self->_xdb_acquire;
-		$self->{over}->begin_lazy if $self->{over};
-		$xdb->begin_transaction if $xdb;
-		$self->{txn} = 1;
-		$xdb;
-	});
+sub begin_txn_lazy {
+	my ($self) = @_;
+	$self->{-inbox}->with_umask(\&_begin_txn, $self) if !$self->{txn};
 }
 
 # store 'indexlevel=medium' in v2 shard=0 and v1 (only one shard)
@@ -882,16 +883,19 @@ sub set_indexlevel {
 	}
 }
 
+sub _commit_txn {
+	my ($self) = @_;
+	if (my $xdb = $self->{xdb}) {
+		set_indexlevel($self);
+		$xdb->commit_transaction;
+	}
+	$self->{over}->commit_lazy if $self->{over};
+}
+
 sub commit_txn_lazy {
 	my ($self) = @_;
-	delete $self->{txn} or return;
-	$self->{-inbox}->with_umask(sub {
-		if (my $xdb = $self->{xdb}) {
-			set_indexlevel($self);
-			$xdb->commit_transaction;
-		}
-		$self->{over}->commit_lazy if $self->{over};
-	});
+	delete($self->{txn}) and
+		$self->{-inbox}->with_umask(\&_commit_txn, $self);
 }
 
 sub worker_done {
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 0119ea76..b51c8525 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -147,10 +147,8 @@ sub init_inbox {
 # returns undef on duplicate or spam
 # mimics Import::add and wraps it for v2
 sub add {
-	my ($self, $mime, $check_cb) = @_;
-	$self->{-inbox}->with_umask(sub {
-		_add($self, $mime, $check_cb)
-	});
+	my ($self, $eml, $check_cb) = @_;
+	$self->{-inbox}->with_umask(\&_add, $self, $eml, $check_cb);
 }
 
 # indexes a message, returns true if checkpointing is needed
@@ -276,6 +274,28 @@ sub idx_shard {
 	$self->{idx_shards}->[$shard_i];
 }
 
+sub _idx_init { # with_umask callback
+	my ($self, $opt) = @_;
+	$self->lock_acquire unless $opt && $opt->{-skip_lock};
+	$self->{over}->create;
+
+	# xcpdb can change shard count while -watch is idle
+	my $nshards = count_shards($self);
+	$self->{shards} = $nshards if $nshards && $nshards != $self->{shards};
+
+	# need to create all shards before initializing msgmap FD
+	# idx_shards must be visible to all forked processes
+	my $max = $self->{shards} - 1;
+	my $idx = $self->{idx_shards} = [];
+	push @$idx, PublicInbox::SearchIdxShard->new($self, $_) for (0..$max);
+
+	# Now that all subprocesses are up, we can open the FDs
+	# for SQLite:
+	my $mm = $self->{mm} = PublicInbox::Msgmap->new_file(
+		"$self->{-inbox}->{inboxdir}/msgmap.sqlite3", 1);
+	$mm->{dbh}->begin_work;
+}
+
 # idempotent
 sub idx_init {
 	my ($self, $opt) = @_;
@@ -285,13 +305,10 @@ sub idx_init {
 	# do not leak read-only FDs to child processes, we only have these
 	# FDs for duplicate detection so they should not be
 	# frequently activated.
+	# delete @$ibx{qw(git mm search)};
 	delete $ibx->{$_} foreach (qw(git mm search));
 
-	my $indexlevel = $ibx->{indexlevel};
-	if ($indexlevel && $indexlevel eq 'basic') {
-		$self->{parallel} = 0;
-	}
-
+	$self->{parallel} = 0 if ($ibx->{indexlevel}//'') eq 'basic';
 	if ($self->{parallel}) {
 		pipe(my ($r, $w)) or die "pipe failed: $!";
 		# pipe for barrier notifications doesn't need to be big,
@@ -301,33 +318,8 @@ sub idx_init {
 		$w->autoflush(1);
 	}
 
-	my $over = $self->{over};
 	$ibx->umask_prepare;
-	$ibx->with_umask(sub {
-		$self->lock_acquire unless ($opt && $opt->{-skip_lock});
-		$over->create;
-
-		# xcpdb can change shard count while -watch is idle
-		my $nshards = count_shards($self);
-		if ($nshards && $nshards != $self->{shards}) {
-			$self->{shards} = $nshards;
-		}
-
-		# need to create all shards before initializing msgmap FD
-		my $max = $self->{shards} - 1;
-
-		# idx_shards must be visible to all forked processes
-		my $idx = $self->{idx_shards} = [];
-		for my $i (0..$max) {
-			push @$idx, PublicInbox::SearchIdxShard->new($self, $i);
-		}
-
-		# Now that all subprocesses are up, we can open the FDs
-		# for SQLite:
-		my $mm = $self->{mm} = PublicInbox::Msgmap->new_file(
-			"$self->{-inbox}->{inboxdir}/msgmap.sqlite3", 1);
-		$mm->{dbh}->begin_work;
-	});
+	$ibx->with_umask(\&_idx_init, $self, $opt);
 }
 
 # returns an array mapping [ epoch => latest_commit ]
@@ -379,24 +371,24 @@ sub content_matches ($$) {
 
 # used for removing or replacing (purging)
 sub rewrite_internal ($$;$$$) {
-	my ($self, $old_mime, $cmt_msg, $new_mime, $sref) = @_;
+	my ($self, $old_eml, $cmt_msg, $new_eml, $sref) = @_;
 	$self->idx_init;
 	my ($im, $need_reindex, $replace_map);
 	if ($sref) {
 		$replace_map = {}; # oid => sref
-		$need_reindex = [] if $new_mime;
+		$need_reindex = [] if $new_eml;
 	} else {
 		$im = $self->importer;
 	}
 	my $over = $self->{over};
-	my $chashes = content_hashes($old_mime);
-	my @removed;
-	my $mids = mids($old_mime->header_obj);
+	my $chashes = content_hashes($old_eml);
+	my $removed = [];
+	my $mids = mids($old_eml->header_obj);
 
 	# We avoid introducing new blobs into git since the raw content
 	# can be slightly different, so we do not need the user-supplied
 	# message now that we have the mids and content_hash
-	$old_mime = undef;
+	$old_eml = undef;
 	my $mark;
 
 	foreach my $mid (@$mids) {
@@ -422,15 +414,15 @@ sub rewrite_internal ($$;$$$) {
 		}
 		foreach my $num (keys %gone) {
 			my ($smsg, $mime, $orig) = @{$gone{$num}};
-			# @removed should only be set once assuming
+			# $removed should only be set once assuming
 			# no bugs in our deduplication code:
-			@removed = (undef, $mime, $smsg);
+			$removed = [ undef, $mime, $smsg ];
 			my $oid = $smsg->{blob};
 			if ($replace_map) {
 				$replace_map->{$oid} = $sref;
 			} else {
 				($mark, undef) = $im->remove($orig, $cmt_msg);
-				$removed[0] = $mark;
+				$removed->[0] = $mark;
 			}
 			$orig = undef;
 			if ($need_reindex) { # ->replace
@@ -447,28 +439,26 @@ sub rewrite_internal ($$;$$$) {
 		$self->{last_commit}->[$self->{epoch_max}] = $cmt;
 	}
 	if ($replace_map && scalar keys %$replace_map) {
-		my $rewrites = _replace_oids($self, $new_mime, $replace_map);
+		my $rewrites = _replace_oids($self, $new_eml, $replace_map);
 		return { rewrites => $rewrites, need_reindex => $need_reindex };
 	}
-	defined($mark) ? @removed : undef;
+	defined($mark) ? $removed : undef;
 }
 
 # public (see PublicInbox::Import->remove), but note the 3rd element
 # (retval[2]) is not part of the stable API shared with Import->remove
 sub remove {
-	my ($self, $mime, $cmt_msg) = @_;
-	my @ret;
-	$self->{-inbox}->with_umask(sub {
-		@ret = rewrite_internal($self, $mime, $cmt_msg);
-	});
-	defined($ret[0]) ? @ret : undef;
+	my ($self, $eml, $cmt_msg) = @_;
+	my $r = $self->{-inbox}->with_umask(\&rewrite_internal,
+						$self, $eml, $cmt_msg);
+	defined($r) && defined($r->[0]) ? @$r: undef;
 }
 
 sub _replace ($$;$$) {
-	my ($self, $old_mime, $new_mime, $sref) = @_;
-	my $rewritten = $self->{-inbox}->with_umask(sub {
-		rewrite_internal($self, $old_mime, undef, $new_mime, $sref);
-	}) or return;
+	my ($self, $old_eml, $new_eml, $sref) = @_;
+	my $arg = [ $self, $old_eml, undef, $new_eml, $sref ];
+	my $rewritten = $self->{-inbox}->with_umask(\&rewrite_internal,
+			$self, $old_eml, undef, $new_eml, $sref) or return;
 
 	my $rewrites = $rewritten->{rewrites};
 	# ->done is called if there are rewrites since we gc+prune from git
diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm
index a57fa559..c04f935c 100644
--- a/lib/PublicInbox/Xapcmd.pm
+++ b/lib/PublicInbox/Xapcmd.pm
@@ -229,6 +229,24 @@ sub prepare_run {
 
 sub check_compact () { runnable_or_die($XAPIAN_COMPACT) }
 
+sub _run {
+	my ($ibx, $cb, $opt, $reindex) = @_;
+	my $im = $ibx->importer(0);
+	$im->lock_acquire;
+	my ($tmp, $queue) = prepare_run($ibx, $opt);
+
+	# fine-grained locking if we prepare for reindex
+	if (!$opt->{-coarse_lock}) {
+		prepare_reindex($ibx, $im, $reindex);
+		$im->lock_release;
+	}
+
+	$ibx->cleanup;
+	process_queue($queue, $cb, $opt);
+	$im->lock_acquire if !$opt->{-coarse_lock};
+	commit_changes($ibx, $im, $tmp, $opt);
+}
+
 sub run {
 	my ($ibx, $task, $opt) = @_; # task = 'cpdb' or 'compact'
 	my $cb = \&${\"PublicInbox::Xapcmd::$task"};
@@ -248,22 +266,7 @@ sub run {
 	local %SIG = %SIG;
 	setup_signals();
 	$ibx->umask_prepare;
-	$ibx->with_umask(sub {
-		my $im = $ibx->importer(0);
-		$im->lock_acquire;
-		my ($tmp, $queue) = prepare_run($ibx, $opt);
-
-		# fine-grained locking if we prepare for reindex
-		if (!$opt->{-coarse_lock}) {
-			prepare_reindex($ibx, $im, $reindex);
-			$im->lock_release;
-		}
-
-		$ibx->cleanup;
-		process_queue($queue, $cb, $opt);
-		$im->lock_acquire if !$opt->{-coarse_lock};
-		commit_changes($ibx, $im, $tmp, $opt);
-	});
+	$ibx->with_umask(\&_run, $ibx, $cb, $opt, $reindex);
 }
 
 sub cpdb_retryable ($$) {

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 5/8] overidx: each_by_mid: pass self and args to callbacks
  2020-07-17  6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
                   ` (3 preceding siblings ...)
  2020-07-17  6:31 ` [PATCH 4/8] with_umask: pass args to callback Eric Wong
@ 2020-07-17  6:31 ` Eric Wong
  2020-07-17  6:31 ` [PATCH 6/8] overidx: favor non-OO sub dispatch for internal subs Eric Wong
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17  6:31 UTC (permalink / raw)
  To: meta

This saves runtime allocations and reduces the likelyhood of
memory leaks either from cycles or buggy old Perl versions.
---
 lib/PublicInbox/OverIdx.pm | 99 +++++++++++++++++++++-----------------
 1 file changed, 54 insertions(+), 45 deletions(-)

diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index ea8da723..52f6328e 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -107,7 +107,7 @@ DELETE FROM $_ WHERE num = ?
 
 # this includes ghosts
 sub each_by_mid {
-	my ($self, $mid, $cols, $cb) = @_;
+	my ($self, $mid, $cols, $cb, @arg) = @_;
 	my $dbh = $self->{dbh};
 
 =over
@@ -152,27 +152,29 @@ SELECT $cols FROM over WHERE over.num = ? LIMIT 1
 		foreach (@$nums) {
 			$sth->execute($_->[0]);
 			my $smsg = $sth->fetchrow_hashref;
-			$cb->(PublicInbox::Over::load_from_row($smsg)) or
-				return;
+			$smsg = PublicInbox::Over::load_from_row($smsg);
+			$cb->($self, $smsg, @arg) or return;
 		}
 		return if $nr != $lim;
 	}
 }
 
+sub _resolve_mid_to_tid {
+	my ($self, $smsg, $tid) = @_;
+	my $cur_tid = $smsg->{tid};
+	if (defined $$tid) {
+		merge_threads($self, $$tid, $cur_tid);
+	} else {
+		$$tid = $cur_tid;
+	}
+	1;
+}
+
 # this will create a ghost as necessary
 sub resolve_mid_to_tid {
 	my ($self, $mid) = @_;
 	my $tid;
-	each_by_mid($self, $mid, ['tid'], sub {
-		my ($smsg) = @_;
-		my $cur_tid = $smsg->{tid};
-		if (defined $tid) {
-			merge_threads($self, $tid, $cur_tid);
-		} else {
-			$tid = $cur_tid;
-		}
-		1;
-	});
+	each_by_mid($self, $mid, ['tid'], \&_resolve_mid_to_tid, \$tid);
 	defined $tid ? $tid : create_ghost($self, $mid);
 }
 
@@ -271,6 +273,22 @@ sub add_overview {
 	add_over($self, [ @$smsg{qw(ts ds num)}, $mids, $refs, $xpath, $dd ]);
 }
 
+sub _add_over {
+	my ($self, $smsg, $mid, $refs, $old_tid, $v) = @_;
+	my $cur_tid = $smsg->{tid};
+	my $n = $smsg->{num};
+	die "num must not be zero for $mid" if !$n;
+	$$old_tid = $cur_tid unless defined $$old_tid;
+	if ($n > 0) { # regular mail
+		merge_threads($self, $$old_tid, $cur_tid);
+	} elsif ($n < 0) { # ghost
+		link_refs($self, $refs, $$old_tid);
+		$self->delete_by_num($n);
+		$$v++;
+	}
+	1;
+}
+
 sub add_over {
 	my ($self, $values) = @_;
 	my ($ts, $ds, $num, $mids, $refs, $xpath, $ddd) = @$values;
@@ -281,21 +299,8 @@ sub add_over {
 	$self->delete_by_num($num, \$old_tid);
 	foreach my $mid (@$mids) {
 		my $v = 0;
-		each_by_mid($self, $mid, ['tid'], sub {
-			my ($cur) = @_;
-			my $cur_tid = $cur->{tid};
-			my $n = $cur->{num};
-			die "num must not be zero for $mid" if !$n;
-			$old_tid = $cur_tid unless defined $old_tid;
-			if ($n > 0) { # regular mail
-				merge_threads($self, $old_tid, $cur_tid);
-			} elsif ($n < 0) { # ghost
-				link_refs($self, $refs, $old_tid);
-				$self->delete_by_num($n);
-				$v++;
-			}
-			1;
-		});
+		each_by_mid($self, $mid, ['tid'], \&_add_over,
+				$mid, $refs, \$old_tid, \$v);
 		$v > 1 and warn "BUG: vivified multiple ($v) ghosts for $mid\n";
 		$vivified += $v;
 	}
@@ -320,35 +325,39 @@ INSERT INTO id2num (id, num) VALUES (?,?)
 	}
 }
 
+sub _remove_oid {
+	my ($self, $smsg, $oid, $nr) = @_;
+	if (!defined($oid) || $smsg->{blob} eq $oid) {
+		$self->delete_by_num($smsg->{num});
+		$$nr++;
+	}
+	1;
+}
+
 # returns number of removed messages
 # $oid may be undef to match only on $mid
 sub remove_oid {
 	my ($self, $oid, $mid) = @_;
 	my $nr = 0;
 	$self->begin_lazy;
-	each_by_mid($self, $mid, ['ddd'], sub {
-		my ($smsg) = @_;
-		if (!defined($oid) || $smsg->{blob} eq $oid) {
-			$self->delete_by_num($smsg->{num});
-			$nr++;
-		}
-		1;
-	});
+	each_by_mid($self, $mid, ['ddd'], \&_remove_oid, $oid, \$nr);
 	$nr;
 }
 
+sub _num_mid0_for_oid {
+	my ($self, $smsg, $oid, $res) = @_;
+	my $blob = $smsg->{blob};
+	return 1 if (!defined($blob) || $blob ne $oid); # continue;
+	@$res = ($smsg->{num}, $smsg->{mid});
+	0; # done
+}
+
 sub num_mid0_for_oid {
 	my ($self, $oid, $mid) = @_;
-	my ($num, $mid0);
+	my $res = [];
 	$self->begin_lazy;
-	each_by_mid($self, $mid, ['ddd'], sub {
-		my ($smsg) = @_;
-		my $blob = $smsg->{blob};
-		return 1 if (!defined($blob) || $blob ne $oid); # continue;
-		($num, $mid0) = ($smsg->{num}, $smsg->{mid});
-		0; # done
-	});
-	($num, $mid0);
+	each_by_mid($self, $mid, ['ddd'], \&_num_mid0_for_oid, $oid, $res);
+	@$res, # ($num, $mid0);
 }
 
 sub create_tables {

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 6/8] overidx: favor non-OO sub dispatch for internal subs
  2020-07-17  6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
                   ` (4 preceding siblings ...)
  2020-07-17  6:31 ` [PATCH 5/8] overidx: each_by_mid: pass self and args to callbacks Eric Wong
@ 2020-07-17  6:31 ` Eric Wong
  2020-07-17  6:31 ` [PATCH 7/8] searchidx: use v5.10.1, parent.pm, drop warnings Eric Wong
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17  6:31 UTC (permalink / raw)
  To: meta

OO method dispatch was 10-15% slower when I was implementing the
NNTP server.  It also serves as a helpful reminder to the reader
at the callsite as to whether a sub is likely in the same
package as the caller or not.
---
 lib/PublicInbox/OverIdx.pm | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 52f6328e..29c6e0b9 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -180,10 +180,10 @@ sub resolve_mid_to_tid {
 
 sub create_ghost {
 	my ($self, $mid) = @_;
-	my $id = $self->mid2id($mid);
-	my $num = $self->next_ghost_num;
+	my $id = mid2id($self, $mid);
+	my $num = next_ghost_num($self);
 	$num < 0 or die "ghost num is non-negative: $num\n";
-	my $tid = $self->next_tid;
+	my $tid = next_tid($self);
 	my $dbh = $self->{dbh};
 	$dbh->prepare_cached(<<'')->execute($num, $tid);
 INSERT INTO over (num, tid) VALUES (?,?)
@@ -221,7 +221,7 @@ sub link_refs {
 			merge_threads($self, $tid, $ptid);
 		}
 	} else {
-		$tid = defined $old_tid ? $old_tid : $self->next_tid;
+		$tid = defined $old_tid ? $old_tid : next_tid($self);
 	}
 	$tid;
 }
@@ -283,7 +283,7 @@ sub _add_over {
 		merge_threads($self, $$old_tid, $cur_tid);
 	} elsif ($n < 0) { # ghost
 		link_refs($self, $refs, $$old_tid);
-		$self->delete_by_num($n);
+		delete_by_num($self, $n);
 		$$v++;
 	}
 	1;
@@ -295,8 +295,8 @@ sub add_over {
 	my $old_tid;
 	my $vivified = 0;
 
-	$self->begin_lazy;
-	$self->delete_by_num($num, \$old_tid);
+	begin_lazy($self);
+	delete_by_num($self, $num, \$old_tid);
 	foreach my $mid (@$mids) {
 		my $v = 0;
 		each_by_mid($self, $mid, ['tid'], \&_add_over,
@@ -305,7 +305,7 @@ sub add_over {
 		$vivified += $v;
 	}
 	my $tid = $vivified ? $old_tid : link_refs($self, $refs, $old_tid);
-	my $sid = $self->sid($xpath);
+	my $sid = sid($self, $xpath);
 	my $dbh = $self->{dbh};
 	my $sth = $dbh->prepare_cached(<<'');
 INSERT INTO over (num, tid, sid, ts, ds, ddd)
@@ -320,7 +320,7 @@ VALUES (?,?,?,?,?,?)
 INSERT INTO id2num (id, num) VALUES (?,?)
 
 	foreach my $mid (@$mids) {
-		my $id = $self->mid2id($mid);
+		my $id = mid2id($self, $mid);
 		$sth->execute($id, $num);
 	}
 }
@@ -328,7 +328,7 @@ INSERT INTO id2num (id, num) VALUES (?,?)
 sub _remove_oid {
 	my ($self, $smsg, $oid, $nr) = @_;
 	if (!defined($oid) || $smsg->{blob} eq $oid) {
-		$self->delete_by_num($smsg->{num});
+		delete_by_num($self, $smsg->{num});
 		$$nr++;
 	}
 	1;
@@ -339,7 +339,7 @@ sub _remove_oid {
 sub remove_oid {
 	my ($self, $oid, $mid) = @_;
 	my $nr = 0;
-	$self->begin_lazy;
+	begin_lazy($self);
 	each_by_mid($self, $mid, ['ddd'], \&_remove_oid, $oid, \$nr);
 	$nr;
 }
@@ -355,7 +355,7 @@ sub _num_mid0_for_oid {
 sub num_mid0_for_oid {
 	my ($self, $oid, $mid) = @_;
 	my $res = [];
-	$self->begin_lazy;
+	begin_lazy($self);
 	each_by_mid($self, $mid, ['ddd'], \&_num_mid0_for_oid, $oid, $res);
 	@$res, # ($num, $mid0);
 }

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 7/8] searchidx: use v5.10.1, parent.pm, drop warnings
  2020-07-17  6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
                   ` (5 preceding siblings ...)
  2020-07-17  6:31 ` [PATCH 6/8] overidx: favor non-OO sub dispatch for internal subs Eric Wong
@ 2020-07-17  6:31 ` Eric Wong
  2020-07-17  6:31 ` [PATCH 8/8] search: simplify unindexing Eric Wong
  2020-07-17  7:25 ` [9/8 PATCH] v2writable: git_hash_raw: avoid $TMPDIR write Eric Wong
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17  6:31 UTC (permalink / raw)
  To: meta

Prefer "parent" to "base" since the former is lighter and part
of Perl 5.10+.  We'll also rely on warnings from "-w" globally
(or not) instead of via "use".
---
 lib/PublicInbox/SearchIdx.pm | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index c93c9034..9550847b 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -8,8 +8,8 @@
 # This writes to the search index.
 package PublicInbox::SearchIdx;
 use strict;
-use warnings;
-use base qw(PublicInbox::Search PublicInbox::Lock);
+use v5.10.1;
+use parent qw(PublicInbox::Search PublicInbox::Lock);
 use PublicInbox::Eml;
 use PublicInbox::InboxWritable;
 use PublicInbox::MID qw/mid_clean mid_mime mids_for_index/;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 8/8] search: simplify unindexing
  2020-07-17  6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
                   ` (6 preceding siblings ...)
  2020-07-17  6:31 ` [PATCH 7/8] searchidx: use v5.10.1, parent.pm, drop warnings Eric Wong
@ 2020-07-17  6:31 ` Eric Wong
  2020-07-17  7:25 ` [9/8 PATCH] v2writable: git_hash_raw: avoid $TMPDIR write Eric Wong
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17  6:31 UTC (permalink / raw)
  To: meta

Since over.sqlite3 seems here to stay, we no longer need to do
Message-ID lookups against Xapian and can simply rely on the
docid <=> NNTP article number equivalancy SCHEMA_VERSION=15
gave us.

This rids us of the closure-using batch_do sub in the v1
code path and vastly simplifies both v1 and v2 unindexing.
---
 lib/PublicInbox/OverIdx.pm        |  13 +--
 lib/PublicInbox/SearchIdx.pm      | 134 +++++++++++-------------------
 lib/PublicInbox/SearchIdxShard.pm |   6 +-
 lib/PublicInbox/V2Writable.pm     |   7 +-
 t/search.t                        |   6 +-
 5 files changed, 66 insertions(+), 100 deletions(-)

diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 29c6e0b9..5601e602 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -326,22 +326,23 @@ INSERT INTO id2num (id, num) VALUES (?,?)
 }
 
 sub _remove_oid {
-	my ($self, $smsg, $oid, $nr) = @_;
+	my ($self, $smsg, $oid, $removed) = @_;
 	if (!defined($oid) || $smsg->{blob} eq $oid) {
 		delete_by_num($self, $smsg->{num});
-		$$nr++;
+		push @$removed, $smsg->{num};
 	}
 	1;
 }
 
-# returns number of removed messages
+# returns number of removed messages in scalar context,
+# array of removed article numbers in array context.
 # $oid may be undef to match only on $mid
 sub remove_oid {
 	my ($self, $oid, $mid) = @_;
-	my $nr = 0;
+	my $removed = [];
 	begin_lazy($self);
-	each_by_mid($self, $mid, ['ddd'], \&_remove_oid, $oid, \$nr);
-	$nr;
+	each_by_mid($self, $mid, ['ddd'], \&_remove_oid, $oid, $removed);
+	@$removed;
 }
 
 sub _num_mid0_for_oid {
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 9550847b..83162509 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -12,7 +12,7 @@ use v5.10.1;
 use parent qw(PublicInbox::Search PublicInbox::Lock);
 use PublicInbox::Eml;
 use PublicInbox::InboxWritable;
-use PublicInbox::MID qw/mid_clean mid_mime mids_for_index/;
+use PublicInbox::MID qw(mid_mime mids_for_index mids);
 use PublicInbox::MsgIter;
 use Carp qw(croak);
 use POSIX qw(strftime);
@@ -413,88 +413,32 @@ sub add_message {
 	$smsg->{num};
 }
 
-# returns begin and end PostingIterator
-sub find_doc_ids {
-	my ($self, $termval) = @_;
-	my $db = $self->{xdb};
-
-	($db->postlist_begin($termval), $db->postlist_end($termval));
-}
-
-# v1 only
-sub batch_do {
-	my ($self, $termval, $cb) = @_;
-	my $batch_size = 1000; # don't let @ids grow too large to avoid OOM
-	while (1) {
-		my ($head, $tail) = $self->find_doc_ids($termval);
-		return if $head == $tail;
-		my @ids;
-		for (; $head != $tail && @ids < $batch_size; $head++) {
-			push @ids, $head->get_docid;
+sub xdb_remove {
+	my ($self, $oid, @removed) = @_;
+	my $xdb = $self->{xdb} or return;
+	for my $num (@removed) {
+		my $doc = eval { $xdb->get_document($num) };
+		unless ($doc) {
+			warn "E: $@\n" if $@;
+			warn "E: #$num $oid missing in Xapian\n";
+			next;
 		}
-		$cb->(\@ids);
-	}
-}
-
-# v1 only, where $mid is unique
-sub remove_message {
-	my ($self, $mid) = @_;
-	$mid = mid_clean($mid);
-
-	if (my $over = $self->{over}) {
-		my $nr = eval { $over->remove_oid(undef, $mid) };
-		if ($@) {
-			warn "failed to remove <$mid> from overview: $@\n";
-		} elsif ($nr == 0) {
-			warn "<$mid> missing for removal from overview\n";
+		my $smsg = bless {}, 'PublicInbox::Smsg';
+		$smsg->load_expand($doc);
+		my $blob = $smsg->{blob} // '(unset)';
+		if ($blob eq $oid) {
+			$xdb->delete_document($num);
+		} else {
+			warn "E: #$num $oid != $blob in Xapian\n";
 		}
 	}
-	return unless need_xapian($self);
-	my $db = $self->{xdb};
-	my $nr = 0;
-	eval {
-		batch_do($self, 'Q' . $mid, sub {
-			my ($ids) = @_;
-			$db->delete_document($_) for @$ids;
-			$nr += scalar @$ids;
-		});
-	};
-	if ($@) {
-		warn "failed to remove <$mid> from Xapian: $@\n";
-	} elsif ($nr == 0) {
-		warn "<$mid> missing for removal from Xapian\n";
-	}
 }
 
-# MID is a hint in V2
 sub remove_by_oid {
-	my ($self, $oid, $mid) = @_;
-
-	$self->{over}->remove_oid($oid, $mid) if $self->{over};
-
-	return unless need_xapian($self);
-	my $db = $self->{xdb};
-
-	# XXX careful, we cannot use batch_do here since we conditionally
-	# delete documents based on other factors, so we cannot call
-	# find_doc_ids twice.
-	my ($head, $tail) = $self->find_doc_ids('Q' . $mid);
-	return if $head == $tail;
-
-	# there is only ONE element in @delete unless we
-	# have bugs in our v2writable deduplication check
-	my @delete;
-	for (; $head != $tail; $head++) {
-		my $docid = $head->get_docid;
-		my $doc = $db->get_document($docid);
-		my $smsg = bless { mid => $mid }, 'PublicInbox::Smsg';
-		$smsg->load_expand($doc);
-		if ($smsg->{blob} eq $oid) {
-			push(@delete, $docid);
-		}
-	}
-	$db->delete_document($_) foreach @delete;
-	scalar(@delete);
+	my ($self, $oid, $num) = @_;
+	die "BUG: remove_by_oid is v2-only\n" if $self->{over};
+	$self->begin_txn_lazy;
+	xdb_remove($self, $oid, $num) if need_xapian($self);
 }
 
 sub index_git_blob_id {
@@ -507,10 +451,29 @@ sub index_git_blob_id {
 	}
 }
 
-sub unindex_blob {
-	my ($self, $mime) = @_;
-	my $mid = eval { mid_mime($mime) };
-	$self->remove_message($mid) if defined $mid;
+# v1 only
+sub unindex_eml {
+	my ($self, $oid, $eml) = @_;
+	my $mids = mids($eml);
+	my $nr = 0;
+	my %tmp;
+	for my $mid (@$mids) {
+		my @removed = eval { $self->{over}->remove_oid($oid, $mid) };
+		if ($@) {
+			warn "E: failed to remove <$mid> from overview: $@\n";
+		} else {
+			$nr += scalar @removed;
+			$tmp{$_}++ for @removed;
+		}
+	}
+	if (!$nr) {
+		$mids = join('> <', @$mids);
+		warn "W: <$mids> missing for removal from overview\n";
+	}
+	while (my ($num, $nr) = each %tmp) {
+		warn "BUG: $num appears >1 times ($nr) for $oid\n" if $nr != 1;
+	}
+	xdb_remove($self, $oid, keys %tmp) if need_xapian($self);
 }
 
 sub index_mm {
@@ -577,7 +540,7 @@ sub index_both { # git->cat_async callback
 sub unindex_both { # git->cat_async callback
 	my ($bref, $oid, $type, $size, $self) = @_;
 	my $eml = PublicInbox::Eml->new($bref);
-	unindex_blob($self, $eml);
+	unindex_eml($self, $oid, $eml);
 	unindex_mm($self, $eml);
 }
 
@@ -844,13 +807,12 @@ sub remote_close {
 }
 
 sub remote_remove {
-	my ($self, $oid, $mid) = @_;
+	my ($self, $oid, $num) = @_;
 	if (my $w = $self->{w}) {
 		# triggers remove_by_oid in a shard
-		print $w "D $oid $mid\n" or die "failed to write remove $!";
+		print $w "D $oid $num\n" or die "failed to write remove $!";
 	} else {
-		$self->begin_txn_lazy;
-		$self->remove_by_oid($oid, $mid);
+		$self->remove_by_oid($oid, $num);
 	}
 }
 
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index b51d148b..54426881 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -62,10 +62,8 @@ sub shard_worker_loop ($$$$$) {
 			# no need to lock < 512 bytes is atomic under POSIX
 			print $bnote "barrier $shard\n" or
 					die "write failed for barrier $!\n";
-		} elsif ($line =~ /\AD ([a-f0-9]{40,}) (.+)\n\z/s) {
-			my ($oid, $mid) = ($1, $2);
-			$self->begin_txn_lazy;
-			$self->remove_by_oid($oid, $mid);
+		} elsif ($line =~ /\AD ([a-f0-9]{40,}) ([0-9]+)\n\z/s) {
+			$self->remove_by_oid($1, $2 + 0);
 		} else {
 			chomp $line;
 			# n.b. $mid may contain spaces(!)
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index b51c8525..dffe90d8 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1185,8 +1185,11 @@ sub sync_prepare ($$$) {
 
 sub unindex_oid_remote ($$$) {
 	my ($self, $oid, $mid) = @_;
-	$_->remote_remove($oid, $mid) foreach @{$self->{idx_shards}};
-	$self->{over}->remove_oid($oid, $mid);
+	my @removed = $self->{over}->remove_oid($oid, $mid);
+	for my $num (@removed) {
+		my $idx = idx_shard($self, $num % $self->{shards});
+		$idx->remote_remove($oid, $num);
+	}
 }
 
 sub unindex_oid ($$$;$) {
diff --git a/t/search.t b/t/search.t
index 82caf9e4..aa6f94bf 100644
--- a/t/search.t
+++ b/t/search.t
@@ -397,7 +397,9 @@ $ibx->with_umask(sub {
 
 $ibx->with_umask(sub {
 	my $amsg = eml_load 't/search-amsg.eml';
-	ok($rw->add_message($amsg), 'added attachment');
+	my $oid = ('0'x40);
+	my $smsg = bless { blob => $oid }, 'PublicInbox::Smsg';
+	ok($rw->add_message($amsg, $smsg), 'added attachment');
 	$rw_commit->();
 	$ro->reopen;
 	my $n = $ro->query('n:attached_fart.txt');
@@ -418,7 +420,7 @@ $ibx->with_umask(sub {
 		$art = $ro->{over_ro}->next_by_mid($mid, \$id, \$prev);
 		ok($art, 'article exists in OVER DB');
 	}
-	$rw->unindex_blob($amsg);
+	$rw->unindex_eml($oid, $amsg);
 	$rw->commit_txn_lazy;
 	SKIP: {
 		skip('$art not defined', 1) unless defined $art;

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [9/8 PATCH] v2writable: git_hash_raw: avoid $TMPDIR write
  2020-07-17  6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
                   ` (7 preceding siblings ...)
  2020-07-17  6:31 ` [PATCH 8/8] search: simplify unindexing Eric Wong
@ 2020-07-17  7:25 ` Eric Wong
  8 siblings, 0 replies; 10+ messages in thread
From: Eric Wong @ 2020-07-17  7:25 UTC (permalink / raw)
  To: meta

We can rely on FD_CLOEXEC being set by default (since Perl 5.6+)
on pipes to avoid FS/page-cache traffic, here.  We also know
"git hash-object" won't output anything until it's consumed all
of its standard input; so there's no danger of a deadlock even
in the the unlikely case git uses a hash that can't fit into
PIPE_BUF :P
---
 lib/PublicInbox/V2Writable.pm | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index dffe90d8c..0582dd5e3 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -482,14 +482,12 @@ sub purge {
 sub git_hash_raw ($$) {
 	my ($self, $raw) = @_;
 	# grab the expected OID we have to reindex:
-	open my $tmp_fh, '+>', undef or die "failed to open tmp: $!";
-	$tmp_fh->autoflush(1);
-	print $tmp_fh $$raw or die "print \$tmp_fh: $!";
-	sysseek($tmp_fh, 0, 0) or die "seek failed: $!";
-
+	pipe(my($in, $w)) or die "pipe: $!";
 	my $git_dir = $self->{-inbox}->git->{git_dir};
 	my $cmd = ['git', "--git-dir=$git_dir", qw(hash-object --stdin)];
-	my $r = popen_rd($cmd, undef, { 0 => $tmp_fh });
+	my $r = popen_rd($cmd, undef, { 0 => $in });
+	print $w $$raw or die "print \$w: $!";
+	close $w or die "close \$w: $!";
 	local $/ = "\n";
 	chomp(my $oid = <$r>);
 	close $r or die "git hash-object failed: $?";

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, back to index

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-17  6:31 [PATCH 0/8] indexing cleanup and code reduction Eric Wong
2020-07-17  6:31 ` [PATCH 1/8] v2: use v5.10.1, parent.pm, drop warnings Eric Wong
2020-07-17  6:31 ` [PATCH 2/8] drop binmode usage Eric Wong
2020-07-17  6:31 ` [PATCH 3/8] import: use common capitalization for filtering headers Eric Wong
2020-07-17  6:31 ` [PATCH 4/8] with_umask: pass args to callback Eric Wong
2020-07-17  6:31 ` [PATCH 5/8] overidx: each_by_mid: pass self and args to callbacks Eric Wong
2020-07-17  6:31 ` [PATCH 6/8] overidx: favor non-OO sub dispatch for internal subs Eric Wong
2020-07-17  6:31 ` [PATCH 7/8] searchidx: use v5.10.1, parent.pm, drop warnings Eric Wong
2020-07-17  6:31 ` [PATCH 8/8] search: simplify unindexing Eric Wong
2020-07-17  7:25 ` [9/8 PATCH] v2writable: git_hash_raw: avoid $TMPDIR write Eric Wong

user/dev discussion of public-inbox itself

Archives are clonable:
	git clone --mirror http://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git