user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* [PATCH 0/8] extindex and then some...
@ 2021-10-10 14:25 Eric Wong
  2021-10-10 14:25 ` [PATCH 1/8] lei_to_mail: show --output on augment progress failure Eric Wong
                   ` (7 more replies)
  0 siblings, 8 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
  To: meta

One notable fix for -extindex --gc, a couple of minor things
here and there.  Still need to speed up --reindex...

Eric Wong (8):
  lei_to_mail: show --output on augment progress failure
  admin: add '# ' prefix for progress messages
  set nodatacow on more SQLite files
  extindex: speed up Xapian cleanup in --gc
  extindex: minor cost reductions
  extindex: --gc doesn't touch ghost entries
  lei/store: keep ".err-XXXX" in stderr tmpfile
  extindex: sync each inbox before checking for missed messages

 lib/PublicInbox/Admin.pm        |  2 +-
 lib/PublicInbox/ExtSearchIdx.pm | 51 +++++++++++++++++++++------------
 lib/PublicInbox/LeiStore.pm     |  2 +-
 lib/PublicInbox/LeiToMail.pm    |  2 +-
 lib/PublicInbox/Over.pm         |  4 ++-
 lib/PublicInbox/SearchIdx.pm    |  3 ++
 lib/PublicInbox/SharedKV.pm     |  3 +-
 7 files changed, 43 insertions(+), 24 deletions(-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/8] lei_to_mail: show --output on augment progress failure
  2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
  2021-10-10 14:25 ` [PATCH 2/8] admin: add '# ' prefix for progress messages Eric Wong
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
  To: meta

Just in case it fails when there's many parallel invocations.
---
 lib/PublicInbox/LeiToMail.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/LeiToMail.pm b/lib/PublicInbox/LeiToMail.pm
index d42759cf..5a220ba3 100644
--- a/lib/PublicInbox/LeiToMail.pm
+++ b/lib/PublicInbox/LeiToMail.pm
@@ -796,7 +796,7 @@ sub augment_inprogress {
 				"scanning old contents of $dst for dedupe" :
 				"removing old contents of $dst")." ...\n";
 	};
-	warn "E: $@" if $@;
+	warn "E: $@ ($dst)" if $@;
 }
 
 # called in top-level lei-daemon when LeiAuth is done

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 2/8] admin: add '# ' prefix for progress messages
  2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
  2021-10-10 14:25 ` [PATCH 1/8] lei_to_mail: show --output on augment progress failure Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
  2021-10-10 14:25 ` [PATCH 3/8] set nodatacow on more SQLite files Eric Wong
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
  To: meta

It's more consistent with TAP output and hopefully puts
users at ease in case they don't understand the meaning
of a message.
---
 lib/PublicInbox/Admin.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index a17a632c..11ea8f83 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -320,7 +320,7 @@ sub progress_prepare ($;$) {
 	} else {
 		$opt->{verbose} ||= 1;
 		$dst //= *STDERR{GLOB};
-		$opt->{-progress} = sub { print $dst @_ };
+		$opt->{-progress} = sub { print $dst '# ', @_ };
 	}
 }
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 3/8] set nodatacow on more SQLite files
  2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
  2021-10-10 14:25 ` [PATCH 1/8] lei_to_mail: show --output on augment progress failure Eric Wong
  2021-10-10 14:25 ` [PATCH 2/8] admin: add '# ' prefix for progress messages Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
  2021-10-10 14:25 ` [PATCH 4/8] extindex: speed up Xapian cleanup in --gc Eric Wong
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
  To: meta

We'll set nodatacow when detecting existing but empty
files, and also their directories in more cases (for
auxiliary -wal, -journal, -shm files).  Hopefully
this keeps performance reasonable on CoW FSes.
---
 lib/PublicInbox/Over.pm     | 4 +++-
 lib/PublicInbox/SharedKV.pm | 3 ++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index 19da056a..98de82c0 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -16,9 +16,11 @@ use constant DEFAULT_LIMIT => 1000;
 sub dbh_new {
 	my ($self, $rw) = @_;
 	my $f = delete $self->{filename};
-	if (!-f $f) { # SQLite defaults mode to 0644, we want 0666
+	if (!-s $f) { # SQLite defaults mode to 0644, we want 0666
 		if ($rw) {
 			require PublicInbox::Spawn;
+			my ($dir) = ($f =~ m!(.+)/[^/]+\z!);
+			PublicInbox::Spawn::nodatacow_dir($dir);
 			open my $fh, '+>>', $f or die "failed to open $f: $!";
 			PublicInbox::Spawn::nodatacow_fd(fileno($fh));
 		} else {
diff --git a/lib/PublicInbox/SharedKV.pm b/lib/PublicInbox/SharedKV.pm
index 645bb57c..398f4ca8 100644
--- a/lib/PublicInbox/SharedKV.pm
+++ b/lib/PublicInbox/SharedKV.pm
@@ -51,7 +51,8 @@ sub new {
 	$base //= '';
 	my $f = $self->{filename} = "$dir/$base.sqlite3";
 	$self->{lock_path} = $opt->{lock_path} // "$dir/$base.flock";
-	unless (-f $f) {
+	unless (-s $f) {
+		PublicInbox::Spawn::nodatacow_dir($dir); # for journal/shm/wal
 		open my $fh, '+>>', $f or die "failed to open $f: $!";
 		PublicInbox::Spawn::nodatacow_fd(fileno($fh));
 	}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 4/8] extindex: speed up Xapian cleanup in --gc
  2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
                   ` (2 preceding siblings ...)
  2021-10-10 14:25 ` [PATCH 3/8] set nodatacow on more SQLite files Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
  2021-10-10 14:25 ` [PATCH 5/8] extindex: minor cost reductions Eric Wong
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
  To: meta

Avoiding repeated SQL statements brings --gc down to 2-3 minutes
from around 10.  We'll also add some checkpoints around over and
xref3 cleanups.
---
 lib/PublicInbox/ExtSearchIdx.pm | 37 ++++++++++++++++++++-------------
 lib/PublicInbox/SearchIdx.pm    |  3 +++
 2 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 20c4cf78..04948b8b 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -421,34 +421,43 @@ sub eidx_gc_scan_shards ($$) { # TODO: use for lei/store
 DELETE FROM xref3 WHERE docid NOT IN (SELECT num FROM over)
 
 	warn "I: eliminated $nr stale xref3 entries\n" if $nr != 0;
+	reindex_checkpoint($self, $sync) if checkpoint_due($sync);
 
 	# fixup from old bugs:
 	$nr = $self->{oidx}->dbh->do(<<'');
 DELETE FROM over WHERE num NOT IN (SELECT docid FROM xref3)
 
 	warn "I: eliminated $nr stale over entries\n" if $nr != 0;
+	reindex_checkpoint($self, $sync) if checkpoint_due($sync);
 
 	my ($cur) = $self->{oidx}->dbh->selectrow_array(<<EOM);
 SELECT MIN(num) FROM over
 EOM
-	my ($max) = $self->{oidx}->dbh->selectrow_array(<<EOM);
-SELECT MAX(num) FROM over
-EOM
-	my $exists;
-restart:
-	$exists = $self->{oidx}->dbh->prepare(<<EOM);
-SELECT COUNT(num) FROM over WHERE num = ?
-EOM
-	for (; $cur <= $max; $cur++) {
-		$exists->execute($cur);
-		next if $exists->fetchrow_array != 0;
-		$self->idx_shard($cur)->ipc_do('xdb_remove_quiet', $cur);
+	$cur // return; # empty
+	my ($r, $n, %active);
+	$nr = 0;
+	while (1) {
+		$r = $self->{oidx}->dbh->selectcol_arrayref(<<"", undef, $cur);
+SELECT num FROM over WHERE num >= ? ORDER BY num ASC LIMIT 10000
+
+		last unless scalar(@$r);
+		while (defined($n = shift @$r)) {
+			for my $i ($cur..($n - 1)) {
+				my $idx = idx_shard($self, $i);
+				$idx->ipc_do('xdb_remove_quiet', $i);
+				$active{$idx} = $idx;
+			}
+			$cur = $n + 1;
+		}
 		if (checkpoint_due($sync)) {
-			$exists = undef;
+			for my $idx (values %active) {
+				$nr += $idx->ipc_do('nr_quiet_rm')
+			}
+			%active = ();
 			reindex_checkpoint($self, $sync);
-			goto restart;
 		}
 	}
+	warn "I: eliminated $nr stale Xapian documents\n" if $nr != 0;
 }
 
 sub eidx_gc {
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 78db329d..bebe904b 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -650,8 +650,11 @@ sub xdb_remove_quiet {
 	begin_txn_lazy($self);
 	my $xdb = $self->{xdb} // die 'BUG: missing {xdb}';
 	eval { $xdb->delete_document($docid) };
+	++$self->{-quiet_rm} unless $@;
 }
 
+sub nr_quiet_rm { delete($_[0]->{-quiet_rm}) // 0 }
+
 sub index_git_blob_id {
 	my ($doc, $pfx, $objid) = @_;
 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 5/8] extindex: minor cost reductions
  2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
                   ` (3 preceding siblings ...)
  2021-10-10 14:25 ` [PATCH 4/8] extindex: speed up Xapian cleanup in --gc Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
  2021-10-10 14:25 ` [PATCH 6/8] extindex: --gc doesn't touch ghost entries Eric Wong
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
  To: meta

Don't bother decoding the 20-byte SHA-1 to a 40-byte hex value
since we don't read it, anyways.  We can also use the on-stack
ibx->eidx_key value instead of dispatching the method again.
---
 lib/PublicInbox/ExtSearchIdx.pm | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 04948b8b..42488e12 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -902,15 +902,14 @@ DELETE FROM xref3 WHERE ibx_id = ? AND xnum = ? AND oidbin = ?
 			$del->execute;
 
 			# get_xref3 over-fetches, but this is a rare path:
-			my $xr3 = $self->{oidx}->get_xref3($docid);
+			my $xr3 = $self->{oidx}->get_xref3($docid, 1);
 			my $idx = $self->idx_shard($docid);
 			if (scalar(@$xr3) == 0) { # all gone
 				$self->{oidx}->delete_by_num($docid);
 				$self->{oidx}->eidxq_del($docid);
 				$idx->ipc_do('xdb_remove', $docid);
 			} else { # enqueue for reindex of remaining messages
-				$idx->ipc_do('remove_eidx_info',
-						$docid, $ibx->eidx_key);
+				$idx->ipc_do('remove_eidx_info', $docid, $ekey);
 				$self->{oidx}->eidxq_add($docid); # yes, add
 			}
 		}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 6/8] extindex: --gc doesn't touch ghost entries
  2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
                   ` (4 preceding siblings ...)
  2021-10-10 14:25 ` [PATCH 5/8] extindex: minor cost reductions Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
  2021-10-10 14:25 ` [PATCH 7/8] lei/store: keep ".err-XXXX" in stderr tmpfile Eric Wong
  2021-10-10 14:25 ` [PATCH 8/8] extindex: sync each inbox before checking for missed messages Eric Wong
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
  To: meta

We were deleting ghost entries, this was usually harmless since
other messages could fill-in-the-blanks, but could cause
misthreading in odd cases where a big chunk of a thread is
missing and the latest messages only referenced ghosts.

We'll also save some cycles when scanning Xapian shards since
docids won't be <= 0.
---
 lib/PublicInbox/ExtSearchIdx.pm | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 42488e12..acf35e3d 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -425,13 +425,13 @@ DELETE FROM xref3 WHERE docid NOT IN (SELECT num FROM over)
 
 	# fixup from old bugs:
 	$nr = $self->{oidx}->dbh->do(<<'');
-DELETE FROM over WHERE num NOT IN (SELECT docid FROM xref3)
+DELETE FROM over WHERE num > 0 AND num NOT IN (SELECT docid FROM xref3)
 
 	warn "I: eliminated $nr stale over entries\n" if $nr != 0;
 	reindex_checkpoint($self, $sync) if checkpoint_due($sync);
 
 	my ($cur) = $self->{oidx}->dbh->selectrow_array(<<EOM);
-SELECT MIN(num) FROM over
+SELECT MIN(num) FROM over WHERE num > 0
 EOM
 	$cur // return; # empty
 	my ($r, $n, %active);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 7/8] lei/store: keep ".err-XXXX" in stderr tmpfile
  2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
                   ` (5 preceding siblings ...)
  2021-10-10 14:25 ` [PATCH 6/8] extindex: --gc doesn't touch ghost entries Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
  2021-10-10 14:25 ` [PATCH 8/8] extindex: sync each inbox before checking for missed messages Eric Wong
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
  To: meta

This is slighly more meaningful since the file is already
in ~/.local/share/lei/store, so "lei_store" was redundant
(and the "XXXX" are random characters replaced by File::Temp)
---
 lib/PublicInbox/LeiStore.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/LeiStore.pm b/lib/PublicInbox/LeiStore.pm
index 52a1456f..613d1d31 100644
--- a/lib/PublicInbox/LeiStore.pm
+++ b/lib/PublicInbox/LeiStore.pm
@@ -512,7 +512,7 @@ sub xchg_stderr {
 	return unless -e $dir;
 	my $old = delete $self->{-tmp_err};
 	my $pfx = POSIX::strftime('%Y%m%d%H%M%S', gmtime(time));
-	my $err = File::Temp->new(TEMPLATE => "$pfx.$$.lei_storeXXXX",
+	my $err = File::Temp->new(TEMPLATE => "$pfx.$$.err-XXXX",
 				SUFFIX => '.err', DIR => $dir);
 	open STDERR, '>>', $err->filename or die "dup2: $!";
 	STDERR->autoflush(1); # shared with shard subprocesses

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 8/8] extindex: sync each inbox before checking for missed messages
  2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
                   ` (6 preceding siblings ...)
  2021-10-10 14:25 ` [PATCH 7/8] lei/store: keep ".err-XXXX" in stderr tmpfile Eric Wong
@ 2021-10-10 14:25 ` Eric Wong
  7 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
  To: meta

Otherwise, it gets too noisy and we repeat some work
when we do an actual sync, since the last_commit info
will be out-of-date.
---
 lib/PublicInbox/ExtSearchIdx.pm | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index acf35e3d..d589d2c0 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -812,6 +812,9 @@ sub _reindex_check_unseen ($$$) {
 	my $ibx_id = $ibx->{-ibx_id};
 	my $slice = 1000;
 	my ($beg, $end) = (1, $slice);
+	my $err = sync_inbox($self, $sync, $ibx) and return;
+	my $max = $ibx->over->max;
+	$end = $max if $end > $max;
 
 	# first, check if we missed any messages in target $ibx
 	my $msgs;
@@ -825,6 +828,7 @@ sub _reindex_check_unseen ($$$) {
 		${$sync->{nr}} = $beg;
 		$beg = $msgs->[-1]->{num} + 1;
 		$end = $beg + $slice;
+		$end = $max if $end > $max;
 		if (checkpoint_due($sync)) {
 			reindex_checkpoint($self, $sync); # release lock
 		}
@@ -952,6 +956,7 @@ sub sync_inbox {
 	my $err = _sync_inbox($self, $sync, $ibx);
 	delete @$ibx{qw(mm over)};
 	warn $err, "\n" if defined($err);
+	$err;
 }
 
 sub dd_smsg { # git->cat_async callback

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-10-10 14:25 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
2021-10-10 14:25 ` [PATCH 1/8] lei_to_mail: show --output on augment progress failure Eric Wong
2021-10-10 14:25 ` [PATCH 2/8] admin: add '# ' prefix for progress messages Eric Wong
2021-10-10 14:25 ` [PATCH 3/8] set nodatacow on more SQLite files Eric Wong
2021-10-10 14:25 ` [PATCH 4/8] extindex: speed up Xapian cleanup in --gc Eric Wong
2021-10-10 14:25 ` [PATCH 5/8] extindex: minor cost reductions Eric Wong
2021-10-10 14:25 ` [PATCH 6/8] extindex: --gc doesn't touch ghost entries Eric Wong
2021-10-10 14:25 ` [PATCH 7/8] lei/store: keep ".err-XXXX" in stderr tmpfile Eric Wong
2021-10-10 14:25 ` [PATCH 8/8] extindex: sync each inbox before checking for missed messages Eric Wong

Code repositories for project(s) associated with this inbox:

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).