user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: [PATCH 4/8] extindex: speed up Xapian cleanup in --gc
Date: Sun, 10 Oct 2021 14:25:14 +0000	[thread overview]
Message-ID: <20211010142518.7012-5-e@80x24.org> (raw)
In-Reply-To: <20211010142518.7012-1-e@80x24.org>

Avoiding repeated SQL statements brings --gc down to 2-3 minutes
from around 10.  We'll also add some checkpoints around over and
xref3 cleanups.
---
 lib/PublicInbox/ExtSearchIdx.pm | 37 ++++++++++++++++++++-------------
 lib/PublicInbox/SearchIdx.pm    |  3 +++
 2 files changed, 26 insertions(+), 14 deletions(-)

diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 20c4cf78..04948b8b 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -421,34 +421,43 @@ sub eidx_gc_scan_shards ($$) { # TODO: use for lei/store
 DELETE FROM xref3 WHERE docid NOT IN (SELECT num FROM over)
 
 	warn "I: eliminated $nr stale xref3 entries\n" if $nr != 0;
+	reindex_checkpoint($self, $sync) if checkpoint_due($sync);
 
 	# fixup from old bugs:
 	$nr = $self->{oidx}->dbh->do(<<'');
 DELETE FROM over WHERE num NOT IN (SELECT docid FROM xref3)
 
 	warn "I: eliminated $nr stale over entries\n" if $nr != 0;
+	reindex_checkpoint($self, $sync) if checkpoint_due($sync);
 
 	my ($cur) = $self->{oidx}->dbh->selectrow_array(<<EOM);
 SELECT MIN(num) FROM over
 EOM
-	my ($max) = $self->{oidx}->dbh->selectrow_array(<<EOM);
-SELECT MAX(num) FROM over
-EOM
-	my $exists;
-restart:
-	$exists = $self->{oidx}->dbh->prepare(<<EOM);
-SELECT COUNT(num) FROM over WHERE num = ?
-EOM
-	for (; $cur <= $max; $cur++) {
-		$exists->execute($cur);
-		next if $exists->fetchrow_array != 0;
-		$self->idx_shard($cur)->ipc_do('xdb_remove_quiet', $cur);
+	$cur // return; # empty
+	my ($r, $n, %active);
+	$nr = 0;
+	while (1) {
+		$r = $self->{oidx}->dbh->selectcol_arrayref(<<"", undef, $cur);
+SELECT num FROM over WHERE num >= ? ORDER BY num ASC LIMIT 10000
+
+		last unless scalar(@$r);
+		while (defined($n = shift @$r)) {
+			for my $i ($cur..($n - 1)) {
+				my $idx = idx_shard($self, $i);
+				$idx->ipc_do('xdb_remove_quiet', $i);
+				$active{$idx} = $idx;
+			}
+			$cur = $n + 1;
+		}
 		if (checkpoint_due($sync)) {
-			$exists = undef;
+			for my $idx (values %active) {
+				$nr += $idx->ipc_do('nr_quiet_rm')
+			}
+			%active = ();
 			reindex_checkpoint($self, $sync);
-			goto restart;
 		}
 	}
+	warn "I: eliminated $nr stale Xapian documents\n" if $nr != 0;
 }
 
 sub eidx_gc {
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 78db329d..bebe904b 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -650,8 +650,11 @@ sub xdb_remove_quiet {
 	begin_txn_lazy($self);
 	my $xdb = $self->{xdb} // die 'BUG: missing {xdb}';
 	eval { $xdb->delete_document($docid) };
+	++$self->{-quiet_rm} unless $@;
 }
 
+sub nr_quiet_rm { delete($_[0]->{-quiet_rm}) // 0 }
+
 sub index_git_blob_id {
 	my ($doc, $pfx, $objid) = @_;
 

  parent reply	other threads:[~2021-10-10 14:25 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-10 14:25 [PATCH 0/8] extindex and then some Eric Wong
2021-10-10 14:25 ` [PATCH 1/8] lei_to_mail: show --output on augment progress failure Eric Wong
2021-10-10 14:25 ` [PATCH 2/8] admin: add '# ' prefix for progress messages Eric Wong
2021-10-10 14:25 ` [PATCH 3/8] set nodatacow on more SQLite files Eric Wong
2021-10-10 14:25 ` Eric Wong [this message]
2021-10-10 14:25 ` [PATCH 5/8] extindex: minor cost reductions Eric Wong
2021-10-10 14:25 ` [PATCH 6/8] extindex: --gc doesn't touch ghost entries Eric Wong
2021-10-10 14:25 ` [PATCH 7/8] lei/store: keep ".err-XXXX" in stderr tmpfile Eric Wong
2021-10-10 14:25 ` [PATCH 8/8] extindex: sync each inbox before checking for missed messages Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211010142518.7012-5-e@80x24.org \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).