* [PATCH 4/8] extindex: speed up Xapian cleanup in --gc
2021-10-10 14:25 7% [PATCH 0/8] extindex and then some Eric Wong
@ 2021-10-10 14:25 6% ` Eric Wong
0 siblings, 0 replies; 2+ results
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
To: meta
Avoiding repeated SQL statements brings --gc down to 2-3 minutes
from around 10. We'll also add some checkpoints around over and
xref3 cleanups.
---
lib/PublicInbox/ExtSearchIdx.pm | 37 ++++++++++++++++++++-------------
lib/PublicInbox/SearchIdx.pm | 3 +++
2 files changed, 26 insertions(+), 14 deletions(-)
diff --git a/lib/PublicInbox/ExtSearchIdx.pm b/lib/PublicInbox/ExtSearchIdx.pm
index 20c4cf78..04948b8b 100644
--- a/lib/PublicInbox/ExtSearchIdx.pm
+++ b/lib/PublicInbox/ExtSearchIdx.pm
@@ -421,34 +421,43 @@ sub eidx_gc_scan_shards ($$) { # TODO: use for lei/store
DELETE FROM xref3 WHERE docid NOT IN (SELECT num FROM over)
warn "I: eliminated $nr stale xref3 entries\n" if $nr != 0;
+ reindex_checkpoint($self, $sync) if checkpoint_due($sync);
# fixup from old bugs:
$nr = $self->{oidx}->dbh->do(<<'');
DELETE FROM over WHERE num NOT IN (SELECT docid FROM xref3)
warn "I: eliminated $nr stale over entries\n" if $nr != 0;
+ reindex_checkpoint($self, $sync) if checkpoint_due($sync);
my ($cur) = $self->{oidx}->dbh->selectrow_array(<<EOM);
SELECT MIN(num) FROM over
EOM
- my ($max) = $self->{oidx}->dbh->selectrow_array(<<EOM);
-SELECT MAX(num) FROM over
-EOM
- my $exists;
-restart:
- $exists = $self->{oidx}->dbh->prepare(<<EOM);
-SELECT COUNT(num) FROM over WHERE num = ?
-EOM
- for (; $cur <= $max; $cur++) {
- $exists->execute($cur);
- next if $exists->fetchrow_array != 0;
- $self->idx_shard($cur)->ipc_do('xdb_remove_quiet', $cur);
+ $cur // return; # empty
+ my ($r, $n, %active);
+ $nr = 0;
+ while (1) {
+ $r = $self->{oidx}->dbh->selectcol_arrayref(<<"", undef, $cur);
+SELECT num FROM over WHERE num >= ? ORDER BY num ASC LIMIT 10000
+
+ last unless scalar(@$r);
+ while (defined($n = shift @$r)) {
+ for my $i ($cur..($n - 1)) {
+ my $idx = idx_shard($self, $i);
+ $idx->ipc_do('xdb_remove_quiet', $i);
+ $active{$idx} = $idx;
+ }
+ $cur = $n + 1;
+ }
if (checkpoint_due($sync)) {
- $exists = undef;
+ for my $idx (values %active) {
+ $nr += $idx->ipc_do('nr_quiet_rm')
+ }
+ %active = ();
reindex_checkpoint($self, $sync);
- goto restart;
}
}
+ warn "I: eliminated $nr stale Xapian documents\n" if $nr != 0;
}
sub eidx_gc {
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 78db329d..bebe904b 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -650,8 +650,11 @@ sub xdb_remove_quiet {
begin_txn_lazy($self);
my $xdb = $self->{xdb} // die 'BUG: missing {xdb}';
eval { $xdb->delete_document($docid) };
+ ++$self->{-quiet_rm} unless $@;
}
+sub nr_quiet_rm { delete($_[0]->{-quiet_rm}) // 0 }
+
sub index_git_blob_id {
my ($doc, $pfx, $objid) = @_;
^ permalink raw reply related [relevance 6%]
* [PATCH 0/8] extindex and then some...
@ 2021-10-10 14:25 7% Eric Wong
2021-10-10 14:25 6% ` [PATCH 4/8] extindex: speed up Xapian cleanup in --gc Eric Wong
0 siblings, 1 reply; 2+ results
From: Eric Wong @ 2021-10-10 14:25 UTC (permalink / raw)
To: meta
One notable fix for -extindex --gc, a couple of minor things
here and there. Still need to speed up --reindex...
Eric Wong (8):
lei_to_mail: show --output on augment progress failure
admin: add '# ' prefix for progress messages
set nodatacow on more SQLite files
extindex: speed up Xapian cleanup in --gc
extindex: minor cost reductions
extindex: --gc doesn't touch ghost entries
lei/store: keep ".err-XXXX" in stderr tmpfile
extindex: sync each inbox before checking for missed messages
lib/PublicInbox/Admin.pm | 2 +-
lib/PublicInbox/ExtSearchIdx.pm | 51 +++++++++++++++++++++------------
lib/PublicInbox/LeiStore.pm | 2 +-
lib/PublicInbox/LeiToMail.pm | 2 +-
lib/PublicInbox/Over.pm | 4 ++-
lib/PublicInbox/SearchIdx.pm | 3 ++
lib/PublicInbox/SharedKV.pm | 3 +-
7 files changed, 43 insertions(+), 24 deletions(-)
^ permalink raw reply [relevance 7%]
Results 1-2 of 2 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2021-10-10 14:25 7% [PATCH 0/8] extindex and then some Eric Wong
2021-10-10 14:25 6% ` [PATCH 4/8] extindex: speed up Xapian cleanup in --gc Eric Wong
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).