* [PATCH 00/20] indexing changes and new features
@ 2020-07-24 5:55 7% Eric Wong
2020-07-24 5:55 4% ` [PATCH 01/20] index: support --rethread switch to fix old indices Eric Wong
0 siblings, 1 reply; 2+ results
From: Eric Wong @ 2020-07-24 5:55 UTC (permalink / raw)
To: meta
--rethread and --no-sync options are now supported in
public-inbox-index. --no-sync should be nice for users
of FSes with poor fsync(2) performance.
Now: I also wonder if --no-sync is a bad name since we
also use it for to mean synchronising indices. Perhaps
--no-fsync would be a better name, though technically
SQLite and Xapian use fdatasync(2), nowadays.
Some of this is prep work for exposing THREADID via IMAP (and
JMAP) to aid in searching.
Since THREADID (`over.tid') will be exposed in a user-visible
way, I'm finally giving up on using the default (reverse
chronological) log order for indexing to ensure THREADID
ascends for newer threads.
This also simplifies the indexing code significantly.
To avoid pinning huge amounts of RAM, the working space is held
in a IdxStack temporary file. This further simplifies our code
since we no longer have to worry about old that did not use
Xapian w/o FD_CLOEXEC.
There's still more work on the horizon, here...
Eric Wong (20):
index: support --rethread switch to fix old indices
v2: index forwards (via `git log --reverse')
v2writable: introduce idx_stack
v2writable: index_sync: reduce fill_alternates calls
v2writable: move {autime} and {cotime} into $sync state
v2writable: allow >= 40 byte git object IDs
v2writable: drop "EPOCH.git indexing $RANGE" progress message
use consistent {ibx} field for writable code paths
search: avoid copying {inboxdir}
v2writable: use read-only PublicInbox::Git for cat_file
v2writable: get rid of {reindex_pipe} field
v2writable: clarify "epoch" for {last_commits}
xapcmd: set {from} properly for v1 inboxes
searchidx: rename _xdb_{acquire,release} => idx_
searchidx: make v1 indexing closer to v2
index+xcpdb: support --no-sync flag
v2writable: share log2stack code with v1
searchidx: support async git check
searchidx: $batch_cb => v1_checkpoint
v2writable: {unindexed} belongs in $sync state
Documentation/public-inbox-index.pod | 30 +-
Documentation/public-inbox-xcpdb.pod | 6 +
MANIFEST | 3 +-
lib/PublicInbox/Git.pm | 72 ++++-
lib/PublicInbox/IdxStack.pm | 52 ++++
lib/PublicInbox/Import.pm | 6 +-
lib/PublicInbox/Msgmap.pm | 21 +-
lib/PublicInbox/MultiMidQueue.pm | 62 ----
lib/PublicInbox/Over.pm | 1 +
lib/PublicInbox/OverIdx.pm | 78 ++++-
lib/PublicInbox/Search.pm | 25 +-
lib/PublicInbox/SearchIdx.pm | 384 ++++++++++++------------
lib/PublicInbox/SearchIdxShard.pm | 12 +-
lib/PublicInbox/Smsg.pm | 8 +-
lib/PublicInbox/V2Writable.pm | 427 +++++++++------------------
lib/PublicInbox/Xapcmd.pm | 10 +-
script/public-inbox-index | 5 +-
script/public-inbox-xcpdb | 4 +-
t/idx_stack.t | 56 ++++
t/inbox_idle.t | 4 +-
t/search.t | 4 +-
t/v1reindex.t | 36 ++-
t/v2reindex.t | 45 +++
23 files changed, 744 insertions(+), 607 deletions(-)
create mode 100644 lib/PublicInbox/IdxStack.pm
delete mode 100644 lib/PublicInbox/MultiMidQueue.pm
create mode 100644 t/idx_stack.t
^ permalink raw reply [relevance 7%]
* [PATCH 01/20] index: support --rethread switch to fix old indices
2020-07-24 5:55 7% [PATCH 00/20] indexing changes and new features Eric Wong
@ 2020-07-24 5:55 4% ` Eric Wong
0 siblings, 0 replies; 2+ results
From: Eric Wong @ 2020-07-24 5:55 UTC (permalink / raw)
To: meta
Older versions of public-inbox < 1.3.0 had subtly
different semantics around threading in some corner
cases. This switch (when combined with --reindex)
allows us to fix them by regenerating associations.
---
Documentation/public-inbox-index.pod | 23 +++++++--
lib/PublicInbox/OverIdx.pm | 76 ++++++++++++++++++++++++++--
lib/PublicInbox/SearchIdx.pm | 7 ++-
lib/PublicInbox/V2Writable.pm | 4 +-
script/public-inbox-index | 2 +-
t/v1reindex.t | 34 +++++++++++++
t/v2reindex.t | 45 ++++++++++++++++
7 files changed, 177 insertions(+), 14 deletions(-)
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index ff2e54867..08f2fbf45 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -68,12 +68,25 @@ Xapian database. Using this with C<--compact> or running
L<public-inbox-compact(1)> afterwards is recommended to
release free space.
-public-inbox protects writes to various indices with L<flock(2)>,
-so it is safe to reindex while L<public-inbox-watch(1)>,
-L<public-inbox-mda(1)> or L<public-inbox-learn(1)> run.
+public-inbox protects writes to various indices with
+L<flock(2)>, so it is safe to reindex (and rethread) while
+L<public-inbox-watch(1)>, L<public-inbox-mda(1)> or
+L<public-inbox-learn(1)> run.
-This does not touch the NNTP article number database or
-affect threading.
+This does not touch the NNTP article number database.
+It does not affect threading unless C<--rethread> is
+used.
+
+=item --rethread
+
+Regenerate internal THREADID and message thread associations
+when reindexing.
+
+This fixes some bugs in older versions of public-inbox. While
+it is possible to use this without C<--reindex>, it makes little
+sense to do so.
+
+Available in public-inbox 1.6.0 (PENDING).
=item --prune
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 5601e602c..c57be7243 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -17,6 +17,7 @@ use PublicInbox::MID qw/id_compress mids_for_index references/;
use PublicInbox::Smsg qw(subject_normalized);
use Compress::Zlib qw(compress);
use PublicInbox::Search;
+use Carp qw(croak);
sub dbh_new {
my ($self) = @_;
@@ -37,6 +38,13 @@ sub dbh_new {
$dbh;
}
+sub new {
+ my ($class, $f) = @_;
+ my $self = $class->SUPER::new($f);
+ $self->{min_tid} = 0;
+ $self;
+}
+
sub get_counter ($$) {
my ($dbh, $key) = @_;
my $sth = $dbh->prepare_cached(<<'', undef, 1);
@@ -164,8 +172,12 @@ sub _resolve_mid_to_tid {
my $cur_tid = $smsg->{tid};
if (defined $$tid) {
merge_threads($self, $$tid, $cur_tid);
- } else {
+ } elsif ($cur_tid > $self->{min_tid}) {
$$tid = $cur_tid;
+ } else { # rethreading, queue up dead ghosts
+ $$tid = next_tid($self);
+ my $num = $smsg->{num};
+ push(@{$self->{-ghosts_to_delete}}, $num) if $num < 0;
}
1;
}
@@ -175,7 +187,10 @@ sub resolve_mid_to_tid {
my ($self, $mid) = @_;
my $tid;
each_by_mid($self, $mid, ['tid'], \&_resolve_mid_to_tid, \$tid);
- defined $tid ? $tid : create_ghost($self, $mid);
+ if (my $del = delete $self->{-ghosts_to_delete}) {
+ delete_by_num($self, $_) for @$del;
+ }
+ $tid // create_ghost($self, $mid);
}
sub create_ghost {
@@ -221,7 +236,7 @@ sub link_refs {
merge_threads($self, $tid, $ptid);
}
} else {
- $tid = defined $old_tid ? $old_tid : next_tid($self);
+ $tid = $old_tid // next_tid($self);
}
$tid;
}
@@ -278,10 +293,17 @@ sub _add_over {
my $cur_tid = $smsg->{tid};
my $n = $smsg->{num};
die "num must not be zero for $mid" if !$n;
- $$old_tid = $cur_tid unless defined $$old_tid;
+ my $cur_valid = $cur_tid > $self->{min_tid};
+
if ($n > 0) { # regular mail
- merge_threads($self, $$old_tid, $cur_tid);
+ if ($cur_valid) {
+ $$old_tid //= $cur_tid;
+ merge_threads($self, $$old_tid, $cur_tid);
+ } else {
+ $$old_tid //= next_tid($self);
+ }
} elsif ($n < 0) { # ghost
+ $$old_tid //= $cur_valid ? $cur_tid : next_tid($self);
link_refs($self, $refs, $$old_tid);
delete_by_num($self, $n);
$$v++;
@@ -297,6 +319,7 @@ sub add_over {
begin_lazy($self);
delete_by_num($self, $num, \$old_tid);
+ $old_tid = undef if ($old_tid // 0) <= $self->{min_tid};
foreach my $mid (@$mids) {
my $v = 0;
each_by_mid($self, $mid, ['tid'], \&_add_over,
@@ -456,4 +479,47 @@ sub create {
$self->disconnect;
}
+sub rethread_prepare {
+ my ($self, $opt) = @_;
+ return unless $opt->{rethread};
+ begin_lazy($self);
+ my $min = $self->{min_tid} = get_counter($self->{dbh}, 'thread') // 0;
+ my $pr = $opt->{-progress};
+ $pr->("rethread min THREADID ".($min + 1)."\n") if $pr && $min;
+}
+
+sub rethread_done {
+ my ($self, $opt) = @_;
+ return unless $opt->{rethread} && $self->{txn};
+ defined(my $min = $self->{min_tid}) or croak('BUG: no min_tid');
+ my $dbh = $self->{dbh} or croak('BUG: no dbh');
+ my $rows = $dbh->selectall_arrayref(<<'', { Slice => {} }, $min);
+SELECT num,tid FROM over WHERE num < 0 AND tid < ?
+
+ my $show_id = $dbh->prepare('SELECT id FROM id2num WHERE num = ?');
+ my $show_mid = $dbh->prepare('SELECT mid FROM msgid WHERE id = ?');
+ my $pr = $opt->{-progress};
+ my $total = 0;
+ for my $r (@$rows) {
+ my $exp = 0;
+ $show_id->execute($r->{num});
+ while (defined(my $id = $show_id->fetchrow_array)) {
+ ++$exp;
+ $show_mid->execute($id);
+ my $mid = $show_mid->fetchrow_array;
+ if (!defined($mid)) {
+ warn <<EOF;
+E: ghost NUM=$r->{num} ID=$id THREADID=$r->{tid} has no Message-ID
+EOF
+ next;
+ }
+ $pr->(<<EOM) if $pr;
+I: ghost $r->{num} <$mid> THREADID=$r->{tid} culled
+EOM
+ }
+ delete_by_num($self, $r->{num});
+ }
+ $pr->("I: rethread culled $total ghosts\n") if $pr && $total;
+}
+
1;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 831625090..e641ffd43 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -723,6 +723,7 @@ sub _index_sync {
my $pr = $opts->{-progress};
my $xdb = $self->begin_txn_lazy;
+ $self->{over}->rethread_prepare($opts);
my $mm = _msgmap_init($self);
do {
$xlog = undef; # stop previous git-log via SIGPIPE
@@ -761,12 +762,14 @@ sub _index_sync {
$xdb->set_metadata('last_commit', $newest);
}
}
+
+ $self->{over}->rethread_done($opts) if $newest; # all done
$self->commit_txn_lazy;
$git->cleanup;
$xdb = _xdb_release($self, $nr);
- # let another process do some work... <
+ # let another process do some work...
$pr->("indexed $nr/$self->{ntodo}\n") if $pr && $nr;
- if (!$newest) {
+ if (!$newest) { # more to come
$xdb = $self->begin_txn_lazy;
$dbh->begin_work if $dbh;
}
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 0582dd5e3..16556ddc2 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1308,6 +1308,7 @@ sub index_sync {
my $latest = git_dir_latest($self, \$epoch_max);
return unless defined $latest;
$self->idx_init($opt); # acquire lock
+ $self->{over}->rethread_prepare($opt);
my $sync = {
D => {}, # "$mid\0$chash" => $oid
unindex_range => {}, # EPOCH => oid_old..oid_new
@@ -1370,12 +1371,13 @@ sub index_sync {
my $pr = $sync->{-opt}->{-progress};
$pr->('all.git '.sprintf($sync->{-regen_fmt}, $nr)) if $pr;
}
+ $self->{over}->rethread_done($opt);
# reindex does not pick up new changes, so we rerun w/o it:
if ($opt->{reindex}) {
my %again = %$opt;
$sync = undef;
- delete @again{qw(reindex -skip_lock)};
+ delete @again{qw(rethread reindex -skip_lock)};
index_sync($self, \%again);
}
}
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 6217fb86c..2e1934b08 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -15,7 +15,7 @@ use PublicInbox::Xapcmd;
my $compact_opt;
my $opt = { quiet => -1, compact => 0, maxsize => undef };
-GetOptions($opt, qw(verbose|v+ reindex compact|c+ jobs|j=i prune
+GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
indexlevel|L=s maxsize|max-size=s batchsize|batch-size=s))
or die "bad command-line args\n$usage";
die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
diff --git a/t/v1reindex.t b/t/v1reindex.t
index 9f23ef01e..8cb751881 100644
--- a/t/v1reindex.t
+++ b/t/v1reindex.t
@@ -11,6 +11,7 @@ require_git(2.6);
require_mods(qw(DBD::SQLite Search::Xapian));
use_ok 'PublicInbox::SearchIdx';
use_ok 'PublicInbox::Import';
+use_ok 'PublicInbox::OverIdx';
my ($inboxdir, $for_destroy) = tmpdir();
my $ibx_config = {
inboxdir => $inboxdir,
@@ -427,5 +428,38 @@ ok(!-d $xap, 'Xapian directories removed again');
], 'msgmap as expected' );
}
+{
+ my @warn;
+ local $SIG{__WARN__} = sub { push @warn, @_ };
+ my $ibx = PublicInbox::Inbox->new({ %$ibx_config });
+ my $f = $ibx->over->{dbh}->sqlite_db_filename;
+ my $over = PublicInbox::OverIdx->new($f);
+ my $dbh = $over->connect;
+ my $non_ghost_tids = sub {
+ $dbh->selectall_arrayref(<<'');
+SELECT tid FROM over WHERE num > 0 ORDER BY tid ASC
+
+ };
+ my $before = $non_ghost_tids->();
+
+ # mess up threading:
+ my $tid = PublicInbox::OverIdx::get_counter($dbh, 'thread');
+ my $nr = $dbh->do('UPDATE over SET tid = ?', undef, $tid);
+
+ my $rw = PublicInbox::SearchIdx->new($ibx, 1);
+ my @pr;
+ my $pr = sub { push @pr, @_ };
+ $rw->index_sync({reindex => 1, rethread => 1, -progress => $pr });
+ my @n = $dbh->selectrow_array(<<EOS, undef, $tid);
+SELECT COUNT(*) FROM over WHERE tid <= ?
+EOS
+ is_deeply(\@n, [ 0 ], 'rethread dropped old threadids');
+ my $after = $non_ghost_tids->();
+ ok($after->[0]->[0] > $before->[-1]->[0],
+ 'all tids greater than before');
+ is(scalar @$after, scalar @$before, 'thread count unchanged');
+ is_deeply([], \@warn, 'no warnings');
+ # diag "@pr"; # XXX do we care?
+}
done_testing();
diff --git a/t/v2reindex.t b/t/v2reindex.t
index 77deffb4b..ea2b24e59 100644
--- a/t/v2reindex.t
+++ b/t/v2reindex.t
@@ -10,6 +10,7 @@ use PublicInbox::TestCommon;
require_git(2.6);
require_mods(qw(DBD::SQLite Search::Xapian));
use_ok 'PublicInbox::V2Writable';
+use_ok 'PublicInbox::OverIdx';
my ($inboxdir, $for_destroy) = tmpdir();
my $ibx_config = {
inboxdir => $inboxdir,
@@ -423,6 +424,46 @@ ok(!-d $xap, 'Xapian directories removed again');
], 'msgmap as expected' );
}
+my $check_rethread = sub {
+ my ($desc) = @_;
+ my @warn;
+ local $SIG{__WARN__} = sub { push @warn, @_ };
+ my %config = %$ibx_config;
+ my $ibx = PublicInbox::Inbox->new(\%config);
+ my $f = $ibx->over->{dbh}->sqlite_db_filename;
+ my $over = PublicInbox::OverIdx->new($f);
+ my $dbh = $over->connect;
+ my $non_ghost_tids = sub {
+ $dbh->selectall_arrayref(<<'');
+SELECT tid FROM over WHERE num > 0 ORDER BY tid ASC
+
+ };
+ my $before = $non_ghost_tids->();
+
+ # mess up threading:
+ my $tid = PublicInbox::OverIdx::get_counter($dbh, 'thread');
+ my $nr = $dbh->do('UPDATE over SET tid = ?', undef, $tid);
+ diag "messing up all threads with tid=$tid";
+
+ my $v2w = PublicInbox::V2Writable->new($ibx);
+ my @pr;
+ my $pr = sub { push @pr, @_ };
+ $v2w->index_sync({reindex => 1, rethread => 1, -progress => $pr});
+ # diag "@pr"; # nobody cares
+ is_deeply(\@warn, [], 'no warnings on reindex + rethread');
+
+ my @n = $dbh->selectrow_array(<<EOS, undef, $tid);
+SELECT COUNT(*) FROM over WHERE tid <= ?
+EOS
+ is_deeply(\@n, [ 0 ], 'rethread dropped old threadids');
+ my $after = $non_ghost_tids->();
+ ok($after->[0]->[0] > $before->[-1]->[0],
+ 'all tids greater than before');
+ is(scalar @$after, scalar @$before, 'thread count unchanged');
+};
+
+$check_rethread->('no-monster');
+
# A real example from linux-renesas-soc on lore where a 3-headed monster
# of a message has 3 sets of common headers. Another normal message
# previously existed with a single Message-ID that conflicts with one
@@ -497,4 +538,8 @@ EOF
is_deeply([values %uniq], [3], 'search on different subjects');
}
+# XXX: not deterministic when dealing with ambiguous messages, oh well
+$check_rethread->('3-headed-monster once');
+$check_rethread->('3-headed-monster twice');
+
done_testing();
^ permalink raw reply related [relevance 4%]
Results 1-2 of 2 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2020-07-24 5:55 7% [PATCH 00/20] indexing changes and new features Eric Wong
2020-07-24 5:55 4% ` [PATCH 01/20] index: support --rethread switch to fix old indices Eric Wong
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).