user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH 00/20] indexing changes and new features
@ 2020-07-24  5:55  7% Eric Wong
  2020-07-24  5:55  4% ` [PATCH 01/20] index: support --rethread switch to fix old indices Eric Wong
  0 siblings, 1 reply; 2+ results
From: Eric Wong @ 2020-07-24  5:55 UTC (permalink / raw)
  To: meta

--rethread and --no-sync options are now supported in
public-inbox-index.  --no-sync should be nice for users
of FSes with poor fsync(2) performance.

Now: I also wonder if --no-sync is a bad name since we
also use it for to mean synchronising indices.  Perhaps
--no-fsync would be a better name, though technically
SQLite and Xapian use fdatasync(2), nowadays.

Some of this is prep work for exposing THREADID via IMAP (and
JMAP) to aid in searching.

Since THREADID (`over.tid') will be exposed in a user-visible
way, I'm finally giving up on using the default (reverse
chronological) log order for indexing to ensure THREADID
ascends for newer threads.

This also simplifies the indexing code significantly.
To avoid pinning huge amounts of RAM, the working space is held
in a IdxStack temporary file.  This further simplifies our code
since we no longer have to worry about old that did not use
Xapian w/o FD_CLOEXEC.

There's still more work on the horizon, here...

Eric Wong (20):
  index: support --rethread switch to fix old indices
  v2: index forwards (via `git log --reverse')
  v2writable: introduce idx_stack
  v2writable: index_sync: reduce fill_alternates calls
  v2writable: move {autime} and {cotime} into $sync state
  v2writable: allow >= 40 byte git object IDs
  v2writable: drop "EPOCH.git indexing $RANGE" progress message
  use consistent {ibx} field for writable code paths
  search: avoid copying {inboxdir}
  v2writable: use read-only PublicInbox::Git for cat_file
  v2writable: get rid of {reindex_pipe} field
  v2writable: clarify "epoch" for {last_commits}
  xapcmd: set {from} properly for v1 inboxes
  searchidx: rename _xdb_{acquire,release} => idx_
  searchidx: make v1 indexing closer to v2
  index+xcpdb: support --no-sync flag
  v2writable: share log2stack code with v1
  searchidx: support async git check
  searchidx: $batch_cb => v1_checkpoint
  v2writable: {unindexed} belongs in $sync state

 Documentation/public-inbox-index.pod |  30 +-
 Documentation/public-inbox-xcpdb.pod |   6 +
 MANIFEST                             |   3 +-
 lib/PublicInbox/Git.pm               |  72 ++++-
 lib/PublicInbox/IdxStack.pm          |  52 ++++
 lib/PublicInbox/Import.pm            |   6 +-
 lib/PublicInbox/Msgmap.pm            |  21 +-
 lib/PublicInbox/MultiMidQueue.pm     |  62 ----
 lib/PublicInbox/Over.pm              |   1 +
 lib/PublicInbox/OverIdx.pm           |  78 ++++-
 lib/PublicInbox/Search.pm            |  25 +-
 lib/PublicInbox/SearchIdx.pm         | 384 ++++++++++++------------
 lib/PublicInbox/SearchIdxShard.pm    |  12 +-
 lib/PublicInbox/Smsg.pm              |   8 +-
 lib/PublicInbox/V2Writable.pm        | 427 +++++++++------------------
 lib/PublicInbox/Xapcmd.pm            |  10 +-
 script/public-inbox-index            |   5 +-
 script/public-inbox-xcpdb            |   4 +-
 t/idx_stack.t                        |  56 ++++
 t/inbox_idle.t                       |   4 +-
 t/search.t                           |   4 +-
 t/v1reindex.t                        |  36 ++-
 t/v2reindex.t                        |  45 +++
 23 files changed, 744 insertions(+), 607 deletions(-)
 create mode 100644 lib/PublicInbox/IdxStack.pm
 delete mode 100644 lib/PublicInbox/MultiMidQueue.pm
 create mode 100644 t/idx_stack.t

^ permalink raw reply	[relevance 7%]

* [PATCH 01/20] index: support --rethread switch to fix old indices
  2020-07-24  5:55  7% [PATCH 00/20] indexing changes and new features Eric Wong
@ 2020-07-24  5:55  4% ` Eric Wong
  0 siblings, 0 replies; 2+ results
From: Eric Wong @ 2020-07-24  5:55 UTC (permalink / raw)
  To: meta

Older versions of public-inbox < 1.3.0 had subtly
different semantics around threading in some corner
cases.  This switch (when combined with --reindex)
allows us to fix them by regenerating associations.
---
 Documentation/public-inbox-index.pod | 23 +++++++--
 lib/PublicInbox/OverIdx.pm           | 76 ++++++++++++++++++++++++++--
 lib/PublicInbox/SearchIdx.pm         |  7 ++-
 lib/PublicInbox/V2Writable.pm        |  4 +-
 script/public-inbox-index            |  2 +-
 t/v1reindex.t                        | 34 +++++++++++++
 t/v2reindex.t                        | 45 ++++++++++++++++
 7 files changed, 177 insertions(+), 14 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index ff2e54867..08f2fbf45 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -68,12 +68,25 @@ Xapian database.  Using this with C<--compact> or running
 L<public-inbox-compact(1)> afterwards is recommended to
 release free space.
 
-public-inbox protects writes to various indices with L<flock(2)>,
-so it is safe to reindex while L<public-inbox-watch(1)>,
-L<public-inbox-mda(1)> or L<public-inbox-learn(1)> run.
+public-inbox protects writes to various indices with
+L<flock(2)>, so it is safe to reindex (and rethread) while
+L<public-inbox-watch(1)>, L<public-inbox-mda(1)> or
+L<public-inbox-learn(1)> run.
 
-This does not touch the NNTP article number database or
-affect threading.
+This does not touch the NNTP article number database.
+It does not affect threading unless C<--rethread> is
+used.
+
+=item --rethread
+
+Regenerate internal THREADID and message thread associations
+when reindexing.
+
+This fixes some bugs in older versions of public-inbox.  While
+it is possible to use this without C<--reindex>, it makes little
+sense to do so.
+
+Available in public-inbox 1.6.0 (PENDING).
 
 =item --prune
 
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 5601e602c..c57be7243 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -17,6 +17,7 @@ use PublicInbox::MID qw/id_compress mids_for_index references/;
 use PublicInbox::Smsg qw(subject_normalized);
 use Compress::Zlib qw(compress);
 use PublicInbox::Search;
+use Carp qw(croak);
 
 sub dbh_new {
 	my ($self) = @_;
@@ -37,6 +38,13 @@ sub dbh_new {
 	$dbh;
 }
 
+sub new {
+	my ($class, $f) = @_;
+	my $self = $class->SUPER::new($f);
+	$self->{min_tid} = 0;
+	$self;
+}
+
 sub get_counter ($$) {
 	my ($dbh, $key) = @_;
 	my $sth = $dbh->prepare_cached(<<'', undef, 1);
@@ -164,8 +172,12 @@ sub _resolve_mid_to_tid {
 	my $cur_tid = $smsg->{tid};
 	if (defined $$tid) {
 		merge_threads($self, $$tid, $cur_tid);
-	} else {
+	} elsif ($cur_tid > $self->{min_tid}) {
 		$$tid = $cur_tid;
+	} else { # rethreading, queue up dead ghosts
+		$$tid = next_tid($self);
+		my $num = $smsg->{num};
+		push(@{$self->{-ghosts_to_delete}}, $num) if $num < 0;
 	}
 	1;
 }
@@ -175,7 +187,10 @@ sub resolve_mid_to_tid {
 	my ($self, $mid) = @_;
 	my $tid;
 	each_by_mid($self, $mid, ['tid'], \&_resolve_mid_to_tid, \$tid);
-	defined $tid ? $tid : create_ghost($self, $mid);
+	if (my $del = delete $self->{-ghosts_to_delete}) {
+		delete_by_num($self, $_) for @$del;
+	}
+	$tid // create_ghost($self, $mid);
 }
 
 sub create_ghost {
@@ -221,7 +236,7 @@ sub link_refs {
 			merge_threads($self, $tid, $ptid);
 		}
 	} else {
-		$tid = defined $old_tid ? $old_tid : next_tid($self);
+		$tid = $old_tid // next_tid($self);
 	}
 	$tid;
 }
@@ -278,10 +293,17 @@ sub _add_over {
 	my $cur_tid = $smsg->{tid};
 	my $n = $smsg->{num};
 	die "num must not be zero for $mid" if !$n;
-	$$old_tid = $cur_tid unless defined $$old_tid;
+	my $cur_valid = $cur_tid > $self->{min_tid};
+
 	if ($n > 0) { # regular mail
-		merge_threads($self, $$old_tid, $cur_tid);
+		if ($cur_valid) {
+			$$old_tid //= $cur_tid;
+			merge_threads($self, $$old_tid, $cur_tid);
+		} else {
+			$$old_tid //= next_tid($self);
+		}
 	} elsif ($n < 0) { # ghost
+		$$old_tid //= $cur_valid ? $cur_tid : next_tid($self);
 		link_refs($self, $refs, $$old_tid);
 		delete_by_num($self, $n);
 		$$v++;
@@ -297,6 +319,7 @@ sub add_over {
 
 	begin_lazy($self);
 	delete_by_num($self, $num, \$old_tid);
+	$old_tid = undef if ($old_tid // 0) <= $self->{min_tid};
 	foreach my $mid (@$mids) {
 		my $v = 0;
 		each_by_mid($self, $mid, ['tid'], \&_add_over,
@@ -456,4 +479,47 @@ sub create {
 	$self->disconnect;
 }
 
+sub rethread_prepare {
+	my ($self, $opt) = @_;
+	return unless $opt->{rethread};
+	begin_lazy($self);
+	my $min = $self->{min_tid} = get_counter($self->{dbh}, 'thread') // 0;
+	my $pr = $opt->{-progress};
+	$pr->("rethread min THREADID ".($min + 1)."\n") if $pr && $min;
+}
+
+sub rethread_done {
+	my ($self, $opt) = @_;
+	return unless $opt->{rethread} && $self->{txn};
+	defined(my $min = $self->{min_tid}) or croak('BUG: no min_tid');
+	my $dbh = $self->{dbh} or croak('BUG: no dbh');
+	my $rows = $dbh->selectall_arrayref(<<'', { Slice => {} }, $min);
+SELECT num,tid FROM over WHERE num < 0 AND tid < ?
+
+	my $show_id = $dbh->prepare('SELECT id FROM id2num WHERE num = ?');
+	my $show_mid = $dbh->prepare('SELECT mid FROM msgid WHERE id = ?');
+	my $pr = $opt->{-progress};
+	my $total = 0;
+	for my $r (@$rows) {
+		my $exp = 0;
+		$show_id->execute($r->{num});
+		while (defined(my $id = $show_id->fetchrow_array)) {
+			++$exp;
+			$show_mid->execute($id);
+			my $mid = $show_mid->fetchrow_array;
+			if (!defined($mid)) {
+				warn <<EOF;
+E: ghost NUM=$r->{num} ID=$id THREADID=$r->{tid} has no Message-ID
+EOF
+				next;
+			}
+			$pr->(<<EOM) if $pr;
+I: ghost $r->{num} <$mid> THREADID=$r->{tid} culled
+EOM
+		}
+		delete_by_num($self, $r->{num});
+	}
+	$pr->("I: rethread culled $total ghosts\n") if $pr && $total;
+}
+
 1;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 831625090..e641ffd43 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -723,6 +723,7 @@ sub _index_sync {
 	my $pr = $opts->{-progress};
 
 	my $xdb = $self->begin_txn_lazy;
+	$self->{over}->rethread_prepare($opts);
 	my $mm = _msgmap_init($self);
 	do {
 		$xlog = undef; # stop previous git-log via SIGPIPE
@@ -761,12 +762,14 @@ sub _index_sync {
 				$xdb->set_metadata('last_commit', $newest);
 			}
 		}
+
+		$self->{over}->rethread_done($opts) if $newest; # all done
 		$self->commit_txn_lazy;
 		$git->cleanup;
 		$xdb = _xdb_release($self, $nr);
-		# let another process do some work... <
+		# let another process do some work...
 		$pr->("indexed $nr/$self->{ntodo}\n") if $pr && $nr;
-		if (!$newest) {
+		if (!$newest) { # more to come
 			$xdb = $self->begin_txn_lazy;
 			$dbh->begin_work if $dbh;
 		}
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 0582dd5e3..16556ddc2 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1308,6 +1308,7 @@ sub index_sync {
 	my $latest = git_dir_latest($self, \$epoch_max);
 	return unless defined $latest;
 	$self->idx_init($opt); # acquire lock
+	$self->{over}->rethread_prepare($opt);
 	my $sync = {
 		D => {}, # "$mid\0$chash" => $oid
 		unindex_range => {}, # EPOCH => oid_old..oid_new
@@ -1370,12 +1371,13 @@ sub index_sync {
 		my $pr = $sync->{-opt}->{-progress};
 		$pr->('all.git '.sprintf($sync->{-regen_fmt}, $nr)) if $pr;
 	}
+	$self->{over}->rethread_done($opt);
 
 	# reindex does not pick up new changes, so we rerun w/o it:
 	if ($opt->{reindex}) {
 		my %again = %$opt;
 		$sync = undef;
-		delete @again{qw(reindex -skip_lock)};
+		delete @again{qw(rethread reindex -skip_lock)};
 		index_sync($self, \%again);
 	}
 }
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 6217fb86c..2e1934b08 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -15,7 +15,7 @@ use PublicInbox::Xapcmd;
 
 my $compact_opt;
 my $opt = { quiet => -1, compact => 0, maxsize => undef };
-GetOptions($opt, qw(verbose|v+ reindex compact|c+ jobs|j=i prune
+GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
 		indexlevel|L=s maxsize|max-size=s batchsize|batch-size=s))
 	or die "bad command-line args\n$usage";
 die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
diff --git a/t/v1reindex.t b/t/v1reindex.t
index 9f23ef01e..8cb751881 100644
--- a/t/v1reindex.t
+++ b/t/v1reindex.t
@@ -11,6 +11,7 @@ require_git(2.6);
 require_mods(qw(DBD::SQLite Search::Xapian));
 use_ok 'PublicInbox::SearchIdx';
 use_ok 'PublicInbox::Import';
+use_ok 'PublicInbox::OverIdx';
 my ($inboxdir, $for_destroy) = tmpdir();
 my $ibx_config = {
 	inboxdir => $inboxdir,
@@ -427,5 +428,38 @@ ok(!-d $xap, 'Xapian directories removed again');
 		  ], 'msgmap as expected' );
 }
 
+{
+	my @warn;
+	local $SIG{__WARN__} = sub { push @warn, @_ };
+	my $ibx = PublicInbox::Inbox->new({ %$ibx_config });
+	my $f = $ibx->over->{dbh}->sqlite_db_filename;
+	my $over = PublicInbox::OverIdx->new($f);
+	my $dbh = $over->connect;
+	my $non_ghost_tids = sub {
+		$dbh->selectall_arrayref(<<'');
+SELECT tid FROM over WHERE num > 0 ORDER BY tid ASC
+
+	};
+	my $before = $non_ghost_tids->();
+
+	# mess up threading:
+	my $tid = PublicInbox::OverIdx::get_counter($dbh, 'thread');
+	my $nr = $dbh->do('UPDATE over SET tid = ?', undef, $tid);
+
+	my $rw = PublicInbox::SearchIdx->new($ibx, 1);
+	my @pr;
+	my $pr = sub { push @pr, @_ };
+	$rw->index_sync({reindex => 1, rethread => 1, -progress => $pr });
+	my @n = $dbh->selectrow_array(<<EOS, undef, $tid);
+SELECT COUNT(*) FROM over WHERE tid <= ?
+EOS
+	is_deeply(\@n, [ 0 ], 'rethread dropped old threadids');
+	my $after = $non_ghost_tids->();
+	ok($after->[0]->[0] > $before->[-1]->[0],
+		'all tids greater than before');
+	is(scalar @$after, scalar @$before, 'thread count unchanged');
+	is_deeply([], \@warn, 'no warnings');
+	# diag "@pr"; # XXX do we care?
+}
 
 done_testing();
diff --git a/t/v2reindex.t b/t/v2reindex.t
index 77deffb4b..ea2b24e59 100644
--- a/t/v2reindex.t
+++ b/t/v2reindex.t
@@ -10,6 +10,7 @@ use PublicInbox::TestCommon;
 require_git(2.6);
 require_mods(qw(DBD::SQLite Search::Xapian));
 use_ok 'PublicInbox::V2Writable';
+use_ok 'PublicInbox::OverIdx';
 my ($inboxdir, $for_destroy) = tmpdir();
 my $ibx_config = {
 	inboxdir => $inboxdir,
@@ -423,6 +424,46 @@ ok(!-d $xap, 'Xapian directories removed again');
 		  ], 'msgmap as expected' );
 }
 
+my $check_rethread = sub {
+	my ($desc) = @_;
+	my @warn;
+	local $SIG{__WARN__} = sub { push @warn, @_ };
+	my %config = %$ibx_config;
+	my $ibx = PublicInbox::Inbox->new(\%config);
+	my $f = $ibx->over->{dbh}->sqlite_db_filename;
+	my $over = PublicInbox::OverIdx->new($f);
+	my $dbh = $over->connect;
+	my $non_ghost_tids = sub {
+		$dbh->selectall_arrayref(<<'');
+SELECT tid FROM over WHERE num > 0 ORDER BY tid ASC
+
+	};
+	my $before = $non_ghost_tids->();
+
+	# mess up threading:
+	my $tid = PublicInbox::OverIdx::get_counter($dbh, 'thread');
+	my $nr = $dbh->do('UPDATE over SET tid = ?', undef, $tid);
+	diag "messing up all threads with tid=$tid";
+
+	my $v2w = PublicInbox::V2Writable->new($ibx);
+	my @pr;
+	my $pr = sub { push @pr, @_ };
+	$v2w->index_sync({reindex => 1, rethread => 1, -progress => $pr});
+	# diag "@pr"; # nobody cares
+	is_deeply(\@warn, [], 'no warnings on reindex + rethread');
+
+	my @n = $dbh->selectrow_array(<<EOS, undef, $tid);
+SELECT COUNT(*) FROM over WHERE tid <= ?
+EOS
+	is_deeply(\@n, [ 0 ], 'rethread dropped old threadids');
+	my $after = $non_ghost_tids->();
+	ok($after->[0]->[0] > $before->[-1]->[0],
+		'all tids greater than before');
+	is(scalar @$after, scalar @$before, 'thread count unchanged');
+};
+
+$check_rethread->('no-monster');
+
 # A real example from linux-renesas-soc on lore where a 3-headed monster
 # of a message has 3 sets of common headers.  Another normal message
 # previously existed with a single Message-ID that conflicts with one
@@ -497,4 +538,8 @@ EOF
 	is_deeply([values %uniq], [3], 'search on different subjects');
 }
 
+# XXX: not deterministic when dealing with ambiguous messages, oh well
+$check_rethread->('3-headed-monster once');
+$check_rethread->('3-headed-monster twice');
+
 done_testing();

^ permalink raw reply related	[relevance 4%]

Results 1-2 of 2 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2020-07-24  5:55  7% [PATCH 00/20] indexing changes and new features Eric Wong
2020-07-24  5:55  4% ` [PATCH 01/20] index: support --rethread switch to fix old indices Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).