user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* BUG: --reindex broken on multiple Message-IDs reuse
@ 2019-10-16 21:14 Eric Wong
  2019-10-17 11:22 ` [RFC] v2writable: reindex handles 3-headed monsters [was: BUG: --reindex broken on multiple Message-IDs reuse] Eric Wong
  2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Wong @ 2019-10-16 21:14 UTC (permalink / raw)
  To: meta

Initial indexing went fine, but --reindex fails because it's
trying to use $sync->{regen} when it should not...

So that's a bug in the reindexing logic somewhere that
needs to be fixed w/o disrupting normal indexing...

One of the messages has 3 Message-IDs:
https://lore.kernel.org/linux-renesas-soc/1923946.Jvi0TDUXFC@wasted.cogentembedded.com

:<

Will look into it more later...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC] v2writable: reindex handles 3-headed monsters [was: BUG: --reindex broken on multiple Message-IDs reuse]
  2019-10-16 21:14 BUG: --reindex broken on multiple Message-IDs reuse Eric Wong
@ 2019-10-17 11:22 ` Eric Wong
  2019-10-22  8:09   ` [RFC/HELP] search: multiple From/To/Cc/Subject (what about Date?) Eric Wong
  2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Wong @ 2019-10-17 11:22 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> Initial indexing went fine,

Actually, no.  Initial indexing was broken when dealing with
this horror show.  But I think I can fix that...

> but --reindex fails because it's
> trying to use $sync->{regen} when it should not...

Yes, --reindex can be fixed if initial indexing isn't totally
broken to begin with.  Or, if over.sqlite3 is deleted..

But --reindex won't fix things if initial indexing was
broken....

> So that's a bug in the reindexing logic somewhere that
> needs to be fixed w/o disrupting normal indexing...
> 
> One of the messages has 3 Message-IDs:
> https://lore.kernel.org/linux-renesas-soc/1923946.Jvi0TDUXFC@wasted.cogentembedded.com

So we had broken code trying to workaround broken data :<

Anyways, the following patch is probably OK, but I'm way too
tired and need to do some more testing on existing repos.
Unfortunately, --reindex alone isn't yet strong enough to fix
cases that were already broken...

----------8<------------
Subject: [RFC] v2writable: reindex handles 3-headed monsters

And maybe 8-headed ones, too; but not very gracefully...

I noticed --reindex failing on the linux-renesas-soc mirror due
one 3-headed monster of a message having 3 sets of headers;
while another normal message had a Message-ID that matched one
of the 3 IDs of the 3-headed monster.

Since we do reindexing backwards, we can try to do our best to
preserve NNTP article numbers by matching blob OIDs if the
overview DB exists by re-reindexing.

Some of these may benefit from using List::Util::shuffle
in case we hit infinite loops with reindex_q...

Link: https://public-inbox.org/meta/20191016211415.GA6084@dcvr/
---
 TODO                          |   3 +
 lib/PublicInbox/OverIdx.pm    |  14 +++++
 lib/PublicInbox/V2Writable.pm | 107 ++++++++++++++++++++++++++++------
 t/v2reindex.t                 |  65 +++++++++++++++++++++
 4 files changed, 171 insertions(+), 18 deletions(-)

diff --git a/TODO b/TODO
index a327ca06..61c44a84 100644
--- a/TODO
+++ b/TODO
@@ -133,3 +133,6 @@ all need to be considered for everything we introduce)
   for coderepos
 
 * configurable diff output for solver-generated blobs
+
+* fix search for messages with multiple Subject:/To:/From:/Date:
+  headers (some wacky examples out there...)
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 7fd1905d..1ff435d7 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -343,6 +343,20 @@ sub remove_oid {
 	$nr;
 }
 
+sub num_for_oid {
+	my ($self, $oid, $mid) = @_;
+	my $num;
+	$self->begin_lazy;
+	each_by_mid($self, $mid, ['ddd'], sub {
+		my ($smsg) = @_;
+		my $blob = $smsg->{blob};
+		return 1 if (!defined($blob) || $blob ne $oid); # continue;
+		$num = $smsg->{num};
+		0; # done
+	});
+	$num;
+}
+
 sub create_tables {
 	my ($dbh) = @_;
 
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 6a88f62a..6e1ac5e3 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -845,6 +845,51 @@ sub mark_deleted ($$$$) {
 	}
 }
 
+sub regen_art_num ($$$$$) {
+	my ($self, $sync, $git, $oid, $mids) = @_;
+	my $num = $sync->{regen}--;
+	my $mm = $self->{mm};
+	if ($num > 0) {
+		foreach my $mid (reverse @$mids) {
+			if ($mm->mid_set($num, $mid) == 1) {
+				return ($num, $mid);
+			}
+		}
+
+		# nope, not using the regen number because either we
+		# have or overlap Message-IDs with another message.
+		$sync->{regen}++;
+		$num = undef;
+	}
+
+	# --reindex was broken when Message-IDs were reused with a
+	# message that already had multiple Message-IDs.  This causes
+	# failures when using --reindex again.  The least bad option is
+	# to show messages twice to NNTP clients, rather than lose/drop
+	# messages entirely.
+	#
+	# create a new article number?
+	for my $mid (@$mids) {
+		$num = $self->{mm}->mid_insert($mid);
+		return ($num, $mid) if defined $num;
+	}
+
+	# swap article numbers with a previously regenerated (newer message)
+	# which has an overlapping Message-ID.  The current $oid is older
+	# than $smsg->{blob} in git history since we regen walks backwards.
+	my $reindex_q = $sync->{reindex_q} //= [];
+	for my $mid (@$mids) {
+		my $n = $self->{mm}->num_for($mid);
+		next unless defined $n;
+		if (my $smsg = $self->{over}->get_art($n)) {
+			push @$reindex_q, $smsg->{blob};
+			return ($n, $mid);
+		}
+	}
+	warn "W: ran out of article numbers on $oid\n";
+	(undef, undef);
+}
+
 sub reindex_oid ($$$$) {
 	my ($self, $sync, $git, $oid) = @_;
 	my $len;
@@ -853,31 +898,43 @@ sub reindex_oid ($$$$) {
 	my $mids = mids($mime->header_obj);
 	my $cid = content_id($mime);
 
-	# get the NNTP article number we used before, highest number wins
-	# and gets deleted from sync->{mm_tmp};
+	# get the NNTP article number we used before, which gets deleted
+	# from sync->{mm_tmp};
 	my $mid0;
 	my $num = -1;
 	my $del = 0;
-	foreach my $mid (@$mids) {
-		$del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0;
-		my $n = $sync->{mm_tmp}->num_for($mid);
-		if (defined $n && $n > $num) {
-			$mid0 = $mid;
-			$num = $n;
-			$self->{mm}->mid_set($num, $mid0);
+	my $reindex = $sync->{reindex};
+	if ($reindex) {
+		# see if it's already in the overview DB, but keep in mind
+		# --reindex may be used blindly w/o overview DB.
+		foreach my $mid (@$mids) {
+			$del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0;
+			my $n = $self->{over}->num_for_oid($oid, $mid);
+			if (defined $n) {
+				($num, $mid0) = ($n, $mid);
+				$self->{mm}->mid_set($num, $mid0);
+				last; # yay! reused
+			}
 		}
 	}
-	if (!defined($mid0) && !$del) {
-		$num = $sync->{regen}--;
-		die "BUG: ran out of article numbers\n" if $num <= 0;
-		my $mm = $self->{mm};
-		foreach my $mid (reverse @$mids) {
-			if ($mm->mid_set($num, $mid) == 1) {
-				$mid0 = $mid;
-				last;
+
+	# not reindexing, or reindexing with broken/incomplete overview DB:
+	if (!defined($mid0)) {
+		# highest number wins for unseen messages
+		foreach my $mid (@$mids) {
+			$del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0;
+			my $n = $sync->{mm_tmp}->num_for($mid);
+			if (defined $n && $n > $num) {
+				($num, $mid0) = ($n, $mid);
+				$self->{mm}->mid_set($num, $mid0);
 			}
 		}
+	}
+
+	if (!defined($mid0) && !$del) {
+		($num, $mid0) = regen_art_num($self, $sync, $git, $oid, $mids);
 		if (!defined($mid0)) {
+			my $mm = $self->{mm};
 			my $id = '<' . join('> <', @$mids) . '>';
 			warn "Message-ID $id unusable for $num\n";
 			foreach my $mid (@$mids) {
@@ -890,7 +947,11 @@ sub reindex_oid ($$$$) {
 	if (!defined($mid0) || $del) {
 		if (!defined($mid0) && $del) { # expected for deletes
 			$num = $sync->{regen}--;
-			$self->{mm}->num_highwater($num) if !$sync->{reindex};
+			if (!$reindex) {
+				($num <= 0) and
+					die "BUG: ran out of article numbers\n";
+				$self->{mm}->num_highwater($num);
+			}
 			return
 		}
 
@@ -1147,6 +1208,16 @@ sub index_epoch ($$$) {
 		} elsif (/\A:\d{6} 100644 $x40 ($x40) [AM]\td$/o) {
 			mark_deleted($self, $sync, $git, $1);
 		}
+
+		while (my $reindex_q = delete $sync->{reindex_q}) {
+			my $all = $self->{-inbox}->git;
+			for my $oid (@$reindex_q) {
+				$self->{current_info} = "$i.git $oid";
+				warn "reindexing to shuffle article numbers\n";
+				reindex_oid($self, $sync, $all, $oid);
+			}
+			$all->cleanup;
+		}
 	}
 	$fh = undef;
 	delete $self->{reindex_pipe};
diff --git a/t/v2reindex.t b/t/v2reindex.t
index 7c5a6b07..5de067c6 100644
--- a/t/v2reindex.t
+++ b/t/v2reindex.t
@@ -431,4 +431,69 @@ ok(!-d $xap, 'Xapian directories removed again');
 		  ], 'msgmap as expected' );
 }
 
+# A real example from linux-renesas-soc on lore where a 3-headed monster
+# of a message has 3 sets of common headers.  Another normal message
+# previously existed with a single Message-ID that conflicts with one
+# of the Message-IDs in the 3-headed monster.
+{
+	my @warn;
+	local $SIG{__WARN__} = sub { push @warn, @_ };
+	my %config = %$ibx_config;
+	$config{indexlevel} = 'basic';
+	my $ibx = PublicInbox::Inbox->new(\%config);
+	my $im = PublicInbox::V2Writable->new($ibx);
+	my $m3 = PublicInbox::MIME->new(<<'EOF');
+Date: Tue, 24 May 2016 14:34:22 -0700 (PDT)
+Message-Id: <20160524.143422.552507610109476444.d@example.com>
+To: t@example.com
+Cc: c@example.com
+Subject: Re: [PATCH v2 2/2]
+From: <f@example.com>
+In-Reply-To: <1463825855-7363-2-git-send-email-y@example.com>
+References: <1463825855-7363-1-git-send-email-y@example.com>
+	<1463825855-7363-2-git-send-email-y@example.com>
+Date: Wed, 25 May 2016 10:01:51 +0900
+From: h@example.com
+To: g@example.com
+Cc: m@example.com
+Subject: Re: [PATCH]
+Message-ID: <20160525010150.GD7292@example.com>
+References: <1463498133-23918-1-git-send-email-g+r@example.com>
+In-Reply-To: <1463498133-23918-1-git-send-email-g+r@example.com>
+From: s@example.com
+To: h@example.com
+Cc: m@example.com
+Subject: [PATCH 12/13]
+Date: Wed, 01 Jun 2016 01:32:35 +0300
+Message-ID: <1923946.Jvi0TDUXFC@wasted.example.com>
+In-Reply-To: <13205049.n7pM8utpHF@wasted.example.com>
+References: <13205049.n7pM8utpHF@wasted.example.com>
+
+Somehow we got a message with 3 sets of headers into one
+message, could've been something broken on the archiver side.
+EOF
+
+	my $m1 = PublicInbox::MIME->new(<<'EOF');
+From: a@example.com
+To: t@example.com
+Subject: [PATCH 12/13]
+Date: Wed, 01 Jun 2016 01:32:35 +0300
+Message-ID: <1923946.Jvi0TDUXFC@wasted.example.com>
+In-Reply-To: <13205049.n7pM8utpHF@wasted.example.com>
+References: <13205049.n7pM8utpHF@wasted.example.com>
+
+This is probably one of the original messages
+
+EOF
+	$im->add($m1);
+	$im->add($m3);
+	$im->done;
+	remove_tree($xap);
+	eval { $im->index_sync() };
+	is($@, '', 'no error from initial indexing');
+	eval { $im->index_sync({reindex=>1}) };
+	is($@, '', 'no error from reindexing after reused Message-ID (x3)');
+	is_deeply(\@warn, [], 'no warnings');
+}
+
 done_testing();

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs
  2019-10-16 21:14 BUG: --reindex broken on multiple Message-IDs reuse Eric Wong
  2019-10-17 11:22 ` [RFC] v2writable: reindex handles 3-headed monsters [was: BUG: --reindex broken on multiple Message-IDs reuse] Eric Wong
@ 2019-10-21 11:02 ` Eric Wong
  2019-10-21 11:02   ` [PATCH 1/3] v2writable: set unindexed article number Eric Wong
                     ` (3 more replies)
  1 sibling, 4 replies; 11+ messages in thread
From: Eric Wong @ 2019-10-21 11:02 UTC (permalink / raw)
  To: meta

This fixes my initial problems around --reindex not picking up
previously-skipped messages, and without having to add extra
command-line options.

It took several iterations and I considered relying on
"git log --reverse" since v2 epochs are limited in size
compared to v1.  But I'm glad we won't have to rely on the
expensive --reverse option and only internally reverse a
few rare messages which have multiple Message-IDs.

I'm actually satisfied with this because it doesn't involve
adding new switches to workaround buggy behavior, since
I considered (and rejected) a "--renumber" option.

original RFC: https://public-inbox.org/meta/20191017112215.GA13175@dcvr/

I'm still testing this with various repos, but the
linux-renesas-soc mirror from lore seems good when doing initial
indexing with an old version of this and using --reindex with
this patch, as all previously-missed messages are now indexed.

Eric Wong (3):
  v2writable: set unindexed article number
  v2writable: improve "num_for" API and disambiguate
  v2writable: reindex handles 3-headered monsters

 TODO                          |   3 +
 lib/PublicInbox/OverIdx.pm    |  14 ++
 lib/PublicInbox/V2Writable.pm | 257 +++++++++++++++++++++++-----------
 t/v2reindex.t                 |  66 +++++++++
 4 files changed, 256 insertions(+), 84 deletions(-)

Interdiff:
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 1ff435d7..64277342 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -343,18 +343,18 @@ sub remove_oid {
 	$nr;
 }
 
-sub num_for_oid {
+sub num_mid0_for_oid {
 	my ($self, $oid, $mid) = @_;
-	my $num;
+	my ($num, $mid0);
 	$self->begin_lazy;
 	each_by_mid($self, $mid, ['ddd'], sub {
 		my ($smsg) = @_;
 		my $blob = $smsg->{blob};
 		return 1 if (!defined($blob) || $blob ne $oid); # continue;
-		$num = $smsg->{num};
+		($num, $mid0) = ($smsg->{num}, $smsg->{mid});
 		0; # done
 	});
-	$num;
+	($num, $mid0);
 }
 
 sub create_tables {
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 6e1ac5e3..7ece6b01 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -155,8 +155,7 @@ sub _add {
 	# leaking FDs to it...
 	$self->idx_init;
 
-	my $mid0;
-	my $num = num_for($self, $mime, \$mid0);
+	my ($num, $mid0) = v2_num_for($self, $mime);
 	defined $num or return; # duplicate
 	defined $mid0 or die "BUG: $mid0 undefined\n";
 	my $im = $self->importer;
@@ -172,16 +171,15 @@ sub _add {
 	$cmt;
 }
 
-sub num_for {
-	my ($self, $mime, $mid0) = @_;
+sub v2_num_for {
+	my ($self, $mime) = @_;
 	my $mids = mids($mime->header_obj);
 	if (@$mids) {
 		my $mid = $mids->[0];
 		my $num = $self->{mm}->mid_insert($mid);
 		if (defined $num) { # common case
-			$$mid0 = $mid;
-			return $num;
-		};
+			return ($num, $mid);
+		}
 
 		# crap, Message-ID is already known, hope somebody just resent:
 		foreach my $m (@$mids) {
@@ -190,7 +188,7 @@ sub num_for {
 			# easy, don't store duplicates
 			# note: do not add more diagnostic info here since
 			# it gets noisy on public-inbox-watch restarts
-			return if $existing;
+			return () if $existing;
 		}
 
 		# AltId may pre-populate article numbers (e.g. X-Mail-Count
@@ -201,8 +199,7 @@ sub num_for {
 			my $num = $self->{mm}->num_for($mid);
 
 			if (defined $num && !$self->{over}->get_art($num)) {
-				$$mid0 = $mid;
-				return $num;
+				return ($num, $mid);
 			}
 		}
 
@@ -215,39 +212,38 @@ sub num_for {
 			$num = $self->{mm}->mid_insert($m);
 			if (defined $num) {
 				warn "alternative <$m> for <$mid> found\n";
-				$$mid0 = $m;
-				return $num;
+				return ($num, $m);
 			}
 		}
 	}
 	# none of the existing Message-IDs are good, generate a new one:
-	num_for_harder($self, $mime, $mid0);
+	v2_num_for_harder($self, $mime);
 }
 
-sub num_for_harder {
-	my ($self, $mime, $mid0) = @_;
+sub v2_num_for_harder {
+	my ($self, $mime) = @_;
 
 	my $hdr = $mime->header_obj;
 	my $dig = content_digest($mime);
-	$$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
-	my $num = $self->{mm}->mid_insert($$mid0);
+	my $mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
+	my $num = $self->{mm}->mid_insert($mid0);
 	unless (defined $num) {
 		# it's hard to spoof the last Received: header
 		my @recvd = $hdr->header_raw('Received');
 		$dig->add("Received: $_") foreach (@recvd);
-		$$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
-		$num = $self->{mm}->mid_insert($$mid0);
+		$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
+		$num = $self->{mm}->mid_insert($mid0);
 
 		# fall back to a random Message-ID and give up determinism:
 		until (defined($num)) {
 			$dig->add(rand);
-			$$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
-			warn "using random Message-ID <$$mid0> as fallback\n";
-			$num = $self->{mm}->mid_insert($$mid0);
+			$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
+			warn "using random Message-ID <$mid0> as fallback\n";
+			$num = $self->{mm}->mid_insert($mid0);
 		}
 	}
-	PublicInbox::Import::append_mid($hdr, $$mid0);
-	$num;
+	PublicInbox::Import::append_mid($hdr, $mid0);
+	($num, $mid0);
 }
 
 sub idx_shard {
@@ -845,142 +841,157 @@ sub mark_deleted ($$$$) {
 	}
 }
 
-sub regen_art_num ($$$$$) {
-	my ($self, $sync, $git, $oid, $mids) = @_;
-	my $num = $sync->{regen}--;
-	my $mm = $self->{mm};
-	if ($num > 0) {
-		foreach my $mid (reverse @$mids) {
-			if ($mm->mid_set($num, $mid) == 1) {
-				return ($num, $mid);
-			}
-		}
+sub reindex_checkpoint ($$$) {
+	my ($self, $sync, $git) = @_;
 
-		# nope, not using the regen number because either we
-		# have or overlap Message-IDs with another message.
-		$sync->{regen}++;
-		$num = undef;
-	}
-
-	# --reindex was broken when Message-IDs were reused with a
-	# message that already had multiple Message-IDs.  This causes
-	# failures when using --reindex again.  The least bad option is
-	# to show messages twice to NNTP clients, rather than lose/drop
-	# messages entirely.
-	#
-	# create a new article number?
-	for my $mid (@$mids) {
-		$num = $self->{mm}->mid_insert($mid);
-		return ($num, $mid) if defined $num;
-	}
-
-	# swap article numbers with a previously regenerated (newer message)
-	# which has an overlapping Message-ID.  The current $oid is older
-	# than $smsg->{blob} in git history since we regen walks backwards.
-	my $reindex_q = $sync->{reindex_q} //= [];
-	for my $mid (@$mids) {
-		my $n = $self->{mm}->num_for($mid);
-		next unless defined $n;
-		if (my $smsg = $self->{over}->get_art($n)) {
-			push @$reindex_q, $smsg->{blob};
-			return ($n, $mid);
-		}
+	$git->cleanup;
+	$sync->{mm_tmp}->atfork_prepare;
+	$self->done; # release lock
+
+	if (my $pr = $sync->{-opt}->{-progress}) {
+		my ($bn) = (split('/', $git->{git_dir}))[-1];
+		$pr->("$bn ".sprintf($sync->{-regen_fmt}, $sync->{nr}));
 	}
-	warn "W: ran out of article numbers on $oid\n";
-	(undef, undef);
+
+	# allow -watch or -mda to write...
+	$self->idx_init; # reacquire lock
+	$sync->{mm_tmp}->atfork_parent;
 }
 
-sub reindex_oid ($$$$) {
+# only for a few odd messages with multiple Message-IDs
+sub reindex_oid_m ($$$$) {
 	my ($self, $sync, $git, $oid) = @_;
-	my $len;
+	my ($num, $mid0, $len);
 	my $msgref = $git->cat_file($oid, \$len);
 	my $mime = PublicInbox::MIME->new($$msgref);
 	my $mids = mids($mime->header_obj);
 	my $cid = content_id($mime);
+	die "BUG: reindex_oid_m called for <=1 mids" if scalar(@$mids) <= 1;
 
-	# get the NNTP article number we used before, which gets deleted
-	# from sync->{mm_tmp};
-	my $mid0;
-	my $num = -1;
-	my $del = 0;
-	my $reindex = $sync->{reindex};
-	if ($reindex) {
-		# see if it's already in the overview DB, but keep in mind
-		# --reindex may be used blindly w/o overview DB.
-		foreach my $mid (@$mids) {
-			$del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0;
-			my $n = $self->{over}->num_for_oid($oid, $mid);
-			if (defined $n) {
-				($num, $mid0) = ($n, $mid);
-				$self->{mm}->mid_set($num, $mid0);
-				last; # yay! reused
-			}
+	for my $mid (reverse @$mids) {
+		delete($sync->{D}->{"$mid\0$cid"}) and
+			die "BUG: reindex_oid should handle <$mid> delete";
+	}
+	for my $mid (reverse @$mids) {
+		($num, $mid0) = $self->{over}->num_mid0_for_oid($oid, $mid);
+		last if defined $num;
+	}
+	unless (defined($num)) {
+		for my $mid (reverse @$mids) {
+			# is this a number we got before?
+			$num = $sync->{mm_tmp}->num_for($mid);
+			next unless defined $num;
+			$mid0 = $mid;
+			last;
 		}
 	}
-
-	# not reindexing, or reindexing with broken/incomplete overview DB:
-	if (!defined($mid0)) {
-		# highest number wins for unseen messages
-		foreach my $mid (@$mids) {
-			$del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0;
-			my $n = $sync->{mm_tmp}->num_for($mid);
-			if (defined $n && $n > $num) {
-				($num, $mid0) = ($n, $mid);
-				$self->{mm}->mid_set($num, $mid0);
+	if (defined($num)) {
+		$sync->{mm_tmp}->num_delete($num);
+	} else {
+		$num = $sync->{regen}--;
+		if ($num <= 0) {
+			# fixup bugs in old mirrors on reindex
+			for my $mid (reverse @$mids) {
+				$num = $self->{mm}->mid_insert($mid);
+				next unless defined $num;
+				$mid0 = $mid;
+				last;
+			}
+			if (defined $mid0) {
+				if ($sync->{reindex}) {
+					warn "reindex added #$num <$mid0>\n";
+				}
+			} else {
+				warn "E: cannot find article #\n";
+				return;
+			}
+		} else { # $num > 0, use the new article number
+			for my $mid (reverse @$mids) {
+				$self->{mm}->mid_set($num, $mid) == 1 or next;
+				$mid0 = $mid;
+				last;
+			}
+			unless (defined $mid0) {
+				warn "E: cannot regen #$num\n";
+				return;
 			}
 		}
 	}
+	$sync->{nr}++;
+	if (do_idx($self, $msgref, $mime, $len, $num, $oid, $mid0)) {
+		reindex_checkpoint($self, $sync, $git);
+	}
+}
 
-	if (!defined($mid0) && !$del) {
-		($num, $mid0) = regen_art_num($self, $sync, $git, $oid, $mids);
-		if (!defined($mid0)) {
-			my $mm = $self->{mm};
-			my $id = '<' . join('> <', @$mids) . '>';
-			warn "Message-ID $id unusable for $num\n";
-			foreach my $mid (@$mids) {
-				defined(my $n = $mm->num_for($mid)) or next;
-				warn "#$n previously mapped for <$mid>\n";
-			}
-		}
+sub check_unindexed ($$$) {
+	my ($self, $num, $mid0) = @_;
+	my $unindexed = $self->{unindexed} // {};
+	my $n = delete($unindexed->{$mid0});
+	defined $n or return;
+	if ($n != $num) {
+		die "BUG: unindexed $n != $num <$mid0>\n";
+	} else {
+		$self->{mm}->mid_set($num, $mid0);
 	}
+}
 
-	if (!defined($mid0) || $del) {
-		if (!defined($mid0) && $del) { # expected for deletes
-			$num = $sync->{regen}--;
-			if (!$reindex) {
-				($num <= 0) and
-					die "BUG: ran out of article numbers\n";
+sub reindex_oid ($$$$) {
+	my ($self, $sync, $git, $oid) = @_;
+	my ($num, $mid0, $len);
+	my $msgref = $git->cat_file($oid, \$len);
+	return if $len == 0; # purged
+	my $mime = PublicInbox::MIME->new($$msgref);
+	my $mids = mids($mime->header_obj);
+	my $cid = content_id($mime);
+
+	if (scalar(@$mids) == 0) {
+		warn "E: $oid has no Message-ID, skipping\n";
+		return;
+	} elsif (scalar(@$mids) == 1) {
+		my $mid = $mids->[0];
+
+		# was the file previously marked as deleted?, skip if so
+		if (delete($sync->{D}->{"$mid\0$cid"})) {
+			if (!$sync->{reindex}) {
+				$num = $sync->{regen}--;
 				$self->{mm}->num_highwater($num);
 			}
-			return
+			return;
 		}
 
-		my $id = '<' . join('> <', @$mids) . '>';
-		defined($mid0) or
-			warn "Skipping $id, no article number found\n";
-		if ($del && defined($mid0)) {
-			warn "$id was deleted $del " .
-				"time(s) but mapped to article #$num\n";
+		# is this a number we got before?
+		$num = $sync->{mm_tmp}->num_for($mid);
+		if (defined $num) {
+			$mid0 = $mid;
+			check_unindexed($self, $num, $mid0);
+		} else {
+			$num = $sync->{regen}--;
+			die "BUG: ran out of article numbers\n" if $num <= 0;
+			if ($self->{mm}->mid_set($num, $mid) != 1) {
+				warn "E: unable to assign $num => <$mid>\n";
+				return;
+			}
+			$mid0 = $mid;
+		}
+	} else { # multiple MIDs are a weird case:
+		my $del = 0;
+		for (@$mids) {
+			$del += delete($sync->{D}->{"$_\0$cid"}) // 0;
+		}
+		if ($del) {
+			unindex_oid_remote($self, $oid, $_) for @$mids;
+			# do not delete from {mm_tmp}, since another
+			# single-MID message may use it.
+		} else { # handle them at the end:
+			push @{$sync->{multi_mid} //= []}, $oid;
 		}
 		return;
-
 	}
 	$sync->{mm_tmp}->mid_delete($mid0) or
 		die "failed to delete <$mid0> for article #$num\n";
 	$sync->{nr}++;
 	if (do_idx($self, $msgref, $mime, $len, $num, $oid, $mid0)) {
-		$git->cleanup;
-		$sync->{mm_tmp}->atfork_prepare;
-		$self->done; # release lock
-
-		if (my $pr = $sync->{-opt}->{-progress}) {
-			my ($bn) = (split('/', $git->{git_dir}))[-1];
-			$pr->("$bn ".sprintf($sync->{-regen_fmt}, $sync->{nr}));
-		}
-
-		# allow -watch or -mda to write...
-		$self->idx_init; # reacquire lock
-		$sync->{mm_tmp}->atfork_parent;
+		reindex_checkpoint($self, $sync, $git);
 	}
 }
 
@@ -1112,8 +1123,9 @@ sub unindex_oid_remote ($$$) {
 	$self->{over}->remove_oid($oid, $mid);
 }
 
-sub unindex_oid ($$$) {
-	my ($self, $git, $oid) = @_;
+sub unindex_oid ($$$;$) {
+	my ($self, $git, $oid, $unindexed) = @_;
+	my $mm = $self->{mm};
 	my $msgref = $git->cat_file($oid);
 	my $mime = PublicInbox::MIME->new($msgref);
 	my $mids = mids($mime->header_obj);
@@ -1133,8 +1145,11 @@ sub unindex_oid ($$$) {
 				join(',',sort keys %gone), "\n";
 		}
 		foreach my $num (keys %gone) {
-			$self->{unindexed}->{$_}++;
-			$self->{mm}->num_delete($num);
+			if ($unindexed) {
+				my $mid0 = $mm->mid_for($num);
+				$unindexed->{$mid0} = $num;
+			}
+			$mm->num_delete($num);
 		}
 		unindex_oid_remote($self, $oid, $mid);
 	}
@@ -1143,20 +1158,21 @@ sub unindex_oid ($$$) {
 my $x40 = qr/[a-f0-9]{40}/;
 sub unindex ($$$$) {
 	my ($self, $sync, $git, $unindex_range) = @_;
-	my $un = $self->{unindexed} ||= {}; # num => removal count
-	my $before = scalar keys %$un;
+	my $unindexed = $self->{unindexed} ||= {}; # $mid0 => $num
+	my $before = scalar keys %$unindexed;
+	# order does not matter, here:
 	my @cmd = qw(log --raw -r
 			--no-notes --no-color --no-abbrev --no-renames);
 	my $fh = $self->{reindex_pipe} = $git->popen(@cmd, $unindex_range);
 	while (<$fh>) {
 		/\A:\d{6} 100644 $x40 ($x40) [AM]\tm$/o or next;
-		unindex_oid($self, $git, $1);
+		unindex_oid($self, $git, $1, $unindexed);
 	}
 	delete $self->{reindex_pipe};
 	$fh = undef;
 
 	return unless $sync->{-opt}->{prune};
-	my $after = scalar keys %$un;
+	my $after = scalar keys %$unindexed;
 	return if $before == $after;
 
 	# ensure any blob can not longer be accessed via dumb HTTP
@@ -1208,16 +1224,6 @@ sub index_epoch ($$$) {
 		} elsif (/\A:\d{6} 100644 $x40 ($x40) [AM]\td$/o) {
 			mark_deleted($self, $sync, $git, $1);
 		}
-
-		while (my $reindex_q = delete $sync->{reindex_q}) {
-			my $all = $self->{-inbox}->git;
-			for my $oid (@$reindex_q) {
-				$self->{current_info} = "$i.git $oid";
-				warn "reindexing to shuffle article numbers\n";
-				reindex_oid($self, $sync, $all, $oid);
-			}
-			$all->cleanup;
-		}
 	}
 	$fh = undef;
 	delete $self->{reindex_pipe};
@@ -1259,10 +1265,22 @@ sub index_sync {
 
 	# unindex is required for leftovers if "deletes" affect messages
 	# in a previous fetch+index window:
+	my $git;
 	if (my @leftovers = values %{delete $sync->{D}}) {
-		my $git = $self->{-inbox}->git;
-		unindex_oid($self, $git, $_) for @leftovers;
-		$git->cleanup;
+		$git = $self->{-inbox}->git;
+		for my $oid (@leftovers) {
+			$self->{current_info} = "leftover $oid";
+			unindex_oid($self, $git, $oid);
+		}
+	}
+	if (my $multi_mid = delete $sync->{multi_mid}) {
+		$git //= $self->{-inbox}->git;
+
+		while (defined(my $oid = pop(@$multi_mid))) {
+			$self->{current_info} = "multi_mid $oid";
+			reindex_oid_m($self, $sync, $git, $oid);
+		}
+		$git->cleanup if $git;
 	}
 	$self->done;
 
diff --git a/t/v2reindex.t b/t/v2reindex.t
index 5de067c6..52711f8f 100644
--- a/t/v2reindex.t
+++ b/t/v2reindex.t
@@ -491,9 +491,10 @@ EOF
 	remove_tree($xap);
 	eval { $im->index_sync() };
 	is($@, '', 'no error from initial indexing');
+	is_deeply(\@warn, [], 'no warnings from initial index');
 	eval { $im->index_sync({reindex=>1}) };
 	is($@, '', 'no error from reindexing after reused Message-ID (x3)');
-	is_deeply(\@warn, [], 'no warnings');
+	is_deeply(\@warn, [], 'no warnings on reindex');
 }
 
 done_testing();

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 1/3] v2writable: set unindexed article number
  2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong
@ 2019-10-21 11:02   ` Eric Wong
  2019-10-21 11:02   ` [PATCH 2/3] v2writable: improve "num_for" API and disambiguate Eric Wong
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2019-10-21 11:02 UTC (permalink / raw)
  To: meta

We'll actually use the keys of this hash in future commits.
---
 lib/PublicInbox/V2Writable.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 6a88f62a..dcbbbc77 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1072,7 +1072,7 @@ sub unindex_oid ($$$) {
 				join(',',sort keys %gone), "\n";
 		}
 		foreach my $num (keys %gone) {
-			$self->{unindexed}->{$_}++;
+			$self->{unindexed}->{$num}++;
 			$self->{mm}->num_delete($num);
 		}
 		unindex_oid_remote($self, $oid, $mid);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] v2writable: improve "num_for" API and disambiguate
  2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong
  2019-10-21 11:02   ` [PATCH 1/3] v2writable: set unindexed article number Eric Wong
@ 2019-10-21 11:02   ` Eric Wong
  2019-10-21 11:02   ` [PATCH 3/3] v2writable: reindex handles 3-headered monsters Eric Wong
  2019-10-23 18:19   ` [PUSHED] fix reindex on multiple + overlapping Message-IDs Eric Wong
  3 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2019-10-21 11:02 UTC (permalink / raw)
  To: meta

Make it obvious that we're not the Msgmap sub and return an
array because it's less awkward than providing a modifiable ref
to a function to write to.
---
 lib/PublicInbox/V2Writable.pm | 44 ++++++++++++++++-------------------
 1 file changed, 20 insertions(+), 24 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index dcbbbc77..9d507828 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -155,8 +155,7 @@ sub _add {
 	# leaking FDs to it...
 	$self->idx_init;
 
-	my $mid0;
-	my $num = num_for($self, $mime, \$mid0);
+	my ($num, $mid0) = v2_num_for($self, $mime);
 	defined $num or return; # duplicate
 	defined $mid0 or die "BUG: $mid0 undefined\n";
 	my $im = $self->importer;
@@ -172,16 +171,15 @@ sub _add {
 	$cmt;
 }
 
-sub num_for {
-	my ($self, $mime, $mid0) = @_;
+sub v2_num_for {
+	my ($self, $mime) = @_;
 	my $mids = mids($mime->header_obj);
 	if (@$mids) {
 		my $mid = $mids->[0];
 		my $num = $self->{mm}->mid_insert($mid);
 		if (defined $num) { # common case
-			$$mid0 = $mid;
-			return $num;
-		};
+			return ($num, $mid);
+		}
 
 		# crap, Message-ID is already known, hope somebody just resent:
 		foreach my $m (@$mids) {
@@ -190,7 +188,7 @@ sub num_for {
 			# easy, don't store duplicates
 			# note: do not add more diagnostic info here since
 			# it gets noisy on public-inbox-watch restarts
-			return if $existing;
+			return () if $existing;
 		}
 
 		# AltId may pre-populate article numbers (e.g. X-Mail-Count
@@ -201,8 +199,7 @@ sub num_for {
 			my $num = $self->{mm}->num_for($mid);
 
 			if (defined $num && !$self->{over}->get_art($num)) {
-				$$mid0 = $mid;
-				return $num;
+				return ($num, $mid);
 			}
 		}
 
@@ -215,39 +212,38 @@ sub num_for {
 			$num = $self->{mm}->mid_insert($m);
 			if (defined $num) {
 				warn "alternative <$m> for <$mid> found\n";
-				$$mid0 = $m;
-				return $num;
+				return ($num, $m);
 			}
 		}
 	}
 	# none of the existing Message-IDs are good, generate a new one:
-	num_for_harder($self, $mime, $mid0);
+	v2_num_for_harder($self, $mime);
 }
 
-sub num_for_harder {
-	my ($self, $mime, $mid0) = @_;
+sub v2_num_for_harder {
+	my ($self, $mime) = @_;
 
 	my $hdr = $mime->header_obj;
 	my $dig = content_digest($mime);
-	$$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
-	my $num = $self->{mm}->mid_insert($$mid0);
+	my $mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
+	my $num = $self->{mm}->mid_insert($mid0);
 	unless (defined $num) {
 		# it's hard to spoof the last Received: header
 		my @recvd = $hdr->header_raw('Received');
 		$dig->add("Received: $_") foreach (@recvd);
-		$$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
-		$num = $self->{mm}->mid_insert($$mid0);
+		$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
+		$num = $self->{mm}->mid_insert($mid0);
 
 		# fall back to a random Message-ID and give up determinism:
 		until (defined($num)) {
 			$dig->add(rand);
-			$$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
-			warn "using random Message-ID <$$mid0> as fallback\n";
-			$num = $self->{mm}->mid_insert($$mid0);
+			$mid0 = PublicInbox::Import::digest2mid($dig, $hdr);
+			warn "using random Message-ID <$mid0> as fallback\n";
+			$num = $self->{mm}->mid_insert($mid0);
 		}
 	}
-	PublicInbox::Import::append_mid($hdr, $$mid0);
-	$num;
+	PublicInbox::Import::append_mid($hdr, $mid0);
+	($num, $mid0);
 }
 
 sub idx_shard {

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] v2writable: reindex handles 3-headered monsters
  2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong
  2019-10-21 11:02   ` [PATCH 1/3] v2writable: set unindexed article number Eric Wong
  2019-10-21 11:02   ` [PATCH 2/3] v2writable: improve "num_for" API and disambiguate Eric Wong
@ 2019-10-21 11:02   ` Eric Wong
  2019-10-21 11:34     ` Eric Wong
  2019-10-22  1:28     ` [PATCH 4/3] v2writable: move git->cleanup to the correct place Eric Wong
  2019-10-23 18:19   ` [PUSHED] fix reindex on multiple + overlapping Message-IDs Eric Wong
  3 siblings, 2 replies; 11+ messages in thread
From: Eric Wong @ 2019-10-21 11:02 UTC (permalink / raw)
  To: meta

And maybe 8-headered ones, too...

I noticed --reindex failing on the linux-renesas-soc mirror due
one 3-headed monster of a message having 3 sets of headers;
while another normal message had a Message-ID that matched one
of the 3 IDs of the 3-headed monster.

We still try to do the majority of indexing backwards, but we
defer indexing multi-Message-ID'd messages until the end to
ensure we get all the "good" messages in before we process the
multi-headered ones.

Link: https://public-inbox.org/meta/20191016211415.GA6084@dcvr/
---
 TODO                          |   3 +
 lib/PublicInbox/OverIdx.pm    |  14 +++
 lib/PublicInbox/V2Writable.pm | 213 ++++++++++++++++++++++++----------
 t/v2reindex.t                 |  66 +++++++++++
 4 files changed, 236 insertions(+), 60 deletions(-)

diff --git a/TODO b/TODO
index a327ca06..61c44a84 100644
--- a/TODO
+++ b/TODO
@@ -133,3 +133,6 @@ all need to be considered for everything we introduce)
   for coderepos
 
 * configurable diff output for solver-generated blobs
+
+* fix search for messages with multiple Subject:/To:/From:/Date:
+  headers (some wacky examples out there...)
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 7fd1905d..64277342 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -343,6 +343,20 @@ sub remove_oid {
 	$nr;
 }
 
+sub num_mid0_for_oid {
+	my ($self, $oid, $mid) = @_;
+	my ($num, $mid0);
+	$self->begin_lazy;
+	each_by_mid($self, $mid, ['ddd'], sub {
+		my ($smsg) = @_;
+		my $blob = $smsg->{blob};
+		return 1 if (!defined($blob) || $blob ne $oid); # continue;
+		($num, $mid0) = ($smsg->{num}, $smsg->{mid});
+		0; # done
+	});
+	($num, $mid0);
+}
+
 sub create_tables {
 	my ($dbh) = @_;
 
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 9d507828..7ece6b01 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -841,81 +841,157 @@ sub mark_deleted ($$$$) {
 	}
 }
 
-sub reindex_oid ($$$$) {
+sub reindex_checkpoint ($$$) {
+	my ($self, $sync, $git) = @_;
+
+	$git->cleanup;
+	$sync->{mm_tmp}->atfork_prepare;
+	$self->done; # release lock
+
+	if (my $pr = $sync->{-opt}->{-progress}) {
+		my ($bn) = (split('/', $git->{git_dir}))[-1];
+		$pr->("$bn ".sprintf($sync->{-regen_fmt}, $sync->{nr}));
+	}
+
+	# allow -watch or -mda to write...
+	$self->idx_init; # reacquire lock
+	$sync->{mm_tmp}->atfork_parent;
+}
+
+# only for a few odd messages with multiple Message-IDs
+sub reindex_oid_m ($$$$) {
 	my ($self, $sync, $git, $oid) = @_;
-	my $len;
+	my ($num, $mid0, $len);
 	my $msgref = $git->cat_file($oid, \$len);
 	my $mime = PublicInbox::MIME->new($$msgref);
 	my $mids = mids($mime->header_obj);
 	my $cid = content_id($mime);
+	die "BUG: reindex_oid_m called for <=1 mids" if scalar(@$mids) <= 1;
 
-	# get the NNTP article number we used before, highest number wins
-	# and gets deleted from sync->{mm_tmp};
-	my $mid0;
-	my $num = -1;
-	my $del = 0;
-	foreach my $mid (@$mids) {
-		$del += delete($sync->{D}->{"$mid\0$cid"}) ? 1 : 0;
-		my $n = $sync->{mm_tmp}->num_for($mid);
-		if (defined $n && $n > $num) {
+	for my $mid (reverse @$mids) {
+		delete($sync->{D}->{"$mid\0$cid"}) and
+			die "BUG: reindex_oid should handle <$mid> delete";
+	}
+	for my $mid (reverse @$mids) {
+		($num, $mid0) = $self->{over}->num_mid0_for_oid($oid, $mid);
+		last if defined $num;
+	}
+	unless (defined($num)) {
+		for my $mid (reverse @$mids) {
+			# is this a number we got before?
+			$num = $sync->{mm_tmp}->num_for($mid);
+			next unless defined $num;
 			$mid0 = $mid;
-			$num = $n;
-			$self->{mm}->mid_set($num, $mid0);
+			last;
 		}
 	}
-	if (!defined($mid0) && !$del) {
+	if (defined($num)) {
+		$sync->{mm_tmp}->num_delete($num);
+	} else {
 		$num = $sync->{regen}--;
-		die "BUG: ran out of article numbers\n" if $num <= 0;
-		my $mm = $self->{mm};
-		foreach my $mid (reverse @$mids) {
-			if ($mm->mid_set($num, $mid) == 1) {
+		if ($num <= 0) {
+			# fixup bugs in old mirrors on reindex
+			for my $mid (reverse @$mids) {
+				$num = $self->{mm}->mid_insert($mid);
+				next unless defined $num;
 				$mid0 = $mid;
 				last;
 			}
-		}
-		if (!defined($mid0)) {
-			my $id = '<' . join('> <', @$mids) . '>';
-			warn "Message-ID $id unusable for $num\n";
-			foreach my $mid (@$mids) {
-				defined(my $n = $mm->num_for($mid)) or next;
-				warn "#$n previously mapped for <$mid>\n";
+			if (defined $mid0) {
+				if ($sync->{reindex}) {
+					warn "reindex added #$num <$mid0>\n";
+				}
+			} else {
+				warn "E: cannot find article #\n";
+				return;
+			}
+		} else { # $num > 0, use the new article number
+			for my $mid (reverse @$mids) {
+				$self->{mm}->mid_set($num, $mid) == 1 or next;
+				$mid0 = $mid;
+				last;
+			}
+			unless (defined $mid0) {
+				warn "E: cannot regen #$num\n";
+				return;
 			}
 		}
 	}
+	$sync->{nr}++;
+	if (do_idx($self, $msgref, $mime, $len, $num, $oid, $mid0)) {
+		reindex_checkpoint($self, $sync, $git);
+	}
+}
 
-	if (!defined($mid0) || $del) {
-		if (!defined($mid0) && $del) { # expected for deletes
-			$num = $sync->{regen}--;
-			$self->{mm}->num_highwater($num) if !$sync->{reindex};
-			return
+sub check_unindexed ($$$) {
+	my ($self, $num, $mid0) = @_;
+	my $unindexed = $self->{unindexed} // {};
+	my $n = delete($unindexed->{$mid0});
+	defined $n or return;
+	if ($n != $num) {
+		die "BUG: unindexed $n != $num <$mid0>\n";
+	} else {
+		$self->{mm}->mid_set($num, $mid0);
+	}
+}
+
+sub reindex_oid ($$$$) {
+	my ($self, $sync, $git, $oid) = @_;
+	my ($num, $mid0, $len);
+	my $msgref = $git->cat_file($oid, \$len);
+	return if $len == 0; # purged
+	my $mime = PublicInbox::MIME->new($$msgref);
+	my $mids = mids($mime->header_obj);
+	my $cid = content_id($mime);
+
+	if (scalar(@$mids) == 0) {
+		warn "E: $oid has no Message-ID, skipping\n";
+		return;
+	} elsif (scalar(@$mids) == 1) {
+		my $mid = $mids->[0];
+
+		# was the file previously marked as deleted?, skip if so
+		if (delete($sync->{D}->{"$mid\0$cid"})) {
+			if (!$sync->{reindex}) {
+				$num = $sync->{regen}--;
+				$self->{mm}->num_highwater($num);
+			}
+			return;
 		}
 
-		my $id = '<' . join('> <', @$mids) . '>';
-		defined($mid0) or
-			warn "Skipping $id, no article number found\n";
-		if ($del && defined($mid0)) {
-			warn "$id was deleted $del " .
-				"time(s) but mapped to article #$num\n";
+		# is this a number we got before?
+		$num = $sync->{mm_tmp}->num_for($mid);
+		if (defined $num) {
+			$mid0 = $mid;
+			check_unindexed($self, $num, $mid0);
+		} else {
+			$num = $sync->{regen}--;
+			die "BUG: ran out of article numbers\n" if $num <= 0;
+			if ($self->{mm}->mid_set($num, $mid) != 1) {
+				warn "E: unable to assign $num => <$mid>\n";
+				return;
+			}
+			$mid0 = $mid;
+		}
+	} else { # multiple MIDs are a weird case:
+		my $del = 0;
+		for (@$mids) {
+			$del += delete($sync->{D}->{"$_\0$cid"}) // 0;
+		}
+		if ($del) {
+			unindex_oid_remote($self, $oid, $_) for @$mids;
+			# do not delete from {mm_tmp}, since another
+			# single-MID message may use it.
+		} else { # handle them at the end:
+			push @{$sync->{multi_mid} //= []}, $oid;
 		}
 		return;
-
 	}
 	$sync->{mm_tmp}->mid_delete($mid0) or
 		die "failed to delete <$mid0> for article #$num\n";
 	$sync->{nr}++;
 	if (do_idx($self, $msgref, $mime, $len, $num, $oid, $mid0)) {
-		$git->cleanup;
-		$sync->{mm_tmp}->atfork_prepare;
-		$self->done; # release lock
-
-		if (my $pr = $sync->{-opt}->{-progress}) {
-			my ($bn) = (split('/', $git->{git_dir}))[-1];
-			$pr->("$bn ".sprintf($sync->{-regen_fmt}, $sync->{nr}));
-		}
-
-		# allow -watch or -mda to write...
-		$self->idx_init; # reacquire lock
-		$sync->{mm_tmp}->atfork_parent;
+		reindex_checkpoint($self, $sync, $git);
 	}
 }
 
@@ -1047,8 +1123,9 @@ sub unindex_oid_remote ($$$) {
 	$self->{over}->remove_oid($oid, $mid);
 }
 
-sub unindex_oid ($$$) {
-	my ($self, $git, $oid) = @_;
+sub unindex_oid ($$$;$) {
+	my ($self, $git, $oid, $unindexed) = @_;
+	my $mm = $self->{mm};
 	my $msgref = $git->cat_file($oid);
 	my $mime = PublicInbox::MIME->new($msgref);
 	my $mids = mids($mime->header_obj);
@@ -1068,8 +1145,11 @@ sub unindex_oid ($$$) {
 				join(',',sort keys %gone), "\n";
 		}
 		foreach my $num (keys %gone) {
-			$self->{unindexed}->{$num}++;
-			$self->{mm}->num_delete($num);
+			if ($unindexed) {
+				my $mid0 = $mm->mid_for($num);
+				$unindexed->{$mid0} = $num;
+			}
+			$mm->num_delete($num);
 		}
 		unindex_oid_remote($self, $oid, $mid);
 	}
@@ -1078,20 +1158,21 @@ sub unindex_oid ($$$) {
 my $x40 = qr/[a-f0-9]{40}/;
 sub unindex ($$$$) {
 	my ($self, $sync, $git, $unindex_range) = @_;
-	my $un = $self->{unindexed} ||= {}; # num => removal count
-	my $before = scalar keys %$un;
+	my $unindexed = $self->{unindexed} ||= {}; # $mid0 => $num
+	my $before = scalar keys %$unindexed;
+	# order does not matter, here:
 	my @cmd = qw(log --raw -r
 			--no-notes --no-color --no-abbrev --no-renames);
 	my $fh = $self->{reindex_pipe} = $git->popen(@cmd, $unindex_range);
 	while (<$fh>) {
 		/\A:\d{6} 100644 $x40 ($x40) [AM]\tm$/o or next;
-		unindex_oid($self, $git, $1);
+		unindex_oid($self, $git, $1, $unindexed);
 	}
 	delete $self->{reindex_pipe};
 	$fh = undef;
 
 	return unless $sync->{-opt}->{prune};
-	my $after = scalar keys %$un;
+	my $after = scalar keys %$unindexed;
 	return if $before == $after;
 
 	# ensure any blob can not longer be accessed via dumb HTTP
@@ -1184,10 +1265,22 @@ sub index_sync {
 
 	# unindex is required for leftovers if "deletes" affect messages
 	# in a previous fetch+index window:
+	my $git;
 	if (my @leftovers = values %{delete $sync->{D}}) {
-		my $git = $self->{-inbox}->git;
-		unindex_oid($self, $git, $_) for @leftovers;
-		$git->cleanup;
+		$git = $self->{-inbox}->git;
+		for my $oid (@leftovers) {
+			$self->{current_info} = "leftover $oid";
+			unindex_oid($self, $git, $oid);
+		}
+	}
+	if (my $multi_mid = delete $sync->{multi_mid}) {
+		$git //= $self->{-inbox}->git;
+
+		while (defined(my $oid = pop(@$multi_mid))) {
+			$self->{current_info} = "multi_mid $oid";
+			reindex_oid_m($self, $sync, $git, $oid);
+		}
+		$git->cleanup if $git;
 	}
 	$self->done;
 
diff --git a/t/v2reindex.t b/t/v2reindex.t
index 7c5a6b07..52711f8f 100644
--- a/t/v2reindex.t
+++ b/t/v2reindex.t
@@ -431,4 +431,70 @@ ok(!-d $xap, 'Xapian directories removed again');
 		  ], 'msgmap as expected' );
 }
 
+# A real example from linux-renesas-soc on lore where a 3-headed monster
+# of a message has 3 sets of common headers.  Another normal message
+# previously existed with a single Message-ID that conflicts with one
+# of the Message-IDs in the 3-headed monster.
+{
+	my @warn;
+	local $SIG{__WARN__} = sub { push @warn, @_ };
+	my %config = %$ibx_config;
+	$config{indexlevel} = 'basic';
+	my $ibx = PublicInbox::Inbox->new(\%config);
+	my $im = PublicInbox::V2Writable->new($ibx);
+	my $m3 = PublicInbox::MIME->new(<<'EOF');
+Date: Tue, 24 May 2016 14:34:22 -0700 (PDT)
+Message-Id: <20160524.143422.552507610109476444.d@example.com>
+To: t@example.com
+Cc: c@example.com
+Subject: Re: [PATCH v2 2/2]
+From: <f@example.com>
+In-Reply-To: <1463825855-7363-2-git-send-email-y@example.com>
+References: <1463825855-7363-1-git-send-email-y@example.com>
+	<1463825855-7363-2-git-send-email-y@example.com>
+Date: Wed, 25 May 2016 10:01:51 +0900
+From: h@example.com
+To: g@example.com
+Cc: m@example.com
+Subject: Re: [PATCH]
+Message-ID: <20160525010150.GD7292@example.com>
+References: <1463498133-23918-1-git-send-email-g+r@example.com>
+In-Reply-To: <1463498133-23918-1-git-send-email-g+r@example.com>
+From: s@example.com
+To: h@example.com
+Cc: m@example.com
+Subject: [PATCH 12/13]
+Date: Wed, 01 Jun 2016 01:32:35 +0300
+Message-ID: <1923946.Jvi0TDUXFC@wasted.example.com>
+In-Reply-To: <13205049.n7pM8utpHF@wasted.example.com>
+References: <13205049.n7pM8utpHF@wasted.example.com>
+
+Somehow we got a message with 3 sets of headers into one
+message, could've been something broken on the archiver side.
+EOF
+
+	my $m1 = PublicInbox::MIME->new(<<'EOF');
+From: a@example.com
+To: t@example.com
+Subject: [PATCH 12/13]
+Date: Wed, 01 Jun 2016 01:32:35 +0300
+Message-ID: <1923946.Jvi0TDUXFC@wasted.example.com>
+In-Reply-To: <13205049.n7pM8utpHF@wasted.example.com>
+References: <13205049.n7pM8utpHF@wasted.example.com>
+
+This is probably one of the original messages
+
+EOF
+	$im->add($m1);
+	$im->add($m3);
+	$im->done;
+	remove_tree($xap);
+	eval { $im->index_sync() };
+	is($@, '', 'no error from initial indexing');
+	is_deeply(\@warn, [], 'no warnings from initial index');
+	eval { $im->index_sync({reindex=>1}) };
+	is($@, '', 'no error from reindexing after reused Message-ID (x3)');
+	is_deeply(\@warn, [], 'no warnings on reindex');
+}
+
 done_testing();

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] v2writable: reindex handles 3-headered monsters
  2019-10-21 11:02   ` [PATCH 3/3] v2writable: reindex handles 3-headered monsters Eric Wong
@ 2019-10-21 11:34     ` Eric Wong
  2019-10-22  1:28     ` [PATCH 4/3] v2writable: move git->cleanup to the correct place Eric Wong
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Wong @ 2019-10-21 11:34 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> +++ b/lib/PublicInbox/V2Writable.pm
> +sub reindex_oid ($$$$) {

<snip>

> +	} else { # multiple MIDs are a weird case:
> +		my $del = 0;
> +		for (@$mids) {
> +			$del += delete($sync->{D}->{"$_\0$cid"}) // 0;
> +		}
> +		if ($del) {
> +			unindex_oid_remote($self, $oid, $_) for @$mids;
> +			# do not delete from {mm_tmp}, since another
> +			# single-MID message may use it.
> +		} else { # handle them at the end:
> +			push @{$sync->{multi_mid} //= []}, $oid;

Part of me worres @$multi_mid can be abused here by people
trying to OOM the indexing process...

<snip>

> @@ -1184,10 +1265,22 @@ sub index_sync {
>  
>  	# unindex is required for leftovers if "deletes" affect messages
>  	# in a previous fetch+index window:
> +	my $git;
>  	if (my @leftovers = values %{delete $sync->{D}}) {
> -		my $git = $self->{-inbox}->git;
> -		unindex_oid($self, $git, $_) for @leftovers;
> -		$git->cleanup;
> +		$git = $self->{-inbox}->git;
> +		for my $oid (@leftovers) {
> +			$self->{current_info} = "leftover $oid";
> +			unindex_oid($self, $git, $oid);
> +		}
> +	}
> +	if (my $multi_mid = delete $sync->{multi_mid}) {
> +		$git //= $self->{-inbox}->git;
> +
> +		while (defined(my $oid = pop(@$multi_mid))) {
> +			$self->{current_info} = "multi_mid $oid";
> +			reindex_oid_m($self, $sync, $git, $oid);
> +		}
> +		$git->cleanup if $git;
>  	}
>  	$self->done;

I suppose we could easily write to a temporary file and use
fixed offsets, too; but fixed offsets would make git + SHA-256
more difficult.  So maybe Tie::File (stdlib) or a SQLite-based
stack could work, too...

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 4/3] v2writable: move git->cleanup to the correct place
  2019-10-21 11:02   ` [PATCH 3/3] v2writable: reindex handles 3-headered monsters Eric Wong
  2019-10-21 11:34     ` Eric Wong
@ 2019-10-22  1:28     ` Eric Wong
  2019-10-22  1:29       ` [PATCH 5/3] v2writable: use msgmap as multi_mid queue Eric Wong
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Wong @ 2019-10-22  1:28 UTC (permalink / raw)
  To: meta

We need to stop the git process to avoid leaking FDs
to Xapian if we recurse ->index_sync on reindex.
---
 lib/PublicInbox/V2Writable.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 7ece6b01..33c0038d 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1280,8 +1280,8 @@ sub index_sync {
 			$self->{current_info} = "multi_mid $oid";
 			reindex_oid_m($self, $sync, $git, $oid);
 		}
-		$git->cleanup if $git;
 	}
+	$git->cleanup if $git;
 	$self->done;
 
 	if (my $nr = $sync->{nr}) {

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 5/3] v2writable: use msgmap as multi_mid queue
  2019-10-22  1:28     ` [PATCH 4/3] v2writable: move git->cleanup to the correct place Eric Wong
@ 2019-10-22  1:29       ` Eric Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2019-10-22  1:29 UTC (permalink / raw)
  To: meta

Instead of storing Message-IDs in the Msgmap object, we can
store the blob OID.

For initial indexing of mirrors, this lets us preserve
$sync->{regen} by storing the intended article number in
the queue.

On --reindex, the article number we store in Msgmap is ignored
but only used for ordering purposes.

This also allows us to avoid ENOMEM errors if somebody abuses
our system by reusing Message-IDs; but we now risk ENOSPC
instead (but systems tend to have more FS storage than RAM).
---
 lib/PublicInbox/V2Writable.pm | 120 ++++++++++++++++++++++------------
 1 file changed, 79 insertions(+), 41 deletions(-)

diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 33c0038d..ad2e8e62 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -19,6 +19,7 @@ use PublicInbox::Msgmap;
 use PublicInbox::Spawn qw(spawn);
 use PublicInbox::SearchIdx;
 use IO::Handle;
+use File::Temp qw(tempfile);
 
 # an estimate of the post-packed size to the raw uncompressed size
 my $PACKING_FACTOR = 0.4;
@@ -763,7 +764,6 @@ sub import_init {
 # XXX experimental
 sub diff ($$$) {
 	my ($mid, $cur, $new) = @_;
-	use File::Temp qw(tempfile);
 
 	my ($ah, $an) = tempfile('email-cur-XXXXXXXX', TMPDIR => 1);
 	print $ah $cur->as_string or die "print: $!";
@@ -859,8 +859,9 @@ sub reindex_checkpoint ($$$) {
 }
 
 # only for a few odd messages with multiple Message-IDs
-sub reindex_oid_m ($$$$) {
-	my ($self, $sync, $git, $oid) = @_;
+sub reindex_oid_m ($$$$;$) {
+	my ($self, $sync, $git, $oid, $regen_num) = @_;
+	$self->{current_info} = "multi_mid $oid";
 	my ($num, $mid0, $len);
 	my $msgref = $git->cat_file($oid, \$len);
 	my $mime = PublicInbox::MIME->new($$msgref);
@@ -872,49 +873,51 @@ sub reindex_oid_m ($$$$) {
 		delete($sync->{D}->{"$mid\0$cid"}) and
 			die "BUG: reindex_oid should handle <$mid> delete";
 	}
+	my $over = $self->{over};
 	for my $mid (reverse @$mids) {
-		($num, $mid0) = $self->{over}->num_mid0_for_oid($oid, $mid);
-		last if defined $num;
+		($num, $mid0) = $over->num_mid0_for_oid($oid, $mid);
+		next unless defined $num;
+		if (defined($regen_num) && $regen_num != $num) {
+			die "BUG: regen(#$regen_num) != over(#$num)";
+		}
 	}
 	unless (defined($num)) {
 		for my $mid (reverse @$mids) {
 			# is this a number we got before?
-			$num = $sync->{mm_tmp}->num_for($mid);
-			next unless defined $num;
-			$mid0 = $mid;
+			my $n = $sync->{mm_tmp}->num_for($mid);
+			next unless defined $n;
+			next if defined($regen_num) && $regen_num != $n;
+			($num, $mid0) = ($n, $mid);
 			last;
 		}
 	}
 	if (defined($num)) {
 		$sync->{mm_tmp}->num_delete($num);
-	} else {
-		$num = $sync->{regen}--;
-		if ($num <= 0) {
-			# fixup bugs in old mirrors on reindex
-			for my $mid (reverse @$mids) {
-				$num = $self->{mm}->mid_insert($mid);
-				next unless defined $num;
-				$mid0 = $mid;
-				last;
-			}
-			if (defined $mid0) {
-				if ($sync->{reindex}) {
-					warn "reindex added #$num <$mid0>\n";
-				}
-			} else {
-				warn "E: cannot find article #\n";
-				return;
-			}
-		} else { # $num > 0, use the new article number
-			for my $mid (reverse @$mids) {
-				$self->{mm}->mid_set($num, $mid) == 1 or next;
-				$mid0 = $mid;
-				last;
-			}
-			unless (defined $mid0) {
-				warn "E: cannot regen #$num\n";
-				return;
+	} elsif (defined $regen_num) {
+		$num = $regen_num;
+		for my $mid (reverse @$mids) {
+			$self->{mm}->mid_set($num, $mid) == 1 or next;
+			$mid0 = $mid;
+			last;
+		}
+		unless (defined $mid0) {
+			warn "E: cannot regen #$num\n";
+			return;
+		}
+	} else { # fixup bugs in old mirrors on reindex
+		for my $mid (reverse @$mids) {
+			$num = $self->{mm}->mid_insert($mid);
+			next unless defined $num;
+			$mid0 = $mid;
+			last;
+		}
+		if (defined $mid0) {
+			if ($sync->{reindex}) {
+				warn "reindex added #$num <$mid0>\n";
 			}
+		} else {
+			warn "E: cannot find article #\n";
+			return;
 		}
 	}
 	$sync->{nr}++;
@@ -935,6 +938,30 @@ sub check_unindexed ($$$) {
 	}
 }
 
+# reuse Msgmap to store num => oid mapping (rather than num => mid)
+sub multi_mid_q_new () {
+	my ($fh, $fn) = tempfile('multi_mid-XXXXXXX', EXLOCK => 0, TMPDIR => 1);
+	my $multi_mid = PublicInbox::Msgmap->new_file($fn, 1);
+	$multi_mid->{dbh}->do('PRAGMA synchronous = OFF');
+	# for Msgmap->DESTROY:
+	$multi_mid->{tmp_name} = $fn;
+	$multi_mid->{pid} = $$;
+	close $fh or die "failed to close $fn: $!";
+	$multi_mid
+}
+
+sub multi_mid_q_push ($$) {
+	my ($sync, $oid) = @_;
+	my $multi_mid = $sync->{multi_mid} //= multi_mid_q_new();
+	if ($sync->{reindex}) { # no regen on reindex
+		$multi_mid->mid_insert($oid);
+	} else {
+		my $num = $sync->{regen}--;
+		die "BUG: ran out of article numbers" if $num <= 0;
+		$multi_mid->mid_set($num, $oid);
+	}
+}
+
 sub reindex_oid ($$$$) {
 	my ($self, $sync, $git, $oid) = @_;
 	my ($num, $mid0, $len);
@@ -966,7 +993,7 @@ sub reindex_oid ($$$$) {
 			check_unindexed($self, $num, $mid0);
 		} else {
 			$num = $sync->{regen}--;
-			die "BUG: ran out of article numbers\n" if $num <= 0;
+			die "BUG: ran out of article numbers" if $num <= 0;
 			if ($self->{mm}->mid_set($num, $mid) != 1) {
 				warn "E: unable to assign $num => <$mid>\n";
 				return;
@@ -983,7 +1010,7 @@ sub reindex_oid ($$$$) {
 			# do not delete from {mm_tmp}, since another
 			# single-MID message may use it.
 		} else { # handle them at the end:
-			push @{$sync->{multi_mid} //= []}, $oid;
+			multi_mid_q_push($sync, $oid);
 		}
 		return;
 	}
@@ -1275,10 +1302,21 @@ sub index_sync {
 	}
 	if (my $multi_mid = delete $sync->{multi_mid}) {
 		$git //= $self->{-inbox}->git;
-
-		while (defined(my $oid = pop(@$multi_mid))) {
-			$self->{current_info} = "multi_mid $oid";
-			reindex_oid_m($self, $sync, $git, $oid);
+		my ($min, $max) = $multi_mid->minmax;
+		if ($sync->{reindex}) {
+			# we may need to create new Message-IDs if mirrors
+			# were initially indexed with old versions
+			for (my $i = $max; $i >= $min; $i--) {
+				my $oid = $multi_mid->mid_for($i);
+				next unless defined $oid;
+				reindex_oid_m($self, $sync, $git, $oid);
+			}
+		} else { # regen on initial index
+			for my $num ($min..$max) {
+				my $oid = $multi_mid->mid_for($num);
+				next unless defined $oid;
+				reindex_oid_m($self, $sync, $git, $oid, $num);
+			}
 		}
 	}
 	$git->cleanup if $git;

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC/HELP] search: multiple From/To/Cc/Subject (what about Date?)
  2019-10-17 11:22 ` [RFC] v2writable: reindex handles 3-headed monsters [was: BUG: --reindex broken on multiple Message-IDs reuse] Eric Wong
@ 2019-10-22  8:09   ` Eric Wong
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2019-10-22  8:09 UTC (permalink / raw)
  To: meta

We can easily support searching on messages with
multiple From/To/Cc/Subject headers.

Now, the display part in the thread skeleton could be
trickier, but not impossible...

Now, how are we supposed to support searching date ranges
when messages have multiple Date: headers? e.g:

https://lore.kernel.org/linux-renesas-soc/20160524.143422.552507610109476444.davem@davemloft.net/raw

OTOH, I'm pretty sure that's just a mangled message somebody
passed on because all 3 Message-IDs appear in standalone forms
elsewhere (e.g. lkml)

But, I also fully expect future messages will contain
horribleness in headers unless anti-spam/abuse rules start
blocking that before it hits public-inbox...
---
 lib/PublicInbox/SearchMsg.pm |  4 ++--
 t/v2reindex.t                | 16 ++++++++++++----
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/lib/PublicInbox/SearchMsg.pm b/lib/PublicInbox/SearchMsg.pm
index adadf92e..7561e7f2 100644
--- a/lib/PublicInbox/SearchMsg.pm
+++ b/lib/PublicInbox/SearchMsg.pm
@@ -107,8 +107,8 @@ sub __hdr ($$) {
 	return $val if defined $val;
 
 	my $mime = $self->{mime} or return;
-	$val = $mime->header($field);
-	$val = '' unless defined $val;
+	my @raw = $mime->header($field);
+	$val = join(', ', @raw);
 	$val =~ tr/\t\n/  /;
 	$val =~ tr/\r//d;
 	$self->{$field} = $val;
diff --git a/t/v2reindex.t b/t/v2reindex.t
index 52711f8f..3e56ddfa 100644
--- a/t/v2reindex.t
+++ b/t/v2reindex.t
@@ -439,7 +439,7 @@ ok(!-d $xap, 'Xapian directories removed again');
 	my @warn;
 	local $SIG{__WARN__} = sub { push @warn, @_ };
 	my %config = %$ibx_config;
-	$config{indexlevel} = 'basic';
+	$config{indexlevel} = 'medium';
 	my $ibx = PublicInbox::Inbox->new(\%config);
 	my $im = PublicInbox::V2Writable->new($ibx);
 	my $m3 = PublicInbox::MIME->new(<<'EOF');
@@ -447,7 +447,7 @@ Date: Tue, 24 May 2016 14:34:22 -0700 (PDT)
 Message-Id: <20160524.143422.552507610109476444.d@example.com>
 To: t@example.com
 Cc: c@example.com
-Subject: Re: [PATCH v2 2/2]
+Subject: Re: [PATCH v2 2/2] uno
 From: <f@example.com>
 In-Reply-To: <1463825855-7363-2-git-send-email-y@example.com>
 References: <1463825855-7363-1-git-send-email-y@example.com>
@@ -456,14 +456,14 @@ Date: Wed, 25 May 2016 10:01:51 +0900
 From: h@example.com
 To: g@example.com
 Cc: m@example.com
-Subject: Re: [PATCH]
+Subject: Re: [PATCH] dos
 Message-ID: <20160525010150.GD7292@example.com>
 References: <1463498133-23918-1-git-send-email-g+r@example.com>
 In-Reply-To: <1463498133-23918-1-git-send-email-g+r@example.com>
 From: s@example.com
 To: h@example.com
 Cc: m@example.com
-Subject: [PATCH 12/13]
+Subject: [PATCH 12/13] tres
 Date: Wed, 01 Jun 2016 01:32:35 +0300
 Message-ID: <1923946.Jvi0TDUXFC@wasted.example.com>
 In-Reply-To: <13205049.n7pM8utpHF@wasted.example.com>
@@ -495,6 +495,14 @@ EOF
 	eval { $im->index_sync({reindex=>1}) };
 	is($@, '', 'no error from reindexing after reused Message-ID (x3)');
 	is_deeply(\@warn, [], 'no warnings on reindex');
+
+	my %uniq;
+	for my $s (qw(uno dos tres)) {
+		my $msgs = $ibx->search->query("s:$s");
+		is(scalar(@$msgs), 1, "only one result for `$s'");
+		$uniq{$msgs->[0]->{num}}++;
+	}
+	is_deeply([values %uniq], [3], 'search on different subjects');
 }
 
 done_testing();

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PUSHED] fix reindex on multiple + overlapping Message-IDs
  2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong
                     ` (2 preceding siblings ...)
  2019-10-21 11:02   ` [PATCH 3/3] v2writable: reindex handles 3-headered monsters Eric Wong
@ 2019-10-23 18:19   ` Eric Wong
  3 siblings, 0 replies; 11+ messages in thread
From: Eric Wong @ 2019-10-23 18:19 UTC (permalink / raw)
  To: meta

Eric Wong <e@80x24.org> wrote:
> I'm still testing this with various repos, but the
> linux-renesas-soc mirror from lore seems good when doing initial
> indexing with an old version of this and using --reindex with
> this patch, as all previously-missed messages are now indexed.

Seems fine and pushed all 5 patches.  However, display issues
still need to be fixed in the HTML...

> Eric Wong (3):
>   v2writable: set unindexed article number
>   v2writable: improve "num_for" API and disambiguate
>   v2writable: reindex handles 3-headered monsters

      v2writable: move git->cleanup to the correct place
      v2writable: use msgmap as multi_mid queue

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-10-23 18:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-16 21:14 BUG: --reindex broken on multiple Message-IDs reuse Eric Wong
2019-10-17 11:22 ` [RFC] v2writable: reindex handles 3-headed monsters [was: BUG: --reindex broken on multiple Message-IDs reuse] Eric Wong
2019-10-22  8:09   ` [RFC/HELP] search: multiple From/To/Cc/Subject (what about Date?) Eric Wong
2019-10-21 11:02 ` [PATCH 0/3] fix reindex on multiple + overlapping Message-IDs Eric Wong
2019-10-21 11:02   ` [PATCH 1/3] v2writable: set unindexed article number Eric Wong
2019-10-21 11:02   ` [PATCH 2/3] v2writable: improve "num_for" API and disambiguate Eric Wong
2019-10-21 11:02   ` [PATCH 3/3] v2writable: reindex handles 3-headered monsters Eric Wong
2019-10-21 11:34     ` Eric Wong
2019-10-22  1:28     ` [PATCH 4/3] v2writable: move git->cleanup to the correct place Eric Wong
2019-10-22  1:29       ` [PATCH 5/3] v2writable: use msgmap as multi_mid queue Eric Wong
2019-10-23 18:19   ` [PUSHED] fix reindex on multiple + overlapping Message-IDs Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).