user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: [PATCH 2/2] www: fix ref cycle from threading w/ extindex
Date: Sun,  3 Oct 2021 19:07:17 -0500	[thread overview]
Message-ID: <20211004000717.18965-3-e@80x24.org> (raw)
In-Reply-To: <20211004000717.18965-1-e@80x24.org>

Unlike v1 inboxes (which don't accept duplicate Message-IDs at
all), and v2 inboxes (which generate a new Message-ID for
duplicates), extindex must accept duplicate Message-IDs as-is.

This was fine for storage, but prevented the reference-cycle
mechanism of our message threading display algorithm from working
reliably.  It could no longer delete the ->{parent} field from
clobbered entries in the %id_table.

So we now take into account reused Message-IDs and never clobber
entries in %id_table.  Instead, we mark reused Message-IDs as
"imposters" and special-case them by injecting them as children
after all other threading is complete.

This cycle was noticed using a pre-release of Devel::Mwrap::PSGI:
  https://80x24.org/mwrap-perl.git
---
 lib/PublicInbox/SearchThread.pm | 104 +++++++++++++++++---------------
 t/thread-cycle.t                |  21 ++++++-
 2 files changed, 74 insertions(+), 51 deletions(-)

diff --git a/lib/PublicInbox/SearchThread.pm b/lib/PublicInbox/SearchThread.pm
index 8fb3a030aa72..507f25baab0e 100644
--- a/lib/PublicInbox/SearchThread.pm
+++ b/lib/PublicInbox/SearchThread.pm
@@ -24,70 +24,74 @@ use PublicInbox::MID qw($MID_EXTRACT);
 
 sub thread {
 	my ($msgs, $ordersub, $ctx) = @_;
+	my (%id_table, @imposters);
+	keys(%id_table) = scalar @$msgs; # pre-size
 
-	# A. put all current $msgs (non-ghosts) into %id_table
-	my %id_table = map {;
+	# A. put all current non-imposter $msgs (non-ghosts) into %id_table
+	# (imposters are messages with reused Message-IDs)
+	# Sadly, we sort here anyways since the fill-in-the-blanks References:
+	# can be shakier if somebody used In-Reply-To with multiple, disparate
+	# messages.  So, take the client Date: into account since we can't
+	# always determine ordering when somebody uses multiple In-Reply-To.
+	my @kids = sort { $a->{ds} <=> $b->{ds} } grep {
 		# this delete saves around 4K across 1K messages
 		# TODO: move this to a more appropriate place, breaks tests
 		# if we do it during psgi_cull
 		delete $_->{num};
 
-		$_->{mid} => PublicInbox::SearchThread::Msg::cast($_);
+		PublicInbox::SearchThread::Msg::cast($_);
+		if (exists $id_table{$_->{mid}}) {
+			$_->{children} = [];
+			push @imposters, $_; # we'll deal with them later
+			undef;
+		} else {
+			$id_table{$_->{mid}} = $_;
+			defined($_->{references});
+		}
 	} @$msgs;
+	for my $smsg (@kids) {
+		# This loop exists to help fill in gaps left from missing
+		# messages.  It is not needed in a perfect world where
+		# everything is perfectly referenced, only the last ref
+		# matters.
+		my $prev;
+		for my $ref ($smsg->{references} =~ m/$MID_EXTRACT/go) {
+			# Find a Container object for the given Message-ID
+			my $cont = $id_table{$ref} //=
+				PublicInbox::SearchThread::Msg::ghost($ref);
+
+			# Link the References field's Containers together in
+			# the order implied by the References header
+			#
+			# * If they are already linked don't change the
+			#   existing links
+			# * Do not add a link if adding that link would
+			#   introduce a loop...
+			if ($prev &&
+				!$cont->{parent} &&  # already linked
+				!$cont->has_descendent($prev) # would loop
+			   ) {
+				$prev->add_child($cont);
+			}
+			$prev = $cont;
+		}
 
-	# Sadly, we sort here anyways since the fill-in-the-blanks References:
-	# can be shakier if somebody used In-Reply-To with multiple, disparate
-	# messages.  So, take the client Date: into account since we can't
-	# always determine ordering when somebody uses multiple In-Reply-To.
-	# We'll trust the client Date: header here instead of the Received:
-	# time since this is for display (and not retrieval)
-	_set_parent(\%id_table, $_) for sort { $a->{ds} <=> $b->{ds} } @$msgs;
+		# C. Set the parent of this message to be the last element in
+		# References.
+		if (defined $prev && !$smsg->has_descendent($prev)) {
+			$prev->add_child($smsg);
+		}
+	}
 	my $ibx = $ctx->{ibx};
-	my $rootset = [ grep {
+	my $rootset = [ grep { # n.b.: delete prevents cyclic refs
 			!delete($_->{parent}) && $_->visible($ibx)
 		} values %id_table ];
 	$rootset = $ordersub->($rootset);
 	$_->order_children($ordersub, $ctx) for @$rootset;
-	$rootset;
-}
 
-sub _set_parent ($$) {
-	my ($id_table, $this) = @_;
-
-	# B. For each element in the message's References field:
-	defined(my $refs = $this->{references}) or return;
-
-	# This loop exists to help fill in gaps left from missing
-	# messages.  It is not needed in a perfect world where
-	# everything is perfectly referenced, only the last ref
-	# matters.
-	my $prev;
-	foreach my $ref ($refs =~ m/$MID_EXTRACT/go) {
-		# Find a Container object for the given Message-ID
-		my $cont = $id_table->{$ref} //=
-			PublicInbox::SearchThread::Msg::ghost($ref);
-
-		# Link the References field's Containers together in
-		# the order implied by the References header
-		#
-		# * If they are already linked don't change the
-		#   existing links
-		# * Do not add a link if adding that link would
-		#   introduce a loop...
-		if ($prev &&
-			!$cont->{parent} &&  # already linked
-			!$cont->has_descendent($prev) # would loop
-		   ) {
-			$prev->add_child($cont);
-		}
-		$prev = $cont;
-	}
-
-	# C. Set the parent of this message to be the last element in
-	# References.
-	if (defined $prev && !$this->has_descendent($prev)) { # would loop
-		$prev->add_child($this);
-	}
+	# parent imposter messages with reused Message-IDs
+	unshift(@{$id_table{$_->{mid}}->{children}}, $_) for @imposters;
+	$rootset;
 }
 
 package PublicInbox::SearchThread::Msg;
diff --git a/t/thread-cycle.t b/t/thread-cycle.t
index 4b47c01c37c1..e89b18464a5f 100644
--- a/t/thread-cycle.t
+++ b/t/thread-cycle.t
@@ -96,7 +96,26 @@ if ('sorting by Date') {
 	is("\n".$backward, "\n".$forward, 'forward and backward matches');
 }
 
-done_testing();
+SKIP: {
+	require_mods 'Devel::Cycle', 1;
+	Devel::Cycle->import('find_cycle');
+	my @dup = (
+		{ mid => 5, references => '<6>' },
+		{ mid => 5, references => '<6> <1>' },
+	);
+	open my $fh, '+>', \(my $out = '') or xbail "open: $!";
+	(undef, $smsgs) = $make_objs->(@dup);
+	eval 'package EmptyInbox; sub smsg_by_mid { undef }';
+	my $ctx = { ibx => bless {}, 'EmptyInbox' };
+	my $rootset = PublicInbox::SearchThread::thread($smsgs, sub {
+		[ sort { $a->{mid} cmp $b->{mid} } @{$_[0]} ] }, $ctx);
+	my $oldout = select $fh;
+	find_cycle($rootset);
+	select $oldout;
+	is($out, '', 'nothing from find_cycle');
+} # Devel::Cycle check
+
+done_testing;
 
 sub thread_to_s {
 	my ($msgs) = @_;

  parent reply	other threads:[~2021-10-04  0:07 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-04 23:53 httpd memory usage? Eric Wong
2021-09-27  7:10 ` Eric Wong
2021-10-04  0:07 ` [PATCH 0/2] www: fix ref cycles when threading extindex Eric Wong
2021-10-04  0:07   ` [PATCH 1/2] t/thread-cycle: make Email::Simple optional Eric Wong
2021-10-04  0:07   ` Eric Wong [this message]
2021-10-04 22:51   ` [PATCH 0/2] www: fix ref cycles when threading extindex Eric Wong
2021-10-05 11:33     ` Encode.pm leak Eric Wong
2021-10-12 10:59       ` Encode.pm leak in v2.87..v3.12 Eric Wong
2021-10-13 10:16         ` [PATCH 0/7] workaround Encode leak, several test fixes Eric Wong
2021-10-13 10:16           ` [PATCH 1/7] xt/perf-msgview: drop unnecessary use_ok Eric Wong
2021-10-13 10:16           ` [PATCH 2/7] test_common: hoist out tail_f sub Eric Wong
2021-10-13 10:16           ` [PATCH 3/7] t/www_listing: require opt-in for grokmirror tests Eric Wong
2021-10-13 10:16           ` [PATCH 4/7] eml: avoid Encode 2.87..3.12 leak Eric Wong
2021-10-13 10:16           ` [PATCH 5/7] t/lei-mirror: avoid reading ~/.public-inbox/config in test Eric Wong
2021-10-13 10:16           ` [PATCH 6/7] t/git: avoid "once" warning for async_warn Eric Wong
2021-10-13 10:16           ` [PATCH 7/7] t/nntpd-tls: change diag() to like() assertion Eric Wong
2021-11-04  0:17 ` httpd memory usage? Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211004000717.18965-3-e@80x24.org \
    --to=e@80x24.org \
    --cc=meta@public-inbox.org \
    --subject='Re: [PATCH 2/2] www: fix ref cycle from threading w/ extindex' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Code repositories for project(s) associated with this inbox:

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).