From: Eric Wong <e@80x24.org>
To: meta@public-inbox.org
Subject: [PATCH 2/2] www: fix ref cycle from threading w/ extindex
Date: Sun, 3 Oct 2021 19:07:17 -0500 [thread overview]
Message-ID: <20211004000717.18965-3-e@80x24.org> (raw)
In-Reply-To: <20211004000717.18965-1-e@80x24.org>
Unlike v1 inboxes (which don't accept duplicate Message-IDs at
all), and v2 inboxes (which generate a new Message-ID for
duplicates), extindex must accept duplicate Message-IDs as-is.
This was fine for storage, but prevented the reference-cycle
mechanism of our message threading display algorithm from working
reliably. It could no longer delete the ->{parent} field from
clobbered entries in the %id_table.
So we now take into account reused Message-IDs and never clobber
entries in %id_table. Instead, we mark reused Message-IDs as
"imposters" and special-case them by injecting them as children
after all other threading is complete.
This cycle was noticed using a pre-release of Devel::Mwrap::PSGI:
https://80x24.org/mwrap-perl.git
---
lib/PublicInbox/SearchThread.pm | 104 +++++++++++++++++---------------
t/thread-cycle.t | 21 ++++++-
2 files changed, 74 insertions(+), 51 deletions(-)
diff --git a/lib/PublicInbox/SearchThread.pm b/lib/PublicInbox/SearchThread.pm
index 8fb3a030aa72..507f25baab0e 100644
--- a/lib/PublicInbox/SearchThread.pm
+++ b/lib/PublicInbox/SearchThread.pm
@@ -24,70 +24,74 @@ use PublicInbox::MID qw($MID_EXTRACT);
sub thread {
my ($msgs, $ordersub, $ctx) = @_;
+ my (%id_table, @imposters);
+ keys(%id_table) = scalar @$msgs; # pre-size
- # A. put all current $msgs (non-ghosts) into %id_table
- my %id_table = map {;
+ # A. put all current non-imposter $msgs (non-ghosts) into %id_table
+ # (imposters are messages with reused Message-IDs)
+ # Sadly, we sort here anyways since the fill-in-the-blanks References:
+ # can be shakier if somebody used In-Reply-To with multiple, disparate
+ # messages. So, take the client Date: into account since we can't
+ # always determine ordering when somebody uses multiple In-Reply-To.
+ my @kids = sort { $a->{ds} <=> $b->{ds} } grep {
# this delete saves around 4K across 1K messages
# TODO: move this to a more appropriate place, breaks tests
# if we do it during psgi_cull
delete $_->{num};
- $_->{mid} => PublicInbox::SearchThread::Msg::cast($_);
+ PublicInbox::SearchThread::Msg::cast($_);
+ if (exists $id_table{$_->{mid}}) {
+ $_->{children} = [];
+ push @imposters, $_; # we'll deal with them later
+ undef;
+ } else {
+ $id_table{$_->{mid}} = $_;
+ defined($_->{references});
+ }
} @$msgs;
+ for my $smsg (@kids) {
+ # This loop exists to help fill in gaps left from missing
+ # messages. It is not needed in a perfect world where
+ # everything is perfectly referenced, only the last ref
+ # matters.
+ my $prev;
+ for my $ref ($smsg->{references} =~ m/$MID_EXTRACT/go) {
+ # Find a Container object for the given Message-ID
+ my $cont = $id_table{$ref} //=
+ PublicInbox::SearchThread::Msg::ghost($ref);
+
+ # Link the References field's Containers together in
+ # the order implied by the References header
+ #
+ # * If they are already linked don't change the
+ # existing links
+ # * Do not add a link if adding that link would
+ # introduce a loop...
+ if ($prev &&
+ !$cont->{parent} && # already linked
+ !$cont->has_descendent($prev) # would loop
+ ) {
+ $prev->add_child($cont);
+ }
+ $prev = $cont;
+ }
- # Sadly, we sort here anyways since the fill-in-the-blanks References:
- # can be shakier if somebody used In-Reply-To with multiple, disparate
- # messages. So, take the client Date: into account since we can't
- # always determine ordering when somebody uses multiple In-Reply-To.
- # We'll trust the client Date: header here instead of the Received:
- # time since this is for display (and not retrieval)
- _set_parent(\%id_table, $_) for sort { $a->{ds} <=> $b->{ds} } @$msgs;
+ # C. Set the parent of this message to be the last element in
+ # References.
+ if (defined $prev && !$smsg->has_descendent($prev)) {
+ $prev->add_child($smsg);
+ }
+ }
my $ibx = $ctx->{ibx};
- my $rootset = [ grep {
+ my $rootset = [ grep { # n.b.: delete prevents cyclic refs
!delete($_->{parent}) && $_->visible($ibx)
} values %id_table ];
$rootset = $ordersub->($rootset);
$_->order_children($ordersub, $ctx) for @$rootset;
- $rootset;
-}
-sub _set_parent ($$) {
- my ($id_table, $this) = @_;
-
- # B. For each element in the message's References field:
- defined(my $refs = $this->{references}) or return;
-
- # This loop exists to help fill in gaps left from missing
- # messages. It is not needed in a perfect world where
- # everything is perfectly referenced, only the last ref
- # matters.
- my $prev;
- foreach my $ref ($refs =~ m/$MID_EXTRACT/go) {
- # Find a Container object for the given Message-ID
- my $cont = $id_table->{$ref} //=
- PublicInbox::SearchThread::Msg::ghost($ref);
-
- # Link the References field's Containers together in
- # the order implied by the References header
- #
- # * If they are already linked don't change the
- # existing links
- # * Do not add a link if adding that link would
- # introduce a loop...
- if ($prev &&
- !$cont->{parent} && # already linked
- !$cont->has_descendent($prev) # would loop
- ) {
- $prev->add_child($cont);
- }
- $prev = $cont;
- }
-
- # C. Set the parent of this message to be the last element in
- # References.
- if (defined $prev && !$this->has_descendent($prev)) { # would loop
- $prev->add_child($this);
- }
+ # parent imposter messages with reused Message-IDs
+ unshift(@{$id_table{$_->{mid}}->{children}}, $_) for @imposters;
+ $rootset;
}
package PublicInbox::SearchThread::Msg;
diff --git a/t/thread-cycle.t b/t/thread-cycle.t
index 4b47c01c37c1..e89b18464a5f 100644
--- a/t/thread-cycle.t
+++ b/t/thread-cycle.t
@@ -96,7 +96,26 @@ if ('sorting by Date') {
is("\n".$backward, "\n".$forward, 'forward and backward matches');
}
-done_testing();
+SKIP: {
+ require_mods 'Devel::Cycle', 1;
+ Devel::Cycle->import('find_cycle');
+ my @dup = (
+ { mid => 5, references => '<6>' },
+ { mid => 5, references => '<6> <1>' },
+ );
+ open my $fh, '+>', \(my $out = '') or xbail "open: $!";
+ (undef, $smsgs) = $make_objs->(@dup);
+ eval 'package EmptyInbox; sub smsg_by_mid { undef }';
+ my $ctx = { ibx => bless {}, 'EmptyInbox' };
+ my $rootset = PublicInbox::SearchThread::thread($smsgs, sub {
+ [ sort { $a->{mid} cmp $b->{mid} } @{$_[0]} ] }, $ctx);
+ my $oldout = select $fh;
+ find_cycle($rootset);
+ select $oldout;
+ is($out, '', 'nothing from find_cycle');
+} # Devel::Cycle check
+
+done_testing;
sub thread_to_s {
my ($msgs) = @_;
next prev parent reply other threads:[~2021-10-04 0:07 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-09-04 23:53 httpd memory usage? Eric Wong
2021-09-27 7:10 ` Eric Wong
2021-10-04 0:07 ` [PATCH 0/2] www: fix ref cycles when threading extindex Eric Wong
2021-10-04 0:07 ` [PATCH 1/2] t/thread-cycle: make Email::Simple optional Eric Wong
2021-10-04 0:07 ` Eric Wong [this message]
2021-10-04 22:51 ` [PATCH 0/2] www: fix ref cycles when threading extindex Eric Wong
2021-10-05 11:33 ` Encode.pm leak Eric Wong
2021-10-12 10:59 ` Encode.pm leak in v2.87..v3.12 Eric Wong
2021-10-13 10:16 ` [PATCH 0/7] workaround Encode leak, several test fixes Eric Wong
2021-10-13 10:16 ` [PATCH 1/7] xt/perf-msgview: drop unnecessary use_ok Eric Wong
2021-10-13 10:16 ` [PATCH 2/7] test_common: hoist out tail_f sub Eric Wong
2021-10-13 10:16 ` [PATCH 3/7] t/www_listing: require opt-in for grokmirror tests Eric Wong
2021-10-13 10:16 ` [PATCH 4/7] eml: avoid Encode 2.87..3.12 leak Eric Wong
2021-10-13 10:16 ` [PATCH 5/7] t/lei-mirror: avoid reading ~/.public-inbox/config in test Eric Wong
2021-10-13 10:16 ` [PATCH 6/7] t/git: avoid "once" warning for async_warn Eric Wong
2021-10-13 10:16 ` [PATCH 7/7] t/nntpd-tls: change diag() to like() assertion Eric Wong
2021-11-04 0:17 ` httpd memory usage? Eric Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://public-inbox.org/README
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20211004000717.18965-3-e@80x24.org \
--to=e@80x24.org \
--cc=meta@public-inbox.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).