user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* [PATCH 0/7] message threading fixes for WWW UI
@ 2016-12-10  3:42 Eric Wong
  2016-12-10  3:42 ` [PATCH 1/7] search: favor In-Reply-To over last References iff IRT exists Eric Wong
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: Eric Wong @ 2016-12-10  3:42 UTC (permalink / raw)
  To: meta

This series improves thread handling in several oddball
cases.

In the Xapian search indexing phase, the In-Reply-To header
is always considered the last (direct) parent of a message.
This is necessary in cases where a MUA specifies References
in an invalid order.  This is also what our View.pm display
has done for generating "reply" links.

Not many repos are affected by this, but
"public-inbox-index --reindex" will make those consistent
(there is no incompatible Xapian DB version bump).

We will now prune ghosts without children before display, as
they are sometimes the result of buggy (or malicious) MUAs
inserting spaces or otherwise mangling References: headers.
Ghosts with valid children remain shown, as they are likely to
be legitimate (but lost) messages.

Deploying over the next few hours, .onions first!

  Currently reindexing git@vger mirror:

    http://czquwvybam4bgbro.onion/meta

  Up next:

    http://hjrcffqmbrq6wope.onion/meta

  Last: (also public-inbox.org)

    http://ou63pmih66umazou.onion/meta


Eric Wong (7):
  search: favor In-Reply-To over last References iff IRT exists
  view: favor SearchMsg for In-Reply-To over Email::MIME
  thread: fix comment describing its existence
  view: reduce indentation for skeleton generation
  view: skip ghosts with no direct children
  thread: last Reference always wins
  search: always sort thread results in ascending time order

 lib/PublicInbox/Mbox.pm         |  2 +-
 lib/PublicInbox/Search.pm       |  5 ++++
 lib/PublicInbox/SearchIdx.pm    | 22 ++++++++++++---
 lib/PublicInbox/SearchThread.pm | 30 ++++++++++++++------
 lib/PublicInbox/View.pm         | 61 +++++++++++++++++++++--------------------
 t/thread-cycle.t                |  8 ------
 6 files changed, 76 insertions(+), 52 deletions(-)

-- 
EW

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/7] search: favor In-Reply-To over last References iff IRT exists
  2016-12-10  3:42 [PATCH 0/7] message threading fixes for WWW UI Eric Wong
@ 2016-12-10  3:42 ` Eric Wong
  2016-12-10  3:43 ` [PATCH 2/7] view: favor SearchMsg for In-Reply-To over Email::MIME Eric Wong
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2016-12-10  3:42 UTC (permalink / raw)
  To: meta

Some email clients set the References headers backwards, so
trust the In-Reply-To header if (and only if) it exists and
is parseable as direct parent of the current message.

For affected repos, this will require reindexing (via
"public-inbox-index --reindex"), but there will be no
version bump for this bugfix.
---
 lib/PublicInbox/SearchIdx.pm | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 4aac028..832d1cb 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -293,10 +293,10 @@ sub link_message {
 	my $hdr = $mime->header_obj;
 	my $refs = $hdr->header_raw('References');
 	my @refs = $refs ? ($refs =~ /<([^>]+)>/g) : ();
-	if (my $irt = $hdr->header_raw('In-Reply-To')) {
-		# last References should be $irt
-		# we will de-dupe later
-		push @refs, mid_clean($irt);
+	my $irt = $hdr->header_raw('In-Reply-To');
+	if (defined $irt) {
+		$irt = mid_clean($irt);
+		$irt = undef if $mid eq $irt;
 	}
 
 	my $tid;
@@ -305,6 +305,15 @@ sub link_message {
 		my @orig_refs = @refs;
 		@refs = ();
 
+		if (defined $irt) {
+			# to check MAX_MID_SIZE
+			push @orig_refs, $irt;
+
+			# below, we will ensure IRT (if specified)
+			# is the last References
+			$uniq{$irt} = 1;
+		}
+
 		# prevent circular references via References: here:
 		foreach my $ref (@orig_refs) {
 			if (length($ref) > MAX_MID_SIZE) {
@@ -315,6 +324,11 @@ sub link_message {
 			push @refs, $ref;
 		}
 	}
+
+	# last References should be IRT, but some mail clients do things
+	# out of order, so trust IRT over References iff IRT exists
+	push @refs, $irt if defined $irt;
+
 	if (@refs) {
 		$smsg->{references} = '<'.join('> <', @refs).'>';
 
-- 
EW


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/7] view: favor SearchMsg for In-Reply-To over Email::MIME
  2016-12-10  3:42 [PATCH 0/7] message threading fixes for WWW UI Eric Wong
  2016-12-10  3:42 ` [PATCH 1/7] search: favor In-Reply-To over last References iff IRT exists Eric Wong
@ 2016-12-10  3:43 ` Eric Wong
  2016-12-10  3:43 ` [PATCH 3/7] thread: fix comment describing its existence Eric Wong
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2016-12-10  3:43 UTC (permalink / raw)
  To: meta

This should avoid warnings during thread skeleton generation if
ever the Xapian database disagrees with View.pm about which is
the proper direct parent of a message.  We will treat the data
in Xapian as the truth (if Xapian is available).
---
 lib/PublicInbox/View.pm | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm
index 2bfbb80..feac601 100644
--- a/lib/PublicInbox/View.pm
+++ b/lib/PublicInbox/View.pm
@@ -127,13 +127,13 @@ sub index_entry {
 	my $id_m = 'm'.$id;
 
 	my $root_anchor = $ctx->{root_anchor} || '';
-	my $irt = in_reply_to($hdr);
+	my $irt;
 
 	my $rv = "<a\nhref=#e$id\nid=m$id>*</a> ";
 	$subj = '<b>'.ascii_html($subj).'</b>';
 	$subj = "<u\nid=u>$subj</u>" if $root_anchor eq $id_m;
 	$rv .= $subj . "\n";
-	$rv .= _th_index_lite($mid_raw, $irt, $id, $ctx);
+	$rv .= _th_index_lite($mid_raw, \$irt, $id, $ctx);
 	my @tocc;
 	foreach my $f (qw(To Cc)) {
 		my $dst = _hdr_names($hdr, $f);
@@ -147,7 +147,7 @@ sub index_entry {
 	$rv .= '  '.join('; +', @tocc) . "\n" if @tocc;
 
 	my $mapping = $ctx->{mapping};
-	if (!$mapping && $irt) {
+	if (!$mapping && (defined($irt) || defined($irt = in_reply_to($hdr)))) {
 		my $mirt = PublicInbox::Hval->new_msgid($irt);
 		my $href = $upfx . $mirt->{href}. '/';
 		my $html = $mirt->as_html;
@@ -206,7 +206,10 @@ sub _th_index_lite {
 	my $nr_c = scalar @$children;
 	my $nr_s = 0;
 	my $siblings;
-	my $irt_map = $mapping->{$irt} if defined $irt;
+	if (my $smsg = $node->{smsg}) {
+		($$irt) = (($smsg->{references} || '') =~ m/<([^>]+)>\z/);
+	}
+	my $irt_map = $mapping->{$$irt} if defined $$irt;
 	if (defined $irt_map) {
 		$siblings = $irt_map->[1]->{children};
 		$nr_s = scalar(@$siblings) - 1;
-- 
EW


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/7] thread: fix comment describing its existence
  2016-12-10  3:42 [PATCH 0/7] message threading fixes for WWW UI Eric Wong
  2016-12-10  3:42 ` [PATCH 1/7] search: favor In-Reply-To over last References iff IRT exists Eric Wong
  2016-12-10  3:43 ` [PATCH 2/7] view: favor SearchMsg for In-Reply-To over Email::MIME Eric Wong
@ 2016-12-10  3:43 ` Eric Wong
  2016-12-10  3:43 ` [PATCH 4/7] view: reduce indentation for skeleton generation Eric Wong
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2016-12-10  3:43 UTC (permalink / raw)
  To: meta

Mail::Thread is UNavailable on many distros, meaning ordinary
users will have to rely on CPAN, a Perl-specific packaging tool.
---
 lib/PublicInbox/SearchThread.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/SearchThread.pm b/lib/PublicInbox/SearchThread.pm
index fe70406..5774a95 100644
--- a/lib/PublicInbox/SearchThread.pm
+++ b/lib/PublicInbox/SearchThread.pm
@@ -4,7 +4,7 @@
 # This license differs from the rest of public-inbox
 #
 # Our own jwz-style threading class based on Mail::Thread from CPAN.
-# Mail::Thread is unmaintained and available on some distros.
+# Mail::Thread is unmaintained and unavailable on some distros.
 # We also do not want pruning or subject grouping, since we want
 # to encourage strict threading and hopefully encourage people
 # to use proper In-Reply-To.
-- 
EW


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/7] view: reduce indentation for skeleton generation
  2016-12-10  3:42 [PATCH 0/7] message threading fixes for WWW UI Eric Wong
                   ` (2 preceding siblings ...)
  2016-12-10  3:43 ` [PATCH 3/7] thread: fix comment describing its existence Eric Wong
@ 2016-12-10  3:43 ` Eric Wong
  2016-12-10  3:43 ` [PATCH 5/7] view: skip ghosts with no direct children Eric Wong
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2016-12-10  3:43 UTC (permalink / raw)
  To: meta

This should reduce the number of subroutine calls needed
for the common case of real (non-ghost) messages as well
as shortening code.
---
 lib/PublicInbox/View.pm | 48 +++++++++++++++++++++++-------------------------
 1 file changed, 23 insertions(+), 25 deletions(-)

diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm
index feac601..c2e1ae7 100644
--- a/lib/PublicInbox/View.pm
+++ b/lib/PublicInbox/View.pm
@@ -767,8 +767,9 @@ sub _msg_date {
 
 sub fmt_ts { POSIX::strftime('%Y-%m-%d %k:%M', gmtime($_[0])) }
 
-sub _skel_header {
-	my ($ctx, $smsg, $level) = @_;
+sub skel_dump {
+	my ($ctx, $level, $node) = @_;
+	my $smsg = $node->{smsg} or return _skel_ghost($ctx, $level, $node);
 
 	my $dst = $ctx->{dst};
 	my $cur = $ctx->{cur};
@@ -824,32 +825,29 @@ sub _skel_header {
 	$$dst .=  $d . "<a\nhref=\"$m\"$id>" . $end;
 }
 
-sub skel_dump {
+sub _skel_ghost {
 	my ($ctx, $level, $node) = @_;
-	if (my $smsg = $node->{smsg}) {
-		_skel_header($ctx, $smsg, $level);
+
+	my $mid = $node->{id};
+	my $d = $ctx->{pct} ? '    [irrelevant] ' # search result
+			    : '     [not found] ';
+	$d .= indent_for($level) . th_pfx($level);
+	my $upfx = $ctx->{-upfx};
+	my $m = PublicInbox::Hval->new_msgid($mid);
+	my $href = $upfx . $m->{href} . '/';
+	my $html = $m->as_html;
+
+	my $mapping = $ctx->{mapping};
+	my $map = $mapping->{$mid} if $mapping;
+	if ($map) {
+		my $id = id_compress($mid, 1);
+		$map->[0] = $d . qq{&lt;<a\nhref=#r$id>$html</a>&gt;\n};
+		$d .= qq{&lt;<a\nhref="$href"\nid=r$id>$html</a>&gt;\n};
 	} else {
-		my $mid = $node->{id};
-		my $dst = $ctx->{dst};
-		my $d = $ctx->{pct} ? '    [irrelevant] ' # search result
-				    : '     [not found] ';
-		$d .= indent_for($level) . th_pfx($level);
-		my $upfx = $ctx->{-upfx};
-		my $m = PublicInbox::Hval->new_msgid($mid);
-		my $href = $upfx . $m->{href} . '/';
-		my $html = $m->as_html;
-
-		my $mapping = $ctx->{mapping};
-		my $map = $mapping->{$mid} if $mapping;
-		if ($map) {
-			my $id = id_compress($mid, 1);
-			$map->[0] = $d . qq{&lt;<a\nhref=#r$id>$html</a>&gt;\n};
-			$d .= qq{&lt;<a\nhref="$href"\nid=r$id>$html</a>&gt;\n};
-		} else {
-			$d .= qq{&lt;<a\nhref="$href">$html</a>&gt;\n};
-		}
-		$$dst .= $d;
+		$d .= qq{&lt;<a\nhref="$href">$html</a>&gt;\n};
 	}
+	my $dst = $ctx->{dst};
+	$$dst .= $d;
 }
 
 sub sort_ts {
-- 
EW


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 5/7] view: skip ghosts with no direct children
  2016-12-10  3:42 [PATCH 0/7] message threading fixes for WWW UI Eric Wong
                   ` (3 preceding siblings ...)
  2016-12-10  3:43 ` [PATCH 4/7] view: reduce indentation for skeleton generation Eric Wong
@ 2016-12-10  3:43 ` Eric Wong
  2016-12-10  3:43 ` [PATCH 6/7] thread: last Reference always wins Eric Wong
  2016-12-10  3:43 ` [PATCH 7/7] search: always sort thread results in ascending time order Eric Wong
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2016-12-10  3:43 UTC (permalink / raw)
  To: meta

Otherwise, a malicious or broken client could populate the
thread skeleton with invalid References.  We only care about
ghosts which messages correctly refer to, not totally bogus ones
which may be the result of long line or token truncation +
wrapping in MUA headers.
---
 lib/PublicInbox/SearchThread.pm | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/SearchThread.pm b/lib/PublicInbox/SearchThread.pm
index 5774a95..ee35f0d 100644
--- a/lib/PublicInbox/SearchThread.pm
+++ b/lib/PublicInbox/SearchThread.pm
@@ -33,7 +33,8 @@ sub thread {
 	my $self = shift;
 	_add_message($self, $_) foreach @{$self->{messages}};
 	my $id_table = delete $self->{id_table};
-	$self->{rootset} = [ grep { !delete $_->{parent} } values %$id_table ];
+	$self->{rootset} = [ grep {
+		!delete($_->{parent}) && $_->visible } values %$id_table ];
 }
 
 sub _get_cont_for_id ($$) {
@@ -133,15 +134,23 @@ sub has_descendent {
 	0;
 }
 
+# Do not show/keep ghosts iff they have no children.  Sometimes
+# a ghost Message-ID is the result of a long header line
+# being folded/mangled by a MUA, and not a missing message.
+sub visible ($) {
+	my ($self) = @_;
+	$self->{smsg} || scalar values %{$self->{children}};
+}
+
 sub order_children {
 	my ($cur, $ordersub) = @_;
 
-	my %seen = ($cur => 1);
+	my %seen = ($cur => 1); # self-referential loop prevention
 	my @q = ($cur);
 	while (defined($cur = shift @q)) {
 		my $c = $cur->{children}; # The hashref here...
 
-		$c = [ grep { !$seen{$_}++ } values %$c ]; # spot/break loops
+		$c = [ grep { !$seen{$_}++ && visible($_) } values %$c ];
 		$c = $ordersub->($c) if scalar @$c > 1;
 		$cur->{children} = $c; # ...becomes an arrayref
 		push @q, @$c;
-- 
EW


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 6/7] thread: last Reference always wins
  2016-12-10  3:42 [PATCH 0/7] message threading fixes for WWW UI Eric Wong
                   ` (4 preceding siblings ...)
  2016-12-10  3:43 ` [PATCH 5/7] view: skip ghosts with no direct children Eric Wong
@ 2016-12-10  3:43 ` Eric Wong
  2016-12-10  3:43 ` [PATCH 7/7] search: always sort thread results in ascending time order Eric Wong
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2016-12-10  3:43 UTC (permalink / raw)
  To: meta

Since we use SearchMsg from Xapian data, we can be
assured we do not get self-referential {references}
field.

However, we may need to be more careful when checking
has_descendent for loops, as blindly calling add_child
could open us up to that possibility...
---
 lib/PublicInbox/SearchThread.pm | 13 ++++++++-----
 t/thread-cycle.t                |  8 --------
 2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/lib/PublicInbox/SearchThread.pm b/lib/PublicInbox/SearchThread.pm
index ee35f0d..601a84b 100644
--- a/lib/PublicInbox/SearchThread.pm
+++ b/lib/PublicInbox/SearchThread.pm
@@ -52,6 +52,10 @@ sub _add_message ($$) {
 	# B. For each element in the message's References field:
 	defined(my $refs = $smsg->{references}) or return;
 
+	# This loop exists to help fill in gaps left from missing
+	# messages.  It is not needed in a perfect world where
+	# everything is perfectly referenced, only the last ref
+	# matters.
 	my $prev;
 	foreach my $ref ($refs =~ m/<([^>]+)>/g) {
 		# Find a Container object for the given Message-ID
@@ -74,10 +78,8 @@ sub _add_message ($$) {
 	}
 
 	# C. Set the parent of this message to be the last element in
-	# References...
-	if ($prev && !$this->has_descendent($prev)) { # would loop
-		$prev->add_child($this)
-	}
+	# References.
+	$prev->add_child($this) if defined $prev;
 }
 
 sub order {
@@ -127,8 +129,9 @@ sub add_child {
 
 sub has_descendent {
 	my ($self, $child) = @_;
+	my %seen; # loop prevention XXX may not be necessary
 	while ($child) {
-		return 1 if $self == $child;
+		return 1 if $self == $child || $seen{$child}++;
 		$child = $child->{parent};
 	}
 	0;
diff --git a/t/thread-cycle.t b/t/thread-cycle.t
index 0e1ecfe..b084449 100644
--- a/t/thread-cycle.t
+++ b/t/thread-cycle.t
@@ -70,14 +70,6 @@ SKIP: {
 	is($check, $st, 'Mail::Thread output matches');
 }
 
-@msgs = map { bless $_, 'PublicInbox::SearchMsg' } (
-	{ mid => 'a@b' },
-	{ mid => 'b@c', references => '<a@b> <b@c>' },
-	{ mid => 'd@e', references => '<d@e>' },
-);
-
-is(thread_to_s(\@msgs), "a\@b\n b\@c\nd\@e\n", 'ok with self-references');
-
 done_testing();
 
 sub thread_to_s {
-- 
EW


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 7/7] search: always sort thread results in ascending time order
  2016-12-10  3:42 [PATCH 0/7] message threading fixes for WWW UI Eric Wong
                   ` (5 preceding siblings ...)
  2016-12-10  3:43 ` [PATCH 6/7] thread: last Reference always wins Eric Wong
@ 2016-12-10  3:43 ` Eric Wong
  6 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2016-12-10  3:43 UTC (permalink / raw)
  To: meta

This makes life easier for the threading algorithm, as we can
use the implied ordering of timestamps to avoid temporary ghosts
and resulting container vivication.

This would've also allowed us to hide the bug (in most cases)
fixed by the patch titled "thread: last Reference always wins",
in case that needs to be reverted due to infinite looping.
---
 lib/PublicInbox/Mbox.pm   | 2 +-
 lib/PublicInbox/Search.pm | 5 +++++
 lib/PublicInbox/View.pm   | 2 +-
 3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index fd623f6..2565ea5 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -115,7 +115,7 @@ sub new {
 		cb => $cb,
 		ctx => $ctx,
 		msgs => [],
-		opts => { asc => 1, offset => 0 },
+		opts => { offset => 0 },
 	}, $class;
 }
 
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 8da30c1..5e6bfc6 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -158,6 +158,11 @@ sub get_thread {
 	}
 	$opts ||= {};
 	$opts->{limit} ||= 1000;
+
+	# always sort threads by timestamp, this makes life easier
+	# for the threading algorithm (in SearchThread.pm)
+	$opts->{asc} = 1;
+
 	_do_enquire($self, $qtid, $opts);
 }
 
diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm
index c2e1ae7..ec5f7e0 100644
--- a/lib/PublicInbox/View.pm
+++ b/lib/PublicInbox/View.pm
@@ -327,7 +327,7 @@ sub stream_thread ($$) {
 sub thread_html {
 	my ($ctx) = @_;
 	my $mid = $ctx->{mid};
-	my $sres = $ctx->{srch}->get_thread($mid, { asc => 1 });
+	my $sres = $ctx->{srch}->get_thread($mid);
 	my $msgs = load_results($sres);
 	my $nr = $sres->{total};
 	return missing_thread($ctx) if $nr == 0;
-- 
EW


^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2016-12-10  3:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-10  3:42 [PATCH 0/7] message threading fixes for WWW UI Eric Wong
2016-12-10  3:42 ` [PATCH 1/7] search: favor In-Reply-To over last References iff IRT exists Eric Wong
2016-12-10  3:43 ` [PATCH 2/7] view: favor SearchMsg for In-Reply-To over Email::MIME Eric Wong
2016-12-10  3:43 ` [PATCH 3/7] thread: fix comment describing its existence Eric Wong
2016-12-10  3:43 ` [PATCH 4/7] view: reduce indentation for skeleton generation Eric Wong
2016-12-10  3:43 ` [PATCH 5/7] view: skip ghosts with no direct children Eric Wong
2016-12-10  3:43 ` [PATCH 6/7] thread: last Reference always wins Eric Wong
2016-12-10  3:43 ` [PATCH 7/7] search: always sort thread results in ascending time order Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).