user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@80x24.org>
To: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Cc: meta@public-inbox.org
Subject: [PATCH] eml: header_raw converts octets to Perl UTF-8
Date: Thu, 24 Nov 2022 21:31:55 +0000	[thread overview]
Message-ID: <20221124213155.M736847@dcvr> (raw)
In-Reply-To: <20221124153715.3nenjpjzj43vqxr2@meerkat.local>

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hello:
> 
> There's a bit of inconsistency handling messages with utf8 content in the
> headers:
> 
> https://lore.kernel.org/b4-sent/20221122-gud-shadow-plane-v1-0-9de3afa3383e@tronnes.org/
> 
> You can see that the name in the From: line is mangled, but in the thread
> overview it is displayed correctly.

Thanks, the overview and Xapian/SQLite DBs were correct because
PublicInbox::Smsg->populate had a special case.

The below case generalizes it to all HTML displays and removes
the special case.

> I know older SMTP standards still require 7bit escaping in the headers, but
> with SMTPUTF8 being very widely available, it should be possible to store and
> properly display messages with 8bit unicode in the headers.

Oops, I barely knew about it :x

Anyways, the below should fix it and I've deployed it to:
https://yhbt.net/lore/b4-sent/20221122-gud-shadow-plane-v1-0-9de3afa3383e@tronnes.org/

I'm pretty sure it's safe, but my HW is problematic and I'm not
sure if it can finish a full reindex.

---------8<--------
Subject: [PATCH] eml: header_raw converts octets to Perl UTF-8

This fixes the display of raw (non-RFC 2047) names and subjects
in HTML message views.

SMTPUTF8 (RFC 6531) allows raw UTF-8 in headers without RFC 2047
encoding, so let Perl handle it as a character sequence for the
rest of our consumers.  Thus, the old special case in
PublicInbox::Smsg->populate is no longer necessary and gone.

The one regression notice so far (and fixed here) is compressed
IMAP envelope responses still needs raw bytes since the zlib
wrapper is designed for octets, Perl UTF-8 bytes.  Thus we
reverse utf8::decode with utf8::encode in PublicInbox::IMAP::_esc.

->header_set also forces encoding to bytes, since all existing
callers would either be dealing with ->header_raw results or
be RFC-2047-encoded anyways.

Reindexing is not necessary with this change due to the prior
PublicInbox::Smsg->populate special case.

Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20221124153715.3nenjpjzj43vqxr2@meerkat.local/
---
 lib/PublicInbox/Eml.pm  |  8 +++++---
 lib/PublicInbox/IMAP.pm |  2 ++
 lib/PublicInbox/Smsg.pm |  3 ---
 t/imapd.t               | 28 ++++++++++++++++++++++++++++
 t/psgi_search.t         |  7 ++++++-
 5 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/Eml.pm b/lib/PublicInbox/Eml.pm
index 485f637a..8b999e1a 100644
--- a/lib/PublicInbox/Eml.pm
+++ b/lib/PublicInbox/Eml.pm
@@ -1,4 +1,4 @@
-# Copyright (C) 2020-2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 #
 # Lazy MIME parser, it still slurps the full message but keeps short
@@ -144,6 +144,7 @@ sub header_raw {
 	my $re = re_memo($_[1]);
 	my @v = (${ $_[0]->{hdr} } =~ /$re/g);
 	for (@v) {
+		utf8::decode($_); # SMTPUTF8
 		# for compatibility w/ Email::Simple::Header,
 		s/\s+\z//s;
 		s/\A\s+//s;
@@ -359,14 +360,15 @@ sub header_set {
 	$pfx .= ': ';
 	my $len = 78 - length($pfx);
 	@vals = map {;
+		utf8::encode(my $v = $_); # to bytes, support SMTPUTF8
 		# folding differs from Email::Simple::Header,
 		# we favor tabs for visibility (and space savings :P)
 		if (length($_) >= $len && (/\n[^ \t]/s || !/\n/s)) {
 			local $Text::Wrap::columns = $len;
 			local $Text::Wrap::huge = 'overflow';
-			$pfx . wrap('', "\t", $_) . $self->{crlf};
+			$pfx . wrap('', "\t", $v) . $self->{crlf};
 		} else {
-			$pfx . $_ . $self->{crlf};
+			$pfx . $v . $self->{crlf};
 		}
 	} @vals;
 	$$hdr =~ s!$re!shift(@vals) // ''!ge; # replace current headers, first
diff --git a/lib/PublicInbox/IMAP.pm b/lib/PublicInbox/IMAP.pm
index 1f65aa65..37317948 100644
--- a/lib/PublicInbox/IMAP.pm
+++ b/lib/PublicInbox/IMAP.pm
@@ -426,8 +426,10 @@ sub _esc ($) {
 	if (!defined($v)) {
 		'NIL';
 	} elsif ($v =~ /[{"\r\n%*\\\[]/) { # literal string
+		utf8::encode($v);
 		'{' . length($v) . "}\r\n" . $v;
 	} else { # quoted string
+		utf8::encode($v);
 		qq{"$v"}
 	}
 }
diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index 2026c7d9..b132381b 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -99,9 +99,6 @@ sub populate {
 		# to protect git and NNTP clients
 		$val =~ tr/\0\t\n/   /;
 
-		# rare: in case headers have wide chars (not RFC2047-encoded)
-		utf8::decode($val);
-
 		# lower-case fields for read-only stuff
 		$self->{lc($f)} = $val;
 
diff --git a/t/imapd.t b/t/imapd.t
index 3c74aefd..cbd6c1b9 100644
--- a/t/imapd.t
+++ b/t/imapd.t
@@ -534,6 +534,34 @@ SKIP: {
 	}
 }
 
+{
+	ok(my $ic = $imap_client->new(%mic_opt), 'logged in');
+	my $mb = "$ibx[0]->{newsgroup}.$first_range";
+	ok($ic->examine($mb), "EXAMINE $mb");
+	my $uidnext = $ic->uidnext($mb); # we'll fetch BODYSTRUCTURE on this
+	my $im = $ibx[0]->importer(0);
+	$im->add(PublicInbox::Eml->new(<<EOF)) or BAIL_OUT;
+Subject: test Ævar
+Message-ID: <smtputf8-delivered-mess\@age>
+From: Ævar Arnfjörð Bjarmason <avarab\@example>
+To: git\@vger.kernel.org
+
+EOF
+	$im->done;
+	my $envl = $ic->get_envelope($uidnext);
+	is($envl->{subject}, 'test Ævar', 'UTF-8 subject');
+	is($envl->{sender}->[0]->{personalname}, 'Ævar Arnfjörð Bjarmason',
+		'UTF-8 sender[0].personalname');
+	SKIP: {
+		skip 'need compress for comparisons', 1 if !$can_compress;
+		ok($ic = $imap_client->new(%mic_opt), 'uncompressed logged in');
+		ok($ic && $ic->compress, 'compress enabled');
+		ok($ic->examine($mb), "EXAMINE $mb");
+		my $raw = $ic->get_envelope($uidnext);
+		is_deeply($envl, $raw, 'raw and compressed match');
+	}
+}
+
 $td->kill;
 $td->join;
 is($?, 0, 'no error in exited process') if !$ENV{TEST_KILL_IMAPD};
diff --git a/t/psgi_search.t b/t/psgi_search.t
index 3da93eda..8868f67e 100644
--- a/t/psgi_search.t
+++ b/t/psgi_search.t
@@ -1,5 +1,5 @@
 #!perl -w
-# Copyright (C) 2017-2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 use strict;
 use v5.10.1;
@@ -103,6 +103,11 @@ test_psgi(sub { $www->call(@_) }, sub {
 		like($res->content, $mid_re, 'found mid in response');
 		chop($digits);
 	}
+	$res = $cb->(GET("/test/$mid/"));
+	$html = $res->content;
+	like($html, qr/\bFrom: &#198;var /,
+		"displayed Ævar's name properly in permalink From:");
+	unlike($html, qr/&#195;/, 'no raw octets in permalink HTML');
 
 	$res = $cb->(GET('/test/'));
 	$html = $res->content;
-- 
[1] getting weird SATA errors and disconnects and SMART is not
    telling me anything useful.  I suspect a wrong BIOS setting
    after replacing a CMOS battery, but so many knobs and it
    takes hours/days to reproduce :<

  reply	other threads:[~2022-11-24 21:31 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-24 15:37 handling unquoted utf8 in the headers Konstantin Ryabitsev
2022-11-24 21:31 ` Eric Wong [this message]
2022-11-25 18:14   ` [PATCH] eml: header_raw converts octets to Perl UTF-8 Konstantin Ryabitsev
2022-11-27  9:15     ` [PATCH] content_hash: handle References as octets Eric Wong
2022-11-26  9:05   ` [PATCH] eml: header_raw converts octets to Perl UTF-8 Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221124213155.M736847@dcvr \
    --to=e@80x24.org \
    --cc=konstantin@linuxfoundation.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).