user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* Re: Add "generator" information to HTML pages
  2023-01-08 20:58  8%     ` Eric Wong
@ 2023-01-08 21:54  0%       ` Thomas Weißschuh
  0 siblings, 0 replies; 6+ results
From: Thomas Weißschuh @ 2023-01-08 21:54 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Sun, Jan 08, 2023 at 08:58:04PM +0000, Eric Wong wrote:
> Thomas Weißschuh <thomas@t-8ch.de> wrote:
> > On Sun, Jan 08, 2023 at 07:47:38PM +0000, Eric Wong wrote:
> > > Thomas Weißschuh <thomas@t-8ch.de> wrote:
> > > > it would be nice if public-inbox could extend the HTML pages it
> > > > generates with the "generator" meta tag [0].
> > > > Especially the version would be useful.
> > > > 
> > > > This would help users during debugging to see the specific version of
> > > > public-inbox they are looking at.
> > > 
> > > What would users be debugging?
> > > Admins would be the only ones who care, I think...
> > 
> > Since recently my mails to linux-kernel@vger.kernel.org that should end
> > up on public-inbox on https://lore.kernel.org/lkml/ don't do so.
> > They are accepted by the mail server on vger.kernel.org but never end up
> > in the archives.
> > I suspect some interactions between b4 which is used to generate the
> > mails, the unicode characters in my name and public-inbox to be the
> > culprit.
> 
> Your mail seem fine to my server, but coming from an IPv6
> address has caused problems with some other servers in the past.
> Another potential thing might be your use of utf-8 in the From:
> header, while your Content-Type: is iso-8859-1 for the body.

I think I found the culprit. And it is indeed the b4 tool, or rather the
Python email library it is using.
Posting it here because you might know if this is standards conform or
if it would be reasonable to carry a workaround inside public-inbox.

When b4 passes the message to Pythons email.message.EmailMessage the
'To' header is just a long, unencoded string containing all recipients
and their unicode names.
EmailMessage then makes sure that this string conforms to legal email
header values. It performs linewrapping and the special header utf-8
encoding/escaping.

However IFF a header line contains unicode character and IFF the first
character of a linewrapped line is a comma (,) then that comma will also
be utf-8 escaped.

Example input:
01234567890123456789012345678901234567890123456789012345678901234567890123, ä

Example output
01234567890123456789012345678901234567890123456789012345678901234567890123
 =?utf-8?q?=2C?= =?utf-8?q?=C3=A4?=

 I expect this to be a bug in the python library but maybe it is
 correct.

> > This is what I wanted to reproduce locally, for which exact versions
> > would have been nice.
> 
> I remember Konstantin has cherry-picked some commits from
> public-inbox.git in the past, and I suspect he already
> has https://public-inbox.org/meta/20221124213155.M736847@dcvr/
> ("eml: header_raw converts octets to Perl UTF-8") for SMTPUTF8
> 
> One thing I wouldn't be opposed to doing is adding a way to
> download all loaded files in a tarball as a means for AGPL
> enforcement.  The tricky thing is those files may change on disk
> after loading (and often does in my case :x), so they'd need to
> be copied into stable storage at startup (and updated if there's
> lazy-loading).  Same security caveats apply, though.
> 
> > > I also don't like wasting memory+bandwidth on things most users
> > > won't see or care about.  This is especially true for stuff at
> > > the beginnning of the output since that's most likely to succeed
> > > in being transferred.
> > 
> > Fair enough.
> > The loading speed of public-inbox is really great, let's keep it that
> > way.
> 
> Good to know it's great for you.  It's still too slow for me,
> but I'm anti-consumerist and refuse to follow Moore's law :x

^ permalink raw reply	[relevance 0%]

* Re: Add "generator" information to HTML pages
  @ 2023-01-08 20:58  8%     ` Eric Wong
  2023-01-08 21:54  0%       ` Thomas Weißschuh
  0 siblings, 1 reply; 6+ results
From: Eric Wong @ 2023-01-08 20:58 UTC (permalink / raw)
  To: Thomas Weißschuh; +Cc: meta

Thomas Weißschuh <thomas@t-8ch.de> wrote:
> On Sun, Jan 08, 2023 at 07:47:38PM +0000, Eric Wong wrote:
> > Thomas Weißschuh <thomas@t-8ch.de> wrote:
> > > Hi,
> > > 
> > > it would be nice if public-inbox could extend the HTML pages it
> > > generates with the "generator" meta tag [0].
> > > Especially the version would be useful.
> > > 
> > > This would help users during debugging to see the specific version of
> > > public-inbox they are looking at.
> > 
> > What would users be debugging?
> > Admins would be the only ones who care, I think...
> 
> Since recently my mails to linux-kernel@vger.kernel.org that should end
> up on public-inbox on https://lore.kernel.org/lkml/ don't do so.
> They are accepted by the mail server on vger.kernel.org but never end up
> in the archives.
> I suspect some interactions between b4 which is used to generate the
> mails, the unicode characters in my name and public-inbox to be the
> culprit.

Your mail seem fine to my server, but coming from an IPv6
address has caused problems with some other servers in the past.
Another potential thing might be your use of utf-8 in the From:
header, while your Content-Type: is iso-8859-1 for the body.

> This is what I wanted to reproduce locally, for which exact versions
> would have been nice.

I remember Konstantin has cherry-picked some commits from
public-inbox.git in the past, and I suspect he already
has https://public-inbox.org/meta/20221124213155.M736847@dcvr/
("eml: header_raw converts octets to Perl UTF-8") for SMTPUTF8

One thing I wouldn't be opposed to doing is adding a way to
download all loaded files in a tarball as a means for AGPL
enforcement.  The tricky thing is those files may change on disk
after loading (and often does in my case :x), so they'd need to
be copied into stable storage at startup (and updated if there's
lazy-loading).  Same security caveats apply, though.

> > I also don't like wasting memory+bandwidth on things most users
> > won't see or care about.  This is especially true for stuff at
> > the beginnning of the output since that's most likely to succeed
> > in being transferred.
> 
> Fair enough.
> The loading speed of public-inbox is really great, let's keep it that
> way.

Good to know it's great for you.  It's still too slow for me,
but I'm anti-consumerist and refuse to follow Moore's law :x

^ permalink raw reply	[relevance 8%]

* [PATCH] content_hash: handle References as octets
  2022-11-25 18:14 10%   ` Konstantin Ryabitsev
@ 2022-11-27  9:15  8%     ` Eric Wong
  0 siblings, 0 replies; 6+ results
From: Eric Wong @ 2022-11-27  9:15 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Thu, Nov 24, 2022 at 09:31:55PM +0000, Eric Wong wrote:
> > The below case generalizes it to all HTML displays and removes
> > the special case.
> 
> It looks good to me in some cursory tests, thank you!

I just noticed this so far:
-------8<------
Subject: [PATCH] content_hash: handle References as octets

The alsa-devel archives on lore has some UTF-8 References:
headers, so we need to treat them as octets, again, otherwise
(re)indexing triggers cascading failures.

Fixes: 5198c976ce8b "eml: header_raw converts octets to Perl UTF-8"
---
 lib/PublicInbox/ContentHash.pm |  7 ++++---
 t/v2writable.t                 | 16 ++++++++++++++++
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/ContentHash.pm b/lib/PublicInbox/ContentHash.pm
index bacc9cdd..1afbb413 100644
--- a/lib/PublicInbox/ContentHash.pm
+++ b/lib/PublicInbox/ContentHash.pm
@@ -1,4 +1,4 @@
-# Copyright (C) 2018-2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 
 # Unstable internal API.
@@ -63,8 +63,9 @@ sub content_digest ($;$) {
 	# do NOT consider the Message-ID as part of the content_hash
 	# if we got here, we've already got Message-ID reuse
 	my %seen = map { $_ => 1 } @{mids($eml)};
-	foreach my $mid (@{references($eml)}) {
-		$dig->add("ref\0$mid\0") unless $seen{$mid}++;
+	for (grep { !$seen{$_}++ } @{references($eml)}) {
+		utf8::encode($_);
+		$dig->add("ref\0$_\0");
 	}
 
 	# Only use Sender: if From is not present
diff --git a/t/v2writable.t b/t/v2writable.t
index ad946338..0d102204 100644
--- a/t/v2writable.t
+++ b/t/v2writable.t
@@ -283,6 +283,22 @@ EOF
 	is($msgs->[1]->{mid}, 'y'x244, 'stored truncated mid(2)');
 }
 
+if ('UTF-8 References') {
+	my @w;
+	local $SIG{__WARN__} = sub { push @w, @_ };
+	my $msg = <<EOM;
+From: a\@example.com
+Subject: b
+Message-ID: <horrible\@example>
+References: <\xc4\x80\@example>
+
+EOM
+	ok($im->add(PublicInbox::Eml->new($msg."a\n")), 'UTF-8 References 1');
+	ok($im->add(PublicInbox::Eml->new($msg."b\n")), 'UTF-8 References 2');
+	$im->done;
+	ok(!grep(/Wide character/, @w), 'no wide characters') or xbail(\@w);
+}
+
 my $tmp = {
 	inboxdir => "$inboxdir/non-existent/subdir",
 	name => 'nope',

^ permalink raw reply related	[relevance 8%]

* Re: [PATCH] eml: header_raw converts octets to Perl UTF-8
  2022-11-24 21:31 14% ` [PATCH] eml: header_raw converts octets to Perl UTF-8 Eric Wong
  2022-11-25 18:14 10%   ` Konstantin Ryabitsev
@ 2022-11-26  9:05 10%   ` Eric Wong
  1 sibling, 0 replies; 6+ results
From: Eric Wong @ 2022-11-26  9:05 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Eric Wong <e@80x24.org> wrote:
> IMAP envelope responses still needs raw bytes since the zlib
> wrapper is designed for octets, Perl UTF-8 bytes.

That should say:
   wrapper is designed for octets, not Perl UTF-8 chars.

Fixed and pushed.  Thanks for testing.

^ permalink raw reply	[relevance 10%]

* Re: [PATCH] eml: header_raw converts octets to Perl UTF-8
  2022-11-24 21:31 14% ` [PATCH] eml: header_raw converts octets to Perl UTF-8 Eric Wong
@ 2022-11-25 18:14 10%   ` Konstantin Ryabitsev
  2022-11-27  9:15  8%     ` [PATCH] content_hash: handle References as octets Eric Wong
  2022-11-26  9:05 10%   ` [PATCH] eml: header_raw converts octets to Perl UTF-8 Eric Wong
  1 sibling, 1 reply; 6+ results
From: Konstantin Ryabitsev @ 2022-11-25 18:14 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Thu, Nov 24, 2022 at 09:31:55PM +0000, Eric Wong wrote:
> The below case generalizes it to all HTML displays and removes
> the special case.

It looks good to me in some cursory tests, thank you!

-K

^ permalink raw reply	[relevance 10%]

* [PATCH] eml: header_raw converts octets to Perl UTF-8
  @ 2022-11-24 21:31 14% ` Eric Wong
  2022-11-25 18:14 10%   ` Konstantin Ryabitsev
  2022-11-26  9:05 10%   ` [PATCH] eml: header_raw converts octets to Perl UTF-8 Eric Wong
  0 siblings, 2 replies; 6+ results
From: Eric Wong @ 2022-11-24 21:31 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hello:
> 
> There's a bit of inconsistency handling messages with utf8 content in the
> headers:
> 
> https://lore.kernel.org/b4-sent/20221122-gud-shadow-plane-v1-0-9de3afa3383e@tronnes.org/
> 
> You can see that the name in the From: line is mangled, but in the thread
> overview it is displayed correctly.

Thanks, the overview and Xapian/SQLite DBs were correct because
PublicInbox::Smsg->populate had a special case.

The below case generalizes it to all HTML displays and removes
the special case.

> I know older SMTP standards still require 7bit escaping in the headers, but
> with SMTPUTF8 being very widely available, it should be possible to store and
> properly display messages with 8bit unicode in the headers.

Oops, I barely knew about it :x

Anyways, the below should fix it and I've deployed it to:
https://yhbt.net/lore/b4-sent/20221122-gud-shadow-plane-v1-0-9de3afa3383e@tronnes.org/

I'm pretty sure it's safe, but my HW is problematic and I'm not
sure if it can finish a full reindex.

---------8<--------
Subject: [PATCH] eml: header_raw converts octets to Perl UTF-8

This fixes the display of raw (non-RFC 2047) names and subjects
in HTML message views.

SMTPUTF8 (RFC 6531) allows raw UTF-8 in headers without RFC 2047
encoding, so let Perl handle it as a character sequence for the
rest of our consumers.  Thus, the old special case in
PublicInbox::Smsg->populate is no longer necessary and gone.

The one regression notice so far (and fixed here) is compressed
IMAP envelope responses still needs raw bytes since the zlib
wrapper is designed for octets, Perl UTF-8 bytes.  Thus we
reverse utf8::decode with utf8::encode in PublicInbox::IMAP::_esc.

->header_set also forces encoding to bytes, since all existing
callers would either be dealing with ->header_raw results or
be RFC-2047-encoded anyways.

Reindexing is not necessary with this change due to the prior
PublicInbox::Smsg->populate special case.

Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20221124153715.3nenjpjzj43vqxr2@meerkat.local/
---
 lib/PublicInbox/Eml.pm  |  8 +++++---
 lib/PublicInbox/IMAP.pm |  2 ++
 lib/PublicInbox/Smsg.pm |  3 ---
 t/imapd.t               | 28 ++++++++++++++++++++++++++++
 t/psgi_search.t         |  7 ++++++-
 5 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/lib/PublicInbox/Eml.pm b/lib/PublicInbox/Eml.pm
index 485f637a..8b999e1a 100644
--- a/lib/PublicInbox/Eml.pm
+++ b/lib/PublicInbox/Eml.pm
@@ -1,4 +1,4 @@
-# Copyright (C) 2020-2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 #
 # Lazy MIME parser, it still slurps the full message but keeps short
@@ -144,6 +144,7 @@ sub header_raw {
 	my $re = re_memo($_[1]);
 	my @v = (${ $_[0]->{hdr} } =~ /$re/g);
 	for (@v) {
+		utf8::decode($_); # SMTPUTF8
 		# for compatibility w/ Email::Simple::Header,
 		s/\s+\z//s;
 		s/\A\s+//s;
@@ -359,14 +360,15 @@ sub header_set {
 	$pfx .= ': ';
 	my $len = 78 - length($pfx);
 	@vals = map {;
+		utf8::encode(my $v = $_); # to bytes, support SMTPUTF8
 		# folding differs from Email::Simple::Header,
 		# we favor tabs for visibility (and space savings :P)
 		if (length($_) >= $len && (/\n[^ \t]/s || !/\n/s)) {
 			local $Text::Wrap::columns = $len;
 			local $Text::Wrap::huge = 'overflow';
-			$pfx . wrap('', "\t", $_) . $self->{crlf};
+			$pfx . wrap('', "\t", $v) . $self->{crlf};
 		} else {
-			$pfx . $_ . $self->{crlf};
+			$pfx . $v . $self->{crlf};
 		}
 	} @vals;
 	$$hdr =~ s!$re!shift(@vals) // ''!ge; # replace current headers, first
diff --git a/lib/PublicInbox/IMAP.pm b/lib/PublicInbox/IMAP.pm
index 1f65aa65..37317948 100644
--- a/lib/PublicInbox/IMAP.pm
+++ b/lib/PublicInbox/IMAP.pm
@@ -426,8 +426,10 @@ sub _esc ($) {
 	if (!defined($v)) {
 		'NIL';
 	} elsif ($v =~ /[{"\r\n%*\\\[]/) { # literal string
+		utf8::encode($v);
 		'{' . length($v) . "}\r\n" . $v;
 	} else { # quoted string
+		utf8::encode($v);
 		qq{"$v"}
 	}
 }
diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm
index 2026c7d9..b132381b 100644
--- a/lib/PublicInbox/Smsg.pm
+++ b/lib/PublicInbox/Smsg.pm
@@ -99,9 +99,6 @@ sub populate {
 		# to protect git and NNTP clients
 		$val =~ tr/\0\t\n/   /;
 
-		# rare: in case headers have wide chars (not RFC2047-encoded)
-		utf8::decode($val);
-
 		# lower-case fields for read-only stuff
 		$self->{lc($f)} = $val;
 
diff --git a/t/imapd.t b/t/imapd.t
index 3c74aefd..cbd6c1b9 100644
--- a/t/imapd.t
+++ b/t/imapd.t
@@ -534,6 +534,34 @@ SKIP: {
 	}
 }
 
+{
+	ok(my $ic = $imap_client->new(%mic_opt), 'logged in');
+	my $mb = "$ibx[0]->{newsgroup}.$first_range";
+	ok($ic->examine($mb), "EXAMINE $mb");
+	my $uidnext = $ic->uidnext($mb); # we'll fetch BODYSTRUCTURE on this
+	my $im = $ibx[0]->importer(0);
+	$im->add(PublicInbox::Eml->new(<<EOF)) or BAIL_OUT;
+Subject: test Ævar
+Message-ID: <smtputf8-delivered-mess\@age>
+From: Ævar Arnfjörð Bjarmason <avarab\@example>
+To: git\@vger.kernel.org
+
+EOF
+	$im->done;
+	my $envl = $ic->get_envelope($uidnext);
+	is($envl->{subject}, 'test Ævar', 'UTF-8 subject');
+	is($envl->{sender}->[0]->{personalname}, 'Ævar Arnfjörð Bjarmason',
+		'UTF-8 sender[0].personalname');
+	SKIP: {
+		skip 'need compress for comparisons', 1 if !$can_compress;
+		ok($ic = $imap_client->new(%mic_opt), 'uncompressed logged in');
+		ok($ic && $ic->compress, 'compress enabled');
+		ok($ic->examine($mb), "EXAMINE $mb");
+		my $raw = $ic->get_envelope($uidnext);
+		is_deeply($envl, $raw, 'raw and compressed match');
+	}
+}
+
 $td->kill;
 $td->join;
 is($?, 0, 'no error in exited process') if !$ENV{TEST_KILL_IMAPD};
diff --git a/t/psgi_search.t b/t/psgi_search.t
index 3da93eda..8868f67e 100644
--- a/t/psgi_search.t
+++ b/t/psgi_search.t
@@ -1,5 +1,5 @@
 #!perl -w
-# Copyright (C) 2017-2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 use strict;
 use v5.10.1;
@@ -103,6 +103,11 @@ test_psgi(sub { $www->call(@_) }, sub {
 		like($res->content, $mid_re, 'found mid in response');
 		chop($digits);
 	}
+	$res = $cb->(GET("/test/$mid/"));
+	$html = $res->content;
+	like($html, qr/\bFrom: &#198;var /,
+		"displayed Ævar's name properly in permalink From:");
+	unlike($html, qr/&#195;/, 'no raw octets in permalink HTML');
 
 	$res = $cb->(GET('/test/'));
 	$html = $res->content;
-- 
[1] getting weird SATA errors and disconnects and SMART is not
    telling me anything useful.  I suspect a wrong BIOS setting
    after replacing a CMOS battery, but so many knobs and it
    takes hours/days to reproduce :<

^ permalink raw reply related	[relevance 14%]

Results 1-6 of 6 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2022-11-24 15:37     handling unquoted utf8 in the headers Konstantin Ryabitsev
2022-11-24 21:31 14% ` [PATCH] eml: header_raw converts octets to Perl UTF-8 Eric Wong
2022-11-25 18:14 10%   ` Konstantin Ryabitsev
2022-11-27  9:15  8%     ` [PATCH] content_hash: handle References as octets Eric Wong
2022-11-26  9:05 10%   ` [PATCH] eml: header_raw converts octets to Perl UTF-8 Eric Wong
2023-01-08 19:04     Add "generator" information to HTML pages Thomas Weißschuh
2023-01-08 19:47     ` Eric Wong
2023-01-08 20:02       ` Thomas Weißschuh
2023-01-08 20:58  8%     ` Eric Wong
2023-01-08 21:54  0%       ` Thomas Weißschuh

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).