user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* Extra newline when retrieving messages
@ 2020-12-10 20:21 Konstantin Ryabitsev
  2020-12-10 20:55 ` Eric Wong
  0 siblings, 1 reply; 6+ messages in thread
From: Konstantin Ryabitsev @ 2020-12-10 20:21 UTC (permalink / raw)
  To: meta

Hello:

While investigating why some of the messages retrieved via 
lore.kernel.org were failing DKIM checks, I realized that 
public-inbox-httpd appends an extra newline to message bodies. This 
newline isn't present in git backends, just in messages retrieved via 
(at least) public-inbox-httpd. 

E.g.:

$ git clone --mirror https://lore.kernel.org/linux-modules/git/0.git
$ cd 0.git
$ git show 8c61c2a9ec6c4737d2da7258a0ff565d87d83d0b:m > /tmp/1
$ curl -s https://lore.kernel.org/linux-modules/20201129164737.135866-1-yauheni.kaliuta@redhat.com/raw > /tmp/2
$ diff -u /tmp/1 /tmp/2
--- /tmp/1      2020-12-10 15:06:15.815341700 -0500
+++ /tmp/2      2020-12-10 15:06:20.647345538 -0500
@@ -1,3 +1,4 @@
+From mboxrd@z Thu Jan  1 00:00:00 1970
 Return-Path: <linux-modules-owner@kernel.org>
 X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
        aws-us-west-2-korg-lkml-1.web.codeaurora.org
@@ -56,6 +57,9 @@
 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14
 Precedence: bulk
 List-ID: <linux-modules.vger.kernel.org>
+Archived-At: <https://lore.kernel.org/linux-modules/20201129164737.135866-1-yauheni.kaliuta@redhat.com/>
+List-Archive: <https://lore.kernel.org/linux-modules/>
+List-Post: <mailto:linux-modules@vger.kernel.org>

 The function allocates array but on building it if get_string()
 fails it returns the error leaving the array allocated. The caller
@@ -83,3 +87,4 @@
 --
 2.29.2

+

^^^
This extra newline seems to be present whether a message is retrieved 
via /raw or via /t.mbox.gz. It's benign in almost all cases except when 
the message is signed using DKIM's "simple" mode, in which case it 
causes DKIM signature to no longer verify (as "simple" doesn't do any 
whitespace normalization before hashing body contents).

This appears to be present in the public-inbox version running on 
https://public-inbox.org as well.

Best regards,
-K

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Extra newline when retrieving messages
  2020-12-10 20:21 Extra newline when retrieving messages Konstantin Ryabitsev
@ 2020-12-10 20:55 ` Eric Wong
  2020-12-10 21:43   ` Konstantin Ryabitsev
  0 siblings, 1 reply; 6+ messages in thread
From: Eric Wong @ 2020-12-10 20:55 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> Hello:
> 
> While investigating why some of the messages retrieved via 
> lore.kernel.org were failing DKIM checks, I realized that 
> public-inbox-httpd appends an extra newline to message bodies. This 
> newline isn't present in git backends, just in messages retrieved via 
> (at least) public-inbox-httpd. 

It looks like a regression from commit dbdc7a42dd885523 (2016-04-11);
which now doesn't make sense, though very little is making sense
to me nowadays :<

The following should undo it, unless somebody DKIM signed a
message which lacked any trailing "\n" at all.  I'm not sure
if it's possible to get a message lacking a trailing "\n"
through any MTA, though.

So I think the following should fix it, but I'm having even more
trouble than usual trusting anything I do:

---------8<----------
Subject: [PATCH] mbox: only add a trailing newline if one does not exist

In retrospect, commit dbdc7a42dd885523 from 2016-04-11
did not make sense since the CR would've been added before
the "From " line anyways.

Fixes: dbdc7a42dd885523 ("mbox: unconditionally add trailing newline")
Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Link: https://public-inbox.org/meta/20201210202145.7agtcmrtl5jec42d@chatter.i7.local/
---
 lib/PublicInbox/Mbox.pm | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index c88350c9..1895d7e6 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -133,7 +133,11 @@ sub msg_body ($) {
 	# https://www.loc.gov/preservation/digital/formats/fdd/fdd000385.shtml
 	# https://web.archive.org/http://www.qmail.org/man/man5/mbox.html
 	$$bdy =~ s/^(>*From )/>$1/gm;
-	$$bdy .= "\n";
+
+	# some messages lack a trailing newline, there needs to be one
+	# before the "From " line for the next message
+	$$bdy .= "\n" if substr($$bdy, -1, 1) ne "\n";
+	$$bdy;
 }
 
 sub thread_cb {

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: Extra newline when retrieving messages
  2020-12-10 20:55 ` Eric Wong
@ 2020-12-10 21:43   ` Konstantin Ryabitsev
  2020-12-10 22:29     ` Konstantin Ryabitsev
  2020-12-10 22:38     ` Eric Wong
  0 siblings, 2 replies; 6+ messages in thread
From: Konstantin Ryabitsev @ 2020-12-10 21:43 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Thu, Dec 10, 2020 at 08:55:40PM +0000, Eric Wong wrote:
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> > Hello:
> > 
> > While investigating why some of the messages retrieved via 
> > lore.kernel.org were failing DKIM checks, I realized that 
> > public-inbox-httpd appends an extra newline to message bodies. This 
> > newline isn't present in git backends, just in messages retrieved via 
> > (at least) public-inbox-httpd. 
> 
> It looks like a regression from commit dbdc7a42dd885523 (2016-04-11);
> which now doesn't make sense, though very little is making sense
> to me nowadays :<

Actually, my conclusions were wrong and the problem is not with the 
newline -- it's the extra inserted headers. So you're not the only 
person confused here. :)

The message that tripped me up is this one:

https://lore.kernel.org/alsa-devel/20201210152541.191728-1-amadeuszx.slawinski@linux.intel.com/raw

The DKIM signature in that message is:

DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=alsa-project.org;
	s=default; t=1607605851;
	bh=RloCOZ1mS9qShbBGvPnUerOnMg14SHidcYi1OxvdtE0=;
	h=From:To:Subject:Date:Cc:List-Id:List-Unsubscribe:List-Archive:
	 List-Post:List-Help:List-Subscribe:From;
	b=nBOdKYbMY44uVJMlw++UIMx03JWn334I/F5uyH5hmhU9h9/cFQTfZL0WAwPzzG7nL
	 pT9M4ElwsSqu8isrhJd7QV8q8DNvu+cRmGKbGOVLIEMhlYU87iHvIATKQmchLQv3xR
	 OixHf1955oAoTFU/n5eqjaZdLhyfUtSo5oCdxg7Y=

The problem here is that List-Archive is included in the list of signed 
headers, but when we retrieve the message via lore.kernel.org, it 
inserts an additional couple of headers:

Archived-At: <https://lore.kernel.org/alsa-devel/20201210152541.191728-1-amadeuszx.slawinski@linux.intel.com/>
List-Archive: <https://lore.kernel.org/alsa-devel/>

This is what causes DKIM verification to fail, and NOT the newline:

$ curl -s \
  https://lore.kernel.org/alsa-devel/20201210152541.191728-1-amadeuszx.slawinski@linux.intel.com/raw \
  | dkimverify
signature verification failed

$ curl -s \
  https://lore.kernel.org/alsa-devel/20201210152541.191728-1-amadeuszx.slawinski@linux.intel.com/raw \
  | grep -v '<https://lore' | dkimverify
signature ok

So, I was confused as well and there is no need to fix the newline.

That said, should public-inbox consider this case when generating the 
/raw and /t.mbox.gz messages? If the Archived-At and List-Archive 
headers are listed in the DKIM-Signature header, skip inserting them 
into the generated message?

-K

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Extra newline when retrieving messages
  2020-12-10 21:43   ` Konstantin Ryabitsev
@ 2020-12-10 22:29     ` Konstantin Ryabitsev
  2020-12-10 22:38     ` Eric Wong
  1 sibling, 0 replies; 6+ messages in thread
From: Konstantin Ryabitsev @ 2020-12-10 22:29 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Thu, 10 Dec 2020 at 16:43, Konstantin Ryabitsev
<konstantin@linuxfoundation.org> wrote:
> This is what causes DKIM verification to fail, and NOT the newline:

for the record, the DKIM RFC specifically deals with extra trailing newlines:

   The "simple" body canonicalization algorithm ignores all empty lines
   at the end of the message body.

(from https://tools.ietf.org/html/rfc6376#section-3.4.3)

I'll make it a point to double-check next time before jumping to conclusions. :)

-K

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Extra newline when retrieving messages
  2020-12-10 21:43   ` Konstantin Ryabitsev
  2020-12-10 22:29     ` Konstantin Ryabitsev
@ 2020-12-10 22:38     ` Eric Wong
  2020-12-10 22:41       ` Konstantin Ryabitsev
  1 sibling, 1 reply; 6+ messages in thread
From: Eric Wong @ 2020-12-10 22:38 UTC (permalink / raw)
  To: meta

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
<snip>

> That said, should public-inbox consider this case when generating the 
> /raw and /t.mbox.gz messages? If the Archived-At and List-Archive 
> headers are listed in the DKIM-Signature header, skip inserting them 
> into the generated message?

Fwiw, I've been wondering if they are even necessary.  I don't
think anybody misses them with -imapd.  So I prefer to just drop
them and have less code:

---------8<---------
Subject: [PATCH] nntp+www: drop List-* and Archived-At headers

These headers can conflict with headers in the DKIM signature;
and parsing the DKIM-Signature header to determine whether or
not we can safely add a header would be more code and CPU
cycles.

Since IMAP seems fine without these headers (and JMAP will
likely be, too), there's likely no need to continue appending
these to every message.  Nowadays, developers seem sufficiently
trained to use URLs with Message-IDs in them.  So drop the
headers and save some cycles and bandwidth all around.
---
 lib/PublicInbox/Mbox.pm   | 32 ++++++--------------------------
 lib/PublicInbox/MboxGz.pm |  2 +-
 lib/PublicInbox/NNTP.pm   |  8 --------
 t/nntp.t                  |  9 ---------
 t/psgi_mount.t            | 14 ++++----------
 5 files changed, 11 insertions(+), 54 deletions(-)

diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index c88350c9..c8e4b406 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -20,7 +20,7 @@ sub getline {
 	my $ibx = $ctx->{ibx};
 	my $eml = $ibx->smsg_eml($smsg) or return;
 	my $n = $ctx->{smsg} = $ibx->over->next_by_mid(@{$ctx->{next_arg}});
-	$ctx->zmore(msg_hdr($ctx, $eml, $smsg->{mid}));
+	$ctx->zmore(msg_hdr($ctx, $eml));
 	if ($n) {
 		$ctx->translate(msg_body($eml));
 	} else { # last message
@@ -46,7 +46,7 @@ sub async_eml { # for async_blob_cb
 	# next message
 	$ctx->{smsg} = $ctx->{ibx}->over->next_by_mid(@{$ctx->{next_arg}});
 
-	$ctx->zmore(msg_hdr($ctx, $eml, $smsg->{mid}));
+	$ctx->zmore(msg_hdr($ctx, $eml));
 	$ctx->{http_out}->write($ctx->translate(msg_body($eml)));
 }
 
@@ -74,7 +74,7 @@ sub no_over_raw ($) {
 	my $mref = $ctx->{ibx}->msg_by_mid($ctx->{mid}) or return;
 	my $eml = PublicInbox::Eml->new($mref);
 	[ 200, res_hdr($ctx, $eml->header_str('Subject')),
-		[ msg_hdr($ctx, $eml, $ctx->{mid}) . msg_body($eml) ] ]
+		[ msg_hdr($ctx, $eml) . msg_body($eml) ] ]
 }
 
 # /$INBOX/$MESSAGE_ID/raw
@@ -90,8 +90,8 @@ sub emit_raw {
 	$ctx->psgi_response(200, $res_hdr);
 }
 
-sub msg_hdr ($$;$) {
-	my ($ctx, $eml, $mid) = @_;
+sub msg_hdr ($$) {
+	my ($ctx, $eml) = @_;
 	my $header_obj = $eml->header_obj;
 
 	# drop potentially confusing headers, ssoma already should've dropped
@@ -99,31 +99,11 @@ sub msg_hdr ($$;$) {
 	foreach my $d (qw(Lines Bytes Content-Length Status)) {
 		$header_obj->header_set($d);
 	}
-	my $ibx = $ctx->{ibx};
-	my $base = $ctx->{base_url};
-	$mid = $ctx->{mid} unless defined $mid;
-	$mid = mid_escape($mid);
-	my @append = (
-		'Archived-At', "<$base$mid/>",
-		'List-Archive', "<$base>",
-	);
 	my $crlf = $header_obj->crlf;
 	my $buf = $header_obj->as_string;
 	# fixup old bug from import (pre-a0c07cba0e5d8b6a)
 	$buf =~ s/\A[\r\n]*From [^\r\n]*\r?\n//s;
-	$buf = "From mboxrd\@z Thu Jan  1 00:00:00 1970" . $crlf . $buf;
-
-	for (my $i = 0; $i < @append; $i += 2) {
-		my $k = $append[$i];
-		my $v = $append[$i + 1];
-		my @v = $header_obj->header_raw($k);
-		$buf .= "$k: $v$crlf" if !grep(/\A\Q$v\E\z/, @v);
-	}
-	my $post_addr = $ibx->{-primary_address};
-	if ($post_addr && $header_obj->header_raw('List-Post')) {
-		$buf .= "List-Post: <mailto:$post_addr>$crlf";
-	}
-	$buf .= $crlf;
+	"From mboxrd\@z Thu Jan  1 00:00:00 1970" . $crlf . $buf . $crlf;
 }
 
 sub msg_body ($) {
diff --git a/lib/PublicInbox/MboxGz.pm b/lib/PublicInbox/MboxGz.pm
index ab3c9770..7b054845 100644
--- a/lib/PublicInbox/MboxGz.pm
+++ b/lib/PublicInbox/MboxGz.pm
@@ -38,7 +38,7 @@ sub getline {
 	my $cb = $self->{cb} or return;
 	while (my $smsg = $cb->($self)) {
 		my $eml = $self->{ibx}->smsg_eml($smsg) or next;
-		$self->zmore(msg_hdr($self, $eml, $smsg->{mid}));
+		$self->zmore(msg_hdr($self, $eml));
 		return $self->translate(msg_body($eml));
 	}
 	# signal that we're done and can return undef next call:
diff --git a/lib/PublicInbox/NNTP.pm b/lib/PublicInbox/NNTP.pm
index 097fdb84..11a7ffb8 100644
--- a/lib/PublicInbox/NNTP.pm
+++ b/lib/PublicInbox/NNTP.pm
@@ -487,14 +487,6 @@ sub set_nntp_headers ($$) {
 	# *something* here is required for leafnode, try to follow
 	# RFC 5536 3.1.5...
 	$hdr->header_set('Path', $server_name . '!not-for-mail');
-	if (my $post_addr = $ibx->{-primary_address}) {
-		header_append($hdr, 'List-Post', "<mailto:$post_addr>");
-	}
-	if (my $url = $ibx->base_url) {
-		$mid = mid_escape($mid);
-		header_append($hdr, 'Archived-At', "<$url$mid/>");
-		header_append($hdr, 'List-Archive', "<$url>");
-	}
 }
 
 sub art_lookup ($$$) {
diff --git a/t/nntp.t b/t/nntp.t
index b745886d..36eb6945 100644
--- a/t/nntp.t
+++ b/t/nntp.t
@@ -121,12 +121,6 @@ use PublicInbox::Config;
 	PublicInbox::NNTP::set_nntp_headers($hdr, $smsg);
 	is_deeply([ $mime->header('Message-ID') ], [ "<$mid>" ],
 		'Message-ID unchanged');
-	is_deeply([ $mime->header('Archived-At') ], [ "<${u}a\@b/>" ],
-		'Archived-At: set');
-	is_deeply([ $mime->header('List-Archive') ], [ "<$u>" ],
-		'List-Archive: set');
-	is_deeply([ $mime->header('List-Post') ], [ '<mailto:a@example.com>' ],
-		'List-Post: set');
 	is_deeply([ $mime->header('Newsgroups') ], [ 'test' ],
 		'Newsgroups: set');
 	is_deeply([ $mime->header('Xref') ], [ 'example.com test:1' ],
@@ -137,9 +131,6 @@ use PublicInbox::Config;
 	PublicInbox::NNTP::set_nntp_headers($hdr, $smsg);
 	is_deeply([ $mime->header('Message-ID') ], [ "<$mid>" ],
 		'Message-ID unchanged');
-	is_deeply([ $mime->header('Archived-At') ],
-		[ "<${u}a\@b/>", '<http://mirror.example.com/m/a@b/>' ],
-		'Archived-At: appended');
 	is_deeply([ $mime->header('Xref') ], [ 'example.com test:2' ],
 		'Old Xref: clobbered');
 }
diff --git a/t/psgi_mount.t b/t/psgi_mount.t
index dac62c1a..48d8e5c0 100644
--- a/t/psgi_mount.t
+++ b/t/psgi_mount.t
@@ -67,11 +67,9 @@ test_psgi($app, sub {
 
 	$res = $cb->(GET('/a/test/blah%40example.com/raw'));
 	is($res->code, 200, 'OK with URLMap mount');
-	like($res->content, qr!^List-Archive: <http://[^/]+/a/test/>!m,
-		'List-Archive set in /raw mboxrd');
 	like($res->content,
-		qr!^Archived-At: <http://[^/]+/a/test/blah\@example\.com/>!m,
-		'Archived-At set in /raw mboxrd');
+		qr/^Message-Id: <blah\@example\.com>\n/sm,
+		'headers appear in /raw');
 
 	# redirects
 	$res = $cb->(GET('/a/test/m/blah%40example.com.html'));
@@ -94,12 +92,8 @@ SKIP: {
 		my $gz = $res->content;
 		my $raw;
 		IO::Uncompress::Gunzip::gunzip(\$gz => \$raw);
-		like($raw, qr!^List-Archive: <http://[^/]+/a/test/>!m,
-			'List-Archive set in /t.mbox.gz mboxrd');
-		like($raw,
-			qr!^Archived-At:\x20
-				<http://[^/]+/a/test/blah\@example\.com/>!mx,
-			'Archived-At set in /t.mbox.gz mboxrd');
+		like($raw, qr!^Message-Id:\x20<blah\@example\.com>\n!sm,
+			'headers appear in /t.mbox.gz mboxrd');
 	});
 }
 

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: Extra newline when retrieving messages
  2020-12-10 22:38     ` Eric Wong
@ 2020-12-10 22:41       ` Konstantin Ryabitsev
  0 siblings, 0 replies; 6+ messages in thread
From: Konstantin Ryabitsev @ 2020-12-10 22:41 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On Thu, Dec 10, 2020 at 10:38:47PM +0000, Eric Wong wrote:
> Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> <snip>
> 
> > That said, should public-inbox consider this case when generating the 
> > /raw and /t.mbox.gz messages? If the Archived-At and List-Archive 
> > headers are listed in the DKIM-Signature header, skip inserting them 
> > into the generated message?
> 
> Fwiw, I've been wondering if they are even necessary.  I don't
> think anybody misses them with -imapd.  So I prefer to just drop
> them and have less code:

Right on, I'm good with that.

-K

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-12-10 22:41 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-10 20:21 Extra newline when retrieving messages Konstantin Ryabitsev
2020-12-10 20:55 ` Eric Wong
2020-12-10 21:43   ` Konstantin Ryabitsev
2020-12-10 22:29     ` Konstantin Ryabitsev
2020-12-10 22:38     ` Eric Wong
2020-12-10 22:41       ` Konstantin Ryabitsev

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).