user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* [PATCH 0/2] www: respect charset in $MSGID/raw display
@ 2021-10-25  2:45 Eric Wong
  2021-10-25  2:45 ` [PATCH 1/2] gzip_filter: delay async wcb call Eric Wong
  2021-10-25  2:45 ` [PATCH 2/2] www: $MSGID/raw: set charset in HTTP response Eric Wong
  0 siblings, 2 replies; 4+ messages in thread
From: Eric Wong @ 2021-10-25  2:45 UTC (permalink / raw)
  To: meta; +Cc: Thomas Weißschuh

Only lightly-tested via new additions to the test suite...

Eric Wong (2):
  gzip_filter: delay async wcb call
  www: $MSGID/raw: set charset in HTTP response

 lib/PublicInbox/GzipFilter.pm    | 34 ++++++++++++++++++++++----------
 lib/PublicInbox/Mbox.pm          | 26 +++++++++++++-----------
 lib/PublicInbox/WwwAtomStream.pm |  4 ++--
 lib/PublicInbox/WwwStream.pm     |  5 ++---
 t/plack.t                        | 26 +++++++++++++++++++++---
 t/psgi_v2.t                      |  5 ++++-
 6 files changed, 69 insertions(+), 31 deletions(-)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] gzip_filter: delay async wcb call
  2021-10-25  2:45 [PATCH 0/2] www: respect charset in $MSGID/raw display Eric Wong
@ 2021-10-25  2:45 ` Eric Wong
  2021-10-25  2:45 ` [PATCH 2/2] www: $MSGID/raw: set charset in HTTP response Eric Wong
  1 sibling, 0 replies; 4+ messages in thread
From: Eric Wong @ 2021-10-25  2:45 UTC (permalink / raw)
  To: meta; +Cc: Thomas Weißschuh

This will let us modify the response header later to set
a proper charset for Content-Type when displaying raw
messages.

Cc: Thomas Weißschuh <thomas@t-8ch.de>
---
 lib/PublicInbox/GzipFilter.pm    | 19 +++++++++++++------
 lib/PublicInbox/Mbox.pm          |  2 +-
 lib/PublicInbox/WwwAtomStream.pm |  4 ++--
 lib/PublicInbox/WwwStream.pm     |  5 ++---
 4 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/lib/PublicInbox/GzipFilter.pm b/lib/PublicInbox/GzipFilter.pm
index c62161710725..c4858a971495 100644
--- a/lib/PublicInbox/GzipFilter.pm
+++ b/lib/PublicInbox/GzipFilter.pm
@@ -54,7 +54,7 @@ sub psgi_response {
 		$http->{forward} = $self;
 		sub {
 			my ($wcb) = @_; # -httpd provided write callback
-			$self->{http_out} = $wcb->([$code, $res_hdr]);
+			$self->{wcb_args} = [ $code, $res_hdr, $wcb ];
 			$self->can('async_next')->($http); # start stepping
 		};
 	} else { # generic PSGI code path
@@ -114,9 +114,17 @@ sub translate ($$) {
 	}
 }
 
+sub http_out ($) {
+	my ($self) = @_;
+	$self->{http_out} //= do {
+		my $args = delete $self->{wcb_args} // return undef;
+		pop(@$args)->($args); # $wcb->([$code, $hdr_ary])
+	};
+}
+
 sub write {
 	# my $ret = bytes::length($_[1]); # XXX does anybody care?
-	$_[0]->{http_out}->write(translate($_[0], $_[1]));
+	http_out($_[0])->write(translate($_[0], $_[1]));
 }
 
 # similar to ->translate; use this when we're sure we know we have
@@ -145,10 +153,9 @@ sub zflush ($;$) {
 
 sub close {
 	my ($self) = @_;
-	if (my $http_out = delete $self->{http_out}) {
-		$http_out->write(zflush($self));
-		$http_out->close;
-	}
+	my $http_out = http_out($self) // return;
+	$http_out->write(zflush($self));
+	delete($self->{http_out})->close;
 }
 
 sub bail  {
diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index dede4825ff13..4f84eea6745d 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -47,7 +47,7 @@ sub async_eml { # for async_blob_cb
 	$ctx->{smsg} = $ctx->{ibx}->over->next_by_mid(@{$ctx->{next_arg}});
 
 	$ctx->zmore(msg_hdr($ctx, $eml));
-	$ctx->{http_out}->write($ctx->translate(msg_body($eml)));
+	$ctx->write(msg_body($eml));
 }
 
 sub res_hdr ($$) {
diff --git a/lib/PublicInbox/WwwAtomStream.pm b/lib/PublicInbox/WwwAtomStream.pm
index 5d32294eec15..82895db6373e 100644
--- a/lib/PublicInbox/WwwAtomStream.pm
+++ b/lib/PublicInbox/WwwAtomStream.pm
@@ -28,7 +28,7 @@ sub async_next ($) {
 		if (my $smsg = $ctx->{smsg} = $ctx->{cb}->($ctx)) {
 			$ctx->smsg_blob($smsg);
 		} else {
-			$ctx->{http_out}->write($ctx->translate('</feed>'));
+			$ctx->write('</feed>');
 			$ctx->close;
 		}
 	};
@@ -38,7 +38,7 @@ sub async_next ($) {
 sub async_eml { # for async_blob_cb
 	my ($ctx, $eml) = @_;
 	my $smsg = delete $ctx->{smsg};
-	$ctx->{http_out}->write($ctx->translate(feed_entry($ctx, $smsg, $eml)))
+	$ctx->write(feed_entry($ctx, $smsg, $eml));
 }
 
 sub response {
diff --git a/lib/PublicInbox/WwwStream.pm b/lib/PublicInbox/WwwStream.pm
index 5be5ed0cad59..6d7c447fe6a2 100644
--- a/lib/PublicInbox/WwwStream.pm
+++ b/lib/PublicInbox/WwwStream.pm
@@ -32,7 +32,7 @@ sub init {
 
 sub async_eml { # for async_blob_cb
 	my ($ctx, $eml) = @_;
-	$ctx->{http_out}->write($ctx->translate($ctx->{cb}->($ctx, $eml)));
+	$ctx->write($ctx->{cb}->($ctx, $eml));
 }
 
 sub html_top ($) {
@@ -187,8 +187,7 @@ sub async_next ($) {
 		if (my $smsg = $ctx->{smsg} = $ctx->{cb}->($ctx)) {
 			$ctx->smsg_blob($smsg);
 		} else {
-			$ctx->{http_out}->write(
-					$ctx->translate(_html_end($ctx)));
+			$ctx->write(_html_end($ctx));
 			$ctx->close; # GzipFilter->close
 		}
 	};

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 2/2] www: $MSGID/raw: set charset in HTTP response
  2021-10-25  2:45 [PATCH 0/2] www: respect charset in $MSGID/raw display Eric Wong
  2021-10-25  2:45 ` [PATCH 1/2] gzip_filter: delay async wcb call Eric Wong
@ 2021-10-25  2:45 ` Eric Wong
  2021-10-25  6:32   ` Thomas Weißschuh
  1 sibling, 1 reply; 4+ messages in thread
From: Eric Wong @ 2021-10-25  2:45 UTC (permalink / raw)
  To: meta; +Cc: Thomas Weißschuh

By using the charset specified in the message, web browsers are
more likely to display the raw text properly for human readers.

Inspired by a patch by Thomas Weißschuh:
  https://public-inbox.org/meta/20211024214337.161779-3-thomas@t-8ch.de/

Cc: Thomas Weißschuh <thomas@t-8ch.de>
---
 lib/PublicInbox/GzipFilter.pm | 19 +++++++++++++------
 lib/PublicInbox/Mbox.pm       | 24 +++++++++++++-----------
 t/plack.t                     | 26 +++++++++++++++++++++++---
 t/psgi_v2.t                   |  5 ++++-
 4 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/lib/PublicInbox/GzipFilter.pm b/lib/PublicInbox/GzipFilter.pm
index c4858a971495..e37f1f76bd4a 100644
--- a/lib/PublicInbox/GzipFilter.pm
+++ b/lib/PublicInbox/GzipFilter.pm
@@ -46,11 +46,10 @@ sub gz_or_noop {
 sub gzf_maybe ($$) { bless { gz => gz_or_noop(@_) }, __PACKAGE__ }
 
 sub psgi_response {
+	# $code may be an HTTP response code (e.g. 200) or a CODE ref (mbox_hdr)
 	my ($self, $code, $res_hdr) = @_;
-	my $env = $self->{env};
-	$self->{gz} //= gz_or_noop($res_hdr, $env);
-	if ($env->{'pi-httpd.async'}) {
-		my $http = $env->{'psgix.io'}; # PublicInbox::HTTP
+	if ($self->{env}->{'pi-httpd.async'}) {
+		my $http = $self->{env}->{'psgix.io'}; # PublicInbox::HTTP
 		$http->{forward} = $self;
 		sub {
 			my ($wcb) = @_; # -httpd provided write callback
@@ -58,6 +57,9 @@ sub psgi_response {
 			$self->can('async_next')->($http); # start stepping
 		};
 	} else { # generic PSGI code path
+		ref($code) eq 'CODE' and
+			($code, $res_hdr) = @{$code->($self)};
+		$self->{gz} //= gz_or_noop($res_hdr, $self->{env});
 		[ $code, $res_hdr, $self ];
 	}
 }
@@ -116,9 +118,13 @@ sub translate ($$) {
 
 sub http_out ($) {
 	my ($self) = @_;
-	$self->{http_out} //= do {
+	$self->{http_out} // do {
 		my $args = delete $self->{wcb_args} // return undef;
-		pop(@$args)->($args); # $wcb->([$code, $hdr_ary])
+		my $wcb = pop @$args; # from PublicInbox:HTTP async
+		# $args->[0] may be \&mbox_hdr or similar
+		$args = $args->[0]->($self) if ref($args->[0]) eq 'CODE';
+		$self->{gz} //= gz_or_noop($args->[1], $self->{env});
+		$self->{http_out} = $wcb->($args); # $wcb->([$code, $hdr_ary])
 	};
 }
 
@@ -131,6 +137,7 @@ sub write {
 # more data to buffer after this
 sub zmore {
 	my $self = $_[0]; # $_[1] => input
+	http_out($self);
 	my $err = $self->{gz}->deflate($_[1], $self->{zbuf});
 	die "gzip->deflate: $err" if $err != Z_OK;
 	undef;
diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index 4f84eea6745d..b977308d0541 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -18,7 +18,7 @@ sub getline {
 	my ($ctx) = @_; # ctx
 	my $smsg = $ctx->{smsg} or return;
 	my $ibx = $ctx->{ibx};
-	my $eml = $ibx->smsg_eml($smsg) or return;
+	my $eml = delete($ctx->{eml}) // $ibx->smsg_eml($smsg) // return;
 	my $n = $ctx->{smsg} = $ibx->over->next_by_mid(@{$ctx->{next_arg}});
 	$ctx->zmore(msg_hdr($ctx, $eml));
 	if ($n) {
@@ -45,14 +45,15 @@ sub async_eml { # for async_blob_cb
 	my $smsg = delete $ctx->{smsg};
 	# next message
 	$ctx->{smsg} = $ctx->{ibx}->over->next_by_mid(@{$ctx->{next_arg}});
-
+	local $ctx->{eml} = $eml; # for mbox_hdr
 	$ctx->zmore(msg_hdr($ctx, $eml));
 	$ctx->write(msg_body($eml));
 }
 
-sub res_hdr ($$) {
-	my ($ctx, $subject) = @_;
-	my $fn = $subject // '';
+sub mbox_hdr ($) {
+	my ($ctx) = @_;
+	my $eml = $ctx->{eml} //= $ctx->{ibx}->smsg_eml($ctx->{smsg});
+	my $fn = $eml->header_str('Subject') // '';
 	$fn =~ s/^re:\s+//i;
 	$fn = to_filename($fn) // 'no-subject';
 	my @hdr = ('Content-Type');
@@ -64,17 +65,19 @@ sub res_hdr ($$) {
 		push @hdr, 'text/plain';
 		$fn .= '.txt';
 	}
+	my $cs = $ctx->{eml}->ct->{attributes}->{charset} // 'UTF-8';
+	$cs = 'UTF-8' if $cs =~ /[^a-zA-Z0-9\-\_]/; # avoid header injection
+	$hdr[-1] .= "; charset=$cs";
 	push @hdr, 'Content-Disposition', "inline; filename=$fn";
-	\@hdr;
+	[ 200, \@hdr ];
 }
 
 # for rare cases where v1 inboxes aren't indexed w/ ->over at all
 sub no_over_raw ($) {
 	my ($ctx) = @_;
 	my $mref = $ctx->{ibx}->msg_by_mid($ctx->{mid}) or return;
-	my $eml = PublicInbox::Eml->new($mref);
-	[ 200, res_hdr($ctx, $eml->header_str('Subject')),
-		[ msg_hdr($ctx, $eml) . msg_body($eml) ] ]
+	my $eml = $ctx->{eml} = PublicInbox::Eml->new($mref);
+	[ @{mbox_hdr($ctx)}, [ msg_hdr($ctx, $eml) . msg_body($eml) ] ]
 }
 
 # /$INBOX/$MESSAGE_ID/raw
@@ -85,9 +88,8 @@ sub emit_raw {
 	my ($id, $prev);
 	my $mip = $ctx->{next_arg} = [ $ctx->{mid}, \$id, \$prev ];
 	my $smsg = $ctx->{smsg} = $over->next_by_mid(@$mip) or return;
-	my $res_hdr = res_hdr($ctx, $smsg->{subject});
 	bless $ctx, __PACKAGE__;
-	$ctx->psgi_response(200, $res_hdr);
+	$ctx->psgi_response(\&mbox_hdr);
 }
 
 sub msg_hdr ($$) {
diff --git a/t/plack.t b/t/plack.t
index 40ff2baa7273..e4dedce6a844 100644
--- a/t/plack.t
+++ b/t/plack.t
@@ -10,17 +10,24 @@ require_mods(@mods);
 foreach my $mod (@mods) { use_ok $mod; }
 ok(-f $psgi, "psgi example file found");
 my $pfx = 'http://example.com/test';
-# ensure successful message delivery
-my $ibx = create_inbox('test', sub {
+my $eml = eml_load('t/iso-2202-jp.eml');
+# ensure successful message deliveries
+my $ibx = create_inbox('test-1', sub {
 	my ($im, $ibx) = @_;
 	my $addr = $ibx->{-primary_address};
-	$im->add(PublicInbox::Eml->new(<<EOF)) or BAIL_OUT '->add';
+	$im->add($eml) or xbail '->add';
+	$eml->header_set('Content-Type',
+		"text/plain; charset=\rso\rb\0gus\rithurts");
+	$eml->header_set('Message-ID', '<broken@example.com>');
+	$im->add($eml) or xbail '->add';
+	$im->add(PublicInbox::Eml->new(<<EOF)) or xbail '->add';
 From: Me <me\@example.com>
 To: You <you\@example.com>
 Cc: $addr
 Message-Id: <blah\@example.com>
 Subject: hihi
 Date: Fri, 02 Oct 1993 00:00:00 +0000
+Content-Type: text/plain; charset=iso-8859-1
 
 > quoted text
 zzzzzz
@@ -195,6 +202,19 @@ test_psgi($app, sub {
 	my $res = $cb->(GET($pfx . '/blah@example.com/raw'));
 	is(200, $res->code, 'success response received for /*/raw');
 	like($res->content, qr!^From !sm, "mbox returned");
+	is($res->header('Content-Type'), 'text/plain; charset=iso-8859-1',
+		'charset from message used');
+
+	$res = $cb->(GET($pfx . '/broken@example.com/raw'));
+	is($res->header('Content-Type'), 'text/plain; charset=UTF-8',
+		'broken charset ignored');
+
+	$res = $cb->(GET($pfx . '/199707281508.AAA24167@hoyogw.example/raw'));
+	is($res->header('Content-Type'), 'text/plain; charset=ISO-2022-JP',
+		'ISO-2002-JP returned');
+	chomp(my $body = $res->content);
+	my $raw = PublicInbox::Eml->new(\$body);
+	is($raw->body_raw, $eml->body_raw, 'ISO-2022-JP body unmodified');
 
 	$res = $cb->(GET($pfx . '/blah@example.com/t.mbox.gz'));
 	is(501, $res->code, '501 when overview missing');
diff --git a/t/psgi_v2.t b/t/psgi_v2.t
index 64c1a8d38a0a..7d73b606dbef 100644
--- a/t/psgi_v2.t
+++ b/t/psgi_v2.t
@@ -20,11 +20,12 @@ To: test@example.com
 Subject: this is a subject
 Message-ID: <a-mid@b>
 Date: Fri, 02 Oct 1993 00:00:00 +0000
+Content-Type: text/plain; charset=iso-8859-1
 
 hello world
 EOF
 my $new_mid;
-my $ibx = create_inbox 'v2', version => 2, indexlevel => 'medium',
+my $ibx = create_inbox 'v2-1', version => 2, indexlevel => 'medium',
 			tmpdir => "$tmpdir/v2", sub {
 	my ($im, $ibx) = @_;
 	$im->add($eml) or BAIL_OUT;
@@ -68,6 +69,8 @@ my $client0 = sub {
 	like($res->content, qr!\$INBOX_DIR/description missing!,
 		'got v2 description missing message');
 	$res = $cb->(GET('/v2test/a-mid@b/raw'));
+	is($res->header('Content-Type'), 'text/plain; charset=iso-8859-1',
+		'charset from message used');
 	$raw = $res->content;
 	unlike($raw, qr/^From oldbug/sm, 'buggy "From_" line omitted');
 	like($raw, qr/^hello world$/m, 'got first message');

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 2/2] www: $MSGID/raw: set charset in HTTP response
  2021-10-25  2:45 ` [PATCH 2/2] www: $MSGID/raw: set charset in HTTP response Eric Wong
@ 2021-10-25  6:32   ` Thomas Weißschuh
  0 siblings, 0 replies; 4+ messages in thread
From: Thomas Weißschuh @ 2021-10-25  6:32 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

On 2021-10-25 02:45+0000, Eric Wong wrote:
> By using the charset specified in the message, web browsers are
> more likely to display the raw text properly for human readers.
> 
> Inspired by a patch by Thomas Weißschuh:
>   https://public-inbox.org/meta/20211024214337.161779-3-thomas@t-8ch.de/
> 
> Cc: Thomas Weißschuh <thomas@t-8ch.de>

Tested-by: Thomas Weißschuh <thomas@t-8ch.de>

> ---
>  lib/PublicInbox/GzipFilter.pm | 19 +++++++++++++------
>  lib/PublicInbox/Mbox.pm       | 24 +++++++++++++-----------
>  t/plack.t                     | 26 +++++++++++++++++++++++---
>  t/psgi_v2.t                   |  5 ++++-
>  4 files changed, 53 insertions(+), 21 deletions(-)

Thanks!

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-10-25  6:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-25  2:45 [PATCH 0/2] www: respect charset in $MSGID/raw display Eric Wong
2021-10-25  2:45 ` [PATCH 1/2] gzip_filter: delay async wcb call Eric Wong
2021-10-25  2:45 ` [PATCH 2/2] www: $MSGID/raw: set charset in HTTP response Eric Wong
2021-10-25  6:32   ` Thomas Weißschuh

Code repositories for project(s) associated with this inbox:

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).