From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 34E181F66E for ; Wed, 19 Aug 2020 08:15:49 +0000 (UTC) From: Eric Wong To: meta@public-inbox.org Subject: [PATCH] smsg: handle wide characters in raw mail headers Date: Wed, 19 Aug 2020 08:15:49 +0000 Message-Id: <20200819081549.24617-1-e@yhbt.net> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit List-Id: There may be messages in the wild with wide characters in headers which aren't non-RFC2047 encoded. Assume UTF-8 so those fields can round trip through the `ddd' (doc-data-deflated) column of over.sqlite3. This doesn't affect docdata.glass in Xapian (at least not with Search::Xapian), but it does affect how over.sqlite3 stores the same data via Compress::Zlib::compress(). Noticed while working on patches to remove docdata storage from Xapian in favor of using over.sqlite3. --- lib/PublicInbox/Smsg.pm | 3 +++ t/psgi_search.t | 6 +++++- 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/lib/PublicInbox/Smsg.pm b/lib/PublicInbox/Smsg.pm index aaf88f35..62cb951e 100644 --- a/lib/PublicInbox/Smsg.pm +++ b/lib/PublicInbox/Smsg.pm @@ -105,6 +105,9 @@ sub populate { # to protect git and NNTP clients $val =~ tr/\0\t\n/ /; + # rare: in case headers have wide chars (not RFC2047-encoded) + utf8::decode($val); + # lower-case fields for read-only stuff $self->{lc($f)} = $val; diff --git a/t/psgi_search.t b/t/psgi_search.t index 2d12ba6a..5d537363 100644 --- a/t/psgi_search.t +++ b/t/psgi_search.t @@ -28,8 +28,10 @@ my $im = $ibx->importer(0); my $digits = '10010260936330'; my $ua = 'Pine.LNX.4.10'; my $mid = "$ua.$digits.2460-100000\@penguin.transmeta.com"; + +# n.b. these headers are not properly RFC2047-encoded my $mime = PublicInbox::Eml->new(< From: Ævar Arnfjörð Bjarmason To: git\@vger.kernel.org @@ -102,6 +104,8 @@ test_psgi(sub { $www->call(@_) }, sub { 'subject-less message linked from "/$INBOX/"'); like($html, qr/\bhref="blank-subject[^>]+>\(no subject\)(GET('/test/?q=tc:git')); like($html, qr/\bhref="no-subject-at-all[^>]+>\(no subject\)