From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id BE4C71F9FD; Fri, 12 Mar 2021 00:31:23 +0000 (UTC) Date: Fri, 12 Mar 2021 02:31:23 +0200 From: Eric Wong To: meta@public-inbox.org Subject: [SQUASH] msg_part_text: discover text in application/octet-stream Message-ID: <20210312003123.GA30304@dcvr> References: <20210311014539.19756-1-e@80x24.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20210311014539.19756-1-e@80x24.org> List-Id: This simplifies the check and ensures returned text is Perl "utf8" text (that is, Perl's internal "utf8" and not the strict "UTF-8". diff --git a/lib/PublicInbox/MsgIter.pm b/lib/PublicInbox/MsgIter.pm index e2819523..9c6581cc 100644 --- a/lib/PublicInbox/MsgIter.pm +++ b/lib/PublicInbox/MsgIter.pm @@ -90,12 +90,8 @@ sub msg_part_text ($$) { # Try to see if it's printable text that we can index # and display: $s = $part->body; - if ($s =~ /[^\p{XPosixPrint}\s]/s) { - utf8::decode($s); - $s =~ /[^\p{XPosixPrint}\s]/s ? undef($s) : undef($err); - } else { - undef($err); - } + utf8::decode($s); + undef($s =~ /[^\p{XPosixPrint}\s]/s ? $s : $err); } ($s, $err); } diff --git a/t/msg_iter.t b/t/msg_iter.t index 6c52eec8..ae3594da 100644 --- a/t/msg_iter.t +++ b/t/msg_iter.t @@ -121,6 +121,7 @@ EOM push @parts, $s; }); $expect =~ s/\n/\r\n/sg; + utf8::decode($expect); # aka "bytes2str" is_deeply(\@parts, [ "blah\r\n", $expect ], 'fallback to application/octet-stream as UTF-8 text');