From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 3D3AB1F463 for ; Wed, 18 Dec 2019 09:14:43 +0000 (UTC) From: Eric Wong To: meta@public-inbox.org Subject: [PATCH] msgiter: msg_part_text returns undef on text/html Date: Wed, 18 Dec 2019 09:14:43 +0000 Message-Id: <20191218091443.12551-1-e@80x24.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit List-Id: We want HTML parts to be downloadable, but not displayed as unreadable (but injection-safe) HTML source in our own web and Atom interfaces. This affects indexing, too, as HTML tags/comments won't be indexed anymore, but existing indices are only cleaned after --reindex. HTML-only mail won't be indexed at all, but we won't cross that bridge until somebody cares about that crap. We'll continue to actively discourage such waste of CPU cycles, bandwidth, cache and storage. Fixes: 7d82a8bc04ce2e68 (handle "multipart/mixed" messages which are not multipart') --- lib/PublicInbox/MsgIter.pm | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/lib/PublicInbox/MsgIter.pm b/lib/PublicInbox/MsgIter.pm index d9df32ab..6453d9f1 100644 --- a/lib/PublicInbox/MsgIter.pm +++ b/lib/PublicInbox/MsgIter.pm @@ -38,6 +38,11 @@ sub msg_iter ($$) { sub msg_part_text ($$) { my ($part, $ct) = @_; + # TODO: we may offer a separate sub for people who need to index + # HTML-only mail, but the majority of HTML mail is multipart/alternative + # with a text part which we don't have to waste cycles decoding + return if $ct =~ m!\btext/x?html\b!; + my $s = eval { $part->body_str }; my $err = $@;