From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-3.5 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 53BF31F910; Sun, 27 Nov 2022 09:15:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=80x24.org; s=selector1; t=1669540547; bh=wj2l5LIdkQY+WSZdUV4UXJDutqZPme4j8QnTaZHiCXQ=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=LmeFdTQMmvm4I/0f5ohegP4ZTjIbdMMlg3A8D3Z5+2yNgGy5NoW5w9aU+3QXPL1bF ogpEpqwwwFaDMQWqyRMqRkMurzkG9i4fb3+HjUOWvpv9Cf315MQfOXJ/2lrfFkq4zd qaf2vpeMZzHw4ZTDQbsagSBUlk1F2K6C6wC8guwo= Date: Sun, 27 Nov 2022 09:15:47 +0000 From: Eric Wong To: Konstantin Ryabitsev Cc: meta@public-inbox.org Subject: [PATCH] content_hash: handle References as octets Message-ID: <20221127091547.M264128@dcvr> References: <20221124153715.3nenjpjzj43vqxr2@meerkat.local> <20221124213155.M736847@dcvr> <20221125181403.azktm5zkhg7lgnp7@meerkat.local> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20221125181403.azktm5zkhg7lgnp7@meerkat.local> List-Id: Konstantin Ryabitsev wrote: > On Thu, Nov 24, 2022 at 09:31:55PM +0000, Eric Wong wrote: > > The below case generalizes it to all HTML displays and removes > > the special case. > > It looks good to me in some cursory tests, thank you! I just noticed this so far: -------8<------ Subject: [PATCH] content_hash: handle References as octets The alsa-devel archives on lore has some UTF-8 References: headers, so we need to treat them as octets, again, otherwise (re)indexing triggers cascading failures. Fixes: 5198c976ce8b "eml: header_raw converts octets to Perl UTF-8" --- lib/PublicInbox/ContentHash.pm | 7 ++++--- t/v2writable.t | 16 ++++++++++++++++ 2 files changed, 20 insertions(+), 3 deletions(-) diff --git a/lib/PublicInbox/ContentHash.pm b/lib/PublicInbox/ContentHash.pm index bacc9cdd..1afbb413 100644 --- a/lib/PublicInbox/ContentHash.pm +++ b/lib/PublicInbox/ContentHash.pm @@ -1,4 +1,4 @@ -# Copyright (C) 2018-2021 all contributors +# Copyright (C) all contributors # License: AGPL-3.0+ # Unstable internal API. @@ -63,8 +63,9 @@ sub content_digest ($;$) { # do NOT consider the Message-ID as part of the content_hash # if we got here, we've already got Message-ID reuse my %seen = map { $_ => 1 } @{mids($eml)}; - foreach my $mid (@{references($eml)}) { - $dig->add("ref\0$mid\0") unless $seen{$mid}++; + for (grep { !$seen{$_}++ } @{references($eml)}) { + utf8::encode($_); + $dig->add("ref\0$_\0"); } # Only use Sender: if From is not present diff --git a/t/v2writable.t b/t/v2writable.t index ad946338..0d102204 100644 --- a/t/v2writable.t +++ b/t/v2writable.t @@ -283,6 +283,22 @@ EOF is($msgs->[1]->{mid}, 'y'x244, 'stored truncated mid(2)'); } +if ('UTF-8 References') { + my @w; + local $SIG{__WARN__} = sub { push @w, @_ }; + my $msg = < +References: <\xc4\x80\@example> + +EOM + ok($im->add(PublicInbox::Eml->new($msg."a\n")), 'UTF-8 References 1'); + ok($im->add(PublicInbox::Eml->new($msg."b\n")), 'UTF-8 References 2'); + $im->done; + ok(!grep(/Wide character/, @w), 'no wide characters') or xbail(\@w); +} + my $tmp = { inboxdir => "$inboxdir/non-existent/subdir", name => 'nope',