From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 7F6F91FAE5 for ; Tue, 6 Mar 2018 08:42:42 +0000 (UTC) From: "Eric Wong (Contractor, The Linux Foundation)" To: meta@public-inbox.org Subject: [PATCH 04/34] content_id: special treatment for Message-Id headers Date: Tue, 6 Mar 2018 08:42:12 +0000 Message-Id: <20180306084242.19988-5-e@80x24.org> In-Reply-To: <20180306084242.19988-1-e@80x24.org> References: <20180306084242.19988-1-e@80x24.org> List-Id: Some emails in LKML archives are identical with the only difference being s/References:/In-Reply-To:/ in the headers. Since this difference doesn't affect how we handle message threading, we will treat them the same way for the purposes of deduplication. There may be more changes to how we do content_id along these lines (e.g. using msg_iter to walk the message). --- lib/PublicInbox/ContentId.pm | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/lib/PublicInbox/ContentId.pm b/lib/PublicInbox/ContentId.pm index 65d5a76..7ec638c 100644 --- a/lib/PublicInbox/ContentId.pm +++ b/lib/PublicInbox/ContentId.pm @@ -11,7 +11,7 @@ our @EXPORT_OK = qw/content_id/; use Digest::SHA; # Content-* headers are often no-ops, so maybe we don't need them -my @ID_HEADERS = qw(Subject From Date Message-ID References To Cc In-Reply-To); +my @ID_HEADERS = qw(Subject From Date To Cc); sub content_id ($;$) { my ($mime, $alg) = @_; @@ -19,6 +19,20 @@ sub content_id ($;$) { my $dig = Digest::SHA->new($alg); my $hdr = $mime->header_obj; + # References: and In-Reply-To: get used interchangeably + # in some "duplicates" in LKML. We treat them the same + # in SearchIdx, so treat them the same for this: + my @mid = $hdr->header_raw('Message-ID'); + @mid = (join(' ', @mid) =~ /<([^>]+)>/g); + my $refs = join(' ', $hdr->header_raw('References'), + $hdr->header_raw('In-Reply-To')); + my @refs = ($refs =~ /<([^>]+)>/g); + my %seen; + foreach my $mid (@mid, @refs) { + next if $seen{$mid}; + $dig->add($mid); + $seen{$mid} = 1; + } foreach my $h (@ID_HEADERS) { my @v = $hdr->header_raw($h); $dig->add($_) foreach @v; -- EW