From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 1CBFE1F9F4 for ; Sat, 2 Oct 2021 11:18:36 +0000 (UTC) From: Eric Wong To: meta@public-inbox.org Subject: [PATCH 3/4] content_hash: normalize whitespace before hashing addresses Date: Sat, 2 Oct 2021 11:18:34 +0000 Message-Id: <20211002111835.19220-4-e@80x24.org> In-Reply-To: <20211002111835.19220-1-e@80x24.org> References: <20211002111835.19220-1-e@80x24.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit List-Id: This should prevent some false duplicates. I noticed this while implementing "lei mail-diff", and only noticed it when I implemented the ContentDigestDbg wrapper for mail-diff. --- lib/PublicInbox/ContentHash.pm | 1 + 1 file changed, 1 insertion(+) diff --git a/lib/PublicInbox/ContentHash.pm b/lib/PublicInbox/ContentHash.pm index f6ae9011c1bf..bacc9cdda124 100644 --- a/lib/PublicInbox/ContentHash.pm +++ b/lib/PublicInbox/ContentHash.pm @@ -20,6 +20,7 @@ use Digest::SHA; sub digest_addr ($$$) { my ($dig, $h, $v) = @_; $v =~ tr/"//d; + $v =~ tr/\r\n\t / /s; $v =~ s/@([a-z0-9\_\.\-\(\)]*([A-Z])\S*)/'@'.lc($1)/ge; utf8::encode($v); $dig->add("$h\0$v\0");