v2: improve deduplication checks

First off, decode text portions of messages since some archived mail I got was converted from quoted-printable or base-64 to 8bit by the original recipient. Attempting to merge them with my own archives (which had no conversion done) led to unnecessary duplicates showing up. Then, normalize CRLF line endings in text portions to LF. In the headers, we relax the content_id hashing to ignore quotes and lower-case domain names in To, Cc, and From headers since some mail processors will alter them. Finally, I've discovered Email::MIME->new($mime->as_string) does not always round-trip reliably, so we calculate the content_id twice on user-supplied messages.
author: Eric Wong (Contractor, The Linux Foundation) <e@80x24.org> 2018-04-18 09:13:11 +0000
committer: Eric Wong (Contractor, The Linux Foundation) <e@80x24.org> 2018-04-18 09:14:15 +0000
commit: f0ef0a56a8957d6f3095b1a24798e54b0b815d04 (patch)
tree: fcab14a29eaf1ec68564aa2163e31751f7e9936d /t
parent: 69329215485cf2ab9d8cd1fa7faf65d8ec42dc0b (diff)
download: public-inbox-f0ef0a56a8957d6f3095b1a24798e54b0b815d04.tar.gz
1 files changed, 10 insertions, 0 deletions
diff --git a/t/content_id.t b/t/content_id.t
index adcdb6c1..01ce65e5 100644
--- a/t/content_id.t
+++ b/t/content_id.t
@@ -22,4 +22,14 @@ my $orig = content_id($mime);
  my $reload = content_id(Email::MIME->new($mime->as_string));
  is($orig, $reload, 'content_id matches after serialization');
  
+foreach my $h (qw(From To Cc)) {
+        my $n = '"Quoted N\'Ame" <foo@EXAMPLE.com>';
+        $mime->header_str_set($h, "$n");
+        my $q = content_id($mime);
+        is($n, $mime->header($h), "content_id does not mutate $h:");
+        $mime->header_str_set($h, 'Quoted N\'Ame <foo@example.com>');
+        my $nq = content_id($mime);
+        is($nq, $q, "quotes ignored in $h:");
+}
+
  done_testing();
author	Eric Wong (Contractor, The Linux Foundation) <e@80x24.org>	2018-04-18 09:13:11 +0000
committer	Eric Wong (Contractor, The Linux Foundation) <e@80x24.org>	2018-04-18 09:14:15 +0000
commit	f0ef0a56a8957d6f3095b1a24798e54b0b815d04 (patch)
tree	fcab14a29eaf1ec68564aa2163e31751f7e9936d /t
parent	69329215485cf2ab9d8cd1fa7faf65d8ec42dc0b (diff)
download	public-inbox-f0ef0a56a8957d6f3095b1a24798e54b0b815d04.tar.gz