about summary refs log tree commit homepage
path: root/lib/PublicInbox/ContentId.pm
DateCommit message (Collapse)
2020-01-24contentid: ignore duplicate References: headers
OverIdx::parse_references already skips duplicate References (which we use in SearchThread for rendering). So there's no reason for our content deduplication logic to care if a Message-Id in the Reference header is mentioned twice.
2020-01-24contentid: use map to generate %seen for Message-Ids
This use of map {} is a common idiom as we no longer consider the Message-ID as part of the digest.
2019-12-27contentid: no anonymous sub
msg_iter now passes a user specified arg into the supplied callback, so we can use that to pass the Digest object into the \&content_dig_i callback.
2019-09-09run update-copyrights from gnulib for 2019
2019-01-09doc: various overview-level module comments
Hopefully this helps people familiarize themselves with the source code.
2018-12-30handle "multipart/mixed" messages which are not multipart
I've found two examples on https://lore.kernel.org/lkml/ where the messages declared themselves to be "multipart/mixed" but were actually plain text: <87llgalspt.fsf@free.fr> <200308111450.h7BEoOu20077@mail.osdl.org> With the mboxrd downloaded, mutt is able to view them without difficulty. Note: this change would require reindexing of Xapian to pick up the changes. But it's only two ancient messages, the first was resent by the original sender and the second is too old to be relevant.
2018-04-18v2: improve deduplication checks
First off, decode text portions of messages since some archived mail I got was converted from quoted-printable or base-64 to 8bit by the original recipient. Attempting to merge them with my own archives (which had no conversion done) led to unnecessary duplicates showing up. Then, normalize CRLF line endings in text portions to LF. In the headers, we relax the content_id hashing to ignore quotes and lower-case domain names in To, Cc, and From headers since some mail processors will alter them. Finally, I've discovered Email::MIME->new($mime->as_string) does not always round-trip reliably, so we calculate the content_id twice on user-supplied messages.
2018-03-20content_id: do not take Message-Id into account
If we need to use content_id, we've already lost hope in relying on Message-Id as a differentiator. This prevents duplicates from showing up repeatedly with -watch when Message-Ids are reused and we generate new Message-Ids to disambiguate.
2018-03-19content_id: use Sender header if From is not available
We will be using Sender: in more places if the From: header is not available, this is one of them. Followup-to: ("import: fall back to Sender for extracting name and email")
2018-03-02content_id: no need to be human-friendly
We merely use this for internal comparisons and do not store this in Xapian. So using a shorter, non-human readable digest is enough. Furthermore, introduce "content_digest" which returns the Digest::SHA object for extra changes.
2018-03-02content_id: use `mids' and `references' for MID extraction
These already take care of deduping internally, so we'll save ourselves at least some of the trouble while using a more consistent API. While we're at it, hash the header name as well, since we need to distinguish which header a certain value came from.
2018-03-02content_id: special treatment for Message-Id headers
Some emails in LKML archives are identical with the only difference being s/References:/In-Reply-To:/ in the headers. Since this difference doesn't affect how we handle message threading, we will treat them the same way for the purposes of deduplication. There may be more changes to how we do content_id along these lines (e.g. using msg_iter to walk the message).
2018-02-12import: initial handling for v2
Call order will need to change a bit since this is going to be tied to Xapian