Date | Commit message (Collapse) |
|
OverIdx::parse_references already skips duplicate
References (which we use in SearchThread for rendering).
So there's no reason for our content deduplication logic
to care if a Message-Id in the Reference header is mentioned
twice.
|
|
This use of map {} is a common idiom as we no longer consider
the Message-ID as part of the digest.
|
|
msg_iter now passes a user specified arg into the supplied
callback, so we can use that to pass the Digest object into
the \&content_dig_i callback.
|
|
|
|
Hopefully this helps people familiarize themselves with
the source code.
|
|
I've found two examples on https://lore.kernel.org/lkml/
where the messages declared themselves to be "multipart/mixed"
but were actually plain text:
<87llgalspt.fsf@free.fr>
<200308111450.h7BEoOu20077@mail.osdl.org>
With the mboxrd downloaded, mutt is able to view them without
difficulty.
Note: this change would require reindexing of Xapian to pick up
the changes. But it's only two ancient messages, the first was
resent by the original sender and the second is too old to be
relevant.
|
|
First off, decode text portions of messages since some archived
mail I got was converted from quoted-printable or base-64 to
8bit by the original recipient. Attempting to merge them with
my own archives (which had no conversion done) led to
unnecessary duplicates showing up.
Then, normalize CRLF line endings in text portions to LF.
In the headers, we relax the content_id hashing to ignore quotes
and lower-case domain names in To, Cc, and From headers since
some mail processors will alter them.
Finally, I've discovered Email::MIME->new($mime->as_string)
does not always round-trip reliably, so we calculate the
content_id twice on user-supplied messages.
|
|
If we need to use content_id, we've already lost hope
in relying on Message-Id as a differentiator. This
prevents duplicates from showing up repeatedly with
-watch when Message-Ids are reused and we generate
new Message-Ids to disambiguate.
|
|
We will be using Sender: in more places if the From: header
is not available, this is one of them.
Followup-to: ("import: fall back to Sender for extracting name and email")
|
|
We merely use this for internal comparisons and do not store
this in Xapian. So using a shorter, non-human readable digest
is enough. Furthermore, introduce "content_digest" which
returns the Digest::SHA object for extra changes.
|
|
These already take care of deduping internally, so we'll save
ourselves at least some of the trouble while using a more
consistent API. While we're at it, hash the header name as
well, since we need to distinguish which header a certain value
came from.
|
|
Some emails in LKML archives are identical with the only
difference being s/References:/In-Reply-To:/ in the headers.
Since this difference doesn't affect how we handle message
threading, we will treat them the same way for the purposes
of deduplication.
There may be more changes to how we do content_id along these
lines (e.g. using msg_iter to walk the message).
|
|
Call order will need to change a bit since this is going to be
tied to Xapian
|