Date | Commit message (Collapse) |
|
The content dedupe logic was originally designed for v2 public
inboxes as a fallback for when the importer sees identical
Message-IDs. Thus it did not account for Message-ID(s) in
the message itself.
This change doesn't affect saved searches (the default when
writing to a pathname or IMAP). It affects --no-save, and
outputs to stdout (even if stdout is redirected to a file).
Prior to this change, lei reused the v2 logic as-is without
accounting for Message-IDs anywhere with `--dedupe=content'
(the default). This could cause messages to be skipped when
the content matches despite Message-IDs being different.
So with this change, `lei q --dedupe=content' will hash the
Message-ID(s) in the message to ensure messages with different
Message-IDs are NOT deduplicated.
Whether or not this change is a bug fix or introduces regression
is actually debatable. In my mind, it is better to err on the
side of showing too many messages rather than too few, even if
the actual contents of the message are identical. Making saved
searches deduplicate without accounting for Message-IDs would be
more difficult, too.
|
|
ContentHash currently doesn't convert CRCRLF to LF. Perhaps it
should, but for now, have diff behavior match the actual
comparison behavior used for dedupe and omit all trailing
whitespace for diff.
|
|
It's possible in theory that Perl could be smarter and free
memory a tad sooner this way. Regardless, fewer lines of code
is easier-to-navigate/read and can save optree size and reduce
parsing times.
|
|
On my x86-64 machine, OpenSSL SHA-256 is nearly twice as fast as
the Digest::SHA implementation from Perl, most likely due to an
optimized assembly implementation. SHA-1 is a few percent
faster, too.
|
|
The alsa-devel archives on lore has some UTF-8 References:
headers, so we need to treat them as octets, again, otherwise
(re)indexing triggers cascading failures.
Fixes: 5198c976ce8b "eml: header_raw converts octets to Perl UTF-8"
|
|
This should prevent some false duplicates. I noticed this
while implementing "lei mail-diff", and only noticed it when
I implemented the ContentDigestDbg wrapper for mail-diff.
|
|
This is useful in finding the cause of deduplication bugs,
and possibly the cause of missing threads reported by
Konstantin in <20211001130527.z7eivotlgqbgetzz@meerkat.local>
usage:
u=https://yhbt.net/lore/all/87czop5j33.fsf@tynnyri.adurom.net/raw
lei mail-diff $u
|
|
This will be convenient to avoid the overhead of
PublicInbox::Eml for verifying synchronization in lei.
|
|
This will let us tie keywords from remote externals
to those which only exist in local externals.
|
|
This regression was introduced long ago and matches behavior
originally specified in the comments. It makes a noticeable
improvement with search results using -extindex ("all") and
lei results with multiple inboxes.
Update some style bits at the top of the test case while
we're at it.
Fixes: f0ef0a56a8957d6f ("v2: improve deduplication checks")
|
|
Using "make update-copyrights" after setting GNULIB_PATH in my
config.mak
|
|
We used ->header_obj in the past as an optimization with
Email::MIME. That optimization is no longer necessary
with PublicInbox::Eml.
This doesn't make any functional difference even if we were to
go back to Email::MIME. However, it reduces the amount of code
we have and slightly reduces allocations with PublicInbox::Eml.
|
|
The old name may be confused with "Content-ID" as described in
RFC 2392, so use an alternate name to avoid confusing future
readers.
|