about summary refs log tree commit homepage
path: root/lib/PublicInbox/Smsg.pm
DateCommit message (Collapse)
2022-11-26eml: header_raw converts octets to Perl UTF-8
This fixes the display of raw (non-RFC 2047) names and subjects in HTML message views. SMTPUTF8 (RFC 6531) allows raw UTF-8 in headers without RFC 2047 encoding, so let Perl handle it as a character sequence for the rest of our consumers. Thus, the old special case in PublicInbox::Smsg->populate is no longer necessary and gone. The one regression notice so far (and fixed here) is compressed IMAP envelope responses still needs raw bytes since the zlib wrapper is designed for octets, not Perl UTF-8 chars. Thus we reverse utf8::decode with utf8::encode in PublicInbox::IMAP::_esc. ->header_set also forces encoding to bytes, since all existing callers would either be dealing with ->header_raw results or be RFC-2047-encoded anyways. Reindexing is not necessary with this change due to the prior PublicInbox::Smsg->populate special case. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/20221124153715.3nenjpjzj43vqxr2@meerkat.local/
2022-08-19smsg: ->populate falls back to old {ds}/{ts} values
This will be useful for re-indexing external messages which were over-indexed in lei/store.
2021-10-16smsg: add ->oidbin method
This makes some of our code less noisy by reducing the amount of pack('H*', ...) use.
2021-10-09view: save memory by dropping smsg->{from_name} on use
We'll also save a few LoC when generating it. $smsg objects can linger a while when rendering large threads, so saving a few bytes here can add up to several hundred KB saved. I noticed this while chasing the ref cycle leak in commit b28e74c9dc0a (www: fix ref cycle from threading w/ extindex, 2021-10-03). While there's no longer a leak, releasing memory earlier can allow it to be reused sooner and reduce both memory traffic and memory pressure.
2021-10-06overidx: subject_path: allow non-ASCII char in subject matches
This should bring us closer to the "Base subject" definition in IMAP ORDEREDSUBJECT (RFC 5256 2.1). Larger changes may cause some breakage (until --reindex). But for now, a reindex will prevents the non-ASCII subjects from being normalized to the same fuzzy "thread" in the thread view.
2021-04-03lei: improve handling of Message-ID-less draft messages
We need a stable fallback time for digest2mid in the presence of messages without Received/Date headers. Furthermore, we must avoid using uninitialized smsg->{mid} when parsing References for draft replies.
2021-01-24smsg: parse_references: micro-optimization
With Perl 5.10+, we can rely on the defined-or-assignment (//=) operator to avoid repeatedly rewriting an SV. This may not provide a measurable difference here, but it's more consistent with current style where things like commit a05445fb400108e60ede7d377cf3b26a0392eb24 ("config: config_fh_parse: micro-optimize") provide a measurable improvement.
2021-01-24smsg: make parse_references an object method
Having parse_references in OverIdx was awkward and Smsg is a better place for it.
2021-01-03use Eml (or MIME) objects for all indexing paths
We don't need to be keeping the raw message around after it hits git. Shard work now relies on Storable (or Sereal) and all of the indexing code relies on the Email::MIME-like API of Eml to access interesting parts of the message. Similarly, smsg->{raw_bytes} is no longer carried around and we do the CRLF adjustment when setting smsg->{bytes}. There's also a small simplification to t/import.t while we're in the area to use xqx instead of spawn/popen_rd.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2021-01-01lei_store: handle messages without Message-ID at all
For personal mail, unsent drafts messages are a common source of messages without Message-IDs.
2020-11-07searchidx: remove xref3 support for Xapian
It doesn't seem worth storing xref3 data in Xapian now that the same info is in over.sqlite3.
2020-11-07searchidx: introduce "xref3" concept
This will be used to track cross-posted messages in the external/detached index.
2020-09-24searchidx: fix (undocumented) --skip-docdata handling
This switch is still undocumented, but we can reduce the scope of our Xapian docdata dependency by moving its only caller to SearchIdx. This reduces the amount of code loaded by read-only code paths.
2020-08-28www: improve navigation around contemporary threads
Sometimes it's useful to quickly get to threads and messages which are contemporaries of the current thread/message being focused on. This hopefully improves navigation by making: a) the top line (where $INBOX_DIR/description) is shown a link to the latest topics in search results and per-thread/per-message views. b) providing a link to contemporaries ("~YYYY-MM-DD") at around the thread overview skeleton area for per-thread and per-message views
2020-08-23searchidx: index THREADID in Xapian
This is the `tid' column from over.sqlite3; and will be used for IMAP and JMAP search (among other things).
2020-08-20smsg: remove from_mitem
We no longer read docdata.glass from anywhere in our code base. Some adjustments were needed to t/search.t to deal with the Xapian::WritableDatabase committing at different times, since our ->query is avoided from PublicInbox::SearchIdx to avoid needing a {over_ro} field.
2020-08-20smsg: reduce utf8::decode call sites
Both callers of load_from_data call utf8::decode, so just do utf8::decode in load_from_data.
2020-08-19smsg: handle wide characters in raw mail headers
There may be messages in the wild with wide characters in headers which aren't non-RFC2047 encoded. Assume UTF-8 so those fields can round trip through over.sqlite3. This doesn't affect docdata.glass in Xapian, but it does affect how over.sqlite3 stores the same deflated info.
2020-07-25v2writable: move {autime} and {cotime} into $sync state
The V2Writable object may be long-lived, so it makes more sense to put the {autime} and {cotime} fields into the shorter-lived index_sync state.
2020-06-13preliminary imap server implementation
It shares a bit of code with NNTP. It's copy+pasted for now since this provides new ground to experiment with APIs for dealing with slow storage and many inboxes.
2020-06-03smsg: remove remaining accessor methods
We'll continue to favor simpler data models that can be used directly rather than wasting time and memory with accessor APIs. The ->from, ->to, -cc, ->mid, ->subject, >references methods can all be trivially replaced by hash lookups since all their values are stored in doc_data. Most remaining callers of those methods were test cases, anyways. ->from_name is only used in the PSGI code, so we can just use ->psgi_cull to take care of populating the {from_name} field.
2020-06-03smsg: remove ->bytes and ->lines methods
They're stored directly in Xapian and SQLite document data. NNTP accesses those fields directly to avoid method invocation overhead so there's no reason to waste several kilobytes for each sub.
2020-06-03smsg: get rid of remaining {mime} users
We'll let $smsg->populate take care of everything all at once without hanging onto the header object for too long.
2020-06-03www: remove smsg_mime API and adjust callers
To further simplify callers and avoid embarrasing memory explosions[1], we can finally eliminate this method in favor of smsg_eml. [1] commit 7d02b9e64455831d3bda20cd2e64e0c15dc07df5 ("view: stop storing all MIME objects on large threads") fixed a huge memory blowup.
2020-06-03smsg: get rid of ->wrap initializer, too
We'll just use `bless' like most current PublicInbox::Smsg callers.
2020-06-03smsg: introduce ->populate method
This will eventually replace the __hdr() calling methods and eradicate {mime} usage from Smsg. For now, we can eliminate PublicInbox::Smsg->new since most callers already rely on an open `bless' to avoid the old {mime} arg.
2020-05-09smsg: use capitalization for header retrieval
PublicInbox::Eml will have case-sensitive memoization to avoid the need to call `lc' to retrieve common headers, so ensure we call $mime->header() with the common capitalization. Unfortunately, we need to continue using lowercase for field names for smsg, since NNTP requires case-insensitivity when matching headers and method dispatch is expensive.
2020-04-02smsg: inline _extract_mid functionality
No need to keep an extra sub which isn't called anywhere else, and the mid_clean call is redundant since mid_mime already plucks the msgid out of the angle brackets.
2020-03-22smsg: to_doc_data: use existing fields
No need to pass extra parameters to this method, since smsg has universal meanings for {blob} and {mid}.
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.