about summary refs log tree commit homepage
path: root/lib/PublicInbox/Mbox.pm
DateCommit message (Collapse)
2020-08-23mbox: disable "&t" on existing Xapian until full reindex
Expanding threads via over.sqlite3 for mbox.gz downloads without Xapian effectively collapsing on the THREADID column leads to repeated messages getting downloaded. To avoid that situation, use a "has_threadid" Xapian metadata flag that's only set on --reindex (and brand new Xapian DBs). This allows admins to upgrade WWW or do --reindex in any order; without worrying about users eating up bandwidth and CPU cycles.
2020-08-23search: support downloading mboxes results with full thread
Finally, the addition of THREADID for collapsing results in Xapian lets us emulate the "mairix --threads" feature. That is, instead of returning only the matching messages, the entire thread is included in the downloaded mbox.gz This requires a "public-inbox-index --reindex" to be usable.
2020-08-20search: add mset_to_artnums method
We can avoid importing mdocid() in several places by using this method, simplifying callers.
2020-08-20mbox: avoid Xapian docdata in search results
Another place where we can reduce kernel page cache overhead by hitting over.sqlite3 instead of docdata.glass.
2020-08-20www: reduce long-lived PublicInbox::Search references
While this is unlikely to be a problem in current practice, keeping Xapian DBs open for long responses can interfere with free space recovery after -compact. In the future, it will interfere with inbox search grouping and lead to unexpected results.
2020-08-01www: rework async_* to use method table
Although the ->async_next method does not take $self as a receiver, but rather a PublicInbox::HTTP object, we may still retrieve it to be called with the HTTP object via UNIVERSAL->can.
2020-07-10hval: to_filename: return `undef' instead of empty string
Returning an empty string for a filename makes no sense, so instead return `undef' so the caller can setup a fallback using the "//" operator. This fixes uninitialized variable warnings because split() on an empty string returns `undef', which caused to_filename to warn on s// and tr// ops.
2020-07-06www: update internal docs
We no longer favor getline+close for streaming PSGI responses when using public-inbox-httpd. We still support it for other PSGI servers, though.
2020-07-06www: start making gzipfilter the parent response class
Virtually all of our responses are going to be gzipped, anyways. This will allow us to utilize zlib as a buffering layer and share common code for async blob retrieval responses. To streamline this and allow GzipFilter to be a parent class, we'll replace the NoopFilter with a similar CompressNoop class which emulates the two Compress::Raw::Zlib::Deflate methods we use. This drops a bunch of redundant code and will hopefully make upcoming WwwStream changes easier to reason about.
2020-07-06mbox: async blob fetch for "single message" raw mboxrd
This restores gzip-by-default behavior for /$INBOX/$MSGID/raw endpoints for all indexed inboxes. Unindexed v1 inboxes will remain uncompressed, for now.
2020-07-06mbox: remove html_oneshot import
It's no longer needed, we no longer show a runtime error for zlib being missing, as zlib is a hard requirement. Fixes: a318e758129d616b ("make zlib-related modules a hard dependency")
2020-06-03smsg: remove remaining accessor methods
We'll continue to favor simpler data models that can be used directly rather than wasting time and memory with accessor APIs. The ->from, ->to, -cc, ->mid, ->subject, >references methods can all be trivially replaced by hash lookups since all their values are stored in doc_data. Most remaining callers of those methods were test cases, anyways. ->from_name is only used in the PSGI code, so we can just use ->psgi_cull to take care of populating the {from_name} field.
2020-05-09switch read-only Email::Simple users to Eml
Since PublicInbox::Eml doesn't parse MIME subparts up front, it can replace most uses of Email::Simple without performance penalty. This will eventually allow us to lower overall internal API footprint by not having to keep the MIME vs Simple distinction.
2020-04-22make zlib-related modules a hard dependency
This allows us to simplify some of our existing code and make future changes easier. I doubt anybody goes through the trouble to have a Perl installation without zlib support. The zlib source code is even bundled with Perl since 5.9.3 for systems without existing zlib development headers and libraries. Of course, zlib is also a requirement of git, too; and we're not going to stop using git :) [squashed: "wwwaltid: use gzipfilter up front"]
2020-04-19reduce scope of mbox From_ line removal
It's unnecessary overhead for anything which does Email::MIME parsing. It was never done for v2 indexing, even though v1->v2 conversions did NOT remove those From_ lines. There was never a need to remote From_ lines the v1 SearchIdx paths, either. Hitting a /$INBOX_URL/$MSGID/T/ endpoint with an 18 message thread reveals a ~0.5% speed improvement. This will become more apparent when we have a faster MIME parser.
2020-04-19mbox: use per-message line-ending for From_ line
Email::Simple preserves the message line ending in headers, so make the From_ line consistent with the rest of the headers.
2020-04-05mbox: halve ->getline "context switches"
We don't need to take extra trips through the event loop for a single message (in the common case of Message-IDs being unique). In fact, holding the body reference left behind by Email::Simple could be harmful to memory usage, though in practice it's not a big problem since code paths which use Email::MIME take far more.
2020-03-30wwwstream::oneshot => html_oneshot
And use Exporter to make our life easier, since WwwAltId was using a non-existent PublicInbox::WwwResponse namespace in error paths which doesn't get noticed by `perl -c' or exercised by tests on normal systems. Fixes: 6512b1245ebc6fe3 ("www: add endpoint to retrieve altid dumps")
2020-03-25mbox: need_gzip uses WwwStream::oneshot
This makes the error page more consistent. Not that it really matters since Compress::Raw::Zlib and IO::Compress packages have been distributed with Perl since 5.10.x. Of course, zlib itself is also a dependency of git.
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-25mbox: handle empty subjects after dropping "Re:" prefix
We can't pass empty strings to `to_filename' without triggering warnings, and `to_filename' on an empty string makes no sense.
2019-12-28search: retry_reopen passes user arg to callback
This allows callers to pass named (not anonymous) subs. Update all retry_reopen callers to use this feature, and fix some places where we failed to use retry_reopen :x
2019-12-27mboxgz: pass $ctx to callback to avoid anon subs
Another place where we can rid ourselves of most anonymous subs by passing the $ctx arg to the callback.
2019-12-12mbox: do not try to create filename from empty string
This was causing warnings to pop up in syslogs for messages with empty Subject headers.
2019-11-16mboxgz: use Compress::Raw::Zlib instead of IO::Compress::Gzip
IO::Compress::Gzip is a wrapper around Compress::Raw::Zlib, anyways, and being able to easily detach buffers to return them via ->getline is nice. This results in a 1-2% performance improvement when fetching giant mboxes.
2019-11-16mbox: split mboxgz out into a separate file
It'll make using Compress::Raw::Zlib easier, since we can use that and import constants more easily.
2019-11-16mbox: unused mid_clean import
We're gradually phasing mid_clean out (in favor of mids()).
2019-10-01www: fix absolute URLs when mounted under a subdir
While we avoid generating absolute URLs in most cases, our "git clone" instructions and URL headers in mboxrd files contain full URLs. So do the same thing we do for WwwAtomStream and pre-generate the full URL before Plack::App::URLMap changes $env->{PATH_INFO} and $env->{SCRIPT_NAME} back to their original values. Reported-by: edef <edef@edef.eu> Link: https://public-inbox.org/meta/cover.0f97c47bb88db8b875be7497289d8fedd3b11991.1569296942.git-series.edef@edef.eu/
2019-09-27mbox: update URL for mboxrd info
qmail.org seems unavailable.
2019-09-09run update-copyrights from gnulib for 2019
2019-06-27mbox: split header and body processing
When dealing with ~30MB messages, we can save another ~30MB by splitting the header and body processing and not appending the body string back to the header. We'll rely on buffering in gzip or kernel (via MSG_MORE) to prevent silly packet sizes.
2019-06-27mbox: use Email::Simple->new to do in-place modifications
Email::Simple->new will split the head from the body in-place, and we can avoid using Email::Simple::body. This saves us from holding an extra copy of the message in memory, and saves us around ~30MB when operating on ~30MB messages.
2019-05-15www: use Inbox->over where appropriate
We don't need to rely on Xapian search functionality for the majority of the WWW code, even. subject_normalized is moved to SearchMsg, where it (probably) makes more sense, anyways.
2019-01-09doc: various overview-level module comments
Hopefully this helps people familiarize themselves with the source code.
2018-04-07psgi: ensure /$INBOX/$MESSAGE_ID/T/ endpoint is chronological
We only need to call get_thread beyond 1000 messages for fetching entire mboxes. It's probably too much for the HTML display otherwise.
2018-04-06www: favor reading more from SQLite, and less from Xapian
Favor simpler internal APIs this time around, this cuts a fair amount of code out and takes another step towards removing Xapian as a dependency for v2 repos.
2018-04-05mbox: do not sort search results
Sorting large msets is a waste when it comes to mboxes since MUAs should thread and sort them as the user desires. This forces us to rework each of the mbox download mechanisms to be more independent of each other, but might make things easier to reason about.
2018-04-03msgmap: replace id_batch with ids_after
id_batch had a an overly complicated interface, replace it with id_batch which is simpler and takes advantage of selectcol_arrayref in DBI. This allows simplification of callers and the diffstat agrees with me.
2018-04-03mbox: remove remaining OFFSET usage in SQLite
We can use id_batch in the common case to speed up full mbox retrievals. Gigantic msets are still a problem, but will be fixed in future commits.
2018-04-02www: rework query responses to avoid COUNT in SQLite
In many cases, we do not care about the total number of messages. It's a rather expensive operation in SQLite (Xapian only provides an estimate). For LKML, this brings top-level /$INBOX/ loading time from ~375ms to around 60ms on my system. Days ago, this operation was taking 800-900ms(!) for me before introducing the SQLite overview DB.
2018-03-29mbox: avoid extracting Message-ID for linkification
We can avoid a small amount of overhead and use the "preferred" Message-ID based on what is in the SearchMsg object.
2018-03-29www: remove unnecessary ghost checks
We do not need to care about ghosts at multiple call sites; they cannot have a {blob} field and we've stored the blob field in Xapian since SCHEMA_VERSION=13.
2018-03-27view: permalink (per-message) view shows multiple messages
This needs tests and further refinement, but current tests pass.
2018-03-23www: $MESSAGE_ID/raw endpoint supports "duplicates"
Since v2 supports duplicate messages, we need to support looking up different messages with the same Message-Id. Fortunately, our "raw" endpoint has always been mboxrd, so users won't need to change their parsing tools.
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-12-01search: allow downloading search results as mbox
Allowing downloading of all search results as an gzipped mboxrd file can be convenient for some users.
2017-10-04mbox: support inline filename via Content-Disposition header
This is hopefully more sensical than "raw" files from resulting downloads.
2017-06-23mbox: show application/mbox for obfuscated inboxes
Sigh, yet another place to handle obfuscation for misguided people who expect it. Maybe this will do something to prevent spammers from getting addresses, while still allowing the "curl $URL | git am" use case to work.
2016-12-10search: always sort thread results in ascending time order
This makes life easier for the threading algorithm, as we can use the implied ordering of timestamps to avoid temporary ghosts and resulting container vivication. This would've also allowed us to hide the bug (in most cases) fixed by the patch titled "thread: last Reference always wins", in case that needs to be reverted due to infinite looping.