about summary refs log tree commit homepage
path: root/lib/PublicInbox/Mbox.pm
DateCommit message (Collapse)
2020-04-05mbox: halve ->getline "context switches"
We don't need to take extra trips through the event loop for a single message (in the common case of Message-IDs being unique). In fact, holding the body reference left behind by Email::Simple could be harmful to memory usage, though in practice it's not a big problem since code paths which use Email::MIME take far more.
2020-03-30wwwstream::oneshot => html_oneshot
And use Exporter to make our life easier, since WwwAltId was using a non-existent PublicInbox::WwwResponse namespace in error paths which doesn't get noticed by `perl -c' or exercised by tests on normal systems. Fixes: 6512b1245ebc6fe3 ("www: add endpoint to retrieve altid dumps")
2020-03-25mbox: need_gzip uses WwwStream::oneshot
This makes the error page more consistent. Not that it really matters since Compress::Raw::Zlib and IO::Compress packages have been distributed with Perl since 5.10.x. Of course, zlib itself is also a dependency of git.
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-25mbox: handle empty subjects after dropping "Re:" prefix
We can't pass empty strings to `to_filename' without triggering warnings, and `to_filename' on an empty string makes no sense.
2019-12-28search: retry_reopen passes user arg to callback
This allows callers to pass named (not anonymous) subs. Update all retry_reopen callers to use this feature, and fix some places where we failed to use retry_reopen :x
2019-12-27mboxgz: pass $ctx to callback to avoid anon subs
Another place where we can rid ourselves of most anonymous subs by passing the $ctx arg to the callback.
2019-12-12mbox: do not try to create filename from empty string
This was causing warnings to pop up in syslogs for messages with empty Subject headers.
2019-11-16mboxgz: use Compress::Raw::Zlib instead of IO::Compress::Gzip
IO::Compress::Gzip is a wrapper around Compress::Raw::Zlib, anyways, and being able to easily detach buffers to return them via ->getline is nice. This results in a 1-2% performance improvement when fetching giant mboxes.
2019-11-16mbox: split mboxgz out into a separate file
It'll make using Compress::Raw::Zlib easier, since we can use that and import constants more easily.
2019-11-16mbox: unused mid_clean import
We're gradually phasing mid_clean out (in favor of mids()).
2019-10-01www: fix absolute URLs when mounted under a subdir
While we avoid generating absolute URLs in most cases, our "git clone" instructions and URL headers in mboxrd files contain full URLs. So do the same thing we do for WwwAtomStream and pre-generate the full URL before Plack::App::URLMap changes $env->{PATH_INFO} and $env->{SCRIPT_NAME} back to their original values. Reported-by: edef <edef@edef.eu> Link: https://public-inbox.org/meta/cover.0f97c47bb88db8b875be7497289d8fedd3b11991.1569296942.git-series.edef@edef.eu/
2019-09-27mbox: update URL for mboxrd info
qmail.org seems unavailable.
2019-09-09run update-copyrights from gnulib for 2019
2019-06-27mbox: split header and body processing
When dealing with ~30MB messages, we can save another ~30MB by splitting the header and body processing and not appending the body string back to the header. We'll rely on buffering in gzip or kernel (via MSG_MORE) to prevent silly packet sizes.
2019-06-27mbox: use Email::Simple->new to do in-place modifications
Email::Simple->new will split the head from the body in-place, and we can avoid using Email::Simple::body. This saves us from holding an extra copy of the message in memory, and saves us around ~30MB when operating on ~30MB messages.
2019-05-15www: use Inbox->over where appropriate
We don't need to rely on Xapian search functionality for the majority of the WWW code, even. subject_normalized is moved to SearchMsg, where it (probably) makes more sense, anyways.
2019-01-09doc: various overview-level module comments
Hopefully this helps people familiarize themselves with the source code.
2018-04-07psgi: ensure /$INBOX/$MESSAGE_ID/T/ endpoint is chronological
We only need to call get_thread beyond 1000 messages for fetching entire mboxes. It's probably too much for the HTML display otherwise.
2018-04-06www: favor reading more from SQLite, and less from Xapian
Favor simpler internal APIs this time around, this cuts a fair amount of code out and takes another step towards removing Xapian as a dependency for v2 repos.
2018-04-05mbox: do not sort search results
Sorting large msets is a waste when it comes to mboxes since MUAs should thread and sort them as the user desires. This forces us to rework each of the mbox download mechanisms to be more independent of each other, but might make things easier to reason about.
2018-04-03msgmap: replace id_batch with ids_after
id_batch had a an overly complicated interface, replace it with id_batch which is simpler and takes advantage of selectcol_arrayref in DBI. This allows simplification of callers and the diffstat agrees with me.
2018-04-03mbox: remove remaining OFFSET usage in SQLite
We can use id_batch in the common case to speed up full mbox retrievals. Gigantic msets are still a problem, but will be fixed in future commits.
2018-04-02www: rework query responses to avoid COUNT in SQLite
In many cases, we do not care about the total number of messages. It's a rather expensive operation in SQLite (Xapian only provides an estimate). For LKML, this brings top-level /$INBOX/ loading time from ~375ms to around 60ms on my system. Days ago, this operation was taking 800-900ms(!) for me before introducing the SQLite overview DB.
2018-03-29mbox: avoid extracting Message-ID for linkification
We can avoid a small amount of overhead and use the "preferred" Message-ID based on what is in the SearchMsg object.
2018-03-29www: remove unnecessary ghost checks
We do not need to care about ghosts at multiple call sites; they cannot have a {blob} field and we've stored the blob field in Xapian since SCHEMA_VERSION=13.
2018-03-27view: permalink (per-message) view shows multiple messages
This needs tests and further refinement, but current tests pass.
2018-03-23www: $MESSAGE_ID/raw endpoint supports "duplicates"
Since v2 supports duplicate messages, we need to support looking up different messages with the same Message-Id. Fortunately, our "raw" endpoint has always been mboxrd, so users won't need to change their parsing tools.
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-12-01search: allow downloading search results as mbox
Allowing downloading of all search results as an gzipped mboxrd file can be convenient for some users.
2017-10-04mbox: support inline filename via Content-Disposition header
This is hopefully more sensical than "raw" files from resulting downloads.
2017-06-23mbox: show application/mbox for obfuscated inboxes
Sigh, yet another place to handle obfuscation for misguided people who expect it. Maybe this will do something to prevent spammers from getting addresses, while still allowing the "curl $URL | git am" use case to work.
2016-12-10search: always sort thread results in ascending time order
This makes life easier for the threading algorithm, as we can use the implied ordering of timestamps to avoid temporary ghosts and resulting container vivication. This would've also allowed us to hide the bug (in most cases) fixed by the patch titled "thread: last Reference always wins", in case that needs to be reverted due to infinite looping.
2016-08-14www: do not unecessarily escape some chars in paths
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&', "'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed in path-absolute where we have the Message-ID. In any case, it seems '@' is fairly common in path components nowadays and too common in Message-IDs.
2016-08-06mbox: be fair to other HTTP clients
At least for public-inbox-httpd, this allows us to avoid having a client monopolize one event loop tick of the server for too long. It hurts throughput for the /all.mbox.gz endpoint, but I doubt anybody cares and the latency improvement for other clients would be appreciated. We already do the same fairness thing for HTML pages.
2016-08-04searchmsg: add git object ID to doc_data
Doing git tree lookups based on the SHA-1 of the Message-ID is expensive as trees get larger, instead, use the SHA-1 object ID directly. This drastically reduces the amount of time spent in the "git cat-file --batch" process for fetching the /$INBOX/all.mbox.gz endpoint on the ~800MB git@vger.kernel.org mirror This retains backwards compatibility and allows existing indices to be transparently upgraded without performance degradation.
2016-07-09cleanup some unnecessary use/requires
Hopefully this can reduce memory overhead for people that use one-shot CGI.
2016-07-02inbox: base_url method takes PSGI env hashref instead
This is lighter and we can work further towards eliminating our Plack::Request dependency entirely.
2016-06-24mbox: reduce small packets for gzipped mboxes
We want to avoid sending 10 or 20-byte gzip headers as separate TCP packets to reduce syscalls and avoid wasting bandwidth.
2016-06-20feed: various object-orientation cleanups
Favor Inbox objects as our primary source of truth to simplify our code. This increases our coupling with PSGI to make it easier to write tests in the future. A lot of this code was originally designed to be usable standalone without PSGI or CGI at all; but that might increase development effort.
2016-06-20mbox: avoid write dependency for streaming
Prefer to return strings instead, so Content-Length can be calculated for caching and such.
2016-06-20mbox: remove feed dependency
We do not need feed options there (or anywhere, hopefully).
2016-06-19mbox: set gzip timestamp to the Unix epoch
This allows consistency between different invocations from roughly the same period and is no worse for caching any any of our existing HTML and Atom feeds. We cannot set the timestamp to the end date since messages may be added to the repository while we are iterating (and this streaming mechanism will pick them up).
2016-05-21mbox: switch generation over to pull model
This allows us to easily provide gigantic inboxes with proper backpressure handling for slow clients. It also eliminates public-inbox-httpd and Danga::Socket-specific knowledge from this class, making it easier to follow for those used to generic PSGI applications.
2016-05-15mbox: support /$INBOX/all.mbox.gz endpoint
Allows easily downloading the entire archive without special tools. In any case, it's not yet advertised to via HTML until we can test it better. It'll also support range queries in the future to avoid wasting bandwidth.
2016-05-15mbox: consistent header order when decompressed
This should make validating the output easier when testing between different servers.
2016-05-06mbox: sort messages by ascending date
This allows messages to be read in chronological order when read without a mail client (e.g. with "zcat t.mbox.gz | less")
2016-04-12mbox: do not clobber existing archive headers in WWW
When serving archives, it's more robust to keep existing archive links in one server goes down.
2016-04-11mbox: unconditionally add trailing newline
This may be necessary for compatibility with non-mboxrd aware parsers which expect "\nFrom " for everything but the first record.