about summary refs log tree commit homepage
path: root/lib/PublicInbox/Mbox.pm
DateCommit message (Collapse)
2019-05-15www: use Inbox->over where appropriate
We don't need to rely on Xapian search functionality for the majority of the WWW code, even. subject_normalized is moved to SearchMsg, where it (probably) makes more sense, anyways.
2019-01-09doc: various overview-level module comments
Hopefully this helps people familiarize themselves with the source code.
2018-04-07psgi: ensure /$INBOX/$MESSAGE_ID/T/ endpoint is chronological
We only need to call get_thread beyond 1000 messages for fetching entire mboxes. It's probably too much for the HTML display otherwise.
2018-04-06www: favor reading more from SQLite, and less from Xapian
Favor simpler internal APIs this time around, this cuts a fair amount of code out and takes another step towards removing Xapian as a dependency for v2 repos.
2018-04-05mbox: do not sort search results
Sorting large msets is a waste when it comes to mboxes since MUAs should thread and sort them as the user desires. This forces us to rework each of the mbox download mechanisms to be more independent of each other, but might make things easier to reason about.
2018-04-03msgmap: replace id_batch with ids_after
id_batch had a an overly complicated interface, replace it with id_batch which is simpler and takes advantage of selectcol_arrayref in DBI. This allows simplification of callers and the diffstat agrees with me.
2018-04-03mbox: remove remaining OFFSET usage in SQLite
We can use id_batch in the common case to speed up full mbox retrievals. Gigantic msets are still a problem, but will be fixed in future commits.
2018-04-02www: rework query responses to avoid COUNT in SQLite
In many cases, we do not care about the total number of messages. It's a rather expensive operation in SQLite (Xapian only provides an estimate). For LKML, this brings top-level /$INBOX/ loading time from ~375ms to around 60ms on my system. Days ago, this operation was taking 800-900ms(!) for me before introducing the SQLite overview DB.
2018-03-29mbox: avoid extracting Message-ID for linkification
We can avoid a small amount of overhead and use the "preferred" Message-ID based on what is in the SearchMsg object.
2018-03-29www: remove unnecessary ghost checks
We do not need to care about ghosts at multiple call sites; they cannot have a {blob} field and we've stored the blob field in Xapian since SCHEMA_VERSION=13.
2018-03-27view: permalink (per-message) view shows multiple messages
This needs tests and further refinement, but current tests pass.
2018-03-23www: $MESSAGE_ID/raw endpoint supports "duplicates"
Since v2 supports duplicate messages, we need to support looking up different messages with the same Message-Id. Fortunately, our "raw" endpoint has always been mboxrd, so users won't need to change their parsing tools.
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-12-01search: allow downloading search results as mbox
Allowing downloading of all search results as an gzipped mboxrd file can be convenient for some users.
2017-10-04mbox: support inline filename via Content-Disposition header
This is hopefully more sensical than "raw" files from resulting downloads.
2017-06-23mbox: show application/mbox for obfuscated inboxes
Sigh, yet another place to handle obfuscation for misguided people who expect it. Maybe this will do something to prevent spammers from getting addresses, while still allowing the "curl $URL | git am" use case to work.
2016-12-10search: always sort thread results in ascending time order
This makes life easier for the threading algorithm, as we can use the implied ordering of timestamps to avoid temporary ghosts and resulting container vivication. This would've also allowed us to hide the bug (in most cases) fixed by the patch titled "thread: last Reference always wins", in case that needs to be reverted due to infinite looping.
2016-08-14www: do not unecessarily escape some chars in paths
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&', "'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed in path-absolute where we have the Message-ID. In any case, it seems '@' is fairly common in path components nowadays and too common in Message-IDs.
2016-08-06mbox: be fair to other HTTP clients
At least for public-inbox-httpd, this allows us to avoid having a client monopolize one event loop tick of the server for too long. It hurts throughput for the /all.mbox.gz endpoint, but I doubt anybody cares and the latency improvement for other clients would be appreciated. We already do the same fairness thing for HTML pages.
2016-08-04searchmsg: add git object ID to doc_data
Doing git tree lookups based on the SHA-1 of the Message-ID is expensive as trees get larger, instead, use the SHA-1 object ID directly. This drastically reduces the amount of time spent in the "git cat-file --batch" process for fetching the /$INBOX/all.mbox.gz endpoint on the ~800MB git@vger.kernel.org mirror This retains backwards compatibility and allows existing indices to be transparently upgraded without performance degradation.
2016-07-09cleanup some unnecessary use/requires
Hopefully this can reduce memory overhead for people that use one-shot CGI.
2016-07-02inbox: base_url method takes PSGI env hashref instead
This is lighter and we can work further towards eliminating our Plack::Request dependency entirely.
2016-06-24mbox: reduce small packets for gzipped mboxes
We want to avoid sending 10 or 20-byte gzip headers as separate TCP packets to reduce syscalls and avoid wasting bandwidth.
2016-06-20feed: various object-orientation cleanups
Favor Inbox objects as our primary source of truth to simplify our code. This increases our coupling with PSGI to make it easier to write tests in the future. A lot of this code was originally designed to be usable standalone without PSGI or CGI at all; but that might increase development effort.
2016-06-20mbox: avoid write dependency for streaming
Prefer to return strings instead, so Content-Length can be calculated for caching and such.
2016-06-20mbox: remove feed dependency
We do not need feed options there (or anywhere, hopefully).
2016-06-19mbox: set gzip timestamp to the Unix epoch
This allows consistency between different invocations from roughly the same period and is no worse for caching any any of our existing HTML and Atom feeds. We cannot set the timestamp to the end date since messages may be added to the repository while we are iterating (and this streaming mechanism will pick them up).
2016-05-21mbox: switch generation over to pull model
This allows us to easily provide gigantic inboxes with proper backpressure handling for slow clients. It also eliminates public-inbox-httpd and Danga::Socket-specific knowledge from this class, making it easier to follow for those used to generic PSGI applications.
2016-05-15mbox: support /$INBOX/all.mbox.gz endpoint
Allows easily downloading the entire archive without special tools. In any case, it's not yet advertised to via HTML until we can test it better. It'll also support range queries in the future to avoid wasting bandwidth.
2016-05-15mbox: consistent header order when decompressed
This should make validating the output easier when testing between different servers.
2016-05-06mbox: sort messages by ascending date
This allows messages to be read in chronological order when read without a mail client (e.g. with "zcat t.mbox.gz | less")
2016-04-12mbox: do not clobber existing archive headers in WWW
When serving archives, it's more robust to keep existing archive links in one server goes down.
2016-04-11mbox: unconditionally add trailing newline
This may be necessary for compatibility with non-mboxrd aware parsers which expect "\nFrom " for everything but the first record.
2015-12-22rename 'GitCatFile' package to 'Git'
We'll be using it for more than just cat-file. Adding a `popen' API for internal use allows us to save a bunch of code in other places.
2015-11-20various internal documentation updates
Hopefully this gives new hackers a better overview of how the components relate to each other.
2015-10-04mbox: generate Archived-At, List-Post, List-Archive headers
Downloaded mboxen can be archived/stored indefinitely, try to make it easy for future archaelogists to find the online archive location.
2015-10-04mbox: kill Bytes meta-header, too
It may be present in messages imported from NNTP.
2015-09-30remove unnecessary fields usage
It doesn't actually give performance improvements unless we use types with "my", but we don't do that. We'll only continue using fields with Danga::Socket-derived classes where they're required.
2015-09-06update copyright headers and email addresses
In the future, it should be possible to use this: git ls-files | UPDATE_COPYRIGHT_HOLDER='all contributors' \ UPDATE_COPYRIGHT_USE_INTERVALS=2 \ xargs /path/to/gnulib/build-aux/update-copyright
2015-09-03get rid of Message-ID compression entirely
Provide a fallback for legacy SHA-1 messages, but do not advertise shorter URLs anymore for data portability concerns. This fixes a regression introduced in commit 81a9c1b476987d845b340ab9013d26cf4487cb9a ("search: disable Message-ID compression in Xapian") which ended up breaking thread-related endpoints for large Message-IDs, as lookups on the SHA-1 message no longer worked.
2015-08-26mbox: close file handle for single mbox
This doesn't seem needed for actual server use, but Plack tests complain about it
2015-08-25mid: mid_compressed => mid_compress
Consistently name mid_* functions as verbs.
2015-08-23cleanup calls to header_obj
Dereference header_obj only once when performance may be critical, or simplify our code by calling "header" directly on the Email::{Simple,MIME} object if not.
2015-08-23mbox: clarify our use of the the mboxrd variant
Commenting it in the From: line seems appropriate and reduces compatibility problems in case a MUA cannot handle trailing comments after the timestamp.
2015-08-23mbox: use mboxrd quoting rules
This redundantly quotes >From from to prevent losing information as described by qmail
2015-08-23.txt links return an mbox instead
This improves compatibility and allows individual messages to be concatenated into an existing mbox without further modifications. "git format-patch" does something similar (but does not do "From " line escaping(!))
2015-08-22mbox: support uncompressed mbox
Some folks may want to view the mbox inline as a string of raw text, when guessing URLs. Let them do this...
2015-08-22stream HTML views as much as possible
This should allow progressive rendering on the client and reduce memory usage on the server. Unfortunately XML::Atom::SimpleFeed does not yet support streaming, so we may not use it in the future.
2015-08-21mbox: drop unnecessary imports
These are not necessary, anymore
2015-08-21switch to gzipped mboxes
Mboxes may be huge, so only support downloading gzipped mboxes to save bandwidth and to get free checksumming. Streaming output means we should not be wasting too much memory on this unless the chosen server sucks.