about summary refs log tree commit homepage
path: root/lib/PublicInbox/Mbox.pm
DateCommit message (Collapse)
2023-10-11treewide: consolidate "From " line removal
Aside from our prior import bugs (fixed in a0c07cba0e5d8b6a (mda: drop leading "From " lines again, 2016-06-26)), we'll always have to be dealing with mutt piping messages to us and `git format-patch' output. So just share the regexp so we can use it everywhere. In may be desirable to allow importing messages with a leading "From " line for FUSE, even. Additionally, some instances of this regexp needlessly added optional `\r?' (CR) checks ahead of the `\n' (LF) element; but they're pointless anyways since [^\n]* is enough to exclude all non-LF bytes.
2023-06-17www: use correct threadid for per-thread search
For individual public-inboxes relying on extindex for per-inbox search, we must use the threadid from the extindex over.sqlite3 rather than the per-inbox over.sqlite3 file. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/20230616-rudy-comedy-vision-2b9f92@meerkat/
2023-03-31www: support POST /$INBOX/$MSGID/?x=m&q=
This allows filtering the contents of any existing thread using a search query. It uses the existing THREADID column in Xapian so we can internally add a Xapian OP_FILTER to the results. This new functionality is orthogonal to the existing `t=1' parameter which gives mairix-style thread expansion. It doesn't make sense to use `t=1' with this functionality, but it's not disallowed, either. The indentation change in Over->next_by_mid is to ensure DBI->prepare_cached can share across both ->next_by_mid and ->mid2tid. I also noticed the existing regex for `POST /$INBOX/?x=m&q=' was allowing extra characters. With an added \z, it's now as strict was originally intended and AFAIK nothing was generating invalid URLs for it Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/aaniyhk7wfm4e6m5mbukcrhevzoc6ftctyrfwvmz4fkykwwtlj@mverfng6ytas/T/
2022-09-10mbox*: use multi-arg ->translate and ->write
No need to make multiple method calls from here, now that ->translate and GzipFilter->write both support multiple args.
2022-09-10www: use PerlIO::scalar (zfh) for buffering
Calling Compress::Raw::Zlib::deflate is fairly expensive. Relying on the `.=' (concat) operator inside ->zadd operator is faster, but the method dispatch overhead is noticeable compared to the original code where we had bare `.=' littered throughout. Fortunately, `print' and `say' with the PerlIO::scalar IO layer appears to offer better performance without high method dispatch overhead. This doesn't allow us to save as much memory as I originally hoped, but does allow us to rely less on concat operators in other places and just pass a list of args to `print' and `say' as a appropriate. This does reduce scratchpad use, however, allowing for large memory savings, and we still ->deflate every single $eml.
2022-09-10www: switch to zadd for the majority of buffering
This allows us to focus string concatenations in one place to allow Perl internal scratchpad optimizations to reuse memory. Calling Compress::Raw::Zlib::deflate repeatedly proves too expensive in terms of CPU cycles.
2022-08-20www: mbox* drop unneeded {base_url} memoizations
That field is not needed since List-* and Archived-At headers are no longer appended as of commit: 1bf653ad139bf7bb (nntp+www: drop List-* and Archived-At headers, 2020-12-10)
2022-08-03www: simplify GzipFilter->zflush callers
->zflush can take a buffer arg, so there's no need to make a separate call to ->translate in some cases.
2021-10-25www: $MSGID/raw: set charset in HTTP response
By using the charset specified in the message, web browsers are more likely to display the raw text properly for human readers. Inspired by a patch by Thomas Weißschuh: https://public-inbox.org/meta/20211024214337.161779-3-thomas@t-8ch.de/ Cc: Thomas Weißschuh <thomas@t-8ch.de>
2021-10-25gzip_filter: delay async wcb call
This will let us modify the response header later to set a proper charset for Content-Type when displaying raw messages. Cc: Thomas Weißschuh <thomas@t-8ch.de>
2021-09-29www: do not bump {over} refcnt on long responses
SQLite files may be replaced or removed by admins while generating a large threads or mailbox responses. Ensure we don't hold onto DBI handles and associated file descriptors past their cleanup.
2021-09-28www+httpd: lower priority of large mbox downloads
While each git blob request is treated fairly w.r.t other git blob requests, responses triggering thousands of git blob requests can still noticeably increase latency for less-expensive responses. Move large mbox results and the nasty all.mbox endpoint to a low priority queue which only fires once per-event loop iteration. This reduces the response time of short HTTP responses while many gigantic mboxes are being downloaded simultaneously, but still maximizes use of available I/O when there's no inexpensive HTTP responses happening. This only affects PublicInbox::WWW users who use public-inbox-httpd, not generic PSGI servers.
2021-08-28move ->ids_after from mm to over
Since we favor ->over in WWW and IMAP, move this method to ->over to reduce open files in common cases. This fixes the /$EXTINDEX_NAME/all.mbox.gz endpoint for extindex entries (which may get expensive...).
2021-02-11search: use git approxidate in WWW and "lei q --stdin"
This greatly improves the usability of d:, dt:, and rt: search prefixes for users already familiar git's "approxidate" feature. That is, users familiar with the --(since|after|until|before)= options in git-log(1) and similar commands will be able to use those dates in the WWW UI.
2021-02-09www: stream mboxrd in descending docid order
Order doesn't matter when users are completely downloading mboxrds onto the FS and then opening them with an MUA. The MUA is expected to sort the results in the user's preferred order. However, lei can start streaming the results to its destination Maildir (or eventually IMAP/JMAP mailbox) with an MUA already open. This will let users see recent results sooner in their MUA, as those tend to have a higher docid. This matches the behavior of the HTML results, as well. As a bonus, this is around ~5% faster in a one-off, informal test case with 66k results. I expect this to hold true in all all cases since git has always optimized storage to favor recent objects.
2021-02-07lei: replace --thread with --threads
Nobody is expected to use long options, but for consistency with mairix(1), we'll use the pluralized option throughout (including existing PublicInbox::{Search,SearchView}). Link: https://public-inbox.org/meta/20210206090119.GA14519@dcvr/
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-28search: remove {mset} option for ->mset method
The ->mset method always returns a Xapian mset nowadays, so naming a parameter {mset} is too confusing. As it does with MiscSearch, setting the {relevance} parameter to -1 now sorts by ascending docid order. -2 is now supported for descending docid order, too, since it may be useful for lei users.
2020-12-11nntp+www: drop List-* and Archived-At headers
These headers can conflict with headers in the DKIM signature; and parsing the DKIM-Signature header to determine whether or not we can safely add a header would be more code and CPU cycles. Since IMAP seems fine without these headers (and JMAP will likely be, too), there's likely no need to continue appending these to every message. Nowadays, developers seem sufficiently trained to use URLs with Message-IDs in them. So drop the headers and save some cycles and bandwidth all around.
2020-12-10www+nntp: deal with lack of addresses for ->ALL
Since extindex is an amalgamation of several inboxes, discerning an appropriate address for List-Post: would be expensive and most likely unnecessary. Some legacy/historical inboxes may have no active address, either, so don't attempt to set the List-Post header if no addresses are configured.
2020-12-09treewide: replace {-inbox} with {ibx} for consistency
{ibx} is shorter and is the most prevalent abbreviation in indexing and IMAP code, and the `$ibx' local variable is already prevalent throughout. In general, the codebase favors removal of vowels in variable and field names to denote non-references (because references are "lighter" than non-references). So update WWW and Filter users to use the same code since it reduces confusion and may allow easier code sharing.
2020-12-05imap: support isearch and reduce Xapian queries
Since IMAP search (either with Isearch or traditional per-Inbox search) only returns UIDs, we can safely set the limit to the UID slice size(*). With isearch, we can also trust the Xapian result to fit any docid range we specify. Limiting Xapian results to 1000 was making ->ALL docid <=> per-Inbox UID impossible since results could overlap between ranges unpredictably. Finally, we can map the ->ALL docids into per-Inbox UIDs and show them to the client in the UID order of the Inbox, not the docid order of the ->ALL extindex. This also lets us get rid of the "uid:" query parser prefix and use the Xapian::Query API directly to reduce our search prefix footprint. For mbox.gz downloads in WWW, we'll also make a best effort to preserve the order from the Inbox, not the order of extindex; though it's possible large result sets can have non-overlapping windows. (*) by definition, UID slice size is a "safe" value which shouldn't OOM either the server or clients.
2020-12-05isearch: emulate per-inbox search with ->ALL
Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05inbox: simplify ->search and callers
Stop leaking WWW/PSGI-specific logic into classes like PublicInbox::Inbox, which is used universally. We'll also decouple $ibx->over from $ibx->search and just deal with duplicate the code inside ->over to reduce argument complexity in ->search. This is also a step in moving away from using {psgi.errors} to ease code sharing between IMAP, NNTP, and command-line interfaces. Perl's built-in `warn' and `local $SIG{__WARN__}' provides all the flexibility we need to control warning output and should be universally understood by Perl hackers who may be unfamiliar with PSGI.
2020-09-03search: replace ->query with ->mset
Nearly all of the search uses in the production code rely on a Xapian mset iterator being returned (instead of an array of $smsg objects). So default to returning the mset and move the burden of smsg array conversion into the test cases.
2020-08-23mbox: disable "&t" on existing Xapian until full reindex
Expanding threads via over.sqlite3 for mbox.gz downloads without Xapian effectively collapsing on the THREADID column leads to repeated messages getting downloaded. To avoid that situation, use a "has_threadid" Xapian metadata flag that's only set on --reindex (and brand new Xapian DBs). This allows admins to upgrade WWW or do --reindex in any order; without worrying about users eating up bandwidth and CPU cycles.
2020-08-23search: support downloading mboxes results with full thread
Finally, the addition of THREADID for collapsing results in Xapian lets us emulate the "mairix --threads" feature. That is, instead of returning only the matching messages, the entire thread is included in the downloaded mbox.gz This requires a "public-inbox-index --reindex" to be usable.
2020-08-20search: add mset_to_artnums method
We can avoid importing mdocid() in several places by using this method, simplifying callers.
2020-08-20mbox: avoid Xapian docdata in search results
Another place where we can reduce kernel page cache overhead by hitting over.sqlite3 instead of docdata.glass.
2020-08-20www: reduce long-lived PublicInbox::Search references
While this is unlikely to be a problem in current practice, keeping Xapian DBs open for long responses can interfere with free space recovery after -compact. In the future, it will interfere with inbox search grouping and lead to unexpected results.
2020-08-01www: rework async_* to use method table
Although the ->async_next method does not take $self as a receiver, but rather a PublicInbox::HTTP object, we may still retrieve it to be called with the HTTP object via UNIVERSAL->can.
2020-07-10hval: to_filename: return `undef' instead of empty string
Returning an empty string for a filename makes no sense, so instead return `undef' so the caller can setup a fallback using the "//" operator. This fixes uninitialized variable warnings because split() on an empty string returns `undef', which caused to_filename to warn on s// and tr// ops.
2020-07-06www: update internal docs
We no longer favor getline+close for streaming PSGI responses when using public-inbox-httpd. We still support it for other PSGI servers, though.
2020-07-06www: start making gzipfilter the parent response class
Virtually all of our responses are going to be gzipped, anyways. This will allow us to utilize zlib as a buffering layer and share common code for async blob retrieval responses. To streamline this and allow GzipFilter to be a parent class, we'll replace the NoopFilter with a similar CompressNoop class which emulates the two Compress::Raw::Zlib::Deflate methods we use. This drops a bunch of redundant code and will hopefully make upcoming WwwStream changes easier to reason about.
2020-07-06mbox: async blob fetch for "single message" raw mboxrd
This restores gzip-by-default behavior for /$INBOX/$MSGID/raw endpoints for all indexed inboxes. Unindexed v1 inboxes will remain uncompressed, for now.
2020-07-06mbox: remove html_oneshot import
It's no longer needed, we no longer show a runtime error for zlib being missing, as zlib is a hard requirement. Fixes: a318e758129d616b ("make zlib-related modules a hard dependency")
2020-06-03smsg: remove remaining accessor methods
We'll continue to favor simpler data models that can be used directly rather than wasting time and memory with accessor APIs. The ->from, ->to, -cc, ->mid, ->subject, >references methods can all be trivially replaced by hash lookups since all their values are stored in doc_data. Most remaining callers of those methods were test cases, anyways. ->from_name is only used in the PSGI code, so we can just use ->psgi_cull to take care of populating the {from_name} field.
2020-05-09switch read-only Email::Simple users to Eml
Since PublicInbox::Eml doesn't parse MIME subparts up front, it can replace most uses of Email::Simple without performance penalty. This will eventually allow us to lower overall internal API footprint by not having to keep the MIME vs Simple distinction.
2020-04-22make zlib-related modules a hard dependency
This allows us to simplify some of our existing code and make future changes easier. I doubt anybody goes through the trouble to have a Perl installation without zlib support. The zlib source code is even bundled with Perl since 5.9.3 for systems without existing zlib development headers and libraries. Of course, zlib is also a requirement of git, too; and we're not going to stop using git :) [squashed: "wwwaltid: use gzipfilter up front"]
2020-04-19reduce scope of mbox From_ line removal
It's unnecessary overhead for anything which does Email::MIME parsing. It was never done for v2 indexing, even though v1->v2 conversions did NOT remove those From_ lines. There was never a need to remote From_ lines the v1 SearchIdx paths, either. Hitting a /$INBOX_URL/$MSGID/T/ endpoint with an 18 message thread reveals a ~0.5% speed improvement. This will become more apparent when we have a faster MIME parser.
2020-04-19mbox: use per-message line-ending for From_ line
Email::Simple preserves the message line ending in headers, so make the From_ line consistent with the rest of the headers.
2020-04-05mbox: halve ->getline "context switches"
We don't need to take extra trips through the event loop for a single message (in the common case of Message-IDs being unique). In fact, holding the body reference left behind by Email::Simple could be harmful to memory usage, though in practice it's not a big problem since code paths which use Email::MIME take far more.
2020-03-30wwwstream::oneshot => html_oneshot
And use Exporter to make our life easier, since WwwAltId was using a non-existent PublicInbox::WwwResponse namespace in error paths which doesn't get noticed by `perl -c' or exercised by tests on normal systems. Fixes: 6512b1245ebc6fe3 ("www: add endpoint to retrieve altid dumps")
2020-03-25mbox: need_gzip uses WwwStream::oneshot
This makes the error page more consistent. Not that it really matters since Compress::Raw::Zlib and IO::Compress packages have been distributed with Perl since 5.10.x. Of course, zlib itself is also a dependency of git.
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-25mbox: handle empty subjects after dropping "Re:" prefix
We can't pass empty strings to `to_filename' without triggering warnings, and `to_filename' on an empty string makes no sense.
2019-12-28search: retry_reopen passes user arg to callback
This allows callers to pass named (not anonymous) subs. Update all retry_reopen callers to use this feature, and fix some places where we failed to use retry_reopen :x
2019-12-27mboxgz: pass $ctx to callback to avoid anon subs
Another place where we can rid ourselves of most anonymous subs by passing the $ctx arg to the callback.
2019-12-12mbox: do not try to create filename from empty string
This was causing warnings to pop up in syslogs for messages with empty Subject headers.