about summary refs log tree commit homepage
path: root/lib/PublicInbox/ExtMsg.pm
DateCommit message (Collapse)
2022-09-02extmsg: shorten partial Message-IDs minimum to 14
Gnus seems to start Message-IDs with 10 random characters followed by ".fsf@$DOMAIN". In case of mis-linkification or mis-selection from stopping at the `@', ensuring the first 14 characters are accepted as a search parameter for the truncated Message-ID improves usability.
2021-10-01ds: simplify signalfd use
Since signalfd is often combined with our event loop, give it a convenient API and reduce the code duplication required to use it. EventLoop is replaced with ::event_loop to allow consistent parameter passing and avoid needlessly passing the package name on stack. We also avoid exporting SFD_NONBLOCK since it's the only flag we support. There's no sense in having the memory overhead of a constant function when it's in cold code.
2021-09-26extmsg: search_partial: use ->isrch if available
This allows us to avoid creating ibx->{search}->{xdb} at this spot by using an `undef' value. This is a step towards eliminating the innocuous "/path/to/inboxdir/xap15 has no shards" messages introduced in commit 63d7b8ce (daemons: revamp periodic cleanup task, 2021-09-23)
2021-09-02www: handle name-only publicinbox.*.url entries
Apparently URLs can be configured relatively for HTTP(S) setups, attempt to support them when linking to cross-posted messages. This also fixes the top-row (mirror/help/color/Atom feed) links in /$INBOX_URL/$EXTMSG_MSGID/T/ (and /t/) URLs. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/20210902191239.cmbxlmjqcsmdzmqp@meerkat.local/
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-28search: remove {mset} option for ->mset method
The ->mset method always returns a Xapian mset nowadays, so naming a parameter {mset} is too confusing. As it does with MiscSearch, setting the {relevance} parameter to -1 now sorts by ascending docid order. -2 is now supported for descending docid order, too, since it may be useful for lei users.
2020-12-11extmsg: avoid exceptions when /all/$MSGID/ fails
If a message can't be found in ->ALL, we shouldn't attempt to enter code paths which iterate normal inboxes or attempt to access non-existent fields (e.g. {name}, {newsgroup}, {inboxdir}) in the ExtSearch object.
2020-12-09rename {pi_config} fields to {pi_cfg}
{pi_config} may be confused with the documented `PI_CONFIG' environment variable, and we'll favor vowel-removal to be consistent with our usage of object references. The `pi_' prefix may stay in some places, for now; since a separate namespace may come into this codebase for local/private client-tooling. For InboxIdle, we'll also remove an invalid comment about holding a reference to the PublicInbox::Config object, too.
2020-12-09treewide: replace {-inbox} with {ibx} for consistency
{ibx} is shorter and is the most prevalent abbreviation in indexing and IMAP code, and the `$ibx' local variable is already prevalent throughout. In general, the codebase favors removal of vowels in variable and field names to denote non-references (because references are "lighter" than non-references). So update WWW and Filter users to use the same code since it reduces confusion and may allow easier code sharing.
2020-12-05isearch: emulate per-inbox search with ->ALL
Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05extmsg: use ->ALL for "global" MID lookups
As with NewsWWW and NNTP, we can use ->ALL to completely avoid trying SQLite/Xapian lookups across hundreds/thousands of inboxes.
2020-09-12treewide: avoid `goto &NAME' for tail recursion
While Perl implements tail recursion via `goto' which allows avoiding warnings on deep recursion. It doesn't (as of 5.28) optimize the speed of such dispatches, though it may reduce ephemeral memory usage. Make the code less alien to hackers coming from other languages by using normal subroutine dispatch. It's actually slightly faster in micro benchmarks due to the complexity of `goto &NAME'.
2020-09-10extmsg: prevent cross-inbox matches from hogging event loop
With many inboxes, checking multiple SQLite repos will be slow and time-consuming, so ensure we can schedule it fairly between multiple inboxes.
2020-09-10config: flatten each_inbox and iterate_start args
In Perl, we can simplify callers by passing a single array all the way down the stack instead of a single array ref which needs to be expanded every call.
2020-09-03search: replace ->query with ->mset
Nearly all of the search uses in the production code rely on a Xapian mset iterator being returned (instead of an array of $smsg objects). So default to returning the mset and move the burden of smsg array conversion into the test cases.
2020-08-20search: add mset_to_artnums method
We can avoid importing mdocid() in several places by using this method, simplifying callers.
2020-08-20extmsg: avoid using Xapian docdata
Once again, over.sqlite3 contains everything necessary for Message-ID resolution. Also, Xapian may be completely unnecessary with the advent of over.sqlite3, but that's for another time.
2020-06-03smsg: remove remaining accessor methods
We'll continue to favor simpler data models that can be used directly rather than wasting time and memory with accessor APIs. The ->from, ->to, -cc, ->mid, ->subject, >references methods can all be trivially replaced by hash lookups since all their values are stored in doc_data. Most remaining callers of those methods were test cases, anyways. ->from_name is only used in the PSGI code, so we can just use ->psgi_cull to take care of populating the {from_name} field.
2020-03-30wwwstream::oneshot => html_oneshot
And use Exporter to make our life easier, since WwwAltId was using a non-existent PublicInbox::WwwResponse namespace in error paths which doesn't get noticed by `perl -c' or exercised by tests on normal systems. Fixes: 6512b1245ebc6fe3 ("www: add endpoint to retrieve altid dumps")
2020-03-25extmsg: use WwwResponse::oneshot
No reason to use the ->getline interface for small responses.
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-02-16view: escape ampersand in Message-IDs
We need to escape ampersands (and some other characters for href attributes), so introduce a `mid_href' sub to do just that. '<', '>' and '"' were always escaped, so there's no risk of tag or attribute injection, but creative Message-IDs could cause confusion for some parsers and generate invalid URLs. Start getting rid of the bloated, over-engineered OO Hval API while we're at it, I only noticed this bug because I started killing off Hval->new* callers.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-25s/news.gmane.org/news.gmane.io/
gmane still has a NNTP server, so update links to point to it. cf. https://lars.ingebrigtsen.no/2020/01/06/whatever-happened-to-news-gmane-org/
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2020-01-06hval: export prurl and add prototype
This allows to do some compile-time checking and fills in a missing "use" in PublicInbox::NewsWWW, allowing it to be used standalone and independently of PublicInbox::WWW
2019-12-28search: retry_reopen passes user arg to callback
This allows callers to pass named (not anonymous) subs. Update all retry_reopen callers to use this feature, and fix some places where we failed to use retry_reopen :x
2019-12-27config: each_inbox: pass user arg to callback
Another place where we can replace anonymous subs with named subs by passing a user-supplied arg.
2019-10-09extmsg: drop unused $have_mm variable
We rely on Inbox::mm nowadays.
2019-09-09run update-copyrights from gnulib for 2019
2019-04-27extmsg: escape ampersands in @EXT_URL array
We already escape the user-provided Message-IDs (so there's no security problem AFAIK), but the URL templates which exist in our source code were not escaped properly. This quiets down tidy(1).
2019-01-20extmsg: don't bother partial matching with <16 chars
It's not worth it, and attempts to wildcard off single-character Message-IDs(*) causes Xapian to error out in unpredictable ways: something terrible happened at /usr/lib/x86_64-linux-gnu/perl5/5.24/Search/Xapian/Enquire.pm line 54. ...propagated at lib/PublicInbox/Search.pm line 209. So don't bother. (*) because people blindly hit 'y' or 'n' when git-send-email prompted them for In-Reply-To.
2018-04-22extmsg: use Xapian only for partial matches
"LIKE" in SQLite (and other SQL implementations I've seen) is expensive with nearly 3 million messages in the archives. This caused some partial Message-ID lookups to take over 600ms on my workstation (~300ms on a faster Xeon). Cut that to below under 30ms on average on my workstation by relying exclusively on Xapian for partial Message-ID lookups as we have in the past. Unlike in the past when we tried using Xapian to match partial Message-IDs; we now optimize our indexing of Message-IDs to break apart "words" in Message-IDs for searching, yielding (hopefully) "good enough" accuracy for folks who get long URLs broken across lines when copy+pasting. We'll also drop the (in retrospect) pointless stripping of "/[tTf]" suffixes for the partial match, since anybody who hits that codepath would be hitting an invalid message ID. Finally, limit wildcard expansion to prevent easy DoS vectors on short terms. And blame Pine and alpine for generating Message-IDs with low-entropy prefixes :P
2018-04-18Merge remote-tracking branch 'origin/master' into v2
* origin/master: nntp: allow and ignore empty commands mbox: do not barf on queries which return no results nntp: fix NEWNEWS command searchview: fix non-numeric comparison Allow specification of the number of search results to return githttpbackend: avoid infinite loop on generic PSGI servers http: fix modification of read-only value extmsg: use news.gmane.org for Message-ID lookups extmsg: rework partial MID matching to favor current inbox Update the installation instructions with Fedora package names nntp: do not drain rbuf if there is a command pending nntp: improve fairness during XOVER and similar commands searchidx: do not modify Xapian DB while iterating Don't use LIMIT in UPDATE statements
2018-04-18extmsg: remove expensive git path checks
Searching across different inboxes is expensive without SQLite (or Xapian) installed, so avoid doing expensive tree lookups in git. Since SQLite is required for Xapian support anyways, we won't need to check Xapian, either. Sites without SQLite installed will simply 404 if somebody requests a message which isn't in the current inbox.
2018-03-27extmsg: use news.gmane.org for Message-ID lookups
http://mid.gmane.org/ has not worked for a while, but their NNTP server continues to work. Use that and perhaps give NNTP more exposure. Reported-by: Jonathan Corbet <corbet@lwn.net>
2018-03-19extmsg: rework partial MID matching to favor current inbox
The current inbox is more important for partial Message-ID matching, so we try harder on that to fix common errors before moving onto other inboxes. Then, prevent expensive scanning of other inboxes by requiring a Message-ID length of at least 16 bytes. Finally, we limit the overall partial responses to 200 when scanning other inboxes to avoid excessive memory usage.
2018-03-19extmsg: rework partial MID matching to favor current inbox
The current inbox is more important for partial Message-ID matching, so we try harder on that to fix common errors before moving onto other inboxes. Then, prevent expensive scanning of other inboxes by requiring a Message-ID length of at least 16 bytes. Finally, we limit the overall partial responses to 200 when scanning other inboxes to avoid excessive memory usage.
2018-03-02search: revert to using 'Q' as a uniQue id per-Xapian conventions
'Q' is merely a convention in the Xapian world, and is close enough to unique for practical purposes, so stop using XMID and gain a little more term length as a result.
2018-02-16search: stop assuming Message-ID is unique
In general, they are, but there's no way for or general purpose mail server to enforce that. This is a step in allowing us to handle more corner cases which existing lists throw at us.
2018-02-16extmsg: fix broken Xapian MID lookup
This likely has no real world implications, though, as we fall back to Msgmap lookups anyways. Broken since commit 7eeadcb62729b0efbcb53cd9b7b181897c92cf9a ("search: remove unnecessary abstractions and functionality")
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-03-22extmsg: use updated mail-archive.com URL
Apparently mid.mail-archive.com does not support HTTPS, and the HTTP version redirects to the search query, anyways.
2016-08-14www: do not unecessarily escape some chars in paths
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&', "'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed in path-absolute where we have the Message-ID. In any case, it seems '@' is fairly common in path components nowadays and too common in Message-IDs.
2016-08-13extmsg: reorder and add a more Message-ID lookup services
gmane is down at the moment, so lower that in priority (hopefully it will be brought back up, again). Wikipedia also lists a few more project-specific list providers, so include those as well: https://en.wikipedia.org/wiki/Message-ID
2016-07-17extmsg: favor user-provided URL on partial matches
While an inbox may have multiple URLs, we will favor the existing URL for the current inbox on partial matches to avoid confusing users or slowing them down by requiring a new TCP connection.
2016-07-09www: cleanup parameter passing
Reduce the size of hashes a bit and drops some unneeded hash lookups for uncommon paths.
2016-07-06extmsg: switch to wwwstream for partial match, too
Another step towards a consistent WWW UI...
2016-07-06extmsg: disable automatic inbox switching
Automatic inbox switching was a potentially deceptive pattern and surprises readers who do not check the URL bar closely. Furthermore, a message could be cross-posted to multiple lists, too.
2016-07-06hval: get rid of unused parameter for new_msgid
Exposing compressed Message-IDs in URLs was a mistake, remove a remnant of it.