about summary refs log tree commit homepage
path: root/lib/PublicInbox/WWW.pm
DateCommit message (Collapse)
2020-12-05isearch: emulate per-inbox search with ->ALL
Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-11-07extsearch: wire up remaining Inbox-like methods for WWW
This lets us pretend an ExtSearch object is an Inbox object in most of the existing WWW code.
2020-09-16treewide: relax allow >=40 chars for git OID
This will help with eventual git SHA-256 transitions.
2020-09-10wwwlisting: avoid hogging event loop
By using the just-introduced ConfigIter class. And make ManifestJsGz a subclass of it to reduce duplication.
2020-09-10www: manifest.js.gz generation no longer hogs event loop
It's still as slow as before with hundreds/thousands of inboxes, but at least it's fair. Future changes will allow it to be cached and memoized with persistent HTTP servers.
2020-07-10viewvcs: allow "0" as a path name
This means we need to filter out "" from query parameters. While we're at it, update comments for the WWW endpoint.
2020-07-06www: need: use WwwStream::html_oneshot
It'll give us a nicer HTML header and footer.
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-05-09switch read-only Email::Simple users to Eml
Since PublicInbox::Eml doesn't parse MIME subparts up front, it can replace most uses of Email::Simple without performance penalty. This will eventually allow us to lower overall internal API footprint by not having to keep the MIME vs Simple distinction.
2020-05-09www: preload: load all encodings at startup
Encode lazy-loads encodings on an as-needed basis. This is great for short-lived programs, but leads to fragmentation in long-lived daemons where immortal allocations can get interleaved with short-lived, per-request allocations. Since we have no idea which encodings will be needed when there's a constant flow of incoming mail, just preload everything available at startup.
2020-03-26wwwaltid: inform users to use POST instead of GET
Seeing the example config linkified, some users may inevitably try to following it in a browser with a GET request. Provide a helpful message to inform users to use POST instead of attempting to treat /$INBOX/$ALTID.sql.gz as a Message-Id.
2020-03-26inbox: altid_map becomes a method
We want to be able to preload that, as well as to access it in WwwText for a config comment in the config example.
2020-03-25www: add endpoint to retrieve altid dumps
This ensures all our indexed data, including data from altid searches (e.g. "gmane:$ARTNUM") is retrievable. It uses a "POST" request to avoid wasting cycles when invoked by crawlers, since it could potentially be several megabytes of data not indexable by search engines.
2020-03-20daemon: do more immortal allocations up front
Doing immortal allocations late can cause those allocations to end up in places where it fragments the heap. So do more things up front for long-lived daemons.
2020-03-20www: update ->preload for newer modules
We'll also avoid explicitly loading standard library modules like POSIX and Digest::SHA, here; instead we load our own modules and let those load whatever non-PublicInbox:: modules they need.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-02-04www: serve $INBOX_DIR/description as $INBOX_URL/description
Instead of serving $INBOX_DIR/all.git/description, since $INBOX_DIR/all.git/description is not described in the default message when it's missing.
2020-02-04www: stricter regexp for 405 errors
We want to match "GET" and "HEAD" exactly, not requests which start with "GET" or end with "HEAD". This doesn't seem like a real problem for public-inboxes which are actually public data anyways.
2020-01-01wwwstatic: add directory listing + index.html support
It's now possible to use WwwStatic as a standalone PSGI app to serve static files and recreate the award-winning web design of https://public-inbox.org/ :>
2020-01-01wwwstatic: move r(...) functions here
Remove redundant "r" functions for generating short error responses. These responses will no longer be cached by clients, which is probably a good thing since most errors ought to be transient, anyways. This also fixes error responses for our cgit wrapper when static files are missing.
2020-01-01www: move more logic into path_info_raw
It'll be easier to reuse in future code.
2019-12-27www: lazy load Plack::Util
cgit users won't need Plack::Util, here.
2019-10-22www: remove unused ctx_get sub
This hasn't been used since commit 48b21cb662c1e17b7 in 2016: ("declare Inbox object for reusability")
2019-09-09run update-copyrights from gnulib for 2019
2019-06-14rename reference to git epochs as "partitions"
Try to remain consistent with our own documentation regarding v2 git "epochs", first.
2019-06-09www: support $INBOX/git/$EPOCH.git for v2 cloning
And use it in manifest.js. To ease maintaining mirrors with grokmirror(1), we can accept a "git/" directory prefix before the epoch, and ".git" suffix after the epoch number. We maintain compatibility with "$INBOX/$EPOCH" cloning, of course, and it's still easier-to-type on the command-line.
2019-06-09www: wire up /$INBOX/manifest.js.gz, too
I can imagine myself just wanting to clone a single v2 inbox and all its epochs without thinking about include/exclude rules in a grokmirror config file.
2019-06-09wwwlisting: generate grokmirror-compatible manifest.js.gz
Support on-demand generation of "/manifest.js.gz" for inboxes. By default, this matches inboxes with URLs matching the given request hostname by default. This makes it easier to create full mirrors of several inboxes without needing to configure static file serving. cf. https://git.kernel.org/pub/scm/utils/grokmirror/grokmirror.git
2019-06-04www: require ASCII word characters for CSS filenames
Allowing admins to set non-ASCII CSS filenames could cause unnecessary problems for client and proxies.
2019-06-04www: require ASCII range for mbox downloads
We do not support many mboxrd download range specifications at the moment; but parsing non-ASCII characters isn't planned. This makes no difference aside from being able to return 404 slightly earlier than we would've in the past.
2019-06-04www: require ASCII digit for git epoch
Don't inadvertantly serve git repos containing non-ASCII digit characters.
2019-06-04www: require ASCII filenames in git blob downloads
Our Hval::to_filename sub has always been strict about emitting ASCII-only characters for ViewVCS "raw" links. However, somebody could manually generate a filename with non-ASCII words for somebody else to download (we have no cheap and fast way of mapping filenames back to blobs for validation).
2019-06-04www: only emit ASCII chars in attachment filenames
We don't want to emit funky URLs which can be lost in translation or cause problems with non-Unicode-aware clients. Then, don't accept non-ASCII filenames in URLs, since a manually-generated URL/filename in attachment downloads could be used for Unicode homographs to confuse folks who down the attachment.
2019-05-21Merge remote-tracking branch 'origin/xap-optional' into master
* origin/xap-optional: admin: improve warnings and errors for missing modules searchidx: do not create empty Xapian partitions for basic lazy load Xapian and make it optional for v2 www: use Inbox->over where appropriate nntp: use Inbox->over directly inbox: add ->over method to ease access
2019-05-16www: unescape '+' => ' ' before general URI unescape
This allows searching for terms with "+" in them properly.
2019-05-15lazy load Xapian and make it optional for v2
More tests work without Search::Xapian, now. Usability issues still need to be fixed
2019-05-15www: use Inbox->over where appropriate
We don't need to rely on Xapian search functionality for the majority of the WWW code, even. subject_normalized is moved to SearchMsg, where it (probably) makes more sense, anyways.
2019-04-19www: support listing of inboxes
We will still return a 404 by default to '/' for compatibility with users of Plack::App::Cascade or similar. Inboxes are sorted by modification times to help users detect activity (similar to the /$INBOX/ topic view). New configuration options: * publicinbox.wwwlisting - configure the listing type * publicinbox.<name>.hide - hide a particular inbox from the listing See changes to public-inbox-config.pod for full descriptions of the new options. Requested-by: Leah Neukirchen <leah@vuxu.org> https://public-inbox.org/meta/871sdfzy80.fsf@gmail.com/
2019-04-19start depending on Perl 5.10.1+
I mainly want to start using the '//' (defined-or) operator to simplify code, and Perl 5.10.1 is roughly a decade old at this point. "given/when" would've be nice, but it's future is in doubt AFAIK. I also started using the 'parent' module in WwwHighlight, and 'autodie' in UserContent.pm, both of which were only distributed with Perl since 5.10.1; and testing with ancient versions/distros is time-consuming. Anyways, I think this a small-enough jump to not break any existing installations, given we already depend on fairly recent versions of git and Xapian. Maybe we can use more newish Perl features in the future...
2019-04-16cleanup: use '$ibx' consistently when referring to Inbox refs
'$inbox' is more human-readable, so that is for the more human-readable name in most cases. Making our variable naming more consistent should make the code easier-to-review and harder to screw up.
2019-04-16www: remove unnecessary Git object reference
We access the Git object via the Inbox object nowadays, so there's no point in having a shortcut to it, anymore.
2019-04-04www: fix missing cgit fallback after legacy redirects
We need to instate our cgit handler everywhere we use NewsWWW to catch wildcard requests which our normal endpoints do not handle.
2019-04-04www: wire up cgit as a 404 handler if cgitrc is configured
Requests intended for cgit are unlikely to conflict with requests to inboxes. So we can safely hand those requests off to cgit.cgi.
2019-02-23www: prevent '!important' in BOFH-specified CSS
CSS specified by the BOFH must never take precedence over what a user sets in userContent.css.
2019-02-13ensure bytes::length is available to callers
We were relying on Danga::Socket using the "bytes" pragma, previously. Nowadays, the "bytes" pragma is not recommended in general, but bytes::length remains acceptable for getting the byte-size of a scalar.
2019-01-20$INBOX/_/text/color/ and sample user-side CSS
Since we now support more CSS classes for coloring, give this feature more visibility.
2019-01-20www: admin-configurable CSS via "publicinbox.css"
Maybe we'll default to a dark theme to promote energy savings... See contrib/css/README for details
2019-01-20view: enforce trailing slash for /$INBOX/$OID/s/ endpoints
As with our use of the trailing slash in $MESSAGE_ID/T/ and '$MESSAGE_ID/t/' endpoints, this for 'wget -r --mirror' compatibility as well as allowing sysadmins to quickly stand up a static directory with "index.html" in it to reduce load.
2019-01-19view: wire up diff and vcs viewers with solver
2019-01-15config: inbox name checking matches git.git more closely
Actually, it turns out git.git/remote.c::valid_remote_nick rules alone are insufficient. More checking is performed as part of the refname in the git.git/refs.c::check_refname_component I also considered rejecting URL-unfriendly inbox names entirely, but realized some users may intentionally configure names not handled by our WWW endpoint for archives they don't want accessible over HTTP.