about summary refs log tree commit homepage
path: root/lib/PublicInbox/WWW.pm
DateCommit message (Collapse)
2023-11-29www: load and use cindex join data
This is a major step in solving the problem of having to manually associate hundreds/thousands of coderepos with hundreds/thousands of public-inboxes to power solver (and more).
2023-11-10www: add topics_(new|active).(html|atom) endpoints
This seems like a easy (but WWW-specific) way to get recently created and recently active topics as suggested by Konstantin. To do this with Xapian will require a new columns and reindexing; and I'm not sure if the current lei handling of search results by dumping results to a format readable by common MUAs would work well with this. A new TUI may be required... Suggested-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/20231107-skilled-cobra-of-swiftness-a6ff26@meerkat/
2023-11-03move read_all, try_cat, and poll_in to PublicInbox::IO
The IO package seems like a better home for I/O subs than the Git package. We lose the 60 second read timeout for `git cat-file --batch-*' processes since it's probably not necessary given how reliable the code has proven and things would fall over hard in other ways if the storage device were completely hosed.
2023-10-23www: fully qualify read_all subroutine call
IMHO, this is not worth importing here since it's only called at startup. Fixes: 19b791f4894e (use read_all in more places to improve safety)
2023-10-18use read_all in more places to improve safety
`readline' ops may not detect errors on partial reads. This saves us some code to reduce cognitive overhead for readers. We'll also support reusing a destination buffers so it can work more nicely with existing code.
2023-05-31www: more restrictive query string parsing
Only allow single-character query keys to prevent clients from wasting memory in Perl's hash tables. We'll also perform the utf8::decode and tr/+/ / calls once on the whole query string at once to reduce op calls. This also avoids creating an empty hash in the common case when the QUERY_STRING is empty and instead relies on auto-vivification of Perl.
2023-03-31www: support POST /$INBOX/$MSGID/?x=m&q=
This allows filtering the contents of any existing thread using a search query. It uses the existing THREADID column in Xapian so we can internally add a Xapian OP_FILTER to the results. This new functionality is orthogonal to the existing `t=1' parameter which gives mairix-style thread expansion. It doesn't make sense to use `t=1' with this functionality, but it's not disallowed, either. The indentation change in Over->next_by_mid is to ensure DBI->prepare_cached can share across both ->next_by_mid and ->mid2tid. I also noticed the existing regex for `POST /$INBOX/?x=m&q=' was allowing extra characters. With an added \z, it's now as strict was originally intended and AFAIK nothing was generating invalid URLs for it Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/aaniyhk7wfm4e6m5mbukcrhevzoc6ftctyrfwvmz4fkykwwtlj@mverfng6ytas/T/
2023-01-11www: /$INBOX/$MSGID/d/ to diff reused Message-IDs
To ensure users aren't abusing the ability to reuse Message-IDs, provide a convenient front-end to `lei mail-diff' from WWW. Most of the time it's just list-appended signatures, so I expect this to be useful for /all/ users.
2022-10-07www: support publicinbox.cgit knob
For backwards-compatibility, this defaults to `first'. When set to `fallback', PublicInbox::WwwCoderepo is favored and cgit is only used as a fallback. Eventually, `rewrite' will also be supported to rewrite cgit URLs to WwwCoderepo ones. Of course, WwwCoderepo is still missing search and other key features, but that's being worked on...
2022-10-07www: do not call ->coderepo->srv on sub ref
The PublicInbox::Cgit wrapper will return a sub-ref for most responses, so ensure we don't try to treat it as an array-ref.
2022-10-05www_coderepo: an alternative to cgit
This will allow it to easily map a single coderepo to multiple inboxes (or multiple coderepos to any number of inboxes). For now, this is just a summary, but $REPO/$OID/s/ support will be added, along with archive downloads. Indexing of coderepos will probably be supported via -extindex, only.
2022-08-29www: allow html_oneshot to take an array arg
Another step towards making our internal APIs more writev-like and reducing the copies needed for `join' or `.=' concatenation.
2022-08-23www: /s/: 404 for unconfigured coderepos
The $r404 variable is unset if we have a valid inbox, but no coderepos configured for that inbox, thus we must `r(404)' explicitly.
2022-08-22www: support `+' in inbox names
`+' already seemed to works for IMAP mailboxes and NNTP newsgroup names and git-config doesn't complain, either. So allow it as the path components of WWW URLs so projects like `libstdc++' can use it. Reported-by: Mark Wielaard <mark@klomp.org> Tested-by: Mark Wielaard <mark@klomp.org> Link: https://public-inbox.org/meta/YwKnFCvganW7ErXU@wildebeest.org/
2022-03-22www: loosen deep-linking prevention
Apparently some browsers can set a Referer: header which fails to match. I'm not certain why, but making "$schema://$HOST_PORT" matches case-insensitive seems more correct regardless. In case that doesn't work, we'll also allow bypassing deep-link prevention via a POST form button. Reported-by: Vlastimil Babka <vbabka@suse.cz> Link: https://public-inbox.org/meta/93ebfbd1-9924-481c-4edc-9b232d1e995c@suse.cz/
2021-11-01treewide: kill problematic "$h->{k} //= do {" assignments
As stated in the previous change, conditional hash assignments which trigger other hash assignments seem problematic, at times. So replace: $h->{k} //= do { $h->{x} = ...; $val }; $h->{k} // do { $h->{x} = ...; $hk->{k} = $val }; "||=" is affected the same way, and some instances of "||=" are replaced with "//=" or "// do {", now.
2021-10-13www: preload: load ExtSearch via ->ALL
This ought to give us more CoW savings and fragmentation avoidance in -httpd.
2021-09-29inbox: drop memoization/preload, cleanup expires caches
cloneurl, description, and base_url are no longer memoized. The non-$env form of base_url is rare in WWW, and is fast enough to not require memoization. cloneurl and description are now expired during cleanup, allowing admins to change these files without restarting (or SIGHUP). -altid_map is no longer cached nor memoized at all, since the endpoint(s) which hit it seem rarely accessed. nntp_url and imap_url are now cached (instead of memoized) in case an inbox is unvisited for a long time. They remain cached since the truthiness check gets called in every per-inbox HTML page, which can potentially be expensive.
2021-09-28www+httpd: lower priority of large mbox downloads
While each git blob request is treated fairly w.r.t other git blob requests, responses triggering thousands of git blob requests can still noticeably increase latency for less-expensive responses. Move large mbox results and the nasty all.mbox endpoint to a low priority queue which only fires once per-event loop iteration. This reduces the response time of short HTTP responses while many gigantic mboxes are being downloaded simultaneously, but still maximizes use of available I/O when there's no inexpensive HTTP responses happening. This only affects PublicInbox::WWW users who use public-inbox-httpd, not generic PSGI servers.
2021-08-28get rid of unnecessary bytes::length usage
The only place where we could return wide characters with -httpd was the raw $INBOX_DIR/description text, which is now converted to octets. All daemon (HTTP/NNTP/IMAP) sockets are opened in binary mode, so length() and bytes::length() are equivalent on reads. For socket writes, any non-octet data would warn about wide characters and we are strict in warnings with test_httpd. All gzipped buffers are also octets, as is PublicInbox::Eml->body, and anything from PerlIO objects ("git cat-file --batch" output, filesystems), so bytes::length was unnecessary in all those places.
2021-08-26wwwlisting: support global CSS in HTML view
Since CSS can be overridden by a static webserver on a per-inbox basis, we need a similar pattern to deal with the instance-wide WwwListing HTML. "/+/" probably won't conflict with any current nor future public inbox names. I don't think it'll cause problems with common linkifiers or URL extractors, either (and it's unlikely anybody would want to share URLs of just the CSS in a plain text(-like) format).
2021-06-23www: do not warn on blank query parameters
Sometimes users (or bots) may lead queries with '&' and trigger uninitialized variable warnings, just ignore them and give consumers a $ctx->{qp}->{''} entry. While we're in the area, pass a regexp rather than scalar string to the `split' perlop to prevent Perl from recompiling the regexp on every call.
2021-04-26www: missing /$INBOX/$MSGID/raw returns 404
Don't attempt to return HTTP 300 via Extmsg on it, since whoever uses /raw is likely piping it to some other command.
2021-03-17config: lazy-load coderepos, support extindex
Extsearch objects are duck-types of Inbox objects, and are capable of supporting code repos all the same.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-21manifest.js.gz: fix per-inbox /$INBOX/manifest.js.gz
/$INBOX/manifest.js.gz should not attempt to match every inbox in the domain (or every inbox); that is for /manifest.js.gz (without a /$INBOX prefix). Fixes: f303b4add8ea1883 ("wwwlisting: avoid hogging event loop")
2020-12-09rename {pi_config} fields to {pi_cfg}
{pi_config} may be confused with the documented `PI_CONFIG' environment variable, and we'll favor vowel-removal to be consistent with our usage of object references. The `pi_' prefix may stay in some places, for now; since a separate namespace may come into this codebase for local/private client-tooling. For InboxIdle, we'll also remove an invalid comment about holding a reference to the PublicInbox::Config object, too.
2020-12-09treewide: replace {-inbox} with {ibx} for consistency
{ibx} is shorter and is the most prevalent abbreviation in indexing and IMAP code, and the `$ibx' local variable is already prevalent throughout. In general, the codebase favors removal of vowels in variable and field names to denote non-references (because references are "lighter" than non-references). So update WWW and Filter users to use the same code since it reduces confusion and may allow easier code sharing.
2020-12-05isearch: emulate per-inbox search with ->ALL
Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-11-07extsearch: wire up remaining Inbox-like methods for WWW
This lets us pretend an ExtSearch object is an Inbox object in most of the existing WWW code.
2020-09-16treewide: relax allow >=40 chars for git OID
This will help with eventual git SHA-256 transitions.
2020-09-10wwwlisting: avoid hogging event loop
By using the just-introduced ConfigIter class. And make ManifestJsGz a subclass of it to reduce duplication.
2020-09-10www: manifest.js.gz generation no longer hogs event loop
It's still as slow as before with hundreds/thousands of inboxes, but at least it's fair. Future changes will allow it to be cached and memoized with persistent HTTP servers.
2020-07-10viewvcs: allow "0" as a path name
This means we need to filter out "" from query parameters. While we're at it, update comments for the WWW endpoint.
2020-07-06www: need: use WwwStream::html_oneshot
It'll give us a nicer HTML header and footer.
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-05-09switch read-only Email::Simple users to Eml
Since PublicInbox::Eml doesn't parse MIME subparts up front, it can replace most uses of Email::Simple without performance penalty. This will eventually allow us to lower overall internal API footprint by not having to keep the MIME vs Simple distinction.
2020-05-09www: preload: load all encodings at startup
Encode lazy-loads encodings on an as-needed basis. This is great for short-lived programs, but leads to fragmentation in long-lived daemons where immortal allocations can get interleaved with short-lived, per-request allocations. Since we have no idea which encodings will be needed when there's a constant flow of incoming mail, just preload everything available at startup.
2020-03-26wwwaltid: inform users to use POST instead of GET
Seeing the example config linkified, some users may inevitably try to following it in a browser with a GET request. Provide a helpful message to inform users to use POST instead of attempting to treat /$INBOX/$ALTID.sql.gz as a Message-Id.
2020-03-26inbox: altid_map becomes a method
We want to be able to preload that, as well as to access it in WwwText for a config comment in the config example.
2020-03-25www: add endpoint to retrieve altid dumps
This ensures all our indexed data, including data from altid searches (e.g. "gmane:$ARTNUM") is retrievable. It uses a "POST" request to avoid wasting cycles when invoked by crawlers, since it could potentially be several megabytes of data not indexable by search engines.
2020-03-20daemon: do more immortal allocations up front
Doing immortal allocations late can cause those allocations to end up in places where it fragments the heap. So do more things up front for long-lived daemons.
2020-03-20www: update ->preload for newer modules
We'll also avoid explicitly loading standard library modules like POSIX and Digest::SHA, here; instead we load our own modules and let those load whatever non-PublicInbox:: modules they need.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-02-04www: serve $INBOX_DIR/description as $INBOX_URL/description
Instead of serving $INBOX_DIR/all.git/description, since $INBOX_DIR/all.git/description is not described in the default message when it's missing.
2020-02-04www: stricter regexp for 405 errors
We want to match "GET" and "HEAD" exactly, not requests which start with "GET" or end with "HEAD". This doesn't seem like a real problem for public-inboxes which are actually public data anyways.
2020-01-01wwwstatic: add directory listing + index.html support
It's now possible to use WwwStatic as a standalone PSGI app to serve static files and recreate the award-winning web design of https://public-inbox.org/ :>
2020-01-01wwwstatic: move r(...) functions here
Remove redundant "r" functions for generating short error responses. These responses will no longer be cached by clients, which is probably a good thing since most errors ought to be transient, anyways. This also fixes error responses for our cgit wrapper when static files are missing.
2020-01-01www: move more logic into path_info_raw
It'll be easier to reuse in future code.
2019-12-27www: lazy load Plack::Util
cgit users won't need Plack::Util, here.