about summary refs log tree commit homepage
path: root/lib
DateCommit message (Collapse)
2020-04-15dskqxs: ignore EV_SET errors on EVFILT_WRITE
Just like the EPOLL_CTL_ADD emulation path, the EPOLL_CTL_MOD and EPOLL_CTL_DEL emulation paths can fail if attempting to install an EVFILT_WRITE for a read-only pipe. I've only observed this on the EPOLL_CTL_DEL emulation path, but I suspect it could happen on the EPOLL_CTL_MOD path as well. Increasing the amount of read-only pipes we rely on with altid exports via sqlite3 made this old bug more apparent and reproducible while looping the test suite. This may be adjusted in the future to deal with write-only pipes, but we currently don't have any of those watched by kqueue.
2020-04-15testcommon: DESTROY: wait for killed daemon
Otherwise, the waitpid(-1, 0) call in Xapcmd::process_queue() may reap it in a subsequent test when using t/run.perl to reuse processes for testing. While we're at it, make Xapcmd::process_queue warn about unknown PIDs in case other PIDs leak through to us in the future.
2020-04-09triewyde: ficks soem speling errrors
Dikshunarees R gude!
2020-04-09tests: document run_mode=1 as not implemented
It was implemented at some point, but it was more things to support and the worst of both worlds: both unrealistic compared to real-world use and slower than run_mode=2. Noticed while looking for speling erorrs.
2020-04-07view: do not redundantly obfuscate addresses
We shouldn't rerun the address obfuscator on data we've already run through. Instead, run through the unescaped text part and substitute the UTF-8 "\x{2022}" substitution before it hits HTML escaping Fixes: 9bdd81dc16ba6511 ("view: msg_iter calls add_body_text directly")
2020-04-07portability: constants for NetBSD
NetBSD implements O_CLOEXEC, so let us use it to avoid inadvertant FD sharing. It also has the same value for SIGWINCH as Linux and the other BSDs we support.
2020-04-05git: reduce stat buffer storage overhead
The stat() array is a whopping 480 bytes (on x86-64, Perl 5.28), while the new packed representation of two 64-bit doubles as a scalar is "only" 56 bytes. This can add up when there's many inboxes. Just use a string comparison on the packed representation. Some 32-bit Perl builds (IIRC OpenBSD) lack quad support, so doubles were chosen for pack() portability.
2020-04-05mbox: halve ->getline "context switches"
We don't need to take extra trips through the event loop for a single message (in the common case of Message-IDs being unique). In fact, holding the body reference left behind by Email::Simple could be harmful to memory usage, though in practice it's not a big problem since code paths which use Email::MIME take far more.
2020-04-05release large (non ref) scalars using `undef $sv'
Using `undef EXPR' like a function call actually frees the heap memory associated with the scalar, whereas `$sv = undef' or `$sv = ""' will hold the buffer around until $sv goes out of scope. The `sv_set_undef' documentation in the perlapi(1) manpage explicitly states this: The perl equivalent is "$sv = undef;". Note that it doesn't free any string buffer, unlike "undef $sv". And I've confirmed by reading Dump() output from Devel::Peek. We'll also inline the old index_body sub in SearchIdx.pm to make the scope of the scalar more obvious. This change saves several hundred kB RSS on both -index and -httpd when hitting large emails with thousands of lines.
2020-04-05wwwstatic: set "Vary: Accept-Encoding" in static gzip response
We don't want to confuse intermediate caches into serving gzipped content to any clients which can't handle it. It probably doesn't matter in practice, though, since every HTTP client seems to handle "Content-Encoding: gzip" regardless of whether it was requested or not, though I could expect some nc/socat/telnet/s_client users being annoyed. This also matches the behavior of Plack::Middleware::Deflater and other deflater implementations.
2020-04-04view: inline flush_quote sub
No point in having an extra sub for a short, commonly called function in the same file.
2020-04-04viewdiff: reduce sub parameter count
We're slowly moving towards doing all of our output buffering into a single buffer, so passing that around on the stack as a dedicated parameter is confusing.
2020-04-04view: dedupe_subject: allow "0" as a valid Subject
While rare in practice (even by spammers), A single "0" could theoretically be the entire contents of a Subject line. So use the Perl 5.10+ defined-or operator to improve correctness of subject deduplication.
2020-04-04view: use defined-or operator to simplify checks
We depend on Perl 5.10 features in other places. Shorten the lifetime of the `$desc' scalar while we're at it.
2020-04-04view: note we assume UTF-8 on unknown encodings
Clarify that we're assuming the text is UTF-8, since users may have no idea how it's mangled.
2020-04-04inboxwritable: fix From_ line unescaping
We can't rely on Email::MIME noticing the change to our scalar ref after calling `PublicInbox::MIME->new'. This is because Email::MIME::body_set (unlike Email::Simple::body_set) will copy the contents of the body into `->{body_raw}' as a new scalar. Furthermore, we need to escape multiple From lines in the body, not just the first one, using the `g' modifier to `s//'. Reported-by: Kyle Meyer <kyle@kyleam.com>
2020-04-03quiet "Complex regular subexpression recursion limit" warnings
These seem mostly harmless since Perl will just truncate the match and start a new one on a newline boundary in our case. The only downside is we'd end up with redundant <span> tags in HTML. Limiting the number of line matched ourselves with `{1,$NUM}' doesn't seem prudent since lines vary in length, so we continue to defer the job of limiting matches to the Perl regexp engine. I've noticed this warning in practice on 100K+ line patches to locale data.
2020-04-03view: handle the topic-free case properly
There may be no topics for a given timestamp range, so don't attempt to treat `undef' as an arrayref.
2020-04-02nntp: allow multiple spaces or tabs to delimit args
While this is not a known problem in practice, RFC 3977 section 3.1 states: Keywords and arguments MUST each be separated by one or more space or TAB characters.
2020-04-02mid: add $MID_EXTRACT regexp for export
This allows us to consistently enforce the same Message-ID extraction rules everywhere and makes it easier for us to make changes in the future. Update scripts/ssoma-replay, as well, but don't rely on PublicInbox::* modules in that since it's legacy and public-inbox was never a dependency of ssoma.
2020-04-02searchidx: v1: skip mid_clean on mid_mime results
We do not need run mid_clean() since mid_mime() uses mids() to extract the msgid from inside the angle brackets.
2020-04-02smsg: inline _extract_mid functionality
No need to keep an extra sub which isn't called anywhere else, and the mid_clean call is redundant since mid_mime already plucks the msgid out of the angle brackets.
2020-03-31v2writable: index Message-IDs w/ spaces properly
Message-IDs can apparently contain spaces and other weird characters. Ensure we pass those properly to shard subprocesses when importing messages in parallel mode. Our NNTP request parser does not deal with spaces in the Message-ID, yet, and I don't expect most NNTP clients to, either. Nor does the Net::NNTP client handle them in responses.
2020-03-30viewvcs: stream_blob_parse_hdr: fix BIN_DETECT retries
git-cat-file(1) may return less than the $BIN_DETECT value for some blobs, so ensure we repopulate the values in $ctx for retries in that case, otherwise we'll lose `$ctx->{-res}' and die when attempting to use `undef' as an array ref.
2020-03-30qspawn: capture errors from parse_hdr callback
User-supplied callbacks may fail, so capture the error instead of propagating it up the stack into the public-inbox-httpd event loop.
2020-03-30wwwstream::oneshot => html_oneshot
And use Exporter to make our life easier, since WwwAltId was using a non-existent PublicInbox::WwwResponse namespace in error paths which doesn't get noticed by `perl -c' or exercised by tests on normal systems. Fixes: 6512b1245ebc6fe3 ("www: add endpoint to retrieve altid dumps")
2020-03-29index: support --compact / -c on command-line
It's more convenient to specify `-c' / `--compact' on the command-line when reindexing than it is to invoke public-inbox-compact(1) separately. This is especially convenient in low-space situations when public-inbox-index is operating on multiple inboxes sequentially, as compaction can happen immediately after indexing each inbox, instead of waiting until all inboxes are indexed.
2020-03-29searchidxshard: ensure we set indexlevel on shard[0]
For sharded v2 repositories with few-enough messages, it is possible for shard[0] to go unused and never trigger the ->commit_txn_lazy to set the indexlevel field in Xapian metadata. So set it immediately at initialization and avoid this case. While we're at it, avoid triggering needless pwrite syscalls from ->set_metadata by checking with ->get_metadata, first.
2020-03-29config: Honor gitconfig includes
This allows for a setup where a central config file for the web server includes per-user config files.
2020-03-26wwwaltid: inform users to use POST instead of GET
Seeing the example config linkified, some users may inevitably try to following it in a browser with a GET request. Provide a helpful message to inform users to use POST instead of attempting to treat /$INBOX/$ALTID.sql.gz as a Message-Id.
2020-03-26wwwtext: show altid instructions in config
Exposing altid dumps will help and ensure total reproducibility of existing instances. AFAIK, sqlite3(1) can't execute arbitrary code, so it's not quite as fashionable as the "curl | bash" stuff the cool people are doing, these days :P
2020-03-26inbox: altid_map becomes a method
We want to be able to preload that, as well as to access it in WwwText for a config comment in the config example.
2020-03-25www: add endpoint to retrieve altid dumps
This ensures all our indexed data, including data from altid searches (e.g. "gmane:$ARTNUM") is retrievable. It uses a "POST" request to avoid wasting cycles when invoked by crawlers, since it could potentially be several megabytes of data not indexable by search engines.
2020-03-25altid: warn about non-word prefixes
We only support searching on prefixes matching /\A\w+\z/ because Xapian requires ':' to delimit the prefix and splits on spaces without quotes. I've also verified Xapian supports multibyte UTF-8 characters, underscores, and bare numbers as search prefixes, so there's no need to restrict it beyond what Perl's UTF-8 aware \w character class offers.
2020-03-25wwwtext: show thread endpoint w/ indexlevel=basic
And show contact info when there's no indexing, at all. Installations where Xapian is too expensive can still support threading since it only depends on SQLite, so we need to inform users of what's available.
2020-03-25search: clobber -user_pfx on query parser initialization
While we don't currently reinitialize the query parser for the lifetime of a PublicInbox::Search object and have no plans to, it's incorrect to be appending to an existing array in case we reininitialize the query parser in the future.
2020-03-25qspawn: handle ENOENT (and other errors on exec)
As sqlite3(1) and other executables may become unavailable or uninstalled while a daemon runs, we need to gracefully handle errors in those cases.
2020-03-25mbox: need_gzip uses WwwStream::oneshot
This makes the error page more consistent. Not that it really matters since Compress::Raw::Zlib and IO::Compress packages have been distributed with Perl since 5.10.x. Of course, zlib itself is also a dependency of git.
2020-03-25wwwstream: oneshot sets content-length
PublicInbox::HTTP will chunk, otherwise, and that's extra overhead which isn't needed.
2020-03-25extmsg: use WwwResponse::oneshot
No reason to use the ->getline interface for small responses.
2020-03-25wwwstream: introduce oneshot API to avoid ->getline
The ->getline API is only useful for limiting memory use when streaming responses containing multiple emails or log messages. However it's unnecessary complexity and overhead for callers (PublicInbox::HTTP) when there's only a single message.
2020-03-25gzipfilter: lazy allocate the deflate context
zlib contexts are memory-intensive, particularly when used for compression. Since the gzip filter may be sitting in a limiter queue for a long period, delay the allocation we actually have data to translate, and not a moment sooner.
2020-03-25qspawn: reinstate filter support, add gzip filter
We'll be supporting gzipped from sqlite3(1) dumps for altid files in future commits. In the future (and if we survive), we may replace Plack::Middleware::Deflater with our own GzipFilter to work better with asynchronous responses without relying on memory-intensive anonymous subs.
2020-03-24daemon: unlink .oldbin PID file correctly
We need to track the PID file having ".oldbin" appended to it while a SIGUSR2 upgrade is in progress and ensure it is unlinked on SIGQUIT.
2020-03-24daemon: fix SIGUSR2 upgrade with -W0 (no workers)
Disabling workers via `-W0' blesses the contents of the @listeners array, so we need to ensure we call fcntl on the GLOB ref in ->{sock}. Add tests to ensure USR2 works regardless of whether workers are enabled or not.
2020-03-22v2: SDBM-based multi Message-ID queue
This lets us store author and committer times for deferred indexing messages with ambiguous Message-IDs. This allows us to reproducibly reindex messages with the git commit and author times when a rare message lacks Received and/or Date headers while having ambiguous Message-IDs.
2020-03-22*idx: pass smsg in even more places
We can finally get rid of the awkward, ad-hoc use of V2Writable, SearchIdx, and OverIdx args for passing {cotime} and {autime} between classes. We'll still use those git time fields internally within V2Writable and SearchIdx for (re)indexing, but that's not worth avoiding as a fallback.
2020-03-22v2: pass smsg in more places
We can pass fewer order-dependent args to V2Writable::do_idx and SearchIdxShard::index_raw by passing the smsg object, instead.
2020-03-22*idx: pass $smsg in more places instead of many args
We can pass blessed PublicInbox::Smsg objects to internal indexing APIs instead of having long parameter lists in some places. The end goal is to avoid parsing redundant information each step of the way and hopefully make things more understandable.
2020-03-22overidx: parse_references: less error-prone args
Favor `$smsg->{mid}' instead of `$mid0' to reduce parameters down-the-line, but favor passing the Email::MIME::Header object around instead of relying on the bloat-prone `$smsg->{mime}' and calling ->header_obj on it.