about summary refs log tree commit homepage
path: root/lib
DateCommit message (Collapse)
2019-12-19tests: move t/common.perl to PublicInbox::TestCommon
We want to be able to use run_script with *.t files, so t/common.perl putting subs into the top-level "main" namespace won't work. Instead, make it a module which uses Exporter like other libraries.
2019-12-19msgiter: msg_part_text returns undef on text/html
We want HTML parts to be downloadable, but not displayed as unreadable (but injection-safe) HTML source in our own web and Atom interfaces. This affects indexing, too, as HTML tags/comments won't be indexed anymore, but existing indices are only cleaned after --reindex. HTML-only mail won't be indexed at all, but we won't cross that bridge until somebody cares about that crap. We'll continue to actively discourage such waste of CPU cycles, bandwidth, cache and storage. Fixes: 7d82a8bc04ce2e68 (handle "multipart/mixed" messages which are not multipart')
2019-12-18viewvcs: flesh out some functionality and test
Expose MAX_SIZE via "our" will make it possible to use in tests, and configure, later. Additionally, returning HTTP 500 code for big files is not an Internal Server Error, just a memory limit... Some browsers won't show our HTML response with the link to the raw file in case of errors, either, so we'll return 200 to ensure users can use the link to access the raw blob. Finally, throw in some tests to the existing solver_git testcase, since that was incomplete and was pointlessly loading Plack modules without testing PSGI.
2019-12-16daemon: drop listeners early in master on graceful shutdown
For users not relying on socket activation via systemd (or similar), we want to drop listeners ASAP so another process can bind to their address. While we're at it, disable TTIN and HUP handlers since we have no chance of starting usable workers without listeners.
2019-12-16daemon: shorten lifetime of listener_names mapping
Keeping a ref to the IO::Socket handle was preventing close(2) from being invoked on graceful shutdown of worker.
2019-12-15address: explicitly reject local-only addresses
Apparently, neither our previous address parsing code nor Email::Address::XS recognizes local, username-only addresses in the form of <username> (without "@host"). Without this change, Email::Address::XS->address would return "undef", so we need to filter it out via "grep { defined }" It seems the cases where users email each other on the same machine is small and public-inbox won't be able to index addresses for those cases... Oh well :/
2019-12-15address: use Email::Address::XS if available
Email::Address::XS is a dependency of modern versions of Email::MIME, so it's likely loaded and installed on newer systems, already; and capable of handling more corner-cases than our pure-Perl fallback. We still fallback to the imperfect-but-good-enough-in-practice pure-Perl code while avoiding the non-XS Email::Address (which was susceptible to DoS attacks (CVE-2015-7686)). We just need to keep "git fast-import" happy.
2019-12-15address: use comment as name if no phrase available
Some users will set their From: headers in the form of: "<user@example.com> (A U Thor)", where their name is in the parenthesized comment. Use that instead of the email address, if available.
2019-12-15inbox: fix periodic git process cleanup
We need to use $PublicInbox::DS::in_loop instead of ::running(). The latter is not valid for systems with signalfd or kqueue and is now gone, completely. Not needing periodic cleanups at all to deal with unlinked pack indices will be a tougher task...
2019-12-15searchidx: do not modify read-only $1 via git_unquote
git_unquote works in-place, and we sometimes see strange filenames, or badly munged diffs with terminal escape characters (for colorization) end up in emails.
2019-12-14daemon: use DESTROY for unlinking --pid-file
This gets rid of the last "END{}" block in our code and cleans up a (temporary) circular reference. Furthermore, ensure the cleanup code still works in all configurations by adding tests and testing both the -W1 (default, 1 worker) and -W0 (no workers) code paths.
2019-12-14ds: move NNTP-only expiration code into DS
We'll be supporting idle timeout for the HTTP code in the future to deal directly with Internet-exposed clients w/o Varnish or nginx.
2019-12-14ds: move EvCleanup code into DS
EvCleanup only existed since Danga::Socket was a separate component, and cleanup code belongs with the event loop.
2019-12-12mbox: do not try to create filename from empty string
This was causing warnings to pop up in syslogs for messages with empty Subject headers.
2019-12-12msgtime: avoid obviously out-of-range dates (for now)
Wacky dates show up in lore for valid messages. Lets ignore them and let future generations deal with Y10K and time-travel problems.
2019-12-12Date::Parse is now optional
-mda should not be dealing with broken Date: headers nowadays, and deprioritize it in our documentation and internal checks.
2019-12-12msgtime: drop Date::Parse for RFC2822
Date::Parse is not optimized for RFC2822 dates and isn't packaged on OpenBSD. It's still useful for historical email when email clients were less conformant, but is less relevant for new emails.
2019-12-12git: async batch interface
This is a transitionary interface which does NOT require an event loop. It can be plugged into in current synchronous code without major surgery. It allows HTTP/1.1 pipelining-like functionality by taking advantage of predictable and well-specified POSIX pipe semantics by stuffing multiple git cat-file requests into the --batch pipe With xt/git_async_cmp.t and GIANT_GIT_DIR=git.git, the async interface is 10-25% faster than the synchronous interface since it can keep the "git cat-file" process busier. This is expected to improve performance on systems with slower storage (but multiple cores).
2019-12-11import: (cleanup) drop redundant env arg to run_die
run_die() doesn't require an $env arg, so there's no point passing "undef" to it.
2019-12-11spawn: remove support for clearing the env
It's unnecessary code which I'm not sure we ever used. In retrospect, completely clearing the environment doesn't make sense for the processes we spawn. We don't need to clobber individual environment variables in our code, either (and if we did for tests, we can use 'local').
2019-12-11ds: ->Reset initializes $nextq
I haven't noticed this being a problem in practice, but be consistent with the rest of the singleton stuff. Since we always call Reset() at load time, only do initialization in that sub and not at declaration.
2019-11-29replace: quiet "git gc" invocation
Since we give users no indication or control of how "git gc" runs, showing its progress is confusing.
2019-11-27httpd|nntpd: avoid missed signal wakeups
Our attempt at using a self-pipe in signal handlers was ineffective, since pure Perl code execution is deferred and Perl doesn't use an internal self-pipe/eventfd. In retrospect, I actually prefer the simplicity of Perl in this regard... We can use sigprocmask() from Perl, so we can introduce signalfd(2) and EVFILT_SIGNAL support on Linux and *BSD-based systems, respectively. These OS primitives allow us to avoid a race where Perl checks for signals right before epoll_wait() or kevent() puts the process to sleep. The (few) systems nowadays without signalfd(2) or IO::KQueue will now see wakeups every second to avoid missed signals.
2019-11-27dskqxs: fix missing EV_DISPATCH define
Oops, IO::KQueue support was broken due to this missing constant. Add a new ds-kqxs.t test case to ensure we test the IO::KQueue path if IO::KQueue is available.
2019-11-27msgtime: deal with strange minutes in TZ offsets
I'm not sure if TZ minute offsets aside from '00' or '30' exist, but lets just deal with them properly when negative. Examples taken from various inboxes on lore.kernel.org. These are mostly message from spammers, but some are legitimate messages.
2019-11-24xapcmd: replace Xtmpdirs with File::Temp->newdir
Since we're using Perl 5.10.1 and File::Temp 0.19+, we don't need Xtmpdirs at all for cleaning up tempdirs on failure and can just rely on the DESTROY handler provided by File::Temp.
2019-11-24daemon: avoid race when quitting workers
While the master process has a self-pipe to avoid missing signals, worker processes lack that aside from a pipe to detect master death. That pipe doesn't exist when there's no master process, so it's possible DS::close never finishes because it never woke up from epoll_wait. So create a pipe on the worker_quit signal and force it into epoll/kevent so it wakes up right away.
2019-11-24daemon: use sigprocmask when respawning workers
We need to block signals in workers during respawns until they're ready to receive signals.
2019-11-24daemon: use sigprocmask to block signals at startup
`$SIG{FOO} = "IGNORE"' will cause the daemon to miss signals entirely. Instead, we can use sigprocmask to block signal delivery until we have our signal handlers setup. This closes a race where a PID file can be written for an init script and a signal to be dropped via "IGNORE".
2019-11-24mboxgz: fix compiler parse error under under Perl 5.16.3
Perl 5.16.3 (and possibly older versions) fails with the following errors (from CentOS7): Use of ?PATTERN? without explicit operator is deprecated Search pattern not terminated
2019-11-24check for File::Temp 0.19 for ->newdir method
This is distributed with Perl 5.10.1 and onwards, so it should not be an installation burden for any users. I'm planning to move away from tempdir() entirely and use File::Temp->newdir to remove dependencies on END{} blocks.
2019-11-16spawn: which: allow embedded slash for relative path
This makes the subroutine behave more like which(1) command and will make using spawn() in tests easier.
2019-11-16xapcmd: do not fire END and DESTROY handlers in child
We need to bypass whatever Test::More does with END/DESTROY handlers for use in lon-lived process. This doesn't affect any of our normal code since we don't use END/DESTROY for Xapcmd and its callers.
2019-11-16import: only pass Inbox object to SearchIdx->new
SearchIdx->new no longer accepts a GIT_DIR path as its argument since commit 585314673236d664729fe3ab2d4fb229d1c0f2d5 ("searchidx: require PublicInbox::Inbox (or InboxWritable) ref")
2019-11-16inboxwritable: add ->cleanup method
We've been using this in -edit, and will be using it in some more scripts and tests to optimize for run_mode=2 with run_script. Keeping this in the *Writable modules since I don't see it being useful for the WWW and NNTP read-only interfaces which use PublicInbox::Inbox.
2019-11-16admin: get rid of singleton $CFG var
PublicInbox::Admin::config() just adds an extra layer of indirection which we barely rely on. So get rid of this global variable and make it easier to run tests in the future without relying on global state.
2019-11-16mboxgz: use Compress::Raw::Zlib instead of IO::Compress::Gzip
IO::Compress::Gzip is a wrapper around Compress::Raw::Zlib, anyways, and being able to easily detach buffers to return them via ->getline is nice. This results in a 1-2% performance improvement when fetching giant mboxes.
2019-11-16mbox: split mboxgz out into a separate file
It'll make using Compress::Raw::Zlib easier, since we can use that and import constants more easily.
2019-11-16mbox: unused mid_clean import
We're gradually phasing mid_clean out (in favor of mids()).
2019-11-14inboxwritable: drop {-importer} cyclic reference
InboxWritable caching the result of ->importer leads to a circular references with returned (V2Writable|Import) object holds onto the calling InboxWritable object. With public-inbox-watch, this leads to a memory leak if a user is reloading via SIGHUP after a message is imported (it would only become noticeable with SIGHUPs after every message imported). I would not expect anybody to to notice this in real-world usage. I only noticed this since I was making -xcpdb suitable for long-lived process use (e.g. "mod_perl style") and a flock remained unreleased on v1 inboxes after resharding. WatchMaildir (used by -watch) already handles caching of the importer object itself, and all of our other real-world uses of ->importer are short-lived or designed for batch scripts, so there's no need to cache the importer result internally.
2019-11-14xapcmd: localize %SIG changes using "local"
Perl's "local" allows changes to %SIG (and %ENV) to be limited to its enclosing block. This allows us to get rid of a global variable and ad-hoc method for restoring signal handlers.
2019-11-14solvergit: use --unidiff-zero with git-apply(1)
I sometimes post context-free documentation patches generated with "-U0" to reduce size and bandwidth overhead when replacing URLs or updating copyright notices. git-apply(1) needs the --unidiff-zero switch to work properly with context-free patches. Given our search looks for blob OIDs, and we're never going to be running the code we regenerate, "--unidiff-zero" ought to be safe.
2019-11-04index: "git log" failures are fatal
While I've never seen "git log" fail on its own, it could happen one day and we should be prepared to abort indexing when it happens. Beef up tests for t/spawn.t to ensure close() behaves on popen_rd the way we expect it to.
2019-11-03searchidxshard: reuse $SIG{__WARN__} callback from Admin
We don't want to define $SIG{__WARN__} in the worker to call an existing non-default callback. Instead update ->{current_info} the same way the V2Writable master process does. I noticed this while reindexing with a large XAPIAN_FLUSH_THRESHOLD and seeing the wrong epoch on my terminal from a shard because the shard worker was spawned while reindexing a higher-numbered epoch.
2019-10-31hval: replace "&apos;" with "&#39;" for compatibility
While testing 216light.css changes, I managed to hit some cases where dillo failed to render &apos; correctly, but I also can't reproduce it reliably. Anyways, it's definitely a problem with some old browsers and newer versions of highlight already work around it, but Debian 10.x has 3.41, so use "&#39;" to maximize compatibility.
2019-10-31qspawn: psgi_qx: delay callback until waitpid returns
We need to detect "git apply" failures reliably when patches fail. This is necessary for solving for blob 81c1164ae5 in https://public-inbox.org/git/ when at least two messages can solve for it (and one of them fails): 1. https://public-inbox.org/git/b9fb52b8-8168-6bf0-9a72-1e6c44a281a5@oracle.com/ 2. https://public-inbox.org/git/56664222-6c29-09dc-ef78-7b380b113c4a@oracle.com/
2019-10-31solvergit: deal with false-positive dfpost: results
When solving for blob 81c1164ae5 in https://public-inbox.org/git/, at least two messages get indexed with the dfpost result for that blob (after fixing MsgIter to decode all text/* parts): 1. https://public-inbox.org/git/b9fb52b8-8168-6bf0-9a72-1e6c44a281a5@oracle.com/ 2. https://public-inbox.org/git/56664222-6c29-09dc-ef78-7b380b113c4a@oracle.com/ However, only the first message contains a usable patch. So we must adjust SolverGit to account for multiple messages hitting the same "dfpost:" search result and attempt "git apply" on all results, not just the first. In the future, changes to SearchIdx.pm may rid us of invalid search results and speed up performance (at the expense of developer/indexing time); but we need to account for old search indices, first.
2019-10-31msgiter: attempt to decode all text/* bodies
We want to index text/x-patch and text/x-diff, at least, since "git format-patch" can generate a patch series as attachments using --attach.
2019-10-31msgiter: do not assume UTF-8 if Email::MIME->body_str succeeds
ISO-2202-JP and other non-UTF-8 messages need to be displayed correctly. Fixes: 7d82a8bc04ce ('handle "multipart/mixed" messages which are not multipart')
2019-10-30search: add note about SCHEMA_VERSION 15
--reindex has gotten better over the years, and having parallel Xapian DB directories would exceed all available disk space for some users with giant inboxes.