about summary refs log tree commit homepage
path: root/t
DateCommit message (Collapse)
2020-05-09remove most internal Email::MIME usage
We no longer load or use Email::MIME outside of comparison tests.
2020-05-09EmlContentFoo: Email::MIME::ContentType replacement
Since we're getting rid of Email::MIME, get rid of Email::MIME::ContentType, too; since we may introduce speedups down the line specific to our codebase.
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-05-09eml: pure-Perl replacement for Email::MIME
Email::MIME eats memory, wastes time parsing out all the headers, and some problems can't be fixed without breaking compatibility for other projects which depend on it. Informal benchmarks show a ~2x improvement in general stats gathering scripts and ~10% improvement in HTML view rendering. We also don't need the ability to create MIME messages, just parse them and maybe drop an attachment. While this isn't the zero-copy or streaming MIME parser of my dreams; it's still an improvement in that it doesn't keep a scalar copy of the raw body around along with subparts. It also doesn't parse subparts up front, so it can also replace our uses of Email::Simple.
2020-05-09msg_iter: pass $idx as a scalar, not array
This doesn't make any difference for most multipart messages (or any single part messages). However, this starts having space savings when parts start nesting. It also slightly simplifies callers.
2020-05-09search: support searching on List-Id
We'll support both probabilistic matches via `l:' and boolean matches via `lid:' for exact matches, similar to how both `m:' and `mid:' are supported. Only text inside angle braces (`<' and `>') are supported, since I'm not sure if there's value in searching on the optional phrases (which would require decoding with ->header_str instead of ->header_raw).
2020-05-03t/convert-compact: avoid warning on `scalar(split(...))'
Perl 5.10.1 would warn about implicit assignment to @_ by split(). So favor the documented method of using `tr' to count lines. Fixes: b5ddcb3352ef31ae ("index: support --compact / -c on command-line")
2020-05-03t/httpd-corner.t: fix uninitialized warning
Current versions of Perl don't warn when vec() is given `undef' as its first arg, but Perl 5.10.1 does, at least. Fixes: c7b4cbdadf3116a0 ("t/httpd-corner: improve reliability and diagnostics")
2020-04-30t/precheck: remove Email::Simple->create from tests
It's likely we'll replace Email::Simple using our Email::MIME alternative/replacement, as well. So reduce the API surface we interact with and make it easier to swap implementations.
2020-04-26tests: replace mime_from_path with mime_load
mime_from_path is designed to fail gracefully in busy Maildirs whereas mime_load was made for loading files from a work tree.
2020-04-26tests: remove Email::MIME->create use entirely
Replace them with .eml files generated with the help of Email::MIME, but without some extraneous and unnecessary headers, and strip mime_load down to just loading files. This will give us more freedom to experiment with other mail libraries which may be more correct, better maintained, use less memory and/or be faster than Email::MIME.
2020-04-26testcommon: introduce mime_load sub
We'll use this to create, memoize, and reuse .eml files. This will be used to reduce (and eventually eliminate) our dependency on Email::MIME in tests.
2020-04-22t/mda.t: avoid needless use of Email::Simple
Totally pointless to create an object only to convert it back to a raw string for -mda input.
2020-04-22t/*.t: reduce dependency on Email::MIME APIs
Instead, favor PublicInbox::MIME->new for non-attachment emails. We may support alternatives to Email::MIME down the line. We'll still keep Email::MIME->create to deal with attachments, for now, but there's also a fair amount of test duplication we should eliminate, later.
2020-04-22t/*.t: use Email::MIME->create over PublicInbox::MIME->create
PublicInbox::MIME only supports ->new, and is only different from Email::MIME for old versions of Email::MIME. In the future, PublicInbox::MIME may not be a subclass of Email::MIME at all.
2020-04-22t/feed: remove useless $ENV{GIT_DIR} assignment
I don't think this has been useful since we stopped supporting ssoma in this test.
2020-04-21t/nntpd: die if we can't open stderr output
We need to detect FS errors and bail out on the test if we can't open a file -nntpd was just writing to.
2020-04-21t/nntpd: reduce dependencies on internal API
Since the advent of run_script(), we can rely on it to simplify our test code. Changes like this will let us evolve the internal API more easily while preserving stable CLI interfaces, especially since we test the v2 path by default, now.
2020-04-21t/nntpd: fix lsof check w/ TEST_RUN_MODE=0
The `xqx' sub requires an absolute path for optional commands. Fixes: 6e07def560b211d9 ("testcommon: spawn-aware system() and qx[] workalikes")
2020-04-21index: support --max-size / publicinbox.indexMaxSize
In normal mail paths, we can rely on MTAs being configured with reasonable limits in the -watch and -mda mail injection paths. However, the MTA is bypassed in a git-only delivery path, a BOFH could inject a large message and DoS users attempting to mirror a public-inbox. This doesn't protect unindexed WWW interfaces from Email::MIME memory explosions on v1 inboxes. Probably nobody cares about unindexed WWW interfaces anymore, especially now that Xapian is optional for indexing.
2020-04-20testcommon: spawn-aware system() and qx[] workalikes
Barely noticeable on Linux, but this gives a 1-2% speedup on a FreeBSD 11.3 VM and lets us use built-in redirects rather than relying on /bin/sh.
2020-04-20t/ds-leak: use BSD::Resource
We use BSD::Resource in other places, so there's no sense in avoiding it, here.
2020-04-20import: init_bare: allow use as method, use in tests
Allowing ->init_bare to be used as a method saves some keystrokes, and we can save a little bit of time on systems with our vfork(2)-enabled spawn(). This also sets us up for future improvements where we can avoid spawning a process at all.
2020-04-20watchmaildir: support multiple watchheader values
The watchheader key supports only a single value. Supporting multiple watchheader values was mentioned in discussion [1] of 8d3e3bd8 (doc: explain publicinbox.<name>.watchheader, 2019-10-09), and it wasn't clear if there was a need. One scenario in which matching multiple headers would be convenient is when someone wants to set up public-inbox archives for some small projects but does _not_ want to run mailing lists for them, instead allowing others to follow the project by any of the pull mechanisms. Using a common underlying address, an address alias for each project is configured via a third-party email provider, with messages for each alias being exposed as a separate public-inbox archive. In this setup, messages for an inbox cannot be selected by a List-ID header but can be identified by the inbox's address in either the To or Cc header. To support such a use case, update the watchheader handling to consider multiple values, accepting a message if it matches any value. While selecting a message based on matching _any_ rather than _all_ values is motivated by the above scenario, it's worth noting that the "any" behavior is consistent with how multiple listid config values are handled. [1] https://public-inbox.org/meta/20191010085118.r3amey4cayazfycb@dcvr/
2020-04-19t/v*-add-remove-add: fix typo in description of 'removed' check
2020-04-19reduce scope of mbox From_ line removal
It's unnecessary overhead for anything which does Email::MIME parsing. It was never done for v2 indexing, even though v1->v2 conversions did NOT remove those From_ lines. There was never a need to remote From_ lines the v1 SearchIdx paths, either. Hitting a /$INBOX_URL/$MSGID/T/ endpoint with an 18 message thread reveals a ~0.5% speed improvement. This will become more apparent when we have a faster MIME parser.
2020-04-19favor `do {}' over `eval {}' for localized slurp
I did not know to use the return value of `do' back in the day. There's probably no practical difference in these cases, but `eval' is overkill for these uses and may hide actual errors. We can get rid of a few redundant `scalar' ops and pass scalar refs to Email::MIME->new to avoid copies in a few more places, too.
2020-04-19inbox: don't memoize missing description|cloneurl
It's probably common to have inboxes initially setup without these files properly configured, so don't memoize at that stage.
2020-04-19inboxwritable: mime_from_path: reuse in more places
There's nothing Maildir-specific about the function, so `maildir_path_load' was a bad name. So give it a more appropriate name and use it in our tests. This save ourselves some code and inconsistency by reusing an existing internal library routine in more places. We can drop the "From_" line in some of our (formerly) mbox sample files.
2020-04-17searchthread: reduce indirection by removing container
We can rid ourselves of a layer of indirection by subclassing PublicInbox::Smsg instead of using a container object to hold each $smsg. Furthermore, the `{id}' vs. `{mid}' field name confusion is eliminated. This reduces the size of the $rootset passed to walk_thread by around 15%, that is over 50K memory when rendering a /$INBOX/ landing page.
2020-04-17t/httpd-unix: skip some tests w/o signalfd|EVFILT_SIGNAL
Some of these tests just don't seem reliable enough with the way we or Perl do portable signal handling.
2020-04-16t/httpd-corner: improve reliability and diagnostics
The graceful-shutdown-on-PUT test is unreliable because we can't rely on a FIFO as we do with the GET tests. So increase the delay to 100ms since that seems enough on my system even with CONFIG_HZ=100. Add a timeout and backtrace to the $check_self sub to help with further diagnostics while we're at it, too. It would be nice if there were a portable syscall tracing mechanism we could attach to the -httpd process to make the test more determistic...
2020-04-15t/httpd-corner.t: relax read-after-failed-write handling
I've observed FreeBSD 11.2 read(2) having one of three behaviors after a failed write(2) on a socket: 1) returning number of bytes read 2) failing with ECONNRESET 3) returning with EOF 1) is the most common, and I've only seen 1) on Linux. It may be possible to use SO_LINGER or shutdown(2) to ensure 1) always happens, but SO_LINGER behavior seems inconsistent across OSes, especially with non-blocking sockets. Since these tests are corner-cases where we're dealing with broken/malicious clients, lets continue spending the least amount of syscalls protecting ourselves in the daemon and instead make the client-side test code tolerate more socket implementations.
2020-04-15t/*.t: localize $SIG{__WARN__} changes
We don't want to propagate %SIG changes to other tests when running multiple tests within the same process via t/run.perl.
2020-04-09t/httpd-unix: improve test reliability
Net::Server::Daemonize::create_pid_file does not write the PID file atomically, so we need to barf if it's incomplete.
2020-04-09triewyde: ficks soem speling errrors
Dikshunarees R gude!
2020-04-09tests: document run_mode=1 as not implemented
It was implemented at some point, but it was more things to support and the worst of both worlds: both unrealistic compared to real-world use and slower than run_mode=2. Noticed while looking for speling erorrs.
2020-04-03quiet "Complex regular subexpression recursion limit" warnings
These seem mostly harmless since Perl will just truncate the match and start a new one on a newline boundary in our case. The only downside is we'd end up with redundant <span> tags in HTML. Limiting the number of line matched ourselves with `{1,$NUM}' doesn't seem prudent since lines vary in length, so we continue to defer the job of limiting matches to the Perl regexp engine. I've noticed this warning in practice on 100K+ line patches to locale data.
2020-04-03view: handle the topic-free case properly
There may be no topics for a given timestamp range, so don't attempt to treat `undef' as an arrayref.
2020-03-31v2writable: index Message-IDs w/ spaces properly
Message-IDs can apparently contain spaces and other weird characters. Ensure we pass those properly to shard subprocesses when importing messages in parallel mode. Our NNTP request parser does not deal with spaces in the Message-ID, yet, and I don't expect most NNTP clients to, either. Nor does the Net::NNTP client handle them in responses.
2020-03-30t/multi-mid: allow test to run w/o Xapian
While the v1 inbox in this test is created without Xapian, the v2 inbox in this test defaults to having Xapian enabled regardless of whether it's installed or not. Fixes: c7acdfe78bda5bf3 ("v2: SDBM-based multi Message-ID queue")
2020-03-30t/filter_rubylang.t: avoid warning for non-word prefix
The "-" was never supported by Xapian in the prefix, but it could still be used to make documentation and URLs more readable in certain cases. Fixes: 7909c5f7439777e3 ("altid: warn about non-word prefixes")
2020-03-29index: support --compact / -c on command-line
It's more convenient to specify `-c' / `--compact' on the command-line when reindexing than it is to invoke public-inbox-compact(1) separately. This is especially convenient in low-space situations when public-inbox-index is operating on multiple inboxes sequentially, as compaction can happen immediately after indexing each inbox, instead of waiting until all inboxes are indexed.
2020-03-29searchidxshard: ensure we set indexlevel on shard[0]
For sharded v2 repositories with few-enough messages, it is possible for shard[0] to go unused and never trigger the ->commit_txn_lazy to set the indexlevel field in Xapian metadata. So set it immediately at initialization and avoid this case. While we're at it, avoid triggering needless pwrite syscalls from ->set_metadata by checking with ->get_metadata, first.
2020-03-25www: add endpoint to retrieve altid dumps
This ensures all our indexed data, including data from altid searches (e.g. "gmane:$ARTNUM") is retrievable. It uses a "POST" request to avoid wasting cycles when invoked by crawlers, since it could potentially be several megabytes of data not indexable by search engines.
2020-03-25qspawn: handle ENOENT (and other errors on exec)
As sqlite3(1) and other executables may become unavailable or uninstalled while a daemon runs, we need to gracefully handle errors in those cases.
2020-03-25qspawn: reinstate filter support, add gzip filter
We'll be supporting gzipped from sqlite3(1) dumps for altid files in future commits. In the future (and if we survive), we may replace Plack::Middleware::Deflater with our own GzipFilter to work better with asynchronous responses without relying on memory-intensive anonymous subs.
2020-03-24daemon: unlink .oldbin PID file correctly
We need to track the PID file having ".oldbin" appended to it while a SIGUSR2 upgrade is in progress and ensure it is unlinked on SIGQUIT.
2020-03-24daemon: fix SIGUSR2 upgrade with -W0 (no workers)
Disabling workers via `-W0' blesses the contents of the @listeners array, so we need to ensure we call fcntl on the GLOB ref in ->{sock}. Add tests to ensure USR2 works regardless of whether workers are enabled or not.
2020-03-22v2: SDBM-based multi Message-ID queue
This lets us store author and committer times for deferred indexing messages with ambiguous Message-IDs. This allows us to reproducibly reindex messages with the git commit and author times when a rare message lacks Received and/or Date headers while having ambiguous Message-IDs.