about summary refs log tree commit homepage
path: root/t
DateCommit message (Collapse)
2020-04-17t/httpd-unix: skip some tests w/o signalfd|EVFILT_SIGNAL
Some of these tests just don't seem reliable enough with the way we or Perl do portable signal handling.
2020-04-16t/httpd-corner: improve reliability and diagnostics
The graceful-shutdown-on-PUT test is unreliable because we can't rely on a FIFO as we do with the GET tests. So increase the delay to 100ms since that seems enough on my system even with CONFIG_HZ=100. Add a timeout and backtrace to the $check_self sub to help with further diagnostics while we're at it, too. It would be nice if there were a portable syscall tracing mechanism we could attach to the -httpd process to make the test more determistic...
2020-04-15t/httpd-corner.t: relax read-after-failed-write handling
I've observed FreeBSD 11.2 read(2) having one of three behaviors after a failed write(2) on a socket: 1) returning number of bytes read 2) failing with ECONNRESET 3) returning with EOF 1) is the most common, and I've only seen 1) on Linux. It may be possible to use SO_LINGER or shutdown(2) to ensure 1) always happens, but SO_LINGER behavior seems inconsistent across OSes, especially with non-blocking sockets. Since these tests are corner-cases where we're dealing with broken/malicious clients, lets continue spending the least amount of syscalls protecting ourselves in the daemon and instead make the client-side test code tolerate more socket implementations.
2020-04-15t/*.t: localize $SIG{__WARN__} changes
We don't want to propagate %SIG changes to other tests when running multiple tests within the same process via t/run.perl.
2020-04-09t/httpd-unix: improve test reliability
Net::Server::Daemonize::create_pid_file does not write the PID file atomically, so we need to barf if it's incomplete.
2020-04-09triewyde: ficks soem speling errrors
Dikshunarees R gude!
2020-04-09tests: document run_mode=1 as not implemented
It was implemented at some point, but it was more things to support and the worst of both worlds: both unrealistic compared to real-world use and slower than run_mode=2. Noticed while looking for speling erorrs.
2020-04-03quiet "Complex regular subexpression recursion limit" warnings
These seem mostly harmless since Perl will just truncate the match and start a new one on a newline boundary in our case. The only downside is we'd end up with redundant <span> tags in HTML. Limiting the number of line matched ourselves with `{1,$NUM}' doesn't seem prudent since lines vary in length, so we continue to defer the job of limiting matches to the Perl regexp engine. I've noticed this warning in practice on 100K+ line patches to locale data.
2020-04-03view: handle the topic-free case properly
There may be no topics for a given timestamp range, so don't attempt to treat `undef' as an arrayref.
2020-03-31v2writable: index Message-IDs w/ spaces properly
Message-IDs can apparently contain spaces and other weird characters. Ensure we pass those properly to shard subprocesses when importing messages in parallel mode. Our NNTP request parser does not deal with spaces in the Message-ID, yet, and I don't expect most NNTP clients to, either. Nor does the Net::NNTP client handle them in responses.
2020-03-30t/multi-mid: allow test to run w/o Xapian
While the v1 inbox in this test is created without Xapian, the v2 inbox in this test defaults to having Xapian enabled regardless of whether it's installed or not. Fixes: c7acdfe78bda5bf3 ("v2: SDBM-based multi Message-ID queue")
2020-03-30t/filter_rubylang.t: avoid warning for non-word prefix
The "-" was never supported by Xapian in the prefix, but it could still be used to make documentation and URLs more readable in certain cases. Fixes: 7909c5f7439777e3 ("altid: warn about non-word prefixes")
2020-03-29index: support --compact / -c on command-line
It's more convenient to specify `-c' / `--compact' on the command-line when reindexing than it is to invoke public-inbox-compact(1) separately. This is especially convenient in low-space situations when public-inbox-index is operating on multiple inboxes sequentially, as compaction can happen immediately after indexing each inbox, instead of waiting until all inboxes are indexed.
2020-03-29searchidxshard: ensure we set indexlevel on shard[0]
For sharded v2 repositories with few-enough messages, it is possible for shard[0] to go unused and never trigger the ->commit_txn_lazy to set the indexlevel field in Xapian metadata. So set it immediately at initialization and avoid this case. While we're at it, avoid triggering needless pwrite syscalls from ->set_metadata by checking with ->get_metadata, first.
2020-03-25www: add endpoint to retrieve altid dumps
This ensures all our indexed data, including data from altid searches (e.g. "gmane:$ARTNUM") is retrievable. It uses a "POST" request to avoid wasting cycles when invoked by crawlers, since it could potentially be several megabytes of data not indexable by search engines.
2020-03-25qspawn: handle ENOENT (and other errors on exec)
As sqlite3(1) and other executables may become unavailable or uninstalled while a daemon runs, we need to gracefully handle errors in those cases.
2020-03-25qspawn: reinstate filter support, add gzip filter
We'll be supporting gzipped from sqlite3(1) dumps for altid files in future commits. In the future (and if we survive), we may replace Plack::Middleware::Deflater with our own GzipFilter to work better with asynchronous responses without relying on memory-intensive anonymous subs.
2020-03-24daemon: unlink .oldbin PID file correctly
We need to track the PID file having ".oldbin" appended to it while a SIGUSR2 upgrade is in progress and ensure it is unlinked on SIGQUIT.
2020-03-24daemon: fix SIGUSR2 upgrade with -W0 (no workers)
Disabling workers via `-W0' blesses the contents of the @listeners array, so we need to ensure we call fcntl on the GLOB ref in ->{sock}. Add tests to ensure USR2 works regardless of whether workers are enabled or not.
2020-03-22v2: SDBM-based multi Message-ID queue
This lets us store author and committer times for deferred indexing messages with ambiguous Message-IDs. This allows us to reproducibly reindex messages with the git commit and author times when a rare message lacks Received and/or Date headers while having ambiguous Message-IDs.
2020-03-22*idx: pass smsg in even more places
We can finally get rid of the awkward, ad-hoc use of V2Writable, SearchIdx, and OverIdx args for passing {cotime} and {autime} between classes. We'll still use those git time fields internally within V2Writable and SearchIdx for (re)indexing, but that's not worth avoiding as a fallback.
2020-03-22*idx: pass $smsg in more places instead of many args
We can pass blessed PublicInbox::Smsg objects to internal indexing APIs instead of having long parameter lists in some places. The end goal is to avoid parsing redundant information each step of the way and hopefully make things more understandable.
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-03-22index: use git commit times on missing Date/Received
When indexing messages without Date: and/or Received: headers, fall back to using timestamps originally recorded by git in the commit object. This allows git mirrors to preserve the import datestamp and timestamp of a message according to what was fed into git, instead of blindly falling back to the current time.
2020-03-21t/msgtime: skip test if timezone isn't UTC
Date::Parse falls back to using the local timezone when it's missing from an email, so only test in a reasonable TZ (UTC) for server software.
2020-03-21t/www_listing: avoid 'once' warnings
We reach into the WwwListing package directly to retrieve that JSON encoder/decoder object, and we can't rely on `use' since WwwListing loading may fail if Plack is missing.
2020-03-20wwwlisting: avoid lazy loading JSON module
We already lazy-load WwwListing for the CGI script, and hiding another layer of lazy-loading makes things difficult to do WWW->preload. We want long-lived processes to do all long-lived allocations up front to avoid fragmentation in the allocator, but we'll still support short-lived processes by lazy-loading individual modules in the PublicInbox::* namespace. Mixing up allocation lifetimes (e.g. doing immortal allocations while a large amount of space is taken by short-lived objects) will cause fragmentation in any allocator which favors large contiguous regions for performance reasons. This includes any malloc implementation which relies on sbrk() for the primary heap, including glibc malloc.
2020-03-19http: fix RFC conformance w.r.t. message length
We need to favor "Transfer-Encoding: chunked" over the value of the Content-Length header. We should also reject bogus, duplicate and/or unreasonable values for both these, since they can trigger unexpected behavior when combined with other HTTP parsers in proxies such as varnish, nginx, haproxy, etc... See RFC 7230 (and RFC 2616) for more details: https://tools.ietf.org/html/rfc7230 https://www.rfc-editor.org/errata_search.php?rfc=7230
2020-03-07searchmsg: allow lines (and bytes) to be zero
We will occasionally see legit messages with zero lines, be sure we index that count for NNTP clients. I'm not sure about bytes being zero (aside from purged messages), but we should've dealt with that earlier up the stack.
2020-03-01msgtime: assume +0000 if TZ missing when using Date::Parse
Some old emails don't have timezone offsets, since our Date::Parse code path takes a liberal interpretation of dates, fallback to using "+0000" as the timezone offset since it's closer to the actual date of the message than whatever the current date is. Reported-by: Leah Neukirchen <leah@vuxu.org> Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/ Fixes: ae80a3fdb53d7014 ("MsgTime.pm: Use strptime to compute the time zone")
2020-03-01import: drop '<' and '>' characters in addresses
Some strange "From:" lines will cause Email::Address::XS to leave '<' (and presumably '>') in the address which git-fast-import won't accept even if quoted. Workaround this problem by deleting '<' and '>' the same way we delete them for the ident name. Reported-by: Leah Neukirchen <leah@vuxu.org> Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/
2020-02-24v2writable: make remove return-compatible w/ Import::remove
Import::remove is a documented interface, and the return value of the V2Writable work-alike should try to be compatible with what Import implements.
2020-02-24hval: ascii_html: drop CRLF => LF conversion
Instead, we add CRLF conversion to the only remaining place which needs it, ViewVCS. This save many redundant ops in in many places. The only other place where this mattered was in View::add_text_body, but we already started doing CRLF conversions when we added diff parsing and link generation for ViewVCS. Otherwise, all other places we used this was for header viewing and Email::MIME doesn't preserve CRLF in headers.
2020-02-16view: escape ampersand in Message-IDs
We need to escape ampersands (and some other characters for href attributes), so introduce a `mid_href' sub to do just that. '<', '>' and '"' were always escaped, so there's no risk of tag or attribute injection, but creative Message-IDs could cause confusion for some parsers and generate invalid URLs. Start getting rid of the bloated, over-engineered OO Hval API while we're at it, I only noticed this bug because I started killing off Hval->new* callers.
2020-02-15t/msg_iter: test for X-UNKNOWN charset from Alpine
A long overdue test for behavior established in 2016. Fixes: 1b28cc7f00a866cb ("view: try assuming UTF-8 for bogus charsets")
2020-02-08t/multi-mid: skip properly w/o DBD::SQLite
SearchIdx always requires DBD::SQLite, so only require it after we've passed `require_mods(qw(DBD::SQLite))'.
2020-02-07tests: switch to XML::TreePP for testing Atom feeds
XML::Feed pulls in a lot of dependencies, some of which XS. That makes testing with blead or any non-OS-supplied Perl installations more time consuming and more difficult because of the need to have development headers and libraries for libexpat1 or libxml2. Performance from libexpat1 or libxml2 for our small tests cases isn't relevant, either, and the pure Perl XML::TreePP seems up to the task. It's also available in CentOS 7.x, FreeBSD 11.x, and Debian, at least.
2020-02-07syscall: support Linux x32 ABI
The x32 ABI allows users to take advantage of the extra registers on x86-64 without the bloat of 64-bit pointers and longs. This ought to be significant since Perl was designed when 32-bit was prevalent; and the common structs for ops, hashes, scalars, and arrays use longs (SSize_t/Size_t) for things which should never need 64-bits when processing emails. Debian's x32 port seems to work quite nicely under a chroot on an amd64 Linux system. All tests pass under x32, now.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-02-06t/multi-mid: don't access ~/.public-inbox/config
It can cause unpredictable behavior and also slow things down. Followup-to: e4d3be19612b2082 ("t: localize the PI_CONFIG env")
2020-02-04over: simplify read-only vs read-write checking
No need to call ref() and do a string comparison. Add some extra tests using the {ReadOnly} attribute in DBI.pm.
2020-02-04www: serve $INBOX_DIR/description as $INBOX_URL/description
Instead of serving $INBOX_DIR/all.git/description, since $INBOX_DIR/all.git/description is not described in the default message when it's missing.
2020-02-04www: stricter regexp for 405 errors
We want to match "GET" and "HEAD" exactly, not requests which start with "GET" or end with "HEAD". This doesn't seem like a real problem for public-inboxes which are actually public data anyways.
2020-02-02convert: fix --no-index switch
The (currently undocumented) "--no-index" flag did not trigger the V2Writable->done call necessary to make the import successful. Fixes: eea47b676127bcdb ("convert: preserve highwater mark from v1 msgmap")
2020-02-02v2writable: nproc_shards: subtract 1 from given value
This is to be consistent with the `nproc(1)' code path. It also quiets down a warning from Admin when "-j $JOBS" is specified, since the master process (which distributes work to shards and handles OverIdx and Msgmap) is considered a job on its own.
2020-02-02t/multi-mid.t: extra test for -convert highwater mark
This is derived from a real-world test case where I encounterd multiple Message-IDs in a v1 inbox causing regen problems. Fixes: eea47b676127bcdb ("convert: preserve highwater mark from v1 msgmap")
2020-01-31convert: preserve highwater mark from v1 msgmap
If we're reusing the msgmap from a v1 inbox, we also need to ensure the highwater mark doesn't get doubled in the v1->v2 conversion by internally triggering the equivalent of "--reindex" on a fresh v2 inbox. This was needed to convert an indexed v1 inbox which featured messages with multiple Message-IDs in it. Fresh, unindexed clones of v1 inboxes would not have been affected by this.
2020-01-31mboxgz: ensure gzipped mboxes always have filenames
Lets always have Content-Disposition for files intended to be downloaded for consumption by non-browsers, such as pigz, zcat, "git am". This is also to be consistent with the non-gzipped mbox $MESSAGE_ID/raw endpoint.
2020-01-31t/psgi_search: test for subject-free messages
Apparently I fixed this bug a while back in commit f94c3a195a25a31d0215cd175938008fca473378 but did not write tests.
2020-01-28v2writable: newest epochs go first in alternates
New epochs are the most likely to have loose objects. git won't be able to take advantage of pack indices and needs to scan every alternate for the loose object via open/openat syscalls. Those syscalls will add up some day when we've got hundreds or thousands of epochs.