public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-04-22	t/*.t: reduce dependency on Email::MIME APIs
	Instead, favor PublicInbox::MIME->new for non-attachment emails. We may support alternatives to Email::MIME down the line. We'll still keep Email::MIME->create to deal with attachments, for now, but there's also a fair amount of test duplication we should eliminate, later.
2020-04-22	t/*.t: use Email::MIME->create over PublicInbox::MIME->create
	PublicInbox::MIME only supports ->new, and is only different from Email::MIME for old versions of Email::MIME. In the future, PublicInbox::MIME may not be a subclass of Email::MIME at all.
2020-04-22	t/feed: remove useless $ENV{GIT_DIR} assignment
	I don't think this has been useful since we stopped supporting ssoma in this test.
2020-04-21	t/nntpd: die if we can't open stderr output
	We need to detect FS errors and bail out on the test if we can't open a file -nntpd was just writing to.
2020-04-21	t/nntpd: reduce dependencies on internal API
	Since the advent of run_script(), we can rely on it to simplify our test code. Changes like this will let us evolve the internal API more easily while preserving stable CLI interfaces, especially since we test the v2 path by default, now.
2020-04-21	t/nntpd: fix lsof check w/ TEST_RUN_MODE=0
	The `xqx' sub requires an absolute path for optional commands. Fixes: 6e07def560b211d9 ("testcommon: spawn-aware system() and qx[] workalikes")
2020-04-21	index: support --max-size / publicinbox.indexMaxSize
	In normal mail paths, we can rely on MTAs being configured with reasonable limits in the -watch and -mda mail injection paths. However, the MTA is bypassed in a git-only delivery path, a BOFH could inject a large message and DoS users attempting to mirror a public-inbox. This doesn't protect unindexed WWW interfaces from Email::MIME memory explosions on v1 inboxes. Probably nobody cares about unindexed WWW interfaces anymore, especially now that Xapian is optional for indexing.
2020-04-20	testcommon: spawn-aware system() and qx[] workalikes
	Barely noticeable on Linux, but this gives a 1-2% speedup on a FreeBSD 11.3 VM and lets us use built-in redirects rather than relying on /bin/sh.
2020-04-20	t/ds-leak: use BSD::Resource
	We use BSD::Resource in other places, so there's no sense in avoiding it, here.
2020-04-20	import: init_bare: allow use as method, use in tests
	Allowing ->init_bare to be used as a method saves some keystrokes, and we can save a little bit of time on systems with our vfork(2)-enabled spawn(). This also sets us up for future improvements where we can avoid spawning a process at all.
2020-04-20	watchmaildir: support multiple watchheader values
	The watchheader key supports only a single value. Supporting multiple watchheader values was mentioned in discussion [1] of 8d3e3bd8 (doc: explain publicinbox.<name>.watchheader, 2019-10-09), and it wasn't clear if there was a need. One scenario in which matching multiple headers would be convenient is when someone wants to set up public-inbox archives for some small projects but does _not_ want to run mailing lists for them, instead allowing others to follow the project by any of the pull mechanisms. Using a common underlying address, an address alias for each project is configured via a third-party email provider, with messages for each alias being exposed as a separate public-inbox archive. In this setup, messages for an inbox cannot be selected by a List-ID header but can be identified by the inbox's address in either the To or Cc header. To support such a use case, update the watchheader handling to consider multiple values, accepting a message if it matches any value. While selecting a message based on matching _any_ rather than _all_ values is motivated by the above scenario, it's worth noting that the "any" behavior is consistent with how multiple listid config values are handled. [1] https://public-inbox.org/meta/20191010085118.r3amey4cayazfycb@dcvr/
2020-04-19	t/v*-add-remove-add: fix typo in description of 'removed' check

2020-04-19	reduce scope of mbox From_ line removal
	It's unnecessary overhead for anything which does Email::MIME parsing. It was never done for v2 indexing, even though v1->v2 conversions did NOT remove those From_ lines. There was never a need to remote From_ lines the v1 SearchIdx paths, either. Hitting a /$INBOX_URL/$MSGID/T/ endpoint with an 18 message thread reveals a ~0.5% speed improvement. This will become more apparent when we have a faster MIME parser.
2020-04-19	favor `do {}' over `eval {}' for localized slurp
	I did not know to use the return value of `do' back in the day. There's probably no practical difference in these cases, but `eval' is overkill for these uses and may hide actual errors. We can get rid of a few redundant `scalar' ops and pass scalar refs to Email::MIME->new to avoid copies in a few more places, too.
2020-04-19	inbox: don't memoize missing description\|cloneurl
	It's probably common to have inboxes initially setup without these files properly configured, so don't memoize at that stage.
2020-04-19	inboxwritable: mime_from_path: reuse in more places
	There's nothing Maildir-specific about the function, so `maildir_path_load' was a bad name. So give it a more appropriate name and use it in our tests. This save ourselves some code and inconsistency by reusing an existing internal library routine in more places. We can drop the "From_" line in some of our (formerly) mbox sample files.
2020-04-17	searchthread: reduce indirection by removing container
	We can rid ourselves of a layer of indirection by subclassing PublicInbox::Smsg instead of using a container object to hold each $smsg. Furthermore, the `{id}' vs. `{mid}' field name confusion is eliminated. This reduces the size of the $rootset passed to walk_thread by around 15%, that is over 50K memory when rendering a /$INBOX/ landing page.
2020-04-17	t/httpd-unix: skip some tests w/o signalfd\|EVFILT_SIGNAL
	Some of these tests just don't seem reliable enough with the way we or Perl do portable signal handling.
2020-04-16	t/httpd-corner: improve reliability and diagnostics
	The graceful-shutdown-on-PUT test is unreliable because we can't rely on a FIFO as we do with the GET tests. So increase the delay to 100ms since that seems enough on my system even with CONFIG_HZ=100. Add a timeout and backtrace to the $check_self sub to help with further diagnostics while we're at it, too. It would be nice if there were a portable syscall tracing mechanism we could attach to the -httpd process to make the test more determistic...
2020-04-15	t/httpd-corner.t: relax read-after-failed-write handling
	I've observed FreeBSD 11.2 read(2) having one of three behaviors after a failed write(2) on a socket: 1) returning number of bytes read 2) failing with ECONNRESET 3) returning with EOF 1) is the most common, and I've only seen 1) on Linux. It may be possible to use SO_LINGER or shutdown(2) to ensure 1) always happens, but SO_LINGER behavior seems inconsistent across OSes, especially with non-blocking sockets. Since these tests are corner-cases where we're dealing with broken/malicious clients, lets continue spending the least amount of syscalls protecting ourselves in the daemon and instead make the client-side test code tolerate more socket implementations.
2020-04-15	t/*.t: localize $SIG{__WARN__} changes
	We don't want to propagate %SIG changes to other tests when running multiple tests within the same process via t/run.perl.
2020-04-09	t/httpd-unix: improve test reliability
	Net::Server::Daemonize::create_pid_file does not write the PID file atomically, so we need to barf if it's incomplete.
2020-04-09	triewyde: ficks soem speling errrors
	Dikshunarees R gude!
2020-04-09	tests: document run_mode=1 as not implemented
	It was implemented at some point, but it was more things to support and the worst of both worlds: both unrealistic compared to real-world use and slower than run_mode=2. Noticed while looking for speling erorrs.
2020-04-03	quiet "Complex regular subexpression recursion limit" warnings
	These seem mostly harmless since Perl will just truncate the match and start a new one on a newline boundary in our case. The only downside is we'd end up with redundant <span> tags in HTML. Limiting the number of line matched ourselves with `{1,$NUM}' doesn't seem prudent since lines vary in length, so we continue to defer the job of limiting matches to the Perl regexp engine. I've noticed this warning in practice on 100K+ line patches to locale data.
2020-04-03	view: handle the topic-free case properly
	There may be no topics for a given timestamp range, so don't attempt to treat `undef' as an arrayref.
2020-03-31	v2writable: index Message-IDs w/ spaces properly
	Message-IDs can apparently contain spaces and other weird characters. Ensure we pass those properly to shard subprocesses when importing messages in parallel mode. Our NNTP request parser does not deal with spaces in the Message-ID, yet, and I don't expect most NNTP clients to, either. Nor does the Net::NNTP client handle them in responses.
2020-03-30	t/multi-mid: allow test to run w/o Xapian
	While the v1 inbox in this test is created without Xapian, the v2 inbox in this test defaults to having Xapian enabled regardless of whether it's installed or not. Fixes: c7acdfe78bda5bf3 ("v2: SDBM-based multi Message-ID queue")
2020-03-30	t/filter_rubylang.t: avoid warning for non-word prefix
	The "-" was never supported by Xapian in the prefix, but it could still be used to make documentation and URLs more readable in certain cases. Fixes: 7909c5f7439777e3 ("altid: warn about non-word prefixes")
2020-03-29	index: support --compact / -c on command-line
	It's more convenient to specify `-c' / `--compact' on the command-line when reindexing than it is to invoke public-inbox-compact(1) separately. This is especially convenient in low-space situations when public-inbox-index is operating on multiple inboxes sequentially, as compaction can happen immediately after indexing each inbox, instead of waiting until all inboxes are indexed.
2020-03-29	searchidxshard: ensure we set indexlevel on shard[0]
	For sharded v2 repositories with few-enough messages, it is possible for shard[0] to go unused and never trigger the ->commit_txn_lazy to set the indexlevel field in Xapian metadata. So set it immediately at initialization and avoid this case. While we're at it, avoid triggering needless pwrite syscalls from ->set_metadata by checking with ->get_metadata, first.
2020-03-25	www: add endpoint to retrieve altid dumps
	This ensures all our indexed data, including data from altid searches (e.g. "gmane:$ARTNUM") is retrievable. It uses a "POST" request to avoid wasting cycles when invoked by crawlers, since it could potentially be several megabytes of data not indexable by search engines.
2020-03-25	qspawn: handle ENOENT (and other errors on exec)
	As sqlite3(1) and other executables may become unavailable or uninstalled while a daemon runs, we need to gracefully handle errors in those cases.
2020-03-25	qspawn: reinstate filter support, add gzip filter
	We'll be supporting gzipped from sqlite3(1) dumps for altid files in future commits. In the future (and if we survive), we may replace Plack::Middleware::Deflater with our own GzipFilter to work better with asynchronous responses without relying on memory-intensive anonymous subs.
2020-03-24	daemon: unlink .oldbin PID file correctly
	We need to track the PID file having ".oldbin" appended to it while a SIGUSR2 upgrade is in progress and ensure it is unlinked on SIGQUIT.
2020-03-24	daemon: fix SIGUSR2 upgrade with -W0 (no workers)
	Disabling workers via `-W0' blesses the contents of the @listeners array, so we need to ensure we call fcntl on the GLOB ref in ->{sock}. Add tests to ensure USR2 works regardless of whether workers are enabled or not.
2020-03-22	v2: SDBM-based multi Message-ID queue
	This lets us store author and committer times for deferred indexing messages with ambiguous Message-IDs. This allows us to reproducibly reindex messages with the git commit and author times when a rare message lacks Received and/or Date headers while having ambiguous Message-IDs.
2020-03-22	*idx: pass smsg in even more places
	We can finally get rid of the awkward, ad-hoc use of V2Writable, SearchIdx, and OverIdx args for passing {cotime} and {autime} between classes. We'll still use those git time fields internally within V2Writable and SearchIdx for (re)indexing, but that's not worth avoiding as a fallback.
2020-03-22	*idx: pass $smsg in more places instead of many args
	We can pass blessed PublicInbox::Smsg objects to internal indexing APIs instead of having long parameter lists in some places. The end goal is to avoid parsing redundant information each step of the way and hopefully make things more understandable.
2020-03-22	rename PublicInbox::SearchMsg => PublicInbox::Smsg
	Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-03-22	index: use git commit times on missing Date/Received
	When indexing messages without Date: and/or Received: headers, fall back to using timestamps originally recorded by git in the commit object. This allows git mirrors to preserve the import datestamp and timestamp of a message according to what was fed into git, instead of blindly falling back to the current time.
2020-03-21	t/msgtime: skip test if timezone isn't UTC
	Date::Parse falls back to using the local timezone when it's missing from an email, so only test in a reasonable TZ (UTC) for server software.
2020-03-21	t/www_listing: avoid 'once' warnings
	We reach into the WwwListing package directly to retrieve that JSON encoder/decoder object, and we can't rely on `use' since WwwListing loading may fail if Plack is missing.
2020-03-20	wwwlisting: avoid lazy loading JSON module
	We already lazy-load WwwListing for the CGI script, and hiding another layer of lazy-loading makes things difficult to do WWW->preload. We want long-lived processes to do all long-lived allocations up front to avoid fragmentation in the allocator, but we'll still support short-lived processes by lazy-loading individual modules in the PublicInbox::* namespace. Mixing up allocation lifetimes (e.g. doing immortal allocations while a large amount of space is taken by short-lived objects) will cause fragmentation in any allocator which favors large contiguous regions for performance reasons. This includes any malloc implementation which relies on sbrk() for the primary heap, including glibc malloc.
2020-03-19	http: fix RFC conformance w.r.t. message length
	We need to favor "Transfer-Encoding: chunked" over the value of the Content-Length header. We should also reject bogus, duplicate and/or unreasonable values for both these, since they can trigger unexpected behavior when combined with other HTTP parsers in proxies such as varnish, nginx, haproxy, etc... See RFC 7230 (and RFC 2616) for more details: https://tools.ietf.org/html/rfc7230 https://www.rfc-editor.org/errata_search.php?rfc=7230
2020-03-07	searchmsg: allow lines (and bytes) to be zero
	We will occasionally see legit messages with zero lines, be sure we index that count for NNTP clients. I'm not sure about bytes being zero (aside from purged messages), but we should've dealt with that earlier up the stack.
2020-03-01	msgtime: assume +0000 if TZ missing when using Date::Parse
	Some old emails don't have timezone offsets, since our Date::Parse code path takes a liberal interpretation of dates, fallback to using "+0000" as the timezone offset since it's closer to the actual date of the message than whatever the current date is. Reported-by: Leah Neukirchen <leah@vuxu.org> Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/ Fixes: ae80a3fdb53d7014 ("MsgTime.pm: Use strptime to compute the time zone")
2020-03-01	import: drop '<' and '>' characters in addresses
	Some strange "From:" lines will cause Email::Address::XS to leave '<' (and presumably '>') in the address which git-fast-import won't accept even if quoted. Workaround this problem by deleting '<' and '>' the same way we delete them for the ident name. Reported-by: Leah Neukirchen <leah@vuxu.org> Link: https://public-inbox.org/meta/87h7zfemur.fsf@vuxu.org/
2020-02-24	v2writable: make remove return-compatible w/ Import::remove
	Import::remove is a documented interface, and the return value of the V2Writable work-alike should try to be compatible with what Import implements.
2020-02-24	hval: ascii_html: drop CRLF => LF conversion
	Instead, we add CRLF conversion to the only remaining place which needs it, ViewVCS. This save many redundant ops in in many places. The only other place where this mattered was in View::add_text_body, but we already started doing CRLF conversions when we added diff parsing and link generation for ViewVCS. Otherwise, all other places we used this was for header viewing and Email::MIME doesn't preserve CRLF in headers.