public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-06-08	index: v2: parallel by default
	InboxWritable should only set $v2w->{parallel} if the $parallel flag is defined to 0 or 1. We want indexing a new inbox to utilize SMP, just like --reindex. -index once again allows -j0/--jobs=0 to force single-process use, and we'll be ensuring that works in tests to maintain performance on small systems. Fixes: 61a2fff5b34a3e32 ("admin: move index_inbox over")
2020-06-03	smsg: remove remaining accessor methods
	We'll continue to favor simpler data models that can be used directly rather than wasting time and memory with accessor APIs. The ->from, ->to, -cc, ->mid, ->subject, >references methods can all be trivially replaced by hash lookups since all their values are stored in doc_data. Most remaining callers of those methods were test cases, anyways. ->from_name is only used in the PSGI code, so we can just use ->psgi_cull to take care of populating the {from_name} field.
2020-06-03	www: remove smsg_mime API and adjust callers
	To further simplify callers and avoid embarrasing memory explosions[1], we can finally eliminate this method in favor of smsg_eml. [1] commit 7d02b9e64455831d3bda20cd2e64e0c15dc07df5 ("view: stop storing all MIME objects on large threads") fixed a huge memory blowup.
2020-06-03	smsg: introduce ->populate method
	This will eventually replace the __hdr() calling methods and eradicate {mime} usage from Smsg. For now, we can eliminate PublicInbox::Smsg->new since most callers already rely on an open `bless' to avoid the old {mime} arg.
2020-05-29	treat $INBOX_DIR/description and gitweb.owner as UTF-8
	gitweb does the same with $GIT_DIR/description and gitweb.owner. Allowing UTF-8 description should not cause problems when used in responses for to the NNTP "LIST NEWSGROUPS" request, either, since RFC 3977 section 7.6.6 recommends the description be UTF-8 (but does not require it). Link: https://public-inbox.org/meta/20200528151216.l7vmnmrs4ojw372g@sourcephile.fr/
2020-05-27	learn: fix buggy typo on List-ID mapping
	There is obviously a typo here, so fix it and add a test case to guard against future regressions. Fixes: 74a3206babe0572a ("mda: support multiple List-ID matches")
2020-05-26	view: do not offer links to 0-byte multipart attachments
	Offering links to download 0-byte files is useless. We could waste memory by preserving $eml->{bdy} during iteration, but offering attachments of type "multipart" is not very useful, as users are usually interested in decoded attachments or the entire raw message. Fixes: e60231148eb604a3 ("descend into message/(rfc822\|news\|global) parts")
2020-05-24	t/eml.t: favor ->header over ->header_str
	This test may still run against ancient versions of Email::MIME for comparisons.
2020-05-20	t/edit: use eml_load here, too
	I missed this instance of file slurping into an Email::MIME-like object the other week when tearing Email::MIME usage out.
2020-05-17	confine Email::MIME use even further
	To avoid confusing future readers and users, recommend PublicInbox::Eml in our Import POD and refer to PublicInbox::Eml comments at the top of PublicInbox::MIME. mime_load() confined to t/eml.t, since we won't be using it anywhere else in our tests.
2020-05-17	descend into message/(rfc822\|news\|global) parts
	Email::MIME never supported this properly, but there's real instances of forwarded messages as message/rfc822 attachments. message/news is legacy thing which we'll see in archives, and message/global appears to be the new thing. gmime also supports message/rfc2822, so we'll support it anyways despite lacking other evidence of its existence. Existing attachments remain downloadable as a whole message, but individual attachments of subparts are now downloadable and can be displayed in HTML, too. Furthermore, ensure Xapian can now search for common headers inside those messages as well as the message bodies.
2020-05-17	t/psgi_attach: assert message/* parts are downloadable
	We'll be adding support to descend into message/rfc822 (and legacy message/news) attachments. First, we must ensure existing message/rfc822 attachments can be downloaded and remain downloadable in future commits.
2020-05-12	rename "ContentId" to "ContentHash"
	The old name may be confused with "Content-ID" as described in RFC 2392, so use an alternate name to avoid confusing future readers.
2020-05-10	emlcontentfoo: drop the {discrete} and {composite} fields
	We don't have to worry about compatibility with old installations of Email::MIME::ContentType any longer, so save some space.
2020-05-10	t/mime: fix test to work w/o Email::MIME
	Although the lazy loading changes were correct, the code was still using PublicInbox::MIME as a fixed class. Use the `$cls' variable from the loop. Favor ->subparts to ->parts, instead, too, since ->parts is discouraged by the Email::MIME manpage and not implemented for Eml.
2020-05-10	eml: rename limits to match postfix names
	They're still part of our internal API at this point, but reusing the same names as those used by postfix makes sense for now to reduce cognitive overheads of learning new things. There's no "mime_parts_limit", but the name is consistent with "mime_nesting_limit".
2020-05-10	eml: enforce a maximum header length
	While our header processing is more efficient than Email::*::Header, capping the maximum size for a `m//g' match still limits memory growth on a header we care for. Use the same limit as postfix (header_size_limit=102400), since messages fetched via git/HTTP/NNTP/etc can bypass MTA limits.
2020-05-09	remove most internal Email::MIME usage
	We no longer load or use Email::MIME outside of comparison tests.
2020-05-09	EmlContentFoo: Email::MIME::ContentType replacement
	Since we're getting rid of Email::MIME, get rid of Email::MIME::ContentType, too; since we may introduce speedups down the line specific to our codebase.
2020-05-09	replace most uses of PublicInbox::MIME with Eml
	PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-05-09	eml: pure-Perl replacement for Email::MIME
	Email::MIME eats memory, wastes time parsing out all the headers, and some problems can't be fixed without breaking compatibility for other projects which depend on it. Informal benchmarks show a ~2x improvement in general stats gathering scripts and ~10% improvement in HTML view rendering. We also don't need the ability to create MIME messages, just parse them and maybe drop an attachment. While this isn't the zero-copy or streaming MIME parser of my dreams; it's still an improvement in that it doesn't keep a scalar copy of the raw body around along with subparts. It also doesn't parse subparts up front, so it can also replace our uses of Email::Simple.
2020-05-09	msg_iter: pass $idx as a scalar, not array
	This doesn't make any difference for most multipart messages (or any single part messages). However, this starts having space savings when parts start nesting. It also slightly simplifies callers.
2020-05-09	search: support searching on List-Id
	We'll support both probabilistic matches via `l:' and boolean matches via `lid:' for exact matches, similar to how both `m:' and `mid:' are supported. Only text inside angle braces (`<' and `>') are supported, since I'm not sure if there's value in searching on the optional phrases (which would require decoding with ->header_str instead of ->header_raw).
2020-05-03	t/convert-compact: avoid warning on `scalar(split(...))'
	Perl 5.10.1 would warn about implicit assignment to @_ by split(). So favor the documented method of using `tr' to count lines. Fixes: b5ddcb3352ef31ae ("index: support --compact / -c on command-line")
2020-05-03	t/httpd-corner.t: fix uninitialized warning
	Current versions of Perl don't warn when vec() is given `undef' as its first arg, but Perl 5.10.1 does, at least. Fixes: c7b4cbdadf3116a0 ("t/httpd-corner: improve reliability and diagnostics")
2020-04-30	t/precheck: remove Email::Simple->create from tests
	It's likely we'll replace Email::Simple using our Email::MIME alternative/replacement, as well. So reduce the API surface we interact with and make it easier to swap implementations.
2020-04-26	tests: replace mime_from_path with mime_load
	mime_from_path is designed to fail gracefully in busy Maildirs whereas mime_load was made for loading files from a work tree.
2020-04-26	tests: remove Email::MIME->create use entirely
	Replace them with .eml files generated with the help of Email::MIME, but without some extraneous and unnecessary headers, and strip mime_load down to just loading files. This will give us more freedom to experiment with other mail libraries which may be more correct, better maintained, use less memory and/or be faster than Email::MIME.
2020-04-26	testcommon: introduce mime_load sub
	We'll use this to create, memoize, and reuse .eml files. This will be used to reduce (and eventually eliminate) our dependency on Email::MIME in tests.
2020-04-22	t/mda.t: avoid needless use of Email::Simple
	Totally pointless to create an object only to convert it back to a raw string for -mda input.
2020-04-22	t/*.t: reduce dependency on Email::MIME APIs
	Instead, favor PublicInbox::MIME->new for non-attachment emails. We may support alternatives to Email::MIME down the line. We'll still keep Email::MIME->create to deal with attachments, for now, but there's also a fair amount of test duplication we should eliminate, later.
2020-04-22	t/*.t: use Email::MIME->create over PublicInbox::MIME->create
	PublicInbox::MIME only supports ->new, and is only different from Email::MIME for old versions of Email::MIME. In the future, PublicInbox::MIME may not be a subclass of Email::MIME at all.
2020-04-22	t/feed: remove useless $ENV{GIT_DIR} assignment
	I don't think this has been useful since we stopped supporting ssoma in this test.
2020-04-21	t/nntpd: die if we can't open stderr output
	We need to detect FS errors and bail out on the test if we can't open a file -nntpd was just writing to.
2020-04-21	t/nntpd: reduce dependencies on internal API
	Since the advent of run_script(), we can rely on it to simplify our test code. Changes like this will let us evolve the internal API more easily while preserving stable CLI interfaces, especially since we test the v2 path by default, now.
2020-04-21	t/nntpd: fix lsof check w/ TEST_RUN_MODE=0
	The `xqx' sub requires an absolute path for optional commands. Fixes: 6e07def560b211d9 ("testcommon: spawn-aware system() and qx[] workalikes")
2020-04-21	index: support --max-size / publicinbox.indexMaxSize
	In normal mail paths, we can rely on MTAs being configured with reasonable limits in the -watch and -mda mail injection paths. However, the MTA is bypassed in a git-only delivery path, a BOFH could inject a large message and DoS users attempting to mirror a public-inbox. This doesn't protect unindexed WWW interfaces from Email::MIME memory explosions on v1 inboxes. Probably nobody cares about unindexed WWW interfaces anymore, especially now that Xapian is optional for indexing.
2020-04-20	testcommon: spawn-aware system() and qx[] workalikes
	Barely noticeable on Linux, but this gives a 1-2% speedup on a FreeBSD 11.3 VM and lets us use built-in redirects rather than relying on /bin/sh.
2020-04-20	t/ds-leak: use BSD::Resource
	We use BSD::Resource in other places, so there's no sense in avoiding it, here.
2020-04-20	import: init_bare: allow use as method, use in tests
	Allowing ->init_bare to be used as a method saves some keystrokes, and we can save a little bit of time on systems with our vfork(2)-enabled spawn(). This also sets us up for future improvements where we can avoid spawning a process at all.
2020-04-20	watchmaildir: support multiple watchheader values
	The watchheader key supports only a single value. Supporting multiple watchheader values was mentioned in discussion [1] of 8d3e3bd8 (doc: explain publicinbox.<name>.watchheader, 2019-10-09), and it wasn't clear if there was a need. One scenario in which matching multiple headers would be convenient is when someone wants to set up public-inbox archives for some small projects but does _not_ want to run mailing lists for them, instead allowing others to follow the project by any of the pull mechanisms. Using a common underlying address, an address alias for each project is configured via a third-party email provider, with messages for each alias being exposed as a separate public-inbox archive. In this setup, messages for an inbox cannot be selected by a List-ID header but can be identified by the inbox's address in either the To or Cc header. To support such a use case, update the watchheader handling to consider multiple values, accepting a message if it matches any value. While selecting a message based on matching _any_ rather than _all_ values is motivated by the above scenario, it's worth noting that the "any" behavior is consistent with how multiple listid config values are handled. [1] https://public-inbox.org/meta/20191010085118.r3amey4cayazfycb@dcvr/
2020-04-19	t/v*-add-remove-add: fix typo in description of 'removed' check

2020-04-19	reduce scope of mbox From_ line removal
	It's unnecessary overhead for anything which does Email::MIME parsing. It was never done for v2 indexing, even though v1->v2 conversions did NOT remove those From_ lines. There was never a need to remote From_ lines the v1 SearchIdx paths, either. Hitting a /$INBOX_URL/$MSGID/T/ endpoint with an 18 message thread reveals a ~0.5% speed improvement. This will become more apparent when we have a faster MIME parser.
2020-04-19	favor `do {}' over `eval {}' for localized slurp
	I did not know to use the return value of `do' back in the day. There's probably no practical difference in these cases, but `eval' is overkill for these uses and may hide actual errors. We can get rid of a few redundant `scalar' ops and pass scalar refs to Email::MIME->new to avoid copies in a few more places, too.
2020-04-19	inbox: don't memoize missing description\|cloneurl
	It's probably common to have inboxes initially setup without these files properly configured, so don't memoize at that stage.
2020-04-19	inboxwritable: mime_from_path: reuse in more places
	There's nothing Maildir-specific about the function, so `maildir_path_load' was a bad name. So give it a more appropriate name and use it in our tests. This save ourselves some code and inconsistency by reusing an existing internal library routine in more places. We can drop the "From_" line in some of our (formerly) mbox sample files.
2020-04-17	searchthread: reduce indirection by removing container
	We can rid ourselves of a layer of indirection by subclassing PublicInbox::Smsg instead of using a container object to hold each $smsg. Furthermore, the `{id}' vs. `{mid}' field name confusion is eliminated. This reduces the size of the $rootset passed to walk_thread by around 15%, that is over 50K memory when rendering a /$INBOX/ landing page.
2020-04-17	t/httpd-unix: skip some tests w/o signalfd\|EVFILT_SIGNAL
	Some of these tests just don't seem reliable enough with the way we or Perl do portable signal handling.
2020-04-16	t/httpd-corner: improve reliability and diagnostics
	The graceful-shutdown-on-PUT test is unreliable because we can't rely on a FIFO as we do with the GET tests. So increase the delay to 100ms since that seems enough on my system even with CONFIG_HZ=100. Add a timeout and backtrace to the $check_self sub to help with further diagnostics while we're at it, too. It would be nice if there were a portable syscall tracing mechanism we could attach to the -httpd process to make the test more determistic...
2020-04-15	t/httpd-corner.t: relax read-after-failed-write handling
	I've observed FreeBSD 11.2 read(2) having one of three behaviors after a failed write(2) on a socket: 1) returning number of bytes read 2) failing with ECONNRESET 3) returning with EOF 1) is the most common, and I've only seen 1) on Linux. It may be possible to use SO_LINGER or shutdown(2) to ensure 1) always happens, but SO_LINGER behavior seems inconsistent across OSes, especially with non-blocking sockets. Since these tests are corner-cases where we're dealing with broken/malicious clients, lets continue spending the least amount of syscalls protecting ourselves in the daemon and instead make the client-side test code tolerate more socket implementations.