public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-05-12	rename "ContentId" to "ContentHash"
	The old name may be confused with "Content-ID" as described in RFC 2392, so use an alternate name to avoid confusing future readers.
2020-05-12	xt/eml_check_limits: check limits against an inbox
	This allows maintainers to easily check limits against the contents of existing inboxes. This script covers most of the new limits enforced by PublicInbox::Eml. Usage is similar to most xt/*.t scripts: GIANT_INBOX_DIR=/path/to/inbox prove -bvw xt/eml_check_limits.t Setting `TEST_CLASS=PublicInbox::MIME' allows us to check performance and memory use against the old subclass of Email::MIME.
2020-05-09	xt: eml comparison tests
	While our codebase can still work with either MIME implementation, add comparison tests to ensure we handle corner cases in existing archives.
2020-05-09	EmlContentFoo: Email::MIME::ContentType replacement
	Since we're getting rid of Email::MIME, get rid of Email::MIME::ContentType, too; since we may introduce speedups down the line specific to our codebase.
2020-05-09	eml: pure-Perl replacement for Email::MIME
	Email::MIME eats memory, wastes time parsing out all the headers, and some problems can't be fixed without breaking compatibility for other projects which depend on it. Informal benchmarks show a ~2x improvement in general stats gathering scripts and ~10% improvement in HTML view rendering. We also don't need the ability to create MIME messages, just parse them and maybe drop an attachment. While this isn't the zero-copy or streaming MIME parser of my dreams; it's still an improvement in that it doesn't keep a scalar copy of the raw body around along with subparts. It also doesn't parse subparts up front, so it can also replace our uses of Email::Simple.
2020-04-27	doc: add clients.txt
	Since some client tools exist for dealing with public-inbox specifically, it seems like a good idea to list some of them. Cc: Danh Doan <congdanhqx@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Cc: Leah Neukirchen <leah@vuxu.org>
2020-04-26	tests: remove Email::MIME->create use entirely
	Replace them with .eml files generated with the help of Email::MIME, but without some extraneous and unnecessary headers, and strip mime_load down to just loading files. This will give us more freedom to experiment with other mail libraries which may be more correct, better maintained, use less memory and/or be faster than Email::MIME.
2020-04-20	watchmaildir: support multiple watchheader values
	The watchheader key supports only a single value. Supporting multiple watchheader values was mentioned in discussion [1] of 8d3e3bd8 (doc: explain publicinbox.<name>.watchheader, 2019-10-09), and it wasn't clear if there was a need. One scenario in which matching multiple headers would be convenient is when someone wants to set up public-inbox archives for some small projects but does _not_ want to run mailing lists for them, instead allowing others to follow the project by any of the pull mechanisms. Using a common underlying address, an address alias for each project is configured via a third-party email provider, with messages for each alias being exposed as a separate public-inbox archive. In this setup, messages for an inbox cannot be selected by a List-ID header but can be identified by the inbox's address in either the To or Cc header. To support such a use case, update the watchheader handling to consider multiple values, accepting a message if it matches any value. While selecting a message based on matching _any_ rather than _all_ values is motivated by the above scenario, it's worth noting that the "any" behavior is consistent with how multiple listid config values are handled. [1] https://public-inbox.org/meta/20191010085118.r3amey4cayazfycb@dcvr/
2020-04-19	doc: start writeup on semi-automatic memory management
	I don't consider Perl's memory management "automatic". Instead, having an extra bit of control as a hacker is nice and there's no need to burden ordinary users with GC tuning knobs.
2020-04-19	inboxwritable: mime_from_path: reuse in more places
	There's nothing Maildir-specific about the function, so `maildir_path_load' was a bad name. So give it a more appropriate name and use it in our tests. This save ourselves some code and inconsistency by reusing an existing internal library routine in more places. We can drop the "From_" line in some of our (formerly) mbox sample files.
2020-04-17	doc: update 1.4.0 relnotes with date, start 1.5.0

2020-04-15	MANIFEST update

2020-03-25	www: add endpoint to retrieve altid dumps
	This ensures all our indexed data, including data from altid searches (e.g. "gmane:$ARTNUM") is retrievable. It uses a "POST" request to avoid wasting cycles when invoked by crawlers, since it could potentially be several megabytes of data not indexable by search engines.
2020-03-25	qspawn: reinstate filter support, add gzip filter
	We'll be supporting gzipped from sqlite3(1) dumps for altid files in future commits. In the future (and if we survive), we may replace Plack::Middleware::Deflater with our own GzipFilter to work better with asynchronous responses without relying on memory-intensive anonymous subs.
2020-03-22	v2: SDBM-based multi Message-ID queue
	This lets us store author and committer times for deferred indexing messages with ambiguous Message-IDs. This allows us to reproducibly reindex messages with the git commit and author times when a rare message lacks Received and/or Date headers while having ambiguous Message-IDs.
2020-03-22	rename PublicInbox::SearchMsg => PublicInbox::Smsg
	Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-03-22	index: use git commit times on missing Date/Received
	When indexing messages without Date: and/or Received: headers, fall back to using timestamps originally recorded by git in the commit object. This allows git mirrors to preserve the import datestamp and timestamp of a message according to what was fed into git, instead of blindly falling back to the current time.
2020-02-24	doc: technical: document data structures
	Can't code without data structures, and we emphasize data over code just about everywhere.
2020-02-15	t/msg_iter: test for X-UNKNOWN charset from Alpine
	A long overdue test for behavior established in 2016. Fixes: 1b28cc7f00a866cb ("view: try assuming UTF-8 for bogus charsets")
2020-02-09	doc: update v1.3.0.eml with actual headers, start v1.4.0
	Bigger changes coming :>
2020-02-07	syscall: support Linux x32 ABI
	The x32 ABI allows users to take advantage of the extra registers on x86-64 without the bloat of 64-bit pointers and longs. This ought to be significant since Perl was designed when 32-bit was prevalent; and the common structs for ops, hashes, scalars, and arrays use longs (SSize_t/Size_t) for things which should never need 64-bits when processing emails. Debian's x32 port seems to work quite nicely under a chroot on an amd64 Linux system. All tests pass under x32, now.
2020-02-06	MANIFEST: add flow.{ge,txt}
	Oops :x
2020-02-02	t/multi-mid.t: extra test for -convert highwater mark
	This is derived from a real-world test case where I encounterd multiple Message-IDs in a v1 inbox causing regen problems. Fixes: eea47b676127bcdb ("convert: preserve highwater mark from v1 msgmap")
2020-01-11	doc: technical/ds.txt: describe PublicInbox::DS divergences
	Danga::Socket 1.62 was released a few months back and the maintainer indicated it would be the last release. We've diverged significantly in incompatible ways... While most of this should've already been documented in commit messages, putting it all into one document could make it easier-to-digest. It's also a strange design for anybody used to conventional event loops. Maybe this is an unconventional project :P
2020-01-05	view: msg_html: reduce memory use on reused MIDs
	In rare cases where Message-IDs get reused, we do not want to hold onto the large Email::MIME objects in memory after showing the first message. So discard each message as soon as we're done using it so we can save memory for the next message. The new and expensive xt/mem-msgview.t test shows a nearly 14MB reduction for two ~7MB messages. run_script() also gets upgraded to make it easier to pass large inputs via IO GLOBs.
2020-01-04	xt/solver.t: real-world regression tests
	There's a lot of test cases which we should probably make self-contained at some point, but right now it's easier to just mark them off in a maintainer test.
2020-01-03	examples: add empty "lib" dir to placate plackup
	This is necessary for Filesys::Notify::Simple 0.13 using Linux::Inotify2, since 0.13 started croaking on inotify_add_watch failures.
2020-01-02	doc: release notes: set Date for 1.2.0, start 1.3.0
	Seems like a lot's happened since 1.2, but it's mostly internal stuff...
2020-01-01	wwwstatic: add directory listing + index.html support
	It's now possible to use WwwStatic as a standalone PSGI app to serve static files and recreate the award-winning web design of https://public-inbox.org/ :>
2019-12-27	githttpbackend: split out wwwstatic
	Make it easier to share code between our GitHTTPBackend and Cgit packages, for now, and possibly other packages in the future. We can avoid inline_object and anonymous subs at the same time, reducing per-request memory overhead.
2019-12-19	t/run.perl: to avoid repeated process spawning for *.t
	Spawning a new Perl interpreter for every test case means Perl has to reparse and recompile every single file it needs, costing us performance and development time. Now that we've modified our code to avoid global state, we can preload everything we need. The new "check-run" test target is now 20-30% faster than the original "check" target.
2019-12-19	tests: move t/common.perl to PublicInbox::TestCommon
	We want to be able to use run_script with *.t files, so t/common.perl putting subs into the top-level "main" namespace won't work. Instead, make it a module which uses Exporter like other libraries.
2019-12-15	address: use Email::Address::XS if available
	Email::Address::XS is a dependency of modern versions of Email::MIME, so it's likely loaded and installed on newer systems, already; and capable of handling more corner-cases than our pure-Perl fallback. We still fallback to the imperfect-but-good-enough-in-practice pure-Perl code while avoiding the non-XS Email::Address (which was susceptible to DoS attacks (CVE-2015-7686)). We just need to keep "git fast-import" happy.
2019-12-14	ds: move EvCleanup code into DS
	EvCleanup only existed since Danga::Socket was a separate component, and cleanup code belongs with the event loop.
2019-12-12	add msgtime_cmp maintainer test
	Changes will be coming for MsgTime to stop depending on Date::Parse due to lack of package availability on OpenBSD and suboptimal performance on RFC822 dates.
2019-12-12	git: async batch interface
	This is a transitionary interface which does NOT require an event loop. It can be plugged into in current synchronous code without major surgery. It allows HTTP/1.1 pipelining-like functionality by taking advantage of predictable and well-specified POSIX pipe semantics by stuffing multiple git cat-file requests into the --batch pipe With xt/git_async_cmp.t and GIANT_GIT_DIR=git.git, the async interface is 10-25% faster than the synchronous interface since it can keep the "git cat-file" process busier. This is expected to improve performance on systems with slower storage (but multiple cores).
2019-11-27	httpd\|nntpd: avoid missed signal wakeups
	Our attempt at using a self-pipe in signal handlers was ineffective, since pure Perl code execution is deferred and Perl doesn't use an internal self-pipe/eventfd. In retrospect, I actually prefer the simplicity of Perl in this regard... We can use sigprocmask() from Perl, so we can introduce signalfd(2) and EVFILT_SIGNAL support on Linux and *BSD-based systems, respectively. These OS primitives allow us to avoid a race where Perl checks for signals right before epoll_wait() or kevent() puts the process to sleep. The (few) systems nowadays without signalfd(2) or IO::KQueue will now see wakeups every second to avoid missed signals.
2019-11-27	dskqxs: fix missing EV_DISPATCH define
	Oops, IO::KQueue support was broken due to this missing constant. Add a new ds-kqxs.t test case to ensure we test the IO::KQueue path if IO::KQueue is available.
2019-11-24	tests: move giant inbox/git dependent tests to xt/
	xt/ is typically reserved for "eXtended tests" intended for the maintainers and not ordinary users. Since these require special configuration and do nothing by waste cycles during startup, they qualify.
2019-11-24	tests: quiet down commit graph
	Newer versions of git enable the commit graph by default. Since we blow away our temporary directories every test, generating graphis is a waste and clutters stderr with "Computing commit graph generation numbers" messages.
2019-11-16	mbox: split mboxgz out into a separate file
	It'll make using Compress::Raw::Zlib easier, since we can use that and import constants more easily.
2019-11-03	public-inbox v1.2.0 v1.2.0

2019-11-03	doc: add public-inbox.cgi(1) manpage
	Yet another case of documenting things which should NOT be used :>
2019-11-02	doc: add public-inbox-purge(1) manpage
	Tools intended for end users need manpages, and doubly so to convince potential users NOT to use them :)
2019-10-31	msgiter: do not assume UTF-8 if Email::MIME->body_str succeeds
	ISO-2202-JP and other non-UTF-8 messages need to be displayed correctly. Fixes: 7d82a8bc04ce ('handle "multipart/mixed" messages which are not multipart')
2019-10-30	doc: add public-inbox-learn(1) manpage
	Tools intended for end users need manpages.
2019-10-09	doc: PublicInbox::SaPlugin::ListMirror manpage
	This is a plugin for SpamAssassin that happens to be quite useful in keeping spam off lists I mirror. Hopefully more people can find it useful now that it has a manpage.
2019-10-07	examples: add grok-pull post_update_hook example
	This requires the latest (to be in 1.2) -init changes for synchronization and has no dependencies on GNU or bash-isms so it should run on *BSD systems without GNU tools. It does attempt to use curl on <$INBOX_URL/_/text/config/raw>, but curl is fairly standard nowadays, and falls back to using an invalid address to initialize.
2019-10-07	doc: generate NEWS, NEWS.atom, and NEWS.html
	We'll use our Documentation/RelNotes directory and internal APIs to generate these files for website use (the website should be completely reproducible).
2019-10-05	doc: add manpage for public-inbox-init(1)
	This old command was lacking a manpage, so (finally) create one.