public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-12-05	isearch: emulate per-inbox search with ->ALL
	Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05	newswww: use ->ALL to avoid O(n) inbox scan
	We can avoid doing a Message-ID lookup on every single inbox by using ->ALL to scan its over.sqlite3 DB. This mimics NNTP behavior and picks the first message indexed, though redirecting to /all/$MESSAGE_ID/ could be done. With the current lore.kernel.org set of inboxes (~140), this provides a 10-40% speedup depending on inbox ordering.
2020-12-01	nntpd: move {newsgroup} name check to config
	With 50K newsgroups in the config file, this doesn't slow down `PublicInbox::Config->new->fill_all' any measurable amount on my busy old workstation. This should prevent invalid newsgroup names from getting into into extindex and catch user errors sooner, rather than later. v2: - delete {newsgroup} if invalid to avoid ->nntp_url link - simplify -imapd and explain remaining check
2020-11-28	nntpd: share {groups} hash with {-by_newsgroup} in Config
	There's no need to duplicate a potentially large hash, but we can keep the inexpensive shortcut to it. We may eventually drop the {groups} shortcut if it's no longer useful.
2020-11-24	manifest: support faster generation via [extindex "all"]
	For a mirror of lore.kernel.org with >140 inboxes, this speeds up manifest.js.gz generation from ~1s to 40ms on my HW. This is still unacceptable when dealing with thousands of inboxes, but gets us closer to where we need to be.
2020-11-24	move JSON module portability into PublicInbox::Config
	We'll be using JSON in MiscIdx and MiscSearch, and PublicInbox::Config seems like an appropriate place to put it.
2020-11-08	extsearch: rename -eindex to -extindex
	Upon "eindex" rhymes with "reindex", which could be confusing; so name the command and config prefix to use "extindex" which is hopefully less confusing.
2020-11-07	extsearch: wire up remaining Inbox-like methods for WWW
	This lets us pretend an ExtSearch object is an Inbox object in most of the existing WWW code.
2020-09-22	mda: match List-Id insensitively
	This follows -watch commit b70473ab8296d31ebb600adb4fa8fe0ac5935ca8 to match List-Id headers case-insensitively. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/20200921180152.uyqluod7qxbwqubo@chatter.i7.local/
2020-09-20	config: warn on multiple values for some fields
	Our code doesn't support multi-values for these, and having unexpected arrays leads to unexpected results (e.g. showing stuff like "ARRAY(0xDEADBEEFADD12E55)" in user interfaces). So warn and only use the last value (matching git-config(1) behavior without `--get-all').
2020-09-10	config: split out iterator into separate object
	We will need to allow simultaneous iterators on the same config object, since we'll need this for ExtMsg, NNTPD, WwwListing, NewsWWW, and other places.
2020-09-10	config: flatten each_inbox and iterate_start args
	In Perl, we can simplify callers by passing a single array all the way down the stack instead of a single array ref which needs to be expanded every call.
2020-09-02	config: use defined-or (//) in a few places
	Just some golfing to reduce scrolling and hopefully readability.
2020-08-07	index: v2: --sequential-shard option
	This gives better page cache utilization for Xapian indexing on slow storage by improving locality for random I/O activity on the Xapian DB. Instead of doing a single-pass to index both SQLite and Xapian; this indexes them separately. The first pass is identical to indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3. Subsequent passes only operate on a single Xapian shard for documents belonging to that shard. Given enough shards, each individual shard can be made small enough to fit into the kernel page cache and avoid HDD seeks for read activity. Doing rough tests with a busy system with a 7200 RPM HDD with ext4, full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to ~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2) disabled (--no-sync) and `--batch-size=10m'.
2020-07-17	config: reject `\n' in `inboxdir'
	"\n" and other characters requiring quoting and/or escaping in in $GIT_DIR/objects/info/alternates was not supported in git 2.11 and earlier; nor does it seem supported at all in libgit2. This will allow us to support sharing git-cat-file or similar endpoints across multiple inboxes via alternates. This breaks an existing use case for anybody wacky enough to put `\n' in the `inboxdir' pathname; but I doubt this affects anybody.
2020-06-28	config: support ->urlmatch method for -watch
	Since we have IMAP client support in -watch; make sure per-URL settings are familiar to git users by taking advantage of git's URL matching abilities. This requires git 1.8.5+, which most users ought to have (though base CentOS 7 is on 1.8.3).
2020-06-13	imap: require ".$UID_MIN-$UID_END" suffix
	Finish up the IMAP-only portion of iterative config reloading, which allows us to create all sub-ranges of an inbox up front. The InboxIdler still uses ->each_inbox which will struggle with 100K inboxes. Having messages in the top-level newsgroup name of an inbox will still waste bandwidth for clients which want to do full syncs once there's a rollover to a new 50K range. So instead, make every inbox accessible exclusively via 50K slices in the form of "$NEWSGROUP.$UID_MIN-$UID_END". This introduces the DummyInbox, which makes $NEWSGROUP and every parent component a selectable, empty inbox. This aids navigation with mutt and possibly other MUAs. Finally, the xt/perf-imap-list maintainer test is broken, now, so remove it. The grep perlfunc is already proven effective, and we'll have separate tests for mocking out ~100k inboxes.
2020-06-13	imap: start doing iterative config reloading
	This will be used to prevent reloading a giant config with tens/hundreds of thousands of inboxes from blocking the event loop.
2020-04-20	watchmaildir: support multiple watchheader values
	The watchheader key supports only a single value. Supporting multiple watchheader values was mentioned in discussion [1] of 8d3e3bd8 (doc: explain publicinbox.<name>.watchheader, 2019-10-09), and it wasn't clear if there was a need. One scenario in which matching multiple headers would be convenient is when someone wants to set up public-inbox archives for some small projects but does _not_ want to run mailing lists for them, instead allowing others to follow the project by any of the pull mechanisms. Using a common underlying address, an address alias for each project is configured via a third-party email provider, with messages for each alias being exposed as a separate public-inbox archive. In this setup, messages for an inbox cannot be selected by a List-ID header but can be identified by the inbox's address in either the To or Cc header. To support such a use case, update the watchheader handling to consider multiple values, accepting a message if it matches any value. While selecting a message based on matching _any_ rather than _all_ values is motivated by the above scenario, it's worth noting that the "any" behavior is consistent with how multiple listid config values are handled. [1] https://public-inbox.org/meta/20191010085118.r3amey4cayazfycb@dcvr/
2020-03-29	config: Honor gitconfig includes
	This allows for a setup where a central config file for the web server includes per-user config files.
2020-02-06	treewide: run update-copyrights from gnulib for 2019
	I didn't wait until September to do it, this year!
2020-02-01	config: assume multiple cgit URLs, too
	Since we support inboxes with multiple URLs and multiple infourls to reduce reliance on SPOFs, we'll do the same with cgit URLs.
2020-01-13	config: do not slurp entire cgitrc at once
	cgitrc files can have hundreds or thousands of lines in them and slurping them into memory is a waste. "while (<$fh>)" only reads one line at a time, whereas "for (<$fh>)" reads the entire contents of the file into a temporary array.
2020-01-11	spawn (and thus popen_rd) die on failure
	Most spawn and popen_rd callers die on failure to spawn, anyways, and some are missing checks entirely. This saves us a bunch of verbose error-checking code in callers. This also makes popen_rd more consistent, since it already dies on pipe creation failures.
2020-01-06	admin: do not lazy-load Inbox or Config packages
	No point in lazy-loading these, since they're always loaded anyways and would not have portability problems on systems with minimal dependencies.
2020-01-02	config: support multi-value inbox..url
	Since the beginning of this project, we've implicitly supported inboxes with multiple URLs by relying on the Host: header sent by the client ($env->{HTTP_HOST}). We now offer the option to explicitly configure multiple URLs for every inbox along with the ability to do a best-effort match for matching hostnames.
2019-12-27	config: each_inbox: pass user arg to callback
	Another place where we can replace anonymous subs with named subs by passing a user-supplied arg.
2019-10-16	config: remove redundant inboxdir check
	This was causing compatibility problems for old configs when using public-inbox-nntpd.
2019-10-16	config: support "inboxdir" in addition to "mainrepo"
	"mainrepo" ws a bad name and artifact from the early days when I intended for there to be a "spamrepo" (now just the ENV{PI_EMERGENCY} Maildir). With v2, "mainrepo" can be especially confusing, since v2 needs at least two git repositories (epoch + all.git) to function and we shouldn't confuse users by having them point to a git repository for v2. Much of our documentation already references "INBOX_DIR" for command-line arguments, so use "inboxdir" as the git-config(1)-friendly variant for that. "mainrepo" remains supported indefinitely for compatibility. Users may need to revert to old versions, or may be referring to old documentation and must not be forced to change config files to account for this change. So if you're using "mainrepo" today, I do NOT recommend changing it right away because other bugs can lurk. Link: https://public-inbox.org/meta/874l0ice8v.fsf@alyssa.is/
2019-10-15	config: allow "0" as a valid mainrepo path
	It's probably wrong to use relative path names, but things are all relative these days anyways with shared and networked FSes.
2019-10-15	config: avoid unnecessary '\|\|' use
	'//' is available in Perl 5.10+ which allows `0' and `""' (empty string) to remain unclobbered. We also don't need '\|\|=' for initializing our internal caches.
2019-10-15	config: simplify lookup* methods
	This ensures we always process inboxes in section order and reduces the amount of code we have to maintain for each lookup. Avoiding the cost of inboxes object creation is not worth the code overhead; and we can implement a config cache via Storable easily for large configs and -mda users.
2019-10-15	config: we always have {-section_order}
	Rewrite a bunch of tests to use ordered input (emulating "git config -l" output) so we can always walk sections in the order they were given in the config file.
2019-10-15	Config.pm: Add support for mailing list information
	The world has turned since I first started following mailing lists and to my surprise every mailing list that I am subscribed to properly sets the "List-ID:" mailing list header. So instead of doing something clever and flexible I am adding support for looking up public inbox mailing lists by their mailing list name. That makes the work needed for each email trivial and easy to understand. - Parse the "List-ID:" header. - Lookup in the configuration which mailbox is connected to that "List-ID:" - Deliver the mail to that mailbox. To that end this change enhances PublicInbox to have an additional mailbox configuration parameter "listid" that holds the mailing list name. A method is added to the PublicInbox config object called lookup_list_id that given a mailing list name will return the PublicInbox in the configuration that is configured to handle that mailing list. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> [ew: avoid autovivification of $ibx->{listid} for t/config.t]
2019-09-30	config: use NUL-delimited git-config(1) output
	This allows us to deal with newlines in config values, since git-config(1) acquired "-z" support in git v1.5.3. I'm not sure if it's actually useful in our case, but maybe some multi-line texts could be added. And newlines in path names are super useful!
2019-09-18	config: boolean handling matches git-config(1)
	We need to handle arbitrary integers and case-insensitive variations of human words to match git-config(1) behavior, since that's what users would expect given we use config files parseable by git-config(1).
2019-09-09	run update-copyrights from gnulib for 2019

2019-06-04	config: do not accept non-ASCII digits in cgitrc params
	cgit uses atoi(3), and now we can retain compatibility.
2019-04-19	www: support listing of inboxes
	We will still return a 404 by default to '/' for compatibility with users of Plack::App::Cascade or similar. Inboxes are sorted by modification times to help users detect activity (similar to the /$INBOX/ topic view). New configuration options: * publicinbox.wwwlisting - configure the listing type * publicinbox.<name>.hide - hide a particular inbox from the listing See changes to public-inbox-config.pod for full descriptions of the new options. Requested-by: Leah Neukirchen <leah@vuxu.org> https://public-inbox.org/meta/871sdfzy80.fsf@gmail.com/
2019-04-18	config: use '$ibx' instead of '$rv' to denote Inbox objects
	Followup-to: 6e6f7999361925e4 ("cleanup: use '$ibx' consistently when referring to Inbox refs")
2019-04-16	cleanup: use '$ibx' consistently when referring to Inbox refs
	'$inbox' is more human-readable, so that is for the more human-readable name in most cases. Making our variable naming more consistent should make the code easier-to-review and harder to screw up.
2019-04-15	config: fix regression in repo.path => coderepo.dir mapping
	We parse cgitrc for "repo.path", while we use "coderepo.dir" to mean the same thing for non-cgit users. So I ended up confusing myself, here. But then again, git uses "--git-dir" and "GIT_DIR", so I suspect "dir" is the better choice than "path", here
2019-04-15	config: support more cgit directives for project lists
	Hopefully this gets us closer to matching cgit upstream behavior (which also lacks tests). We'll still need to support macro expansion at some point for compatibility...
2019-04-15	cgit: serve static css, logo, favicon directly
	We can reduce the configuration needed to run cgit by reusing the static file handling logic of the dumb git HTTP protocol. I hate logos and icons, so don't expect public-inbox.org or 80x24.org to ever have those to waste users' bandwidth with :P But I expect other users to find this useful.
2019-04-15	config: support cgit scan-path and scan-hidden-path
	project_list support still needs to be done And tests need to be written... :<
2019-04-04	cgit: use a dedicated named limiter
	I mainly need this to enforce RLIMIT_CPU (and RLIMIT_CORE) when requests come which generate giant, unrealistic diffs. Per-coderepo limiters may be added in the future. But for now, I need to prevent cgit from monopolizing resources on my dinky server.
2019-04-04	qspawn: wire up RLIMIT_* handling to limiters
	This allows users to configure RLIMIT_{CORE,CPU,DATA} using our "limiter" config directive when spawning external processes.
2019-04-04	www: wire up cgit as a 404 handler if cgitrc is configured
	Requests intended for cgit are unlikely to conflict with requests to inboxes. So we can safely hand those requests off to cgit.cgi.
2019-04-04	support publicinbox.cgitrc directive
	We can save admins the trouble of declaring [coderepo "..."] sections in the public-inbox config by parsing the cgitrc directly. Macro expansion (e.g. $HTTP_HOST) expansion is not supported, yet; but may be in the future.
2019-03-12	config: ignore missing config files
	There's no reason for us to have git-config(1) warn users when a config file is entirely missing.