public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-08-27	watch: imap: only remove \Seen spam
	This matches the behavior of Maildir `watchspam' handling in not removing unseen messages. NNTP can't match this behavior, since NNTP servers don't store flags, clients do.
2020-08-27	overidx: inline create_ghost sub
	There's no need for this to be a separate sub since there's only a single caller. This saves a few kilobytes at least in short-lived processes.
2020-08-27	imaptracker: preserve WAL journal_mode if set by user
	It's no problem for most users to enable WAL, here, since there's only a single process doing both reading and writing (unlike the read-only daemons). However, WAL doesn't work on network filesystems, so it can't be enabled by default.
2020-08-27	watchmaildir: ensure I:/W:/E: prefixes in warnings
	For consistency in output, any URL/path-context-dependent prefixes should have the same prefix as the actual warning which triggered it.
2020-08-27	git: show more context info on failures
	I'm seeing "read: Connection timed out" from in my syslog from -httpd. The fail() calls in PublicInbox::Git seems to be the only code path of ours which could trigger it... ETIMEDOUT shouldn't happen on pipes, only sockets; and all of our socket operations are non-blocking. So this could be cgit-wwwhighlight-filter.lua, but that's connecting over localhost, though on fairly loaded HW.
2020-08-27	search: allow testing with current xapian.git and 1.5.x
	A `PI_XAPIAN' environment variable is now exposed for testing purposes. We'll also deal with the removal of `NumberValueRangeProcessor' and use `NumberRangeProcessor' in its place, but continue favoring the old Search::Xapian since that's all that's packaged for Debian 10.x stable.
2020-08-27	msgmap: use v5.10.1
	We use the defined-or (`//', `//=') operators in 5.10, so require 5.10.1 like the rest of our codebase. Update an outdated comment while we're at it.
2020-08-27	over*: use v5.10.1, drop warnings
	v5.10.1 lets us use the lighter parent.pm instead of base.pm, and we'll rely on the shebang to enable warnings (or not). While we're in the area, drop a no-longer-necessary import for PublicInbox::Search, since OverIdx doesn't require search.
2020-08-27	over: recent: remove expensive COUNT query
	As noted in commit 87dca6d8d5988c5eb54019cca342450b0b7dd6b7 ("www: rework query responses to avoid COUNT in SQLite"), COUNT on many rows is expensive on big SQLite DBs. We've already stopped using that code path long ago in WWW while -imapd and -nntpd never used it. So we'll adjust our remaining test cases to not need it, either.
2020-08-27	over: rename ->disconnect to ->dbh_close
	Since we got rid of over->connect, `disconnect' no longer pairs with it. So name it after the `close(2)' syscall it ultimately issues.
2020-08-27	over: rename ->connect method to ->dbh
	`->connect' is confused with the perlfunc for the `connect(2)' syscall, and also `DBI->connect'. Since SQLite doesn't use sockets, the word "connect" needlessly confuses me. Give it a short name to match the field name we use for it, which also matches the variable name used by the DBI(3pm) and DBD::SQLite(3pm) manpages.
2020-08-26	v2writable: compatibility with SWIG Xapian binding
	The SWIG binding won't auto-convert IV/UV to PV like the XS Search::Xapian binding would, so workaround that shortcoming for now. Fixes: a367ec1b15a2458 ("mbox: disable "&t" on existing Xapian until full reindex")
2020-08-26	over+msgmap: respect WAL journal_mode if set
	WAL actually seems to have ideal locking characteristics given concurrency problems I'm experiencing with --reindex running in parallel with expensive read-only SQLite queries: <https://public-inbox.org/meta/20200825001204.GA840@dcvr/> Unfortunately, we cannot blindly use WAL while preserving compatibility with existing setups nor our guarantees that read-only daemons are indeed "read-only". However, respect an user's the choice to set WAL on their own if they're comfortable with giving -nntpd/-httpd/-imapd processes write permission to the directory storing SQLite DBs.
2020-08-26	msgmap: use "CREATE TABLE IF NOT EXISTS"
	It's fewer queries and matches what we do in OverIdx.
2020-08-26	over: skip nodatacow on the journal
	This file gets truncated anyhow, so it won't fragment.
2020-08-25	searchidx: croak for Xapian DB open failure
	croak() can give more context on the failure, and setting `PERL5OPT=-MCarp=verbose' can force a stacktrace.
2020-08-23	index: --sequential-shard checkpoints after each shard
	There's no reason we'd want Xapian to defer flushing once we've indexed everything belonging to a particular shard.
2020-08-23	mbox: disable "&t" on existing Xapian until full reindex
	Expanding threads via over.sqlite3 for mbox.gz downloads without Xapian effectively collapsing on the THREADID column leads to repeated messages getting downloaded. To avoid that situation, use a "has_threadid" Xapian metadata flag that's only set on --reindex (and brand new Xapian DBs). This allows admins to upgrade WWW or do --reindex in any order; without worrying about users eating up bandwidth and CPU cycles.
2020-08-23	search: support downloading mboxes results with full thread
	Finally, the addition of THREADID for collapsing results in Xapian lets us emulate the "mairix --threads" feature. That is, instead of returning only the matching messages, the entire thread is included in the downloaded mbox.gz This requires a "public-inbox-index --reindex" to be usable.
2020-08-23	searchidx: index THREADID in Xapian
	This is the `tid' column from over.sqlite3; and will be used for IMAP and JMAP search (among other things).
2020-08-23	searchidx: put all shard-related stuff in SearchIdxShard.pm
	We'll also rename the /^remote_/ prefix to "shard_", since remote implies the process is on a different host. These methods only pass messages to a child process on the same host OR perform operations within the same process.
2020-08-23	searchidxshard: clear $msgref buffer properly
	Merely assigning `undef' to a scalar does not free the underlying buffer memory of a scalar.
2020-08-22	searchview: fix mbox.gz downloads for lynx users
	Unlike w3m and links, the lynx browser seems to require a `name' attribute for `<input type=submit>' elements. Maybe some other browsers do, too. The `name' attribute for submit elements doesn't seem to cause any harm for w3m or links, users, either; despite not (AFAIK) being part of historical or current HTML specs.
2020-08-20	search: add mset_to_artnums method
	We can avoid importing mdocid() in several places by using this method, simplifying callers.
2020-08-20	init+index: support --skip-docdata for Xapian
	Since we no longer read document data from Xapian, allow users to opt-out of storing it. This breaks compatibility with previous releases of public-inbox, but gives us a ~1.5% space savings on Xapian storage (and associated I/O and page cache pressure reduction).
2020-08-20	smsg: remove from_mitem
	We no longer read docdata.glass from anywhere in our code base. Some adjustments were needed to t/search.t to deal with the Xapian::WritableDatabase committing at different times, since our ->query is avoided from PublicInbox::SearchIdx to avoid needing a {over_ro} field.
2020-08-20	mbox: avoid Xapian docdata in search results
	Another place where we can reduce kernel page cache overhead by hitting over.sqlite3 instead of docdata.glass.
2020-08-20	extmsg: avoid using Xapian docdata
	Once again, over.sqlite3 contains everything necessary for Message-ID resolution. Also, Xapian may be completely unnecessary with the advent of over.sqlite3, but that's for another time.
2020-08-20	searchview: convert nested and Atom display to over.sqlite3
	git blob retrieval dominates on these, "&x=t" (nested) is roughly the same due to increased overhead for ->get_percent storage balancing out the mass-loading from SQLite. Atom "&x=A" is sped up slightly and uses less memory in the long-lived response.
2020-08-20	searchview: speed up search summary by ~10%
	Instead of loading one article at-a-time from over.sqlite3, we can use SQL to mass-load IN (?,?, ...) all results with a single SQLite query. Despite SQLite being in-process and having no network latency, the reduction in SQL query executions from loading multiple rows at once speeds things up significantly. We'll keep the over->get_art optimizations from the previous commit, since it still speeds up long-lived responses, slightly.
2020-08-20	searchview: use over.sqlite3 instead of Xapian docdata
	This is a step towards improving kernel page cache hit rates by relying on over.sqlite3 for document data instead of Xapian. Some micro-optimization to over->get_art was required to maintain performance.
2020-08-20	smsg: reduce utf8::decode call sites
	Both callers of load_from_data call utf8::decode, so just do utf8::decode in load_from_data.
2020-08-20	search: make qparse_new an internal function
	We'll probably be reusing it from another package in a future commit.
2020-08-20	searchquery: split off from searchview
	Since this was already a separate package, split it off into its own file since SearchView may not handle inbox groups.
2020-08-20	search: export mdocid subroutine
	No need to have awkward globrefs for this.
2020-08-20	search: improve comments around constants
	We'll probably be adding more value columns like THREADID to sort on.
2020-08-20	www: reduce long-lived PublicInbox::Search references
	While this is unlikely to be a problem in current practice, keeping Xapian DBs open for long responses can interfere with free space recovery after -compact. In the future, it will interfere with inbox search grouping and lead to unexpected results.
2020-08-20	xapcmd: simplify {reindex} parameter passing
	No need to localize it, here, since we can just refer to it in the `$opt' hashref. Hopefully this improves readability for others like it does for me. I sometimes wonder if the concept of a stack in high-level languages is even necessary...
2020-08-20	search: v2: ensure shards are numerically sorted
	This seems required to correctly get the NNTP article number from Xapian docid on combined Xapian DBs. The default (ASCII-betical) sorting was only acceptable for -imapd users until somebody hit 11 (or more) shards, which is a rare case.
2020-08-20	admin: progress shows the inbox being indexed
	This is helpful with --all, or when multiple inboxes are being indexed.
2020-08-19	v2writable: show newline after "indexing all of .. " message
	Otherwise things get very confusing when verbosity is enabled :x
2020-08-19	smsg: handle wide characters in raw mail headers
	There may be messages in the wild with wide characters in headers which aren't non-RFC2047 encoded. Assume UTF-8 so those fields can round trip through over.sqlite3. This doesn't affect docdata.glass in Xapian, but it does affect how over.sqlite3 stores the same deflated info.
2020-08-13	v2writable: remove IdxStack import
	We use IdxStack via log2stack() from SearchIdx, now.
2020-08-13	xcpdb: wire up new index options and --help
	--sequential-shard also disables the copy parallelism (--jobs), so it can be useful for systems unable to handle parallel random I/O but still want many shards. There was a missing "use strict", too, which is fixed.
2020-08-13	admin: don't warn when --jobs exceeds shards
	Established tools like make(1), prove(1) and xargs(1) don't warn when the desired parallelism level can't be met, either.
2020-08-13	xapcmd: reduce CPU idling when shards exceeds job count
	In case there's unbalanced shards AND we're limiting parallelism while using many shards, spawn the next task in the queue ASAP once a task is done, instead of waiting for all tasks to finish before spawning the next batch. Unbalanced shards probably isn't a big issue for most users; however many smaller shards with few jobs can be useful for HDD users to reduce the effect of random writes.
2020-08-13	xapcmd: simplify sub reference
	We don't need to fully-qualify when referring to subs in the same namespace, nor do we need make a SCALAR ref only to dereference it (Yes, still learning Perl :x)
2020-08-10	convert: support new -index options
	Converting v1 inboxes from v2 can be a painful experience on HDD. Some of the new options in the CLI or config file make it less painful.
2020-08-10	searchidx: use singular `$opt' for consistency with v2
	The rest of our indexing code uses `$opt' instead of `$opts'.
2020-08-10	index: cleanup internal variables
	Move away from hard-to-read alllowercase naming and favor snake_case or separated-by-dashes. We'll keep `--indexlevel' as-is for now, since it's been around for several releases; but we'll support `--index-level' in the CLI and update our documentation in a few months. We'll also clarify that publicInbox.indexMaxSize is only intended for -index, and not -watch or -mda.