public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2021-10-13	fetch: support --try-remote/-T for alternate remote names
	This allows -fetch to work out-of-the-box on using the grokmirror 2.x default of "_grokmirror".
2021-10-13	eml: avoid Encode 2.87..3.12 leak
	Encode::FB_CROAK leaks memory in old versions of Encode: <https://rt.cpan.org/Public/Bug/Display.html?id=139622> Since I expect there's still many users on old systems and old Perls, we can use "$SIG{__WARN__} = \&croak" here with Encode::FB_WARN to emulate Encode::FB_CROAK behavior.
2021-10-13	test_common: hoist out tail_f sub
	We'll be reusing this in more places. While we're at it, allow it to tail all run_script() users, including lei() in TestCommon.
2021-10-13	www: preload: load ExtSearch via ->ALL
	This ought to give us more CoW savings and fragmentation avoidance in -httpd.
2021-10-13	extindex: set {current_info} in eidxq processing
	This gives context as to where warnings are coming from.
2021-10-13	treewide: use warn() or carp() instead of env->{psgi.errors}
	Large chunks of our codebase and 3rd-party dependencies do not use ->{psgi.errors}, so trying to standardize on it was a fruitless endeavor. Since warn() and carp() are standard mechanism within Perl, just use that instead and simplify a bunch of existing code.
2021-10-13	lei: use standard warn() in more places
	warn() is easier to augment with context information, and frankly unavoidable in the presence of 3rd-party libraries we don't control.
2021-10-13	extindex: show OID on bad blob failure
	AFAIK I've never hit these messages, but I might be glad if I ever do.
2021-10-13	daemon: set $SIG{__WARN__} properly
	Eml->warn_ignore_cb itself returns a callback, so creating a reference to it was wrong when assigning it to $SIG{__WARN__}; Fixes: 176cd51f9aa81b74 ("daemon: quiet down Eml-related warnings")
2021-10-13	lei up --all: show output for warnings
	This helps users make sense of which saved searches some warnings were coming from. Since I often create and discard externals, some warnings from saved searches were confusing to me without output context: "`$FOO' is unknown" "$FOO not indexed by Xapian"
2021-10-13	index: optimize after all SQLite DB commits
	This covers v1 inboxes, as well. We also guard the execution since "PRAGMA optimize" was only introduced in SQLite 3.18.0 (2017-03-30)
2021-10-13	lei/store: use remove_doc to save some LoC

2021-10-13	extindex: flush pending reindex before unref
	This prevents unnecessary message renumbering and I/O. Without this change, there is a small window for long-running WWW streaming requests to miss a message that was unref-ed before reindexing. If we expose an "All Mail" mailbox via IMAP/JMAP, this will save client traffic.
2021-10-12	www: _/text/config/raw Last-Modified: is mm->created_at
	This allows IMAP mirrors to keep UIDVALIDITY synchronized (and "LIST ACTIVE.TIMES" in NNTP). "lei add-external --mirror" will automatically set it, as will the combination of public-inbox-clone + public-inbox-index. This avoids the need for extra endpoints or config entries, at least...
2021-10-12	msgmap: ->new_file to supports $ibx arg, drop ->new
	The original Msgmap->new API was v1-specific and not necessary. The ->new_file API now supports an $ibx object being passed to it, simplify -no_fsync use. It will also make an upcoming change easier...
2021-10-12	daemon: unconditionally close Xapian shards on cleanup
	The cost of opening a Xapian DB (even with shards) isn't high, so save some FDs and just close it. We hit Xapian far less than over.sqlite3 and we discard the MSet ASAP even when streaming large responses. This simplifies our code a bit and hopefully helps reduce fragmentation by increasing mortality of late allocations.
2021-10-12	msgmap: share most of check_inodes w/ over
	We still need to account for msgmap being open all the time and not having separate read-only vs. read-write packages.
2021-10-12	msgmap: use DBI->prepare_cached
	msgmap is not performance-critical enough to justify doing our own prepared statement caching. Just rely on the functionality of DBI here so future changes will be easier. There's also minor style changes to avoid dirtying refcount cache lines bumping by repeating hash lookups rather than attempting to store them as locals.
2021-10-12	nntp: use defined-OR from Perl 5.10 for msgid check
	"<0>" could be a valid Message-ID, maybe...
2021-10-12	search: delete QueryParser along with DB handle
	Xapian::QueryParser is attached to the Xapian::Database, so holding onto the QueryParser was preventing us from releasing DB handles if a query was performed.
2021-10-12	daemon: quiet down Eml-related warnings
	Email::Address::XS is quite noisy and there's nothing we can really do about messages we're serving from read-only daemons.
2021-10-12	daemon: use v5.10.1, disable local warnings
	We're moving towards relying on "perl -w" for warnings and v5.12 for strict.
2021-10-12	isearch: do not access Extsearch->{over} directly
	It may not exist due to periodic cleanup to avoid excessive FD use.
2021-10-12	extindex: avoid invalid blobs after unref
	When unref-ing a blob from xref3, make sure the "preferred" smsg->{blob} doesn't point to the blob we just unrefed. This is necessary because we periodically checkpoint our extindex process to allow -watch and -mda processes to run. This also gets rid of a lot of redundant code for ->remove_xref3, since it's all handled in ExtSearchIdx, now.
2021-10-12	extindex: more consistent doc removal
	We need to ensure a message is consistently removed from eidxq, over and Xapian in all cases. Removing from eidxq saves users from some noisy error messages.
2021-10-12	extindex: share unref logic in more places
	We can use the same logic for --gc and --reindex and 'd' log entries They're similar enough and the actual need to unref should be fairly rare. We could go a lot faster if we didn't show progress for --gc and --reindex, actually.
2021-10-12	extindex: rename var: active => active_shards
	We also have the idea of active inboxes, too, so "active shards" ought to make the purpose of the data structure more obvious.
2021-10-12	sqlite: PRAGMA optimize on close
	As recommended by SQLite documentation[1]: To achieve the best long-term query performance without the need to do a detailed engineering analysis of the application schema and SQL, it is recommended that applications run "PRAGMA optimize" (with no arguments) just before closing each database connection. Hopefully that works for our use cases and can make things faster for us. [1] https://www.sqlite.org/pragma.html#pragma_optimize
2021-10-12	extindex: speed up --reindex --fast
	This required some tweaking of xref3 indices in over.sqlite3, but the end result is it brings no-op "--reindex --fast --all" checks down to roughly 20 minutes (from 30-40 minutes) on lore/all. This is faster because a bunch of small SQLite queries are still slower en-mass than a bunch of perlops. Despite the lack of IPC overhead, crossing .so boundaries and repeating lookups over btrees is still slower than doing the same with Perl hash tables.
2021-10-10	extindex: sync each inbox before checking for missed messages
	Otherwise, it gets too noisy and we repeat some work when we do an actual sync, since the last_commit info will be out-of-date.
2021-10-10	lei/store: keep ".err-XXXX" in stderr tmpfile
	This is slighly more meaningful since the file is already in ~/.local/share/lei/store, so "lei_store" was redundant (and the "XXXX" are random characters replaced by File::Temp)
2021-10-10	extindex: --gc doesn't touch ghost entries
	We were deleting ghost entries, this was usually harmless since other messages could fill-in-the-blanks, but could cause misthreading in odd cases where a big chunk of a thread is missing and the latest messages only referenced ghosts. We'll also save some cycles when scanning Xapian shards since docids won't be <= 0.
2021-10-10	extindex: minor cost reductions
	Don't bother decoding the 20-byte SHA-1 to a 40-byte hex value since we don't read it, anyways. We can also use the on-stack ibx->eidx_key value instead of dispatching the method again.
2021-10-10	extindex: speed up Xapian cleanup in --gc
	Avoiding repeated SQL statements brings --gc down to 2-3 minutes from around 10. We'll also add some checkpoints around over and xref3 cleanups.
2021-10-10	set nodatacow on more SQLite files
	We'll set nodatacow when detecting existing but empty files, and also their directories in more cases (for auxiliary -wal, -journal, -shm files). Hopefully this keeps performance reasonable on CoW FSes.
2021-10-10	admin: add '# ' prefix for progress messages
	It's more consistent with TAP output and hopefully puts users at ease in case they don't understand the meaning of a message.
2021-10-10	lei_to_mail: show --output on augment progress failure
	Just in case it fails when there's many parallel invocations.
2021-10-09	extindex: support --reindex --fast
	This mode only checks history for missed/stale messages and doesn't attempt to reindex messages which are already indexed.
2021-10-09	view: save memory by dropping smsg->{from_name} on use
	We'll also save a few LoC when generating it. $smsg objects can linger a while when rendering large threads, so saving a few bytes here can add up to several hundred KB saved. I noticed this while chasing the ref cycle leak in commit b28e74c9dc0a (www: fix ref cycle from threading w/ extindex, 2021-10-03). While there's no longer a leak, releasing memory earlier can allow it to be reused sooner and reduce both memory traffic and memory pressure.
2021-10-09	http: avoid Perl target cache for psgi.input
	By using syswrite to populate env->{psgi.input}. The substr() call IO::Handle->write will trigger Perl's target/scratchpad and result in a permanent allocation. Since this is a cold path, that allocation is pointless, and syswrite() can already write a substring. Allowing Perl to cache a large allocation in a cold path only result in fragmentation and wasted RAM. write(2) on a regular file won't result in short writes unless the FS quotas or free space limits are hit, or the buffer is close to overflowing (e.g. the 0x7ffff000-byte Linux limit). Since our HTTP server will never buffer that much in RAM, there's no need to retry syswrite nor rely on the retrying implicit in IO::Handle->write and the "print" perlop.
2021-10-09	view: discard Eml->{bdy} when done using
	We can release the raw body buffer once we've obtained a copy of the decoded buffer. This reduces memory pressure ahead of some expensive diff processing.
2021-10-09	solver_git: shorten scalar lifetimes
	Some of these scalar buffers may be large patches, so try to keep them as short-lived as possible to reduce memory pressure.
2021-10-09	net_reader: hoist out _imap_fetch_bodies
	We'll be supporting pipelining in a future commit, since Tor is too slow and increasing batch size can use too much memory.
2021-10-08	git: fatalize async callback errors by default
	This should help us catch BUG: errors (and then some) in -extindex and other read-write code paths. Only read-only daemons should warn on async callback failures, since those aren't capable of causing data loss.
2021-10-08	git: async_abort includes --batch-check requests
	We need to abort both check-only and cat requests when aborting, since we'll be aborting more aggressively in in read-write paths.
2021-10-08	git: use async_wait_all everywhere
	Some code paths may use maximum size checks, so ensure any checks are waited on, too.
2021-10-08	overidx: each_by_mid: account for messages being deleted
	This may fix some extindex problems and should get rid of the "Can't bless non-reference value" errors.
2021-10-06	ds: tmpio: avoid Perl target cache
	The use of `substr' here an argument to `print' was causing Perl to internally cache its target buffer. Since `syswrite()' already offers a buffer offset arg and length limits, just use `syswrite' directly. We were using autoflush anyways, so the lack of buffering was of no concern performance-wise. The target buffer could get to roughly ~10MB under some loads, but it was usually a cold path and using memory which cannot be released nor reused in other places. note: IO::Handle::write uses `substr' internally, too; so nothing would be gained using IO::Handle:write.
2021-10-06	msg_iter: split_quotes adds trailing "\n"
	The regexp in split_quotes relies on the presence of a final "\n", so add it wherever we need to instead of making it the responsibility of every caller. This probably doesn't matter in practice since every email seems to have a "\n" as the final byte (due to the way SMTP works), but maybe there's some odd ones that'll get imported via lei.
2021-10-06	overidx: subject_path: allow non-ASCII char in subject matches
	This should bring us closer to the "Base subject" definition in IMAP ORDEREDSUBJECT (RFC 5256 2.1). Larger changes may cause some breakage (until --reindex). But for now, a reindex will prevents the non-ASCII subjects from being normalized to the same fuzzy "thread" in the thread view.