public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2021-02-03	lei q: support --jobs [SEARCHERS],[WRITERS]
	This comma-delimited parameter allows controlling the number or lei_xsearch and lei2mail worker processes. With the change to make IPC wq_* work use the event loop, it's now safe to run fewer worker processes for searching with no risk of deadlocks. MAX_PER_HOST isn't configurable yet for remote hosts, and maybe it shouldn't be due to potential for abuse.
2021-01-29	v2writable: nproc: use sysconf() on Linux and FreeBSD
	No need to fork a process on platforms I use daily, at least.
2021-01-26	miscidx: switch to lazy transactions
	This fixes a sporadic failure on a 1/2 core VM where "git cat-file --batch" hasn't started up by the time $cleanup->() destroys the ALL.git directory in t/lei.t (but not t/lei-oneshot.t). This happens because dwaitpid() runs inside the event loop asynchronously and we were able to return to the client before the cat-file process could even start. I could not reproduce this failure on my usual 4-core workstation via "schedtool -a 0x1" to force the entire test to use a single core. Lazy transactions matches OverIdx and SearchIdx behavior, and I've verified this lets us avoid problems with old Xapian versions (on CentOS 7.x) which failed to set FD_CLOEXEC.
2021-01-18	extindex: fix w/ Xapian 1.2.21..1.2.24
	Xapian v1.2.21..v1.2.24 failed to set the close-on-exec flag on the flintlock FD, causing "git cat-file" processes to hold onto the lock and prevent subsequent Xapian::WritableDatabase from locking the DB. So cleanup git processes after committing the miscidx transaction.
2021-01-09	v2writable: exact discontiguous history handling
	We've always temporarily unindexeded messages before reindexing them again if there's discontiguous history. This change improves the mechanism we use to prevent NNTP and IMAP clients from seeing duplicate messages. Previously, we relied on mapping Message-IDs to NNTP article numbers to ensure clients would not see the same message twice. This worked for most messages, but not for for messages with reused or duplicate Message-IDs. Instead of relying on Message-IDs as a key, we now rely on the git blob object ID for exact content matching. This allows truly different messages to show up for NNTP\|IMAP clients, while still those clients from seeing the message again.
2021-01-03	searchidxshard: use add_xapian directly for v2
	We can more clearly distinguish between v1 and v2-only code paths this way, and may be able to save a few cycles this way.
2021-01-03	ipc: switch to one-way pipes
	This fixes a performance regression in multi-process v2 indexing due to the switch to PublicInbox::IPC. While Unix sockets are fewer FDs to manage, pipes allow unprivileged processes to use larger buffers (up to 1M) on out-of-the-box Linux instances. A larger buffer via F_SETPIPE_SZ afforded by pipes was proven valuable during v2 development in 2018 and continues to be valuable when we get significant amounts of one-way traffic from the producer parent to worker children. Compression may be an option for systems without F_SETPIPE_SZ; but it increases CPU usage with no memory bandwidth savings on hosts where larger buffers are available.
2021-01-03	use Eml (or MIME) objects for all indexing paths
	We don't need to be keeping the raw message around after it hits git. Shard work now relies on Storable (or Sereal) and all of the indexing code relies on the Email::MIME-like API of Eml to access interesting parts of the message. Similarly, smsg->{raw_bytes} is no longer carried around and we do the CRLF adjustment when setting smsg->{bytes}. There's also a small simplification to t/import.t while we're in the area to use xqx instead of spawn/popen_rd.
2021-01-03	searchidxshard: replace index_raw with index_eml
	Since Storable and Sereal are designed for lossless serialization, we'll just pass $eml objects to whatever process is running SearchIdx.
2021-01-03	searchidxshard: IPC conversion, part 2
	We can remove some now-pointless wrapper functions by using ->ipc_do in even more places.
2021-01-03	searchidxshard: use PublicInbox::IPC to kill lots of code
	It's nice to prove the new code works by swapping it into the current V2Writable / SearchIdxShard packages. This is only the first step for the core bits, and we'll be able to delete more code in a subsequent patch.
2021-01-01	update copyrights for 2021
	Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2021-01-01	spawn: move run_die here from PublicInbox::Import
	It seems like a more logical place for it, but we'll favor the newly-added xsys_e() in tests for BAIL_OUT use.
2020-12-31	Merge remote-tracking branch 'origin/master' into lorelei
	* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-27	extindex: --watch for inotify-based updates
	This reuses existing InboxIdle infrastructure to update external indices based on per-inbox updates. This is an alternative to auto-updating external indices via the -index command and also works with existing uses of -mda and public-inbox-watch. Using inotify (or EVFILT_VNODE) allows watching thousands of inboxes without having to scan every single one at every invocation. This is especially beneficial in cases where an external index is not writable to the users writing to per-inbox indices.
2020-12-26	v2writable: don't verify tip if reindexing
	We only rely on git-rev-parse to resolve symbolic names ("HEAD") to a SHA-* git commit ID. We'll assume any git commit IDs we get from SQLite DBs are valid and let "git-log" fail if it isn't.
2020-12-26	index: do not attach inbox to extindex unless updated
	We'll count the number of log changes (regardless of index or unindex) and only attach inboxes to ExtSearchIdx objects when they get new work. We'll also reduce lock bouncing and only update external indices after all per-inbox indexing is done. This also updates existing v2 indexing/unindexing callers to be more consistent and ensures unindex log entries update per-inbox last commit information.
2020-12-25	index: support --fast-noop / -F switch
	Note: I'm not sure if it's worth documenting and supporting this long-term. We can can avoid taking locks for invocations of "index --all" and rely on high-resolution ctime (struct timespec st_ctim) comparisons of msgmap.sqlite3 and the packed-refs + refs/heads directory of the newest epoch. This cuts public-inbox-index invocations with "--all --no-update-extindex -L basic" down from 0.92s to 0.31s. The change with "-L medium" or "-L full" and (default) non-zero jobs is even more drastic, reducing a 12-13s no-op invocation down to the same 0.31s
2020-12-25	inboxwritable: delay umask_prepare calls
	This simplifies all ->with_umask callers and opens the door for further optimizations to delay/elide process spawning.
2020-12-19	search: simplify initialization, add ->xdb_shards_flat
	This reduces differences between v1 and v2 code, and introduces ->xdb_shards_flat to provide read-only access to shards without using Xapian::MultiDatabase. This will allow us to combine shards of several inboxes AND extindexes for lei.
2020-12-19	lei_store: local storage for Local Email Interface
	Still unstable, this builds off the equally unstable extindex :P This will be used for caching/memoization of traditional mail stores (IMAP, Maildir, etc) while providing indexing via Xapian, along with compression, and checksumming from git. Most notably, this adds the ability to add/remove per-message keywords (draft, seen, flagged, answered) as described in the JMAP specification (RFC 8621 section 4.1.1). We'll use `.' (a single period) as an $eidx_key since it's an invalid {inboxdir} or {newsgroup} name.
2020-12-17	inboxwritable: drop git_dir_n sub
	There's only one caller, unlikely to be any more, and should be harmless to open code.
2020-12-17	inbox: simplify v2 epoch counting
	Perl readdir detects list context and can return an array suitable for the grep op. From there, we can rely on substr to remove the ".git" suffix and integerize the value to save a few bytes before letting List::Util::max return the value. This is how we detect Xapian shards nowadays, too, and we'll also use defined-or (//) to simplify the return value there. We'll also simplify InboxWritable->git_dir_latest, remove some callers, and consider removing it entirely.
2020-12-17	extsearchidx: reindex releases over.sqlite3 handles properly
	When checkpointing and yielding the lock to other processes, we need to ensure any open DB statement handles are closed, since they reference and prevent DB FDs from being closed and unlocked. And clean up some progress reporting while we're at it.
2020-12-17	extsearchidx: checkpoint releases locks
	--reindex can take many hours or days, ensure we release locks according to --batch-size so automated fetch+index jobs can write new data to indices while we update old data.
2020-12-17	extindex: preliminary --reindex support
	--reindex allows us to catch missed and stale messages due to -extindex vs -index races prior to commit 02b2fcc46f364b51 ("extsearchidx: enforce -index before -extindex"). We'll also rely on reindex to internally deal with v1/v2 inbox removals and partial-unindexing of messages which are only removed from one inbox out of many. This reindex design is completely different than how normal v1/v2 inbox reindex operates due to extindex having multiple histories to work with. Instead of scanning git history, this relies exclusively on comparing over.sqlite3 contents between the v1/v2 inboxes and the extindex. Changes to Xapian behavior also get picked up, now. Xapian indexing is handled by workers with minimal IPC to the parent process. This results in more read I/O but fewer writes when dealing with cross-posted messages. Changes to $smsg->populate and --rethread still need further work.
2020-12-10	extsearchidx: enforce -index before -extindex
	We cannot set xref3 data without the `xnum' column to tie it to the per-inbox over.sqlite3 DB. So ensure we don't read brand-new history that only exists in git, but instead rely on last_commit and last_xap15-$EPOCH metadata in msgmap to decide how far we can index. Before this change, it was possible to miss messages in the extindex if -index did not run (which will be fixable by upcoming --reindex support in -extindex).
2020-12-10	searchidx: all indexers check for bad blobs
	This should help us detect bugs in our code or storage synchronization problems more easily. This probably won't detect corrupted git storage, but can detect corrupted SQLite files. "Bad blobs, bad blobs, whatcha gonna do when they come for you?"
2020-12-08	searchidx: remove $oid parameter from most calls
	Xapian docids have been tied to the over {num} column for nearly 3 years, now; and OIDs are no longer stored in Xapian document data. There's no need to increase code and IPC complexity by passing the OID around.
2020-11-29	v2writable: detect shard count for ExtSearchIdx properly
	Otherwise, any explicitly set shard counts were ignored and we'd be counting CPUs every single time.
2020-11-28	*index: more consistent graceful shutdown checks
	v1 and v2 inbox indexing now supports graceful shutdown checks just like ExtSearchIdx. Additionally, we'll consistently perform quit checks at the top of loops for consistency. Interaction with the --xapian-only and --sequential-shard options are a bit lacking, and will warn the user to use "--reindex --xapian-only" to fix.
2020-11-28	mm: min/max: return 0 instead of undef
	This simplifies callers and allows empty newsgroups to be represented (the WWW UI may be insufficient there, too).
2020-11-24	miscsearch: a new Xapian sub-DB for extindex
	This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-17	v2writable: avoid initiating leftover unindex if interrupted
	We can also avoid a needless progress message on log2stack interruptions, too.
2020-11-15	extindex: support graceful shutdown via QUIT/INT/TERM
	Just like the daemon processes, -extindex now supports graceful shutdown via the same signals. This lets users avoid having to repeat indexing messages when a power outage strikes during a long (multi-hour/day) indexing run. Per-inbox (v1/v2) -index graceful shutdowns are not supported, yet, but is planned for later.
2020-11-15	*index: discard sync->{todo} on iteration
	There's no need to continuously append to {todo} when indexing multiple inboxes. They're not redundantly indexed (because the IdxStack is discarded, making it a noop), but it's still a waste of memory keeping the $unit hashrefs around.
2020-11-15	*index: avoid per-epoch --batch-check processes
	Since all.git (v2) and ALL.git (extindex) encompass every single epoch or indexed inbox; and is_ancestor() only uses hexadecimal OIDs; there is no good reason to use $unit->{git} for an epoch-local $git->check. This prevents dozens/hundreds of --batch-check processes from being left running after indexing and can improve locality if size checks are being done (since that uses --batch-check, too). Theoretically several epochs may have conflicting OIDs, but we're screwed in those cases, anyways, so we might as well detect it earlier (though I'm not sure what the behavior would be :x).
2020-11-15	*index: checkpoints write last_commit metadata
	This will set us up for supporting graceful shutdown on -index without repeating any work.
2020-11-08	v2writable: more accurate {current_info} warnings/progress
	With async git blob retrievals, the OID being enqueued and the OID being processed can be totally unrelated and misleading. We'll also prefix $INBOX_DIR for v2, and not just the epoch since we could be indexing multiple inboxes via both -index and -extindex.
2020-11-08	v2writable: less expensive checkpoint for extindex
	Since extindex holds no locks on parallel inbox writers, we can simply use "barrier" IPC shard commands to checkpoint and avoid respawning shard or git processes.
2020-11-07	extsearchidx: handle edits
	We can now handle cases where messages are edited in one inbox but not another, bifurcating the message. V2Writable::log_range handles some edge-cases which could happen in v2-only code paths, as well, but weren't usually triggered due to default git-gc knobs not pruning immediately
2020-11-07	v2writable: pass oid to uindex_oid
	We'll be validating against this in the future to stop bugs from creeping in.
2020-11-07	v2writable: reduce scope of epoch-aware code
	And clearly label it. We may try to reuse some of this for v1 indexing code paths.
2020-11-07	v2writable: move size check init to sync_prepare
	This will let us use it from ExtSearchIdx.
2020-11-07	v2writable: make *last_commits and sync_prepare OO methods
	This will allow ExtSearchIdx to override or reuse them more easily. Unfortunately we lose prototype validation, but that seems to be discouraged anyways given the 'signatures' feature in Perl 5.20+.
2020-11-07	v2writable: rename {v2w} field to {self}
	This will make it easier to reuse some indexing code for ExtSearchIdx.
2020-11-07	v2writable: allow OO method references
	Using `->can(method)' allows subclasses to override `index_oid' and `unindex_oid' methods.
2020-11-07	v2writable: more generic sync setup code
	We want to reuse this code for ExtSearchIdx, eventually.
2020-11-07	searchidx: log2stack: simplify callers
	Since we store {ibx} in $sync state, we no longer have to pass it as an argument to log2stack.
2020-11-07	v2writable: checkpoint: account for lack of {mm}
	ExtSearchIdx will not have Msgmap, since it may index non email blobs in the future (it'll still be usable with IMAP, but not NNTP).