public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-11-29	nntpd: remove redundant {groups} shortcut
	It's not worth confusing hackers reading the source to have two ways to access the same (large) hash table. So just go through PublicInbox::Config objects for now since the extra hash lookup isn't going to be noticeable. I've also started favoring "for" instead of "foreach" since they're the equivalent perlop and less wear on my fingers + keyboard.
2020-11-29	nntp: XPATH uses ->ALL extindex, too
	Another 30-40% speedup when testing against a local lore.kernel.org mirror. In either case, we'll consistently sort the response for ease-of-testing and client-side cache-friendliness.
2020-11-29	nntp: art_lookup: use mid_lookup and simplify
	This lets us take advantage of mid_lookup speedup from the previous commit. While we're at it, start moving towards using `$ibx' as the abbreviation for PublicInbox::Inbox objects even in the NNTP code, since they've been shared with the WWW code for several years, now.
2020-11-29	nntp: speed up mid_lookup() using ->ALL extindex
	We can reuse "xref3" information in extindex to quickly match messages matching a given Message-ID across hundreds or thousands of newsgroups with a few SQL statements. "XHDR Xref $MESSAGE_ID" is around 40% faster, on top of previous speedups.
2020-11-29	nntp: NEWGROUPS uses long_response
	We can amortize the cost of NEWGROUPS time filtering using the long_response API. This lets us handle hundreds/thousands of inboxes without monopolizing the event loop for this command. Further speedup is possible using MiscSearch, but that requires not-yet-done indexing changes to MiscIdx.
2020-11-29	extindex: fix delete (`d') handling
	We need to completely remove a message from over.sqlite3 and Xapian when no references remain, otherwise users will still see the removed messages in NNTP overviews and WWW search results/summaries. References to messages are now solely handled by the `xref3' table of over.sqlite3. We can also trust `xref3' when deciding whether to remove only the "O$eidx_key" and "G$lid" terms from a document in Xapian or to remove the entire Xapian document.
2020-11-28	searchidxshard: chomp $eidx_key from pipe
	We were accidentally adding "\n" to terms (which Xapian happily accepts), causing incompatibilities when enabling parallel sharding in some invocations of -extindex but not others. This is an extindex incompatibility and starting a new extindex will be required to take advantage of in-development features, so it's not urgent to start another one, either. (other incompatible things may happen before a 1.7 release)
2020-11-28	*index: more consistent graceful shutdown checks
	v1 and v2 inbox indexing now supports graceful shutdown checks just like ExtSearchIdx. Additionally, we'll consistently perform quit checks at the top of loops for consistency. Interaction with the --xapian-only and --sequential-shard options are a bit lacking, and will warn the user to use "--reindex --xapian-only" to fix.
2020-11-28	nntp: xref: use ->ALL extindex if available
	Getting Xref for cross-posted messages is an O(n) operation where `n' is the number of newsgroups on the server. This works acceptably when there are dozens of groups, but would be unnacceptable when there's tens of thousands of newsgroups. With ~140 newsgroups, a lore.kernel.org mirror already handles "XHDR Xref $MESSAGE_ID" requests around 30% faster after creating the xref3.idx_nntp index. The SQL additions to ExtSearch.pm may be a bit strange and seem more appropriate for Over.pm; however it currently makes sense to me since those bits of over.sqlite3 access are exclusive to ExtSearch and can't be used by traditional v1/v2 inboxes...
2020-11-28	nntp: xref: simplify sub signature
	We'll be using the `xref3' table in extindex to speed up xref(), and that'll require comparisons against $smsg->{blob}. So pass the entire $smsg through.
2020-11-28	nntp: some minor golfing
	Reduce screen real estate usage to reduce human attention span requirements.
2020-11-28	t/extsearch: show a more realistic case
	Different messages to different public Inboxes are likely to have different List-IDs, so show that we can deduplicate based on content (but per-mailing-list trailers need to go through a PublicInbox::Filter::* or be disabled by mailing list admins).
2020-11-28	nntp: move LIST iterators to long_response
	Iterating through many newsgroups can hog the event loop if many random seeks are required. Avoid monopolizing the event loop in that case by using the long_response API. For now, we can still rely on grep() since it seems to work reasonably well with 50K test newsgroup names.
2020-11-28	nntp: LIST ACTIVE.TIMES use angle brackets around address
	This matches the example shown in RFC 3977, section 7.6.1.3
2020-11-28	miscsearch: implement ->newsgroup_matches
	This may be used to speed up newsgroup searches down-the-line, but the grep perlop isn't too shabby, at the moment.
2020-11-28	nntp: NEWNEWS: speed up filtering
	With 50K newsgroups, the filtering phase goes from ~2000 seconds to ~90 MILLISECONDS by relying on the grep perlop. This moves ->over checking out of the main dispatch and amortizes the cost via long_response. (Fairly scheduled) long_response time in newnews_i now takes ~360 seconds as opposed to ~30 seconds before this change, however; but the initial filtering speedup eliminating 2000s is more than worth it.
2020-11-28	nntp: use grep operation for wildmat matching
	Based on experiences with the IMAP server, this ought to be significantly faster (as to be demonstrated in the next commit).
2020-11-28	mm: min/max: return 0 instead of undef
	This simplifies callers and allows empty newsgroups to be represented (the WWW UI may be insufficient there, too).
2020-11-28	nntpd: share {groups} hash with {-by_newsgroup} in Config
	There's no need to duplicate a potentially large hash, but we can keep the inexpensive shortcut to it. We may eventually drop the {groups} shortcut if it's no longer useful.
2020-11-28	nntp: use Inbox->uidvalidity instead of ->mm->created_at
	This is memoized, and may allow us some future flexibility w.r.t PublicInbox::Inbox-like objects. While we're at it, use defined-or ("//") in case somebody really set a public-inbox creation time to the Unix epoch.
2020-11-24	extsearchidx: deduplicate alternates based on st_dev + st_ino
	This allows us to filter out duplicate alternates entries in case there's symlinks or bind mounts in play, as I (and perhaps some other users) tend to use symlinks and/or bind mounts heavily.
2020-11-24	wwwattach: prevent deep-linking via Referer match
	This prevents `<img src=' tags from being used to deep-link image attachments from HTML outside of the current host and reduces potential for abuse. Some browsers (e.g. Firefox) favor content detection and will display images irrespective of the Content-Type header being "application/octet-stream", and "Content-Disposition: attachment" doesn't stop them, either. Tested with dillo and Firefox. Reported-by: Leah Neukirchen <leah@vuxu.org>
2020-11-24	gcf2: workaround libgit2 alternates bug for extindex
	While libgit2 handles alternates with relative paths properly for v2 epochs; nesting them another layer with extindex uses the wrong relative path expansion (and is inconsistent with git(1) behavior). Fortunately, it's possible to work around this libgit2 bug entirely within Gcf2 and avoid further special cases throughout the rest of our code to support extindex. Link: https://bugs.debian.org/975607
2020-11-24	*search: simplify retry_reopen users
	Every callback uses `$self', and creating short-lived array references is not necessary when it's just as easy to copy the array in Perl (unlike C).
2020-11-24	manifest: support faster generation via [extindex "all"]
	For a mirror of lore.kernel.org with >140 inboxes, this speeds up manifest.js.gz generation from ~1s to 40ms on my HW. This is still unacceptable when dealing with thousands of inboxes, but gets us closer to where we need to be.
2020-11-24	extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare
	This was intended to make development easier; but also allows us description, URL, and address changes to be picked up independently of message history.
2020-11-24	miscidx: store absolute git_dir of each epoch in docdata
	This will make it possible to map reference repos in case somebody uses the feature.
2020-11-24	miscidx: cleanup git processes after manifest indexing
	We shouldn't leave "cat-file --batch" processes around when we're done with an epoch or inbox, since there could be many thousands.
2020-11-24	extsearch: fix remaining "eindex" references
	We'll replace "$EINDEX" => "$EXTINDEX" in a user-visible line and also some hacker-only tests. "eindex" is no longer used because it rhymes with "reindex", so remove the last instance of it. Fixes: 6b0fed3b03263ba2 ("extsearch: rename -eindex to -extindex")
2020-11-24	miscidx: put grokmirror manifest entries in Xapian docdata
	This should make it possible for us quickly generate manifest.js.gz files with less random I/O and process spawning in the WWW code.
2020-11-24	inbox: git_epoch: remove ->version check
	If $epoch is supplied to this method, there's already epochs and an extra method call for ->version is a pointless waste of CPU cycles.
2020-11-24	manifest: use ibx->git_epoch method for v2
	We can slightly reduce the amount of version-specific logic, here.
2020-11-24	git: add manifest_entry method
	We'll be using this for MiscIdx and pre-generating the necessary JSON for manifest.js.gz, so make it easier to share code for generating per-repo JSON entries for grokmirror.
2020-11-24	move JSON module portability into PublicInbox::Config
	We'll be using JSON in MiscIdx and MiscSearch, and PublicInbox::Config seems like an appropriate place to put it.
2020-11-24	miscsearch: a new Xapian sub-DB for extindex
	This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-19	extindex: remove skip-docdata option
	Since extindex is entirely new, it doesn't have backwards compatibility concerns and never stored docdata, anyways.
2020-11-17	v2writable: avoid initiating leftover unindex if interrupted
	We can also avoid a needless progress message on log2stack interruptions, too.
2020-11-15	searchidx: check for graceful shutdown in log2stack
	The initial "git log" invocation for a git epoch can be time consuming, so check for graceful shutdown at each line to ensure timely shutdowns and avoid SSD/HDD wear.
2020-11-15	t/eml.t: workaround newer Email::MIME* behavior
	Recent (2020) versions of Email::MIME (and/or dependencies) have different behavior than historical versions which seem to be less DWIM and perhaps technically more correct. We'll retain historical behavior for now, since it doesn't seem to cause real problems and DWIM-ness is often required to make sense of historical mail. Tested on a FreeBSD 11.4 VM with the following packages: p5-Email-MIME-1.949 p5-Email-MIME-ContentType-1.024_1 p5-Email-MIME-Encodings-1.315_2
2020-11-15	extindex: support graceful shutdown via QUIT/INT/TERM
	Just like the daemon processes, -extindex now supports graceful shutdown via the same signals. This lets users avoid having to repeat indexing messages when a power outage strikes during a long (multi-hour/day) indexing run. Per-inbox (v1/v2) -index graceful shutdowns are not supported, yet, but is planned for later.
2020-11-15	*index: discard sync->{todo} on iteration
	There's no need to continuously append to {todo} when indexing multiple inboxes. They're not redundantly indexed (because the IdxStack is discarded, making it a noop), but it's still a waste of memory keeping the $unit hashrefs around.
2020-11-15	*index: avoid per-epoch --batch-check processes
	Since all.git (v2) and ALL.git (extindex) encompass every single epoch or indexed inbox; and is_ancestor() only uses hexadecimal OIDs; there is no good reason to use $unit->{git} for an epoch-local $git->check. This prevents dozens/hundreds of --batch-check processes from being left running after indexing and can improve locality if size checks are being done (since that uses --batch-check, too). Theoretically several epochs may have conflicting OIDs, but we're screwed in those cases, anyways, so we might as well detect it earlier (though I'm not sure what the behavior would be :x).
2020-11-15	*index: checkpoints write last_commit metadata
	This will set us up for supporting graceful shutdown on -index without repeating any work.
2020-11-10	searchidx: fix fallback on unindex miss
	In case of other bugs or intentional corruption of over.sqlite3, we don't want to attempt dereferencing a non-ref scalar when calling ->mid_delete in the fallback code path. Noticed while chasing another bug in extindex development...
2020-11-08	extindex: fix --batch-size support
	Calling PublicInbox::Admin::index_prepare is required for --batch-size (k\|m\|g) modifiiers and indexBatchSize in the config file. Otherwise, the default 1m batch size stuck and led to unexpectedly bad performance on a machine which could index v2 inboxes faster with larger batch sizes.
2020-11-08	extindex: SIGUSR1 supports checkpoint
	Matching the behavior of git-fast-import(1), we'll allow a user to send SIGUSR1 to checkpoint over.sqlite3 and Xapian.
2020-11-08	v2writable: more accurate {current_info} warnings/progress
	With async git blob retrievals, the OID being enqueued and the OID being processed can be totally unrelated and misleading. We'll also prefix $INBOX_DIR for v2, and not just the epoch since we could be indexing multiple inboxes via both -index and -extindex.
2020-11-08	extsearch: canonicalize topdir
	This makes `ps' output look a bit nicer if there's trailing slashes involved from the command-line.
2020-11-08	extsearchidx: quiet warning for unindexed `d' messages
	"deleted" messages (via -learn <spam\|rm>) in the source inboxes are likely to already be unindexed, so avoid triggering needless warnings about the spam message being missing.
2020-11-08	v2writable: less expensive checkpoint for extindex
	Since extindex holds no locks on parallel inbox writers, we can simply use "barrier" IPC shard commands to checkpoint and avoid respawning shard or git processes.