about summary refs log tree commit homepage
path: root/t
DateCommit message (Collapse)
2020-11-24miscsearch: a new Xapian sub-DB for extindex
This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-15t/eml.t: workaround newer Email::MIME* behavior
Recent (2020) versions of Email::MIME (and/or dependencies) have different behavior than historical versions which seem to be less DWIM and perhaps technically more correct. We'll retain historical behavior for now, since it doesn't seem to cause real problems and DWIM-ness is often required to make sense of historical mail. Tested on a FreeBSD 11.4 VM with the following packages: p5-Email-MIME-1.949 p5-Email-MIME-ContentType-1.024_1 p5-Email-MIME-Encodings-1.315_2
2020-11-15*index: checkpoints write last_commit metadata
This will set us up for supporting graceful shutdown on -index without repeating any work.
2020-11-08extsearch: rename -eindex to -extindex
Upon "eindex" rhymes with "reindex", which could be confusing; so name the command and config prefix to use "extindex" which is hopefully less confusing.
2020-11-07extsearchidx: handle edits
We can now handle cases where messages are edited in one inbox but not another, bifurcating the message. V2Writable::log_range handles some edge-cases which could happen in v2-only code paths, as well, but weren't usually triggered due to default git-gc knobs not pruning immediately
2020-11-07t/v2writable: remove pointless ->barrier call
We don't actually use it anywhere, and may not need it in the future.
2020-11-07t/extsearch.t: verify results and xref3 ordering
We want NNTP clients to see consistent Xref: headers to ensure client-side caches don't get confused.
2020-11-07searchidx: remove xref3 support for Xapian
It doesn't seem worth storing xref3 data in Xapian now that the same info is in over.sqlite3.
2020-11-07over: store xref3 data in over.sqlite3
We may not end up storing xref3 data in Xapian, actually. This will make indexlevel=basic possible, and along with --sequential-shard indexing support for slow storage. Making oidmap a separate table seems unnecessary, too, so fold it into the xref3 table since it's unlikely a git blob will be responsible for multiple xref3 rows.
2020-11-07script: add preliminary eindex implementation
Not documented, yet, but it runs...
2020-11-07extsearchidx: initial implementation
It compiles...
2020-11-07overidx: introduce changes for external index
Since external indices won't have msgmap.sqlite3, we'll need to store last_commit-* metadata in over.sqlite3 instead. This has a longer limits to account for path names or newsgroup names stored in keys. We'll also rely on built-in counters for Xapian document IDs, since msgmap.sqlite3 no longer provides an AUTOINCREMENT column.
2020-11-07searchidx: introduce "xref3" concept
This will be used to track cross-posted messages in the external/detached index.
2020-11-07extsearch: start mocking out
This will provide a similar API to PublicInbox::Inbox for read-only WWW, -imapd, and -nntpd interfaces.
2020-09-24searchidx: fix (undocumented) --skip-docdata handling
This switch is still undocumented, but we can reduce the scope of our Xapian docdata dependency by moving its only caller to SearchIdx. This reduces the amount of code loaded by read-only code paths.
2020-09-22mda: match List-Id insensitively
This follows -watch commit b70473ab8296d31ebb600adb4fa8fe0ac5935ca8 to match List-Id headers case-insensitively. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/20200921180152.uyqluod7qxbwqubo@chatter.i7.local/
2020-09-19gcf2: wire up read-only daemons and rm -gcf2 script
It seems easiest to have a singleton Gcf2Client client object per daemon worker for all inboxes to use. This reduces overall FD usage from pipes. The `public-inbox-gcf2' command + manpage are gone and a `$^X' one-liner is used, instead. This saves inodes for internal commands and hopefully makes it easier to avoid mismatched PERL5LIB include paths (as noticed during development :x). We'll also make the existing cat-file process management infrastructure more resilient to BOFHs on process killing sprees (or in case our libgit2-based code fails on us). (Rare) PublicInbox::WWW PSGI users NOT using public-inbox-httpd won't automatically benefit from this change, and extra configuration will be required (to be documented later).
2020-09-19gcf2: require git dir with OID
This amortizes the cost of recreating PublicInbox::Gcf2 objects when alternates change in v2 all.git.
2020-09-19gcf2: transparently retry on missing OID
Since we only get OIDs from trusted local data sources (over.sqlite3), we can safely retry within the -gcf2 process without worry about clients spamming us with requests for invalid OIDs and triggering reopens.
2020-09-19add gcf2 client and executable script
This should be able to replace multiple `git cat-file' for blob retrieval, but adjustments may be needed.
2020-09-19t/gcf2: test changes to alternates
Calling ->add_alternate won't pick up new additions to $OBJDIR/info/alternates, unfornately. Thus v2 inboxes will need to do something to invalidate Gcf2 objects.
2020-09-19gcf2: libgit2-based git cat-file alternative
Having tens of thousands of inboxes and associated git processes won't work well, so we'll use libgit2 to access the object DB directly. We only care about OID lookups and won't need to rely on per-repo revision names or paths. The Git::Raw XS package won't be used since its manpages don't promise a stable API. Since we already use Inline::C and have experience with I::C when it comes to compatibility, this only introduces libgit2 itself as a source of new incompatibilities. This also provides an excuse for me to writev(2) to reduce syscalls, but liburing is on the horizon for next year.
2020-09-16t/indexlevels-mirror: fix improperly skipped test
Oops :x
2020-09-16treewide: relax allow >=40 chars for git OID
This will help with eventual git SHA-256 transitions.
2020-09-15imap: quiet uninitialized variable warning on FETCH
This was triggered by blindly trying to FETCH an MSN (not "UID FETCH") on an empty dummy inbox. It's harmless, and probably triggered by a wayward client or misbehaving bot.
2020-09-15t/imapd.t: skip dependent test on failure
We don't want to cascade failures/warnings when something else breaks. There's likely more of these to be fixed as we encounter them.
2020-09-14tests: consistently check for xapian-compact
We may need to test against development versions of Xapian, which may rely on setting `XAPIAN_COMPACT=xapian-compact-1.5'. Ensure it's possible to do that. And add a missing check in t/xcpdb-reshard.t, too.
2020-09-12t/nntpd: add test for the XPATH command
It's only in RFC 2980 (not 977 or 3977), but Net::NNTP has supported it since 2001, at least. We'll be making changes to avoid pathological behavior, so test it, first.
2020-09-10solver: check one git coderepo and inbox at a time
With public-inbox-httpd, this mitigates the effect of slow git blob storage with multiple coderepos configured for an inbox. It's still synchronous for now (and may need to remain that way for ->last_check_err), but no longer monopolizes the event loop when checking multiple coderepos. We don't yet support multi-inbox scanning, yet; but this also prepares us for a future where we do. We'll also support >=40 char blob OIDs in preparation for future git SHA-256 support, too.
2020-09-10t/cgi.t: show stderr on failures
This helped me diagnose an error I would've introduced in the next commit.
2020-09-10www: manifest.js.gz generation no longer hogs event loop
It's still as slow as before with hundreds/thousands of inboxes, but at least it's fair. Future changes will allow it to be cached and memoized with persistent HTTP servers.
2020-09-10use "\&" where possible when referring to subroutines
"*foo" is ambiguous in that it may refer to a bareword file handle; so we'll use it where we can without triggering warnings. PublicInbox::TestCommon::run_script_exit required dropping the prototype, however. We'll also future-proof by dropping "use warnings" in Cgit.pm and use the less-ambiguous "//=" in Inbox.pm while we're in the area.
2020-09-10nntp: fix cross-newsgroup Message-ID lookups
We cannot blindly use the selected newsgroup for HEAD/ARTICLE/BODY requests using Message-ID, since those commands look across all newsgroups; not just the selected one (if any). So stuff a reference to the Inbox object into $smsg. We can reduce args passed into set_nntp_headers() and msg_hdr_write(), too. Fixes: 0e6ceff37fc38f28 ("nntp: support slow blob retrievals")
2020-09-03search: replace ->query with ->mset
Nearly all of the search uses in the production code rely on a Xapian mset iterator being returned (instead of an array of $smsg objects). So default to returning the mset and move the burden of smsg array conversion into the test cases.
2020-09-03tests: add "use strict" and declare v5.10.1 compatibility
strict.pm helped me find a typo in an upcoming recent change, so ensure we use it since it does more good than harm. We'll also take the opportunity here to declare v5.10.1 compatibility level to future-proof against Perl incompatibilities.
2020-09-03search: remove special case for blank query
The special case (if any) belongs at a higher-level, and this is another step towards removing {over_ro}-dependence in our Search object.
2020-09-03use more idiomatic internal API for ->over access
{over_ro} being a part of the Search object is a historical oddity which will go away, soon. Lets start removing its use in tests and rarely-used helper scripts.
2020-09-03disambiguate OverIdx and Over by field name
We'll use {oidx} as the common field name for the read-write OverIdx, here, to disambiguate it from the read-only {over} field. This hopefully makes it clearer which code paths are read-only and which are read-write.
2020-09-02t/run: Perl future proofing
Bareword file handles outside of STD(IN|OUT|ERR) seem to be on the chopping block for Perl 8. We'll also "use v5.10.1" to guard against future incompatibilities.
2020-09-02init+convert: create non-existing directory hierarchies
Following "git init" as an example, we'll create every parent path up to the one specified, instead of attempting to continue on when Cwd::abs_path returns `undef'.
2020-09-02t/v2dupindex: test indexing mirrors with duplicate messages
While it's not a known problem, our deduplicating logic may change in the future; or a BOFH could be manually injecting duplicate messages directly into the git epoch repositories. Ensure indexing in mirrors doesn't break when there's duplicates. This is in preparation for detached indices for multi-inbox search.
2020-09-01rename WatchMaildir => Watch
This is no longer limited to Maildirs now that IMAP and NNTP support exist; so give it a shorter name.
2020-08-30imapd: filter out unusable flags from search
Quiet down logs from -imapd when clients are blindly sending some unsupported flag conditions (e.g. "DRAFT", "DELETED") specified in RFC 3501.
2020-08-29tests: check-run: fixup un-squashed simplification
Link: https://public-inbox.org/meta/20200828221803.GA89978@dcvr/
2020-08-28tests: check-run: show skipped tests
We'll deduplicate redundant lines and show counts of skipped tests to ensure it's easy to notice if something is unexpectedly skipped.
2020-08-27overidx: inline create_ghost sub
There's no need for this to be a separate sub since there's only a single caller. This saves a few kilobytes at least in short-lived processes.
2020-08-27over: recent: remove expensive COUNT query
As noted in commit 87dca6d8d5988c5eb54019cca342450b0b7dd6b7 ("www: rework query responses to avoid COUNT in SQLite"), COUNT on many rows is expensive on big SQLite DBs. We've already stopped using that code path long ago in WWW while -imapd and -nntpd never used it. So we'll adjust our remaining test cases to not need it, either.
2020-08-27over: rename ->disconnect to ->dbh_close
Since we got rid of over->connect, `disconnect' no longer pairs with it. So name it after the `close(2)' syscall it ultimately issues.
2020-08-27over: rename ->connect method to ->dbh
`->connect' is confused with the perlfunc for the `connect(2)' syscall, and also `DBI->connect'. Since SQLite doesn't use sockets, the word "connect" needlessly confuses me. Give it a short name to match the field name we use for it, which also matches the variable name used by the DBI(3pm) and DBD::SQLite(3pm) manpages.
2020-08-26over+msgmap: respect WAL journal_mode if set
WAL actually seems to have ideal locking characteristics given concurrency problems I'm experiencing with --reindex running in parallel with expensive read-only SQLite queries: <https://public-inbox.org/meta/20200825001204.GA840@dcvr/> Unfortunately, we cannot blindly use WAL while preserving compatibility with existing setups nor our guarantees that read-only daemons are indeed "read-only". However, respect an user's the choice to set WAL on their own if they're comfortable with giving -nntpd/-httpd/-imapd processes write permission to the directory storing SQLite DBs.