about summary refs log tree commit homepage
path: root/t
DateCommit message (Collapse)
2020-12-17extindex: support --rethread and content bifurcation
--rethread is useful for dealing with bugs and behaves just like it does with current inboxes. This is in case our content deduplication logic changes for whatever reason and causes previously merged messages to be considered "different". As with v2, this won't allow us to merge messages in a way that allows deduplicating messages which were previously considered different, but v2 inboxes do not allow that, either. In other words, this makes the --reindex and --rethread switches of -extindex match the behavior of v2 -index.
2020-12-17extindex: delete stale messages from over.sqlite3
In addition to removing stale messages from Xapian, we must also remove them from over.sqlite3.
2020-12-17extindex: preliminary --reindex support
--reindex allows us to catch missed and stale messages due to -extindex vs -index races prior to commit 02b2fcc46f364b51 ("extsearchidx: enforce -index before -extindex"). We'll also rely on reindex to internally deal with v1/v2 inbox removals and partial-unindexing of messages which are only removed from one inbox out of many. This reindex design is completely different than how normal v1/v2 inbox reindex operates due to extindex having multiple histories to work with. Instead of scanning git history, this relies exclusively on comparing over.sqlite3 contents between the v1/v2 inboxes and the extindex. Changes to Xapian behavior also get picked up, now. Xapian indexing is handled by workers with minimal IPC to the parent process. This results in more read I/O but fewer writes when dealing with cross-posted messages. Changes to $smsg->populate and --rethread still need further work.
2020-12-17t/psgi_v2: ignore warnings on missing P::M::ReverseProxy
Plack::Test::ExternalServer doesn't depend on Plack::Middleware::ReverseProxy, so we need to account for some warnings in stderr if P::M::RP is missing.
2020-12-14PublicInbox::Feed owns `feedmax' default value
There's no need to have extra code in the Inbox package for this or to waste dozens of bytes for every Inbox object which uses the default value. This makes our code more flexible w.r.t Inbox-like ExtSearch objects and fixes uninitialized value warnings with ->ALL.
2020-12-11nntp+www: drop List-* and Archived-At headers
These headers can conflict with headers in the DKIM signature; and parsing the DKIM-Signature header to determine whether or not we can safely add a header would be more code and CPU cycles. Since IMAP seems fine without these headers (and JMAP will likely be, too), there's likely no need to continue appending these to every message. Nowadays, developers seem sufficiently trained to use URLs with Message-IDs in them. So drop the headers and save some cycles and bandwidth all around.
2020-12-10extsearchidx: enforce -index before -extindex
We cannot set xref3 data without the `xnum' column to tie it to the per-inbox over.sqlite3 DB. So ensure we don't read brand-new history that only exists in git, but instead rely on last_commit and last_xap15-$EPOCH metadata in msgmap to decide how far we can index. Before this change, it was possible to miss messages in the extindex if -index did not run (which will be fixable by upcoming --reindex support in -extindex).
2020-12-10t/extsearch: use indexlevel=basic in inboxes
There's no need for per-inbox Xapian DBs when using extindex, so reduce wear on the poor systems this test runs on.
2020-12-09admin: resolve_repo_dir => resolve_inboxdir
We've stopped referring to inboxdirs as "repos" a while ago since v2 inboxes have multiple git repos associated with them. So update the name to reflect that and avoid an unnecessary export that's only used by a test case.
2020-12-09rename {pi_config} fields to {pi_cfg}
{pi_config} may be confused with the documented `PI_CONFIG' environment variable, and we'll favor vowel-removal to be consistent with our usage of object references. The `pi_' prefix may stay in some places, for now; since a separate namespace may come into this codebase for local/private client-tooling. For InboxIdle, we'll also remove an invalid comment about holding a reference to the PublicInbox::Config object, too.
2020-12-09nntp: replace {ng} with {ibx} for consistency
They're PublicInbox::Inbox objects just like the rest of the non-NNTP code. So rename the NNTP code for consistency with the rest of the codebase. Furthermore, {ng} and $ng may be confused with the `--ng' switch for -init, and that's a non-ref scalar string.
2020-12-09treewide: replace {-inbox} with {ibx} for consistency
{ibx} is shorter and is the most prevalent abbreviation in indexing and IMAP code, and the `$ibx' local variable is already prevalent throughout. In general, the codebase favors removal of vowels in variable and field names to denote non-references (because references are "lighter" than non-references). So update WWW and Filter users to use the same code since it reduces confusion and may allow easier code sharing.
2020-12-05isearch: emulate per-inbox search with ->ALL
Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05over: ensure old, merged {tid} is really gone
We must use the result of link_refs() since it can trigger merge_threads() and invalidate $old_tid. In case merge_threads() isn't triggered, link_refs() will return $old_tid anyways. When rethreading and allocating new {tid}, we also must update the row where the now-expired {tid} came from to ensure only the new {tid} is seen when reindexing subsequent messages in history. Otherwise, every subsequently reindexed+rethreaded message could end up getting a new {tid}. Reported-by: Kyle Meyer <kyle@kyleam.com> Link: https://public-inbox.org/meta/87360nlc44.fsf@kyleam.com/
2020-12-01nntp: make ->ALL Xref generation more fuzzy
For ->ALL users, this mitigates the regression introduced by commit 811b8d3cbaa790f59b7b107140b86248da16499b ("nntp: xref: use ->ALL extindex if available"), since it's common to cross post messages to some mailing lists with per-list trailers for unsubscribe information. We won't bother dealing with Bcc-ed messages since those are nearly all spam when it comes to public mailing lists. Fixes: 811b8d3cbaa790f5 ("nntp: xref: use ->ALL extindex if available") Link: https://public-inbox.org/meta/20201130194201.GA6687@dcvr/
2020-11-30t/extsearch: test ->has_threadid
Since has_threadid predates the existence of ExtSearch, we can be certain the Xapian DB can collapse by threadid.
2020-11-29extindex: support `--gc' to remove dead inboxes
Inboxes may be removed or newsgroups renamed over time. Introduce a switch to do garbage collection and eliminate stale search and xref3 results based on inboxes which remain in the config file. This may also fixup stale results leftover from any bugs which may leave stale data around. This is also useful in case a clumsy BOFH (me :P) is swapping between several PI_CONFIGs and accidentally indexed a bunch of inboxes they didn't intend to.
2020-11-29nntp: XPATH uses ->ALL extindex, too
Another 30-40% speedup when testing against a local lore.kernel.org mirror. In either case, we'll consistently sort the response for ease-of-testing and client-side cache-friendliness.
2020-11-29nntp: NEWGROUPS uses long_response
We can amortize the cost of NEWGROUPS time filtering using the long_response API. This lets us handle hundreds/thousands of inboxes without monopolizing the event loop for this command. Further speedup is possible using MiscSearch, but that requires not-yet-done indexing changes to MiscIdx.
2020-11-29extindex: fix delete (`d') handling
We need to completely remove a message from over.sqlite3 and Xapian when no references remain, otherwise users will still see the removed messages in NNTP overviews and WWW search results/summaries. References to messages are now solely handled by the `xref3' table of over.sqlite3. We can also trust `xref3' when deciding whether to remove only the "O$eidx_key" and "G$lid" terms from a document in Xapian or to remove the entire Xapian document.
2020-11-28nntp: xref: use ->ALL extindex if available
Getting Xref for cross-posted messages is an O(n) operation where `n' is the number of newsgroups on the server. This works acceptably when there are dozens of groups, but would be unnacceptable when there's tens of thousands of newsgroups. With ~140 newsgroups, a lore.kernel.org mirror already handles "XHDR Xref $MESSAGE_ID" requests around 30% faster after creating the xref3.idx_nntp index. The SQL additions to ExtSearch.pm may be a bit strange and seem more appropriate for Over.pm; however it currently makes sense to me since those bits of over.sqlite3 access are exclusive to ExtSearch and can't be used by traditional v1/v2 inboxes...
2020-11-28t/extsearch: show a more realistic case
Different messages to different public Inboxes are likely to have different List-IDs, so show that we can deduplicate based on content (but per-mailing-list trailers need to go through a PublicInbox::Filter::* or be disabled by mailing list admins).
2020-11-28miscsearch: implement ->newsgroup_matches
This may be used to speed up newsgroup searches down-the-line, but the grep perlop isn't too shabby, at the moment.
2020-11-28mm: min/max: return 0 instead of undef
This simplifies callers and allows empty newsgroups to be represented (the WWW UI may be insufficient there, too).
2020-11-24extsearch: fix remaining "eindex" references
We'll replace "$EINDEX" => "$EXTINDEX" in a user-visible line and also some hacker-only tests. "eindex" is no longer used because it rhymes with "reindex", so remove the last instance of it. Fixes: 6b0fed3b03263ba2 ("extsearch: rename -eindex to -extindex")
2020-11-24miscidx: put grokmirror manifest entries in Xapian docdata
This should make it possible for us quickly generate manifest.js.gz files with less random I/O and process spawning in the WWW code.
2020-11-24git: add manifest_entry method
We'll be using this for MiscIdx and pre-generating the necessary JSON for manifest.js.gz, so make it easier to share code for generating per-repo JSON entries for grokmirror.
2020-11-24miscsearch: a new Xapian sub-DB for extindex
This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-15t/eml.t: workaround newer Email::MIME* behavior
Recent (2020) versions of Email::MIME (and/or dependencies) have different behavior than historical versions which seem to be less DWIM and perhaps technically more correct. We'll retain historical behavior for now, since it doesn't seem to cause real problems and DWIM-ness is often required to make sense of historical mail. Tested on a FreeBSD 11.4 VM with the following packages: p5-Email-MIME-1.949 p5-Email-MIME-ContentType-1.024_1 p5-Email-MIME-Encodings-1.315_2
2020-11-15*index: checkpoints write last_commit metadata
This will set us up for supporting graceful shutdown on -index without repeating any work.
2020-11-08extsearch: rename -eindex to -extindex
Upon "eindex" rhymes with "reindex", which could be confusing; so name the command and config prefix to use "extindex" which is hopefully less confusing.
2020-11-07extsearchidx: handle edits
We can now handle cases where messages are edited in one inbox but not another, bifurcating the message. V2Writable::log_range handles some edge-cases which could happen in v2-only code paths, as well, but weren't usually triggered due to default git-gc knobs not pruning immediately
2020-11-07t/v2writable: remove pointless ->barrier call
We don't actually use it anywhere, and may not need it in the future.
2020-11-07t/extsearch.t: verify results and xref3 ordering
We want NNTP clients to see consistent Xref: headers to ensure client-side caches don't get confused.
2020-11-07searchidx: remove xref3 support for Xapian
It doesn't seem worth storing xref3 data in Xapian now that the same info is in over.sqlite3.
2020-11-07over: store xref3 data in over.sqlite3
We may not end up storing xref3 data in Xapian, actually. This will make indexlevel=basic possible, and along with --sequential-shard indexing support for slow storage. Making oidmap a separate table seems unnecessary, too, so fold it into the xref3 table since it's unlikely a git blob will be responsible for multiple xref3 rows.
2020-11-07script: add preliminary eindex implementation
Not documented, yet, but it runs...
2020-11-07extsearchidx: initial implementation
It compiles...
2020-11-07overidx: introduce changes for external index
Since external indices won't have msgmap.sqlite3, we'll need to store last_commit-* metadata in over.sqlite3 instead. This has a longer limits to account for path names or newsgroup names stored in keys. We'll also rely on built-in counters for Xapian document IDs, since msgmap.sqlite3 no longer provides an AUTOINCREMENT column.
2020-11-07searchidx: introduce "xref3" concept
This will be used to track cross-posted messages in the external/detached index.
2020-11-07extsearch: start mocking out
This will provide a similar API to PublicInbox::Inbox for read-only WWW, -imapd, and -nntpd interfaces.
2020-09-24searchidx: fix (undocumented) --skip-docdata handling
This switch is still undocumented, but we can reduce the scope of our Xapian docdata dependency by moving its only caller to SearchIdx. This reduces the amount of code loaded by read-only code paths.
2020-09-22mda: match List-Id insensitively
This follows -watch commit b70473ab8296d31ebb600adb4fa8fe0ac5935ca8 to match List-Id headers case-insensitively. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/20200921180152.uyqluod7qxbwqubo@chatter.i7.local/
2020-09-19gcf2: wire up read-only daemons and rm -gcf2 script
It seems easiest to have a singleton Gcf2Client client object per daemon worker for all inboxes to use. This reduces overall FD usage from pipes. The `public-inbox-gcf2' command + manpage are gone and a `$^X' one-liner is used, instead. This saves inodes for internal commands and hopefully makes it easier to avoid mismatched PERL5LIB include paths (as noticed during development :x). We'll also make the existing cat-file process management infrastructure more resilient to BOFHs on process killing sprees (or in case our libgit2-based code fails on us). (Rare) PublicInbox::WWW PSGI users NOT using public-inbox-httpd won't automatically benefit from this change, and extra configuration will be required (to be documented later).
2020-09-19gcf2: require git dir with OID
This amortizes the cost of recreating PublicInbox::Gcf2 objects when alternates change in v2 all.git.
2020-09-19gcf2: transparently retry on missing OID
Since we only get OIDs from trusted local data sources (over.sqlite3), we can safely retry within the -gcf2 process without worry about clients spamming us with requests for invalid OIDs and triggering reopens.
2020-09-19add gcf2 client and executable script
This should be able to replace multiple `git cat-file' for blob retrieval, but adjustments may be needed.
2020-09-19t/gcf2: test changes to alternates
Calling ->add_alternate won't pick up new additions to $OBJDIR/info/alternates, unfornately. Thus v2 inboxes will need to do something to invalidate Gcf2 objects.
2020-09-19gcf2: libgit2-based git cat-file alternative
Having tens of thousands of inboxes and associated git processes won't work well, so we'll use libgit2 to access the object DB directly. We only care about OID lookups and won't need to rely on per-repo revision names or paths. The Git::Raw XS package won't be used since its manpages don't promise a stable API. Since we already use Inline::C and have experience with I::C when it comes to compatibility, this only introduces libgit2 itself as a source of new incompatibilities. This also provides an excuse for me to writev(2) to reduce syscalls, but liburing is on the horizon for next year.
2020-09-16t/indexlevels-mirror: fix improperly skipped test
Oops :x