about summary refs log tree commit homepage
DateCommit message (Collapse)
2020-12-08searchidx: remove $oid parameter from most calls
Xapian docids have been tied to the over {num} column for nearly 3 years, now; and OIDs are no longer stored in Xapian document data. There's no need to increase code and IPC complexity by passing the OID around.
2020-12-08extsearchidx: remove needless SHA-1 check
There is no need to verify checksums of data already stored in git. Doing this ourselves also limits flexibility in moving to other hashes.
2020-12-08overidx: wrap eidx_key => ibx_id mapping
This makes things a little less noisy and will be called by ExtSearchIdx.
2020-12-08over: gracefully show invalid ibx_id
While "public-inbox-extindex --gc" invocations try to ensure proper ordering, it is still possible for users to change the `inboxes' tables via sqlite3(1) or similar means. So show a "missing://ibx_id=$ibx_id" placeholder to avoid undefined variable warnings. URLs such as "imaps://..." will eventually be supported as eidx_keys, so having a URL-like "missing://" as a placeholder probably makes sense.
2020-12-07overidx: {num} column is INTEGER PRIMARY KEY
INTEGER PRIMARY KEY can be an alias for ROWID in SQLite and is already unique, so there's no need for a separate UNIQUE(num) index. With a smallish ~3K, freshly indexed v2 inbox, this results in a ~40K space savings, reducing over.sqlite3 from 1.375M to 1.335M (post-VACUUM). This only affects newly-indexed inboxes; existing DBs will require manual intervention to take advantage of space savings. Link: https://www.sqlite.org/rowidtable.html
2020-12-05imap: support isearch and reduce Xapian queries
Since IMAP search (either with Isearch or traditional per-Inbox search) only returns UIDs, we can safely set the limit to the UID slice size(*). With isearch, we can also trust the Xapian result to fit any docid range we specify. Limiting Xapian results to 1000 was making ->ALL docid <=> per-Inbox UID impossible since results could overlap between ranges unpredictably. Finally, we can map the ->ALL docids into per-Inbox UIDs and show them to the client in the UID order of the Inbox, not the docid order of the ->ALL extindex. This also lets us get rid of the "uid:" query parser prefix and use the Xapian::Query API directly to reduce our search prefix footprint. For mbox.gz downloads in WWW, we'll also make a best effort to preserve the order from the Inbox, not the order of extindex; though it's possible large result sets can have non-overlapping windows. (*) by definition, UID slice size is a "safe" value which shouldn't OOM either the server or clients.
2020-12-05isearch: emulate per-inbox search with ->ALL
Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05inbox: simplify ->search and callers
Stop leaking WWW/PSGI-specific logic into classes like PublicInbox::Inbox, which is used universally. We'll also decouple $ibx->over from $ibx->search and just deal with duplicate the code inside ->over to reduce argument complexity in ->search. This is also a step in moving away from using {psgi.errors} to ease code sharing between IMAP, NNTP, and command-line interfaces. Perl's built-in `warn' and `local $SIG{__WARN__}' provides all the flexibility we need to control warning output and should be universally understood by Perl hackers who may be unfamiliar with PSGI.
2020-12-05extmsg: use ->ALL for "global" MID lookups
As with NewsWWW and NNTP, we can use ->ALL to completely avoid trying SQLite/Xapian lookups across hundreds/thousands of inboxes.
2020-12-05newswww: use ->ALL to avoid O(n) inbox scan
We can avoid doing a Message-ID lookup on every single inbox by using ->ALL to scan its over.sqlite3 DB. This mimics NNTP behavior and picks the first message indexed, though redirecting to /all/$MESSAGE_ID/ could be done. With the current lore.kernel.org set of inboxes (~140), this provides a 10-40% speedup depending on inbox ordering.
2020-12-05search: remove mdocid export
There's no need to export it, as shown by the change to SearchView. This should pave the way to making search more flexible and allow per-Inbox search to reuse ->ALL.
2020-12-05nntp: small speed up for multi-line responses
Using a non-zero-length separator for `join' requires extra work inside Perl. We can shove the cost of appending "\r\n" into the `map' loop, instead. This speeds up the `join' operation. The "deferred" log entry for a "LISTGROUP org.kernel.vger.linux-kernel" command (with nearly 3.8 million messages) goes from ~3.96s to 3.86s on my workstation.
2020-12-05nntp: xref_by_tc: simplify slightly
We can invalidate ibx->{newsgroup} at config load-time to avoid having to check ibx->{newsgroup} validity in To/Cc: matching. This saves us some hash lookups in all cases.
2020-12-05over: ensure old, merged {tid} is really gone
We must use the result of link_refs() since it can trigger merge_threads() and invalidate $old_tid. In case merge_threads() isn't triggered, link_refs() will return $old_tid anyways. When rethreading and allocating new {tid}, we also must update the row where the now-expired {tid} came from to ensure only the new {tid} is seen when reindexing subsequent messages in history. Otherwise, every subsequently reindexed+rethreaded message could end up getting a new {tid}. Reported-by: Kyle Meyer <kyle@kyleam.com> Link: https://public-inbox.org/meta/87360nlc44.fsf@kyleam.com/
2020-12-01nntp: make ->ALL Xref generation more fuzzy
For ->ALL users, this mitigates the regression introduced by commit 811b8d3cbaa790f59b7b107140b86248da16499b ("nntp: xref: use ->ALL extindex if available"), since it's common to cross post messages to some mailing lists with per-list trailers for unsubscribe information. We won't bother dealing with Bcc-ed messages since those are nearly all spam when it comes to public mailing lists. Fixes: 811b8d3cbaa790f5 ("nntp: xref: use ->ALL extindex if available") Link: https://public-inbox.org/meta/20201130194201.GA6687@dcvr/
2020-12-01nntpd: move {newsgroup} name check to config
With 50K newsgroups in the config file, this doesn't slow down `PublicInbox::Config->new->fill_all' any measurable amount on my busy old workstation. This should prevent invalid newsgroup names from getting into into extindex and catch user errors sooner, rather than later. v2: - delete {newsgroup} if invalid to avoid ->nntp_url link - simplify -imapd and explain remaining check
2020-11-30t/extsearch: test ->has_threadid
Since has_threadid predates the existence of ExtSearch, we can be certain the Xapian DB can collapse by threadid.
2020-11-30git: ensure subclassed ->fail gets called
Some of these changes may not be strictly necessary, but it makes code easier to maintain and change. Hackers using/modifying this code will no longer wonder if a particular callsite needs to care about subclasses or not.
2020-11-30git: set non-blocking flag in case of other bugs
This makes GitAsyncCat more resilient to bugs in Gcf2 or even git-cat-file itself. I noticed -imapd stuck on read(2) from the Gcf2 pipe, so there may be a bug somewhere in Gcf2 or PublicInbox::Git. This should make us more resilient to them and hopefully help us notice and fix them.
2020-11-29extindex: support `--gc' to remove dead inboxes
Inboxes may be removed or newsgroups renamed over time. Introduce a switch to do garbage collection and eliminate stale search and xref3 results based on inboxes which remain in the config file. This may also fixup stale results leftover from any bugs which may leave stale data around. This is also useful in case a clumsy BOFH (me :P) is swapping between several PI_CONFIGs and accidentally indexed a bunch of inboxes they didn't intend to.
2020-11-29v2writable: detect shard count for ExtSearchIdx properly
Otherwise, any explicitly set shard counts were ignored and we'd be counting CPUs every single time.
2020-11-29nntpd: remove redundant {groups} shortcut
It's not worth confusing hackers reading the source to have two ways to access the same (large) hash table. So just go through PublicInbox::Config objects for now since the extra hash lookup isn't going to be noticeable. I've also started favoring "for" instead of "foreach" since they're the equivalent perlop and less wear on my fingers + keyboard.
2020-11-29nntp: XPATH uses ->ALL extindex, too
Another 30-40% speedup when testing against a local lore.kernel.org mirror. In either case, we'll consistently sort the response for ease-of-testing and client-side cache-friendliness.
2020-11-29nntp: art_lookup: use mid_lookup and simplify
This lets us take advantage of mid_lookup speedup from the previous commit. While we're at it, start moving towards using `$ibx' as the abbreviation for PublicInbox::Inbox objects even in the NNTP code, since they've been shared with the WWW code for several years, now.
2020-11-29nntp: speed up mid_lookup() using ->ALL extindex
We can reuse "xref3" information in extindex to quickly match messages matching a given Message-ID across hundreds or thousands of newsgroups with a few SQL statements. "XHDR Xref $MESSAGE_ID" is around 40% faster, on top of previous speedups.
2020-11-29nntp: NEWGROUPS uses long_response
We can amortize the cost of NEWGROUPS time filtering using the long_response API. This lets us handle hundreds/thousands of inboxes without monopolizing the event loop for this command. Further speedup is possible using MiscSearch, but that requires not-yet-done indexing changes to MiscIdx.
2020-11-29extindex: fix delete (`d') handling
We need to completely remove a message from over.sqlite3 and Xapian when no references remain, otherwise users will still see the removed messages in NNTP overviews and WWW search results/summaries. References to messages are now solely handled by the `xref3' table of over.sqlite3. We can also trust `xref3' when deciding whether to remove only the "O$eidx_key" and "G$lid" terms from a document in Xapian or to remove the entire Xapian document.
2020-11-28searchidxshard: chomp $eidx_key from pipe
We were accidentally adding "\n" to terms (which Xapian happily accepts), causing incompatibilities when enabling parallel sharding in some invocations of -extindex but not others. This is an extindex incompatibility and starting a new extindex will be required to take advantage of in-development features, so it's not urgent to start another one, either. (other incompatible things may happen before a 1.7 release)
2020-11-28*index: more consistent graceful shutdown checks
v1 and v2 inbox indexing now supports graceful shutdown checks just like ExtSearchIdx. Additionally, we'll consistently perform quit checks at the top of loops for consistency. Interaction with the --xapian-only and --sequential-shard options are a bit lacking, and will warn the user to use "--reindex --xapian-only" to fix.
2020-11-28nntp: xref: use ->ALL extindex if available
Getting Xref for cross-posted messages is an O(n) operation where `n' is the number of newsgroups on the server. This works acceptably when there are dozens of groups, but would be unnacceptable when there's tens of thousands of newsgroups. With ~140 newsgroups, a lore.kernel.org mirror already handles "XHDR Xref $MESSAGE_ID" requests around 30% faster after creating the xref3.idx_nntp index. The SQL additions to ExtSearch.pm may be a bit strange and seem more appropriate for Over.pm; however it currently makes sense to me since those bits of over.sqlite3 access are exclusive to ExtSearch and can't be used by traditional v1/v2 inboxes...
2020-11-28nntp: xref: simplify sub signature
We'll be using the `xref3' table in extindex to speed up xref(), and that'll require comparisons against $smsg->{blob}. So pass the entire $smsg through.
2020-11-28nntp: some minor golfing
Reduce screen real estate usage to reduce human attention span requirements.
2020-11-28t/extsearch: show a more realistic case
Different messages to different public Inboxes are likely to have different List-IDs, so show that we can deduplicate based on content (but per-mailing-list trailers need to go through a PublicInbox::Filter::* or be disabled by mailing list admins).
2020-11-28nntp: move LIST iterators to long_response
Iterating through many newsgroups can hog the event loop if many random seeks are required. Avoid monopolizing the event loop in that case by using the long_response API. For now, we can still rely on grep() since it seems to work reasonably well with 50K test newsgroup names.
2020-11-28nntp: LIST ACTIVE.TIMES use angle brackets around address
This matches the example shown in RFC 3977, section 7.6.1.3
2020-11-28miscsearch: implement ->newsgroup_matches
This may be used to speed up newsgroup searches down-the-line, but the grep perlop isn't too shabby, at the moment.
2020-11-28nntp: NEWNEWS: speed up filtering
With 50K newsgroups, the filtering phase goes from ~2000 seconds to ~90 MILLISECONDS by relying on the grep perlop. This moves ->over checking out of the main dispatch and amortizes the cost via long_response. (Fairly scheduled) long_response time in newnews_i now takes ~360 seconds as opposed to ~30 seconds before this change, however; but the initial filtering speedup eliminating 2000s is more than worth it.
2020-11-28nntp: use grep operation for wildmat matching
Based on experiences with the IMAP server, this ought to be significantly faster (as to be demonstrated in the next commit).
2020-11-28mm: min/max: return 0 instead of undef
This simplifies callers and allows empty newsgroups to be represented (the WWW UI may be insufficient there, too).
2020-11-28nntpd: share {groups} hash with {-by_newsgroup} in Config
There's no need to duplicate a potentially large hash, but we can keep the inexpensive shortcut to it. We may eventually drop the {groups} shortcut if it's no longer useful.
2020-11-28nntp: use Inbox->uidvalidity instead of ->mm->created_at
This is memoized, and may allow us some future flexibility w.r.t PublicInbox::Inbox-like objects. While we're at it, use defined-or ("//") in case somebody really set a public-inbox creation time to the Unix epoch.
2020-11-24extsearchidx: deduplicate alternates based on st_dev + st_ino
This allows us to filter out duplicate alternates entries in case there's symlinks or bind mounts in play, as I (and perhaps some other users) tend to use symlinks and/or bind mounts heavily.
2020-11-24wwwattach: prevent deep-linking via Referer match
This prevents `<img src=' tags from being used to deep-link image attachments from HTML outside of the current host and reduces potential for abuse. Some browsers (e.g. Firefox) favor content detection and will display images irrespective of the Content-Type header being "application/octet-stream", and "Content-Disposition: attachment" doesn't stop them, either. Tested with dillo and Firefox. Reported-by: Leah Neukirchen <leah@vuxu.org>
2020-11-24gcf2: workaround libgit2 alternates bug for extindex
While libgit2 handles alternates with relative paths properly for v2 epochs; nesting them another layer with extindex uses the wrong relative path expansion (and is inconsistent with git(1) behavior). Fortunately, it's possible to work around this libgit2 bug entirely within Gcf2 and avoid further special cases throughout the rest of our code to support extindex. Link: https://bugs.debian.org/975607
2020-11-24*search: simplify retry_reopen users
Every callback uses `$self', and creating short-lived array references is not necessary when it's just as easy to copy the array in Perl (unlike C).
2020-11-24manifest: support faster generation via [extindex "all"]
For a mirror of lore.kernel.org with >140 inboxes, this speeds up manifest.js.gz generation from ~1s to 40ms on my HW. This is still unacceptable when dealing with thousands of inboxes, but gets us closer to where we need to be.
2020-11-24extsearchidx: do not short-circuit MiscIdx on no-op v2 prepare
This was intended to make development easier; but also allows us description, URL, and address changes to be picked up independently of message history.
2020-11-24miscidx: store absolute git_dir of each epoch in docdata
This will make it possible to map reference repos in case somebody uses the feature.
2020-11-24miscidx: cleanup git processes after manifest indexing
We shouldn't leave "cat-file --batch" processes around when we're done with an epoch or inbox, since there could be many thousands.
2020-11-24extsearch: fix remaining "eindex" references
We'll replace "$EINDEX" => "$EXTINDEX" in a user-visible line and also some hacker-only tests. "eindex" is no longer used because it rhymes with "reindex", so remove the last instance of it. Fixes: 6b0fed3b03263ba2 ("extsearch: rename -eindex to -extindex")