about summary refs log tree commit homepage
path: root/lib/PublicInbox/Search.pm
DateCommit message (Collapse)
2021-01-02search: do not use $QP_FLAGS until Xapian is loaded
The default $QP_FLAGS won't be set until after Xapian is loaded, duh... This fixes t/imapd.t with TEST_RUN_MODE=0
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-31Merge remote-tracking branch 'origin/master' into lorelei
* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-31lei_xsearch: cross-(inbox|extindex) search
While a single extindex combines multiple inboxes into a single search index, extindex still requires up-front indexing on items which can be searched. XSearch has no on-disk footprint itself and uses Xapian DBs of existing publicinbox and extindex ("extinbox") exclusively. XSearch still suffers from the multi-shard Xapian scalability problems which led to the creation of extindex, but I expect the number of shards to remain relatively low. I envision users hosting public-inbox instances on their workstations will only have two extindex combined by this, one read-only extindex for serving public archives, and one read-write extindex managed by LeiStore for private mail.
2020-12-28search: remove {mset} option for ->mset method
The ->mset method always returns a Xapian mset nowadays, so naming a parameter {mset} is too confusing. As it does with MiscSearch, setting the {relevance} parameter to -1 now sorts by ascending docid order. -2 is now supported for descending docid order, too, since it may be useful for lei users.
2020-12-28search: remove pointless {relevance} setting
SearchView will set it to `undef', others will set the 'mset' option (for the ->mset method :P) to 2 which causes {relevance} to be ignored. And the 'mset' option is poorly named now that the message is named ->mset...
2020-12-23miscsearch: index UIDVALIDITY, use as startup cache
This brings -nntpd startup time down from ~35s to ~5s with 50K inboxes. Further improvements ought to be possible with deeper changes to MiscIdx, since -mda having to load every inbox seems unreasonable; but this general change is fairly unintrusive.
2020-12-19search: simplify initialization, add ->xdb_shards_flat
This reduces differences between v1 and v2 code, and introduces ->xdb_shards_flat to provide read-only access to shards without using Xapian::MultiDatabase. This will allow us to combine shards of several inboxes AND extindexes for lei.
2020-12-17inbox: simplify v2 epoch counting
Perl readdir detects list context and can return an array suitable for the grep op. From there, we can rely on substr to remove the ".git" suffix and integerize the value to save a few bytes before letting List::Util::max return the value. This is how we detect Xapian shards nowadays, too, and we'll also use defined-or (//) to simplify the return value there. We'll also simplify InboxWritable->git_dir_latest, remove some callers, and consider removing it entirely.
2020-12-09search: reinstate "uid:" internal search prefix
User-supplied queries (via PublicInbox::IMAPsearchqp) may restrict messages to certain UID ranges in addition to the limits we impose ourselves for mailbox slices. So we'll continue to ask Xapian::QueryParser to "uid:" numeric ranges. Fixes: 4b551c884a648b45 ("imap: support isearch and reduce Xapian queries")
2020-12-05imap: support isearch and reduce Xapian queries
Since IMAP search (either with Isearch or traditional per-Inbox search) only returns UIDs, we can safely set the limit to the UID slice size(*). With isearch, we can also trust the Xapian result to fit any docid range we specify. Limiting Xapian results to 1000 was making ->ALL docid <=> per-Inbox UID impossible since results could overlap between ranges unpredictably. Finally, we can map the ->ALL docids into per-Inbox UIDs and show them to the client in the UID order of the Inbox, not the docid order of the ->ALL extindex. This also lets us get rid of the "uid:" query parser prefix and use the Xapian::Query API directly to reduce our search prefix footprint. For mbox.gz downloads in WWW, we'll also make a best effort to preserve the order from the Inbox, not the order of extindex; though it's possible large result sets can have non-overlapping windows. (*) by definition, UID slice size is a "safe" value which shouldn't OOM either the server or clients.
2020-12-05isearch: emulate per-inbox search with ->ALL
Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05search: remove mdocid export
There's no need to export it, as shown by the change to SearchView. This should pave the way to making search more flexible and allow per-Inbox search to reuse ->ALL.
2020-11-24*search: simplify retry_reopen users
Every callback uses `$self', and creating short-lived array references is not necessary when it's just as easy to copy the array in Perl (unlike C).
2020-11-24miscsearch: a new Xapian sub-DB for extindex
This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-07search: xdb_sharded: make this a public method for ExtSearch
We can simplify callers by using $self->{xpfx} instead of passing another arg on the stack.
2020-11-07extsearch: start mocking out
This will provide a similar API to PublicInbox::Inbox for read-only WWW, -imapd, and -nntpd interfaces.
2020-11-07search: hoist out _xdb_sharded for v2 inboxes
We'll be using this in detached (ext) Xapian indexes in cross inbox search.
2020-09-24searchidx: fix (undocumented) --skip-docdata handling
This switch is still undocumented, but we can reduce the scope of our Xapian docdata dependency by moving its only caller to SearchIdx. This reduces the amount of code loaded by read-only code paths.
2020-09-03v2writable: reuse read-only shard counting code
We'll also fix the read-only code to ensure we notice missing Xapian shards, since gaps would throw off our expectation that Xapian document IDs and NNTP article numbers are interchangeable.
2020-09-03search: remove {over_ro} field
Only inbox accesses the read-only {over}, now, instead of going through ->search. This simplifies our object graph and avoids potentially redundant FDs and DB handles pointing to the same over.sqlite3 file.
2020-09-03search: replace ->query with ->mset
Nearly all of the search uses in the production code rely on a Xapian mset iterator being returned (instead of an array of $smsg objects). So default to returning the mset and move the burden of smsg array conversion into the test cases.
2020-09-03search: remove special case for blank query
The special case (if any) belongs at a higher-level, and this is another step towards removing {over_ro}-dependence in our Search object.
2020-08-27search: allow testing with current xapian.git and 1.5.x
A `PI_XAPIAN' environment variable is now exposed for testing purposes. We'll also deal with the removal of `NumberValueRangeProcessor' and use `NumberRangeProcessor' in its place, but continue favoring the old Search::Xapian since that's all that's packaged for Debian 10.x stable.
2020-08-23mbox: disable "&t" on existing Xapian until full reindex
Expanding threads via over.sqlite3 for mbox.gz downloads without Xapian effectively collapsing on the THREADID column leads to repeated messages getting downloaded. To avoid that situation, use a "has_threadid" Xapian metadata flag that's only set on --reindex (and brand new Xapian DBs). This allows admins to upgrade WWW or do --reindex in any order; without worrying about users eating up bandwidth and CPU cycles.
2020-08-23search: support downloading mboxes results with full thread
Finally, the addition of THREADID for collapsing results in Xapian lets us emulate the "mairix --threads" feature. That is, instead of returning only the matching messages, the entire thread is included in the downloaded mbox.gz This requires a "public-inbox-index --reindex" to be usable.
2020-08-23searchidx: index THREADID in Xapian
This is the `tid' column from over.sqlite3; and will be used for IMAP and JMAP search (among other things).
2020-08-20search: add mset_to_artnums method
We can avoid importing mdocid() in several places by using this method, simplifying callers.
2020-08-20smsg: remove from_mitem
We no longer read docdata.glass from anywhere in our code base. Some adjustments were needed to t/search.t to deal with the Xapian::WritableDatabase committing at different times, since our ->query is avoided from PublicInbox::SearchIdx to avoid needing a {over_ro} field.
2020-08-20search: make qparse_new an internal function
We'll probably be reusing it from another package in a future commit.
2020-08-20search: export mdocid subroutine
No need to have awkward globrefs for this.
2020-08-20search: improve comments around constants
We'll probably be adding more value columns like THREADID to sort on.
2020-08-20search: v2: ensure shards are numerically sorted
This seems required to correctly get the NNTP article number from Xapian docid on combined Xapian DBs. The default (ASCII-betical) sorting was only acceptable for -imapd users until somebody hit 11 (or more) shards, which is a rare case.
2020-07-25search: avoid copying {inboxdir}
Instead, storing {xdir} will allow us to avoid string concatenation in the read-only path and save us a little hash entry space.
2020-06-13imap: wire up Xapian, MSN SEARCH and multi sequence-sets
Simple queries work, more complex queries involving parentheses, "OR", "NOT" don't work, yet. Tested with "=b", "=B", and "=H" search and limits in mutt on both v1 and v2 with multiple Xapian shards.
2020-06-13search: index UID for IMAP search, too
We'll need to support searching UID ranges for IMAP, so make sure it's indexed, too.
2020-06-13search: index byte size of a message for IMAP search
Searching for messages smaller than a certain size is allowed by offlineimap(1), mbsync(1), and possibly other tools. Maybe public-inbox-watch will support it, too. I don't see a reason to expose searching by size via WWW search right now (but maybe in the future, I could be convinced to). Note: we only store the byte-size of the message in git, this is typically LF-only and we won't have the correct size after CRLF conversion for NNTP or IMAP.
2020-05-10search: remove documentation for "lid:"
I'm not sure it's necessary, since "mid:" is similarly undocumented. Also, "t:", "c:", "f:" don't offer boolean analogues for exact matches on To/Cc/From headers, despite having similar tokens as List-Id inside angle brackets.
2020-05-09search: support searching on List-Id
We'll support both probabilistic matches via `l:' and boolean matches via `lid:' for exact matches, similar to how both `m:' and `mid:' are supported. Only text inside angle braces (`<' and `>') are supported, since I'm not sure if there's value in searching on the optional phrases (which would require decoding with ->header_str instead of ->header_raw).
2020-03-25altid: warn about non-word prefixes
We only support searching on prefixes matching /\A\w+\z/ because Xapian requires ':' to delimit the prefix and splits on spaces without quotes. I've also verified Xapian supports multibyte UTF-8 characters, underscores, and bare numbers as search prefixes, so there's no need to restrict it beyond what Perl's UTF-8 aware \w character class offers.
2020-03-25search: clobber -user_pfx on query parser initialization
While we don't currently reinitialize the query parser for the lifetime of a PublicInbox::Search object and have no plans to, it's incorrect to be appending to an existing array in case we reininitialize the query parser in the future.
2020-03-22rename PublicInbox::SearchMsg => PublicInbox::Smsg
Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-27search: {version} => {ibx_ver}
We don't confuse human readers with the Xapian schema version. We also want to make it obvious this is the version of the inbox we're indexing, these are Search or SearchIdx objects, not Inbox objects.
2020-01-27inbox: add ->version method
This allows us to simplify version checking by avoiding "//" or "||" operators sprinkled around.
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2020-01-05search: remove lookup_article
It was no longer used outside of tests, so don't penalize regular users with the extra function. Just inline it for t/search.t.
2019-12-29search: load_xapian: return true on success
This was causing -xcpdb and other admin modules to fail outside of tests (or when testing with the slow TEST_RUN_MODE=0).
2019-12-28search: retry_reopen passes user arg to callback
This allows callers to pass named (not anonymous) subs. Update all retry_reopen callers to use this feature, and fix some places where we failed to use retry_reopen :x
2019-12-24search: support SWIG-generated Xapian.pm
Xapian upstream is slowly phasing out the XS-based Search::Xapian in favor of the SWIG-generated "Xapian" package. While Debian and both FreeBSD have Search::Xapian, OpenBSD only includes the "Xapian" binding. More information about the status of the "Xapian" Perl module here: https://trac.xapian.org/ticket/523