about summary refs log tree commit homepage
path: root/lib/PublicInbox/Search.pm
DateCommit message (Collapse)
2021-10-16inbox + search: use 5.10.1 and do some golfing
Some yak-shaving while I try to track down other bugs...
2021-10-15www: various help text updates
`dt:' documentation is redundant with `d:' approxidate support; so drop `dt:' since mairix uses `d:'. We'll also document `rt:' since there are legit messages from senders with broken clocks. Reduce indentation level of help texts to be in 2-space increments to using too much horizontal space. We'll always place IMAP ahead of NNTP since it's alphabetical and there's likely more IMAP clients out there. Add "--ng NEWSGROUP" to -init instructions if configured. There's also some minor wording changes throughout.
2021-10-14lei inspect: account for non-extindex inboxes
Inbox->xdb does not exist, but this code path was apparently never tested :x I noticed this on basic v2 inbox, but it could happen with any v1/v2 inbox. Move ->num2docid into Search so it's less awkward to use.
2021-10-12daemon: unconditionally close Xapian shards on cleanup
The cost of opening a Xapian DB (even with shards) isn't high, so save some FDs and just close it. We hit Xapian far less than over.sqlite3 and we discard the MSet ASAP even when streaming large responses. This simplifies our code a bit and hopefully helps reduce fragmentation by increasing mortality of late allocations.
2021-10-12search: delete QueryParser along with DB handle
Xapian::QueryParser is attached to the Xapian::Database, so holding onto the QueryParser was preventing us from releasing DB handles if a query was performed.
2021-09-26search: avoid setting undef hashtable entries
`undef' entries still take up a slot in the hash table, and cause the `exists' check to false-positive in ->cleanup_shards. This should fully fix the (innocuous) messages introduced in commit 63d7b8ce (daemons: revamp periodic cleanup task, 2021-09-23)
2021-09-23daemons: revamp periodic cleanup task
Neither Inboxes nor ExtSearch objects were retrying correctly when there are live git processes, but the inboxes were getting rescanned for search or other reasons. Ensure the scan retries eventually if there's live processes. We also need to update the cleanup task to detect Xapian shard count changes, since Xapian ->reopen is enough to detect any other Xapian changes. Otherwise, we just issue an inexpensive ->reopen call and let Xapian check whether there's anything worth reopening. This also lets us eliminate the Devel::Peek dependency.
2021-09-21search: drop reopen retry message
It's needless noise in syslogs for daemons and unnecessarily alarming to users on the command-line.
2021-09-17search: fix rt: w/ approxidate when TZ != UTC
While git respects a user's local timezone and returns seconds-since-the-Epoch, we were unnecessarily and incorrectly calling gmtime+strftime on its result. So ignore calling gmtime+strftime when the strftime format is "%s", just feed the output time from git directly to Xapian. This is mainly for lei, which will likely run in a variety of timezones. While we're at it, add a recommendation to use TZ=UTC in public-inbox-httpd, in case there are (misguided :P) sysadmins who set a non-UTC TZ.
2021-07-31extindex: -xcpdb and -compact support
Since extindex uses Xapian shards in a similar way to v2 inboxes, we'll support -xcpdb (reshard+upgrade) and -compact all the same to give admins tuning+upgrade options.
2021-06-23search: make xap_terms easier-to-use and use it more
This allows us to simplify callers throughout, and exceptions are can no longer be silently hidden. MiscSearch now uses xap_terms for looking up eidx_key terms for a code reduction. We also simplify LeiStore->_msg_kw for runtime use by moving the MsetIterator handling into t/lei_store.t test case.
2021-05-28lei: retry_reopen on read-only Xapian access
Xapian DBs may be modified by a parallel process while we're reading it, and Xapian's MVCC model places the burden on readers to retry operations. We'll also have retry_reopen croak instead of die on errors, which ought to help us track down some "Document not found" errors I've occasionally seen when using "lei <q|up>".
2021-04-16search: expand "d:" to "dt:" for precision with approxidate
If a user specifies "d:" with a higher precision than it was traditionally able to handle, switch transparently to "dt:". This lowers the learning curve and improves DWIM-ness. v2: fix "d:YYYYMMDD..$NEEDS_APPROXIDATE" case
2021-03-26lei: add some labels support
"lei q" now displays labels in JSON output, "lei mark" can add or remove labels for any messages. "lei ls-label" is supported, too. Unfortunately, "lei q" won't hande "kw:" or "L:" for external messages, they must be imported, first.
2021-03-11doc: glossary: add information for dates and timestamps
These have been confusing to me in the past, too.
2021-03-05search: use "z:" instead of "bytes:" prefix
So far, searching by size has never been publicly documented, and IMHO, of questionable utility. In any case, "z:" is what mairix(1) uses, so it may be familiar to existing mairix users (I've never used this prefix myself). So far, this prefix is only used internally in tests and in auto-translated queries from IMAP; thus this incompatible change is unlikely to affect anyone.
2021-02-12search: query_approxidate: cleanup regexp, more tests
The cleanup doesn't seem to matter, I initially thought I needed to handle "" (two double quotes) explicitly because that's what Xapian does to escape a double quote inside a double-quoted phrase. It turns out we only need to be able to pass phrases through to Xapian unmodified, and the existing group of ["\x{201c}\x{201d}] is sufficient for our purposes.
2021-02-11search: disallow spaces in argv approxidate queries
This is for consistency with --stdin and WWW front ends which can't distinguish between phrase searches and prefix ranges used for d:/dt:/rt:. In any case, I expect users on the lei command-line are more likely to use `5.days.ago' instead of `"5 days ago"'
2021-02-11search: use git approxidate in WWW and "lei q --stdin"
This greatly improves the usability of d:, dt:, and rt: search prefixes for users already familiar git's "approxidate" feature. That is, users familiar with the --(since|after|until|before)= options in git-log(1) and similar commands will be able to use those dates in the WWW UI.
2021-02-10search: fix argv handling of quoted phrases
This fixes both an old bug in "lei q" argv handling and one recent regression introduced with the change to use approxidate. Field prefixes are also handled correctly inside parenthesized statements when the field follows "(" without a separation character. Fixes: fbb7ccabbf54a405 ("lei q: use git approxidate with d:, dt: and rt: ranges")
2021-02-09www: stream mboxrd in descending docid order
Order doesn't matter when users are completely downloading mboxrds onto the FS and then opening them with an MUA. The MUA is expected to sort the results in the user's preferred order. However, lei can start streaming the results to its destination Maildir (or eventually IMAP/JMAP mailbox) with an MUA already open. This will let users see recent results sooner in their MUA, as those tend to have a higher docid. This matches the behavior of the HTML results, as well. As a bonus, this is around ~5% faster in a one-off, informal test case with 66k results. I expect this to hold true in all all cases since git has always optimized storage to favor recent objects.
2021-02-08search: use one git-rev-parse process for all dates
This is necessary to avoid slowdowns with pathological cases with many dates in the query, since each rev-parse invocation takes ~5ms. This is immeasurably slower with one open-ended range, but already faster with any closed range featuring two dates which require parsing via git.
2021-02-08lei q: use git approxidate with d:, dt: and rt: ranges
Instead of having --(sent|received)-(before|after)=s command-line switches, we'll just try to make sense of argv so it's usable within parenthesized statements and such. Given the negligible performance penalty with Inline::C process spawning, we'll probably wire this up to the WWW interface, too. "d:" is for mairix compatibility. I don't know if "dt:" and "rt:" will be too useful, but they exist because of IMAP (and JMAP).
2021-02-07lei: replace --thread with --threads
Nobody is expected to use long options, but for consistency with mairix(1), we'll use the pluralized option throughout (including existing PublicInbox::{Search,SearchView}). Link: https://public-inbox.org/meta/20210206090119.GA14519@dcvr/
2021-01-22lei q: retrieve keywords for local, non-external messages
This isn't tested for now, so maybe it works.
2021-01-14search: rename "ts:" prefix to "rt:"
Meaning "Received time", as it is the best description of the value we use from the "Received:" header, if present. JMAP calls it "receivedAt", but "rt:" seems like a better abbreviation being in line with "dt:" for the "Date" header. "Timestamp" ("ts") was potentially ambiguous given the presence of the "Date" header.
2021-01-12lei query + pagination sorta working
Parallelism and interactivity with pager + SIGPIPE needs work; but results are shown and phrase search works without shell users having to apply Xapian quoting rules on top of standard shell quoting.
2021-01-02search: do not use $QP_FLAGS until Xapian is loaded
The default $QP_FLAGS won't be set until after Xapian is loaded, duh... This fixes t/imapd.t with TEST_RUN_MODE=0
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-31Merge remote-tracking branch 'origin/master' into lorelei
* origin/master: (58 commits) ds: flatten + reuse @events, epoll_wait style fixes ds: simplify EventLoop implementation check defined return value for localized slurp errors import: check for git->qx errors, clearer return values git: qx: avoid extra "local" for scalar context case search: remove {mset} option for ->mset method search: remove pointless {relevance} setting miscsearch: take reopen from Search and use it extsearch: unconditionally reopen on access extindex: allow using --all without EXTINDEX_DIR extindex: add undocumented --no-scan switch extindex: enable autoflush on STDOUT/STDERR extindex: various --watch signal handling fixes extindex: --watch for inotify-based updates eml: fix undefined vars on <Perl 5.28 t/config: test --get-urlmatch for git <2.26 default to CORE::warn in $SIG{__WARN__} handlers inbox: name variable for values loop iterator inboxidle: avoid needless syscalls on refresh inboxidle: clue users into resolving ENOSPC from inotify ...
2020-12-31lei_xsearch: cross-(inbox|extindex) search
While a single extindex combines multiple inboxes into a single search index, extindex still requires up-front indexing on items which can be searched. XSearch has no on-disk footprint itself and uses Xapian DBs of existing publicinbox and extindex ("extinbox") exclusively. XSearch still suffers from the multi-shard Xapian scalability problems which led to the creation of extindex, but I expect the number of shards to remain relatively low. I envision users hosting public-inbox instances on their workstations will only have two extindex combined by this, one read-only extindex for serving public archives, and one read-write extindex managed by LeiStore for private mail.
2020-12-28search: remove {mset} option for ->mset method
The ->mset method always returns a Xapian mset nowadays, so naming a parameter {mset} is too confusing. As it does with MiscSearch, setting the {relevance} parameter to -1 now sorts by ascending docid order. -2 is now supported for descending docid order, too, since it may be useful for lei users.
2020-12-28search: remove pointless {relevance} setting
SearchView will set it to `undef', others will set the 'mset' option (for the ->mset method :P) to 2 which causes {relevance} to be ignored. And the 'mset' option is poorly named now that the message is named ->mset...
2020-12-23miscsearch: index UIDVALIDITY, use as startup cache
This brings -nntpd startup time down from ~35s to ~5s with 50K inboxes. Further improvements ought to be possible with deeper changes to MiscIdx, since -mda having to load every inbox seems unreasonable; but this general change is fairly unintrusive.
2020-12-19search: simplify initialization, add ->xdb_shards_flat
This reduces differences between v1 and v2 code, and introduces ->xdb_shards_flat to provide read-only access to shards without using Xapian::MultiDatabase. This will allow us to combine shards of several inboxes AND extindexes for lei.
2020-12-17inbox: simplify v2 epoch counting
Perl readdir detects list context and can return an array suitable for the grep op. From there, we can rely on substr to remove the ".git" suffix and integerize the value to save a few bytes before letting List::Util::max return the value. This is how we detect Xapian shards nowadays, too, and we'll also use defined-or (//) to simplify the return value there. We'll also simplify InboxWritable->git_dir_latest, remove some callers, and consider removing it entirely.
2020-12-09search: reinstate "uid:" internal search prefix
User-supplied queries (via PublicInbox::IMAPsearchqp) may restrict messages to certain UID ranges in addition to the limits we impose ourselves for mailbox slices. So we'll continue to ask Xapian::QueryParser to "uid:" numeric ranges. Fixes: 4b551c884a648b45 ("imap: support isearch and reduce Xapian queries")
2020-12-05imap: support isearch and reduce Xapian queries
Since IMAP search (either with Isearch or traditional per-Inbox search) only returns UIDs, we can safely set the limit to the UID slice size(*). With isearch, we can also trust the Xapian result to fit any docid range we specify. Limiting Xapian results to 1000 was making ->ALL docid <=> per-Inbox UID impossible since results could overlap between ranges unpredictably. Finally, we can map the ->ALL docids into per-Inbox UIDs and show them to the client in the UID order of the Inbox, not the docid order of the ->ALL extindex. This also lets us get rid of the "uid:" query parser prefix and use the Xapian::Query API directly to reduce our search prefix footprint. For mbox.gz downloads in WWW, we'll also make a best effort to preserve the order from the Inbox, not the order of extindex; though it's possible large result sets can have non-overlapping windows. (*) by definition, UID slice size is a "safe" value which shouldn't OOM either the server or clients.
2020-12-05isearch: emulate per-inbox search with ->ALL
Using "eidx_key:" boolean prefix to limit results to a given inbox, we can use ->ALL to emulate and replace per-Inbox xap15/[0-9] search indices. With this change, the presence of "extindex.all.topdir" in the $PI_CONFIG will cause the WWW code to use that extindex and ignore per-inbox Xapian DBs in xap15/[0-9]. Unfortunately IMAP search still requires old per-inbox indices, for now. Mapping extindex Xapian docids to per-Inbox UIDs and vice-versa is proving tricky. Fortunately, IMAP search is rarely used and optional. The RFCs don't specify expensive phrase search, either, so `indexlevel=medium' can be used in per-inbox Xapian indices to save space. For primarily WWW (and future JMAP) users; this should result in significant disk space, FD, and page cache footprint savings for large instances with many inboxes and many cross-posted messages.
2020-12-05search: remove mdocid export
There's no need to export it, as shown by the change to SearchView. This should pave the way to making search more flexible and allow per-Inbox search to reuse ->ALL.
2020-11-24*search: simplify retry_reopen users
Every callback uses `$self', and creating short-lived array references is not necessary when it's just as easy to copy the array in Perl (unlike C).
2020-11-24miscsearch: a new Xapian sub-DB for extindex
This will be used to index and search Inbox objects and perhaps individual git repositories/epochs for grokmirror manifest.js.gz generation. There is no sharding planned for this at the moment since inbox count should remain low (~100K to 1M) compared to message count. Folding this into the existing sharded DBs could be possible; but would likely increase query and maintenance costs, as well as development complexity. So we'll use a few more inodes and FDs at runtime, instead.
2020-11-07search: xdb_sharded: make this a public method for ExtSearch
We can simplify callers by using $self->{xpfx} instead of passing another arg on the stack.
2020-11-07extsearch: start mocking out
This will provide a similar API to PublicInbox::Inbox for read-only WWW, -imapd, and -nntpd interfaces.
2020-11-07search: hoist out _xdb_sharded for v2 inboxes
We'll be using this in detached (ext) Xapian indexes in cross inbox search.
2020-09-24searchidx: fix (undocumented) --skip-docdata handling
This switch is still undocumented, but we can reduce the scope of our Xapian docdata dependency by moving its only caller to SearchIdx. This reduces the amount of code loaded by read-only code paths.
2020-09-03v2writable: reuse read-only shard counting code
We'll also fix the read-only code to ensure we notice missing Xapian shards, since gaps would throw off our expectation that Xapian document IDs and NNTP article numbers are interchangeable.
2020-09-03search: remove {over_ro} field
Only inbox accesses the read-only {over}, now, instead of going through ->search. This simplifies our object graph and avoids potentially redundant FDs and DB handles pointing to the same over.sqlite3 file.
2020-09-03search: replace ->query with ->mset
Nearly all of the search uses in the production code rely on a Xapian mset iterator being returned (instead of an array of $smsg objects). So default to returning the mset and move the burden of smsg array conversion into the test cases.
2020-09-03search: remove special case for blank query
The special case (if any) belongs at a higher-level, and this is another step towards removing {over_ro}-dependence in our Search object.