about summary refs log tree commit homepage
path: root/lib/PublicInbox/Search.pm
DateCommit message (Collapse)
2024-05-06search: fix altid search with XapHelper process
External Xapian helper processes need to support non-standard QueryParser prefixes. The only way to do this is to specify these prefixes in every `mset' request since we have no idea if the XH worker servicing the request has initialized the extra prefixes, yet.
2024-04-28search: remove auto-start for async_mset
Only public-facing daemons use it, currently, and all public-facing daemons will pre-spawn it as early as feasible. lei will need it eventually to handle queries requiring C++, but I'm not certain what path to take with lei, yet...
2024-04-28daemon: share and allow configuring Xapian helpers
Xapian helper processes are disabled by default once again. However, they can be enabled via the new `-X INTEGER' parameter. One big positive is the Xapian helpers being spawned by the top-level daemon means they can be shared freely across all workers for improved load balancing and memory reduction.
2024-04-28search: async_mset: pass resource errors to callback
We need to be able to handle resource limitation errors in public-facing daemons.
2024-04-24www: wire up search to use async xap_helper
The C++ version of xap_helper will allow more complex and expensive queries. Both the Perl and C++-only version will allow offloading search into a separate process which can be killed via ITIMER_REAL or RLIMIT_CPU in the face of overload. The xap_helper `mset' command wrapper is simplified to unconditionally return rank, percentage, and estimated matches information. This may slightly penalize mbox retrievals and lei users, but perhaps that can be a different command entirely.
2024-04-24xap_helper: drop terms+data from `mset' command
Retrieving Xapian document terms, data (and possibly values) and transferring to the Perl side would be an increase in complexity and I/O both the Perl and C++ sides. It would require more I/O in C++ and transient memory use on the Perl side where slow mset iteration gives an opportunity to dictate memory release rate. So lets ignore the document-related stuff here for now for ease-of-development. We can reconsider this change if dropping Xapian Perl bindings entirely and relying on JAOT C++ ever becomes a possibility.
2023-12-09*search: simplify handling of Xapian term iterators
Xapian has always sorted termlist iterators, so we now: 1) break out of the iterator loop early on non-matches 2) avoid doing sorting ourselves As a result, we'll also favor the wantarray forms of xap_terms and all_terms to preserve sort order in most cases. Confirmed by the Xapian maintainer: <20231201184844.GO4059@survex.com> Link: https://lists.xapian.org/pipermail/xapian-discuss/2023-December/010013.html
2023-11-29www: load and use cindex join data
This is a major step in solving the problem of having to manually associate hundreds/thousands of coderepos with hundreds/thousands of public-inboxes to power solver (and more).
2023-11-29xap_helper: implement mset endpoint for WWW, IMAP, etc...
The C++ version will allow us to take full advantage of Xapian's APIs for better queries, and the Perl bindings version can still be advantageous in the future since we'll be able to support timeouts effectively.
2023-10-17search: rectify comment (rt: documented since 738c4a65719e)
Fixes: 738c4a65719e ("www: various help text updates")
2023-09-11treewide: favor Xapian (SWIG binding) over Search::Xapian
The Xapian SWIG bindings are favored by Xapian upstream for ease-of-maintenance compared to the XS version. While Debian lags on this front, the SWIG bindings are widely available on all *BSDs.
2023-08-24introduce optional C++ xap_helper
This allows us to perform the expensive "dump_ibx" operations in native C++ code using the Xapian C++ library. This provides the majority of the speedup with the -cindex --associate switch. Eventually this may be expanded to cover all uses of Xapian within the project to ensure we have access to Xapian APIs which aren't available in XS|SWIG bindings; and also for ease-of-installation on systems which don't provide pre-packaged Perl Xapian bindings (e.g. OpenBSD 7.3) but do provide Xapian development libraries. Most of the C++ code is still C, as I'm not remotely familiar with C++ compared to C. I suspect many users and potential hackers being from git, Linux kernel, and glibc world are in the same boat.
2023-08-24cindex: read-only association dump
This will eventually allow associating coderepos with inboxes and vice-versa; avoiding the need for manual configuration via tedious publicinbox.*.coderepo directives. I'm not sure how this should be stored for WWW, yet, but it's required since it takes about 8 hours to do this fully across lore and git.kernel.org.
2023-08-24search: hoist out shards_dir for future use
This will be useful for internal tooling and APIs.
2023-08-17search: all_terms: remove needless prefix check
The ->allterms_{begin,end} methods of Xapian::Database already filter match on prefix natively. Thus there's no need to do filtering ourselves (unlike per-document ->termlist_{begin/end})
2023-06-09search: hoist out do_enquire for codesearch
Reusing this bit seems to make sense as mail and code search are similar enough w.r.t. setting up sort options. This deduplication will become more useful as -cindex will likely combine code and mail search to generate associations between inboxes and code repos.
2023-06-09search: add comments wrt codesearch, reduce ops
Add some comments about various usages of xdb_shards_flat and mset since the addition of CodeSearch (and other search things) subclassing it may become confusing. Since we're in the area, we can also avoid an extra hash lookups/initializations and reduce Perl ops in various places.
2023-03-31www: support POST /$INBOX/$MSGID/?x=m&q=
This allows filtering the contents of any existing thread using a search query. It uses the existing THREADID column in Xapian so we can internally add a Xapian OP_FILTER to the results. This new functionality is orthogonal to the existing `t=1' parameter which gives mairix-style thread expansion. It doesn't make sense to use `t=1' with this functionality, but it's not disallowed, either. The indentation change in Over->next_by_mid is to ensure DBI->prepare_cached can share across both ->next_by_mid and ->mid2tid. I also noticed the existing regex for `POST /$INBOX/?x=m&q=' was allowing extra characters. With an added \z, it's now as strict was originally intended and AFAIK nothing was generating invalid URLs for it Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> Link: https://public-inbox.org/meta/aaniyhk7wfm4e6m5mbukcrhevzoc6ftctyrfwvmz4fkykwwtlj@mverfng6ytas/T/
2023-03-25codesearch: initial cut w/ -cindex tool
It seems relying on root commits is a reasonable way to deduplicate and handle repositories with common history. I initially wanted to shoehorn this into extindex, but decided a separate Xapian index layout capable of being EITHER external to handle many forks or internal (in $GIT_DIR/public-inbox-cindex) for small projects is the right way to go. Unlike most existing parts of public-inbox, this relies on absolute paths of $GIT_DIR stored in the Xapian DB and does not rely on the config file. We'll be relying on the config file to map absolute paths to public URL paths for WWW.
2023-03-25search: relocate all_terms from lei_search
This will be used for code_search, too.
2023-02-19search: translate d: to dt: in query
dt: is higher resolution and the YYYYMMDD column will be dropped if there's ever another SCHEMA_VERSION update. While the upcoming code repo index is independent of the mail schemas, it'll use similar query prefixes and likely use d:/dt: for Author Date of git commits.
2023-02-17search: move query transform + enquire setup out of retry loop
The Xapian query transformation and Enquire object setup aren't subject to MVCC and retries, so move it outside the retry loop to save some cycles in case we need to retry on a busy DB.
2023-02-15lei q: do not collapse threads with `-tt'
While having Xapian collapse threads is an easy way to reduce the amount of deduplication work we need to do when writing out threads; we can't rely on it when using `lei q -tt` since that needs to flag all hits. Reported-by: Maxim Mikityanskiy <maxtram95@gmail.com> Link: https://public-inbox.org/git/Y+pgBmj0jxR+cVkD@mail.gmail.com/
2022-08-04miscidx: index inbox min/max article numbers
This will be used to speed up NNTP group listings and IMAP startup with thousands of inboxes.
2022-06-21search: add help for patchid: prefix
Noticed-by: Kyle Meyer <kyle@kyleam.com>
2022-06-21search: support "patchid:" prefix (git patch-id --stable)
This allows easy searching via patch-id from a git commit. Currently, abbreviations are not supported, and it seems needless to support them since AFAIK (git) doesn't generate nor resolve abbreviated patch-ids anywhere.
2021-11-03doc: lei-q: document SEARCH TERMS prefixes
The new Documentation/common.perl file will be used for all manpages in the future.
2021-10-16inbox + search: use 5.10.1 and do some golfing
Some yak-shaving while I try to track down other bugs...
2021-10-15www: various help text updates
`dt:' documentation is redundant with `d:' approxidate support; so drop `dt:' since mairix uses `d:'. We'll also document `rt:' since there are legit messages from senders with broken clocks. Reduce indentation level of help texts to be in 2-space increments to using too much horizontal space. We'll always place IMAP ahead of NNTP since it's alphabetical and there's likely more IMAP clients out there. Add "--ng NEWSGROUP" to -init instructions if configured. There's also some minor wording changes throughout.
2021-10-14lei inspect: account for non-extindex inboxes
Inbox->xdb does not exist, but this code path was apparently never tested :x I noticed this on basic v2 inbox, but it could happen with any v1/v2 inbox. Move ->num2docid into Search so it's less awkward to use.
2021-10-12daemon: unconditionally close Xapian shards on cleanup
The cost of opening a Xapian DB (even with shards) isn't high, so save some FDs and just close it. We hit Xapian far less than over.sqlite3 and we discard the MSet ASAP even when streaming large responses. This simplifies our code a bit and hopefully helps reduce fragmentation by increasing mortality of late allocations.
2021-10-12search: delete QueryParser along with DB handle
Xapian::QueryParser is attached to the Xapian::Database, so holding onto the QueryParser was preventing us from releasing DB handles if a query was performed.
2021-09-26search: avoid setting undef hashtable entries
`undef' entries still take up a slot in the hash table, and cause the `exists' check to false-positive in ->cleanup_shards. This should fully fix the (innocuous) messages introduced in commit 63d7b8ce (daemons: revamp periodic cleanup task, 2021-09-23)
2021-09-23daemons: revamp periodic cleanup task
Neither Inboxes nor ExtSearch objects were retrying correctly when there are live git processes, but the inboxes were getting rescanned for search or other reasons. Ensure the scan retries eventually if there's live processes. We also need to update the cleanup task to detect Xapian shard count changes, since Xapian ->reopen is enough to detect any other Xapian changes. Otherwise, we just issue an inexpensive ->reopen call and let Xapian check whether there's anything worth reopening. This also lets us eliminate the Devel::Peek dependency.
2021-09-21search: drop reopen retry message
It's needless noise in syslogs for daemons and unnecessarily alarming to users on the command-line.
2021-09-17search: fix rt: w/ approxidate when TZ != UTC
While git respects a user's local timezone and returns seconds-since-the-Epoch, we were unnecessarily and incorrectly calling gmtime+strftime on its result. So ignore calling gmtime+strftime when the strftime format is "%s", just feed the output time from git directly to Xapian. This is mainly for lei, which will likely run in a variety of timezones. While we're at it, add a recommendation to use TZ=UTC in public-inbox-httpd, in case there are (misguided :P) sysadmins who set a non-UTC TZ.
2021-07-31extindex: -xcpdb and -compact support
Since extindex uses Xapian shards in a similar way to v2 inboxes, we'll support -xcpdb (reshard+upgrade) and -compact all the same to give admins tuning+upgrade options.
2021-06-23search: make xap_terms easier-to-use and use it more
This allows us to simplify callers throughout, and exceptions are can no longer be silently hidden. MiscSearch now uses xap_terms for looking up eidx_key terms for a code reduction. We also simplify LeiStore->_msg_kw for runtime use by moving the MsetIterator handling into t/lei_store.t test case.
2021-05-28lei: retry_reopen on read-only Xapian access
Xapian DBs may be modified by a parallel process while we're reading it, and Xapian's MVCC model places the burden on readers to retry operations. We'll also have retry_reopen croak instead of die on errors, which ought to help us track down some "Document not found" errors I've occasionally seen when using "lei <q|up>".
2021-04-16search: expand "d:" to "dt:" for precision with approxidate
If a user specifies "d:" with a higher precision than it was traditionally able to handle, switch transparently to "dt:". This lowers the learning curve and improves DWIM-ness. v2: fix "d:YYYYMMDD..$NEEDS_APPROXIDATE" case
2021-03-26lei: add some labels support
"lei q" now displays labels in JSON output, "lei mark" can add or remove labels for any messages. "lei ls-label" is supported, too. Unfortunately, "lei q" won't hande "kw:" or "L:" for external messages, they must be imported, first.
2021-03-11doc: glossary: add information for dates and timestamps
These have been confusing to me in the past, too.
2021-03-05search: use "z:" instead of "bytes:" prefix
So far, searching by size has never been publicly documented, and IMHO, of questionable utility. In any case, "z:" is what mairix(1) uses, so it may be familiar to existing mairix users (I've never used this prefix myself). So far, this prefix is only used internally in tests and in auto-translated queries from IMAP; thus this incompatible change is unlikely to affect anyone.
2021-02-12search: query_approxidate: cleanup regexp, more tests
The cleanup doesn't seem to matter, I initially thought I needed to handle "" (two double quotes) explicitly because that's what Xapian does to escape a double quote inside a double-quoted phrase. It turns out we only need to be able to pass phrases through to Xapian unmodified, and the existing group of ["\x{201c}\x{201d}] is sufficient for our purposes.
2021-02-11search: disallow spaces in argv approxidate queries
This is for consistency with --stdin and WWW front ends which can't distinguish between phrase searches and prefix ranges used for d:/dt:/rt:. In any case, I expect users on the lei command-line are more likely to use `5.days.ago' instead of `"5 days ago"'
2021-02-11search: use git approxidate in WWW and "lei q --stdin"
This greatly improves the usability of d:, dt:, and rt: search prefixes for users already familiar git's "approxidate" feature. That is, users familiar with the --(since|after|until|before)= options in git-log(1) and similar commands will be able to use those dates in the WWW UI.
2021-02-10search: fix argv handling of quoted phrases
This fixes both an old bug in "lei q" argv handling and one recent regression introduced with the change to use approxidate. Field prefixes are also handled correctly inside parenthesized statements when the field follows "(" without a separation character. Fixes: fbb7ccabbf54a405 ("lei q: use git approxidate with d:, dt: and rt: ranges")
2021-02-09www: stream mboxrd in descending docid order
Order doesn't matter when users are completely downloading mboxrds onto the FS and then opening them with an MUA. The MUA is expected to sort the results in the user's preferred order. However, lei can start streaming the results to its destination Maildir (or eventually IMAP/JMAP mailbox) with an MUA already open. This will let users see recent results sooner in their MUA, as those tend to have a higher docid. This matches the behavior of the HTML results, as well. As a bonus, this is around ~5% faster in a one-off, informal test case with 66k results. I expect this to hold true in all all cases since git has always optimized storage to favor recent objects.
2021-02-08search: use one git-rev-parse process for all dates
This is necessary to avoid slowdowns with pathological cases with many dates in the query, since each rev-parse invocation takes ~5ms. This is immeasurably slower with one open-ended range, but already faster with any closed range featuring two dates which require parsing via git.
2021-02-08lei q: use git approxidate with d:, dt: and rt: ranges
Instead of having --(sent|received)-(before|after)=s command-line switches, we'll just try to make sense of argv so it's usable within parenthesized statements and such. Given the negligible performance penalty with Inline::C process spawning, we'll probably wire this up to the WWW interface, too. "d:" is for mairix compatibility. I don't know if "dt:" and "rt:" will be too useful, but they exist because of IMAP (and JMAP).