about summary refs log tree commit homepage
path: root/lib/PublicInbox/Search.pm
DateCommit message (Collapse)
2019-09-09run update-copyrights from gnulib for 2019
2019-06-14search: use "shard" for local variable
Another small step towards terminology consistency with Xapian.
2019-06-14search*: rename {partition} => {shard}
Another step towards keeping our internal data structures consistent with Xapian naming.
2019-06-14search: require PublicInbox::Inbox ref here
No sense in supporting multiple methods of initialization for an internal class.
2019-06-04require ASCII digits for local FS items
In case some BOFH decides to randomly create directories using non-ASCII digits all over the place.
2019-05-24search: don't log all warnings on retry_reopen
Some users (or bots :P) can trigger horrible queries which the caller can choose to either log or ignore. This prevents horrible queries from ExtMsg from logging confusing "ref: " messages when $@ is not a Perl reference.
2019-05-23search: reenable phrase search on non-chert Xapian
This is assuming nobody uses flint or earlier, anymore; as flint predates the existence of this project.
2019-05-21Merge remote-tracking branch 'origin/xap-optional' into master
* origin/xap-optional: admin: improve warnings and errors for missing modules searchidx: do not create empty Xapian partitions for basic lazy load Xapian and make it optional for v2 www: use Inbox->over where appropriate nntp: use Inbox->over directly inbox: add ->over method to ease access
2019-05-16search: disable phrase searching, for now
There probably needs to be an option to enable this independently of indexlevel; but for now this is the safest option. And, as I discovered during the development of the indexlevel option, Xapian does a pretty good job of finding phrases without position data, anyways.
2019-05-15lazy load Xapian and make it optional for v2
More tests work without Search::Xapian, now. Usability issues still need to be fixed
2019-05-15www: use Inbox->over where appropriate
We don't need to rely on Xapian search functionality for the majority of the WWW code, even. subject_normalized is moved to SearchMsg, where it (probably) makes more sense, anyways.
2019-05-15nntp: use Inbox->over directly
None of the NNTP code actually relies on Xapian, anymore.
2018-07-20search: use boolean prefixes for git blob queries
I've hit some case where probabilistic searches don't work when using dfpre:/dfpost:/dfblob: search prefixes because stemming in the query parser interferes. In any case, our indexing code indexes longer/unabbreviated blob names down to its 7 character abbreviation, so there should be no need to do wildcard searches on git blob names.
2018-04-23search: avoid repeated mbox results from search
Previous search queries already set sort order on the Enquire object, altering the ordering of results and was causing messages to be redundantly downloaded via POST /$INBOX/?q=$QUERY&x=m So stop caching the Search::Xapian::Enquire object since it wasn't providing any measurable performance improvement.
2018-04-22extmsg: use Xapian only for partial matches
"LIKE" in SQLite (and other SQL implementations I've seen) is expensive with nearly 3 million messages in the archives. This caused some partial Message-ID lookups to take over 600ms on my workstation (~300ms on a faster Xeon). Cut that to below under 30ms on average on my workstation by relying exclusively on Xapian for partial Message-ID lookups as we have in the past. Unlike in the past when we tried using Xapian to match partial Message-IDs; we now optimize our indexing of Message-IDs to break apart "words" in Message-IDs for searching, yielding (hopefully) "good enough" accuracy for folks who get long URLs broken across lines when copy+pasting. We'll also drop the (in retrospect) pointless stripping of "/[tTf]" suffixes for the partial match, since anybody who hits that codepath would be hitting an invalid message ID. Finally, limit wildcard expansion to prevent easy DoS vectors on short terms. And blame Pine and alpine for generating Message-IDs with low-entropy prefixes :P
2018-04-06www: favor reading more from SQLite, and less from Xapian
Favor simpler internal APIs this time around, this cuts a fair amount of code out and takes another step towards removing Xapian as a dependency for v2 repos.
2018-04-06search: index and allow searching by date-time
Dscho found this useful for finding matching git commits based on AuthorDate in git. Add it to the overview DB format, too; so in the future we can support v2 repos without Xapian. https://public-inbox.org/git/nycvar.QRO.7.76.6.1804041821420.55@ZVAVAG-6OXH6DA.rhebcr.pbec.zvpebfbsg.pbz https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/
2018-04-05mbox: do not sort search results
Sorting large msets is a waste when it comes to mboxes since MUAs should thread and sort them as the user desires. This forces us to rework each of the mbox download mechanisms to be more independent of each other, but might make things easier to reason about.
2018-04-05search: remove unnecessary OP_AND of query
This was vestigial code from the switch to the overview DB
2018-04-03mbox: remove remaining OFFSET usage in SQLite
We can use id_batch in the common case to speed up full mbox retrievals. Gigantic msets are still a problem, but will be fixed in future commits.
2018-04-03nntp: make XOVER, XHDR, OVER, HDR and NEWNEWS faster
While SQLite is faster than Xapian for some queries we use, it sucks at handling OFFSET. Fortunately, we do not need offsets when retrieving sorted results and can bake it into the query. For inbox.comp.version-control.git (v1 Xapian), XOVER and XHDR are over 20x faster.
2018-04-02www: rework query responses to avoid COUNT in SQLite
In many cases, we do not care about the total number of messages. It's a rather expensive operation in SQLite (Xapian only provides an estimate). For LKML, this brings top-level /$INBOX/ loading time from ~375ms to around 60ms on my system. Days ago, this operation was taking 800-900ms(!) for me before introducing the SQLite overview DB.
2018-04-02replace Xapian skeleton with SQLite overview DB
This ought to provide better performance and scalability which is less dependent on inbox size. Xapian does not seem optimized for some queries used by the WWW homepage, Atom feeds, XOVER and NEWNEWS NNTP commands. This can actually make Xapian optional for NNTP usage, and allow more functionality to work without Xapian installed. Indexing performance was extremely bad at first, but DBI::Profile helped me optimize away problematic queries.
2018-04-01search: reduce columns stored in Xapian
We can store :bytes and :lines in doc_data since we never sort or search by them. We don't have much use for the Date: stamp at the moment, either.
2018-03-30search: warn on reopens and die on total failure
-watch on a busy/giant Maildir caused too many Xapian errors while attempting to browse.
2018-03-29search: retry_reopen on first_smsg_by_mid
This was causing errors while attempting to load messages via the WWW interface while mass-importing LKML. While we're at it, remove unnecessary eval from lookup_article.
2018-03-29search: move find_doc_ids to searchidx
We do not need this subroutine for read-only use in Search.pm
2018-03-29search: get rid of most lookup_* subroutines
Too many similar functions doing the same basic thing was redundant and misleading, especially since Message-ID is no longer treated as a truly unique identifier. For displaying threads in the HTML, this makes it clear that we favor the primary Message-ID mapped to an NNTP article number if a message cannot be found.
2018-03-29search: cleanup uniqueness checking
The only Xapian term which should be unique is the NNTP article number; so we no longer need find_unique_doc_id.
2018-03-23search: reopen DB if each_smsg_by_mid fails
This gives more-up-to-date data in case and allows us to avoid reopening in more places ourselves.
2018-03-23www: $MESSAGE_ID/raw endpoint supports "duplicates"
Since v2 supports duplicate messages, we need to support looking up different messages with the same Message-Id. Fortunately, our "raw" endpoint has always been mboxrd, so users won't need to change their parsing tools.
2018-03-22use both Date: and Received: times
We want to rely on Date: to sort messages within individual threads since it keeps messages from git-send-email(1) sorted. However, since developers occasionally have the clock set wrong on their machines, sort overall messages by the newest date in a Received: header so the landing page isn't forever polluted by messages from the future. This also gives us determinism for commit times in most cases, as we'll used the Received: timestamp there, as well.
2018-03-19search: allow ->reopen to be chainable
Makes life a little easier for V2Writable...
2018-03-06search: each_smsg_by_mid uses skeleton if available
We do not need the large DBs for MID scans.
2018-03-05search: favor skeleton DB for lookup_mail
The skeleton DB is smaller and hit more frequently given the homepage and per-message/thread views; so it will be hotter in the page cache.
2018-03-03nntp: fix NEWNEWS command
I guess nobody uses this command (slrnpull does not), and the breakage was not noticed until I started writing new tests for multi-MID handling. Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
2018-03-03nntp: use NNTP article numbers for lookups
Since Message-IDs are no longer unique within Xapian (but are within the SQLite Msgmap); favor NNTP article numbers for internal lookups. This will prevent us from finding the "wrong" internal Message-ID.
2018-03-03searchidx: avoid excessive XNQ indexing with diffs
When indexing diffs, we can avoid indexing the diff parts under XNQ and instead combine the parts in the read-only search interface. This results in better indexing performance and 10-15% smaller Xapian indices.
2018-03-03searchidx: support indexing multiple MIDs
It's possible to have a message handle multiple terms; so use this feature to ensure messages with multiple MIDs can be found by either one.
2018-03-02search: revert to using 'Q' as a uniQue id per-Xapian conventions
'Q' is merely a convention in the Xapian world, and is close enough to unique for practical purposes, so stop using XMID and gain a little more term length as a result.
2018-03-02v2writable: deduplicate detection on add
This is a bit expensive in a multi-process situation because we need to make our indices and packs visible to the read-only pieces.
2018-03-02search: remove informational "warning" message
It was making imports too noisy.
2018-02-28search: query_xover uses skeleton DB iff available
The skeleton DB is where we store all the information needed for NNTP overviews via XOVER. This seems to be the only change necessary (besides eventually handling duplicates) necessary to support our nntpd interface for v2 repositories.
2018-02-28rename SearchIdxThread to SearchIdxSkeleton
Interchangably using "all", "skel", "threader", etc. were confusing. Standardize on the "skeleton" term to describe this class since it's also used for retrieval of basic headers.
2018-02-28search: use different Enquire object for skeleton queries
A different Xapian DB requires the use of a different Enquire object. This is necessary for get_thread and thread skeleton to work in the PSGI UI.
2018-02-28search: reopen skeleton DB as well
Any Xapian DB is subject to the same errors and retries. Perhaps in the future this can made more granular to avoid unnecessary reopens.
2018-02-28v2/ui: retry DB reopens in a few more places
Relying more on Xapian requires retrying reopens in more places to ensure it does not fall down and show errors to the user.
2018-02-28v2/ui: some hacky things to get the PSGI UI to show up
Fortunately, Xapian multiple database support makes things easier but we still need to handle the skeleton DB separately.
2018-02-22v2: parallelize Xapian indexing
The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[*] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [* ] each subprocess writes to threader
2018-02-20v2: support Xapian + SQLite indexing
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least.