about summary refs log tree commit homepage
path: root/lib/PublicInbox/Search.pm
DateCommit message (Collapse)
2018-07-20search: use boolean prefixes for git blob queries
I've hit some case where probabilistic searches don't work when using dfpre:/dfpost:/dfblob: search prefixes because stemming in the query parser interferes. In any case, our indexing code indexes longer/unabbreviated blob names down to its 7 character abbreviation, so there should be no need to do wildcard searches on git blob names.
2018-04-23search: avoid repeated mbox results from search
Previous search queries already set sort order on the Enquire object, altering the ordering of results and was causing messages to be redundantly downloaded via POST /$INBOX/?q=$QUERY&x=m So stop caching the Search::Xapian::Enquire object since it wasn't providing any measurable performance improvement.
2018-04-22extmsg: use Xapian only for partial matches
"LIKE" in SQLite (and other SQL implementations I've seen) is expensive with nearly 3 million messages in the archives. This caused some partial Message-ID lookups to take over 600ms on my workstation (~300ms on a faster Xeon). Cut that to below under 30ms on average on my workstation by relying exclusively on Xapian for partial Message-ID lookups as we have in the past. Unlike in the past when we tried using Xapian to match partial Message-IDs; we now optimize our indexing of Message-IDs to break apart "words" in Message-IDs for searching, yielding (hopefully) "good enough" accuracy for folks who get long URLs broken across lines when copy+pasting. We'll also drop the (in retrospect) pointless stripping of "/[tTf]" suffixes for the partial match, since anybody who hits that codepath would be hitting an invalid message ID. Finally, limit wildcard expansion to prevent easy DoS vectors on short terms. And blame Pine and alpine for generating Message-IDs with low-entropy prefixes :P
2018-04-06www: favor reading more from SQLite, and less from Xapian
Favor simpler internal APIs this time around, this cuts a fair amount of code out and takes another step towards removing Xapian as a dependency for v2 repos.
2018-04-06search: index and allow searching by date-time
Dscho found this useful for finding matching git commits based on AuthorDate in git. Add it to the overview DB format, too; so in the future we can support v2 repos without Xapian. https://public-inbox.org/git/nycvar.QRO.7.76.6.1804041821420.55@ZVAVAG-6OXH6DA.rhebcr.pbec.zvpebfbsg.pbz https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/
2018-04-05mbox: do not sort search results
Sorting large msets is a waste when it comes to mboxes since MUAs should thread and sort them as the user desires. This forces us to rework each of the mbox download mechanisms to be more independent of each other, but might make things easier to reason about.
2018-04-05search: remove unnecessary OP_AND of query
This was vestigial code from the switch to the overview DB
2018-04-03mbox: remove remaining OFFSET usage in SQLite
We can use id_batch in the common case to speed up full mbox retrievals. Gigantic msets are still a problem, but will be fixed in future commits.
2018-04-03nntp: make XOVER, XHDR, OVER, HDR and NEWNEWS faster
While SQLite is faster than Xapian for some queries we use, it sucks at handling OFFSET. Fortunately, we do not need offsets when retrieving sorted results and can bake it into the query. For inbox.comp.version-control.git (v1 Xapian), XOVER and XHDR are over 20x faster.
2018-04-02www: rework query responses to avoid COUNT in SQLite
In many cases, we do not care about the total number of messages. It's a rather expensive operation in SQLite (Xapian only provides an estimate). For LKML, this brings top-level /$INBOX/ loading time from ~375ms to around 60ms on my system. Days ago, this operation was taking 800-900ms(!) for me before introducing the SQLite overview DB.
2018-04-02replace Xapian skeleton with SQLite overview DB
This ought to provide better performance and scalability which is less dependent on inbox size. Xapian does not seem optimized for some queries used by the WWW homepage, Atom feeds, XOVER and NEWNEWS NNTP commands. This can actually make Xapian optional for NNTP usage, and allow more functionality to work without Xapian installed. Indexing performance was extremely bad at first, but DBI::Profile helped me optimize away problematic queries.
2018-04-01search: reduce columns stored in Xapian
We can store :bytes and :lines in doc_data since we never sort or search by them. We don't have much use for the Date: stamp at the moment, either.
2018-03-30search: warn on reopens and die on total failure
-watch on a busy/giant Maildir caused too many Xapian errors while attempting to browse.
2018-03-29search: retry_reopen on first_smsg_by_mid
This was causing errors while attempting to load messages via the WWW interface while mass-importing LKML. While we're at it, remove unnecessary eval from lookup_article.
2018-03-29search: move find_doc_ids to searchidx
We do not need this subroutine for read-only use in Search.pm
2018-03-29search: get rid of most lookup_* subroutines
Too many similar functions doing the same basic thing was redundant and misleading, especially since Message-ID is no longer treated as a truly unique identifier. For displaying threads in the HTML, this makes it clear that we favor the primary Message-ID mapped to an NNTP article number if a message cannot be found.
2018-03-29search: cleanup uniqueness checking
The only Xapian term which should be unique is the NNTP article number; so we no longer need find_unique_doc_id.
2018-03-23search: reopen DB if each_smsg_by_mid fails
This gives more-up-to-date data in case and allows us to avoid reopening in more places ourselves.
2018-03-23www: $MESSAGE_ID/raw endpoint supports "duplicates"
Since v2 supports duplicate messages, we need to support looking up different messages with the same Message-Id. Fortunately, our "raw" endpoint has always been mboxrd, so users won't need to change their parsing tools.
2018-03-22use both Date: and Received: times
We want to rely on Date: to sort messages within individual threads since it keeps messages from git-send-email(1) sorted. However, since developers occasionally have the clock set wrong on their machines, sort overall messages by the newest date in a Received: header so the landing page isn't forever polluted by messages from the future. This also gives us determinism for commit times in most cases, as we'll used the Received: timestamp there, as well.
2018-03-19search: allow ->reopen to be chainable
Makes life a little easier for V2Writable...
2018-03-06search: each_smsg_by_mid uses skeleton if available
We do not need the large DBs for MID scans.
2018-03-05search: favor skeleton DB for lookup_mail
The skeleton DB is smaller and hit more frequently given the homepage and per-message/thread views; so it will be hotter in the page cache.
2018-03-03nntp: fix NEWNEWS command
I guess nobody uses this command (slrnpull does not), and the breakage was not noticed until I started writing new tests for multi-MID handling. Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
2018-03-03nntp: use NNTP article numbers for lookups
Since Message-IDs are no longer unique within Xapian (but are within the SQLite Msgmap); favor NNTP article numbers for internal lookups. This will prevent us from finding the "wrong" internal Message-ID.
2018-03-03searchidx: avoid excessive XNQ indexing with diffs
When indexing diffs, we can avoid indexing the diff parts under XNQ and instead combine the parts in the read-only search interface. This results in better indexing performance and 10-15% smaller Xapian indices.
2018-03-03searchidx: support indexing multiple MIDs
It's possible to have a message handle multiple terms; so use this feature to ensure messages with multiple MIDs can be found by either one.
2018-03-02search: revert to using 'Q' as a uniQue id per-Xapian conventions
'Q' is merely a convention in the Xapian world, and is close enough to unique for practical purposes, so stop using XMID and gain a little more term length as a result.
2018-03-02v2writable: deduplicate detection on add
This is a bit expensive in a multi-process situation because we need to make our indices and packs visible to the read-only pieces.
2018-03-02search: remove informational "warning" message
It was making imports too noisy.
2018-02-28search: query_xover uses skeleton DB iff available
The skeleton DB is where we store all the information needed for NNTP overviews via XOVER. This seems to be the only change necessary (besides eventually handling duplicates) necessary to support our nntpd interface for v2 repositories.
2018-02-28rename SearchIdxThread to SearchIdxSkeleton
Interchangably using "all", "skel", "threader", etc. were confusing. Standardize on the "skeleton" term to describe this class since it's also used for retrieval of basic headers.
2018-02-28search: use different Enquire object for skeleton queries
A different Xapian DB requires the use of a different Enquire object. This is necessary for get_thread and thread skeleton to work in the PSGI UI.
2018-02-28search: reopen skeleton DB as well
Any Xapian DB is subject to the same errors and retries. Perhaps in the future this can made more granular to avoid unnecessary reopens.
2018-02-28v2/ui: retry DB reopens in a few more places
Relying more on Xapian requires retrying reopens in more places to ensure it does not fall down and show errors to the user.
2018-02-28v2/ui: some hacky things to get the PSGI UI to show up
Fortunately, Xapian multiple database support makes things easier but we still need to handle the skeleton DB separately.
2018-02-22v2: parallelize Xapian indexing
The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[*] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [* ] each subprocess writes to threader
2018-02-20v2: support Xapian + SQLite indexing
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least.
2018-02-16search: stop assuming Message-ID is unique
In general, they are, but there's no way for or general purpose mail server to enforce that. This is a step in allowing us to handle more corner cases which existing lists throw at us.
2018-02-14search: free up 'Q' prefix for a real unique identifier
This will allow easier-compatibility with v2 code which will introduce content_id as the unique identifier. The old "XMID" becomes "XM" as a free text searchable term. "Q" becomes "XMID" as a boolean prefix. There's no user-visible changes in this, but there needs to be a schema version bump later on... (more changes planned which can affect v1)
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-10-03search: try to fill in ghosts when generating thread skeleton
Since we attempt to fill in threads by Subject, our thread skeletons can cross actual thread IDs, leading to the possibility of false ghosts showing up in the skeleton. Try to fill in the ghosts as well as possible by performing a message lookup.
2017-06-14search: allow searching within mail diffs
This can be tied into a repository browser to browse in-flight topics on a mailing list.
2017-06-14search: remove unnecessary abstractions and functionality
This simplifies the code a bit and reduces the translation overhead for looking directly at data from tools shipped with Xapian. While we're at it, fix thread-all.t :)
2017-05-07searchidx: fix ghost root vivification
Due to the asynchronous nature of SMTP, it is possible for the root message of a thread (with no References/In-Reply-To) to arrive last in a series. We must preserve the thread_id of the ghost message in this case, as we do when vivifiying non-root ghosts. Otherwise, this causes threads to be broken when the root arrives last.
2017-04-11search: fix help message for searching within quotes
I'm not sure if people use either and it's not in mairix (where we base our abbreviations off of). Lets go with the shorter prefix since it's easier-to-type.
2017-02-06search: schema version bump for empty References/In-Reply-To
We cannot distinguish between legitimate ghosts and mis-threaded messages before commit 83425ef12e4b65cdcecd11ddcb38175d4a91d5a0 ("searchidx: deal with empty In-Reply-To and References headers") so we must rebuild the index in parallel to fix it.
2017-01-10introduce PublicInbox::MIME wrapper class
This should fix problems with multipart messages where text/plain parts lack a header. cf. git clone --mirror https://github.com/rjbs/Email-MIME.git refs/pull/28/head In the future, we may still introduce as streaming interface to reduce memory usage on large emails.
2017-01-07search: remove subject_summary
Apparently it never actually got used, and the world seems fine without it, so we can drop it. While we're at it, consider removing our subject_path usage from existence, too. We are not using fancy subject-line based URLs, here.
2017-01-07searchmsg: favor direct hash access over accessor methods
This is faster, smaller, and more straighforward to me with fewer layers of indirection.