about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchIdx.pm
DateCommit message (Collapse)
2019-01-05index: quiet down git-log error messages on new inboxes
The new t/*filter_rubylang.t tests call -index immediately after -init, which causes confusing messages to show up to the end user. Check the validity of the ref before calling "git-log".
2018-12-30handle "multipart/mixed" messages which are not multipart
I've found two examples on https://lore.kernel.org/lkml/ where the messages declared themselves to be "multipart/mixed" but were actually plain text: <87llgalspt.fsf@free.fr> <200308111450.h7BEoOu20077@mail.osdl.org> With the mboxrd downloaded, mutt is able to view them without difficulty. Note: this change would require reindexing of Xapian to pick up the changes. But it's only two ancient messages, the first was resent by the original sender and the second is too old to be relevant.
2018-08-03Merge branch 'eb/index-incremental'
Incremental indexing fixes from Eric W. Biederman. These prevents the highest message number in msgmap from being reassigned after deletes in rare cases and ensures messages are deleted from msgmap in v2. * eb/index-incremental: V2Writeable.pm: In unindex_oid delete the message from msgmap V2Writeable.pm: Ensure that a found message number is in the msgmap SearchIdx,V2Writeable: Update num_highwater on optimized deletes t/v[12]reindex.t: Verify the num highwater is as expected t/v[12]reindex.t Verify num_highwater Msgmap.pm: Track the largest value of num ever assigned SearchIdx.pm: Always assign numbers backwards during incremental indexing t/v[12]reindex.t: Test incremental indexing works t/v[12]reindex.t: Test that the resulting msgmap is as expected t/v[12]reindex.t: Place expected second in Xapian tests t/v2reindex.t: Isolate the test cases more t/v1reindex.t: Isolate the test cases Import.pm: Don't assume {in} and {out} always exist
2018-08-03SearchIdx,V2Writeable: Update num_highwater on optimized deletes
When performing an incremental index update with index_sync if a message is seen to be both added and deleted update the num_highwater mark even though the message is not otherwise indexed. This ensures index_sync generates the same msgmap no matter which commit it stops at during incremental syncs. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-08-03Msgmap.pm: Track the largest value of num ever assigned
Today the only thing that prevents public-inbox not reusing the message numbers of deleted messages is the sqlite autoincrement magic and that only works part of the time. The new incremental indexing test has revealed areas where today public-inbox does try to reuse numbers of deleted messages. Reusing the message numbers of existing messages is a problem because if a client ever sees messages that are subsequently deleted the client will not see the new messages with their old numbers. In practice this is difficult to trigger because it requires the most recently added message to be removed and have the removal show up in a separate pull request. Still it can happen and it should be handled. Instead of infering the highset number ever used by finding the maximum number in the message map, track the largest number ever assigned directly. Update Msgmap to track this value and update the indexers to use this value. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-08-03search: (really) match the behavior of WWW for indexing text
Not sure what was going through my mind when I made my first attempt at this, but we really want to make sure we index all the text we display in the web view (and presumably anything a reasonable mail client can display). Followup-to: 0cf6196025d4e4880cd1ed859257ce21dd3cdcf6 ("search: match the behavior of WWW for indexing text")
2018-08-02SearchIdx.pm: Always assign numbers backwards during incremental indexing
When walking messages newest to oldest, assigning the larger numbers before smaller numbers ensures older messages get smaller numbers. This leads to the possibility of a msgmap that can be regenerated when needed. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-20v1: allow upgrading indexlevel=basic to 'medium' or 'full'
For v1 repos, we don't need to write any metadata to Xapian and changing from 'basic' to 'medium' or 'full' will work. For v2, the metadata for indexing is stored in msgmap (because the Xapian databases are partitioned for parallelism), so a reindex is required.
2018-07-19searchidx: respect XAPIAN_FLUSH_THRESHOLD env if set
Xapian documents and respect XAPIAN_FLUSH_THRESHOLD to define the interval in documents to flush, so don't override it with our own BATCH_BYTES. This is helpful for initial indexing for those on slower storage but enough RAM. It is unnecessary for -watch and frequent incremental indexing; and it increases transaction times if -watch is playing "catch-up" if it was stopped for a while. The original BATCH_BYTES was tuned for a machine with little memory as the default XAPIAN_FLUSH_THRESHOLD of 10000 documents was causing swap storms. Using document counts also proved an innaccurate estimator of RAM usage compared to the actual bytes processed.
2018-07-19SearchIdx: Allow the amount of indexing be configured
This adds a new inbox configuration option 'indexlevel' that can take the values 'full', 'medium', and 'basic'. When set to 'full' everything is indexed including the positions of all terms. When set to 'medium' everything except the positions of terms is indexed. When set to 'basic' terms and positions are not indexed. Just the Overview database for NNTP is created. Which is still quite good and allows searching for messages by Message-ID. But there are no indexes to support searching inside the email messages themselves. Update the reindex tests to exercise the full medium and basic code paths Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-19SearchIdx: Add the mechanism for making all Xapian indexing optional
Create a new method add_xapian that holds all of the code to create Xapian indexes. The creation of this method simpliy involved idenitifying the relevant code and moving it from add_message. A call is added to add_xapian from add_message to keep everything working as it currently does. The new call is made conditional upon index levels of 'full' and 'medium'. The index levels that index positions and terms the two things public-inbox uses Xapian to index. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-19SearchIdx.pm: Make indexing search positions optional
About half the size of the Xapian search index turns out to be search positions. The search positions are only used in a very narrow set of queries. Make the search positions optional so people don't need to pay the cost of queries they will never make. This also makes public-inbox more approachable for light hacking as generating all of the indexes is time consuming. The way this is done is to add a method to SearchIdx called index_text that wraps the call of the term generator method index_text. The new index_text method takes care of calling both index_text and increase_termpos (the two functions that are responsible for position data). Then index_users, index_diff_inc, index_old_diff_fn, index_diff, index_body are made proper methods that calls the new index_text. Callers of the new index_text are slightly simplified as they don't need to call increase_termpos as well. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-18SearchIdx: Decrement regen_down even for added messages that are later deleted.
Decrement regen_down when visiting messages that appear in %D that we know will later be deleted. This ensures consistent message numbers are generated no matter which commit number is on top. Allowing deletes to propagage separately from the messages they delete without causing problems. The v2 trees already do this and when the indexes are deleted and rebuilt they maintain they commit numbers. Add a v1 version of the v2reindex test to verify that reindexing is working properly on v1 as well as v2. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-05-24workaround Xapian OFD locks w/o close-on-exec
Xapian v1.2.21..v1.2.24 (inclusive) use OFD locks but failed to set the close-on-exec flag on those locks. So we must continue to work around those old versions by ensuring Xapian file descriptors aren't held any longer than necessary when in long-running git processes. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
2018-05-01searchidx: preserve umask when starting/committing transactions
Xapian will replace files upon committing, so non-parallel V2Writable users need to have umask preserved this way.
2018-04-22extmsg: use Xapian only for partial matches
"LIKE" in SQLite (and other SQL implementations I've seen) is expensive with nearly 3 million messages in the archives. This caused some partial Message-ID lookups to take over 600ms on my workstation (~300ms on a faster Xeon). Cut that to below under 30ms on average on my workstation by relying exclusively on Xapian for partial Message-ID lookups as we have in the past. Unlike in the past when we tried using Xapian to match partial Message-IDs; we now optimize our indexing of Message-IDs to break apart "words" in Message-IDs for searching, yielding (hopefully) "good enough" accuracy for folks who get long URLs broken across lines when copy+pasting. We'll also drop the (in retrospect) pointless stripping of "/[tTf]" suffixes for the partial match, since anybody who hits that codepath would be hitting an invalid message ID. Finally, limit wildcard expansion to prevent easy DoS vectors on short terms. And blame Pine and alpine for generating Message-IDs with low-entropy prefixes :P
2018-04-20searchidx: remove leftover debugging code
I was using this to trace the path of brian's message. Fixes: 017fed7bc4d33ac4 ("searchidx: regenerate and avoid article number gaps on full index")
2018-04-20searchidx: release lock again during v1 batch callback
Relaxing this lock during a v1 --reindex is important to keep messages showing up in -watch process in a timely manner. Looks like I deleted an extra line when doing the following for v2: s/xdb->commit_transaction/self->commit_txn_lazy/ Fixes: 35ff6bb106909b1c ("replace Xapian skeleton with SQLite overview DB")
2018-04-18searchidx: revert default BATCH_BYTES to 1_000_000
This increases indexing time by around 10% but roughly halves memory usage of an -index process. We will probably make this tunable in the future for people with bigger/smaller machines.
2018-04-18searchidx: increase term positions for all text terms
We do not want phrase searches to cross between independent fields (filenames/Message-ID vs bodies)
2018-04-18searchidx: regenerate and avoid article number gaps on full index
Some messages to git@vger went missing from Msgmap from old bugs and became inaccessible via NNTP. Forcing NNTP article numbers when the overview DB came about made the problem more visible when reindexing old (v1) repositories as all removed spam messages took up AUTOINCREMENT numbers again before they were removed. Having large gaps in NNTP article numbers is not good since it throws off NNTP clients. This does NOT prevent NNTP clients from seeing some messages twice, but is better than having them miss several messages entirely. We also avoid depending on --reverse in git-log, as git requires storing an entire commit list in memory for --reverse, so it's cheaper to store only deleted blobs in the %D hash since they do not live long.
2018-04-18search: preserve References in Xapian smsg for x=t view
I'm not sure how useful this view is, but it exists for now.
2018-04-18v1: remove articles from overview DB
Otherwise articles show up again...
2018-04-07store less data in the Xapian document
Since we only query the SQLite over DB for OVER/XOVER; do not need to waste space storing fields To/Cc/:bytes/:lines or the XNUM term. We only use From/Subject/References/Message-ID/:blob in various places of the PSGI code. For reindexing, we will take advantage of docid stability in "xapian-compact --no-renumber" to ensure duplicates do not show up in search results. Since the PSGI interface is the only consumer of Xapian at the moment, it has no need to search based on NNTP article number.
2018-04-07over: remove forked subprocess
Since the overview stuff is a synchronization point anyways, move it into the main V2Writable process and allow us to drop a bunch of code. This is another step towards making Xapian optional for v2. In other words, the fan-out point is moved and the Xapian partitions no longer need to synchronize against each other: Before: /-------->\ /---------->\ v2writable -->+----parts----> over \---------->/ \-------->/ After: /----------> /-----------> v2writable --> over-->+----parts---> \-----------> \----------> Since the overview/threading logic needs to run on the same core that feeds git-fast-import, it's slower for small repos but is not noticeable in large imports where I/O wait in the partitions dominates.
2018-04-06www: favor reading more from SQLite, and less from Xapian
Favor simpler internal APIs this time around, this cuts a fair amount of code out and takes another step towards removing Xapian as a dependency for v2 repos.
2018-04-06search: index and allow searching by date-time
Dscho found this useful for finding matching git commits based on AuthorDate in git. Add it to the overview DB format, too; so in the future we can support v2 repos without Xapian. https://public-inbox.org/git/nycvar.QRO.7.76.6.1804041821420.55@ZVAVAG-6OXH6DA.rhebcr.pbec.zvpebfbsg.pbz https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/
2018-04-04searchidx: ensure duplicated Message-IDs can be linked together
This allows us to emulate the display of thread-aware MUAs when multiple messages share the same Message-ID. This also is a place where "public-inbox-index --reindex" is useful to fix existing messages and no schema version bump is necessary.
2018-04-02replace Xapian skeleton with SQLite overview DB
This ought to provide better performance and scalability which is less dependent on inbox size. Xapian does not seem optimized for some queries used by the WWW homepage, Atom feeds, XOVER and NEWNEWS NNTP commands. This can actually make Xapian optional for NNTP usage, and allow more functionality to work without Xapian installed. Indexing performance was extremely bad at first, but DBI::Profile helped me optimize away problematic queries.
2018-04-01search: reduce columns stored in Xapian
We can store :bytes and :lines in doc_data since we never sort or search by them. We don't have much use for the Date: stamp at the moment, either.
2018-03-30searchidx: correct warning for over-vivification
We will vivify multiple ghosts if a message has multiple Message-IDs.
2018-03-30search: move permissions handling to InboxWritable
We'll be making sure V2Writable uses this.
2018-03-29search: move find_doc_ids to searchidx
We do not need this subroutine for read-only use in Search.pm
2018-03-27searchidx: warn about vivifying multiple ghosts
This should help us detect bugs sooner in case we have space waste problems.
2018-03-22v2writable: support reindexing Xapian
This still requires a msgmap.sqlite3 file to exist, but it allows us to tweak Xapian indexing rules and reindex the Xapian database online while -watch is running.
2018-03-22use both Date: and Received: times
We want to rely on Date: to sort messages within individual threads since it keeps messages from git-send-email(1) sorted. However, since developers occasionally have the clock set wrong on their machines, sort overall messages by the newest date in a Received: header so the landing page isn't forever polluted by messages from the future. This also gives us determinism for commit times in most cases, as we'll used the Received: timestamp there, as well.
2018-03-19v2writable: allow disabling parallelization
While parallel processes improves import speed for initial imports; they are probably not necessary for daily mail imports via WatchMaildir and certainly not for public-inbox-init. Save some memory for daily use and even helps improve readability of some subroutines by showing which methods they call remotely.
2018-03-19Lock: new base class for writable lockers
This reduces code duplication needed for locking and and hopefully makes things easier to understand.
2018-03-19v2writable: implement remove correctly
We need to hide removals from anybody hitting the search engine.
2018-03-19searchidx: do not delete documents while iterating
Followup-to: ebb59815035b42c2 ("searchidx: do not modify Xapian DB while iterating")
2018-03-03v2: avoid redundant/repeated configs for git partition repos
We'll let the config of all.git dictate every other subrepo to ease maintenance and configuration. The "include" directive has been supported since git 1.7.10, so it's safe to depend on as v2 requires git 2.6.0+ anyways for "get-mark" in fast-import.
2018-03-03searchidx: store the primary MID in doc data for NNTP
We can't rely on header order for Message-ID after all since we fall back to existing MIDs if they exist and are unseen. This lets us use SearchMsg->mid to get the MID we associated with the NNTP article number to ensure all NNTP article lookups roundtrip correctly.
2018-03-03mid: truncate excessively long MIDs early
Since we support duplicate MIDs in v2, we can safely truncate long MID terms in the database and let other normal duplicate resolution sort it out. It seems only spammers use excessively long MIDs, and there'll always be abuse/misuse vectors for causing mis-threaded messages, so it's not worth worrying about excessively long MIDs.
2018-03-03searchidx: add NNTP article number as a searchable term
Since we support duplicate MIDs in v2, the NNTP article number becomes the true unique identifier and we want a way to do fast lookups on it. While we're at it, stop putting XPATH in the term partitions since we only need it in the skeleton DB.
2018-03-03searchidx: use add_boolean_term for internal terms
Aside from the Message-Id ('Q'), these terms do not appear in content and thus have no business contributing to the Xapian document length. Thanks-to Olly Betts for the tip on xapian-discuss <20180228004400.GU12724@survex.com>
2018-03-03searchidx: avoid excessive XNQ indexing with diffs
When indexing diffs, we can avoid indexing the diff parts under XNQ and instead combine the parts in the read-only search interface. This results in better indexing performance and 10-15% smaller Xapian indices.
2018-03-03searchidx: support indexing multiple MIDs
It's possible to have a message handle multiple terms; so use this feature to ensure messages with multiple MIDs can be found by either one.
2018-03-02search: revert to using 'Q' as a uniQue id per-Xapian conventions
'Q' is merely a convention in the Xapian world, and is close enough to unique for practical purposes, so stop using XMID and gain a little more term length as a result.
2018-03-02searchidx: use new `references' method for parsing References
It's shorter and more convenient, here.
2018-03-02v2writable: deduplicate detection on add
This is a bit expensive in a multi-process situation because we need to make our indices and packs visible to the read-only pieces.