about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchIdx.pm
DateCommit message (Collapse)
2020-01-04searchidx: remove_message: pedantic fix for v1
It shouldn't be possible for v1 inboxes to have multiple matches for a given Message-ID, so the sub would only get called once, but strange things could happen in 2112 :>
2020-01-04searchidx: index_text: use Xapian parameter names
Use the parameter names from the Search::Xapian::TermGenerator manpage for our local variables instead of confusing names...
2020-01-04searchidx: simplify quote-splitting in index_body
We now use the same regexp View::add_text_body uses.
2020-01-04searchidx: add_message: fix and make use of prototypes
Procedural function calls allow prototype checking, and our add_message prototype was totally wrong to begin with. Convert most of the "$self->index_*" calls to "index_*($self" While we're at it, use "//=" to avoid some "unless" statements.
2020-01-04searchidx: split off index_xapian for msg_iter
This ought to save some memory, but it's probably lost in the noise given the cost of indexing. Regardless it still reduces the indentation level and makes future changes easier to read.
2020-01-04searchidx: index_diff: allow /^$/ line as diff context
As discovered by solver bug hunting, "git apply" also handles the case where blank lines w/o leading space are treated as diff context, apparently because GNU diff once did it: https://public-inbox.org/git/b507b465f7831612b9d9fc643e3e5218b64e5bfa/s/
2019-12-24search: support SWIG-generated Xapian.pm
Xapian upstream is slowly phasing out the XS-based Search::Xapian in favor of the SWIG-generated "Xapian" package. While Debian and both FreeBSD have Search::Xapian, OpenBSD only includes the "Xapian" binding. More information about the status of the "Xapian" Perl module here: https://trac.xapian.org/ticket/523
2019-12-24searchidx: call "++" on PostingIterator instead of "->inc"
The "++" is not yet available in the SWIG-based "Xapian.pm" Perl bindings, so use "++" where it's supported in both the XS (Search::Xapian) and SWIG-based Xapian binding.
2019-12-15searchidx: do not modify read-only $1 via git_unquote
git_unquote works in-place, and we sometimes see strange filenames, or badly munged diffs with terminal escape characters (for colorization) end up in emails.
2019-11-04index: "git log" failures are fatal
While I've never seen "git log" fail on its own, it could happen one day and we should be prepared to abort indexing when it happens. Beef up tests for t/spawn.t to ensure close() behaves on popen_rd the way we expect it to.
2019-10-28index: allow search/lookups on X-Alt-Message-ID
Since we replace extra Message-ID headers with X-Alt-Message-ID to placate NNTP clients, we should allow searching and indexing on X-Alt-Message-ID just like we do with Message-ID.
2019-10-16config: support "inboxdir" in addition to "mainrepo"
"mainrepo" ws a bad name and artifact from the early days when I intended for there to be a "spamrepo" (now just the ENV{PI_EMERGENCY} Maildir). With v2, "mainrepo" can be especially confusing, since v2 needs at least two git repositories (epoch + all.git) to function and we shouldn't confuse users by having them point to a git repository for v2. Much of our documentation already references "INBOX_DIR" for command-line arguments, so use "inboxdir" as the git-config(1)-friendly variant for that. "mainrepo" remains supported indefinitely for compatibility. Users may need to revert to old versions, or may be referring to old documentation and must not be forced to change config files to account for this change. So if you're using "mainrepo" today, I do NOT recommend changing it right away because other bugs can lurk. Link: https://public-inbox.org/meta/874l0ice8v.fsf@alyssa.is/
2019-09-09run update-copyrights from gnulib for 2019
2019-06-15comments: replace "partition" with "shard"
Now that the code matches Xapian terminology, ensure our comments match, too.
2019-06-14search*: rename {partition} => {shard}
Another step towards keeping our internal data structures consistent with Xapian naming.
2019-06-14searchidx: require PublicInbox::Inbox (or InboxWritable) ref
PublicInbox::Inbox objects have minimal dependencies, so drop code to support old tests which existed before the PublicInbox::Inbox object came into existence.
2019-06-12searchidx: improve error message when Xapian fails
Make it easier to detect if a partition is corrupt.
2019-05-29searchidx: store indexlevel=medium as metadata
And use it from Admin. It's easy to tell what indexlevel=basic is from unconfigured inboxes, but distinguishing between 'medium' and 'full' would require stat()-ing position.* files which is fragile and Xapian-implementation-dependent. So use the metadata facility of Xapian and store it in the main partition so Admin tools can deal better with unconfigured inboxes copied using generic tools like cp(1) or rsync(1).
2019-05-27searchidx: fix obvious typo
We can't pass an empty string to `git merge-base --is-ancestor' AFAIK, this did NOT present issues in the current test suite.
2019-05-23xcpdb: show re-indexing progress
Emit information about reindexing git revision ranges when used with xcpdb. Additionally, distinguish Xapian copy output from v2 git epoch counting by increasing directory context info. For now, v1 batches batches are emitted. v2 indexing is still missing progress reporting for batches, as the data structures for reindexing would benefit from a refactoring, first. This does not currently affect the use of public-inbox-index, but may in the future.
2019-05-23xcpdb: use fine-grained locking
Copying an entire Xapian DB takes a long time, so update our reindexing code to support partial reindexing, snapshot the pre-copydatabase git revisions, perform the lengthy copy, and do a partial reindex when the copy + renames are done.
2019-05-21Merge remote-tracking branch 'origin/xap-optional' into master
* origin/xap-optional: admin: improve warnings and errors for missing modules searchidx: do not create empty Xapian partitions for basic lazy load Xapian and make it optional for v2 www: use Inbox->over where appropriate nntp: use Inbox->over directly inbox: add ->over method to ease access
2019-05-21searchidx: remove unused Compress::Zlib import
We only compress in OverIdx, now; since we no longer do overview stuff in Xapian (and Xapian compresses document data, anyways).
2019-05-15searchidx: do not create empty Xapian partitions for basic
No point in leaving a mess of empty directories when Xapian doesn't load.
2019-05-15lazy load Xapian and make it optional for v2
More tests work without Search::Xapian, now. Usability issues still need to be fixed
2019-05-14searchidx: fix incremental index with indexlevel=basic on v1
We were reindexing the full history every invocation of -index when Xapian was not used because we were incorrectly relying on 'last_commit' metadata stored in Xapian. Rewrite the indexing logic to be less confusing while we're at it, since we rely on `git merge-base --is-ancestor' nowadays. Furthermore, we need to handle message removals from the overview index correctly when Xapian is not in use. Co-authored-by: Eric W. Biederman <ebiederm@xmission.com>
2019-01-15searchidx: move git_unquote to PublicInbox::Git
We'll be using it outside of searchidx...
2019-01-10Merge commit 'mem'
* commit 'mem': view: more culling for search threads over: cull unneeded fields for get_thread searchmsg: remove unused fields for PSGI in Xapian results searchview: drop unused {seen} hashref searchmsg: remove Xapian::Document field searchmsg: get rid of termlist scanning for mid httpd: remove psgix.harakiri reference
2019-01-10check git version requirements
This allows v1 tests to continue working on git 1.8.0 for now. This allows git 2.1.4 packaged with Debian 8 ("jessie") to run old tests, at least. I suppose it's safe to drop Debian 7 ("wheezy") due to our dependency on git 1.8.0 for "merge-base --is-ancestor". Writing V2 repositories requires git 2.6 for "get-mark" support, so mask out tests for older gits.
2019-01-08searchmsg: remove Xapian::Document field
We don't need to be carrying this around with the many SearchMsg objects we have. This saves about 20K from a large SearchView "&x=t" response.
2019-01-05index: quiet down git-log error messages on new inboxes
The new t/*filter_rubylang.t tests call -index immediately after -init, which causes confusing messages to show up to the end user. Check the validity of the ref before calling "git-log".
2018-12-30handle "multipart/mixed" messages which are not multipart
I've found two examples on https://lore.kernel.org/lkml/ where the messages declared themselves to be "multipart/mixed" but were actually plain text: <87llgalspt.fsf@free.fr> <200308111450.h7BEoOu20077@mail.osdl.org> With the mboxrd downloaded, mutt is able to view them without difficulty. Note: this change would require reindexing of Xapian to pick up the changes. But it's only two ancient messages, the first was resent by the original sender and the second is too old to be relevant.
2018-08-03Merge branch 'eb/index-incremental'
Incremental indexing fixes from Eric W. Biederman. These prevents the highest message number in msgmap from being reassigned after deletes in rare cases and ensures messages are deleted from msgmap in v2. * eb/index-incremental: V2Writeable.pm: In unindex_oid delete the message from msgmap V2Writeable.pm: Ensure that a found message number is in the msgmap SearchIdx,V2Writeable: Update num_highwater on optimized deletes t/v[12]reindex.t: Verify the num highwater is as expected t/v[12]reindex.t Verify num_highwater Msgmap.pm: Track the largest value of num ever assigned SearchIdx.pm: Always assign numbers backwards during incremental indexing t/v[12]reindex.t: Test incremental indexing works t/v[12]reindex.t: Test that the resulting msgmap is as expected t/v[12]reindex.t: Place expected second in Xapian tests t/v2reindex.t: Isolate the test cases more t/v1reindex.t: Isolate the test cases Import.pm: Don't assume {in} and {out} always exist
2018-08-03SearchIdx,V2Writeable: Update num_highwater on optimized deletes
When performing an incremental index update with index_sync if a message is seen to be both added and deleted update the num_highwater mark even though the message is not otherwise indexed. This ensures index_sync generates the same msgmap no matter which commit it stops at during incremental syncs. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-08-03Msgmap.pm: Track the largest value of num ever assigned
Today the only thing that prevents public-inbox not reusing the message numbers of deleted messages is the sqlite autoincrement magic and that only works part of the time. The new incremental indexing test has revealed areas where today public-inbox does try to reuse numbers of deleted messages. Reusing the message numbers of existing messages is a problem because if a client ever sees messages that are subsequently deleted the client will not see the new messages with their old numbers. In practice this is difficult to trigger because it requires the most recently added message to be removed and have the removal show up in a separate pull request. Still it can happen and it should be handled. Instead of infering the highset number ever used by finding the maximum number in the message map, track the largest number ever assigned directly. Update Msgmap to track this value and update the indexers to use this value. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-08-03search: (really) match the behavior of WWW for indexing text
Not sure what was going through my mind when I made my first attempt at this, but we really want to make sure we index all the text we display in the web view (and presumably anything a reasonable mail client can display). Followup-to: 0cf6196025d4e4880cd1ed859257ce21dd3cdcf6 ("search: match the behavior of WWW for indexing text")
2018-08-02SearchIdx.pm: Always assign numbers backwards during incremental indexing
When walking messages newest to oldest, assigning the larger numbers before smaller numbers ensures older messages get smaller numbers. This leads to the possibility of a msgmap that can be regenerated when needed. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-20v1: allow upgrading indexlevel=basic to 'medium' or 'full'
For v1 repos, we don't need to write any metadata to Xapian and changing from 'basic' to 'medium' or 'full' will work. For v2, the metadata for indexing is stored in msgmap (because the Xapian databases are partitioned for parallelism), so a reindex is required.
2018-07-19searchidx: respect XAPIAN_FLUSH_THRESHOLD env if set
Xapian documents and respect XAPIAN_FLUSH_THRESHOLD to define the interval in documents to flush, so don't override it with our own BATCH_BYTES. This is helpful for initial indexing for those on slower storage but enough RAM. It is unnecessary for -watch and frequent incremental indexing; and it increases transaction times if -watch is playing "catch-up" if it was stopped for a while. The original BATCH_BYTES was tuned for a machine with little memory as the default XAPIAN_FLUSH_THRESHOLD of 10000 documents was causing swap storms. Using document counts also proved an innaccurate estimator of RAM usage compared to the actual bytes processed.
2018-07-19SearchIdx: Allow the amount of indexing be configured
This adds a new inbox configuration option 'indexlevel' that can take the values 'full', 'medium', and 'basic'. When set to 'full' everything is indexed including the positions of all terms. When set to 'medium' everything except the positions of terms is indexed. When set to 'basic' terms and positions are not indexed. Just the Overview database for NNTP is created. Which is still quite good and allows searching for messages by Message-ID. But there are no indexes to support searching inside the email messages themselves. Update the reindex tests to exercise the full medium and basic code paths Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-19SearchIdx: Add the mechanism for making all Xapian indexing optional
Create a new method add_xapian that holds all of the code to create Xapian indexes. The creation of this method simpliy involved idenitifying the relevant code and moving it from add_message. A call is added to add_xapian from add_message to keep everything working as it currently does. The new call is made conditional upon index levels of 'full' and 'medium'. The index levels that index positions and terms the two things public-inbox uses Xapian to index. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-19SearchIdx.pm: Make indexing search positions optional
About half the size of the Xapian search index turns out to be search positions. The search positions are only used in a very narrow set of queries. Make the search positions optional so people don't need to pay the cost of queries they will never make. This also makes public-inbox more approachable for light hacking as generating all of the indexes is time consuming. The way this is done is to add a method to SearchIdx called index_text that wraps the call of the term generator method index_text. The new index_text method takes care of calling both index_text and increase_termpos (the two functions that are responsible for position data). Then index_users, index_diff_inc, index_old_diff_fn, index_diff, index_body are made proper methods that calls the new index_text. Callers of the new index_text are slightly simplified as they don't need to call increase_termpos as well. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-18SearchIdx: Decrement regen_down even for added messages that are later deleted.
Decrement regen_down when visiting messages that appear in %D that we know will later be deleted. This ensures consistent message numbers are generated no matter which commit number is on top. Allowing deletes to propagage separately from the messages they delete without causing problems. The v2 trees already do this and when the indexes are deleted and rebuilt they maintain they commit numbers. Add a v1 version of the v2reindex test to verify that reindexing is working properly on v1 as well as v2. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-05-24workaround Xapian OFD locks w/o close-on-exec
Xapian v1.2.21..v1.2.24 (inclusive) use OFD locks but failed to set the close-on-exec flag on those locks. So we must continue to work around those old versions by ensuring Xapian file descriptors aren't held any longer than necessary when in long-running git processes. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
2018-05-01searchidx: preserve umask when starting/committing transactions
Xapian will replace files upon committing, so non-parallel V2Writable users need to have umask preserved this way.
2018-04-22extmsg: use Xapian only for partial matches
"LIKE" in SQLite (and other SQL implementations I've seen) is expensive with nearly 3 million messages in the archives. This caused some partial Message-ID lookups to take over 600ms on my workstation (~300ms on a faster Xeon). Cut that to below under 30ms on average on my workstation by relying exclusively on Xapian for partial Message-ID lookups as we have in the past. Unlike in the past when we tried using Xapian to match partial Message-IDs; we now optimize our indexing of Message-IDs to break apart "words" in Message-IDs for searching, yielding (hopefully) "good enough" accuracy for folks who get long URLs broken across lines when copy+pasting. We'll also drop the (in retrospect) pointless stripping of "/[tTf]" suffixes for the partial match, since anybody who hits that codepath would be hitting an invalid message ID. Finally, limit wildcard expansion to prevent easy DoS vectors on short terms. And blame Pine and alpine for generating Message-IDs with low-entropy prefixes :P
2018-04-20searchidx: remove leftover debugging code
I was using this to trace the path of brian's message. Fixes: 017fed7bc4d33ac4 ("searchidx: regenerate and avoid article number gaps on full index")
2018-04-20searchidx: release lock again during v1 batch callback
Relaxing this lock during a v1 --reindex is important to keep messages showing up in -watch process in a timely manner. Looks like I deleted an extra line when doing the following for v2: s/xdb->commit_transaction/self->commit_txn_lazy/ Fixes: 35ff6bb106909b1c ("replace Xapian skeleton with SQLite overview DB")
2018-04-18searchidx: revert default BATCH_BYTES to 1_000_000
This increases indexing time by around 10% but roughly halves memory usage of an -index process. We will probably make this tunable in the future for people with bigger/smaller machines.
2018-04-18searchidx: increase term positions for all text terms
We do not want phrase searches to cross between independent fields (filenames/Message-ID vs bodies)