about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchIdx.pm
DateCommit message (Collapse)
2020-03-22index: use git commit times on missing Date/Received
When indexing messages without Date: and/or Received: headers, fall back to using timestamps originally recorded by git in the commit object. This allows git mirrors to preserve the import datestamp and timestamp of a message according to what was fed into git, instead of blindly falling back to the current time.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-27searchidx: don't assume "a/" and "b/" as prefixes
Some people use "--{src,dst}-prefix=", try to deal with those since git-apply can handle them when called by solver.
2020-01-27searchidx: skip filenames on "diff --git ..."
We already capture filenames on the lines beginning with "---" and "+++", so it's redundant work to capture filenames from "diff --git ..." lines.
2020-01-27search: {version} => {ibx_ver}
We don't confuse human readers with the Xapian schema version. We also want to make it obvious this is the version of the inbox we're indexing, these are Search or SearchIdx objects, not Inbox objects.
2020-01-27inbox: add ->version method
This allows us to simplify version checking by avoiding "//" or "||" operators sprinkled around.
2020-01-11spawn (and thus popen_rd) die on failure
Most spawn and popen_rd callers die on failure to spawn, anyways, and some are missing checks entirely. This saves us a bunch of verbose error-checking code in callers. This also makes popen_rd more consistent, since it already dies on pipe creation failures.
2020-01-06treewide: "require" + "use" cleanup and docs
There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2020-01-04searchidx: remove_message: pedantic fix for v1
It shouldn't be possible for v1 inboxes to have multiple matches for a given Message-ID, so the sub would only get called once, but strange things could happen in 2112 :>
2020-01-04searchidx: index_text: use Xapian parameter names
Use the parameter names from the Search::Xapian::TermGenerator manpage for our local variables instead of confusing names...
2020-01-04searchidx: simplify quote-splitting in index_body
We now use the same regexp View::add_text_body uses.
2020-01-04searchidx: add_message: fix and make use of prototypes
Procedural function calls allow prototype checking, and our add_message prototype was totally wrong to begin with. Convert most of the "$self->index_*" calls to "index_*($self" While we're at it, use "//=" to avoid some "unless" statements.
2020-01-04searchidx: split off index_xapian for msg_iter
This ought to save some memory, but it's probably lost in the noise given the cost of indexing. Regardless it still reduces the indentation level and makes future changes easier to read.
2020-01-04searchidx: index_diff: allow /^$/ line as diff context
As discovered by solver bug hunting, "git apply" also handles the case where blank lines w/o leading space are treated as diff context, apparently because GNU diff once did it: https://public-inbox.org/git/b507b465f7831612b9d9fc643e3e5218b64e5bfa/s/
2019-12-24search: support SWIG-generated Xapian.pm
Xapian upstream is slowly phasing out the XS-based Search::Xapian in favor of the SWIG-generated "Xapian" package. While Debian and both FreeBSD have Search::Xapian, OpenBSD only includes the "Xapian" binding. More information about the status of the "Xapian" Perl module here: https://trac.xapian.org/ticket/523
2019-12-24searchidx: call "++" on PostingIterator instead of "->inc"
The "++" is not yet available in the SWIG-based "Xapian.pm" Perl bindings, so use "++" where it's supported in both the XS (Search::Xapian) and SWIG-based Xapian binding.
2019-12-15searchidx: do not modify read-only $1 via git_unquote
git_unquote works in-place, and we sometimes see strange filenames, or badly munged diffs with terminal escape characters (for colorization) end up in emails.
2019-11-04index: "git log" failures are fatal
While I've never seen "git log" fail on its own, it could happen one day and we should be prepared to abort indexing when it happens. Beef up tests for t/spawn.t to ensure close() behaves on popen_rd the way we expect it to.
2019-10-28index: allow search/lookups on X-Alt-Message-ID
Since we replace extra Message-ID headers with X-Alt-Message-ID to placate NNTP clients, we should allow searching and indexing on X-Alt-Message-ID just like we do with Message-ID.
2019-10-16config: support "inboxdir" in addition to "mainrepo"
"mainrepo" ws a bad name and artifact from the early days when I intended for there to be a "spamrepo" (now just the ENV{PI_EMERGENCY} Maildir). With v2, "mainrepo" can be especially confusing, since v2 needs at least two git repositories (epoch + all.git) to function and we shouldn't confuse users by having them point to a git repository for v2. Much of our documentation already references "INBOX_DIR" for command-line arguments, so use "inboxdir" as the git-config(1)-friendly variant for that. "mainrepo" remains supported indefinitely for compatibility. Users may need to revert to old versions, or may be referring to old documentation and must not be forced to change config files to account for this change. So if you're using "mainrepo" today, I do NOT recommend changing it right away because other bugs can lurk. Link: https://public-inbox.org/meta/874l0ice8v.fsf@alyssa.is/
2019-09-09run update-copyrights from gnulib for 2019
2019-06-15comments: replace "partition" with "shard"
Now that the code matches Xapian terminology, ensure our comments match, too.
2019-06-14search*: rename {partition} => {shard}
Another step towards keeping our internal data structures consistent with Xapian naming.
2019-06-14searchidx: require PublicInbox::Inbox (or InboxWritable) ref
PublicInbox::Inbox objects have minimal dependencies, so drop code to support old tests which existed before the PublicInbox::Inbox object came into existence.
2019-06-12searchidx: improve error message when Xapian fails
Make it easier to detect if a partition is corrupt.
2019-05-29searchidx: store indexlevel=medium as metadata
And use it from Admin. It's easy to tell what indexlevel=basic is from unconfigured inboxes, but distinguishing between 'medium' and 'full' would require stat()-ing position.* files which is fragile and Xapian-implementation-dependent. So use the metadata facility of Xapian and store it in the main partition so Admin tools can deal better with unconfigured inboxes copied using generic tools like cp(1) or rsync(1).
2019-05-27searchidx: fix obvious typo
We can't pass an empty string to `git merge-base --is-ancestor' AFAIK, this did NOT present issues in the current test suite.
2019-05-23xcpdb: show re-indexing progress
Emit information about reindexing git revision ranges when used with xcpdb. Additionally, distinguish Xapian copy output from v2 git epoch counting by increasing directory context info. For now, v1 batches batches are emitted. v2 indexing is still missing progress reporting for batches, as the data structures for reindexing would benefit from a refactoring, first. This does not currently affect the use of public-inbox-index, but may in the future.
2019-05-23xcpdb: use fine-grained locking
Copying an entire Xapian DB takes a long time, so update our reindexing code to support partial reindexing, snapshot the pre-copydatabase git revisions, perform the lengthy copy, and do a partial reindex when the copy + renames are done.
2019-05-21Merge remote-tracking branch 'origin/xap-optional' into master
* origin/xap-optional: admin: improve warnings and errors for missing modules searchidx: do not create empty Xapian partitions for basic lazy load Xapian and make it optional for v2 www: use Inbox->over where appropriate nntp: use Inbox->over directly inbox: add ->over method to ease access
2019-05-21searchidx: remove unused Compress::Zlib import
We only compress in OverIdx, now; since we no longer do overview stuff in Xapian (and Xapian compresses document data, anyways).
2019-05-15searchidx: do not create empty Xapian partitions for basic
No point in leaving a mess of empty directories when Xapian doesn't load.
2019-05-15lazy load Xapian and make it optional for v2
More tests work without Search::Xapian, now. Usability issues still need to be fixed
2019-05-14searchidx: fix incremental index with indexlevel=basic on v1
We were reindexing the full history every invocation of -index when Xapian was not used because we were incorrectly relying on 'last_commit' metadata stored in Xapian. Rewrite the indexing logic to be less confusing while we're at it, since we rely on `git merge-base --is-ancestor' nowadays. Furthermore, we need to handle message removals from the overview index correctly when Xapian is not in use. Co-authored-by: Eric W. Biederman <ebiederm@xmission.com>
2019-01-15searchidx: move git_unquote to PublicInbox::Git
We'll be using it outside of searchidx...
2019-01-10Merge commit 'mem'
* commit 'mem': view: more culling for search threads over: cull unneeded fields for get_thread searchmsg: remove unused fields for PSGI in Xapian results searchview: drop unused {seen} hashref searchmsg: remove Xapian::Document field searchmsg: get rid of termlist scanning for mid httpd: remove psgix.harakiri reference
2019-01-10check git version requirements
This allows v1 tests to continue working on git 1.8.0 for now. This allows git 2.1.4 packaged with Debian 8 ("jessie") to run old tests, at least. I suppose it's safe to drop Debian 7 ("wheezy") due to our dependency on git 1.8.0 for "merge-base --is-ancestor". Writing V2 repositories requires git 2.6 for "get-mark" support, so mask out tests for older gits.
2019-01-08searchmsg: remove Xapian::Document field
We don't need to be carrying this around with the many SearchMsg objects we have. This saves about 20K from a large SearchView "&x=t" response.
2019-01-05index: quiet down git-log error messages on new inboxes
The new t/*filter_rubylang.t tests call -index immediately after -init, which causes confusing messages to show up to the end user. Check the validity of the ref before calling "git-log".
2018-12-30handle "multipart/mixed" messages which are not multipart
I've found two examples on https://lore.kernel.org/lkml/ where the messages declared themselves to be "multipart/mixed" but were actually plain text: <87llgalspt.fsf@free.fr> <200308111450.h7BEoOu20077@mail.osdl.org> With the mboxrd downloaded, mutt is able to view them without difficulty. Note: this change would require reindexing of Xapian to pick up the changes. But it's only two ancient messages, the first was resent by the original sender and the second is too old to be relevant.
2018-08-03Merge branch 'eb/index-incremental'
Incremental indexing fixes from Eric W. Biederman. These prevents the highest message number in msgmap from being reassigned after deletes in rare cases and ensures messages are deleted from msgmap in v2. * eb/index-incremental: V2Writeable.pm: In unindex_oid delete the message from msgmap V2Writeable.pm: Ensure that a found message number is in the msgmap SearchIdx,V2Writeable: Update num_highwater on optimized deletes t/v[12]reindex.t: Verify the num highwater is as expected t/v[12]reindex.t Verify num_highwater Msgmap.pm: Track the largest value of num ever assigned SearchIdx.pm: Always assign numbers backwards during incremental indexing t/v[12]reindex.t: Test incremental indexing works t/v[12]reindex.t: Test that the resulting msgmap is as expected t/v[12]reindex.t: Place expected second in Xapian tests t/v2reindex.t: Isolate the test cases more t/v1reindex.t: Isolate the test cases Import.pm: Don't assume {in} and {out} always exist
2018-08-03SearchIdx,V2Writeable: Update num_highwater on optimized deletes
When performing an incremental index update with index_sync if a message is seen to be both added and deleted update the num_highwater mark even though the message is not otherwise indexed. This ensures index_sync generates the same msgmap no matter which commit it stops at during incremental syncs. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-08-03Msgmap.pm: Track the largest value of num ever assigned
Today the only thing that prevents public-inbox not reusing the message numbers of deleted messages is the sqlite autoincrement magic and that only works part of the time. The new incremental indexing test has revealed areas where today public-inbox does try to reuse numbers of deleted messages. Reusing the message numbers of existing messages is a problem because if a client ever sees messages that are subsequently deleted the client will not see the new messages with their old numbers. In practice this is difficult to trigger because it requires the most recently added message to be removed and have the removal show up in a separate pull request. Still it can happen and it should be handled. Instead of infering the highset number ever used by finding the maximum number in the message map, track the largest number ever assigned directly. Update Msgmap to track this value and update the indexers to use this value. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-08-03search: (really) match the behavior of WWW for indexing text
Not sure what was going through my mind when I made my first attempt at this, but we really want to make sure we index all the text we display in the web view (and presumably anything a reasonable mail client can display). Followup-to: 0cf6196025d4e4880cd1ed859257ce21dd3cdcf6 ("search: match the behavior of WWW for indexing text")
2018-08-02SearchIdx.pm: Always assign numbers backwards during incremental indexing
When walking messages newest to oldest, assigning the larger numbers before smaller numbers ensures older messages get smaller numbers. This leads to the possibility of a msgmap that can be regenerated when needed. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-20v1: allow upgrading indexlevel=basic to 'medium' or 'full'
For v1 repos, we don't need to write any metadata to Xapian and changing from 'basic' to 'medium' or 'full' will work. For v2, the metadata for indexing is stored in msgmap (because the Xapian databases are partitioned for parallelism), so a reindex is required.
2018-07-19searchidx: respect XAPIAN_FLUSH_THRESHOLD env if set
Xapian documents and respect XAPIAN_FLUSH_THRESHOLD to define the interval in documents to flush, so don't override it with our own BATCH_BYTES. This is helpful for initial indexing for those on slower storage but enough RAM. It is unnecessary for -watch and frequent incremental indexing; and it increases transaction times if -watch is playing "catch-up" if it was stopped for a while. The original BATCH_BYTES was tuned for a machine with little memory as the default XAPIAN_FLUSH_THRESHOLD of 10000 documents was causing swap storms. Using document counts also proved an innaccurate estimator of RAM usage compared to the actual bytes processed.
2018-07-19SearchIdx: Allow the amount of indexing be configured
This adds a new inbox configuration option 'indexlevel' that can take the values 'full', 'medium', and 'basic'. When set to 'full' everything is indexed including the positions of all terms. When set to 'medium' everything except the positions of terms is indexed. When set to 'basic' terms and positions are not indexed. Just the Overview database for NNTP is created. Which is still quite good and allows searching for messages by Message-ID. But there are no indexes to support searching inside the email messages themselves. Update the reindex tests to exercise the full medium and basic code paths Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-19SearchIdx: Add the mechanism for making all Xapian indexing optional
Create a new method add_xapian that holds all of the code to create Xapian indexes. The creation of this method simpliy involved idenitifying the relevant code and moving it from add_message. A call is added to add_xapian from add_message to keep everything working as it currently does. The new call is made conditional upon index levels of 'full' and 'medium'. The index levels that index positions and terms the two things public-inbox uses Xapian to index. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-19SearchIdx.pm: Make indexing search positions optional
About half the size of the Xapian search index turns out to be search positions. The search positions are only used in a very narrow set of queries. Make the search positions optional so people don't need to pay the cost of queries they will never make. This also makes public-inbox more approachable for light hacking as generating all of the indexes is time consuming. The way this is done is to add a method to SearchIdx called index_text that wraps the call of the term generator method index_text. The new index_text method takes care of calling both index_text and increase_termpos (the two functions that are responsible for position data). Then index_users, index_diff_inc, index_old_diff_fn, index_diff, index_body are made proper methods that calls the new index_text. Callers of the new index_text are slightly simplified as they don't need to call increase_termpos as well. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>