about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchIdx.pm
DateCommit message (Collapse)
2018-03-19Lock: new base class for writable lockers
This reduces code duplication needed for locking and and hopefully makes things easier to understand.
2018-03-19v2writable: implement remove correctly
We need to hide removals from anybody hitting the search engine.
2018-03-19searchidx: do not delete documents while iterating
Followup-to: ebb59815035b42c2 ("searchidx: do not modify Xapian DB while iterating")
2018-03-03v2: avoid redundant/repeated configs for git partition repos
We'll let the config of all.git dictate every other subrepo to ease maintenance and configuration. The "include" directive has been supported since git 1.7.10, so it's safe to depend on as v2 requires git 2.6.0+ anyways for "get-mark" in fast-import.
2018-03-03searchidx: store the primary MID in doc data for NNTP
We can't rely on header order for Message-ID after all since we fall back to existing MIDs if they exist and are unseen. This lets us use SearchMsg->mid to get the MID we associated with the NNTP article number to ensure all NNTP article lookups roundtrip correctly.
2018-03-03mid: truncate excessively long MIDs early
Since we support duplicate MIDs in v2, we can safely truncate long MID terms in the database and let other normal duplicate resolution sort it out. It seems only spammers use excessively long MIDs, and there'll always be abuse/misuse vectors for causing mis-threaded messages, so it's not worth worrying about excessively long MIDs.
2018-03-03searchidx: add NNTP article number as a searchable term
Since we support duplicate MIDs in v2, the NNTP article number becomes the true unique identifier and we want a way to do fast lookups on it. While we're at it, stop putting XPATH in the term partitions since we only need it in the skeleton DB.
2018-03-03searchidx: use add_boolean_term for internal terms
Aside from the Message-Id ('Q'), these terms do not appear in content and thus have no business contributing to the Xapian document length. Thanks-to Olly Betts for the tip on xapian-discuss <20180228004400.GU12724@survex.com>
2018-03-03searchidx: avoid excessive XNQ indexing with diffs
When indexing diffs, we can avoid indexing the diff parts under XNQ and instead combine the parts in the read-only search interface. This results in better indexing performance and 10-15% smaller Xapian indices.
2018-03-03searchidx: support indexing multiple MIDs
It's possible to have a message handle multiple terms; so use this feature to ensure messages with multiple MIDs can be found by either one.
2018-03-02search: revert to using 'Q' as a uniQue id per-Xapian conventions
'Q' is merely a convention in the Xapian world, and is close enough to unique for practical purposes, so stop using XMID and gain a little more term length as a result.
2018-03-02searchidx: use new `references' method for parsing References
It's shorter and more convenient, here.
2018-03-02v2writable: deduplicate detection on add
This is a bit expensive in a multi-process situation because we need to make our indices and packs visible to the read-only pieces.
2018-03-02searchidx: add PID to error message when die-ing
2018-02-28searchidx: do not modify Xapian DB while iterating
Iterating through a list of documents while modifying them does not seem to be supported in Xapian and it can trigger DatabaseCorruptError exceptions. This only worked with past datasets out of dumb luck. With the work-in-progress "v2" public-inbox layout, this problem might become more visible as the "thread skeleton" is partitioned out to a separate, smaller Xapian database. I've reproduced the problem on both Debian 8.x and 9.x with Xapian 1.2.19 (chert backend) and 1.4.3 (glass backend) respectively.
2018-02-28rename SearchIdxThread to SearchIdxSkeleton
Interchangably using "all", "skel", "threader", etc. were confusing. Standardize on the "skeleton" term to describe this class since it's also used for retrieval of basic headers.
2018-02-28searchidx: index values in the threader
We will need timestamp, YYYYMMDD, article number, and line count for querying thread information (including XOVER for NNTP).
2018-02-28searchidx: get rid of pointless index_blob wrapper
This used to lookup the message in git, but no longer, so remove a needless indirection layer and call add_message directly.
2018-02-28use PublicInbox::MIME consistently
It works around some bugs in older Email::MIME which we'll find useful.
2018-02-22v2writable: warn on duplicate Message-IDs
This should give us an idea of how much a problem deduplication will be.
2018-02-22v2: parallelize Xapian indexing
The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[*] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [* ] each subprocess writes to threader
2018-02-20v2: support Xapian + SQLite indexing
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least.
2018-02-16search: stop assuming Message-ID is unique
In general, they are, but there's no way for or general purpose mail server to enforce that. This is a step in allowing us to handle more corner cases which existing lists throw at us.
2018-02-14searchidx: fix comment around next_thread_id
I decided not to copy the notmuch implementation regarding serialization of integers to Xapian metadata.
2018-02-14search: free up 'Q' prefix for a real unique identifier
This will allow easier-compatibility with v2 code which will introduce content_id as the unique identifier. The old "XMID" becomes "XM" as a free text searchable term. "Q" becomes "XMID" as a boolean prefix. There's no user-visible changes in this, but there needs to be a schema version bump later on... (more changes planned which can affect v1)
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-10-03threading: deal with improperly-terminated References headers
We should not blindly join References and In-Reply-To headers as a single string, because some messages can have an open angle brace '<' in References: without a corresponding '>'.
2017-06-23searchidx: fallback to lookup on pre-set article numbers
Yet another hiccup from reusing pre-set article numbers on various ruby-lang.org mailing lists. This was causing messages to not appear to NNTP readers which use XOVER.
2017-06-15searchidx: remove messages correctly from Xapian index
This fixes a bug introduced in commit 7eeadcb62729b0efbcb53cd9b7b181897c92cf9a ("search: remove unnecessary abstractions and functionality")
2017-06-14search: allow searching within mail diffs
This can be tied into a repository browser to browse in-flight topics on a mailing list.
2017-06-14searchidx: switch to accounting by message bytes
Xapian memory usage is tied to the size of the indexed text, so take the raw message size into account when deciding when to flush Xapian data. More importantly, we now flush Xapian before we have it buffer beyond our maximum; and we do it unconditionally to prevent even high priority processes from OOM-ing.
2017-06-14search: remove unnecessary abstractions and functionality
This simplifies the code a bit and reduces the translation overhead for looking directly at data from tools shipped with Xapian. While we're at it, fix thread-all.t :)
2017-05-09searchidx: use cached local $@ copy
umask should never fail and set $@, but use the cached local to be more explicit just in case.
2017-05-07searchidx: fix ghost root vivification
Due to the asynchronous nature of SMTP, it is possible for the root message of a thread (with no References/In-Reply-To) to arrive last in a series. We must preserve the thread_id of the ghost message in this case, as we do when vivifiying non-root ghosts. Otherwise, this causes threads to be broken when the root arrives last.
2017-02-11handle repeated References and In-Reply-To headers
It seems possible for git-send-email(1) to generate repeated repeated instances of References and In-Reply-To headers, as evidenced in: https://public-inbox.org/git/20161111124541.8216-17-vascomalmeida@sapo.pt/raw This causes a mismatch between how our search indexer threads and how our HTML view handles threading. In the future, View.pm will use the smsg-parsed {references} field and avoid redoing Email::MIME header parsing. We will still need to figure out a way to deal with messages with repeated Message-IDs, at some point, too.
2017-02-06Revert "searchidx: reindex clobbers old thread IDs"
Oops, that's broken, too. I guess the only way to reindex after fixing the thread detection is to start from scratch. This reverts commit 5d91adedf5f33ef1cb87df2a86306ddf370b4f8d.
2017-02-06searchidx: reindex clobbers old thread IDs
We cannot always reuse thread IDs since our threading logic may change as bugs are fixed.
2017-02-06searchidx: deal with empty In-Reply-To and References headers
In some messages, these headers exist, but have empty values. Do not let empty values throw off our search indexer to tie threads together, as it can make non-sensical threads grouped to a Message-Id of "" (empty string). See <https://public-inbox.org/git/11340844841342-git-send-email-mailing-lists.git@rawuncut.elitemail.org/raw> for an example of such a message. Thanks-to: Johannes Schindelin <Johannes.Schindelin@gmx.de> <https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/>
2017-01-10introduce PublicInbox::MIME wrapper class
This should fix problems with multipart messages where text/plain parts lack a header. cf. git clone --mirror https://github.com/rjbs/Email-MIME.git refs/pull/28/head In the future, we may still introduce as streaming interface to reduce memory usage on large emails.
2017-01-07searchmsg: favor direct hash access over accessor methods
This is faster, smaller, and more straighforward to me with fewer layers of indirection.
2016-12-10search: favor In-Reply-To over last References iff IRT exists
Some email clients set the References headers backwards, so trust the In-Reply-To header if (and only if) it exists and is parseable as direct parent of the current message. For affected repos, this will require reindexing (via "public-inbox-index --reindex"), but there will be no version bump for this bugfix.
2016-10-05thread: remove Mail::Thread dependency
Introduce our own SearchThread class for threading messages. This should allow us to specialize and optimize away objects in future commits.
2016-09-09search: index attachment filenames
And while we're at it, ensure searching inside displayable attachment bodies works.
2016-09-09search: match the behavior of WWW for indexing text
The basic rule is that if it is displayable via our WWW interface, it should be indexable text for Xapian search.
2016-09-09search: avoid mindlessly calling body_set
It's not worth entering a complex codepath in Email::MIME to save some (probably immeasurable amount of) memory, here. We've already stopped doing this in our WWW code a while back, too. If we really cared enough about it, we'd prioritize work on a streaming replacement for Email::MIME.
2016-09-09search: fix compatibility with Debian wheezy
Specifying the "d:" field only worked for NumberValueRangeProcessor in older versions of Xapian, such as the one in Debian wheezy (libsearch-xapian-perl=1.2.10.0-1) This slipped through since I rarely use wheezy, anymore, and perhaps nobody else does, either. Perhaps wheezy support may be dropped, soon. Unfortunately, this requires a schema version bump.
2016-09-09search: increase term positions for each quoted hunk
We pay a storage cost for storing positional information in Xapian, make good use of it by attempting to preserve it for (hopefully) better search results.
2016-09-09search: match quote detection behavior of view
This is stricter than the mutt quote_regexp default ("^([ \t]*[|>:}#])+" on Debian jessie), but matches what we have in View.pm. I prefer the stricter quote detection since it is less ambiguous and less likely to hide/obscure important details.
2016-09-09search: fix space regressions from recent changes
As of Xapian 1.0.4 (from 2007) is possible to use Search::Xapian::QueryParser::add_prefix multiple times with the same user field name but different term prefixes. This brings my current git@vger mirror from 6.5GB to 2.1GB (both sizes are after xapian-compact).
2016-09-09search: more granular message body searching
"bs:" and "b:" are adapted from mairix(1) We will also support searching explicitly for quoted vs non-quoted text via "q:" and "nq:" prefixes since sometimes readers will not care for quoted text. In the future, we will support parsing diffs (perhaps when repobrowse integration is complete). Note: this roughly doubles the size of the Xapian database due to the additional information; so this change may not be worth it.