about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchIdx.pm
DateCommit message (Collapse)
2018-03-02searchidx: add PID to error message when die-ing
2018-02-28searchidx: do not modify Xapian DB while iterating
Iterating through a list of documents while modifying them does not seem to be supported in Xapian and it can trigger DatabaseCorruptError exceptions. This only worked with past datasets out of dumb luck. With the work-in-progress "v2" public-inbox layout, this problem might become more visible as the "thread skeleton" is partitioned out to a separate, smaller Xapian database. I've reproduced the problem on both Debian 8.x and 9.x with Xapian 1.2.19 (chert backend) and 1.4.3 (glass backend) respectively.
2018-02-28rename SearchIdxThread to SearchIdxSkeleton
Interchangably using "all", "skel", "threader", etc. were confusing. Standardize on the "skeleton" term to describe this class since it's also used for retrieval of basic headers.
2018-02-28searchidx: index values in the threader
We will need timestamp, YYYYMMDD, article number, and line count for querying thread information (including XOVER for NNTP).
2018-02-28searchidx: get rid of pointless index_blob wrapper
This used to lookup the message in git, but no longer, so remove a needless indirection layer and call add_message directly.
2018-02-28use PublicInbox::MIME consistently
It works around some bugs in older Email::MIME which we'll find useful.
2018-02-22v2writable: warn on duplicate Message-IDs
This should give us an idea of how much a problem deduplication will be.
2018-02-22v2: parallelize Xapian indexing
The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[*] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [* ] each subprocess writes to threader
2018-02-20v2: support Xapian + SQLite indexing
This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least.
2018-02-16search: stop assuming Message-ID is unique
In general, they are, but there's no way for or general purpose mail server to enforce that. This is a step in allowing us to handle more corner cases which existing lists throw at us.
2018-02-14searchidx: fix comment around next_thread_id
I decided not to copy the notmuch implementation regarding serialization of integers to Xapian metadata.
2018-02-14search: free up 'Q' prefix for a real unique identifier
This will allow easier-compatibility with v2 code which will introduce content_id as the unique identifier. The old "XMID" becomes "XM" as a free text searchable term. "Q" becomes "XMID" as a boolean prefix. There's no user-visible changes in this, but there needs to be a schema version bump later on... (more changes planned which can affect v1)
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-10-03threading: deal with improperly-terminated References headers
We should not blindly join References and In-Reply-To headers as a single string, because some messages can have an open angle brace '<' in References: without a corresponding '>'.
2017-06-23searchidx: fallback to lookup on pre-set article numbers
Yet another hiccup from reusing pre-set article numbers on various ruby-lang.org mailing lists. This was causing messages to not appear to NNTP readers which use XOVER.
2017-06-15searchidx: remove messages correctly from Xapian index
This fixes a bug introduced in commit 7eeadcb62729b0efbcb53cd9b7b181897c92cf9a ("search: remove unnecessary abstractions and functionality")
2017-06-14search: allow searching within mail diffs
This can be tied into a repository browser to browse in-flight topics on a mailing list.
2017-06-14searchidx: switch to accounting by message bytes
Xapian memory usage is tied to the size of the indexed text, so take the raw message size into account when deciding when to flush Xapian data. More importantly, we now flush Xapian before we have it buffer beyond our maximum; and we do it unconditionally to prevent even high priority processes from OOM-ing.
2017-06-14search: remove unnecessary abstractions and functionality
This simplifies the code a bit and reduces the translation overhead for looking directly at data from tools shipped with Xapian. While we're at it, fix thread-all.t :)
2017-05-09searchidx: use cached local $@ copy
umask should never fail and set $@, but use the cached local to be more explicit just in case.
2017-05-07searchidx: fix ghost root vivification
Due to the asynchronous nature of SMTP, it is possible for the root message of a thread (with no References/In-Reply-To) to arrive last in a series. We must preserve the thread_id of the ghost message in this case, as we do when vivifiying non-root ghosts. Otherwise, this causes threads to be broken when the root arrives last.
2017-02-11handle repeated References and In-Reply-To headers
It seems possible for git-send-email(1) to generate repeated repeated instances of References and In-Reply-To headers, as evidenced in: https://public-inbox.org/git/20161111124541.8216-17-vascomalmeida@sapo.pt/raw This causes a mismatch between how our search indexer threads and how our HTML view handles threading. In the future, View.pm will use the smsg-parsed {references} field and avoid redoing Email::MIME header parsing. We will still need to figure out a way to deal with messages with repeated Message-IDs, at some point, too.
2017-02-06Revert "searchidx: reindex clobbers old thread IDs"
Oops, that's broken, too. I guess the only way to reindex after fixing the thread detection is to start from scratch. This reverts commit 5d91adedf5f33ef1cb87df2a86306ddf370b4f8d.
2017-02-06searchidx: reindex clobbers old thread IDs
We cannot always reuse thread IDs since our threading logic may change as bugs are fixed.
2017-02-06searchidx: deal with empty In-Reply-To and References headers
In some messages, these headers exist, but have empty values. Do not let empty values throw off our search indexer to tie threads together, as it can make non-sensical threads grouped to a Message-Id of "" (empty string). See <https://public-inbox.org/git/11340844841342-git-send-email-mailing-lists.git@rawuncut.elitemail.org/raw> for an example of such a message. Thanks-to: Johannes Schindelin <Johannes.Schindelin@gmx.de> <https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/>
2017-01-10introduce PublicInbox::MIME wrapper class
This should fix problems with multipart messages where text/plain parts lack a header. cf. git clone --mirror https://github.com/rjbs/Email-MIME.git refs/pull/28/head In the future, we may still introduce as streaming interface to reduce memory usage on large emails.
2017-01-07searchmsg: favor direct hash access over accessor methods
This is faster, smaller, and more straighforward to me with fewer layers of indirection.
2016-12-10search: favor In-Reply-To over last References iff IRT exists
Some email clients set the References headers backwards, so trust the In-Reply-To header if (and only if) it exists and is parseable as direct parent of the current message. For affected repos, this will require reindexing (via "public-inbox-index --reindex"), but there will be no version bump for this bugfix.
2016-10-05thread: remove Mail::Thread dependency
Introduce our own SearchThread class for threading messages. This should allow us to specialize and optimize away objects in future commits.
2016-09-09search: index attachment filenames
And while we're at it, ensure searching inside displayable attachment bodies works.
2016-09-09search: match the behavior of WWW for indexing text
The basic rule is that if it is displayable via our WWW interface, it should be indexable text for Xapian search.
2016-09-09search: avoid mindlessly calling body_set
It's not worth entering a complex codepath in Email::MIME to save some (probably immeasurable amount of) memory, here. We've already stopped doing this in our WWW code a while back, too. If we really cared enough about it, we'd prioritize work on a streaming replacement for Email::MIME.
2016-09-09search: fix compatibility with Debian wheezy
Specifying the "d:" field only worked for NumberValueRangeProcessor in older versions of Xapian, such as the one in Debian wheezy (libsearch-xapian-perl=1.2.10.0-1) This slipped through since I rarely use wheezy, anymore, and perhaps nobody else does, either. Perhaps wheezy support may be dropped, soon. Unfortunately, this requires a schema version bump.
2016-09-09search: increase term positions for each quoted hunk
We pay a storage cost for storing positional information in Xapian, make good use of it by attempting to preserve it for (hopefully) better search results.
2016-09-09search: match quote detection behavior of view
This is stricter than the mutt quote_regexp default ("^([ \t]*[|>:}#])+" on Debian jessie), but matches what we have in View.pm. I prefer the stricter quote detection since it is less ambiguous and less likely to hide/obscure important details.
2016-09-09search: fix space regressions from recent changes
As of Xapian 1.0.4 (from 2007) is possible to use Search::Xapian::QueryParser::add_prefix multiple times with the same user field name but different term prefixes. This brings my current git@vger mirror from 6.5GB to 2.1GB (both sizes are after xapian-compact).
2016-09-09search: more granular message body searching
"bs:" and "b:" are adapted from mairix(1) We will also support searching explicitly for quoted vs non-quoted text via "q:" and "nq:" prefixes since sometimes readers will not care for quoted text. In the future, we will support parsing diffs (perhaps when repobrowse integration is complete). Note: this roughly doubles the size of the Xapian database due to the additional information; so this change may not be worth it.
2016-09-09search: allow searching user fields (To/Cc/From)
Sometimes it can be useful to search based on who the message was sent to, sent by, or Cc:-ed. Of course, headers can be faked, but they usually are not... Anyways this mostly matches the behavior of mairix(1).
2016-08-16search: add YYYYMMDD search range via "d:" prefix
This is similar to mairix in that it uses a "d:" prefix; but only takes YYYYMMDD, for now. Using custom date/time parsers via Perl will be much more work: nntp://news.gmane.org/20151005222157.GE5880@survex.com Anyhow, this ought to be more human-friendly than searching by Unix timestamps, but it requires reindexing to take advantage of.
2016-08-14searchidx: do not release Xapian lock while (only) Msgmap is indexing
SQLite might index quickly, so we hold the lock used by Xapian for the duration. This probably needs to be reworked entirely, actually.
2016-08-11search: support alt-ID for mapping legacy serial numbers
For some existing mailing list archives, messages are identified by serial number (such as NNTP article numbers in gmane). Those links may become inaccessible (as is the current case for gmane), so ensure users can still search based on old serial numbers. Now, I run the following periodically to get article numbers from gmane (while news.gmane.org remains): NNTPSERVER=news.gmane.org export NNTPSERVER GROUP=gmane.comp.version-control.git perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3 (I might integrate this further with public-inbox-* scripts one day). My ~/.public-inbox/config as an added "altid" snippet which now looks like this: [publicinbox "git"] address = git@vger.kernel.org mainrepo = /path/to/git.vger.git newsgroup = inbox.comp.version-control.git ; relative pathnames expand to $mainrepo/public-inbox/$file altid = serial:gmane:file=gmane.sqlite3 And run "public-inbox-index --reindex /path/to/git.vger.git" periodically. This ought to allow searching for "gmane:12345" to work for Xapian-enabled instances. Disclaimer: while public-inbox supports NNTP and stable article serial numbers, use of those for public links is discouraged since it encourages centralization.
2016-08-10searchidx: allow searching Message-IDs in free-form text
It is not unheard of for users to attempt finding messages by entering Message-IDs into the "Search" box instead of using the existing URL structure. So make it possible for them. Fwiw, I've definitely encountered users who enter entire URLs into generic search engines.
2016-08-09searchidx: avoid holding Xapian lock in cat-file
We must ensure cat-file process is launched before Xapian grabs lock, too. Our use of "git cat-file --batch" has the same problem as "git log" did, (which was fixed in commit 3713c727cda431a0dc2865a7878c13ecf9f21851) "searchidx: release Xapian FDs before spawning git log"
2016-08-09searchidx: release Xapian FDs before spawning git log
This will allow us to release and re-acquire Xapian locks due to the lack of FD_CLOEXEC on some FDs.
2016-08-09searchidx: persist the PublicInbox::Git object
We can cheaply keep the object around nowadays since it spawns expensive processes only on an as-needed basis.
2016-08-09searchidx: remove unused $git parameters
We do not need to pass the PublicInbox::Git object to various callbacks.
2016-08-05search: disable batching in newer versions of Xapian, for now
This warrants further investigation, but it appears we cannot release Xapian reliably after forking "git log" due to the lack of a close-on-exec flag on the Xapian flintlock FD
2016-08-04searchmsg: add git object ID to doc_data
Doing git tree lookups based on the SHA-1 of the Message-ID is expensive as trees get larger, instead, use the SHA-1 object ID directly. This drastically reduces the amount of time spent in the "git cat-file --batch" process for fetching the /$INBOX/all.mbox.gz endpoint on the ~800MB git@vger.kernel.org mirror This retains backwards compatibility and allows existing indices to be transparently upgraded without performance degradation.
2016-08-02search: improve reindexing behavior
For reindexing, fresh Xapian DBs do not count as a reindex, allowing users to blindly use --reindex on the first run on a clean repo. While we're at it, allow indexing to override HEAD ref for multi-head git repos.
2016-07-31search: support reindexing existing search indices
This should make tweaking the way we search more efficiet by allowing us to avoid doubling destroying the index every time we want to change something. We also give priority to incremental indexing via public-inbox-{watch,mda} and have manual invocations of public-inbox-index perform batch updates while releasing ssoma.lock.