about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchIdx.pm
DateCommit message (Collapse)
2017-02-14Merge remote-tracking branch 'origin/master' into repobrowse
* origin/master: www: do not unescape PATH_INFO twice t/mime: quiet warnings for old versions of Email::Simple handle repeated References and In-Reply-To headers
2017-02-14searchidx: switch to accounting by message bytes
Xapian memory usage is tied to the size of the indexed text, so take the raw message size into account when deciding when to flush Xapian data. More importantly, we now flush Xapian before we have it buffer beyond our maximum; and we do it unconditionally to prevent even high priority processes from OOM-ing.
2017-02-11handle repeated References and In-Reply-To headers
It seems possible for git-send-email(1) to generate repeated repeated instances of References and In-Reply-To headers, as evidenced in: https://public-inbox.org/git/20161111124541.8216-17-vascomalmeida@sapo.pt/raw This causes a mismatch between how our search indexer threads and how our HTML view handles threading. In the future, View.pm will use the smsg-parsed {references} field and avoid redoing Email::MIME header parsing. We will still need to figure out a way to deal with messages with repeated Message-IDs, at some point, too.
2017-02-10search: remove unnecessary abstractions and functionality
This simplifies the code a bit and reduces the translation overhead for looking directly at data from tools shipped with Xapian. While we're at it, fix thread-all.t :)
2017-02-09Merge remote-tracking branch 'origin/master' into repobrowse
* origin/master: config: do not slurp lines into memory TODO: several updates search: schema version bump for empty References/In-Reply-To Revert "searchidx: reindex clobbers old thread IDs" searchidx: reindex clobbers old thread IDs searchidx: deal with empty In-Reply-To and References headers searchview: increase limit for displaying search results searchview: clarify numeric summary at bottom add filter for Subject: tags watchmaildir: allow arguments for filters watchmaildir: limit live importer processes learn: implement "rm" only functionality mime: avoid SUPER usage in Email::MIME subclass inbox: reinstate periodic cleanup of Xapian and SQLite objects introduce PublicInbox::MIME wrapper class
2017-02-07search: hoist out git directory search index helper
We will be reusing this for indexing normal (code) repositories using git and Xapian, too.
2017-02-06Revert "searchidx: reindex clobbers old thread IDs"
Oops, that's broken, too. I guess the only way to reindex after fixing the thread detection is to start from scratch. This reverts commit 5d91adedf5f33ef1cb87df2a86306ddf370b4f8d.
2017-02-06searchidx: reindex clobbers old thread IDs
We cannot always reuse thread IDs since our threading logic may change as bugs are fixed.
2017-02-06searchidx: deal with empty In-Reply-To and References headers
In some messages, these headers exist, but have empty values. Do not let empty values throw off our search indexer to tie threads together, as it can make non-sensical threads grouped to a Message-Id of "" (empty string). See <https://public-inbox.org/git/11340844841342-git-send-email-mailing-lists.git@rawuncut.elitemail.org/raw> for an example of such a message. Thanks-to: Johannes Schindelin <Johannes.Schindelin@gmx.de> <https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/>
2017-01-10introduce PublicInbox::MIME wrapper class
This should fix problems with multipart messages where text/plain parts lack a header. cf. git clone --mirror https://github.com/rjbs/Email-MIME.git refs/pull/28/head In the future, we may still introduce as streaming interface to reduce memory usage on large emails.
2017-01-07searchmsg: favor direct hash access over accessor methods
This is faster, smaller, and more straighforward to me with fewer layers of indirection.
2016-12-10search: favor In-Reply-To over last References iff IRT exists
Some email clients set the References headers backwards, so trust the In-Reply-To header if (and only if) it exists and is parseable as direct parent of the current message. For affected repos, this will require reindexing (via "public-inbox-index --reindex"), but there will be no version bump for this bugfix.
2016-10-05thread: remove Mail::Thread dependency
Introduce our own SearchThread class for threading messages. This should allow us to specialize and optimize away objects in future commits.
2016-09-09search: index attachment filenames
And while we're at it, ensure searching inside displayable attachment bodies works.
2016-09-09search: match the behavior of WWW for indexing text
The basic rule is that if it is displayable via our WWW interface, it should be indexable text for Xapian search.
2016-09-09search: avoid mindlessly calling body_set
It's not worth entering a complex codepath in Email::MIME to save some (probably immeasurable amount of) memory, here. We've already stopped doing this in our WWW code a while back, too. If we really cared enough about it, we'd prioritize work on a streaming replacement for Email::MIME.
2016-09-09search: fix compatibility with Debian wheezy
Specifying the "d:" field only worked for NumberValueRangeProcessor in older versions of Xapian, such as the one in Debian wheezy (libsearch-xapian-perl=1.2.10.0-1) This slipped through since I rarely use wheezy, anymore, and perhaps nobody else does, either. Perhaps wheezy support may be dropped, soon. Unfortunately, this requires a schema version bump.
2016-09-09search: increase term positions for each quoted hunk
We pay a storage cost for storing positional information in Xapian, make good use of it by attempting to preserve it for (hopefully) better search results.
2016-09-09search: match quote detection behavior of view
This is stricter than the mutt quote_regexp default ("^([ \t]*[|>:}#])+" on Debian jessie), but matches what we have in View.pm. I prefer the stricter quote detection since it is less ambiguous and less likely to hide/obscure important details.
2016-09-09search: fix space regressions from recent changes
As of Xapian 1.0.4 (from 2007) is possible to use Search::Xapian::QueryParser::add_prefix multiple times with the same user field name but different term prefixes. This brings my current git@vger mirror from 6.5GB to 2.1GB (both sizes are after xapian-compact).
2016-09-09search: more granular message body searching
"bs:" and "b:" are adapted from mairix(1) We will also support searching explicitly for quoted vs non-quoted text via "q:" and "nq:" prefixes since sometimes readers will not care for quoted text. In the future, we will support parsing diffs (perhaps when repobrowse integration is complete). Note: this roughly doubles the size of the Xapian database due to the additional information; so this change may not be worth it.
2016-09-09search: allow searching user fields (To/Cc/From)
Sometimes it can be useful to search based on who the message was sent to, sent by, or Cc:-ed. Of course, headers can be faked, but they usually are not... Anyways this mostly matches the behavior of mairix(1).
2016-08-16search: add YYYYMMDD search range via "d:" prefix
This is similar to mairix in that it uses a "d:" prefix; but only takes YYYYMMDD, for now. Using custom date/time parsers via Perl will be much more work: nntp://news.gmane.org/20151005222157.GE5880@survex.com Anyhow, this ought to be more human-friendly than searching by Unix timestamps, but it requires reindexing to take advantage of.
2016-08-14searchidx: do not release Xapian lock while (only) Msgmap is indexing
SQLite might index quickly, so we hold the lock used by Xapian for the duration. This probably needs to be reworked entirely, actually.
2016-08-11search: support alt-ID for mapping legacy serial numbers
For some existing mailing list archives, messages are identified by serial number (such as NNTP article numbers in gmane). Those links may become inaccessible (as is the current case for gmane), so ensure users can still search based on old serial numbers. Now, I run the following periodically to get article numbers from gmane (while news.gmane.org remains): NNTPSERVER=news.gmane.org export NNTPSERVER GROUP=gmane.comp.version-control.git perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3 (I might integrate this further with public-inbox-* scripts one day). My ~/.public-inbox/config as an added "altid" snippet which now looks like this: [publicinbox "git"] address = git@vger.kernel.org mainrepo = /path/to/git.vger.git newsgroup = inbox.comp.version-control.git ; relative pathnames expand to $mainrepo/public-inbox/$file altid = serial:gmane:file=gmane.sqlite3 And run "public-inbox-index --reindex /path/to/git.vger.git" periodically. This ought to allow searching for "gmane:12345" to work for Xapian-enabled instances. Disclaimer: while public-inbox supports NNTP and stable article serial numbers, use of those for public links is discouraged since it encourages centralization.
2016-08-10searchidx: allow searching Message-IDs in free-form text
It is not unheard of for users to attempt finding messages by entering Message-IDs into the "Search" box instead of using the existing URL structure. So make it possible for them. Fwiw, I've definitely encountered users who enter entire URLs into generic search engines.
2016-08-09searchidx: avoid holding Xapian lock in cat-file
We must ensure cat-file process is launched before Xapian grabs lock, too. Our use of "git cat-file --batch" has the same problem as "git log" did, (which was fixed in commit 3713c727cda431a0dc2865a7878c13ecf9f21851) "searchidx: release Xapian FDs before spawning git log"
2016-08-09searchidx: release Xapian FDs before spawning git log
This will allow us to release and re-acquire Xapian locks due to the lack of FD_CLOEXEC on some FDs.
2016-08-09searchidx: persist the PublicInbox::Git object
We can cheaply keep the object around nowadays since it spawns expensive processes only on an as-needed basis.
2016-08-09searchidx: remove unused $git parameters
We do not need to pass the PublicInbox::Git object to various callbacks.
2016-08-05search: disable batching in newer versions of Xapian, for now
This warrants further investigation, but it appears we cannot release Xapian reliably after forking "git log" due to the lack of a close-on-exec flag on the Xapian flintlock FD
2016-08-04searchmsg: add git object ID to doc_data
Doing git tree lookups based on the SHA-1 of the Message-ID is expensive as trees get larger, instead, use the SHA-1 object ID directly. This drastically reduces the amount of time spent in the "git cat-file --batch" process for fetching the /$INBOX/all.mbox.gz endpoint on the ~800MB git@vger.kernel.org mirror This retains backwards compatibility and allows existing indices to be transparently upgraded without performance degradation.
2016-08-02search: improve reindexing behavior
For reindexing, fresh Xapian DBs do not count as a reindex, allowing users to blindly use --reindex on the first run on a clean repo. While we're at it, allow indexing to override HEAD ref for multi-head git repos.
2016-07-31search: support reindexing existing search indices
This should make tweaking the way we search more efficiet by allowing us to avoid doubling destroying the index every time we want to change something. We also give priority to incremental indexing via public-inbox-{watch,mda} and have manual invocations of public-inbox-index perform batch updates while releasing ssoma.lock.
2016-07-31msgmap: fix use of transactions
We want transactions to be the responsibility of the caller when possible; this fixes the potential for the msgmap to internally become inconsistent when using it from inside searchidx.
2016-06-26inbox: ensure we do not show leading "From " lines
Some messages will be misimported due to an old bug, clean them up and ensure we do not propagate the mistake. Followup-to: a0c07cba0e5d ("mda: drop leading "From " lines again")
2016-06-21searchidx: merge old thread id from ghosts
We failed to discard old thread IDs when vivifying ghosts due to out-of-order message arrival. This rectifies the failure and will trigger a re-index.
2016-06-21searchidx: simplify ghost creation
Remove some worthless parameters and redundant no-ops to make the next (important) patch easier-to-review.
2016-06-17searchidx: disable Email::MIME::ContentType::STRICT_PARAMS
Disable this since we handle imperfect data from an imperfect world.
2016-05-21localize $/ in more places to avoid potential problems
This hopefully makes the intent of the code clearer, too. The the HTTP use of the numeric reference for getline caused problems in Git.pm, already.
2016-05-19switch read-only uses of walk_parts to msg_iter
msg_iter lets us know the index of the attachment, allow us to make more sensible labels and in a future commit, hyperlinks to download attachments.
2016-03-03use raw header for Message-ID
Message-IDs should not be MIME encoded, but in case they are, use the raw form for compatibility with ssoma and possibly other tools. This prevents a potential problem where a malicious client could confuse our storage layer into indexing incorrect contents.
2016-02-28reduce calls to close unless error checks are needed
We can rely on timely auto-destruction based on reference counting; reducing the chance of redundant close(2) calls which may hit the wront FD. We do care about certain close calls (e.g. writing to a buffered IO handle) if we require error-checking for write-integrity. In other cases, let things go out-of-scope so it can be freed automatically after use.
2016-02-28searchidx: use defined for checking EOF behavior
While empty or "0" should never appear, this allows the reviewer to think and know less about the context in which this check is done.
2015-12-22rename 'GitCatFile' package to 'Git'
We'll be using it for more than just cat-file. Adding a `popen' API for internal use allows us to save a bunch of code in other places.
2015-11-20various internal documentation updates
Hopefully this gives new hackers a better overview of how the components relate to each other.
2015-10-03drop Message-IDs longer than 244 bytes
Xapian has this limit for terms, and there are likely no legitimate Message-IDs (or single header lines) this long; so there's no need to workaround this limit.
2015-10-02rename mid_compress to id_compress
We use it as a general compressor for identifiers such as subject paths, so using the "mid_" prefix probably is not appropriate.
2015-10-01searchidx: subject is not a term
Sometimes subjects are excessively long and hit Xapian's 245-byte term limit. We can still perform subject-only searches with a probabilistic prefix.
2015-09-30nntp: implement OVER/XOVER summary in search document
The document data of a search message already contains a good chunk of the information needed to respond to OVER/XOVER commands quickly. Expand on that and use the document data to implement OVER/XOVER quickly. This adds a dependency on Xapian being available for nntpd usage, but is probably alright since nntpd is esoteric enough that anybody willing to run nntpd will also want search functionality offered by Xapian. This also speeds up XHDR/HDR with the To: and Cc: headers and :bytes/:lines article metadata used by some clients for header displays and marking messages as read/unread.