about summary refs log tree commit homepage
path: root/lib/PublicInbox/Search.pm
DateCommit message (Collapse)
2017-02-10search: remove unnecessary abstractions and functionality
This simplifies the code a bit and reduces the translation overhead for looking directly at data from tools shipped with Xapian. While we're at it, fix thread-all.t :)
2017-02-06search: schema version bump for empty References/In-Reply-To
We cannot distinguish between legitimate ghosts and mis-threaded messages before commit 83425ef12e4b65cdcecd11ddcb38175d4a91d5a0 ("searchidx: deal with empty In-Reply-To and References headers") so we must rebuild the index in parallel to fix it.
2017-01-10introduce PublicInbox::MIME wrapper class
This should fix problems with multipart messages where text/plain parts lack a header. cf. git clone --mirror https://github.com/rjbs/Email-MIME.git refs/pull/28/head In the future, we may still introduce as streaming interface to reduce memory usage on large emails.
2017-01-07search: remove subject_summary
Apparently it never actually got used, and the world seems fine without it, so we can drop it. While we're at it, consider removing our subject_path usage from existence, too. We are not using fancy subject-line based URLs, here.
2017-01-07searchmsg: favor direct hash access over accessor methods
This is faster, smaller, and more straighforward to me with fewer layers of indirection.
2016-12-22search: lookup_mail handles modified DBs
We call lookup_mail all over the place, be sure we can handle database modifications in those cases.
2016-12-20searchmsg: remove ensure_metadata
Instead, only preload the ->mid field for threading, as we only need ->thread and ->path once in Search->get_thread (but we will need the ->mid field repeatedly). This more than doubles View->load_results performance on according to thread-all on an inbox with over 300K messages.
2016-12-10search: retry document loading from Xapian
In addition to needing to retry enquire queries, we also need to protect document loading from the Xapian DB and retry on modification, as it seems to throw the same errors. Checking the $@ ref for Search::Xapian::DatabaseModifiedError is actually in the test suite for both the XS and SWIG Xapian bindings, so we should be good as far as forward/backwards compatibility.
2016-12-10search: always sort thread results in ascending time order
This makes life easier for the threading algorithm, as we can use the implied ordering of timestamps to avoid temporary ghosts and resulting container vivication. This would've also allowed us to hide the bug (in most cases) fixed by the patch titled "thread: last Reference always wins", in case that needs to be reverted due to infinite looping.
2016-09-13help: document new search prefixes
Support (and document) 'a:' after all, as "mairix -h" uses it, so this should reduce the learning curve for mairix users.
2016-09-09search: index attachment filenames
And while we're at it, ensure searching inside displayable attachment bodies works.
2016-09-09search: fix compatibility with Debian wheezy
Specifying the "d:" field only worked for NumberValueRangeProcessor in older versions of Xapian, such as the one in Debian wheezy (libsearch-xapian-perl=1.2.10.0-1) This slipped through since I rarely use wheezy, anymore, and perhaps nobody else does, either. Perhaps wheezy support may be dropped, soon. Unfortunately, this requires a schema version bump.
2016-09-09search: fix space regressions from recent changes
As of Xapian 1.0.4 (from 2007) is possible to use Search::Xapian::QueryParser::add_prefix multiple times with the same user field name but different term prefixes. This brings my current git@vger mirror from 6.5GB to 2.1GB (both sizes are after xapian-compact).
2016-09-09search: more granular message body searching
"bs:" and "b:" are adapted from mairix(1) We will also support searching explicitly for quoted vs non-quoted text via "q:" and "nq:" prefixes since sometimes readers will not care for quoted text. In the future, we will support parsing diffs (perhaps when repobrowse integration is complete). Note: this roughly doubles the size of the Xapian database due to the additional information; so this change may not be worth it.
2016-09-09search: drop longer subject: prefix for search
We only document the "s:" anyways. While the long name is more descriptive, the ambiguity makes agnostic caching (by Varnish or similar) slightly harder and longer URLs are more likely to be accidentally truncated when shared.
2016-09-09search: allow searching user fields (To/Cc/From)
Sometimes it can be useful to search based on who the message was sent to, sent by, or Cc:-ed. Of course, headers can be faked, but they usually are not... Anyways this mostly matches the behavior of mairix(1).
2016-08-18www: implement generic help text
Begin documenting some basic help functionality. I may tweak the anchor names of the various HTML endpoints to be more consistent with each other (old ones will be supported for a short while), so I'm not documenting those, for now. This may become part of a builtin key-value store for basic texts, but this probably shouldn't become a wiki engine, either.
2016-08-16search: add YYYYMMDD search range via "d:" prefix
This is similar to mairix in that it uses a "d:" prefix; but only takes YYYYMMDD, for now. Using custom date/time parsers via Perl will be much more work: nntp://news.gmane.org/20151005222157.GE5880@survex.com Anyhow, this ought to be more human-friendly than searching by Unix timestamps, but it requires reindexing to take advantage of.
2016-08-16search: drop pointless range processors for Unix timestamp
The Unix timestamp isn't meaningful for users searching, we will start indexing the YYYYMMDD date stamp which may use StringValueRangeProcessor, instead.
2016-08-14search: gracefully handle lookup_message failure
We can't blindly assume a ghost even exists in the DB, as the rules can change internally for some corner-case Message-IDs.
2016-08-11search: support alt-ID for mapping legacy serial numbers
For some existing mailing list archives, messages are identified by serial number (such as NNTP article numbers in gmane). Those links may become inaccessible (as is the current case for gmane), so ensure users can still search based on old serial numbers. Now, I run the following periodically to get article numbers from gmane (while news.gmane.org remains): NNTPSERVER=news.gmane.org export NNTPSERVER GROUP=gmane.comp.version-control.git perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3 (I might integrate this further with public-inbox-* scripts one day). My ~/.public-inbox/config as an added "altid" snippet which now looks like this: [publicinbox "git"] address = git@vger.kernel.org mainrepo = /path/to/git.vger.git newsgroup = inbox.comp.version-control.git ; relative pathnames expand to $mainrepo/public-inbox/$file altid = serial:gmane:file=gmane.sqlite3 And run "public-inbox-index --reindex /path/to/git.vger.git" periodically. This ought to allow searching for "gmane:12345" to work for Xapian-enabled instances. Disclaimer: while public-inbox supports NNTP and stable article serial numbers, use of those for public links is discouraged since it encourages centralization.
2016-06-21search: support Subject:-less messages
Some mailing lists allow empty Subject headers and we shall support searching and indexing them.
2016-06-21searchidx: merge old thread id from ghosts
We failed to discard old thread IDs when vivifying ghosts due to out-of-order message arrival. This rectifies the failure and will trigger a re-index.
2016-06-20www: improve topic view by scanning for ghosts
This should help avoid having too many fake top-level messages in the topic view since we only have a partial window for threading results.
2016-06-19search: reopen and retry on updated databases
This seems like a nasty thing which breaks downloads of large mailboxes.
2016-06-17search: increase limit for thread search
Some threads are easily over 100 messages, so the 50 limit is not enough. It is likely that 1000 messages is not enough, either, and we will need to tune our threading to handle more messages and supply options for configurability.
2015-11-20various internal documentation updates
Hopefully this gives new hackers a better overview of how the components relate to each other.
2015-10-02rename mid_compress to id_compress
We use it as a general compressor for identifiers such as subject paths, so using the "mid_" prefix probably is not appropriate.
2015-09-30nntp: implement OVER/XOVER summary in search document
The document data of a search message already contains a good chunk of the information needed to respond to OVER/XOVER commands quickly. Expand on that and use the document data to implement OVER/XOVER quickly. This adds a dependency on Xapian being available for nntpd usage, but is probably alright since nntpd is esoteric enough that anybody willing to run nntpd will also want search functionality offered by Xapian. This also speeds up XHDR/HDR with the To: and Cc: headers and :bytes/:lines article metadata used by some clients for header displays and marking messages as read/unread.
2015-09-30search: remove get_subject_path
We probably won't be supporting this in the public API
2015-09-18read-only NNTP server
Implementing NEWNEWS, XHDR, XOVER efficiently will require additional caching on top of msgmap. This seems to work with lynx and slrnpull, haven't tried clients. DO NOT run in production, yet, denial-of-service vulnerabilities await!
2015-09-15extmsg: wire up to use msgmap for prefixes
DBI + DBD::SQLite has much better handling of prefix lookups than Xapian. While we're at it, avoid linking blatantly wrong Message-IDs to external services.
2015-09-06update copyright headers and email addresses
In the future, it should be possible to use this: git ls-files | UPDATE_COPYRIGHT_HOLDER='all contributors' \ UPDATE_COPYRIGHT_USE_INTERVALS=2 \ xargs /path/to/gnulib/build-aux/update-copyright
2015-09-05extmsg: fall back to partial Message-ID matching
In case a URL gets truncated (as is common with long URLs), we can rely on Xapian for partial matches and bring the user to their destination.
2015-09-05search: tweak parsing for internal queries
We should not need to use QueryParser for internal queries, but rather for external ones. We'll also be exposing searching Message-IDs with the "mid:" prefix for broken mids on some servers, and enabling partial searching with 'm' to help with URL truncations. Since thread IDs may be volatile, they cannot be exposed to the public, there's no reason to expose them to the query parser, either. Also, add 's:' as an alternative probabilistic prefix to 'subject' as it is shorter.
2015-09-05search: note why we do not support FLAG_PURE_NOT
Perhaps this can be optionally enabled in the future for smaller sites.
2015-09-05search: use relevance as secondary sort by default
Might as well give relevance some weight if the timestamp is tied.
2015-09-05view: preliminary HTML search interface
This hopefully makes it easier to find things without resorting to proprietary external services.
2015-09-03search: disable Message-ID compression in Xapian
We'll continue to compress long Message-IDs in URLs (which we know about), but we will store entire Message-IDs in the Xapian database to facilitate ease-of-lookups in external databases.
2015-09-01search: show newest results first
Like revision control history, older stuff is less relevant, so favor newer stuff, first.
2015-09-01search: allow querying all mail with ''
This makes dumping recent topics easier, hopefully.
2015-09-01search: reduce redundant doc data
Redundant document data increases our database size, pull the smsg->mid off the unique term, the smsg->ts off the value, and only generate the formatted display date off smsg->ts.
2015-08-30search: do not index references and inreplyto terms
We no longer need them, as we can rely on index-time thread resolution and thread merging. This allows us to index less data and hopefully increase efficiency.
2015-08-29avoid length in boolean context
Perl does not currently optimize for this. ref (from p5p): http://mid.gmane.org/D5C27970-9176-4C7A-8B99-7D78360E67A2@pobox.com
2015-08-28search: do not load type into metadata
Our search query already filters out ghost messages, so it's wasteful to have type information loaded.
2015-08-25search: implement subject summarization
We ought to summarize subjects to avoid exploding line lengths in the web interface.
2015-08-25mid: mid_compressed => mid_compress
Consistently name mid_* functions as verbs.
2015-08-25search: only sort by relevance if requested
Many of our internal search queries do not care about relevance, but is used for proper thread displays.
2015-08-22search: consistently pass options and flags
Most of our special query functions require exact matches, so none of the flags we normally use are necessary for query parsing.
2015-08-22search: split search indexing to a separate file
This makes organization easier and reduces the amount of code loaded for a PSGI, mod_perl or CGI instance.