about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchMsg.pm
DateCommit message (Collapse)
2017-02-10search: remove unnecessary abstractions and functionality
This simplifies the code a bit and reduces the translation overhead for looking directly at data from tools shipped with Xapian. While we're at it, fix thread-all.t :)
2017-01-07search: remove subject_summary
Apparently it never actually got used, and the world seems fine without it, so we can drop it. While we're at it, consider removing our subject_path usage from existence, too. We are not using fancy subject-line based URLs, here.
2017-01-07searchmsg: favor direct hash access over accessor methods
This is faster, smaller, and more straighforward to me with fewer layers of indirection.
2017-01-07remove incorrect comment about strftime + locales
We only need strftime to be locale-independent when generating dates for email and HTTP headers. Purely numeric dates can use strftime for ease-of-readability.
2016-12-20searchmsg: remove ensure_metadata
Instead, only preload the ->mid field for threading, as we only need ->thread and ->path once in Search->get_thread (but we will need the ->mid field repeatedly). This more than doubles View->load_results performance on according to thread-all on an inbox with over 300K messages.
2016-12-17searchmsg: do not memoize {date} field
We only generate the ->date once in NNTP, so creating the hash entry is a waste.
2016-12-17searchmsg: remove locale-dependency for ->date
strftime is locale-dependent, which can cause surprising failures for some users.
2016-12-13searchmsg: remove unused EPOCH_822 constant
This hasn't been needed since our Email::Abstract removal for message threading.
2016-10-05thread: remove Email::Abstract wrapping
This roughly doubles performance due to the reduction in object creation and abstraction layers.
2016-08-04searchmsg: add git object ID to doc_data
Doing git tree lookups based on the SHA-1 of the Message-ID is expensive as trees get larger, instead, use the SHA-1 object ID directly. This drastically reduces the amount of time spent in the "git cat-file --batch" process for fetching the /$INBOX/all.mbox.gz endpoint on the ~800MB git@vger.kernel.org mirror This retains backwards compatibility and allows existing indices to be transparently upgraded without performance degradation.
2016-06-25address: remove Address::from_name
Address::names is sufficient to handle what from_name did.
2016-06-20www: improve topic view by scanning for ghosts
This should help avoid having too many fake top-level messages in the topic view since we only have a partial window for threading results.
2016-05-30use utf8::{encode,decode} for in-place transforms
No need to duplicate the string when transforming it; learned from studying SpamAssassin 3.4.1
2016-05-29searchmsg: all timestamps stored in Xapian are UTC
We cannot have strftime using the local timezone for %z. This fixes output when a server is not running UTC.
2016-05-25remove Email::Address dependency
git has stricter requirements for ident names (no '<>') which Email::Address allows. Even in 1.908, Email::Address also has an incomplete fix for CVE-2015-7686 with a DoS-able regexp for comments. Since we don't care for or need all the RFC compliance of Email::Address, avoiding it entirely may be preferable. Email::Address will still be installed as a requirement for Email::MIME, but it is only used by the Email::MIME::header_str_set which we do not use
2016-04-30searchmsg: ensure long subject lines are not broken
Noticed when using a long URL in the subject.
2016-03-12searchmsg: preserve hard tabs, but drop CR (\r)
Hard tabs *may* be searchable, so preserve them since they do not take up any more space than a normal space. However, CR (carriage return) is worthless and likely a sign of a buggy mail (or spam) client anyways.
2016-03-03use raw header for Message-ID
Message-IDs should not be MIME encoded, but in case they are, use the raw form for compatibility with ssoma and possibly other tools. This prevents a potential problem where a malicious client could confuse our storage layer into indexing incorrect contents.
2016-02-28searchmsg: update + fix license header
Not sure how, but this should've always been AGPL-3.0+ like the rest of the code, not GPL-3.0+
2015-11-20various internal documentation updates
Hopefully this gives new hackers a better overview of how the components relate to each other.
2015-09-30nntp: implement OVER/XOVER summary in search document
The document data of a search message already contains a good chunk of the information needed to respond to OVER/XOVER commands quickly. Expand on that and use the document data to implement OVER/XOVER quickly. This adds a dependency on Xapian being available for nntpd usage, but is probably alright since nntpd is esoteric enough that anybody willing to run nntpd will also want search functionality offered by Xapian. This also speeds up XHDR/HDR with the To: and Cc: headers and :bytes/:lines article metadata used by some clients for header displays and marking messages as read/unread.
2015-09-19nntp: implement XROVER, speed up XHDR for some cases
Using Xapian allows us to implement XROVER without forking new processes.
2015-09-06update copyright headers and email addresses
In the future, it should be possible to use this: git ls-files | UPDATE_COPYRIGHT_HOLDER='all contributors' \ UPDATE_COPYRIGHT_USE_INTERVALS=2 \ xargs /path/to/gnulib/build-aux/update-copyright
2015-09-04SearchMsg: avoid encoding Message-IDs
Spaces may be added when using header_str with Email::MIME->create, so use the normal "header" parameter when setting Message-IDs and References.
2015-09-03search: disable Message-ID compression in Xapian
We'll continue to compress long Message-IDs in URLs (which we know about), but we will store entire Message-IDs in the Xapian database to facilitate ease-of-lookups in external databases.
2015-09-01search: reduce redundant doc data
Redundant document data increases our database size, pull the smsg->mid off the unique term, the smsg->ts off the value, and only generate the formatted display date off smsg->ts.
2015-08-28search: do not iterate through entire termlist
A document may have many terms, so this hurts performance if we blindly iterate. Unfortunately, we can't rely on the order of the termlist just yet, either, so we must repeatedly restart the search for now until we're ready to bump schema versions.
2015-08-25search: implement subject summarization
We ought to summarize subjects to avoid exploding line lengths in the web interface.
2015-08-25mid: mid_compressed => mid_compress
Consistently name mid_* functions as verbs.
2015-08-20search: preserve References: order in document data
We need proper ordering of References to thread messages correctly. We would lose this order if we load the terms from the database, so set it directly document data. Do not bother with a separate In-Reply-To, since Mail::Thread just merges the IRT into References. This bumps our schema version once again.
2015-08-16SearchMsg: ensure metadata for ghost messages mid
Ghosts have no document data in them. Perhaps we should just rely on terms for Message-ID and avoid storing that in the document data...
2015-08-15view: display replies in per-message view
This can be used to quickly scan for replies in a message without displaying an entire thread.
2015-08-15search: make search results more OO
This will relieve callers of the need to decode the data we store internally in Xapian
2015-08-13initial search backend implementation
This shall allow us to search for replies/threads more easily.