about summary refs log tree commit homepage
path: root/lib/PublicInbox/SearchIdx.pm
DateCommit message (Collapse)
2016-07-31msgmap: fix use of transactions
We want transactions to be the responsibility of the caller when possible; this fixes the potential for the msgmap to internally become inconsistent when using it from inside searchidx.
2016-06-26inbox: ensure we do not show leading "From " lines
Some messages will be misimported due to an old bug, clean them up and ensure we do not propagate the mistake. Followup-to: a0c07cba0e5d ("mda: drop leading "From " lines again")
2016-06-21searchidx: merge old thread id from ghosts
We failed to discard old thread IDs when vivifying ghosts due to out-of-order message arrival. This rectifies the failure and will trigger a re-index.
2016-06-21searchidx: simplify ghost creation
Remove some worthless parameters and redundant no-ops to make the next (important) patch easier-to-review.
2016-06-17searchidx: disable Email::MIME::ContentType::STRICT_PARAMS
Disable this since we handle imperfect data from an imperfect world.
2016-05-21localize $/ in more places to avoid potential problems
This hopefully makes the intent of the code clearer, too. The the HTTP use of the numeric reference for getline caused problems in Git.pm, already.
2016-05-19switch read-only uses of walk_parts to msg_iter
msg_iter lets us know the index of the attachment, allow us to make more sensible labels and in a future commit, hyperlinks to download attachments.
2016-03-03use raw header for Message-ID
Message-IDs should not be MIME encoded, but in case they are, use the raw form for compatibility with ssoma and possibly other tools. This prevents a potential problem where a malicious client could confuse our storage layer into indexing incorrect contents.
2016-02-28reduce calls to close unless error checks are needed
We can rely on timely auto-destruction based on reference counting; reducing the chance of redundant close(2) calls which may hit the wront FD. We do care about certain close calls (e.g. writing to a buffered IO handle) if we require error-checking for write-integrity. In other cases, let things go out-of-scope so it can be freed automatically after use.
2016-02-28searchidx: use defined for checking EOF behavior
While empty or "0" should never appear, this allows the reviewer to think and know less about the context in which this check is done.
2015-12-22rename 'GitCatFile' package to 'Git'
We'll be using it for more than just cat-file. Adding a `popen' API for internal use allows us to save a bunch of code in other places.
2015-11-20various internal documentation updates
Hopefully this gives new hackers a better overview of how the components relate to each other.
2015-10-03drop Message-IDs longer than 244 bytes
Xapian has this limit for terms, and there are likely no legitimate Message-IDs (or single header lines) this long; so there's no need to workaround this limit.
2015-10-02rename mid_compress to id_compress
We use it as a general compressor for identifiers such as subject paths, so using the "mid_" prefix probably is not appropriate.
2015-10-01searchidx: subject is not a term
Sometimes subjects are excessively long and hit Xapian's 245-byte term limit. We can still perform subject-only searches with a probabilistic prefix.
2015-09-30nntp: implement OVER/XOVER summary in search document
The document data of a search message already contains a good chunk of the information needed to respond to OVER/XOVER commands quickly. Expand on that and use the document data to implement OVER/XOVER quickly. This adds a dependency on Xapian being available for nntpd usage, but is probably alright since nntpd is esoteric enough that anybody willing to run nntpd will also want search functionality offered by Xapian. This also speeds up XHDR/HDR with the To: and Cc: headers and :bytes/:lines article metadata used by some clients for header displays and marking messages as read/unread.
2015-09-25searchidx: remove unused sub: next_doc_id
It seems like it was never used
2015-09-15searchidx: sync Msgmap database along with Xapian
We can avoid duplicating work of extracting messages from git if we tie this to Xapian. Of course, this ties the two features together, but it's probably reasonable to expect that anybody who wants to use public-inbox to serve messages to front-end users will have both.
2015-09-15searchidx: hoist out rlog code
We'll be reusing this for loading msgmap.
2015-09-06update copyright headers and email addresses
In the future, it should be possible to use this: git ls-files | UPDATE_COPYRIGHT_HOLDER='all contributors' \ UPDATE_COPYRIGHT_USE_INTERVALS=2 \ xargs /path/to/gnulib/build-aux/update-copyright
2015-09-03search: disable Message-ID compression in Xapian
We'll continue to compress long Message-IDs in URLs (which we know about), but we will store entire Message-IDs in the Xapian database to facilitate ease-of-lookups in external databases.
2015-09-01search: reduce redundant doc data
Redundant document data increases our database size, pull the smsg->mid off the unique term, the smsg->ts off the value, and only generate the formatted display date off smsg->ts.
2015-08-30search: do not index references and inreplyto terms
We no longer need them, as we can rely on index-time thread resolution and thread merging. This allows us to index less data and hopefully increase efficiency.
2015-08-29avoid length in boolean context
Perl does not currently optimize for this. ref (from p5p): http://mid.gmane.org/D5C27970-9176-4C7A-8B99-7D78360E67A2@pobox.com
2015-08-25mid: mid_compressed => mid_compress
Consistently name mid_* functions as verbs.
2015-08-23cleanup calls to header_obj
Dereference header_obj only once when performance may be critical, or simplify our code by calling "header" directly on the Email::{Simple,MIME} object if not.
2015-08-23hopefully fix broken permissions for search
We must preserve the umask for the entirety of the indexing operation, as Xapian transactions replace entire files atomically instead of writing them in place.
2015-08-23search: respect core.sharedRepository in for Xapian DB
Extend the purpose of core.sharedRepository to apply to the $GIT_DIR/public-inbox/xapian* directory.
2015-08-22search: split search indexing to a separate file
This makes organization easier and reduces the amount of code loaded for a PSGI, mod_perl or CGI instance.