about summary refs log tree commit homepage
path: root/lib
DateCommit message (Collapse)
2016-10-05thread: remove Mail::Thread dependency
Introduce our own SearchThread class for threading messages. This should allow us to specialize and optimize away objects in future commits.
2016-10-05view: remove "subject dummy" references
We will not care for inexact threading by subject or pruning.
2016-09-13help: document new search prefixes
Support (and document) 'a:' after all, as "mairix -h" uses it, so this should reduce the learning curve for mairix users.
2016-09-09nntp: cleanup: move use statements out of sub scope
This clarifies the code somewhat, and we don't care to lazy-load in NNTP.pm anyways since this is only used for a long-lived daemon.
2016-09-09search: index attachment filenames
And while we're at it, ensure searching inside displayable attachment bodies works.
2016-09-09search: match the behavior of WWW for indexing text
The basic rule is that if it is displayable via our WWW interface, it should be indexable text for Xapian search.
2016-09-09search: avoid mindlessly calling body_set
It's not worth entering a complex codepath in Email::MIME to save some (probably immeasurable amount of) memory, here. We've already stopped doing this in our WWW code a while back, too. If we really cared enough about it, we'd prioritize work on a streaming replacement for Email::MIME.
2016-09-09search: fix compatibility with Debian wheezy
Specifying the "d:" field only worked for NumberValueRangeProcessor in older versions of Xapian, such as the one in Debian wheezy (libsearch-xapian-perl=1.2.10.0-1) This slipped through since I rarely use wheezy, anymore, and perhaps nobody else does, either. Perhaps wheezy support may be dropped, soon. Unfortunately, this requires a schema version bump.
2016-09-09search: increase term positions for each quoted hunk
We pay a storage cost for storing positional information in Xapian, make good use of it by attempting to preserve it for (hopefully) better search results.
2016-09-09search: match quote detection behavior of view
This is stricter than the mutt quote_regexp default ("^([ \t]*[|>:}#])+" on Debian jessie), but matches what we have in View.pm. I prefer the stricter quote detection since it is less ambiguous and less likely to hide/obscure important details.
2016-09-09search: fix space regressions from recent changes
As of Xapian 1.0.4 (from 2007) is possible to use Search::Xapian::QueryParser::add_prefix multiple times with the same user field name but different term prefixes. This brings my current git@vger mirror from 6.5GB to 2.1GB (both sizes are after xapian-compact).
2016-09-09search: more granular message body searching
"bs:" and "b:" are adapted from mairix(1) We will also support searching explicitly for quoted vs non-quoted text via "q:" and "nq:" prefixes since sometimes readers will not care for quoted text. In the future, we will support parsing diffs (perhaps when repobrowse integration is complete). Note: this roughly doubles the size of the Xapian database due to the additional information; so this change may not be worth it.
2016-09-09search: drop longer subject: prefix for search
We only document the "s:" anyways. While the long name is more descriptive, the ambiguity makes agnostic caching (by Varnish or similar) slightly harder and longer URLs are more likely to be accidentally truncated when shared.
2016-09-09search: allow searching user fields (To/Cc/From)
Sometimes it can be useful to search based on who the message was sent to, sent by, or Cc:-ed. Of course, headers can be faked, but they usually are not... Anyways this mostly matches the behavior of mairix(1).
2016-09-08import: run "git gc --auto" when done
We need to prevent excessive repository growth for public-inbox-watch and public-inbox-mda users.
2016-09-08import: hoist out common run_die subroutine
We will be reusing this in the next commit, too.
2016-09-08import: hoist out _check_path function
This reduces duplication, slightly. We may be using it yet again in a to-be-introduced function (or we may not introduce it).
2016-09-08view: handle missing Content-Type in message
Email::MIME internally assumes "text/plain" for messages missing a Content-Type, but does not expose that in the Email::MIME::content_type API method. We must assume it ourselves to avoid uninitialized value warnings for the rare (nowadays) MUAs which do not set it.
2016-09-07doc: new docs for user-level commands
Hopefully more folks can download and run public-inbox, nowadays.
2016-09-02config: use "publicinboxlimiter" prefix
Just having "limiter" in the prefix may confuse it with something else. Use the full prefix to avoid this confusion.
2016-09-01watch: use "publicinboxwatch" namespace
We'll keep supporting "publicinboxlearn" indefinitely, but "publicinboxwatch" is probably more appropriate at the moment. Noticed while writing documentation.
2016-08-23www: give tor2web some exposure, too
Not everybody can run Tor, hopefully more can use Tor2web even if it compromises their privacy. This should help make system more resilient for users unable to use Tor.
2016-08-18searchview: link to internal help text
The internal help text links to the Xapian query parser documentation anyways, but also provides information on which prefixes exist.
2016-08-18www: implement generic help text
Begin documenting some basic help functionality. I may tweak the anchor names of the various HTML endpoints to be more consistent with each other (old ones will be supported for a short while), so I'm not documenting those, for now. This may become part of a builtin key-value store for basic texts, but this probably shouldn't become a wiki engine, either.
2016-08-18linkify: be stricter about matching RFC 3986
We're not to-the-letter about percent-encoding, but we should allow all the characters. This is mainly so we can effectively use the link to some Wikipedia pages with parentheses in them: https://en.wikipedia.org/wiki/Atom_(standard) https://en.wikipedia.org/wiki/Git_(software)
2016-08-18view: try assuming UTF-8 for bogus charsets
For some reason, Alpine will set X-UNKNOWN for valid UTF-8. Since we favor UTF-8 HTML anyways, try forcing Email::MIME to handle text/plain as UTF-8 which might show up better. At least this change renders <alpine.DEB.2.20.1608131214070.4924@virtualbox> properly by showing "•" (&#8226;) instead of "â ¢" (&#226;&#128;&#162;) Reported-by: Thomas Ferris Nicolaisen <tfnico@gmail.com>
2016-08-18view: try to display bogus charsets for text/plain
Alpine seems to set charset=X-UNKNOWN for valid UTF-8 text, which causes Email::MIME::body_str to fail as X-UNKNOWN is not a valid encoding. So, blindly display the body as plain-text but warn users about possibly mangled text. Reported-by: Thomas Ferris Nicolaisen <tfnico@gmail.com>
2016-08-18view: attach_link uses string concatentation
There is no point in using an array to join on an empty string (my original intention was probably to join on "\n"). This is only preparation for the next change to show a warning to in the attachment link.
2016-08-16search: add YYYYMMDD search range via "d:" prefix
This is similar to mairix in that it uses a "d:" prefix; but only takes YYYYMMDD, for now. Using custom date/time parsers via Perl will be much more work: nntp://news.gmane.org/20151005222157.GE5880@survex.com Anyhow, this ought to be more human-friendly than searching by Unix timestamps, but it requires reindexing to take advantage of.
2016-08-16search: drop pointless range processors for Unix timestamp
The Unix timestamp isn't meaningful for users searching, we will start indexing the YYYYMMDD date stamp which may use StringValueRangeProcessor, instead.
2016-08-15import: use common address parsing to drop unnecessary quotes
Not sure why or how I missed this before; but the common address parsing routine we have should be more correct. Add a test to ensure excessively quoted names don't make it through, either.
2016-08-14www: do not double-clean Message-IDs from internal DBs
Ensure we usually strip one level of '<>' from Message-IDs, since our internal SQLite, Xapian, and SHA-1 storage all assume that. Realistically, we screw up if somebody has '<<' or '>>', but those are screwed up mail clients and we can deal with it another time. Currently, this means some messages with '>>' in References or Message-Id are not handled correctly, yet, but we match the behavior of Mail::Thread in keeping the extra '>'.
2016-08-14www: do not unecessarily escape some chars in paths
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&', "'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed in path-absolute where we have the Message-ID. In any case, it seems '@' is fairly common in path components nowadays and too common in Message-IDs.
2016-08-14www: ensure XML validity for some odd ASCII chars
I've seen 0x1b (\e) in at least one message and some other possibly non-printable chars. In any case, make sure they're valid XML with us-ascii encoding as far as xmlstarlet(1) thinks so.
2016-08-14mid: no wide characters for sha1_hex
Apparently there are some really screwed up In-Reply-To fields out there.
2016-08-14search: gracefully handle lookup_message failure
We can't blindly assume a ghost even exists in the DB, as the rules can change internally for some corner-case Message-IDs.
2016-08-14view: remove redundant pre closing tag
2016-08-14view: allow for missing In-Reply-To mapping
Because buggy mail clients exist and generate invalid In-Reply-To headers we cannot handle across the board...
2016-08-14searchidx: do not release Xapian lock while (only) Msgmap is indexing
SQLite might index quickly, so we hold the lock used by Xapian for the duration. This probably needs to be reworked entirely, actually.
2016-08-14newswww: include body text in 404 response
Some browsers do not give any indication of the HTTP error code on errors, so show the error text to the user like we do in the top-level WWW module.
2016-08-13extmsg: reorder and add a more Message-ID lookup services
gmane is down at the moment, so lower that in priority (hopefully it will be brought back up, again). Wikipedia also lists a few more project-specific list providers, so include those as well: https://en.wikipedia.org/wiki/Message-ID
2016-08-12watch: respect altid for incremental watch changes
We need to pass the Inbox object to SearchIdx to get altid mappings properly for incremental imports. TODO: use the Inbox object in more places where it makes sense to do so.
2016-08-12www: allow including links to NNTP sites in HTML footer
Improve the discoverability of NNTP endpoints for users who still know what NNTP is. ==> ~/.public-inbox/config <== ; aliases for the locally-run nntpd can be specified in ; the "publicinbox" section: [publicinbox] nntpserver = nntp://ou63pmih66umazou.onion/ nntpserver = news.public-inbox.org ; NNTPS is not supported natively, yet, ; but one can use haproxy or similar ; nntpserver = nntps://news.public-inbox.invalid/ ; mirrors for specific inboxes may be specified either as full ; NNTP (or NNTPS) URLs, or with the server name only if the ; newsgroup name is specfied for a local NNTP server [publicinbox "git"] ... newsgroup = inbox.a.b.c nntpmirror = nntp://czquwvybam4bgbro.onion/ nntpmirror = hjrcffqmbrq6wope.onion ; there may be a mirror on a different server with a ; different name: nntpmirror = nntp://news.example.com/differently.named.group ; (And I really need to write manpages for all this...)
2016-08-12config: do not nest multi-value altid arrays
Oops. We will inevitably need to support multiple altids for a public-inbox one day.
2016-08-11search: support alt-ID for mapping legacy serial numbers
For some existing mailing list archives, messages are identified by serial number (such as NNTP article numbers in gmane). Those links may become inaccessible (as is the current case for gmane), so ensure users can still search based on old serial numbers. Now, I run the following periodically to get article numbers from gmane (while news.gmane.org remains): NNTPSERVER=news.gmane.org export NNTPSERVER GROUP=gmane.comp.version-control.git perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3 (I might integrate this further with public-inbox-* scripts one day). My ~/.public-inbox/config as an added "altid" snippet which now looks like this: [publicinbox "git"] address = git@vger.kernel.org mainrepo = /path/to/git.vger.git newsgroup = inbox.comp.version-control.git ; relative pathnames expand to $mainrepo/public-inbox/$file altid = serial:gmane:file=gmane.sqlite3 And run "public-inbox-index --reindex /path/to/git.vger.git" periodically. This ought to allow searching for "gmane:12345" to work for Xapian-enabled instances. Disclaimer: while public-inbox supports NNTP and stable article serial numbers, use of those for public links is discouraged since it encourages centralization.
2016-08-10searchidx: allow searching Message-IDs in free-form text
It is not unheard of for users to attempt finding messages by entering Message-IDs into the "Search" box instead of using the existing URL structure. So make it possible for them. Fwiw, I've definitely encountered users who enter entire URLs into generic search engines.
2016-08-09www: avoid misinterpreting '&' and ';' in query parameters
Oops, we must unescape each key=value pair in a QUERY_STRING individually; otherwise we cannot interpret '&' or ';' in query parameter values.
2016-08-09searchidx: avoid holding Xapian lock in cat-file
We must ensure cat-file process is launched before Xapian grabs lock, too. Our use of "git cat-file --batch" has the same problem as "git log" did, (which was fixed in commit 3713c727cda431a0dc2865a7878c13ecf9f21851) "searchidx: release Xapian FDs before spawning git log"
2016-08-09searchidx: release Xapian FDs before spawning git log
This will allow us to release and re-acquire Xapian locks due to the lack of FD_CLOEXEC on some FDs.
2016-08-09searchidx: persist the PublicInbox::Git object
We can cheaply keep the object around nowadays since it spawns expensive processes only on an as-needed basis.