public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2016-10-05	thread: remove Mail::Thread dependency
	Introduce our own SearchThread class for threading messages. This should allow us to specialize and optimize away objects in future commits.
2016-10-05	view: remove "subject dummy" references
	We will not care for inexact threading by subject or pruning.
2016-09-13	help: document new search prefixes
	Support (and document) 'a:' after all, as "mairix -h" uses it, so this should reduce the learning curve for mairix users.
2016-09-09	nntp: cleanup: move use statements out of sub scope
	This clarifies the code somewhat, and we don't care to lazy-load in NNTP.pm anyways since this is only used for a long-lived daemon.
2016-09-09	search: index attachment filenames
	And while we're at it, ensure searching inside displayable attachment bodies works.
2016-09-09	search: match the behavior of WWW for indexing text
	The basic rule is that if it is displayable via our WWW interface, it should be indexable text for Xapian search.
2016-09-09	search: avoid mindlessly calling body_set
	It's not worth entering a complex codepath in Email::MIME to save some (probably immeasurable amount of) memory, here. We've already stopped doing this in our WWW code a while back, too. If we really cared enough about it, we'd prioritize work on a streaming replacement for Email::MIME.
2016-09-09	search: fix compatibility with Debian wheezy
	Specifying the "d:" field only worked for NumberValueRangeProcessor in older versions of Xapian, such as the one in Debian wheezy (libsearch-xapian-perl=1.2.10.0-1) This slipped through since I rarely use wheezy, anymore, and perhaps nobody else does, either. Perhaps wheezy support may be dropped, soon. Unfortunately, this requires a schema version bump.
2016-09-09	search: increase term positions for each quoted hunk
	We pay a storage cost for storing positional information in Xapian, make good use of it by attempting to preserve it for (hopefully) better search results.
2016-09-09	search: match quote detection behavior of view
	This is stricter than the mutt quote_regexp default ("^([ \t]*[\|>:}#])+" on Debian jessie), but matches what we have in View.pm. I prefer the stricter quote detection since it is less ambiguous and less likely to hide/obscure important details.
2016-09-09	search: fix space regressions from recent changes
	As of Xapian 1.0.4 (from 2007) is possible to use Search::Xapian::QueryParser::add_prefix multiple times with the same user field name but different term prefixes. This brings my current git@vger mirror from 6.5GB to 2.1GB (both sizes are after xapian-compact).
2016-09-09	search: more granular message body searching
	"bs:" and "b:" are adapted from mairix(1) We will also support searching explicitly for quoted vs non-quoted text via "q:" and "nq:" prefixes since sometimes readers will not care for quoted text. In the future, we will support parsing diffs (perhaps when repobrowse integration is complete). Note: this roughly doubles the size of the Xapian database due to the additional information; so this change may not be worth it.
2016-09-09	search: drop longer subject: prefix for search
	We only document the "s:" anyways. While the long name is more descriptive, the ambiguity makes agnostic caching (by Varnish or similar) slightly harder and longer URLs are more likely to be accidentally truncated when shared.
2016-09-09	search: allow searching user fields (To/Cc/From)
	Sometimes it can be useful to search based on who the message was sent to, sent by, or Cc:-ed. Of course, headers can be faked, but they usually are not... Anyways this mostly matches the behavior of mairix(1).
2016-09-08	import: run "git gc --auto" when done
	We need to prevent excessive repository growth for public-inbox-watch and public-inbox-mda users.
2016-09-08	import: hoist out common run_die subroutine
	We will be reusing this in the next commit, too.
2016-09-08	import: hoist out _check_path function
	This reduces duplication, slightly. We may be using it yet again in a to-be-introduced function (or we may not introduce it).
2016-09-08	view: handle missing Content-Type in message
	Email::MIME internally assumes "text/plain" for messages missing a Content-Type, but does not expose that in the Email::MIME::content_type API method. We must assume it ourselves to avoid uninitialized value warnings for the rare (nowadays) MUAs which do not set it.
2016-09-07	doc: new docs for user-level commands
	Hopefully more folks can download and run public-inbox, nowadays.
2016-09-02	config: use "publicinboxlimiter" prefix
	Just having "limiter" in the prefix may confuse it with something else. Use the full prefix to avoid this confusion.
2016-09-01	watch: use "publicinboxwatch" namespace
	We'll keep supporting "publicinboxlearn" indefinitely, but "publicinboxwatch" is probably more appropriate at the moment. Noticed while writing documentation.
2016-08-23	www: give tor2web some exposure, too
	Not everybody can run Tor, hopefully more can use Tor2web even if it compromises their privacy. This should help make system more resilient for users unable to use Tor.
2016-08-18	searchview: link to internal help text
	The internal help text links to the Xapian query parser documentation anyways, but also provides information on which prefixes exist.
2016-08-18	www: implement generic help text
	Begin documenting some basic help functionality. I may tweak the anchor names of the various HTML endpoints to be more consistent with each other (old ones will be supported for a short while), so I'm not documenting those, for now. This may become part of a builtin key-value store for basic texts, but this probably shouldn't become a wiki engine, either.
2016-08-18	linkify: be stricter about matching RFC 3986
	We're not to-the-letter about percent-encoding, but we should allow all the characters. This is mainly so we can effectively use the link to some Wikipedia pages with parentheses in them: https://en.wikipedia.org/wiki/Atom_(standard) https://en.wikipedia.org/wiki/Git_(software)
2016-08-18	view: try assuming UTF-8 for bogus charsets
	For some reason, Alpine will set X-UNKNOWN for valid UTF-8. Since we favor UTF-8 HTML anyways, try forcing Email::MIME to handle text/plain as UTF-8 which might show up better. At least this change renders <alpine.DEB.2.20.1608131214070.4924@virtualbox> properly by showing "•" (•) instead of "â ¢" (â¢) Reported-by: Thomas Ferris Nicolaisen <tfnico@gmail.com>
2016-08-18	view: try to display bogus charsets for text/plain
	Alpine seems to set charset=X-UNKNOWN for valid UTF-8 text, which causes Email::MIME::body_str to fail as X-UNKNOWN is not a valid encoding. So, blindly display the body as plain-text but warn users about possibly mangled text. Reported-by: Thomas Ferris Nicolaisen <tfnico@gmail.com>
2016-08-18	view: attach_link uses string concatentation
	There is no point in using an array to join on an empty string (my original intention was probably to join on "\n"). This is only preparation for the next change to show a warning to in the attachment link.
2016-08-16	search: add YYYYMMDD search range via "d:" prefix
	This is similar to mairix in that it uses a "d:" prefix; but only takes YYYYMMDD, for now. Using custom date/time parsers via Perl will be much more work: nntp://news.gmane.org/20151005222157.GE5880@survex.com Anyhow, this ought to be more human-friendly than searching by Unix timestamps, but it requires reindexing to take advantage of.
2016-08-16	search: drop pointless range processors for Unix timestamp
	The Unix timestamp isn't meaningful for users searching, we will start indexing the YYYYMMDD date stamp which may use StringValueRangeProcessor, instead.
2016-08-15	import: use common address parsing to drop unnecessary quotes
	Not sure why or how I missed this before; but the common address parsing routine we have should be more correct. Add a test to ensure excessively quoted names don't make it through, either.
2016-08-14	www: do not double-clean Message-IDs from internal DBs
	Ensure we usually strip one level of '<>' from Message-IDs, since our internal SQLite, Xapian, and SHA-1 storage all assume that. Realistically, we screw up if somebody has '<<' or '>>', but those are screwed up mail clients and we can deal with it another time. Currently, this means some messages with '>>' in References or Message-Id are not handled correctly, yet, but we match the behavior of Mail::Thread in keeping the extra '>'.
2016-08-14	www: do not unecessarily escape some chars in paths
	Based on reading RFC 3986, it seems '@', ':', '!', '$', '&', "'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed in path-absolute where we have the Message-ID. In any case, it seems '@' is fairly common in path components nowadays and too common in Message-IDs.
2016-08-14	www: ensure XML validity for some odd ASCII chars
	I've seen 0x1b (\e) in at least one message and some other possibly non-printable chars. In any case, make sure they're valid XML with us-ascii encoding as far as xmlstarlet(1) thinks so.
2016-08-14	mid: no wide characters for sha1_hex
	Apparently there are some really screwed up In-Reply-To fields out there.
2016-08-14	search: gracefully handle lookup_message failure
	We can't blindly assume a ghost even exists in the DB, as the rules can change internally for some corner-case Message-IDs.
2016-08-14	view: remove redundant pre closing tag

2016-08-14	view: allow for missing In-Reply-To mapping
	Because buggy mail clients exist and generate invalid In-Reply-To headers we cannot handle across the board...
2016-08-14	searchidx: do not release Xapian lock while (only) Msgmap is indexing
	SQLite might index quickly, so we hold the lock used by Xapian for the duration. This probably needs to be reworked entirely, actually.
2016-08-14	newswww: include body text in 404 response
	Some browsers do not give any indication of the HTTP error code on errors, so show the error text to the user like we do in the top-level WWW module.
2016-08-13	extmsg: reorder and add a more Message-ID lookup services
	gmane is down at the moment, so lower that in priority (hopefully it will be brought back up, again). Wikipedia also lists a few more project-specific list providers, so include those as well: https://en.wikipedia.org/wiki/Message-ID
2016-08-12	watch: respect altid for incremental watch changes
	We need to pass the Inbox object to SearchIdx to get altid mappings properly for incremental imports. TODO: use the Inbox object in more places where it makes sense to do so.
2016-08-12	www: allow including links to NNTP sites in HTML footer
	Improve the discoverability of NNTP endpoints for users who still know what NNTP is. ==> ~/.public-inbox/config <== ; aliases for the locally-run nntpd can be specified in ; the "publicinbox" section: [publicinbox] nntpserver = nntp://ou63pmih66umazou.onion/ nntpserver = news.public-inbox.org ; NNTPS is not supported natively, yet, ; but one can use haproxy or similar ; nntpserver = nntps://news.public-inbox.invalid/ ; mirrors for specific inboxes may be specified either as full ; NNTP (or NNTPS) URLs, or with the server name only if the ; newsgroup name is specfied for a local NNTP server [publicinbox "git"] ... newsgroup = inbox.a.b.c nntpmirror = nntp://czquwvybam4bgbro.onion/ nntpmirror = hjrcffqmbrq6wope.onion ; there may be a mirror on a different server with a ; different name: nntpmirror = nntp://news.example.com/differently.named.group ; (And I really need to write manpages for all this...)
2016-08-12	config: do not nest multi-value altid arrays
	Oops. We will inevitably need to support multiple altids for a public-inbox one day.
2016-08-11	search: support alt-ID for mapping legacy serial numbers
	For some existing mailing list archives, messages are identified by serial number (such as NNTP article numbers in gmane). Those links may become inaccessible (as is the current case for gmane), so ensure users can still search based on old serial numbers. Now, I run the following periodically to get article numbers from gmane (while news.gmane.org remains): NNTPSERVER=news.gmane.org export NNTPSERVER GROUP=gmane.comp.version-control.git perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3 (I might integrate this further with public-inbox-* scripts one day). My ~/.public-inbox/config as an added "altid" snippet which now looks like this: [publicinbox "git"] address = git@vger.kernel.org mainrepo = /path/to/git.vger.git newsgroup = inbox.comp.version-control.git ; relative pathnames expand to $mainrepo/public-inbox/$file altid = serial:gmane:file=gmane.sqlite3 And run "public-inbox-index --reindex /path/to/git.vger.git" periodically. This ought to allow searching for "gmane:12345" to work for Xapian-enabled instances. Disclaimer: while public-inbox supports NNTP and stable article serial numbers, use of those for public links is discouraged since it encourages centralization.
2016-08-10	searchidx: allow searching Message-IDs in free-form text
	It is not unheard of for users to attempt finding messages by entering Message-IDs into the "Search" box instead of using the existing URL structure. So make it possible for them. Fwiw, I've definitely encountered users who enter entire URLs into generic search engines.
2016-08-09	www: avoid misinterpreting '&' and ';' in query parameters
	Oops, we must unescape each key=value pair in a QUERY_STRING individually; otherwise we cannot interpret '&' or ';' in query parameter values.
2016-08-09	searchidx: avoid holding Xapian lock in cat-file
	We must ensure cat-file process is launched before Xapian grabs lock, too. Our use of "git cat-file --batch" has the same problem as "git log" did, (which was fixed in commit 3713c727cda431a0dc2865a7878c13ecf9f21851) "searchidx: release Xapian FDs before spawning git log"
2016-08-09	searchidx: release Xapian FDs before spawning git log
	This will allow us to release and re-acquire Xapian locks due to the lack of FD_CLOEXEC on some FDs.
2016-08-09	searchidx: persist the PublicInbox::Git object
	We can cheaply keep the object around nowadays since it spawns expensive processes only on an as-needed basis.