public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-01-27	inbox: add ->version method
	This allows us to simplify version checking by avoiding "//" or "\|\|" operators sprinkled around.
2020-01-06	treewide: "require" + "use" cleanup and docs
	There's a bunch of leftover "require" and "use" statements we no longer need and can get rid of, along with some excessive imports via "use". IO::Handle usage isn't always obvious, so add comments describing why a package loads it. Along the same lines, document the tmpdir support as the reason we depend on File::Temp 0.19, even though every Perl 5.10.1+ user has it. While we're at it, favor "use" over "require", since it it gives us extra compile-time checking.
2020-01-05	search: remove lookup_article
	It was no longer used outside of tests, so don't penalize regular users with the extra function. Just inline it for t/search.t.
2019-12-29	search: load_xapian: return true on success
	This was causing -xcpdb and other admin modules to fail outside of tests (or when testing with the slow TEST_RUN_MODE=0).
2019-12-28	search: retry_reopen passes user arg to callback
	This allows callers to pass named (not anonymous) subs. Update all retry_reopen callers to use this feature, and fix some places where we failed to use retry_reopen :x
2019-12-24	search: support SWIG-generated Xapian.pm
	Xapian upstream is slowly phasing out the XS-based Search::Xapian in favor of the SWIG-generated "Xapian" package. While Debian and both FreeBSD have Search::Xapian, OpenBSD only includes the "Xapian" binding. More information about the status of the "Xapian" Perl module here: https://trac.xapian.org/ticket/523
2019-10-30	search: add note about SCHEMA_VERSION 15
	--reindex has gotten better over the years, and having parallel Xapian DB directories would exceed all available disk space for some users with giant inboxes.
2019-10-16	config: support "inboxdir" in addition to "mainrepo"
	"mainrepo" ws a bad name and artifact from the early days when I intended for there to be a "spamrepo" (now just the ENV{PI_EMERGENCY} Maildir). With v2, "mainrepo" can be especially confusing, since v2 needs at least two git repositories (epoch + all.git) to function and we shouldn't confuse users by having them point to a git repository for v2. Much of our documentation already references "INBOX_DIR" for command-line arguments, so use "inboxdir" as the git-config(1)-friendly variant for that. "mainrepo" remains supported indefinitely for compatibility. Users may need to revert to old versions, or may be referring to old documentation and must not be forced to change config files to account for this change. So if you're using "mainrepo" today, I do NOT recommend changing it right away because other bugs can lurk. Link: https://public-inbox.org/meta/874l0ice8v.fsf@alyssa.is/
2019-09-09	run update-copyrights from gnulib for 2019

2019-06-14	search: use "shard" for local variable
	Another small step towards terminology consistency with Xapian.
2019-06-14	search*: rename {partition} => {shard}
	Another step towards keeping our internal data structures consistent with Xapian naming.
2019-06-14	search: require PublicInbox::Inbox ref here
	No sense in supporting multiple methods of initialization for an internal class.
2019-06-04	require ASCII digits for local FS items
	In case some BOFH decides to randomly create directories using non-ASCII digits all over the place.
2019-05-24	search: don't log all warnings on retry_reopen
	Some users (or bots :P) can trigger horrible queries which the caller can choose to either log or ignore. This prevents horrible queries from ExtMsg from logging confusing "ref: " messages when $@ is not a Perl reference.
2019-05-23	search: reenable phrase search on non-chert Xapian
	This is assuming nobody uses flint or earlier, anymore; as flint predates the existence of this project.
2019-05-21	Merge remote-tracking branch 'origin/xap-optional' into master
	* origin/xap-optional: admin: improve warnings and errors for missing modules searchidx: do not create empty Xapian partitions for basic lazy load Xapian and make it optional for v2 www: use Inbox->over where appropriate nntp: use Inbox->over directly inbox: add ->over method to ease access
2019-05-16	search: disable phrase searching, for now
	There probably needs to be an option to enable this independently of indexlevel; but for now this is the safest option. And, as I discovered during the development of the indexlevel option, Xapian does a pretty good job of finding phrases without position data, anyways.
2019-05-15	lazy load Xapian and make it optional for v2
	More tests work without Search::Xapian, now. Usability issues still need to be fixed
2019-05-15	www: use Inbox->over where appropriate
	We don't need to rely on Xapian search functionality for the majority of the WWW code, even. subject_normalized is moved to SearchMsg, where it (probably) makes more sense, anyways.
2019-05-15	nntp: use Inbox->over directly
	None of the NNTP code actually relies on Xapian, anymore.
2018-07-20	search: use boolean prefixes for git blob queries
	I've hit some case where probabilistic searches don't work when using dfpre:/dfpost:/dfblob: search prefixes because stemming in the query parser interferes. In any case, our indexing code indexes longer/unabbreviated blob names down to its 7 character abbreviation, so there should be no need to do wildcard searches on git blob names.
2018-04-23	search: avoid repeated mbox results from search
	Previous search queries already set sort order on the Enquire object, altering the ordering of results and was causing messages to be redundantly downloaded via POST /$INBOX/?q=$QUERY&x=m So stop caching the Search::Xapian::Enquire object since it wasn't providing any measurable performance improvement.
2018-04-22	extmsg: use Xapian only for partial matches
	"LIKE" in SQLite (and other SQL implementations I've seen) is expensive with nearly 3 million messages in the archives. This caused some partial Message-ID lookups to take over 600ms on my workstation (~300ms on a faster Xeon). Cut that to below under 30ms on average on my workstation by relying exclusively on Xapian for partial Message-ID lookups as we have in the past. Unlike in the past when we tried using Xapian to match partial Message-IDs; we now optimize our indexing of Message-IDs to break apart "words" in Message-IDs for searching, yielding (hopefully) "good enough" accuracy for folks who get long URLs broken across lines when copy+pasting. We'll also drop the (in retrospect) pointless stripping of "/[tTf]" suffixes for the partial match, since anybody who hits that codepath would be hitting an invalid message ID. Finally, limit wildcard expansion to prevent easy DoS vectors on short terms. And blame Pine and alpine for generating Message-IDs with low-entropy prefixes :P
2018-04-06	www: favor reading more from SQLite, and less from Xapian
	Favor simpler internal APIs this time around, this cuts a fair amount of code out and takes another step towards removing Xapian as a dependency for v2 repos.
2018-04-06	search: index and allow searching by date-time
	Dscho found this useful for finding matching git commits based on AuthorDate in git. Add it to the overview DB format, too; so in the future we can support v2 repos without Xapian. https://public-inbox.org/git/nycvar.QRO.7.76.6.1804041821420.55@ZVAVAG-6OXH6DA.rhebcr.pbec.zvpebfbsg.pbz https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/
2018-04-05	mbox: do not sort search results
	Sorting large msets is a waste when it comes to mboxes since MUAs should thread and sort them as the user desires. This forces us to rework each of the mbox download mechanisms to be more independent of each other, but might make things easier to reason about.
2018-04-05	search: remove unnecessary OP_AND of query
	This was vestigial code from the switch to the overview DB
2018-04-03	mbox: remove remaining OFFSET usage in SQLite
	We can use id_batch in the common case to speed up full mbox retrievals. Gigantic msets are still a problem, but will be fixed in future commits.
2018-04-03	nntp: make XOVER, XHDR, OVER, HDR and NEWNEWS faster
	While SQLite is faster than Xapian for some queries we use, it sucks at handling OFFSET. Fortunately, we do not need offsets when retrieving sorted results and can bake it into the query. For inbox.comp.version-control.git (v1 Xapian), XOVER and XHDR are over 20x faster.
2018-04-02	www: rework query responses to avoid COUNT in SQLite
	In many cases, we do not care about the total number of messages. It's a rather expensive operation in SQLite (Xapian only provides an estimate). For LKML, this brings top-level /$INBOX/ loading time from ~375ms to around 60ms on my system. Days ago, this operation was taking 800-900ms(!) for me before introducing the SQLite overview DB.
2018-04-02	replace Xapian skeleton with SQLite overview DB
	This ought to provide better performance and scalability which is less dependent on inbox size. Xapian does not seem optimized for some queries used by the WWW homepage, Atom feeds, XOVER and NEWNEWS NNTP commands. This can actually make Xapian optional for NNTP usage, and allow more functionality to work without Xapian installed. Indexing performance was extremely bad at first, but DBI::Profile helped me optimize away problematic queries.
2018-04-01	search: reduce columns stored in Xapian
	We can store :bytes and :lines in doc_data since we never sort or search by them. We don't have much use for the Date: stamp at the moment, either.
2018-03-30	search: warn on reopens and die on total failure
	-watch on a busy/giant Maildir caused too many Xapian errors while attempting to browse.
2018-03-29	search: retry_reopen on first_smsg_by_mid
	This was causing errors while attempting to load messages via the WWW interface while mass-importing LKML. While we're at it, remove unnecessary eval from lookup_article.
2018-03-29	search: move find_doc_ids to searchidx
	We do not need this subroutine for read-only use in Search.pm
2018-03-29	search: get rid of most lookup_* subroutines
	Too many similar functions doing the same basic thing was redundant and misleading, especially since Message-ID is no longer treated as a truly unique identifier. For displaying threads in the HTML, this makes it clear that we favor the primary Message-ID mapped to an NNTP article number if a message cannot be found.
2018-03-29	search: cleanup uniqueness checking
	The only Xapian term which should be unique is the NNTP article number; so we no longer need find_unique_doc_id.
2018-03-23	search: reopen DB if each_smsg_by_mid fails
	This gives more-up-to-date data in case and allows us to avoid reopening in more places ourselves.
2018-03-23	www: $MESSAGE_ID/raw endpoint supports "duplicates"
	Since v2 supports duplicate messages, we need to support looking up different messages with the same Message-Id. Fortunately, our "raw" endpoint has always been mboxrd, so users won't need to change their parsing tools.
2018-03-22	use both Date: and Received: times
	We want to rely on Date: to sort messages within individual threads since it keeps messages from git-send-email(1) sorted. However, since developers occasionally have the clock set wrong on their machines, sort overall messages by the newest date in a Received: header so the landing page isn't forever polluted by messages from the future. This also gives us determinism for commit times in most cases, as we'll used the Received: timestamp there, as well.
2018-03-19	search: allow ->reopen to be chainable
	Makes life a little easier for V2Writable...
2018-03-06	search: each_smsg_by_mid uses skeleton if available
	We do not need the large DBs for MID scans.
2018-03-05	search: favor skeleton DB for lookup_mail
	The skeleton DB is smaller and hit more frequently given the homepage and per-message/thread views; so it will be hotter in the page cache.
2018-03-03	nntp: fix NEWNEWS command
	I guess nobody uses this command (slrnpull does not), and the breakage was not noticed until I started writing new tests for multi-MID handling. Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
2018-03-03	nntp: use NNTP article numbers for lookups
	Since Message-IDs are no longer unique within Xapian (but are within the SQLite Msgmap); favor NNTP article numbers for internal lookups. This will prevent us from finding the "wrong" internal Message-ID.
2018-03-03	searchidx: avoid excessive XNQ indexing with diffs
	When indexing diffs, we can avoid indexing the diff parts under XNQ and instead combine the parts in the read-only search interface. This results in better indexing performance and 10-15% smaller Xapian indices.
2018-03-03	searchidx: support indexing multiple MIDs
	It's possible to have a message handle multiple terms; so use this feature to ensure messages with multiple MIDs can be found by either one.
2018-03-02	search: revert to using 'Q' as a uniQue id per-Xapian conventions
	'Q' is merely a convention in the Xapian world, and is close enough to unique for practical purposes, so stop using XMID and gain a little more term length as a result.
2018-03-02	v2writable: deduplicate detection on add
	This is a bit expensive in a multi-process situation because we need to make our indices and packs visible to the read-only pieces.
2018-03-02	search: remove informational "warning" message
	It was making imports too noisy.