public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2019-01-08	searchmsg: remove unused fields for PSGI in Xapian results
	These fields are only necessary in NNTP and not even stored in Xapian; so keeping them around for the PSGI web UI search results wastes nearly 80K when loading large result sets.
2018-07-18	t/search.t t/v2writable.t: Teach search tests to fail more cleanly.
	Now that some of the indexes are optionals these tests might fail so teach them to fail more cleanly. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-05-30	respect umask if core.sharedRepository is not set
	This is consistent with git itself and the previous behavior was a result of misunderstanding of how git interprets this. And adjust tests slightly to match the new behavior. Reported-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org> <38873789-ab42-65a1-20c9-12c30b171f4f@linuxfoundation.org>
2018-05-11	t/search: quiet warning from Encode.pm
	This was probably a typo on my part, and quiets a warning: Argument contains empty address at .../Email/MIME/Encode.pm line 70 Tested with Email::MIME 1.946
2018-04-22	extmsg: use Xapian only for partial matches
	"LIKE" in SQLite (and other SQL implementations I've seen) is expensive with nearly 3 million messages in the archives. This caused some partial Message-ID lookups to take over 600ms on my workstation (~300ms on a faster Xeon). Cut that to below under 30ms on average on my workstation by relying exclusively on Xapian for partial Message-ID lookups as we have in the past. Unlike in the past when we tried using Xapian to match partial Message-IDs; we now optimize our indexing of Message-IDs to break apart "words" in Message-IDs for searching, yielding (hopefully) "good enough" accuracy for folks who get long URLs broken across lines when copy+pasting. We'll also drop the (in retrospect) pointless stripping of "/[tTf]" suffixes for the partial match, since anybody who hits that codepath would be hitting an invalid message ID. Finally, limit wildcard expansion to prevent easy DoS vectors on short terms. And blame Pine and alpine for generating Message-IDs with low-entropy prefixes :P
2018-04-18	ensure SQLite and Xapian files respect core.sharedRepository
	We can't have files with permissions inconsistent with what's in git objects.
2018-04-18	v1: remove articles from overview DB
	Otherwise articles show up again...
2018-04-07	store less data in the Xapian document
	Since we only query the SQLite over DB for OVER/XOVER; do not need to waste space storing fields To/Cc/:bytes/:lines or the XNUM term. We only use From/Subject/References/Message-ID/:blob in various places of the PSGI code. For reindexing, we will take advantage of docid stability in "xapian-compact --no-renumber" to ensure duplicates do not show up in search results. Since the PSGI interface is the only consumer of Xapian at the moment, it has no need to search based on NNTP article number.
2018-04-06	www: favor reading more from SQLite, and less from Xapian
	Favor simpler internal APIs this time around, this cuts a fair amount of code out and takes another step towards removing Xapian as a dependency for v2 repos.
2018-04-06	search: index and allow searching by date-time
	Dscho found this useful for finding matching git commits based on AuthorDate in git. Add it to the overview DB format, too; so in the future we can support v2 repos without Xapian. https://public-inbox.org/git/nycvar.QRO.7.76.6.1804041821420.55@ZVAVAG-6OXH6DA.rhebcr.pbec.zvpebfbsg.pbz https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/
2018-04-02	www: rework query responses to avoid COUNT in SQLite
	In many cases, we do not care about the total number of messages. It's a rather expensive operation in SQLite (Xapian only provides an estimate). For LKML, this brings top-level /$INBOX/ loading time from ~375ms to around 60ms on my system. Days ago, this operation was taking 800-900ms(!) for me before introducing the SQLite overview DB.
2018-04-02	replace Xapian skeleton with SQLite overview DB
	This ought to provide better performance and scalability which is less dependent on inbox size. Xapian does not seem optimized for some queries used by the WWW homepage, Atom feeds, XOVER and NEWNEWS NNTP commands. This can actually make Xapian optional for NNTP usage, and allow more functionality to work without Xapian installed. Indexing performance was extremely bad at first, but DBI::Profile helped me optimize away problematic queries.
2018-03-30	search: move permissions handling to InboxWritable
	We'll be making sure V2Writable uses this.
2018-03-29	search: get rid of most lookup_* subroutines
	Too many similar functions doing the same basic thing was redundant and misleading, especially since Message-ID is no longer treated as a truly unique identifier. For displaying threads in the HTML, this makes it clear that we favor the primary Message-ID mapped to an NNTP article number if a message cannot be found.
2018-02-07	update copyrights for 2018
	Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-06-14	search: remove unnecessary abstractions and functionality
	This simplifies the code a bit and reduces the translation overhead for looking directly at data from tools shipped with Xapian. While we're at it, fix thread-all.t :)
2017-01-07	search: remove subject_summary
	Apparently it never actually got used, and the world seems fine without it, so we can drop it. While we're at it, consider removing our subject_path usage from existence, too. We are not using fancy subject-line based URLs, here.
2016-12-20	searchmsg: remove ensure_metadata
	Instead, only preload the ->mid field for threading, as we only need ->thread and ->path once in Search->get_thread (but we will need the ->mid field repeatedly). This more than doubles View->load_results performance on according to thread-all on an inbox with over 300K messages.
2016-10-05	thread: remove Email::Abstract wrapping
	This roughly doubles performance due to the reduction in object creation and abstraction layers.
2016-09-09	search: index attachment filenames
	And while we're at it, ensure searching inside displayable attachment bodies works.
2016-09-09	search: more granular message body searching
	"bs:" and "b:" are adapted from mairix(1) We will also support searching explicitly for quoted vs non-quoted text via "q:" and "nq:" prefixes since sometimes readers will not care for quoted text. In the future, we will support parsing diffs (perhaps when repobrowse integration is complete). Note: this roughly doubles the size of the Xapian database due to the additional information; so this change may not be worth it.
2016-09-09	search: drop longer subject: prefix for search
	We only document the "s:" anyways. While the long name is more descriptive, the ambiguity makes agnostic caching (by Varnish or similar) slightly harder and longer URLs are more likely to be accidentally truncated when shared.
2016-09-09	search: allow searching user fields (To/Cc/From)
	Sometimes it can be useful to search based on who the message was sent to, sent by, or Cc:-ed. Of course, headers can be faked, but they usually are not... Anyways this mostly matches the behavior of mairix(1).
2016-08-16	search: add YYYYMMDD search range via "d:" prefix
	This is similar to mairix in that it uses a "d:" prefix; but only takes YYYYMMDD, for now. Using custom date/time parsers via Perl will be much more work: nntp://news.gmane.org/20151005222157.GE5880@survex.com Anyhow, this ought to be more human-friendly than searching by Unix timestamps, but it requires reindexing to take advantage of.
2016-08-09	searchidx: release Xapian FDs before spawning git log
	This will allow us to release and re-acquire Xapian locks due to the lack of FD_CLOEXEC on some FDs.
2016-04-30	searchmsg: ensure long subject lines are not broken
	Noticed when using a long URL in the subject.
2016-03-03	t/*.t: use identifiable tempdir names
	This should make identifiying leftover directories due to SIGKILL-ed tests easier.
2016-02-29	t/search.t: use transactions to reduce I/O load
	In case folks do not use eatmydata or tmpfs for testing, use transactions to reduce the number of fsync calls made and hopefully prevent drives from wearing out.
2016-02-28	t/: remove unnecessary Dumper use
	No point in loading Data::Dumper if we do not use it in the tests.
2015-09-30	nntp: implement OVER/XOVER summary in search document
	The document data of a search message already contains a good chunk of the information needed to respond to OVER/XOVER commands quickly. Expand on that and use the document data to implement OVER/XOVER quickly. This adds a dependency on Xapian being available for nntpd usage, but is probably alright since nntpd is esoteric enough that anybody willing to run nntpd will also want search functionality offered by Xapian. This also speeds up XHDR/HDR with the To: and Cc: headers and :bytes/:lines article metadata used by some clients for header displays and marking messages as read/unread.
2015-09-06	update copyright headers and email addresses
	In the future, it should be possible to use this: git ls-files \| UPDATE_COPYRIGHT_HOLDER='all contributors' \ UPDATE_COPYRIGHT_USE_INTERVALS=2 \ xargs /path/to/gnulib/build-aux/update-copyright
2015-09-03	search: disable Message-ID compression in Xapian
	We'll continue to compress long Message-IDs in URLs (which we know about), but we will store entire Message-IDs in the Xapian database to facilitate ease-of-lookups in external databases.
2015-08-30	search: do not index references and inreplyto terms
	We no longer need them, as we can rely on index-time thread resolution and thread merging. This allows us to index less data and hopefully increase efficiency.
2015-08-25	search: implement subject summarization
	We ought to summarize subjects to avoid exploding line lengths in the web interface.
2015-08-23	search: respect core.sharedRepository in for Xapian DB
	Extend the purpose of core.sharedRepository to apply to the $GIT_DIR/public-inbox/xapian* directory.
2015-08-22	search: split search indexing to a separate file
	This makes organization easier and reduces the amount of code loaded for a PSGI, mod_perl or CGI instance.
2015-08-21	search: s/count/total/ for results
	This is hopefully less ambiguous, as the word "count" confused me, too.
2015-08-18	search: avoid creating ghosts for circular References
	Some mail software incorrectly creates circular references and causes us to create ghosts before the actual mail doc is created.
2015-08-17	search: simplify indexing operation
	There's no need to make a transaction for each message when doing incremental indexing against a git repository. While we're at it, simplify the interface for callers, too and do not auto-create the Xapian database if it was not explicitly enabled.
2015-08-17	terminology: replies => followups
	Replies are only direct replies, but followups could be any message further down the thread. The latter is more useful.
2015-08-17	skip search test if search support is missing
	We will not require Search::Xapian to be installed.
2015-08-16	implement /s/$SUBJECT_PATH.html lookups
	Quick-and-dirty wiring up of to Subject: paths. This may prove more memorizable and easier-to-share than /t/$MESSAGE_ID.html links, but less strict. This changes our schema version to 1, since we now use lower-case subject paths.
2015-08-15	search: make search results more OO
	This will relieve callers of the need to decode the data we store internally in Xapian
2015-08-13	initial search backend implementation
	This shall allow us to search for replies/threads more easily.