public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-08-07	index+xcpdb: rename `--no-sync' to `--no-fsync'
	We'll continue supporting `--no-sync' even if its yet-to-make it it into a release, but the term `sync' is overloaded in our codebase which may be confusing to new hackers and users. None of our our code nor dependencies issue the sync(2) syscall, either, only fsync(2) and fdatasync(2).
2020-08-02	remove unnecessary ->header_obj calls
	We used ->header_obj in the past as an optimization with Email::MIME. That optimization is no longer necessary with PublicInbox::Eml. This doesn't make any functional difference even if we were to go back to Email::MIME. However, it reduces the amount of code we have and slightly reduces allocations with PublicInbox::Eml.
2020-07-26	overidx: fix compatibility with current versions
	We still need to use SQL_BLOB to ensure existing versions of public-inbox can read over.sqlite3 because they're still using {sqlite_unicode}. This partially reverts commit e9fc1290ead44e06d20ff58e0a6acb5306d4fbe2. Fixes: e9fc1290ead44e06 ("over: unset sqlite_unicode attribute")
2020-07-25	index+xcpdb: support --no-sync flag
	This allows us to speed up indexing operations to SQLite and Xapian. Unfortunately, it doesn't affect operations using `xapian-compact' and the compactor API, since that doesn't seem to support Xapian::DB_NO_SYNC, yet.
2020-07-25	index: support --rethread switch to fix old indices
	Older versions of public-inbox < 1.3.0 had subtly different semantics around threading in some corner cases. This switch (when combined with --reindex) allows us to fix them by regenerating associations.
2020-07-17	search: simplify unindexing
	Since over.sqlite3 seems here to stay, we no longer need to do Message-ID lookups against Xapian and can simply rely on the docid <=> NNTP article number equivalancy SCHEMA_VERSION=15 gave us. This rids us of the closure-using batch_do sub in the v1 code path and vastly simplifies both v1 and v2 unindexing.
2020-07-17	overidx: favor non-OO sub dispatch for internal subs
	OO method dispatch was 10-15% slower when I was implementing the NNTP server. It also serves as a helpful reminder to the reader at the callsite as to whether a sub is likely in the same package as the caller or not.
2020-07-17	overidx: each_by_mid: pass self and args to callbacks
	This saves runtime allocations and reduces the likelyhood of memory leaks either from cycles or buggy old Perl versions.
2020-07-14	over+msgmap: do not store filename after DBI->connect
	SQLite already knows the filename internally, so avoid having it as a long-lived Perl SV to save some bytes when there's many inboxes and open DBs.
2020-07-14	over: unset sqlite_unicode attribute
	None of the human-readable strings stored in over.sqlite3 require UTF-8. Message-IDs do not, nor do the compressed Subject IDs (sid) we use for Subject-based threading. And the `ddd' (doc-data-deflated) column is of course binary data. This frees us of having to use SQL_BLOB for the `ddd', column, and will open the door for us to use dbh_new for Msgmap, too.
2020-07-02	overidx: document why we don't use SQLite WAL
	I was wondering about this myself the other day and had to read up on it. So make a note of it for future readers.
2020-06-03	smsg: remove remaining accessor methods
	We'll continue to favor simpler data models that can be used directly rather than wasting time and memory with accessor APIs. The ->from, ->to, -cc, ->mid, ->subject, >references methods can all be trivially replaced by hash lookups since all their values are stored in doc_data. Most remaining callers of those methods were test cases, anyways. ->from_name is only used in the PSGI code, so we can just use ->psgi_cull to take care of populating the {from_name} field.
2020-06-03	smsg: get rid of remaining {mime} users
	We'll let $smsg->populate take care of everything all at once without hanging onto the header object for too long.
2020-05-12	overidx: document the SQLite PRAGMA we use
	This ought to prevent cargo-culting the cache_size PRAGMA into smaller SQLite DBs we might use.
2020-03-22	*idx: pass smsg in even more places
	We can finally get rid of the awkward, ad-hoc use of V2Writable, SearchIdx, and OverIdx args for passing {cotime} and {autime} between classes. We'll still use those git time fields internally within V2Writable and SearchIdx for (re)indexing, but that's not worth avoiding as a fallback.
2020-03-22	*idx: pass $smsg in more places instead of many args
	We can pass blessed PublicInbox::Smsg objects to internal indexing APIs instead of having long parameter lists in some places. The end goal is to avoid parsing redundant information each step of the way and hopefully make things more understandable.
2020-03-22	overidx: parse_references: less error-prone args
	Favor `$smsg->{mid}' instead of `$mid0' to reduce parameters down-the-line, but favor passing the Email::MIME::Header object around instead of relying on the bloat-prone `$smsg->{mime}' and calling ->header_obj on it.
2020-03-22	smsg: to_doc_data: use existing fields
	No need to pass extra parameters to this method, since smsg has universal meanings for {blob} and {mid}.
2020-03-22	rename PublicInbox::SearchMsg => PublicInbox::Smsg
	Since the introduction of over.sqlite3, SearchMsg is not tied to our search functionality in any way, so stop confusing ourselves and future hackers by just calling it "PublicInbox::Smsg". Add a missing "use" in ExtMsg while we're at it.
2020-03-22	index: use git commit times on missing Date/Received
	When indexing messages without Date: and/or Received: headers, fall back to using timestamps originally recorded by git in the commit object. This allows git mirrors to preserve the import datestamp and timestamp of a message according to what was fed into git, instead of blindly falling back to the current time.
2020-02-06	treewide: run update-copyrights from gnulib for 2019
	I didn't wait until September to do it, this year!
2020-02-04	over: simplify read-only vs read-write checking
	No need to call ref() and do a string comparison. Add some extra tests using the {ReadOnly} attribute in DBI.pm.
2020-01-24	contentid: ignore duplicate References: headers
	OverIdx::parse_references already skips duplicate References (which we use in SearchThread for rendering). So there's no reason for our content deduplication logic to care if a Message-Id in the Reference header is mentioned twice.
2019-10-28	index: allow search/lookups on X-Alt-Message-ID
	Since we replace extra Message-ID headers with X-Alt-Message-ID to placate NNTP clients, we should allow searching and indexing on X-Alt-Message-ID just like we do with Message-ID.
2019-10-23	Merge branch 'regen'
	* regen: v2writable: use msgmap as multi_mid queue v2writable: move git->cleanup to the correct place v2writable: reindex handles 3-headered monsters v2writable: improve "num_for" API and disambiguate v2writable: set unindexed article number
2019-10-22	overidx: remove unused delete_articles sub
	This hasn't been used since commit 1b7e935ab1690e28 ("searchidx: fix incremental index with indexlevel=basic on v1")
2019-10-21	v2writable: reindex handles 3-headered monsters
	And maybe 8-headered ones, too... I noticed --reindex failing on the linux-renesas-soc mirror due one 3-headed monster of a message having 3 sets of headers; while another normal message had a Message-ID that matched one of the 3 IDs of the 3-headed monster. We still try to do the majority of indexing backwards, but we defer indexing multi-Message-ID'd messages until the end to ensure we get all the "good" messages in before we process the multi-headered ones. Link: https://public-inbox.org/meta/20191016211415.GA6084@dcvr/
2019-09-09	run update-copyrights from gnulib for 2019

2019-05-15	www: use Inbox->over where appropriate
	We don't need to rely on Xapian search functionality for the majority of the WWW code, even. subject_normalized is moved to SearchMsg, where it (probably) makes more sense, anyways.
2019-05-14	searchidx: fix incremental index with indexlevel=basic on v1
	We were reindexing the full history every invocation of -index when Xapian was not used because we were incorrectly relying on 'last_commit' metadata stored in Xapian. Rewrite the indexing logic to be less confusing while we're at it, since we rely on `git merge-base --is-ancestor' nowadays. Furthermore, we need to handle message removals from the overview index correctly when Xapian is not in use. Co-authored-by: Eric W. Biederman <ebiederm@xmission.com>
2018-08-05	overidx: preserve `tid' column on re-indexing
	Otherwise, walking backwards through history could mean the root message in a thread forgets its `tid' and it prevents messages from being looked up by it. This bug was hidden by the fact that `sid' matches were often good enough to link threads together.
2018-04-07	store less data in the Xapian document
	Since we only query the SQLite over DB for OVER/XOVER; do not need to waste space storing fields To/Cc/:bytes/:lines or the XNUM term. We only use From/Subject/References/Message-ID/:blob in various places of the PSGI code. For reindexing, we will take advantage of docid stability in "xapian-compact --no-renumber" to ensure duplicates do not show up in search results. Since the PSGI interface is the only consumer of Xapian at the moment, it has no need to search based on NNTP article number.
2018-04-07	over: remove forked subprocess
	Since the overview stuff is a synchronization point anyways, move it into the main V2Writable process and allow us to drop a bunch of code. This is another step towards making Xapian optional for v2. In other words, the fan-out point is moved and the Xapian partitions no longer need to synchronize against each other: Before: /-------->\ /---------->\ v2writable -->+----parts----> over \---------->/ \-------->/ After: /----------> /-----------> v2writable --> over-->+----parts---> \-----------> \----------> Since the overview/threading logic needs to run on the same core that feeds git-fast-import, it's slower for small repos but is not noticeable in large imports where I/O wait in the partitions dominates.
2018-04-06	search: index and allow searching by date-time
	Dscho found this useful for finding matching git commits based on AuthorDate in git. Add it to the overview DB format, too; so in the future we can support v2 repos without Xapian. https://public-inbox.org/git/nycvar.QRO.7.76.6.1804041821420.55@ZVAVAG-6OXH6DA.rhebcr.pbec.zvpebfbsg.pbz https://public-inbox.org/git/alpine.DEB.2.20.1702041206130.3496@virtualbox/
2018-04-06	over: use only supported and safe SQLite APIs
	Some of this jankiness was from early performance problems and they turned out to be unnecessary measures.
2018-04-02	replace Xapian skeleton with SQLite overview DB
	This ought to provide better performance and scalability which is less dependent on inbox size. Xapian does not seem optimized for some queries used by the WWW homepage, Atom feeds, XOVER and NEWNEWS NNTP commands. This can actually make Xapian optional for NNTP usage, and allow more functionality to work without Xapian installed. Indexing performance was extremely bad at first, but DBI::Profile helped me optimize away problematic queries.