public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2020-08-23	searchidx: index THREADID in Xapian
	This is the `tid' column from over.sqlite3; and will be used for IMAP and JMAP search (among other things).
2020-08-23	searchidx: put all shard-related stuff in SearchIdxShard.pm
	We'll also rename the /^remote_/ prefix to "shard_", since remote implies the process is on a different host. These methods only pass messages to a child process on the same host OR perform operations within the same process.
2020-08-23	searchidxshard: clear $msgref buffer properly
	Merely assigning `undef' to a scalar does not free the underlying buffer memory of a scalar.
2020-08-20	init+index: support --skip-docdata for Xapian
	Since we no longer read document data from Xapian, allow users to opt-out of storing it. This breaks compatibility with previous releases of public-inbox, but gives us a ~1.5% space savings on Xapian storage (and associated I/O and page cache pressure reduction).
2020-08-07	index: support --xapian-only switch
	This is useful for speeding up indexing runs when only Xapian rules change but SQLite indexing doesn't change. This mostly implies `--reindex', but does NOT pick up new messages (because SQLite indexing needs to occur for that). I'm leaving this undocumented in the manpage for now since it's mainly to speed up development and testing. Users upgrading to 1.6.0 will be advised to `--reindex --rethread', anyways, due to the threading improvements since 1.1.0-pre1. It may make sense to document for 1.7+ when there's Xapian-only indexing changes, though.
2020-07-25	searchidx: rename _xdb_{acquire,release} => idx_
	The "xdb" prefix was inaccurate since it's used by indexlevel=basic, which is Xapian-free. The '_' (underscore) prefix was also wrong for a method which is called across package boundaries.
2020-07-25	use consistent {ibx} field for writable code paths
	This is a step which makes our use of abbreviations more consistent when referring to PublicInbox::Inbox objects. We'll also be reducing the number of redundant fields in SearchIdx and V2Writable code paths to make the object graph easier-to-follow.
2020-07-17	search: simplify unindexing
	Since over.sqlite3 seems here to stay, we no longer need to do Message-ID lookups against Xapian and can simply rely on the docid <=> NNTP article number equivalancy SCHEMA_VERSION=15 gave us. This rids us of the closure-using batch_do sub in the v1 code path and vastly simplifies both v1 and v2 unindexing.
2020-07-17	drop binmode usage
	We only support Unix-like platforms where binmode (":raw") is the default anyways, and v5.10 semantics means it won't do unicode_strings (unlike v5.12). So save some lines of code.
2020-07-17	v2: use v5.10.1, parent.pm, drop warnings
	The "5.010_001" form was for Perl 5.6, which I doubt anybody would attempt; so favor "v5.10.1" as it is more readable to humans. Prefer "parent" to "base" since the former is lighter. We'll also rely on warnings from "-w" globally (or not) instead of via "use". We'll also update "use" statements to reflect what's actually used by V2Writable.
2020-06-13	index: account for CRLF conversion when storing bytes
	NNTP and IMAP both require CRLF conversions on the wire. They're also the only components which care about $smsg->{bytes}, so store the CRLF-adjusted value in over.sqlite3 and Xapian DBs.. This will allow us to optimize RFC822.SIZE fetch item in IMAP without triggering size mismatch errors in some clients' default configurations (e.g. Mail::IMAPClient), but not most others. It could also fix hypothetical problems with NNTP clients that report discrepancies between overview and article data.
2020-05-19	favor readline() and print() as functions
	In our inbox-writing code paths, ->getline as an OO method may be confused with the various definitions of `getline' used by the PSGI interface. It's also easier to do: "perldoc -f readline" than to figure out which class "->getline" belongs to (IO::Handle) and lookup documentation for that. ->print is less confusing than the "readline" vs "getline" mismatch, but we can still make it clear we're using a real file handle and not a mock interface. Finally, functions are a bit faster than their OO counterparts.
2020-05-09	replace most uses of PublicInbox::MIME with Eml
	PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-03-31	v2writable: index Message-IDs w/ spaces properly
	Message-IDs can apparently contain spaces and other weird characters. Ensure we pass those properly to shard subprocesses when importing messages in parallel mode. Our NNTP request parser does not deal with spaces in the Message-ID, yet, and I don't expect most NNTP clients to, either. Nor does the Net::NNTP client handle them in responses.
2020-03-29	searchidxshard: ensure we set indexlevel on shard[0]
	For sharded v2 repositories with few-enough messages, it is possible for shard[0] to go unused and never trigger the ->commit_txn_lazy to set the indexlevel field in Xapian metadata. So set it immediately at initialization and avoid this case. While we're at it, avoid triggering needless pwrite syscalls from ->set_metadata by checking with ->get_metadata, first.
2020-03-22	*idx: pass smsg in even more places
	We can finally get rid of the awkward, ad-hoc use of V2Writable, SearchIdx, and OverIdx args for passing {cotime} and {autime} between classes. We'll still use those git time fields internally within V2Writable and SearchIdx for (re)indexing, but that's not worth avoiding as a fallback.
2020-03-22	v2: pass smsg in more places
	We can pass fewer order-dependent args to V2Writable::do_idx and SearchIdxShard::index_raw by passing the smsg object, instead.
2020-03-22	*idx: pass $smsg in more places instead of many args
	We can pass blessed PublicInbox::Smsg objects to internal indexing APIs instead of having long parameter lists in some places. The end goal is to avoid parsing redundant information each step of the way and hopefully make things more understandable.
2020-03-22	index: use git commit times on missing Date/Received
	When indexing messages without Date: and/or Received: headers, fall back to using timestamps originally recorded by git in the commit object. This allows git mirrors to preserve the import datestamp and timestamp of a message according to what was fed into git, instead of blindly falling back to the current time.
2020-02-06	treewide: run update-copyrights from gnulib for 2019
	I didn't wait until September to do it, this year!
2020-02-02	searchidxshard: rely on autoflush instead of ->flush
	It reduces the number of ops and simplifies the code, slightly. Add a missing IO::Handle import while we're at it, to be explicit about which methods we use.
2019-11-03	searchidxshard: reuse $SIG{__WARN__} callback from Admin
	We don't want to define $SIG{__WARN__} in the worker to call an existing non-default callback. Instead update ->{current_info} the same way the V2Writable master process does. I noticed this while reindexing with a large XAPIAN_FLUSH_THRESHOLD and seeing the wrong epoch on my terminal from a shard because the shard worker was spawned while reindexing a higher-numbered epoch.
2019-09-09	run update-copyrights from gnulib for 2019

2019-06-14	v2: rename SearchIdxPart => SearchIdxShard
	Another step towards keeping our file and package names consistent with Xapian terminology.