public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2018-03-19	v2writable: support "barrier" operation to avoid reforking
	Stopping and starting a bunch of processes to look up duplicates or removals is inefficient. Take advantage of checkpointing in "git fast-import" and transactions in Xapian and SQLite.
2018-03-06	import: fall back to Sender for extracting name and email
	This seems like a reasonable course of action for old messages. Cc: Nicolás Ojeda Bär <n.oje.bar@gmail.com>
2018-03-06	favor Received: date over Date: header globally
	The first Received: header is believable since it typically hits the user's mail server and can be treated as relatively trustworthy. We still show the Date: in per-message (permalink) views, which may expose users for having incorrect Date: headers, but all the ISO YYYY-MM-DD dates we display will match what we see.
2018-03-03	import: consolidate object info for v2 imports
	It's easier to store everything in one array ref similar to what our Git->check routine returns
2018-02-28	v2writable: cleanup unused pipes in partitions
	Leaking these pipes to child processes wasn't harmful, but made determining relationships and dataflow between processes more confusing.
2018-02-22	v2: parallelize Xapian indexing
	The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [ ] each subprocess writes to threader
2018-02-20	v2: support Xapian + SQLite indexing
	This is too slow, currently. Working with only 2017 LKML archives: git-only: ~1 minute git + SQLite: ~12 minutes git+Xapian+SQlite: ~45 minutes So yes, it looks like we'll need to parallelize Xapian indexing, at least.
2018-02-19	v2writable: initial cut for repo-rotation
	Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing.
2018-02-15	import: allow the epoch (0s) as a valid time
	Despite email not existing until 1971; "Jan 1, 1970 00:00:00" seems like a common default timestamp for some test emails to use as a Date: header.
2018-02-15	import: quiet down warnings from bogus From: lines
	There's a lot of crap in archives and git-fast-import accepts empty names and email addresses for authors just fine.
2018-02-15	import: pass "raw" dates to git-fast-import(1)
	For LKML, it appears we need an even more liberal parser than RFC2822 date parser in git. I have not validated Date::Parse parses dates correctly, but this at least prevents git-fast-import(1) from choking.
2018-02-14	import: APIs to support v2 use
	Wrap "get-mark" and "checkpoint" commands for git-fast-import while documenting/cementing parts of the API.
2018-02-12	import: initial handling for v2
	Call order will need to change a bit since this is going to be tied to Xapian
2018-02-09	import: begin supporting this without ssoma.lock
	We'll reuse this class in v2, but won't be utilizing per-git-repository ssoma.lock files. Meanwhile, stop treating ::Inbox objects as an afterthought and allow importing name and email into them.
2018-02-08	import: stop writing legacy ssoma.index by default
	For machines which have never seen ssoma, they don't need the index so stop creating it.
2018-02-07	update copyrights for 2018
	Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-11-16	learn: use "spam" as subject for removal commits
	Sometimes an email is an innocent removal "rm" for a misdirected, off-topic post, while most removed messages are "spam". Allow anybody to look at history and easily distinguish the reason for removing the message.
2017-06-20	import: fix encoding issues from weird "raw" emails
	This seems to allow weirdly-encoded "raw" emails in blade.nagaokaut.ac.jp/ruby/ruby-core/* to be handled without difficulties.
2017-05-25	import: reset :raw mode for commit title (subject)
	This was necessary for the presence of the 0xa0 byte() in the Subject: of the message at: http://blade.nagaokaut.ac.jp/ruby/ruby-core/3220 () That is 0xa0, not 0x0a ("\n"), so I wonder if the nibbles got swapped somehow.
2017-01-10	introduce PublicInbox::MIME wrapper class
	This should fix problems with multipart messages where text/plain parts lack a header. cf. git clone --mirror https://github.com/rjbs/Email-MIME.git refs/pull/28/head In the future, we may still introduce as streaming interface to reduce memory usage on large emails.
2016-10-16	import: failed GC runs are non-fatal
	We should not completely kill a process if "git gc --auto" errors out due to a warning or whatnot.
2016-09-08	import: run "git gc --auto" when done
	We need to prevent excessive repository growth for public-inbox-watch and public-inbox-mda users.
2016-09-08	import: hoist out common run_die subroutine
	We will be reusing this in the next commit, too.
2016-09-08	import: hoist out _check_path function
	This reduces duplication, slightly. We may be using it yet again in a to-be-introduced function (or we may not introduce it).
2016-08-15	import: use common address parsing to drop unnecessary quotes
	Not sure why or how I missed this before; but the common address parsing routine we have should be more correct. Add a test to ensure excessively quoted names don't make it through, either.
2016-08-12	watch: respect altid for incremental watch changes
	We need to pass the Inbox object to SearchIdx to get altid mappings properly for incremental imports. TODO: use the Inbox object in more places where it makes sense to do so.
2016-08-09	searchidx: release Xapian FDs before spawning git log
	This will allow us to release and re-acquire Xapian locks due to the lack of FD_CLOEXEC on some FDs.
2016-08-02	search: improve reindexing behavior
	For reindexing, fresh Xapian DBs do not count as a reindex, allowing users to blindly use --reindex on the first run on a clean repo. While we're at it, allow indexing to override HEAD ref for multi-head git repos.
2016-07-27	localize $/ when using chomp
	Callers may have localized $/ to something else, so make sure we chomp the expected character(s) when calling chomp.
2016-06-24	watch_maildir: implement optional spam checking
	Mailing lists I watch and mirror may not have the best spam filtering, and an extra layer should not hurt.
2016-06-19	import: allow messages without subject
	Because our WatchMaildir module is liberal about what it accepts, we can potentially have messages without a subject.
2016-06-17	import: auto-update index when done
	This prevents multiple update processes from stepping over each other while called under the lock, and also allows the new -watch process to update the index iff indexing was desired.
2016-05-25	remove Email::Address dependency
	git has stricter requirements for ident names (no '<>') which Email::Address allows. Even in 1.908, Email::Address also has an incomplete fix for CVE-2015-7686 with a DoS-able regexp for comments. Since we don't care for or need all the RFC compliance of Email::Address, avoiding it entirely may be preferable. Email::Address will still be installed as a requirement for Email::MIME, but it is only used by the Email::MIME::header_str_set which we do not use
2016-05-21	import: avoid needless git update-server-info
	We don't need to update-server-info (or read-tree) if fast import was spawned for removals and no changes were made.
2016-05-12	import: fallback to email if '<>' exists in author name
	git doesn't handle '<' and '>' characters in the author name at all regardless of quoting, not just matched pairs. So fall back to using the email as the author name since the commit info isn't critical, anyways (shallow clones are fine).
2016-05-12	import: normalize body by stripping trailing newlines
	Mbox formatters may add extra newlines at the end of the message, and that's not relevant for comparing messages for deletion.
2016-04-28	import: run git-update-server-info when done
	We should update $GIT_DIR/info/refs for dumb HTTP clients whenever we make changes to the repository. The best place to update is immediately after making commits. This fixes a bug where public-inbox-learn did not properly update $GIT_DIR/info/refs after inserting or removing messages.
2016-04-27	import: document API for public consumption
	This is probably trivial enough to be final?
2016-04-25	remove ssoma dependency
	By converting to using ourt git-fast-import-based Import module. This should allow us to be more easily installed.
2016-04-25	import: extra check for final byte read
	The read could fail entirely and leave $lf undefined.
2016-04-12	import: filter out [<>] from user names
	It confuses the git ident parser and may not be a great idea to fix in git since it could break interopability with older versions.
2016-04-11	import: use bytes::length for true data length in bytes
	git is byte-oriented and fast-import will not tolerate miscalculations. This is necessary for wide characters in commit messages (email Subjects).
2016-04-11	import: set binmode before printing author names
	Author names may have wide characters in them, so avoid warnings as git favors UTF-8 for names and fast-import even requires them for commit messages
2016-04-11	import: initial module + test case
	This will allow us to write fast importers for existing archives as well as eventually removing the ssoma dependency for performance and ease-of-installation.