public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2019-01-05	filter/rubylang: fix SQLite DB lifetime problems
	Clearly the AltId stuff was never tested for v2. Ensure this tricky filter (which reuses Msgmap to avoid introducing new serial numbers) doesn't trigger deadlocks SQLite due to opening a DB for writing multiple times. I went through several iterations of this change before going with this one, which is the least intrusive I could fine.
2019-01-02	update and add documentation for repository formats
	Remove confusing documentation around ssoma now that we have NNTP and downloadable mbox support. Only lightly-checked for grammar and speling, and not yet formatting. Edits, corrections and addendums expected :>
2018-12-30	handle "multipart/mixed" messages which are not multipart
	I've found two examples on https://lore.kernel.org/lkml/ where the messages declared themselves to be "multipart/mixed" but were actually plain text: <87llgalspt.fsf@free.fr> <200308111450.h7BEoOu20077@mail.osdl.org> With the mboxrd downloaded, mutt is able to view them without difficulty. Note: this change would require reindexing of Xapian to pick up the changes. But it's only two ancient messages, the first was resent by the original sender and the second is too old to be relevant.
2018-12-28	add filter for gmane archives
	Extracted from import_slrnspool, since some spools get converted to mbox or what not.
2018-07-29	mda: allow configuring globally without spamc support
	This reuses some of the configuration from -watch, but remains independent since some configurations will use -watch for some inboxes and -mda for others. The default remains "spamc" for -mda users so nothing changes without explicit configuration. Per-inbox configurations may also be supported in the future.
2018-07-29	mda: v2: ensure message bodies are indexed
	We must not clobber the original message string, as Email::MIME() still needs it for iterating through parts in SearchIdx (but not when handing it as a raw string to git-fast-import). I've noticed message bodies (especially dfpre/dpost) were not getting indexed when going through -mda (no problems with -watch). This also did not affect v1 repos, since indexing is a separate process for v1 and requires re-reading the data from git. () tested Email::MIME 1.937 on Debian stretch
2018-07-18	SearchIdx: Decrement regen_down even for added messages that are later deleted.
	Decrement regen_down when visiting messages that appear in %D that we know will later be deleted. This ensures consistent message numbers are generated no matter which commit number is on top. Allowing deletes to propagage separately from the messages they delete without causing problems. The v2 trees already do this and when the indexes are deleted and rebuilt they maintain they commit numbers. Add a v1 version of the v2reindex test to verify that reindexing is working properly on v1 as well as v2. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-07-07	Import: Don't copy nulls from emails into git
	Recently I ran git --git-dir=lkml/git/1.git fsck and it reported: > warning in commit 299dbd50b6995c6debe2275f0df984ce697fb4cc: nulInCommit: NULL byte inthe commit object body Which I found quite scary. Nulls in the wrong place have a bad tendency to make programs misbehave. It turns out someone had placed "=?iso-8859-1?q?=00?=" at the end of their subject line. Which is the mime encoding for NULL. Email::Mime had correctly decoded the header, and then public-inbox had simply copied the contents of the header into the subject line of the git commit. To prevent that from causing problems replace nulls in such subject lines with spaces. Signed-off-by: Eric Biederman <ebiederm@xmission.com>
2018-07-06	MsgTime.pm: Use strptime to compute the time zone
	Recently I had trouble cloning lkml/git/0.git because git fsck on receive was failing. The output of git fsck was: > Checking object directories: 100% (256/256), done. > warning in commit 59173dc1fe67b113ace4ce83e7f522414b3e0404: badTimezone: invalid author/committer line - bad time zone > warning in commit ff22aaff22eb4479e49e93f697e385f76db51c55: badTimezone: invalid author/committer line - bad time zone > warning in commit 609b744909693f5f00aff5ed9928beeeee9ded2e: badTimezone: invalid author/committer line - bad time zone > warning in commit 084572141db8e0d879428afb278bd338f2dbb053: badTimezone: invalid author/committer line - bad time zone > warning in commit 789d204de27cd12c6da693d903390a241a1a4bca: badTimezone: invalid author/committer line - bad time zone > warning in commit 0d9a65948b0c957007ca387cd56b690f9bab9c08: badTimezone: invalid author/committer line - bad time zone > warning in commit f7468c42b4196ee6323afb373ab9323971c38d69: badTimezone: invalid author/committer line - bad time zone > warning in commit 85e0cd6dd527cd55ad0440f14384529b83818228: badTimezone: invalid author/committer line - bad time zone > warning in commit f31e19a2e772c9ed00728ef142af9c550ea5de6a: badTimezone: invalid author/committer line - bad time zone > warning in commit 56eb7384443ef84e17e29504a304a071b189ae67: badTimezone: invalid author/committer line - bad time zone > warning in commit e4470030471e6810414b9de5e3b52e16f2245d12: badTimezone: invalid author/committer line - bad time zone > warning in commit f913b48caa097c3b2cb3f491707944f88d52d89f: badTimezone: invalid author/committer line - bad time zone > warning in commit 4390f26923d572c6dab6cce8282c7cad5520d785: badTimezone: invalid author/committer line - bad time zone > warning in commit 0f66db71a06bd7d651a0cd80877d8043b70fda20: badTimezone: invalid author/committer line - bad time zone > warning in commit d71472c40b36dcdf0396afc9778f6137eea45887: badTimezone: invalid author/committer line - bad time zone > warning in commit e8d3b19a91a2d86b6a91bd19dc811e851398b519: badTimezone: invalid author/committer line - bad time zone > warning in commit afd9fc0cc87e56ed7736d633e17d0ef77817b3cc: badTimezone: invalid author/committer line - bad time zone > warning in commit 811b3217708358cf1b75fba4602a64a426fce0f5: badTimezone: invalid author/committer line - bad time zone > warning in commit e7a751a597c6f5e4770c61bdee6220d55a37cba9: badTimezone: invalid author/committer line - bad time zone > warning in commit 3e32ad6192fe093e03e6b9346c3a90b16d9905c0: badTimezone: invalid author/committer line - bad time zone > warning in commit 5e66b47528e79d3bbb769e137f036a1fa99cccf9: badTimezone: invalid author/committer line - bad time zone > warning in commit d90d67d94ca47142670dff13fcb81ab7afab07bb: badTimezone: invalid author/committer line - bad time zone > Checking objects: 100% (1711464/1711464), done. > Checking connectivity: 1711464, done. Upon examination with git show --pretty=raw all of the problem commits had a time zone that was not 4 digits long. This time zone had been passed straight from the Date line in the email into the author line of the commit. Looking into that I discovered that str2time takes into account the time zone, and was actually able to process these weird time zones. So get the normalized time zone with strptime and convert it from seconds from gmt to hours and minutes from gmt. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-06-26	additional tests for bad Message-IDs in URLs
	Followup-to: 73cfed86d8a8287a ("www: use undecoded paths for Message-ID extraction") Reported-by: Leah Neukirchen <leah@vuxu.org> https://public-inbox.org/meta/8736xsb5s5.fsf@vuxu.org/
2018-06-16	Contribute SELinux policy for EL7
	This adds a SELinux policy suitable for RHEL/CentOS 7. It assumes the following: - public-inbox-httpd and public-inbox-nntpd are running via systemd on sane ports (119 and 80/8080) - /var/lib/public-inbox is the location for mainrepos - /var/run/public-inbox is the location for PERL_INLINE_DIRECTORY - /var/log/public-inbox is the location for logs - mail delivery is done via postfix-pipe or public-inbox-watch via the provided example systemd service Signed-off-by: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
2018-05-30	examples: add systemd example for public-inbox-watch
	I guess I forgot to include this, but I've been running public-inbox-watch as a systemd service for nearly two years, now.
2018-04-18	searchidx: regenerate and avoid article number gaps on full index
	Some messages to git@vger went missing from Msgmap from old bugs and became inaccessible via NNTP. Forcing NNTP article numbers when the overview DB came about made the problem more visible when reindexing old (v1) repositories as all removed spam messages took up AUTOINCREMENT numbers again before they were removed. Having large gaps in NNTP article numbers is not good since it throws off NNTP clients. This does NOT prevent NNTP clients from seeing some messages twice, but is better than having them miss several messages entirely. We also avoid depending on --reverse in git-log, as git requires storing an entire commit list in memory for --reverse, so it's cheaper to store only deleted blobs in the %D hash since they do not live long.
2018-04-18	v2: generate better Message-IDs for duplicates
	While hunting duplicates, I noticed a leading '-' in some Message-IDs as a result of RFC4648 encoding. While '-' seems allowed by RFC5322 and URL-friendly (RFC4648), they are uncommon and make using Message-IDs as arguments for command-line tools more difficult. So prefix them with a datestamp to at least give readers some sense of the age. And shorten the "localhost" hostname to "z" to save space.
2018-04-07	over: remove forked subprocess
	Since the overview stuff is a synchronization point anyways, move it into the main V2Writable process and allow us to drop a bunch of code. This is another step towards making Xapian optional for v2. In other words, the fan-out point is moved and the Xapian partitions no longer need to synchronize against each other: Before: /-------->\ /---------->\ v2writable -->+----parts----> over \---------->/ \-------->/ After: /----------> /-----------> v2writable --> over-->+----parts---> \-----------> \----------> Since the overview/threading logic needs to run on the same core that feeds git-fast-import, it's slower for small repos but is not noticeable in large imports where I/O wait in the partitions dominates.
2018-04-05	support altid mechanism for v2
	There's enough gmane links out there in wild that it makes sense to maintain support for these mappings.
2018-04-04	v2: support incremental indexing + purge
	This is important for people running mirrors via "git fetch", as they need to be kept up-to-date. Purging is also now supported in mirrors. The short-lived "--regenerate" option is gone and is now implicitly enabled as a result. It's still cheap when article number regeneration is unnecessary, as we track the range for each git repository.
2018-04-03	nntp: make XOVER, XHDR, OVER, HDR and NEWNEWS faster
	While SQLite is faster than Xapian for some queries we use, it sucks at handling OFFSET. Fortunately, we do not need offsets when retrieving sorted results and can bake it into the query. For inbox.comp.version-control.git (v1 Xapian), XOVER and XHDR are over 20x faster.
2018-04-03	rename+rewrite test using Benchmark module
	There'll be more performance-related tests in the future.
2018-04-02	replace Xapian skeleton with SQLite overview DB
	This ought to provide better performance and scalability which is less dependent on inbox size. Xapian does not seem optimized for some queries used by the WWW homepage, Atom feeds, XOVER and NEWNEWS NNTP commands. This can actually make Xapian optional for NNTP usage, and allow more functionality to work without Xapian installed. Indexing performance was extremely bad at first, but DBI::Profile helped me optimize away problematic queries.
2018-04-01	v2: one file, really
	We need to ensure there is only one file in the top-level tree at any commit so the "add; remove; add;" sequence on the same message is detected properly. Otherwise, git will not detect the second "add" unless a second message is added to history. Deletes are now stored in "d" (and not "D" or "_/D") at the top-level, now. There's no need to have a "_" to reduce churn as "m" and "d" should never co-exist. It's now lowercased to make it easier-to-distinguish from "D" in git-log output.
2018-03-30	msgtime: parse 3-digit years properly
	Some folks had bad mail clients which generated 3-digit years around Y2K...
2018-03-29	mda: support v2 inboxes
	I mainly focus on -watch for mirroring busy mailing lists, but using -mda should remain an option.
2018-03-29	public-inbox-compact: new tool for driving xapian-compact
	Having multiple Xapian partitions is mostly pointless after the initial import. We can compact all the partitions into one while keeping the skeleton separate.
2018-03-29	public-inbox-convert: tool for converting old to new inboxes
	This should make it easier to let users perform comparisons and migrate to v2 if needed.
2018-03-23	www: $MESSAGE_ID/raw endpoint supports "duplicates"
	Since v2 supports duplicate messages, we need to support looking up different messages with the same Message-Id. Fortunately, our "raw" endpoint has always been mboxrd, so users won't need to change their parsing tools.
2018-03-22	v2writable: add NNTP article number regeneration support
	Allow best-effort regeneration of NNTP article numbers from cloned git repositories in addition to indexing Xapian Article numbers will not remain consistent when we add purge support, though.
2018-03-20	introduce InboxWritable class
	This code will be shared with future mass-import tools.
2018-03-19	watchmaildir: support v2 repositories
	Unfortunately this gives up some minor performance tweaks we made to avoid reforking import processes.
2018-03-19	Lock: new base class for writable lockers
	This reduces code duplication needed for locking and and hopefully makes things easier to understand.
2018-03-06	favor Received: date over Date: header globally
	The first Received: header is believable since it typically hits the user's mail server and can be treated as relatively trustworthy. We still show the Date: in per-message (permalink) views, which may expose users for having incorrect Date: headers, but all the ISO YYYY-MM-DD dates we display will match what we see.
2018-03-02	v2writable: inject new Message-IDs on true duplicates
	Since we'll need to support multiple Message-IDs anyways, inject a new one if we hit a duplicate (or don't get one at all). Try to use a deterministic Message-Id for consistency, but give up determinism and use a random Message-Id if an "attacker" wants to prevent their message from being archived.
2018-02-28	rename SearchIdxThread to SearchIdxSkeleton
	Interchangably using "all", "skel", "threader", etc. were confusing. Standardize on the "skeleton" term to describe this class since it's also used for retrieval of basic headers.
2018-02-22	v2: parallelize Xapian indexing
	The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [ ] each subprocess writes to threader
2018-02-19	v2writable: initial cut for repo-rotation
	Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing.
2018-02-12	content_id: add test case

2018-02-12	import: initial handling for v2
	Call order will need to change a bit since this is going to be tied to Xapian
2018-02-08	MANIFEST: add AUTHORS file

2017-07-13	MANIFEST: add hosted list

2017-06-23	allow admins to configure non-obfuscated addresses/domains
	We will also treat all known list addresses as non-obfuscated. By setting publicinbox.noObfuscate in ~/.public-inbox/config, this will allow users to disable address obfuscation on a per-domain or per-address basis.
2017-06-22	test for PublicInbox::Filter::RubyLang
	This will make it easier to prevent breakage in the future.
2017-06-22	add filter for RubyLang lists
	Unfortunately, it appears we have to reject this and instead add support filtering at View time(), due to DKIM signatures in messages from ruby-lang.org. () which may not be worth it
2017-06-15	replyto parameter support
	This allows us to support centralized mailing lists (which suck, but better than no mailing list at all).
2017-06-15	view: split out reply logic into its own module
	We'll be adding more reply options for centralized mailing lists. So split out the logic so it's easy-to-find. Organizing code is hard :<
2017-05-23	www: do not mangle characters from search queries
	Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> https://public-inbox.org/meta/CACBZZX5Gnow08r=0A1J_kt3a=zpGyMfvsqu8nAN7kacNnDm+dg@mail.gmail.com/
2017-05-07	searchidx: fix ghost root vivification
	Due to the asynchronous nature of SMTP, it is possible for the root message of a thread (with no References/In-Reply-To) to arrive last in a series. We must preserve the thread_id of the ghost message in this case, as we do when vivifiying non-root ghosts. Otherwise, this causes threads to be broken when the root arrives last.
2017-01-26	add filter for Subject: tags
	Some mailing lists add annoying tags into the Subject line which discourages readers from doing proper mail organization on the client side. They also waste precious screen space and attention span. Remove them from our archives to reduce clutter.
2017-01-10	introduce PublicInbox::MIME wrapper class
	This should fix problems with multipart messages where text/plain parts lack a header. cf. git clone --mirror https://github.com/rjbs/Email-MIME.git refs/pull/28/head In the future, we may still introduce as streaming interface to reduce memory usage on large emails.
2016-12-20	tests: add thread-all testing for benchmarking
	I'll be using this to improve message threading performance.
2016-12-03	atom: switch to getline/close for response bodies
	This will let us stream larger Atom documents bodies without wasting too much memory and reduce the amount of round-trip requests needed to get necessary information. Hopefully clients are using streaming (SAX) parsers, too. This is the final transition in the core public-inbox code to allow migrating to a "pull"-based body streaming scheme which allows a HTTP server to respond appropriately to backpressure from slow clients.