public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2018-03-22	v2writable: add NNTP article number regeneration support
	Allow best-effort regeneration of NNTP article numbers from cloned git repositories in addition to indexing Xapian Article numbers will not remain consistent when we add purge support, though.
2018-03-22	t/altid.t: extra tests for mid_set
	I'll be relying on some of this behavior for regenerating NNTP article numbers off fresh clones.
2018-03-22	msgmap: add tmp_clone to create an anonymous copy
	This will be used to keep track of Message-ID <-> NNTP Article numbers to prevent article number reuse when reindexing.
2018-03-20	content_id: do not take Message-Id into account
	If we need to use content_id, we've already lost hope in relying on Message-Id as a differentiator. This prevents duplicates from showing up repeatedly with -watch when Message-Ids are reused and we generate new Message-Ids to disambiguate.
2018-03-19	v2writable: remove "resent" message for duplicate Message-IDs
	public-inbox-watch gets restarted on reboots and whatnot, so it could get pointlessly noisy. This message was only useful during initial development and imports.
2018-03-19	v2writable: allow disabling parallelization
	While parallel processes improves import speed for initial imports; they are probably not necessary for daily mail imports via WatchMaildir and certainly not for public-inbox-init. Save some memory for daily use and even helps improve readability of some subroutines by showing which methods they call remotely.
2018-03-19	watchmaildir: support v2 repositories
	Unfortunately this gives up some minor performance tweaks we made to avoid reforking import processes.
2018-03-19	v2writable: ensure ->done is idempotent
	This matches Import::done behavior
2018-03-19	t/watch_maildir: note the reason for FIFO creation
	I had to dig through commit history for this and we should better document our tests (along with everything else).
2018-03-19	v2writable: test for idempotent removals
	This will make reindexing easier.
2018-03-19	import: switch to URL-safe Base64 for Message-IDs
	Hexdigests are too long and shorter Message-IDs are easier to deal with.
2018-03-19	import: (v2): write deletes to a separate '_' subdirectory
	In the future, we may store "purged" content IDs or other uncommon stuff under "_/" of the git tree. This keeps the top-level tree small and more amenable to deltafication. This helps the the common case where "m" is most commonly changed file at the top level. Also, use 'D' instead of 'd' since it matches git's '--raw' output format.
2018-03-19	import: (v2) delete writes the blob into history in subdir
	This makes it easier to audit deletes with "git log -p" and prevents an unstable specification of "content_id" from being stored in history. This should be cost-free if done in the same partition (and even cheaper than before as it introduces no new blobs). It does have a higher cost across partitions, but is probably irrelevant given the typical ham:spam ratio.
2018-03-19	v2writable: implement remove correctly
	We need to hide removals from anybody hitting the search engine.
2018-03-19	v2writable: support "barrier" operation to avoid reforking
	Stopping and starting a bunch of processes to look up duplicates or removals is inefficient. Take advantage of checkpointing in "git fast-import" and transactions in Xapian and SQLite.
2018-03-06	v2writable: detect and use previous partition count
	We need to detect the number of partitions the repository was created with to ensure Xapian DBs can work across different machines (or even CPU affinity changes) without leaving messages unaffected by search.
2018-03-03	v2: avoid redundant/repeated configs for git partition repos
	We'll let the config of all.git dictate every other subrepo to ease maintenance and configuration. The "include" directive has been supported since git 1.7.10, so it's safe to depend on as v2 requires git 2.6.0+ anyways for "get-mark" in fast-import.
2018-03-03	import: consolidate object info for v2 imports
	It's easier to store everything in one array ref similar to what our Git->check routine returns
2018-03-03	searchidx: store the primary MID in doc data for NNTP
	We can't rely on header order for Message-ID after all since we fall back to existing MIDs if they exist and are unseen. This lets us use SearchMsg->mid to get the MID we associated with the NNTP article number to ensure all NNTP article lookups roundtrip correctly.
2018-03-03	nntp: fix NEWNEWS command
	I guess nobody uses this command (slrnpull does not), and the breakage was not noticed until I started writing new tests for multi-MID handling. Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
2018-03-03	v2writable: generated Message-ID goes first
	This is to make SearchMsg behave more sanely under NNTP.
2018-03-03	searchidx: support indexing multiple MIDs
	It's possible to have a message handle multiple terms; so use this feature to ensure messages with multiple MIDs can be found by either one.
2018-03-02	v2writable: inject new Message-IDs on true duplicates
	Since we'll need to support multiple Message-IDs anyways, inject a new one if we hit a duplicate (or don't get one at all). Try to use a deterministic Message-Id for consistency, but give up determinism and use a random Message-Id if an "attacker" wants to prevent their message from being archived.
2018-03-02	content_id: no need to be human-friendly
	We merely use this for internal comparisons and do not store this in Xapian. So using a shorter, non-human readable digest is enough. Furthermore, introduce "content_digest" which returns the Digest::SHA object for extra changes.
2018-03-02	mid: add `mids' and `references' methods for extraction
	We'll be using a more consistent API for extracting Message-IDs from various headers.
2018-02-28	v2/ui: get nntpd and init tests running on v2
	A work-in-progress, but it appears the v2 UI pieces do will not require a lot of work to do.
2018-02-19	git: reload alternates file on missing blob
	Since we'll be adding new repositories to the `alternates' file in git, we must restart the `git cat-file --batch' process as git currently does not detect changes to the alternates file in long-running cat-file processes. Don't bother with the `--batch-check' process since we won't be using it with v2.
2018-02-19	v2writable: initial cut for repo-rotation
	Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing.
2018-02-15	address: extract more characters from email addresses
	There's a lot of weird characters which show up in LKML archives which we did not support before. Furthermore, allow spaces before the '>' in the From: line as at least some non-spam poster used it.
2018-02-14	import: APIs to support v2 use
	Wrap "get-mark" and "checkpoint" commands for git-fast-import while documenting/cementing parts of the API.
2018-02-12	content_id: add test case

2018-02-12	t/import: test for last_object_id insertion
	Check for this before doing the Xapian-based v2 importer.
2018-02-07	update copyrights for 2018
	Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2018-01-16	hval: only allow domain obfuscation in address
	Obfuscating username portions of the email address leads to having subsequent parts of the address not being obfuscated; which could mean we show someone else's email entirely. In other words, obfuscating "john.doe@example.com" becomes might mean "doe@example.com" is picked up by scanners. In other news, email address obfuscation is still a horrible usability issue and only exists to appease misguided people.
2017-10-04	mbox: support inline filename via Content-Disposition header
	This is hopefully more sensical than "raw" files from resulting downloads.
2017-06-29	hval: only perform one substitution when obfuscating
	Only one substitution character is necessary when obfuscating email addresses.
2017-06-26	tests: deal with the removal of '.' from @INC in newer Perl
	Oops, this is needed for Perl 5.22 (tested 5.24.1) since '.' was removed due to security problems. Fwiw, I consider this change to Perl an overreaction and do not agree with it.
2017-06-26	watch: improve fairness during full rescans
	We need to ensure new messages are being processed fairly during full rescans, so have the ->scan subroutine yield and reschedule itself. Additionally, having a long-running task inside the signal handler is dangerous and subject to reentrancy bugs. Due to the limitations of the Filesys::Notify::Simple interface, we cannot rely on multiplexing I/O interfaces (select, IO::Poll, Danga::Socket, etc...) for this. Forking a separate process was considered, but it is more expensive for a mostly-idle process. So, we use a variant of the "self-pipe trick" via inotify (or whatever Filesys::Notify::Simple gives us). Instead of writing to our own pipe, we write to a file in our own temporary directory watched by Filesys::Notify::Simple to trigger events in signal handlers.
2017-06-26	mda: set List-ID correctly according to RFC2919
	Oops, due to an old mistake , List-ID was set incorrectly in the MDA. This could cause some breakage w.r.t. mail filters.
2017-06-23	linkify: handle URLs in parenthesized statements
	Sometimes, URLs exist at the end of parethesized statements, and we shouldn't unnecessarily capture that. (example: https://public-inbox.org/ruby-core/20170623032722.GA8124@dcvr/)
2017-06-23	allow admins to configure non-obfuscated addresses/domains
	We will also treat all known list addresses as non-obfuscated. By setting publicinbox.noObfuscate in ~/.public-inbox/config, this will allow users to disable address obfuscation on a per-domain or per-address basis.
2017-06-23	config: assume lists have multiple addresses
	This should simplify the rest of our code for handling the do-not-obfuscate list.
2017-06-23	reply: handle address obfuscation :<
	We can show users a lightly-obfuscated Bourne shell command for invoking "git send-email" for address obfuscation. However, I'm not sure if the mailto: arg will work effectively since URL encoding is probably too well-known to be effective.
2017-06-22	test for PublicInbox::Filter::RubyLang
	This will make it easier to prevent breakage in the future.
2017-06-22	msgmap: mid_insert ignores duplicates instead of die-ing
	This will allow smoother imports as occasional Message-ID duplicates happen and the best we can do is ignore the second one.
2017-06-15	replyto parameter support
	This allows us to support centralized mailing lists (which suck, but better than no mailing list at all).
2017-06-15	view: split out reply logic into its own module
	We'll be adding more reply options for centralized mailing lists. So split out the logic so it's easy-to-find. Organizing code is hard :<
2017-06-14	search: remove unnecessary abstractions and functionality
	This simplifies the code a bit and reduces the translation overhead for looking directly at data from tools shipped with Xapian. While we're at it, fix thread-all.t :)
2017-05-23	www: do not mangle characters from search queries
	Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> https://public-inbox.org/meta/CACBZZX5Gnow08r=0A1J_kt3a=zpGyMfvsqu8nAN7kacNnDm+dg@mail.gmail.com/
2017-05-07	searchidx: fix ghost root vivification
	Due to the asynchronous nature of SMTP, it is possible for the root message of a thread (with no References/In-Reply-To) to arrive last in a series. We must preserve the thread_id of the ghost message in this case, as we do when vivifiying non-root ghosts. Otherwise, this causes threads to be broken when the root arrives last.