public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2018-03-29	mda: support v2 inboxes
	I mainly focus on -watch for mirroring busy mailing lists, but using -mda should remain an option.
2018-03-29	public-inbox-compact: new tool for driving xapian-compact
	Having multiple Xapian partitions is mostly pointless after the initial import. We can compact all the partitions into one while keeping the skeleton separate.
2018-03-29	v2writable: initializing an existing inbox is idempotent
	And we do not want to start making confused repos if somebody leaves out "-V2" the second time around.
2018-03-29	www: cleanup expensive fallback for legacy URLs
	Back in the day, we compressed long Message-IDs to SHA-1 hexdigests for the URL. This now redirects to a 301 in the hopes we can remove these checks some day to reduce overhead.
2018-03-29	search: get rid of most lookup_* subroutines
	Too many similar functions doing the same basic thing was redundant and misleading, especially since Message-ID is no longer treated as a truly unique identifier. For displaying threads in the HTML, this makes it clear that we favor the primary Message-ID mapped to an NNTP article number if a message cannot be found.
2018-03-29	v2writable: support purging messages from git entirely
	Purging existing messages is fairly straightforward since we can take advantage of Xapian and lookup the git object_id with it. Unfortunately, purging an already "removed" message (which is no longer in Xapian) is not as easy and we'll need to expose ->purge_oids to purge by the git object_id (currently SHA-1). Furthermore, we expire reflogs and prune in hopes a dumb HTTP client won't get the object.
2018-03-29	www: fix attachment downloads for conflicted Message-IDs
	By using the "primary" Message-ID in WwwAttach, we can avoid conflicts in the links we use for downloading attachments.
2018-03-29	v2writable: append, instead of prepending generated Message-ID
	The original Message-ID is still the most important when discussing with other recipients who do not rely on a message flowing through public-inbox. So whatever Message-ID we use to deduplicate internally will be secondary and less important. All of our front-end v2 code is order-independent, so we won't let the message count against us, that way.
2018-03-27	www: support cloning individual v2 git partitions
	This will require multiple client invocations, but should reduce load on the server and make it easier for readers to only clone the latest data. Unfortunately, supporting a cloneurl file for externally-hosted repos will be more difficult as we cannot easily know if the clones use v1 or v2 repositories, or how many git partitions they have.
2018-03-27	view: depend on SearchMsg for Message-ID
	Since we need to handle messages with multiple and duplicate Message-ID headers, our thread skeleton display must account for that. Since we have a "preferred" Message-ID in case of conflicts, use it as the UUID in an Atom feed so readers do not get confused by conflicts.
2018-03-27	view: permalink (per-message) view shows multiple messages
	This needs tests and further refinement, but current tests pass.
2018-03-23	feed: fix new.html for v2
	I forget this endpoint is still accessible (even if not linked). This also simplifies new.html all around and removes some unused clutter from the old days while we're at it.
2018-03-23	t/psgi_v2: minimal test for Atom feed and t.mbox.gz
	Some test coverage is better than none, here.
2018-03-23	search: reopen DB if each_smsg_by_mid fails
	This gives more-up-to-date data in case and allows us to avoid reopening in more places ourselves.
2018-03-23	www: $MESSAGE_ID/raw endpoint supports "duplicates"
	Since v2 supports duplicate messages, we need to support looking up different messages with the same Message-Id. Fortunately, our "raw" endpoint has always been mboxrd, so users won't need to change their parsing tools.
2018-03-22	v2writable: add NNTP article number regeneration support
	Allow best-effort regeneration of NNTP article numbers from cloned git repositories in addition to indexing Xapian Article numbers will not remain consistent when we add purge support, though.
2018-03-22	t/altid.t: extra tests for mid_set
	I'll be relying on some of this behavior for regenerating NNTP article numbers off fresh clones.
2018-03-22	msgmap: add tmp_clone to create an anonymous copy
	This will be used to keep track of Message-ID <-> NNTP Article numbers to prevent article number reuse when reindexing.
2018-03-20	content_id: do not take Message-Id into account
	If we need to use content_id, we've already lost hope in relying on Message-Id as a differentiator. This prevents duplicates from showing up repeatedly with -watch when Message-Ids are reused and we generate new Message-Ids to disambiguate.
2018-03-19	v2writable: remove "resent" message for duplicate Message-IDs
	public-inbox-watch gets restarted on reboots and whatnot, so it could get pointlessly noisy. This message was only useful during initial development and imports.
2018-03-19	v2writable: allow disabling parallelization
	While parallel processes improves import speed for initial imports; they are probably not necessary for daily mail imports via WatchMaildir and certainly not for public-inbox-init. Save some memory for daily use and even helps improve readability of some subroutines by showing which methods they call remotely.
2018-03-19	watchmaildir: support v2 repositories
	Unfortunately this gives up some minor performance tweaks we made to avoid reforking import processes.
2018-03-19	v2writable: ensure ->done is idempotent
	This matches Import::done behavior
2018-03-19	t/watch_maildir: note the reason for FIFO creation
	I had to dig through commit history for this and we should better document our tests (along with everything else).
2018-03-19	v2writable: test for idempotent removals
	This will make reindexing easier.
2018-03-19	import: switch to URL-safe Base64 for Message-IDs
	Hexdigests are too long and shorter Message-IDs are easier to deal with.
2018-03-19	import: (v2): write deletes to a separate '_' subdirectory
	In the future, we may store "purged" content IDs or other uncommon stuff under "_/" of the git tree. This keeps the top-level tree small and more amenable to deltafication. This helps the the common case where "m" is most commonly changed file at the top level. Also, use 'D' instead of 'd' since it matches git's '--raw' output format.
2018-03-19	import: (v2) delete writes the blob into history in subdir
	This makes it easier to audit deletes with "git log -p" and prevents an unstable specification of "content_id" from being stored in history. This should be cost-free if done in the same partition (and even cheaper than before as it introduces no new blobs). It does have a higher cost across partitions, but is probably irrelevant given the typical ham:spam ratio.
2018-03-19	v2writable: implement remove correctly
	We need to hide removals from anybody hitting the search engine.
2018-03-19	v2writable: support "barrier" operation to avoid reforking
	Stopping and starting a bunch of processes to look up duplicates or removals is inefficient. Take advantage of checkpointing in "git fast-import" and transactions in Xapian and SQLite.
2018-03-06	v2writable: detect and use previous partition count
	We need to detect the number of partitions the repository was created with to ensure Xapian DBs can work across different machines (or even CPU affinity changes) without leaving messages unaffected by search.
2018-03-03	v2: avoid redundant/repeated configs for git partition repos
	We'll let the config of all.git dictate every other subrepo to ease maintenance and configuration. The "include" directive has been supported since git 1.7.10, so it's safe to depend on as v2 requires git 2.6.0+ anyways for "get-mark" in fast-import.
2018-03-03	import: consolidate object info for v2 imports
	It's easier to store everything in one array ref similar to what our Git->check routine returns
2018-03-03	searchidx: store the primary MID in doc data for NNTP
	We can't rely on header order for Message-ID after all since we fall back to existing MIDs if they exist and are unseen. This lets us use SearchMsg->mid to get the MID we associated with the NNTP article number to ensure all NNTP article lookups roundtrip correctly.
2018-03-03	nntp: fix NEWNEWS command
	I guess nobody uses this command (slrnpull does not), and the breakage was not noticed until I started writing new tests for multi-MID handling. Fixes: 3fc411c772a21d8f ("search: drop pointless range processors for Unix timestamp")
2018-03-03	v2writable: generated Message-ID goes first
	This is to make SearchMsg behave more sanely under NNTP.
2018-03-03	searchidx: support indexing multiple MIDs
	It's possible to have a message handle multiple terms; so use this feature to ensure messages with multiple MIDs can be found by either one.
2018-03-02	v2writable: inject new Message-IDs on true duplicates
	Since we'll need to support multiple Message-IDs anyways, inject a new one if we hit a duplicate (or don't get one at all). Try to use a deterministic Message-Id for consistency, but give up determinism and use a random Message-Id if an "attacker" wants to prevent their message from being archived.
2018-03-02	content_id: no need to be human-friendly
	We merely use this for internal comparisons and do not store this in Xapian. So using a shorter, non-human readable digest is enough. Furthermore, introduce "content_digest" which returns the Digest::SHA object for extra changes.
2018-03-02	mid: add `mids' and `references' methods for extraction
	We'll be using a more consistent API for extracting Message-IDs from various headers.
2018-02-28	v2/ui: get nntpd and init tests running on v2
	A work-in-progress, but it appears the v2 UI pieces do will not require a lot of work to do.
2018-02-19	git: reload alternates file on missing blob
	Since we'll be adding new repositories to the `alternates' file in git, we must restart the `git cat-file --batch' process as git currently does not detect changes to the alternates file in long-running cat-file processes. Don't bother with the `--batch-check' process since we won't be using it with v2.
2018-02-19	v2writable: initial cut for repo-rotation
	Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing.
2018-02-15	address: extract more characters from email addresses
	There's a lot of weird characters which show up in LKML archives which we did not support before. Furthermore, allow spaces before the '>' in the From: line as at least some non-spam poster used it.
2018-02-14	import: APIs to support v2 use
	Wrap "get-mark" and "checkpoint" commands for git-fast-import while documenting/cementing parts of the API.
2018-02-12	content_id: add test case

2018-02-12	t/import: test for last_object_id insertion
	Check for this before doing the Xapian-based v2 importer.
2018-02-07	update copyrights for 2018
	Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2018-01-16	hval: only allow domain obfuscation in address
	Obfuscating username portions of the email address leads to having subsequent parts of the address not being obfuscated; which could mean we show someone else's email entirely. In other words, obfuscating "john.doe@example.com" becomes might mean "doe@example.com" is picked up by scanners. In other news, email address obfuscation is still a horrible usability issue and only exists to appease misguided people.
2017-10-04	mbox: support inline filename via Content-Disposition header
	This is hopefully more sensical than "raw" files from resulting downloads.