public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2018-03-20	introduce InboxWritable class
	This code will be shared with future mass-import tools.
2018-03-19	watchmaildir: support v2 repositories
	Unfortunately this gives up some minor performance tweaks we made to avoid reforking import processes.
2018-03-19	Lock: new base class for writable lockers
	This reduces code duplication needed for locking and and hopefully makes things easier to understand.
2018-03-06	favor Received: date over Date: header globally
	The first Received: header is believable since it typically hits the user's mail server and can be treated as relatively trustworthy. We still show the Date: in per-message (permalink) views, which may expose users for having incorrect Date: headers, but all the ISO YYYY-MM-DD dates we display will match what we see.
2018-03-02	v2writable: inject new Message-IDs on true duplicates
	Since we'll need to support multiple Message-IDs anyways, inject a new one if we hit a duplicate (or don't get one at all). Try to use a deterministic Message-Id for consistency, but give up determinism and use a random Message-Id if an "attacker" wants to prevent their message from being archived.
2018-02-28	rename SearchIdxThread to SearchIdxSkeleton
	Interchangably using "all", "skel", "threader", etc. were confusing. Standardize on the "skeleton" term to describe this class since it's also used for retrieval of basic headers.
2018-02-22	v2: parallelize Xapian indexing
	The parallelization requires splitting Msgmap, text+term indexing, and thread-linking out into separate processes. git-fast-import is fast, so we don't bother parallelizing it. Msgmap (SQLite) and thread-linking (Xapian) must be serialized because they rely on monotonically increasing numbers (NNTP article number and internal thread_id, respectively). We handle msgmap in the main process which drives fast-import. When the article number is retrieved/generated, we write the entire message to per-partition subprocesses via pipes for expensive text+term indexing. When these per-partition subprocesses are done with the expensive text+term indexing, they write SearchMsg (small data) to a shared pipe (inherited from the main V2Writable process) back to the threader, which runs its own subprocess. The number of text+term Xapian partitions is chosen at import and can be made equal to the number of cores in a machine. V2Writable --> Import -> git-fast-import \-> SearchIdxThread -> Msgmap (synchronous) \-> SearchIdxPart[n] -> SearchIdx[] \-> SearchIdxThread -> SearchIdx ("threader", a subprocess) [ ] each subprocess writes to threader
2018-02-19	v2writable: initial cut for repo-rotation
	Wrap the old Import package to enable creating new repos based on size thresholds. This is better than relying on time-based rotation as LKML traffic seems to be increasing.
2018-02-12	content_id: add test case

2018-02-12	import: initial handling for v2
	Call order will need to change a bit since this is going to be tied to Xapian
2018-02-08	MANIFEST: add AUTHORS file

2017-07-13	MANIFEST: add hosted list

2017-06-23	allow admins to configure non-obfuscated addresses/domains
	We will also treat all known list addresses as non-obfuscated. By setting publicinbox.noObfuscate in ~/.public-inbox/config, this will allow users to disable address obfuscation on a per-domain or per-address basis.
2017-06-22	test for PublicInbox::Filter::RubyLang
	This will make it easier to prevent breakage in the future.
2017-06-22	add filter for RubyLang lists
	Unfortunately, it appears we have to reject this and instead add support filtering at View time(), due to DKIM signatures in messages from ruby-lang.org. () which may not be worth it
2017-06-15	replyto parameter support
	This allows us to support centralized mailing lists (which suck, but better than no mailing list at all).
2017-06-15	view: split out reply logic into its own module
	We'll be adding more reply options for centralized mailing lists. So split out the logic so it's easy-to-find. Organizing code is hard :<
2017-05-23	www: do not mangle characters from search queries
	Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> https://public-inbox.org/meta/CACBZZX5Gnow08r=0A1J_kt3a=zpGyMfvsqu8nAN7kacNnDm+dg@mail.gmail.com/
2017-05-07	searchidx: fix ghost root vivification
	Due to the asynchronous nature of SMTP, it is possible for the root message of a thread (with no References/In-Reply-To) to arrive last in a series. We must preserve the thread_id of the ghost message in this case, as we do when vivifiying non-root ghosts. Otherwise, this causes threads to be broken when the root arrives last.
2017-01-26	add filter for Subject: tags
	Some mailing lists add annoying tags into the Subject line which discourages readers from doing proper mail organization on the client side. They also waste precious screen space and attention span. Remove them from our archives to reduce clutter.
2017-01-10	introduce PublicInbox::MIME wrapper class
	This should fix problems with multipart messages where text/plain parts lack a header. cf. git clone --mirror https://github.com/rjbs/Email-MIME.git refs/pull/28/head In the future, we may still introduce as streaming interface to reduce memory usage on large emails.
2016-12-20	tests: add thread-all testing for benchmarking
	I'll be using this to improve message threading performance.
2016-12-03	atom: switch to getline/close for response bodies
	This will let us stream larger Atom documents bodies without wasting too much memory and reduce the amount of round-trip requests needed to get necessary information. Hopefully clients are using streaming (SAX) parsers, too. This is the final transition in the core public-inbox code to allow migrating to a "pull"-based body streaming scheme which allows a HTTP server to respond appropriately to backpressure from slow clients.
2016-10-05	thread: use hash + array instead of hand-rolled linked list
	This starts to show noticeable performance improvements when attempting to thread over 400 messages; but the improvement may not be measurable with less. However, the resulting code is much shorter and (IMHO) much easier to understand.
2016-10-05	thread: remove Mail::Thread dependency
	Introduce our own SearchThread class for threading messages. This should allow us to specialize and optimize away objects in future commits.
2016-09-07	doc: new docs for user-level commands
	Hopefully more folks can download and run public-inbox, nowadays.
2016-08-18	www: implement generic help text
	Begin documenting some basic help functionality. I may tweak the anchor names of the various HTML endpoints to be more consistent with each other (old ones will be supported for a short while), so I'm not documenting those, for now. This may become part of a builtin key-value store for basic texts, but this probably shouldn't become a wiki engine, either.
2016-08-14	www: do not unecessarily escape some chars in paths
	Based on reading RFC 3986, it seems '@', ':', '!', '$', '&', "'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed in path-absolute where we have the Message-ID. In any case, it seems '@' is fairly common in path components nowadays and too common in Message-IDs.
2016-08-11	search: support alt-ID for mapping legacy serial numbers
	For some existing mailing list archives, messages are identified by serial number (such as NNTP article numbers in gmane). Those links may become inaccessible (as is the current case for gmane), so ensure users can still search based on old serial numbers. Now, I run the following periodically to get article numbers from gmane (while news.gmane.org remains): NNTPSERVER=news.gmane.org export NNTPSERVER GROUP=gmane.comp.version-control.git perl -I lib scripts/xhdr-num2mid $GROUP --msgmap=/path/to/gmane.sqlite3 (I might integrate this further with public-inbox-* scripts one day). My ~/.public-inbox/config as an added "altid" snippet which now looks like this: [publicinbox "git"] address = git@vger.kernel.org mainrepo = /path/to/git.vger.git newsgroup = inbox.comp.version-control.git ; relative pathnames expand to $mainrepo/public-inbox/$file altid = serial:gmane:file=gmane.sqlite3 And run "public-inbox-index --reindex /path/to/git.vger.git" periodically. This ought to allow searching for "gmane:12345" to work for Xapian-enabled instances. Disclaimer: while public-inbox supports NNTP and stable article serial numbers, use of those for public links is discouraged since it encourages centralization.
2016-07-28	add scripts/xhdr-num2mid example
	This is used to quickly generate an article number to Message-ID mapping. Usage: NNTPSERVER=news.example.org ./scripts/xhdr-num2mid GROUP >file
2016-07-28	fix manifest

2016-07-09	www: add configurable limiters
	Currently only for git-http-backend use, this allows limiting the number of spawned processes per-inbox or by group, if there are multiple large inboxes amidst a sea of small ones. For example, a "big" repo limiter could be used for big inboxes: which would be shared between multiple repos: [limiter "big"] max = 4 [publicinbox "git"] address = git@vger.kernel.org mainrepo = /path/to/git.git ; shared limiter with giant: httpbackendmax = big [publicinbox "giant"] address = giant@project.org mainrepo = /path/to/giant.git ; shared limiter with git: httpbackendmax = big ; This is a tiny inbox, use the default limiter with 32 slots: [publicinbox "meta"] address = meta@public-inbox.org mainrepo = /path/to/meta.git
2016-07-08	examples: add logrotate sample to show USR1 reopening
	Same as nginx :>
2016-07-01	MANIFEST: update with new varnish-4 vcl example

2016-06-25	address: beef up the module with name list extaction
	We may remove from_name in the future. ...And disallow quotes in email addresses. Technically I believe they're allowed, but they're definitely uncommon and unlikely to show up in legitimate mail.
2016-06-24	split out spamcheck/spamc to its own module.
	This should hopefully make it easier to try other anti-spam systems (or none at all) in the future.
2016-06-24	implement ListMirror SpamAssassin plugin
	When running mailing list mirrors, one needs to be careful spammers do not try to sidestep the list server we want to mirror from and inject email into our mail directly by setting the appropriate list headers (e.g. "X-Mailing-List" or "List-Id"). We trust the top-most Received: header is the one our own mail server got the mail from. Bcc:-ing a public mailing list is a very likely indicator of spam in my experience, so throw in an extra rule mark it. While public-inbox-mda rejects Bcc: entirely, public-inbox-watch needs to mirror lists which allow Bcc. ==> list_mirror.cf <== loadplugin PublicInbox::SaPlugin::ListMirror ifplugin PublicInbox::SaPlugin::ListMirror header LIST_MIRROR_RECEIVED eval:check_list_mirror_received() describe LIST_MIRROR_RECEIVED Received does not match trusted list server score LIST_MIRROR_RECEIVED 10 header LIST_MIRROR_BCC eval:check_list_mirror_bcc() describe LIST_MIRROR_BCC Mailing list was Bcc-ed score LIST_MIRROR_BCC 1 endif ==> ~/.spamassassin/user_prefs <== ifplugin PublicInbox::SaPlugin::ListMirror list_mirror X-Mailing-List git@vger.kernel.org *.kernel.org git@vger.kernel.org endif
2016-06-20	MANIFEST: update with recent changes
	And add a check-manifest target to the Makefile to ensure we're up-to-date with git (but do not depend on git).
2016-06-17	watch: introduce watch directive
	This will allow users to run importers off existing mail accounts where they may not have access to run -mda. Currently, we only support Maildirs, but IMAP ought to be doable.
2016-06-15	MANIFEST: update
	Oops, maybe this could be auto-maintained somehow...
2016-05-28	remove redundant NewsGroup class
	Most of its functionality is in the PublicInbox::Inbox class. While we're at it, we no longer auto-create newsgroup names based on the inbox name, since newsgroup names probably deserve some thought when it comes to hierarchy.
2016-03-01	linkify: do not capture trailing '.' or ';' in URLs
	It seems common for users to end statements with URLs, while it is rare for a URL itself to end with a '.' or ';'. So make a guess and assume the URL was intended to not include the trailing '.' or ';'
2016-03-01	extract linkification code to a separate package
	This will allow us to more easily reuse it elsewhere.
2016-03-01	MANIFEST: add examples/apache2_perl_old.conf
	Ugh, I wonder if we can/should generate this automatically...
2016-02-28	MANIFEST: update (generate via "git ls-files")
	It's been a while...
2016-01-04	use Perl POD instead of pandoc-flavored Markdown
	This project is currently implemented in Perl, and pod2man is probably more common among potential users and developers of this project.
2015-12-22	rename 'GitCatFile' package to 'Git'
	We'll be using it for more than just cat-file. Adding a `popen' API for internal use allows us to save a bunch of code in other places.
2014-05-01	workaround Mail::Thread memory leak
	Thanks to Ask for the patch in https://rt.cpan.org/Public/Bug/Display.html?id=22817
2014-05-01	split out WWW package and CGI/PSGI-specific parts
	This should allow us to more-easily test with Plack.
2014-04-29	implement our own cat-file --batch wrapper
	We use --git-dir=... instead of $ENV{GIT_DIR} because ENV changes do not propagate easily with mod_perl.