public-inbox.git - an "archives first" approach to mailing lists

Date	Commit message (Collapse)
2023-11-28	disallow NUL characters in Message-ID and List-Id
	While MTAs seem to stop '\0' from appearing in headers, users fetching archives via git remain susceptible to having '\0' land in archives. So we'll filter them out of Xapian and SQLite DBs to avoid interopability problems with CLI tools since there's no known messages in lore or any of my archives which feature them. Avoiding '\0' will ensure all indexed Message-IDs and List-Ids can be specified from the command-line (although some characters will still require $(printf) contortions). As with Message-ID, List-Id fields with /\n\t\r/ characters will also be stripped for indexing. I will assume whatever went wrong with the References: header in <https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw> could also happen to the List-Id header. This is inspired by commit aca47e05a6026c12c768753c87e6ff769ef6bee4 (Import: Don't copy nulls from emails into git, 2018-07-07)
2023-11-27	www: qs_html: fix escaping of `q' param
	Our use of MID_ESC characters was only intended for the pathname component of URIs and not appropriate for the query string component. So use a different $unsafe parameter list for uri_escape to make the result appropriate for query strings by disallowing [\&\'\+=] characters. Most notably, this change also allows us to accept `/' (slash) unescaped to make dfn: queries nicer to look at. Finally, we'll also add a ascii_html call on the URI-escaped result as an extra safety measure even though it's not really needed. As far as I can tell, the code without this fix didn't result in in an HTML injection since all our uses of uri_escape did escape angle brackets. Reported-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com> Link: https://public-inbox.org/meta/87o7ff4nlk.fsf@collabora.com/ Tested-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com>
2023-04-25	mid+contenthash: eliminate needless local variable captures
	It's possible in theory that Perl could be smarter and free memory a tad sooner this way. Regardless, fewer lines of code is easier-to-navigate/read and can save optree size and reduce parsing times.
2023-01-30	use Net::SSLeay (OpenSSL) for SHA-(1\|256) if installed
	On my x86-64 machine, OpenSSL SHA-256 is nearly twice as fast as the Digest::SHA implementation from Perl, most likely due to an optimized assembly implementation. SHA-1 is a few percent faster, too.
2021-01-01	update copyrights for 2021
	Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2021-01-01	mid: hoist out mids_in sub
	We'll be using it for Resent-Message-ID with lei, and possibly other places.
2021-01-01	mid: use defined-or with `push' for uniqueness check
	As shown recently in commit a05445fb400108e60ede7d377cf3b26a0392eb24 ("config: config_fh_parse: micro-optimize"), the relying on the return value of `push' and defined-or operators can avoid modifying a the hash value scalar with an increment.
2020-09-20	mid: drop repeated ';' in mid_escape() regular expression

2020-09-16	mid: rename MID_MAX to ID_MAX
	It's only used for HTML anchors which we will need indefinitely.
2020-08-02	searchidx: remove v1-only msg_mime sub
	We can rely on the newer mids() sub directly and use faster numeric comparisons for Msgmap unindexing in v1.
2020-04-30	mid: capitalize "ID" in "Message-ID"
	Prefer the "ID" capitalization since it seems to to be the preferred capitalization in RFC 5322. In theory, this allows the interpreter to deduplicate the string internally (I haven't checked if it does). Unfortunately, there's too many instances of "Message-Id" in the tests to be worth changing at this point.
2020-04-02	mid: add $MID_EXTRACT regexp for export
	This allows us to consistently enforce the same Message-ID extraction rules everywhere and makes it easier for us to make changes in the future. Update scripts/ssoma-replay, as well, but don't rely on PublicInbox::* modules in that since it's legacy and public-inbox was never a dependency of ssoma.
2020-02-06	treewide: run update-copyrights from gnulib for 2019
	I didn't wait until September to do it, this year!
2020-01-24	mid: shorten uniq_mids logic
	We won't be able to use List::Util::uniq here, but we can still shorten our logic and make it more consistent with the rest of our code which does similar things.
2019-10-28	index: allow search/lookups on X-Alt-Message-ID
	Since we replace extra Message-ID headers with X-Alt-Message-ID to placate NNTP clients, we should allow searching and indexing on X-Alt-Message-ID just like we do with Message-ID.
2019-09-09	run update-copyrights from gnulib for 2019

2019-06-04	mid: id_compress requires ASCII-clean words
	Its result is used for HTML anchors and such.
2019-01-29	mid: filter out 'y', 'n', and email addresses from references()
	Looking at git@vger history, several emails had broken References/In-Reply-To pointing to <y>, <n> and email addresses as Message-IDs in References and In-Reply-To headers. This was causing too many unrelated messages to be linked together in the same thread.
2018-04-20	disallow "\t" and "\n" in OVER headers
	For Subject/To/Cc/From headers, we squeeze them to a space (' '). For Message-IDs (including References/In-Reply-To), '\t', '\n', '\r' are deleted since some MUAs might screw them up: https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw
2018-04-01	truncate Message-IDs and References consistently
	We need to stop ghost messages from generating longer Message-IDs than Xapian can handle with terms.
2018-03-19	mid: mid_mime uses v2-compatible mids function
	This allows us to be more consistent in dealing with completely empty Message-Ids.
2018-03-03	mid: truncate excessively long MIDs early
	Since we support duplicate MIDs in v2, we can safely truncate long MID terms in the database and let other normal duplicate resolution sort it out. It seems only spammers use excessively long MIDs, and there'll always be abuse/misuse vectors for causing mis-threaded messages, so it's not worth worrying about excessively long MIDs.
2018-03-03	mid: be strict with References, but loose on Message-Id
	Traditionally we've been more lax on parsing Message-Id and allow it without the angle brackets. We've always been strict on References and can't have it be pointlessly large when some MUA decides to use HTML-escaped angle brackets ("<", ">").
2018-03-02	searchidx: use new `references' method for parsing References
	It's shorter and more convenient, here.
2018-03-02	mid: add `mids' and `references' methods for extraction
	We'll be using a more consistent API for extracting Message-IDs from various headers.
2018-02-07	update copyrights for 2018
	Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-05-23	www: do not mangle characters from search queries
	Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> https://public-inbox.org/meta/CACBZZX5Gnow08r=0A1J_kt3a=zpGyMfvsqu8nAN7kacNnDm+dg@mail.gmail.com/
2016-08-14	www: do not unecessarily escape some chars in paths
	Based on reading RFC 3986, it seems '@', ':', '!', '$', '&', "'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed in path-absolute where we have the Message-ID. In any case, it seems '@' is fairly common in path components nowadays and too common in Message-IDs.
2016-08-14	mid: no wide characters for sha1_hex
	Apparently there are some really screwed up In-Reply-To fields out there.
2016-03-03	use raw header for Message-ID
	Message-IDs should not be MIME encoded, but in case they are, use the raw form for compatibility with ssoma and possibly other tools. This prevents a potential problem where a malicious client could confuse our storage layer into indexing incorrect contents.
2015-11-20	various internal documentation updates
	Hopefully this gives new hackers a better overview of how the components relate to each other.
2015-10-02	rename mid_compress to id_compress
	We use it as a general compressor for identifiers such as subject paths, so using the "mid_" prefix probably is not appropriate.
2015-09-06	update copyright headers and email addresses
	In the future, it should be possible to use this: git ls-files \| UPDATE_COPYRIGHT_HOLDER='all contributors' \ UPDATE_COPYRIGHT_USE_INTERVALS=2 \ xargs /path/to/gnulib/build-aux/update-copyright
2015-08-30	mid2path: clean MID of angle brackets '<>'
	We screwed up and needed to fix URL generation with '<>' in them. Regardless, users may attempt to copy and paste URLs with '<>' in them, do not punish them for that.
2015-08-27	mid: extract Message-ID from inside '<>'
	This is necessary for some mailers which include comment text in in the In-Reply-To header, merely assuming there is nothing outside of '<>' as we were doing is not enough.
2015-08-25	mid: mid_compressed => mid_compress
	Consistently name mid_* functions as verbs.
2015-08-17	view: always compress Message-IDs for anchors
	Valid URLs do not make valid anchor ids.
2015-08-17	mid: compress Message-IDs with '%' in them
	Some HTTP servers (apache2 2.2.22-13+deb7u5) on my system apparently do not handle "%25" correctly. I'm not yet sure if it's something weird with my rewrite rules or what....
2015-08-16	view: deduplicate common code for loading search results
	More to come later.
2015-08-15	extract redundant Message-ID handling code
	Quit repeating ourselves and use a common MID module instead.