about summary refs log tree commit homepage
path: root/lib/PublicInbox/MID.pm
DateCommit message (Collapse)
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2021-01-01mid: hoist out mids_in sub
We'll be using it for Resent-Message-ID with lei, and possibly other places.
2021-01-01mid: use defined-or with `push' for uniqueness check
As shown recently in commit a05445fb400108e60ede7d377cf3b26a0392eb24 ("config: config_fh_parse: micro-optimize"), the relying on the return value of `push' and defined-or operators can avoid modifying a the hash value scalar with an increment.
2020-09-20mid: drop repeated ';' in mid_escape() regular expression
2020-09-16mid: rename MID_MAX to ID_MAX
It's only used for HTML anchors which we will need indefinitely.
2020-08-02searchidx: remove v1-only msg_mime sub
We can rely on the newer mids() sub directly and use faster numeric comparisons for Msgmap unindexing in v1.
2020-04-30mid: capitalize "ID" in "Message-ID"
Prefer the "ID" capitalization since it seems to to be the preferred capitalization in RFC 5322. In theory, this allows the interpreter to deduplicate the string internally (I haven't checked if it does). Unfortunately, there's too many instances of "Message-Id" in the tests to be worth changing at this point.
2020-04-02mid: add $MID_EXTRACT regexp for export
This allows us to consistently enforce the same Message-ID extraction rules everywhere and makes it easier for us to make changes in the future. Update scripts/ssoma-replay, as well, but don't rely on PublicInbox::* modules in that since it's legacy and public-inbox was never a dependency of ssoma.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2020-01-24mid: shorten uniq_mids logic
We won't be able to use List::Util::uniq here, but we can still shorten our logic and make it more consistent with the rest of our code which does similar things.
2019-10-28index: allow search/lookups on X-Alt-Message-ID
Since we replace extra Message-ID headers with X-Alt-Message-ID to placate NNTP clients, we should allow searching and indexing on X-Alt-Message-ID just like we do with Message-ID.
2019-09-09run update-copyrights from gnulib for 2019
2019-06-04mid: id_compress requires ASCII-clean words
Its result is used for HTML anchors and such.
2019-01-29mid: filter out 'y', 'n', and email addresses from references()
Looking at git@vger history, several emails had broken References/In-Reply-To pointing to <y>, <n> and email addresses as Message-IDs in References and In-Reply-To headers. This was causing too many unrelated messages to be linked together in the same thread.
2018-04-20disallow "\t" and "\n" in OVER headers
For Subject/To/Cc/From headers, we squeeze them to a space (' '). For Message-IDs (including References/In-Reply-To), '\t', '\n', '\r' are deleted since some MUAs might screw them up: https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw
2018-04-01truncate Message-IDs and References consistently
We need to stop ghost messages from generating longer Message-IDs than Xapian can handle with terms.
2018-03-19mid: mid_mime uses v2-compatible mids function
This allows us to be more consistent in dealing with completely empty Message-Ids.
2018-03-03mid: truncate excessively long MIDs early
Since we support duplicate MIDs in v2, we can safely truncate long MID terms in the database and let other normal duplicate resolution sort it out. It seems only spammers use excessively long MIDs, and there'll always be abuse/misuse vectors for causing mis-threaded messages, so it's not worth worrying about excessively long MIDs.
2018-03-03mid: be strict with References, but loose on Message-Id
Traditionally we've been more lax on parsing Message-Id and allow it without the angle brackets. We've always been strict on References and can't have it be pointlessly large when some MUA decides to use HTML-escaped angle brackets ("&lt;", "&gt;").
2018-03-02searchidx: use new `references' method for parsing References
It's shorter and more convenient, here.
2018-03-02mid: add `mids' and `references' methods for extraction
We'll be using a more consistent API for extracting Message-IDs from various headers.
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-05-23www: do not mangle characters from search queries
Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> https://public-inbox.org/meta/CACBZZX5Gnow08r=0A1J_kt3a=zpGyMfvsqu8nAN7kacNnDm+dg@mail.gmail.com/
2016-08-14www: do not unecessarily escape some chars in paths
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&', "'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed in path-absolute where we have the Message-ID. In any case, it seems '@' is fairly common in path components nowadays and too common in Message-IDs.
2016-08-14mid: no wide characters for sha1_hex
Apparently there are some really screwed up In-Reply-To fields out there.
2016-03-03use raw header for Message-ID
Message-IDs should not be MIME encoded, but in case they are, use the raw form for compatibility with ssoma and possibly other tools. This prevents a potential problem where a malicious client could confuse our storage layer into indexing incorrect contents.
2015-11-20various internal documentation updates
Hopefully this gives new hackers a better overview of how the components relate to each other.
2015-10-02rename mid_compress to id_compress
We use it as a general compressor for identifiers such as subject paths, so using the "mid_" prefix probably is not appropriate.
2015-09-06update copyright headers and email addresses
In the future, it should be possible to use this: git ls-files | UPDATE_COPYRIGHT_HOLDER='all contributors' \ UPDATE_COPYRIGHT_USE_INTERVALS=2 \ xargs /path/to/gnulib/build-aux/update-copyright
2015-08-30mid2path: clean MID of angle brackets '<>'
We screwed up and needed to fix URL generation with '<>' in them. Regardless, users may attempt to copy and paste URLs with '<>' in them, do not punish them for that.
2015-08-27mid: extract Message-ID from inside '<>'
This is necessary for some mailers which include comment text in in the In-Reply-To header, merely assuming there is nothing outside of '<>' as we were doing is not enough.
2015-08-25mid: mid_compressed => mid_compress
Consistently name mid_* functions as verbs.
2015-08-17view: always compress Message-IDs for anchors
Valid URLs do not make valid anchor ids.
2015-08-17mid: compress Message-IDs with '%' in them
Some HTTP servers (apache2 2.2.22-13+deb7u5) on my system apparently do not handle "%25" correctly. I'm not yet sure if it's something weird with my rewrite rules or what....
2015-08-16view: deduplicate common code for loading search results
More to come later.
2015-08-15extract redundant Message-ID handling code
Quit repeating ourselves and use a common MID module instead.