Date | Commit message (Collapse) |
|
While MTAs seem to stop '\0' from appearing in headers, users
fetching archives via git remain susceptible to having '\0' land
in archives. So we'll filter them out of Xapian and SQLite DBs
to avoid interopability problems with CLI tools since there's no
known messages in lore or any of my archives which feature them.
Avoiding '\0' will ensure all indexed Message-IDs and List-Ids
can be specified from the command-line (although some characters
will still require $(printf) contortions).
As with Message-ID, List-Id fields with /\n\t\r/ characters will
also be stripped for indexing. I will assume whatever went wrong
with the References: header in
<https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw>
could also happen to the List-Id header.
This is inspired by commit aca47e05a6026c12c768753c87e6ff769ef6bee4
(Import: Don't copy nulls from emails into git, 2018-07-07)
|
|
Our use of MID_ESC characters was only intended for the pathname
component of URIs and not appropriate for the query string
component. So use a different $unsafe parameter list for
uri_escape to make the result appropriate for query strings by
disallowing [\&\'\+=] characters. Most notably, this change
also allows us to accept `/' (slash) unescaped to make dfn: queries
nicer to look at.
Finally, we'll also add a ascii_html call on the URI-escaped
result as an extra safety measure even though it's not really
needed.
As far as I can tell, the code without this fix didn't result in
in an HTML injection since all our uses of uri_escape did escape
angle brackets.
Reported-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com>
Link: https://public-inbox.org/meta/87o7ff4nlk.fsf@collabora.com/
Tested-by: Ricardo Cañuelo <ricardo.canuelo@collabora.com>
|
|
It's possible in theory that Perl could be smarter and free
memory a tad sooner this way. Regardless, fewer lines of code
is easier-to-navigate/read and can save optree size and reduce
parsing times.
|
|
On my x86-64 machine, OpenSSL SHA-256 is nearly twice as fast as
the Digest::SHA implementation from Perl, most likely due to an
optimized assembly implementation. SHA-1 is a few percent
faster, too.
|
|
Using "make update-copyrights" after setting GNULIB_PATH in my
config.mak
|
|
We'll be using it for Resent-Message-ID with lei, and possibly
other places.
|
|
As shown recently in commit a05445fb400108e60ede7d377cf3b26a0392eb24
("config: config_fh_parse: micro-optimize"), the relying on
the return value of `push' and defined-or operators can avoid
modifying a the hash value scalar with an increment.
|
|
|
|
It's only used for HTML anchors which we will need indefinitely.
|
|
We can rely on the newer mids() sub directly and use faster
numeric comparisons for Msgmap unindexing in v1.
|
|
Prefer the "ID" capitalization since it seems to to be the
preferred capitalization in RFC 5322.
In theory, this allows the interpreter to deduplicate the string
internally (I haven't checked if it does).
Unfortunately, there's too many instances of "Message-Id" in the
tests to be worth changing at this point.
|
|
This allows us to consistently enforce the same Message-ID
extraction rules everywhere and makes it easier for us to
make changes in the future.
Update scripts/ssoma-replay, as well, but don't rely on
PublicInbox::* modules in that since it's legacy and
public-inbox was never a dependency of ssoma.
|
|
I didn't wait until September to do it, this year!
|
|
We won't be able to use List::Util::uniq here, but we can still
shorten our logic and make it more consistent with the rest of
our code which does similar things.
|
|
Since we replace extra Message-ID headers with X-Alt-Message-ID
to placate NNTP clients, we should allow searching and indexing
on X-Alt-Message-ID just like we do with Message-ID.
|
|
|
|
Its result is used for HTML anchors and such.
|
|
Looking at git@vger history, several emails had broken
References/In-Reply-To pointing to <y>, <n> and email
addresses as Message-IDs in References and In-Reply-To
headers.
This was causing too many unrelated messages to be linked
together in the same thread.
|
|
For Subject/To/Cc/From headers, we squeeze them to a space (' ').
For Message-IDs (including References/In-Reply-To), '\t', '\n', '\r'
are deleted since some MUAs might screw them up:
https://public-inbox.org/git/656C30A1EFC89F6B2082D9B6@localhost/raw
|
|
We need to stop ghost messages from generating longer
Message-IDs than Xapian can handle with terms.
|
|
This allows us to be more consistent in dealing with completely
empty Message-Ids.
|
|
Since we support duplicate MIDs in v2, we can safely truncate
long MID terms in the database and let other normal duplicate
resolution sort it out. It seems only spammers use excessively
long MIDs, and there'll always be abuse/misuse vectors for causing
mis-threaded messages, so it's not worth worrying about
excessively long MIDs.
|
|
Traditionally we've been more lax on parsing Message-Id
and allow it without the angle brackets. We've always been
strict on References and can't have it be pointlessly
large when some MUA decides to use HTML-escaped angle
brackets ("<", ">").
|
|
It's shorter and more convenient, here.
|
|
We'll be using a more consistent API for extracting Message-IDs
from various headers.
|
|
Using update-copyrights from gnulib
While we're at it, use the SPDX identifier for AGPL-3.0+ to
ease mechanical processing.
|
|
Reported-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
https://public-inbox.org/meta/CACBZZX5Gnow08r=0A1J_kt3a=zpGyMfvsqu8nAN7kacNnDm+dg@mail.gmail.com/
|
|
Based on reading RFC 3986, it seems '@', ':', '!', '$', '&',
"'", '; '(', ')', '*', '+', ',', ';', '=' are all allowed
in path-absolute where we have the Message-ID.
In any case, it seems '@' is fairly common in path components
nowadays and too common in Message-IDs.
|
|
Apparently there are some really screwed up In-Reply-To
fields out there.
|
|
Message-IDs should not be MIME encoded, but in case they are,
use the raw form for compatibility with ssoma and possibly
other tools. This prevents a potential problem where a
malicious client could confuse our storage layer into indexing
incorrect contents.
|
|
Hopefully this gives new hackers a better overview of
how the components relate to each other.
|
|
We use it as a general compressor for identifiers such as
subject paths, so using the "mid_" prefix probably is not
appropriate.
|
|
In the future, it should be possible to use this:
git ls-files | UPDATE_COPYRIGHT_HOLDER='all contributors' \
UPDATE_COPYRIGHT_USE_INTERVALS=2 \
xargs /path/to/gnulib/build-aux/update-copyright
|
|
We screwed up and needed to fix URL generation with '<>'
in them. Regardless, users may attempt to copy and paste
URLs with '<>' in them, do not punish them for that.
|
|
This is necessary for some mailers which include comment text
in in the In-Reply-To header, merely assuming there is nothing
outside of '<>' as we were doing is not enough.
|
|
Consistently name mid_* functions as verbs.
|
|
Valid URLs do not make valid anchor ids.
|
|
Some HTTP servers (apache2 2.2.22-13+deb7u5) on my system
apparently do not handle "%25" correctly. I'm not yet sure if
it's something weird with my rewrite rules or what....
|
|
More to come later.
|
|
Quit repeating ourselves and use a common MID module
instead.
|