Date | Commit message (Collapse) |
|
This ought to prevent cargo-culting the cache_size PRAGMA
into smaller SQLite DBs we might use.
|
|
The old name may be confused with "Content-ID" as described in
RFC 2392, so use an alternate name to avoid confusing future
readers.
|
|
This allows maintainers to easily check limits against the
contents of existing inboxes. This script covers most of
the new limits enforced by PublicInbox::Eml.
Usage is similar to most xt/*.t scripts:
GIANT_INBOX_DIR=/path/to/inbox prove -bvw xt/eml_check_limits.t
Setting `TEST_CLASS=PublicInbox::MIME' allows us to check
performance and memory use against the old subclass of
Email::MIME.
|
|
Despite several memory reductions and pure Perl performance
improvements, Inline::C spawn() still gives us a noticeable
performance boost.
More user-oriented command-line programs are likely coming,
setting PERL_INLINE_DIRECTORY is annoying to users, and so is
is poor performance. So allow users to opt-in to using our
Inline::C code once by creating a `~/.cache/public-inbox/inline-c'
directory.
XDG_CACHE_HOME is respected to override the location of ~/.cache
independent of HOME, according to
https://specifications.freedesktop.org/basedir-spec/0.6/ar01s03.html
v2: use "/nonexistent" if HOME is undefined, since that's
the home of the "nobody" user on both FreeBSD and Debian.
|
|
We don't have to worry about compatibility with old
installations of Email::MIME::ContentType any longer,
so save some space.
|
|
Although the lazy loading changes were correct, the code
was still using PublicInbox::MIME as a fixed class. Use
the `$cls' variable from the loop.
Favor ->subparts to ->parts, instead, too, since ->parts is
discouraged by the Email::MIME manpage and not implemented for
Eml.
|
|
And just treat it as a non-fatal nag when checking the rest of the
codebase. Calling it "check-manifest" as a `make' target
preserves the old behavior, which causes the check to fail
if a file were added to the worktree without changing the
MANIFEST.
|
|
|
|
|
|
They're still part of our internal API at this point, but
reusing the same names as those used by postfix makes sense for
now to reduce cognitive overheads of learning new things.
There's no "mime_parts_limit", but the name is consistent
with "mime_nesting_limit".
|
|
While our header processing is more efficient than
Email::*::Header, capping the maximum size for a `m//g' match
still limits memory growth on a header we care for.
Use the same limit as postfix (header_size_limit=102400), since
messages fetched via git/HTTP/NNTP/etc can bypass MTA limits.
|
|
I'm not sure it's necessary, since "mid:" is similarly
undocumented. Also, "t:", "c:", "f:" don't offer boolean
analogues for exact matches on To/Cc/From headers, despite
having similar tokens as List-Id inside angle brackets.
|
|
This bug was also present in Email::MIME::ContentType:
commit ae081fb576d8507efca4928116ad81efa756c723 (refs/pull/pull/9/head)
in https://github.com/rjbs/Email-MIME-ContentType.git
Our fix is shorter, but dependent on 5.10+ as our codebase
relies on Perl 5.10 features, anyways.
|
|
Emails a *nix MTA are typically LF-only, so we don't need the
complexity of the RE engine when a simple index() works. We
still need to ensure there's no "\r\n\r\n" before the first
"\n\n", but two calls to index() is still faster than a RE
match.
This gives a 2-5% speedup in some informal tests and saves ~30MB
when scanning a 30MB spam message on newer versions of Perl.
I'll have to diagnose why Perl wastes so much memory doing
RE matches on giant strings, though.
|
|
Since Perl 5.6, the `@-' (aka @LAST_MATCH_START) and `@+' (aka
@LAST_MATCH_END) arrays provides integer offsets for every match
as documented in perlvar(1), regardless of regexp modifiers.
We can avoid relying on $1 in the epilogue scan, entirely.
So use these instead of relying on m//g and pos(), since the `g'
modifier can be affected by m//g matches performed in other
places.
Unrelated, but while we're in the area: remove some unnecessary
use of (?:...), too.
|
|
For a diff hunk starting at line N, diff_hunk() constructs the link
with "#n(N + 1)". This sends the viewer one line below the first
context line. Although this is minor and may not even be noticed,
there's not an obvious reason to increment the line number, so switch
to using the reported value as is.
|
|
This improves Email::MIME compatibility when running
xt/cmp-msgview.t on some GPG-signed messages.
Its usefulness is dubious in the long term and this patch
may be reverted down the line.
|
|
We no longer load or use Email::MIME outside of comparison
tests.
|
|
While our codebase can still work with either MIME
implementation, add comparison tests to ensure we
handle corner cases in existing archives.
|
|
Since Email::MIME usage is going away, Email::MIME::Encodings
might as well go away, too. We can also use fewer branches
and just rely on hash lookups, unlike E::M::E.
|
|
We want to support Perl v5.10.1 out-of-the-box with minimal
download/installation time. Installing Encode from CPAN
requires a compiler and lengthy build+install time.
So mimic find_mime_encoding() using what Perl v5.10.1 provides
out-of-the box.
|
|
Since we're getting rid of Email::MIME, get rid of
Email::MIME::ContentType, too; since we may introduce
speedups down the line specific to our codebase.
|
|
PublicInbox::Eml has enough functionality to replace the
Email::MIME-based PublicInbox::MIME.
|
|
Since PublicInbox::Eml doesn't parse MIME subparts
up front, it can replace most uses of Email::Simple
without performance penalty.
This will eventually allow us to lower overall internal
API footprint by not having to keep the MIME vs Simple
distinction.
|
|
Email::MIME eats memory, wastes time parsing out all the
headers, and some problems can't be fixed without breaking
compatibility for other projects which depend on it.
Informal benchmarks show a ~2x improvement in general
stats gathering scripts and ~10% improvement in HTML
view rendering.
We also don't need the ability to create MIME messages, just
parse them and maybe drop an attachment.
While this isn't the zero-copy or streaming MIME parser of my
dreams; it's still an improvement in that it doesn't keep a
scalar copy of the raw body around along with subparts. It also
doesn't parse subparts up front, so it can also replace our uses
of Email::Simple.
|
|
PublicInbox::Eml will have case-sensitive memoization to
avoid the need to call `lc' to retrieve common headers,
so ensure we call $mime->header() with the common
capitalization.
Unfortunately, we need to continue using lowercase for field
names for smsg, since NNTP requires case-insensitivity when
matching headers and method dispatch is expensive.
|
|
Mailman only seems to add trailers (or signatures) as
attachments at the top-level of MIME messages. So don't bother
recursing with ->walk_parts since ->walk_parts is non-trivial to
recreate in the Email::MIME replacement I'm working on.
|
|
This doesn't make any difference for most multipart
messages (or any single part messages). However,
this starts having space savings when parts start
nesting.
It also slightly simplifies callers.
|
|
The reliance on Email::MIME->subparts is a tad inefficient with
a work-in-progress module to replace Email::MIME. So move
towards using ->each_part as a class-specific iterator which can
take advantage of more class-specific optimizations in the
yet-to-be-revealed PublicInbox::Eml and PublicInbox::Gmime
classes.
The msg_iter() sub remains for compatibility with existing
3rd-party scripts/modules which use our small public Perl API
and Email::MIME.
|
|
Encode lazy-loads encodings on an as-needed basis. This is
great for short-lived programs, but leads to fragmentation in
long-lived daemons where immortal allocations can get
interleaved with short-lived, per-request allocations.
Since we have no idea which encodings will be needed when
there's a constant flow of incoming mail, just preload
everything available at startup.
|
|
We'll support both probabilistic matches via `l:' and boolean
matches via `lid:' for exact matches, similar to how both `m:'
and `mid:' are supported. Only text inside angle braces (`<'
and `>') are supported, since I'm not sure if there's value in
searching on the optional phrases (which would require decoding
with ->header_str instead of ->header_raw).
|
|
Sometimes senders draw ASCII tables and such which we
get fooled into attempting highlighting and diffstat
anchoring.
We now require 3 consecutive diff header lines:
/^--- /, /^\Q+++\E /, and /^@@ /
to enable diff highlighting (whether generated with git or not).
The presence of a line matching /^diff / is not sufficient or
even useful to us for highlighting diffs, since that could just
be part of a line-wrapped sentence.
However, we'll now check for the presence of a line matching
/^diff --git / before enabling diffstat anchors. Otherwise
cover letters for a patch series may fool us into creating
anchors for diffstats.
|
|
For non-malicious messages, we can assume the diffstat and actual
diff appear in the same order. Thus we can store {-long_paths} as
an arrayref and only compare the first element when we encounter
a truncated path.
This should make HTML rendering stable when there's basename
conflicts in message such as
https://lore.kernel.org/backports/1393202754-12919-13-git-send-email-hauke@hauke-m.de/
This diffstat anchor linkification can still be defeated by
users who make actual path names beginning with "...", but we
won't waste CPU cycles on it, either.
|
|
This will help us track down bugs in our own code when
it comes to missing error checking.
|
|
glob() sorts alphabetically by default, which doesn't have
a useful meaning with many articles. Stop wasting CPU cycles
and memory.
|
|
Perl 5.10.1 would warn about implicit assignment to @_ by
split(). So favor the documented method of using `tr'
to count lines.
Fixes: b5ddcb3352ef31ae ("index: support --compact / -c on command-line")
|
|
Current versions of Perl don't warn when vec() is given `undef'
as its first arg, but Perl 5.10.1 does, at least.
Fixes: c7b4cbdadf3116a0 ("t/httpd-corner: improve reliability and diagnostics")
|
|
We don't call any Email::MIME or any PublicInbox::MIME-specific
functions in here.
|
|
It's likely we'll replace Email::Simple using our Email::MIME
alternative/replacement, as well. So reduce the API surface we
interact with and make it easier to swap implementations.
|
|
Prefer the "ID" capitalization since it seems to to be the
preferred capitalization in RFC 5322.
In theory, this allows the interpreter to deduplicate the string
internally (I haven't checked if it does).
Unfortunately, there's too many instances of "Message-Id" in the
tests to be worth changing at this point.
|
|
While testing performance improvements elsewhere, I noticed some
micro-optimizations could give a small ~2-3% speedup in my test
using the git async API to parse a large inbox.
The `read' perlfunc already has read-in-full behavior (unless
git is killed unexpectedly), so there's no point in using a
loop. SearchIdxShard in the parallel v2 indexing code path
never looped on `read', either.
Furthermore, we can avoid method dispatch overhead on ->getline
and ->print by using `readline' and `print' as ops which can be
resolved during the Perl compilation phase.
Finally, avoid passing the IO handle around as a parameter,
since avoiding hash lookups with a local variable has its own
costs in stack and refcount bumping.
Best off all, there's less code :>
|
|
Since some client tools exist for dealing with public-inbox
specifically, it seems like a good idea to list some of them.
Cc: Danh Doan <congdanhqx@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Cc: Leah Neukirchen <leah@vuxu.org>
|
|
We don't need the callback arg, anymore.
|
|
mime_from_path is designed to fail gracefully in busy Maildirs
whereas mime_load was made for loading files from a work tree.
|
|
Replace them with .eml files generated with the help of
Email::MIME, but without some extraneous and unnecessary
headers, and strip mime_load down to just loading files.
This will give us more freedom to experiment with other mail
libraries which may be more correct, better maintained, use
less memory and/or be faster than Email::MIME.
|
|
We'll use this to create, memoize, and reuse .eml files. This
will be used to reduce (and eventually eliminate) our dependency
on Email::MIME in tests.
|
|
We don't need to be checking inbox versions in parts of the WWW
code. Checking the presence of $ibx->over is enough, everywhere.
|
|
As an established project (:P), it's important to document when
new features appear in manpages. Users may be reading new
documentation online which doesn't reflect an older version they
have installed.
|
|
RFC 2919 section 6 states the following:
There is only one operation defined for list identifiers,
that of case insensitive equality.
So no arguing with that. Now, the other headers are
open to interpretation, so put a note about them.
|
|
Some headers may appear more than once in a message, so it's
probably best to ensure we attempt matches on all of them.
This ought to allow matching on Received: or similar because a
list lacks List-IDs :P
|