about summary refs log tree commit homepage
path: root/lib
DateCommit message (Collapse)
2020-05-09smsg: use capitalization for header retrieval
PublicInbox::Eml will have case-sensitive memoization to avoid the need to call `lc' to retrieve common headers, so ensure we call $mime->header() with the common capitalization. Unfortunately, we need to continue using lowercase for field names for smsg, since NNTP requires case-insensitivity when matching headers and method dispatch is expensive.
2020-05-09filter/rubylang: avoid recursing subparts to strip trailers
Mailman only seems to add trailers (or signatures) as attachments at the top-level of MIME messages. So don't bother recursing with ->walk_parts since ->walk_parts is non-trivial to recreate in the Email::MIME replacement I'm working on.
2020-05-09msg_iter: pass $idx as a scalar, not array
This doesn't make any difference for most multipart messages (or any single part messages). However, this starts having space savings when parts start nesting. It also slightly simplifies callers.
2020-05-09msg_iter: make ->each_part method for PublicInbox::MIME
The reliance on Email::MIME->subparts is a tad inefficient with a work-in-progress module to replace Email::MIME. So move towards using ->each_part as a class-specific iterator which can take advantage of more class-specific optimizations in the yet-to-be-revealed PublicInbox::Eml and PublicInbox::Gmime classes. The msg_iter() sub remains for compatibility with existing 3rd-party scripts/modules which use our small public Perl API and Email::MIME.
2020-05-09www: preload: load all encodings at startup
Encode lazy-loads encodings on an as-needed basis. This is great for short-lived programs, but leads to fragmentation in long-lived daemons where immortal allocations can get interleaved with short-lived, per-request allocations. Since we have no idea which encodings will be needed when there's a constant flow of incoming mail, just preload everything available at startup.
2020-05-09search: support searching on List-Id
We'll support both probabilistic matches via `l:' and boolean matches via `lid:' for exact matches, similar to how both `m:' and `mid:' are supported. Only text inside angle braces (`<' and `>') are supported, since I'm not sure if there's value in searching on the optional phrases (which would require decoding with ->header_str instead of ->header_raw).
2020-05-07viewdiff: stricter highlighting and linkification check
Sometimes senders draw ASCII tables and such which we get fooled into attempting highlighting and diffstat anchoring. We now require 3 consecutive diff header lines: /^--- /, /^\Q+++\E /, and /^@@ / to enable diff highlighting (whether generated with git or not). The presence of a line matching /^diff / is not sufficient or even useful to us for highlighting diffs, since that could just be part of a line-wrapped sentence. However, we'll now check for the presence of a line matching /^diff --git / before enabling diffstat anchors. Otherwise cover letters for a patch series may fool us into creating anchors for diffstats.
2020-05-07viewdiff: assume diffstat and diff order are identical
For non-malicious messages, we can assume the diffstat and actual diff appear in the same order. Thus we can store {-long_paths} as an arrayref and only compare the first element when we encounter a truncated path. This should make HTML rendering stable when there's basename conflicts in message such as https://lore.kernel.org/backports/1393202754-12919-13-git-send-email-hauke@hauke-m.de/ This diffstat anchor linkification can still be defeated by users who make actual path names beginning with "...", but we won't waste CPU cycles on it, either.
2020-05-06git: warn on ->cat_async callback errors
This will help us track down bugs in our own code when it comes to missing error checking.
2020-05-01feed: remove PublicInbox::MIME module load
We don't call any Email::MIME or any PublicInbox::MIME-specific functions in here.
2020-04-30mid: capitalize "ID" in "Message-ID"
Prefer the "ID" capitalization since it seems to to be the preferred capitalization in RFC 5322. In theory, this allows the interpreter to deduplicate the string internally (I haven't checked if it does). Unfortunately, there's too many instances of "Message-Id" in the tests to be worth changing at this point.
2020-04-29git: various minor speedups
While testing performance improvements elsewhere, I noticed some micro-optimizations could give a small ~2-3% speedup in my test using the git async API to parse a large inbox. The `read' perlfunc already has read-in-full behavior (unless git is killed unexpectedly), so there's no point in using a loop. SearchIdxShard in the parallel v2 indexing code path never looped on `read', either. Furthermore, we can avoid method dispatch overhead on ->getline and ->print by using `readline' and `print' as ops which can be resolved during the Perl compilation phase. Finally, avoid passing the IO handle around as a parameter, since avoiding hash lookups with a local variable has its own costs in stack and refcount bumping. Best off all, there's less code :>
2020-04-26testcommon: mime_load: drop extra $cb arg
We don't need the callback arg, anymore.
2020-04-26tests: remove Email::MIME->create use entirely
Replace them with .eml files generated with the help of Email::MIME, but without some extraneous and unnecessary headers, and strip mime_load down to just loading files. This will give us more freedom to experiment with other mail libraries which may be more correct, better maintained, use less memory and/or be faster than Email::MIME.
2020-04-26testcommon: introduce mime_load sub
We'll use this to create, memoize, and reuse .eml files. This will be used to reduce (and eventually eliminate) our dependency on Email::MIME in tests.
2020-04-25feed: drop needless version check
We don't need to be checking inbox versions in parts of the WWW code. Checking the presence of $ibx->over is enough, everywhere.
2020-04-25watchmaildir: match List-ID case-insensitively
RFC 2919 section 6 states the following: There is only one operation defined for list identifiers, that of case insensitive equality. So no arguing with that. Now, the other headers are open to interpretation, so put a note about them.
2020-04-25watchmaildir: scan all matching headers
Some headers may appear more than once in a message, so it's probably best to ensure we attempt matches on all of them. This ought to allow matching on Received: or similar because a list lacks List-IDs :P
2020-04-22make zlib-related modules a hard dependency
This allows us to simplify some of our existing code and make future changes easier. I doubt anybody goes through the trouble to have a Perl installation without zlib support. The zlib source code is even bundled with Perl since 5.9.3 for systems without existing zlib development headers and libraries. Of course, zlib is also a requirement of git, too; and we're not going to stop using git :) [squashed: "wwwaltid: use gzipfilter up front"]
2020-04-22view: actually omit subject text when dumping topics
Despite dump_topics() calling dedupe_subject() on the subject, the index shows partly duplicated subjects, for example ` [PATCH 2/2] t/www_listing: avoid 'once' warnings ` [PATCH v2] t/www_listing: avoid 'once' warnings " In the second line, the omission character " is appended, but the entire subject is shown. To display the subject with duplicated parts omitted, regenerate it from the array that is modified by dedupe_subject().
2020-04-22view: strip omission character from current message in thread view
In the thread view shown at the top of a message, the subject for the current message is dropped, leaving just the sender's name. However, if skel_dump() omitted part of the subject because it was duplicated, the omission character is still displayed: * [PATCH v2] t/www_listing: avoid 'once' warnings 2020-03-21 1:10 ` [PATCH 2/2] t/www_listing: avoid 'once' warnings Eric Wong @ 2020-03-21 5:24 ` " Eric Wong Note the " on the last line. Adjust the regular expression in _th_index_lite() to account for the omission character. [ew: avoid capturing $1, keep under 80 cols]
2020-04-21index: support --max-size / publicinbox.indexMaxSize
In normal mail paths, we can rely on MTAs being configured with reasonable limits in the -watch and -mda mail injection paths. However, the MTA is bypassed in a git-only delivery path, a BOFH could inject a large message and DoS users attempting to mirror a public-inbox. This doesn't protect unindexed WWW interfaces from Email::MIME memory explosions on v1 inboxes. Probably nobody cares about unindexed WWW interfaces anymore, especially now that Xapian is optional for indexing.
2020-04-21qspawn: remove Perl 5.16.x leak workaround
It seems no longer necessary to workaround this Perl 5.16.3 bug after the removal of anonymous subs from all of our internal code in https://public-inbox.org/meta/20191225075104.22184-1-e@80x24.org/ Tested with repeated clones (both aborted and completed) in a CentOS 7.x VM which was once able to reproduce leaks before the workaround appeared in 2fc42236f72ad16a ("qspawn: workaround Perl 5.16.3 leak, re-enable Deflater") Cc: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
2020-04-20v2writable: drop SQLite-based multi_mid_q_new
We switched to the SDBM-based queue to store author/committer info last month. Fixes: c7acdfe78bda5bf3 ("v2: SDBM-based multi Message-ID queue")
2020-04-20drop needless `eval {}' around Config->new
It hasn't been needed since commit 089cca37fa036411 ("config: ignore missing config files"). And we actually want to propagate errors when we can't start new processes or if git(1) is missing.
2020-04-20testcommon: spawn-aware system() and qx[] workalikes
Barely noticeable on Linux, but this gives a 1-2% speedup on a FreeBSD 11.3 VM and lets us use built-in redirects rather than relying on /bin/sh.
2020-04-20import: init_bare: use pure Perl
Even on systems with Inline::C spawn(), this cuts a primed "make check-run" time by 2-3% on Linux, and roughly 5-7% on FreeBSD when using vfork-enabled spawn. I doubt anybody cares: this omits the sample hooks and some empty and useless-for-us or obsolete directories created by git-init(1).
2020-04-20import: init_bare: allow use as method, use in tests
Allowing ->init_bare to be used as a method saves some keystrokes, and we can save a little bit of time on systems with our vfork(2)-enabled spawn(). This also sets us up for future improvements where we can avoid spawning a process at all.
2020-04-20watchmaildir: support multiple watchheader values
The watchheader key supports only a single value. Supporting multiple watchheader values was mentioned in discussion [1] of 8d3e3bd8 (doc: explain publicinbox.<name>.watchheader, 2019-10-09), and it wasn't clear if there was a need. One scenario in which matching multiple headers would be convenient is when someone wants to set up public-inbox archives for some small projects but does _not_ want to run mailing lists for them, instead allowing others to follow the project by any of the pull mechanisms. Using a common underlying address, an address alias for each project is configured via a third-party email provider, with messages for each alias being exposed as a separate public-inbox archive. In this setup, messages for an inbox cannot be selected by a List-ID header but can be identified by the inbox's address in either the To or Cc header. To support such a use case, update the watchheader handling to consider multiple values, accepting a message if it matches any value. While selecting a message based on matching _any_ rather than _all_ values is motivated by the above scenario, it's worth noting that the "any" behavior is consistent with how multiple listid config values are handled. [1] https://public-inbox.org/meta/20191010085118.r3amey4cayazfycb@dcvr/
2020-04-19reduce scope of mbox From_ line removal
It's unnecessary overhead for anything which does Email::MIME parsing. It was never done for v2 indexing, even though v1->v2 conversions did NOT remove those From_ lines. There was never a need to remote From_ lines the v1 SearchIdx paths, either. Hitting a /$INBOX_URL/$MSGID/T/ endpoint with an 18 message thread reveals a ~0.5% speed improvement. This will become more apparent when we have a faster MIME parser.
2020-04-19mbox: use per-message line-ending for From_ line
Email::Simple preserves the message line ending in headers, so make the From_ line consistent with the rest of the headers.
2020-04-19wwwatomstream: move {emit_header} field to $self
There's no need to pollute the cross-package $ctx with it.
2020-04-19inbox: replace `eval {}' with `do {}' where appropriate
-Git->new and -Limiter->new will never fail unless there's an OOM, so using `eval' is incorrect.
2020-04-19inbox: don't memoize missing description|cloneurl
It's probably common to have inboxes initially setup without these files properly configured, so don't memoize at that stage.
2020-04-19searchidx: die on cat-file failures
We always use the object ID from "git <log|rev-list>" for retrieving blobs, so fail loudly if the git repository is corrupt instead of silently continuing.
2020-04-19inboxwritable: mime_from_path: reuse in more places
There's nothing Maildir-specific about the function, so `maildir_path_load' was a bad name. So give it a more appropriate name and use it in our tests. This save ourselves some code and inconsistency by reusing an existing internal library routine in more places. We can drop the "From_" line in some of our (formerly) mbox sample files.
2020-04-17searchthread: reduce indirection by removing container
We can rid ourselves of a layer of indirection by subclassing PublicInbox::Smsg instead of using a container object to hold each $smsg. Furthermore, the `{id}' vs. `{mid}' field name confusion is eliminated. This reduces the size of the $rootset passed to walk_thread by around 15%, that is over 50K memory when rendering a /$INBOX/ landing page.
2020-04-15dskqxs: ignore EV_SET errors on EVFILT_WRITE
Just like the EPOLL_CTL_ADD emulation path, the EPOLL_CTL_MOD and EPOLL_CTL_DEL emulation paths can fail if attempting to install an EVFILT_WRITE for a read-only pipe. I've only observed this on the EPOLL_CTL_DEL emulation path, but I suspect it could happen on the EPOLL_CTL_MOD path as well. Increasing the amount of read-only pipes we rely on with altid exports via sqlite3 made this old bug more apparent and reproducible while looping the test suite. This may be adjusted in the future to deal with write-only pipes, but we currently don't have any of those watched by kqueue.
2020-04-15testcommon: DESTROY: wait for killed daemon
Otherwise, the waitpid(-1, 0) call in Xapcmd::process_queue() may reap it in a subsequent test when using t/run.perl to reuse processes for testing. While we're at it, make Xapcmd::process_queue warn about unknown PIDs in case other PIDs leak through to us in the future.
2020-04-09triewyde: ficks soem speling errrors
Dikshunarees R gude!
2020-04-09tests: document run_mode=1 as not implemented
It was implemented at some point, but it was more things to support and the worst of both worlds: both unrealistic compared to real-world use and slower than run_mode=2. Noticed while looking for speling erorrs.
2020-04-07view: do not redundantly obfuscate addresses
We shouldn't rerun the address obfuscator on data we've already run through. Instead, run through the unescaped text part and substitute the UTF-8 "\x{2022}" substitution before it hits HTML escaping Fixes: 9bdd81dc16ba6511 ("view: msg_iter calls add_body_text directly")
2020-04-07portability: constants for NetBSD
NetBSD implements O_CLOEXEC, so let us use it to avoid inadvertant FD sharing. It also has the same value for SIGWINCH as Linux and the other BSDs we support.
2020-04-05git: reduce stat buffer storage overhead
The stat() array is a whopping 480 bytes (on x86-64, Perl 5.28), while the new packed representation of two 64-bit doubles as a scalar is "only" 56 bytes. This can add up when there's many inboxes. Just use a string comparison on the packed representation. Some 32-bit Perl builds (IIRC OpenBSD) lack quad support, so doubles were chosen for pack() portability.
2020-04-05mbox: halve ->getline "context switches"
We don't need to take extra trips through the event loop for a single message (in the common case of Message-IDs being unique). In fact, holding the body reference left behind by Email::Simple could be harmful to memory usage, though in practice it's not a big problem since code paths which use Email::MIME take far more.
2020-04-05release large (non ref) scalars using `undef $sv'
Using `undef EXPR' like a function call actually frees the heap memory associated with the scalar, whereas `$sv = undef' or `$sv = ""' will hold the buffer around until $sv goes out of scope. The `sv_set_undef' documentation in the perlapi(1) manpage explicitly states this: The perl equivalent is "$sv = undef;". Note that it doesn't free any string buffer, unlike "undef $sv". And I've confirmed by reading Dump() output from Devel::Peek. We'll also inline the old index_body sub in SearchIdx.pm to make the scope of the scalar more obvious. This change saves several hundred kB RSS on both -index and -httpd when hitting large emails with thousands of lines.
2020-04-05wwwstatic: set "Vary: Accept-Encoding" in static gzip response
We don't want to confuse intermediate caches into serving gzipped content to any clients which can't handle it. It probably doesn't matter in practice, though, since every HTTP client seems to handle "Content-Encoding: gzip" regardless of whether it was requested or not, though I could expect some nc/socat/telnet/s_client users being annoyed. This also matches the behavior of Plack::Middleware::Deflater and other deflater implementations.
2020-04-04view: inline flush_quote sub
No point in having an extra sub for a short, commonly called function in the same file.
2020-04-04viewdiff: reduce sub parameter count
We're slowly moving towards doing all of our output buffering into a single buffer, so passing that around on the stack as a dedicated parameter is confusing.
2020-04-04view: dedupe_subject: allow "0" as a valid Subject
While rare in practice (even by spammers), A single "0" could theoretically be the entire contents of a Subject line. So use the Perl 5.10+ defined-or operator to improve correctness of subject deduplication.