about summary refs log tree commit homepage
path: root/script/public-inbox-learn
DateCommit message (Collapse)
2023-11-13treewide: update read_all to avoid eof|close checks
read_all can be expanded to support FIFOs/pipes/sockets where read-until-EOF behavior is desired. We can also rely on wantarray to support splitting on EOL markers, but it's hard-coded to support only `$/ eq "\n"' since (AFAIK) it's the only way we use the wantarray form `readline'.
2023-11-11mda|learn|watch: support dropUniqueUnsubscribe config
List-Unsubscribe headers with unique identifiers (such as those generated by our examples/unsubscribe.milter) should not end up in public archives. Add a new config knob to strip List-Unsubscribe headers if they have the `List-Unsubscribe-Post: List-Unsubscribe=One-Click' header. Unfortunately, this breaks DKIM signatures if the signature covers either of these List-Unsubscribe* headers. However, breaking DKIM is the lesser evil compared to any archive reader being able to stop archival by an independent archivist. As much as I would like this to be the default, it probably affects few users at the moment since very few mailing lists use unique identifiers in List-Unsubscribe (but that number has grown, recently).
2023-11-11learn: fix redundant ham import on dual matches
When learning and injecting new messages ham, we want to avoid wasting cycles importing the same message into an inbox twice (once for the To/Cc match and once for the List-Id match). Our existing %seen hash turned out to be ineffective since PublicInbox::Inbox refs get re-blessed to PublicInbox::InboxWritable. So we stop letting class name influence the hash key for tracking by using the reference address instead. We can get the reference address by performing an arithmetic operation (+ 0) instead of having to pay the cost of importing Scalar::Util::refaddr.
2023-10-15learn: respect indexlevel for v1 inboxes
v2 never suffered from this bug, apparently, but -learn didn't seem able to handle indexlevel=basic (nor respect `medium') for v1 inboxes. I only noticed this bug because I converted some ancient v1 inboxes to `basic' to save space.
2023-10-11treewide: consolidate "From " line removal
Aside from our prior import bugs (fixed in a0c07cba0e5d8b6a (mda: drop leading "From " lines again, 2016-06-26)), we'll always have to be dealing with mutt piping messages to us and `git format-patch' output. So just share the regexp so we can use it everywhere. In may be desirable to allow importing messages with a leading "From " line for FUSE, even. Additionally, some instances of this regexp needlessly added optional `\r?' (CR) checks ahead of the `\n' (LF) element; but they're pointless anyways since [^\n]* is enough to exclude all non-LF bytes.
2021-01-01update copyrights for 2021
Using "make update-copyrights" after setting GNULIB_PATH in my config.mak
2020-12-28check defined return value for localized slurp errors
Reading from regular files (even on STDIN) can fail when dealing with flakey storage.
2020-12-09rename {pi_config} fields to {pi_cfg}
{pi_config} may be confused with the documented `PI_CONFIG' environment variable, and we'll favor vowel-removal to be consistent with our usage of object references. The `pi_' prefix may stay in some places, for now; since a separate namespace may come into this codebase for local/private client-tooling. For InboxIdle, we'll also remove an invalid comment about holding a reference to the PublicInbox::Config object, too.
2020-09-02mda+learn: add --help / -h support
"use Getopt::Long" doesn't seem too slow on a hot page cache, and it's probably used frequently enough to be in cache. We'll also start reducing the amount of markup in the .pod and favoring verbatim text in documentation for readability in source form, since the bold text seems excessive.
2020-09-02script/*: set executable bit on -learn and -imapd
It's useful to mark they're meant to be executable, even if the shebang is useless.
2020-05-27learn: support --all with `rm'
I found myself wanting to remove a message from all inboxes while working on a test case in another branch. I figure this could also be useful for globally removing messages which are in the grey area or too big for spamc.
2020-05-27learn: fix buggy typo on List-ID mapping
There is obviously a typo here, so fix it and add a test case to guard against future regressions. Fixes: 74a3206babe0572a ("mda: support multiple List-ID matches")
2020-05-09replace most uses of PublicInbox::MIME with Eml
PublicInbox::Eml has enough functionality to replace the Email::MIME-based PublicInbox::MIME.
2020-04-19favor `do {}' over `eval {}' for localized slurp
I did not know to use the return value of `do' back in the day. There's probably no practical difference in these cases, but `eval' is overkill for these uses and may hide actual errors. We can get rid of a few redundant `scalar' ops and pass scalar refs to Email::MIME->new to avoid copies in a few more places, too.
2020-02-06treewide: run update-copyrights from gnulib for 2019
I didn't wait until September to do it, this year!
2019-11-16learn: pass global variables into subs
Avoid 'Variable "%s" will not stay shared' warnings when the contents of this script eval'ed into a sub.
2019-10-30mda: support multiple List-ID matches
While it's not RFC2919-conformant, mail software can theoretically set multiple List-ID headers. Deliver to all inboxes which match a given List-ID since that's likely the intended. Cc: Eric W. Biederman <ebiederm@xmission.com> Link: https://public-inbox.org/meta/87pniltscf.fsf@x220.int.ebiederm.org/
2019-10-30mda: hoist out List-ID handling and reuse in -learn
It's now possible to inject false-positive ham into an inbox the same way -mda does via List-ID.
2019-10-30learn: hoist out remove_or_add subroutine
We'll be reusing it for List-ID processing in the next commit.
2019-10-30learn: GIT_COMMITTER_<NAME|EMAIL> may be "" or "0"
Users may be zeroes or blanks.
2019-10-30learn: update usage statement
Use <foo|bar> since that seems to be the favored notation for required command args (taking a hint from git(1) manpage). While we're at it, remove the space after '<' for the redirect to match git.git coding style.
2019-10-30learn: only map recipient list on "ham" or "rm"
It's assumed that "spam" can end up anywhere due to Bcc:, so we need to scan every single inbox. However, "rm" is usually more targeted and and "ham" obviously only belongs in some inboxes.
2019-10-30learn: support multiple To/Cc headers
It's possible to specify these headers multiple times, and PublicInbox::MDA->precheck takes that into account, so -learn should, too.
2019-09-09run update-copyrights from gnulib for 2019
2018-05-17learn: support for v2 repos
Oops, I mainly rely on public-inbox-watch for spam training and completely forgot this tool existed :x
2018-02-28use PublicInbox::MIME consistently
It works around some bugs in older Email::MIME which we'll find useful.
2018-02-07update copyrights for 2018
Using update-copyrights from gnulib While we're at it, use the SPDX identifier for AGPL-3.0+ to ease mechanical processing.
2017-11-16learn: use "spam" as subject for removal commits (part #2)
We need to use the correct subject when doing global scanning, too. In fact, the per-recipient spam training path is entirely redundant at this point.
2017-11-16learn: use "spam" as subject for removal commits
Sometimes an email is an innocent removal "rm" for a misdirected, off-topic post, while most removed messages are "spam". Allow anybody to look at history and easily distinguish the reason for removing the message.
2017-04-05learn: scan all inboxes when learning spam
This matches the behavior of the -watch daemon since 6d534038285ddd760709ba76ea007f9108200097 ("watch: watchspam affects all configured inboxes")
2017-01-19learn: implement "rm" only functionality
Do not consider this interface stable, but I just needed a way to remove mis-imported multipart messages so public-inbox-watch could pick them up again from my Maildir.
2017-01-10introduce PublicInbox::MIME wrapper class
This should fix problems with multipart messages where text/plain parts lack a header. cf. git clone --mirror https://github.com/rjbs/Email-MIME.git refs/pull/28/head In the future, we may still introduce as streaming interface to reduce memory usage on large emails.
2016-07-26learn: fix uninitialized variable
Oops :x
2016-06-26mda: drop leading "From " lines again
Oops... While we're at it, drop blank lines before the "From ", too, since it could happen.
2016-06-24split out spamcheck/spamc to its own module.
This should hopefully make it easier to try other anti-spam systems (or none at all) in the future.
2016-06-17import: auto-update index when done
This prevents multiple update processes from stepping over each other while called under the lock, and also allows the new -watch process to update the index iff indexing was desired.
2016-06-15mda: hook up new filter functionality
This removes the Email::Filter dependency as well as the signature-breaking scrubber code. We now prefer to reject unacceptable messages and grudgingly (and blindly) mirror messages we're not the primary endpoint for.
2016-06-15learn: remove IPC::Run dependency
We'll be relying on our spawn implementation, for now; since it'll be consistent with the rest of our code and can optionally take advantage of vfork.
2016-05-30script/*{mda,learn}: no strict params for Email::MIME::ContentType
User input is imperfect, do not pollute our mail logs with warnings we cannot fix. This is documented in the Email::MIME::ContentType manpage so it should remain supported.
2016-05-25remove Email::Address dependency
git has stricter requirements for ident names (no '<>') which Email::Address allows. Even in 1.908, Email::Address also has an incomplete fix for CVE-2015-7686 with a DoS-able regexp for comments. Since we don't care for or need all the RFC compliance of Email::Address, avoiding it entirely may be preferable. Email::Address will still be installed as a requirement for Email::MIME, but it is only used by the Email::MIME::header_str_set which we do not use
2016-05-14rename most instances of "list" to "inbox"
A public-inbox is NOT necessarily a mailing list, but it could serve as an input point for zero, one, or infinite mailing lists :D
2016-04-25remove ssoma dependency
By converting to using ourt git-fast-import-based Import module. This should allow us to be more easily installed.
2016-04-09public-inbox-learn: drop leading "From " line from mboxes
It can confuse Email::MIME if we have it.
2016-02-27move executables to script/ directory
This seems to match more closely with what is expected of Perl packages based on how blib is used. Hopefully makes the top-level source tree less cluttered and things easier-to-find.