user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* Archiving HTML mail
@ 2019-11-12 13:37 Florian Weimer
  2019-11-12 21:09 ` Eric Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Florian Weimer @ 2019-11-12 13:37 UTC (permalink / raw)
  To: meta

New contributors tend to send text/html.  We are currently rejecting
such email, which is proving more and more problematic.  I think a
change would be easier to justify if I can show that this will not
break our mailing list archives (in the sense that they become
incomplete).  We currently use mhonarc, and I don't think it copes
well with such mail.  It certainly doesn't do the subdomain split.

Is it possible to archive such mail as well, possibly under separate
subdomains to avoid XSS issues?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Archiving HTML mail
  2019-11-12 13:37 Archiving HTML mail Florian Weimer
@ 2019-11-12 21:09 ` Eric Wong
  2019-11-12 21:17   ` Florian Weimer
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2019-11-12 21:09 UTC (permalink / raw)
  To: Florian Weimer; +Cc: meta

Florian Weimer <fw@deneb.enyo.de> wrote:
> New contributors tend to send text/html.  We are currently rejecting
> such email, which is proving more and more problematic.  I think a
> change would be easier to justify if I can show that this will not
> break our mailing list archives (in the sense that they become
> incomplete).  We currently use mhonarc, and I don't think it copes
> well with such mail.  It certainly doesn't do the subdomain split.
> 
> Is it possible to archive such mail as well, possibly under separate
> subdomains to avoid XSS issues?

You can use "publicinbox.$NAME.filter = PublicInbox::Filter::Mirror"
in the config to blindly mirror everything, which I use for
public-inbox-watch.

I also added "--no-precheck" to public-inbox-mda recently which
disables the last of the mda-specific checks:
https://public-inbox.org/meta/20191016003956.13269-1-e@80x24.org/


text/html is currently shown inline as raw HTML since
https://public-inbox.org/meta/20191031031220.21048-2-e@80x24.org/
But maybe the HTML part shouldn't be shown inline at all in
multiparts parents.

Optionally piping HTML to lynx(1) or similar could be considered,
too (but definitely an option which is off by default)

FWIW, I suggest keeping your lists text-only so contributors can
flow between different projects more easily and not get blocked
by spam filters.  It's significantly more expensive to do spam
processing on HTML mail and less accurate IME.  Better to teach
contributors to optimize for low-end computers and limited
bandwidth situations :)

Also, public-inbox-watch is designed to work in parallel with
existing mailing lists.  I archive several lists (including
libc-alpha@sourceware and git@vger) this way with no special
permissions or access aside from being a regular subscriber.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Archiving HTML mail
  2019-11-12 21:09 ` Eric Wong
@ 2019-11-12 21:17   ` Florian Weimer
  2019-11-12 21:53     ` Eric Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Florian Weimer @ 2019-11-12 21:17 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

* Eric Wong:

> You can use "publicinbox.$NAME.filter = PublicInbox::Filter::Mirror"
> in the config to blindly mirror everything, which I use for
> public-inbox-watch.
> 
> I also added "--no-precheck" to public-inbox-mda recently which
> disables the last of the mda-specific checks:
> https://public-inbox.org/meta/20191016003956.13269-1-e@80x24.org/

Thanks for the pointers.

> text/html is currently shown inline as raw HTML since
> https://public-inbox.org/meta/20191031031220.21048-2-e@80x24.org/
> But maybe the HTML part shouldn't be shown inline at all in
> multiparts parents.

Yeah, there's some concern that this could be used to host phishing
forms.  I've seen this occasionally in the Debian mailing list archive
(where anti-phishing companies tend to report them several years later
to security@).

My feeling is that it would need some post-processing, maybe stripping
image links and forms (and Javascript of course).  Plus the separate
domain thing for additional XSS protection (like bugzilla.mozilla.org
does, IIRC).  But presumably you could put the entire list archive
under its own domain to avoid having to write code for that.

> FWIW, I suggest keeping your lists text-only so contributors can
> flow between different projects more easily and not get blocked
> by spam filters.  It's significantly more expensive to do spam
> processing on HTML mail and less accurate IME.  Better to teach
> contributors to optimize for low-end computers and limited
> bandwidth situations :)

While this is true, it is also a bad experience for those who send
their first email (which may be a huge step for some, I completely
lack perspective there), and then it gets rejected with an obscure
message.  It's also very confusing if Cc:s are involved and everyone
but the mailing list gets the message.

In some clients, it's now impossible to switch of HTML mail (but I
don't know which variant, whether that's HTML-only, or whether there's
still a client-generated text/plain alternative).

> Also, public-inbox-watch is designed to work in parallel with
> existing mailing lists.  I archive several lists (including
> libc-alpha@sourceware and git@vger) this way with no special
> permissions or access aside from being a regular subscriber.

I feel we need to change libc-alpha to accept text/html email.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Archiving HTML mail
  2019-11-12 21:17   ` Florian Weimer
@ 2019-11-12 21:53     ` Eric Wong
  2019-11-12 22:07       ` Florian Weimer
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2019-11-12 21:53 UTC (permalink / raw)
  To: Florian Weimer; +Cc: meta

Florian Weimer <fw@deneb.enyo.de> wrote:
> * Eric Wong:
> > text/html is currently shown inline as raw HTML since
> > https://public-inbox.org/meta/20191031031220.21048-2-e@80x24.org/
> > But maybe the HTML part shouldn't be shown inline at all in
> > multiparts parents.
> 
> Yeah, there's some concern that this could be used to host phishing
> forms.  I've seen this occasionally in the Debian mailing list archive
> (where anti-phishing companies tend to report them several years later
> to security@).

Those should've been caught by spam filters, first; but if they
weren't, public-inbox-learn can be used to remove them from the
WWW/NNTP viewers (w/o breaking git history) and train SpamAssassin.

> My feeling is that it would need some post-processing, maybe stripping
> image links and forms (and Javascript of course).  Plus the separate
> domain thing for additional XSS protection (like bugzilla.mozilla.org
> does, IIRC).  But presumably you could put the entire list archive
> under its own domain to avoid having to write code for that.

That would mess up DKIM verifications if somebody is trying to
verify archives.

Having separate domains seem to work alright depending on how
nginx/varnish (or similar) is setup, and I host
http://ou63pmih66umazou.onion/ and several other non-onion
domains on the same -httpd process as https://public-inbox.org/
(and I have some plans for better multi-domain support).

> > FWIW, I suggest keeping your lists text-only so contributors can
> > flow between different projects more easily and not get blocked
> > by spam filters.  It's significantly more expensive to do spam
> > processing on HTML mail and less accurate IME.  Better to teach
> > contributors to optimize for low-end computers and limited
> > bandwidth situations :)
> 
> While this is true, it is also a bad experience for those who send
> their first email (which may be a huge step for some, I completely
> lack perspective there), and then it gets rejected with an obscure
> message.  It's also very confusing if Cc:s are involved and everyone
> but the mailing list gets the message.

The MTA could be made to show a better message.  At least
PublicInbox::Filter::Base tries to with:

	*** We only accept plain-text mail, No HTML ***

At least postfix shows puts the above in the rejection message.

> In some clients, it's now impossible to switch of HTML mail (but I
> don't know which variant, whether that's HTML-only, or whether there's
> still a client-generated text/plain alternative).

AFAIK, the Android Gmail client was one.  But I'm really against
corporations dictating formats and complexity like that.  Free
software hackers should keep fighting for simple, inexpensive
formats and pressure Google et. al into supporting them.

> > Also, public-inbox-watch is designed to work in parallel with
> > existing mailing lists.  I archive several lists (including
> > libc-alpha@sourceware and git@vger) this way with no special
> > permissions or access aside from being a regular subscriber.
> 
> I feel we need to change libc-alpha to accept text/html email.

Given there's some cross-posting to vger lists which reject HTML,
that could do more harm than good.

My goal is not just to get hackers into using plain-text mail,
but having them influence non-hackers into using plain-text
mail, too.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Archiving HTML mail
  2019-11-12 21:53     ` Eric Wong
@ 2019-11-12 22:07       ` Florian Weimer
  2019-11-12 22:29         ` Eric Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Florian Weimer @ 2019-11-12 22:07 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

* Eric Wong:

>> My feeling is that it would need some post-processing, maybe stripping
>> image links and forms (and Javascript of course).  Plus the separate
>> domain thing for additional XSS protection (like bugzilla.mozilla.org
>> does, IIRC).  But presumably you could put the entire list archive
>> under its own domain to avoid having to write code for that.
>
> That would mess up DKIM verifications if somebody is trying to
> verify archives.

You have to rewrite the HTML parts anyway, to resolve RFC 2392 cid:
links, prior to handing them to web browsers.  I don't think web
browsers support them.  Neither over HTTP, nor browsing locally.

>> > Also, public-inbox-watch is designed to work in parallel with
>> > existing mailing lists.  I archive several lists (including
>> > libc-alpha@sourceware and git@vger) this way with no special
>> > permissions or access aside from being a regular subscriber.
>> 
>> I feel we need to change libc-alpha to accept text/html email.
>
> Given there's some cross-posting to vger lists which reject HTML,
> that could do more harm than good.

Maybe.  But do newcomers tend to cross-post that heavily?  If they do,
that's probably another problem.

> My goal is not just to get hackers into using plain-text mail,
> but having them influence non-hackers into using plain-text
> mail, too.

On the other hand, if we reject their email, we lose a chance to
interact with them directly and influence them.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Archiving HTML mail
  2019-11-12 22:07       ` Florian Weimer
@ 2019-11-12 22:29         ` Eric Wong
  2019-11-12 22:44           ` Konstantin Ryabitsev
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2019-11-12 22:29 UTC (permalink / raw)
  To: Florian Weimer; +Cc: meta

Florian Weimer <fw@deneb.enyo.de> wrote:
> * Eric Wong:
> 
> >> My feeling is that it would need some post-processing, maybe stripping
> >> image links and forms (and Javascript of course).  Plus the separate
> >> domain thing for additional XSS protection (like bugzilla.mozilla.org
> >> does, IIRC).  But presumably you could put the entire list archive
> >> under its own domain to avoid having to write code for that.
> >
> > That would mess up DKIM verifications if somebody is trying to
> > verify archives.
> 
> You have to rewrite the HTML parts anyway, to resolve RFC 2392 cid:
> links, prior to handing them to web browsers.  I don't think web
> browsers support them.  Neither over HTTP, nor browsing locally.

Yeah.  I guess it could be done on-the-fly at the WWW layer.
Parsing HTML is crazy expensive, though :<

> >> > Also, public-inbox-watch is designed to work in parallel with
> >> > existing mailing lists.  I archive several lists (including
> >> > libc-alpha@sourceware and git@vger) this way with no special
> >> > permissions or access aside from being a regular subscriber.
> >> 
> >> I feel we need to change libc-alpha to accept text/html email.
> >
> > Given there's some cross-posting to vger lists which reject HTML,
> > that could do more harm than good.
> 
> Maybe.  But do newcomers tend to cross-post that heavily?  If they do,
> that's probably another problem.

*shrug*  But I do wish it's easier to work and share ideas
across different projects and loop in folks as needed.

> > My goal is not just to get hackers into using plain-text mail,
> > but having them influence non-hackers into using plain-text
> > mail, too.
> 
> On the other hand, if we reject their email, we lose a chance to
> interact with them directly and influence them.

Fwiw, the admins of that server do get the original HTML messages
in ~/.public-inbox/emergency/ (or whatever PI_EMERGENCY is).

emergency/ could be considered a "moderation queue" so the
admins could send personalized replies to legitimate senders who
got rejected.  Such a message could be easier-to-digest than
whatever postfix sends, even with the PublicInbox::Filter::Base
rejection message.

The emergency/ for public-inbox.org is 99.9% spam and I have a
cronjob that removes messages after a few days.

When somebody does send an HTML message to meta or
test@public-inbox.org or another one of the lists I run, they
usually figure out HTML is rejected and followup with a text
message after a few minutes.

That said, I don't attract a lot of users to any of my projects
(I hate marketing and evangelism), so the folks that show up
tend to be like-minded and willing to look past things like the
"homepage" :>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Archiving HTML mail
  2019-11-12 22:29         ` Eric Wong
@ 2019-11-12 22:44           ` Konstantin Ryabitsev
  2019-11-12 23:10             ` Eric Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Konstantin Ryabitsev @ 2019-11-12 22:44 UTC (permalink / raw)
  To: Eric Wong; +Cc: Florian Weimer, meta

On Tue, Nov 12, 2019 at 10:29:32PM +0000, Eric Wong wrote:
> > You have to rewrite the HTML parts anyway, to resolve RFC 2392 cid:
> > links, prior to handing them to web browsers.  I don't think web
> > browsers support them.  Neither over HTTP, nor browsing locally.
> 
> Yeah.  I guess it could be done on-the-fly at the WWW layer.
> Parsing HTML is crazy expensive, though :<

Someone I spoke with in recent past lamented that there is no mechanism 
to properly render markdown-formatted emails. I wonder if that's 
something that can be snuck in on the public-inbox level. :) Most email 
is already properly formatted markdown (paragraphs and blockquotes), so 
it's not *that* crazy of an idea.

Just an off-the-cuff remark.

> Fwiw, the admins of that server do get the original HTML messages
> in ~/.public-inbox/emergency/ (or whatever PI_EMERGENCY is).
> 
> emergency/ could be considered a "moderation queue" so the
> admins could send personalized replies to legitimate senders who
> got rejected.  Such a message could be easier-to-digest than
> whatever postfix sends, even with the PublicInbox::Filter::Base
> rejection message.

Now that public-inbox-mda supports list-id (THANK YOU!), my life 
moderating PI_EMERGENCY is much easier. For lore.kernel.org, emergency 
collects about a thousand messages a week. My Friday afternoon routine 
is usually to fire mutt, delete spam, and re-feed the remainder to 
public-inbox-mda with --no-precheck.

-K

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Archiving HTML mail
  2019-11-12 22:44           ` Konstantin Ryabitsev
@ 2019-11-12 23:10             ` Eric Wong
  2019-11-13 21:38               ` Konstantin Ryabitsev
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Wong @ 2019-11-12 23:10 UTC (permalink / raw)
  To: Konstantin Ryabitsev; +Cc: meta, Florian Weimer

Konstantin Ryabitsev <konstantin@linuxfoundation.org> wrote:
> On Tue, Nov 12, 2019 at 10:29:32PM +0000, Eric Wong wrote:
> > > You have to rewrite the HTML parts anyway, to resolve RFC 2392 cid:
> > > links, prior to handing them to web browsers.  I don't think web
> > > browsers support them.  Neither over HTTP, nor browsing locally.
> > 
> > Yeah.  I guess it could be done on-the-fly at the WWW layer.
> > Parsing HTML is crazy expensive, though :<
> 
> Someone I spoke with in recent past lamented that there is no mechanism 
> to properly render markdown-formatted emails. I wonder if that's 
> something that can be snuck in on the public-inbox level. :) Most email 
> is already properly formatted markdown (paragraphs and blockquotes), so 
> it's not *that* crazy of an idea.
> 
> Just an off-the-cuff remark.

I don't want public-inbox to be leading the charge on that,
(especially given all the flavors of Markdown to choose from).
More MUAs (and "git <log|show>" would have to start supporting
it, first).

And I do value syntax highlighting, so I have nothing against
adding syntax highlighting support for Markdown, HTML, Perl,
Make or any attached source files the same way(*) it's currently
done for git blobs.

Perhaps the biggest problem with phishing in HTML (and AFAIK
Markdown) is being able to obscure the URL from users who don't
check URLs before following them.  e.g.:

  href="https://scam.example.com/">https://legit.example.com/</a>

Not being able to obscure URLs is big reason I favor plain-text
and MUA-level linkification.

> > Fwiw, the admins of that server do get the original HTML messages
> > in ~/.public-inbox/emergency/ (or whatever PI_EMERGENCY is).
> > 
> > emergency/ could be considered a "moderation queue" so the
> > admins could send personalized replies to legitimate senders who
> > got rejected.  Such a message could be easier-to-digest than
> > whatever postfix sends, even with the PublicInbox::Filter::Base
> > rejection message.
> 
> Now that public-inbox-mda supports list-id (THANK YOU!), my life 
> moderating PI_EMERGENCY is much easier. For lore.kernel.org, emergency 
> collects about a thousand messages a week. My Friday afternoon routine 
> is usually to fire mutt, delete spam, and re-feed the remainder to 
> public-inbox-mda with --no-precheck.

Good to know :>

Btw, "public-inbox-learn ham" could be better for your case than
"public-inbox-mda --no-precheck" in that it also trains
SpamAssassin so future messages are less likely to end up in
emergency.

(*) and supporting pygments via subprocess and/or GNU
    source-highlight in addition to the not-in-CentOS
    highlight.pm

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Archiving HTML mail
  2019-11-12 23:10             ` Eric Wong
@ 2019-11-13 21:38               ` Konstantin Ryabitsev
  0 siblings, 0 replies; 9+ messages in thread
From: Konstantin Ryabitsev @ 2019-11-13 21:38 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta, Florian Weimer

On Tue, Nov 12, 2019 at 11:10:36PM +0000, Eric Wong wrote:
> > Now that public-inbox-mda supports list-id (THANK YOU!), my life 
> > moderating PI_EMERGENCY is much easier. For lore.kernel.org, 
> > emergency collects about a thousand messages a week. My Friday 
> > afternoon routine is usually to fire mutt, delete spam, and re-feed 
> > the remainder to public-inbox-mda with --no-precheck.
> 
> Good to know :>
> 
> Btw, "public-inbox-learn ham" could be better for your case than
> "public-inbox-mda --no-precheck" in that it also trains
> SpamAssassin so future messages are less likely to end up in
> emergency.

Good point, I'll switch to that -- I see that it handles figuring out 
where to put things by list-id as well, nice.

-K

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-11-13 21:38 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-12 13:37 Archiving HTML mail Florian Weimer
2019-11-12 21:09 ` Eric Wong
2019-11-12 21:17   ` Florian Weimer
2019-11-12 21:53     ` Eric Wong
2019-11-12 22:07       ` Florian Weimer
2019-11-12 22:29         ` Eric Wong
2019-11-12 22:44           ` Konstantin Ryabitsev
2019-11-12 23:10             ` Eric Wong
2019-11-13 21:38               ` Konstantin Ryabitsev

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).