user/dev discussion of public-inbox itself
 help / color / mirror / Atom feed
* About header filtering
@ 2020-12-22  7:37 Uwe Kleine-König
  2020-12-22 16:28 ` Konstantin Ryabitsev
  0 siblings, 1 reply; 5+ messages in thread
From: Uwe Kleine-König @ 2020-12-22  7:37 UTC (permalink / raw)
  To: meta; +Cc: Konstantin Ryabitsev

[-- Attachment #1: Type: text/plain, Size: 2088 bytes --]

Hello,

I'm trying to setup a public-inbox instance to archive some mailing
lists using a regular subscription (so I'm not collecting the mails
directly at the mailing list address, but rely on the mailing list
software (here: mailman) to forward to the archiver).

One thing I want to have is that some headers that are relevant for the
path between the mailing list host and the subscribed mail account only
are filtered out. That's things like:

	Received: from $mailinglistserver ([2001:....]) by
		$publicinboxmachine with esmtp (Exim 4.92)
		(envelope-from <listname-bounces+something@mailinglistdomain>) id
		23487275432 for $publicinboxaccount; Fri, 18 Dec 2020 15:48:54 +0100
	Envelope-to: $publicinboxaccount
	Return-path: <listname-bounces+something@mailinglistdomain>
	Errors-To: listname-bounces+something@mailinglistdomain

I found that Konstantin Ryabitsev's tool to prepare an initial archive
from an already existing mailing list[1] filters some of these out, but
the instance on kernel.org has some of these details, too. (See for
example
https://lore.kernel.org/lkml/20201013082132.661993-1-u.kleine-koenig@pengutronix.de/raw;
there are Return-Path: and also some Received: headers that I consider
not-so-nice as they were added after the mail was processed by the
mailing list tool on vger.kernel.org.)

Is it considerd bad to filter these out? Or is it just that nobody
wanted this kind of cleanliness before in such a setup?

I could handcraft a preprocessor[2] but I assume that a solution in
public-inbox itself would find some users?!

Best regards
Uwe

[1] https://git.kernel.org/pub/scm/linux/kernel/git/mricon/korg-helpers.git/plain/list-archive-maker.py
[2] something like
	formail -I Envelope-to -I Return-path -I Errors-To
    but filtering Received: is a bit harder if you want to keep the lines
    describing the path from the sender to the mailing list.
-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | https://www.pengutronix.de/ |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: About header filtering
  2020-12-22  7:37 About header filtering Uwe Kleine-König
@ 2020-12-22 16:28 ` Konstantin Ryabitsev
  2020-12-22 22:21   ` Uwe Kleine-König
  0 siblings, 1 reply; 5+ messages in thread
From: Konstantin Ryabitsev @ 2020-12-22 16:28 UTC (permalink / raw)
  To: Uwe Kleine-König; +Cc: meta

[-- Attachment #1: Type: text/plain, Size: 1606 bytes --]

On Tue, Dec 22, 2020 at 08:37:04AM +0100, Uwe Kleine-König wrote:
> I found that Konstantin Ryabitsev's tool to prepare an initial archive
> from an already existing mailing list[1] filters some of these out, but
> the instance on kernel.org has some of these details, too. (See for
> example
> https://lore.kernel.org/lkml/20201013082132.661993-1-u.kleine-koenig@pengutronix.de/raw;
> there are Return-Path: and also some Received: headers that I consider
> not-so-nice as they were added after the mail was processed by the
> mailing list tool on vger.kernel.org.)
> 
> Is it considerd bad to filter these out? Or is it just that nobody
> wanted this kind of cleanliness before in such a setup?

The reason we don't do any filtering after receiving the mail on the archiver
system is two-fold:

1. we don't know if any of the Received: lines are part of any DKIM/ARC
   signatures (they shouldn't be -- it's wrong to include them, but I've seen
   this happen).
2. the goal of lore.kernel.org is maximum transparency, so we include
   everything that our own systems add to the headers in an attempt to show
   that "there's nothing up our sleeves"

> I could handcraft a preprocessor[2] but I assume that a solution in
> public-inbox itself would find some users?!

I don't know if this should be part of public-inbox -- a simple procmail
script would work. I know procmail isn't very actively developed these days,
but it's also extremely robust and handles almost anything you can throw at
it, which is an important advantage when it comes to a format like email.

-K

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: About header filtering
  2020-12-22 16:28 ` Konstantin Ryabitsev
@ 2020-12-22 22:21   ` Uwe Kleine-König
  2020-12-22 23:11     ` Eric Wong
  2020-12-23 17:57     ` Konstantin Ryabitsev
  0 siblings, 2 replies; 5+ messages in thread
From: Uwe Kleine-König @ 2020-12-22 22:21 UTC (permalink / raw)
  To: meta

[-- Attachment #1: Type: text/plain, Size: 2493 bytes --]

Hello Konstantin,

On Tue, Dec 22, 2020 at 11:28:28AM -0500, Konstantin Ryabitsev wrote:
> On Tue, Dec 22, 2020 at 08:37:04AM +0100, Uwe Kleine-König wrote:
> > I found that Konstantin Ryabitsev's tool to prepare an initial archive
> > from an already existing mailing list[1] filters some of these out, but
> > the instance on kernel.org has some of these details, too. (See for
> > example
> > https://lore.kernel.org/lkml/20201013082132.661993-1-u.kleine-koenig@pengutronix.de/raw;
> > there are Return-Path: and also some Received: headers that I consider
> > not-so-nice as they were added after the mail was processed by the
> > mailing list tool on vger.kernel.org.)
> > 
> > Is it considerd bad to filter these out? Or is it just that nobody
> > wanted this kind of cleanliness before in such a setup?
> 
> The reason we don't do any filtering after receiving the mail on the archiver
> system is two-fold:
> 
> 1. we don't know if any of the Received: lines are part of any DKIM/ARC
>    signatures (they shouldn't be -- it's wrong to include them, but I've seen
>    this happen).

Note I don't intend to throw away all Received lines, only the ones
concerning the hops after the mailing list server. These cannot be
signed using DKIM unless the mailing list subscription goes to an
address that is forwarded and the forwarding server signs the Received
lines.

> 2. the goal of lore.kernel.org is maximum transparency, so we include
>    everything that our own systems add to the headers in an attempt to show
>    that "there's nothing up our sleeves"
> 
> > I could handcraft a preprocessor[2] but I assume that a solution in
> > public-inbox itself would find some users?!
> 
> I don't know if this should be part of public-inbox -- a simple procmail
> script would work. I know procmail isn't very actively developed these days,
> but it's also extremely robust and handles almost anything you can throw at
> it, which is an important advantage when it comes to a format like email.

Procmail doesn't help here (unless I miss something). Well, it allows to
call a filter, but doesn't filter itself. Currently I experiment with
formail (which is called by procmail) but formail cannot throw away
selected Received lines only.

Best regards and thanks for your input,
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | https://www.pengutronix.de/ |

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: About header filtering
  2020-12-22 22:21   ` Uwe Kleine-König
@ 2020-12-22 23:11     ` Eric Wong
  2020-12-23 17:57     ` Konstantin Ryabitsev
  1 sibling, 0 replies; 5+ messages in thread
From: Eric Wong @ 2020-12-22 23:11 UTC (permalink / raw)
  To: Uwe Kleine-König; +Cc: meta

Uwe Kleine-König <u.kleine-koenig@pengutronix.de> wrote:
> Hello Konstantin,
> 
> On Tue, Dec 22, 2020 at 11:28:28AM -0500, Konstantin Ryabitsev wrote:
> > On Tue, Dec 22, 2020 at 08:37:04AM +0100, Uwe Kleine-König wrote:
> > > I found that Konstantin Ryabitsev's tool to prepare an initial archive
> > > from an already existing mailing list[1] filters some of these out, but
> > > the instance on kernel.org has some of these details, too. (See for
> > > example
> > > https://lore.kernel.org/lkml/20201013082132.661993-1-u.kleine-koenig@pengutronix.de/raw;
> > > there are Return-Path: and also some Received: headers that I consider
> > > not-so-nice as they were added after the mail was processed by the
> > > mailing list tool on vger.kernel.org.)
> > > 
> > > Is it considerd bad to filter these out? Or is it just that nobody
> > > wanted this kind of cleanliness before in such a setup?
> > 
> > The reason we don't do any filtering after receiving the mail on the archiver
> > system is two-fold:
> > 
> > 1. we don't know if any of the Received: lines are part of any DKIM/ARC
> >    signatures (they shouldn't be -- it's wrong to include them, but I've seen
> >    this happen).
> 
> Note I don't intend to throw away all Received lines, only the ones
> concerning the hops after the mailing list server. These cannot be
> signed using DKIM unless the mailing list subscription goes to an
> address that is forwarded and the forwarding server signs the Received
> lines.

Fwiw, you should be able to use either Email::MIME or
PublicInbox::Eml to shift off the latest (topmost) Received
header:

----8<----
#!/usr/bin/perl -w
use strict;
use PublicInbox::Eml;
my $eml = PublicInbox::Eml->new(do { local $/; <STDIN> });
my @rcvd = $eml->header_raw('Received'); # array context for all instances
shift @rcvd; # remove topmost
$eml->header_set('Received', @rcvd); # set to keep remaining
print $eml->as_string;
----8<----

s/PublicInbox::Eml/Email::MIME/ works, too, but PublicInbox::Eml
won't endlessly recurse multipart mails like Email::MIME does.
Otherwise the header_raw, header_set, as_string APIs should
behave the same.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: About header filtering
  2020-12-22 22:21   ` Uwe Kleine-König
  2020-12-22 23:11     ` Eric Wong
@ 2020-12-23 17:57     ` Konstantin Ryabitsev
  1 sibling, 0 replies; 5+ messages in thread
From: Konstantin Ryabitsev @ 2020-12-23 17:57 UTC (permalink / raw)
  To: Uwe Kleine-König; +Cc: meta

[-- Attachment #1: Type: text/plain, Size: 1553 bytes --]

On Tue, Dec 22, 2020 at 11:21:18PM +0100, Uwe Kleine-König wrote:
> > 2. the goal of lore.kernel.org is maximum transparency, so we include
> >    everything that our own systems add to the headers in an attempt to show
> >    that "there's nothing up our sleeves"
> > 
> > > I could handcraft a preprocessor[2] but I assume that a solution in
> > > public-inbox itself would find some users?!
> > 
> > I don't know if this should be part of public-inbox -- a simple procmail
> > script would work. I know procmail isn't very actively developed these days,
> > but it's also extremely robust and handles almost anything you can throw at
> > it, which is an important advantage when it comes to a format like email.
> 
> Procmail doesn't help here (unless I miss something). Well, it allows to
> call a filter, but doesn't filter itself. Currently I experiment with
> formail (which is called by procmail) but formail cannot throw away
> selected Received lines only.

Right, that's what I meant -- you pipe through procmail to define any
additional filtering to be done based on match logic (whether with formail or
with any other command written for this purpose).

BTW, my hope is that, eventually, we'll stop doing archiving on our own and
will merely mirror public-inbox archives made available by actual
mailing list providers. I know vger folks have been looking at this, but I'm
not sure where this fits in their priorities. This will accomplish the same
thing -- SMTP headers will end at the mailing list host.

-K

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-12-23 17:57 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-22  7:37 About header filtering Uwe Kleine-König
2020-12-22 16:28 ` Konstantin Ryabitsev
2020-12-22 22:21   ` Uwe Kleine-König
2020-12-22 23:11     ` Eric Wong
2020-12-23 17:57     ` Konstantin Ryabitsev

user/dev discussion of public-inbox itself

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 meta meta/ https://public-inbox.org/meta \
		meta@public-inbox.org
	public-inbox-index meta

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://ou63pmih66umazou.onion/inbox.comp.mail.public-inbox.meta
	nntp://czquwvybam4bgbro.onion/inbox.comp.mail.public-inbox.meta
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for the project(s) associated with this inbox:

	https://80x24.org/public-inbox.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git