From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=ALL_TRUSTED,AWL,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id CE0211F4B4; Tue, 22 Dec 2020 23:11:26 +0000 (UTC) Date: Tue, 22 Dec 2020 23:11:26 +0000 From: Eric Wong To: Uwe =?utf-8?Q?Kleine-K=C3=B6nig?= Cc: meta@public-inbox.org Subject: Re: About header filtering Message-ID: <20201222231126.GA14850@dcvr> References: <20201222073704.u7hacjk5m7mpuc52@pengutronix.de> <20201222162828.wir7sfelqmy2mzrr@chatter.i7.local> <20201222222118.i4bioeo7l6iuf3pk@pengutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20201222222118.i4bioeo7l6iuf3pk@pengutronix.de> List-Id: Uwe Kleine-König wrote: > Hello Konstantin, > > On Tue, Dec 22, 2020 at 11:28:28AM -0500, Konstantin Ryabitsev wrote: > > On Tue, Dec 22, 2020 at 08:37:04AM +0100, Uwe Kleine-König wrote: > > > I found that Konstantin Ryabitsev's tool to prepare an initial archive > > > from an already existing mailing list[1] filters some of these out, but > > > the instance on kernel.org has some of these details, too. (See for > > > example > > > https://lore.kernel.org/lkml/20201013082132.661993-1-u.kleine-koenig@pengutronix.de/raw; > > > there are Return-Path: and also some Received: headers that I consider > > > not-so-nice as they were added after the mail was processed by the > > > mailing list tool on vger.kernel.org.) > > > > > > Is it considerd bad to filter these out? Or is it just that nobody > > > wanted this kind of cleanliness before in such a setup? > > > > The reason we don't do any filtering after receiving the mail on the archiver > > system is two-fold: > > > > 1. we don't know if any of the Received: lines are part of any DKIM/ARC > > signatures (they shouldn't be -- it's wrong to include them, but I've seen > > this happen). > > Note I don't intend to throw away all Received lines, only the ones > concerning the hops after the mailing list server. These cannot be > signed using DKIM unless the mailing list subscription goes to an > address that is forwarded and the forwarding server signs the Received > lines. Fwiw, you should be able to use either Email::MIME or PublicInbox::Eml to shift off the latest (topmost) Received header: ----8<---- #!/usr/bin/perl -w use strict; use PublicInbox::Eml; my $eml = PublicInbox::Eml->new(do { local $/; }); my @rcvd = $eml->header_raw('Received'); # array context for all instances shift @rcvd; # remove topmost $eml->header_set('Received', @rcvd); # set to keep remaining print $eml->as_string; ----8<---- s/PublicInbox::Eml/Email::MIME/ works, too, but PublicInbox::Eml won't endlessly recurse multipart mails like Email::MIME does. Otherwise the header_raw, header_set, as_string APIs should behave the same.