From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-3.5 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from mail-io0-x243.google.com (mail-io0-x243.google.com [IPv6:2607:f8b0:4001:c06::243]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id D25311F404 for ; Mon, 5 Mar 2018 18:06:07 +0000 (UTC) Received: by mail-io0-x243.google.com with SMTP id b34so18951756ioj.6 for ; Mon, 05 Mar 2018 10:06:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=KnFE4/tTLe81h76ip2w2oGKdZZlEjcdIEElbFija1gU=; b=mEzEdk30Dof6fOTZZH3v0SdnjIg0QvMKbh+5OdZ/kyR6DxRn4Ry1YiclnCd8UqjFHp y766g76lkA9v1jb5ajzNMMyDJ5A0BLuc4DG+8g9rCp1W0LCwGhb6Fo9xay+b2b2DvKcz 9NJan8KVj1MAddr1hc2WF2Xi4/JafiXEOD1zDPiF7YxrSpCQ+8HosQYofWUdcyk9/YfZ k7fLXkZ9++z6lacOYqnW9MFr/MHYX+f856DLixrotFYgvR9PVhv0e1MHfoPZxvlSFgEm bm1AoOb3wRVmhrFPqUsT8b1GdewvyJEyZKg4ZUCDOxIj7RlnzX+S/MraGBjvSmOw3Mfo 8Hig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=KnFE4/tTLe81h76ip2w2oGKdZZlEjcdIEElbFija1gU=; b=KV9oOznxStLFwWtR1ijdC2FQUP0mIP2YQGlZ9Z3byOKIBoEKenrIjpqwM4kcWicv7+ cswuhQHRH6F6jm7ZS/EBvTPifWQO6k1M1K+HZJhkmPgT/hOIOb8U2ciGiuAuA14Lbm2i cKB6jpZEC7uvN61ol95CjRiCpnXctWHoa+GacXQz/6xSrEQh0/1SvZLRG1MlNEr/5Lih cUJbg1aYRXGF2TXw8fO6/g0yCj0G7oPF2yxrVoQzgJZl0Keat23ZxfsDxT5Axk2OYhkL Xjm0/KX3PQRKU+ETwLTVpmVPsiQuYkRrbKjto/nn1rReW4ids/Ge4/WEPj9jl2C32txO IZsA== X-Gm-Message-State: AElRT7FVMUlLIfIBVdMUJ0oFbpeNM3+F9v73o66hmQXQRuOh2b1vbtqE 77QRIFej/jb5HfTlfTiWoj5eh8UW1ZII5Ei92A8= X-Google-Smtp-Source: AG47ELtK6eaJUHiXacaKhoeLum+HDU7hz3hqQxPazjAPCJeVYr7hDQpB+y0E8aersN+lFP4LtR8yQpEm1dBFRkT6M/4= X-Received: by 10.107.202.67 with SMTP id a64mr18950733iog.194.1520273166883; Mon, 05 Mar 2018 10:06:06 -0800 (PST) MIME-Version: 1.0 Received: by 10.79.118.213 with HTTP; Mon, 5 Mar 2018 10:06:06 -0800 (PST) In-Reply-To: <20180305175007.GA19007@whir> References: <20180305020754.GA11496@dcvr> <20180305175007.GA19007@whir> From: =?UTF-8?Q?Nicol=C3=A1s_Ojeda_B=C3=A4r?= Date: Mon, 5 Mar 2018 19:06:06 +0100 Message-ID: Subject: Re: Relationship between public-inbox and ssoma? To: Eric Wong Cc: meta@public-inbox.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: Hi Eric, Thanks for the quick reply. On Mon, Mar 5, 2018 at 6:50 PM, Eric Wong wrote: > Nicol=C3=A1s Ojeda B=C3=A4r wrote: >> Hello Eric, >> >> Thanks for the prompt reply. I am trying to migrate a long-lived >> mailing list (65k messages over 26 years), below are some >> troubles/questions I am having; >> any suggestions would be greatly appreciated. >> >> - public-inbox-watch seems to struggle with very big maildirs; for now >> I am moving the data into the maildir a little at a time and that >> seems to work. Is there a particular obstacle >> to making the importing process more incremental? > > Do you know if it's SpamAssassin being slow? > > I disable network checks for large imports in ~/.spamassassin/user_prefs > (if I'm using SA at all during the imports): > # uncomment the following for importing archives: > # dns_available no > # skip_rbl_checks 1 > # skip_uribl_checks 1 I don't think it is even installed and I have not set it up at all, so probably not. > Fwiw, large directories are a performance killer in any > application. Seek times and cache overheads are two problems, > at least, so an SSD will definitely help; and maybe even shorter > filenames. OK. > I usually prefer one-off scripts like > scripts/import_vger_from_mbox for initial imports and store > large archives in compressed mboxes instead of Maildir. Lack of > mbox support is one reason I never used notmuch despite studying > it. Thanks for the pointer, I will take a look, hopefully it will nudge me in the right direction. >> - Trouble due to missing/malformed headers (mostly on very old >> messages). For example, here is the header of a message that trips >> public-inbox-watch: >> >> From weis@margaux Fri Nov 27 16:24:50 1992 >> Received: by margaux.inria.fr, Fri, 27 Nov 92 16:24:50 +0100 >> Message-ID: <9211271524.AA29971@margaux.inria.fr> >> To: caml-list@margaux >> Sender: weis@margaux >> Status: O >> >> The error is: fatal: Invalid rfc2822 date "" in ident: <> (I guess >> due to the lack of a Date: field). I added a Date: field just to test >> and >> noticed that Author: in the git commit was empty, I guess due to the >> use of Sender: rather than From: header. > > I have a patch in the wings to use the Received: date: > > https://public-inbox.org/meta/20180215110840.30413-16-e@80x24.org/raw > > And I'm thinking about favoring Received: over Date: if both > exist, since Date: headers are more often wrong... Great, I will try your patch to see if I can get my messages past public-inbox-watch. >> Do you think it is feasible to improve public-inbox-watch to try to >> extract the date from some other header like above? >> and to use Sender: when From: is not found? > > Sure, I suppose falling back to Sender is correct if From is > missing. OK, I will see if I can patch this on my own this since I am keen on getting this mailing list imported. >> - There are some messages that do not have Message-Id, but >> public-inbox-watch seems to be able to handle them. > > Yes, we generate a Message-Id if one is missing > >> Is it the case that Date: is the only header that is absolutely >> necessary for public-inbox-watch to process the message? > > Probably none of them are, actually. Currently, public-inbox-watch refuses to process the message with the header quoted above due to a missing Date: header. >> - Does public-inbox-watch ever modify the message data? > > Message-ID generation is one that's generated. > Status, Lines, Bytes, Content-Length, and @BAD_HEADERS in > lib/PublicInbox/MDA.pm are all dropped: > > our @BAD_HEADERS =3D ( > # postfix > qw(delivered-to x-original-to), # prevent training loops > > # The rest are taken from Mailman 2.1.15: > # could contain passwords: > qw(approved approve x-approved x-approve urgent), > # could be used phishing: > qw(return-receipt-to disposition-notification-to x-confirm-readin= g-to), > # Pegasus mail: > qw(x-pmrqc) > ); > > Email::MIME might modify invalid characters in the headers (or > if there's bugs in Email::MIME). I don't think bodies are > modified outside of the not-really-documented > PublicInbox::Filter API. You can check out some filters at > lib/PublicInbox/Filter/*.pm (some commit messages document them, > but I don't think there's manpages, yet) OK, will take a look. >> - In general public-inbox-watch prints very little about what it is >> doing, which makes it hard(er) to trace problems; a verbose flag would >> be a nice >> addition, I think. > > I usually use strace on Linux to track down problems. I'm not > sure it's worth the effort to introduce new options/features > if generic tracing utilities are more detailed and accurate. > Makes sense. Thanks for the suggestion. > Also, I'm going to be mostly offline for about a week starting > tomorrow; so don't expect prompt replies for a bit. Sure, thanks for the heads-up. Best wishes, Nicol=C3=A1s