From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 96C441F404; Mon, 5 Mar 2018 17:50:07 +0000 (UTC) Date: Mon, 5 Mar 2018 17:50:07 +0000 From: Eric Wong To: =?utf-8?Q?Nicol=C3=A1s_Ojeda_B=C3=A4r?= Cc: meta@public-inbox.org Subject: Re: Relationship between public-inbox and ssoma? Message-ID: <20180305175007.GA19007@whir> References: <20180305020754.GA11496@dcvr> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: List-Id: Nicolás Ojeda Bär wrote: > Hello Eric, > > Thanks for the prompt reply. I am trying to migrate a long-lived > mailing list (65k messages over 26 years), below are some > troubles/questions I am having; > any suggestions would be greatly appreciated. > > - public-inbox-watch seems to struggle with very big maildirs; for now > I am moving the data into the maildir a little at a time and that > seems to work. Is there a particular obstacle > to making the importing process more incremental? Do you know if it's SpamAssassin being slow? I disable network checks for large imports in ~/.spamassassin/user_prefs (if I'm using SA at all during the imports): # uncomment the following for importing archives: # dns_available no # skip_rbl_checks 1 # skip_uribl_checks 1 Fwiw, large directories are a performance killer in any application. Seek times and cache overheads are two problems, at least, so an SSD will definitely help; and maybe even shorter filenames. I usually prefer one-off scripts like scripts/import_vger_from_mbox for initial imports and store large archives in compressed mboxes instead of Maildir. Lack of mbox support is one reason I never used notmuch despite studying it. > - Trouble due to missing/malformed headers (mostly on very old > messages). For example, here is the header of a message that trips > public-inbox-watch: > > From weis@margaux Fri Nov 27 16:24:50 1992 > Received: by margaux.inria.fr, Fri, 27 Nov 92 16:24:50 +0100 > Message-ID: <9211271524.AA29971@margaux.inria.fr> > To: caml-list@margaux > Sender: weis@margaux > Status: O > > The error is: fatal: Invalid rfc2822 date "" in ident: <> (I guess > due to the lack of a Date: field). I added a Date: field just to test > and > noticed that Author: in the git commit was empty, I guess due to the > use of Sender: rather than From: header. I have a patch in the wings to use the Received: date: https://public-inbox.org/meta/20180215110840.30413-16-e@80x24.org/raw And I'm thinking about favoring Received: over Date: if both exist, since Date: headers are more often wrong... > Do you think it is feasible to improve public-inbox-watch to try to > extract the date from some other header like above? > and to use Sender: when From: is not found? Sure, I suppose falling back to Sender is correct if From is missing. > - There are some messages that do not have Message-Id, but > public-inbox-watch seems to be able to handle them. Yes, we generate a Message-Id if one is missing > Is it the case that Date: is the only header that is absolutely > necessary for public-inbox-watch to process the message? Probably none of them are, actually. > - Does public-inbox-watch ever modify the message data? Message-ID generation is one that's generated. Status, Lines, Bytes, Content-Length, and @BAD_HEADERS in lib/PublicInbox/MDA.pm are all dropped: our @BAD_HEADERS = ( # postfix qw(delivered-to x-original-to), # prevent training loops # The rest are taken from Mailman 2.1.15: # could contain passwords: qw(approved approve x-approved x-approve urgent), # could be used phishing: qw(return-receipt-to disposition-notification-to x-confirm-reading-to), # Pegasus mail: qw(x-pmrqc) ); Email::MIME might modify invalid characters in the headers (or if there's bugs in Email::MIME). I don't think bodies are modified outside of the not-really-documented PublicInbox::Filter API. You can check out some filters at lib/PublicInbox/Filter/*.pm (some commit messages document them, but I don't think there's manpages, yet) > - In general public-inbox-watch prints very little about what it is > doing, which makes it hard(er) to trace problems; a verbose flag would > be a nice > addition, I think. I usually use strace on Linux to track down problems. I'm not sure it's worth the effort to introduce new options/features if generic tracing utilities are more detailed and accurate. Also, I'm going to be mostly offline for about a week starting tomorrow; so don't expect prompt replies for a bit.