user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: "Nicolás Ojeda Bär" <n.oje.bar@gmail.com>
To: Eric Wong <e@80x24.org>
Cc: meta@public-inbox.org
Subject: Re: Relationship between public-inbox and ssoma?
Date: Mon, 5 Mar 2018 19:06:06 +0100	[thread overview]
Message-ID: <CAPunWhCKu_8xWYP_PB6Z2A5mxL10o-e1vFBYhiVKXSHBXYJMug@mail.gmail.com> (raw)
In-Reply-To: <20180305175007.GA19007@whir>

Hi Eric,

Thanks for the quick reply.

On Mon, Mar 5, 2018 at 6:50 PM, Eric Wong <e@80x24.org> wrote:
> Nicolás Ojeda Bär <n.oje.bar@gmail.com> wrote:
>> Hello Eric,
>>
>> Thanks for the prompt reply.  I am trying to migrate a long-lived
>> mailing list (65k messages over 26 years), below are some
>> troubles/questions I am having;
>> any suggestions would be greatly appreciated.
>>
>> - public-inbox-watch seems to struggle with very big maildirs; for now
>> I am moving the data into the maildir a little at a time and that
>> seems to work. Is there a particular obstacle
>>   to making the importing process more incremental?
>
> Do you know if it's SpamAssassin being slow?
>
> I disable network checks for large imports in ~/.spamassassin/user_prefs
> (if I'm using SA at all during the imports):
> # uncomment the following for importing archives:
> # dns_available no
> # skip_rbl_checks 1
> # skip_uribl_checks 1

I don't think it is even installed and I have not set it up at all, so
probably not.

> Fwiw, large directories are a performance killer in any
> application.  Seek times and cache overheads are two problems,
> at least, so an SSD will definitely help; and maybe even shorter
> filenames.

OK.

> I usually prefer one-off scripts like
> scripts/import_vger_from_mbox for initial imports and store
> large archives in compressed mboxes instead of Maildir.  Lack of
> mbox support is one reason I never used notmuch despite studying
> it.

Thanks for the pointer, I will take a look, hopefully it will nudge me
in the right direction.

>> - Trouble due to missing/malformed headers (mostly on very old
>> messages). For example, here is the header of a message that trips
>> public-inbox-watch:
>>
>> From weis@margaux  Fri Nov 27 16:24:50 1992
>> Received: by margaux.inria.fr, Fri, 27 Nov 92 16:24:50 +0100
>> Message-ID: <9211271524.AA29971@margaux.inria.fr>
>> To: caml-list@margaux
>> Sender: weis@margaux
>> Status: O
>>
>> The error is: fatal: Invalid rfc2822 date "" in ident:  <> (I guess
>> due to the lack of a Date: field). I added a Date: field just to test
>> and
>> noticed that Author: in the git commit was empty, I guess due to the
>> use of Sender: rather than From: header.
>
> I have a patch in the wings to use the Received: date:
>
>  https://public-inbox.org/meta/20180215110840.30413-16-e@80x24.org/raw
>
> And I'm thinking about favoring Received: over Date: if both
> exist, since Date: headers are more often wrong...

Great, I will try your patch to see if I can get my messages past
public-inbox-watch.

>> Do you think it is feasible to improve public-inbox-watch to try to
>> extract the date from some other header like above?
>> and to use Sender: when From: is not found?
>
> Sure, I suppose falling back to Sender is correct if From is
> missing.

OK, I will see if I can patch this on my own this since I am keen on
getting this mailing list imported.

>> - There are some messages that do not have Message-Id, but
>> public-inbox-watch seems to be able to handle them.
>
> Yes, we generate a Message-Id if one is missing
>
>>   Is it the case that Date: is the only header that is absolutely
>> necessary for public-inbox-watch to process the message?
>
> Probably none of them are, actually.

Currently, public-inbox-watch refuses to process the message with the
header quoted above due to a missing Date: header.

>> - Does public-inbox-watch ever modify the message data?
>
> Message-ID generation is one that's generated.
> Status, Lines, Bytes, Content-Length, and @BAD_HEADERS in
> lib/PublicInbox/MDA.pm are all dropped:
>
> our @BAD_HEADERS = (
>         # postfix
>         qw(delivered-to x-original-to), # prevent training loops
>
>         # The rest are taken from Mailman 2.1.15:
>         # could contain passwords:
>         qw(approved approve x-approved x-approve urgent),
>         # could be used phishing:
>         qw(return-receipt-to disposition-notification-to x-confirm-reading-to),
>         # Pegasus mail:
>         qw(x-pmrqc)
> );
>
> Email::MIME might modify invalid characters in the headers (or
> if there's bugs in Email::MIME).  I don't think bodies are
> modified outside of the not-really-documented
> PublicInbox::Filter API.  You can check out some filters at
> lib/PublicInbox/Filter/*.pm (some commit messages document them,
> but I don't think there's manpages, yet)

OK, will take a look.

>> - In general public-inbox-watch prints very little about what it is
>> doing, which makes it hard(er) to trace problems; a verbose flag would
>> be a nice
>>   addition, I think.
>
> I usually use strace on Linux to track down problems.  I'm not
> sure it's worth the effort to introduce new options/features
> if generic tracing utilities are more detailed and accurate.
>

Makes sense. Thanks for the suggestion.

> Also, I'm going to be mostly offline for about a week starting
> tomorrow; so don't expect prompt replies for a bit.

Sure, thanks for the heads-up.

Best wishes,
Nicolás

  reply	other threads:[~2018-03-05 18:06 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-05  0:54 Relationship between public-inbox and ssoma? Nicolás Ojeda Bär
2018-03-05  2:07 ` Eric Wong
2018-03-05 11:45   ` Nicolás Ojeda Bär
2018-03-05 17:50     ` Eric Wong
2018-03-05 18:06       ` Nicolás Ojeda Bär [this message]
2018-03-19  7:43         ` watch performance [was: Relationship between public-inbox and ssoma?] Eric Wong
2018-03-15 15:30   ` internal format (was: Relationship between public-inbox and ssoma?) Stefan Monnier
2018-03-15 16:40     ` Eric Wong
2018-03-15 18:49       ` internal format Stefan Monnier
2018-03-15 20:14         ` Eric Wong
2018-03-15 21:05           ` Stefan Monnier
2018-03-15 21:21             ` Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPunWhCKu_8xWYP_PB6Z2A5mxL10o-e1vFBYhiVKXSHBXYJMug@mail.gmail.com \
    --to=n.oje.bar@gmail.com \
    --cc=e@80x24.org \
    --cc=meta@public-inbox.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).