user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@80x24.org>
To: Ralf Ramsauer <ralf.ramsauer@oth-regensburg.de>
Cc: meta@public-inbox.org, Lukas Bulwahn <lukas.bulwahn@gmail.com>
Subject: Re: Usage of public-inbox with maildirs
Date: Thu, 21 Mar 2019 03:35:03 +0000	[thread overview]
Message-ID: <20190321033503.uukzpw7o7viobfgo@dcvr> (raw)
In-Reply-To: <745d6a8e-7e7c-8c61-336b-105cf9570ab7@oth-regensburg.de>

Ralf Ramsauer <ralf.ramsauer@oth-regensburg.de> wrote:
> Hi,
> 
> we want to archive a fair amount of mailing lists (~160 lists) with
> public-inbox.
> 
> Therefore, we subscribed to all of those lists with a single email
> address. Mails are periodically fetched and stored in a local maildir
> via IMAP. Mails are currently not pre-filtered or sorted, all of them
> are bunched in a single maildir.
> 
> So every [publicinbox] config entry has the same 'watch' entry for the
> maildir, but all have their own watchheader to be sensitive on different
> lists.
> 
> Is this the intended way to use public-inbox, or should we rather place
> mails from different lists in different maildirs before processing them
> with public-inbox?

Yes, it's supported since this year:

	commit ed3b90b7a203fe5513894d01d478f6104cdff897
	Date:   Sat Jan 5 00:35:42 2019 +0000

	("watchmaildir: support multiple inboxes in the same Maildir")

Sorry, haven't hit a good point to make a release, and I'm not too
good at release management :<

> Secondly, I wrote a script that automatically that creates the
> public-inbox config together with empty, bare git repositories for every
> list.
> 
> A config entry looks like:
> 
>     [publicinbox "listid"]
>         address = post@listid.org
>         mainrepo = /path/to/repo
>         watch = maildir:/path/to/maildir
>         watchheader = List-Id:<listid>

All looks fine to me.

> Our maildir currently contains ~120k mails for the initial import, and
> this raised some new questions:
> 
> 1. It appears that the initial import with public-inbox-watch is very
>    slow. After stracing the perl script, it looks like
>    public-inbox-watch lstats every single mail. After an hour of not
>    inserting any mail into a repo, I canceled the process and restarted
>    it on a smaller initial subset. This works better, but is still slow.
>    (~4k mails in 10 minutes, feels like constantly getting slower)

v1 gets slower as repositories get bigger.  v2 is barely
affected by that.  Are you sure it wasn't importing?  The
fast-import processes may not be writing out frequently enough.

>    If public-inbox-watch is restarted for some reason (e.g., system
>    reboot), will it stat every single mail again on startup?

Yes.  However, the scan is at a low priority compared to
freshly-arrived mail if you have Linux::Inotify2 module
installed for Filesys::Notify::Simple to use.

>    IOW, should old mails be removed from the maildir and/or will they
>    cause performance impacts? Is there an way to automatically delete
>    processed mails?

Yes, old mails should be removed.
I have a cronjob doing something like:

  find $MAILDIR -ctime +$AGE_DAYS -type f | xargs rm -f

AGE_DAYS can be whatever you're comfortable with.

Fwiw, I run public-inbox-watch and the find|rm cronjob as
different users, so public-inbox-watch can rely on read-only
access to a Maildir while rm(1) (obviously) needs write access
to the Maildir.

> 2. public-inbox-watch seems to fill the repositories with the 'old' v1
>    layout, and I don't know how to switch to v2. Is there a config
>    parameter for that?
> 
>    I found the v1-v2 convert script, but I'd like to directly initialise
>    it with the newer version, if possible.

Use "-V2" with public-inbox-init.

Perhaps it could become the default iff SQLite+Xapian are
installed.

> 3. On the initial import, public-inbox-watch seems to randomly insert
>    mails into repositories. In the end, coverage matters more than
>    hierarchy, but is there a way to do the initial import sorted by
>    date?

You can use (or derive from) scripts/import_vger_from_mbox if you
have sorted mboxes.

The main benefit for sorting would be to ensure NNTP articles
numbers roughly match the dates.  Otherwise, the HTTP interface
won't care about ordering.

I suppose you could import the first time into a throwaway inbox,
fetch http://$HOST/$INBOX/all.mbox.gz
and zcat the result of that to scripts/import_vger_from_mbox

> Thanks a lot!

no prob :>

      reply	other threads:[~2019-03-21  3:35 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-20 22:28 Usage of public-inbox with maildirs Ralf Ramsauer
2019-03-21  3:35 ` Eric Wong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190321033503.uukzpw7o7viobfgo@dcvr \
    --to=e@80x24.org \
    --cc=lukas.bulwahn@gmail.com \
    --cc=meta@public-inbox.org \
    --cc=ralf.ramsauer@oth-regensburg.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).