From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: X-Spam-Status: No, score=-4.0 required=3.0 tests=ALL_TRUSTED,BAYES_00 shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 1E05A20248; Thu, 21 Mar 2019 03:35:04 +0000 (UTC) Date: Thu, 21 Mar 2019 03:35:03 +0000 From: Eric Wong To: Ralf Ramsauer Cc: meta@public-inbox.org, Lukas Bulwahn Subject: Re: Usage of public-inbox with maildirs Message-ID: <20190321033503.uukzpw7o7viobfgo@dcvr> References: <745d6a8e-7e7c-8c61-336b-105cf9570ab7@oth-regensburg.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <745d6a8e-7e7c-8c61-336b-105cf9570ab7@oth-regensburg.de> List-Id: Ralf Ramsauer wrote: > Hi, > > we want to archive a fair amount of mailing lists (~160 lists) with > public-inbox. > > Therefore, we subscribed to all of those lists with a single email > address. Mails are periodically fetched and stored in a local maildir > via IMAP. Mails are currently not pre-filtered or sorted, all of them > are bunched in a single maildir. > > So every [publicinbox] config entry has the same 'watch' entry for the > maildir, but all have their own watchheader to be sensitive on different > lists. > > Is this the intended way to use public-inbox, or should we rather place > mails from different lists in different maildirs before processing them > with public-inbox? Yes, it's supported since this year: commit ed3b90b7a203fe5513894d01d478f6104cdff897 Date: Sat Jan 5 00:35:42 2019 +0000 ("watchmaildir: support multiple inboxes in the same Maildir") Sorry, haven't hit a good point to make a release, and I'm not too good at release management :< > Secondly, I wrote a script that automatically that creates the > public-inbox config together with empty, bare git repositories for every > list. > > A config entry looks like: > > [publicinbox "listid"] > address = post@listid.org > mainrepo = /path/to/repo > watch = maildir:/path/to/maildir > watchheader = List-Id: All looks fine to me. > Our maildir currently contains ~120k mails for the initial import, and > this raised some new questions: > > 1. It appears that the initial import with public-inbox-watch is very > slow. After stracing the perl script, it looks like > public-inbox-watch lstats every single mail. After an hour of not > inserting any mail into a repo, I canceled the process and restarted > it on a smaller initial subset. This works better, but is still slow. > (~4k mails in 10 minutes, feels like constantly getting slower) v1 gets slower as repositories get bigger. v2 is barely affected by that. Are you sure it wasn't importing? The fast-import processes may not be writing out frequently enough. > If public-inbox-watch is restarted for some reason (e.g., system > reboot), will it stat every single mail again on startup? Yes. However, the scan is at a low priority compared to freshly-arrived mail if you have Linux::Inotify2 module installed for Filesys::Notify::Simple to use. > IOW, should old mails be removed from the maildir and/or will they > cause performance impacts? Is there an way to automatically delete > processed mails? Yes, old mails should be removed. I have a cronjob doing something like: find $MAILDIR -ctime +$AGE_DAYS -type f | xargs rm -f AGE_DAYS can be whatever you're comfortable with. Fwiw, I run public-inbox-watch and the find|rm cronjob as different users, so public-inbox-watch can rely on read-only access to a Maildir while rm(1) (obviously) needs write access to the Maildir. > 2. public-inbox-watch seems to fill the repositories with the 'old' v1 > layout, and I don't know how to switch to v2. Is there a config > parameter for that? > > I found the v1-v2 convert script, but I'd like to directly initialise > it with the newer version, if possible. Use "-V2" with public-inbox-init. Perhaps it could become the default iff SQLite+Xapian are installed. > 3. On the initial import, public-inbox-watch seems to randomly insert > mails into repositories. In the end, coverage matters more than > hierarchy, but is there a way to do the initial import sorted by > date? You can use (or derive from) scripts/import_vger_from_mbox if you have sorted mboxes. The main benefit for sorting would be to ensure NNTP articles numbers roughly match the dates. Otherwise, the HTTP interface won't care about ordering. I suppose you could import the first time into a throwaway inbox, fetch http://$HOST/$INBOX/all.mbox.gz and zcat the result of that to scripts/import_vger_from_mbox > Thanks a lot! no prob :>