user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* Re: Usage of public-inbox with maildirs
  @ 2019-03-21  3:35  5% ` Eric Wong
  0 siblings, 0 replies; 3+ results
From: Eric Wong @ 2019-03-21  3:35 UTC (permalink / raw)
  To: Ralf Ramsauer; +Cc: meta, Lukas Bulwahn

Ralf Ramsauer <ralf.ramsauer@oth-regensburg.de> wrote:
> Hi,
> 
> we want to archive a fair amount of mailing lists (~160 lists) with
> public-inbox.
> 
> Therefore, we subscribed to all of those lists with a single email
> address. Mails are periodically fetched and stored in a local maildir
> via IMAP. Mails are currently not pre-filtered or sorted, all of them
> are bunched in a single maildir.
> 
> So every [publicinbox] config entry has the same 'watch' entry for the
> maildir, but all have their own watchheader to be sensitive on different
> lists.
> 
> Is this the intended way to use public-inbox, or should we rather place
> mails from different lists in different maildirs before processing them
> with public-inbox?

Yes, it's supported since this year:

	commit ed3b90b7a203fe5513894d01d478f6104cdff897
	Date:   Sat Jan 5 00:35:42 2019 +0000

	("watchmaildir: support multiple inboxes in the same Maildir")

Sorry, haven't hit a good point to make a release, and I'm not too
good at release management :<

> Secondly, I wrote a script that automatically that creates the
> public-inbox config together with empty, bare git repositories for every
> list.
> 
> A config entry looks like:
> 
>     [publicinbox "listid"]
>         address = post@listid.org
>         mainrepo = /path/to/repo
>         watch = maildir:/path/to/maildir
>         watchheader = List-Id:<listid>

All looks fine to me.

> Our maildir currently contains ~120k mails for the initial import, and
> this raised some new questions:
> 
> 1. It appears that the initial import with public-inbox-watch is very
>    slow. After stracing the perl script, it looks like
>    public-inbox-watch lstats every single mail. After an hour of not
>    inserting any mail into a repo, I canceled the process and restarted
>    it on a smaller initial subset. This works better, but is still slow.
>    (~4k mails in 10 minutes, feels like constantly getting slower)

v1 gets slower as repositories get bigger.  v2 is barely
affected by that.  Are you sure it wasn't importing?  The
fast-import processes may not be writing out frequently enough.

>    If public-inbox-watch is restarted for some reason (e.g., system
>    reboot), will it stat every single mail again on startup?

Yes.  However, the scan is at a low priority compared to
freshly-arrived mail if you have Linux::Inotify2 module
installed for Filesys::Notify::Simple to use.

>    IOW, should old mails be removed from the maildir and/or will they
>    cause performance impacts? Is there an way to automatically delete
>    processed mails?

Yes, old mails should be removed.
I have a cronjob doing something like:

  find $MAILDIR -ctime +$AGE_DAYS -type f | xargs rm -f

AGE_DAYS can be whatever you're comfortable with.

Fwiw, I run public-inbox-watch and the find|rm cronjob as
different users, so public-inbox-watch can rely on read-only
access to a Maildir while rm(1) (obviously) needs write access
to the Maildir.

> 2. public-inbox-watch seems to fill the repositories with the 'old' v1
>    layout, and I don't know how to switch to v2. Is there a config
>    parameter for that?
> 
>    I found the v1-v2 convert script, but I'd like to directly initialise
>    it with the newer version, if possible.

Use "-V2" with public-inbox-init.

Perhaps it could become the default iff SQLite+Xapian are
installed.

> 3. On the initial import, public-inbox-watch seems to randomly insert
>    mails into repositories. In the end, coverage matters more than
>    hierarchy, but is there a way to do the initial import sorted by
>    date?

You can use (or derive from) scripts/import_vger_from_mbox if you
have sorted mboxes.

The main benefit for sorting would be to ensure NNTP articles
numbers roughly match the dates.  Otherwise, the HTTP interface
won't care about ordering.

I suppose you could import the first time into a throwaway inbox,
fetch http://$HOST/$INBOX/all.mbox.gz
and zcat the result of that to scripts/import_vger_from_mbox

> Thanks a lot!

no prob :>

^ permalink raw reply	[relevance 5%]

* [PATCH 1/3] watchmaildir: support multiple inboxes in the same Maildir
  2019-01-05  8:36  6% [PATCH 0/3] some watch fixes and improvements Eric Wong
@ 2019-01-05  8:36  7% ` Eric Wong
  0 siblings, 0 replies; 3+ results
From: Eric Wong @ 2019-01-05  8:36 UTC (permalink / raw)
  To: meta

Not sure what I was smoking when I originally wrote this code.

cf. https://public-inbox.org/meta/874li887mp.fsf@vuxu.org/
---
 lib/PublicInbox/WatchMaildir.pm | 54 +++++++++++++++++++++------------
 t/watch_maildir_v2.t            | 38 +++++++++++++++++++++--
 2 files changed, 69 insertions(+), 23 deletions(-)

diff --git a/lib/PublicInbox/WatchMaildir.pm b/lib/PublicInbox/WatchMaildir.pm
index b558cda..064cedf 100644
--- a/lib/PublicInbox/WatchMaildir.pm
+++ b/lib/PublicInbox/WatchMaildir.pm
@@ -20,6 +20,7 @@ package PublicInbox::WatchMaildir;
 sub new {
 	my ($class, $config) = @_;
 	my (%mdmap, @mdir, $spamc, $spamdir);
+	my %uniq;
 
 	# "publicinboxwatch" is the documented namespace
 	# "publicinboxlearn" is legacy but may be supported
@@ -32,7 +33,17 @@ sub new {
 				# skip "new", no MUA has seen it, yet.
 				my $cur = "$dir/cur";
 				$spamdir = $cur;
+				my $old = $mdmap{$cur};
+				if (ref($old)) {
+					foreach my $ibx (@$old) {
+						warn <<"";
+"$cur already watched for `$ibx->{name}'
+
+					}
+					die;
+				}
 				push @mdir, $cur;
+				$uniq{$cur}++;
 				$mdmap{$cur} = 'watchspam';
 			} else {
 				warn "unsupported $k=$dir\n";
@@ -58,10 +69,11 @@ sub new {
 			}
 			my $new = "$watch/new";
 			my $cur = "$watch/cur";
-			push @mdir, $new, $cur;
-			die "$new already in use\n" if $mdmap{$new};
-			die "$cur already in use\n" if $mdmap{$cur};
-			$mdmap{$new} = $mdmap{$cur} = $ibx;
+			push @mdir, $new unless $uniq{$new}++;
+			push @mdir, $cur unless $uniq{$cur}++;
+
+			push @{$mdmap{$new} ||= []}, $ibx;
+			push @{$mdmap{$cur} ||= []}, $ibx;
 		} else {
 			warn "watch unsupported: $k=$watch\n";
 		}
@@ -134,28 +146,30 @@ sub _try_path {
 		warn "unrecognized path: $path\n";
 		return;
 	}
-	my $inbox = $self->{mdmap}->{$1};
-	unless ($inbox) {
+	my $inboxes = $self->{mdmap}->{$1};
+	unless ($inboxes) {
 		warn "unmappable dir: $1\n";
 		return;
 	}
-	if (!ref($inbox) && $inbox eq 'watchspam') {
+	if (!ref($inboxes) && $inboxes eq 'watchspam') {
 		return _remove_spam($self, $path);
 	}
-	my $im = _importer_for($self, $inbox);
-	my $mime = _path_to_mime($path) or return;
-	my $wm = $inbox->{-watchheader};
-	if ($wm) {
-		my $v = $mime->header_obj->header_raw($wm->[0]);
-		return unless ($v && $v =~ $wm->[1]);
-	}
-	if (my $scrub = $inbox->filter) {
-		my $ret = $scrub->scrub($mime) or return;
-		$ret == REJECT() and return;
-		$mime = $ret;
-	}
+	foreach my $ibx (@$inboxes) {
+		my $mime = _path_to_mime($path) or next;
+		my $im = _importer_for($self, $ibx);
 
-	$im->add($mime, $self->{spamcheck});
+		my $wm = $ibx->{-watchheader};
+		if ($wm) {
+			my $v = $mime->header_obj->header_raw($wm->[0]);
+			next unless ($v && $v =~ $wm->[1]);
+		}
+		if (my $scrub = $ibx->filter) {
+			my $ret = $scrub->scrub($mime) or next;
+			$ret == REJECT() and next;
+			$mime = $ret;
+		}
+		$im->add($mime, $self->{spamcheck});
+	}
 }
 
 sub quit { trigger_scan($_[0], 'quit') }
diff --git a/t/watch_maildir_v2.t b/t/watch_maildir_v2.t
index fc002dc..3b5d2b8 100644
--- a/t/watch_maildir_v2.t
+++ b/t/watch_maildir_v2.t
@@ -38,13 +38,14 @@ ok(POSIX::mkfifo("$maildir/cur/fifo", 0777),
 	'create FIFO to ensure we do not get stuck on it :P');
 my $sem = PublicInbox::Emergency->new($spamdir); # create dirs
 
-my $config = PublicInbox::Config->new({
+my %orig = (
 	"$cfgpfx.address" => $addr,
 	"$cfgpfx.mainrepo" => $mainrepo,
 	"$cfgpfx.watch" => "maildir:$maildir",
 	"$cfgpfx.filter" => 'PublicInbox::Filter::Vger',
-	"publicinboxlearn.watchspam" => "maildir:$spamdir",
-});
+	"publicinboxlearn.watchspam" => "maildir:$spamdir"
+);
+my $config = PublicInbox::Config->new({%orig});
 my $ibx = $config->lookup_name('test');
 ok($ibx, 'found inbox by name');
 my $srch = $ibx->search;
@@ -137,4 +138,35 @@ More majordomo info at  http://vger.kernel.org/majordomo-info.html\n);
 	is($post->{blob}, $msgs->[0]->{blob}, 'same message');
 }
 
+# multiple inboxes in the same maildir
+{
+	my $v1repo = "$tmpdir/v1";
+	my $v1pfx = "publicinbox.v1";
+	my $v1addr = 'v1-public@example.com';
+	is(system(qw(git init -q --bare), $v1repo), 0, 'v1 init OK');
+	my $config = PublicInbox::Config->new({
+		%orig,
+		"$v1pfx.address" => $v1addr,
+		"$v1pfx.mainrepo" => $v1repo,
+		"$v1pfx.watch" => "maildir:$maildir",
+	});
+	my $both = <<EOF;
+From: user\@example.com
+To: $addr, $v1addr
+Subject: both
+Message-Id: <both\@b.com>
+Date: Sat, 18 Jun 2016 00:00:00 +0000
+
+both
+EOF
+	PublicInbox::Emergency->new($maildir)->prepare(\$both);
+	PublicInbox::WatchMaildir->new($config)->scan('full');
+	my ($total, $msgs) = $srch->reopen->query('m:both@b.com');
+	my $v1 = $config->lookup_name('v1');
+	my $msg = $v1->git->cat_file($msgs->[0]->{blob});
+	is($both, $$msg, 'got original message back from v1');
+	$msg = $ibx->git->cat_file($msgs->[0]->{blob});
+	is($both, $$msg, 'got original message back from v2');
+}
+
 done_testing;
-- 
EW


^ permalink raw reply related	[relevance 7%]

* [PATCH 0/3] some watch fixes and improvements
@ 2019-01-05  8:36  6% Eric Wong
  2019-01-05  8:36  7% ` [PATCH 1/3] watchmaildir: support multiple inboxes in the same Maildir Eric Wong
  0 siblings, 1 reply; 3+ results
From: Eric Wong @ 2019-01-05  8:36 UTC (permalink / raw)
  To: meta

Most notably, -watch now supports a Maildir with multiple lists.
(works better with the "watcheader" directive).

It should probably resolve symlinks in the future, too;
since I seem to recall some weirdness in that + Filesys::Notify::Simple

Eric Wong (3):
  watchmaildir: support multiple inboxes in the same Maildir
  watchmaildir: get rid of unused spamdir field
  watchmaildir: normalize Maildir pathnames consistently

 lib/PublicInbox/WatchMaildir.pm | 71 ++++++++++++++++++++-------------
 t/watch_maildir.t               |  9 +++++
 t/watch_maildir_v2.t            | 38 ++++++++++++++++--
 3 files changed, 88 insertions(+), 30 deletions(-)

-- 
EW

^ permalink raw reply	[relevance 6%]

Results 1-3 of 3 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2019-01-05  8:36  6% [PATCH 0/3] some watch fixes and improvements Eric Wong
2019-01-05  8:36  7% ` [PATCH 1/3] watchmaildir: support multiple inboxes in the same Maildir Eric Wong
2019-03-20 22:28     Usage of public-inbox with maildirs Ralf Ramsauer
2019-03-21  3:35  5% ` Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).