user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
From: Eric Wong <e@yhbt.net>
To: meta@public-inbox.org
Subject: [PATCH 17/34] watch: use UID SEARCH to avoid empty UID FETCH
Date: Sat, 27 Jun 2020 10:03:43 +0000	[thread overview]
Message-ID: <20200627100400.9871-18-e@yhbt.net> (raw)
In-Reply-To: <20200627100400.9871-1-e@yhbt.net>

For mailboxes with many gaps in the UID sequence,
performing a UID SEARCH beforehand can reduce the
number of articles to fetch.

However, the downside to this is we may end up with
an arbitrarly large list of UIDs from the server.
---
 lib/PublicInbox/WatchMaildir.pm | 88 ++++++++++++++++++++-------------
 1 file changed, 54 insertions(+), 34 deletions(-)

diff --git a/lib/PublicInbox/WatchMaildir.pm b/lib/PublicInbox/WatchMaildir.pm
index 24989130979..b82b51025e6 100644
--- a/lib/PublicInbox/WatchMaildir.pm
+++ b/lib/PublicInbox/WatchMaildir.pm
@@ -335,6 +335,27 @@ sub mic_for ($$$) { # mic = Mail::IMAPClient
 	$mic;
 }
 
+sub imap_import_msg ($$$$$$) {
+	my ($self, $itrk, $url, $r_uidval, $uid, $raw) = @_;
+	# our target audience expects LF-only, save storage
+	$$raw =~ s/\r\n/\n/sg;
+
+	my $inboxes = $self->{imap}->{$url};
+	if (ref($inboxes)) {
+		for my $ibx (@$inboxes) {
+			my $eml = PublicInbox::Eml->new($$raw);
+			my $x = import_eml($self, $ibx, $eml);
+		}
+	} elsif ($inboxes eq 'watchspam') {
+		my $eml = PublicInbox::Eml->new($raw);
+		my $arg = [ $self, $eml, "$url UID:$uid" ];
+		$self->{config}->each_inbox(\&remove_eml_i, $arg);
+	} else {
+		die "BUG: destination unknown $inboxes";
+	}
+	$itrk->update_last($url, $r_uidval, $uid);
+}
+
 sub imap_fetch_all ($$$) {
 	my ($self, $mic, $uri) = @_;
 	my $sec = imap_section($uri);
@@ -367,52 +388,51 @@ sub imap_fetch_all ($$$) {
 	}
 	return if $l_uid >= $r_uid; # nothing to do
 
+	warn "I: $url fetching UID $l_uid:$r_uid\n";
 	$mic->Uid(1); # the default, we hope
+	my $uids;
 	my $req = $mic->imap4rev1 ? 'BODY.PEEK[]' : 'RFC822.PEEK';
 	my $key = $req;
 	$key =~ s/\.PEEK//;
-	my $inboxes = $self->{imap}->{$url};
-	warn "I: $url fetching $l_uid..$r_uid\n";
-	my $uid = -1;
+	my $uid;
 	my $warn_cb = $SIG{__WARN__} || sub { print STDERR @_ };
 	local $SIG{__WARN__} = sub {
+		$uid //= -1;
 		$warn_cb->("$url UID:$uid\n");
 		$warn_cb->(@_);
 	};
 	my $err;
-	$itrk->{dbh}->begin_work;
-	for my $u ($l_uid..$r_uid) {
-		$uid = $u;
-		local $0 = "UID:$uid $mbx $sec";
-		my $r = $mic->fetch_hash($uid, $req);
-		unless ($r) { # network error?
-			$err = "E: $url UID FETCH $uid error: $!\n";
-			last;
-		}
-
-		# messages get deleted, so holes appear
-		defined(my $raw = delete $r->{$uid}->{$key}) or next;
-
-		# our target audience expects LF-only, save storage
-		$raw =~ s/\r\n/\n/sg;
-
-		if (ref($inboxes)) {
-			for my $ibx (@$inboxes) {
-				my $eml = PublicInbox::Eml->new($raw);
-				my $x = import_eml($self, $ibx, $eml);
+	do {
+		$uids = $mic->search("UID $l_uid:*") or
+			return "E: $url UID SEARCH $l_uid:* error: $!";
+		return if scalar(@$uids) == 0;
+
+		# RFC 3501 doesn't seem to indicate order of UID SEARCH
+		# responses, so sort it ourselves
+		@$uids = sort { $a <=> $b } @$uids;
+
+		# Did we actually get new messages?
+		return if $uids->[0] < $l_uid;
+
+		$l_uid = $uids->[-1] + 1; # for next search
+
+		$itrk->{dbh}->begin_work;
+		while (defined(($uid = shift(@$uids)))) {
+			local $0 = "UID:$uid $mbx $sec";
+			my $r = $mic->fetch_hash($uid, $req);
+			unless ($r) { # network error?
+				$err = "E: $url UID FETCH $uid error: $!";
+				last;
 			}
-		} elsif ($inboxes eq 'watchspam') {
-			my $eml = PublicInbox::Eml->new($raw);
-			my $arg = [ $self, $eml, "$uri UID:$uid" ];
-			$self->{config}->each_inbox(\&remove_eml_i, $arg);
-		} else {
-			die "BUG: destination unknown $inboxes";
+			# messages get deleted, so holes appear
+			defined(my $raw = delete $r->{$uid}->{$key}) or next;
+			imap_import_msg($self, $itrk, $url, $r_uidval, $uid,
+					\$raw);
+			last if $self->{quit};
 		}
-		$itrk->update_last($url, $r_uidval, $uid);
-		last if $self->{quit};
-	}
-	_done_for_now($self);
-	$itrk->{dbh}->commit;
+		_done_for_now($self);
+		$itrk->{dbh}->commit;
+	} until ($err || $self->{quit});
 	$err;
 }
 

  parent reply	other threads:[~2020-06-27 10:04 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-27 10:03 [PATCH 00/34] watch: add IMAP and NNTP support Eric Wong
2020-06-27 10:03 ` [PATCH 01/34] inboxwritable: ensure ssoma.lock exists on init Eric Wong
2020-06-27 10:03 ` [PATCH 02/34] inbox: warn on ->on_inbox_unlock exception Eric Wong
2020-06-27 10:03 ` [PATCH 03/34] IMAPTracker: Add a helper to track our place in reading imap mailboxes Eric Wong
2020-06-27 10:03 ` [PATCH 04/34] imaptracker: use ~/.local/share/public-inbox/imap.sqlite3 Eric Wong
2020-06-27 10:03 ` [PATCH 05/34] watchmaildir: hoist out compile_watchheaders Eric Wong
2020-06-27 10:03 ` [PATCH 06/34] watchmaildir: fix check for spam vs ham inbox conflicts Eric Wong
2020-06-27 10:03 ` [PATCH 07/34] URI IMAP support Eric Wong
2020-06-27 10:03 ` [PATCH 08/34] watch: preliminary " Eric Wong
2020-06-27 10:03 ` [PATCH 09/34] kqnotify|fake_inotify: detect Maildir write ops Eric Wong
2020-06-27 10:03 ` [PATCH 10/34] watch: remove Filesys::Notify::Simple dependency Eric Wong
2020-06-27 10:03 ` [PATCH 11/34] watch: use signalfd for Maildir watching Eric Wong
2020-06-27 19:05   ` Kyle Meyer
2020-06-27 22:32     ` Eric Wong
2020-06-27 10:03 ` [PATCH 12/34] ds: remove fields.pm usage Eric Wong
2020-06-27 10:03 ` [PATCH 13/34] watch: wire up IMAP IDLE reapers to DS Eric Wong
2020-06-27 10:03 ` [PATCH 14/34] watch: support IMAP polling Eric Wong
2020-06-27 10:03 ` [PATCH 15/34] config: support ->urlmatch method for -watch Eric Wong
2020-06-27 10:03 ` [PATCH 16/34] watch: stop importers before forking Eric Wong
2020-06-27 10:03 ` Eric Wong [this message]
2020-06-27 10:03 ` [PATCH 18/34] ds: add_timer: allow passing arg to callback Eric Wong
2020-06-27 10:03 ` [PATCH 19/34] imaptracker: add {url} field to reduce args Eric Wong
2020-06-27 10:03 ` [PATCH 20/34] imaptracker: drop {dbname} field Eric Wong
2020-06-27 10:03 ` [PATCH 21/34] watch: avoid long transaction to IMAPTracker Eric Wong
2020-06-27 10:03 ` [PATCH 22/34] watch: support imap.fetchBatchSize parameter Eric Wong
2020-06-27 10:03 ` [PATCH 23/34] watch: imap: be quiet about disconnecting on quit Eric Wong
2020-06-27 10:03 ` [PATCH 24/34] watch: support multiple watch: directives per-inbox Eric Wong
2020-06-27 10:03 ` [PATCH 25/34] watch: remove {mdir} array Eric Wong
2020-06-27 10:03 ` [PATCH 26/34] watch: just use ->urlmatch Eric Wong
2020-06-27 10:03 ` [PATCH 27/34] testcommon: $ENV{TAIL} supports non-@ARGV redirects Eric Wong
2020-06-27 10:03 ` [PATCH 28/34] watch: add NNTP support Eric Wong
2020-06-27 19:06   ` Kyle Meyer
2020-06-27 10:03 ` [PATCH 29/34] watch: show user-specified URL consistently Eric Wong
2020-06-27 10:03 ` [PATCH 30/34] watch: enable autoflush for STDOUT and STDERR Eric Wong
2020-06-27 10:03 ` [PATCH 31/34] watch: use our own "git credential" wrapper Eric Wong
2020-06-27 10:03 ` [PATCH 32/34] watch: support ~/.netrc via Net::Netrc Eric Wong
2020-06-27 10:03 ` [PATCH 33/34] imaptracker: use flock(2) around writes Eric Wong
2020-06-27 10:04 ` [PATCH 34/34] watch: simplify internal structures Eric Wong
2020-06-29 10:34 ` [PATCH 0/5] watch: Maildir fixes Eric Wong
2020-06-29 10:34   ` [PATCH 1/5] watch: check for duplicates in ->over before spamcheck Eric Wong
2020-06-29 10:34   ` [PATCH 2/5] watch: show path for warnings from spam messages Eric Wong
2020-06-29 10:34   ` [PATCH 3/5] watch: ensure SIGCHLD works in forked children Eric Wong
2020-06-29 10:34   ` [PATCH 4/5] spawn: unblock SIGCHLD in subprocess Eric Wong
2020-07-07  6:17     ` [PATCH 6/5] t/spawn: fix test reliability Eric Wong
2020-06-29 10:34   ` [PATCH 5/5] watch: make waitpid() synchronous for Maildir scans Eric Wong
2020-06-29 10:37     ` Eric Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://public-inbox.org/README

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200627100400.9871-18-e@yhbt.net \
    --to=e@yhbt.net \
    --cc=meta@public-inbox.org \
    --subject='Re: [PATCH 17/34] watch: use UID SEARCH to avoid empty UID FETCH' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

user/dev discussion of public-inbox itself

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/meta
	git clone --mirror http://czquwvybam4bgbro.onion/meta
	git clone --mirror http://hjrcffqmbrq6wope.onion/meta
	git clone --mirror http://ou63pmih66umazou.onion/meta

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 meta meta/ https://public-inbox.org/meta \
		meta@public-inbox.org
	public-inbox-index meta

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.mail.public-inbox.meta
	nntp://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/inbox.comp.mail.public-inbox.meta
	nntp://ie5yzdi7fg72h7s4sdcztq5evakq23rdt33mfyfcddc5u3ndnw24ogqd.onion/inbox.comp.mail.public-inbox.meta
	nntp://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/inbox.comp.mail.public-inbox.meta
	nntp://news.gmane.io/gmane.mail.public-inbox.general
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/public-inbox.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git