user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH 2/4] mda|learn|watch: support dropUniqueUnsubscribe config
  @ 2023-11-11  9:04  3% ` Eric Wong
  0 siblings, 0 replies; 13+ results
From: Eric Wong @ 2023-11-11  9:04 UTC (permalink / raw)
  To: meta

List-Unsubscribe headers with unique identifiers (such as those
generated by our examples/unsubscribe.milter) should not
end up in public archives.  Add a new config knob to strip
List-Unsubscribe headers if they have the
`List-Unsubscribe-Post: List-Unsubscribe=One-Click'
header.

Unfortunately, this breaks DKIM signatures if the signature
covers either of these List-Unsubscribe* headers.  However,
breaking DKIM is the lesser evil compared to any archive reader
being able to stop archival by an independent archivist.

As much as I would like this to be the default, it probably
affects few users at the moment since very few mailing lists
use unique identifiers in List-Unsubscribe (but that number
has grown, recently).
---
 Documentation/public-inbox-config.pod | 17 ++++++++++
 Documentation/public-inbox-learn.pod  | 19 +++++++++++
 Documentation/public-inbox-mda.pod    | 18 +++++++++-
 Documentation/public-inbox-watch.pod  |  6 +++-
 lib/PublicInbox/Import.pm             | 27 +++++++++++++++
 lib/PublicInbox/LeiToMail.pm          |  6 ++++
 lib/PublicInbox/Watch.pm              |  1 +
 script/public-inbox-learn             |  3 ++
 script/public-inbox-mda               |  4 +++
 script/public-inbox-watch             |  2 ++
 t/lei-import.t                        | 48 ++++++++++++++++++++++++++-
 t/mda.t                               | 41 ++++++++++++++++++++---
 t/watch_maildir.t                     | 30 +++++++++++++++--
 13 files changed, 212 insertions(+), 10 deletions(-)

diff --git a/Documentation/public-inbox-config.pod b/Documentation/public-inbox-config.pod
index 871ac6c5..1ef7f46f 100644
--- a/Documentation/public-inbox-config.pod
+++ b/Documentation/public-inbox-config.pod
@@ -196,6 +196,23 @@ and the path may be "/dev/null" or any empty file.
 Multiple files may be specified and will be included in the
 order specified.
 
+=item publicinboxImport.dropUniqueUnsubscribe
+
+Drop C<List-Unsubscribe> headers if the message also includes
+the C<List-Unsubscribe-Post: List-Unsubscribe=One-Click> header
+to signal MUAs to support an instantaneous unsubscribe.  This
+is strongly recommended for users creating their own public
+archives of mailing lists they subscribe to, otherwise any
+archive reader can unsubscribe the archivist.
+
+This may break DKIM signatures if the C<List-Unsubscribe*>
+headers are signed, but breaking DKIM signatures is the
+lesser evil compared to allowing any reader to unsubscribe
+the archivist.
+
+This affects L<public-inbox-mda(1)>, L<public-inbox-watch(1)>,
+and L<public-inbox-learn(1)>
+
 =item publicinboxmda.spamcheck
 
 This may be set to C<none> to disable the use of SpamAssassin
diff --git a/Documentation/public-inbox-learn.pod b/Documentation/public-inbox-learn.pod
index f776df6b..b08e4bc8 100644
--- a/Documentation/public-inbox-learn.pod
+++ b/Documentation/public-inbox-learn.pod
@@ -73,6 +73,25 @@ Default: ~/.public-inbox/config
 
 =back
 
+=head1 CONFIGURATION
+
+These configuration knobs should be used in the
+L<public-inbox-config(5)> file.
+
+=over 8
+
+=item publicinboxImport.dropUniqueUnsubscribe
+
+=item publicinbox.<name>.address
+
+=item publicinbox.<name>.listid
+
+=item publicinboxmda.spamcheck
+
+See L<public-inbox-config(5)> for descriptions of these options
+
+=back
+
 =head1 CONTACT
 
 Feedback welcome via plain-text mail to L<mailto:meta@public-inbox.org>
diff --git a/Documentation/public-inbox-mda.pod b/Documentation/public-inbox-mda.pod
index 93cb0e9c..edc90287 100644
--- a/Documentation/public-inbox-mda.pod
+++ b/Documentation/public-inbox-mda.pod
@@ -68,6 +68,22 @@ Default: ~/.public-inbox/emergency/
 
 =back
 
+=head1 CONFIGURATION
+
+Various configuration knobs should be used in the
+L<public-inbox-config(5)> file.
+
+=over 8
+
+=item publicinboxImport.dropUniqueUnsubscribe
+
+=item publicinbox.<name>.address
+
+=item publicinbox.<name>.listid
+
+See L<public-inbox-config(5)> for descriptions of these options
+
+=back
 
 =head1 CONTACT
 
@@ -78,7 +94,7 @@ L<http://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/meta/>
 
 =head1 COPYRIGHT
 
-Copyright 2013-2021 all contributors L<mailto:meta@public-inbox.org>
+Copyright all contributors L<mailto:meta@public-inbox.org>
 
 License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt>
 
diff --git a/Documentation/public-inbox-watch.pod b/Documentation/public-inbox-watch.pod
index febda0b1..7c21f7ce 100644
--- a/Documentation/public-inbox-watch.pod
+++ b/Documentation/public-inbox-watch.pod
@@ -66,6 +66,10 @@ L<public-inbox-config(5)> file.
 
 =over 8
 
+=item publicinboxImport.dropUniqueUnsubscribe
+
+See L<public-inbox-config(5)/publicinboxImport.dropUniqueUnsubscribe>
+
 =item publicinbox.<name>.watch
 
 A location to watch.  public-inbox 1.5.0 and earlier only supported
@@ -201,7 +205,7 @@ L<http://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/meta/>
 
 =head1 COPYRIGHT
 
-Copyright 2016-2021 all contributors L<mailto:meta@public-inbox.org>
+Copyright all contributors L<mailto:meta@public-inbox.org>
 
 License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt>
 
diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index 2d60db55..e4f8615e 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -321,11 +321,38 @@ sub extract_cmt_info ($;$) {
 # kill potentially confusing/misleading headers
 our @UNWANTED_HEADERS = (qw(Bytes Lines Content-Length),
 			qw(Status X-Status));
+our $DROP_UNIQUE_UNSUB;
 sub drop_unwanted_headers ($) {
 	my ($eml) = @_;
 	for (@UNWANTED_HEADERS, @PublicInbox::MDA::BAD_HEADERS) {
 		$eml->header_set($_);
 	}
+
+	# We don't want public-inbox readers to be able to unsubcribe the
+	# address which does archiving.  WARNING: this breaks DKIM if the
+	# mailing list sender follows RFC 8058, section 4; but breaking DKIM
+	# (or have senders ignore RFC 8058 sec. 4) is preferable to having
+	# saboteurs unsubscribing independent archivists:
+	if ($DROP_UNIQUE_UNSUB && grep(/\AList-Unsubscribe=One-Click\z/,
+				$eml->header_raw('List-Unsubscribe-Post'))) {
+		for (qw(List-Unsubscribe-Post List-Unsubscribe)) {
+			$eml->header_set($_)
+		}
+	}
+}
+
+sub load_config ($;$) {
+	my ($cfg, $do_exit) = @_;
+	my $v = $cfg->{lc 'publicinboxImport.dropUniqueUnsubscribe'};
+	if (defined $v) {
+		$DROP_UNIQUE_UNSUB = $cfg->git_bool($v) // do {
+			warn <<EOM;
+E: publicinboxImport.dropUniqueUnsubscribe=$v in $cfg->{-f} is not boolean
+EOM
+			$do_exit //= \&CORE::exit;
+			$do_exit->(78); # EX_CONFIG
+		};
+	}
 }
 
 # used by V2Writable, too
diff --git a/lib/PublicInbox/LeiToMail.pm b/lib/PublicInbox/LeiToMail.pm
index b73af68a..0d2f586a 100644
--- a/lib/PublicInbox/LeiToMail.pm
+++ b/lib/PublicInbox/LeiToMail.pm
@@ -10,6 +10,7 @@ use PublicInbox::Eml;
 use PublicInbox::IO;
 use PublicInbox::Git;
 use PublicInbox::Spawn qw(spawn);
+use PublicInbox::Import;
 use IO::Handle; # ->autoflush
 use Fcntl qw(SEEK_SET SEEK_END O_CREAT O_EXCL O_WRONLY);
 use PublicInbox::Syscall qw(rename_noreplace);
@@ -672,6 +673,11 @@ sub _pre_augment_v2 {
 		});
 	}
 	PublicInbox::InboxWritable->new($ibx, @creat);
+	local $PublicInbox::Import::DROP_UNIQUE_UNSUB; # only for workers
+	PublicInbox::Import::load_config(PublicInbox::Config->new, sub {
+		$lei->x_it(shift);
+		die "E: can't write v2 inbox with broken config\n";
+	});
 	$ibx->init_inbox if @creat;
 	my $v2w = $ibx->importer;
 	$v2w->wq_workers_start("lei/v2w $dir", 1, $lei->oldset, {lei => $lei},
diff --git a/lib/PublicInbox/Watch.pm b/lib/PublicInbox/Watch.pm
index 1cdf12a5..5253ec94 100644
--- a/lib/PublicInbox/Watch.pm
+++ b/lib/PublicInbox/Watch.pm
@@ -45,6 +45,7 @@ sub new {
 	my (%mdmap);
 	my (%imap, %nntp); # url => [inbox objects] or 'watchspam'
 	my (@imap, @nntp);
+	PublicInbox::Import::load_config($cfg);
 
 	# "publicinboxwatch" is the documented namespace
 	# "publicinboxlearn" is legacy but may be supported
diff --git a/script/public-inbox-learn b/script/public-inbox-learn
index 8069d919..6a1bc890 100755
--- a/script/public-inbox-learn
+++ b/script/public-inbox-learn
@@ -28,6 +28,7 @@ use PublicInbox::Spamcheck::Spamc;
 use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
 my %opt = (all => 0);
 GetOptions(\%opt, qw(all help|h)) or die $help;
+use PublicInbox::Import;
 
 my $train = shift or die $help;
 if ($train !~ /\A(?:ham|spam|rm)\z/) {
@@ -37,6 +38,8 @@ die "--all only works with `rm'\n" if $opt{all} && $train ne 'rm';
 
 my $spamc = PublicInbox::Spamcheck::Spamc->new;
 my $pi_cfg = PublicInbox::Config->new;
+local $PublicInbox::Import::DROP_UNIQUE_UNSUB;
+PublicInbox::Import::load_config($pi_cfg);
 my $err;
 my $mime = PublicInbox::Eml->new(do{
 	defined(my $data = do { local $/; <STDIN> }) or die "read STDIN: $!\n";
diff --git a/script/public-inbox-mda b/script/public-inbox-mda
index cac819ac..04fd8aad 100755
--- a/script/public-inbox-mda
+++ b/script/public-inbox-mda
@@ -16,6 +16,8 @@ use strict;
 use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
 my ($ems, $emm, $show_help);
 my $precheck = 1;
+use PublicInbox::Import;
+local $PublicInbox::Import::DROP_UNIQUE_UNSUB; # does this need a CLI switch?
 GetOptions('precheck!' => \$precheck, 'help|h' => \$show_help) or
 	do { print STDERR $help; exit 1 };
 
@@ -47,6 +49,8 @@ my $key = 'publicinboxmda.spamcheck';
 my $default = 'PublicInbox::Spamcheck::Spamc';
 my $spamc = PublicInbox::Spamcheck::get($cfg, $key, $default);
 my $dests = [];
+PublicInbox::Import::load_config($cfg, $do_exit);
+
 my $recipient = $ENV{ORIGINAL_RECIPIENT};
 if (defined $recipient) {
 	my $ibx = $cfg->lookup($recipient); # first check
diff --git a/script/public-inbox-watch b/script/public-inbox-watch
index d9215de9..9bcd42ed 100755
--- a/script/public-inbox-watch
+++ b/script/public-inbox-watch
@@ -11,6 +11,8 @@ use strict;
 use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
 use IO::Handle; # ->autoflush
 use PublicInbox::Watch;
+use PublicInbox::Import;
+local $PublicInbox::Import::DROP_UNIQUE_UNSUB;
 use PublicInbox::Config;
 use PublicInbox::DS;
 my $do_scan = 1;
diff --git a/t/lei-import.t b/t/lei-import.t
index 1edd607d..bd562617 100644
--- a/t/lei-import.t
+++ b/t/lei-import.t
@@ -3,7 +3,8 @@
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 use v5.12; use PublicInbox::TestCommon;
 use PublicInbox::DS qw(now);
-use autodie qw(open close);
+use PublicInbox::IO qw(write_file);
+use autodie qw(open close truncate);
 test_lei(sub {
 ok(!lei(qw(import -F bogus), 't/plack-qp.eml'), 'fails with bogus format');
 like($lei_err, qr/\bis `eml', not --in-format/, 'gave error message');
@@ -180,6 +181,51 @@ SKIP: {
 		'EIO noted in stderr');
 }
 
+{
+	local $ENV{PI_CONFIG} = "$ENV{HOME}/pi_config";
+	write_file '>', $ENV{PI_CONFIG}, <<EOM;
+[publicinboxImport]
+	dropUniqueUnsubscribe
+EOM
+	my $in = <<EOM;
+List-Unsubscribe: <https://example.com/some-UUID-here/test>
+List-Unsubscribe-Post: List-Unsubscribe=One-Click
+Message-ID: <unsubscribe-1\@example>
+Subject: unsubscribe-1 example
+From: u\@example.com
+To: 2\@example.com
+Date: Fri, 02 Oct 1993 00:00:00 +0000
+
+EOM
+	lei_ok [qw(import -F eml +L:unsub)], undef, { %$lei_opt, 0 => \$in },
+		'import succeeds w/ List-Unsubscribe';
+	lei_ok qw(q L:unsub -f mboxrd);
+	like $lei_out, qr/some-UUID-here/,
+		'Unsubscribe header preserved despite PI_CONFIG dropping';
+	lei_ok qw(q L:unsub -o), "v2:$ENV{HOME}/v2-1";
+	lei_ok qw(q s:unsubscribe -f mboxrd --only), "$ENV{HOME}/v2-1";
+	unlike $lei_out, qr/some-UUID-here/,
+		'Unsubscribe header dropped w/ dropUniqueUnsubscribe';
+	like $lei_out, qr/Message-ID: <unsubscribe-1\@example>/,
+		'wrote expected message to v2 output';
+
+	# the default for compatibility:
+	truncate $ENV{PI_CONFIG}, 0;
+	lei_ok qw(q L:unsub -o), "v2:$ENV{HOME}/v2-2";
+	lei_ok qw(q s:unsubscribe -f mboxrd --only), "$ENV{HOME}/v2-2";
+	like $lei_out, qr/some-UUID-here/,
+		'Unsubscribe header preserved by default :<';
+
+	# ensure we can fail
+	write_file '>', $ENV{PI_CONFIG}, <<EOM;
+[publicinboxImport]
+	dropUniqueUnsubscribe = bogus
+EOM
+	ok(!lei(qw(q L:unsub -o), "v2:$ENV{HOME}/v2-3"), 'bad config fails');
+	like $lei_err, qr/is not boolean/, 'non-booleaness noted in stderr';
+	ok !-d "$ENV{HOME}/v2-3", 'v2 directory not created';
+}
+
 # see t/lei_to_mail.t for "import -F mbox*"
 });
 done_testing;
diff --git a/t/mda.t b/t/mda.t
index 83b0b33a..5144f3ca 100644
--- a/t/mda.t
+++ b/t/mda.t
@@ -8,6 +8,7 @@ use PublicInbox::Git;
 use PublicInbox::InboxWritable;
 use PublicInbox::TestCommon;
 use PublicInbox::Import;
+use PublicInbox::IO qw(write_file);
 use File::Path qw(remove_tree);
 my ($tmpdir, $for_destroy) = tmpdir();
 my $home = "$tmpdir/pi-home";
@@ -49,13 +50,11 @@ my $fail_bad_header = sub ($$$) {
 	is(1, mkdir($pi_home, 0755), "setup ~/.public-inbox");
 	PublicInbox::Import::init_bare($maindir);
 
-	open my $fh, '>>', $pi_config or die;
-	print $fh <<EOF or die;
+	write_file '>>', $pi_config, <<EOF;
 [publicinbox "test"]
 	address = $addr
 	inboxdir = $maindir
 EOF
-	close $fh or die;
 }
 
 local $ENV{GIT_COMMITTER_NAME} = eval {
@@ -306,10 +305,44 @@ EOF
 	# ensure -learn rm works after inbox address is updated
 	($out, $err) = ('', '');
 	xsys(qw(git config --file), $pi_config, "$cfgpfx.address",
-		'updated-address@example.com');
+		$addr = 'updated-address@example.com');
 	ok(run_script(['-learn', 'rm'], undef, $rdr), 'rm-ed via -learn');
 	$cur = $git->qx(qw(diff HEAD~1..HEAD));
 	like($cur, qr/^-Message-ID: <2lids\@example>/sm, 'changed in git');
+
+	# ensure we can strip List-Unsubscribe
+	$in = <<EOF;
+To: You <you\@example.com>
+List-Id: <$list_id>
+Message-ID: <unsubscribe-1\@example>
+Subject: unsubscribe-1
+From: user <user\@example.com>
+To: $addr
+Date: Fri, 02 Oct 1993 00:00:00 +0000
+List-Unsubscribe: <https://example.com/some-UUID-here/listname>
+List-Unsubscribe-Post: List-Unsubscribe=One-Click
+
+List-Unsubscribe should be stripped
+EOF
+	write_file '>>', $pi_config, <<EOM;
+[publicinboxImport]
+	dropUniqueUnsubscribe
+EOM
+	$out = $err = '';
+	ok(run_script([qw(-mda)], undef, $rdr), 'mda w/ dropUniqueUnsubscribe');
+	$cur = join('', grep(/^\+/, $git->qx(qw(diff HEAD~1..HEAD))));
+	like $cur, qr/Message-ID: <unsubscribe-1/, 'imported new message';
+	unlike $cur, qr/some-UUID-here/, 'List-Unsubscribe gone';
+	unlike $cur, qr/List-Unsubscribe-Post/i, 'List-Unsubscribe-Post gone';
+
+	$in =~ s/unsubscribe-1/unsubscribe-2/g or xbail 'BUG: s// fail';
+	ok(run_script([qw(-learn ham)], undef, $rdr),
+			'learn ham w/ dropUniqueUnsubscribe');
+	$cur = join('', grep(/^\+/, $git->qx(qw(diff HEAD~1..HEAD))));
+	like $cur, qr/Message-ID: <unsubscribe-2/, 'learn ham';
+	unlike $cur, qr/some-UUID-here/, 'List-Unsubscribe gone on learn ham';
+	unlike $cur, qr/List-Unsubscribe-Post/i,
+		'List-Unsubscribe-Post gone on learn ham';
 }
 
 SKIP: {
diff --git a/t/watch_maildir.t b/t/watch_maildir.t
index 29e9bdc5..69a5e1f3 100644
--- a/t/watch_maildir.t
+++ b/t/watch_maildir.t
@@ -6,6 +6,7 @@ use PublicInbox::Eml;
 use Cwd;
 use PublicInbox::TestCommon;
 use PublicInbox::Import;
+use PublicInbox::IO qw(write_file);
 my ($tmpdir, $for_destroy) = tmpdir();
 my $git_dir = "$tmpdir/test.git";
 my $maildir = "$tmpdir/md";
@@ -143,6 +144,10 @@ More majordomo info at  http://vger.kernel.org/majordomo-info.html\n);
 	my $env = { PI_CONFIG => $cfg_path };
 	$git->cleanup;
 
+	write_file '>>', $cfg_path, <<EOM;
+[publicinboxImport]
+	dropUniqueUnsubscribe
+EOM
 	# n.b. --no-scan is only intended for testing atm
 	my $wm = start_script([qw(-watch --no-scan)], $env);
 	no_pollerfd($wm->{pid});
@@ -194,13 +199,32 @@ More majordomo info at  http://vger.kernel.org/majordomo-info.html\n);
 	$em->commit; # wake -watch up
 	diag 'waiting for -watch to import new message';
 	PublicInbox::DS::event_loop();
+
+	my $head = $git->qx(qw(cat-file commit HEAD));
+	my $subj = $eml->header('Subject');
+	like($head, qr/^\Q$subj\E/sm, 'new commit made');
+
+	# try dropUniqueUnsubscribe
+	$delivered = 0;
+	$eml->header_set('Message-ID', '<unsubscribe@example>');
+	$eml->header_set('List-Unsubscribe',
+			'<https://example.com/some-UUID-here/test');
+	$eml->header_set('List-Unsubscribe-Post', 'List-Unsubscribe=One-Click');
+	$em = PublicInbox::Emergency->new($maildir);
+	$em->prepare(\($eml->as_string));
+	$em->commit; # wake -watch up
+	diag 'waiting for -watch to import dropUniqueUnsubscribe message';
+	PublicInbox::DS::event_loop();
+	my $cur = $git->qx(qw(diff HEAD~1..HEAD));
+	like $cur, qr/Message-ID: <unsubscribe\@example>/,
+		'unsubscribe@example imported';
+	unlike $cur, qr/List-Unsubscribe\b/,
+		'List-Unsubscribe-* headers gone w/ dropUniqueUnsubscribe';
+
 	$wm->kill;
 	$wm->join;
 	$ii->close;
 	PublicInbox::DS->Reset;
-	my $head = $git->qx(qw(cat-file commit HEAD));
-	my $subj = $eml->header('Subject');
-	like($head, qr/^\Q$subj\E/sm, 'new commit made');
 }
 
 sub is_maildir {

^ permalink raw reply related	[relevance 3%]

* [PATCH 2/4] www: gzip_filter: update a few comments
  @ 2022-08-03  7:59  6% ` Eric Wong
  0 siblings, 0 replies; 13+ results
From: Eric Wong @ 2022-08-03  7:59 UTC (permalink / raw)
  To: meta

A few things I noticed while reviewing and evaluating
the PSGI code for JMAP support.
---
 lib/PublicInbox/GzipFilter.pm | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/lib/PublicInbox/GzipFilter.pm b/lib/PublicInbox/GzipFilter.pm
index c586d2f8..d41748c4 100644
--- a/lib/PublicInbox/GzipFilter.pm
+++ b/lib/PublicInbox/GzipFilter.pm
@@ -1,4 +1,4 @@
-# Copyright (C) 2020-2021 all contributors <meta@public-inbox.org>
+# Copyright (C) all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 #
 # In public-inbox <=1.5.0, public-inbox-httpd favored "getline"
@@ -116,6 +116,7 @@ sub translate ($$) {
 	}
 }
 
+# returns PublicInbox::HTTP::{Chunked,Identity}
 sub http_out ($) {
 	my ($self) = @_;
 	$self->{http_out} // do {
@@ -183,7 +184,7 @@ sub bail  {
 # this is public-inbox-httpd-specific
 sub async_blob_cb { # git->cat_async callback
 	my ($bref, $oid, $type, $size, $self) = @_;
-	my $http = $self->{env}->{'psgix.io'};
+	my $http = $self->{env}->{'psgix.io'}; # PublicInbox::HTTP
 	$http->{forward} or return; # client aborted
 	my $smsg = $self->{smsg} or bail($self, 'BUG: no smsg');
 	if (!defined($oid)) {
@@ -195,7 +196,7 @@ sub async_blob_cb { # git->cat_async callback
 	$smsg->{blob} eq $oid or bail($self, "BUG: $smsg->{blob} != $oid");
 	eval { $self->async_eml(PublicInbox::Eml->new($bref)) };
 	bail($self, "E: async_eml: $@") if $@;
-	if ($self->{-low_prio}) {
+	if ($self->{-low_prio}) { # run via PublicInbox::WWW::event_step
 		push(@{$self->{www}->{-low_prio_q}}, $self) == 1 and
 				PublicInbox::DS::requeue($self->{www});
 	} else {

^ permalink raw reply related	[relevance 6%]

* [PATCH] doc: post-1.6 updates, start 1.7
@ 2020-09-19 21:42  4% Eric Wong
  0 siblings, 0 replies; 13+ results
From: Eric Wong @ 2020-09-19 21:42 UTC (permalink / raw)
  To: meta

I should've dropped "PENDING" notes before the 1.6 release;
they're dropped now, and a note is added to remind my future
self to drop them before 1.7.
---
 Documentation/RelNotes/v1.7.0.wip    | 16 ++++++++++++++++
 Documentation/public-inbox-index.pod | 14 +++++++-------
 Documentation/public-inbox-init.pod  |  6 +++---
 Documentation/public-inbox-learn.pod |  4 ++--
 Documentation/public-inbox-watch.pod |  2 +-
 Documentation/public-inbox-xcpdb.pod |  2 +-
 MANIFEST                             |  1 +
 Makefile.PL                          |  5 ++++-
 TODO                                 |  2 --
 9 files changed, 35 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/RelNotes/v1.7.0.wip

diff --git a/Documentation/RelNotes/v1.7.0.wip b/Documentation/RelNotes/v1.7.0.wip
new file mode 100644
index 00000000..a35ff227
--- /dev/null
+++ b/Documentation/RelNotes/v1.7.0.wip
@@ -0,0 +1,16 @@
+To: meta@public-inbox.org
+Subject: [WIP] public-inbox 1.7.0
+MIME-Version: 1.0
+Content-Type: text/plain; charset=utf-8
+Content-Disposition: inline
+
+TODO: gcf2, detached indices, JMAP, ...
+
+Compatibility:
+
+* Rollbacks all the way to public-inbox 1.2.0 remain supported
+
+Please report bugs via plain-text mail to: meta@public-inbox.org
+
+See archives at https://public-inbox.org/meta/ for all history.
+See https://public-inbox.org/TODO for what the future holds.
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 936516f8..0848e860 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -42,7 +42,7 @@ Influences the number of Xapian indexing shards in a
 See L<public-inbox-init(1)/--jobs> for a full description
 of sharding.
 
-C<--jobs=0> is accepted as of public-inbox 1.6.0 (PENDING)
+C<--jobs=0> is accepted as of public-inbox 1.6.0
 to disable parallel indexing regardless of the number of
 pre-existing shards.
 
@@ -102,7 +102,7 @@ This fixes some bugs in older versions of public-inbox.  While
 it is possible to use this without C<--reindex>, it makes little
 sense to do so.
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =item --prune
 
@@ -133,7 +133,7 @@ significantly speed up and reduce fragmentation during the
 initial index and full C<--reindex> invocations (but not
 incremental updates).
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =item --no-fsync
 
@@ -144,7 +144,7 @@ primarily intended for systems with low RAM and the small
 may even find disabling L<fdatasync(2)> causes too much dirty
 data to accumulate, resulting on latency spikes from writeback.
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =item --sequential-shard
 
@@ -152,7 +152,7 @@ Sets or overrides L</publicinbox.indexSequentialShard> on a
 per-invocation basis.  See L</publicinbox.indexSequentialShard>
 below.
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =item --skip-docdata
 
@@ -160,7 +160,7 @@ Stop storing document data in Xapian on an existing inbox.
 
 See L<public-inbox-init(1)/--skip-docdata> for description and caveats.
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =back
 
@@ -237,7 +237,7 @@ to SQLite databases.  WWW and IMAP users may notice incomplete
 search results, but it is otherwise non-fatal.  Using C<--reindex>
 will bring everything back up-to-date.
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 This is ignored on L<public-inbox-v1-format(5)> inboxes.
 
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 24645045..f1ec05de 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -50,7 +50,7 @@ This may be set after-the-fact via C<publicinbox.$NAME.newsgroup>
 in the configuration file.  See L<public-inbox-config(5)> for more
 info.
 
-Available since public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 Default: none.
 
@@ -66,7 +66,7 @@ but may be of use to L<public-inbox-v1-format(5)> users.
 There is no automatic way to use reserved NNTP article numbers
 when old mail is found, yet.
 
-Available since public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 Default: unset, no NNTP article numbers are skipped
 
@@ -110,7 +110,7 @@ overhead by around 1.5%.
 Warning: this option prevents rollbacks to public-inbox 1.5.0
 and earlier.
 
-Available since public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =back
 
diff --git a/Documentation/public-inbox-learn.pod b/Documentation/public-inbox-learn.pod
index 94c96fd5..498c5092 100644
--- a/Documentation/public-inbox-learn.pod
+++ b/Documentation/public-inbox-learn.pod
@@ -55,8 +55,8 @@ not feed the message to L<spamc(1)> and only removes messages
 which match on any of the C<To:>, C<Cc:>, and C<List-ID:> headers.
 
 The C<--all> option may be used match C<spam> semantics in removing
-the message from all configured inboxes.  C<--all> will be
-available in public-inbox 1.6.0 (PENDING).
+the message from all configured inboxes.  C<--all> is only
+available in public-inbox 1.6.0+.
 
 =back
 
diff --git a/Documentation/public-inbox-watch.pod b/Documentation/public-inbox-watch.pod
index 73340ec4..38686645 100644
--- a/Documentation/public-inbox-watch.pod
+++ b/Documentation/public-inbox-watch.pod
@@ -120,7 +120,7 @@ Messages without the (S)een flag are not considered for hiding.
 This hiding affects all configured public-inboxes in PI_CONFIG.
 
 As with C<publicinbox.$NAME.watch>, C<imap://> and C<imaps://> URLs
-are supported in public-inbox 1.6.0.
+are supported in public-inbox 1.6.0+.
 
 Default: none; only for L<public-inbox-watch(1)> users
 
diff --git a/Documentation/public-inbox-xcpdb.pod b/Documentation/public-inbox-xcpdb.pod
index 1397a7f4..1bc1b1df 100644
--- a/Documentation/public-inbox-xcpdb.pod
+++ b/Documentation/public-inbox-xcpdb.pod
@@ -62,7 +62,7 @@ used with C<--compact>.
 Disable L<fsync(2)> and L<fdatasync(2)>.
 See L<public-inbox-index(1)/--no-fsync> for caveats.
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =item --sequential-shard
 
diff --git a/MANIFEST b/MANIFEST
index 04a3744f..f3620de4 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -10,6 +10,7 @@ Documentation/RelNotes/v1.3.0.eml
 Documentation/RelNotes/v1.4.0.eml
 Documentation/RelNotes/v1.5.0.eml
 Documentation/RelNotes/v1.6.0.eml
+Documentation/RelNotes/v1.7.0.wip
 Documentation/clients.txt
 Documentation/dc-dlvr-spam-flow.txt
 Documentation/design_notes.txt
diff --git a/Makefile.PL b/Makefile.PL
index 3fe9acf8..f6b7abb6 100644
--- a/Makefile.PL
+++ b/Makefile.PL
@@ -111,8 +111,11 @@ my %man3 = map {; # semi-colon tells Perl this is a BLOCK (and not EXPR)
 } qw(Git.pm Import.pm WWW.pod SaPlugin/ListMirror.pod);
 
 WriteMakefile(
-	NAME => 'PublicInbox',
+	NAME => 'PublicInbox', # n.b. camel-case is not our choice
+
+	# XXX drop "PENDING" in .pod before updating this!
 	VERSION => '1.6.0',
+
 	AUTHOR => 'Eric Wong <e@80x24.org>',
 	ABSTRACT => 'public-inbox server infrastructure',
 	EXE_FILES => \@EXE_FILES,
diff --git a/TODO b/TODO
index 467f047f..8e1f4eaf 100644
--- a/TODO
+++ b/TODO
@@ -112,8 +112,6 @@ all need to be considered for everything we introduce)
 * imperfect scraper importers for obfuscated list archives
   (e.g. obfuscated Mailman stuff, Google Groups, etc...)
 
-* extend public-inbox-watch to support IMAP, NNTP
-
 * improve performance and avoid head-of-line blocking on slow storage
   (done for most git blob retrievals, Xapian needs work)
 

^ permalink raw reply related	[relevance 4%]

* [PATCH 7/8] doc: move watch config docs to -watch manpage
  @ 2020-08-27 12:17  5% ` Eric Wong
  0 siblings, 0 replies; 13+ results
From: Eric Wong @ 2020-08-27 12:17 UTC (permalink / raw)
  To: meta

The -config manpage is a bit long and the -watch stuff is
isolated from the rest of it while we start documenting NNTP and
IMAP support.

I'm not entirely happy with the way IMAP and NNTP are
configured, it's still good enough for small setups.

This also fixes a long-standing misplaced comment about
`publicinboxwatch.spamcheck' affecting all configured inboxes,
that comment was actually for `publicinboxwatch.watchspam'.

We'll omit documenting NNTP for `watchspam', for now, given the
lack of \Seen flags in NNTP and I'm not sure if it's even
useful.  There may not be any newsgroups for sharing confirmed
spam, either...
---
 Documentation/public-inbox-config.pod | 38 ++---------------
 Documentation/public-inbox-watch.pod  | 61 ++++++++++++++++++++++-----
 2 files changed, 55 insertions(+), 44 deletions(-)

diff --git a/Documentation/public-inbox-config.pod b/Documentation/public-inbox-config.pod
index 1dfb926e..2d845f16 100644
--- a/Documentation/public-inbox-config.pod
+++ b/Documentation/public-inbox-config.pod
@@ -74,26 +74,11 @@ Default: none, optional
 
 =item publicinbox.<name>.watch
 
-A location for L<public-inbox-watch(1)> to watch.  Currently,
-only C<maildir:> paths are supported:
-
-	[publicinbox "test"]
-		watch = maildir:/path/to/maildirs/.INBOX.test/
-
-Default: none; only for L<public-inbox-watch(1)> users
+See L<public-inbox-watch(1)>
 
 =item publicinbox.<name>.watchheader
 
-	[publicinbox "test"]
-		watchheader = List-Id:<test.example.com>
-
-If specified, L<public-inbox-watch(1)> will only process mail
-matching the given header.  If specified multiple times in
-public-inbox 1.5 or later, mail will be processed if it matches
-any of the values.  Only the last value was used in public-inbox
-1.4 and earlier.
-
-Default: none; only for L<public-inbox-watch(1)> users
+See L<public-inbox-watch(1)>
 
 =item publicinbox.<name>.listid
 
@@ -204,26 +189,11 @@ Default: spamc
 
 =item publicinboxwatch.spamcheck
 
-This may be set to C<spamc> to enable the use of SpamAssassin
-L<spamc(1)> for filtering spam before it is imported into git
-history.  Other spam filtering backends may be supported in
-the future.
-
-This requires L<public-inbox-watch(1)>, but affects all configured
-public-inboxes in PI_CONFIG.
-
-Default: none
+See L<public-inbox-watch(1)>
 
 =item publicinboxwatch.watchspam
 
-A Maildir to watch for confirmed spam messages to appear in.
-Messages which appear in this folder with the (S)een Maildir flag
-will be hidden from all configured inboxes based on Message-ID
-and content matching.
-
-Messages without the (S)een Maildir flag are not considered for hiding.
-
-Default: none; only for L<public-inbox-watch(1)> users
+See L<public-inbox-watch(1)>
 
 =item publicinbox.nntpserver
 
diff --git a/Documentation/public-inbox-watch.pod b/Documentation/public-inbox-watch.pod
index 34e8c4f2..b07d0fb5 100644
--- a/Documentation/public-inbox-watch.pod
+++ b/Documentation/public-inbox-watch.pod
@@ -35,8 +35,8 @@ In ~/.public-inbox/config:
 
 =head1 DESCRIPTION
 
-public-inbox-watch allows watching a mailbox (currently only
-Maildir) for the arrival of new messages and automatically
+public-inbox-watch allows watching a mailbox or newsgroup
+for the arrival of new messages and automatically
 importing them into public-inbox git repositories and indices.
 public-inbox-watch is useful in situations when a user wishes to
 mirror an existing mailing list, but has no access to run
@@ -48,11 +48,9 @@ of large Maildirs.
 Upon startup, it scans the mailbox for new messages to be
 imported while it was not running.
 
-Currently, only Maildirs are supported.
-
-For now, IMAP users should use tools such as L<mbsync(1)>
-or L<offlineimap(1)> to bidirectionally sync their IMAP
-folders to Maildirs for public-inbox-watch.
+As of public-inbox 1.6.0, Maildirs, IMAP folders, and NNTP
+newsgroups are supported.  Previous versions of public-inbox
+only supported Maildirs.
 
 public-inbox-watch should be run inside a L<screen(1)> session
 or as a L<systemd(1)> service.  Errors are emitted to stderr.
@@ -64,21 +62,64 @@ public-inbox-watch takes no command-line options.
 =head1 CONFIGURATION
 
 These configuration knobs should be used in the
-L<public-inbox-config(5)>
+L<public-inbox-config(5)> file
 
 =over 8
 
 =item publicinbox.<name>.watch
 
+A location to watch.  public-inbox 1.5.0 and earlier only supported
+C<maildir:> paths:
+
+	[publicinbox "test"]
+		watch = maildir:/path/to/maildirs/.INBOX.test/
+
+public-inbox 1.6.0 supports C<nntp://>, C<nntps://>,
+C<imap://> and C<imaps://> URLs:
+
+		watch = nntp://news.example.com/inbox.test.group
+		watch = imaps://mail.example.com/INBOX.test.foo
+
+Default: none
+
 =item publicinbox.<name>.watchheader
 
+	[publicinbox "test"]
+		watchheader = List-Id:<test.example.com>
+
+If specified, L<public-inbox-watch(1)> will only process mail
+matching the given header.  If specified multiple times in
+public-inbox 1.5 or later, mail will be processed if it matches
+any of the values.  Only the last value was used in public-inbox
+1.4 and earlier.
+
+Default: none
+
 =item publicinboxwatch.spamcheck
 
+This may be set to C<spamc> to enable the use of SpamAssassin
+L<spamc(1)> for filtering spam before it is imported into git
+history.  Other spam filtering backends may be supported in
+the future.
+
+Default: none
+
 =item publicinboxwatch.watchspam
 
-=back
+A Maildir to watch for confirmed spam messages to appear in.
+Messages which appear in this folder with the (S)een flag
+will be hidden from all configured inboxes based on Message-ID
+and content matching.
+
+Messages without the (S)een flag are not considered for hiding.
+This hiding affects all configured public-inboxes in PI_CONFIG.
+
+As with C<publicinbox.$NAME.watch>, C<imap://> and C<imaps://> URLs
+are supported in public-inbox 1.6.0.
 
-See L<public-inbox-config(5)> for documentation on them.
+Default: none; only for L<public-inbox-watch(1)> users
+
+=back
 
 =head1 SIGNALS
 

^ permalink raw reply related	[relevance 5%]

* [PATCH 22/23] init+index: support --skip-docdata for Xapian
  @ 2020-08-20 20:24  3% ` Eric Wong
  0 siblings, 0 replies; 13+ results
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

Since we no longer read document data from Xapian, allow users
to opt-out of storing it.

This breaks compatibility with previous releases of
public-inbox, but gives us a ~1.5% space savings on Xapian
storage (and associated I/O and page cache pressure reduction).
---
 Documentation/public-inbox-index.pod |  8 +++++++
 Documentation/public-inbox-init.pod  | 10 ++++++++
 lib/PublicInbox/Admin.pm             | 12 ++++++----
 lib/PublicInbox/SearchIdx.pm         | 35 +++++++++++++++++++++-------
 lib/PublicInbox/SearchIdxShard.pm    |  2 +-
 script/public-inbox-convert          |  3 ++-
 script/public-inbox-index            |  7 ++++--
 script/public-inbox-init             |  8 +++++++
 t/inbox_idle.t                       |  2 +-
 t/index-git-times.t                  | 11 ++++++++-
 t/init.t                             | 13 +++++++++++
 11 files changed, 91 insertions(+), 20 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 1ed9f5e7..46a53825 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -145,6 +145,14 @@ below.
 
 Available in public-inbox 1.6.0 (PENDING).
 
+=item --skip-docdata
+
+Stop storing document data in Xapian on an existing inbox.
+
+See L<public-inbox-init(1)/--skip-docdata> for description and caveats.
+
+Available in public-inbox 1.6.0 (PENDING).
+
 =back
 
 =head1 FILES
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 4cc7e29f..3f98807a 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -95,6 +95,16 @@ default due to contention in the top-level producer process.
 
 Default: the number of online CPUs, up to 4
 
+=item --skip-docdata
+
+Do not store document data in Xapian, reducing Xapian storage
+overhead by around 1.5%.
+
+Warning: this option prevents rollbacks to public-inbox 1.5.0
+and earlier.
+
+Available since public-inbox 1.6.0 (PENDING).
+
 =back
 
 =head1 ENVIRONMENT
diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index f5427af7..b8ead6f7 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -48,13 +48,14 @@ sub resolve_repo_dir {
 sub detect_indexlevel ($) {
 	my ($ibx) = @_;
 
-	# brand new or never before indexed inboxes default to full
-	return 'full' unless $ibx->over;
-	delete $ibx->{over}; # don't leave open FD lying around
+	my $over = $ibx->over;
+	my $srch = $ibx->search;
+	delete @$ibx{qw(over search)}; # don't leave open FDs lying around
 
+	# brand new or never before indexed inboxes default to full
+	return 'full' unless $over;
 	my $l = 'basic';
-	my $srch = $ibx->search or return $l;
-	delete $ibx->{search}; # don't leave open FD lying around
+	return $l unless $srch;
 	if (my $xdb = $srch->xdb) {
 		$l = 'full';
 		my $m = $xdb->get_metadata('indexlevel');
@@ -65,6 +66,7 @@ sub detect_indexlevel ($) {
 $ibx->{inboxdir} has unexpected indexlevel in Xapian: $m
 
 		}
+		$ibx->{-skip_docdata} = 1 if $xdb->get_metadata('skip_docdata');
 	}
 	$l;
 }
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 5c39f3d6..be46b2b9 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -61,6 +61,10 @@ sub new {
 	}, $class;
 	$self->xpfx_init;
 	$self->{-set_indexlevel_once} = 1 if $indexlevel eq 'medium';
+	if ($ibx->{-skip_docdata}) {
+		$self->{-set_skip_docdata_once} = 1;
+		$self->{-skip_docdata} = 1;
+	}
 	$ibx->umask_prepare;
 	if ($version == 1) {
 		$self->{lock_path} = "$inboxdir/ssoma.lock";
@@ -359,10 +363,18 @@ sub add_xapian ($$$$) {
 
 	msg_iter($eml, \&index_xapian, [ $self, $doc ]);
 	index_ids($self, $doc, $eml, $mids);
-	$smsg->{to} = $smsg->{cc} = ''; # WWW doesn't need these, only NNTP
-	PublicInbox::OverIdx::parse_references($smsg, $eml, $mids);
-	my $data = $smsg->to_doc_data;
-	$doc->set_data($data);
+
+	# by default, we maintain compatibility with v1.5.0 and earlier
+	# by writing to docdata.glass, users who never exect to downgrade can
+	# use --skip-docdata
+	if (!$self->{-skip_docdata}) {
+		# WWW doesn't need {to} or {cc}, only NNTP
+		$smsg->{to} = $smsg->{cc} = '';
+		PublicInbox::OverIdx::parse_references($smsg, $eml, $mids);
+		my $data = $smsg->to_doc_data;
+		$doc->set_data($data);
+	}
+
 	if (my $altid = $self->{-altid}) {
 		foreach my $alt (@$altid) {
 			my $pfx = $alt->{xprefix};
@@ -831,23 +843,28 @@ sub begin_txn_lazy {
 
 # store 'indexlevel=medium' in v2 shard=0 and v1 (only one shard)
 # This metadata is read by Admin::detect_indexlevel:
-sub set_indexlevel {
+sub set_metadata_once {
 	my ($self) = @_;
 
-	if (!$self->{shard} && # undef or 0, not >0
-			delete($self->{-set_indexlevel_once})) {
-		my $xdb = $self->{xdb};
+	return if $self->{shard}; # only continue if undef or 0, not >0
+	my $xdb = $self->{xdb};
+
+	if (delete($self->{-set_indexlevel_once})) {
 		my $level = $xdb->get_metadata('indexlevel');
 		if (!$level || $level ne 'medium') {
 			$xdb->set_metadata('indexlevel', 'medium');
 		}
 	}
+	if (delete($self->{-set_skip_docdata_once})) {
+		$xdb->get_metadata('skip_docdata') or
+			$xdb->set_metadata('skip_docdata', '1');
+	}
 }
 
 sub _commit_txn {
 	my ($self) = @_;
 	if (my $xdb = $self->{xdb}) {
-		set_indexlevel($self);
+		set_metadata_once($self);
 		$xdb->commit_transaction;
 	}
 	$self->{over}->commit_lazy if $self->{over};
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index 59b36087..20077e08 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -16,7 +16,7 @@ sub new {
 	my $self = $class->SUPER::new($ibx, 1, $shard);
 	# create the DB before forking:
 	$self->idx_acquire;
-	$self->set_indexlevel;
+	$self->set_metadata_once;
 	$self->idx_release;
 	$self->spawn_worker($v2w, $shard) if $v2w->{parallel};
 	$self;
diff --git a/script/public-inbox-convert b/script/public-inbox-convert
index d655dcc6..4ff198d1 100755
--- a/script/public-inbox-convert
+++ b/script/public-inbox-convert
@@ -77,7 +77,8 @@ if ($old) {
 die "Only conversion from v1 inboxes is supported\n" if $old->version >= 2;
 
 require PublicInbox::Admin;
-$old->{indexlevel} //= PublicInbox::Admin::detect_indexlevel($old);
+my $detected = PublicInbox::Admin::detect_indexlevel($old);
+$old->{indexlevel} //= $detected;
 my $env;
 if ($opt->{'index'}) {
 	my $mods = {};
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 30d24838..9855c67d 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -39,7 +39,7 @@ GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
 		indexlevel|index-level|L=s max_size|max-size=s
 		batch_size|batch-size=s
 		sequential_shard|seq-shard|sequential-shard
-		all help|?))
+		skip-docdata all help|?))
 	or die "bad command-line args\n$usage";
 if ($opt->{help}) { print $help; exit 0 };
 die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
@@ -58,9 +58,11 @@ unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 }
 
 my $mods = {};
 foreach my $ibx (@ibxs) {
+	# detect_indexlevel may also set $ibx->{-skip_docdata}
+	my $detected = PublicInbox::Admin::detect_indexlevel($ibx);
 	# XXX: users can shoot themselves in the foot, with opt->{indexlevel}
 	$ibx->{indexlevel} //= $opt->{indexlevel} // ($opt->{xapian_only} ?
-			'full' : PublicInbox::Admin::detect_indexlevel($ibx));
+			'full' : $detected);
 	PublicInbox::Admin::scan_ibx_modules($mods, $ibx);
 }
 
@@ -75,6 +77,7 @@ for my $ibx (@ibxs) {
 		PublicInbox::Xapcmd::run($ibx, 'compact', $opt->{compact_opt});
 	}
 	$ibx->{-no_fsync} = 1 if !$opt->{fsync};
+	$ibx->{-skip_docdata} //= $opt->{'skip-docdata'};
 
 	my $ibx_opt = $opt;
 	if (defined(my $s = $ibx->{lc('indexSequentialShard')})) {
diff --git a/script/public-inbox-init b/script/public-inbox-init
index b19c2321..037e8e56 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -34,6 +34,7 @@ require PublicInbox::Admin;
 PublicInbox::Admin::require_or_die('-base');
 
 my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs, $show_help);
+my $skip_docdata;
 my $ng = '';
 my %opts = (
 	'V|version=i' => \$version,
@@ -42,6 +43,7 @@ my %opts = (
 	'skip-artnum=i' => \$skip_artnum,
 	'j|jobs=i' => \$jobs,
 	'ng|newsgroup=s' => \$ng,
+	'skip-docdata' => \$skip_docdata,
 	'help|?' => \$show_help,
 );
 my $usage_cb = sub {
@@ -177,6 +179,12 @@ if (defined $jobs) {
 
 require PublicInbox::InboxWritable;
 $ibx = PublicInbox::InboxWritable->new($ibx, $creat_opt);
+if ($skip_docdata) {
+	$ibx->{indexlevel} //= 'full'; # ensure init_inbox writes xdb
+	$ibx->{indexlevel} eq 'basic' and
+		die "--skip-docdata ignored with --indexlevel=basic\n";
+	$ibx->{-skip_docdata} = $skip_docdata;
+}
 $ibx->init_inbox(0, $skip_epoch, $skip_artnum);
 
 # needed for git prior to v2.1.0
diff --git a/t/inbox_idle.t b/t/inbox_idle.t
index 61287200..e16ee11b 100644
--- a/t/inbox_idle.t
+++ b/t/inbox_idle.t
@@ -29,7 +29,7 @@ for my $V (1, 2) {
 	if ($V == 1) {
 		my $sidx = PublicInbox::SearchIdx->new($ibx, 1);
 		$sidx->idx_acquire;
-		$sidx->set_indexlevel;
+		$sidx->set_metadata_once;
 		$sidx->idx_release; # allow watching on lockfile
 	}
 	my $pi_config = PublicInbox::Config->new(\<<EOF);
diff --git a/t/index-git-times.t b/t/index-git-times.t
index 2e9e88e8..8f80c866 100644
--- a/t/index-git-times.t
+++ b/t/index-git-times.t
@@ -4,6 +4,7 @@ use Test::More;
 use PublicInbox::TestCommon;
 use PublicInbox::Import;
 use PublicInbox::Config;
+use PublicInbox::Admin;
 use File::Path qw(remove_tree);
 
 require_mods(qw(DBD::SQLite Search::Xapian));
@@ -47,11 +48,15 @@ EOF
 	PublicInbox::Import::run_die($cmd, undef, { 0 => $r });
 }
 
-run_script(['-index', $v1dir]) or die 'v1 index failed';
+run_script(['-index', '--skip-docdata', $v1dir]) or die 'v1 index failed';
+
 my $smsg;
 {
 	my $cfg = PublicInbox::Config->new;
 	my $ibx = $cfg->lookup($addr);
+	my $lvl = PublicInbox::Admin::detect_indexlevel($ibx);
+	is($lvl, 'medium', 'indexlevel detected');
+	is($ibx->{-skip_docdata}, 1, '--skip-docdata flag set on -index');
 	$smsg = $ibx->over->get_art(1);
 	is($smsg->{ds}, 749520000, 'datestamp from git author time');
 	is($smsg->{ts}, 1285977600, 'timestamp from git committer time');
@@ -70,6 +75,10 @@ SKIP: {
 	my $check_v2 = sub {
 		my $ibx = PublicInbox::Inbox->new({inboxdir => $v2dir,
 				address => $addr});
+		my $lvl = PublicInbox::Admin::detect_indexlevel($ibx);
+		is($lvl, 'medium', 'indexlevel detected after convert');
+		is($ibx->{-skip_docdata}, 1,
+			'--skip-docdata preserved after convert');
 		my $v2smsg = $ibx->over->get_art(1);
 		is($v2smsg->{ds}, $smsg->{ds},
 			'v2 datestamp from git author time');
diff --git a/t/init.t b/t/init.t
index 4d2c5049..dad09435 100644
--- a/t/init.t
+++ b/t/init.t
@@ -95,6 +95,19 @@ SKIP: {
 		my $ibx = PublicInbox::Inbox->new({ inboxdir => $dir });
 		is(PublicInbox::Admin::detect_indexlevel($ibx), $lvl,
 			'detected expected level w/o config');
+		ok(!$ibx->{-skip_docdata}, 'docdata written by default');
+	}
+	for my $v (1, 2) {
+		my $name = "v$v-skip-docdata";
+		my $dir = "$tmpdir/$name";
+		$cmd = [ '-init', $name, "-V$v", '--skip-docdata',
+			$dir, "http://example.com/$name",
+			"$name\@example.com" ];
+		ok(run_script($cmd), "-init -V$v --skip-docdata");
+		my $ibx = PublicInbox::Inbox->new({ inboxdir => $dir });
+		is(PublicInbox::Admin::detect_indexlevel($ibx), 'full',
+			"detected default indexlevel -V$v");
+		ok($ibx->{-skip_docdata}, "docdata skip set -V$v");
 	}
 
 	# loop for idempotency

^ permalink raw reply related	[relevance 3%]

* [PATCH] doc: release notes and version info updates
@ 2020-07-14 10:06  5% Eric Wong
  0 siblings, 0 replies; 13+ results
From: Eric Wong @ 2020-07-14 10:06 UTC (permalink / raw)
  To: meta

Update release notes with some features in the 1.6 timeline.

We'll note the version availability of some command-line
options, it may help users who are reading the latest
documentation online but running older versions.
---
 Documentation/RelNotes/v1.6.0.eml    | 36 ++++++++++++++++++++++++++--
 Documentation/public-inbox-index.pod |  8 +++++++
 Documentation/public-inbox-init.pod  |  4 ++++
 Documentation/public-inbox-learn.pod |  3 ++-
 4 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/Documentation/RelNotes/v1.6.0.eml b/Documentation/RelNotes/v1.6.0.eml
index 283f42c89..862e1c681 100644
--- a/Documentation/RelNotes/v1.6.0.eml
+++ b/Documentation/RelNotes/v1.6.0.eml
@@ -18,17 +18,37 @@ Content-Disposition: inline
     and indexed for search.  Use `public-inbox-index --reindex' to
     ensure these attachments are indexed in old messages.
 
+  - inbox.lock (v2) and ssoma.lock (v1) files are written to by
+    on message delivery (or spam removal) to wake up read-only
+    daemons via inotify or kqueue.
+
 * public-inbox-index
 
   - --batch-size=BYTES or publicinbox.indexBatchSize parameter
 
-  - parallelize updates by default, "-j0" is (once again) allowed
-    parallelization
+  - parallelize v2 updates by default, "-j0" is (once again) allowed
+    to disable parallelization
+
+  - v1 (re-)indexing parallelizes blob reads from git
 
 * public-inbox-learn
 
   - `rm' supports `--all' to remove from all configured inboxes
 
+* public-inbox-imapd
+
+  - new read-only IMAP daemon similar to public-inbox-nntpd
+
+* public-inbox-nntpd
+
+  - blob reads from git are handled asynchronously
+
+* public-inbox-httpd
+
+  - Plack::Middleware::Deflater is no longer loaded by default
+    when no .psgi file is specified; PublicInbox::WWW gzips
+    natively (see below)
+
 * PublicInbox::WWW
 
   - use consistent blank line around attachment links
@@ -39,6 +59,18 @@ Content-Disposition: inline
 
   - $INBOX_DIR/description is treated as UTF-8
 
+  - HTML, Atom, and text/plain responses are gzipped without
+    relying on Plack::Middleware::Deflater
+
+  - Multi-message endpoints (/t.mbox.gz, /T/, /t/, etc) are ~10% faster
+    when running under public-inbox-httpd with asynchronous blob
+    retrieval
+
+* public-inbox-watch
+
+  - Linux::Inotify2 or IO::KQueue is used directly,
+    Filesys::Notify::Simple is no longer required
+
 Please report bugs via plain-text mail to: meta@public-inbox.org
 
 See archives at https://public-inbox.org/meta/ for all history.
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 5be3c897b..b1b24917b 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -46,6 +46,8 @@ This switch may be specified twice, in which case compaction
 happens both before and after indexing to minimize the temporal
 footprint of the (re)indexing operation.
 
+Available since public-inbox 1.4.0.
+
 =item --reindex
 
 Forces a re-index of all messages in the inbox.
@@ -70,18 +72,24 @@ is detected.  This is intended to be used in mirrors after running
 L<public-inbox-edit(1)> or L<public-inbox-purge(1)> to ensure data
 is expunged from mirrors.
 
+Available since public-inbox 1.2.0.
+
 =item --max-size SIZE
 
 Sets or overrides L</publicinbox.indexMaxSize> on a
 per-invocation basis.  See L</publicinbox.indexMaxSize>
 below.
 
+Available since public-inbox 1.5.0.
+
 =item --batch-size SIZE
 
 Sets or overrides L</publicinbox.indexBatchSize> on a
 per-invocation basis.  See L</publicinbox.indexBatchSize>
 below.
 
+Available in public-inbox 1.6.0 (PENDING).
+
 =back
 
 =head1 FILES
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 5714828d9..fd9fc6379 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -51,6 +51,8 @@ but may be of use to L<public-inbox-v1-format(5)> users.
 There is no automatic way to use reserved NNTP article numbers
 when old mail is found, yet.
 
+Available since public-inbox 1.6.0 (PENDING).
+
 Default: unset, no NNTP article numbers are skipped
 
 =item -S, --skip-epoch
@@ -60,6 +62,8 @@ allows archivists to publish incomplete archives with newer
 mail while allowing "0.git" (or "1.git" and so on) epochs to be
 added-after-the-fact (without affecting "git clone" followers).
 
+Available since public-inbox 1.2.0.
+
 Default: unset, no epochs are skipped
 
 =item -j, --jobs=JOBS
diff --git a/Documentation/public-inbox-learn.pod b/Documentation/public-inbox-learn.pod
index 9c6b261b3..cd9bf2782 100644
--- a/Documentation/public-inbox-learn.pod
+++ b/Documentation/public-inbox-learn.pod
@@ -55,7 +55,8 @@ not feed the message to L<spamc(1)> and only removes messages
 which match on any of the C<To:>, C<Cc:>, and C<List-ID:> headers.
 
 The C<--all> option may be used match C<spam> semantics in removing
-the message from all configured inboxes.
+the message from all configured inboxes.  C<--all> will be
+available in public-inbox 1.6.0 (PENDING).
 
 =back
 

^ permalink raw reply related	[relevance 5%]

* [PATCH 37/43] www: update internal docs
  @ 2020-07-05 23:27  4% ` Eric Wong
  0 siblings, 0 replies; 13+ results
From: Eric Wong @ 2020-07-05 23:27 UTC (permalink / raw)
  To: meta

We no longer favor getline+close for streaming PSGI responses
when using public-inbox-httpd.  We still support it for other
PSGI servers, though.
---
 Documentation/technical/ds.txt   |  4 ++--
 lib/PublicInbox/GetlineBody.pm   |  4 +---
 lib/PublicInbox/GzipFilter.pm    | 17 +++++++++++++----
 lib/PublicInbox/HTTPD.pm         |  5 ++---
 lib/PublicInbox/Mbox.pm          |  8 ++------
 lib/PublicInbox/View.pm          |  2 +-
 lib/PublicInbox/WwwAtomStream.pm |  6 ++----
 lib/PublicInbox/WwwStream.pm     |  7 +++----
 8 files changed, 26 insertions(+), 27 deletions(-)

diff --git a/Documentation/technical/ds.txt b/Documentation/technical/ds.txt
index cbd06cfb4..a0793ca23 100644
--- a/Documentation/technical/ds.txt
+++ b/Documentation/technical/ds.txt
@@ -64,8 +64,8 @@ Augmented features:
 * ->requeue support.  An optimization of the AddTimer(0, ...) idiom
   for immediately dispatching code at the next event loop iteration.
   public-inbox uses this for fairly generating large responses
-  iteratively (see PublicInbox::NNTP::long_response or the use of
-  ->getline callbacks for generating gigantic gzipped mboxes).
+  iteratively (see PublicInbox::NNTP::long_response or git_async_cat
+  for blob retrievals).
 
 New features
 
diff --git a/lib/PublicInbox/GetlineBody.pm b/lib/PublicInbox/GetlineBody.pm
index 6becaaf5f..988bc63f4 100644
--- a/lib/PublicInbox/GetlineBody.pm
+++ b/lib/PublicInbox/GetlineBody.pm
@@ -5,9 +5,7 @@
 # end callback when the object goes out-of-scope.
 # This depends on rpipe being _blocking_ on getline.
 #
-# public-inbox-httpd favors "getline" response bodies to take a
-# "pull"-based approach to feeding slow clients (as opposed to a
-# more common "push" model)
+# This is only used by generic PSGI servers and not public-inbox-httpd
 package PublicInbox::GetlineBody;
 use strict;
 use warnings;
diff --git a/lib/PublicInbox/GzipFilter.pm b/lib/PublicInbox/GzipFilter.pm
index 6380f50e9..d72ad3c88 100644
--- a/lib/PublicInbox/GzipFilter.pm
+++ b/lib/PublicInbox/GzipFilter.pm
@@ -1,7 +1,16 @@
 # Copyright (C) 2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
-
-# Qspawn filter
+#
+# In public-inbox <=1.5.0, public-inbox-httpd favored "getline"
+# response bodies to take a "pull"-based approach to feeding
+# slow clients (as opposed to a more common "push" model).
+#
+# In newer versions, public-inbox-httpd supports a backpressure-aware
+# pull/push model which also accounts for slow git blob storage.
+# {async_next} callbacks only run when the DS {wbuf} is drained
+# {async_eml} callbacks only run when a blob arrives from git.
+#
+# We continue to support getline+close for generic PSGI servers.
 package PublicInbox::GzipFilter;
 use strict;
 use parent qw(Exporter);
@@ -14,12 +23,12 @@ our @EXPORT_OK = qw(gzf_maybe);
 my %OPT = (-WindowBits => 15 + 16, -AppendOutput => 1);
 my @GZIP_HDRS = qw(Vary Accept-Encoding Content-Encoding gzip);
 
-sub new { bless {}, shift }
+sub new { bless {}, shift } # qspawn filter
 
 # for Qspawn if using $env->{'pi-httpd.async'}
 sub attach {
 	my ($self, $http_out) = @_;
-	$self->{http_out} = $http_out;
+	$self->{http_out} = $http_out; # PublicInbox::HTTP::{Chunked,Identity}
 	$self
 }
 
diff --git a/lib/PublicInbox/HTTPD.pm b/lib/PublicInbox/HTTPD.pm
index 331939699..a9f55ff61 100644
--- a/lib/PublicInbox/HTTPD.pm
+++ b/lib/PublicInbox/HTTPD.pm
@@ -36,9 +36,8 @@ sub new {
 
 		# XXX unstable API!, only GitHTTPBackend needs
 		# this to limit git-http-backend(1) parallelism.
-		# The rest of our PSGI code is generic, relying
-		# on "pull" model using "getline" to prevent
-		# over-buffering.
+		# We also check for the truthiness of this to
+		# detect when to use git_async_cat for slow blobs
 		'pi-httpd.async' => \&pi_httpd_async
 	);
 	bless {
diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index abdf43c93..8726b9f64 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -1,12 +1,8 @@
 # Copyright (C) 2015-2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 
-# Streaming (via getline) interface for formatting messages as an mboxrd.
-# Used by the PSGI web interface.
-#
-# public-inbox-httpd favors "getline" response bodies to take a
-# "pull"-based approach to feeding slow clients (as opposed to a
-# more common "push" model)
+# Streaming interface for mboxrd HTTP responses
+# See PublicInbox::GzipFilter for details.
 package PublicInbox::Mbox;
 use strict;
 use parent 'PublicInbox::GzipFilter';
diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm
index 895e4f278..60dad6bac 100644
--- a/lib/PublicInbox/View.pm
+++ b/lib/PublicInbox/View.pm
@@ -415,7 +415,7 @@ sub stream_thread ($$) {
 	PublicInbox::WwwStream::aresponse($ctx, 200, \&stream_thread_i);
 }
 
-# /$INBOX/$MESSAGE_ID/t/
+# /$INBOX/$MSGID/t/ and /$INBOX/$MSGID/T/
 sub thread_html {
 	my ($ctx) = @_;
 	my $mid = $ctx->{mid};
diff --git a/lib/PublicInbox/WwwAtomStream.pm b/lib/PublicInbox/WwwAtomStream.pm
index 073df1dfa..3b5b133a5 100644
--- a/lib/PublicInbox/WwwAtomStream.pm
+++ b/lib/PublicInbox/WwwAtomStream.pm
@@ -1,10 +1,8 @@
 # Copyright (C) 2016-2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 #
-# Atom body stream for which yields getline+close methods
-# public-inbox-httpd favors "getline" response bodies to take a
-# "pull"-based approach to feeding slow clients (as opposed to a
-# more common "push" model)
+# Atom body stream for HTTP responses
+# See PublicInbox::GzipFilter for details.
 package PublicInbox::WwwAtomStream;
 use strict;
 use parent 'PublicInbox::GzipFilter';
diff --git a/lib/PublicInbox/WwwStream.pm b/lib/PublicInbox/WwwStream.pm
index 7d257a191..23b03f0e8 100644
--- a/lib/PublicInbox/WwwStream.pm
+++ b/lib/PublicInbox/WwwStream.pm
@@ -1,11 +1,10 @@
 # Copyright (C) 2016-2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 #
-# HTML body stream for which yields getline+close methods
+# HTML body stream for which yields getline+close methods for
+# generic PSGI servers and callbacks for public-inbox-httpd.
 #
-# public-inbox-httpd favors "getline" response bodies to take a
-# "pull"-based approach to feeding slow clients (as opposed to a
-# more common "push" model)
+# See PublicInbox::GzipFilter parent class for more info.
 package PublicInbox::WwwStream;
 use strict;
 use parent qw(Exporter PublicInbox::GzipFilter);

^ permalink raw reply related	[relevance 4%]

* bug: httpd: incorrect Unicode output of $INBOX_DIR/description
@ 2020-05-28 15:12  4% Julien Moutinho
  0 siblings, 0 replies; 13+ results
From: Julien Moutinho @ 2020-05-28 15:12 UTC (permalink / raw)
  To: meta

Description
-----------
public-inbox-httpd does not output $INBOX_DIR/description
using the expected Unicode code points.

Reproducing
-----------
$ cat /var/lib/public-inbox/inboxes/equipage/description
Équipage

$ file $(readlink -e description)
/nix/store/a7m2gqmj417dlqzjq1arizm7gxxrdqqm-description: UTF-8 Unicode text, with no line terminators

Is rendered by public-inbox-httpd as:
$ curl -s http://example.org/lists/archives/ | grep quipage'$'
&#195;&#137;quipage

My setup: public-inbox-1.5.0, or public-inbox-1.2.0, on NixOS.

Expecting
---------
$ curl -s http://example.org/lists/archives/ | grep quipage'$'
Êquipage

Or:
$ curl -s http://example.org/lists/archives/ | grep quipage'$'
&#201;quipage

Debugging
---------
This may be due to using: ascii_html($ibx->description);

Thanks a lot for developing public-inbox,
Julien.

^ permalink raw reply	[relevance 4%]

* [PATCH] confine Email::MIME use even further
  @ 2020-05-16 22:53  4%                   ` Eric Wong
  0 siblings, 0 replies; 13+ results
From: Eric Wong @ 2020-05-16 22:53 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: meta

"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> Eric Wong <e@yhbt.net> writes:
> 
> > "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> >> Eric Wong <e@yhbt.net> writes:
> >> > "Eric W. Biederman" <ebiederm@xmission.com> wrote:
> >> >> > The email messages are placed without modification into the public
> >> >> > inbox repository so minimize changes of corruption or of loosing
> >> >> > valuable information.  I use the command imap_fetch for all of my
> >> >> > email and not just a mailling list mirror so I don't want automation
> >> >> > to accidentally cause something important to be lost.
> >> >
> >> > Btw, Email::MIME usage is gone from 1.5.0 due to nasty
> >> > performance problems and replaced by PublicInbox::Eml.  Eml
> >> > should be completely non-destructive unless somebody sends an
> >> > abusive message which exceeds the new safety limits; in which
> >> > case it won't OOM or burn CPU like E::M did.
> >> >
> >> > That said, {-public_inbox_raw} still works and Eml looks
> >> > like a drop-in replacement as far as imap_fetch is concerned.
> >> 
> >> I almost did that. But I looked and saw PublicInbox::MIME still present
> >> and a number of other references to Email::MIME so I wasn't certain
> >> exactly how that was being handled.  But since Email::MIME still
> >> worked I didn't mess with that.
> >
> > I think the Import .pod documentation is the only place aside
> > from some random comments and maintainer tests in xt/*, right?
> 
> I am looking at 1.5.0 so you may have made a bit more progress
> but Import.pm still uses Email::MIME,
> and PublicInbox::MIME still uses Email::MIME as a base class.

Yeah, PublicInbox::MIME only existed to workaround old bugs in
Email::MIME, so we used it everywhere for years and and will
keep it in old tests.

Below is a patch to remove most references to Email::MIME;
but I guess PublicInbox::Eml will need POD docs at some point...

> >> > Btw, any reason you create the SSLSocket yourself instead of
> >> > passing (Ssl => \@SSL_Socket_options) to IMAPClient->new?
> >> 
> >> When I read the documentation it looked like that was the way to do
> >> things.  Even now when I reread the documentation that looks like the
> >> way to go.  Especially if I wanted to be certain the connection was
> >> encrypted.
> >
> > There seems more than one way to do it, but `Starttls' and `Ssl'
> > are just as documented from what I tell (in v3.38).
> > Socket/RawSocket seem useful for using an external command to
> > connect/launch an IMAP tunnel or server; so it'll be used to
> > mimic the `imap.tunnel' support of git-imap-send.
> 
> Now that you point it out I can see it.  Commands like starttls
> are a bit dangerous as they are subject to man in the middle attacks.
> 
> But I think that is the difference of just tossing something together
> for yourself versus making something that works with everyone's setup.
> 
> The one challenge I ran into was getting ssl verification to work on
> RHEL7.  Apparently IO::Socket::SSL::default_ca() does not exist in
> the old version of perl that comes with RHEL7.  Which is why I have
> the %ca and the eval.

Ouch.  Yes, I remember that being a problem for testing NNTPS,
too.  Net::NNTP doesn't support old IO::Socket::SSL, either.

Don't feel obligated to figure this out; but how did
IO::Socket::SSL work before it got default_ca()?

Did it force the user to configure that on their own,
set it behind-the-scenes as a default, or did it (*gasp*)
skip verification?

-----------8<-----------
Subject: [PATCH] confine Email::MIME use even further

To avoid confusing future readers and users, recommend
PublicInbox::Eml in our Import POD and refer to PublicInbox::Eml
comments at the top of PublicInbox::MIME.

mime_load() confined to t/eml.t, since we won't be using
it anywhere else in our tests.
---
 lib/PublicInbox/Import.pm     | 22 +++++++++++++---------
 lib/PublicInbox/MIME.pm       |  4 +++-
 lib/PublicInbox/TestCommon.pm | 10 +---------
 t/eml.t                       |  6 ++++++
 4 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/lib/PublicInbox/Import.pm b/lib/PublicInbox/Import.pm
index fc61d062..792570c8 100644
--- a/lib/PublicInbox/Import.pm
+++ b/lib/PublicInbox/Import.pm
@@ -648,7 +648,10 @@ version 1.0
 
 =head1 SYNOPSIS
 
-	use Email::MIME;
+	use PublicInbox::Eml;
+	# PublicInbox::Eml exists as of public-inbox 1.5.0,
+	# Email::MIME was used in older versions
+
 	use PublicInbox::Git;
 	use PublicInbox::Import;
 
@@ -664,7 +667,7 @@ version 1.0
 		"Date: Thu, 01 Jan 1970 00:00:00 +0000\n" .
 		"Message-ID: <m\@example.org>\n".
 		"\ntest message";
-	my $parsed = Email::MIME->new($message);
+	my $parsed = PublicInbox::Eml->new($message);
 	my $ret = $im->add($parsed);
 	if (!defined $ret) {
 		warn "duplicate: ",
@@ -675,7 +678,7 @@ version 1.0
 	$im->done;
 
 	# to remove a message
-	my $junk = Email::MIME->new($message);
+	my $junk = PublicInbox::Eml->new($message);
 	my ($mark, $orig) = $im->remove($junk);
 	if ($mark eq 'MISSING') {
 		print "not found\n";
@@ -690,8 +693,8 @@ version 1.0
 
 =head1 DESCRIPTION
 
-An importer and remover for public-inboxes which takes L<Email::MIME>
-messages as input and stores them in a git repository as
+An importer and remover for public-inboxes which takes C<PublicInbox::Eml>
+or L<Email::MIME> messages as input and stores them in a git repository as
 documented in L<https://public-inbox.org/public-inbox-v1-format.txt>,
 except it does not allow duplicate Message-IDs.
 
@@ -709,7 +712,7 @@ Initialize a new PublicInbox::Import object.
 
 =head2 add
 
-	my $parsed = Email::MIME->new($message);
+	my $parsed = PublicInbox::Eml->new($message);
 	$im->add($parsed);
 
 Adds a message to to the git repository.  This will acquire
@@ -720,12 +723,13 @@ is called, but L</remove> may be called on them.
 
 =head2 remove
 
-	my $junk = Email::MIME->new($message);
+	my $junk = PublicInbox::Eml->new($message);
 	my ($code, $orig) = $im->remove($junk);
 
 Removes a message from the repository.  On success, it returns
 a ':'-prefixed numeric code representing the git-fast-import
-mark and the original messages as an Email::MIME object.
+mark and the original messages as a PublicInbox::Eml
+(or Email::MIME) object.
 If the message could not be found, the code is "MISSING"
 and the original message is undef.  If there is a mismatch where
 the "Message-ID" is matched but the subject and body do not match,
@@ -749,7 +753,7 @@ The mail archives are hosted at L<https://public-inbox.org/meta/>
 
 =head1 COPYRIGHT
 
-Copyright (C) 2016 all contributors L<mailto:meta@public-inbox.org>
+Copyright (C) 2016-2020 all contributors L<mailto:meta@public-inbox.org>
 
 License: AGPL-3.0+ L<http://www.gnu.org/licenses/agpl-3.0.txt>
 
diff --git a/lib/PublicInbox/MIME.pm b/lib/PublicInbox/MIME.pm
index 9077386a..831a3d19 100644
--- a/lib/PublicInbox/MIME.pm
+++ b/lib/PublicInbox/MIME.pm
@@ -4,7 +4,9 @@
 # The license for this file differs from the rest of public-inbox.
 #
 # We no longer load this in any of our code outside of maintainer
-# tests for compatibility.
+# tests for compatibility.  PublicInbox::Eml is favored throughout
+# our codebase for performance and safety reasons, though we maintain
+# Email::MIME-compatibility in mail injection and indexing code paths.
 #
 # It monkey patches the "parts_multipart" subroutine with patches
 # from Matthew Horsfall <wolfsage@gmail.com> at:
diff --git a/lib/PublicInbox/TestCommon.pm b/lib/PublicInbox/TestCommon.pm
index d952ee6d..79e597f5 100644
--- a/lib/PublicInbox/TestCommon.pm
+++ b/lib/PublicInbox/TestCommon.pm
@@ -9,15 +9,7 @@ use Fcntl qw(FD_CLOEXEC F_SETFD F_GETFD :seek);
 use POSIX qw(dup2);
 use IO::Socket::INET;
 our @EXPORT = qw(tmpdir tcp_server tcp_connect require_git require_mods
-	run_script start_script key2sub xsys xqx mime_load eml_load);
-
-sub mime_load ($) {
-	my ($path) = @_;
-	open(my $fh, '<', $path) or die "open $path: $!";
-	# test should've called: require_mods('Email::MIME')
-	require PublicInbox::MIME;
-	PublicInbox::MIME->new(\(do { local $/; <$fh> }));
-}
+	run_script start_script key2sub xsys xqx eml_load);
 
 sub eml_load ($) {
 	my ($path, $cb) = @_;
diff --git a/t/eml.t b/t/eml.t
index b7f58ac7..1892b001 100644
--- a/t/eml.t
+++ b/t/eml.t
@@ -12,6 +12,12 @@ SKIP: {
 };
 use_ok $_ for @classes;
 
+sub mime_load ($) {
+	my ($path) = @_;
+	open(my $fh, '<', $path) or die "open $path: $!";
+	PublicInbox::MIME->new(\(do { local $/; <$fh> }));
+}
+
 {
 	my $eml = PublicInbox::Eml->new(\(my $str = "a: b\n\nhi\n"));
 	is($str, "hi\n", '->new modified body like Email::Simple');

^ permalink raw reply related	[relevance 4%]

* [PATCH 2/2] descend into message/(rfc822|news|global) parts
  @ 2020-05-16 10:03  2% ` Eric Wong
  0 siblings, 0 replies; 13+ results
From: Eric Wong @ 2020-05-16 10:03 UTC (permalink / raw)
  To: meta

Email::MIME never supported this properly, but there's real
instances of forwarded messages as message/rfc822 attachments.
message/news is legacy thing which we'll see in archives, and
message/global appears to be the new thing.

gmime also supports message/rfc2822, so we'll support it anyways
despite lacking other evidence of its existence.

Existing attachments remain downloadable as a whole message,
but individual attachments of subparts are now downloadable
and can be displayed in HTML, too.

Furthermore, ensure Xapian can now search for common headers
inside those messages as well as the message bodies.
---
 lib/PublicInbox/Eml.pm       | 37 +++++++++++++++++++++++-----
 lib/PublicInbox/MsgIter.pm   |  6 ++++-
 lib/PublicInbox/SearchIdx.pm | 47 ++++++++++++++++++++++--------------
 lib/PublicInbox/View.pm      | 30 ++++++++++++++++++++---
 t/eml.t                      | 28 +++++++++++++++++++++
 t/psgi_attach.t              |  9 +++++++
 t/search.t                   | 25 +++++++++++++++++++
 7 files changed, 154 insertions(+), 28 deletions(-)

diff --git a/lib/PublicInbox/Eml.pm b/lib/PublicInbox/Eml.pm
index ef401141c13..6f6874cd237 100644
--- a/lib/PublicInbox/Eml.pm
+++ b/lib/PublicInbox/Eml.pm
@@ -60,6 +60,14 @@ my %DECODE_FULL = (
 our %STR_TYPE = (text => 1);
 our %STR_SUBTYPE = (plain => 1, html => 1);
 
+# message/* subtypes we descend into
+our %MESSAGE_DESCEND = (
+	news => 1, # RFC 1849 (obsolete, but archives are forever)
+	rfc822 => 1, # RFC 2046
+	rfc2822 => 1, # gmime handles this (but not rfc5322)
+	global => 1, # RFC 6532
+);
+
 my %re_memo;
 sub re_memo ($) {
 	my ($k) = @_;
@@ -149,13 +157,25 @@ sub ct ($) {
 }
 
 # returns a queue of sub-parts iff it's worth descending into
-# TODO: descend into message/rfc822 parts (Email::MIME didn't)
 sub mp_descend ($$) {
 	my ($self, $nr) = @_; # or $once for top-level
-	my $bnd = ct($self)->{attributes}->{boundary} // return; # single-part
+	my $ct = ct($self);
+	my $type = lc($ct->{type});
+	if ($type eq 'message' && $MESSAGE_DESCEND{lc($ct->{subtype})}) {
+		my $nxt = new(undef, body_raw($self));
+		$self->{-call_cb} = $nxt->{is_submsg} = 1;
+		return [ $nxt ];
+	}
+	return if $type ne 'multipart';
+	my $bnd = $ct->{attributes}->{boundary} // return; # single-part
 	return if $bnd eq '' || length($bnd) >= $mime_boundary_length_limit;
 	$bnd = quotemeta($bnd);
 
+	# this is a multipart message that didn't get descended into in
+	# public-inbox <= 1.5.0, so ensure we call the user callback for
+	# this part to not break PSGI downloads.
+	$self->{-call_cb} = $self->{is_submsg};
+
 	# "multipart" messages can exist w/o a body
 	my $bdy = ($nr ? delete($self->{bdy}) : \(body_raw($self))) or return;
 
@@ -189,14 +209,15 @@ sub mp_descend ($$) {
 		# compatibility with Email::MIME
 		$parts[-1] =~ s/\n\r?\n\z/\n/s if $epilogue_missing;
 
-		@parts = grep /[^ \t\r\n]/s, @parts; # ignore empty parts
+		# ignore empty parts
+		@parts = map { new_sub(undef, \$_) } grep /[^ \t\r\n]/s, @parts;
 
 		# Keep "From: someone..." from preamble in old,
 		# buggy versions of git-send-email, otherwise drop it
 		# There's also a case where quoted text showed up in the
 		# preamble
 		# <20060515162817.65F0F1BBAE@citi.umich.edu>
-		unshift(@parts, $pre) if $pre =~ /:/s;
+		unshift(@parts, new_sub(undef, \$pre)) if $pre =~ /:/s;
 		return \@parts;
 	}
 	# "multipart", but no boundary found, treat as single part
@@ -217,6 +238,9 @@ sub each_part {
 	my ($self, $cb, $arg, $once) = @_;
 	my $p = mp_descend($self, $once // 0) or
 					return $cb->([$self, 0, 0], $arg);
+
+	$cb->([$self, 0, 0], $arg) if $self->{-call_cb}; # rare
+
 	$p = [ $p, 0 ];
 	my @s; # our virtual stack
 	my $nr = 0;
@@ -226,11 +250,12 @@ sub each_part {
 		my (undef, @idx) = @$p;
 		@idx = (join('.', @idx));
 		my $depth = ($idx[0] =~ tr/././) + 1;
-		my $sub = new_sub(undef, \(shift @{$p->[0]}));
+		my $sub = shift @{$p->[0]};
 		if ($depth < $mime_nesting_limit &&
 				(my $nxt = mp_descend($sub, $nr))) {
 			push(@s, $p) if scalar @{$p->[0]};
 			$p = [ $nxt, @idx, 0 ];
+			$cb->([$sub, $depth, @idx], $arg) if $sub->{-call_cb};
 		} else { # a leaf node
 			$cb->([$sub, $depth, @idx], $arg);
 		}
@@ -270,7 +295,7 @@ sub subparts {
 	if ($$bdy =~ /^--\Q$bnd\E--[ \t]*\r?\n(.+)\z/sm) {
 		$self->{epilogue} = $1;
 	}
-	map { new_sub(undef, \$_) } @$parts;
+	@$parts;
 }
 
 sub parts_set {
diff --git a/lib/PublicInbox/MsgIter.pm b/lib/PublicInbox/MsgIter.pm
index 7c28d019abc..5ec2a4d9c7f 100644
--- a/lib/PublicInbox/MsgIter.pm
+++ b/lib/PublicInbox/MsgIter.pm
@@ -64,8 +64,12 @@ sub msg_part_text ($$) {
 	# times when it should not have been:
 	#   <87llgalspt.fsf@free.fr>
 	#   <200308111450.h7BEoOu20077@mail.osdl.org>
+	# But also do not try this with ->{is_submsg} (message/rfc822),
+	# since a broken multipart/mixed inside a message/rfc822 part
+	# has not been seen in the wild, yet...
 	if ($err && ($ct =~ m!\btext/\b!i ||
-			$ct =~ m!\bmultipart/mixed\b!i)) {
+			(!$part->{is_submsg} &&
+				$ct =~ m!\bmultipart/mixed\b!i) ) ) {
 		my $cte = $part->header_raw('Content-Transfer-Encoding');
 		if (defined($cte) && $cte =~ /\b7bit\b/i) {
 			$s = $part->body;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 4bdd69f540b..5f5ae895e43 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -284,6 +284,13 @@ sub index_xapian { # msg_iter callback
 	if (defined $fn && $fn ne '') {
 		index_text($self, $fn, 1, 'XFN');
 	}
+	if ($part->{is_submsg}) {
+		my $mids = mids_for_index($part);
+		index_ids($self, $doc, $part, $mids);
+		my $smsg = PublicInbox::Smsg->new($part);
+		index_users($self, $smsg);
+		index_text($self, $smsg->subject, 1, 'S') if $smsg->subject;
+	}
 
 	my ($s, undef) = msg_part_text($part, $ct);
 	defined $s or return;
@@ -307,6 +314,27 @@ sub index_xapian { # msg_iter callback
 	}
 }
 
+sub index_ids ($$$$) {
+	my ($self, $doc, $hdr, $mids) = @_;
+	for my $mid (@$mids) {
+		index_text($self, $mid, 1, 'XM');
+
+		# because too many Message-IDs are prefixed with
+		# "Pine.LNX."...
+		if ($mid =~ /\w{12,}/) {
+			my @long = ($mid =~ /(\w{3,}+)/g);
+			index_text($self, join(' ', @long), 1, 'XM');
+		}
+	}
+	$doc->add_boolean_term('Q' . $_) for @$mids;
+	for my $l ($hdr->header_raw('List-Id')) {
+		$l =~ /<([^>]+)>/ or next;
+		my $lid = $1;
+		$doc->add_boolean_term('G' . $lid);
+		index_text($self, $lid, 1, 'XL'); # probabilistic
+	}
+}
+
 sub add_xapian ($$$$) {
 	my ($self, $mime, $smsg, $mids) = @_;
 	$smsg->{mime} = $mime; # XXX dangerous
@@ -321,22 +349,12 @@ sub add_xapian ($$$$) {
 	add_val($doc, PublicInbox::Search::DT(), $dt);
 
 	my $tg = term_generator($self);
-
 	$tg->set_document($doc);
 	index_text($self, $subj, 1, 'S') if $subj;
 	index_users($self, $smsg);
 
 	msg_iter($mime, \&index_xapian, [ $self, $doc ]);
-	foreach my $mid (@$mids) {
-		index_text($self, $mid, 1, 'XM');
-
-		# because too many Message-IDs are prefixed with
-		# "Pine.LNX."...
-		if ($mid =~ /\w{12,}/) {
-			my @long = ($mid =~ /(\w{3,}+)/g);
-			index_text($self, join(' ', @long), 1, 'XM');
-		}
-	}
+	index_ids($self, $doc, $hdr, $mids);
 	$smsg->{to} = $smsg->{cc} = ''; # WWW doesn't need these, only NNTP
 	PublicInbox::OverIdx::parse_references($smsg, $hdr, $mids);
 	my $data = $smsg->to_doc_data;
@@ -351,13 +369,6 @@ sub add_xapian ($$$$) {
 			}
 		}
 	}
-	$doc->add_boolean_term('Q' . $_) foreach @$mids;
-	for my $l ($hdr->header_raw('List-Id')) {
-		$l =~ /<([^>]+)>/ or next;
-		my $lid = $1;
-		$doc->add_boolean_term('G' . $lid);
-		index_text($self, $lid, 1, 'XL'); # probabilistic
-	}
 	$self->{xdb}->replace_document($smsg->{num}, $doc);
 }
 
diff --git a/lib/PublicInbox/View.pm b/lib/PublicInbox/View.pm
index ef5f4b3a25e..a1920212194 100644
--- a/lib/PublicInbox/View.pm
+++ b/lib/PublicInbox/View.pm
@@ -17,6 +17,7 @@ use PublicInbox::Address;
 use PublicInbox::WwwStream;
 use PublicInbox::Reply;
 use PublicInbox::ViewDiff qw(flush_diff);
+use PublicInbox::Eml;
 use POSIX qw(strftime);
 use Time::Local qw(timegm);
 use PublicInbox::Smsg qw(subject_normalized);
@@ -480,6 +481,21 @@ sub multipart_text_as_html {
 	$_[0]->each_part(\&add_text_body, $_[1], 1);
 }
 
+sub submsg_hdr ($$) {
+	my ($ctx, $eml) = @_;
+	my $obfs_ibx = $ctx->{-obfs_ibx};
+	my $rv = $ctx->{obuf};
+	$$rv .= "\n";
+	for my $h (qw(From To Cc Subject Date Message-ID X-Alt-Message-ID)) {
+		my @v = $eml->header($h);
+		for my $v (@v) {
+			obfuscate_addrs($obfs_ibx, $v) if $obfs_ibx;
+			$v = ascii_html($v);
+			$$rv .= "$h: $v\n";
+		}
+	}
+}
+
 sub attach_link ($$$$;$) {
 	my ($ctx, $ct, $p, $fn, $err) = @_;
 	my ($part, $depth, $idx) = @$p;
@@ -511,6 +527,9 @@ EOF
 	$desc = ascii_html($desc);
 	$$rv .= ($desc eq '') ? "$ts --]" : "$desc --]\n[-- $ts --]";
 	$$rv .= "</a>\n";
+
+	submsg_hdr($ctx, $part) if $part->{is_submsg};
+
 	undef;
 }
 
@@ -518,6 +537,7 @@ sub add_text_body { # callback for each_part
 	my ($p, $ctx) = @_;
 	my $upfx = $ctx->{mhref};
 	my $ibx = $ctx->{-inbox};
+	my $l = $ctx->{-linkify} //= PublicInbox::Linkify->new;
 	# $p - from each_part: [ Email::MIME-like, depth, $idx ]
 	my ($part, $depth, $idx) = @$p;
 	my $ct = $part->content_type || 'text/plain';
@@ -525,6 +545,12 @@ sub add_text_body { # callback for each_part
 	my ($s, $err) = msg_part_text($part, $ct);
 	return attach_link($ctx, $ct, $p, $fn) unless defined $s;
 
+	my $rv = $ctx->{obuf};
+	if ($part->{is_submsg}) {
+		submsg_hdr($ctx, $part);
+		$$rv .= "\n";
+	}
+
 	# makes no difference to browsers, and don't screw up filename
 	# link generation in diffs with the extra '%0D'
 	$s =~ s/\r\n/\n/sg;
@@ -571,13 +597,11 @@ sub add_text_body { # callback for each_part
 	# split off quoted and unquoted blocks:
 	my @sections = PublicInbox::MsgIter::split_quotes($s);
 	undef $s; # free memory
-	my $rv = $ctx->{obuf};
-	if (defined($fn) || $depth > 0 || $err) {
+	if (defined($fn) || ($depth > 0 && !$part->{is_submsg}) || $err) {
 		# badly-encoded message with $err? tell the world about it!
 		attach_link($ctx, $ct, $p, $fn, $err);
 		$$rv .= "\n";
 	}
-	my $l = $ctx->{-linkify} //= PublicInbox::Linkify->new;
 	foreach my $cur (@sections) {
 		if ($cur =~ /\A>/) {
 			# we use a <span> here to allow users to specify
diff --git a/t/eml.t b/t/eml.t
index c91deb3ab29..b7f58ac7069 100644
--- a/t/eml.t
+++ b/t/eml.t
@@ -117,6 +117,34 @@ EOF
 		'', 'each_part can clobber body');
 }
 
+if ('descend into message/rfc822') {
+	my $eml = eml_load 't/data/message_embed.eml';
+	my @parts;
+	$eml->each_part(sub {
+		my ($part, $level, @ex) = @{$_[0]};
+		push @parts, [ $part, $level, @ex ];
+	});
+	is(scalar(@parts), 6, 'got all parts');
+	like($parts[0]->[0]->body, qr/^testing embedded message harder\n/sm,
+		'first part found');
+	is_deeply([ @{$parts[0]}[1..2] ], [ 1, '1' ],
+		'got expected depth and level for part #0');
+	is($parts[1]->[0]->filename, 'embed2x.eml',
+		'attachment filename found');
+	is_deeply([ @{$parts[1]}[1..2] ], [ 1, '2' ],
+		'got expected depth and level for part #1');
+	is_deeply([ @{$parts[2]}[1..2] ], [ 2, '2.1' ],
+		'got expected depth and level for part #2');
+	is_deeply([ @{$parts[3]}[1..2] ], [ 3, '2.1.1' ],
+		'got expected depth and level for part #3');
+	is_deeply([ @{$parts[4]}[1..2] ], [ 3, '2.1.2' ],
+		'got expected depth and level for part #4');
+	is($parts[4]->[0]->filename, 'test.eml',
+		'another attachment filename found');
+	is_deeply([ @{$parts[5]}[1..2] ], [ 4, '2.1.2.1' ],
+		'got expected depth and level for part #5');
+}
+
 # body-less, boundary-less
 for my $cls (@classes) {
 	my $call = 0;
diff --git a/t/psgi_attach.t b/t/psgi_attach.t
index 12f9e6eeecd..c6f8072ff9a 100644
--- a/t/psgi_attach.t
+++ b/t/psgi_attach.t
@@ -75,6 +75,9 @@ $im->init_bare;
 		$res = $cb->(GET("/test/$mid/"));
 		like($res->content, qr/\bhref="2-embed2x\.eml"/s,
 			'href to message/rfc822 attachment visible');
+		like($res->content, qr/\bhref="2\.1\.2-test\.eml"/s,
+			'href to nested message/rfc822 attachment visible');
+
 		$res = $cb->(GET("/test/$mid/2-embed2x.eml"));
 		my $eml = PublicInbox::Eml->new(\($res->content));
 		is_deeply([ $eml->header_raw('Message-ID') ], [ "<$irt>" ],
@@ -85,6 +88,12 @@ $im->init_bare;
 			'1st attachment is as expected');
 		is($subs[1]->header('Content-Type'), 'message/rfc822',
 			'2nd attachment is as expected');
+
+		$res = $cb->(GET("/test/$mid/2.1.2-test.eml"));
+		$eml = PublicInbox::Eml->new(\($res->content));
+		is_deeply([ $eml->header_raw('Message-ID') ],
+			[ '<20200418214114.7575-1-e@yhbt.net>' ],
+			'nested eml retrieved');
 	});
 }
 done_testing();
diff --git a/t/search.t b/t/search.t
index 6dd5047454a..9d74f5e0532 100644
--- a/t/search.t
+++ b/t/search.t
@@ -479,6 +479,31 @@ EOF
 	is_deeply($found, [], 'matched on phrase with l:');
 }
 
+$ibx->with_umask(sub {
+	$rw_commit->();
+	my $doc_id = $rw->add_message(eml_load('t/data/message_embed.eml'));
+	ok($doc_id > 0, 'messages within messages');
+	$rw->commit_txn_lazy;
+	$ro->reopen;
+	my $n_test_eml = $ro->query('n:test.eml');
+	is(scalar(@$n_test_eml), 1, 'got a result');
+	my $n_embed2x_eml = $ro->query('n:embed2x.eml');
+	is_deeply($n_test_eml, $n_embed2x_eml, '.eml filenames searchable');
+	for my $m (qw(20200418222508.GA13918@dcvr 20200418222020.GA2745@dcvr
+			20200418214114.7575-1-e@yhbt.net)) {
+		is($ro->query("m:$m")->[0]->{mid},
+			'20200418222508.GA13918@dcvr', 'probabilistic m:'.$m);
+		is($ro->query("mid:$m")->[0]->{mid},
+			'20200418222508.GA13918@dcvr', 'boolean mid:'.$m);
+	}
+	is($ro->query('dfpost:4dc62c50')->[0]->{mid},
+		'20200418222508.GA13918@dcvr',
+		'diff search reaches inside message/rfc822');
+	is($ro->query('s:"mail header experiments"')->[0]->{mid},
+		'20200418222508.GA13918@dcvr',
+		'Subject search reaches inside message/rfc822');
+});
+
 done_testing();
 
 1;

^ permalink raw reply related	[relevance 2%]

* [ANNOUNCE] public-inbox 1.5.0
@ 2020-05-10  7:04 21% Eric Wong
  0 siblings, 0 replies; 13+ results
From: Eric Wong @ 2020-05-10  7:04 UTC (permalink / raw)
  To: meta

This release introduces a new pure-Perl lazy email parser,
PublicInbox::Eml, which uses roughly 10% less memory and
is up to 2x faster than Email::MIME.   This is a major
internal change

Limits commonly enforced by MTAs are also enforced in the
new parser, as messages may bypass MTA transports.

Email::MIME and other Email::* modules are no longer
dependencies nor used at all outside of maintainer validation
tests.

* public-inbox-index

  - `--max-size=SIZE' CLI switch and `publicinbox.indexMaxSize'
    config file option added to prevent indexing of overly
    large messages.

  - List-Id headers are indexed in new messages, old messages
    can be found after `--reindex'.

* public-inbox-watch

  - multiple values of `publicinbox.<name>.watchheader' are
    now supported, thanks to Kyle Meyer

  - List-Id headers are matched case-insensitively as specified
    by RFC 2919

* PublicInbox::WWW

  - $INBOX_DIR/description and $INBOX_DIR/cloneurl are not
    memoized if missing

  - improved display of threads, thanks to Kyle Meyer

  - search for List-Id is available via `l:' prefix if indexed

  - all encodings are preloaded at startup to reduce fragmentation

  - diffstat linkification and highlighting are stricter and
    less likely to linkify tables in cover letters

  - fix hunk header links to solver which were off-by-one line,
    thanks again to Kyle Meyer

Release tarball available for download over HTTPS or Tor .onion:

https://yhbt.net/public-inbox.git/snapshot/public-inbox-1.5.0.tar.gz
http://ou63pmih66umazou.onion/public-inbox.git/snapshot/public-inbox-1.5.0.tar.gz

Please report bugs via plain-text mail to: meta@public-inbox.org

See archives at https://public-inbox.org/meta/ for all history.
See https://public-inbox.org/TODO for what the future holds.

^ permalink raw reply	[relevance 21%]

* [PATCH] various doc updates ahead of 1.5.0
@ 2020-05-10  6:59  5% Eric Wong
  0 siblings, 0 replies; 13+ results
From: Eric Wong @ 2020-05-10  6:59 UTC (permalink / raw)
  To: meta

---
 Documentation/RelNotes/v1.5.0.eml           | 40 +++++++++++++++++++--
 Documentation/technical/data_structures.txt | 17 +++++----
 TODO                                        |  7 ++--
 3 files changed, 53 insertions(+), 11 deletions(-)

diff --git a/Documentation/RelNotes/v1.5.0.eml b/Documentation/RelNotes/v1.5.0.eml
index c9108c15..a9d8b241 100644
--- a/Documentation/RelNotes/v1.5.0.eml
+++ b/Documentation/RelNotes/v1.5.0.eml
@@ -5,21 +5,57 @@ MIME-Version: 1.0
 Content-Type: text/plain; charset=utf-8
 Content-Disposition: inline
 
+This release introduces a new pure-Perl lazy email parser,
+PublicInbox::Eml, which uses roughly 10% less memory and
+is up to 2x faster than Email::MIME.   This is a major
+internal change
+
+Limits commonly enforced by MTAs are also enforced in the
+new parser, as messages may bypass MTA transports.
+
+Email::MIME and other Email::* modules are no longer
+dependencies nor used at all outside of maintainer validation
+tests.
+
 * public-inbox-index
 
   - `--max-size=SIZE' CLI switch and `publicinbox.indexMaxSize'
-     config file option added
+    config file option added to prevent indexing of overly
+    large messages.
+
+  - List-Id headers are indexed in new messages, old messages
+    can be found after `--reindex'.
 
 * public-inbox-watch
 
   - multiple values of `publicinbox.<name>.watchheader' are
-    supported, thanks to Kyle Meyer
+    now supported, thanks to Kyle Meyer
+
+  - List-Id headers are matched case-insensitively as specified
+    by RFC 2919
 
 * PublicInbox::WWW
 
   - $INBOX_DIR/description and $INBOX_DIR/cloneurl are not
     memoized if missing
 
+  - improved display of threads, thanks to Kyle Meyer
+
+  - search for List-Id is available via `l:' prefix if indexed
+
+  - all encodings are preloaded at startup to reduce fragmentation
+
+  - diffstat linkification and highlighting are stricter and
+    less likely to linkify tables in cover letters
+
+  - fix hunk header links to solver which were off-by-one line,
+    thanks again to Kyle Meyer
+
+Release tarball available for download over HTTPS or Tor .onion:
+
+https://yhbt.net/public-inbox.git/snapshot/public-inbox-1.5.0.tar.gz
+http://ou63pmih66umazou.onion/public-inbox.git/snapshot/public-inbox-1.5.0.tar.gz
+
 Please report bugs via plain-text mail to: meta@public-inbox.org
 
 See archives at https://public-inbox.org/meta/ for all history.
diff --git a/Documentation/technical/data_structures.txt b/Documentation/technical/data_structures.txt
index 46d5acff..8776a67b 100644
--- a/Documentation/technical/data_structures.txt
+++ b/Documentation/technical/data_structures.txt
@@ -28,14 +28,13 @@ Outside of tests, this is typically a singleton.
 Per-message classes
 -------------------
 
-* PublicInbox::MIME - Email::MIME subclass
-  Common abbreviation: $mime
+* PublicInbox::Eml - Email::MIME-like class
+  Common abbreviation: $mime, $eml
   Used by: PublicInbox::WWW, PublicInbox::SearchIdx
 
-  An representation of an entire email, multipart or not.  It's
-  a subclass of Email::MIME to workaround bugs in old
-  Email::MIME versions.  An option to use libgmime or libmailutils
-  may be supported in the future for performance and memory use.
+  An representation of an entire email, multipart or not.
+  An option to use libgmime or libmailutils may be supported
+  in the future for performance and memory use.
 
   This can be a memory hog with big messages and giant
   attachments, so our PublicInbox::WWW interface only keeps
@@ -47,6 +46,12 @@ Per-message classes
   Our PublicInbox::V2Writable class may have two objects of this
   type in memory at-a-time for deduplication.
 
+  In public-inbox 1.4 and earlier, Email::MIME and its subclass,
+  PublicInbox::MIME were used.  Despite still slurping,
+  PublicInbox::Eml is faster and uses less memory due to
+  lazy header parsing and lazy subpart instantiation with
+  shorter object lifetimes.
+
 * PublicInbox::Smsg - small message skeleton
   Used by: PublicInbox::{NNTP,WWW,SearchIdx}
   Common abbreviation: $smsg
diff --git a/TODO b/TODO
index 4c4e8e00..16de36bf 100644
--- a/TODO
+++ b/TODO
@@ -42,6 +42,7 @@ all need to be considered for everything we introduce)
   while retaining compatibility with old versions.
 
 * Support more of RFC 3977 (NNTP)
+  Is there anything left for read-only support?
 
 * Combined "super server" for NNTP/HTTP/POP3 to reduce memory overhead
 
@@ -75,9 +76,9 @@ all need to be considered for everything we introduce)
 * linkify thread skeletons better
   https://public-inbox.org/git/6E3699DEA672430CAEA6DEFEDE6918F4@PhilipOakley/
 
-* low-memory Email::MIME replacement: currently we generate many
-  allocations/strings for headers we never look at and slurp
-  entire message bodies into memory.  GMime+Inline::C could work.
+* Further lower mail parser memory usage.  We still slurp entire
+  message bodies into memory and incur 2-3x overhead on
+  multipart messages.  Inline::C (and maybe gmime) could work.
 
 * use REQUEST_URI properly for CGI / mod_perl2 compatibility
   with Message-IDs which include '%' (done?)

^ permalink raw reply related	[relevance 5%]

* [PATCH 2/2] doc: update 1.4.0 relnotes with date, start 1.5.0
  @ 2020-04-17  9:11  5% ` Eric Wong
  0 siblings, 0 replies; 13+ results
From: Eric Wong @ 2020-04-17  9:11 UTC (permalink / raw)
  To: meta

---
 Documentation/RelNotes/v1.4.0.eml |  4 +++-
 Documentation/RelNotes/v1.5.0.eml | 13 +++++++++++++
 MANIFEST                          |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/RelNotes/v1.5.0.eml

diff --git a/Documentation/RelNotes/v1.4.0.eml b/Documentation/RelNotes/v1.4.0.eml
index ae7c1457..845895b5 100644
--- a/Documentation/RelNotes/v1.4.0.eml
+++ b/Documentation/RelNotes/v1.4.0.eml
@@ -1,9 +1,11 @@
+Date: Fri, 17 Apr 2020 08:48:59 +0000
 From: Eric Wong <e@yhbt.net>
 To: meta@public-inbox.org
 Subject: [ANNOUNCE] public-inbox 1.4.0
-Message-Id: <20200417084800.public-inbox-1.4.0-rele@sed>
+Message-ID: <20200417084800.public-inbox-1.4.0-rele@sed>
 MIME-Version: 1.0
 Content-Type: text/plain; charset=utf-8
+Content-Disposition: inline
 
 This release focuses on reproducibility improvements and
 bugfixes for corner-cases.  Busy instances of PublicInbox::WWW
diff --git a/Documentation/RelNotes/v1.5.0.eml b/Documentation/RelNotes/v1.5.0.eml
new file mode 100644
index 00000000..4b01eef2
--- /dev/null
+++ b/Documentation/RelNotes/v1.5.0.eml
@@ -0,0 +1,13 @@
+From: Eric Wong <e@yhbt.net>
+To: meta@public-inbox.org
+Subject: [WIP] public-inbox 1.5.0
+MIME-Version: 1.0
+Content-Type: text/plain; charset=utf-8
+Content-Disposition: inline
+
+TBD
+
+Please report bugs via plain-text mail to: meta@public-inbox.org
+
+See archives at https://public-inbox.org/meta/ for all history.
+See https://public-inbox.org/TODO for what the future holds.
diff --git a/MANIFEST b/MANIFEST
index cb7d52a7..ba5cc6a4 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -8,6 +8,7 @@ Documentation/RelNotes/v1.1.0-pre1.eml
 Documentation/RelNotes/v1.2.0.eml
 Documentation/RelNotes/v1.3.0.eml
 Documentation/RelNotes/v1.4.0.eml
+Documentation/RelNotes/v1.5.0.eml
 Documentation/dc-dlvr-spam-flow.txt
 Documentation/design_notes.txt
 Documentation/design_www.txt

^ permalink raw reply related	[relevance 5%]

Results 1-13 of 13 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2020-04-17  9:11     [PUSHED 0/2] relnotes and some notes about the future Eric Wong
2020-04-17  9:11  5% ` [PATCH 2/2] doc: update 1.4.0 relnotes with date, start 1.5.0 Eric Wong
2020-05-10  6:59  5% [PATCH] various doc updates ahead of 1.5.0 Eric Wong
2020-05-10  7:04 21% [ANNOUNCE] public-inbox 1.5.0 Eric Wong
2020-05-13 21:48     I have figured out IMAP IDLE Eric W. Biederman
2020-05-13 22:17     ` Eric Wong
2020-05-14 12:32       ` Eric W. Biederman
2020-05-15 21:00         ` [PATCH 1/2] IMAPTracker: Add a helper to track our place in reading imap mailboxes Eric W. Biederman
2020-05-15 21:02           ` [PATCH 2/2] imap_fetch: Add a command to continuously fetch from an imap mailbox Eric W. Biederman
2020-05-15 21:26             ` Eric W. Biederman
2020-05-15 22:56               ` Eric Wong
2020-05-16 10:47                 ` Eric W. Biederman
2020-05-16 19:12                   ` Eric Wong
2020-05-16 20:09                     ` Eric W. Biederman
2020-05-16 22:53  4%                   ` [PATCH] confine Email::MIME use even further Eric Wong
2020-05-16 10:03     [PATCH/RFC 0/2] recurse into message/rfc822 parts Eric Wong
2020-05-16 10:03  2% ` [PATCH 2/2] descend into message/(rfc822|news|global) parts Eric Wong
2020-05-28 15:12  4% bug: httpd: incorrect Unicode output of $INBOX_DIR/description Julien Moutinho
2020-07-05 23:27     [PATCH 00/43] www: async git cat-file w/ -httpd Eric Wong
2020-07-05 23:27  4% ` [PATCH 37/43] www: update internal docs Eric Wong
2020-07-14 10:06  5% [PATCH] doc: release notes and version info updates Eric Wong
2020-08-20 20:24     [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
2020-08-20 20:24  3% ` [PATCH 22/23] init+index: support --skip-docdata for Xapian Eric Wong
2020-08-27 12:16     [PATCH 0/8] mostly watch-related odds and ends Eric Wong
2020-08-27 12:17  5% ` [PATCH 7/8] doc: move watch config docs to -watch manpage Eric Wong
2020-09-19 21:42  4% [PATCH] doc: post-1.6 updates, start 1.7 Eric Wong
2022-08-03  7:59     [PATCH 0/4] compression-related stuff Eric Wong
2022-08-03  7:59  6% ` [PATCH 2/4] www: gzip_filter: update a few comments Eric Wong
2023-11-11  9:04     [PATCH 0/4] support publicinboxImport.dropUniqueUnsubscribe Eric Wong
2023-11-11  9:04  3% ` [PATCH 2/4] mda|learn|watch: support dropUniqueUnsubscribe config Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).