user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* [PATCH 1/2] watch: support incremental updates from MH
  @ 2024-01-30  6:31  4% ` Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2024-01-30  6:31 UTC (permalink / raw)
  To: meta

The good news (compared to lei) is we only have to worry about
imports and don't care about the filename nor keywords, so it's
immune to .mh_sequences writing inconsistencies across MH
implementations and sequence number packing.

We still assume the writer will write the mail file with one of:
* rename(2) to create the final sequence number filename
* a single write(2) if not relying on rename(2)

mlmmj and mutt satisfy these requirements.  Python's Lib/mailbox.py
may, I'm not sure...
---
 Documentation/public-inbox-watch.pod |  16 ++--
 MANIFEST                             |   1 +
 lib/PublicInbox/Watch.pm             |  92 ++++++++++++++------
 t/watch_maildir.t                    |   2 +-
 t/watch_mh.t                         | 120 +++++++++++++++++++++++++++
 5 files changed, 198 insertions(+), 33 deletions(-)
 create mode 100644 t/watch_mh.t

diff --git a/Documentation/public-inbox-watch.pod b/Documentation/public-inbox-watch.pod
index 6f812966..6e2142fe 100644
--- a/Documentation/public-inbox-watch.pod
+++ b/Documentation/public-inbox-watch.pod
@@ -48,9 +48,11 @@ of large Maildirs.
 Upon startup, it scans the mailbox for new messages to be
 imported while it was not running.
 
-As of public-inbox 1.6.0, Maildirs, IMAP folders, and NNTP
-newsgroups are supported.  Previous versions of public-inbox
-only supported Maildirs.
+All versions of public-inbox-watch support Maildirs.  public-inbox
+1.6.0 added support for IMAP folders and NNTP newsgroups.
+public-inbox 2.0 adds support for MH directories.  There are no
+plans to support the mbox family since new messages are expensive
+to detect in large mboxes.
 
 public-inbox-watch should be run inside a L<screen(1)> session
 or as a L<systemd(1)> service.  Errors are emitted to stderr.
@@ -84,12 +86,16 @@ C<imap://> and C<imaps://> URLs:
 		watch = nntp://news.example.com/inbox.test.group
 		watch = imaps://user@mail.example.com/INBOX.test
 
+2.0+ supports MH:
+
+		watch = mh:/path/to/MH/inbox.test
+
 This may be specified multiple times to combine several mailboxes
 into a single public-inbox.  URLs requiring authentication
 will require L<netrc(5)> and/or L<git-credential(1)> (preferred) to fill
 in the username and password.
 
-public-inbox 2.0+ supports boolean C<false> to prevent the global
+public-inbox 2.0+ also supports boolean C<false> to prevent the global
 L</publicinboxwatch.watchspam> directive from writing to the inbox.
 
 Default: none
@@ -127,7 +133,7 @@ Messages without the (S)een flag are not considered for hiding.
 This hiding affects all configured public-inboxes in PI_CONFIG.
 
 As with C<publicinbox.$NAME.watch>, C<imap://> and C<imaps://> URLs
-are supported in public-inbox 1.6.0+.
+are supported in public-inbox 1.6.0+, and C<MH> in 2.0+.
 
 Default: none; only for L<public-inbox-watch(1)> users
 
diff --git a/MANIFEST b/MANIFEST
index 051cd6f9..2223cfb4 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -627,6 +627,7 @@ t/watch_filter_rubylang.t
 t/watch_imap.t
 t/watch_maildir.t
 t/watch_maildir_v2.t
+t/watch_mh.t
 t/watch_multiple_headers.t
 t/www_altid.t
 t/www_listing.t
diff --git a/lib/PublicInbox/Watch.pm b/lib/PublicInbox/Watch.pm
index b83a77eb..1ec574ea 100644
--- a/lib/PublicInbox/Watch.pm
+++ b/lib/PublicInbox/Watch.pm
@@ -16,6 +16,7 @@ use PublicInbox::DS qw(now add_timer awaitpid);
 use PublicInbox::MID qw(mids);
 use PublicInbox::ContentHash qw(content_hash);
 use POSIX qw(_exit WNOHANG);
+use constant { D_MAILDIR => 1, D_MH => 2 };
 
 sub compile_watchheaders ($) {
 	my ($ibx) = @_;
@@ -40,9 +41,22 @@ sub compile_watchheaders ($) {
 	$ibx->{-watchheaders} = $watch_hdrs if scalar @$watch_hdrs;
 }
 
+sub d_type_set ($$$) {
+	my ($d_type, $dir, $is) = @_;
+	my $isnt = D_MAILDIR;
+	if ($is == D_MAILDIR) {
+		$isnt = D_MH;
+		$d_type->{"$dir/cur"} |= $is;
+		$d_type->{"$dir/new"} |= $is;
+	}
+	warn <<EOM if ($d_type->{$dir} |= $is) & $isnt;
+W: `$dir' is both Maildir and MH (non-fatal)
+EOM
+}
+
 sub new {
 	my ($class, $cfg) = @_;
-	my (%mdmap);
+	my (%d_map, %d_type);
 	my (%imap, %nntp); # url => [inbox objects] or 'watchspam'
 	my (@imap, @nntp);
 	PublicInbox::Import::load_config($cfg);
@@ -57,7 +71,11 @@ sub new {
 			my $uri;
 			if (is_maildir($dir)) {
 				# skip "new", no MUA has seen it, yet.
-				$mdmap{"$dir/cur"} = 'watchspam';
+				$d_map{"$dir/cur"} = 'watchspam';
+				d_type_set \%d_type, $dir, D_MAILDIR;
+			} elsif (is_mh($dir)) {
+				$d_map{$dir} = 'watchspam';
+				d_type_set \%d_type, $dir, D_MH;
 			} elsif ($uri = imap_uri($dir)) {
 				$imap{$$uri} = 'watchspam';
 				push @imap, $uri;
@@ -69,7 +87,6 @@ sub new {
 			}
 		}
 	}
-
 	my $k = 'publicinboxwatch.spamcheck';
 	my $default = undef;
 	my $spamcheck = PublicInbox::Spamcheck::get($cfg, $k, $default);
@@ -91,10 +108,17 @@ sub new {
 			} elsif (is_maildir($watch)) {
 				compile_watchheaders($ibx);
 				my ($new, $cur) = ("$watch/new", "$watch/cur");
-				my $cur_dst = $mdmap{$cur} //= [];
+				my $cur_dst = $d_map{$cur} //= [];
 				return if is_watchspam($cur, $cur_dst, $ibx);
-				push @{$mdmap{$new} //= []}, $ibx;
+				push @{$d_map{$new} //= []}, $ibx;
 				push @$cur_dst, $ibx;
+				d_type_set \%d_type, $watch, D_MAILDIR;
+			} elsif (is_mh($watch)) {
+				my $cur_dst = $d_map{$watch} //= [];
+				return if is_watchspam($watch, $cur_dst, $ibx);
+				compile_watchheaders($ibx);
+				push(@$cur_dst, $ibx);
+				d_type_set \%d_type, $watch, D_MH;
 			} elsif ($uri = imap_uri($watch)) {
 				my $cur_dst = $imap{$$uri} //= [];
 				return if is_watchspam($uri, $cur_dst, $ibx);
@@ -111,18 +135,19 @@ sub new {
 		}
 	});
 
-	my $mdre;
-	if (scalar keys %mdmap) {
-		$mdre = join('|', map { quotemeta($_) } keys %mdmap);
-		$mdre = qr!\A($mdre)/!;
+	my $d_re;
+	if (scalar keys %d_map) {
+		$d_re = join('|', map quotemeta, keys %d_map);
+		$d_re = qr!\A($d_re)/!;
 	}
-	return unless $mdre || scalar(keys %imap) || scalar(keys %nntp);
+	return unless $d_re || scalar(keys %imap) || scalar(keys %nntp);
 
 	bless {
 		max_batch => 10, # avoid hogging locks for too long
 		spamcheck => $spamcheck,
-		mdmap => \%mdmap,
-		mdre => $mdre,
+		d_map => \%d_map,
+		d_re => $d_re,
+		d_type => \%d_type,
 		pi_cfg => $cfg,
 		imap => scalar keys %imap ? \%imap : undef,
 		nntp => scalar keys %nntp? \%nntp : undef,
@@ -220,17 +245,23 @@ sub import_eml ($$$) {
 
 sub _try_path {
 	my ($self, $path) = @_;
-	my $fl = PublicInbox::MdirReader::maildir_path_flags($path) // return;
-	return if $fl =~ /[DT]/; # no Drafts or Trash
-	if ($path !~ $self->{mdre}) {
-		warn "unrecognized path: $path\n";
-		return;
-	}
-	my $inboxes = $self->{mdmap}->{$1};
-	unless ($inboxes) {
-		warn "unmappable dir: $1\n";
-		return;
-	}
+	$path =~ $self->{d_re} or
+		return warn("BUG? unrecognized path: $path\n");
+	my $dir = $1;
+	my $inboxes = $self->{d_map}->{$dir} //
+		return warn("W: unmappable dir: $dir\n");
+	my ($md_fl, $mh_seq);
+	if ($self->{d_type}->{$dir} & D_MH) {
+		$path =~ m!/([0-9]+)\z! ? ($mh_seq = $1) : return;
+	}
+	$self->{d_type}->{$dir} & D_MAILDIR and
+		$md_fl = PublicInbox::MdirReader::maildir_path_flags($path);
+	$md_fl // $mh_seq // return;
+	return if ($md_fl // '') =~ /[DT]/; # no Drafts or Trash
+	# n.b. none of the MH keywords are relevant for public mail,
+	# mh_seq is only used to validate we're reading an email
+	# and not treating .mh_sequences as an email
+
 	my $warn_cb = $SIG{__WARN__} || \&CORE::warn;
 	local $SIG{__WARN__} = sub {
 		my $pfx = ($_[0] // '') =~ /^([A-Z]: )/g ? $1 : '';
@@ -288,7 +319,7 @@ sub watch_fs_init ($) {
 	require PublicInbox::DirIdle;
 	# inotify_create + EPOLL_CTL_ADD
 	my $dir_idle = $self->{dir_idle} = PublicInbox::DirIdle->new($cb);
-	$dir_idle->add_watches([keys %{$self->{mdmap}}]);
+	$dir_idle->add_watches([keys %{$self->{d_map}}]);
 }
 
 sub net_cb { # NetReader::(nntp|imap)_each callback
@@ -437,7 +468,7 @@ sub event_step {
 		};
 		die $@ if $@;
 	}
-	fs_scan_step($self) if $self->{mdre};
+	fs_scan_step($self) if $self->{d_re};
 }
 
 sub watch_imap_fetch_all ($$) {
@@ -541,7 +572,7 @@ sub watch { # main entry point
 		# poll all URIs for a given interval sequentially
 		add_timer(0, \&poll_fetch_fork, $self, $intvl, $uris);
 	}
-	watch_fs_init($self) if $self->{mdre};
+	watch_fs_init($self) if $self->{d_re};
 	local @PublicInbox::DS::post_loop_do = (\&quit_inprogress, $self);
 	PublicInbox::DS::event_loop($first_sig); # calls ->event_step
 	_done_for_now($self);
@@ -572,7 +603,7 @@ sub fs_scan_step {
 		$opendirs->{$dir} = $dh if $n < 0;
 	}
 	if ($op && $op eq 'full') {
-		foreach my $dir (keys %{$self->{mdmap}}) {
+		foreach my $dir (keys %{$self->{d_map}}) {
 			next if $opendirs->{$dir}; # already in progress
 			my $ok = opendir(my $dh, $dir);
 			unless ($ok) {
@@ -647,6 +678,13 @@ sub is_maildir {
 	$_[0];
 }
 
+sub is_mh {
+	$_[0] =~ s!\Amh:!!i or return;
+	$_[0] =~ tr!/!/!s;
+	$_[0] =~ s!/\z!!;
+	$_[0];
+}
+
 sub is_watchspam {
 	my ($cur, $ws, $ibx) = @_;
 	if ($ws && !ref($ws) && $ws eq 'watchspam') {
diff --git a/t/watch_maildir.t b/t/watch_maildir.t
index d7f01b1a..a12ceefd 100644
--- a/t/watch_maildir.t
+++ b/t/watch_maildir.t
@@ -46,7 +46,7 @@ my $sem = PublicInbox::Emergency->new($spamdir); # create dirs
 EOF
 	my $wm = PublicInbox::Watch->new($cfg);
 	is(scalar grep(/is a spam folder/, @w), 1, 'got warning about spam');
-	is_deeply($wm->{mdmap}, { "$spamdir/cur" => 'watchspam' },
+	is_deeply($wm->{d_map}, { "$spamdir/cur" => 'watchspam' },
 		'only got the spam folder to watch');
 }
 
diff --git a/t/watch_mh.t b/t/watch_mh.t
new file mode 100644
index 00000000..04793750
--- /dev/null
+++ b/t/watch_mh.t
@@ -0,0 +1,120 @@
+#!perl -w
+# Copyright (C) all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use v5.12;
+use PublicInbox::Eml;
+use PublicInbox::TestCommon;
+use PublicInbox::Import;
+use PublicInbox::IO qw(write_file);
+use POSIX qw(mkfifo);
+use File::Copy qw(cp);
+use autodie qw(rename mkdir);
+
+my $tmpdir = tmpdir;
+my $git_dir = "$tmpdir/test.git";
+my $mh = "$tmpdir/mh";
+my $spamdir = "$tmpdir/mh-spam";
+mkdir $_ for ($mh, $spamdir);
+use_ok 'PublicInbox::Watch';
+my $addr = 'test-public@example.com';
+my $default_branch = PublicInbox::Import::default_branch;
+PublicInbox::Import::init_bare($git_dir);
+my $msg = <<EOF;
+From: user\@example.com
+To: $addr
+Subject: spam
+Message-ID: <a\@b.com>
+Date: Sat, 18 Jun 2016 00:00:00 +0000
+
+something
+EOF
+
+cp 't/plack-qp.eml', "$mh/1";
+mkfifo("$mh/5", 0777) or xbail "mkfifo: $!"; # FIFO to ensure no stuckage
+my $cfg = cfg_new $tmpdir, <<EOF;
+[publicinbox "test"]
+	address = $addr
+	inboxdir = $git_dir
+	watch = mh:$mh
+[publicinboxlearn]
+	watchspam = mh:$spamdir
+EOF
+PublicInbox::Watch->new($cfg)->scan('full');
+my $git = PublicInbox::Git->new($git_dir);
+{
+	my @list = $git->qx('rev-list', $default_branch);
+	is(scalar @list, 1, 'one revision in rev-list');
+	$git->cleanup;
+}
+
+# end-to-end test which actually uses inotify/kevent
+{
+	my $env = { PI_CONFIG => $cfg->{-f} };
+	# n.b. --no-scan is only intended for testing atm
+	my $wm = start_script([qw(-watch --no-scan)], $env);
+	no_pollerfd($wm->{pid});
+
+	my $eml = eml_load 't/data/binary.patch';
+	$eml->header_set('Cc', $addr);
+	write_file '>', "$mh/2.tmp", $eml->as_string;
+
+	use_ok 'PublicInbox::InboxIdle';
+	use_ok 'PublicInbox::DS';
+	my $delivered = 0;
+	my $cb = sub {
+		my ($ibx) = @_;
+		diag "message delivered to `$ibx->{name}'";
+		$delivered++;
+	};
+	PublicInbox::DS->Reset;
+	my $ii = PublicInbox::InboxIdle->new($cfg);
+	my $obj = bless \$cb, 'PublicInbox::TestCommon::InboxWakeup';
+	$cfg->each_inbox(sub { $_[0]->subscribe_unlock('ident', $obj) });
+	local @PublicInbox::DS::post_loop_do = (sub { $delivered == 0 });
+
+	# wait for -watch to setup inotify watches
+	my $sleep = 1;
+	if (eval { require PublicInbox::Inotify } && -d "/proc/$wm->{pid}/fd") {
+		my $end = time + 2;
+		my (@ino, @ino_info);
+		do {
+			@ino = grep {
+				(readlink($_)//'') =~ /\binotify\b/
+			} glob("/proc/$wm->{pid}/fd/*");
+		} until (@ino || time > $end || !tick);
+		if (scalar(@ino) == 1) {
+			my $ino_fd = (split(m'/', $ino[0]))[-1];
+			my $ino_fdinfo = "/proc/$wm->{pid}/fdinfo/$ino_fd";
+			while (time < $end && open(my $fh, '<', $ino_fdinfo)) {
+				@ino_info = grep(/^inotify wd:/, <$fh>);
+				last if @ino_info >= 2;
+				tick;
+			}
+			$sleep = undef if @ino_info >= 2;
+		}
+	}
+	if ($sleep) {
+		diag "waiting ${sleep}s for -watch to start up";
+		sleep $sleep;
+	}
+	rename "$mh/2.tmp", "$mh/2";
+	diag 'waiting for -watch to import new message';
+	PublicInbox::DS::event_loop();
+
+	my $subj = $eml->header_raw('Subject');
+	my $head = $git->qx(qw(cat-file commit HEAD));
+	like $head, qr/^\Q$subj\E/sm, 'new commit made';
+
+	$wm->kill;
+	$wm->join;
+	$ii->close;
+	PublicInbox::DS->Reset;
+}
+
+my $is_mh = sub { PublicInbox::Watch::is_mh(my $val = shift) };
+
+is $is_mh->('mh:/hello//world'), '/hello/world', 'extra slash gone';
+is $is_mh->('MH:/hello/world/'), '/hello/world', 'trailing slash gone';
+is $is_mh->('maildir:/hello/world/'), undef, 'non-MH rejected';
+
+done_testing;

^ permalink raw reply related	[relevance 4%]

* [PATCH] watch: support `watch=false' to negate watchspam
@ 2023-11-22  1:04  5% Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2023-11-22  1:04 UTC (permalink / raw)
  To: meta

For users hosting read-only mirrors (via clone|fetch) and feeding
inboxes via -watch
---
 I'm also considering a `fetchonly' directive for -learn/-mda,
 too; but I think overloading watch can coexist with that...

 Documentation/public-inbox-watch.pod |  5 ++++-
 lib/PublicInbox/Watch.pm             |  6 +++++-
 t/watch_maildir.t                    | 12 +++++++++++-
 3 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/Documentation/public-inbox-watch.pod b/Documentation/public-inbox-watch.pod
index 7c21f7ce..6f812966 100644
--- a/Documentation/public-inbox-watch.pod
+++ b/Documentation/public-inbox-watch.pod
@@ -78,7 +78,7 @@ C<maildir:> paths:
 	[publicinbox "test"]
 		watch = maildir:/path/to/maildirs/.INBOX.test/
 
-public-inbox 1.6.0 supports C<nntp://>, C<nntps://>,
+public-inbox 1.6.0+ supports C<nntp://>, C<nntps://>,
 C<imap://> and C<imaps://> URLs:
 
 		watch = nntp://news.example.com/inbox.test.group
@@ -89,6 +89,9 @@ into a single public-inbox.  URLs requiring authentication
 will require L<netrc(5)> and/or L<git-credential(1)> (preferred) to fill
 in the username and password.
 
+public-inbox 2.0+ supports boolean C<false> to prevent the global
+L</publicinboxwatch.watchspam> directive from writing to the inbox.
+
 Default: none
 
 =item publicinbox.<name>.watchheader
diff --git a/lib/PublicInbox/Watch.pm b/lib/PublicInbox/Watch.pm
index 5253ec94..b83a77eb 100644
--- a/lib/PublicInbox/Watch.pm
+++ b/lib/PublicInbox/Watch.pm
@@ -85,7 +85,10 @@ sub new {
 		$watches = PublicInbox::Config::_array($watches);
 		for my $watch (@$watches) {
 			my $uri;
-			if (is_maildir($watch)) {
+			my $bool = $cfg->git_bool($watch);
+			if (defined $bool && !$bool) {
+				$ibx->{-watch_disabled} = 1;
+			} elsif (is_maildir($watch)) {
 				compile_watchheaders($ibx);
 				my ($new, $cur) = ("$watch/new", "$watch/cur");
 				my $cur_dst = $mdmap{$cur} //= [];
@@ -143,6 +146,7 @@ sub _done_for_now {
 
 sub remove_eml_i { # each_inbox callback
 	my ($ibx, $self, $eml, $loc) = @_;
+	return if $ibx->{-watch_disabled};
 
 	eval {
 		# try to avoid taking a lock or unnecessary spawning
diff --git a/t/watch_maildir.t b/t/watch_maildir.t
index 69a5e1f3..07ebeef6 100644
--- a/t/watch_maildir.t
+++ b/t/watch_maildir.t
@@ -16,7 +16,6 @@ use_ok 'PublicInbox::Emergency';
 my $addr = 'test-public@example.com';
 my $default_branch = PublicInbox::Import::default_branch;
 PublicInbox::Import::init_bare($git_dir);
-
 my $msg = <<EOF;
 From: user\@example.com
 To: $addr
@@ -26,6 +25,9 @@ Date: Sat, 18 Jun 2016 00:00:00 +0000
 
 something
 EOF
+
+my $ibx_ro = create_inbox 'ro', sub { $_[0]->add(PublicInbox::Eml->new($msg)) };
+
 PublicInbox::Emergency->new($maildir)->prepare(\$msg);
 ok(POSIX::mkfifo("$maildir/cur/fifo", 0777),
 	'create FIFO to ensure we do not get stuck on it :P');
@@ -56,6 +58,10 @@ my $cfg = cfg_new $tmpdir, <<EOF;
 	filter = PublicInbox::Filter::Vger
 [publicinboxlearn]
 	watchspam = maildir:$spamdir
+[publicinbox "test-ro"]
+	watch = false
+	inboxdir = $ibx_ro->{inboxdir}
+	address = ro-test\@example.com
 EOF
 my $cfg_path = $cfg->{-f};
 PublicInbox::Watch->new($cfg)->scan('full');
@@ -82,6 +88,10 @@ is(scalar @list, 2, 'two revisions in rev-list');
 is(scalar @list, 0, 'tree is empty');
 is(unlink(glob("$spamdir/cur/*")), 1, 'unlinked trained spam');
 
+@list = $ibx_ro->git->qx(qw(ls-tree -r --name-only), $default_branch);
+undef $ibx_ro;
+is scalar(@list), 1, 'read-only inbox is unchanged';
+
 # check with scrubbing
 {
 	$msg .= qq(--

^ permalink raw reply related	[relevance 5%]

* [PATCH 4/7] doc: update 1.7 release notes, tuning, TODO
  @ 2021-03-11 10:45  4% ` Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2021-03-11 10:45 UTC (permalink / raw)
  To: meta

Some stuff done, some stuff still needs doing.
---
 Documentation/RelNotes/v1.7.0.wip     | 59 ++++++++++++++++++++++++++-
 Documentation/public-inbox-tuning.pod |  9 ++++
 TODO                                  | 16 +++-----
 3 files changed, 72 insertions(+), 12 deletions(-)

diff --git a/Documentation/RelNotes/v1.7.0.wip b/Documentation/RelNotes/v1.7.0.wip
index a35ff227..f71f447f 100644
--- a/Documentation/RelNotes/v1.7.0.wip
+++ b/Documentation/RelNotes/v1.7.0.wip
@@ -4,12 +4,69 @@ MIME-Version: 1.0
 Content-Type: text/plain; charset=utf-8
 Content-Disposition: inline
 
-TODO: gcf2, detached indices, JMAP, ...
+Another big release focused on multi-inbox search and scalability.
+
+* general changes
+
+  config file parsing is 2x faster with 50K inboxes
+
+* read-only public-inbox-daemon (-httpd, -nntpd, -imapd):
+
+  libgit2 may be used via Inline::C to avoid hitting system pipe
+  and process limits.  See public-inbox-tuning(7) manpage
+  for more details.
+
+* public-inbox-extindex
+
+  A new Xapian + SQLite index able to search across several inboxes.
+  This may be configured to replace per-inbox Xapian DBs,
+  (but not per-inbox SQLite indices) and speed up manifest.js.gz
+  generation.
+
+  See public-inbox-extindex-format(5) and
+  public-inbox-extindex(1) manpages for more details.
+
+* public-inbox-nntpd
+
+  - startup is 6x faster with 50K inboxes if using -extindex
+
+* PublicInbox::WWW
+
+  - mboxrd search results are returned in reverse Xapian docid order,
+    so more recent results are more likely to show up first
+
+  - d: and dt: search prefixes allow "approxidate" formats supported
+    by "git log --since="
+
+  - manifest.js.gz generation is ~25x faster with -extindex
+
+* lei - local email interface
+
+  An experimental, subject-to-change, likely-to-eat-your-mail tool for
+  personal mail as well as interacting with public-inboxes on the local
+  filesystem or over HTTP(S).  See lei(1), lei-overview(7), and other
+  lei-* manpages for details.
+
+* public-inbox-watch
+
+  - IMAP and NNTP code shared with lei, fixing an off-by-one error
+    in IMAP synchronization for single-message IMAP folders.
+
+  - \Deleted and \Draft messages ignored for IMAP, as they are for
+    Maildir.
+
+  - IMAP and NNTP connection establishment (including git-credential
+    prompts) ordering is now tied to config file order.
 
 Compatibility:
 
 * Rollbacks all the way to public-inbox 1.2.0 remain supported
 
+Internal changes
+
+* public-inbox-index switched to new internal IPC code shared
+  with lei
+
 Please report bugs via plain-text mail to: meta@public-inbox.org
 
 See archives at https://public-inbox.org/meta/ for all history.
diff --git a/Documentation/public-inbox-tuning.pod b/Documentation/public-inbox-tuning.pod
index e9702416..b3a2b411 100644
--- a/Documentation/public-inbox-tuning.pod
+++ b/Documentation/public-inbox-tuning.pod
@@ -55,6 +55,15 @@ public-inbox processes.
 More (optional) L<Inline::C> use will be introduced in the future
 to lower memory use and improve scalability.
 
+=head2 libgit2 usage via Inline::C
+
+If libgit2 development files are installed and L<Inline::C>
+is enabled (described above), per-inbox C<git cat-file --batch>
+processes are replaced with a single L<perl(1)> process running
+C<PublicInbox::Gcf2::loop> in read-only daemons.
+
+Available as of public-inbox 1.7.0.
+
 =head2 Performance on rotational hard disk drives
 
 Random I/O performance is poor on rotational HDDs.  Xapian indexing
diff --git a/TODO b/TODO
index 53907efd..4993b02c 100644
--- a/TODO
+++ b/TODO
@@ -86,7 +86,10 @@ all need to be considered for everything we introduce)
 
 * more and better test cases (use git fast-import to speed up creation)
 
-* large mbox/Maildir/MH/NNTP spool import (see PublicInbox::Import)
+* large mbox/Maildir/MH/NNTP spool import (in lei, but not
+  for public-facing inboxes)
+
+* MH import support (read-only, at least)
 
 * Read-only WebDAV interface to the git repo so it can be mounted
   via davfs2 or fusedav to avoid full clones.
@@ -133,18 +136,9 @@ all need to be considered for everything we introduce)
 
   - inotify-based manifest.js.gz updates
 
-  - process/FD reduction (needs to be slow-storage friendly)
-
   ...
 
-* command-line tool (similar to mairix/notmuch, but solver+git-aware)
-
-* consider removing doc_data from Xapian, redundant with over.sqlite3
-  It's no longer read as of public-inbox 1.6.0, but still written for
-  compatibility.
-
-* share "git cat-file --batch" processes across inboxes to avoid
-  bumping into /proc/sys/fs/pipe-user-pages-* limits
+* lei - see %CMD in lib/PublicInbox/LEI.pm
 
 * make "git cat-file --batch" detect unlinked packfiles so we don't
   have to restart processes (very long-term)

^ permalink raw reply related	[relevance 4%]

* Re: MIME types for image attachments
  2020-11-07 20:39  0% ` Eric Wong
@ 2020-11-08  0:05  0%   ` Leah Neukirchen
  0 siblings, 0 replies; 31+ results
From: Leah Neukirchen @ 2020-11-08  0:05 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@80x24.org> writes:

> Leah Neukirchen <leah@vuxu.org> wrote:
>> Hi,
>> 
>> I just noticed this on a plain public-inbox 1.6.0 installation:
>> 
>> https://inbox.vuxu.org/9fans/8F5F1B4BCF0E2F1DA17BDFBF06430DC7@abbatoir.fios-router.home/T/#u
>> > [-- Attachment #2: Type: image/png, Size: 56860 bytes --]
>> 
>> However, when I click on it:
>> 
>> % curl -I
>> https://inbox.vuxu.org/9fans/8F5F1B4BCF0E2F1DA17BDFBF06430DC7@abbatoir.fios-router.home/2-a.bin
>> HTTP/1.1 200 OK
>> Server: nginx/1.18.0
>> Date: Sat, 07 Nov 2020 19:08:48 GMT
>> Content-Type: application/octet-stream
>> Content-Length: 56860
>> Connection: keep-alive
>> 
>> Any reason this is not served as image/png?  I don't think serving
>> image/* types is particularily dangerous, and it easily allows looking
>> at attached images from the browser.
>
> Several reasons off the top of my head (there may be more):
>
> 1) Image rendering libraries and complex graphics stacks increase
>    attack surface.  IIRC libpng/libjpeg have both had problems with
>    malicious data in the past, and could be in the future.
>
>    From what I can tell, text-only stacks seem barely capable of
>    displaying text without arbitrary code execution.  I'm not
>    optimistic about something as complex as image rendering from
>    untrusted sources.

Well, that's what probably literally every other website on the web does. ;)

> 2) Risk of illegal or objectionable content being viewed by
>    readers and bystanders, especially when in public (libraries,
>    coffee shops, planes, etc).  The risk of accidental clicks
>    seems higher in public due to spills/bumps, too, especially
>    in unfamiliar environments.
>
>    The current practice of linkifying URLs poses that problem,
>    too, but the public-inbox admin isn't responsible for hosting
>    the content in those URLs, bringing us to...

Yes, I don't want to inline the data but just display it on click.
It's the same as any other Mailman or Google Groups archive.

> 3) (Probably) risk to admins hosting public-inbox instances if
>    there's illegal content.  Right now, the data is still there,
>    but having it less obviously visible probably helps reduce
>    exposure when combined with 2).

Notice that deep-linking this attachment (e.g. with <img src= />)
will already display the image, as it triggers content autodetection.

> I am not a lawyer, and laws vary wildly by jurisdiction;
> so I think it's prudent to err on the side of paranoia when
> dealing in untrusted data sources.
>
> That said, a patch + options to allow passing through certain
> content types for the server to pass through could be accepted.
> It needs to also require a secondary option visible to the client
> (via opt-in cookie or POST), to avoid surprising differences
> between differently-configured server instances.

I think it would be more correct to send the real MIME type
and "Content-Disposition: attachment" (or "inline" then, when asked for).

(However this does not prevent hotlinking either...)

> Risks will need to be documented for the admin, and the current
> behavior needs to remain the default.

-- 
Leah Neukirchen  <leah@vuxu.org>  https://leahneukirchen.org

^ permalink raw reply	[relevance 0%]

* Re: MIME types for image attachments
  2020-11-07 19:10  4% MIME types for image attachments Leah Neukirchen
@ 2020-11-07 20:39  0% ` Eric Wong
  2020-11-08  0:05  0%   ` Leah Neukirchen
  0 siblings, 1 reply; 31+ results
From: Eric Wong @ 2020-11-07 20:39 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta

Leah Neukirchen <leah@vuxu.org> wrote:
> Hi,
> 
> I just noticed this on a plain public-inbox 1.6.0 installation:
> 
> https://inbox.vuxu.org/9fans/8F5F1B4BCF0E2F1DA17BDFBF06430DC7@abbatoir.fios-router.home/T/#u
> > [-- Attachment #2: Type: image/png, Size: 56860 bytes --]
> 
> However, when I click on it:
> 
> % curl -I https://inbox.vuxu.org/9fans/8F5F1B4BCF0E2F1DA17BDFBF06430DC7@abbatoir.fios-router.home/2-a.bin
> HTTP/1.1 200 OK
> Server: nginx/1.18.0
> Date: Sat, 07 Nov 2020 19:08:48 GMT
> Content-Type: application/octet-stream
> Content-Length: 56860
> Connection: keep-alive
> 
> Any reason this is not served as image/png?  I don't think serving
> image/* types is particularily dangerous, and it easily allows looking
> at attached images from the browser.

Several reasons off the top of my head (there may be more):

1) Image rendering libraries and complex graphics stacks increase
   attack surface.  IIRC libpng/libjpeg have both had problems with
   malicious data in the past, and could be in the future.

   From what I can tell, text-only stacks seem barely capable of
   displaying text without arbitrary code execution.  I'm not
   optimistic about something as complex as image rendering from
   untrusted sources.

2) Risk of illegal or objectionable content being viewed by
   readers and bystanders, especially when in public (libraries,
   coffee shops, planes, etc).  The risk of accidental clicks
   seems higher in public due to spills/bumps, too, especially
   in unfamiliar environments.

   The current practice of linkifying URLs poses that problem,
   too, but the public-inbox admin isn't responsible for hosting
   the content in those URLs, bringing us to...

3) (Probably) risk to admins hosting public-inbox instances if
   there's illegal content.  Right now, the data is still there,
   but having it less obviously visible probably helps reduce
   exposure when combined with 2).

I am not a lawyer, and laws vary wildly by jurisdiction;
so I think it's prudent to err on the side of paranoia when
dealing in untrusted data sources.

That said, a patch + options to allow passing through certain
content types for the server to pass through could be accepted.
It needs to also require a secondary option visible to the client
(via opt-in cookie or POST), to avoid surprising differences
between differently-configured server instances.

Risks will need to be documented for the admin, and the current
behavior needs to remain the default.

^ permalink raw reply	[relevance 0%]

* MIME types for image attachments
@ 2020-11-07 19:10  4% Leah Neukirchen
  2020-11-07 20:39  0% ` Eric Wong
  0 siblings, 1 reply; 31+ results
From: Leah Neukirchen @ 2020-11-07 19:10 UTC (permalink / raw)
  To: meta

Hi,

I just noticed this on a plain public-inbox 1.6.0 installation:

https://inbox.vuxu.org/9fans/8F5F1B4BCF0E2F1DA17BDFBF06430DC7@abbatoir.fios-router.home/T/#u
> [-- Attachment #2: Type: image/png, Size: 56860 bytes --]

However, when I click on it:

% curl -I https://inbox.vuxu.org/9fans/8F5F1B4BCF0E2F1DA17BDFBF06430DC7@abbatoir.fios-router.home/2-a.bin
HTTP/1.1 200 OK
Server: nginx/1.18.0
Date: Sat, 07 Nov 2020 19:08:48 GMT
Content-Type: application/octet-stream
Content-Length: 56860
Connection: keep-alive

Any reason this is not served as image/png?  I don't think serving
image/* types is particularily dangerous, and it easily allows looking
at attached images from the browser.

Thanks,
-- 
Leah Neukirchen  <leah@vuxu.org>  https://leahneukirchen.org/

^ permalink raw reply	[relevance 4%]

* [PATCH] doc: post-1.6 updates, start 1.7
@ 2020-09-19 21:42 14% Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-09-19 21:42 UTC (permalink / raw)
  To: meta

I should've dropped "PENDING" notes before the 1.6 release;
they're dropped now, and a note is added to remind my future
self to drop them before 1.7.
---
 Documentation/RelNotes/v1.7.0.wip    | 16 ++++++++++++++++
 Documentation/public-inbox-index.pod | 14 +++++++-------
 Documentation/public-inbox-init.pod  |  6 +++---
 Documentation/public-inbox-learn.pod |  4 ++--
 Documentation/public-inbox-watch.pod |  2 +-
 Documentation/public-inbox-xcpdb.pod |  2 +-
 MANIFEST                             |  1 +
 Makefile.PL                          |  5 ++++-
 TODO                                 |  2 --
 9 files changed, 35 insertions(+), 17 deletions(-)
 create mode 100644 Documentation/RelNotes/v1.7.0.wip

diff --git a/Documentation/RelNotes/v1.7.0.wip b/Documentation/RelNotes/v1.7.0.wip
new file mode 100644
index 00000000..a35ff227
--- /dev/null
+++ b/Documentation/RelNotes/v1.7.0.wip
@@ -0,0 +1,16 @@
+To: meta@public-inbox.org
+Subject: [WIP] public-inbox 1.7.0
+MIME-Version: 1.0
+Content-Type: text/plain; charset=utf-8
+Content-Disposition: inline
+
+TODO: gcf2, detached indices, JMAP, ...
+
+Compatibility:
+
+* Rollbacks all the way to public-inbox 1.2.0 remain supported
+
+Please report bugs via plain-text mail to: meta@public-inbox.org
+
+See archives at https://public-inbox.org/meta/ for all history.
+See https://public-inbox.org/TODO for what the future holds.
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 936516f8..0848e860 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -42,7 +42,7 @@ Influences the number of Xapian indexing shards in a
 See L<public-inbox-init(1)/--jobs> for a full description
 of sharding.
 
-C<--jobs=0> is accepted as of public-inbox 1.6.0 (PENDING)
+C<--jobs=0> is accepted as of public-inbox 1.6.0
 to disable parallel indexing regardless of the number of
 pre-existing shards.
 
@@ -102,7 +102,7 @@ This fixes some bugs in older versions of public-inbox.  While
 it is possible to use this without C<--reindex>, it makes little
 sense to do so.
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =item --prune
 
@@ -133,7 +133,7 @@ significantly speed up and reduce fragmentation during the
 initial index and full C<--reindex> invocations (but not
 incremental updates).
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =item --no-fsync
 
@@ -144,7 +144,7 @@ primarily intended for systems with low RAM and the small
 may even find disabling L<fdatasync(2)> causes too much dirty
 data to accumulate, resulting on latency spikes from writeback.
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =item --sequential-shard
 
@@ -152,7 +152,7 @@ Sets or overrides L</publicinbox.indexSequentialShard> on a
 per-invocation basis.  See L</publicinbox.indexSequentialShard>
 below.
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =item --skip-docdata
 
@@ -160,7 +160,7 @@ Stop storing document data in Xapian on an existing inbox.
 
 See L<public-inbox-init(1)/--skip-docdata> for description and caveats.
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =back
 
@@ -237,7 +237,7 @@ to SQLite databases.  WWW and IMAP users may notice incomplete
 search results, but it is otherwise non-fatal.  Using C<--reindex>
 will bring everything back up-to-date.
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 This is ignored on L<public-inbox-v1-format(5)> inboxes.
 
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 24645045..f1ec05de 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -50,7 +50,7 @@ This may be set after-the-fact via C<publicinbox.$NAME.newsgroup>
 in the configuration file.  See L<public-inbox-config(5)> for more
 info.
 
-Available since public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 Default: none.
 
@@ -66,7 +66,7 @@ but may be of use to L<public-inbox-v1-format(5)> users.
 There is no automatic way to use reserved NNTP article numbers
 when old mail is found, yet.
 
-Available since public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 Default: unset, no NNTP article numbers are skipped
 
@@ -110,7 +110,7 @@ overhead by around 1.5%.
 Warning: this option prevents rollbacks to public-inbox 1.5.0
 and earlier.
 
-Available since public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =back
 
diff --git a/Documentation/public-inbox-learn.pod b/Documentation/public-inbox-learn.pod
index 94c96fd5..498c5092 100644
--- a/Documentation/public-inbox-learn.pod
+++ b/Documentation/public-inbox-learn.pod
@@ -55,8 +55,8 @@ not feed the message to L<spamc(1)> and only removes messages
 which match on any of the C<To:>, C<Cc:>, and C<List-ID:> headers.
 
 The C<--all> option may be used match C<spam> semantics in removing
-the message from all configured inboxes.  C<--all> will be
-available in public-inbox 1.6.0 (PENDING).
+the message from all configured inboxes.  C<--all> is only
+available in public-inbox 1.6.0+.
 
 =back
 
diff --git a/Documentation/public-inbox-watch.pod b/Documentation/public-inbox-watch.pod
index 73340ec4..38686645 100644
--- a/Documentation/public-inbox-watch.pod
+++ b/Documentation/public-inbox-watch.pod
@@ -120,7 +120,7 @@ Messages without the (S)een flag are not considered for hiding.
 This hiding affects all configured public-inboxes in PI_CONFIG.
 
 As with C<publicinbox.$NAME.watch>, C<imap://> and C<imaps://> URLs
-are supported in public-inbox 1.6.0.
+are supported in public-inbox 1.6.0+.
 
 Default: none; only for L<public-inbox-watch(1)> users
 
diff --git a/Documentation/public-inbox-xcpdb.pod b/Documentation/public-inbox-xcpdb.pod
index 1397a7f4..1bc1b1df 100644
--- a/Documentation/public-inbox-xcpdb.pod
+++ b/Documentation/public-inbox-xcpdb.pod
@@ -62,7 +62,7 @@ used with C<--compact>.
 Disable L<fsync(2)> and L<fdatasync(2)>.
 See L<public-inbox-index(1)/--no-fsync> for caveats.
 
-Available in public-inbox 1.6.0 (PENDING).
+Available in public-inbox 1.6.0+.
 
 =item --sequential-shard
 
diff --git a/MANIFEST b/MANIFEST
index 04a3744f..f3620de4 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -10,6 +10,7 @@ Documentation/RelNotes/v1.3.0.eml
 Documentation/RelNotes/v1.4.0.eml
 Documentation/RelNotes/v1.5.0.eml
 Documentation/RelNotes/v1.6.0.eml
+Documentation/RelNotes/v1.7.0.wip
 Documentation/clients.txt
 Documentation/dc-dlvr-spam-flow.txt
 Documentation/design_notes.txt
diff --git a/Makefile.PL b/Makefile.PL
index 3fe9acf8..f6b7abb6 100644
--- a/Makefile.PL
+++ b/Makefile.PL
@@ -111,8 +111,11 @@ my %man3 = map {; # semi-colon tells Perl this is a BLOCK (and not EXPR)
 } qw(Git.pm Import.pm WWW.pod SaPlugin/ListMirror.pod);
 
 WriteMakefile(
-	NAME => 'PublicInbox',
+	NAME => 'PublicInbox', # n.b. camel-case is not our choice
+
+	# XXX drop "PENDING" in .pod before updating this!
 	VERSION => '1.6.0',
+
 	AUTHOR => 'Eric Wong <e@80x24.org>',
 	ABSTRACT => 'public-inbox server infrastructure',
 	EXE_FILES => \@EXE_FILES,
diff --git a/TODO b/TODO
index 467f047f..8e1f4eaf 100644
--- a/TODO
+++ b/TODO
@@ -112,8 +112,6 @@ all need to be considered for everything we introduce)
 * imperfect scraper importers for obfuscated list archives
   (e.g. obfuscated Mailman stuff, Google Groups, etc...)
 
-* extend public-inbox-watch to support IMAP, NNTP
-
 * improve performance and avoid head-of-line blocking on slow storage
   (done for most git blob retrievals, Xapian needs work)
 

^ permalink raw reply related	[relevance 14%]

* Re: [ANNOUNCE] public-inbox 1.6.0
  2020-09-19 21:17  6%   ` Eric Wong
@ 2020-09-19 21:24  6%     ` Leah Neukirchen
  0 siblings, 0 replies; 31+ results
From: Leah Neukirchen @ 2020-09-19 21:24 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Eric Wong <e@80x24.org> writes:

> Leah Neukirchen <leah@vuxu.org> wrote:
>> Hi,
>> 
>> thanks for the release!
>
> You're welcome!
>
>> > * Upgrading for new features in 1.6
>
> <snip>
>
>> I did all these steps in this order, NNTP works fine but IMAP shows
>> all folders as empty.  Any ideas how to debug this?
>
> Any chance you're hitting "$NEWSGROUP" and not "$NEWSGROUP.0"?
> (or ".1", ".2" ...)?
>
> I had to split the mailboxes into slices (".0"-".$N") to deal
> with client-side limitations; so "$NEWSGROUP" is just an empty
> folder with sub-folders containing messages.

Of course, that was it.  Works fine now. :)

-- 
Leah Neukirchen  <leah@vuxu.org>  https://leahneukirchen.org

^ permalink raw reply	[relevance 6%]

* Re: [ANNOUNCE] public-inbox 1.6.0
  2020-09-19 20:01  6% ` Leah Neukirchen
@ 2020-09-19 21:17  6%   ` Eric Wong
  2020-09-19 21:24  6%     ` Leah Neukirchen
  0 siblings, 1 reply; 31+ results
From: Eric Wong @ 2020-09-19 21:17 UTC (permalink / raw)
  To: Leah Neukirchen; +Cc: meta

Leah Neukirchen <leah@vuxu.org> wrote:
> Hi,
> 
> thanks for the release!

You're welcome!

> > * Upgrading for new features in 1.6

<snip>

> I did all these steps in this order, NNTP works fine but IMAP shows
> all folders as empty.  Any ideas how to debug this?

Any chance you're hitting "$NEWSGROUP" and not "$NEWSGROUP.0"?
(or ".1", ".2" ...)?

I had to split the mailboxes into slices (".0"-".$N") to deal
with client-side limitations; so "$NEWSGROUP" is just an empty
folder with sub-folders containing messages.

Otherwise, stdout should have a trace of commands issued by the
client, and strace/truss/tcpdump/etc should show more...

It's been a while since I've really touched this part of the
IMAP code (and it's been a long year :x), but I don't think I
ever saw something like this while working on it.

^ permalink raw reply	[relevance 6%]

* Re: [ANNOUNCE] public-inbox 1.6.0
  2020-09-16 20:03 13% [ANNOUNCE] public-inbox 1.6.0 Eric Wong
@ 2020-09-19 20:01  6% ` Leah Neukirchen
  2020-09-19 21:17  6%   ` Eric Wong
  0 siblings, 1 reply; 31+ results
From: Leah Neukirchen @ 2020-09-19 20:01 UTC (permalink / raw)
  To: Eric Wong; +Cc: meta

Hi,

thanks for the release!

Eric Wong <e@80x24.org> writes:

> * Upgrading for new features in 1.6
>
>   The ordering of these steps is only necessary if you intend to
>   use some new features in 1.6.  Future releases may have
>   different instructions (or be entirely transparent).
>
>   0. install (use your OS package manager, or "make install")
>
>   1. restart public-inbox-watch instances if you have any
>
>   2. Optional: remove Plack::Middleware::Deflater if you're using
>      a custom .psgi file for PublicInbox::WWW.  This only saves
>      some memory and CPU cycles, and you may also skip this step
>      if you expect to roll back to 1.5.0 for any reason.
>
>   Steps 3a and 3b may happen in any order, 3b is optional
>   and is only required to use new WWW and IMAP features.
>
>   3a. restart existing read-only daemons if you have them
>       (public-inbox-nntpd, public-inbox-httpd)
>
>   3b. run "public-inbox-index -c --reindex --rethread --all"
>       to reindex all configured inboxes
>
>   4. configure and start the new public-inbox-imapd.  This
>      requires reindexing in 3b, but there's no obligation to
>      run an IMAP server, either.

I did all these steps in this order, NNTP works fine but IMAP shows
all folders as empty.  Any ideas how to debug this?

thx,
-- 
Leah Neukirchen  <leah@vuxu.org>  https://leahneukirchen.org/

^ permalink raw reply	[relevance 6%]

* [ANNOUNCE] public-inbox 1.6.0
@ 2020-09-16 20:03 13% Eric Wong
  2020-09-19 20:01  6% ` Leah Neukirchen
  0 siblings, 1 reply; 31+ results
From: Eric Wong @ 2020-09-16 20:03 UTC (permalink / raw)
  To: meta

A big release containing several performance optimizations, a
new anonymous IMAP server, and more.  It represents an
incremental improvement over 1.5 in several areas with more to
come in 1.7.

The read-only httpd and nntpd daemons no longer block the event
loop when retrieving blobs from git, making better use of SMP
systems while accomodating slow storage.

Indexing can be now be tuned to give somewhat usable performance
on HDD storage, though we can't defy the laws of physics, either.

* General changes:

  - ~/.cache/public-inbox/inline-c is automatically used for Inline::C
    if it exists.  PERL_INLINE_DIRECTORY in env remains supported
    and prioritized to support `nobody'-type users without HOME.

  - msgmap.sqlite3 uses journal_mode=TRUNCATE, matching over.sqlite3
    behavior for a minor reduction in VFS traffic

  - public-inbox-tuning(7) - new manpage containing pointers to
    various tuning options and tips for certain HW and OS setups.

  - Copy-on-write is disabled on BTRFS for new indices to avoid
    fragmentation.  See the new public-inbox-tuning(7) manpage.

  - message/{rfc822,news,global} attachments are decoded recursively
    and indexed for search.  Reindexing (see below) is required
    to ensure these attachments are indexed in old messages.

  - inbox.lock (v2) and ssoma.lock (v1) files are written to
    on message delivery (or spam removal) to wake up read-only
    daemons via inotify or kqueue.

  - `--help' switch supported by command-line tools

* Upgrading for new features in 1.6

  The ordering of these steps is only necessary if you intend to
  use some new features in 1.6.  Future releases may have
  different instructions (or be entirely transparent).

  0. install (use your OS package manager, or "make install")

  1. restart public-inbox-watch instances if you have any

  2. Optional: remove Plack::Middleware::Deflater if you're using
     a custom .psgi file for PublicInbox::WWW.  This only saves
     some memory and CPU cycles, and you may also skip this step
     if you expect to roll back to 1.5.0 for any reason.

  Steps 3a and 3b may happen in any order, 3b is optional
  and is only required to use new WWW and IMAP features.

  3a. restart existing read-only daemons if you have them
      (public-inbox-nntpd, public-inbox-httpd)

  3b. run "public-inbox-index -c --reindex --rethread --all"
      to reindex all configured inboxes

  4. configure and start the new public-inbox-imapd.  This
     requires reindexing in 3b, but there's no obligation to
     run an IMAP server, either.

* public-inbox-index

  There are several new options to improve usability on slow,
  rotational storage.

  - `--batch-size=BYTES' or publicinbox.indexBatchSize parameter
    to reduce frequency of random writes on HDDs

  - `--sequential-shard' or publicInbox.sequentialShard parameter
    to improve OS page cache utilization on HDDs.

  - `--no-fsync' when combined with Xapian 1.4+ can be used to
    speed up indexing on SSDs and small (default) `--batch-size'

  - `--rethread' option to go with `--reindex' (use sparringly,
    see manpage)

  - parallelize v2 updates by default, `--sequential-shard' and
    `-j0' is (once again) allowed to disable parallelization

  - (re-)indexing parallelizes blob reads from git

  - `--all' may be specified to index all configured inboxes

* public-inbox-learn

  - `rm' supports `--all' to remove from all configured inboxes

* public-inbox-imapd

  - new read-only IMAP daemon similar to public-inbox-nntpd
    `AUTH=ANONYMOUS' is supported, but any username and
    password for clients without `AUTH=ANONYMOUS' support.

* public-inbox-nntpd

  - blob reads from git are handled asynchronously

* public-inbox-httpd

  - Plack::Middleware::Deflater is no longer loaded by default
    when no .psgi file is specified; PublicInbox::WWW can rely
    on gzip for buffering (see below)

* PublicInbox::WWW

  - use consistent blank line around attachment links

  - Attachments in message/{rfc822,news,global} messages can be
    individually downloaded.  Downloading the entire message/rfc822
    file in full remains supported

  - $INBOX_DIR/description is treated as UTF-8

  - HTML, Atom, and text/plain responses are gzipped without
    relying on Plack::Middleware::Deflater

  - Multi-message endpoints (/t.mbox.gz, /T/, /t/, etc) are ~10% faster
    when running under public-inbox-httpd with asynchronous blob
    retrieval

  - mbox search results may now include all messages pertaining to that
    thread.  Needs `--reindex' mentioned above in
    `Upgrading for new features in 1.6'.

  - fix mbox.gz search results downloads for lynx users

  - small navigation tweaks, more prominent mirroring instructions

* public-inbox-watch

  - Linux::Inotify2 or IO::KQueue is used directly,
    Filesys::Notify::Simple is no longer required

  - NNTP groups and IMAP mailboxes may be watched in addition
    to Maildirs (lightly tested).

* Ongoing internal changes

  - reduce event loop hogging for many-inbox support

  - use more Perl v5.10-isms, future-proof against Perl 8

  - more consistent variable and field naming, improve internal
    documentation and comments

  - start supporting >=40 char git identifiers for SHA-256

  - test -httpd-specific code paths via Plack::Test::ExternalServer
    in addition to generic PSGI paths.

Please report bugs via plain-text mail to: meta@public-inbox.org

See archives at https://public-inbox.org/meta/ for all history.
See https://public-inbox.org/TODO for what the future holds.

^ permalink raw reply	[relevance 13%]

* [PATCH] doc: TODO and release notes updates ahead of 1.6
@ 2020-09-14  6:39  4% Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-09-14  6:39 UTC (permalink / raw)
  To: meta

Some more things have happened...

And drop some items which are too expensive to support,
such as automatic mirroring.
---
 Documentation/RelNotes/v1.6.0.eml | 31 ++++++++++++++++++++++++++++---
 TODO                              | 20 ++++++++++----------
 2 files changed, 38 insertions(+), 13 deletions(-)

diff --git a/Documentation/RelNotes/v1.6.0.eml b/Documentation/RelNotes/v1.6.0.eml
index 4f72f352..153b1f13 100644
--- a/Documentation/RelNotes/v1.6.0.eml
+++ b/Documentation/RelNotes/v1.6.0.eml
@@ -40,10 +40,13 @@ on HDD storage, though we can't defy the laws of physics, either.
     on message delivery (or spam removal) to wake up read-only
     daemons via inotify or kqueue.
 
-* Upgrading
+  - `--help' switch supported by command-line tools
+
+* Upgrading for new features in 1.6
 
   The ordering of these steps is only necessary if you intend to
-  use some new features of this release.
+  use some new features in 1.6.  Future releases may have
+  different instructions (or be entirely transparent).
 
   0. install (use your OS package manager, or "make install")
 
@@ -129,13 +132,35 @@ on HDD storage, though we can't defy the laws of physics, either.
     retrieval
 
   - mbox search results may now include all messages pertaining to that
-    thread (requires `--reindex' mentioned in `Upgrading').
+    thread.  Needs `--reindex' mentioned above in
+    `Upgrading for new features in 1.6'.
+
+  - fix mbox.gz search results downloads for lynx users
+
+  - small navigation tweaks, more prominent mirroring instructions
 
 * public-inbox-watch
 
   - Linux::Inotify2 or IO::KQueue is used directly,
     Filesys::Notify::Simple is no longer required
 
+  - NNTP groups and IMAP mailboxes may be watched in addition
+    to Maildirs (lightly tested).
+
+* Ongoing internal changes
+
+  - reduce event loop hogging for many-inbox support
+
+  - use more Perl v5.10-isms, future-proof against Perl 8
+
+  - more consistent variable and field naming, improve internal
+    documentation and comments
+
+  - start supporting >=40 char git identifiers for SHA-256
+
+  - test -httpd-specific code paths via Plack::Test::ExternalServer
+    in addition to generic PSGI paths.
+
 Please report bugs via plain-text mail to: meta@public-inbox.org
 
 See archives at https://public-inbox.org/meta/ for all history.
diff --git a/TODO b/TODO
index 9396f661..467f047f 100644
--- a/TODO
+++ b/TODO
@@ -115,6 +115,7 @@ all need to be considered for everything we introduce)
 * extend public-inbox-watch to support IMAP, NNTP
 
 * improve performance and avoid head-of-line blocking on slow storage
+  (done for most git blob retrievals, Xapian needs work)
 
 * HTTP(S) search API (likely JMAP, but GraphQL could be an option)
   It should support git-specific prefixes (dfpre:, dfpost:, dfn:, etc)
@@ -123,6 +124,11 @@ all need to be considered for everything we introduce)
 
 * search across multiple inboxes, or admin-definable groups of inboxes
 
+  This will require a new detached Xapian index that can be used in
+  parallel with existing per-inbox indices.  Using ->add_database
+  with hundreds of shards is unusable in current Xapian as of
+  August 2020 (acknowledged by Xapian upstream).
+
 * scalability to tens/hundreds of thousands of inboxes
 
   - pagination for WwwListing
@@ -136,6 +142,8 @@ all need to be considered for everything we introduce)
 * command-line tool (similar to mairix/notmuch, but solver+git-aware)
 
 * consider removing doc_data from Xapian, redundant with over.sqlite3
+  It's no longer read as of public-inbox 1.6.0, but still written for
+  compatibility.
 
 * share "git cat-file --batch" processes across inboxes to avoid
   bumping into /proc/sys/fs/pipe-user-pages-* limits
@@ -157,15 +165,7 @@ all need to be considered for everything we introduce)
 
 * highlighting + linkification for "git format-patch --interdiff" output
 
-* highlighting + linkification for "git format-patch --range-diff" output
-  (requires mirroring of git repos)
-
-* parse and allow (semi)automatic-mirroring of "git request-pull" output
-  for coderepos
-
-* configurable diff output for solver-generated blobs
-
-* figure out how search for messages with multiple Date: headers
-  should work (some wacky examples out there...)
+* highlighting for "git format-patch --range-diff" output
+  (linkification is too expensive, as it requires mirroring)
 
 * support UUCP addresses for legacy archives

^ permalink raw reply related	[relevance 4%]

* [PATCH] doc: expand on indexBatchSize regarding fragementation
@ 2020-08-31  4:33  4% Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-08-31  4:33 UTC (permalink / raw)
  To: meta

And change the documentation reference in -tuning to
point to the -index manpage while we're at it.
---
 Documentation/public-inbox-index.pod  | 5 +++--
 Documentation/public-inbox-tuning.pod | 6 ++++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 207b2ed8..936516f8 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -129,8 +129,9 @@ below.
 
 When using rotational storage but abundant RAM, using a large
 value (e.g. C<500m>) with C<--sequential-shard> can
-significantly speed up the initial index and full C<--reindex>
-invocations (but not incremental updates).
+significantly speed up and reduce fragmentation during the
+initial index and full C<--reindex> invocations (but not
+incremental updates).
 
 Available in public-inbox 1.6.0 (PENDING).
 
diff --git a/Documentation/public-inbox-tuning.pod b/Documentation/public-inbox-tuning.pod
index b4e7698b..f5a25676 100644
--- a/Documentation/public-inbox-tuning.pod
+++ b/Documentation/public-inbox-tuning.pod
@@ -74,7 +74,7 @@ sharding imposes a performance penalty for read-only queries.
 
 Users with large amounts of RAM are advised to set a large value
 for C<publicinbox.indexBatchSize> as documented in
-L<public-inbox-config(5)>.
+L<public-inbox-index(1)>.
 
 C<dm-crypt> users on Linux 4.0+ are advised to try the
 C<--perf-same_cpu_crypt> C<--perf-submit_from_crypt_cpus>
@@ -95,7 +95,9 @@ Disabling copy-on-write also disables checksumming, thus C<raid1>
 Fortunately, these SQLite and Xapian indices are designed to
 recoverable from git if missing.
 
-Disabling CoW does not prevent all fragmentation.
+Disabling CoW does not prevent all fragmentation.  Large values
+of C<publicInbox.indexBatchSize> also limit fragmentation during
+the initial index.
 
 Avoid snapshotting subvolumes containing Xapian and/or SQLite indices.
 Snapshots use CoW despite our efforts to disable it, resulting

^ permalink raw reply related	[relevance 4%]

* Re: [PATCH 8/8] doc: watch: expand on NNTP and IMAP-specific knobs
  2020-08-27 12:17  8% ` [PATCH 8/8] doc: watch: expand on NNTP and IMAP-specific knobs Eric Wong
@ 2020-08-28  4:22  6%   ` Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-08-28  4:22 UTC (permalink / raw)
  To: meta

Eric Wong <e@yhbt.net> wrote:
> --- a/Documentation/public-inbox-watch.pod
> +++ b/Documentation/public-inbox-watch.pod
> @@ -78,7 +78,12 @@ public-inbox 1.6.0 supports C<nntp://>, C<nntps://>,
>  C<imap://> and C<imaps://> URLs:
>  
>  		watch = nntp://news.example.com/inbox.test.group
> -		watch = imaps://mail.example.com/INBOX.test.foo
> +		watch = imaps://user@mail.example.com/INBOX.test.foo
> +

That exceeds 80 columns (I only ran "make check-run", not
"make check" :x).

Will squash this in:

diff --git a/Documentation/public-inbox-watch.pod b/Documentation/public-inbox-watch.pod
index 39b8ac06..f3e622b0 100644
--- a/Documentation/public-inbox-watch.pod
+++ b/Documentation/public-inbox-watch.pod
@@ -78,7 +78,7 @@ public-inbox 1.6.0 supports C<nntp://>, C<nntps://>,
 C<imap://> and C<imaps://> URLs:
 
 		watch = nntp://news.example.com/inbox.test.group
-		watch = imaps://user@mail.example.com/INBOX.test.foo
+		watch = imaps://user@mail.example.com/INBOX.test
 
 This may be specified multiple times to combine several mailboxes
 into a single public-inbox.  URLs requiring authentication

^ permalink raw reply related	[relevance 6%]

* [PATCH 8/8] doc: watch: expand on NNTP and IMAP-specific knobs
    2020-08-27 12:17  6% ` [PATCH 7/8] doc: move watch config docs to -watch manpage Eric Wong
@ 2020-08-27 12:17  8% ` Eric Wong
  2020-08-28  4:22  6%   ` Eric Wong
  1 sibling, 1 reply; 31+ results
From: Eric Wong @ 2020-08-27 12:17 UTC (permalink / raw)
  To: meta

There's a few more, but maybe they're too esoteric
to be worth documenting at the moment (batch sizes, timeouts, etc).
---
 Documentation/public-inbox-watch.pod | 36 +++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/Documentation/public-inbox-watch.pod b/Documentation/public-inbox-watch.pod
index b07d0fb5..39b8ac06 100644
--- a/Documentation/public-inbox-watch.pod
+++ b/Documentation/public-inbox-watch.pod
@@ -78,7 +78,12 @@ public-inbox 1.6.0 supports C<nntp://>, C<nntps://>,
 C<imap://> and C<imaps://> URLs:
 
 		watch = nntp://news.example.com/inbox.test.group
-		watch = imaps://mail.example.com/INBOX.test.foo
+		watch = imaps://user@mail.example.com/INBOX.test.foo
+
+This may be specified multiple times to combine several mailboxes
+into a single public-inbox.  URLs requiring authentication
+will require L<netrc(5)> and/or L<git-credential(1)> to fill
+in the username and password.
 
 Default: none
 
@@ -119,6 +124,35 @@ are supported in public-inbox 1.6.0.
 
 Default: none; only for L<public-inbox-watch(1)> users
 
+=item imap.Starttls / imap.$URL.Starttls
+
+Whether or not to use C<STARTTLS> on plain C<imap://> connections.
+
+May be specified for certain URLs via L<git-config(1)/--get-urlmatch>
+in C<git(1)> 1.8.5+.
+
+Default: C<true>
+
+=item imap.Compress / imap.$URL.Compress
+
+Whether or not to use the IMAP COMPRESS (RFC4978) extension to
+save bandwidth.  This is not supported by all IMAP servers and
+some advertising this feature may not implement it correctly.
+
+May be specified only for certain URLs if L<git(1)> 1.8.5+ is
+installed to use L<git-config(1)/--get-urlmatch>
+
+Default: C<false>
+
+=item nntp.Starttls / nntp.$URL.Starttls
+
+Whether or not to use C<STARTTLS> on plain C<nntp://> connections.
+
+May be specified for certain URLs via L<git-config(1)/--get-urlmatch>
+in C<git(1)> 1.8.5+.
+
+Default: C<false> if the hostname is a Tor C<.onion>, C<true> otherwise
+
 =back
 
 =head1 SIGNALS

^ permalink raw reply related	[relevance 8%]

* [PATCH 7/8] doc: move watch config docs to -watch manpage
  @ 2020-08-27 12:17  6% ` Eric Wong
  2020-08-27 12:17  8% ` [PATCH 8/8] doc: watch: expand on NNTP and IMAP-specific knobs Eric Wong
  1 sibling, 0 replies; 31+ results
From: Eric Wong @ 2020-08-27 12:17 UTC (permalink / raw)
  To: meta

The -config manpage is a bit long and the -watch stuff is
isolated from the rest of it while we start documenting NNTP and
IMAP support.

I'm not entirely happy with the way IMAP and NNTP are
configured, it's still good enough for small setups.

This also fixes a long-standing misplaced comment about
`publicinboxwatch.spamcheck' affecting all configured inboxes,
that comment was actually for `publicinboxwatch.watchspam'.

We'll omit documenting NNTP for `watchspam', for now, given the
lack of \Seen flags in NNTP and I'm not sure if it's even
useful.  There may not be any newsgroups for sharing confirmed
spam, either...
---
 Documentation/public-inbox-config.pod | 38 ++---------------
 Documentation/public-inbox-watch.pod  | 61 ++++++++++++++++++++++-----
 2 files changed, 55 insertions(+), 44 deletions(-)

diff --git a/Documentation/public-inbox-config.pod b/Documentation/public-inbox-config.pod
index 1dfb926e..2d845f16 100644
--- a/Documentation/public-inbox-config.pod
+++ b/Documentation/public-inbox-config.pod
@@ -74,26 +74,11 @@ Default: none, optional
 
 =item publicinbox.<name>.watch
 
-A location for L<public-inbox-watch(1)> to watch.  Currently,
-only C<maildir:> paths are supported:
-
-	[publicinbox "test"]
-		watch = maildir:/path/to/maildirs/.INBOX.test/
-
-Default: none; only for L<public-inbox-watch(1)> users
+See L<public-inbox-watch(1)>
 
 =item publicinbox.<name>.watchheader
 
-	[publicinbox "test"]
-		watchheader = List-Id:<test.example.com>
-
-If specified, L<public-inbox-watch(1)> will only process mail
-matching the given header.  If specified multiple times in
-public-inbox 1.5 or later, mail will be processed if it matches
-any of the values.  Only the last value was used in public-inbox
-1.4 and earlier.
-
-Default: none; only for L<public-inbox-watch(1)> users
+See L<public-inbox-watch(1)>
 
 =item publicinbox.<name>.listid
 
@@ -204,26 +189,11 @@ Default: spamc
 
 =item publicinboxwatch.spamcheck
 
-This may be set to C<spamc> to enable the use of SpamAssassin
-L<spamc(1)> for filtering spam before it is imported into git
-history.  Other spam filtering backends may be supported in
-the future.
-
-This requires L<public-inbox-watch(1)>, but affects all configured
-public-inboxes in PI_CONFIG.
-
-Default: none
+See L<public-inbox-watch(1)>
 
 =item publicinboxwatch.watchspam
 
-A Maildir to watch for confirmed spam messages to appear in.
-Messages which appear in this folder with the (S)een Maildir flag
-will be hidden from all configured inboxes based on Message-ID
-and content matching.
-
-Messages without the (S)een Maildir flag are not considered for hiding.
-
-Default: none; only for L<public-inbox-watch(1)> users
+See L<public-inbox-watch(1)>
 
 =item publicinbox.nntpserver
 
diff --git a/Documentation/public-inbox-watch.pod b/Documentation/public-inbox-watch.pod
index 34e8c4f2..b07d0fb5 100644
--- a/Documentation/public-inbox-watch.pod
+++ b/Documentation/public-inbox-watch.pod
@@ -35,8 +35,8 @@ In ~/.public-inbox/config:
 
 =head1 DESCRIPTION
 
-public-inbox-watch allows watching a mailbox (currently only
-Maildir) for the arrival of new messages and automatically
+public-inbox-watch allows watching a mailbox or newsgroup
+for the arrival of new messages and automatically
 importing them into public-inbox git repositories and indices.
 public-inbox-watch is useful in situations when a user wishes to
 mirror an existing mailing list, but has no access to run
@@ -48,11 +48,9 @@ of large Maildirs.
 Upon startup, it scans the mailbox for new messages to be
 imported while it was not running.
 
-Currently, only Maildirs are supported.
-
-For now, IMAP users should use tools such as L<mbsync(1)>
-or L<offlineimap(1)> to bidirectionally sync their IMAP
-folders to Maildirs for public-inbox-watch.
+As of public-inbox 1.6.0, Maildirs, IMAP folders, and NNTP
+newsgroups are supported.  Previous versions of public-inbox
+only supported Maildirs.
 
 public-inbox-watch should be run inside a L<screen(1)> session
 or as a L<systemd(1)> service.  Errors are emitted to stderr.
@@ -64,21 +62,64 @@ public-inbox-watch takes no command-line options.
 =head1 CONFIGURATION
 
 These configuration knobs should be used in the
-L<public-inbox-config(5)>
+L<public-inbox-config(5)> file
 
 =over 8
 
 =item publicinbox.<name>.watch
 
+A location to watch.  public-inbox 1.5.0 and earlier only supported
+C<maildir:> paths:
+
+	[publicinbox "test"]
+		watch = maildir:/path/to/maildirs/.INBOX.test/
+
+public-inbox 1.6.0 supports C<nntp://>, C<nntps://>,
+C<imap://> and C<imaps://> URLs:
+
+		watch = nntp://news.example.com/inbox.test.group
+		watch = imaps://mail.example.com/INBOX.test.foo
+
+Default: none
+
 =item publicinbox.<name>.watchheader
 
+	[publicinbox "test"]
+		watchheader = List-Id:<test.example.com>
+
+If specified, L<public-inbox-watch(1)> will only process mail
+matching the given header.  If specified multiple times in
+public-inbox 1.5 or later, mail will be processed if it matches
+any of the values.  Only the last value was used in public-inbox
+1.4 and earlier.
+
+Default: none
+
 =item publicinboxwatch.spamcheck
 
+This may be set to C<spamc> to enable the use of SpamAssassin
+L<spamc(1)> for filtering spam before it is imported into git
+history.  Other spam filtering backends may be supported in
+the future.
+
+Default: none
+
 =item publicinboxwatch.watchspam
 
-=back
+A Maildir to watch for confirmed spam messages to appear in.
+Messages which appear in this folder with the (S)een flag
+will be hidden from all configured inboxes based on Message-ID
+and content matching.
+
+Messages without the (S)een flag are not considered for hiding.
+This hiding affects all configured public-inboxes in PI_CONFIG.
+
+As with C<publicinbox.$NAME.watch>, C<imap://> and C<imaps://> URLs
+are supported in public-inbox 1.6.0.
 
-See L<public-inbox-config(5)> for documentation on them.
+Default: none; only for L<public-inbox-watch(1)> users
+
+=back
 
 =head1 SIGNALS
 

^ permalink raw reply related	[relevance 6%]

* [PATCH] doc: add some more tuning notes
@ 2020-08-25 10:51 11% Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-08-25 10:51 UTC (permalink / raw)
  To: meta

I've learned a thing or three about btrfs in the past few
weeks and remembered some old HDD things, too.

The Xapian MultiDatabase problem will need to be addressed
for 1.7...
---
 Documentation/public-inbox-index.pod  | 12 ++++++++++--
 Documentation/public-inbox-init.pod   | 15 +++++++++++----
 Documentation/public-inbox-tuning.pod | 21 ++++++++++++++++++---
 Documentation/public-inbox-xcpdb.pod  |  1 +
 4 files changed, 40 insertions(+), 9 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 46a53825..207b2ed8 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -39,8 +39,12 @@ normal search functionality.
 Influences the number of Xapian indexing shards in a
 (L<public-inbox-v2-format(5)>) inbox.
 
+See L<public-inbox-init(1)/--jobs> for a full description
+of sharding.
+
 C<--jobs=0> is accepted as of public-inbox 1.6.0 (PENDING)
-to disable parallel indexing.
+to disable parallel indexing regardless of the number of
+pre-existing shards.
 
 If the inbox has not been indexed or initialized, C<JOBS - 1>
 shards will be created (one job is always needed for indexing
@@ -133,7 +137,11 @@ Available in public-inbox 1.6.0 (PENDING).
 =item --no-fsync
 
 Disables L<fsync(2)> and L<fdatasync(2)> operations on SQLite
-and Xapian.  This is only effective with Xapian 1.4+.
+and Xapian.  This is only effective with Xapian 1.4+.  This is
+primarily intended for systems with low RAM and the small
+(default) C<--batch-size=1m>.  Users of large C<--batch-size>
+may even find disabling L<fdatasync(2)> causes too much dirty
+data to accumulate, resulting on latency spikes from writeback.
 
 Available in public-inbox 1.6.0 (PENDING).
 
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index b25dd1e4..24645045 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -86,14 +86,21 @@ Default: unset, no epochs are skipped
 Control the number of Xapian index shards in a
 C<-V2> (L<public-inbox-v2-format(5)>) inbox.
 
-It is useful to use a single shard (C<-j1>) for inboxes on
+It can be useful to use a single shard (C<-j1>) for inboxes on
 high-latency storage (e.g. rotational HDD) unless the system has
 enough RAM to cache 5-10x the size of the git repository.
 
-It is generally not useful to specify higher values than the
-default due to contention in the top-level producer process.
+Another approach for HDDs is to use the
+L<public-inbox-index(1)/publicInbox.indexSequentialShard> option
+and many shards, so each shard may fit into the kernel page
+cache.  Unfortunately, excessive shards slows down read-only
+query performance.
 
-Default: the number of online CPUs, up to 4
+For fast storage, it is generally not useful to specify higher
+values than the default due to the top-level producer process
+being a bottleneck.
+
+Default: the number of online CPUs, up to 4 (3 shard workers, 1 producer)
 
 =item --skip-docdata
 
diff --git a/Documentation/public-inbox-tuning.pod b/Documentation/public-inbox-tuning.pod
index abc53d1e..e3f2899b 100644
--- a/Documentation/public-inbox-tuning.pod
+++ b/Documentation/public-inbox-tuning.pod
@@ -69,7 +69,8 @@ footprint when indexing on HDDs.
 
 Initializing a mirror with a high C<--jobs> count to create more
 shards (in C<-V2> inboxes) will keep each shard smaller and
-reduce its kernel page cache footprint.
+reduce its kernel page cache footprint.  Keep in mind excessive
+sharding imposes a performance penalty for read-only queries.
 
 Users with large amounts of RAM are advised to set a large value
 for C<publicinbox.indexBatchSize> as documented in
@@ -88,12 +89,21 @@ used by public-inbox are no exception to that.
 
 public-inbox 1.6.0+ disables copy-on-write (CoW) on Xapian and SQLite
 indices on btrfs to achieve acceptable performance (even on SSD).
-Disabling copy-on-write also disables checksumming, thus raid1
-(or higher) configurations may corrupt on unsafe shutdowns.
+Disabling copy-on-write also disables checksumming, thus C<raid1>
+(or higher) configurations may be corrupt after unsafe shutdowns.
 
 Fortunately, these SQLite and Xapian indices are designed to
 recoverable from git if missing.
 
+Disabling CoW does not prevent all fragmentation.
+
+Avoid snapshotting subvolumes containing Xapian and/or SQLite indices.
+Snapshots use CoW despite our efforts to disable it, resulting
+in fragmentation.
+
+L<filefrag(8)> can be used to monitor fragmentation, and
+C<btrfs filesystem defragment -fr $INBOX_DIR> may be necessary.
+
 Large filesystems benefit significantly from the C<space_cache=v2>
 mount option documented in L<btrfs(5)>.
 
@@ -106,6 +116,11 @@ While SSD read performance is generally good, SSD write performance
 degrades as the drive ages and/or gets full.  Issuing C<TRIM> commands
 via L<fstrim(8)> or similar is required to sustain write performance.
 
+Users of the Flash-Friendly File System
+L<F2FS|https://en.wikipedia.org/wiki/F2FS> may benefit from
+optimizations found in SQLite 3.21.0+.  Benchmarks are greatly
+appreciated.
+
 =head2 Read-only daemons
 
 L<public-inbox-httpd(1)>, L<public-inbox-imapd(1)>, and
diff --git a/Documentation/public-inbox-xcpdb.pod b/Documentation/public-inbox-xcpdb.pod
index 52939894..1397a7f4 100644
--- a/Documentation/public-inbox-xcpdb.pod
+++ b/Documentation/public-inbox-xcpdb.pod
@@ -60,6 +60,7 @@ used with C<--compact>.
 =item --no-fsync
 
 Disable L<fsync(2)> and L<fdatasync(2)>.
+See L<public-inbox-index(1)/--no-fsync> for caveats.
 
 Available in public-inbox 1.6.0 (PENDING).
 

^ permalink raw reply related	[relevance 11%]

* [PATCH 22/23] init+index: support --skip-docdata for Xapian
    2020-08-20 20:24  4% ` [PATCH 05/23] init: support --newsgroup option Eric Wong
  2020-08-20 20:24  6% ` [PATCH 06/23] init: drop -N alias for --skip-artnum Eric Wong
@ 2020-08-20 20:24  7% ` Eric Wong
  2 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

Since we no longer read document data from Xapian, allow users
to opt-out of storing it.

This breaks compatibility with previous releases of
public-inbox, but gives us a ~1.5% space savings on Xapian
storage (and associated I/O and page cache pressure reduction).
---
 Documentation/public-inbox-index.pod |  8 +++++++
 Documentation/public-inbox-init.pod  | 10 ++++++++
 lib/PublicInbox/Admin.pm             | 12 ++++++----
 lib/PublicInbox/SearchIdx.pm         | 35 +++++++++++++++++++++-------
 lib/PublicInbox/SearchIdxShard.pm    |  2 +-
 script/public-inbox-convert          |  3 ++-
 script/public-inbox-index            |  7 ++++--
 script/public-inbox-init             |  8 +++++++
 t/inbox_idle.t                       |  2 +-
 t/index-git-times.t                  | 11 ++++++++-
 t/init.t                             | 13 +++++++++++
 11 files changed, 91 insertions(+), 20 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 1ed9f5e7..46a53825 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -145,6 +145,14 @@ below.
 
 Available in public-inbox 1.6.0 (PENDING).
 
+=item --skip-docdata
+
+Stop storing document data in Xapian on an existing inbox.
+
+See L<public-inbox-init(1)/--skip-docdata> for description and caveats.
+
+Available in public-inbox 1.6.0 (PENDING).
+
 =back
 
 =head1 FILES
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 4cc7e29f..3f98807a 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -95,6 +95,16 @@ default due to contention in the top-level producer process.
 
 Default: the number of online CPUs, up to 4
 
+=item --skip-docdata
+
+Do not store document data in Xapian, reducing Xapian storage
+overhead by around 1.5%.
+
+Warning: this option prevents rollbacks to public-inbox 1.5.0
+and earlier.
+
+Available since public-inbox 1.6.0 (PENDING).
+
 =back
 
 =head1 ENVIRONMENT
diff --git a/lib/PublicInbox/Admin.pm b/lib/PublicInbox/Admin.pm
index f5427af7..b8ead6f7 100644
--- a/lib/PublicInbox/Admin.pm
+++ b/lib/PublicInbox/Admin.pm
@@ -48,13 +48,14 @@ sub resolve_repo_dir {
 sub detect_indexlevel ($) {
 	my ($ibx) = @_;
 
-	# brand new or never before indexed inboxes default to full
-	return 'full' unless $ibx->over;
-	delete $ibx->{over}; # don't leave open FD lying around
+	my $over = $ibx->over;
+	my $srch = $ibx->search;
+	delete @$ibx{qw(over search)}; # don't leave open FDs lying around
 
+	# brand new or never before indexed inboxes default to full
+	return 'full' unless $over;
 	my $l = 'basic';
-	my $srch = $ibx->search or return $l;
-	delete $ibx->{search}; # don't leave open FD lying around
+	return $l unless $srch;
 	if (my $xdb = $srch->xdb) {
 		$l = 'full';
 		my $m = $xdb->get_metadata('indexlevel');
@@ -65,6 +66,7 @@ sub detect_indexlevel ($) {
 $ibx->{inboxdir} has unexpected indexlevel in Xapian: $m
 
 		}
+		$ibx->{-skip_docdata} = 1 if $xdb->get_metadata('skip_docdata');
 	}
 	$l;
 }
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 5c39f3d6..be46b2b9 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -61,6 +61,10 @@ sub new {
 	}, $class;
 	$self->xpfx_init;
 	$self->{-set_indexlevel_once} = 1 if $indexlevel eq 'medium';
+	if ($ibx->{-skip_docdata}) {
+		$self->{-set_skip_docdata_once} = 1;
+		$self->{-skip_docdata} = 1;
+	}
 	$ibx->umask_prepare;
 	if ($version == 1) {
 		$self->{lock_path} = "$inboxdir/ssoma.lock";
@@ -359,10 +363,18 @@ sub add_xapian ($$$$) {
 
 	msg_iter($eml, \&index_xapian, [ $self, $doc ]);
 	index_ids($self, $doc, $eml, $mids);
-	$smsg->{to} = $smsg->{cc} = ''; # WWW doesn't need these, only NNTP
-	PublicInbox::OverIdx::parse_references($smsg, $eml, $mids);
-	my $data = $smsg->to_doc_data;
-	$doc->set_data($data);
+
+	# by default, we maintain compatibility with v1.5.0 and earlier
+	# by writing to docdata.glass, users who never exect to downgrade can
+	# use --skip-docdata
+	if (!$self->{-skip_docdata}) {
+		# WWW doesn't need {to} or {cc}, only NNTP
+		$smsg->{to} = $smsg->{cc} = '';
+		PublicInbox::OverIdx::parse_references($smsg, $eml, $mids);
+		my $data = $smsg->to_doc_data;
+		$doc->set_data($data);
+	}
+
 	if (my $altid = $self->{-altid}) {
 		foreach my $alt (@$altid) {
 			my $pfx = $alt->{xprefix};
@@ -831,23 +843,28 @@ sub begin_txn_lazy {
 
 # store 'indexlevel=medium' in v2 shard=0 and v1 (only one shard)
 # This metadata is read by Admin::detect_indexlevel:
-sub set_indexlevel {
+sub set_metadata_once {
 	my ($self) = @_;
 
-	if (!$self->{shard} && # undef or 0, not >0
-			delete($self->{-set_indexlevel_once})) {
-		my $xdb = $self->{xdb};
+	return if $self->{shard}; # only continue if undef or 0, not >0
+	my $xdb = $self->{xdb};
+
+	if (delete($self->{-set_indexlevel_once})) {
 		my $level = $xdb->get_metadata('indexlevel');
 		if (!$level || $level ne 'medium') {
 			$xdb->set_metadata('indexlevel', 'medium');
 		}
 	}
+	if (delete($self->{-set_skip_docdata_once})) {
+		$xdb->get_metadata('skip_docdata') or
+			$xdb->set_metadata('skip_docdata', '1');
+	}
 }
 
 sub _commit_txn {
 	my ($self) = @_;
 	if (my $xdb = $self->{xdb}) {
-		set_indexlevel($self);
+		set_metadata_once($self);
 		$xdb->commit_transaction;
 	}
 	$self->{over}->commit_lazy if $self->{over};
diff --git a/lib/PublicInbox/SearchIdxShard.pm b/lib/PublicInbox/SearchIdxShard.pm
index 59b36087..20077e08 100644
--- a/lib/PublicInbox/SearchIdxShard.pm
+++ b/lib/PublicInbox/SearchIdxShard.pm
@@ -16,7 +16,7 @@ sub new {
 	my $self = $class->SUPER::new($ibx, 1, $shard);
 	# create the DB before forking:
 	$self->idx_acquire;
-	$self->set_indexlevel;
+	$self->set_metadata_once;
 	$self->idx_release;
 	$self->spawn_worker($v2w, $shard) if $v2w->{parallel};
 	$self;
diff --git a/script/public-inbox-convert b/script/public-inbox-convert
index d655dcc6..4ff198d1 100755
--- a/script/public-inbox-convert
+++ b/script/public-inbox-convert
@@ -77,7 +77,8 @@ if ($old) {
 die "Only conversion from v1 inboxes is supported\n" if $old->version >= 2;
 
 require PublicInbox::Admin;
-$old->{indexlevel} //= PublicInbox::Admin::detect_indexlevel($old);
+my $detected = PublicInbox::Admin::detect_indexlevel($old);
+$old->{indexlevel} //= $detected;
 my $env;
 if ($opt->{'index'}) {
 	my $mods = {};
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 30d24838..9855c67d 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -39,7 +39,7 @@ GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
 		indexlevel|index-level|L=s max_size|max-size=s
 		batch_size|batch-size=s
 		sequential_shard|seq-shard|sequential-shard
-		all help|?))
+		skip-docdata all help|?))
 	or die "bad command-line args\n$usage";
 if ($opt->{help}) { print $help; exit 0 };
 die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
@@ -58,9 +58,11 @@ unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 }
 
 my $mods = {};
 foreach my $ibx (@ibxs) {
+	# detect_indexlevel may also set $ibx->{-skip_docdata}
+	my $detected = PublicInbox::Admin::detect_indexlevel($ibx);
 	# XXX: users can shoot themselves in the foot, with opt->{indexlevel}
 	$ibx->{indexlevel} //= $opt->{indexlevel} // ($opt->{xapian_only} ?
-			'full' : PublicInbox::Admin::detect_indexlevel($ibx));
+			'full' : $detected);
 	PublicInbox::Admin::scan_ibx_modules($mods, $ibx);
 }
 
@@ -75,6 +77,7 @@ for my $ibx (@ibxs) {
 		PublicInbox::Xapcmd::run($ibx, 'compact', $opt->{compact_opt});
 	}
 	$ibx->{-no_fsync} = 1 if !$opt->{fsync};
+	$ibx->{-skip_docdata} //= $opt->{'skip-docdata'};
 
 	my $ibx_opt = $opt;
 	if (defined(my $s = $ibx->{lc('indexSequentialShard')})) {
diff --git a/script/public-inbox-init b/script/public-inbox-init
index b19c2321..037e8e56 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -34,6 +34,7 @@ require PublicInbox::Admin;
 PublicInbox::Admin::require_or_die('-base');
 
 my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs, $show_help);
+my $skip_docdata;
 my $ng = '';
 my %opts = (
 	'V|version=i' => \$version,
@@ -42,6 +43,7 @@ my %opts = (
 	'skip-artnum=i' => \$skip_artnum,
 	'j|jobs=i' => \$jobs,
 	'ng|newsgroup=s' => \$ng,
+	'skip-docdata' => \$skip_docdata,
 	'help|?' => \$show_help,
 );
 my $usage_cb = sub {
@@ -177,6 +179,12 @@ if (defined $jobs) {
 
 require PublicInbox::InboxWritable;
 $ibx = PublicInbox::InboxWritable->new($ibx, $creat_opt);
+if ($skip_docdata) {
+	$ibx->{indexlevel} //= 'full'; # ensure init_inbox writes xdb
+	$ibx->{indexlevel} eq 'basic' and
+		die "--skip-docdata ignored with --indexlevel=basic\n";
+	$ibx->{-skip_docdata} = $skip_docdata;
+}
 $ibx->init_inbox(0, $skip_epoch, $skip_artnum);
 
 # needed for git prior to v2.1.0
diff --git a/t/inbox_idle.t b/t/inbox_idle.t
index 61287200..e16ee11b 100644
--- a/t/inbox_idle.t
+++ b/t/inbox_idle.t
@@ -29,7 +29,7 @@ for my $V (1, 2) {
 	if ($V == 1) {
 		my $sidx = PublicInbox::SearchIdx->new($ibx, 1);
 		$sidx->idx_acquire;
-		$sidx->set_indexlevel;
+		$sidx->set_metadata_once;
 		$sidx->idx_release; # allow watching on lockfile
 	}
 	my $pi_config = PublicInbox::Config->new(\<<EOF);
diff --git a/t/index-git-times.t b/t/index-git-times.t
index 2e9e88e8..8f80c866 100644
--- a/t/index-git-times.t
+++ b/t/index-git-times.t
@@ -4,6 +4,7 @@ use Test::More;
 use PublicInbox::TestCommon;
 use PublicInbox::Import;
 use PublicInbox::Config;
+use PublicInbox::Admin;
 use File::Path qw(remove_tree);
 
 require_mods(qw(DBD::SQLite Search::Xapian));
@@ -47,11 +48,15 @@ EOF
 	PublicInbox::Import::run_die($cmd, undef, { 0 => $r });
 }
 
-run_script(['-index', $v1dir]) or die 'v1 index failed';
+run_script(['-index', '--skip-docdata', $v1dir]) or die 'v1 index failed';
+
 my $smsg;
 {
 	my $cfg = PublicInbox::Config->new;
 	my $ibx = $cfg->lookup($addr);
+	my $lvl = PublicInbox::Admin::detect_indexlevel($ibx);
+	is($lvl, 'medium', 'indexlevel detected');
+	is($ibx->{-skip_docdata}, 1, '--skip-docdata flag set on -index');
 	$smsg = $ibx->over->get_art(1);
 	is($smsg->{ds}, 749520000, 'datestamp from git author time');
 	is($smsg->{ts}, 1285977600, 'timestamp from git committer time');
@@ -70,6 +75,10 @@ SKIP: {
 	my $check_v2 = sub {
 		my $ibx = PublicInbox::Inbox->new({inboxdir => $v2dir,
 				address => $addr});
+		my $lvl = PublicInbox::Admin::detect_indexlevel($ibx);
+		is($lvl, 'medium', 'indexlevel detected after convert');
+		is($ibx->{-skip_docdata}, 1,
+			'--skip-docdata preserved after convert');
 		my $v2smsg = $ibx->over->get_art(1);
 		is($v2smsg->{ds}, $smsg->{ds},
 			'v2 datestamp from git author time');
diff --git a/t/init.t b/t/init.t
index 4d2c5049..dad09435 100644
--- a/t/init.t
+++ b/t/init.t
@@ -95,6 +95,19 @@ SKIP: {
 		my $ibx = PublicInbox::Inbox->new({ inboxdir => $dir });
 		is(PublicInbox::Admin::detect_indexlevel($ibx), $lvl,
 			'detected expected level w/o config');
+		ok(!$ibx->{-skip_docdata}, 'docdata written by default');
+	}
+	for my $v (1, 2) {
+		my $name = "v$v-skip-docdata";
+		my $dir = "$tmpdir/$name";
+		$cmd = [ '-init', $name, "-V$v", '--skip-docdata',
+			$dir, "http://example.com/$name",
+			"$name\@example.com" ];
+		ok(run_script($cmd), "-init -V$v --skip-docdata");
+		my $ibx = PublicInbox::Inbox->new({ inboxdir => $dir });
+		is(PublicInbox::Admin::detect_indexlevel($ibx), 'full',
+			"detected default indexlevel -V$v");
+		ok($ibx->{-skip_docdata}, "docdata skip set -V$v");
 	}
 
 	# loop for idempotency

^ permalink raw reply related	[relevance 7%]

* [PATCH 06/23] init: drop -N alias for --skip-artnum
    2020-08-20 20:24  4% ` [PATCH 05/23] init: support --newsgroup option Eric Wong
@ 2020-08-20 20:24  6% ` Eric Wong
  2020-08-20 20:24  7% ` [PATCH 22/23] init+index: support --skip-docdata for Xapian Eric Wong
  2 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

It may be too easily confused for --newsgroup or --ng.  This is
too rarely used and never made it into a release, so it should
be fine.
---
 Documentation/public-inbox-init.pod | 2 +-
 script/public-inbox-init            | 2 +-
 t/init.t                            | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 240959eb..4cc7e29f 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -54,7 +54,7 @@ Available since public-inbox 1.6.0 (PENDING).
 
 Default: none.
 
-=item -N, --skip-artnum
+=item --skip-artnum
 
 This option allows archivists to publish incomplete archives
 with only new mail while allowing NNTP article numbers
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 90b32be8..b19c2321 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -39,7 +39,7 @@ my %opts = (
 	'V|version=i' => \$version,
 	'L|index-level|indexlevel=s' => \$indexlevel,
 	'S|skip|skip-epoch=i' => \$skip_epoch,
-	'N|skip-artnum=i' => \$skip_artnum,
+	'skip-artnum=i' => \$skip_artnum,
 	'j|jobs=i' => \$jobs,
 	'ng|newsgroup=s' => \$ng,
 	'help|?' => \$show_help,
diff --git a/t/init.t b/t/init.t
index 6211bb58..4d2c5049 100644
--- a/t/init.t
+++ b/t/init.t
@@ -116,7 +116,7 @@ SKIP: {
 			'publicinboxmda.spamcheck', 'none') == 0 or
 			BAIL_OUT "git config $?";
 	my $addr = 'skip3@example.com';
-	$cmd = [ qw(-init -V2 -Lbasic -N12 skip3), "$tmpdir/skip3",
+	$cmd = [ qw(-init -V2 -Lbasic --skip-artnum=12 skip3), "$tmpdir/skip3",
 		   qw(http://example.com/skip3), $addr ];
 	ok(run_script($cmd), '--skip-artnum -V2');
 	my $env = { ORIGINAL_RECIPIENT => $addr };
@@ -131,7 +131,7 @@ SKIP: {
 
 	$addr = 'skip4@example.com';
 	$env = { ORIGINAL_RECIPIENT => $addr };
-	$cmd = [ qw(-init -V1 -N12 -Lmedium skip4), "$tmpdir/skip4",
+	$cmd = [ qw(-init -V1 --skip-artnum 12 -Lmedium skip4), "$tmpdir/skip4",
 		   qw(http://example.com/skip4), $addr ];
 	ok(run_script($cmd), '--skip-artnum -V1');
 	$err = '';

^ permalink raw reply related	[relevance 6%]

* [PATCH 05/23] init: support --newsgroup option
  @ 2020-08-20 20:24  4% ` Eric Wong
  2020-08-20 20:24  6% ` [PATCH 06/23] init: drop -N alias for --skip-artnum Eric Wong
  2020-08-20 20:24  7% ` [PATCH 22/23] init+index: support --skip-docdata for Xapian Eric Wong
  2 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-08-20 20:24 UTC (permalink / raw)
  To: meta

We can reduce the need to edit the config file for NNTP group names
this way.
---
 Documentation/public-inbox-config.pod |  2 +-
 Documentation/public-inbox-init.pod   | 25 +++++++++++++++++++++----
 script/public-inbox-init              | 12 ++++++++++--
 t/imapd.t                             |  6 ++----
 t/nntpd.t                             |  9 +++------
 5 files changed, 37 insertions(+), 17 deletions(-)

diff --git a/Documentation/public-inbox-config.pod b/Documentation/public-inbox-config.pod
index 05b84819..1dfb926e 100644
--- a/Documentation/public-inbox-config.pod
+++ b/Documentation/public-inbox-config.pod
@@ -63,7 +63,7 @@ Default: none, optional
 =item publicinbox.<name>.newsgroup
 
 The NNTP group name for use with L<public-inbox-nntpd(8)>.  This
-may be any newsgroup name with hierarchies delimited by '.'.
+may be any newsgroup name with hierarchies delimited by C<.>.
 For example, the newsgroup for L<mailto:meta@public-inbox.org>
 is: C<inbox.comp.mail.public-inbox.meta>
 
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index d0c87563..240959eb 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -39,6 +39,21 @@ See L<public-inbox-config(5)> for more information.
 
 Default: C<full>
 
+=item --ng, --newsgroup NEWSGROUP
+
+The NNTP group name for use with L<public-inbox-nntpd(8)>.  This
+may be any newsgroup name with hierarchies delimited by C<.>.
+For example, the newsgroup for L<mailto:meta@public-inbox.org>
+is: C<inbox.comp.mail.public-inbox.meta>
+
+This may be set after-the-fact via C<publicinbox.$NAME.newsgroup>
+in the configuration file.  See L<public-inbox-config(5)> for more
+info.
+
+Available since public-inbox 1.6.0 (PENDING).
+
+Default: none.
+
 =item -N, --skip-artnum
 
 This option allows archivists to publish incomplete archives
@@ -94,10 +109,12 @@ Used to override the default C<~/.public-inbox/config> value.
 
 =head1 LIMITATIONS
 
-This tool predates NNTP support in public-inbox and is missing
-C<newsgroup> and many of the options documented in
-L<public-inbox-config(5)>.  See L<public-inbox-config(5)> for all the
-options which may be applied to a given inbox.
+Some of the options documented in L<public-inbox-config(5)>
+require editing the config file.  Old versions lack the
+C<-n>/C<--newsgroup> parameter
+
+See L<public-inbox-config(5)> for all the options which may be applied
+to a given inbox.
 
 =head1 CONTACT
 
diff --git a/script/public-inbox-init b/script/public-inbox-init
index 6852f64a..90b32be8 100755
--- a/script/public-inbox-init
+++ b/script/public-inbox-init
@@ -22,6 +22,7 @@ options:
 
   -V2                 use scalable public-inbox-v2-format(5)
   -L LEVEL            index level `basic', `medium', or `full' (default: full)
+  --ng NEWSGROUP      set NNTP newsgroup name
   --skip-artnum=NUM   NNTP article numbers to skip
   --skip-epoch=NUM    epochs to skip (-V2 only)
   -J JOBS             number of indexing jobs (-V2 only), (default: 4)
@@ -33,12 +34,14 @@ require PublicInbox::Admin;
 PublicInbox::Admin::require_or_die('-base');
 
 my ($version, $indexlevel, $skip_epoch, $skip_artnum, $jobs, $show_help);
+my $ng = '';
 my %opts = (
 	'V|version=i' => \$version,
 	'L|index-level|indexlevel=s' => \$indexlevel,
 	'S|skip|skip-epoch=i' => \$skip_epoch,
 	'N|skip-artnum=i' => \$skip_artnum,
 	'j|jobs=i' => \$jobs,
+	'ng|newsgroup=s' => \$ng,
 	'help|?' => \$show_help,
 );
 my $usage_cb = sub {
@@ -53,7 +56,11 @@ my $inboxdir = shift @ARGV or $usage_cb->();
 my $http_url = shift @ARGV or $usage_cb->();
 my (@address) = @ARGV;
 @address or $usage_cb->();
-my %seen;
+
+$ng =~ m![^A-Za-z0-9/_\.\-\~\@\+\=:]! and
+	die "--newsgroup `$ng' is not valid\n";
+($ng =~ m!\A\.! || $ng =~ m!\.\z!) and
+	die "--newsgroup `$ng' must not start or end with `.'\n";
 
 require PublicInbox::Config;
 my $pi_config = PublicInbox::Config->default_file;
@@ -84,7 +91,7 @@ sysopen($lockfh, $lockfile, O_RDWR|O_CREAT|O_EXCL) or do {
 	exit(255);
 };
 my $auto_unlink = UnlinkMe->new($lockfile);
-my $perm;
+my ($perm, %seen);
 if (-e $pi_config) {
 	open(my $oh, '<', $pi_config) or die "unable to read $pi_config: $!\n";
 	my @st = stat($oh);
@@ -185,6 +192,7 @@ PublicInbox::Import::run_die([@x, "$pfx.inboxdir", $inboxdir]);
 if (defined($indexlevel)) {
 	PublicInbox::Import::run_die([@x, "$pfx.indexlevel", $indexlevel]);
 }
+PublicInbox::Import::run_die([@x, "$pfx.newsgroup", $ng]) if $ng ne '';
 
 # needed for git prior to v2.1.0
 if (defined $perm) {
diff --git a/t/imapd.t b/t/imapd.t
index 6cfced41..4d627af7 100644
--- a/t/imapd.t
+++ b/t/imapd.t
@@ -39,11 +39,9 @@ for my $V (@V) {
 	my $url = "http://example.com/i$V";
 	my $inboxdir = "$tmpdir/$name";
 	my $folder = "inbox.i$V";
-	my $cmd = ['-init', "-V$V", "-L$level", $name, $inboxdir, $url, $addr];
+	my $cmd = ['-init', "-V$V", "-L$level", "--ng=$folder",
+		$name, $inboxdir, $url, $addr];
 	run_script($cmd) or BAIL_OUT("init $name");
-	xsys(qw(git config), "--file=$ENV{HOME}/.public-inbox/config",
-			"publicinbox.$name.newsgroup", $folder) == 0 or
-			BAIL_OUT("setting newsgroup $V");
 	if ($V == 1) {
 		xsys(qw(git config), "--file=$ENV{HOME}/.public-inbox/config",
 			'publicinboxmda.spamcheck', 'none') == 0 or
diff --git a/t/nntpd.t b/t/nntpd.t
index b9b9a63d..74e21a41 100644
--- a/t/nntpd.t
+++ b/t/nntpd.t
@@ -46,14 +46,11 @@ my $ibx = {
 $ibx = PublicInbox::Inbox->new($ibx);
 {
 	local $ENV{HOME} = $home;
-	my @cmd = ('-init', $group, $inboxdir, 'http://example.com/', $addr);
-	push @cmd, "-V$version", '-Lbasic';
+	my @cmd = ('-init', $group, $inboxdir, 'http://example.com/', $addr,
+		"-V$version", '-Lbasic', '--newsgroup', $group);
 	ok(run_script(\@cmd), 'init OK');
-	is(xsys(qw(git config), "--file=$home/.public-inbox/config",
-			"publicinbox.$group.newsgroup", $group),
-		0, 'enabled newsgroup');
-	my $len;
 
+	my $len;
 	$ibx = PublicInbox::InboxWritable->new($ibx);
 	my $im = $ibx->importer(0);
 

^ permalink raw reply related	[relevance 4%]

* [PATCH] doc: add public-inbox-tuning(7) manpage
@ 2020-08-15  5:21  9% Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-08-15  5:21 UTC (permalink / raw)
  To: meta

Determining storage device speed and latencies doesn't
seem portable or even possible with the wide variety
of storage layers in use.

This means we need to write a tuning document and hope
users read and improve on it :P
---
 Documentation/public-inbox-tuning.pod    | 139 +++++++++++++++++++++++
 Documentation/public-inbox-v2-format.pod |   6 +-
 MANIFEST                                 |   1 +
 Makefile.PL                              |   2 +-
 4 files changed, 144 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/public-inbox-tuning.pod

diff --git a/Documentation/public-inbox-tuning.pod b/Documentation/public-inbox-tuning.pod
new file mode 100644
index 00000000..abc53d1e
--- /dev/null
+++ b/Documentation/public-inbox-tuning.pod
@@ -0,0 +1,139 @@
+=head1 NAME
+
+public-inbox-tuning - tuning public-inbox
+
+=head1 DESCRIPTION
+
+public-inbox intends to support a wide variety of hardware.  While
+we strive to provide the best out-of-the-box performance possible,
+tuning knobs are an unfortunate necessity in some cases.
+
+=over 4
+
+=item 1
+
+New inboxes: public-inbox-init -V2
+
+=item 2
+
+Process spawning
+
+=item 3
+
+Performance on rotational hard disk drives
+
+=item 4
+
+Btrfs (and possibly other copy-on-write filesystems)
+
+=item 5
+
+Performance on solid state drives
+
+=item 6
+
+Read-only daemons
+
+=back
+
+=head2 New inboxes: public-inbox-init -V2
+
+If you're starting a new inbox (and not mirroring an existing one),
+the L<-V2|public-inbox-v2-format(5)> requires L<DBD::SQLite>, but is
+orders of magnitude more scalable than the original C<-V1> format.
+
+=head2 Process spawning
+
+Our optional use of L<Inline::C> speeds up subprocess spawning from
+large daemon processes.
+
+To enable L<Inline::C>, either set the C<PERL_INLINE_DIRECTORY>
+environment variable to point to a writable directory, or create
+C<~/.cache/public-inbox/inline-c> for any user(s) running
+public-inbox processes.
+
+More (optional) L<Inline::C> use will be introduced in the future
+to lower memory use and improve scalability.
+
+=head2 Performance on rotational hard disk drives
+
+Random I/O performance is poor on rotational HDDs.  Xapian indexing
+performance degrades significantly as DBs grow larger than available
+RAM.  Attempts to parallelize random I/O on HDDs leads to pathological
+slowdowns as inboxes grow.
+
+While C<-V2> introduced Xapian shards as a parallelization
+mechanism for SSDs; enabling C<publicInbox.indexSequentialShard>
+repurposes sharding as mechanism to reduce the kernel page cache
+footprint when indexing on HDDs.
+
+Initializing a mirror with a high C<--jobs> count to create more
+shards (in C<-V2> inboxes) will keep each shard smaller and
+reduce its kernel page cache footprint.
+
+Users with large amounts of RAM are advised to set a large value
+for C<publicinbox.indexBatchSize> as documented in
+L<public-inbox-config(5)>.
+
+C<dm-crypt> users on Linux 4.0+ are advised to try the
+C<--perf-same_cpu_crypt> C<--perf-submit_from_crypt_cpus>
+switches of L<cryptsetup(8)> to reduce I/O contention from
+kernel workqueue threads.
+
+=head2 Btrfs (and possibly other copy-on-write filesystems)
+
+L<btrfs(5)> performance degrades from fragmentation when using
+large databases and random writes.  The Xapian + SQLite indices
+used by public-inbox are no exception to that.
+
+public-inbox 1.6.0+ disables copy-on-write (CoW) on Xapian and SQLite
+indices on btrfs to achieve acceptable performance (even on SSD).
+Disabling copy-on-write also disables checksumming, thus raid1
+(or higher) configurations may corrupt on unsafe shutdowns.
+
+Fortunately, these SQLite and Xapian indices are designed to
+recoverable from git if missing.
+
+Large filesystems benefit significantly from the C<space_cache=v2>
+mount option documented in L<btrfs(5)>.
+
+Older, non-CoW filesystems are generally work well out-of-the-box
+for our Xapian and SQLite indices.
+
+=head2 Performance on solid state drives
+
+While SSD read performance is generally good, SSD write performance
+degrades as the drive ages and/or gets full.  Issuing C<TRIM> commands
+via L<fstrim(8)> or similar is required to sustain write performance.
+
+=head2 Read-only daemons
+
+L<public-inbox-httpd(1)>, L<public-inbox-imapd(1)>, and
+L<public-inbox-nntpd(1)> are all designed for C10K (or higher)
+levels of concurrency from a single process.  SMP systems may
+use C<--worker-processes=NUM> as documented in L<public-inbox-daemon(8)>
+for parallelism.
+
+The open file descriptor limit (C<RLIMIT_NOFILE>, C<ulimit -n> in L<sh(1)>,
+C<LimitNOFILE=> in L<systemd.exec(5)>) may need to be raised to
+accomodate many concurrent clients.
+
+Transport Layer Security (IMAPS, NNTPS, or via STARTTLS) significantly
+increases memory use of client sockets, sure to account for that in
+capacity planning.
+
+=head1 CONTACT
+
+Feedback encouraged via plain-text mail to L<mailto:meta@public-inbox.org>
+
+Information for *BSDs and non-traditional filesystems especially
+welcome.
+
+Our archives are hosted at L<https://public-inbox.org/meta/>,
+L<http://hjrcffqmbrq6wope.onion/meta/>, and other places
+
+=head1 COPYRIGHT
+
+Copyright 2020 all contributors L<mailto:meta@public-inbox.org>
+
+License: AGPL-3.0+ L<https://www.gnu.org/licenses/agpl-3.0.txt>
diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod
index 6876989c..86a9b8f2 100644
--- a/Documentation/public-inbox-v2-format.pod
+++ b/Documentation/public-inbox-v2-format.pod
@@ -117,9 +117,9 @@ Rotational storage devices perform significantly worse than
 solid state storage for indexing of large mail archives; but are
 fine for backup and usable for small instances.
 
-As of public-inbox 1.6.0, the C<--sequential-shard> option of
-L<public-inbox-index(1)> may be used with a high shard count
-to ensure individual shards fit into page cache when the entire
+As of public-inbox 1.6.0, the C<publicInbox.indexSequentialShard>
+option of L<public-inbox-index(1)> may be used with a high shard
+count to ensure individual shards fit into page cache when the entire
 Xapian DB cannot.
 
 Our use of the L</OVERVIEW DB> requires Xapian document IDs to
diff --git a/MANIFEST b/MANIFEST
index 3d690177..6cb5f6bf 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -35,6 +35,7 @@ Documentation/public-inbox-mda.pod
 Documentation/public-inbox-nntpd.pod
 Documentation/public-inbox-overview.pod
 Documentation/public-inbox-purge.pod
+Documentation/public-inbox-tuning.pod
 Documentation/public-inbox-v1-format.pod
 Documentation/public-inbox-v2-format.pod
 Documentation/public-inbox-watch.pod
diff --git a/Makefile.PL b/Makefile.PL
index 831649f9..88da5b45 100644
--- a/Makefile.PL
+++ b/Makefile.PL
@@ -34,7 +34,7 @@ $v->{my_syntax} = [map { "$_.syntax" } @syn];
 $v->{-m1} = [ map { (split('/'))[-1] } @EXE_FILES ];
 $v->{-m5} = [ qw(public-inbox-config public-inbox-v1-format
 		public-inbox-v2-format) ];
-$v->{-m7} = [ qw(public-inbox-overview) ];
+$v->{-m7} = [ qw(public-inbox-overview public-inbox-tuning) ];
 $v->{-m8} = [ qw(public-inbox-daemon) ];
 my @sections = (1, 5, 7, 8);
 $v->{check_80} = [];

^ permalink raw reply related	[relevance 9%]

* [PATCH 5/6] xcpdb: wire up new index options and --help
  @ 2020-08-12  9:17  4% ` Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-08-12  9:17 UTC (permalink / raw)
  To: meta

--sequential-shard also disables the copy parallelism (--jobs),
so it can be useful for systems unable to handle parallel random
I/O but still want many shards.

There was a missing "use strict", too, which is fixed.
---
 Documentation/public-inbox-xcpdb.pod | 19 +++++++-
 lib/PublicInbox/Xapcmd.pm            |  3 +-
 script/public-inbox-xcpdb            | 66 +++++++++++++++++++++++-----
 3 files changed, 75 insertions(+), 13 deletions(-)

diff --git a/Documentation/public-inbox-xcpdb.pod b/Documentation/public-inbox-xcpdb.pod
index 2ed4c5821..62a28c0a1 100644
--- a/Documentation/public-inbox-xcpdb.pod
+++ b/Documentation/public-inbox-xcpdb.pod
@@ -19,7 +19,7 @@ L<public-inbox-learn(1)>, and L<public-inbox-index(1)>.
 
 =over
 
-=item --compact
+=item -c, --compact
 
 In addition to performing the copy operation, run L<xapian-compact(1)>
 on each Xapian shard after copying but before finalizing it.
@@ -52,6 +52,23 @@ Disable L<fsync(2)> and L<fdatasync(2)>.
 
 Available in public-inbox 1.6.0 (PENDING).
 
+=item --sequential-shard
+
+Copy each shard sequentially, ignoring C<--jobs>.  This also
+affects indexing done at the end of a run.
+
+=item --batch-size=BYTES
+
+=item --max-size=BYTES
+
+See L<public-inbox-index(1)> for a description of these options.
+
+These indexing options indexing at the end of a run.
+C<public-inbox-xcpdb> may run in parallel with with
+L<public-inbox-index(1)>, and C<public-inbox-xcpdb> needs to
+reindex changes made to the old Xapian DBs by
+L<public-inbox-index(1)> while it was running.
+
 =back
 
 =head1 ENVIRONMENT
diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm
index b6279218c..46548a948 100644
--- a/lib/PublicInbox/Xapcmd.pm
+++ b/lib/PublicInbox/Xapcmd.pm
@@ -82,7 +82,8 @@ sub commit_changes ($$$$) {
 				$im->{shards} = $n;
 			}
 		}
-
+		my $env = $opt->{-idx_env};
+		local %ENV = (%ENV, %$env) if $env;
 		PublicInbox::Admin::index_inbox($ibx, $im, $opt);
 	}
 }
diff --git a/script/public-inbox-xcpdb b/script/public-inbox-xcpdb
index 2c91598cb..718a34b77 100755
--- a/script/public-inbox-xcpdb
+++ b/script/public-inbox-xcpdb
@@ -1,20 +1,64 @@
-#!/usr/bin/perl -w
+#!perl -w
 # Copyright (C) 2019-2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
-# xcpdb: Xapian copy database, a wrapper around Xapian's copydatabase(1)
+use strict;
+use v5.10.1;
 use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
-use PublicInbox::InboxWritable;
-use PublicInbox::Xapcmd;
+my $usage = 'Usage: public-inbox-xcpdb [options] INBOX_DIR';
+my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
+usage: $usage
+
+  upgrade or reshard Xapian DB(s) used by public-inbox
+
+options:
+
+  --compact | -c      run public-inbox-compact(1) after indexing
+  --reshard=NUM       change number the number of shards
+  --jobs=NUM          limit parallelism to JOBS count
+  --verbose | -v      increase verbosity (may be repeated)
+  --sequential-shard  copy+index Xapian shards sequentially (for slow HDD)
+  --help | -?         show this help
+
+index options (see public-inbox-index(1) man page for full description):
+
+  --no-fsync          speed up indexing, risk corruption on power outage
+  --batch-size=BYTES  flush changes to OS after a given number of bytes
+  --max-size=BYTES    do not index messages larger than the given size
+
+See public-inbox-xcpdb(1) man page for full documentation.
+EOF
+my $opt = { quiet => -1, compact => 0, fsync => 1 };
+GetOptions($opt, qw(
+	fsync|sync! compact|c reshard|R=i
+	max_size|max-size=s batch_size|batch-size=s
+	sequential_shard|seq-shard|sequential-shard
+	jobs|j=i quiet|q verbose|v
+	blocksize|b=s no-full|n fuller|F
+	help|?)) or die "bad command-line args\n$usage";
+if ($opt->{help}) { print $help; exit 0 };
+
 use PublicInbox::Admin;
 PublicInbox::Admin::require_or_die('-search');
-my $usage = "Usage: public-inbox-xcpdb [--compact] INBOX_DIR\n";
-my $opt = { fsync => 1 };
-my @opt = (qw(fsync|sync! compact reshard|R=i),
-	@PublicInbox::Xapcmd::COMPACT_OPT);
-GetOptions($opt, @opt) or die "bad command-line args\n$usage";
-my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV) or die $usage;
+
+require PublicInbox::Config;
+my $cfg = PublicInbox::Config->new;
+my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, undef, $cfg) or
+	die $usage;
+my $idx_env = PublicInbox::Admin::index_prepare($opt, $cfg);
+
+# we only set XAPIAN_FLUSH_THRESHOLD for index, since cpdb doesn't
+# know sizes, only doccounts
+$opt->{-idx_env} = $idx_env;
+
+if ($opt->{sequential_shard} && ($opt->{jobs} // 1) > 1) {
+	warn "W: --jobs=$opt->{jobs} ignored with --sequential-shard\n";
+	$opt->{jobs} = 0;
+}
+
+require PublicInbox::InboxWritable;
+require PublicInbox::Xapcmd;
 foreach (@ibxs) {
 	my $ibx = PublicInbox::InboxWritable->new($_);
-	# we rely on --no-renumber to keep docids synched to NNTP
+	# we rely on --no-renumber to keep docids synched for NNTP
 	PublicInbox::Xapcmd::run($ibx, 'cpdb', $opt);
 }

^ permalink raw reply related	[relevance 4%]

* [PATCH 03/14] doc: index: more notes about latest changes
  @ 2020-08-10  2:11 12% ` Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-08-10  2:11 UTC (permalink / raw)
  To: meta

With LKML on an HDD, a giant --batch-size of 500m ends up being
pretty useful.  I was able to index LKML in ~16 hours on a
system that had other activity on it.  The big downside was it
was eating up over 5g of RAM :x.

We'll also fix up a duplicated indexBatchSize section, fix
formatting around global vs per-inbox indexSequentialShard,
and ensure section 5 manpages are linked correctly.
---
 Documentation/public-inbox-index.pod | 62 +++++++++++++++-------------
 1 file changed, 33 insertions(+), 29 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 56dec993..3ae3b008 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -115,6 +115,11 @@ Sets or overrides L</publicinbox.indexBatchSize> on a
 per-invocation basis.  See L</publicinbox.indexBatchSize>
 below.
 
+When using rotational storage but abundant RAM, using a large
+value (e.g. C<500m>) with C<--sequential-shard> can
+significantly speed up the initial index and full C<--reindex>
+invocations (but not incremental updates).
+
 Available in public-inbox 1.6.0 (PENDING).
 
 =item --no-fsync
@@ -136,11 +141,11 @@ Available in public-inbox 1.6.0 (PENDING).
 
 =head1 FILES
 
-For v1 (ssoma) repositories described in L<public-inbox-v1-format>.
+For v1 (ssoma) repositories described in L<public-inbox-v1-format(5)>.
 All public-inbox-specific files are contained within the
 C<$GIT_DIR/public-inbox/> directory.
 
-v2 inboxes are described in L<public-inbox-v2-format>.
+v2 inboxes are described in L<public-inbox-v2-format(5)>.
 
 =head1 CONFIGURATION
 
@@ -168,40 +173,25 @@ L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
 
 Increase this value on powerful systems to improve throughput at
 the expense of memory use.  The reduction of lock granularity
-may not be noticeable on fast systems.
-
-This option is available in public-inbox 1.6 or later.
-public-inbox 1.5 and earlier used the current default, C<1m>.
+may not be noticeable on fast systems.  With SSDs, values above
+C<4m> have little benefit.
 
 For L<public-inbox-v2-format(5)> inboxes, this value is
 multiplied by the number of Xapian shards.  Thus a typical v2
-inbox with 3 shards will flush every 3 megabytes by default.
-
-Default: 1m (one megabyte)
+inbox with 3 shards will flush every 3 megabytes by default
+when unless parallelism is disabled via C<--sequential-shard>
+or C<--jobs=0>.
 
-=item publicinbox.indexBatchSize
-
-Flushes changes to the filesystem and releases locks after
-indexing the given number of bytes.  The default value of C<1m>
-(one megabyte) is low to minimize memory use and reduce
-contention with parallel invocations of L<public-inbox-mda(1)>,
-L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
-
-Increase this value on powerful systems to improve throughput at
-the expense of memory use.  The reduction of lock granularity
-may not be noticeable on fast systems.
+This influences memory usage of Xapian, but it is not exact.
+The actual memory used by Xapian and Perl has been observed
+in excess of 10x this value.
 
 This option is available in public-inbox 1.6 or later.
 public-inbox 1.5 and earlier used the current default, C<1m>.
 
-For L<public-inbox-v2-format(5)> inboxes, this value is
-multiplied by the number of Xapian shards.  Thus a typical v2
-inbox with 3 shards will flush every 3 megabytes by default.
-
 Default: 1m (one megabyte)
 
 =item publicinbox.indexSequentialShard
-=item publicinbox.<inbox_name>.indexSequentialShard
 
 For L<public-inbox-v2-format(5)> inboxes, setting this to C<true>
 allows indexing Xapian shards in multiple passes.  This speeds up
@@ -212,12 +202,23 @@ Using a higher-than-normal number of C<--jobs> with
 L<public-inbox-init(1)> may be required to ensure individual
 shards are small enough to fit into cache.
 
+Warning: interrupting C<public-inbox-index(1)> while this option
+is in use may leave the search indices out-of-date with respect
+to SQLite databases.  WWW and IMAP users may notice incomplete
+search results, but it is otherwise non-fatal.  Using C<--reindex>
+will bring everything back up-to-date.
+
 Available in public-inbox 1.6.0 (PENDING).
 
 This is ignored on L<public-inbox-v1-format(5)> inboxes.
 
 Default: false, shards are indexed in parallel
 
+=item publicinbox.<name>.indexSequentialShard
+
+Identical to L</publicinbox.indexSequentialShard>,
+but only affect the inbox matching E<lt>nameE<gt>.
+
 =back
 
 =head1 ENVIRONMENT
@@ -235,10 +236,13 @@ disk.  This environment is handled directly by Xapian, refer to
 Xapian API documentation for more details.
 
 For public-inbox 1.6 and later, use C<publicinbox.indexBatchSize>
-instead.  Setting C<XAPIAN_FLUSH_THRESHOLD> for a large C<--reindex>
-may cause L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
-L<public-inbox-watch(1)> tasks to wait long periods of time
-during C<--reindex>.
+instead.
+
+Setting C<XAPIAN_FLUSH_THRESHOLD> or
+C<publicinbox.indexBatchSize> for a large C<--reindex> may cause
+L<public-inbox-mda(1)>, L<public-inbox-learn(1)> and
+L<public-inbox-watch(1)> tasks to wait long and unpredictable
+periods of time during C<--reindex>.
 
 Default: none, uses C<publicinbox.indexBatchSize>
 

^ permalink raw reply related	[relevance 12%]

* [PATCH 5/5] index: add built-in --help / -?
  @ 2020-08-07 10:52  4% ` Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-08-07 10:52 UTC (permalink / raw)
  To: meta

Eventually, commonly-used commands run by the user will all
support --help / -? for user-friendliness.   The changes from
up-front `use' to lazy `require' speed up `--help' by 3x or so.
---
 Documentation/public-inbox-index.pod |  4 +--
 script/public-inbox-index            | 44 +++++++++++++++++++++++-----
 2 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index a4edc57a..56dec993 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -40,8 +40,8 @@ Influences the number of Xapian indexing shards in a
 C<--jobs=0> is accepted as of public-inbox 1.6.0 (PENDING)
 to disable parallel indexing.
 
-If the inbox has not been indexed, C<JOBS - 1> shards
-will be created (one job is always needed for indexing
+If the inbox has not been indexed or initialized, C<JOBS - 1>
+shards will be created (one job is always needed for indexing
 the overview and article number mapping).
 
 Default: the number of existing Xapian shards
diff --git a/script/public-inbox-index b/script/public-inbox-index
index e2bca16e..73ca2953 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -1,4 +1,4 @@
-#!/usr/bin/perl -w
+#!perl -w
 # Copyright (C) 2015-2020 all contributors <meta@public-inbox.org>
 # License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
 # Basic tool to create a Xapian search index for a public-inbox.
@@ -6,22 +6,47 @@
 # highly recommended: eatmydata public-inbox-index INBOX_DIR
 
 use strict;
-use warnings;
+use v5.10.1;
 use Getopt::Long qw(:config gnu_getopt no_ignore_case auto_abbrev);
-my $usage = "public-inbox-index INBOX_DIR";
-use PublicInbox::Admin;
-PublicInbox::Admin::require_or_die('-index');
-use PublicInbox::Xapcmd;
+my $usage = 'public-inbox-index [options] INBOX_DIR';
+my $help = <<EOF; # the following should fit w/o scrolling in 80x24 term:
+usage: $usage
+
+  Create and update search indices
+
+options:
 
+  --no-fsync          speed up indexing, risk corruption on power outage
+  --indexlevel=LEVEL  `basic', 'medium', or `full' (default: full)
+  --compact | -c      run public-inbox-compact(1) after indexing
+  --sequential-shard  index Xapian shards sequentially for slow storage
+  --jobs=NUM          set or disable parallelization (NUM=0)
+  --batch-size=BYTES  flush changes to OS after a given number of bytes
+  --max-size=BYTES    do not index messages larger than the given size
+  --reindex           index previously indexed data (if upgrading)
+  --rethread          regenerate thread IDs (if upgrading, use sparingly)
+  --prune             prune git storage on discontiguous history
+  --verbose | -v      increase verbosity (may be repeated)
+  --help | -?         show this help
+
+BYTES may use `k', `m', and `g' suffixes (e.g. `10m' for 10 megabytes)
+See public-inbox-index(1) man page for full documentation.
+EOF
 my $compact_opt;
 my $opt = { quiet => -1, compact => 0, maxsize => undef, fsync => 1 };
 GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
 		fsync|sync! xapianonly|xapian-only
 		indexlevel|L=s maxsize|max-size=s batchsize|batch-size=s
-		sequentialshard|seq-shard|sequential-shard))
+		sequentialshard|seq-shard|sequential-shard
+		help|?))
 	or die "bad command-line args\n$usage";
+if ($opt->{help}) { print $help; exit 0 };
 die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
 
+# require lazily to speed up --help
+require PublicInbox::Admin;
+PublicInbox::Admin::require_or_die('-index');
+
 if ($opt->{compact}) {
 	require PublicInbox::Xapcmd;
 	PublicInbox::Xapcmd::check_compact();
@@ -31,7 +56,7 @@ if ($opt->{compact}) {
 	}
 }
 
-my $cfg = PublicInbox::Config->new;
+my $cfg = PublicInbox::Config->new; # Config is loaded by Admin
 my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV, undef, $cfg);
 PublicInbox::Admin::require_or_die('-index');
 unless (@ibxs) { print STDERR "Usage: $usage\n"; exit 1 }
@@ -47,7 +72,9 @@ if (defined $bs) {
 	PublicInbox::Admin::parse_unsigned(\$bs) or
 		die "`publicInbox.indexBatchSize=$bs' not parsed\n";
 }
+no warnings 'once';
 local $PublicInbox::SearchIdx::BATCH_BYTES = $bs if defined($bs);
+use warnings 'once';
 
 # out-of-the-box builds of Xapian 1.4.x are still limited to 32-bit
 # https://getting-started-with-xapian.readthedocs.io/en/latest/concepts/indexing/limitations.html
@@ -72,6 +99,7 @@ foreach my $ibx (@ibxs) {
 }
 
 PublicInbox::Admin::require_or_die(keys %$mods);
+require PublicInbox::InboxWritable;
 PublicInbox::Admin::progress_prepare($opt);
 for my $ibx (@ibxs) {
 	$ibx = PublicInbox::InboxWritable->new($ibx);

^ permalink raw reply related	[relevance 4%]

* [PATCH 5/7] index: v2: --sequential-shard option
  @ 2020-08-07  1:14  9% ` Eric Wong
  2020-08-07  1:14  3% ` [PATCH 7/7] index+xcpdb: rename `--no-sync' to `--no-fsync' Eric Wong
  1 sibling, 0 replies; 31+ results
From: Eric Wong @ 2020-08-07  1:14 UTC (permalink / raw)
  To: meta

This gives better page cache utilization for Xapian indexing on
slow storage by improving locality for random I/O activity on
the Xapian DB.

Instead of doing a single-pass to index both SQLite and Xapian;
this indexes them separately.  The first pass is identical to
indexlevel=basic: it indexes both over.sqlite3 and msgmap.sqlite3.

Subsequent passes only operate on a single Xapian shard for
documents belonging to that shard.  Given enough shards, each
individual shard can be made small enough to fit into the kernel
page cache and avoid HDD seeks for read activity.

Doing rough tests with a busy system with a 7200 RPM HDD with ext4,
full indexing of LKML (9 epochs) goes from ~80 hours (-j0) to
~30 hours (-j8) with 16GB RAM with 7 shards configured and fsync(2)
disabled (--no-sync) and `--batch-size=10m'.
---
 Documentation/public-inbox-config.pod    |  6 +++
 Documentation/public-inbox-index.pod     | 53 ++++++++++++++++++++++-
 Documentation/public-inbox-v2-format.pod | 11 +++--
 lib/PublicInbox/Config.pm                |  9 ++--
 lib/PublicInbox/V2Writable.pm            | 55 ++++++++++++++++++++++--
 lib/PublicInbox/WatchMaildir.pm          |  2 +-
 script/public-inbox-index                | 22 +++++++++-
 t/config.t                               |  6 +--
 t/v2mirror.t                             | 14 ++++++
 9 files changed, 161 insertions(+), 17 deletions(-)

diff --git a/Documentation/public-inbox-config.pod b/Documentation/public-inbox-config.pod
index e6108c35..05b84819 100644
--- a/Documentation/public-inbox-config.pod
+++ b/Documentation/public-inbox-config.pod
@@ -139,6 +139,10 @@ allow for searching for phrases using quoted text.
 
 Default: C<full>
 
+=item publicinbox.<name>.indexSequentialShard
+
+See L<public-inbox-index(1)/publicInbox.indexSequentialShard>
+
 =item publicinbox.<name>.httpbackendmax
 
 If a digit, the maximum number of parallel
@@ -291,6 +295,8 @@ or /usr/share/cgit/
 See L<public-inbox-edit(1)>
 
 =item publicinbox.indexMaxSize
+=item publicinbox.indexBatchSize
+=item publicinbox.indexSequentialShard
 
 See L<public-inbox-index(1)>
 
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index aeb1b3a3..f525ba54 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -34,12 +34,16 @@ normal search functionality.
 
 =item --jobs=JOBS, -j
 
-Control the number of Xapian indexing jobs in a
+Influences the number of Xapian indexing shards in a
 (L<public-inbox-v2-format(5)>) inbox.
 
 C<--jobs=0> is accepted as of public-inbox 1.6.0 (PENDING)
 to disable parallel indexing.
 
+If the inbox has not been indexed, C<JOBS - 1> shards
+will be created (one job is always needed for indexing
+the overview and article number mapping).
+
 Default: the number of existing Xapian shards
 
 =item --compact / -c
@@ -120,6 +124,14 @@ and Xapian.  This is only effective with Xapian 1.4+.
 
 Available in public-inbox 1.6.0 (PENDING).
 
+=item --sequential-shard
+
+Sets or overrides L</publicinbox.indexSequentialShard> on a
+per-invocation basis.  See L</publicinbox.indexSequentialShard>
+below.
+
+Available in public-inbox 1.6.0 (PENDING).
+
 =back
 
 =head1 FILES
@@ -167,6 +179,45 @@ inbox with 3 shards will flush every 3 megabytes by default.
 
 Default: 1m (one megabyte)
 
+=item publicinbox.indexBatchSize
+
+Flushes changes to the filesystem and releases locks after
+indexing the given number of bytes.  The default value of C<1m>
+(one megabyte) is low to minimize memory use and reduce
+contention with parallel invocations of L<public-inbox-mda(1)>,
+L<public-inbox-learn(1)>, and L<public-inbox-watch(1)>.
+
+Increase this value on powerful systems to improve throughput at
+the expense of memory use.  The reduction of lock granularity
+may not be noticeable on fast systems.
+
+This option is available in public-inbox 1.6 or later.
+public-inbox 1.5 and earlier used the current default, C<1m>.
+
+For L<public-inbox-v2-format(5)> inboxes, this value is
+multiplied by the number of Xapian shards.  Thus a typical v2
+inbox with 3 shards will flush every 3 megabytes by default.
+
+Default: 1m (one megabyte)
+
+=item publicinbox.indexSequentialShard
+=item publicinbox.<inbox_name>.indexSequentialShard
+
+For L<public-inbox-v2-format(5)> inboxes, setting this to C<true>
+allows indexing Xapian shards in multiple passes.  This speeds up
+indexing on rotational storage with high seek latency by allowing
+individual shards to fit into the kernel page cache.
+
+Using a higher-than-normal number of C<--jobs> with
+L<public-inbox-init(1)> may be required to ensure individual
+shards are small enough to fit into cache.
+
+Available in public-inbox 1.6.0 (PENDING).
+
+This is ignored on L<public-inbox-v1-format(5)> inboxes.
+
+Default: false, shards are indexed in parallel
+
 =back
 
 =head1 ENVIRONMENT
diff --git a/Documentation/public-inbox-v2-format.pod b/Documentation/public-inbox-v2-format.pod
index 9e284a75..6876989c 100644
--- a/Documentation/public-inbox-v2-format.pod
+++ b/Documentation/public-inbox-v2-format.pod
@@ -113,9 +113,14 @@ improved with high-quality and high-quantity solid-state storage.
 Issuing TRIM commands with L<fstrim(8)> was necessary to maintain
 consistent performance while developing this feature.
 
-Rotational storage devices are NOT recommended for indexing of
-large mail archives; but are fine for backup and usable for
-small instances.
+Rotational storage devices perform significantly worse than
+solid state storage for indexing of large mail archives; but are
+fine for backup and usable for small instances.
+
+As of public-inbox 1.6.0, the C<--sequential-shard> option of
+L<public-inbox-index(1)> may be used with a high shard count
+to ensure individual shards fit into page cache when the entire
+Xapian DB cannot.
 
 Our use of the L</OVERVIEW DB> requires Xapian document IDs to
 remain stable.  Using L<public-inbox-compact(1)> and
diff --git a/lib/PublicInbox/Config.pm b/lib/PublicInbox/Config.pm
index 67199bb3..f9184bd2 100644
--- a/lib/PublicInbox/Config.pm
+++ b/lib/PublicInbox/Config.pm
@@ -369,8 +369,8 @@ sub _fill_code_repo {
 	$git;
 }
 
-sub _git_config_bool ($) {
-	my ($val) = @_;
+sub git_bool {
+	my ($val) = $_[-1]; # $_[0] may be $self, or $val
 	if ($val =~ /\A(?:false|no|off|[\-\+]?(?:0x)?0+)\z/i) {
 		0;
 	} elsif ($val =~ /\A(?:true|yes|on|[\-\+]?(?:0x)?[0-9]+)\z/i) {
@@ -386,7 +386,8 @@ sub _fill {
 
 	foreach my $k (qw(inboxdir filter newsgroup
 			watch httpbackendmax
-			replyto feedmax nntpserver indexlevel)) {
+			replyto feedmax nntpserver
+			indexlevel indexsequentialshard)) {
 		my $v = $self->{"$pfx.$k"};
 		$ibx->{$k} = $v if defined $v;
 	}
@@ -400,7 +401,7 @@ sub _fill {
 	foreach my $k (qw(obfuscate)) {
 		my $v = $self->{"$pfx.$k"};
 		defined $v or next;
-		if (defined(my $bval = _git_config_bool($v))) {
+		if (defined(my $bval = git_bool($v))) {
 			$ibx->{$k} = $bval;
 		} else {
 			warn "Ignoring $pfx.$k=$v in config, not boolean\n";
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index f98afa61..7bc24592 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -875,7 +875,8 @@ sub reindex_checkpoint ($$) {
 
 	$self->{ibx}->git->cleanup; # *async_wait
 	${$sync->{need_checkpoint}} = 0;
-	$sync->{mm_tmp}->atfork_prepare;
+	my $mm_tmp = $sync->{mm_tmp};
+	$mm_tmp->atfork_prepare if $mm_tmp;
 	$self->done; # release lock
 
 	if (my $pr = $sync->{-opt}->{-progress}) {
@@ -884,7 +885,7 @@ sub reindex_checkpoint ($$) {
 
 	# allow -watch or -mda to write...
 	$self->idx_init; # reacquire lock
-	$sync->{mm_tmp}->atfork_parent;
+	$mm_tmp->atfork_parent if $mm_tmp;
 }
 
 sub index_oid { # cat_async callback
@@ -1085,7 +1086,10 @@ sub sync_prepare ($$$) {
 		}
 		$all->cat_async_wait;
 	}
-	return 0 if (!$regen_max && !keys(%{$self->{unindex_range}}));
+	if (!$regen_max && !keys(%{$self->{unindex_range}})) {
+		$sync->{-regen_fmt} = "%u/?\n";
+		return 0;
+	}
 
 	# reindex should NOT see new commits anymore, if we do,
 	# it's a problem and we need to notice it via die()
@@ -1177,6 +1181,36 @@ sub sync_ranges ($$$) {
 	$ranges;
 }
 
+sub index_xap_only { # git->cat_async callback
+	my ($bref, $oid, $type, $size, $smsg) = @_;
+	my $self = $smsg->{v2w};
+	my $idx = idx_shard($self, $smsg->{num} % $self->{shards});
+	$idx->begin_txn_lazy;
+	$idx->add_message(PublicInbox::Eml->new($bref), $smsg);
+	$self->{transact_bytes} += $size;
+}
+
+sub index_seq_shard ($$$) {
+	my ($self, $sync, $off) = @_;
+	my $ibx = $self->{ibx};
+	my $max = $ibx->mm->max or return;
+	my $all = $ibx->git;
+	my $over = $ibx->over;
+	my $batch_bytes = $PublicInbox::SearchIdx::BATCH_BYTES;
+	if (my $pr = $sync->{-opt}->{-progress}) {
+		$pr->("Xapian indexlevel=$ibx->{indexlevel} % $off\n");
+	}
+	for (my $num = $off; $num <= $max; $num += $self->{shards}) {
+		my $smsg = $over->get_art($num) or next;
+		$smsg->{v2w} = $self;
+		$all->cat_async($smsg->{blob}, \&index_xap_only, $smsg);
+		if ($self->{transact_bytes} >= $batch_bytes) {
+			${$sync->{nr}} = $num;
+			reindex_checkpoint($self, $sync);
+		}
+	}
+}
+
 sub index_epoch ($$$) {
 	my ($self, $sync, $i) = @_;
 
@@ -1218,6 +1252,11 @@ sub index_sync {
 	my $epoch_max;
 	my $latest = git_dir_latest($self, \$epoch_max);
 	return unless defined $latest;
+
+	my $seq = $opt->{sequentialshard};
+	my $idxlevel = $self->{ibx}->{indexlevel};
+	local $self->{ibx}->{indexlevel} = 'basic' if $seq;
+
 	$self->idx_init($opt); # acquire lock
 	fill_alternates($self, $epoch_max);
 	$self->{over}->rethread_prepare($opt);
@@ -1252,6 +1291,16 @@ sub index_sync {
 		$pr->('all.git '.sprintf($sync->{-regen_fmt}, $$nr)) if $pr;
 	}
 
+	if ($seq) { # deal with Xapian shards sequentially
+		my $end = $self->{shards} - 1;
+		$self->{ibx}->{indexlevel} = $idxlevel;
+		delete $sync->{mm_tmp};
+		$self->idx_init($opt); # re-acquire lock
+		index_seq_shard($self, $sync, $_) for (0..$end);
+		$self->{ibx}->git->cat_async_wait;
+		$self->done;
+	}
+
 	# reindex does not pick up new changes, so we rerun w/o it:
 	if ($opt->{reindex}) {
 		my %again = %$opt;
diff --git a/lib/PublicInbox/WatchMaildir.pm b/lib/PublicInbox/WatchMaildir.pm
index 142118bd..2ba10a9e 100644
--- a/lib/PublicInbox/WatchMaildir.pm
+++ b/lib/PublicInbox/WatchMaildir.pm
@@ -285,7 +285,7 @@ sub cfg_intvl ($$$) {
 sub cfg_bool ($$$) {
 	my ($cfg, $key, $url) = @_;
 	my $orig = $cfg->urlmatch($key, $url) // return;
-	my $bool = PublicInbox::Config::_git_config_bool($orig);
+	my $bool = $cfg->git_bool($orig);
 	warn "W: $key=$orig for $url is not boolean\n" unless defined($bool);
 	$bool;
 }
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 5a0ceab7..be518134 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -16,7 +16,8 @@ use PublicInbox::Xapcmd;
 my $compact_opt;
 my $opt = { quiet => -1, compact => 0, maxsize => undef, sync => 1 };
 GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune sync!
-		indexlevel|L=s maxsize|max-size=s batchsize|batch-size=s))
+		indexlevel|L=s maxsize|max-size=s batchsize|batch-size=s
+		sequentialshard|seq-shard|sequential-shard))
 	or die "bad command-line args\n$usage";
 die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
 
@@ -46,6 +47,15 @@ if (my $bs = $opt->{batchsize} // $cfg->{lc('publicInbox.indexBatchSize')}) {
 	$PublicInbox::SearchIdx::BATCH_BYTES = $bs;
 }
 
+my $s = $opt->{sequentialshard} //
+			$cfg->{lc('publicInbox.indexSequentialShard')};
+if (defined $s) {
+	my $v = $cfg->git_bool($s);
+	defined($v) or
+		die "`publicInbox.indexSequentialShard=$s' not boolean\n";
+	$opt->{sequentialshard} = $v;
+}
+
 my $mods = {};
 foreach my $ibx (@ibxs) {
 	# XXX: users can shoot themselves in the foot, with opt->{indexlevel}
@@ -63,6 +73,14 @@ for my $ibx (@ibxs) {
 		PublicInbox::Xapcmd::run($ibx, 'compact', $compact_opt);
 	}
 	$ibx->{-no_sync} = 1 if !$opt->{sync};
-	PublicInbox::Admin::index_inbox($ibx, undef, $opt);
+
+	my $ibx_opt = $opt;
+	if (defined(my $s = $ibx->{indexsequentialshard})) {
+		defined(my $v = $cfg->git_bool($s)) or die <<EOL;
+publicInbox.$ibx->{name}.indexSequentialShard not boolean
+EOL
+		$ibx_opt = { %$opt, sequentialshard => $v };
+	}
+	PublicInbox::Admin::index_inbox($ibx, undef, $ibx_opt);
 	PublicInbox::Xapcmd::run($ibx, 'compact', $compact_opt) if $compact_opt;
 }
diff --git a/t/config.t b/t/config.t
index d7fd9446..ee51c6cc 100644
--- a/t/config.t
+++ b/t/config.t
@@ -220,18 +220,18 @@ EOF
 
 {
 	for my $t (qw(TRUE true yes on 1 +1 -1 13 0x1 0x12 0X5)) {
-		is(PublicInbox::Config::_git_config_bool($t), 1, "$t is true");
+		is(PublicInbox::Config::git_bool($t), 1, "$t is true");
 		is(xqx([qw(git -c), "test.val=$t",
 			qw(config --bool test.val)]),
 			"true\n", "$t matches git-config behavior");
 	}
 	for my $f (qw(FALSE false no off 0 +0 +000 00 0x00 0X0)) {
-		is(PublicInbox::Config::_git_config_bool($f), 0, "$f is false");
+		is(PublicInbox::Config::git_bool($f), 0, "$f is false");
 		is(xqx([qw(git -c), "test.val=$f",
 			qw(config --bool test.val)]),
 			"false\n", "$f matches git-config behavior");
 	}
-	is(PublicInbox::Config::_git_config_bool('bogus'), undef,
+	is(PublicInbox::Config::git_bool('bogus'), undef,
 		'bogus is undef');
 }
 
diff --git a/t/v2mirror.t b/t/v2mirror.t
index b24528fe..a4ac682d 100644
--- a/t/v2mirror.t
+++ b/t/v2mirror.t
@@ -4,6 +4,7 @@ use strict;
 use warnings;
 use Test::More;
 use PublicInbox::TestCommon;
+use File::Path qw(remove_tree);
 use Cwd qw(abs_path);
 require_git(2.6);
 local $ENV{HOME} = abs_path('t');
@@ -189,6 +190,19 @@ is($mibx->git->check($to_purge), undef, 'unindex+prune successful in mirror');
 	is(scalar($mset->items), 0, '1@example.com no longer visible in mirror');
 }
 
+if ('sequential-shard') {
+	$mset = $mibx->search->query('m:15@example.com', {mset => 1});
+	is(scalar($mset->items), 1, 'large message not indexed');
+	remove_tree(glob("$tmpdir/m/xap*"), glob("$tmpdir/m/msgmap.*"));
+	my $cmd = [ qw(-index -j9 --sequential-shard), "$tmpdir/m" ];
+	ok(run_script($cmd), '--sequential-shard works');
+	my @shards = glob("$tmpdir/m/xap*/?");
+	is(scalar(@shards), 8, 'got expected shard count');
+	PublicInbox::InboxWritable::cleanup($mibx);
+	$mset = $mibx->search->query('m:15@example.com', {mset => 1});
+	is(scalar($mset->items), 1, 'search works after --sequential-shard');
+}
+
 if ('max size') {
 	$mime->header_set('Message-ID', '<2big@a>');
 	my $max = '2k';

^ permalink raw reply related	[relevance 9%]

* [PATCH 7/7] index+xcpdb: rename `--no-sync' to `--no-fsync'
    2020-08-07  1:14  9% ` [PATCH 5/7] index: v2: --sequential-shard option Eric Wong
@ 2020-08-07  1:14  3% ` Eric Wong
  1 sibling, 0 replies; 31+ results
From: Eric Wong @ 2020-08-07  1:14 UTC (permalink / raw)
  To: meta

We'll continue supporting `--no-sync' even if its yet-to-make it
it into a release, but the term `sync' is overloaded in our
codebase which may be confusing to new hackers and users.

None of our our code nor dependencies issue the sync(2) syscall,
either, only fsync(2) and fdatasync(2).
---
 Documentation/public-inbox-index.pod | 2 +-
 Documentation/public-inbox-xcpdb.pod | 2 +-
 lib/PublicInbox/OverIdx.pm           | 2 +-
 lib/PublicInbox/SearchIdx.pm         | 6 +++---
 lib/PublicInbox/V2Writable.pm        | 4 ++--
 lib/PublicInbox/Xapcmd.pm            | 2 +-
 script/public-inbox-index            | 8 ++++----
 7 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index f525ba54..a4edc57a 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -117,7 +117,7 @@ below.
 
 Available in public-inbox 1.6.0 (PENDING).
 
-=item --no-sync
+=item --no-fsync
 
 Disables L<fsync(2)> and L<fdatasync(2)> operations on SQLite
 and Xapian.  This is only effective with Xapian 1.4+.
diff --git a/Documentation/public-inbox-xcpdb.pod b/Documentation/public-inbox-xcpdb.pod
index 7fe1e5fe..89eed079 100644
--- a/Documentation/public-inbox-xcpdb.pod
+++ b/Documentation/public-inbox-xcpdb.pod
@@ -45,7 +45,7 @@ too many shards given the capabilities of the current hardware.
 These options are passed directly to L<xapian-compact(1)> when
 used with C<--compact>.
 
-=item --no-sync
+=item --no-fsync
 
 Disable L<fsync(2)> and L<fdatasync(2)>.
 
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index c8f61e01..4543bfa1 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -21,7 +21,7 @@ use Carp qw(croak);
 
 sub dbh_new {
 	my ($self) = @_;
-	my $dbh = $self->SUPER::dbh_new($self->{-no_sync} ? 2 : 1);
+	my $dbh = $self->SUPER::dbh_new($self->{-no_fsync} ? 2 : 1);
 
 	# TRUNCATE reduces I/O compared to the default (DELETE)
 	# We do not use WAL since we're optimized for read-only ops,
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index a1baa65b..22489731 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -67,7 +67,7 @@ sub new {
 		$self->{lock_path} = "$inboxdir/ssoma.lock";
 		my $dir = $self->xdir;
 		$self->{over} = PublicInbox::OverIdx->new("$dir/over.sqlite3");
-		$self->{over}->{-no_sync} = 1 if $ibx->{-no_sync};
+		$self->{over}->{-no_fsync} = 1 if $ibx->{-no_fsync};
 		$self->{index_max_size} = $ibx->{index_max_size};
 	} elsif ($version == 2) {
 		defined $shard or die "shard is required for v2\n";
@@ -138,7 +138,7 @@ sub idx_acquire {
 		}
 	}
 	return unless defined $flag;
-	$flag |= $DB_NO_SYNC if $self->{ibx}->{-no_sync};
+	$flag |= $DB_NO_SYNC if $self->{ibx}->{-no_fsync};
 	my $xdb = eval { ($X->{WritableDatabase})->new($dir, $flag) };
 	if ($@) {
 		die "Failed opening $dir: ", $@;
@@ -389,7 +389,7 @@ sub _msgmap_init ($) {
 	die "BUG: _msgmap_init is only for v1\n" if $self->{ibx_ver} != 1;
 	$self->{mm} //= eval {
 		require PublicInbox::Msgmap;
-		my $rw = $self->{ibx}->{-no_sync} ? 2 : 1;
+		my $rw = $self->{ibx}->{-no_fsync} ? 2 : 1;
 		PublicInbox::Msgmap->new($self->{ibx}->{inboxdir}, $rw);
 	};
 }
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 6b1effe5..a029fe4c 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -122,7 +122,7 @@ sub new {
 		rotate_bytes => int((1024 * 1024 * 1024) / $PACKING_FACTOR),
 		last_commit => [], # git epoch -> commit
 	};
-	$self->{over}->{-no_sync} = 1 if $v2ibx->{-no_sync};
+	$self->{over}->{-no_fsync} = 1 if $v2ibx->{-no_fsync};
 	$self->{shards} = count_shards($self) || nproc_shards($creat);
 	bless $self, $class;
 }
@@ -292,7 +292,7 @@ sub _idx_init { # with_umask callback
 	# for SQLite:
 	my $mm = $self->{mm} = PublicInbox::Msgmap->new_file(
 				"$self->{ibx}->{inboxdir}/msgmap.sqlite3",
-				$self->{ibx}->{-no_sync} ? 2 : 1);
+				$self->{ibx}->{-no_fsync} ? 2 : 1);
 	$mm->{dbh}->begin_work;
 }
 
diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm
index 97a51d1b..8423194f 100644
--- a/lib/PublicInbox/Xapcmd.pm
+++ b/lib/PublicInbox/Xapcmd.pm
@@ -418,7 +418,7 @@ sub cpdb ($$) {
 	my $flag = eval($PublicInbox::Search::Xap.'::DB_CREATE()');
 	die if $@;
 	my $XapianWritableDatabase = $PublicInbox::Search::X{WritableDatabase};
-	$flag |= $PublicInbox::SearchIdx::DB_NO_SYNC if !$opt->{sync};
+	$flag |= $PublicInbox::SearchIdx::DB_NO_SYNC if !$opt->{fsync};
 	my $dst = $XapianWritableDatabase->new($tmp, $flag);
 	my $pr = $opt->{-progress};
 	my $pfx = $opt->{-progress_pfx} = progress_pfx($new);
diff --git a/script/public-inbox-index b/script/public-inbox-index
index a52fb1bf..dc9bdde1 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -14,9 +14,9 @@ PublicInbox::Admin::require_or_die('-index');
 use PublicInbox::Xapcmd;
 
 my $compact_opt;
-my $opt = { quiet => -1, compact => 0, maxsize => undef, sync => 1 };
-GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune sync!
-		xapianonly|xapian-only
+my $opt = { quiet => -1, compact => 0, maxsize => undef, fsync => 1 };
+GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
+		fsync|sync! xapianonly|xapian-only
 		indexlevel|L=s maxsize|max-size=s batchsize|batch-size=s
 		sequentialshard|seq-shard|sequential-shard))
 	or die "bad command-line args\n$usage";
@@ -73,7 +73,7 @@ for my $ibx (@ibxs) {
 	if ($opt->{compact} >= 2) {
 		PublicInbox::Xapcmd::run($ibx, 'compact', $compact_opt);
 	}
-	$ibx->{-no_sync} = 1 if !$opt->{sync};
+	$ibx->{-no_fsync} = 1 if !$opt->{fsync};
 
 	my $ibx_opt = $opt;
 	if (defined(my $s = $ibx->{indexsequentialshard})) {

^ permalink raw reply related	[relevance 3%]

* [PATCH 16/20] index+xcpdb: support --no-sync flag
    2020-07-24  5:55  3% ` [PATCH 01/20] index: support --rethread switch to fix old indices Eric Wong
@ 2020-07-24  5:56  7% ` Eric Wong
  1 sibling, 0 replies; 31+ results
From: Eric Wong @ 2020-07-24  5:56 UTC (permalink / raw)
  To: meta

This allows us to speed up indexing operations to SQLite
and Xapian.

Unfortunately, it doesn't affect operations using
`xapian-compact' and the compactor API, since that doesn't seem
to support Xapian::DB_NO_SYNC, yet.
---
 Documentation/public-inbox-index.pod |  7 +++++++
 Documentation/public-inbox-xcpdb.pod |  6 ++++++
 lib/PublicInbox/Msgmap.pm            | 21 ++++++++++++---------
 lib/PublicInbox/Over.pm              |  1 +
 lib/PublicInbox/OverIdx.pm           |  2 +-
 lib/PublicInbox/SearchIdx.pm         |  9 ++++++++-
 lib/PublicInbox/V2Writable.pm        |  6 ++++--
 lib/PublicInbox/Xapcmd.pm            |  5 +++--
 script/public-inbox-index            |  5 +++--
 script/public-inbox-xcpdb            |  4 ++--
 10 files changed, 47 insertions(+), 19 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 08f2fbf45..aeb1b3a39 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -113,6 +113,13 @@ below.
 
 Available in public-inbox 1.6.0 (PENDING).
 
+=item --no-sync
+
+Disables L<fsync(2)> and L<fdatasync(2)> operations on SQLite
+and Xapian.  This is only effective with Xapian 1.4+.
+
+Available in public-inbox 1.6.0 (PENDING).
+
 =back
 
 =head1 FILES
diff --git a/Documentation/public-inbox-xcpdb.pod b/Documentation/public-inbox-xcpdb.pod
index 149c8f78c..7fe1e5fe2 100644
--- a/Documentation/public-inbox-xcpdb.pod
+++ b/Documentation/public-inbox-xcpdb.pod
@@ -45,6 +45,12 @@ too many shards given the capabilities of the current hardware.
 These options are passed directly to L<xapian-compact(1)> when
 used with C<--compact>.
 
+=item --no-sync
+
+Disable L<fsync(2)> and L<fdatasync(2)>.
+
+Available in public-inbox 1.6.0 (PENDING).
+
 =back
 
 =head1 ENVIRONMENT
diff --git a/lib/PublicInbox/Msgmap.pm b/lib/PublicInbox/Msgmap.pm
index 9d2ef0dc5..839ddf7ca 100644
--- a/lib/PublicInbox/Msgmap.pm
+++ b/lib/PublicInbox/Msgmap.pm
@@ -32,12 +32,11 @@ sub new_file {
 	my $self = bless { filename => $f }, $class;
 	my $dbh = $self->{dbh} = PublicInbox::Over::dbh_new($self, $rw);
 	if ($rw) {
-		create_tables($dbh);
-
 		# TRUNCATE reduces I/O compared to the default (DELETE)
 		$dbh->do('PRAGMA journal_mode = TRUNCATE');
 
 		$dbh->begin_work;
+		create_tables($dbh);
 		$self->created_at(time) unless $self->created_at;
 
 		my $max = $self->max // 0;
@@ -51,12 +50,17 @@ sub new_file {
 sub tmp_clone {
 	my ($self) = @_;
 	my ($fh, $fn) = tempfile('msgmap-XXXXXXXX', EXLOCK => 0, TMPDIR => 1);
-	$self->{dbh}->sqlite_backup_to_file($fn);
-	my $tmp = ref($self)->new_file($fn, 1);
-	$tmp->{dbh}->do('PRAGMA synchronous = OFF');
-	$tmp->{dbh}->do('PRAGMA journal_mode = MEMORY');
+	my $tmp;
+	if ($self->{dbh}->can('sqlite_backup_to_dbh')) {
+		$tmp = ref($self)->new_file($fn, 2);
+		$tmp->{dbh}->do('PRAGMA journal_mode = MEMORY');
+		$self->{dbh}->sqlite_backup_to_dbh($tmp->{dbh});
+	} else { # DBD::SQLite <= 1.61_01
+		$self->{dbh}->sqlite_backup_to_file($fn);
+		$tmp = ref($self)->new_file($fn, 2);
+		$tmp->{dbh}->do('PRAGMA journal_mode = MEMORY');
+	}
 	$tmp->{pid} = $$;
-	close $fh or die "failed to close $fn: $!";
 	$tmp;
 }
 
@@ -241,8 +245,7 @@ sub atfork_parent {
 	$self->{pid} or die 'BUG: not a temporary clone';
 	$self->{dbh} and die 'BUG: tmp_clone dbh not prepared for parent';
 	defined($self->{filename}) or die 'BUG: {filename} not defined';
-	my $dbh = $self->{dbh} = PublicInbox::Over::dbh_new($self, 1);
-	$dbh->do('PRAGMA synchronous = OFF');
+	$self->{dbh} = PublicInbox::Over::dbh_new($self, 2);
 }
 
 sub atfork_prepare {
diff --git a/lib/PublicInbox/Over.pm b/lib/PublicInbox/Over.pm
index e3f264564..f32743c05 100644
--- a/lib/PublicInbox/Over.pm
+++ b/lib/PublicInbox/Over.pm
@@ -40,6 +40,7 @@ sub dbh_new {
 		$st = pack('dd', $st[0], $st[1]);
 	} while ($st ne $self->{st} && $tries++ < 3);
 	warn "W: $f: .st_dev, .st_ino unstable\n" if $st ne $self->{st};
+	$dbh->do('PRAGMA synchronous = OFF') if ($rw // 0) > 1;
 	$dbh;
 }
 
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index c57be7243..fcb450794 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -21,7 +21,7 @@ use Carp qw(croak);
 
 sub dbh_new {
 	my ($self) = @_;
-	my $dbh = $self->SUPER::dbh_new(1);
+	my $dbh = $self->SUPER::dbh_new($self->{-no_sync} ? 2 : 1);
 
 	# TRUNCATE reduces I/O compared to the default (DELETE)
 	# We do not use WAL since we're optimized for read-only ops,
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index c57a7e164..764257432 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -23,6 +23,7 @@ use PublicInbox::Git qw(git_unquote);
 use PublicInbox::MsgTime qw(msg_timestamp msg_datestamp);
 my $X = \%PublicInbox::Search::X;
 my ($DB_CREATE_OR_OPEN, $DB_OPEN);
+our $DB_NO_SYNC = 0;
 our $BATCH_BYTES = defined($ENV{XAPIAN_FLUSH_THRESHOLD}) ?
 			0x7fffffff : 1_000_000;
 use constant DEBUG => !!$ENV{DEBUG};
@@ -67,6 +68,7 @@ sub new {
 		$self->{lock_path} = "$inboxdir/ssoma.lock";
 		my $dir = $self->xdir;
 		$self->{over} = PublicInbox::OverIdx->new("$dir/over.sqlite3");
+		$self->{over}->{-no_sync} = 1 if $ibx->{-no_sync};
 		$self->{index_max_size} = $ibx->{index_max_size};
 	} elsif ($version == 2) {
 		defined $shard or die "shard is required for v2\n";
@@ -103,6 +105,9 @@ sub load_xapian_writable () {
 	*sortable_serialise = $xap.'::sortable_serialise';
 	$DB_CREATE_OR_OPEN = eval($xap.'::DB_CREATE_OR_OPEN()');
 	$DB_OPEN = eval($xap.'::DB_OPEN()');
+	my $ver = (eval($xap.'::major_version()') << 16) |
+		(eval($xap.'::minor_version()') << 8);
+	$DB_NO_SYNC = 0x4 if $ver >= 0x10400;
 	1;
 }
 
@@ -126,6 +131,7 @@ sub idx_acquire {
 		}
 	}
 	return unless defined $flag;
+	$flag |= $DB_NO_SYNC if $self->{ibx}->{-no_sync};
 	my $xdb = eval { ($X->{WritableDatabase})->new($dir, $flag) };
 	if ($@) {
 		die "Failed opening $dir: ", $@;
@@ -377,7 +383,8 @@ sub _msgmap_init ($) {
 	die "BUG: _msgmap_init is only for v1\n" if $self->{ibx_ver} != 1;
 	$self->{mm} //= eval {
 		require PublicInbox::Msgmap;
-		PublicInbox::Msgmap->new($self->{ibx}->{inboxdir}, 1);
+		my $rw = $self->{ibx}->{-no_sync} ? 2 : 1;
+		PublicInbox::Msgmap->new($self->{ibx}->{inboxdir}, $rw);
 	};
 }
 
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 13c1ad6f8..3dc200956 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -116,12 +116,13 @@ sub new {
 		total_bytes => 0,
 		current_info => '',
 		xpfx => $xpfx,
-		over => PublicInbox::OverIdx->new("$xpfx/over.sqlite3", 1),
+		over => PublicInbox::OverIdx->new("$xpfx/over.sqlite3"),
 		lock_path => "$dir/inbox.lock",
 		# limit each git repo (epoch) to 1GB or so
 		rotate_bytes => int((1024 * 1024 * 1024) / $PACKING_FACTOR),
 		last_commit => [], # git epoch -> commit
 	};
+	$self->{over}->{-no_sync} = 1 if $v2ibx->{-no_sync};
 	$self->{shards} = count_shards($self) || nproc_shards($creat);
 	$self->{index_max_size} = $v2ibx->{index_max_size};
 	bless $self, $class;
@@ -293,7 +294,8 @@ sub _idx_init { # with_umask callback
 	# Now that all subprocesses are up, we can open the FDs
 	# for SQLite:
 	my $mm = $self->{mm} = PublicInbox::Msgmap->new_file(
-		"$self->{ibx}->{inboxdir}/msgmap.sqlite3", 1);
+				"$self->{ibx}->{inboxdir}/msgmap.sqlite3",
+				$self->{ibx}->{-no_sync} ? 2 : 1);
 	$mm->{dbh}->begin_work;
 }
 
diff --git a/lib/PublicInbox/Xapcmd.pm b/lib/PublicInbox/Xapcmd.pm
index 4ee3fc791..d6c069d75 100644
--- a/lib/PublicInbox/Xapcmd.pm
+++ b/lib/PublicInbox/Xapcmd.pm
@@ -412,10 +412,11 @@ sub cpdb ($$) {
 
 	# like copydatabase(1), be sure we don't overwrite anything in case
 	# of other bugs:
-	my $creat = eval($PublicInbox::Search::Xap.'::DB_CREATE()');
+	my $flag = eval($PublicInbox::Search::Xap.'::DB_CREATE()');
 	die if $@;
 	my $XapianWritableDatabase = $PublicInbox::Search::X{WritableDatabase};
-	my $dst = $XapianWritableDatabase->new($tmp, $creat);
+	$flag |= $PublicInbox::SearchIdx::DB_NO_SYNC if !$opt->{sync};
+	my $dst = $XapianWritableDatabase->new($tmp, $flag);
 	my $pr = $opt->{-progress};
 	my $pfx = $opt->{-progress_pfx} = progress_pfx($new);
 	my $pr_data = { pr => $pr, pfx => $pfx, nr => 0 } if $pr;
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 2e1934b08..d5c7cae2b 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -14,8 +14,8 @@ PublicInbox::Admin::require_or_die('-index');
 use PublicInbox::Xapcmd;
 
 my $compact_opt;
-my $opt = { quiet => -1, compact => 0, maxsize => undef };
-GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
+my $opt = { quiet => -1, compact => 0, maxsize => undef, sync => 1 };
+GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune sync!
 		indexlevel|L=s maxsize|max-size=s batchsize|batch-size=s))
 	or die "bad command-line args\n$usage";
 die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
@@ -59,6 +59,7 @@ for my $ibx (@ibxs) {
 	if ($opt->{compact} >= 2) {
 		PublicInbox::Xapcmd::run($ibx, 'compact', $compact_opt);
 	}
+	$ibx->{-no_sync} = 1 if !$opt->{sync};
 	PublicInbox::Admin::index_inbox($ibx, undef, $opt);
 	PublicInbox::Xapcmd::run($ibx, 'compact', $compact_opt) if $compact_opt;
 }
diff --git a/script/public-inbox-xcpdb b/script/public-inbox-xcpdb
index 2b9f032c5..fcd961488 100755
--- a/script/public-inbox-xcpdb
+++ b/script/public-inbox-xcpdb
@@ -8,8 +8,8 @@ use PublicInbox::Xapcmd;
 use PublicInbox::Admin;
 PublicInbox::Admin::require_or_die('-search');
 my $usage = "Usage: public-inbox-xcpdb [--compact] INBOX_DIR\n";
-my $opt = {};
-my @opt = (qw(compact reshard|R=i), @PublicInbox::Xapcmd::COMPACT_OPT);
+my $opt = { sync => 1 };
+my @opt = (qw(sync! compact reshard|R=i), @PublicInbox::Xapcmd::COMPACT_OPT);
 GetOptions($opt, @opt) or die "bad command-line args\n$usage";
 my @ibxs = PublicInbox::Admin::resolve_inboxes(\@ARGV) or die $usage;
 foreach (@ibxs) {

^ permalink raw reply related	[relevance 7%]

* [PATCH 01/20] index: support --rethread switch to fix old indices
  @ 2020-07-24  5:55  3% ` Eric Wong
  2020-07-24  5:56  7% ` [PATCH 16/20] index+xcpdb: support --no-sync flag Eric Wong
  1 sibling, 0 replies; 31+ results
From: Eric Wong @ 2020-07-24  5:55 UTC (permalink / raw)
  To: meta

Older versions of public-inbox < 1.3.0 had subtly
different semantics around threading in some corner
cases.  This switch (when combined with --reindex)
allows us to fix them by regenerating associations.
---
 Documentation/public-inbox-index.pod | 23 +++++++--
 lib/PublicInbox/OverIdx.pm           | 76 ++++++++++++++++++++++++++--
 lib/PublicInbox/SearchIdx.pm         |  7 ++-
 lib/PublicInbox/V2Writable.pm        |  4 +-
 script/public-inbox-index            |  2 +-
 t/v1reindex.t                        | 34 +++++++++++++
 t/v2reindex.t                        | 45 ++++++++++++++++
 7 files changed, 177 insertions(+), 14 deletions(-)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index ff2e54867..08f2fbf45 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -68,12 +68,25 @@ Xapian database.  Using this with C<--compact> or running
 L<public-inbox-compact(1)> afterwards is recommended to
 release free space.
 
-public-inbox protects writes to various indices with L<flock(2)>,
-so it is safe to reindex while L<public-inbox-watch(1)>,
-L<public-inbox-mda(1)> or L<public-inbox-learn(1)> run.
+public-inbox protects writes to various indices with
+L<flock(2)>, so it is safe to reindex (and rethread) while
+L<public-inbox-watch(1)>, L<public-inbox-mda(1)> or
+L<public-inbox-learn(1)> run.
 
-This does not touch the NNTP article number database or
-affect threading.
+This does not touch the NNTP article number database.
+It does not affect threading unless C<--rethread> is
+used.
+
+=item --rethread
+
+Regenerate internal THREADID and message thread associations
+when reindexing.
+
+This fixes some bugs in older versions of public-inbox.  While
+it is possible to use this without C<--reindex>, it makes little
+sense to do so.
+
+Available in public-inbox 1.6.0 (PENDING).
 
 =item --prune
 
diff --git a/lib/PublicInbox/OverIdx.pm b/lib/PublicInbox/OverIdx.pm
index 5601e602c..c57be7243 100644
--- a/lib/PublicInbox/OverIdx.pm
+++ b/lib/PublicInbox/OverIdx.pm
@@ -17,6 +17,7 @@ use PublicInbox::MID qw/id_compress mids_for_index references/;
 use PublicInbox::Smsg qw(subject_normalized);
 use Compress::Zlib qw(compress);
 use PublicInbox::Search;
+use Carp qw(croak);
 
 sub dbh_new {
 	my ($self) = @_;
@@ -37,6 +38,13 @@ sub dbh_new {
 	$dbh;
 }
 
+sub new {
+	my ($class, $f) = @_;
+	my $self = $class->SUPER::new($f);
+	$self->{min_tid} = 0;
+	$self;
+}
+
 sub get_counter ($$) {
 	my ($dbh, $key) = @_;
 	my $sth = $dbh->prepare_cached(<<'', undef, 1);
@@ -164,8 +172,12 @@ sub _resolve_mid_to_tid {
 	my $cur_tid = $smsg->{tid};
 	if (defined $$tid) {
 		merge_threads($self, $$tid, $cur_tid);
-	} else {
+	} elsif ($cur_tid > $self->{min_tid}) {
 		$$tid = $cur_tid;
+	} else { # rethreading, queue up dead ghosts
+		$$tid = next_tid($self);
+		my $num = $smsg->{num};
+		push(@{$self->{-ghosts_to_delete}}, $num) if $num < 0;
 	}
 	1;
 }
@@ -175,7 +187,10 @@ sub resolve_mid_to_tid {
 	my ($self, $mid) = @_;
 	my $tid;
 	each_by_mid($self, $mid, ['tid'], \&_resolve_mid_to_tid, \$tid);
-	defined $tid ? $tid : create_ghost($self, $mid);
+	if (my $del = delete $self->{-ghosts_to_delete}) {
+		delete_by_num($self, $_) for @$del;
+	}
+	$tid // create_ghost($self, $mid);
 }
 
 sub create_ghost {
@@ -221,7 +236,7 @@ sub link_refs {
 			merge_threads($self, $tid, $ptid);
 		}
 	} else {
-		$tid = defined $old_tid ? $old_tid : next_tid($self);
+		$tid = $old_tid // next_tid($self);
 	}
 	$tid;
 }
@@ -278,10 +293,17 @@ sub _add_over {
 	my $cur_tid = $smsg->{tid};
 	my $n = $smsg->{num};
 	die "num must not be zero for $mid" if !$n;
-	$$old_tid = $cur_tid unless defined $$old_tid;
+	my $cur_valid = $cur_tid > $self->{min_tid};
+
 	if ($n > 0) { # regular mail
-		merge_threads($self, $$old_tid, $cur_tid);
+		if ($cur_valid) {
+			$$old_tid //= $cur_tid;
+			merge_threads($self, $$old_tid, $cur_tid);
+		} else {
+			$$old_tid //= next_tid($self);
+		}
 	} elsif ($n < 0) { # ghost
+		$$old_tid //= $cur_valid ? $cur_tid : next_tid($self);
 		link_refs($self, $refs, $$old_tid);
 		delete_by_num($self, $n);
 		$$v++;
@@ -297,6 +319,7 @@ sub add_over {
 
 	begin_lazy($self);
 	delete_by_num($self, $num, \$old_tid);
+	$old_tid = undef if ($old_tid // 0) <= $self->{min_tid};
 	foreach my $mid (@$mids) {
 		my $v = 0;
 		each_by_mid($self, $mid, ['tid'], \&_add_over,
@@ -456,4 +479,47 @@ sub create {
 	$self->disconnect;
 }
 
+sub rethread_prepare {
+	my ($self, $opt) = @_;
+	return unless $opt->{rethread};
+	begin_lazy($self);
+	my $min = $self->{min_tid} = get_counter($self->{dbh}, 'thread') // 0;
+	my $pr = $opt->{-progress};
+	$pr->("rethread min THREADID ".($min + 1)."\n") if $pr && $min;
+}
+
+sub rethread_done {
+	my ($self, $opt) = @_;
+	return unless $opt->{rethread} && $self->{txn};
+	defined(my $min = $self->{min_tid}) or croak('BUG: no min_tid');
+	my $dbh = $self->{dbh} or croak('BUG: no dbh');
+	my $rows = $dbh->selectall_arrayref(<<'', { Slice => {} }, $min);
+SELECT num,tid FROM over WHERE num < 0 AND tid < ?
+
+	my $show_id = $dbh->prepare('SELECT id FROM id2num WHERE num = ?');
+	my $show_mid = $dbh->prepare('SELECT mid FROM msgid WHERE id = ?');
+	my $pr = $opt->{-progress};
+	my $total = 0;
+	for my $r (@$rows) {
+		my $exp = 0;
+		$show_id->execute($r->{num});
+		while (defined(my $id = $show_id->fetchrow_array)) {
+			++$exp;
+			$show_mid->execute($id);
+			my $mid = $show_mid->fetchrow_array;
+			if (!defined($mid)) {
+				warn <<EOF;
+E: ghost NUM=$r->{num} ID=$id THREADID=$r->{tid} has no Message-ID
+EOF
+				next;
+			}
+			$pr->(<<EOM) if $pr;
+I: ghost $r->{num} <$mid> THREADID=$r->{tid} culled
+EOM
+		}
+		delete_by_num($self, $r->{num});
+	}
+	$pr->("I: rethread culled $total ghosts\n") if $pr && $total;
+}
+
 1;
diff --git a/lib/PublicInbox/SearchIdx.pm b/lib/PublicInbox/SearchIdx.pm
index 831625090..e641ffd43 100644
--- a/lib/PublicInbox/SearchIdx.pm
+++ b/lib/PublicInbox/SearchIdx.pm
@@ -723,6 +723,7 @@ sub _index_sync {
 	my $pr = $opts->{-progress};
 
 	my $xdb = $self->begin_txn_lazy;
+	$self->{over}->rethread_prepare($opts);
 	my $mm = _msgmap_init($self);
 	do {
 		$xlog = undef; # stop previous git-log via SIGPIPE
@@ -761,12 +762,14 @@ sub _index_sync {
 				$xdb->set_metadata('last_commit', $newest);
 			}
 		}
+
+		$self->{over}->rethread_done($opts) if $newest; # all done
 		$self->commit_txn_lazy;
 		$git->cleanup;
 		$xdb = _xdb_release($self, $nr);
-		# let another process do some work... <
+		# let another process do some work...
 		$pr->("indexed $nr/$self->{ntodo}\n") if $pr && $nr;
-		if (!$newest) {
+		if (!$newest) { # more to come
 			$xdb = $self->begin_txn_lazy;
 			$dbh->begin_work if $dbh;
 		}
diff --git a/lib/PublicInbox/V2Writable.pm b/lib/PublicInbox/V2Writable.pm
index 0582dd5e3..16556ddc2 100644
--- a/lib/PublicInbox/V2Writable.pm
+++ b/lib/PublicInbox/V2Writable.pm
@@ -1308,6 +1308,7 @@ sub index_sync {
 	my $latest = git_dir_latest($self, \$epoch_max);
 	return unless defined $latest;
 	$self->idx_init($opt); # acquire lock
+	$self->{over}->rethread_prepare($opt);
 	my $sync = {
 		D => {}, # "$mid\0$chash" => $oid
 		unindex_range => {}, # EPOCH => oid_old..oid_new
@@ -1370,12 +1371,13 @@ sub index_sync {
 		my $pr = $sync->{-opt}->{-progress};
 		$pr->('all.git '.sprintf($sync->{-regen_fmt}, $nr)) if $pr;
 	}
+	$self->{over}->rethread_done($opt);
 
 	# reindex does not pick up new changes, so we rerun w/o it:
 	if ($opt->{reindex}) {
 		my %again = %$opt;
 		$sync = undef;
-		delete @again{qw(reindex -skip_lock)};
+		delete @again{qw(rethread reindex -skip_lock)};
 		index_sync($self, \%again);
 	}
 }
diff --git a/script/public-inbox-index b/script/public-inbox-index
index 6217fb86c..2e1934b08 100755
--- a/script/public-inbox-index
+++ b/script/public-inbox-index
@@ -15,7 +15,7 @@ use PublicInbox::Xapcmd;
 
 my $compact_opt;
 my $opt = { quiet => -1, compact => 0, maxsize => undef };
-GetOptions($opt, qw(verbose|v+ reindex compact|c+ jobs|j=i prune
+GetOptions($opt, qw(verbose|v+ reindex rethread compact|c+ jobs|j=i prune
 		indexlevel|L=s maxsize|max-size=s batchsize|batch-size=s))
 	or die "bad command-line args\n$usage";
 die "--jobs must be >= 0\n" if defined $opt->{jobs} && $opt->{jobs} < 0;
diff --git a/t/v1reindex.t b/t/v1reindex.t
index 9f23ef01e..8cb751881 100644
--- a/t/v1reindex.t
+++ b/t/v1reindex.t
@@ -11,6 +11,7 @@ require_git(2.6);
 require_mods(qw(DBD::SQLite Search::Xapian));
 use_ok 'PublicInbox::SearchIdx';
 use_ok 'PublicInbox::Import';
+use_ok 'PublicInbox::OverIdx';
 my ($inboxdir, $for_destroy) = tmpdir();
 my $ibx_config = {
 	inboxdir => $inboxdir,
@@ -427,5 +428,38 @@ ok(!-d $xap, 'Xapian directories removed again');
 		  ], 'msgmap as expected' );
 }
 
+{
+	my @warn;
+	local $SIG{__WARN__} = sub { push @warn, @_ };
+	my $ibx = PublicInbox::Inbox->new({ %$ibx_config });
+	my $f = $ibx->over->{dbh}->sqlite_db_filename;
+	my $over = PublicInbox::OverIdx->new($f);
+	my $dbh = $over->connect;
+	my $non_ghost_tids = sub {
+		$dbh->selectall_arrayref(<<'');
+SELECT tid FROM over WHERE num > 0 ORDER BY tid ASC
+
+	};
+	my $before = $non_ghost_tids->();
+
+	# mess up threading:
+	my $tid = PublicInbox::OverIdx::get_counter($dbh, 'thread');
+	my $nr = $dbh->do('UPDATE over SET tid = ?', undef, $tid);
+
+	my $rw = PublicInbox::SearchIdx->new($ibx, 1);
+	my @pr;
+	my $pr = sub { push @pr, @_ };
+	$rw->index_sync({reindex => 1, rethread => 1, -progress => $pr });
+	my @n = $dbh->selectrow_array(<<EOS, undef, $tid);
+SELECT COUNT(*) FROM over WHERE tid <= ?
+EOS
+	is_deeply(\@n, [ 0 ], 'rethread dropped old threadids');
+	my $after = $non_ghost_tids->();
+	ok($after->[0]->[0] > $before->[-1]->[0],
+		'all tids greater than before');
+	is(scalar @$after, scalar @$before, 'thread count unchanged');
+	is_deeply([], \@warn, 'no warnings');
+	# diag "@pr"; # XXX do we care?
+}
 
 done_testing();
diff --git a/t/v2reindex.t b/t/v2reindex.t
index 77deffb4b..ea2b24e59 100644
--- a/t/v2reindex.t
+++ b/t/v2reindex.t
@@ -10,6 +10,7 @@ use PublicInbox::TestCommon;
 require_git(2.6);
 require_mods(qw(DBD::SQLite Search::Xapian));
 use_ok 'PublicInbox::V2Writable';
+use_ok 'PublicInbox::OverIdx';
 my ($inboxdir, $for_destroy) = tmpdir();
 my $ibx_config = {
 	inboxdir => $inboxdir,
@@ -423,6 +424,46 @@ ok(!-d $xap, 'Xapian directories removed again');
 		  ], 'msgmap as expected' );
 }
 
+my $check_rethread = sub {
+	my ($desc) = @_;
+	my @warn;
+	local $SIG{__WARN__} = sub { push @warn, @_ };
+	my %config = %$ibx_config;
+	my $ibx = PublicInbox::Inbox->new(\%config);
+	my $f = $ibx->over->{dbh}->sqlite_db_filename;
+	my $over = PublicInbox::OverIdx->new($f);
+	my $dbh = $over->connect;
+	my $non_ghost_tids = sub {
+		$dbh->selectall_arrayref(<<'');
+SELECT tid FROM over WHERE num > 0 ORDER BY tid ASC
+
+	};
+	my $before = $non_ghost_tids->();
+
+	# mess up threading:
+	my $tid = PublicInbox::OverIdx::get_counter($dbh, 'thread');
+	my $nr = $dbh->do('UPDATE over SET tid = ?', undef, $tid);
+	diag "messing up all threads with tid=$tid";
+
+	my $v2w = PublicInbox::V2Writable->new($ibx);
+	my @pr;
+	my $pr = sub { push @pr, @_ };
+	$v2w->index_sync({reindex => 1, rethread => 1, -progress => $pr});
+	# diag "@pr"; # nobody cares
+	is_deeply(\@warn, [], 'no warnings on reindex + rethread');
+
+	my @n = $dbh->selectrow_array(<<EOS, undef, $tid);
+SELECT COUNT(*) FROM over WHERE tid <= ?
+EOS
+	is_deeply(\@n, [ 0 ], 'rethread dropped old threadids');
+	my $after = $non_ghost_tids->();
+	ok($after->[0]->[0] > $before->[-1]->[0],
+		'all tids greater than before');
+	is(scalar @$after, scalar @$before, 'thread count unchanged');
+};
+
+$check_rethread->('no-monster');
+
 # A real example from linux-renesas-soc on lore where a 3-headed monster
 # of a message has 3 sets of common headers.  Another normal message
 # previously existed with a single Message-ID that conflicts with one
@@ -497,4 +538,8 @@ EOF
 	is_deeply([values %uniq], [3], 'search on different subjects');
 }
 
+# XXX: not deterministic when dealing with ambiguous messages, oh well
+$check_rethread->('3-headed-monster once');
+$check_rethread->('3-headed-monster twice');
+
 done_testing();

^ permalink raw reply related	[relevance 3%]

* [PATCH] doc: add some recommendations around slow HDDs
@ 2020-07-17  3:57  5% Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-07-17  3:57 UTC (permalink / raw)
  To: meta

grok-pull is still painful with serialization on an old USB 2.0
HDD, but at least it can finish with flock(1) and disabling
parallelization.  While parallel "git fetch" doesn't seem so
bad, slow seeks are exacerbated by parallel reads in Xapian.
That means some updates can take days instead of hours.  The
same updates take only seconds or minutes on an SSD.
---
 Documentation/public-inbox-index.pod   | 10 ++++++++++
 examples/grok-pull.post_update_hook.sh |  6 ++++++
 2 files changed, 16 insertions(+)

diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index b1b24917b..ff2e54867 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -32,6 +32,16 @@ normal search functionality.
 
 =over
 
+=item --jobs=JOBS, -j
+
+Control the number of Xapian indexing jobs in a
+(L<public-inbox-v2-format(5)>) inbox.
+
+C<--jobs=0> is accepted as of public-inbox 1.6.0 (PENDING)
+to disable parallel indexing.
+
+Default: the number of existing Xapian shards
+
 =item --compact / -c
 
 Compacts the Xapian DBs after indexing.  This is recommended
diff --git a/examples/grok-pull.post_update_hook.sh b/examples/grok-pull.post_update_hook.sh
index 3ead39440..ec4ae93e8 100755
--- a/examples/grok-pull.post_update_hook.sh
+++ b/examples/grok-pull.post_update_hook.sh
@@ -1,4 +1,9 @@
 #!/bin/sh
+
+# use flock(1) from util-linux to avoid seek contention on slow HDDs
+# when using multiple `pull_threads' with grok-pull:
+# [ "${FLOCKER}" != "$0" ] && exec env FLOCKER="$0" flock "$0" "$0" "$@" || :
+
 # post_update_hook for repos.conf as used by grok-pull, takes a full
 # git repo path as it's first and only arg.
 full_git_dir="$1"
@@ -119,6 +124,7 @@ then
 		: v2 inboxes may be init-ed with an empty msgmap
 		;;
 	*)
+		# if on HDD and limited RAM, add `-j0' w/ public-inbox 1.6.0+
 		$EATMYDATA public-inbox-index -v "$inbox_dir"
 		;;
 	esac

^ permalink raw reply related	[relevance 5%]

* [PATCH] doc: release notes and version info updates
@ 2020-07-14 10:06  6% Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-07-14 10:06 UTC (permalink / raw)
  To: meta

Update release notes with some features in the 1.6 timeline.

We'll note the version availability of some command-line
options, it may help users who are reading the latest
documentation online but running older versions.
---
 Documentation/RelNotes/v1.6.0.eml    | 36 ++++++++++++++++++++++++++--
 Documentation/public-inbox-index.pod |  8 +++++++
 Documentation/public-inbox-init.pod  |  4 ++++
 Documentation/public-inbox-learn.pod |  3 ++-
 4 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/Documentation/RelNotes/v1.6.0.eml b/Documentation/RelNotes/v1.6.0.eml
index 283f42c89..862e1c681 100644
--- a/Documentation/RelNotes/v1.6.0.eml
+++ b/Documentation/RelNotes/v1.6.0.eml
@@ -18,17 +18,37 @@ Content-Disposition: inline
     and indexed for search.  Use `public-inbox-index --reindex' to
     ensure these attachments are indexed in old messages.
 
+  - inbox.lock (v2) and ssoma.lock (v1) files are written to by
+    on message delivery (or spam removal) to wake up read-only
+    daemons via inotify or kqueue.
+
 * public-inbox-index
 
   - --batch-size=BYTES or publicinbox.indexBatchSize parameter
 
-  - parallelize updates by default, "-j0" is (once again) allowed
-    parallelization
+  - parallelize v2 updates by default, "-j0" is (once again) allowed
+    to disable parallelization
+
+  - v1 (re-)indexing parallelizes blob reads from git
 
 * public-inbox-learn
 
   - `rm' supports `--all' to remove from all configured inboxes
 
+* public-inbox-imapd
+
+  - new read-only IMAP daemon similar to public-inbox-nntpd
+
+* public-inbox-nntpd
+
+  - blob reads from git are handled asynchronously
+
+* public-inbox-httpd
+
+  - Plack::Middleware::Deflater is no longer loaded by default
+    when no .psgi file is specified; PublicInbox::WWW gzips
+    natively (see below)
+
 * PublicInbox::WWW
 
   - use consistent blank line around attachment links
@@ -39,6 +59,18 @@ Content-Disposition: inline
 
   - $INBOX_DIR/description is treated as UTF-8
 
+  - HTML, Atom, and text/plain responses are gzipped without
+    relying on Plack::Middleware::Deflater
+
+  - Multi-message endpoints (/t.mbox.gz, /T/, /t/, etc) are ~10% faster
+    when running under public-inbox-httpd with asynchronous blob
+    retrieval
+
+* public-inbox-watch
+
+  - Linux::Inotify2 or IO::KQueue is used directly,
+    Filesys::Notify::Simple is no longer required
+
 Please report bugs via plain-text mail to: meta@public-inbox.org
 
 See archives at https://public-inbox.org/meta/ for all history.
diff --git a/Documentation/public-inbox-index.pod b/Documentation/public-inbox-index.pod
index 5be3c897b..b1b24917b 100644
--- a/Documentation/public-inbox-index.pod
+++ b/Documentation/public-inbox-index.pod
@@ -46,6 +46,8 @@ This switch may be specified twice, in which case compaction
 happens both before and after indexing to minimize the temporal
 footprint of the (re)indexing operation.
 
+Available since public-inbox 1.4.0.
+
 =item --reindex
 
 Forces a re-index of all messages in the inbox.
@@ -70,18 +72,24 @@ is detected.  This is intended to be used in mirrors after running
 L<public-inbox-edit(1)> or L<public-inbox-purge(1)> to ensure data
 is expunged from mirrors.
 
+Available since public-inbox 1.2.0.
+
 =item --max-size SIZE
 
 Sets or overrides L</publicinbox.indexMaxSize> on a
 per-invocation basis.  See L</publicinbox.indexMaxSize>
 below.
 
+Available since public-inbox 1.5.0.
+
 =item --batch-size SIZE
 
 Sets or overrides L</publicinbox.indexBatchSize> on a
 per-invocation basis.  See L</publicinbox.indexBatchSize>
 below.
 
+Available in public-inbox 1.6.0 (PENDING).
+
 =back
 
 =head1 FILES
diff --git a/Documentation/public-inbox-init.pod b/Documentation/public-inbox-init.pod
index 5714828d9..fd9fc6379 100644
--- a/Documentation/public-inbox-init.pod
+++ b/Documentation/public-inbox-init.pod
@@ -51,6 +51,8 @@ but may be of use to L<public-inbox-v1-format(5)> users.
 There is no automatic way to use reserved NNTP article numbers
 when old mail is found, yet.
 
+Available since public-inbox 1.6.0 (PENDING).
+
 Default: unset, no NNTP article numbers are skipped
 
 =item -S, --skip-epoch
@@ -60,6 +62,8 @@ allows archivists to publish incomplete archives with newer
 mail while allowing "0.git" (or "1.git" and so on) epochs to be
 added-after-the-fact (without affecting "git clone" followers).
 
+Available since public-inbox 1.2.0.
+
 Default: unset, no epochs are skipped
 
 =item -j, --jobs=JOBS
diff --git a/Documentation/public-inbox-learn.pod b/Documentation/public-inbox-learn.pod
index 9c6b261b3..cd9bf2782 100644
--- a/Documentation/public-inbox-learn.pod
+++ b/Documentation/public-inbox-learn.pod
@@ -55,7 +55,8 @@ not feed the message to L<spamc(1)> and only removes messages
 which match on any of the C<To:>, C<Cc:>, and C<List-ID:> headers.
 
 The C<--all> option may be used match C<spam> semantics in removing
-the message from all configured inboxes.
+the message from all configured inboxes.  C<--all> will be
+available in public-inbox 1.6.0 (PENDING).
 
 =back
 

^ permalink raw reply related	[relevance 6%]

* [PATCH] doc: update TODO and WIP 1.6.0 release notes
@ 2020-06-10 18:39  5% Eric Wong
  0 siblings, 0 replies; 31+ results
From: Eric Wong @ 2020-06-10 18:39 UTC (permalink / raw)
  To: meta

Lots of big changes coming   Thanks to The Linux Foundation for
sponsoring me to hack on this in 2020 :)
---
 Documentation/RelNotes/v1.6.0.eml | 45 +++++++++++++++++++++++++++++++
 TODO                              | 33 ++++++++++++++++++++---
 2 files changed, 74 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/RelNotes/v1.6.0.eml

diff --git a/Documentation/RelNotes/v1.6.0.eml b/Documentation/RelNotes/v1.6.0.eml
new file mode 100644
index 00000000000..283f42c898f
--- /dev/null
+++ b/Documentation/RelNotes/v1.6.0.eml
@@ -0,0 +1,45 @@
+From: Eric Wong <e@yhbt.net>
+To: meta@public-inbox.org
+Subject: [WIP] public-inbox 1.6.0
+MIME-Version: 1.0
+Content-Type: text/plain; charset=utf-8
+Content-Disposition: inline
+
+* General changes:
+
+  - ~/.cache/public-inbox/inline-c is automatically used for Inline::C
+    if it exists.  PERL_INLINE_DIRECTORY in env remains supported
+    and prioritized to support `nobody'-type users without HOME.
+
+  - msgmap.sqlite3 uses journal_mode=TRUNCATE, matching over.sqlite3
+    behavior for a minor reduction in VFS traffic
+
+  - message/{rfc822,news,global} attachments are decoded recursively
+    and indexed for search.  Use `public-inbox-index --reindex' to
+    ensure these attachments are indexed in old messages.
+
+* public-inbox-index
+
+  - --batch-size=BYTES or publicinbox.indexBatchSize parameter
+
+  - parallelize updates by default, "-j0" is (once again) allowed
+    parallelization
+
+* public-inbox-learn
+
+  - `rm' supports `--all' to remove from all configured inboxes
+
+* PublicInbox::WWW
+
+  - use consistent blank line around attachment links
+
+  - Attachments in message/{rfc822,news,global} messages can be
+    individually downloaded.  Downloading the entire message/rfc822
+    file in full remains supported
+
+  - $INBOX_DIR/description is treated as UTF-8
+
+Please report bugs via plain-text mail to: meta@public-inbox.org
+
+See archives at https://public-inbox.org/meta/ for all history.
+See https://public-inbox.org/TODO for what the future holds.
diff --git a/TODO b/TODO
index 16de36bf200..9396f661137 100644
--- a/TODO
+++ b/TODO
@@ -19,7 +19,7 @@ all need to be considered for everything we introduce)
   Meaning users can run this without needing a full copy of the
   archives in git repositories.
 
-* HTTP and NNTP proxy support.  Allow us to be a frontend for
+* HTTP, IMAP and NNTP proxy support.  Allow us to be a frontend for
   firewalled off (or Tor-exclusive) instances.  The use case is
   for offering a publicly accessible IP with a cheap VPS,
   yet storing large amounts of data on computers without a
@@ -32,7 +32,7 @@ all need to be considered for everything we introduce)
   archive locations to avoid SPOF.
 
 * optional Cache::FastMmap support so production deployments won't
-  need Varnish (Varnish doesn't protect NNTP, either)
+  need Varnish (Varnish doesn't protect NNTP or IMAP, either)
 
 * dogfood and take advantage of new kernel APIs (while maintaining
   portability to older Linux, free BSDs and maybe Hurd).
@@ -44,7 +44,8 @@ all need to be considered for everything we introduce)
 * Support more of RFC 3977 (NNTP)
   Is there anything left for read-only support?
 
-* Combined "super server" for NNTP/HTTP/POP3 to reduce memory overhead
+* Combined "super server" for NNTP/HTTP/POP3/IMAP to reduce memory,
+  process, and FD overhead
 
 * Configurable linkification for per-inbox shorthands:
   "$gmane/123456" could be configured to expand to the
@@ -111,8 +112,31 @@ all need to be considered for everything we introduce)
 * imperfect scraper importers for obfuscated list archives
   (e.g. obfuscated Mailman stuff, Google Groups, etc...)
 
+* extend public-inbox-watch to support IMAP, NNTP
+
 * improve performance and avoid head-of-line blocking on slow storage
 
+* HTTP(S) search API (likely JMAP, but GraphQL could be an option)
+  It should support git-specific prefixes (dfpre:, dfpost:, dfn:, etc)
+  as extensions.  If JMAP, it should have HTTP(S) analogues to
+  various IMAP extensions.
+
+* search across multiple inboxes, or admin-definable groups of inboxes
+
+* scalability to tens/hundreds of thousands of inboxes
+
+  - pagination for WwwListing
+
+  - inotify-based manifest.js.gz updates
+
+  - process/FD reduction (needs to be slow-storage friendly)
+
+  ...
+
+* command-line tool (similar to mairix/notmuch, but solver+git-aware)
+
+* consider removing doc_data from Xapian, redundant with over.sqlite3
+
 * share "git cat-file --batch" processes across inboxes to avoid
   bumping into /proc/sys/fs/pipe-user-pages-* limits
 
@@ -125,7 +149,8 @@ all need to be considered for everything we introduce)
 * linter to check validity of config file
 
 * linter option and WWW endpoint to graph relationships and flows
-  between inboxes, addresses maildirs, coderepos, etc...
+  between inboxes, addresses, Maildirs, coderepos, newsgroups,
+  IMAP mailboxes, etc...
 
 * pygments support - via Python script similar to `git cat-file --batch'
   to avoid startup penalty.  pygments.rb (Ruby) can be inspiration, too.

^ permalink raw reply related	[relevance 5%]

Results 1-31 of 31 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2020-06-10 18:39  5% [PATCH] doc: update TODO and WIP 1.6.0 release notes Eric Wong
2020-07-14 10:06  6% [PATCH] doc: release notes and version info updates Eric Wong
2020-07-17  3:57  5% [PATCH] doc: add some recommendations around slow HDDs Eric Wong
2020-07-24  5:55     [PATCH 00/20] indexing changes and new features Eric Wong
2020-07-24  5:55  3% ` [PATCH 01/20] index: support --rethread switch to fix old indices Eric Wong
2020-07-24  5:56  7% ` [PATCH 16/20] index+xcpdb: support --no-sync flag Eric Wong
2020-08-07  1:13     [PATCH 0/7] index: --sequential-shard and other stuff Eric Wong
2020-08-07  1:14  9% ` [PATCH 5/7] index: v2: --sequential-shard option Eric Wong
2020-08-07  1:14  3% ` [PATCH 7/7] index+xcpdb: rename `--no-sync' to `--no-fsync' Eric Wong
2020-08-07 10:52     [PATCH 0/5] more indexing improvements Eric Wong
2020-08-07 10:52  4% ` [PATCH 5/5] index: add built-in --help / -? Eric Wong
2020-08-10  2:11     [PATCH 00/14] more indexing related improvements Eric Wong
2020-08-10  2:11 12% ` [PATCH 03/14] doc: index: more notes about latest changes Eric Wong
2020-08-12  9:17     [PATCH 0/6] xcpdb -index improvements Eric Wong
2020-08-12  9:17  4% ` [PATCH 5/6] xcpdb: wire up new index options and --help Eric Wong
2020-08-15  5:21  9% [PATCH] doc: add public-inbox-tuning(7) manpage Eric Wong
2020-08-20 20:24     [PATCH 00/23] indexing: --skip-docdata + speedups Eric Wong
2020-08-20 20:24  4% ` [PATCH 05/23] init: support --newsgroup option Eric Wong
2020-08-20 20:24  6% ` [PATCH 06/23] init: drop -N alias for --skip-artnum Eric Wong
2020-08-20 20:24  7% ` [PATCH 22/23] init+index: support --skip-docdata for Xapian Eric Wong
2020-08-25 10:51 11% [PATCH] doc: add some more tuning notes Eric Wong
2020-08-27 12:16     [PATCH 0/8] mostly watch-related odds and ends Eric Wong
2020-08-27 12:17  6% ` [PATCH 7/8] doc: move watch config docs to -watch manpage Eric Wong
2020-08-27 12:17  8% ` [PATCH 8/8] doc: watch: expand on NNTP and IMAP-specific knobs Eric Wong
2020-08-28  4:22  6%   ` Eric Wong
2020-08-31  4:33  4% [PATCH] doc: expand on indexBatchSize regarding fragementation Eric Wong
2020-09-14  6:39  4% [PATCH] doc: TODO and release notes updates ahead of 1.6 Eric Wong
2020-09-16 20:03 13% [ANNOUNCE] public-inbox 1.6.0 Eric Wong
2020-09-19 20:01  6% ` Leah Neukirchen
2020-09-19 21:17  6%   ` Eric Wong
2020-09-19 21:24  6%     ` Leah Neukirchen
2020-09-19 21:42 14% [PATCH] doc: post-1.6 updates, start 1.7 Eric Wong
2020-11-07 19:10  4% MIME types for image attachments Leah Neukirchen
2020-11-07 20:39  0% ` Eric Wong
2020-11-08  0:05  0%   ` Leah Neukirchen
2021-03-11 10:45     [PATCH 0/7] doc updates, fixups, and more Eric Wong
2021-03-11 10:45  4% ` [PATCH 4/7] doc: update 1.7 release notes, tuning, TODO Eric Wong
2023-11-22  1:04  5% [PATCH] watch: support `watch=false' to negate watchspam Eric Wong
2024-01-30  6:31     [PATCH 0/2] watch: add MH support + lei doc Eric Wong
2024-01-30  6:31  4% ` [PATCH 1/2] watch: support incremental updates from MH Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).