user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
* [PATCH 00/11] www: export SQLite altid dumps
@ 2020-03-21  2:03 Eric Wong
  2020-03-21  2:03 ` [PATCH 01/11] qspawn: reinstate filter support, add gzip filter Eric Wong
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: Eric Wong @ 2020-03-21  2:03 UTC (permalink / raw)
  To: meta

To improve reproducibility in mirrors, altid dumps can be
exported via "POST /$INBOX_URL/$prefix.sql.gz".  $prefix is
something like "gmane" (though the search prefix is "gmane:"
with a colon).

Eric Wong (11):
  qspawn: reinstate filter support, add gzip filter
  gzipfilter: lazy allocate the deflate context
  wwwstream: introduce oneshot API to avoid ->getline
  extmsg: use WwwResponse::oneshot
  wwwstream: oneshot sets content-length
  mbox: need_gzip uses WwwStream::oneshot
  qspawn: handle ENOENT (and other errors on exec)
  search: clobber -user_pfx on query parser initialization
  wwwtext: show thread endpoints info w/ indexlevel=basic
  altid: warn about non-word prefixes
  www: add endpoint to retrieve altid dumps

 MANIFEST                       |  4 ++
 lib/PublicInbox/AltId.pm       |  3 +-
 lib/PublicInbox/ExtMsg.pm      |  4 +-
 lib/PublicInbox/GetlineBody.pm | 21 ++++----
 lib/PublicInbox/GzipFilter.pm  | 59 +++++++++++++++++++++
 lib/PublicInbox/Mbox.pm        | 16 +++---
 lib/PublicInbox/Qspawn.pm      | 66 ++++++++++++++----------
 lib/PublicInbox/Search.pm      |  4 +-
 lib/PublicInbox/ViewVCS.pm     |  8 +--
 lib/PublicInbox/WWW.pm         | 14 ++++-
 lib/PublicInbox/WwwAltId.pm    | 94 ++++++++++++++++++++++++++++++++++
 lib/PublicInbox/WwwStream.pm   | 29 +++++++++--
 lib/PublicInbox/WwwText.pm     | 10 +++-
 t/gzip_filter.t                | 37 +++++++++++++
 t/httpd-corner.psgi            | 16 ++++++
 t/httpd-corner.t               | 48 +++++++++++++++++
 t/www_altid.t                  | 83 ++++++++++++++++++++++++++++++
 17 files changed, 452 insertions(+), 64 deletions(-)
 create mode 100644 lib/PublicInbox/GzipFilter.pm
 create mode 100644 lib/PublicInbox/WwwAltId.pm
 create mode 100644 t/gzip_filter.t
 create mode 100644 t/www_altid.t

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 01/11] qspawn: reinstate filter support, add gzip filter
  2020-03-21  2:03 [PATCH 00/11] www: export SQLite altid dumps Eric Wong
@ 2020-03-21  2:03 ` Eric Wong
  2020-03-21  2:03 ` [PATCH 02/11] gzipfilter: lazy allocate the deflate context Eric Wong
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2020-03-21  2:03 UTC (permalink / raw)
  To: meta

We'll be supporting gzipped from sqlite3(1) dumps
for altid files in future commits.

In the future (and if we survive), we may replace
Plack::Middleware::Deflater with our own GzipFilter to work
better with asynchronous responses without relying on
memory-intensive anonymous subs.
---
 MANIFEST                       |  2 ++
 lib/PublicInbox/GetlineBody.pm | 21 +++++++------
 lib/PublicInbox/GzipFilter.pm  | 54 ++++++++++++++++++++++++++++++++++
 lib/PublicInbox/Qspawn.pm      |  8 ++++-
 t/gzip_filter.t                | 37 +++++++++++++++++++++++
 t/httpd-corner.psgi            |  9 ++++++
 t/httpd-corner.t               | 25 ++++++++++++++++
 7 files changed, 144 insertions(+), 12 deletions(-)
 create mode 100644 lib/PublicInbox/GzipFilter.pm
 create mode 100644 t/gzip_filter.t

diff --git a/MANIFEST b/MANIFEST
index 265ad909..be1c4ab5 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -111,6 +111,7 @@ lib/PublicInbox/Filter/Vger.pm
 lib/PublicInbox/GetlineBody.pm
 lib/PublicInbox/Git.pm
 lib/PublicInbox/GitHTTPBackend.pm
+lib/PublicInbox/GzipFilter.pm
 lib/PublicInbox/HTTP.pm
 lib/PublicInbox/HTTPD.pm
 lib/PublicInbox/HTTPD/Async.pm
@@ -232,6 +233,7 @@ t/filter_vger.t
 t/git-http-backend.psgi
 t/git.fast-import-data
 t/git.t
+t/gzip_filter.t
 t/hl_mod.t
 t/html_index.t
 t/httpd-corner.psgi
diff --git a/lib/PublicInbox/GetlineBody.pm b/lib/PublicInbox/GetlineBody.pm
index 92719a82..6becaaf5 100644
--- a/lib/PublicInbox/GetlineBody.pm
+++ b/lib/PublicInbox/GetlineBody.pm
@@ -13,13 +13,13 @@ use strict;
 use warnings;
 
 sub new {
-	my ($class, $rpipe, $end, $end_arg, $buf) = @_;
+	my ($class, $rpipe, $end, $end_arg, $buf, $filter) = @_;
 	bless {
 		rpipe => $rpipe,
 		end => $end,
 		end_arg => $end_arg,
-		buf => $buf,
-		filter => 0,
+		initial_buf => $buf,
+		filter => $filter,
 	}, $class;
 }
 
@@ -30,19 +30,18 @@ sub DESTROY { $_[0]->close }
 
 sub getline {
 	my ($self) = @_;
-	my $filter = $self->{filter};
-	return if $filter == -1; # last call was EOF
-
-	my $buf = delete $self->{buf}; # initial buffer
-	$buf = $self->{rpipe}->getline unless defined $buf;
-	$self->{filter} = -1 unless defined $buf; # set EOF for next call
+	my $rpipe = $self->{rpipe} or return; # EOF was set on previous call
+	my $buf = delete($self->{initial_buf}) // $rpipe->getline;
+	delete($self->{rpipe}) unless defined $buf; # set EOF for next call
+	if (my $filter = $self->{filter}) {
+		$buf = $filter->translate($buf);
+	}
 	$buf;
 }
 
 sub close {
 	my ($self) = @_;
-	my ($rpipe, $end, $end_arg) = delete @$self{qw(rpipe end end_arg)};
-	close $rpipe if $rpipe;
+	my ($end, $end_arg) = delete @$self{qw(end end_arg)};
 	$end->($end_arg) if $end;
 }
 
diff --git a/lib/PublicInbox/GzipFilter.pm b/lib/PublicInbox/GzipFilter.pm
new file mode 100644
index 00000000..d883130f
--- /dev/null
+++ b/lib/PublicInbox/GzipFilter.pm
@@ -0,0 +1,54 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# Qspawn filter
+package PublicInbox::GzipFilter;
+use strict;
+use bytes (); # length
+use Compress::Raw::Zlib qw(Z_FINISH Z_OK);
+my %OPT = (-WindowBits => 15 + 16, -AppendOutput => 1);
+
+sub new {
+	my ($gz, $err) = Compress::Raw::Zlib::Deflate->new(%OPT);
+	$err == Z_OK or die "Deflate->new failed: $err";
+	bless { gz => $gz }, shift;
+}
+
+# for Qspawn if using $env->{'pi-httpd.async'}
+sub attach {
+	my ($self, $fh) = @_;
+	$self->{fh} = $fh;
+	$self
+}
+
+# for GetlineBody (via Qspawn) when NOT using $env->{'pi-httpd.async'}
+sub translate ($$) {
+	my $self = $_[0];
+	my $zbuf = delete($self->{zbuf});
+	if (defined $_[1]) { # my $buf = $_[1];
+		my $err = $self->{gz}->deflate($_[1], $zbuf);
+		die "gzip->deflate: $err" if $err != Z_OK;
+		return $zbuf if length($zbuf) >= 8192;
+
+		$self->{zbuf} = $zbuf;
+		'';
+	} else { # undef == EOF
+		my $err = $self->{gz}->flush($zbuf, Z_FINISH);
+		die "gzip->flush: $err" if $err != Z_OK;
+		$zbuf;
+	}
+}
+
+sub write {
+	# my $ret = bytes::length($_[1]); # XXX does anybody care?
+	$_[0]->{fh}->write(translate($_[0], $_[1]));
+}
+
+sub close {
+	my ($self) = @_;
+	my $fh = delete $self->{fh};
+	$fh->write(translate($self, undef));
+	$fh->close;
+}
+
+1;
diff --git a/lib/PublicInbox/Qspawn.pm b/lib/PublicInbox/Qspawn.pm
index 63ec3648..52aea3eb 100644
--- a/lib/PublicInbox/Qspawn.pm
+++ b/lib/PublicInbox/Qspawn.pm
@@ -243,6 +243,7 @@ sub psgi_return_init_cb {
 	my ($self) = @_;
 	my $r = rd_hdr($self) or return;
 	my $env = $self->{psgi_env};
+	my $filter = delete $env->{'qspawn.filter'};
 	my $wcb = delete $env->{'qspawn.wcb'};
 	my $async = delete $self->{async};
 	if (scalar(@$r) == 3) { # error
@@ -257,6 +258,7 @@ sub psgi_return_init_cb {
 	} elsif ($async) {
 		# done reading headers, handoff to read body
 		my $fh = $wcb->($r); # scalar @$r == 2
+		$fh = $filter->attach($fh) if $filter;
 		$self->{fh} = $fh;
 		$async->async_pass($env->{'psgix.io'}, $fh,
 					delete($self->{hdr_buf}));
@@ -264,7 +266,7 @@ sub psgi_return_init_cb {
 		require PublicInbox::GetlineBody;
 		$r->[2] = PublicInbox::GetlineBody->new($self->{rpipe},
 					\&event_step, $self,
-					${$self->{hdr_buf}});
+					${$self->{hdr_buf}}, $filter);
 		$wcb->($r);
 	}
 
@@ -294,6 +296,10 @@ sub psgi_return_start { # may run later, much later...
 #                          psgi_return will return an anonymous
 #                          sub for the PSGI server to call
 #
+#   $env->{'qspawn.filter'} - filter object, responds to ->attach for
+#                             pi-httpd.async and ->translate for generic
+#                             PSGI servers
+#
 # $limiter - the Limiter object to use (uses the def_limiter if not given)
 #
 # $parse_hdr - Initial read function; often for parsing CGI header output.
diff --git a/t/gzip_filter.t b/t/gzip_filter.t
new file mode 100644
index 00000000..400214e6
--- /dev/null
+++ b/t/gzip_filter.t
@@ -0,0 +1,37 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use Test::More;
+use IO::Handle (); # autoflush
+use Fcntl qw(SEEK_SET);
+use PublicInbox::TestCommon;
+require_mods(qw(Compress::Zlib IO::Uncompress::Gunzip));
+require_ok 'PublicInbox::GzipFilter';
+
+{
+	open my $fh, '+>', undef or die "open: $!";
+	open my $dup, '>&', $fh or die "dup $!";
+	$dup->autoflush(1);
+	my $filter = PublicInbox::GzipFilter->new->attach($dup);
+	ok($filter->write("hello"), 'wrote something');
+	ok($filter->write("world"), 'wrote more');
+	$filter->close;
+	seek($fh, 0, SEEK_SET) or die;
+	IO::Uncompress::Gunzip::gunzip($fh => \(my $buf));
+	is($buf, 'helloworld', 'buffer matches');
+}
+
+{
+	pipe(my ($r, $w)) or die "pipe: $!";
+	$w->autoflush(1);
+	close $r or die;
+	my $filter = PublicInbox::GzipFilter->new->attach($w);
+	my $sigpipe;
+	local $SIG{PIPE} = sub { $sigpipe = 1 };
+	open my $fh, '<', 'COPYING' or die "open(COPYING): $!";
+	my $buf = do { local $/; <$fh> };
+	while ($filter->write($buf .= rand)) {}
+	ok($sigpipe, 'got SIGPIPE');
+	close $w;
+}
+done_testing;
diff --git a/t/httpd-corner.psgi b/t/httpd-corner.psgi
index 35d1216e..f2427234 100644
--- a/t/httpd-corner.psgi
+++ b/t/httpd-corner.psgi
@@ -85,6 +85,15 @@ my $app = sub {
 			close $null;
 			[ 200, [ qw(Content-Type application/octet-stream) ]];
 		});
+	} elsif ($path eq '/psgi-return-gzip') {
+		require PublicInbox::Qspawn;
+		require PublicInbox::GzipFilter;
+		my $cmd = [qw(echo hello world)];
+		my $qsp = PublicInbox::Qspawn->new($cmd);
+		$env->{'qspawn.filter'} = PublicInbox::GzipFilter->new;
+		return $qsp->psgi_return($env, undef, sub {
+			[ 200, [ qw(Content-Type application/octet-stream)]]
+		});
 	} elsif ($path eq '/pid') {
 		$code = 200;
 		push @$body, "$$\n";
diff --git a/t/httpd-corner.t b/t/httpd-corner.t
index c99e5ec7..e50aa436 100644
--- a/t/httpd-corner.t
+++ b/t/httpd-corner.t
@@ -22,6 +22,7 @@ my $err = "$tmpdir/stderr.log";
 my $out = "$tmpdir/stdout.log";
 my $psgi = "./t/httpd-corner.psgi";
 my $sock = tcp_server() or die;
+my @zmods = qw(PublicInbox::GzipFilter IO::Uncompress::Gunzip);
 
 # make sure stdin is not a pipe for lsof test to check for leaking pipes
 open(STDIN, '<', '/dev/null') or die 'no /dev/null: $!';
@@ -324,6 +325,14 @@ SKIP: {
 	close $fh or die "curl errored out \$?=$?";
 	is($n, 30 * 1024 * 1024, 'got expected output from curl');
 	is($non_zero, 0, 'read all zeros');
+
+	require_mods(@zmods, 1);
+	open $fh, '-|', qw(curl -sS), "$base/psgi-return-gzip" or die;
+	binmode $fh;
+	my $buf = do { local $/; <$fh> };
+	close $fh or die "curl errored out \$?=$?";
+	IO::Uncompress::Gunzip::gunzip(\$buf => \(my $out));
+	is($out, "hello world\n");
 }
 
 {
@@ -596,6 +605,22 @@ SKIP: {
 	is_deeply([], [keys %child], 'no extra pipes with -W0');
 };
 
+# ensure compatibility with other PSGI servers
+SKIP: {
+	require_mods(@zmods, qw(Plack::Test HTTP::Request::Common), 3);
+	use_ok 'HTTP::Request::Common';
+	use_ok 'Plack::Test';
+	my $app = require $psgi;
+	test_psgi($app, sub {
+		my ($cb) = @_;
+		my $req = GET('http://example.com/psgi-return-gzip');
+		my $res = $cb->($req);
+		my $buf = $res->content;
+		IO::Uncompress::Gunzip::gunzip(\$buf => \(my $out));
+		is($out, "hello world\n");
+	});
+}
+
 done_testing();
 
 sub capture {

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 02/11] gzipfilter: lazy allocate the deflate context
  2020-03-21  2:03 [PATCH 00/11] www: export SQLite altid dumps Eric Wong
  2020-03-21  2:03 ` [PATCH 01/11] qspawn: reinstate filter support, add gzip filter Eric Wong
@ 2020-03-21  2:03 ` Eric Wong
  2020-03-21  2:03 ` [PATCH 03/11] wwwstream: introduce oneshot API to avoid ->getline Eric Wong
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2020-03-21  2:03 UTC (permalink / raw)
  To: meta

zlib contexts are memory-intensive, particularly when used for
compression.  Since the gzip filter may be sitting in a limiter
queue for a long period, delay the allocation we actually have
data to translate, and not a moment sooner.
---
 lib/PublicInbox/GzipFilter.pm | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/GzipFilter.pm b/lib/PublicInbox/GzipFilter.pm
index d883130f..86409586 100644
--- a/lib/PublicInbox/GzipFilter.pm
+++ b/lib/PublicInbox/GzipFilter.pm
@@ -8,11 +8,7 @@ use bytes (); # length
 use Compress::Raw::Zlib qw(Z_FINISH Z_OK);
 my %OPT = (-WindowBits => 15 + 16, -AppendOutput => 1);
 
-sub new {
-	my ($gz, $err) = Compress::Raw::Zlib::Deflate->new(%OPT);
-	$err == Z_OK or die "Deflate->new failed: $err";
-	bless { gz => $gz }, shift;
-}
+sub new { bless {}, shift }
 
 # for Qspawn if using $env->{'pi-httpd.async'}
 sub attach {
@@ -24,6 +20,15 @@ sub attach {
 # for GetlineBody (via Qspawn) when NOT using $env->{'pi-httpd.async'}
 sub translate ($$) {
 	my $self = $_[0];
+
+	# allocate the zlib context lazily here, instead of in ->new.
+	# Deflate contexts are memory-intensive and this object may
+	# be sitting in the Qspawn limiter queue for a while.
+	my $gz = $self->{gz} ||= do {
+		my ($g, $err) = Compress::Raw::Zlib::Deflate->new(%OPT);
+		$err == Z_OK or die "Deflate->new failed: $err";
+		$g;
+	};
 	my $zbuf = delete($self->{zbuf});
 	if (defined $_[1]) { # my $buf = $_[1];
 		my $err = $self->{gz}->deflate($_[1], $zbuf);

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 03/11] wwwstream: introduce oneshot API to avoid ->getline
  2020-03-21  2:03 [PATCH 00/11] www: export SQLite altid dumps Eric Wong
  2020-03-21  2:03 ` [PATCH 01/11] qspawn: reinstate filter support, add gzip filter Eric Wong
  2020-03-21  2:03 ` [PATCH 02/11] gzipfilter: lazy allocate the deflate context Eric Wong
@ 2020-03-21  2:03 ` Eric Wong
  2020-03-21  2:03 ` [PATCH 04/11] extmsg: use WwwResponse::oneshot Eric Wong
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2020-03-21  2:03 UTC (permalink / raw)
  To: meta

The ->getline API is only useful for limiting memory use when
streaming responses containing multiple emails or log messages.
However it's unnecessary complexity and overhead for callers
(PublicInbox::HTTP) when there's only a single message.
---
 lib/PublicInbox/ViewVCS.pm   |  8 +-------
 lib/PublicInbox/WwwStream.pm | 21 ++++++++++++++++++---
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/lib/PublicInbox/ViewVCS.pm b/lib/PublicInbox/ViewVCS.pm
index 2f8e1c4f..6714e67c 100644
--- a/lib/PublicInbox/ViewVCS.pm
+++ b/lib/PublicInbox/ViewVCS.pm
@@ -31,17 +31,11 @@ my %QP_MAP = ( A => 'oid_a', B => 'oid_b', a => 'path_a', b => 'path_b' );
 our $MAX_SIZE = 1024 * 1024; # TODO: configurable
 my $BIN_DETECT = 8000; # same as git
 
-sub html_i { # WwwStream::getline callback
-	my ($nr, $ctx) =  @_;
-	$nr == 1 ? ${delete $ctx->{obuf}} : undef;
-}
-
 sub html_page ($$$) {
 	my ($ctx, $code, $strref) = @_;
 	my $wcb = delete $ctx->{-wcb};
 	$ctx->{-upfx} = '../../'; # from "/$INBOX/$OID/s/"
-	$ctx->{obuf} = $strref;
-	my $res = PublicInbox::WwwStream->response($ctx, $code, \&html_i);
+	my $res = PublicInbox::WwwStream::oneshot($ctx, $code, $strref);
 	$wcb ? $wcb->($res) : $res;
 }
 
diff --git a/lib/PublicInbox/WwwStream.pm b/lib/PublicInbox/WwwStream.pm
index 3a867ec3..2dd8b157 100644
--- a/lib/PublicInbox/WwwStream.pm
+++ b/lib/PublicInbox/WwwStream.pm
@@ -16,16 +16,21 @@ our $CODE_URL = 'https://public-inbox.org/public-inbox.git';
 # noop for HTTP.pm (and any other PSGI servers)
 sub close {}
 
+sub base_url ($) {
+	my $ctx = shift;
+	my $base_url = $ctx->{-inbox}->base_url($ctx->{env});
+	chop $base_url; # no trailing slash for clone
+	$base_url;
+}
+
 sub new {
 	my ($class, $ctx, $cb) = @_;
 
-	my $base_url = $ctx->{-inbox}->base_url($ctx->{env});
-	chop $base_url; # no trailing slash for clone
 	bless {
 		nr => 0,
 		cb => $cb || \&close,
 		ctx => $ctx,
-		base_url => $base_url,
+		base_url => base_url($ctx),
 	}, $class;
 }
 
@@ -164,4 +169,14 @@ sub getline {
 	delete $self->{cb} ? _html_end($self) : undef;
 }
 
+sub oneshot {
+	my ($ctx, $code, $strref) = @_;
+	my $self = bless {
+		ctx => $ctx,
+		base_url => base_url($ctx),
+	}, __PACKAGE__;
+	[ $code, [ 'Content-Type', 'text/html; charset=UTF-8' ],
+		[ _html_top($self), $$strref, _html_end($self) ] ]
+}
+
 1;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 04/11] extmsg: use WwwResponse::oneshot
  2020-03-21  2:03 [PATCH 00/11] www: export SQLite altid dumps Eric Wong
                   ` (2 preceding siblings ...)
  2020-03-21  2:03 ` [PATCH 03/11] wwwstream: introduce oneshot API to avoid ->getline Eric Wong
@ 2020-03-21  2:03 ` Eric Wong
  2020-03-21  2:03 ` [PATCH 05/11] wwwstream: oneshot sets content-length Eric Wong
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2020-03-21  2:03 UTC (permalink / raw)
  To: meta

No reason to use the ->getline interface for small responses.
---
 lib/PublicInbox/ExtMsg.pm    | 4 ++--
 lib/PublicInbox/WwwStream.pm | 7 ++++---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/lib/PublicInbox/ExtMsg.pm b/lib/PublicInbox/ExtMsg.pm
index 44884ad2..74a95cf9 100644
--- a/lib/PublicInbox/ExtMsg.pm
+++ b/lib/PublicInbox/ExtMsg.pm
@@ -158,7 +158,7 @@ sub ext_msg {
 	$ctx->{-html_tip} = $s .= '</pre>';
 	$ctx->{-title_html} = $title;
 	$ctx->{-upfx} = '../';
-	PublicInbox::WwwStream->response($ctx, $code);
+	PublicInbox::WwwStream::oneshot($ctx, $code);
 }
 
 sub ext_urls {
@@ -196,7 +196,7 @@ sub exact {
 					qq(<a\nhref="$u$href/">$u$html/</a>\n)
 				} @$found),
 			$ext_urls, '</pre>');
-	PublicInbox::WwwStream->response($ctx, $code);
+	PublicInbox::WwwStream::oneshot($ctx, $code);
 }
 
 1;
diff --git a/lib/PublicInbox/WwwStream.pm b/lib/PublicInbox/WwwStream.pm
index 2dd8b157..fceef745 100644
--- a/lib/PublicInbox/WwwStream.pm
+++ b/lib/PublicInbox/WwwStream.pm
@@ -28,7 +28,7 @@ sub new {
 
 	bless {
 		nr => 0,
-		cb => $cb || \&close,
+		cb => $cb,
 		ctx => $ctx,
 		base_url => base_url($ctx),
 	}, $class;
@@ -175,8 +175,9 @@ sub oneshot {
 		ctx => $ctx,
 		base_url => base_url($ctx),
 	}, __PACKAGE__;
-	[ $code, [ 'Content-Type', 'text/html; charset=UTF-8' ],
-		[ _html_top($self), $$strref, _html_end($self) ] ]
+	[ $code, [ 'Content-Type', 'text/html; charset=UTF-8' ], [
+		_html_top($self), $strref ? $$strref : (), _html_end($self)
+	] ]
 }
 
 1;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 05/11] wwwstream: oneshot sets content-length
  2020-03-21  2:03 [PATCH 00/11] www: export SQLite altid dumps Eric Wong
                   ` (3 preceding siblings ...)
  2020-03-21  2:03 ` [PATCH 04/11] extmsg: use WwwResponse::oneshot Eric Wong
@ 2020-03-21  2:03 ` Eric Wong
  2020-03-21  2:03 ` [PATCH 06/11] mbox: need_gzip uses WwwStream::oneshot Eric Wong
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2020-03-21  2:03 UTC (permalink / raw)
  To: meta

PublicInbox::HTTP will chunk, otherwise, and that's
extra overhead which isn't needed.
---
 lib/PublicInbox/WwwStream.pm | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/lib/PublicInbox/WwwStream.pm b/lib/PublicInbox/WwwStream.pm
index fceef745..985e0262 100644
--- a/lib/PublicInbox/WwwStream.pm
+++ b/lib/PublicInbox/WwwStream.pm
@@ -9,6 +9,7 @@
 package PublicInbox::WwwStream;
 use strict;
 use warnings;
+use bytes (); # length
 use PublicInbox::Hval qw(ascii_html prurl);
 our $TOR_URL = 'https://www.torproject.org/';
 our $CODE_URL = 'https://public-inbox.org/public-inbox.git';
@@ -170,14 +171,18 @@ sub getline {
 }
 
 sub oneshot {
-	my ($ctx, $code, $strref) = @_;
+	my ($ctx, $code, $sref) = @_;
 	my $self = bless {
 		ctx => $ctx,
 		base_url => base_url($ctx),
 	}, __PACKAGE__;
-	[ $code, [ 'Content-Type', 'text/html; charset=UTF-8' ], [
-		_html_top($self), $strref ? $$strref : (), _html_end($self)
-	] ]
+	my @x = (_html_top($self), $sref ? $$sref : (), _html_end($self));
+	my $len = 0;
+	$len += bytes::length($_) for @x;
+	[ $code, [
+		'Content-Type' => 'text/html; charset=UTF-8',
+		'Content-Length' => $len
+	], \@x ];
 }
 
 1;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 06/11] mbox: need_gzip uses WwwStream::oneshot
  2020-03-21  2:03 [PATCH 00/11] www: export SQLite altid dumps Eric Wong
                   ` (4 preceding siblings ...)
  2020-03-21  2:03 ` [PATCH 05/11] wwwstream: oneshot sets content-length Eric Wong
@ 2020-03-21  2:03 ` Eric Wong
  2020-03-21  2:03 ` [PATCH 07/11] qspawn: handle ENOENT (and other errors on exec) Eric Wong
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2020-03-21  2:03 UTC (permalink / raw)
  To: meta

This makes the error page more consistent.

Not that it really matters since Compress::Raw::Zlib and
IO::Compress packages have been distributed with Perl since
5.10.x.  Of course, zlib itself is also a dependency of git.
---
 lib/PublicInbox/Mbox.pm | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/lib/PublicInbox/Mbox.pm b/lib/PublicInbox/Mbox.pm
index 5693d30b..f329653a 100644
--- a/lib/PublicInbox/Mbox.pm
+++ b/lib/PublicInbox/Mbox.pm
@@ -152,7 +152,7 @@ sub thread_cb {
 sub thread_mbox {
 	my ($ctx, $over, $sfx) = @_;
 	eval { require PublicInbox::MboxGz };
-	return need_gzip() if $@;
+	return need_gzip($ctx) if $@;
 	my $msgs = $ctx->{msgs} = $over->get_thread($ctx->{mid}, {});
 	return [404, [qw(Content-Type text/plain)], []] if !@$msgs;
 	$ctx->{prev} = $msgs->[-1];
@@ -221,7 +221,7 @@ sub mbox_all {
 	my ($ctx, $query) = @_;
 
 	eval { require PublicInbox::MboxGz };
-	return need_gzip() if $@;
+	return need_gzip($ctx) if $@;
 	return mbox_all_ids($ctx) if $query eq '';
 	my $qopts = $ctx->{qopts} = { mset => 2 };
 	my $srch = $ctx->{srch} = $ctx->{-inbox}->search or
@@ -236,16 +236,14 @@ sub mbox_all {
 }
 
 sub need_gzip {
-	my $title = 'gzipped mbox not available';
-	my $body = <<EOF;
-<html><head><title>$title</title><body><pre>$title
+	PublicInbox::WwwStream::oneshot($_[0], 501, \<<EOF);
+<pre>gzipped mbox not available
+
 The administrator needs to install the Compress::Raw::Zlib Perl module
 to support gzipped mboxes.
-<a href="../">Return to index</a></pre></body></html>
-EOF
 
-	[501,[qw(Content-Type text/html Content-Length), bytes::length($body)],
-	[ $body ] ];
+<a href="../">Return to index</a></pre>
+EOF
 }
 
 1;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 07/11] qspawn: handle ENOENT (and other errors on exec)
  2020-03-21  2:03 [PATCH 00/11] www: export SQLite altid dumps Eric Wong
                   ` (5 preceding siblings ...)
  2020-03-21  2:03 ` [PATCH 06/11] mbox: need_gzip uses WwwStream::oneshot Eric Wong
@ 2020-03-21  2:03 ` Eric Wong
  2020-03-21  2:03 ` [PATCH 08/11] search: clobber -user_pfx on query parser initialization Eric Wong
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2020-03-21  2:03 UTC (permalink / raw)
  To: meta

As sqlite3(1) and other executables may become unavailable or
uninstalled while a daemon runs, we need to gracefully handle
errors in those cases.
---
 lib/PublicInbox/Qspawn.pm | 58 ++++++++++++++++++++++-----------------
 t/httpd-corner.psgi       |  7 +++++
 t/httpd-corner.t          | 25 ++++++++++++++++-
 3 files changed, 64 insertions(+), 26 deletions(-)

diff --git a/lib/PublicInbox/Qspawn.pm b/lib/PublicInbox/Qspawn.pm
index 52aea3eb..34b6912f 100644
--- a/lib/PublicInbox/Qspawn.pm
+++ b/lib/PublicInbox/Qspawn.pm
@@ -45,7 +45,7 @@ sub new ($$$;) {
 sub _do_spawn {
 	my ($self, $start_cb, $limiter) = @_;
 	my $err;
-	my ($cmd, $cmd_env, $opt) = @{$self->{args}};
+	my ($cmd, $cmd_env, $opt) = @{delete $self->{args}};
 	my %o = %{$opt || {}};
 	$self->{limiter} = $limiter;
 	foreach my $k (PublicInbox::Spawn::RLIMITS()) {
@@ -53,20 +53,17 @@ sub _do_spawn {
 			$o{$k} = $rlimit;
 		}
 	}
+	$self->{cmd} = $o{quiet} ? undef : $cmd;
 	eval {
 		# popen_rd may die on EMFILE, ENFILE
 		($self->{rpipe}, $self->{pid}) = popen_rd($cmd, $cmd_env, \%o);
-		$self->{args} = $o{quiet} ? undef : $cmd;
 
 		die "E: $!" unless defined($self->{pid});
 
 		$limiter->{running}++;
 		$start_cb->($self); # EPOLL_CTL_ADD may ENOSPC/ENOMEM
 	};
-	if ($@) {
-		$self->{err} = $@;
-		finish($self);
-	}
+	finish($self, $@) if $@;
 }
 
 sub child_err ($) {
@@ -83,16 +80,8 @@ sub log_err ($$) {
 	$env->{'psgi.errors'}->print($msg, "\n");
 }
 
-# callback for dwaitpid
-sub waitpid_err ($$) {
-	my ($self, $pid) = @_;
-	my $xpid = delete $self->{pid};
-	my $err;
-	if ($pid > 0) { # success!
-		$err = child_err($?);
-	} elsif ($pid < 0) { # ??? does this happen in our case?
-		$err = "W: waitpid($xpid, 0) => $pid: $!";
-	} # else should not be called with pid == 0
+sub finalize ($$) {
+	my ($self, $err) = @_;
 
 	my ($env, $qx_cb, $qx_arg, $qx_buf) =
 		delete @$self{qw(psgi_env qx_cb qx_arg qx_buf)};
@@ -108,16 +97,37 @@ sub waitpid_err ($$) {
 	}
 
 	if ($err) {
-		if ($self->{err}) {
+		if (defined $self->{err}) {
 			$self->{err} .= "; $err";
 		} else {
 			$self->{err} = $err;
 		}
-		if ($env && $self->{args}) {
-			log_err($env, join(' ', @{$self->{args}}) . ": $err");
+		if ($env && $self->{cmd}) {
+			log_err($env, join(' ', @{$self->{cmd}}) . ": $err");
 		}
 	}
-	eval { $qx_cb->($qx_buf, $qx_arg) } if $qx_cb;
+	if ($qx_cb) {
+		eval { $qx_cb->($qx_buf, $qx_arg) };
+	} elsif (my $wcb = delete $env->{'qspawn.wcb'}) {
+		# have we started writing, yet?
+		require PublicInbox::WwwStatic;
+		$wcb->(PublicInbox::WwwStatic::r(500));
+	}
+}
+
+# callback for dwaitpid
+sub waitpid_err ($$) {
+	my ($self, $pid) = @_;
+	my $xpid = delete $self->{pid};
+	my $err;
+	if (defined $pid) {
+		if ($pid > 0) { # success!
+			$err = child_err($?);
+		} elsif ($pid < 0) { # ??? does this happen in our case?
+			$err = "W: waitpid($xpid, 0) => $pid: $!";
+		} # else should not be called with pid == 0
+	}
+	finalize($self, $err);
 }
 
 sub do_waitpid ($) {
@@ -133,14 +143,12 @@ sub do_waitpid ($) {
 	}
 }
 
-sub finish ($) {
-	my ($self) = @_;
+sub finish ($;$) {
+	my ($self, $err) = @_;
 	if (delete $self->{rpipe}) {
 		do_waitpid($self);
 	} else {
-		my ($env, $qx_cb, $qx_arg, $qx_buf) =
-			delete @$self{qw(psgi_env qx_cb qx_arg qx_buf)};
-		eval { $qx_cb->($qx_buf, $qx_arg) } if $qx_cb;
+		finalize($self, $err);
 	}
 }
 
diff --git a/t/httpd-corner.psgi b/t/httpd-corner.psgi
index f2427234..44629620 100644
--- a/t/httpd-corner.psgi
+++ b/t/httpd-corner.psgi
@@ -94,6 +94,13 @@ my $app = sub {
 		return $qsp->psgi_return($env, undef, sub {
 			[ 200, [ qw(Content-Type application/octet-stream)]]
 		});
+	} elsif ($path eq '/psgi-return-enoent') {
+		require PublicInbox::Qspawn;
+		my $cmd = [ 'this-better-not-exist-in-PATH'.rand ];
+		my $qsp = PublicInbox::Qspawn->new($cmd);
+		return $qsp->psgi_return($env, undef, sub {
+			[ 200, [ qw(Content-Type application/octet-stream)]]
+		});
 	} elsif ($path eq '/pid') {
 		$code = 200;
 		push @$body, "$$\n";
diff --git a/t/httpd-corner.t b/t/httpd-corner.t
index e50aa436..cbfc8332 100644
--- a/t/httpd-corner.t
+++ b/t/httpd-corner.t
@@ -10,6 +10,7 @@ use PublicInbox::Spawn qw(which spawn);
 use PublicInbox::TestCommon;
 require_mods(qw(Plack::Util Plack::Builder HTTP::Date HTTP::Status));
 use Digest::SHA qw(sha1_hex);
+use IO::Handle ();
 use IO::Socket;
 use IO::Socket::UNIX;
 use Fcntl qw(:seek);
@@ -335,6 +336,14 @@ SKIP: {
 	is($out, "hello world\n");
 }
 
+{
+	my $conn = conn_for($sock, 'psgi_return ENOENT');
+	print $conn "GET /psgi-return-enoent HTTP/1.1\r\n\r\n" or die;
+	my $buf = '';
+	sysread($conn, $buf, 16384, length($buf)) until $buf =~ /\r\n\r\n/;
+	like($buf, qr!HTTP/1\.[01] 500\b!, 'got 500 error on ENOENT');
+}
+
 {
 	my $conn = conn_for($sock, '1.1 pipeline together');
 	$conn->write("PUT /sha1 HTTP/1.1\r\nUser-agent: hello\r\n\r\n" .
@@ -610,6 +619,11 @@ SKIP: {
 	require_mods(@zmods, qw(Plack::Test HTTP::Request::Common), 3);
 	use_ok 'HTTP::Request::Common';
 	use_ok 'Plack::Test';
+	STDERR->flush;
+	open my $olderr, '>&', \*STDERR or die "dup stderr: $!";
+	open my $tmperr, '+>', undef or die;
+	open STDERR, '>&', $tmperr or die;
+	STDERR->autoflush(1);
 	my $app = require $psgi;
 	test_psgi($app, sub {
 		my ($cb) = @_;
@@ -617,8 +631,17 @@ SKIP: {
 		my $res = $cb->($req);
 		my $buf = $res->content;
 		IO::Uncompress::Gunzip::gunzip(\$buf => \(my $out));
-		is($out, "hello world\n");
+		is($out, "hello world\n", 'got expected output');
+
+		$req = GET('http://example.com/psgi-return-enoent');
+		$res = $cb->($req);
+		is($res->code, 500, 'got error on ENOENT');
+		seek($tmperr, 0, SEEK_SET) or die;
+		my $errbuf = do { local $/; <$tmperr> };
+		like($errbuf, qr/this-better-not-exist/,
+			'error logged about missing command');
 	});
+	open STDERR, '>&', $olderr or die "restore stderr: $!";
 }
 
 done_testing();

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 08/11] search: clobber -user_pfx on query parser initialization
  2020-03-21  2:03 [PATCH 00/11] www: export SQLite altid dumps Eric Wong
                   ` (6 preceding siblings ...)
  2020-03-21  2:03 ` [PATCH 07/11] qspawn: handle ENOENT (and other errors on exec) Eric Wong
@ 2020-03-21  2:03 ` Eric Wong
  2020-03-21  2:03 ` [PATCH 09/11] wwwtext: show thread endpoint w/ indexlevel=basic Eric Wong
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2020-03-21  2:03 UTC (permalink / raw)
  To: meta

While we don't currently reinitialize the query parser for
the lifetime of a PublicInbox::Search object and have no plans
to, it's incorrect to be appending to an existing array in
case we reininitialize the query parser in the future.
---
 lib/PublicInbox/Search.pm | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 7f901125..372dc5a7 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -313,7 +313,7 @@ sub qp {
 	# we do not actually create AltId objects,
 	# just parse the spec to avoid the extra DB handles for now.
 	if (my $altid = $self->{altid}) {
-		my $user_pfx = $self->{-user_pfx} ||= [];
+		my $user_pfx = $self->{-user_pfx} = [];
 		for (@$altid) {
 			# $_ = 'serial:gmane:/path/to/gmane.msgmap.sqlite3'
 			/\Aserial:(\w+):/ or next;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 09/11] wwwtext: show thread endpoint w/ indexlevel=basic
  2020-03-21  2:03 [PATCH 00/11] www: export SQLite altid dumps Eric Wong
                   ` (7 preceding siblings ...)
  2020-03-21  2:03 ` [PATCH 08/11] search: clobber -user_pfx on query parser initialization Eric Wong
@ 2020-03-21  2:03 ` Eric Wong
  2020-03-21  2:03 ` [PATCH 10/11] altid: warn about non-word prefixes Eric Wong
  2020-03-21  2:03 ` [PATCH 11/11] www: add endpoint to retrieve altid dumps Eric Wong
  10 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2020-03-21  2:03 UTC (permalink / raw)
  To: meta

And show contact info when there's no indexing, at all.
Installations where Xapian is too expensive can still support
threading since it only depends on SQLite, so we need to inform
users of what's available.
---
 lib/PublicInbox/WwwText.pm | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/WwwText.pm b/lib/PublicInbox/WwwText.pm
index f6d831f6..cbe82b73 100644
--- a/lib/PublicInbox/WwwText.pm
+++ b/lib/PublicInbox/WwwText.pm
@@ -256,6 +256,11 @@ EOF
 
 	$QP_URL
 
+EOF
+	} # $srch
+	my $over = $ibx->over;
+	if ($over) {
+		$$txt .= <<EOF;
 message threading
 -----------------
 
@@ -301,6 +306,10 @@ message threading
 
 	$WIKI_URL/Mbox
 
+EOF
+	} # $over
+
+	$$txt .= <<EOF;
 contact
 -------
 
@@ -309,7 +318,6 @@ contact
 
 EOF
 	# TODO: support admin contact info in ~/.public-inbox/config
-	}
 	1;
 }
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 10/11] altid: warn about non-word prefixes
  2020-03-21  2:03 [PATCH 00/11] www: export SQLite altid dumps Eric Wong
                   ` (8 preceding siblings ...)
  2020-03-21  2:03 ` [PATCH 09/11] wwwtext: show thread endpoint w/ indexlevel=basic Eric Wong
@ 2020-03-21  2:03 ` Eric Wong
  2020-03-21  2:03 ` [PATCH 11/11] www: add endpoint to retrieve altid dumps Eric Wong
  10 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2020-03-21  2:03 UTC (permalink / raw)
  To: meta

We only support searching on prefixes matching /\A\w+\z/ because
Xapian requires ':' to delimit the prefix and splits on spaces
without quotes.

I've also verified Xapian supports multibyte UTF-8 characters,
underscores, and bare numbers as search prefixes, so there's
no need to restrict it beyond what Perl's UTF-8 aware \w
character class offers.
---
 lib/PublicInbox/AltId.pm  | 2 +-
 lib/PublicInbox/Search.pm | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/PublicInbox/AltId.pm b/lib/PublicInbox/AltId.pm
index 8ce70e46..3be6c73c 100644
--- a/lib/PublicInbox/AltId.pm
+++ b/lib/PublicInbox/AltId.pm
@@ -22,7 +22,7 @@ sub new {
 	my ($class, $ibx, $spec, $writable) = @_;
 	my ($type, $prefix, $query) = split(/:/, $spec, 3);
 	$type eq 'serial' or die "non-serial not supported, yet\n";
-
+	$prefix =~ /\A\w+\z/ or warn "non-word prefix not searchable\n";
 	my %params = map {
 		my ($k, $v) = split(/=/, uri_unescape($_), 2);
 		$v = '' unless defined $v;
diff --git a/lib/PublicInbox/Search.pm b/lib/PublicInbox/Search.pm
index 372dc5a7..00dddc6b 100644
--- a/lib/PublicInbox/Search.pm
+++ b/lib/PublicInbox/Search.pm
@@ -316,6 +316,8 @@ sub qp {
 		my $user_pfx = $self->{-user_pfx} = [];
 		for (@$altid) {
 			# $_ = 'serial:gmane:/path/to/gmane.msgmap.sqlite3'
+			# note: Xapian supports multibyte UTF-8, /^[0-9]+$/,
+			# and '_' with prefixes matching \w+
 			/\Aserial:(\w+):/ or next;
 			my $pfx = $1;
 			push @$user_pfx, "$pfx:", <<EOF;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 11/11] www: add endpoint to retrieve altid dumps
  2020-03-21  2:03 [PATCH 00/11] www: export SQLite altid dumps Eric Wong
                   ` (9 preceding siblings ...)
  2020-03-21  2:03 ` [PATCH 10/11] altid: warn about non-word prefixes Eric Wong
@ 2020-03-21  2:03 ` Eric Wong
  10 siblings, 0 replies; 12+ messages in thread
From: Eric Wong @ 2020-03-21  2:03 UTC (permalink / raw)
  To: meta

This ensures all our indexed data, including data from altid
searches (e.g. "gmane:$ARTNUM") is retrievable.

It uses a "POST" request to avoid wasting cycles when invoked by
crawlers, since it could potentially be several megabytes of
data not indexable by search engines.
---
 MANIFEST                    |  2 +
 lib/PublicInbox/AltId.pm    |  1 +
 lib/PublicInbox/WWW.pm      | 14 +++++-
 lib/PublicInbox/WwwAltId.pm | 94 +++++++++++++++++++++++++++++++++++++
 t/www_altid.t               | 83 ++++++++++++++++++++++++++++++++
 5 files changed, 192 insertions(+), 2 deletions(-)
 create mode 100644 lib/PublicInbox/WwwAltId.pm
 create mode 100644 t/www_altid.t

diff --git a/MANIFEST b/MANIFEST
index be1c4ab5..84872561 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -168,6 +168,7 @@ lib/PublicInbox/ViewVCS.pm
 lib/PublicInbox/WWW.pm
 lib/PublicInbox/WWW.pod
 lib/PublicInbox/WatchMaildir.pm
+lib/PublicInbox/WwwAltId.pm
 lib/PublicInbox/WwwAtomStream.pm
 lib/PublicInbox/WwwAttach.pm
 lib/PublicInbox/WwwHighlight.pm
@@ -300,6 +301,7 @@ t/view.t
 t/watch_filter_rubylang.t
 t/watch_maildir.t
 t/watch_maildir_v2.t
+t/www_altid.t
 t/www_listing.t
 t/www_static.t
 t/x-unknown-alpine.eml
diff --git a/lib/PublicInbox/AltId.pm b/lib/PublicInbox/AltId.pm
index 3be6c73c..6d16242a 100644
--- a/lib/PublicInbox/AltId.pm
+++ b/lib/PublicInbox/AltId.pm
@@ -39,6 +39,7 @@ sub new {
 	bless {
 		filename => $f,
 		writable => $writable,
+		prefix => $prefix,
 		xprefix => 'X'.uc($prefix),
 	}, $class;
 }
diff --git a/lib/PublicInbox/WWW.pm b/lib/PublicInbox/WWW.pm
index 2434f2f5..5017f572 100644
--- a/lib/PublicInbox/WWW.pm
+++ b/lib/PublicInbox/WWW.pm
@@ -65,6 +65,8 @@ sub call {
 			my ($epoch, $path) = ($2, $3);
 			return invalid_inbox($ctx, $1) ||
 				serve_git($ctx, $epoch, $path);
+		} elsif ($path_info =~ m!$INBOX_RE/(\w+)\.sql\.gz\z!o) {
+			return get_altid_dump($ctx, $1, $2);
 		} elsif ($path_info =~ m!$INBOX_RE/!o) {
 			return invalid_inbox($ctx, $1) || mbox_results($ctx);
 		}
@@ -150,8 +152,8 @@ sub preload {
 		require PublicInbox::Search;
 		PublicInbox::Search::load_xapian();
 	};
-	foreach (qw(PublicInbox::SearchView PublicInbox::MboxGz)) {
-		eval "require $_;";
+	for (qw(SearchView MboxGz WwwAltId)) {
+		eval "require PublicInbox::$_;";
 	}
 	if (ref($self)) {
 		my $pi_config = $self->{pi_config};
@@ -301,6 +303,14 @@ sub get_vcs_object ($$$;$) {
 	PublicInbox::ViewVCS::show($ctx, $oid, $filename);
 }
 
+sub get_altid_dump {
+	my ($ctx, $inbox, $altid_pfx) =@_;
+	my $r404 = invalid_inbox($ctx, $inbox);
+	return $r404 if $r404;
+	eval { require PublicInbox::WwwAltId } or return need($ctx, 'sqlite3');
+	PublicInbox::WwwAltId::sqldump($ctx, $altid_pfx);
+}
+
 sub need {
 	my ($ctx, $extra) = @_;
 	my $msg = <<EOF;
diff --git a/lib/PublicInbox/WwwAltId.pm b/lib/PublicInbox/WwwAltId.pm
new file mode 100644
index 00000000..34641a92
--- /dev/null
+++ b/lib/PublicInbox/WwwAltId.pm
@@ -0,0 +1,94 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+
+# dumps using the ".dump" command of sqlite3(1)
+package PublicInbox::WwwAltId;
+use strict;
+use PublicInbox::Qspawn;
+use PublicInbox::WwwStream;
+use PublicInbox::AltId;
+use PublicInbox::Spawn qw(which);
+our $sqlite3 = $ENV{SQLITE3};
+
+# returns prefix => pathname mapping
+# (pathname is NOT public, but prefix is used for Xapian queries)
+sub altid_map ($) {
+	my ($ibx) = @_;
+	my $altid = $ibx->{altid} or return {};
+	my %h = map {;
+		my $x = PublicInbox::AltId->new($ibx, $_);
+		"$x->{prefix}" => $x->{filename}
+	} @$altid;
+	\%h;
+}
+
+sub sqlite3_missing ($) {
+	PublicInbox::WwwResponse::oneshot($_[0], 501, \<<EOF);
+<pre>sqlite3 not available
+
+The administrator needs to install the sqlite3(1) binary
+to support gzipped sqlite3 dumps.</pre>
+</pre>
+EOF
+}
+
+sub check_output {
+	my ($r, $bref, $ctx) = @_;
+	return PublicInbox::WwwResponse::oneshot($ctx, 500) if !defined($r);
+	if ($r == 0) {
+		my $err = eval { $ctx->{env}->{'psgi.errors'} } // \*STDERR;
+		$err->print("unexpected EOF from sqlite3\n");
+		return PublicInbox::WwwResponse::oneshot($ctx, 501);
+	}
+	[200, [ qw(Content-Type application/gzip), 'Content-Disposition',
+		"inline; filename=$ctx->{altid_pfx}.sql.gz" ] ]
+}
+
+# POST $INBOX/$prefix.sql.gz
+# we use the sqlite3(1) binary here since that's where the ".dump"
+# command is implemented, not (AFAIK) in the libsqlite3 library
+# and thus not usable from DBD::SQLite.
+sub sqldump ($$) {
+	my ($ctx, $altid_pfx) = @_;
+	my $ibx = $ctx->{-inbox};
+	my $altid_map = $ibx->{-altid_map} //= altid_map($ibx);
+	my $fn = $altid_map->{$altid_pfx};
+	unless (defined $fn) {
+		return PublicInbox::WwwStream::oneshot($ctx, 404, \<<EOF);
+<pre>`$altid_pfx' is not a valid altid for this inbox</pre>
+EOF
+	}
+
+	eval { require PublicInbox::GzipFilter } or
+		return PublicInbox::WwwStream::oneshot($ctx, 501, \<<EOF);
+<pre>gzip output not available
+
+The administrator needs to install the Compress::Raw::Zlib Perl module
+to support gzipped sqlite3 dumps.</pre>
+EOF
+	$sqlite3 //= which('sqlite3');
+	if (!defined($sqlite3)) {
+		return PublicInbox::WwwStream::oneshot($ctx, 501, \<<EOF);
+<pre>sqlite3 not available
+
+The administrator needs to install the sqlite3(1) binary
+to support gzipped sqlite3 dumps.</pre>
+</pre>
+EOF
+	}
+
+	# setup stdin, POSIX requires writes <= 512 bytes to succeed so
+	# we can close the pipe right away.
+	pipe(my ($r, $w)) or die "pipe: $!";
+	syswrite($w, ".dump\n") == 6 or die "write: $!";
+	close($w) or die "close: $!";
+
+	# TODO: use -readonly if available with newer sqlite3(1)
+	my $qsp = PublicInbox::Qspawn->new([$sqlite3, $fn], undef, { 0 => $r });
+	my $env = $ctx->{env};
+	$ctx->{altid_pfx} = $altid_pfx;
+	$env->{'qspawn.filter'} = PublicInbox::GzipFilter->new;
+	$qsp->psgi_return($env, undef, \&check_output, $ctx);
+}
+
+1;
diff --git a/t/www_altid.t b/t/www_altid.t
new file mode 100644
index 00000000..a885c389
--- /dev/null
+++ b/t/www_altid.t
@@ -0,0 +1,83 @@
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+use strict;
+use Test::More;
+use PublicInbox::TestCommon;
+use PublicInbox::Inbox;
+use PublicInbox::InboxWritable;
+use PublicInbox::Config;
+use PublicInbox::Spawn qw(which spawn);
+which('sqlite3') or plan skip_all => 'sqlite3 binary missing';
+require_mods(qw(DBD::SQLite HTTP::Request::Common Plack::Test URI::Escape
+	Plack::Builder IO::Uncompress::Gunzip));
+use_ok($_) for qw(Plack::Test HTTP::Request::Common);
+require_ok 'PublicInbox::Msgmap';
+require_ok 'PublicInbox::AltId';
+require_ok 'PublicInbox::WWW';
+my ($inboxdir, $for_destroy) = tmpdir();
+my $aid = 'xyz';
+my $spec = "serial:$aid:file=blah.sqlite3";
+if ('setup') {
+	my $opts = {
+		inboxdir => $inboxdir,
+		name => 'test',
+		-primary_address => 'test@example.com',
+	};
+	my $ibx = PublicInbox::Inbox->new($opts);
+	$ibx = PublicInbox::InboxWritable->new($ibx, 1);
+	my $im = $ibx->importer(0);
+	my $mime = PublicInbox::MIME->new(<<'EOF');
+From: a@example.com
+Message-Id: <a@example.com>
+
+EOF
+	$im->add($mime);
+	$im->done;
+	mkdir "$inboxdir/public-inbox" or die;
+	my $altid = PublicInbox::AltId->new($ibx, $spec, 1);
+	$altid->mm_alt->mid_set(1, 'a@example.com');
+}
+
+my $cfgpath = "$inboxdir/cfg";
+open my $fh, '>', $cfgpath or die;
+print $fh <<EOF or die;
+[publicinbox "test"]
+	inboxdir = $inboxdir
+	address = test\@example.com
+	altid = $spec
+	url = http://example.com/test
+EOF
+close $fh or die;
+my $cfg = PublicInbox::Config->new($cfgpath);
+my $www = PublicInbox::WWW->new($cfg);
+my $cmpfile = "$inboxdir/cmp.sqlite3";
+my $client = sub {
+	my ($cb) = @_;
+	my $res = $cb->(POST("/test/$aid.sql.gz"));
+	is($res->code, 200, 'retrieved gzipped dump');
+	IO::Uncompress::Gunzip::gunzip(\($res->content) => \(my $buf));
+	pipe(my ($r, $w)) or die;
+	my $cmd = ['sqlite3', $cmpfile];
+	my $pid = spawn($cmd, undef, { 0 => $r });
+	print $w $buf or die;
+	close $w or die;
+	is(waitpid($pid, 0), $pid, 'sqlite3 exited');
+	is($?, 0, 'sqlite3 loaded dump');
+	my $mm_cmp = PublicInbox::Msgmap->new_file($cmpfile);
+	is($mm_cmp->mid_for(1), 'a@example.com', 'sqlite3 dump valid');
+	$mm_cmp = undef;
+	unlink $cmpfile or die;
+};
+test_psgi(sub { $www->call(@_) }, $client);
+SKIP: {
+	require_mods(qw(Plack::Test::ExternalServer), 4);
+	my $env = { PI_CONFIG => $cfgpath };
+	my $sock = tcp_server() or die;
+	my ($out, $err) = map { "$inboxdir/std$_.log" } qw(out err);
+	my $cmd = [ qw(-httpd -W0), "--stdout=$out", "--stderr=$err" ];
+	my $td = start_script($cmd, $env, { 3 => $sock });
+	my ($h, $p) = ($sock->sockhost, $sock->sockport);
+	local $ENV{PLACK_TEST_EXTERNALSERVER_URI} = "http://$h:$p";
+	Plack::Test::ExternalServer::test_psgi(client => $client);
+}
+done_testing;

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-03-21  2:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-21  2:03 [PATCH 00/11] www: export SQLite altid dumps Eric Wong
2020-03-21  2:03 ` [PATCH 01/11] qspawn: reinstate filter support, add gzip filter Eric Wong
2020-03-21  2:03 ` [PATCH 02/11] gzipfilter: lazy allocate the deflate context Eric Wong
2020-03-21  2:03 ` [PATCH 03/11] wwwstream: introduce oneshot API to avoid ->getline Eric Wong
2020-03-21  2:03 ` [PATCH 04/11] extmsg: use WwwResponse::oneshot Eric Wong
2020-03-21  2:03 ` [PATCH 05/11] wwwstream: oneshot sets content-length Eric Wong
2020-03-21  2:03 ` [PATCH 06/11] mbox: need_gzip uses WwwStream::oneshot Eric Wong
2020-03-21  2:03 ` [PATCH 07/11] qspawn: handle ENOENT (and other errors on exec) Eric Wong
2020-03-21  2:03 ` [PATCH 08/11] search: clobber -user_pfx on query parser initialization Eric Wong
2020-03-21  2:03 ` [PATCH 09/11] wwwtext: show thread endpoint w/ indexlevel=basic Eric Wong
2020-03-21  2:03 ` [PATCH 10/11] altid: warn about non-word prefixes Eric Wong
2020-03-21  2:03 ` [PATCH 11/11] www: add endpoint to retrieve altid dumps Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).