* [PATCH 00/43] www: async git cat-file w/ -httpd
@ 2020-07-05 23:27 7% Eric Wong
2020-07-05 23:27 5% ` [PATCH 14/43] mboxgz: do asynchronous git blob retrievals Eric Wong
0 siblings, 1 reply; 2+ results
From: Eric Wong @ 2020-07-05 23:27 UTC (permalink / raw)
To: meta
This allows -httpd to make better use of time it spends waiting
on git-cat-file to respond. It allows us to deal with
high-latency HDD storage without a client monopolizing the event
loop. Even on a mid-range consumer-grade SSD, this seems to
give a 10+% speed improvement for HTTP responses requiring many
blobs, including all /T/, /t/, and /t.mbox.gz endpoints.
This only benefits indexed inboxes (both v1 and v2); I'm not
sure if anybody still uses unindexed v1 inboxes nowadays.
A new xt/httpd-async-stream.t maintainer test ensures checksums
for responses before and after this series match exactly as
before.
This builds off a branch I started several months ago (but never
published here) to integrate gzip responses into our codebase
and remove our optional dependency on Plack::Middleware::Deflater.
We already gzip a bunch of things independent of
Plack::Middleware::Deflater: manifest.js.gz, altid SQLite3 dumps
and all the *.mbox.gz endpoints; so being able to use gzip on
all of our responses without an extra dependency seemed logical.
Being able to consistently use our GzipFilter API to perform
buffering via ->zmore made it significantly easier to reason
about small response chunks for ghost messages interspersed with
large ones when streaming /$INBOX/$MSGID/t/ endpoints.
I'm not yet maximizing use of ->zmore for all buffering of HTTP
responses, yet; measurements need to happen, first. That may
happen in the 1.7 time frame. In particular, we would need to
ensure the Perl method dispatch and DSO overhead to Zlib.so and
libz.so of making many ->zmore calls doesn't cause performance
regressions compared to the current `.=' use and calling
->zmore/->translate fewer times.
Eric Wong (43):
gzipfilter: minor cleanups
wwwstream: oneshot: perform gzip without middleware
www*stream: gzip ->getline responses
wwwtext: gzip text/plain responses, as well
wwwtext: switch to html_oneshot
www: need: use WwwStream::html_oneshot
wwwlisting: use GzipFilter for HTML
gzipfilter: replace Compress::Raw::Deflate usages
{gzip,noop}filter: ->zmore returns undef, always
mbox: remove html_oneshot import
wwwstatic: support gzipped response directory listings
qspawn: learn to gzip streaming responses
stop auto-loading Plack::Middleware::Deflater
mboxgz: do asynchronous git blob retrievals
mboxgz: reduce object hash depth
mbox: async blob retrieval for "single message" raw mboxrd
wwwatomstream: simplify feed_update callers
wwwatomstream: use PublicInbox::Inbox->modified for feed_updated
wwwatomstream: reuse $ctx as $self
xt/httpd-async-stream: allow more options
wwwatomstream: support asynchronous blob retrievals
wwwstream: reduce object graph depth
wwwstream: reduce blob retrieval paths for ->getline
www: start making gzipfilter the parent response class
remove unused/redundant zlib-related imports
wwwstream: use parent.pm and no warnings
wwwstream: subclass off GzipFilter
view: wire up /$INBOX/$MESSAGE_ID/ permalink to async
view: /$INBOX/$MSGID/t/ reads blobs asynchronously
view: update /$INBOX/$MSGID/T/ to be async
feed: generate_i: eliminate pointless loop
feed: /$INBOX/new.html retrieves blobs asynchronously
ssearchview: /$INBOX/?q=$QUERY&x=t uses async blobs
view: eml_entry: reduce parameters
view: /$INBOX/$MSGID/t/: avoid extra hash lookup in eml case
wwwstream: eliminate ::response, use html_oneshot
www: update internal docs
view: simplify eml_entry callers further
wwwtext: simplify gzf_maybe use
wwwattach: support async blob retrievals
gzipfilter: drop HTTP connection on bugs or data corruption
daemon: warn on missing blobs
gzipfilter: check http->{forward} for client disconnects
Documentation/mknews.perl | 20 +--
Documentation/public-inbox-httpd.pod | 1 -
Documentation/technical/ds.txt | 4 +-
INSTALL | 5 -
MANIFEST | 2 +
ci/deps.perl | 1 -
examples/cgit.psgi | 8 -
examples/newswww.psgi | 8 -
examples/public-inbox.psgi | 9 --
examples/unsubscribe.psgi | 1 -
lib/PublicInbox/CompressNoop.pm | 22 +++
lib/PublicInbox/Feed.pm | 22 ++-
lib/PublicInbox/GetlineBody.pm | 4 +-
lib/PublicInbox/GzipFilter.pm | 168 +++++++++++++++++---
lib/PublicInbox/HTTP.pm | 7 +
lib/PublicInbox/HTTPD.pm | 5 +-
lib/PublicInbox/IMAP.pm | 1 +
lib/PublicInbox/Mbox.pm | 137 +++++++++--------
lib/PublicInbox/MboxGz.pm | 81 ++++------
lib/PublicInbox/NNTP.pm | 1 +
lib/PublicInbox/Qspawn.pm | 6 +-
lib/PublicInbox/SearchView.pm | 40 ++---
lib/PublicInbox/View.pm | 219 ++++++++++++++-------------
lib/PublicInbox/WWW.pm | 9 +-
lib/PublicInbox/WwwAtomStream.pm | 66 ++++----
lib/PublicInbox/WwwAttach.pm | 63 ++++++--
lib/PublicInbox/WwwListing.pm | 24 +--
lib/PublicInbox/WwwStatic.pm | 14 +-
lib/PublicInbox/WwwStream.pm | 110 ++++++++------
lib/PublicInbox/WwwText.pm | 26 ++--
script/public-inbox-httpd | 9 --
script/public-inbox.cgi | 7 -
t/httpd-corner.psgi | 7 +
t/httpd-corner.t | 9 +-
t/plack.t | 4 +
t/psgi_attach.t | 162 +++++++++++---------
t/psgi_text.t | 33 +++-
t/psgi_v2.t | 80 ++++++++--
t/www_listing.t | 8 +-
t/www_static.t | 11 +-
xt/httpd-async-stream.t | 104 +++++++++++++
41 files changed, 964 insertions(+), 554 deletions(-)
create mode 100644 lib/PublicInbox/CompressNoop.pm
create mode 100644 xt/httpd-async-stream.t
^ permalink raw reply [relevance 7%]
* [PATCH 14/43] mboxgz: do asynchronous git blob retrievals
2020-07-05 23:27 7% [PATCH 00/43] www: async git cat-file w/ -httpd Eric Wong
@ 2020-07-05 23:27 5% ` Eric Wong
0 siblings, 0 replies; 2+ results
From: Eric Wong @ 2020-07-05 23:27 UTC (permalink / raw)
To: meta
This lets the -httpd worker process make better use of time
instead of waiting for git-cat-file to respond. With 4 jobs in
the new test case against a clone of
<https://public-inbox.org/meta/>, a speedup of 10-12% is shown.
Even a single job shows a 2-5% improvement on an SSD.
---
MANIFEST | 1 +
lib/PublicInbox/HTTP.pm | 7 +++
lib/PublicInbox/MboxGz.pm | 69 +++++++++++++++++++++++----
xt/httpd-async-stream.t | 99 +++++++++++++++++++++++++++++++++++++++
4 files changed, 167 insertions(+), 9 deletions(-)
create mode 100644 xt/httpd-async-stream.t
diff --git a/MANIFEST b/MANIFEST
index dcd7a7e5f..9b0f50203 100644
--- a/MANIFEST
+++ b/MANIFEST
@@ -368,6 +368,7 @@ xt/cmp-msgview.t
xt/eml_check_limits.t
xt/git-http-backend.t
xt/git_async_cmp.t
+xt/httpd-async-stream.t
xt/imapd-mbsync-oimap.t
xt/imapd-validate.t
xt/mem-imapd-tls.t
diff --git a/lib/PublicInbox/HTTP.pm b/lib/PublicInbox/HTTP.pm
index 828174653..5844ef440 100644
--- a/lib/PublicInbox/HTTP.pm
+++ b/lib/PublicInbox/HTTP.pm
@@ -488,6 +488,13 @@ sub busy () {
($self->{rbuf} || exists($self->{env}) || $self->{wbuf});
}
+# runs $cb on the next iteration of the event loop at earliest
+sub next_step {
+ my ($self, $cb) = @_;
+ return unless exists $self->{sock};
+ $self->requeue if 1 == push(@{$self->{wbuf}}, $cb);
+}
+
# Chunked and Identity packages are used for writing responses.
# They may be exposed to the PSGI application when the PSGI app
# returns a CODE ref for "push"-based responses
diff --git a/lib/PublicInbox/MboxGz.pm b/lib/PublicInbox/MboxGz.pm
index 535ef96c9..8c9010afb 100644
--- a/lib/PublicInbox/MboxGz.pm
+++ b/lib/PublicInbox/MboxGz.pm
@@ -6,6 +6,9 @@ use parent 'PublicInbox::GzipFilter';
use PublicInbox::Eml;
use PublicInbox::Hval qw/to_filename/;
use PublicInbox::Mbox;
+use PublicInbox::GitAsyncCat;
+*msg_hdr = \&PublicInbox::Mbox::msg_hdr;
+*msg_body = \&PublicInbox::Mbox::msg_body;
sub new {
my ($class, $ctx, $cb) = @_;
@@ -17,33 +20,81 @@ sub new {
}, $class;
}
+# this is public-inbox-httpd-specific
+sub mboxgz_blob_cb { # git->cat_async callback
+ my ($bref, $oid, $type, $size, $self) = @_;
+ my $http = $self->{ctx}->{env}->{'psgix.io'} or return; # client abort
+ my $smsg = delete $self->{smsg} or die 'BUG: no smsg';
+ if (!defined($oid)) {
+ # it's possible to have TOCTOU if an admin runs
+ # public-inbox-(edit|purge), just move onto the next message
+ return $http->next_step(\&async_next);
+ } else {
+ $smsg->{blob} eq $oid or die "BUG: $smsg->{blob} != $oid";
+ }
+ $self->zmore(msg_hdr($self->{ctx},
+ PublicInbox::Eml->new($bref)->header_obj,
+ $smsg->{mid}));
+
+ # PublicInbox::HTTP::{Chunked,Identity}::write
+ $self->{http_out}->write($self->translate(msg_body($$bref)));
+
+ $http->next_step(\&async_next);
+}
+
+# this is public-inbox-httpd-specific
+sub async_step ($) {
+ my ($self) = @_;
+ if (my $smsg = $self->{smsg} = $self->{cb}->($self->{ctx})) {
+ git_async_cat($self->{ctx}->{-inbox}->git, $smsg->{blob},
+ \&mboxgz_blob_cb, $self);
+ } elsif (my $out = delete $self->{http_out}) {
+ $out->write($self->zflush);
+ $out->close;
+ }
+}
+
+# called by PublicInbox::DS::write
+sub async_next {
+ my ($http) = @_; # PublicInbox::HTTP
+ async_step($http->{forward});
+}
+
+# called by PublicInbox::HTTP::close, or any other PSGI server
+sub close { !!delete($_[0]->{http_out}) }
+
sub response {
my ($class, $ctx, $cb, $fn) = @_;
- my $body = $class->new($ctx, $cb);
+ my $self = $class->new($ctx, $cb);
# http://www.iana.org/assignments/media-types/application/gzip
$fn = defined($fn) && $fn ne '' ? to_filename($fn) : 'no-subject';
my $h = [ qw(Content-Type application/gzip),
'Content-Disposition', "inline; filename=$fn.mbox.gz" ];
- [ 200, $h, $body ];
+ if ($ctx->{env}->{'pi-httpd.async'}) {
+ sub {
+ my ($wcb) = @_; # -httpd provided write callback
+ $self->{http_out} = $wcb->([200, $h]);
+ $self->{ctx}->{env}->{'psgix.io'}->{forward} = $self;
+ async_step($self); # start stepping
+ };
+ } else { # generic PSGI
+ [ 200, $h, $self ];
+ }
}
-# called by Plack::Util::foreach or similar
+# called by Plack::Util::foreach or similar (generic PSGI)
sub getline {
my ($self) = @_;
my $ctx = $self->{ctx} or return;
while (my $smsg = $self->{cb}->($ctx)) {
my $mref = $ctx->{-inbox}->msg_by_smsg($smsg) or next;
my $h = PublicInbox::Eml->new($mref)->header_obj;
- $self->zmore(
- PublicInbox::Mbox::msg_hdr($ctx, $h, $smsg->{mid})
- );
- return $self->translate(PublicInbox::Mbox::msg_body($$mref));
+ $self->zmore(msg_hdr($ctx, $h, $smsg->{mid}));
+ return $self->translate(msg_body($$mref));
}
# signal that we're done and can return undef next call:
delete $self->{ctx};
$self->zflush;
}
-sub close {} # noop
-
1;
diff --git a/xt/httpd-async-stream.t b/xt/httpd-async-stream.t
new file mode 100644
index 000000000..29bcb6125
--- /dev/null
+++ b/xt/httpd-async-stream.t
@@ -0,0 +1,99 @@
+#!perl -w
+# Copyright (C) 2020 all contributors <meta@public-inbox.org>
+# License: AGPL-3.0+ <https://www.gnu.org/licenses/agpl-3.0.txt>
+# Expensive test to validate compression and TLS.
+use strict;
+use Test::More;
+use PublicInbox::TestCommon;
+use PublicInbox::DS qw(now);
+use PublicInbox::Spawn qw(which popen_rd);
+use Digest::MD5;
+use POSIX qw(_exit);
+my $inboxdir = $ENV{GIANT_INBOX_DIR};
+plan skip_all => "GIANT_INBOX_DIR not defined for $0" unless $inboxdir;
+my $curl = which('curl') or plan skip_all => "curl(1) missing for $0";
+my ($tmpdir, $for_destroy) = tmpdir();
+require_mods(qw(DBD::SQLite));
+my $JOBS = $ENV{TEST_JOBS} // 4;
+diag "TEST_JOBS=$JOBS";
+
+my $make_local_server = sub {
+ my $pi_config = "$tmpdir/config";
+ open my $fh, '>', $pi_config or die "open($pi_config): $!";
+ print $fh <<"" or die "print $pi_config: $!";
+[publicinbox "test"]
+inboxdir = $inboxdir
+address = test\@example.com
+
+ close $fh or die "close($pi_config): $!";
+ my ($out, $err) = ("$tmpdir/out", "$tmpdir/err");
+ for ($out, $err) {
+ open my $fh, '>', $_ or die "truncate: $!";
+ }
+ my $http = tcp_server();
+ my $rdr = { 3 => $http };
+
+ # not using multiple workers, here, since we want to increase
+ # the chance of tripping concurrency bugs within PublicInbox/HTTP*.pm
+ my $cmd = [ '-httpd', "--stdout=$out", "--stderr=$err", '-W0' ];
+ my $host_port = $http->sockhost.':'.$http->sockport;
+ push @$cmd, "-lhttp://$host_port";
+ my $url = "$host_port/test/all.mbox.gz";
+ print STDERR "# CMD ". join(' ', @$cmd). "\n";
+ my $env = { PI_CONFIG => $pi_config };
+ (start_script($cmd, $env, $rdr), $url);
+};
+
+my ($td, $url) = $make_local_server->();
+
+my $do_get_all = sub {
+ my ($job) = @_;
+ local $SIG{__DIE__} = sub { print STDERR $job, ': ', @_; _exit(1) };
+ my $dig = Digest::MD5->new;
+ my ($buf, $nr);
+ my $bytes = 0;
+ my $t0 = now();
+ my ($rd, $pid) = popen_rd([$curl, qw(-HHost:example.com -sSf), $url]);
+ while (1) {
+ $nr = sysread($rd, $buf, 65536);
+ last if !$nr;
+ $dig->add($buf);
+ $bytes += $nr;
+ }
+ my $res = $dig->hexdigest;
+ my $elapsed = sprintf('%0.3f', now() - $t0);
+ close $rd or die "close curl failed: $!\n";
+ waitpid($pid, 0) == $pid or die "waitpid failed: $!\n";
+ $? == 0 or die "curl failed: $?\n";
+ print STDERR "# $job $$ ($?) $res (${elapsed}s) $bytes bytes\n";
+ $res;
+};
+
+my (%pids, %res);
+for my $job (1..$JOBS) {
+ pipe(my ($r, $w)) or die;
+ my $pid = fork;
+ if ($pid == 0) {
+ close $r or die;
+ my $res = $do_get_all->($job);
+ print $w $res or die;
+ close $w or die;
+ _exit(0);
+ }
+ close $w or die;
+ $pids{$pid} = [ $job, $r ];
+}
+
+while (scalar keys %pids) {
+ my $pid = waitpid(-1, 0) or next;
+ my $child = delete $pids{$pid} or next;
+ my ($job, $rpipe) = @$child;
+ is($?, 0, "$job done");
+ my $sum = do { local $/; <$rpipe> };
+ push @{$res{$sum}}, $job;
+}
+is(scalar keys %res, 1, 'all got the same result');
+$td->kill;
+$td->join;
+is($?, 0, 'no error on -httpd exit');
+done_testing;
^ permalink raw reply related [relevance 5%]
Results 1-2 of 2 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2020-07-05 23:27 7% [PATCH 00/43] www: async git cat-file w/ -httpd Eric Wong
2020-07-05 23:27 5% ` [PATCH 14/43] mboxgz: do asynchronous git blob retrievals Eric Wong
Code repositories for project(s) associated with this public inbox
https://80x24.org/public-inbox.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).