user/dev discussion of public-inbox itself
 help / color / mirror / code / Atom feed
Search results ordered by [date|relevance]  view[summary|nested|Atom feed]
thread overview below | download mbox.gz: |
* sample robots.txt to reduce WWW load
@ 2024-04-01 13:21  7% Eric Wong
  0 siblings, 0 replies; 3+ results
From: Eric Wong @ 2024-04-01 13:21 UTC (permalink / raw)
  To: meta

Performance is still slow, and crawler traffic patterns tend to
do bad things with caches at all levels, so I've regretfully had
to experiment with robots.txt to mitigate performance problems.

The /s/ solver endpoint remains expensive but commit
8d6a50ff2a44 (www: use a dedicated limiter for blob solver, 2024-03-11)
seems to have helped significantly.

All the multi-message endpoints (/[Tt]*) are of course expensive
and have always been.  git blob access over SATA 2 SSD isn't too
fast, and HTML rendering is quite expensive in Perl.  Keeping
multiple zlib contexts for HTTP gzip also hurts memory usage,
so we want to minimize the amount of time clients keep
longer-lived allocations.

Anyways, this robots.txt is what I've been experimenting with
and (after a few days when bots pick it up) it seems to have
significantly cut load on my system so I can actually work on
performance problems[1] which show up.

==> robots.txt <==
User-Agent: *
Disallow: /*/s/
Disallow: /*/T/
Disallow: /*/t/
Disallow: /*/t.atom
Disallow: /*/t.mbox.gz
Allow: /

I also disable git-archive snapshots for cgit || WwwCoderepo:

Disallow: /*/snapshot/*


[1] I'm testing a glibc patch which hopefully reduces fragmentation.
    I've gotten rid of many of the Disallow: entries temporarily
   since

^ permalink raw reply	[relevance 7%]

* [PATCH 0/4] memory reductions for WWW + solver
@ 2024-03-11 19:40  7% Eric Wong
  2024-03-11 19:40  6% ` [PATCH 1/4] www: use a dedicated limiter for blob solver Eric Wong
  0 siblings, 1 reply; 3+ results
From: Eric Wong @ 2024-03-11 19:40 UTC (permalink / raw)
  To: meta

1/4 gets rid of some overload caused by parallel solver
invocations under heavy (likely bot) traffic crawling
yhbt.net/lore with many coderepos enabled and joined
to inboxes.

2/4 is a large reduction in allocations from loading
coderepo <=> inbox associations, 4/4 is smaller.
I found 2/4 with Devel::Mwrap and noticed 4/4 while
working on 2/4.

3/4 is just a doc update but I've been successfully using
jemalloc on my lore+gko mirror for a week or two, now
(and I plan to experiment with making glibc||dlmalloc more
resistant to fragmentation)

Eric Wong (4):
  www: use a dedicated limiter for blob solver
  codesearch: deduplicate {ibx_score} name pairs
  doc: tuning: note reduced fragmentation w/ jemalloc
  codesearch: deduplicate $git->{nick} field

 Documentation/public-inbox-tuning.pod |  5 +++
 examples/public-inbox-netd@.service   |  2 ++
 lib/PublicInbox/CodeSearch.pm         | 14 ++++++--
 lib/PublicInbox/SolverGit.pm          | 15 +++++----
 lib/PublicInbox/ViewVCS.pm            | 48 ++++++++++++++++++++++-----
 5 files changed, 66 insertions(+), 18 deletions(-)

^ permalink raw reply	[relevance 7%]

* [PATCH 1/4] www: use a dedicated limiter for blob solver
  2024-03-11 19:40  7% [PATCH 0/4] memory reductions for WWW + solver Eric Wong
@ 2024-03-11 19:40  6% ` Eric Wong
  0 siblings, 0 replies; 3+ results
From: Eric Wong @ 2024-03-11 19:40 UTC (permalink / raw)
  To: meta

Wrap the entire solver command chain with a dedicated limiter.
The normal limiter is designed for longer-lived commands or ones
which serve a single HTTP request (e.g. git-http-backend or
cgit) and not effective for short memory + CPU intensive commands
used for solver.

Each overall solver request is both memory + CPU intensive: it
spawns several short-lived git processes(*) in addition to a
longer-lived `git cat-file --batch' process.

Thus running parallel solvers from a single -netd/-httpd worker
(which have their own parallelization) results in excessive
parallelism that is both memory and CPU-bound (not network-bound)
and cascade into slowdowns for handling simpler memory/CPU-bound
requests.  Parallel solvers were also responsible for the
increased lifetime and frequency of zombies since the event loop
was too saturated to reap them.

We'll also return 503 on excessive solver queueing, since these
require an FD for the client HTTP(S) socket to be held onto.

(*) git (update-index|apply|ls-files) are all run by solver and
    short-lived
---
 lib/PublicInbox/SolverGit.pm | 15 ++++++-----
 lib/PublicInbox/ViewVCS.pm   | 48 +++++++++++++++++++++++++++++-------
 2 files changed, 48 insertions(+), 15 deletions(-)

diff --git a/lib/PublicInbox/SolverGit.pm b/lib/PublicInbox/SolverGit.pm
index 4e79f750..296e7d17 100644
--- a/lib/PublicInbox/SolverGit.pm
+++ b/lib/PublicInbox/SolverGit.pm
@@ -256,6 +256,12 @@ sub update_index_result ($$) {
 	next_step($self); # onto do_git_apply
 }
 
+sub qsp_qx ($$$) {
+	my ($self, $qsp, $cb) = @_;
+	$qsp->{qsp_err} = \($self->{-qsp_err} = '');
+	$qsp->psgi_qx($self->{psgi_env}, $self->{limiter}, $cb, $self);
+}
+
 sub prepare_index ($) {
 	my ($self) = @_;
 	my $patches = $self->{patches};
@@ -284,9 +290,8 @@ sub prepare_index ($) {
 	my $cmd = [ qw(git update-index -z --index-info) ];
 	my $qsp = PublicInbox::Qspawn->new($cmd, $self->{git_env}, $rdr);
 	$path_a = git_quote($path_a);
-	$qsp->{qsp_err} = \($self->{-qsp_err} = '');
 	$self->{-msg} = "index prepared:\n$mode_a $oid_full\t$path_a";
-	$qsp->psgi_qx($self->{psgi_env}, undef, \&update_index_result, $self);
+	qsp_qx $self, $qsp, \&update_index_result;
 }
 
 # pure Perl "git init"
@@ -465,8 +470,7 @@ sub apply_result ($$) { # qx_cb
 	my @cmd = qw(git ls-files -s -z);
 	my $qsp = PublicInbox::Qspawn->new(\@cmd, $self->{git_env});
 	$self->{-cur_di} = $di;
-	$qsp->{qsp_err} = \($self->{-qsp_err} = '');
-	$qsp->psgi_qx($self->{psgi_env}, undef, \&ls_files_result, $self);
+	qsp_qx $self, $qsp, \&ls_files_result;
 }
 
 sub do_git_apply ($) {
@@ -495,8 +499,7 @@ sub do_git_apply ($) {
 	my $opt = { 2 => 1, -C => _tmp($self)->dirname, quiet => 1 };
 	my $qsp = PublicInbox::Qspawn->new(\@cmd, $self->{git_env}, $opt);
 	$self->{-cur_di} = $di;
-	$qsp->{qsp_err} = \($self->{-qsp_err} = '');
-	$qsp->psgi_qx($self->{psgi_env}, undef, \&apply_result, $self);
+	qsp_qx $self, $qsp, \&apply_result;
 }
 
 sub di_url ($$) {
diff --git a/lib/PublicInbox/ViewVCS.pm b/lib/PublicInbox/ViewVCS.pm
index 61329db6..790b9a2c 100644
--- a/lib/PublicInbox/ViewVCS.pm
+++ b/lib/PublicInbox/ViewVCS.pm
@@ -49,6 +49,10 @@ my %GIT_MODE = (
 	'160000' => 'g', # commit (gitlink)
 );
 
+# TODO: not fork safe, but we don't fork w/o exec in PublicInbox::WWW
+my (@solver_q, $solver_lim);
+my $solver_nr = 0;
+
 sub html_page ($$;@) {
 	my ($ctx, $code) = @_[0, 1];
 	my $wcb = delete $ctx->{-wcb};
@@ -614,26 +618,52 @@ sub show_blob { # git->cat_async callback
 		'</code></pre></td></tr></table>'.dbg_log($ctx), @def);
 }
 
-# GET /$INBOX/$GIT_OBJECT_ID/s/
-# GET /$INBOX/$GIT_OBJECT_ID/s/$FILENAME
-sub show ($$;$) {
-	my ($ctx, $oid_b, $fn) = @_;
-	my $hints = $ctx->{hints} = {};
+sub start_solver ($) {
+	my ($ctx) = @_;
 	while (my ($from, $to) = each %QP_MAP) {
 		my $v = $ctx->{qp}->{$from} // next;
-		$hints->{$to} = $v if $v ne '';
+		$ctx->{hints}->{$to} = $v if $v ne '';
 	}
-	$ctx->{fn} = $fn;
-	$ctx->{-tmp} = File::Temp->newdir("solver.$oid_b-XXXX", TMPDIR => 1);
+	$ctx->{-next_solver} = PublicInbox::OnDestroy->new($$, \&next_solver);
+	++$solver_nr;
+	$ctx->{-tmp} = File::Temp->newdir("solver.$ctx->{oid_b}-XXXX",
+						TMPDIR => 1);
 	$ctx->{lh} or open $ctx->{lh}, '+>>', "$ctx->{-tmp}/solve.log";
 	my $solver = PublicInbox::SolverGit->new($ctx->{ibx},
 						\&solve_result, $ctx);
+	$solver->{limiter} = $solver_lim;
 	$solver->{gits} //= [ $ctx->{git} ];
 	$solver->{tmp} = $ctx->{-tmp}; # share tmpdir
 	# PSGI server will call this immediately and give us a callback (-wcb)
+	$solver->solve(@$ctx{qw(env lh oid_b hints)});
+}
+
+# run the next solver job when done and DESTROY-ed
+sub next_solver {
+	--$solver_nr;
+	# XXX FIXME: client may've disconnected if it waited a long while
+	start_solver(shift(@solver_q) // return);
+}
+
+sub may_start_solver ($) {
+	my ($ctx) = @_;
+	$solver_lim //= $ctx->{www}->{pi_cfg}->limiter('codeblob');
+	if ($solver_nr >= $solver_lim->{max}) {
+		@solver_q > 128 ? html_page($ctx, 503, 'too busy')
+				: push(@solver_q, $ctx);
+	} else {
+		start_solver($ctx);
+	}
+}
+
+# GET /$INBOX/$GIT_OBJECT_ID/s/
+# GET /$INBOX/$GIT_OBJECT_ID/s/$FILENAME
+sub show ($$;$) {
+	my ($ctx, $oid_b, $fn) = @_;
+	@$ctx{qw(oid_b fn)} = ($oid_b, $fn);
 	sub {
 		$ctx->{-wcb} = $_[0]; # HTTP write callback
-		$solver->solve($ctx->{env}, $ctx->{lh}, $oid_b, $hints);
+		may_start_solver $ctx;
 	};
 }
 

^ permalink raw reply related	[relevance 6%]

Results 1-3 of 3 | reverse | options above
-- pct% links below jump to the message on this page, permalinks otherwise --
2024-03-11 19:40  7% [PATCH 0/4] memory reductions for WWW + solver Eric Wong
2024-03-11 19:40  6% ` [PATCH 1/4] www: use a dedicated limiter for blob solver Eric Wong
2024-04-01 13:21  7% sample robots.txt to reduce WWW load Eric Wong

Code repositories for project(s) associated with this public inbox

	https://80x24.org/public-inbox.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).