From: Jakub Narebski <jnareb@gmail.com>
To: git@vger.kernel.org
Cc: "John 'Warthog9' Hawley" <warthog9@kernel.org>,
"John 'Warthog' Hawley" <warthog19@eaglescrag.net>,
Petr Baudis <pasky@ucw.cz>, Petr Baudis <pasky@suse.cz>,
admin@repo.or.cz
Subject: [RFD] Possible improvements for output caching in gitweb
Date: Sun, 10 Oct 2010 22:32:07 +0200 [thread overview]
Message-ID: <201010102232.08924.jnareb@gmail.com> (raw)
In-Reply-To: <1286402526-13143-1-git-send-email-jnareb@gmail.com>
On Thu, 7 Oct 2010, Jakub Narebski wrote:
> TODO list and areas of possible improvements would be send in separate
> email.
Here they are. What do you think about them; which are needed, which
ones would be nice to have, which are not worth the trouble, and what's
perhaps most important: which ones are missing. Also in what order of
importance should they be worked upon.
Note that in the list below there are deliberately missing improvements
to the code (which were already commented on; thanks again).
New features and improvements related directly to cache or capture:
* Ajax-y progress indicator (perhaps inside skeleton of page)
see 9982b6f (gitweb: Ajax-y "Generating..." page when regenerating
cache (WIP), 2010-01-24) on 'gitweb/cache-kernel' branch in the
http://repo.or.cz/w/git/jnareb-git.git repository.
Instead of relying on http-equiv refresh trick (which uses the fact
that web browsers render inclomplete page, and that refresh is done
only after page is received in full), use XMLHttpRequest to get
(re)generated version of the page, displaying progress info while at
it, and redraw page when data is received in full. All of this only
when JavaScript is enabled, so I guess old trick should be kept as
fallback.
This of course assumes that progress info indicator is important...
* error handler, like in CHI
Instead of using 'die <message>' and relying on CGI.pm and gitweb
catching the exception and displaying it (set_message from CGI.pm),
pass error handler (wrapped die_error) to cache constructor. In the
case of CHI caching framework, there are 'on_get_error' and
'on_set_error' options.
In the original patches by J.H. subroutines from cache.pm used
die_error directly; this was possible only because the file was
loaded using "do 'cache.pm';" as a kind of mixin / role into gitweb
code.
* $capture option to cache_output()
Currently in GitwebCache::OutputCache the capture engine used to
capture gitweb output to cache it is hardcoded; mind you, one would
need to change code only in two places to use different compatibile
caching engine, but still it would require changing code. It would
be better to pass $capture as parameter to cache_output(), just like
$cache is.
* POD documentation instead of comments + make doc
Currently gitweb caching modules are documented (like original
cache.pm by J.H.) only in comments, and a bit in gitweb/README.
Though those modules are used only internally, it might be better
to use POD (perlpod) for documentation, like in Git.pm.
We migth also want to add 'doc' target in gitweb/Makefile, though
it might be difficult (see perl/Makefile.PL and generated perl.mak).
* cache expire variation a la CHI
CHI (caching interface in Perl) supports 'expires_variance' parameter,
which according to documentation:
"Controls the variable expiration feature, which allows items to
expire a little earlier than the stated expiration time to help
prevent cache miss stampedes.
Value is between 0.0 and 1.0, with 0.0 meaning that items expire
exactly when specified (feature is disabled), and 1.0 meaning that
items might expire anytime from now til the stated expiration time.
The default is 0.0. A setting of 0.10 to 0.25 would introduce a
small amount of variation without interfering too much with intended
expiration times."
See http://p3rl.org/CHI (or CHI manpage, if you have it installed).
This feature is about *avoiding* cache miss stampede, while locking
is used to ensure that only one process is regenerating cache for
a given entry.
* benchmarks for different caches under light and under heavy load;
profiling of gitweb with caching using Devel::NYTProf.
The problem is to both prepare repositories, and to generate traffic
(or generate IO pressure) to represent real-life situation, where
supposedly gitweb is IO bound, rather than CPU bound.
-------------------------------------------------------------
Below there are cache related improvements that require for
GitwebCache::CacheOutput to be aware that it caches HTTP response,
which consist of HTTP headers (text) separated by an empty line
from a body of a request (which can be binary).
This can be done either by parsing response or a retrieved cache entry,
or by storing headers and body separately, or by using some Perl data
structure (like for example the one used by PSGI) and storing it
serialized (though serialization can affect performance).
* X-Sendfile (or equivalent) support
Web server encountering such HTTP header will discard all output and
send the file specified by that header instead using web server
internals including all optimizations like caching-headers and
sendfile or mmap if configured. For Apache it requires mod_xsendfile
module (https://tn123.org/mod_xsendfile/), lighttpd has it build in
(at least for FastCGI) but disabled by default; in Nginx similar
feature is called X-Accel-Redirect.
The idea is to use cache file for X-Sendfile contents; though this
might require storing headers and body of response separately, and
might be not much of speedup.
* compressed cache entries (transfer-encoding) (?)
To reduce size taken by cache, and also reduce bandwidth taken by
serving gitweb requests, save body of response compressed. Then,
if browser supports it, send compressed data with the HTTP header
'Transfer-Encoding:' set to appriopriate value.
The complication which, I think, we have to take into account is
that some (hopefully small amount) of web browsers and net downloaders
doesn't support transfer-encoding we plan to use (gzip or deflate).
Also gitweb should compress file which it knows to not compress well,
like already compressed snapshots (zip, tar.gz, tar.bz2) or images.
There was patch " gitweb: Enable transparent compression for HTTP
output" sent to git mailing list (using PerlIO::gzip), but in the
cached case we pay CPU cost only *once*.
* Replace text/html with application/xml+xhtml in header
when reading from cache.
In the non-cached case, gitweb served page using either plain
'text/html' content type, or if web browser accepts it more advanced
'application/xhtml+xml' content type. When caching is enabled, we
had to always use 'text/html', because web browser (e.g. lynx) might
not accept the other... but with cache being HTTP-aware, we can
replace 'text/html' with 'application/xhtml+xml' in 'Content-Type:'
HTTP header.
* Expires-In / cache-age synchronized with cache lifetime,
Last-Modified synchronized with cache entry creation time.
Currently all cache entries have the same global (per cache instance)
expiration time. The Expire header is not correlated with it.
There are two issues: when storing data in cache, we can set Expire
header (or cache-age pragma in Cache-Control header) to the expiration
time of cache entry and set Last-Modified to the time cache entry was
(re)generated (unless it is already set by gitweb).
The other issue is that some data doesn't change, ever. We set expire
time to '+1d' (one day) in such case. If we could mark those cache
entries as having longer / infinite lifetime to not regenerate them...
* support for If-Modified-Since (external/browser caches)
When caching is enabled, we know when page was created. We could
check for If-Modified-Since conditional request header, and return
'304 Not Modified' HTTP response if we would serve from the same
cache entry. It would save bandwidth, and a bit of I/O.
* ETag support - gitweb version + cache key hash, possibly also Range
requests.
We can compose strong ETag validator from cache key hash and gitweb
version string. Maybe it would make possible to respond to Range
requests for resuming download of e.g. large snapshot file...
But it might be the fact that those features are unrelated...
----------------------------------------------------------------------
Below there are proposed gitweb improvements and features, which would
also improve caching support in gitweb:
* Time::HiRes is in core + simplify progress indicator
Time::HiRes was first released with perl 5.007003 (5.7.3). Because
gitweb requires at least Perl 5.8 for its Unicode / UTF-8 support,
we can assume that it is present.
This would simplify code in git_generating_data_html()
* $per_request_config = 0/1 (default)/coderef
or just $per_request_config = coderef
If it would be possible to have config read only once in persistent
environments such as mod_perl and FastCGI, and not once per request,
it would improve performance when caching engine used has slow
initialization / creation time, like Moose-based CHI.
The basic idea is to put parts of config that change per request (like
e.g. gitosis or gitolite uses) in coderef in $per_request_config
variable; this coderef would be invoked once per config. Example
configuration:
our $per_request_config = sub {
$ENV{GL_USER} = ($cgi && $cgi->remote_user) || "gitweb";
};
* authenthication / authorization for admin stuff
Some kind of authenthications support would be needed for edit / write
support in gitweb, and also for controlling access to the cache
administration page. We don't want anyone to be able to clear cache.
In the current proof-of-concept patch the cache administration page
is restricted to people accessing gitweb pages from localhost, or
running gitweb as a standalone script.
* mod_perl handler
It might be possible with altering / modifying gitweb only slightly to
make it work *also* as native mod_perl handler, and not only via
ModPerl::Registry.
This would make possible to initialize cache once per process
lifetime, and not once per request.
--
Jakub Narebski
Poland
next prev parent reply other threads:[~2010-10-10 20:32 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-10-06 22:01 [PATCHv5 00/17] gitweb: Simple file based output caching Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 01/17] t/test-lib.sh: Export also GIT_BUILD_DIR in test_external Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 02/17] gitweb: Prepare for splitting gitweb Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 03/17] gitweb/lib - Very simple file based cache Jakub Narebski
2010-10-06 22:41 ` Thomas Adam
2010-10-06 22:44 ` Ævar Arnfjörð Bjarmason
2010-10-06 22:46 ` Thomas Adam
2010-10-06 22:47 ` Ævar Arnfjörð Bjarmason
2010-10-06 23:00 ` Jakub Narebski
2010-10-06 23:12 ` Thomas Adam
2010-10-06 23:32 ` Jakub Narebski
2010-10-06 22:57 ` Ævar Arnfjörð Bjarmason
2010-10-06 23:46 ` Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 04/17] gitweb/lib - Stat-based cache expiration Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 05/17] gitweb/lib - Regenerate entry if the cache file has size of 0 Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 06/17] gitweb/lib - Simple select(FH) based output capture Jakub Narebski
2010-10-06 22:52 ` Thomas Adam
2010-10-06 23:22 ` Jakub Narebski
2010-10-06 23:03 ` Ævar Arnfjörð Bjarmason
2010-10-06 23:26 ` Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 07/17] gitweb/lib - Cache captured output (using get/set) Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 08/17] gitweb: Add optional output caching Jakub Narebski
2010-10-06 22:46 ` Ævar Arnfjörð Bjarmason
2010-10-06 23:06 ` Jakub Narebski
2010-10-06 23:16 ` Ævar Arnfjörð Bjarmason
2010-10-06 22:01 ` [PATCHv5 09/17] gitweb/lib - Adaptive cache expiration time Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 10/21] gitweb/lib - Use CHI compatibile (compute method) caching interface Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 11/17] gitweb/lib - Use locking to avoid 'cache miss stampede' problem Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 12/17] gitweb/lib - No need for File::Temp when locking Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 13/17] gitweb/lib - Serve stale data when waiting for filling cache Jakub Narebski
2010-10-06 22:01 ` [PATCHv5 14/17] gitweb/lib - Regenerate (refresh) cache in background Jakub Narebski
2010-10-06 22:02 ` [PATCHv5 15/17] gitweb: Introduce %actions_info, gathering information about actions Jakub Narebski
2010-10-06 22:02 ` [PATCHv5/RFC 16/17] gitweb: Show appropriate "Generating..." page when regenerating cache Jakub Narebski
2010-10-06 22:02 ` [PATCHv5/RFC 17/17] gitweb: Add startup delay to activity indicator for cache Jakub Narebski
2010-10-06 22:02 ` [RFC/PATCHv5 18/17] gitweb/lib - Add clear() and size() methods to caching interface Jakub Narebski
2010-10-06 22:56 ` Thomas Adam
2010-10-06 22:02 ` [RFC PATCHv5 19/17] gitweb: Add beginnings of cache administration page Jakub Narebski
2010-10-06 22:02 ` [PoC PATCHv5 20/17] gitweb/lib - Benchmarking GitwebCache::SimpleFileCache (in t/9603/) Jakub Narebski
2010-10-06 22:02 ` [PoC PATCHv5 21/17] gitweb/lib - Alternate ways of capturing output Jakub Narebski
2010-10-10 20:32 ` Jakub Narebski [this message]
2010-10-24 21:34 ` [PATCHv5 00/17] gitweb: Simple file based output caching J.H.
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=201010102232.08924.jnareb@gmail.com \
--to=jnareb@gmail.com \
--cc=admin@repo.or.cz \
--cc=git@vger.kernel.org \
--cc=pasky@suse.cz \
--cc=pasky@ucw.cz \
--cc=warthog19@eaglescrag.net \
--cc=warthog9@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).