From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jakub Narebski Subject: Re: [PATCH] gitweb: use Git.pm, and use its parse_rev method for git_get_head_hash Date: Mon, 2 Jun 2008 23:32:27 +0200 Message-ID: <200806022332.29853.jnareb@gmail.com> References: <1212188412-20479-1-git-send-email-LeWiemann@gmail.com> <200806020019.23858.jnareb@gmail.com> <20080602092926.GJ18781@machine.or.cz> Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Cc: Lea Wiemann , git@vger.kernel.org, Lars Hjemli , John Hawley To: Petr Baudis X-From: git-owner@vger.kernel.org Mon Jun 02 23:33:38 2008 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1K3Heq-0000bp-IY for gcvg-git-2@gmane.org; Mon, 02 Jun 2008 23:33:37 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751769AbYFBVcl (ORCPT ); Mon, 2 Jun 2008 17:32:41 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752221AbYFBVcl (ORCPT ); Mon, 2 Jun 2008 17:32:41 -0400 Received: from ug-out-1314.google.com ([66.249.92.174]:36645 "EHLO ug-out-1314.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751650AbYFBVcj (ORCPT ); Mon, 2 Jun 2008 17:32:39 -0400 Received: by ug-out-1314.google.com with SMTP id h2so364010ugf.16 for ; Mon, 02 Jun 2008 14:32:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:from:to:subject:date :user-agent:cc:references:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:message-id; bh=kV+XcXB9E2K+lhTkpC3ZoaA9IY5EaG1E8/D4FnMcemU=; b=whL9JiOnqmNxNSrNFPhLVqklXgcIBfl6JTOF5WWeUH3d+8QBv3U6sTIQtORIJvYEt0 UAK6jk4LwHxK6neLflhL1dAvw/cc1B5/fOqaOojS4/qVFs7CTt0OwBXy4Pak7utigOel 43Q9Xx5KhGmKNlS3BeMHcVH3gQ5LdLk06nECE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:subject:date:user-agent:cc:references:in-reply-to :mime-version:content-type:content-transfer-encoding :content-disposition:message-id; b=GD/O+EUCXKq/ModE7Qx0vQhLJQIpWCxHV6VFCkwEUx5m4Dn9lxieEDYot5Uw5YOWjL SPYcRgIplr+qsrIn+JRCuY4GQD9raSVpV3l2US7BqAaFGwGPl00TDWyKrVmn/M1nPM4g 0su5fgfGf4Vul2YDZmRalzEKc7IsYDKtS9/hc= Received: by 10.210.65.2 with SMTP id n2mr7814403eba.48.1212442357535; Mon, 02 Jun 2008 14:32:37 -0700 (PDT) Received: from ?192.168.1.15? ( [83.8.244.52]) by mx.google.com with ESMTPS id k5sm33360212nfh.39.2008.06.02.14.32.34 (version=SSLv3 cipher=RC4-MD5); Mon, 02 Jun 2008 14:32:35 -0700 (PDT) User-Agent: KMail/1.9.3 In-Reply-To: <20080602092926.GJ18781@machine.or.cz> Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Mon, 2 June 2008, Petr Baudis wrote: > On Mon, Jun 02, 2008 at 12:19:23AM +0200, Jakub Narebski wrote: > > On Sat, 31 May 2008, Lea Wiemann wrote: > > > > > > We may need to have an explicitly implemented layer 0 (front-end > > > caching) in Gitweb for at least a subset of pages (like project pages), > > > but we'll see if that's necessary. > > > > I think that front-end caching (HTML/RSS/Atom output caching) has sense > > for large web pages (large size and large number of items), such as > > projects list page if it is unpaginated (and perhaps even if it is > > divided into pages, although I'm sure not for project search page), > > commonly requested snapshots (if they are enabled), large trees like > > SPECS directory with all the package *.spec files for distribution > > repository, perhaps summary/feed for most popular projects. If (when) > > syntax highlighting would got implemented, probably also caching > > blob view (as CPU can become issue). > > > > Front-end (HTML output) caching has the advantages of easy to calculate > > strong ETag, and web server support for If-Match: and If-None-Match: > > HTTP/1.1 headers. You can easy support Range: request, needed for > > resuming download (e.g. for downloading snapshots, if this feature is > > enabled in gitweb). > > Caching snapshots would definitely make sense, sure. That reminds me to finally implement nicer (git-describe like) [proposed] snapshot filenames. For example for snapshot of state given by some tag (snapshot of tagged release [1]), don't use generic -<40-chars sha1>. git-b2a42f55bc419352b848751b0763b0a2d1198479.tar.gz but -. git-v1.5.5.3.tar.gz (well, currently tags don't have 'snapshot' link, but this is easily fixed). What do you think about (ab)using 'fp' (file_parent) parameter to pass proposed snapshot file name? [1] Would it be good feature to add support for limiting snapshots to snapshots only of tagged releases (which would be I guess more important when gitweb caching gets implemented). > > You can even compress the output, and serve it to clients which > > support proper transparent compression (Content-Encoding). > > What does this have to do with caching? Well, only in a sense that with front-end caching (to choose if CPU matters most) this can be done "for free", without incurring extra CPU, at the cost of little more disk space. Of course, if most clients understand (accept) Content-Encoding (transfer encoding), you can store compressed output, with a little CPU cost to decompress for non-conformant clients; this way frontend caching can have cache size comparable to [parsed] data caching. > > And of course it has the advantage of actually been written and tested > > in work, in the case of kernel.org gitweb. Although caching parsed > > data was implemented in repo.or.cz gitweb, it was done only for > > projects list view, and it is quite new and not so thoroughly tested > > http://article.gmane.org/gmane.comp.version-control.git/77469 > > This argument does have some value, but I don't think it matters too > much, since as far as I understood, it is going to get largely > reimplemented anyway. What I'd like to see in a bit of time is some estimate how much time would take implementing data caching almost from scratch (a bit of code in repo's gitweb), compared to merging in kernel.org's gitweb caching code... > > It would be nice for front-end caching to have an option to use absolute > > time for all time/dates, and to (optionally) not use adaptive > > Content-Type... > > I'd hate to have to do unnecessary compromises in order to get sensible > caching. > > Even in your excellent series on Gitweb caching series, I didn't spot > any arguments that would put frontend caching in front of the > intermediate data caching option; yes, it is the simplest solution > implementation-wise, but also the least flexible one. My gut feeling is > still to go with data caching instead of HTML caching. As Lars Hjemli wrote in "[RFC/PATCH] gitweb: Paginate project list" thread (unfortunately not all articles got to git mailing list) http://thread.gmane.org/gmane.comp.version-control.git/81838/focus=81875 In cgit I've chosen "projectlist in a single file" and "cache html output". This makes it cheap (in terms of cpu and io) to both generate and serve the cached page (and the cache works for all pages). While I agree that caching search result output almost never makes sense, I think it's more important that cache hits requires minimal processing. This is why I've chosen to cache the final result instead of an intermediate state, but both solutions obviously got some pros and cons. caching final output is important if you want to minimize processing (CPU time). I'd say also if you want to implement Range: for resumable downloads (snapshots), because otherwise I think it would be quote hard to do reasonably (with caching only data). So I guess best solution would be mixed one: use output cache for large or CPU intensive pages, use data caching to limit cache size and for maximum flexibility (relative dates, sorting by columns: athough that would be best solved using some DHTML/JavaScript, paginated output, projects search etc.) By the way, we have your (Petr 'Pasky' Baudis, based on repo.or.cz) and John 'Warthog9' Hawley (based on kernel.org) statements that gitweb performance is I/O bound, but I don't remember any hard data. I have said wrongly that one can use 'fio' tool to check I/O performance; it is not true, this tool can be used to test _filesystem_ by generating specified pattern of I/O load. I don't know of any tool which allow to measure if I/O is bottleneck for given application; 'iogrind' measures cold cache start, and it requires I think compiled program, as it uses Valgrind. You can measure CPU load, time to response and memory usage using 'time' (running gitweb as script), ApacheBench and top; you can measure latency using LatencyTOP. Is there some iotop tool, and can it be used to measure performance bottlenecks of web scripts (web applications)? -- Jakub Narebski Poland