From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Jon Smirl" Subject: Re: git-daemon on NSLU2 Date: Sat, 25 Aug 2007 11:44:07 -0400 Message-ID: <9e4733910708250844n7074cb8coa5844fa6c46b40f0@mail.gmail.com> References: <9e4733910708232254w4e74ca72o917c7cadae4ee0f4@mail.gmail.com> <20070824062106.GV27913@spearce.org> <9e4733910708241238n1899f332j4fafbd6d7ccc48b9@mail.gmail.com> <9e4733910708241417l44c55306xaa322afda69c6beb@mail.gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: "Nicolas Pitre" , "Shawn O. Pearce" , "Git Mailing List" To: "Linus Torvalds" , jnareb@gmail.com X-From: git-owner@vger.kernel.org Sat Aug 25 17:44:14 2007 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1IOxo6-0002VQ-CF for gcvg-git@gmane.org; Sat, 25 Aug 2007 17:44:14 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751712AbXHYPoL (ORCPT ); Sat, 25 Aug 2007 11:44:11 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751495AbXHYPoJ (ORCPT ); Sat, 25 Aug 2007 11:44:09 -0400 Received: from wa-out-1112.google.com ([209.85.146.179]:20789 "EHLO wa-out-1112.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751462AbXHYPoI (ORCPT ); Sat, 25 Aug 2007 11:44:08 -0400 Received: by wa-out-1112.google.com with SMTP id j4so1400250wah for ; Sat, 25 Aug 2007 08:44:07 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=AbDyBLeUUD90O4SG136+wk6qe+m665Uk9znVqo663J4QcaZ9iicgKGfqlGWtlu23omBFTVIdeen2EmcFZtbXtYNRPPEHbcVY3XWzyEWKn11KBmNBMDbMAjF84npyVtcZqmHrSbM8Det7w3fjhGaoGgkeTxrLs0mh3VuKA1d0Zqw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=tcQ0vX6kGU2TWKxF4Cbw/gzAcsQIQpA3RvPJi5pJ0qD8v0OyrceryGJUZL6ckb8T6qQigSmQjRIDqUtiPDTICat5v/SAYHb8hnLXWLYtXuPfPlVavfKJScpSjRCfjYis9+d3K3BpIwZcVdVB+Uctm9HO60JP2gkcylYQQazTIwQ= Received: by 10.114.60.19 with SMTP id i19mr2358193waa.1188056647446; Sat, 25 Aug 2007 08:44:07 -0700 (PDT) Received: by 10.114.195.5 with HTTP; Sat, 25 Aug 2007 08:44:07 -0700 (PDT) In-Reply-To: Content-Disposition: inline Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: On 8/24/07, Linus Torvalds wrote: > > I can clone the tree in five minutes using the http protocol. Using the > > git protocol would take 24hrs if I let it finish. > > The http side doesn't actually do any global verification, the way > git-daemon does. So to it, everything is just temporary buffers, and you > don't need any memory at all, really. > > git-daemon will create a packfile. That means that it has to generate the > *global* object reachability, and will then optimize the object packing > etc etc. That's a minimum of something like 48 bytes per object for just > the object chains, and the kernel has a *lot* of objects (over half a > million). A large, repeating work load is created in this process when you take a 200MB pack, repack it to add a few loose objects and then don't save the results. This model makes the NSLU2 unusable, but I also see it at my shared hosting provider. Initial clones of a repo that take 3min from kernel.org take 25min on a shared host since the RAM is not dedicated. There are three categories of fetches: 1) initial clone, fetch all 2) fetch recent 3) I haven't fetched in three months 99% of fetches fall in the first two categories. A very simple solution is to sendfile() existing packs if they contain any objects that the client wants and let the client deal with the unwanted objects. Yes this does send extra traffic over the net, but the only group significantly impacted is #2 which is the most infrequent group. Loose objects are handled as they are currently. To optimize this scheme you need to let the loose objects build up at the server and then periodically sweep only the older ones into a pack. Packing the entire repo into a single pack would cause recent fetches to retrieve the entire pack. Initial clone can be optimized further by recognizing that the receiving repository is empty and sending them everything; no need to compute which objects are missing at the server. This method will speed up initial clone since the existing pack can be immediately sent instead of waiting on a pack file to be built. Build the loose object pack in parallel with sending the existing packs. I recognize that in the case of cloning a single branch or --reference too many objects will also be transmitted but I believe the benefits of reducing the server load outweigh the overhead of transmitting extra objects in this case. You can always remove the extra objects on the client side. On 8/24/07, Jakub Narebski wrote: > There was idea to special case clone (just concatenate the packs, the > receiving side as someone told there can detect pack boundaries; do not > forget to pack loose objects, first), instead of using generic fetch --all > for clone, bnut no code. Code speaks louder than words (although if someone > would provide details of pack boundary detection...) Write the file name and length into the socket before sending the pack. Use sendfile() or it's current incarnation to actually send the pack. Insert these header lines between packs. > In addition to the object chains yourself, the native protocol will also > obviously have to actually *look* at and parse all the tree and commit > objects while it does all this, so while it doesn't necessarily keep all > of those in memory all the time, it will need to access them, and if you > don't have enough memory to cache them, that will add its own set of IO. > > So I haven't checked exactly how much memory you really want to have to > serve big projects, but with some handwavy guesstimate, if you actually > want to do a good job I'd guess that you really want to have at least as > much memory as the size of largest project you are serving, and probably > add at least 10-20% on top of that. > > So for the kernel, at a guess, you'd probably want to have at least 256MB > of RAM to do a half-way good job. 512MB is likely nicer and allows you to > actually cache the stuff over multiple accesses. > > But I haven't actually tested. Maybe it might be bearable at 128M. > > Linus > -- Jon Smirl jonsmirl@gmail.com