From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff King Subject: Re: remove_duplicates() in builtin/fetch-pack.c is O(N^2) Date: Tue, 22 May 2012 13:46:48 -0400 Message-ID: <20120522174648.GE11600@sigill.intra.peff.net> References: <4FB9F92D.8000305@alum.mit.edu> <20120521174525.GA22643@sigill.intra.peff.net> <20120522170101.GA11600@sigill.intra.peff.net> <7vaa105dah.fsf@alter.siamese.dyndns.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Cc: Nguyen Thai Ngoc Duy , Michael Haggerty , git discussion list To: Junio C Hamano X-From: git-owner@vger.kernel.org Tue May 22 19:46:58 2012 Return-path: Envelope-to: gcvg-git-2@plane.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by plane.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1SWtAe-0000Cd-3t for gcvg-git-2@plane.gmane.org; Tue, 22 May 2012 19:46:56 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752539Ab2EVRqv (ORCPT ); Tue, 22 May 2012 13:46:51 -0400 Received: from 99-108-226-0.lightspeed.iplsin.sbcglobal.net ([99.108.226.0]:51853 "EHLO peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751852Ab2EVRqu (ORCPT ); Tue, 22 May 2012 13:46:50 -0400 Received: (qmail 20569 invoked by uid 107); 22 May 2012 17:47:16 -0000 Received: from sigill.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.7) (smtp-auth username relayok, mechanism cram-md5) by peff.net (qpsmtpd/0.84) with ESMTPA; Tue, 22 May 2012 13:47:16 -0400 Received: by sigill.intra.peff.net (sSMTP sendmail emulation); Tue, 22 May 2012 13:46:48 -0400 Content-Disposition: inline In-Reply-To: <7vaa105dah.fsf@alter.siamese.dyndns.org> Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Tue, May 22, 2012 at 10:35:50AM -0700, Junio C Hamano wrote: > > In this case, the packed-refs file is 30MB. Even just gzipping it takes > > it down to 2MB. As far as I know, we don't ever do random access on the > > file, but instead just stream it into memory. > > True. > > The current code reads the whole thing in upon first use of _any_ element > in the file, just like the index codepath does for the index file. > > But the calling pattern to the refs machinery is fairly well isolated and > all happens in refs.c file. Especially thanks to the recent work by > Michael Haggerty, for "I am about to create a new branch 'frotz'; do I > have 'refs/heads/frotz' or anything that begins with 'refs/heads/frotz/'?" > kind of callers, it is reasonably easy to design a better structured > packed-refs file format to allow us to read only a subtree portion of > refs/ hierarchy, and plug that logic into the lazy ref population code. > Such a "design a better packed-refs format for scalability to 400k refs" > is a very well isolated project that has high chance of succeeding without > breaking things. No code outside refs.c assumes that there is a flat > array of refs that records what was read from the packed-refs file and can > walk linearly over it, unlike the in-core index. Yeah. As gross as I find a 30MB packed-refs file, I don't actually think that is the bottle-neck at all. Sure, it wastes some disk cache, but in the grand scheme of things, anybody with 400K refs probably has a much, much bigger object database. So there probably isn't much point in clever tricks to make it smaller on disk, especially if they will make further partial-read optimizations harder to do. But... > If you do "for_each_ref()" for everything (e.g. sending 'have' during the > object transfer, or repacking the whole repository), you would end up > needing to read the whole thing for obvious reasons. I think Michael's work is really great for the loose refs, where hitting each directory is very expensive. But for a repo with packed refs that is just going to call for_each_ref, I'm worried that breaking apart the ref list is going to cause noticeable overhead. But still, the regression I'm seeing seems way too big for that overhead, so I think something else is going on. We just need to find it. -Peff