From mboxrd@z Thu Jan 1 00:00:00 1970 From: Linus Torvalds Subject: Re: Handling large files with GIT Date: Mon, 13 Feb 2006 08:19:10 -0800 (PST) Message-ID: References: <46a038f90602080114r2205d72cmc2b5c93f6fffe03d@mail.gmail.com> <87slqty2c8.fsf@mid.deneb.enyo.de> <46a038f90602081435x49e53a1cgdc56040a19768adb@mail.gmail.com> <43F01F5A.5020808@pobox.com> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: Ben Clifford , Martin Langhoff , Florian Weimer , git@vger.kernel.org X-From: git-owner@vger.kernel.org Mon Feb 13 17:21:01 2006 Return-path: Envelope-to: gcvg-git@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by ciao.gmane.org with esmtp (Exim 4.43) id 1F8gRE-0001mS-UQ for gcvg-git@gmane.org; Mon, 13 Feb 2006 17:20:33 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750828AbWBMQU1 (ORCPT ); Mon, 13 Feb 2006 11:20:27 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1750819AbWBMQU1 (ORCPT ); Mon, 13 Feb 2006 11:20:27 -0500 Received: from smtp.osdl.org ([65.172.181.4]:50307 "EHLO smtp.osdl.org") by vger.kernel.org with ESMTP id S1750739AbWBMQU0 (ORCPT ); Mon, 13 Feb 2006 11:20:26 -0500 Received: from shell0.pdx.osdl.net (fw.osdl.org [65.172.181.6]) by smtp.osdl.org (8.12.8/8.12.8) with ESMTP id k1DGJCDZ008157 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Mon, 13 Feb 2006 08:19:12 -0800 Received: from localhost (shell0.pdx.osdl.net [10.9.0.31]) by shell0.pdx.osdl.net (8.13.1/8.11.6) with ESMTP id k1DGJAA5024677; Mon, 13 Feb 2006 08:19:10 -0800 To: Jeff Garzik In-Reply-To: <43F01F5A.5020808@pobox.com> X-Spam-Status: No, hits=0 required=5 tests= X-Spam-Checker-Version: SpamAssassin 2.63-osdl_revision__1.68__ X-MIMEDefang-Filter: osdl$Revision: 1.129 $ X-Scanned-By: MIMEDefang 2.36 Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: On Mon, 13 Feb 2006, Jeff Garzik wrote: > > Linus Torvalds wrote: > > I've never used maildir layout, but if it is a couple of large _flat_ > > subdirectories, > > That's what it is :/ One directory per mail folder, with each email an > individual file in that dir. Ok. Anyway, I double-checked, and I'm wrong anyway. While the "static directories" thing is a huge performance optimization for doing many things (diffing trees, file history in git-rev-list, etc etc), for merging it doesn't help. We always end up expanding the whole tree. Which is kind of sad. It's inevitable in one sense: we do the merge in the index, after all, and the index - unlike the tree structures - is a flat file (like the "manifest" in mercurial or monotone). It's also represented that way in memory. However, it is a total and complete waste in other cases. Thinking more about it, this is also why merging causes all the horrible index performance: not only do we (unnecessarily) read the same trees over and over again only to collapse them back to stage0 later when they are the same, but because we keep the index in a linear format, when we read the other trees, we'll have to move things around with memmove() (just the pointers, but still). We'd actually be a _lot_ better off if we split "git-read-tree" up into two phases: one that did the recursive tree operation (which can optimize the "same tree everywhere" case), and the second stage that actually populated the index. I'll have to think about this. It would be an absolutely _huge_ optimization for merging in certain patterns, it just doesn't matter for something like the kernel with "just" 18,000 files and not a lot of strange merging going on. In contrast, I can see a mail archive easily having hundreds of thousands of individual emails. At which time it's horribly stupid to read them all in three times (for a merge - base, origin, new) and do so in a pretty inefficient manner. Ho humm. It doesn't look _hard_ per se, and I think the two-stage git-read-tree is actually also what the recursive merge strategy wants anyway (it can't use the index - it really just wants to get a list of conflict information). So this definitely sounds like the RightThing(tm) to do anyway, and it fits the git data structures really well. So no downsides. Except that this is some rather core code, and you can't afford to get it wrong. And the fact that I'm a lazy bastard, of course. Linus