From mboxrd@z Thu Jan 1 00:00:00 1970 From: Derek Fawcus Subject: Re: space compression (again) Date: Fri, 15 Apr 2005 19:50:38 +0100 Message-ID: <20050415195038.E6735@mrwint.cisco.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-From: git-owner@vger.kernel.org Fri Apr 15 20:48:13 2005 Return-path: Received: from vger.kernel.org ([12.107.209.244]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DMVqe-00085b-HL for gcvg-git@gmane.org; Fri, 15 Apr 2005 20:47:24 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261893AbVDOSuw (ORCPT ); Fri, 15 Apr 2005 14:50:52 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261907AbVDOSuw (ORCPT ); Fri, 15 Apr 2005 14:50:52 -0400 Received: from ams-iport-1.cisco.com ([144.254.224.140]:29848 "EHLO ams-iport-1.cisco.com") by vger.kernel.org with ESMTP id S261893AbVDOSum (ORCPT ); Fri, 15 Apr 2005 14:50:42 -0400 Received: from ams-core-1.cisco.com (144.254.224.150) by ams-iport-1.cisco.com with ESMTP; 15 Apr 2005 20:50:42 +0200 Received: from cisco.com (mrwint.cisco.com [64.103.71.48]) by ams-core-1.cisco.com (8.12.10/8.12.6) with ESMTP id j3FIoc54001899 for ; Fri, 15 Apr 2005 20:50:39 +0200 (MEST) Received: (from dfawcus@localhost) by cisco.com (8.8.8-Cisco List Logging/8.8.8) id TAA18818 for git@vger.kernel.org; Fri, 15 Apr 2005 19:50:38 +0100 (BST) To: git@vger.kernel.org X-Mailer: Mutt 1.0.1i In-Reply-To: ; from cscott@cscott.net on Fri, Apr 15, 2005 at 01:19:30PM -0400 Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org On Fri, Apr 15, 2005 at 01:19:30PM -0400, C. Scott Ananian wrote: > Why are blobs per-file? [After all, Linus insists that files are an > illusion.] Why not just have 'chunks', and assemble *these* > into blobs (read, 'files')? A good chunk size would fit evenly into some > number of disk blocks (no wasted space!). [ I've only been earwigging, not paying a lot of attention, however ...] Funny I was just think of this having read Linus' discourse on "files don't matter", the obvious chunking factor would be say a function. The problem being tending towards having very small files - I know I tend to prefer small functions. Hmm - a underlying filesystem that efficiently stores small files - why does that ring a bell :-) However the simple answer is to have a preparser for a file / tree checkin which split say a .c file into it's associated chunks, anf represented it in git as a signed/hashed object. i.e. a automatically created extra level of indirection (as I seem to recall was added somewhere else?). So say fred.c: /* * File boiler */ #include #include /* * Fn a boiler */ int fn_a(args) { } /* * Fn b boiler */ long fn_b(args) { } Would be split into 4 parts within git, the 'file object' which simply points to the content objects, and 3 contents objects, being the stuff before 'Fn a boiler', fn_a and it's boiler, fn_b and it's boiler. The interesting bit is needing a preprocessor which can roughly parse the code - i.e. detect where to place the boiler blocks. You would then do most of your tree operations upon the file objects, but get the space savings from the content objects being shared. I suspect that simply to prevent pathological conditions you'd have to arrange that the contents objects have a minimal size, irrespective of the number of desired chunks (functions) they would naturally contain. i.e. for compresion efficiency, you may choose something like 2K as the minimal pre compression content object size. DF