From mboxrd@z Thu Jan 1 00:00:00 1970 From: Davide Libenzi Subject: Re: RFC: adding xdelta compression to git Date: Tue, 3 May 2005 11:10:32 -0700 (PDT) Message-ID: References: <200505030657.38309.alonz@nolaviz.org> Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Cc: "C. Scott Ananian" , Alon Ziv , git@vger.kernel.org X-From: git-owner@vger.kernel.org Tue May 03 20:05:26 2005 Return-path: Received: from vger.kernel.org ([12.107.209.244]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DT1lC-0003Pg-Vp for gcvg-git@gmane.org; Tue, 03 May 2005 20:04:43 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261499AbVECSKx (ORCPT ); Tue, 3 May 2005 14:10:53 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261500AbVECSKx (ORCPT ); Tue, 3 May 2005 14:10:53 -0400 Received: from x35.xmailserver.org ([69.30.125.51]:57043 "EHLO x35.xmailserver.org") by vger.kernel.org with ESMTP id S261499AbVECSKi (ORCPT ); Tue, 3 May 2005 14:10:38 -0400 X-AuthUser: davidel@xmailserver.org Received: from bigblue.dev.mdolabs.com by xmailserver.org with [XMail 1.21 ESMTP Server] id for from ; Tue, 3 May 2005 11:13:12 -0700 X-X-Sender: davide@bigblue.dev.mdolabs.com To: Linus Torvalds In-Reply-To: X-GPG-FINGRPRINT: CFAE 5BEE FD36 F65E E640 56FE 0974 BF23 270F 474E X-GPG-PUBLIC_KEY: http://www.xmailserver.org/davidel.asc Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org On Tue, 3 May 2005, Linus Torvalds wrote: > On Tue, 3 May 2005, C. Scott Ananian wrote: > > > > Linus knows this. His point is just to be sure you actually *code* that > > walk in fsck, and (hopefully) do so w/o complicating the fsck too much. > > Indeed. It's also a performance issue. > > If you do xdelta objects, and don't tell fsck about it, then fsck will > just check every object as a blob. Why is that bad? > > Think about it: let's say that you have a series of xdelta objects, and a > fsck that is xdelta-unaware. It will unpack each object independently, > which means that it will keep on doing the same early xdelta work over and > over and over again. Instead of just applying them in order, and checking > the sha1 of the result at each point. > > Now, You probably want to limit the length of the chains to some firly > small number anyway, so maybe that's not a big deal. Who knows. And I'm > actually still so anal that I don't think I'd use this for _my_ tree, just > because I'm a worry-wart (and I still think disk is incredibly cheap ;) If you use a "full tip" metadata format with reverse deltas, you drop a "full" version "time to time" along the chain, and you keep a small index file, you have: 1) No matter how big it becomes the xdelta collection object, you are only touching very limited regions of it (due the small index file, that can be less than 20+8 bytes per entry in the xdelta blob) 2) Checkout happens w/out even doing xpatching (since the tip is full) 3) Checkins requires only one xdelta operation (since the tip is full), and zero if it is the time to store a full version along the chain (I use to drop one every 10-16 xdeltas, depending on the progressive size of the delta operations) 4) Worst case performance in reconstructing histories are bound by the longest xdelta chain (10-16) In some way I tend to agree (strangely ;) with you about the disk-cheap mantra, but network bandwidth matter IMO. So, if you do not want (being a real worry-wart) to use xdelta leverage on the FS trees, you can have way smarter network protocols using xdelta plus the knowledge of the git history structure. The rsync algo uses xdelta, but the poor guy is not able to leverage from the knowledge of the history that only git knows. So, if Larry and Greg shares a common object A, Larry changes A and makes a new git object B, rsync will transfer the whole object B, because it does not have any idea of the git structure. Git though, has this knowledge, and it can say to the remote fetcher: Look, I have this new thing called B, that is basically your thing A plus this very small xdelta (B-A). And typical xdelta diffs are really small (1/7 to 1/10 of classical 'diff -u' ones). - Davide