From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Montellese Subject: Fwd: Git and Large Binaries: A Proposed Solution Date: Fri, 21 Jan 2011 13:57:21 -0500 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE To: git@vger.kernel.org X-From: git-owner@vger.kernel.org Fri Jan 21 19:57:52 2011 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PgMBC-0000Jm-HM for gcvg-git-2@lo.gmane.org; Fri, 21 Jan 2011 19:57:51 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754848Ab1AUS5p convert rfc822-to-quoted-printable (ORCPT ); Fri, 21 Jan 2011 13:57:45 -0500 Received: from mail-bw0-f46.google.com ([209.85.214.46]:62387 "EHLO mail-bw0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752566Ab1AUS5o convert rfc822-to-8bit (ORCPT ); Fri, 21 Jan 2011 13:57:44 -0500 Received: by bwz15 with SMTP id 15so1871586bwz.19 for ; Fri, 21 Jan 2011 10:57:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type:content-transfer-encoding; bh=VUmZYCKj9WXI7AA+gxkhPLV7hEdLJ8/AdNEjTApB4BU=; b=T4UsajgZUqa2nxZV483nVeW/rcoylJxUCWAuNnglMWoBM8DPL87YB9CcyOXoGgCLIt 5BRmo2lw6qlOLIBEhpSr0bkdcmFqU6xYtexv3oahYM04HClHTSGsFTzGlMzCGQbJsio0 6Uh6qCtRcHoqmzvTk4LfaIXm4XDs83Orq2JXs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; b=RV65ctCBCZLP0E6WcGjxv3N4koGavTrSxUR/gr1iXqSD64jL1Tr60kVJEAbNbysHmv ojp8RzYDZeosAC1EHaOBIGrQQWaLi4QUNLyHGmwPfYweOVQidI8AANJtI9dqQ/d4nkEe z7YHZWOmKoolrmDO192KfIwvwkIRh2a6JNdCY= Received: by 10.204.70.137 with SMTP id d9mr1052537bkj.141.1295636262013; Fri, 21 Jan 2011 10:57:42 -0800 (PST) Received: by 10.204.60.11 with HTTP; Fri, 21 Jan 2011 10:57:21 -0800 (PST) In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: I did a search for this issue before posting, but if I am touching on an old topic that already has a solution in progress, I apologize. =A0A= s far as I know, this is still an open issue (last i saw in the kernel-trap archives was a "git and binary files" thread from Jan 2008, and there are a couple of promising related works (git annex and git bigfiles) -- but nothing that solves the complete problem). I'm interested in hearing your thoughts and suggestions, and interested if there is community interest in adding this feature to git. =A0I would be happy to be involved in making the changes, but I have very limited time, so would prefer help and would like to know that it has a strong chance of joining the main line before starting... To whet your appetite to read all of the below (I know it's long), this is the root of the solution: ---=A0=A0 =A0 =A0 Don't track binaries in git. =A0Track their hashes. = --- Problem Background: I work on embedded system software, with code for these products delivered from multiple customers and in multiple formats, for example: 1. source code -- works great and is what git is designed for 2. zipped tarballs of source code (that I will never need to modify) -- I could unpack these and then use git to track the source code. However, I prefer to track these deliverables as the tarballs themselves because it makes my customer happier to see the exact tarball that they delivered being used when I repackage updates. (Let's not discuss problems with this model - I understand that this is non-ideal). 3. large and/or many binaries. =A0(could be pictures, short videos, pre-compiled binaries, etc) The problem of course is that, as you know, git is not ideal for handling large, or many, binaries. =A0It's better at just about everything else, of course, but not this, for largely these two reasons: 1. git cannot diff the binaries to their previous iterations effectively to save space. =A0(and neither can any tool) 2. git requires that all clones of that repository must therefore download all versions of all binaries -- which, if the binaries are large, or many, and are poorly compressed together (as stated in 1), this will be a very expensive operation. Problem Statement: We (the git user) want and "need" to be able to track large binaries from within our repository. =A0But putting them into git slows down git unnecessarily. The only current alternative is to *not* check the large binaries into git -- but now they are no longer tracked, which is unacceptable. =A0If I want to jump back in git to a point in the tree from 6 months ago, I do not have any way to tell which version of the large binaries I need. =A0I could keep track of this manually, of course, but that's wha= t git is for... Solution: The short version: ***Don't track binaries in git. =A0Track their hashes.*** Solution: The long version: =46or my current project, I have this (the "store the hashes" idea) implemented outside of git. =A0I am posting to this list because I woul= d like to see this functionality (well, something even better) become native to git, and believe that it would remove one of the few remaining arguments that some projects have against adopting git. Here is how I have it implemented: =46irst the layout: my_git_project/binrepo/ -- binaries/ -- hashes/ -- symlink_to_hashes_file -- symlink_to_another_hashes_file within the "binrepo" (binary repository) there is a subdirectory for binaries, and a subdirectory for hashes. =A0In the root of the 'binrepo= ' all of the files stored have a symlink to the current version of the hash. The "binaries" directory is .gitignore'd -- the hashes directory and the symlinks to the current hashes are maintained by git. Whenever I receive a new version of a large binary file from a customer, I put it into "binaries" and I create a new hash for that file in "hashes" and update the symlink to point to that hash. =A0I 'gi= t commit' and 'git push' those changes (this is fast since there is no large binary in the git repository). The other important factor is that I must put this large binary file somewhere accessible for others to download it. =A0In this example, it is: =A0my_git_server.net:/binrepo/ Then I have a bash script (some psuedocode here to save space): for (all BINFILE in binrepo) ; do =A0=A0HASHFILE=3D$BINFILE".md5" =A0=A0# check if the binary exists =A0=A0if [[ -e binaries/$BINFILE ]] ; then =A0=A0 =A0echo " =A0$BINFILE available" =A0=A0else =A0=A0 =A0echo " =A0$BINFILE not available. Downloading..." =A0=A0 =A0wget http://my_git_server.net:/binrepo/$BINFILE =A0=A0fi =A0=A0# check md5sum =A0=A0md5sum $BINFILE > temp.md5 =A0=A0if ! diff -q ../hashes/$HASHFILE temp.md5 >/dev/null ; then =A0=A0 =A0echo "ERROR! $BINFILE md5 does not match!" =A0=A0 =A0exit and/or redownload =A0=A0fi done This confirms that I have the right version of all of the binaries -- my git repository is effectively tracking the large binaries, but without actually storing them internally to the git repo. =A0If someone else updates the "binrepo" I will know it when I do a "git pull" and I will automatically get the right version of the binary file so that my sandbox is up-to-date. =A0Now let's say I want to revert the version of the large binary file to the previous version -- all I need to do is to to edit the symlink in "binrepo", commit, and push. =A0Other users will automatically use the old version of the file as well after they do their pull (and without needing to re-download that file) Summary of Big Advantages: 1. Repository is unpolluted by large binary files. =A0git clone stays f= ast. 2. User has access to any version of any binary file, but does not need to store every version locally if they do not want to. 3. Git does not need to worry about the big binaries - there are no slow attempts to calculate binary deltas or pack and unpack under the hood. Improvements: I imagine these features (among others): 1. In my current setup, each large binary file has a different name (a revision number). =A0This could be easily solved, however, by generatin= g unique names under the hood and tracking this within git. 2. A lot of the steps in my current setup are manual. =A0When I want to add a new binary file, I need to manually create the hash and manually upload the binary to the joint server. =A0If done within git, this woul= d be automatic. 3. In my setup, all of the binary files are in a single "binrepo" directory. =A0If done from within git, we would need a non-kludgey way to allow large binaries to exist anywhere within the git tree. =A0If gi= t handles the "binrepo" under the hood though, the user would never need to know about it -- instead git would just handle all binaries by checking the internal "binrepo" =A0Instead of tracking symlinks, git would track the file versions in the normal way -- it just wouldn't store the binaries the same way (instead it would store the hash) 4. User option to download all versions of all binaries, or only the version necessary for the position on the current branch. =A0If you wan= t to be able to run all versions of the repository when offline, you can download all versions of all binaries. =A0If you don't need to do this, you can just download the versions you need. Or perhaps have the option to download all binaries smaller than X-bytes, but skip the big ones. 5. Command to purge all binaries in your "binrepo" that are not needed for the current revision (if you're running out of disk space locally). 6. Automatically upload new versions of files to the "binrepo" (rather than needing to do this manually) Rock on! Eric