Re: Fwd: Git and Large Binaries: A Proposed Solution

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jeff King <peff@peff.net>
To: Eric Montellese <emontellese@gmail.com>
Cc: git@vger.kernel.org
Subject: Re: Fwd: Git and Large Binaries: A Proposed Solution
Date: Fri, 21 Jan 2011 17:24:42 -0500	[thread overview]
Message-ID: <20110121222440.GA1837@sigill.intra.peff.net> (raw)
In-Reply-To: <AANLkTimPua_kz2w33BRPeTtOEWOKDCsJzf0sqxm=db68@mail.gmail.com>

On Fri, Jan 21, 2011 at 01:57:21PM -0500, Eric Montellese wrote:

> I did a search for this issue before posting, but if I am touching on
> an old topic that already has a solution in progress, I apologize.  As

It's been talked about a lot, but there is not exactly a solution in
progress. One promising direction is not very different from what you're
doing, though:

> Solution:
> The short version:
> ***Don't track binaries in git.  Track their hashes.***

Yes, exactly. But what your solution lacks, I think, is more integration
into git. Specifically, using clean/smudge filters you can have git take
care of tracking the file contents automatically.

At the very simplest, it would look like:

-- >8 --
cat >$HOME/local/bin/huge-clean <<'EOF'
#!/bin/sh

# In an ideal world, we could actually
# access the original file directly instead of
# having to cat it to a new file.
temp="$(git rev-parse --git-dir)"/huge.$$
cat >"$temp"
sha1=`sha1sum "$temp" | cut -d' ' -f1`

# now move it to wherever your permanent storage is
# scp "$root/$sha1" host:/path/to/big_storage/$sha1
cp "$temp" /tmp/big_storage/$sha1
rm -f "$temp"

echo $sha1
EOF

cat >$HOME/local/bin/huge-smudge <<'EOF'
#!/bin/sh

# Get sha1 from stored blob via stdin
read sha1

# Now retrieve blob. We could optionally do some caching here.
# ssh host cat /path/to/big/storage/$sha1
cat /tmp/big_storage/$sha1
EOF
-- 8< --

Obviously our storage mechanism (throwing things in /tmp) is simplistic,
but obviously you could store and retrieve via ssh, http, s3, or
whatever.

You can try it out like this:

  # set up our filter config and fake storage area
  mkdir /tmp/big_storage
  git config --global filter.huge.clean huge-clean
  git config --global filter.huge.smudge huge-smudge

  # now make a repo, and make sure we mark *.bin files as huge
  mkdir repo && cd repo && git init
  echo '*.bin filter=huge' >.gitattributes
  git add .gitattributes
  git commit -m 'add attributes'

  # let's do a moderate 20M file
  perl -e 'print "foo\n" for (1 .. 5000000)' >foo.bin
  git add foo.bin
  git commit -m 'add huge file (foo)'

  # and then another revision
  perl -e 'print "bar\n" for (1 .. 5000000)' >foo.bin
  git commit -a -m 'revise huge file (bar)'

Notice that we just add and commit as normal.  And we can check that the
space usage is what you expect:

  $ du -sh repo/.git
  196K    repo/.git
  $ du -sh /tmp/big_storage
  39M     /tmp/big_storage

Diffs obviously are going to be less interesting, as we just see the
hash:

  $ git log --oneline -p foo.bin
  39e549c revise huge file (bar)
  diff --git a/foo.bin b/foo.bin
  index 281fd03..70874bd 100644
  --- a/foo.bin
  +++ b/foo.bin
  @@ -1 +1 @@
  -50a1ee265f4562721346566701fce1d06f54dd9e
  +bbc2f7f191ad398fe3fcb57d885e1feacb4eae4e
  845836e add huge file (foo)
  diff --git a/foo.bin b/foo.bin
  new file mode 100644
  index 0000000..281fd03
  --- /dev/null
  +++ b/foo.bin
  @@ -0,0 +1 @@
  +50a1ee265f4562721346566701fce1d06f54dd9e

but if you wanted to, you could write a custom diff driver that does
something more meaningful with your particular binary format (it would
have to grab from big_storage, though).

Checking out other revisions works without extra action:

  $ head -n 1 foo.bin
  bar
  $ git checkout HEAD^
  HEAD is now at 845836e... add huge file (foo)
  $ head -n 1 foo.bin
  foo

And since you have the filter config in your ~/.gitconfig, clones will
just work:

  $ git clone repo other
  $ du -sh other/.git
  204K    other/.git
  $ du -sh other/foo.bin
  20M

So conceptually it's pretty similar to yours, but the filter integration
means that git takes care of putting the right files in place at the
right time.

It would probably benefit a lot from caching the large binary files
instead of hitting big_storage all the time. And probably the
putting/getting from storage should be factored out so you can plug in
different storage. And it should all be configurable. Different users of
the same repo might want different caching policies, or to access the
binary assets by different mechanisms or URLs.

> I imagine these features (among others):
> 
> 1. In my current setup, each large binary file has a different name (a
> revision number).  This could be easily solved, however, by generating
> unique names under the hood and tracking this within git.

In the scheme above, we just index by their hash. So you can easily fsck
your big_storage by making sure everything matches its hash (but you
can't know that you have _all_ of the blobs needed unless you
cross-reference with the history).

> 2. A lot of the steps in my current setup are manual.  When I want to
> add a new binary file, I need to manually create the hash and manually
> upload the binary to the joint server.  If done within git, this would
> be automatic.

I think the scheme above takes care of the manual bits.

> 3. In my setup, all of the binary files are in a single "binrepo"
> directory.  If done from within git, we would need a non-kludgey way
> to allow large binaries to exist anywhere within the git tree.  If git

Any scheme, whether it uses clean/smudge filters or not, should probably
tie in via gitattributes.

> 4. User option to download all versions of all binaries, or only the
> version necessary for the position on the current branch.  If you want
> to be able to run all versions of the repository when offline, you can
> download all versions of all binaries.  If you don't need to do this,
> you can just download the versions you need.  Or perhaps have the
> option to download all binaries smaller than X-bytes, but skip the big
> ones.

The scheme above will download on an as-needed basis. If caching were
implemented, you could just make the cache infinitely big and do a "git
log -p" which would download everything. :)

Probably you would also want the smudge filter to return "blob not
available" when operating in some kind of offline mode.

> 5. Command to purge all binaries in your "binrepo" that are not needed
> for the current revision (if you're running out of disk space
> locally).

In my scheme, just rm your cache directory (once it exists).

> 6. Automatically upload new versions of files to the "binrepo" (rather
> than needing to do this manually)

Handled by the clean filter above.

So obviously this is not very complete. And there are a few changes to
git that could make it more efficient (e.g., letting the clean filter
touch the file directly instead of having to make a copy via stdin). But
the general idea is there, and it just needs somebody to make a nice
polished script that is configurable, does caching, etc. I'll get to it
eventually, but if you'd like to work on it, be my guest.

-Peff

next prev parent reply	other threads:[~2011-01-21 22:25 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AANLkTin=UySutWLS0Y7OmuvkE=T=+YB8G8aUCxLH=GKa@mail.gmail.com>
2011-01-21 18:57 ` Fwd: Git and Large Binaries: A Proposed Solution Eric Montellese
2011-01-21 21:36   ` Wesley J. Landaker
2011-01-21 22:00     ` Eric Montellese
2011-01-21 22:24   ` Jeff King [this message]
2011-01-21 23:15     ` Eric Montellese
2011-01-22  3:05       ` Sverre Rabbelier
2011-01-23 14:14     ` Pete Wyckoff
2011-01-26  3:42       ` Scott Chacon
2011-01-26 16:23         ` Eric Montellese
2011-01-26 17:42         ` Joey Hess
2011-01-26 21:40         ` Jakub Narebski
2011-03-10 21:02       ` Alexander Miseler
2011-03-10 22:24         ` Jeff King
2011-03-13  1:53           ` Eric Montellese
2011-03-13  2:52             ` Jeff King
2011-03-13 19:33               ` Alexander Miseler
2011-03-14 19:32                 ` Jeff King
2011-03-16  0:35                   ` Eric Montellese
2011-03-16 14:40                   ` Nguyen Thai Ngoc Duy
2011-01-22  0:07   ` Joey Hess

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110121222440.GA1837@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=emontellese@gmail.com \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).