Fwd: Git and Large Binaries: A Proposed Solution

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Fwd: Git and Large Binaries: A Proposed Solution
       [not found] <AANLkTin=UySutWLS0Y7OmuvkE=T=+YB8G8aUCxLH=GKa@mail.gmail.com>
@ 2011-01-21 18:57 ` Eric Montellese
  2011-01-21 21:36   ` Wesley J. Landaker
                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Eric Montellese @ 2011-01-21 18:57 UTC (permalink / raw)
  To: git

I did a search for this issue before posting, but if I am touching on
an old topic that already has a solution in progress, I apologize.  As
far as I know, this is still an open issue (last i saw in the
kernel-trap archives was a "git and binary files" thread from Jan
2008, and there are a couple of promising related works (git annex and
git bigfiles) -- but nothing that solves the complete problem).

I'm interested in hearing your thoughts and suggestions, and
interested if there is community interest in adding this feature to
git.  I would be happy to be involved in making the changes, but I
have very limited time, so would prefer help and would like to know
that it has a strong chance of joining the main line before
starting...
To whet your appetite to read all of the below (I know it's long),
this is the root of the solution:

---       Don't track binaries in git.  Track their hashes.       ---

Problem Background:
I work on embedded system software, with code for these products
delivered from multiple customers and in multiple formats, for
example:

1. source code -- works great and is what git is designed for
2. zipped tarballs of source code (that I will never need to modify)
-- I could unpack these and then use git to track the source code.
However, I prefer to track these deliverables as the tarballs
themselves because it makes my customer happier to see the exact
tarball that they delivered being used when I repackage updates.
(Let's not discuss problems with this model - I understand that this
is non-ideal).
3. large and/or many binaries.  (could be pictures, short videos,
pre-compiled binaries, etc)

The problem of course is that, as you know, git is not ideal for
handling large, or many, binaries.  It's better at just about
everything else, of course, but not this, for largely these two
reasons:

1. git cannot diff the binaries to their previous iterations
effectively to save space.  (and neither can any tool)
2. git requires that all clones of that repository must therefore
download all versions of all binaries -- which, if the binaries are
large, or many, and are poorly compressed together (as stated in 1),
this will be a very expensive operation.

Problem Statement:
We (the git user) want and "need" to be able to track large binaries
from within our repository.  But putting them into git slows down git
unnecessarily.
The only current alternative is to *not* check the large binaries into
git -- but now they are no longer tracked, which is unacceptable.  If
I want to jump back in git to a point in the tree from 6 months ago, I
do not have any way to tell which version of the large binaries I
need.  I could keep track of this manually, of course, but that's what
git is for...

Solution:
The short version:
***Don't track binaries in git.  Track their hashes.***

Solution:
The long version:
For my current project, I have this (the "store the hashes" idea)
implemented outside of git.  I am posting to this list because I would
like to see this functionality (well, something even better) become
native to git, and believe that it would remove one of the few
remaining arguments that some projects have against adopting git.
Here is how I have it implemented:

First the layout:
my_git_project/binrepo/
-- binaries/
-- hashes/
-- symlink_to_hashes_file
-- symlink_to_another_hashes_file
within the "binrepo" (binary repository) there is a subdirectory for
binaries, and a subdirectory for hashes.  In the root of the 'binrepo'
all of the files stored have a symlink to the current version of the
hash.
The "binaries" directory is .gitignore'd -- the hashes directory and
the symlinks to the current hashes are maintained by git.
Whenever I receive a new version of a large binary file from a
customer, I put it into "binaries" and I create a new hash for that
file in "hashes" and update the symlink to point to that hash.  I 'git
commit' and 'git push' those changes (this is fast since there is no
large binary in the git repository).
The other important factor is that I must put this large binary file
somewhere accessible for others to download it.  In this example, it
is:  my_git_server.net:/binrepo/

Then I have a bash script (some psuedocode here to save space):

for (all BINFILE in binrepo) ; do
  HASHFILE=$BINFILE".md5"
  # check if the binary exists
  if [[ -e binaries/$BINFILE ]] ; then
    echo "  $BINFILE available"
  else
    echo "  $BINFILE not available. Downloading..."
    wget http://my_git_server.net:/binrepo/$BINFILE
  fi
  # check md5sum
  md5sum $BINFILE > temp.md5
  if ! diff -q ../hashes/$HASHFILE temp.md5 >/dev/null ; then
    echo "ERROR! $BINFILE md5 does not match!"
    exit and/or redownload
  fi
done

This confirms that I have the right version of all of the binaries --
my git repository is effectively tracking the large binaries, but
without actually storing them internally to the git repo.  If someone
else updates the "binrepo" I will know it when I do a "git pull" and I
will automatically get the right version of the binary file so that my
sandbox is up-to-date.  Now let's say I want to revert the version of
the large binary file to the previous version -- all I need to do is
to to edit the symlink in "binrepo", commit, and push.  Other users
will automatically use the old version of the file as well after they
do their pull (and without needing to re-download that file)

Summary of Big Advantages:

1. Repository is unpolluted by large binary files.  git clone stays fast.
2. User has access to any version of any binary file, but does not
need to store every version locally if they do not want to.
3. Git does not need to worry about the big binaries - there are no
slow attempts to calculate binary deltas or pack and unpack under the
hood.

Improvements:

I imagine these features (among others):

1. In my current setup, each large binary file has a different name (a
revision number).  This could be easily solved, however, by generating
unique names under the hood and tracking this within git.
2. A lot of the steps in my current setup are manual.  When I want to
add a new binary file, I need to manually create the hash and manually
upload the binary to the joint server.  If done within git, this would
be automatic.
3. In my setup, all of the binary files are in a single "binrepo"
directory.  If done from within git, we would need a non-kludgey way
to allow large binaries to exist anywhere within the git tree.  If git
handles the "binrepo" under the hood though, the user would never need
to know about it -- instead git would just handle all binaries by
checking the internal "binrepo"  Instead of tracking symlinks, git
would track the file versions in the normal way -- it just wouldn't
store the binaries the same way (instead it would store the hash)
4. User option to download all versions of all binaries, or only the
version necessary for the position on the current branch.  If you want
to be able to run all versions of the repository when offline, you can
download all versions of all binaries.  If you don't need to do this,
you can just download the versions you need.  Or perhaps have the
option to download all binaries smaller than X-bytes, but skip the big
ones.
5. Command to purge all binaries in your "binrepo" that are not needed
for the current revision (if you're running out of disk space
locally).
6. Automatically upload new versions of files to the "binrepo" (rather
than needing to do this manually)

Rock on!
Eric

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-01-21 18:57 ` Fwd: Git and Large Binaries: A Proposed Solution Eric Montellese
@ 2011-01-21 21:36   ` Wesley J. Landaker
  2011-01-21 22:00     ` Eric Montellese
  2011-01-21 22:24   ` Jeff King
  2011-01-22  0:07   ` Joey Hess
  2 siblings, 1 reply; 20+ messages in thread
From: Wesley J. Landaker @ 2011-01-21 21:36 UTC (permalink / raw)
  To: Eric Montellese; +Cc: git

On Friday, January 21, 2011 11:57:21 Eric Montellese wrote:
> To whet your appetite to read all of the below (I know it's long),
> this is the root of the solution:
> 
> ---       Don't track binaries in git.  Track their hashes.       ---

Comment from the peanut gallery:

I haven't read your approach in great detail, but just in case you are not 
aware, there is a project call git-annex <http://git-annex.branchable.com/> 
by Joey Hess that I believe takes a similar approach.

Since you've obviously given this a lot of thought, you might want to take a 
peek at that and see if it already does what you want, or if your proposal 
does something significantly different/better.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-01-21 21:36   ` Wesley J. Landaker
@ 2011-01-21 22:00     ` Eric Montellese
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Montellese @ 2011-01-21 22:00 UTC (permalink / raw)
  To: Wesley J. Landaker; +Cc: git

Thanks Wesely,

I did take a look at git annex -- it looks to me as though that
project is more of a special-case, allowing users to use git to track
things like music and movies.  While it's possible this might be
usable for the use case I described, what I'm really looking for is a
true extension of git which allows binaries to be treated differently
(if the user desires) when using git as a source management tool.

The major difference I see with git-annex is that the user must
specifically tell git-annex to download certain files.  Instead, I
want the user to always automatically have *all* of the files (both
source and binaries) for the current revision -- but not necessarily
for hundreds (thousands?) of past revisions (which as git is
implemented currently would take up many gigabytes)

git-annex does look like a neat piece of software, but I don't think
it quite fits here -- thank you again for the comment though!

Eric

On Fri, Jan 21, 2011 at 4:36 PM, Wesley J. Landaker <wjl@icecavern.net> wrote:
> On Friday, January 21, 2011 11:57:21 Eric Montellese wrote:
>> To whet your appetite to read all of the below (I know it's long),
>> this is the root of the solution:
>>
>> ---       Don't track binaries in git.  Track their hashes.       ---
>
> Comment from the peanut gallery:
>
> I haven't read your approach in great detail, but just in case you are not
> aware, there is a project call git-annex <http://git-annex.branchable.com/>
> by Joey Hess that I believe takes a similar approach.
>
> Since you've obviously given this a lot of thought, you might want to take a
> peek at that and see if it already does what you want, or if your proposal
> does something significantly different/better.
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-01-21 18:57 ` Fwd: Git and Large Binaries: A Proposed Solution Eric Montellese
  2011-01-21 21:36   ` Wesley J. Landaker
@ 2011-01-21 22:24   ` Jeff King
  2011-01-21 23:15     ` Eric Montellese
  2011-01-23 14:14     ` Pete Wyckoff
  2011-01-22  0:07   ` Joey Hess
  2 siblings, 2 replies; 20+ messages in thread
From: Jeff King @ 2011-01-21 22:24 UTC (permalink / raw)
  To: Eric Montellese; +Cc: git

On Fri, Jan 21, 2011 at 01:57:21PM -0500, Eric Montellese wrote:

> I did a search for this issue before posting, but if I am touching on
> an old topic that already has a solution in progress, I apologize.  As

It's been talked about a lot, but there is not exactly a solution in
progress. One promising direction is not very different from what you're
doing, though:

> Solution:
> The short version:
> ***Don't track binaries in git.  Track their hashes.***

Yes, exactly. But what your solution lacks, I think, is more integration
into git. Specifically, using clean/smudge filters you can have git take
care of tracking the file contents automatically.

At the very simplest, it would look like:

-- >8 --
cat >$HOME/local/bin/huge-clean <<'EOF'
#!/bin/sh

# In an ideal world, we could actually
# access the original file directly instead of
# having to cat it to a new file.
temp="$(git rev-parse --git-dir)"/huge.$$
cat >"$temp"
sha1=`sha1sum "$temp" | cut -d' ' -f1`

# now move it to wherever your permanent storage is
# scp "$root/$sha1" host:/path/to/big_storage/$sha1
cp "$temp" /tmp/big_storage/$sha1
rm -f "$temp"

echo $sha1
EOF

cat >$HOME/local/bin/huge-smudge <<'EOF'
#!/bin/sh

# Get sha1 from stored blob via stdin
read sha1

# Now retrieve blob. We could optionally do some caching here.
# ssh host cat /path/to/big/storage/$sha1
cat /tmp/big_storage/$sha1
EOF
-- 8< --

Obviously our storage mechanism (throwing things in /tmp) is simplistic,
but obviously you could store and retrieve via ssh, http, s3, or
whatever.

You can try it out like this:

  # set up our filter config and fake storage area
  mkdir /tmp/big_storage
  git config --global filter.huge.clean huge-clean
  git config --global filter.huge.smudge huge-smudge

  # now make a repo, and make sure we mark *.bin files as huge
  mkdir repo && cd repo && git init
  echo '*.bin filter=huge' >.gitattributes
  git add .gitattributes
  git commit -m 'add attributes'

  # let's do a moderate 20M file
  perl -e 'print "foo\n" for (1 .. 5000000)' >foo.bin
  git add foo.bin
  git commit -m 'add huge file (foo)'

  # and then another revision
  perl -e 'print "bar\n" for (1 .. 5000000)' >foo.bin
  git commit -a -m 'revise huge file (bar)'

Notice that we just add and commit as normal.  And we can check that the
space usage is what you expect:

  $ du -sh repo/.git
  196K    repo/.git
  $ du -sh /tmp/big_storage
  39M     /tmp/big_storage

Diffs obviously are going to be less interesting, as we just see the
hash:

  $ git log --oneline -p foo.bin
  39e549c revise huge file (bar)
  diff --git a/foo.bin b/foo.bin
  index 281fd03..70874bd 100644
  --- a/foo.bin
  +++ b/foo.bin
  @@ -1 +1 @@
  -50a1ee265f4562721346566701fce1d06f54dd9e
  +bbc2f7f191ad398fe3fcb57d885e1feacb4eae4e
  845836e add huge file (foo)
  diff --git a/foo.bin b/foo.bin
  new file mode 100644
  index 0000000..281fd03
  --- /dev/null
  +++ b/foo.bin
  @@ -0,0 +1 @@
  +50a1ee265f4562721346566701fce1d06f54dd9e

but if you wanted to, you could write a custom diff driver that does
something more meaningful with your particular binary format (it would
have to grab from big_storage, though).

Checking out other revisions works without extra action:

  $ head -n 1 foo.bin
  bar
  $ git checkout HEAD^
  HEAD is now at 845836e... add huge file (foo)
  $ head -n 1 foo.bin
  foo

And since you have the filter config in your ~/.gitconfig, clones will
just work:

  $ git clone repo other
  $ du -sh other/.git
  204K    other/.git
  $ du -sh other/foo.bin
  20M

So conceptually it's pretty similar to yours, but the filter integration
means that git takes care of putting the right files in place at the
right time.

It would probably benefit a lot from caching the large binary files
instead of hitting big_storage all the time. And probably the
putting/getting from storage should be factored out so you can plug in
different storage. And it should all be configurable. Different users of
the same repo might want different caching policies, or to access the
binary assets by different mechanisms or URLs.

> I imagine these features (among others):
> 
> 1. In my current setup, each large binary file has a different name (a
> revision number).  This could be easily solved, however, by generating
> unique names under the hood and tracking this within git.

In the scheme above, we just index by their hash. So you can easily fsck
your big_storage by making sure everything matches its hash (but you
can't know that you have _all_ of the blobs needed unless you
cross-reference with the history).

> 2. A lot of the steps in my current setup are manual.  When I want to
> add a new binary file, I need to manually create the hash and manually
> upload the binary to the joint server.  If done within git, this would
> be automatic.

I think the scheme above takes care of the manual bits.

> 3. In my setup, all of the binary files are in a single "binrepo"
> directory.  If done from within git, we would need a non-kludgey way
> to allow large binaries to exist anywhere within the git tree.  If git

Any scheme, whether it uses clean/smudge filters or not, should probably
tie in via gitattributes.

> 4. User option to download all versions of all binaries, or only the
> version necessary for the position on the current branch.  If you want
> to be able to run all versions of the repository when offline, you can
> download all versions of all binaries.  If you don't need to do this,
> you can just download the versions you need.  Or perhaps have the
> option to download all binaries smaller than X-bytes, but skip the big
> ones.

The scheme above will download on an as-needed basis. If caching were
implemented, you could just make the cache infinitely big and do a "git
log -p" which would download everything. :)

Probably you would also want the smudge filter to return "blob not
available" when operating in some kind of offline mode.

> 5. Command to purge all binaries in your "binrepo" that are not needed
> for the current revision (if you're running out of disk space
> locally).

In my scheme, just rm your cache directory (once it exists).

> 6. Automatically upload new versions of files to the "binrepo" (rather
> than needing to do this manually)

Handled by the clean filter above.

So obviously this is not very complete. And there are a few changes to
git that could make it more efficient (e.g., letting the clean filter
touch the file directly instead of having to make a copy via stdin). But
the general idea is there, and it just needs somebody to make a nice
polished script that is configurable, does caching, etc. I'll get to it
eventually, but if you'd like to work on it, be my guest.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-01-21 22:24   ` Jeff King
@ 2011-01-21 23:15     ` Eric Montellese
  2011-01-22  3:05       ` Sverre Rabbelier
  2011-01-23 14:14     ` Pete Wyckoff
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Montellese @ 2011-01-21 23:15 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Peff,

Thanks for your insight -- this looks great.

Once something like this is available and more polished, what's the
process to request that it join the main line of git development?   (i
know functionally there's "no main line" in git... but you know what I
mean)

Has there already been discussion to this effect?  I do think that a
fix like this would improve git adoption among certain groups.  (I
know I've heard the "big binaries" problem mentioned at least a few
times)


I haven't dug around in git code yet, so while I can get the gist of
your code, I'm unable to get the complete picture.  You wouldn't
happen to have a git patch, or a public repo somewhere that I can take
a look at?  Does there happen to be a git developers guide hidden away
anywhere?  Though I have very limited time, I'd be happy to help out
as much as I can.


Eric







On Fri, Jan 21, 2011 at 5:24 PM, Jeff King <peff@peff.net> wrote:
> On Fri, Jan 21, 2011 at 01:57:21PM -0500, Eric Montellese wrote:
>
>> I did a search for this issue before posting, but if I am touching on
>> an old topic that already has a solution in progress, I apologize.  As
>
> It's been talked about a lot, but there is not exactly a solution in
> progress. One promising direction is not very different from what you're
> doing, though:
>
>> Solution:
>> The short version:
>> ***Don't track binaries in git.  Track their hashes.***
>
> Yes, exactly. But what your solution lacks, I think, is more integration
> into git. Specifically, using clean/smudge filters you can have git take
> care of tracking the file contents automatically.
>
> At the very simplest, it would look like:
>
> -- >8 --
> cat >$HOME/local/bin/huge-clean <<'EOF'
> #!/bin/sh
>
> # In an ideal world, we could actually
> # access the original file directly instead of
> # having to cat it to a new file.
> temp="$(git rev-parse --git-dir)"/huge.$$
> cat >"$temp"
> sha1=`sha1sum "$temp" | cut -d' ' -f1`
>
> # now move it to wherever your permanent storage is
> # scp "$root/$sha1" host:/path/to/big_storage/$sha1
> cp "$temp" /tmp/big_storage/$sha1
> rm -f "$temp"
>
> echo $sha1
> EOF
>
> cat >$HOME/local/bin/huge-smudge <<'EOF'
> #!/bin/sh
>
> # Get sha1 from stored blob via stdin
> read sha1
>
> # Now retrieve blob. We could optionally do some caching here.
> # ssh host cat /path/to/big/storage/$sha1
> cat /tmp/big_storage/$sha1
> EOF
> -- 8< --
>
> Obviously our storage mechanism (throwing things in /tmp) is simplistic,
> but obviously you could store and retrieve via ssh, http, s3, or
> whatever.
>
> You can try it out like this:
>
>  # set up our filter config and fake storage area
>  mkdir /tmp/big_storage
>  git config --global filter.huge.clean huge-clean
>  git config --global filter.huge.smudge huge-smudge
>
>  # now make a repo, and make sure we mark *.bin files as huge
>  mkdir repo && cd repo && git init
>  echo '*.bin filter=huge' >.gitattributes
>  git add .gitattributes
>  git commit -m 'add attributes'
>
>  # let's do a moderate 20M file
>  perl -e 'print "foo\n" for (1 .. 5000000)' >foo.bin
>  git add foo.bin
>  git commit -m 'add huge file (foo)'
>
>  # and then another revision
>  perl -e 'print "bar\n" for (1 .. 5000000)' >foo.bin
>  git commit -a -m 'revise huge file (bar)'
>
> Notice that we just add and commit as normal.  And we can check that the
> space usage is what you expect:
>
>  $ du -sh repo/.git
>  196K    repo/.git
>  $ du -sh /tmp/big_storage
>  39M     /tmp/big_storage
>
> Diffs obviously are going to be less interesting, as we just see the
> hash:
>
>  $ git log --oneline -p foo.bin
>  39e549c revise huge file (bar)
>  diff --git a/foo.bin b/foo.bin
>  index 281fd03..70874bd 100644
>  --- a/foo.bin
>  +++ b/foo.bin
>  @@ -1 +1 @@
>  -50a1ee265f4562721346566701fce1d06f54dd9e
>  +bbc2f7f191ad398fe3fcb57d885e1feacb4eae4e
>  845836e add huge file (foo)
>  diff --git a/foo.bin b/foo.bin
>  new file mode 100644
>  index 0000000..281fd03
>  --- /dev/null
>  +++ b/foo.bin
>  @@ -0,0 +1 @@
>  +50a1ee265f4562721346566701fce1d06f54dd9e
>
> but if you wanted to, you could write a custom diff driver that does
> something more meaningful with your particular binary format (it would
> have to grab from big_storage, though).
>
> Checking out other revisions works without extra action:
>
>  $ head -n 1 foo.bin
>  bar
>  $ git checkout HEAD^
>  HEAD is now at 845836e... add huge file (foo)
>  $ head -n 1 foo.bin
>  foo
>
> And since you have the filter config in your ~/.gitconfig, clones will
> just work:
>
>  $ git clone repo other
>  $ du -sh other/.git
>  204K    other/.git
>  $ du -sh other/foo.bin
>  20M
>
> So conceptually it's pretty similar to yours, but the filter integration
> means that git takes care of putting the right files in place at the
> right time.
>
> It would probably benefit a lot from caching the large binary files
> instead of hitting big_storage all the time. And probably the
> putting/getting from storage should be factored out so you can plug in
> different storage. And it should all be configurable. Different users of
> the same repo might want different caching policies, or to access the
> binary assets by different mechanisms or URLs.
>
>> I imagine these features (among others):
>>
>> 1. In my current setup, each large binary file has a different name (a
>> revision number).  This could be easily solved, however, by generating
>> unique names under the hood and tracking this within git.
>
> In the scheme above, we just index by their hash. So you can easily fsck
> your big_storage by making sure everything matches its hash (but you
> can't know that you have _all_ of the blobs needed unless you
> cross-reference with the history).
>
>> 2. A lot of the steps in my current setup are manual.  When I want to
>> add a new binary file, I need to manually create the hash and manually
>> upload the binary to the joint server.  If done within git, this would
>> be automatic.
>
> I think the scheme above takes care of the manual bits.
>
>> 3. In my setup, all of the binary files are in a single "binrepo"
>> directory.  If done from within git, we would need a non-kludgey way
>> to allow large binaries to exist anywhere within the git tree.  If git
>
> Any scheme, whether it uses clean/smudge filters or not, should probably
> tie in via gitattributes.
>
>> 4. User option to download all versions of all binaries, or only the
>> version necessary for the position on the current branch.  If you want
>> to be able to run all versions of the repository when offline, you can
>> download all versions of all binaries.  If you don't need to do this,
>> you can just download the versions you need.  Or perhaps have the
>> option to download all binaries smaller than X-bytes, but skip the big
>> ones.
>
> The scheme above will download on an as-needed basis. If caching were
> implemented, you could just make the cache infinitely big and do a "git
> log -p" which would download everything. :)
>
> Probably you would also want the smudge filter to return "blob not
> available" when operating in some kind of offline mode.
>
>> 5. Command to purge all binaries in your "binrepo" that are not needed
>> for the current revision (if you're running out of disk space
>> locally).
>
> In my scheme, just rm your cache directory (once it exists).
>
>> 6. Automatically upload new versions of files to the "binrepo" (rather
>> than needing to do this manually)
>
> Handled by the clean filter above.
>
>
> So obviously this is not very complete. And there are a few changes to
> git that could make it more efficient (e.g., letting the clean filter
> touch the file directly instead of having to make a copy via stdin). But
> the general idea is there, and it just needs somebody to make a nice
> polished script that is configurable, does caching, etc. I'll get to it
> eventually, but if you'd like to work on it, be my guest.
>
> -Peff
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-01-21 18:57 ` Fwd: Git and Large Binaries: A Proposed Solution Eric Montellese
  2011-01-21 21:36   ` Wesley J. Landaker
  2011-01-21 22:24   ` Jeff King
@ 2011-01-22  0:07   ` Joey Hess
  2 siblings, 0 replies; 20+ messages in thread
From: Joey Hess @ 2011-01-22  0:07 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 4113 bytes --]

Hi, I wrote git-annex, and pristine-tar, and etckeeper. I enjoy making
git do things that I'm told it shouldn't be used for. :) I should have
probably talked more about git-annex here, before.

Eric Montellese wrote:
> 2. zipped tarballs of source code (that I will never need to modify)
> -- I could unpack these and then use git to track the source code.
> However, I prefer to track these deliverables as the tarballs
> themselves because it makes my customer happier to see the exact
> tarball that they delivered being used when I repackage updates.
> (Let's not discuss problems with this model - I understand that this
> is non-ideal).

In this specific case, you can use pristine-tar to recreate the
original, exact tarballs from unpacked source files that you check into
git. It accomplishes this without the overhead of duplicating compressed
data in tarballs. I feel in this case, this is a better approach than
generic large file support, since it stores all the data in git, just in a
much more compressed form, and so fits in nicely with standard git-based
source code management.

> The short version:
> ***Don't track binaries in git.  Track their hashes.***

That was my principle with git-annex. Although slightly generalized to:
"Don't track large file contents in git. Track unique keys that
an arbitrary backend can use to obtain the file contents."

Now, you mention in a followup that git-annex does not default to keeping
a local copy of every binary referenced by a file in master.
This is true, for the simple reason that a copy of every file in some of
my git repos master would sum to multiple terabytes of data. :) I think
that practically, anything that supports large files in git needs to
support partial checkouts too.

But, git-annex can be run in eg, a post-merge hook, and asked to
retrieve all current file contents, and drop outdated contents.

> First the layout:
> my_git_project/binrepo/
> -- binaries/
> -- hashes/
> -- symlink_to_hashes_file
> -- symlink_to_another_hashes_file
> within the "binrepo" (binary repository) there is a subdirectory for
> binaries, and a subdirectory for hashes.  In the root of the 'binrepo'
> all of the files stored have a symlink to the current version of the
> hash.

Very similar to git-annex in the use of versioned symlinks here.
It stores the binaries in .git/annex/objects to avoid needing to
gitignore them.

> 3. In my setup, all of the binary files are in a single "binrepo"
> directory.  If done from within git, we would need a non-kludgey way
> to allow large binaries to exist anywhere within the git tree.

git-annex allows the symlinks to be mixed with regular git managed
content throughout the repository. (This means that when symlinks
are moved, they may need to be fixed, which is done at commit time.)

> 5. Command to purge all binaries in your "binrepo" that are not needed
> for the current revision (if you're running out of disk space
> locally).

Safely dropping data is really one of the complexities of this
approach. Git-annex stores location tracking information in git,
so it can know where it can retrieve file data *from*. I chose to make
it very cautious about removing data, as location tracking data can 
fall out of date (if for example, a remote had the data, had dropped it,
and has not pushed that information out). So it actively confirms that
enough other copies of the data currently exist before dropping it.
(Of course, these checks can be disabled.)

> 6. Automatically upload new versions of files to the "binrepo" (rather
> than needing to do this manually)

In git-annex, data transfer is done using rsync, so that interrupted
transfers of large files can be resumed. I recently added a git-annex-shell
to support locked-down access, similar to git-shell.

BTW, I have been meaning to look into using smudge filters with git-annex.
I'm a bit worried about some of the potential overhead associated with
smudge filters, and I'm not sure how a partial checkout would work with
them.

-- 
see shy jo

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-01-21 23:15     ` Eric Montellese
@ 2011-01-22  3:05       ` Sverre Rabbelier
  0 siblings, 0 replies; 20+ messages in thread
From: Sverre Rabbelier @ 2011-01-22  3:05 UTC (permalink / raw)
  To: Eric Montellese, Avery Pennarun, Jonathan Leto; +Cc: Jeff King, git, Joey Hess

Heya,

[More or less separate from to the ongoing discussion, so no text quoted]

Eric, at the last GitTogether Avery presented his tool, bup, which
implements a number of solutions to the problem of large binary files.
I think I remember that Jonathan is also interested in the topic.
Avery, Jonathan, you can read up on the ongoing conversation at [0] if
you like :).

[0] http://thread.gmane.org/gmane.comp.version-control.git/165389/focus=165401

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-01-21 22:24   ` Jeff King
  2011-01-21 23:15     ` Eric Montellese
@ 2011-01-23 14:14     ` Pete Wyckoff
  2011-01-26  3:42       ` Scott Chacon
  2011-03-10 21:02       ` Alexander Miseler
  1 sibling, 2 replies; 20+ messages in thread
From: Pete Wyckoff @ 2011-01-23 14:14 UTC (permalink / raw)
  To: Jeff King; +Cc: git

peff@peff.net wrote on Fri, 21 Jan 2011 17:24 -0500:
> cat >$HOME/local/bin/huge-clean <<'EOF'
> #!/bin/sh
> 
> # In an ideal world, we could actually
> # access the original file directly instead of
> # having to cat it to a new file.
> temp="$(git rev-parse --git-dir)"/huge.$$
> cat >"$temp"

Just a quick aside.  Since (a2b665d, 2011-01-05) you can provide
the filename as an argument to the filter script:

    git config --global filter.huge.clean huge-clean %f

then use it in place:

    $ cat >huge-clean 
    #!/bin/sh
    f="$1"
    echo orig file is "$f" >&2
    sha1=`sha1sum "$f" | cut -d' ' -f1`
    cp "$f" /tmp/big_storage/$sha1
    rm -f "$f"
    echo $sha1

		-- Pete

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-01-23 14:14     ` Pete Wyckoff
@ 2011-01-26  3:42       ` Scott Chacon
  2011-01-26 16:23         ` Eric Montellese
                           ` (2 more replies)
  2011-03-10 21:02       ` Alexander Miseler
  1 sibling, 3 replies; 20+ messages in thread
From: Scott Chacon @ 2011-01-26  3:42 UTC (permalink / raw)
  To: Pete Wyckoff; +Cc: Eric Montellese, Jeff King, git list

Hey,

Sorry to come in a bit late to this, but in addition to git-annex, I
wrote something called 'git-media' a long time ago that works in a
similar manner to what you both are discussing.

Much like what peff was talking about, it uses the smudge and clean
filters to automatically redirect content into a .git/media directory
instead of into Git itself while keeping the SHA in Git.  One of the
cool thing is that it can use S3, scp or a local directory to transfer
the big files to and from.

Check it out if interested:

https://github.com/schacon/git-media

On Sun, Jan 23, 2011 at 6:14 AM, Pete Wyckoff <pw@padd.com> wrote:
> peff@peff.net wrote on Fri, 21 Jan 2011 17:24 -0500:
>
> Just a quick aside.  Since (a2b665d, 2011-01-05) you can provide
> the filename as an argument to the filter script:
>
>    git config --global filter.huge.clean huge-clean %f
>

This is amazing.  I absolutely did not know you could do this, and it
would make parts of git-media way better if I re-implemented it using
this.  Thanks for pointing this out.

Scott

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-01-26  3:42       ` Scott Chacon
@ 2011-01-26 16:23         ` Eric Montellese
  2011-01-26 17:42         ` Joey Hess
  2011-01-26 21:40         ` Jakub Narebski
  2 siblings, 0 replies; 20+ messages in thread
From: Eric Montellese @ 2011-01-26 16:23 UTC (permalink / raw)
  To: Scott Chacon; +Cc: Pete Wyckoff, Jeff King, git list

Good stuff!

So, it seems like there are at least a few decent ways to work around
the git binaries problem -- but my question is, will something like
this become part of mainline git?  (and how is such a decision made
and by whom?)

It does seem that there is a real need for a solution like this, and a
lot of the core code to handle it has already been written (perhaps
even by multiple folks); it's just in need (as Peff said) of a
polished and configurable script.  If one were to exist, would it
become part of mainline git?

Thanks,
Eric






On Tue, Jan 25, 2011 at 10:42 PM, Scott Chacon <schacon@gmail.com> wrote:
> Hey,
>
> Sorry to come in a bit late to this, but in addition to git-annex, I
> wrote something called 'git-media' a long time ago that works in a
> similar manner to what you both are discussing.
>
> Much like what peff was talking about, it uses the smudge and clean
> filters to automatically redirect content into a .git/media directory
> instead of into Git itself while keeping the SHA in Git.  One of the
> cool thing is that it can use S3, scp or a local directory to transfer
> the big files to and from.
>
> Check it out if interested:
>
> https://github.com/schacon/git-media
>
> On Sun, Jan 23, 2011 at 6:14 AM, Pete Wyckoff <pw@padd.com> wrote:
>> peff@peff.net wrote on Fri, 21 Jan 2011 17:24 -0500:
>>
>> Just a quick aside.  Since (a2b665d, 2011-01-05) you can provide
>> the filename as an argument to the filter script:
>>
>>    git config --global filter.huge.clean huge-clean %f
>>
>
> This is amazing.  I absolutely did not know you could do this, and it
> would make parts of git-media way better if I re-implemented it using
> this.  Thanks for pointing this out.
>
> Scott
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-01-26  3:42       ` Scott Chacon
  2011-01-26 16:23         ` Eric Montellese
@ 2011-01-26 17:42         ` Joey Hess
  2011-01-26 21:40         ` Jakub Narebski
  2 siblings, 0 replies; 20+ messages in thread
From: Joey Hess @ 2011-01-26 17:42 UTC (permalink / raw)
  To: Scott Chacon; +Cc: git list

[-- Attachment #1: Type: text/plain, Size: 881 bytes --]

Scott Chacon wrote:
> Sorry to come in a bit late to this, but in addition to git-annex, I
> wrote something called 'git-media' a long time ago that works in a
> similar manner to what you both are discussing.

Huh, if I had known about that I might not have written git-annex.
Although probably I would have still, I needed a more distributed
approach to storing data than git-media seems to support, and the
ability to partially check out only some files.

> >    git config --global filter.huge.clean huge-clean %f
> 
> This is amazing.  I absolutely did not know you could do this, and it
> would make parts of git-media way better if I re-implemented it using
> this.  Thanks for pointing this out.

Yeah, that's great, it should allow a smart clean filter to not waste
time reprocessing a known large file on every git status/git commit.

-- 
see shy jo

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-01-26  3:42       ` Scott Chacon
  2011-01-26 16:23         ` Eric Montellese
  2011-01-26 17:42         ` Joey Hess
@ 2011-01-26 21:40         ` Jakub Narebski
  2 siblings, 0 replies; 20+ messages in thread
From: Jakub Narebski @ 2011-01-26 21:40 UTC (permalink / raw)
  To: Scott Chacon; +Cc: Pete Wyckoff, Eric Montellese, Jeff King, git list

Scott Chacon <schacon@gmail.com> writes:

> Sorry to come in a bit late to this, but in addition to git-annex, I
> wrote something called 'git-media' a long time ago that works in a
> similar manner to what you both are discussing.
> 
> Much like what peff was talking about, it uses the smudge and clean
> filters to automatically redirect content into a .git/media directory
> instead of into Git itself while keeping the SHA in Git.  One of the
> cool thing is that it can use S3, scp or a local directory to transfer
> the big files to and from.
> 
> Check it out if interested:
> 
> https://github.com/schacon/git-media

Could you please add short information about this project to the
https://git.wiki.kernel.org/index.php/InterfacesFrontendsAndTools
page, in the "Backups, metadata, and large files" subsection?
git-annex is there...

Thanks in advance.

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-01-23 14:14     ` Pete Wyckoff
  2011-01-26  3:42       ` Scott Chacon
@ 2011-03-10 21:02       ` Alexander Miseler
  2011-03-10 22:24         ` Jeff King
  1 sibling, 1 reply; 20+ messages in thread
From: Alexander Miseler @ 2011-03-10 21:02 UTC (permalink / raw)
  To: Pete Wyckoff; +Cc: Jeff King, git, emontellese, schacon, joey

I've been debating whether to resurrect this thread, but since it has been referenced by the SoC2011Ideas wiki article I will just go ahead.
I've spent a few hours trying to make this work to make git with big files usable under Windows.

> Just a quick aside.  Since (a2b665d, 2011-01-05) you can provide
> the filename as an argument to the filter script:
> 
>     git config --global filter.huge.clean huge-clean %f
> 
> then use it in place:
> 
>     $ cat >huge-clean 
>     #!/bin/sh
>     f="$1"
>     echo orig file is "$f" >&2
>     sha1=`sha1sum "$f" | cut -d' ' -f1`
>     cp "$f" /tmp/big_storage/$sha1
>     rm -f "$f"
>     echo $sha1
> 
> 		-- Pete

First off, the commit mentioned here is no help at all. This commit changes nothing about the input and output of filters. The file is still loaded completely into memory, still streamed to the filter via stdin, still streamed from the filter via stdout into yet another memory buffer. The two of which, IIRC, exist simultaneous for at least some time, thus doubling the memory requirements. This change only additionally provides the file name to the filter and nothing else. If one carefully rereads the commit message this apparently was the intention.

After this I started digging into the git source code. To change the filter input would be extremely trivial. However, the function that returns the filter output in a memory buffer is called from 8 places (all details from wetware memory and therefore unreliable). Most, maybe all, of the callers just dump the buffer into a file, which could easily be relocated into the filter calling function itself. But two callers detached the buffer from the strbuf and kept it beyond writing the file. I didn't track it any further since I decided to rather spend my time on improving big file handling in git itself, rather than targeting a workaround. Though of course a completely big-file-ready git should also provide a sane way to feed big files to and from filters.

If the two detached buffers are no complication this might be a trivial project. If they do it might become demanding though.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-03-10 21:02       ` Alexander Miseler
@ 2011-03-10 22:24         ` Jeff King
  2011-03-13  1:53           ` Eric Montellese
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2011-03-10 22:24 UTC (permalink / raw)
  To: Alexander Miseler; +Cc: Pete Wyckoff, git, emontellese, schacon, joey

On Thu, Mar 10, 2011 at 10:02:53PM +0100, Alexander Miseler wrote:

> I've been debating whether to resurrect this thread, but since it has
> been referenced by the SoC2011Ideas wiki article I will just go ahead.
> I've spent a few hours trying to make this work to make git with big
> files usable under Windows.
> 
> > Just a quick aside.  Since (a2b665d, 2011-01-05) you can provide
> > the filename as an argument to the filter script:
> > 
> >     git config --global filter.huge.clean huge-clean %f
> > 
> > then use it in place:
> > 
> >     $ cat >huge-clean 
> >     #!/bin/sh
> >     f="$1"
> >     echo orig file is "$f" >&2
> >     sha1=`sha1sum "$f" | cut -d' ' -f1`
> >     cp "$f" /tmp/big_storage/$sha1
> >     rm -f "$f"
> >     echo $sha1
> > 
> > 		-- Pete

After thinking about this strategy more (the "convert big binary files
into a hash via clean/smudge filter" strategy), it feels like a hack.
That is, I don't see any reason that git can't give you the equivalent
behavior without having to resort to bolted-on scripts.

For example, with this strategy you are giving up meaningful diffs in
favor of just showing a diff of the hashes. But git _already_ can do
this for binary diffs.  The problem is that git unnecessarily uses a
bunch of memory to come up with that answer because of assumptions in
the diff code. So we should be fixing those assumptions. Any place that
this smudge/clean filter solution could avoid looking at the blobs, we
should be able to do the same inside git.

Of course that leaves the storage question; Scott's git-media script has
pluggable storage that is backed by http, s3, or whatever. But again,
that is a feature that might be worth putting into git (even if it is
just a pluggable script at the object-db level).

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-03-10 22:24         ` Jeff King
@ 2011-03-13  1:53           ` Eric Montellese
  2011-03-13  2:52             ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Montellese @ 2011-03-13  1:53 UTC (permalink / raw)
  To: Jeff King; +Cc: Alexander Miseler, Pete Wyckoff, git, schacon, joey

This is a good point.

The best solution, it seems, has two parts:

1. Clean up the way in which git considers, diffs, and stores binaries
to cut down on the overhead of dealing with these files.
  1.1 Perhaps a "binaries" directory, or structure of directories, within .git
  1.2 Perhaps configurable options for when and how to try a binary
diff?  (allow user to decide if storage or speed is more important)
2. Once (1) is accomplished, add an option to avoid copying binaries
from all but the tip when doing a "git clone."
  2.1 The default behavior would be to copy everything, as users
currently expect.
  2.2 Core code would have hooks to allow a script to use a central
location for the binary storage. (ssh, http, gmail-fs, whatever)

(of course, the implementation of (1) should be friendly to the addition of (2))

Obviously, the major drawback to (2) without (2.2) is that if there is
truly distributed work going on, some clone-of-a-clone may not know
where to get the binaries.

But, print a warning when turning on the non-default behavior (2.1),
then it's a user problem :-)

Eric




On Thu, Mar 10, 2011 at 5:24 PM, Jeff King <peff@peff.net> wrote:
>
> On Thu, Mar 10, 2011 at 10:02:53PM +0100, Alexander Miseler wrote:
>
> > I've been debating whether to resurrect this thread, but since it has
> > been referenced by the SoC2011Ideas wiki article I will just go ahead.
> > I've spent a few hours trying to make this work to make git with big
> > files usable under Windows.
> >
> > > Just a quick aside.  Since (a2b665d, 2011-01-05) you can provide
> > > the filename as an argument to the filter script:
> > >
> > >     git config --global filter.huge.clean huge-clean %f
> > >
> > > then use it in place:
> > >
> > >     $ cat >huge-clean
> > >     #!/bin/sh
> > >     f="$1"
> > >     echo orig file is "$f" >&2
> > >     sha1=`sha1sum "$f" | cut -d' ' -f1`
> > >     cp "$f" /tmp/big_storage/$sha1
> > >     rm -f "$f"
> > >     echo $sha1
> > >
> > >             -- Pete
>
> After thinking about this strategy more (the "convert big binary files
> into a hash via clean/smudge filter" strategy), it feels like a hack.
> That is, I don't see any reason that git can't give you the equivalent
> behavior without having to resort to bolted-on scripts.
>
> For example, with this strategy you are giving up meaningful diffs in
> favor of just showing a diff of the hashes. But git _already_ can do
> this for binary diffs.  The problem is that git unnecessarily uses a
> bunch of memory to come up with that answer because of assumptions in
> the diff code. So we should be fixing those assumptions. Any place that
> this smudge/clean filter solution could avoid looking at the blobs, we
> should be able to do the same inside git.
>
> Of course that leaves the storage question; Scott's git-media script has
> pluggable storage that is backed by http, s3, or whatever. But again,
> that is a feature that might be worth putting into git (even if it is
> just a pluggable script at the object-db level).
>
> -Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-03-13  1:53           ` Eric Montellese
@ 2011-03-13  2:52             ` Jeff King
  2011-03-13 19:33               ` Alexander Miseler
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2011-03-13  2:52 UTC (permalink / raw)
  To: Eric Montellese; +Cc: Alexander Miseler, Pete Wyckoff, git, schacon, joey

On Sat, Mar 12, 2011 at 08:53:53PM -0500, Eric Montellese wrote:

> The best solution, it seems, has two parts:
> 
> 1. Clean up the way in which git considers, diffs, and stores binaries
> to cut down on the overhead of dealing with these files.

This is the easier half, I think.

>   1.1 Perhaps a "binaries" directory, or structure of directories, within .git

I'd rather not do something so drastic. We already have ways of marking
files as binary and un-diffable within the tree. So you can already do
pretty well with marking them with gitattributes. I think we can do
better by making them the binaryness auto-detection less expensive
(right now we pull in the whole blob to check the first 1K or so for
NULs or other patterns; this is fine in the common text case, where
we'll want the whole blob in a minute anyway, but for large files it's
obviously wasteful). There may also be code-paths for binary files where
we accidentally load them (I just fixed one last week where we
unnecessarily loaded them in the diffstat code path). Somebody will need
to do some experimenting to shake out those code paths.

For packing, we have core.bigFileThreshold to turn off delta compression
for large files, but according to the documentation, it is only honored
for fast-import. I think we would want something similar to say "for
some subset of files (indicated either by name or by minimum size),
don't bother with zlib-compression either, and always keep them loose".

Those are the two major ones, I think. There are probably a handful of
other cases (like git-add, which really should be able to have a fixed
memory size). Again, the first step is figuring out where all of the
problems are (and I'm happy to just fix them one by one as they come up,
but I am also thinking of this in terms of a GSoC project).

>   1.2 Perhaps configurable options for when and how to try a binary
> diff?  (allow user to decide if storage or speed is more important)

We can already do that with gitattributes. But it would be nice to have
it be fast in the binary auto-detection case.

> 2. Once (1) is accomplished, add an option to avoid copying binaries
> from all but the tip when doing a "git clone."

This is much harder. :)

>   2.1 The default behavior would be to copy everything, as users
> currently expect.
>   2.2 Core code would have hooks to allow a script to use a central
> location for the binary storage. (ssh, http, gmail-fs, whatever)

I think we would need a protocol extension for the fetching client to
say "please don't bother sending me anything larger than N bytes; I will
get it via alternate storage". Although there are situations more
complicated than that. Your alternate storage might have up to commit X,
and you don't want large objects in X or its ancestors. But you _do_
want large objects in descendants of X, since you have no other way to
get them.

So you need some way of saying which sets of large objects you need and
which you don't. One implementation is that you could fetch from
alternate storage (which would then need to be not just large-blob
storage, but actually have a full repo), and then afterwards fetch from
the remote (which would then send you all binaries, because by
definition anything you are fetching is not something the alternate
storage has). That feels a bit hack-ish. Doing something more clever
would require a pretty major protocol extension, though.

I haven't been paying attention to any sparse clone proposals. I know it
has come up but I don't know how mature the idea is. But this is
potentially related.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-03-13  2:52             ` Jeff King
@ 2011-03-13 19:33               ` Alexander Miseler
  2011-03-14 19:32                 ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Alexander Miseler @ 2011-03-13 19:33 UTC (permalink / raw)
  To: Jeff King; +Cc: Eric Montellese, Pete Wyckoff, git, schacon, joey

My thoughts on big file storage:

We want to store them as flat as possible. Ideally if we have a temp file with the content (e.g. the output of some filter) it should be possible to store it by simply doing a move/rename and updating some meta data external to the actual file.

Options:

1.) The loose file format is inherently unsuited for this. It has a header before the actual content and the whole file (header + content) is always compressed. Even if one changes this to compressing/decompressing header and content independently it is still unsuited by a) having the header within the same file and b) because the header has no flags or other means to indicate a different behavior (e.g. no compression) for the content. We could extend the header format or introduce a new object type (e.g. flatblob) but both would probably cause more trouble than other solutions. Another idea would be to keep the metadata in an external file (e.g. 84d7.header for the object 84d7). This would probably have a bad performance though since every object lookup would first need to check for the e
 xistence of a header file. A smarter variant would be to optionally keep the meta data directly in the filename (e.g. saving the object as 84d7.object_type.size.flag instead of just 84d7). 
This would only require special handling for cases where the normal lookup for 84d7 fails.

2.) The pack format fares a lot better. Content and meta data are already separated with the meta data describing how the content is stored. We would need a flag to mark the content as flat and that would pretty much be it. We would still need to include a virtual header when calculating the sha1 so it is guaranteed that the same content has always the same id.
Thus i think we should simply forgo the loose object phase when storing big files and simply drop each big file flat as a individual pack file, with the idx file describing it as a pack file with one entry which is stored flat.

3.) Do some completely different handling for big files, as suggested by Eric:
>>   1.1 Perhaps a "binaries" directory, or structure of directories, within .git
> 
> I'd rather not do something so drastic.
My main issue with this approach (apart from the 'drastic' ^_^) is that the definition of big file may change at any time by e.g. changing a config value like core.bigFileThreshold. What has been stored as big file may suddenly be considered a normal blob and vice versa. Thus any storage variant that isn't well integrated in the normal object storage will probably be troublesome.

> There may also be code-paths for binary files where
> we accidentally load them (I just fixed one last week where we
> unnecessarily loaded them in the diffstat code path). Somebody will need
> to do some experimenting to shake out those code paths.

This is my main focus for now. They are easy to detect when your memory is small enough :D

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-03-13 19:33               ` Alexander Miseler
@ 2011-03-14 19:32                 ` Jeff King
  2011-03-16  0:35                   ` Eric Montellese
  2011-03-16 14:40                   ` Nguyen Thai Ngoc Duy
  0 siblings, 2 replies; 20+ messages in thread
From: Jeff King @ 2011-03-14 19:32 UTC (permalink / raw)
  To: Alexander Miseler; +Cc: Eric Montellese, Pete Wyckoff, git, schacon, joey

On Sun, Mar 13, 2011 at 08:33:18PM +0100, Alexander Miseler wrote:

> We want to store them as flat as possible. Ideally if we have a temp
> file with the content (e.g. the output of some filter) it should be
> possible to store it by simply doing a move/rename and updating some
> meta data external to the actual file.

Yeah, that would be a nice optimization.  But I'd rather do the easy
stuff first and see if more advanced stuff is still worth doing.

For example, I spent some time a while back designing a faster textconv
interface (the current interface spools the blob to a tempfile, whereas
in some cases a filter needs to only access the first couple kilobytes
of the file to get metadata). But what I found was that an even better
scheme was to cache textconv output in git-notes. Then it speeds up the
slow case _and_ the already-fast case.

Now after this, would my new textconv interface still speed up the
initial non-cached textconv? Absolutely. But I didn't really care
anymore, because the small speed up on the first run was not worth the
trouble of maintaining two interfaces (at least for my datasets).

And this may fall into the same category. Accessing big blobs is
expensive. One solution is to make it a bit faster. Another solution is
to just do it less. So we may find that once we are doing it less, it is
not worth the complexity to make it faster.

And note that I am not saying "it definitely won't be worth it"; only
that it is worth making the easy, big optimizations first and then
seeing what's left to do.

> 1.) The loose file format is inherently unsuited for this. It has a
> header before the actual content and the whole file (header + content)
> is always compressed. Even if one changes this to
> compressing/decompressing header and content independently it is still
> unsuited by a) having the header within the same file and b) because
> the header has no flags or other means to indicate a different
> behavior (e.g. no compression) for the content. We could extend the
> header format or introduce a new object type (e.g. flatblob) but both
> would probably cause more trouble than other solutions. Another idea
> would be to keep the metadata in an external file (e.g. 84d7.header
> for the object 84d7). This would probably have a bad performance
> though since every object lookup would first need to check for the
> existence of a header file. A smarter variant would be to optionally
> keep the meta data directly in the filename (e.g. saving the object as
> 84d7.object_type.size.flag instead of just 84d7).
> This would only require special handling for cases where the normal lookup for 84d7 fails.

A new object type is definitely a bad idea. It changes the sha1 of the
resulting object, which means that our identical trees which differ only
in the use of "flatblob" versus regular blob will have different sha1s.

So I think the right place to insert this would be at the object db
layer. The header just has the type and size. But I don't think anybody
is having a problem with large objects that are _not_ blobs. So the
simplest implementation would be a special blob-only object db
containing pristine files. We implicitly know that objects in this db
are blobs, and we can get the size from the filesystem via stat().
Checking their sha1 would involve prepending "blob <size>\0" to the file
data. It does introduce an extra stat() into object lookup, so probably
we would have the lookup order of pack, regular loose object, flat blob
object. Then you pay the extra stat() only in the less-common case of
accessing either a large blob or a non-existent object.

That being said, I'm not sure how much this optimization will buy us.
There are times when being able to mmap() the file directly, or point an
external program directly at the original blob will be helpful. But we
will still have to copy, for example on checkout. It would be nice if
there was a way to make a copy-on-write link from the working tree to
the original file. But I don't think there is a portable way to do so,
and we can't allow the user to accidentally munge the contents of the
object db, which are supposed to be immutable.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-03-14 19:32                 ` Jeff King
@ 2011-03-16  0:35                   ` Eric Montellese
  2011-03-16 14:40                   ` Nguyen Thai Ngoc Duy
  1 sibling, 0 replies; 20+ messages in thread
From: Eric Montellese @ 2011-03-16  0:35 UTC (permalink / raw)
  To: Jeff King; +Cc: Alexander Miseler, Pete Wyckoff, git, schacon, joey

Makes a lot of sense --

As you said, the "sparse clone" idea and this one (not downloading all
binaries) probably have a similar or related solution...  In fact, I'd
imagine that most of the reasons for needing a sparse clone are
because of large binaries.    (since text files compress so nicely in
git).

And actually, if the sparse-clone idea is limited to only binaries
being "sparse" (i.e. not copied) that probably simplifies the
spares-clone logic quite a bit since you don't need to split up the
bits of patches to generate the resulting changesets you need?  (this
is based on my very loose understanding of how git works)


So, if we simplify our requirements a bit (at least as a first cut),
perhaps we've now simplified down to these tasks  (similar but
modified from before):

1. clean up git's handling of binaries to improve efficiency.  In
doing so, see if it makes sense to separate somewhat the way that
binaries are stored (particularly because this would help (2))

2.  Allow full clones to be "sparsely cloned" (that is, cloned with
the exception of the large binary files).
   2.1 As a corollary, no clones of any kind can be made from a sparse
clone (sparse clones are "leaf" nodes on a tree of descendants) --
that simplifies the complexity quite a bit, since the "remote" you
cloned from will always have the files if you need 'em.


doing so limits some of the possible applications of "binary sparse"
clones, but might yield a cleaner final solution -- thoughts?


Eric




On Mon, Mar 14, 2011 at 3:32 PM, Jeff King <peff@peff.net> wrote:
> On Sun, Mar 13, 2011 at 08:33:18PM +0100, Alexander Miseler wrote:
>
>> We want to store them as flat as possible. Ideally if we have a temp
>> file with the content (e.g. the output of some filter) it should be
>> possible to store it by simply doing a move/rename and updating some
>> meta data external to the actual file.
>
> Yeah, that would be a nice optimization.  But I'd rather do the easy
> stuff first and see if more advanced stuff is still worth doing.
>
> For example, I spent some time a while back designing a faster textconv
> interface (the current interface spools the blob to a tempfile, whereas
> in some cases a filter needs to only access the first couple kilobytes
> of the file to get metadata). But what I found was that an even better
> scheme was to cache textconv output in git-notes. Then it speeds up the
> slow case _and_ the already-fast case.
>
> Now after this, would my new textconv interface still speed up the
> initial non-cached textconv? Absolutely. But I didn't really care
> anymore, because the small speed up on the first run was not worth the
> trouble of maintaining two interfaces (at least for my datasets).
>
> And this may fall into the same category. Accessing big blobs is
> expensive. One solution is to make it a bit faster. Another solution is
> to just do it less. So we may find that once we are doing it less, it is
> not worth the complexity to make it faster.
>
> And note that I am not saying "it definitely won't be worth it"; only
> that it is worth making the easy, big optimizations first and then
> seeing what's left to do.
>
>> 1.) The loose file format is inherently unsuited for this. It has a
>> header before the actual content and the whole file (header + content)
>> is always compressed. Even if one changes this to
>> compressing/decompressing header and content independently it is still
>> unsuited by a) having the header within the same file and b) because
>> the header has no flags or other means to indicate a different
>> behavior (e.g. no compression) for the content. We could extend the
>> header format or introduce a new object type (e.g. flatblob) but both
>> would probably cause more trouble than other solutions. Another idea
>> would be to keep the metadata in an external file (e.g. 84d7.header
>> for the object 84d7). This would probably have a bad performance
>> though since every object lookup would first need to check for the
>> existence of a header file. A smarter variant would be to optionally
>> keep the meta data directly in the filename (e.g. saving the object as
>> 84d7.object_type.size.flag instead of just 84d7).
>> This would only require special handling for cases where the normal lookup for 84d7 fails.
>
> A new object type is definitely a bad idea. It changes the sha1 of the
> resulting object, which means that our identical trees which differ only
> in the use of "flatblob" versus regular blob will have different sha1s.
>
> So I think the right place to insert this would be at the object db
> layer. The header just has the type and size. But I don't think anybody
> is having a problem with large objects that are _not_ blobs. So the
> simplest implementation would be a special blob-only object db
> containing pristine files. We implicitly know that objects in this db
> are blobs, and we can get the size from the filesystem via stat().
> Checking their sha1 would involve prepending "blob <size>\0" to the file
> data. It does introduce an extra stat() into object lookup, so probably
> we would have the lookup order of pack, regular loose object, flat blob
> object. Then you pay the extra stat() only in the less-common case of
> accessing either a large blob or a non-existent object.
>
> That being said, I'm not sure how much this optimization will buy us.
> There are times when being able to mmap() the file directly, or point an
> external program directly at the original blob will be helpful. But we
> will still have to copy, for example on checkout. It would be nice if
> there was a way to make a copy-on-write link from the working tree to
> the original file. But I don't think there is a portable way to do so,
> and we can't allow the user to accidentally munge the contents of the
> object db, which are supposed to be immutable.
>
> -Peff
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Fwd: Git and Large Binaries: A Proposed Solution
  2011-03-14 19:32                 ` Jeff King
  2011-03-16  0:35                   ` Eric Montellese
@ 2011-03-16 14:40                   ` Nguyen Thai Ngoc Duy
  1 sibling, 0 replies; 20+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-03-16 14:40 UTC (permalink / raw)
  To: Jeff King
  Cc: Alexander Miseler, Eric Montellese, Pete Wyckoff, git, schacon,
	joey

On Tue, Mar 15, 2011 at 2:32 AM, Jeff King <peff@peff.net> wrote:
> That being said, I'm not sure how much this optimization will buy us.
> There are times when being able to mmap() the file directly, or point an
> external program directly at the original blob will be helpful. But we
> will still have to copy, for example on checkout.

Sparse checkout code may help. If those large files are not always
needed, they can be marked skip-checkout based on
core.bigFileThreshold and won't be checked out until explictly
requested. This use may conflict with sparse checkout because it sets
skip-checkout bits automatically from $GIT_DIR/info/sparsecheckout
though.
-- 
Duy

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-03-16 14:41 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <AANLkTin=UySutWLS0Y7OmuvkE=T=+YB8G8aUCxLH=GKa@mail.gmail.com>
2011-01-21 18:57 ` Fwd: Git and Large Binaries: A Proposed Solution Eric Montellese
2011-01-21 21:36   ` Wesley J. Landaker
2011-01-21 22:00     ` Eric Montellese
2011-01-21 22:24   ` Jeff King
2011-01-21 23:15     ` Eric Montellese
2011-01-22  3:05       ` Sverre Rabbelier
2011-01-23 14:14     ` Pete Wyckoff
2011-01-26  3:42       ` Scott Chacon
2011-01-26 16:23         ` Eric Montellese
2011-01-26 17:42         ` Joey Hess
2011-01-26 21:40         ` Jakub Narebski
2011-03-10 21:02       ` Alexander Miseler
2011-03-10 22:24         ` Jeff King
2011-03-13  1:53           ` Eric Montellese
2011-03-13  2:52             ` Jeff King
2011-03-13 19:33               ` Alexander Miseler
2011-03-14 19:32                 ` Jeff King
2011-03-16  0:35                   ` Eric Montellese
2011-03-16 14:40                   ` Nguyen Thai Ngoc Duy
2011-01-22  0:07   ` Joey Hess

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).