git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Joey Hess <joey@kitenet.net>
To: git@vger.kernel.org
Subject: Re: Fwd: Git and Large Binaries: A Proposed Solution
Date: Fri, 21 Jan 2011 20:07:12 -0400	[thread overview]
Message-ID: <20110122000712.GA7931@gnu.kitenet.net> (raw)
In-Reply-To: <AANLkTimPua_kz2w33BRPeTtOEWOKDCsJzf0sqxm=db68@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4113 bytes --]

Hi, I wrote git-annex, and pristine-tar, and etckeeper. I enjoy making
git do things that I'm told it shouldn't be used for. :) I should have
probably talked more about git-annex here, before.

Eric Montellese wrote:
> 2. zipped tarballs of source code (that I will never need to modify)
> -- I could unpack these and then use git to track the source code.
> However, I prefer to track these deliverables as the tarballs
> themselves because it makes my customer happier to see the exact
> tarball that they delivered being used when I repackage updates.
> (Let's not discuss problems with this model - I understand that this
> is non-ideal).

In this specific case, you can use pristine-tar to recreate the
original, exact tarballs from unpacked source files that you check into
git. It accomplishes this without the overhead of duplicating compressed
data in tarballs. I feel in this case, this is a better approach than
generic large file support, since it stores all the data in git, just in a
much more compressed form, and so fits in nicely with standard git-based
source code management.

> The short version:
> ***Don't track binaries in git.  Track their hashes.***

That was my principle with git-annex. Although slightly generalized to:
"Don't track large file contents in git. Track unique keys that
an arbitrary backend can use to obtain the file contents."

Now, you mention in a followup that git-annex does not default to keeping
a local copy of every binary referenced by a file in master.
This is true, for the simple reason that a copy of every file in some of
my git repos master would sum to multiple terabytes of data. :) I think
that practically, anything that supports large files in git needs to
support partial checkouts too.

But, git-annex can be run in eg, a post-merge hook, and asked to
retrieve all current file contents, and drop outdated contents.

> First the layout:
> my_git_project/binrepo/
> -- binaries/
> -- hashes/
> -- symlink_to_hashes_file
> -- symlink_to_another_hashes_file
> within the "binrepo" (binary repository) there is a subdirectory for
> binaries, and a subdirectory for hashes.  In the root of the 'binrepo'
> all of the files stored have a symlink to the current version of the
> hash.

Very similar to git-annex in the use of versioned symlinks here.
It stores the binaries in .git/annex/objects to avoid needing to
gitignore them.

> 3. In my setup, all of the binary files are in a single "binrepo"
> directory.  If done from within git, we would need a non-kludgey way
> to allow large binaries to exist anywhere within the git tree.

git-annex allows the symlinks to be mixed with regular git managed
content throughout the repository. (This means that when symlinks
are moved, they may need to be fixed, which is done at commit time.)

> 5. Command to purge all binaries in your "binrepo" that are not needed
> for the current revision (if you're running out of disk space
> locally).

Safely dropping data is really one of the complexities of this
approach. Git-annex stores location tracking information in git,
so it can know where it can retrieve file data *from*. I chose to make
it very cautious about removing data, as location tracking data can 
fall out of date (if for example, a remote had the data, had dropped it,
and has not pushed that information out). So it actively confirms that
enough other copies of the data currently exist before dropping it.
(Of course, these checks can be disabled.)

> 6. Automatically upload new versions of files to the "binrepo" (rather
> than needing to do this manually)

In git-annex, data transfer is done using rsync, so that interrupted
transfers of large files can be resumed. I recently added a git-annex-shell
to support locked-down access, similar to git-shell.


BTW, I have been meaning to look into using smudge filters with git-annex.
I'm a bit worried about some of the potential overhead associated with
smudge filters, and I'm not sure how a partial checkout would work with
them.

-- 
see shy jo

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

      parent reply	other threads:[~2011-01-22  0:17 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AANLkTin=UySutWLS0Y7OmuvkE=T=+YB8G8aUCxLH=GKa@mail.gmail.com>
2011-01-21 18:57 ` Fwd: Git and Large Binaries: A Proposed Solution Eric Montellese
2011-01-21 21:36   ` Wesley J. Landaker
2011-01-21 22:00     ` Eric Montellese
2011-01-21 22:24   ` Jeff King
2011-01-21 23:15     ` Eric Montellese
2011-01-22  3:05       ` Sverre Rabbelier
2011-01-23 14:14     ` Pete Wyckoff
2011-01-26  3:42       ` Scott Chacon
2011-01-26 16:23         ` Eric Montellese
2011-01-26 17:42         ` Joey Hess
2011-01-26 21:40         ` Jakub Narebski
2011-03-10 21:02       ` Alexander Miseler
2011-03-10 22:24         ` Jeff King
2011-03-13  1:53           ` Eric Montellese
2011-03-13  2:52             ` Jeff King
2011-03-13 19:33               ` Alexander Miseler
2011-03-14 19:32                 ` Jeff King
2011-03-16  0:35                   ` Eric Montellese
2011-03-16 14:40                   ` Nguyen Thai Ngoc Duy
2011-01-22  0:07   ` Joey Hess [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110122000712.GA7931@gnu.kitenet.net \
    --to=joey@kitenet.net \
    --cc=git@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).