git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Eric Montellese <emontellese@gmail.com>
To: Jeff King <peff@peff.net>
Cc: Alexander Miseler <alexander@miseler.de>,
	Pete Wyckoff <pw@padd.com>,
	git@vger.kernel.org, schacon@gmail.com, joey@kitenet.net
Subject: Re: Fwd: Git and Large Binaries: A Proposed Solution
Date: Sat, 12 Mar 2011 20:53:53 -0500	[thread overview]
Message-ID: <AANLkTimpbhaGEfxW1wwRc14tpV6qnPDiZYnXp_tvA3Ft@mail.gmail.com> (raw)
In-Reply-To: <20110310222443.GC15828@sigill.intra.peff.net>

This is a good point.

The best solution, it seems, has two parts:

1. Clean up the way in which git considers, diffs, and stores binaries
to cut down on the overhead of dealing with these files.
  1.1 Perhaps a "binaries" directory, or structure of directories, within .git
  1.2 Perhaps configurable options for when and how to try a binary
diff?  (allow user to decide if storage or speed is more important)
2. Once (1) is accomplished, add an option to avoid copying binaries
from all but the tip when doing a "git clone."
  2.1 The default behavior would be to copy everything, as users
currently expect.
  2.2 Core code would have hooks to allow a script to use a central
location for the binary storage. (ssh, http, gmail-fs, whatever)

(of course, the implementation of (1) should be friendly to the addition of (2))

Obviously, the major drawback to (2) without (2.2) is that if there is
truly distributed work going on, some clone-of-a-clone may not know
where to get the binaries.

But, print a warning when turning on the non-default behavior (2.1),
then it's a user problem :-)

Eric




On Thu, Mar 10, 2011 at 5:24 PM, Jeff King <peff@peff.net> wrote:
>
> On Thu, Mar 10, 2011 at 10:02:53PM +0100, Alexander Miseler wrote:
>
> > I've been debating whether to resurrect this thread, but since it has
> > been referenced by the SoC2011Ideas wiki article I will just go ahead.
> > I've spent a few hours trying to make this work to make git with big
> > files usable under Windows.
> >
> > > Just a quick aside.  Since (a2b665d, 2011-01-05) you can provide
> > > the filename as an argument to the filter script:
> > >
> > >     git config --global filter.huge.clean huge-clean %f
> > >
> > > then use it in place:
> > >
> > >     $ cat >huge-clean
> > >     #!/bin/sh
> > >     f="$1"
> > >     echo orig file is "$f" >&2
> > >     sha1=`sha1sum "$f" | cut -d' ' -f1`
> > >     cp "$f" /tmp/big_storage/$sha1
> > >     rm -f "$f"
> > >     echo $sha1
> > >
> > >             -- Pete
>
> After thinking about this strategy more (the "convert big binary files
> into a hash via clean/smudge filter" strategy), it feels like a hack.
> That is, I don't see any reason that git can't give you the equivalent
> behavior without having to resort to bolted-on scripts.
>
> For example, with this strategy you are giving up meaningful diffs in
> favor of just showing a diff of the hashes. But git _already_ can do
> this for binary diffs.  The problem is that git unnecessarily uses a
> bunch of memory to come up with that answer because of assumptions in
> the diff code. So we should be fixing those assumptions. Any place that
> this smudge/clean filter solution could avoid looking at the blobs, we
> should be able to do the same inside git.
>
> Of course that leaves the storage question; Scott's git-media script has
> pluggable storage that is backed by http, s3, or whatever. But again,
> that is a feature that might be worth putting into git (even if it is
> just a pluggable script at the object-db level).
>
> -Peff

  reply	other threads:[~2011-03-13  1:54 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AANLkTin=UySutWLS0Y7OmuvkE=T=+YB8G8aUCxLH=GKa@mail.gmail.com>
2011-01-21 18:57 ` Fwd: Git and Large Binaries: A Proposed Solution Eric Montellese
2011-01-21 21:36   ` Wesley J. Landaker
2011-01-21 22:00     ` Eric Montellese
2011-01-21 22:24   ` Jeff King
2011-01-21 23:15     ` Eric Montellese
2011-01-22  3:05       ` Sverre Rabbelier
2011-01-23 14:14     ` Pete Wyckoff
2011-01-26  3:42       ` Scott Chacon
2011-01-26 16:23         ` Eric Montellese
2011-01-26 17:42         ` Joey Hess
2011-01-26 21:40         ` Jakub Narebski
2011-03-10 21:02       ` Alexander Miseler
2011-03-10 22:24         ` Jeff King
2011-03-13  1:53           ` Eric Montellese [this message]
2011-03-13  2:52             ` Jeff King
2011-03-13 19:33               ` Alexander Miseler
2011-03-14 19:32                 ` Jeff King
2011-03-16  0:35                   ` Eric Montellese
2011-03-16 14:40                   ` Nguyen Thai Ngoc Duy
2011-01-22  0:07   ` Joey Hess

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AANLkTimpbhaGEfxW1wwRc14tpV6qnPDiZYnXp_tvA3Ft@mail.gmail.com \
    --to=emontellese@gmail.com \
    --cc=alexander@miseler.de \
    --cc=git@vger.kernel.org \
    --cc=joey@kitenet.net \
    --cc=peff@peff.net \
    --cc=pw@padd.com \
    --cc=schacon@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).