git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Eric Montellese <emontellese@gmail.com>
Cc: Alexander Miseler <alexander@miseler.de>,
	Pete Wyckoff <pw@padd.com>,
	git@vger.kernel.org, schacon@gmail.com, joey@kitenet.net
Subject: Re: Fwd: Git and Large Binaries: A Proposed Solution
Date: Sat, 12 Mar 2011 21:52:58 -0500	[thread overview]
Message-ID: <20110313025258.GA10452@sigill.intra.peff.net> (raw)
In-Reply-To: <AANLkTimpbhaGEfxW1wwRc14tpV6qnPDiZYnXp_tvA3Ft@mail.gmail.com>

On Sat, Mar 12, 2011 at 08:53:53PM -0500, Eric Montellese wrote:

> The best solution, it seems, has two parts:
> 
> 1. Clean up the way in which git considers, diffs, and stores binaries
> to cut down on the overhead of dealing with these files.

This is the easier half, I think.

>   1.1 Perhaps a "binaries" directory, or structure of directories, within .git

I'd rather not do something so drastic. We already have ways of marking
files as binary and un-diffable within the tree. So you can already do
pretty well with marking them with gitattributes. I think we can do
better by making them the binaryness auto-detection less expensive
(right now we pull in the whole blob to check the first 1K or so for
NULs or other patterns; this is fine in the common text case, where
we'll want the whole blob in a minute anyway, but for large files it's
obviously wasteful). There may also be code-paths for binary files where
we accidentally load them (I just fixed one last week where we
unnecessarily loaded them in the diffstat code path). Somebody will need
to do some experimenting to shake out those code paths.

For packing, we have core.bigFileThreshold to turn off delta compression
for large files, but according to the documentation, it is only honored
for fast-import. I think we would want something similar to say "for
some subset of files (indicated either by name or by minimum size),
don't bother with zlib-compression either, and always keep them loose".

Those are the two major ones, I think. There are probably a handful of
other cases (like git-add, which really should be able to have a fixed
memory size). Again, the first step is figuring out where all of the
problems are (and I'm happy to just fix them one by one as they come up,
but I am also thinking of this in terms of a GSoC project).

>   1.2 Perhaps configurable options for when and how to try a binary
> diff?  (allow user to decide if storage or speed is more important)

We can already do that with gitattributes. But it would be nice to have
it be fast in the binary auto-detection case.

> 2. Once (1) is accomplished, add an option to avoid copying binaries
> from all but the tip when doing a "git clone."

This is much harder. :)

>   2.1 The default behavior would be to copy everything, as users
> currently expect.
>   2.2 Core code would have hooks to allow a script to use a central
> location for the binary storage. (ssh, http, gmail-fs, whatever)

I think we would need a protocol extension for the fetching client to
say "please don't bother sending me anything larger than N bytes; I will
get it via alternate storage". Although there are situations more
complicated than that. Your alternate storage might have up to commit X,
and you don't want large objects in X or its ancestors. But you _do_
want large objects in descendants of X, since you have no other way to
get them.

So you need some way of saying which sets of large objects you need and
which you don't. One implementation is that you could fetch from
alternate storage (which would then need to be not just large-blob
storage, but actually have a full repo), and then afterwards fetch from
the remote (which would then send you all binaries, because by
definition anything you are fetching is not something the alternate
storage has). That feels a bit hack-ish. Doing something more clever
would require a pretty major protocol extension, though.

I haven't been paying attention to any sparse clone proposals. I know it
has come up but I don't know how mature the idea is. But this is
potentially related.

-Peff

  reply	other threads:[~2011-03-13  2:53 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AANLkTin=UySutWLS0Y7OmuvkE=T=+YB8G8aUCxLH=GKa@mail.gmail.com>
2011-01-21 18:57 ` Fwd: Git and Large Binaries: A Proposed Solution Eric Montellese
2011-01-21 21:36   ` Wesley J. Landaker
2011-01-21 22:00     ` Eric Montellese
2011-01-21 22:24   ` Jeff King
2011-01-21 23:15     ` Eric Montellese
2011-01-22  3:05       ` Sverre Rabbelier
2011-01-23 14:14     ` Pete Wyckoff
2011-01-26  3:42       ` Scott Chacon
2011-01-26 16:23         ` Eric Montellese
2011-01-26 17:42         ` Joey Hess
2011-01-26 21:40         ` Jakub Narebski
2011-03-10 21:02       ` Alexander Miseler
2011-03-10 22:24         ` Jeff King
2011-03-13  1:53           ` Eric Montellese
2011-03-13  2:52             ` Jeff King [this message]
2011-03-13 19:33               ` Alexander Miseler
2011-03-14 19:32                 ` Jeff King
2011-03-16  0:35                   ` Eric Montellese
2011-03-16 14:40                   ` Nguyen Thai Ngoc Duy
2011-01-22  0:07   ` Joey Hess

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110313025258.GA10452@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=alexander@miseler.de \
    --cc=emontellese@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=joey@kitenet.net \
    --cc=pw@padd.com \
    --cc=schacon@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).