From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeff King Subject: Re: Fwd: Git and Large Binaries: A Proposed Solution Date: Sat, 12 Mar 2011 21:52:58 -0500 Message-ID: <20110313025258.GA10452@sigill.intra.peff.net> References: <20110121222440.GA1837@sigill.intra.peff.net> <20110123141417.GA6133@mew.padd.com> <4D793C7D.1000502@miseler.de> <20110310222443.GC15828@sigill.intra.peff.net> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Alexander Miseler , Pete Wyckoff , git@vger.kernel.org, schacon@gmail.com, joey@kitenet.net To: Eric Montellese X-From: git-owner@vger.kernel.org Sun Mar 13 03:53:24 2011 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1PybQn-00030Y-Ut for gcvg-git-2@lo.gmane.org; Sun, 13 Mar 2011 03:53:22 +0100 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756374Ab1CMCxE convert rfc822-to-quoted-printable (ORCPT ); Sat, 12 Mar 2011 21:53:04 -0500 Received: from 99-108-226-0.lightspeed.iplsin.sbcglobal.net ([99.108.226.0]:52442 "EHLO peff.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756196Ab1CMCxB (ORCPT ); Sat, 12 Mar 2011 21:53:01 -0500 Received: (qmail 29353 invoked by uid 107); 13 Mar 2011 02:53:32 -0000 Received: from sigill.intra.peff.net (HELO sigill.intra.peff.net) (10.0.0.7) (smtp-auth username relayok, mechanism cram-md5) by peff.net (qpsmtpd/0.84) with ESMTPA; Sat, 12 Mar 2011 21:53:32 -0500 Received: by sigill.intra.peff.net (sSMTP sendmail emulation); Sat, 12 Mar 2011 21:52:58 -0500 Content-Disposition: inline In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Sat, Mar 12, 2011 at 08:53:53PM -0500, Eric Montellese wrote: > The best solution, it seems, has two parts: >=20 > 1. Clean up the way in which git considers, diffs, and stores binarie= s > to cut down on the overhead of dealing with these files. This is the easier half, I think. > 1.1 Perhaps a "binaries" directory, or structure of directories, wi= thin .git I'd rather not do something so drastic. We already have ways of marking files as binary and un-diffable within the tree. So you can already do pretty well with marking them with gitattributes. I think we can do better by making them the binaryness auto-detection less expensive (right now we pull in the whole blob to check the first 1K or so for NULs or other patterns; this is fine in the common text case, where we'll want the whole blob in a minute anyway, but for large files it's obviously wasteful). There may also be code-paths for binary files wher= e we accidentally load them (I just fixed one last week where we unnecessarily loaded them in the diffstat code path). Somebody will nee= d to do some experimenting to shake out those code paths. =46or packing, we have core.bigFileThreshold to turn off delta compress= ion for large files, but according to the documentation, it is only honored for fast-import. I think we would want something similar to say "for some subset of files (indicated either by name or by minimum size), don't bother with zlib-compression either, and always keep them loose". Those are the two major ones, I think. There are probably a handful of other cases (like git-add, which really should be able to have a fixed memory size). Again, the first step is figuring out where all of the problems are (and I'm happy to just fix them one by one as they come up= , but I am also thinking of this in terms of a GSoC project). > 1.2 Perhaps configurable options for when and how to try a binary > diff? =C2=A0(allow user to decide if storage or speed is more importa= nt) We can already do that with gitattributes. But it would be nice to have it be fast in the binary auto-detection case. > 2. Once (1) is accomplished, add an option to avoid copying binaries > from all but the tip when doing a "git clone." This is much harder. :) > 2.1 The default behavior would be to copy everything, as users > currently expect. > 2.2 Core code would have hooks to allow a script to use a central > location for the binary storage. (ssh, http, gmail-fs, whatever) I think we would need a protocol extension for the fetching client to say "please don't bother sending me anything larger than N bytes; I wil= l get it via alternate storage". Although there are situations more complicated than that. Your alternate storage might have up to commit X= , and you don't want large objects in X or its ancestors. But you _do_ want large objects in descendants of X, since you have no other way to get them. So you need some way of saying which sets of large objects you need and which you don't. One implementation is that you could fetch from alternate storage (which would then need to be not just large-blob storage, but actually have a full repo), and then afterwards fetch from the remote (which would then send you all binaries, because by definition anything you are fetching is not something the alternate storage has). That feels a bit hack-ish. Doing something more clever would require a pretty major protocol extension, though. I haven't been paying attention to any sparse clone proposals. I know i= t has come up but I don't know how mature the idea is. But this is potentially related. -Peff