Re: Fwd: Git and Large Binaries: A Proposed Solution

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Alexander Miseler <alexander@miseler.de>
To: Jeff King <peff@peff.net>
Cc: Eric Montellese <emontellese@gmail.com>,
	Pete Wyckoff <pw@padd.com>,
	git@vger.kernel.org, schacon@gmail.com, joey@kitenet.net
Subject: Re: Fwd: Git and Large Binaries: A Proposed Solution
Date: Sun, 13 Mar 2011 20:33:18 +0100	[thread overview]
Message-ID: <4D7D1BFE.2030008@miseler.de> (raw)
In-Reply-To: <20110313025258.GA10452@sigill.intra.peff.net>

My thoughts on big file storage:

We want to store them as flat as possible. Ideally if we have a temp file with the content (e.g. the output of some filter) it should be possible to store it by simply doing a move/rename and updating some meta data external to the actual file.

Options:

1.) The loose file format is inherently unsuited for this. It has a header before the actual content and the whole file (header + content) is always compressed. Even if one changes this to compressing/decompressing header and content independently it is still unsuited by a) having the header within the same file and b) because the header has no flags or other means to indicate a different behavior (e.g. no compression) for the content. We could extend the header format or introduce a new object type (e.g. flatblob) but both would probably cause more trouble than other solutions. Another idea would be to keep the metadata in an external file (e.g. 84d7.header for the object 84d7). This would probably have a bad performance though since every object lookup would first need to check for the e
 xistence of a header file. A smarter variant would be to optionally keep the meta data directly in the filename (e.g. saving the object as 84d7.object_type.size.flag instead of just 84d7). 
This would only require special handling for cases where the normal lookup for 84d7 fails.

2.) The pack format fares a lot better. Content and meta data are already separated with the meta data describing how the content is stored. We would need a flag to mark the content as flat and that would pretty much be it. We would still need to include a virtual header when calculating the sha1 so it is guaranteed that the same content has always the same id.
Thus i think we should simply forgo the loose object phase when storing big files and simply drop each big file flat as a individual pack file, with the idx file describing it as a pack file with one entry which is stored flat.

3.) Do some completely different handling for big files, as suggested by Eric:
>>   1.1 Perhaps a "binaries" directory, or structure of directories, within .git
> 
> I'd rather not do something so drastic.
My main issue with this approach (apart from the 'drastic' ^_^) is that the definition of big file may change at any time by e.g. changing a config value like core.bigFileThreshold. What has been stored as big file may suddenly be considered a normal blob and vice versa. Thus any storage variant that isn't well integrated in the normal object storage will probably be troublesome.

> There may also be code-paths for binary files where
> we accidentally load them (I just fixed one last week where we
> unnecessarily loaded them in the diffstat code path). Somebody will need
> to do some experimenting to shake out those code paths.

This is my main focus for now. They are easy to detect when your memory is small enough :D

next prev parent reply	other threads:[~2011-03-13 19:33 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AANLkTin=UySutWLS0Y7OmuvkE=T=+YB8G8aUCxLH=GKa@mail.gmail.com>
2011-01-21 18:57 ` Fwd: Git and Large Binaries: A Proposed Solution Eric Montellese
2011-01-21 21:36   ` Wesley J. Landaker
2011-01-21 22:00     ` Eric Montellese
2011-01-21 22:24   ` Jeff King
2011-01-21 23:15     ` Eric Montellese
2011-01-22  3:05       ` Sverre Rabbelier
2011-01-23 14:14     ` Pete Wyckoff
2011-01-26  3:42       ` Scott Chacon
2011-01-26 16:23         ` Eric Montellese
2011-01-26 17:42         ` Joey Hess
2011-01-26 21:40         ` Jakub Narebski
2011-03-10 21:02       ` Alexander Miseler
2011-03-10 22:24         ` Jeff King
2011-03-13  1:53           ` Eric Montellese
2011-03-13  2:52             ` Jeff King
2011-03-13 19:33               ` Alexander Miseler [this message]
2011-03-14 19:32                 ` Jeff King
2011-03-16  0:35                   ` Eric Montellese
2011-03-16 14:40                   ` Nguyen Thai Ngoc Duy
2011-01-22  0:07   ` Joey Hess

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4D7D1BFE.2030008@miseler.de \
    --to=alexander@miseler.de \
    --cc=emontellese@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=joey@kitenet.net \
    --cc=peff@peff.net \
    --cc=pw@padd.com \
    --cc=schacon@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).