Re: Fwd: Git and Large Binaries: A Proposed Solution

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Jeff King <peff@peff.net>
To: Alexander Miseler <alexander@miseler.de>
Cc: Eric Montellese <emontellese@gmail.com>,
	Pete Wyckoff <pw@padd.com>,
	git@vger.kernel.org, schacon@gmail.com, joey@kitenet.net
Subject: Re: Fwd: Git and Large Binaries: A Proposed Solution
Date: Mon, 14 Mar 2011 15:32:54 -0400	[thread overview]
Message-ID: <20110314193254.GA21581@sigill.intra.peff.net> (raw)
In-Reply-To: <4D7D1BFE.2030008@miseler.de>

On Sun, Mar 13, 2011 at 08:33:18PM +0100, Alexander Miseler wrote:

> We want to store them as flat as possible. Ideally if we have a temp
> file with the content (e.g. the output of some filter) it should be
> possible to store it by simply doing a move/rename and updating some
> meta data external to the actual file.

Yeah, that would be a nice optimization.  But I'd rather do the easy
stuff first and see if more advanced stuff is still worth doing.

For example, I spent some time a while back designing a faster textconv
interface (the current interface spools the blob to a tempfile, whereas
in some cases a filter needs to only access the first couple kilobytes
of the file to get metadata). But what I found was that an even better
scheme was to cache textconv output in git-notes. Then it speeds up the
slow case _and_ the already-fast case.

Now after this, would my new textconv interface still speed up the
initial non-cached textconv? Absolutely. But I didn't really care
anymore, because the small speed up on the first run was not worth the
trouble of maintaining two interfaces (at least for my datasets).

And this may fall into the same category. Accessing big blobs is
expensive. One solution is to make it a bit faster. Another solution is
to just do it less. So we may find that once we are doing it less, it is
not worth the complexity to make it faster.

And note that I am not saying "it definitely won't be worth it"; only
that it is worth making the easy, big optimizations first and then
seeing what's left to do.

> 1.) The loose file format is inherently unsuited for this. It has a
> header before the actual content and the whole file (header + content)
> is always compressed. Even if one changes this to
> compressing/decompressing header and content independently it is still
> unsuited by a) having the header within the same file and b) because
> the header has no flags or other means to indicate a different
> behavior (e.g. no compression) for the content. We could extend the
> header format or introduce a new object type (e.g. flatblob) but both
> would probably cause more trouble than other solutions. Another idea
> would be to keep the metadata in an external file (e.g. 84d7.header
> for the object 84d7). This would probably have a bad performance
> though since every object lookup would first need to check for the
> existence of a header file. A smarter variant would be to optionally
> keep the meta data directly in the filename (e.g. saving the object as
> 84d7.object_type.size.flag instead of just 84d7).
> This would only require special handling for cases where the normal lookup for 84d7 fails.

A new object type is definitely a bad idea. It changes the sha1 of the
resulting object, which means that our identical trees which differ only
in the use of "flatblob" versus regular blob will have different sha1s.

So I think the right place to insert this would be at the object db
layer. The header just has the type and size. But I don't think anybody
is having a problem with large objects that are _not_ blobs. So the
simplest implementation would be a special blob-only object db
containing pristine files. We implicitly know that objects in this db
are blobs, and we can get the size from the filesystem via stat().
Checking their sha1 would involve prepending "blob <size>\0" to the file
data. It does introduce an extra stat() into object lookup, so probably
we would have the lookup order of pack, regular loose object, flat blob
object. Then you pay the extra stat() only in the less-common case of
accessing either a large blob or a non-existent object.

That being said, I'm not sure how much this optimization will buy us.
There are times when being able to mmap() the file directly, or point an
external program directly at the original blob will be helpful. But we
will still have to copy, for example on checkout. It would be nice if
there was a way to make a copy-on-write link from the working tree to
the original file. But I don't think there is a portable way to do so,
and we can't allow the user to accidentally munge the contents of the
object db, which are supposed to be immutable.

-Peff

next prev parent reply	other threads:[~2011-03-14 19:33 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AANLkTin=UySutWLS0Y7OmuvkE=T=+YB8G8aUCxLH=GKa@mail.gmail.com>
2011-01-21 18:57 ` Fwd: Git and Large Binaries: A Proposed Solution Eric Montellese
2011-01-21 21:36   ` Wesley J. Landaker
2011-01-21 22:00     ` Eric Montellese
2011-01-21 22:24   ` Jeff King
2011-01-21 23:15     ` Eric Montellese
2011-01-22  3:05       ` Sverre Rabbelier
2011-01-23 14:14     ` Pete Wyckoff
2011-01-26  3:42       ` Scott Chacon
2011-01-26 16:23         ` Eric Montellese
2011-01-26 17:42         ` Joey Hess
2011-01-26 21:40         ` Jakub Narebski
2011-03-10 21:02       ` Alexander Miseler
2011-03-10 22:24         ` Jeff King
2011-03-13  1:53           ` Eric Montellese
2011-03-13  2:52             ` Jeff King
2011-03-13 19:33               ` Alexander Miseler
2011-03-14 19:32                 ` Jeff King [this message]
2011-03-16  0:35                   ` Eric Montellese
2011-03-16 14:40                   ` Nguyen Thai Ngoc Duy
2011-01-22  0:07   ` Joey Hess

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110314193254.GA21581@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=alexander@miseler.de \
    --cc=emontellese@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=joey@kitenet.net \
    --cc=pw@padd.com \
    --cc=schacon@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).