From: Linus Torvalds <torvalds@osdl.org>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: Martin Langhoff <martin.langhoff@gmail.com>,
Git Mailing List <git@vger.kernel.org>
Subject: Re: Handling large files with GIT
Date: Wed, 8 Feb 2006 08:34:22 -0800 (PST) [thread overview]
Message-ID: <Pine.LNX.4.64.0602080815180.2458@g5.osdl.org> (raw)
In-Reply-To: <Pine.LNX.4.63.0602081248270.31700@wbgn013.biozentrum.uni-wuerzburg.de>
On Wed, 8 Feb 2006, Johannes Schindelin wrote:
>
> I am uncertain if it is possible to extend git to handle large files
> gracefully, without slowing it down for its main use case.
Indeed. The git architecture simply sucks for big objects. It was
discussed somewhat durign the early stages, but a lot of it really is
pretty fundamental. The fact that all the operations work on a full
object, and the delta's are (on purpose) just a very specific and limited
kind of size compression is just very ingrained.
> [thinking] A potentially silly idea just hit me: We could virtually cut
> every file into 256kB chunks. That would not affect source code at all:
> anybody producing a 256kB C file should be shot anyway.
It probably wouldn't help that much, really. And it would probably impact
source code users too: I bet we'd have bugs. It would be a very strange
special case.
It also would only help for things that purely grow at the end. Which
isn't even true for a mailbox: it may or may not be true for your INBOX,
but anybody who _uses_ a mailbox format to read his email will be adding
status flags to the mbox format (or deleting mbox entries etc).
So every time a small change happened that changed the offset, you'd have
an explosion of these 256kB chunk objects, and while the delta would work
(probably slowly - remember how the git deltification algorithm tries to
compare against the ten "nearest" neighbors), at _commit_ time you'd have
to write that 1GB (compressed) out anyway.
Realistically, I think the answer is that git just doesn't work for his
usage case. There's two alternatives:
- convince him to not have big mailboxes (an answer I don't particularly
like: it's a tool limitation, and you shouldn't change your behaviour
just because the tool doesn't work for it - you should just try to find
the right tool).
That said: git should actually work beautifully for email if you
_don't_ keep it as one big mbox. You could probably very reasonably use
git as a database backend, where each email is its own object, and you
can have many different ways of indexing them into trees (by content,
by date, by author, by thread).
But that's very different from the suggested "home directory" setup
would be.
- try to work around some of the worst git issues. While I don't think
the 256kB blockign thing would help (the git protocol would still
always send the base versions), there _are_ probably things that could
be done. They'd be very invasive, though, and somebody would seriously
have to look at the architectural issues.
For example, right now the decision to send only "self-contained" packs
in the git protocol was a very conscious one: it's much safer, and it
makes the unpacking a lot easier (the unpacking doesn't ever have to
even read any other objects than the stream it gets). It's also (for
packs that we use on-disk) the only sane way to avoid nasty inter-pack
dependencies.
But for the git protocol, the inter-pack dependencies don't matter,
if we'd always unpack the thing on reception if it is not a
self-contained pack. So we _could_ allow delta's that depend on the
receiver already having the objects we delta against.
However, the deltification itself is likely very slow, exactly because
git (again, very much by design) generates the deltas dynamically
rather than depending on things already being in delta format.
Personally, I think the answer is "git is good for lots of small files".
It's very much what git was designed for, and the fact that it doesn't
work for everything is a trade-off for the things it _does_ work well for.
Linus
next prev parent reply other threads:[~2006-02-08 16:34 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-02-08 9:14 Handling large files with GIT Martin Langhoff
2006-02-08 11:54 ` Johannes Schindelin
2006-02-08 16:34 ` Linus Torvalds [this message]
2006-02-08 17:01 ` Linus Torvalds
2006-02-08 20:11 ` Junio C Hamano
2006-02-08 21:20 ` Florian Weimer
2006-02-08 22:35 ` Martin Langhoff
2006-02-13 1:26 ` Ben Clifford
2006-02-13 3:42 ` Linus Torvalds
2006-02-13 4:57 ` Linus Torvalds
2006-02-13 5:05 ` Linus Torvalds
2006-02-13 23:17 ` Ian Molton
2006-02-13 23:19 ` Martin Langhoff
2006-02-14 18:56 ` Johannes Schindelin
2006-02-14 19:52 ` Linus Torvalds
2006-02-14 21:21 ` Sam Vilain
2006-02-14 22:01 ` Linus Torvalds
2006-02-14 22:30 ` Junio C Hamano
2006-02-15 0:40 ` Sam Vilain
2006-02-15 1:39 ` Junio C Hamano
2006-02-15 4:03 ` Sam Vilain
2006-02-15 2:07 ` Martin Langhoff
2006-02-15 2:05 ` Linus Torvalds
2006-02-15 2:18 ` Linus Torvalds
2006-02-15 2:33 ` Linus Torvalds
2006-02-15 3:58 ` Linus Torvalds
2006-02-15 9:54 ` Junio C Hamano
2006-02-15 15:44 ` Linus Torvalds
2006-02-15 17:16 ` Linus Torvalds
2006-02-16 3:25 ` Linus Torvalds
2006-02-16 3:29 ` Junio C Hamano
2006-02-16 20:32 ` Fredrik Kuivinen
2006-02-13 5:55 ` Jeff Garzik
2006-02-13 6:07 ` Keith Packard
2006-02-14 0:07 ` Martin Langhoff
2006-02-13 16:19 ` Linus Torvalds
2006-02-13 4:40 ` Martin Langhoff
2006-02-09 4:54 ` Greg KH
2006-02-09 5:38 ` Martin Langhoff
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0602080815180.2458@g5.osdl.org \
--to=torvalds@osdl.org \
--cc=Johannes.Schindelin@gmx.de \
--cc=git@vger.kernel.org \
--cc=martin.langhoff@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).