git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Linus Torvalds <torvalds@osdl.org>
To: Jeff Garzik <jgarzik@pobox.com>
Cc: Ben Clifford <benc@hawaga.org.uk>,
	Martin Langhoff <martin.langhoff@gmail.com>,
	Florian Weimer <fw@deneb.enyo.de>,
	git@vger.kernel.org
Subject: Re: Handling large files with GIT
Date: Mon, 13 Feb 2006 08:19:10 -0800 (PST)	[thread overview]
Message-ID: <Pine.LNX.4.64.0602130806070.3691@g5.osdl.org> (raw)
In-Reply-To: <43F01F5A.5020808@pobox.com>



On Mon, 13 Feb 2006, Jeff Garzik wrote:
>
> Linus Torvalds wrote:
> > I've never used maildir layout, but if it is a couple of large _flat_
> > subdirectories,
> 
> That's what it is :/   One directory per mail folder, with each email an
> individual file in that dir.

Ok.

Anyway, I double-checked, and I'm wrong anyway. While the "static 
directories" thing is a huge performance optimization for doing many 
things (diffing trees, file history in git-rev-list, etc etc), for merging 
it doesn't help. We always end up expanding the whole tree.

Which is kind of sad.

It's inevitable in one sense: we do the merge in the index, after all, and 
the index - unlike the tree structures - is a flat file (like the 
"manifest" in mercurial or monotone). It's also represented that way in 
memory. 

However, it is a total and complete waste in other cases.

Thinking more about it, this is also why merging causes all the horrible 
index performance: not only do we (unnecessarily) read the same trees over 
and over again only to collapse them back to stage0 later when they are 
the same, but because we keep the index in a linear format, when we read 
the other trees, we'll have to move things around with memmove() (just the 
pointers, but still).

We'd actually be a _lot_ better off if we split "git-read-tree" up into 
two phases: one that did the recursive tree operation (which can optimize 
the "same tree everywhere" case), and the second stage that actually 
populated the index.

I'll have to think about this. It would be an absolutely _huge_ 
optimization for merging in certain patterns, it just doesn't matter for 
something like the kernel with "just" 18,000 files and not a lot of 
strange merging going on.

In contrast, I can see a mail archive easily having hundreds of thousands 
of individual emails. At which time it's horribly stupid to read them all 
in three times (for a merge - base, origin, new) and do so in a pretty 
inefficient manner.

Ho humm. It doesn't look _hard_ per se, and I think the two-stage 
git-read-tree is actually also what the recursive merge strategy wants 
anyway (it can't use the index - it really just wants to get a list of 
conflict information). So this definitely sounds like the RightThing(tm) 
to do anyway, and it fits the git data structures really well.

So no downsides. Except that this is some rather core code, and you can't 
afford to get it wrong. And the fact that I'm a lazy bastard, of course.

			Linus

  parent reply	other threads:[~2006-02-13 16:20 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-02-08  9:14 Handling large files with GIT Martin Langhoff
2006-02-08 11:54 ` Johannes Schindelin
2006-02-08 16:34   ` Linus Torvalds
2006-02-08 17:01     ` Linus Torvalds
2006-02-08 20:11       ` Junio C Hamano
2006-02-08 21:20 ` Florian Weimer
2006-02-08 22:35   ` Martin Langhoff
2006-02-13  1:26     ` Ben Clifford
2006-02-13  3:42       ` Linus Torvalds
2006-02-13  4:57         ` Linus Torvalds
2006-02-13  5:05           ` Linus Torvalds
2006-02-13 23:17             ` Ian Molton
2006-02-13 23:19               ` Martin Langhoff
2006-02-14 18:56               ` Johannes Schindelin
2006-02-14 19:52                 ` Linus Torvalds
2006-02-14 21:21                   ` Sam Vilain
2006-02-14 22:01                     ` Linus Torvalds
2006-02-14 22:30                       ` Junio C Hamano
2006-02-15  0:40                         ` Sam Vilain
2006-02-15  1:39                           ` Junio C Hamano
2006-02-15  4:03                             ` Sam Vilain
2006-02-15  2:07                           ` Martin Langhoff
2006-02-15  2:05                         ` Linus Torvalds
2006-02-15  2:18                           ` Linus Torvalds
2006-02-15  2:33                             ` Linus Torvalds
2006-02-15  3:58                               ` Linus Torvalds
2006-02-15  9:54                                 ` Junio C Hamano
2006-02-15 15:44                                   ` Linus Torvalds
2006-02-15 17:16                                     ` Linus Torvalds
2006-02-16  3:25                                   ` Linus Torvalds
2006-02-16  3:29                                     ` Junio C Hamano
2006-02-16 20:32                                 ` Fredrik Kuivinen
2006-02-13  5:55           ` Jeff Garzik
2006-02-13  6:07             ` Keith Packard
2006-02-14  0:07               ` Martin Langhoff
2006-02-13 16:19             ` Linus Torvalds [this message]
2006-02-13  4:40       ` Martin Langhoff
2006-02-09  4:54   ` Greg KH
2006-02-09  5:38     ` Martin Langhoff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0602130806070.3691@g5.osdl.org \
    --to=torvalds@osdl.org \
    --cc=benc@hawaga.org.uk \
    --cc=fw@deneb.enyo.de \
    --cc=git@vger.kernel.org \
    --cc=jgarzik@pobox.com \
    --cc=martin.langhoff@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).