Handling large files with GIT

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Handling large files with GIT
@ 2006-02-08  9:14 Martin Langhoff
  2006-02-08 11:54 ` Johannes Schindelin
  2006-02-08 21:20 ` Florian Weimer
  0 siblings, 2 replies; 39+ messages in thread
From: Martin Langhoff @ 2006-02-08  9:14 UTC (permalink / raw)
  To: Git Mailing List

Roland Stigge recently pointed out a use case using very large files
where GIT has some serious limitations. He is one of several Debian
developers keeping their homedir under version control with SVN (blame
Joey Hess for this - http://www.kitenet.net/~joey/svnhome.html ).

SVN does reasonably well tracking his >1GB mbox file. Now, I don't
know if I like the idea of putting my own mbox file under version
control, but it looks like projects with large and slow-changing files
would be in trouble with GIT. Not literally trouble, but gross
inefficiencies.

The problems are two. At commit time, a full copy is stored in the
object database until git-repack && git-prune-packed are called. And
during the transfer over the git protocol we send the full object,
even if both ends have objects that are good candidates for a small
delta.

I'm not strong on either aspect of git (packfile format or git
protocol), and I don't personally deal with large files. So feel free
to ignore me for the time being. If it ever itches, you might get a
patch...

cheers,

martin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-08  9:14 Handling large files with GIT Martin Langhoff
@ 2006-02-08 11:54 ` Johannes Schindelin
  2006-02-08 16:34   ` Linus Torvalds
  2006-02-08 21:20 ` Florian Weimer
  1 sibling, 1 reply; 39+ messages in thread
From: Johannes Schindelin @ 2006-02-08 11:54 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Git Mailing List

Hi,

On Wed, 8 Feb 2006, Martin Langhoff wrote:

> Roland Stigge recently pointed out a use case using very large files
> where GIT has some serious limitations.

That is intentional: git handles source code very well, where you tend to 
have small files, and it handles branches very well, where you tend to 
have mostly the same files in different branches.

I am uncertain if it is possible to extend git to handle large files 
gracefully, without slowing it down for its main use case.

[thinking] A potentially silly idea just hit me: We could virtually cut 
every file into 256kB chunks. That would not affect source code at all: 
anybody producing a 256kB C file should be shot anyway.

If the files just keep growing, this should help enormously. If the files 
change subtly, the diff algorithm should work quite well on 'em.

Comments?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-08 11:54 ` Johannes Schindelin
@ 2006-02-08 16:34   ` Linus Torvalds
  2006-02-08 17:01     ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2006-02-08 16:34 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Martin Langhoff, Git Mailing List

On Wed, 8 Feb 2006, Johannes Schindelin wrote:
> 
> I am uncertain if it is possible to extend git to handle large files 
> gracefully, without slowing it down for its main use case.

Indeed. The git architecture simply sucks for big objects. It was 
discussed somewhat durign the early stages, but a lot of it really is 
pretty fundamental. The fact that all the operations work on a full 
object, and the delta's are (on purpose) just a very specific and limited 
kind of size compression is just very ingrained.

> [thinking] A potentially silly idea just hit me: We could virtually cut 
> every file into 256kB chunks. That would not affect source code at all: 
> anybody producing a 256kB C file should be shot anyway.

It probably wouldn't help that much, really. And it would probably impact 
source code users too: I bet we'd have bugs. It would be a very strange 
special case.

It also would only help for things that purely grow at the end. Which 
isn't even true for a mailbox: it may or may not be true for your INBOX, 
but anybody who _uses_ a mailbox format to read his email will be adding 
status flags to the mbox format (or deleting mbox entries etc). 

So every time a small change happened that changed the offset, you'd have 
an explosion of these 256kB chunk objects, and while the delta would work 
(probably slowly - remember how the git deltification algorithm tries to 
compare against the ten "nearest" neighbors), at _commit_ time you'd have 
to write that 1GB (compressed) out anyway.

Realistically, I think the answer is that git just doesn't work for his 
usage case. There's two alternatives:

 - convince him to not have big mailboxes (an answer I don't particularly 
   like: it's a tool limitation, and you shouldn't change your behaviour 
   just because the tool doesn't work for it - you should just try to find 
   the right tool).

   That said: git should actually work beautifully for email if you 
   _don't_ keep it as one big mbox. You could probably very reasonably use 
   git as a database backend, where each email is its own object, and you 
   can have many different ways of indexing them into trees (by content, 
   by date, by author, by thread).

   But that's very different from the suggested "home directory" setup 
   would be.

 - try to work around some of the worst git issues. While I don't think 
   the 256kB blockign thing would help (the git protocol would still 
   always send the base versions), there _are_ probably things that could 
   be done. They'd be very invasive, though, and somebody would seriously 
   have to look at the architectural issues.

   For example, right now the decision to send only "self-contained" packs 
   in the git protocol was a very conscious one: it's much safer, and it 
   makes the unpacking a lot easier (the unpacking doesn't ever have to 
   even read any other objects than the stream it gets). It's also (for 
   packs that we use on-disk) the only sane way to avoid nasty inter-pack 
   dependencies.

   But for the git protocol, the inter-pack dependencies don't matter, 
   if we'd always unpack the thing on reception if it is not a 
   self-contained pack. So we _could_ allow delta's that depend on the 
   receiver already having the objects we delta against.

   However, the deltification itself is likely very slow, exactly because 
   git (again, very much by design) generates the deltas dynamically 
   rather than depending on things already being in delta format.

Personally, I think the answer is "git is good for lots of small files". 
It's very much what git was designed for, and the fact that it doesn't 
work for everything is a trade-off for the things it _does_ work well for.

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-08 16:34   ` Linus Torvalds
@ 2006-02-08 17:01     ` Linus Torvalds
  2006-02-08 20:11       ` Junio C Hamano
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2006-02-08 17:01 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Martin Langhoff, Git Mailing List

On Wed, 8 Feb 2006, Linus Torvalds wrote:
>
> The fact that all the operations work on a full object, and the delta's 
> are (on purpose) just a very specific and limited kind of size 
> compression is just very ingrained.

Side note: the original explicit git "delta" objects by Nicolas Pitre 
would have handled this large-file-case much more gracefully. 

The pack-files had absolutely huge advantages, though, so I think we (I) 
did the right thing there in making the delta code only a very specific 
special case..

It is possible that we could re-introduce the "explicit delta" object, 
though (it's not incompatible with also doing pack-files, it's just that 
pack-files made 99% of all the arguments for an explicit delta go away).

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-08 17:01     ` Linus Torvalds
@ 2006-02-08 20:11       ` Junio C Hamano
  0 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2006-02-08 20:11 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git, Johannes Schindelin

Linus Torvalds <torvalds@osdl.org> writes:

> Side note: the original explicit git "delta" objects by Nicolas Pitre 
> would have handled this large-file-case much more gracefully. 

True.

> The pack-files had absolutely huge advantages, though, so I think we (I) 
> did the right thing there in making the delta code only a very specific 
> special case..

Well the blame for ripping that out falls on me, actually...

> It is possible that we could re-introduce the "explicit delta" object, 
> though (it's not incompatible with also doing pack-files, it's just that 
> pack-files made 99% of all the arguments for an explicit delta go away).

I do not remember we had 'rev-list --objects' support for Nico's
explicit delta object chains.  If we didn't that would be a new
development that needs to be done to resurrect it.  I know
pack-objects never had support for it so obviously that needs to
be added as well.  Probably explicit delta objects should always
be packed in full without spending cost to find delta candidates.

Personally I feel that post-1.2.0 would be a good time to start
looking at enhancing the pack generation chain, rev-list piped
to pack-objects.  This "large files" use case is helped by
less self-contained packs while "shallow clone" use case
we discussed earlier is helped by more self-contained packs (we
had a discussion long time ago on this and I think we have the
code to do so [*1*]). 

An addition to pack-objects is needed to make it capable to read
a list of objects that we do not want to include in the
resulting pack but can be used as base objects for delitified.

BTW, as to the "shallow clone", I changed my mind and am
inclined to agree with Johannes that handling cut-offs
differently from grafts is easier for dealing with later "give
me more history" operation, so I am planning to chuck my jc/clone
topic branch that I have included in the proposed updates so
far.

[Footnote]

*1* http://article.gmane.org/gmane.comp.version-control.git/5779

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-08  9:14 Handling large files with GIT Martin Langhoff
  2006-02-08 11:54 ` Johannes Schindelin
@ 2006-02-08 21:20 ` Florian Weimer
  2006-02-08 22:35   ` Martin Langhoff
  2006-02-09  4:54   ` Greg KH
  1 sibling, 2 replies; 39+ messages in thread
From: Florian Weimer @ 2006-02-08 21:20 UTC (permalink / raw)
  To: git

* Martin Langhoff:

> SVN does reasonably well tracking his >1GB mbox file. Now, I don't
> know if I like the idea of putting my own mbox file under version
> control, but it looks like projects with large and slow-changing files
> would be in trouble with GIT.

To my surprise, it's not that bad.  The Debian testing-security team
uses a single 1.8 MB file (400 KB compressed) to keep vulnerability
data.  Most changes to that file involve just a few lines.  But even
in this extreme case, git doesn't compare too badly against Subversion
if you pack regularly (but not too often).  Disk usage is actually
*below* Subversion FSFS even with --depth=10 (the default,
unfortunately a bit hard to override).

I plan to do another experiment for GCC, which contains marvels such
as:

  35905  126056 1379093 gcc/ChangeLog-2005
  12610   61215  417584 gcc/combine.c

But the outcome will likely be quite similar to the secure-testing
case: comparable disk space usage, not a difference in the order of
one or more magnitudes.

But Subversion still has got a significant adventage: I can get a
working copy without downloading full history (several gigabytes in
GCC's case).  There's also the slight drawback that you shouldn't pack
too often, otherwise you'll reduce its effectiveness.  You can always
run "git-repack -a -d", but it's rather expensive.  This means that
you need to keep compressed fulltexts from a few dozen revisions, but
I don't think this is a huge burden.  All in all, the compressed
fulltexts/packs model is a pretty good trade-off between disk usage,
end user usability nad code complexity.

In your mbox case, you should simply try Maildir.  The tree object
(which lists all files in the Maildir folder) will still be rather
large (about 40 to 50 bytes per message stored), though.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-08 21:20 ` Florian Weimer
@ 2006-02-08 22:35   ` Martin Langhoff
  2006-02-13  1:26     ` Ben Clifford
  2006-02-09  4:54   ` Greg KH
  1 sibling, 1 reply; 39+ messages in thread
From: Martin Langhoff @ 2006-02-08 22:35 UTC (permalink / raw)
  To: Florian Weimer; +Cc: git

On 2/9/06, Florian Weimer <fw@deneb.enyo.de> wrote:
> In your mbox case, you should simply try Maildir.  The tree object
> (which lists all files in the Maildir folder) will still be rather
> large (about 40 to 50 bytes per message stored), though.

I did suggest maildir,  where GIT is bound to do well as the content
of the emails doesn't change but they just move around a lot. Though
yes, trees are going to be nasty.

But the interesting case I gues is the general one of large files
changing slowly. My guess is that supporting delta transfers in the
git protocol would make it a lot more manageable. For local storage
git isn't so bad, and the problem is perhaps harder to resolve.

cheers,

martin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-08 21:20 ` Florian Weimer
  2006-02-08 22:35   ` Martin Langhoff
@ 2006-02-09  4:54   ` Greg KH
  2006-02-09  5:38     ` Martin Langhoff
  1 sibling, 1 reply; 39+ messages in thread
From: Greg KH @ 2006-02-09  4:54 UTC (permalink / raw)
  To: Florian Weimer; +Cc: git

On Wed, Feb 08, 2006 at 10:20:39PM +0100, Florian Weimer wrote:
> * Martin Langhoff:
> 
> > SVN does reasonably well tracking his >1GB mbox file. Now, I don't
> > know if I like the idea of putting my own mbox file under version
> > control, but it looks like projects with large and slow-changing files
> > would be in trouble with GIT.
> 
> To my surprise, it's not that bad.  The Debian testing-security team
> uses a single 1.8 MB file (400 KB compressed) to keep vulnerability
> data.  Most changes to that file involve just a few lines.  But even
> in this extreme case, git doesn't compare too badly against Subversion
> if you pack regularly (but not too often).  Disk usage is actually
> *below* Subversion FSFS even with --depth=10 (the default,
> unfortunately a bit hard to override).

I have a project that has 2.5Mb files, and git handles them just fine,
even on my old slow laptop.

But when I tried to use it to backup my old email archive a few months
ago, running about 2Gb in about 300 files, it took forever.  Luckily I
only archive stuff off every other month or so, otherwise it would be
unusable.

However, I did notice one problem.  When cloning from one machine to
another, for a project that is already fully packed, it seems that the
whole project is packed again before sending it accross the wire.  With
an archive this big, that takes over an hour for my slow old fileserver.
I ended up just rsyncing over the files and pointing the parent back to
the original.  Is there anyway to not repack everything if it's not
needed?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-09  4:54   ` Greg KH
@ 2006-02-09  5:38     ` Martin Langhoff
  0 siblings, 0 replies; 39+ messages in thread
From: Martin Langhoff @ 2006-02-09  5:38 UTC (permalink / raw)
  To: Greg KH; +Cc: Florian Weimer, git

On 2/9/06, Greg KH <greg@kroah.com> wrote:
> Is there anyway to not repack everything if it's not
> need?

I often cg-clone large projects over stupid protocols (http/rsync) and
then hand-edit the branches/origin or remotes/origin file to say
"git://" instead.

It's called cheating but it works great ;-)



m

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-08 22:35   ` Martin Langhoff
@ 2006-02-13  1:26     ` Ben Clifford
  2006-02-13  3:42       ` Linus Torvalds
  2006-02-13  4:40       ` Martin Langhoff
  0 siblings, 2 replies; 39+ messages in thread
From: Ben Clifford @ 2006-02-13  1:26 UTC (permalink / raw)
  To: Martin Langhoff; +Cc: Florian Weimer, git

On Thu, 9 Feb 2006, Martin Langhoff wrote:

> I did suggest maildir,  where GIT is bound to do well as the content
> of the emails doesn't change but they just move around a lot. Though
> yes, trees are going to be nasty.

I've been keeping maildir in git for a few months, with mail being 
delivered into a git repo on one (permanently connected) host and me 
merging that branch into a repo on my laptop for reading (the intention 
being that I should be able to sync it back to the permanently connected 
host as I sometimes read mail there.

Alas, the merge part of this absolutely sucks -- as time goes by, its 
getting slower and slower (its taking an hour or so to do the merge, which 
has got to the point of being barely usable -- if it wasn't for the neat 
hack-value, I'd have given up on this by now).

I haven't really probed whats happening when I'm doing the merges in any 
depth, but I see a lot of index manipulation happening (git update-index, 
I think) to add and remove files where each invocation of that seems to be 
taking almost a whole second.

I wonder if the present merge algorithms perform especially badly in the 
case of a large number of files with lots of renames (and so lots of 
adds/removes) but no content changes? The merge should be able to happen 
entirely in the index, I think.

Perhaps one way to proceed would be for me to write a move-optimised merge 
strategy where I flip the index round and instead of saying "how has the 
content inside this filename changed?" I instead say "how has the filename 
associated with this content <hash> changed?"

A special-case on top of a move-optimised merge might be some 
maildir-aware filename handling that knows how to resolve conflicts when a 
particular content-hash has been renamed to two different names (eg. when 
one flag is added to a message in one repo and a different flag is added 
to a message in another repo).

Any advice/thoughts/suggestions-that-this-is-a-stupid-thing-to-do would be 
greatly appreciated.

-- 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-13  1:26     ` Ben Clifford
@ 2006-02-13  3:42       ` Linus Torvalds
  2006-02-13  4:57         ` Linus Torvalds
  2006-02-13  4:40       ` Martin Langhoff
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2006-02-13  3:42 UTC (permalink / raw)
  To: Ben Clifford; +Cc: Martin Langhoff, Florian Weimer, git



On Mon, 13 Feb 2006, Ben Clifford wrote:
> 
> I've been keeping maildir in git for a few months, with mail being delivered
> into a git repo on one (permanently connected) host and me merging that branch
> into a repo on my laptop for reading (the intention being that I should be
> able to sync it back to the permanently connected host as I sometimes read
> mail there.
> 
> Alas, the merge part of this absolutely sucks -- as time goes by, its getting
> slower and slower (its taking an hour or so to do the merge, which has got to
> the point of being barely usable -- if it wasn't for the neat hack-value, I'd
> have given up on this by now).

If it takes an hour per merge, it _is_ unusable. I consider 15 _seconds_ 
to be pretty unusable.

Can you do a

	git-ls-tree -r -t HEAD
	git-ls-tree -r -t HEAD^1
	git-ls-tree -r -t HEAD^2

after a merge, and put the three resulting files up somewhere public (I 
assume the filenames aren't going to be anything private, I don't know how 
maildir organizes stuff) so that people can get an idea of what ends up 
being involved there..

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-13  1:26     ` Ben Clifford
  2006-02-13  3:42       ` Linus Torvalds
@ 2006-02-13  4:40       ` Martin Langhoff
  1 sibling, 0 replies; 39+ messages in thread
From: Martin Langhoff @ 2006-02-13  4:40 UTC (permalink / raw)
  To: Ben Clifford; +Cc: Florian Weimer, git

On 2/13/06, Ben Clifford <benc@hawaga.org.uk> wrote:
> Any advice/thoughts/suggestions-that-this-is-a-stupid-thing-to-do would be
> greatly appreciated.

How many files are you talking about?


m

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-13  3:42       ` Linus Torvalds
@ 2006-02-13  4:57         ` Linus Torvalds
  2006-02-13  5:05           ` Linus Torvalds
  2006-02-13  5:55           ` Jeff Garzik
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2006-02-13  4:57 UTC (permalink / raw)
  To: Ben Clifford; +Cc: Martin Langhoff, Florian Weimer, git

On Sun, 12 Feb 2006, Linus Torvalds wrote:
> 
> If it takes an hour per merge, it _is_ unusable. I consider 15 _seconds_ 
> to be pretty unusable.

Btw, one thing to realize is that git is inherently a lot better at 
handling lots of files in _subdirectories_, especially if those 
subdirectories don't change.

I've never used maildir layout, but if it is a couple of large _flat_ 
subdirectories, git will potentially handle that a lot worse than if you 
have a hierarchy of directories.

I say "potentially", because if the directories are all mutable and 
change, then the flat approach is better. But if they tend to have some 
kind of stability, a lot of git operations (diffing and merging in 
particular) are able to see that two subdirectories are 100% equal, and 
will entirely skip them.

This is a large part of why git performs well on the kernel. Most merges 
don't actually touch all - or even a very big percentage - of the over 
thousand subdirectories in the kernel. Git can quickly see and ignore the 
whole subdirectory when that happens - the SHA1 is exactly the same, so 
git knows that every file under that subdirectory (and every recursive 
directory) is the same.

In contrast, if you have a million files in one directory, and 10 of them 
changed, git will still have to check the SHA1's for matches for the other 
999,990 files. Which is going to be slow.

That said, I suspect there's space for optimization. 

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-13  4:57         ` Linus Torvalds
@ 2006-02-13  5:05           ` Linus Torvalds
  2006-02-13 23:17             ` Ian Molton
  2006-02-13  5:55           ` Jeff Garzik
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2006-02-13  5:05 UTC (permalink / raw)
  To: Ben Clifford; +Cc: Martin Langhoff, Florian Weimer, git

On Sun, 12 Feb 2006, Linus Torvalds wrote:
> 
> This is a large part of why git performs well on the kernel. Most merges 
> don't actually touch all - or even a very big percentage - of the over 
> thousand subdirectories in the kernel. Git can quickly see and ignore the 
> whole subdirectory when that happens - the SHA1 is exactly the same, so 
> git knows that every file under that subdirectory (and every recursive 
> directory) is the same.

Final note: this means, for example, that git is relatively bad at 
tracking a "hashed" nested file directory (like the one git itself uses), 
because new files will end up randomly appearing in every directory, and 
no directory is ever "stable".

In contrast, if the directory structure is - for example - something where 
you index files by date, and subdirectories with older dates are thus much 
more naturally likely to be quiescent, the "this tree is the same" 
optimizations work very well.

Basically, a lot of the git speed optimizations depend on "on average, 
things stay the same". We may have 18,000+ files in the kernel, but most 
patches will change maybe five of them. There's a lot of fairly static 
content and the changes have a certain level of "locality". It's normally 
a hundred-line patch to one file, not a hundred files that had one-liners. 
And when 20 files are changed, most of them tend to be in the same 
subdirectory, etc etc.

Taking advantage of those kinds of things is what makes git good at 
handling software projects. But it wouldn't necessarily be how you lay out 
a mail directory, for example. An automated file store might want to 
spread out the changes on purpose.

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-13  4:57         ` Linus Torvalds
  2006-02-13  5:05           ` Linus Torvalds
@ 2006-02-13  5:55           ` Jeff Garzik
  2006-02-13  6:07             ` Keith Packard
  2006-02-13 16:19             ` Linus Torvalds
  1 sibling, 2 replies; 39+ messages in thread
From: Jeff Garzik @ 2006-02-13  5:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ben Clifford, Martin Langhoff, Florian Weimer, git

Linus Torvalds wrote:
> I've never used maildir layout, but if it is a couple of large _flat_ 
> subdirectories,

That's what it is :/   One directory per mail folder, with each email an 
individual file in that dir.

On the side I run a mini-ISP that offers (among other things) 
SMTP/IMAP/POP via exim/dovecot.  I use maildir for my customers, and am 
desperately searching for a better solution.  David Woodhouse and I talk 
off and on about using git for distributed mail storage.

OTOH, I wonder if a P2P-ish, sha1-indexed network service wouldn't be 
better for both a git fallback and email storage.

	Jeff

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-13  5:55           ` Jeff Garzik
@ 2006-02-13  6:07             ` Keith Packard
  2006-02-14  0:07               ` Martin Langhoff
  2006-02-13 16:19             ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Keith Packard @ 2006-02-13  6:07 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: keithp, Linus Torvalds, Ben Clifford, Martin Langhoff,
	Florian Weimer, git

[-- Attachment #1: Type: text/plain, Size: 458 bytes --]

On Mon, 2006-02-13 at 00:55 -0500, Jeff Garzik wrote:
> Linus Torvalds wrote:
> > I've never used maildir layout, but if it is a couple of large _flat_ 
> > subdirectories,
> 
> That's what it is :/   One directory per mail folder, with each email an 
> individual file in that dir.

and, named to include a hash of the contents so that they are always
unique within a multi-folder mail store (makes refile easier).

-- 
keith.packard@intel.com

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-13  5:55           ` Jeff Garzik
  2006-02-13  6:07             ` Keith Packard
@ 2006-02-13 16:19             ` Linus Torvalds
  1 sibling, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2006-02-13 16:19 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: Ben Clifford, Martin Langhoff, Florian Weimer, git

On Mon, 13 Feb 2006, Jeff Garzik wrote:
>
> Linus Torvalds wrote:
> > I've never used maildir layout, but if it is a couple of large _flat_
> > subdirectories,
> 
> That's what it is :/   One directory per mail folder, with each email an
> individual file in that dir.

Ok.

Anyway, I double-checked, and I'm wrong anyway. While the "static 
directories" thing is a huge performance optimization for doing many 
things (diffing trees, file history in git-rev-list, etc etc), for merging 
it doesn't help. We always end up expanding the whole tree.

Which is kind of sad.

It's inevitable in one sense: we do the merge in the index, after all, and 
the index - unlike the tree structures - is a flat file (like the 
"manifest" in mercurial or monotone). It's also represented that way in 
memory. 

However, it is a total and complete waste in other cases.

Thinking more about it, this is also why merging causes all the horrible 
index performance: not only do we (unnecessarily) read the same trees over 
and over again only to collapse them back to stage0 later when they are 
the same, but because we keep the index in a linear format, when we read 
the other trees, we'll have to move things around with memmove() (just the 
pointers, but still).

We'd actually be a _lot_ better off if we split "git-read-tree" up into 
two phases: one that did the recursive tree operation (which can optimize 
the "same tree everywhere" case), and the second stage that actually 
populated the index.

I'll have to think about this. It would be an absolutely _huge_ 
optimization for merging in certain patterns, it just doesn't matter for 
something like the kernel with "just" 18,000 files and not a lot of 
strange merging going on.

In contrast, I can see a mail archive easily having hundreds of thousands 
of individual emails. At which time it's horribly stupid to read them all 
in three times (for a merge - base, origin, new) and do so in a pretty 
inefficient manner.

Ho humm. It doesn't look _hard_ per se, and I think the two-stage 
git-read-tree is actually also what the recursive merge strategy wants 
anyway (it can't use the index - it really just wants to get a list of 
conflict information). So this definitely sounds like the RightThing(tm) 
to do anyway, and it fits the git data structures really well.

So no downsides. Except that this is some rather core code, and you can't 
afford to get it wrong. And the fact that I'm a lazy bastard, of course.

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-13  5:05           ` Linus Torvalds
@ 2006-02-13 23:17             ` Ian Molton
  2006-02-13 23:19               ` Martin Langhoff
  2006-02-14 18:56               ` Johannes Schindelin
  0 siblings, 2 replies; 39+ messages in thread
From: Ian Molton @ 2006-02-13 23:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ben Clifford, Martin Langhoff, Florian Weimer, git

Linus Torvalds wrote:

> Taking advantage of those kinds of things is what makes git good at 
> handling software projects. But it wouldn't necessarily be how you lay out 
> a mail directory, for example. An automated file store might want to 
> spread out the changes on purpose.

Indeed...

Im curious as to why anyone would want to use a SCM tool on a mail dir 
anyway - surely no-one edits their pasnt mails and needs to keep logs?

surely incremental backups would be a better way to manage something 
like this ?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-13 23:17             ` Ian Molton
@ 2006-02-13 23:19               ` Martin Langhoff
  2006-02-14 18:56               ` Johannes Schindelin
  1 sibling, 0 replies; 39+ messages in thread
From: Martin Langhoff @ 2006-02-13 23:19 UTC (permalink / raw)
  To: Ian Molton; +Cc: Linus Torvalds, Ben Clifford, Florian Weimer, git

On 2/14/06, Ian Molton <spyro@f2s.com> wrote:
> Im curious as to why anyone would want to use a SCM tool on a mail dir
> anyway - surely no-one edits their pasnt mails and needs to keep logs?
>
> surely incremental backups would be a better way to manage something
> like this ?

Well, the maildir arrangement is friendlier to something content-smart
like git than to something content-stupid like a backup tool. Files
inside a maildir change name/location to reflect changes in status,
but their content tends to remain the same.

git does great in this scenario, except for the "dealing with a
bazillion files" part of it.

cheers,

martin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-13  6:07             ` Keith Packard
@ 2006-02-14  0:07               ` Martin Langhoff
  0 siblings, 0 replies; 39+ messages in thread
From: Martin Langhoff @ 2006-02-14  0:07 UTC (permalink / raw)
  To: Keith Packard
  Cc: Jeff Garzik, Linus Torvalds, Ben Clifford, Florian Weimer, git

On 2/13/06, Keith Packard <keithp@keithp.com> wrote:
> and, named to include a hash of the contents

really? I thought it was something like md5(hostname+datestamp+random).

cheers,

m

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-13 23:17             ` Ian Molton
  2006-02-13 23:19               ` Martin Langhoff
@ 2006-02-14 18:56               ` Johannes Schindelin
  2006-02-14 19:52                 ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Johannes Schindelin @ 2006-02-14 18:56 UTC (permalink / raw)
  To: Ian Molton; +Cc: git

Hi,

On Mon, 13 Feb 2006, Ian Molton wrote:

> Im curious as to why anyone would want to use a SCM tool on a mail dir 
> anyway - surely no-one edits their pasnt mails and needs to keep logs?
> 
> surely incremental backups would be a better way to manage something 
> like this?

Point is, if you want to read your email on different computers (like one 
desktop and one laptop), you are quite well off managing them with git. Of 
course, you could rsync them from/to the other computer. But rsync is slow 
once you accumulated enough files, since it has to compare the hashes of 
tons of files (or file chunks). Git knows if they have changed.

Hth,
Dscho

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-14 18:56               ` Johannes Schindelin
@ 2006-02-14 19:52                 ` Linus Torvalds
  2006-02-14 21:21                   ` Sam Vilain
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2006-02-14 19:52 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Ian Molton, git

On Tue, 14 Feb 2006, Johannes Schindelin wrote:
> 
> Point is, if you want to read your email on different computers (like one 
> desktop and one laptop), you are quite well off managing them with git. Of 
> course, you could rsync them from/to the other computer. But rsync is slow 
> once you accumulated enough files, since it has to compare the hashes of 
> tons of files (or file chunks). Git knows if they have changed.

Yes. I actually think that git would be a _wonderful_ email tracking tool, 
but that may not mean that it's a wonderful tool for tracking all 
particular email layout possibilities. It clearly is _not_ a wonderful 
tool for tracking mbox-style email setups, for example ;)

I suspect we actually could make the "one linear directory" setup perform 
pretty well. It wouldn't be the best possible layout (by far), but I think 
our problems there are just because of some decisions we've (me, mostly) 
made that didn't take that layout into account. I don't think the problems 
are in any way fundamental.

That said, I think git could do much better if the layout was optimized 
for git. For example, in the maildir thing, there's two issues: the flat 
directory structure is sub-optimal, but the other thing seems to be that 
maildir apparently saves metadata in the filename.

Saving meta-data in the filename should actually work wonderfully well 
with git, but both merging and git-diff-tree consider the filename to be 
the "index", so they optimize for that. You could do indexing the other 
way around, and consider the contents to be the index (and the filename is 
the "status"), but that's obviously not sane for a sw project, even if it 
might be exactly what you want to do for mail handling.

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-14 19:52                 ` Linus Torvalds
@ 2006-02-14 21:21                   ` Sam Vilain
  2006-02-14 22:01                     ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Sam Vilain @ 2006-02-14 21:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Johannes Schindelin, Ian Molton, git

Linus Torvalds wrote:
> That said, I think git could do much better if the layout was optimized 
> for git. For example, in the maildir thing, there's two issues: the flat 
> directory structure is sub-optimal, but the other thing seems to be that 
> maildir apparently saves metadata in the filename.
> 
> Saving meta-data in the filename should actually work wonderfully well 
> with git, but both merging and git-diff-tree consider the filename to be 
> the "index", so they optimize for that. You could do indexing the other 
> way around, and consider the contents to be the index (and the filename is 
> the "status"), but that's obviously not sane for a sw project, even if it 
> might be exactly what you want to do for mail handling.

This seems to me to be another use case where git could gain orders of
magnitude speed improvement by either explicit ("forensic") history
objects, or a history analysis cache.

Sam.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-14 21:21                   ` Sam Vilain
@ 2006-02-14 22:01                     ` Linus Torvalds
  2006-02-14 22:30                       ` Junio C Hamano
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2006-02-14 22:01 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Johannes Schindelin, Ian Molton, git

On Wed, 15 Feb 2006, Sam Vilain wrote:
> 
> This seems to me to be another use case where git could gain orders of
> magnitude speed improvement by either explicit ("forensic") history
> objects, or a history analysis cache.

Well, the thing is, it could get that _without_ any cache too.

The problem really isn't that we couldn't make things faster, the problem 
is that at least for _me_ the thing is fast enough.

If somebody is interested in making the "lots of filename changes" case go 
fast, I'd be more than happy to walk them through what they'd need to 
change. I'm just not horribly motivated to do it myself. Hint, hint.

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-14 22:01                     ` Linus Torvalds
@ 2006-02-14 22:30                       ` Junio C Hamano
  2006-02-15  0:40                         ` Sam Vilain
  2006-02-15  2:05                         ` Linus Torvalds
  0 siblings, 2 replies; 39+ messages in thread
From: Junio C Hamano @ 2006-02-14 22:30 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds

Linus Torvalds <torvalds@osdl.org> writes:

> If somebody is interested in making the "lots of filename changes" case go 
> fast, I'd be more than happy to walk them through what they'd need to 
> change. I'm just not horribly motivated to do it myself. Hint, hint.

In case anybody is wondering, I share the same feeling.  I
cannot say I'd be "more than happy to" clean up potential
breakages during the development of such changes, but if the
change eventually would help certain use cases, I can be
persuaded to help debugging such a mess ;-).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-14 22:30                       ` Junio C Hamano
@ 2006-02-15  0:40                         ` Sam Vilain
  2006-02-15  1:39                           ` Junio C Hamano
  2006-02-15  2:07                           ` Martin Langhoff
  2006-02-15  2:05                         ` Linus Torvalds
  1 sibling, 2 replies; 39+ messages in thread
From: Sam Vilain @ 2006-02-15  0:40 UTC (permalink / raw)
  To: Junio C Hamano, Martin Langhoff; +Cc: git, Linus Torvalds

Junio C Hamano wrote:
> Linus Torvalds <torvalds@osdl.org> writes:
> 
>>If somebody is interested in making the "lots of filename changes" case go 
>>fast, I'd be more than happy to walk them through what they'd need to 
>>change. I'm just not horribly motivated to do it myself. Hint, hint.
> 
> In case anybody is wondering, I share the same feeling.  I
> cannot say I'd be "more than happy to" clean up potential
> breakages during the development of such changes, but if the
> change eventually would help certain use cases, I can be
> persuaded to help debugging such a mess ;-).

Excellent.  Any speculations on where they might fit?  Clearly, it needs
to be out of the "tree".

Dealing with the three cases I mentioned before in my Warnocked post;

   1. caching - I'll consider this an "under the hood" thing, it really
                doesn't matter, so long as the tools all know.

   2. forensic - extra stuff at the end of the commit object?

      eg
         Copied: /new/path from /old/path:commit:c0bb171d..
           (for SVN case where history matters)
         Copied: /new/path from blob:b10b1d..
           (for general pre-caching case)
         Merged: /new/path from /old/path:commit:C0bb171d..
           (for an SVK clone, so we know that subsequent merges on
            /new/path need only merge from /old/path starting at commit
            C0bb171d..)

   3. retrospective - as above, but allow to specify old versions.

      eg
         Copied: /new/path:C0bb171d1 from /old/path:commit:c0bb171d2...
           (for SVN case where history matters)

Martin, is that enough for your CVS case?

Sam.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-15  0:40                         ` Sam Vilain
@ 2006-02-15  1:39                           ` Junio C Hamano
  2006-02-15  4:03                             ` Sam Vilain
  2006-02-15  2:07                           ` Martin Langhoff
  1 sibling, 1 reply; 39+ messages in thread
From: Junio C Hamano @ 2006-02-15  1:39 UTC (permalink / raw)
  To: Sam Vilain; +Cc: git

Sam Vilain <sam@vilain.net> writes:

> ...  Clearly, it needs to be out of the "tree".

OK.

>   2. forensic - extra stuff at the end of the commit object?

(except "extra at the end of commit", which does not make it out
of the tree).

>      eg
>         Copied: /new/path from /old/path:commit:c0bb171d..
>           (for SVN case where history matters)
>         Copied: /new/path from blob:b10b1d..
>           (for general pre-caching case)
>         Merged: /new/path from /old/path:commit:C0bb171d..
>           (for an SVK clone, so we know that subsequent merges on
>            /new/path need only merge from /old/path starting at commit
>            C0bb171d..)

I am not sure if recording the bare SVN ``copied'' is very
useful.  You would need to infer things from what SVN did to
tell if the copy is a tree copy inside a project (e.g. cp -r
i386 x86_64), tagging (e.g. svn-cp rHEAD trunk tags/v1.2), or
branching, wouldn't you?  SVK merge ticket is a bit more useful
in that sense.

So far, git philosophy is to record things you _know_ about and
defer such guesswork to the future, so limiting what you record
to what you can actually see from the foreign SCM would be more
in line with it.  For the same reason, if you are talking about
maildir managed under git, you should not have record anything
other than what git already records: "we used to have these
files, now we have these instead".

But I thought you were talking about caching what earlier
inference declared what happened, so that you do not have to do
the same inference every time.  If that is the case, SVN level
"Copied:" is probably not what you would want to record, I
suspect.  You would do some inference with the given information
("SVN says it copied this tree to that tree, what was it that it
really wanted to do?  Was it a copy, or was it to create a
branch which was implemented as a copy?"), and record that,
hoping that information would help your other operations this
time and later.

So I think the order of questions you should be asking is:

   - what operations are you trying to help?

   - what information you would need to achieve those operations
     better?

   - among the second one, what will be necessary to be set in
     stone (IOW, cannot be computed later), and what are
     computable but expensive to recompute every time?

An example from an ancient thread.

With criss-cross merge between renamed trees, it was conjectured
that recording renames detected earlier would help later merges.
I think you should arrive at the list of "what we should record"
by thinking things in this order:

 (1) currently criss-cross merge between renamed trees does not
     work well (realization of the status quo);

 (2) if we had this kind of information it would work better,
     here are the things we need to record when a new commit is
     made, and here is how to compute other information that can
     be inferred, and here is how to use that information to
     make the merge work better (solution without caching);

 (3) but it is expensive to recompute information we said
     computable in (2) if we were to do so every time.  Let's
     cache it.

I am getting an impression that you are doing only the first
half of (2) without other parts, which somewhat bothers me.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-14 22:30                       ` Junio C Hamano
  2006-02-15  0:40                         ` Sam Vilain
@ 2006-02-15  2:05                         ` Linus Torvalds
  2006-02-15  2:18                           ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2006-02-15  2:05 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Tue, 14 Feb 2006, Junio C Hamano wrote:

> Linus Torvalds <torvalds@osdl.org> writes:
> 
> > If somebody is interested in making the "lots of filename changes" case go 
> > fast, I'd be more than happy to walk them through what they'd need to 
> > change. I'm just not horribly motivated to do it myself. Hint, hint.
> 
> In case anybody is wondering, I share the same feeling.  I
> cannot say I'd be "more than happy to" clean up potential
> breakages during the development of such changes, but if the
> change eventually would help certain use cases, I can be
> persuaded to help debugging such a mess ;-).

Actually, I got interested in seeing how hard this is, and wrote a simple 
first cut at doing a tree-optimized merger.

Let me shout a bit first:

  THIS IS WORKING CODE, BUT BE CAREFUL: IT'S A TECHNOLOGY DEMONSTRATION 
  RATHER THAN THE FINAL PRODUCT!

With that out of the way, let me descibe what this does (and then describe 
the missing parts).

This is basically a three-way merge that works entirely on the "tree" 
level, rather than on the index. A lot of the _concepts_ are the same, 
though, and if you're familiar with the results of an index merge, some of 
the output will make more sense.

You give it three trees: the base tree (tree 0), and the two branches to 
be merged (tree 1 and tree 2 respectively). It will then walk these three 
trees, and resolve them as it goes along.

The interesting part is:
 - it can resolve whole sub-directories in one go, without actually even 
   looking recursively at them. A whole subdirectory will resolve the same 
   way as any individual files will (although that may need some 
   modification, see later).
 - if it has a "content conflict", for subdirectories that means "try to 
   do a recursive tree merge", while for non-subdirectories it's just a 
   content conflict and we'll output the stage 1/2/3 information.
 - a successful merge will output a single stage 0 ("merged") entry, 
   potentially for a whole subdirectory.
 - it outputs all the resolve information on stdout, so something like the 
   recursive resolver can pretty easily parse it all.

Now, the caveats:
 - we probably need to be more careful about subdirectory resolves. The 
   trivial case (both branches have the exact same subdirectory) is a 
   trivial resolve, but the other cases ("branch1 matches base, branch2 is 
   different" probably can't be silently just resolved to the "branch2" 
   subdirectory state, since it might involve renames into - or out of - 
   that subdirectory)
 - we do not track the current index file at all, so this does not do the 
   "check that index matches branch1" logic that the three-way merge in 
   git-read-tree does. The theory is that we'd do a full three-way merge 
   (ignoring the index and working directory), and then to update the 
   working tree, we'd do a two-way "git-read-tree branch1->result"
 - I didn't actually make it do all the trivial resolve cases that 
   git-read-tree does. It's a technology demonstration.

Finally (a more serious caveat):
 - doing things through stdout may end up being so expensive that we'd 
   need to do something else. In particular, it's likely that I should 
   not actually output the "merge results", but instead output a "merge 
   results as they _differ_ from branch1"

However, I think this patch is already interesting enough that people who 
are interested in merging trees might want to look at it. Please keep in 
mind that tech _demo_ part, and in particular, keep in mind the final 
"serious caveat" part.

In many ways, the really _interesting_ part of a merge is not the result, 
but how it _changes_ the branch we're merging into. That's particularly 
important as it should hopefully also mean that the output size for any 
reasonable case is minimal (and tracks what we actually need to do to the 
current state to create the final result).

The code very much is organized so that doing the result as a "diff 
against branch1" should be quite easy/possible. I was actually going to do 
it, but I decided that it probably makes the output harder to read. I 
dunno.

Anyway, let's think about this kind of approach.. Note how the code itself 
is actually quite small and short, although it's prbably pretty "dense".

As an interesting test-case, I'd suggest this merge in the kernel:

	git-merge-tree $(git-merge-base 4cbf876 7d2babc) 4cbf876 7d2babc

which resolves beautifully (there are no actual file-level conflicts), and 
you can look at the output of that command to start thinking about what 
it does.

The interesting part (perhaps) is that timing that command for me shows 
that it takes all of 0.004 seconds.. (the git-merge-base thing takes 
considerably more ;)

The point is, we _can_ do the actual merge part really really quickly. 

		Linus

PS. Final note: when I say that it is "WORKING CODE", that is obviously by 
my standards. IOW, I tested it once and it gave reasonable results - so it 
must be perfect.

Whether it works for anybody else, or indeed for any other test-case, is 
not my problem ;)

---
diff-tree f0e6b454ff873429237322c846603d2e1fffc867 (from 6a9b87972f27edfe53da4ce016adf4c0cd42f5e6)
Author: Linus Torvalds <torvalds@osdl.org>
Date:   Tue Feb 14 17:39:15 2006 -0800

    Add "git-merge-tree" functionality
    
    This is basically a tree-optimized merge.  Or rather, it is the first
    stages _towards_ such a merge.
    
    Given a base tree and two branches to merge, it will do a trivial merge,
    optimizing away the case of identical subdirectories, and resolving
    trivial merges.  It outputs the list of file/directory resolves.
    
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/Makefile b/Makefile
index d40aa6a..4d04f49 100644
--- a/Makefile
+++ b/Makefile
@@ -151,7 +151,7 @@ PROGRAMS = \
 	git-upload-pack$X git-verify-pack$X git-write-tree$X \
 	git-update-ref$X git-symbolic-ref$X git-check-ref-format$X \
 	git-name-rev$X git-pack-redundant$X git-repo-config$X git-var$X \
-	git-describe$X
+	git-describe$X git-merge-tree$X
 
 # what 'all' will build and 'install' will install, in gitexecdir
 ALL_PROGRAMS = $(PROGRAMS) $(SIMPLE_PROGRAMS) $(SCRIPTS)
diff --git a/merge-tree.c b/merge-tree.c
new file mode 100644
index 0000000..0d6d434
--- /dev/null
+++ b/merge-tree.c
@@ -0,0 +1,238 @@
+#include "cache.h"
+#include "diff.h"
+
+static const char merge_tree_usage[] = "git-merge-tree <base-tree> <branch1> <branch2>";
+static int resolve_directories = 1;
+
+static void merge_trees(struct tree_desc t[3], const char *base);
+
+static void *fill_tree_descriptor(struct tree_desc *desc, const unsigned char *sha1)
+{
+	unsigned long size = 0;
+	void *buf = NULL;
+
+	if (sha1) {
+		buf = read_object_with_reference(sha1, "tree", &size, NULL);
+		if (!buf)
+			die("unable to read tree %s", sha1_to_hex(sha1));
+	}
+	desc->size = size;
+	desc->buf = buf;
+	return buf;
+}
+
+struct name_entry {
+	const unsigned char *sha1;
+	const char *path;
+	unsigned int mode;
+	int pathlen;
+};
+
+static void entry_clear(struct name_entry *a)
+{
+	memset(a, 0, sizeof(*a));
+}
+
+static int entry_compare(struct name_entry *a, struct name_entry *b)
+{
+	return base_name_compare(
+			a->path, a->pathlen, a->mode,
+			b->path, b->pathlen, b->mode);
+}
+
+static void entry_extract(struct tree_desc *t, struct name_entry *a)
+{
+	a->sha1 = tree_entry_extract(t, &a->path, &a->mode);
+	a->pathlen = strlen(a->path);
+}
+
+/* An empty entry never compares same, not even to another empty entry */
+static int same_entry(struct name_entry *a, struct name_entry *b)
+{
+	return	a->sha1 &&
+		b->sha1 &&
+		!memcmp(a->sha1, b->sha1, 20) &&
+		a->mode == b->mode;
+}
+
+static void resolve(const char *base, struct name_entry *result)
+{
+	printf("0 %06o %s %s%s\n", result->mode, sha1_to_hex(result->sha1), base, result->path);
+}
+
+static int unresolved_directory(const char *base, struct name_entry n[3])
+{
+	int baselen;
+	char *newbase;
+	struct name_entry *p;
+	struct tree_desc t[3];
+	void *buf0, *buf1, *buf2;
+
+	if (!resolve_directories)
+		return 0;
+	p = n;
+	if (!p->mode) {
+		p++;
+		if (!p->mode)
+			p++;
+	}
+	if (!S_ISDIR(p->mode))
+		return 0;
+	baselen = strlen(base);
+	newbase = xmalloc(baselen + p->pathlen + 2);
+	memcpy(newbase, base, baselen);
+	memcpy(newbase + baselen, p->path, p->pathlen);
+	memcpy(newbase + baselen + p->pathlen, "/", 2);
+
+	buf0 = fill_tree_descriptor(t+0, n[0].sha1);
+	buf1 = fill_tree_descriptor(t+1, n[1].sha1);
+	buf2 = fill_tree_descriptor(t+2, n[2].sha1);
+	merge_trees(t, newbase);
+
+	free(buf0);
+	free(buf1);
+	free(buf2);
+	free(newbase);
+	return 1;
+}
+
+static void unresolved(const char *base, struct name_entry n[3])
+{
+	if (unresolved_directory(base, n))
+		return;
+	printf("1 %06o %s %s%s\n", n[0].mode, sha1_to_hex(n[0].sha1), base, n[0].path);
+	printf("2 %06o %s %s%s\n", n[1].mode, sha1_to_hex(n[1].sha1), base, n[1].path);
+	printf("3 %06o %s %s%s\n", n[2].mode, sha1_to_hex(n[2].sha1), base, n[2].path);
+}
+
+/*
+ * Merge two trees together (t[1] and t[2]), using a common base (t[0])
+ * as the origin.
+ *
+ * This walks the (sorted) trees in lock-step, checking every possible
+ * name. Note that directories automatically sort differently from other
+ * files (see "base_name_compare"), so you'll never see file/directory
+ * conflicts, because they won't ever compare the same.
+ *
+ * IOW, if a directory changes to a filename, it will automatically be
+ * seen as the directory going away, and the filename being created.
+ *
+ * Think of this as a three-way diff.
+ *
+ * The output will be either:
+ *  - successful merge
+ *	 "0 mode sha1 filename"
+ *    NOTE NOTE NOTE! FIXME! We really really need to walk the index
+ *    in parallel with this too!
+ * 
+ *  - conflict:
+ *	"1 mode sha1 filename"
+ *	"2 mode sha1 filename"
+ *	"3 mode sha1 filename"
+ *    where not all of the 1/2/3 lines may exist, of course.
+ *
+ * The successful merge rules are the same as for the three-way merge
+ * in git-read-tree.
+ */
+static void merge_trees(struct tree_desc t[3], const char *base)
+{
+	for (;;) {
+		struct name_entry entry[3];
+		unsigned int mask = 0;
+		int i, last;
+
+		last = -1;
+		for (i = 0; i < 3; i++) {
+			if (!t[i].size)
+				continue;
+			entry_extract(t+i, entry+i);
+			if (last >= 0) {
+				int cmp = entry_compare(entry+i, entry+last);
+
+				/*
+				 * Is the new name bigger than the old one?
+				 * Ignore it
+				 */
+				if (cmp > 0)
+					continue;
+				/*
+				 * Is the new name smaller than the old one?
+				 * Ignore all old ones
+				 */
+				if (cmp < 0)
+					mask = 0;
+			}
+			mask |= 1u << i;
+			last = i;
+		}
+		if (!mask)
+			break;
+
+		/*
+		 * Update the tree entries we've walked, and clear
+		 * all the unused name-entries.
+		 */
+		for (i = 0; i < 3; i++) {
+			if (mask & (1u << i)) {
+				update_tree_entry(t+i);
+				continue;
+			}
+			entry_clear(entry + i);
+		}
+
+		/* Same in both? */
+		if (same_entry(entry+1, entry+2)) {
+			if (entry[0].sha1) {
+				resolve(base, entry+1);
+				continue;
+			}
+		}
+
+		if (same_entry(entry+0, entry+1)) {
+			if (entry[2].sha1) {
+				resolve(base, entry+2);
+				continue;
+			}
+		}
+
+		if (same_entry(entry+0, entry+2)) {
+			if (entry[1].sha1) {
+				resolve(base, entry+1);
+				continue;
+			}
+		}
+
+		unresolved(base, entry);
+	}
+}
+
+static void *get_tree_descriptor(struct tree_desc *desc, const char *rev)
+{
+	unsigned char sha1[20];
+	void *buf;
+
+	if (get_sha1(rev, sha1) < 0)
+		die("unknown rev %s", rev);
+	buf = fill_tree_descriptor(desc, sha1);
+	if (!buf)
+		die("%s is not a tree", rev);
+	return buf;
+}
+
+int main(int argc, char **argv)
+{
+	struct tree_desc t[3];
+	void *buf1, *buf2, *buf3;
+
+	if (argc < 4)
+		usage(merge_tree_usage);
+
+	buf1 = get_tree_descriptor(t+0, argv[1]);
+	buf2 = get_tree_descriptor(t+1, argv[2]);
+	buf3 = get_tree_descriptor(t+2, argv[3]);
+	merge_trees(t, "");
+	free(buf1);
+	free(buf2);
+	free(buf3);
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-15  0:40                         ` Sam Vilain
  2006-02-15  1:39                           ` Junio C Hamano
@ 2006-02-15  2:07                           ` Martin Langhoff
  1 sibling, 0 replies; 39+ messages in thread
From: Martin Langhoff @ 2006-02-15  2:07 UTC (permalink / raw)
  To: Sam Vilain; +Cc: Junio C Hamano, git, Linus Torvalds

On 2/15/06, Sam Vilain <sam@vilain.net> wrote:
> Excellent.  Any speculations on where they might fit?  Clearly, it needs
> to be out of the "tree".

I think Junio & Linus are talking about alternative mergers, something
that can be called instead of git-read-tree -m (which is the way
merges seem to kick off). Or perhaps an additional flag to
git-read-tree to be used in conjunction with -m, something like
--optimize-for-identity that lets git-read-tree know to do a first
pass keying things on file identity rather than file path.

So we are _not_ touching the object database, at all. Only optimising
merges for very large trees there mv is a popular operation. All the
cases you discuss can be tackled very efficiently without making *any*
change to the object database.

> Martin, is that enough for your CVS case?

Oh, I don't need it at all. It's just that there's been some lazy talk
of tracking mboxes and maildirs with git, and look where it's led.
Blame Roland Stigge who got me started down this track.

I'm sure it's because the other optimisations are a lot harder to
tackle ;-) though Linus mentions that it'd be trivial for
git-read-tree -m to detect unchanged directories and perhaps do things
a bit faster. Not as revolutionary as an --optimize-for-identity but
not as risky either.

In any case, don't count in me for any of this git-checkout hacking. I
know better than start learning C posting patches to *this* list.

cheers,

martin

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-15  2:05                         ` Linus Torvalds
@ 2006-02-15  2:18                           ` Linus Torvalds
  2006-02-15  2:33                             ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2006-02-15  2:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Tue, 14 Feb 2006, Linus Torvalds wrote:
> 
> Finally (a more serious caveat):
>  - doing things through stdout may end up being so expensive that we'd 
>    need to do something else. In particular, it's likely that I should 
>    not actually output the "merge results", but instead output a "merge 
>    results as they _differ_ from branch1"
> 
> In many ways, the really _interesting_ part of a merge is not the result, 
> but how it _changes_ the branch we're merging into. That's particularly 
> important as it should hopefully also mean that the output size for any 
> reasonable case is minimal (and tracks what we actually need to do to the 
> current state to create the final result).

Here, btw, is the trivial diff to turn my previous "tree-resolve" into a 
"resolve tree relative to the current branch".

In particular, it makes the example merge perhaps even more interesting, 
and makes the "merging directories and merging files should use different 
heuristics more obvious". It's quite instructive, I think.

So if you want to test this, the merge I have been testing with is the 
last infiniband merge in the kernel:

	git-merge-tree 3c3b809 4cbf876 7d2babc

and you'll need to spend a few moments on thinking about what the 
"directory merge" thing there means: in particular, we should probably 
make the

	if (entry[2].sha1) {

test be

	if (entry[2].sha && !S_ISDIR(entry[2].mode)) {

(and same for "resolve to entry[1]" case for that matter) so that we never 
create a "resolve()" that picks a whole subdirectory from one of the 
branches.

The current logic is "logical", just probably not what we want.

		Linus

----
diff --git a/merge-tree.c b/merge-tree.c
index 0d6d434..0bf871c 100644
--- a/merge-tree.c
+++ b/merge-tree.c
@@ -55,9 +55,19 @@ static int same_entry(struct name_entry 
 		a->mode == b->mode;
 }
 
-static void resolve(const char *base, struct name_entry *result)
+static void resolve(const char *base, struct name_entry *branch1, struct name_entry *result)
 {
-	printf("0 %06o %s %s%s\n", result->mode, sha1_to_hex(result->sha1), base, result->path);
+	char branch1_sha1[50];
+
+	/* If it's already branch1, don't bother showing it */
+	if (!branch1)
+		return;
+	memcpy(branch1_sha1, sha1_to_hex(branch1->sha1), 41);
+
+	printf("0 %06o->%06o %s->%s %s%s\n",
+		branch1->mode, result->mode,
+		branch1_sha1, sha1_to_hex(result->sha1),
+		base, result->path);
 }
 
 static int unresolved_directory(const char *base, struct name_entry n[3])
@@ -183,21 +193,21 @@ static void merge_trees(struct tree_desc
 		/* Same in both? */
 		if (same_entry(entry+1, entry+2)) {
 			if (entry[0].sha1) {
-				resolve(base, entry+1);
+				resolve(base, NULL, entry+1);
 				continue;
 			}
 		}
 
 		if (same_entry(entry+0, entry+1)) {
 			if (entry[2].sha1) {
-				resolve(base, entry+2);
+				resolve(base, entry+1, entry+2);
 				continue;
 			}
 		}
 
 		if (same_entry(entry+0, entry+2)) {
 			if (entry[1].sha1) {
-				resolve(base, entry+1);
+				resolve(base, NULL, entry+1);
 				continue;
 			}
 		}

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-15  2:18                           ` Linus Torvalds
@ 2006-02-15  2:33                             ` Linus Torvalds
  2006-02-15  3:58                               ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2006-02-15  2:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Tue, 14 Feb 2006, Linus Torvalds wrote:
> 
> Here, btw, is the trivial diff to turn my previous "tree-resolve" into a 
> "resolve tree relative to the current branch".

Gaah. It was trivial, and it happened to work fine for my test-case, but 
when I started looking at not doing that extremely aggressive subdirectory 
merging, that showed a few other issues...

So in case people want to try, here's a third patch. Oh, and it's against 
my _original_ path, not incremental to the middle one (ie both patches two 
and three are against patch #1, it's not a nice series).

Now I'm really done, and won't be sending out any more patches today. 
Sorry for the noise.

		Linus
----
diff --git a/merge-tree.c b/merge-tree.c
index 0d6d434..6381118 100644
--- a/merge-tree.c
+++ b/merge-tree.c
@@ -55,9 +55,26 @@ static int same_entry(struct name_entry 
 		a->mode == b->mode;
 }
 
-static void resolve(const char *base, struct name_entry *result)
+static const char *sha1_to_hex_zero(const unsigned char *sha1)
 {
-	printf("0 %06o %s %s%s\n", result->mode, sha1_to_hex(result->sha1), base, result->path);
+	if (sha1)
+		return sha1_to_hex(sha1);
+	return "0000000000000000000000000000000000000000";
+}
+
+static void resolve(const char *base, struct name_entry *branch1, struct name_entry *result)
+{
+	char branch1_sha1[50];
+
+	/* If it's already branch1, don't bother showing it */
+	if (!branch1)
+		return;
+	memcpy(branch1_sha1, sha1_to_hex_zero(branch1->sha1), 41);
+
+	printf("0 %06o->%06o %s->%s %s%s\n",
+		branch1->mode, result->mode,
+		branch1_sha1, sha1_to_hex_zero(result->sha1),
+		base, result->path);
 }
 
 static int unresolved_directory(const char *base, struct name_entry n[3])
@@ -100,9 +117,12 @@ static void unresolved(const char *base,
 {
 	if (unresolved_directory(base, n))
 		return;
-	printf("1 %06o %s %s%s\n", n[0].mode, sha1_to_hex(n[0].sha1), base, n[0].path);
-	printf("2 %06o %s %s%s\n", n[1].mode, sha1_to_hex(n[1].sha1), base, n[1].path);
-	printf("3 %06o %s %s%s\n", n[2].mode, sha1_to_hex(n[2].sha1), base, n[2].path);
+	if (n[0].sha1)
+		printf("1 %06o %s %s%s\n", n[0].mode, sha1_to_hex(n[0].sha1), base, n[0].path);
+	if (n[1].sha1)
+		printf("2 %06o %s %s%s\n", n[1].mode, sha1_to_hex(n[1].sha1), base, n[1].path);
+	if (n[2].sha1)
+		printf("3 %06o %s %s%s\n", n[2].mode, sha1_to_hex(n[2].sha1), base, n[2].path);
 }
 
 /*
@@ -183,21 +203,21 @@ static void merge_trees(struct tree_desc
 		/* Same in both? */
 		if (same_entry(entry+1, entry+2)) {
 			if (entry[0].sha1) {
-				resolve(base, entry+1);
+				resolve(base, NULL, entry+1);
 				continue;
 			}
 		}
 
 		if (same_entry(entry+0, entry+1)) {
-			if (entry[2].sha1) {
-				resolve(base, entry+2);
+			if (entry[2].sha1 && !S_ISDIR(entry[2].mode)) {
+				resolve(base, entry+1, entry+2);
 				continue;
 			}
 		}
 
 		if (same_entry(entry+0, entry+2)) {
-			if (entry[1].sha1) {
-				resolve(base, entry+1);
+			if (entry[1].sha1 && !S_ISDIR(entry[1].mode)) {
+				resolve(base, NULL, entry+1);
 				continue;
 			}
 		}

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-15  2:33                             ` Linus Torvalds
@ 2006-02-15  3:58                               ` Linus Torvalds
  2006-02-15  9:54                                 ` Junio C Hamano
  2006-02-16 20:32                                 ` Fredrik Kuivinen
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2006-02-15  3:58 UTC (permalink / raw)
  To: Junio C Hamano, Fredrik Kuivinen; +Cc: Git Mailing List

On Tue, 14 Feb 2006, Linus Torvalds wrote:
> 
> So in case people want to try, here's a third patch. Oh, and it's against 
> my _original_ path, not incremental to the middle one (ie both patches two 
> and three are against patch #1, it's not a nice series).
> 
> Now I'm really done, and won't be sending out any more patches today. 

Still true. I've just been thinking about the last state.

As far as I can tell, the output from git-merge-tree with that fix to only 
simplify subdirectories that match exactly in all of base/branch1/branch2 
is precisely the output that git-merge-recursive actually wants.

Rather than doing a three-way merge with "git-read-tree", and then doing 
"git-ls-files --unmerged", I think this gives the same result much more 
efficiently.

That said, I can't follow the python code, so maybe I'm missing something. 
Fredrik cc'd, in case he can put me right.

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-15  1:39                           ` Junio C Hamano
@ 2006-02-15  4:03                             ` Sam Vilain
  0 siblings, 0 replies; 39+ messages in thread
From: Sam Vilain @ 2006-02-15  4:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano wrote:
> So I think the order of questions you should be asking is:
> 
>   1 - what operations are you trying to help?

Primarily, tracing history when dealing with history/changeset based
revision systems like SVN or darcs, and doing this in a manner that we
can make guarantees about behaving in the same way as those systems
would.

>   2 - what information you would need to achieve those operations
>       better?

Minimally, this tuple:

   ( merge|copy, source_path, source_tree|source_commit,
     destination_path, destination_commit )

It makes sense to record this with commits, as conceptually it is a part
of the intended commit history along with the change comment.

>   3 - among the second one, what will be necessary to be set in
>       stone (IOW, cannot be computed later), and what are
>       computable but expensive to recompute every time?

The only operation you cannot automatically and with certainty detect a 
rename and change content without inserting a dummy commit between the 
name change and the content change.  But in a sense this is the same as
my suggestion - using the commit object history to record information
that normally doesn't matter when you are doing content-keyed
operations.

> I am getting an impression that you are doing only the first
> half of (2) without other parts, which somewhat bothers me.

Well, thank you for spending so much time to reply to me given that was
your assessment.  I think the best direction from here would be to start
molding some porcelain, then I can cross this bridge when I come to it
rather than simply speculating and hand-waving.

Besides, I can always prototype it for discussion using the commit
description as a surrogate container for the information.

Sam.

ps I also responded to the rest of your e-mail, but decided that the 
answers to the above questions were more important.

 >>  2. forensic - extra stuff at the end of the commit object?
 > (except "extra at the end of commit", which does not make it out
 > of the tree).

It is a part of the repository, but more a property of the commit itself
- like the commit description.  Like somebody writing "I renamed this
file to that file and changed its contents", but in a parsable form
that can _optionally_ be used to prevent the relevant git-core tools
from having to do content comparison, or perhaps something subtler like
increasing the score of the recorded history branch when scoring
alternatives looking for history.

 >>     eg
 >>        Copied: /new/path from /old/path:commit:c0bb171d..
 >>          (for SVN case where history matters)
 >>        Copied: /new/path from blob:b10b1d..
 >>          (for general pre-caching case)
 >>        Merged: /new/path from /old/path:commit:C0bb171d..
 >>          (for an SVK clone, so we know that subsequent merges on
 >>           /new/path need only merge from /old/path starting at commit
 >>           C0bb171d..)
 > I am not sure if recording the bare SVN ``copied'' is very
 > useful.  You would need to infer things from what SVN did to
 > tell if the copy is a tree copy inside a project (e.g. cp -r
 > i386 x86_64), tagging (e.g. svn-cp rHEAD trunk tags/v1.2), or
 > branching, wouldn't you?  SVK merge ticket is a bit more useful
 > in that sense.

In the SVN model there really is no difference between these cases.  Of
course the actual representation of these in the object does not matter;
the above is the what, not the how.  But in general, SVN only records
copying; it has no repository concept of merge, branch, tag, rename.
SVK adds merging to the picture.

Representing an SVN tree copy as a new sub-tree in a git repository
should still be a "cheap copy", it's just that all the tools will not
(and probably should not) see it as a branch but a copy.

 > So far, git philosophy is to record things you _know_ about and
 > defer such guesswork to the future, so limiting what you record
 > to what you can actually see from the foreign SCM would be more
 > in line with it.

Yes, and if I am mirroring an SVN repository, then I only know that in
that repository, the history /was recorded/ as such.  Not the history
/is/ as such, that's a different question, and is the guesswork worth
being defered to the future.

 > For the same reason, if you are talking about
 > maildir managed under git, you should not have record anything
 > other than what git already records: "we used to have these
 > files, now we have these instead".

Ok.  As Martin pointed out, the Maildir situation is actually a simple
case.  In a sense, I hijacked a vaguely related thread to resolve my
Warnock dilemma :)

 > But I thought you were talking about caching what earlier
 > inference declared what happened, so that you do not have to do
 > the same inference every time.  If that is the case, SVN level
 > "Copied:" is probably not what you would want to record, I
 > suspect.  You would do some inference with the given information
 > ("SVN says it copied this tree to that tree, what was it that it
 > really wanted to do?  Was it a copy, or was it to create a
 > branch which was implemented as a copy?"), and record that,
 > hoping that information would help your other operations this
 > time and later.

Well, this is already guesswork defered to the future that the
Subversion authors inflict on the users of Subversion repositories.  If
you read the Subversion manual you will find recommendations to
studiously record this information and to use a standard repository
layout so that other people will understand what your copies were
intended to be.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-15  3:58                               ` Linus Torvalds
@ 2006-02-15  9:54                                 ` Junio C Hamano
  2006-02-15 15:44                                   ` Linus Torvalds
  2006-02-16  3:25                                   ` Linus Torvalds
  2006-02-16 20:32                                 ` Fredrik Kuivinen
  1 sibling, 2 replies; 39+ messages in thread
From: Junio C Hamano @ 2006-02-15  9:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Fredrik Kuivinen, Git Mailing List

Linus Torvalds <torvalds@osdl.org> writes:

> As far as I can tell, the output from git-merge-tree with that fix to only 
> simplify subdirectories that match exactly in all of base/branch1/branch2 
> is precisely the output that git-merge-recursive actually wants.

The matches the recollection I had last time I mucked with the
code.  Currently it is set up to do one path at a time in both
index and working tree, so it would not be a trivial rewrite,
but merge-tree based approach would speed things up quite a
bit.

I was thinking about implementing mergers as a pipeline:

	git-merge-tree O A B |
        git-merge-renaming A |
        git-merge-aggressive A |
        git-merge-filemerge

git-merge-tree (yours) does not do trivial collapsing, and
produce raw-diff from A.  git-merge-renaming reads it, finds
copied/renamed entries (maybe reusing parts of diffcore), and
writes out the results in the same format as merge-tree output
(that's why I am giving A on the command line -- so it can also
read A if it wanted to. it may need to talk about what a path in
A was even when merge-tree did not say anything about that
path).  Then git-merge-aggressive (bad naming, I know, it only
corresponds to the flag of the same name in read-tree) will
collapse git-merge-one-file equivalent stage collapsing.  The
remainder is fed to file-level merger for postprocessing.
Everything except the last step would work on a data format that
merge-tree outputs.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-15  9:54                                 ` Junio C Hamano
@ 2006-02-15 15:44                                   ` Linus Torvalds
  2006-02-15 17:16                                     ` Linus Torvalds
  2006-02-16  3:25                                   ` Linus Torvalds
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2006-02-15 15:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, Git Mailing List

On Wed, 15 Feb 2006, Junio C Hamano wrote:
> 
> I was thinking about implementing mergers as a pipeline:
> 
> 	git-merge-tree O A B |
>         git-merge-renaming A |
>         git-merge-aggressive A |
>         git-merge-filemerge

Great minds think alike.

> git-merge-tree (yours) does not do trivial collapsing, and
> produce raw-diff from A.

(It does _truly_ trivial collapsing, but I think we both agree: it doesn't 
do anything that we used to go git-merge-one-file on)

> git-merge-renaming reads it, finds
> copied/renamed entries (maybe reusing parts of diffcore), and
> writes out the results in the same format as merge-tree output

I was considering perhaps doing a first cut at that in git-merge-tree 
already. Not sure.

One issue is that I think I may have to change the output format if I do 
that. I should anyway. 

Why?

It's hard to see where "one event" stops, and another starts. I stupidly 
initially thought that you can do it entirely based on looking at the 
numbers, but you can't. Right now you have to look at the pathname too, 
which is kind of sad, and doesn't work after rename detection (since then 
the pathnames won't be sorted any more, and one "event" can have different 
pathnames in different stages).

[ Side note: it doesn't even work for file/directory conflicts, which can 
  have the same name, but are two different "events". So you'd actually 
  have to look at both mode _and_ filename to sort out if two lines that 
  start with "1" and "3" respectively are one event (removal in first 
  branch) or two events ("1" on one file: removal in both branches + "3" 
  on another file: add in second branch) ]

So to do the rename output, you can't use the same format as merge-tree 
uses right _now_. We'd have to add a marker to mark what the event 
boundaries are.

The "mark" could be a running "event number", or even as easy as an 
alternating character ("#" vs " " as the second character in the line or 
similar)

So instead of

	2 100644 ff280e2e1613e808e4d7844376134dfa2bb1fc21 Documentation/cputopology.txt
	2 100644 28c5b7d1eb90f0ccd8e0307c170f89bd7954dc9c Documentation/hwmon/f71805f
	1 100644 b88953dfd58022aef1680c266c7438605b146fc8 Documentation/i2c/busses/i2c-sis69x
	3 100644 b88953dfd58022aef1680c266c7438605b146fc8 Documentation/i2c/busses/i2c-sis69x
	2 100644 00a009b977e92b1a942d1138afdccf1b725df956 Documentation/i2c/busses/i2c-sis96x
	2 100644 90a5e9e5bef1daa9d0f0621e209827f0d180f384 Documentation/unshare.txt
	2 100644 5127f39fa9bf9a384a6529c6d5deb1002e945de5 arch/arm/mach-s3c2410/s3c2400-gpio.c
	2 100644 8b2394e1ed4088c3b8d38e87e58bde2f38152bf7 arch/arm/mach-s3c2410/s3c2400.h
	 ...

it migth be

	2#100644 ff280e2e1613e808e4d7844376134dfa2bb1fc21 Documentation/cputopology.txt
	2 100644 28c5b7d1eb90f0ccd8e0307c170f89bd7954dc9c Documentation/hwmon/f71805f
	1#100644 b88953dfd58022aef1680c266c7438605b146fc8 Documentation/i2c/busses/i2c-sis69x
	3#100644 b88953dfd58022aef1680c266c7438605b146fc8 Documentation/i2c/busses/i2c-sis69x
	2 100644 00a009b977e92b1a942d1138afdccf1b725df956 Documentation/i2c/busses/i2c-sis96x
	2#100644 90a5e9e5bef1daa9d0f0621e209827f0d180f384 Documentation/unshare.txt
	2 100644 5127f39fa9bf9a384a6529c6d5deb1002e945de5 arch/arm/mach-s3c2410/s3c2400-gpio.c
	2#100644 8b2394e1ed4088c3b8d38e87e58bde2f38152bf7 arch/arm/mach-s3c2410/s3c2400.h
	 ...

where you can clearly see the "grouping" without having to even look at 
the filename.

(The example I show actually has a rename-with-modifications that was made 
on the first branch: notice that i2c-sis69x vs i2c-sis96x thing?)

I don't know exactly what the "after rename detection" output format would 
be, but it _might_ turn that

	...
	1#b889... i2c-sis69x
	3#b889... i2c-sis69x
	2 00a0... i2c-sis96x
	...

into one event:

	...
	1#b889... i2c-sis69x
	2#00a0... i2c-sis96x
	3#b889... i2c-sis69x
	...

and then the actual file-merge logic would have to merge the names as well 
as the file contents (and in this case, the final name would thus be 
"i2c-sis96x", since one branch hadn't changed it).

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-15 15:44                                   ` Linus Torvalds
@ 2006-02-15 17:16                                     ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2006-02-15 17:16 UTC (permalink / raw)
  To: Junio C Hamano, Ben Clifford; +Cc: Git Mailing List

Btw, some actual numbers: I did the recent kernel networking merge (which 
is a trivial in-index merge) with the standard three-way

	git-read-tree -m <base> <branch> <branch>

and with the new git-merge-tree to compare performance.

Doing git-read-tree takes ~0.35s, while git-merge-tree took 0.015s.

Now, that's not a really fair comparison, because the end result is very 
different: the git-read-tree has populated the index, ready for a 
git-writet-ree, while the git-merge-tree has not. 

However, the interesting part is that especially for a trivial merge, we 
don't actually _want_ to necessarily populate the index, because doing a 
"git-write-tree" is actually a pretty expensive operation (on the kernel, 
it will try to write 1000+ directory trees, most of which already exist. 
Admittedly we don't actually have to write the objects, since we figure 
out that they already exist, but we have to do the SHA1 calculations to 
do so).

So if we made the git-merge-tree based merge work entirely on trees all 
the way, and never even necessarily populate the index at all (unless it 
has to, due to actual data conflicts that want to be fixed up), that would 
actually be another performance advantage. The only downside there is that 
we would literally have to write the resulting tree objects by hand (ie 
we'd need a new helper for doing that, and another thing to validate).

Anyway, that should almost certainly make it possible to scale up git 
merges to hundreds of thousands of files without huge performance problems 
(still, that depends a bit on layout - again, flat directory structures 
won't scale as well, so it might not be enough for maildir handling).

But just at a guess, I think there's at least an order of magnitude to be 
had there. So if a maildir merge currently takes an hour, at least we 
should be able to get it down to a few minutes.

Ben, are you interested in trying this out in your maildir experiments?

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-15  9:54                                 ` Junio C Hamano
  2006-02-15 15:44                                   ` Linus Torvalds
@ 2006-02-16  3:25                                   ` Linus Torvalds
  2006-02-16  3:29                                     ` Junio C Hamano
  1 sibling, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2006-02-16  3:25 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Fredrik Kuivinen, Git Mailing List



Btw, here's one last gasp on this thread: it generalizes the notion of 
traversing several trees in sync, which could be used to do the n-way diff 
for the "-c" and "--cc" style merge diffs a lot more efficiently.

I didn't check, but I'm pretty sure that this would bring the cost of 
doing the 12-way diff down to way under a second. Right now:

	[torvalds@g5 linux]$ time git-diff-tree -c 9fdb62a > /dev/null 

	real    0m1.279s
	user    0m1.272s
	sys     0m0.008s

and that's a bit too much. We I'd really have expected us to be able to do 
better.

It should be possible to do this as a 

	traverse_trees(12, &trees, "", combined_diff_callback);

fairly cheaply (and quickly throw away anything where any of the parents 
was the same as the result).

Junio, that "traverse_trees()" logic is totally independent of whether we 
actually do "git-merge-tree" or not, so if you want to, I could split up 
the patches the other way (and merge "traverse_trees()" first as a new 
interface, independently).

		Linus

----
git-merge-tree: generalize the "traverse <n> trees in sync" functionality

It's actually very useful for other things too. Notably, we could do the
combined diff a lot more efficiently with this.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/merge-tree.c b/merge-tree.c
index 6381118..2a9a013 100644
--- a/merge-tree.c
+++ b/merge-tree.c
@@ -125,44 +125,19 @@ static void unresolved(const char *base,
 		printf("3 %06o %s %s%s\n", n[2].mode, sha1_to_hex(n[2].sha1), base, n[2].path);
 }
 
-/*
- * Merge two trees together (t[1] and t[2]), using a common base (t[0])
- * as the origin.
- *
- * This walks the (sorted) trees in lock-step, checking every possible
- * name. Note that directories automatically sort differently from other
- * files (see "base_name_compare"), so you'll never see file/directory
- * conflicts, because they won't ever compare the same.
- *
- * IOW, if a directory changes to a filename, it will automatically be
- * seen as the directory going away, and the filename being created.
- *
- * Think of this as a three-way diff.
- *
- * The output will be either:
- *  - successful merge
- *	 "0 mode sha1 filename"
- *    NOTE NOTE NOTE! FIXME! We really really need to walk the index
- *    in parallel with this too!
- * 
- *  - conflict:
- *	"1 mode sha1 filename"
- *	"2 mode sha1 filename"
- *	"3 mode sha1 filename"
- *    where not all of the 1/2/3 lines may exist, of course.
- *
- * The successful merge rules are the same as for the three-way merge
- * in git-read-tree.
- */
-static void merge_trees(struct tree_desc t[3], const char *base)
+typedef void (*traverse_callback_t)(int n, unsigned long mask, struct name_entry *entry, const char *base);
+
+static void traverse_trees(int n, struct tree_desc *t, const char *base, traverse_callback_t callback)
 {
+	struct name_entry *entry = xmalloc(n*sizeof(*entry));
+
 	for (;;) {
 		struct name_entry entry[3];
-		unsigned int mask = 0;
+		unsigned long mask = 0;
 		int i, last;
 
 		last = -1;
-		for (i = 0; i < 3; i++) {
+		for (i = 0; i < n; i++) {
 			if (!t[i].size)
 				continue;
 			entry_extract(t+i, entry+i);
@@ -182,7 +157,7 @@ static void merge_trees(struct tree_desc
 				if (cmp < 0)
 					mask = 0;
 			}
-			mask |= 1u << i;
+			mask |= 1ul << i;
 			last = i;
 		}
 		if (!mask)
@@ -192,38 +167,77 @@ static void merge_trees(struct tree_desc
 		 * Update the tree entries we've walked, and clear
 		 * all the unused name-entries.
 		 */
-		for (i = 0; i < 3; i++) {
-			if (mask & (1u << i)) {
+		for (i = 0; i < n; i++) {
+			if (mask & (1ul << i)) {
 				update_tree_entry(t+i);
 				continue;
 			}
 			entry_clear(entry + i);
 		}
+		callback(n, mask, entry, base);
+	}
+	free(entry);
+}
 
-		/* Same in both? */
-		if (same_entry(entry+1, entry+2)) {
-			if (entry[0].sha1) {
-				resolve(base, NULL, entry+1);
-				continue;
-			}
+/*
+ * Merge two trees together (t[1] and t[2]), using a common base (t[0])
+ * as the origin.
+ *
+ * This walks the (sorted) trees in lock-step, checking every possible
+ * name. Note that directories automatically sort differently from other
+ * files (see "base_name_compare"), so you'll never see file/directory
+ * conflicts, because they won't ever compare the same.
+ *
+ * IOW, if a directory changes to a filename, it will automatically be
+ * seen as the directory going away, and the filename being created.
+ *
+ * Think of this as a three-way diff.
+ *
+ * The output will be either:
+ *  - successful merge
+ *	 "0 mode sha1 filename"
+ *    NOTE NOTE NOTE! FIXME! We really really need to walk the index
+ *    in parallel with this too!
+ * 
+ *  - conflict:
+ *	"1 mode sha1 filename"
+ *	"2 mode sha1 filename"
+ *	"3 mode sha1 filename"
+ *    where not all of the 1/2/3 lines may exist, of course.
+ *
+ * The successful merge rules are the same as for the three-way merge
+ * in git-read-tree.
+ */
+static void threeway_callback(int n, unsigned long mask, struct name_entry *entry, const char *base)
+{
+	/* Same in both? */
+	if (same_entry(entry+1, entry+2)) {
+		if (entry[0].sha1) {
+			resolve(base, NULL, entry+1);
+			return;
 		}
+	}
 
-		if (same_entry(entry+0, entry+1)) {
-			if (entry[2].sha1 && !S_ISDIR(entry[2].mode)) {
-				resolve(base, entry+1, entry+2);
-				continue;
-			}
+	if (same_entry(entry+0, entry+1)) {
+		if (entry[2].sha1 && !S_ISDIR(entry[2].mode)) {
+			resolve(base, entry+1, entry+2);
+			return;
 		}
+	}
 
-		if (same_entry(entry+0, entry+2)) {
-			if (entry[1].sha1 && !S_ISDIR(entry[1].mode)) {
-				resolve(base, NULL, entry+1);
-				continue;
-			}
+	if (same_entry(entry+0, entry+2)) {
+		if (entry[1].sha1 && !S_ISDIR(entry[1].mode)) {
+			resolve(base, NULL, entry+1);
+			return;
 		}
-
-		unresolved(base, entry);
 	}
+
+	unresolved(base, entry);
+}
+
+static void merge_trees(struct tree_desc t[3], const char *base)
+{
+	traverse_trees(3, t, base, threeway_callback);
 }
 
 static void *get_tree_descriptor(struct tree_desc *desc, const char *rev)

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-16  3:25                                   ` Linus Torvalds
@ 2006-02-16  3:29                                     ` Junio C Hamano
  0 siblings, 0 replies; 39+ messages in thread
From: Junio C Hamano @ 2006-02-16  3:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> Junio, that "traverse_trees()" logic is totally independent of whether we 
> actually do "git-merge-tree" or not, so if you want to, I could split up 
> the patches the other way (and merge "traverse_trees()" first as a new 
> interface, independently).

I won't have time to look at the actual patch tonight but I am
interested.  I think the general idea should work nice with both
multi-base and octopus merge cases as well ;-).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: Handling large files with GIT
  2006-02-15  3:58                               ` Linus Torvalds
  2006-02-15  9:54                                 ` Junio C Hamano
@ 2006-02-16 20:32                                 ` Fredrik Kuivinen
  1 sibling, 0 replies; 39+ messages in thread
From: Fredrik Kuivinen @ 2006-02-16 20:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Fredrik Kuivinen, Git Mailing List

On Tue, Feb 14, 2006 at 07:58:03PM -0800, Linus Torvalds wrote:
> 
> 
> On Tue, 14 Feb 2006, Linus Torvalds wrote:
> > 
> > So in case people want to try, here's a third patch. Oh, and it's against 
> > my _original_ path, not incremental to the middle one (ie both patches two 
> > and three are against patch #1, it's not a nice series).
> > 
> > Now I'm really done, and won't be sending out any more patches today. 
> 
> Still true. I've just been thinking about the last state.
> 
> As far as I can tell, the output from git-merge-tree with that fix to only 
> simplify subdirectories that match exactly in all of base/branch1/branch2 
> is precisely the output that git-merge-recursive actually wants.
> 
> Rather than doing a three-way merge with "git-read-tree", and then doing 
> "git-ls-files --unmerged", I think this gives the same result much more 
> efficiently.
> 
> That said, I can't follow the python code, so maybe I'm missing something. 
> Fredrik cc'd, in case he can put me right.
> 

I don't think you miss anything. I _think_ (I haven't looked at this
too close yet) that it shouldn't be too much work to make
git-merge-recursive make use of the git-merge-tree thing.

- Fredrik

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2006-02-16 20:32 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-08  9:14 Handling large files with GIT Martin Langhoff
2006-02-08 11:54 ` Johannes Schindelin
2006-02-08 16:34   ` Linus Torvalds
2006-02-08 17:01     ` Linus Torvalds
2006-02-08 20:11       ` Junio C Hamano
2006-02-08 21:20 ` Florian Weimer
2006-02-08 22:35   ` Martin Langhoff
2006-02-13  1:26     ` Ben Clifford
2006-02-13  3:42       ` Linus Torvalds
2006-02-13  4:57         ` Linus Torvalds
2006-02-13  5:05           ` Linus Torvalds
2006-02-13 23:17             ` Ian Molton
2006-02-13 23:19               ` Martin Langhoff
2006-02-14 18:56               ` Johannes Schindelin
2006-02-14 19:52                 ` Linus Torvalds
2006-02-14 21:21                   ` Sam Vilain
2006-02-14 22:01                     ` Linus Torvalds
2006-02-14 22:30                       ` Junio C Hamano
2006-02-15  0:40                         ` Sam Vilain
2006-02-15  1:39                           ` Junio C Hamano
2006-02-15  4:03                             ` Sam Vilain
2006-02-15  2:07                           ` Martin Langhoff
2006-02-15  2:05                         ` Linus Torvalds
2006-02-15  2:18                           ` Linus Torvalds
2006-02-15  2:33                             ` Linus Torvalds
2006-02-15  3:58                               ` Linus Torvalds
2006-02-15  9:54                                 ` Junio C Hamano
2006-02-15 15:44                                   ` Linus Torvalds
2006-02-15 17:16                                     ` Linus Torvalds
2006-02-16  3:25                                   ` Linus Torvalds
2006-02-16  3:29                                     ` Junio C Hamano
2006-02-16 20:32                                 ` Fredrik Kuivinen
2006-02-13  5:55           ` Jeff Garzik
2006-02-13  6:07             ` Keith Packard
2006-02-14  0:07               ` Martin Langhoff
2006-02-13 16:19             ` Linus Torvalds
2006-02-13  4:40       ` Martin Langhoff
2006-02-09  4:54   ` Greg KH
2006-02-09  5:38     ` Martin Langhoff

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).