* Re: Merge with git-pasky II. @ 2005-04-26 18:55 Bram Cohen 2005-04-26 19:58 ` Linus Torvalds 0 siblings, 1 reply; 81+ messages in thread From: Bram Cohen @ 2005-04-26 18:55 UTC (permalink / raw) To: git (my apologies for responding to old messages, I only just subscribed to this list) Linus Torvalds wrote: > On Thu, 14 Apr 2005, Junio C Hamano wrote: > > > > You say "merge these two trees" above (I take it that you mean > > "merge these two trees, taking account of this tree as their > > common ancestor", so actually you are dealing with three trees), > > Yes. We're definitely talking three trees. The LCA for different files might be at different points in the history. Forcing them to all come from the same point produces very bad merges. > The fact is, we know how to make tree merges unambiguous, by just > totally ignoring the history between them. Ie we know how to merge > data. I am pretty damn sure that _nobody_ knows how to merge "data over > time". You're incorrect. Codeville does exactly that (history-aware merges which do the right thing even in cases where 3-way merge can't) > > This however opens up another set of can of worms---it would > > involve not just three trees but all the trees in the commit > > chain in between. > > Exactly. I seriously believe that the model is _broken_, simply because > it gets too complicated. At some point it boils down to "keep it simple, > stupid". The Codeville merge algorithm is also quite simple, and is already implemented and mature. > I've not even been convinved that renames are worth it. Nobody has > really given a good reason why. If one person renames a file and another person modifies it then the changes should be applied to the moved file. Also, there's the directory rename case where one person moves a directory and another person adds a file to it, in which case the file should be moved to the new directory location on merge. I gather than BK doesn't support this functionality, but Codeville and Monotone both do. > I think you might as well interpret the whole object thing. Git > _does_ tell you how the objects changed, and I actually believe that a > diff that works in between objects (ie can show "these lines moved from > this file X to tjhat file Y") is a _hell_ of a lot more powerful than > "rename" is. > > So I'd seriously suggest that instead of worryign about renames, > people think about global diffs that aren't per-file. Git is good at > limiting the changes to a set of objects, and it should be entirely > possible to think of diffs as ways of moving lines _between_ objects and > not just within objects. It's quite common to move a function from one > file to another - certainly more so than renaming the whole file. > > In other words, I really believe renames are just a meaningless special > case of a much more interesting problem. Which is just one reason why > I'm not at all interested in bothering with them other than as a "data > moved" thing, which git already handles very well indeed. Nothing, not eveny our beloved BitKeeper, has 'move lines between files' functionality, and for good reason. To begin with, it's behaviorally extremely dubious. It would be not uncommon for the system to erroneously think that some files deleted from one file were added to another, and then further changes down the line would cause random unrelated files to get modified in unpredictable ways when merges happened. Also, it presents a completely unsolved UI problem. If one person moves lines 5-15 of file A to file B, and another person concurrently rewrites lines 10-20 of file A, how on earth is that supposed to be presented to the user? Codeville can support line moves *within* files just fine, but doesn't do it because of the UI problem of presenting all the corner cases. Maybe someday somebody will do a PhD thesis on that topic and we'll add it, but until then we're sticking with the basic functionality. Honestly, that you would think of doing whole-tree three-way merges and even consider moving lines between files shows that you haven't explored the merge problem very deeply. This is a much harder problem than you think it is, and one which has already been solved by other systems. -Bram ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 18:55 Merge with git-pasky II Bram Cohen @ 2005-04-26 19:58 ` Linus Torvalds 2005-04-26 20:30 ` Tom Lord ` (2 more replies) 0 siblings, 3 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-26 19:58 UTC (permalink / raw) To: Bram Cohen; +Cc: git On Tue, 26 Apr 2005, Bram Cohen wrote: > > If one person renames a file and another person modifies it then the > changes should be applied to the moved file. Bzzt. Wrong answer. The _right_ answer is "if one person moves a function, and another person modifies the function, the changes should be applied to the moved function". Which is clearly a _much_ more common case than file renames. In other words, if your algorithm doesn't handle the latter, then there is no point in handling the former either. And _if_ your algorithm handles the latter, then there's no point in handling file renames specially, since the algorithm will have done that too, as a small part of it. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 19:58 ` Linus Torvalds @ 2005-04-26 20:30 ` Tom Lord 2005-04-26 20:31 ` Bram Cohen 2005-04-26 20:31 ` Daniel Barkalow 2 siblings, 0 replies; 81+ messages in thread From: Tom Lord @ 2005-04-26 20:30 UTC (permalink / raw) To: torvalds; +Cc: bram, git From: Linus Torvalds <torvalds@osdl.org> On Tue, 26 Apr 2005, Bram Cohen wrote: > > If one person renames a file and another person modifies it then the > changes should be applied to the moved file. Bzzt. Wrong answer. You're a little bit nuts, guy. -t ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 19:58 ` Linus Torvalds 2005-04-26 20:30 ` Tom Lord @ 2005-04-26 20:31 ` Bram Cohen 2005-04-26 20:39 ` Tom Lord ` (2 more replies) 2005-04-26 20:31 ` Daniel Barkalow 2 siblings, 3 replies; 81+ messages in thread From: Bram Cohen @ 2005-04-26 20:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds wrote: > On Tue, 26 Apr 2005, Bram Cohen wrote: > > > > If one person renames a file and another person modifies it then the > > changes should be applied to the moved file. > > Bzzt. Wrong answer. I'm trying to be polite. You're not making that easy. > The _right_ answer is "if one person moves a function, and another person > modifies the function, the changes should be applied to the moved > function". Now that you're done being dismissive, could you either (a) rebut my quite detailed explanation of exactly why that functionality is both a dubious idea and difficult to implement, or (b) admit that you have no plans whatsoever for supporting any of this stuff? You can't have it both ways. What I'd really like to hear is some explanation of why git is reimplementing all of this stuff from scratch. Your implicit claims that git will do more things than the other systems without having to reinvent all of their functionality first are, honestly, naive, ill-informed arrogance. I'd like to reiterate that *nothing* out there supports moving lines between files, and further predict, with total confidence, that if git tries to support such functionality it will simply fail, either by giving up or creating a system which can behave horribly. Before you get all dismissive about this claim, please remember that I've spent years thinking about merge algorithms, and have actually designed and implemented them, and have spoken at length with other people who have done the same, while you've merely thought about them for a few weeks. > Which is clearly a _much_ more common case than file renames. Even if we pretend that these are comparable features, that's far from clearly true. Function moves within a file occur more frequently, but a file rename moves *all* the functions within that file. > In other words, if your algorithm doesn't handle the latter, then there is > no point in handling the former either. If someone offers you a dollar, no strings attached, do you turn them down because they didn't offer you ten? > And _if_ your algorithm handles the latter, then there's no point in > handling file renames specially, since the algorithm will have done that > too, as a small part of it. In case these concepts got conflated, I'd like to point out that Codeville merge both supports renames *and* does better than three-way merge can do at merging a single, non-renamed file. In most cases three-way and codeville merge give the same answer, but there are some cases where there isn't a single appropriate LCA available, and in those cases codeville will do the right thing while three-way can't. -Bram ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 20:31 ` Bram Cohen @ 2005-04-26 20:39 ` Tom Lord 2005-04-26 20:58 ` Linus Torvalds 2005-04-26 21:26 ` Diego Calleja 2 siblings, 0 replies; 81+ messages in thread From: Tom Lord @ 2005-04-26 20:39 UTC (permalink / raw) To: bram; +Cc: git > What I'd really like to hear is some explanation of why git is > reimplementing all of this stuff from scratch. Whatever Linus' reasons, it's not a bad exercise and it does help robustify the kernel project to roll its own. Not that he seems to be doing *this* part especially well or anything, but that doesn't matter from the perspective of the good reasons to do it. It doesn't matter much -- get stuff into git and people can layer on that pretty gently. The low layers of git are a common idea but context and Linus' nifty code make this instance of the idea a bit of a gem. > please remember that I've spent years > thinking about merge algorithms I'm surprised we haven't met sooner. If we did and I forgot, sorry. -t ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 20:31 ` Bram Cohen 2005-04-26 20:39 ` Tom Lord @ 2005-04-26 20:58 ` Linus Torvalds 2005-04-26 21:25 ` Linus Torvalds 2005-04-26 21:28 ` Bram Cohen 2005-04-26 21:26 ` Diego Calleja 2 siblings, 2 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-26 20:58 UTC (permalink / raw) To: Bram Cohen; +Cc: git On Tue, 26 Apr 2005, Bram Cohen wrote: > > Now that you're done being dismissive, could you either (a) rebut my quite > detailed explanation of exactly why that functionality is both a dubious > idea and difficult to implement, or (b) admit that you have no plans > whatsoever for supporting any of this stuff? You can't have it both ways. I'm absolutely not going to do it myself, you're right about that. I just don't like your notion that you should support the 5% problem with ugly hacks, and then you dismiss the 95% problem with "nothing else does it either". In other words, we're already merging manually for the 95%. Why do you think the 5% is so important? > What I'd really like to hear is some explanation of why git is > reimplementing all of this stuff from scratch. Git does in ~5000 lines and two weeks of work what _I_ think is the right thing to do. You're welcome to disagree, but the fact is, people have whined and moaned about my use of BK FOR THREE YEARS without showing me any better alternatives. So why are you complaining now, when I implement my own version in two weeks? > If someone offers you a dollar, no strings attached, do you turn them down > because they didn't offer you ten? "no strings attached"? There are lots of strings attached to the "follow renames" thing. There's 30 _years_ of strings attached, and they result in people not looking at the _interesting_ problem. Exactly because people follow renames, they think that they have the history of the code, but then they ignore the fact that they don't - because it doesn't follow merging of code or splitting of code. In other words, it's sometimes better to know that you don't know the answer, than it is to _think_ that you know the answer. > In case these concepts got conflated, I'd like to point out that Codeville > merge both supports renames *and* does better than three-way merge can do > at merging a single, non-renamed file. And I'd like to point out (again) that git doesn't actually care what merge strategy the user uses. Me _personally_, I want to have something that is very repeatable and non-clever. Something I understand _or_ tells me that it can't do it. And quite frankly, merging single-file history _without_ taking all the other files' history into account makes me go "ugh". That's why I like the "we do _not_ look for 'nearer' parents on a per-file basis", which is what this discussion started with, afaik. I think the only original source that makes sense is the least common parent for the whole _project_, exactly because other files have done things that may or may not depend on the history of the file you're merging. I (and thus git) really takes a "whole project" approach. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 20:58 ` Linus Torvalds @ 2005-04-26 21:25 ` Linus Torvalds 2005-04-26 21:28 ` Bram Cohen 1 sibling, 0 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-26 21:25 UTC (permalink / raw) To: Bram Cohen; +Cc: git On Tue, 26 Apr 2005, Linus Torvalds wrote: > > > What I'd really like to hear is some explanation of why git is > > reimplementing all of this stuff from scratch. > > Git does in ~5000 lines and two weeks of work what _I_ think is the right > thing to do. You're welcome to disagree, but the fact is, people have > whined and moaned about my use of BK FOR THREE YEARS without showing me > any better alternatives. Btw, I've also been pretty disgusted by SCM's apparently generally caring about stuff that is totally not relevant. For example, it seems like most SCM people think that merging is about getting the end result of two conflicting patches right. In my opinion, that's the _least_ important part of a merge. Maybe the kernel is very unusual in this, but basically true _conflicts_ are not only rare, but they tend to be things you want a human to look at regardless. The important part of a merge is not how it handles conflicts (which need to be verified by a human anyway if they are at all interesting), but that it should meld the history together right so that you have a new solid base for future merges. In other words, the important part is the _trivial_ part: the naming of the parents, and keeping track of their relationship. Not the clashes. For example, CVS gets this part totally wrong. Sure, it can merge the contents, but it totally ignores the important part, so once you've done a merge, you're pretty much up shit creek wrt any subsequent merges in any other direction. All the other CVS problems pale in comparison. Renames? Just a detail. And it looks like 99% of SCM people seem to think that the solution to that is to be more clever about content merges. Which misses the point entirely. Don't get me wrong: content merges are nice, but they are _gravy_. They are not important. You can do them manually if you have to. What's important is that once you _have_ done them (manually or automatically), the system had better be able to go on, knowing that they've been done. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 20:58 ` Linus Torvalds 2005-04-26 21:25 ` Linus Torvalds @ 2005-04-26 21:28 ` Bram Cohen 2005-04-26 21:36 ` Fabian Franz 2005-04-26 22:25 ` Linus Torvalds 1 sibling, 2 replies; 81+ messages in thread From: Bram Cohen @ 2005-04-26 21:28 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Linus Torvalds wrote: > On Tue, 26 Apr 2005, Bram Cohen wrote: > > > > Now that you're done being dismissive, could you either (a) rebut my quite > > detailed explanation of exactly why that functionality is both a dubious > > idea and difficult to implement, or (b) admit that you have no plans > > whatsoever for supporting any of this stuff? You can't have it both ways. > > I'm absolutely not going to do it myself, you're right about that. Now you're just being an ass. I stated, flatly, that what you're proposing to have done (by you or whoever, for feasibility it doesn't matter which) is not going to happen due to just plain difficulty. You obviously disagree with me, but rather than coming out and saying so you're pretending I didn't make that statement. > > What I'd really like to hear is some explanation of why git is > > reimplementing all of this stuff from scratch. > > Git does in ~5000 lines and two weeks of work what _I_ think is the right > thing to do. So you think that a system which supports snapshots and history but has no merging functionality whatsoever is the right thing? I'm asking this seriously. You have a magic --make-somebody-else-do-merge command, but for everybody else the current state of things is workable as a stopgap measure (the original plan) but very painful for anything more. Codeville is comparable in terms of number of lines of code to Git, by the way. > You're welcome to disagree, but the fact is, people have whined and > moaned about my use of BK FOR THREE YEARS without showing me any better > alternatives. You were happy with BitKeeper, so why should we? Monotone and Codeville are only just about now really mature, and you aren't exactly known as a model customer. > So why are you complaining now, when I implement my own version in two > weeks? I'm trying to tell you that the amount of time between now and when a system as nice as BitKeeper is in use by the kernel can be dramatically reduced by either using an existing system verbatim or basing new efforts on one. If you think that git as it exists right now is at all comparable to Monotone or Codeville you're completely delusional. > > In case these concepts got conflated, I'd like to point out that Codeville > > merge both supports renames *and* does better than three-way merge can do > > at merging a single, non-renamed file. > > And I'd like to point out (again) that git doesn't actually care what > merge strategy the user uses. > > Me _personally_, I want to have something that is very repeatable and > non-clever. Something I understand _or_ tells me that it can't do it. And > quite frankly, merging single-file history _without_ taking all the other > files' history into account makes me go "ugh". Now you've just gone off the deep end. This is an apples-to-apples comparison. Please accept one of thee following two statements: (a) Git doesn't do merging, and none of the related new tools around it do merging. (b) Codeville merge (sans rename functionality) would be superior for the merging which will be done. -Bram ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 21:28 ` Bram Cohen @ 2005-04-26 21:36 ` Fabian Franz 2005-04-26 22:30 ` Linus Torvalds 2005-04-26 22:25 ` Linus Torvalds 1 sibling, 1 reply; 81+ messages in thread From: Fabian Franz @ 2005-04-26 21:36 UTC (permalink / raw) To: Bram Cohen; +Cc: git, Linus Torvalds -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Am Dienstag, 26. April 2005 23:28 schrieb Bram Cohen: > Now you've just gone off the deep end. This is an apples-to-apples > comparison. Please accept one of thee following two statements: > > (a) Git doesn't do merging, and none of the related new tools around it do > merging. > > (b) Codeville merge (sans rename functionality) would be superior for the > merging which will be done. I have one very humble question: Why don't you write and contribute some code for git to do good merging? This would resolve all your problems. I think the "magic-merge" command is quite exchangable and if your way works good and is compatible, then people will automatically start using that. cu Fabian -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.4 (GNU/Linux) iD8DBQFCbrRdI0lSH7CXz7MRAkS4AJ9JEka71M0Zc6cizXhrYpHiKHhL0gCcD/3Q j+UnPU/cXafGjGG6Bt9mZE8= =IYk0 -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 21:36 ` Fabian Franz @ 2005-04-26 22:30 ` Linus Torvalds 0 siblings, 0 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-26 22:30 UTC (permalink / raw) To: Fabian Franz; +Cc: Bram Cohen, git On Tue, 26 Apr 2005, Fabian Franz wrote: > > Am Dienstag, 26. April 2005 23:28 schrieb Bram Cohen: > > Now you've just gone off the deep end. This is an apples-to-apples > > comparison. Please accept one of thee following two statements: > > > > (a) Git doesn't do merging, and none of the related new tools around it do > > merging. > > > > (b) Codeville merge (sans rename functionality) would be superior for the > > merging which will be done. > > I have one very humble question: > > Why don't you write and contribute some code for git to do good merging? Don't bother. Bram doesn't know what he's talking about. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 21:28 ` Bram Cohen 2005-04-26 21:36 ` Fabian Franz @ 2005-04-26 22:25 ` Linus Torvalds 2005-04-28 0:42 ` Petr Baudis 1 sibling, 1 reply; 81+ messages in thread From: Linus Torvalds @ 2005-04-26 22:25 UTC (permalink / raw) To: Bram Cohen; +Cc: git On Tue, 26 Apr 2005, Bram Cohen wrote: > > So you think that a system which supports snapshots and history but has no > merging functionality whatsoever is the right thing? You haven't looked at git, have you? Git already merges better than _any_ open-source SCM out there. It just does it so effortlessly that you didn't even realize it does that. Today I've done four (count them) fully automated merges on the kernel tree: serial, networking, usb and arm. And they took a fraction of a second (plus the download of the new objects, which is the real cost). This is something that SVN _still_ cannot do, for example. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 22:25 ` Linus Torvalds @ 2005-04-28 0:42 ` Petr Baudis 0 siblings, 0 replies; 81+ messages in thread From: Petr Baudis @ 2005-04-28 0:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: Bram Cohen, git Dear diary, on Wed, Apr 27, 2005 at 12:25:58AM CEST, I got a letter where Linus Torvalds <torvalds@osdl.org> told me that... > On Tue, 26 Apr 2005, Bram Cohen wrote: > > > > So you think that a system which supports snapshots and history but has no > > merging functionality whatsoever is the right thing? > > You haven't looked at git, have you? > > Git already merges better than _any_ open-source SCM out there. It just > does it so effortlessly that you didn't even realize it does that. Did you (or any other kernel developer reading this) actually try the Codeville merge? (I admit I didn't get time to do anything real with it yet.) SCM people keep praising it (as basically the best (at least open-source) merge out there), so it would be interesting to compare that with the actual real-world experience with it on the kernel. > Today I've done four (count them) fully automated merges on the kernel > tree: serial, networking, usb and arm. > > And they took a fraction of a second (plus the download of the new > objects, which is the real cost). > > This is something that SVN _still_ cannot do, for example. I think SVN is just irrelevant here - it is a completely different league. The contenders here are probably Codeville, Monotone and perhaps GNU Arch offsprings. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 20:31 ` Bram Cohen 2005-04-26 20:39 ` Tom Lord 2005-04-26 20:58 ` Linus Torvalds @ 2005-04-26 21:26 ` Diego Calleja 2 siblings, 0 replies; 81+ messages in thread From: Diego Calleja @ 2005-04-26 21:26 UTC (permalink / raw) To: Bram Cohen; +Cc: torvalds, git El Tue, 26 Apr 2005 13:31:31 -0700 (PDT), Bram Cohen <bram@bitconjurer.org> escribió: > Even if we pretend that these are comparable features, that's far from > clearly true. Function moves within a file occur more frequently, but a > file rename moves *all* the functions within that file. Renaming or moving files is _so_ rare and unusual that even not implementing it (like CVS) is hardly a big issue. Even in the linux kernel people moved subsystems around - OSS went from drivers/sound to /sound/oss in 2.6, and a USB subdirectory moved too, I think. The patch got bigger. People wasted 30 seconds more of their life because the .gz file was bigger - who cares? If it's something it's going to happen every 5 years, I'd rather move them like CVS does rather than wasting a single second implementing file renaming/moving... If something so uncommon like file renaming has been implemented, I don't see why people shouldn't implement something really useful like the thing linus proposes, in fact it doesn't looks like a bad idea at all (and you'd get file renaming for free, too). Perhaps it would be hard to implement and get right, but at least it would be _useful_. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 19:58 ` Linus Torvalds 2005-04-26 20:30 ` Tom Lord 2005-04-26 20:31 ` Bram Cohen @ 2005-04-26 20:31 ` Daniel Barkalow 2005-04-26 20:44 ` Tom Lord 2 siblings, 1 reply; 81+ messages in thread From: Daniel Barkalow @ 2005-04-26 20:31 UTC (permalink / raw) To: Linus Torvalds; +Cc: Bram Cohen, git On Tue, 26 Apr 2005, Linus Torvalds wrote: > > > On Tue, 26 Apr 2005, Bram Cohen wrote: > > > > If one person renames a file and another person modifies it then the > > changes should be applied to the moved file. > > Bzzt. Wrong answer. > > The _right_ answer is "if one person moves a function, and another person > modifies the function, the changes should be applied to the moved > function". I'd even go so far as to say that we need to have the user sign off on this. The modification is reasonably likely to not work in the new location, due to use of statics that aren't available or something of the sort. I suspect that we want to report a conflict in the new location between the old version and the new version, and let the person merging check whether the change is okay. That is, both sides modified the same section at the same time: one side modified its content, while the other modified its location. We can give a good suggestion, but we need a final ruling. -Daniel *This .sig left intentionally blank* ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-26 20:31 ` Daniel Barkalow @ 2005-04-26 20:44 ` Tom Lord 0 siblings, 0 replies; 81+ messages in thread From: Tom Lord @ 2005-04-26 20:44 UTC (permalink / raw) To: barkalow; +Cc: torvalds, bram, git ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Merge with git-pasky II.
@ 2005-04-17 17:34 Linus Torvalds
2005-04-17 22:12 ` Herbert Xu
2005-04-18 4:16 ` Sanjoy Mahajan
0 siblings, 2 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-17 17:34 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Petr Baudis, Simon Fowler, David Lang, git
On Sun, 17 Apr 2005, Ingo Molnar wrote:
>
> in fact, this attack cannot even be proven to be malicious, purely via
> the email from Malice: it could be incredible bad luck that caused that
> good-looking patch to be mistakenly matching a dangerous object.
I really hate theoretical discussions.
The fact is, a lot of _crap_ engineering gets done because of the question
"what if?". It results in over-engineering, often to the point where the
end result is quite a lot measurably worse than the sane results.
You are _literally_ arguing for the equivalent of "what if a meteorite hit
my plane while it was in flight - maybe I should add three inches of
high-tension armored steel around the plane, so that my passengers would
be protected".
That's not engineering. That's five-year-olds discussing building their
imaginary forts ("I want gun-turrets and a mechanical horse one mile high,
and my command center is 5 miles under-ground and totally encased in 5
meters of lead").
I absolutely _hate_ doing engineering on the principle of "this might be
possible in theory", and I'm violently opposed to it. So far, I have not
heard a single argument that I consider even _remotely_ likely.
The thing is, even if you can force a hash collission by sending somebody
a patch, it's really pretty much almost guaranteed that the patch is not
just "a few strange characters", unless sha1 is really broken to the point
where it's not cryptographically secure _at_all_.
In other words, unless somebody finds a way to make sha1 appear as nothing
more than a complicated set of parity bits, all brute-force "get the same
sha1" is likely to be about generating a really strange blob based on the
thing you want to replace - and by "really strange" I mean total binary
crap. And likely _much_ bigger too. And by "much bigger" I mean "possibly
gigabytes of data".
And the thing is, _if_ somebody finds a way to make sha1 act as just a
complex parity bit, and comes up with generating a clashing object that
actually makes sense, then going to sha256 is likely pointless too - I
think the algorithm is basically the same, just with more bits. If you've
broken sha1 to the point where it's _that_ breakable, then you've likely
broken sha256 too. Nobody has ever proven that you couldn't break sha256
with some really clever algorithm...
So if you start playing "what if?" games, dammit, I can play mine.
If we want to have any kind of confidence that the hash is reall
yunbreakable, we should make it not just longer than 160 bits, we should
make sure that it's two or more hashes, and that they are based on totally
different principles.
And we should all digitally sign every single object too, and we should
use 4096-bit PGP keys and unguessable passphrases that are at least 20
words in length. And we should then build a bunker 5 miles underground,
encased in lead, so that somebody cannot flip a few bits with a ray-gun,
and make us believe that the sha1's match when they don't. Oh, and we need
to all wear aluminum propeller beanies to make sure that they don't use
that ray-gun to make us do the modification _outselves_.
And the thing is, that's just crazy talk. The difference between a crazy
person and an intelligent one is that the crazy one doesn't realize what
makes sense in the world. The goal of good engineering is not to ask "what
if?", but to ask "how do I make this work as well as possible".
So please stop with the theoretical sha1 attacks. It is simply NOT TRUE
that you can generate an object that looks halfway sane and still gets you
the sha1 you want. Even the "breakage" doesn't actually do that. And if
it ever _does_ become true, it will quite possibly be thanks to some
technology that breaks other hashes too.
So until proven otherwise, I worry about accidental hashes, and in 160
bits of good hashing, that just isn't an issue either. Anybody who
compares a 128-bit md5-sum to a 160-bit sha1 doesn't understand the math.
It didn't get "slightly less likely" to happen. It got so _unbelievably_
less likely to happen that it's not even funny.
Linus
^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 17:34 Linus Torvalds @ 2005-04-17 22:12 ` Herbert Xu 2005-04-17 22:35 ` Linus Torvalds 2005-04-18 4:16 ` Sanjoy Mahajan 1 sibling, 1 reply; 81+ messages in thread From: Herbert Xu @ 2005-04-17 22:12 UTC (permalink / raw) To: Linus Torvalds; +Cc: mingo, pasky, simon, david.lang, git Linus Torvalds <torvalds@osdl.org> wrote: > > If we want to have any kind of confidence that the hash is reall > yunbreakable, we should make it not just longer than 160 bits, we should > make sure that it's two or more hashes, and that they are based on totally > different principles. Sorry, it has already been shown that combining two difference hashes doesn't necessarily provide the security that you would hope. I think what hasn't been discussed here is the cost of actually doing the comparisons. In other words, what is the minimum number of comparisons we can get away and still deal with hash collisions successfully? Once we know what the cost is then we can decide whether it's worthwhile considering the odds involved. -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 22:12 ` Herbert Xu @ 2005-04-17 22:35 ` Linus Torvalds 2005-04-17 23:29 ` Herbert Xu 0 siblings, 1 reply; 81+ messages in thread From: Linus Torvalds @ 2005-04-17 22:35 UTC (permalink / raw) To: Herbert Xu; +Cc: mingo, pasky, simon, david.lang, git On Mon, 18 Apr 2005, Herbert Xu wrote: > > Sorry, it has already been shown that combining two difference hashes > doesn't necessarily provide the security that you would hope. Sorry, that's not true. Quite the reverse. Again, you bring up totally theoretical arguments. In _practice_ it has indeed been shown that using two hashes _does_ catch hash colissions. The trivial example is using md5 sums with a length. The "length" is a rally bad "hash" of the file contents too. And the fact is, that simple combination of hashes has proven to be more resistant to attack than the hash itself. It clearly _does_ make a difference in practice. So _please_, can we drop the obviously bogus "in theory" arguments. They do not matter. What matters is practice. And the fact is, in _theory_ we don't know if somebody may be trivially able to break any particular hash. But in practice we do know that it's less likely that you can break a combination of two totally unrelated hashes than you break one particular one. NOTE! I'm not actually arguing that we should do that. I'm actually arguing totally the reverse: I'm arguing that there is a fine line between being "very very careful" and being "crazy to the point of being incompetent". Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 22:35 ` Linus Torvalds @ 2005-04-17 23:29 ` Herbert Xu 2005-04-17 23:34 ` Petr Baudis 2005-04-17 23:50 ` Linus Torvalds 0 siblings, 2 replies; 81+ messages in thread From: Herbert Xu @ 2005-04-17 23:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: mingo, pasky, simon, david.lang, git On Sun, Apr 17, 2005 at 03:35:17PM -0700, Linus Torvalds wrote: > > Quite the reverse. Again, you bring up totally theoretical arguments. In > _practice_ it has indeed been shown that using two hashes _does_ catch > hash colissions. > > The trivial example is using md5 sums with a length. The "length" is a > rally bad "hash" of the file contents too. And the fact is, that simple > combination of hashes has proven to be more resistant to attack than the > hash itself. It clearly _does_ make a difference in practice. I wasn't disputing that of course. However, the same effect can be achieved in using a single hash with a bigger length, e.g., sha256 or sha512. > So _please_, can we drop the obviously bogus "in theory" arguments. They > do not matter. What matters is practice. I agree. However, what is the actual cost in practice of detecting collisions? I get the feeling that it isn't that bad. For example, if we did it at the points where the blobs actually entered the tree, then the cost is always proportional to the change size (the number of new blobs). Is this really that bad considering that the average blob isn't very big? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 23:29 ` Herbert Xu @ 2005-04-17 23:34 ` Petr Baudis 2005-04-17 23:53 ` Kenneth Johansson 2005-04-18 0:49 ` Herbert Xu 2005-04-17 23:50 ` Linus Torvalds 1 sibling, 2 replies; 81+ messages in thread From: Petr Baudis @ 2005-04-17 23:34 UTC (permalink / raw) To: Herbert Xu; +Cc: Linus Torvalds, mingo, simon, david.lang, git Dear diary, on Mon, Apr 18, 2005 at 01:29:05AM CEST, I got a letter where Herbert Xu <herbert@gondor.apana.org.au> told me that... > I get the feeling that it isn't that bad. For example, if we did it > at the points where the blobs actually entered the tree, then the cost > is always proportional to the change size (the number of new blobs). No. The collision check is done in the opposite cache - when you want to write a blob and there is already a file of the same hash in the tree. So either the blob is already in the database, or you have a collision. Therefore, the cost is proportional to the size of what stays unchanged. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 23:34 ` Petr Baudis @ 2005-04-17 23:53 ` Kenneth Johansson 2005-04-18 0:49 ` Herbert Xu 1 sibling, 0 replies; 81+ messages in thread From: Kenneth Johansson @ 2005-04-17 23:53 UTC (permalink / raw) To: git; +Cc: Linus Torvalds, mingo, simon, david.lang, git Petr Baudis wrote: > Dear diary, on Mon, Apr 18, 2005 at 01:29:05AM CEST, I got a letter > where Herbert Xu <herbert@gondor.apana.org.au> told me that... > >>I get the feeling that it isn't that bad. For example, if we did it >>at the points where the blobs actually entered the tree, then the cost >>is always proportional to the change size (the number of new blobs). > > > No. The collision check is done in the opposite cache - when you want to > write a blob and there is already a file of the same hash in the tree. > So either the blob is already in the database, or you have a collision. > > Therefore, the cost is proportional to the size of what stays unchanged. > ?? now I'm confused. Surly the only cost involved is to never write over a file that already exist in the cache and that is already done NOW as far as I read the code. So there is NO extra cost in detecting an collision. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 23:34 ` Petr Baudis 2005-04-17 23:53 ` Kenneth Johansson @ 2005-04-18 0:49 ` Herbert Xu 2005-04-18 0:55 ` Petr Baudis 1 sibling, 1 reply; 81+ messages in thread From: Herbert Xu @ 2005-04-18 0:49 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, mingo, simon, david.lang, git On Mon, Apr 18, 2005 at 01:34:41AM +0200, Petr Baudis wrote: > > No. The collision check is done in the opposite cache - when you want to > write a blob and there is already a file of the same hash in the tree. > So either the blob is already in the database, or you have a collision. > Therefore, the cost is proportional to the size of what stays unchanged. This is only true if we're calling update-cache on all unchanged files. If that's what git is doing then we're in trouble anyway. Remember that prior to the collision check we've already spent the effort in 1) Compressing the file. 2) Computing a SHA1 hash on the result. These two steps together (especially the first one) is much more expensive than a file content comparison of the blob versus what's already in the tree. Somehow I have a hard time seeing how this can be at all efficient if we're compressing all checked out files including those which are unchanged. Therefore the only conclusion I can draw is that we're only calling update-cache on the set of changed files, or at most a small superset of them. In that case, the cost of the collision check *is* proportional to the size of the change. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-18 0:49 ` Herbert Xu @ 2005-04-18 0:55 ` Petr Baudis 0 siblings, 0 replies; 81+ messages in thread From: Petr Baudis @ 2005-04-18 0:55 UTC (permalink / raw) To: Herbert Xu; +Cc: Linus Torvalds, mingo, simon, david.lang, git Dear diary, on Mon, Apr 18, 2005 at 02:49:06AM CEST, I got a letter where Herbert Xu <herbert@gondor.apana.org.au> told me that... > Therefore the only conclusion I can draw is that we're only calling > update-cache on the set of changed files, or at most a small superset > of them. In that case, the cost of the collision check *is* proportional > to the size of the change. Yes, of course, sorry for the confusion. We only consider files you either specify manually or which have their stat metadata changed relative to the directory cache. (That is from the git-pasky perspective; from the plumbing perspective, the user just does update-cache on whatever he picks.) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 23:29 ` Herbert Xu 2005-04-17 23:34 ` Petr Baudis @ 2005-04-17 23:50 ` Linus Torvalds 1 sibling, 0 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-17 23:50 UTC (permalink / raw) To: Herbert Xu; +Cc: mingo, pasky, simon, david.lang, git On Mon, 18 Apr 2005, Herbert Xu wrote: > > I wasn't disputing that of course. However, the same effect can be > achieved in using a single hash with a bigger length, e.g., sha256 > or sha512. No it cannot. If somebody actually literally totally breaks that hash, length won't matter. There are (bad) hashes where you can literally edit the content of the file, and make sure that the end result has the same hash. In that case, when the hash algorithm has actually been broken, the length of the hash ends up being not very relevant. For example, you might "hash" your file by blocking it up in 16-byte blocks, and xoring all blocks together - the result is a 16-byte hash. It's a terrible hash, and obviously trivially breakable, and once broken it does _not_ help to make it use its 32-byte cousin. Not at all. You can just modify the breaking thing to equally cheaply make modifications to a file and get the 32-byte hash "right" again. Is that kind of breakage likely for sha1? Hell no. Is it possible? In your "in theory" world where practice doesn't matter, yes. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-17 17:34 Linus Torvalds 2005-04-17 22:12 ` Herbert Xu @ 2005-04-18 4:16 ` Sanjoy Mahajan 1 sibling, 0 replies; 81+ messages in thread From: Sanjoy Mahajan @ 2005-04-18 4:16 UTC (permalink / raw) To: Linus Torvalds; +Cc: Ingo Molnar, Petr Baudis, Simon Fowler, David Lang, git > So until proven otherwise, I worry about accidental hashes, and in > 160 bits of good hashing, that just isn't an issue either...[Going > from 128 bits to 160 bits made it] so _unbelievably_ less likely to > happen that it's not even funny. You are right. Here's how I learnt to stop worrying and love the 160 bits. A 160-bit hash requires 2^80=10^24 files before the collision probability is roughly 0.5 (actually 1-e^{-1/2}). Now be very conservative: Instead of tolerating a 0.5 probability, worry about even a 10^-8 probability of a collision anywhere, anytime. The magic number of files for that probability is 10^20 (roughly 10^40 pairs for 2^160=10^48 boxes). Given 10 billion people using git, each producing 1 source file per second -- busy beavers all -- they would need 300 years to produce 10^20 files. And to reach the 10^-8 collision probability, all 10^20 files must belong to the same project, and even OpenOffice will not be that bloated. -Sanjoy ^ permalink raw reply [flat|nested] 81+ messages in thread
[parent not found: <000d01c541ed$32241fd0$6400a8c0@gandalf>]
* RE: Merge with git-pasky II. [not found] <000d01c541ed$32241fd0$6400a8c0@gandalf> @ 2005-04-15 20:31 ` Linus Torvalds 2005-04-15 23:00 ` Barry Silverman 0 siblings, 1 reply; 81+ messages in thread From: Linus Torvalds @ 2005-04-15 20:31 UTC (permalink / raw) To: Barry Silverman; +Cc: git [ I'm cc'ing the git list even though Barry's question wasn't cc'd. Because I think his question is interesting and astute per se, even if I disagree with the proposal ] On Fri, 15 Apr 2005, Barry Silverman wrote: > > If git is totally project based, and each commit represents total state > of the project, then how important is the intermediate commit > information between two states. You need it in order to do further merges. > IE, Area maintainer has A1->A2->A3->A4->A5 in a repository with 5 > commits, and 5 comments. And I last synced with A1. > > A few days later I sync again. Couldn't I just pull the "diff-tree A5 > A1" and then commit to my tree just the record A1->A5. Why does MY > repository need trees A2,A3,A4? Because that second merge needs the first merge to work well. The first merge might have had some small differences that ended up auto-merging (or even needing some manual help from you). The second time you sync, there migth be some more automatic merging. And so on. If you don't keep track of the incremental merges, you end up with one really _difficult_ merge that may not be mergable at all. Not automatically, and perhaps not even with help. So in order to keep things mergable, you need to not diverge. And the intermediate merges are the "anchor-points" for the next merge, keeping the divergences minimal. I'm personally convinced that one of the reasons CVS is a pain to merge is obviously that it doesn't do a good job of finding parents, but also exactly _because_ it makes merges so painful that people wait longer to do them, so you never end up fixing the simple stuff. In contrast, if you have all these small merges going on all the time, the hope is that there's never any really painful nasty "final merge". So you're right - the small merges do end up cluttering up the revision history. But it's a small price to pay if it means that you avoid having the painful ones. > Isn't preserving the A1,A2,A3,A4,A5 a legacy of BK, which required all > the changesets to be loaded in order, and so is a completely "file" > driven concept? Nope. In fact, to some degree git will need this even _more_, since the git merger is likely to be _weaker_ than BK, and thus more easily confused. I do believe that BK has these things for the same reason. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* RE: Merge with git-pasky II. 2005-04-15 20:31 ` Linus Torvalds @ 2005-04-15 23:00 ` Barry Silverman 2005-04-16 0:32 ` Linus Torvalds 0 siblings, 1 reply; 81+ messages in thread From: Barry Silverman @ 2005-04-15 23:00 UTC (permalink / raw) To: 'Linus Torvalds'; +Cc: git The issue I am trying to come to grips with in the current design, is that the git repository of a number of interrelated projects will soon become the logical OR of all blobs, commits, and trees in ALL the projects. This will involve horrendous amounts of replication, as developers end interchanging objects that originated from third parties who are not party to the merge (and which happen to be in the repository because of previous merge activity). It would be really nice if the repositories for each project stay distinct (and maybe even living on different servers). Merges should the one of the few points of contact where the state be exchanged between the repositories. >> If you don't keep track of the incremental merges, you end up with one >> really _difficult_ merge that may not be mergable at all. Not >> automatically, and perhaps not even with help. No argument from me.... In my example, I didn't intend A2,A3,A4 to be considered hash's of points where you did a merge - rather they are hashes of individual points where you did a commit BETWEEN the points where you merged. A1 and A5 are the "merge points"! It makes perfect sense to define some subset of commits as "merge-like" commits, and only have those copied over from one repository to the other. You could also only use merge-points in the common ancestor calculation, and not worry about intermediate commits. Only small changes to the existing logic are necessary to do a merge by "distributing" out the merge algorithm to each repository. This involves querying each repository, and communicating the results, followed by copying over only those blob objects necessary for the merge. After the merge, you would create a "merge-point commit" record that has one of the parents pointing to a hash in the other repository! But the BIG issue with this scheme, is that you will not be replicating over any of the intermediate commits, trees, or blobs (not really needed by the merge), but currently being traversed by various plumbing components. Hence my question.... -----Original Message----- From: Linus Torvalds [mailto:torvalds@osdl.org] Sent: Friday, April 15, 2005 4:31 PM To: Barry Silverman Cc: git@vger.kernel.org Subject: RE: Merge with git-pasky II. [ I'm cc'ing the git list even though Barry's question wasn't cc'd. Because I think his question is interesting and astute per se, even if I disagree with the proposal ] On Fri, 15 Apr 2005, Barry Silverman wrote: > > If git is totally project based, and each commit represents total state > of the project, then how important is the intermediate commit > information between two states. You need it in order to do further merges. > IE, Area maintainer has A1->A2->A3->A4->A5 in a repository with 5 > commits, and 5 comments. And I last synced with A1. > > A few days later I sync again. Couldn't I just pull the "diff-tree A5 > A1" and then commit to my tree just the record A1->A5. Why does MY > repository need trees A2,A3,A4? Because that second merge needs the first merge to work well. The first merge might have had some small differences that ended up auto-merging (or even needing some manual help from you). The second time you sync, there migth be some more automatic merging. And so on. If you don't keep track of the incremental merges, you end up with one really _difficult_ merge that may not be mergable at all. Not automatically, and perhaps not even with help. So in order to keep things mergable, you need to not diverge. And the intermediate merges are the "anchor-points" for the next merge, keeping the divergences minimal. I'm personally convinced that one of the reasons CVS is a pain to merge is obviously that it doesn't do a good job of finding parents, but also exactly _because_ it makes merges so painful that people wait longer to do them, so you never end up fixing the simple stuff. In contrast, if you have all these small merges going on all the time, the hope is that there's never any really painful nasty "final merge". So you're right - the small merges do end up cluttering up the revision history. But it's a small price to pay if it means that you avoid having the painful ones. > Isn't preserving the A1,A2,A3,A4,A5 a legacy of BK, which required all > the changesets to be loaded in order, and so is a completely "file" > driven concept? Nope. In fact, to some degree git will need this even _more_, since the git merger is likely to be _weaker_ than BK, and thus more easily confused. I do believe that BK has these things for the same reason. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* RE: Merge with git-pasky II. 2005-04-15 23:00 ` Barry Silverman @ 2005-04-16 0:32 ` Linus Torvalds 0 siblings, 0 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-16 0:32 UTC (permalink / raw) To: Barry Silverman; +Cc: git On Fri, 15 Apr 2005, Barry Silverman wrote: > > The issue I am trying to come to grips with in the current design, is > that the git repository of a number of interrelated projects will soon > become the logical OR of all blobs, commits, and trees in ALL the > projects. Nope. I'm actually against the notion of sharing object directories between projects. The git model _allows_ it, and I think it can be a valid approach, but it's absolutely not an approach that I would personally suggest be used for the kernel. In other words, the default git behaviour - and the one I personally am a proponent of - is to have one object directory per work tree, and really consider all such trees independent. Then you can merge between trees if you want to, and bring in objects that way, but normally you would _not_ have tons of objects from other trees, and _especially_ not from other unrelated projects. The reason git supports shared object archives is that (a) it falls out trivially as part of the design, so not allowing it is silly and (b) it is part of a merge, where you _do_ want to get the objects of the trees you merge, and in particular you need to generate a seperate tree that has all those objects without having to copy them. (Before you do the merge, you need to bring the new objects into your repository of course, but that I consider to be a separate issue, not part of the actual technical merge process). So normally, you'd probably have a totally pruned tree, with only the objects you need (and you might even consider the "commit parent links" less than necessary, especially if you're just a regular user and not a developer who wants to merge). But the ability to have extra objects is wonderful. It makes going backwards in time basically free (while the equivalent "bk undo" in the BK world is a very expensive operation), and it makes it easy to incrementally keep up-to-date with trees that you know you're _eventually_ going to merge with. But it's not an excuse to put just any random crap in that object directory.. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Merge with git-pasky II. @ 2005-04-14 0:29 Petr Baudis 2005-04-13 21:25 ` Christopher Li 2005-04-14 0:30 ` Petr Baudis 0 siblings, 2 replies; 81+ messages in thread From: Petr Baudis @ 2005-04-14 0:29 UTC (permalink / raw) To: torvalds; +Cc: git Hello Linus, I think my tree should be ready for merging with you. It is the final tree and I've already switched my main branch for it, so it's what people doing git pull are getting for some time already. Its main contents are all of my shell scripts. Apart of that, some tiny fixes scattered all around can be found there, as well as some patches which went through the mailing list. My last merge with you concerned your commit 39021759c903a943a33a28cfbd5070d36d851581. It's again rsync://pasky.or.cz/git/ this time my HEAD is fba83970090ef54c6eb86dcc2c2d5087af5ac637. Note that my rsync tree still contains even my old branch; I thought I'd leave it around in the public objects database for some time, shall anyone want to have a look at the history of some of the scripts. But if you want it gone, tell me and I will prune it (and perhaps offer it in /git-old/ or whatever). I'm using the following: fsck-cache --unreachable $(commit-id) | grep unreachable \ | cut -d ' ' -f 2 | sed 's/^\(..\)/.git\/objects\/\1\//' \ | xargs rm Thanks, -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 0:29 Petr Baudis @ 2005-04-13 21:25 ` Christopher Li 2005-04-14 0:45 ` Petr Baudis 2005-04-14 0:30 ` Petr Baudis 1 sibling, 1 reply; 81+ messages in thread From: Christopher Li @ 2005-04-13 21:25 UTC (permalink / raw) To: Petr Baudis; +Cc: torvalds, git While you are there, do you mind to move the shell script to a sub directory? Let's try how rename works. Chris On Thu, Apr 14, 2005 at 02:29:02AM +0200, Petr Baudis wrote: > Hello Linus, > > I think my tree should be ready for merging with you. It is the final > tree and I've already switched my main branch for it, so it's what > people doing git pull are getting for some time already. > > Its main contents are all of my shell scripts. Apart of that, some > tiny fixes scattered all around can be found there, as well as some > patches which went through the mailing list. My last merge with you > concerned your commit 39021759c903a943a33a28cfbd5070d36d851581. > > It's again > > rsync://pasky.or.cz/git/ > > this time my HEAD is fba83970090ef54c6eb86dcc2c2d5087af5ac637. > > Note that my rsync tree still contains even my old branch; I thought > I'd leave it around in the public objects database for some time, shall > anyone want to have a look at the history of some of the scripts. But if > you want it gone, tell me and I will prune it (and perhaps offer it in > /git-old/ or whatever). I'm using the following: > > fsck-cache --unreachable $(commit-id) | grep unreachable \ > | cut -d ' ' -f 2 | sed 's/^\(..\)/.git\/objects\/\1\//' \ > | xargs rm > > Thanks, > > -- > Petr "Pasky" Baudis > Stuff: http://pasky.or.cz/ > C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor > - > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-13 21:25 ` Christopher Li @ 2005-04-14 0:45 ` Petr Baudis 2005-04-14 3:51 ` Linus Torvalds 0 siblings, 1 reply; 81+ messages in thread From: Petr Baudis @ 2005-04-14 0:45 UTC (permalink / raw) To: Christopher Li; +Cc: torvalds, git Dear diary, on Wed, Apr 13, 2005 at 11:25:46PM CEST, I got a letter where Christopher Li <git@chrisli.org> told me that... > While you are there, do you mind to move the shell script > to a sub directory? Let's try how rename works. Well, unless Linus will want me otherwise, I'd like to postpone this until I'm finally done with the damn merge - enough things already got into my way today, so I would really like to focus on this tomorrow. So I'll be probably merging only (or mostly) bugfixes until I have that finished. P.S.: Just staring at http://www.theregister.co.uk/2005/04/11/torvalds_attack/ ... I'm nothing like a regular reader of (R), but I thought the guys have at least a bit of sense. Duh. :/ Or is April 11 now yet another joke day after April 1? -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 0:45 ` Petr Baudis @ 2005-04-14 3:51 ` Linus Torvalds 2005-04-14 1:23 ` Christopher Li 2005-04-14 7:05 ` Junio C Hamano 0 siblings, 2 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-14 3:51 UTC (permalink / raw) To: Petr Baudis; +Cc: Christopher Li, git On Thu, 14 Apr 2005, Petr Baudis wrote: > > http://www.theregister.co.uk/2005/04/11/torvalds_attack/ ... I'm nothing > like a regular reader of (R), but I thought the guys have at least a bit > of sense. Duh. :/ Or is April 11 now yet another joke day after April 1? I actually _am_ a fairly regular reader, and hey, being opinionated and a bit over the top is what makes the site worthwhile. It's obviously what motivates the people. And then, occasionally, when they bite you, hey, that's the price of having a high profile. I worry more about sometimes not listening to critics than I do about the critics themselves. Thick skin is the name of the game. I'd not get any work done otherwise. On that note - I've been avoiding doing the merge-tree thing, in the hope that somebody else does what I've described. I really do suck at scripting things, yet this is clearly something where using C to do a lot of the stuff is pointless. Almost all the parts do seem to be there, ie Daniel did the "common parent" part, and the rest really does seem to be more about scripting than writing more C plumbing stuff.. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 3:51 ` Linus Torvalds @ 2005-04-14 1:23 ` Christopher Li 2005-04-14 5:03 ` Paul Jackson 2005-04-14 7:05 ` Junio C Hamano 1 sibling, 1 reply; 81+ messages in thread From: Christopher Li @ 2005-04-14 1:23 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, git On Wed, Apr 13, 2005 at 08:51:50PM -0700, Linus Torvalds wrote: > > > On Thu, 14 Apr 2005, Petr Baudis wrote: > > Thick skin is the name of the game. I'd not get any work done otherwise. > > On that note - I've been avoiding doing the merge-tree thing, in the hope > that somebody else does what I've described. I really do suck at scripting > things, yet this is clearly something where using C to do a lot of the > stuff is pointless. > > Almost all the parts do seem to be there, ie Daniel did the "common > parent" part, and the rest really does seem to be more about scripting > than writing more C plumbing stuff.. Do you have preference about what language of script we used? I actually hesitated to introduce my Python script to git. I can build some script extension for git just like the one I did for sparse, is that some thing you want to see? Chris ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 1:23 ` Christopher Li @ 2005-04-14 5:03 ` Paul Jackson 2005-04-14 2:16 ` Christopher Li 0 siblings, 1 reply; 81+ messages in thread From: Paul Jackson @ 2005-04-14 5:03 UTC (permalink / raw) To: Christopher Li; +Cc: torvalds, pasky, git > Do you have preference about what language of script we used? Do you have a thick skin? <grin> Can you easily ignore language wars with an amused wave of the hand and a happy chuckle at the oh so predictable weaknesses of humans? Then I'd wager it will be fine. If you have a thin skin or tend to annoy others with a bit too much attitude or can't pass up a good language war (which is my failing, and why I am responding to a discussion that I've not been involved in for days) then the resulting flamage could be distracting. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 5:03 ` Paul Jackson @ 2005-04-14 2:16 ` Christopher Li 2005-04-14 6:16 ` Paul Jackson 0 siblings, 1 reply; 81+ messages in thread From: Christopher Li @ 2005-04-14 2:16 UTC (permalink / raw) To: Paul Jackson; +Cc: torvalds, pasky, git On Wed, Apr 13, 2005 at 10:03:41PM -0700, Paul Jackson wrote: > > If you have a thin skin or tend to annoy others with a bit too much > attitude or can't pass up a good language war (which is my failing, and > why I am responding to a discussion that I've not been involved in for > days) then the resulting flamage could be distracting. Oh, my bad. I am not trying to start a language war here. That is why I am hesitated about Python. Just try to find out the acceptability. No pushing. Chris ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 2:16 ` Christopher Li @ 2005-04-14 6:16 ` Paul Jackson 0 siblings, 0 replies; 81+ messages in thread From: Paul Jackson @ 2005-04-14 6:16 UTC (permalink / raw) To: Christopher Li; +Cc: torvalds, pasky, git > Oh, my bad. I am not trying to start a language war here. Neither am I - no problem what so ever. <chuckle ...> Besides, I think we'd be on the same side. My point was only a gentle one -- as is often the case when dealing with the strange species called human, whether or not you can get away with something is often a simple matter of ones attitude. There is one Python script already in the kernel: scripts/show_delta. But it's too small a sample to mean much. I think first thing is "get it right." Python is good for that in the hands of someone who enjoys coding in it. I wish you well. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 3:51 ` Linus Torvalds 2005-04-14 1:23 ` Christopher Li @ 2005-04-14 7:05 ` Junio C Hamano 2005-04-14 8:06 ` Linus Torvalds 1 sibling, 1 reply; 81+ messages in thread From: Junio C Hamano @ 2005-04-14 7:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Christopher Li, git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> On that note - I've been avoiding doing the merge-tree thing, in the hope LT> that somebody else does what I've described. I now have a Perl script that uses rev-tree, cat-file, diff-tree, show-files (with one modification so that it can deal with pathnames with embedded newlines), update-cache (with one modification so that I can add an entry for a file that does not exist to the dircache) and merge (from RCS). Quick and dirty. The changes to show-files is to give it an optional '-z' flag, which chanegs record terminator to NUL character instead of LF. The script git-merge.perl takes two head commits. It basically follows what you described as I remember ;-): 1. runs rev-tree with --edges to find the common anscestor. 2. creates a temporary directory "./,,merge-temp"; create a symlink ./,,merge-temp/.git/objects that points at .git/objects. 3. sets up dircache there, initially populated with this common ancestor tree. No files are checked out. Just set up .git/index and that's it. 4. runs diff-tree to find what has been changed in each head. 5. for each path involved: 5.0 if neither heads change it, leave it as is; 5.1 if only one head changes a path and the other does not, just get the changed version; 5.2 if both heads change it, check all three out and run merge. It does not currently commit. You can go to ./,,merge-temp/ and see show-diff to see the result of the merge. Files added in one head has already been run "update-cache" when the script ends, but changed and merged files are not---dircache still has the common ancestor view. So show-diff you will be seeing may be enormous and not very useful if two forks were done in the distant past. After reviewing the merge result, you can update-cache, write-tree and commit-tree as usual, but with one caveat: do not run "show-files | xargs update-cache" if you are running git-merge.perl without -f flag! By default, git-merge.perl creates absolute minimum number of files in ./,,merge-temp---only the merged files are left there so that you can inspect them. You will not see unmodified files nor files changed only by one side of the merge. If you give '-o' (oneside checkout) flag to git-merge.perl, then the files only one side of the merge changed are also checked out in ./,,merge-temp. If you give '-f' (full checkout) flag to git-merge.perl, then in addition to what '-o' checks out, unchanged files are checked out in ./,,merge-temp. This default is geared towards a huge tree with small merges (favorite case of Linus, if I understand correctly). Running 'show-diff' in such a sparsely populated merge result tree gives you huge results because recent show-diff shows diffs with empty files. I added a '-r' flag to show-diff, which squelches diffs with empty files. Also to implement 'changed only by one-side' without actually checking the file out, I needed to add one option to 'update-cache'. --cacheinfo flag is used this way: $ update-cache --cacheinfo mode sha1 path and adds the pathname with mode and sha1 to the .git/index without actually requiring you to have such a file there. Signed-off-by: Junio C Hamano <junkio@cox.net> --- show-diff.c | 11 ++- show-files.c | 12 ++- update-cache.c | 25 +++++++ git-merge.perl | 193 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 234 insertions(+), 7 deletions(-) show-diff.c: a531ca4078525d1c8dcf84aae0bfa89fed6e5d96 --- show-diff.c +++ show-diff.c 2005-04-13 22:47:33.000000000 -0700 @@ -58,15 +58,20 @@ int main(int argc, char **argv) { int silent = 0; + int silent_on_nonexisting_files = 0; int entries = read_cache(); int i; while (argc-- > 1) { if (!strcmp(argv[1], "-s")) { - silent = 1; + silent_on_nonexisting_files = silent = 1; continue; } - usage("show-diff [-s]"); + if (!strcmp(argv[1], "-r")) { + silent_on_nonexisting_files = 1; + continue; + } + usage("show-diff [-s] [-r]"); } if (entries < 0) { @@ -83,7 +88,7 @@ if (stat(ce->name, &st) < 0) { printf("%s: %s\n", ce->name, strerror(errno)); - if (errno == ENOENT && !silent) + if (errno == ENOENT && !silent_on_nonexisting_files) show_diff_empty(ce); continue; } show-files.c: a9fa6767a418f870a34b39379f417bf37b17ee18 --- show-files.c +++ show-files.c 2005-04-13 21:18:40.000000000 -0700 @@ -14,6 +14,7 @@ static int show_cached = 0; static int show_others = 0; static int show_ignored = 0; +static int line_terminator = '\n'; static const char **dir; static int nr_dir; @@ -105,12 +106,12 @@ } if (show_others) { for (i = 0; i < nr_dir; i++) - printf("%s\n", dir[i]); + printf("%s%c", dir[i], line_terminator); } if (show_cached) { for (i = 0; i < active_nr; i++) { struct cache_entry *ce = active_cache[i]; - printf("%s\n", ce->name); + printf("%s%c", ce->name, line_terminator); } } if (show_deleted) { @@ -119,7 +120,7 @@ struct stat st; if (!stat(ce->name, &st)) continue; - printf("%s\n", ce->name); + printf("%s%c", ce->name, line_terminator); } } if (show_ignored) { @@ -134,6 +135,11 @@ for (i = 1; i < argc; i++) { char *arg = argv[i]; + if (!strcmp(arg, "-z")) { + line_terminator = 0; + continue; + } + if (!strcmp(arg, "--cached")) { show_cached = 1; continue; update-cache.c: 8f149d5a4ab60e030a0ab19fdb59b8ee2576ee71 --- update-cache.c +++ update-cache.c 2005-04-13 23:27:54.000000000 -0700 @@ -203,6 +203,8 @@ { int i, newfd, entries; int allow_options = 1; + const char *sha1_force = NULL; + const char *mode_force = NULL; newfd = open(".git/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600); if (newfd < 0) @@ -235,14 +237,35 @@ refresh_cache(); continue; } + if (!strcmp(path, "--cacheinfo")) { + mode_force = argv[++i]; + sha1_force = argv[++i]; + continue; + } die("unknown option %s", path); } if (!verify_path(path)) { fprintf(stderr, "Ignoring path %s\n", argv[i]); continue; } - if (add_file_to_cache(path)) + if (sha1_force && mode_force) { + struct cache_entry *ce; + int namelen = strlen(path); + int mode; + int size = cache_entry_size(namelen); + sscanf(mode_force, "%o", &mode); + ce = malloc(size); + memset(ce, 0, size); + memcpy(ce->name, path, namelen); + ce->namelen = namelen; + ce->st_mode = mode; + get_sha1_hex(sha1_force, ce->sha1); + + add_cache_entry(ce, 1); + } + else if (add_file_to_cache(path)) die("Unable to add %s to database", path); + mode_force = sha1_force = NULL; } if (write_cache(newfd, active_cache, active_nr) || rename(".git/index.lock", ".git/index")) --- /dev/null 2005-03-19 15:28:25.000000000 -0800 +++ git-merge.perl 2005-04-13 23:45:23.000000000 -0700 @@ -0,0 +1,193 @@ +#!/usr/bin/perl -w + +use Getopt::Long; + +my $full_checkout = 0; +my $oneside_checkout = 0; +GetOptions("full" => \$full_checkout, + "oneside" => \$oneside_checkout) + or die; + +if ($full_checkout) { + $oneside_checkout = 1; +} + +sub read_rev_tree { + my (@head) = @_; + my ($fhi); + open $fhi, '-|', 'rev-tree', '--edges', @head + or die "$!: rev-tree --edges @head"; + my $common; + while (<$fhi>) { + chomp; + (undef, undef, $common) = split(/ /, $_); + if ($common =~ s/^([a-f0-f]{40}):\d+$/$1/) { + last; + } + } + close $fhi; + return $common; +} + +sub read_commit_tree { + my ($commit) = @_; + my ($fhi); + open $fhi, '-|', 'cat-file', 'commit', $commit + or die "$!: cat-file commit $commit"; + my $tree = <$fhi>; + close $fhi; + $tree =~ s/^tree //; + return $tree; +} + +sub read_diff_tree { + my (@tree) = @_; + my ($fhi); + local ($_, $/); + $/ = "\0"; + my %path; + open $fhi, '-|', 'diff-tree', '-r', @tree + or die "$!: diff-tree -r @tree"; + while (<$fhi>) { + chomp; + if (/^\*[0-7]+->([0-7]+)\tblob\t[0-9a-f]+->([0-9a-f]{40})\t(.*)$/s) { + # mode newsha path + $path{$3} = [$1, $2]; + } + elsif (/^\+([0-7]+)\tblob\t([0-9a-f]{40})\t(.*)$/s) { + # mode newsha path + $path{$3} = [$1, $2]; + } + else { + print STDERR "$_??"; + } + } + close $fhi; + return %path; +} + +sub read_show_files { + my ($fhi); + local ($_, $/); + $/ = "\0"; + open $fhi, '-|', 'show-files', '-z' + or die "$!: show-files -z"; + my (@path) = map { chomp; $_ } <$fhi>; + close $fhi; + return @path; +} + +sub checkout_file { + my ($path, $info) = @_; + my (@elt) = split(/\//, $path); + my $j = ''; + my $tail = pop @elt; + my ($fhi, $fho); + for (@elt) { + mkdir "$j$_"; + $j = "$j$_/"; + } + open $fho, '>', "$path"; + open $fhi, '-|', 'cat-file', 'blob', $info->[1] + or die "$!: cat-file blob $info->[1]"; + while (<$fhi>) { + print $fho $_; + } + close $fhi; + close $fho; + chmod oct("0$info->[0]"), "$path"; +} + +sub record_file { + my ($path, $info) = @_; + system 'update-cache', '--cacheinfo', @$info, $path; +} + +sub merge_tree { + my ($path, $info0, $info1) = @_; + print STDERR "M - $path\n"; + checkout_file(',,merge-0', $info0); + checkout_file(',,merge-1', $info1); + system 'checkout-cache', $path; + my ($fhi, $fho); + open $fhi, '-|', 'merge', '-p', ',,merge-0', $path, ',,merge-1'; + open $fho, '>', "$path+"; + local ($/); + while (<$fhi>) { print $fho $_; } + close $fhi; + close $fho; + unlink ',,merge-0', ',,merge-1'; + rename "$path+", $path; + # There is no reason to prefer info0 over info1 but + # we need to pick one. + chmod oct("0$info0->[0]"), "$path"; +} + +# Find common ancestor of two trees. +my $common = read_rev_tree(@ARGV); +print "Common ancestor: $common\n"; + +# Create a temporary directory and go there. +system 'rm', '-rf', ',,merge-temp'; +for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; } +symlink "../../.git/objects", "objects"; +chdir '..'; + +my $ancestor_tree = read_commit_tree($common); +system 'read-tree', $ancestor_tree; + +my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0])); +my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1])); + +my @ancestor_file = read_show_files(); +my %ancestor_file = map { $_ => 1 } @ancestor_file; + +for (@ancestor_file) { + if (! exists $tree0{$_} && ! exists $tree1{$_}) { + if ($full_checkout) { + system 'checkout-cache', $_; + } + print STDERR "O - $_\n"; + } +} + +my %need_merge = (); + +for $path (keys %tree0) { + if (! exists $tree1{$path}) { + # Only changed in tree 0 --- take his version + print STDERR "0 - $path\n"; + if (! exists $ancestor_file{$path}) { + checkout_file($path, $tree0{$path}); + system 'update-cache', '--add', "$path"; + } + elsif ($oneside_checkout) { + checkout_file($path, $tree0{$path}); + } + else { + record_file($path, $tree0{$path}); + } + } + else { + merge_tree($path, $tree0{$path}, $tree1{$path}); + } +} + +for $path (keys %tree1) { + if (! exists $tree0{$path}) { + # Only changed in tree 1 --- take his version + print STDERR "1 - $path\n"; + if (! exists $ancestor_file{$path}) { + checkout_file($path, $tree1{$path}); + system 'update-cache', '--add', "$path"; + } + elsif ($oneside_checkout) { + checkout_file($path, $tree1{$path}); + } + else { + record_file($path, $tree1{$path}); + } + } +} + +# system 'show-diff'; ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 7:05 ` Junio C Hamano @ 2005-04-14 8:06 ` Linus Torvalds 2005-04-14 8:39 ` Junio C Hamano 2005-04-15 19:57 ` Junio C Hamano 0 siblings, 2 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-14 8:06 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Christopher Li, git On Thu, 14 Apr 2005, Junio C Hamano wrote: > > I now have a Perl script that uses rev-tree, cat-file, > diff-tree, show-files (with one modification so that it can deal > with pathnames with embedded newlines), update-cache (with one > modification so that I can add an entry for a file that does not > exist to the dircache) and merge (from RCS). Quick and dirty. That's exactly what I wanted. Q'n'D is how the ball gets rolling. In the meantime I wrote a very stupid "merge-tree" which does things slightly differently, but I really think your approach (aka my original approach) is actually a lot faster. I was just starting to worry that the ball didn't start, so I wrote an even hackier one. My really hacky one is called "merge-tree", and it really only merges one directory. For each entry in the directory it says either select <mode> <sha1> path or merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path depending on whether it could directly select the right object or not. It's actually exactly the same algorithm as the first one, but I was afraid the first one would be so abstract that it (a) might not work and (b) wouldn't get people to work it out. This "one directory at a time with very explicit output" thing is much more down-to-earth, but it's also likely slower because it will need script help more often. That said, I don't know. MOST of the time there will be just a single "directory" entry that needs merging, and then the script would just need to recurse into that directory with the new "tree" objects. So it might not be too horrible. But I'm really happy that you seem to have implemented my first suggestion and I seem to have been wasting my time. > 5. for each path involved: > > 5.0 if neither heads change it, leave it as is; > 5.1 if only one head changes a path and the other does not, just > get the changed version; > 5.2 if both heads change it, check all three out and run merge. You missed one case: 5.0.1 if both heads change it to the same thing, take the new thing but maybe you counted that as 5.0 (it _should_ fall out automatically from the fact that "diff-tree" between the two destination trees shows no difference for such a file). Now, arguably, your 5.2 will do things right, but the thing is, it's actually fairly _common_ that both heads have changed something to the same thing. Namely if there was a previous merge that already handled that case, but that previous merge may not be a proper parent of the new commits. So from a performance standpoint you really don't want to consider that to be a merge - you just pick up the new contents directly. See? (My stupid "merge-tree" should show the algorithm in painful obviousity. Of course, my stipid merge-tree may also be painfully buggy. You be the judge). > It does not currently commit. You can go to ./,,merge-temp/ and > see show-diff to see the result of the merge. Files added in > one head has already been run "update-cache" when the script > ends, but changed and merged files are not---dircache still has > the common ancestor view. That sounds good. > Also to implement 'changed only by one-side' without actually > checking the file out, I needed to add one option to > 'update-cache'. --cacheinfo flag is used this way: > > $ update-cache --cacheinfo mode sha1 path Yes. My "merge-tree" needs the exact same thing. Looks good from your explanation, but I'm too tired to look at the code. It's 1AM, and the kids get up at 7. I'm not much of a hacker, I usually crash by 10PM these days ;^) Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 8:06 ` Linus Torvalds @ 2005-04-14 8:39 ` Junio C Hamano 2005-04-14 9:10 ` Linus Torvalds 2005-04-15 19:57 ` Junio C Hamano 1 sibling, 1 reply; 81+ messages in thread From: Junio C Hamano @ 2005-04-14 8:39 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Christopher Li, git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> But I'm really happy that you seem to have implemented my first LT> suggestion and I seem to have been wasting my time. Thanks for the kind words. >> 5. for each path involved: >> >> 5.0 if neither heads change it, leave it as is; >> 5.1 if only one head changes a path and the other does not, just >> get the changed version; >> 5.2 if both heads change it, check all three out and run merge. LT> You missed one case: LT> 5.0.1 if both heads change it to the same thing, take the new thing LT> but maybe you counted that as 5.0 (it _should_ fall out automatically from LT> the fact that "diff-tree" between the two destination trees shows no LT> difference for such a file). Actually I am not handling that. It really is 5.1a---the exact same code path as 5.1 can be used for this case, and as you point out it is really a quite important optimization. I have to handle the following cases. I think I currently do wrong things to them: 5.1a both head modify to the same thing. 5.1b one head removes, the other does not do anything. 5.1c both head remove. 5.3 one head removes, the other head modifies. Handling of 5.1a, 5.1b and 5.1c are obvious. 5.1a Update dircache to the same new thing. Without -f or -o flag do not touch ,,merge-temp/. directory; with -f or -o, leave the new file in ,,merge-temp/. 5.1b Remove the path from dircache and do not have the file in ,,merge-temp/. directory regardless of -f or -o flags. 5.1c Same as 5.1b I am not sure what to do with 5.3. My knee-jerk reaction is to leave the modified result in ,,merge-temp/$path~ without touching dircache. If the merger wants to pick it up, he can rename $path~ to $path temporarily, run show-diff on it (I think giving an option to show-diff to specify paths would be helpful for this workflow), to decide if he wants to keep the file or not. Suggestions? ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 8:39 ` Junio C Hamano @ 2005-04-14 9:10 ` Linus Torvalds 2005-04-14 11:14 ` Junio C Hamano 0 siblings, 1 reply; 81+ messages in thread From: Linus Torvalds @ 2005-04-14 9:10 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Christopher Li, git On Thu, 14 Apr 2005, Junio C Hamano wrote: > > I have to handle the following cases. I think I currently do > wrong things to them: > > 5.1a both head modify to the same thing. > 5.1b one head removes, the other does not do anything. > 5.1c both head remove. > 5.3 one head removes, the other head modifies. There's another interesting set of cases: one side creates a file, and the other one creates a directory. > I am not sure what to do with 5.3. My very _strong_ preference is to just inform the user about a merge that cannot be performed, and not let it be automated. BIG warning, with some way for the user to specify the end result. The thing is, these are pretty rare cases. But in order to make people feel good about the _common_ case, it's important that they feel safe about the rare one. Put another way: if git tells me when it can't do something (with some specificity), I can then fix the situation up and try again. I might curse a while, and maybe it ends up being so common that I might even automate it, but at least I'll be able to trust the end result. In contrast, if git does something that _may_ be nonsensical, then I'll worry all the time, and not trust git. That's much worse than an occasional curse. So the rule should be: only merge when it's "obviously the right thing". If it's not obvious, the merge should _not_ try to guess what the right thing is. It's much better to fail loudly. (That's especially true early on. There may be cases that end up being obvious after some usage. But I'd rather find them by having git be too stupid, than find out the hard way that git lost some data because it thought it was ok to remove a file that had been modified) Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 9:10 ` Linus Torvalds @ 2005-04-14 11:14 ` Junio C Hamano 2005-04-14 12:16 ` Petr Baudis 0 siblings, 1 reply; 81+ messages in thread From: Junio C Hamano @ 2005-04-14 11:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: git Here is a diff to update the git-merge.perl script I showed you earlier today ;-). It contains the following updates against your HEAD (bb95843a5a0f397270819462812735ee29796fb4). * git-merge.perl command we talked about on the git list. I've covered the changed-to-the-same case etc. I still haven't done anything about file-vs-directory case yet. It does warn when it needed to run merge to automerge and let merge give a warning message about conflicts if any. In modify/remove cases, modified in one but removed in the other files are left in either $path~A~ or $path~B~ in the merge temporary directory, and the script issues a warning at the end. * show-files and ls-tree updates to add -z flag to NUL terminate records; this is needed for git-merge.perl to work. * show-diff updates to add -r flag to squelch diffs for files not in the working directory. This is mainly useful when verifying the result of an automated merge. * update-cache updates to add "--cacheinfo mode sha1" flag to register a file that is not in the current working directory. Needed for minimum-checkout merging by git-merge.perl. Signed-off-by: Junio C Hamano <junkio@cox.net> --- git-merge.perl | 247 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ls-tree.c | 9 +- show-diff.c | 11 +- show-files.c | 12 ++ update-cache.c | 25 +++++ 5 files changed, 296 insertions(+), 8 deletions(-) diff -x .git -Nru ,,1/git-merge.perl ,,2/git-merge.perl --- ,,1/git-merge.perl 1969-12-31 16:00:00.000000000 -0800 +++ ,,2/git-merge.perl 2005-04-14 04:00:14.000000000 -0700 @@ -0,0 +1,247 @@ +#!/usr/bin/perl -w + +use Getopt::Long; + +my $full_checkout = 0; +my $oneside_checkout = 0; +GetOptions("full" => \$full_checkout, + "oneside" => \$oneside_checkout) + or die; + +if ($full_checkout) { + $oneside_checkout = 1; +} + +sub read_rev_tree { + my (@head) = @_; + my ($fhi); + open $fhi, '-|', 'rev-tree', '--edges', @head + or die "$!: rev-tree --edges @head"; + my %common; + while (<$fhi>) { + chomp; + (undef, undef, my @common) = split(/ /, $_); + for (@common) { + if (s/^([a-f0-f]{40}):3$/$1/) { + $common{$_}++; + } + } + } + close $fhi; + + my @common = (map { $_->[1] } + sort { $b->[0] <=> $a->[0] } + map { [ $common{$_} => $_ ] } + keys %common); + + return $common[0]; +} + +sub read_commit_tree { + my ($commit) = @_; + my ($fhi); + open $fhi, '-|', 'cat-file', 'commit', $commit + or die "$!: cat-file commit $commit"; + my $tree = <$fhi>; + close $fhi; + $tree =~ s/^tree //; + return $tree; +} + +# Reads diff-tree -r output and gives a hash that maps a path +# to 3-tuple (old-mode new-mode new-sha). +# When creating, old-mode is undef. When removing, new-* are undef. +sub read_diff_tree { + my (@tree) = @_; + my ($fhi); + local ($_, $/); + $/ = "\0"; + my %path; + open $fhi, '-|', 'diff-tree', '-r', @tree + or die "$!: diff-tree -r @tree"; + while (<$fhi>) { + chomp; + if (/^\*([0-7]+)->([0-7]+)\tblob\t[0-9a-f]+->([0-9a-f]{40})\t(.*)$/s) { + $path{$4} = [$1, $2, $3]; + } + elsif (/^\+([0-7]+)\tblob\t([0-9a-f]{40})\t(.*)$/s) { + $path{$3} = [undef, $1, $2]; + } + elsif (/^\-([0-7]+)\tblob\t[0-9a-f]{40}\t(.*)$/s) { + $path{$2} = [$1, undef, undef]; + } + else { + die "cannot parse diff-tree output: $_"; + } + } + close $fhi; + return %path; +} + +sub read_show_files { + my ($fhi); + local ($_, $/); + $/ = "\0"; + open $fhi, '-|', 'show-files', '-z' + or die "$!: show-files -z"; + my (@path) = map { chomp; $_ } <$fhi>; + close $fhi; + return @path; +} + +sub checkout_file { + my ($path, $info) = @_; + my (@elt) = split(/\//, $path); + my $j = ''; + my $tail = pop @elt; + my ($fhi, $fho); + for (@elt) { + mkdir "$j$_"; + $j = "$j$_/"; + } + open $fho, '>', "$path"; + open $fhi, '-|', 'cat-file', 'blob', $info->[2] + or die "$!: cat-file blob $info->[2]"; + while (<$fhi>) { + print $fho $_; + } + close $fhi; + close $fho; + chmod oct("0$info->[1]"), "$path"; +} + +sub record_file { + my ($path, $info) = @_; + system ('update-cache', '--add', '--cacheinfo', + $info->[1], $info->[2], $path); +} + +sub merge_tree { + my ($path, $info0, $info1) = @_; + checkout_file(',,merge-0', $info0); + checkout_file(',,merge-1', $info1); + system 'checkout-cache', $path; + my ($fhi, $fho); + open $fhi, '-|', 'merge', '-p', ',,merge-0', $path, ',,merge-1'; + open $fho, '>', "$path+"; + local ($/); + while (<$fhi>) { print $fho $_; } + close $fhi; + close $fho; + unlink ',,merge-0', ',,merge-1'; + rename "$path+", $path; + # There is no reason to prefer info0 over info1 but + # we need to pick one. + chmod oct("0$info0->[1]"), "$path"; +} + +# Find common ancestor of two trees. +my $common = read_rev_tree(@ARGV); +print "Common ancestor: $common\n"; + +# Create a temporary directory and go there. +system 'rm', '-rf', ',,merge-temp'; +for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; } +symlink "../../.git/objects", "objects"; +chdir '..'; + +my $ancestor_tree = read_commit_tree($common); +system 'read-tree', $ancestor_tree; + +my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0])); +my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1])); + +my @ancestor_file = read_show_files(); +my %ancestor_file = map { $_ => 1 } @ancestor_file; + +for (@ancestor_file) { + if (! exists $tree0{$_} && ! exists $tree1{$_}) { + if ($full_checkout) { + system 'checkout-cache', $_; + } + print STDERR "O - $_\n"; + } +} + +for my $set ([\%tree0, \%tree1, 'A'], [\%tree1, \%tree0, 'B']) { + my ($treeA, $treeB, $side) = @$set; + while (my ($path, $info) = each %$treeA) { + # In this loop we do not deal with overlaps. + next if (exists $treeB->{$path}); + + if (! defined $info->[1]) { + # deleted in this tree only. + unlink $path; + system 'update-cache', '--remove', $path; + print STDERR "$side D $path\n"; + } + else { + # modified or created in this tree only. + print STDERR "$side M $path\n"; + if ($oneside_checkout) { + checkout_file($path, $info); + system 'update-cache', '--add', "$path"; + } else { + record_file($path, $info); + } + } + } +} + +my @warning = (); + +while (my ($path, $info0) = each %tree0) { + # We need to deal only with overlaps. + next if (!exists $tree1{$path}); + + my $info1 = $tree1{$path}; + if (! defined $info0->[1]) { + # deleted in this tree. + if (! defined $info1->[1]) { + # deleted in both trees. Obvious. + print STDERR "*DD $path\n"; + unlink $path; + system 'update-cache', '--remove', $path; + } + else { + # oops. tree0 wants to remove but tree1 wants to modify it. + print STDERR "*DM $path\n"; + checkout_file("$path~B~", $info1); + push @warning, $path; + } + } + else { + # modified or created in tree0 + if (! defined $info1->[1]) { + # oops. tree0 wants to modify but tree1 wants to remove it. + print STDERR "*MD $path\n"; + checkout_file("$path~A~", $info0); + push @warning, $path; + } + else { + # modified both in tree0 and tree1 + # are they modifying to the same contents? + if ($info0->[2] eq $info1->[2]) { + # just mode changes (or no changes) + # we prefer tree0 over tree1 for no particular reason. + print STDERR "*MM $path\n"; + record_file($path, $info0); + } + else { + # modified in both. Needs merge. + print STDERR "MRG $path\n"; + merge_tree($path, $info0, $info1); + } + } + } +} + +if (@warning) { + print "\nThere are some files that were deleted in one branch and\n" + . "modified in another. Please examine them carefully:\n"; + for (@warning) { + print "$_\n"; + } +} + +# system 'show-diff'; diff -x .git -Nru ,,1/ls-tree.c ,,2/ls-tree.c --- ,,1/ls-tree.c 2005-04-14 03:47:18.000000000 -0700 +++ ,,2/ls-tree.c 2005-04-14 04:00:14.000000000 -0700 @@ -5,6 +5,8 @@ */ #include "cache.h" +int line_termination = '\n'; + static int list(unsigned char *sha1) { void *buffer; @@ -31,7 +33,8 @@ * It seems not worth it to read each file just to get this * and the file size. -- pasky@ucw.cz */ type = S_ISDIR(mode) ? "tree" : "blob"; - printf("%03o\t%s\t%s\t%s\n", mode, type, sha1_to_hex(sha1), path); + printf("%03o\t%s\t%s\t%s%c", mode, type, sha1_to_hex(sha1), + path, line_termination); } return 0; } @@ -40,6 +43,10 @@ { unsigned char sha1[20]; + if (argc == 3 && !strcmp(argv[1], "-z")) { + line_termination = 0; + argc--; argv++; + } if (argc != 2) usage("ls-tree <key>"); if (get_sha1_hex(argv[1], sha1) < 0) diff -x .git -Nru ,,1/show-diff.c ,,2/show-diff.c --- ,,1/show-diff.c 2005-04-14 03:47:18.000000000 -0700 +++ ,,2/show-diff.c 2005-04-14 04:00:14.000000000 -0700 @@ -58,15 +58,20 @@ int main(int argc, char **argv) { int silent = 0; + int silent_on_nonexisting_files = 0; int entries = read_cache(); int i; while (argc-- > 1) { if (!strcmp(argv[1], "-s")) { - silent = 1; + silent_on_nonexisting_files = silent = 1; continue; } - usage("show-diff [-s]"); + if (!strcmp(argv[1], "-r")) { + silent_on_nonexisting_files = 1; + continue; + } + usage("show-diff [-s] [-r]"); } if (entries < 0) { @@ -83,7 +88,7 @@ if (stat(ce->name, &st) < 0) { printf("%s: %s\n", ce->name, strerror(errno)); - if (errno == ENOENT && !silent) + if (errno == ENOENT && !silent_on_nonexisting_files) show_diff_empty(ce); continue; } diff -x .git -Nru ,,1/show-files.c ,,2/show-files.c --- ,,1/show-files.c 2005-04-14 03:47:18.000000000 -0700 +++ ,,2/show-files.c 2005-04-14 04:00:14.000000000 -0700 @@ -14,6 +14,7 @@ static int show_cached = 0; static int show_others = 0; static int show_ignored = 0; +static int line_terminator = '\n'; static const char **dir; static int nr_dir; @@ -105,12 +106,12 @@ } if (show_others) { for (i = 0; i < nr_dir; i++) - printf("%s\n", dir[i]); + printf("%s%c", dir[i], line_terminator); } if (show_cached) { for (i = 0; i < active_nr; i++) { struct cache_entry *ce = active_cache[i]; - printf("%s\n", ce->name); + printf("%s%c", ce->name, line_terminator); } } if (show_deleted) { @@ -119,7 +120,7 @@ struct stat st; if (!stat(ce->name, &st)) continue; - printf("%s\n", ce->name); + printf("%s%c", ce->name, line_terminator); } } if (show_ignored) { @@ -134,6 +135,11 @@ for (i = 1; i < argc; i++) { char *arg = argv[i]; + if (!strcmp(arg, "-z")) { + line_terminator = 0; + continue; + } + if (!strcmp(arg, "--cached")) { show_cached = 1; continue; diff -x .git -Nru ,,1/update-cache.c ,,2/update-cache.c --- ,,1/update-cache.c 2005-04-14 03:47:18.000000000 -0700 +++ ,,2/update-cache.c 2005-04-14 04:00:14.000000000 -0700 @@ -250,6 +250,8 @@ { int i, newfd, entries; int allow_options = 1; + const char *sha1_force = NULL; + const char *mode_force = NULL; newfd = open(".git/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600); if (newfd < 0) @@ -282,14 +284,35 @@ refresh_cache(); continue; } + if (!strcmp(path, "--cacheinfo")) { + mode_force = argv[++i]; + sha1_force = argv[++i]; + continue; + } die("unknown option %s", path); } if (!verify_path(path)) { fprintf(stderr, "Ignoring path %s\n", argv[i]); continue; } - if (add_file_to_cache(path)) + if (sha1_force && mode_force) { + struct cache_entry *ce; + int namelen = strlen(path); + int mode; + int size = cache_entry_size(namelen); + sscanf(mode_force, "%o", &mode); + ce = malloc(size); + memset(ce, 0, size); + memcpy(ce->name, path, namelen); + ce->namelen = namelen; + ce->st_mode = mode; + get_sha1_hex(sha1_force, ce->sha1); + + add_cache_entry(ce, 1); + } + else if (add_file_to_cache(path)) die("Unable to add %s to database", path); + mode_force = sha1_force = NULL; } if (write_cache(newfd, active_cache, active_nr) || rename(".git/index.lock", ".git/index")) ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 11:14 ` Junio C Hamano @ 2005-04-14 12:16 ` Petr Baudis 2005-04-14 18:12 ` Junio C Hamano 0 siblings, 1 reply; 81+ messages in thread From: Petr Baudis @ 2005-04-14 12:16 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Dear diary, on Thu, Apr 14, 2005 at 01:14:13PM CEST, I got a letter where Junio C Hamano <junkio@cox.net> told me that... > Here is a diff to update the git-merge.perl script I showed you > earlier today ;-). It contains the following updates against > your HEAD (bb95843a5a0f397270819462812735ee29796fb4). Bah, you outran me. ;-) > * git-merge.perl command we talked about on the git list. I've > covered the changed-to-the-same case etc. I still haven't done > anything about file-vs-directory case yet. > > It does warn when it needed to run merge to automerge and let > merge give a warning message about conflicts if any. In > modify/remove cases, modified in one but removed in the other > files are left in either $path~A~ or $path~B~ in the merge > temporary directory, and the script issues a warning at the > end. I think I will take it rather my working git merge implementation - it's getting insane in bash. ;-) I'll change it to use the cool git-pasky stuff (commit-id etc) and its style of committing - that is, it will merely record the update-caches to be done upon commit, and it will read-tree the branch we are merging to instead of the ancestor. (So that git diff gives useful output.) > * show-files and ls-tree updates to add -z flag to NUL terminate records; > this is needed for git-merge.perl to work. > > * show-diff updates to add -r flag to squelch diffs for files not in > the working directory. This is mainly useful when verifying the > result of an automated merge. -r traditionally means recursive - what's the reasoning behind the choice of this letter? > * update-cache updates to add "--cacheinfo mode sha1" flag to register > a file that is not in the current working directory. Needed for > minimum-checkout merging by git-merge.perl. > > > Signed-off-by: Junio C Hamano <junkio@cox.net> > > --- > > git-merge.perl | 247 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ls-tree.c | 9 +- > show-diff.c | 11 +- > show-files.c | 12 ++ > update-cache.c | 25 +++++ > 5 files changed, 296 insertions(+), 8 deletions(-) > > diff -x .git -Nru ,,1/git-merge.perl ,,2/git-merge.perl > --- ,,1/git-merge.perl 1969-12-31 16:00:00.000000000 -0800 > +++ ,,2/git-merge.perl 2005-04-14 04:00:14.000000000 -0700 > @@ -0,0 +1,247 @@ > +#!/usr/bin/perl -w > + > +use Getopt::Long; use strict? > + > +my $full_checkout = 0; > +my $oneside_checkout = 0; > +GetOptions("full" => \$full_checkout, > + "oneside" => \$oneside_checkout) > + or die; > + > +if ($full_checkout) { > + $oneside_checkout = 1; > +} > + > +sub read_rev_tree { > + my (@head) = @_; > + my ($fhi); > + open $fhi, '-|', 'rev-tree', '--edges', @head > + or die "$!: rev-tree --edges @head"; > + my %common; > + while (<$fhi>) { > + chomp; > + (undef, undef, my @common) = split(/ /, $_); > + for (@common) { > + if (s/^([a-f0-f]{40}):3$/$1/) { > + $common{$_}++; > + } > + } > + } > + close $fhi; > + > + my @common = (map { $_->[1] } > + sort { $b->[0] <=> $a->[0] } > + map { [ $common{$_} => $_ ] } > + keys %common); > + > + return $common[0]; > +} It'd be simpler to do just my @common = (map { $common{$_} } sort { $b <=> $a } keys %common) But I really think this is a horrible heuristic. I believe you should take the latest commit in the --edges output, and from that choose the base whose rev-tree --edges the_base merged_branch has the least lines on output. (That is, the path to it is shortest - ideally it's already part of the merged_branch.) > + > +sub read_commit_tree { > + my ($commit) = @_; > + my ($fhi); > + open $fhi, '-|', 'cat-file', 'commit', $commit > + or die "$!: cat-file commit $commit"; > + my $tree = <$fhi>; > + close $fhi; > + $tree =~ s/^tree //; > + return $tree; > +} > + > +# Reads diff-tree -r output and gives a hash that maps a path > +# to 3-tuple (old-mode new-mode new-sha). > +# When creating, old-mode is undef. When removing, new-* are undef. What about sub OLDMODE { 0 } sub NEWMODE { 1 } sub NEWSHA { 2 } and then using that when accessing the tuple? Would make the code much more readable. > +sub read_diff_tree { > + my (@tree) = @_; > + my ($fhi); > + local ($_, $/); > + $/ = "\0"; > + my %path; > + open $fhi, '-|', 'diff-tree', '-r', @tree > + or die "$!: diff-tree -r @tree"; > + while (<$fhi>) { > + chomp; > + if (/^\*([0-7]+)->([0-7]+)\tblob\t[0-9a-f]+->([0-9a-f]{40})\t(.*)$/s) { > + $path{$4} = [$1, $2, $3]; > + } > + elsif (/^\+([0-7]+)\tblob\t([0-9a-f]{40})\t(.*)$/s) { > + $path{$3} = [undef, $1, $2]; > + } > + elsif (/^\-([0-7]+)\tblob\t[0-9a-f]{40}\t(.*)$/s) { > + $path{$2} = [$1, undef, undef]; > + } > + else { > + die "cannot parse diff-tree output: $_"; > + } > + } > + close $fhi; > + return %path; > +} > + > +sub read_show_files { > + my ($fhi); > + local ($_, $/); > + $/ = "\0"; > + open $fhi, '-|', 'show-files', '-z' > + or die "$!: show-files -z"; > + my (@path) = map { chomp; $_ } <$fhi>; > + close $fhi; > + return @path; > +} > + > +sub checkout_file { > + my ($path, $info) = @_; > + my (@elt) = split(/\//, $path); > + my $j = ''; > + my $tail = pop @elt; > + my ($fhi, $fho); > + for (@elt) { > + mkdir "$j$_"; > + $j = "$j$_/"; > + } > + open $fho, '>', "$path"; > + open $fhi, '-|', 'cat-file', 'blob', $info->[2] > + or die "$!: cat-file blob $info->[2]"; > + while (<$fhi>) { > + print $fho $_; > + } > + close $fhi; > + close $fho; > + chmod oct("0$info->[1]"), "$path"; > +} > + > +sub record_file { > + my ($path, $info) = @_; > + system ('update-cache', '--add', '--cacheinfo', > + $info->[1], $info->[2], $path); > +} > + > +sub merge_tree { > + my ($path, $info0, $info1) = @_; > + checkout_file(',,merge-0', $info0); > + checkout_file(',,merge-1', $info1); > + system 'checkout-cache', $path; > + my ($fhi, $fho); > + open $fhi, '-|', 'merge', '-p', ',,merge-0', $path, ',,merge-1'; > + open $fho, '>', "$path+"; > + local ($/); > + while (<$fhi>) { print $fho $_; } > + close $fhi; > + close $fho; > + unlink ',,merge-0', ',,merge-1'; > + rename "$path+", $path; > + # There is no reason to prefer info0 over info1 but > + # we need to pick one. > + chmod oct("0$info0->[1]"), "$path"; > +} It is a good idea to check merge's exit code and give a notice at the end if there were any conflicts. > + > +# Find common ancestor of two trees. > +my $common = read_rev_tree(@ARGV); > +print "Common ancestor: $common\n"; > + > +# Create a temporary directory and go there. > +system 'rm', '-rf', ',,merge-temp'; Can't we call it just ,,merge? > +for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; } > +symlink "../../.git/objects", "objects"; > +chdir '..'; > + > +my $ancestor_tree = read_commit_tree($common); > +system 'read-tree', $ancestor_tree; > + > +my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0])); > +my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1])); > + > +my @ancestor_file = read_show_files(); > +my %ancestor_file = map { $_ => 1 } @ancestor_file; > + > +for (@ancestor_file) { > + if (! exists $tree0{$_} && ! exists $tree1{$_}) { > + if ($full_checkout) { > + system 'checkout-cache', $_; > + } > + print STDERR "O - $_\n"; Huh, what are you trying to do here? I think you should just record remove, no? (And I wouldn't do anything with my read-tree. ;-) > + } > +} > + > +for my $set ([\%tree0, \%tree1, 'A'], [\%tree1, \%tree0, 'B']) { > + my ($treeA, $treeB, $side) = @$set; > + while (my ($path, $info) = each %$treeA) { > + # In this loop we do not deal with overlaps. > + next if (exists $treeB->{$path}); > + > + if (! defined $info->[1]) { > + # deleted in this tree only. > + unlink $path; > + system 'update-cache', '--remove', $path; > + print STDERR "$side D $path\n"; > + } > + else { > + # modified or created in this tree only. > + print STDERR "$side M $path\n"; > + if ($oneside_checkout) { > + checkout_file($path, $info); > + system 'update-cache', '--add', "$path"; > + } else { > + record_file($path, $info); > + } > + } > + } > +} ..snip.. Hmm, I think I will just need to play with the script a lot. ;-) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 12:16 ` Petr Baudis @ 2005-04-14 18:12 ` Junio C Hamano 2005-04-14 18:36 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 81+ messages in thread From: Junio C Hamano @ 2005-04-14 18:12 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, git >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: PB> Bah, you outran me. ;-) Just being in a different timezone, I guess. PB> I'll change it to use the cool git-pasky stuff (commit-id etc) and its PB> style of committing - that is, it will merely record the update-caches PB> to be done upon commit, and it will read-tree the branch we are merging PB> to instead of the ancestor. (So that git diff gives useful output.) Sorry, I have not seen what you have been doing since pasky 0.3, and I have not even started to understand the mental model of the world your tool is building. That said, my gut feeling is that telling this script about git-pasky's world model might be a mistake. I'd rather see you consider the script as mere "part of the plumbing". Maybe adding an extra parameter to the script to let the user explicitly specify the common ancestor to use would be needed, but I would prefer git-pasky-merge to do its own magic (converting symbolic commit names into raw commit names and such) before calling this low level script. That way people like me who have not migrated to your framework can still keep using it. All the script currently needs is a bare git object database; i.e., nothing other than what is in .git/objects and a couple of commit record SHA1s as its parameters. No .git/heads/, no .git/HEAD.local, no .git/tags, are involved for it to work, and I would prefer to keep things that way if possible. >> * show-diff updates to add -r flag to squelch diffs for files not in >> the working directory. This is mainly useful when verifying the >> result of an automated merge. PB> -r traditionally means recursive - what's the reasoning behind the PB> choice of this letter? Well, '-r' is not necessarily recursive. "ls -r" is reverse, "sort -r" is reverse. "less -r" is raw. "cat -r" is reversible. "nethack -r" is race ;-). You are thinking as an SCM person so it may look that way. "diff -r" is recursive. "darcs add -r" is recursive. But even in the SCM world, "cvs add -r" is not (it means read-only) neither "co -r" (explicit revision) ;-). I would rather pick '-q' if I were doing the patch today, but I was too tired and did not think of a letter when I wrote it. I guess '-r' stood for removed, but I agree it is a bad choice. Any objections to '-q'? PB> use strict? Not in this iteration but eventually yes. PB> It'd be simpler to do just PB> my @common = (map { $common{$_} } PB> sort { $b <=> $a } PB> keys %common) Well, actually you spotted a bug between the implementation and what I wanted to do. It should have been: map { $_->[0] } sort { $b->[1] <=> $a->[1] } map { [ $common{$_} => $_ ] } keys %common That is, sort [anscestor => number of times it appears] tuple by the "number of times it appears" in decreasing order, and project the resulting list to a list of ancestors. It is trying to deal with the following pattern in rev-tree output: TIMESTAMP1 EDGE1:1 ANCESTOR1:3 ANCESTOR2:3 TIMESTAMP2 EDGE2:2 ANCESTOR1:3 and when the above happens I wanted to pick up ANCESTOR1, but that was without no sound reason. PB> But I really think this is a horrible heuristic. I believe you should PB> take the latest commit in the --edges output, and from that choose the PB> base whose rev-tree --edges the_base merged_branch has the least lines PB> on output. (That is, the path to it is shortest - ideally it's already PB> part of the merged_branch.) I'll try something along that line. Honestly the ancestor selection part was what I had most trouble with. Thanks. PB> What about PB> sub OLDMODE { 0 } PB> sub NEWMODE { 1 } PB> sub NEWSHA { 2 } PB> and then using that when accessing the tuple? Would make the code PB> much more readable. Totally agreed; readability cleanup is needed, just as "use strict" you mentioned, before it is ready for public consumption. Remember, however, the primary purpose of the message was to share it with Linus so that I can ask his opinion while the script was still slushy; the contents that array contained was still changing then and was too early for symbolic constants. I'll do that in the next round. PB> It is a good idea to check merge's exit code and give a notice at the PB> end if there were any conflicts. In principle yes, but I noticed that merge already gave me a nice warning message when it found conflicts, so there was no need to do so myself in this case. See sample output: $ perl ./git-merge.perl \ 71796686221a0a56ccc25b02386ed8ea648da14d \ bb95843a5a0f397270819462812735ee29796fb4 Common ancestor: 9f02d4d233223462d3f6217b5837b786e6286ba4 O - COPYING O - README ... O - write-tree.c A M write-blob.c A M show-diff.c ... A M update-cache.c A M git-merge.perl B M merge-tree.c MRG Makefile merge: warning: conflicts during merge $ >> +# Create a temporary directory and go there. >> +system 'rm', '-rf', ',,merge-temp'; PB> Can't we call it just ,,merge? I'd rather have a command line option '-o' (scrapping the current '-o' and renaming it to something else; as you can see I am terrible at picking option names ;-)) to mean "output to this directory". I am not really an Arch person so I do not particulary care about /^,,/. How about "git~merge~$$"? >> +for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; } >> +symlink "../../.git/objects", "objects"; >> +chdir '..'; >> + >> +my $ancestor_tree = read_commit_tree($common); >> +system 'read-tree', $ancestor_tree; >> + >> +my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0])); >> +my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1])); >> + >> +my @ancestor_file = read_show_files(); >> +my %ancestor_file = map { $_ => 1 } @ancestor_file; >> + >> +for (@ancestor_file) { >> + if (! exists $tree0{$_} && ! exists $tree1{$_}) { >> + if ($full_checkout) { >> + system 'checkout-cache', $_; >> + } >> + print STDERR "O - $_\n"; PB> Huh, what are you trying to do here? I think you should just record PB> remove, no? (And I wouldn't do anything with my read-tree. ;-) At this moment in the script, we have run "read-tree" the ancestor so the dircache has the original. %tree0 and %tree1 both did not touch the path ($_ here) so it is the same as ancestor. When '-f' is specified we are populating the output working tree with the merge result so that is what that 'checkout-cache' is about. "O - $path" means "we took the original". The idea is to populate the dircache of merge-temp with the merge result and leave uncertain stuff as in the common ancestor state, so that the user can fix them starting from there. Maybe it is a good time for me to summarize the output somewhere in a document. O - $path Tree-A and tree-B did not touch this; the result is taken from the ancestor (O for original). A D $path Only tree-A (or tree-B) deleted this and the other B D $path branch did not touch this; the result is to delete. A M $path Only tree-A (or tree-B) modified this and the other B M $path branch did not touch this; the result is to use one from tree-A (or tree-B). This includes file creation case. *DD $path Both tree-A and tree-B deleted this; the result is to delete. *DM $path Tree-A deleted while tree-B modified this (or *MD $path vice versa), and manual conflict resolution is needed; dircache is left as in the ancestor, and the modified file is saved as $path~A~ in the working directory. The user can rename it to $path and run show-diff to see what Tree-A wanted to do and decide before running update-cache. *MM $path Tree-A and tree-B did the exact same modification; the result is to use that. MRG $path Tree-A and tree-B have different modifications; run "merge" and the merge result is left as $path in the working directory. In cases other than *DM, *MD, and MRG, the result is trivial and is recorded in the dircache. Without '-o' (to be renamed ;-) nor '-f' there will not be a file checked out in the working directory for them. The three merge cases need human attention. The dircache is not touched in these cases and left as the ancestor version, and the working directory gets some file as described above. NOTE NOTE NOTE: I am not dealing with a case where both branches create the same file but with different contents. In such a case the current code falls into MRG path without having a common ancestor, which is nonsense---I can use /dev/null as the common ancestor, I guess. Also NOTE NOTE NOTE I need to detect the case where one branch creates a directory while the other creates a file. There is nothing an automated tool can do in that case but it needs to be detected and be told the user loudly. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 18:12 ` Junio C Hamano @ 2005-04-14 18:36 ` Linus Torvalds 2005-04-14 19:59 ` Junio C Hamano 2005-04-15 9:14 ` David Woodhouse 2005-04-14 18:51 ` Christopher Li 2005-04-14 19:35 ` Petr Baudis 2 siblings, 2 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-14 18:36 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, git On Thu, 14 Apr 2005, Junio C Hamano wrote: > > Sorry, I have not seen what you have been doing since pasky 0.3, > and I have not even started to understand the mental model of > the world your tool is building. That said, my gut feeling is > that telling this script about git-pasky's world model might be > a mistake. I'd rather see you consider the script as mere "part > of the plumbing". I agree. Having separate abstraction layers is good. I'm actually very happy with Pasky's cleaned-up-tree, exactly because unlike the first one, Pasky did a great job of maintaining the abstraction between "plumbing" and user interfaces. The plumbing should take user interface needs into account, but the more conceptually separate it is ("does it makes sense on its own?") the better off we'll be. And "merge these two trees" (which works on a _tree_ level) or "find the common commit" (which works on a _commit_ level) look like plumbing to me - the kind of things I should have written, if I weren't such a lazy slob. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 18:36 ` Linus Torvalds @ 2005-04-14 19:59 ` Junio C Hamano 2005-04-15 0:42 ` Linus Torvalds 2005-04-15 9:14 ` David Woodhouse 1 sibling, 1 reply; 81+ messages in thread From: Junio C Hamano @ 2005-04-14 19:59 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> On Thu, 14 Apr 2005, Junio C Hamano wrote: >> Sorry, I have not seen what you have been doing since pasky 0.3, >> and I have not even started to understand the mental model of >> the world your tool is building. That said, my gut feeling is >> that telling this script about git-pasky's world model might be >> a mistake. I'd rather see you consider the script as mere "part >> of the plumbing". LT> I agree. Having separate abstraction layers is good. I'm actually very LT> happy with Pasky's cleaned-up-tree, exactly because unlike the first one, LT> Pasky did a great job of maintaining the abstraction between "plumbing" LT> and user interfaces. Agreed, not just with your agreeing with me, but with the statement that Pasky did a good job (although I am ashamed to say I have not caught up with the "userland" tools). LT> The plumbing should take user interface needs into account, but the more LT> conceptually separate it is ("does it makes sense on its own?") the better LT> off we'll be. And "merge these two trees" (which works on a _tree_ level) LT> or "find the common commit" (which works on a _commit_ level) look like LT> plumbing to me - the kind of things I should have written, if I weren't LT> such a lazy slob. I am planning drop the ancestor computation from the script, and make it another command line parameter to the script. Dan Barkalow's merge-base program should be used to compute it and his result should drive the merge. That sounds more UNIXy to me. I even may want to make the script take three trees not commits, since the merge script does not need commits (it only needs trees). As plumbing it would be cleaner interface to it to do so. The wrapper SCM scripts can and should make sure it is fed trees when the user gives it commits (or symbolic representation of it like .git/tags/blah, or `cat .git/HEAD`). But one different thing to note here. You say "merge these two trees" above (I take it that you mean "merge these two trees, taking account of this tree as their common ancestor", so actually you are dealing with three trees), and I am tending to agree with the notion of merging trees not commits. However you might get richer context and more sensible resulting merge if you say "merge these two commits". Since commit chaining is part of the fundamental git object model you may as well use it. This however opens up another set of can of worms---it would involve not just three trees but all the trees in the commit chain in between. That's when you start wondering if it would be better to add renames in the git object model, which is the topic of another thread. I have not formed an opinion on that one myself yet. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 19:59 ` Junio C Hamano @ 2005-04-15 0:42 ` Linus Torvalds 2005-04-15 2:33 ` Barry Silverman 2005-04-15 10:02 ` David Woodhouse 0 siblings, 2 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-15 0:42 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, git On Thu, 14 Apr 2005, Junio C Hamano wrote: > > You say "merge these two trees" above (I take it that you mean > "merge these two trees, taking account of this tree as their > common ancestor", so actually you are dealing with three trees), Yes. We're definitely talking three trees. > and I am tending to agree with the notion of merging trees not > commits. However you might get richer context and more sensible > resulting merge if you say "merge these two commits". Since > commit chaining is part of the fundamental git object model you > may as well use it. Yes and no. There are real advantages to using the commit state to just figure out the trees, and then at least have the _option_ to do the merge at a pure tree object. In particular, if you ever find yourself wanting to graft together two different commit histories, that almost certainly is what you'd want to do. Somebody might have arrived at the exact same tree some other way, starting with a 2.6.12 tar.ball or something, and I think we should at least support the notion of saying "these two totally unrelated commits actually have the same base tree, so let's merge them in "space" (ie data) even if we can't really sanely join them in "time" (ie "commits"). I dunno. And it's also a question of sanity. The fact is, we know how to make tree merges unambiguous, by just totally ignoring the history between them. Ie we know how to merge data. I am pretty damn sure that _nobody_ knows how to merge "data over time". Maybe BK does. I'm pretty sure it actually takes the "over time" into account. But My goal is to get something that works, and something that is reliable because it is simple and it has simple rules. As you say: > This however opens up another set of can of worms---it would > involve not just three trees but all the trees in the commit > chain in between. Exactly. I seriously believe that the model is _broken_, simply because it gets too complicated. At some point it boils down to "keep it simple, stupid". > That's when you start wondering if it would > be better to add renames in the git object model, which is the > topic of another thread. I have not formed an opinion on that > one myself yet. I've not even been convinved that renames are worth it. Nobody has really given a good reason why. There are two reasons for renames I can think of: - space efficiency in delta-based trees. This is a total non-issue for git, and trying to explicitly track renames is going to cause _more_ space to be wasted rather than less. - "annotate". Something git doesn't really handle anyway, and it has little to do with renames. You can fake an annotate, but let's face it, it's _always_ going to be depending on interpreting a diff. In fact, that ends up how traditional SCM's do it too - they don't really annotate lines, they just interpret the diff. I think you might as well interpret the whole object thing. Git _does_ tell you how the objects changed, and I actually believe that a diff that works in between objects (ie can show "these lines moved from this file X to tjhat file Y") is a _hell_ of a lot more powerful than "rename" is. So I'd seriously suggest that instead of worryign about renames, people think about global diffs that aren't per-file. Git is good at limiting the changes to a set of objects, and it should be entirely possible to think of diffs as ways of moving lines _between_ objects and not just within objects. It's quite common to move a function from one file to another - certainly more so than renaming the whole file. In other words, I really believe renames are just a meaningless special case of a much more interesting problem. Which is just one reason why I'm not at all interested in bothering with them other than as a "data moved" thing, which git already handles very well indeed. So there, Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* RE: Merge with git-pasky II. 2005-04-15 0:42 ` Linus Torvalds @ 2005-04-15 2:33 ` Barry Silverman 2005-04-15 10:02 ` David Woodhouse 1 sibling, 0 replies; 81+ messages in thread From: Barry Silverman @ 2005-04-15 2:33 UTC (permalink / raw) To: 'Linus Torvalds', 'Junio C Hamano' Cc: 'Petr Baudis', git >>In particular, if you ever find yourself wanting to graft together two >>different commit histories, that almost certainly is what you'd want to >>do. Somebody might have arrived at the exact same tree some other way, >>starting with a 2.6.12 tar.ball or something, and I think we should at >>least support the notion of saying "these two totally unrelated commits >>actually have the same base tree, so let's merge them in "space" (ie data) >>even if we can't really sanely join them in "time" (ie "commits"). If this is true - then the tree-id's of the two commits would be identical, but the commit-id's wouldn't. Does this imply that common ancestor lookup should work by comparing the tree-id's (space-wise the same) rather than the commit-ids (time-wise the same)? -----Original Message----- From: git-owner@vger.kernel.org [mailto:git-owner@vger.kernel.org] On Behalf Of Linus Torvalds Sent: Thursday, April 14, 2005 8:43 PM To: Junio C Hamano Cc: Petr Baudis; git@vger.kernel.org Subject: Re: Merge with git-pasky II. On Thu, 14 Apr 2005, Junio C Hamano wrote: > > You say "merge these two trees" above (I take it that you mean > "merge these two trees, taking account of this tree as their > common ancestor", so actually you are dealing with three trees), Yes. We're definitely talking three trees. > and I am tending to agree with the notion of merging trees not > commits. However you might get richer context and more sensible > resulting merge if you say "merge these two commits". Since > commit chaining is part of the fundamental git object model you > may as well use it. Yes and no. There are real advantages to using the commit state to just figure out the trees, and then at least have the _option_ to do the merge at a pure tree object. In particular, if you ever find yourself wanting to graft together two different commit histories, that almost certainly is what you'd want to do. Somebody might have arrived at the exact same tree some other way, starting with a 2.6.12 tar.ball or something, and I think we should at least support the notion of saying "these two totally unrelated commits actually have the same base tree, so let's merge them in "space" (ie data) even if we can't really sanely join them in "time" (ie "commits"). I dunno. And it's also a question of sanity. The fact is, we know how to make tree merges unambiguous, by just totally ignoring the history between them. Ie we know how to merge data. I am pretty damn sure that _nobody_ knows how to merge "data over time". Maybe BK does. I'm pretty sure it actually takes the "over time" into account. But My goal is to get something that works, and something that is reliable because it is simple and it has simple rules. As you say: > This however opens up another set of can of worms---it would > involve not just three trees but all the trees in the commit > chain in between. Exactly. I seriously believe that the model is _broken_, simply because it gets too complicated. At some point it boils down to "keep it simple, stupid". > That's when you start wondering if it would > be better to add renames in the git object model, which is the > topic of another thread. I have not formed an opinion on that > one myself yet. I've not even been convinved that renames are worth it. Nobody has really given a good reason why. There are two reasons for renames I can think of: - space efficiency in delta-based trees. This is a total non-issue for git, and trying to explicitly track renames is going to cause _more_ space to be wasted rather than less. - "annotate". Something git doesn't really handle anyway, and it has little to do with renames. You can fake an annotate, but let's face it, it's _always_ going to be depending on interpreting a diff. In fact, that ends up how traditional SCM's do it too - they don't really annotate lines, they just interpret the diff. I think you might as well interpret the whole object thing. Git _does_ tell you how the objects changed, and I actually believe that a diff that works in between objects (ie can show "these lines moved from this file X to tjhat file Y") is a _hell_ of a lot more powerful than "rename" is. So I'd seriously suggest that instead of worryign about renames, people think about global diffs that aren't per-file. Git is good at limiting the changes to a set of objects, and it should be entirely possible to think of diffs as ways of moving lines _between_ objects and not just within objects. It's quite common to move a function from one file to another - certainly more so than renaming the whole file. In other words, I really believe renames are just a meaningless special case of a much more interesting problem. Which is just one reason why I'm not at all interested in bothering with them other than as a "data moved" thing, which git already handles very well indeed. So there, Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 0:42 ` Linus Torvalds 2005-04-15 2:33 ` Barry Silverman @ 2005-04-15 10:02 ` David Woodhouse 2005-04-15 15:32 ` Linus Torvalds 1 sibling, 1 reply; 81+ messages in thread From: David Woodhouse @ 2005-04-15 10:02 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Petr Baudis, git On Thu, 2005-04-14 at 17:42 -0700, Linus Torvalds wrote: > I've not even been convinved that renames are worth it. Nobody has > really given a good reason why. > > There are two reasons for renames I can think of: > > - space efficiency in delta-based trees. > - "annotate". Neither of those were my motivation for looking at renames. The reasons I wanted to track renames were: - Per-file revision history which doesn't stop dead at a rename. - Merging where files have been renamed in one branch and modified in another. Which is basically a special case of the above; we need to see the per-file revision history. > So I'd seriously suggest that instead of worryign about renames, people > think about global diffs that aren't per-file. Git is good at limiting > the changes to a set of objects, and it should be entirely possible to > think of diffs as ways of moving lines _between_ objects and not just > within objects. It's quite common to move a function from one file to > another - certainly more so than renaming the whole file. > > In other words, I really believe renames are just a meaningless special > case of a much more interesting problem. Which is just one reason why > I'm not at all interested in bothering with them other than as a "data > moved" thing, which git already handles very well indeed. Git doesn't handle 'data moved' except at a whole-tree level. For each commit, it says "these are the old trees; this is the new tree". Git doesn't actually look hard into the contents of tree; certainly it has no business looking at the contents of individual files; that is something that the SCM or possibly only the user should do. The storage of 'rename' information in the commit object is another kind of 'xattr' storage which git would provides but not directly interpret. And you're right; it shouldn't have to be for renames only. There's no need for us to limit it to one "source" and one "destination"; the SCM can use it to track content as it sees fit. As I said, the main aim of this is to track revision history of given content, for displaying to the user and for performing merges. So when a file is split up, or a function is moved from it to another file, a 'rename' xattr can be included to mark that files 'foo' and 'bar' in the new tree are both associated with file 'wibble' in the parent. That's as much as we need to provide for content tracking, and it _does_ handle the general case as well as we should be attempting to. We don't want to get into dealing with file contents ourselves; we just want to store the hint for the SCM or the user that "your data went thataway". -- dwmw2 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 10:02 ` David Woodhouse @ 2005-04-15 15:32 ` Linus Torvalds 2005-04-15 16:01 ` David Woodhouse ` (3 more replies) 0 siblings, 4 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-15 15:32 UTC (permalink / raw) To: David Woodhouse; +Cc: Junio C Hamano, Petr Baudis, git On Fri, 15 Apr 2005, David Woodhouse wrote: > > And you're right; it shouldn't have to be for renames only. There's no > need for us to limit it to one "source" and one "destination"; the SCM > can use it to track content as it sees fit. Listen to yourself, and think about the problem for a second. First off, let's just posit that "files" do not matter. The only thing that matters is how "content" moved in the tree. Ok? If I copy a function from one fiel to another, the perfect SCM will notice that, and show it as a diff that removes it from one file and adds it to another, and is _still_ able to track authorship past the move. Agreed? Now, you basically propose to put that information in the "commit" log, and that's certainly valid. You can have the commit log say "lines 50-89 in file kernel/sched.c moved to lines 100-139 in kernel/timer.c", and then renames fall out of that as one very small special case. You can even say "lines 50-89 in file kernel/sched.c copied to.." and allow data to be tracked past not just movement, but also duplication. Do you agree that this is kind of what you'd want to aim for? That's a winning SCM concept. How do you think the SCM _gets_ at this information? In particular, how are you proposing that we determine this, especially since 90% of all stuff comes in as patches etc? You propose that we spend time when generating the tree on doing so. I'm telling you that that is wrong, for several reasons: - you're ignoring different paths for the same data. For example, you will make it impossible to merge two trees that have done exactly the same thing, except one did it as a patch (create/delete) and one did it using some other heuristic. - you're doing the work at the wrong point. Doing it _well_ is quite expensive. So if you do it at commit time, you cannot _afford_ to do it well, and you'll always fall back to doing an ass-backwards job that doesn't really get you to the good state, and only gets you to a not-very-interesting easy 1% of the solution (ie full file renames). - you're doing the work at the wrong point for _another_ reason. You're freezing your (crappy) algorithm at tree creation time, and basically making it pointless to ever create something better later, because even if hardware and software improves, you've codified that "we have to have crappy information". Now, look at my proposal: - the actual information tracking tracks _nothing_ but information. You have an SCM that tracks what changed at the only level that really matters, namely the whole project. None of the information actually makes any sense at all at a smaller granularity, since by definition, a "project" depends on the other files, or it wouldn't be a project, it would be _two_ projects or more. - When you're interested in the history of the information, you actually track it, and you try to be _intelligent_ about it. You can actually do a HELL of a lot better than whet you propose if you go the extra mile. For example, let's say that you have a visualization tool that you can use for finding out where a line of code came from. You start out at some arbitrary point in the tree, and you drill down. That's how it works, right? So how do you drill down? You simply go backwards in history for that project, tracking when that file+line changed (a "file+line" thing is actually a "sensible" tracking unit at this point, because it makes sense within the query you're doing - it's _not_ a sensible thing to track at "commit" time, but when you ask yourself "where did this line come from", that _question_ makes it sensible. Also note that "where did this _file_ come from is not a sensible question, since the file may have been the combination (or split) of several files, so there is no _answer_ to that question" So the question then becomes: "how can you reasonably _efficiently_ find the history of one particular line", and in fact it turns out that by asking the question that way, it's pretty obvious: now that you don't have to track the whole repository, you can always try to minimize the thing you're looking for. So what you do is walk back the history, and look at the tree objects (both sides when you hit a merge), eand see if that file ever changes. That's actually a very efficient operation in GIT - it matches _exactly_ how git tracks things anyway. So it's not expensive at all. When that file changes, you need to look if that _line_ changed (and here is where it comes down to usability: from a practical standpoint you probably don't care about a single line, you really _probably_ want to see changes around it too). So you diff the old state and the new state, and you see if you can still find where you were. If you still can, and the line (and a few lines around it) is still the same, you just continue to drill down. So that's not the interesting case. So what happens when you found "ok, that area changed"? Your visualization tool now shows it to the user, AND BECAUSE IT SEES THE WHOLE TREE DIFF, it also shows where it probably came from. At _that_ point, it is actually very trivial to use a modest amount of CPU time, and look for probable sources within that diff. You can do it on modern hardware in basically no time, so your visualization tool can actually notice that "oops, that line didn't even exist in the previous version, BUT I FOUND FIVE PLACES that matched almost perfectly in the same diff, and here they are" and voila, your tool now very efficiently showed the programmer that the source of the line in question was actually that we had merged 5 copies of the same code in different archtiectures into one common helper function. And if you didn't find some source that matched, or if the old file was actually very similar around that line, and that line hadn't been "totally new"? That's the easy case again - you show the programmer the diff at that point in time, and you let him decide whether that diff was what he was looking for, or whether he wants to continue to "zoom down" into the history. The above tool is (a) fairly easy to write for git (if you can do visualization tools and (b) _exactly_ what I think most programmers actually want. Tell me I'm wrong. Honestly.. And notice? My clearly _superior_ algorithm never needed any rename information at all. It would have been a total waste of time. It would also have hidden the _real_ pattern, which was that a piece of code was merged from several other matching pieces of code into one new helper function. But if it _had_ been a pure rename, my superior tool would have trivially found that _too_. So rename infomation really really doesn't matter. So I'm claiming that any SCM that tries to track renames is fundamentally broken unless it does so for internal reasons (ie to allow efficient deltas), exactly because renames do not matter. They don't help you, and they aren't what you were interested in _anyway_. What matters is finding "where did this come from", and the git architecture does that very well indeed - much better than anything else out there. I outlined a simple algorithm that can be fairly trivially coded up by somebody who really cares. Sure, pattern matching isn't trivial, but you start out with just saying "let's find that exact line, and two lines on each side", and then you start improving on that. And that "where did this come from" decision should be done at _search_ time, not commit time. Because at that time it's not only trivial to do, but at that time you can _dynamically_ change your search criteria. For example, you can make the "match" algorithm be dependent on what you are looking at. If it's C source code, it might want to ignore vairable names when it searches for matching code. And if it's a OpenOffice document, you might have some open-office-specific tools to do so. See? Also, the person doing the searches can say whether he is interested in that particular line (or even that particial _identifier_ on a line), or whether he wants to see the changes "around" that line. All of which are very valid things to do, and all of which my world-view supports very well indeed. And all of which your pitiful "files matter" world-view totally doesn't get at all. In other words, I'm right. I'm always right, but sometimes I'm more right than other times. And dammit, when I say "files don't matter", I'm really really Right(tm). Please stop this "track files" crap. Git tracks _exactly_ what matters, namely "collections of files". Nothing else is relevant, and even _thinking_ that it is relevant only limits your world-view. Notice how the notion of CVS "annotate" always inevitably ends up limiting how people use it. I think it's a totally useless piece of crap, and I've described something that I think is a million times more useful, and it all fell out _exactly_ because I'm not limiting my thinking to the wrong model of the world. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 15:32 ` Linus Torvalds @ 2005-04-15 16:01 ` David Woodhouse 2005-04-15 16:31 ` C. Scott Ananian 2005-04-16 15:33 ` Johannes Schindelin 2005-04-15 19:20 ` Paul Jackson ` (2 subsequent siblings) 3 siblings, 2 replies; 81+ messages in thread From: David Woodhouse @ 2005-04-15 16:01 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Petr Baudis, git On Fri, 2005-04-15 at 08:32 -0700, Linus Torvalds wrote: > - you're doing the work at the wrong point. Doing it _well_ is quite > expensive. So if you do it at commit time, you cannot _afford_ to do it > well, and you'll always fall back to doing an ass-backwards job that > doesn't really get you to the good state, and only gets you to a > not-very-interesting easy 1% of the solution (ie full file renames). > > - you're doing the work at the wrong point for _another_ reason. You're > freezing your (crappy) algorithm at tree creation time, and basically > making it pointless to ever create something better later, because even > if hardware and software improves, you've codified that "we have to > have crappy information". OK, I'm inclined to agree. The only thing that prevents me from capitulating entirely and resubscribing to the "Torvalds is always right" school is the concern that it _is_ expensive, and that's why I originally wanted to do it at commit time because then it's a one-off cost rather than recurring every time we want to track the history of a given piece of content. Also because we actually have the developer's attention at commit time, and we can get _real_ answers from the user about what she was doing, instead of having to guess. But if it can be done cheaply enough at a later date even though we end up repeating ourselves, and if it can be done _well_ enough that we shouldn't have just asked the user in the first place, then yes, OK I agree. -- dwmw2 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 16:01 ` David Woodhouse @ 2005-04-15 16:31 ` C. Scott Ananian 2005-04-15 17:11 ` Linus Torvalds 2005-04-16 15:33 ` Johannes Schindelin 1 sibling, 1 reply; 81+ messages in thread From: C. Scott Ananian @ 2005-04-15 16:31 UTC (permalink / raw) To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git On Fri, 15 Apr 2005, David Woodhouse wrote: > given piece of content. Also because we actually have the developer's > attention at commit time, and we can get _real_ answers from the user > about what she was doing, instead of having to guess. Yes, but it's still hard to get *accurate* information. And developers tend to use very short commit messages already... > But if it can be done cheaply enough at a later date even though we end > up repeating ourselves, and if it can be done _well_ enough that we > shouldn't have just asked the user in the first place, then yes, OK I > agree. I think examining the rsync algorithms should convince you that finding common chunks can be fairly efficient. (See my next message for a more concrete proposal.) --scott Rijndael AMLASH Moscow Ft. Bragg shotgun HTKEEPER SHERWOOD overthrow Uzi anthrax Yeltsin Indonesia Suharto LITEMPO Dictionary Yakima KUBARK ( http://cscott.net/ ) ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 16:31 ` C. Scott Ananian @ 2005-04-15 17:11 ` Linus Torvalds 0 siblings, 0 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-15 17:11 UTC (permalink / raw) To: C. Scott Ananian; +Cc: David Woodhouse, Junio C Hamano, Petr Baudis, git On Fri, 15 Apr 2005, C. Scott Ananian wrote: > > I think examining the rsync algorithms should convince you that finding > common chunks can be fairly efficient. Note that "efficient" really depends on how good a job you want to do, so you can tune it to how much CPU you can afford to waste on the problem. For example, my example had this thing where we merged five different functions into one function, and it is truly pretty efficient to find things like that _IF_ we only look at the files that changed (since the set of files that change in any one particular commit tends to be small, relative to the whole repository). There are many good algorithms for finding "common code", and with modern hardware that is basically instantaneous if you look at a few tens of files. For example, people wrote efficient things to compare _millions_ of lines of code for the whole SCO saga - you can do quite well. Some googling comes up with for example http://minnie.tuhs.org/Programs and applying those to a smallish set of files is quite efficient. What is _not_ necessarily as easy is the situation where you notice that a new set of lines appeared, but you don't see any place that matches that set of lines in the set of CHANGED files. That's actually quite common, ie let's say that you have a new filesystem or a new driver, and almost always it's based on a template or something, and you _would_ be able to see where that template came from, except it's not in that _changed_ set. And that is still doable, but now you really have to compare against the whole tree if you want to do it. Even _that_ is actually efficient if you cache the hashes - that's how the comparison tools compare two totally independent trees against each other, and it makes it practically possible to do even that expensive O(n**2) operation in reasonable time. It's certainly possible to do exactly the same thing for the "new code got added, does it bear any similarity to old code" case. Note! This is a question that is relevant and actually is in the realm of the "possible to find the answer interactively". It may fairly expensive, but the point is that this is the kind of relevant question that really does depend on the fundamental notion that "data matters more than any local changes". And when you think about the problem in that form, you find these kinds of interesting questions that you _can_ answer. Because the way git identifies data, the example "is there any other relevant code that may actually be similar to the newly added code" is actually not that hard to do in git. Remember: the way to answer that question is to have a cache of hashes of the contents. Guess what git _is_? You can now index your line-based hashes of contents against the _object_ hashes that git keeps track of, and you suddenly have an efficient way to actually look up those hashes. NOTE! All of this is outside the scope of git itself. This is all "visualization and comparison tools" built up on top of git. And I'm not at all interested in writing those tools myself, and I'm absolutely not signing up for that part. All I'm arguing for is that the git architecture is actually a very good architecture for doing these kinds of very very cool tools. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 16:01 ` David Woodhouse 2005-04-15 16:31 ` C. Scott Ananian @ 2005-04-16 15:33 ` Johannes Schindelin 2005-04-17 13:14 ` David Woodhouse 1 sibling, 1 reply; 81+ messages in thread From: Johannes Schindelin @ 2005-04-16 15:33 UTC (permalink / raw) To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git Hi, On Fri, 15 Apr 2005, David Woodhouse wrote: > But if it can be done cheaply enough at a later date even though we end > up repeating ourselves, and if it can be done _well_ enough that we > shouldn't have just asked the user in the first place, then yes, OK I > agree. The repetition could be helped by using a cache. Ciao, Dscho ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-16 15:33 ` Johannes Schindelin @ 2005-04-17 13:14 ` David Woodhouse 0 siblings, 0 replies; 81+ messages in thread From: David Woodhouse @ 2005-04-17 13:14 UTC (permalink / raw) To: Johannes Schindelin; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git On Sat, 2005-04-16 at 17:33 +0200, Johannes Schindelin wrote: > > But if it can be done cheaply enough at a later date even though we end > > up repeating ourselves, and if it can be done _well_ enough that we > > shouldn't have just asked the user in the first place, then yes, OK I > > agree. > > The repetition could be helped by using a cache. Perhaps. Since neither such a cache nor even the commit comments are strictly part of the git data, they probably shouldn't be included in the sha1 hash of the commit object. However, I don't see a fundamental reason why we couldn't store them in the same file but omit them from the hash calculations. That also allows us to retrospectively edit commit comments without completely changing the entire subsequent history. Or is that a little too heretical a suggestion? -- dwmw2 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 15:32 ` Linus Torvalds 2005-04-15 16:01 ` David Woodhouse @ 2005-04-15 19:20 ` Paul Jackson 2005-04-16 1:44 ` Simon Fowler 2005-04-16 20:29 ` Sanjoy Mahajan 3 siblings, 0 replies; 81+ messages in thread From: Paul Jackson @ 2005-04-15 19:20 UTC (permalink / raw) To: Linus Torvalds; +Cc: dwmw2, junkio, pasky, git These notions that one can always best answer questions by looking at the content, and that "Individual files DO NOT EXIST" seem over stated, to me. Granted, overstated for a good reason. A couple sticks of dynamite are needed to shake loose some old SCM thinking habits. === Ingo has a point when he states: > i believe the fundamental thing to think about is not file or line or > namespace, but 'tracking developer intent'. He too overstates - it's not _the_ (as in one and only) thing. But it's useful. Given the traditional terseness of many engineers, it's certainly not the _only_ thing. The code speaks too. === The above two are related in this way. Traditional SCM uses per file (versioned controlled file, as in s.* or *,v files) metadata to track 'developer intent'. I'm afraid we are at risk for confusing baby (developer intent) and bathwater (version controlled file structure of classic SCM's). === But we already have a pretty damn good way of tracking developer intent that needs to fit naturally with whatever we build on top of git. Mr. McGuire: I just want to say one word to you - just one word. Ben: Yes sir. Mr. McGuire: Are you listening? Ben: Yes I am. Mr. McGuire: 'Patches.' # the original word was 'Plastics' - The Graduate (1967) Andrew and the other maintainers do a pretty good job of 'encouraging' developers to provide useful statements of 'intent' in their patch headers. The patch series in something like *-mm, including per-patch commentary, are a valuable part of this project. === I have not looked closely at what is being done here, on top of git, for SCM like capabilities. Hopefully the next two questions are not too stupid: 1) How do we track the patch header commentary? 2) Why can't we have a quilt like SCM, not bk/rcs/cvs/sccs/... like? For (2), anyone publishing a Linux source would periodically announce an <sha1> value, attached to some name suitable for public consumption. For example, sometime in the next month or so, Linus would announce that the <sha1> of 2.6.12 is so-and-so. That would identify the result of applying a specific set of patches, resulting in a specific source tree contents. He would announce a few 2.6.12-rc* <sha1>'s between now and then. Between now and then, Andrew would (if using these tools) have published several <sha1> values, one each for various 2.6.12-rc*-mm* versions. If you explode such a <sha1> all out into a working directory, you get both the source contents in the appropriately named files, and the quilt-style patches subdirectory, of the patch series that gets you here, starting from some Time Zero for that series of published kernel versions. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 15:32 ` Linus Torvalds 2005-04-15 16:01 ` David Woodhouse 2005-04-15 19:20 ` Paul Jackson @ 2005-04-16 1:44 ` Simon Fowler 2005-04-16 12:19 ` David Lang 2005-04-16 20:29 ` Sanjoy Mahajan 3 siblings, 1 reply; 81+ messages in thread From: Simon Fowler @ 2005-04-16 1:44 UTC (permalink / raw) To: Linus Torvalds; +Cc: David Woodhouse, Junio C Hamano, Petr Baudis, git [-- Attachment #1.1: Type: text/plain, Size: 1189 bytes --] On Fri, Apr 15, 2005 at 08:32:46AM -0700, Linus Torvalds wrote: > In other words, I'm right. I'm always right, but sometimes I'm more right > than other times. And dammit, when I say "files don't matter", I'm really > really Right(tm). > You're right, of course (All Hail Linus!), if you can make it work efficiently enough. Just to put something else on the table, here's how I'd go about tracking renames and the like, in another world where Linus /does/ make the odd mistake - it's basically a unique id for files in the repository, added when the file is first recognised and updated when update-cache adds a new version to the cache. Renames copy the id across to the new name, and add it into the cache. This gives you an O(n) way to tell what file was what across renames, and it might even be useful in Linus' world, or if someone wanted to build a traditional SCM on top of a git-a-like. Attached is a patch, and a rename-file.c to use it. Simon -- PGP public key Id 0x144A991C, or http://himi.org/stuff/himi.asc (crappy) Homepage: http://himi.org doe #237 (see http://www.lemuria.org/DeCSS) My DeCSS mirror: ftp://himi.org/pub/mirrors/css/ [-- Attachment #1.2: guid2.patch --] [-- Type: text/plain, Size: 19027 bytes --] COPYING: fe2a4177a760fd110e78788734f167bd633be8de Makefile: ca50293c4f211452d999b81f122e99babb9f2987 --- Makefile +++ Makefile 2005-04-15 22:17:49.000000000 +1000 @@ -14,7 +14,7 @@ PROG= update-cache show-diff init-db write-tree read-tree commit-tree \ cat-file fsck-cache checkout-cache diff-tree rev-tree show-files \ - check-files ls-tree + check-files ls-tree rename-file SCRIPT= parent-id tree-id git gitXnormid.sh gitadd.sh gitaddremote.sh \ gitcommit.sh gitdiff-do gitdiff.sh gitlog.sh gitls.sh gitlsobj.sh \ @@ -73,6 +73,9 @@ ls-tree: ls-tree.o read-cache.o $(CC) $(CFLAGS) -o ls-tree ls-tree.o read-cache.o $(LIBS) +rename-file: rename-file.o read-cache.o + $(CC) $(CFLAGS) -o rename-file rename-file.o read-cache.o $(LIBS) + read-cache.o: cache.h show-diff.o: cache.h README: ded1a3b20e9bbe1f40e487ba5f9361719a1b6b85 VERSION: c27bd67cd632cc15dd520fbfbf807d482efa2dcf cache.h: 4d382549041d3281f8d44aa2e52f9f8ec47dd420 --- cache.h +++ cache.h 2005-04-14 22:35:59.000000000 +1000 @@ -55,6 +55,7 @@ unsigned int st_gid; unsigned int st_size; unsigned char sha1[20]; + unsigned char guid[20]; unsigned short namelen; char name[0]; }; cat-file.c: 45be1badaa8517d4e3a69e0bf1cac2e90191e475 check-files.c: 927b0b9aca742183fc8e7ccd73d73d8d5427e98f checkout-cache.c: f06871cdbc1b18ea93bdf4e17126aeb4cca1373e commit-id: 65c81756c8f10d513d073ecbd741a3244663c4c9 commit-tree.c: 12196c79f31d004dff0df1f50dda67d8204f5568 diff-tree.c: 7dcc9eb7782fa176e27f1677b161ce78ac1d2070 --- diff-tree.c +++ diff-tree.c 2005-04-16 10:46:52.000000000 +1000 @@ -1,33 +1,144 @@ +#include <sys/param.h> #include "cache.h" -static int recursive = 0; +enum diff_type { + REMOVE, + ADD, + RENAME, + MODIFY, +}; + +struct guid_cache_entry { + enum diff_type diff; + unsigned char guid[20]; + unsigned char sha1[20]; + struct guid_cache_entry *old; + unsigned int mode; + unsigned int pathlen; + unsigned char path[0]; +}; + +struct guid_cache { + unsigned int nr; + unsigned int alloc; + struct guid_cache_entry **cache; +}; -static int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const char *base); +struct guid_cache guid_cache; +struct guid_cache *cache = &guid_cache; -static void update_tree_entry(void **bufp, unsigned long *sizep) +int guid_cache_pos(const char *guid) { - void *buf = *bufp; - unsigned long size = *sizep; - int len = strlen(buf) + 1 + 20; + int first, last; - if (size < len) - die("corrupt tree file"); - *bufp = buf + len; - *sizep = size - len; + first = 0; + last = cache->nr; + while (last > first) { + int next = (last + first) >> 1; + struct guid_cache_entry *gce = cache->cache[next]; + int cmp = memcmp(guid, gce->guid, 20); + if (!cmp) + return next; + if (cmp < 0) { + last = next; + continue; + } + first = next + 1; + } + return - first-1; } -static const unsigned char *extract(void *tree, unsigned long size, const char **pathp, unsigned int *modep) +int add_guid_cache_entry(struct guid_cache_entry *gce) +{ + int pos; + + pos = guid_cache_pos(gce->guid); + + /* if this is a rename or modify, the guid will show up a + * second time */ + if (pos >= 0) { + struct guid_cache_entry *old = cache->cache[pos]; + int cmp = cache_name_compare(old->path, old->pathlen, gce->path, gce->pathlen); + + if (!cmp) { + /* pathname matches, so this must be a + * modify. */ + gce->old = old; + gce->diff = MODIFY; + cache->cache[pos] = gce; + } else { + /* the pathnames are different, so the file + * must have been renamed somewhere along the + * line. */ + gce->old = old; + gce->diff = RENAME; + cache->cache[pos] = gce; + } + return 0; + } + pos = -pos-1; + + if (cache->nr == cache->alloc) { + cache->alloc = alloc_nr(cache->alloc); + cache->cache = realloc(cache->cache, cache->alloc * sizeof(struct guid_cache_entry *)); + } + + cache->nr++; + if (cache->nr > pos) + memmove(cache->cache + pos + 1, cache->cache + pos, (cache->nr - pos - 1) * sizeof(struct guid_cache_entry *)); + cache->cache[pos] = gce; + return 0; +} + +static const unsigned char *extract(void *tree, unsigned long size, const char **pathp, unsigned int *modep, const unsigned char **guid) { int len = strlen(tree)+1; const unsigned char *sha1 = tree + len; const char *path = strchr(tree, ' '); - if (!path || size < len + 20 || sscanf(tree, "%o", modep) != 1) + if (!path || size < len + 40 || sscanf(tree, "%o", modep) != 1) die("corrupt tree file"); *pathp = path+1; + *guid = tree + len + 20; return sha1; } +static void guid_cache_tree_entry(void *buf, unsigned int len, const char *base, enum diff_type diff) +{ + unsigned mode; + const char *path; + const unsigned char *guid; + const unsigned char *sha1 = extract(buf, len, &path, &mode, &guid); + struct guid_cache_entry *gce; + int baselen = strlen(base); + + gce = calloc(1, sizeof(struct guid_cache_entry) + baselen + strlen(path) + 1); + memcpy(gce->guid, guid, 20); + memcpy(gce->sha1, sha1, 20); + gce->diff = diff; + gce->mode = mode; + gce->pathlen = snprintf(gce->path, MAXPATHLEN, "%s%s", base, path); + gce->path[gce->pathlen + 1] = '\0'; + + add_guid_cache_entry(gce); +} + +static int recursive = 0; + +static int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const char *base); + +static void update_tree_entry(void **bufp, unsigned long *sizep) +{ + void *buf = *bufp; + unsigned long size = *sizep; + int len = strlen(buf) + 1 + 40; + + if (size < len) + die("corrupt tree file"); + *bufp = buf + len; + *sizep = size - len; +} + static char *malloc_base(const char *base, const char *path, int pathlen) { int baselen = strlen(base); @@ -38,23 +149,24 @@ return newbase; } -static void show_file(const char *prefix, void *tree, unsigned long size, const char *base); +static void changed_file(void *tree, unsigned long size, const char *base, enum diff_type diff); /* A whole sub-tree went away or appeared */ -static void show_tree(const char *prefix, void *tree, unsigned long size, const char *base) +static void changed_tree(void *tree, unsigned long size, const char *base, enum diff_type diff) { while (size) { - show_file(prefix, tree, size, base); + changed_file(tree, size, base, diff); update_tree_entry(&tree, &size); } } /* A file entry went away or appeared */ -static void show_file(const char *prefix, void *tree, unsigned long size, const char *base) +static void changed_file(void *tree, unsigned long size, const char *base, enum diff_type diff) { unsigned mode; const char *path; - const unsigned char *sha1 = extract(tree, size, &path, &mode); + const unsigned char *guid; + const unsigned char *sha1 = extract(tree, size, &path, &mode, &guid); if (recursive && S_ISDIR(mode)) { char type[20]; @@ -66,38 +178,96 @@ if (!tree || strcmp(type, "tree")) die("corrupt tree sha %s", sha1_to_hex(sha1)); - show_tree(prefix, tree, size, newbase); + changed_tree(tree, size, newbase, diff); free(tree); free(newbase); return; } - printf("%s%o\t%s\t%s\t%s%s%c", prefix, mode, - S_ISDIR(mode) ? "tree" : "blob", - sha1_to_hex(sha1), base, path, 0); + guid_cache_tree_entry(tree, size, base, diff); } +static void show_one_file(struct guid_cache_entry *gce) +{ + struct guid_cache_entry *old; + char old_sha1[50]; + char old_sha2[50]; + + switch(gce->diff) { + case REMOVE: + sprintf(old_sha1, "%s", sha1_to_hex(gce->sha1)); + printf("-%o\t%s\t%s\t%s\t%s%c", gce->mode, + S_ISDIR(gce->mode) ? "tree" : "blob", + old_sha1, sha1_to_hex(gce->guid), gce->path, 0); + break; + case ADD: + sprintf(old_sha1, "%s", sha1_to_hex(gce->sha1)); + printf("+%o\t%s\t%s\t%s\t%s%c", gce->mode, + S_ISDIR(gce->mode) ? "tree" : "blob", + old_sha1, sha1_to_hex(gce->guid), gce->path, 0); + break; + case MODIFY: + old = gce->old; + if (old) { + sprintf(old_sha1, "%s", sha1_to_hex(old->sha1)); + sprintf(old_sha2, "%s", sha1_to_hex(gce->sha1)); + + printf("*%o->%o\t%s\t%s->%s\t%s\t%s%c", old->mode, gce->mode, + S_ISDIR(old->mode) ? "tree" : "blob", + old_sha1, old_sha2, sha1_to_hex(gce->guid), gce->path, 0); + } else { + die("diff-tree: internal error"); + } + break; + case RENAME: + old = gce->old; + if (old) { + sprintf(old_sha1, "%s", sha1_to_hex(gce->sha1)); + sprintf(old_sha2, "%s", sha1_to_hex(old->sha1)); + + printf("r%o->%o\t%s\t%s->%s\t%s\t%s%c", gce->mode, old->mode, + S_ISDIR(old->mode) ? "tree" : "blob", + old_sha1, old_sha2, sha1_to_hex(old->guid), old->path, 0); + } else { + die("diff-tree: internal error"); + } + break; + default: + die("diff-tree: internal error"); + } +} + +/* simply iterate over both caches looking for matching guids, + * showing all files in both caches */ +static void show_cache(void) +{ + int i; + + for (i = 0; i < cache->nr; i++) + show_one_file(cache->cache[i]); +} + static int compare_tree_entry(void *tree1, unsigned long size1, void *tree2, unsigned long size2, const char *base) { unsigned mode1, mode2; const char *path1, *path2; const unsigned char *sha1, *sha2; + const unsigned char *guid1, *guid2; int cmp, pathlen1, pathlen2; - char old_sha1_hex[50]; - sha1 = extract(tree1, size1, &path1, &mode1); - sha2 = extract(tree2, size2, &path2, &mode2); + sha1 = extract(tree1, size1, &path1, &mode1, &guid1); + sha2 = extract(tree2, size2, &path2, &mode2, &guid2); pathlen1 = strlen(path1); pathlen2 = strlen(path2); cmp = cache_name_compare(path1, pathlen1, path2, pathlen2); if (cmp < 0) { - show_file("-", tree1, size1, base); + changed_file(tree1, size1, base, REMOVE); return -1; } if (cmp > 0) { - show_file("+", tree2, size2, base); + changed_file(tree2, size2, base, ADD); return 1; } if (!memcmp(sha1, sha2, 20) && mode1 == mode2) @@ -108,8 +278,8 @@ * file, we need to consider it a remove and an add. */ if (S_ISDIR(mode1) != S_ISDIR(mode2)) { - show_file("-", tree1, size1, base); - show_file("+", tree2, size2, base); + changed_file(tree1, size1, base, REMOVE); + changed_file(tree2, size2, base, ADD); return 0; } @@ -121,10 +291,14 @@ return retval; } - strcpy(old_sha1_hex, sha1_to_hex(sha1)); - printf("*%o->%o\t%s\t%s->%s\t%s%s%c", mode1, mode2, - S_ISDIR(mode1) ? "tree" : "blob", - old_sha1_hex, sha1_to_hex(sha2), base, path1, 0); + if (!memcmp(guid1, guid2, 20)) { + changed_file(tree1, size1, base, MODIFY); + changed_file(tree2, size2, base, MODIFY); + return 0; + } + + changed_file(tree1, size1, base, REMOVE); + changed_file(tree2, size2, base, ADD); return 0; } @@ -132,12 +306,12 @@ { while (size1 | size2) { if (!size1) { - show_file("+", tree2, size2, base); + changed_file(tree2, size2, base, ADD); update_tree_entry(&tree2, &size2); continue; } if (!size2) { - show_file("-", tree1, size1, base); + changed_file(tree1, size1, base, REMOVE); update_tree_entry(&tree1, &size1); continue; } @@ -179,6 +353,7 @@ int main(int argc, char **argv) { unsigned char old[20], new[20]; + int retval; while (argc > 3) { char *arg = argv[1]; @@ -193,5 +368,7 @@ if (argc != 3 || get_sha1_hex(argv[1], old) || get_sha1_hex(argv[2], new)) usage("diff-tree <tree sha1> <tree sha1>"); - return diff_tree_sha1(old, new, ""); + retval = diff_tree_sha1(old, new, ""); + show_cache(); + return retval; } fsck-cache.c: 9c900fe458cecd2bdb4c4571a584115b5cf24f22 --- fsck-cache.c +++ fsck-cache.c 2005-04-15 20:39:49.000000000 +1000 @@ -165,9 +165,10 @@ while (size) { int len = 1+strlen(data); unsigned char *file_sha1 = data + len; + unsigned char *guid = file_sha1 + 20; char *path = strchr(data, ' '); unsigned int mode; - if (size < len + 20 || !path || sscanf(data, "%o", &mode) != 1) + if (size < len + 40 || !path || sscanf(data, "%o", &mode) != 1) return -1; /* Warn about trees that don't do the recursive thing.. */ @@ -176,8 +177,8 @@ warn_old_tree = 0; } - data += len + 20; - size -= len + 20; + data += len + 40; + size -= len + 40; mark_needs_sha1(sha1, S_ISDIR(mode) ? "tree" : "blob", file_sha1); } return 0; git: 2c557dcf2032325acc265b577ee104e605fdaede gitXnormid.sh: a5d7a9f4a6e8d4860f35f69500965c2a493d80de gitadd.sh: 3ed93ea0fcb995673ba9ee1982e0e7abdbe35982 gitaddremote.sh: bf1f28823da5b5270aa8fa05b321faa514a57a11 gitapply.sh: d0e3c46e2ce1ee74e1a87ee6137955fa9b35c27b gitcancel.sh: ec58f7444a42cd3cbaae919fc68c70a3866420c0 gitcommit.sh: 3629f67bbd3f171d091552814908b67af7537f4d gitdiff-do: d6174abceab34d22010c36a8453a6c3f3f184fe0 gitdiff.sh: 5e47c4779d73c3f2f39f6be714c0145175933197 gitexport.sh: dad00bf251b38ce522c593ea9631f842d8ccc934 gitlntree.sh: 17c4966ea64aeced96ae4f1b00f3775c1904b0f1 gitlog.sh: 177c6d12dd9fa4b4920b08451ffe4badde544a39 gitls.sh: b6f15d82f16c1e9982c5031f3be22eb5430273af gitlsobj.sh: 128461d3de6a42cfaaa989fc6401bebdfa885b3f gitmerge.sh: 23e4a3ff342c6005928ceea598a2f52de6fb9817 gitpull.sh: 0883898dda579e3fa44944b7b1d909257f6dc63e gitrm.sh: 5c18c38a890c9fd9ad2b866ee7b529539d2f3f8f gittag.sh: c8cb31385d5a9622e95a4e0b2d6a4198038a659c gittrack.sh: 03d6db1fb3a70605ef249c632c04e542457f0808 init-db.c: aa00fbb1b95624f6c30090a17354c9c08a6ac596 ls-tree.c: 3e2a6c7d183a42e41f1073dfec6794e8f8a5e75c --- ls-tree.c +++ ls-tree.c 2005-04-15 15:55:40.000000000 +1000 @@ -10,6 +10,7 @@ void *buffer; unsigned long size; char type[20]; + char old_sha1[50]; buffer = read_sha1_file(sha1, type, &size); if (!buffer) @@ -19,19 +20,21 @@ while (size) { int len = strlen(buffer)+1; unsigned char *sha1 = buffer + len; + unsigned char *guid = buffer + len + 20; char *path = strchr(buffer, ' ')+1; unsigned int mode; unsigned char *type; - if (size < len + 20 || sscanf(buffer, "%o", &mode) != 1) + if (size < len + 40 || sscanf(buffer, "%o", &mode) != 1) die("corrupt 'tree' file"); - buffer = sha1 + 20; - size -= len + 20; + buffer = sha1 + 40; + size -= len + 40; /* XXX: We do some ugly mode heuristics here. * It seems not worth it to read each file just to get this * and the file size. -- pasky@ucw.cz */ type = S_ISDIR(mode) ? "tree" : "blob"; - printf("%03o\t%s\t%s\t%s\n", mode, type, sha1_to_hex(sha1), path); + sprintf(old_sha1, sha1_to_hex(guid)); + printf("%03o\t%s\t%s\t%s\t%s\n", mode, type, sha1_to_hex(sha1), old_sha1, path); } return 0; } parent-id: 1801c6fe426592832e7250f8b760fb9d2e65220f read-cache.c: 7a6ae8b9b489f6b67c82e065dedd5716a6bfc0ef --- read-cache.c +++ read-cache.c 2005-04-16 10:52:51.000000000 +1000 @@ -4,6 +4,8 @@ * Copyright (C) Linus Torvalds, 2005 */ #include <stdarg.h> +#include <time.h> +#include <sys/param.h> #include "cache.h" const char *sha1_file_directory = NULL; @@ -233,6 +235,22 @@ return 0; } +void new_guid(const char *filename, int namelen, unsigned char *returnguid) +{ + size_t size; + time_t now = time(NULL); + char buf[MAXPATHLEN + 20]; + unsigned char guid[20]; + + size = snprintf(buf, MAXPATHLEN + 20, "%ld%s", now, filename) + 1; + + SHA1(buf, size, guid); + + if (returnguid) + memcpy(returnguid, guid, 20); + return; +} + static inline int collision_check(char *filename, void *buf, unsigned int size) { #ifdef COLLISION_CHECK @@ -363,11 +381,14 @@ int add_cache_entry(struct cache_entry *ce, int ok_to_add) { int pos; + unsigned char guid[20]; pos = cache_name_pos(ce->name, ce->namelen); /* existing match? Just replace it */ if (pos >= 0) { + struct cache_entry *old_ce = active_cache[pos]; + memcpy(ce->guid, old_ce->guid, 20); active_cache[pos] = ce; return 0; } @@ -376,6 +397,12 @@ if (!ok_to_add) return -1; + memset(guid, 0, 20); + if (!memcmp(ce->guid, guid, 20)) { + new_guid(ce->name, ce->namelen, guid); + memcpy(ce->guid, guid, 20); + } + /* Make sure the array is big enough .. */ if (active_nr == active_alloc) { active_alloc = alloc_nr(active_alloc); read-tree.c: eb548148aa6d212f05c2c622ffbe62a06cd072f9 --- read-tree.c +++ read-tree.c 2005-04-16 10:41:46.000000000 +1000 @@ -5,7 +5,9 @@ */ #include "cache.h" -static int read_one_entry(unsigned char *sha1, const char *base, int baselen, const char *pathname, unsigned mode) +static int read_one_entry(unsigned char *sha1, unsigned char *guid, + const char *base, int baselen, + const char *pathname, unsigned mode) { int len = strlen(pathname); unsigned int size = cache_entry_size(baselen + len); @@ -18,6 +20,7 @@ memcpy(ce->name, base, baselen); memcpy(ce->name + baselen, pathname, len+1); memcpy(ce->sha1, sha1, 20); + memcpy(ce->guid, guid, 20); return add_cache_entry(ce, 1); } @@ -35,14 +38,15 @@ while (size) { int len = strlen(buffer)+1; unsigned char *sha1 = buffer + len; + unsigned char *guid = buffer + len + 20; char *path = strchr(buffer, ' ')+1; unsigned int mode; - - if (size < len + 20 || sscanf(buffer, "%o", &mode) != 1) + + if (size < len + 40 || sscanf(buffer, "%o", &mode) != 1) return -1; - buffer = sha1 + 20; - size -= len + 20; + buffer = sha1 + 40; + size -= len + 40; if (S_ISDIR(mode)) { int retval; @@ -57,7 +61,7 @@ return -1; continue; } - if (read_one_entry(sha1, base, baselen, path, mode) < 0) + if (read_one_entry(sha1, guid, base, baselen, path, mode) < 0) return -1; } return 0; rev-tree.c: 395b0b3bfadb0537ae0c62744b25ead4b487f3f6 show-diff.c: a531ca4078525d1c8dcf84aae0bfa89fed6e5d96 show-files.c: a9fa6767a418f870a34b39379f417bf37b17ee18 tree-id: cb70e2c508a18107abe305633612ed702aa3ee4f update-cache.c: 62d0a6c41560d40863c44599355af10d9e089312 write-tree.c: 1534477c91169ebddcf953e3f4d2872495477f6b --- write-tree.c +++ write-tree.c 2005-04-15 13:46:05.000000000 +1000 @@ -47,6 +47,7 @@ const char *pathname = ce->name, *filename, *dirname; int pathlen = ce->namelen, entrylen; unsigned char *sha1; + unsigned char *guid; unsigned int mode; /* Did we hit the end of the directory? Return how many we wrote */ @@ -54,6 +55,7 @@ break; sha1 = ce->sha1; + guid = ce->guid; mode = ce->st_mode; /* Do we have _further_ subdirectories? */ @@ -86,6 +88,8 @@ buffer[offset++] = 0; memcpy(buffer + offset, sha1, 20); offset += 20; + memcpy(buffer + offset, guid, 20); + offset += 20; nr++; } while (nr < maxentries); [-- Attachment #1.3: rename-file.c --] [-- Type: text/x-csrc, Size: 1703 bytes --] /* * rename files in a git repository, keeping the guid. * * Copyright Simon Fowler <simon@dreamcraft.com.au>, 2005. */ #include <unistd.h> #include <sys/stat.h> #include <errno.h> #include "cache.h" static int remove_lock = 0; static void remove_lock_file(void) { if (remove_lock) unlink(".git/index.lock"); } int main(int argc, char *argv[]) { struct stat stats; struct cache_entry *ce, *new; int newfd, entries, pos, pos2; if (argc != 3) usage("rename-file <old> <new>"); if (stat(argv[1], &stats)) { perror("rename-file: "); exit(1); } if (!stat(argv[2], &stats)) die("rename-file: destination file already exists"); newfd = open(".git/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600); if (newfd < 0) die("unable to create new cachefile"); atexit(remove_lock_file); remove_lock = 1; entries = read_cache(); if (entries < 0) die("cache corrupted"); pos = cache_name_pos(argv[1], strlen(argv[1])); pos2 = cache_name_pos(argv[2], strlen(argv[2])); if (pos < 0) die("original file not in cache"); if (pos2 >= 0) die("destination file already in cache"); ce = active_cache[pos]; new = malloc(sizeof(struct cache_entry) + strlen(argv[2]) + 1); memcpy(new, ce, sizeof(struct cache_entry)); new->namelen = strlen(argv[2]); memcpy(new->name, argv[2], new->namelen); if (rename(argv[1], argv[2])) { perror("rename-file: "); exit(1); } remove_file_from_cache(argv[1]); add_cache_entry(new, 1); if (write_cache(newfd, active_cache, active_nr) || rename(".git/index.lock", ".git/index")) die("Unable to write new cachefile"); remove_lock = 0; return 0; } [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-16 1:44 ` Simon Fowler @ 2005-04-16 12:19 ` David Lang 2005-04-16 15:55 ` Simon Fowler 0 siblings, 1 reply; 81+ messages in thread From: David Lang @ 2005-04-16 12:19 UTC (permalink / raw) To: simon; +Cc: Linus Torvalds, David Woodhouse, Junio C Hamano, Petr Baudis, git On Fri, Apr 15, 2005 at 08:32:46AM -0700, Linus Torvalds wrote: > In other words, I'm right. I'm always right, but sometimes I'm more right > than other times. And dammit, when I say "files don't matter", I'm really > really Right(tm). > You're right, of course (All Hail Linus!), if you can make it work efficiently enough. Just to put something else on the table, here's how I'd go about tracking renames and the like, in another world where Linus /does/ make the odd mistake - it's basically a unique id for files in the repository, added when the file is first recognised and updated when update-cache adds a new version to the cache. Renames copy the id across to the new name, and add it into the cache. This gives you an O(n) way to tell what file was what across renames, and it might even be useful in Linus' world, or if someone wanted to build a traditional SCM on top of a git-a-like. Attached is a patch, and a rename-file.c to use it. Simon given that you have multiple machines creating files, how do you deal with the idea of the same 'unique id' being assigned to different files by different machines? David Lang -- There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. -- C.A.R. Hoare ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-16 12:19 ` David Lang @ 2005-04-16 15:55 ` Simon Fowler 0 siblings, 0 replies; 81+ messages in thread From: Simon Fowler @ 2005-04-16 15:55 UTC (permalink / raw) To: David Lang; +Cc: git [-- Attachment #1: Type: text/plain, Size: 797 bytes --] On Sat, Apr 16, 2005 at 05:19:24AM -0700, David Lang wrote: > Simon > > given that you have multiple machines creating files, how do you deal with > the idea of the same 'unique id' being assigned to different files by > different machines? > The id is a sha1 hash of the current time and the full path of the file being added - the chances of that being replicated without malicious intent is extremely small. There are other things that could be used, like the hostname, username of the person running the program, etc, but I don't really see them being necessary. Simon -- PGP public key Id 0x144A991C, or http://himi.org/stuff/himi.asc (crappy) Homepage: http://himi.org doe #237 (see http://www.lemuria.org/DeCSS) My DeCSS mirror: ftp://himi.org/pub/mirrors/css/ [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 15:32 ` Linus Torvalds ` (2 preceding siblings ...) 2005-04-16 1:44 ` Simon Fowler @ 2005-04-16 20:29 ` Sanjoy Mahajan 2005-04-16 20:41 ` Linus Torvalds 3 siblings, 1 reply; 81+ messages in thread From: Sanjoy Mahajan @ 2005-04-16 20:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: David Woodhouse, Junio C Hamano, Petr Baudis, git > And that "where did this come from" decision should be done at _search_ > time, not commit time. I like this elegant approach, but clever pattern matching can help even at commit time. Suppose hello.c is simply: printf ("Hello %d\n", year); And then developer A updates hello.c to: printf ("Hello %d\n", year); printf ("And %d\n", year+1); Meanwhile developer B updates hello.c to: printf ("Hello %d\n", yyyy); How to merge these two changes? The psychic solution is printf ("Hello %d\n", yyyy); printf ("And %d\n", yyyy+1); Darcs handles token renames specially, but it's not a general solution so let's leave it aside. The example does not have enough information to make the psychic solution unique or reliable, but imagine that the example were longer to solve that problem. You'd want to describe the delta A(hello.c) as 1. duplicated message line 2. changed 2nd line a bit And B(hello.c) as 1. Changed year to yyyy In that representation, merging the two deltas becomes 1. duplicated message line 2. changed 2nd line a bit 3. Changed year to yyyy in both lines Or, by commuting the merge operations and adjusting for their non-commutativity (in terminology like darcs's -- I'm also a physicist): 1. Changed year to yyyy 2. duplicated message line 3. changed 2nd line a bit So here some of the computation that Linus wants only at question time (e.g. 'how did that line get here??') is also useful at merge time. It's difficult (expensive, unreliable) to describe deltas in the form above or, worse, to merge two such descriptions, but I hope it illustrates the point. And perhaps a robust and easier-to-compute change-description language can be dreamt up, even if the general problem of describing changes compactly is not computable -- it's almost the same, or is the same, as finding the Kolmogorov complexity of a data set. Or have I missed a fundamental point? -Sanjoy `A society of sheep must in time beget a government of wolves.' - Bertrand de Jouvenal ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-16 20:29 ` Sanjoy Mahajan @ 2005-04-16 20:41 ` Linus Torvalds 0 siblings, 0 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-16 20:41 UTC (permalink / raw) To: Sanjoy Mahajan; +Cc: David Woodhouse, Junio C Hamano, Petr Baudis, git On Sat, 16 Apr 2005, Sanjoy Mahajan wrote: > > I like this elegant approach, but clever pattern matching can help even > at commit time. Suppose hello.c is simply: Here, what you're talking about is not "commit", but "merge". The git model very much separates the two events. You first generate a merged tree, an dyou commit that merge as a separate and largely totally independent phase. And yes, I agree that with merging, you do end up potentially wanting to try different things. When I've done my "git" merges, all I've really done is to make sure that the trivial parts basically merge in zero time, so that you can afford to perhaps spend some effort on handling the _real_ merge conflicts. Many systems seem to be designed around a "clever merge" algorithm, with darcs perhaps being the most extreme example. The problem with that design is that 99.9% of all the work is not at all about being clever, and if you try to base your design around the clever things, your performance will definitely suck. So I think that with git, you can actually really try to be clever, because when you get a merge conflict, you're now only worrying about one file out of 17,000, and then you can go wild on that one and try different merge algorithms (token merge, character-merge, line-based merge, you name it). Of course, I might not actually personally want to depend on any clever merges, but the git infrastructure really doesn't care. My plumbing doesn't merge the conflicts that arise within one single object, or the filename differences - you can do anything you want on that. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 18:36 ` Linus Torvalds 2005-04-14 19:59 ` Junio C Hamano @ 2005-04-15 9:14 ` David Woodhouse 2005-04-15 9:36 ` Ingo Molnar ` (2 more replies) 1 sibling, 3 replies; 81+ messages in thread From: David Woodhouse @ 2005-04-15 9:14 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Petr Baudis, git On Thu, 2005-04-14 at 11:36 -0700, Linus Torvalds wrote: > And "merge these two trees" (which works on a _tree_ level) > or "find the common commit" (which works on a _commit_ level) I suspect that finding the common commit is actually a per-file thing; it's not just something you do for the _commit_ graph, then use for merging each file in the two branches you're trying to merge. Consider a simple repository which contains two files A and B. We start off with the first version of each ('A1B1'), and the owner of each file takes a branch and modifies their own file. There is cross-pulling between the two, and then each modifies the _other's_ file as well as their own... (A1B2)--(A2B2)--(A2'B3) / \ / \ / \ / \ (A1B1) X (...) \ / \ / \ / \ / (A2B1)--(A2B2)--(A3B2') Now, we're trying to merge the two branches. It appears that the most useful common ancestor to use for a three-way merge of file A is the version from tree 'A2B1', while the most useful common ancestor for merging file B is that in 'A1B2'. (I think it's a coincidence that in my example the useful files 'A2' and 'B2' actually do end up in a single tree together at some point.) -- dwmw2 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 9:14 ` David Woodhouse @ 2005-04-15 9:36 ` Ingo Molnar 2005-04-15 10:05 ` David Woodhouse 2005-04-15 12:03 ` Johannes Schindelin 2005-04-15 14:53 ` Linus Torvalds 2 siblings, 1 reply; 81+ messages in thread From: Ingo Molnar @ 2005-04-15 9:36 UTC (permalink / raw) To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git * David Woodhouse <dwmw2@infradead.org> wrote: > Consider a simple repository which contains two files A and B. We > start off with the first version of each ('A1B1'), and the owner of > each file takes a branch and modifies their own file. There is > cross-pulling between the two, and then each modifies the _other's_ > file as well as their own... > > (A1B2)--(A2B2)--(A2'B3) > / \ / \ > / \ / \ > (A1B1) X (...) > \ / \ / > \ / \ / > (A2B1)--(A2B2)--(A3B2') > > Now, we're trying to merge the two branches. It appears that the most > useful common ancestor to use for a three-way merge of file A is the > version from tree 'A2B1', while the most useful common ancestor for > merging file B is that in 'A1B2'. do such cases occur frequently? In the kernel at least it's not too typical. Would it be a problem to go for the simple solution of using (A1B1) as the common ancestor (based on the tree graph), and then to do a 3-way merge of all changes from that point on? Ingo ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 9:36 ` Ingo Molnar @ 2005-04-15 10:05 ` David Woodhouse 2005-04-15 14:53 ` Ingo Molnar 0 siblings, 1 reply; 81+ messages in thread From: David Woodhouse @ 2005-04-15 10:05 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git On Fri, 2005-04-15 at 11:36 +0200, Ingo Molnar wrote: > do such cases occur frequently? In the kernel at least it's not too > typical. Isn't it? I thought it was a fairly accurate representation of the process "I make a whole bunch of changes to files I maintain, pulling from Linus while occasionally asking him to pull from my tree. Sometimes my files are changed by someone else in Linus' tree, and sometimes I change files that I don't actually own.". -- dwmw2 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 10:05 ` David Woodhouse @ 2005-04-15 14:53 ` Ingo Molnar 2005-04-15 15:09 ` David Woodhouse 0 siblings, 1 reply; 81+ messages in thread From: Ingo Molnar @ 2005-04-15 14:53 UTC (permalink / raw) To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git * David Woodhouse <dwmw2@infradead.org> wrote: > On Fri, 2005-04-15 at 11:36 +0200, Ingo Molnar wrote: > > do such cases occur frequently? In the kernel at least it's not too > > typical. > > Isn't it? I thought it was a fairly accurate representation of the > process "I make a whole bunch of changes to files I maintain, pulling > from Linus while occasionally asking him to pull from my tree. > Sometimes my files are changed by someone else in Linus' tree, and > sometimes I change files that I don't actually own.". but the specific scenario you described would require _Linus'_ tree to be in limbo for a long time, and have uncommitted half-done edits. I.e.: (A1B2)--(A2B2)--(A2'B3) / \ / \ / \ / \ (A1B1) X (...) \ / \ / \ / \ / (A2B1)--(A2B2)--(A3B2') in the above scenario Linus' tree needs to 'cross' with a maintainer's tree. (maintainer's tree wont cross with another maintainer's tree, as maintainer-to-maintainer merges rare.) but for the scenario to occur, i think there needs to be a prolongued "limbo" period in Linus' tree for a 'crossing' to happen. But Linus' merges are typically almost atomic: they are done then they are pushed out. It's definitely not in the 'days, sometimes weeks' timescale as maintainer trees are. so for the scenario to occur, a maintainer, from whom Linus has just pulled an update and Linus is merging the tree manually without comitting, has to pull a file from the earlier Linus tree, and then Linus has to modify that same file again. This does not seem to be a common scenario. so i think to avoid the scenario, maintainers should not pull from each other - they should only pull/push to/from Linus' tree. Maybe this is an unacceptable limitation? Ingo ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 14:53 ` Ingo Molnar @ 2005-04-15 15:09 ` David Woodhouse 0 siblings, 0 replies; 81+ messages in thread From: David Woodhouse @ 2005-04-15 15:09 UTC (permalink / raw) To: Ingo Molnar; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git On Fri, 2005-04-15 at 16:53 +0200, Ingo Molnar wrote: > but the specific scenario you described would require _Linus'_ tree to > be in limbo for a long time, and have uncommitted half-done edits. > I.e.: > > (A1B2)--(A2B2)--(A2'B3) > / \ / \ > / \ / \ > (A1B1) X (...) > \ / \ / > \ / \ / > (A2B1)--(A2B2)--(A3B2') > > in the above scenario Linus' tree needs to 'cross' with a maintainer's > tree. (maintainer's tree wont cross with another maintainer's tree, > as maintainer-to-maintainer merges rare.) Is that true? Consider (A2B1) to be a bugfixes-only tree which I make available for Linus to pull from. I keep doing more experimental stuff in my own private copy of the tree along the bottom branch, while Linus _eventually_ responds to my pull request and moves on, stopping only to add a 'static' to one of my new functions. I move on too but don't pull from Linus again for a little while; the final merge happens when I _do_ pull again. -- dwmw2 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 9:14 ` David Woodhouse 2005-04-15 9:36 ` Ingo Molnar @ 2005-04-15 12:03 ` Johannes Schindelin 2005-04-15 10:22 ` Theodore Ts'o 2005-04-15 14:53 ` Linus Torvalds 2 siblings, 1 reply; 81+ messages in thread From: Johannes Schindelin @ 2005-04-15 12:03 UTC (permalink / raw) To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git Hi, On Fri, 15 Apr 2005, David Woodhouse wrote: > On Thu, 2005-04-14 at 11:36 -0700, Linus Torvalds wrote: > > And "merge these two trees" (which works on a _tree_ level) > > or "find the common commit" (which works on a _commit_ level) > > I suspect that finding the common commit is actually a per-file thing; > it's not just something you do for the _commit_ graph, then use for > merging each file in the two branches you're trying to merge. I disagree. In order to be trusted, this thing has to catch the following scenario: Skywalker and Solo start from the same base. They commit quite a lot to their trees. In between, Skywalker commits a tree, where the function "kazoom()" has been added to the file "deathstar.c", but Solo also added this function, but to the file "moon.c". A file-based merge would have no problem merging each file, such that in the end, "kazoom()" is defined twice. The same problems arise when one tries to merge line-wise, i.e. when for each line a (possibly different) merge-parent is sought. The concept here is a *transaction*: when going from one tree to the next tree via a commit, a sort of integrity is maintained, which is breached when only looking at files and commits. Ciao, Dscho ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 12:03 ` Johannes Schindelin @ 2005-04-15 10:22 ` Theodore Ts'o 0 siblings, 0 replies; 81+ messages in thread From: Theodore Ts'o @ 2005-04-15 10:22 UTC (permalink / raw) To: Johannes Schindelin Cc: David Woodhouse, Linus Torvalds, Junio C Hamano, Petr Baudis, git On Fri, Apr 15, 2005 at 02:03:08PM +0200, Johannes Schindelin wrote: > I disagree. In order to be trusted, this thing has to catch the following > scenario: > > Skywalker and Solo start from the same base. They commit quite a lot to > their trees. In between, Skywalker commits a tree, where the function > "kazoom()" has been added to the file "deathstar.c", but Solo also added > this function, but to the file "moon.c". A file-based merge would have no > problem merging each file, such that in the end, "kazoom()" is defined > twice. > > The same problems arise when one tries to merge line-wise, i.e. when for > each line a (possibly different) merge-parent is sought. Be careful. There is a very big tradeoff between 100% perfections in catching these sorts of errors, and usability. There exists SCM's where you are not allowed to do commit such merges until you do a test compile, or run a regression test suite (that being the only way to catch these sorts of problems when we merge two branches like this). BitKeeper never caught this sort of thing, and we trusted it. In practice it was also rarely a problem. I'll also note that BitKeeper doesn't restrict you from doing a committing a changeset when you have modified files that have yet to be checked in to the tree. Same issue; you can accidentally check in changesets that in trees that won't build, but if we added this kind of SCM-by-straightjacket philosophy it would decrease our productivity and people would simply not use such an SCM, thus negating its effectiveness. - Ted ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 9:14 ` David Woodhouse 2005-04-15 9:36 ` Ingo Molnar 2005-04-15 12:03 ` Johannes Schindelin @ 2005-04-15 14:53 ` Linus Torvalds 2005-04-15 15:29 ` David Woodhouse 2005-04-15 15:54 ` Paul Jackson 2 siblings, 2 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-15 14:53 UTC (permalink / raw) To: David Woodhouse; +Cc: Junio C Hamano, Petr Baudis, git On Fri, 15 Apr 2005, David Woodhouse wrote: > > I suspect that finding the common commit is actually a per-file thing; > it's not just something you do for the _commit_ graph, then use for > merging each file in the two branches you're trying to merge. I disagree. Conceptually, you should never do _anything_ on a file level. Why? Because individual files don't matter. You shouldn't merge two files cleanly just because they look fine - they _depend_ on the other files in the archive, and that's quite fundamentally why per-file tracking is really wrong from a project standpoint. So if you can't merge two files cleanly because the "project" history ended up being further back than the "file" history, then that's a _good_ thing. You don't know what the hell happened to the other files that this file depended on. Merging one file independently of the others is WRONG. Also, I suspect that you'll find that if you do cross-merges, you'll basically always end up in: > (I think it's a coincidence that in my example the useful files 'A2' and > 'B2' actually do end up in a single tree together at some point.) nope, I don't think that's coincidence. I think that's the normal case. Your file-based history is the one that can _incorrectly_ and coincidentally happen to have a single file at some point, but since that file doesn't stand alone, that's really not a fundamentally good reason to merge it. Really, this "individual files matter" approach is a _disease_. They don't. Individual files DO NOT EXIST. Files always exist as part of the project, and the _only_ time you track a single file is when the project is a single file (and then that will be very very obvious in a git archive, thank you very much). So the single-file mentality is a disease brought on by decades of _crap_. And by the fact that it ends up limiting the problem scope, so you can do certain things easier. For example, just doing intra-file diffs is a lot _easier_ and less time-consuming than doing inter-file diffs. Bit it is _absolutely_ not better. In fact, it is clearly inferior to anybody who spends even five seconds thinking about it - yet we still do it, because of the historical (and INCORRECT) mindset that "files matter". Files DO NOT matter. Never have. It's an implementation limitation to think they do. You'll screw yourself up, and when somebody comes up with a half-way efficient way to generate inter-fiel diffs, your architecture is totally and utterly unable to handle it. I don't care what you do at an SCM level, and if the crud you put on top of git wants to perpetuate mistakes of yesteryear, that's _your_ issue. But dammit, git is designed to do the right thing, and I will fight tooth and nail against anybody who thinks individual files matter. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 14:53 ` Linus Torvalds @ 2005-04-15 15:29 ` David Woodhouse 2005-04-15 15:51 ` Linus Torvalds 2005-04-15 15:54 ` Paul Jackson 1 sibling, 1 reply; 81+ messages in thread From: David Woodhouse @ 2005-04-15 15:29 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Petr Baudis, git On Fri, 2005-04-15 at 07:53 -0700, Linus Torvalds wrote: > Files DO NOT matter. Never have. It's an implementation limitation to > think they do. You'll screw yourself up, and when somebody comes up with a > half-way efficient way to generate inter-fiel diffs, your architecture is > totally and utterly unable to handle it. > > I don't care what you do at an SCM level, and if the crud you put on top > of git wants to perpetuate mistakes of yesteryear, that's _your_ issue. > But dammit, git is designed to do the right thing, and I will fight tooth > and nail against anybody who thinks individual files matter. No, really: individual files _DO_ matter. There's a reason we split stuff up into separate files, and if you look closely you'll find that we don't just randomly put different functions into different files with neither rhyme nor reason -- there's a pattern to it; usually some kind of functional grouping. And when I'm looking for the change that broke something, I can almost always tell which file it's in and go looking in _that_ file. It's a _whole_ lot easier to use the equivalent of 'bk revtool' than it is to sift through all the unrelated commits in the whole tree. If that's an implementation limitation, then it's an implementation limitation in my _brain_ not just in my tools. OK, in fact it shouldn't be 'show me the history of this file'; it's often really 'show me the history of this function' which I want. But that's fine. All I'm suggesting is that we should include the metadata which says "content moved from file XXX to file YYY" along with the commit objects. I'm certainly not suggesting that we should implement jejb's idea of explicit 'file revision history' objects -- the tree-based philosophy is perfectly sane and sufficient. But we do _also_ need a little information which allows us to track content as it moves around within the tree, and the SCM has to have a sane way to filter out the noise when we're looking for what broke. Yes, that's part of the SCM functionality, and can live in an xattr-type field in the commit object -- but it does need to be stored, and in practice I suspect it _will_ be useful for merging too. It's not about ditching the per-tree tracking and doing per-file tracking instead. I agree that would be wrong. It's about storing enough information to track what happened to given content as it moved around within the tree. -- dwmw2 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 15:29 ` David Woodhouse @ 2005-04-15 15:51 ` Linus Torvalds 0 siblings, 0 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-15 15:51 UTC (permalink / raw) To: David Woodhouse; +Cc: Junio C Hamano, Petr Baudis, git On Fri, 15 Apr 2005, David Woodhouse wrote: > > And when I'm looking for the change that broke something, I can almost > always tell which file it's in and go looking in _that_ file. Read my email about finding "what changed" that I sent out a minute ago. I claim that my algorithm for finding "what changed" handles your "single file" case as a very small (and usually quite uninteresting) special case. I claim (and if you just look at my proposal I think you'll agree) that I can track single functions, and do it efficiently. WITHOUT adding any meta-data at all. The thing is, if the question is "I have this piece of code, and I want to see what changed", you fundamentally _can_ do that efficiently. That's really what git was designed for. It's the whole _point_ of having history in the first place. If git didn't care, it wouldn't have a back-pointer to the tree it came from, and we'd all be just merging pure trees. But you mix that question up with "how do I save that information in the commit", which is a totally unnecessary mix-up, and which makes things MUCH more complicated, for absolutely zero gain. In fact, because you mixed up those two issues, the problem now became so complicated that you can no longer solve it, so you start doing hacks like "the user has to tell us what he did" (aka "bk mv" or "svn rename"), and you start mentally to limit yourself to files, because you realize that you _have_ to limit your intractable problem to make it at all solvable. And I'm telling you that your problem is STUPID. You made it stupid by thinking that every question about the source tree should be answered at commit time. Which just clearly isn't true! If you just drop the tying-together, and accept that "what changed" is a valid question _regardless_ of trying to track it at commit time, now your whole world opens up. Birds sing, the sun is shining on you, and beautiful scantily clad women (or men) dance around you. The world is suddenly a good place, just _filled_ with possibilities. Suddenly you realize that if the question is just "what changed in this piece of code" (and let's face it, that _is_ the question), you can track it afterwards. Trying to tie in "commit time" into the question was what made it hard. If you do _not_ due that (totally unnecessary) tie-in, the question suddenly becomes easy to answer, and several obvious and simple answers spring to mind pretty immediately. > It's not about ditching the per-tree tracking and doing per-file > tracking instead. I agree that would be wrong. It's about storing enough > information to track what happened to given content as it moved around > within the tree. No. Git absolutely does have everything you need already. You just aren't realizing that it's already there - in the data - and that you can do much more intelligent searches for changes if you accept that undeniable fact. The fact that you can NOT do those searches at commit-time (which is a global op), and can only do them if you have a specific question in mind ("what changed _here_"), is the big issue. The thing is, at commit-time you'd need to answer every possible question ("what changed here, and here, and here, and in this function, and in this file, and in this directory and why did this identifier get renamed and why is the sky blue"). AND YOU FUNDAMENTALLY CANNOT DO THAT. It's impossible. But once you _know_ the question (which is the only time when the answer is actually relevant, so why care about if before that time?), you can find out the answer by just automating the job of looking at the _data_. It's easy. The question makes it obvious by its nature. The question is the thing that gives you the specifics that makes the search possible in the first place. And _this_ is why the data matters. Renames and file boundaries do not. And until you accept that, you just limit yourself. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 14:53 ` Linus Torvalds 2005-04-15 15:29 ` David Woodhouse @ 2005-04-15 15:54 ` Paul Jackson 2005-04-15 16:30 ` C. Scott Ananian 1 sibling, 1 reply; 81+ messages in thread From: Paul Jackson @ 2005-04-15 15:54 UTC (permalink / raw) To: Linus Torvalds; +Cc: dwmw2, junkio, pasky, git Linus wrote: > For example, just doing intra-file diffs is a lot _easier_ and less > time-consuming than doing inter-file diffs. Um ah ... could you explain what you mean by inter and intra file diffs? Google found a three year old message by Andrew Morton, discussing inter and intra file fragmentation on ext2/ext3 file systems and the find_group_dir() routine. I don't think that's what you had in mind ;). When I run the 'diff' command, it usually between two files, not between two parts of a file. So I'd have thought inter file diffs were easier. Clearly, I don't git it. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 15:54 ` Paul Jackson @ 2005-04-15 16:30 ` C. Scott Ananian 2005-04-15 18:29 ` Paul Jackson 0 siblings, 1 reply; 81+ messages in thread From: C. Scott Ananian @ 2005-04-15 16:30 UTC (permalink / raw) To: Paul Jackson; +Cc: Linus Torvalds, dwmw2, junkio, pasky, git On Fri, 15 Apr 2005, Paul Jackson wrote: > Um ah ... could you explain what you mean by inter and intra file diffs? intra file diffs: here are two versions of the same file. what changed? inter file diffs: here is a new file, and here are *all the files in the current committed version*. Where did the contents of this new file come from? (Note that the new file is often a slightly changed version of an existing file in the current committed version. But we don't assume that must be true.) --scott supercomputer Pakistan WSHOOFS SECANT LCPANGS SDI assassination ZPSECANT SEQUIN AEBARMAN ESCOBILLA bomb mustard STANDEL ESGAIN Nazi FJDEFLECT ( http://cscott.net/ ) ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 16:30 ` C. Scott Ananian @ 2005-04-15 18:29 ` Paul Jackson 0 siblings, 0 replies; 81+ messages in thread From: Paul Jackson @ 2005-04-15 18:29 UTC (permalink / raw) To: C. Scott Ananian; +Cc: torvalds, dwmw2, junkio, pasky, git > intra file diffs: here are two versions of the same file. Ah so. Linus faked me out. I was _sure_ that by "file" he meant "file" -- as in a bucket of bits with a unique identifying <sha1>. In that message, I guess by "file" he meant "a version controlled file, consisting of a series of content versions and meta-data" That's what I get for trusting Linus to always speak as a kernel hacker, not an SCM hacker. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401 ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 18:12 ` Junio C Hamano 2005-04-14 18:36 ` Linus Torvalds @ 2005-04-14 18:51 ` Christopher Li 2005-04-14 19:35 ` Petr Baudis 2 siblings, 0 replies; 81+ messages in thread From: Christopher Li @ 2005-04-14 18:51 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Linus Torvalds, git On Thu, Apr 14, 2005 at 11:12:35AM -0700, Junio C Hamano wrote: > >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: > > At this moment in the script, we have run "read-tree" the > ancestor so the dircache has the original. %tree0 and %tree1 > both did not touch the path ($_ here) so it is the same as > ancestor. When '-f' is specified we are populating the output > working tree with the merge result so that is what that > 'checkout-cache' is about. "O - $path" means "we took the > original". > > The idea is to populate the dircache of merge-temp with the > merge result and leave uncertain stuff as in the common ancestor > state, so that the user can fix them starting from there. > > Maybe it is a good time for me to summarize the output somewhere > in a document. > > O - $path Tree-A and tree-B did not touch this; the result > is taken from the ancestor (O for original). > > A D $path Only tree-A (or tree-B) deleted this and the other > B D $path branch did not touch this; the result is to delete. > > A M $path Only tree-A (or tree-B) modified this and the other > B M $path branch did not touch this; the result is to use one > from tree-A (or tree-B). This includes file > creation case. > > *DD $path Both tree-A and tree-B deleted this; the result > is to delete. > > *DM $path Tree-A deleted while tree-B modified this (or > *MD $path vice versa), and manual conflict resolution is > needed; dircache is left as in the ancestor, and > the modified file is saved as $path~A~ in the > working directory. The user can rename it to $path > and run show-diff to see what Tree-A wanted to do > and decide before running update-cache. > > *MM $path Tree-A and tree-B did the exact same > modification; the result is to use that. > > MRG $path Tree-A and tree-B have different modifications; > run "merge" and the merge result is left as > $path in the working directory. > > In cases other than *DM, *MD, and MRG, the result is trivial and I believe there is simpler way to do it as in my demo python script. I start it easier but you bits me in time. It is a demo script, it only print the action instead of actually going out to do it. change that to corresponding os.system("") call leaves to the reader. Again, this is a demo how it can be done. Not python vs perl thing I did not chose perl only because I am not good at it. #!/usr/bin/env python import re import sys import os from pprint import pprint def get_tree(commit): data = os.popen("cat-file commit %s"%commit).read() return re.findall(r"(?m)^tree (\w+)", data)[0] PREFIX = 0 PATH = -1 SHA = -2 ORIGSHA = -3 def get_difftree(old, new): lines = os.popen("diff-tree %s %s"%(old, new)).read().split("\x00") patterns = (r"(\*)(\d+)->(\d+)\s(\w+)\s(\w+)->(\w+)\s(.*)", r"([+-])(\d+)\s(\w+)\s(\w+)\s(.*)") res = {} for l in lines: if not l: continue for p in patterns: m = re.findall(p, l) if m: m = m[0] res[m[-1]] = m break else: raise "difftree: unknow line", l return res def analyze(diff1, diff2): diff1only = [ diff1[k] for k in diff1 if k not in diff2 ] diff2only = [ diff2[k] for k in diff2 if k not in diff1 ] both = [ (diff1[k],diff2[k]) for k in diff2 if k in diff1 ] action(diff1only) action(diff2only) action_two(both) def action(diffs): for act in diffs: if act[PREFIX] == "*": print "modify", act[PATH], act[SHA] elif act[PREFIX] == '-': print "remove", act[PATH], act[SHA] elif act[PREFIX] == '+': print "add", "remove", act[PATH], act[SHA] else: raise "unknow action" def action_two(diffs): for act1, act2 in diffs: if len(act1) == len(act2): # same kind type if act1[PREFIX] == act2[PREFIX]: if act1[SHA] == act2[SHA] or act1[PREFIX] == '-': return action(act1) if act1[PREFIX]=='*': print "3way-merge", act1[PATH], act1[ORIGSHA], act1[SHA], act2[SHA] return print "unable to handle", act[PATH] print "one side wants", act1[PREFIX] print "the other side wants", act2[PREFIX] args = sys.argv[1:] trees = map(get_tree, args) print "check out tree", trees[0] diff1 = get_difftree(trees[0], trees[1]) diff2 = get_difftree(trees[0], trees[2]) analyze(diff1, diff2) ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 18:12 ` Junio C Hamano 2005-04-14 18:36 ` Linus Torvalds 2005-04-14 18:51 ` Christopher Li @ 2005-04-14 19:35 ` Petr Baudis 2005-04-14 23:12 ` Junio C Hamano 2 siblings, 1 reply; 81+ messages in thread From: Petr Baudis @ 2005-04-14 19:35 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Dear diary, on Thu, Apr 14, 2005 at 08:12:35PM CEST, I got a letter where Junio C Hamano <junkio@cox.net> told me that... > >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: > > PB> Bah, you outran me. ;-) > > Just being in a different timezone, I guess. > > PB> I'll change it to use the cool git-pasky stuff (commit-id etc) and its > PB> style of committing - that is, it will merely record the update-caches > PB> to be done upon commit, and it will read-tree the branch we are merging > PB> to instead of the ancestor. (So that git diff gives useful output.) > > Sorry, I have not seen what you have been doing since pasky 0.3, > and I have not even started to understand the mental model of > the world your tool is building. That said, my gut feeling is > that telling this script about git-pasky's world model might be > a mistake. I'd rather see you consider the script as mere "part > of the plumbing". Maybe adding an extra parameter to the script > to let the user explicitly specify the common ancestor to use > would be needed, but I would prefer git-pasky-merge to do its > own magic (converting symbolic commit names into raw commit > names and such) before calling this low level script. > > That way people like me who have not migrated to your framework > can still keep using it. All the script currently needs is a > bare git object database; i.e., nothing other than what is in > .git/objects and a couple of commit record SHA1s as its > parameters. No .git/heads/, no .git/HEAD.local, no .git/tags, > are involved for it to work, and I would prefer to keep things > that way if possible. I see, and I actually agree with it. However, I'll want merge-tree.pl to do a little less than it does now for that, though. The mechanics in "kernel" is fine as long as I can control policy in my "userspace". ;-) BTW, the git* name sorta imply my toilet instead of the core plumbing, and it'd be more consistent with the current plumbnaming; and could we have it with the .pl extension, please? :-) What I would like your script to do is therefore just do the merge in a given already prepared (including built index) directory, with a passed base. The base should be determined by a separate tool (I already saw some patches); most future "science" will probably go to a clever selection of this base, anyway. This will give the tool maximal flexibility. E.g., then someone who wants to can just merge with his working copy (if you don't give checkout-cache -f - but why would you anyway), or do whatever other cleverness he wants. > >> * show-diff updates to add -r flag to squelch diffs for files not in > >> the working directory. This is mainly useful when verifying the > >> result of an automated merge. ..snip.. > was too tired and did not think of a letter when I wrote it. I > guess '-r' stood for removed, but I agree it is a bad choice. > Any objections to '-q'? None here. > >> +# Create a temporary directory and go there. > >> +system 'rm', '-rf', ',,merge-temp'; > > PB> Can't we call it just ,,merge? > > I'd rather have a command line option '-o' (scrapping the > current '-o' and renaming it to something else; as you can see I > am terrible at picking option names ;-)) to mean "output to this > directory". I am not really an Arch person so I do not > particulary care about /^,,/. How about "git~merge~$$"? I'm all for an -o, and I don't mind ,, - I just don't want it uselessly long. I hope "git~merge~$$" was a joke... :-) > >> +for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; } > >> +symlink "../../.git/objects", "objects"; > >> +chdir '..'; > >> + > >> +my $ancestor_tree = read_commit_tree($common); > >> +system 'read-tree', $ancestor_tree; > >> + > >> +my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0])); > >> +my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1])); > >> + > >> +my @ancestor_file = read_show_files(); > >> +my %ancestor_file = map { $_ => 1 } @ancestor_file; > >> + > >> +for (@ancestor_file) { > >> + if (! exists $tree0{$_} && ! exists $tree1{$_}) { By the way, what about indentation with tabs? If you have a strong opinion about this, I don't insist - but if you really don't mind/care either way, it'd be great to use tabs as in the rest of the git code. > >> + if ($full_checkout) { > >> + system 'checkout-cache', $_; > >> + } > >> + print STDERR "O - $_\n"; > > PB> Huh, what are you trying to do here? I think you should just record > PB> remove, no? (And I wouldn't do anything with my read-tree. ;-) > > At this moment in the script, we have run "read-tree" the > ancestor so the dircache has the original. %tree0 and %tree1 > both did not touch the path ($_ here) so it is the same as > ancestor. When '-f' is specified we are populating the output > working tree with the merge result so that is what that > 'checkout-cache' is about. "O - $path" means "we took the > original". Aha! Thanks. Is there a fundamental reason why the directory cache contains the ancestor instead of the destination branch? It makes no sense to me and I think the script actually does not fundamentally depend on it. My main motivation is that the user can then trivially see what is he actually going to commit to his destination branch, which would be bought for free by that. > The idea is to populate the dircache of merge-temp with the > merge result and leave uncertain stuff as in the common ancestor > state, so that the user can fix them starting from there. And this is another thing I dislike a lot. I'd like merge-tree.pl to leave my directory cache alone, thank you very much. You know, I see what goes to the directory cache as actually part of the policy part. What you actually do is interfering with my different policy choice, which is to record stuff to index only at the time of commit (I've asked about this and noone replied, so I assume it's an ok choice). show-diff does the right thing for me then, and I don't need to care about losing *any* information when replacing/rebuilding the index for any reason. I have full control, and I like that. :-) I'd be happy with parsing merge-tree.pl output and doing the right thing on my side. Of course I could then blast away the tediously modified index with my one, but I didn't need to do any such hacking before and I'd prefer not to now either. Actually, the only time I need to do explicit update-cache (with my policy) when doing git merge is when deleting stuff or adding new stuff; both of this is not so common as modifying, when I need not to do anything. > Maybe it is a good time for me to summarize the output somewhere > in a document. > > O - $path Tree-A and tree-B did not touch this; the result > is taken from the ancestor (O for original). > > A D $path Only tree-A (or tree-B) deleted this and the other > B D $path branch did not touch this; the result is to delete. > > A M $path Only tree-A (or tree-B) modified this and the other > B M $path branch did not touch this; the result is to use one > from tree-A (or tree-B). This includes file > creation case. Could we please have the file creation case separately? Modification is much more common and creation has pretty different consequences (especially that it can't combine with anything else :-). > *DD $path Both tree-A and tree-B deleted this; the result > is to delete. > > *DM $path Tree-A deleted while tree-B modified this (or > *MD $path vice versa), and manual conflict resolution is > needed; dircache is left as in the ancestor, and > the modified file is saved as $path~A~ in the > working directory. The user can rename it to $path > and run show-diff to see what Tree-A wanted to do > and decide before running update-cache. > > *MM $path Tree-A and tree-B did the exact same > modification; the result is to use that. > > MRG $path Tree-A and tree-B have different modifications; > run "merge" and the merge result is left as > $path in the working directory. Hmm. I actually don't like this naming. I think it's not too consistent, is irregular, therefore parsing it would be ugly. What I propose: 12c\tname <- legend <- original file D <- tree #1 removed file D <- tree #2 removed file DD <- both trees removed file M <- tree #1 modified file M DM* <- conflict, tree #1 removed file, tree #2 modified file MD* MM <- exact same modification MM* <- different modifications, merging This is generic, theoretically scales well even to more trees, is easy to parse trivially, still is human readable (actually the asterisk in the 'conflict' column is there basically only for the humans), is completely regular and consistent. Now that we have the notion of tree A and tree B gone, I'd prefer to use numbers instead of letters for the ~1~ and ~2~ suffixes. Not insisting, though. What do you think? > In cases other than *DM, *MD, and MRG, the result is trivial and > is recorded in the dircache. Without '-o' (to be renamed ;-) > nor '-f' there will not be a file checked out in the working > directory for them. The three merge cases need human attention. > The dircache is not touched in these cases and left as the > ancestor version, and the working directory gets some file as > described above. > > NOTE NOTE NOTE: I am not dealing with a case where both branches > create the same file but with different contents. In such a > case the current code falls into MRG path without having a > common ancestor, which is nonsense---I can use /dev/null as the > common ancestor, I guess. Also NOTE NOTE NOTE I need to detect That might be the best way at least for the start, although I suspect that merge will fail horribly this way even in the case of slightest differences; still better than nothing. > the case where one branch creates a directory while the other > creates a file. There is nothing an automated tool can do in > that case but it needs to be detected and be told the user > loudly. Or when both branches create directories... ;-) -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 19:35 ` Petr Baudis @ 2005-04-14 23:12 ` Junio C Hamano 2005-04-14 20:24 ` Christopher Li 2005-04-14 23:31 ` Petr Baudis 0 siblings, 2 replies; 81+ messages in thread From: Junio C Hamano @ 2005-04-14 23:12 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, git >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: PB> What I would like your script to do is therefore just do the PB> merge in a given already prepared (including built index) PB> directory, with a passed base. The base should be determined PB> by a separate tool (I already saw some patches); most future PB> "science" will probably go to a clever selection of this PB> base, anyway. I think you are contradicting yourself for saying the above after agreeing with me that the script should just work on trees not commits. My understanding is that the tools is just to merge two related trees relative to another ancestor tree, nothing more. Especially, it should not care what is in the working directory---that is SCM person's business. I am just trying to follow my understanding of what Linus wanted. One of the guiding principle is to do as much things as in dircache without ever checking things out or touching working files unnecessarily. PB> This will give the tool maximal flexibility. I suspect it would force me to have a working directory populated with files, just to do a merge. PB> I'm all for an -o, and I don't mind ,, - I just don't want it uselessly PB> long. I hope "git~merge~$$" was a joke... :-) Which part do you object to? PID part? or tilde? Would git~merge do, perhaps? It probably would not matter to you because as an SCM you would always give an explicit --output parameter to the script anyway. PB> By the way, what about indentation with tabs? If you have a PB> strong opinion about this, I don't insist - but if you PB> really don't mind/care either way, it'd be great to use tabs PB> as in the rest of the git code. I do not have a strong opinion, but it is more trouble for me only because I am lazy and am used to the indentation my Emacs gives me. I write code other than git, so changing Perl-mode indentation setting globally for all .pl files is not an option for me. I'll see what I can do when I have time. PB> Is there a fundamental reason why the directory cache PB> contains the ancestor instead of the destination branch? Because you are thinking as an SCM person where there are distinction between tree-A and tree-B, two heads being merged. There is no "destination branch" nor "source branch" in what I am doing. It is a merge of two equals derived from the same ancestor. PB> I think the script actually does not fundamentally depend on it. My main PB> motivation is that the user can then trivially see what is he actually PB> going to commit to his destination branch, which would be bought for PB> free by that. And again the user is *not* commiting to his "destination branch". At the level I am working at, the merge result should be commited with two -p parameters to commit-tree --- tree-A and tree-B, both being equal parents from the POV of git object storage. PB> And this is another thing I dislike a lot. I'd like merge-tree.pl to PB> leave my directory cache alone, thank you very much. You know, I see PB> what goes to the directory cache as actually part of the policy part. Remember I am not touching *your* dircache. It is a dircache in the temporary merge area, specifically set up to help you review the merge. Can't the SCM driver do things along this line, perhaps? - You have your working files and your dircache. They may not match because you have uncommitted changes to your environment. You want to merge with Linus head. You know its SHA1 (call it COMMIT-Linus). Your SCM knows which commit you started with (call it COMMIT-Current). - First you merge the tree associated with COMMIT-Current. Use it and COMMIT-Linus to find the common ancestor to use. - Now use the tree SHA of COMMIT-Current, tree SHA1 of COMMIT-Linus, and tree SHA1 of the common ancestor commit to drive git-merge.perl (to be renamed ;-). You will get a temporary directory. Have your user examine what is in there, and fix the merge and have them tell you they are happy. - You go to that temporary directory, do write-tree and commit-tree with -p parameter of COMMIT-Linus and COMMIT-Current. This will result in a new commit. Call that COMMIT-Merge. - You, as an SCM, should know what your user have done in the working directory relative to COMMIT-Current. Especially you should know the set of paths involved in that change. Go in to the temporary area, checkout-cache those files if you have not done so. Apply the changes you have there. Optionally have the user examine the changes and have him confirm. Lift those files into the user's working directory. - Do your bookkeeping like "echo COMMIT-Merge >.git/Head", to make the user's working files based on COMMIT-Merge, and run read-tree using the COMMIT-Merge in the user's working directory. At this point, show-diff output should show what the changes your user have had made if he had started working based on COMMIT-Merge instead of starting from COMMIT-Current. I think the above would result in what SCM person would call "merge upstream/sidestream changes into my working directory". ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 23:12 ` Junio C Hamano @ 2005-04-14 20:24 ` Christopher Li 2005-04-14 23:31 ` Petr Baudis 1 sibling, 0 replies; 81+ messages in thread From: Christopher Li @ 2005-04-14 20:24 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Linus Torvalds, git Hi Junio, I think if the merge tree belong to plumbing, you can do even less in the merge.perl. You can just print out the instruction for the upper level SCM what to to without actually doing it yourself. So you don't have to do touch anything in the tree. That is the way I use in my previous python script. You just print out some easy to modify e.g. in my python script it prints: (BTW, poor choice of print out name) check out tree 253290af8b9ebc8565dd8de4cda24d0432a92b57 modify pre-process.c 7684c115a87e41a9226ce79478101c746cf22c34 3way-merge check.c dcb970cc1c5a83284dc5986abf07b6da76a8758c f77bfe119c19d928879091e0e3ee6debe3f1e1bf d315b43b025350d0107568a4d42cc2494d38621d Your merge tree can do the smae. Then the supper level SCM can easily follow instruction. Save your effort and make no assumption what SCM module is. Chris On Thu, Apr 14, 2005 at 04:12:34PM -0700, Junio C Hamano wrote: > >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: > > I think you are contradicting yourself for saying the above > after agreeing with me that the script should just work on trees > not commits. My understanding is that the tools is just to > merge two related trees relative to another ancestor tree, > nothing more. Especially, it should not care what is in the > working directory---that is SCM person's business. > > I am just trying to follow my understanding of what Linus > wanted. One of the guiding principle is to do as much things as > in dircache without ever checking things out or touching working > files unnecessarily. > > PB> This will give the tool maximal flexibility. > > I suspect it would force me to have a working directory > populated with files, just to do a merge. > > PB> I'm all for an -o, and I don't mind ,, - I just don't want it uselessly > PB> long. I hope "git~merge~$$" was a joke... :-) > > Which part do you object to? PID part? or tilde? Would > git~merge do, perhaps? It probably would not matter to you > because as an SCM you would always give an explicit --output > parameter to the script anyway. > > PB> By the way, what about indentation with tabs? If you have a > PB> strong opinion about this, I don't insist - but if you > PB> really don't mind/care either way, it'd be great to use tabs > PB> as in the rest of the git code. > > I do not have a strong opinion, but it is more trouble for me > only because I am lazy and am used to the indentation my Emacs > gives me. I write code other than git, so changing Perl-mode > indentation setting globally for all .pl files is not an option > for me. I'll see what I can do when I have time. > > PB> Is there a fundamental reason why the directory cache > PB> contains the ancestor instead of the destination branch? > > Because you are thinking as an SCM person where there are > distinction between tree-A and tree-B, two heads being merged. > There is no "destination branch" nor "source branch" in what I > am doing. It is a merge of two equals derived from the same > ancestor. > > PB> I think the script actually does not fundamentally depend on it. My main > PB> motivation is that the user can then trivially see what is he actually > PB> going to commit to his destination branch, which would be bought for > PB> free by that. > > And again the user is *not* commiting to his "destination > branch". At the level I am working at, the merge result should > be commited with two -p parameters to commit-tree --- tree-A and > tree-B, both being equal parents from the POV of git object > storage. > > PB> And this is another thing I dislike a lot. I'd like merge-tree.pl to > PB> leave my directory cache alone, thank you very much. You know, I see > PB> what goes to the directory cache as actually part of the policy part. > > Remember I am not touching *your* dircache. It is a dircache in > the temporary merge area, specifically set up to help you review > the merge. > > Can't the SCM driver do things along this line, perhaps? > > - You have your working files and your dircache. They may not > match because you have uncommitted changes to your > environment. You want to merge with Linus head. You know > its SHA1 (call it COMMIT-Linus). Your SCM knows which commit > you started with (call it COMMIT-Current). > > - First you merge the tree associated with COMMIT-Current. Use > it and COMMIT-Linus to find the common ancestor to use. > > - Now use the tree SHA of COMMIT-Current, tree SHA1 of > COMMIT-Linus, and tree SHA1 of the common ancestor commit to > drive git-merge.perl (to be renamed ;-). You will get a > temporary directory. Have your user examine what is in > there, and fix the merge and have them tell you they are > happy. > > - You go to that temporary directory, do write-tree and > commit-tree with -p parameter of COMMIT-Linus and > COMMIT-Current. This will result in a new commit. Call that > COMMIT-Merge. > > - You, as an SCM, should know what your user have done in the > working directory relative to COMMIT-Current. Especially you > should know the set of paths involved in that change. Go in > to the temporary area, checkout-cache those files if you have > not done so. Apply the changes you have there. Optionally > have the user examine the changes and have him confirm. Lift > those files into the user's working directory. > > - Do your bookkeeping like "echo COMMIT-Merge >.git/Head", to > make the user's working files based on COMMIT-Merge, and run > read-tree using the COMMIT-Merge in the user's working > directory. At this point, show-diff output should show what > the changes your user have had made if he had started working > based on COMMIT-Merge instead of starting from > COMMIT-Current. > > I think the above would result in what SCM person would call > "merge upstream/sidestream changes into my working directory". > ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-14 23:12 ` Junio C Hamano 2005-04-14 20:24 ` Christopher Li @ 2005-04-14 23:31 ` Petr Baudis 2005-04-15 0:58 ` Junio C Hamano 2005-04-15 10:22 ` Junio C Hamano 1 sibling, 2 replies; 81+ messages in thread From: Petr Baudis @ 2005-04-14 23:31 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Dear diary, on Fri, Apr 15, 2005 at 01:12:34AM CEST, I got a letter where Junio C Hamano <junkio@cox.net> told me that... > >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: > > PB> What I would like your script to do is therefore just do the > PB> merge in a given already prepared (including built index) > PB> directory, with a passed base. The base should be determined > PB> by a separate tool (I already saw some patches); most future > PB> "science" will probably go to a clever selection of this > PB> base, anyway. > > I think you are contradicting yourself for saying the above > after agreeing with me that the script should just work on trees > not commits. My understanding is that the tools is just to > merge two related trees relative to another ancestor tree, > nothing more. Especially, it should not care what is in the > working directory---that is SCM person's business. Yes. Isn't this exactly what I'm saying? I'm arguing for doing less in my paragraph, you are arguing for doing less in your paragraph, and we even seem to agree on the direction in which we should do less. > I am just trying to follow my understanding of what Linus > wanted. One of the guiding principle is to do as much things as > in dircache without ever checking things out or touching working > files unnecessarily. I'm just arguing that instead of directly touching the directory cache, you should just list what would you do there - and you already do this, I think. So I'd be happy with a switch which would just do that and not touch the directory cache. I'll parse your output and do the right thing for me. > PB> This will give the tool maximal flexibility. > > I suspect it would force me to have a working directory > populated with files, just to do a merge. Why would that be so? > PB> I'm all for an -o, and I don't mind ,, - I just don't want it uselessly > PB> long. I hope "git~merge~$$" was a joke... :-) > > Which part do you object to? PID part? or tilde? Would > git~merge do, perhaps? It probably would not matter to you > because as an SCM you would always give an explicit --output > parameter to the script anyway. Yes. I'll just override it with ,,merge, I think. So, do whatever you want. ;-)) > PB> By the way, what about indentation with tabs? If you have a > PB> strong opinion about this, I don't insist - but if you > PB> really don't mind/care either way, it'd be great to use tabs > PB> as in the rest of the git code. > > I do not have a strong opinion, but it is more trouble for me > only because I am lazy and am used to the indentation my Emacs > gives me. I write code other than git, so changing Perl-mode > indentation setting globally for all .pl files is not an option > for me. I'll see what I can do when I have time. Doesn't Emacs have something equivalent to ./.vimrc? I've also seen those funny -*- strings. Well, if it would mean a lot of trouble for you, just forget about it. > PB> Is there a fundamental reason why the directory cache > PB> contains the ancestor instead of the destination branch? > > Because you are thinking as an SCM person where there are > distinction between tree-A and tree-B, two heads being merged. > There is no "destination branch" nor "source branch" in what I > am doing. It is a merge of two equals derived from the same > ancestor. That's a valid point of view too. Actually, when you would have a mode in which you would not write to the directory cache, do you need to read from it? You could do just direct cat-files like for the other trees, and it would be even faster. Then, you could do without a directory cache altogether in this mode. > PB> And this is another thing I dislike a lot. I'd like merge-tree.pl to > PB> leave my directory cache alone, thank you very much. You know, I see > PB> what goes to the directory cache as actually part of the policy part. > > Remember I am not touching *your* dircache. It is a dircache in > the temporary merge area, specifically set up to help you review > the merge. Yes, but I want to have a control over its dircache too. :-) That is because I want the user to be able to use the regular git commands like "git diff" there. > Can't the SCM driver do things along this line, perhaps? > > - You have your working files and your dircache. They may not > match because you have uncommitted changes to your > environment. You want to merge with Linus head. You know > its SHA1 (call it COMMIT-Linus). Your SCM knows which commit > you started with (call it COMMIT-Current). > > - First you merge the tree associated with COMMIT-Current. Use > it and COMMIT-Linus to find the common ancestor to use. > > - Now use the tree SHA of COMMIT-Current, tree SHA1 of > COMMIT-Linus, and tree SHA1 of the common ancestor commit to > drive git-merge.perl (to be renamed ;-). You will get a > temporary directory. Have your user examine what is in > there, and fix the merge and have them tell you they are > happy. > > - You go to that temporary directory, do write-tree and > commit-tree with -p parameter of COMMIT-Linus and > COMMIT-Current. This will result in a new commit. Call that > COMMIT-Merge. > > - You, as an SCM, should know what your user have done in the > working directory relative to COMMIT-Current. Especially you > should know the set of paths involved in that change. Go in > to the temporary area, checkout-cache those files if you have > not done so. Apply the changes you have there. Optionally > have the user examine the changes and have him confirm. Lift > those files into the user's working directory. > > - Do your bookkeeping like "echo COMMIT-Merge >.git/Head", to > make the user's working files based on COMMIT-Merge, and run > read-tree using the COMMIT-Merge in the user's working > directory. At this point, show-diff output should show what > the changes your user have had made if he had started working > based on COMMIT-Merge instead of starting from > COMMIT-Current. > > I think the above would result in what SCM person would call > "merge upstream/sidestream changes into my working directory". And that's exactly what I'm doing now with git merge. ;-) In fact, ideally the whole change in my scripts when your script is finished would be replacing checkout-cache `diff-tree` # symbolic git diff $base $merged | git apply with merge-tree.pl -b $base $(tree-id) $merged | parse-your-output -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 23:31 ` Petr Baudis @ 2005-04-15 0:58 ` Junio C Hamano 2005-04-14 22:30 ` Christopher Li 2005-04-15 10:22 ` Junio C Hamano 1 sibling, 1 reply; 81+ messages in thread From: Junio C Hamano @ 2005-04-15 0:58 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, git >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: >> I think the above would result in what SCM person would call >> "merge upstream/sidestream changes into my working directory". PB> And that's exactly what I'm doing now with git merge. ;-) In fact, PB> ideally the whole change in my scripts when your script is finished PB> would be replacing PB> checkout-cache `diff-tree` # symbolic PB> git diff $base $merged | git apply PB> with PB> merge-tree.pl -b $base $(tree-id) $merged | parse-your-output In the above I presume by $merged you mean the tree ID (or commit ID) the user's working directory is based upon? Well, merge-trees (Linus has a single directory merge-tree already) looks at tree IDs (or commit IDs); it would never involve working files in random state that is not recorded as part of a tree (committed or not). Given that constraints I am not sure how well that would pan out. I have to think about this a bit. I do like, however, the idea of separating the step of doing any checkout/merge etc. and actually doing them. So the command set of parse-your-output needs to be defined. Based on what I have done so far, it would consist of the following: - Result is this object $SHA1 with mode $mode at $path (takes one of the trees); you can do update-cache --cacheinfo (if you want to muck with dircache) or cat-file blob (if you want to get the file) or both. - Result is to delete $path. - Result is a merge between object $SHA1-1 and $SHA1-2 with mode $mode-1 or $mode-2 at $path. Would this be a good enough command set? PB> Doesn't Emacs have something equivalent to ./.vimrc? I've also seen PB> those funny -*- strings. The former is global per user (that is me including other Perl files I work outside of git context), which is exactly what I said is unacceptable to me. The latter is per file (applying to everybody else who touch the file), so if it is short and sweet I should use one. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 0:58 ` Junio C Hamano @ 2005-04-14 22:30 ` Christopher Li 2005-04-15 7:43 ` Junio C Hamano 0 siblings, 1 reply; 81+ messages in thread From: Christopher Li @ 2005-04-14 22:30 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Linus Torvalds, git On Thu, Apr 14, 2005 at 05:58:25PM -0700, Junio C Hamano wrote: > > I do like, however, the idea of separating the step of doing any > checkout/merge etc. and actually doing them. So the command set > of parse-your-output needs to be defined. Based on what I have > done so far, it would consist of the following: > > - Result is this object $SHA1 with mode $mode at $path (takes > one of the trees); you can do update-cache --cacheinfo (if > you want to muck with dircache) or cat-file blob (if you want > to get the file) or both. Is that SHA1 for tree or the file object? If it is tree it don't need the $mode any more. If it is file you might need to emit entry for it's parent directory, including the modes of directory. > > - Result is to delete $path. > > - Result is a merge between object $SHA1-1 and $SHA1-2 with > mode $mode-1 or $mode-2 at $path. > > Would this be a good enough command set? And of course error/command for the files that unable to perform auto merge. including information of both revisions. That needs to be defined as well. Chris ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 22:30 ` Christopher Li @ 2005-04-15 7:43 ` Junio C Hamano 2005-04-15 6:28 ` Christopher Li 0 siblings, 1 reply; 81+ messages in thread From: Junio C Hamano @ 2005-04-15 7:43 UTC (permalink / raw) To: Christopher Li; +Cc: Petr Baudis, Linus Torvalds, git >>>>> "CL" == Christopher Li <git@chrisli.org> writes: >> - Result is this object $SHA1 with mode $mode at $path (takes >> one of the trees); you can do update-cache --cacheinfo (if >> you want to muck with dircache) or cat-file blob (if you want >> to get the file) or both. CL> Is that SHA1 for tree or the file object? I am talking about a single file here. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 7:43 ` Junio C Hamano @ 2005-04-15 6:28 ` Christopher Li 2005-04-15 11:11 ` Junio C Hamano 0 siblings, 1 reply; 81+ messages in thread From: Christopher Li @ 2005-04-15 6:28 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Linus Torvalds, git On Fri, Apr 15, 2005 at 12:43:47AM -0700, Junio C Hamano wrote: > >>>>> "CL" == Christopher Li <git@chrisli.org> writes: > > CL> Is that SHA1 for tree or the file object? > > I am talking about a single file here. > Then do you emit the entry for it's parents directory? e.g. /foo/bar get created. foo doesn't exists. You have to create foo first. You don't have mode information for foo yet. If it give the top level tree, the SCM can check it out by tree. hopefully have the mode on directory correctly. Well, if they care about those little details. Chris ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 6:28 ` Christopher Li @ 2005-04-15 11:11 ` Junio C Hamano 0 siblings, 0 replies; 81+ messages in thread From: Junio C Hamano @ 2005-04-15 11:11 UTC (permalink / raw) To: Christopher Li, Linus Torvalds; +Cc: Petr Baudis, git >>>>> "CL" == Christopher Li <git@chrisli.org> writes: CL> Then do you emit the entry for it's parents directory? In GIT object model, directory modes do not matter. It is not designed to record directories, and running "update-cache --add foo" when foo is a directory fails. The data model of GIT is that it associates file datablob to a string called "pathname" that happen to contain slashes in them. It is kinda wierd. When you externalize it with checkout-cache, these slashes are mapped to hierarchical UNIX filesystem paths, relative to whereever you happened to run checkout-cache. The hierarchical "tree" representation in the GIT database was started as just a space optimization thing. CL> e.g. /foo/bar get created. foo doesn't exists. You have CL> to create foo first. You don't have mode information for CL> foo yet. And you will never have that information, since it is not recorded anywhere. If I say you should have foo/bar (by the way, no leading slashes are placed in the dircache either), and if it so happens that you do not have foo yet, you'd better create one without waiting to be told, because I will never tell you to just create a directory. By the way, Linus, while I was studying how the new hierarchical trees are written out, I think I have found one small funny (I would not call this a *bug*) there. Here is an excerpt from write-tree (around ll. 56; I am basing on pasky-0.4 so your line numbers may have some offsets): sha1 = ce->sha1; mode = ntohl(ce->st_mode); /* Do we have _further_ subdirectories? */ filename = pathname + baselen; dirname = strchr(filename, '/'); if (dirname) { int subdir_written; subdir_written = write_tree(cachep + nr, maxentries - nr, pathname, dirname-pathname+1, subdir_sha1); nr += subdir_written; /* Now we need to write out the directory entry into this tree.. */ mode |= S_IFDIR; pathlen = dirname - pathname; /* ..but the directory entry doesn't count towards the total count */ nr--; sha1 = subdir_sha1; } This code is going through a flat list of cache entries sorted by pathnames. The list is flat in the sense that the pathnames are like "foo/bar" i.e. with slashes inside. The if() statement there, upon seeing "foo/bar", slurps all the entries in foo/ subhierarchy and writes into a separate tree, recursively, to "represent" foo/. Notice what mode the "tree" object gets in this case? File mode for foo/bar (or whatever happens to be sorted the first among the stuff in dircache from foo/ directory) ORed with S_IFDIR. I think this is nonsense, and we should just store constant S_IFDIR. Another option, probably better from the SCM purist's POV, would be to start recording directories in dircaches, so that people can actually keep track of directory modes. Does it matter? --- I would say not. GIT does not have to be tar or cpio. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 23:31 ` Petr Baudis 2005-04-15 0:58 ` Junio C Hamano @ 2005-04-15 10:22 ` Junio C Hamano 2005-04-15 20:40 ` Petr Baudis 1 sibling, 1 reply; 81+ messages in thread From: Junio C Hamano @ 2005-04-15 10:22 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, git After I re-read [*R1*], in which Linus talks about dircache, especially this section: - The "current directory cache" describes some baseline. In particular, note the "some" part. It's not tied to any special baseline, and you can change your baseline any way you please. So it does NOT have to track any particular state in either the object database _or_ in your actual current working tree. In fact, all real interactions with "git" are really about updating this staging area one way or the other: you might check out the state from it into your working area (partially or fully), you can push your working area into the staging area (again, partially or fully). And if you want to, you can write the thing that the staging area represents as a "tree" into the object database, or you can merge a tree from the object database into the staging area. In other words: the staging area aka "current directory cache" is really how all interaction takes place. The object database never interacts directly with your working directory contents. ALL interactions go through the current directory cache. I started to have more doubts on the approach of *not* performing the merge in the dircache I set up specifically for merging, which is the direction in which you are pushing if I understand you correctly. Maybe I completely misunderstand what you want. This message is long but I need a clear understanding of what is expected to be useful to you, so please bear with me. PB> merge-tree.pl -b $base $(tree-id) $merged | parse-your-output Please help me understand this example you have given earlier. Here is my understanding of your assumption when the above pipeline takes place. Correct me if I am mistaken. * The user is in a working directory $W. It is controlled by git-tools and there are $W/.git/. directory and $W/.git/index dircache. * The dircache $W/.git/index started its life as a read-tree from some commit. The git-tools is keeping track of which commit it is somewhere, presumably in $W/.git/ directory. Let's call it $C (commit). ? Question. Is the $(tree-id) in your example the same as $C above? * The user have run [*1*] (see Footnote below) checkout-cache on $W/.git/index some time in the past and $W is full of working files. Some of them may or may not have modified. There may be some additions or deletions. So the contents of the working directory may not match the tree associated with $C. * The user may or may not have run [*1*] update-cache in $W. The contents of the dircache $W/.git/index may not match the tree associated with $C. ? Question. Are you forbidding the user to run update-cache by hand, and keeping track of the changes yourself, to be applied all at once at "git commit" time, thereby guaranteeing the $W/.git/index to match the tree associated with $C all times? From the description of The "GIT toolkit" section in README, it is not clear to me which part of his repository an end user is not supposed to muck with himself. * Now the user has some changes in his working directory and notices upstream or a side branch has notable changes desireble to be picked up. So he runs some git-tools command to cause the above quoted pipeline to run. ? Question. Does $merged in your example mean such an upstream or side branch? Is $base in your example the common ancestor between $C and $merged? Assuming that my above understanding of your model is correct, here are my "thinking aloud". - "merge-trees $base $C $merged" looks only at the git object database for those three trees named. The data structure of git object database is optimized to distinguish differences in those recorded trees (and hence recorded blobs they point at) without unpacking most of the files if the changes are small, because all the blobs involved are already hashed. It is not very good at comparing things in git object store and working files in random states, which would involve unpacking blobs and comparing, so "merge-trees" does not bother. - What can come out from merge-trees is therefore one of the following for each path from the union of paths contained in $base, $C, and $merged: (a) Neither $C nor $merged changed it --- merge result is what is in $C. (b) $C changed it but $merged did not --- merge result is what is in $C. (c) Both $C and $merged changed it in the same way --- merge result is what is in $C. (d) $C did not change it but $merged did --- merge result is what is in $merged. (e) Both $C and $merged changed it differently --- merge is needed and automatically succeeds between $C and $merge. (f) Both $C and $merged changed it differently --- merge is needed but have conflicts. - Assuming we are dealing with the case where working files are dirty and do not match what is in $C, among the above, (a)-(c) can be ignored by SCM. What the user has in his working files is exactly what he would have got if he started working from the merge result, although in reality the work was started from $C. Handling (d), (e) and (f) from SCM's point of view would be the same. They all involve 3-way merges between the file in the working directory, and the file from $merged, pivoting on the file from $base. In order to help SCM, merge-trees therefore should output SHA1 of blobs for such a file from $base and $merged and expect SCM to run "cat-file blob" on them and then merge or diff3. Up to the point of giving those two SHA1 out is the business of merge-trees and after that it is up to SCM. That would work. So I should base the design of output from merge-trees on the above analysis, which probably needs to be extended to cover differences between creation, modification, and deletion. - However, the above is quite different from the way Linus envisioned initially, on which my current implementation is based [*3*]. My current implementation is to record the merge outcome in the temporary dircache $W/,,merge/.git/index for cases (a)-(e). The last case (f) is problematic and needs human validation [*2*], so it is not recorded in that temporary dircache, but the files to be merged are left in that temporary directory and merge-trees stops there. It is expected that the end-user or SCM would merge the resulting file and run update-cache to update $W/,,merge/.git/index. After that happens, $W/,,merge/.git/index has the tree representing the desired result of the merge. It is expected that the end-user or SCM would write-tree, commit-tree there in the temporary directory, creating a new commit $C1. Then, it is expected that the SCM would make a patch file between $C and the user working directory, checks out $C1 (either in the user's working directory or another temporary directory; at this point merge-trees does not care because it has already done its job and exited), applies that patch to bring the user edits over to $C1. Then that directory would contain the desired merge of user edits. That is my understanding of how Linus originally wanted the tool to do his kernel work with to work. My hesitation to suggestions from you to change it not to keep its own merge dircache is coming from here. Not doing what I am currently doing to $W/,,merge/.git/index dircache would mean that SCM would have to do more, not less, to arrive at $C1 (the result of the clean $merge and $C merge pivoted at $base), where the real SCM merge begins. Although I suspect I am misunderstanding what you want, your messages so far suggest that what you want might be quite different from what Linus wants. Please do not misunderstand what I mean by saying this. I am not saying that Linus is always right [*4*] and therefore you are wrong for wanting something else. It is just that, if what I started writing needs to support both of those quite different needs, I need to know what they are. I think I understand what Linus wants well enough [*5*], but I am not certain about yours. [Footnotes] *1* By "The user have run" I mean either the user directly used the low-level plumbing command himself, or used git-tools to cause such command to run. *2* Strictly speaking, case (e) needs human validation as well, because successful textual merge does not guarantee sensible semantic merge. *3* See [*R2*] for descriptions on the way Linus wanted merge in git to happen. Especially around "5) At this point you need to MERGE" onwards. The current implementation handles (or attempts to handle) the `your working directory was fully committed' case described there. *4* According to Linus himself, he is always right ;-). [*R3*] *5* I consider [*R1*] and [*R2*] essential read for anybody wanting to understand merging operation in git object model (I am saying this for others; not for Pasky --- it would be like preaching to the choir ;-)). [References] *R1* <Pine.LNX.4.58.0504110928360.1267@ppc970.osdl.org> http://marc.theaimsgroup.com/?i=%3CPine.LNX.4.58.0504110928360.1267%20()%20ppc970%20!%20osdl%20!%20org%3E *R2* <Pine.LNX.4.58.0504121606580.4501@ppc970.osdl.org> http://marc.theaimsgroup.com/?i=%3CPine.LNX.4.58.0504121606580.4501%20()%20ppc970%20!%20osdl%20!%20org%3E *R3* http://www.uwsg.indiana.edu/hypermail/linux/kernel/0008.3/0555.html ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Re: Merge with git-pasky II. 2005-04-15 10:22 ` Junio C Hamano @ 2005-04-15 20:40 ` Petr Baudis 2005-04-15 22:41 ` Junio C Hamano 0 siblings, 1 reply; 81+ messages in thread From: Petr Baudis @ 2005-04-15 20:40 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, git Dear diary, on Fri, Apr 15, 2005 at 12:22:26PM CEST, I got a letter where Junio C Hamano <junkio@cox.net> told me that... > After I re-read [*R1*], in which Linus talks about dircache, > especially this section: > > - The "current directory cache" describes some baseline. In particular, > note the "some" part. It's not tied to any special baseline, and you > can change your baseline any way you please. > > So it does NOT have to track any particular state in either the object > database _or_ in your actual current working tree. In fact, all real > interactions with "git" are really about updating this staging area one > way or the other: you might check out the state from it into your > working area (partially or fully), you can push your working area into > the staging area (again, partially or fully). > > And if you want to, you can write the thing that the staging area > represents as a "tree" into the object database, or you can merge a > tree from the object database into the staging area. > > In other words: the staging area aka "current directory cache" is > really how all interaction takes place. The object database never > interacts directly with your working directory contents. ALL > interactions go through the current directory cache. > > I started to have more doubts on the approach of *not* > performing the merge in the dircache I set up specifically for > merging, which is the direction in which you are pushing if I > understand you correctly. Maybe I completely misunderstand what > you want. This message is long but I need a clear understanding > of what is expected to be useful to you, so please bear with me. > PB> merge-tree.pl -b $base $(tree-id) $merged | parse-your-output > > Please help me understand this example you have given earlier. > Here is my understanding of your assumption when the above > pipeline takes place. Correct me if I am mistaken. > > * The user is in a working directory $W. It is controlled by > git-tools and there are $W/.git/. directory and $W/.git/index > dircache. > > * The dircache $W/.git/index started its life as a read-tree > from some commit. The git-tools is keeping track of which > commit it is somewhere, presumably in $W/.git/ directory. > Let's call it $C (commit). > > ? Question. Is the $(tree-id) in your example the same as $C > above? Yes. Actually $(tree-id) returns ID of the tree object, not the commit object; but that doesn't matter here, probably - let's ignore that distinction for simplicity. > * The user have run [*1*] (see Footnote below) checkout-cache > on $W/.git/index some time in the past and $W is full of > working files. Some of them may or may not have modified. > There may be some additions or deletions. So the contents of > the working directory may not match the tree associated with > $C. > > * The user may or may not have run [*1*] update-cache in $W. > The contents of the dircache $W/.git/index may not match the > tree associated with $C. > > ? Question. Are you forbidding the user to run update-cache by > hand, and keeping track of the changes yourself, to be > applied all at once at "git commit" time, thereby > guaranteeing the $W/.git/index to match the tree associated > with $C all times? From the description of The "GIT toolkit" > section in README, it is not clear to me which part of his > repository an end user is not supposed to muck with himself. Ideally, he shouldn't be using *any* of the low-level plumbing by now. The only exception is update-cache --refresh, which he can do at will (I'm yet thinking what to do with it :-). The git-tools always assume that index basically contains the state as of the last commit. (Actually the only time when this matters *now* might be git diff - the user would get confused from the results.) > * Now the user has some changes in his working directory and > notices upstream or a side branch has notable changes > desireble to be picked up. So he runs some git-tools command > to cause the above quoted pipeline to run. > > ? Question. Does $merged in your example mean such an upstream > or side branch? Is $base in your example the common ancestor > between $C and $merged? Correct. *HOWEVER* what is not correct is that git-tools would let you merge in your working directory while you have local changes there. In the past, the merge would happen in your working tree, but git-tools wouldn't let you go for it unless your working tree has no local changes. It would complain loudly and refuse to, since it's *not* what you want to do and it was most likely a mistake. Currently, git merge just creates a ,,merge/ subdirectory sharing the object database with your working tree, but with an independent checkout of it; it will do the merge there, and when you commit it there, it will update your working tree with the merged changes. I'm describing both behaviors since I might revert back to the first one, based on what (if anything) will Linus reply to my mail about out-of-tree merges. But either way, when a merge is about to happen upon us, the working tree is "clean". > Assuming that my above understanding of your model is correct, > here are my "thinking aloud". > > - "merge-trees $base $C $merged" looks only at the git object > database for those three trees named. The data structure of > git object database is optimized to distinguish differences > in those recorded trees (and hence recorded blobs they point > at) without unpacking most of the files if the changes are > small, because all the blobs involved are already hashed. It > is not very good at comparing things in git object store and > working files in random states, which would involve unpacking > blobs and comparing, so "merge-trees" does not bother. > > - What can come out from merge-trees is therefore one of the > following for each path from the union of paths contained in > $base, $C, and $merged: > > (a) Neither $C nor $merged changed it --- merge result is what > is in $C. (Or in $base, if you don't want to give $C "unfair advantage", since it does not matter. ;-) > (b) $C changed it but $merged did not --- merge result is what > is in $C. > > (c) Both $C and $merged changed it in the same way --- merge > result is what is in $C. > > (d) $C did not change it but $merged did --- merge result is > what is in $merged. > > (e) Both $C and $merged changed it differently --- merge is > needed and automatically succeeds between $C and $merge. > > (f) Both $C and $merged changed it differently --- merge is > needed but have conflicts. > > - Assuming we are dealing with the case where working files are > dirty and do not match what is in $C, among the above, > (a)-(c) can be ignored by SCM. What the user has in his > working files is exactly what he would have got if he started > working from the merge result, although in reality the work > was started from $C. Yes. Actually they can be ignored by git-tools in any case since what is in the directory cache is $C. So it never needs to do any special action. > Handling (d), (e) and (f) from SCM's point of view would be > the same. They all involve 3-way merges between the file in > the working directory, and the file from $merged, pivoting on > the file from $base. In order to help SCM, merge-trees > therefore should output SHA1 of blobs for such a file from > $base and $merged and expect SCM to run "cat-file blob" on > them and then merge or diff3. Up to the point of giving > those two SHA1 out is the business of merge-trees and after > that it is up to SCM. > > That would work. So I should base the design of output from > merge-trees on the above analysis, which probably needs to be > extended to cover differences between creation, modification, > and deletion. Yes, it sounds sensible. Actually, you don't even need to make $C more special than $merged; I can filter out only the $merged changes on the SCM level. I guess that would add no complexity to your tool and make it usable even for more exotic kinds of merges (like the floating-in-the-void merge of two "equally important" trees). > - However, the above is quite different from the way Linus > envisioned initially, on which my current implementation is > based [*3*]. > > My current implementation is to record the merge outcome in > the temporary dircache $W/,,merge/.git/index for cases > (a)-(e). The last case (f) is problematic and needs human > validation [*2*], so it is not recorded in that temporary > dircache, but the files to be merged are left in that > temporary directory and merge-trees stops there. It is > expected that the end-user or SCM would merge the resulting > file and run update-cache to update $W/,,merge/.git/index. > After that happens, $W/,,merge/.git/index has the tree > representing the desired result of the merge. It is expected > that the end-user or SCM would write-tree, commit-tree there > in the temporary directory, creating a new commit $C1. > > Then, it is expected that the SCM would make a patch file > between $C and the user working directory, checks out $C1 > (either in the user's working directory or another temporary > directory; at this point merge-trees does not care because it > has already done its job and exited), applies that patch to > bring the user edits over to $C1. Then that directory would > contain the desired merge of user edits. > > That is my understanding of how Linus originally wanted the > tool to do his kernel work with to work. My hesitation to > suggestions from you to change it not to keep its own merge > dircache is coming from here. Not doing what I am currently > doing to $W/,,merge/.git/index dircache would mean that SCM > would have to do more, not less, to arrive at $C1 (the result > of the clean $merge and $C merge pivoted at $base), where the > real SCM merge begins. Well. Currently, apart from the directory cache part, I do it like you describe. I create a new directory, after commit I apply the diff back to the original tree etc. The only problem is really the dircache, and that's because it would be done totally differently than in the original tree, and I would unnecessarily have to introduce crowds of special cases to my tools in order for them to be usable in the merge tree (I call the ",,merge" temporary directory a "merge tree"). And the user would still lose the capability of easily seeing the changes being committed. I admit that I'm using this largely as an excuse and there *could* be a tool made which would compare the given tree with the cache, but it would be clumsy to use, violate Linus' "ALL interactions go through the current directory cache" paradigm (whew, the first time in my life I used this word), and we could do just fine with our current tools. > Although I suspect I am misunderstanding what you want, your > messages so far suggest that what you want might be quite > different from what Linus wants. Please do not misunderstand > what I mean by saying this. I am not saying that Linus is > always right [*4*] and therefore you are wrong for wanting > something else. It is just that, if what I started writing > needs to support both of those quite different needs, I need to > know what they are. I think I understand what Linus wants well > enough [*5*], but I am not certain about yours. I can't see the conflicts between what I want and what Linus wants. After all, Linus says that I can use the directory cache in any way I please (well, the user can, but I'm speaking for him ;-). So I'm doing so, and with your tool I would get into problems, since it is suddenly imposing a policy on what should be in the index. ..snip.. > *2* Strictly speaking, case (e) needs human validation as > well, because successful textual merge does not guarantee > sensible semantic merge. ..snip.. Actually, I think _all_ the caches should make the human validation _possible_ (by showing the diff of what would be merged), and it is trivial to do so by having pristine index. -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 20:40 ` Petr Baudis @ 2005-04-15 22:41 ` Junio C Hamano 0 siblings, 0 replies; 81+ messages in thread From: Junio C Hamano @ 2005-04-15 22:41 UTC (permalink / raw) To: Petr Baudis; +Cc: Linus Torvalds, git >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes: PB> I can't see the conflicts between what I want and what Linus wants. PB> After all, Linus says that I can use the directory cache in any way I PB> please (well, the user can, but I'm speaking for him ;-). So I'm doing PB> so, and with your tool I would get into problems, since it is suddenly PB> imposing a policy on what should be in the index. I think our misunderstanding is coming from the use of the word "merge tree". I think you have been assuming that I wanted you to run "merge-trees -o ,,merge" --- which would certainly cause me to muck with your dircache there. I totally agree with you that that is a *BAD* *THING*. No question there. However, my assumption has been different. I was assuming that you would run "merge-trees -o merge~tree" (i.e. different from your "merge tree"), so that you can get the merge results in a form parsable by you. And then, using that information, you can make your changes in ,,merge. After you are done with that information, you can remove "merge~trees", of course. The format I chose for the "merge result in a form parsable by you" happens to be a dircache in "merge~tree", with minimum number of files checked out when merge cannot be automatically done safely. In the simplest case of not having any conflicting merge between $C and $merged, Cogito can immediately run write-tree in "merge~tree" (not ,,merge) to obtain its tree-ID $T, so that it can feed it to diff-tree to compare it with whatever tree state Cogito wants to apply the merges between $C and $merged to. I still do not understand what you do in ,,merge directory, but here is one way you can update the user working directory in-place without having a ,,merge directory [*2*]. You can run your "git diff" between $C and $T [*1*]. The result is the diff you need to apply on top of your user's working files. If the user does not like the result of running that diff, it can easily be reversed. If a manual merge were needed between $C and $merged, Cogito could guide the user through that manual edit in "merge~tree", and run update-cache on those hand merged files in "merge~tree", before running write-tree in "merge~tree" to obtain $T; after that, everything else is the same. You make interesting points in other parts of your message I need to regurgitate for a while, so I would not comment on them in this message. [Footnote] *1* I really like the convenience of being able to use tree-ID and commit-ID interchangeably there. Thanks. *2* I understand that this would change the user's "git-tools" experience a bit. The user will not be told to "go to ,,merge and commit there which will reflected back to your working tree" anymore. Instead the merge happens in-place. Committing, not committing, or further hand-fixing the merge is up to the user. I suspect this change might even be for the better. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 8:06 ` Linus Torvalds 2005-04-14 8:39 ` Junio C Hamano @ 2005-04-15 19:57 ` Junio C Hamano 2005-04-15 20:45 ` Linus Torvalds 1 sibling, 1 reply; 81+ messages in thread From: Junio C Hamano @ 2005-04-15 19:57 UTC (permalink / raw) To: Linus Torvalds; +Cc: Petr Baudis, Christopher Li, git >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes: LT> In the meantime I wrote a very stupid "merge-tree" which LT> does things slightly differently, but I really think your LT> approach (aka my original approach) is actually a lot LT> faster. I was just starting to worry that the ball didn't LT> start, so I wrote an even hackier one. LT> ... This "one directory at a time with very explicit output" LT> thing is much more down-to-earth, but it's also likely LT> slower because it will need script help more often. I was looking at merge-tree.c last night to add recursive behaviour (my favorite these days ;-) to it [*1*]. But then I started thinking. LT> ... For each entry in the directory it says either LT> select <mode> <sha1> path LT> or LT> merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path LT> depending on whether it could directly select the right object or not. Given that the case you are primarily interested in is the one that affects only small parts of a huge tree (i.e. common kernel merge pattern I understand from your previous messages), your "hacky version" [*2*], extended for recursive operation, would spit out 98% select and 2% merge, and probably the origin of these selects are distributed across ancestor=90%, his=4%, my=4%, or something similar. Am I misestimating grossly? Assuming I am correct in the above, this would not scale for a huge project. We need to cut down the number of "90% select" part of the output to make it manageable. I am thinking about: - adding recursive behaviour (I am almost done with this); - adding another command line argument to merge-tree.c, to tell "do not output anything for the path if the resulting merge is the same as what is in this tree"; - adding another output type, "delete" to make the output type repertoire these three: delete path select <mode> <sha1> path merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path When the user of the output of $ merge-tree <ancestor-sha1> <my-sha1> <his-sha1> <result-base-sha1> want to get a dircache populated with the merged result, he can: 1. read-tree <result-base-sha1> 2. for each output: a) "delete" -- delete path from dircache b) "select" -- register mode-sha1 at path c) "merge" -- do the 3-way merge and register result at path Do you think this is sensible? The reason I have the separate <result-base-sha1> instead of always using <ancestor-sha1> is because the user may be thinking of patching an existing base which is different from "my" or "his" or "ancestor" and doing it in place. That way, probably Pasky's SCM can use it to patch the dircache it creates in its own ,,merge/ directory, which would most likely be initially populated from the dircache in the user's working directory--- which may or may not match "my-sha1" if the user has uncommitted update-cache there. Pasky, do you think this is workable? If so do you think this would make your life easier? [Footnotes] *1* That's how I found the S_IFDIR problem (not in your tree but in the copy I had). *2* I did not find it quite "hacky". It was a pleasant read. Especially I liked "smaller()" part. ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-15 19:57 ` Junio C Hamano @ 2005-04-15 20:45 ` Linus Torvalds 0 siblings, 0 replies; 81+ messages in thread From: Linus Torvalds @ 2005-04-15 20:45 UTC (permalink / raw) To: Junio C Hamano; +Cc: Petr Baudis, Christopher Li, git On Fri, 15 Apr 2005, Junio C Hamano wrote: > > I was looking at merge-tree.c last night to add recursive > behaviour (my favorite these days ;-) to it [*1*]. > > But then I started thinking. Always good. > LT> ... For each entry in the directory it says either > LT> select <mode> <sha1> path > LT> or > LT> merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path > LT> depending on whether it could directly select the right object or not. > > Given that the case you are primarily interested in is the one > that affects only small parts of a huge tree (i.e. common kernel > merge pattern I understand from your previous messages), your > "hacky version" [*2*], extended for recursive operation, would > spit out 98% select and 2% merge, and probably the origin of > these selects are distributed across ancestor=90%, his=4%, > my=4%, or something similar. Am I misestimating grossly? No. That's _exactly_ right. You do not want a recursive merge-tree. The "diff-tree" thing is different, exactly because it prunes out all the differences early on. > I am thinking about: > > - adding recursive behaviour (I am almost done with this); I think your suggestion sounds perfectly reasonable. Linus ^ permalink raw reply [flat|nested] 81+ messages in thread
* Re: Merge with git-pasky II. 2005-04-14 0:29 Petr Baudis 2005-04-13 21:25 ` Christopher Li @ 2005-04-14 0:30 ` Petr Baudis 1 sibling, 0 replies; 81+ messages in thread From: Petr Baudis @ 2005-04-14 0:30 UTC (permalink / raw) To: torvalds; +Cc: git Dear diary, on Thu, Apr 14, 2005 at 02:29:02AM CEST, I got a letter where Petr Baudis <pasky@ucw.cz> told me that... > Its main contents are all of my shell scripts. Apart of that, some > tiny fixes scattered all around can be found there, as well as some > patches which went through the mailing list. My last merge with you > concerned your commit 39021759c903a943a33a28cfbd5070d36d851581. > > It's again > > rsync://pasky.or.cz/git/ > > this time my HEAD is fba83970090ef54c6eb86dcc2c2d5087af5ac637. I forgot to add that after merging, you will probably want to change the VERSION file (to contain whatever you want). -- Petr "Pasky" Baudis Stuff: http://pasky.or.cz/ C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor ^ permalink raw reply [flat|nested] 81+ messages in thread
end of thread, other threads:[~2005-04-28 0:37 UTC | newest] Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-04-26 18:55 Merge with git-pasky II Bram Cohen 2005-04-26 19:58 ` Linus Torvalds 2005-04-26 20:30 ` Tom Lord 2005-04-26 20:31 ` Bram Cohen 2005-04-26 20:39 ` Tom Lord 2005-04-26 20:58 ` Linus Torvalds 2005-04-26 21:25 ` Linus Torvalds 2005-04-26 21:28 ` Bram Cohen 2005-04-26 21:36 ` Fabian Franz 2005-04-26 22:30 ` Linus Torvalds 2005-04-26 22:25 ` Linus Torvalds 2005-04-28 0:42 ` Petr Baudis 2005-04-26 21:26 ` Diego Calleja 2005-04-26 20:31 ` Daniel Barkalow 2005-04-26 20:44 ` Tom Lord -- strict thread matches above, loose matches on Subject: below -- 2005-04-17 17:34 Linus Torvalds 2005-04-17 22:12 ` Herbert Xu 2005-04-17 22:35 ` Linus Torvalds 2005-04-17 23:29 ` Herbert Xu 2005-04-17 23:34 ` Petr Baudis 2005-04-17 23:53 ` Kenneth Johansson 2005-04-18 0:49 ` Herbert Xu 2005-04-18 0:55 ` Petr Baudis 2005-04-17 23:50 ` Linus Torvalds 2005-04-18 4:16 ` Sanjoy Mahajan [not found] <000d01c541ed$32241fd0$6400a8c0@gandalf> 2005-04-15 20:31 ` Linus Torvalds 2005-04-15 23:00 ` Barry Silverman 2005-04-16 0:32 ` Linus Torvalds 2005-04-14 0:29 Petr Baudis 2005-04-13 21:25 ` Christopher Li 2005-04-14 0:45 ` Petr Baudis 2005-04-14 3:51 ` Linus Torvalds 2005-04-14 1:23 ` Christopher Li 2005-04-14 5:03 ` Paul Jackson 2005-04-14 2:16 ` Christopher Li 2005-04-14 6:16 ` Paul Jackson 2005-04-14 7:05 ` Junio C Hamano 2005-04-14 8:06 ` Linus Torvalds 2005-04-14 8:39 ` Junio C Hamano 2005-04-14 9:10 ` Linus Torvalds 2005-04-14 11:14 ` Junio C Hamano 2005-04-14 12:16 ` Petr Baudis 2005-04-14 18:12 ` Junio C Hamano 2005-04-14 18:36 ` Linus Torvalds 2005-04-14 19:59 ` Junio C Hamano 2005-04-15 0:42 ` Linus Torvalds 2005-04-15 2:33 ` Barry Silverman 2005-04-15 10:02 ` David Woodhouse 2005-04-15 15:32 ` Linus Torvalds 2005-04-15 16:01 ` David Woodhouse 2005-04-15 16:31 ` C. Scott Ananian 2005-04-15 17:11 ` Linus Torvalds 2005-04-16 15:33 ` Johannes Schindelin 2005-04-17 13:14 ` David Woodhouse 2005-04-15 19:20 ` Paul Jackson 2005-04-16 1:44 ` Simon Fowler 2005-04-16 12:19 ` David Lang 2005-04-16 15:55 ` Simon Fowler 2005-04-16 20:29 ` Sanjoy Mahajan 2005-04-16 20:41 ` Linus Torvalds 2005-04-15 9:14 ` David Woodhouse 2005-04-15 9:36 ` Ingo Molnar 2005-04-15 10:05 ` David Woodhouse 2005-04-15 14:53 ` Ingo Molnar 2005-04-15 15:09 ` David Woodhouse 2005-04-15 12:03 ` Johannes Schindelin 2005-04-15 10:22 ` Theodore Ts'o 2005-04-15 14:53 ` Linus Torvalds 2005-04-15 15:29 ` David Woodhouse 2005-04-15 15:51 ` Linus Torvalds 2005-04-15 15:54 ` Paul Jackson 2005-04-15 16:30 ` C. Scott Ananian 2005-04-15 18:29 ` Paul Jackson 2005-04-14 18:51 ` Christopher Li 2005-04-14 19:35 ` Petr Baudis 2005-04-14 23:12 ` Junio C Hamano 2005-04-14 20:24 ` Christopher Li 2005-04-14 23:31 ` Petr Baudis 2005-04-15 0:58 ` Junio C Hamano 2005-04-14 22:30 ` Christopher Li 2005-04-15 7:43 ` Junio C Hamano 2005-04-15 6:28 ` Christopher Li 2005-04-15 11:11 ` Junio C Hamano 2005-04-15 10:22 ` Junio C Hamano 2005-04-15 20:40 ` Petr Baudis 2005-04-15 22:41 ` Junio C Hamano 2005-04-15 19:57 ` Junio C Hamano 2005-04-15 20:45 ` Linus Torvalds 2005-04-14 0:30 ` Petr Baudis
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).