git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Re: Merge with git-pasky II.
  2005-04-14  0:29 Petr Baudis
@ 2005-04-13 21:25 ` Christopher Li
  2005-04-14  0:45   ` Petr Baudis
  2005-04-14  0:30 ` Petr Baudis
  1 sibling, 1 reply; 81+ messages in thread
From: Christopher Li @ 2005-04-13 21:25 UTC (permalink / raw)
  To: Petr Baudis; +Cc: torvalds, git

While you are there, do you mind to move the shell script
to a sub directory? Let's try how rename works.

Chris

On Thu, Apr 14, 2005 at 02:29:02AM +0200, Petr Baudis wrote:
>   Hello Linus,
> 
>   I think my tree should be ready for merging with you. It is the final
> tree and I've already switched my main branch for it, so it's what
> people doing git pull are getting for some time already.
> 
>   Its main contents are all of my shell scripts. Apart of that, some
> tiny fixes scattered all around can be found there, as well as some
> patches which went through the mailing list. My last merge with you
> concerned your commit 39021759c903a943a33a28cfbd5070d36d851581.
> 
>   It's again
> 
> 	rsync://pasky.or.cz/git/
> 
> this time my HEAD is fba83970090ef54c6eb86dcc2c2d5087af5ac637.
> 
>   Note that my rsync tree still contains even my old branch; I thought
> I'd leave it around in the public objects database for some time, shall
> anyone want to have a look at the history of some of the scripts. But if
> you want it gone, tell me and I will prune it (and perhaps offer it in
> /git-old/ or whatever). I'm using the following:
> 
> 	fsck-cache --unreachable $(commit-id) | grep unreachable \
> 		| cut -d ' ' -f 2 | sed 's/^\(..\)/.git\/objects\/\1\//' \
> 		| xargs rm
> 
>   Thanks,
> 
> -- 
> 				Petr "Pasky" Baudis
> Stuff: http://pasky.or.cz/
> C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Merge with git-pasky II.
@ 2005-04-14  0:29 Petr Baudis
  2005-04-13 21:25 ` Christopher Li
  2005-04-14  0:30 ` Petr Baudis
  0 siblings, 2 replies; 81+ messages in thread
From: Petr Baudis @ 2005-04-14  0:29 UTC (permalink / raw)
  To: torvalds; +Cc: git

  Hello Linus,

  I think my tree should be ready for merging with you. It is the final
tree and I've already switched my main branch for it, so it's what
people doing git pull are getting for some time already.

  Its main contents are all of my shell scripts. Apart of that, some
tiny fixes scattered all around can be found there, as well as some
patches which went through the mailing list. My last merge with you
concerned your commit 39021759c903a943a33a28cfbd5070d36d851581.

  It's again

	rsync://pasky.or.cz/git/

this time my HEAD is fba83970090ef54c6eb86dcc2c2d5087af5ac637.

  Note that my rsync tree still contains even my old branch; I thought
I'd leave it around in the public objects database for some time, shall
anyone want to have a look at the history of some of the scripts. But if
you want it gone, tell me and I will prune it (and perhaps offer it in
/git-old/ or whatever). I'm using the following:

	fsck-cache --unreachable $(commit-id) | grep unreachable \
		| cut -d ' ' -f 2 | sed 's/^\(..\)/.git\/objects\/\1\//' \
		| xargs rm

  Thanks,

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14  0:29 Petr Baudis
  2005-04-13 21:25 ` Christopher Li
@ 2005-04-14  0:30 ` Petr Baudis
  1 sibling, 0 replies; 81+ messages in thread
From: Petr Baudis @ 2005-04-14  0:30 UTC (permalink / raw)
  To: torvalds; +Cc: git

Dear diary, on Thu, Apr 14, 2005 at 02:29:02AM CEST, I got a letter
where Petr Baudis <pasky@ucw.cz> told me that...
>   Its main contents are all of my shell scripts. Apart of that, some
> tiny fixes scattered all around can be found there, as well as some
> patches which went through the mailing list. My last merge with you
> concerned your commit 39021759c903a943a33a28cfbd5070d36d851581.
> 
>   It's again
> 
> 	rsync://pasky.or.cz/git/
> 
> this time my HEAD is fba83970090ef54c6eb86dcc2c2d5087af5ac637.

I forgot to add that after merging, you will probably want to change the
VERSION file (to contain whatever you want).

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14  5:03         ` Paul Jackson
@ 2005-04-14  2:16           ` Christopher Li
  2005-04-14  6:16             ` Paul Jackson
  0 siblings, 1 reply; 81+ messages in thread
From: Christopher Li @ 2005-04-14  2:16 UTC (permalink / raw)
  To: Paul Jackson; +Cc: torvalds, pasky, git

On Wed, Apr 13, 2005 at 10:03:41PM -0700, Paul Jackson wrote:
> 
> If you have a thin skin or tend to annoy others with a bit too much
> attitude or can't pass up a good language war (which is my failing, and
> why I am responding to a discussion that I've not been involved in for
> days) then the resulting flamage could be distracting.

Oh, my bad. I am not trying to start a language war here.
That is why I am hesitated about Python.
Just try to find out the acceptability. No pushing.

Chris 


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14  1:23       ` Christopher Li
@ 2005-04-14  5:03         ` Paul Jackson
  2005-04-14  2:16           ` Christopher Li
  0 siblings, 1 reply; 81+ messages in thread
From: Paul Jackson @ 2005-04-14  5:03 UTC (permalink / raw)
  To: Christopher Li; +Cc: torvalds, pasky, git

> Do you have preference about what language of script we used? 

Do you have a thick skin? <grin>

Can you easily ignore language wars with an amused wave of the hand
and a happy chuckle at the oh so predictable weaknesses of humans?

Then I'd wager it will be fine.

If you have a thin skin or tend to annoy others with a bit too much
attitude or can't pass up a good language war (which is my failing, and
why I am responding to a discussion that I've not been involved in for
days) then the resulting flamage could be distracting.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14  2:16           ` Christopher Li
@ 2005-04-14  6:16             ` Paul Jackson
  0 siblings, 0 replies; 81+ messages in thread
From: Paul Jackson @ 2005-04-14  6:16 UTC (permalink / raw)
  To: Christopher Li; +Cc: torvalds, pasky, git

> Oh, my bad. I am not trying to start a language war here.

Neither am I - no problem what so ever. <chuckle ...>

Besides, I think we'd be on the same side.

My point was only a gentle one -- as is often the case when dealing with
the strange species called human, whether or not you can get away with
something is often a simple matter of ones attitude.

There is one Python script already in the kernel: scripts/show_delta.
But it's too small a sample to mean much.

I think first thing is "get it right."  Python is good for that
in the hands of someone who enjoys coding in it.

I wish you well.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14  3:51     ` Linus Torvalds
  2005-04-14  1:23       ` Christopher Li
@ 2005-04-14  7:05       ` Junio C Hamano
  2005-04-14  8:06         ` Linus Torvalds
  1 sibling, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2005-04-14  7:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Petr Baudis, Christopher Li, git

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> On that note - I've been avoiding doing the merge-tree thing, in the hope 
LT> that somebody else does what I've described.

I now have a Perl script that uses rev-tree, cat-file,
diff-tree, show-files (with one modification so that it can deal
with pathnames with embedded newlines), update-cache (with one
modification so that I can add an entry for a file that does not
exist to the dircache) and merge (from RCS).  Quick and dirty.

The changes to show-files is to give it an optional '-z' flag,
which chanegs record terminator to NUL character instead of LF.

The script git-merge.perl takes two head commits.  It basically
follows what you described as I remember ;-):

 1. runs rev-tree with --edges to find the common anscestor.

 2. creates a temporary directory "./,,merge-temp"; create a
    symlink ./,,merge-temp/.git/objects that points at
    .git/objects.

 3. sets up dircache there, initially populated with this common
    ancestor tree.  No files are checked out.  Just set up
    .git/index and that's it.

 4. runs diff-tree to find what has been changed in each head.

 5. for each path involved:

  5.0 if neither heads change it, leave it as is;
  5.1 if only one head changes a path and the other does not, just
      get the changed version;
  5.2 if both heads change it, check all three out and run merge.

It does not currently commit.  You can go to ./,,merge-temp/ and
see show-diff to see the result of the merge.  Files added in
one head has already been run "update-cache" when the script
ends, but changed and merged files are not---dircache still has
the common ancestor view.  So show-diff you will be seeing may
be enormous and not very useful if two forks were done in the
distant past.  After reviewing the merge result, you can
update-cache, write-tree and commit-tree as usual, but with one
caveat:  do not run "show-files | xargs update-cache" if you are
running git-merge.perl without -f flag!

By default, git-merge.perl creates absolute minimum number of
files in ./,,merge-temp---only the merged files are left there
so that you can inspect them.  You will not see unmodified
files nor files changed only by one side of the merge.

If you give '-o' (oneside checkout) flag to git-merge.perl, then
the files only one side of the merge changed are also checked
out in ./,,merge-temp.  If you give '-f' (full checkout) flag to
git-merge.perl, then in addition to what '-o' checks out,
unchanged files are checked out in ./,,merge-temp.  This default
is geared towards a huge tree with small merges (favorite case
of Linus, if I understand correctly).

Running 'show-diff' in such a sparsely populated merge result
tree gives you huge results because recent show-diff shows diffs
with empty files.  I added a '-r' flag to show-diff, which
squelches diffs with empty files.

Also to implement 'changed only by one-side' without actually
checking the file out, I needed to add one option to
'update-cache'.  --cacheinfo flag is used this way:

    $ update-cache --cacheinfo mode sha1 path

and adds the pathname with mode and sha1 to the .git/index
without actually requiring you to have such a file there.

Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 show-diff.c    |   11 ++-
 show-files.c   |   12 ++-
 update-cache.c |   25 +++++++
 git-merge.perl |  193 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 234 insertions(+), 7 deletions(-)


show-diff.c:  a531ca4078525d1c8dcf84aae0bfa89fed6e5d96
--- show-diff.c
+++ show-diff.c	2005-04-13 22:47:33.000000000 -0700
@@ -58,15 +58,20 @@
 int main(int argc, char **argv)
 {
 	int silent = 0;
+	int silent_on_nonexisting_files = 0;
 	int entries = read_cache();
 	int i;
 
 	while (argc-- > 1) {
 		if (!strcmp(argv[1], "-s")) {
-			silent = 1;
+			silent_on_nonexisting_files = silent = 1;
 			continue;
 		}
-		usage("show-diff [-s]");
+		if (!strcmp(argv[1], "-r")) {
+			silent_on_nonexisting_files = 1;
+			continue;
+		}
+		usage("show-diff [-s] [-r]");
 	}
 
 	if (entries < 0) {
@@ -83,7 +88,7 @@
 
 		if (stat(ce->name, &st) < 0) {
 			printf("%s: %s\n", ce->name, strerror(errno));
-			if (errno == ENOENT && !silent)
+			if (errno == ENOENT && !silent_on_nonexisting_files)
 				show_diff_empty(ce);
 			continue;
 		}
show-files.c:  a9fa6767a418f870a34b39379f417bf37b17ee18
--- show-files.c
+++ show-files.c	2005-04-13 21:18:40.000000000 -0700
@@ -14,6 +14,7 @@
 static int show_cached = 0;
 static int show_others = 0;
 static int show_ignored = 0;
+static int line_terminator = '\n';
 
 static const char **dir;
 static int nr_dir;
@@ -105,12 +106,12 @@
 	}
 	if (show_others) {
 		for (i = 0; i < nr_dir; i++)
-			printf("%s\n", dir[i]);
+			printf("%s%c", dir[i], line_terminator);
 	}
 	if (show_cached) {
 		for (i = 0; i < active_nr; i++) {
 			struct cache_entry *ce = active_cache[i];
-			printf("%s\n", ce->name);
+			printf("%s%c", ce->name, line_terminator);
 		}
 	}
 	if (show_deleted) {
@@ -119,7 +120,7 @@
 			struct stat st;
 			if (!stat(ce->name, &st))
 				continue;
-			printf("%s\n", ce->name);
+			printf("%s%c", ce->name, line_terminator);
 		}
 	}
 	if (show_ignored) {
@@ -134,6 +135,11 @@
 	for (i = 1; i < argc; i++) {
 		char *arg = argv[i];
 
+		if (!strcmp(arg, "-z")) {
+			line_terminator = 0;
+			continue;
+		}
+
 		if (!strcmp(arg, "--cached")) {
 			show_cached = 1;
 			continue;
update-cache.c:  8f149d5a4ab60e030a0ab19fdb59b8ee2576ee71
--- update-cache.c
+++ update-cache.c	2005-04-13 23:27:54.000000000 -0700
@@ -203,6 +203,8 @@
 {
 	int i, newfd, entries;
 	int allow_options = 1;
+	const char *sha1_force = NULL;
+	const char *mode_force = NULL;
 
 	newfd = open(".git/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600);
 	if (newfd < 0)
@@ -235,14 +237,35 @@
 				refresh_cache();
 				continue;
 			}
+			if (!strcmp(path, "--cacheinfo")) {
+				mode_force = argv[++i];
+				sha1_force = argv[++i];
+				continue;
+			}
 			die("unknown option %s", path);
 		}
 		if (!verify_path(path)) {
 			fprintf(stderr, "Ignoring path %s\n", argv[i]);
 			continue;
 		}
-		if (add_file_to_cache(path))
+		if (sha1_force && mode_force) {
+			struct cache_entry *ce;
+			int namelen = strlen(path);
+			int mode;
+			int size = cache_entry_size(namelen);
+			sscanf(mode_force, "%o", &mode);
+			ce = malloc(size);
+			memset(ce, 0, size);
+			memcpy(ce->name, path, namelen);
+			ce->namelen = namelen;
+			ce->st_mode = mode;
+			get_sha1_hex(sha1_force, ce->sha1);
+
+			add_cache_entry(ce, 1);
+		}
+		else if (add_file_to_cache(path))
 			die("Unable to add %s to database", path);
+		mode_force = sha1_force = NULL;
 	}
 	if (write_cache(newfd, active_cache, active_nr) ||
 	    rename(".git/index.lock", ".git/index"))

--- /dev/null	2005-03-19 15:28:25.000000000 -0800
+++ git-merge.perl	2005-04-13 23:45:23.000000000 -0700
@@ -0,0 +1,193 @@
+#!/usr/bin/perl -w
+
+use Getopt::Long;
+
+my $full_checkout = 0;
+my $oneside_checkout = 0;
+GetOptions("full" => \$full_checkout,
+	   "oneside" => \$oneside_checkout)
+    or die;
+
+if ($full_checkout) {
+    $oneside_checkout = 1;
+}
+
+sub read_rev_tree {
+    my (@head) = @_;
+    my ($fhi);
+    open $fhi, '-|', 'rev-tree', '--edges', @head
+	or die "$!: rev-tree --edges @head";
+    my $common;
+    while (<$fhi>) {
+	chomp;
+	(undef, undef, $common) = split(/ /, $_);
+	if ($common =~ s/^([a-f0-f]{40}):\d+$/$1/) {
+	    last;
+	}
+    }
+    close $fhi;
+    return $common;
+}
+
+sub read_commit_tree {
+    my ($commit) = @_;
+    my ($fhi);
+    open $fhi, '-|', 'cat-file', 'commit', $commit
+	or die "$!: cat-file commit $commit";
+    my $tree = <$fhi>;
+    close $fhi;
+    $tree =~ s/^tree //;
+    return $tree;
+}
+
+sub read_diff_tree {
+    my (@tree) = @_;
+    my ($fhi);
+    local ($_, $/);
+    $/ = "\0"; 
+    my %path;
+    open $fhi, '-|', 'diff-tree', '-r', @tree
+	or die "$!: diff-tree -r @tree";
+    while (<$fhi>) {
+	chomp;
+	if (/^\*[0-7]+->([0-7]+)\tblob\t[0-9a-f]+->([0-9a-f]{40})\t(.*)$/s) {
+	    # mode newsha path
+	    $path{$3} = [$1, $2];
+	}
+	elsif (/^\+([0-7]+)\tblob\t([0-9a-f]{40})\t(.*)$/s) {
+	    # mode newsha path
+	    $path{$3} = [$1, $2];
+	}
+	else {
+	    print STDERR "$_??";
+	}
+    }
+    close $fhi;
+    return %path;
+}
+
+sub read_show_files {
+    my ($fhi);
+    local ($_, $/);
+    $/ = "\0"; 
+    open $fhi, '-|', 'show-files', '-z'
+	or die "$!: show-files -z";
+    my (@path) = map { chomp; $_ } <$fhi>;
+    close $fhi;
+    return @path;
+}
+
+sub checkout_file {
+    my ($path, $info) = @_;
+    my (@elt) = split(/\//, $path);
+    my $j = '';
+    my $tail = pop @elt;
+    my ($fhi, $fho);
+    for (@elt) {
+	mkdir "$j$_";
+	$j = "$j$_/";
+    }
+    open $fho, '>', "$path";
+    open $fhi, '-|', 'cat-file', 'blob', $info->[1]
+	or die "$!: cat-file blob $info->[1]";
+    while (<$fhi>) {
+	print $fho $_;
+    }
+    close $fhi;
+    close $fho;
+    chmod oct("0$info->[0]"), "$path";
+}
+
+sub record_file {
+    my ($path, $info) = @_;
+    system 'update-cache', '--cacheinfo', @$info, $path;
+}
+
+sub merge_tree {
+    my ($path, $info0, $info1) = @_;
+    print STDERR "M - $path\n";
+    checkout_file(',,merge-0', $info0);
+    checkout_file(',,merge-1', $info1);
+    system 'checkout-cache', $path;
+    my ($fhi, $fho);
+    open $fhi, '-|', 'merge', '-p', ',,merge-0', $path, ',,merge-1';
+    open $fho, '>', "$path+";
+    local ($/);
+    while (<$fhi>) { print $fho $_; }
+    close $fhi;
+    close $fho;
+    unlink ',,merge-0', ',,merge-1';
+    rename "$path+", $path;
+    # There is no reason to prefer info0 over info1 but
+    # we need to pick one.
+    chmod oct("0$info0->[0]"), "$path";
+}
+
+# Find common ancestor of two trees.
+my $common = read_rev_tree(@ARGV);
+print "Common ancestor: $common\n";
+
+# Create a temporary directory and go there.
+system 'rm', '-rf', ',,merge-temp';
+for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; }
+symlink "../../.git/objects", "objects";
+chdir '..';
+
+my $ancestor_tree = read_commit_tree($common);
+system 'read-tree', $ancestor_tree;
+
+my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0]));
+my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1]));
+
+my @ancestor_file = read_show_files();
+my %ancestor_file = map { $_ => 1 } @ancestor_file;
+
+for (@ancestor_file) {
+    if (! exists $tree0{$_} && ! exists $tree1{$_}) {
+	if ($full_checkout) {
+	    system 'checkout-cache', $_;
+	}
+	print STDERR "O - $_\n";
+    }
+}
+
+my %need_merge = ();
+
+for $path (keys %tree0) {
+    if (! exists $tree1{$path}) {
+	# Only changed in tree 0 --- take his version
+	print STDERR "0 - $path\n";
+	if (! exists $ancestor_file{$path}) {
+	    checkout_file($path, $tree0{$path});
+	    system 'update-cache', '--add', "$path";
+	}
+	elsif ($oneside_checkout) {
+	    checkout_file($path, $tree0{$path});
+	}
+	else {
+	    record_file($path, $tree0{$path});
+	}
+    }
+    else {
+	merge_tree($path, $tree0{$path}, $tree1{$path});
+    }
+}
+
+for $path (keys %tree1) {
+    if (! exists $tree0{$path}) {
+	# Only changed in tree 1 --- take his version
+	print STDERR "1 - $path\n";
+	if (! exists $ancestor_file{$path}) {
+	    checkout_file($path, $tree1{$path});
+	    system 'update-cache', '--add', "$path";
+	}
+	elsif ($oneside_checkout) {
+	    checkout_file($path, $tree1{$path});
+	}
+	else {
+	    record_file($path, $tree1{$path});
+	}
+    }
+}
+
+# system 'show-diff';



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14  7:05       ` Junio C Hamano
@ 2005-04-14  8:06         ` Linus Torvalds
  2005-04-14  8:39           ` Junio C Hamano
  2005-04-15 19:57           ` Junio C Hamano
  0 siblings, 2 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-14  8:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Petr Baudis, Christopher Li, git



On Thu, 14 Apr 2005, Junio C Hamano wrote:
> 
> I now have a Perl script that uses rev-tree, cat-file,
> diff-tree, show-files (with one modification so that it can deal
> with pathnames with embedded newlines), update-cache (with one
> modification so that I can add an entry for a file that does not
> exist to the dircache) and merge (from RCS).  Quick and dirty.

That's exactly what I wanted. Q'n'D is how the ball gets rolling.

In the meantime I wrote a very stupid "merge-tree" which does things
slightly differently, but I really think your approach (aka my original
approach) is actually a lot faster. I was just starting to worry that the 
ball didn't start, so I wrote an even hackier one.

My really hacky one is called "merge-tree", and it really only merges one 
directory. For each entry in the directory it says either

	select <mode> <sha1> path

or

	merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path

depending on whether it could directly select the right object or not.

It's actually exactly the same algorithm as the first one, but I was 
afraid the first one would be so abstract that it (a) might not work and 
(b) wouldn't get people to work it out. This "one directory at a time with 
very explicit output" thing is much more down-to-earth, but it's also
likely slower because it will need script help more often.

That said, I don't know. MOST of the time there will be just a single 
"directory" entry that needs merging, and then the script would just need 
to recurse into that directory with the new "tree" objects. So it might 
not be too horrible.

But I'm really happy that you seem to have implemented my first 
suggestion and I seem to have been wasting my time. 

>  5. for each path involved:
> 
>   5.0 if neither heads change it, leave it as is;
>   5.1 if only one head changes a path and the other does not, just
>       get the changed version;
>   5.2 if both heads change it, check all three out and run merge.

You missed one case: 

    5.0.1 if both heads change it to the same thing, take the new thing

but maybe you counted that as 5.0 (it _should_ fall out automatically from
the fact that "diff-tree" between the two destination trees shows no
difference for such a file).

Now, arguably, your 5.2 will do things right, but the thing is, it's 
actually fairly _common_ that both heads have changed something to the 
same thing. Namely if there was a previous merge that already handled that 
case, but that previous merge may not be a proper parent of the new 
commits.  So from a performance standpoint you really don't want to 
consider that to be a merge - you just pick up the new contents directly.

See?

(My stupid "merge-tree" should show the algorithm in painful obviousity. 
Of course, my stipid merge-tree may also be painfully buggy. You be the 
judge).

> It does not currently commit.  You can go to ./,,merge-temp/ and
> see show-diff to see the result of the merge.  Files added in
> one head has already been run "update-cache" when the script
> ends, but changed and merged files are not---dircache still has
> the common ancestor view.

That sounds good.

> Also to implement 'changed only by one-side' without actually
> checking the file out, I needed to add one option to
> 'update-cache'.  --cacheinfo flag is used this way:
> 
>     $ update-cache --cacheinfo mode sha1 path

Yes. My "merge-tree" needs the exact same thing.

Looks good from your explanation, but I'm too tired to look at the code. 
It's 1AM, and the kids get up at 7.

I'm not much of a hacker, I usually crash by 10PM these days ;^)

			Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14  8:06         ` Linus Torvalds
@ 2005-04-14  8:39           ` Junio C Hamano
  2005-04-14  9:10             ` Linus Torvalds
  2005-04-15 19:57           ` Junio C Hamano
  1 sibling, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2005-04-14  8:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Petr Baudis, Christopher Li, git

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> But I'm really happy that you seem to have implemented my first 
LT> suggestion and I seem to have been wasting my time. 

Thanks for the kind words.

>> 5. for each path involved:
>> 
>> 5.0 if neither heads change it, leave it as is;
>> 5.1 if only one head changes a path and the other does not, just
>> get the changed version;
>> 5.2 if both heads change it, check all three out and run merge.

LT> You missed one case: 

LT>     5.0.1 if both heads change it to the same thing, take the new thing

LT> but maybe you counted that as 5.0 (it _should_ fall out automatically from
LT> the fact that "diff-tree" between the two destination trees shows no
LT> difference for such a file).

Actually I am not handling that.  It really is 5.1a---the exact
same code path as 5.1 can be used for this case, and as you
point out it is really a quite important optimization.

I have to handle the following cases.  I think I currently do
wrong things to them:

  5.1a both head modify to the same thing.
  5.1b one head removes, the other does not do anything.
  5.1c both head remove.
  5.3 one head removes, the other head modifies.

Handling of 5.1a, 5.1b and 5.1c are obvious.

  5.1a Update dircache to the same new thing.  Without -f or -o
       flag do not touch ,,merge-temp/. directory; with -f or
       -o, leave the new file in ,,merge-temp/.

  5.1b Remove the path from dircache and do not have the file in
       ,,merge-temp/. directory regardless of -f or -o flags.

  5.1c Same as 5.1b

I am not sure what to do with 5.3.  My knee-jerk reaction is to
leave the modified result in ,,merge-temp/$path~ without
touching dircache.  If the merger wants to pick it up, he can
rename $path~ to $path temporarily, run show-diff on it (I think
giving an option to show-diff to specify paths would be helpful
for this workflow), to decide if he wants to keep the file or
not.  Suggestions?


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14  8:39           ` Junio C Hamano
@ 2005-04-14  9:10             ` Linus Torvalds
  2005-04-14 11:14               ` Junio C Hamano
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2005-04-14  9:10 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Petr Baudis, Christopher Li, git



On Thu, 14 Apr 2005, Junio C Hamano wrote:
> 
> I have to handle the following cases.  I think I currently do
> wrong things to them:
> 
>   5.1a both head modify to the same thing.
>   5.1b one head removes, the other does not do anything.
>   5.1c both head remove.
>   5.3 one head removes, the other head modifies.

There's another interesting set of cases: one side creates a file, and the
other one creates a directory. 

> I am not sure what to do with 5.3.

My very _strong_ preference is to just inform the user about a merge that
cannot be performed, and not let it be automated. BIG warning, with some 
way for the user to specify the end result.

The thing is, these are pretty rare cases. But in order to make people 
feel good about the _common_ case, it's important that they feel safe 
about the rare one.

Put another way: if git tells me when it can't do something (with some
specificity), I can then fix the situation up and try again. I might curse
a while, and maybe it ends up being so common that I might even automate
it, but at least I'll be able to trust the end result.

In contrast, if git does something that _may_ be nonsensical, then I'll
worry all the time, and not trust git. That's much worse than an 
occasional curse.

So the rule should be: only merge when it's "obviously the right thing".  
If it's not obvious, the merge should _not_ try to guess what the right
thing is. It's much better to fail loudly.

(That's especially true early on. There may be cases that end up being
obvious after some usage. But I'd rather find them by having git be too
stupid, than find out the hard way that git lost some data because it
thought it was ok to remove a file that had been modified)

			Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14  9:10             ` Linus Torvalds
@ 2005-04-14 11:14               ` Junio C Hamano
  2005-04-14 12:16                 ` Petr Baudis
  0 siblings, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2005-04-14 11:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Here is a diff to update the git-merge.perl script I showed you
earlier today ;-).  It contains the following updates against
your HEAD (bb95843a5a0f397270819462812735ee29796fb4).

 * git-merge.perl command we talked about on the git list.  I've
   covered the changed-to-the-same case etc.  I still haven't done
   anything about file-vs-directory case yet.

   It does warn when it needed to run merge to automerge and let
   merge give a warning message about conflicts if any.  In
   modify/remove cases, modified in one but removed in the other
   files are left in either $path~A~ or $path~B~ in the merge
   temporary directory, and the script issues a warning at the
   end.

 * show-files and ls-tree updates to add -z flag to NUL terminate records;
   this is needed for git-merge.perl to work.

 * show-diff updates to add -r flag to squelch diffs for files not in
   the working directory.  This is mainly useful when verifying the
   result of an automated merge.

 * update-cache updates to add "--cacheinfo mode sha1" flag to register
   a file that is not in the current working directory.  Needed for
   minimum-checkout merging by git-merge.perl.


Signed-off-by: Junio C Hamano <junkio@cox.net>

---

 git-merge.perl |  247 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ls-tree.c      |    9 +-
 show-diff.c    |   11 +-
 show-files.c   |   12 ++
 update-cache.c |   25 +++++
 5 files changed, 296 insertions(+), 8 deletions(-)

diff -x .git -Nru ,,1/git-merge.perl ,,2/git-merge.perl
--- ,,1/git-merge.perl	1969-12-31 16:00:00.000000000 -0800
+++ ,,2/git-merge.perl	2005-04-14 04:00:14.000000000 -0700
@@ -0,0 +1,247 @@
+#!/usr/bin/perl -w
+
+use Getopt::Long;
+
+my $full_checkout = 0;
+my $oneside_checkout = 0;
+GetOptions("full" => \$full_checkout,
+	   "oneside" => \$oneside_checkout)
+    or die;
+
+if ($full_checkout) {
+    $oneside_checkout = 1;
+}
+
+sub read_rev_tree {
+    my (@head) = @_;
+    my ($fhi);
+    open $fhi, '-|', 'rev-tree', '--edges', @head
+	or die "$!: rev-tree --edges @head";
+    my %common;
+    while (<$fhi>) {
+	chomp;
+	(undef, undef, my @common) = split(/ /, $_);
+	for (@common) {
+	    if (s/^([a-f0-f]{40}):3$/$1/) {
+		$common{$_}++;
+	    }
+	}
+    }
+    close $fhi;
+
+    my @common = (map { $_->[1] }
+		  sort { $b->[0] <=> $a->[0] }
+		  map { [ $common{$_} => $_ ] }
+		  keys %common);
+
+    return $common[0];
+}
+
+sub read_commit_tree {
+    my ($commit) = @_;
+    my ($fhi);
+    open $fhi, '-|', 'cat-file', 'commit', $commit
+	or die "$!: cat-file commit $commit";
+    my $tree = <$fhi>;
+    close $fhi;
+    $tree =~ s/^tree //;
+    return $tree;
+}
+
+# Reads diff-tree -r output and gives a hash that maps a path
+# to 3-tuple (old-mode new-mode new-sha).
+# When creating, old-mode is undef.  When removing, new-* are undef.
+sub read_diff_tree {
+    my (@tree) = @_;
+    my ($fhi);
+    local ($_, $/);
+    $/ = "\0"; 
+    my %path;
+    open $fhi, '-|', 'diff-tree', '-r', @tree
+	or die "$!: diff-tree -r @tree";
+    while (<$fhi>) {
+	chomp;
+	if (/^\*([0-7]+)->([0-7]+)\tblob\t[0-9a-f]+->([0-9a-f]{40})\t(.*)$/s) {
+	    $path{$4} = [$1, $2, $3];
+	}
+	elsif (/^\+([0-7]+)\tblob\t([0-9a-f]{40})\t(.*)$/s) {
+	    $path{$3} = [undef, $1, $2];
+	}
+	elsif (/^\-([0-7]+)\tblob\t[0-9a-f]{40}\t(.*)$/s) {
+	    $path{$2} = [$1, undef, undef];
+	}
+	else {
+	    die "cannot parse diff-tree output: $_";
+	}
+    }
+    close $fhi;
+    return %path;
+}
+
+sub read_show_files {
+    my ($fhi);
+    local ($_, $/);
+    $/ = "\0"; 
+    open $fhi, '-|', 'show-files', '-z'
+	or die "$!: show-files -z";
+    my (@path) = map { chomp; $_ } <$fhi>;
+    close $fhi;
+    return @path;
+}
+
+sub checkout_file {
+    my ($path, $info) = @_;
+    my (@elt) = split(/\//, $path);
+    my $j = '';
+    my $tail = pop @elt;
+    my ($fhi, $fho);
+    for (@elt) {
+	mkdir "$j$_";
+	$j = "$j$_/";
+    }
+    open $fho, '>', "$path";
+    open $fhi, '-|', 'cat-file', 'blob', $info->[2]
+	or die "$!: cat-file blob $info->[2]";
+    while (<$fhi>) {
+	print $fho $_;
+    }
+    close $fhi;
+    close $fho;
+    chmod oct("0$info->[1]"), "$path";
+}
+
+sub record_file {
+    my ($path, $info) = @_;
+    system ('update-cache', '--add', '--cacheinfo',
+	    $info->[1], $info->[2], $path);
+}
+
+sub merge_tree {
+    my ($path, $info0, $info1) = @_;
+    checkout_file(',,merge-0', $info0);
+    checkout_file(',,merge-1', $info1);
+    system 'checkout-cache', $path;
+    my ($fhi, $fho);
+    open $fhi, '-|', 'merge', '-p', ',,merge-0', $path, ',,merge-1';
+    open $fho, '>', "$path+";
+    local ($/);
+    while (<$fhi>) { print $fho $_; }
+    close $fhi;
+    close $fho;
+    unlink ',,merge-0', ',,merge-1';
+    rename "$path+", $path;
+    # There is no reason to prefer info0 over info1 but
+    # we need to pick one.
+    chmod oct("0$info0->[1]"), "$path";
+}
+
+# Find common ancestor of two trees.
+my $common = read_rev_tree(@ARGV);
+print "Common ancestor: $common\n";
+
+# Create a temporary directory and go there.
+system 'rm', '-rf', ',,merge-temp';
+for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; }
+symlink "../../.git/objects", "objects";
+chdir '..';
+
+my $ancestor_tree = read_commit_tree($common);
+system 'read-tree', $ancestor_tree;
+
+my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0]));
+my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1]));
+
+my @ancestor_file = read_show_files();
+my %ancestor_file = map { $_ => 1 } @ancestor_file;
+
+for (@ancestor_file) {
+    if (! exists $tree0{$_} && ! exists $tree1{$_}) {
+	if ($full_checkout) {
+	    system 'checkout-cache', $_;
+	}
+	print STDERR "O - $_\n";
+    }
+}
+
+for my $set ([\%tree0, \%tree1, 'A'], [\%tree1, \%tree0, 'B']) {
+    my ($treeA, $treeB, $side) = @$set;
+    while (my ($path, $info) = each %$treeA) {
+	# In this loop we do not deal with overlaps.
+	next if (exists $treeB->{$path});
+
+	if (! defined $info->[1]) {
+	    # deleted in this tree only.
+	    unlink $path;
+	    system 'update-cache', '--remove', $path;
+	    print STDERR "$side D $path\n";
+	}
+	else {
+	    # modified or created in this tree only.
+	    print STDERR "$side M $path\n";
+	    if ($oneside_checkout) {
+		checkout_file($path, $info);
+		system 'update-cache', '--add', "$path";
+	    } else {
+		record_file($path, $info);
+	    }
+	}
+    }
+}
+
+my @warning = ();
+
+while (my ($path, $info0) = each %tree0) {
+    # We need to deal only with overlaps.
+    next if (!exists $tree1{$path});
+
+    my $info1 = $tree1{$path};
+    if (! defined $info0->[1]) {
+	# deleted in this tree.
+	if (! defined $info1->[1]) {
+	    # deleted in both trees.  Obvious.
+	    print STDERR "*DD $path\n";
+	    unlink $path;
+	    system 'update-cache', '--remove', $path;
+	}
+	else {
+	    # oops.  tree0 wants to remove but tree1 wants to modify it.
+	    print STDERR "*DM $path\n";
+	    checkout_file("$path~B~", $info1);
+	    push @warning, $path;
+	}
+    }
+    else {
+	# modified or created in tree0
+	if (! defined $info1->[1]) {
+	    # oops.  tree0 wants to modify but tree1 wants to remove it.
+	    print STDERR "*MD $path\n";
+	    checkout_file("$path~A~", $info0);
+	    push @warning, $path;
+	}
+	else {
+	    # modified both in tree0 and tree1
+	    # are they modifying to the same contents?
+	    if ($info0->[2] eq $info1->[2]) {
+		# just mode changes (or no changes)
+		# we prefer tree0 over tree1 for no particular reason.
+		print STDERR "*MM $path\n";
+		record_file($path, $info0);
+	    }
+	    else {
+		# modified in both.  Needs merge.
+		print STDERR "MRG $path\n";
+		merge_tree($path, $info0, $info1);
+	    }
+	}
+    }
+}
+
+if (@warning) {
+    print "\nThere are some files that were deleted in one branch and\n"
+	. "modified in another.  Please examine them carefully:\n";
+    for (@warning) {
+	print "$_\n";
+    }
+}
+
+# system 'show-diff';


diff -x .git -Nru ,,1/ls-tree.c ,,2/ls-tree.c
--- ,,1/ls-tree.c	2005-04-14 03:47:18.000000000 -0700
+++ ,,2/ls-tree.c	2005-04-14 04:00:14.000000000 -0700
@@ -5,6 +5,8 @@
  */
 #include "cache.h"
 
+int line_termination = '\n';
+
 static int list(unsigned char *sha1)
 {
 	void *buffer;
@@ -31,7 +33,8 @@
 		 * It seems not worth it to read each file just to get this
 		 * and the file size. -- pasky@ucw.cz */
 		type = S_ISDIR(mode) ? "tree" : "blob";
-		printf("%03o\t%s\t%s\t%s\n", mode, type, sha1_to_hex(sha1), path);
+		printf("%03o\t%s\t%s\t%s%c", mode, type, sha1_to_hex(sha1),
+		       path, line_termination);
 	}
 	return 0;
 }
@@ -40,6 +43,10 @@
 {
 	unsigned char sha1[20];
 
+	if (argc == 3 && !strcmp(argv[1], "-z")) {
+	  line_termination = 0;
+	  argc--; argv++;
+	}
 	if (argc != 2)
 		usage("ls-tree <key>");
 	if (get_sha1_hex(argv[1], sha1) < 0)


diff -x .git -Nru ,,1/show-diff.c ,,2/show-diff.c
--- ,,1/show-diff.c	2005-04-14 03:47:18.000000000 -0700
+++ ,,2/show-diff.c	2005-04-14 04:00:14.000000000 -0700
@@ -58,15 +58,20 @@
 int main(int argc, char **argv)
 {
 	int silent = 0;
+	int silent_on_nonexisting_files = 0;
 	int entries = read_cache();
 	int i;
 
 	while (argc-- > 1) {
 		if (!strcmp(argv[1], "-s")) {
-			silent = 1;
+			silent_on_nonexisting_files = silent = 1;
 			continue;
 		}
-		usage("show-diff [-s]");
+		if (!strcmp(argv[1], "-r")) {
+			silent_on_nonexisting_files = 1;
+			continue;
+		}
+		usage("show-diff [-s] [-r]");
 	}
 
 	if (entries < 0) {
@@ -83,7 +88,7 @@
 
 		if (stat(ce->name, &st) < 0) {
 			printf("%s: %s\n", ce->name, strerror(errno));
-			if (errno == ENOENT && !silent)
+			if (errno == ENOENT && !silent_on_nonexisting_files)
 				show_diff_empty(ce);
 			continue;
 		}


diff -x .git -Nru ,,1/show-files.c ,,2/show-files.c
--- ,,1/show-files.c	2005-04-14 03:47:18.000000000 -0700
+++ ,,2/show-files.c	2005-04-14 04:00:14.000000000 -0700
@@ -14,6 +14,7 @@
 static int show_cached = 0;
 static int show_others = 0;
 static int show_ignored = 0;
+static int line_terminator = '\n';
 
 static const char **dir;
 static int nr_dir;
@@ -105,12 +106,12 @@
 	}
 	if (show_others) {
 		for (i = 0; i < nr_dir; i++)
-			printf("%s\n", dir[i]);
+			printf("%s%c", dir[i], line_terminator);
 	}
 	if (show_cached) {
 		for (i = 0; i < active_nr; i++) {
 			struct cache_entry *ce = active_cache[i];
-			printf("%s\n", ce->name);
+			printf("%s%c", ce->name, line_terminator);
 		}
 	}
 	if (show_deleted) {
@@ -119,7 +120,7 @@
 			struct stat st;
 			if (!stat(ce->name, &st))
 				continue;
-			printf("%s\n", ce->name);
+			printf("%s%c", ce->name, line_terminator);
 		}
 	}
 	if (show_ignored) {
@@ -134,6 +135,11 @@
 	for (i = 1; i < argc; i++) {
 		char *arg = argv[i];
 
+		if (!strcmp(arg, "-z")) {
+			line_terminator = 0;
+			continue;
+		}
+
 		if (!strcmp(arg, "--cached")) {
 			show_cached = 1;
 			continue;


diff -x .git -Nru ,,1/update-cache.c ,,2/update-cache.c
--- ,,1/update-cache.c	2005-04-14 03:47:18.000000000 -0700
+++ ,,2/update-cache.c	2005-04-14 04:00:14.000000000 -0700
@@ -250,6 +250,8 @@
 {
 	int i, newfd, entries;
 	int allow_options = 1;
+	const char *sha1_force = NULL;
+	const char *mode_force = NULL;
 
 	newfd = open(".git/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600);
 	if (newfd < 0)
@@ -282,14 +284,35 @@
 				refresh_cache();
 				continue;
 			}
+			if (!strcmp(path, "--cacheinfo")) {
+				mode_force = argv[++i];
+				sha1_force = argv[++i];
+				continue;
+			}
 			die("unknown option %s", path);
 		}
 		if (!verify_path(path)) {
 			fprintf(stderr, "Ignoring path %s\n", argv[i]);
 			continue;
 		}
-		if (add_file_to_cache(path))
+		if (sha1_force && mode_force) {
+			struct cache_entry *ce;
+			int namelen = strlen(path);
+			int mode;
+			int size = cache_entry_size(namelen);
+			sscanf(mode_force, "%o", &mode);
+			ce = malloc(size);
+			memset(ce, 0, size);
+			memcpy(ce->name, path, namelen);
+			ce->namelen = namelen;
+			ce->st_mode = mode;
+			get_sha1_hex(sha1_force, ce->sha1);
+
+			add_cache_entry(ce, 1);
+		}
+		else if (add_file_to_cache(path))
 			die("Unable to add %s to database", path);
+		mode_force = sha1_force = NULL;
 	}
 	if (write_cache(newfd, active_cache, active_nr) ||
 	    rename(".git/index.lock", ".git/index"))


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14 12:16                 ` Petr Baudis
@ 2005-04-14 18:12                   ` Junio C Hamano
  2005-04-14 18:36                     ` Linus Torvalds
                                       ` (2 more replies)
  0 siblings, 3 replies; 81+ messages in thread
From: Junio C Hamano @ 2005-04-14 18:12 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Linus Torvalds, git

>>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes:

PB> Bah, you outran me. ;-)

Just being in a different timezone, I guess.

PB> I'll change it to use the cool git-pasky stuff (commit-id etc) and its
PB> style of committing - that is, it will merely record the update-caches
PB> to be done upon commit, and it will read-tree the branch we are merging
PB> to instead of the ancestor. (So that git diff gives useful output.)

Sorry, I have not seen what you have been doing since pasky 0.3,
and I have not even started to understand the mental model of
the world your tool is building.  That said, my gut feeling is
that telling this script about git-pasky's world model might be
a mistake.  I'd rather see you consider the script as mere "part
of the plumbing".  Maybe adding an extra parameter to the script
to let the user explicitly specify the common ancestor to use
would be needed, but I would prefer git-pasky-merge to do its
own magic (converting symbolic commit names into raw commit
names and such) before calling this low level script.

That way people like me who have not migrated to your framework
can still keep using it.  All the script currently needs is a
bare git object database; i.e., nothing other than what is in
.git/objects and a couple of commit record SHA1s as its
parameters.  No .git/heads/, no .git/HEAD.local, no .git/tags,
are involved for it to work, and I would prefer to keep things
that way if possible.

>> * show-diff updates to add -r flag to squelch diffs for files not in
>> the working directory.  This is mainly useful when verifying the
>> result of an automated merge.

PB> -r traditionally means recursive - what's the reasoning behind the
PB> choice of this letter?

Well, '-r' is not necessarily recursive. "ls -r" is reverse, "sort
-r" is reverse.  "less -r" is raw.  "cat -r" is reversible.
"nethack -r" is race ;-).  You are thinking as an SCM person so
it may look that way.  "diff -r" is recursive.  "darcs add -r"
is recursive.  But even in the SCM world, "cvs add -r" is not
(it means read-only) neither "co -r" (explicit revision) ;-).

I would rather pick '-q' if I were doing the patch today, but I
was too tired and did not think of a letter when I wrote it.  I
guess '-r' stood for removed, but I agree it is a bad choice.
Any objections to '-q'?

PB> use strict?

Not in this iteration but eventually yes.

PB> It'd be simpler to do just

PB> 	my @common = (map { $common{$_} }
PB> 	              sort { $b <=> $a }
PB> 	              keys %common)

Well, actually you spotted a bug between the implementation and
what I wanted to do.  It should have been:

map { $_->[0] }
    sort { $b->[1] <=> $a->[1] }
        map { [ $common{$_} => $_ ] } keys %common

That is, sort [anscestor => number of times it appears] tuple by
the "number of times it appears" in decreasing order, and
project the resulting list to a list of ancestors.  It is trying
to deal with the following pattern in rev-tree output:

TIMESTAMP1 EDGE1:1 ANCESTOR1:3 ANCESTOR2:3
TIMESTAMP2 EDGE2:2 ANCESTOR1:3

and when the above happens I wanted to pick up ANCESTOR1, but
that was without no sound reason.

PB> But I really think this is a horrible heuristic. I believe you should
PB> take the latest commit in the --edges output, and from that choose the
PB> base whose rev-tree --edges the_base merged_branch has the least lines
PB> on output. (That is, the path to it is shortest - ideally it's already
PB> part of the merged_branch.)

I'll try something along that line.  Honestly the ancestor
selection part was what I had most trouble with.  Thanks.

PB> What about

PB> sub OLDMODE { 0 }
PB> sub NEWMODE { 1 }
PB> sub NEWSHA { 2 }

PB> and then using that when accessing the tuple? Would make the code
PB> much more readable.

Totally agreed; readability cleanup is needed, just as "use
strict" you mentioned, before it is ready for public
consumption.  Remember, however, the primary purpose of the
message was to share it with Linus so that I can ask his opinion
while the script was still slushy; the contents that array
contained was still changing then and was too early for symbolic
constants.  I'll do that in the next round.

PB> It is a good idea to check merge's exit code and give a notice at the
PB> end if there were any conflicts.

In principle yes, but I noticed that merge already gave me a
nice warning message when it found conflicts, so there was no
need to do so myself in this case.  See sample output:

    $ perl ./git-merge.perl \
        71796686221a0a56ccc25b02386ed8ea648da14d \
        bb95843a5a0f397270819462812735ee29796fb4 
    Common ancestor: 9f02d4d233223462d3f6217b5837b786e6286ba4
    O - COPYING
    O - README
    ...
    O - write-tree.c
    A M write-blob.c
    A M show-diff.c
    ...
    A M update-cache.c
    A M git-merge.perl
    B M merge-tree.c
    MRG Makefile
    merge: warning: conflicts during merge
    $ 

>> +# Create a temporary directory and go there.
>> +system 'rm', '-rf', ',,merge-temp';

PB> Can't we call it just ,,merge?

I'd rather have a command line option '-o' (scrapping the
current '-o' and renaming it to something else; as you can see I
am terrible at picking option names ;-)) to mean "output to this
directory".  I am not really an Arch person so I do not
particulary care about /^,,/.  How about "git~merge~$$"?

>> +for ((',,merge-temp', '.git')) { mkdir $_; chdir $_; }
>> +symlink "../../.git/objects", "objects";
>> +chdir '..';
>> +
>> +my $ancestor_tree = read_commit_tree($common);
>> +system 'read-tree', $ancestor_tree;
>> +
>> +my %tree0 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[0]));
>> +my %tree1 = read_diff_tree($ancestor_tree, read_commit_tree($ARGV[1]));
>> +
>> +my @ancestor_file = read_show_files();
>> +my %ancestor_file = map { $_ => 1 } @ancestor_file;
>> +
>> +for (@ancestor_file) {
>> +    if (! exists $tree0{$_} && ! exists $tree1{$_}) {
>> +	if ($full_checkout) {
>> +	    system 'checkout-cache', $_;
>> +	}
>> +	print STDERR "O - $_\n";

PB> Huh, what are you trying to do here? I think you should just record
PB> remove, no? (And I wouldn't do anything with my read-tree. ;-)

At this moment in the script, we have run "read-tree" the
ancestor so the dircache has the original.  %tree0 and %tree1
both did not touch the path ($_ here) so it is the same as
ancestor.  When '-f' is specified we are populating the output
working tree with the merge result so that is what that
'checkout-cache' is about.  "O - $path" means "we took the
original".

The idea is to populate the dircache of merge-temp with the
merge result and leave uncertain stuff as in the common ancestor
state, so that the user can fix them starting from there.

Maybe it is a good time for me to summarize the output somewhere
in a document.

    O - $path	Tree-A and tree-B did not touch this; the result
                is taken from the ancestor (O for original).

    A D $path	Only tree-A (or tree-B) deleted this and the other
    B D $path   branch did not touch this; the result is to delete.

    A M $path	Only tree-A (or tree-B) modified this and the other
    B M $path   branch did not touch this; the result is to use one
                from tree-A (or tree-B).  This includes file
                creation case.

    *DD $path	Both tree-A and tree-B deleted this; the result
                is to delete.

    *DM $path   Tree-A deleted while tree-B modified this (or
    *MD $path   vice versa), and manual conflict resolution is
                needed; dircache is left as in the ancestor, and
                the modified file is saved as $path~A~ in the
                working directory.  The user can rename it to $path
                and run show-diff to see what Tree-A wanted to do
                and decide before running update-cache.

    *MM $path   Tree-A and tree-B did the exact same
                modification; the result is to use that.

    MRG $path   Tree-A and tree-B have different modifications;
                run "merge" and the merge result is left as
                $path in the working directory.

In cases other than *DM, *MD, and MRG, the result is trivial and
is recorded in the dircache.  Without '-o' (to be renamed ;-)
nor '-f' there will not be a file checked out in the working
directory for them.  The three merge cases need human attention.
The dircache is not touched in these cases and left as the
ancestor version, and the working directory gets some file as
described above.

NOTE NOTE NOTE: I am not dealing with a case where both branches
create the same file but with different contents.  In such a
case the current code falls into MRG path without having a
common ancestor, which is nonsense---I can use /dev/null as the
common ancestor, I guess.  Also NOTE NOTE NOTE I need to detect
the case where one branch creates a directory while the other
creates a file.  There is nothing an automated tool can do in
that case but it needs to be detected and be told the user
loudly.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14 18:12                   ` Junio C Hamano
@ 2005-04-14 18:36                     ` Linus Torvalds
  2005-04-14 19:59                       ` Junio C Hamano
  2005-04-15  9:14                       ` David Woodhouse
  2005-04-14 18:51                     ` Christopher Li
  2005-04-14 19:35                     ` Petr Baudis
  2 siblings, 2 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-14 18:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Petr Baudis, git



On Thu, 14 Apr 2005, Junio C Hamano wrote:
> 
> Sorry, I have not seen what you have been doing since pasky 0.3,
> and I have not even started to understand the mental model of
> the world your tool is building.  That said, my gut feeling is
> that telling this script about git-pasky's world model might be
> a mistake.  I'd rather see you consider the script as mere "part
> of the plumbing". 

I agree. Having separate abstraction layers is good.  I'm actually very 
happy with Pasky's cleaned-up-tree, exactly because unlike the first one, 
Pasky did a great job of maintaining the abstraction between "plumbing" 
and user interfaces.

The plumbing should take user interface needs into account, but the more
conceptually separate it is ("does it makes sense on its own?") the better
off we'll be. And "merge these two trees" (which works on a _tree_ level)
or "find the common commit" (which works on a _commit_ level) look like 
plumbing to me - the kind of things I should have written, if I weren't 
such a lazy slob.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14 18:12                   ` Junio C Hamano
  2005-04-14 18:36                     ` Linus Torvalds
@ 2005-04-14 18:51                     ` Christopher Li
  2005-04-14 19:35                     ` Petr Baudis
  2 siblings, 0 replies; 81+ messages in thread
From: Christopher Li @ 2005-04-14 18:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Petr Baudis, Linus Torvalds, git

On Thu, Apr 14, 2005 at 11:12:35AM -0700, Junio C Hamano wrote:
> >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes:
> 
> At this moment in the script, we have run "read-tree" the
> ancestor so the dircache has the original.  %tree0 and %tree1
> both did not touch the path ($_ here) so it is the same as
> ancestor.  When '-f' is specified we are populating the output
> working tree with the merge result so that is what that
> 'checkout-cache' is about.  "O - $path" means "we took the
> original".
> 
> The idea is to populate the dircache of merge-temp with the
> merge result and leave uncertain stuff as in the common ancestor
> state, so that the user can fix them starting from there.
> 
> Maybe it is a good time for me to summarize the output somewhere
> in a document.
> 
>     O - $path	Tree-A and tree-B did not touch this; the result
>                 is taken from the ancestor (O for original).
> 
>     A D $path	Only tree-A (or tree-B) deleted this and the other
>     B D $path   branch did not touch this; the result is to delete.
> 
>     A M $path	Only tree-A (or tree-B) modified this and the other
>     B M $path   branch did not touch this; the result is to use one
>                 from tree-A (or tree-B).  This includes file
>                 creation case.
> 
>     *DD $path	Both tree-A and tree-B deleted this; the result
>                 is to delete.
> 
>     *DM $path   Tree-A deleted while tree-B modified this (or
>     *MD $path   vice versa), and manual conflict resolution is
>                 needed; dircache is left as in the ancestor, and
>                 the modified file is saved as $path~A~ in the
>                 working directory.  The user can rename it to $path
>                 and run show-diff to see what Tree-A wanted to do
>                 and decide before running update-cache.
> 
>     *MM $path   Tree-A and tree-B did the exact same
>                 modification; the result is to use that.
> 
>     MRG $path   Tree-A and tree-B have different modifications;
>                 run "merge" and the merge result is left as
>                 $path in the working directory.
> 
> In cases other than *DM, *MD, and MRG, the result is trivial and

I believe there is simpler way to do it as in my demo python script.
I start it easier but you bits me in time. It is a demo script, it
only print the action instead of actually going out to do it.
change that to corresponding os.system("") call leaves to the reader.

Again, this is a demo how it can be done. Not python vs perl thing
I did not chose perl only because I am not good at it.

#!/usr/bin/env python

import re
import sys
import os
from pprint import pprint

def get_tree(commit):
    data = os.popen("cat-file commit %s"%commit).read()
    return re.findall(r"(?m)^tree (\w+)", data)[0]

PREFIX = 0
PATH = -1
SHA = -2
ORIGSHA = -3

def get_difftree(old, new):
    lines = os.popen("diff-tree %s %s"%(old, new)).read().split("\x00")
    patterns = (r"(\*)(\d+)->(\d+)\s(\w+)\s(\w+)->(\w+)\s(.*)",
		r"([+-])(\d+)\s(\w+)\s(\w+)\s(.*)")
    res = {}
    for l in lines:
	if not l: continue
	for p in patterns:
	    m = re.findall(p, l)
	    if m:
		m = m[0]
		res[m[-1]] = m
		break
	else:
	    raise "difftree: unknow line", l
    return res

def analyze(diff1, diff2):
    diff1only = [ diff1[k] for k in diff1 if k not in diff2 ]
    diff2only = [ diff2[k] for k in diff2 if k not in diff1 ]
    both = [ (diff1[k],diff2[k]) for k in diff2 if k in diff1 ]

    action(diff1only)
    action(diff2only)
    action_two(both)

def action(diffs):
    for act in diffs:
	if act[PREFIX] == "*":
	    print "modify", act[PATH], act[SHA]
	elif act[PREFIX] == '-':
	    print "remove", act[PATH], act[SHA]
	elif act[PREFIX] == '+':
	    print "add", "remove", act[PATH], act[SHA]
	else:
	    raise "unknow action"

def action_two(diffs):
    for act1, act2 in diffs:
	if len(act1) == len(act2):	# same kind type
	    if act1[PREFIX] == act2[PREFIX]:
		if act1[SHA] == act2[SHA] or act1[PREFIX] == '-': 
		    return action(act1)
	    	if act1[PREFIX]=='*':
		    print "3way-merge", act1[PATH], act1[ORIGSHA], act1[SHA], act2[SHA]
		    return
	print "unable to handle", act[PATH]
	print "one side wants", act1[PREFIX]
	print "the other side wants", act2[PREFIX]
	
    
args = sys.argv[1:]
trees = map(get_tree, args)
print "check out tree", trees[0]
diff1 = get_difftree(trees[0], trees[1])
diff2 = get_difftree(trees[0], trees[2])
analyze(diff1, diff2)


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14 18:36                     ` Linus Torvalds
@ 2005-04-14 19:59                       ` Junio C Hamano
  2005-04-15  0:42                         ` Linus Torvalds
  2005-04-15  9:14                       ` David Woodhouse
  1 sibling, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2005-04-14 19:59 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Petr Baudis, git

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> On Thu, 14 Apr 2005, Junio C Hamano wrote:

>> Sorry, I have not seen what you have been doing since pasky 0.3,
>> and I have not even started to understand the mental model of
>> the world your tool is building.  That said, my gut feeling is
>> that telling this script about git-pasky's world model might be
>> a mistake.  I'd rather see you consider the script as mere "part
>> of the plumbing". 

LT> I agree. Having separate abstraction layers is good.  I'm actually very 
LT> happy with Pasky's cleaned-up-tree, exactly because unlike the first one, 
LT> Pasky did a great job of maintaining the abstraction between "plumbing" 
LT> and user interfaces.

Agreed, not just with your agreeing with me, but with the
statement that Pasky did a good job (although I am ashamed to
say I have not caught up with the "userland" tools).

LT> The plumbing should take user interface needs into account, but the more
LT> conceptually separate it is ("does it makes sense on its own?") the better
LT> off we'll be. And "merge these two trees" (which works on a _tree_ level)
LT> or "find the common commit" (which works on a _commit_ level) look like 
LT> plumbing to me - the kind of things I should have written, if I weren't 
LT> such a lazy slob.

I am planning drop the ancestor computation from the script, and
make it another command line parameter to the script.  Dan
Barkalow's merge-base program should be used to compute it and
his result should drive the merge.  That sounds more UNIXy to
me.  I even may want to make the script take three trees not
commits, since the merge script does not need commits (it only
needs trees).  As plumbing it would be cleaner interface to it
to do so.  The wrapper SCM scripts can and should make sure it
is fed trees when the user gives it commits (or symbolic
representation of it like .git/tags/blah, or `cat .git/HEAD`).

But one different thing to note here.

You say "merge these two trees" above (I take it that you mean
"merge these two trees, taking account of this tree as their
common ancestor", so actually you are dealing with three trees),
and I am tending to agree with the notion of merging trees not
commits.  However you might get richer context and more sensible
resulting merge if you say "merge these two commits".  Since
commit chaining is part of the fundamental git object model you
may as well use it.

This however opens up another set of can of worms---it would
involve not just three trees but all the trees in the commit
chain in between.  That's when you start wondering if it would
be better to add renames in the git object model, which is the
topic of another thread.  I have not formed an opinion on that
one myself yet.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14 23:12                       ` Junio C Hamano
@ 2005-04-14 20:24                         ` Christopher Li
  2005-04-14 23:31                         ` Petr Baudis
  1 sibling, 0 replies; 81+ messages in thread
From: Christopher Li @ 2005-04-14 20:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Petr Baudis, Linus Torvalds, git

Hi Junio,

I think if the merge tree belong to plumbing, you can do
even less in the merge.perl. You can just print out the
instruction for the upper level SCM what to to without
actually doing it yourself.

So you don't have to do touch anything in the tree.
That is the way I use in my previous python script.
You just print out some easy to modify  

e.g. in my python script it prints: (BTW, poor choice of print out name)

check out tree 253290af8b9ebc8565dd8de4cda24d0432a92b57
modify pre-process.c 7684c115a87e41a9226ce79478101c746cf22c34
3way-merge check.c dcb970cc1c5a83284dc5986abf07b6da76a8758c f77bfe119c19d928879091e0e3ee6debe3f1e1bf d315b43b025350d0107568a4d42cc2494d38621d

Your merge tree can do the smae.

Then the supper level SCM can easily follow instruction.
Save your effort and make no assumption what SCM module is.

Chris

On Thu, Apr 14, 2005 at 04:12:34PM -0700, Junio C Hamano wrote:
> >>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes:
> 
> I think you are contradicting yourself for saying the above
> after agreeing with me that the script should just work on trees
> not commits.  My understanding is that the tools is just to
> merge two related trees relative to another ancestor tree,
> nothing more.  Especially, it should not care what is in the
> working directory---that is SCM person's business.
> 
> I am just trying to follow my understanding of what Linus
> wanted.  One of the guiding principle is to do as much things as
> in dircache without ever checking things out or touching working
> files unnecessarily.
> 
> PB> This will give the tool maximal flexibility.
> 
> I suspect it would force me to have a working directory
> populated with files, just to do a merge.
> 
> PB> I'm all for an -o, and I don't mind ,, - I just don't want it uselessly
> PB> long. I hope "git~merge~$$" was a joke... :-)
> 
> Which part do you object to?  PID part?  or tilde?  Would
> git~merge do, perhaps?  It probably would not matter to you
> because as an SCM you would always give an explicit --output
> parameter to the script anyway.
> 
> PB> By the way, what about indentation with tabs? If you have a
> PB> strong opinion about this, I don't insist - but if you
> PB> really don't mind/care either way, it'd be great to use tabs
> PB> as in the rest of the git code.
> 
> I do not have a strong opinion, but it is more trouble for me
> only because I am lazy and am used to the indentation my Emacs
> gives me.  I write code other than git, so changing Perl-mode
> indentation setting globally for all .pl files is not an option
> for me.  I'll see what I can do when I have time.
> 
> PB> Is there a fundamental reason why the directory cache
> PB> contains the ancestor instead of the destination branch?
> 
> Because you are thinking as an SCM person where there are
> distinction between tree-A and tree-B, two heads being merged.
> There is no "destination branch" nor "source branch" in what I
> am doing.  It is a merge of two equals derived from the same
> ancestor.
> 
> PB> I think the script actually does not fundamentally depend on it. My main
> PB> motivation is that the user can then trivially see what is he actually
> PB> going to commit to his destination branch, which would be bought for
> PB> free by that.
> 
> And again the user is *not* commiting to his "destination
> branch".  At the level I am working at, the merge result should
> be commited with two -p parameters to commit-tree --- tree-A and
> tree-B, both being equal parents from the POV of git object
> storage.
> 
> PB> And this is another thing I dislike a lot. I'd like merge-tree.pl to
> PB> leave my directory cache alone, thank you very much. You know, I see
> PB> what goes to the directory cache as actually part of the policy part.
> 
> Remember I am not touching *your* dircache.  It is a dircache in
> the temporary merge area, specifically set up to help you review
> the merge.  
> 
> Can't the SCM driver do things along this line, perhaps?
> 
>  - You have your working files and your dircache.  They may not
>    match because you have uncommitted changes to your
>    environment.  You want to merge with Linus head.  You know
>    its SHA1 (call it COMMIT-Linus).  Your SCM knows which commit
>    you started with (call it COMMIT-Current).
> 
>  - First you merge the tree associated with COMMIT-Current.  Use
>    it and COMMIT-Linus to find the common ancestor to use.
> 
>  - Now use the tree SHA of COMMIT-Current, tree SHA1 of
>    COMMIT-Linus, and tree SHA1 of the common ancestor commit to
>    drive git-merge.perl (to be renamed ;-).  You will get a
>    temporary directory.  Have your user examine what is in
>    there, and fix the merge and have them tell you they are
>    happy.
> 
>  - You go to that temporary directory, do write-tree and
>    commit-tree with -p parameter of COMMIT-Linus and
>    COMMIT-Current.  This will result in a new commit.  Call that
>    COMMIT-Merge.
> 
>  - You, as an SCM, should know what your user have done in the
>    working directory relative to COMMIT-Current.  Especially you
>    should know the set of paths involved in that change.  Go in
>    to the temporary area, checkout-cache those files if you have
>    not done so.  Apply the changes you have there.  Optionally
>    have the user examine the changes and have him confirm.  Lift
>    those files into the user's working directory.
> 
>  - Do your bookkeeping like "echo COMMIT-Merge >.git/Head", to
>    make the user's working files based on COMMIT-Merge, and run
>    read-tree using the COMMIT-Merge in the user's working
>    directory.  At this point, show-diff output should show what
>    the changes your user have had made if he had started working
>    based on COMMIT-Merge instead of starting from
>    COMMIT-Current.
> 
> I think the above would result in what SCM person would call
> "merge upstream/sidestream changes into my working directory".
> 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15  0:58                           ` Junio C Hamano
@ 2005-04-14 22:30                             ` Christopher Li
  2005-04-15  7:43                               ` Junio C Hamano
  0 siblings, 1 reply; 81+ messages in thread
From: Christopher Li @ 2005-04-14 22:30 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Petr Baudis, Linus Torvalds, git

On Thu, Apr 14, 2005 at 05:58:25PM -0700, Junio C Hamano wrote:
> 
> I do like, however, the idea of separating the step of doing any
> checkout/merge etc. and actually doing them.  So the command set
> of parse-your-output needs to be defined.  Based on what I have
> done so far, it would consist of the following:
> 
>  - Result is this object $SHA1 with mode $mode at $path (takes
>    one of the trees); you can do update-cache --cacheinfo (if
>    you want to muck with dircache) or cat-file blob (if you want
>    to get the file) or both.

Is that SHA1 for tree or the file object? If it is tree it don't
need the $mode any more.  If it is file you might need to emit
entry for it's parent directory, including the modes of directory.

> 
>  - Result is to delete $path.
> 
>  - Result is a merge between object $SHA1-1 and $SHA1-2 with
>    mode $mode-1 or $mode-2 at $path.
>
> Would this be a good enough command set?

And of course error/command for the files that unable to perform
auto merge. including information of both revisions. That needs
to be defined as well.

Chris


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14 19:35                     ` Petr Baudis
@ 2005-04-14 23:12                       ` Junio C Hamano
  2005-04-14 20:24                         ` Christopher Li
  2005-04-14 23:31                         ` Petr Baudis
  0 siblings, 2 replies; 81+ messages in thread
From: Junio C Hamano @ 2005-04-14 23:12 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Linus Torvalds, git

>>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes:

PB> What I would like your script to do is therefore just do the
PB> merge in a given already prepared (including built index)
PB> directory, with a passed base. The base should be determined
PB> by a separate tool (I already saw some patches); most future
PB> "science" will probably go to a clever selection of this
PB> base, anyway.

I think you are contradicting yourself for saying the above
after agreeing with me that the script should just work on trees
not commits.  My understanding is that the tools is just to
merge two related trees relative to another ancestor tree,
nothing more.  Especially, it should not care what is in the
working directory---that is SCM person's business.

I am just trying to follow my understanding of what Linus
wanted.  One of the guiding principle is to do as much things as
in dircache without ever checking things out or touching working
files unnecessarily.

PB> This will give the tool maximal flexibility.

I suspect it would force me to have a working directory
populated with files, just to do a merge.

PB> I'm all for an -o, and I don't mind ,, - I just don't want it uselessly
PB> long. I hope "git~merge~$$" was a joke... :-)

Which part do you object to?  PID part?  or tilde?  Would
git~merge do, perhaps?  It probably would not matter to you
because as an SCM you would always give an explicit --output
parameter to the script anyway.

PB> By the way, what about indentation with tabs? If you have a
PB> strong opinion about this, I don't insist - but if you
PB> really don't mind/care either way, it'd be great to use tabs
PB> as in the rest of the git code.

I do not have a strong opinion, but it is more trouble for me
only because I am lazy and am used to the indentation my Emacs
gives me.  I write code other than git, so changing Perl-mode
indentation setting globally for all .pl files is not an option
for me.  I'll see what I can do when I have time.

PB> Is there a fundamental reason why the directory cache
PB> contains the ancestor instead of the destination branch?

Because you are thinking as an SCM person where there are
distinction between tree-A and tree-B, two heads being merged.
There is no "destination branch" nor "source branch" in what I
am doing.  It is a merge of two equals derived from the same
ancestor.

PB> I think the script actually does not fundamentally depend on it. My main
PB> motivation is that the user can then trivially see what is he actually
PB> going to commit to his destination branch, which would be bought for
PB> free by that.

And again the user is *not* commiting to his "destination
branch".  At the level I am working at, the merge result should
be commited with two -p parameters to commit-tree --- tree-A and
tree-B, both being equal parents from the POV of git object
storage.

PB> And this is another thing I dislike a lot. I'd like merge-tree.pl to
PB> leave my directory cache alone, thank you very much. You know, I see
PB> what goes to the directory cache as actually part of the policy part.

Remember I am not touching *your* dircache.  It is a dircache in
the temporary merge area, specifically set up to help you review
the merge.  

Can't the SCM driver do things along this line, perhaps?

 - You have your working files and your dircache.  They may not
   match because you have uncommitted changes to your
   environment.  You want to merge with Linus head.  You know
   its SHA1 (call it COMMIT-Linus).  Your SCM knows which commit
   you started with (call it COMMIT-Current).

 - First you merge the tree associated with COMMIT-Current.  Use
   it and COMMIT-Linus to find the common ancestor to use.

 - Now use the tree SHA of COMMIT-Current, tree SHA1 of
   COMMIT-Linus, and tree SHA1 of the common ancestor commit to
   drive git-merge.perl (to be renamed ;-).  You will get a
   temporary directory.  Have your user examine what is in
   there, and fix the merge and have them tell you they are
   happy.

 - You go to that temporary directory, do write-tree and
   commit-tree with -p parameter of COMMIT-Linus and
   COMMIT-Current.  This will result in a new commit.  Call that
   COMMIT-Merge.

 - You, as an SCM, should know what your user have done in the
   working directory relative to COMMIT-Current.  Especially you
   should know the set of paths involved in that change.  Go in
   to the temporary area, checkout-cache those files if you have
   not done so.  Apply the changes you have there.  Optionally
   have the user examine the changes and have him confirm.  Lift
   those files into the user's working directory.

 - Do your bookkeeping like "echo COMMIT-Merge >.git/Head", to
   make the user's working files based on COMMIT-Merge, and run
   read-tree using the COMMIT-Merge in the user's working
   directory.  At this point, show-diff output should show what
   the changes your user have had made if he had started working
   based on COMMIT-Merge instead of starting from
   COMMIT-Current.

I think the above would result in what SCM person would call
"merge upstream/sidestream changes into my working directory".


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14 19:59                       ` Junio C Hamano
@ 2005-04-15  0:42                         ` Linus Torvalds
  2005-04-15  2:33                           ` Barry Silverman
  2005-04-15 10:02                           ` David Woodhouse
  0 siblings, 2 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-15  0:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Petr Baudis, git



On Thu, 14 Apr 2005, Junio C Hamano wrote:
>
> You say "merge these two trees" above (I take it that you mean
> "merge these two trees, taking account of this tree as their
> common ancestor", so actually you are dealing with three trees),

Yes. We're definitely talking three trees.

> and I am tending to agree with the notion of merging trees not
> commits.  However you might get richer context and more sensible
> resulting merge if you say "merge these two commits".  Since
> commit chaining is part of the fundamental git object model you
> may as well use it.

Yes and no. There are real advantages to using the commit state to just 
figure out the trees, and then at least have the _option_ to do the merge 
at a pure tree object.

In particular, if you ever find yourself wanting to graft together two
different commit histories, that almost certainly is what you'd want to
do. Somebody might have arrived at the exact same tree some other way,
starting with a 2.6.12 tar.ball or something, and I think we should at
least support the notion of saying "these two totally unrelated commits
actually have the same base tree, so let's merge them in "space" (ie data)
even if we can't really sanely join them in "time" (ie "commits").

I dunno.

And it's also a question of sanity. The fact is, we know how to make tree 
merges unambiguous, by just totally ignoring the history between them. Ie 
we know how to merge data. I am pretty damn sure that _nobody_ knows how 
to merge "data over time". Maybe BK does. I'm pretty sure it actually 
takes the "over time" into account. But My goal is to get something that 
works, and something that is reliable because it is simple and it has 
simple rules.

As you say:

> This however opens up another set of can of worms---it would
> involve not just three trees but all the trees in the commit
> chain in between.

Exactly.  I seriously believe that the model is _broken_, simply because 
it gets too complicated. At some point it boils down to "keep it simple, 
stupid".

>  That's when you start wondering if it would
> be better to add renames in the git object model, which is the
> topic of another thread.  I have not formed an opinion on that
> one myself yet.

I've not even been convinved that renames are worth it. Nobody has really 
given a good reason why.

There are two reasons for renames I can think of:

 - space efficiency in delta-based trees. This is a total non-issue for 
   git, and trying to explicitly track renames is going to cause _more_
   space to be wasted rather than less.

 - "annotate". Something git doesn't really handle anyway, and it has 
   little to do with renames. You can fake an annotate, but let's face it, 
   it's _always_ going to be depending on interpreting a diff. In fact, 
   that ends up how traditional SCM's do it too - they don't really 
   annotate lines, they just interpret the diff.

   I think you might as well interpret the whole object thing. Git _does_ 
   tell you how the objects changed, and I actually believe that a diff 
   that works in between objects (ie can show "these lines moved from this
   file X to tjhat file Y") is a _hell_ of a lot more powerful than
   "rename"  is.

   So I'd seriously suggest that instead of worryign about renames, people 
   think about global diffs that aren't per-file. Git is good at limiting 
   the changes to a set of objects, and it should be entirely possible to 
   think of diffs as ways of moving lines _between_ objects and not just
   within objects. It's quite common to move a function from one file to 
   another - certainly more so than renaming the whole file.

   In other words, I really believe renames are just a meaningless special 
   case of a much more interesting problem. Which is just one reason why 
   I'm not at all interested in bothering with them other than as a "data 
   moved" thing, which git already handles very well indeed.

So there,

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14 23:31                         ` Petr Baudis
@ 2005-04-15  0:58                           ` Junio C Hamano
  2005-04-14 22:30                             ` Christopher Li
  2005-04-15 10:22                           ` Junio C Hamano
  1 sibling, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2005-04-15  0:58 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Linus Torvalds, git

>>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes:
>> I think the above would result in what SCM person would call
>> "merge upstream/sidestream changes into my working directory".

PB> And that's exactly what I'm doing now with git merge. ;-) In fact,
PB> ideally the whole change in my scripts when your script is finished
PB> would be replacing

PB> 	checkout-cache `diff-tree` # symbolic
PB> 	git diff $base $merged | git apply

PB> with

PB> 	merge-tree.pl -b $base $(tree-id) $merged | parse-your-output

In the above I presume by $merged you mean the tree ID (or
commit ID) the user's working directory is based upon?  Well,
merge-trees (Linus has a single directory merge-tree already)
looks at tree IDs (or commit IDs); it would never involve
working files in random state that is not recorded as part of a
tree (committed or not).  Given that constraints I am not sure
how well that would pan out.  I have to think about this a bit.

I do like, however, the idea of separating the step of doing any
checkout/merge etc. and actually doing them.  So the command set
of parse-your-output needs to be defined.  Based on what I have
done so far, it would consist of the following:

 - Result is this object $SHA1 with mode $mode at $path (takes
   one of the trees); you can do update-cache --cacheinfo (if
   you want to muck with dircache) or cat-file blob (if you want
   to get the file) or both.

 - Result is to delete $path.

 - Result is a merge between object $SHA1-1 and $SHA1-2 with
   mode $mode-1 or $mode-2 at $path.

Would this be a good enough command set?

PB> Doesn't Emacs have something equivalent to ./.vimrc? I've also seen
PB> those funny -*- strings.

The former is global per user (that is me including other Perl
files I work outside of git context), which is exactly what I
said is unacceptable to me.  The latter is per file (applying to
everybody else who touch the file), so if it is short and sweet
I should use one.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: Merge with git-pasky II.
  2005-04-15  0:42                         ` Linus Torvalds
@ 2005-04-15  2:33                           ` Barry Silverman
  2005-04-15 10:02                           ` David Woodhouse
  1 sibling, 0 replies; 81+ messages in thread
From: Barry Silverman @ 2005-04-15  2:33 UTC (permalink / raw)
  To: 'Linus Torvalds', 'Junio C Hamano'
  Cc: 'Petr Baudis', git

>>In particular, if you ever find yourself wanting to graft together two
>>different commit histories, that almost certainly is what you'd want
to >>do. Somebody might have arrived at the exact same tree some other
way, >>starting with a 2.6.12 tar.ball or something, and I think we
should at >>least support the notion of saying "these two totally
unrelated commits >>actually have the same base tree, so let's merge
them in "space" (ie data) >>even if we can't really sanely join them in
"time" (ie "commits").

If this is true - then the tree-id's of the two commits would be
identical, but the commit-id's wouldn't.

Does this imply that common ancestor lookup should work by comparing the
tree-id's (space-wise the same) rather than the commit-ids (time-wise
the same)?

-----Original Message-----
From: git-owner@vger.kernel.org [mailto:git-owner@vger.kernel.org] On
Behalf Of Linus Torvalds
Sent: Thursday, April 14, 2005 8:43 PM
To: Junio C Hamano
Cc: Petr Baudis; git@vger.kernel.org
Subject: Re: Merge with git-pasky II.



On Thu, 14 Apr 2005, Junio C Hamano wrote:
>
> You say "merge these two trees" above (I take it that you mean
> "merge these two trees, taking account of this tree as their
> common ancestor", so actually you are dealing with three trees),

Yes. We're definitely talking three trees.

> and I am tending to agree with the notion of merging trees not
> commits.  However you might get richer context and more sensible
> resulting merge if you say "merge these two commits".  Since
> commit chaining is part of the fundamental git object model you
> may as well use it.

Yes and no. There are real advantages to using the commit state to just 
figure out the trees, and then at least have the _option_ to do the
merge 
at a pure tree object.

In particular, if you ever find yourself wanting to graft together two
different commit histories, that almost certainly is what you'd want to
do. Somebody might have arrived at the exact same tree some other way,
starting with a 2.6.12 tar.ball or something, and I think we should at
least support the notion of saying "these two totally unrelated commits
actually have the same base tree, so let's merge them in "space" (ie
data)
even if we can't really sanely join them in "time" (ie "commits").

I dunno.

And it's also a question of sanity. The fact is, we know how to make
tree 
merges unambiguous, by just totally ignoring the history between them.
Ie 
we know how to merge data. I am pretty damn sure that _nobody_ knows how

to merge "data over time". Maybe BK does. I'm pretty sure it actually 
takes the "over time" into account. But My goal is to get something that

works, and something that is reliable because it is simple and it has 
simple rules.

As you say:

> This however opens up another set of can of worms---it would
> involve not just three trees but all the trees in the commit
> chain in between.

Exactly.  I seriously believe that the model is _broken_, simply because

it gets too complicated. At some point it boils down to "keep it simple,

stupid".

>  That's when you start wondering if it would
> be better to add renames in the git object model, which is the
> topic of another thread.  I have not formed an opinion on that
> one myself yet.

I've not even been convinved that renames are worth it. Nobody has
really 
given a good reason why.

There are two reasons for renames I can think of:

 - space efficiency in delta-based trees. This is a total non-issue for 
   git, and trying to explicitly track renames is going to cause _more_
   space to be wasted rather than less.

 - "annotate". Something git doesn't really handle anyway, and it has 
   little to do with renames. You can fake an annotate, but let's face
it, 
   it's _always_ going to be depending on interpreting a diff. In fact, 
   that ends up how traditional SCM's do it too - they don't really 
   annotate lines, they just interpret the diff.

   I think you might as well interpret the whole object thing. Git
_does_ 
   tell you how the objects changed, and I actually believe that a diff 
   that works in between objects (ie can show "these lines moved from
this
   file X to tjhat file Y") is a _hell_ of a lot more powerful than
   "rename"  is.

   So I'd seriously suggest that instead of worryign about renames,
people 
   think about global diffs that aren't per-file. Git is good at
limiting 
   the changes to a set of objects, and it should be entirely possible
to 
   think of diffs as ways of moving lines _between_ objects and not just
   within objects. It's quite common to move a function from one file to

   another - certainly more so than renaming the whole file.

   In other words, I really believe renames are just a meaningless
special 
   case of a much more interesting problem. Which is just one reason why

   I'm not at all interested in bothering with them other than as a
"data 
   moved" thing, which git already handles very well indeed.

So there,

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15  7:43                               ` Junio C Hamano
@ 2005-04-15  6:28                                 ` Christopher Li
  2005-04-15 11:11                                   ` Junio C Hamano
  0 siblings, 1 reply; 81+ messages in thread
From: Christopher Li @ 2005-04-15  6:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Petr Baudis, Linus Torvalds, git

On Fri, Apr 15, 2005 at 12:43:47AM -0700, Junio C Hamano wrote:
> >>>>> "CL" == Christopher Li <git@chrisli.org> writes:
> 
> CL> Is that SHA1 for tree or the file object?
> 
> I am talking about a single file here.
>
Then do you emit the entry for it's parents directory?

e.g. /foo/bar get created. foo doesn't exists. You have
to create foo first. You don't have mode information for
foo yet. If it give the top level tree, the SCM can check it
out by tree. hopefully have the mode on directory correctly.
Well, if they care about those little details.

Chris
 

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14 22:30                             ` Christopher Li
@ 2005-04-15  7:43                               ` Junio C Hamano
  2005-04-15  6:28                                 ` Christopher Li
  0 siblings, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2005-04-15  7:43 UTC (permalink / raw)
  To: Christopher Li; +Cc: Petr Baudis, Linus Torvalds, git

>>>>> "CL" == Christopher Li <git@chrisli.org> writes:

>> - Result is this object $SHA1 with mode $mode at $path (takes
>> one of the trees); you can do update-cache --cacheinfo (if
>> you want to muck with dircache) or cat-file blob (if you want
>> to get the file) or both.

CL> Is that SHA1 for tree or the file object?

I am talking about a single file here.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14 18:36                     ` Linus Torvalds
  2005-04-14 19:59                       ` Junio C Hamano
@ 2005-04-15  9:14                       ` David Woodhouse
  2005-04-15  9:36                         ` Ingo Molnar
                                           ` (2 more replies)
  1 sibling, 3 replies; 81+ messages in thread
From: David Woodhouse @ 2005-04-15  9:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Petr Baudis, git

On Thu, 2005-04-14 at 11:36 -0700, Linus Torvalds wrote:
> And "merge these two trees" (which works on a _tree_ level)
> or "find the common commit" (which works on a _commit_ level)

I suspect that finding the common commit is actually a per-file thing;
it's not just something you do for the _commit_ graph, then use for
merging each file in the two branches you're trying to merge.

Consider a simple repository which contains two files A and B. We start
off with the first version of each ('A1B1'), and the owner of each file
takes a branch and modifies their own file. There is cross-pulling
between the two, and then each modifies the _other's_ file as well as
their own...

   (A1B2)--(A2B2)--(A2'B3)
    /  \   /            \
   /    \ /              \
 (A1B1)  X               (...)
   \    / \              /
    \  /   \            /
   (A2B1)--(A2B2)--(A3B2')

Now, we're trying to merge the two branches. It appears that the most
useful common ancestor to use for a three-way merge of file A is the
version from tree 'A2B1', while the most useful common ancestor for
merging file B is that in 'A1B2'.

(I think it's a coincidence that in my example the useful files 'A2' and
'B2' actually do end up in a single tree together at some point.)

-- 
dwmw2



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15  9:14                       ` David Woodhouse
@ 2005-04-15  9:36                         ` Ingo Molnar
  2005-04-15 10:05                           ` David Woodhouse
  2005-04-15 12:03                         ` Johannes Schindelin
  2005-04-15 14:53                         ` Linus Torvalds
  2 siblings, 1 reply; 81+ messages in thread
From: Ingo Molnar @ 2005-04-15  9:36 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git


* David Woodhouse <dwmw2@infradead.org> wrote:

> Consider a simple repository which contains two files A and B. We 
> start off with the first version of each ('A1B1'), and the owner of 
> each file takes a branch and modifies their own file. There is 
> cross-pulling between the two, and then each modifies the _other's_ 
> file as well as their own...
> 
>    (A1B2)--(A2B2)--(A2'B3)
>     /  \   /            \
>    /    \ /              \
>  (A1B1)  X               (...)
>    \    / \              /
>     \  /   \            /
>    (A2B1)--(A2B2)--(A3B2')
> 
> Now, we're trying to merge the two branches. It appears that the most 
> useful common ancestor to use for a three-way merge of file A is the 
> version from tree 'A2B1', while the most useful common ancestor for 
> merging file B is that in 'A1B2'.

do such cases occur frequently? In the kernel at least it's not too 
typical. Would it be a problem to go for the simple solution of using 
(A1B1) as the common ancestor (based on the tree graph), and then to do 
a 3-way merge of all changes from that point on?

	Ingo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15  0:42                         ` Linus Torvalds
  2005-04-15  2:33                           ` Barry Silverman
@ 2005-04-15 10:02                           ` David Woodhouse
  2005-04-15 15:32                             ` Linus Torvalds
  1 sibling, 1 reply; 81+ messages in thread
From: David Woodhouse @ 2005-04-15 10:02 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Petr Baudis, git

On Thu, 2005-04-14 at 17:42 -0700, Linus Torvalds wrote:
> I've not even been convinved that renames are worth it. Nobody has
> really given a good reason why.
> 
> There are two reasons for renames I can think of:
> 
>  - space efficiency in delta-based trees.
>  - "annotate".

Neither of those were my motivation for looking at renames. The reasons
I wanted to track renames were:
   - Per-file revision history which doesn't stop dead at a rename.
   - Merging where files have been renamed in one branch and modified in
     another. Which is basically a special case of the above; we need to
     see the per-file revision history.

>    So I'd seriously suggest that instead of worryign about renames, people 
>    think about global diffs that aren't per-file. Git is good at limiting 
>    the changes to a set of objects, and it should be entirely possible to 
>    think of diffs as ways of moving lines _between_ objects and not just
>    within objects. It's quite common to move a function from one file to 
>    another - certainly more so than renaming the whole file.
>
>    In other words, I really believe renames are just a meaningless special 
>    case of a much more interesting problem. Which is just one reason why 
>    I'm not at all interested in bothering with them other than as a "data 
>    moved" thing, which git already handles very well indeed.

Git doesn't handle 'data moved' except at a whole-tree level. For each
commit, it says "these are the old trees; this is the new tree".

Git doesn't actually look hard into the contents of tree; certainly it
has no business looking at the contents of individual files; that is
something that the SCM or possibly only the user should do. The storage
of 'rename' information in the commit object is another kind of 'xattr'
storage which git would provides but not directly interpret.

And you're right; it shouldn't have to be for renames only. There's no
need for us to limit it to one "source" and one "destination"; the SCM
can use it to track content as it sees fit.

As I said, the main aim of this is to track revision history of given
content, for displaying to the user and for performing merges. So when a
file is split up, or a function is moved from it to another file, a
'rename' xattr can be included to mark that files 'foo' and 'bar' in the
new tree are both associated with file 'wibble' in the parent.

That's as much as we need to provide for content tracking, and it _does_
handle the general case as well as we should be attempting to. We don't
want to get into dealing with file contents ourselves; we just want to
store the hint for the SCM or the user that "your data went thataway".

-- 
dwmw2



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15  9:36                         ` Ingo Molnar
@ 2005-04-15 10:05                           ` David Woodhouse
  2005-04-15 14:53                             ` Ingo Molnar
  0 siblings, 1 reply; 81+ messages in thread
From: David Woodhouse @ 2005-04-15 10:05 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git

On Fri, 2005-04-15 at 11:36 +0200, Ingo Molnar wrote:
> do such cases occur frequently? In the kernel at least it's not too 
> typical. 

Isn't it? I thought it was a fairly accurate representation of the
process "I make a whole bunch of changes to files I maintain, pulling
from Linus while occasionally asking him to pull from my tree. Sometimes
my files are changed by someone else in Linus' tree, and sometimes I
change files that I don't actually own.".

-- 
dwmw2



^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 12:03                         ` Johannes Schindelin
@ 2005-04-15 10:22                           ` Theodore Ts'o
  0 siblings, 0 replies; 81+ messages in thread
From: Theodore Ts'o @ 2005-04-15 10:22 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: David Woodhouse, Linus Torvalds, Junio C Hamano, Petr Baudis, git

On Fri, Apr 15, 2005 at 02:03:08PM +0200, Johannes Schindelin wrote:
> I disagree. In order to be trusted, this thing has to catch the following
> scenario:
> 
> Skywalker and Solo start from the same base. They commit quite a lot to
> their trees. In between, Skywalker commits a tree, where the function
> "kazoom()" has been added to the file "deathstar.c", but Solo also added
> this function, but to the file "moon.c". A file-based merge would have no
> problem merging each file, such that in the end, "kazoom()" is defined
> twice.
> 
> The same problems arise when one tries to merge line-wise, i.e. when for
> each line a (possibly different) merge-parent is sought.

Be careful.  There is a very big tradeoff between 100% perfections in
catching these sorts of errors, and usability.  There exists SCM's
where you are not allowed to do commit such merges until you do a test
compile, or run a regression test suite (that being the only way to
catch these sorts of problems when we merge two branches like this).  

BitKeeper never caught this sort of thing, and we trusted it.  In
practice it was also rarely a problem.

I'll also note that BitKeeper doesn't restrict you from doing a
committing a changeset when you have modified files that have yet to
be checked in to the tree.  Same issue; you can accidentally check in
changesets that in trees that won't build, but if we added this kind
of SCM-by-straightjacket philosophy it would decrease our productivity
and people would simply not use such an SCM, thus negating its
effectiveness.

						- Ted

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14 23:31                         ` Petr Baudis
  2005-04-15  0:58                           ` Junio C Hamano
@ 2005-04-15 10:22                           ` Junio C Hamano
  2005-04-15 20:40                             ` Petr Baudis
  1 sibling, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2005-04-15 10:22 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Linus Torvalds, git

After I re-read [*R1*], in which Linus talks about dircache,
especially this section:

 - The "current directory cache" describes some baseline. In particular,
   note the "some" part. It's not tied to any special baseline, and you
   can change your baseline any way you please.

   So it does NOT have to track any particular state in either the object 
   database _or_ in your actual current working tree. In fact, all real 
   interactions with "git" are really about updating this staging area one 
   way or the other: you might check out the state from it into your 
   working area (partially or fully), you can push your working area into 
   the staging area (again, partially or fully).

   And if you want to, you can write the thing that the staging area 
   represents as a "tree" into the object database, or you can merge a 
   tree from the object database into the staging area.

   In other words: the staging area aka "current directory cache" is 
   really how all interaction takes place. The object database never 
   interacts directly with your working directory contents. ALL 
   interactions go through the current directory cache.

I started to have more doubts on the approach of *not*
performing the merge in the dircache I set up specifically for
merging, which is the direction in which you are pushing if I
understand you correctly.  Maybe I completely misunderstand what
you want.  This message is long but I need a clear understanding
of what is expected to be useful to you, so please bear with me.

PB> 	merge-tree.pl -b $base $(tree-id) $merged | parse-your-output

Please help me understand this example you have given earlier.
Here is my understanding of your assumption when the above
pipeline takes place.  Correct me if I am mistaken.

 * The user is in a working directory $W.  It is controlled by
   git-tools and there are $W/.git/. directory and $W/.git/index
   dircache.

 * The dircache $W/.git/index started its life as a read-tree
   from some commit.  The git-tools is keeping track of which
   commit it is somewhere, presumably in $W/.git/ directory.
   Let's call it $C (commit).

 ? Question.  Is the $(tree-id) in your example the same as $C
   above?

 * The user have run [*1*] (see Footnote below) checkout-cache
   on $W/.git/index some time in the past and $W is full of
   working files.  Some of them may or may not have modified.
   There may be some additions or deletions.  So the contents of
   the working directory may not match the tree associated with
   $C.

 * The user may or may not have run [*1*] update-cache in $W.
   The contents of the dircache $W/.git/index may not match the
   tree associated with $C.

 ? Question.  Are you forbidding the user to run update-cache by
   hand, and keeping track of the changes yourself, to be
   applied all at once at "git commit" time, thereby
   guaranteeing the $W/.git/index to match the tree associated
   with $C all times?  From the description of The "GIT toolkit"
   section in README, it is not clear to me which part of his
   repository an end user is not supposed to muck with himself.

 * Now the user has some changes in his working directory and
   notices upstream or a side branch has notable changes
   desireble to be picked up.  So he runs some git-tools command
   to cause the above quoted pipeline to run.

 ? Question.  Does $merged in your example mean such an upstream
   or side branch?  Is $base in your example the common ancestor
   between $C and $merged?

Assuming that my above understanding of your model is correct,
here are my "thinking aloud".

 - "merge-trees $base $C $merged" looks only at the git object
   database for those three trees named.  The data structure of
   git object database is optimized to distinguish differences
   in those recorded trees (and hence recorded blobs they point
   at) without unpacking most of the files if the changes are
   small, because all the blobs involved are already hashed.  It
   is not very good at comparing things in git object store and
   working files in random states, which would involve unpacking
   blobs and comparing, so "merge-trees" does not bother.

 - What can come out from merge-trees is therefore one of the
   following for each path from the union of paths contained in
   $base, $C, and $merged:

   (a) Neither $C nor $merged changed it --- merge result is what
       is in $C.

   (b) $C changed it but $merged did not --- merge result is what
       is in $C.

   (c) Both $C and $merged changed it in the same way --- merge
       result is what is in $C.

   (d) $C did not change it but $merged did --- merge result is
       what is in $merged.

   (e) Both $C and $merged changed it differently --- merge is
       needed and automatically succeeds between $C and $merge.

   (f) Both $C and $merged changed it differently --- merge is
       needed but have conflicts.

 - Assuming we are dealing with the case where working files are
   dirty and do not match what is in $C, among the above,
   (a)-(c) can be ignored by SCM.  What the user has in his
   working files is exactly what he would have got if he started
   working from the merge result, although in reality the work
   was started from $C.

   Handling (d), (e) and (f) from SCM's point of view would be
   the same.  They all involve 3-way merges between the file in
   the working directory, and the file from $merged, pivoting on
   the file from $base.  In order to help SCM, merge-trees
   therefore should output SHA1 of blobs for such a file from
   $base and $merged and expect SCM to run "cat-file blob" on
   them and then merge or diff3.  Up to the point of giving
   those two SHA1 out is the business of merge-trees and after
   that it is up to SCM.

   That would work.  So I should base the design of output from
   merge-trees on the above analysis, which probably needs to be
   extended to cover differences between creation, modification,
   and deletion.

 - However, the above is quite different from the way Linus
   envisioned initially, on which my current implementation is
   based [*3*].

   My current implementation is to record the merge outcome in
   the temporary dircache $W/,,merge/.git/index for cases
   (a)-(e).  The last case (f) is problematic and needs human
   validation [*2*], so it is not recorded in that temporary
   dircache, but the files to be merged are left in that
   temporary directory and merge-trees stops there.  It is
   expected that the end-user or SCM would merge the resulting
   file and run update-cache to update $W/,,merge/.git/index.
   After that happens, $W/,,merge/.git/index has the tree
   representing the desired result of the merge.  It is expected
   that the end-user or SCM would write-tree, commit-tree there
   in the temporary directory, creating a new commit $C1.

   Then, it is expected that the SCM would make a patch file
   between $C and the user working directory, checks out $C1
   (either in the user's working directory or another temporary
   directory; at this point merge-trees does not care because it
   has already done its job and exited), applies that patch to
   bring the user edits over to $C1.  Then that directory would
   contain the desired merge of user edits.

   That is my understanding of how Linus originally wanted the
   tool to do his kernel work with to work.  My hesitation to
   suggestions from you to change it not to keep its own merge
   dircache is coming from here.  Not doing what I am currently
   doing to $W/,,merge/.git/index dircache would mean that SCM
   would have to do more, not less, to arrive at $C1 (the result
   of the clean $merge and $C merge pivoted at $base), where the
   real SCM merge begins.

Although I suspect I am misunderstanding what you want, your
messages so far suggest that what you want might be quite
different from what Linus wants.  Please do not misunderstand
what I mean by saying this.  I am not saying that Linus is
always right [*4*] and therefore you are wrong for wanting
something else.  It is just that, if what I started writing
needs to support both of those quite different needs, I need to
know what they are.  I think I understand what Linus wants well
enough [*5*], but I am not certain about yours.


[Footnotes]

*1* By "The user have run" I mean either the user directly used
the low-level plumbing command himself, or used git-tools to
cause such command to run.

*2* Strictly speaking, case (e) needs human validation as
well, because successful textual merge does not guarantee
sensible semantic merge.

*3* See [*R2*] for descriptions on the way Linus wanted merge
in git to happen.  Especially around "5) At this point you need
to MERGE" onwards.  The current implementation handles (or
attempts to handle) the `your working directory was fully
committed' case described there.

*4* According to Linus himself, he is always right ;-). [*R3*]

*5* I consider [*R1*] and [*R2*] essential read for anybody
wanting to understand merging operation in git object model (I
am saying this for others; not for Pasky --- it would be like
preaching to the choir ;-)).


[References]

*R1* <Pine.LNX.4.58.0504110928360.1267@ppc970.osdl.org>
http://marc.theaimsgroup.com/?i=%3CPine.LNX.4.58.0504110928360.1267%20()%20ppc970%20!%20osdl%20!%20org%3E

*R2* <Pine.LNX.4.58.0504121606580.4501@ppc970.osdl.org> 
http://marc.theaimsgroup.com/?i=%3CPine.LNX.4.58.0504121606580.4501%20()%20ppc970%20!%20osdl%20!%20org%3E

*R3*
http://www.uwsg.indiana.edu/hypermail/linux/kernel/0008.3/0555.html


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15  6:28                                 ` Christopher Li
@ 2005-04-15 11:11                                   ` Junio C Hamano
  0 siblings, 0 replies; 81+ messages in thread
From: Junio C Hamano @ 2005-04-15 11:11 UTC (permalink / raw)
  To: Christopher Li, Linus Torvalds; +Cc: Petr Baudis, git

>>>>> "CL" == Christopher Li <git@chrisli.org> writes:

CL> Then do you emit the entry for it's parents directory?

In GIT object model, directory modes do not matter.  It is not
designed to record directories, and running "update-cache --add
foo" when foo is a directory fails.

The data model of GIT is that it associates file datablob to a
string called "pathname" that happen to contain slashes in them.
It is kinda wierd.  When you externalize it with checkout-cache,
these slashes are mapped to hierarchical UNIX filesystem paths,
relative to whereever you happened to run checkout-cache.  The
hierarchical "tree" representation in the GIT database was
started as just a space optimization thing.

CL> e.g. /foo/bar get created. foo doesn't exists. You have
CL> to create foo first. You don't have mode information for
CL> foo yet.

And you will never have that information, since it is not
recorded anywhere.  If I say you should have foo/bar (by the
way, no leading slashes are placed in the dircache either), and
if it so happens that you do not have foo yet, you'd better
create one without waiting to be told, because I will never tell
you to just create a directory.

By the way, Linus, while I was studying how the new hierarchical
trees are written out, I think I have found one small funny (I
would not call this a *bug*) there.  Here is an excerpt from
write-tree (around ll. 56; I am basing on pasky-0.4 so your line
numbers may have some offsets):

        sha1 = ce->sha1;
        mode = ntohl(ce->st_mode);

        /* Do we have _further_ subdirectories? */
        filename = pathname + baselen;
        dirname = strchr(filename, '/');
        if (dirname) {
                int subdir_written;

                subdir_written = write_tree(cachep + nr, maxentries - nr, pathname, dirname-pathname+1, subdir_sha1);
                nr += subdir_written;

                /* Now we need to write out the directory entry into this tree.. */
                mode |= S_IFDIR;
                pathlen = dirname - pathname;

                /* ..but the directory entry doesn't count towards the total count */
                nr--;
                sha1 = subdir_sha1;
        }

This code is going through a flat list of cache entries sorted
by pathnames.  The list is flat in the sense that the pathnames
are like "foo/bar" i.e. with slashes inside.  The if() statement
there, upon seeing "foo/bar", slurps all the entries in foo/
subhierarchy and writes into a separate tree, recursively, to
"represent" foo/.

Notice what mode the "tree" object gets in this case?  File mode
for foo/bar (or whatever happens to be sorted the first among
the stuff in dircache from foo/ directory) ORed with S_IFDIR.  I
think this is nonsense, and we should just store constant
S_IFDIR.

Another option, probably better from the SCM purist's POV, would
be to start recording directories in dircaches, so that people
can actually keep track of directory modes.  Does it matter? ---
I would say not.  GIT does not have to be tar or cpio.  


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15  9:14                       ` David Woodhouse
  2005-04-15  9:36                         ` Ingo Molnar
@ 2005-04-15 12:03                         ` Johannes Schindelin
  2005-04-15 10:22                           ` Theodore Ts'o
  2005-04-15 14:53                         ` Linus Torvalds
  2 siblings, 1 reply; 81+ messages in thread
From: Johannes Schindelin @ 2005-04-15 12:03 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git

Hi,

On Fri, 15 Apr 2005, David Woodhouse wrote:

> On Thu, 2005-04-14 at 11:36 -0700, Linus Torvalds wrote:
> > And "merge these two trees" (which works on a _tree_ level)
> > or "find the common commit" (which works on a _commit_ level)
>
> I suspect that finding the common commit is actually a per-file thing;
> it's not just something you do for the _commit_ graph, then use for
> merging each file in the two branches you're trying to merge.

I disagree. In order to be trusted, this thing has to catch the following
scenario:

Skywalker and Solo start from the same base. They commit quite a lot to
their trees. In between, Skywalker commits a tree, where the function
"kazoom()" has been added to the file "deathstar.c", but Solo also added
this function, but to the file "moon.c". A file-based merge would have no
problem merging each file, such that in the end, "kazoom()" is defined
twice.

The same problems arise when one tries to merge line-wise, i.e. when for
each line a (possibly different) merge-parent is sought.

The concept here is a *transaction*: when going from one tree to the next
tree via a commit, a sort of integrity is maintained, which is breached
when only looking at files and commits.

Ciao,
Dscho


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15  9:14                       ` David Woodhouse
  2005-04-15  9:36                         ` Ingo Molnar
  2005-04-15 12:03                         ` Johannes Schindelin
@ 2005-04-15 14:53                         ` Linus Torvalds
  2005-04-15 15:29                           ` David Woodhouse
  2005-04-15 15:54                           ` Paul Jackson
  2 siblings, 2 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-15 14:53 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Junio C Hamano, Petr Baudis, git



On Fri, 15 Apr 2005, David Woodhouse wrote:
> 
> I suspect that finding the common commit is actually a per-file thing;
> it's not just something you do for the _commit_ graph, then use for
> merging each file in the two branches you're trying to merge.

I disagree.

Conceptually, you should never do _anything_ on a file level. Why? Because
individual files don't matter. You shouldn't merge two files cleanly just
because they look fine - they _depend_ on the other files in the archive, 
and that's quite fundamentally why per-file tracking is really wrong from 
a project standpoint.

So if you can't merge two files cleanly because the "project" history 
ended up being further back than the "file" history, then that's a _good_ 
thing. You don't know what the hell happened to the other files that this 
file depended on. Merging one file independently of the others is WRONG.

Also, I suspect that you'll find that if you do cross-merges, you'll 
basically always end up in:

> (I think it's a coincidence that in my example the useful files 'A2' and
> 'B2' actually do end up in a single tree together at some point.)

nope, I don't think that's coincidence. I think that's the normal case. 
Your file-based history is the one that can _incorrectly_ and 
coincidentally happen to have a single file at some point, but since that 
file doesn't stand alone, that's really not a fundamentally good reason to 
merge it.

Really, this "individual files matter" approach is a _disease_. They 
don't. Individual files DO NOT EXIST. Files always exist as part of the 
project, and the _only_ time you track a single file is when the project 
is a single file (and then that will be very very obvious in a git 
archive, thank you very much).

So the single-file mentality is a disease brought on by decades of _crap_. 
And by the fact that it ends up limiting the problem scope, so you can do 
certain things easier.

For example, just doing intra-file diffs is a lot _easier_ and less 
time-consuming than doing inter-file diffs. Bit it is _absolutely_ not 
better. In fact, it is clearly inferior to anybody who spends even five 
seconds thinking about it - yet we still do it, because of the historical 
(and INCORRECT) mindset that "files matter".

Files DO NOT matter. Never have. It's an implementation limitation to 
think they do. You'll screw yourself up, and when somebody comes up with a 
half-way efficient way to generate inter-fiel diffs, your architecture is 
totally and utterly unable to handle it.

I don't care what you do at an SCM level, and if the crud you put on top
of git wants to perpetuate mistakes of yesteryear, that's _your_ issue.  
But dammit, git is designed to do the right thing, and I will fight tooth
and nail against anybody who thinks individual files matter.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 10:05                           ` David Woodhouse
@ 2005-04-15 14:53                             ` Ingo Molnar
  2005-04-15 15:09                               ` David Woodhouse
  0 siblings, 1 reply; 81+ messages in thread
From: Ingo Molnar @ 2005-04-15 14:53 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git


* David Woodhouse <dwmw2@infradead.org> wrote:

> On Fri, 2005-04-15 at 11:36 +0200, Ingo Molnar wrote:
> > do such cases occur frequently? In the kernel at least it's not too 
> > typical. 
> 
> Isn't it? I thought it was a fairly accurate representation of the 
> process "I make a whole bunch of changes to files I maintain, pulling 
> from Linus while occasionally asking him to pull from my tree. 
> Sometimes my files are changed by someone else in Linus' tree, and 
> sometimes I change files that I don't actually own.".

but the specific scenario you described would require _Linus'_ tree to 
be in limbo for a long time, and have uncommitted half-done edits. I.e.:

   (A1B2)--(A2B2)--(A2'B3)
    /  \   /            \
   /    \ /              \
 (A1B1)  X               (...)
   \    / \              /
    \  /   \            /
   (A2B1)--(A2B2)--(A3B2')

in the above scenario Linus' tree needs to 'cross' with a maintainer's 
tree.  (maintainer's tree wont cross with another maintainer's tree, as 
maintainer-to-maintainer merges rare.)

but for the scenario to occur, i think there needs to be a prolongued 
"limbo" period in Linus' tree for a 'crossing' to happen. But Linus' 
merges are typically almost atomic: they are done then they are pushed 
out. It's definitely not in the 'days, sometimes weeks' timescale as 
maintainer trees are.

so for the scenario to occur, a maintainer, from whom Linus has just 
pulled an update and Linus is merging the tree manually without 
comitting, has to pull a file from the earlier Linus tree, and then 
Linus has to modify that same file again. This does not seem to be a 
common scenario.

so i think to avoid the scenario, maintainers should not pull from each 
other - they should only pull/push to/from Linus' tree. Maybe this is an 
unacceptable limitation?

	Ingo

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 14:53                             ` Ingo Molnar
@ 2005-04-15 15:09                               ` David Woodhouse
  0 siblings, 0 replies; 81+ messages in thread
From: David Woodhouse @ 2005-04-15 15:09 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git

On Fri, 2005-04-15 at 16:53 +0200, Ingo Molnar wrote:
> but the specific scenario you described would require _Linus'_ tree to
> be in limbo for a long time, and have uncommitted half-done edits.
> I.e.:
> 
>    (A1B2)--(A2B2)--(A2'B3)
>     /  \   /            \
>    /    \ /              \
>  (A1B1)  X               (...)
>    \    / \              /
>     \  /   \            /
>    (A2B1)--(A2B2)--(A3B2')
> 
> in the above scenario Linus' tree needs to 'cross' with a maintainer's
> tree.  (maintainer's tree wont cross with another maintainer's tree,
> as maintainer-to-maintainer merges rare.)

Is that true? Consider (A2B1) to be a bugfixes-only tree which I make
available for Linus to pull from. I keep doing more experimental stuff
in my own private copy of the tree along the bottom branch, while Linus
_eventually_ responds to my pull request and moves on, stopping only to
add a 'static' to one of my new functions. I move on too but don't pull
from Linus again for a little while; the final merge happens when I _do_
pull again.

-- 
dwmw2


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 14:53                         ` Linus Torvalds
@ 2005-04-15 15:29                           ` David Woodhouse
  2005-04-15 15:51                             ` Linus Torvalds
  2005-04-15 15:54                           ` Paul Jackson
  1 sibling, 1 reply; 81+ messages in thread
From: David Woodhouse @ 2005-04-15 15:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Petr Baudis, git

On Fri, 2005-04-15 at 07:53 -0700, Linus Torvalds wrote:
> Files DO NOT matter. Never have. It's an implementation limitation to 
> think they do. You'll screw yourself up, and when somebody comes up with a 
> half-way efficient way to generate inter-fiel diffs, your architecture is 
> totally and utterly unable to handle it.
> 
> I don't care what you do at an SCM level, and if the crud you put on top
> of git wants to perpetuate mistakes of yesteryear, that's _your_ issue.  
> But dammit, git is designed to do the right thing, and I will fight tooth
> and nail against anybody who thinks individual files matter.

No, really: individual files _DO_ matter. There's a reason we split
stuff up into separate files, and if you look closely you'll find that
we don't just randomly put different functions into different files with
neither rhyme nor reason -- there's a pattern to it; usually some kind
of functional grouping.

And when I'm looking for the change that broke something, I can almost
always tell which file it's in and go looking in _that_ file. It's a
_whole_ lot easier to use the equivalent of 'bk revtool' than it is to
sift through all the unrelated commits in the whole tree. If that's an
implementation limitation, then it's an implementation limitation in my
_brain_ not just in my tools.

OK, in fact it shouldn't be 'show me the history of this file'; it's
often really 'show me the history of this function' which I want. But
that's fine. All I'm suggesting is that we should include the metadata
which says "content moved from file XXX to file YYY" along with the
commit objects.

I'm certainly not suggesting that we should implement jejb's idea of
explicit 'file revision history' objects -- the tree-based philosophy is
perfectly sane and sufficient. But we do _also_ need a little
information which allows us to track content as it moves around within
the tree, and the SCM has to have a sane way to filter out the noise
when we're looking for what broke. Yes, that's part of the SCM
functionality, and can live in an xattr-type field in the commit object
-- but it does need to be stored, and in practice I suspect it _will_ be
useful for merging too.

It's not about ditching the per-tree tracking and doing per-file
tracking instead. I agree that would be wrong. It's about storing enough
information to track what happened to given content as it moved around
within the tree.

-- 
dwmw2


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 10:02                           ` David Woodhouse
@ 2005-04-15 15:32                             ` Linus Torvalds
  2005-04-15 16:01                               ` David Woodhouse
                                                 ` (3 more replies)
  0 siblings, 4 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-15 15:32 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Junio C Hamano, Petr Baudis, git



On Fri, 15 Apr 2005, David Woodhouse wrote:
> 
> And you're right; it shouldn't have to be for renames only. There's no
> need for us to limit it to one "source" and one "destination"; the SCM
> can use it to track content as it sees fit.

Listen to yourself, and think about the problem for a second.

First off, let's just posit that "files" do not matter. The only thing
that matters is how "content" moved in the tree. Ok? If I copy a function
from one fiel to another, the perfect SCM will notice that, and show it as
a diff that removes it from one file and adds it to another, and is
_still_ able to track authorship past the move. Agreed?

Now, you basically propose to put that information in the "commit" log, 
and that's certainly valid. You can have the commit log say "lines 50-89 
in file kernel/sched.c moved to lines 100-139 in kernel/timer.c", and then 
renames fall out of that as one very small special case.

You can even say "lines 50-89 in file kernel/sched.c copied to.." and 
allow data to be tracked past not just movement, but also duplication.

Do you agree that this is kind of what you'd want to aim for? That's a 
winning SCM concept.

How do you think the SCM _gets_ at this information? In particular, how 
are you proposing that we determine this, especially since 90% of all 
stuff comes in as patches etc? 

You propose that we spend time when generating the tree on doing so. I'm 
telling you that that is wrong, for several reasons:

 - you're ignoring different paths for the same data. For example, you 
   will make it impossible to merge two trees that have done exactly the 
   same thing, except one did it as a patch (create/delete) and one did it 
   using some other heuristic.

 - you're doing the work at the wrong point. Doing it _well_ is quite 
   expensive. So if you do it at commit time, you cannot _afford_ to do it 
   well, and you'll always fall back to doing an ass-backwards job that 
   doesn't really get you to the good state, and only gets you to a 
   not-very-interesting easy 1% of the solution (ie full file renames).

 - you're doing the work at the wrong point for _another_ reason. You're 
   freezing your (crappy) algorithm at tree creation time, and basically 
   making it pointless to ever create something better later, because even 
   if hardware and software improves, you've codified that "we have to
   have crappy information".

Now, look at my proposal: 

 - the actual information tracking tracks _nothing_ but information. You 
   have an SCM that tracks what changed at the only level that really 
   matters, namely the whole project. None of the information actually 
   makes any sense at all at a smaller granularity, since by definition, a
   "project" depends on the other files, or it wouldn't be a project, it
   would be _two_ projects or more.

 - When you're interested in the history of the information, you actually 
   track it, and you try to be _intelligent_ about it. You can actually do 
   a HELL of a lot better than whet you propose if you go the extra mile. 
   For example, let's say that you have a visualization tool that you can 
   use for finding out where a line of code came from. You start out at 
   some arbitrary point in the tree, and you drill down. That's how it 
   works, right?

   So how do you drill down? You simply go backwards in history for that 
   project, tracking when that file+line changed (a "file+line" thing is 
   actually a "sensible" tracking unit at this point, because it makes
   sense within the query you're doing - it's _not_ a sensible thing to
   track at "commit" time, but when you ask yourself "where did this line
   come from", that _question_ makes it sensible. Also note that "where 
   did this _file_ come from is not a sensible question, since the file 
   may have been the combination (or split) of several files, so there is
   no _answer_ to that question"

   So the question then becomes: "how can you reasonably _efficiently_
   find the history of one particular line", and in fact it turns out that 
   by asking the question that way, it's pretty obvious: now that you
   don't have to track the whole repository, you can always try to 
   minimize the thing you're looking for.

   So what you do is walk back the history, and look at the tree objects 
   (both sides when you hit a merge), eand see if that file ever changes. 
   That's actually a very efficient operation in GIT - it matches
   _exactly_ how git tracks things anyway. So it's not expensive at all.

   When that file changes, you need to look if that _line_ changed (and 
   here is where it comes down to usability: from a practical standpoint
   you probably don't care about a single line, you really _probably_ want
   to see changes around it too). So you diff the old state and the new 
   state, and you see if you can still find where you were. If you still 
   can, and the line (and a few lines around it) is still the same, you 
   just continue to drill down. So that's not the interesting case.

   So what happens when you found "ok, that area changed"? Your 
   visualization tool now shows it to the user, AND BECAUSE IT SEES THE 
   WHOLE TREE DIFF, it also shows where it probably came from. At _that_ 
   point, it is actually very trivial to use a modest amount of CPU time, 
   and look for probable sources within that diff. You can do it on modern 
   hardware in basically no time, so your visualization tool can actually 
   notice that
 
	"oops, that line didn't even exist in the previous version, BUT I
	 FOUND FIVE PLACES that matched almost perfectly in the same diff,
	 and here they are"

   and voila, your tool now very efficiently showed the programmer that
   the source of the line in question was actually that we had merged 5 
   copies of the same code in different archtiectures into one common
   helper function.

   And if you didn't find some source that matched, or if the old file was
   actually very similar around that line, and that line hadn't been
   "totally new"? That's the easy case again - you show the programmer the
   diff at that point in time, and you let him decide whether that diff 
   was what he was looking for, or whether he wants to continue to "zoom
   down" into the history.

The above tool is (a) fairly easy to write for git (if you can do 
visualization tools and (b) _exactly_ what I think most programmers 
actually want. Tell me I'm wrong. Honestly..

And notice? My clearly _superior_ algorithm never needed any rename
information at all. It would have been a total waste of time. It would
also have hidden the _real_ pattern, which was that a piece of code was
merged from several other matching pieces of code into one new helper
function. But if it _had_ been a pure rename, my superior tool would have
trivially found that _too_. So rename infomation really really doesn't
matter.

So I'm claiming that any SCM that tries to track renames is fundamentally
broken unless it does so for internal reasons (ie to allow efficient
deltas), exactly because renames do not matter. They don't help you, and 
they aren't what you were interested in _anyway_.

What matters is finding "where did this come from", and the git
architecture does that very well indeed - much better than anything else
out there. I outlined a simple algorithm that can be fairly trivially
coded up by somebody who really cares. Sure, pattern matching isn't
trivial, but you start out with just saying "let's find that exact line,
and two lines on each side", and then you start improving on that.

And that "where did this come from" decision should be done at _search_ 
time, not commit time. Because at that time it's not only trivial to do, 
but at that time you can _dynamically_ change your search criteria. For 
example, you can make the "match" algorithm be dependent on what you are 
looking at.

If it's C source code, it might want to ignore vairable names when it
searches for matching code. And if it's a OpenOffice document, you might
have some open-office-specific tools to do so. See? Also, the person doing 
the searches can say whether he is interested in that particular line (or 
even that particial _identifier_ on a line), or whether he wants to see 
the changes "around" that line.

All of which are very valid things to do, and all of which my world-view
supports very well indeed. And all of which your pitiful "files matter" 
world-view totally doesn't get at all.

In other words, I'm right. I'm always right, but sometimes I'm more right 
than other times. And dammit, when I say "files don't matter", I'm really 
really Right(tm).

Please stop this "track files" crap. Git tracks _exactly_ what matters, 
namely "collections of files". Nothing else is relevant, and even 
_thinking_ that it is relevant only limits your world-view. Notice how the 
notion of CVS "annotate" always inevitably ends up limiting how people use 
it. I think it's a totally useless piece of crap, and I've described 
something that I think is a million times more useful, and it all fell out 
_exactly_ because I'm not limiting my thinking to the wrong model of the 
world.

			Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 15:29                           ` David Woodhouse
@ 2005-04-15 15:51                             ` Linus Torvalds
  0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-15 15:51 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Junio C Hamano, Petr Baudis, git



On Fri, 15 Apr 2005, David Woodhouse wrote:
> 
> And when I'm looking for the change that broke something, I can almost
> always tell which file it's in and go looking in _that_ file.

Read my email about finding "what changed" that I sent out a minute ago.

I claim that my algorithm for finding "what changed" handles your "single
file" case as a very small (and usually quite uninteresting) special case.

I claim (and if you just look at my proposal I think you'll agree) that I
can track single functions, and do it efficiently. WITHOUT adding any
meta-data at all.

The thing is, if the question is "I have this piece of code, and I want to 
see what changed", you fundamentally _can_ do that efficiently. That's 
really what git was designed for. It's the whole _point_ of having history 
in the first place. If git didn't care, it wouldn't have a back-pointer to 
the tree it came from, and we'd all be just merging pure trees.

But you mix that question up with "how do I save that information in the 
commit", which is a totally unnecessary mix-up, and which makes things 
MUCH more complicated, for absolutely zero gain.

In fact, because you mixed up those two issues, the problem now became so
complicated that you can no longer solve it, so you start doing hacks like
"the user has to tell us what he did" (aka "bk mv" or "svn rename"), and 
you start mentally to limit yourself to files, because you realize that 
you _have_ to limit your intractable problem to make it at all solvable.

And I'm telling you that your problem is STUPID. You made it stupid by 
thinking that every question about the source tree should be answered at 
commit time. Which just clearly isn't true!

If you just drop the tying-together, and accept that "what changed" is a
valid question _regardless_ of trying to track it at commit time, now your
whole world opens up. Birds sing, the sun is shining on you, and beautiful
scantily clad women (or men) dance around you. The world is suddenly a 
good place, just _filled_ with possibilities.

Suddenly you realize that if the question is just "what changed in this
piece of code" (and let's face it, that _is_ the question), you can track 
it afterwards. Trying to tie in "commit time" into the question was what 
made it hard. If you do _not_ due that (totally unnecessary) tie-in, the 
question suddenly becomes easy to answer, and several obvious and simple 
answers spring to mind pretty immediately.

> It's not about ditching the per-tree tracking and doing per-file
> tracking instead. I agree that would be wrong. It's about storing enough
> information to track what happened to given content as it moved around
> within the tree.

No. Git absolutely does have everything you need already. You just aren't 
realizing that it's already there - in the data - and that you can do much 
more intelligent searches for changes if you accept that undeniable fact.

The fact that you can NOT do those searches at commit-time (which is a 
global op), and can only do them if you have a specific question in mind 
("what changed _here_"), is the big issue.

The thing is, at commit-time you'd need to answer every possible question
("what changed here, and here, and here, and in this function, and in this
file, and in this directory and why did this identifier get renamed and 
why is the sky blue"). AND YOU FUNDAMENTALLY CANNOT DO THAT. It's 
impossible.

But once you _know_ the question (which is the only time when the answer
is actually relevant, so why care about if before that time?), you can
find out the answer by just automating the job of looking at the _data_. 
It's easy. The question makes it obvious by its nature. The question is 
the thing that gives you the specifics that makes the search possible in 
the first place.

And _this_ is why the data matters. Renames and file boundaries do not.
And until you accept that, you just limit yourself.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 14:53                         ` Linus Torvalds
  2005-04-15 15:29                           ` David Woodhouse
@ 2005-04-15 15:54                           ` Paul Jackson
  2005-04-15 16:30                             ` C. Scott Ananian
  1 sibling, 1 reply; 81+ messages in thread
From: Paul Jackson @ 2005-04-15 15:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dwmw2, junkio, pasky, git

Linus wrote:
> For example, just doing intra-file diffs is a lot _easier_ and less 
> time-consuming than doing inter-file diffs. 

Um ah ... could you explain what you mean by inter and intra file diffs?

Google found a three year old message by Andrew Morton, discussing
inter and intra file fragmentation on ext2/ext3 file systems and the
find_group_dir() routine.  I don't think that's what you had in mind ;).

When I run the 'diff' command, it usually between two files, not between
two parts of a file.  So I'd have thought inter file diffs were easier.

Clearly, I don't git it.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 15:32                             ` Linus Torvalds
@ 2005-04-15 16:01                               ` David Woodhouse
  2005-04-15 16:31                                 ` C. Scott Ananian
  2005-04-16 15:33                                 ` Johannes Schindelin
  2005-04-15 19:20                               ` Paul Jackson
                                                 ` (2 subsequent siblings)
  3 siblings, 2 replies; 81+ messages in thread
From: David Woodhouse @ 2005-04-15 16:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Petr Baudis, git

On Fri, 2005-04-15 at 08:32 -0700, Linus Torvalds wrote:
>  - you're doing the work at the wrong point. Doing it _well_ is quite 
>    expensive. So if you do it at commit time, you cannot _afford_ to do it 
>    well, and you'll always fall back to doing an ass-backwards job that 
>    doesn't really get you to the good state, and only gets you to a 
>    not-very-interesting easy 1% of the solution (ie full file renames).
> 
>  - you're doing the work at the wrong point for _another_ reason. You're 
>    freezing your (crappy) algorithm at tree creation time, and basically 
>    making it pointless to ever create something better later, because even 
>    if hardware and software improves, you've codified that "we have to
>    have crappy information".

OK, I'm inclined to agree. The only thing that prevents me from
capitulating entirely and resubscribing to the "Torvalds is always
right" school is the concern that it _is_ expensive, and that's why I
originally wanted to do it at commit time because then it's a one-off
cost rather than recurring every time we want to track the history of a
given piece of content. Also because we actually have the developer's
attention at commit time, and we can get _real_ answers from the user
about what she was doing, instead of having to guess.

But if it can be done cheaply enough at a later date even though we end
up repeating ourselves, and if it can be done _well_ enough that we
shouldn't have just asked the user in the first place, then yes, OK I
agree.

-- 
dwmw2


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 15:54                           ` Paul Jackson
@ 2005-04-15 16:30                             ` C. Scott Ananian
  2005-04-15 18:29                               ` Paul Jackson
  0 siblings, 1 reply; 81+ messages in thread
From: C. Scott Ananian @ 2005-04-15 16:30 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Linus Torvalds, dwmw2, junkio, pasky, git

On Fri, 15 Apr 2005, Paul Jackson wrote:

> Um ah ... could you explain what you mean by inter and intra file diffs?

intra file diffs: here are two versions of the same file.  what changed? 
inter file diffs: here is a new file, and here are *all the files in the 
current committed version*.  Where did the contents of this new file come 
from?  (Note that the new file is often a slightly changed version of an 
existing file in the current committed version.  But we don't assume that 
must be true.)
  --scott

supercomputer Pakistan WSHOOFS SECANT LCPANGS SDI assassination ZPSECANT 
SEQUIN AEBARMAN ESCOBILLA bomb mustard STANDEL ESGAIN Nazi FJDEFLECT
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 16:01                               ` David Woodhouse
@ 2005-04-15 16:31                                 ` C. Scott Ananian
  2005-04-15 17:11                                   ` Linus Torvalds
  2005-04-16 15:33                                 ` Johannes Schindelin
  1 sibling, 1 reply; 81+ messages in thread
From: C. Scott Ananian @ 2005-04-15 16:31 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git

On Fri, 15 Apr 2005, David Woodhouse wrote:

> given piece of content. Also because we actually have the developer's
> attention at commit time, and we can get _real_ answers from the user
> about what she was doing, instead of having to guess.

Yes, but it's still hard to get *accurate* information.  And developers 
tend to use very short commit messages already...

> But if it can be done cheaply enough at a later date even though we end
> up repeating ourselves, and if it can be done _well_ enough that we
> shouldn't have just asked the user in the first place, then yes, OK I
> agree.

I think examining the rsync algorithms should convince you that finding 
common chunks can be fairly efficient.  (See my next message for a more 
concrete proposal.)
  --scott

Rijndael AMLASH Moscow Ft. Bragg shotgun HTKEEPER SHERWOOD overthrow 
Uzi anthrax Yeltsin Indonesia Suharto LITEMPO Dictionary Yakima KUBARK
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 16:31                                 ` C. Scott Ananian
@ 2005-04-15 17:11                                   ` Linus Torvalds
  0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-15 17:11 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: David Woodhouse, Junio C Hamano, Petr Baudis, git



On Fri, 15 Apr 2005, C. Scott Ananian wrote:
> 
> I think examining the rsync algorithms should convince you that finding 
> common chunks can be fairly efficient.

Note that "efficient" really depends on how good a job you want to do, so 
you can tune it to how much CPU you can afford to waste on the problem.

For example, my example had this thing where we merged five different
functions into one function, and it is truly pretty efficient to find
things like that _IF_ we only look at the files that changed (since the
set of files that change in any one particular commit tends to be small,
relative to the whole repository). There are many good algorithms for 
finding "common code", and with modern hardware that is basically 
instantaneous if you look at a few tens of files.

For example, people wrote efficient things to compare _millions_ of lines 
of code for the whole SCO saga - you can do quite well. Some googling 
comes up with for example

	http://minnie.tuhs.org/Programs

and applying those to a smallish set of files is quite efficient.

What is _not_ necessarily as easy is the situation where you notice that a 
new set of lines appeared, but you don't see any place that matches that 
set of lines in the set of CHANGED files. That's actually quite common, ie 
let's say that you have a new filesystem or a new driver, and almost 
always it's based on a template or something, and you _would_ be able to 
see where that template came from, except it's not in that _changed_ set.

And that is still doable, but now you really have to compare against the
whole tree if you want to do it. Even _that_ is actually efficient if you
cache the hashes - that's how the comparison tools compare two totally
independent trees against each other, and it makes it practically possible
to do even that expensive O(n**2) operation in reasonable time. It's
certainly possible to do exactly the same thing for the "new code got
added, does it bear any similarity to old code" case.

Note! This is a question that is relevant and actually is in the realm of
the "possible to find the answer interactively".  It may fairly expensive, 
but the point is that this is the kind of relevant question that really 
does depend on the fundamental notion that "data matters more than any 
local changes". And when you think about the problem in that form, you 
find these kinds of interesting questions that you _can_ answer.

Because the way git identifies data, the example "is there any other
relevant code that may actually be similar to the newly added code" is
actually not that hard to do in git. Remember: the way to answer that
question is to have a cache of hashes of the contents. Guess what git
_is_? You can now index your line-based hashes of contents against the
_object_ hashes that git keeps track of, and you suddenly have an
efficient way to actually look up those hashes.

NOTE! All of this is outside the scope of git itself. This is all
"visualization and comparison tools" built up on top of git. And I'm not
at all interested in writing those tools myself, and I'm absolutely not
signing up for that part. All I'm arguing for is that the git architecture
is actually a very good architecture for doing these kinds of very very
cool tools.

			Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 16:30                             ` C. Scott Ananian
@ 2005-04-15 18:29                               ` Paul Jackson
  0 siblings, 0 replies; 81+ messages in thread
From: Paul Jackson @ 2005-04-15 18:29 UTC (permalink / raw)
  To: C. Scott Ananian; +Cc: torvalds, dwmw2, junkio, pasky, git

> intra file diffs: here are two versions of the same file. 

Ah so.  Linus faked me out.

I was _sure_ that by "file" he meant "file" -- as in a bucket of bits
with a unique identifying <sha1>.

In that message, I guess by "file" he meant "a version controlled
file, consisting of a series of content versions and meta-data"

That's what I get for trusting Linus to always speak as a kernel
hacker, not an SCM hacker.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 15:32                             ` Linus Torvalds
  2005-04-15 16:01                               ` David Woodhouse
@ 2005-04-15 19:20                               ` Paul Jackson
  2005-04-16  1:44                               ` Simon Fowler
  2005-04-16 20:29                               ` Sanjoy Mahajan
  3 siblings, 0 replies; 81+ messages in thread
From: Paul Jackson @ 2005-04-15 19:20 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dwmw2, junkio, pasky, git

These notions that one can always best answer questions by looking at
the content, and that "Individual files DO NOT EXIST" seem over stated,
to me.

Granted, overstated for a good reason.  A couple sticks of dynamite are
needed to shake loose some old SCM thinking habits.

===

Ingo has a point when he states:

> i believe the fundamental thing to think about is not file or line or 
> namespace, but 'tracking developer intent'.

He too overstates - it's not _the_ (as in one and only) thing.
But it's useful.  Given the traditional terseness of many engineers,
it's certainly not the _only_ thing.  The code speaks too.

===

The above two are related in this way.  Traditional SCM uses per
file (versioned controlled file, as in s.* or *,v files) metadata
to track 'developer intent'.

I'm afraid we are at risk for confusing baby (developer intent)
and bathwater (version controlled file structure of classic SCM's).

===

But we already have a pretty damn good way of tracking developer
intent that needs to fit naturally with whatever we build on top
of git.

     Mr. McGuire: I just want to say one word to you - just one word.
     Ben: Yes sir.
     Mr. McGuire: Are you listening?
     Ben: Yes I am.
     Mr. McGuire: 'Patches.'
     # the original word was 'Plastics' - The Graduate (1967)

Andrew and the other maintainers do a pretty good job of 'encouraging'
developers to provide useful statements of 'intent' in their patch
headers.

The patch series in something like *-mm, including per-patch
commentary, are a valuable part of this project.

===

I have not looked closely at what is being done here, on top of
git, for SCM like capabilities.  Hopefully the next two questions
are not too stupid:

 1) How do we track the patch header commentary?

 2) Why can't we have a quilt like SCM, not bk/rcs/cvs/sccs/... like?

For (2), anyone publishing a Linux source would periodically announce an
<sha1> value, attached to some name suitable for public consumption.

For example, sometime in the next month or so, Linus would announce
that the <sha1> of 2.6.12 is so-and-so.  That would identify the
result of applying a specific set of patches, resulting in a specific
source tree contents.  He would announce a few 2.6.12-rc* <sha1>'s
between now and then.

Between now and then, Andrew would (if using these tools) have published
several <sha1> values, one each for various 2.6.12-rc*-mm* versions.

If you explode such a <sha1> all out into a working directory, you get
both the source contents in the appropriately named files, and the
quilt-style patches subdirectory, of the patch series that gets you
here, starting from some Time Zero for that series of published kernel
versions.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@engr.sgi.com> 1.650.933.1373, 1.925.600.0401

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-14  8:06         ` Linus Torvalds
  2005-04-14  8:39           ` Junio C Hamano
@ 2005-04-15 19:57           ` Junio C Hamano
  2005-04-15 20:45             ` Linus Torvalds
  1 sibling, 1 reply; 81+ messages in thread
From: Junio C Hamano @ 2005-04-15 19:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Petr Baudis, Christopher Li, git

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> In the meantime I wrote a very stupid "merge-tree" which
LT> does things slightly differently, but I really think your
LT> approach (aka my original approach) is actually a lot
LT> faster. I was just starting to worry that the ball didn't
LT> start, so I wrote an even hackier one.

LT> ... This "one directory at a time with very explicit output"
LT> thing is much more down-to-earth, but it's also likely
LT> slower because it will need script help more often.

I was looking at merge-tree.c last night to add recursive
behaviour (my favorite these days ;-) to it [*1*].

But then I started thinking.

LT> ... For each entry in the directory it says either
LT> 	select <mode> <sha1> path
LT> or
LT> 	merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path
LT> depending on whether it could directly select the right object or not.

Given that the case you are primarily interested in is the one
that affects only small parts of a huge tree (i.e. common kernel
merge pattern I understand from your previous messages), your
"hacky version" [*2*], extended for recursive operation, would
spit out 98% select and 2% merge, and probably the origin of
these selects are distributed across ancestor=90%, his=4%,
my=4%, or something similar.  Am I misestimating grossly?

Assuming I am correct in the above, this would not scale for a
huge project.  We need to cut down the number of "90% select"
part of the output to make it manageable.

I am thinking about:

 - adding recursive behaviour (I am almost done with this);

 - adding another command line argument to merge-tree.c, to 
   tell "do not output anything for the path if the resulting
   merge is the same as what is in this tree";

 - adding another output type, "delete" to make the output type
   repertoire these three:

    delete path
    select <mode> <sha1> path
    merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path

When the user of the output of 

  $ merge-tree <ancestor-sha1> <my-sha1> <his-sha1> <result-base-sha1>

want to get a dircache populated with the merged result, he can:

  1. read-tree <result-base-sha1>
  2. for each output:
     a) "delete" -- delete path from dircache
     b) "select" -- register mode-sha1 at path
     c) "merge"  -- do the 3-way merge and register result at path

Do you think this is sensible?

The reason I have the separate <result-base-sha1> instead of
always using <ancestor-sha1> is because the user may be thinking
of patching an existing base which is different from "my" or
"his" or "ancestor" and doing it in place.  That way, probably
Pasky's SCM can use it to patch the dircache it creates in its
own ,,merge/ directory, which would most likely be initially
populated from the dircache in the user's working directory---
which may or may not match "my-sha1" if the user has uncommitted
update-cache there.

Pasky, do you think this is workable?  If so do you think this
would make your life easier?


[Footnotes]

*1* That's how I found the S_IFDIR problem (not in your tree but
in the copy I had).

*2* I did not find it quite "hacky".  It was a pleasant read.
Especially I liked "smaller()" part.



^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: Merge with git-pasky II.
       [not found] <000d01c541ed$32241fd0$6400a8c0@gandalf>
@ 2005-04-15 20:31 ` Linus Torvalds
  2005-04-15 23:00   ` Barry Silverman
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2005-04-15 20:31 UTC (permalink / raw)
  To: Barry Silverman; +Cc: git


[ I'm cc'ing the git list even though Barry's question wasn't cc'd. 
  Because I think his question is interesting and astute per se, even
  if I disagree with the proposal ]

On Fri, 15 Apr 2005, Barry Silverman wrote:
>
> If git is totally project based, and each commit represents total state
> of the project, then how important is the intermediate commit
> information between two states. 

You need it in order to do further merges.

> IE, Area maintainer has A1->A2->A3->A4->A5 in a repository with 5
> commits, and 5 comments. And I last synced with A1.
> 
> A few days later I sync again. Couldn't I just pull the "diff-tree A5
> A1" and then commit to my tree just the record A1->A5. Why does MY
> repository need trees A2,A3,A4?

Because that second merge needs the first merge to work well. The first 
merge might have had some small differences that ended up auto-merging (or 
even needing some manual help from you). The second time you sync, there 
migth be some more automatic merging. And so on.

If you don't keep track of the incremental merges, you end up with one 
really _difficult_ merge that may not be mergable at all. Not 
automatically, and perhaps not even with help.

So in order to keep things mergable, you need to not diverge. And the 
intermediate merges are the "anchor-points" for the next merge, keeping 
the divergences minimal. 

I'm personally convinced that one of the reasons CVS is a pain to merge is 
obviously that it doesn't do a good job of finding parents, but also 
exactly _because_ it makes merges so painful that people wait longer to do 
them, so you never end up fixing the simple stuff. In contrast, if you
have all these small merges going on all the time, the hope is that 
there's never any really painful nasty "final merge".

So you're right - the small merges do end up cluttering up the revision 
history. But it's a small price to pay if it means that you avoid having 
the painful ones.

> Isn't preserving the A1,A2,A3,A4,A5 a legacy of BK, which required all
> the changesets to be loaded in order, and so is a completely "file"
> driven concept? 

Nope. In fact, to some degree git will need this even _more_, since the
git merger is likely to be _weaker_ than BK, and thus more easily
confused.

I do believe that BK has these things for the same reason.

			Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 19:57           ` Junio C Hamano
@ 2005-04-15 20:45             ` Linus Torvalds
  0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-15 20:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Petr Baudis, Christopher Li, git



On Fri, 15 Apr 2005, Junio C Hamano wrote:
> 
> I was looking at merge-tree.c last night to add recursive
> behaviour (my favorite these days ;-) to it [*1*].
> 
> But then I started thinking.

Always good.

> LT> ... For each entry in the directory it says either
> LT> 	select <mode> <sha1> path
> LT> or
> LT> 	merge <mode>-><mode>,<mode> <sha1>-><sha1>,<sha1> path
> LT> depending on whether it could directly select the right object or not.
> 
> Given that the case you are primarily interested in is the one
> that affects only small parts of a huge tree (i.e. common kernel
> merge pattern I understand from your previous messages), your
> "hacky version" [*2*], extended for recursive operation, would
> spit out 98% select and 2% merge, and probably the origin of
> these selects are distributed across ancestor=90%, his=4%,
> my=4%, or something similar.  Am I misestimating grossly?

No. That's _exactly_ right. You do not want a recursive merge-tree. 

The "diff-tree" thing is different, exactly because it prunes out all the 
differences early on.

> I am thinking about:
> 
>  - adding recursive behaviour (I am almost done with this);

I think your suggestion sounds perfectly reasonable.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 20:40                             ` Petr Baudis
@ 2005-04-15 22:41                               ` Junio C Hamano
  0 siblings, 0 replies; 81+ messages in thread
From: Junio C Hamano @ 2005-04-15 22:41 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Linus Torvalds, git

>>>>> "PB" == Petr Baudis <pasky@ucw.cz> writes:

PB> I can't see the conflicts between what I want and what Linus wants.
PB> After all, Linus says that I can use the directory cache in any way I
PB> please (well, the user can, but I'm speaking for him ;-). So I'm doing
PB> so, and with your tool I would get into problems, since it is suddenly
PB> imposing a policy on what should be in the index.

I think our misunderstanding is coming from the use of the word
"merge tree".  I think you have been assuming that I wanted you
to run "merge-trees -o ,,merge" --- which would certainly cause
me to muck with your dircache there.  I totally agree with you
that that is a *BAD* *THING*.  No question there.

However, my assumption has been different.  I was assuming that
you would run "merge-trees -o merge~tree" (i.e. different from
your "merge tree"), so that you can get the merge results in a
form parsable by you.  And then, using that information, you can
make your changes in ,,merge.  After you are done with that
information, you can remove "merge~trees", of course.

The format I chose for the "merge result in a form parsable by
you" happens to be a dircache in "merge~tree", with minimum
number of files checked out when merge cannot be automatically
done safely.  In the simplest case of not having any conflicting
merge between $C and $merged, Cogito can immediately run
write-tree in "merge~tree" (not ,,merge) to obtain its tree-ID
$T, so that it can feed it to diff-tree to compare it with
whatever tree state Cogito wants to apply the merges between $C
and $merged to.

I still do not understand what you do in ,,merge directory, but
here is one way you can update the user working directory
in-place without having a ,,merge directory [*2*].  You can run
your "git diff" between $C and $T [*1*].  The result is the diff
you need to apply on top of your user's working files.  If the
user does not like the result of running that diff, it can
easily be reversed.

If a manual merge were needed between $C and $merged, Cogito
could guide the user through that manual edit in "merge~tree",
and run update-cache on those hand merged files in "merge~tree",
before running write-tree in "merge~tree" to obtain $T; after
that, everything else is the same.

You make interesting points in other parts of your message I
need to regurgitate for a while, so I would not comment on them
in this message.

[Footnote]

*1* I really like the convenience of being able to use tree-ID
and commit-ID interchangeably there.  Thanks.

*2* I understand that this would change the user's "git-tools"
experience a bit.  The user will not be told to "go to ,,merge
and commit there which will reflected back to your working tree"
anymore.  Instead the merge happens in-place.  Committing, not
committing, or further hand-fixing the merge is up to the user.
I suspect this change might even be for the better.


^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: Merge with git-pasky II.
  2005-04-15 20:31 ` Linus Torvalds
@ 2005-04-15 23:00   ` Barry Silverman
  2005-04-16  0:32     ` Linus Torvalds
  0 siblings, 1 reply; 81+ messages in thread
From: Barry Silverman @ 2005-04-15 23:00 UTC (permalink / raw)
  To: 'Linus Torvalds'; +Cc: git

The issue I am trying to come to grips with in the current design, is
that the git repository of a number of interrelated projects will soon
become the logical OR of all blobs, commits, and trees in ALL the
projects. 

This will involve horrendous amounts of replication, as developers end
interchanging objects that originated from third parties who are not
party to the merge (and which happen to be in the repository because of
previous merge activity).

It would be really nice if the repositories for each project stay
distinct (and maybe even living on different servers). Merges should the
one of the few points of contact where the state be exchanged between
the repositories.

>> If you don't keep track of the incremental merges, you end up with
one 
>> really _difficult_ merge that may not be mergable at all. Not 
>> automatically, and perhaps not even with help.

No argument from me.... In my example, I didn't intend A2,A3,A4 to be
considered hash's of points where you did a merge - rather they are
hashes of individual points where you did a commit BETWEEN the points
where you merged. A1 and A5 are the "merge points"! 

It makes perfect sense to define some subset of commits as "merge-like"
commits, and only have those copied over from one repository to the
other. You could also only use merge-points in the common ancestor
calculation, and not worry about intermediate commits. 

Only small changes to the existing logic are necessary to do a merge by
"distributing" out the merge algorithm to each repository. This involves
querying each repository, and communicating the results, followed by
copying over only those blob objects necessary for the merge. 

After the merge, you would create a "merge-point commit" record that has
one of the parents pointing to a hash in the other repository!

But the BIG issue with this scheme, is that you will not be replicating
over any of the intermediate commits, trees, or blobs (not really needed
by the merge), but currently being traversed by various plumbing
components.

Hence my question....

-----Original Message-----
From: Linus Torvalds [mailto:torvalds@osdl.org] 
Sent: Friday, April 15, 2005 4:31 PM
To: Barry Silverman
Cc: git@vger.kernel.org
Subject: RE: Merge with git-pasky II.


[ I'm cc'ing the git list even though Barry's question wasn't cc'd. 
  Because I think his question is interesting and astute per se, even
  if I disagree with the proposal ]

On Fri, 15 Apr 2005, Barry Silverman wrote:
>
> If git is totally project based, and each commit represents total
state
> of the project, then how important is the intermediate commit
> information between two states. 

You need it in order to do further merges.

> IE, Area maintainer has A1->A2->A3->A4->A5 in a repository with 5
> commits, and 5 comments. And I last synced with A1.
> 
> A few days later I sync again. Couldn't I just pull the "diff-tree A5
> A1" and then commit to my tree just the record A1->A5. Why does MY
> repository need trees A2,A3,A4?

Because that second merge needs the first merge to work well. The first 
merge might have had some small differences that ended up auto-merging
(or 
even needing some manual help from you). The second time you sync, there

migth be some more automatic merging. And so on.

If you don't keep track of the incremental merges, you end up with one 
really _difficult_ merge that may not be mergable at all. Not 
automatically, and perhaps not even with help.

So in order to keep things mergable, you need to not diverge. And the 
intermediate merges are the "anchor-points" for the next merge, keeping 
the divergences minimal. 

I'm personally convinced that one of the reasons CVS is a pain to merge
is 
obviously that it doesn't do a good job of finding parents, but also 
exactly _because_ it makes merges so painful that people wait longer to
do 
them, so you never end up fixing the simple stuff. In contrast, if you
have all these small merges going on all the time, the hope is that 
there's never any really painful nasty "final merge".

So you're right - the small merges do end up cluttering up the revision 
history. But it's a small price to pay if it means that you avoid having

the painful ones.

> Isn't preserving the A1,A2,A3,A4,A5 a legacy of BK, which required all
> the changesets to be loaded in order, and so is a completely "file"
> driven concept? 

Nope. In fact, to some degree git will need this even _more_, since the
git merger is likely to be _weaker_ than BK, and thus more easily
confused.

I do believe that BK has these things for the same reason.

			Linus



^ permalink raw reply	[flat|nested] 81+ messages in thread

* RE: Merge with git-pasky II.
  2005-04-15 23:00   ` Barry Silverman
@ 2005-04-16  0:32     ` Linus Torvalds
  0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-16  0:32 UTC (permalink / raw)
  To: Barry Silverman; +Cc: git



On Fri, 15 Apr 2005, Barry Silverman wrote:
>
> The issue I am trying to come to grips with in the current design, is
> that the git repository of a number of interrelated projects will soon
> become the logical OR of all blobs, commits, and trees in ALL the
> projects. 

Nope. I'm actually against the notion of sharing object directories
between projects. The git model _allows_ it, and I think it can be a valid 
approach, but it's absolutely not an approach that I would personally 
suggest be used for the kernel.

In other words, the default git behaviour - and the one I personally am a 
proponent of - is to have one object directory per work tree, and really 
consider all such trees independent. Then you can merge between trees if 
you want to, and bring in objects that way, but normally you would _not_ 
have tons of objects from other trees, and _especially_ not from other 
unrelated projects.

The reason git supports shared object archives is that (a) it falls out 
trivially as part of the design, so not allowing it is silly and (b) it is 
part of a merge, where you _do_ want to get the objects of the trees you 
merge, and in particular you need to generate a seperate tree that has all 
those objects without having to copy them.

(Before you do the merge, you need to bring the new objects into your 
repository of course, but that I consider to be a separate issue, not 
part of the actual technical merge process).

So normally, you'd probably have a totally pruned tree, with only the 
objects you need (and you might even consider the "commit parent links" 
less than necessary, especially if you're just a regular user and not a 
developer who wants to merge).

But the ability to have extra objects is wonderful. It makes going
backwards in time basically free (while the equivalent "bk undo" in the BK
world is a very expensive operation), and it makes it easy to
incrementally keep up-to-date with trees that you know you're _eventually_
going to merge with. But it's not an excuse to put just any random crap in 
that object directory..

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 15:32                             ` Linus Torvalds
  2005-04-15 16:01                               ` David Woodhouse
  2005-04-15 19:20                               ` Paul Jackson
@ 2005-04-16  1:44                               ` Simon Fowler
  2005-04-16 12:19                                 ` David Lang
  2005-04-16 20:29                               ` Sanjoy Mahajan
  3 siblings, 1 reply; 81+ messages in thread
From: Simon Fowler @ 2005-04-16  1:44 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Woodhouse, Junio C Hamano, Petr Baudis, git


[-- Attachment #1.1: Type: text/plain, Size: 1189 bytes --]

On Fri, Apr 15, 2005 at 08:32:46AM -0700, Linus Torvalds wrote:
> In other words, I'm right. I'm always right, but sometimes I'm more right 
> than other times. And dammit, when I say "files don't matter", I'm really 
> really Right(tm).
> 
You're right, of course (All Hail Linus!), if you can make it work
efficiently enough.

Just to put something else on the table, here's how I'd go about
tracking renames and the like, in another world where Linus /does/
make the odd mistake - it's basically a unique id for files in the
repository, added when the file is first recognised and updated when
update-cache adds a new version to the cache. Renames copy the id
across to the new name, and add it into the cache.

This gives you an O(n) way to tell what file was what across
renames, and it might even be useful in Linus' world, or if someone
wanted to build a traditional SCM on top of a git-a-like.

Attached is a patch, and a rename-file.c to use it.

Simon

-- 
PGP public key Id 0x144A991C, or http://himi.org/stuff/himi.asc
(crappy) Homepage: http://himi.org
doe #237 (see http://www.lemuria.org/DeCSS) 
My DeCSS mirror: ftp://himi.org/pub/mirrors/css/ 

[-- Attachment #1.2: guid2.patch --]
[-- Type: text/plain, Size: 19027 bytes --]

COPYING:  fe2a4177a760fd110e78788734f167bd633be8de
Makefile:  ca50293c4f211452d999b81f122e99babb9f2987
--- Makefile
+++ Makefile	2005-04-15 22:17:49.000000000 +1000
@@ -14,7 +14,7 @@
 
 PROG=   update-cache show-diff init-db write-tree read-tree commit-tree \
 	cat-file fsck-cache checkout-cache diff-tree rev-tree show-files \
-	check-files ls-tree
+	check-files ls-tree rename-file
 
 SCRIPT=	parent-id tree-id git gitXnormid.sh gitadd.sh gitaddremote.sh \
 	gitcommit.sh gitdiff-do gitdiff.sh gitlog.sh gitls.sh gitlsobj.sh \
@@ -73,6 +73,9 @@
 ls-tree: ls-tree.o read-cache.o
 	$(CC) $(CFLAGS) -o ls-tree ls-tree.o read-cache.o $(LIBS)
 
+rename-file: rename-file.o read-cache.o
+	$(CC) $(CFLAGS) -o rename-file rename-file.o read-cache.o $(LIBS)
+
 read-cache.o: cache.h
 show-diff.o: cache.h
 
README:  ded1a3b20e9bbe1f40e487ba5f9361719a1b6b85
VERSION:  c27bd67cd632cc15dd520fbfbf807d482efa2dcf
cache.h:  4d382549041d3281f8d44aa2e52f9f8ec47dd420
--- cache.h
+++ cache.h	2005-04-14 22:35:59.000000000 +1000
@@ -55,6 +55,7 @@
 	unsigned int st_gid;
 	unsigned int st_size;
 	unsigned char sha1[20];
+	unsigned char guid[20];
 	unsigned short namelen;
 	char name[0];
 };
cat-file.c:  45be1badaa8517d4e3a69e0bf1cac2e90191e475
check-files.c:  927b0b9aca742183fc8e7ccd73d73d8d5427e98f
checkout-cache.c:  f06871cdbc1b18ea93bdf4e17126aeb4cca1373e
commit-id:  65c81756c8f10d513d073ecbd741a3244663c4c9
commit-tree.c:  12196c79f31d004dff0df1f50dda67d8204f5568
diff-tree.c:  7dcc9eb7782fa176e27f1677b161ce78ac1d2070
--- diff-tree.c
+++ diff-tree.c	2005-04-16 10:46:52.000000000 +1000
@@ -1,33 +1,144 @@
+#include <sys/param.h>
 #include "cache.h"
 
-static int recursive = 0;
+enum diff_type {
+	REMOVE,
+	ADD,
+	RENAME,
+	MODIFY,
+};
+
+struct guid_cache_entry {
+	enum diff_type diff;
+	unsigned char guid[20];
+	unsigned char sha1[20];
+	struct guid_cache_entry *old;
+	unsigned int mode;
+	unsigned int pathlen;
+	unsigned char path[0];
+};	
+
+struct guid_cache {
+	unsigned int nr;
+	unsigned int alloc;
+	struct guid_cache_entry **cache;
+};
 
-static int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const char *base);
+struct guid_cache guid_cache;
+struct guid_cache *cache = &guid_cache;
 
-static void update_tree_entry(void **bufp, unsigned long *sizep)
+int guid_cache_pos(const char *guid)
 {
-	void *buf = *bufp;
-	unsigned long size = *sizep;
-	int len = strlen(buf) + 1 + 20;
+	int first, last;
 
-	if (size < len)
-		die("corrupt tree file");
-	*bufp = buf + len;
-	*sizep = size - len;
+	first = 0;
+	last = cache->nr;
+	while (last > first) {
+		int next = (last + first) >> 1;
+		struct guid_cache_entry *gce = cache->cache[next];
+		int cmp = memcmp(guid, gce->guid, 20);
+		if (!cmp)
+			return next;
+		if (cmp < 0) {
+			last = next;
+			continue;
+		}
+		first = next + 1;
+	}
+	return - first-1;
 }
 
-static const unsigned char *extract(void *tree, unsigned long size, const char **pathp, unsigned int *modep)
+int add_guid_cache_entry(struct guid_cache_entry *gce)
+{
+	int pos;
+	
+	pos = guid_cache_pos(gce->guid);
+	
+	/* if this is a rename or modify, the guid will show up a
+	 * second time */
+	if (pos >= 0) {
+		struct guid_cache_entry *old = cache->cache[pos];
+		int cmp = cache_name_compare(old->path, old->pathlen, gce->path, gce->pathlen);
+
+		if (!cmp) {
+			/* pathname matches, so this must be a
+			 * modify. */
+			gce->old = old;
+			gce->diff = MODIFY;
+			cache->cache[pos] = gce;
+		} else {
+			/* the pathnames are different, so the file
+			 * must have been renamed somewhere along the
+			 * line.  */
+			gce->old = old;
+			gce->diff = RENAME;
+			cache->cache[pos] = gce;
+		}
+		return 0;
+	}
+	pos = -pos-1;
+
+	if (cache->nr == cache->alloc) {
+		cache->alloc = alloc_nr(cache->alloc);
+		cache->cache = realloc(cache->cache, cache->alloc * sizeof(struct guid_cache_entry *));
+	}
+
+	cache->nr++;
+	if (cache->nr > pos)
+		memmove(cache->cache + pos + 1, cache->cache + pos, (cache->nr - pos - 1) * sizeof(struct guid_cache_entry *));
+	cache->cache[pos] = gce;
+	return 0;
+}
+
+static const unsigned char *extract(void *tree, unsigned long size, const char **pathp, unsigned int *modep, const unsigned char **guid)
 {
 	int len = strlen(tree)+1;
 	const unsigned char *sha1 = tree + len;
 	const char *path = strchr(tree, ' ');
 
-	if (!path || size < len + 20 || sscanf(tree, "%o", modep) != 1)
+	if (!path || size < len + 40 || sscanf(tree, "%o", modep) != 1)
 		die("corrupt tree file");
 	*pathp = path+1;
+	*guid = tree + len + 20;
 	return sha1;
 }
 
+static void guid_cache_tree_entry(void *buf, unsigned int len, const char *base, enum diff_type diff)
+{
+	unsigned mode;
+	const char *path;
+	const unsigned char *guid;
+	const unsigned char *sha1 = extract(buf, len, &path, &mode, &guid);
+	struct guid_cache_entry *gce;
+	int baselen = strlen(base);
+	
+	gce = calloc(1, sizeof(struct guid_cache_entry) + baselen + strlen(path) + 1);
+	memcpy(gce->guid, guid, 20);
+	memcpy(gce->sha1, sha1, 20);
+	gce->diff = diff;
+	gce->mode = mode;
+	gce->pathlen = snprintf(gce->path, MAXPATHLEN, "%s%s", base, path);
+	gce->path[gce->pathlen + 1] = '\0';
+
+	add_guid_cache_entry(gce);
+}
+	
+static int recursive = 0;
+
+static int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const char *base);
+
+static void update_tree_entry(void **bufp, unsigned long *sizep)
+{
+	void *buf = *bufp;
+	unsigned long size = *sizep;
+	int len = strlen(buf) + 1 + 40;
+
+	if (size < len)
+		die("corrupt tree file");
+	*bufp = buf + len;
+	*sizep = size - len;
+}
+
 static char *malloc_base(const char *base, const char *path, int pathlen)
 {
 	int baselen = strlen(base);
@@ -38,23 +149,24 @@
 	return newbase;
 }
 
-static void show_file(const char *prefix, void *tree, unsigned long size, const char *base);
+static void changed_file(void *tree, unsigned long size, const char *base, enum diff_type diff);
 
 /* A whole sub-tree went away or appeared */
-static void show_tree(const char *prefix, void *tree, unsigned long size, const char *base)
+static void changed_tree(void *tree, unsigned long size, const char *base, enum diff_type diff)
 {
 	while (size) {
-		show_file(prefix, tree, size, base);
+		changed_file(tree, size, base, diff);
 		update_tree_entry(&tree, &size);
 	}
 }
 
 /* A file entry went away or appeared */
-static void show_file(const char *prefix, void *tree, unsigned long size, const char *base)
+static void changed_file(void *tree, unsigned long size, const char *base, enum diff_type diff)
 {
 	unsigned mode;
 	const char *path;
-	const unsigned char *sha1 = extract(tree, size, &path, &mode);
+	const unsigned char *guid;
+	const unsigned char *sha1 = extract(tree, size, &path, &mode, &guid);
 
 	if (recursive && S_ISDIR(mode)) {
 		char type[20];
@@ -66,38 +178,96 @@
 		if (!tree || strcmp(type, "tree"))
 			die("corrupt tree sha %s", sha1_to_hex(sha1));
 
-		show_tree(prefix, tree, size, newbase);
+		changed_tree(tree, size, newbase, diff);
 		
 		free(tree);
 		free(newbase);
 		return;
 	}
 
-	printf("%s%o\t%s\t%s\t%s%s%c", prefix, mode,
-	       S_ISDIR(mode) ? "tree" : "blob",
-	       sha1_to_hex(sha1), base, path, 0);
+	guid_cache_tree_entry(tree, size, base, diff);
 }
 
+static void show_one_file(struct guid_cache_entry *gce)
+{
+	struct guid_cache_entry *old;
+	char old_sha1[50];
+	char old_sha2[50];
+
+	switch(gce->diff) {
+	case REMOVE:
+		sprintf(old_sha1, "%s", sha1_to_hex(gce->sha1));
+		printf("-%o\t%s\t%s\t%s\t%s%c", gce->mode,
+		       S_ISDIR(gce->mode) ? "tree" : "blob",
+		       old_sha1, sha1_to_hex(gce->guid), gce->path, 0);
+		break;
+	case ADD:
+		sprintf(old_sha1, "%s", sha1_to_hex(gce->sha1));
+		printf("+%o\t%s\t%s\t%s\t%s%c", gce->mode,
+		       S_ISDIR(gce->mode) ? "tree" : "blob",
+		       old_sha1, sha1_to_hex(gce->guid), gce->path, 0);
+		break;
+	case MODIFY:
+		old = gce->old;
+		if (old) {
+			sprintf(old_sha1, "%s", sha1_to_hex(old->sha1));
+			sprintf(old_sha2, "%s", sha1_to_hex(gce->sha1));
+			
+			printf("*%o->%o\t%s\t%s->%s\t%s\t%s%c", old->mode, gce->mode,
+			       S_ISDIR(old->mode) ? "tree" : "blob",
+			       old_sha1, old_sha2, sha1_to_hex(gce->guid), gce->path, 0);
+		} else {
+			die("diff-tree: internal error");
+		}
+		break;
+	case RENAME:
+		old = gce->old;
+		if (old) {
+			sprintf(old_sha1, "%s", sha1_to_hex(gce->sha1));
+			sprintf(old_sha2, "%s", sha1_to_hex(old->sha1));
+			
+			printf("r%o->%o\t%s\t%s->%s\t%s\t%s%c", gce->mode, old->mode,
+			       S_ISDIR(old->mode) ? "tree" : "blob",
+			       old_sha1, old_sha2, sha1_to_hex(old->guid), old->path, 0);
+		} else {
+			die("diff-tree: internal error");
+		}
+		break;
+	default:
+		die("diff-tree: internal error");
+	}
+}
+
+/* simply iterate over both caches looking for matching guids,
+ * showing all files in both caches */
+static void show_cache(void)
+{
+	int i;
+
+	for (i = 0; i < cache->nr; i++)
+		show_one_file(cache->cache[i]);
+}
+	
 static int compare_tree_entry(void *tree1, unsigned long size1, void *tree2, unsigned long size2, const char *base)
 {
 	unsigned mode1, mode2;
 	const char *path1, *path2;
 	const unsigned char *sha1, *sha2;
+	const unsigned char *guid1, *guid2;
 	int cmp, pathlen1, pathlen2;
-	char old_sha1_hex[50];
 
-	sha1 = extract(tree1, size1, &path1, &mode1);
-	sha2 = extract(tree2, size2, &path2, &mode2);
+	sha1 = extract(tree1, size1, &path1, &mode1, &guid1);
+	sha2 = extract(tree2, size2, &path2, &mode2, &guid2);
 
 	pathlen1 = strlen(path1);
 	pathlen2 = strlen(path2);
 	cmp = cache_name_compare(path1, pathlen1, path2, pathlen2);
 	if (cmp < 0) {
-		show_file("-", tree1, size1, base);
+		changed_file(tree1, size1, base, REMOVE);
 		return -1;
 	}
 	if (cmp > 0) {
-		show_file("+", tree2, size2, base);
+		changed_file(tree2, size2, base, ADD);
 		return 1;
 	}
 	if (!memcmp(sha1, sha2, 20) && mode1 == mode2)
@@ -108,8 +278,8 @@
 	 * file, we need to consider it a remove and an add.
 	 */
 	if (S_ISDIR(mode1) != S_ISDIR(mode2)) {
-		show_file("-", tree1, size1, base);
-		show_file("+", tree2, size2, base);
+		changed_file(tree1, size1, base, REMOVE);
+		changed_file(tree2, size2, base, ADD);
 		return 0;
 	}
 
@@ -121,10 +291,14 @@
 		return retval;
 	}
 
-	strcpy(old_sha1_hex, sha1_to_hex(sha1));
-	printf("*%o->%o\t%s\t%s->%s\t%s%s%c", mode1, mode2,
-	       S_ISDIR(mode1) ? "tree" : "blob",
-	       old_sha1_hex, sha1_to_hex(sha2), base, path1, 0);
+	if (!memcmp(guid1, guid2, 20)) {
+		changed_file(tree1, size1, base, MODIFY);
+		changed_file(tree2, size2, base, MODIFY);
+		return 0;
+	}
+	
+	changed_file(tree1, size1, base, REMOVE);
+	changed_file(tree2, size2, base, ADD);
 	return 0;
 }
 
@@ -132,12 +306,12 @@
 {
 	while (size1 | size2) {
 		if (!size1) {
-			show_file("+", tree2, size2, base);
+			changed_file(tree2, size2, base, ADD);
 			update_tree_entry(&tree2, &size2);
 			continue;
 		}
 		if (!size2) {
-			show_file("-", tree1, size1, base);
+			changed_file(tree1, size1, base, REMOVE);
 			update_tree_entry(&tree1, &size1);
 			continue;
 		}
@@ -179,6 +353,7 @@
 int main(int argc, char **argv)
 {
 	unsigned char old[20], new[20];
+	int retval;
 
 	while (argc > 3) {
 		char *arg = argv[1];
@@ -193,5 +368,7 @@
 
 	if (argc != 3 || get_sha1_hex(argv[1], old) || get_sha1_hex(argv[2], new))
 		usage("diff-tree <tree sha1> <tree sha1>");
-	return diff_tree_sha1(old, new, "");
+	retval =  diff_tree_sha1(old, new, "");
+	show_cache();
+	return retval;
 }
fsck-cache.c:  9c900fe458cecd2bdb4c4571a584115b5cf24f22
--- fsck-cache.c
+++ fsck-cache.c	2005-04-15 20:39:49.000000000 +1000
@@ -165,9 +165,10 @@
 	while (size) {
 		int len = 1+strlen(data);
 		unsigned char *file_sha1 = data + len;
+		unsigned char *guid = file_sha1 + 20;
 		char *path = strchr(data, ' ');
 		unsigned int mode;
-		if (size < len + 20 || !path || sscanf(data, "%o", &mode) != 1)
+		if (size < len + 40 || !path || sscanf(data, "%o", &mode) != 1)
 			return -1;
 
 		/* Warn about trees that don't do the recursive thing.. */
@@ -176,8 +177,8 @@
 			warn_old_tree = 0;
 		}
 
-		data += len + 20;
-		size -= len + 20;
+		data += len + 40;
+		size -= len + 40;
 		mark_needs_sha1(sha1, S_ISDIR(mode) ? "tree" : "blob", file_sha1);
 	}
 	return 0;
git:  2c557dcf2032325acc265b577ee104e605fdaede
gitXnormid.sh:  a5d7a9f4a6e8d4860f35f69500965c2a493d80de
gitadd.sh:  3ed93ea0fcb995673ba9ee1982e0e7abdbe35982
gitaddremote.sh:  bf1f28823da5b5270aa8fa05b321faa514a57a11
gitapply.sh:  d0e3c46e2ce1ee74e1a87ee6137955fa9b35c27b
gitcancel.sh:  ec58f7444a42cd3cbaae919fc68c70a3866420c0
gitcommit.sh:  3629f67bbd3f171d091552814908b67af7537f4d
gitdiff-do:  d6174abceab34d22010c36a8453a6c3f3f184fe0
gitdiff.sh:  5e47c4779d73c3f2f39f6be714c0145175933197
gitexport.sh:  dad00bf251b38ce522c593ea9631f842d8ccc934
gitlntree.sh:  17c4966ea64aeced96ae4f1b00f3775c1904b0f1
gitlog.sh:  177c6d12dd9fa4b4920b08451ffe4badde544a39
gitls.sh:  b6f15d82f16c1e9982c5031f3be22eb5430273af
gitlsobj.sh:  128461d3de6a42cfaaa989fc6401bebdfa885b3f
gitmerge.sh:  23e4a3ff342c6005928ceea598a2f52de6fb9817
gitpull.sh:  0883898dda579e3fa44944b7b1d909257f6dc63e
gitrm.sh:  5c18c38a890c9fd9ad2b866ee7b529539d2f3f8f
gittag.sh:  c8cb31385d5a9622e95a4e0b2d6a4198038a659c
gittrack.sh:  03d6db1fb3a70605ef249c632c04e542457f0808
init-db.c:  aa00fbb1b95624f6c30090a17354c9c08a6ac596
ls-tree.c:  3e2a6c7d183a42e41f1073dfec6794e8f8a5e75c
--- ls-tree.c
+++ ls-tree.c	2005-04-15 15:55:40.000000000 +1000
@@ -10,6 +10,7 @@
 	void *buffer;
 	unsigned long size;
 	char type[20];
+	char old_sha1[50];
 
 	buffer = read_sha1_file(sha1, type, &size);
 	if (!buffer)
@@ -19,19 +20,21 @@
 	while (size) {
 		int len = strlen(buffer)+1;
 		unsigned char *sha1 = buffer + len;
+		unsigned char *guid = buffer + len + 20;
 		char *path = strchr(buffer, ' ')+1;
 		unsigned int mode;
 		unsigned char *type;
 
-		if (size < len + 20 || sscanf(buffer, "%o", &mode) != 1)
+		if (size < len + 40 || sscanf(buffer, "%o", &mode) != 1)
 			die("corrupt 'tree' file");
-		buffer = sha1 + 20;
-		size -= len + 20;
+		buffer = sha1 + 40;
+		size -= len + 40;
 		/* XXX: We do some ugly mode heuristics here.
 		 * It seems not worth it to read each file just to get this
 		 * and the file size. -- pasky@ucw.cz */
 		type = S_ISDIR(mode) ? "tree" : "blob";
-		printf("%03o\t%s\t%s\t%s\n", mode, type, sha1_to_hex(sha1), path);
+		sprintf(old_sha1, sha1_to_hex(guid));
+		printf("%03o\t%s\t%s\t%s\t%s\n", mode, type, sha1_to_hex(sha1), old_sha1, path);
 	}
 	return 0;
 }
parent-id:  1801c6fe426592832e7250f8b760fb9d2e65220f
read-cache.c:  7a6ae8b9b489f6b67c82e065dedd5716a6bfc0ef
--- read-cache.c
+++ read-cache.c	2005-04-16 10:52:51.000000000 +1000
@@ -4,6 +4,8 @@
  * Copyright (C) Linus Torvalds, 2005
  */
 #include <stdarg.h>
+#include <time.h>
+#include <sys/param.h>
 #include "cache.h"
 
 const char *sha1_file_directory = NULL;
@@ -233,6 +235,22 @@
 	return 0;
 }
 
+void new_guid(const char *filename, int namelen, unsigned char *returnguid)
+{
+	size_t size;
+	time_t now = time(NULL);
+	char buf[MAXPATHLEN + 20];
+	unsigned char guid[20];
+	
+	size = snprintf(buf, MAXPATHLEN + 20, "%ld%s", now, filename) + 1;
+	
+	SHA1(buf, size, guid);
+
+	if (returnguid)
+		memcpy(returnguid, guid, 20);
+	return;
+}
+
 static inline int collision_check(char *filename, void *buf, unsigned int size)
 {
 #ifdef COLLISION_CHECK
@@ -363,11 +381,14 @@
 int add_cache_entry(struct cache_entry *ce, int ok_to_add)
 {
 	int pos;
+	unsigned char guid[20];
 
 	pos = cache_name_pos(ce->name, ce->namelen);
 
 	/* existing match? Just replace it */
 	if (pos >= 0) {
+		struct cache_entry *old_ce = active_cache[pos];
+		memcpy(ce->guid, old_ce->guid, 20);
 		active_cache[pos] = ce;
 		return 0;
 	}
@@ -376,6 +397,12 @@
 	if (!ok_to_add)
 		return -1;
 
+	memset(guid, 0, 20);
+	if (!memcmp(ce->guid, guid, 20)) {
+		new_guid(ce->name, ce->namelen, guid);
+		memcpy(ce->guid, guid, 20);
+	}
+
 	/* Make sure the array is big enough .. */
 	if (active_nr == active_alloc) {
 		active_alloc = alloc_nr(active_alloc);
read-tree.c:  eb548148aa6d212f05c2c622ffbe62a06cd072f9
--- read-tree.c
+++ read-tree.c	2005-04-16 10:41:46.000000000 +1000
@@ -5,7 +5,9 @@
  */
 #include "cache.h"
 
-static int read_one_entry(unsigned char *sha1, const char *base, int baselen, const char *pathname, unsigned mode)
+static int read_one_entry(unsigned char *sha1, unsigned char *guid, 
+			  const char *base, int baselen, 
+			  const char *pathname, unsigned mode)
 {
 	int len = strlen(pathname);
 	unsigned int size = cache_entry_size(baselen + len);
@@ -18,6 +20,7 @@
 	memcpy(ce->name, base, baselen);
 	memcpy(ce->name + baselen, pathname, len+1);
 	memcpy(ce->sha1, sha1, 20);
+	memcpy(ce->guid, guid, 20);
 	return add_cache_entry(ce, 1);
 }
 
@@ -35,14 +38,15 @@
 	while (size) {
 		int len = strlen(buffer)+1;
 		unsigned char *sha1 = buffer + len;
+		unsigned char *guid = buffer + len + 20;
 		char *path = strchr(buffer, ' ')+1;
 		unsigned int mode;
-
-		if (size < len + 20 || sscanf(buffer, "%o", &mode) != 1)
+		
+		if (size < len + 40 || sscanf(buffer, "%o", &mode) != 1)
 			return -1;
 
-		buffer = sha1 + 20;
-		size -= len + 20;
+		buffer = sha1 + 40;
+		size -= len + 40;
 
 		if (S_ISDIR(mode)) {
 			int retval;
@@ -57,7 +61,7 @@
 				return -1;
 			continue;
 		}
-		if (read_one_entry(sha1, base, baselen, path, mode) < 0)
+		if (read_one_entry(sha1, guid, base, baselen, path, mode) < 0)
 			return -1;
 	}
 	return 0;
rev-tree.c:  395b0b3bfadb0537ae0c62744b25ead4b487f3f6
show-diff.c:  a531ca4078525d1c8dcf84aae0bfa89fed6e5d96
show-files.c:  a9fa6767a418f870a34b39379f417bf37b17ee18
tree-id:  cb70e2c508a18107abe305633612ed702aa3ee4f
update-cache.c:  62d0a6c41560d40863c44599355af10d9e089312
write-tree.c:  1534477c91169ebddcf953e3f4d2872495477f6b
--- write-tree.c
+++ write-tree.c	2005-04-15 13:46:05.000000000 +1000
@@ -47,6 +47,7 @@
 		const char *pathname = ce->name, *filename, *dirname;
 		int pathlen = ce->namelen, entrylen;
 		unsigned char *sha1;
+		unsigned char *guid;
 		unsigned int mode;
 
 		/* Did we hit the end of the directory? Return how many we wrote */
@@ -54,6 +55,7 @@
 			break;
 
 		sha1 = ce->sha1;
+		guid = ce->guid;
 		mode = ce->st_mode;
 
 		/* Do we have _further_ subdirectories? */
@@ -86,6 +88,8 @@
 		buffer[offset++] = 0;
 		memcpy(buffer + offset, sha1, 20);
 		offset += 20;
+		memcpy(buffer + offset, guid, 20);
+		offset += 20;
 		nr++;
 	} while (nr < maxentries);
 

[-- Attachment #1.3: rename-file.c --]
[-- Type: text/x-csrc, Size: 1703 bytes --]

/*
 * rename files in a git repository, keeping the guid.
 * 
 * Copyright Simon Fowler <simon@dreamcraft.com.au>, 2005.
 */
#include <unistd.h>
#include <sys/stat.h>
#include <errno.h>
#include "cache.h"

static int remove_lock = 0;

static void remove_lock_file(void)
{
	if (remove_lock)
		unlink(".git/index.lock");
}

int main(int argc, char *argv[])
{
	struct stat stats;
	struct cache_entry *ce, *new;
	int newfd, entries, pos, pos2;

	if (argc != 3)
		usage("rename-file <old> <new>");
	if (stat(argv[1], &stats)) {
		perror("rename-file: ");
		exit(1);
	}
	if (!stat(argv[2], &stats))
		die("rename-file: destination file already exists");
	
	newfd = open(".git/index.lock", O_RDWR | O_CREAT | O_EXCL, 0600);
	if (newfd < 0)
		die("unable to create new cachefile");

	atexit(remove_lock_file);
	remove_lock = 1;

	entries = read_cache();
	if (entries < 0)
		die("cache corrupted");

	pos = cache_name_pos(argv[1], strlen(argv[1]));
	pos2 = cache_name_pos(argv[2], strlen(argv[2]));
		
	if (pos < 0) 
		die("original file not in cache");
	if (pos2 >= 0)
		die("destination file already in cache");
	ce = active_cache[pos];
	new = malloc(sizeof(struct cache_entry) + strlen(argv[2]) + 1);
	memcpy(new, ce, sizeof(struct cache_entry));
	new->namelen = strlen(argv[2]);
	memcpy(new->name, argv[2], new->namelen);
	
	if (rename(argv[1], argv[2])) {
		perror("rename-file: ");
		exit(1);
	}

	remove_file_from_cache(argv[1]);
	add_cache_entry(new, 1);

	if (write_cache(newfd, active_cache, active_nr) ||
	    rename(".git/index.lock", ".git/index"))
		die("Unable to write new cachefile");
      
	remove_lock = 0;
	return 0;
}

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-16  1:44                               ` Simon Fowler
@ 2005-04-16 12:19                                 ` David Lang
  2005-04-16 15:55                                   ` Simon Fowler
  0 siblings, 1 reply; 81+ messages in thread
From: David Lang @ 2005-04-16 12:19 UTC (permalink / raw)
  To: simon; +Cc: Linus Torvalds, David Woodhouse, Junio C Hamano, Petr Baudis, git


On Fri, Apr 15, 2005 at 08:32:46AM -0700, Linus Torvalds wrote:
> In other words, I'm right. I'm always right, but sometimes I'm more 
right
> than other times. And dammit, when I say "files don't matter", I'm 
really
> really Right(tm).
>
You're right, of course (All Hail Linus!), if you can make it work
efficiently enough.

Just to put something else on the table, here's how I'd go about
tracking renames and the like, in another world where Linus /does/
make the odd mistake - it's basically a unique id for files in the
repository, added when the file is first recognised and updated when
update-cache adds a new version to the cache. Renames copy the id
across to the new name, and add it into the cache.

This gives you an O(n) way to tell what file was what across
renames, and it might even be useful in Linus' world, or if someone
wanted to build a traditional SCM on top of a git-a-like.

Attached is a patch, and a rename-file.c to use it.

Simon

given that you have multiple machines creating files, how do you deal with 
the idea of the same 'unique id' being assigned to different files by 
different machines?

David Lang



-- 
There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies.
  -- C.A.R. Hoare

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 16:01                               ` David Woodhouse
  2005-04-15 16:31                                 ` C. Scott Ananian
@ 2005-04-16 15:33                                 ` Johannes Schindelin
  2005-04-17 13:14                                   ` David Woodhouse
  1 sibling, 1 reply; 81+ messages in thread
From: Johannes Schindelin @ 2005-04-16 15:33 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git

Hi,

On Fri, 15 Apr 2005, David Woodhouse wrote:

> But if it can be done cheaply enough at a later date even though we end
> up repeating ourselves, and if it can be done _well_ enough that we
> shouldn't have just asked the user in the first place, then yes, OK I
> agree.

The repetition could be helped by using a cache.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-16 12:19                                 ` David Lang
@ 2005-04-16 15:55                                   ` Simon Fowler
  0 siblings, 0 replies; 81+ messages in thread
From: Simon Fowler @ 2005-04-16 15:55 UTC (permalink / raw)
  To: David Lang; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 797 bytes --]

On Sat, Apr 16, 2005 at 05:19:24AM -0700, David Lang wrote:
> Simon
> 
> given that you have multiple machines creating files, how do you deal with 
> the idea of the same 'unique id' being assigned to different files by 
> different machines?
> 
The id is a sha1 hash of the current time and the full path of the
file being added - the chances of that being replicated without
malicious intent is extremely small. There are other things that
could be used, like the hostname, username of the person running the
program, etc, but I don't really see them being necessary.

Simon

-- 
PGP public key Id 0x144A991C, or http://himi.org/stuff/himi.asc
(crappy) Homepage: http://himi.org
doe #237 (see http://www.lemuria.org/DeCSS) 
My DeCSS mirror: ftp://himi.org/pub/mirrors/css/ 

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-15 15:32                             ` Linus Torvalds
                                                 ` (2 preceding siblings ...)
  2005-04-16  1:44                               ` Simon Fowler
@ 2005-04-16 20:29                               ` Sanjoy Mahajan
  2005-04-16 20:41                                 ` Linus Torvalds
  3 siblings, 1 reply; 81+ messages in thread
From: Sanjoy Mahajan @ 2005-04-16 20:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: David Woodhouse, Junio C Hamano, Petr Baudis, git

> And that "where did this come from" decision should be done at _search_
> time, not commit time.

I like this elegant approach, but clever pattern matching can help even
at commit time.  Suppose hello.c is simply:

  printf ("Hello %d\n", year);

And then developer A updates hello.c to:

  printf ("Hello %d\n", year);
  printf ("And   %d\n", year+1);

Meanwhile developer B updates hello.c to:

  printf ("Hello %d\n", yyyy);

How to merge these two changes?  The psychic solution is

  printf ("Hello %d\n", yyyy);
  printf ("And   %d\n", yyyy+1);

Darcs handles token renames specially, but it's not a general solution
so let's leave it aside.  The example does not have enough information
to make the psychic solution unique or reliable, but imagine that the
example were longer to solve that problem.  You'd want to describe the
delta A(hello.c) as
   
  1. duplicated message line
  2. changed 2nd line a bit

And B(hello.c) as

  1. Changed year to yyyy

In that representation, merging the two deltas becomes

  1. duplicated message line
  2. changed 2nd line a bit
  3. Changed year to yyyy in both lines

Or, by commuting the merge operations and adjusting for their
non-commutativity (in terminology like darcs's -- I'm also a physicist):

  1. Changed year to yyyy
  2. duplicated message line
  3. changed 2nd line a bit

So here some of the computation that Linus wants only at question time
(e.g. 'how did that line get here??') is also useful at merge time.
It's difficult (expensive, unreliable) to describe deltas in the form
above or, worse, to merge two such descriptions, but I hope it
illustrates the point.  And perhaps a robust and easier-to-compute
change-description language can be dreamt up, even if the general
problem of describing changes compactly is not computable -- it's almost
the same, or is the same, as finding the Kolmogorov complexity of a data
set.

Or have I missed a fundamental point?

-Sanjoy

`A society of sheep must in time beget a government of wolves.'
   - Bertrand de Jouvenal

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-16 20:29                               ` Sanjoy Mahajan
@ 2005-04-16 20:41                                 ` Linus Torvalds
  0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-16 20:41 UTC (permalink / raw)
  To: Sanjoy Mahajan; +Cc: David Woodhouse, Junio C Hamano, Petr Baudis, git



On Sat, 16 Apr 2005, Sanjoy Mahajan wrote:
> 
> I like this elegant approach, but clever pattern matching can help even
> at commit time.  Suppose hello.c is simply:

Here, what you're talking about is not "commit", but "merge".

The git model very much separates the two events. You first generate a 
merged tree, an dyou commit that merge as a separate and largely totally 
independent phase.

And yes, I agree that with merging, you do end up potentially wanting to 
try different things. When I've done my "git" merges, all I've really done 
is to make sure that the trivial parts basically merge in zero time, so 
that you can afford to perhaps spend some effort on handling the _real_ 
merge conflicts.

Many systems seem to be designed around a "clever merge" algorithm, with 
darcs perhaps being the most extreme example. The problem with that design 
is that 99.9% of all the work is not at all about being clever, and if you 
try to base your design around the clever things, your performance will 
definitely suck.

So I think that with git, you can actually really try to be clever,
because when you get a merge conflict, you're now only worrying about one
file out of 17,000, and then you can go wild on that one and try different
merge algorithms (token merge, character-merge, line-based merge, you name
it).

Of course, I might not actually personally want to depend on any clever 
merges, but the git infrastructure really doesn't care. My plumbing 
doesn't merge the conflicts that arise within one single object, or the 
filename differences - you can do anything you want on that.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-16 15:33                                 ` Johannes Schindelin
@ 2005-04-17 13:14                                   ` David Woodhouse
  0 siblings, 0 replies; 81+ messages in thread
From: David Woodhouse @ 2005-04-17 13:14 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Linus Torvalds, Junio C Hamano, Petr Baudis, git

On Sat, 2005-04-16 at 17:33 +0200, Johannes Schindelin wrote:
> > But if it can be done cheaply enough at a later date even though we end
> > up repeating ourselves, and if it can be done _well_ enough that we
> > shouldn't have just asked the user in the first place, then yes, OK I
> > agree.
> 
> The repetition could be helped by using a cache.

Perhaps. Since neither such a cache nor even the commit comments are
strictly part of the git data, they probably shouldn't be included in
the sha1 hash of the commit object. However, I don't see a fundamental
reason why we couldn't store them in the same file but omit them from
the hash calculations. That also allows us to retrospectively edit
commit comments without completely changing the entire subsequent
history.

Or is that a little too heretical a suggestion?

-- 
dwmw2


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-17 17:34 Linus Torvalds
@ 2005-04-17 22:12 ` Herbert Xu
  2005-04-17 22:35   ` Linus Torvalds
  2005-04-18  4:16 ` Sanjoy Mahajan
  1 sibling, 1 reply; 81+ messages in thread
From: Herbert Xu @ 2005-04-17 22:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mingo, pasky, simon, david.lang, git

Linus Torvalds <torvalds@osdl.org> wrote:
> 
> If we want to have any kind of confidence that the hash is reall
> yunbreakable, we should make it not just longer than 160 bits, we should
> make sure that it's two or more hashes, and that they are based on totally
> different principles.

Sorry, it has already been shown that combining two difference hashes
doesn't necessarily provide the security that you would hope.

I think what hasn't been discussed here is the cost of actually doing
the comparisons.  In other words, what is the minimum number of
comparisons we can get away and still deal with hash collisions
successfully?

Once we know what the cost is then we can decide whether it's worthwhile
considering the odds involved.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-17 22:12 ` Herbert Xu
@ 2005-04-17 22:35   ` Linus Torvalds
  2005-04-17 23:29     ` Herbert Xu
  0 siblings, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2005-04-17 22:35 UTC (permalink / raw)
  To: Herbert Xu; +Cc: mingo, pasky, simon, david.lang, git



On Mon, 18 Apr 2005, Herbert Xu wrote:
> 
> Sorry, it has already been shown that combining two difference hashes
> doesn't necessarily provide the security that you would hope.

Sorry, that's not true.

Quite the reverse. Again, you bring up totally theoretical arguments. In 
_practice_ it has indeed been shown that using two hashes _does_ catch 
hash colissions.

The trivial example is using md5 sums with a length. The "length" is a 
rally bad "hash" of the file contents too. And the fact is, that simple 
combination of hashes has proven to be more resistant to attack than the 
hash itself. It clearly _does_ make a difference in practice.

So _please_, can we drop the obviously bogus "in theory" arguments. They 
do not matter. What matters is practice.

And the fact is, in _theory_ we don't know if somebody may be trivially
able to break any particular hash. But in practice we do know that it's
less likely that you can break a combination of two totally unrelated
hashes than you break one particular one.

NOTE! I'm not actually arguing that we should do that. I'm actually
arguing totally the reverse: I'm arguing that there is a fine line between
being "very very careful" and being "crazy to the point of being
incompetent".

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-17 22:35   ` Linus Torvalds
@ 2005-04-17 23:29     ` Herbert Xu
  2005-04-17 23:34       ` Petr Baudis
  2005-04-17 23:50       ` Linus Torvalds
  0 siblings, 2 replies; 81+ messages in thread
From: Herbert Xu @ 2005-04-17 23:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mingo, pasky, simon, david.lang, git

On Sun, Apr 17, 2005 at 03:35:17PM -0700, Linus Torvalds wrote:
> 
> Quite the reverse. Again, you bring up totally theoretical arguments. In 
> _practice_ it has indeed been shown that using two hashes _does_ catch 
> hash colissions.
> 
> The trivial example is using md5 sums with a length. The "length" is a 
> rally bad "hash" of the file contents too. And the fact is, that simple 
> combination of hashes has proven to be more resistant to attack than the 
> hash itself. It clearly _does_ make a difference in practice.

I wasn't disputing that of course.  However, the same effect can be
achieved in using a single hash with a bigger length, e.g., sha256
or sha512.

> So _please_, can we drop the obviously bogus "in theory" arguments. They 
> do not matter. What matters is practice.

I agree.  However, what is the actual cost in practice of detecting
collisions?

I get the feeling that it isn't that bad.  For example, if we did it
at the points where the blobs actually entered the tree, then the cost
is always proportional to the change size (the number of new blobs).

Is this really that bad considering that the average blob isn't very
big?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-17 23:29     ` Herbert Xu
@ 2005-04-17 23:34       ` Petr Baudis
  2005-04-17 23:53         ` Kenneth Johansson
  2005-04-18  0:49         ` Herbert Xu
  2005-04-17 23:50       ` Linus Torvalds
  1 sibling, 2 replies; 81+ messages in thread
From: Petr Baudis @ 2005-04-17 23:34 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Linus Torvalds, mingo, simon, david.lang, git

Dear diary, on Mon, Apr 18, 2005 at 01:29:05AM CEST, I got a letter
where Herbert Xu <herbert@gondor.apana.org.au> told me that...
> I get the feeling that it isn't that bad.  For example, if we did it
> at the points where the blobs actually entered the tree, then the cost
> is always proportional to the change size (the number of new blobs).

No. The collision check is done in the opposite cache - when you want to
write a blob and there is already a file of the same hash in the tree.
So either the blob is already in the database, or you have a collision.

Therefore, the cost is proportional to the size of what stays unchanged.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-17 23:29     ` Herbert Xu
  2005-04-17 23:34       ` Petr Baudis
@ 2005-04-17 23:50       ` Linus Torvalds
  1 sibling, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-17 23:50 UTC (permalink / raw)
  To: Herbert Xu; +Cc: mingo, pasky, simon, david.lang, git



On Mon, 18 Apr 2005, Herbert Xu wrote:
> 
> I wasn't disputing that of course.  However, the same effect can be
> achieved in using a single hash with a bigger length, e.g., sha256
> or sha512.

No it cannot.

If somebody actually literally totally breaks that hash, length won't 
matter. There are (bad) hashes where you can literally edit the content of 
the file, and make sure that the end result has the same hash.

In that case, when the hash algorithm has actually been broken, the length 
of the hash ends up being not very relevant. 

For example, you might "hash" your file by blocking it up in 16-byte
blocks, and xoring all blocks together - the result is a 16-byte hash.  
It's a terrible hash, and obviously trivially breakable, and once broken
it does _not_ help to make it use its 32-byte cousin. Not at all. You can 
just modify the breaking thing to equally cheaply make modifications to a 
file and get the 32-byte hash "right" again.

Is that kind of breakage likely for sha1? Hell no. Is it possible? In your 
"in theory" world where practice doesn't matter, yes.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-17 23:34       ` Petr Baudis
@ 2005-04-17 23:53         ` Kenneth Johansson
  2005-04-18  0:49         ` Herbert Xu
  1 sibling, 0 replies; 81+ messages in thread
From: Kenneth Johansson @ 2005-04-17 23:53 UTC (permalink / raw)
  To: git; +Cc: Linus Torvalds, mingo, simon, david.lang, git

Petr Baudis wrote:
> Dear diary, on Mon, Apr 18, 2005 at 01:29:05AM CEST, I got a letter
> where Herbert Xu <herbert@gondor.apana.org.au> told me that...
> 
>>I get the feeling that it isn't that bad.  For example, if we did it
>>at the points where the blobs actually entered the tree, then the cost
>>is always proportional to the change size (the number of new blobs).
> 
> 
> No. The collision check is done in the opposite cache - when you want to
> write a blob and there is already a file of the same hash in the tree.
> So either the blob is already in the database, or you have a collision.
> 
> Therefore, the cost is proportional to the size of what stays unchanged.
> 

?? now I'm confused. Surly the only cost involved is to never write over 
a file that already exist in the cache and that is already done NOW as 
far as I read the code. So there is NO extra cost in detecting an collision.






^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-17 23:34       ` Petr Baudis
  2005-04-17 23:53         ` Kenneth Johansson
@ 2005-04-18  0:49         ` Herbert Xu
  2005-04-18  0:55           ` Petr Baudis
  1 sibling, 1 reply; 81+ messages in thread
From: Herbert Xu @ 2005-04-18  0:49 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Linus Torvalds, mingo, simon, david.lang, git

On Mon, Apr 18, 2005 at 01:34:41AM +0200, Petr Baudis wrote:
>
> No. The collision check is done in the opposite cache - when you want to
> write a blob and there is already a file of the same hash in the tree.
> So either the blob is already in the database, or you have a collision.
> Therefore, the cost is proportional to the size of what stays unchanged.

This is only true if we're calling update-cache on all unchanged files.
If that's what git is doing then we're in trouble anyway.

Remember that prior to the collision check we've already spent the
effort in

1) Compressing the file.
2) Computing a SHA1 hash on the result.

These two steps together (especially the first one) is much more
expensive than a file content comparison of the blob versus what's
already in the tree.

Somehow I have a hard time seeing how this can be at all efficient if
we're compressing all checked out files including those which are
unchanged.

Therefore the only conclusion I can draw is that we're only calling
update-cache on the set of changed files, or at most a small superset
of them.  In that case, the cost of the collision check *is* proportional
to the size of the change.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-18  0:49         ` Herbert Xu
@ 2005-04-18  0:55           ` Petr Baudis
  0 siblings, 0 replies; 81+ messages in thread
From: Petr Baudis @ 2005-04-18  0:55 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Linus Torvalds, mingo, simon, david.lang, git

Dear diary, on Mon, Apr 18, 2005 at 02:49:06AM CEST, I got a letter
where Herbert Xu <herbert@gondor.apana.org.au> told me that...
> Therefore the only conclusion I can draw is that we're only calling
> update-cache on the set of changed files, or at most a small superset
> of them.  In that case, the cost of the collision check *is* proportional
> to the size of the change.

Yes, of course, sorry for the confusion.  We only consider files you
either specify manually or which have their stat metadata changed
relative to the directory cache. (That is from the git-pasky
perspective; from the plumbing perspective, the user just does
update-cache on whatever he picks.)

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-17 17:34 Linus Torvalds
  2005-04-17 22:12 ` Herbert Xu
@ 2005-04-18  4:16 ` Sanjoy Mahajan
  1 sibling, 0 replies; 81+ messages in thread
From: Sanjoy Mahajan @ 2005-04-18  4:16 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, Petr Baudis, Simon Fowler, David Lang, git

> So until proven otherwise, I worry about accidental hashes, and in
> 160 bits of good hashing, that just isn't an issue either...[Going
> from 128 bits to 160 bits made it] so _unbelievably_ less likely to
> happen that it's not even funny.

You are right.  Here's how I learnt to stop worrying and love the 160
bits.

A 160-bit hash requires 2^80=10^24 files before the collision
probability is roughly 0.5 (actually 1-e^{-1/2}).  Now be very
conservative: Instead of tolerating a 0.5 probability, worry about
even a 10^-8 probability of a collision anywhere, anytime.

The magic number of files for that probability is 10^20 (roughly 10^40
pairs for 2^160=10^48 boxes).

Given 10 billion people using git, each producing 1 source file per
second -- busy beavers all -- they would need 300 years to produce
10^20 files.  And to reach the 10^-8 collision probability, all 10^20
files must belong to the same project, and even OpenOffice will not be
that bloated.

-Sanjoy

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
@ 2005-04-26 18:55 Bram Cohen
  2005-04-26 19:58 ` Linus Torvalds
  0 siblings, 1 reply; 81+ messages in thread
From: Bram Cohen @ 2005-04-26 18:55 UTC (permalink / raw)
  To: git

(my apologies for responding to old messages, I only just subscribed to
this list)

Linus Torvalds wrote:
> On Thu, 14 Apr 2005, Junio C Hamano wrote:
> >
> > You say "merge these two trees" above (I take it that you mean
> > "merge these two trees, taking account of this tree as their
> > common ancestor", so actually you are dealing with three trees),
>
> Yes. We're definitely talking three trees.

The LCA for different files might be at different points in the history.
Forcing them to all come from the same point produces very bad merges.

> The fact is, we know how to make tree merges unambiguous, by just
> totally ignoring the history between them.  Ie we know how to merge
> data. I am pretty damn sure that _nobody_ knows how to merge "data over
> time".

You're incorrect. Codeville does exactly that (history-aware merges which
do the right thing even in cases where 3-way merge can't)

> > This however opens up another set of can of worms---it would
> > involve not just three trees but all the trees in the commit
> > chain in between.
>
> Exactly.  I seriously believe that the model is _broken_, simply because
> it gets too complicated. At some point it boils down to "keep it simple,
> stupid".

The Codeville merge algorithm is also quite simple, and is already
implemented and mature.

> I've not even been convinved that renames are worth it. Nobody has
> really given a good reason why.

If one person renames a file and another person modifies it then the
changes should be applied to the moved file.

Also, there's the directory rename case where one person moves a directory
and another person adds a file to it, in which case the file should be
moved to the new directory location on merge. I gather than BK doesn't
support this functionality, but Codeville and Monotone both do.

>    I think you might as well interpret the whole object thing. Git
> _does_ tell you how the objects changed, and I actually believe that a
> diff that works in between objects (ie can show "these lines moved from
> this file X to tjhat file Y") is a _hell_ of a lot more powerful than
> "rename"  is.
>
>    So I'd seriously suggest that instead of worryign about renames,
> people think about global diffs that aren't per-file. Git is good at
> limiting the changes to a set of objects, and it should be entirely
> possible to think of diffs as ways of moving lines _between_ objects and
> not just within objects. It's quite common to move a function from one
> file to another - certainly more so than renaming the whole file.
>
> In other words, I really believe renames are just a meaningless special
> case of a much more interesting problem. Which is just one reason why
> I'm not at all interested in bothering with them other than as a "data
> moved" thing, which git already handles very well indeed.

Nothing, not eveny our beloved BitKeeper, has 'move lines between files'
functionality, and for good reason.

To begin with, it's behaviorally extremely dubious. It would be not
uncommon for the system to erroneously think that some files deleted from
one file were added to another, and then further changes down the line
would cause random unrelated files to get modified in unpredictable ways
when merges happened.

Also, it presents a completely unsolved UI problem. If one person moves
lines 5-15 of file A to file B, and another person concurrently rewrites
lines 10-20 of file A, how on earth is that supposed to be presented to
the user? Codeville can support line moves *within* files just fine, but
doesn't do it because of the UI problem of presenting all the corner
cases. Maybe someday somebody will do a PhD thesis on that topic and we'll
add it, but until then we're sticking with the basic functionality.

Honestly, that you would think of doing whole-tree three-way merges and
even consider moving lines between files shows that you haven't explored
the merge problem very deeply. This is a much harder problem than you
think it is, and one which has already been solved by other systems.

-Bram


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 18:55 Merge with git-pasky II Bram Cohen
@ 2005-04-26 19:58 ` Linus Torvalds
  2005-04-26 20:30   ` Tom Lord
                     ` (2 more replies)
  0 siblings, 3 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-26 19:58 UTC (permalink / raw)
  To: Bram Cohen; +Cc: git



On Tue, 26 Apr 2005, Bram Cohen wrote:
> 
> If one person renames a file and another person modifies it then the
> changes should be applied to the moved file.

Bzzt. Wrong answer.

The _right_ answer is "if one person moves a function, and another person 
modifies the function, the changes should be applied to the moved 
function".

Which is clearly a _much_ more common case than file renames.

In other words, if your algorithm doesn't handle the latter, then there is 
no point in handling the former either.

And _if_ your algorithm handles the latter, then there's no point in 
handling file renames specially, since the algorithm will have done that 
too, as a small part of it.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 19:58 ` Linus Torvalds
@ 2005-04-26 20:30   ` Tom Lord
  2005-04-26 20:31   ` Bram Cohen
  2005-04-26 20:31   ` Daniel Barkalow
  2 siblings, 0 replies; 81+ messages in thread
From: Tom Lord @ 2005-04-26 20:30 UTC (permalink / raw)
  To: torvalds; +Cc: bram, git



   From: Linus Torvalds <torvalds@osdl.org>

   On Tue, 26 Apr 2005, Bram Cohen wrote:
   > 
   > If one person renames a file and another person modifies it then the
   > changes should be applied to the moved file.

   Bzzt. Wrong answer.

You're a little bit nuts, guy.

-t

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 19:58 ` Linus Torvalds
  2005-04-26 20:30   ` Tom Lord
@ 2005-04-26 20:31   ` Bram Cohen
  2005-04-26 20:39     ` Tom Lord
                       ` (2 more replies)
  2005-04-26 20:31   ` Daniel Barkalow
  2 siblings, 3 replies; 81+ messages in thread
From: Bram Cohen @ 2005-04-26 20:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds wrote:

> On Tue, 26 Apr 2005, Bram Cohen wrote:
> >
> > If one person renames a file and another person modifies it then the
> > changes should be applied to the moved file.
>
> Bzzt. Wrong answer.

I'm trying to be polite. You're not making that easy.

> The _right_ answer is "if one person moves a function, and another person
> modifies the function, the changes should be applied to the moved
> function".

Now that you're done being dismissive, could you either (a) rebut my quite
detailed explanation of exactly why that functionality is both a dubious
idea and difficult to implement, or (b) admit that you have no plans
whatsoever for supporting any of this stuff? You can't have it both ways.

What I'd really like to hear is some explanation of why git is
reimplementing all of this stuff from scratch. Your implicit claims that
git will do more things than the other systems without having to reinvent
all of their functionality first are, honestly, naive, ill-informed
arrogance.

I'd like to reiterate that *nothing* out there supports moving lines
between files, and further predict, with total confidence, that if git
tries to support such functionality it will simply fail, either by giving
up or creating a system which can behave horribly. Before you get all
dismissive about this claim, please remember that I've spent years
thinking about merge algorithms, and have actually designed and
implemented them, and have spoken at length with other people who have
done the same, while you've merely thought about them for a few weeks.

> Which is clearly a _much_ more common case than file renames.

Even if we pretend that these are comparable features, that's far from
clearly true. Function moves within a file occur more frequently, but a
file rename moves *all* the functions within that file.

> In other words, if your algorithm doesn't handle the latter, then there is
> no point in handling the former either.

If someone offers you a dollar, no strings attached, do you turn them down
because they didn't offer you ten?

> And _if_ your algorithm handles the latter, then there's no point in
> handling file renames specially, since the algorithm will have done that
> too, as a small part of it.

In case these concepts got conflated, I'd like to point out that Codeville
merge both supports renames *and* does better than three-way merge can do
at merging a single, non-renamed file. In most cases three-way and
codeville merge give the same answer, but there are some cases where there
isn't a single appropriate LCA available, and in those cases codeville
will do the right thing while three-way can't.

-Bram


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 19:58 ` Linus Torvalds
  2005-04-26 20:30   ` Tom Lord
  2005-04-26 20:31   ` Bram Cohen
@ 2005-04-26 20:31   ` Daniel Barkalow
  2005-04-26 20:44     ` Tom Lord
  2 siblings, 1 reply; 81+ messages in thread
From: Daniel Barkalow @ 2005-04-26 20:31 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Bram Cohen, git

On Tue, 26 Apr 2005, Linus Torvalds wrote:

> 
> 
> On Tue, 26 Apr 2005, Bram Cohen wrote:
> > 
> > If one person renames a file and another person modifies it then the
> > changes should be applied to the moved file.
> 
> Bzzt. Wrong answer.
> 
> The _right_ answer is "if one person moves a function, and another person 
> modifies the function, the changes should be applied to the moved 
> function".

I'd even go so far as to say that we need to have the user sign off on
this. The modification is reasonably likely to not work in the new
location, due to use of statics that aren't available or something of the
sort.

I suspect that we want to report a conflict in the new location between
the old version and the new version, and let the person merging check
whether the change is okay. That is, both sides modified the same section
at the same time: one side modified its content, while the other modified
its location. We can give a good suggestion, but we need a final ruling.

	-Daniel
*This .sig left intentionally blank*


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 20:31   ` Bram Cohen
@ 2005-04-26 20:39     ` Tom Lord
  2005-04-26 20:58     ` Linus Torvalds
  2005-04-26 21:26     ` Diego Calleja
  2 siblings, 0 replies; 81+ messages in thread
From: Tom Lord @ 2005-04-26 20:39 UTC (permalink / raw)
  To: bram; +Cc: git


  > What I'd really like to hear is some explanation of why git is
  > reimplementing all of this stuff from scratch.

Whatever Linus' reasons, it's not a bad exercise and it does help
robustify the kernel project to roll its own.  Not that he seems to be
doing *this* part especially well or anything, but that doesn't matter
from the perspective of the good reasons to do it.

It doesn't matter much -- get stuff into git and people can layer on
that pretty gently.  The low layers of git are a common idea but
context and Linus' nifty code make this instance of the idea a bit
of a gem.



  > please remember that I've spent years
  > thinking about merge algorithms

I'm surprised we haven't met sooner.  If we did and I forgot, sorry.

-t

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 20:31   ` Daniel Barkalow
@ 2005-04-26 20:44     ` Tom Lord
  0 siblings, 0 replies; 81+ messages in thread
From: Tom Lord @ 2005-04-26 20:44 UTC (permalink / raw)
  To: barkalow; +Cc: torvalds, bram, git




^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 20:31   ` Bram Cohen
  2005-04-26 20:39     ` Tom Lord
@ 2005-04-26 20:58     ` Linus Torvalds
  2005-04-26 21:25       ` Linus Torvalds
  2005-04-26 21:28       ` Bram Cohen
  2005-04-26 21:26     ` Diego Calleja
  2 siblings, 2 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-26 20:58 UTC (permalink / raw)
  To: Bram Cohen; +Cc: git



On Tue, 26 Apr 2005, Bram Cohen wrote:
> 
> Now that you're done being dismissive, could you either (a) rebut my quite
> detailed explanation of exactly why that functionality is both a dubious
> idea and difficult to implement, or (b) admit that you have no plans
> whatsoever for supporting any of this stuff? You can't have it both ways.

I'm absolutely not going to do it myself, you're right about that.

I just don't like your notion that you should support the 5% problem with
ugly hacks, and then you dismiss the 95% problem with "nothing else does 
it either".

In other words, we're already merging manually for the 95%. Why do you 
think the 5% is so important?

> What I'd really like to hear is some explanation of why git is
> reimplementing all of this stuff from scratch.

Git does in ~5000 lines and two weeks of work what _I_ think is the right 
thing to do. You're welcome to disagree, but the fact is, people have 
whined and moaned about my use of BK FOR THREE YEARS without showing me 
any better alternatives.

So why are you complaining now, when I implement my own version in two
weeks?

> If someone offers you a dollar, no strings attached, do you turn them down
> because they didn't offer you ten?

"no strings attached"?

There are lots of strings attached to the "follow renames" thing. There's 
30 _years_ of strings attached, and they result in people not looking at 
the _interesting_ problem.

Exactly because people follow renames, they think that they have the 
history of the code, but then they ignore the fact that they don't - 
because it doesn't follow merging of code or splitting of code.

In other words, it's sometimes better to know that you don't know the
answer, than it is to _think_ that you know the answer.

> In case these concepts got conflated, I'd like to point out that Codeville
> merge both supports renames *and* does better than three-way merge can do
> at merging a single, non-renamed file.

And I'd like to point out (again) that git doesn't actually care what 
merge strategy the user uses. 

Me _personally_, I want to have something that is very repeatable and
non-clever. Something I understand _or_ tells me that it can't do it. And
quite frankly, merging single-file history _without_ taking all the other
files' history into account makes me go "ugh".

That's why I like the "we do _not_ look for 'nearer' parents on a per-file
basis", which is what this discussion started with, afaik. I think the 
only original source that makes sense is the least common parent for the 
whole _project_, exactly because other files have done things that may or 
may not depend on the history of the file you're merging.

I (and thus git) really takes a "whole project" approach. 

			Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 20:58     ` Linus Torvalds
@ 2005-04-26 21:25       ` Linus Torvalds
  2005-04-26 21:28       ` Bram Cohen
  1 sibling, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-26 21:25 UTC (permalink / raw)
  To: Bram Cohen; +Cc: git



On Tue, 26 Apr 2005, Linus Torvalds wrote:
> 
> > What I'd really like to hear is some explanation of why git is
> > reimplementing all of this stuff from scratch.
> 
> Git does in ~5000 lines and two weeks of work what _I_ think is the right 
> thing to do. You're welcome to disagree, but the fact is, people have 
> whined and moaned about my use of BK FOR THREE YEARS without showing me 
> any better alternatives.

Btw, I've also been pretty disgusted by SCM's apparently generally caring 
about stuff that is totally not relevant.

For example, it seems like most SCM people think that merging is about
getting the end result of two conflicting patches right.

In my opinion, that's the _least_ important part of a merge. Maybe the 
kernel is very unusual in this, but basically true _conflicts_ are not 
only rare, but they tend to be things you want a human to look at 
regardless.

The important part of a merge is not how it handles conflicts (which need
to be verified by a human anyway if they are at all interesting), but that
it should meld the history together right so that you have a new solid
base for future merges.

In other words, the important part is the _trivial_ part: the naming of
the parents, and keeping track of their relationship. Not the clashes.

For example, CVS gets this part totally wrong. Sure, it can merge the 
contents, but it totally ignores the important part, so once you've done a 
merge, you're pretty much up shit creek wrt any subsequent merges in any 
other direction. All the other CVS problems pale in comparison. Renames? 
Just a detail.

And it looks like 99% of SCM people seem to think that the solution to 
that is to be more clever about content merges. Which misses the point 
entirely.

Don't get me wrong: content merges are nice, but they are _gravy_. They
are not important. You can do them manually if you have to. What's
important is that once you _have_ done them (manually or automatically),
the system had better be able to go on, knowing that they've been done.

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 20:31   ` Bram Cohen
  2005-04-26 20:39     ` Tom Lord
  2005-04-26 20:58     ` Linus Torvalds
@ 2005-04-26 21:26     ` Diego Calleja
  2 siblings, 0 replies; 81+ messages in thread
From: Diego Calleja @ 2005-04-26 21:26 UTC (permalink / raw)
  To: Bram Cohen; +Cc: torvalds, git

El Tue, 26 Apr 2005 13:31:31 -0700 (PDT),
Bram Cohen <bram@bitconjurer.org> escribió:

> Even if we pretend that these are comparable features, that's far from
> clearly true. Function moves within a file occur more frequently, but a
> file rename moves *all* the functions within that file.

Renaming or moving files is _so_ rare and unusual that even not implementing
it (like CVS) is hardly a big issue. Even in the linux kernel people moved subsystems
around - OSS went from drivers/sound to /sound/oss in 2.6, and a USB subdirectory
moved too, I think.

The patch got bigger. People wasted 30 seconds more of their life because the .gz
file was bigger - who cares? If it's something it's going to happen every 5 years,
I'd rather move them like CVS does rather than wasting a single second
implementing file renaming/moving...

If something so uncommon like file renaming has been implemented, I don't see
why people shouldn't implement something really useful like the thing linus
proposes, in fact it doesn't looks like a bad idea at all (and you'd get
file renaming for free, too). Perhaps it would be hard to implement and get
right, but at least it would be _useful_.

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 20:58     ` Linus Torvalds
  2005-04-26 21:25       ` Linus Torvalds
@ 2005-04-26 21:28       ` Bram Cohen
  2005-04-26 21:36         ` Fabian Franz
  2005-04-26 22:25         ` Linus Torvalds
  1 sibling, 2 replies; 81+ messages in thread
From: Bram Cohen @ 2005-04-26 21:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds wrote:

> On Tue, 26 Apr 2005, Bram Cohen wrote:
> >
> > Now that you're done being dismissive, could you either (a) rebut my quite
> > detailed explanation of exactly why that functionality is both a dubious
> > idea and difficult to implement, or (b) admit that you have no plans
> > whatsoever for supporting any of this stuff? You can't have it both ways.
>
> I'm absolutely not going to do it myself, you're right about that.

Now you're just being an ass. I stated, flatly, that what you're proposing
to have done (by you or whoever, for feasibility it doesn't matter which)
is not going to happen due to just plain difficulty. You obviously
disagree with me, but rather than coming out and saying so you're
pretending I didn't make that statement.

> > What I'd really like to hear is some explanation of why git is
> > reimplementing all of this stuff from scratch.
>
> Git does in ~5000 lines and two weeks of work what _I_ think is the right
> thing to do.

So you think that a system which supports snapshots and history but has no
merging functionality whatsoever is the right thing? I'm asking this
seriously. You have a magic --make-somebody-else-do-merge command, but for
everybody else the current state of things is workable as a stopgap
measure (the original plan) but very painful for anything more.

Codeville is comparable in terms of number of lines of code to Git, by the
way.

> You're welcome to disagree, but the fact is, people have whined and
> moaned about my use of BK FOR THREE YEARS without showing me any better
> alternatives.

You were happy with BitKeeper, so why should we? Monotone and Codeville
are only just about now really mature, and you aren't exactly known as a
model customer.

> So why are you complaining now, when I implement my own version in two
> weeks?

I'm trying to tell you that the amount of time between now and when a
system as nice as BitKeeper is in use by the kernel can be dramatically
reduced by either using an existing system verbatim or basing new efforts
on one.

If you think that git as it exists right now is at all comparable to
Monotone or Codeville you're completely delusional.

> > In case these concepts got conflated, I'd like to point out that Codeville
> > merge both supports renames *and* does better than three-way merge can do
> > at merging a single, non-renamed file.
>
> And I'd like to point out (again) that git doesn't actually care what
> merge strategy the user uses.
>
> Me _personally_, I want to have something that is very repeatable and
> non-clever. Something I understand _or_ tells me that it can't do it. And
> quite frankly, merging single-file history _without_ taking all the other
> files' history into account makes me go "ugh".

Now you've just gone off the deep end. This is an apples-to-apples
comparison. Please accept one of thee following two statements:

(a) Git doesn't do merging, and none of the related new tools around it do
merging.

(b) Codeville merge (sans rename functionality) would be superior for the
merging which will be done.

-Bram


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 21:28       ` Bram Cohen
@ 2005-04-26 21:36         ` Fabian Franz
  2005-04-26 22:30           ` Linus Torvalds
  2005-04-26 22:25         ` Linus Torvalds
  1 sibling, 1 reply; 81+ messages in thread
From: Fabian Franz @ 2005-04-26 21:36 UTC (permalink / raw)
  To: Bram Cohen; +Cc: git, Linus Torvalds

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Am Dienstag, 26. April 2005 23:28 schrieb Bram Cohen:
> Now you've just gone off the deep end. This is an apples-to-apples
> comparison. Please accept one of thee following two statements:
>
> (a) Git doesn't do merging, and none of the related new tools around it do
> merging.
>
> (b) Codeville merge (sans rename functionality) would be superior for the
> merging which will be done.

I have one very humble question:

Why don't you write and contribute some code for git to do good merging?

This would resolve all your problems.

I think the "magic-merge" command is quite exchangable and if your way works
good and is compatible, then people will automatically start using that.

cu

Fabian
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQFCbrRdI0lSH7CXz7MRAkS4AJ9JEka71M0Zc6cizXhrYpHiKHhL0gCcD/3Q
j+UnPU/cXafGjGG6Bt9mZE8=
=IYk0
-----END PGP SIGNATURE-----


^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 21:28       ` Bram Cohen
  2005-04-26 21:36         ` Fabian Franz
@ 2005-04-26 22:25         ` Linus Torvalds
  2005-04-28  0:42           ` Petr Baudis
  1 sibling, 1 reply; 81+ messages in thread
From: Linus Torvalds @ 2005-04-26 22:25 UTC (permalink / raw)
  To: Bram Cohen; +Cc: git



On Tue, 26 Apr 2005, Bram Cohen wrote:
> 
> So you think that a system which supports snapshots and history but has no
> merging functionality whatsoever is the right thing?

You haven't looked at git, have you?

Git already merges better than _any_ open-source SCM out there. It just 
does it so effortlessly that you didn't even realize it does that.

Today I've done four (count them) fully automated merges on the kernel
tree: serial, networking, usb and arm.

And they took a fraction of a second (plus the download of the new
objects, which is the real cost).

This is something that SVN _still_ cannot do, for example. 

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 21:36         ` Fabian Franz
@ 2005-04-26 22:30           ` Linus Torvalds
  0 siblings, 0 replies; 81+ messages in thread
From: Linus Torvalds @ 2005-04-26 22:30 UTC (permalink / raw)
  To: Fabian Franz; +Cc: Bram Cohen, git



On Tue, 26 Apr 2005, Fabian Franz wrote:
> 
> Am Dienstag, 26. April 2005 23:28 schrieb Bram Cohen:
> > Now you've just gone off the deep end. This is an apples-to-apples
> > comparison. Please accept one of thee following two statements:
> >
> > (a) Git doesn't do merging, and none of the related new tools around it do
> > merging.
> >
> > (b) Codeville merge (sans rename functionality) would be superior for the
> > merging which will be done.
> 
> I have one very humble question:
> 
> Why don't you write and contribute some code for git to do good merging?

Don't bother. Bram doesn't know what he's talking about. 

		Linus

^ permalink raw reply	[flat|nested] 81+ messages in thread

* Re: Merge with git-pasky II.
  2005-04-26 22:25         ` Linus Torvalds
@ 2005-04-28  0:42           ` Petr Baudis
  0 siblings, 0 replies; 81+ messages in thread
From: Petr Baudis @ 2005-04-28  0:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Bram Cohen, git

Dear diary, on Wed, Apr 27, 2005 at 12:25:58AM CEST, I got a letter
where Linus Torvalds <torvalds@osdl.org> told me that...
> On Tue, 26 Apr 2005, Bram Cohen wrote:
> > 
> > So you think that a system which supports snapshots and history but has no
> > merging functionality whatsoever is the right thing?
> 
> You haven't looked at git, have you?
> 
> Git already merges better than _any_ open-source SCM out there. It just 
> does it so effortlessly that you didn't even realize it does that.

Did you (or any other kernel developer reading this) actually try the
Codeville merge? (I admit I didn't get time to do anything real with it
yet.) SCM people keep praising it (as basically the best (at least
open-source) merge out there), so it would be interesting to compare
that with the actual real-world experience with it on the kernel.

> Today I've done four (count them) fully automated merges on the kernel
> tree: serial, networking, usb and arm.
> 
> And they took a fraction of a second (plus the download of the new
> objects, which is the real cost).
> 
> This is something that SVN _still_ cannot do, for example. 

I think SVN is just irrelevant here - it is a completely different
league. The contenders here are probably Codeville, Monotone and perhaps
GNU Arch offsprings.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
C++: an octopus made by nailing extra legs onto a dog. -- Steve Taylor

^ permalink raw reply	[flat|nested] 81+ messages in thread

end of thread, other threads:[~2005-04-28  0:37 UTC | newest]

Thread overview: 81+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-26 18:55 Merge with git-pasky II Bram Cohen
2005-04-26 19:58 ` Linus Torvalds
2005-04-26 20:30   ` Tom Lord
2005-04-26 20:31   ` Bram Cohen
2005-04-26 20:39     ` Tom Lord
2005-04-26 20:58     ` Linus Torvalds
2005-04-26 21:25       ` Linus Torvalds
2005-04-26 21:28       ` Bram Cohen
2005-04-26 21:36         ` Fabian Franz
2005-04-26 22:30           ` Linus Torvalds
2005-04-26 22:25         ` Linus Torvalds
2005-04-28  0:42           ` Petr Baudis
2005-04-26 21:26     ` Diego Calleja
2005-04-26 20:31   ` Daniel Barkalow
2005-04-26 20:44     ` Tom Lord
  -- strict thread matches above, loose matches on Subject: below --
2005-04-17 17:34 Linus Torvalds
2005-04-17 22:12 ` Herbert Xu
2005-04-17 22:35   ` Linus Torvalds
2005-04-17 23:29     ` Herbert Xu
2005-04-17 23:34       ` Petr Baudis
2005-04-17 23:53         ` Kenneth Johansson
2005-04-18  0:49         ` Herbert Xu
2005-04-18  0:55           ` Petr Baudis
2005-04-17 23:50       ` Linus Torvalds
2005-04-18  4:16 ` Sanjoy Mahajan
     [not found] <000d01c541ed$32241fd0$6400a8c0@gandalf>
2005-04-15 20:31 ` Linus Torvalds
2005-04-15 23:00   ` Barry Silverman
2005-04-16  0:32     ` Linus Torvalds
2005-04-14  0:29 Petr Baudis
2005-04-13 21:25 ` Christopher Li
2005-04-14  0:45   ` Petr Baudis
2005-04-14  3:51     ` Linus Torvalds
2005-04-14  1:23       ` Christopher Li
2005-04-14  5:03         ` Paul Jackson
2005-04-14  2:16           ` Christopher Li
2005-04-14  6:16             ` Paul Jackson
2005-04-14  7:05       ` Junio C Hamano
2005-04-14  8:06         ` Linus Torvalds
2005-04-14  8:39           ` Junio C Hamano
2005-04-14  9:10             ` Linus Torvalds
2005-04-14 11:14               ` Junio C Hamano
2005-04-14 12:16                 ` Petr Baudis
2005-04-14 18:12                   ` Junio C Hamano
2005-04-14 18:36                     ` Linus Torvalds
2005-04-14 19:59                       ` Junio C Hamano
2005-04-15  0:42                         ` Linus Torvalds
2005-04-15  2:33                           ` Barry Silverman
2005-04-15 10:02                           ` David Woodhouse
2005-04-15 15:32                             ` Linus Torvalds
2005-04-15 16:01                               ` David Woodhouse
2005-04-15 16:31                                 ` C. Scott Ananian
2005-04-15 17:11                                   ` Linus Torvalds
2005-04-16 15:33                                 ` Johannes Schindelin
2005-04-17 13:14                                   ` David Woodhouse
2005-04-15 19:20                               ` Paul Jackson
2005-04-16  1:44                               ` Simon Fowler
2005-04-16 12:19                                 ` David Lang
2005-04-16 15:55                                   ` Simon Fowler
2005-04-16 20:29                               ` Sanjoy Mahajan
2005-04-16 20:41                                 ` Linus Torvalds
2005-04-15  9:14                       ` David Woodhouse
2005-04-15  9:36                         ` Ingo Molnar
2005-04-15 10:05                           ` David Woodhouse
2005-04-15 14:53                             ` Ingo Molnar
2005-04-15 15:09                               ` David Woodhouse
2005-04-15 12:03                         ` Johannes Schindelin
2005-04-15 10:22                           ` Theodore Ts'o
2005-04-15 14:53                         ` Linus Torvalds
2005-04-15 15:29                           ` David Woodhouse
2005-04-15 15:51                             ` Linus Torvalds
2005-04-15 15:54                           ` Paul Jackson
2005-04-15 16:30                             ` C. Scott Ananian
2005-04-15 18:29                               ` Paul Jackson
2005-04-14 18:51                     ` Christopher Li
2005-04-14 19:35                     ` Petr Baudis
2005-04-14 23:12                       ` Junio C Hamano
2005-04-14 20:24                         ` Christopher Li
2005-04-14 23:31                         ` Petr Baudis
2005-04-15  0:58                           ` Junio C Hamano
2005-04-14 22:30                             ` Christopher Li
2005-04-15  7:43                               ` Junio C Hamano
2005-04-15  6:28                                 ` Christopher Li
2005-04-15 11:11                                   ` Junio C Hamano
2005-04-15 10:22                           ` Junio C Hamano
2005-04-15 20:40                             ` Petr Baudis
2005-04-15 22:41                               ` Junio C Hamano
2005-04-15 19:57           ` Junio C Hamano
2005-04-15 20:45             ` Linus Torvalds
2005-04-14  0:30 ` Petr Baudis

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).