git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/8] `log -c` speedup (part 2)
@ 2014-02-03 12:47 Kirill Smelkov
  2014-02-03 12:47 ` [PATCH 1/8] fixup! combine_diff: simplify intersect_paths() further Kirill Smelkov
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-03 12:47 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

Hello up there,

I'm still trying to speedup combined-diff to be not so slow, to be able to
realistically use it in my readonly filesystem for git archives.

In the first part[1], we optimized paths intersections, but combined-diff still
remained slow, because internally it was computing huge diffs, even for small,
or empty output.

This time, here goes paths scanning rework, which results in significant
combine-diff speedup. Please apply.

Thanks beforehand,
Kirill

P.S. the code depends on ks/diff-c-with-diff-order

[1] http://permalink.gmane.org/gmane.comp.version-control.git/240713


Kirill Smelkov (8):
  fixup! combine_diff: simplify intersect_paths() further
  tests: add checking that combine-diff emits only correct paths
  tree-diff: no need to manually verify that there is no mode change for
    a path
  tree-diff: no need to pass match to skip_uninteresting()
  combine-diff: move show_log_first logic/action out of paths scanning
  combine-diff: Move changed-paths scanning logic into its own function
  combine-diff: Fast changed-to-all-parents paths scanning
  combine-diff: bail out early, if num_paths=0

 combine-diff.c                 | 215 +++++++++++++++-----
 diff.c                         |   1 +
 diff.h                         |   6 +
 t/t4057-diff-combined-paths.sh | 106 ++++++++++
 tree-diff.c                    | 442 +++++++++++++++++++++++++++++++++++++++--
 5 files changed, 696 insertions(+), 74 deletions(-)
 create mode 100755 t/t4057-diff-combined-paths.sh

-- 
1.9.rc1.181.g641f458

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/8] fixup! combine_diff: simplify intersect_paths() further
  2014-02-03 12:47 [PATCH 0/8] `log -c` speedup (part 2) Kirill Smelkov
@ 2014-02-03 12:47 ` Kirill Smelkov
  2014-02-03 19:40   ` Junio C Hamano
  2014-02-03 12:47 ` [PATCH 2/8] tests: add checking that combine-diff emits only correct paths Kirill Smelkov
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-03 12:47 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

That cleanup patch is good, but I've found a bug in it. In the item removal
code

> +                      /* p->path not in q->queue[]; drop it */
> +                      struct combine_diff_path *next = p->next;
> +
> +                      if ((*tail = next) != NULL)
> +                              tail = &next->next;
> +                      free(p);
>                        continue;

        *tail = next

is right, but

        tail = &next->next

is wrong, because it will lead to skipping the element after removed
one. Proof:

    tail        p
      |         |
      |         |
      v         v
     +-+       +------+-+       +------+-+       +------+-+
     | |       |      | |       |      | |       |      | |
     |o------->|      |o------->|      |o------->|      |o------->
     |0|       |     1| |       |     2| |       |     3| |
     +-+       +------+-+       +------+-+       +------+-+

suppose, we are removing element 1. p->next points to 2, after

    *tail = next

0 points to 2, which != NULL. After

    tail = &next->next

tail points to &2->next.

On the next cycle iteration, after

    p = *tail

we'll have

    p = *2->next = 3

so 2 was skipped, oops.

I've actually hit the bug - when generating combine-diff, for merges with
should-be empty `log -c` output, every second path was present in the diff.
That, by the way, means we need to extend combine-diff tests somewhat.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 combine-diff.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/combine-diff.c b/combine-diff.c
index 0809e79..a03147c 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -58,8 +58,7 @@ static struct combine_diff_path *intersect_paths(struct combine_diff_path *curr,
 			/* p->path not in q->queue[]; drop it */
 			struct combine_diff_path *next = p->next;
 
-			if ((*tail = next) != NULL)
-				tail = &next->next;
+			*tail = next;
 			free(p);
 			continue;
 		}
-- 
1.9.rc1.181.g641f458

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/8] tests: add checking that combine-diff emits only correct paths
  2014-02-03 12:47 [PATCH 0/8] `log -c` speedup (part 2) Kirill Smelkov
  2014-02-03 12:47 ` [PATCH 1/8] fixup! combine_diff: simplify intersect_paths() further Kirill Smelkov
@ 2014-02-03 12:47 ` Kirill Smelkov
  2014-02-03 23:10   ` Junio C Hamano
  2014-02-03 12:47 ` [PATCH 3/8] tree-diff: no need to manually verify that there is no mode change for a path Kirill Smelkov
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-03 12:47 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

where "correct paths" stands for paths that are different to all
parents.

Up until now, we were testing combined diff only on one file, or on
several files which were all different (t4038-diff-combined.sh).

As recent thinko in "simplify intersect_paths() further" showed, and
also, since we are going to rework code for finding paths different to
all parents, lets write at least basic tests.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 t/t4057-diff-combined-paths.sh | 106 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)
 create mode 100755 t/t4057-diff-combined-paths.sh

diff --git a/t/t4057-diff-combined-paths.sh b/t/t4057-diff-combined-paths.sh
new file mode 100755
index 0000000..e6e457d
--- /dev/null
+++ b/t/t4057-diff-combined-paths.sh
@@ -0,0 +1,106 @@
+#!/bin/sh
+
+test_description='combined diff show only paths that are different to all parents'
+
+. ./test-lib.sh
+
+# verify that diffc.expect matches output of
+# `git diff -c --name-only HEAD HEAD^ HEAD^2`
+diffc_verify() {
+	git diff -c --name-only HEAD HEAD^ HEAD^2 >diffc.actual &&
+	test_cmp diffc.expect diffc.actual
+}
+
+test_expect_success 'trivial merge - combine-diff empty' '
+	for i in `test_seq 1 9`
+	do
+		echo $i >$i.txt	&&
+		git add $i.txt
+	done &&
+	git commit -m "init" &&
+	git checkout -b side &&
+	for i in `test_seq 2 9`
+	do
+		echo $i/2 >>$i.txt
+	done &&
+	git commit -a -m "side 2-9" &&
+	git checkout master &&
+	echo 1/2 >1.txt &&
+	git commit -a -m "master 1" &&
+	git merge side &&
+	>diffc.expect &&
+	diffc_verify
+'
+
+
+test_expect_success 'only one trully conflicting path' '
+	git checkout side &&
+	for i in `test_seq 2 9`
+	do
+		echo $i/3 >>$i.txt
+	done &&
+	echo "4side" >>4.txt &&
+	git commit -a -m "side 2-9 +4" &&
+	git checkout master &&
+	for i in `test_seq 1 9`
+	do
+		echo $i/3 >>$i.txt
+	done &&
+	echo "4master" >>4.txt &&
+	git commit -a -m "master 1-9 +4" &&
+	test_must_fail git merge side &&
+	cat <<-\EOF >4.txt &&
+	4
+	4/2
+	4/3
+	4master
+	4side
+	EOF
+	git add 4.txt &&
+	git commit -m "merge side (2)" &&
+	echo 4.txt >diffc.expect &&
+	diffc_verify
+'
+
+test_expect_success 'merge introduces new file' '
+	git checkout side &&
+	for i in `test_seq 5 9`
+	do
+		echo $i/4 >>$i.txt
+	done &&
+	git commit -a -m "side 5-9" &&
+	git checkout master &&
+	for i in `test_seq 1 3`
+	do
+		echo $i/4 >>$i.txt
+	done &&
+	git commit -a -m "master 1-3 +4hello" &&
+	git merge side &&
+	echo "Hello World" >4hello.txt &&
+	git add 4hello.txt &&
+	git commit --amend &&
+	echo 4hello.txt >diffc.expect &&
+	diffc_verify
+'
+
+test_expect_success 'merge removed a file' '
+	git checkout side &&
+	for i in `test_seq 5 9`
+	do
+		echo $i/5 >>$i.txt
+	done &&
+	git commit -a -m "side 5-9" &&
+	git checkout master &&
+	for i in `test_seq 1 3`
+	do
+		echo $i/4 >>$i.txt
+	done &&
+	git commit -a -m "master 1-3" &&
+	git merge side &&
+	git rm 4.txt &&
+	git commit --amend &&
+	echo 4.txt >diffc.expect &&
+	diffc_verify
+'
+
+test_done
-- 
1.9.rc1.181.g641f458

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/8] tree-diff: no need to manually verify that there is no mode change for a path
  2014-02-03 12:47 [PATCH 0/8] `log -c` speedup (part 2) Kirill Smelkov
  2014-02-03 12:47 ` [PATCH 1/8] fixup! combine_diff: simplify intersect_paths() further Kirill Smelkov
  2014-02-03 12:47 ` [PATCH 2/8] tests: add checking that combine-diff emits only correct paths Kirill Smelkov
@ 2014-02-03 12:47 ` Kirill Smelkov
  2014-02-03 23:12   ` Junio C Hamano
  2014-02-03 12:47 ` [PATCH 4/8] tree-diff: no need to pass match to skip_uninteresting() Kirill Smelkov
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-03 12:47 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

Because if there is, such two tree entries would never be compared as
equal - the code in base_name_compare() explicitly compares modes, if
there is a change for dir bit, even for equal paths, entries would
compare as different.

The code I'm removing here is from 2005 April 262e82b4 (Fix diff-tree
recursion), which pre-dates base_name_compare() introduction in 958ba6c9
(Introduce "base_name_compare()" helper function) by a month.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 tree-diff.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/tree-diff.c b/tree-diff.c
index 456660c..c2c67fd 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -23,6 +23,10 @@ static int compare_tree_entry(struct tree_desc *t1, struct tree_desc *t2,
 
 	pathlen1 = tree_entry_len(&t1->entry);
 	pathlen2 = tree_entry_len(&t2->entry);
+
+	/* NOTE files and directories *always* compare differently, event when
+	 * having the same name.
+	 */
 	cmp = base_name_compare(path1, pathlen1, mode1, path2, pathlen2, mode2);
 	if (cmp < 0) {
 		show_entry(opt, "-", t1, base);
@@ -35,16 +39,6 @@ static int compare_tree_entry(struct tree_desc *t1, struct tree_desc *t2,
 	if (!DIFF_OPT_TST(opt, FIND_COPIES_HARDER) && !hashcmp(sha1, sha2) && mode1 == mode2)
 		return 0;
 
-	/*
-	 * If the filemode has changed to/from a directory from/to a regular
-	 * file, we need to consider it a remove and an add.
-	 */
-	if (S_ISDIR(mode1) != S_ISDIR(mode2)) {
-		show_entry(opt, "-", t1, base);
-		show_entry(opt, "+", t2, base);
-		return 0;
-	}
-
 	strbuf_add(base, path1, pathlen1);
 	if (DIFF_OPT_TST(opt, RECURSIVE) && S_ISDIR(mode1)) {
 		if (DIFF_OPT_TST(opt, TREE_IN_RECURSIVE)) {
-- 
1.9.rc1.181.g641f458

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/8] tree-diff: no need to pass match to skip_uninteresting()
  2014-02-03 12:47 [PATCH 0/8] `log -c` speedup (part 2) Kirill Smelkov
                   ` (2 preceding siblings ...)
  2014-02-03 12:47 ` [PATCH 3/8] tree-diff: no need to manually verify that there is no mode change for a path Kirill Smelkov
@ 2014-02-03 12:47 ` Kirill Smelkov
  2014-02-03 12:47 ` [PATCH 5/8] combine-diff: move show_log_first logic/action out of paths scanning Kirill Smelkov
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-03 12:47 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

It is neither used there as input, nor the output written through it, is
used outside.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 tree-diff.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/tree-diff.c b/tree-diff.c
index c2c67fd..f7b3ade 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -108,13 +108,14 @@ static void show_entry(struct diff_options *opt, const char *prefix,
 }
 
 static void skip_uninteresting(struct tree_desc *t, struct strbuf *base,
-			       struct diff_options *opt,
-			       enum interesting *match)
+			       struct diff_options *opt)
 {
+	enum interesting match;
+
 	while (t->size) {
-		*match = tree_entry_interesting(&t->entry, base, 0, &opt->pathspec);
-		if (*match) {
-			if (*match == all_entries_not_interesting)
+		match = tree_entry_interesting(&t->entry, base, 0, &opt->pathspec);
+		if (match) {
+			if (match == all_entries_not_interesting)
 				t->size = 0;
 			break;
 		}
@@ -127,8 +128,6 @@ int diff_tree(struct tree_desc *t1, struct tree_desc *t2,
 {
 	struct strbuf base;
 	int baselen = strlen(base_str);
-	enum interesting t1_match = entry_not_interesting;
-	enum interesting t2_match = entry_not_interesting;
 
 	/* Enable recursion indefinitely */
 	opt->pathspec.recursive = DIFF_OPT_TST(opt, RECURSIVE);
@@ -140,8 +139,8 @@ int diff_tree(struct tree_desc *t1, struct tree_desc *t2,
 		if (diff_can_quit_early(opt))
 			break;
 		if (opt->pathspec.nr) {
-			skip_uninteresting(t1, &base, opt, &t1_match);
-			skip_uninteresting(t2, &base, opt, &t2_match);
+			skip_uninteresting(t1, &base, opt);
+			skip_uninteresting(t2, &base, opt);
 		}
 		if (!t1->size) {
 			if (!t2->size)
-- 
1.9.rc1.181.g641f458

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/8] combine-diff: move show_log_first logic/action out of paths scanning
  2014-02-03 12:47 [PATCH 0/8] `log -c` speedup (part 2) Kirill Smelkov
                   ` (3 preceding siblings ...)
  2014-02-03 12:47 ` [PATCH 4/8] tree-diff: no need to pass match to skip_uninteresting() Kirill Smelkov
@ 2014-02-03 12:47 ` Kirill Smelkov
  2014-02-03 23:21   ` Junio C Hamano
  2014-02-03 12:47 ` [PATCH 6/8] combine-diff: Move changed-paths scanning logic into its own function Kirill Smelkov
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-03 12:47 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

Judging from sample outputs and tests nothing changes in diff -c output,
and this change will help later patches, when we'll be refactoring paths
scanning into its own function with several variants - the
show_log_first logic / code will stay common to all of them.

NOTE: only now we have to take care to explicitly not show anything if
    parents array is empty, as in fact there are some clients in Git code,
    which calls diff_tree_combined() in such a way.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 combine-diff.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/combine-diff.c b/combine-diff.c
index a03147c..272931f 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1313,6 +1313,20 @@ void diff_tree_combined(const unsigned char *sha1,
 	struct combine_diff_path *p, *paths = NULL;
 	int i, num_paths, needsep, show_log_first, num_parent = parents->nr;
 
+	/* nothing to do, if no parents */
+	if (!num_parent)
+		return;
+
+	show_log_first = !!rev->loginfo && !rev->no_commit_id;
+	needsep = 0;
+	if (show_log_first) {
+		show_log(rev);
+
+		if (rev->verbose_header && opt->output_format)
+			printf("%s%c", diff_line_prefix(opt),
+			       opt->line_termination);
+	}
+
 	diffopts = *opt;
 	copy_pathspec(&diffopts.pathspec, &opt->pathspec);
 	diffopts.output_format = DIFF_FORMAT_NO_OUTPUT;
@@ -1321,8 +1335,6 @@ void diff_tree_combined(const unsigned char *sha1,
 	/* tell diff_tree to emit paths in sorted (=tree) order */
 	diffopts.orderfile = NULL;
 
-	show_log_first = !!rev->loginfo && !rev->no_commit_id;
-	needsep = 0;
 	/* find set of paths that everybody touches */
 	for (i = 0; i < num_parent; i++) {
 		/* show stat against the first parent even
@@ -1338,14 +1350,6 @@ void diff_tree_combined(const unsigned char *sha1,
 		diffcore_std(&diffopts);
 		paths = intersect_paths(paths, i, num_parent);
 
-		if (show_log_first && i == 0) {
-			show_log(rev);
-
-			if (rev->verbose_header && opt->output_format)
-				printf("%s%c", diff_line_prefix(opt),
-				       opt->line_termination);
-		}
-
 		/* if showing diff, show it in requested order */
 		if (diffopts.output_format != DIFF_FORMAT_NO_OUTPUT &&
 		    opt->orderfile) {
-- 
1.9.rc1.181.g641f458

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/8] combine-diff: Move changed-paths scanning logic into its own function
  2014-02-03 12:47 [PATCH 0/8] `log -c` speedup (part 2) Kirill Smelkov
                   ` (4 preceding siblings ...)
  2014-02-03 12:47 ` [PATCH 5/8] combine-diff: move show_log_first logic/action out of paths scanning Kirill Smelkov
@ 2014-02-03 12:47 ` Kirill Smelkov
  2014-02-03 12:47 ` [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning Kirill Smelkov
  2014-02-03 12:47 ` [PATCH 8/8] combine-diff: bail out early, if num_paths=0 Kirill Smelkov
  7 siblings, 0 replies; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-03 12:47 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

Move code for finding paths for which diff(commit,parent_i) is not-empty
for all parents to separate function - at present we have generic (and
slow) code for this job, which translates 1 n-parent problem to n
1-parent problems and then intersect results, and will be adding another
limited, but faster, paths scanning implementation in the next patch.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 combine-diff.c | 79 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 52 insertions(+), 27 deletions(-)

diff --git a/combine-diff.c b/combine-diff.c
index 272931f..427a7c1 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1303,6 +1303,50 @@ static const char *path_path(void *obj)
 	return path->path;
 }
 
+
+/* find set of paths that every parent touches */
+static struct combine_diff_path *find_paths(const unsigned char *sha1,
+	const struct sha1_array *parents, struct diff_options *opt)
+{
+	struct combine_diff_path *paths = NULL;
+	int i, num_parent = parents->nr;
+
+	int output_format = opt->output_format;
+	const char *orderfile = opt->orderfile;
+
+	opt->output_format = DIFF_FORMAT_NO_OUTPUT;
+	/* tell diff_tree to emit paths in sorted (=tree) order */
+	opt->orderfile = NULL;
+
+	for (i = 0; i < num_parent; i++) {
+		/* show stat against the first parent even
+		 * when doing combined diff.
+		 */
+		int stat_opt = (output_format &
+				(DIFF_FORMAT_NUMSTAT|DIFF_FORMAT_DIFFSTAT));
+		if (i == 0 && stat_opt)
+			opt->output_format = stat_opt;
+		else
+			opt->output_format = DIFF_FORMAT_NO_OUTPUT;
+		diff_tree_sha1(parents->sha1[i], sha1, "", opt);
+		diffcore_std(opt);
+		paths = intersect_paths(paths, i, num_parent);
+
+		/* if showing diff, show it in requested order */
+		if (opt->output_format != DIFF_FORMAT_NO_OUTPUT &&
+		    orderfile) {
+			diffcore_order(orderfile);
+		}
+
+		diff_flush(opt);
+	}
+
+	opt->output_format = output_format;
+	opt->orderfile = orderfile;
+	return paths;
+}
+
+
 void diff_tree_combined(const unsigned char *sha1,
 			const struct sha1_array *parents,
 			int dense,
@@ -1310,7 +1354,7 @@ void diff_tree_combined(const unsigned char *sha1,
 {
 	struct diff_options *opt = &rev->diffopt;
 	struct diff_options diffopts;
-	struct combine_diff_path *p, *paths = NULL;
+	struct combine_diff_path *p, *paths;
 	int i, num_paths, needsep, show_log_first, num_parent = parents->nr;
 
 	/* nothing to do, if no parents */
@@ -1329,35 +1373,16 @@ void diff_tree_combined(const unsigned char *sha1,
 
 	diffopts = *opt;
 	copy_pathspec(&diffopts.pathspec, &opt->pathspec);
-	diffopts.output_format = DIFF_FORMAT_NO_OUTPUT;
 	DIFF_OPT_SET(&diffopts, RECURSIVE);
 	DIFF_OPT_CLR(&diffopts, ALLOW_EXTERNAL);
-	/* tell diff_tree to emit paths in sorted (=tree) order */
-	diffopts.orderfile = NULL;
 
-	/* find set of paths that everybody touches */
-	for (i = 0; i < num_parent; i++) {
-		/* show stat against the first parent even
-		 * when doing combined diff.
-		 */
-		int stat_opt = (opt->output_format &
-				(DIFF_FORMAT_NUMSTAT|DIFF_FORMAT_DIFFSTAT));
-		if (i == 0 && stat_opt)
-			diffopts.output_format = stat_opt;
-		else
-			diffopts.output_format = DIFF_FORMAT_NO_OUTPUT;
-		diff_tree_sha1(parents->sha1[i], sha1, "", &diffopts);
-		diffcore_std(&diffopts);
-		paths = intersect_paths(paths, i, num_parent);
-
-		/* if showing diff, show it in requested order */
-		if (diffopts.output_format != DIFF_FORMAT_NO_OUTPUT &&
-		    opt->orderfile) {
-			diffcore_order(opt->orderfile);
-		}
-
-		diff_flush(&diffopts);
-	}
+	/* find set of paths that everybody touches
+	 *
+	 * NOTE find_paths() also handles --stat, as it computes
+	 * diff(sha1,parent_i) for all i to do the job, specifically
+	 * for parent0.
+	 */
+	paths = find_paths(sha1, parents, &diffopts);
 
 	/* find out number of surviving paths */
 	for (num_paths = 0, p = paths; p; p = p->next)
-- 
1.9.rc1.181.g641f458

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-03 12:47 [PATCH 0/8] `log -c` speedup (part 2) Kirill Smelkov
                   ` (5 preceding siblings ...)
  2014-02-03 12:47 ` [PATCH 6/8] combine-diff: Move changed-paths scanning logic into its own function Kirill Smelkov
@ 2014-02-03 12:47 ` Kirill Smelkov
  2014-02-03 23:26   ` Junio C Hamano
  2014-02-04  0:00   ` Junio C Hamano
  2014-02-03 12:47 ` [PATCH 8/8] combine-diff: bail out early, if num_paths=0 Kirill Smelkov
  7 siblings, 2 replies; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-03 12:47 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

As was recently shown (c839f1bd "combine-diff: optimize
combine_diff_path sets intersection"), combine-diff runs very slowly. In
that commit we optimized paths sets intersection, but that accounted
only for ~ 25% of the slowness, and as my tracing showed, for linux.git
v3.10..v3.11, for merges a lot of time is spent computing
diff(commit,commit^2) just to only then intersect that huge diff to
almost small set of files from diff(commit,commit^1).

That's because at present, to compute combine-diff, for first finding
paths, that "every parent touches", we use the following combine-diff
property/definition:

    D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn)      (w.r.t. paths)

where

    D(A,P1...Pn) is combined diff between commit A, and parents Pi

and

    D(A,Pi) is usual two-tree diff Pi..A

So if any of that D(A,Pi) is huge, tracting 1 n-parent combine-diff as n
1-parent diffs and intersecting results will be slow.

And usually, for linux.git and other topic-based workflows, that
D(A,P2) is huge, because, if merge-base of A and P2, is several dozens
of merges (from A, via first parent) below, that D(A,P2) will be diffing
sum of merges from several subsystems to 1 subsystem.

The solution is to avoid computing n 1-parent diffs, and to find
changed-to-all-parents paths via scanning A's and all Pi's trees
simultaneously, at each step comparing their entries, and based on that
comparison, populate paths result, and deduce we could *skip*
*recursing* into subdirectories, if at least for 1 parent, sha1 of that
dir tree is the same as in A. That would save us from doing significant
amount of needless work.

Such approach is very similar to what diff_tree() does, only there we
deal with scanning only 2 trees simultaneously, and for n+1 tree, the
logic is a bit more complex:

    D(A,X1...Xn) calculation scheme
    -------------------------------

    D(A,X1...Xn) = D(A,X1) ^ ... ^ D(A,Xn)       (regarding resulting paths set)

         D(A,Xj)         - diff between A..Xj
         D(A,X1...Xn)    - combined diff from A to parents X1,...,Xn

    We start from all trees, which are sorted, and compare their entries in
    lock-step:

          A     X1       Xn
          -     -        -
         |a|   |x1|     |xn|
         |-|   |--| ... |--|      i = argmin(x1...xn)
         | |   |  |     |  |
         |-|   |--|     |--|
         |.|   |. |     |. |
          .     .        .
          .     .        .

    at any time there could be 3 cases:

         1)  a < xi;
         2)  a > xi;
         3)  a = xi.

    Schematic deduction of what every case means, and what to do, follows:

    1)  a < xi  ->  ∀j a ∉ Xj  ->  "+a" ∈ D(A,Xj)  ->  D += "+a";  a↓

    2)  a > xi

        2.1) ∃j: xj > xi  ->  "-xi" ∉ D(A,Xj)  ->  D += ø;  ∀ xk=xi  xk↓
        2.2) ∀j  xj = xi  ->  xj ∉ A  ->  "-xj" ∈ D(A,Xj)  ->  D += "-xi";  ∀j xj↓

    3)  a = xi

        3.1) ∃j: xj > xi  ->  "+a" ∈ D(A,Xj)  ->  only xk=xi remains to investigate
        3.2) xj = xi  ->  investigate δ(a,xj)
         |
         |
         v

        3.1+3.2) looking at δ(a,xk) ∀k: xk=xi - if all != ø  ->

                          ⎧δ(a,xk)  - if xk=xi
                 ->  D += ⎨
                          ⎩"+a"     - if xk>xi

        in any case a↓  ∀ xk=xi  xk↓

~

For comparison, here is how diff_tree() works:

    D(A,B) calculation scheme
    -------------------------

        A     B
        -     -
       |a|   |b|    a < b   ->  a ∉ B   ->   D(A,B) +=  +a    a↓
       |-|   |-|    a > b   ->  b ∉ A   ->   D(A,B) +=  -b    b↓
       | |   | |    a = b   ->  investigate δ(a,b)            a↓ b↓
       |-|   |-|
       |.|   |.|
        .     .
        .     .

For now there is a limitation however - for simple scan to work, we have
to know in advance, a user had not asked for finding rename/copies
detection. At present, if he/she has, we fallback to calling old n
1-parent diffs code. I think that restriction, at least in theory,
could be lifted. I propose we have fast code for at least common case,
and move on incrementally.

Another similar limitation, is that if diffcore transforms which
filter-out paths, are active, we also fallback to old code. The reason
here is pure "technical" - all transforms expect to find paths in
diff_filepair's queue, which applies only to 1-parent case. If needed,
The solution here would be to generalize diff_filepair to
combine_diff_path everywhere, and do not differentiate between plain and
combined diffs (all diffs will be combined with diff(A,B) being
combined diff with only 1 parent), but again, let's move on
incrementally.

I consciously placed trees-scanning code in tree-diff.c, foreseeing diff
and combine-diff merge, and because that code is very similar to
diff_tree - to keep similar codes near.

Timings for `git log --raw --no-abbrev --no-renames` without `-c` ("git log")
and with `-c` ("git log -c") and with `-c --merges` ("git log -c --merges")
before and after the patch are as follows:

                linux.git v3.10..v3.11

            log     log -c     log -c --merges

    before  1.9s   16.6s      15.5s
    after   1.9s    2.4s       1.1s(*)

The result stayed the same.

(*) that 15.5/1.1 = ~14.1 ratio, is the most appropriate estimate for
    combine-diff speedup.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 combine-diff.c |  85 +++++++++++-
 diff.c         |   1 +
 diff.h         |   6 +
 tree-diff.c    | 411 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 498 insertions(+), 5 deletions(-)

diff --git a/combine-diff.c b/combine-diff.c
index 427a7c1..3e3f328 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1305,7 +1305,7 @@ static const char *path_path(void *obj)
 
 
 /* find set of paths that every parent touches */
-static struct combine_diff_path *find_paths(const unsigned char *sha1,
+static struct combine_diff_path *find_paths_generic(const unsigned char *sha1,
 	const struct sha1_array *parents, struct diff_options *opt)
 {
 	struct combine_diff_path *paths = NULL;
@@ -1318,6 +1318,7 @@ static struct combine_diff_path *find_paths(const unsigned char *sha1,
 	/* tell diff_tree to emit paths in sorted (=tree) order */
 	opt->orderfile = NULL;
 
+	/* D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn)  (wrt paths) */
 	for (i = 0; i < num_parent; i++) {
 		/* show stat against the first parent even
 		 * when doing combined diff.
@@ -1347,6 +1348,35 @@ static struct combine_diff_path *find_paths(const unsigned char *sha1,
 }
 
 
+/*
+ * find set of paths that everybody touches, assuming diff is run without
+ * rename/copy detection, etc, comparing all trees simultaneously (= faster).
+ */
+static struct combine_diff_path *find_paths_multitree(
+	const unsigned char *sha1, const struct sha1_array *parents,
+	struct diff_options *opt)
+{
+	int i, nparent = parents->nr;
+	const unsigned char **parents_sha1;
+
+	parents_sha1 = xmalloc(nparent * sizeof(parents_sha1[0]));
+	for (i = 0; i < nparent; i++)
+		parents_sha1[i] = parents->sha1[i];
+
+	/* fake list head, so worker can assume it is non-NULL */
+	struct combine_diff_path paths_head;
+	paths_head.next = NULL;
+
+	struct strbuf base;
+	strbuf_init(&base, PATH_MAX);
+
+	diff_tree_combined_paths(&paths_head, sha1, parents_sha1, nparent, &base, opt);
+
+	free(parents_sha1);
+	return paths_head.next;
+}
+
+
 void diff_tree_combined(const unsigned char *sha1,
 			const struct sha1_array *parents,
 			int dense,
@@ -1356,6 +1386,7 @@ void diff_tree_combined(const unsigned char *sha1,
 	struct diff_options diffopts;
 	struct combine_diff_path *p, *paths;
 	int i, num_paths, needsep, show_log_first, num_parent = parents->nr;
+	int need_generic_pathscan;
 
 	/* nothing to do, if no parents */
 	if (!num_parent)
@@ -1378,11 +1409,55 @@ void diff_tree_combined(const unsigned char *sha1,
 
 	/* find set of paths that everybody touches
 	 *
-	 * NOTE find_paths() also handles --stat, as it computes
-	 * diff(sha1,parent_i) for all i to do the job, specifically
-	 * for parent0.
+	 * NOTE
+	 *
+	 * Diffcore transformations are bound to diff_filespec and logic
+	 * comparing two entries - i.e. they do not apply directly to combine
+	 * diff.
+	 *
+	 * If some of such transformations is requested - we launch generic
+	 * path scanning, which works significantly slower compared to
+	 * simultaneous all-trees-in-one-go scan in find_paths_multitree().
+	 *
+	 * TODO some of the filters could be ported to work on
+	 * combine_diff_paths - i.e. all functionality that skips paths, so in
+	 * theory, we could end up having only multitree path scanning.
+	 *
+	 * NOTE please keep this semantically in sync with diffcore_std()
 	 */
-	paths = find_paths(sha1, parents, &diffopts);
+	need_generic_pathscan = opt->skip_stat_unmatch	||
+			DIFF_OPT_TST(opt, FOLLOW_RENAMES)	||
+			opt->break_opt != -1	||
+			opt->detect_rename	||
+			opt->pickaxe		||
+			opt->filter;
+
+
+	if (need_generic_pathscan) {
+		/* NOTE generic case also handles --stat, as it computes
+		 * diff(sha1,parent_i) for all i to do the job, specifically
+		 * for parent0.
+		 */
+		paths = find_paths_generic(sha1, parents, &diffopts);
+	}
+	else {
+		paths = find_paths_multitree(sha1, parents, &diffopts);
+
+		/* show stat against the first parent even
+		 * when doing combined diff.
+		 */
+		int stat_opt = (opt->output_format &
+				(DIFF_FORMAT_NUMSTAT|DIFF_FORMAT_DIFFSTAT));
+		if (stat_opt) {
+			diffopts.output_format = stat_opt;
+
+			diff_tree_sha1(parents->sha1[0], sha1, "", &diffopts);
+			diffcore_std(&diffopts);
+			if (opt->orderfile)
+				diffcore_order(opt->orderfile);
+			diff_flush(&diffopts);
+		}
+	}
 
 	/* find out number of surviving paths */
 	for (num_paths = 0, p = paths; p; p = p->next)
diff --git a/diff.c b/diff.c
index 49e71f4..29ede0b 100644
--- a/diff.c
+++ b/diff.c
@@ -4755,6 +4755,7 @@ void diffcore_fix_diff_index(struct diff_options *options)
 
 void diffcore_std(struct diff_options *options)
 {
+	/* NOTE please keep the following in sync with diff_tree_combined() */
 	if (options->skip_stat_unmatch)
 		diffcore_skip_stat_unmatch(options);
 	if (!options->found_follow) {
diff --git a/diff.h b/diff.h
index a24a767..5496560 100644
--- a/diff.h
+++ b/diff.h
@@ -214,6 +214,12 @@ struct combine_diff_path {
 extern void show_combined_diff(struct combine_diff_path *elem, int num_parent,
 			      int dense, struct rev_info *);
 
+extern
+struct combine_diff_path *diff_tree_combined_paths(
+	struct combine_diff_path *p, const unsigned char *sha1,
+	const unsigned char **parent_sha1, int nparent,
+	struct strbuf *base, struct diff_options *opt);
+
 extern void diff_tree_combined(const unsigned char *sha1, const struct sha1_array *parents, int dense, struct rev_info *rev);
 
 extern void diff_tree_combined_merge(const struct commit *commit, int dense, struct rev_info *rev);
diff --git a/tree-diff.c b/tree-diff.c
index f7b3ade..7f4f917 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -172,6 +172,417 @@ int diff_tree(struct tree_desc *t1, struct tree_desc *t2,
 	return 0;
 }
 
+
+/*
+ * Compare two tree descriptors, taking into account only path/mode,
+ * but not their sha1's.
+ *
+ * NOTE files and directories *always* compare differently, even when having
+ *      the same name - thanks to base_name_compare().
+ *
+ * NOTE empty (=invalid) descriptor(s) take part in comparison as +infty.
+ */
+static int tree_desc_pathcmp(struct tree_desc *t1, struct tree_desc *t2)
+{
+	const unsigned char *t1_sha1, *t2_sha1;
+	const char *path1, *path2;
+	int path1_len, path2_len;
+	unsigned mode1, mode2;
+	int cmp;
+
+	if (!t1->size)
+		return t2->size ? +1 /* +∞ > c */  : 0 /* +∞ = +∞ */;
+	else if (!t2->size)
+		return -1;	/* c < +∞ */
+
+
+	t1_sha1 = tree_entry_extract(t1, &path1, &mode1);
+	t2_sha1 = tree_entry_extract(t2, &path2, &mode2);
+
+	path1_len = tree_entry_len(&t1->entry);
+	path2_len = tree_entry_len(&t2->entry);
+
+	cmp = base_name_compare(path1, path1_len, mode1,
+				path2, path2_len, mode2);
+	return cmp;
+}
+
+
+/*
+ * Make a new combine_diff_path from path/mode/sha1
+ * and append it to paths list tail.
+ *
+ * p->parent[] remains uninitialized.
+ */
+static struct combine_diff_path *__path_appendnew(struct combine_diff_path *last,
+	int nparent, const struct strbuf *base, const char *path, int pathlen,
+	unsigned mode, const unsigned char *sha1)
+{
+	struct combine_diff_path *p;
+	int len = base->len + pathlen;
+
+	p = xmalloc(combine_diff_path_size(nparent, len));
+	p->next = NULL;
+	p->path = (char *)&(p->parent[nparent]);
+	memcpy(p->path, base->buf, base->len);
+	memcpy(p->path + base->len, path, pathlen);
+	p->path[len] = 0;
+	p->mode = mode;
+	hashcpy(p->sha1, sha1);
+
+	last->next = p;
+	return p;
+}
+
+/* new path should be added to combine diff
+ *
+ * 3 cases on how/when it should be called and behaves:
+ *
+ *	 t, !tp		-> path added, all parents lack it
+ *	!t,  tp		-> path removed from all parents
+ *	 t,  tp		-> path modified/added
+ *			   (M for tp[i]=tp[imin], A otherwise)
+ */
+static struct combine_diff_path *path_appendnew(struct combine_diff_path *p,
+	struct strbuf *base, struct diff_options *opt, int nparent,
+	struct tree_desc *t, struct tree_desc *tp,
+	int imin, int *xmin_pathcmpv)
+{
+	int i, isdir, recurse=0, emitthis=1;
+
+	const unsigned char *sha1;
+	const char *path;
+	int pathlen;
+	unsigned mode;
+
+	/* at least something has to be valid */
+	assert(t || tp);
+
+	if (t) {
+		/* path present in resulting tree */
+		sha1 = tree_entry_extract(t, &path, &mode);
+		pathlen = tree_entry_len(&t->entry);
+		isdir = S_ISDIR(mode);
+	}
+	else {
+		/* a path was removed - take path from imin parent. Also take
+		 * mode from that parent, to decide on recursion(1).
+		 *
+		 * 1) all modes for tp[k]=tp[imin] should be the same wrt
+		 *    S_ISDIR, thanks to base_name_compare().
+		 */
+		tree_entry_extract(&tp[imin], &path, &mode);
+		pathlen = tree_entry_len(&tp[imin].entry);
+
+		isdir = S_ISDIR(mode);
+		sha1 = null_sha1;
+		mode = 0;
+	}
+
+
+	if (DIFF_OPT_TST(opt, RECURSIVE) && isdir) {
+		recurse = 1;
+		emitthis = DIFF_OPT_TST(opt, TREE_IN_RECURSIVE);
+	}
+
+	if (emitthis) {
+		p = __path_appendnew(p, nparent, base, path, pathlen, mode, sha1);
+		for (i = 0; i < nparent; ++i) {
+			/* tp[i] is valid, if present and if tp[i]==tp[imin] -
+			 * otherwise, we should ignore it.
+			 */
+			int tpi_valid = tp && !xmin_pathcmpv[i];
+
+			const unsigned char *sha1_i;
+			unsigned mode_i;
+
+			p->parent[i].status =
+				!t ? DIFF_STATUS_DELETED :
+					tpi_valid ?
+						DIFF_STATUS_MODIFIED :
+						DIFF_STATUS_ADDED;
+
+			if (tpi_valid) {
+				sha1_i = tp[i].entry.sha1;
+				mode_i = canon_mode(tp[i].entry.mode);
+			}
+			else {
+				sha1_i = null_sha1;
+				mode_i = 0;
+			}
+
+			p->parent[i].mode = mode_i;
+			hashcpy(p->parent[i].sha1, sha1_i);
+		}
+	}
+
+	if (recurse) {
+		const unsigned char **parents_sha1;
+		int old_baselen = base->len;
+
+		parents_sha1 = xmalloc(nparent * sizeof(parents_sha1[0]));
+		for (i = 0; i < nparent; ++i) {
+			/* same rule as in emitthis */
+			int tpi_valid = tp && !xmin_pathcmpv[i];
+
+			parents_sha1[i] = tpi_valid ? tp[i].entry.sha1
+						    : null_sha1;
+		}
+
+		/* base += path+'/' */
+		strbuf_add(base, path, pathlen);
+		strbuf_addch(base, '/');
+
+		p = diff_tree_combined_paths(p, sha1, parents_sha1, nparent, base, opt);
+
+		/* restore base */
+		strbuf_setlen(base, old_baselen);
+
+		free(parents_sha1);
+	}
+
+	return p;
+}
+
+
+/* Like fill_tree_descriptor, but returns empty tree_desc for null_sha1 */
+static void *xfill_tree_descriptor(struct tree_desc *t, const unsigned char *sha1)
+{
+	if (!is_null_sha1(sha1))
+		return fill_tree_descriptor(t, sha1);
+
+	init_tree_desc(t, NULL, 0);
+	return NULL;
+}
+
+
+/* generate paths for combined diff D(sha1,parents_sha1[])
+ *
+ * The paths are generated scanning new tree and all parents trees
+ * simultaneously, similarly to what diff_tree() does for 2 trees. The theory
+ * behind such scan is as follows:
+ *
+ *
+ * D(A,X1...Xn) calculation scheme
+ * -------------------------------
+ *
+ * D(A,X1...Xn) = D(A,X1) ^ ... ^ D(A,Xn)	(regarding resulting paths set)
+ *
+ *	D(A,Xj)		- diff between A..Xj
+ *	D(A,X1...Xn)	- combined diff from A to parents X1,...,Xn
+ *
+ *
+ * We start from all trees, which are sorted, and compare their entries in
+ * lock-step:
+ *
+ *	 A     X1       Xn
+ *	 -     -        -
+ *	|a|   |x1|     |xn|
+ *	|-|   |--| ... |--|      i = argmin(x1...xn)
+ *	| |   |  |     |  |
+ *	|-|   |--|     |--|
+ *	|.|   |. |     |. |
+ *	 .     .        .
+ *	 .     .        .
+ *
+ * at any time there could be 3 cases:
+ *
+ *	1)  a < xi;
+ *	2)  a > xi;
+ *	3)  a = xi.
+ *
+ * Schematic deduction of what every case means, and what to do, follows:
+ *
+ * 1)  a < xi  ->  ∀j a ∉ Xj  ->  "+a" ∈ D(A,Xj)  ->  D += "+a";  a↓
+ *
+ * 2)  a > xi
+ *
+ *     2.1) ∃j: xj > xi  ->  "-xi" ∉ D(A,Xj)  ->  D += ø;  ∀ xk=xi  xk↓
+ *     2.2) ∀j  xj = xi  ->  xj ∉ A  ->  "-xj" ∈ D(A,Xj)  ->  D += "-xi";  ∀j xj↓
+ *
+ * 3)  a = xi
+ *
+ *     3.1) ∃j: xj > xi  ->  "+a" ∈ D(A,Xj)  ->  only xk=xi remains to investigate
+ *     3.2) xj = xi  ->  investigate δ(a,xj)
+ *      |
+ *      |
+ *      v
+ *
+ *     3.1+3.2) looking at δ(a,xk) ∀k: xk=xi - if all != ø  ->
+ *
+ *                       ⎧δ(a,xk)  - if xk=xi
+ *              ->  D += ⎨
+ *                       ⎩"+a"     - if xk>xi
+ *
+ *
+ *     in any case a↓  ∀ xk=xi  xk↓
+ */
+struct combine_diff_path *diff_tree_combined_paths(
+	struct combine_diff_path *p, const unsigned char *sha1,
+	const unsigned char **parents_sha1, int nparent,
+	struct strbuf *base, struct diff_options *opt)
+{
+	struct tree_desc t, *tp;
+	void *ttree, **tptree;
+	int *xmin_pathcmpv;	/* [i] = pathcmp(tp[i], tp[imin]) */
+	int i;
+
+	tp     = xmalloc(nparent * sizeof(tp[0]));
+	tptree = xmalloc(nparent * sizeof(tptree[0]));
+
+	ttree = xfill_tree_descriptor(&t, sha1);
+	for (i = 0; i < nparent; i++)
+		tptree[i] = xfill_tree_descriptor(&tp[i], parents_sha1[i]);
+
+	xmin_pathcmpv = xmalloc(nparent * sizeof(xmin_pathcmpv[0]));
+
+	/* Enable recursion indefinitely */
+	opt->pathspec.recursive = DIFF_OPT_TST(opt, RECURSIVE);
+
+	for (;;) {
+		int imin, xmin_eqtotal, cmp, alldifferent;
+		int a_next, tp_next;
+		const unsigned char *entry_sha1;
+		const char *path;
+		unsigned mode;
+
+
+		/* TODO Something similiar to if diff_can_quit_early -> break */
+
+		if (opt->pathspec.nr) {
+			skip_uninteresting(&t, base, opt);
+			for (i = 0; i < nparent; i++)
+				skip_uninteresting(&tp[i], base, opt);
+		}
+
+		/* comparing is finished when all trees are done */
+		if (!t.size) {
+			int done = 1;
+			for (i = 0; i < nparent; ++i)
+				if (tp[i].size) {
+					done = 0;
+					break;
+				}
+			if (done)
+				break;
+		}
+
+
+
+		/* lookup imin = argmin(x1...xn),  build X compare vector along the way */
+		imin = 0;		/* argmin(x1...xn)	*/
+		xmin_eqtotal = 1;	/* len([xk: xk=ximin])	*/
+		xmin_pathcmpv[0] = 0;
+
+		for (i = 1; i < nparent; ++i) {
+			cmp = tree_desc_pathcmp(&tp[i], &tp[imin]);
+			if (cmp == -1) {
+				imin = i;
+				xmin_eqtotal = 1;
+				xmin_pathcmpv[i] = 0;
+			}
+			else {
+				xmin_pathcmpv[i] = cmp;
+				if (cmp == 0)
+					xmin_eqtotal++;
+			}
+		}
+
+		/* fixup compare vector for entries before imin */
+		for (i = 0; i < imin; ++i)
+			xmin_pathcmpv[i] = +1;	/* x[i] > x[imin] */
+
+
+		/* what tree entries will we update, in the end of this step? */
+		a_next  = 0;	/* a */
+		tp_next = 0;	/* tp[k: xk=xmin] */
+
+
+		/* compare a vs x[imin] */
+		cmp = tree_desc_pathcmp(&t, &tp[imin]);
+
+		switch (cmp) {
+		/* a < xi */
+		case -1:
+			/* D += "+a" */
+			p = path_appendnew(p, base, opt, nparent,
+					&t, /*tp=*/NULL, -1, NULL);
+
+			/* a↓ */
+			a_next = 1;
+			break;
+
+
+		/* a > xi */
+		case +1:
+			/* ∀j xj=ximin -> D += "-xi" */
+			if (xmin_eqtotal == nparent)
+				p = path_appendnew(p, base, opt, nparent,
+					/*t=*/NULL, tp, imin, xmin_pathcmpv);
+
+			/* ∀ xk=ximin  xk↓ */
+			tp_next = 1;
+			break;
+
+
+		/* a = xi */
+		case 0:
+			/* are either xk > xi or diff(a,xk) != ø ? */
+			alldifferent = 1;
+			entry_sha1 = tree_entry_extract(&t, &path, &mode);
+			for (i = 0; i < nparent; i++) {
+				const unsigned char *sha1_i;
+				const char *path_i;
+				unsigned mode_i;
+
+				/* x[i] > x[imin] */
+				if (xmin_pathcmpv[i])
+					continue;
+
+				/* diff(a,xk) != ø */
+				sha1_i = tree_entry_extract(&tp[i], &path_i, &mode_i);
+				if (hashcmp(entry_sha1, sha1_i) || (mode != mode_i))
+					continue;
+
+				alldifferent = 0;
+				break;
+			}
+
+			/* D += {δ(a,xk) if xk=xi;  "+a" if xk > xi} */
+			if (alldifferent)
+				p = path_appendnew(p, base, opt, nparent,
+						&t, tp, imin, xmin_pathcmpv);
+
+			/* a↓,  ∀ xk=ximin  xk↓ */
+			a_next  = 1;
+			tp_next = 1;
+			break;
+		}
+
+		/* perform scheduled tree entries updates: */
+		if (a_next)
+			/* a↓ */
+			update_tree_entry(&t);
+
+		if (tp_next)
+			/* ∀ xk=xi  xk↓ */
+			for (i = 0; i < nparent; ++i)
+				if (!xmin_pathcmpv[i])
+					update_tree_entry(&tp[i]);
+	}
+
+	free(xmin_pathcmpv);
+	for (i = nparent-1; i >= 0; i--)
+		free(tptree[i]);
+	free(ttree);
+	free(tptree);
+	free(tp);
+
+	return p;
+}
+
+
+
 /*
  * Does it look like the resulting diff might be due to a rename?
  *  - single entry
-- 
1.9.rc1.181.g641f458

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 8/8] combine-diff: bail out early, if num_paths=0
  2014-02-03 12:47 [PATCH 0/8] `log -c` speedup (part 2) Kirill Smelkov
                   ` (6 preceding siblings ...)
  2014-02-03 12:47 ` [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning Kirill Smelkov
@ 2014-02-03 12:47 ` Kirill Smelkov
  7 siblings, 0 replies; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-03 12:47 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git, Kirill Smelkov

That simplifies the code - instead of repeated checking for
num_paths !=0, let's verify it once, and return if it is, and
free following code from repeated ifs.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 combine-diff.c | 52 +++++++++++++++++++++++++++-------------------------
 1 file changed, 27 insertions(+), 25 deletions(-)

diff --git a/combine-diff.c b/combine-diff.c
index 3e3f328..c1f481f 100644
--- a/combine-diff.c
+++ b/combine-diff.c
@@ -1459,12 +1459,18 @@ void diff_tree_combined(const unsigned char *sha1,
 		}
 	}
 
+	free_pathspec(&diffopts.pathspec);
+
 	/* find out number of surviving paths */
 	for (num_paths = 0, p = paths; p; p = p->next)
 		num_paths++;
 
+	/* nothing to do, if no paths */
+	if (!num_paths)
+		return;
+
 	/* order paths according to diffcore_order */
-	if (opt->orderfile && num_paths) {
+	if (opt->orderfile) {
 		struct obj_order *o;
 
 		o = xmalloc(sizeof(*o) * num_paths);
@@ -1483,28 +1489,26 @@ void diff_tree_combined(const unsigned char *sha1,
 	}
 
 
-	if (num_paths) {
-		if (opt->output_format & (DIFF_FORMAT_RAW |
-					  DIFF_FORMAT_NAME |
-					  DIFF_FORMAT_NAME_STATUS)) {
-			for (p = paths; p; p = p->next)
-				show_raw_diff(p, num_parent, rev);
-			needsep = 1;
-		}
-		else if (opt->output_format &
-			 (DIFF_FORMAT_NUMSTAT|DIFF_FORMAT_DIFFSTAT))
-			needsep = 1;
-		else if (opt->output_format & DIFF_FORMAT_CALLBACK)
-			handle_combined_callback(opt, paths, num_parent, num_paths);
-
-		if (opt->output_format & DIFF_FORMAT_PATCH) {
-			if (needsep)
-				printf("%s%c", diff_line_prefix(opt),
-				       opt->line_termination);
-			for (p = paths; p; p = p->next)
-				show_patch_diff(p, num_parent, dense,
-						0, rev);
-		}
+	if (opt->output_format & (DIFF_FORMAT_RAW |
+				  DIFF_FORMAT_NAME |
+				  DIFF_FORMAT_NAME_STATUS)) {
+		for (p = paths; p; p = p->next)
+			show_raw_diff(p, num_parent, rev);
+		needsep = 1;
+	}
+	else if (opt->output_format &
+		 (DIFF_FORMAT_NUMSTAT|DIFF_FORMAT_DIFFSTAT))
+		needsep = 1;
+	else if (opt->output_format & DIFF_FORMAT_CALLBACK)
+		handle_combined_callback(opt, paths, num_parent, num_paths);
+
+	if (opt->output_format & DIFF_FORMAT_PATCH) {
+		if (needsep)
+			printf("%s%c", diff_line_prefix(opt),
+			       opt->line_termination);
+		for (p = paths; p; p = p->next)
+			show_patch_diff(p, num_parent, dense,
+					0, rev);
 	}
 
 	/* Clean things up */
@@ -1513,8 +1517,6 @@ void diff_tree_combined(const unsigned char *sha1,
 		paths = paths->next;
 		free(tmp);
 	}
-
-	free_pathspec(&diffopts.pathspec);
 }
 
 void diff_tree_combined_merge(const struct commit *commit, int dense,
-- 
1.9.rc1.181.g641f458

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/8] fixup! combine_diff: simplify intersect_paths() further
  2014-02-03 12:47 ` [PATCH 1/8] fixup! combine_diff: simplify intersect_paths() further Kirill Smelkov
@ 2014-02-03 19:40   ` Junio C Hamano
  0 siblings, 0 replies; 26+ messages in thread
From: Junio C Hamano @ 2014-02-03 19:40 UTC (permalink / raw
  To: Kirill Smelkov; +Cc: git

Kirill Smelkov <kirr@mns.spb.ru> writes:

> That cleanup patch is good, but I've found a bug in it. In the item removal
> code
>
>> +                      /* p->path not in q->queue[]; drop it */
>> +                      struct combine_diff_path *next = p->next;
>> +
>> +                      if ((*tail = next) != NULL)
>> +                              tail = &next->next;
>> +                      free(p);
>>                        continue;
>
>         *tail = next
>
> is right, but
>
>         tail = &next->next
>
> is wrong,

Heh, surely.  We just have skipped, and the fact that tail points at
the pointer variable that points at the first of the remaining items
does not change with this skipping of next by assigning it to *tail.  
An extra assignment to tail will skip one more, which is unnecessary.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/8] tests: add checking that combine-diff emits only correct paths
  2014-02-03 12:47 ` [PATCH 2/8] tests: add checking that combine-diff emits only correct paths Kirill Smelkov
@ 2014-02-03 23:10   ` Junio C Hamano
  2014-02-05 10:36     ` Kirill Smelkov
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2014-02-03 23:10 UTC (permalink / raw
  To: Kirill Smelkov; +Cc: git

Kirill Smelkov <kirr@mns.spb.ru> writes:

> where "correct paths" stands for paths that are different to all
> parents.
>
> Up until now, we were testing combined diff only on one file, or on
> several files which were all different (t4038-diff-combined.sh).
>
> As recent thinko in "simplify intersect_paths() further" showed, and
> also, since we are going to rework code for finding paths different to
> all parents, lets write at least basic tests.

Thanks.  Some nitpicks.

>
> Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
> ---
>  t/t4057-diff-combined-paths.sh | 106 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 106 insertions(+)
>  create mode 100755 t/t4057-diff-combined-paths.sh
>
> diff --git a/t/t4057-diff-combined-paths.sh b/t/t4057-diff-combined-paths.sh
> new file mode 100755
> index 0000000..e6e457d
> --- /dev/null
> +++ b/t/t4057-diff-combined-paths.sh
> @@ -0,0 +1,106 @@
> +#!/bin/sh
> +
> +test_description='combined diff show only paths that are different to all parents'
> +
> +. ./test-lib.sh
> +
> +# verify that diffc.expect matches output of
> +# `git diff -c --name-only HEAD HEAD^ HEAD^2`
> +diffc_verify() {

Style: SP before (), i.e.

	diffc_verify () {

> +	git diff -c --name-only HEAD HEAD^ HEAD^2 >diffc.actual &&
> +	test_cmp diffc.expect diffc.actual
> +}
> +
> +test_expect_success 'trivial merge - combine-diff empty' '
> +	for i in `test_seq 1 9`

Style: prefer $() over ``

> +	do
> +		echo $i >$i.txt	&&
> +		git add $i.txt

Quiz.  What happens when this "git add" fails with an earlier value
of $i?

> +	done &&

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/8] tree-diff: no need to manually verify that there is no mode change for a path
  2014-02-03 12:47 ` [PATCH 3/8] tree-diff: no need to manually verify that there is no mode change for a path Kirill Smelkov
@ 2014-02-03 23:12   ` Junio C Hamano
  0 siblings, 0 replies; 26+ messages in thread
From: Junio C Hamano @ 2014-02-03 23:12 UTC (permalink / raw
  To: Kirill Smelkov; +Cc: git

Kirill Smelkov <kirr@mns.spb.ru> writes:

> Because if there is, such two tree entries would never be compared as
> equal - the code in base_name_compare() explicitly compares modes, if
> there is a change for dir bit, even for equal paths, entries would
> compare as different.

OK.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/8] combine-diff: move show_log_first logic/action out of paths scanning
  2014-02-03 12:47 ` [PATCH 5/8] combine-diff: move show_log_first logic/action out of paths scanning Kirill Smelkov
@ 2014-02-03 23:21   ` Junio C Hamano
  0 siblings, 0 replies; 26+ messages in thread
From: Junio C Hamano @ 2014-02-03 23:21 UTC (permalink / raw
  To: Kirill Smelkov; +Cc: git

Kirill Smelkov <kirr@mns.spb.ru> writes:

> Judging from sample outputs and tests nothing changes in diff -c output,

Yuck.

I do not think the processing done inside the loop for the first
path (i.e. i==0) before we call show_log(rev) affects what that
called show_log(rev) does, so it probably is a good readability
change.

Thanks.

> and this change will help later patches, when we'll be refactoring paths
> scanning into its own function with several variants - the
> show_log_first logic / code will stay common to all of them.
>
> NOTE: only now we have to take care to explicitly not show anything if
>     parents array is empty, as in fact there are some clients in Git code,
>     which calls diff_tree_combined() in such a way.
>
> Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
> ---
>  combine-diff.c | 24 ++++++++++++++----------
>  1 file changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/combine-diff.c b/combine-diff.c
> index a03147c..272931f 100644
> --- a/combine-diff.c
> +++ b/combine-diff.c
> @@ -1313,6 +1313,20 @@ void diff_tree_combined(const unsigned char *sha1,
>  	struct combine_diff_path *p, *paths = NULL;
>  	int i, num_paths, needsep, show_log_first, num_parent = parents->nr;
>  
> +	/* nothing to do, if no parents */
> +	if (!num_parent)
> +		return;
> +
> +	show_log_first = !!rev->loginfo && !rev->no_commit_id;
> +	needsep = 0;
> +	if (show_log_first) {
> +		show_log(rev);
> +
> +		if (rev->verbose_header && opt->output_format)
> +			printf("%s%c", diff_line_prefix(opt),
> +			       opt->line_termination);
> +	}
> +
>  	diffopts = *opt;
>  	copy_pathspec(&diffopts.pathspec, &opt->pathspec);
>  	diffopts.output_format = DIFF_FORMAT_NO_OUTPUT;
> @@ -1321,8 +1335,6 @@ void diff_tree_combined(const unsigned char *sha1,
>  	/* tell diff_tree to emit paths in sorted (=tree) order */
>  	diffopts.orderfile = NULL;
>  
> -	show_log_first = !!rev->loginfo && !rev->no_commit_id;
> -	needsep = 0;
>  	/* find set of paths that everybody touches */
>  	for (i = 0; i < num_parent; i++) {
>  		/* show stat against the first parent even
> @@ -1338,14 +1350,6 @@ void diff_tree_combined(const unsigned char *sha1,
>  		diffcore_std(&diffopts);
>  		paths = intersect_paths(paths, i, num_parent);
>  
> -		if (show_log_first && i == 0) {
> -			show_log(rev);
> -
> -			if (rev->verbose_header && opt->output_format)
> -				printf("%s%c", diff_line_prefix(opt),
> -				       opt->line_termination);
> -		}
> -
>  		/* if showing diff, show it in requested order */
>  		if (diffopts.output_format != DIFF_FORMAT_NO_OUTPUT &&
>  		    opt->orderfile) {

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-03 12:47 ` [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning Kirill Smelkov
@ 2014-02-03 23:26   ` Junio C Hamano
  2014-02-03 23:39     ` Junio C Hamano
  2014-02-04  0:00   ` Junio C Hamano
  1 sibling, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2014-02-03 23:26 UTC (permalink / raw
  To: Kirill Smelkov; +Cc: git

Kirill Smelkov <kirr@mns.spb.ru> writes:

> As was recently shown (c839f1bd "combine-diff: optimize
> combine_diff_path sets intersection"), combine-diff runs very slowly. In
> that commit we optimized paths sets intersection, but that accounted
> only for ~ 25% of the slowness, and as my tracing showed, for linux.git
> v3.10..v3.11, for merges a lot of time is spent computing
> diff(commit,commit^2) just to only then intersect that huge diff to
> almost small set of files from diff(commit,commit^1).
>
> That's because at present, to compute combine-diff, for first finding
> paths, that "every parent touches", we use the following combine-diff
> property/definition:
>
>     D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn)      (w.r.t. paths)
>
> where
>
>     D(A,P1...Pn) is combined diff between commit A, and parents Pi
>
> and
>
>     D(A,Pi) is usual two-tree diff Pi..A

and A ^ B means what???

I do like the approach of walking the tree entries and stop as
shallowly as possible without recursing.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-03 23:26   ` Junio C Hamano
@ 2014-02-03 23:39     ` Junio C Hamano
  2014-02-04 16:34       ` Kirill Smelkov
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2014-02-03 23:39 UTC (permalink / raw
  To: Kirill Smelkov; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> Kirill Smelkov <kirr@mns.spb.ru> writes:
>
>> As was recently shown (c839f1bd "combine-diff: optimize
>> combine_diff_path sets intersection"), combine-diff runs very slowly. In
>> that commit we optimized paths sets intersection, but that accounted
>> only for ~ 25% of the slowness, and as my tracing showed, for linux.git
>> v3.10..v3.11, for merges a lot of time is spent computing
>> diff(commit,commit^2) just to only then intersect that huge diff to
>> almost small set of files from diff(commit,commit^1).
>>
>> That's because at present, to compute combine-diff, for first finding
>> paths, that "every parent touches", we use the following combine-diff
>> property/definition:
>>
>>     D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn)      (w.r.t. paths)
>>
>> where
>>
>>     D(A,P1...Pn) is combined diff between commit A, and parents Pi
>>
>> and
>>
>>     D(A,Pi) is usual two-tree diff Pi..A
>
> and A ^ B means what???

Silly me; obviously it is the "set intersection" operator.

We probably could instead use the "current" set of paths as literal
pathspec to compute subsequent paths, i.e.

	D(A,Pi,PS) is two tree diff P1..A limited to paths PS

	D(A,P1...Pn) = D(A,P1,[]) ^
        	       D(A,P2,D(A,P1,[])) ^
                       ...
        	       D(A,Pn,D(A,P1...Pn-1))

if we did not want to reinvent the whole tree walking thing, which
would add risks for new bugs and burden to maintain this and the
usual two-tree diff tree walking in sync.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-03 12:47 ` [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning Kirill Smelkov
  2014-02-03 23:26   ` Junio C Hamano
@ 2014-02-04  0:00   ` Junio C Hamano
  1 sibling, 0 replies; 26+ messages in thread
From: Junio C Hamano @ 2014-02-04  0:00 UTC (permalink / raw
  To: Kirill Smelkov; +Cc: git

Kirill Smelkov <kirr@mns.spb.ru> writes:

> +	parents_sha1 = xmalloc(nparent * sizeof(parents_sha1[0]));
> +	for (i = 0; i < nparent; i++)
> +		parents_sha1[i] = parents->sha1[i];
> +
> +	/* fake list head, so worker can assume it is non-NULL */
> +	struct combine_diff_path paths_head;

decl-after-statement; please fix.

> +	paths_head.next = NULL;
> +
> +	struct strbuf base;

likewise.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-03 23:39     ` Junio C Hamano
@ 2014-02-04 16:34       ` Kirill Smelkov
  2014-02-04 18:37         ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-04 16:34 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

On Mon, Feb 03, 2014 at 03:39:02PM -0800, Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
> 
> > Kirill Smelkov <kirr@mns.spb.ru> writes:
> >
> >> As was recently shown (c839f1bd "combine-diff: optimize
> >> combine_diff_path sets intersection"), combine-diff runs very slowly. In
> >> that commit we optimized paths sets intersection, but that accounted
> >> only for ~ 25% of the slowness, and as my tracing showed, for linux.git
> >> v3.10..v3.11, for merges a lot of time is spent computing
> >> diff(commit,commit^2) just to only then intersect that huge diff to
> >> almost small set of files from diff(commit,commit^1).
> >>
> >> That's because at present, to compute combine-diff, for first finding
> >> paths, that "every parent touches", we use the following combine-diff
> >> property/definition:
> >>
> >>     D(A,P1...Pn) = D(A,P1) ^ ... ^ D(A,Pn)      (w.r.t. paths)
> >>
> >> where
> >>
> >>     D(A,P1...Pn) is combined diff between commit A, and parents Pi
> >>
> >> and
> >>
> >>     D(A,Pi) is usual two-tree diff Pi..A
> >
> > and A ^ B means what???
> 
> Silly me; obviously it is the "set intersection" operator.
> 
> We probably could instead use the "current" set of paths as literal
> pathspec to compute subsequent paths, i.e.
> 
> 	D(A,Pi,PS) is two tree diff P1..A limited to paths PS
> 
> 	D(A,P1...Pn) = D(A,P1,[]) ^
>         	       D(A,P2,D(A,P1,[])) ^
>                        ...
>         	       D(A,Pn,D(A,P1...Pn-1))
> 
> if we did not want to reinvent the whole tree walking thing, which
> would add risks for new bugs and burden to maintain this and the
> usual two-tree diff tree walking in sync.

Junio, I understand it is not good to duplicate code and increase
maintenance risks. On the other hand, I don't quite like the approach
with keeping current paths - it could work and be faster compared to
what we had, but to me it is still suboptimal, because if the first diff
D(A,P1) is huge then oops. Also, to implement it rationally, without
delving into unneeded recursion, we would need to do the diffing without
recursion, intersect results, and then recurse into overlapping subtrees,
which means tree-walker rework anyway.


Instead I propose we switch to the new tree walker completely. With the
attached patch which draftly shows it (diff_tree is gone, the work is
done through diff_tree_combined_paths, and then the result is
"exported" to diff_filepairs) all the tests pass. Also, timings for

    git log --raw --no-abbrev --no-renames
    
    ( without -c - it would be the same - we are not touching that code, it
      would only add, irrelevant-to-the-change constant time )

are

    linux.git   v3.10..v3.11    became 0.1% _faster_
    navy.git                    became 1.4% slower


That means, that the new tree-walker is correct and should be ok
performance-wise (I had not optimized it at all, yet).

What would you say?

Thanks,
Kirill

P.S. Thanks for commenting on other patches. I'll try to correct them
tomorrow, as I have no more time today and need to run.


---- 8< ----
From: Kirill Smelkov <kirr@mns.spb.ru>
Date: Tue, 4 Feb 2014 20:11:16 +0400
Subject: [PATCH] X Show new tree-walker could be used instead of the old one

All tests pass. Timings for git log --raw --no-abrev --no-renames

    linux.git   v3.10..v3.11    became 0.1% _faster_
    navy.git                    became 1.4% slower
---
 diff.h      |  2 ++
 line-log.c  | 12 +++++++--
 revision.c  | 16 ++++++-----
 tree-diff.c | 88 +++++++++++++++++++++++++++++++++++++++++++++++--------------
 4 files changed, 90 insertions(+), 28 deletions(-)

diff --git a/diff.h b/diff.h
index 5496560..0a9845a 100644
--- a/diff.h
+++ b/diff.h
@@ -189,8 +189,10 @@ const char *diff_line_prefix(struct diff_options *);
 
 extern const char mime_boundary_leader[];
 
+#if 0
 extern int diff_tree(struct tree_desc *t1, struct tree_desc *t2,
 		     const char *base, struct diff_options *opt);
+#endif
 extern int diff_tree_sha1(const unsigned char *old, const unsigned char *new,
 			  const char *base, struct diff_options *opt);
 extern int diff_root_tree_sha1(const unsigned char *new, const char *base,
diff --git a/line-log.c b/line-log.c
index 717638b..5739bbf 100644
--- a/line-log.c
+++ b/line-log.c
@@ -766,6 +766,7 @@ void line_log_init(struct rev_info *rev, const char *prefix, struct string_list
 	}
 }
 
+#if 0
 static void load_tree_desc(struct tree_desc *desc, void **tree,
 			   const unsigned char *sha1)
 {
@@ -775,6 +776,7 @@ static void load_tree_desc(struct tree_desc *desc, void **tree,
 		die("Unable to read tree (%s)", sha1_to_hex(sha1));
 	init_tree_desc(desc, *tree, size);
 }
+#endif
 
 static int count_parents(struct commit *commit)
 {
@@ -843,17 +845,23 @@ static void queue_diffs(struct line_log_data *range,
 			struct commit *commit, struct commit *parent)
 {
 	void *tree1 = NULL, *tree2 = NULL;
-	struct tree_desc desc1, desc2;
+//	struct tree_desc desc1, desc2;
 
 	assert(commit);
+#if 0
 	load_tree_desc(&desc2, &tree2, commit->tree->object.sha1);
 	if (parent)
 		load_tree_desc(&desc1, &tree1, parent->tree->object.sha1);
 	else
 		init_tree_desc(&desc1, "", 0);
+#endif
 
 	DIFF_QUEUE_CLEAR(&diff_queued_diff);
-	diff_tree(&desc1, &desc2, "", opt);
+//	diff_tree(&desc1, &desc2, "", opt);
+	diff_tree_sha1(parent ? parent->tree->object.sha1 : null_sha1,
+			commit->tree->object.sha1, "", opt);
+
+
 	if (opt->detect_rename) {
 		filter_diffs_for_paths(range, 1);
 		if (diff_might_be_rename())
diff --git a/revision.c b/revision.c
index 18ec339..ecc250b 100644
--- a/revision.c
+++ b/revision.c
@@ -496,24 +496,28 @@ static int rev_compare_tree(struct rev_info *revs,
 static int rev_same_tree_as_empty(struct rev_info *revs, struct commit *commit)
 {
 	int retval;
-	void *tree;
-	unsigned long size;
-	struct tree_desc empty, real;
+//	void *tree;
+//	unsigned long size;
+//	struct tree_desc empty, real;
 	struct tree *t1 = commit->tree;
 
 	if (!t1)
 		return 0;
 
+#if 0
 	tree = read_object_with_reference(t1->object.sha1, tree_type, &size, NULL);
 	if (!tree)
-		return 0;
+		return 0;	// XXX ok?
 	init_tree_desc(&real, tree, size);
 	init_tree_desc(&empty, "", 0);
+#endif
 
 	tree_difference = REV_TREE_SAME;
 	DIFF_OPT_CLR(&revs->pruning, HAS_CHANGES);
-	retval = diff_tree(&empty, &real, "", &revs->pruning);
-	free(tree);
+//	retval = diff_tree(&empty, &real, "", &revs->pruning);
+	retval = diff_tree_sha1(null_sha1, t1->object.sha1, "", &revs->pruning);
+
+//	free(tree);
 
 	return retval >= 0 && (tree_difference == REV_TREE_SAME);
 }
diff --git a/tree-diff.c b/tree-diff.c
index 7f4f917..9f2a69b 100644
--- a/tree-diff.c
+++ b/tree-diff.c
@@ -6,6 +6,7 @@
 #include "diffcore.h"
 #include "tree.h"
 
+#if 0
 static void show_entry(struct diff_options *opt, const char *prefix,
 		       struct tree_desc *desc, struct strbuf *base);
 
@@ -106,6 +107,7 @@ static void show_entry(struct diff_options *opt, const char *prefix,
 
 	strbuf_setlen(base, old_baselen);
 }
+#endif
 
 static void skip_uninteresting(struct tree_desc *t, struct strbuf *base,
 			       struct diff_options *opt)
@@ -123,6 +125,7 @@ static void skip_uninteresting(struct tree_desc *t, struct strbuf *base,
 	}
 }
 
+#if 0
 int diff_tree(struct tree_desc *t1, struct tree_desc *t2,
 	      const char *base_str, struct diff_options *opt)
 {
@@ -171,6 +174,7 @@ int diff_tree(struct tree_desc *t1, struct tree_desc *t2,
 	strbuf_release(&base);
 	return 0;
 }
+#endif
 
 
 /*
@@ -529,23 +533,25 @@ struct combine_diff_path *diff_tree_combined_paths(
 		case 0:
 			/* are either xk > xi or diff(a,xk) != ø ? */
 			alldifferent = 1;
-			entry_sha1 = tree_entry_extract(&t, &path, &mode);
-			for (i = 0; i < nparent; i++) {
-				const unsigned char *sha1_i;
-				const char *path_i;
-				unsigned mode_i;
-
-				/* x[i] > x[imin] */
-				if (xmin_pathcmpv[i])
-					continue;
-
-				/* diff(a,xk) != ø */
-				sha1_i = tree_entry_extract(&tp[i], &path_i, &mode_i);
-				if (hashcmp(entry_sha1, sha1_i) || (mode != mode_i))
-					continue;
-
-				alldifferent = 0;
-				break;
+			if (!DIFF_OPT_TST(opt, FIND_COPIES_HARDER)) {
+				entry_sha1 = tree_entry_extract(&t, &path, &mode);
+				for (i = 0; i < nparent; i++) {
+					const unsigned char *sha1_i;
+					const char *path_i;
+					unsigned mode_i;
+
+					/* x[i] > x[imin] */
+					if (xmin_pathcmpv[i])
+						continue;
+
+					/* diff(a,xk) != ø */
+					sha1_i = tree_entry_extract(&tp[i], &path_i, &mode_i);
+					if (hashcmp(entry_sha1, sha1_i) || (mode != mode_i))
+						continue;
+
+					alldifferent = 0;
+					break;
+				}
 			}
 
 			/* D += {δ(a,xk) if xk=xi;  "+a" if xk > xi} */
@@ -594,7 +600,8 @@ static inline int diff_might_be_rename(void)
 		!DIFF_FILE_VALID(diff_queued_diff.queue[0]->one);
 }
 
-static void try_to_follow_renames(struct tree_desc *t1, struct tree_desc *t2, const char *base, struct diff_options *opt)
+//static void try_to_follow_renames(struct tree_desc *t1, struct tree_desc *t2, const char *base, struct diff_options *opt)
+static void try_to_follow_renames(const unsigned char *sha1_old, const unsigned char *sha1_new, const char *base, struct diff_options *opt)
 {
 	struct diff_options diff_opts;
 	struct diff_queue_struct *q = &diff_queued_diff;
@@ -632,7 +639,8 @@ static void try_to_follow_renames(struct tree_desc *t1, struct tree_desc *t2, co
 	diff_opts.break_opt = opt->break_opt;
 	diff_opts.rename_score = opt->rename_score;
 	diff_setup_done(&diff_opts);
-	diff_tree(t1, t2, base, &diff_opts);
+//	diff_tree(t1, t2, base, &diff_opts);
+	diff_tree_sha1(sha1_old, sha1_new, base, &diff_opts);
 	diffcore_std(&diff_opts);
 	free_pathspec(&diff_opts.pathspec);
 
@@ -693,11 +701,13 @@ static void try_to_follow_renames(struct tree_desc *t1, struct tree_desc *t2, co
 
 int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const char *base, struct diff_options *opt)
 {
+#if 0
 	void *tree1, *tree2;
 	struct tree_desc t1, t2;
 	unsigned long size1, size2;
 	int retval;
 
+	// TODO use xfill_tree_descriptor
 	tree1 = read_object_with_reference(old, tree_type, &size1, NULL);
 	if (!tree1)
 		die("unable to read source tree (%s)", sha1_to_hex(old));
@@ -707,23 +717,60 @@ int diff_tree_sha1(const unsigned char *old, const unsigned char *new, const cha
 	init_tree_desc(&t1, tree1, size1);
 	init_tree_desc(&t2, tree2, size2);
 	retval = diff_tree(&t1, &t2, base, opt);
+#endif
+	struct combine_diff_path phead, *p, *pprev;
+	phead.next = NULL;
+
+	struct strbuf basebuf;
+	strbuf_init(&basebuf, PATH_MAX);
+	strbuf_addstr(&basebuf, base);
+
+	const unsigned char *parents_sha1[1] = {new};
+
+	diff_tree_combined_paths(&phead, old, parents_sha1, 1, &basebuf, opt);
+
+	strbuf_release(&basebuf);
+
+	// paths -> diff_q
+	for (p = phead.next; p;) {
+		struct combine_diff_parent *p0 = &p->parent[0];
+		opt->change(opt, p->mode, p0->mode, p->sha1, p0->sha1,
+			p->mode != 0, p0->mode != 0,
+			p->path, 0, 0);
+
+		pprev = p;
+		p = p->next;
+		free(pprev);
+	}
+
 	if (!*base && DIFF_OPT_TST(opt, FOLLOW_RENAMES) && diff_might_be_rename()) {
+#if 0
 		init_tree_desc(&t1, tree1, size1);
 		init_tree_desc(&t2, tree2, size2);
 		try_to_follow_renames(&t1, &t2, base, opt);
+#endif
+		try_to_follow_renames(old, new, base, opt);
 	}
+#if 0
 	free(tree1);
 	free(tree2);
-	return retval;
+#endif
+//	return retval;
+	return 0;
 }
 
 int diff_root_tree_sha1(const unsigned char *new, const char *base, struct diff_options *opt)
 {
+#if 0
 	int retval;
 	void *tree;
 	unsigned long size;
 	struct tree_desc empty, real;
+#endif
 
+	return diff_tree_sha1(null_sha1, new, base, opt);
+
+#if 0
 	tree = read_object_with_reference(new, tree_type, &size, NULL);
 	if (!tree)
 		die("unable to read root tree (%s)", sha1_to_hex(new));
@@ -733,4 +780,5 @@ int diff_root_tree_sha1(const unsigned char *new, const char *base, struct diff_
 	retval = diff_tree(&empty, &real, base, opt);
 	free(tree);
 	return retval;
+#endif
 }
-- 
1.9.rc1.181.g641f458

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-04 16:34       ` Kirill Smelkov
@ 2014-02-04 18:37         ` Junio C Hamano
  2014-02-05 16:51           ` Kirill Smelkov
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2014-02-04 18:37 UTC (permalink / raw
  To: Kirill Smelkov; +Cc: git

Kirill Smelkov <kirr@mns.spb.ru> writes:

>> if we did not want to reinvent the whole tree walking thing, which
>> would add risks for new bugs and burden to maintain this and the
>> usual two-tree diff tree walking in sync.
>
> Junio, I understand it is not good to duplicate code and increase
> maintenance risks....
> Instead I propose we switch to the new tree walker completely.

I am not fundamentally opposed to the idea. We'll see what happens.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/8] tests: add checking that combine-diff emits only correct paths
  2014-02-03 23:10   ` Junio C Hamano
@ 2014-02-05 10:36     ` Kirill Smelkov
  0 siblings, 0 replies; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-05 10:36 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

On Mon, Feb 03, 2014 at 03:10:08PM -0800, Junio C Hamano wrote:
> Kirill Smelkov <kirr@mns.spb.ru> writes:
> 
> > where "correct paths" stands for paths that are different to all
> > parents.
> >
> > Up until now, we were testing combined diff only on one file, or on
> > several files which were all different (t4038-diff-combined.sh).
> >
> > As recent thinko in "simplify intersect_paths() further" showed, and
> > also, since we are going to rework code for finding paths different to
> > all parents, lets write at least basic tests.
> 
> Thanks.  Some nitpicks.
> 
> >
> > Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
> > ---
> >  t/t4057-diff-combined-paths.sh | 106 +++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 106 insertions(+)
> >  create mode 100755 t/t4057-diff-combined-paths.sh
> >
> > diff --git a/t/t4057-diff-combined-paths.sh b/t/t4057-diff-combined-paths.sh
> > new file mode 100755
> > index 0000000..e6e457d
> > --- /dev/null
> > +++ b/t/t4057-diff-combined-paths.sh
> > @@ -0,0 +1,106 @@
> > +#!/bin/sh
> > +
> > +test_description='combined diff show only paths that are different to all parents'
> > +
> > +. ./test-lib.sh
> > +
> > +# verify that diffc.expect matches output of
> > +# `git diff -c --name-only HEAD HEAD^ HEAD^2`
> > +diffc_verify() {
> 
> Style: SP before (), i.e.
> 
> 	diffc_verify () {
> 
> > +	git diff -c --name-only HEAD HEAD^ HEAD^2 >diffc.actual &&
> > +	test_cmp diffc.expect diffc.actual
> > +}
> > +
> > +test_expect_success 'trivial merge - combine-diff empty' '
> > +	for i in `test_seq 1 9`
> 
> Style: prefer $() over ``

Thanks, I've corrected the style.

> 
> > +	do
> > +		echo $i >$i.txt	&&
> > +		git add $i.txt
> 
> Quiz.  What happens when this "git add" fails with an earlier value
> of $i?

We just continue and ignore the error = bad. On the other hand, there
are a lot of places like this one in git's testsuite, and almost almost
nobody cares. In a few places people add

    || exit

to last command, but I wonder, is there a way to omit the boilerplate,
and also || exit is not strictly correct, as sometime we would need to
analyse `for` exit code for good and also for bad, and exit just exits
from shell... What would be the most convenient thing to do here?

Please find patch with style issues corrected below

Thanks,
Kirill

---- 8< -----
From: Kirill Smelkov <kirr@mns.spb.ru>
Date: Mon, 3 Feb 2014 13:08:49 +0400
Subject: [PATCH] tests: add checking that combine-diff emits only correct paths

where "correct paths" stands for paths that are different to all
parents.

Up until now, we were testing combined diff only on one file, or on
several files which were all different (t4038-diff-combined.sh).

As recent thinko in "simplify intersect_paths() further" showed, and
also, since we are going to rework code for finding paths different to
all parents, lets write at least basic tests.

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 t/t4057-diff-combined-paths.sh | 106 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 106 insertions(+)
 create mode 100755 t/t4057-diff-combined-paths.sh

diff --git a/t/t4057-diff-combined-paths.sh b/t/t4057-diff-combined-paths.sh
new file mode 100755
index 0000000..097e632
--- /dev/null
+++ b/t/t4057-diff-combined-paths.sh
@@ -0,0 +1,106 @@
+#!/bin/sh
+
+test_description='combined diff show only paths that are different to all parents'
+
+. ./test-lib.sh
+
+# verify that diffc.expect matches output of
+# `git diff -c --name-only HEAD HEAD^ HEAD^2`
+diffc_verify () {
+	git diff -c --name-only HEAD HEAD^ HEAD^2 >diffc.actual &&
+	test_cmp diffc.expect diffc.actual
+}
+
+test_expect_success 'trivial merge - combine-diff empty' '
+	for i in $(test_seq 1 9)
+	do
+		echo $i >$i.txt	&&
+		git add $i.txt
+	done &&
+	git commit -m "init" &&
+	git checkout -b side &&
+	for i in $(test_seq 2 9)
+	do
+		echo $i/2 >>$i.txt
+	done &&
+	git commit -a -m "side 2-9" &&
+	git checkout master &&
+	echo 1/2 >1.txt &&
+	git commit -a -m "master 1" &&
+	git merge side &&
+	>diffc.expect &&
+	diffc_verify
+'
+
+
+test_expect_success 'only one trully conflicting path' '
+	git checkout side &&
+	for i in $(test_seq 2 9)
+	do
+		echo $i/3 >>$i.txt
+	done &&
+	echo "4side" >>4.txt &&
+	git commit -a -m "side 2-9 +4" &&
+	git checkout master &&
+	for i in $(test_seq 1 9)
+	do
+		echo $i/3 >>$i.txt
+	done &&
+	echo "4master" >>4.txt &&
+	git commit -a -m "master 1-9 +4" &&
+	test_must_fail git merge side &&
+	cat <<-\EOF >4.txt &&
+	4
+	4/2
+	4/3
+	4master
+	4side
+	EOF
+	git add 4.txt &&
+	git commit -m "merge side (2)" &&
+	echo 4.txt >diffc.expect &&
+	diffc_verify
+'
+
+test_expect_success 'merge introduces new file' '
+	git checkout side &&
+	for i in $(test_seq 5 9)
+	do
+		echo $i/4 >>$i.txt
+	done &&
+	git commit -a -m "side 5-9" &&
+	git checkout master &&
+	for i in $(test_seq 1 3)
+	do
+		echo $i/4 >>$i.txt
+	done &&
+	git commit -a -m "master 1-3 +4hello" &&
+	git merge side &&
+	echo "Hello World" >4hello.txt &&
+	git add 4hello.txt &&
+	git commit --amend &&
+	echo 4hello.txt >diffc.expect &&
+	diffc_verify
+'
+
+test_expect_success 'merge removed a file' '
+	git checkout side &&
+	for i in $(test_seq 5 9)
+	do
+		echo $i/5 >>$i.txt
+	done &&
+	git commit -a -m "side 5-9" &&
+	git checkout master &&
+	for i in $(test_seq 1 3)
+	do
+		echo $i/4 >>$i.txt
+	done &&
+	git commit -a -m "master 1-3" &&
+	git merge side &&
+	git rm 4.txt &&
+	git commit --amend &&
+	echo 4.txt >diffc.expect &&
+	diffc_verify
+'
+
+test_done
-- 
1.9.rc1.181.g641f458

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-04 18:37         ` Junio C Hamano
@ 2014-02-05 16:51           ` Kirill Smelkov
  2014-02-05 17:36             ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-05 16:51 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

On Tue, Feb 04, 2014 at 10:37:24AM -0800, Junio C Hamano wrote:
> Kirill Smelkov <kirr@mns.spb.ru> writes:
> 
> >> if we did not want to reinvent the whole tree walking thing, which
> >> would add risks for new bugs and burden to maintain this and the
> >> usual two-tree diff tree walking in sync.
> >
> > Junio, I understand it is not good to duplicate code and increase
> > maintenance risks....
> > Instead I propose we switch to the new tree walker completely.
> 
> I am not fundamentally opposed to the idea. We'll see what happens.

Thanks. Locally, with draft patches with generalized tree-walker, for
`git log --raw --no-abbrev --no-renames` I was able to get to as fast as
with old two-tree walker for navy.git and 1% faster for linux.git and
all tests pass.

Only, before I clean things up, I'd like to ask - would the following
patch be accepted

---- 8< ---
diff --git a/tree-walk.c b/tree-walk.c
index 79dba1d..4dc86c7 100644
--- a/tree-walk.c
+++ b/tree-walk.c
@@ -37,7 +37,7 @@ static void decode_tree_entry(struct tree_desc *desc, const char *buf, unsigned
 
 	/* Initialize the descriptor entry */
 	desc->entry.path = path;
-	desc->entry.mode = mode;
+	desc->entry.mode = canon_mode(mode);
 	desc->entry.sha1 = (const unsigned char *)(path + len);
 }
 
diff --git a/tree-walk.h b/tree-walk.h
index ae04b64..ae7fb3a 100644
--- a/tree-walk.h
+++ b/tree-walk.h
@@ -16,7 +16,7 @@ struct tree_desc {
 static inline const unsigned char *tree_entry_extract(struct tree_desc *desc, const char **pathp, unsigned int *modep)
 {
 	*pathp = desc->entry.path;
-	*modep = canon_mode(desc->entry.mode);
+	*modep = desc->entry.mode;
 	return desc->entry.sha1;
 }
---- 8< ---
 
?

The reason I ask is because it is more handy to work with tree_desc
structures, compared to splitting it to sha1/path/mode/pathlen, and
if canon_mode() is done only once, clients don't have to pay the
performance price of canon_mode on each tree_desc "extract".

I understand there is fsck, which on invalid tree entry prints the mode,
but maybe somehow fsck should be special, and common interface be tuned
to usual clients?

Thanks,
Kirill

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-05 16:51           ` Kirill Smelkov
@ 2014-02-05 17:36             ` Junio C Hamano
  2014-02-05 19:14               ` Kirill Smelkov
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2014-02-05 17:36 UTC (permalink / raw
  To: Kirill Smelkov; +Cc: git

Kirill Smelkov <kirr@mns.spb.ru> writes:

> Only, before I clean things up, I'd like to ask - would the following
> patch be accepted
>
> ---- 8< ---
> diff --git a/tree-walk.c b/tree-walk.c
> index 79dba1d..4dc86c7 100644
> --- a/tree-walk.c
> +++ b/tree-walk.c
> @@ -37,7 +37,7 @@ static void decode_tree_entry(struct tree_desc *desc, const char *buf, unsigned
>  
>  	/* Initialize the descriptor entry */
>  	desc->entry.path = path;
> -	desc->entry.mode = mode;
> +	desc->entry.mode = canon_mode(mode);
>  	desc->entry.sha1 = (const unsigned char *)(path + len);
>  }
>  
> diff --git a/tree-walk.h b/tree-walk.h
> index ae04b64..ae7fb3a 100644
> --- a/tree-walk.h
> +++ b/tree-walk.h
> @@ -16,7 +16,7 @@ struct tree_desc {
>  static inline const unsigned char *tree_entry_extract(struct tree_desc *desc, const char **pathp, unsigned int *modep)
>  {
>  	*pathp = desc->entry.path;
> -	*modep = canon_mode(desc->entry.mode);
> +	*modep = desc->entry.mode;
>  	return desc->entry.sha1;
>  }
> ---- 8< ---
>  
> ?

Doesn't desc point into and walks over the data we read from the
tree object directly?

We try to keep (tree|commit)->buffer intact and that is done pretty
deliberately.  While you are walking a tree or parsing a commit,
somebody else, perhaps called indirectly by a helper function you
call, may call lookup_object() for the same object, get the copy
that has already been read and start using it.  This kind of change
will introduce bugs that are hard to debug unless it is done very
carefully (e.g. starting from making tree.buffer into a const pointer
and propagating constness throughout the system), which might not be
worth the pain.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-05 17:36             ` Junio C Hamano
@ 2014-02-05 19:14               ` Kirill Smelkov
  2014-02-05 19:42                 ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-05 19:14 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

On Wed, Feb 05, 2014 at 09:36:55AM -0800, Junio C Hamano wrote:
> Kirill Smelkov <kirr@mns.spb.ru> writes:
> 
> > Only, before I clean things up, I'd like to ask - would the following
> > patch be accepted
> >
> > ---- 8< ---
> > diff --git a/tree-walk.c b/tree-walk.c
> > index 79dba1d..4dc86c7 100644
> > --- a/tree-walk.c
> > +++ b/tree-walk.c
> > @@ -37,7 +37,7 @@ static void decode_tree_entry(struct tree_desc *desc, const char *buf, unsigned
> >  
> >  	/* Initialize the descriptor entry */
> >  	desc->entry.path = path;
> > -	desc->entry.mode = mode;
> > +	desc->entry.mode = canon_mode(mode);
> >  	desc->entry.sha1 = (const unsigned char *)(path + len);
> >  }
> >  
> > diff --git a/tree-walk.h b/tree-walk.h
> > index ae04b64..ae7fb3a 100644
> > --- a/tree-walk.h
> > +++ b/tree-walk.h
> > @@ -16,7 +16,7 @@ struct tree_desc {
> >  static inline const unsigned char *tree_entry_extract(struct tree_desc *desc, const char **pathp, unsigned int *modep)
> >  {
> >  	*pathp = desc->entry.path;
> > -	*modep = canon_mode(desc->entry.mode);
> > +	*modep = desc->entry.mode;
> >  	return desc->entry.sha1;
> >  }
> > ---- 8< ---
> >  
> > ?
> 
> Doesn't desc point into and walks over the data we read from the
> tree object directly?
> 
> We try to keep (tree|commit)->buffer intact and that is done pretty
> deliberately.  While you are walking a tree or parsing a commit,
> somebody else, perhaps called indirectly by a helper function you
> call, may call lookup_object() for the same object, get the copy
> that has already been read and start using it.  This kind of change
> will introduce bugs that are hard to debug unless it is done very
> carefully (e.g. starting from making tree.buffer into a const pointer
> and propagating constness throughout the system), which might not be
> worth the pain.

I agree object data should be immutable for good. The only thing I'm talking
about here is mode, which is parsed from a tree buffer and is stored in
separate field:

        ---- 8< ---- tree-walk.h
        struct name_entry {
                const unsigned char *sha1;
                const char *path;
                unsigned int mode;              <---
        };

        struct tree_desc {
                const void *buffer;
                struct name_entry entry;
                unsigned int size;
        };

        ---- 8< ---- tree-walk.c
        const char *get_mode(const char *str, unsigned int *modep)
        { ... }     /* parses symbolic mode from tree buffer to uint */

        void decode_tree_entry(struct tree_desc *desc, const char *buf, unsigned long size)
        {
                const char *path;
                unsigned int mode, len;

                ...
                path = get_mode(buf, &mode);

                /* Initialize the descriptor entry */
                desc->entry.path = path;
                desc->entry.mode = mode;        <---
                desc->entry.sha1 = (const unsigned char *)(path + len);


so we are not talking about modifying object buffers here. All I'm
asking is about canonicalizing _already_ parsed and copied mode on the
fly.

Does that change anything?


Sorry, if I'm maybe misunderstanding something, and thanks,
Kirill

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-05 19:14               ` Kirill Smelkov
@ 2014-02-05 19:42                 ` Junio C Hamano
  2014-02-05 20:22                   ` Kirill Smelkov
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2014-02-05 19:42 UTC (permalink / raw
  To: Kirill Smelkov; +Cc: git

Kirill Smelkov <kirr@navytux.spb.ru> writes:

> I agree object data should be immutable for good. The only thing I'm talking
> about here is mode, which is parsed from a tree buffer and is stored in
> separate field:

Ah, I do not see any problem in that case, then.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-05 19:42                 ` Junio C Hamano
@ 2014-02-05 20:22                   ` Kirill Smelkov
  2014-02-05 22:58                     ` Junio C Hamano
  0 siblings, 1 reply; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-05 20:22 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git, kirr

On Wed, Feb 05, 2014 at 11:42:41AM -0800, Junio C Hamano wrote:
> Kirill Smelkov <kirr@navytux.spb.ru> writes:
> 
> > I agree object data should be immutable for good. The only thing I'm talking
> > about here is mode, which is parsed from a tree buffer and is stored in
> > separate field:
> 
> Ah, I do not see any problem in that case, then.
> 
> Thanks.

Thanks, that simplifies things for me.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-05 20:22                   ` Kirill Smelkov
@ 2014-02-05 22:58                     ` Junio C Hamano
  2014-02-06 16:22                       ` Kirill Smelkov
  0 siblings, 1 reply; 26+ messages in thread
From: Junio C Hamano @ 2014-02-05 22:58 UTC (permalink / raw
  To: Kirill Smelkov; +Cc: git, kirr

Kirill Smelkov <kirr@navytux.spb.ru> writes:

> On Wed, Feb 05, 2014 at 11:42:41AM -0800, Junio C Hamano wrote:
>> Kirill Smelkov <kirr@navytux.spb.ru> writes:
>> 
>> > I agree object data should be immutable for good. The only thing I'm talking
>> > about here is mode, which is parsed from a tree buffer and is stored in
>> > separate field:
>> 
>> Ah, I do not see any problem in that case, then.
>> 
>> Thanks.
>
> Thanks, that simplifies things for me.

Surely.

Be careful when traversing N-trees in parallel---you may have to
watch out for the entry ordering rules that sorts the following
paths in the order shown:

	a
	a-b
        a/b
        a_b

Inside a single tree, you cannot have 'a' and 'a/b' at the same
time, but one tree may have 'a' (without 'a/b') while another one
may have 'a/b' (without 'a'), and walking them in parallel has
historically been one source of funny bugs.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning
  2014-02-05 22:58                     ` Junio C Hamano
@ 2014-02-06 16:22                       ` Kirill Smelkov
  0 siblings, 0 replies; 26+ messages in thread
From: Kirill Smelkov @ 2014-02-06 16:22 UTC (permalink / raw
  To: Junio C Hamano; +Cc: git

On Wed, Feb 05, 2014 at 02:58:36PM -0800, Junio C Hamano wrote:
> Kirill Smelkov <kirr@navytux.spb.ru> writes:
> 
> > On Wed, Feb 05, 2014 at 11:42:41AM -0800, Junio C Hamano wrote:
> >> Kirill Smelkov <kirr@navytux.spb.ru> writes:
> >> 
> >> > I agree object data should be immutable for good. The only thing I'm talking
> >> > about here is mode, which is parsed from a tree buffer and is stored in
> >> > separate field:
> >> 
> >> Ah, I do not see any problem in that case, then.
> >> 
> >> Thanks.
> >
> > Thanks, that simplifies things for me.

Below is a patch which does it. Please apply, if it is ok.


> Surely.
> 
> Be careful when traversing N-trees in parallel---you may have to
> watch out for the entry ordering rules that sorts the following
> paths in the order shown:
> 
> 	a
> 	a-b
>         a/b
>         a_b
> 
> Inside a single tree, you cannot have 'a' and 'a/b' at the same
> time, but one tree may have 'a' (without 'a/b') while another one
> may have 'a/b' (without 'a'), and walking them in parallel has
> historically been one source of funny bugs.

Thanks for the warning. I'm relying here on base_name_compare() and
ordering it imposes on entries and how it differentiates directories and
files, so let's hope this should be ok.

Actual reworking is still in flux though...

Thanks,
Kirill


---- 8< ----
From: Kirill Smelkov <kirr@mns.spb.ru>
Date: Thu, 6 Feb 2014 15:36:31 +0400
Subject: [PATCH] Finally switch over tree descriptors to contain a pre-parsed entry

This continues 4651ece8 (Switch over tree descriptors to contain a
pre-parsed entry) and moves the only rest computational part

    mode = canon_mode(mode)

from tree_entry_extract() to tree entry decode phase - to
decode_tree_entry().

The reason to do it, is that canon_mode() is at least 2 conditional
jumps for regular files, and that could be noticeable should canon_mode()
be invoked several times.

That does not matter for current Git codebase, where typical tree
traversal is

    while (t->size) {
        sha1 = tree_entry_extract(t, &path, &mode);
        ...
        update_tree_entry(t);
    }

i.e. we do t -> sha1,path.mode "extraction" only once per entry. In such
cases, it does not matter performance-wise, where that mode
canonicalization is done - either once in tree_entry_extract(), or once
in decode_tree_entry() called by update_tree_entry() - it is
approximately the same.

But for future code, which could need to work with several tree_desc's
in parallel, it could be handy to operate on tree_desc descriptors, and
do "extracts" only when needed, or at all, access only relevant part of
it through structure fields directly.

And for such situations, having canon_mode() be done once in decode
phase is better - we won't need to pay the performance price of 2 extra
conditional jumps on every t->mode access.

So let's move mode canonicalization to decode_tree_entry(). That was the
final bit. Now after tree entry is decoded, it is fully ready and could
be accessed either directly via field, or through tree_entry_extract()
which this time got really "totally trivial".

Signed-off-by: Kirill Smelkov <kirr@mns.spb.ru>
---
 tree-walk.c | 2 +-
 tree-walk.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tree-walk.c b/tree-walk.c
index 79dba1d..4dc86c7 100644
--- a/tree-walk.c
+++ b/tree-walk.c
@@ -37,7 +37,7 @@ static void decode_tree_entry(struct tree_desc *desc, const char *buf, unsigned
 
 	/* Initialize the descriptor entry */
 	desc->entry.path = path;
-	desc->entry.mode = mode;
+	desc->entry.mode = canon_mode(mode);
 	desc->entry.sha1 = (const unsigned char *)(path + len);
 }
 
diff --git a/tree-walk.h b/tree-walk.h
index ae04b64..ae7fb3a 100644
--- a/tree-walk.h
+++ b/tree-walk.h
@@ -16,7 +16,7 @@ struct tree_desc {
 static inline const unsigned char *tree_entry_extract(struct tree_desc *desc, const char **pathp, unsigned int *modep)
 {
 	*pathp = desc->entry.path;
-	*modep = canon_mode(desc->entry.mode);
+	*modep = desc->entry.mode;
 	return desc->entry.sha1;
 }
 
-- 
1.9.rc1.181.g641f458

^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2014-02-06 16:21 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-03 12:47 [PATCH 0/8] `log -c` speedup (part 2) Kirill Smelkov
2014-02-03 12:47 ` [PATCH 1/8] fixup! combine_diff: simplify intersect_paths() further Kirill Smelkov
2014-02-03 19:40   ` Junio C Hamano
2014-02-03 12:47 ` [PATCH 2/8] tests: add checking that combine-diff emits only correct paths Kirill Smelkov
2014-02-03 23:10   ` Junio C Hamano
2014-02-05 10:36     ` Kirill Smelkov
2014-02-03 12:47 ` [PATCH 3/8] tree-diff: no need to manually verify that there is no mode change for a path Kirill Smelkov
2014-02-03 23:12   ` Junio C Hamano
2014-02-03 12:47 ` [PATCH 4/8] tree-diff: no need to pass match to skip_uninteresting() Kirill Smelkov
2014-02-03 12:47 ` [PATCH 5/8] combine-diff: move show_log_first logic/action out of paths scanning Kirill Smelkov
2014-02-03 23:21   ` Junio C Hamano
2014-02-03 12:47 ` [PATCH 6/8] combine-diff: Move changed-paths scanning logic into its own function Kirill Smelkov
2014-02-03 12:47 ` [PATCH 7/8] combine-diff: Fast changed-to-all-parents paths scanning Kirill Smelkov
2014-02-03 23:26   ` Junio C Hamano
2014-02-03 23:39     ` Junio C Hamano
2014-02-04 16:34       ` Kirill Smelkov
2014-02-04 18:37         ` Junio C Hamano
2014-02-05 16:51           ` Kirill Smelkov
2014-02-05 17:36             ` Junio C Hamano
2014-02-05 19:14               ` Kirill Smelkov
2014-02-05 19:42                 ` Junio C Hamano
2014-02-05 20:22                   ` Kirill Smelkov
2014-02-05 22:58                     ` Junio C Hamano
2014-02-06 16:22                       ` Kirill Smelkov
2014-02-04  0:00   ` Junio C Hamano
2014-02-03 12:47 ` [PATCH 8/8] combine-diff: bail out early, if num_paths=0 Kirill Smelkov

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).