[PATCH v10 0/3] read-cache: speed up add_index

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v10 0/3] read-cache: speed up add_index_entry
@ 2017-04-14 19:12 git
  2017-04-14 19:12 ` [PATCH v10 1/3] read-cache: add strcmp_offset function git
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: git @ 2017-04-14 19:12 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Version 10 addresses mailing list comments on coding style in
read-cache.c and in p0006.  I've also adjusted the speed-up
algorithm and more thoroughly documented the cases in the code.

We skip the binary searches whenever we can prove that the new
entry (and every sub-directory prefix of the entry) clearly
FOLLOWS the last entry in the index.  Otherwise, we fall through
to the existing search algorithm -- which handles stages and
CE_REMOVE entries.

Jeff Hostetler (3):
  read-cache: add strcmp_offset function
  p0006-read-tree-checkout: perf test to time read-tree
  read-cache: speed up add_index_entry during checkout

 Makefile                           |   1 +
 cache.h                            |   1 +
 read-cache.c                       | 138 ++++++++++++++++++++++++++++++++++++-
 t/helper/.gitignore                |   1 +
 t/helper/test-strcmp-offset.c      |  22 ++++++
 t/perf/p0006-read-tree-checkout.sh |  67 ++++++++++++++++++
 t/perf/repos/.gitignore            |   1 +
 t/perf/repos/many-files.sh         | 110 +++++++++++++++++++++++++++++
 t/t0065-strcmp-offset.sh           |  21 ++++++
 9 files changed, 360 insertions(+), 2 deletions(-)
 create mode 100644 t/helper/test-strcmp-offset.c
 create mode 100755 t/perf/p0006-read-tree-checkout.sh
 create mode 100644 t/perf/repos/.gitignore
 create mode 100755 t/perf/repos/many-files.sh
 create mode 100755 t/t0065-strcmp-offset.sh

-- 
2.9.3


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH v10 1/3] read-cache: add strcmp_offset function
  2017-04-14 19:12 [PATCH v10 0/3] read-cache: speed up add_index_entry git
@ 2017-04-14 19:12 ` git
  2017-04-14 19:12 ` [PATCH v10 2/3] p0006-read-tree-checkout: perf test to time read-tree git
  2017-04-14 19:12 ` [PATCH v10 3/3] read-cache: speed up add_index_entry during checkout git
  2 siblings, 0 replies; 7+ messages in thread
From: git @ 2017-04-14 19:12 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Add strcmp_offset() function to also return the offset of the
first change.

Add unit test and helper to verify.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Makefile                      |  1 +
 cache.h                       |  1 +
 read-cache.c                  | 20 ++++++++++++++++++++
 t/helper/.gitignore           |  1 +
 t/helper/test-strcmp-offset.c | 22 ++++++++++++++++++++++
 t/t0065-strcmp-offset.sh      | 21 +++++++++++++++++++++
 6 files changed, 66 insertions(+)
 create mode 100644 t/helper/test-strcmp-offset.c
 create mode 100755 t/t0065-strcmp-offset.sh

diff --git a/Makefile b/Makefile
index 9ec6065..4c4c246 100644
--- a/Makefile
+++ b/Makefile
@@ -631,6 +631,7 @@ TEST_PROGRAMS_NEED_X += test-scrap-cache-tree
 TEST_PROGRAMS_NEED_X += test-sha1
 TEST_PROGRAMS_NEED_X += test-sha1-array
 TEST_PROGRAMS_NEED_X += test-sigchain
+TEST_PROGRAMS_NEED_X += test-strcmp-offset
 TEST_PROGRAMS_NEED_X += test-string-list
 TEST_PROGRAMS_NEED_X += test-submodule-config
 TEST_PROGRAMS_NEED_X += test-subprocess
diff --git a/cache.h b/cache.h
index 80b6372..3c55047 100644
--- a/cache.h
+++ b/cache.h
@@ -574,6 +574,7 @@ extern int write_locked_index(struct index_state *, struct lock_file *lock, unsi
 extern int discard_index(struct index_state *);
 extern int unmerged_index(const struct index_state *);
 extern int verify_path(const char *path);
+extern int strcmp_offset(const char *s1, const char *s2, size_t *first_change);
 extern int index_dir_exists(struct index_state *istate, const char *name, int namelen);
 extern void adjust_dirname_case(struct index_state *istate, char *name);
 extern struct cache_entry *index_file_exists(struct index_state *istate, const char *name, int namelen, int igncase);
diff --git a/read-cache.c b/read-cache.c
index 9054369..97f13a1 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -887,6 +887,26 @@ static int has_file_name(struct index_state *istate,
 	return retval;
 }
 
+
+/*
+ * Like strcmp(), but also return the offset of the first change.
+ * If strings are equal, return the length.
+ */
+int strcmp_offset(const char *s1, const char *s2, size_t *first_change)
+{
+	size_t k;
+
+	if (!first_change)
+		return strcmp(s1, s2);
+
+	for (k = 0; s1[k] == s2[k]; k++)
+		if (s1[k] == '\0')
+			break;
+
+	*first_change = k;
+	return (unsigned char)s1[k] - (unsigned char)s2[k];
+}
+
 /*
  * Do we have another file with a pathname that is a proper
  * subset of the name we're trying to add?
diff --git a/t/helper/.gitignore b/t/helper/.gitignore
index d6e8b36..0a89531 100644
--- a/t/helper/.gitignore
+++ b/t/helper/.gitignore
@@ -25,6 +25,7 @@
 /test-sha1
 /test-sha1-array
 /test-sigchain
+/test-strcmp-offset
 /test-string-list
 /test-submodule-config
 /test-subprocess
diff --git a/t/helper/test-strcmp-offset.c b/t/helper/test-strcmp-offset.c
new file mode 100644
index 0000000..4a45a54
--- /dev/null
+++ b/t/helper/test-strcmp-offset.c
@@ -0,0 +1,22 @@
+#include "cache.h"
+
+int cmd_main(int argc, const char **argv)
+{
+	int result;
+	size_t offset;
+
+	if (!argv[1] || !argv[2])
+		die("usage: %s <string1> <string2>", argv[0]);
+
+	result = strcmp_offset(argv[1], argv[2], &offset);
+
+	/*
+	 * Because differnt CRTs behave differently, only rely on signs
+	 * of the result values.
+	 */
+	result = (result < 0 ? -1 :
+			  result > 0 ? 1 :
+			  0);
+	printf("%d %"PRIuMAX"\n", result, (uintmax_t)offset);
+	return 0;
+}
diff --git a/t/t0065-strcmp-offset.sh b/t/t0065-strcmp-offset.sh
new file mode 100755
index 0000000..7d6d214
--- /dev/null
+++ b/t/t0065-strcmp-offset.sh
@@ -0,0 +1,21 @@
+#!/bin/sh
+
+test_description='Test strcmp_offset functionality'
+
+. ./test-lib.sh
+
+while read s1 s2 expect
+do
+	test_expect_success "strcmp_offset($s1, $s2)" '
+		echo "$expect" >expect &&
+		test-strcmp-offset "$s1" "$s2" >actual &&
+		test_cmp expect actual
+	'
+done <<-EOF
+abc abc 0 3
+abc def -1 0
+abc abz -1 2
+abc abcdef -1 3
+EOF
+
+test_done
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v10 2/3] p0006-read-tree-checkout: perf test to time read-tree
  2017-04-14 19:12 [PATCH v10 0/3] read-cache: speed up add_index_entry git
  2017-04-14 19:12 ` [PATCH v10 1/3] read-cache: add strcmp_offset function git
@ 2017-04-14 19:12 ` git
  2017-04-14 19:12 ` [PATCH v10 3/3] read-cache: speed up add_index_entry during checkout git
  2 siblings, 0 replies; 7+ messages in thread
From: git @ 2017-04-14 19:12 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Created t/perf/repos/many-files.sh to generate large, but
artificial repositories.

Created t/perf/p0006-read-tree-checkout.sh to measure
performance on various read-tree, checkout, and update-index
operations.  This test can run using either artificial repos
described above or normal repos.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 t/perf/p0006-read-tree-checkout.sh |  67 ++++++++++++++++++++++
 t/perf/repos/.gitignore            |   1 +
 t/perf/repos/many-files.sh         | 110 +++++++++++++++++++++++++++++++++++++
 3 files changed, 178 insertions(+)
 create mode 100755 t/perf/p0006-read-tree-checkout.sh
 create mode 100644 t/perf/repos/.gitignore
 create mode 100755 t/perf/repos/many-files.sh

diff --git a/t/perf/p0006-read-tree-checkout.sh b/t/perf/p0006-read-tree-checkout.sh
new file mode 100755
index 0000000..78cc23f
--- /dev/null
+++ b/t/perf/p0006-read-tree-checkout.sh
@@ -0,0 +1,67 @@
+#!/bin/sh
+#
+# This test measures the performance of various read-tree
+# and checkout operations.  It is primarily interested in
+# the algorithmic costs of index operations and recursive
+# tree traversal -- and NOT disk I/O on thousands of files.
+
+test_description="Tests performance of read-tree"
+
+. ./perf-lib.sh
+
+test_perf_default_repo
+
+# If the test repo was generated by ./repos/many-files.sh
+# then we know something about the data shape and branches,
+# so we can isolate testing to the ballast-related commits
+# and setup sparse-checkout so we don't have to populate
+# the ballast files and directories.
+#
+# Otherwise, we make some general assumptions about the
+# repo and consider the entire history of the current
+# branch to be the ballast.
+
+test_expect_success "setup repo" '
+	if git rev-parse --verify refs/heads/p0006-ballast^{commit}
+	then
+		echo Assuming synthetic repo from many-files.sh
+		git branch br_base            master
+		git branch br_ballast         p0006-ballast^
+		git branch br_ballast_alias   p0006-ballast^
+		git branch br_ballast_plus_1  p0006-ballast
+		git config --local core.sparsecheckout 1
+		cat >.git/info/sparse-checkout <<-EOF
+		/*
+		!ballast/*
+		EOF
+	else
+		echo Assuming non-synthetic repo...
+		git branch br_base            $(git rev-list HEAD | tail -n 1)
+		git branch br_ballast         HEAD^ || error "no ancestor commit from current head"
+		git branch br_ballast_alias   HEAD^
+		git branch br_ballast_plus_1  HEAD
+	fi &&
+	git checkout -q br_ballast &&
+	nr_files=$(git ls-files | wc -l)
+'
+
+test_perf "read-tree br_base br_ballast ($nr_files)" '
+	git read-tree -m br_base br_ballast -n
+'
+
+test_perf "switch between br_base br_ballast ($nr_files)" '
+	git checkout -q br_base &&
+	git checkout -q br_ballast
+'
+
+test_perf "switch between br_ballast br_ballast_plus_1 ($nr_files)" '
+	git checkout -q br_ballast_plus_1 &&
+	git checkout -q br_ballast
+'
+
+test_perf "switch between aliases ($nr_files)" '
+	git checkout -q br_ballast_alias &&
+	git checkout -q br_ballast
+'
+
+test_done
diff --git a/t/perf/repos/.gitignore b/t/perf/repos/.gitignore
new file mode 100644
index 0000000..72e3dc3
--- /dev/null
+++ b/t/perf/repos/.gitignore
@@ -0,0 +1 @@
+gen-*/
diff --git a/t/perf/repos/many-files.sh b/t/perf/repos/many-files.sh
new file mode 100755
index 0000000..5a1d25e
--- /dev/null
+++ b/t/perf/repos/many-files.sh
@@ -0,0 +1,110 @@
+#!/bin/sh
+## Generate test data repository using the given parameters.
+## When omitted, we create "gen-many-files-d-w-f.git".
+##
+## Usage: [-r repo] [-d depth] [-w width] [-f files]
+##
+## -r repo: path to the new repo to be generated
+## -d depth: the depth of sub-directories
+## -w width: the number of sub-directories at each level
+## -f files: the number of files created in each directory
+##
+## Note that all files will have the same SHA-1 and each
+## directory at a level will have the same SHA-1, so we
+## will potentially have a large index, but not a large
+## ODB.
+##
+## Ballast will be created under "ballast/".
+
+EMPTY_BLOB=e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
+
+set -e
+
+## (5, 10, 9) will create 999,999 ballast files.
+## (4, 10, 9) will create  99,999 ballast files.
+depth=5
+width=10
+files=9
+
+while test "$#" -ne 0
+do
+    case "$1" in
+	-r)
+	    shift;
+	    test "$#" -ne 0 || { echo 'error: -r requires an argument' >&2; exit 1; }
+	    repo=$1;
+	    shift ;;
+	-d)
+	    shift;
+	    test "$#" -ne 0 || { echo 'error: -d requires an argument' >&2; exit 1; }
+	    depth=$1;
+	    shift ;;
+	-w)
+	    shift;
+	    test "$#" -ne 0 || { echo 'error: -w requires an argument' >&2; exit 1; }
+	    width=$1;
+	    shift ;;
+	-f)
+	    shift;
+	    test "$#" -ne 0 || { echo 'error: -f requires an argument' >&2; exit 1; }
+	    files=$1;
+	    shift ;;
+	*)
+	    echo "error: unknown option '$1'" >&2; exit 1 ;;
+	esac
+done
+
+## Inflate the index with thousands of empty files.
+## usage: dir depth width files
+fill_index() {
+	awk -v arg_dir=$1 -v arg_depth=$2 -v arg_width=$3 -v arg_files=$4 '
+		function make_paths(dir, depth, width, files, f, w) {
+			for (f = 1; f <= files; f++) {
+				print dir "/file" f
+			}
+			if (depth > 0) {
+				for (w = 1; w <= width; w++) {
+					make_paths(dir "/dir" w, depth - 1, width, files)
+				}
+			}
+		}
+		END { make_paths(arg_dir, arg_depth, arg_width, arg_files) }
+		' </dev/null |
+	sed "s/^/100644 $EMPTY_BLOB	/" |
+	git update-index --index-info
+	return 0
+}
+
+[ -z "$repo" ] && repo=gen-many-files-$depth.$width.$files.git
+
+mkdir $repo
+cd $repo
+git init .
+
+## Create an initial commit just to define master.
+touch many-files.empty
+echo "$depth $width $files" >many-files.params
+git add many-files.*
+git commit -q -m params
+
+## Create ballast for p0006 based upon the given params and
+## inflate the index with thousands of empty files and commit.
+git checkout -b p0006-ballast
+fill_index "ballast" $depth $width $files
+git commit -q -m "ballast"
+
+nr_files=$(git ls-files | wc -l)
+
+## Modify 1 file and commit.
+echo "$depth $width $files" >>many-files.params
+git add many-files.params
+git commit -q -m "ballast plus 1"
+
+## Checkout master to put repo in canonical state (because
+## the perf test may need to clone and enable sparse-checkout
+## before attempting to checkout a commit with the ballast
+## (because it may contain 100K directories and 1M files)).
+git checkout master
+
+echo "Repository "$repo" ($depth, $width, $files) created.  Ballast $nr_files."
+exit 0
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH v10 3/3] read-cache: speed up add_index_entry during checkout
  2017-04-14 19:12 [PATCH v10 0/3] read-cache: speed up add_index_entry git
  2017-04-14 19:12 ` [PATCH v10 1/3] read-cache: add strcmp_offset function git
  2017-04-14 19:12 ` [PATCH v10 2/3] p0006-read-tree-checkout: perf test to time read-tree git
@ 2017-04-14 19:12 ` git
  2017-04-15 17:55   ` René Scharfe
  2 siblings, 1 reply; 7+ messages in thread
From: git @ 2017-04-14 19:12 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Teach add_index_entry_with_check() and has_dir_name()
to see if the path of the new item is greater than the
last path in the index array before attempting to search
for it.

During checkout, merge_working_tree() populates the new
index in sorted order, so this change will save at least 2
binary lookups per file.  This preserves the original
behavior but simply checks the last element before starting
the search.

This helps performance on very large repositories.

This can be seen using p0006-read-tree-checkout.sh and the
artificial repository created by t/perf/repos/many-files.sh
with parameters (5, 10, 9).   (1M files in index.)

Test                                                            HEAD^              HEAD
----------------------------------------------------------------------------------------------------------
0006.2: read-tree br_base br_ballast (1000001)                  4.15(2.72+1.41)    3.21(1.97+1.22) -22.7%
0006.3: switch between br_base br_ballast (1000001)             8.11(5.57+2.28)    6.77(4.36+2.14) -16.5%
0006.4: switch between br_ballast br_ballast_plus_1 (1000001)   13.52(8.68+4.35)   11.80(7.38+3.96) -12.7%
0006.5: switch between aliases (1000001)                        13.59(8.75+4.43)   11.85(7.49+3.87) -12.8%

On linux.git, results are:
Test                                                          HEAD^             HEAD
------------------------------------------------------------------------------------------------------
0006.2: read-tree br_base br_ballast (57994)                  0.24(0.22+0.01)   0.20(0.17+0.02) -16.7%
0006.3: switch between br_base br_ballast (57994)             9.91(5.82+2.79)   10.26(5.92+2.77) +3.5%
0006.4: switch between br_ballast br_ballast_plus_1 (57994)   0.59(0.44+0.16)   0.50(0.34+0.18) -15.3%
0006.5: switch between aliases (57994)                        0.62(0.48+0.16)   0.50(0.35+0.16) -19.4%

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 read-cache.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 116 insertions(+), 2 deletions(-)

diff --git a/read-cache.c b/read-cache.c
index 97f13a1..ba95fbb 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -910,6 +910,9 @@ int strcmp_offset(const char *s1, const char *s2, size_t *first_change)
 /*
  * Do we have another file with a pathname that is a proper
  * subset of the name we're trying to add?
+ *
+ * That is, is there another file in the index with a path
+ * that matches a sub-directory in the given entry?
  */
 static int has_dir_name(struct index_state *istate,
 			const struct cache_entry *ce, int pos, int ok_to_replace)
@@ -918,9 +921,51 @@ static int has_dir_name(struct index_state *istate,
 	int stage = ce_stage(ce);
 	const char *name = ce->name;
 	const char *slash = name + ce_namelen(ce);
+	size_t len_eq_last;
+	int cmp_last = 0;
+
+	/*
+	 * We are frequently called during an iteration on a sorted
+	 * list of pathnames and while building a new index.  Therefore,
+	 * there is a high probability that this entry will eventually
+	 * be appended to the index, rather than inserted in the middle.
+	 * If we can confirm that, we can avoid binary searches on the
+	 * components of the pathname.
+	 *
+	 * Compare the entry's full path with the last path in the index.
+	 */
+	if (istate->cache_nr > 0) {
+		cmp_last = strcmp_offset(name,
+			istate->cache[istate->cache_nr - 1]->name,
+			&len_eq_last);
+		if (cmp_last > 0) {
+			if (len_eq_last == 0) {
+				/*
+				 * The entry sorts AFTER the last one in the
+				 * index and their paths have no common prefix,
+				 * so there cannot be a F/D conflict.
+				 */
+				return retval;
+			} else {
+				/*
+				 * The entry sorts AFTER the last one in the
+				 * index, but has a common prefix.  Fall through
+				 * to the loop below to disect the entry's path
+				 * and see where the difference is.
+				 */
+			}
+		} else if (cmp_last == 0) {
+			/*
+			 * The entry exactly matches the last one in the
+			 * index, but because of multiple stage and CE_REMOVE
+			 * items, we fall through and let the regular search
+			 * code handle it.
+			 */
+		}
+	}
 
 	for (;;) {
-		int len;
+		size_t len;
 
 		for (;;) {
 			if (*--slash == '/')
@@ -930,6 +975,66 @@ static int has_dir_name(struct index_state *istate,
 		}
 		len = slash - name;
 
+		if (cmp_last > 0) {
+			/*
+			 * (len + 1) is a directory boundary (including
+			 * the trailing slash).  And since the loop is
+			 * decrementing "slash", the first iteration is
+			 * the longest directory prefix; subsequent
+			 * iterations consider parent directories.
+			 */
+
+			if (len + 1 <= len_eq_last) {
+				/*
+				 * The directory prefix (including the trailing
+				 * slash) also appears as a prefix in the last
+				 * entry, so the remainder cannot collide (because
+				 * strcmp said the whole path was greater).
+				 *
+				 * EQ: last: xxx/A
+				 *     this: xxx/B
+				 *
+				 * LT: last: xxx/file_A
+				 *     this: xxx/file_B
+				 */
+				return retval;
+			}
+
+			if (len > len_eq_last) {
+				/*
+				 * This part of the directory prefix (excluding
+				 * the trailing slash) is longer than the known
+				 * equal portions, so this sub-directory cannot
+				 * collide with a file.
+				 *
+				 * GT: last: xxxA
+				 *     this: xxxB/file
+				 */
+				return retval;
+			}
+
+			if (ce_namelen(istate->cache[istate->cache_nr - 1]) > len) {
+				/*
+				 * The directory prefix lines up with part of
+				 * a longer file or directory name, but sorts
+				 * after it, so this sub-directory cannot
+				 * collide with a file.
+				 *
+				 * last: xxx/yy-file (because '-' sorts before '/')
+				 * this: xxx/yy/abc
+				 */
+				return retval;
+			}
+
+			/*
+			 * This is a possible collision. Fall through and
+			 * let the regular search code handle it.
+			 *
+			 * last: xxx
+			 * this: xxx/file
+			 */
+		}
+
 		pos = index_name_stage_pos(istate, name, len, stage);
 		if (pos >= 0) {
 			/*
@@ -1021,7 +1126,16 @@ static int add_index_entry_with_check(struct index_state *istate, struct cache_e
 
 	if (!(option & ADD_CACHE_KEEP_CACHE_TREE))
 		cache_tree_invalidate_path(istate, ce->name);
-	pos = index_name_stage_pos(istate, ce->name, ce_namelen(ce), ce_stage(ce));
+
+	/*
+	 * If this entry's path sorts after the last entry in the index,
+	 * we can avoid searching for it.
+	 */
+	if (istate->cache_nr > 0 &&
+		strcmp(ce->name, istate->cache[istate->cache_nr - 1]->name) > 0)
+		pos = -istate->cache_nr - 1;
+	else
+		pos = index_name_stage_pos(istate, ce->name, ce_namelen(ce), ce_stage(ce));
 
 	/* existing match? Just replace it. */
 	if (pos >= 0) {
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v10 3/3] read-cache: speed up add_index_entry during checkout
  2017-04-14 19:12 ` [PATCH v10 3/3] read-cache: speed up add_index_entry during checkout git
@ 2017-04-15 17:55   ` René Scharfe
  2017-04-17 14:53     ` Jeff Hostetler
  0 siblings, 1 reply; 7+ messages in thread
From: René Scharfe @ 2017-04-15 17:55 UTC (permalink / raw)
  To: git, git; +Cc: gitster, peff, Jeff Hostetler

Am 14.04.2017 um 21:12 schrieb git@jeffhostetler.com:
> From: Jeff Hostetler <jeffhost@microsoft.com>
> 
> Teach add_index_entry_with_check() and has_dir_name()
> to see if the path of the new item is greater than the
> last path in the index array before attempting to search
> for it.
> 
> During checkout, merge_working_tree() populates the new
> index in sorted order, so this change will save at least 2
> binary lookups per file.  This preserves the original
> behavior but simply checks the last element before starting
> the search.
> 
> This helps performance on very large repositories.
> 
> This can be seen using p0006-read-tree-checkout.sh and the
> artificial repository created by t/perf/repos/many-files.sh
> with parameters (5, 10, 9).   (1M files in index.)
> 
> Test                                                            HEAD^              HEAD
> ----------------------------------------------------------------------------------------------------------
> 0006.2: read-tree br_base br_ballast (1000001)                  4.15(2.72+1.41)    3.21(1.97+1.22) -22.7%
> 0006.3: switch between br_base br_ballast (1000001)             8.11(5.57+2.28)    6.77(4.36+2.14) -16.5%
> 0006.4: switch between br_ballast br_ballast_plus_1 (1000001)   13.52(8.68+4.35)   11.80(7.38+3.96) -12.7%
> 0006.5: switch between aliases (1000001)                        13.59(8.75+4.43)   11.85(7.49+3.87) -12.8%
> 
> On linux.git, results are:
> Test                                                          HEAD^             HEAD
> ------------------------------------------------------------------------------------------------------
> 0006.2: read-tree br_base br_ballast (57994)                  0.24(0.22+0.01)   0.20(0.17+0.02) -16.7%
> 0006.3: switch between br_base br_ballast (57994)             9.91(5.82+2.79)   10.26(5.92+2.77) +3.5%
> 0006.4: switch between br_ballast br_ballast_plus_1 (57994)   0.59(0.44+0.16)   0.50(0.34+0.18) -15.3%
> 0006.5: switch between aliases (57994)                        0.62(0.48+0.16)   0.50(0.35+0.16) -19.4%
> 
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> ---
>   read-cache.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 116 insertions(+), 2 deletions(-)

Very nice, especially the perf test!  But can we unbundle the different
optimizations into separate patches with their own performance numbers?
Candidates IMHO: The change in add_index_entry_with_check(), the first
hunk in has_dir_name(), the loop shortcuts.  That might also help find
the reason for the slight slowdown of 0006.3 for the kernel repository.

> diff --git a/read-cache.c b/read-cache.c
> index 97f13a1..ba95fbb 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -910,6 +910,9 @@ int strcmp_offset(const char *s1, const char *s2, size_t *first_change)
>   /*
>    * Do we have another file with a pathname that is a proper
>    * subset of the name we're trying to add?
> + *
> + * That is, is there another file in the index with a path
> + * that matches a sub-directory in the given entry?
>    */
>   static int has_dir_name(struct index_state *istate,
>   			const struct cache_entry *ce, int pos, int ok_to_replace)
> @@ -918,9 +921,51 @@ static int has_dir_name(struct index_state *istate,
>   	int stage = ce_stage(ce);
>   	const char *name = ce->name;
>   	const char *slash = name + ce_namelen(ce);
> +	size_t len_eq_last;
> +	int cmp_last = 0;
> +
> +	/*
> +	 * We are frequently called during an iteration on a sorted
> +	 * list of pathnames and while building a new index.  Therefore,
> +	 * there is a high probability that this entry will eventually
> +	 * be appended to the index, rather than inserted in the middle.
> +	 * If we can confirm that, we can avoid binary searches on the
> +	 * components of the pathname.
> +	 *
> +	 * Compare the entry's full path with the last path in the index.
> +	 */
> +	if (istate->cache_nr > 0) {
> +		cmp_last = strcmp_offset(name,
> +			istate->cache[istate->cache_nr - 1]->name,
> +			&len_eq_last);
> +		if (cmp_last > 0) {
> +			if (len_eq_last == 0) {
> +				/*
> +				 * The entry sorts AFTER the last one in the
> +				 * index and their paths have no common prefix,
> +				 * so there cannot be a F/D conflict.
> +				 */
> +				return retval;
> +			} else {
> +				/*
> +				 * The entry sorts AFTER the last one in the
> +				 * index, but has a common prefix.  Fall through
> +				 * to the loop below to disect the entry's path
> +				 * and see where the difference is.
> +				 */
> +			}
> +		} else if (cmp_last == 0) {
> +			/*
> +			 * The entry exactly matches the last one in the
> +			 * index, but because of multiple stage and CE_REMOVE
> +			 * items, we fall through and let the regular search
> +			 * code handle it.
> +			 */
> +		}
> +	}
>   
>   	for (;;) {
> -		int len;
> +		size_t len;
>   
>   		for (;;) {
>   			if (*--slash == '/')
> @@ -930,6 +975,66 @@ static int has_dir_name(struct index_state *istate,
>   		}
>   		len = slash - name;
>   
> +		if (cmp_last > 0) {
> +			/*
> +			 * (len + 1) is a directory boundary (including
> +			 * the trailing slash).  And since the loop is
> +			 * decrementing "slash", the first iteration is
> +			 * the longest directory prefix; subsequent
> +			 * iterations consider parent directories.
> +			 */
> +
> +			if (len + 1 <= len_eq_last) {
> +				/*
> +				 * The directory prefix (including the trailing
> +				 * slash) also appears as a prefix in the last
> +				 * entry, so the remainder cannot collide (because
> +				 * strcmp said the whole path was greater).
> +				 *
> +				 * EQ: last: xxx/A
> +				 *     this: xxx/B
> +				 *
> +				 * LT: last: xxx/file_A
> +				 *     this: xxx/file_B
> +				 */
> +				return retval;
> +			}
> +
> +			if (len > len_eq_last) {
> +				/*
> +				 * This part of the directory prefix (excluding
> +				 * the trailing slash) is longer than the known
> +				 * equal portions, so this sub-directory cannot
> +				 * collide with a file.
> +				 *
> +				 * GT: last: xxxA
> +				 *     this: xxxB/file
> +				 */
> +				return retval;
> +			}
> +

At this point len and len_eq_last are equal -- otherwise one of the two
previous conditions would have triggered.  Silly question: Is this
necessary for the following shortcut to work?  Removing the two checks
above doesn't seem to affect performance very much, and at least the
test suite still passes..

> +			if (ce_namelen(istate->cache[istate->cache_nr - 1]) > len) {
> +				/*
> +				 * The directory prefix lines up with part of
> +				 * a longer file or directory name, but sorts
> +				 * after it, so this sub-directory cannot
> +				 * collide with a file.
> +				 *
> +				 * last: xxx/yy-file (because '-' sorts before '/')
> +				 * this: xxx/yy/abc
> +				 */
> +				return retval;
> +			}

istate->cache_nr can be zero if remove_index_entry_at() had been called
in a previous iteration, resulting in a segfault.  Checking right here
is probably the easiest way out; not sure if exiting early when the
index becomes empty would be better.

> +
> +			/*
> +			 * This is a possible collision. Fall through and
> +			 * let the regular search code handle it.
> +			 *
> +			 * last: xxx
> +			 * this: xxx/file
> +			 */
> +		}
> +
>   		pos = index_name_stage_pos(istate, name, len, stage);
>   		if (pos >= 0) {
>   			/*
> @@ -1021,7 +1126,16 @@ static int add_index_entry_with_check(struct index_state *istate, struct cache_e
>   
>   	if (!(option & ADD_CACHE_KEEP_CACHE_TREE))
>   		cache_tree_invalidate_path(istate, ce->name);
> -	pos = index_name_stage_pos(istate, ce->name, ce_namelen(ce), ce_stage(ce));
> +
> +	/*
> +	 * If this entry's path sorts after the last entry in the index,
> +	 * we can avoid searching for it.
> +	 */
> +	if (istate->cache_nr > 0 &&
> +		strcmp(ce->name, istate->cache[istate->cache_nr - 1]->name) > 0)
> +		pos = -istate->cache_nr - 1;
> +	else
> +		pos = index_name_stage_pos(istate, ce->name, ce_namelen(ce), ce_stage(ce));
>   
>   	/* existing match? Just replace it. */
>   	if (pos >= 0) {
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v10 3/3] read-cache: speed up add_index_entry during checkout
  2017-04-15 17:55   ` René Scharfe
@ 2017-04-17 14:53     ` Jeff Hostetler
  2017-04-18 16:19       ` Jeff Hostetler
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff Hostetler @ 2017-04-17 14:53 UTC (permalink / raw)
  To: René Scharfe, git; +Cc: gitster, peff, Jeff Hostetler



On 4/15/2017 1:55 PM, René Scharfe wrote:
> Am 14.04.2017 um 21:12 schrieb git@jeffhostetler.com:
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>
> Very nice, especially the perf test!  But can we unbundle the different
> optimizations into separate patches with their own performance numbers?
> Candidates IMHO: The change in add_index_entry_with_check(), the first
> hunk in has_dir_name(), the loop shortcuts.  That might also help find
> the reason for the slight slowdown of 0006.3 for the kernel repository.

Let me take a look at this and see if it helps.

>> diff --git a/read-cache.c b/read-cache.c
>> index 97f13a1..ba95fbb 100644
>> --- a/read-cache.c
>> +++ b/read-cache.c
>> @@ -910,6 +910,9 @@ int strcmp_offset(const char *s1, const char *s2,
>> size_t *first_change)
>>   /*
>>    * Do we have another file with a pathname that is a proper
>>    * subset of the name we're trying to add?
>> + *
>> + * That is, is there another file in the index with a path
>> + * that matches a sub-directory in the given entry?
>>    */
>>   static int has_dir_name(struct index_state *istate,
>>               const struct cache_entry *ce, int pos, int ok_to_replace)
>> @@ -918,9 +921,51 @@ static int has_dir_name(struct index_state *istate,
>>       int stage = ce_stage(ce);
>>       const char *name = ce->name;
>>       const char *slash = name + ce_namelen(ce);
>> +    size_t len_eq_last;
>> +    int cmp_last = 0;
>> +
>> +    /*
>> +     * We are frequently called during an iteration on a sorted
>> +     * list of pathnames and while building a new index.  Therefore,
>> +     * there is a high probability that this entry will eventually
>> +     * be appended to the index, rather than inserted in the middle.
>> +     * If we can confirm that, we can avoid binary searches on the
>> +     * components of the pathname.
>> +     *
>> +     * Compare the entry's full path with the last path in the index.
>> +     */
>> +    if (istate->cache_nr > 0) {
>> +        cmp_last = strcmp_offset(name,
>> +            istate->cache[istate->cache_nr - 1]->name,
>> +            &len_eq_last);
>> +        if (cmp_last > 0) {
>> +            if (len_eq_last == 0) {
>> +                /*
>> +                 * The entry sorts AFTER the last one in the
>> +                 * index and their paths have no common prefix,
>> +                 * so there cannot be a F/D conflict.
>> +                 */
>> +                return retval;
>> +            } else {
>> +                /*
>> +                 * The entry sorts AFTER the last one in the
>> +                 * index, but has a common prefix.  Fall through
>> +                 * to the loop below to disect the entry's path
>> +                 * and see where the difference is.
>> +                 */
>> +            }
>> +        } else if (cmp_last == 0) {
>> +            /*
>> +             * The entry exactly matches the last one in the
>> +             * index, but because of multiple stage and CE_REMOVE
>> +             * items, we fall through and let the regular search
>> +             * code handle it.
>> +             */
>> +        }
>> +    }
>>         for (;;) {
>> -        int len;
>> +        size_t len;
>>             for (;;) {
>>               if (*--slash == '/')
>> @@ -930,6 +975,66 @@ static int has_dir_name(struct index_state *istate,
>>           }
>>           len = slash - name;
>>   +        if (cmp_last > 0) {
>> +            /*
>> +             * (len + 1) is a directory boundary (including
>> +             * the trailing slash).  And since the loop is
>> +             * decrementing "slash", the first iteration is
>> +             * the longest directory prefix; subsequent
>> +             * iterations consider parent directories.
>> +             */
>> +
>> +            if (len + 1 <= len_eq_last) {
>> +                /*
>> +                 * The directory prefix (including the trailing
>> +                 * slash) also appears as a prefix in the last
>> +                 * entry, so the remainder cannot collide (because
>> +                 * strcmp said the whole path was greater).
>> +                 *
>> +                 * EQ: last: xxx/A
>> +                 *     this: xxx/B
>> +                 *
>> +                 * LT: last: xxx/file_A
>> +                 *     this: xxx/file_B
>> +                 */
>> +                return retval;
>> +            }
>> +
>> +            if (len > len_eq_last) {
>> +                /*
>> +                 * This part of the directory prefix (excluding
>> +                 * the trailing slash) is longer than the known
>> +                 * equal portions, so this sub-directory cannot
>> +                 * collide with a file.
>> +                 *
>> +                 * GT: last: xxxA
>> +                 *     this: xxxB/file
>> +                 */
>> +                return retval;
>> +            }
>> +
>
> At this point len and len_eq_last are equal -- otherwise one of the two
> previous conditions would have triggered.  Silly question: Is this
> necessary for the following shortcut to work?  Removing the two checks
> above doesn't seem to affect performance very much, and at least the
> test suite still passes..

Both of these are highly dependent on the shape of the
pathnames in the tree.  Let me try to quantify this.

>
>> +            if (ce_namelen(istate->cache[istate->cache_nr - 1]) > len) {
>> +                /*
>> +                 * The directory prefix lines up with part of
>> +                 * a longer file or directory name, but sorts
>> +                 * after it, so this sub-directory cannot
>> +                 * collide with a file.
>> +                 *
>> +                 * last: xxx/yy-file (because '-' sorts before '/')
>> +                 * this: xxx/yy/abc
>> +                 */
>> +                return retval;
>> +            }
>
> istate->cache_nr can be zero if remove_index_entry_at() had been called
> in a previous iteration, resulting in a segfault.  Checking right here
> is probably the easiest way out; not sure if exiting early when the
> index becomes empty would be better.

yeah, i guess remove_...() could delete them all and this test
would be ill-defined.  Let me fix that.

>
>> +
>> +            /*
>> +             * This is a possible collision. Fall through and
>> +             * let the regular search code handle it.
>> +             *
>> +             * last: xxx
>> +             * this: xxx/file
>> +             */
>> +        }
>> +
>>           pos = index_name_stage_pos(istate, name, len, stage);
>>           if (pos >= 0) {
>>               /*
>> @@ -1021,7 +1126,16 @@ static int add_index_entry_with_check(struct
>> index_state *istate, struct cache_e
>>         if (!(option & ADD_CACHE_KEEP_CACHE_TREE))
>>           cache_tree_invalidate_path(istate, ce->name);
>> -    pos = index_name_stage_pos(istate, ce->name, ce_namelen(ce),
>> ce_stage(ce));
>> +
>> +    /*
>> +     * If this entry's path sorts after the last entry in the index,
>> +     * we can avoid searching for it.
>> +     */
>> +    if (istate->cache_nr > 0 &&
>> +        strcmp(ce->name, istate->cache[istate->cache_nr - 1]->name) > 0)
>> +        pos = -istate->cache_nr - 1;
>> +    else
>> +        pos = index_name_stage_pos(istate, ce->name, ce_namelen(ce),
>> ce_stage(ce));
>>         /* existing match? Just replace it. */
>>       if (pos >= 0) {
>>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v10 3/3] read-cache: speed up add_index_entry during checkout
  2017-04-17 14:53     ` Jeff Hostetler
@ 2017-04-18 16:19       ` Jeff Hostetler
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff Hostetler @ 2017-04-18 16:19 UTC (permalink / raw)
  To: René Scharfe, git; +Cc: gitster, peff, Jeff Hostetler

On 4/17/2017 10:53 AM, Jeff Hostetler wrote:
>
>
> On 4/15/2017 1:55 PM, René Scharfe wrote:
>> Am 14.04.2017 um 21:12 schrieb git@jeffhostetler.com:
>>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> Very nice, especially the perf test!  But can we unbundle the different
>> optimizations into separate patches with their own performance numbers?
>> Candidates IMHO: The change in add_index_entry_with_check(), the first
>> hunk in has_dir_name(), the loop shortcuts.  That might also help find
>> the reason for the slight slowdown of 0006.3 for the kernel repository.
>
> Let me take a look at this and see if it helps.

Last night I pushed up version 11 which has the 3 parts
of read-cache.c in 3 commits (but still in the same patch
series).  This should allow for more experimentation.

The add_index_entry_with_check() shows a gain.  For the
operations in p0006 on linux.git, the short-cut was being
taken 57993 of 57994 times.

The top of has_dir_name() -- by itself -- does not, but
the short-cut only triggers when the paths have no
prefix in common -- which only happens when the top-level
directory changes.  On linux.git, this was 19 of 57993.
However, it does set us up for the next part.

The 3 loop body short-cuts hit 54372, 3509, and 86 (sum
57967) times.  So in p0006, the search was only attempted
7 times (57993 - 19 - 57967) most of the time.

WRT the slowdown of 0006.3 on linux.git, I suspect this is
I/O noise.  In the commit message for part 2 in V11, I
show 2 runs on linux.git that show wide variance in the 0006.3
times.  And given the nature of that test, the speed up in the
lookups is completely hidden by the I/O of the full checkouts.
When I step up to a repo with 4M files, the results are very
clear.

https://public-inbox.org/git/20170417213734.55373-6-git@jeffhostetler.com/

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-04-18 16:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-14 19:12 [PATCH v10 0/3] read-cache: speed up add_index_entry git
2017-04-14 19:12 ` [PATCH v10 1/3] read-cache: add strcmp_offset function git
2017-04-14 19:12 ` [PATCH v10 2/3] p0006-read-tree-checkout: perf test to time read-tree git
2017-04-14 19:12 ` [PATCH v10 3/3] read-cache: speed up add_index_entry during checkout git
2017-04-15 17:55   ` René Scharfe
2017-04-17 14:53     ` Jeff Hostetler
2017-04-18 16:19       ` Jeff Hostetler

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).