[PATCH v1] diffcore-rename speedup

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v1] diffcore-rename speedup
@ 2017-04-18 19:44 git
  2017-04-18 19:44 ` [PATCH v1] diffcore-rename: speed up register_rename_src git
  0 siblings, 1 reply; 13+ messages in thread
From: git @ 2017-04-18 19:44 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Here is another micro-optimization for very large repositories.
Speed up register_rename_src() in diffcore-rename.c

Jeff Hostetler (1):
  diffcore-rename: speed up register_rename_src

 diffcore-rename.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v1] diffcore-rename: speed up register_rename_src
  2017-04-18 19:44 [PATCH v1] diffcore-rename speedup git
@ 2017-04-18 19:44 ` git
  2017-04-19  1:32   ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: git @ 2017-04-18 19:44 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Teach register_rename_src() to see if new file pair
can simply be appended to the rename_src[] array before
performing the binary search to find the proper insertion
point.

This is a performance optimization.  This routine is called
during run_diff_files in status and the caller is iterating
over the sorted index, so we should expect to be able to
append in the normal case.  The existing insert logic is
preserved so we don't have to assume that, but simply take
advantage of it if possible.

Using t/perf/p0005-status.h (from a parallel patch series),
we get the following improvement on a 4.2M file repo:

Test                                            HEAD~1               HEAD
0005.2: read-tree status br_ballast (4194305)   59.14(31.85+18.79)   55.48(28.52+20.71) -6.2%

On a 1M file repo:
Test                                            HEAD~1            HEAD
0005.2: read-tree status br_ballast (1000001)   8.20(4.82+3.35)   7.91(4.57+3.27) -3.5%

On a smaller repo, like linux.git (58K files), results are masked
by normal I/O variance.

Test                                          HEAD~1            HEAD
0005.2: read-tree status br_ballast (57994)   0.43(0.30+0.13)   0.42(0.31+0.12) -2.3%
0005.2: read-tree status br_ballast (57994)   0.42(0.32+0.09)   0.43(0.34+0.10) +2.4%
0005.2: read-tree status br_ballast (57994)   0.44(0.33+0.10)   0.42(0.26+0.16) -4.5%

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 diffcore-rename.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index f7444c8..543a409 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -81,6 +81,18 @@ static struct diff_rename_src *register_rename_src(struct diff_filepair *p)
 
 	first = 0;
 	last = rename_src_nr;
+
+	if (last > 0) {
+		struct diff_rename_src *src = &(rename_src[last-1]);
+		int cmp = strcmp(one->path, src->p->one->path);
+		if (!cmp)
+			return src;
+		if (cmp > 0) {
+			first = last;
+			goto append_it;
+		}
+	}
+
 	while (last > first) {
 		int next = (last + first) >> 1;
 		struct diff_rename_src *src = &(rename_src[next]);
@@ -94,6 +106,7 @@ static struct diff_rename_src *register_rename_src(struct diff_filepair *p)
 		first = next+1;
 	}
 
+append_it:
 	/* insert to make it at "first" */
 	ALLOC_GROW(rename_src, rename_src_nr + 1, rename_src_alloc);
 	rename_src_nr++;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v1] diffcore-rename: speed up register_rename_src
  2017-04-18 19:44 ` [PATCH v1] diffcore-rename: speed up register_rename_src git
@ 2017-04-19  1:32   ` Jeff King
  2017-04-19  2:45     ` Junio C Hamano
  2017-04-20 10:40     ` Johannes Schindelin
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff King @ 2017-04-19  1:32 UTC (permalink / raw)
  To: git; +Cc: git, gitster, Jeff Hostetler

On Tue, Apr 18, 2017 at 07:44:21PM +0000, git@jeffhostetler.com wrote:

> From: Jeff Hostetler <jeffhost@microsoft.com>
> 
> Teach register_rename_src() to see if new file pair
> can simply be appended to the rename_src[] array before
> performing the binary search to find the proper insertion
> point.

I guess your perf results show some minor improvement. But I suspect
this is because your synthetic repo does not resemble the real world
very much. You're saving a few strcmps, but for each of those files
you're potentially going to have actually zlib inflate the object
contents and do similarity analysis.

So "absurd number of files doing 100% exact renames" is the absolute
best case, and it saves a few percent.

I dunno. It is not that much code _here_, but I'm not excited about the
prospect of sprinkling this same "check the last one" optimization all
over the code base. I wonder if there's some way to generalize it.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1] diffcore-rename: speed up register_rename_src
  2017-04-19  1:32   ` Jeff King
@ 2017-04-19  2:45     ` Junio C Hamano
  2017-04-19  2:56       ` Jeff King
  2017-04-20 10:40     ` Johannes Schindelin
  1 sibling, 1 reply; 13+ messages in thread
From: Junio C Hamano @ 2017-04-19  2:45 UTC (permalink / raw)
  To: Jeff King; +Cc: git, git, Jeff Hostetler

Jeff King <peff@peff.net> writes:

> On Tue, Apr 18, 2017 at 07:44:21PM +0000, git@jeffhostetler.com wrote:
>
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>> 
>> Teach register_rename_src() to see if new file pair
>> can simply be appended to the rename_src[] array before
>> performing the binary search to find the proper insertion
>> point.
>
> I guess your perf results show some minor improvement. But I suspect
> this is because your synthetic repo does not resemble the real world
> very much. You're saving a few strcmps, but for each of those files
> you're potentially going to have actually zlib inflate the object
> contents and do similarity analysis.
>
> So "absurd number of files doing 100% exact renames" is the absolute
> best case, and it saves a few percent.
>
> I dunno. It is not that much code _here_, but I'm not excited about the
> prospect of sprinkling this same "check the last one" optimization all
> over the code base. I wonder if there's some way to generalize it.

When adding many things, we often just append and then sort at the
end after we finished adding.  I wonder if recent "check the last
one and append" optimization beats that strategy.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1] diffcore-rename: speed up register_rename_src
  2017-04-19  2:45     ` Junio C Hamano
@ 2017-04-19  2:56       ` Jeff King
  2017-04-19  3:18         ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2017-04-19  2:56 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, git, Jeff Hostetler

On Tue, Apr 18, 2017 at 07:45:05PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > On Tue, Apr 18, 2017 at 07:44:21PM +0000, git@jeffhostetler.com wrote:
> >
> >> From: Jeff Hostetler <jeffhost@microsoft.com>
> >> 
> >> Teach register_rename_src() to see if new file pair
> >> can simply be appended to the rename_src[] array before
> >> performing the binary search to find the proper insertion
> >> point.
> >
> > I guess your perf results show some minor improvement. But I suspect
> > this is because your synthetic repo does not resemble the real world
> > very much. You're saving a few strcmps, but for each of those files
> > you're potentially going to have actually zlib inflate the object
> > contents and do similarity analysis.
> >
> > So "absurd number of files doing 100% exact renames" is the absolute
> > best case, and it saves a few percent.
> >
> > I dunno. It is not that much code _here_, but I'm not excited about the
> > prospect of sprinkling this same "check the last one" optimization all
> > over the code base. I wonder if there's some way to generalize it.
> 
> When adding many things, we often just append and then sort at the
> end after we finished adding.  I wonder if recent "check the last
> one and append" optimization beats that strategy.

The big question is whether we need to detect duplicates while we're
appending to the list, which is hard on an unsorted list.  In this
function, at least, we do detect when the path already exists and return
the existing entry. I'm not sure under what circumstances we would see
such a duplicate, though, as each filename should appear only once in
the tree diff. I would think.

Doing:

diff --git a/diffcore-rename.c b/diffcore-rename.c
index f7444c86b..56a493d97 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -86,7 +86,7 @@ static struct diff_rename_src *register_rename_src(struct diff_filepair *p)
 		struct diff_rename_src *src = &(rename_src[next]);
 		int cmp = strcmp(one->path, src->p->one->path);
 		if (!cmp)
-			return src;
+			die("BUG: duplicate rename src: %s", one->path);
 		if (cmp < 0) {
 			last = next;
 			continue;

passes the test suite, at least. :)

-Peff

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v1] diffcore-rename: speed up register_rename_src
  2017-04-19  2:56       ` Jeff King
@ 2017-04-19  3:18         ` Jeff King
  2017-04-20 14:00           ` Jeff Hostetler
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2017-04-19  3:18 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, git, Jeff Hostetler

On Tue, Apr 18, 2017 at 10:56:08PM -0400, Jeff King wrote:

> > When adding many things, we often just append and then sort at the
> > end after we finished adding.  I wonder if recent "check the last
> > one and append" optimization beats that strategy.
> 
> The big question is whether we need to detect duplicates while we're
> appending to the list, which is hard on an unsorted list.  In this
> function, at least, we do detect when the path already exists and return
> the existing entry. I'm not sure under what circumstances we would see
> such a duplicate, though, as each filename should appear only once in
> the tree diff. I would think.
> 
> Doing:
> 
> diff --git a/diffcore-rename.c b/diffcore-rename.c
> index f7444c86b..56a493d97 100644
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -86,7 +86,7 @@ static struct diff_rename_src *register_rename_src(struct diff_filepair *p)
>  		struct diff_rename_src *src = &(rename_src[next]);
>  		int cmp = strcmp(one->path, src->p->one->path);
>  		if (!cmp)
> -			return src;
> +			die("BUG: duplicate rename src: %s", one->path);
>  		if (cmp < 0) {
>  			last = next;
>  			continue;
> 
> passes the test suite, at least. :)

Maybe relevant: 4d6be03b9 (diffcore-rename: avoid processing duplicate
destinations, 2015-02-26). That's on the dst side, but possibly we
should do something similar on the src side.

BTW, I think the return value from register_rename_src() is
questionable. It points to a "struct diff_rename_src" that may be
reallocated by further calls to the function. Fortunately nobody
actually looks at it, let alone saves it, so there's no bug.

We may want to convert that return value to a void (if not just return
an int for "hey, there's a duplicate", like we do for add_rename_dst()).

Also, presumably that function could learn the same "check the last one"
trick that the src side does. Which leads me back to "surely we can
generalize this". I don't think bsearch() is quite what we want, because
its interface doesn't tell us where to put the item when it isn't found.
But I think we could make a general bsearch-like function that has
similar semantics to index_name_pos(), with its negative-integer return.

And then that would be a general lookup function, and we could easily
build a general "look up and add" function around that. And the "check
the last one" optimization would go in the latter.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1] diffcore-rename: speed up register_rename_src
  2017-04-19  3:18         ` Jeff King
@ 2017-04-20 14:00           ` Jeff Hostetler
  2017-04-20 16:13             ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Hostetler @ 2017-04-20 14:00 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: git, Jeff Hostetler

On 4/18/2017 11:18 PM, Jeff King wrote:
> On Tue, Apr 18, 2017 at 10:56:08PM -0400, Jeff King wrote:
>
>>> When adding many things, we often just append and then sort at the
>>> end after we finished adding.  I wonder if recent "check the last
>>> one and append" optimization beats that strategy.
>>
>> The big question is whether we need to detect duplicates while we're
>> appending to the list, which is hard on an unsorted list.  In this
>> function, at least, we do detect when the path already exists and return
>> the existing entry. I'm not sure under what circumstances we would see
>> such a duplicate, though, as each filename should appear only once in
>> the tree diff. I would think.
>>
>> Doing:
>>
>> diff --git a/diffcore-rename.c b/diffcore-rename.c
>> index f7444c86b..56a493d97 100644
>> --- a/diffcore-rename.c
>> +++ b/diffcore-rename.c
>> @@ -86,7 +86,7 @@ static struct diff_rename_src *register_rename_src(struct diff_filepair *p)
>>  		struct diff_rename_src *src = &(rename_src[next]);
>>  		int cmp = strcmp(one->path, src->p->one->path);
>>  		if (!cmp)
>> -			return src;
>> +			die("BUG: duplicate rename src: %s", one->path);
>>  		if (cmp < 0) {
>>  			last = next;
>>  			continue;
>>
>> passes the test suite, at least. :)
>
> Maybe relevant: 4d6be03b9 (diffcore-rename: avoid processing duplicate
> destinations, 2015-02-26). That's on the dst side, but possibly we
> should do something similar on the src side.
>
> BTW, I think the return value from register_rename_src() is
> questionable. It points to a "struct diff_rename_src" that may be
> reallocated by further calls to the function. Fortunately nobody
> actually looks at it, let alone saves it, so there's no bug.
>
> We may want to convert that return value to a void (if not just return
> an int for "hey, there's a duplicate", like we do for add_rename_dst()).
>
> Also, presumably that function could learn the same "check the last one"
> trick that the src side does. Which leads me back to "surely we can
> generalize this". I don't think bsearch() is quite what we want, because
> its interface doesn't tell us where to put the item when it isn't found.
> But I think we could make a general bsearch-like function that has
> similar semantics to index_name_pos(), with its negative-integer return.
>
> And then that would be a general lookup function, and we could easily
> build a general "look up and add" function around that. And the "check
> the last one" optimization would go in the latter.

I do see your point.  And I do get tired of littering the code with
"check the last before binary searching" tricks.  It would be nice
if we could better isolate this.  I've tried to follow the path of
least change/damage -- don't change functionality, but just short-cut
when possible, so the patches I've pushed up this month have mostly
taken that form.

And you're right, the gains on this particular patch are relatively
minor and it is a bit of a contrived case (lots of renames??), but
it did come up while testing on the Windows repo.  It doesn't happen
often, but it did happen.

So I'm OK with rejecting this one.

Perhaps the thing to learn from this (and the other ones) is that
we have lots of places where we are building a sorted list by
iterating over a sorted list.  The insert routines are general
purpose and cannot assume this, so they search first.  Perhaps it
would be clearer to have independent _append and _insert functions
and have the caller explicitly call the appropriate one.  The mainline
iterations on the existing index could just call the _append form
and never have to worry about searching or the negative-integer
return trick.  Whereas, the random iterations (such as on the
command's arg list), would always call the _insert form.

Jeff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1] diffcore-rename: speed up register_rename_src
  2017-04-20 14:00           ` Jeff Hostetler
@ 2017-04-20 16:13             ` Jeff King
  2017-04-20 18:08               ` Jeff Hostetler
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2017-04-20 16:13 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Junio C Hamano, git, Jeff Hostetler

On Thu, Apr 20, 2017 at 10:00:04AM -0400, Jeff Hostetler wrote:

> Perhaps the thing to learn from this (and the other ones) is that
> we have lots of places where we are building a sorted list by
> iterating over a sorted list.  The insert routines are general
> purpose and cannot assume this, so they search first.  Perhaps it
> would be clearer to have independent _append and _insert functions
> and have the caller explicitly call the appropriate one.  The mainline
> iterations on the existing index could just call the _append form
> and never have to worry about searching or the negative-integer
> return trick.  Whereas, the random iterations (such as on the
> command's arg list), would always call the _insert form.

Yes. I'd be much happier if your patch was flipping between two
general-purpose insertion functions. And if that same trick was used on
the dst side.

Or even, given that this these functions are called from a single
location that has sorted input, the binary search was just replaced
completely with an append combined with a sort-check.

That's not the minimal change you were going for, but I think the end
result is simpler and more consistent.

E.g., something like this:

diff --git a/diffcore-rename.c b/diffcore-rename.c
index f7444c86b..a5c017198 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -43,26 +43,20 @@ static struct diff_rename_dst *locate_rename_dst(struct diff_filespec *two)
 }
 
 /*
- * Returns 0 on success, -1 if we found a duplicate.
+ * Returns 0 on success, -1 if we found a duplicate or a sorting problem.
  */
 static int add_rename_dst(struct diff_filespec *two)
 {
-	int first = find_rename_dst(two);
-
-	if (first >= 0)
+	if (rename_dst_nr > 0 &&
+	    strcmp(two->path, rename_dst[rename_dst_nr - 1].two->path) <= 0)
 		return -1;
-	first = -first - 1;
 
-	/* insert to make it at "first" */
 	ALLOC_GROW(rename_dst, rename_dst_nr + 1, rename_dst_alloc);
+	rename_dst[rename_dst_nr].two = alloc_filespec(two->path);
+	fill_filespec(rename_dst[rename_dst_nr].two,
+		      two->oid.hash, two->oid_valid, two->mode);
+	rename_dst[rename_dst_nr].pair = NULL;
 	rename_dst_nr++;
-	if (first < rename_dst_nr)
-		memmove(rename_dst + first + 1, rename_dst + first,
-			(rename_dst_nr - first - 1) * sizeof(*rename_dst));
-	rename_dst[first].two = alloc_filespec(two->path);
-	fill_filespec(rename_dst[first].two, two->oid.hash, two->oid_valid,
-		      two->mode);
-	rename_dst[first].pair = NULL;
 	return 0;
 }
 
@@ -73,36 +67,17 @@ static struct diff_rename_src {
 } *rename_src;
 static int rename_src_nr, rename_src_alloc;
 
-static struct diff_rename_src *register_rename_src(struct diff_filepair *p)
+static int register_rename_src(struct diff_filepair *p)
 {
-	int first, last;
-	struct diff_filespec *one = p->one;
-	unsigned short score = p->score;
-
-	first = 0;
-	last = rename_src_nr;
-	while (last > first) {
-		int next = (last + first) >> 1;
-		struct diff_rename_src *src = &(rename_src[next]);
-		int cmp = strcmp(one->path, src->p->one->path);
-		if (!cmp)
-			return src;
-		if (cmp < 0) {
-			last = next;
-			continue;
-		}
-		first = next+1;
-	}
+	if (rename_src_nr > 0 &&
+	    strcmp(p->one->path, rename_src[rename_src_nr - 1].p->one->path) <= 0)
+		return -1;
 
-	/* insert to make it at "first" */
 	ALLOC_GROW(rename_src, rename_src_nr + 1, rename_src_alloc);
+	rename_src[rename_src_nr].p = p;
+	rename_src[rename_src_nr].score = p->score;
 	rename_src_nr++;
-	if (first < rename_src_nr)
-		memmove(rename_src + first + 1, rename_src + first,
-			(rename_src_nr - first - 1) * sizeof(*rename_src));
-	rename_src[first].p = p;
-	rename_src[first].score = score;
-	return &(rename_src[first]);
+	return 0;
 }
 
 static int basename_same(struct diff_filespec *src, struct diff_filespec *dst)
@@ -468,7 +443,7 @@ void diffcore_rename(struct diff_options *options)
 				continue;
 			else if (add_rename_dst(p->two) < 0) {
 				warning("skipping rename detection, detected"
-					" duplicate destination '%s'",
+					" duplicate or out-of-order destination '%s'",
 					p->two->path);
 				goto cleanup;
 			}
@@ -486,7 +461,12 @@ void diffcore_rename(struct diff_options *options)
 			 */
 			if (p->broken_pair && !p->score)
 				p->one->rename_used++;
-			register_rename_src(p);
+			if (register_rename_src(p) < 0) {
+				warning("skipping rename detection, detected"
+					" duplicate or out-of-order source '%s'",
+					p->one->path);
+				goto cleanup;
+			}
 		}
 		else if (detect_rename == DIFF_DETECT_COPY) {
 			/*
@@ -494,7 +474,12 @@ void diffcore_rename(struct diff_options *options)
 			 * one, to indicate ourselves as a user.
 			 */
 			p->one->rename_used++;
-			register_rename_src(p);
+			if (register_rename_src(p) < 0) {
+				warning("skipping rename detection, detected"
+					" duplicate or out-of-order source '%s'",
+					p->one->path);
+				goto cleanup;
+			}
 		}
 	}
 	if (rename_dst_nr == 0 || rename_src_nr == 0)

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v1] diffcore-rename: speed up register_rename_src
  2017-04-20 16:13             ` Jeff King
@ 2017-04-20 18:08               ` Jeff Hostetler
  2017-04-20 18:34                 ` Jeff King
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Hostetler @ 2017-04-20 18:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, git, Jeff Hostetler



On 4/20/2017 12:13 PM, Jeff King wrote:
> On Thu, Apr 20, 2017 at 10:00:04AM -0400, Jeff Hostetler wrote:
>
>> Perhaps the thing to learn from this (and the other ones) is that
>> we have lots of places where we are building a sorted list by
>> iterating over a sorted list.  The insert routines are general
>> purpose and cannot assume this, so they search first.  Perhaps it
>> would be clearer to have independent _append and _insert functions
>> and have the caller explicitly call the appropriate one.  The mainline
>> iterations on the existing index could just call the _append form
>> and never have to worry about searching or the negative-integer
>> return trick.  Whereas, the random iterations (such as on the
>> command's arg list), would always call the _insert form.
>
> Yes. I'd be much happier if your patch was flipping between two
> general-purpose insertion functions. And if that same trick was used on
> the dst side.
>
> Or even, given that this these functions are called from a single
> location that has sorted input, the binary search was just replaced
> completely with an append combined with a sort-check.
>
> That's not the minimal change you were going for, but I think the end
> result is simpler and more consistent.

OK, let me take a stab at something like that and
see where it takes me.

WRT your earlier comment about how often we add or delete 4M
files and then run status.  The use case that started this was a
1% sparse-checkout followed by a read-tree (which reset the
skip-worktree bits) and then status (which thought 99% of the
worktree had been deleted or maybe renamed).  There are probably
other ways to get into this state, but that's how this started.
The more subtle point is that -- for these obscenely large
values of n -- any time I see an O(n log n) operation that could
or should be O(n), I want to stop and look at it.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1] diffcore-rename: speed up register_rename_src
  2017-04-20 18:08               ` Jeff Hostetler
@ 2017-04-20 18:34                 ` Jeff King
  2017-04-21  1:19                   ` Junio C Hamano
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff King @ 2017-04-20 18:34 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Junio C Hamano, git, Jeff Hostetler

On Thu, Apr 20, 2017 at 02:08:46PM -0400, Jeff Hostetler wrote:

> > That's not the minimal change you were going for, but I think the end
> > result is simpler and more consistent.
> 
> OK, let me take a stab at something like that and
> see where it takes me.

Thanks.

I set the patch as a lump, but I think there are a few things going on
there:

  - the return value of register_rename_src() is actively dangerous (it
    points to memory which may be reallocated), so it's good that it
    goes away in favor of an "int"

  - we already refuse to do rename detection when there are duplicate
    dsts. This adds the same for srcs. I don't know if the same safety
    rules apply there, but it certainly seems like a reasonable and
    consistent precaution to say "this tree looks broken, let's skip
    rename detection". But it does mean a potential change in
    functionality in that corner case.

  - this patch probably adds "unsorted tree" to the list of breakages
    that would cause us to skip rename detection. I don't know if that's
    actually possible in practice (i.e., do we end up sorting the
    diffq elsewhere anyway?). I also wondered if it might run afoul of
    diffcore_order(), but that is applied after rename detection, so
    we're OK.

> WRT your earlier comment about how often we add or delete 4M
> files and then run status.  The use case that started this was a
> 1% sparse-checkout followed by a read-tree (which reset the
> skip-worktree bits) and then status (which thought 99% of the
> worktree had been deleted or maybe renamed).  There are probably
> other ways to get into this state, but that's how this started.

Right, that sounds plausible. I guess I just wondered if this is
something an average developer runs daily, or something that they would
run into once a year. Shaving 4s of CPU off of a once-a-year operation
is less exciting.

> The more subtle point is that -- for these obscenely large
> values of n -- any time I see an O(n log n) operation that could
> or should be O(n), I want to stop and look at it.

Heh. I spent a fair bit of time in Git's past turning O(n^2) operations
into O(n log n), so I feel your pain. I do think it's important to pay
attention to whole-operation numbers, though. Quite often you have an
O(n log n) with a small constant (like a single strcmp) coupled with
something linear but with a huge constant (like loading blob contents),
and micro-optimizations to the former get drowned out by the latter.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1] diffcore-rename: speed up register_rename_src
  2017-04-20 18:34                 ` Jeff King
@ 2017-04-21  1:19                   ` Junio C Hamano
  0 siblings, 0 replies; 13+ messages in thread
From: Junio C Hamano @ 2017-04-21  1:19 UTC (permalink / raw)
  To: Jeff King; +Cc: Jeff Hostetler, git, Jeff Hostetler

Jeff King <peff@peff.net> writes:

>   - this patch probably adds "unsorted tree" to the list of breakages
>     that would cause us to skip rename detection. I don't know if that's
>     actually possible in practice (i.e., do we end up sorting the
>     diffq elsewhere anyway?). I also wondered if it might run afoul of
>     diffcore_order(), but that is applied after rename detection, so
>     we're OK.

One of the frontends (I think it was diff-index) couldn't generate
sorted output (which is input to diffcore-* machinery) but I think
diffq is sorted before getting passed to the diffcore-* machinery in
that codepath, so we should be also OK on that front.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1] diffcore-rename: speed up register_rename_src
  2017-04-19  1:32   ` Jeff King
  2017-04-19  2:45     ` Junio C Hamano
@ 2017-04-20 10:40     ` Johannes Schindelin
  2017-04-20 15:50       ` Jeff King
  1 sibling, 1 reply; 13+ messages in thread
From: Johannes Schindelin @ 2017-04-20 10:40 UTC (permalink / raw)
  To: Jeff King; +Cc: git, git, gitster, Jeff Hostetler

Hi Peff,

On Tue, 18 Apr 2017, Jeff King wrote:

> On Tue, Apr 18, 2017 at 07:44:21PM +0000, git@jeffhostetler.com wrote:
> 
> > From: Jeff Hostetler <jeffhost@microsoft.com>
> > 
> > Teach register_rename_src() to see if new file pair can simply be
> > appended to the rename_src[] array before performing the binary search
> > to find the proper insertion point.
> 
> I guess your perf results show some minor improvement. But I suspect
> this is because your synthetic repo does not resemble the real world
> very much.

Please note that the synthetic test repo was added *after* coming up with
the patch, *after* performance benchmarking on a certain really big
repository (it is not hard to guess what use case we are optimizing,
right?).

In that light, I would like to register the fact that Jeff's performance
work is trying to improve a very real world, that of more than 2,000
developers in our company [*1*].

Ciao,
Johannes

Footnote *1*: https://twitter.com/GabeAul/status/846189637945110528

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v1] diffcore-rename: speed up register_rename_src
  2017-04-20 10:40     ` Johannes Schindelin
@ 2017-04-20 15:50       ` Jeff King
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff King @ 2017-04-20 15:50 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: git, git, gitster, Jeff Hostetler

On Thu, Apr 20, 2017 at 12:40:52PM +0200, Johannes Schindelin wrote:

> > > Teach register_rename_src() to see if new file pair can simply be
> > > appended to the rename_src[] array before performing the binary search
> > > to find the proper insertion point.
> > 
> > I guess your perf results show some minor improvement. But I suspect
> > this is because your synthetic repo does not resemble the real world
> > very much.
> 
> Please note that the synthetic test repo was added *after* coming up with
> the patch, *after* performance benchmarking on a certain really big
> repository (it is not hard to guess what use case we are optimizing,
> right?).
> 
> In that light, I would like to register the fact that Jeff's performance
> work is trying to improve a very real world, that of more than 2,000
> developers in our company [*1*].

Sure; I didn't think it came out of thin air. What are the benchmarks on
this real-world repository, then?

Specifically, it looks like this optimization isn't really about the
number of files in the repository so much as the number of
additions/deletions in a particular diff (which is what become rename
sources and destinations).

Is it common to add or delete 4 million tiny files and then run "git
status"?

Note that I think the optimization probably _is_ worth doing in the
general case. These "is it sorted" tradeoffs can backfire if we
sometimes get unsorted input, but I don't think that would ever be the
case here. My main complaint is not that it's not worth doing, but that
I'm not excited about sprinkling these checks ad-hoc throughout the code
base.

-Peff

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-04-21  1:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-18 19:44 [PATCH v1] diffcore-rename speedup git
2017-04-18 19:44 ` [PATCH v1] diffcore-rename: speed up register_rename_src git
2017-04-19  1:32   ` Jeff King
2017-04-19  2:45     ` Junio C Hamano
2017-04-19  2:56       ` Jeff King
2017-04-19  3:18         ` Jeff King
2017-04-20 14:00           ` Jeff Hostetler
2017-04-20 16:13             ` Jeff King
2017-04-20 18:08               ` Jeff Hostetler
2017-04-20 18:34                 ` Jeff King
2017-04-21  1:19                   ` Junio C Hamano
2017-04-20 10:40     ` Johannes Schindelin
2017-04-20 15:50       ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).