[PATCH v1] WIP unpack-trees: avoid duplicate ODB lookups during checkout

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH v1] WIP unpack-trees: avoid duplicate ODB lookups during checkout
@ 2017-04-06 20:37 git
  2017-04-06 20:37 ` [PATCH v1] " git
  0 siblings, 1 reply; 8+ messages in thread
From: git @ 2017-04-06 20:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Avoid duplicate ODB lookups for trees during traverse_tree_recursive().

I'm marking this WIP so we can talk about the TODO in the
commit message.

Jeff Hostetler (1):
  unpack-trees: avoid duplicate ODB lookups during checkout

 tree-walk.c    |  8 ++++++++
 tree-walk.h    |  1 +
 unpack-trees.c | 13 +++++++++----
 3 files changed, 18 insertions(+), 4 deletions(-)

-- 
2.9.3


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v1] unpack-trees: avoid duplicate ODB lookups during checkout
  2017-04-06 20:37 [PATCH v1] WIP unpack-trees: avoid duplicate ODB lookups during checkout git
@ 2017-04-06 20:37 ` git
  2017-04-06 22:48   ` Stefan Beller
  2017-04-07  0:32   ` René Scharfe
  0 siblings, 2 replies; 8+ messages in thread
From: git @ 2017-04-06 20:37 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Teach traverse_trees_recursive() to not do redundant ODB
lookups when both directories refer to the same OID.

In operations such as read-tree, checkout, and merge when
the differences between the commits are relatively small,
there will likely be many directories that have the same
SHA-1.  In these cases we can avoid hitting the ODB multiple
times for the same SHA-1.

TODO This change is a first attempt to test that by comparing
TODO the hashes of name[i] and name[i-i] and simply copying
TODO the tree-descriptor data.  I was thinking of the n=2
TODO case here.  We may want to extend this to the n=3 case.

================
On the Windows repo (500K trees, 3.1M files, 450MB index),
this reduced the overall time by 0.75 seconds when cycling
between 2 commits with a single file difference.

(avg) before: 22.699
(avg) after:  21.955
===============

================
Using the p0004-read-tree test (posted earlier this week)
with 1M files on Linux:

before:
$ ./p0004-read-tree.sh
0004.5: switch work1 work2 (1003037)       11.99(8.12+3.32)
0004.6: switch commit aliases (1003037)    11.95(8.20+3.14)

after:
$ ./p0004-read-tree.sh
0004.5: switch work1 work2 (1003037)       11.17(7.84+2.76)
0004.6: switch commit aliases (1003037)    11.13(7.82+2.72)
================

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 tree-walk.c    |  8 ++++++++
 tree-walk.h    |  1 +
 unpack-trees.c | 13 +++++++++----
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/tree-walk.c b/tree-walk.c
index ff77605..3b82f0e 100644
--- a/tree-walk.c
+++ b/tree-walk.c
@@ -92,6 +92,14 @@ void *fill_tree_descriptor(struct tree_desc *desc, const unsigned char *sha1)
 	return buf;
 }
 
+void *copy_tree_descriptor(struct tree_desc *dest, const struct tree_desc *src)
+{
+	void *buf = xmalloc(src->size);
+	memcpy(buf, src->buffer, src->size);
+	init_tree_desc(dest, buf, src->size);
+	return buf;
+}
+
 static void entry_clear(struct name_entry *a)
 {
 	memset(a, 0, sizeof(*a));
diff --git a/tree-walk.h b/tree-walk.h
index 68bb78b..ca4032b 100644
--- a/tree-walk.h
+++ b/tree-walk.h
@@ -43,6 +43,7 @@ int tree_entry(struct tree_desc *, struct name_entry *);
 int tree_entry_gently(struct tree_desc *, struct name_entry *);
 
 void *fill_tree_descriptor(struct tree_desc *desc, const unsigned char *sha1);
+void *copy_tree_descriptor(struct tree_desc *dest, const struct tree_desc *src);
 
 struct traverse_info;
 typedef int (*traverse_callback_t)(int n, unsigned long mask, unsigned long dirmask, struct name_entry *entry, struct traverse_info *);
diff --git a/unpack-trees.c b/unpack-trees.c
index 3a8ee19..50aacad 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -554,10 +554,15 @@ static int traverse_trees_recursive(int n, unsigned long dirmask,
 	newinfo.df_conflicts |= df_conflicts;
 
 	for (i = 0; i < n; i++, dirmask >>= 1) {
-		const unsigned char *sha1 = NULL;
-		if (dirmask & 1)
-			sha1 = names[i].oid->hash;
-		buf[i] = fill_tree_descriptor(t+i, sha1);
+		if (i > 0 && (dirmask & 1) && names[i].oid && names[i-1].oid &&
+			!hashcmp(names[i].oid->hash, names[i-1].oid->hash)) {
+			buf[i] = copy_tree_descriptor(&t[i], &t[i-1]);
+		} else {
+			const unsigned char *sha1 = NULL;
+			if (dirmask & 1)
+				sha1 = names[i].oid->hash;
+			buf[i] = fill_tree_descriptor(t+i, sha1);
+		}
 	}
 
 	bottom = switch_cache_bottom(&newinfo);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] unpack-trees: avoid duplicate ODB lookups during checkout
  2017-04-06 20:37 ` [PATCH v1] " git
@ 2017-04-06 22:48   ` Stefan Beller
  2017-04-07  5:19     ` Jeff King
  2017-04-07 17:35     ` Jeff Hostetler
  2017-04-07  0:32   ` René Scharfe
  1 sibling, 2 replies; 8+ messages in thread
From: Stefan Beller @ 2017-04-06 22:48 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King, Jeff Hostetler

On Thu, Apr 6, 2017 at 1:37 PM,  <git@jeffhostetler.com> wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
>
> Teach traverse_trees_recursive() to not do redundant ODB
> lookups when both directories refer to the same OID.

And the reason for this is that omitting the second lookup
saves time, i.e. a lookup in the ODB of a sufficiently large
repo is slow.

My kneejerk line of thinking:
* yes, it sounds good to reduce the number of ODB accesses.
* But if we consider ODB lookups to be slow and we perform
  a structured access, how about a cache in front of the ODB?
* We already have that! (sort of..) 9a414486d9 (lookup_object:
  prioritize recently found objects, 2013-05-01)
* Instead of improving the caching, maybe change the
  size of the problem: We could keep the objects of different types
  in different hash-tables.

object.c has its own hash table, I presume for historical and
performance reasons, this would be split up to multiple hash
tables.

Additionally to "object *lookup_object(*sha1)", we'd have
a function

"object *lookup_object(*sha1, enum object_type hint)"
which looks into the correct the hash table.

If you were to call just  lookup_object with no hint, then you'd
look into all the different tables (I guess there is a preferrable
order in which to look, haven't thought about that).

>
> In operations such as read-tree, checkout, and merge when
> the differences between the commits are relatively small,
> there will likely be many directories that have the same
> SHA-1.  In these cases we can avoid hitting the ODB multiple
> times for the same SHA-1.

This would explain partially why there was such a good
performance boost in the referenced commit above as we
implicitly lookup the same object multiple times.

Peff is really into getting this part faster, c.f.
https://public-inbox.org/git/20160914235547.h3n2otje2hec6u7k@sigill.intra.peff.net/

> TODO This change is a first attempt to test that by comparing
> TODO the hashes of name[i] and name[i-i] and simply copying
> TODO the tree-descriptor data.  I was thinking of the n=2
> TODO case here.  We may want to extend this to the n=3 case.

>
> ================
> On the Windows repo (500K trees, 3.1M files, 450MB index),
> this reduced the overall time by 0.75 seconds when cycling
> between 2 commits with a single file difference.
>
> (avg) before: 22.699
> (avg) after:  21.955
> ===============

So it shaves off 4% of the time needed. it doesn't sound like a
break through, but I guess these small patches add up. :)

>         for (i = 0; i < n; i++, dirmask >>= 1) {
> -               const unsigned char *sha1 = NULL;
> -               if (dirmask & 1)
> -                       sha1 = names[i].oid->hash;
> -               buf[i] = fill_tree_descriptor(t+i, sha1);
> +               if (i > 0 && (dirmask & 1) && names[i].oid && names[i-1].oid &&
> +                       !hashcmp(names[i].oid->hash, names[i-1].oid->hash)) {

Why do we need to check for dirmask & 1 here?
This ought to be covered by the hashcmp already IIUC.
So maybe we can pull out the
                         if (dirmask & 1)
                                 sha1 = names[i].oid->hash;
out of the else when dropping that dirmask check?

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] unpack-trees: avoid duplicate ODB lookups during checkout
  2017-04-06 20:37 ` [PATCH v1] " git
  2017-04-06 22:48   ` Stefan Beller
@ 2017-04-07  0:32   ` René Scharfe
  2017-04-07 13:57     ` Jeff Hostetler
  1 sibling, 1 reply; 8+ messages in thread
From: René Scharfe @ 2017-04-07  0:32 UTC (permalink / raw)
  To: git, git; +Cc: gitster, peff, Jeff Hostetler

Am 06.04.2017 um 22:37 schrieb git@jeffhostetler.com:
> From: Jeff Hostetler <jeffhost@microsoft.com>
> 
> Teach traverse_trees_recursive() to not do redundant ODB
> lookups when both directories refer to the same OID.
> 
> In operations such as read-tree, checkout, and merge when
> the differences between the commits are relatively small,
> there will likely be many directories that have the same
> SHA-1.  In these cases we can avoid hitting the ODB multiple
> times for the same SHA-1.
> 
> TODO This change is a first attempt to test that by comparing
> TODO the hashes of name[i] and name[i-i] and simply copying
> TODO the tree-descriptor data.  I was thinking of the n=2
> TODO case here.  We may want to extend this to the n=3 case.
> 
> ================
> On the Windows repo (500K trees, 3.1M files, 450MB index),
> this reduced the overall time by 0.75 seconds when cycling
> between 2 commits with a single file difference.
> 
> (avg) before: 22.699
> (avg) after:  21.955
> ===============
> 
> ================
> Using the p0004-read-tree test (posted earlier this week)
> with 1M files on Linux:
> 
> before:
> $ ./p0004-read-tree.sh
> 0004.5: switch work1 work2 (1003037)       11.99(8.12+3.32)
> 0004.6: switch commit aliases (1003037)    11.95(8.20+3.14)
> 
> after:
> $ ./p0004-read-tree.sh
> 0004.5: switch work1 work2 (1003037)       11.17(7.84+2.76)
> 0004.6: switch commit aliases (1003037)    11.13(7.82+2.72)
> ================
> 
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> ---
>   tree-walk.c    |  8 ++++++++
>   tree-walk.h    |  1 +
>   unpack-trees.c | 13 +++++++++----
>   3 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/tree-walk.c b/tree-walk.c
> index ff77605..3b82f0e 100644
> --- a/tree-walk.c
> +++ b/tree-walk.c
> @@ -92,6 +92,14 @@ void *fill_tree_descriptor(struct tree_desc *desc, const unsigned char *sha1)
>   	return buf;
>   }
>   
> +void *copy_tree_descriptor(struct tree_desc *dest, const struct tree_desc *src)
> +{
> +	void *buf = xmalloc(src->size);
> +	memcpy(buf, src->buffer, src->size);
> +	init_tree_desc(dest, buf, src->size);
> +	return buf;
> +}
> +
>   static void entry_clear(struct name_entry *a)
>   {
>   	memset(a, 0, sizeof(*a));
> diff --git a/tree-walk.h b/tree-walk.h
> index 68bb78b..ca4032b 100644
> --- a/tree-walk.h
> +++ b/tree-walk.h
> @@ -43,6 +43,7 @@ int tree_entry(struct tree_desc *, struct name_entry *);
>   int tree_entry_gently(struct tree_desc *, struct name_entry *);
>   
>   void *fill_tree_descriptor(struct tree_desc *desc, const unsigned char *sha1);
> +void *copy_tree_descriptor(struct tree_desc *dest, const struct tree_desc *src);
>   
>   struct traverse_info;
>   typedef int (*traverse_callback_t)(int n, unsigned long mask, unsigned long dirmask, struct name_entry *entry, struct traverse_info *);
> diff --git a/unpack-trees.c b/unpack-trees.c
> index 3a8ee19..50aacad 100644
> --- a/unpack-trees.c
> +++ b/unpack-trees.c
> @@ -554,10 +554,15 @@ static int traverse_trees_recursive(int n, unsigned long dirmask,
>   	newinfo.df_conflicts |= df_conflicts;
>   
>   	for (i = 0; i < n; i++, dirmask >>= 1) {
> -		const unsigned char *sha1 = NULL;
> -		if (dirmask & 1)
> -			sha1 = names[i].oid->hash;
> -		buf[i] = fill_tree_descriptor(t+i, sha1);
> +		if (i > 0 && (dirmask & 1) && names[i].oid && names[i-1].oid &&

Can .oid even be NULL?  (I didn't check, but it's dereferenced in the 
sha1 assignment below, so I guess the answer is no and these two checks 
are not needed.)

> +			!hashcmp(names[i].oid->hash, names[i-1].oid->hash)) {

Calling oidcmp would be shorter.

> +			buf[i] = copy_tree_descriptor(&t[i], &t[i-1]);

buf keeps track of the allocations that need to be freed at the end of 
the function.  I assume these buffers are read-only.  Can you use an 
alias here instead of a duplicate by calling init_tree_desc with the 
predecessor's buffer and setting buf[i] to NULL?  Or even just copying 
t[i - 1] to t[i] with an assignment?  That would be shorter and probably 
also quicker.

> +		} else {
> +			const unsigned char *sha1 = NULL;
> +			if (dirmask & 1)
> +				sha1 = names[i].oid->hash;
> +			buf[i] = fill_tree_descriptor(t+i, sha1);
> +		}
>   	}
>   
>   	bottom = switch_cache_bottom(&newinfo);
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] unpack-trees: avoid duplicate ODB lookups during checkout
  2017-04-06 22:48   ` Stefan Beller
@ 2017-04-07  5:19     ` Jeff King
  2017-04-07 13:51       ` Jeff Hostetler
  2017-04-07 17:35     ` Jeff Hostetler
  1 sibling, 1 reply; 8+ messages in thread
From: Jeff King @ 2017-04-07  5:19 UTC (permalink / raw)
  To: Stefan Beller
  Cc: Jeff Hostetler, git@vger.kernel.org, Junio C Hamano,
	Jeff Hostetler

On Thu, Apr 06, 2017 at 03:48:07PM -0700, Stefan Beller wrote:

> On Thu, Apr 6, 2017 at 1:37 PM,  <git@jeffhostetler.com> wrote:
> > From: Jeff Hostetler <jeffhost@microsoft.com>
> >
> > Teach traverse_trees_recursive() to not do redundant ODB
> > lookups when both directories refer to the same OID.
> 
> And the reason for this is that omitting the second lookup
> saves time, i.e. a lookup in the ODB of a sufficiently large
> repo is slow.
> 
> My kneejerk line of thinking:
> * yes, it sounds good to reduce the number of ODB accesses.
> * But if we consider ODB lookups to be slow and we perform
>   a structured access, how about a cache in front of the ODB?
> * We already have that! (sort of..) 9a414486d9 (lookup_object:
>   prioritize recently found objects, 2013-05-01)
> * Instead of improving the caching, maybe change the
>   size of the problem: We could keep the objects of different types
>   in different hash-tables.

I don't think this is using lookup_object() at all, though. It goes
straight to fill_tree_descriptor(), which will read the object contents
from disk. So our time is going to:

  1. Finding the object in the odb (ideally a binary search in a single
     pack index, but less optimal when there are many packs).

  2. Reconstructing the object. This means zlib-inflating from disk, but
     it may also mean delta reconstruction.

I think there _are_ some caches at play here, though, when you look up
the same tree back to back. The pack-mru means that we'll always look in
the correct pack first. And in theory the delta base cache means that
we'll already have the whole thing reconstructed in memory (though I
have often been confused by exactly when we put items into that cache,
so I might be wrong).

So in theory, this is not all that different than the "just allocate and
copy the bytes" optimization that's happening here (though I'm not
surprised that doing it at a higher level can produce some speedup).

I think the more interesting optimization is "just use the same buffer
without bothering to copy", which is hard for the low-level code to do
(since it doesn't know what lifetime the caller is expecting). 

> object.c has its own hash table, I presume for historical and
> performance reasons, this would be split up to multiple hash
> tables.

So I don't think lookup_object() is really relevant here. But I'm also
not sure that multiple hash tables would really buy us much. In theory
hash tables are O(1), so multiple smaller tables doesn't help (and might
hurt, since now we have four O(1) lookups to do). Of course that's a
minor fiction because of collisions, but we keep the load factor on the
object.c table pretty low (and that's why things like quadratic probing
and cuckoo hashing never showed great improvement).

-Peff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] unpack-trees: avoid duplicate ODB lookups during checkout
  2017-04-07  5:19     ` Jeff King
@ 2017-04-07 13:51       ` Jeff Hostetler
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Hostetler @ 2017-04-07 13:51 UTC (permalink / raw)
  To: Jeff King, Stefan Beller
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff Hostetler



On 4/7/2017 1:19 AM, Jeff King wrote:
> On Thu, Apr 06, 2017 at 03:48:07PM -0700, Stefan Beller wrote:
>
>> On Thu, Apr 6, 2017 at 1:37 PM,  <git@jeffhostetler.com> wrote:
>>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>>
>>> Teach traverse_trees_recursive() to not do redundant ODB
>>> lookups when both directories refer to the same OID.
>>
>> And the reason for this is that omitting the second lookup
>> saves time, i.e. a lookup in the ODB of a sufficiently large
>> repo is slow.
>>
>> My kneejerk line of thinking:
>> * yes, it sounds good to reduce the number of ODB accesses.
>> * But if we consider ODB lookups to be slow and we perform
>>   a structured access, how about a cache in front of the ODB?
>> * We already have that! (sort of..) 9a414486d9 (lookup_object:
>>   prioritize recently found objects, 2013-05-01)
>> * Instead of improving the caching, maybe change the
>>   size of the problem: We could keep the objects of different types
>>   in different hash-tables.
>
> I don't think this is using lookup_object() at all, though. It goes
> straight to fill_tree_descriptor(), which will read the object contents
> from disk. So our time is going to:
>
>   1. Finding the object in the odb (ideally a binary search in a single
>      pack index, but less optimal when there are many packs).
>
>   2. Reconstructing the object. This means zlib-inflating from disk, but
>      it may also mean delta reconstruction.
>
> I think there _are_ some caches at play here, though, when you look up
> the same tree back to back. The pack-mru means that we'll always look in
> the correct pack first. And in theory the delta base cache means that
> we'll already have the whole thing reconstructed in memory (though I
> have often been confused by exactly when we put items into that cache,
> so I might be wrong).

This change shaved 500K calls to fill_tree_descriptor() (on the Windows
source tree on a "checkout -b" to the same commit).  I was surprised it
didn't give more of a speed up, so some of that caching may be in play
here, but it's hard to tell.

Also, on my repo I have 100GB+ of packfiles, so that may be messing
with things a bit.

>
> So in theory, this is not all that different than the "just allocate and
> copy the bytes" optimization that's happening here (though I'm not
> surprised that doing it at a higher level can produce some speedup).
>
> I think the more interesting optimization is "just use the same buffer
> without bothering to copy", which is hard for the low-level code to do
> (since it doesn't know what lifetime the caller is expecting).

My first draft did just borrow the buffer right there in traverse_
and it seemed to work just fine.  I was being cautious and copied it
properly in the lower layer.  Since the bufs are freed at the bottom
it felt safe, but I didn't want to overlook anything.  I'll switch it
back.

>
>> object.c has its own hash table, I presume for historical and
>> performance reasons, this would be split up to multiple hash
>> tables.
>
> So I don't think lookup_object() is really relevant here. But I'm also
> not sure that multiple hash tables would really buy us much. In theory
> hash tables are O(1), so multiple smaller tables doesn't help (and might
> hurt, since now we have four O(1) lookups to do). Of course that's a
> minor fiction because of collisions, but we keep the load factor on the
> object.c table pretty low (and that's why things like quadratic probing
> and cuckoo hashing never showed great improvement).

I do wonder now about the initial hash table size and any limits
on it, but that is a question for another day.

Thanks
Jeff


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] unpack-trees: avoid duplicate ODB lookups during checkout
  2017-04-07  0:32   ` René Scharfe
@ 2017-04-07 13:57     ` Jeff Hostetler
  0 siblings, 0 replies; 8+ messages in thread
From: Jeff Hostetler @ 2017-04-07 13:57 UTC (permalink / raw)
  To: René Scharfe, git; +Cc: gitster, peff, Jeff Hostetler



On 4/6/2017 8:32 PM, René Scharfe wrote:
> Am 06.04.2017 um 22:37 schrieb git@jeffhostetler.com:
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> Teach traverse_trees_recursive() to not do redundant ODB
>> lookups when both directories refer to the same OID.
>>
>> In operations such as read-tree, checkout, and merge when
>> the differences between the commits are relatively small,
>> there will likely be many directories that have the same
>> SHA-1.  In these cases we can avoid hitting the ODB multiple
>> times for the same SHA-1.
>>
>> TODO This change is a first attempt to test that by comparing
>> TODO the hashes of name[i] and name[i-i] and simply copying
>> TODO the tree-descriptor data.  I was thinking of the n=2
>> TODO case here.  We may want to extend this to the n=3 case.
>>
>> ================
>> On the Windows repo (500K trees, 3.1M files, 450MB index),
>> this reduced the overall time by 0.75 seconds when cycling
>> between 2 commits with a single file difference.
>>
>> (avg) before: 22.699
>> (avg) after:  21.955
>> ===============
>>
>> ================
>> Using the p0004-read-tree test (posted earlier this week)
>> with 1M files on Linux:
>>
>> before:
>> $ ./p0004-read-tree.sh
>> 0004.5: switch work1 work2 (1003037)       11.99(8.12+3.32)
>> 0004.6: switch commit aliases (1003037)    11.95(8.20+3.14)
>>
>> after:
>> $ ./p0004-read-tree.sh
>> 0004.5: switch work1 work2 (1003037)       11.17(7.84+2.76)
>> 0004.6: switch commit aliases (1003037)    11.13(7.82+2.72)
>> ================
>>
>> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
>> ---
>>   tree-walk.c    |  8 ++++++++
>>   tree-walk.h    |  1 +
>>   unpack-trees.c | 13 +++++++++----
>>   3 files changed, 18 insertions(+), 4 deletions(-)
>>
>> diff --git a/tree-walk.c b/tree-walk.c
>> index ff77605..3b82f0e 100644
>> --- a/tree-walk.c
>> +++ b/tree-walk.c
>> @@ -92,6 +92,14 @@ void *fill_tree_descriptor(struct tree_desc *desc,
>> const unsigned char *sha1)
>>       return buf;
>>   }
>>   +void *copy_tree_descriptor(struct tree_desc *dest, const struct
>> tree_desc *src)
>> +{
>> +    void *buf = xmalloc(src->size);
>> +    memcpy(buf, src->buffer, src->size);
>> +    init_tree_desc(dest, buf, src->size);
>> +    return buf;
>> +}
>> +
>>   static void entry_clear(struct name_entry *a)
>>   {
>>       memset(a, 0, sizeof(*a));
>> diff --git a/tree-walk.h b/tree-walk.h
>> index 68bb78b..ca4032b 100644
>> --- a/tree-walk.h
>> +++ b/tree-walk.h
>> @@ -43,6 +43,7 @@ int tree_entry(struct tree_desc *, struct name_entry
>> *);
>>   int tree_entry_gently(struct tree_desc *, struct name_entry *);
>>     void *fill_tree_descriptor(struct tree_desc *desc, const unsigned
>> char *sha1);
>> +void *copy_tree_descriptor(struct tree_desc *dest, const struct
>> tree_desc *src);
>>     struct traverse_info;
>>   typedef int (*traverse_callback_t)(int n, unsigned long mask,
>> unsigned long dirmask, struct name_entry *entry, struct traverse_info *);
>> diff --git a/unpack-trees.c b/unpack-trees.c
>> index 3a8ee19..50aacad 100644
>> --- a/unpack-trees.c
>> +++ b/unpack-trees.c
>> @@ -554,10 +554,15 @@ static int traverse_trees_recursive(int n,
>> unsigned long dirmask,
>>       newinfo.df_conflicts |= df_conflicts;
>>         for (i = 0; i < n; i++, dirmask >>= 1) {
>> -        const unsigned char *sha1 = NULL;
>> -        if (dirmask & 1)
>> -            sha1 = names[i].oid->hash;
>> -        buf[i] = fill_tree_descriptor(t+i, sha1);
>> +        if (i > 0 && (dirmask & 1) && names[i].oid && names[i-1].oid &&
>
> Can .oid even be NULL?  (I didn't check, but it's dereferenced in the
> sha1 assignment below, so I guess the answer is no and these two checks
> are not needed.)

yes.  i think the (dirmask&1) is hiding it for name[i].  i put both in
the code above, but it also worked fine just testing both oid's (and
without dirmask).

>
>> +            !hashcmp(names[i].oid->hash, names[i-1].oid->hash)) {
>
> Calling oidcmp would be shorter.

right.

>
>> +            buf[i] = copy_tree_descriptor(&t[i], &t[i-1]);
>
> buf keeps track of the allocations that need to be freed at the end of
> the function.  I assume these buffers are read-only.  Can you use an
> alias here instead of a duplicate by calling init_tree_desc with the
> predecessor's buffer and setting buf[i] to NULL?  Or even just copying
> t[i - 1] to t[i] with an assignment?  That would be shorter and probably
> also quicker.

Yes, my first draft did that.  Just being cautious, but i'll switch it
back since that seems to be the consensus.

>
>> +        } else {
>> +            const unsigned char *sha1 = NULL;
>> +            if (dirmask & 1)
>> +                sha1 = names[i].oid->hash;
>> +            buf[i] = fill_tree_descriptor(t+i, sha1);
>> +        }
>>       }
>>         bottom = switch_cache_bottom(&newinfo);
>>

Thanks
Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v1] unpack-trees: avoid duplicate ODB lookups during checkout
  2017-04-06 22:48   ` Stefan Beller
  2017-04-07  5:19     ` Jeff King
@ 2017-04-07 17:35     ` Jeff Hostetler
  1 sibling, 0 replies; 8+ messages in thread
From: Jeff Hostetler @ 2017-04-07 17:35 UTC (permalink / raw)
  To: Stefan Beller
  Cc: git@vger.kernel.org, Junio C Hamano, Jeff King, Jeff Hostetler



On 4/6/2017 6:48 PM, Stefan Beller wrote:
> On Thu, Apr 6, 2017 at 1:37 PM,  <git@jeffhostetler.com> wrote:
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> Teach traverse_trees_recursive() to not do redundant ODB
>> lookups when both directories refer to the same OID.
>
> And the reason for this is that omitting the second lookup
> saves time, i.e. a lookup in the ODB of a sufficiently large
> repo is slow.
>
> My kneejerk line of thinking:
> * yes, it sounds good to reduce the number of ODB accesses.
> * But if we consider ODB lookups to be slow and we perform
>   a structured access, how about a cache in front of the ODB?
> * We already have that! (sort of..) 9a414486d9 (lookup_object:
>   prioritize recently found objects, 2013-05-01)
> * Instead of improving the caching, maybe change the
>   size of the problem: We could keep the objects of different types
>   in different hash-tables.
>
> object.c has its own hash table, I presume for historical and
> performance reasons, this would be split up to multiple hash
> tables.
>
> Additionally to "object *lookup_object(*sha1)", we'd have
> a function
>
> "object *lookup_object(*sha1, enum object_type hint)"
> which looks into the correct the hash table.
>
> If you were to call just  lookup_object with no hint, then you'd
> look into all the different tables (I guess there is a preferrable
> order in which to look, haven't thought about that).
>
>>
>> In operations such as read-tree, checkout, and merge when
>> the differences between the commits are relatively small,
>> there will likely be many directories that have the same
>> SHA-1.  In these cases we can avoid hitting the ODB multiple
>> times for the same SHA-1.
>
> This would explain partially why there was such a good
> performance boost in the referenced commit above as we
> implicitly lookup the same object multiple times.
>
> Peff is really into getting this part faster, c.f.
> https://public-inbox.org/git/20160914235547.h3n2otje2hec6u7k@sigill.intra.peff.net/

That looks interesting, but I question the probabilities for
my use case here.  When walking the trees and files in a single
commit, I have no expectation that I'll see the same tree OID
twice, so the cache is not really useful and may just add
overhead.  However, in a checkout or merge there is a high
expectation of visiting the same tree OID -- and most of the
time they are peers -- since commits tend to only change
isolated parts of the tree.  (I'm not going to worry about the
case where someone moves an entire sub-tree to somewhere else
in the tree and violates my peer assertion.)

I did notice that we do 2 independent passes during checkout.
First to compare the old and new commits.  Then to compare the
new with the worktree.  So we touch each tree object 3 times.

My patch helps the first, but does nothing for the second.
Hopefully the cache is helping it (but I have not measured that).


>
>> TODO This change is a first attempt to test that by comparing
>> TODO the hashes of name[i] and name[i-i] and simply copying
>> TODO the tree-descriptor data.  I was thinking of the n=2
>> TODO case here.  We may want to extend this to the n=3 case.
>
>>
>> ================
>> On the Windows repo (500K trees, 3.1M files, 450MB index),
>> this reduced the overall time by 0.75 seconds when cycling
>> between 2 commits with a single file difference.
>>
>> (avg) before: 22.699
>> (avg) after:  21.955
>> ===============
>
> So it shaves off 4% of the time needed. it doesn't sound like a
> break through, but I guess these small patches add up. :)

Agreed, but on the Windows source repo, it can take 30 seconds to
do a "checkout -b" (without changing the actual HEAD commit).
That's just for the housekeeping of ensuring you get a clean
worktree.  If I can knock off 5% here with minimal impact
and without changing any file formats, I'll take it.

And if I can just repeat that n times...  :-)

>
>>         for (i = 0; i < n; i++, dirmask >>= 1) {
>> -               const unsigned char *sha1 = NULL;
>> -               if (dirmask & 1)
>> -                       sha1 = names[i].oid->hash;
>> -               buf[i] = fill_tree_descriptor(t+i, sha1);
>> +               if (i > 0 && (dirmask & 1) && names[i].oid && names[i-1].oid &&
>> +                       !hashcmp(names[i].oid->hash, names[i-1].oid->hash)) {
>
> Why do we need to check for dirmask & 1 here?
> This ought to be covered by the hashcmp already IIUC.
> So maybe we can pull out the
>                          if (dirmask & 1)
>                                  sha1 = names[i].oid->hash;
> out of the else when dropping that dirmask check?

I was wondering about the test in the else clause.
Just now I put a quick assert in there and it went off,
so I'm not going to change this.

Thanks
Jeff

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-04-07 17:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-06 20:37 [PATCH v1] WIP unpack-trees: avoid duplicate ODB lookups during checkout git
2017-04-06 20:37 ` [PATCH v1] " git
2017-04-06 22:48   ` Stefan Beller
2017-04-07  5:19     ` Jeff King
2017-04-07 13:51       ` Jeff Hostetler
2017-04-07 17:35     ` Jeff Hostetler
2017-04-07  0:32   ` René Scharfe
2017-04-07 13:57     ` Jeff Hostetler

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).