git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff Hostetler <git@jeffhostetler.com>
To: "René Scharfe" <l.s.r@web.de>, git@vger.kernel.org
Cc: gitster@pobox.com, peff@peff.net,
	Jeff Hostetler <jeffhost@microsoft.com>
Subject: Re: [PATCH v1] unpack-trees: avoid duplicate ODB lookups during checkout
Date: Fri, 7 Apr 2017 09:57:12 -0400	[thread overview]
Message-ID: <b2952bdc-cb93-1317-58b1-a06ea4e14ee8@jeffhostetler.com> (raw)
In-Reply-To: <a1e08e20-7fe6-da3c-cd18-74a8bcd266e0@web.de>



On 4/6/2017 8:32 PM, René Scharfe wrote:
> Am 06.04.2017 um 22:37 schrieb git@jeffhostetler.com:
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> Teach traverse_trees_recursive() to not do redundant ODB
>> lookups when both directories refer to the same OID.
>>
>> In operations such as read-tree, checkout, and merge when
>> the differences between the commits are relatively small,
>> there will likely be many directories that have the same
>> SHA-1.  In these cases we can avoid hitting the ODB multiple
>> times for the same SHA-1.
>>
>> TODO This change is a first attempt to test that by comparing
>> TODO the hashes of name[i] and name[i-i] and simply copying
>> TODO the tree-descriptor data.  I was thinking of the n=2
>> TODO case here.  We may want to extend this to the n=3 case.
>>
>> ================
>> On the Windows repo (500K trees, 3.1M files, 450MB index),
>> this reduced the overall time by 0.75 seconds when cycling
>> between 2 commits with a single file difference.
>>
>> (avg) before: 22.699
>> (avg) after:  21.955
>> ===============
>>
>> ================
>> Using the p0004-read-tree test (posted earlier this week)
>> with 1M files on Linux:
>>
>> before:
>> $ ./p0004-read-tree.sh
>> 0004.5: switch work1 work2 (1003037)       11.99(8.12+3.32)
>> 0004.6: switch commit aliases (1003037)    11.95(8.20+3.14)
>>
>> after:
>> $ ./p0004-read-tree.sh
>> 0004.5: switch work1 work2 (1003037)       11.17(7.84+2.76)
>> 0004.6: switch commit aliases (1003037)    11.13(7.82+2.72)
>> ================
>>
>> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
>> ---
>>   tree-walk.c    |  8 ++++++++
>>   tree-walk.h    |  1 +
>>   unpack-trees.c | 13 +++++++++----
>>   3 files changed, 18 insertions(+), 4 deletions(-)
>>
>> diff --git a/tree-walk.c b/tree-walk.c
>> index ff77605..3b82f0e 100644
>> --- a/tree-walk.c
>> +++ b/tree-walk.c
>> @@ -92,6 +92,14 @@ void *fill_tree_descriptor(struct tree_desc *desc,
>> const unsigned char *sha1)
>>       return buf;
>>   }
>>   +void *copy_tree_descriptor(struct tree_desc *dest, const struct
>> tree_desc *src)
>> +{
>> +    void *buf = xmalloc(src->size);
>> +    memcpy(buf, src->buffer, src->size);
>> +    init_tree_desc(dest, buf, src->size);
>> +    return buf;
>> +}
>> +
>>   static void entry_clear(struct name_entry *a)
>>   {
>>       memset(a, 0, sizeof(*a));
>> diff --git a/tree-walk.h b/tree-walk.h
>> index 68bb78b..ca4032b 100644
>> --- a/tree-walk.h
>> +++ b/tree-walk.h
>> @@ -43,6 +43,7 @@ int tree_entry(struct tree_desc *, struct name_entry
>> *);
>>   int tree_entry_gently(struct tree_desc *, struct name_entry *);
>>     void *fill_tree_descriptor(struct tree_desc *desc, const unsigned
>> char *sha1);
>> +void *copy_tree_descriptor(struct tree_desc *dest, const struct
>> tree_desc *src);
>>     struct traverse_info;
>>   typedef int (*traverse_callback_t)(int n, unsigned long mask,
>> unsigned long dirmask, struct name_entry *entry, struct traverse_info *);
>> diff --git a/unpack-trees.c b/unpack-trees.c
>> index 3a8ee19..50aacad 100644
>> --- a/unpack-trees.c
>> +++ b/unpack-trees.c
>> @@ -554,10 +554,15 @@ static int traverse_trees_recursive(int n,
>> unsigned long dirmask,
>>       newinfo.df_conflicts |= df_conflicts;
>>         for (i = 0; i < n; i++, dirmask >>= 1) {
>> -        const unsigned char *sha1 = NULL;
>> -        if (dirmask & 1)
>> -            sha1 = names[i].oid->hash;
>> -        buf[i] = fill_tree_descriptor(t+i, sha1);
>> +        if (i > 0 && (dirmask & 1) && names[i].oid && names[i-1].oid &&
>
> Can .oid even be NULL?  (I didn't check, but it's dereferenced in the
> sha1 assignment below, so I guess the answer is no and these two checks
> are not needed.)

yes.  i think the (dirmask&1) is hiding it for name[i].  i put both in
the code above, but it also worked fine just testing both oid's (and
without dirmask).

>
>> +            !hashcmp(names[i].oid->hash, names[i-1].oid->hash)) {
>
> Calling oidcmp would be shorter.

right.

>
>> +            buf[i] = copy_tree_descriptor(&t[i], &t[i-1]);
>
> buf keeps track of the allocations that need to be freed at the end of
> the function.  I assume these buffers are read-only.  Can you use an
> alias here instead of a duplicate by calling init_tree_desc with the
> predecessor's buffer and setting buf[i] to NULL?  Or even just copying
> t[i - 1] to t[i] with an assignment?  That would be shorter and probably
> also quicker.

Yes, my first draft did that.  Just being cautious, but i'll switch it
back since that seems to be the consensus.

>
>> +        } else {
>> +            const unsigned char *sha1 = NULL;
>> +            if (dirmask & 1)
>> +                sha1 = names[i].oid->hash;
>> +            buf[i] = fill_tree_descriptor(t+i, sha1);
>> +        }
>>       }
>>         bottom = switch_cache_bottom(&newinfo);
>>

Thanks
Jeff

      reply	other threads:[~2017-04-07 13:57 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-06 20:37 [PATCH v1] WIP unpack-trees: avoid duplicate ODB lookups during checkout git
2017-04-06 20:37 ` [PATCH v1] " git
2017-04-06 22:48   ` Stefan Beller
2017-04-07  5:19     ` Jeff King
2017-04-07 13:51       ` Jeff Hostetler
2017-04-07 17:35     ` Jeff Hostetler
2017-04-07  0:32   ` René Scharfe
2017-04-07 13:57     ` Jeff Hostetler [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b2952bdc-cb93-1317-58b1-a06ea4e14ee8@jeffhostetler.com \
    --to=git@jeffhostetler.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jeffhost@microsoft.com \
    --cc=l.s.r@web.de \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).