From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.5 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,RCVD_IN_DNSWL_HI,RP_MATCHES_RCVD shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by dcvr.yhbt.net (Postfix) with ESMTP id ACFD020966 for ; Fri, 7 Apr 2017 17:35:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932921AbdDGRfk (ORCPT ); Fri, 7 Apr 2017 13:35:40 -0400 Received: from siwi.pair.com ([209.68.5.199]:33986 "EHLO siwi.pair.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753290AbdDGRfj (ORCPT ); Fri, 7 Apr 2017 13:35:39 -0400 Received: from [10.160.98.126] (unknown [167.220.148.155]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by siwi.pair.com (Postfix) with ESMTPSA id EFC65845DB; Fri, 7 Apr 2017 13:35:37 -0400 (EDT) Subject: Re: [PATCH v1] unpack-trees: avoid duplicate ODB lookups during checkout To: Stefan Beller References: <20170406203732.47714-1-git@jeffhostetler.com> <20170406203732.47714-2-git@jeffhostetler.com> Cc: "git@vger.kernel.org" , Junio C Hamano , Jeff King , Jeff Hostetler From: Jeff Hostetler Message-ID: <6aa9d888-54e6-4d43-db84-ce4ec4dfaca6@jeffhostetler.com> Date: Fri, 7 Apr 2017 13:35:36 -0400 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org On 4/6/2017 6:48 PM, Stefan Beller wrote: > On Thu, Apr 6, 2017 at 1:37 PM, wrote: >> From: Jeff Hostetler >> >> Teach traverse_trees_recursive() to not do redundant ODB >> lookups when both directories refer to the same OID. > > And the reason for this is that omitting the second lookup > saves time, i.e. a lookup in the ODB of a sufficiently large > repo is slow. > > My kneejerk line of thinking: > * yes, it sounds good to reduce the number of ODB accesses. > * But if we consider ODB lookups to be slow and we perform > a structured access, how about a cache in front of the ODB? > * We already have that! (sort of..) 9a414486d9 (lookup_object: > prioritize recently found objects, 2013-05-01) > * Instead of improving the caching, maybe change the > size of the problem: We could keep the objects of different types > in different hash-tables. > > object.c has its own hash table, I presume for historical and > performance reasons, this would be split up to multiple hash > tables. > > Additionally to "object *lookup_object(*sha1)", we'd have > a function > > "object *lookup_object(*sha1, enum object_type hint)" > which looks into the correct the hash table. > > If you were to call just lookup_object with no hint, then you'd > look into all the different tables (I guess there is a preferrable > order in which to look, haven't thought about that). > >> >> In operations such as read-tree, checkout, and merge when >> the differences between the commits are relatively small, >> there will likely be many directories that have the same >> SHA-1. In these cases we can avoid hitting the ODB multiple >> times for the same SHA-1. > > This would explain partially why there was such a good > performance boost in the referenced commit above as we > implicitly lookup the same object multiple times. > > Peff is really into getting this part faster, c.f. > https://public-inbox.org/git/20160914235547.h3n2otje2hec6u7k@sigill.intra.peff.net/ That looks interesting, but I question the probabilities for my use case here. When walking the trees and files in a single commit, I have no expectation that I'll see the same tree OID twice, so the cache is not really useful and may just add overhead. However, in a checkout or merge there is a high expectation of visiting the same tree OID -- and most of the time they are peers -- since commits tend to only change isolated parts of the tree. (I'm not going to worry about the case where someone moves an entire sub-tree to somewhere else in the tree and violates my peer assertion.) I did notice that we do 2 independent passes during checkout. First to compare the old and new commits. Then to compare the new with the worktree. So we touch each tree object 3 times. My patch helps the first, but does nothing for the second. Hopefully the cache is helping it (but I have not measured that). > >> TODO This change is a first attempt to test that by comparing >> TODO the hashes of name[i] and name[i-i] and simply copying >> TODO the tree-descriptor data. I was thinking of the n=2 >> TODO case here. We may want to extend this to the n=3 case. > >> >> ================ >> On the Windows repo (500K trees, 3.1M files, 450MB index), >> this reduced the overall time by 0.75 seconds when cycling >> between 2 commits with a single file difference. >> >> (avg) before: 22.699 >> (avg) after: 21.955 >> =============== > > So it shaves off 4% of the time needed. it doesn't sound like a > break through, but I guess these small patches add up. :) Agreed, but on the Windows source repo, it can take 30 seconds to do a "checkout -b" (without changing the actual HEAD commit). That's just for the housekeeping of ensuring you get a clean worktree. If I can knock off 5% here with minimal impact and without changing any file formats, I'll take it. And if I can just repeat that n times... :-) > >> for (i = 0; i < n; i++, dirmask >>= 1) { >> - const unsigned char *sha1 = NULL; >> - if (dirmask & 1) >> - sha1 = names[i].oid->hash; >> - buf[i] = fill_tree_descriptor(t+i, sha1); >> + if (i > 0 && (dirmask & 1) && names[i].oid && names[i-1].oid && >> + !hashcmp(names[i].oid->hash, names[i-1].oid->hash)) { > > Why do we need to check for dirmask & 1 here? > This ought to be covered by the hashcmp already IIUC. > So maybe we can pull out the > if (dirmask & 1) > sha1 = names[i].oid->hash; > out of the else when dropping that dirmask check? I was wondering about the test in the else clause. Just now I put a quick assert in there and it went off, so I'm not going to change this. Thanks Jeff