From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Kastrup Subject: Re: What's cooking in git.git (topics) Date: Thu, 27 Sep 2007 08:43:14 +0200 Message-ID: <85sl50ehh9.fsf@lola.goethe.zz> References: <7v1wdcch06.fsf@gitster.siamese.dyndns.org> <7v1wd1d0le.fsf@gitster.siamese.dyndns.org> <7vfy11yyxk.fsf@gitster.siamese.dyndns.org> <20070927023633.GA28902@coredump.intra.peff.net> <854phgfxn7.fsf@lola.goethe.zz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Junio C Hamano , git@vger.kernel.org To: Jeff King X-From: git-owner@vger.kernel.org Thu Sep 27 08:43:33 2007 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1Ian5w-0000z9-GB for gcvg-git-2@gmane.org; Thu, 27 Sep 2007 08:43:32 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754763AbXI0GnU (ORCPT ); Thu, 27 Sep 2007 02:43:20 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754694AbXI0GnS (ORCPT ); Thu, 27 Sep 2007 02:43:18 -0400 Received: from mail-in-01.arcor-online.net ([151.189.21.41]:47275 "EHLO mail-in-01.arcor-online.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754639AbXI0GnR (ORCPT ); Thu, 27 Sep 2007 02:43:17 -0400 Received: from mail-in-05-z2.arcor-online.net (mail-in-05-z2.arcor-online.net [151.189.8.17]) by mail-in-01.arcor-online.net (Postfix) with ESMTP id 49DFA15C3ED; Thu, 27 Sep 2007 08:43:15 +0200 (CEST) Received: from mail-in-01.arcor-online.net (mail-in-01.arcor-online.net [151.189.21.41]) by mail-in-05-z2.arcor-online.net (Postfix) with ESMTP id 35A8F2DAC94; Thu, 27 Sep 2007 08:43:15 +0200 (CEST) Received: from lola.goethe.zz (dslb-084-061-013-116.pools.arcor-ip.net [84.61.13.116]) by mail-in-01.arcor-online.net (Postfix) with ESMTP id F06A91B3262; Thu, 27 Sep 2007 08:43:14 +0200 (CEST) Received: by lola.goethe.zz (Postfix, from userid 1002) id 034701C0039C; Thu, 27 Sep 2007 08:43:14 +0200 (CEST) In-Reply-To: <854phgfxn7.fsf@lola.goethe.zz> (David Kastrup's message of "Thu\, 27 Sep 2007 08\:08\:44 +0200") User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.1.50 (gnu/linux) X-Virus-Scanned: ClamAV 0.91.2/4411/Wed Sep 26 23:43:35 2007 on mail-in-01.arcor-online.net X-Virus-Status: Clean Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: David Kastrup writes: > Jeff King writes: > >> On Wed, Sep 26, 2007 at 01:05:59PM -0700, Junio C Hamano wrote: >> >>> * jk/diff-rename (Tue Sep 25 15:29:42 2007 -0400) 1 commit >>> + diffcore-rename: cache file deltas >>> >>> Parked in 'next' for now but is 'master' material. >> >> My tests after this patch show that spanhash_find is responsible for >> a large portion of the processing time in large renames, so I am going >> to look into speeding that up. > > In itself, it does not look like there is all too much room for > optimization. One can remove the temporary pointer "optimization" and > see whether this makes strength reduction possible for the compiler. > Making this an endless loop wrapped around a loop on bucket might also > help the compiler in that effect. > > But there is really not all too much leeway, and it might be better > spent in the caller. For example, the search will take something like > r/(1-r) iterations on average where r is the fill ratio of the hash > array. So one would not want to, say, let r grow above 0.75 or > something like that. Ok, here is some suggestion: Here is the inner loop for this stuff: for (i = 0; i < ssz; i++) { struct spanhash *s = &(src_count->data[i]); struct spanhash *d; unsigned dst_cnt, src_cnt; if (!s->cnt) continue; src_cnt = s->cnt; d = spanhash_find(dst_count, s->hashval); dst_cnt = d ? d->cnt : 0; if (src_cnt < dst_cnt) { la += dst_cnt - src_cnt; sc += src_cnt; } else sc += dst_cnt; } Now here is how one could optimize the data structures: The hash structures are with linear probing, and we try to find any hash matches from source to destination. If we sort all hashes indexed to a given first hash bucket by their full hash value, then one could basically use passes similar to list merges for figuring the 1:1 relations. That cuts down the O(l n) cost (where n is the number of elements and l their average run length) to O(n). Of course, making l close to 1 by keeping the hash utilization reasonably low is much simpler. -- David Kastrup, Kriemhildstr. 15, 44793 Bochum