From mboxrd@z Thu Jan 1 00:00:00 1970 From: Junio C Hamano Subject: Re: [PATCH] diffcore-rename: cache file deltas Date: Tue, 25 Sep 2007 14:29:07 -0700 Message-ID: <7vy7eu4eos.fsf@gitster.siamese.dyndns.org> References: <20070925192941.GA8564@coredump.intra.peff.net> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Linus Torvalds , git@vger.kernel.org To: Jeff King X-From: git-owner@vger.kernel.org Tue Sep 25 23:29:21 2007 Return-path: Envelope-to: gcvg-git-2@gmane.org Received: from vger.kernel.org ([209.132.176.167]) by lo.gmane.org with esmtp (Exim 4.50) id 1IaHy4-0001jK-Id for gcvg-git-2@gmane.org; Tue, 25 Sep 2007 23:29:20 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755141AbXIYV3M (ORCPT ); Tue, 25 Sep 2007 17:29:12 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754957AbXIYV3L (ORCPT ); Tue, 25 Sep 2007 17:29:11 -0400 Received: from fed1rmmtao102.cox.net ([68.230.241.44]:62067 "EHLO fed1rmmtao102.cox.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754466AbXIYV3I (ORCPT ); Tue, 25 Sep 2007 17:29:08 -0400 Received: from fed1rmimpo01.cox.net ([70.169.32.71]) by fed1rmmtao102.cox.net (InterMail vM.7.08.02.01 201-2186-121-102-20070209) with ESMTP id <20070925212909.WPPQ27287.fed1rmmtao102.cox.net@fed1rmimpo01.cox.net>; Tue, 25 Sep 2007 17:29:09 -0400 Received: from localhost ([68.225.240.77]) by fed1rmimpo01.cox.net with bizsmtp id slV71X00X1gtr5g0000000; Tue, 25 Sep 2007 17:29:08 -0400 User-Agent: Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux) Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org Archived-At: Jeff King writes: > We find rename candidates by computing a fingerprint hash of > each file, and then comparing those fingerprints. There are > inherently O(n^2) comparisons, so it pays in CPU time to > hoist the (rather expensive) computation of the fingerprint > out of that loop (or to cache it once we have computed it once). > > Previously, we didn't keep the filespec information around > because then we had the potential to consume a great deal of > memory. However, instead of keeping all of the filespec > data, we can instead just keep the fingerprint. > > This patch implements and uses diff_free_filespec_data_large > to accomplish that goal. We also have to change > estimate_similarity not to needlessly repopulate the > filespec data when we already have the hash. > > Practical tests showed 4.5x speedup for a 10% memory usage > increase. > > Signed-off-by: Jeff King Very nice. > The implementation is a little less nice than I would like, but I was > trying to be non-invasive. Specifically: > > - the name diff_free_filespec_data_large is horrible, but this is based > on the fact that diff_free_filespec_data actually does too much (it > frees the data _and_ some other auxiliary data). And renaming that > would entail changing many callsites. True. But we can rename it to diff_file_filespec_blob() and that would perfectly well describe what it does. Will do so when applying if it is Ok to you. > - It seems that a better place to call diffcore_populate_filespec > (rather than in estimate_similarity) would actually be in > diffcore_count_changes (when we _know_ that we need to populate the > contents data). > > - The hash_chars() should arguably be tied into > diffcore_populate_filespec, which should have more of a "what > information do you want" interface. I.e., the "size_only" parameter > could be extended to a bitfield to say "populate this, and I need the > delta fingerprint, size, actual contents, etc". Then callers could > just use "populate" before looking at the filespec and it would > lazily load whatever they needed. Both good points, but I agree with you that it is wise to do that with a follow-up patch.