From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tay Ray Chuan Subject: Re: [RFC/PATCH 0/3] teach --histogram to diff Date: Thu, 14 Jul 2011 00:34:14 +0800 Message-ID: References: <1310451027-15148-1-git-send-email-rctay89@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Cc: Git Mailing List To: Shawn Pearce X-From: git-owner@vger.kernel.org Wed Jul 13 18:34:21 2011 Return-path: Envelope-to: gcvg-git-2@lo.gmane.org Received: from vger.kernel.org ([209.132.180.67]) by lo.gmane.org with esmtp (Exim 4.69) (envelope-from ) id 1Qh2OC-00046s-Qq for gcvg-git-2@lo.gmane.org; Wed, 13 Jul 2011 18:34:21 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756247Ab1GMQeQ (ORCPT ); Wed, 13 Jul 2011 12:34:16 -0400 Received: from mail-ey0-f174.google.com ([209.85.215.174]:35417 "EHLO mail-ey0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755811Ab1GMQeP (ORCPT ); Wed, 13 Jul 2011 12:34:15 -0400 Received: by eyx24 with SMTP id 24so2155495eyx.19 for ; Wed, 13 Jul 2011 09:34:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=Jw4Kc9IDp6uXTeogaVgHigC+ngUWVY+Ac2SMeTQ8nGQ=; b=YkAQbtROzdErytrtrE7X3T9eA5cB2OIWeMK7OZgzl26oTRXraQdGdhULqB9gf4C78O D3TKRLOpUwo9WdJ1oe/blKvEQyuFFDMq6xFKC+FR5Uv6Sphbry4kMYnysXp6NUtedMJU MPtWHVssmgFJDzHeH2b1sa1qxXQxHW2C3p44s= Received: by 10.14.20.17 with SMTP id o17mr400445eeo.109.1310574854247; Wed, 13 Jul 2011 09:34:14 -0700 (PDT) Received: by 10.14.29.14 with HTTP; Wed, 13 Jul 2011 09:34:14 -0700 (PDT) In-Reply-To: Sender: git-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: git@vger.kernel.org Archived-At: On Tue, Jul 12, 2011 at 10:19 PM, Shawn Pearce wrote: > On Mon, Jul 11, 2011 at 23:10, Tay Ray Chuan wrote: >> [RFC/PATCH 3/3] xdiff/xprepare: use a smaller sample size for histogram > > Do we need sampling at all for histogram? Can you skip it? Sampling is done to get a guess of lines in the file. This guess is then used to preallocated memory for the list of records. (This is just a guess; if we find more records we allocate more memory.) By doing this preallocation, we can save on malloc()'s, giving a performance boost. But then sampling has its costs - previously, we ran up to 256 memchr('\n')s within a mmfile "block". For histogram diff, we cut the cap down to 20. (But not for the other diff algorithms - see the relevant patch text for more.) I think this gives us a good balance - time spent in guessing lines, and time gained from preallocating memory. -- Cheers, Ray Chuan