From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-99814-e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS31976 209.132.180.0/23
X-Spam-Status: No, score=-3.9 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_EF,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS
	shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id C5C5B1F453
	for <e@80x24.org>; Wed,  6 Feb 2019 14:53:35 +0000 (UTC)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:to:cc:references:from:subject:message-id:date
	:mime-version:in-reply-to:content-type
	:content-transfer-encoding; q=dns; s=default; b=hPcSgA7FK0LXV0ek
	qBFDt6UywNp2UfiiV1d1y3Y+i4jB92O/559P0tinNPl1P54Z9VrcCzWkfIhlSI1Q
	SpBD+PEqjpki4Ukjf+NYjj+eLaRBiWxd3ruKQgDlKWIPBj7R15xzHb0WtVDEnnRK
	mrNrHJ3mqd85+DBxnTatByPr32A=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:to:cc:references:from:subject:message-id:date
	:mime-version:in-reply-to:content-type
	:content-transfer-encoding; s=default; bh=GV23gjE/3/MPE7iYpb/Z+x
	AYhT4=; b=Sv7B+PUELZO/2IwE8mZuXVXSeWiKr5I11+867V/+0WqcQ5IcFU+4rI
	07ocgkauUK+c0CES6v3Rr9egiEja+1k+XelX/0kdZC23lg+2CMl29FrdGFk8462y
	/zINSkh85qvAnf5xAbQveBivV4mEiBxJk6wTGBcfkC52QHrWCSU5Q=
Received: (qmail 78161 invoked by alias); 6 Feb 2019 14:53:33 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-e=80x24.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 78152 invoked by uid 89); 6 Feb 2019 14:53:32 -0000
Authentication-Results: sourceware.org; auth=none
X-HELO: mail-qk1-f194.google.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=linaro.org; s=google;
        h=to:cc:references:from:openpgp:autocrypt:subject:message-id:date
         :user-agent:mime-version:in-reply-to:content-language
         :content-transfer-encoding;
        bh=pSGfLCNau8BGhZGJIgrlZeXiAmbETKTH4OdPbt2W8G4=;
        b=ElENsAlK7FRP+vl/9UP5alt0amrc9xJ/lZkaoJHiLzOtaSl2o4zRNNIOZfbcSP61ry
         Id3dVUf4gjjYUGAfvfO/9Wy349Xtif8+wQ93HVVRLC2vnGWushgxl7nx2wBstLHX0FFJ
         xxP1/ckWew2VnpAYpcjYGxnYCZUunK9/IpiDBB/vnjI4E7ooYTjAayAMukPqYmReW12x
         wgVq8M8zbrRlxGLBoTexOpdEwK7XVzkKJ6JmgdGXIBjkk3siH4l8zS+RTrr6pmlsPK4S
         Uy71wrvDZVK0zvN3PhY58rsUPDZsHbia7ZTbQDvuQC/s97fgYPdqDMVSNuHbLWy9pfZS
         1m4A==
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>,
 'GNU C Library' <libc-alpha@sourceware.org>
Cc: nd <nd@arm.com>
References: <DB5PR08MB10302F67370DDC84C904006383B60@DB5PR08MB1030.eurprd08.prod.outlook.com>
 <49967cf5-a89a-fa17-5c94-556c92705bef@linaro.org>
 <DB5PR08MB1030BB52717C28E0134E5154836E0@DB5PR08MB1030.eurprd08.prod.outlook.com>
 <1dc12364-668c-0216-a569-295a0c1f394f@linaro.org>
 <DB5PR08MB103066A4010D52D641FDD871836F0@DB5PR08MB1030.eurprd08.prod.outlook.com>
From: Adhemerval Zanella <adhemerval.zanella@linaro.org>
Openpgp: preference=signencrypt
Subject: Re: [PATCH] Improve string benchtests
Message-ID: <8d43c338-50a5-9bf0-f16c-7d072a75d741@linaro.org>
Date: Wed, 6 Feb 2019 12:53:26 -0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.4.0
MIME-Version: 1.0
In-Reply-To: <DB5PR08MB103066A4010D52D641FDD871836F0@DB5PR08MB1030.eurprd08.prod.outlook.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit


On 06/02/2019 12:01, Wilco Dijkstra wrote:
> Hi Adhemerval,
> 
>>>> Same as before for wcpncpy: instead of reimplement the generic implementation
>>>> on benchtests we can just include them. And it also leads to an possible
>>>> optimization on generic implementation for wcpncpy.
>>>
>>> The point is to enable useful comparisons of string implementations. If we include
>>> the generic implementation then we just compare the generic implementation with
>>> itself in many cases. And that isn't useful. If I change a generic implementation I
>>> want to see the difference that makes in the benchmark comparison rather than
>>> showing no difference.
>>
>> My understanding is we have the generic implementation as the baseline
>> where arch-specific optimization might be applied and the idea of the 
>> comparison is to check against it.  I see no point in using a different 
>> implementation on benchtests, it should compare against exactly what 
>> glibc is currently providing.
> 
> I have to disagree, we cannot do an exact comparison unless build the generic
> string functions as part of GLIBC and call them via the PLT. Including source
> files with lots of #define magic is never going to be equivalent.
> 
> The goal here is not an accurate comparison with generic string functions but
> to enable a realistic comparison with an efficient baseline - the existing byte
> oriented implementations provide a baseline but are too slow to be useful.

The idea is not to be equivalent, since benchtests already adds the exported
libc symbol which will be called through PLT.  I do agree with you that the
byte-oriented baseline is somewhat useless now that most architecture implements
efficient word or vectorized common symbol, and the idea is also to provide
some more efficient generic string implementation (I have a long-standing patchset
'Improve generic string routines' to address this).

So my point is to which exactly should we compare on benchtests? Current we have:

  1. Byte-oriented 'simple' implementation which, as we agree, should not be
     used as a baseline.

  2. Some named 'stupid' which are usually composed implementation that might
     in fact be a faster implementation than some 'clever' ones.

  3. Compiler builtins, which also does not represent meaningful data for
     libc optimization (it will either be inline, call libc implementation, or
     mix both strategies).

  4. The libc implementations themselves, possible including all ifunc variations.

So which really give us meaningful data for future optimization? Should we keep
add multiple implementation as baselines to compare with?

What about an architecture that uses as baseline an arch-specific implementation,
which might use non optimal strategy that a future generic implementation might
use? We have examples for both string and math code on different architectures
where the generic implementation ended up performing better than the arch-specific
implementation.

Another example is your recent bench-strlen improvement (5289f1f56b7) which
added the memchr_strlen.  The generic implementation of strlen uses a similar
strategy of memchr, with the difference it does not need to materialize the
magic constant and add some loop unrolling for tail comparison. At first
it should be faster than memchr_strlen, however if the architecture has 
an optimized memchr implementation (which is a hotspot and it is usually a
target for arch-specific optimization), the memchr_strlen should be indeed
faster (and have a lower i-cache footprint). 

My point is using memchr_strlen as the *generic* implementation and also use
it as the *baseline* for performance comparison shows to the developer that
optimizing memchr would be a net gain in general than providing multiple
different optimization for multiple symbols that can be built by memchr
calls.

So I still think we should define better which exactly we need to compare
in benchtests and use the generic implementation, which will be used as
default for new ports, as the default basline. The file inclusion is just to 
avoid code duplication, I don't have a strong opinion whether to include or 
just copy-paste the code on benchtests. 

> 
>> If you want to check if the your changes improves the generic, you can
>> compare against multiples glibc builds.
> 
> That doesn't work so well given it takes a long time to rebuild GLIBC and 
> benchmarks. For all benchmarking I do, I always create a direct comparison of
> old vs new in a single run so it shows the differences and can be run repeatedly
> to confirm. The string bench is setup to do this already, so why remove this
> useful feature?
>  
>>> Maybe the name generic_xxx is confusing? It's meant to be the baseline,
>>> something which you should beat in all cases with the actual implementation.
>>
>> My understanding is the baseline should be the generic implementation which
>> is selected if the architecture does not provide an optimized one.
> 
> That means you never compare the generic implementation against a baseline.
> Given that is what we do today, I don't see why we should stop doing that.
> 
> Cheers,
> Wilco
>     
>