From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-return-100444-e=80x24.org@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS31976 209.132.180.0/23
X-Spam-Status: No, score=-4.1 required=3.0 tests=AWL,BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS
	shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 1F99420248
	for <e@80x24.org>; Wed,  6 Mar 2019 18:15:03 +0000 (UTC)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id:references
	:in-reply-to:content-type:content-transfer-encoding
	:mime-version; q=dns; s=default; b=bY+XEUmaEtu2OnvRjAZdpTN2AqeHT
	WmiNNDFgXpPuNVIiNYTNKP8s3VoT1qRTUTexF6Dumiv4HdkwTBlR9QG4GRd297j+
	TxT1Ms5brfpayoZYJxoxNmnqWiFI9LYil1jWJOhtslN6gVQT7EOskT2LQkQFzxFh
	F9kniBfXSNajJw=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id
	:list-unsubscribe:list-subscribe:list-archive:list-post
	:list-help:sender:from:to:cc:subject:date:message-id:references
	:in-reply-to:content-type:content-transfer-encoding
	:mime-version; s=default; bh=jOhWG+OUykXmhdECDl4Mm1JZwYs=; b=j5F
	5bZ9Y0F7uJeUsWVSqpgR1uVC9hneepl37TosVdOqaySARHSuzPDzFcCyigRO3iiZ
	MqstCoXkZOKJXZt2w2OV4l5J8Zm/pewJZ4a7WN8ylUhxwmw0gUDyYcLh3FVeaMmb
	mrahDmoIolX82IUCSCtJNaW/3PP3VQkXIHGAkhE0=
Received: (qmail 30595 invoked by alias); 6 Mar 2019 18:14:59 -0000
Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <libc-alpha.sourceware.org>
List-Unsubscribe: <mailto:libc-alpha-unsubscribe-e=80x24.org@sourceware.org>
List-Subscribe: <mailto:libc-alpha-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: libc-alpha-owner@sourceware.org
Received: (qmail 30586 invoked by uid 89); 6 Mar 2019 18:14:59 -0000
Authentication-Results: sourceware.org; auth=none
X-HELO: EUR03-VE1-obe.outbound.protection.outlook.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com;
 s=selector1-arm-com;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=ADpcMJidpPo8nUtmUhzSE2cz5N7vkJC2fD70vXSYQR4=;
 b=NLahupzFg9PnC7J0w9GZScVnCq5UOP0Zs4R9yW0rj0TUtTG0tNaXtrgRqFq7RUsu6bb7R6KjAw2++HXHV6sZilqceQqlShZcqjzLPduVmCY73DAkrGvAwFhnNQjnqBdxlC7JMIEdZyeEbL/VBJ7OTdhNvsk+JyHrbTdziMfKvRM=
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: Adhemerval Zanella <adhemerval.zanella@linaro.org>, 'GNU C Library'
	<libc-alpha@sourceware.org>
CC: nd <nd@arm.com>
Subject: Re: [PATCH] Improve string benchtests
Date: Wed, 6 Mar 2019 18:14:45 +0000
Message-ID:
 <DB5PR08MB1030B618B3C4385F3CEB0C2883730@DB5PR08MB1030.eurprd08.prod.outlook.com>
References:
 <DB5PR08MB10302F67370DDC84C904006383B60@DB5PR08MB1030.eurprd08.prod.outlook.com>
 <49967cf5-a89a-fa17-5c94-556c92705bef@linaro.org>
 <DB5PR08MB1030BB52717C28E0134E5154836E0@DB5PR08MB1030.eurprd08.prod.outlook.com>
 <1dc12364-668c-0216-a569-295a0c1f394f@linaro.org>
 <DB5PR08MB103066A4010D52D641FDD871836F0@DB5PR08MB1030.eurprd08.prod.outlook.com>,<8d43c338-50a5-9bf0-f16c-7d072a75d741@linaro.org>
In-Reply-To: <8d43c338-50a5-9bf0-f16c-7d072a75d741@linaro.org>
authentication-results: spf=none (sender IP is )
 smtp.mailfrom=Wilco.Dijkstra@arm.com; 
received-spf: None (protection.outlook.com: arm.com does not designate
 permitted sender hosts)
x-ms-exchange-senderadcheck: 1
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-mailboxtype: HOSTED

Hi Adhemerval,

> So my point is to which exactly should we compare on benchtests? Current =
we have:
>
>=A0 1. Byte-oriented 'simple' implementation which, as we agree, should no=
t be
>=A0=A0=A0=A0 used as a baseline.

That is for functions like memcpy where you'd expect any implementation -
including generic - to act on words rather than bytes. So for those compari=
ng
with a byte-oriented version doesn't add anything useful.

A byte-oriented version does make sense when it is competitive with the
generic implementation. This is true for various wcs functions and strstr f=
or
example.

>=A0 2. Some named 'stupid' which are usually composed implementation that =
might
> =A0=A0=A0 in fact be a faster implementation than some 'clever' ones.

In all cases I found the stupid ones were slower, mostly because they first
called strlen and then still processed the string one byte at a time instea=
d of
calling memcpy with the known size. So they weren't adding useful data.

>=A0 3. Compiler builtins, which also does not represent meaningful data fo=
r
> =A0=A0=A0=A0 libc optimization (it will either be inline, call libc imple=
mentation, or
>=A0=A0=A0=A0 mix both strategies).

Agreed - there are few cases where string functions with non-constant input=
s are
inlined, so in most cases you're just benchmarking the libc version.

> =A0 4. The libc implementations themselves, possible including all ifunc =
variations.
>
> So which really give us meaningful data for future optimization? Should w=
e keep
> add multiple implementation as baselines to compare with?

Well it is useful to check different strategies as well as comparing simila=
r string
functions. For example I noticed that on some microarchitectures memchr was
faster, but on others strlen is faster. Given the individual benchmarks use=
 very
different inputs, you only notice this in a direct comparison.=20

> What about an architecture that uses as baseline an arch-specific impleme=
ntation,
> which might use non optimal strategy that a future generic implementation=
 might
> use? We have examples for both string and math code on different architec=
tures
> where the generic implementation ended up performing better than the arch=
-specific
> implementation.

So having more than 1 baseline is often useful since you can compare differ=
ent
implementations and strategies.

> My point is using memchr_strlen as the *generic* implementation and also =
use
> it as the *baseline* for performance comparison shows to the developer th=
at
> optimizing memchr would be a net gain in general than providing multiple
> different optimization for multiple symbols that can be built by memchr
> calls.

Well if you're only adding a target specific implementation for either strl=
en or
memchr then it makes sense to do memchr first. And yes we could make the=20
generic strlen defer to memchr as that makes it easier to get good performa=
nce
on a new target using just a fast memchr.

> So I still think we should define better which exactly we need to compare
> in benchtests and use the generic implementation, which will be used as
> default for new ports, as the default basline. The file inclusion is just=
 to=20
> avoid code duplication, I don't have a strong opinion whether to include =
or=20
> just copy-paste the code on benchtests.=20

It's hard to come up with simple rules - what is best depends on the specif=
ic
function. Memcpy/memchr/strlen have no obvious "baseline", and given
most targets implement these in assembler (which should beat the generic
version), comparing against the generic implementations isn't really that u=
seful.

For more complex string functions we've seen cases where an assembler
strcpy/strcat was slower than strlen/memcpy, so it makes sense to
benchmark against that as the baseline. The same is true for strlen, memchr
and strnlen as well as strchr and strlen/memchr.

My goal is to add the most obvious and useful comparisons to the benchtests
to make it much easier to find these unexpected performance flaws across
targets and microarchitectures.

Cheers,
Wilco=