From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-4.1 required=3.0 tests=AWL,BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 1F99420248 for ; Wed, 6 Mar 2019 18:15:03 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:references :in-reply-to:content-type:content-transfer-encoding :mime-version; q=dns; s=default; b=bY+XEUmaEtu2OnvRjAZdpTN2AqeHT WmiNNDFgXpPuNVIiNYTNKP8s3VoT1qRTUTexF6Dumiv4HdkwTBlR9QG4GRd297j+ TxT1Ms5brfpayoZYJxoxNmnqWiFI9LYil1jWJOhtslN6gVQT7EOskT2LQkQFzxFh F9kniBfXSNajJw= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:references :in-reply-to:content-type:content-transfer-encoding :mime-version; s=default; bh=jOhWG+OUykXmhdECDl4Mm1JZwYs=; b=j5F 5bZ9Y0F7uJeUsWVSqpgR1uVC9hneepl37TosVdOqaySARHSuzPDzFcCyigRO3iiZ MqstCoXkZOKJXZt2w2OV4l5J8Zm/pewJZ4a7WN8ylUhxwmw0gUDyYcLh3FVeaMmb mrahDmoIolX82IUCSCtJNaW/3PP3VQkXIHGAkhE0= Received: (qmail 30595 invoked by alias); 6 Mar 2019 18:14:59 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 30586 invoked by uid 89); 6 Mar 2019 18:14:59 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: EUR03-VE1-obe.outbound.protection.outlook.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector1-arm-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ADpcMJidpPo8nUtmUhzSE2cz5N7vkJC2fD70vXSYQR4=; b=NLahupzFg9PnC7J0w9GZScVnCq5UOP0Zs4R9yW0rj0TUtTG0tNaXtrgRqFq7RUsu6bb7R6KjAw2++HXHV6sZilqceQqlShZcqjzLPduVmCY73DAkrGvAwFhnNQjnqBdxlC7JMIEdZyeEbL/VBJ7OTdhNvsk+JyHrbTdziMfKvRM= From: Wilco Dijkstra To: Adhemerval Zanella , 'GNU C Library' CC: nd Subject: Re: [PATCH] Improve string benchtests Date: Wed, 6 Mar 2019 18:14:45 +0000 Message-ID: References: <49967cf5-a89a-fa17-5c94-556c92705bef@linaro.org> <1dc12364-668c-0216-a569-295a0c1f394f@linaro.org> ,<8d43c338-50a5-9bf0-f16c-7d072a75d741@linaro.org> In-Reply-To: <8d43c338-50a5-9bf0-f16c-7d072a75d741@linaro.org> authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED Hi Adhemerval, > So my point is to which exactly should we compare on benchtests? Current = we have: > >=A0 1. Byte-oriented 'simple' implementation which, as we agree, should no= t be >=A0=A0=A0=A0 used as a baseline. That is for functions like memcpy where you'd expect any implementation - including generic - to act on words rather than bytes. So for those compari= ng with a byte-oriented version doesn't add anything useful. A byte-oriented version does make sense when it is competitive with the generic implementation. This is true for various wcs functions and strstr f= or example. >=A0 2. Some named 'stupid' which are usually composed implementation that = might > =A0=A0=A0 in fact be a faster implementation than some 'clever' ones. In all cases I found the stupid ones were slower, mostly because they first called strlen and then still processed the string one byte at a time instea= d of calling memcpy with the known size. So they weren't adding useful data. >=A0 3. Compiler builtins, which also does not represent meaningful data fo= r > =A0=A0=A0=A0 libc optimization (it will either be inline, call libc imple= mentation, or >=A0=A0=A0=A0 mix both strategies). Agreed - there are few cases where string functions with non-constant input= s are inlined, so in most cases you're just benchmarking the libc version. > =A0 4. The libc implementations themselves, possible including all ifunc = variations. > > So which really give us meaningful data for future optimization? Should w= e keep > add multiple implementation as baselines to compare with? Well it is useful to check different strategies as well as comparing simila= r string functions. For example I noticed that on some microarchitectures memchr was faster, but on others strlen is faster. Given the individual benchmarks use= very different inputs, you only notice this in a direct comparison.=20 > What about an architecture that uses as baseline an arch-specific impleme= ntation, > which might use non optimal strategy that a future generic implementation= might > use? We have examples for both string and math code on different architec= tures > where the generic implementation ended up performing better than the arch= -specific > implementation. So having more than 1 baseline is often useful since you can compare differ= ent implementations and strategies. > My point is using memchr_strlen as the *generic* implementation and also = use > it as the *baseline* for performance comparison shows to the developer th= at > optimizing memchr would be a net gain in general than providing multiple > different optimization for multiple symbols that can be built by memchr > calls. Well if you're only adding a target specific implementation for either strl= en or memchr then it makes sense to do memchr first. And yes we could make the=20 generic strlen defer to memchr as that makes it easier to get good performa= nce on a new target using just a fast memchr. > So I still think we should define better which exactly we need to compare > in benchtests and use the generic implementation, which will be used as > default for new ports, as the default basline. The file inclusion is just= to=20 > avoid code duplication, I don't have a strong opinion whether to include = or=20 > just copy-paste the code on benchtests.=20 It's hard to come up with simple rules - what is best depends on the specif= ic function. Memcpy/memchr/strlen have no obvious "baseline", and given most targets implement these in assembler (which should beat the generic version), comparing against the generic implementations isn't really that u= seful. For more complex string functions we've seen cases where an assembler strcpy/strcat was slower than strlen/memcpy, so it makes sense to benchmark against that as the baseline. The same is true for strlen, memchr and strnlen as well as strchr and strlen/memchr. My goal is to add the most obvious and useful comparisons to the benchtests to make it much easier to find these unexpected performance flaws across targets and microarchitectures. Cheers, Wilco=