From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.2 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id F07311F55B for ; Fri, 29 May 2020 16:26:27 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 169F6387084F; Fri, 29 May 2020 16:26:27 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 169F6387084F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1590769587; bh=ARG7PfiQ8Jh5/9MitnO8iE53wQDepGPZYvDC6D9xkxo=; h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=uscYW7xgFknFNCeiRYN664E440+nVvZzULgBrcijS45/sj4jkLCd1xXRRPVVKV8nr TEplm5UU8KnlUp3eXtpgmQ1jz1BksAlhWbsk2q4dRloh11TgN0Kh2ro5H8KPfYnwVl u1yHqd1NSz0WyMayPnFEo5fNsAgkejOb1hq+wv7w= Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 1CE143851C2C for ; Fri, 29 May 2020 16:26:24 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 1CE143851C2C Received: from pps.filterd (m0098399.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 04TG2g6l124440; Fri, 29 May 2020 12:26:18 -0400 Received: from ppma01dal.us.ibm.com (83.d6.3fa9.ip4.static.sl-reverse.com [169.63.214.131]) by mx0a-001b2d01.pphosted.com with ESMTP id 31as1c76jb-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 29 May 2020 12:26:17 -0400 Received: from pps.filterd (ppma01dal.us.ibm.com [127.0.0.1]) by ppma01dal.us.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 04TGCaUj000884; Fri, 29 May 2020 16:26:16 GMT Received: from b01cxnp23032.gho.pok.ibm.com (b01cxnp23032.gho.pok.ibm.com [9.57.198.27]) by ppma01dal.us.ibm.com with ESMTP id 31b3njhv8k-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 29 May 2020 16:26:16 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 04TGQGRH34996732 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 29 May 2020 16:26:16 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 21CE9124053; Fri, 29 May 2020 16:26:16 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 79CA4124052; Fri, 29 May 2020 16:26:15 +0000 (GMT) Received: from [9.163.92.94] (unknown [9.163.92.94]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Fri, 29 May 2020 16:26:15 +0000 (GMT) Subject: Re: [PATCH] powerpc64le: add optimized strlen for P9 To: "Paul A. Clarke" , "Paul E. Murphy" References: <20200521191048.1566568-1-murphyp@linux.vnet.ibm.com> <20200527164554.GA13085@oc3272150783.ibm.com> Message-ID: Date: Fri, 29 May 2020 11:26:14 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0 MIME-Version: 1.0 In-Reply-To: <20200527164554.GA13085@oc3272150783.ibm.com> Content-Type: multipart/mixed; boundary="------------AE49395CE2B31494E88B0DC7" Content-Language: en-US X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687 definitions=2020-05-29_07:2020-05-28, 2020-05-29 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 phishscore=0 mlxscore=0 impostorscore=0 spamscore=0 suspectscore=0 bulkscore=0 malwarescore=0 mlxlogscore=999 adultscore=0 cotscore=-2147483648 lowpriorityscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2005290123 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Paul E Murphy via Libc-alpha Reply-To: Paul E Murphy Cc: anton@ozlabs.org, libc-alpha@sourceware.org Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" This is a multi-part message in MIME format. --------------AE49395CE2B31494E88B0DC7 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit V3 is attached with changes to formatting and a couple of simplifications as noted below. On 5/27/20 11:45 AM, Paul A. Clarke wrote: > On Thu, May 21, 2020 at 02:10:48PM -0500, Paul E. Murphy via Libc-alpha wrote: >> +/* Implements the function >> + >> + int [r3] strlen (void *s [r3]) > > const void *s? Fixed, alongside folding away the mr r3,r4. Likewise, the basic GNU formatting requests, and removed some of the more redundant ones. Thank you for the suggested changes. >> + .p2align 5 >> +L(loop_64b): >> + lxv v1+32, 0(r4) /* Load 4 quadwords. */ >> + lxv v2+32, 16(r4) >> + lxv v3+32, 32(r4) >> + lxv v4+32, 48(r4) >> + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ >> + vminub v6,v3,v4 >> + vminub v7,v5,v6 >> + vcmpequb. v7,v7,v18 /* Check for NULLs. */ >> + addi r4,r4,64 /* Adjust address for the next iteration. */ >> + bne cr6,L(vmx_zero) >> + >> + lxv v1+32, 0(r4) /* Load 4 quadwords. */ >> + lxv v2+32, 16(r4) >> + lxv v3+32, 32(r4) >> + lxv v4+32, 48(r4) >> + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ >> + vminub v6,v3,v4 >> + vminub v7,v5,v6 >> + vcmpequb. v7,v7,v18 /* Check for NULLs. */ >> + addi r4,r4,64 /* Adjust address for the next iteration. */ >> + bne cr6,L(vmx_zero) >> + >> + lxv v1+32, 0(r4) /* Load 4 quadwords. */ >> + lxv v2+32, 16(r4) >> + lxv v3+32, 32(r4) >> + lxv v4+32, 48(r4) >> + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ >> + vminub v6,v3,v4 >> + vminub v7,v5,v6 >> + vcmpequb. v7,v7,v18 /* Check for NULLs. */ >> + addi r4,r4,64 /* Adjust address for the next iteration. */ >> + beq cr6,L(loop_64b) > > Curious how much this loop unrolling helps, since it adds a fair bit of > redundant code? It does seem to help a little bit, though maybe just an artifact of the benchsuite. > >> + >> +L(vmx_zero): >> + /* OK, we found a null byte. Let's look for it in the current 64-byte >> + block and mark it in its corresponding VR. */ >> + vcmpequb v1,v1,v18 >> + vcmpequb v2,v2,v18 >> + vcmpequb v3,v3,v18 >> + vcmpequb v4,v4,v18 >> + >> + /* We will now 'compress' the result into a single doubleword, so it >> + can be moved to a GPR for the final calculation. First, we >> + generate an appropriate mask for vbpermq, so we can permute bits into >> + the first halfword. */ > > I'm wondering (without having verified) if you can do something here akin to > what's done in the "tail" sections below, using "vctzlsbb". It does not help when the content spans more than 1 VR. I don't think there is much to improve for a 64b mask reduction. Though, we can save a couple cycles below using cnttzd (new in ISA 3.0). --------------AE49395CE2B31494E88B0DC7 Content-Type: text/x-patch; charset=UTF-8; name="0001-powerpc64le-add-optimized-strlen-for-P9.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-powerpc64le-add-optimized-strlen-for-P9.patch" >From 86decdb4a1bea39cc34bb3320fc9e3ea934042f5 Mon Sep 17 00:00:00 2001 From: "Paul E. Murphy" Date: Mon, 18 May 2020 11:16:06 -0500 Subject: [PATCH] powerpc64le: add optimized strlen for P9 This started as a trivial change to Anton's rawmemchr. I got carried away. This is a hybrid between P8's asympotically faster 64B checks with extremely efficient small string checks e.g <64B (and sometimes a little bit more depending on alignment). The second trick is to align to 64B by running a 48B checking loop 16B at a time until we naturally align to 64B (i.e checking 48/96/144 bytes/iteration based on the alignment after the first 5 comparisons). This allieviates the need to check page boundaries. Finally, explicly use the P7 strlen with the runtime loader when building P9. We need to be cautious about vector/vsx extensions here on P9 only builds. --- .../powerpc/powerpc64/le/power9/rtld-strlen.S | 1 + sysdeps/powerpc/powerpc64/le/power9/strlen.S | 213 ++++++++++++++++++ sysdeps/powerpc/powerpc64/multiarch/Makefile | 2 +- .../powerpc64/multiarch/ifunc-impl-list.c | 4 + .../powerpc64/multiarch/strlen-power9.S | 2 + sysdeps/powerpc/powerpc64/multiarch/strlen.c | 5 + 6 files changed, 226 insertions(+), 1 deletion(-) create mode 100644 sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S create mode 100644 sysdeps/powerpc/powerpc64/le/power9/strlen.S create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S diff --git a/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S new file mode 100644 index 0000000000..e9d83323ac --- /dev/null +++ b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S @@ -0,0 +1 @@ +#include diff --git a/sysdeps/powerpc/powerpc64/le/power9/strlen.S b/sysdeps/powerpc/powerpc64/le/power9/strlen.S new file mode 100644 index 0000000000..0b358ff128 --- /dev/null +++ b/sysdeps/powerpc/powerpc64/le/power9/strlen.S @@ -0,0 +1,213 @@ +/* Optimized strlen implementation for PowerPC64/POWER9. + Copyright (C) 2020 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include + +#ifndef STRLEN +# define STRLEN __strlen +# define DEFINE_STRLEN_HIDDEN_DEF 1 +#endif + +/* Implements the function + + int [r3] strlen (const void *s [r3]) + + The implementation can load bytes past a matching byte, but only + up to the next 64B boundary, so it never crosses a page. */ + +.machine power9 +ENTRY_TOCLESS (STRLEN, 4) + CALL_MCOUNT 2 + + vspltisb v18,0 + vspltisb v19,-1 + + neg r5,r3 + rldicl r9,r5,0,60 /* How many bytes to get source 16B aligned? */ + + + /* Align data and fill bytes not loaded with non matching char. */ + lvx v0,0,r3 + lvsr v1,0,r3 + vperm v0,v19,v0,v1 + + vcmpequb. v6,v0,v18 + beq cr6,L(aligned) + + vctzlsbb r3,v6 + blr + + /* Test 64B 16B at a time. The 64B vector loop is optimized for + longer strings. Likewise, we check a multiple of 64B to avoid + breaking the alignment calculation below. */ +L(aligned): + add r4,r3,r9 + rldicl. r5,r4,60,62 /* Determine the number of 48B loops needed for + alignment to 64B. And test for zero. */ + + lxv v0+32,0(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail1) + + lxv v0+32,16(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail2) + + lxv v0+32,32(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail3) + + lxv v0+32,48(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail4) + addi r4,r4,64 + + /* Prep for weird constant generation of reduction. */ + li r0,0 + + /* Skip the alignment if already 64B aligned. */ + beq L(loop_64b) + mtctr r5 + + /* Test 48B per iteration until 64B aligned. */ + .p2align 5 +L(loop): + lxv v0+32,0(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail1) + + lxv v0+32,16(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail2) + + lxv v0+32,32(r4) + vcmpequb. v6,v0,v18 + bne cr6,L(tail3) + + addi r4,r4,48 + bdnz L(loop) + + .p2align 5 +L(loop_64b): + lxv v1+32,0(r4) /* Load 4 quadwords. */ + lxv v2+32,16(r4) + lxv v3+32,32(r4) + lxv v4+32,48(r4) + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ + vminub v6,v3,v4 + vminub v7,v5,v6 + vcmpequb. v7,v7,v18 /* Check for NULLs. */ + addi r4,r4,64 /* Adjust address for the next iteration. */ + bne cr6,L(vmx_zero) + + lxv v1+32,0(r4) /* Load 4 quadwords. */ + lxv v2+32,16(r4) + lxv v3+32,32(r4) + lxv v4+32,48(r4) + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ + vminub v6,v3,v4 + vminub v7,v5,v6 + vcmpequb. v7,v7,v18 /* Check for NULLs. */ + addi r4,r4,64 /* Adjust address for the next iteration. */ + bne cr6,L(vmx_zero) + + lxv v1+32,0(r4) /* Load 4 quadwords. */ + lxv v2+32,16(r4) + lxv v3+32,32(r4) + lxv v4+32,48(r4) + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ + vminub v6,v3,v4 + vminub v7,v5,v6 + vcmpequb. v7,v7,v18 /* Check for NULLs. */ + addi r4,r4,64 /* Adjust address for the next iteration. */ + beq cr6,L(loop_64b) + +L(vmx_zero): + /* OK, we found a null byte. Let's look for it in the current 64-byte + block and mark it in its corresponding VR. */ + vcmpequb v1,v1,v18 + vcmpequb v2,v2,v18 + vcmpequb v3,v3,v18 + vcmpequb v4,v4,v18 + + /* We will now 'compress' the result into a single doubleword, so it + can be moved to a GPR for the final calculation. First, we + generate an appropriate mask for vbpermq, so we can permute bits into + the first halfword. */ + vspltisb v10,3 + lvsl v11,0,r0 + vslb v10,v11,v10 + + /* Permute the first bit of each byte into bits 48-63. */ + vbpermq v1,v1,v10 + vbpermq v2,v2,v10 + vbpermq v3,v3,v10 + vbpermq v4,v4,v10 + + /* Shift each component into its correct position for merging. */ + vsldoi v2,v2,v2,2 + vsldoi v3,v3,v3,4 + vsldoi v4,v4,v4,6 + + /* Merge the results and move to a GPR. */ + vor v1,v2,v1 + vor v2,v3,v4 + vor v4,v1,v2 + mfvrd r10,v4 + + /* Adjust address to the begninning of the current 64-byte block. */ + addi r4,r4,-64 + + cnttzd r0,r10 /* Count trailing zeros before the match. */ + subf r5,r3,r4 + add r3,r5,r0 /* Compute final length. */ + blr + +L(tail1): + vctzlsbb r0,v6 + add r4,r4,r0 + subf r3,r3,r4 + blr + +L(tail2): + vctzlsbb r0,v6 + add r4,r4,r0 + addi r4,r4,16 + subf r3,r3,r4 + blr + +L(tail3): + vctzlsbb r0,v6 + add r4,r4,r0 + addi r4,r4,32 + subf r3,r3,r4 + blr + +L(tail4): + vctzlsbb r0,v6 + add r4,r4,r0 + addi r4,r4,48 + subf r3,r3,r4 + blr + +END (STRLEN) + +#ifdef DEFINE_STRLEN_HIDDEN_DEF +weak_alias (__strlen, strlen) +libc_hidden_builtin_def (strlen) +#endif diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile index fc2268f6b5..19acb6c64a 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile @@ -33,7 +33,7 @@ sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \ ifneq (,$(filter %le,$(config-machine))) sysdep_routines += strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-power9 \ - rawmemchr-power9 + rawmemchr-power9 strlen-power9 endif CFLAGS-strncase-power7.c += -mcpu=power7 -funroll-loops CFLAGS-strncase_l-power7.c += -mcpu=power7 -funroll-loops diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c index 59a227ee22..ea10b00417 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c @@ -111,6 +111,10 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, /* Support sysdeps/powerpc/powerpc64/multiarch/strlen.c. */ IFUNC_IMPL (i, name, strlen, +#ifdef __LITTLE_ENDIAN__ + IFUNC_IMPL_ADD (array, i, strcpy, hwcap2 & PPC_FEATURE2_ARCH_3_00, + __strlen_power9) +#endif IFUNC_IMPL_ADD (array, i, strlen, hwcap2 & PPC_FEATURE2_ARCH_2_07, __strlen_power8) IFUNC_IMPL_ADD (array, i, strlen, hwcap & PPC_FEATURE_HAS_VSX, diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S b/sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S new file mode 100644 index 0000000000..68c8d54b5f --- /dev/null +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S @@ -0,0 +1,2 @@ +#define STRLEN __strlen_power9 +#include diff --git a/sysdeps/powerpc/powerpc64/multiarch/strlen.c b/sysdeps/powerpc/powerpc64/multiarch/strlen.c index e587554221..cd9dc78a7c 100644 --- a/sysdeps/powerpc/powerpc64/multiarch/strlen.c +++ b/sysdeps/powerpc/powerpc64/multiarch/strlen.c @@ -30,8 +30,13 @@ extern __typeof (__redirect_strlen) __libc_strlen; extern __typeof (__redirect_strlen) __strlen_ppc attribute_hidden; extern __typeof (__redirect_strlen) __strlen_power7 attribute_hidden; extern __typeof (__redirect_strlen) __strlen_power8 attribute_hidden; +extern __typeof (__redirect_strlen) __strlen_power9 attribute_hidden; libc_ifunc (__libc_strlen, +# ifdef __LITTLE_ENDIAN__ + (hwcap2 & PPC_FEATURE2_ARCH_3_00) + ? __strlen_power9 : +# endif (hwcap2 & PPC_FEATURE2_ARCH_2_07) ? __strlen_power8 : (hwcap & PPC_FEATURE_HAS_VSX) -- 2.26.2 --------------AE49395CE2B31494E88B0DC7--