From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.2 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 6C6121F55B for ; Wed, 3 Jun 2020 20:45:47 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 8FC62388C006; Wed, 3 Jun 2020 20:45:46 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8FC62388C006 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1591217146; bh=s3E1+uYaqQ/nrhW1DNKeK20u7ADFA7rgsP66XFhRpDQ=; h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=GjFkfe8gUl4a+BF4WgMfjFWaAXUmpCtPSCm7uIQtC6kSE6G4DTVDRtR7lNF7k/8iy pxlsbXqfHTzkhX2qP5+EEa+tNuRVEFXH8GPIhDYtPL7zgl0Vxp0uE2LkDu4AyZ8prv ppPFrSIw8ilwaIr6+x8dtvmHWCtUjTeuOsfgJEJI= Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 1F8FA3870872 for ; Wed, 3 Jun 2020 20:45:42 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 1F8FA3870872 Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 053KaPjC151303; Wed, 3 Jun 2020 16:45:39 -0400 Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com [169.47.144.27]) by mx0a-001b2d01.pphosted.com with ESMTP id 31dr8j478n-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 03 Jun 2020 16:45:38 -0400 Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1]) by ppma05wdc.us.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 053Kf7Ms019086; Wed, 3 Jun 2020 20:45:37 GMT Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com [9.57.198.25]) by ppma05wdc.us.ibm.com with ESMTP id 31bf4qh8e3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 03 Jun 2020 20:45:37 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 053KihL638732098 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 3 Jun 2020 20:44:43 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 51985112061; Wed, 3 Jun 2020 20:44:43 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B88FE112062; Wed, 3 Jun 2020 20:44:41 +0000 (GMT) Received: from oc3272150783.ibm.com (unknown [9.160.68.150]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTPS; Wed, 3 Jun 2020 20:44:41 +0000 (GMT) Date: Wed, 3 Jun 2020 15:44:39 -0500 To: Paul E Murphy Subject: Re: [PATCH] powerpc64le: add optimized strlen for P9 Message-ID: <20200603204439.GA13031@oc3272150783.ibm.com> References: <20200521191048.1566568-1-murphyp@linux.vnet.ibm.com> <20200527164554.GA13085@oc3272150783.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687 definitions=2020-06-03_13:2020-06-02, 2020-06-03 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 suspectscore=0 adultscore=0 mlxscore=0 spamscore=0 mlxlogscore=999 impostorscore=0 phishscore=0 lowpriorityscore=0 priorityscore=1501 bulkscore=0 malwarescore=0 cotscore=-2147483648 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2006030156 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: "Paul A. Clarke via Libc-alpha" Reply-To: "Paul A. Clarke" Cc: anton@ozlabs.org, libc-alpha@sourceware.org Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" On Fri, May 29, 2020 at 11:26:14AM -0500, Paul E Murphy wrote: > > V3 is attached with changes to formatting and a couple of > simplifications as noted below. [snip] This version LGTM with a few nits below (and you were going to check the binutils support for the POWER9 instruction). > From 86decdb4a1bea39cc34bb3320fc9e3ea934042f5 Mon Sep 17 00:00:00 2001 > From: "Paul E. Murphy" > Date: Mon, 18 May 2020 11:16:06 -0500 > Subject: [PATCH] powerpc64le: add optimized strlen for P9 > > This started as a trivial change to Anton's rawmemchr. I got > carried away. This is a hybrid between P8's asympotically > faster 64B checks with extremely efficient small string checks > e.g <64B (and sometimes a little bit more depending on alignment). > > The second trick is to align to 64B by running a 48B checking loop > 16B at a time until we naturally align to 64B (i.e checking 48/96/144 > bytes/iteration based on the alignment after the first 5 comparisons). > This allieviates the need to check page boundaries. > > Finally, explicly use the P7 strlen with the runtime loader when building > P9. We need to be cautious about vector/vsx extensions here on P9 only > builds. > --- > .../powerpc/powerpc64/le/power9/rtld-strlen.S | 1 + > sysdeps/powerpc/powerpc64/le/power9/strlen.S | 213 ++++++++++++++++++ > sysdeps/powerpc/powerpc64/multiarch/Makefile | 2 +- > .../powerpc64/multiarch/ifunc-impl-list.c | 4 + > .../powerpc64/multiarch/strlen-power9.S | 2 + > sysdeps/powerpc/powerpc64/multiarch/strlen.c | 5 + > 6 files changed, 226 insertions(+), 1 deletion(-) > create mode 100644 sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > create mode 100644 sysdeps/powerpc/powerpc64/le/power9/strlen.S > create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S > > diff --git a/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > new file mode 100644 > index 0000000000..e9d83323ac > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S > @@ -0,0 +1 @@ > +#include > diff --git a/sysdeps/powerpc/powerpc64/le/power9/strlen.S b/sysdeps/powerpc/powerpc64/le/power9/strlen.S > new file mode 100644 > index 0000000000..0b358ff128 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power9/strlen.S > @@ -0,0 +1,213 @@ > +/* Optimized strlen implementation for PowerPC64/POWER9. > + Copyright (C) 2020 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#include > + > +#ifndef STRLEN > +# define STRLEN __strlen > +# define DEFINE_STRLEN_HIDDEN_DEF 1 > +#endif > + > +/* Implements the function > + > + int [r3] strlen (const void *s [r3]) > + > + The implementation can load bytes past a matching byte, but only > + up to the next 64B boundary, so it never crosses a page. */ > + > +.machine power9 > +ENTRY_TOCLESS (STRLEN, 4) > + CALL_MCOUNT 2 > + > + vspltisb v18,0 > + vspltisb v19,-1 > + > + neg r5,r3 > + rldicl r9,r5,0,60 /* How many bytes to get source 16B aligned? */ > + > + Extra blank line here. (Sorry, didn't see this the first time.) > + /* Align data and fill bytes not loaded with non matching char. */ > + lvx v0,0,r3 > + lvsr v1,0,r3 > + vperm v0,v19,v0,v1 > + > + vcmpequb. v6,v0,v18 > + beq cr6,L(aligned) > + Consider for before the next two instructions: /* String ends within first cache line. Compute and return length. */ > + vctzlsbb r3,v6 > + blr > + > + /* Test 64B 16B at a time. The 64B vector loop is optimized for > + longer strings. Likewise, we check a multiple of 64B to avoid > + breaking the alignment calculation below. */ > +L(aligned): > + add r4,r3,r9 > + rldicl. r5,r4,60,62 /* Determine the number of 48B loops needed for > + alignment to 64B. And test for zero. */ Would it be bad to move the "rldicl." down... > + > + lxv v0+32,0(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail1) > + > + lxv v0+32,16(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail2) > + > + lxv v0+32,32(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail3) > + > + lxv v0+32,48(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail4) ...to here, to avoid needlessly penalizing the cases above? > + addi r4,r4,64 > + > + /* Prep for weird constant generation of reduction. */ > + li r0,0 Still need a better comment here. Consider: /* Load a dummy aligned address (0) so that 'lvsl' produces a shift vector of 0..15. */ And this "li" instruction can be moved WAY down... > + > + /* Skip the alignment if already 64B aligned. */ > + beq L(loop_64b) > + mtctr r5 > + > + /* Test 48B per iteration until 64B aligned. */ > + .p2align 5 > +L(loop): > + lxv v0+32,0(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail1) > + > + lxv v0+32,16(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail2) > + > + lxv v0+32,32(r4) > + vcmpequb. v6,v0,v18 > + bne cr6,L(tail3) > + > + addi r4,r4,48 > + bdnz L(loop) > + > + .p2align 5 > +L(loop_64b): > + lxv v1+32,0(r4) /* Load 4 quadwords. */ > + lxv v2+32,16(r4) > + lxv v3+32,32(r4) > + lxv v4+32,48(r4) > + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ > + vminub v6,v3,v4 > + vminub v7,v5,v6 > + vcmpequb. v7,v7,v18 /* Check for NULLs. */ > + addi r4,r4,64 /* Adjust address for the next iteration. */ > + bne cr6,L(vmx_zero) > + > + lxv v1+32,0(r4) /* Load 4 quadwords. */ > + lxv v2+32,16(r4) > + lxv v3+32,32(r4) > + lxv v4+32,48(r4) > + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ > + vminub v6,v3,v4 > + vminub v7,v5,v6 > + vcmpequb. v7,v7,v18 /* Check for NULLs. */ > + addi r4,r4,64 /* Adjust address for the next iteration. */ > + bne cr6,L(vmx_zero) > + > + lxv v1+32,0(r4) /* Load 4 quadwords. */ > + lxv v2+32,16(r4) > + lxv v3+32,32(r4) > + lxv v4+32,48(r4) > + vminub v5,v1,v2 /* Compare and merge into one VR for speed. */ > + vminub v6,v3,v4 > + vminub v7,v5,v6 > + vcmpequb. v7,v7,v18 /* Check for NULLs. */ > + addi r4,r4,64 /* Adjust address for the next iteration. */ > + beq cr6,L(loop_64b) > + > +L(vmx_zero): ...to here, perhaps, to avoid penalizing shorter strings. (And be closer to its use.) > + /* OK, we found a null byte. Let's look for it in the current 64-byte > + block and mark it in its corresponding VR. */ > + vcmpequb v1,v1,v18 > + vcmpequb v2,v2,v18 > + vcmpequb v3,v3,v18 > + vcmpequb v4,v4,v18 > + > + /* We will now 'compress' the result into a single doubleword, so it > + can be moved to a GPR for the final calculation. First, we > + generate an appropriate mask for vbpermq, so we can permute bits into > + the first halfword. */ > + vspltisb v10,3 > + lvsl v11,0,r0 > + vslb v10,v11,v10 > + > + /* Permute the first bit of each byte into bits 48-63. */ > + vbpermq v1,v1,v10 > + vbpermq v2,v2,v10 > + vbpermq v3,v3,v10 > + vbpermq v4,v4,v10 > + > + /* Shift each component into its correct position for merging. */ > + vsldoi v2,v2,v2,2 > + vsldoi v3,v3,v3,4 > + vsldoi v4,v4,v4,6 > + > + /* Merge the results and move to a GPR. */ > + vor v1,v2,v1 > + vor v2,v3,v4 > + vor v4,v1,v2 > + mfvrd r10,v4 > + > + /* Adjust address to the begninning of the current 64-byte block. */ > + addi r4,r4,-64 > + > + cnttzd r0,r10 /* Count trailing zeros before the match. */ > + subf r5,r3,r4 > + add r3,r5,r0 /* Compute final length. */ > + blr > + > +L(tail1): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + subf r3,r3,r4 > + blr > + > +L(tail2): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + addi r4,r4,16 > + subf r3,r3,r4 > + blr > + > +L(tail3): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + addi r4,r4,32 > + subf r3,r3,r4 > + blr > + > +L(tail4): > + vctzlsbb r0,v6 > + add r4,r4,r0 > + addi r4,r4,48 > + subf r3,r3,r4 > + blr > + > +END (STRLEN) > + > +#ifdef DEFINE_STRLEN_HIDDEN_DEF > +weak_alias (__strlen, strlen) > +libc_hidden_builtin_def (strlen) > +#endif [snip] PC