From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS17314 8.43.84.0/22 X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 48B401F5AE for ; Wed, 28 Apr 2021 18:48:43 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 1F5EA3945040; Wed, 28 Apr 2021 18:48:42 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 1F5EA3945040 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1619635722; bh=8dBzOnh/smNpQAICE3YYUlOFUG0rk3LyqxHoz9Ryvqc=; h=In-Reply-To:References:Subject:To:Date:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=M3y/0RMelILZBKsNxZQp9r/gw9KxB20pWDIUnldVWHdqBxLabUjCvVbWR0td96uhE RCHqH52SfrhalVjW2SvUkFx4UbIunM+i7StOhLDHgeWxJHz7tgE/4il4qoIWQwJ551 /7SYpFEaGJR3K0bgrOFjg4/5XcvOjynYWTbf9hh8= Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 0D2D53944829 for ; Wed, 28 Apr 2021 18:48:37 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 0D2D53944829 Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 13SIYNn9146767; Wed, 28 Apr 2021 14:48:34 -0400 Received: from ppma03dal.us.ibm.com (b.bd.3ea9.ip4.static.sl-reverse.com [169.62.189.11]) by mx0a-001b2d01.pphosted.com with ESMTP id 387bhcavdq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 28 Apr 2021 14:48:34 -0400 Received: from pps.filterd (ppma03dal.us.ibm.com [127.0.0.1]) by ppma03dal.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 13SIm3UP011466; Wed, 28 Apr 2021 18:48:33 GMT Received: from b01cxnp22036.gho.pok.ibm.com (b01cxnp22036.gho.pok.ibm.com [9.57.198.26]) by ppma03dal.us.ibm.com with ESMTP id 384ay9eqhb-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 28 Apr 2021 18:48:33 +0000 Received: from b01ledav005.gho.pok.ibm.com (b01ledav005.gho.pok.ibm.com [9.57.199.110]) by b01cxnp22036.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 13SImW8c12386838 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 28 Apr 2021 18:48:32 GMT Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id AD587AE063; Wed, 28 Apr 2021 18:48:32 +0000 (GMT) Received: from b01ledav005.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 2B9E1AE05F; Wed, 28 Apr 2021 18:48:32 +0000 (GMT) Received: from localhost (unknown [9.80.209.251]) by b01ledav005.gho.pok.ibm.com (Postfix) with ESMTP; Wed, 28 Apr 2021 18:48:31 +0000 (GMT) Content-Type: text/plain; charset="utf-8" In-Reply-To: <20210428144048.gyeulahanuzjiotq@work-tp> References: <20210428144048.gyeulahanuzjiotq@work-tp> Subject: Re: [PATCH] powerpc64le: Optimize memset for POWER10 To: Raoni Fassina Firmino , libc-alpha@sourceware.org Date: Wed, 28 Apr 2021 15:48:28 -0300 Message-ID: <161963570877.1422733.11894775104498167968@fedora.local> User-Agent: alot/0.9.1 X-TM-AS-GCONF: 00 X-Proofpoint-GUID: uUwm7zyExHFe78MvfTGCK6c7DoA-DPno X-Proofpoint-ORIG-GUID: uUwm7zyExHFe78MvfTGCK6c7DoA-DPno Content-Transfer-Encoding: quoted-printable X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761 definitions=2021-04-28_10:2021-04-28, 2021-04-28 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 clxscore=1015 mlxscore=0 lowpriorityscore=0 mlxlogscore=999 spamscore=0 impostorscore=0 suspectscore=0 priorityscore=1501 bulkscore=0 phishscore=0 malwarescore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104060000 definitions=main-2104280118 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: "Lucas A. M. Magalhaes via Libc-alpha" Reply-To: "Lucas A. M. Magalhaes" Cc: tuliom@linux.ibm.com, anton@ozlabs.org Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" Thanks Raoni, LGTM with some small comment fixes, all tests pass. Quoting Raoni Fassina Firmino via Libc-alpha (2021-04-28 11:40:48) > This implementation is based on __memset_power8 and integrates a lot > of suggestions from Anton Blanchard. >=20 > The biggest difference is that it makes extensive use of stxvl to > alignment and tail code to avoid branches and small stores. It has > three main execution paths: >=20 > a) "Short lengths" for lengths up to 64 bytes, avoiding as many > branches as possible. >=20 > b) "General case" for larger lengths, it has an alignment section > using stxvl to avoid branches, a 128 bytes loop and then a tail > code, again using stxvl with few branches. >=20 > c) "Zeroing cache blocks" for lengths from 256 bytes upwards and set > value being zero. It is mostly the __memset_power8 code but the > alignment phase was simplified because, at this point, address is > already 16-bytes aligned and also changed to use vector stores. > The tail code was also simplified to reuse the general case tail. >=20 > All unaligned stores use stxvl instructions that do not generate > alignment interrupts on POWER10, making it safe to use on > caching-inhibited memory. >=20 > On average, this implementation provides something around 30% > improvement when compared to __memset_power8. > --- > sysdeps/powerpc/powerpc64/le/power10/memset.S | 251 ++++++++++++++++++ > sysdeps/powerpc/powerpc64/multiarch/Makefile | 3 +- > sysdeps/powerpc/powerpc64/multiarch/bzero.c | 8 + > .../powerpc64/multiarch/ifunc-impl-list.c | 14 + > .../powerpc64/multiarch/memset-power10.S | 27 ++ > sysdeps/powerpc/powerpc64/multiarch/memset.c | 8 + > 6 files changed, 310 insertions(+), 1 deletion(-) > create mode 100644 sysdeps/powerpc/powerpc64/le/power10/memset.S > create mode 100644 sysdeps/powerpc/powerpc64/multiarch/memset-power10.S >=20 > diff --git a/sysdeps/powerpc/powerpc64/le/power10/memset.S b/sysdeps/powe= rpc/powerpc64/le/power10/memset.S > new file mode 100644 > index 000000000000..c8b77fc64596 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power10/memset.S > @@ -0,0 +1,251 @@ > +/* Optimized memset implementation for PowerPC64/POWER10. > + Copyright (C) 2021 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#include > + > +/* void * [r3] memset (void *s [r3], int c [r4], size_t n [r5])); > + Returns 's'. */ > + > +#ifndef MEMSET > +# define MEMSET memset > +#endif > + > + .machine power9 > +ENTRY_TOCLESS (MEMSET, 5) > + CALL_MCOUNT 3 > + Ok. > +L(_memset): > + /* Assume memset of zero length is uncommon, and just let it go > + through the small path below. */ > + cmpldi r5,64 > + > + /* Replicate byte to quad word. */ > + mtvsrws v0+32,r4 > + vspltb v0,v0,15 > + > + li r7,16 > + sldi r8,r7,56 > + > + bgt L(large) > + > + /* For short lengths we want to avoid as many branches as possibl= e. > + We use store VSX vector with length instructions to do this. = */ > + sldi r5,r5,56 > + > + addi r10,r3,16 > + > + sub. r11,r5,r8 > + isellt r11,0,r11 /* Saturate the subtraction to zero. */ > + > + stxvl v0+32,r3,r5 > + stxvl v0+32,r10,r11 > + > + addi r9,r3,32 > + addi r10,r3,48 > + > + sub. r11,r11,r8 > + isellt r11,0,r11 > + > + sub. r5,r11,r8 > + isellt r5,0,r5 > + > + stxvl v0+32,r9,r11 > + stxvl v0+32,r10,r5 > + > + blr > + Ok. > + .balign 16 > +L(large): > + mr r6,r3 /* Don't modify r3 since we need to return it. */ > + > + /* Get dest 16B aligned. */ > + neg r0,r3 > + clrldi. r7,r0,(64-4) > + beq L(aligned) > + rldic r9,r0,56,4 /* (~X & 0xf)<<56 "clrlsldi r9,r0,64-4,56= ". */ > + > + stxvl v0+32,r6,r9 /* Store up to 15B until aligned address.= */ > + > + add r6,r6,r7 > + sub r5,r5,r7 > + Ok. > + /* After alignment, if there is 127B or less left s/127B/64B/ > + go directly to the tail. */ > + cmpldi r5,64 > + blt L(tail_64) > + > + .balign 16 > +L(aligned): > + srdi. r0,r5,7 > + beq L(tail_128) > + > + cmpldi cr5,r5,255 > + cmpldi cr6,r4,0 > + crand 27,26,21 > + bt 27,L(dcbz) Maybe add a comment to explain this branch. > + > + mtctr r0 > + > + .balign 32 > +L(loop): > + stxv v0+32,0(r6) > + stxv v0+32,16(r6) > + stxv v0+32,32(r6) > + stxv v0+32,48(r6) > + stxv v0+32,64(r6) > + stxv v0+32,80(r6) > + stxv v0+32,96(r6) > + stxv v0+32,112(r6) > + addi r6,r6,128 > + bdnz L(loop) > + Ok. > + .balign 16 > +L(tail): > + /* 127B or less left, finish the tail or return. */ > + andi. r5,r5,127 > + beqlr > + > + cmpldi r5,64 > + blt L(tail_64) > + > + .balign 16 > +L(tail_128): The label tail_128 made me think that here would be copied 128 bytes. Maybe add a comment here. > + stxv v0+32,0(r6) > + stxv v0+32,16(r6) > + stxv v0+32,32(r6) > + stxv v0+32,48(r6) > + addi r6,r6,64 > + andi. r5,r5,63 > + beqlr > + > + .balign 16 > +L(tail_64): Maybe add a comment here to explay this section as well. > + sldi r5,r5,56 > + > + addi r10,r6,16 > + > + sub. r11,r5,r8 > + isellt r11,0,r11 > + > + stxvl v0+32,r6,r5 > + stxvl v0+32,r10,r11 > + > + sub. r11,r11,r8 > + blelr > + > + addi r9,r6,32 > + addi r10,r6,48 > + > + isellt r11,0,r11 > + > + sub. r5,r11,r8 > + isellt r5,0,r5 > + > + stxvl v0+32,r9,r11 > + stxvl v0+32,r10,r5 > + > + blr > + Ok. > + .balign 16 > +L(dcbz): > + /* Special case when value is 0 and we have a long length to deal > + with. Use dcbz to zero out a full cacheline of 128 bytes at a= time. > + Before using dcbz though, we need to get the destination 128-b= yte > + aligned. */ > + neg r0,r6 > + clrldi. r0,r0,(64-7)t > + beq L(dcbz_aligned) > + > + sub r5,r5,r0 > + mtocrf 0x2,r0 /* These are the bits 57..59, the ones for sizes = 64, > + 32 and 16 which are those that need to be chec= k. */ > + Ok. > + /* Write 16~128 bytes until DST is aligned to 128 bytes. */ > +64: bf 25,32f > + stxv v0+32,0(r6) > + stxv v0+32,16(r6) > + stxv v0+32,32(r6) > + stxv v0+32,48(r6) > + addi r6,r6,64 > + > +32: bf 26,16f > + stxv v0+32,0(r6) > + stxv v0+32,16(r6) > + addi r6,r6,32 > + > +16: bf 27,L(dcbz_aligned) > + stxv v0+32,0(r6) > + addi r6,r6,16 > + Ok. > + .balign 16 > +L(dcbz_aligned): > + /* Setup dcbz unroll offsets and count numbers. */ > + srdi. r0,r5,9 > + li r9,128 > + beq L(bcdz_tail) > + li r10,256 > + li r11,384 > + mtctr r0 > + Ok. > + .balign 16 > +L(dcbz_loop): > + /* Sets 512 bytes to zero in each iteration, the loop unrolling s= hows > + a throughput boost for large sizes (2048 bytes or higher). */ > + dcbz 0,r6 > + dcbz r9,r6 > + dcbz r10,r6 > + dcbz r11,r6 > + addi r6,r6,512 > + bdnz L(dcbz_loop) > + > + andi. r5,r5,511 > + beqlr > + Ok. > + .balign 16 > +L(bcdz_tail): > + /* We have 1~511 bytes remaining. */ > + srdi. r0,r5,7 > + beq L(tail) > + > + mtocrf 0x1,r0 > + > +256: bf 30,128f > + dcbz 0,r6 > + dcbz r9,r6 > + addi r6,r6,256 > + > +128: bf 31,L(tail) > + dcbz 0,r6 > + addi r6,r6,128 > + > + b L(tail) > + Ok. > +END_GEN_TB (MEMSET,TB_TOCLESS) > +libc_hidden_builtin_def (memset) > + > +/* Copied from bzero.S to prevent the linker from inserting a stub > + between bzero and memset. */ > +ENTRY_TOCLESS (__bzero) > + CALL_MCOUNT 3 > + mr r5,r4 > + li r4,0 > + b L(_memset) > +END (__bzero) > +#ifndef __bzero > +weak_alias (__bzero, bzero) > +#endif Ok. > diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/power= pc/powerpc64/multiarch/Makefile > index 8aa46a370270..147ed42b218e 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile > +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile > @@ -32,7 +32,8 @@ sysdep_routines +=3D memcpy-power8-cached memcpy-power7= memcpy-a2 memcpy-power6 \ > strncase-power8 >=20=20 > ifneq (,$(filter %le,$(config-machine))) > -sysdep_routines +=3D strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-p= ower9 \ > +sysdep_routines +=3D memset-power10 \ > + strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-powe= r9 \ > rawmemchr-power9 strlen-power9 strncpy-power9 stpncpy-= power9 \ > strlen-power10 > endif Ok. > diff --git a/sysdeps/powerpc/powerpc64/multiarch/bzero.c b/sysdeps/powerp= c/powerpc64/multiarch/bzero.c > index c3f819ff48d6..50a5320c6650 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/bzero.c > +++ b/sysdeps/powerpc/powerpc64/multiarch/bzero.c > @@ -27,8 +27,16 @@ extern __typeof (bzero) __bzero_power4 attribute_hidde= n; > extern __typeof (bzero) __bzero_power6 attribute_hidden; > extern __typeof (bzero) __bzero_power7 attribute_hidden; > extern __typeof (bzero) __bzero_power8 attribute_hidden; > +# ifdef __LITTLE_ENDIAN__ > +extern __typeof (bzero) __bzero_power10 attribute_hidden; > +# endif >=20=20 > libc_ifunc (__bzero, > +# ifdef __LITTLE_ENDIAN__ > + (hwcap2 & (PPC_FEATURE2_ARCH_3_1 | PPC_FEATURE2_HAS_ISEL) > + && hwcap & PPC_FEATURE_HAS_VSX) > + ? __bzero_power10 : > +# endif > (hwcap2 & PPC_FEATURE2_ARCH_2_07) > ? __bzero_power8 : > (hwcap & PPC_FEATURE_HAS_VSX) Ok. > diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysd= eps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > index 1a6993616f2a..cd0d95ed9a94 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > @@ -73,6 +73,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_= ifunc_impl *array, >=20=20 > /* Support sysdeps/powerpc/powerpc64/multiarch/memset.c. */ > IFUNC_IMPL (i, name, memset, > +#ifdef __LITTLE_ENDIAN__ > + IFUNC_IMPL_ADD (array, i, memset, > + hwcap2 & (PPC_FEATURE2_ARCH_3_1 | > + PPC_FEATURE2_HAS_ISEL) > + && hwcap & PPC_FEATURE_HAS_VSX, > + __memset_power10) > +#endif > IFUNC_IMPL_ADD (array, i, memset, hwcap2 & PPC_FEATURE2_ARC= H_2_07, > __memset_power8) > IFUNC_IMPL_ADD (array, i, memset, hwcap & PPC_FEATURE_HAS_V= SX, > @@ -174,6 +181,13 @@ __libc_ifunc_impl_list (const char *name, struct lib= c_ifunc_impl *array, >=20=20 > /* Support sysdeps/powerpc/powerpc64/multiarch/bzero.c. */ > IFUNC_IMPL (i, name, bzero, > +#ifdef __LITTLE_ENDIAN__ > + IFUNC_IMPL_ADD (array, i, bzero, > + hwcap2 & (PPC_FEATURE2_ARCH_3_1 | > + PPC_FEATURE2_HAS_ISEL) > + && hwcap & PPC_FEATURE_HAS_VSX, > + __bzero_power10) > +#endif > IFUNC_IMPL_ADD (array, i, bzero, hwcap2 & PPC_FEATURE2_ARCH= _2_07, > __bzero_power8) > IFUNC_IMPL_ADD (array, i, bzero, hwcap & PPC_FEATURE_HAS_VS= X, Ok. > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memset-power10.S b/sysde= ps/powerpc/powerpc64/multiarch/memset-power10.S > new file mode 100644 > index 000000000000..53a9535a2401 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/multiarch/memset-power10.S > @@ -0,0 +1,27 @@ > +/* Optimized memset implementation for PowerPC64/POWER10. > + Copyright (C) 2021 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#define MEMSET __memset_power10 > + > +#undef libc_hidden_builtin_def > +#define libc_hidden_builtin_def(name) > + > +#undef __bzero > +#define __bzero __bzero_power10 > + Ok. > +#include > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memset.c b/sysdeps/power= pc/powerpc64/multiarch/memset.c > index d483f66f2744..6562646dffcf 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/memset.c > +++ b/sysdeps/powerpc/powerpc64/multiarch/memset.c > @@ -33,10 +33,18 @@ extern __typeof (__redirect_memset) __memset_power4 a= ttribute_hidden; > extern __typeof (__redirect_memset) __memset_power6 attribute_hidden; > extern __typeof (__redirect_memset) __memset_power7 attribute_hidden; > extern __typeof (__redirect_memset) __memset_power8 attribute_hidden; > +# ifdef __LITTLE_ENDIAN__ > +extern __typeof (__redirect_memset) __memset_power10 attribute_hidden; > +# endif >=20=20 > /* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle > ifunc symbol properly. */ > libc_ifunc (__libc_memset, > +# ifdef __LITTLE_ENDIAN__ > + (hwcap2 & (PPC_FEATURE2_ARCH_3_1 | PPC_FEATURE2_HAS_ISEL) > + && hwcap & PPC_FEATURE_HAS_VSX) > + ? __memset_power10 : > +# endif > (hwcap2 & PPC_FEATURE2_ARCH_2_07) > ? __memset_power8 : > (hwcap & PPC_FEATURE_HAS_VSX) Ok. > --=20 > 2.26.2 >