From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS17314 8.43.84.0/22 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,NICE_REPLY_A, RCVD_IN_DNSWL_MED,RDNS_DYNAMIC,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id ABF201F5AE for ; Wed, 28 Apr 2021 20:28:32 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A864B388A02B; Wed, 28 Apr 2021 20:28:31 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A864B388A02B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1619641711; bh=3DWy/rYgnkCnG1zivzX+XAYZH7ZvblGZfLCgzN6CzAE=; h=Subject:To:References:Date:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=qPYlybLt9YRFoDqG6/CYlQOqCiCsNuxWHy13V/8F3L7bg2gb2241cneIO5qvHn9zm nlnuImaxiCdi3uwal5K8TyHddUzAHCMV5q6/sjdhJtWgwUjsqOjUa49IublFgU0DyI MxQznln3FdB1F/3zUG7pzsMSJ64PeAYHBmsAJ3KQ= Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com [148.163.156.1]) by sourceware.org (Postfix) with ESMTPS id 87DE3385482C for ; Wed, 28 Apr 2021 20:28:28 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 87DE3385482C Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 13SK33Yo098888 for ; Wed, 28 Apr 2021 16:28:27 -0400 Received: from ppma01wdc.us.ibm.com (fd.55.37a9.ip4.static.sl-reverse.com [169.55.85.253]) by mx0a-001b2d01.pphosted.com with ESMTP id 387byb4fkt-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 28 Apr 2021 16:28:27 -0400 Received: from pps.filterd (ppma01wdc.us.ibm.com [127.0.0.1]) by ppma01wdc.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 13SKRwxt019131 for ; Wed, 28 Apr 2021 20:28:26 GMT Received: from b03cxnp07027.gho.boulder.ibm.com (b03cxnp07027.gho.boulder.ibm.com [9.17.130.14]) by ppma01wdc.us.ibm.com with ESMTP id 384ay9qn5m-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 28 Apr 2021 20:28:24 +0000 Received: from b03ledav005.gho.boulder.ibm.com (b03ledav005.gho.boulder.ibm.com [9.17.130.236]) by b03cxnp07027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 13SKSNOM30146938 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Wed, 28 Apr 2021 20:28:23 GMT Received: from b03ledav005.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id B55A9BE051 for ; Wed, 28 Apr 2021 20:28:23 +0000 (GMT) Received: from b03ledav005.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 4928EBE04F for ; Wed, 28 Apr 2021 20:28:22 +0000 (GMT) Received: from [9.160.120.103] (unknown [9.160.120.103]) by b03ledav005.gho.boulder.ibm.com (Postfix) with ESMTP for ; Wed, 28 Apr 2021 20:28:21 +0000 (GMT) Subject: Re: [PATCH] powerpc64le: Optimize memset for POWER10 To: libc-alpha@sourceware.org References: <20210428144048.gyeulahanuzjiotq@work-tp> Message-ID: <4eb56953-1cda-7c79-1fc3-c8b7057284c6@linux.ibm.com> Date: Wed, 28 Apr 2021 17:28:20 -0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.0 In-Reply-To: <20210428144048.gyeulahanuzjiotq@work-tp> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: H87QLS6nKlF7Nkv9lpb4US-bOnA0xsv4 X-Proofpoint-GUID: H87QLS6nKlF7Nkv9lpb4US-bOnA0xsv4 Content-Transfer-Encoding: 7bit X-Proofpoint-UnRewURL: 0 URL was un-rewritten MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761 definitions=2021-04-28_13:2021-04-28, 2021-04-28 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 lowpriorityscore=0 malwarescore=0 mlxscore=0 priorityscore=1501 phishscore=0 bulkscore=0 adultscore=0 impostorscore=0 clxscore=1015 suspectscore=0 mlxlogscore=999 spamscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104060000 definitions=main-2104280130 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Raphael M Zinsly via Libc-alpha Reply-To: Raphael M Zinsly Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" Hi Raoni, this patch LGTM. On 28/04/2021 11:40, Raoni Fassina Firmino via Libc-alpha wrote: > This implementation is based on __memset_power8 and integrates a lot > of suggestions from Anton Blanchard. > > The biggest difference is that it makes extensive use of stxvl to > alignment and tail code to avoid branches and small stores. It has > three main execution paths: > > a) "Short lengths" for lengths up to 64 bytes, avoiding as many > branches as possible. > > b) "General case" for larger lengths, it has an alignment section > using stxvl to avoid branches, a 128 bytes loop and then a tail > code, again using stxvl with few branches. > > c) "Zeroing cache blocks" for lengths from 256 bytes upwards and set > value being zero. It is mostly the __memset_power8 code but the > alignment phase was simplified because, at this point, address is > already 16-bytes aligned and also changed to use vector stores. > The tail code was also simplified to reuse the general case tail. > > All unaligned stores use stxvl instructions that do not generate > alignment interrupts on POWER10, making it safe to use on > caching-inhibited memory. > > On average, this implementation provides something around 30% > improvement when compared to __memset_power8. > --- > sysdeps/powerpc/powerpc64/le/power10/memset.S | 251 ++++++++++++++++++ > sysdeps/powerpc/powerpc64/multiarch/Makefile | 3 +- > sysdeps/powerpc/powerpc64/multiarch/bzero.c | 8 + > .../powerpc64/multiarch/ifunc-impl-list.c | 14 + > .../powerpc64/multiarch/memset-power10.S | 27 ++ > sysdeps/powerpc/powerpc64/multiarch/memset.c | 8 + > 6 files changed, 310 insertions(+), 1 deletion(-) > create mode 100644 sysdeps/powerpc/powerpc64/le/power10/memset.S > create mode 100644 sysdeps/powerpc/powerpc64/multiarch/memset-power10.S > > diff --git a/sysdeps/powerpc/powerpc64/le/power10/memset.S b/sysdeps/powerpc/powerpc64/le/power10/memset.S > new file mode 100644 > index 000000000000..c8b77fc64596 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/le/power10/memset.S > @@ -0,0 +1,251 @@ > +/* Optimized memset implementation for PowerPC64/POWER10. Could be just POWER10. > + Copyright (C) 2021 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#include > + > +/* void * [r3] memset (void *s [r3], int c [r4], size_t n [r5])); > + Returns 's'. */ > + > +#ifndef MEMSET > +# define MEMSET memset > +#endif > + > + .machine power9 > +ENTRY_TOCLESS (MEMSET, 5) > + CALL_MCOUNT 3 > + > +L(_memset): > + /* Assume memset of zero length is uncommon, and just let it go > + through the small path below. */ > + cmpldi r5,64 > + > + /* Replicate byte to quad word. */ > + mtvsrws v0+32,r4 > + vspltb v0,v0,15 > + > + li r7,16 > + sldi r8,r7,56 > + > + bgt L(large) > + > + /* For short lengths we want to avoid as many branches as possible. > + We use store VSX vector with length instructions to do this. */ > + sldi r5,r5,56 > + > + addi r10,r3,16 > + > + sub. r11,r5,r8 > + isellt r11,0,r11 /* Saturate the subtraction to zero. */ > + > + stxvl v0+32,r3,r5 > + stxvl v0+32,r10,r11 > + > + addi r9,r3,32 > + addi r10,r3,48 > + > + sub. r11,r11,r8 > + isellt r11,0,r11 > + > + sub. r5,r11,r8 > + isellt r5,0,r5 > + > + stxvl v0+32,r9,r11 > + stxvl v0+32,r10,r5 > + > + blr > + > + .balign 16 > +L(large): > + mr r6,r3 /* Don't modify r3 since we need to return it. */ > + > + /* Get dest 16B aligned. */ > + neg r0,r3 > + clrldi. r7,r0,(64-4) > + beq L(aligned) > + rldic r9,r0,56,4 /* (~X & 0xf)<<56 "clrlsldi r9,r0,64-4,56". */ > + > + stxvl v0+32,r6,r9 /* Store up to 15B until aligned address. */ > + > + add r6,r6,r7 > + sub r5,r5,r7 > + > + /* After alignment, if there is 127B or less left > + go directly to the tail. */ > + cmpldi r5,64 > + blt L(tail_64) > + > + .balign 16 > +L(aligned): > + srdi. r0,r5,7 > + beq L(tail_128) > + > + cmpldi cr5,r5,255 > + cmpldi cr6,r4,0 > + crand 27,26,21 > + bt 27,L(dcbz) > + > + mtctr r0 > + > + .balign 32 > +L(loop): > + stxv v0+32,0(r6) > + stxv v0+32,16(r6) > + stxv v0+32,32(r6) > + stxv v0+32,48(r6) > + stxv v0+32,64(r6) > + stxv v0+32,80(r6) > + stxv v0+32,96(r6) > + stxv v0+32,112(r6) > + addi r6,r6,128 > + bdnz L(loop) > + > + .balign 16 > +L(tail): > + /* 127B or less left, finish the tail or return. */ > + andi. r5,r5,127 > + beqlr > + > + cmpldi r5,64 > + blt L(tail_64) > + > + .balign 16 > +L(tail_128): > + stxv v0+32,0(r6) > + stxv v0+32,16(r6) > + stxv v0+32,32(r6) > + stxv v0+32,48(r6) > + addi r6,r6,64 > + andi. r5,r5,63 > + beqlr > + > + .balign 16 > +L(tail_64): > + sldi r5,r5,56 > + > + addi r10,r6,16 > + > + sub. r11,r5,r8 > + isellt r11,0,r11 > + > + stxvl v0+32,r6,r5 > + stxvl v0+32,r10,r11 > + > + sub. r11,r11,r8 > + blelr > + > + addi r9,r6,32 > + addi r10,r6,48 > + > + isellt r11,0,r11 > + > + sub. r5,r11,r8 > + isellt r5,0,r5 > + > + stxvl v0+32,r9,r11 > + stxvl v0+32,r10,r5 > + > + blr > + > + .balign 16 > +L(dcbz): > + /* Special case when value is 0 and we have a long length to deal > + with. Use dcbz to zero out a full cacheline of 128 bytes at a time. > + Before using dcbz though, we need to get the destination 128-byte > + aligned. */ > + neg r0,r6 > + clrldi. r0,r0,(64-7) > + beq L(dcbz_aligned) > + > + sub r5,r5,r0 > + mtocrf 0x2,r0 /* These are the bits 57..59, the ones for sizes 64, > + 32 and 16 which are those that need to be check. */ > + > + /* Write 16~128 bytes until DST is aligned to 128 bytes. */ > +64: bf 25,32f > + stxv v0+32,0(r6) > + stxv v0+32,16(r6) > + stxv v0+32,32(r6) > + stxv v0+32,48(r6) > + addi r6,r6,64 > + > +32: bf 26,16f > + stxv v0+32,0(r6) > + stxv v0+32,16(r6) > + addi r6,r6,32 > + > +16: bf 27,L(dcbz_aligned) > + stxv v0+32,0(r6) > + addi r6,r6,16 > + > + .balign 16 > +L(dcbz_aligned): > + /* Setup dcbz unroll offsets and count numbers. */ > + srdi. r0,r5,9 > + li r9,128 > + beq L(bcdz_tail) > + li r10,256 > + li r11,384 > + mtctr r0 > + > + .balign 16 > +L(dcbz_loop): > + /* Sets 512 bytes to zero in each iteration, the loop unrolling shows > + a throughput boost for large sizes (2048 bytes or higher). */ > + dcbz 0,r6 > + dcbz r9,r6 > + dcbz r10,r6 > + dcbz r11,r6 > + addi r6,r6,512 > + bdnz L(dcbz_loop) > + > + andi. r5,r5,511 > + beqlr > + > + .balign 16 > +L(bcdz_tail): > + /* We have 1~511 bytes remaining. */ > + srdi. r0,r5,7 > + beq L(tail) > + > + mtocrf 0x1,r0 > + > +256: bf 30,128f > + dcbz 0,r6 > + dcbz r9,r6 > + addi r6,r6,256 > + > +128: bf 31,L(tail) > + dcbz 0,r6 > + addi r6,r6,128 > + > + b L(tail) > + > +END_GEN_TB (MEMSET,TB_TOCLESS) > +libc_hidden_builtin_def (memset) > + > +/* Copied from bzero.S to prevent the linker from inserting a stub > + between bzero and memset. */ > +ENTRY_TOCLESS (__bzero) > + CALL_MCOUNT 3 > + mr r5,r4 > + li r4,0 > + b L(_memset) > +END (__bzero) > +#ifndef __bzero > +weak_alias (__bzero, bzero) > +#endif > diff --git a/sysdeps/powerpc/powerpc64/multiarch/Makefile b/sysdeps/powerpc/powerpc64/multiarch/Makefile > index 8aa46a370270..147ed42b218e 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/Makefile > +++ b/sysdeps/powerpc/powerpc64/multiarch/Makefile > @@ -32,7 +32,8 @@ sysdep_routines += memcpy-power8-cached memcpy-power7 memcpy-a2 memcpy-power6 \ > strncase-power8 > > ifneq (,$(filter %le,$(config-machine))) > -sysdep_routines += strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-power9 \ > +sysdep_routines += memset-power10 \ > + strcmp-power9 strncmp-power9 strcpy-power9 stpcpy-power9 \ > rawmemchr-power9 strlen-power9 strncpy-power9 stpncpy-power9 \ > strlen-power10 > endif > diff --git a/sysdeps/powerpc/powerpc64/multiarch/bzero.c b/sysdeps/powerpc/powerpc64/multiarch/bzero.c > index c3f819ff48d6..50a5320c6650 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/bzero.c > +++ b/sysdeps/powerpc/powerpc64/multiarch/bzero.c > @@ -27,8 +27,16 @@ extern __typeof (bzero) __bzero_power4 attribute_hidden; > extern __typeof (bzero) __bzero_power6 attribute_hidden; > extern __typeof (bzero) __bzero_power7 attribute_hidden; > extern __typeof (bzero) __bzero_power8 attribute_hidden; > +# ifdef __LITTLE_ENDIAN__ > +extern __typeof (bzero) __bzero_power10 attribute_hidden; > +# endif > > libc_ifunc (__bzero, > +# ifdef __LITTLE_ENDIAN__ > + (hwcap2 & (PPC_FEATURE2_ARCH_3_1 | PPC_FEATURE2_HAS_ISEL) > + && hwcap & PPC_FEATURE_HAS_VSX) > + ? __bzero_power10 : > +# endif > (hwcap2 & PPC_FEATURE2_ARCH_2_07) > ? __bzero_power8 : > (hwcap & PPC_FEATURE_HAS_VSX) > diff --git a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > index 1a6993616f2a..cd0d95ed9a94 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > +++ b/sysdeps/powerpc/powerpc64/multiarch/ifunc-impl-list.c > @@ -73,6 +73,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > > /* Support sysdeps/powerpc/powerpc64/multiarch/memset.c. */ > IFUNC_IMPL (i, name, memset, > +#ifdef __LITTLE_ENDIAN__ > + IFUNC_IMPL_ADD (array, i, memset, > + hwcap2 & (PPC_FEATURE2_ARCH_3_1 | > + PPC_FEATURE2_HAS_ISEL) > + && hwcap & PPC_FEATURE_HAS_VSX, > + __memset_power10) > +#endif > IFUNC_IMPL_ADD (array, i, memset, hwcap2 & PPC_FEATURE2_ARCH_2_07, > __memset_power8) > IFUNC_IMPL_ADD (array, i, memset, hwcap & PPC_FEATURE_HAS_VSX, > @@ -174,6 +181,13 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > > /* Support sysdeps/powerpc/powerpc64/multiarch/bzero.c. */ > IFUNC_IMPL (i, name, bzero, > +#ifdef __LITTLE_ENDIAN__ > + IFUNC_IMPL_ADD (array, i, bzero, > + hwcap2 & (PPC_FEATURE2_ARCH_3_1 | > + PPC_FEATURE2_HAS_ISEL) > + && hwcap & PPC_FEATURE_HAS_VSX, > + __bzero_power10) > +#endif > IFUNC_IMPL_ADD (array, i, bzero, hwcap2 & PPC_FEATURE2_ARCH_2_07, > __bzero_power8) > IFUNC_IMPL_ADD (array, i, bzero, hwcap & PPC_FEATURE_HAS_VSX, > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memset-power10.S b/sysdeps/powerpc/powerpc64/multiarch/memset-power10.S > new file mode 100644 > index 000000000000..53a9535a2401 > --- /dev/null > +++ b/sysdeps/powerpc/powerpc64/multiarch/memset-power10.S > @@ -0,0 +1,27 @@ > +/* Optimized memset implementation for PowerPC64/POWER10. > + Copyright (C) 2021 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + . */ > + > +#define MEMSET __memset_power10 > + > +#undef libc_hidden_builtin_def > +#define libc_hidden_builtin_def(name) > + > +#undef __bzero > +#define __bzero __bzero_power10 > + > +#include > diff --git a/sysdeps/powerpc/powerpc64/multiarch/memset.c b/sysdeps/powerpc/powerpc64/multiarch/memset.c > index d483f66f2744..6562646dffcf 100644 > --- a/sysdeps/powerpc/powerpc64/multiarch/memset.c > +++ b/sysdeps/powerpc/powerpc64/multiarch/memset.c > @@ -33,10 +33,18 @@ extern __typeof (__redirect_memset) __memset_power4 attribute_hidden; > extern __typeof (__redirect_memset) __memset_power6 attribute_hidden; > extern __typeof (__redirect_memset) __memset_power7 attribute_hidden; > extern __typeof (__redirect_memset) __memset_power8 attribute_hidden; > +# ifdef __LITTLE_ENDIAN__ > +extern __typeof (__redirect_memset) __memset_power10 attribute_hidden; > +# endif > > /* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle > ifunc symbol properly. */ > libc_ifunc (__libc_memset, > +# ifdef __LITTLE_ENDIAN__ > + (hwcap2 & (PPC_FEATURE2_ARCH_3_1 | PPC_FEATURE2_HAS_ISEL) > + && hwcap & PPC_FEATURE_HAS_VSX) > + ? __memset_power10 : > +# endif > (hwcap2 & PPC_FEATURE2_ARCH_2_07) > ? __memset_power8 : > (hwcap & PPC_FEATURE_HAS_VSX) > Thanks, -- Raphael Moreira Zinsly