From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-4.3 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id C2C361F5AE for ; Thu, 29 Apr 2021 21:40:47 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 80CEE3857036; Thu, 29 Apr 2021 21:40:45 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 80CEE3857036 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1619732445; bh=RAjCnjUNuoVmEvYEeBw/aWGkO5Bmf6sSSch+yDCsA4k=; h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=lUv9Axl7ca6FC6zvh9WO1+NJMN0abiCyiDdcC7KtBVAx7Wy6CBcUAPp6sb+IleQhF h0Geds8tDyb5u0cAzbCETBF9Vs842yqLm1Osg3iBR5alz9cmOGlyQPPMLu1F0PSZvk mhMV1jIO7RxpVQEKA3HNGOtB9I1bG4gvbe8/S1cg= Received: from mx0a-001b2d01.pphosted.com (mx0b-001b2d01.pphosted.com [148.163.158.5]) by sourceware.org (Postfix) with ESMTPS id 024F83857036 for ; Thu, 29 Apr 2021 21:40:42 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 024F83857036 Received: from pps.filterd (m0098414.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 13TLYCBY111635; Thu, 29 Apr 2021 17:40:38 -0400 Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com [169.63.121.186]) by mx0b-001b2d01.pphosted.com with ESMTP id 3882k5bjgs-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 29 Apr 2021 17:40:38 -0400 Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1]) by ppma03wdc.us.ibm.com (8.16.0.43/8.16.0.43) with SMTP id 13TLRAWW029481; Thu, 29 Apr 2021 21:40:38 GMT Received: from b01cxnp22033.gho.pok.ibm.com (b01cxnp22033.gho.pok.ibm.com [9.57.198.23]) by ppma03wdc.us.ibm.com with ESMTP id 3882p7gr52-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 29 Apr 2021 21:40:38 +0000 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp22033.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 13TLeaHu30277912 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 29 Apr 2021 21:40:36 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C222F112061; Thu, 29 Apr 2021 21:40:36 +0000 (GMT) Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C2A08112062; Thu, 29 Apr 2021 21:40:35 +0000 (GMT) Received: from work-tp (unknown [9.65.216.65]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTPS; Thu, 29 Apr 2021 21:40:35 +0000 (GMT) Date: Thu, 29 Apr 2021 18:40:33 -0300 To: Matheus Castanho Subject: Re: [PATCH] powerpc64le: Optimize memset for POWER10 Message-ID: <20210429214033.j7ghoc3ghxj3bfqv@work-tp> Mail-Followup-To: Matheus Castanho , tuliom@linux.ibm.com, anton@ozlabs.org, libc-alpha@sourceware.org References: <20210428144048.gyeulahanuzjiotq@work-tp> <87czue16dt.fsf@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87czue16dt.fsf@linux.ibm.com> X-TM-AS-GCONF: 00 X-Proofpoint-ORIG-GUID: aF59y84WOoTkxvRHWAi9BH17Sfd7EfWQ X-Proofpoint-GUID: aF59y84WOoTkxvRHWAi9BH17Sfd7EfWQ X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.391, 18.0.761 definitions=2021-04-29_11:2021-04-28, 2021-04-29 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxlogscore=999 phishscore=0 priorityscore=1501 impostorscore=0 mlxscore=0 lowpriorityscore=0 suspectscore=0 malwarescore=0 spamscore=0 adultscore=0 clxscore=1015 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2104060000 definitions=main-2104290137 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Raoni Fassina Firmino via Libc-alpha Reply-To: Raoni Fassina Firmino Cc: tuliom@linux.ibm.com, libc-alpha@sourceware.org, anton@ozlabs.org Errors-To: libc-alpha-bounces@sourceware.org Sender: "Libc-alpha" Thanks for the review Matheus, let me know if I missed something. > > +L(_memset): > > + /* Assume memset of zero length is uncommon, and just let it go > > + through the small path below. */ > > + cmpldi r5,64 > > + > > + /* Replicate byte to quad word. */ > > + mtvsrws v0+32,r4 > > + vspltb v0,v0,15 > > Why not simply use mtvsrd here? The byte splat part will have to be done > separately anyways: > > mtvsrd v0+32,r4 > vspltb v0,v0,7 Done. Maybe this one is more intuitive. Don't have to wonder if the previous word splat. > > + sub. r11,r5,r8 > > + isellt r11,0,r11 /* Saturate the subtraction to zero. */ > > + > > + stxvl v0+32,r3,r5 > > + stxvl v0+32,r10,r11 > > + > > + addi r9,r3,32 > > + addi r10,r3,48 > > + > > + sub. r11,r11,r8 > > + isellt r11,0,r11 > > + > > + sub. r5,r11,r8 > > + isellt r5,0,r5 > > Minor detail: I see this construct appears many times in the code. You > could create a macro for it like: > > #define SUBS(rt,ra,rb) \ > sub. rt,ra,rb; \ > isellt rt,0,rt; I had this before, but I got to the conclusion it was hindering seen all the code. Since it is only two instructions I though the macro benefits were marginal and added some noise to have macro and mnemonics mixed up in the code. There was a practical reason though, in one case I have a blelr in the tail, so the macro could not be use in this case anyway. But is there is a strong opinion here I can change it. > > + > > + stxvl v0+32,r9,r11 > > + stxvl v0+32,r10,r5 > > + > > + blr > > Ok. Takes advantage of the fact that if the length passed to stxvl is > zero, nothing is done. > > Maybe you could expand the comment at the beginning of this block to > make that trick clear. Done. > > + > > + .balign 16 > > +L(large): > > + mr r6,r3 /* Don't modify r3 since we need to return it. */ > > + > > + /* Get dest 16B aligned. */ > > + neg r0,r3 > > + clrldi. r7,r0,(64-4) > > + beq L(aligned) > > + rldic r9,r0,56,4 /* (~X & 0xf)<<56 "clrlsldi r9,r0,64-4,56". */ > > Why not just use clrlsldi as noted in the comment? It makes clearer what > the instruction is doing. Because of this: clrldi. r7,r0,(64-4) beq L(aligned) clrlsldi r9,r0,64-4,56 It is too big to stay in the same tab column ad the rest of the instructions and the options to fix this weren't great. It was just a formatting decision, hence the comment to try to help out a little. Also, in this particular case rldic reading of the values is not that terrible, it is like: Keep the 4 least significant bits, shift then 56 bits left. I can change if you or anyone else feel strong about it. > > + > > + /* After alignment, if there is 127B or less left > > + go directly to the tail. */ > > 127B -> 63B Done. > > + cmpldi r5,64 > > + blt L(tail_64) > > + > > + .balign 16 > > +L(aligned): > > + srdi. r0,r5,7 > > + beq L(tail_128) > > + > > + cmpldi cr5,r5,255 > > + cmpldi cr6,r4,0 > > + crand 27,26,21 > > + bt 27,L(dcbz) > > This last block deserves a comment, as it is really hard to follow. > > IIUC, this is what this code is checking: > > if r5 > 255 && r4 == 0 > goto L(dcbz) > > So if we have at least 256B left and we are setting zeroes, then use the > dcbz strategy. Ok. Done. I did the comment in one line, but mostly the same, let me know if it explain it well. I decided to use the arguments names instead of registers in hope to be more meaningful, but lets see it is not more confusing. > > + .balign 16 > > +L(dcbz): > > + /* Special case when value is 0 and we have a long length to deal > > + with. Use dcbz to zero out a full cacheline of 128 bytes at a time. > > + Before using dcbz though, we need to get the destination 128-byte > > + aligned. */ > > + neg r0,r6 > > + clrldi. r0,r0,(64-7) > > + beq L(dcbz_aligned) > > + > > + sub r5,r5,r0 > > + mtocrf 0x2,r0 /* These are the bits 57..59, the ones for sizes 64, > > + 32 and 16 which are those that need to be check. */ > > need to be check -> need to be checked > > Please add to the comment that these bits are being set to cr6. mtocrf > is not one of the most straightfoward instructions to read =/ Done. > > + /* Write 16~128 bytes until DST is aligned to 128 bytes. */ > > 16~128 -> 16-128 Done. > > + .balign 16 > > +L(bcdz_tail): > > + /* We have 1~511 bytes remaining. */ > > 1~511 -> 1-511 Done. > > +END_GEN_TB (MEMSET,TB_TOCLESS) > > +libc_hidden_builtin_def (memset) > > + > > +/* Copied from bzero.S to prevent the linker from inserting a stub > > + between bzero and memset. */ > > +ENTRY_TOCLESS (__bzero) > > + CALL_MCOUNT 3 > > Should this CALL_MCOUNT 2 since bzero receives just 2 args? Done. You are right here, good catch. o/ Raoni