From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-Status: No, score=-4.2 required=3.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI,
	SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham
	autolearn_force=no version=3.4.2
Received: from sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(No client certificate requested)
	by dcvr.yhbt.net (Postfix) with ESMTPS id 6C6121F55B
	for <e@80x24.org>; Wed,  3 Jun 2020 20:45:47 +0000 (UTC)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 8FC62388C006;
	Wed,  3 Jun 2020 20:45:46 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8FC62388C006
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1591217146;
	bh=s3E1+uYaqQ/nrhW1DNKeK20u7ADFA7rgsP66XFhRpDQ=;
	h=Date:To:Subject:References:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=GjFkfe8gUl4a+BF4WgMfjFWaAXUmpCtPSCm7uIQtC6kSE6G4DTVDRtR7lNF7k/8iy
	 pxlsbXqfHTzkhX2qP5+EEa+tNuRVEFXH8GPIhDYtPL7zgl0Vxp0uE2LkDu4AyZ8prv
	 ppPFrSIw8ilwaIr6+x8dtvmHWCtUjTeuOsfgJEJI=
Received: from mx0a-001b2d01.pphosted.com (mx0a-001b2d01.pphosted.com
 [148.163.156.1])
 by sourceware.org (Postfix) with ESMTPS id 1F8FA3870872
 for <libc-alpha@sourceware.org>; Wed,  3 Jun 2020 20:45:42 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 1F8FA3870872
Received: from pps.filterd (m0098404.ppops.net [127.0.0.1])
 by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id
 053KaPjC151303; Wed, 3 Jun 2020 16:45:39 -0400
Received: from ppma05wdc.us.ibm.com (1b.90.2fa9.ip4.static.sl-reverse.com
 [169.47.144.27])
 by mx0a-001b2d01.pphosted.com with ESMTP id 31dr8j478n-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 03 Jun 2020 16:45:38 -0400
Received: from pps.filterd (ppma05wdc.us.ibm.com [127.0.0.1])
 by ppma05wdc.us.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 053Kf7Ms019086;
 Wed, 3 Jun 2020 20:45:37 GMT
Received: from b01cxnp22035.gho.pok.ibm.com (b01cxnp22035.gho.pok.ibm.com
 [9.57.198.25]) by ppma05wdc.us.ibm.com with ESMTP id 31bf4qh8e3-1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT);
 Wed, 03 Jun 2020 20:45:37 +0000
Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com
 [9.57.199.109])
 by b01cxnp22035.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id
 053KihL638732098
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK);
 Wed, 3 Jun 2020 20:44:43 GMT
Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id 51985112061;
 Wed,  3 Jun 2020 20:44:43 +0000 (GMT)
Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1])
 by IMSVA (Postfix) with ESMTP id B88FE112062;
 Wed,  3 Jun 2020 20:44:41 +0000 (GMT)
Received: from oc3272150783.ibm.com (unknown [9.160.68.150])
 by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTPS;
 Wed,  3 Jun 2020 20:44:41 +0000 (GMT)
Date: Wed, 3 Jun 2020 15:44:39 -0500
To: Paul E Murphy <murphyp@linux.ibm.com>
Subject: Re: [PATCH] powerpc64le: add optimized strlen for P9
Message-ID: <20200603204439.GA13031@oc3272150783.ibm.com>
References: <20200521191048.1566568-1-murphyp@linux.vnet.ibm.com>
 <20200527164554.GA13085@oc3272150783.ibm.com>
 <ebe58e68-e691-c83b-7ea5-44cbe3e62e25@linux.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <ebe58e68-e691-c83b-7ea5-44cbe3e62e25@linux.ibm.com>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-TM-AS-GCONF: 00
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216, 18.0.687
 definitions=2020-06-03_13:2020-06-02,
 2020-06-03 signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
 clxscore=1015 suspectscore=0
 adultscore=0 mlxscore=0 spamscore=0 mlxlogscore=999 impostorscore=0
 phishscore=0 lowpriorityscore=0 priorityscore=1501 bulkscore=0
 malwarescore=0 cotscore=-2147483648 classifier=spam adjust=0 reason=mlx
 scancount=1 engine=8.12.0-2004280000 definitions=main-2006030156
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <http://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <http://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
From: "Paul A. Clarke via Libc-alpha" <libc-alpha@sourceware.org>
Reply-To: "Paul A. Clarke" <pc@us.ibm.com>
Cc: anton@ozlabs.org, libc-alpha@sourceware.org
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

On Fri, May 29, 2020 at 11:26:14AM -0500, Paul E Murphy wrote:
> 
> V3 is attached with changes to formatting and a couple of
> simplifications as noted below.
[snip]

This version LGTM with a few nits below (and you were
going to check the binutils support for the POWER9 instruction).

> From 86decdb4a1bea39cc34bb3320fc9e3ea934042f5 Mon Sep 17 00:00:00 2001
> From: "Paul E. Murphy" <murphyp@linux.vnet.ibm.com>
> Date: Mon, 18 May 2020 11:16:06 -0500
> Subject: [PATCH] powerpc64le: add optimized strlen for P9
> 
> This started as a trivial change to Anton's rawmemchr.  I got
> carried away.  This is a hybrid between P8's asympotically
> faster 64B checks with extremely efficient small string checks
> e.g <64B (and sometimes a little bit more depending on alignment).
> 
> The second trick is to align to 64B by running a 48B checking loop
> 16B at a time until we naturally align to 64B (i.e checking 48/96/144
> bytes/iteration based on the alignment after the first 5 comparisons).
> This allieviates the need to check page boundaries.
> 
> Finally, explicly use the P7 strlen with the runtime loader when building
> P9.  We need to be cautious about vector/vsx extensions here on P9 only
> builds.
> ---
>  .../powerpc/powerpc64/le/power9/rtld-strlen.S |   1 +
>  sysdeps/powerpc/powerpc64/le/power9/strlen.S  | 213 ++++++++++++++++++
>  sysdeps/powerpc/powerpc64/multiarch/Makefile  |   2 +-
>  .../powerpc64/multiarch/ifunc-impl-list.c     |   4 +
>  .../powerpc64/multiarch/strlen-power9.S       |   2 +
>  sysdeps/powerpc/powerpc64/multiarch/strlen.c  |   5 +
>  6 files changed, 226 insertions(+), 1 deletion(-)
>  create mode 100644 sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S
>  create mode 100644 sysdeps/powerpc/powerpc64/le/power9/strlen.S
>  create mode 100644 sysdeps/powerpc/powerpc64/multiarch/strlen-power9.S
> 
> diff --git a/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S
> new file mode 100644
> index 0000000000..e9d83323ac
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/le/power9/rtld-strlen.S
> @@ -0,0 +1 @@
> +#include <sysdeps/powerpc/powerpc64/power7/strlen.S>
> diff --git a/sysdeps/powerpc/powerpc64/le/power9/strlen.S b/sysdeps/powerpc/powerpc64/le/power9/strlen.S
> new file mode 100644
> index 0000000000..0b358ff128
> --- /dev/null
> +++ b/sysdeps/powerpc/powerpc64/le/power9/strlen.S
> @@ -0,0 +1,213 @@
> +/* Optimized strlen implementation for PowerPC64/POWER9.
> +   Copyright (C) 2020 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +
> +#ifndef STRLEN
> +# define STRLEN __strlen
> +# define DEFINE_STRLEN_HIDDEN_DEF 1
> +#endif
> +
> +/* Implements the function
> +
> +   int [r3] strlen (const void *s [r3])
> +
> +   The implementation can load bytes past a matching byte, but only
> +   up to the next 64B boundary, so it never crosses a page.  */
> +
> +.machine power9
> +ENTRY_TOCLESS (STRLEN, 4)
> +	CALL_MCOUNT 2
> +
> +	vspltisb  v18,0
> +	vspltisb  v19,-1
> +
> +	neg	  r5,r3
> +	rldicl	  r9,r5,0,60   /* How many bytes to get source 16B aligned?  */
> +
> +

Extra blank line here. (Sorry, didn't see this the first time.)

> +	/* Align data and fill bytes not loaded with non matching char.  */
> +	lvx	  v0,0,r3
> +	lvsr	  v1,0,r3
> +	vperm	  v0,v19,v0,v1
> +
> +	vcmpequb. v6,v0,v18
> +	beq	  cr6,L(aligned)
> +

Consider for before the next two instructions:
/* String ends within first cache line.  Compute and return length.  */

> +	vctzlsbb  r3,v6
> +	blr
> +
> +	/* Test 64B 16B at a time.  The 64B vector loop is optimized for
> +	   longer strings.  Likewise, we check a multiple of 64B to avoid
> +	   breaking the alignment calculation below.  */
> +L(aligned):
> +	add	  r4,r3,r9
> +	rldicl.	  r5,r4,60,62  /* Determine the number of 48B loops needed for
> +                                  alignment to 64B.  And test for zero.  */

Would it be bad to move the "rldicl." down...

> +
> +	lxv	  v0+32,0(r4)
> +	vcmpequb. v6,v0,v18
> +	bne	  cr6,L(tail1)
> +
> +	lxv	  v0+32,16(r4)
> +	vcmpequb. v6,v0,v18
> +	bne 	  cr6,L(tail2)
> +
> +	lxv	  v0+32,32(r4)
> +	vcmpequb. v6,v0,v18
> +	bne 	  cr6,L(tail3)
> +
> +	lxv	  v0+32,48(r4)
> +	vcmpequb. v6,v0,v18
> +	bne 	  cr6,L(tail4)

...to here, to avoid needlessly penalizing the cases above?

> +	addi	  r4,r4,64
> +
> +	/* Prep for weird constant generation of reduction.  */
> +	li	  r0,0

Still need a better comment here. Consider:
/* Load a dummy aligned address (0) so that 'lvsl' produces
   a shift vector of 0..15.  */

And this "li" instruction can be moved WAY down...

> +
> +	/* Skip the alignment if already 64B aligned.  */
> +	beq	  L(loop_64b)
> +	mtctr	  r5
> +
> +	/* Test 48B per iteration until 64B aligned.  */
> +	.p2align  5
> +L(loop):
> +	lxv	  v0+32,0(r4)
> +	vcmpequb. v6,v0,v18
> +	bne	  cr6,L(tail1)
> +
> +	lxv	  v0+32,16(r4)
> +	vcmpequb. v6,v0,v18
> +	bne	  cr6,L(tail2)
> +
> +	lxv 	  v0+32,32(r4)
> +	vcmpequb. v6,v0,v18
> +	bne	  cr6,L(tail3)
> +
> +	addi	  r4,r4,48
> +	bdnz	  L(loop)
> +
> +	.p2align  5
> +L(loop_64b):
> +	lxv	  v1+32,0(r4)     /* Load 4 quadwords.  */
> +	lxv	  v2+32,16(r4)
> +	lxv	  v3+32,32(r4)
> +	lxv	  v4+32,48(r4)
> +	vminub	  v5,v1,v2        /* Compare and merge into one VR for speed.  */
> +	vminub	  v6,v3,v4
> +	vminub	  v7,v5,v6
> +	vcmpequb. v7,v7,v18       /* Check for NULLs.  */
> +	addi	  r4,r4,64        /* Adjust address for the next iteration.  */
> +	bne	  cr6,L(vmx_zero)
> +
> +	lxv	  v1+32,0(r4)     /* Load 4 quadwords.  */
> +	lxv	  v2+32,16(r4)
> +	lxv	  v3+32,32(r4)
> +	lxv	  v4+32,48(r4)
> +	vminub	  v5,v1,v2        /* Compare and merge into one VR for speed.  */
> +	vminub	  v6,v3,v4
> +	vminub	  v7,v5,v6
> +	vcmpequb. v7,v7,v18       /* Check for NULLs.  */
> +	addi	  r4,r4,64        /* Adjust address for the next iteration.  */
> +	bne	  cr6,L(vmx_zero)
> +
> +	lxv	  v1+32,0(r4)     /* Load 4 quadwords.  */
> +	lxv	  v2+32,16(r4)
> +	lxv	  v3+32,32(r4)
> +	lxv	  v4+32,48(r4)
> +	vminub	  v5,v1,v2        /* Compare and merge into one VR for speed.  */
> +	vminub	  v6,v3,v4
> +	vminub	  v7,v5,v6
> +	vcmpequb. v7,v7,v18       /* Check for NULLs.  */
> +	addi	  r4,r4,64        /* Adjust address for the next iteration.  */
> +	beq	  cr6,L(loop_64b)
> +
> +L(vmx_zero):

...to here, perhaps, to avoid penalizing shorter strings.
(And be closer to its use.)

> +	/* OK, we found a null byte.  Let's look for it in the current 64-byte
> +	   block and mark it in its corresponding VR.  */
> +	vcmpequb  v1,v1,v18
> +	vcmpequb  v2,v2,v18
> +	vcmpequb  v3,v3,v18
> +	vcmpequb  v4,v4,v18
> +
> +	/* We will now 'compress' the result into a single doubleword, so it
> +	   can be moved to a GPR for the final calculation.  First, we
> +	   generate an appropriate mask for vbpermq, so we can permute bits into
> +	   the first halfword.  */
> +	vspltisb  v10,3
> +	lvsl	  v11,0,r0
> +	vslb	  v10,v11,v10
> +
> +	/* Permute the first bit of each byte into bits 48-63.  */
> +	vbpermq	  v1,v1,v10
> +	vbpermq	  v2,v2,v10
> +	vbpermq	  v3,v3,v10
> +	vbpermq	  v4,v4,v10
> +
> +	/* Shift each component into its correct position for merging.  */
> +	vsldoi	  v2,v2,v2,2
> +	vsldoi	  v3,v3,v3,4
> +	vsldoi	  v4,v4,v4,6
> +
> +	/* Merge the results and move to a GPR.  */
> +	vor	  v1,v2,v1
> +	vor	  v2,v3,v4
> +	vor	  v4,v1,v2
> +	mfvrd	  r10,v4
> +
> +	/* Adjust address to the begninning of the current 64-byte block.  */
> +	addi	  r4,r4,-64
> +
> +	cnttzd	  r0,r10           /* Count trailing zeros before the match.  */
> +	subf	  r5,r3,r4
> +	add	  r3,r5,r0         /* Compute final length.  */
> +	blr
> +
> +L(tail1):
> +	vctzlsbb  r0,v6
> +	add	  r4,r4,r0
> +	subf	  r3,r3,r4
> +	blr
> +
> +L(tail2):
> +	vctzlsbb  r0,v6
> +	add	  r4,r4,r0
> +	addi	  r4,r4,16
> +	subf	  r3,r3,r4
> +	blr
> +
> +L(tail3):
> +	vctzlsbb  r0,v6
> +	add	  r4,r4,r0
> +	addi	  r4,r4,32
> +	subf	  r3,r3,r4
> +	blr
> +
> +L(tail4):
> +	vctzlsbb  r0,v6
> +	add	  r4,r4,r0
> +	addi	  r4,r4,48
> +	subf	  r3,r3,r4
> +	blr
> +
> +END (STRLEN)
> +
> +#ifdef DEFINE_STRLEN_HIDDEN_DEF
> +weak_alias (__strlen, strlen)
> +libc_hidden_builtin_def (strlen)
> +#endif
[snip]

PC