unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
@ 2021-03-17  2:28 Naohiro Tamura
  2021-03-17  2:33 ` [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64 Naohiro Tamura
                   ` (7 more replies)
  0 siblings, 8 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-03-17  2:28 UTC (permalink / raw)
  To: libc-alpha

Fujitsu is in the process of signing the copyright assignment paper.
We'd like to have some feedback in advance.

This series of patches optimize the performance of
memcpy/memmove/memset for A64FX [1] which implements ARMv8-A SVE and
has L1 64KB cache per core and L2 8MB cache per NUMA node.

The first patch is an update of autoconf to check if assembler is
capable for ARMv8-A SVE code generation or not, and then define
HAVE_SVE_ASM_SUPPORT macro.

The second patch is memcpy/memmove performance optimization which makes
use of Scalable Vector Register with several techniques such as
loop unrolling, memory access alignment, cache zero fill, prefetch,
and software pipelining.

The third patch is memset performance optimization which makes
use of Scalable Vector Register with several techniques such as
loop unrolling, memory access alignment, cache zero fill, and
prefetch.

The forth patch is a test helper script to change Vector Length for
child process. This script can be used as test-wrapper for 'make
check'

The fifth patch is to add generic_memcpy and generic_memmove to
bench-memcpy-large.c and bench-memmove-large.c respectively so that we
can compare performance between 512 bit scalable vector register with
scalar 64 bit register consistently among memcpy/memmove/memset
default and large benchtests.


SVE assembler code for memcpy/memmove/memset is implemented as Vector
Length Agnostic code so theoretically it can be run on any SOC which
supports ARMv8-A SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX


Naohiro Tamura (5):
  config: Added HAVE_SVE_ASM_SUPPORT for aarch64
  aarch64: Added optimized memcpy and memmove for A64FX
  aarch64: Added optimized memset for A64FX
  scripts: Added Vector Length Set test helper script
  benchtests: Added generic_memcpy and generic_memmove to large
    benchtests

 benchtests/bench-memcpy-large.c               |   9 +
 benchtests/bench-memmove-large.c              |   9 +
 config.h.in                                   |   3 +
 manual/tunables.texi                          |   3 +-
 scripts/vltest.py                             |  82 ++
 sysdeps/aarch64/configure                     |  28 +
 sysdeps/aarch64/configure.ac                  |  15 +
 sysdeps/aarch64/multiarch/Makefile            |   3 +-
 sysdeps/aarch64/multiarch/ifunc-impl-list.c   |  17 +-
 sysdeps/aarch64/multiarch/init-arch.h         |   4 +-
 sysdeps/aarch64/multiarch/memcpy.c            |  12 +-
 sysdeps/aarch64/multiarch/memcpy_a64fx.S      | 979 ++++++++++++++++++
 sysdeps/aarch64/multiarch/memmove.c           |  12 +-
 sysdeps/aarch64/multiarch/memset.c            |  11 +-
 sysdeps/aarch64/multiarch/memset_a64fx.S      | 574 ++++++++++
 .../unix/sysv/linux/aarch64/cpu-features.c    |   4 +
 .../unix/sysv/linux/aarch64/cpu-features.h    |   4 +
 17 files changed, 1759 insertions(+), 10 deletions(-)
 create mode 100755 scripts/vltest.py
 create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S
 create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S

-- 
2.17.1


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64
  2021-03-17  2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
@ 2021-03-17  2:33 ` Naohiro Tamura
  2021-03-29 12:11   ` Szabolcs Nagy via Libc-alpha
  2021-03-17  2:34 ` [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-03-17  2:33 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

This patch checks if assembler supports '-march=armv8.2-a+sve' to
generate SVE code or not, and then define HAVE_SVE_ASM_SUPPORT macro.
---
 config.h.in                  |  3 +++
 sysdeps/aarch64/configure    | 28 ++++++++++++++++++++++++++++
 sysdeps/aarch64/configure.ac | 15 +++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/config.h.in b/config.h.in
index f21bf04e47..2073816af8 100644
--- a/config.h.in
+++ b/config.h.in
@@ -118,6 +118,9 @@
 /* AArch64 PAC-RET code generation is enabled.  */
 #define HAVE_AARCH64_PAC_RET 0
 
+/* Assembler support ARMv8.2-A SVE */
+#define HAVE_SVE_ASM_SUPPORT 0
+
 /* ARC big endian ABI */
 #undef HAVE_ARC_BE
 
diff --git a/sysdeps/aarch64/configure b/sysdeps/aarch64/configure
index 83c3a23e44..ac16250f8a 100644
--- a/sysdeps/aarch64/configure
+++ b/sysdeps/aarch64/configure
@@ -304,3 +304,31 @@ fi
 $as_echo "$libc_cv_aarch64_variant_pcs" >&6; }
 config_vars="$config_vars
 aarch64-variant-pcs = $libc_cv_aarch64_variant_pcs"
+
+# Check if asm support armv8.2-a+sve
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for SVE support in assembler" >&5
+$as_echo_n "checking for SVE support in assembler... " >&6; }
+if ${libc_cv_asm_sve+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat > conftest.s <<\EOF
+        ptrue p0.b
+EOF
+if { ac_try='${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&5'
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+  (eval $ac_try) 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; }; then
+  libc_cv_asm_sve=yes
+else
+  libc_cv_asm_sve=no
+fi
+rm -f conftest*
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_asm_sve" >&5
+$as_echo "$libc_cv_asm_sve" >&6; }
+if test $libc_cv_asm_sve = yes; then
+  $as_echo "#define HAVE_SVE_ASM_SUPPORT 1" >>confdefs.h
+
+fi
diff --git a/sysdeps/aarch64/configure.ac b/sysdeps/aarch64/configure.ac
index 66f755078a..389a0b4e8d 100644
--- a/sysdeps/aarch64/configure.ac
+++ b/sysdeps/aarch64/configure.ac
@@ -90,3 +90,18 @@ EOF
   fi
   rm -rf conftest.*])
 LIBC_CONFIG_VAR([aarch64-variant-pcs], [$libc_cv_aarch64_variant_pcs])
+
+# Check if asm support armv8.2-a+sve
+AC_CACHE_CHECK(for SVE support in assembler, libc_cv_asm_sve, [dnl
+cat > conftest.s <<\EOF
+        ptrue p0.b
+EOF
+if AC_TRY_COMMAND(${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&AS_MESSAGE_LOG_FD); then
+  libc_cv_asm_sve=yes
+else
+  libc_cv_asm_sve=no
+fi
+rm -f conftest*])
+if test $libc_cv_asm_sve = yes; then
+  AC_DEFINE(HAVE_SVE_ASM_SUPPORT)
+fi
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX
  2021-03-17  2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
  2021-03-17  2:33 ` [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64 Naohiro Tamura
@ 2021-03-17  2:34 ` Naohiro Tamura
  2021-03-29 12:44   ` Szabolcs Nagy via Libc-alpha
  2021-03-17  2:34 ` [PATCH 3/5] aarch64: Added optimized memset " Naohiro Tamura
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-03-17  2:34 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

This patch optimizes the performance of memcpy/memmove for A64FX [1]
which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
cache per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill, prefetch, and software pipelining.

SVE assembler code for memcpy/memmove is implemented as Vector Length
Agnostic code so theoretically it can be run on any SOC which supports
ARMv8-A SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX
---
 manual/tunables.texi                          |   3 +-
 sysdeps/aarch64/multiarch/Makefile            |   2 +-
 sysdeps/aarch64/multiarch/ifunc-impl-list.c   |  12 +-
 sysdeps/aarch64/multiarch/init-arch.h         |   4 +-
 sysdeps/aarch64/multiarch/memcpy.c            |  12 +-
 sysdeps/aarch64/multiarch/memcpy_a64fx.S      | 979 ++++++++++++++++++
 sysdeps/aarch64/multiarch/memmove.c           |  12 +-
 .../unix/sysv/linux/aarch64/cpu-features.c    |   4 +
 .../unix/sysv/linux/aarch64/cpu-features.h    |   4 +
 9 files changed, 1024 insertions(+), 8 deletions(-)
 create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1b746c0fa1..81ed5366fc 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -453,7 +453,8 @@ This tunable is specific to powerpc, powerpc64 and powerpc64le.
 The @code{glibc.cpu.name=xxx} tunable allows the user to tell @theglibc{} to
 assume that the CPU is @code{xxx} where xxx may have one of these values:
 @code{generic}, @code{falkor}, @code{thunderxt88}, @code{thunderx2t99},
-@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng}.
+@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng},
+@code{a64fx}.
 
 This tunable is specific to aarch64.
 @end deftp
diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index dc3efffb36..04c3f17121 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -1,6 +1,6 @@
 ifeq ($(subdir),string)
 sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
-		   memcpy_falkor \
+		   memcpy_falkor memcpy_a64fx \
 		   memset_generic memset_falkor memset_emag memset_kunpeng \
 		   memchr_generic memchr_nosimd \
 		   strlen_mte strlen_asimd
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index 99a8c68aac..cb78da9692 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -25,7 +25,11 @@
 #include <stdio.h>
 
 /* Maximum number of IFUNC implementations.  */
-#define MAX_IFUNC	4
+#if HAVE_SVE_ASM_SUPPORT
+# define MAX_IFUNC	7
+#else
+# define MAX_IFUNC	6
+#endif
 
 size_t
 __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
@@ -43,12 +47,18 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_falkor)
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_simd)
+#if HAVE_SVE_ASM_SUPPORT
+	      IFUNC_IMPL_ADD (array, i, memcpy, sve, __memcpy_a64fx)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
   IFUNC_IMPL (i, name, memmove,
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx)
 	      IFUNC_IMPL_ADD (array, i, memmove, !bti, __memmove_thunderx2)
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_falkor)
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_simd)
+#if HAVE_SVE_ASM_SUPPORT
+	      IFUNC_IMPL_ADD (array, i, memmove, sve, __memmove_a64fx)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
   IFUNC_IMPL (i, name, memset,
 	      /* Enable this on non-falkor processors too so that other cores
diff --git a/sysdeps/aarch64/multiarch/init-arch.h b/sysdeps/aarch64/multiarch/init-arch.h
index a167699e74..d20e7e1b8e 100644
--- a/sysdeps/aarch64/multiarch/init-arch.h
+++ b/sysdeps/aarch64/multiarch/init-arch.h
@@ -33,4 +33,6 @@
   bool __attribute__((unused)) bti =					      \
     HAVE_AARCH64_BTI && GLRO(dl_aarch64_cpu_features).bti;		      \
   bool __attribute__((unused)) mte =					      \
-    MTE_ENABLED ();
+    MTE_ENABLED ();							      \
+  unsigned __attribute__((unused)) sve =				      \
+    GLRO(dl_aarch64_cpu_features).sve;
diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
index 0e0a5cbcfb..0006f38eb0 100644
--- a/sysdeps/aarch64/multiarch/memcpy.c
+++ b/sysdeps/aarch64/multiarch/memcpy.c
@@ -33,6 +33,9 @@ extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;
 extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
 extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;
 extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
+#if HAVE_SVE_ASM_SUPPORT
+extern __typeof (__redirect_memcpy) __memcpy_a64fx attribute_hidden;
+#endif
 
 libc_ifunc (__libc_memcpy,
             (IS_THUNDERX (midr)
@@ -44,8 +47,13 @@ libc_ifunc (__libc_memcpy,
 		  : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
 		     || IS_NEOVERSE_V1 (midr)
 		     ? __memcpy_simd
-		     : __memcpy_generic)))));
-
+#if HAVE_SVE_ASM_SUPPORT
+                     : (IS_A64FX (midr)
+                        ? __memcpy_a64fx
+                        : __memcpy_generic))))));
+#else
+                     : __memcpy_generic)))));
+#endif
 # undef memcpy
 strong_alias (__libc_memcpy, memcpy);
 #endif
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
new file mode 100644
index 0000000000..23438e4e3d
--- /dev/null
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -0,0 +1,979 @@
+/* Optimized memcpy for Fujitsu A64FX processor.
+   Copyright (C) 2012-2021 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+#if HAVE_SVE_ASM_SUPPORT
+#if IS_IN (libc)
+# define MEMCPY __memcpy_a64fx
+# define MEMMOVE __memmove_a64fx
+
+/* Assumptions:
+ *
+ * ARMv8.2-a, AArch64, unaligned accesses, sve
+ *
+ */
+
+#define L1_SIZE (64*1024)/2     // L1 64KB
+#define L2_SIZE (7*1024*1024)/2 // L2 8MB - 1MB
+#define CACHE_LINE_SIZE 256
+#define PF_DIST_L1 (CACHE_LINE_SIZE * 16)
+#define PF_DIST_L2 (CACHE_LINE_SIZE * 64)
+#define dest            x0
+#define src             x1
+#define n               x2      // size
+#define tmp1            x3
+#define tmp2            x4
+#define rest            x5
+#define dest_ptr        x6
+#define src_ptr         x7
+#define vector_length   x8
+#define vl_remainder    x9      // vector_length remainder
+#define cl_remainder    x10     // CACHE_LINE_SIZE remainder
+
+    .arch armv8.2-a+sve
+
+ENTRY_ALIGN (MEMCPY, 6)
+
+    PTR_ARG (0)
+    SIZE_ARG (2)
+
+L(fwd_start):
+    cmp         n, 0
+    ccmp        dest, src, 4, ne
+    b.ne        L(init)
+    ret
+
+L(init):
+    mov         rest, n
+    mov         dest_ptr, dest
+    mov         src_ptr, src
+    cntb        vector_length
+    ptrue       p0.b
+
+L(L2):
+    // get block_size
+    mrs         tmp1, dczid_el0
+    cmp         tmp1, 6         // CACHE_LINE_SIZE 256
+    b.ne        L(vl_agnostic)
+
+    // if rest >= L2_SIZE
+    cmp         rest, L2_SIZE
+    b.cc        L(L1_prefetch)
+    // align dest address at vector_length byte boundary
+    sub         tmp1, vector_length, 1
+    and         tmp2, dest_ptr, tmp1
+    // if vl_remainder == 0
+    cmp         tmp2, 0
+    b.eq        1f
+    sub         vl_remainder, vector_length, tmp2
+    // process remainder until the first vector_length boundary
+    whilelt     p0.b, xzr, vl_remainder
+    ld1b        z0.b, p0/z, [src_ptr]
+    st1b        z0.b, p0, [dest_ptr]
+    add         dest_ptr, dest_ptr, vl_remainder
+    add         src_ptr, src_ptr, vl_remainder
+    sub         rest, rest, vl_remainder
+    // align dest address at CACHE_LINE_SIZE byte boundary
+1:  mov         tmp1, CACHE_LINE_SIZE
+    and         tmp2, dest_ptr, CACHE_LINE_SIZE - 1
+    // if cl_remainder == 0
+    cmp         tmp2, 0
+    b.eq        L(L2_dc_zva)
+    sub         cl_remainder, tmp1, tmp2
+    // process remainder until the first CACHE_LINE_SIZE boundary
+    mov         tmp1, xzr       // index
+2:  whilelt     p0.b, tmp1, cl_remainder
+    ld1b        z0.b, p0/z, [src_ptr, tmp1]
+    st1b        z0.b, p0, [dest_ptr, tmp1]
+    incb        tmp1
+    cmp         tmp1, cl_remainder
+    b.lo        2b
+    add         dest_ptr, dest_ptr, cl_remainder
+    add         src_ptr, src_ptr, cl_remainder
+    sub         rest, rest, cl_remainder
+
+L(L2_dc_zva): // unroll zero fill
+    and         tmp1, dest, 0xffffffffffffff
+    and         tmp2, src, 0xffffffffffffff
+    sub         tmp1, tmp2, tmp1        // diff
+    mov         tmp2, CACHE_LINE_SIZE * 20
+    cmp         tmp1, tmp2
+    b.lo        L(L1_prefetch)
+    mov         tmp1, dest_ptr
+    dc          zva, tmp1               // 1
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 2
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 3
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 4
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 5
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 6
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 7
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 8
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 9
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 10
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 11
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 12
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 13
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 14
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 15
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 16
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 17
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 18
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 19
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 20
+
+L(L2_vl_64): // VL64 unroll8
+    cmp         vector_length, 64
+    b.ne        L(L2_vl_32)
+    ptrue       p0.b
+    .p2align 3
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+    sub         rest, rest, CACHE_LINE_SIZE * 2
+1:  st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    mov         tmp1, PF_DIST_L1
+    prfm        pstl1keep, [dest_ptr, tmp1]
+    mov         tmp1, PF_DIST_L2
+    prfm        pstl2keep, [dest_ptr, tmp1]
+    mov         tmp2, CACHE_LINE_SIZE * 19
+    add         tmp2, dest_ptr, tmp2
+    dc          zva, tmp2       // distance CACHE_LINE_SIZE * 19
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    mov         tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+    prfm        pstl1keep, [dest_ptr, tmp1]
+    mov         tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+    prfm        pstl2keep, [dest_ptr, tmp1]
+    add         tmp2, tmp2, CACHE_LINE_SIZE
+    dc          zva, tmp2       // distance CACHE_LINE_SIZE * 20
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+    sub         rest, rest, CACHE_LINE_SIZE * 2
+    cmp         rest, L2_SIZE
+    b.ge        1b
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+
+L(L2_vl_32): // VL32 unroll6
+    cmp         vector_length, 32
+    b.ne        L(L2_vl_16)
+    ptrue       p0.b
+    .p2align 3
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE
+    sub         rest, rest, CACHE_LINE_SIZE
+1:  st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    mov         tmp1, PF_DIST_L1
+    prfm        pstl1keep, [dest_ptr, tmp1]
+    mov         tmp1, PF_DIST_L2
+    prfm        pstl2keep, [dest_ptr, tmp1]
+    mov         tmp2, CACHE_LINE_SIZE * 19
+    add         tmp2, dest_ptr, tmp2
+    dc          zva, tmp2       // distance CACHE_LINE_SIZE * 19
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    mov         tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+    prfm        pstl1keep, [dest_ptr, tmp1]
+    mov         tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+    prfm        pstl2keep, [dest_ptr, tmp1]
+    add         tmp2, tmp2, CACHE_LINE_SIZE
+    dc          zva, tmp2       // distance CACHE_LINE_SIZE * 20
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE
+    sub         rest, rest, CACHE_LINE_SIZE * 2
+    cmp         rest, L2_SIZE
+    b.ge        1b
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE
+
+L(L2_vl_16): // VL16 unroll32
+    cmp         vector_length, 16
+    b.ne        L(L1_prefetch)
+    ptrue       p0.b
+    .p2align 3
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+    ld1b        z16.b,  p0/z, [src_ptr, #-8, mul vl]
+    ld1b        z17.b,  p0/z, [src_ptr, #-7, mul vl]
+    ld1b        z18.b, p0/z, [src_ptr,  #-6, mul vl]
+    ld1b        z19.b, p0/z, [src_ptr,  #-5, mul vl]
+    ld1b        z20.b, p0/z, [src_ptr,  #-4, mul vl]
+    ld1b        z21.b, p0/z, [src_ptr,  #-3, mul vl]
+    ld1b        z22.b, p0/z, [src_ptr,  #-2, mul vl]
+    ld1b        z23.b, p0/z, [src_ptr,  #-1, mul vl]
+    ld1b        z0.b,  p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b,  p0/z, [src_ptr,  #1, mul vl]
+    ld1b        z2.b,  p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b,  p0/z, [src_ptr,  #3, mul vl]
+    ld1b        z4.b,  p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b,  p0/z, [src_ptr,  #5, mul vl]
+    ld1b        z6.b,  p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b,  p0/z, [src_ptr,  #7, mul vl]
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+    sub         rest, rest, CACHE_LINE_SIZE
+1:  add         dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+    st1b        z16.b, p0,   [dest_ptr, #-8, mul vl]
+    st1b        z17.b, p0,   [dest_ptr, #-7, mul vl]
+    ld1b        z16.b, p0/z, [src_ptr,  #-8, mul vl]
+    ld1b        z17.b, p0/z, [src_ptr,  #-7, mul vl]
+    st1b        z18.b, p0,   [dest_ptr, #-6, mul vl]
+    st1b        z19.b, p0,   [dest_ptr, #-5, mul vl]
+    ld1b        z18.b, p0/z, [src_ptr,  #-6, mul vl]
+    ld1b        z19.b, p0/z, [src_ptr,  #-5, mul vl]
+    st1b        z20.b, p0,   [dest_ptr, #-4, mul vl]
+    st1b        z21.b, p0,   [dest_ptr, #-3, mul vl]
+    ld1b        z20.b, p0/z, [src_ptr,  #-4, mul vl]
+    ld1b        z21.b, p0/z, [src_ptr,  #-3, mul vl]
+    st1b        z22.b, p0,   [dest_ptr, #-2, mul vl]
+    st1b        z23.b, p0,   [dest_ptr, #-1, mul vl]
+    ld1b        z22.b, p0/z, [src_ptr,  #-2, mul vl]
+    ld1b        z23.b, p0/z, [src_ptr,  #-1, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    mov         tmp1, PF_DIST_L1
+    prfm        pstl1keep, [dest_ptr, tmp1]
+    mov         tmp1, PF_DIST_L2
+    prfm        pstl2keep, [dest_ptr, tmp1]
+    mov         tmp2, CACHE_LINE_SIZE * 19
+    add         tmp2, dest_ptr, tmp2
+    dc          zva, tmp2       // distance CACHE_LINE_SIZE * 19
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE
+    st1b        z16.b, p0,   [dest_ptr, #-8, mul vl]
+    st1b        z17.b, p0,   [dest_ptr, #-7, mul vl]
+    ld1b        z16.b, p0/z, [src_ptr,  #-8, mul vl]
+    ld1b        z17.b, p0/z, [src_ptr,  #-7, mul vl]
+    st1b        z18.b, p0,   [dest_ptr, #-6, mul vl]
+    st1b        z19.b, p0,   [dest_ptr, #-5, mul vl]
+    ld1b        z18.b, p0/z, [src_ptr,  #-6, mul vl]
+    ld1b        z19.b, p0/z, [src_ptr,  #-5, mul vl]
+    st1b        z20.b, p0,   [dest_ptr, #-4, mul vl]
+    st1b        z21.b, p0,   [dest_ptr, #-3, mul vl]
+    ld1b        z20.b, p0/z, [src_ptr,  #-4, mul vl]
+    ld1b        z21.b, p0/z, [src_ptr,  #-3, mul vl]
+    st1b        z22.b, p0,   [dest_ptr, #-2, mul vl]
+    st1b        z23.b, p0,   [dest_ptr, #-1, mul vl]
+    ld1b        z22.b, p0/z, [src_ptr,  #-2, mul vl]
+    ld1b        z23.b, p0/z, [src_ptr,  #-1, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    mov         tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+    prfm        pstl1keep, [dest_ptr, tmp1]
+    mov         tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+    prfm        pstl2keep, [dest_ptr, tmp1]
+    add         tmp2, tmp2, CACHE_LINE_SIZE
+    dc          zva, tmp2       // distance CACHE_LINE_SIZE * 20
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+    sub         rest, rest, CACHE_LINE_SIZE * 2
+    cmp         rest, L2_SIZE
+    b.ge        1b
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+    st1b        z16.b, p0, [dest_ptr, #-8, mul vl]
+    st1b        z17.b, p0, [dest_ptr, #-7, mul vl]
+    st1b        z18.b, p0, [dest_ptr, #-6, mul vl]
+    st1b        z19.b, p0, [dest_ptr, #-5, mul vl]
+    st1b        z20.b, p0, [dest_ptr, #-4, mul vl]
+    st1b        z21.b, p0, [dest_ptr, #-3, mul vl]
+    st1b        z22.b, p0, [dest_ptr, #-2, mul vl]
+    st1b        z23.b, p0, [dest_ptr, #-1, mul vl]
+    st1b        z0.b, p0,  [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,  [dest_ptr, #1, mul vl]
+    st1b        z2.b, p0,  [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,  [dest_ptr, #3, mul vl]
+    st1b        z4.b, p0,  [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,  [dest_ptr, #5, mul vl]
+    st1b        z6.b, p0,  [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,  [dest_ptr, #7, mul vl]
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+
+L(L1_prefetch): // if rest >= L1_SIZE
+    cmp         rest, L1_SIZE
+    b.cc        L(vl_agnostic)
+L(L1_vl_64):
+    cmp         vector_length, 64
+    b.ne        L(L1_vl_32)
+    ptrue       p0.b
+    .p2align 3
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+    sub         rest, rest, CACHE_LINE_SIZE * 2
+1:  st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    mov         tmp1, PF_DIST_L1
+    prfm        pstl1keep, [dest_ptr, tmp1]
+    mov         tmp1, PF_DIST_L2
+    prfm        pstl2keep, [dest_ptr, tmp1]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    mov         tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+    prfm        pstl1keep, [dest_ptr, tmp1]
+    mov         tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+    prfm        pstl2keep, [dest_ptr, tmp1]
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+    sub         rest, rest, CACHE_LINE_SIZE * 2
+    cmp         rest, L1_SIZE
+    b.ge        1b
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+
+L(L1_vl_32):
+    cmp         vector_length, 32
+    b.ne        L(L1_vl_16)
+    ptrue       p0.b
+    .p2align 3
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE
+    sub         rest, rest, CACHE_LINE_SIZE
+1:  st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    mov         tmp1, PF_DIST_L1
+    prfm        pstl1keep, [dest_ptr, tmp1]
+    mov         tmp1, PF_DIST_L2
+    prfm        pstl2keep, [dest_ptr, tmp1]
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    mov         tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+    prfm        pstl1keep, [dest_ptr, tmp1]
+    mov         tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+    prfm        pstl2keep, [dest_ptr, tmp1]
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE
+    sub         rest, rest, CACHE_LINE_SIZE * 2
+    cmp         rest, L1_SIZE
+    b.ge        1b
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE
+
+L(L1_vl_16):
+    cmp         vector_length, 16
+    b.ne        L(vl_agnostic)
+    ptrue       p0.b
+    .p2align 3
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+    ld1b        z16.b,  p0/z, [src_ptr, #-8, mul vl]
+    ld1b        z17.b,  p0/z, [src_ptr, #-7, mul vl]
+    ld1b        z18.b, p0/z, [src_ptr,  #-6, mul vl]
+    ld1b        z19.b, p0/z, [src_ptr,  #-5, mul vl]
+    ld1b        z20.b, p0/z, [src_ptr,  #-4, mul vl]
+    ld1b        z21.b, p0/z, [src_ptr,  #-3, mul vl]
+    ld1b        z22.b, p0/z, [src_ptr,  #-2, mul vl]
+    ld1b        z23.b, p0/z, [src_ptr,  #-1, mul vl]
+    ld1b        z0.b,  p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b,  p0/z, [src_ptr,  #1, mul vl]
+    ld1b        z2.b,  p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b,  p0/z, [src_ptr,  #3, mul vl]
+    ld1b        z4.b,  p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b,  p0/z, [src_ptr,  #5, mul vl]
+    ld1b        z6.b,  p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b,  p0/z, [src_ptr,  #7, mul vl]
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+    sub         rest, rest, CACHE_LINE_SIZE
+1:  add         dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+    st1b        z16.b, p0,   [dest_ptr, #-8, mul vl]
+    st1b        z17.b, p0,   [dest_ptr, #-7, mul vl]
+    ld1b        z16.b, p0/z, [src_ptr,  #-8, mul vl]
+    ld1b        z17.b, p0/z, [src_ptr,  #-7, mul vl]
+    st1b        z18.b, p0,   [dest_ptr, #-6, mul vl]
+    st1b        z19.b, p0,   [dest_ptr, #-5, mul vl]
+    ld1b        z18.b, p0/z, [src_ptr,  #-6, mul vl]
+    ld1b        z19.b, p0/z, [src_ptr,  #-5, mul vl]
+    st1b        z20.b, p0,   [dest_ptr, #-4, mul vl]
+    st1b        z21.b, p0,   [dest_ptr, #-3, mul vl]
+    ld1b        z20.b, p0/z, [src_ptr,  #-4, mul vl]
+    ld1b        z21.b, p0/z, [src_ptr,  #-3, mul vl]
+    st1b        z22.b, p0,   [dest_ptr, #-2, mul vl]
+    st1b        z23.b, p0,   [dest_ptr, #-1, mul vl]
+    ld1b        z22.b, p0/z, [src_ptr,  #-2, mul vl]
+    ld1b        z23.b, p0/z, [src_ptr,  #-1, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    mov         tmp1, PF_DIST_L1
+    prfm        pstl1keep, [dest_ptr, tmp1]
+    mov         tmp1, PF_DIST_L2
+    prfm        pstl2keep, [dest_ptr, tmp1]
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE
+    st1b        z16.b, p0,   [dest_ptr, #-8, mul vl]
+    st1b        z17.b, p0,   [dest_ptr, #-7, mul vl]
+    ld1b        z16.b, p0/z, [src_ptr,  #-8, mul vl]
+    ld1b        z17.b, p0/z, [src_ptr,  #-7, mul vl]
+    st1b        z18.b, p0,   [dest_ptr, #-6, mul vl]
+    st1b        z19.b, p0,   [dest_ptr, #-5, mul vl]
+    ld1b        z18.b, p0/z, [src_ptr,  #-6, mul vl]
+    ld1b        z19.b, p0/z, [src_ptr,  #-5, mul vl]
+    st1b        z20.b, p0,   [dest_ptr, #-4, mul vl]
+    st1b        z21.b, p0,   [dest_ptr, #-3, mul vl]
+    ld1b        z20.b, p0/z, [src_ptr,  #-4, mul vl]
+    ld1b        z21.b, p0/z, [src_ptr,  #-3, mul vl]
+    st1b        z22.b, p0,   [dest_ptr, #-2, mul vl]
+    st1b        z23.b, p0,   [dest_ptr, #-1, mul vl]
+    ld1b        z22.b, p0/z, [src_ptr,  #-2, mul vl]
+    ld1b        z23.b, p0/z, [src_ptr,  #-1, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    mov         tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+    prfm        pstl1keep, [dest_ptr, tmp1]
+    mov         tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+    prfm        pstl2keep, [dest_ptr, tmp1]
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+    sub         rest, rest, CACHE_LINE_SIZE * 2
+    cmp         rest, L1_SIZE
+    b.ge        1b
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+    st1b        z16.b, p0, [dest_ptr, #-8, mul vl]
+    st1b        z17.b, p0, [dest_ptr, #-7, mul vl]
+    st1b        z18.b, p0, [dest_ptr, #-6, mul vl]
+    st1b        z19.b, p0, [dest_ptr, #-5, mul vl]
+    st1b        z20.b, p0, [dest_ptr, #-4, mul vl]
+    st1b        z21.b, p0, [dest_ptr, #-3, mul vl]
+    st1b        z22.b, p0, [dest_ptr, #-2, mul vl]
+    st1b        z23.b, p0, [dest_ptr, #-1, mul vl]
+    st1b        z0.b, p0,  [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,  [dest_ptr, #1, mul vl]
+    st1b        z2.b, p0,  [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,  [dest_ptr, #3, mul vl]
+    st1b        z4.b, p0,  [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,  [dest_ptr, #5, mul vl]
+    st1b        z6.b, p0,  [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,  [dest_ptr, #7, mul vl]
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+
+L(vl_agnostic): // VL Agnostic
+
+L(unroll32): // unrolling and software pipeline
+    lsl         tmp1, vector_length, 3  // vector_length * 8
+    lsl         tmp2, vector_length, 5  // vector_length * 32
+    ptrue       p0.b
+    .p2align 3
+1:  cmp         rest, tmp2
+    b.cc        L(unroll8)
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    add         dest_ptr, dest_ptr, tmp1
+    add         src_ptr, src_ptr, tmp1
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    add         dest_ptr, dest_ptr, tmp1
+    add         src_ptr, src_ptr, tmp1
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    add         dest_ptr, dest_ptr, tmp1
+    add         src_ptr, src_ptr, tmp1
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    add         dest_ptr, dest_ptr, tmp1
+    add         src_ptr, src_ptr, tmp1
+    sub         rest, rest, tmp2
+    b           1b
+
+L(unroll8): // unrolling and software pipeline
+    lsl         tmp1, vector_length, 3  // vector_length * 8
+    ptrue       p0.b
+    .p2align 3
+1:  cmp         rest, tmp1
+    b.cc        L(unroll1)
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    add         dest_ptr, dest_ptr, tmp1
+    add         src_ptr, src_ptr, tmp1
+    sub         rest, rest, tmp1
+    b           1b
+
+ L(unroll1):
+    ptrue       p0.b
+    .p2align 3
+1:  cmp         rest, vector_length
+    b.cc        L(last)
+    ld1b        z0.b, p0/z, [src_ptr]
+    st1b        z0.b, p0,   [dest_ptr]
+    add         dest_ptr, dest_ptr, vector_length
+    add         src_ptr, src_ptr, vector_length
+    sub         rest, rest, vector_length
+    b           1b
+
+L(last):
+    whilelt     p0.b, xzr, rest
+    ld1b        z0.b, p0/z, [src_ptr]
+    st1b        z0.b, p0, [dest_ptr]
+    ret
+
+END (MEMCPY)
+libc_hidden_builtin_def (MEMCPY)
+
+
+    .p2align 4
+ENTRY_ALIGN (MEMMOVE, 6)
+
+    // remove tag address
+    and         tmp1, dest, 0xffffffffffffff
+    and         tmp2, src, 0xffffffffffffff
+    sub         tmp1, tmp1, tmp2         // diff
+    // if diff <= 0 || diff >= n then memcpy
+    cmp         tmp1, 0
+    ccmp        tmp1, n, 2, gt
+    b.cs        L(fwd_start)
+
+L(bwd_start):
+    mov         rest, n
+    add         dest_ptr, dest, n       // dest_end
+    add         src_ptr, src, n         // src_end
+    cntb        vector_length
+    ptrue       p0.b
+    udiv        tmp1, n, vector_length          // quotient
+    mul         tmp1, tmp1, vector_length       // product
+    sub         vl_remainder, n, tmp1
+    // if bwd_remainder == 0 then skip vl_remainder bwd copy
+    cmp         vl_remainder, 0
+    b.eq        L(bwd_main)
+    // vl_remainder bwd copy
+    whilelt     p0.b, xzr, vl_remainder
+    sub         src_ptr, src_ptr, vl_remainder
+    sub         dest_ptr, dest_ptr, vl_remainder
+    ld1b        z0.b, p0/z, [src_ptr]
+    st1b        z0.b, p0, [dest_ptr]
+    sub         rest, rest, vl_remainder
+
+L(bwd_main):
+
+    // VL Agnostic
+L(bwd_unroll32): // unrolling and software pipeline
+    lsl         tmp1, vector_length, 3  // vector_length * 8
+    lsl         tmp2, vector_length, 5  // vector_length * 32
+    ptrue       p0.b
+    .p2align 3
+1:  cmp         rest, tmp2
+    b.cc        L(bwd_unroll8)
+    sub         src_ptr, src_ptr, tmp1
+    sub         dest_ptr, dest_ptr, tmp1
+    ld1b        z0.b, p0/z, [src_ptr,  #7, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #6, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #7, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #6, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #5, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #4, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #5, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #4, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #3, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #2, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #3, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #2, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #1, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #0, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #1, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #0, mul vl]
+    sub         src_ptr, src_ptr, tmp1
+    sub         dest_ptr, dest_ptr, tmp1
+    ld1b        z0.b, p0/z, [src_ptr,  #7, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #6, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #7, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #6, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #5, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #4, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #5, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #4, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #3, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #2, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #3, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #2, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #1, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #0, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #1, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #0, mul vl]
+    sub         src_ptr, src_ptr, tmp1
+    sub         dest_ptr, dest_ptr, tmp1
+    ld1b        z0.b, p0/z, [src_ptr,  #7, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #6, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #7, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #6, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #5, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #4, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #5, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #4, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #3, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #2, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #3, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #2, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #1, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #0, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #1, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #0, mul vl]
+    sub         src_ptr, src_ptr, tmp1
+    sub         dest_ptr, dest_ptr, tmp1
+    ld1b        z0.b, p0/z, [src_ptr,  #7, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #6, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #7, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #6, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #5, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #4, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #5, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #4, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #3, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #2, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #3, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #2, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #1, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #0, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #1, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #0, mul vl]
+    sub         rest, rest, tmp2
+    b           1b
+
+L(bwd_unroll8): // unrolling and software pipeline
+    lsl         tmp1, vector_length, 3  // vector_length * 8
+    ptrue       p0.b
+    .p2align 3
+1:  cmp         rest, tmp1
+    b.cc        L(bwd_unroll1)
+    sub         src_ptr, src_ptr, tmp1
+    sub         dest_ptr, dest_ptr, tmp1
+    ld1b        z0.b, p0/z, [src_ptr,  #7, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #6, mul vl]
+    st1b        z0.b, p0,   [dest_ptr, #7, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #6, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #5, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #4, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #5, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #4, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #3, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #2, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #3, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #2, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #1, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #0, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #1, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #0, mul vl]
+    sub         rest, rest, tmp1
+    b           1b
+
+    .p2align 3
+L(bwd_unroll1):
+    ptrue       p0.b
+1:  cmp         rest, vector_length
+    b.cc        L(bwd_last)
+    sub         src_ptr, src_ptr, vector_length
+    sub         dest_ptr, dest_ptr, vector_length
+    ld1b        z0.b, p0/z, [src_ptr]
+    st1b        z0.b, p0, [dest_ptr]
+    sub         rest, rest, vector_length
+    b           1b
+
+L(bwd_last):
+    whilelt     p0.b, xzr, rest
+    sub         src_ptr, src_ptr, rest
+    sub         dest_ptr, dest_ptr, rest
+    ld1b        z0.b, p0/z, [src_ptr]
+    st1b        z0.b, p0, [dest_ptr]
+    ret
+
+END (MEMMOVE)
+libc_hidden_builtin_def (MEMMOVE)
+#endif /* IS_IN (libc) */
+#endif /* HAVE_SVE_ASM_SUPPORT */
+
diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarch/memmove.c
index 12d77818a9..1e5ee1c934 100644
--- a/sysdeps/aarch64/multiarch/memmove.c
+++ b/sysdeps/aarch64/multiarch/memmove.c
@@ -33,6 +33,9 @@ extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;
 extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;
 extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidden;
 extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;
+#if HAVE_SVE_ASM_SUPPORT
+extern __typeof (__redirect_memmove) __memmove_a64fx attribute_hidden;
+#endif
 
 libc_ifunc (__libc_memmove,
             (IS_THUNDERX (midr)
@@ -44,8 +47,13 @@ libc_ifunc (__libc_memmove,
 		  : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
 		     || IS_NEOVERSE_V1 (midr)
 		     ? __memmove_simd
-		     : __memmove_generic)))));
-
+#if HAVE_SVE_ASM_SUPPORT
+                     : (IS_A64FX (midr)
+                        ? __memmove_a64fx
+                        : __memmove_generic))))));
+#else
+                        : __memmove_generic)))));
+#endif
 # undef memmove
 strong_alias (__libc_memmove, memmove);
 #endif
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
index db6aa3516c..6206a2f618 100644
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
@@ -46,6 +46,7 @@ static struct cpu_list cpu_list[] = {
       {"ares",		 0x411FD0C0},
       {"emag",		 0x503F0001},
       {"kunpeng920", 	 0x481FD010},
+      {"a64fx",		 0x460F0010},
       {"generic", 	 0x0}
 };
 
@@ -116,4 +117,7 @@ init_cpu_features (struct cpu_features *cpu_features)
 	     (PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_ASYNC | MTE_ALLOWED_TAGS),
 	     0, 0, 0);
 #endif
+
+  /* Check if SVE is supported.  */
+  cpu_features->sve = GLRO (dl_hwcap) & HWCAP_SVE;
 }
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
index 3b9bfed134..2b322e5414 100644
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
@@ -65,6 +65,9 @@
 #define IS_KUNPENG920(midr) (MIDR_IMPLEMENTOR(midr) == 'H'			   \
                         && MIDR_PARTNUM(midr) == 0xd01)
 
+#define IS_A64FX(midr) (MIDR_IMPLEMENTOR(midr) == 'F'			      \
+			&& MIDR_PARTNUM(midr) == 0x001)
+
 struct cpu_features
 {
   uint64_t midr_el1;
@@ -72,6 +75,7 @@ struct cpu_features
   bool bti;
   /* Currently, the GLIBC memory tagging tunable only defines 8 bits.  */
   uint8_t mte_state;
+  bool sve;
 };
 
 #endif /* _CPU_FEATURES_AARCH64_H  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 3/5] aarch64: Added optimized memset for A64FX
  2021-03-17  2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
  2021-03-17  2:33 ` [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64 Naohiro Tamura
  2021-03-17  2:34 ` [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
@ 2021-03-17  2:34 ` Naohiro Tamura
  2021-03-17  2:35 ` [PATCH 4/5] scripts: Added Vector Length Set test helper script Naohiro Tamura
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-03-17  2:34 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

This patch optimizes the performance of memset for A64FX [1] which
implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill and prefetch.

SVE assembler code for memset is implemented as Vector Length Agnostic
code so theoretically it can be run on any SOC which supports ARMv8-A
SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX
---
 sysdeps/aarch64/multiarch/Makefile          |   1 +
 sysdeps/aarch64/multiarch/ifunc-impl-list.c |   5 +-
 sysdeps/aarch64/multiarch/memset.c          |  11 +-
 sysdeps/aarch64/multiarch/memset_a64fx.S    | 574 ++++++++++++++++++++
 4 files changed, 589 insertions(+), 2 deletions(-)
 create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S

diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index 04c3f17121..7500cf1e93 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -2,6 +2,7 @@ ifeq ($(subdir),string)
 sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
 		   memcpy_falkor memcpy_a64fx \
 		   memset_generic memset_falkor memset_emag memset_kunpeng \
+		   memset_a64fx \
 		   memchr_generic memchr_nosimd \
 		   strlen_mte strlen_asimd
 endif
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index cb78da9692..e252a10d88 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -41,7 +41,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   INIT_ARCH ();
 
-  /* Support sysdeps/aarch64/multiarch/memcpy.c and memmove.c.  */
+  /* Support sysdeps/aarch64/multiarch/memcpy.c, memmove.c and memset.c.  */
   IFUNC_IMPL (i, name, memcpy,
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx)
 	      IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
@@ -66,6 +66,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, memset, (zva_size == 64), __memset_falkor)
 	      IFUNC_IMPL_ADD (array, i, memset, (zva_size == 64), __memset_emag)
 	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_kunpeng)
+#if HAVE_SVE_ASM_SUPPORT
+	      IFUNC_IMPL_ADD (array, i, memset, sve, __memset_a64fx)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic))
   IFUNC_IMPL (i, name, memchr,
 	      IFUNC_IMPL_ADD (array, i, memchr, !mte, __memchr_nosimd)
diff --git a/sysdeps/aarch64/multiarch/memset.c b/sysdeps/aarch64/multiarch/memset.c
index 28d3926bc2..df075edddb 100644
--- a/sysdeps/aarch64/multiarch/memset.c
+++ b/sysdeps/aarch64/multiarch/memset.c
@@ -31,6 +31,9 @@ extern __typeof (__redirect_memset) __libc_memset;
 extern __typeof (__redirect_memset) __memset_falkor attribute_hidden;
 extern __typeof (__redirect_memset) __memset_emag attribute_hidden;
 extern __typeof (__redirect_memset) __memset_kunpeng attribute_hidden;
+#if HAVE_SVE_ASM_SUPPORT
+extern __typeof (__redirect_memset) __memset_a64fx attribute_hidden;
+#endif
 extern __typeof (__redirect_memset) __memset_generic attribute_hidden;
 
 libc_ifunc (__libc_memset,
@@ -40,7 +43,13 @@ libc_ifunc (__libc_memset,
 	     ? __memset_falkor
 	     : (IS_EMAG (midr) && zva_size == 64
 	       ? __memset_emag
-	       : __memset_generic)));
+#if HAVE_SVE_ASM_SUPPORT
+	       : (IS_A64FX (midr)
+		  ? __memset_a64fx
+	          : __memset_generic))));
+#else
+	          : __memset_generic)));
+#endif
 
 # undef memset
 strong_alias (__libc_memset, memset);
diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S
new file mode 100644
index 0000000000..02ae7caab0
--- /dev/null
+++ b/sysdeps/aarch64/multiarch/memset_a64fx.S
@@ -0,0 +1,574 @@
+/* Optimized memset for Fujitsu A64FX processor.
+   Copyright (C) 2012-2021 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sysdeps/aarch64/memset-reg.h>
+
+#if HAVE_SVE_ASM_SUPPORT
+#if IS_IN (libc)
+# define MEMSET __memset_a64fx
+
+/* Assumptions:
+ *
+ * ARMv8.2-a, AArch64, unaligned accesses, sve
+ *
+ */
+
+#define L1_SIZE         (64*1024)       // L1 64KB
+#define L2_SIZE         (8*1024*1024)   // L2 8MB - 1MB
+#define CACHE_LINE_SIZE 256
+#define PF_DIST_L1 (CACHE_LINE_SIZE * 16)
+#define PF_DIST_L2 (CACHE_LINE_SIZE * 128)
+#define rest            x8
+#define vector_length   x9
+#define vl_remainder    x10     // vector_length remainder
+#define cl_remainder    x11     // CACHE_LINE_SIZE remainder
+
+    .arch armv8.2-a+sve
+
+ENTRY_ALIGN (MEMSET, 6)
+
+    PTR_ARG (0)
+    SIZE_ARG (2)
+
+    cmp         count, 0
+    b.ne        L(init)
+    ret
+L(init):
+    mov         rest, count
+    mov         dst, dstin
+    add         dstend, dstin, count
+    cntb        vector_length
+    ptrue       p0.b
+    dup         z0.b, valw
+
+    cmp         count, 96
+    b.hi	L(set_long)
+    cmp         count, 16
+    b.hs	L(set_medium)
+    mov         val, v0.D[0]
+
+    /* Set 0..15 bytes.  */
+    tbz         count, 3, 1f
+    str         val, [dstin]
+    str         val, [dstend, -8]
+    ret
+    nop
+1:  tbz         count, 2, 2f
+    str         valw, [dstin]
+    str         valw, [dstend, -4]
+    ret
+2:  cbz         count, 3f
+    strb        valw, [dstin]
+    tbz         count, 1, 3f
+    strh        valw, [dstend, -2]
+3:  ret
+
+    /* Set 17..96 bytes.  */
+L(set_medium):
+    str         q0, [dstin]
+    tbnz        count, 6, L(set96)
+    str         q0, [dstend, -16]
+    tbz         count, 5, 1f
+    str         q0, [dstin, 16]
+    str         q0, [dstend, -32]
+1:  ret
+
+    .p2align 4
+    /* Set 64..96 bytes.  Write 64 bytes from the start and
+       32 bytes from the end.  */
+L(set96):
+    str         q0, [dstin, 16]
+    stp         q0, q0, [dstin, 32]
+    stp         q0, q0, [dstend, -32]
+    ret
+
+L(set_long):
+    // if count > 1280 && vector_length != 16 then L(L2)
+    cmp         count, 1280
+    ccmp        vector_length, 16, 4, gt
+    b.ne        L(L2)
+    bic         dst, dstin, 15
+    str         q0, [dstin]
+    sub         count, dstend, dst      /* Count is 16 too large.  */
+    sub         dst, dst, 16            /* Dst is biased by -32.  */
+    sub         count, count, 64 + 16   /* Adjust count and bias for loop.  */
+1:  stp         q0, q0, [dst, 32]
+    stp         q0, q0, [dst, 64]!
+    subs        count, count, 64
+    b.lo        2f
+    stp         q0, q0, [dst, 32]
+    stp         q0, q0, [dst, 64]!
+    subs        count, count, 64
+    b.lo	2f
+    stp         q0, q0, [dst, 32]
+    stp         q0, q0, [dst, 64]!
+    subs        count, count, 64
+    b.lo        2f
+    stp         q0, q0, [dst, 32]
+    stp         q0, q0, [dst, 64]!
+    subs        count, count, 64
+    b.hi        1b
+2:  stp         q0, q0, [dstend, -64]
+    stp         q0, q0, [dstend, -32]
+    ret
+
+L(L2):
+    // get block_size
+    mrs         tmp1, dczid_el0
+    cmp         tmp1, 6         // CACHE_LINE_SIZE 256
+    b.ne        L(vl_agnostic)
+
+    // if rest >= L2_SIZE
+    cmp         rest, L2_SIZE
+    b.cc        L(L1_prefetch)
+    // align dst address at vector_length byte boundary
+    sub         tmp1, vector_length, 1
+    and         tmp2, dst, tmp1
+    // if vl_remainder == 0
+    cmp         tmp2, 0
+    b.eq        1f
+    sub         vl_remainder, vector_length, tmp2
+    // process remainder until the first vector_length boundary
+    whilelt     p0.b, xzr, vl_remainder
+    st1b        z0.b, p0, [dst]
+    add         dst, dst, vl_remainder
+    sub         rest, rest, vl_remainder
+    // align dstin address at CACHE_LINE_SIZE byte boundary
+1:  mov         tmp1, CACHE_LINE_SIZE
+    and         tmp2, dst, CACHE_LINE_SIZE - 1
+    // if cl_remainder == 0
+    cmp         tmp2, 0
+    b.eq        L(L2_dc_zva)
+    sub         cl_remainder, tmp1, tmp2
+    // process remainder until the first CACHE_LINE_SIZE boundary
+    mov         tmp1, xzr       // index
+2:  whilelt     p0.b, tmp1, cl_remainder
+    st1b        z0.b, p0, [dst, tmp1]
+    incb        tmp1
+    cmp         tmp1, cl_remainder
+    b.lo        2b
+    add         dst, dst, cl_remainder
+    sub         rest, rest, cl_remainder
+
+L(L2_dc_zva): // unroll zero fill
+    mov         tmp1, dst
+    dc          zva, tmp1               // 1
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 2
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 3
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 4
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 5
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 6
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 7
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 8
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 9
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 10
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 11
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 12
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 13
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 14
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 15
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 16
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 17
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 18
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 19
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    dc          zva, tmp1               // 20
+
+L(L2_vl_64): // VL64 unroll8
+    cmp         vector_length, 64
+    b.ne        L(L2_vl_32)
+    ptrue       p0.b
+    .p2align 4
+1:  st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    mov         tmp2, CACHE_LINE_SIZE * 20
+    add         tmp2, dst, tmp2
+    dc          zva, tmp2       // distance CACHE_LINE_SIZE * 20
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    add         tmp2, tmp2, CACHE_LINE_SIZE
+    dc          zva, tmp2       // distance CACHE_LINE_SIZE * 21
+    add         dst, dst, 512
+    sub         rest, rest, 512
+    cmp         rest, L2_SIZE
+    b.ge        1b
+
+L(L2_vl_32): // VL32 unroll6
+    cmp         vector_length, 32
+    b.ne        L(L2_vl_16)
+    ptrue       p0.b
+    .p2align 4
+1:  st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    mov         tmp2, CACHE_LINE_SIZE * 21
+    add         tmp2, dst, tmp2
+    dc          zva, tmp2       // distance CACHE_LINE_SIZE * 21
+    add         dst, dst, CACHE_LINE_SIZE
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    add         tmp2, tmp2, CACHE_LINE_SIZE
+    dc          zva, tmp2       // distance CACHE_LINE_SIZE * 22
+    add         dst, dst, CACHE_LINE_SIZE
+    sub         rest, rest, 512
+    cmp         rest, L2_SIZE
+    b.ge        1b
+
+L(L2_vl_16):  // VL16 unroll32
+    cmp         vector_length, 16
+    b.ne        L(L1_prefetch)
+    ptrue       p0.b
+    .p2align 4
+1:  add         dst, dst, 128
+    st1b        {z0.b}, p0, [dst, #-8, mul vl]
+    st1b        {z0.b}, p0, [dst, #-7, mul vl]
+    st1b        {z0.b}, p0, [dst, #-6, mul vl]
+    st1b        {z0.b}, p0, [dst, #-5, mul vl]
+    st1b        {z0.b}, p0, [dst, #-4, mul vl]
+    st1b        {z0.b}, p0, [dst, #-3, mul vl]
+    st1b        {z0.b}, p0, [dst, #-2, mul vl]
+    st1b        {z0.b}, p0, [dst, #-1, mul vl]
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    mov         tmp2, CACHE_LINE_SIZE * 20
+    add         tmp2, dst, tmp2
+    dc          zva, tmp2       // distance CACHE_LINE_SIZE * 20
+    add         dst, dst, CACHE_LINE_SIZE
+    st1b        {z0.b}, p0, [dst, #-8, mul vl]
+    st1b        {z0.b}, p0, [dst, #-7, mul vl]
+    st1b        {z0.b}, p0, [dst, #-6, mul vl]
+    st1b        {z0.b}, p0, [dst, #-5, mul vl]
+    st1b        {z0.b}, p0, [dst, #-4, mul vl]
+    st1b        {z0.b}, p0, [dst, #-3, mul vl]
+    st1b        {z0.b}, p0, [dst, #-2, mul vl]
+    st1b        {z0.b}, p0, [dst, #-1, mul vl]
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    add         tmp2, tmp2, CACHE_LINE_SIZE
+    dc          zva, tmp2       // distance CACHE_LINE_SIZE * 21
+    add         dst, dst, 128
+    sub         rest, rest, 512
+    cmp         rest, L2_SIZE
+    b.ge        1b
+
+L(L1_prefetch): // if rest >= L1_SIZE
+    cmp         rest, L1_SIZE
+    b.cc        L(vl_agnostic)
+L(L1_vl_64):
+    cmp         vector_length, 64
+    b.ne        L(L1_vl_32)
+    ptrue       p0.b
+    .p2align 4
+1:  st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    mov         tmp1, PF_DIST_L1
+    prfm        pstl1keep, [dst, tmp1]
+    mov         tmp1, PF_DIST_L2
+    prfm        pstl2keep, [dst, tmp1]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    mov         tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+    prfm        pstl1keep, [dst, tmp1]
+    mov         tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+    prfm        pstl2keep, [dst, tmp1]
+    add         dst, dst, 512
+    sub         rest, rest, 512
+    cmp         rest, L1_SIZE
+    b.ge        1b
+
+L(L1_vl_32):
+    cmp         vector_length, 32
+    b.ne        L(L1_vl_16)
+    ptrue       p0.b
+    .p2align 4
+1:  st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    mov         tmp1, PF_DIST_L1
+    prfm        pstl1keep, [dst, tmp1]
+    mov         tmp1, PF_DIST_L2
+    prfm        pstl2keep, [dst, tmp1]
+    add         dst, dst, CACHE_LINE_SIZE
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    mov         tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+    prfm        pstl1keep, [dst, tmp1]
+    mov         tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+    prfm        pstl2keep, [dst, tmp1]
+    add         dst, dst, CACHE_LINE_SIZE
+    sub         rest, rest, 512
+    cmp         rest, L1_SIZE
+    b.ge        1b
+
+L(L1_vl_16):  // VL16 unroll32
+    cmp         vector_length, 16
+    b.ne        L(vl_agnostic)
+    ptrue       p0.b
+    .p2align 4
+1:  mov         tmp1, dst
+    add         dst, dst, 128
+    st1b        {z0.b}, p0, [dst, #-8, mul vl]
+    st1b        {z0.b}, p0, [dst, #-7, mul vl]
+    st1b        {z0.b}, p0, [dst, #-6, mul vl]
+    st1b        {z0.b}, p0, [dst, #-5, mul vl]
+    st1b        {z0.b}, p0, [dst, #-4, mul vl]
+    st1b        {z0.b}, p0, [dst, #-3, mul vl]
+    st1b        {z0.b}, p0, [dst, #-2, mul vl]
+    st1b        {z0.b}, p0, [dst, #-1, mul vl]
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    mov         tmp1, PF_DIST_L1
+    prfm        pstl1keep, [dst, tmp1]
+    mov         tmp1, PF_DIST_L2
+    prfm        pstl2keep, [dst, tmp1]
+    add         dst, dst, CACHE_LINE_SIZE
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    st1b        {z0.b}, p0, [dst, #-8, mul vl]
+    st1b        {z0.b}, p0, [dst, #-7, mul vl]
+    st1b        {z0.b}, p0, [dst, #-6, mul vl]
+    st1b        {z0.b}, p0, [dst, #-5, mul vl]
+    st1b        {z0.b}, p0, [dst, #-4, mul vl]
+    st1b        {z0.b}, p0, [dst, #-3, mul vl]
+    st1b        {z0.b}, p0, [dst, #-2, mul vl]
+    st1b        {z0.b}, p0, [dst, #-1, mul vl]
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    mov         tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+    prfm        pstl1keep, [dst, tmp1]
+    mov         tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+    prfm        pstl2keep, [dst, tmp1]
+    add         dst, dst, 128
+    sub         rest, rest, 512
+    cmp         rest, L1_SIZE
+    b.ge        1b
+
+    // VL Agnostic
+L(vl_agnostic):
+L(unroll32):
+    ptrue       p0.b
+    lsl         tmp1, vector_length, 3  // vector_length * 8
+    lsl         tmp2, vector_length, 5  // vector_length * 32
+    .p2align 4
+1:  cmp         rest, tmp2
+    b.cc        L(unroll16)
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    add         dst, dst, tmp1
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    add         dst, dst, tmp1
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    add         dst, dst, tmp1
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    add         dst, dst, tmp1
+    sub         rest, rest, tmp2
+    b           1b
+
+L(unroll16):
+    ptrue       p0.b
+    lsl         tmp1, vector_length, 3  // vector_length * 8
+    lsl         tmp2, vector_length, 4  // vector_length * 16
+    .p2align 4
+1:  cmp         rest, tmp2
+    b.cc        L(unroll8)
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    add         dst, dst, tmp1
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    add         dst, dst, tmp1
+    sub         rest, rest, tmp2
+    b           1b
+
+L(unroll8):
+    lsl         tmp1, vector_length, 3
+    ptrue       p0.b
+    .p2align 4
+1:  cmp         rest, tmp1
+    b.cc        L(unroll4)
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    st1b        {z0.b}, p0, [dst, #4, mul vl]
+    st1b        {z0.b}, p0, [dst, #5, mul vl]
+    st1b        {z0.b}, p0, [dst, #6, mul vl]
+    st1b        {z0.b}, p0, [dst, #7, mul vl]
+    add         dst, dst, tmp1
+    sub         rest, rest, tmp1
+    b           1b
+
+L(unroll4):
+    lsl         tmp1, vector_length, 2
+    ptrue       p0.b
+    .p2align 4
+1:  cmp         rest, tmp1
+    b.cc        L(unroll2)
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    st1b        {z0.b}, p0, [dst, #2, mul vl]
+    st1b        {z0.b}, p0, [dst, #3, mul vl]
+    add         dst, dst, tmp1
+    sub         rest, rest, tmp1
+    b           1b
+
+L(unroll2):
+    lsl         tmp1, vector_length, 1
+    ptrue       p0.b
+    .p2align 4
+1:  cmp         rest, tmp1
+    b.cc        L(unroll1)
+    st1b        {z0.b}, p0, [dst]
+    st1b        {z0.b}, p0, [dst, #1, mul vl]
+    add         dst, dst, tmp1
+    sub         rest, rest, tmp1
+    b           1b
+
+L(unroll1):
+    ptrue       p0.b
+    .p2align 4
+1:  cmp         rest, vector_length
+    b.cc        L(last)
+    st1b        {z0.b}, p0, [dst]
+    sub         rest, rest, vector_length
+    add         dst, dst, vector_length
+    b           1b
+
+    .p2align 4
+L(last):
+    whilelt     p0.b, xzr, rest
+    st1b        z0.b, p0, [dst]
+    ret
+
+END (MEMSET)
+libc_hidden_builtin_def (MEMSET)
+
+#endif /* IS_IN (libc) */
+#endif /* HAVE_SVE_ASM_SUPPORT */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 4/5] scripts: Added Vector Length Set test helper script
  2021-03-17  2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
                   ` (2 preceding siblings ...)
  2021-03-17  2:34 ` [PATCH 3/5] aarch64: Added optimized memset " Naohiro Tamura
@ 2021-03-17  2:35 ` Naohiro Tamura
  2021-03-29 13:20   ` Szabolcs Nagy via Libc-alpha
  2021-03-17  2:35 ` [PATCH 5/5] benchtests: Added generic_memcpy and generic_memmove to large benchtests Naohiro Tamura
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-03-17  2:35 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

This patch is a test helper script to change Vector Length for child
process. This script can be used as test-wrapper for 'make check'.

Usage examples:

ubuntu@bionic:~/build$ make check subdirs=string \
test-wrapper='~/glibc/scripts/vltest.py 16'

ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 16 make test \
t=string/test-memcpy

ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 32 ./debugglibc.sh \
string/test-memmove

ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 64 ./testrun.sh
string/test-memset
---
 scripts/vltest.py | 82 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)
 create mode 100755 scripts/vltest.py

diff --git a/scripts/vltest.py b/scripts/vltest.py
new file mode 100755
index 0000000000..264dfa449f
--- /dev/null
+++ b/scripts/vltest.py
@@ -0,0 +1,82 @@
+#!/usr/bin/python3
+# Set Scalable Vector Length test helper
+# Copyright (C) 2019-2021 Free Software Foundation, Inc.
+# This file is part of the GNU C Library.
+#
+# The GNU C Library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# The GNU C Library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with the GNU C Library; if not, see
+# <https://www.gnu.org/licenses/>.
+"""Set Scalable Vector Length test helper.
+
+Set Scalable Vector Length for child process.
+
+examples:
+
+ubuntu@bionic:~/build$ make check subdirs=string \
+test-wrapper='~/glibc/scripts/vltest.py 16'
+
+ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 16 make test \
+t=string/test-memcpy
+
+ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 32 ./debugglibc.sh \
+string/test-memmove
+
+ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 64 ./testrun.sh \
+string/test-memset
+"""
+import argparse
+from ctypes import cdll, CDLL
+import os
+import sys
+
+EXIT_SUCCESS = 0
+EXIT_FAILURE = 1
+EXIT_UNSUPPORTED = 77
+
+AT_HWCAP = 16
+HWCAP_SVE = (1 << 22)
+
+PR_SVE_GET_VL = 51
+PR_SVE_SET_VL = 50
+PR_SVE_SET_VL_ONEXEC = (1 << 18)
+PR_SVE_VL_INHERIT = (1 << 17)
+PR_SVE_VL_LEN_MASK = 0xffff
+
+def main(args):
+    libc = CDLL("libc.so.6")
+    if not libc.getauxval(AT_HWCAP) & HWCAP_SVE:
+        print("CPU doesn't support SVE")
+        sys.exit(EXIT_UNSUPPORTED)
+
+    libc.prctl(PR_SVE_SET_VL,
+               args.vl[0] | PR_SVE_SET_VL_ONEXEC | PR_SVE_VL_INHERIT)
+    os.execvp(args.args[0], args.args)
+    print("exec system call failure")
+    sys.exit(EXIT_FAILURE)
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description=
+            "Set Scalable Vector Length test helper",
+            formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+    # positional argument
+    parser.add_argument("vl", nargs=1, type=int,
+                        choices=range(16, 257, 16),
+                        help=('vector length '\
+                              'which is multiples of 16 from 16 to 256'))
+    # remainDer arguments
+    parser.add_argument('args', nargs=argparse.REMAINDER,
+                        help=('args '\
+                              'which is passed to child process'))
+    args = parser.parse_args()
+    main(args)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH 5/5] benchtests: Added generic_memcpy and generic_memmove to large benchtests
  2021-03-17  2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
                   ` (3 preceding siblings ...)
  2021-03-17  2:35 ` [PATCH 4/5] scripts: Added Vector Length Set test helper script Naohiro Tamura
@ 2021-03-17  2:35 ` Naohiro Tamura
  2021-03-29 12:03 ` [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Szabolcs Nagy via Libc-alpha
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-03-17  2:35 UTC (permalink / raw)
  To: libc-alpha

This patch is to add generic_memcpy and generic_memmove to
bench-memcpy-large.c and bench-memmove-large.c respectively so that we
can compare performance between 512 bit scalable vector register with
scalar 64 bit register consistently among memcpy/memmove/memset
default and large benchtests.
---
 benchtests/bench-memcpy-large.c  | 9 +++++++++
 benchtests/bench-memmove-large.c | 9 +++++++++
 2 files changed, 18 insertions(+)

diff --git a/benchtests/bench-memcpy-large.c b/benchtests/bench-memcpy-large.c
index 3df1575514..4a87987202 100644
--- a/benchtests/bench-memcpy-large.c
+++ b/benchtests/bench-memcpy-large.c
@@ -25,7 +25,10 @@
 # define TIMEOUT (20 * 60)
 # include "bench-string.h"
 
+void *generic_memcpy (void *, const void *, size_t);
+
 IMPL (memcpy, 1)
+IMPL (generic_memcpy, 0)
 #endif
 
 #include "json-lib.h"
@@ -124,3 +127,9 @@ test_main (void)
 }
 
 #include <support/test-driver.c>
+
+#define libc_hidden_builtin_def(X)
+#undef MEMCPY
+#define MEMCPY generic_memcpy
+#include <string/memcpy.c>
+#include <string/wordcopy.c>
diff --git a/benchtests/bench-memmove-large.c b/benchtests/bench-memmove-large.c
index 9e2fcd50ab..151dd5a276 100644
--- a/benchtests/bench-memmove-large.c
+++ b/benchtests/bench-memmove-large.c
@@ -25,7 +25,10 @@
 #include "bench-string.h"
 #include "json-lib.h"
 
+void *generic_memmove (void *, const void *, size_t);
+
 IMPL (memmove, 1)
+IMPL (generic_memmove, 0)
 
 typedef char *(*proto_t) (char *, const char *, size_t);
 
@@ -123,3 +126,9 @@ test_main (void)
 }
 
 #include <support/test-driver.c>
+
+#define libc_hidden_builtin_def(X)
+#undef MEMMOVE
+#define MEMMOVE generic_memmove
+#include <string/memmove.c>
+#include <string/wordcopy.c>
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-03-17  2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
                   ` (4 preceding siblings ...)
  2021-03-17  2:35 ` [PATCH 5/5] benchtests: Added generic_memcpy and generic_memmove to large benchtests Naohiro Tamura
@ 2021-03-29 12:03 ` Szabolcs Nagy via Libc-alpha
  2021-05-10  1:45 ` naohirot
  2021-05-12  9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
  7 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-03-29 12:03 UTC (permalink / raw)
  To: Naohiro Tamura; +Cc: libc-alpha

The 03/17/2021 02:28, Naohiro Tamura wrote:
> Fujitsu is in the process of signing the copyright assignment paper.
> We'd like to have some feedback in advance.

thanks for these patches, please let me know when the
copyright is sorted out. i will do some review now.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64
  2021-03-17  2:33 ` [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64 Naohiro Tamura
@ 2021-03-29 12:11   ` Szabolcs Nagy via Libc-alpha
  2021-03-30  6:19     ` naohirot
  0 siblings, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-03-29 12:11 UTC (permalink / raw)
  To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha

The 03/17/2021 02:33, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
> 
> This patch checks if assembler supports '-march=armv8.2-a+sve' to
> generate SVE code or not, and then define HAVE_SVE_ASM_SUPPORT macro.
> ---
>  config.h.in                  |  3 +++
>  sysdeps/aarch64/configure    | 28 ++++++++++++++++++++++++++++
>  sysdeps/aarch64/configure.ac | 15 +++++++++++++++
>  3 files changed, 46 insertions(+)
> 
> diff --git a/config.h.in b/config.h.in
> index f21bf04e47..2073816af8 100644
> --- a/config.h.in
> +++ b/config.h.in
> @@ -118,6 +118,9 @@
>  /* AArch64 PAC-RET code generation is enabled.  */
>  #define HAVE_AARCH64_PAC_RET 0
>  
> +/* Assembler support ARMv8.2-A SVE */
> +#define HAVE_SVE_ASM_SUPPORT 0
> +

i prefer to use HAVE_AARCH64_ prefix for aarch64 specific
macros in the global config.h, e.g. HAVE_AARCH64_SVE_ASM

and i'd like to have a comment here or in configue.ac with the
binutils version where this becomes obsolete (binutils 2.28 i
think). right now the minimum required version is 2.25, but
glibc may increase that soon to above 2.28.

> diff --git a/sysdeps/aarch64/configure.ac b/sysdeps/aarch64/configure.ac
> index 66f755078a..389a0b4e8d 100644
> --- a/sysdeps/aarch64/configure.ac
> +++ b/sysdeps/aarch64/configure.ac
> @@ -90,3 +90,18 @@ EOF
>    fi
>    rm -rf conftest.*])
>  LIBC_CONFIG_VAR([aarch64-variant-pcs], [$libc_cv_aarch64_variant_pcs])
> +
> +# Check if asm support armv8.2-a+sve
> +AC_CACHE_CHECK(for SVE support in assembler, libc_cv_asm_sve, [dnl
> +cat > conftest.s <<\EOF
> +        ptrue p0.b
> +EOF
> +if AC_TRY_COMMAND(${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&AS_MESSAGE_LOG_FD); then
> +  libc_cv_asm_sve=yes
> +else
> +  libc_cv_asm_sve=no
> +fi
> +rm -f conftest*])
> +if test $libc_cv_asm_sve = yes; then
> +  AC_DEFINE(HAVE_SVE_ASM_SUPPORT)
> +fi

i would use libc_cv_aarch64_sve_asm to make it obvious
that it's aarch64 specific setting.

otherwise OK.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX
  2021-03-17  2:34 ` [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
@ 2021-03-29 12:44   ` Szabolcs Nagy via Libc-alpha
  2021-03-30  7:17     ` naohirot
  0 siblings, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-03-29 12:44 UTC (permalink / raw)
  To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha

The 03/17/2021 02:34, Naohiro Tamura wrote:
> And also we confirmed that the SVE 512 bit vector register performance
> is roughly 4 times better than Advanced SIMD 128 bit register and 8
> times better than scalar 64 bit register by running 'make bench'.

nice speed up. i won't comment on the memcpy asm now.

> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1b746c0fa1..81ed5366fc 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -453,7 +453,8 @@ This tunable is specific to powerpc, powerpc64 and powerpc64le.
>  The @code{glibc.cpu.name=xxx} tunable allows the user to tell @theglibc{} to
>  assume that the CPU is @code{xxx} where xxx may have one of these values:
>  @code{generic}, @code{falkor}, @code{thunderxt88}, @code{thunderx2t99},
> -@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng}.
> +@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng},
> +@code{a64fx}.

OK.

> --- a/sysdeps/aarch64/multiarch/Makefile
> +++ b/sysdeps/aarch64/multiarch/Makefile
> @@ -1,6 +1,6 @@
>  ifeq ($(subdir),string)
>  sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
> -		   memcpy_falkor \
> +		   memcpy_falkor memcpy_a64fx \
>  		   memset_generic memset_falkor memset_emag memset_kunpeng \

OK.

> --- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> @@ -25,7 +25,11 @@
>  #include <stdio.h>
>  
>  /* Maximum number of IFUNC implementations.  */
> -#define MAX_IFUNC	4
> +#if HAVE_SVE_ASM_SUPPORT
> +# define MAX_IFUNC	7
> +#else
> +# define MAX_IFUNC	6
> +#endif

hm this MAX_IFUNC looks a bit problematic: currently its only
use is to detect if a target requires more ifuncs than the
array passed to __libc_ifunc_impl_list, but for that ideally
it would be automatic, not manually maintained.

i would just define it to 7 unconditionally (the maximum over
valid configurations).

>  size_t
>  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> @@ -43,12 +47,18 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>  	      IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
>  	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_falkor)
>  	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_simd)
> +#if HAVE_SVE_ASM_SUPPORT
> +	      IFUNC_IMPL_ADD (array, i, memcpy, sve, __memcpy_a64fx)
> +#endif

OK.

>  	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
>    IFUNC_IMPL (i, name, memmove,
>  	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx)
>  	      IFUNC_IMPL_ADD (array, i, memmove, !bti, __memmove_thunderx2)
>  	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_falkor)
>  	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_simd)
> +#if HAVE_SVE_ASM_SUPPORT
> +	      IFUNC_IMPL_ADD (array, i, memmove, sve, __memmove_a64fx)
> +#endif

OK.

>  	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
>    IFUNC_IMPL (i, name, memset,
>  	      /* Enable this on non-falkor processors too so that other cores
> diff --git a/sysdeps/aarch64/multiarch/init-arch.h b/sysdeps/aarch64/multiarch/init-arch.h
> index a167699e74..d20e7e1b8e 100644
> --- a/sysdeps/aarch64/multiarch/init-arch.h
> +++ b/sysdeps/aarch64/multiarch/init-arch.h
> @@ -33,4 +33,6 @@
>    bool __attribute__((unused)) bti =					      \
>      HAVE_AARCH64_BTI && GLRO(dl_aarch64_cpu_features).bti;		      \
>    bool __attribute__((unused)) mte =					      \
> -    MTE_ENABLED ();
> +    MTE_ENABLED ();							      \
> +  unsigned __attribute__((unused)) sve =				      \
> +    GLRO(dl_aarch64_cpu_features).sve;

i would use bool here.

> diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
> index 0e0a5cbcfb..0006f38eb0 100644
> --- a/sysdeps/aarch64/multiarch/memcpy.c
> +++ b/sysdeps/aarch64/multiarch/memcpy.c
> @@ -33,6 +33,9 @@ extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;
>  extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
>  extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;
>  extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
> +#if HAVE_SVE_ASM_SUPPORT
> +extern __typeof (__redirect_memcpy) __memcpy_a64fx attribute_hidden;
> +#endif

OK.

>  libc_ifunc (__libc_memcpy,
>              (IS_THUNDERX (midr)
> @@ -44,8 +47,13 @@ libc_ifunc (__libc_memcpy,
>  		  : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
>  		     || IS_NEOVERSE_V1 (midr)
>  		     ? __memcpy_simd
> -		     : __memcpy_generic)))));
> -
> +#if HAVE_SVE_ASM_SUPPORT
> +                     : (IS_A64FX (midr)
> +                        ? __memcpy_a64fx
> +                        : __memcpy_generic))))));
> +#else
> +                     : __memcpy_generic)))));
> +#endif

OK.

> new file mode 100644
> index 0000000000..23438e4e3d
> --- /dev/null
> +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S

skipping this.

> diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarch/memmove.c
> index 12d77818a9..1e5ee1c934 100644
> --- a/sysdeps/aarch64/multiarch/memmove.c
> +++ b/sysdeps/aarch64/multiarch/memmove.c
> @@ -33,6 +33,9 @@ extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;
>  extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;
>  extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidden;
>  extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;
> +#if HAVE_SVE_ASM_SUPPORT
> +extern __typeof (__redirect_memmove) __memmove_a64fx attribute_hidden;
> +#endif

OK.

>  
>  libc_ifunc (__libc_memmove,
>              (IS_THUNDERX (midr)
> @@ -44,8 +47,13 @@ libc_ifunc (__libc_memmove,
>  		  : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
>  		     || IS_NEOVERSE_V1 (midr)
>  		     ? __memmove_simd
> -		     : __memmove_generic)))));
> -
> +#if HAVE_SVE_ASM_SUPPORT
> +                     : (IS_A64FX (midr)
> +                        ? __memmove_a64fx
> +                        : __memmove_generic))))));
> +#else
> +                        : __memmove_generic)))));
> +#endif

OK.

> diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
> index db6aa3516c..6206a2f618 100644
> --- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
> +++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
> @@ -46,6 +46,7 @@ static struct cpu_list cpu_list[] = {
>        {"ares",		 0x411FD0C0},
>        {"emag",		 0x503F0001},
>        {"kunpeng920", 	 0x481FD010},
> +      {"a64fx",		 0x460F0010},
>        {"generic", 	 0x0}

OK.

> +
> +  /* Check if SVE is supported.  */
> +  cpu_features->sve = GLRO (dl_hwcap) & HWCAP_SVE;

OK.

>  }
> diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
> index 3b9bfed134..2b322e5414 100644
> --- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
> +++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
> @@ -65,6 +65,9 @@
>  #define IS_KUNPENG920(midr) (MIDR_IMPLEMENTOR(midr) == 'H'			   \
>                          && MIDR_PARTNUM(midr) == 0xd01)
>  
> +#define IS_A64FX(midr) (MIDR_IMPLEMENTOR(midr) == 'F'			      \
> +			&& MIDR_PARTNUM(midr) == 0x001)
> +

OK.

>  struct cpu_features
>  {
>    uint64_t midr_el1;
> @@ -72,6 +75,7 @@ struct cpu_features
>    bool bti;
>    /* Currently, the GLIBC memory tagging tunable only defines 8 bits.  */
>    uint8_t mte_state;
> +  bool sve;
>  };

OK.


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 4/5] scripts: Added Vector Length Set test helper script
  2021-03-17  2:35 ` [PATCH 4/5] scripts: Added Vector Length Set test helper script Naohiro Tamura
@ 2021-03-29 13:20   ` Szabolcs Nagy via Libc-alpha
  2021-03-30  7:25     ` naohirot
  0 siblings, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-03-29 13:20 UTC (permalink / raw)
  To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha

The 03/17/2021 02:35, Naohiro Tamura wrote:
> +"""Set Scalable Vector Length test helper.
> +
> +Set Scalable Vector Length for child process.
> +
> +examples:
> +
> +ubuntu@bionic:~/build$ make check subdirs=string \
> +test-wrapper='~/glibc/scripts/vltest.py 16'
> +
> +ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 16 make test \
> +t=string/test-memcpy
> +
> +ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 32 ./debugglibc.sh \
> +string/test-memmove
> +
> +ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 64 ./testrun.sh \
> +string/test-memset
> +"""
> +import argparse
> +from ctypes import cdll, CDLL
> +import os
> +import sys
> +
> +EXIT_SUCCESS = 0
> +EXIT_FAILURE = 1
> +EXIT_UNSUPPORTED = 77
> +
> +AT_HWCAP = 16
> +HWCAP_SVE = (1 << 22)
> +
> +PR_SVE_GET_VL = 51
> +PR_SVE_SET_VL = 50
> +PR_SVE_SET_VL_ONEXEC = (1 << 18)
> +PR_SVE_VL_INHERIT = (1 << 17)
> +PR_SVE_VL_LEN_MASK = 0xffff
> +
> +def main(args):
> +    libc = CDLL("libc.so.6")
> +    if not libc.getauxval(AT_HWCAP) & HWCAP_SVE:
> +        print("CPU doesn't support SVE")
> +        sys.exit(EXIT_UNSUPPORTED)
> +
> +    libc.prctl(PR_SVE_SET_VL,
> +               args.vl[0] | PR_SVE_SET_VL_ONEXEC | PR_SVE_VL_INHERIT)
> +    os.execvp(args.args[0], args.args)
> +    print("exec system call failure")
> +    sys.exit(EXIT_FAILURE)


this only works on a (new enough) glibc based system and python's
CDLL path lookup can fail too (it does not follow the host system
configuration).

but i think there is no simple solution without compiling c code and
this seems useful, so i'm happy to have this script.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64
  2021-03-29 12:11   ` Szabolcs Nagy via Libc-alpha
@ 2021-03-30  6:19     ` naohirot
  0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-03-30  6:19 UTC (permalink / raw)
  To: 'Szabolcs Nagy'; +Cc: libc-alpha@sourceware.org

Szabolcs-san,

Thank you for your review.

> > +/* Assembler support ARMv8.2-A SVE */ #define
> HAVE_SVE_ASM_SUPPORT 0
> > +
> 
> i prefer to use HAVE_AARCH64_ prefix for aarch64 specific macros in the global
> config.h, e.g. HAVE_AARCH64_SVE_ASM

OK, I'll change it to HAVE_AARCH64_SVE_ASM.

> and i'd like to have a comment here or in configue.ac with the binutils version
> where this becomes obsolete (binutils 2.28 i think). right now the minimum
> required version is 2.25, but glibc may increase that soon to above 2.28.

I'll add the comment in config.h.in like this:

+/* Assembler support ARMv8.2-A SVE.
+   This macro becomes obsolete when glibc increased the minimum
+   required version of GNU 'binutils' to 2.28 or later. */
+#define HAVE_AARCH64_SVE_ASM 0

> > diff --git a/sysdeps/aarch64/configure.ac
> > b/sysdeps/aarch64/configure.ac index 66f755078a..389a0b4e8d 100644
> > --- a/sysdeps/aarch64/configure.ac
> > +++ b/sysdeps/aarch64/configure.ac
...
> > +if AC_TRY_COMMAND(${CC-cc} -c -march=armv8.2-a+sve conftest.s
> > +1>&AS_MESSAGE_LOG_FD); then
> > +  libc_cv_asm_sve=yes
> > +else
> > +  libc_cv_asm_sve=no
> > +fi
> > +rm -f conftest*])
> > +if test $libc_cv_asm_sve = yes; then
> > +  AC_DEFINE(HAVE_SVE_ASM_SUPPORT)
> > +fi
> 
> i would use libc_cv_aarch64_sve_asm to make it obvious that it's aarch64 specific
> setting.

OK, I'll change it to libc_cv_aarch64_sve_asm.

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX
  2021-03-29 12:44   ` Szabolcs Nagy via Libc-alpha
@ 2021-03-30  7:17     ` naohirot
  0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-03-30  7:17 UTC (permalink / raw)
  To: 'Szabolcs Nagy'; +Cc: libc-alpha@sourceware.org

Szabolcs-san,

Thank you for your review.

> >  /* Maximum number of IFUNC implementations.  */
> > -#define MAX_IFUNC	4
> > +#if HAVE_SVE_ASM_SUPPORT
> > +# define MAX_IFUNC	7
> > +#else
> > +# define MAX_IFUNC	6
> > +#endif
> 
> hm this MAX_IFUNC looks a bit problematic: currently its only use is to detect if a
> target requires more ifuncs than the array passed to __libc_ifunc_impl_list, but for
> that ideally it would be automatic, not manually maintained.
> 
> i would just define it to 7 unconditionally (the maximum over valid configurations).

OK, I'll fix it to 7 unconditionally.

> > cores diff --git a/sysdeps/aarch64/multiarch/init-arch.h
> > b/sysdeps/aarch64/multiarch/init-arch.h
> > index a167699e74..d20e7e1b8e 100644
> > --- a/sysdeps/aarch64/multiarch/init-arch.h
> > +++ b/sysdeps/aarch64/multiarch/init-arch.h
> > @@ -33,4 +33,6 @@
> >    bool __attribute__((unused)) bti =
> \
> >      HAVE_AARCH64_BTI && GLRO(dl_aarch64_cpu_features).bti;
> 	      \
> >    bool __attribute__((unused)) mte =
> \
> > -    MTE_ENABLED ();
> > +    MTE_ENABLED ();
> 	      \
> > +  unsigned __attribute__((unused)) sve =
> \
> > +    GLRO(dl_aarch64_cpu_features).sve;
> 
> i would use bool here.

I'll fix it to the bool.

> > --- /dev/null
> > +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> 
> skipping this.

I wait for your review.

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 4/5] scripts: Added Vector Length Set test helper script
  2021-03-29 13:20   ` Szabolcs Nagy via Libc-alpha
@ 2021-03-30  7:25     ` naohirot
  0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-03-30  7:25 UTC (permalink / raw)
  To: 'Szabolcs Nagy'; +Cc: libc-alpha@sourceware.org

Szabolcs-san,

Thank you for your review.

> > +def main(args):
> > +    libc = CDLL("libc.so.6")
> > +    if not libc.getauxval(AT_HWCAP) & HWCAP_SVE:
> > +        print("CPU doesn't support SVE")
> > +        sys.exit(EXIT_UNSUPPORTED)
> > +
> > +    libc.prctl(PR_SVE_SET_VL,
> > +               args.vl[0] | PR_SVE_SET_VL_ONEXEC |
> PR_SVE_VL_INHERIT)
> > +    os.execvp(args.args[0], args.args)
> > +    print("exec system call failure")
> > +    sys.exit(EXIT_FAILURE)
> 
> 
> this only works on a (new enough) glibc based system and python's CDLL path
> lookup can fail too (it does not follow the host system configuration).

I see, I didn't notice that.

> but i think there is no simple solution without compiling c code and this seems
> useful, so i'm happy to have this script.

OK, thanks!
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
@ 2021-04-12 12:52 Wilco Dijkstra via Libc-alpha
  2021-04-12 18:53 ` Florian Weimer
  2021-04-13 12:07 ` naohirot
  0 siblings, 2 replies; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-12 12:52 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi,

I have a few comments about memcpy design (the principles apply equally to memset):

1. Overall the code is too large due to enormous unroll factors

Our current memcpy is about 300 bytes (that includes memmove), this memcpy is ~12 times larger!
This hurts performance due to the code not fitting in the I-cache for common copies.
On a modern OoO core you need very little unrolling since ALU operations and branches
become essentially free while the CPU executes loads and stores. So rather than unrolling
by 32-64 times, try 4 times - you just need enough to hide the taken branch latency.

2. I don't see any special handling for small copies

Even if you want to hyper optimize gigabyte sized copies, small copies are still extremely common,
so you always want to handle those as quickly (and with as little code) as possible. Special casing
small copies does not slow down the huge copies - the reverse is more likely since you no longer
need to handle small cases.

3. Check whether using SVE helps small/medium copies

Run memcpy-random benchmark to see whether it is faster to use SVE for small cases or just the SIMD
copy on your uarch.

4. Avoid making the code too general or too specialistic

I see both appearing in the code - trying to deal with different cacheline sizes and different vector lengths,
and also splitting these out into separate cases. If you depend on a particular cacheline size, specialize
the code for that and check the size in the ifunc selector (as various memsets do already). If you want to
handle multiple vector sizes, just use a register for the increment rather than repeating the same code
several times for each vector length.

5. Odd prefetches

I have a hard time believing first prefetching the data to be written, then clearing it using DC ZVA (???),
then prefetching the same data a 2nd time, before finally write the loaded data is helping performance...
Generally hardware prefetchers are able to do exactly the right thing since memcpy is trivial to prefetch.
So what is the performance gain of each prefetch/clear step? What is the difference between memcpy
and memmove performance (given memmove doesn't do any of this)?

Cheers,
Wilco


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-12 12:52 [PATCH 0/5] Added optimized memcpy/memmove/memset " Wilco Dijkstra via Libc-alpha
@ 2021-04-12 18:53 ` Florian Weimer
  2021-04-13 12:07 ` naohirot
  1 sibling, 0 replies; 72+ messages in thread
From: Florian Weimer @ 2021-04-12 18:53 UTC (permalink / raw)
  To: Wilco Dijkstra via Libc-alpha; +Cc: Szabolcs Nagy, Wilco Dijkstra

* Wilco Dijkstra via Libc-alpha:

> 5. Odd prefetches
>
> I have a hard time believing first prefetching the data to be
> written, then clearing it using DC ZVA (???), then prefetching the
> same data a 2nd time, before finally write the loaded data is
> helping performance...  Generally hardware prefetchers are able to
> do exactly the right thing since memcpy is trivial to prefetch.  So
> what is the performance gain of each prefetch/clear step? What is
> the difference between memcpy and memmove performance (given memmove
> doesn't do any of this)?

Another downside is exposure of latent concurrency bugs:

  G1: Phantom zeros in cardtable
  <https://bugs.openjdk.java.net/browse/JDK-8039042>

I guess the CPU's heritage is shining through here. 8-)

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-12 12:52 [PATCH 0/5] Added optimized memcpy/memmove/memset " Wilco Dijkstra via Libc-alpha
  2021-04-12 18:53 ` Florian Weimer
@ 2021-04-13 12:07 ` naohirot
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
  1 sibling, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-13 12:07 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Thanks for the comments.

I've been continuously updated the first patch since I posted on Mar. 17 2021,
and fixed some bugs.
Here is my local repository's commit history:
https://github.com/NaohiroTamura/glibc/commits/patch-20210317

I answer your comments referring to the latest source code above and
benchtests graphs uploaded to Google drive.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> 
> 1. Overall the code is too large due to enormous unroll factors
> 
> Our current memcpy is about 300 bytes (that includes memmove), this memcpy is
> ~12 times larger!
> This hurts performance due to the code not fitting in the I-cache for common
> copies.

OK, I'll try to remove unnecessary code which doesn't contribute performance gain
based on benchtests performance data. 

> On a modern OoO core you need very little unrolling since ALU operations and
> branches become essentially free while the CPU executes loads and stores. So
> rather than unrolling by 32-64 times, try 4 times - you just need enough to hide the
> taken branch latency.
> 

In terms of loop unrolling, I tested several cases in my local environment.
Here is the result.
The source code is based on the latest commit of the branch patch-20210317 in my GitHub repository.
[1] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S

Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.
This unroll configuration recorded the highest performance.
Memcpy    35 Gbps/sec [3]
Memmove  70 Gbps/sec [4]
Mmemset  70 Gbps/sec [5]

[3] https://drive.google.com/file/d/1Xz04kV-S1E4tKOKLJRl8KgO8ZdCQqv1O/view
[4] https://drive.google.com/file/d/1QDmt7LMscXIJSpaq2sPOiCKl3nxcLxwk/view
[5] https://drive.google.com/file/d/1rpy7rkIskRs6czTARNIh4yCeh8d-L-cP/view

In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,
The performance degraded minus 5 to 15 Gbps/sec at the peak.
Memcpy    30 Gbps/sec [6]
Memmove  65 Gbps/sec [7]
Mmemset  45 Gbps/sec [8]

[6] https://drive.google.com/file/d/1P-QJGeuHPlfj3ax8GlxRShV0_HVMJWGc/view
[7] https://drive.google.com/file/d/1R2IK5eWr8NEduNnvqkdPZyoNE0oImRcp/view
[8] https://drive.google.com/file/d/1WMZFjzF5WgmfpXSOnAd9YMjLqv1mcsEm/view

> 2. I don't see any special handling for small copies
> 
> Even if you want to hyper optimize gigabyte sized copies, small copies are still
> extremely common, so you always want to handle those as quickly (and with as
> little code) as possible. Special casing small copies does not slow down the huge
> copies - the reverse is more likely since you no longer need to handle small cases.
>

Yes, I implemented for the case of 1 byte to 512 byte [9][10].
SVE code seems faster than ASIMD in small/medium range too [11][12][13].

[9] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L176-L267
[10] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S#L68-L78
[11] https://drive.google.com/file/d/1VgkFTrWgjFMQ35btWjqHJbEGMgb3ZE-h/view
[12] https://drive.google.com/file/d/1SJ-WMUEEX73SioT9F7tVEIc4iRa8SfjU/view
[13] https://drive.google.com/file/d/1DPPgh2r6t16Ppe0Cpo5XzkVqWA_AVRUc/view
 
> 3. Check whether using SVE helps small/medium copies
> 
> Run memcpy-random benchmark to see whether it is faster to use SVE for small
> cases or just the SIMD copy on your uarch.
> 

Thanks for the memcpy-random benchmark info.
For small/medium copies, I needed to remove BTI macro from ASM ENTRY in order
to see the distinct performance difference between ASIMD and SVE.
I'll post the patch [14] with the A64FX second patch.
 
And also somehow on A64FX as well as on ThunderX2 machine, memcpy-random
didn't start due to mprotect error.
I needed to fix memcpy-random [15].
If this is not wrong, I'll post the patch [15] with the a64fx second patch.

[14] https://github.com/NaohiroTamura/glibc/commit/07ea389846c7c63622b6c0b3aaead3f93e21f356
[15] https://github.com/NaohiroTamura/glibc/commit/ec0b55a855529f75bd6f280e59dc2b1c25640490

> 4. Avoid making the code too general or too specialistic
> 
> I see both appearing in the code - trying to deal with different cacheline sizes and
> different vector lengths, and also splitting these out into separate cases. If you
> depend on a particular cacheline size, specialize the code for that and check the
> size in the ifunc selector (as various memsets do already). If you want to handle
> multiple vector sizes, just use a register for the increment rather than repeating
> the same code several times for each vector length.
> 

In terms of the cache line size, A64FX is not configurable, it is fixed to 256 byte.
I've already removed the code to get it [16][17]

[16] https://github.com/NaohiroTamura/glibc/commit/4bcc6d83c970f7a7283abfec753ecf6b697cf6f7
[17] https://github.com/NaohiroTamura/glibc/commit/f2b2c1ca03b50d414e03411ed65e4b131615e865

In terms of Vector Length, I'll remove the code for VL256 bit and 128 bit.
Because Vector Length agnostic code can cover the both cases.

> 5. Odd prefetches
> 
> I have a hard time believing first prefetching the data to be written, then clearing it
> using DC ZVA (???), then prefetching the same data a 2nd time, before finally
> write the loaded data is helping performance...
> Generally hardware prefetchers are able to do exactly the right thing since
> memcpy is trivial to prefetch.
> So what is the performance gain of each prefetch/clear step? What is the
> difference between memcpy and memmove performance (given memmove
> doesn't do any of this)?

Sorry, memcpy prefetch code was not right, I noticed this bug and fixed it
soon after posting the first patch [18].
Basically " prfm pstl1keep, [dest_ptr, tmp1]" should be " prfm pldl2keep, [src_ptr, tmp1]".

[18] https://github.com/NaohiroTamura/glibc/commit/f5bf15708830f91fb886b15928158db2e875ac88

Without DC_VZA and L2 prefetch, memcpy and memset performance degraded over 4MB.
Please compare [19] with [22], and [21] with [24] for memset.
Without DC_VZA and L2 prefetch, memmove didn't degraded over 4MB.
Please compare [20] with [23].
The reason why I didn't implement DC_VZA and L2 prefetch is that memmove calls memcpy in
most cases, and memmove code only handles backward copy.
Maybe most of memmove-large benchtest cases are backward copy, I need to check.
DC_VZA and L2 prefetch have to be pair, only DC_VZA or only L2 prefetch doesn't get any improvement.

With DC_VZA and L2 prefetch:
[19] https://drive.google.com/file/d/1mmYaLwzEoytBJZ913jaWmucL0j564Ta7/view
[20] https://drive.google.com/file/d/1Bc_DVGBcDRpvDjxCB_2yOk3MOy5BEiOs/view
[21] https://drive.google.com/file/d/19cHvU2lxF28DW9_Z5_5O6gOOdUmVz_ps/view

Without DC_VZA and L2 prefetch:
[22] https://drive.google.com/file/d/1My6idNuQsrsPVODl0VrqiRbMR9yKGsGS/view
[23] https://drive.google.com/file/d/1q8KhvIqDf27fJ8HGWgjX0nBhgPgGBg_T/view
[24] https://drive.google.com/file/d/1l6pDhuPWDLy5egQ6BhRIYRvshvDeIrGl/view

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-13 12:07 ` naohirot
@ 2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
  2021-04-15 12:20     ` naohirot
                       ` (5 more replies)
  0 siblings, 6 replies; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-14 16:02 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

Thanks for the comprehensive reply, especially the graphs are quite useful!
(I'd avoid adding generic_memcpy/memmove though since those are unoptimized C
implementations).

> OK, I'll try to remove unnecessary code which doesn't contribute performance gain
> based on benchtests performance data. 

Yes that is a good idea - you could also check whether the software pipelining actually
helps on an OoO core (it shouldn't) since that contributes a lot to the complexity and the
amount of code and unrolling required.

It is also possible to remove a lot of unnecessary code - eg. rather than use 2 instructions
per prefetch, merge the constant offset in the prefetch instruction itself (since they allow
up to 32KB offset). There are also lots of branches that skip a few instructions if a value is
zero, this is often counterproductive due to adding branch mispredictions.

> Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.
> This unroll configuration recorded the highest performance.

> In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,
> The performance degraded minus 5 to 15 Gbps/sec at the peak.

So this is the L(L1_vl_64) loop right? I guess the problem is the large number of
prefetches and all the extra code that is not strictly required (you can remove 5
redundant mov/cmp instructions from the loop). Also assuming prefetching helps
here (the good memmove results suggest it's not needed), prefetching directly
into L1 should be better than first into L2 and then into L1. So I don't see a good
reason why 4x unrolling would have to be any slower.

> Yes, I implemented for the case of 1 byte to 512 byte [9][10].
> SVE code seems faster than ASIMD in small/medium range too [11][12][13].

That adds quite a lot of code and uses a slow linear chain of comparisons. A small
loop like used in the memset should work fine to handle copies smaller than 
256 or 512 bytes (you can handle the zero bytes case for free in this code rather
than special casing it).

> For small/medium copies, I needed to remove BTI macro from ASM ENTRY in order
> to see the distinct performance difference between ASIMD and SVE.
> I'll post the patch [14] with the A64FX second patch.

I'm not sure I understand - the BTI macro just emits a NOP hint so it is harmless. We always emit
it so that it works seamlessly when BTI is enabled.

> And also somehow on A64FX as well as on ThunderX2 machine, memcpy-random
> didn't start due to mprotect error.

Yes it looks like the size isn't rounded up to a pagesize. It really needs the extra space, so
changing +4096 into getpagesize () will work.

> Without DC_VZA and L2 prefetch, memcpy and memset performance degraded over 4MB.

> DC_VZA and L2 prefetch have to be pair, only DC_VZA or only L2 prefetch doesn't get any improvement.

That seems odd. Was that using the L1 prefetch with the L2 distance? It seems to me one of the L1 or L2
prefetches is unnecessary. Also why would the DC_ZVA need to be done so early? It seems to me that
cleaning the cacheline just before you write it works best since that avoids accidentally replacing it.

> Without DC_VZA and L2 prefetch, memmove didn't degraded over 4MB.
>
> The reason why I didn't implement DC_VZA and L2 prefetch is that memmove calls memcpy in
> most cases, and memmove code only handles backward copy.
> Maybe most of memmove-large benchtest cases are backward copy, I need to check.

Most of the memmove tests do indeed overlap (so DC_ZVA does not work). However it also shows
that it performs well across the L2 cache size range without any prefetch or DC_ZVA.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
@ 2021-04-15 12:20     ` naohirot
  2021-04-20 16:00       ` Wilco Dijkstra via Libc-alpha
  2021-04-19  2:51     ` naohirot
                       ` (4 subsequent siblings)
  5 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-15 12:20 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Thanks for reviewing in detail technically!!
Now we have several topics to discuss.
So let me focus on the BTI in this mail. I'll answer other topics in later mail.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> 
> Thanks for the comprehensive reply, especially the graphs are quite useful!
> (I'd avoid adding generic_memcpy/memmove though since those are unoptimized
> C implementations).

OK, I'll withdraw the patch from the A64FX patch V2.

> > For small/medium copies, I needed to remove BTI macro from ASM ENTRY
> > in order to see the distinct performance difference between ASIMD and SVE.
> > I'll post the patch [14] with the A64FX second patch.
> 
> I'm not sure I understand - the BTI macro just emits a NOP hint so it is harmless.
> We always emit it so that it works seamlessly when BTI is enabled.

Yes, I observed that just " hint #0x22" is inserted.
The benchtest results show that the A64FX performance of size less than 100B with
BTI is slower than ASIMD, but without BTI is faster than ASIMD.
And the A64FX performance of 512B with BTI 4Gbps/sec slower than without BTI.

With BTI, source code [4] 
[1] https://drive.google.com/file/d/1LlyQOq7qT4d0-54uVzUtYMMMDgIiddEj/view
[2] https://drive.google.com/file/d/1C2pl-Iz_-18mkpuQTk1PhEHKsd5x0wWo/view
[3] https://drive.google.com/file/d/1eg_p1_b619KN7XLmOpxqcoI3c9o4WXd-/view
[4] https://github.com/NaohiroTamura/glibc/commit/0f45fff654d7a31b58e5d6f4dbfa31d6586f8cc2

Without BTI, source code [8]
[5] https://drive.google.com/file/d/1Mf7wxwgGb5yYBJo1eUxqvjrkp9O4EVVJ/view
[6] https://drive.google.com/file/d/1rgfFmWsM4Q3oDK8aYa_GjEQWttS0pOBF/view
[7] https://drive.google.com/file/d/1hF7oevP-MERrQ04yajtEUY8CSWe8V2EX/view
[8] https://github.com/NaohiroTamura/glibc/commit/c204a74971b3d34680964bc52ac59264b14527e3

I executed the same test on ThanderX2, the result had very little difference
between with BTI and without BTI as you mentioned.
So if distinct degradation happens only on A64FX, I'd like to add another
ENTRY macro in sysdeps/aarch64/sysdep.h such as:

#define ENTRY_ALIGN_NO_BTI(name, align)				\
  .globl C_SYMBOL_NAME(name);					\
  .type C_SYMBOL_NAME(name),%function;				\
  .p2align align;						\
  C_LABEL(name)							\
  cfi_startproc;						\
  CALL_MCOUNT

Or I'd like to change memcpy_a64fx.S and memset_a64fx.S without ENTRY macro such as:
  .globl __memcpy_a64fx
  .type __memcpy_a64fx, %function
  .p2align 6
  __memcpy_a64fx:
  cfi_startproc
  CALL_MCOUNT

What do you think?

> > And also somehow on A64FX as well as on ThunderX2 machine,
> > memcpy-random didn't start due to mprotect error.
> 
> Yes it looks like the size isn't rounded up to a pagesize. It really needs the extra
> space, so changing +4096 into getpagesize () will work.

OK, I've already applied it [8].

Thanks!
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
  2021-04-15 12:20     ` naohirot
@ 2021-04-19  2:51     ` naohirot
  2021-04-19 14:57       ` Wilco Dijkstra via Libc-alpha
  2021-04-19 12:43     ` naohirot
                       ` (3 subsequent siblings)
  5 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-19  2:51 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Let me focus on the macro " shortcut_for_small_size" for small/medium, less than
512 byte in this mail. 

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> > Yes, I implemented for the case of 1 byte to 512 byte [9][10].
> > SVE code seems faster than ASIMD in small/medium range too [11][12][13].
> 
> That adds quite a lot of code and uses a slow linear chain of comparisons. A small
> loop like used in the memset should work fine to handle copies smaller than
> 256 or 512 bytes (you can handle the zero bytes case for free in this code rather
> than special casing it).
> 

I compared performance of the size less than 512 byte for the following five
implementation cases.

CASE 1: liner chain
As mentioned in the reply [0] I removed BTI_J [1], but the macro " shortcut_for_small_size"
stays linear chain [2]
A64FX performance is 4-14 Gbps [3].
The other arch implementations call BTI_J, so performance is degraded.
.
[0] https://sourceware.org/pipermail/libc-alpha/2021-April/125079.html
[1] https://github.com/NaohiroTamura/glibc/commit/7d7217b518e59c78582ac4e89cae725cf620877e
[2] https://github.com/NaohiroTamura/glibc/blob/7d7217b518e59c78582ac4e89cae725cf620877e/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L176-L267
[3] https://drive.google.com/file/d/16qo7N05W526H9j7_9qjm-_Q7gZmOXwpY/view

CASE 2: whilelt loop such as memset
I tested "whilelt loop" implementation instead of the macro " shortcut_for_small_size".
And after having tested, I commented out "whilelt loop" implementation [4]
Comparing with the CASE 1, A64FX performance degraded from 4-14 Gbps to 3-10 Gbps [5]. 
Please notice that "whilelt loop" implementation cannot be used for memmove,
because it doesn't work for backward copy.
On the other hand, the macro " shortcut_for_small_size" works for backward copy, because
it loads up to all 512 byte of data into z0 to z7 SVE registers at once, and then store all data.

[4] https://github.com/NaohiroTamura/glibc/commit/77d1da301f8161c74875b0314cae34be8cb33477#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5a0ff6a259eR308-R318
[5] https://drive.google.com/file/d/1xdw7mr0c90VupVkQwelFafQHNkXslCwv/view

CASE 3: binary tree chain
I updated the macro " shortcut_for_small_size" to use binary tree chain [6][7].
Comparing with the CASE 1, the size less than 96 byte degraded from 4.0-6.0 Gbps
to 2.5-5.0 Gbps, but the size 512 byte improved from 14.0 Gbps to 17.5 Gbps.

[6] https://github.com/NaohiroTamura/glibc/commit/5c17af8c57561ede5ed2c2af96c9efde4092f02f
[7] https://github.com/NaohiroTamura/glibc/blob/5c17af8c57561ede5ed2c2af96c9efde4092f02f/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L177-L204
[8] https://drive.google.com/file/d/13w8yKdeLpVbp-uJmCttKBKtScya1tXqP/view

CASE 4: binary tree chain except up to 64 byte
I handled up to 64 byte so as to return quickly [9].
Comparing with the CASE 3, the size less than 64 byte improved from 2.5 Gbps to
4.0 Gbps, but the size 512 byte degraded from 17.5 Gbps to 16.5 Gbps [10].

[9] https://github.com/NaohiroTamura/glibc/commit/77d1da301f8161c74875b0314cae34be8cb33477#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5a0ff6a259eR177-R184
[10] https://drive.google.com/file/d/1lFsjns9g_7fySAsvx_RVS9o6HSrk6ir9/view

CASE 5: binary tree chain except up to 128 byte
I handled up to 128 byte so as to return quickly [11].
Comparing with the CASE 4, the size less than 128 byte improved from 4.0-6.0 Gbps
to 4.0-7.0 Gbps, but the size 512 byte degraded from 16.5 Gbps to 16.0 Gbps [12].

[11] https://github.com/NaohiroTamura/glibc/commit/fefc59f01ecfd6a207fe261de5ab133f4409d687#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5a0ff6a259eR184-R195
[12] https://drive.google.com/file/d/1HS277_qQUuEeZqLUo0H2XRlFhOhIdI_o/view

In conclusion, I'd like to adopt the CASE 5 implementation, considering the
performance balance between the small size (less than 128 byte) and medium size
(close to 512 byte).

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
  2021-04-15 12:20     ` naohirot
  2021-04-19  2:51     ` naohirot
@ 2021-04-19 12:43     ` naohirot
  2021-04-20  3:31     ` naohirot
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-04-19 12:43 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Let me focus on L1_prefetch in this mail.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
 
> > Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.
> > This unroll configuration recorded the highest performance.

When I tested "4 unrolls", I modified the source code [1][2] in the mail [0]
such as followings:
in case of memcpy, 
   I commented out L(unroll8), L(unroll2), and left L(unroll4), L(unroll1) and L(last),
In case of memmove,
   I commented out L(bwd_unroll8), L(bwd_unroll2), and left L(bwd_unroll4), L(bwd_unroll1) and L(bwd_last),
In case of memset, 
   I commented out L(unroll32), L(unroll8), L(unroll2), and left L(unroll4), L(unroll1) and L(last).

[0] https://sourceware.org/pipermail/libc-alpha/2021-April/125002.html
[1] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S

> > In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,
> > The performance degraded minus 5 to 15 Gbps/sec at the peak.
> 
> So this is the L(L1_vl_64) loop right? I guess the problem is the large number of

So this is NOT the L(L1_vl_64) loop, but L(vl_agnostic).

> prefetches and all the extra code that is not strictly required (you can remove 5
> redundant mov/cmp instructions from the loop). Also assuming prefetching helps
> here (the good memmove results suggest it's not needed), prefetching directly
> into L1 should be better than first into L2 and then into L1. So I don't see a good
> reason why 4x unrolling would have to be any slower.

I tried to remove L(L1_prefetch) from both memcpy and memset, and also
I tried to remove L2 prefetch instructions (prfm pstl2keep and pldl2keep) in
L(L1_prefetch) from both memcpy and memset.

In case of memcpy, both removing L(L1_prefetch)[3] and removing L2 prefetch
instruction from L(L1_prefetch) increased the performance of the size range 64KB-4MB
from 18-20 GB/sec [4] to 20-22 GB/sec [5].

[3] https://github.com/NaohiroTamura/glibc/commit/22612299247e64dbffd62aa186513bde7328d104
[4] https://drive.google.com/file/d/1hGWz4eAYWc1ktdw74rzDPxtQQ48P0-Hv/view
[5] https://drive.google.com/file/d/11Pt1mWSCN2LBPHxXUE-rs7Q6JhtBfpyQ/view

In case of memset, removing L(L1_prefetch)[6] decreased the performance of the size range
128KB-4MB from 22-24 GB/sec [7] to 20-22 GB/sec[8].
But removing L2 prefetch instruction (prfm pstl2keep) in L(L1_prefetch) [9] kept the same
performance of the size range 128KB-4MB as 22-24 GB/sec [10].

[6] https://github.com/NaohiroTamura/glibc/blob/22612299247e64dbffd62aa186513bde7328d104/sysdeps/aarch64/multiarch/memset_a64fx.S#L146-L163
   Commented out L146-L163, I didn't commit because of decreasing the performance.
[7] https://drive.google.com/file/d/1MT1d2aBxSoYrzQuRZtv4U9NCXV4ZwHsJ/view
[8] https://drive.google.com/file/d/1qUzYklLvgXTZbP1wm9n4VryF3bgUOplo/view
[9] https://github.com/NaohiroTamura/glibc/commit/cc478c96bac051c9b98b9d9a1ae6f38326f77645
[10] https://drive.google.com/file/d/1bPKHFWyhzNWXX7A_S6_UpZ2BwP2QAJK4/view

In conclusion, I adopt to remove L(L1_prefetch) from memcpy [3] and to remove L2 prefetch
instruction (prfm pstl2keep) from L(L1_prefetch) [9].

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-19  2:51     ` naohirot
@ 2021-04-19 14:57       ` Wilco Dijkstra via Libc-alpha
  2021-04-21 10:10         ` naohirot
  0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-19 14:57 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

> Let me focus on the macro " shortcut_for_small_size" for small/medium, less than
> 512 byte in this mail. 

Yes, one subject at a time is a good idea.

> Comparing with the CASE 1, A64FX performance degraded from 4-14 Gbps to 3-10 Gbps [5]. 
> Please notice that "whilelt loop" implementation cannot be used for memmove,
> because it doesn't work for backward copy.

Indeed, the memmove code would need a similar loop but backwards. However it sounds like
small loops are not efficient (possibly a high taken branch penalty), so it's not a good option.

> In conclusion, I'd like to adopt the CASE 5 implementation, considering the
> performance balance between the small size (less than 128 byte) and medium size
> (close to 512 byte).

Yes something like this would work. ​I would strip out any unnecessary instructions and merge
multiple cases to avoid branches as much as possible. For example start memcpy like this:

memcpy:
   ​cntb        vector_length
   ​whilelo     p0.b, xzr, n    // gives a free ptrue for N >= VL
   ​whilelo     p1.b, vector_length, n
   ​b.last       1f
   ​ld1b        z0.b, p0/z, [src]
   ​ld1b        z1.b, p1/z, [src, #1, mul vl]
   ​st1b        z0.b, p0, [dest]
   ​st1b        z1.b, p1, [dest, #1, mul vl]
   ​ret

The proposed case 5 uses 13 instructions up to 64 bytes and 19 up to 128, the above 
does 0-127 bytes in 9 instructions. You can see the code is perfectly balanced, with
4 load/store instructions, 3 ALU instructions and 2 branches.

Rather than doing a complex binary search, we can use the same trick to merge the code
for 128-256 and 256-512. So overall we only need 2 comparisons which we can write like:

cmp n, vector_length, lsl 3

Like I mentioned before, it is a really good idea to run bench-memcpy-random since it
will clearly show issues with branch prediction on small copies. For memcpy and related
functions you want to minimize branches and only use branches that are heavily biased.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
                       ` (2 preceding siblings ...)
  2021-04-19 12:43     ` naohirot
@ 2021-04-20  3:31     ` naohirot
  2021-04-20 14:44       ` Wilco Dijkstra via Libc-alpha
  2021-04-20  5:49     ` naohirot
  2021-04-23 13:22     ` naohirot
  5 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-20  3:31 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Let me focus on DC_ZVA and L1/L2 prefetch in this mail.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> > Without DC_VZA and L2 prefetch, memcpy and memset performance degraded
> over 4MB.
> 
> > DC_VZA and L2 prefetch have to be pair, only DC_VZA or only L2 prefetch
> doesn't get any improvement.
> 
> That seems odd. Was that using the L1 prefetch with the L2 distance? It seems to
> me one of the L1 or L2 prefetches is unnecessary. 

I tested the following 4 cases.
The result was that Case 4 is the best.
Case 2 and 3 were almost same as Case 1.
Case 4 [1] improved the performance in the size range more than 4MB from Case 1
7.5-10 GB/sec [2] to 10-10.5 GB/sec [3].

Case 1: DC_ZVA + L1 prefetch + L2 + prefetch [2]
Case 2: DC_ZVA + L1 prefetch
Case 3: DC_ZVA + L2 prefetch
Case 4: DC_ZVA only [3]

[1] https://github.com/NaohiroTamura/glibc/commit/d57bed764a45383dfea8265d6a384646f4f07eed
[2] https://drive.google.com/file/d/1ws3lTLzMFK3lLrrwxVFvriERrs-IKdP9/view
[3] https://drive.google.com/file/d/1g7nuFOtkFw3b5INcAfuuv2lVODmASm-G/view


>                                                Also why would the DC_ZVA
> need to be done so early? It seems to me that cleaning the cacheline just before
> you write it works best since that avoids accidentally replacing it.
> 

Yes, I moved it closer, please look at the change [1].

> > Without DC_VZA and L2 prefetch, memmove didn't degraded over 4MB.
> >
> > The reason why I didn't implement DC_VZA and L2 prefetch is that
> > memmove calls memcpy in most cases, and memmove code only handles
> backward copy.
> > Maybe most of memmove-large benchtest cases are backward copy, I need to
> check.
> 
> Most of the memmove tests do indeed overlap (so DC_ZVA does not work).
> However it also shows that it performs well across the L2 cache size range
> without any prefetch or DC_ZVA.

That's right, I confirmed that only DC_ZVA was necessary [1].

Next, I'll remove redundant instructions.

Thanks.
Naohiro



^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
                       ` (3 preceding siblings ...)
  2021-04-20  3:31     ` naohirot
@ 2021-04-20  5:49     ` naohirot
  2021-04-20 11:39       ` Wilco Dijkstra via Libc-alpha
  2021-04-23 13:22     ` naohirot
  5 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-20  5:49 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Let me focus on removing redundant instructions in this mail.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> It is also possible to remove a lot of unnecessary code - eg. rather than use 2
> instructions per prefetch, merge the constant offset in the prefetch instruction
> itself (since they allow up to 32KB offset). There are also lots of branches that
> skip a few instructions if a value is zero, this is often counterproductive due to
> adding branch mispredictions.

I removed redundant instructions using cbz and prfm offset address [1][2].

[1] https://github.com/NaohiroTamura/glibc/commit/94363b4ab2e5b4b29843a47a6970b9645a8e4eeb
[2] https://github.com/NaohiroTamura/glibc/commit/4648eb559e46d978ded65d40c6bf8c38dd2519d7

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-20  5:49     ` naohirot
@ 2021-04-20 11:39       ` Wilco Dijkstra via Libc-alpha
  2021-04-27 11:03         ` naohirot
  0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-20 11:39 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Haohiro,

> I removed redundant instructions using cbz and prfm offset address [1][2].
>
> [1] https://github.com/NaohiroTamura/glibc/commit/94363b4ab2e5b4b29843a47a6970b9645a8e4eeb
> [2] https://github.com/NaohiroTamura/glibc/commit/4648eb559e46d978ded65d40c6bf8c38dd2519d7

For the first 2 CBZ cases in both [1] and [2] the fastest option is to use ANDS+BEQ. ANDS only
requires 1 ALU operation while AND+CBZ uses 2 ALU operations on A64FX.

Wilco

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-20  3:31     ` naohirot
@ 2021-04-20 14:44       ` Wilco Dijkstra via Libc-alpha
  2021-04-27  9:01         ` naohirot
  0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-20 14:44 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

> Case 4 [1] improved the performance in the size range more than 4MB from Case 1
> 7.5-10 GB/sec [2] to 10-10.5 GB/sec [3].
>
> Case 1: DC_ZVA + L1 prefetch + L2 + prefetch [2]
> Case 2: DC_ZVA + L1 prefetch
> Case 3: DC_ZVA + L2 prefetch
> Case 4: DC_ZVA only [3]

That is great news - it simplifies the loop a lot, and it is faster too!

>>                                                Also why would the DC_ZVA
>> need to be done so early? It seems to me that cleaning the cacheline just before
>> you write it works best since that avoids accidentally replacing it.
>> 
>
> Yes, I moved it closer, please look at the change [1].

What I meant is, why is ZF_DIST so huge? I don't see how that helps. Is there any penalty
if we did it like this (or possibly with 1-2 cachelines offset)?

    dc           zva, dest_ptr
    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]

This would remove almost all initialization code from the start of L(L2_dc_zva).

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-15 12:20     ` naohirot
@ 2021-04-20 16:00       ` Wilco Dijkstra via Libc-alpha
  2021-04-27 11:58         ` naohirot
  0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-20 16:00 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

> Yes, I observed that just " hint #0x22" is inserted.
> The benchtest results show that the A64FX performance of size less than 100B with
> BTI is slower than ASIMD, but without BTI is faster than ASIMD.
> And the A64FX performance of 512B with BTI 4Gbps/sec slower than without BTI.

That's unfortunate - it seems like the hint is very slow, maybe even serializing...
We can work around if for now in GLIBC, but at some point distros will start to insert
BTI instructions by default, and then the performance hit will be bad.

> So if distinct degradation happens only on A64FX, I'd like to add another
> ENTRY macro in sysdeps/aarch64/sysdep.h such as:

I think the best option for now is to change BTI_C into NOP if AARCH64_HAVE_BTI
is not set. This avoids creating alignment issues in existing code (which is written
to assume the hint is present) and works for all string functions.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-19 14:57       ` Wilco Dijkstra via Libc-alpha
@ 2021-04-21 10:10         ` naohirot
  2021-04-21 15:02           ` Wilco Dijkstra via Libc-alpha
  0 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-21 10:10 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

This mail is a continuation of the macro " shortcut_for_small_size" for small/medium,
less than 512 byte.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> Yes something like this would work. ​I would strip out any unnecessary instructions
> and merge multiple cases to avoid branches as much as possible. For example
> start memcpy like this:
> 
> memcpy:
>    ​cntb        vector_length
>    ​whilelo     p0.b, xzr, n    // gives a free ptrue for N >= VL
>    ​whilelo     p1.b, vector_length, n
>    ​b.last       1f
>    ​ld1b        z0.b, p0/z, [src]
>    ​ld1b        z1.b, p1/z, [src, #1, mul vl]
>    ​st1b        z0.b, p0, [dest]
>    ​st1b        z1.b, p1, [dest, #1, mul vl]
>    ​ret
> 
> The proposed case 5 uses 13 instructions up to 64 bytes and 19 up to 128, the
> above does 0-127 bytes in 9 instructions. You can see the code is perfectly
> balanced, with
> 4 load/store instructions, 3 ALU instructions and 2 branches.
> 
> Rather than doing a complex binary search, we can use the same trick to merge
> the code for 128-256 and 256-512. So overall we only need 2 comparisons which
> we can write like:
> 
> cmp n, vector_length, lsl 3

It's really smart way, isn't it? 😊
I re-implemented the macro " shortcut_for_small_size" using the whilelo, and
please check it [1][2] if understood correctly.
The performance of "whilelo dispatch" [3] is almost same as "binary tree dispatch" [4]
but I notice that there are gaps at 128 byte and at 256 byte [3].

[1] https://github.com/NaohiroTamura/glibc/commit/7491bcb36e5c497e509d35b1378fcc663595c2d0
[2] https://github.com/NaohiroTamura/glibc/blob/7491bcb36e5c497e509d35b1378fcc663595c2d0/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L129-L174
[3] https://drive.google.com/file/d/10S6doDFiVtveqRZs-366E_yDzefe-zBS/view
[4] https://drive.google.com/file/d/1p5qPt0KLT4i3Iv_Uy9UT5zo0NetXK-RZ/view

> Like I mentioned before, it is a really good idea to run bench-memcpy-random
> since it will clearly show issues with branch prediction on small copies. For
> memcpy and related functions you want to minimize branches and only use
> branches that are heavily biased.

I checked bench-memcpy-random [5], but it measures the performance from the size
4K byte to 512K byte.
How do we know the branch issue for less than 512 byte?

[5] https://drive.google.com/file/d/1cRwaN9vu9q2Zm8xW6l6hp0GxVB1ZY-Tm/view

Thanks.
Naohiro

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-21 10:10         ` naohirot
@ 2021-04-21 15:02           ` Wilco Dijkstra via Libc-alpha
  2021-04-22 13:17             ` naohirot
  0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-21 15:02 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

> It's really smart way, isn't it? 😊

Well that's the point of SVE!

> I re-implemented the macro " shortcut_for_small_size" using the whilelo, and
> please check it [1][2] if understood correctly.

Yes it works fine. You should still remove the check for zero at entry (which is really slow
and unnecessary) and the argument moves. L2 doesn't need the ptrue, all it needs
is MOV dest_ptr, dst.

> The performance of "whilelo dispatch" [3] is almost same as "binary tree dispatch" [4]
> but I notice that there are gaps at 128 byte and at 256 byte [3].

From what I can see, the new version is faster across the full range. It would be useful to show
both new and old in the same graph rather than separately. You can do that by copying the file
and use a different name for the functions. I do this all the time as it allows direct comparison
of several variants in one benchmark run.

That said, the dip at 256+64 looks fairly substantial. It could be throughput of WHILELO - to test
that you could try commenting out the long WHILELO sequence for the 256-512 byte case and
see whether it improves. If it is WHILELO, it is possible to remove 3x WHILELO from the earlier
cases by moving them after a branch (so that the 256-512 case only needs to execute 5x WHILELO
rather than 8 into total). Also it is worth checking if the 256-512 case beats jumping directly
to L(unroll4) - however note that code isn't optimized yet (eg. there is no need for complex
software pipelined loops since we can only iterate once!). If all that doesn't help, it may be
best to split into 256-384 and 384-512 so you only need 2x WHILELO.

> I checked bench-memcpy-random [5], but it measures the performance from the size
> 4K byte to 512K byte.
> How do we know the branch issue for less than 512 byte?

The size is the size of the memory region tested, not the size of the copies. The actual copies
are very small (90% are smaller than 128 bytes). The key is that it doesn't repeat the same copy
over and over so it's hard on the branch predictor just like in a real application.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-21 15:02           ` Wilco Dijkstra via Libc-alpha
@ 2021-04-22 13:17             ` naohirot
  2021-04-23  0:58               ` naohirot
  0 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-22 13:17 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Thanks for your review and advice!

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> Yes it works fine. You should still remove the check for zero at entry (which is
> really slow and unnecessary) and the argument moves. L2 doesn't need the ptrue,
> all it needs is MOV dest_ptr, dst.

Yes, I cleaned them up [1].
[1] https://github.com/NaohiroTamura/glibc/commit/fbee8284f6cea9671554249816f3ab2a14abeade

> > The performance of "whilelo dispatch" [3] is almost same as "binary
> > tree dispatch" [4] but I notice that there are gaps at 128 byte and at 256 byte [3].
> 
> From what I can see, the new version is faster across the full range. It would be
> useful to show both new and old in the same graph rather than separately. You can
> do that by copying the file and use a different name for the functions. I do this all
> the time as it allows direct comparison of several variants in one benchmark run.

Yes, I confirmed that "whilelo dispatch" is better than "binary tree dispatch".
I converted json data from bench-memcpy.out into csv using jq, and created Graph 1
in Google Sheet [2].

$ cat bench-memcpy.out | jq -r '.functions.memcpy.results| sort_by(.length) | .[]|[.length, .align1, .align2, .timings[5], .length/.timings[5]] | @csv' > bench-memcpy.csv

[2] https://docs.google.com/spreadsheets/d/19XYE63defjFEHZVqciZdmcDrJLWkRfGmSagXlIV2F-c/edit?usp=sharing

> That said, the dip at 256+64 looks fairly substantial. It could be throughput of
> WHILELO - to test that you could try commenting out the long WHILELO sequence
> for the 256-512 byte case and see whether it improves. 

I commented out WHILELO in 256-512 byte , and confirmed that it made the dip small [3].

[3] https://drive.google.com/file/d/13Q3OSUN3qXFiTNNkRVGnsNioUMEId1ge/view

>                                                      If it is WHILELO, it is
> possible to remove 3x WHILELO from the earlier cases by moving them after a
> branch (so that the 256-512 case only needs to execute 5x WHILELO rather than 8
> into total). 

As shown in Graph 2 in Google Sheet [2], this approach didn't make the dip small,
because I assume that we can reduce two WHILELO, but we needed to add two PTRUE.
I changed the code [1] like the following diff.

$ git diff
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
index 6d0ae1cd1f..2ae1f4e3b9 100644
--- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -139,12 +139,13 @@
 1:  // if rest > vector_length * 8
     cmp         n, vector_length, lsl 3 // vector_length * 8
     b.hi        \exit
+    cmp         n, vector_length, lsl 2 // vector_length * 4
+    b.hi        1f
     // if rest <= vector_length * 4
     lsl         tmp1, vector_length, 1  // vector_length * 2
     whilelo     p2.b, tmp1, n
     incb        tmp1
     whilelo     p3.b, tmp1, n
-    b.last      1f
     ld1b        z0.b, p0/z, [src, #0, mul vl]
     ld1b        z1.b, p1/z, [src, #1, mul vl]
     ld1b        z2.b, p2/z, [src, #2, mul vl]
@@ -155,6 +156,8 @@
     st1b        z3.b, p3, [dest, #3, mul vl]
     ret
 1:  // if rest <= vector_length * 8
+    ptrue       p2.b
+    ptrue       p3.b
     lsl         tmp1, vector_length, 2  // vector_length * 4
     whilelo     p4.b, tmp1, n
     incb        tmp1

>           Also it is worth checking if the 256-512 case beats jumping directly to
> L(unroll4) - however note that code isn't optimized yet (eg. there is no need for
> complex software pipelined loops since we can only iterate once!). 

I tried, but it didn't work for memmove, because L(unroll4) doesn't support
backward copy.

>                                                                                                             If all that
> doesn't help, it may be best to split into 256-384 and 384-512 so you only need 2x
> WHILELO.

This way [4] made the dip small as shown in Graph3 in Google Sheet [2].
So it seems that this is the way we should take.
 
[4] https://github.com/NaohiroTamura/glibc/commit/cbcb80e69325c16c6697c42627a6ca12c3245a86

> > I checked bench-memcpy-random [5], but it measures the performance
> > from the size 4K byte to 512K byte.
> > How do we know the branch issue for less than 512 byte?
> 
> The size is the size of the memory region tested, not the size of the copies. The
> actual copies are very small (90% are smaller than 128 bytes). The key is that it
> doesn't repeat the same copy over and over so it's hard on the branch predictor
> just like in a real application.

I see, I'll take a look at the source code more thoroughly.

Thanks.
Naohiro

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-22 13:17             ` naohirot
@ 2021-04-23  0:58               ` naohirot
  0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-04-23  0:58 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Let me make one correction, I forgot about free ptrue to p0.b.

> From: Tamura, Naohiro/田村 直広 <naohirot@fujitsu.com>

> >                                                      If it is WHILELO,
> > it is possible to remove 3x WHILELO from the earlier cases by moving
> > them after a branch (so that the 256-512 case only needs to execute 5x
> > WHILELO rather than 8 into total).
> 
> As shown in Graph 2 in Google Sheet [2], this approach didn't make the dip small,
> because I assume that we can reduce two WHILELO, but we needed to add two
> PTRUE.

I didn't have to add two PTREU because of the free p0.b.
As shown in Graph 4 in Google Sheet [2], this approach without adding two PTRUE made
the dip small a little bit, but improvement is smaller than the last way [4] shown in Graph 3.
So the conclusion seems not to change.

[2] https://docs.google.com/spreadsheets/d/19XYE63defjFEHZVqciZdmcDrJLWkRfGmSagXlIV2F-c/edit?usp=sharing

The code without adding two PTRUE is like the following diff.

$ git diff
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
index 6d0ae1cd1f..c3779d0147 100644
--- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -139,12 +139,13 @@
 1:  // if rest > vector_length * 8
     cmp         n, vector_length, lsl 3 // vector_length * 8
     b.hi        \exit
+    cmp         n, vector_length, lsl 2 // vector_length * 4
+    b.hi        1f
     // if rest <= vector_length * 4
     lsl         tmp1, vector_length, 1  // vector_length * 2
     whilelo     p2.b, tmp1, n
     incb        tmp1
     whilelo     p3.b, tmp1, n
-    b.last      1f
     ld1b        z0.b, p0/z, [src, #0, mul vl]
     ld1b        z1.b, p1/z, [src, #1, mul vl]
     ld1b        z2.b, p2/z, [src, #2, mul vl]
@@ -165,16 +166,16 @@
     whilelo     p7.b, tmp1, n
     ld1b        z0.b, p0/z, [src, #0, mul vl]
     ld1b        z1.b, p1/z, [src, #1, mul vl]
-    ld1b        z2.b, p2/z, [src, #2, mul vl]
-    ld1b        z3.b, p3/z, [src, #3, mul vl]
+    ld1b        z2.b, p0/z, [src, #2, mul vl]
+    ld1b        z3.b, p0/z, [src, #3, mul vl]
     ld1b        z4.b, p4/z, [src, #4, mul vl]
     ld1b        z5.b, p5/z, [src, #5, mul vl]
     ld1b        z6.b, p6/z, [src, #6, mul vl]
     ld1b        z7.b, p7/z, [src, #7, mul vl]
     st1b        z0.b, p0, [dest, #0, mul vl]
     st1b        z1.b, p1, [dest, #1, mul vl]
-    st1b        z2.b, p2, [dest, #2, mul vl]
-    st1b        z3.b, p3, [dest, #3, mul vl]
+    st1b        z2.b, p0, [dest, #2, mul vl]
+    st1b        z3.b, p0, [dest, #3, mul vl]
     st1b        z4.b, p4, [dest, #4, mul vl]
     st1b        z5.b, p5, [dest, #5, mul vl]
     st1b        z6.b, p6, [dest, #6, mul vl]

> I changed the code [1] like the following diff.
> 
> $ git diff
> diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> index 6d0ae1cd1f..2ae1f4e3b9 100644
> --- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> @@ -139,12 +139,13 @@
>  1:  // if rest > vector_length * 8
>      cmp         n, vector_length, lsl 3 // vector_length * 8
>      b.hi        \exit
> +    cmp         n, vector_length, lsl 2 // vector_length * 4
> +    b.hi        1f
>      // if rest <= vector_length * 4
>      lsl         tmp1, vector_length, 1  // vector_length * 2
>      whilelo     p2.b, tmp1, n
>      incb        tmp1
>      whilelo     p3.b, tmp1, n
> -    b.last      1f
>      ld1b        z0.b, p0/z, [src, #0, mul vl]
>      ld1b        z1.b, p1/z, [src, #1, mul vl]
>      ld1b        z2.b, p2/z, [src, #2, mul vl]
> @@ -155,6 +156,8 @@
>      st1b        z3.b, p3, [dest, #3, mul vl]
>      ret
>  1:  // if rest <= vector_length * 8
> +    ptrue       p2.b
> +    ptrue       p3.b
>      lsl         tmp1, vector_length, 2  // vector_length * 4
>      whilelo     p4.b, tmp1, n
>      incb        tmp1

> > If all that doesn't help, it may be best to split into 256-384 and
> > 384-512 so you only need 2x WHILELO.
> 
> This way [4] made the dip small as shown in Graph3 in Google Sheet [2].
> So it seems that this is the way we should take.
> 
> [4]
> https://github.com/NaohiroTamura/glibc/commit/cbcb80e69325c16c6697c4262
> 7a6ca12c3245a86

Thanks.
Naohiro


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
                       ` (4 preceding siblings ...)
  2021-04-20  5:49     ` naohirot
@ 2021-04-23 13:22     ` naohirot
  5 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-04-23 13:22 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Let me re-evaluate the loop unrolling/software pipelining of L(vl_agnostic) for the size
512B-4MB using the latest source code [2] with all graphs [3] in this mail.
The early evaluation was reported in the mail [1] but all graphs were not provided.

[1] https://sourceware.org/pipermail/libc-alpha/2021-April/125002.html
[2] https://github.com/NaohiroTamura/glibc/commit/cbcb80e69325c16c6697c42627a6ca12c3245a86
[3] https://docs.google.com/spreadsheets/d/1leFhCAirelDezb0OFC7cr7v4uMUMveaN1iAxL410D2c/edit?usp=sharing

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> Yes that is a good idea - you could also check whether the software pipelining
> actually helps on an OoO core (it shouldn't) since that contributes a lot to the
> complexity and the amount of code and unrolling required.

I compared each unrolls by commenting out upper labels of the target label.
For example, if the target labels is L(unroll4) of memset, L(unroll32) and L(unroll8)
are commented out, and L(unroll4), L(unroll2), and L(unroll1) are executed.
Regarding memcpy/memmove, among L(unroll8), L(unroll4), L(unroll2), and L(unroll1).
Regarding memset, among L(unroll32), L(unroll8), L(unroll4), L(unroll2), and L(unroll1) .

The result was that 8 unrolling/pipelining for memcpy/memmove and 32
unrolling/pipelining for memset are still effective between the size 512B-64KB
as shown in the graphs in Google Sheet [3]
In conclusion, it seems the loop unrolling/software pipelining technique still works
in case of A64FX. It may be a peculiar characteristic of A64FX, I believe.

Thanks.
Naohiro

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-20 14:44       ` Wilco Dijkstra via Libc-alpha
@ 2021-04-27  9:01         ` naohirot
  0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-04-27  9:01 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

I focus on the zero fill distance in this mail.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> >>                                                Also why
> would the
> >>DC_ZVA  need to be done so early? It seems to me that cleaning the
> >>cacheline just before  you write it works best since that avoids accidentally
> replacing it.
> >>
> >
> > Yes, I moved it closer, please look at the change [1].
> 
> What I meant is, why is ZF_DIST so huge? I don't see how that helps. Is there any
> penalty if we did it like this (or possibly with 1-2 cachelines offset)?
> 
>     dc           zva, dest_ptr
>     st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
>     st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
>     st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
>     st1b        z3.b, p0,   [dest_ptr, #3, mul vl]

I tested several zero fill distance for memcpy and memset including 1-2 cachelines offset.
As shown in Graph1 and Graph2 of Google Sheet [1], the most suitable zero fill
distance of both memcpy and memset was 21 cachelinees offset.
ZF21 means Zero Fill distance is  21 * cachelines offset in Graph1 and Graph2.
So I updated both memcpy and memset source code [2][3].

[1] https://docs.google.com/spreadsheets/d/1qXWHc-OXl2E9Q9vWUl4R4eM00k02eij6eMAhXYUFVoI/edit
[2] https://github.com/NaohiroTamura/glibc/commit/5e7f737a270334ec0f86c0228f90000bf9a2cf00
[3] https://github.com/NaohiroTamura/glibc/commit/42334cb84419603003977eb77783bf407cb75072

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-20 11:39       ` Wilco Dijkstra via Libc-alpha
@ 2021-04-27 11:03         ` naohirot
  0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-04-27 11:03 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

This mail is a continuation of removing redundant instructions.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
 
> For the first 2 CBZ cases in both [1] and [2] the fastest option is to use
> ANDS+BEQ. ANDS only requires 1 ALU operation while AND+CBZ uses 2 ALU
> operations on A64FX.

I see, I haven't used ANDS before. Thanks for the advice.
I updated memcpy[1] and memset[2].

[1] https://github.com/NaohiroTamura/glibc/commit/fca2c1cf1fd80ec7ecb93f7cd08be9aab9ca9412
[2] https://github.com/NaohiroTamura/glibc/commit/5004e34c35a20faf3e12e6ce915845a75b778cbf

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-20 16:00       ` Wilco Dijkstra via Libc-alpha
@ 2021-04-27 11:58         ` naohirot
  2021-04-29 15:13           ` Wilco Dijkstra via Libc-alpha
  0 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-27 11:58 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

This mail is a continuation of BTI macro.

I believe that I've answered all of your comments so far.
Please let me know if I missed something.
If there is no further comments to the first version of this patch,
I'd like to proceed with the preparation of the second version after
the consecutive National holidays, Apr. 29th - May. 5th, in Japan.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> > So if distinct degradation happens only on A64FX, I'd like to add
> > another ENTRY macro in sysdeps/aarch64/sysdep.h such as:
> 
> I think the best option for now is to change BTI_C into NOP if AARCH64_HAVE_BTI
> is not set. This avoids creating alignment issues in existing code (which is written
> to assume the hint is present) and works for all string functions.

I updated sysdeps/aarch64/sysdep.h following your advice [1].
Then I reverted the entries of memcpy/memmove [2] and memset [3].

[1] https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0c82cb70339294386
[2] https://github.com/NaohiroTamura/glibc/commit/f4627d5a0faa8d2bd9102964a3e31936248fa9ca
[3] https://github.com/NaohiroTamura/glibc/commit/da48f62bab67d875cb712a886ba074073857d5c3

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-27 11:58         ` naohirot
@ 2021-04-29 15:13           ` Wilco Dijkstra via Libc-alpha
  2021-04-30 15:01             ` Szabolcs Nagy via Libc-alpha
  2021-05-06 10:01             ` naohirot
  0 siblings, 2 replies; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-29 15:13 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

> I believe that I've answered all of your comments so far.
> Please let me know if I missed something.
> If there is no further comments to the first version of this patch,
> I'd like to proceed with the preparation of the second version after
> the consecutive National holidays, Apr. 29th - May. 5th, in Japan.

I've only looked at memcpy so far. My comments on memcpy:

(1) Improve the tail code in unroll4/2/1/last to do the reverse of
    shortcut_for_small_size - basically there is no need for loops or lots of branches.

(2) Rather than start with L2, check for n > L2_SIZE && vector_length == 64 and
    start with the vl_agnostic case. Copies > L2_SIZE will be very rare so it's best to
    handle the common case first.

(3) The alignment code can be significantly simplified. Why not just process
    4 vectors unconditionally and then align the pointers? That avoids all the
    complex code and is much faster.

(4) Is there a benefit of aligning src or dst to vector size in the vl_agnostic case?
    If so, it would be easy to align to a vector first and then if n > L2_SIZE do the
    remaining 3 vectors to align to a full cacheline.

(5) I'm not sure I understand the reason for src_notag/dest_notag. However if
    you want to ignore tags, just change the mov src_ptr, src into AND that
    clears the tag. There is no reason to both clear the tag and also keep the
    original pointer and tag.

For memmove I would suggest to merge it with memcpy to save ~100 instructions.
I don't understand the complexity of the L(dispatch) code - you just need a simple
3-instruction overlap check that branches to bwd_unroll8.

I haven't looked at memset, but pretty much all the improvements apply there too.

>> I think the best option for now is to change BTI_C into NOP if AARCH64_HAVE_BTI
>> is not set. This avoids creating alignment issues in existing code (which is written
>> to assume the hint is present) and works for all string functions.
>
> I updated sysdeps/aarch64/sysdep.h following your advice [1].
> 
> [1] https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0c82cb70339294386

I meant using an actual NOP in the #else case so that existing string functions
won't change. Also note the #defines in the #if and #else need to be indented.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-29 15:13           ` Wilco Dijkstra via Libc-alpha
@ 2021-04-30 15:01             ` Szabolcs Nagy via Libc-alpha
  2021-04-30 15:23               ` Wilco Dijkstra via Libc-alpha
  2021-05-06 10:01             ` naohirot
  1 sibling, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-04-30 15:01 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: 'GNU C Library'

The 04/29/2021 16:13, Wilco Dijkstra wrote:
> > I updated sysdeps/aarch64/sysdep.h following your advice [1].
> >
> > [1] https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0c82cb70339294386
> 
> I meant using an actual NOP in the #else case so that existing string functions
> won't change. Also note the #defines in the #if and #else need to be indented.

is that really useful?
'bti c' is already a nop if it's unsupported.

maybe it works if a64fx_memcpy.S has

  #undef BTI_C
  #define BTI_C
  ENTRY(a64fx_memcpy)
    ...

to save one nop.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-30 15:01             ` Szabolcs Nagy via Libc-alpha
@ 2021-04-30 15:23               ` Wilco Dijkstra via Libc-alpha
  2021-04-30 15:30                 ` Florian Weimer via Libc-alpha
  0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-30 15:23 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: 'GNU C Library'

Hi Szabolcs,

>> I meant using an actual NOP in the #else case so that existing string functions
>> won't change. Also note the #defines in the #if and #else need to be indented.
>
> is that really useful?
> 'bti c' is already a nop if it's unsupported.

Well it doesn't seem to behave like a NOP. So to avoid slowing down all string
functions, bti c must be removed completely, not just from A64FX memcpy.
Using a real NOP is fine in all cases as long as HAVE_AARCH64_BTI is not defined.

> maybe it works if a64fx_memcpy.S has
>
>  #undef BTI_C
>  #define BTI_C
>  ENTRY(a64fx_memcpy)

That works for memcpy, but what about everything else?

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-30 15:23               ` Wilco Dijkstra via Libc-alpha
@ 2021-04-30 15:30                 ` Florian Weimer via Libc-alpha
  2021-04-30 15:40                   ` Wilco Dijkstra via Libc-alpha
  0 siblings, 1 reply; 72+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-04-30 15:30 UTC (permalink / raw)
  To: Wilco Dijkstra via Libc-alpha; +Cc: Szabolcs Nagy, Wilco Dijkstra

* Wilco Dijkstra via Libc-alpha:

> Hi Szabolcs,
>
>>> I meant using an actual NOP in the #else case so that existing string functions
>>> won't change. Also note the #defines in the #if and #else need to be indented.
>>
>> is that really useful?
>> 'bti c' is already a nop if it's unsupported.
>
> Well it doesn't seem to behave like a NOP. So to avoid slowing down
> all string functions, bti c must be removed completely, not just from
> A64FX memcpy.  Using a real NOP is fine in all cases as long as
> HAVE_AARCH64_BTI is not defined.

I'm probably confused, but: If BTI is active, many more glibc functions
will have BTI markers.  What makes the string functions special?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-30 15:30                 ` Florian Weimer via Libc-alpha
@ 2021-04-30 15:40                   ` Wilco Dijkstra via Libc-alpha
  2021-05-04  7:56                     ` Szabolcs Nagy via Libc-alpha
  0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-30 15:40 UTC (permalink / raw)
  To: Florian Weimer, Wilco Dijkstra via Libc-alpha; +Cc: Szabolcs Nagy

Hi Florian,

>> Well it doesn't seem to behave like a NOP. So to avoid slowing down
>> all string functions, bti c must be removed completely, not just from
>> A64FX memcpy.  Using a real NOP is fine in all cases as long as
>> HAVE_AARCH64_BTI is not defined.
>
> I'm probably confused, but: If BTI is active, many more glibc functions
> will have BTI markers.  What makes the string functions special?

Exactly. And at that point trying to remove it from memcpy is just pointless.

The case we are discussing is where BTI is not turned on in GLIBC but we still 
emit a BTI at the start of assembler functions for simplicity. By using a NOP
instead, A64FX will not execute BTI anywhere in GLIBC.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-30 15:40                   ` Wilco Dijkstra via Libc-alpha
@ 2021-05-04  7:56                     ` Szabolcs Nagy via Libc-alpha
  2021-05-04 10:17                       ` Florian Weimer via Libc-alpha
  0 siblings, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-04  7:56 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Florian Weimer, Wilco Dijkstra via Libc-alpha

The 04/30/2021 16:40, Wilco Dijkstra wrote:
> >> Well it doesn't seem to behave like a NOP. So to avoid slowing down
> >> all string functions, bti c must be removed completely, not just from
> >> A64FX memcpy.  Using a real NOP is fine in all cases as long as
> >> HAVE_AARCH64_BTI is not defined.
> >
> > I'm probably confused, but: If BTI is active, many more glibc functions
> > will have BTI markers.  What makes the string functions special?
> 
> Exactly. And at that point trying to remove it from memcpy is just pointless.
> 
> The case we are discussing is where BTI is not turned on in GLIBC but we still
> emit a BTI at the start of assembler functions for simplicity. By using a NOP
> instead, A64FX will not execute BTI anywhere in GLIBC.

the asm ENTRY was written with the assumption that bti c
behaves like a nop when bti is disabled, so we don't have
to make the asm conditional based on cflags.

if that's not the case i agree with the patch, however we
will have to review some other code (e.g. libgcc outline
atomics asm) where we made the same assumption.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-04  7:56                     ` Szabolcs Nagy via Libc-alpha
@ 2021-05-04 10:17                       ` Florian Weimer via Libc-alpha
  2021-05-04 10:38                         ` Wilco Dijkstra via Libc-alpha
  2021-05-04 10:42                         ` Szabolcs Nagy via Libc-alpha
  0 siblings, 2 replies; 72+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-05-04 10:17 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Wilco Dijkstra via Libc-alpha, Wilco Dijkstra

* Szabolcs Nagy:

> The 04/30/2021 16:40, Wilco Dijkstra wrote:
>> >> Well it doesn't seem to behave like a NOP. So to avoid slowing down
>> >> all string functions, bti c must be removed completely, not just from
>> >> A64FX memcpy.  Using a real NOP is fine in all cases as long as
>> >> HAVE_AARCH64_BTI is not defined.
>> >
>> > I'm probably confused, but: If BTI is active, many more glibc functions
>> > will have BTI markers.  What makes the string functions special?
>> 
>> Exactly. And at that point trying to remove it from memcpy is just pointless.
>> 
>> The case we are discussing is where BTI is not turned on in GLIBC but we still
>> emit a BTI at the start of assembler functions for simplicity. By using a NOP
>> instead, A64FX will not execute BTI anywhere in GLIBC.
>
> the asm ENTRY was written with the assumption that bti c
> behaves like a nop when bti is disabled, so we don't have
> to make the asm conditional based on cflags.
>
> if that's not the case i agree with the patch, however we
> will have to review some other code (e.g. libgcc outline
> atomics asm) where we made the same assumption.

I find this discussion extremely worrisome.  If bti c does not behave
like a nop, then we need a new AArch64 ABI variant to enable BTI.

That being said, a distribution with lots of bti c instructions in
binaries seems to run on A64FX CPUs, so I'm not sure what is going on.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-04 10:17                       ` Florian Weimer via Libc-alpha
@ 2021-05-04 10:38                         ` Wilco Dijkstra via Libc-alpha
  2021-05-04 10:42                         ` Szabolcs Nagy via Libc-alpha
  1 sibling, 0 replies; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-05-04 10:38 UTC (permalink / raw)
  To: Florian Weimer, Szabolcs Nagy; +Cc: Wilco Dijkstra via Libc-alpha

Hi Florian,

> I find this discussion extremely worrisome.  If bti c does not behave
> like a nop, then we need a new AArch64 ABI variant to enable BTI.
>
> That being said, a distribution with lots of bti c instructions in
> binaries seems to run on A64FX CPUs, so I'm not sure what is going on.

NOP-space instructions should take no time or execution resources.
From Naohiro's graphs I estimate A64FX takes around 30 cycles per BTI
instruction - that's clearly "not behaving like a NOP". That would cause a
significant performance degradation if BTI is enabled in a distro.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-04 10:17                       ` Florian Weimer via Libc-alpha
  2021-05-04 10:38                         ` Wilco Dijkstra via Libc-alpha
@ 2021-05-04 10:42                         ` Szabolcs Nagy via Libc-alpha
  2021-05-04 11:07                           ` Florian Weimer via Libc-alpha
  1 sibling, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-04 10:42 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Wilco Dijkstra via Libc-alpha, Wilco Dijkstra

The 05/04/2021 12:17, Florian Weimer wrote:
> * Szabolcs Nagy:
> 
> > The 04/30/2021 16:40, Wilco Dijkstra wrote:
> >> >> Well it doesn't seem to behave like a NOP. So to avoid slowing down
> >> >> all string functions, bti c must be removed completely, not just from
> >> >> A64FX memcpy.  Using a real NOP is fine in all cases as long as
> >> >> HAVE_AARCH64_BTI is not defined.
> >> >
> >> > I'm probably confused, but: If BTI is active, many more glibc functions
> >> > will have BTI markers.  What makes the string functions special?
> >> 
> >> Exactly. And at that point trying to remove it from memcpy is just pointless.
> >> 
> >> The case we are discussing is where BTI is not turned on in GLIBC but we still
> >> emit a BTI at the start of assembler functions for simplicity. By using a NOP
> >> instead, A64FX will not execute BTI anywhere in GLIBC.
> >
> > the asm ENTRY was written with the assumption that bti c
> > behaves like a nop when bti is disabled, so we don't have
> > to make the asm conditional based on cflags.
> >
> > if that's not the case i agree with the patch, however we
> > will have to review some other code (e.g. libgcc outline
> > atomics asm) where we made the same assumption.
> 
> I find this discussion extremely worrisome.  If bti c does not behave
> like a nop, then we need a new AArch64 ABI variant to enable BTI.
> 
> That being said, a distribution with lots of bti c instructions in
> binaries seems to run on A64FX CPUs, so I'm not sure what is going on.

this does not have correctness impact, only performance impact.

hint space instructions are seem slower than expected on a64fx.

which means unconditionally adding bti c to asm entry code is not
ideal if somebody tries to build a system without branch-protection.
distros that build all binaries with branch protection will just
take a performance hit on a64fx, we cant fix that easily.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-04 10:42                         ` Szabolcs Nagy via Libc-alpha
@ 2021-05-04 11:07                           ` Florian Weimer via Libc-alpha
  0 siblings, 0 replies; 72+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-05-04 11:07 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Wilco Dijkstra via Libc-alpha, Wilco Dijkstra

* Szabolcs Nagy:

> The 05/04/2021 12:17, Florian Weimer wrote:
>> * Szabolcs Nagy:
>> 
>> > The 04/30/2021 16:40, Wilco Dijkstra wrote:
>> >> >> Well it doesn't seem to behave like a NOP. So to avoid slowing down
>> >> >> all string functions, bti c must be removed completely, not just from
>> >> >> A64FX memcpy.  Using a real NOP is fine in all cases as long as
>> >> >> HAVE_AARCH64_BTI is not defined.
>> >> >
>> >> > I'm probably confused, but: If BTI is active, many more glibc functions
>> >> > will have BTI markers.  What makes the string functions special?
>> >> 
>> >> Exactly. And at that point trying to remove it from memcpy is just pointless.
>> >> 
>> >> The case we are discussing is where BTI is not turned on in GLIBC but we still
>> >> emit a BTI at the start of assembler functions for simplicity. By using a NOP
>> >> instead, A64FX will not execute BTI anywhere in GLIBC.
>> >
>> > the asm ENTRY was written with the assumption that bti c
>> > behaves like a nop when bti is disabled, so we don't have
>> > to make the asm conditional based on cflags.
>> >
>> > if that's not the case i agree with the patch, however we
>> > will have to review some other code (e.g. libgcc outline
>> > atomics asm) where we made the same assumption.
>> 
>> I find this discussion extremely worrisome.  If bti c does not behave
>> like a nop, then we need a new AArch64 ABI variant to enable BTI.
>> 
>> That being said, a distribution with lots of bti c instructions in
>> binaries seems to run on A64FX CPUs, so I'm not sure what is going on.
>
> this does not have correctness impact, only performance impact.
>
> hint space instructions are seem slower than expected on a64fx.
>
> which means unconditionally adding bti c to asm entry code is not
> ideal if somebody tries to build a system without branch-protection.
> distros that build all binaries with branch protection will just
> take a performance hit on a64fx, we cant fix that easily.

I think I see it now.  It's not critically slow, but there appears to be
observable impact.  I'm still worried.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-29 15:13           ` Wilco Dijkstra via Libc-alpha
  2021-04-30 15:01             ` Szabolcs Nagy via Libc-alpha
@ 2021-05-06 10:01             ` naohirot
  2021-05-06 14:26               ` Szabolcs Nagy via Libc-alpha
  2021-05-06 17:31               ` Wilco Dijkstra via Libc-alpha
  1 sibling, 2 replies; 72+ messages in thread
From: naohirot @ 2021-05-06 10:01 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco,

Thanks for the comments, I applied all of your comments to both
memcpy/memmove and memset except (3) alignment code for memset.
The latest code became memcpy/memove [1] and memset [2] in the
patch-20210317 [3] branch by evaluating the performance data as shown
below.

[1] https://github.com/NaohiroTamura/glibc/blob/d2ea23703fc45cbfe4a8f27c759b0b23722e17a4/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/d2ea23703fc45cbfe4a8f27c759b0b23722e17a4/sysdeps/aarch64/multiarch/memset_a64fx.S
[3] https://github.com/NaohiroTamura/glibc/commits/patch-20210317

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> I've only looked at memcpy so far. My comments on memcpy:
> 
> (1) Improve the tail code in unroll4/2/1/last to do the reverse of
>     shortcut_for_small_size - basically there is no need for loops or lots of
> branches.
>

I updated the tail code both memcpy/memmove [4] and memset [5], and
replaced small size code of memset [5].
The performance is shown as "whilelo" in Google Sheet Graph for
memcpy/memmove [6] and memset [7].

[4] https://github.com/NaohiroTamura/glibc/commit/f7d9d7b22814affdd89cf291905b9c6601e2031d
[5] https://github.com/NaohiroTamura/glibc/commit/b79d6731f800a56be66c895c035b791ca5176bbb
[6] https://docs.google.com/spreadsheets/d/1Rh-bwF6dpWqoOCbL2epogUPn4I2Emd0NiFgoEOPaujM/edit
[7] https://docs.google.com/spreadsheets/d/1TS0qFhyR_06OyqaRHYAdCKxwvRz7f1T8jI7Pu6x2GIk/edit

> (2) Rather than start with L2, check for n > L2_SIZE && vector_length == 64 and
>     start with the vl_agnostic case. Copies > L2_SIZE will be very rare so it's best
> to
>     handle the common case first.
> 

I changed the order both both memcpy/memmove [8] and memset [9].
The performance is shown as "agnostic1st" in Google Sheet Graph for
memcpy/memmove [6] and memset [7].

[8] https://github.com/NaohiroTamura/glibc/commit/c0d7e39aa4aefe3d7b7d2a8a7c220150a0eb78fe
[9] https://github.com/NaohiroTamura/glibc/commit/d2ea23703fc45cbfe4a8f27c759b0b23722e17a4

> (3) The alignment code can be significantly simplified. Why not just process
>     4 vectors unconditionally and then align the pointers? That avoids all the
>     complex code and is much faster.
> 

In terms of memcpy/memmove, I tried 4 patterns, "simplifiedL2algin"[10], 
" simplifiedL2algin2"[11], "agnosticVLalign"[12], and "noalign"[13] as shown in
Google Sheet Graph [14].

"simplifiedL2algin"[10] simplified to 4 whilelo, " simplifiedL2algin2"[11] simplified
to 2 whilelo or 4 whilelo, "agnosticVLalign"[12] added alignment code to L(vl_agnostic),
and "noalign"[13] removed all alignments.

"agnosticVLalign"[12] and "noalign"[13] didn't improve the performance, so these
commits are kept in the patch-20210317-memcpy-alignment branch [15]

[10] https://github.com/NaohiroTamura/glibc/commit/dd4ede78ec4d74e61a4dc3166fc8586168c4e410
[11] https://github.com/NaohiroTamura/glibc/commit/dd246ff01d59e4e91d10261cd070baae07c0093e
[12] https://github.com/NaohiroTamura/glibc/commit/35b8057d91024bf41595d38d94b2c3c76bdfd6b0
[13] https://github.com/NaohiroTamura/glibc/commit/b1f16f3e738152a5c0f3441201058b48901b4910
[14] https://docs.google.com/spreadsheets/d/1REBslxd56kMDMiXHAtRkBn4IaUO7AVmgvGldJl5qc58/edit
[15] https://github.com/NaohiroTamura/glibc/commits/patch-20210317-memcpy-alignment

In terms of memset, I tried 4 patterns too, " VL/CL-align "[16], "CL-align"[17],
"CL-align2"[18] and "noalign"[19] as shown in Google Sheet Graph [20].

" VL/CL-align "[16] simplified to 1 whilelo for VL and 3 whilelo for CL,
"CL-align"[17] simplified to 4 whilelo, "CL-align2"[18] simplified to 2 whilelo or
4 whilelo, and "noalign"[19] removed all alignments.

As shown in Google Sheet Graph [20] all of 4 patters didn't improve the
performance, so these commits are kept in the
patch-20210317-memset-alignment branch [21]

[16] https://github.com/NaohiroTamura/glibc/commit/2405b67a6bb8b380476967e150b35f10e0f25fe3
[17 https://github.com/NaohiroTamura/glibc/commit/a01a8ef08f3b53a691502538dabce3d5941790ff
[18] https://github.com/NaohiroTamura/glibc/commit/c8eb4467acbc97890a4f76f716a88d2dd901e083
[19] https://github.com/NaohiroTamura/glibc/commit/01ff56a9e558d650b09b0053adbc3215d269d65f
[20] https://docs.google.com/spreadsheets/d/1qT0ZkbrrL3fpEyfdjr23cbtanNyPFKN8xDo6E9Mb_YQ/edit
[21] https://github.com/NaohiroTamura/glibc/commits/patch-20210317-memset-alginment

> (4) Is there a benefit of aligning src or dst to vector size in the vl_agnostic case?
>     If so, it would be easy to align to a vector first and then if n > L2_SIZE do the
>     remaining 3 vectors to align to a full cacheline.
> 

As tried in (3), "agnosticVLalign"[12] didn't improve the performance.

> (5) I'm not sure I understand the reason for src_notag/dest_notag. However if
>     you want to ignore tags, just change the mov src_ptr, src into AND that
>     clears the tag. There is no reason to both clear the tag and also keep the
>     original pointer and tag.
> 

A64FX has Fujitsu's proprietary enhancement regarding tag address.
I removed dest_notag/src_notag macro and simplified L(dispatch) [22]
"src" address has to be kept to jump to L(last)[23].

[22] https://github.com/NaohiroTamura/glibc/commit/519244f5058d0aa98634bb544bae3358f0b7b07c
[23] https://github.com/NaohiroTamura/glibc/blob/519244f5058d0aa98634bb544bae3358f0b7b07c/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L399

> For memmove I would suggest to merge it with memcpy to save ~100 instructions.
> I don't understand the complexity of the L(dispatch) code - you just need a simple
> 3-instruction overlap check that branches to bwd_unroll8.
> 

I simplified the he L(dispatch) code to 3 instructions[24] in the commit[23]. 

[24] https://github.com/NaohiroTamura/glibc/blob/519244f5058d0aa98634bb544bae3358f0b7b07c/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L368-L370

> I haven't looked at memset, but pretty much all the improvements apply there too.

So please review the latest memset [2].

> >> I think the best option for now is to change BTI_C into NOP if
> >> AARCH64_HAVE_BTI is not set. This avoids creating alignment issues in
> >> existing code (which is written to assume the hint is present) and works for all
> string functions.
> >
> > I updated sysdeps/aarch64/sysdep.h following your advice [1].
> >
> > [1]
> > https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0
> > c82cb70339294386
> 
> I meant using an actual NOP in the #else case so that existing string functions
> won't change. Also note the #defines in the #if and #else need to be indented.
> 

I've read the mail thread regarding BTI, but I think I couldn't fully understand the
problem. BTI seems available from ARMv8.5, and A64FX is ARMv8.2.
Even though distro distributed BTI enabled binary, BTI doesn't work on A64FX.
So BTI_J macro can be removed from A64FX IFUNC code at least, because A64FX
IFUNC code is executed only on A64FX.
Are we discussing the BTI_C code which is not in IFUNC code?

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-06 10:01             ` naohirot
@ 2021-05-06 14:26               ` Szabolcs Nagy via Libc-alpha
  2021-05-06 15:09                 ` Florian Weimer via Libc-alpha
  2021-05-06 17:31               ` Wilco Dijkstra via Libc-alpha
  1 sibling, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-06 14:26 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: 'GNU C Library', 'Wilco Dijkstra'

The 05/06/2021 10:01, naohirot@fujitsu.com wrote:
> > From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> > > [1]
> > > https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0
> > > c82cb70339294386
> > 
> > I meant using an actual NOP in the #else case so that existing string functions
> > won't change. Also note the #defines in the #if and #else need to be indented.
> > 
> 
> I've read the mail thread regarding BTI, but I think I couldn't fully understand the
> problem. BTI seems available from ARMv8.5, and A64FX is ARMv8.2.
> Even though distro distributed BTI enabled binary, BTI doesn't work on A64FX.
> So BTI_J macro can be removed from A64FX IFUNC code at least, because A64FX
> IFUNC code is executed only on A64FX.
> Are we discussing the BTI_C code which is not in IFUNC code?

BTI_C at function entry.

the slowdown you showed with bti c at function entry
should not be present with a plain nop.

this means a64fx implemented hint space instructions
(such as bti c) slower than plain nops, which is not
expected and will cause slowdowns with distros that
try to distribute binaries with bti c, this problem
goes beyond string functions.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-06 14:26               ` Szabolcs Nagy via Libc-alpha
@ 2021-05-06 15:09                 ` Florian Weimer via Libc-alpha
  0 siblings, 0 replies; 72+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-05-06 15:09 UTC (permalink / raw)
  To: Szabolcs Nagy via Libc-alpha; +Cc: Szabolcs Nagy, 'Wilco Dijkstra'

* Szabolcs Nagy via Libc-alpha:

> this means a64fx implemented hint space instructions
> (such as bti c) slower than plain nops, which is not
> expected and will cause slowdowns with distros that
> try to distribute binaries with bti c, this problem
> goes beyond string functions.

And we are using -mbranch-protection=standard on AArch64 going forward,
for example:

| optflags: aarch64 %{__global_compiler_flags} -mbranch-protection=standard -fasynchronous-unwind-tables %[ "%{toolchain}" == "gcc" ? "-fstack-clash-protection" : "" ]

<https://gitlab.com/redhat/centos-stream/rpms/redhat-rpm-config/-/blob/c9s/rpmrc#L77>

(Fedora is similar.

This is why I find this issue so worrying.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-06 10:01             ` naohirot
  2021-05-06 14:26               ` Szabolcs Nagy via Libc-alpha
@ 2021-05-06 17:31               ` Wilco Dijkstra via Libc-alpha
  2021-05-07 12:31                 ` naohirot
  1 sibling, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-05-06 17:31 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

> I've read the mail thread regarding BTI, but I think I couldn't fully understand the
> problem. BTI seems available from ARMv8.5, and A64FX is ARMv8.2.

BTI instructions are NOP hints, so it is possible to enable BTI even on ARMv8.0.
Using BTI instructions is harmless on CPUs that don't support it if NOP hints are as
cheap as a NOP (which generally doesn't need any execution resources).

> Even though distro distributed BTI enabled binary, BTI doesn't work on A64FX.

It works (ie. it is binary compatible with A64FX) and should have no effect. However
it seems to cause an unexpected slowdown.

> So BTI_J macro can be removed from A64FX IFUNC code at least, because A64FX
> IFUNC code is executed only on A64FX.

How is removing it just from memcpy going to help? The worry is not about memcpy
but the slowdown from all the BTI instructions that will be added to most functions.

Note it is still worthwhile to change BTI_C to NOP as suggested - that is the case when
BTI is not enabled, and there you want to avoid inserting BTI when it is not needed.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-06 17:31               ` Wilco Dijkstra via Libc-alpha
@ 2021-05-07 12:31                 ` naohirot
  0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-05-07 12:31 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Szabolcs Nagy, Florian Weimer, 'GNU C Library'

Hi Wilco, Szabolcs, Florian,

Thanks for the explanation!

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
 
> How is removing it just from memcpy going to help? The worry is not about memcpy
> but the slowdown from all the BTI instructions that will be added to most functions.

OK, I understood.
Now I'm asking a question to CPU design team how A64FX "hint 34" is implemented and
behaves. 
 
> Note it is still worthwhile to change BTI_C to NOP as suggested - that is the case when
> BTI is not enabled, and there you want to avoid inserting BTI when it is not needed.

I changed BTI_C and BTI_J definitions to nop [1]. 

[1] https://github.com/NaohiroTamura/glibc/commit/0804fe9d288d489ec8af98c687552decd2723f5d

Thanks.
Naohiro

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-03-17  2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
                   ` (5 preceding siblings ...)
  2021-03-29 12:03 ` [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Szabolcs Nagy via Libc-alpha
@ 2021-05-10  1:45 ` naohirot
  2021-05-14 13:35   ` Szabolcs Nagy via Libc-alpha
  2021-05-12  9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
  7 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-05-10  1:45 UTC (permalink / raw)
  To: Szabolcs Nagy, Wilco Dijkstra, Florian Weimer; +Cc: libc-alpha@sourceware.org

Hi Szabolcs, Wilco, Florian,

> From: Naohiro Tamura <naohirot@fujitsu.com>
> Sent: Wednesday, March 17, 2021 11:29 AM
 
> Fujitsu is in the process of signing the copyright assignment paper.
> We'd like to have some feedback in advance.

FYI: Fujitsu has submitted the signed assignment finally.

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX
  2021-03-17  2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
                   ` (6 preceding siblings ...)
  2021-05-10  1:45 ` naohirot
@ 2021-05-12  9:23 ` Naohiro Tamura
  2021-05-12  9:26   ` [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64 Naohiro Tamura
                     ` (8 more replies)
  7 siblings, 9 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12  9:23 UTC (permalink / raw)
  To: libc-alpha

Hi Szabolcs, Wilco, Florian,

Thank you for reviewing Patch V1.

Patch V2 has been reflected all of V1 comments which are mainly
related to redundant assembler code.
Consequently assembler code has been minimized, and each line of V2
assembler code has been rationalized by string bench performance
data.
In terms of assembler LOC (lines of code), memcpy/memmove reduced 60%
from 1,000 to 400 lines, memset reduced 55% from 600 to 270 lines.

So please kindly review V2.

Thanks.
Naohiro

Naohiro Tamura (6):
  config: Added HAVE_AARCH64_SVE_ASM for aarch64
  aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI
  aarch64: Added optimized memcpy and memmove for A64FX
  aarch64: Added optimized memset for A64FX
  scripts: Added Vector Length Set test helper script
  benchtests: Fixed bench-memcpy-random: buf1: mprotect failed

 benchtests/bench-memcpy-random.c              |   4 +-
 config.h.in                                   |   5 +
 manual/tunables.texi                          |   3 +-
 scripts/vltest.py                             |  82 ++++
 sysdeps/aarch64/configure                     |  28 ++
 sysdeps/aarch64/configure.ac                  |  15 +
 sysdeps/aarch64/multiarch/Makefile            |   3 +-
 sysdeps/aarch64/multiarch/ifunc-impl-list.c   |  13 +-
 sysdeps/aarch64/multiarch/init-arch.h         |   4 +-
 sysdeps/aarch64/multiarch/memcpy.c            |  12 +-
 sysdeps/aarch64/multiarch/memcpy_a64fx.S      | 405 ++++++++++++++++++
 sysdeps/aarch64/multiarch/memmove.c           |  12 +-
 sysdeps/aarch64/multiarch/memset.c            |  11 +-
 sysdeps/aarch64/multiarch/memset_a64fx.S      | 268 ++++++++++++
 sysdeps/aarch64/sysdep.h                      |   9 +-
 .../unix/sysv/linux/aarch64/cpu-features.c    |   4 +
 .../unix/sysv/linux/aarch64/cpu-features.h    |   4 +
 17 files changed, 868 insertions(+), 14 deletions(-)
 create mode 100755 scripts/vltest.py
 create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S
 create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S

-- 
2.17.1


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64
  2021-05-12  9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
@ 2021-05-12  9:26   ` Naohiro Tamura
  2021-05-26 10:05     ` Szabolcs Nagy via Libc-alpha
  2021-05-12  9:27   ` [PATCH v2 2/6] aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI Naohiro Tamura
                     ` (7 subsequent siblings)
  8 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12  9:26 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

This patch checks if assembler supports '-march=armv8.2-a+sve' to
generate SVE code or not, and then define HAVE_AARCH64_SVE_ASM macro.
---
 config.h.in                  |  5 +++++
 sysdeps/aarch64/configure    | 28 ++++++++++++++++++++++++++++
 sysdeps/aarch64/configure.ac | 15 +++++++++++++++
 3 files changed, 48 insertions(+)

diff --git a/config.h.in b/config.h.in
index 99036b887f..13fba9bb8d 100644
--- a/config.h.in
+++ b/config.h.in
@@ -121,6 +121,11 @@
 /* AArch64 PAC-RET code generation is enabled.  */
 #define HAVE_AARCH64_PAC_RET 0
 
+/* Assembler support ARMv8.2-A SVE.
+   This macro becomes obsolete when glibc increased the minimum
+   required version of GNU 'binutils' to 2.28 or later. */
+#define HAVE_AARCH64_SVE_ASM 0
+
 /* ARC big endian ABI */
 #undef HAVE_ARC_BE
 
diff --git a/sysdeps/aarch64/configure b/sysdeps/aarch64/configure
index 83c3a23e44..4c1fac49f3 100644
--- a/sysdeps/aarch64/configure
+++ b/sysdeps/aarch64/configure
@@ -304,3 +304,31 @@ fi
 $as_echo "$libc_cv_aarch64_variant_pcs" >&6; }
 config_vars="$config_vars
 aarch64-variant-pcs = $libc_cv_aarch64_variant_pcs"
+
+# Check if asm support armv8.2-a+sve
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for SVE support in assembler" >&5
+$as_echo_n "checking for SVE support in assembler... " >&6; }
+if ${libc_cv_asm_sve+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat > conftest.s <<\EOF
+        ptrue p0.b
+EOF
+if { ac_try='${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&5'
+  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+  (eval $ac_try) 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; }; then
+  libc_cv_aarch64_sve_asm=yes
+else
+  libc_cv_aarch64_sve_asm=no
+fi
+rm -f conftest*
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_asm_sve" >&5
+$as_echo "$libc_cv_asm_sve" >&6; }
+if test $libc_cv_aarch64_sve_asm = yes; then
+  $as_echo "#define HAVE_AARCH64_SVE_ASM 1" >>confdefs.h
+
+fi
diff --git a/sysdeps/aarch64/configure.ac b/sysdeps/aarch64/configure.ac
index 66f755078a..3347c13fa1 100644
--- a/sysdeps/aarch64/configure.ac
+++ b/sysdeps/aarch64/configure.ac
@@ -90,3 +90,18 @@ EOF
   fi
   rm -rf conftest.*])
 LIBC_CONFIG_VAR([aarch64-variant-pcs], [$libc_cv_aarch64_variant_pcs])
+
+# Check if asm support armv8.2-a+sve
+AC_CACHE_CHECK(for SVE support in assembler, libc_cv_asm_sve, [dnl
+cat > conftest.s <<\EOF
+        ptrue p0.b
+EOF
+if AC_TRY_COMMAND(${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&AS_MESSAGE_LOG_FD); then
+  libc_cv_aarch64_sve_asm=yes
+else
+  libc_cv_aarch64_sve_asm=no
+fi
+rm -f conftest*])
+if test $libc_cv_aarch64_sve_asm = yes; then
+  AC_DEFINE(HAVE_AARCH64_SVE_ASM)
+fi
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 2/6] aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI
  2021-05-12  9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
  2021-05-12  9:26   ` [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64 Naohiro Tamura
@ 2021-05-12  9:27   ` Naohiro Tamura
  2021-05-26 10:06     ` Szabolcs Nagy via Libc-alpha
  2021-05-12  9:28   ` [PATCH v2 3/6] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12  9:27 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

This patch defines BTI_C and BTI_J macros conditionally for
performance.
If HAVE_AARCH64_BTI is true, BTI_C and BTI_J are defined as HINT
instruction for ARMv8.5 BTI (Branch Target Identification).
If HAVE_AARCH64_BTI is false, both BTI_C and BTI_J are defined as
NOP.
---
 sysdeps/aarch64/sysdep.h | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/sysdeps/aarch64/sysdep.h b/sysdeps/aarch64/sysdep.h
index 90acca4e42..b936e29cbd 100644
--- a/sysdeps/aarch64/sysdep.h
+++ b/sysdeps/aarch64/sysdep.h
@@ -62,8 +62,13 @@ strip_pac (void *p)
 #define ASM_SIZE_DIRECTIVE(name) .size name,.-name
 
 /* Branch Target Identitication support.  */
-#define BTI_C		hint	34
-#define BTI_J		hint	36
+#if HAVE_AARCH64_BTI
+# define BTI_C		hint	34
+# define BTI_J		hint	36
+#else
+# define BTI_C		nop
+# define BTI_J		nop
+#endif
 
 /* Return address signing support (pac-ret).  */
 #define PACIASP		hint	25
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 3/6] aarch64: Added optimized memcpy and memmove for A64FX
  2021-05-12  9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
  2021-05-12  9:26   ` [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64 Naohiro Tamura
  2021-05-12  9:27   ` [PATCH v2 2/6] aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI Naohiro Tamura
@ 2021-05-12  9:28   ` Naohiro Tamura
  2021-05-26 10:19     ` Szabolcs Nagy via Libc-alpha
  2021-05-12  9:28   ` [PATCH v2 4/6] aarch64: Added optimized memset " Naohiro Tamura
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12  9:28 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

This patch optimizes the performance of memcpy/memmove for A64FX [1]
which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
cache per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill, and software pipelining.

SVE assembler code for memcpy/memmove is implemented as Vector Length
Agnostic code so theoretically it can be run on any SOC which supports
ARMv8-A SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX
---
 manual/tunables.texi                          |   3 +-
 sysdeps/aarch64/multiarch/Makefile            |   2 +-
 sysdeps/aarch64/multiarch/ifunc-impl-list.c   |   8 +-
 sysdeps/aarch64/multiarch/init-arch.h         |   4 +-
 sysdeps/aarch64/multiarch/memcpy.c            |  12 +-
 sysdeps/aarch64/multiarch/memcpy_a64fx.S      | 405 ++++++++++++++++++
 sysdeps/aarch64/multiarch/memmove.c           |  12 +-
 .../unix/sysv/linux/aarch64/cpu-features.c    |   4 +
 .../unix/sysv/linux/aarch64/cpu-features.h    |   4 +
 9 files changed, 446 insertions(+), 8 deletions(-)
 create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 6de647b426..fe7c1313cc 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -454,7 +454,8 @@ This tunable is specific to powerpc, powerpc64 and powerpc64le.
 The @code{glibc.cpu.name=xxx} tunable allows the user to tell @theglibc{} to
 assume that the CPU is @code{xxx} where xxx may have one of these values:
 @code{generic}, @code{falkor}, @code{thunderxt88}, @code{thunderx2t99},
-@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng}.
+@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng},
+@code{a64fx}.
 
 This tunable is specific to aarch64.
 @end deftp
diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index dc3efffb36..04c3f17121 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -1,6 +1,6 @@
 ifeq ($(subdir),string)
 sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
-		   memcpy_falkor \
+		   memcpy_falkor memcpy_a64fx \
 		   memset_generic memset_falkor memset_emag memset_kunpeng \
 		   memchr_generic memchr_nosimd \
 		   strlen_mte strlen_asimd
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index 99a8c68aac..911393565c 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -25,7 +25,7 @@
 #include <stdio.h>
 
 /* Maximum number of IFUNC implementations.  */
-#define MAX_IFUNC	4
+#define MAX_IFUNC	7
 
 size_t
 __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
@@ -43,12 +43,18 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_falkor)
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_simd)
+#if HAVE_AARCH64_SVE_ASM
+	      IFUNC_IMPL_ADD (array, i, memcpy, sve, __memcpy_a64fx)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
   IFUNC_IMPL (i, name, memmove,
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx)
 	      IFUNC_IMPL_ADD (array, i, memmove, !bti, __memmove_thunderx2)
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_falkor)
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_simd)
+#if HAVE_AARCH64_SVE_ASM
+	      IFUNC_IMPL_ADD (array, i, memmove, sve, __memmove_a64fx)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
   IFUNC_IMPL (i, name, memset,
 	      /* Enable this on non-falkor processors too so that other cores
diff --git a/sysdeps/aarch64/multiarch/init-arch.h b/sysdeps/aarch64/multiarch/init-arch.h
index a167699e74..6d92c1bcff 100644
--- a/sysdeps/aarch64/multiarch/init-arch.h
+++ b/sysdeps/aarch64/multiarch/init-arch.h
@@ -33,4 +33,6 @@
   bool __attribute__((unused)) bti =					      \
     HAVE_AARCH64_BTI && GLRO(dl_aarch64_cpu_features).bti;		      \
   bool __attribute__((unused)) mte =					      \
-    MTE_ENABLED ();
+    MTE_ENABLED ();							      \
+  bool __attribute__((unused)) sve =					      \
+    GLRO(dl_aarch64_cpu_features).sve;
diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
index 0e0a5cbcfb..d90ee51ffc 100644
--- a/sysdeps/aarch64/multiarch/memcpy.c
+++ b/sysdeps/aarch64/multiarch/memcpy.c
@@ -33,6 +33,9 @@ extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;
 extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
 extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;
 extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
+#if HAVE_AARCH64_SVE_ASM
+extern __typeof (__redirect_memcpy) __memcpy_a64fx attribute_hidden;
+#endif
 
 libc_ifunc (__libc_memcpy,
             (IS_THUNDERX (midr)
@@ -44,8 +47,13 @@ libc_ifunc (__libc_memcpy,
 		  : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
 		     || IS_NEOVERSE_V1 (midr)
 		     ? __memcpy_simd
-		     : __memcpy_generic)))));
-
+#if HAVE_AARCH64_SVE_ASM
+                     : (IS_A64FX (midr)
+                        ? __memcpy_a64fx
+                        : __memcpy_generic))))));
+#else
+                     : __memcpy_generic)))));
+#endif
 # undef memcpy
 strong_alias (__libc_memcpy, memcpy);
 #endif
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
new file mode 100644
index 0000000000..e28afd708f
--- /dev/null
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -0,0 +1,405 @@
+/* Optimized memcpy for Fujitsu A64FX processor.
+   Copyright (C) 2012-2021 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+#if HAVE_AARCH64_SVE_ASM
+#if IS_IN (libc)
+# define MEMCPY __memcpy_a64fx
+# define MEMMOVE __memmove_a64fx
+
+/* Assumptions:
+ *
+ * ARMv8.2-a, AArch64, unaligned accesses, sve
+ *
+ */
+
+#define L2_SIZE         (8*1024*1024)/2 // L2 8MB/2
+#define CACHE_LINE_SIZE 256
+#define ZF_DIST         (CACHE_LINE_SIZE * 21)  // Zerofill distance
+#define dest            x0
+#define src             x1
+#define n               x2      // size
+#define tmp1            x3
+#define tmp2            x4
+#define tmp3            x5
+#define rest            x6
+#define dest_ptr        x7
+#define src_ptr         x8
+#define vector_length   x9
+#define cl_remainder    x10     // CACHE_LINE_SIZE remainder
+
+    .arch armv8.2-a+sve
+
+    .macro dc_zva times
+    dc          zva, tmp1
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    .if \times-1
+    dc_zva "(\times-1)"
+    .endif
+    .endm
+
+    .macro ld1b_unroll8
+    ld1b        z0.b, p0/z, [src_ptr, #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr, #1, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr, #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr, #3, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr, #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr, #5, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr, #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr, #7, mul vl]
+    .endm
+
+    .macro stld1b_unroll4a
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    ld1b        z0.b, p0/z, [src_ptr,  #0, mul vl]
+    ld1b        z1.b, p0/z, [src_ptr,  #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    ld1b        z2.b, p0/z, [src_ptr,  #2, mul vl]
+    ld1b        z3.b, p0/z, [src_ptr,  #3, mul vl]
+    .endm
+
+    .macro stld1b_unroll4b
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    ld1b        z4.b, p0/z, [src_ptr,  #4, mul vl]
+    ld1b        z5.b, p0/z, [src_ptr,  #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    ld1b        z6.b, p0/z, [src_ptr,  #6, mul vl]
+    ld1b        z7.b, p0/z, [src_ptr,  #7, mul vl]
+    .endm
+
+    .macro stld1b_unroll8
+    stld1b_unroll4a
+    stld1b_unroll4b
+    .endm
+
+    .macro st1b_unroll8
+    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
+    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
+    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
+    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]
+    st1b        z4.b, p0,   [dest_ptr, #4, mul vl]
+    st1b        z5.b, p0,   [dest_ptr, #5, mul vl]
+    st1b        z6.b, p0,   [dest_ptr, #6, mul vl]
+    st1b        z7.b, p0,   [dest_ptr, #7, mul vl]
+    .endm
+
+    .macro shortcut_for_small_size exit
+    // if rest <= vector_length * 2
+    whilelo     p0.b, xzr, n
+    whilelo     p1.b, vector_length, n
+    b.last      1f
+    ld1b        z0.b, p0/z, [src, #0, mul vl]
+    ld1b        z1.b, p1/z, [src, #1, mul vl]
+    st1b        z0.b, p0, [dest, #0, mul vl]
+    st1b        z1.b, p1, [dest, #1, mul vl]
+    ret
+1:  // if rest > vector_length * 8
+    cmp         n, vector_length, lsl 3 // vector_length * 8
+    b.hi        \exit
+    // if rest <= vector_length * 4
+    lsl         tmp1, vector_length, 1  // vector_length * 2
+    whilelo     p2.b, tmp1, n
+    incb        tmp1
+    whilelo     p3.b, tmp1, n
+    b.last      1f
+    ld1b        z0.b, p0/z, [src, #0, mul vl]
+    ld1b        z1.b, p1/z, [src, #1, mul vl]
+    ld1b        z2.b, p2/z, [src, #2, mul vl]
+    ld1b        z3.b, p3/z, [src, #3, mul vl]
+    st1b        z0.b, p0, [dest, #0, mul vl]
+    st1b        z1.b, p1, [dest, #1, mul vl]
+    st1b        z2.b, p2, [dest, #2, mul vl]
+    st1b        z3.b, p3, [dest, #3, mul vl]
+    ret
+1:  // if rest <= vector_length * 8
+    lsl         tmp1, vector_length, 2  // vector_length * 4
+    whilelo     p4.b, tmp1, n
+    incb        tmp1
+    whilelo     p5.b, tmp1, n
+    b.last      1f
+    ld1b        z0.b, p0/z, [src, #0, mul vl]
+    ld1b        z1.b, p1/z, [src, #1, mul vl]
+    ld1b        z2.b, p2/z, [src, #2, mul vl]
+    ld1b        z3.b, p3/z, [src, #3, mul vl]
+    ld1b        z4.b, p4/z, [src, #4, mul vl]
+    ld1b        z5.b, p5/z, [src, #5, mul vl]
+    st1b        z0.b, p0, [dest, #0, mul vl]
+    st1b        z1.b, p1, [dest, #1, mul vl]
+    st1b        z2.b, p2, [dest, #2, mul vl]
+    st1b        z3.b, p3, [dest, #3, mul vl]
+    st1b        z4.b, p4, [dest, #4, mul vl]
+    st1b        z5.b, p5, [dest, #5, mul vl]
+    ret
+1:  lsl         tmp1, vector_length, 2  // vector_length * 4
+    incb        tmp1                    // vector_length * 5
+    incb        tmp1                    // vector_length * 6
+    whilelo     p6.b, tmp1, n
+    incb        tmp1
+    whilelo     p7.b, tmp1, n
+    ld1b        z0.b, p0/z, [src, #0, mul vl]
+    ld1b        z1.b, p1/z, [src, #1, mul vl]
+    ld1b        z2.b, p2/z, [src, #2, mul vl]
+    ld1b        z3.b, p3/z, [src, #3, mul vl]
+    ld1b        z4.b, p4/z, [src, #4, mul vl]
+    ld1b        z5.b, p5/z, [src, #5, mul vl]
+    ld1b        z6.b, p6/z, [src, #6, mul vl]
+    ld1b        z7.b, p7/z, [src, #7, mul vl]
+    st1b        z0.b, p0, [dest, #0, mul vl]
+    st1b        z1.b, p1, [dest, #1, mul vl]
+    st1b        z2.b, p2, [dest, #2, mul vl]
+    st1b        z3.b, p3, [dest, #3, mul vl]
+    st1b        z4.b, p4, [dest, #4, mul vl]
+    st1b        z5.b, p5, [dest, #5, mul vl]
+    st1b        z6.b, p6, [dest, #6, mul vl]
+    st1b        z7.b, p7, [dest, #7, mul vl]
+    ret
+    .endm
+
+ENTRY (MEMCPY)
+
+    PTR_ARG (0)
+    PTR_ARG (1)
+    SIZE_ARG (2)
+
+L(memcpy):
+    cntb        vector_length
+    // shortcut for less than vector_length * 8
+    // gives a free ptrue to p0.b for n >= vector_length
+    shortcut_for_small_size L(vl_agnostic)
+    // end of shortcut
+
+L(vl_agnostic): // VL Agnostic
+    mov         rest, n
+    mov         dest_ptr, dest
+    mov         src_ptr, src
+    // if rest >= L2_SIZE && vector_length == 64 then L(L2)
+    mov         tmp1, 64
+    cmp         rest, L2_SIZE
+    ccmp        vector_length, tmp1, 0, cs
+    b.eq        L(L2)
+
+L(unroll8): // unrolling and software pipeline
+    lsl         tmp1, vector_length, 3  // vector_length * 8
+    .p2align 3
+    cmp         rest, tmp1
+    b.cc        L(last)
+    ld1b_unroll8
+    add         src_ptr, src_ptr, tmp1
+    sub         rest, rest, tmp1
+    cmp         rest, tmp1
+    b.cc        2f
+    .p2align 3
+1:  stld1b_unroll8
+    add         dest_ptr, dest_ptr, tmp1
+    add         src_ptr, src_ptr, tmp1
+    sub         rest, rest, tmp1
+    cmp         rest, tmp1
+    b.ge        1b
+2:  st1b_unroll8
+    add         dest_ptr, dest_ptr, tmp1
+
+    .p2align 3
+L(last):
+    whilelo     p0.b, xzr, rest
+    whilelo     p1.b, vector_length, rest
+    b.last      1f
+    ld1b        z0.b, p0/z, [src_ptr, #0, mul vl]
+    ld1b        z1.b, p1/z, [src_ptr, #1, mul vl]
+    st1b        z0.b, p0, [dest_ptr, #0, mul vl]
+    st1b        z1.b, p1, [dest_ptr, #1, mul vl]
+    ret
+1:  lsl         tmp1, vector_length, 1  // vector_length * 2
+    whilelo     p2.b, tmp1, rest
+    incb        tmp1
+    whilelo     p3.b, tmp1, rest
+    b.last      1f
+    ld1b        z0.b, p0/z, [src_ptr, #0, mul vl]
+    ld1b        z1.b, p1/z, [src_ptr, #1, mul vl]
+    ld1b        z2.b, p2/z, [src_ptr, #2, mul vl]
+    ld1b        z3.b, p3/z, [src_ptr, #3, mul vl]
+    st1b        z0.b, p0, [dest_ptr, #0, mul vl]
+    st1b        z1.b, p1, [dest_ptr, #1, mul vl]
+    st1b        z2.b, p2, [dest_ptr, #2, mul vl]
+    st1b        z3.b, p3, [dest_ptr, #3, mul vl]
+    ret
+1:  lsl         tmp1, vector_length, 2  // vector_length * 4
+    whilelo     p4.b, tmp1, rest
+    incb        tmp1
+    whilelo     p5.b, tmp1, rest
+    incb        tmp1
+    whilelo     p6.b, tmp1, rest
+    incb        tmp1
+    whilelo     p7.b, tmp1, rest
+    ld1b        z0.b, p0/z, [src_ptr, #0, mul vl]
+    ld1b        z1.b, p1/z, [src_ptr, #1, mul vl]
+    ld1b        z2.b, p2/z, [src_ptr, #2, mul vl]
+    ld1b        z3.b, p3/z, [src_ptr, #3, mul vl]
+    ld1b        z4.b, p4/z, [src_ptr, #4, mul vl]
+    ld1b        z5.b, p5/z, [src_ptr, #5, mul vl]
+    ld1b        z6.b, p6/z, [src_ptr, #6, mul vl]
+    ld1b        z7.b, p7/z, [src_ptr, #7, mul vl]
+    st1b        z0.b, p0, [dest_ptr, #0, mul vl]
+    st1b        z1.b, p1, [dest_ptr, #1, mul vl]
+    st1b        z2.b, p2, [dest_ptr, #2, mul vl]
+    st1b        z3.b, p3, [dest_ptr, #3, mul vl]
+    st1b        z4.b, p4, [dest_ptr, #4, mul vl]
+    st1b        z5.b, p5, [dest_ptr, #5, mul vl]
+    st1b        z6.b, p6, [dest_ptr, #6, mul vl]
+    st1b        z7.b, p7, [dest_ptr, #7, mul vl]
+    ret
+
+L(L2):
+    // align dest address at CACHE_LINE_SIZE byte boundary
+    mov         tmp1, CACHE_LINE_SIZE
+    ands        tmp2, dest_ptr, CACHE_LINE_SIZE - 1
+    // if cl_remainder == 0
+    b.eq        L(L2_dc_zva)
+    sub         cl_remainder, tmp1, tmp2
+    // process remainder until the first CACHE_LINE_SIZE boundary
+    whilelo     p1.b, xzr, cl_remainder        // keep p0.b all true
+    whilelo     p2.b, vector_length, cl_remainder
+    b.last      1f
+    ld1b        z1.b, p1/z, [src_ptr, #0, mul vl]
+    ld1b        z2.b, p2/z, [src_ptr, #1, mul vl]
+    st1b        z1.b, p1, [dest_ptr, #0, mul vl]
+    st1b        z2.b, p2, [dest_ptr, #1, mul vl]
+    b           2f
+1:  lsl         tmp1, vector_length, 1  // vector_length * 2
+    whilelo     p3.b, tmp1, cl_remainder
+    incb        tmp1
+    whilelo     p4.b, tmp1, cl_remainder
+    ld1b        z1.b, p1/z, [src_ptr, #0, mul vl]
+    ld1b        z2.b, p2/z, [src_ptr, #1, mul vl]
+    ld1b        z3.b, p3/z, [src_ptr, #2, mul vl]
+    ld1b        z4.b, p4/z, [src_ptr, #3, mul vl]
+    st1b        z1.b, p1, [dest_ptr, #0, mul vl]
+    st1b        z2.b, p2, [dest_ptr, #1, mul vl]
+    st1b        z3.b, p3, [dest_ptr, #2, mul vl]
+    st1b        z4.b, p4, [dest_ptr, #3, mul vl]
+2:  add         dest_ptr, dest_ptr, cl_remainder
+    add         src_ptr, src_ptr, cl_remainder
+    sub         rest, rest, cl_remainder
+
+L(L2_dc_zva):
+    // zero fill
+    and         tmp1, dest, 0xffffffffffffff
+    and         tmp2, src, 0xffffffffffffff
+    subs        tmp1, tmp1, tmp2     // diff
+    b.ge        1f
+    neg         tmp1, tmp1
+1:  mov         tmp3, ZF_DIST + CACHE_LINE_SIZE * 2
+    cmp         tmp1, tmp3
+    b.lo        L(unroll8)
+    mov         tmp1, dest_ptr
+    dc_zva      (ZF_DIST / CACHE_LINE_SIZE) - 1
+    // unroll
+    ld1b_unroll8        // this line has to be after "b.lo L(unroll8)"
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+    sub         rest, rest, CACHE_LINE_SIZE * 2
+    mov         tmp1, ZF_DIST
+    .p2align 3
+1:  stld1b_unroll4a
+    add         tmp2, dest_ptr, tmp1    // dest_ptr + ZF_DIST
+    dc          zva, tmp2
+    stld1b_unroll4b
+    add         tmp2, tmp2, CACHE_LINE_SIZE
+    dc          zva, tmp2
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+    add         src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+    sub         rest, rest, CACHE_LINE_SIZE * 2
+    cmp         rest, tmp3      // ZF_DIST + CACHE_LINE_SIZE * 2
+    b.ge        1b
+    st1b_unroll8
+    add         dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+    b           L(unroll8)
+
+END (MEMCPY)
+libc_hidden_builtin_def (MEMCPY)
+
+
+ENTRY (MEMMOVE)
+
+    PTR_ARG (0)
+    PTR_ARG (1)
+    SIZE_ARG (2)
+
+    // remove tag address
+    // dest has to be immutable because it is the return value
+    // src has to be immutable because it is used in L(bwd_last)
+    and         tmp2, dest, 0xffffffffffffff    // save dest_notag into tmp2
+    and         tmp3, src, 0xffffffffffffff     // save src_notag intp tmp3
+    cmp         n, 0
+    ccmp        tmp2, tmp3, 4, ne
+    b.ne        1f
+    ret
+1:  cntb        vector_length
+    // shortcut for less than vector_length * 8
+    // gives a free ptrue to p0.b for n >= vector_length
+    // tmp2 and tmp3 should not be used in this macro to keep notag addresses
+    shortcut_for_small_size L(dispatch)
+    // end of shortcut
+
+L(dispatch):
+    // tmp2 = dest_notag, tmp3 = src_notag
+    // diff = dest_notag - src_notag
+    sub         tmp1, tmp2, tmp3
+    // if diff <= 0 || diff >= n then memcpy
+    cmp         tmp1, 0
+    ccmp        tmp1, n, 2, gt
+    b.cs        L(vl_agnostic)
+
+L(bwd_start):
+    mov         rest, n
+    add         dest_ptr, dest, n       // dest_end
+    add         src_ptr, src, n         // src_end
+
+L(bwd_unroll8): // unrolling and software pipeline
+    lsl         tmp1, vector_length, 3  // vector_length * 8
+    .p2align 3
+    cmp         rest, tmp1
+    b.cc        L(bwd_last)
+    sub         src_ptr, src_ptr, tmp1
+    ld1b_unroll8
+    sub         rest, rest, tmp1
+    cmp         rest, tmp1
+    b.cc        2f
+    .p2align 3
+1:  sub         src_ptr, src_ptr, tmp1
+    sub         dest_ptr, dest_ptr, tmp1
+    stld1b_unroll8
+    sub         rest, rest, tmp1
+    cmp         rest, tmp1
+    b.ge        1b
+2:  sub         dest_ptr, dest_ptr, tmp1
+    st1b_unroll8
+
+L(bwd_last):
+    mov         dest_ptr, dest
+    mov         src_ptr, src
+    b           L(last)
+
+END (MEMMOVE)
+libc_hidden_builtin_def (MEMMOVE)
+#endif /* IS_IN (libc) */
+#endif /* HAVE_AARCH64_SVE_ASM */
diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarch/memmove.c
index 12d77818a9..be2d35a251 100644
--- a/sysdeps/aarch64/multiarch/memmove.c
+++ b/sysdeps/aarch64/multiarch/memmove.c
@@ -33,6 +33,9 @@ extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;
 extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;
 extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidden;
 extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;
+#if HAVE_AARCH64_SVE_ASM
+extern __typeof (__redirect_memmove) __memmove_a64fx attribute_hidden;
+#endif
 
 libc_ifunc (__libc_memmove,
             (IS_THUNDERX (midr)
@@ -44,8 +47,13 @@ libc_ifunc (__libc_memmove,
 		  : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
 		     || IS_NEOVERSE_V1 (midr)
 		     ? __memmove_simd
-		     : __memmove_generic)))));
-
+#if HAVE_AARCH64_SVE_ASM
+                     : (IS_A64FX (midr)
+                        ? __memmove_a64fx
+                        : __memmove_generic))))));
+#else
+                        : __memmove_generic)))));
+#endif
 # undef memmove
 strong_alias (__libc_memmove, memmove);
 #endif
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
index db6aa3516c..6206a2f618 100644
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
@@ -46,6 +46,7 @@ static struct cpu_list cpu_list[] = {
       {"ares",		 0x411FD0C0},
       {"emag",		 0x503F0001},
       {"kunpeng920", 	 0x481FD010},
+      {"a64fx",		 0x460F0010},
       {"generic", 	 0x0}
 };
 
@@ -116,4 +117,7 @@ init_cpu_features (struct cpu_features *cpu_features)
 	     (PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_ASYNC | MTE_ALLOWED_TAGS),
 	     0, 0, 0);
 #endif
+
+  /* Check if SVE is supported.  */
+  cpu_features->sve = GLRO (dl_hwcap) & HWCAP_SVE;
 }
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
index 3b9bfed134..2b322e5414 100644
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
@@ -65,6 +65,9 @@
 #define IS_KUNPENG920(midr) (MIDR_IMPLEMENTOR(midr) == 'H'			   \
                         && MIDR_PARTNUM(midr) == 0xd01)
 
+#define IS_A64FX(midr) (MIDR_IMPLEMENTOR(midr) == 'F'			      \
+			&& MIDR_PARTNUM(midr) == 0x001)
+
 struct cpu_features
 {
   uint64_t midr_el1;
@@ -72,6 +75,7 @@ struct cpu_features
   bool bti;
   /* Currently, the GLIBC memory tagging tunable only defines 8 bits.  */
   uint8_t mte_state;
+  bool sve;
 };
 
 #endif /* _CPU_FEATURES_AARCH64_H  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 4/6] aarch64: Added optimized memset for A64FX
  2021-05-12  9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
                     ` (2 preceding siblings ...)
  2021-05-12  9:28   ` [PATCH v2 3/6] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
@ 2021-05-12  9:28   ` Naohiro Tamura
  2021-05-26 10:22     ` Szabolcs Nagy via Libc-alpha
  2021-05-12  9:29   ` [PATCH v2 5/6] scripts: Added Vector Length Set test helper script Naohiro Tamura
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12  9:28 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

This patch optimizes the performance of memset for A64FX [1] which
implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill and prefetch.

SVE assembler code for memset is implemented as Vector Length Agnostic
code so theoretically it can be run on any SOC which supports ARMv8-A
SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX
---
 sysdeps/aarch64/multiarch/Makefile          |   1 +
 sysdeps/aarch64/multiarch/ifunc-impl-list.c |   5 +-
 sysdeps/aarch64/multiarch/memset.c          |  11 +-
 sysdeps/aarch64/multiarch/memset_a64fx.S    | 268 ++++++++++++++++++++
 4 files changed, 283 insertions(+), 2 deletions(-)
 create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S

diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index 04c3f17121..7500cf1e93 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -2,6 +2,7 @@ ifeq ($(subdir),string)
 sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
 		   memcpy_falkor memcpy_a64fx \
 		   memset_generic memset_falkor memset_emag memset_kunpeng \
+		   memset_a64fx \
 		   memchr_generic memchr_nosimd \
 		   strlen_mte strlen_asimd
 endif
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index 911393565c..4e1a641d9f 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -37,7 +37,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   INIT_ARCH ();
 
-  /* Support sysdeps/aarch64/multiarch/memcpy.c and memmove.c.  */
+  /* Support sysdeps/aarch64/multiarch/memcpy.c, memmove.c and memset.c.  */
   IFUNC_IMPL (i, name, memcpy,
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx)
 	      IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
@@ -62,6 +62,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, memset, (zva_size == 64), __memset_falkor)
 	      IFUNC_IMPL_ADD (array, i, memset, (zva_size == 64), __memset_emag)
 	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_kunpeng)
+#if HAVE_AARCH64_SVE_ASM
+	      IFUNC_IMPL_ADD (array, i, memset, sve, __memset_a64fx)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic))
   IFUNC_IMPL (i, name, memchr,
 	      IFUNC_IMPL_ADD (array, i, memchr, !mte, __memchr_nosimd)
diff --git a/sysdeps/aarch64/multiarch/memset.c b/sysdeps/aarch64/multiarch/memset.c
index 28d3926bc2..48a59574dd 100644
--- a/sysdeps/aarch64/multiarch/memset.c
+++ b/sysdeps/aarch64/multiarch/memset.c
@@ -31,6 +31,9 @@ extern __typeof (__redirect_memset) __libc_memset;
 extern __typeof (__redirect_memset) __memset_falkor attribute_hidden;
 extern __typeof (__redirect_memset) __memset_emag attribute_hidden;
 extern __typeof (__redirect_memset) __memset_kunpeng attribute_hidden;
+#if HAVE_AARCH64_SVE_ASM
+extern __typeof (__redirect_memset) __memset_a64fx attribute_hidden;
+#endif
 extern __typeof (__redirect_memset) __memset_generic attribute_hidden;
 
 libc_ifunc (__libc_memset,
@@ -40,7 +43,13 @@ libc_ifunc (__libc_memset,
 	     ? __memset_falkor
 	     : (IS_EMAG (midr) && zva_size == 64
 	       ? __memset_emag
-	       : __memset_generic)));
+#if HAVE_AARCH64_SVE_ASM
+	       : (IS_A64FX (midr)
+		  ? __memset_a64fx
+	          : __memset_generic))));
+#else
+	          : __memset_generic)));
+#endif
 
 # undef memset
 strong_alias (__libc_memset, memset);
diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S
new file mode 100644
index 0000000000..9bd58cab6d
--- /dev/null
+++ b/sysdeps/aarch64/multiarch/memset_a64fx.S
@@ -0,0 +1,268 @@
+/* Optimized memset for Fujitsu A64FX processor.
+   Copyright (C) 2012-2021 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sysdeps/aarch64/memset-reg.h>
+
+#if HAVE_AARCH64_SVE_ASM
+#if IS_IN (libc)
+# define MEMSET __memset_a64fx
+
+/* Assumptions:
+ *
+ * ARMv8.2-a, AArch64, unaligned accesses, sve
+ *
+ */
+
+#define L1_SIZE         (64*1024)       // L1 64KB
+#define L2_SIZE         (8*1024*1024)   // L2 8MB - 1MB
+#define CACHE_LINE_SIZE 256
+#define PF_DIST_L1      (CACHE_LINE_SIZE * 16)  // Prefetch distance L1
+#define ZF_DIST         (CACHE_LINE_SIZE * 21)  // Zerofill distance
+#define rest            x8
+#define vector_length   x9
+#define vl_remainder    x10     // vector_length remainder
+#define cl_remainder    x11     // CACHE_LINE_SIZE remainder
+
+    .arch armv8.2-a+sve
+
+    .macro dc_zva times
+    dc          zva, tmp1
+    add         tmp1, tmp1, CACHE_LINE_SIZE
+    .if \times-1
+    dc_zva "(\times-1)"
+    .endif
+    .endm
+
+    .macro st1b_unroll first=0, last=7
+    st1b        z0.b, p0, [dst, #\first, mul vl]
+    .if \last-\first
+    st1b_unroll "(\first+1)", \last
+    .endif
+    .endm
+
+    .macro shortcut_for_small_size exit
+    // if rest <= vector_length * 2
+    whilelo     p0.b, xzr, count
+    whilelo     p1.b, vector_length, count
+    b.last      1f
+    st1b        z0.b, p0, [dstin, #0, mul vl]
+    st1b        z0.b, p1, [dstin, #1, mul vl]
+    ret
+1:  // if rest > vector_length * 8
+    cmp         count, vector_length, lsl 3     // vector_length * 8
+    b.hi        \exit
+    // if rest <= vector_length * 4
+    lsl         tmp1, vector_length, 1  // vector_length * 2
+    whilelo     p2.b, tmp1, count
+    incb        tmp1
+    whilelo     p3.b, tmp1, count
+    b.last      1f
+    st1b        z0.b, p0, [dstin, #0, mul vl]
+    st1b        z0.b, p1, [dstin, #1, mul vl]
+    st1b        z0.b, p2, [dstin, #2, mul vl]
+    st1b        z0.b, p3, [dstin, #3, mul vl]
+    ret
+1:  // if rest <= vector_length * 8
+    lsl         tmp1, vector_length, 2  // vector_length * 4
+    whilelo     p4.b, tmp1, count
+    incb        tmp1
+    whilelo     p5.b, tmp1, count
+    b.last      1f
+    st1b        z0.b, p0, [dstin, #0, mul vl]
+    st1b        z0.b, p1, [dstin, #1, mul vl]
+    st1b        z0.b, p2, [dstin, #2, mul vl]
+    st1b        z0.b, p3, [dstin, #3, mul vl]
+    st1b        z0.b, p4, [dstin, #4, mul vl]
+    st1b        z0.b, p5, [dstin, #5, mul vl]
+    ret
+1:  lsl         tmp1, vector_length, 2  // vector_length * 4
+    incb        tmp1                    // vector_length * 5
+    incb        tmp1                    // vector_length * 6
+    whilelo     p6.b, tmp1, count
+    incb        tmp1
+    whilelo     p7.b, tmp1, count
+    st1b        z0.b, p0, [dstin, #0, mul vl]
+    st1b        z0.b, p1, [dstin, #1, mul vl]
+    st1b        z0.b, p2, [dstin, #2, mul vl]
+    st1b        z0.b, p3, [dstin, #3, mul vl]
+    st1b        z0.b, p4, [dstin, #4, mul vl]
+    st1b        z0.b, p5, [dstin, #5, mul vl]
+    st1b        z0.b, p6, [dstin, #6, mul vl]
+    st1b        z0.b, p7, [dstin, #7, mul vl]
+    ret
+    .endm
+
+ENTRY (MEMSET)
+
+    PTR_ARG (0)
+    SIZE_ARG (2)
+
+    cbnz        count, 1f
+    ret
+1:  dup         z0.b, valw
+    cntb        vector_length
+    // shortcut for less than vector_length * 8
+    // gives a free ptrue to p0.b for n >= vector_length
+    shortcut_for_small_size L(vl_agnostic)
+    // end of shortcut
+
+L(vl_agnostic): // VL Agnostic
+    mov         rest, count
+    mov         dst, dstin
+    add         dstend, dstin, count
+    // if rest >= L2_SIZE && vector_length == 64 then L(L2)
+    mov         tmp1, 64
+    cmp         rest, L2_SIZE
+    ccmp        vector_length, tmp1, 0, cs
+    b.eq        L(L2)
+    // if rest >= L1_SIZE && vector_length == 64 then L(L1_prefetch)
+    cmp         rest, L1_SIZE
+    ccmp        vector_length, tmp1, 0, cs
+    b.eq        L(L1_prefetch)
+
+L(unroll32):
+    lsl         tmp1, vector_length, 3  // vector_length * 8
+    lsl         tmp2, vector_length, 5  // vector_length * 32
+    .p2align 3
+1:  cmp         rest, tmp2
+    b.cc        L(unroll8)
+    st1b_unroll
+    add         dst, dst, tmp1
+    st1b_unroll
+    add         dst, dst, tmp1
+    st1b_unroll
+    add         dst, dst, tmp1
+    st1b_unroll
+    add         dst, dst, tmp1
+    sub         rest, rest, tmp2
+    b           1b
+
+L(unroll8):
+    lsl         tmp1, vector_length, 3
+    .p2align 3
+1:  cmp         rest, tmp1
+    b.cc        L(last)
+    st1b_unroll
+    add         dst, dst, tmp1
+    sub         rest, rest, tmp1
+    b           1b
+
+L(last):
+    whilelo     p0.b, xzr, rest
+    whilelo     p1.b, vector_length, rest
+    b.last      1f
+    st1b        z0.b, p0, [dst, #0, mul vl]
+    st1b        z0.b, p1, [dst, #1, mul vl]
+    ret
+1:  lsl         tmp1, vector_length, 1  // vector_length * 2
+    whilelo     p2.b, tmp1, rest
+    incb        tmp1
+    whilelo     p3.b, tmp1, rest
+    b.last      1f
+    st1b        z0.b, p0, [dst, #0, mul vl]
+    st1b        z0.b, p1, [dst, #1, mul vl]
+    st1b        z0.b, p2, [dst, #2, mul vl]
+    st1b        z0.b, p3, [dst, #3, mul vl]
+    ret
+1:  lsl         tmp1, vector_length, 2  // vector_length * 4
+    whilelo     p4.b, tmp1, rest
+    incb        tmp1
+    whilelo     p5.b, tmp1, rest
+    incb        tmp1
+    whilelo     p6.b, tmp1, rest
+    incb        tmp1
+    whilelo     p7.b, tmp1, rest
+    st1b        z0.b, p0, [dst, #0, mul vl]
+    st1b        z0.b, p1, [dst, #1, mul vl]
+    st1b        z0.b, p2, [dst, #2, mul vl]
+    st1b        z0.b, p3, [dst, #3, mul vl]
+    st1b        z0.b, p4, [dst, #4, mul vl]
+    st1b        z0.b, p5, [dst, #5, mul vl]
+    st1b        z0.b, p6, [dst, #6, mul vl]
+    st1b        z0.b, p7, [dst, #7, mul vl]
+    ret
+
+L(L1_prefetch): // if rest >= L1_SIZE
+    .p2align 3
+1:  st1b_unroll 0, 3
+    prfm        pstl1keep, [dst, PF_DIST_L1]
+    st1b_unroll 4, 7
+    prfm        pstl1keep, [dst, PF_DIST_L1 + CACHE_LINE_SIZE]
+    add         dst, dst, CACHE_LINE_SIZE * 2
+    sub         rest, rest, CACHE_LINE_SIZE * 2
+    cmp         rest, L1_SIZE
+    b.ge        1b
+    cbnz        rest, L(unroll32)
+    ret
+
+L(L2):
+    // align dst address at vector_length byte boundary
+    sub         tmp1, vector_length, 1
+    ands        tmp2, dst, tmp1
+    // if vl_remainder == 0
+    b.eq        1f
+    sub         vl_remainder, vector_length, tmp2
+    // process remainder until the first vector_length boundary
+    whilelt     p2.b, xzr, vl_remainder
+    st1b        z0.b, p2, [dst]
+    add         dst, dst, vl_remainder
+    sub         rest, rest, vl_remainder
+    // align dstin address at CACHE_LINE_SIZE byte boundary
+1:  mov         tmp1, CACHE_LINE_SIZE
+    ands        tmp2, dst, CACHE_LINE_SIZE - 1
+    // if cl_remainder == 0
+    b.eq        L(L2_dc_zva)
+    sub         cl_remainder, tmp1, tmp2
+    // process remainder until the first CACHE_LINE_SIZE boundary
+    mov         tmp1, xzr       // index
+2:  whilelt     p2.b, tmp1, cl_remainder
+    st1b        z0.b, p2, [dst, tmp1]
+    incb        tmp1
+    cmp         tmp1, cl_remainder
+    b.lo        2b
+    add         dst, dst, cl_remainder
+    sub         rest, rest, cl_remainder
+
+L(L2_dc_zva):
+    // zero fill
+    mov         tmp1, dst
+    dc_zva      (ZF_DIST / CACHE_LINE_SIZE) - 1
+    mov         zva_len, ZF_DIST
+    add         tmp1, zva_len, CACHE_LINE_SIZE * 2
+    // unroll
+    .p2align 3
+1:  st1b_unroll 0, 3
+    add         tmp2, dst, zva_len
+    dc          zva, tmp2
+    st1b_unroll 4, 7
+    add         tmp2, tmp2, CACHE_LINE_SIZE
+    dc          zva, tmp2
+    add         dst, dst, CACHE_LINE_SIZE * 2
+    sub         rest, rest, CACHE_LINE_SIZE * 2
+    cmp         rest, tmp1      // ZF_DIST + CACHE_LINE_SIZE * 2
+    b.ge        1b
+    cbnz        rest, L(unroll8)
+    ret
+
+END (MEMSET)
+libc_hidden_builtin_def (MEMSET)
+
+#endif /* IS_IN (libc) */
+#endif /* HAVE_AARCH64_SVE_ASM */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 5/6] scripts: Added Vector Length Set test helper script
  2021-05-12  9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
                     ` (3 preceding siblings ...)
  2021-05-12  9:28   ` [PATCH v2 4/6] aarch64: Added optimized memset " Naohiro Tamura
@ 2021-05-12  9:29   ` Naohiro Tamura
  2021-05-12 16:58     ` Joseph Myers
  2021-05-20  7:34     ` Naohiro Tamura
  2021-05-12  9:29   ` [PATCH v2 6/6] benchtests: Fixed bench-memcpy-random: buf1: mprotect failed Naohiro Tamura
                     ` (3 subsequent siblings)
  8 siblings, 2 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12  9:29 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

This patch is a test helper script to change Vector Length for child
process. This script can be used as test-wrapper for 'make check'.

Usage examples:

ubuntu@bionic:~/build$ make check subdirs=string \
test-wrapper='~/glibc/scripts/vltest.py 16'

ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 16 make test \
t=string/test-memcpy

ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 32 ./debugglibc.sh \
string/test-memmove

ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 64 ./testrun.sh
string/test-memset
---
 scripts/vltest.py | 82 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)
 create mode 100755 scripts/vltest.py

diff --git a/scripts/vltest.py b/scripts/vltest.py
new file mode 100755
index 0000000000..264dfa449f
--- /dev/null
+++ b/scripts/vltest.py
@@ -0,0 +1,82 @@
+#!/usr/bin/python3
+# Set Scalable Vector Length test helper
+# Copyright (C) 2019-2021 Free Software Foundation, Inc.
+# This file is part of the GNU C Library.
+#
+# The GNU C Library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# The GNU C Library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with the GNU C Library; if not, see
+# <https://www.gnu.org/licenses/>.
+"""Set Scalable Vector Length test helper.
+
+Set Scalable Vector Length for child process.
+
+examples:
+
+ubuntu@bionic:~/build$ make check subdirs=string \
+test-wrapper='~/glibc/scripts/vltest.py 16'
+
+ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 16 make test \
+t=string/test-memcpy
+
+ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 32 ./debugglibc.sh \
+string/test-memmove
+
+ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 64 ./testrun.sh \
+string/test-memset
+"""
+import argparse
+from ctypes import cdll, CDLL
+import os
+import sys
+
+EXIT_SUCCESS = 0
+EXIT_FAILURE = 1
+EXIT_UNSUPPORTED = 77
+
+AT_HWCAP = 16
+HWCAP_SVE = (1 << 22)
+
+PR_SVE_GET_VL = 51
+PR_SVE_SET_VL = 50
+PR_SVE_SET_VL_ONEXEC = (1 << 18)
+PR_SVE_VL_INHERIT = (1 << 17)
+PR_SVE_VL_LEN_MASK = 0xffff
+
+def main(args):
+    libc = CDLL("libc.so.6")
+    if not libc.getauxval(AT_HWCAP) & HWCAP_SVE:
+        print("CPU doesn't support SVE")
+        sys.exit(EXIT_UNSUPPORTED)
+
+    libc.prctl(PR_SVE_SET_VL,
+               args.vl[0] | PR_SVE_SET_VL_ONEXEC | PR_SVE_VL_INHERIT)
+    os.execvp(args.args[0], args.args)
+    print("exec system call failure")
+    sys.exit(EXIT_FAILURE)
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description=
+            "Set Scalable Vector Length test helper",
+            formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+    # positional argument
+    parser.add_argument("vl", nargs=1, type=int,
+                        choices=range(16, 257, 16),
+                        help=('vector length '\
+                              'which is multiples of 16 from 16 to 256'))
+    # remainDer arguments
+    parser.add_argument('args', nargs=argparse.REMAINDER,
+                        help=('args '\
+                              'which is passed to child process'))
+    args = parser.parse_args()
+    main(args)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v2 6/6] benchtests: Fixed bench-memcpy-random: buf1: mprotect failed
  2021-05-12  9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
                     ` (4 preceding siblings ...)
  2021-05-12  9:29   ` [PATCH v2 5/6] scripts: Added Vector Length Set test helper script Naohiro Tamura
@ 2021-05-12  9:29   ` Naohiro Tamura
  2021-05-26 10:25     ` Szabolcs Nagy via Libc-alpha
  2021-05-27  0:22   ` [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX naohirot
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12  9:29 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

This patch fixed mprotect system call failure on AArch64.
This failure happened on not only A64FX but also ThunderX2.

Also this patch updated a JSON key from "max-size" to "length" so that
'plot_strings.py' can process 'bench-memcpy-random.out'
---
 benchtests/bench-memcpy-random.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/benchtests/bench-memcpy-random.c b/benchtests/bench-memcpy-random.c
index 9b62033379..c490b73ed0 100644
--- a/benchtests/bench-memcpy-random.c
+++ b/benchtests/bench-memcpy-random.c
@@ -16,7 +16,7 @@
    License along with the GNU C Library; if not, see
    <https://www.gnu.org/licenses/>.  */
 
-#define MIN_PAGE_SIZE (512*1024+4096)
+#define MIN_PAGE_SIZE (512*1024+getpagesize())
 #define TEST_MAIN
 #define TEST_NAME "memcpy"
 #include "bench-string.h"
@@ -160,7 +160,7 @@ do_test (json_ctx_t *json_ctx, size_t max_size)
     }
 
   json_element_object_begin (json_ctx);
-  json_attr_uint (json_ctx, "max-size", (double) max_size);
+  json_attr_uint (json_ctx, "length", (double) max_size);
   json_array_begin (json_ctx, "timings");
 
   FOR_EACH_IMPL (impl, 0)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 5/6] scripts: Added Vector Length Set test helper script
  2021-05-12  9:29   ` [PATCH v2 5/6] scripts: Added Vector Length Set test helper script Naohiro Tamura
@ 2021-05-12 16:58     ` Joseph Myers
  2021-05-13  9:53       ` naohirot
  2021-05-20  7:34     ` Naohiro Tamura
  1 sibling, 1 reply; 72+ messages in thread
From: Joseph Myers @ 2021-05-12 16:58 UTC (permalink / raw)
  To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha

On Wed, 12 May 2021, Naohiro Tamura wrote:

> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
> 
> This patch is a test helper script to change Vector Length for child
> process. This script can be used as test-wrapper for 'make check'.

This is specific to AArch64, so I think it would better go under 
sysdeps/unix/sysv/linux/aarch64/ rather than under scripts/.

There is also the question of how to make this discoverable to people 
developing glibc.  Maybe this script should be mentioned in install.texi 
(with INSTALL regenerated accordingly), with the documentation there 
clearly explaining that it's specific to AArch64 GNU/Linux.

-- 
Joseph S. Myers
joseph@codesourcery.com

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 5/6] scripts: Added Vector Length Set test helper script
  2021-05-12 16:58     ` Joseph Myers
@ 2021-05-13  9:53       ` naohirot
  0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-05-13  9:53 UTC (permalink / raw)
  To: 'Joseph Myers'; +Cc: libc-alpha@sourceware.org

Hi Joseph,

Thank you for the review.

> From: Joseph Myers <joseph@codesourcery.com>

> > This patch is a test helper script to change Vector Length for child
> > process. This script can be used as test-wrapper for 'make check'.
> 
> This is specific to AArch64, so I think it would better go under
> sysdeps/unix/sysv/linux/aarch64/ rather than under scripts/.

OK, I moved it to sysdeps/unix/sysv/linux/aarch64/.

> There is also the question of how to make this discoverable to people developing
> glibc.  Maybe this script should be mentioned in install.texi (with INSTALL
> regenerated accordingly), with the documentation there clearly explaining that it's
> specific to AArch64 GNU/Linux.

OK, I updated install.texi, INSTALL, vlset.py doc part as well as commit message
such as the followings or my github [1].

[1] https://github.com/NaohiroTamura/glibc/commit/37a5832fea109ab939ffdf58a2a19d5707849cc5

[commit message] aarch64: Added Vector Length Set test helper script

This patch is a test helper script to change Vector Length for child
process. This script can be used as test-wrapper for 'make check'.

Usage examples:

~/build$ make check subdirs=string \
test-wrapper='~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16'

~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16 \
make test t=string/test-memcpy

~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 32 \
./debugglibc.sh string/test-memmove

~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 64 \
./testrun.sh string/test-memset
---
 INSTALL                                   |  4 ++
 manual/install.texi                       |  3 +
 sysdeps/unix/sysv/linux/aarch64/vltest.py | 82 +++++++++++++++++++++++
 3 files changed, 89 insertions(+)
 create mode 100755 sysdeps/unix/sysv/linux/aarch64/vltest.py

diff --git a/INSTALL b/INSTALL
index 065a568585..bc761ab98b 100644
--- a/INSTALL
+++ b/INSTALL
@@ -380,6 +380,10 @@ the same syntax as 'test-wrapper-env', the only difference in its
 semantics being starting with an empty set of environment variables
 rather than the ambient set.

+   For AArch64 with SVE, when testing the GNU C Library, 'test-wrapper'
+may be set to "SRCDIR/sysdeps/unix/sysv/linux/aarch64/vltest.py
+VECTOR-LENGTH" to change Vector Length.
+
 Installing the C Library
 ========================

diff --git a/manual/install.texi b/manual/install.texi
index eb41fbd0b5..f1d858fb78 100644
--- a/manual/install.texi
+++ b/manual/install.texi
@@ -418,6 +418,9 @@ use has the same syntax as @samp{test-wrapper-env}, the only
 difference in its semantics being starting with an empty set of
 environment variables rather than the ambient set.

+For AArch64 with SVE, when testing @theglibc{}, @samp{test-wrapper}
+may be set to "@var{srcdir}/sysdeps/unix/sysv/linux/aarch64/vltest.py
+@var{vector-length}" to change Vector Length.

 @node Running make install
 @appendixsec Installing the C Library
diff --git a/sysdeps/unix/sysv/linux/aarch64/vltest.py b/sysdeps/unix/sysv/linux/aarch64/vltest.py
new file mode 100755
index 0000000000..bed62ad151
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/aarch64/vltest.py
@@ -0,0 +1,82 @@
+#!/usr/bin/python3
+# Set Scalable Vector Length test helper
+# Copyright (C) 2021 Free Software Foundation, Inc.
+# This file is part of the GNU C Library.
+#
+# The GNU C Library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# The GNU C Library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with the GNU C Library; if not, see
+# <https://www.gnu.org/licenses/>.
+"""Set Scalable Vector Length test helper.
+
+Set Scalable Vector Length for child process.
+
+examples:
+
+~/build$ make check subdirs=string \
+test-wrapper='~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16'
+
+~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16 \
+make test t=string/test-memcpy
+
+~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 32 \
+./debugglibc.sh string/test-memmove
+
+~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 64 \
+./testrun.sh string/test-memset
+"""

Thanks.
Naohiro


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-10  1:45 ` naohirot
@ 2021-05-14 13:35   ` Szabolcs Nagy via Libc-alpha
  2021-05-19  0:11     ` naohirot
  0 siblings, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-14 13:35 UTC (permalink / raw)
  To: naohirot@fujitsu.com, Carlos O'Donell
  Cc: Florian Weimer, libc-alpha@sourceware.org, Wilco Dijkstra

The 05/10/2021 01:45, naohirot@fujitsu.com wrote:
> FYI: Fujitsu has submitted the signed assignment finally.

Carlos, can we commit patches from fujitsu now?
(i dont know if we are still waiting for something)

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-14 13:35   ` Szabolcs Nagy via Libc-alpha
@ 2021-05-19  0:11     ` naohirot
  0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-05-19  0:11 UTC (permalink / raw)
  To: 'Szabolcs Nagy', Carlos O'Donell
  Cc: Florian Weimer, libc-alpha@sourceware.org, Wilco Dijkstra

Hi Szabolcs, Carlos,

> From: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
> Sent: Friday, May 14, 2021 10:36 PM
> 
> The 05/10/2021 01:45, naohirot@fujitsu.com wrote:
> > FYI: Fujitsu has submitted the signed assignment finally.
> 
> Carlos, can we commit patches from fujitsu now?
> (i dont know if we are still waiting for something)

Fujitsu has received FSF signed assignment.
So the contract process has completed.

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v2 5/6] scripts: Added Vector Length Set test helper script
  2021-05-12  9:29   ` [PATCH v2 5/6] scripts: Added Vector Length Set test helper script Naohiro Tamura
  2021-05-12 16:58     ` Joseph Myers
@ 2021-05-20  7:34     ` Naohiro Tamura
  2021-05-26 10:24       ` Szabolcs Nagy via Libc-alpha
  1 sibling, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-20  7:34 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

Let me send the whole updated patch.
Thanks.
Naohiro

-- >8 --
Subject: [PATCH v2 5/6] aarch64: Added Vector Length Set test helper script

This patch is a test helper script to change Vector Length for child
process. This script can be used as test-wrapper for 'make check'.

Usage examples:

~/build$ make check subdirs=string \
test-wrapper='~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16'

~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16 \
make test t=string/test-memcpy

~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 32 \
./debugglibc.sh string/test-memmove

~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 64 \
./testrun.sh string/test-memset
---
 INSTALL                                   |  4 ++
 manual/install.texi                       |  3 +
 sysdeps/unix/sysv/linux/aarch64/vltest.py | 82 +++++++++++++++++++++++
 3 files changed, 89 insertions(+)
 create mode 100755 sysdeps/unix/sysv/linux/aarch64/vltest.py

diff --git a/INSTALL b/INSTALL
index 065a568585e6..bc761ab98bbf 100644
--- a/INSTALL
+++ b/INSTALL
@@ -380,6 +380,10 @@ the same syntax as 'test-wrapper-env', the only difference in its
 semantics being starting with an empty set of environment variables
 rather than the ambient set.
 
+   For AArch64 with SVE, when testing the GNU C Library, 'test-wrapper'
+may be set to "SRCDIR/sysdeps/unix/sysv/linux/aarch64/vltest.py
+VECTOR-LENGTH" to change Vector Length.
+
 Installing the C Library
 ========================
 
diff --git a/manual/install.texi b/manual/install.texi
index eb41fbd0b5ab..f1d858fb789c 100644
--- a/manual/install.texi
+++ b/manual/install.texi
@@ -418,6 +418,9 @@ use has the same syntax as @samp{test-wrapper-env}, the only
 difference in its semantics being starting with an empty set of
 environment variables rather than the ambient set.
 
+For AArch64 with SVE, when testing @theglibc{}, @samp{test-wrapper}
+may be set to "@var{srcdir}/sysdeps/unix/sysv/linux/aarch64/vltest.py
+@var{vector-length}" to change Vector Length.
 
 @node Running make install
 @appendixsec Installing the C Library
diff --git a/sysdeps/unix/sysv/linux/aarch64/vltest.py b/sysdeps/unix/sysv/linux/aarch64/vltest.py
new file mode 100755
index 000000000000..bed62ad151e0
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/aarch64/vltest.py
@@ -0,0 +1,82 @@
+#!/usr/bin/python3
+# Set Scalable Vector Length test helper
+# Copyright (C) 2021 Free Software Foundation, Inc.
+# This file is part of the GNU C Library.
+#
+# The GNU C Library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# The GNU C Library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with the GNU C Library; if not, see
+# <https://www.gnu.org/licenses/>.
+"""Set Scalable Vector Length test helper.
+
+Set Scalable Vector Length for child process.
+
+examples:
+
+~/build$ make check subdirs=string \
+test-wrapper='~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16'
+
+~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16 \
+make test t=string/test-memcpy
+
+~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 32 \
+./debugglibc.sh string/test-memmove
+
+~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 64 \
+./testrun.sh string/test-memset
+"""
+import argparse
+from ctypes import cdll, CDLL
+import os
+import sys
+
+EXIT_SUCCESS = 0
+EXIT_FAILURE = 1
+EXIT_UNSUPPORTED = 77
+
+AT_HWCAP = 16
+HWCAP_SVE = (1 << 22)
+
+PR_SVE_GET_VL = 51
+PR_SVE_SET_VL = 50
+PR_SVE_SET_VL_ONEXEC = (1 << 18)
+PR_SVE_VL_INHERIT = (1 << 17)
+PR_SVE_VL_LEN_MASK = 0xffff
+
+def main(args):
+    libc = CDLL("libc.so.6")
+    if not libc.getauxval(AT_HWCAP) & HWCAP_SVE:
+        print("CPU doesn't support SVE")
+        sys.exit(EXIT_UNSUPPORTED)
+
+    libc.prctl(PR_SVE_SET_VL,
+               args.vl[0] | PR_SVE_SET_VL_ONEXEC | PR_SVE_VL_INHERIT)
+    os.execvp(args.args[0], args.args)
+    print("exec system call failure")
+    sys.exit(EXIT_FAILURE)
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description=
+            "Set Scalable Vector Length test helper",
+            formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+    # positional argument
+    parser.add_argument("vl", nargs=1, type=int,
+                        choices=range(16, 257, 16),
+                        help=('vector length '\
+                              'which is multiples of 16 from 16 to 256'))
+    # remainDer arguments
+    parser.add_argument('args', nargs=argparse.REMAINDER,
+                        help=('args '\
+                              'which is passed to child process'))
+    args = parser.parse_args()
+    main(args)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64
  2021-05-12  9:26   ` [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64 Naohiro Tamura
@ 2021-05-26 10:05     ` Szabolcs Nagy via Libc-alpha
  0 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-26 10:05 UTC (permalink / raw)
  To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha

The 05/12/2021 09:26, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
> 
> This patch checks if assembler supports '-march=armv8.2-a+sve' to
> generate SVE code or not, and then define HAVE_AARCH64_SVE_ASM macro.

this is ok for master.

i will commit it for you.

> ---
>  config.h.in                  |  5 +++++
>  sysdeps/aarch64/configure    | 28 ++++++++++++++++++++++++++++
>  sysdeps/aarch64/configure.ac | 15 +++++++++++++++
>  3 files changed, 48 insertions(+)
> 
> diff --git a/config.h.in b/config.h.in
> index 99036b887f..13fba9bb8d 100644
> --- a/config.h.in
> +++ b/config.h.in
> @@ -121,6 +121,11 @@
>  /* AArch64 PAC-RET code generation is enabled.  */
>  #define HAVE_AARCH64_PAC_RET 0
>  
> +/* Assembler support ARMv8.2-A SVE.
> +   This macro becomes obsolete when glibc increased the minimum
> +   required version of GNU 'binutils' to 2.28 or later. */
> +#define HAVE_AARCH64_SVE_ASM 0
> +
>  /* ARC big endian ABI */
>  #undef HAVE_ARC_BE
>  
> diff --git a/sysdeps/aarch64/configure b/sysdeps/aarch64/configure
> index 83c3a23e44..4c1fac49f3 100644
> --- a/sysdeps/aarch64/configure
> +++ b/sysdeps/aarch64/configure
> @@ -304,3 +304,31 @@ fi
>  $as_echo "$libc_cv_aarch64_variant_pcs" >&6; }
>  config_vars="$config_vars
>  aarch64-variant-pcs = $libc_cv_aarch64_variant_pcs"
> +
> +# Check if asm support armv8.2-a+sve
> +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for SVE support in assembler" >&5
> +$as_echo_n "checking for SVE support in assembler... " >&6; }
> +if ${libc_cv_asm_sve+:} false; then :
> +  $as_echo_n "(cached) " >&6
> +else
> +  cat > conftest.s <<\EOF
> +        ptrue p0.b
> +EOF
> +if { ac_try='${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&5'
> +  { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
> +  (eval $ac_try) 2>&5
> +  ac_status=$?
> +  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
> +  test $ac_status = 0; }; }; then
> +  libc_cv_aarch64_sve_asm=yes
> +else
> +  libc_cv_aarch64_sve_asm=no
> +fi
> +rm -f conftest*
> +fi
> +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_asm_sve" >&5
> +$as_echo "$libc_cv_asm_sve" >&6; }
> +if test $libc_cv_aarch64_sve_asm = yes; then
> +  $as_echo "#define HAVE_AARCH64_SVE_ASM 1" >>confdefs.h
> +
> +fi
> diff --git a/sysdeps/aarch64/configure.ac b/sysdeps/aarch64/configure.ac
> index 66f755078a..3347c13fa1 100644
> --- a/sysdeps/aarch64/configure.ac
> +++ b/sysdeps/aarch64/configure.ac
> @@ -90,3 +90,18 @@ EOF
>    fi
>    rm -rf conftest.*])
>  LIBC_CONFIG_VAR([aarch64-variant-pcs], [$libc_cv_aarch64_variant_pcs])
> +
> +# Check if asm support armv8.2-a+sve
> +AC_CACHE_CHECK(for SVE support in assembler, libc_cv_asm_sve, [dnl
> +cat > conftest.s <<\EOF
> +        ptrue p0.b
> +EOF
> +if AC_TRY_COMMAND(${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&AS_MESSAGE_LOG_FD); then
> +  libc_cv_aarch64_sve_asm=yes
> +else
> +  libc_cv_aarch64_sve_asm=no
> +fi
> +rm -f conftest*])
> +if test $libc_cv_aarch64_sve_asm = yes; then
> +  AC_DEFINE(HAVE_AARCH64_SVE_ASM)
> +fi
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 2/6] aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI
  2021-05-12  9:27   ` [PATCH v2 2/6] aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI Naohiro Tamura
@ 2021-05-26 10:06     ` Szabolcs Nagy via Libc-alpha
  0 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-26 10:06 UTC (permalink / raw)
  To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha

The 05/12/2021 09:27, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
> 
> This patch defines BTI_C and BTI_J macros conditionally for
> performance.
> If HAVE_AARCH64_BTI is true, BTI_C and BTI_J are defined as HINT
> instruction for ARMv8.5 BTI (Branch Target Identification).
> If HAVE_AARCH64_BTI is false, both BTI_C and BTI_J are defined as
> NOP.

thanks. this is ok for master.

i will commit it.

> ---
>  sysdeps/aarch64/sysdep.h | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/sysdeps/aarch64/sysdep.h b/sysdeps/aarch64/sysdep.h
> index 90acca4e42..b936e29cbd 100644
> --- a/sysdeps/aarch64/sysdep.h
> +++ b/sysdeps/aarch64/sysdep.h
> @@ -62,8 +62,13 @@ strip_pac (void *p)
>  #define ASM_SIZE_DIRECTIVE(name) .size name,.-name
>  
>  /* Branch Target Identitication support.  */
> -#define BTI_C		hint	34
> -#define BTI_J		hint	36
> +#if HAVE_AARCH64_BTI
> +# define BTI_C		hint	34
> +# define BTI_J		hint	36
> +#else
> +# define BTI_C		nop
> +# define BTI_J		nop
> +#endif
>  
>  /* Return address signing support (pac-ret).  */
>  #define PACIASP		hint	25
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 3/6] aarch64: Added optimized memcpy and memmove for A64FX
  2021-05-12  9:28   ` [PATCH v2 3/6] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
@ 2021-05-26 10:19     ` Szabolcs Nagy via Libc-alpha
  0 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-26 10:19 UTC (permalink / raw)
  To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha

The 05/12/2021 09:28, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
> 
> This patch optimizes the performance of memcpy/memmove for A64FX [1]
> which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
> cache per NUMA node.
> 
> The performance optimization makes use of Scalable Vector Register
> with several techniques such as loop unrolling, memory access
> alignment, cache zero fill, and software pipelining.
> 
> SVE assembler code for memcpy/memmove is implemented as Vector Length
> Agnostic code so theoretically it can be run on any SOC which supports
> ARMv8-A SVE standard.
> 
> We confirmed that all testcases have been passed by running 'make
> check' and 'make xcheck' not only on A64FX but also on ThunderX2.
> 
> And also we confirmed that the SVE 512 bit vector register performance
> is roughly 4 times better than Advanced SIMD 128 bit register and 8
> times better than scalar 64 bit register by running 'make bench'.
> 
> [1] https://github.com/fujitsu/A64FX

thanks. this looks ok, except for whitespace usage.

can you please send a version with fixed whitespaces?

> --- a/sysdeps/aarch64/multiarch/memcpy.c
> +++ b/sysdeps/aarch64/multiarch/memcpy.c
> @@ -33,6 +33,9 @@ extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;
>  extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
>  extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;
>  extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
> +#if HAVE_AARCH64_SVE_ASM
> +extern __typeof (__redirect_memcpy) __memcpy_a64fx attribute_hidden;
> +#endif
>  
>  libc_ifunc (__libc_memcpy,
>              (IS_THUNDERX (midr)
> @@ -44,8 +47,13 @@ libc_ifunc (__libc_memcpy,
>  		  : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
>  		     || IS_NEOVERSE_V1 (midr)
>  		     ? __memcpy_simd
> -		     : __memcpy_generic)))));
> -
> +#if HAVE_AARCH64_SVE_ASM
> +                     : (IS_A64FX (midr)
> +                        ? __memcpy_a64fx
> +                        : __memcpy_generic))))));
> +#else
> +                     : __memcpy_generic)))));
> +#endif

glibc uses a mix of tabs and spaces, you used space only.

> --- /dev/null
> +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> @@ -0,0 +1,405 @@
> +/* Optimized memcpy for Fujitsu A64FX processor.
> +   Copyright (C) 2012-2021 Free Software Foundation, Inc.
> +
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library.  If not, see
> +   <https://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +
> +#if HAVE_AARCH64_SVE_ASM
> +#if IS_IN (libc)
> +# define MEMCPY __memcpy_a64fx
> +# define MEMMOVE __memmove_a64fx
> +
> +/* Assumptions:
> + *
> + * ARMv8.2-a, AArch64, unaligned accesses, sve
> + *
> + */
> +
> +#define L2_SIZE         (8*1024*1024)/2 // L2 8MB/2
> +#define CACHE_LINE_SIZE 256
> +#define ZF_DIST         (CACHE_LINE_SIZE * 21)  // Zerofill distance
> +#define dest            x0
> +#define src             x1
> +#define n               x2      // size
> +#define tmp1            x3
> +#define tmp2            x4
> +#define tmp3            x5
> +#define rest            x6
> +#define dest_ptr        x7
> +#define src_ptr         x8
> +#define vector_length   x9
> +#define cl_remainder    x10     // CACHE_LINE_SIZE remainder
> +
> +    .arch armv8.2-a+sve
> +
> +    .macro dc_zva times
> +    dc          zva, tmp1
> +    add         tmp1, tmp1, CACHE_LINE_SIZE
> +    .if \times-1
> +    dc_zva "(\times-1)"
> +    .endif
> +    .endm
> +
> +    .macro ld1b_unroll8
> +    ld1b        z0.b, p0/z, [src_ptr, #0, mul vl]
> +    ld1b        z1.b, p0/z, [src_ptr, #1, mul vl]
> +    ld1b        z2.b, p0/z, [src_ptr, #2, mul vl]
> +    ld1b        z3.b, p0/z, [src_ptr, #3, mul vl]
> +    ld1b        z4.b, p0/z, [src_ptr, #4, mul vl]
> +    ld1b        z5.b, p0/z, [src_ptr, #5, mul vl]
> +    ld1b        z6.b, p0/z, [src_ptr, #6, mul vl]
> +    ld1b        z7.b, p0/z, [src_ptr, #7, mul vl]
> +    .endm
...

please indent all asm code with one tab, see other asm files.

> --- a/sysdeps/aarch64/multiarch/memmove.c
> +++ b/sysdeps/aarch64/multiarch/memmove.c
> @@ -33,6 +33,9 @@ extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;
>  extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;
>  extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidden;
>  extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;
> +#if HAVE_AARCH64_SVE_ASM
> +extern __typeof (__redirect_memmove) __memmove_a64fx attribute_hidden;
> +#endif
>  
>  libc_ifunc (__libc_memmove,
>              (IS_THUNDERX (midr)
> @@ -44,8 +47,13 @@ libc_ifunc (__libc_memmove,
>  		  : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
>  		     || IS_NEOVERSE_V1 (midr)
>  		     ? __memmove_simd
> -		     : __memmove_generic)))));
> -
> +#if HAVE_AARCH64_SVE_ASM
> +                     : (IS_A64FX (midr)
> +                        ? __memmove_a64fx
> +                        : __memmove_generic))))));
> +#else
> +                        : __memmove_generic)))));
> +#endif

same as above.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 4/6] aarch64: Added optimized memset for A64FX
  2021-05-12  9:28   ` [PATCH v2 4/6] aarch64: Added optimized memset " Naohiro Tamura
@ 2021-05-26 10:22     ` Szabolcs Nagy via Libc-alpha
  0 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-26 10:22 UTC (permalink / raw)
  To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha

The 05/12/2021 09:28, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
> 
> This patch optimizes the performance of memset for A64FX [1] which
> implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
> per NUMA node.
> 
> The performance optimization makes use of Scalable Vector Register
> with several techniques such as loop unrolling, memory access
> alignment, cache zero fill and prefetch.
> 
> SVE assembler code for memset is implemented as Vector Length Agnostic
> code so theoretically it can be run on any SOC which supports ARMv8-A
> SVE standard.
> 
> We confirmed that all testcases have been passed by running 'make
> check' and 'make xcheck' not only on A64FX but also on ThunderX2.
> 
> And also we confirmed that the SVE 512 bit vector register performance
> is roughly 4 times better than Advanced SIMD 128 bit register and 8
> times better than scalar 64 bit register by running 'make bench'.
> 
> [1] https://github.com/fujitsu/A64FX

thanks, this looks good, except for whitespace.

can you please send a version with fixed whitespaces?

> --- a/sysdeps/aarch64/multiarch/memset.c
> +++ b/sysdeps/aarch64/multiarch/memset.c
...
> -	       : __memset_generic)));
> +#if HAVE_AARCH64_SVE_ASM
> +	       : (IS_A64FX (midr)
> +		  ? __memset_a64fx
> +	          : __memset_generic))));
> +#else
> +	          : __memset_generic)));
> +#endif

replace 8 spaces with 1 tab.

> --- /dev/null
> +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S
...
> +    .arch armv8.2-a+sve
> +
> +    .macro dc_zva times
> +    dc          zva, tmp1
> +    add         tmp1, tmp1, CACHE_LINE_SIZE
> +    .if \times-1
> +    dc_zva "(\times-1)"
> +    .endif
> +    .endm

use 1 tab indentation throughout.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 5/6] scripts: Added Vector Length Set test helper script
  2021-05-20  7:34     ` Naohiro Tamura
@ 2021-05-26 10:24       ` Szabolcs Nagy via Libc-alpha
  0 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-26 10:24 UTC (permalink / raw)
  To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha

The 05/20/2021 07:34, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
> 
> Let me send the whole updated patch.
> Thanks.
> Naohiro
> 
> -- >8 --
> Subject: [PATCH v2 5/6] aarch64: Added Vector Length Set test helper script
> 
> This patch is a test helper script to change Vector Length for child
> process. This script can be used as test-wrapper for 'make check'.
> 
> Usage examples:
> 
> ~/build$ make check subdirs=string \
> test-wrapper='~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16'
> 
> ~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16 \
> make test t=string/test-memcpy
> 
> ~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 32 \
> ./debugglibc.sh string/test-memmove
> 
> ~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 64 \
> ./testrun.sh string/test-memset

thanks, this is ok for master.
i will commit it.


> ---
>  INSTALL                                   |  4 ++
>  manual/install.texi                       |  3 +
>  sysdeps/unix/sysv/linux/aarch64/vltest.py | 82 +++++++++++++++++++++++
>  3 files changed, 89 insertions(+)
>  create mode 100755 sysdeps/unix/sysv/linux/aarch64/vltest.py
> 
> diff --git a/INSTALL b/INSTALL
> index 065a568585e6..bc761ab98bbf 100644
> --- a/INSTALL
> +++ b/INSTALL
> @@ -380,6 +380,10 @@ the same syntax as 'test-wrapper-env', the only difference in its
>  semantics being starting with an empty set of environment variables
>  rather than the ambient set.
>  
> +   For AArch64 with SVE, when testing the GNU C Library, 'test-wrapper'
> +may be set to "SRCDIR/sysdeps/unix/sysv/linux/aarch64/vltest.py
> +VECTOR-LENGTH" to change Vector Length.
> +
>  Installing the C Library
>  ========================
>  
> diff --git a/manual/install.texi b/manual/install.texi
> index eb41fbd0b5ab..f1d858fb789c 100644
> --- a/manual/install.texi
> +++ b/manual/install.texi
> @@ -418,6 +418,9 @@ use has the same syntax as @samp{test-wrapper-env}, the only
>  difference in its semantics being starting with an empty set of
>  environment variables rather than the ambient set.
>  
> +For AArch64 with SVE, when testing @theglibc{}, @samp{test-wrapper}
> +may be set to "@var{srcdir}/sysdeps/unix/sysv/linux/aarch64/vltest.py
> +@var{vector-length}" to change Vector Length.
>  
>  @node Running make install
>  @appendixsec Installing the C Library
> diff --git a/sysdeps/unix/sysv/linux/aarch64/vltest.py b/sysdeps/unix/sysv/linux/aarch64/vltest.py
> new file mode 100755
> index 000000000000..bed62ad151e0
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/aarch64/vltest.py
> @@ -0,0 +1,82 @@
> +#!/usr/bin/python3
> +# Set Scalable Vector Length test helper
> +# Copyright (C) 2021 Free Software Foundation, Inc.
> +# This file is part of the GNU C Library.
> +#
> +# The GNU C Library is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU Lesser General Public
> +# License as published by the Free Software Foundation; either
> +# version 2.1 of the License, or (at your option) any later version.
> +#
> +# The GNU C Library is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +# Lesser General Public License for more details.
> +#
> +# You should have received a copy of the GNU Lesser General Public
> +# License along with the GNU C Library; if not, see
> +# <https://www.gnu.org/licenses/>.
> +"""Set Scalable Vector Length test helper.
> +
> +Set Scalable Vector Length for child process.
> +
> +examples:
> +
> +~/build$ make check subdirs=string \
> +test-wrapper='~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16'
> +
> +~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16 \
> +make test t=string/test-memcpy
> +
> +~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 32 \
> +./debugglibc.sh string/test-memmove
> +
> +~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 64 \
> +./testrun.sh string/test-memset
> +"""
> +import argparse
> +from ctypes import cdll, CDLL
> +import os
> +import sys
> +
> +EXIT_SUCCESS = 0
> +EXIT_FAILURE = 1
> +EXIT_UNSUPPORTED = 77
> +
> +AT_HWCAP = 16
> +HWCAP_SVE = (1 << 22)
> +
> +PR_SVE_GET_VL = 51
> +PR_SVE_SET_VL = 50
> +PR_SVE_SET_VL_ONEXEC = (1 << 18)
> +PR_SVE_VL_INHERIT = (1 << 17)
> +PR_SVE_VL_LEN_MASK = 0xffff
> +
> +def main(args):
> +    libc = CDLL("libc.so.6")
> +    if not libc.getauxval(AT_HWCAP) & HWCAP_SVE:
> +        print("CPU doesn't support SVE")
> +        sys.exit(EXIT_UNSUPPORTED)
> +
> +    libc.prctl(PR_SVE_SET_VL,
> +               args.vl[0] | PR_SVE_SET_VL_ONEXEC | PR_SVE_VL_INHERIT)
> +    os.execvp(args.args[0], args.args)
> +    print("exec system call failure")
> +    sys.exit(EXIT_FAILURE)
> +
> +if __name__ == '__main__':
> +    parser = argparse.ArgumentParser(description=
> +            "Set Scalable Vector Length test helper",
> +            formatter_class=argparse.ArgumentDefaultsHelpFormatter)
> +
> +    # positional argument
> +    parser.add_argument("vl", nargs=1, type=int,
> +                        choices=range(16, 257, 16),
> +                        help=('vector length '\
> +                              'which is multiples of 16 from 16 to 256'))
> +    # remainDer arguments
> +    parser.add_argument('args', nargs=argparse.REMAINDER,
> +                        help=('args '\
> +                              'which is passed to child process'))
> +    args = parser.parse_args()
> +    main(args)
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v2 6/6] benchtests: Fixed bench-memcpy-random: buf1: mprotect failed
  2021-05-12  9:29   ` [PATCH v2 6/6] benchtests: Fixed bench-memcpy-random: buf1: mprotect failed Naohiro Tamura
@ 2021-05-26 10:25     ` Szabolcs Nagy via Libc-alpha
  0 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-26 10:25 UTC (permalink / raw)
  To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha

The 05/12/2021 09:29, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
> 
> This patch fixed mprotect system call failure on AArch64.
> This failure happened on not only A64FX but also ThunderX2.
> 
> Also this patch updated a JSON key from "max-size" to "length" so that
> 'plot_strings.py' can process 'bench-memcpy-random.out'

thanks, this is ok for master.
i will commit it.

> ---
>  benchtests/bench-memcpy-random.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/benchtests/bench-memcpy-random.c b/benchtests/bench-memcpy-random.c
> index 9b62033379..c490b73ed0 100644
> --- a/benchtests/bench-memcpy-random.c
> +++ b/benchtests/bench-memcpy-random.c
> @@ -16,7 +16,7 @@
>     License along with the GNU C Library; if not, see
>     <https://www.gnu.org/licenses/>.  */
>  
> -#define MIN_PAGE_SIZE (512*1024+4096)
> +#define MIN_PAGE_SIZE (512*1024+getpagesize())
>  #define TEST_MAIN
>  #define TEST_NAME "memcpy"
>  #include "bench-string.h"
> @@ -160,7 +160,7 @@ do_test (json_ctx_t *json_ctx, size_t max_size)
>      }
>  
>    json_element_object_begin (json_ctx);
> -  json_attr_uint (json_ctx, "max-size", (double) max_size);
> +  json_attr_uint (json_ctx, "length", (double) max_size);
>    json_array_begin (json_ctx, "timings");
>  
>    FOR_EACH_IMPL (impl, 0)
> -- 
> 2.17.1
> 

-- 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX
  2021-05-12  9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
                     ` (5 preceding siblings ...)
  2021-05-12  9:29   ` [PATCH v2 6/6] benchtests: Fixed bench-memcpy-random: buf1: mprotect failed Naohiro Tamura
@ 2021-05-27  0:22   ` naohirot
  2021-05-27 23:50     ` naohirot
  2021-05-27  7:42   ` [PATCH v3 1/2] aarch64: Added optimized memcpy and memmove " Naohiro Tamura
  2021-05-27  7:44   ` [PATCH v3 2/2] aarch64: Added optimized memset " Naohiro Tamura
  8 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-05-27  0:22 UTC (permalink / raw)
  To: 'Szabolcs Nagy', libc-alpha@sourceware.org

Hi Szabolcs,

>   config: Added HAVE_AARCH64_SVE_ASM for aarch64
>   aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI
>   scripts: Added Vector Length Set test helper script
>   benchtests: Fixed bench-memcpy-random: buf1: mprotect failed

Thank you for the merges!

>   aarch64: Added optimized memcpy and memmove for A64FX
>   aarch64: Added optimized memset for A64FX

I'll fix the whitespaces.

Thanks
Naohiro



^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v3 1/2] aarch64: Added optimized memcpy and memmove for A64FX
  2021-05-12  9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
                     ` (6 preceding siblings ...)
  2021-05-27  0:22   ` [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX naohirot
@ 2021-05-27  7:42   ` Naohiro Tamura
  2021-05-27  7:44   ` [PATCH v3 2/2] aarch64: Added optimized memset " Naohiro Tamura
  8 siblings, 0 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-27  7:42 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

This patch optimizes the performance of memcpy/memmove for A64FX [1]
which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
cache per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill, and software pipelining.

SVE assembler code for memcpy/memmove is implemented as Vector Length
Agnostic code so theoretically it can be run on any SOC which supports
ARMv8-A SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
---
 manual/tunables.texi                          |   3 +-
 sysdeps/aarch64/multiarch/Makefile            |   2 +-
 sysdeps/aarch64/multiarch/ifunc-impl-list.c   |   8 +-
 sysdeps/aarch64/multiarch/init-arch.h         |   4 +-
 sysdeps/aarch64/multiarch/memcpy.c            |  18 +-
 sysdeps/aarch64/multiarch/memcpy_a64fx.S      | 406 ++++++++++++++++++
 sysdeps/aarch64/multiarch/memmove.c           |  18 +-
 .../unix/sysv/linux/aarch64/cpu-features.c    |   4 +
 .../unix/sysv/linux/aarch64/cpu-features.h    |   4 +
 9 files changed, 453 insertions(+), 14 deletions(-)
 create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S

diff --git a/manual/tunables.texi b/manual/tunables.texi
index 6de647b4262c..fe7c1313ccc4 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -454,7 +454,8 @@ This tunable is specific to powerpc, powerpc64 and powerpc64le.
 The @code{glibc.cpu.name=xxx} tunable allows the user to tell @theglibc{} to
 assume that the CPU is @code{xxx} where xxx may have one of these values:
 @code{generic}, @code{falkor}, @code{thunderxt88}, @code{thunderx2t99},
-@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng}.
+@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng},
+@code{a64fx}.
 
 This tunable is specific to aarch64.
 @end deftp
diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index dc3efffb36b6..04c3f171215e 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -1,6 +1,6 @@
 ifeq ($(subdir),string)
 sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
-		   memcpy_falkor \
+		   memcpy_falkor memcpy_a64fx \
 		   memset_generic memset_falkor memset_emag memset_kunpeng \
 		   memchr_generic memchr_nosimd \
 		   strlen_mte strlen_asimd
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index 99a8c68aaca0..911393565c21 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -25,7 +25,7 @@
 #include <stdio.h>
 
 /* Maximum number of IFUNC implementations.  */
-#define MAX_IFUNC	4
+#define MAX_IFUNC	7
 
 size_t
 __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
@@ -43,12 +43,18 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_falkor)
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_simd)
+#if HAVE_AARCH64_SVE_ASM
+	      IFUNC_IMPL_ADD (array, i, memcpy, sve, __memcpy_a64fx)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
   IFUNC_IMPL (i, name, memmove,
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx)
 	      IFUNC_IMPL_ADD (array, i, memmove, !bti, __memmove_thunderx2)
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_falkor)
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_simd)
+#if HAVE_AARCH64_SVE_ASM
+	      IFUNC_IMPL_ADD (array, i, memmove, sve, __memmove_a64fx)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
   IFUNC_IMPL (i, name, memset,
 	      /* Enable this on non-falkor processors too so that other cores
diff --git a/sysdeps/aarch64/multiarch/init-arch.h b/sysdeps/aarch64/multiarch/init-arch.h
index a167699e74f4..6d92c1bcff6a 100644
--- a/sysdeps/aarch64/multiarch/init-arch.h
+++ b/sysdeps/aarch64/multiarch/init-arch.h
@@ -33,4 +33,6 @@
   bool __attribute__((unused)) bti =					      \
     HAVE_AARCH64_BTI && GLRO(dl_aarch64_cpu_features).bti;		      \
   bool __attribute__((unused)) mte =					      \
-    MTE_ENABLED ();
+    MTE_ENABLED ();							      \
+  bool __attribute__((unused)) sve =					      \
+    GLRO(dl_aarch64_cpu_features).sve;
diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
index 0e0a5cbcfb1b..25e0081eeb51 100644
--- a/sysdeps/aarch64/multiarch/memcpy.c
+++ b/sysdeps/aarch64/multiarch/memcpy.c
@@ -33,6 +33,9 @@ extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;
 extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
 extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;
 extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
+# if HAVE_AARCH64_SVE_ASM
+extern __typeof (__redirect_memcpy) __memcpy_a64fx attribute_hidden;
+# endif
 
 libc_ifunc (__libc_memcpy,
             (IS_THUNDERX (midr)
@@ -40,12 +43,17 @@ libc_ifunc (__libc_memcpy,
 	     : (IS_FALKOR (midr) || IS_PHECDA (midr)
 		? __memcpy_falkor
 		: (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr)
-		  ? __memcpy_thunderx2
-		  : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
-		     || IS_NEOVERSE_V1 (midr)
-		     ? __memcpy_simd
+		   ? __memcpy_thunderx2
+		   : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
+		      || IS_NEOVERSE_V1 (midr)
+		      ? __memcpy_simd
+# if HAVE_AARCH64_SVE_ASM
+		     : (IS_A64FX (midr)
+			? __memcpy_a64fx
+			: __memcpy_generic))))));
+# else
 		     : __memcpy_generic)))));
-
+# endif
 # undef memcpy
 strong_alias (__libc_memcpy, memcpy);
 #endif
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
new file mode 100644
index 000000000000..65528405bb12
--- /dev/null
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -0,0 +1,406 @@
+/* Optimized memcpy for Fujitsu A64FX processor.
+   Copyright (C) 2021 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+
+/* Assumptions:
+ *
+ * ARMv8.2-a, AArch64, unaligned accesses, sve
+ *
+ */
+
+#define L2_SIZE		(8*1024*1024)/2	// L2 8MB/2
+#define CACHE_LINE_SIZE	256
+#define ZF_DIST		(CACHE_LINE_SIZE * 21)	// Zerofill distance
+#define dest		x0
+#define src		x1
+#define n		x2	// size
+#define tmp1		x3
+#define tmp2		x4
+#define tmp3		x5
+#define rest		x6
+#define dest_ptr	x7
+#define src_ptr		x8
+#define vector_length	x9
+#define cl_remainder	x10	// CACHE_LINE_SIZE remainder
+
+#if HAVE_AARCH64_SVE_ASM
+# if IS_IN (libc)
+#  define MEMCPY __memcpy_a64fx
+#  define MEMMOVE __memmove_a64fx
+
+	.arch armv8.2-a+sve
+
+	.macro dc_zva times
+	dc	zva, tmp1
+	add	tmp1, tmp1, CACHE_LINE_SIZE
+	.if \times-1
+	dc_zva "(\times-1)"
+	.endif
+	.endm
+
+	.macro ld1b_unroll8
+	ld1b	z0.b, p0/z, [src_ptr, #0, mul vl]
+	ld1b	z1.b, p0/z, [src_ptr, #1, mul vl]
+	ld1b	z2.b, p0/z, [src_ptr, #2, mul vl]
+	ld1b	z3.b, p0/z, [src_ptr, #3, mul vl]
+	ld1b	z4.b, p0/z, [src_ptr, #4, mul vl]
+	ld1b	z5.b, p0/z, [src_ptr, #5, mul vl]
+	ld1b	z6.b, p0/z, [src_ptr, #6, mul vl]
+	ld1b	z7.b, p0/z, [src_ptr, #7, mul vl]
+	.endm
+
+	.macro stld1b_unroll4a
+	st1b	z0.b, p0,   [dest_ptr, #0, mul vl]
+	st1b	z1.b, p0,   [dest_ptr, #1, mul vl]
+	ld1b	z0.b, p0/z, [src_ptr,  #0, mul vl]
+	ld1b	z1.b, p0/z, [src_ptr,  #1, mul vl]
+	st1b	z2.b, p0,   [dest_ptr, #2, mul vl]
+	st1b	z3.b, p0,   [dest_ptr, #3, mul vl]
+	ld1b	z2.b, p0/z, [src_ptr,  #2, mul vl]
+	ld1b	z3.b, p0/z, [src_ptr,  #3, mul vl]
+	.endm
+
+	.macro stld1b_unroll4b
+	st1b	z4.b, p0,   [dest_ptr, #4, mul vl]
+	st1b	z5.b, p0,   [dest_ptr, #5, mul vl]
+	ld1b	z4.b, p0/z, [src_ptr,  #4, mul vl]
+	ld1b	z5.b, p0/z, [src_ptr,  #5, mul vl]
+	st1b	z6.b, p0,   [dest_ptr, #6, mul vl]
+	st1b	z7.b, p0,   [dest_ptr, #7, mul vl]
+	ld1b	z6.b, p0/z, [src_ptr,  #6, mul vl]
+	ld1b	z7.b, p0/z, [src_ptr,  #7, mul vl]
+	.endm
+
+	.macro stld1b_unroll8
+	stld1b_unroll4a
+	stld1b_unroll4b
+	.endm
+
+	.macro st1b_unroll8
+	st1b	z0.b, p0, [dest_ptr, #0, mul vl]
+	st1b	z1.b, p0, [dest_ptr, #1, mul vl]
+	st1b	z2.b, p0, [dest_ptr, #2, mul vl]
+	st1b	z3.b, p0, [dest_ptr, #3, mul vl]
+	st1b	z4.b, p0, [dest_ptr, #4, mul vl]
+	st1b	z5.b, p0, [dest_ptr, #5, mul vl]
+	st1b	z6.b, p0, [dest_ptr, #6, mul vl]
+	st1b	z7.b, p0, [dest_ptr, #7, mul vl]
+	.endm
+
+	.macro shortcut_for_small_size exit
+	// if rest <= vector_length * 2
+	whilelo	p0.b, xzr, n
+	whilelo	p1.b, vector_length, n
+	b.last	1f
+	ld1b	z0.b, p0/z, [src, #0, mul vl]
+	ld1b	z1.b, p1/z, [src, #1, mul vl]
+	st1b	z0.b, p0, [dest, #0, mul vl]
+	st1b	z1.b, p1, [dest, #1, mul vl]
+	ret
+1:	// if rest > vector_length * 8
+	cmp	n, vector_length, lsl 3 // vector_length * 8
+	b.hi	\exit
+	// if rest <= vector_length * 4
+	lsl	tmp1, vector_length, 1  // vector_length * 2
+	whilelo	p2.b, tmp1, n
+	incb	tmp1
+	whilelo	p3.b, tmp1, n
+	b.last	1f
+	ld1b	z0.b, p0/z, [src, #0, mul vl]
+	ld1b	z1.b, p1/z, [src, #1, mul vl]
+	ld1b	z2.b, p2/z, [src, #2, mul vl]
+	ld1b	z3.b, p3/z, [src, #3, mul vl]
+	st1b	z0.b, p0, [dest, #0, mul vl]
+	st1b	z1.b, p1, [dest, #1, mul vl]
+	st1b	z2.b, p2, [dest, #2, mul vl]
+	st1b	z3.b, p3, [dest, #3, mul vl]
+	ret
+1:	// if rest <= vector_length * 8
+	lsl	tmp1, vector_length, 2  // vector_length * 4
+	whilelo	p4.b, tmp1, n
+	incb	tmp1
+	whilelo	p5.b, tmp1, n
+	b.last	1f
+	ld1b	z0.b, p0/z, [src, #0, mul vl]
+	ld1b	z1.b, p1/z, [src, #1, mul vl]
+	ld1b	z2.b, p2/z, [src, #2, mul vl]
+	ld1b	z3.b, p3/z, [src, #3, mul vl]
+	ld1b	z4.b, p4/z, [src, #4, mul vl]
+	ld1b	z5.b, p5/z, [src, #5, mul vl]
+	st1b	z0.b, p0, [dest, #0, mul vl]
+	st1b	z1.b, p1, [dest, #1, mul vl]
+	st1b	z2.b, p2, [dest, #2, mul vl]
+	st1b	z3.b, p3, [dest, #3, mul vl]
+	st1b	z4.b, p4, [dest, #4, mul vl]
+	st1b	z5.b, p5, [dest, #5, mul vl]
+	ret
+1:	lsl	tmp1, vector_length, 2	// vector_length * 4
+	incb	tmp1			// vector_length * 5
+	incb	tmp1			// vector_length * 6
+	whilelo	p6.b, tmp1, n
+	incb	tmp1
+	whilelo	p7.b, tmp1, n
+	ld1b	z0.b, p0/z, [src, #0, mul vl]
+	ld1b	z1.b, p1/z, [src, #1, mul vl]
+	ld1b	z2.b, p2/z, [src, #2, mul vl]
+	ld1b	z3.b, p3/z, [src, #3, mul vl]
+	ld1b	z4.b, p4/z, [src, #4, mul vl]
+	ld1b	z5.b, p5/z, [src, #5, mul vl]
+	ld1b	z6.b, p6/z, [src, #6, mul vl]
+	ld1b	z7.b, p7/z, [src, #7, mul vl]
+	st1b	z0.b, p0, [dest, #0, mul vl]
+	st1b	z1.b, p1, [dest, #1, mul vl]
+	st1b	z2.b, p2, [dest, #2, mul vl]
+	st1b	z3.b, p3, [dest, #3, mul vl]
+	st1b	z4.b, p4, [dest, #4, mul vl]
+	st1b	z5.b, p5, [dest, #5, mul vl]
+	st1b	z6.b, p6, [dest, #6, mul vl]
+	st1b	z7.b, p7, [dest, #7, mul vl]
+	ret
+	.endm
+
+ENTRY (MEMCPY)
+
+	PTR_ARG (0)
+	PTR_ARG (1)
+	SIZE_ARG (2)
+
+L(memcpy):
+	cntb	vector_length
+	// shortcut for less than vector_length * 8
+	// gives a free ptrue to p0.b for n >= vector_length
+	shortcut_for_small_size L(vl_agnostic)
+	// end of shortcut
+
+L(vl_agnostic): // VL Agnostic
+	mov	rest, n
+	mov	dest_ptr, dest
+	mov	src_ptr, src
+	// if rest >= L2_SIZE && vector_length == 64 then L(L2)
+	mov	tmp1, 64
+	cmp	rest, L2_SIZE
+	ccmp	vector_length, tmp1, 0, cs
+	b.eq	L(L2)
+
+L(unroll8): // unrolling and software pipeline
+	lsl	tmp1, vector_length, 3	// vector_length * 8
+	.p2align 3
+	cmp	 rest, tmp1
+	b.cc	L(last)
+	ld1b_unroll8
+	add	src_ptr, src_ptr, tmp1
+	sub	rest, rest, tmp1
+	cmp	rest, tmp1
+	b.cc	2f
+	.p2align 3
+1:	stld1b_unroll8
+	add	dest_ptr, dest_ptr, tmp1
+	add	src_ptr, src_ptr, tmp1
+	sub	rest, rest, tmp1
+	cmp	rest, tmp1
+	b.ge	1b
+2:	st1b_unroll8
+	add	dest_ptr, dest_ptr, tmp1
+
+	.p2align 3
+L(last):
+	whilelo	p0.b, xzr, rest
+	whilelo	p1.b, vector_length, rest
+	b.last	1f
+	ld1b	z0.b, p0/z, [src_ptr, #0, mul vl]
+	ld1b	z1.b, p1/z, [src_ptr, #1, mul vl]
+	st1b	z0.b, p0, [dest_ptr, #0, mul vl]
+	st1b	z1.b, p1, [dest_ptr, #1, mul vl]
+	ret
+1:	lsl	tmp1, vector_length, 1	// vector_length * 2
+	whilelo	p2.b, tmp1, rest
+	incb	tmp1
+	whilelo	p3.b, tmp1, rest
+	b.last	1f
+	ld1b	z0.b, p0/z, [src_ptr, #0, mul vl]
+	ld1b	z1.b, p1/z, [src_ptr, #1, mul vl]
+	ld1b	z2.b, p2/z, [src_ptr, #2, mul vl]
+	ld1b	z3.b, p3/z, [src_ptr, #3, mul vl]
+	st1b	z0.b, p0, [dest_ptr, #0, mul vl]
+	st1b	z1.b, p1, [dest_ptr, #1, mul vl]
+	st1b	z2.b, p2, [dest_ptr, #2, mul vl]
+	st1b	z3.b, p3, [dest_ptr, #3, mul vl]
+	ret
+1:	lsl	tmp1, vector_length, 2	// vector_length * 4
+	whilelo	p4.b, tmp1, rest
+	incb	tmp1
+	whilelo	p5.b, tmp1, rest
+	incb	tmp1
+	whilelo	p6.b, tmp1, rest
+	incb	tmp1
+	whilelo	p7.b, tmp1, rest
+	ld1b	z0.b, p0/z, [src_ptr, #0, mul vl]
+	ld1b	z1.b, p1/z, [src_ptr, #1, mul vl]
+	ld1b	z2.b, p2/z, [src_ptr, #2, mul vl]
+	ld1b	z3.b, p3/z, [src_ptr, #3, mul vl]
+	ld1b	z4.b, p4/z, [src_ptr, #4, mul vl]
+	ld1b	z5.b, p5/z, [src_ptr, #5, mul vl]
+	ld1b	z6.b, p6/z, [src_ptr, #6, mul vl]
+	ld1b	z7.b, p7/z, [src_ptr, #7, mul vl]
+	st1b	z0.b, p0, [dest_ptr, #0, mul vl]
+	st1b	z1.b, p1, [dest_ptr, #1, mul vl]
+	st1b	z2.b, p2, [dest_ptr, #2, mul vl]
+	st1b	z3.b, p3, [dest_ptr, #3, mul vl]
+	st1b	z4.b, p4, [dest_ptr, #4, mul vl]
+	st1b	z5.b, p5, [dest_ptr, #5, mul vl]
+	st1b	z6.b, p6, [dest_ptr, #6, mul vl]
+	st1b	z7.b, p7, [dest_ptr, #7, mul vl]
+	ret
+
+L(L2):
+	// align dest address at CACHE_LINE_SIZE byte boundary
+	mov	tmp1, CACHE_LINE_SIZE
+	ands	tmp2, dest_ptr, CACHE_LINE_SIZE - 1
+	// if cl_remainder == 0
+	b.eq	L(L2_dc_zva)
+	sub	cl_remainder, tmp1, tmp2
+	// process remainder until the first CACHE_LINE_SIZE boundary
+	whilelo	p1.b, xzr, cl_remainder	// keep p0.b all true
+	whilelo	p2.b, vector_length, cl_remainder
+	b.last	1f
+	ld1b	z1.b, p1/z, [src_ptr, #0, mul vl]
+	ld1b	z2.b, p2/z, [src_ptr, #1, mul vl]
+	st1b	z1.b, p1, [dest_ptr, #0, mul vl]
+	st1b	z2.b, p2, [dest_ptr, #1, mul vl]
+	b	2f
+1:	lsl	tmp1, vector_length, 1	// vector_length * 2
+	whilelo	p3.b, tmp1, cl_remainder
+	incb	tmp1
+	whilelo	p4.b, tmp1, cl_remainder
+	ld1b	z1.b, p1/z, [src_ptr, #0, mul vl]
+	ld1b	z2.b, p2/z, [src_ptr, #1, mul vl]
+	ld1b	z3.b, p3/z, [src_ptr, #2, mul vl]
+	ld1b	z4.b, p4/z, [src_ptr, #3, mul vl]
+	st1b	z1.b, p1, [dest_ptr, #0, mul vl]
+	st1b	z2.b, p2, [dest_ptr, #1, mul vl]
+	st1b	z3.b, p3, [dest_ptr, #2, mul vl]
+	st1b	z4.b, p4, [dest_ptr, #3, mul vl]
+2:	add	dest_ptr, dest_ptr, cl_remainder
+	add	src_ptr, src_ptr, cl_remainder
+	sub	rest, rest, cl_remainder
+
+L(L2_dc_zva):
+	// zero fill
+	and	tmp1, dest, 0xffffffffffffff
+	and	tmp2, src, 0xffffffffffffff
+	subs	tmp1, tmp1, tmp2	// diff
+	b.ge	1f
+	neg	tmp1, tmp1
+1:	mov	tmp3, ZF_DIST + CACHE_LINE_SIZE * 2
+	cmp	tmp1, tmp3
+	b.lo	L(unroll8)
+	mov	tmp1, dest_ptr
+	dc_zva	(ZF_DIST / CACHE_LINE_SIZE) - 1
+	// unroll
+	ld1b_unroll8	// this line has to be after "b.lo L(unroll8)"
+	add	 src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+	sub	 rest, rest, CACHE_LINE_SIZE * 2
+	mov	 tmp1, ZF_DIST
+	.p2align 3
+1:	stld1b_unroll4a
+	add	tmp2, dest_ptr, tmp1	// dest_ptr + ZF_DIST
+	dc	zva, tmp2
+	stld1b_unroll4b
+	add	tmp2, tmp2, CACHE_LINE_SIZE
+	dc	zva, tmp2
+	add	dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+	add	src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+	sub	rest, rest, CACHE_LINE_SIZE * 2
+	cmp	rest, tmp3	// ZF_DIST + CACHE_LINE_SIZE * 2
+	b.ge	1b
+	st1b_unroll8
+	add	dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+	b	L(unroll8)
+
+END (MEMCPY)
+libc_hidden_builtin_def (MEMCPY)
+
+
+ENTRY (MEMMOVE)
+
+	PTR_ARG (0)
+	PTR_ARG (1)
+	SIZE_ARG (2)
+
+	// remove tag address
+	// dest has to be immutable because it is the return value
+	// src has to be immutable because it is used in L(bwd_last)
+	and	tmp2, dest, 0xffffffffffffff	// save dest_notag into tmp2
+	and	tmp3, src, 0xffffffffffffff	// save src_notag intp tmp3
+	cmp	n, 0
+	ccmp	tmp2, tmp3, 4, ne
+	b.ne	1f
+	ret
+1:	cntb	vector_length
+	// shortcut for less than vector_length * 8
+	// gives a free ptrue to p0.b for n >= vector_length
+	// tmp2 and tmp3 should not be used in this macro to keep
+	// notag addresses
+	shortcut_for_small_size L(dispatch)
+	// end of shortcut
+
+L(dispatch):
+	// tmp2 = dest_notag, tmp3 = src_notag
+	// diff = dest_notag - src_notag
+	sub	tmp1, tmp2, tmp3
+	// if diff <= 0 || diff >= n then memcpy
+	cmp	tmp1, 0
+	ccmp	tmp1, n, 2, gt
+	b.cs	L(vl_agnostic)
+
+L(bwd_start):
+	mov	rest, n
+	add	dest_ptr, dest, n	// dest_end
+	add	src_ptr, src, n		// src_end
+
+L(bwd_unroll8): // unrolling and software pipeline
+	lsl	tmp1, vector_length, 3	// vector_length * 8
+	.p2align 3
+	cmp	rest, tmp1
+	b.cc	L(bwd_last)
+	sub	src_ptr, src_ptr, tmp1
+	ld1b_unroll8
+	sub	rest, rest, tmp1
+	cmp	rest, tmp1
+	b.cc	2f
+	.p2align 3
+1:	sub	src_ptr, src_ptr, tmp1
+	sub	dest_ptr, dest_ptr, tmp1
+	stld1b_unroll8
+	sub	rest, rest, tmp1
+	cmp	rest, tmp1
+	b.ge	1b
+2:	sub	dest_ptr, dest_ptr, tmp1
+	st1b_unroll8
+
+L(bwd_last):
+	mov	dest_ptr, dest
+	mov	src_ptr, src
+	b	L(last)
+
+END (MEMMOVE)
+libc_hidden_builtin_def (MEMMOVE)
+# endif /* IS_IN (libc) */
+#endif /* HAVE_AARCH64_SVE_ASM */
diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarch/memmove.c
index 12d77818a999..d0adefc547f6 100644
--- a/sysdeps/aarch64/multiarch/memmove.c
+++ b/sysdeps/aarch64/multiarch/memmove.c
@@ -33,6 +33,9 @@ extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;
 extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;
 extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidden;
 extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;
+# if HAVE_AARCH64_SVE_ASM
+extern __typeof (__redirect_memmove) __memmove_a64fx attribute_hidden;
+# endif
 
 libc_ifunc (__libc_memmove,
             (IS_THUNDERX (midr)
@@ -40,12 +43,17 @@ libc_ifunc (__libc_memmove,
 	     : (IS_FALKOR (midr) || IS_PHECDA (midr)
 		? __memmove_falkor
 		: (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr)
-		  ? __memmove_thunderx2
-		  : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
-		     || IS_NEOVERSE_V1 (midr)
-		     ? __memmove_simd
+		   ? __memmove_thunderx2
+		   : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
+		      || IS_NEOVERSE_V1 (midr)
+		      ? __memmove_simd
+# if HAVE_AARCH64_SVE_ASM
+		     : (IS_A64FX (midr)
+			? __memmove_a64fx
+			: __memmove_generic))))));
+# else
 		     : __memmove_generic)))));
-
+# endif
 # undef memmove
 strong_alias (__libc_memmove, memmove);
 #endif
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
index db6aa3516c1b..6206a2f618b0 100644
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
@@ -46,6 +46,7 @@ static struct cpu_list cpu_list[] = {
       {"ares",		 0x411FD0C0},
       {"emag",		 0x503F0001},
       {"kunpeng920", 	 0x481FD010},
+      {"a64fx",		 0x460F0010},
       {"generic", 	 0x0}
 };
 
@@ -116,4 +117,7 @@ init_cpu_features (struct cpu_features *cpu_features)
 	     (PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_ASYNC | MTE_ALLOWED_TAGS),
 	     0, 0, 0);
 #endif
+
+  /* Check if SVE is supported.  */
+  cpu_features->sve = GLRO (dl_hwcap) & HWCAP_SVE;
 }
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
index 3b9bfed1349c..2b322e5414be 100644
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
@@ -65,6 +65,9 @@
 #define IS_KUNPENG920(midr) (MIDR_IMPLEMENTOR(midr) == 'H'			   \
                         && MIDR_PARTNUM(midr) == 0xd01)
 
+#define IS_A64FX(midr) (MIDR_IMPLEMENTOR(midr) == 'F'			      \
+			&& MIDR_PARTNUM(midr) == 0x001)
+
 struct cpu_features
 {
   uint64_t midr_el1;
@@ -72,6 +75,7 @@ struct cpu_features
   bool bti;
   /* Currently, the GLIBC memory tagging tunable only defines 8 bits.  */
   uint8_t mte_state;
+  bool sve;
 };
 
 #endif /* _CPU_FEATURES_AARCH64_H  */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v3 2/2] aarch64: Added optimized memset for A64FX
  2021-05-12  9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
                     ` (7 preceding siblings ...)
  2021-05-27  7:42   ` [PATCH v3 1/2] aarch64: Added optimized memcpy and memmove " Naohiro Tamura
@ 2021-05-27  7:44   ` Naohiro Tamura
  8 siblings, 0 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-27  7:44 UTC (permalink / raw)
  To: libc-alpha; +Cc: Naohiro Tamura

From: Naohiro Tamura <naohirot@jp.fujitsu.com>

This patch optimizes the performance of memset for A64FX [1] which
implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
per NUMA node.

The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill and prefetch.

SVE assembler code for memset is implemented as Vector Length Agnostic
code so theoretically it can be run on any SOC which supports ARMv8-A
SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX

Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
---
 sysdeps/aarch64/multiarch/Makefile          |   1 +
 sysdeps/aarch64/multiarch/ifunc-impl-list.c |   5 +-
 sysdeps/aarch64/multiarch/memset.c          |  17 +-
 sysdeps/aarch64/multiarch/memset_a64fx.S    | 268 ++++++++++++++++++++
 4 files changed, 286 insertions(+), 5 deletions(-)
 create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S

diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index 04c3f171215e..7500cf1e9369 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -2,6 +2,7 @@ ifeq ($(subdir),string)
 sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
 		   memcpy_falkor memcpy_a64fx \
 		   memset_generic memset_falkor memset_emag memset_kunpeng \
+		   memset_a64fx \
 		   memchr_generic memchr_nosimd \
 		   strlen_mte strlen_asimd
 endif
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index 911393565c21..4e1a641d9fe9 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -37,7 +37,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   INIT_ARCH ();
 
-  /* Support sysdeps/aarch64/multiarch/memcpy.c and memmove.c.  */
+  /* Support sysdeps/aarch64/multiarch/memcpy.c, memmove.c and memset.c.  */
   IFUNC_IMPL (i, name, memcpy,
 	      IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx)
 	      IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
@@ -62,6 +62,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 	      IFUNC_IMPL_ADD (array, i, memset, (zva_size == 64), __memset_falkor)
 	      IFUNC_IMPL_ADD (array, i, memset, (zva_size == 64), __memset_emag)
 	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_kunpeng)
+#if HAVE_AARCH64_SVE_ASM
+	      IFUNC_IMPL_ADD (array, i, memset, sve, __memset_a64fx)
+#endif
 	      IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic))
   IFUNC_IMPL (i, name, memchr,
 	      IFUNC_IMPL_ADD (array, i, memchr, !mte, __memchr_nosimd)
diff --git a/sysdeps/aarch64/multiarch/memset.c b/sysdeps/aarch64/multiarch/memset.c
index 28d3926bc2e6..d7d9bbbda095 100644
--- a/sysdeps/aarch64/multiarch/memset.c
+++ b/sysdeps/aarch64/multiarch/memset.c
@@ -31,16 +31,25 @@ extern __typeof (__redirect_memset) __libc_memset;
 extern __typeof (__redirect_memset) __memset_falkor attribute_hidden;
 extern __typeof (__redirect_memset) __memset_emag attribute_hidden;
 extern __typeof (__redirect_memset) __memset_kunpeng attribute_hidden;
+# if HAVE_AARCH64_SVE_ASM
+extern __typeof (__redirect_memset) __memset_a64fx attribute_hidden;
+# endif
 extern __typeof (__redirect_memset) __memset_generic attribute_hidden;
 
 libc_ifunc (__libc_memset,
 	    IS_KUNPENG920 (midr)
 	    ?__memset_kunpeng
 	    : ((IS_FALKOR (midr) || IS_PHECDA (midr)) && zva_size == 64
-	     ? __memset_falkor
-	     : (IS_EMAG (midr) && zva_size == 64
-	       ? __memset_emag
-	       : __memset_generic)));
+	      ? __memset_falkor
+	      : (IS_EMAG (midr) && zva_size == 64
+		? __memset_emag
+# if HAVE_AARCH64_SVE_ASM
+		: (IS_A64FX (midr)
+		  ? __memset_a64fx
+		  : __memset_generic))));
+# else
+		  : __memset_generic)));
+# endif
 
 # undef memset
 strong_alias (__libc_memset, memset);
diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S
new file mode 100644
index 000000000000..ce54e5418b08
--- /dev/null
+++ b/sysdeps/aarch64/multiarch/memset_a64fx.S
@@ -0,0 +1,268 @@
+/* Optimized memset for Fujitsu A64FX processor.
+   Copyright (C) 2021 Free Software Foundation, Inc.
+
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library.  If not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <sysdeps/aarch64/memset-reg.h>
+
+/* Assumptions:
+ *
+ * ARMv8.2-a, AArch64, unaligned accesses, sve
+ *
+ */
+
+#define L1_SIZE		(64*1024)	// L1 64KB
+#define L2_SIZE         (8*1024*1024)	// L2 8MB - 1MB
+#define CACHE_LINE_SIZE	256
+#define PF_DIST_L1	(CACHE_LINE_SIZE * 16)	// Prefetch distance L1
+#define ZF_DIST		(CACHE_LINE_SIZE * 21)	// Zerofill distance
+#define rest		x8
+#define vector_length	x9
+#define vl_remainder	x10	// vector_length remainder
+#define cl_remainder	x11	// CACHE_LINE_SIZE remainder
+
+#if HAVE_AARCH64_SVE_ASM
+# if IS_IN (libc)
+#  define MEMSET __memset_a64fx
+
+	.arch armv8.2-a+sve
+
+	.macro dc_zva times
+	dc	zva, tmp1
+	add	tmp1, tmp1, CACHE_LINE_SIZE
+	.if \times-1
+	dc_zva "(\times-1)"
+	.endif
+	.endm
+
+	.macro st1b_unroll first=0, last=7
+	st1b	z0.b, p0, [dst, #\first, mul vl]
+	.if \last-\first
+	st1b_unroll "(\first+1)", \last
+	.endif
+	.endm
+
+	.macro shortcut_for_small_size exit
+	// if rest <= vector_length * 2
+	whilelo	p0.b, xzr, count
+	whilelo	p1.b, vector_length, count
+	b.last	1f
+	st1b	z0.b, p0, [dstin, #0, mul vl]
+	st1b	z0.b, p1, [dstin, #1, mul vl]
+	ret
+1:	// if rest > vector_length * 8
+	cmp	count, vector_length, lsl 3	// vector_length * 8
+	b.hi	\exit
+	// if rest <= vector_length * 4
+	lsl	tmp1, vector_length, 1	// vector_length * 2
+	whilelo	p2.b, tmp1, count
+	incb	tmp1
+	whilelo	p3.b, tmp1, count
+	b.last	1f
+	st1b	z0.b, p0, [dstin, #0, mul vl]
+	st1b	z0.b, p1, [dstin, #1, mul vl]
+	st1b	z0.b, p2, [dstin, #2, mul vl]
+	st1b	z0.b, p3, [dstin, #3, mul vl]
+	ret
+1:	// if rest <= vector_length * 8
+	lsl	tmp1, vector_length, 2	// vector_length * 4
+	whilelo	p4.b, tmp1, count
+	incb	tmp1
+	whilelo	p5.b, tmp1, count
+	b.last	1f
+	st1b	z0.b, p0, [dstin, #0, mul vl]
+	st1b	z0.b, p1, [dstin, #1, mul vl]
+	st1b	z0.b, p2, [dstin, #2, mul vl]
+	st1b	z0.b, p3, [dstin, #3, mul vl]
+	st1b	z0.b, p4, [dstin, #4, mul vl]
+	st1b	z0.b, p5, [dstin, #5, mul vl]
+	ret
+1:	lsl	tmp1, vector_length, 2	// vector_length * 4
+	incb	tmp1			// vector_length * 5
+	incb	tmp1			// vector_length * 6
+	whilelo	p6.b, tmp1, count
+	incb	tmp1
+	whilelo	p7.b, tmp1, count
+	st1b	z0.b, p0, [dstin, #0, mul vl]
+	st1b	z0.b, p1, [dstin, #1, mul vl]
+	st1b	z0.b, p2, [dstin, #2, mul vl]
+	st1b	z0.b, p3, [dstin, #3, mul vl]
+	st1b	z0.b, p4, [dstin, #4, mul vl]
+	st1b	z0.b, p5, [dstin, #5, mul vl]
+	st1b	z0.b, p6, [dstin, #6, mul vl]
+	st1b	z0.b, p7, [dstin, #7, mul vl]
+	ret
+	.endm
+
+ENTRY (MEMSET)
+
+	PTR_ARG (0)
+	SIZE_ARG (2)
+
+	cbnz	count, 1f
+	ret
+1:	dup	z0.b, valw
+	cntb	vector_length
+	// shortcut for less than vector_length * 8
+	// gives a free ptrue to p0.b for n >= vector_length
+	shortcut_for_small_size L(vl_agnostic)
+	// end of shortcut
+
+L(vl_agnostic): // VL Agnostic
+	mov	rest, count
+	mov	dst, dstin
+	add	dstend, dstin, count
+	// if rest >= L2_SIZE && vector_length == 64 then L(L2)
+	mov	tmp1, 64
+	cmp	rest, L2_SIZE
+	ccmp	vector_length, tmp1, 0, cs
+	b.eq	L(L2)
+	// if rest >= L1_SIZE && vector_length == 64 then L(L1_prefetch)
+	cmp	rest, L1_SIZE
+	ccmp	vector_length, tmp1, 0, cs
+	b.eq	L(L1_prefetch)
+
+L(unroll32):
+	lsl	tmp1, vector_length, 3	// vector_length * 8
+	lsl	tmp2, vector_length, 5	// vector_length * 32
+	.p2align 3
+1:	cmp	rest, tmp2
+	b.cc	L(unroll8)
+	st1b_unroll
+	add	dst, dst, tmp1
+	st1b_unroll
+	add	dst, dst, tmp1
+	st1b_unroll
+	add	dst, dst, tmp1
+	st1b_unroll
+	add	dst, dst, tmp1
+	sub	rest, rest, tmp2
+	b	1b
+
+L(unroll8):
+	lsl	tmp1, vector_length, 3
+	.p2align 3
+1:	cmp	rest, tmp1
+	b.cc	L(last)
+	st1b_unroll
+	add	dst, dst, tmp1
+	sub	rest, rest, tmp1
+	b	1b
+
+L(last):
+	whilelo	p0.b, xzr, rest
+	whilelo	p1.b, vector_length, rest
+	b.last	1f
+	st1b	z0.b, p0, [dst, #0, mul vl]
+	st1b	z0.b, p1, [dst, #1, mul vl]
+	ret
+1:	lsl	tmp1, vector_length, 1	// vector_length * 2
+	whilelo	p2.b, tmp1, rest
+	incb	tmp1
+	whilelo	p3.b, tmp1, rest
+	b.last	1f
+	st1b	z0.b, p0, [dst, #0, mul vl]
+	st1b	z0.b, p1, [dst, #1, mul vl]
+	st1b	z0.b, p2, [dst, #2, mul vl]
+	st1b	z0.b, p3, [dst, #3, mul vl]
+	ret
+1:	lsl	tmp1, vector_length, 2	// vector_length * 4
+	whilelo	p4.b, tmp1, rest
+	incb	tmp1
+	whilelo	p5.b, tmp1, rest
+	incb	tmp1
+	whilelo	p6.b, tmp1, rest
+	incb	tmp1
+	whilelo	p7.b, tmp1, rest
+	st1b	z0.b, p0, [dst, #0, mul vl]
+	st1b	z0.b, p1, [dst, #1, mul vl]
+	st1b	z0.b, p2, [dst, #2, mul vl]
+	st1b	z0.b, p3, [dst, #3, mul vl]
+	st1b	z0.b, p4, [dst, #4, mul vl]
+	st1b	z0.b, p5, [dst, #5, mul vl]
+	st1b	z0.b, p6, [dst, #6, mul vl]
+	st1b	z0.b, p7, [dst, #7, mul vl]
+	ret
+
+L(L1_prefetch): // if rest >= L1_SIZE
+	.p2align 3
+1:	st1b_unroll 0, 3
+	prfm	pstl1keep, [dst, PF_DIST_L1]
+	st1b_unroll 4, 7
+	prfm	pstl1keep, [dst, PF_DIST_L1 + CACHE_LINE_SIZE]
+	add	dst, dst, CACHE_LINE_SIZE * 2
+	sub	rest, rest, CACHE_LINE_SIZE * 2
+	cmp	rest, L1_SIZE
+	b.ge	1b
+	cbnz	rest, L(unroll32)
+	ret
+
+L(L2):
+	// align dst address at vector_length byte boundary
+	sub	tmp1, vector_length, 1
+	ands	tmp2, dst, tmp1
+	// if vl_remainder == 0
+	b.eq	1f
+	sub	vl_remainder, vector_length, tmp2
+	// process remainder until the first vector_length boundary
+	whilelt	p2.b, xzr, vl_remainder
+	st1b	z0.b, p2, [dst]
+	add	dst, dst, vl_remainder
+	sub	rest, rest, vl_remainder
+	// align dstin address at CACHE_LINE_SIZE byte boundary
+1:	mov	tmp1, CACHE_LINE_SIZE
+	ands	tmp2, dst, CACHE_LINE_SIZE - 1
+	// if cl_remainder == 0
+	b.eq	L(L2_dc_zva)
+	sub	cl_remainder, tmp1, tmp2
+	// process remainder until the first CACHE_LINE_SIZE boundary
+	mov	tmp1, xzr       // index
+2:	whilelt	p2.b, tmp1, cl_remainder
+	st1b	z0.b, p2, [dst, tmp1]
+	incb	tmp1
+	cmp	tmp1, cl_remainder
+	b.lo	2b
+	add	dst, dst, cl_remainder
+	sub	rest, rest, cl_remainder
+
+L(L2_dc_zva):
+	// zero fill
+	mov	tmp1, dst
+	dc_zva	(ZF_DIST / CACHE_LINE_SIZE) - 1
+	mov	zva_len, ZF_DIST
+	add	tmp1, zva_len, CACHE_LINE_SIZE * 2
+	// unroll
+	.p2align 3
+1:	st1b_unroll 0, 3
+	add	tmp2, dst, zva_len
+	dc	 zva, tmp2
+	st1b_unroll 4, 7
+	add	tmp2, tmp2, CACHE_LINE_SIZE
+	dc	zva, tmp2
+	add	dst, dst, CACHE_LINE_SIZE * 2
+	sub	rest, rest, CACHE_LINE_SIZE * 2
+	cmp	rest, tmp1	// ZF_DIST + CACHE_LINE_SIZE * 2
+	b.ge	1b
+	cbnz	rest, L(unroll8)
+	ret
+
+END (MEMSET)
+libc_hidden_builtin_def (MEMSET)
+
+#endif /* IS_IN (libc) */
+#endif /* HAVE_AARCH64_SVE_ASM */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* RE: [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX
  2021-05-27  0:22   ` [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX naohirot
@ 2021-05-27 23:50     ` naohirot
  0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-05-27 23:50 UTC (permalink / raw)
  To: 'Szabolcs Nagy', libc-alpha@sourceware.org

Hi Szabolcs,

> >   aarch64: Added optimized memcpy and memmove for A64FX
> >   aarch64: Added optimized memset for A64FX
> 
> I'll fix the whitespaces.

Great thank you for the merges!
Naohiro


^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2021-05-27 23:50 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17  2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
2021-03-17  2:33 ` [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64 Naohiro Tamura
2021-03-29 12:11   ` Szabolcs Nagy via Libc-alpha
2021-03-30  6:19     ` naohirot
2021-03-17  2:34 ` [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
2021-03-29 12:44   ` Szabolcs Nagy via Libc-alpha
2021-03-30  7:17     ` naohirot
2021-03-17  2:34 ` [PATCH 3/5] aarch64: Added optimized memset " Naohiro Tamura
2021-03-17  2:35 ` [PATCH 4/5] scripts: Added Vector Length Set test helper script Naohiro Tamura
2021-03-29 13:20   ` Szabolcs Nagy via Libc-alpha
2021-03-30  7:25     ` naohirot
2021-03-17  2:35 ` [PATCH 5/5] benchtests: Added generic_memcpy and generic_memmove to large benchtests Naohiro Tamura
2021-03-29 12:03 ` [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Szabolcs Nagy via Libc-alpha
2021-05-10  1:45 ` naohirot
2021-05-14 13:35   ` Szabolcs Nagy via Libc-alpha
2021-05-19  0:11     ` naohirot
2021-05-12  9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
2021-05-12  9:26   ` [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64 Naohiro Tamura
2021-05-26 10:05     ` Szabolcs Nagy via Libc-alpha
2021-05-12  9:27   ` [PATCH v2 2/6] aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI Naohiro Tamura
2021-05-26 10:06     ` Szabolcs Nagy via Libc-alpha
2021-05-12  9:28   ` [PATCH v2 3/6] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
2021-05-26 10:19     ` Szabolcs Nagy via Libc-alpha
2021-05-12  9:28   ` [PATCH v2 4/6] aarch64: Added optimized memset " Naohiro Tamura
2021-05-26 10:22     ` Szabolcs Nagy via Libc-alpha
2021-05-12  9:29   ` [PATCH v2 5/6] scripts: Added Vector Length Set test helper script Naohiro Tamura
2021-05-12 16:58     ` Joseph Myers
2021-05-13  9:53       ` naohirot
2021-05-20  7:34     ` Naohiro Tamura
2021-05-26 10:24       ` Szabolcs Nagy via Libc-alpha
2021-05-12  9:29   ` [PATCH v2 6/6] benchtests: Fixed bench-memcpy-random: buf1: mprotect failed Naohiro Tamura
2021-05-26 10:25     ` Szabolcs Nagy via Libc-alpha
2021-05-27  0:22   ` [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX naohirot
2021-05-27 23:50     ` naohirot
2021-05-27  7:42   ` [PATCH v3 1/2] aarch64: Added optimized memcpy and memmove " Naohiro Tamura
2021-05-27  7:44   ` [PATCH v3 2/2] aarch64: Added optimized memset " Naohiro Tamura
  -- strict thread matches above, loose matches on Subject: below --
2021-04-12 12:52 [PATCH 0/5] Added optimized memcpy/memmove/memset " Wilco Dijkstra via Libc-alpha
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
2021-04-15 12:20     ` naohirot
2021-04-20 16:00       ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:58         ` naohirot
2021-04-29 15:13           ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:01             ` Szabolcs Nagy via Libc-alpha
2021-04-30 15:23               ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:30                 ` Florian Weimer via Libc-alpha
2021-04-30 15:40                   ` Wilco Dijkstra via Libc-alpha
2021-05-04  7:56                     ` Szabolcs Nagy via Libc-alpha
2021-05-04 10:17                       ` Florian Weimer via Libc-alpha
2021-05-04 10:38                         ` Wilco Dijkstra via Libc-alpha
2021-05-04 10:42                         ` Szabolcs Nagy via Libc-alpha
2021-05-04 11:07                           ` Florian Weimer via Libc-alpha
2021-05-06 10:01             ` naohirot
2021-05-06 14:26               ` Szabolcs Nagy via Libc-alpha
2021-05-06 15:09                 ` Florian Weimer via Libc-alpha
2021-05-06 17:31               ` Wilco Dijkstra via Libc-alpha
2021-05-07 12:31                 ` naohirot
2021-04-19  2:51     ` naohirot
2021-04-19 14:57       ` Wilco Dijkstra via Libc-alpha
2021-04-21 10:10         ` naohirot
2021-04-21 15:02           ` Wilco Dijkstra via Libc-alpha
2021-04-22 13:17             ` naohirot
2021-04-23  0:58               ` naohirot
2021-04-19 12:43     ` naohirot
2021-04-20  3:31     ` naohirot
2021-04-20 14:44       ` Wilco Dijkstra via Libc-alpha
2021-04-27  9:01         ` naohirot
2021-04-20  5:49     ` naohirot
2021-04-20 11:39       ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:03         ` naohirot
2021-04-23 13:22     ` naohirot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).