* [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
@ 2021-03-17 2:28 Naohiro Tamura
2021-03-17 2:33 ` [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64 Naohiro Tamura
` (7 more replies)
0 siblings, 8 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-03-17 2:28 UTC (permalink / raw)
To: libc-alpha
Fujitsu is in the process of signing the copyright assignment paper.
We'd like to have some feedback in advance.
This series of patches optimize the performance of
memcpy/memmove/memset for A64FX [1] which implements ARMv8-A SVE and
has L1 64KB cache per core and L2 8MB cache per NUMA node.
The first patch is an update of autoconf to check if assembler is
capable for ARMv8-A SVE code generation or not, and then define
HAVE_SVE_ASM_SUPPORT macro.
The second patch is memcpy/memmove performance optimization which makes
use of Scalable Vector Register with several techniques such as
loop unrolling, memory access alignment, cache zero fill, prefetch,
and software pipelining.
The third patch is memset performance optimization which makes
use of Scalable Vector Register with several techniques such as
loop unrolling, memory access alignment, cache zero fill, and
prefetch.
The forth patch is a test helper script to change Vector Length for
child process. This script can be used as test-wrapper for 'make
check'
The fifth patch is to add generic_memcpy and generic_memmove to
bench-memcpy-large.c and bench-memmove-large.c respectively so that we
can compare performance between 512 bit scalable vector register with
scalar 64 bit register consistently among memcpy/memmove/memset
default and large benchtests.
SVE assembler code for memcpy/memmove/memset is implemented as Vector
Length Agnostic code so theoretically it can be run on any SOC which
supports ARMv8-A SVE standard.
We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.
And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.
[1] https://github.com/fujitsu/A64FX
Naohiro Tamura (5):
config: Added HAVE_SVE_ASM_SUPPORT for aarch64
aarch64: Added optimized memcpy and memmove for A64FX
aarch64: Added optimized memset for A64FX
scripts: Added Vector Length Set test helper script
benchtests: Added generic_memcpy and generic_memmove to large
benchtests
benchtests/bench-memcpy-large.c | 9 +
benchtests/bench-memmove-large.c | 9 +
config.h.in | 3 +
manual/tunables.texi | 3 +-
scripts/vltest.py | 82 ++
sysdeps/aarch64/configure | 28 +
sysdeps/aarch64/configure.ac | 15 +
sysdeps/aarch64/multiarch/Makefile | 3 +-
sysdeps/aarch64/multiarch/ifunc-impl-list.c | 17 +-
sysdeps/aarch64/multiarch/init-arch.h | 4 +-
sysdeps/aarch64/multiarch/memcpy.c | 12 +-
sysdeps/aarch64/multiarch/memcpy_a64fx.S | 979 ++++++++++++++++++
sysdeps/aarch64/multiarch/memmove.c | 12 +-
sysdeps/aarch64/multiarch/memset.c | 11 +-
sysdeps/aarch64/multiarch/memset_a64fx.S | 574 ++++++++++
.../unix/sysv/linux/aarch64/cpu-features.c | 4 +
.../unix/sysv/linux/aarch64/cpu-features.h | 4 +
17 files changed, 1759 insertions(+), 10 deletions(-)
create mode 100755 scripts/vltest.py
create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S
create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S
--
2.17.1
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64
2021-03-17 2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
@ 2021-03-17 2:33 ` Naohiro Tamura
2021-03-29 12:11 ` Szabolcs Nagy via Libc-alpha
2021-03-17 2:34 ` [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
` (6 subsequent siblings)
7 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-03-17 2:33 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
This patch checks if assembler supports '-march=armv8.2-a+sve' to
generate SVE code or not, and then define HAVE_SVE_ASM_SUPPORT macro.
---
config.h.in | 3 +++
sysdeps/aarch64/configure | 28 ++++++++++++++++++++++++++++
sysdeps/aarch64/configure.ac | 15 +++++++++++++++
3 files changed, 46 insertions(+)
diff --git a/config.h.in b/config.h.in
index f21bf04e47..2073816af8 100644
--- a/config.h.in
+++ b/config.h.in
@@ -118,6 +118,9 @@
/* AArch64 PAC-RET code generation is enabled. */
#define HAVE_AARCH64_PAC_RET 0
+/* Assembler support ARMv8.2-A SVE */
+#define HAVE_SVE_ASM_SUPPORT 0
+
/* ARC big endian ABI */
#undef HAVE_ARC_BE
diff --git a/sysdeps/aarch64/configure b/sysdeps/aarch64/configure
index 83c3a23e44..ac16250f8a 100644
--- a/sysdeps/aarch64/configure
+++ b/sysdeps/aarch64/configure
@@ -304,3 +304,31 @@ fi
$as_echo "$libc_cv_aarch64_variant_pcs" >&6; }
config_vars="$config_vars
aarch64-variant-pcs = $libc_cv_aarch64_variant_pcs"
+
+# Check if asm support armv8.2-a+sve
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for SVE support in assembler" >&5
+$as_echo_n "checking for SVE support in assembler... " >&6; }
+if ${libc_cv_asm_sve+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat > conftest.s <<\EOF
+ ptrue p0.b
+EOF
+if { ac_try='${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&5'
+ { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+ (eval $ac_try) 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; }; then
+ libc_cv_asm_sve=yes
+else
+ libc_cv_asm_sve=no
+fi
+rm -f conftest*
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_asm_sve" >&5
+$as_echo "$libc_cv_asm_sve" >&6; }
+if test $libc_cv_asm_sve = yes; then
+ $as_echo "#define HAVE_SVE_ASM_SUPPORT 1" >>confdefs.h
+
+fi
diff --git a/sysdeps/aarch64/configure.ac b/sysdeps/aarch64/configure.ac
index 66f755078a..389a0b4e8d 100644
--- a/sysdeps/aarch64/configure.ac
+++ b/sysdeps/aarch64/configure.ac
@@ -90,3 +90,18 @@ EOF
fi
rm -rf conftest.*])
LIBC_CONFIG_VAR([aarch64-variant-pcs], [$libc_cv_aarch64_variant_pcs])
+
+# Check if asm support armv8.2-a+sve
+AC_CACHE_CHECK(for SVE support in assembler, libc_cv_asm_sve, [dnl
+cat > conftest.s <<\EOF
+ ptrue p0.b
+EOF
+if AC_TRY_COMMAND(${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&AS_MESSAGE_LOG_FD); then
+ libc_cv_asm_sve=yes
+else
+ libc_cv_asm_sve=no
+fi
+rm -f conftest*])
+if test $libc_cv_asm_sve = yes; then
+ AC_DEFINE(HAVE_SVE_ASM_SUPPORT)
+fi
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX
2021-03-17 2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
2021-03-17 2:33 ` [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64 Naohiro Tamura
@ 2021-03-17 2:34 ` Naohiro Tamura
2021-03-29 12:44 ` Szabolcs Nagy via Libc-alpha
2021-03-17 2:34 ` [PATCH 3/5] aarch64: Added optimized memset " Naohiro Tamura
` (5 subsequent siblings)
7 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-03-17 2:34 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
This patch optimizes the performance of memcpy/memmove for A64FX [1]
which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
cache per NUMA node.
The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill, prefetch, and software pipelining.
SVE assembler code for memcpy/memmove is implemented as Vector Length
Agnostic code so theoretically it can be run on any SOC which supports
ARMv8-A SVE standard.
We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.
And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.
[1] https://github.com/fujitsu/A64FX
---
manual/tunables.texi | 3 +-
sysdeps/aarch64/multiarch/Makefile | 2 +-
sysdeps/aarch64/multiarch/ifunc-impl-list.c | 12 +-
sysdeps/aarch64/multiarch/init-arch.h | 4 +-
sysdeps/aarch64/multiarch/memcpy.c | 12 +-
sysdeps/aarch64/multiarch/memcpy_a64fx.S | 979 ++++++++++++++++++
sysdeps/aarch64/multiarch/memmove.c | 12 +-
.../unix/sysv/linux/aarch64/cpu-features.c | 4 +
.../unix/sysv/linux/aarch64/cpu-features.h | 4 +
9 files changed, 1024 insertions(+), 8 deletions(-)
create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S
diff --git a/manual/tunables.texi b/manual/tunables.texi
index 1b746c0fa1..81ed5366fc 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -453,7 +453,8 @@ This tunable is specific to powerpc, powerpc64 and powerpc64le.
The @code{glibc.cpu.name=xxx} tunable allows the user to tell @theglibc{} to
assume that the CPU is @code{xxx} where xxx may have one of these values:
@code{generic}, @code{falkor}, @code{thunderxt88}, @code{thunderx2t99},
-@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng}.
+@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng},
+@code{a64fx}.
This tunable is specific to aarch64.
@end deftp
diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index dc3efffb36..04c3f17121 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -1,6 +1,6 @@
ifeq ($(subdir),string)
sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
- memcpy_falkor \
+ memcpy_falkor memcpy_a64fx \
memset_generic memset_falkor memset_emag memset_kunpeng \
memchr_generic memchr_nosimd \
strlen_mte strlen_asimd
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index 99a8c68aac..cb78da9692 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -25,7 +25,11 @@
#include <stdio.h>
/* Maximum number of IFUNC implementations. */
-#define MAX_IFUNC 4
+#if HAVE_SVE_ASM_SUPPORT
+# define MAX_IFUNC 7
+#else
+# define MAX_IFUNC 6
+#endif
size_t
__libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
@@ -43,12 +47,18 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_falkor)
IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_simd)
+#if HAVE_SVE_ASM_SUPPORT
+ IFUNC_IMPL_ADD (array, i, memcpy, sve, __memcpy_a64fx)
+#endif
IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
IFUNC_IMPL (i, name, memmove,
IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx)
IFUNC_IMPL_ADD (array, i, memmove, !bti, __memmove_thunderx2)
IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_falkor)
IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_simd)
+#if HAVE_SVE_ASM_SUPPORT
+ IFUNC_IMPL_ADD (array, i, memmove, sve, __memmove_a64fx)
+#endif
IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
IFUNC_IMPL (i, name, memset,
/* Enable this on non-falkor processors too so that other cores
diff --git a/sysdeps/aarch64/multiarch/init-arch.h b/sysdeps/aarch64/multiarch/init-arch.h
index a167699e74..d20e7e1b8e 100644
--- a/sysdeps/aarch64/multiarch/init-arch.h
+++ b/sysdeps/aarch64/multiarch/init-arch.h
@@ -33,4 +33,6 @@
bool __attribute__((unused)) bti = \
HAVE_AARCH64_BTI && GLRO(dl_aarch64_cpu_features).bti; \
bool __attribute__((unused)) mte = \
- MTE_ENABLED ();
+ MTE_ENABLED (); \
+ unsigned __attribute__((unused)) sve = \
+ GLRO(dl_aarch64_cpu_features).sve;
diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
index 0e0a5cbcfb..0006f38eb0 100644
--- a/sysdeps/aarch64/multiarch/memcpy.c
+++ b/sysdeps/aarch64/multiarch/memcpy.c
@@ -33,6 +33,9 @@ extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;
extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;
extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
+#if HAVE_SVE_ASM_SUPPORT
+extern __typeof (__redirect_memcpy) __memcpy_a64fx attribute_hidden;
+#endif
libc_ifunc (__libc_memcpy,
(IS_THUNDERX (midr)
@@ -44,8 +47,13 @@ libc_ifunc (__libc_memcpy,
: (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
|| IS_NEOVERSE_V1 (midr)
? __memcpy_simd
- : __memcpy_generic)))));
-
+#if HAVE_SVE_ASM_SUPPORT
+ : (IS_A64FX (midr)
+ ? __memcpy_a64fx
+ : __memcpy_generic))))));
+#else
+ : __memcpy_generic)))));
+#endif
# undef memcpy
strong_alias (__libc_memcpy, memcpy);
#endif
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
new file mode 100644
index 0000000000..23438e4e3d
--- /dev/null
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -0,0 +1,979 @@
+/* Optimized memcpy for Fujitsu A64FX processor.
+ Copyright (C) 2012-2021 Free Software Foundation, Inc.
+
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library. If not, see
+ <https://www.gnu.org/licenses/>. */
+
+#include <sysdep.h>
+
+#if HAVE_SVE_ASM_SUPPORT
+#if IS_IN (libc)
+# define MEMCPY __memcpy_a64fx
+# define MEMMOVE __memmove_a64fx
+
+/* Assumptions:
+ *
+ * ARMv8.2-a, AArch64, unaligned accesses, sve
+ *
+ */
+
+#define L1_SIZE (64*1024)/2 // L1 64KB
+#define L2_SIZE (7*1024*1024)/2 // L2 8MB - 1MB
+#define CACHE_LINE_SIZE 256
+#define PF_DIST_L1 (CACHE_LINE_SIZE * 16)
+#define PF_DIST_L2 (CACHE_LINE_SIZE * 64)
+#define dest x0
+#define src x1
+#define n x2 // size
+#define tmp1 x3
+#define tmp2 x4
+#define rest x5
+#define dest_ptr x6
+#define src_ptr x7
+#define vector_length x8
+#define vl_remainder x9 // vector_length remainder
+#define cl_remainder x10 // CACHE_LINE_SIZE remainder
+
+ .arch armv8.2-a+sve
+
+ENTRY_ALIGN (MEMCPY, 6)
+
+ PTR_ARG (0)
+ SIZE_ARG (2)
+
+L(fwd_start):
+ cmp n, 0
+ ccmp dest, src, 4, ne
+ b.ne L(init)
+ ret
+
+L(init):
+ mov rest, n
+ mov dest_ptr, dest
+ mov src_ptr, src
+ cntb vector_length
+ ptrue p0.b
+
+L(L2):
+ // get block_size
+ mrs tmp1, dczid_el0
+ cmp tmp1, 6 // CACHE_LINE_SIZE 256
+ b.ne L(vl_agnostic)
+
+ // if rest >= L2_SIZE
+ cmp rest, L2_SIZE
+ b.cc L(L1_prefetch)
+ // align dest address at vector_length byte boundary
+ sub tmp1, vector_length, 1
+ and tmp2, dest_ptr, tmp1
+ // if vl_remainder == 0
+ cmp tmp2, 0
+ b.eq 1f
+ sub vl_remainder, vector_length, tmp2
+ // process remainder until the first vector_length boundary
+ whilelt p0.b, xzr, vl_remainder
+ ld1b z0.b, p0/z, [src_ptr]
+ st1b z0.b, p0, [dest_ptr]
+ add dest_ptr, dest_ptr, vl_remainder
+ add src_ptr, src_ptr, vl_remainder
+ sub rest, rest, vl_remainder
+ // align dest address at CACHE_LINE_SIZE byte boundary
+1: mov tmp1, CACHE_LINE_SIZE
+ and tmp2, dest_ptr, CACHE_LINE_SIZE - 1
+ // if cl_remainder == 0
+ cmp tmp2, 0
+ b.eq L(L2_dc_zva)
+ sub cl_remainder, tmp1, tmp2
+ // process remainder until the first CACHE_LINE_SIZE boundary
+ mov tmp1, xzr // index
+2: whilelt p0.b, tmp1, cl_remainder
+ ld1b z0.b, p0/z, [src_ptr, tmp1]
+ st1b z0.b, p0, [dest_ptr, tmp1]
+ incb tmp1
+ cmp tmp1, cl_remainder
+ b.lo 2b
+ add dest_ptr, dest_ptr, cl_remainder
+ add src_ptr, src_ptr, cl_remainder
+ sub rest, rest, cl_remainder
+
+L(L2_dc_zva): // unroll zero fill
+ and tmp1, dest, 0xffffffffffffff
+ and tmp2, src, 0xffffffffffffff
+ sub tmp1, tmp2, tmp1 // diff
+ mov tmp2, CACHE_LINE_SIZE * 20
+ cmp tmp1, tmp2
+ b.lo L(L1_prefetch)
+ mov tmp1, dest_ptr
+ dc zva, tmp1 // 1
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 2
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 3
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 4
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 5
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 6
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 7
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 8
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 9
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 10
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 11
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 12
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 13
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 14
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 15
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 16
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 17
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 18
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 19
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 20
+
+L(L2_vl_64): // VL64 unroll8
+ cmp vector_length, 64
+ b.ne L(L2_vl_32)
+ ptrue p0.b
+ .p2align 3
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ add src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+1: st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ mov tmp1, PF_DIST_L1
+ prfm pstl1keep, [dest_ptr, tmp1]
+ mov tmp1, PF_DIST_L2
+ prfm pstl2keep, [dest_ptr, tmp1]
+ mov tmp2, CACHE_LINE_SIZE * 19
+ add tmp2, dest_ptr, tmp2
+ dc zva, tmp2 // distance CACHE_LINE_SIZE * 19
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ mov tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+ prfm pstl1keep, [dest_ptr, tmp1]
+ mov tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+ prfm pstl2keep, [dest_ptr, tmp1]
+ add tmp2, tmp2, CACHE_LINE_SIZE
+ dc zva, tmp2 // distance CACHE_LINE_SIZE * 20
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+ add src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ cmp rest, L2_SIZE
+ b.ge 1b
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+
+L(L2_vl_32): // VL32 unroll6
+ cmp vector_length, 32
+ b.ne L(L2_vl_16)
+ ptrue p0.b
+ .p2align 3
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ add src_ptr, src_ptr, CACHE_LINE_SIZE
+ sub rest, rest, CACHE_LINE_SIZE
+1: st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ mov tmp1, PF_DIST_L1
+ prfm pstl1keep, [dest_ptr, tmp1]
+ mov tmp1, PF_DIST_L2
+ prfm pstl2keep, [dest_ptr, tmp1]
+ mov tmp2, CACHE_LINE_SIZE * 19
+ add tmp2, dest_ptr, tmp2
+ dc zva, tmp2 // distance CACHE_LINE_SIZE * 19
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE
+ add src_ptr, src_ptr, CACHE_LINE_SIZE
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ mov tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+ prfm pstl1keep, [dest_ptr, tmp1]
+ mov tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+ prfm pstl2keep, [dest_ptr, tmp1]
+ add tmp2, tmp2, CACHE_LINE_SIZE
+ dc zva, tmp2 // distance CACHE_LINE_SIZE * 20
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE
+ add src_ptr, src_ptr, CACHE_LINE_SIZE
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ cmp rest, L2_SIZE
+ b.ge 1b
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE
+
+L(L2_vl_16): // VL16 unroll32
+ cmp vector_length, 16
+ b.ne L(L1_prefetch)
+ ptrue p0.b
+ .p2align 3
+ add src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+ ld1b z16.b, p0/z, [src_ptr, #-8, mul vl]
+ ld1b z17.b, p0/z, [src_ptr, #-7, mul vl]
+ ld1b z18.b, p0/z, [src_ptr, #-6, mul vl]
+ ld1b z19.b, p0/z, [src_ptr, #-5, mul vl]
+ ld1b z20.b, p0/z, [src_ptr, #-4, mul vl]
+ ld1b z21.b, p0/z, [src_ptr, #-3, mul vl]
+ ld1b z22.b, p0/z, [src_ptr, #-2, mul vl]
+ ld1b z23.b, p0/z, [src_ptr, #-1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ add src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+ sub rest, rest, CACHE_LINE_SIZE
+1: add dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+ add src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+ st1b z16.b, p0, [dest_ptr, #-8, mul vl]
+ st1b z17.b, p0, [dest_ptr, #-7, mul vl]
+ ld1b z16.b, p0/z, [src_ptr, #-8, mul vl]
+ ld1b z17.b, p0/z, [src_ptr, #-7, mul vl]
+ st1b z18.b, p0, [dest_ptr, #-6, mul vl]
+ st1b z19.b, p0, [dest_ptr, #-5, mul vl]
+ ld1b z18.b, p0/z, [src_ptr, #-6, mul vl]
+ ld1b z19.b, p0/z, [src_ptr, #-5, mul vl]
+ st1b z20.b, p0, [dest_ptr, #-4, mul vl]
+ st1b z21.b, p0, [dest_ptr, #-3, mul vl]
+ ld1b z20.b, p0/z, [src_ptr, #-4, mul vl]
+ ld1b z21.b, p0/z, [src_ptr, #-3, mul vl]
+ st1b z22.b, p0, [dest_ptr, #-2, mul vl]
+ st1b z23.b, p0, [dest_ptr, #-1, mul vl]
+ ld1b z22.b, p0/z, [src_ptr, #-2, mul vl]
+ ld1b z23.b, p0/z, [src_ptr, #-1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ mov tmp1, PF_DIST_L1
+ prfm pstl1keep, [dest_ptr, tmp1]
+ mov tmp1, PF_DIST_L2
+ prfm pstl2keep, [dest_ptr, tmp1]
+ mov tmp2, CACHE_LINE_SIZE * 19
+ add tmp2, dest_ptr, tmp2
+ dc zva, tmp2 // distance CACHE_LINE_SIZE * 19
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE
+ add src_ptr, src_ptr, CACHE_LINE_SIZE
+ st1b z16.b, p0, [dest_ptr, #-8, mul vl]
+ st1b z17.b, p0, [dest_ptr, #-7, mul vl]
+ ld1b z16.b, p0/z, [src_ptr, #-8, mul vl]
+ ld1b z17.b, p0/z, [src_ptr, #-7, mul vl]
+ st1b z18.b, p0, [dest_ptr, #-6, mul vl]
+ st1b z19.b, p0, [dest_ptr, #-5, mul vl]
+ ld1b z18.b, p0/z, [src_ptr, #-6, mul vl]
+ ld1b z19.b, p0/z, [src_ptr, #-5, mul vl]
+ st1b z20.b, p0, [dest_ptr, #-4, mul vl]
+ st1b z21.b, p0, [dest_ptr, #-3, mul vl]
+ ld1b z20.b, p0/z, [src_ptr, #-4, mul vl]
+ ld1b z21.b, p0/z, [src_ptr, #-3, mul vl]
+ st1b z22.b, p0, [dest_ptr, #-2, mul vl]
+ st1b z23.b, p0, [dest_ptr, #-1, mul vl]
+ ld1b z22.b, p0/z, [src_ptr, #-2, mul vl]
+ ld1b z23.b, p0/z, [src_ptr, #-1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ mov tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+ prfm pstl1keep, [dest_ptr, tmp1]
+ mov tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+ prfm pstl2keep, [dest_ptr, tmp1]
+ add tmp2, tmp2, CACHE_LINE_SIZE
+ dc zva, tmp2 // distance CACHE_LINE_SIZE * 20
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+ add src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ cmp rest, L2_SIZE
+ b.ge 1b
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+ st1b z16.b, p0, [dest_ptr, #-8, mul vl]
+ st1b z17.b, p0, [dest_ptr, #-7, mul vl]
+ st1b z18.b, p0, [dest_ptr, #-6, mul vl]
+ st1b z19.b, p0, [dest_ptr, #-5, mul vl]
+ st1b z20.b, p0, [dest_ptr, #-4, mul vl]
+ st1b z21.b, p0, [dest_ptr, #-3, mul vl]
+ st1b z22.b, p0, [dest_ptr, #-2, mul vl]
+ st1b z23.b, p0, [dest_ptr, #-1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+
+L(L1_prefetch): // if rest >= L1_SIZE
+ cmp rest, L1_SIZE
+ b.cc L(vl_agnostic)
+L(L1_vl_64):
+ cmp vector_length, 64
+ b.ne L(L1_vl_32)
+ ptrue p0.b
+ .p2align 3
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ add src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+1: st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ mov tmp1, PF_DIST_L1
+ prfm pstl1keep, [dest_ptr, tmp1]
+ mov tmp1, PF_DIST_L2
+ prfm pstl2keep, [dest_ptr, tmp1]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ mov tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+ prfm pstl1keep, [dest_ptr, tmp1]
+ mov tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+ prfm pstl2keep, [dest_ptr, tmp1]
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+ add src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ cmp rest, L1_SIZE
+ b.ge 1b
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+
+L(L1_vl_32):
+ cmp vector_length, 32
+ b.ne L(L1_vl_16)
+ ptrue p0.b
+ .p2align 3
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ add src_ptr, src_ptr, CACHE_LINE_SIZE
+ sub rest, rest, CACHE_LINE_SIZE
+1: st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ mov tmp1, PF_DIST_L1
+ prfm pstl1keep, [dest_ptr, tmp1]
+ mov tmp1, PF_DIST_L2
+ prfm pstl2keep, [dest_ptr, tmp1]
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE
+ add src_ptr, src_ptr, CACHE_LINE_SIZE
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ mov tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+ prfm pstl1keep, [dest_ptr, tmp1]
+ mov tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+ prfm pstl2keep, [dest_ptr, tmp1]
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE
+ add src_ptr, src_ptr, CACHE_LINE_SIZE
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ cmp rest, L1_SIZE
+ b.ge 1b
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE
+
+L(L1_vl_16):
+ cmp vector_length, 16
+ b.ne L(vl_agnostic)
+ ptrue p0.b
+ .p2align 3
+ add src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+ ld1b z16.b, p0/z, [src_ptr, #-8, mul vl]
+ ld1b z17.b, p0/z, [src_ptr, #-7, mul vl]
+ ld1b z18.b, p0/z, [src_ptr, #-6, mul vl]
+ ld1b z19.b, p0/z, [src_ptr, #-5, mul vl]
+ ld1b z20.b, p0/z, [src_ptr, #-4, mul vl]
+ ld1b z21.b, p0/z, [src_ptr, #-3, mul vl]
+ ld1b z22.b, p0/z, [src_ptr, #-2, mul vl]
+ ld1b z23.b, p0/z, [src_ptr, #-1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ add src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+ sub rest, rest, CACHE_LINE_SIZE
+1: add dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+ add src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+ st1b z16.b, p0, [dest_ptr, #-8, mul vl]
+ st1b z17.b, p0, [dest_ptr, #-7, mul vl]
+ ld1b z16.b, p0/z, [src_ptr, #-8, mul vl]
+ ld1b z17.b, p0/z, [src_ptr, #-7, mul vl]
+ st1b z18.b, p0, [dest_ptr, #-6, mul vl]
+ st1b z19.b, p0, [dest_ptr, #-5, mul vl]
+ ld1b z18.b, p0/z, [src_ptr, #-6, mul vl]
+ ld1b z19.b, p0/z, [src_ptr, #-5, mul vl]
+ st1b z20.b, p0, [dest_ptr, #-4, mul vl]
+ st1b z21.b, p0, [dest_ptr, #-3, mul vl]
+ ld1b z20.b, p0/z, [src_ptr, #-4, mul vl]
+ ld1b z21.b, p0/z, [src_ptr, #-3, mul vl]
+ st1b z22.b, p0, [dest_ptr, #-2, mul vl]
+ st1b z23.b, p0, [dest_ptr, #-1, mul vl]
+ ld1b z22.b, p0/z, [src_ptr, #-2, mul vl]
+ ld1b z23.b, p0/z, [src_ptr, #-1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ mov tmp1, PF_DIST_L1
+ prfm pstl1keep, [dest_ptr, tmp1]
+ mov tmp1, PF_DIST_L2
+ prfm pstl2keep, [dest_ptr, tmp1]
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE
+ add src_ptr, src_ptr, CACHE_LINE_SIZE
+ st1b z16.b, p0, [dest_ptr, #-8, mul vl]
+ st1b z17.b, p0, [dest_ptr, #-7, mul vl]
+ ld1b z16.b, p0/z, [src_ptr, #-8, mul vl]
+ ld1b z17.b, p0/z, [src_ptr, #-7, mul vl]
+ st1b z18.b, p0, [dest_ptr, #-6, mul vl]
+ st1b z19.b, p0, [dest_ptr, #-5, mul vl]
+ ld1b z18.b, p0/z, [src_ptr, #-6, mul vl]
+ ld1b z19.b, p0/z, [src_ptr, #-5, mul vl]
+ st1b z20.b, p0, [dest_ptr, #-4, mul vl]
+ st1b z21.b, p0, [dest_ptr, #-3, mul vl]
+ ld1b z20.b, p0/z, [src_ptr, #-4, mul vl]
+ ld1b z21.b, p0/z, [src_ptr, #-3, mul vl]
+ st1b z22.b, p0, [dest_ptr, #-2, mul vl]
+ st1b z23.b, p0, [dest_ptr, #-1, mul vl]
+ ld1b z22.b, p0/z, [src_ptr, #-2, mul vl]
+ ld1b z23.b, p0/z, [src_ptr, #-1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ mov tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+ prfm pstl1keep, [dest_ptr, tmp1]
+ mov tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+ prfm pstl2keep, [dest_ptr, tmp1]
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+ add src_ptr, src_ptr, CACHE_LINE_SIZE / 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ cmp rest, L1_SIZE
+ b.ge 1b
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+ st1b z16.b, p0, [dest_ptr, #-8, mul vl]
+ st1b z17.b, p0, [dest_ptr, #-7, mul vl]
+ st1b z18.b, p0, [dest_ptr, #-6, mul vl]
+ st1b z19.b, p0, [dest_ptr, #-5, mul vl]
+ st1b z20.b, p0, [dest_ptr, #-4, mul vl]
+ st1b z21.b, p0, [dest_ptr, #-3, mul vl]
+ st1b z22.b, p0, [dest_ptr, #-2, mul vl]
+ st1b z23.b, p0, [dest_ptr, #-1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE / 2
+
+L(vl_agnostic): // VL Agnostic
+
+L(unroll32): // unrolling and software pipeline
+ lsl tmp1, vector_length, 3 // vector_length * 8
+ lsl tmp2, vector_length, 5 // vector_length * 32
+ ptrue p0.b
+ .p2align 3
+1: cmp rest, tmp2
+ b.cc L(unroll8)
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ add dest_ptr, dest_ptr, tmp1
+ add src_ptr, src_ptr, tmp1
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ add dest_ptr, dest_ptr, tmp1
+ add src_ptr, src_ptr, tmp1
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ add dest_ptr, dest_ptr, tmp1
+ add src_ptr, src_ptr, tmp1
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ add dest_ptr, dest_ptr, tmp1
+ add src_ptr, src_ptr, tmp1
+ sub rest, rest, tmp2
+ b 1b
+
+L(unroll8): // unrolling and software pipeline
+ lsl tmp1, vector_length, 3 // vector_length * 8
+ ptrue p0.b
+ .p2align 3
+1: cmp rest, tmp1
+ b.cc L(unroll1)
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ add dest_ptr, dest_ptr, tmp1
+ add src_ptr, src_ptr, tmp1
+ sub rest, rest, tmp1
+ b 1b
+
+ L(unroll1):
+ ptrue p0.b
+ .p2align 3
+1: cmp rest, vector_length
+ b.cc L(last)
+ ld1b z0.b, p0/z, [src_ptr]
+ st1b z0.b, p0, [dest_ptr]
+ add dest_ptr, dest_ptr, vector_length
+ add src_ptr, src_ptr, vector_length
+ sub rest, rest, vector_length
+ b 1b
+
+L(last):
+ whilelt p0.b, xzr, rest
+ ld1b z0.b, p0/z, [src_ptr]
+ st1b z0.b, p0, [dest_ptr]
+ ret
+
+END (MEMCPY)
+libc_hidden_builtin_def (MEMCPY)
+
+
+ .p2align 4
+ENTRY_ALIGN (MEMMOVE, 6)
+
+ // remove tag address
+ and tmp1, dest, 0xffffffffffffff
+ and tmp2, src, 0xffffffffffffff
+ sub tmp1, tmp1, tmp2 // diff
+ // if diff <= 0 || diff >= n then memcpy
+ cmp tmp1, 0
+ ccmp tmp1, n, 2, gt
+ b.cs L(fwd_start)
+
+L(bwd_start):
+ mov rest, n
+ add dest_ptr, dest, n // dest_end
+ add src_ptr, src, n // src_end
+ cntb vector_length
+ ptrue p0.b
+ udiv tmp1, n, vector_length // quotient
+ mul tmp1, tmp1, vector_length // product
+ sub vl_remainder, n, tmp1
+ // if bwd_remainder == 0 then skip vl_remainder bwd copy
+ cmp vl_remainder, 0
+ b.eq L(bwd_main)
+ // vl_remainder bwd copy
+ whilelt p0.b, xzr, vl_remainder
+ sub src_ptr, src_ptr, vl_remainder
+ sub dest_ptr, dest_ptr, vl_remainder
+ ld1b z0.b, p0/z, [src_ptr]
+ st1b z0.b, p0, [dest_ptr]
+ sub rest, rest, vl_remainder
+
+L(bwd_main):
+
+ // VL Agnostic
+L(bwd_unroll32): // unrolling and software pipeline
+ lsl tmp1, vector_length, 3 // vector_length * 8
+ lsl tmp2, vector_length, 5 // vector_length * 32
+ ptrue p0.b
+ .p2align 3
+1: cmp rest, tmp2
+ b.cc L(bwd_unroll8)
+ sub src_ptr, src_ptr, tmp1
+ sub dest_ptr, dest_ptr, tmp1
+ ld1b z0.b, p0/z, [src_ptr, #7, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #6, mul vl]
+ st1b z0.b, p0, [dest_ptr, #7, mul vl]
+ st1b z1.b, p0, [dest_ptr, #6, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #4, mul vl]
+ st1b z2.b, p0, [dest_ptr, #5, mul vl]
+ st1b z3.b, p0, [dest_ptr, #4, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #2, mul vl]
+ st1b z4.b, p0, [dest_ptr, #3, mul vl]
+ st1b z5.b, p0, [dest_ptr, #2, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #0, mul vl]
+ st1b z6.b, p0, [dest_ptr, #1, mul vl]
+ st1b z7.b, p0, [dest_ptr, #0, mul vl]
+ sub src_ptr, src_ptr, tmp1
+ sub dest_ptr, dest_ptr, tmp1
+ ld1b z0.b, p0/z, [src_ptr, #7, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #6, mul vl]
+ st1b z0.b, p0, [dest_ptr, #7, mul vl]
+ st1b z1.b, p0, [dest_ptr, #6, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #4, mul vl]
+ st1b z2.b, p0, [dest_ptr, #5, mul vl]
+ st1b z3.b, p0, [dest_ptr, #4, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #2, mul vl]
+ st1b z4.b, p0, [dest_ptr, #3, mul vl]
+ st1b z5.b, p0, [dest_ptr, #2, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #0, mul vl]
+ st1b z6.b, p0, [dest_ptr, #1, mul vl]
+ st1b z7.b, p0, [dest_ptr, #0, mul vl]
+ sub src_ptr, src_ptr, tmp1
+ sub dest_ptr, dest_ptr, tmp1
+ ld1b z0.b, p0/z, [src_ptr, #7, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #6, mul vl]
+ st1b z0.b, p0, [dest_ptr, #7, mul vl]
+ st1b z1.b, p0, [dest_ptr, #6, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #4, mul vl]
+ st1b z2.b, p0, [dest_ptr, #5, mul vl]
+ st1b z3.b, p0, [dest_ptr, #4, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #2, mul vl]
+ st1b z4.b, p0, [dest_ptr, #3, mul vl]
+ st1b z5.b, p0, [dest_ptr, #2, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #0, mul vl]
+ st1b z6.b, p0, [dest_ptr, #1, mul vl]
+ st1b z7.b, p0, [dest_ptr, #0, mul vl]
+ sub src_ptr, src_ptr, tmp1
+ sub dest_ptr, dest_ptr, tmp1
+ ld1b z0.b, p0/z, [src_ptr, #7, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #6, mul vl]
+ st1b z0.b, p0, [dest_ptr, #7, mul vl]
+ st1b z1.b, p0, [dest_ptr, #6, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #4, mul vl]
+ st1b z2.b, p0, [dest_ptr, #5, mul vl]
+ st1b z3.b, p0, [dest_ptr, #4, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #2, mul vl]
+ st1b z4.b, p0, [dest_ptr, #3, mul vl]
+ st1b z5.b, p0, [dest_ptr, #2, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #0, mul vl]
+ st1b z6.b, p0, [dest_ptr, #1, mul vl]
+ st1b z7.b, p0, [dest_ptr, #0, mul vl]
+ sub rest, rest, tmp2
+ b 1b
+
+L(bwd_unroll8): // unrolling and software pipeline
+ lsl tmp1, vector_length, 3 // vector_length * 8
+ ptrue p0.b
+ .p2align 3
+1: cmp rest, tmp1
+ b.cc L(bwd_unroll1)
+ sub src_ptr, src_ptr, tmp1
+ sub dest_ptr, dest_ptr, tmp1
+ ld1b z0.b, p0/z, [src_ptr, #7, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #6, mul vl]
+ st1b z0.b, p0, [dest_ptr, #7, mul vl]
+ st1b z1.b, p0, [dest_ptr, #6, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #4, mul vl]
+ st1b z2.b, p0, [dest_ptr, #5, mul vl]
+ st1b z3.b, p0, [dest_ptr, #4, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #2, mul vl]
+ st1b z4.b, p0, [dest_ptr, #3, mul vl]
+ st1b z5.b, p0, [dest_ptr, #2, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #0, mul vl]
+ st1b z6.b, p0, [dest_ptr, #1, mul vl]
+ st1b z7.b, p0, [dest_ptr, #0, mul vl]
+ sub rest, rest, tmp1
+ b 1b
+
+ .p2align 3
+L(bwd_unroll1):
+ ptrue p0.b
+1: cmp rest, vector_length
+ b.cc L(bwd_last)
+ sub src_ptr, src_ptr, vector_length
+ sub dest_ptr, dest_ptr, vector_length
+ ld1b z0.b, p0/z, [src_ptr]
+ st1b z0.b, p0, [dest_ptr]
+ sub rest, rest, vector_length
+ b 1b
+
+L(bwd_last):
+ whilelt p0.b, xzr, rest
+ sub src_ptr, src_ptr, rest
+ sub dest_ptr, dest_ptr, rest
+ ld1b z0.b, p0/z, [src_ptr]
+ st1b z0.b, p0, [dest_ptr]
+ ret
+
+END (MEMMOVE)
+libc_hidden_builtin_def (MEMMOVE)
+#endif /* IS_IN (libc) */
+#endif /* HAVE_SVE_ASM_SUPPORT */
+
diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarch/memmove.c
index 12d77818a9..1e5ee1c934 100644
--- a/sysdeps/aarch64/multiarch/memmove.c
+++ b/sysdeps/aarch64/multiarch/memmove.c
@@ -33,6 +33,9 @@ extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;
extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;
extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidden;
extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;
+#if HAVE_SVE_ASM_SUPPORT
+extern __typeof (__redirect_memmove) __memmove_a64fx attribute_hidden;
+#endif
libc_ifunc (__libc_memmove,
(IS_THUNDERX (midr)
@@ -44,8 +47,13 @@ libc_ifunc (__libc_memmove,
: (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
|| IS_NEOVERSE_V1 (midr)
? __memmove_simd
- : __memmove_generic)))));
-
+#if HAVE_SVE_ASM_SUPPORT
+ : (IS_A64FX (midr)
+ ? __memmove_a64fx
+ : __memmove_generic))))));
+#else
+ : __memmove_generic)))));
+#endif
# undef memmove
strong_alias (__libc_memmove, memmove);
#endif
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
index db6aa3516c..6206a2f618 100644
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
@@ -46,6 +46,7 @@ static struct cpu_list cpu_list[] = {
{"ares", 0x411FD0C0},
{"emag", 0x503F0001},
{"kunpeng920", 0x481FD010},
+ {"a64fx", 0x460F0010},
{"generic", 0x0}
};
@@ -116,4 +117,7 @@ init_cpu_features (struct cpu_features *cpu_features)
(PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_ASYNC | MTE_ALLOWED_TAGS),
0, 0, 0);
#endif
+
+ /* Check if SVE is supported. */
+ cpu_features->sve = GLRO (dl_hwcap) & HWCAP_SVE;
}
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
index 3b9bfed134..2b322e5414 100644
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
@@ -65,6 +65,9 @@
#define IS_KUNPENG920(midr) (MIDR_IMPLEMENTOR(midr) == 'H' \
&& MIDR_PARTNUM(midr) == 0xd01)
+#define IS_A64FX(midr) (MIDR_IMPLEMENTOR(midr) == 'F' \
+ && MIDR_PARTNUM(midr) == 0x001)
+
struct cpu_features
{
uint64_t midr_el1;
@@ -72,6 +75,7 @@ struct cpu_features
bool bti;
/* Currently, the GLIBC memory tagging tunable only defines 8 bits. */
uint8_t mte_state;
+ bool sve;
};
#endif /* _CPU_FEATURES_AARCH64_H */
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH 3/5] aarch64: Added optimized memset for A64FX
2021-03-17 2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
2021-03-17 2:33 ` [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64 Naohiro Tamura
2021-03-17 2:34 ` [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
@ 2021-03-17 2:34 ` Naohiro Tamura
2021-03-17 2:35 ` [PATCH 4/5] scripts: Added Vector Length Set test helper script Naohiro Tamura
` (4 subsequent siblings)
7 siblings, 0 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-03-17 2:34 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
This patch optimizes the performance of memset for A64FX [1] which
implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
per NUMA node.
The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill and prefetch.
SVE assembler code for memset is implemented as Vector Length Agnostic
code so theoretically it can be run on any SOC which supports ARMv8-A
SVE standard.
We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.
And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.
[1] https://github.com/fujitsu/A64FX
---
sysdeps/aarch64/multiarch/Makefile | 1 +
sysdeps/aarch64/multiarch/ifunc-impl-list.c | 5 +-
sysdeps/aarch64/multiarch/memset.c | 11 +-
sysdeps/aarch64/multiarch/memset_a64fx.S | 574 ++++++++++++++++++++
4 files changed, 589 insertions(+), 2 deletions(-)
create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S
diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index 04c3f17121..7500cf1e93 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -2,6 +2,7 @@ ifeq ($(subdir),string)
sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
memcpy_falkor memcpy_a64fx \
memset_generic memset_falkor memset_emag memset_kunpeng \
+ memset_a64fx \
memchr_generic memchr_nosimd \
strlen_mte strlen_asimd
endif
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index cb78da9692..e252a10d88 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -41,7 +41,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
INIT_ARCH ();
- /* Support sysdeps/aarch64/multiarch/memcpy.c and memmove.c. */
+ /* Support sysdeps/aarch64/multiarch/memcpy.c, memmove.c and memset.c. */
IFUNC_IMPL (i, name, memcpy,
IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx)
IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
@@ -66,6 +66,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
IFUNC_IMPL_ADD (array, i, memset, (zva_size == 64), __memset_falkor)
IFUNC_IMPL_ADD (array, i, memset, (zva_size == 64), __memset_emag)
IFUNC_IMPL_ADD (array, i, memset, 1, __memset_kunpeng)
+#if HAVE_SVE_ASM_SUPPORT
+ IFUNC_IMPL_ADD (array, i, memset, sve, __memset_a64fx)
+#endif
IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic))
IFUNC_IMPL (i, name, memchr,
IFUNC_IMPL_ADD (array, i, memchr, !mte, __memchr_nosimd)
diff --git a/sysdeps/aarch64/multiarch/memset.c b/sysdeps/aarch64/multiarch/memset.c
index 28d3926bc2..df075edddb 100644
--- a/sysdeps/aarch64/multiarch/memset.c
+++ b/sysdeps/aarch64/multiarch/memset.c
@@ -31,6 +31,9 @@ extern __typeof (__redirect_memset) __libc_memset;
extern __typeof (__redirect_memset) __memset_falkor attribute_hidden;
extern __typeof (__redirect_memset) __memset_emag attribute_hidden;
extern __typeof (__redirect_memset) __memset_kunpeng attribute_hidden;
+#if HAVE_SVE_ASM_SUPPORT
+extern __typeof (__redirect_memset) __memset_a64fx attribute_hidden;
+#endif
extern __typeof (__redirect_memset) __memset_generic attribute_hidden;
libc_ifunc (__libc_memset,
@@ -40,7 +43,13 @@ libc_ifunc (__libc_memset,
? __memset_falkor
: (IS_EMAG (midr) && zva_size == 64
? __memset_emag
- : __memset_generic)));
+#if HAVE_SVE_ASM_SUPPORT
+ : (IS_A64FX (midr)
+ ? __memset_a64fx
+ : __memset_generic))));
+#else
+ : __memset_generic)));
+#endif
# undef memset
strong_alias (__libc_memset, memset);
diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S
new file mode 100644
index 0000000000..02ae7caab0
--- /dev/null
+++ b/sysdeps/aarch64/multiarch/memset_a64fx.S
@@ -0,0 +1,574 @@
+/* Optimized memset for Fujitsu A64FX processor.
+ Copyright (C) 2012-2021 Free Software Foundation, Inc.
+
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library. If not, see
+ <https://www.gnu.org/licenses/>. */
+
+#include <sysdep.h>
+#include <sysdeps/aarch64/memset-reg.h>
+
+#if HAVE_SVE_ASM_SUPPORT
+#if IS_IN (libc)
+# define MEMSET __memset_a64fx
+
+/* Assumptions:
+ *
+ * ARMv8.2-a, AArch64, unaligned accesses, sve
+ *
+ */
+
+#define L1_SIZE (64*1024) // L1 64KB
+#define L2_SIZE (8*1024*1024) // L2 8MB - 1MB
+#define CACHE_LINE_SIZE 256
+#define PF_DIST_L1 (CACHE_LINE_SIZE * 16)
+#define PF_DIST_L2 (CACHE_LINE_SIZE * 128)
+#define rest x8
+#define vector_length x9
+#define vl_remainder x10 // vector_length remainder
+#define cl_remainder x11 // CACHE_LINE_SIZE remainder
+
+ .arch armv8.2-a+sve
+
+ENTRY_ALIGN (MEMSET, 6)
+
+ PTR_ARG (0)
+ SIZE_ARG (2)
+
+ cmp count, 0
+ b.ne L(init)
+ ret
+L(init):
+ mov rest, count
+ mov dst, dstin
+ add dstend, dstin, count
+ cntb vector_length
+ ptrue p0.b
+ dup z0.b, valw
+
+ cmp count, 96
+ b.hi L(set_long)
+ cmp count, 16
+ b.hs L(set_medium)
+ mov val, v0.D[0]
+
+ /* Set 0..15 bytes. */
+ tbz count, 3, 1f
+ str val, [dstin]
+ str val, [dstend, -8]
+ ret
+ nop
+1: tbz count, 2, 2f
+ str valw, [dstin]
+ str valw, [dstend, -4]
+ ret
+2: cbz count, 3f
+ strb valw, [dstin]
+ tbz count, 1, 3f
+ strh valw, [dstend, -2]
+3: ret
+
+ /* Set 17..96 bytes. */
+L(set_medium):
+ str q0, [dstin]
+ tbnz count, 6, L(set96)
+ str q0, [dstend, -16]
+ tbz count, 5, 1f
+ str q0, [dstin, 16]
+ str q0, [dstend, -32]
+1: ret
+
+ .p2align 4
+ /* Set 64..96 bytes. Write 64 bytes from the start and
+ 32 bytes from the end. */
+L(set96):
+ str q0, [dstin, 16]
+ stp q0, q0, [dstin, 32]
+ stp q0, q0, [dstend, -32]
+ ret
+
+L(set_long):
+ // if count > 1280 && vector_length != 16 then L(L2)
+ cmp count, 1280
+ ccmp vector_length, 16, 4, gt
+ b.ne L(L2)
+ bic dst, dstin, 15
+ str q0, [dstin]
+ sub count, dstend, dst /* Count is 16 too large. */
+ sub dst, dst, 16 /* Dst is biased by -32. */
+ sub count, count, 64 + 16 /* Adjust count and bias for loop. */
+1: stp q0, q0, [dst, 32]
+ stp q0, q0, [dst, 64]!
+ subs count, count, 64
+ b.lo 2f
+ stp q0, q0, [dst, 32]
+ stp q0, q0, [dst, 64]!
+ subs count, count, 64
+ b.lo 2f
+ stp q0, q0, [dst, 32]
+ stp q0, q0, [dst, 64]!
+ subs count, count, 64
+ b.lo 2f
+ stp q0, q0, [dst, 32]
+ stp q0, q0, [dst, 64]!
+ subs count, count, 64
+ b.hi 1b
+2: stp q0, q0, [dstend, -64]
+ stp q0, q0, [dstend, -32]
+ ret
+
+L(L2):
+ // get block_size
+ mrs tmp1, dczid_el0
+ cmp tmp1, 6 // CACHE_LINE_SIZE 256
+ b.ne L(vl_agnostic)
+
+ // if rest >= L2_SIZE
+ cmp rest, L2_SIZE
+ b.cc L(L1_prefetch)
+ // align dst address at vector_length byte boundary
+ sub tmp1, vector_length, 1
+ and tmp2, dst, tmp1
+ // if vl_remainder == 0
+ cmp tmp2, 0
+ b.eq 1f
+ sub vl_remainder, vector_length, tmp2
+ // process remainder until the first vector_length boundary
+ whilelt p0.b, xzr, vl_remainder
+ st1b z0.b, p0, [dst]
+ add dst, dst, vl_remainder
+ sub rest, rest, vl_remainder
+ // align dstin address at CACHE_LINE_SIZE byte boundary
+1: mov tmp1, CACHE_LINE_SIZE
+ and tmp2, dst, CACHE_LINE_SIZE - 1
+ // if cl_remainder == 0
+ cmp tmp2, 0
+ b.eq L(L2_dc_zva)
+ sub cl_remainder, tmp1, tmp2
+ // process remainder until the first CACHE_LINE_SIZE boundary
+ mov tmp1, xzr // index
+2: whilelt p0.b, tmp1, cl_remainder
+ st1b z0.b, p0, [dst, tmp1]
+ incb tmp1
+ cmp tmp1, cl_remainder
+ b.lo 2b
+ add dst, dst, cl_remainder
+ sub rest, rest, cl_remainder
+
+L(L2_dc_zva): // unroll zero fill
+ mov tmp1, dst
+ dc zva, tmp1 // 1
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 2
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 3
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 4
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 5
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 6
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 7
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 8
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 9
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 10
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 11
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 12
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 13
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 14
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 15
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 16
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 17
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 18
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 19
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ dc zva, tmp1 // 20
+
+L(L2_vl_64): // VL64 unroll8
+ cmp vector_length, 64
+ b.ne L(L2_vl_32)
+ ptrue p0.b
+ .p2align 4
+1: st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ mov tmp2, CACHE_LINE_SIZE * 20
+ add tmp2, dst, tmp2
+ dc zva, tmp2 // distance CACHE_LINE_SIZE * 20
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ add tmp2, tmp2, CACHE_LINE_SIZE
+ dc zva, tmp2 // distance CACHE_LINE_SIZE * 21
+ add dst, dst, 512
+ sub rest, rest, 512
+ cmp rest, L2_SIZE
+ b.ge 1b
+
+L(L2_vl_32): // VL32 unroll6
+ cmp vector_length, 32
+ b.ne L(L2_vl_16)
+ ptrue p0.b
+ .p2align 4
+1: st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ mov tmp2, CACHE_LINE_SIZE * 21
+ add tmp2, dst, tmp2
+ dc zva, tmp2 // distance CACHE_LINE_SIZE * 21
+ add dst, dst, CACHE_LINE_SIZE
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ add tmp2, tmp2, CACHE_LINE_SIZE
+ dc zva, tmp2 // distance CACHE_LINE_SIZE * 22
+ add dst, dst, CACHE_LINE_SIZE
+ sub rest, rest, 512
+ cmp rest, L2_SIZE
+ b.ge 1b
+
+L(L2_vl_16): // VL16 unroll32
+ cmp vector_length, 16
+ b.ne L(L1_prefetch)
+ ptrue p0.b
+ .p2align 4
+1: add dst, dst, 128
+ st1b {z0.b}, p0, [dst, #-8, mul vl]
+ st1b {z0.b}, p0, [dst, #-7, mul vl]
+ st1b {z0.b}, p0, [dst, #-6, mul vl]
+ st1b {z0.b}, p0, [dst, #-5, mul vl]
+ st1b {z0.b}, p0, [dst, #-4, mul vl]
+ st1b {z0.b}, p0, [dst, #-3, mul vl]
+ st1b {z0.b}, p0, [dst, #-2, mul vl]
+ st1b {z0.b}, p0, [dst, #-1, mul vl]
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ mov tmp2, CACHE_LINE_SIZE * 20
+ add tmp2, dst, tmp2
+ dc zva, tmp2 // distance CACHE_LINE_SIZE * 20
+ add dst, dst, CACHE_LINE_SIZE
+ st1b {z0.b}, p0, [dst, #-8, mul vl]
+ st1b {z0.b}, p0, [dst, #-7, mul vl]
+ st1b {z0.b}, p0, [dst, #-6, mul vl]
+ st1b {z0.b}, p0, [dst, #-5, mul vl]
+ st1b {z0.b}, p0, [dst, #-4, mul vl]
+ st1b {z0.b}, p0, [dst, #-3, mul vl]
+ st1b {z0.b}, p0, [dst, #-2, mul vl]
+ st1b {z0.b}, p0, [dst, #-1, mul vl]
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ add tmp2, tmp2, CACHE_LINE_SIZE
+ dc zva, tmp2 // distance CACHE_LINE_SIZE * 21
+ add dst, dst, 128
+ sub rest, rest, 512
+ cmp rest, L2_SIZE
+ b.ge 1b
+
+L(L1_prefetch): // if rest >= L1_SIZE
+ cmp rest, L1_SIZE
+ b.cc L(vl_agnostic)
+L(L1_vl_64):
+ cmp vector_length, 64
+ b.ne L(L1_vl_32)
+ ptrue p0.b
+ .p2align 4
+1: st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ mov tmp1, PF_DIST_L1
+ prfm pstl1keep, [dst, tmp1]
+ mov tmp1, PF_DIST_L2
+ prfm pstl2keep, [dst, tmp1]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ mov tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+ prfm pstl1keep, [dst, tmp1]
+ mov tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+ prfm pstl2keep, [dst, tmp1]
+ add dst, dst, 512
+ sub rest, rest, 512
+ cmp rest, L1_SIZE
+ b.ge 1b
+
+L(L1_vl_32):
+ cmp vector_length, 32
+ b.ne L(L1_vl_16)
+ ptrue p0.b
+ .p2align 4
+1: st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ mov tmp1, PF_DIST_L1
+ prfm pstl1keep, [dst, tmp1]
+ mov tmp1, PF_DIST_L2
+ prfm pstl2keep, [dst, tmp1]
+ add dst, dst, CACHE_LINE_SIZE
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ mov tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+ prfm pstl1keep, [dst, tmp1]
+ mov tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+ prfm pstl2keep, [dst, tmp1]
+ add dst, dst, CACHE_LINE_SIZE
+ sub rest, rest, 512
+ cmp rest, L1_SIZE
+ b.ge 1b
+
+L(L1_vl_16): // VL16 unroll32
+ cmp vector_length, 16
+ b.ne L(vl_agnostic)
+ ptrue p0.b
+ .p2align 4
+1: mov tmp1, dst
+ add dst, dst, 128
+ st1b {z0.b}, p0, [dst, #-8, mul vl]
+ st1b {z0.b}, p0, [dst, #-7, mul vl]
+ st1b {z0.b}, p0, [dst, #-6, mul vl]
+ st1b {z0.b}, p0, [dst, #-5, mul vl]
+ st1b {z0.b}, p0, [dst, #-4, mul vl]
+ st1b {z0.b}, p0, [dst, #-3, mul vl]
+ st1b {z0.b}, p0, [dst, #-2, mul vl]
+ st1b {z0.b}, p0, [dst, #-1, mul vl]
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ mov tmp1, PF_DIST_L1
+ prfm pstl1keep, [dst, tmp1]
+ mov tmp1, PF_DIST_L2
+ prfm pstl2keep, [dst, tmp1]
+ add dst, dst, CACHE_LINE_SIZE
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ st1b {z0.b}, p0, [dst, #-8, mul vl]
+ st1b {z0.b}, p0, [dst, #-7, mul vl]
+ st1b {z0.b}, p0, [dst, #-6, mul vl]
+ st1b {z0.b}, p0, [dst, #-5, mul vl]
+ st1b {z0.b}, p0, [dst, #-4, mul vl]
+ st1b {z0.b}, p0, [dst, #-3, mul vl]
+ st1b {z0.b}, p0, [dst, #-2, mul vl]
+ st1b {z0.b}, p0, [dst, #-1, mul vl]
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ mov tmp1, PF_DIST_L1 + CACHE_LINE_SIZE
+ prfm pstl1keep, [dst, tmp1]
+ mov tmp1, PF_DIST_L2 + CACHE_LINE_SIZE
+ prfm pstl2keep, [dst, tmp1]
+ add dst, dst, 128
+ sub rest, rest, 512
+ cmp rest, L1_SIZE
+ b.ge 1b
+
+ // VL Agnostic
+L(vl_agnostic):
+L(unroll32):
+ ptrue p0.b
+ lsl tmp1, vector_length, 3 // vector_length * 8
+ lsl tmp2, vector_length, 5 // vector_length * 32
+ .p2align 4
+1: cmp rest, tmp2
+ b.cc L(unroll16)
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ add dst, dst, tmp1
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ add dst, dst, tmp1
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ add dst, dst, tmp1
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ add dst, dst, tmp1
+ sub rest, rest, tmp2
+ b 1b
+
+L(unroll16):
+ ptrue p0.b
+ lsl tmp1, vector_length, 3 // vector_length * 8
+ lsl tmp2, vector_length, 4 // vector_length * 16
+ .p2align 4
+1: cmp rest, tmp2
+ b.cc L(unroll8)
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ add dst, dst, tmp1
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ add dst, dst, tmp1
+ sub rest, rest, tmp2
+ b 1b
+
+L(unroll8):
+ lsl tmp1, vector_length, 3
+ ptrue p0.b
+ .p2align 4
+1: cmp rest, tmp1
+ b.cc L(unroll4)
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ st1b {z0.b}, p0, [dst, #4, mul vl]
+ st1b {z0.b}, p0, [dst, #5, mul vl]
+ st1b {z0.b}, p0, [dst, #6, mul vl]
+ st1b {z0.b}, p0, [dst, #7, mul vl]
+ add dst, dst, tmp1
+ sub rest, rest, tmp1
+ b 1b
+
+L(unroll4):
+ lsl tmp1, vector_length, 2
+ ptrue p0.b
+ .p2align 4
+1: cmp rest, tmp1
+ b.cc L(unroll2)
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ st1b {z0.b}, p0, [dst, #2, mul vl]
+ st1b {z0.b}, p0, [dst, #3, mul vl]
+ add dst, dst, tmp1
+ sub rest, rest, tmp1
+ b 1b
+
+L(unroll2):
+ lsl tmp1, vector_length, 1
+ ptrue p0.b
+ .p2align 4
+1: cmp rest, tmp1
+ b.cc L(unroll1)
+ st1b {z0.b}, p0, [dst]
+ st1b {z0.b}, p0, [dst, #1, mul vl]
+ add dst, dst, tmp1
+ sub rest, rest, tmp1
+ b 1b
+
+L(unroll1):
+ ptrue p0.b
+ .p2align 4
+1: cmp rest, vector_length
+ b.cc L(last)
+ st1b {z0.b}, p0, [dst]
+ sub rest, rest, vector_length
+ add dst, dst, vector_length
+ b 1b
+
+ .p2align 4
+L(last):
+ whilelt p0.b, xzr, rest
+ st1b z0.b, p0, [dst]
+ ret
+
+END (MEMSET)
+libc_hidden_builtin_def (MEMSET)
+
+#endif /* IS_IN (libc) */
+#endif /* HAVE_SVE_ASM_SUPPORT */
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH 4/5] scripts: Added Vector Length Set test helper script
2021-03-17 2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
` (2 preceding siblings ...)
2021-03-17 2:34 ` [PATCH 3/5] aarch64: Added optimized memset " Naohiro Tamura
@ 2021-03-17 2:35 ` Naohiro Tamura
2021-03-29 13:20 ` Szabolcs Nagy via Libc-alpha
2021-03-17 2:35 ` [PATCH 5/5] benchtests: Added generic_memcpy and generic_memmove to large benchtests Naohiro Tamura
` (3 subsequent siblings)
7 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-03-17 2:35 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
This patch is a test helper script to change Vector Length for child
process. This script can be used as test-wrapper for 'make check'.
Usage examples:
ubuntu@bionic:~/build$ make check subdirs=string \
test-wrapper='~/glibc/scripts/vltest.py 16'
ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 16 make test \
t=string/test-memcpy
ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 32 ./debugglibc.sh \
string/test-memmove
ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 64 ./testrun.sh
string/test-memset
---
scripts/vltest.py | 82 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 82 insertions(+)
create mode 100755 scripts/vltest.py
diff --git a/scripts/vltest.py b/scripts/vltest.py
new file mode 100755
index 0000000000..264dfa449f
--- /dev/null
+++ b/scripts/vltest.py
@@ -0,0 +1,82 @@
+#!/usr/bin/python3
+# Set Scalable Vector Length test helper
+# Copyright (C) 2019-2021 Free Software Foundation, Inc.
+# This file is part of the GNU C Library.
+#
+# The GNU C Library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# The GNU C Library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with the GNU C Library; if not, see
+# <https://www.gnu.org/licenses/>.
+"""Set Scalable Vector Length test helper.
+
+Set Scalable Vector Length for child process.
+
+examples:
+
+ubuntu@bionic:~/build$ make check subdirs=string \
+test-wrapper='~/glibc/scripts/vltest.py 16'
+
+ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 16 make test \
+t=string/test-memcpy
+
+ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 32 ./debugglibc.sh \
+string/test-memmove
+
+ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 64 ./testrun.sh \
+string/test-memset
+"""
+import argparse
+from ctypes import cdll, CDLL
+import os
+import sys
+
+EXIT_SUCCESS = 0
+EXIT_FAILURE = 1
+EXIT_UNSUPPORTED = 77
+
+AT_HWCAP = 16
+HWCAP_SVE = (1 << 22)
+
+PR_SVE_GET_VL = 51
+PR_SVE_SET_VL = 50
+PR_SVE_SET_VL_ONEXEC = (1 << 18)
+PR_SVE_VL_INHERIT = (1 << 17)
+PR_SVE_VL_LEN_MASK = 0xffff
+
+def main(args):
+ libc = CDLL("libc.so.6")
+ if not libc.getauxval(AT_HWCAP) & HWCAP_SVE:
+ print("CPU doesn't support SVE")
+ sys.exit(EXIT_UNSUPPORTED)
+
+ libc.prctl(PR_SVE_SET_VL,
+ args.vl[0] | PR_SVE_SET_VL_ONEXEC | PR_SVE_VL_INHERIT)
+ os.execvp(args.args[0], args.args)
+ print("exec system call failure")
+ sys.exit(EXIT_FAILURE)
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser(description=
+ "Set Scalable Vector Length test helper",
+ formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+ # positional argument
+ parser.add_argument("vl", nargs=1, type=int,
+ choices=range(16, 257, 16),
+ help=('vector length '\
+ 'which is multiples of 16 from 16 to 256'))
+ # remainDer arguments
+ parser.add_argument('args', nargs=argparse.REMAINDER,
+ help=('args '\
+ 'which is passed to child process'))
+ args = parser.parse_args()
+ main(args)
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH 5/5] benchtests: Added generic_memcpy and generic_memmove to large benchtests
2021-03-17 2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
` (3 preceding siblings ...)
2021-03-17 2:35 ` [PATCH 4/5] scripts: Added Vector Length Set test helper script Naohiro Tamura
@ 2021-03-17 2:35 ` Naohiro Tamura
2021-03-29 12:03 ` [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Szabolcs Nagy via Libc-alpha
` (2 subsequent siblings)
7 siblings, 0 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-03-17 2:35 UTC (permalink / raw)
To: libc-alpha
This patch is to add generic_memcpy and generic_memmove to
bench-memcpy-large.c and bench-memmove-large.c respectively so that we
can compare performance between 512 bit scalable vector register with
scalar 64 bit register consistently among memcpy/memmove/memset
default and large benchtests.
---
benchtests/bench-memcpy-large.c | 9 +++++++++
benchtests/bench-memmove-large.c | 9 +++++++++
2 files changed, 18 insertions(+)
diff --git a/benchtests/bench-memcpy-large.c b/benchtests/bench-memcpy-large.c
index 3df1575514..4a87987202 100644
--- a/benchtests/bench-memcpy-large.c
+++ b/benchtests/bench-memcpy-large.c
@@ -25,7 +25,10 @@
# define TIMEOUT (20 * 60)
# include "bench-string.h"
+void *generic_memcpy (void *, const void *, size_t);
+
IMPL (memcpy, 1)
+IMPL (generic_memcpy, 0)
#endif
#include "json-lib.h"
@@ -124,3 +127,9 @@ test_main (void)
}
#include <support/test-driver.c>
+
+#define libc_hidden_builtin_def(X)
+#undef MEMCPY
+#define MEMCPY generic_memcpy
+#include <string/memcpy.c>
+#include <string/wordcopy.c>
diff --git a/benchtests/bench-memmove-large.c b/benchtests/bench-memmove-large.c
index 9e2fcd50ab..151dd5a276 100644
--- a/benchtests/bench-memmove-large.c
+++ b/benchtests/bench-memmove-large.c
@@ -25,7 +25,10 @@
#include "bench-string.h"
#include "json-lib.h"
+void *generic_memmove (void *, const void *, size_t);
+
IMPL (memmove, 1)
+IMPL (generic_memmove, 0)
typedef char *(*proto_t) (char *, const char *, size_t);
@@ -123,3 +126,9 @@ test_main (void)
}
#include <support/test-driver.c>
+
+#define libc_hidden_builtin_def(X)
+#undef MEMMOVE
+#define MEMMOVE generic_memmove
+#include <string/memmove.c>
+#include <string/wordcopy.c>
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-03-17 2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
` (4 preceding siblings ...)
2021-03-17 2:35 ` [PATCH 5/5] benchtests: Added generic_memcpy and generic_memmove to large benchtests Naohiro Tamura
@ 2021-03-29 12:03 ` Szabolcs Nagy via Libc-alpha
2021-05-10 1:45 ` naohirot
2021-05-12 9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
7 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-03-29 12:03 UTC (permalink / raw)
To: Naohiro Tamura; +Cc: libc-alpha
The 03/17/2021 02:28, Naohiro Tamura wrote:
> Fujitsu is in the process of signing the copyright assignment paper.
> We'd like to have some feedback in advance.
thanks for these patches, please let me know when the
copyright is sorted out. i will do some review now.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64
2021-03-17 2:33 ` [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64 Naohiro Tamura
@ 2021-03-29 12:11 ` Szabolcs Nagy via Libc-alpha
2021-03-30 6:19 ` naohirot
0 siblings, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-03-29 12:11 UTC (permalink / raw)
To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha
The 03/17/2021 02:33, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
>
> This patch checks if assembler supports '-march=armv8.2-a+sve' to
> generate SVE code or not, and then define HAVE_SVE_ASM_SUPPORT macro.
> ---
> config.h.in | 3 +++
> sysdeps/aarch64/configure | 28 ++++++++++++++++++++++++++++
> sysdeps/aarch64/configure.ac | 15 +++++++++++++++
> 3 files changed, 46 insertions(+)
>
> diff --git a/config.h.in b/config.h.in
> index f21bf04e47..2073816af8 100644
> --- a/config.h.in
> +++ b/config.h.in
> @@ -118,6 +118,9 @@
> /* AArch64 PAC-RET code generation is enabled. */
> #define HAVE_AARCH64_PAC_RET 0
>
> +/* Assembler support ARMv8.2-A SVE */
> +#define HAVE_SVE_ASM_SUPPORT 0
> +
i prefer to use HAVE_AARCH64_ prefix for aarch64 specific
macros in the global config.h, e.g. HAVE_AARCH64_SVE_ASM
and i'd like to have a comment here or in configue.ac with the
binutils version where this becomes obsolete (binutils 2.28 i
think). right now the minimum required version is 2.25, but
glibc may increase that soon to above 2.28.
> diff --git a/sysdeps/aarch64/configure.ac b/sysdeps/aarch64/configure.ac
> index 66f755078a..389a0b4e8d 100644
> --- a/sysdeps/aarch64/configure.ac
> +++ b/sysdeps/aarch64/configure.ac
> @@ -90,3 +90,18 @@ EOF
> fi
> rm -rf conftest.*])
> LIBC_CONFIG_VAR([aarch64-variant-pcs], [$libc_cv_aarch64_variant_pcs])
> +
> +# Check if asm support armv8.2-a+sve
> +AC_CACHE_CHECK(for SVE support in assembler, libc_cv_asm_sve, [dnl
> +cat > conftest.s <<\EOF
> + ptrue p0.b
> +EOF
> +if AC_TRY_COMMAND(${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&AS_MESSAGE_LOG_FD); then
> + libc_cv_asm_sve=yes
> +else
> + libc_cv_asm_sve=no
> +fi
> +rm -f conftest*])
> +if test $libc_cv_asm_sve = yes; then
> + AC_DEFINE(HAVE_SVE_ASM_SUPPORT)
> +fi
i would use libc_cv_aarch64_sve_asm to make it obvious
that it's aarch64 specific setting.
otherwise OK.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX
2021-03-17 2:34 ` [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
@ 2021-03-29 12:44 ` Szabolcs Nagy via Libc-alpha
2021-03-30 7:17 ` naohirot
0 siblings, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-03-29 12:44 UTC (permalink / raw)
To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha
The 03/17/2021 02:34, Naohiro Tamura wrote:
> And also we confirmed that the SVE 512 bit vector register performance
> is roughly 4 times better than Advanced SIMD 128 bit register and 8
> times better than scalar 64 bit register by running 'make bench'.
nice speed up. i won't comment on the memcpy asm now.
> diff --git a/manual/tunables.texi b/manual/tunables.texi
> index 1b746c0fa1..81ed5366fc 100644
> --- a/manual/tunables.texi
> +++ b/manual/tunables.texi
> @@ -453,7 +453,8 @@ This tunable is specific to powerpc, powerpc64 and powerpc64le.
> The @code{glibc.cpu.name=xxx} tunable allows the user to tell @theglibc{} to
> assume that the CPU is @code{xxx} where xxx may have one of these values:
> @code{generic}, @code{falkor}, @code{thunderxt88}, @code{thunderx2t99},
> -@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng}.
> +@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng},
> +@code{a64fx}.
OK.
> --- a/sysdeps/aarch64/multiarch/Makefile
> +++ b/sysdeps/aarch64/multiarch/Makefile
> @@ -1,6 +1,6 @@
> ifeq ($(subdir),string)
> sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
> - memcpy_falkor \
> + memcpy_falkor memcpy_a64fx \
> memset_generic memset_falkor memset_emag memset_kunpeng \
OK.
> --- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
> @@ -25,7 +25,11 @@
> #include <stdio.h>
>
> /* Maximum number of IFUNC implementations. */
> -#define MAX_IFUNC 4
> +#if HAVE_SVE_ASM_SUPPORT
> +# define MAX_IFUNC 7
> +#else
> +# define MAX_IFUNC 6
> +#endif
hm this MAX_IFUNC looks a bit problematic: currently its only
use is to detect if a target requires more ifuncs than the
array passed to __libc_ifunc_impl_list, but for that ideally
it would be automatic, not manually maintained.
i would just define it to 7 unconditionally (the maximum over
valid configurations).
> size_t
> __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> @@ -43,12 +47,18 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
> IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_falkor)
> IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_simd)
> +#if HAVE_SVE_ASM_SUPPORT
> + IFUNC_IMPL_ADD (array, i, memcpy, sve, __memcpy_a64fx)
> +#endif
OK.
> IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
> IFUNC_IMPL (i, name, memmove,
> IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx)
> IFUNC_IMPL_ADD (array, i, memmove, !bti, __memmove_thunderx2)
> IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_falkor)
> IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_simd)
> +#if HAVE_SVE_ASM_SUPPORT
> + IFUNC_IMPL_ADD (array, i, memmove, sve, __memmove_a64fx)
> +#endif
OK.
> IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
> IFUNC_IMPL (i, name, memset,
> /* Enable this on non-falkor processors too so that other cores
> diff --git a/sysdeps/aarch64/multiarch/init-arch.h b/sysdeps/aarch64/multiarch/init-arch.h
> index a167699e74..d20e7e1b8e 100644
> --- a/sysdeps/aarch64/multiarch/init-arch.h
> +++ b/sysdeps/aarch64/multiarch/init-arch.h
> @@ -33,4 +33,6 @@
> bool __attribute__((unused)) bti = \
> HAVE_AARCH64_BTI && GLRO(dl_aarch64_cpu_features).bti; \
> bool __attribute__((unused)) mte = \
> - MTE_ENABLED ();
> + MTE_ENABLED (); \
> + unsigned __attribute__((unused)) sve = \
> + GLRO(dl_aarch64_cpu_features).sve;
i would use bool here.
> diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
> index 0e0a5cbcfb..0006f38eb0 100644
> --- a/sysdeps/aarch64/multiarch/memcpy.c
> +++ b/sysdeps/aarch64/multiarch/memcpy.c
> @@ -33,6 +33,9 @@ extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;
> extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
> extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;
> extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
> +#if HAVE_SVE_ASM_SUPPORT
> +extern __typeof (__redirect_memcpy) __memcpy_a64fx attribute_hidden;
> +#endif
OK.
> libc_ifunc (__libc_memcpy,
> (IS_THUNDERX (midr)
> @@ -44,8 +47,13 @@ libc_ifunc (__libc_memcpy,
> : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
> || IS_NEOVERSE_V1 (midr)
> ? __memcpy_simd
> - : __memcpy_generic)))));
> -
> +#if HAVE_SVE_ASM_SUPPORT
> + : (IS_A64FX (midr)
> + ? __memcpy_a64fx
> + : __memcpy_generic))))));
> +#else
> + : __memcpy_generic)))));
> +#endif
OK.
> new file mode 100644
> index 0000000000..23438e4e3d
> --- /dev/null
> +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
skipping this.
> diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarch/memmove.c
> index 12d77818a9..1e5ee1c934 100644
> --- a/sysdeps/aarch64/multiarch/memmove.c
> +++ b/sysdeps/aarch64/multiarch/memmove.c
> @@ -33,6 +33,9 @@ extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;
> extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;
> extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidden;
> extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;
> +#if HAVE_SVE_ASM_SUPPORT
> +extern __typeof (__redirect_memmove) __memmove_a64fx attribute_hidden;
> +#endif
OK.
>
> libc_ifunc (__libc_memmove,
> (IS_THUNDERX (midr)
> @@ -44,8 +47,13 @@ libc_ifunc (__libc_memmove,
> : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
> || IS_NEOVERSE_V1 (midr)
> ? __memmove_simd
> - : __memmove_generic)))));
> -
> +#if HAVE_SVE_ASM_SUPPORT
> + : (IS_A64FX (midr)
> + ? __memmove_a64fx
> + : __memmove_generic))))));
> +#else
> + : __memmove_generic)))));
> +#endif
OK.
> diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
> index db6aa3516c..6206a2f618 100644
> --- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
> +++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
> @@ -46,6 +46,7 @@ static struct cpu_list cpu_list[] = {
> {"ares", 0x411FD0C0},
> {"emag", 0x503F0001},
> {"kunpeng920", 0x481FD010},
> + {"a64fx", 0x460F0010},
> {"generic", 0x0}
OK.
> +
> + /* Check if SVE is supported. */
> + cpu_features->sve = GLRO (dl_hwcap) & HWCAP_SVE;
OK.
> }
> diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
> index 3b9bfed134..2b322e5414 100644
> --- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
> +++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
> @@ -65,6 +65,9 @@
> #define IS_KUNPENG920(midr) (MIDR_IMPLEMENTOR(midr) == 'H' \
> && MIDR_PARTNUM(midr) == 0xd01)
>
> +#define IS_A64FX(midr) (MIDR_IMPLEMENTOR(midr) == 'F' \
> + && MIDR_PARTNUM(midr) == 0x001)
> +
OK.
> struct cpu_features
> {
> uint64_t midr_el1;
> @@ -72,6 +75,7 @@ struct cpu_features
> bool bti;
> /* Currently, the GLIBC memory tagging tunable only defines 8 bits. */
> uint8_t mte_state;
> + bool sve;
> };
OK.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 4/5] scripts: Added Vector Length Set test helper script
2021-03-17 2:35 ` [PATCH 4/5] scripts: Added Vector Length Set test helper script Naohiro Tamura
@ 2021-03-29 13:20 ` Szabolcs Nagy via Libc-alpha
2021-03-30 7:25 ` naohirot
0 siblings, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-03-29 13:20 UTC (permalink / raw)
To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha
The 03/17/2021 02:35, Naohiro Tamura wrote:
> +"""Set Scalable Vector Length test helper.
> +
> +Set Scalable Vector Length for child process.
> +
> +examples:
> +
> +ubuntu@bionic:~/build$ make check subdirs=string \
> +test-wrapper='~/glibc/scripts/vltest.py 16'
> +
> +ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 16 make test \
> +t=string/test-memcpy
> +
> +ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 32 ./debugglibc.sh \
> +string/test-memmove
> +
> +ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 64 ./testrun.sh \
> +string/test-memset
> +"""
> +import argparse
> +from ctypes import cdll, CDLL
> +import os
> +import sys
> +
> +EXIT_SUCCESS = 0
> +EXIT_FAILURE = 1
> +EXIT_UNSUPPORTED = 77
> +
> +AT_HWCAP = 16
> +HWCAP_SVE = (1 << 22)
> +
> +PR_SVE_GET_VL = 51
> +PR_SVE_SET_VL = 50
> +PR_SVE_SET_VL_ONEXEC = (1 << 18)
> +PR_SVE_VL_INHERIT = (1 << 17)
> +PR_SVE_VL_LEN_MASK = 0xffff
> +
> +def main(args):
> + libc = CDLL("libc.so.6")
> + if not libc.getauxval(AT_HWCAP) & HWCAP_SVE:
> + print("CPU doesn't support SVE")
> + sys.exit(EXIT_UNSUPPORTED)
> +
> + libc.prctl(PR_SVE_SET_VL,
> + args.vl[0] | PR_SVE_SET_VL_ONEXEC | PR_SVE_VL_INHERIT)
> + os.execvp(args.args[0], args.args)
> + print("exec system call failure")
> + sys.exit(EXIT_FAILURE)
this only works on a (new enough) glibc based system and python's
CDLL path lookup can fail too (it does not follow the host system
configuration).
but i think there is no simple solution without compiling c code and
this seems useful, so i'm happy to have this script.
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64
2021-03-29 12:11 ` Szabolcs Nagy via Libc-alpha
@ 2021-03-30 6:19 ` naohirot
0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-03-30 6:19 UTC (permalink / raw)
To: 'Szabolcs Nagy'; +Cc: libc-alpha@sourceware.org
Szabolcs-san,
Thank you for your review.
> > +/* Assembler support ARMv8.2-A SVE */ #define
> HAVE_SVE_ASM_SUPPORT 0
> > +
>
> i prefer to use HAVE_AARCH64_ prefix for aarch64 specific macros in the global
> config.h, e.g. HAVE_AARCH64_SVE_ASM
OK, I'll change it to HAVE_AARCH64_SVE_ASM.
> and i'd like to have a comment here or in configue.ac with the binutils version
> where this becomes obsolete (binutils 2.28 i think). right now the minimum
> required version is 2.25, but glibc may increase that soon to above 2.28.
I'll add the comment in config.h.in like this:
+/* Assembler support ARMv8.2-A SVE.
+ This macro becomes obsolete when glibc increased the minimum
+ required version of GNU 'binutils' to 2.28 or later. */
+#define HAVE_AARCH64_SVE_ASM 0
> > diff --git a/sysdeps/aarch64/configure.ac
> > b/sysdeps/aarch64/configure.ac index 66f755078a..389a0b4e8d 100644
> > --- a/sysdeps/aarch64/configure.ac
> > +++ b/sysdeps/aarch64/configure.ac
...
> > +if AC_TRY_COMMAND(${CC-cc} -c -march=armv8.2-a+sve conftest.s
> > +1>&AS_MESSAGE_LOG_FD); then
> > + libc_cv_asm_sve=yes
> > +else
> > + libc_cv_asm_sve=no
> > +fi
> > +rm -f conftest*])
> > +if test $libc_cv_asm_sve = yes; then
> > + AC_DEFINE(HAVE_SVE_ASM_SUPPORT)
> > +fi
>
> i would use libc_cv_aarch64_sve_asm to make it obvious that it's aarch64 specific
> setting.
OK, I'll change it to libc_cv_aarch64_sve_asm.
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX
2021-03-29 12:44 ` Szabolcs Nagy via Libc-alpha
@ 2021-03-30 7:17 ` naohirot
0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-03-30 7:17 UTC (permalink / raw)
To: 'Szabolcs Nagy'; +Cc: libc-alpha@sourceware.org
Szabolcs-san,
Thank you for your review.
> > /* Maximum number of IFUNC implementations. */
> > -#define MAX_IFUNC 4
> > +#if HAVE_SVE_ASM_SUPPORT
> > +# define MAX_IFUNC 7
> > +#else
> > +# define MAX_IFUNC 6
> > +#endif
>
> hm this MAX_IFUNC looks a bit problematic: currently its only use is to detect if a
> target requires more ifuncs than the array passed to __libc_ifunc_impl_list, but for
> that ideally it would be automatic, not manually maintained.
>
> i would just define it to 7 unconditionally (the maximum over valid configurations).
OK, I'll fix it to 7 unconditionally.
> > cores diff --git a/sysdeps/aarch64/multiarch/init-arch.h
> > b/sysdeps/aarch64/multiarch/init-arch.h
> > index a167699e74..d20e7e1b8e 100644
> > --- a/sysdeps/aarch64/multiarch/init-arch.h
> > +++ b/sysdeps/aarch64/multiarch/init-arch.h
> > @@ -33,4 +33,6 @@
> > bool __attribute__((unused)) bti =
> \
> > HAVE_AARCH64_BTI && GLRO(dl_aarch64_cpu_features).bti;
> \
> > bool __attribute__((unused)) mte =
> \
> > - MTE_ENABLED ();
> > + MTE_ENABLED ();
> \
> > + unsigned __attribute__((unused)) sve =
> \
> > + GLRO(dl_aarch64_cpu_features).sve;
>
> i would use bool here.
I'll fix it to the bool.
> > --- /dev/null
> > +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
>
> skipping this.
I wait for your review.
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 4/5] scripts: Added Vector Length Set test helper script
2021-03-29 13:20 ` Szabolcs Nagy via Libc-alpha
@ 2021-03-30 7:25 ` naohirot
0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-03-30 7:25 UTC (permalink / raw)
To: 'Szabolcs Nagy'; +Cc: libc-alpha@sourceware.org
Szabolcs-san,
Thank you for your review.
> > +def main(args):
> > + libc = CDLL("libc.so.6")
> > + if not libc.getauxval(AT_HWCAP) & HWCAP_SVE:
> > + print("CPU doesn't support SVE")
> > + sys.exit(EXIT_UNSUPPORTED)
> > +
> > + libc.prctl(PR_SVE_SET_VL,
> > + args.vl[0] | PR_SVE_SET_VL_ONEXEC |
> PR_SVE_VL_INHERIT)
> > + os.execvp(args.args[0], args.args)
> > + print("exec system call failure")
> > + sys.exit(EXIT_FAILURE)
>
>
> this only works on a (new enough) glibc based system and python's CDLL path
> lookup can fail too (it does not follow the host system configuration).
I see, I didn't notice that.
> but i think there is no simple solution without compiling c code and this seems
> useful, so i'm happy to have this script.
OK, thanks!
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
@ 2021-04-12 12:52 Wilco Dijkstra via Libc-alpha
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
0 siblings, 2 replies; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-12 12:52 UTC (permalink / raw)
To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi,
I have a few comments about memcpy design (the principles apply equally to memset):
1. Overall the code is too large due to enormous unroll factors
Our current memcpy is about 300 bytes (that includes memmove), this memcpy is ~12 times larger!
This hurts performance due to the code not fitting in the I-cache for common copies.
On a modern OoO core you need very little unrolling since ALU operations and branches
become essentially free while the CPU executes loads and stores. So rather than unrolling
by 32-64 times, try 4 times - you just need enough to hide the taken branch latency.
2. I don't see any special handling for small copies
Even if you want to hyper optimize gigabyte sized copies, small copies are still extremely common,
so you always want to handle those as quickly (and with as little code) as possible. Special casing
small copies does not slow down the huge copies - the reverse is more likely since you no longer
need to handle small cases.
3. Check whether using SVE helps small/medium copies
Run memcpy-random benchmark to see whether it is faster to use SVE for small cases or just the SIMD
copy on your uarch.
4. Avoid making the code too general or too specialistic
I see both appearing in the code - trying to deal with different cacheline sizes and different vector lengths,
and also splitting these out into separate cases. If you depend on a particular cacheline size, specialize
the code for that and check the size in the ifunc selector (as various memsets do already). If you want to
handle multiple vector sizes, just use a register for the increment rather than repeating the same code
several times for each vector length.
5. Odd prefetches
I have a hard time believing first prefetching the data to be written, then clearing it using DC ZVA (???),
then prefetching the same data a 2nd time, before finally write the loaded data is helping performance...
Generally hardware prefetchers are able to do exactly the right thing since memcpy is trivial to prefetch.
So what is the performance gain of each prefetch/clear step? What is the difference between memcpy
and memmove performance (given memmove doesn't do any of this)?
Cheers,
Wilco
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-12 12:52 [PATCH 0/5] Added optimized memcpy/memmove/memset " Wilco Dijkstra via Libc-alpha
@ 2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
1 sibling, 0 replies; 72+ messages in thread
From: Florian Weimer @ 2021-04-12 18:53 UTC (permalink / raw)
To: Wilco Dijkstra via Libc-alpha; +Cc: Szabolcs Nagy, Wilco Dijkstra
* Wilco Dijkstra via Libc-alpha:
> 5. Odd prefetches
>
> I have a hard time believing first prefetching the data to be
> written, then clearing it using DC ZVA (???), then prefetching the
> same data a 2nd time, before finally write the loaded data is
> helping performance... Generally hardware prefetchers are able to
> do exactly the right thing since memcpy is trivial to prefetch. So
> what is the performance gain of each prefetch/clear step? What is
> the difference between memcpy and memmove performance (given memmove
> doesn't do any of this)?
Another downside is exposure of latent concurrency bugs:
G1: Phantom zeros in cardtable
<https://bugs.openjdk.java.net/browse/JDK-8039042>
I guess the CPU's heritage is shining through here. 8-)
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-12 12:52 [PATCH 0/5] Added optimized memcpy/memmove/memset " Wilco Dijkstra via Libc-alpha
2021-04-12 18:53 ` Florian Weimer
@ 2021-04-13 12:07 ` naohirot
2021-04-14 16:02 ` Wilco Dijkstra via Libc-alpha
1 sibling, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-13 12:07 UTC (permalink / raw)
To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
Thanks for the comments.
I've been continuously updated the first patch since I posted on Mar. 17 2021,
and fixed some bugs.
Here is my local repository's commit history:
https://github.com/NaohiroTamura/glibc/commits/patch-20210317
I answer your comments referring to the latest source code above and
benchtests graphs uploaded to Google drive.
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
>
> 1. Overall the code is too large due to enormous unroll factors
>
> Our current memcpy is about 300 bytes (that includes memmove), this memcpy is
> ~12 times larger!
> This hurts performance due to the code not fitting in the I-cache for common
> copies.
OK, I'll try to remove unnecessary code which doesn't contribute performance gain
based on benchtests performance data.
> On a modern OoO core you need very little unrolling since ALU operations and
> branches become essentially free while the CPU executes loads and stores. So
> rather than unrolling by 32-64 times, try 4 times - you just need enough to hide the
> taken branch latency.
>
In terms of loop unrolling, I tested several cases in my local environment.
Here is the result.
The source code is based on the latest commit of the branch patch-20210317 in my GitHub repository.
[1] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S
Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.
This unroll configuration recorded the highest performance.
Memcpy 35 Gbps/sec [3]
Memmove 70 Gbps/sec [4]
Mmemset 70 Gbps/sec [5]
[3] https://drive.google.com/file/d/1Xz04kV-S1E4tKOKLJRl8KgO8ZdCQqv1O/view
[4] https://drive.google.com/file/d/1QDmt7LMscXIJSpaq2sPOiCKl3nxcLxwk/view
[5] https://drive.google.com/file/d/1rpy7rkIskRs6czTARNIh4yCeh8d-L-cP/view
In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,
The performance degraded minus 5 to 15 Gbps/sec at the peak.
Memcpy 30 Gbps/sec [6]
Memmove 65 Gbps/sec [7]
Mmemset 45 Gbps/sec [8]
[6] https://drive.google.com/file/d/1P-QJGeuHPlfj3ax8GlxRShV0_HVMJWGc/view
[7] https://drive.google.com/file/d/1R2IK5eWr8NEduNnvqkdPZyoNE0oImRcp/view
[8] https://drive.google.com/file/d/1WMZFjzF5WgmfpXSOnAd9YMjLqv1mcsEm/view
> 2. I don't see any special handling for small copies
>
> Even if you want to hyper optimize gigabyte sized copies, small copies are still
> extremely common, so you always want to handle those as quickly (and with as
> little code) as possible. Special casing small copies does not slow down the huge
> copies - the reverse is more likely since you no longer need to handle small cases.
>
Yes, I implemented for the case of 1 byte to 512 byte [9][10].
SVE code seems faster than ASIMD in small/medium range too [11][12][13].
[9] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L176-L267
[10] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S#L68-L78
[11] https://drive.google.com/file/d/1VgkFTrWgjFMQ35btWjqHJbEGMgb3ZE-h/view
[12] https://drive.google.com/file/d/1SJ-WMUEEX73SioT9F7tVEIc4iRa8SfjU/view
[13] https://drive.google.com/file/d/1DPPgh2r6t16Ppe0Cpo5XzkVqWA_AVRUc/view
> 3. Check whether using SVE helps small/medium copies
>
> Run memcpy-random benchmark to see whether it is faster to use SVE for small
> cases or just the SIMD copy on your uarch.
>
Thanks for the memcpy-random benchmark info.
For small/medium copies, I needed to remove BTI macro from ASM ENTRY in order
to see the distinct performance difference between ASIMD and SVE.
I'll post the patch [14] with the A64FX second patch.
And also somehow on A64FX as well as on ThunderX2 machine, memcpy-random
didn't start due to mprotect error.
I needed to fix memcpy-random [15].
If this is not wrong, I'll post the patch [15] with the a64fx second patch.
[14] https://github.com/NaohiroTamura/glibc/commit/07ea389846c7c63622b6c0b3aaead3f93e21f356
[15] https://github.com/NaohiroTamura/glibc/commit/ec0b55a855529f75bd6f280e59dc2b1c25640490
> 4. Avoid making the code too general or too specialistic
>
> I see both appearing in the code - trying to deal with different cacheline sizes and
> different vector lengths, and also splitting these out into separate cases. If you
> depend on a particular cacheline size, specialize the code for that and check the
> size in the ifunc selector (as various memsets do already). If you want to handle
> multiple vector sizes, just use a register for the increment rather than repeating
> the same code several times for each vector length.
>
In terms of the cache line size, A64FX is not configurable, it is fixed to 256 byte.
I've already removed the code to get it [16][17]
[16] https://github.com/NaohiroTamura/glibc/commit/4bcc6d83c970f7a7283abfec753ecf6b697cf6f7
[17] https://github.com/NaohiroTamura/glibc/commit/f2b2c1ca03b50d414e03411ed65e4b131615e865
In terms of Vector Length, I'll remove the code for VL256 bit and 128 bit.
Because Vector Length agnostic code can cover the both cases.
> 5. Odd prefetches
>
> I have a hard time believing first prefetching the data to be written, then clearing it
> using DC ZVA (???), then prefetching the same data a 2nd time, before finally
> write the loaded data is helping performance...
> Generally hardware prefetchers are able to do exactly the right thing since
> memcpy is trivial to prefetch.
> So what is the performance gain of each prefetch/clear step? What is the
> difference between memcpy and memmove performance (given memmove
> doesn't do any of this)?
Sorry, memcpy prefetch code was not right, I noticed this bug and fixed it
soon after posting the first patch [18].
Basically " prfm pstl1keep, [dest_ptr, tmp1]" should be " prfm pldl2keep, [src_ptr, tmp1]".
[18] https://github.com/NaohiroTamura/glibc/commit/f5bf15708830f91fb886b15928158db2e875ac88
Without DC_VZA and L2 prefetch, memcpy and memset performance degraded over 4MB.
Please compare [19] with [22], and [21] with [24] for memset.
Without DC_VZA and L2 prefetch, memmove didn't degraded over 4MB.
Please compare [20] with [23].
The reason why I didn't implement DC_VZA and L2 prefetch is that memmove calls memcpy in
most cases, and memmove code only handles backward copy.
Maybe most of memmove-large benchtest cases are backward copy, I need to check.
DC_VZA and L2 prefetch have to be pair, only DC_VZA or only L2 prefetch doesn't get any improvement.
With DC_VZA and L2 prefetch:
[19] https://drive.google.com/file/d/1mmYaLwzEoytBJZ913jaWmucL0j564Ta7/view
[20] https://drive.google.com/file/d/1Bc_DVGBcDRpvDjxCB_2yOk3MOy5BEiOs/view
[21] https://drive.google.com/file/d/19cHvU2lxF28DW9_Z5_5O6gOOdUmVz_ps/view
Without DC_VZA and L2 prefetch:
[22] https://drive.google.com/file/d/1My6idNuQsrsPVODl0VrqiRbMR9yKGsGS/view
[23] https://drive.google.com/file/d/1q8KhvIqDf27fJ8HGWgjX0nBhgPgGBg_T/view
[24] https://drive.google.com/file/d/1l6pDhuPWDLy5egQ6BhRIYRvshvDeIrGl/view
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-13 12:07 ` naohirot
@ 2021-04-14 16:02 ` Wilco Dijkstra via Libc-alpha
2021-04-15 12:20 ` naohirot
` (5 more replies)
0 siblings, 6 replies; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-14 16:02 UTC (permalink / raw)
To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Naohiro,
Thanks for the comprehensive reply, especially the graphs are quite useful!
(I'd avoid adding generic_memcpy/memmove though since those are unoptimized C
implementations).
> OK, I'll try to remove unnecessary code which doesn't contribute performance gain
> based on benchtests performance data.
Yes that is a good idea - you could also check whether the software pipelining actually
helps on an OoO core (it shouldn't) since that contributes a lot to the complexity and the
amount of code and unrolling required.
It is also possible to remove a lot of unnecessary code - eg. rather than use 2 instructions
per prefetch, merge the constant offset in the prefetch instruction itself (since they allow
up to 32KB offset). There are also lots of branches that skip a few instructions if a value is
zero, this is often counterproductive due to adding branch mispredictions.
> Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.
> This unroll configuration recorded the highest performance.
> In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,
> The performance degraded minus 5 to 15 Gbps/sec at the peak.
So this is the L(L1_vl_64) loop right? I guess the problem is the large number of
prefetches and all the extra code that is not strictly required (you can remove 5
redundant mov/cmp instructions from the loop). Also assuming prefetching helps
here (the good memmove results suggest it's not needed), prefetching directly
into L1 should be better than first into L2 and then into L1. So I don't see a good
reason why 4x unrolling would have to be any slower.
> Yes, I implemented for the case of 1 byte to 512 byte [9][10].
> SVE code seems faster than ASIMD in small/medium range too [11][12][13].
That adds quite a lot of code and uses a slow linear chain of comparisons. A small
loop like used in the memset should work fine to handle copies smaller than
256 or 512 bytes (you can handle the zero bytes case for free in this code rather
than special casing it).
> For small/medium copies, I needed to remove BTI macro from ASM ENTRY in order
> to see the distinct performance difference between ASIMD and SVE.
> I'll post the patch [14] with the A64FX second patch.
I'm not sure I understand - the BTI macro just emits a NOP hint so it is harmless. We always emit
it so that it works seamlessly when BTI is enabled.
> And also somehow on A64FX as well as on ThunderX2 machine, memcpy-random
> didn't start due to mprotect error.
Yes it looks like the size isn't rounded up to a pagesize. It really needs the extra space, so
changing +4096 into getpagesize () will work.
> Without DC_VZA and L2 prefetch, memcpy and memset performance degraded over 4MB.
> DC_VZA and L2 prefetch have to be pair, only DC_VZA or only L2 prefetch doesn't get any improvement.
That seems odd. Was that using the L1 prefetch with the L2 distance? It seems to me one of the L1 or L2
prefetches is unnecessary. Also why would the DC_ZVA need to be done so early? It seems to me that
cleaning the cacheline just before you write it works best since that avoids accidentally replacing it.
> Without DC_VZA and L2 prefetch, memmove didn't degraded over 4MB.
>
> The reason why I didn't implement DC_VZA and L2 prefetch is that memmove calls memcpy in
> most cases, and memmove code only handles backward copy.
> Maybe most of memmove-large benchtest cases are backward copy, I need to check.
Most of the memmove tests do indeed overlap (so DC_ZVA does not work). However it also shows
that it performs well across the L2 cache size range without any prefetch or DC_ZVA.
Cheers,
Wilco
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-14 16:02 ` Wilco Dijkstra via Libc-alpha
@ 2021-04-15 12:20 ` naohirot
2021-04-20 16:00 ` Wilco Dijkstra via Libc-alpha
2021-04-19 2:51 ` naohirot
` (4 subsequent siblings)
5 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-15 12:20 UTC (permalink / raw)
To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
Thanks for reviewing in detail technically!!
Now we have several topics to discuss.
So let me focus on the BTI in this mail. I'll answer other topics in later mail.
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
>
> Thanks for the comprehensive reply, especially the graphs are quite useful!
> (I'd avoid adding generic_memcpy/memmove though since those are unoptimized
> C implementations).
OK, I'll withdraw the patch from the A64FX patch V2.
> > For small/medium copies, I needed to remove BTI macro from ASM ENTRY
> > in order to see the distinct performance difference between ASIMD and SVE.
> > I'll post the patch [14] with the A64FX second patch.
>
> I'm not sure I understand - the BTI macro just emits a NOP hint so it is harmless.
> We always emit it so that it works seamlessly when BTI is enabled.
Yes, I observed that just " hint #0x22" is inserted.
The benchtest results show that the A64FX performance of size less than 100B with
BTI is slower than ASIMD, but without BTI is faster than ASIMD.
And the A64FX performance of 512B with BTI 4Gbps/sec slower than without BTI.
With BTI, source code [4]
[1] https://drive.google.com/file/d/1LlyQOq7qT4d0-54uVzUtYMMMDgIiddEj/view
[2] https://drive.google.com/file/d/1C2pl-Iz_-18mkpuQTk1PhEHKsd5x0wWo/view
[3] https://drive.google.com/file/d/1eg_p1_b619KN7XLmOpxqcoI3c9o4WXd-/view
[4] https://github.com/NaohiroTamura/glibc/commit/0f45fff654d7a31b58e5d6f4dbfa31d6586f8cc2
Without BTI, source code [8]
[5] https://drive.google.com/file/d/1Mf7wxwgGb5yYBJo1eUxqvjrkp9O4EVVJ/view
[6] https://drive.google.com/file/d/1rgfFmWsM4Q3oDK8aYa_GjEQWttS0pOBF/view
[7] https://drive.google.com/file/d/1hF7oevP-MERrQ04yajtEUY8CSWe8V2EX/view
[8] https://github.com/NaohiroTamura/glibc/commit/c204a74971b3d34680964bc52ac59264b14527e3
I executed the same test on ThanderX2, the result had very little difference
between with BTI and without BTI as you mentioned.
So if distinct degradation happens only on A64FX, I'd like to add another
ENTRY macro in sysdeps/aarch64/sysdep.h such as:
#define ENTRY_ALIGN_NO_BTI(name, align) \
.globl C_SYMBOL_NAME(name); \
.type C_SYMBOL_NAME(name),%function; \
.p2align align; \
C_LABEL(name) \
cfi_startproc; \
CALL_MCOUNT
Or I'd like to change memcpy_a64fx.S and memset_a64fx.S without ENTRY macro such as:
.globl __memcpy_a64fx
.type __memcpy_a64fx, %function
.p2align 6
__memcpy_a64fx:
cfi_startproc
CALL_MCOUNT
What do you think?
> > And also somehow on A64FX as well as on ThunderX2 machine,
> > memcpy-random didn't start due to mprotect error.
>
> Yes it looks like the size isn't rounded up to a pagesize. It really needs the extra
> space, so changing +4096 into getpagesize () will work.
OK, I've already applied it [8].
Thanks!
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-14 16:02 ` Wilco Dijkstra via Libc-alpha
2021-04-15 12:20 ` naohirot
@ 2021-04-19 2:51 ` naohirot
2021-04-19 14:57 ` Wilco Dijkstra via Libc-alpha
2021-04-19 12:43 ` naohirot
` (3 subsequent siblings)
5 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-19 2:51 UTC (permalink / raw)
To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
Let me focus on the macro " shortcut_for_small_size" for small/medium, less than
512 byte in this mail.
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> > Yes, I implemented for the case of 1 byte to 512 byte [9][10].
> > SVE code seems faster than ASIMD in small/medium range too [11][12][13].
>
> That adds quite a lot of code and uses a slow linear chain of comparisons. A small
> loop like used in the memset should work fine to handle copies smaller than
> 256 or 512 bytes (you can handle the zero bytes case for free in this code rather
> than special casing it).
>
I compared performance of the size less than 512 byte for the following five
implementation cases.
CASE 1: liner chain
As mentioned in the reply [0] I removed BTI_J [1], but the macro " shortcut_for_small_size"
stays linear chain [2]
A64FX performance is 4-14 Gbps [3].
The other arch implementations call BTI_J, so performance is degraded.
.
[0] https://sourceware.org/pipermail/libc-alpha/2021-April/125079.html
[1] https://github.com/NaohiroTamura/glibc/commit/7d7217b518e59c78582ac4e89cae725cf620877e
[2] https://github.com/NaohiroTamura/glibc/blob/7d7217b518e59c78582ac4e89cae725cf620877e/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L176-L267
[3] https://drive.google.com/file/d/16qo7N05W526H9j7_9qjm-_Q7gZmOXwpY/view
CASE 2: whilelt loop such as memset
I tested "whilelt loop" implementation instead of the macro " shortcut_for_small_size".
And after having tested, I commented out "whilelt loop" implementation [4]
Comparing with the CASE 1, A64FX performance degraded from 4-14 Gbps to 3-10 Gbps [5].
Please notice that "whilelt loop" implementation cannot be used for memmove,
because it doesn't work for backward copy.
On the other hand, the macro " shortcut_for_small_size" works for backward copy, because
it loads up to all 512 byte of data into z0 to z7 SVE registers at once, and then store all data.
[4] https://github.com/NaohiroTamura/glibc/commit/77d1da301f8161c74875b0314cae34be8cb33477#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5a0ff6a259eR308-R318
[5] https://drive.google.com/file/d/1xdw7mr0c90VupVkQwelFafQHNkXslCwv/view
CASE 3: binary tree chain
I updated the macro " shortcut_for_small_size" to use binary tree chain [6][7].
Comparing with the CASE 1, the size less than 96 byte degraded from 4.0-6.0 Gbps
to 2.5-5.0 Gbps, but the size 512 byte improved from 14.0 Gbps to 17.5 Gbps.
[6] https://github.com/NaohiroTamura/glibc/commit/5c17af8c57561ede5ed2c2af96c9efde4092f02f
[7] https://github.com/NaohiroTamura/glibc/blob/5c17af8c57561ede5ed2c2af96c9efde4092f02f/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L177-L204
[8] https://drive.google.com/file/d/13w8yKdeLpVbp-uJmCttKBKtScya1tXqP/view
CASE 4: binary tree chain except up to 64 byte
I handled up to 64 byte so as to return quickly [9].
Comparing with the CASE 3, the size less than 64 byte improved from 2.5 Gbps to
4.0 Gbps, but the size 512 byte degraded from 17.5 Gbps to 16.5 Gbps [10].
[9] https://github.com/NaohiroTamura/glibc/commit/77d1da301f8161c74875b0314cae34be8cb33477#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5a0ff6a259eR177-R184
[10] https://drive.google.com/file/d/1lFsjns9g_7fySAsvx_RVS9o6HSrk6ir9/view
CASE 5: binary tree chain except up to 128 byte
I handled up to 128 byte so as to return quickly [11].
Comparing with the CASE 4, the size less than 128 byte improved from 4.0-6.0 Gbps
to 4.0-7.0 Gbps, but the size 512 byte degraded from 16.5 Gbps to 16.0 Gbps [12].
[11] https://github.com/NaohiroTamura/glibc/commit/fefc59f01ecfd6a207fe261de5ab133f4409d687#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5a0ff6a259eR184-R195
[12] https://drive.google.com/file/d/1HS277_qQUuEeZqLUo0H2XRlFhOhIdI_o/view
In conclusion, I'd like to adopt the CASE 5 implementation, considering the
performance balance between the small size (less than 128 byte) and medium size
(close to 512 byte).
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-14 16:02 ` Wilco Dijkstra via Libc-alpha
2021-04-15 12:20 ` naohirot
2021-04-19 2:51 ` naohirot
@ 2021-04-19 12:43 ` naohirot
2021-04-20 3:31 ` naohirot
` (2 subsequent siblings)
5 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-04-19 12:43 UTC (permalink / raw)
To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
Let me focus on L1_prefetch in this mail.
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> > Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.
> > This unroll configuration recorded the highest performance.
When I tested "4 unrolls", I modified the source code [1][2] in the mail [0]
such as followings:
in case of memcpy,
I commented out L(unroll8), L(unroll2), and left L(unroll4), L(unroll1) and L(last),
In case of memmove,
I commented out L(bwd_unroll8), L(bwd_unroll2), and left L(bwd_unroll4), L(bwd_unroll1) and L(bwd_last),
In case of memset,
I commented out L(unroll32), L(unroll8), L(unroll2), and left L(unroll4), L(unroll1) and L(last).
[0] https://sourceware.org/pipermail/libc-alpha/2021-April/125002.html
[1] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S
> > In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,
> > The performance degraded minus 5 to 15 Gbps/sec at the peak.
>
> So this is the L(L1_vl_64) loop right? I guess the problem is the large number of
So this is NOT the L(L1_vl_64) loop, but L(vl_agnostic).
> prefetches and all the extra code that is not strictly required (you can remove 5
> redundant mov/cmp instructions from the loop). Also assuming prefetching helps
> here (the good memmove results suggest it's not needed), prefetching directly
> into L1 should be better than first into L2 and then into L1. So I don't see a good
> reason why 4x unrolling would have to be any slower.
I tried to remove L(L1_prefetch) from both memcpy and memset, and also
I tried to remove L2 prefetch instructions (prfm pstl2keep and pldl2keep) in
L(L1_prefetch) from both memcpy and memset.
In case of memcpy, both removing L(L1_prefetch)[3] and removing L2 prefetch
instruction from L(L1_prefetch) increased the performance of the size range 64KB-4MB
from 18-20 GB/sec [4] to 20-22 GB/sec [5].
[3] https://github.com/NaohiroTamura/glibc/commit/22612299247e64dbffd62aa186513bde7328d104
[4] https://drive.google.com/file/d/1hGWz4eAYWc1ktdw74rzDPxtQQ48P0-Hv/view
[5] https://drive.google.com/file/d/11Pt1mWSCN2LBPHxXUE-rs7Q6JhtBfpyQ/view
In case of memset, removing L(L1_prefetch)[6] decreased the performance of the size range
128KB-4MB from 22-24 GB/sec [7] to 20-22 GB/sec[8].
But removing L2 prefetch instruction (prfm pstl2keep) in L(L1_prefetch) [9] kept the same
performance of the size range 128KB-4MB as 22-24 GB/sec [10].
[6] https://github.com/NaohiroTamura/glibc/blob/22612299247e64dbffd62aa186513bde7328d104/sysdeps/aarch64/multiarch/memset_a64fx.S#L146-L163
Commented out L146-L163, I didn't commit because of decreasing the performance.
[7] https://drive.google.com/file/d/1MT1d2aBxSoYrzQuRZtv4U9NCXV4ZwHsJ/view
[8] https://drive.google.com/file/d/1qUzYklLvgXTZbP1wm9n4VryF3bgUOplo/view
[9] https://github.com/NaohiroTamura/glibc/commit/cc478c96bac051c9b98b9d9a1ae6f38326f77645
[10] https://drive.google.com/file/d/1bPKHFWyhzNWXX7A_S6_UpZ2BwP2QAJK4/view
In conclusion, I adopt to remove L(L1_prefetch) from memcpy [3] and to remove L2 prefetch
instruction (prfm pstl2keep) from L(L1_prefetch) [9].
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-19 2:51 ` naohirot
@ 2021-04-19 14:57 ` Wilco Dijkstra via Libc-alpha
2021-04-21 10:10 ` naohirot
0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-19 14:57 UTC (permalink / raw)
To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Naohiro,
> Let me focus on the macro " shortcut_for_small_size" for small/medium, less than
> 512 byte in this mail.
Yes, one subject at a time is a good idea.
> Comparing with the CASE 1, A64FX performance degraded from 4-14 Gbps to 3-10 Gbps [5].
> Please notice that "whilelt loop" implementation cannot be used for memmove,
> because it doesn't work for backward copy.
Indeed, the memmove code would need a similar loop but backwards. However it sounds like
small loops are not efficient (possibly a high taken branch penalty), so it's not a good option.
> In conclusion, I'd like to adopt the CASE 5 implementation, considering the
> performance balance between the small size (less than 128 byte) and medium size
> (close to 512 byte).
Yes something like this would work. I would strip out any unnecessary instructions and merge
multiple cases to avoid branches as much as possible. For example start memcpy like this:
memcpy:
cntb vector_length
whilelo p0.b, xzr, n // gives a free ptrue for N >= VL
whilelo p1.b, vector_length, n
b.last 1f
ld1b z0.b, p0/z, [src]
ld1b z1.b, p1/z, [src, #1, mul vl]
st1b z0.b, p0, [dest]
st1b z1.b, p1, [dest, #1, mul vl]
ret
The proposed case 5 uses 13 instructions up to 64 bytes and 19 up to 128, the above
does 0-127 bytes in 9 instructions. You can see the code is perfectly balanced, with
4 load/store instructions, 3 ALU instructions and 2 branches.
Rather than doing a complex binary search, we can use the same trick to merge the code
for 128-256 and 256-512. So overall we only need 2 comparisons which we can write like:
cmp n, vector_length, lsl 3
Like I mentioned before, it is a really good idea to run bench-memcpy-random since it
will clearly show issues with branch prediction on small copies. For memcpy and related
functions you want to minimize branches and only use branches that are heavily biased.
Cheers,
Wilco
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-14 16:02 ` Wilco Dijkstra via Libc-alpha
` (2 preceding siblings ...)
2021-04-19 12:43 ` naohirot
@ 2021-04-20 3:31 ` naohirot
2021-04-20 14:44 ` Wilco Dijkstra via Libc-alpha
2021-04-20 5:49 ` naohirot
2021-04-23 13:22 ` naohirot
5 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-20 3:31 UTC (permalink / raw)
To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
Let me focus on DC_ZVA and L1/L2 prefetch in this mail.
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> > Without DC_VZA and L2 prefetch, memcpy and memset performance degraded
> over 4MB.
>
> > DC_VZA and L2 prefetch have to be pair, only DC_VZA or only L2 prefetch
> doesn't get any improvement.
>
> That seems odd. Was that using the L1 prefetch with the L2 distance? It seems to
> me one of the L1 or L2 prefetches is unnecessary.
I tested the following 4 cases.
The result was that Case 4 is the best.
Case 2 and 3 were almost same as Case 1.
Case 4 [1] improved the performance in the size range more than 4MB from Case 1
7.5-10 GB/sec [2] to 10-10.5 GB/sec [3].
Case 1: DC_ZVA + L1 prefetch + L2 + prefetch [2]
Case 2: DC_ZVA + L1 prefetch
Case 3: DC_ZVA + L2 prefetch
Case 4: DC_ZVA only [3]
[1] https://github.com/NaohiroTamura/glibc/commit/d57bed764a45383dfea8265d6a384646f4f07eed
[2] https://drive.google.com/file/d/1ws3lTLzMFK3lLrrwxVFvriERrs-IKdP9/view
[3] https://drive.google.com/file/d/1g7nuFOtkFw3b5INcAfuuv2lVODmASm-G/view
> Also why would the DC_ZVA
> need to be done so early? It seems to me that cleaning the cacheline just before
> you write it works best since that avoids accidentally replacing it.
>
Yes, I moved it closer, please look at the change [1].
> > Without DC_VZA and L2 prefetch, memmove didn't degraded over 4MB.
> >
> > The reason why I didn't implement DC_VZA and L2 prefetch is that
> > memmove calls memcpy in most cases, and memmove code only handles
> backward copy.
> > Maybe most of memmove-large benchtest cases are backward copy, I need to
> check.
>
> Most of the memmove tests do indeed overlap (so DC_ZVA does not work).
> However it also shows that it performs well across the L2 cache size range
> without any prefetch or DC_ZVA.
That's right, I confirmed that only DC_ZVA was necessary [1].
Next, I'll remove redundant instructions.
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-14 16:02 ` Wilco Dijkstra via Libc-alpha
` (3 preceding siblings ...)
2021-04-20 3:31 ` naohirot
@ 2021-04-20 5:49 ` naohirot
2021-04-20 11:39 ` Wilco Dijkstra via Libc-alpha
2021-04-23 13:22 ` naohirot
5 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-20 5:49 UTC (permalink / raw)
To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
Let me focus on removing redundant instructions in this mail.
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> It is also possible to remove a lot of unnecessary code - eg. rather than use 2
> instructions per prefetch, merge the constant offset in the prefetch instruction
> itself (since they allow up to 32KB offset). There are also lots of branches that
> skip a few instructions if a value is zero, this is often counterproductive due to
> adding branch mispredictions.
I removed redundant instructions using cbz and prfm offset address [1][2].
[1] https://github.com/NaohiroTamura/glibc/commit/94363b4ab2e5b4b29843a47a6970b9645a8e4eeb
[2] https://github.com/NaohiroTamura/glibc/commit/4648eb559e46d978ded65d40c6bf8c38dd2519d7
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-20 5:49 ` naohirot
@ 2021-04-20 11:39 ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:03 ` naohirot
0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-20 11:39 UTC (permalink / raw)
To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Haohiro,
> I removed redundant instructions using cbz and prfm offset address [1][2].
>
> [1] https://github.com/NaohiroTamura/glibc/commit/94363b4ab2e5b4b29843a47a6970b9645a8e4eeb
> [2] https://github.com/NaohiroTamura/glibc/commit/4648eb559e46d978ded65d40c6bf8c38dd2519d7
For the first 2 CBZ cases in both [1] and [2] the fastest option is to use ANDS+BEQ. ANDS only
requires 1 ALU operation while AND+CBZ uses 2 ALU operations on A64FX.
Wilco
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-20 3:31 ` naohirot
@ 2021-04-20 14:44 ` Wilco Dijkstra via Libc-alpha
2021-04-27 9:01 ` naohirot
0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-20 14:44 UTC (permalink / raw)
To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Naohiro,
> Case 4 [1] improved the performance in the size range more than 4MB from Case 1
> 7.5-10 GB/sec [2] to 10-10.5 GB/sec [3].
>
> Case 1: DC_ZVA + L1 prefetch + L2 + prefetch [2]
> Case 2: DC_ZVA + L1 prefetch
> Case 3: DC_ZVA + L2 prefetch
> Case 4: DC_ZVA only [3]
That is great news - it simplifies the loop a lot, and it is faster too!
>> Also why would the DC_ZVA
>> need to be done so early? It seems to me that cleaning the cacheline just before
>> you write it works best since that avoids accidentally replacing it.
>>
>
> Yes, I moved it closer, please look at the change [1].
What I meant is, why is ZF_DIST so huge? I don't see how that helps. Is there any penalty
if we did it like this (or possibly with 1-2 cachelines offset)?
dc zva, dest_ptr
st1b z0.b, p0, [dest_ptr, #0, mul vl]
st1b z1.b, p0, [dest_ptr, #1, mul vl]
st1b z2.b, p0, [dest_ptr, #2, mul vl]
st1b z3.b, p0, [dest_ptr, #3, mul vl]
This would remove almost all initialization code from the start of L(L2_dc_zva).
Cheers,
Wilco
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-15 12:20 ` naohirot
@ 2021-04-20 16:00 ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:58 ` naohirot
0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-20 16:00 UTC (permalink / raw)
To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Naohiro,
> Yes, I observed that just " hint #0x22" is inserted.
> The benchtest results show that the A64FX performance of size less than 100B with
> BTI is slower than ASIMD, but without BTI is faster than ASIMD.
> And the A64FX performance of 512B with BTI 4Gbps/sec slower than without BTI.
That's unfortunate - it seems like the hint is very slow, maybe even serializing...
We can work around if for now in GLIBC, but at some point distros will start to insert
BTI instructions by default, and then the performance hit will be bad.
> So if distinct degradation happens only on A64FX, I'd like to add another
> ENTRY macro in sysdeps/aarch64/sysdep.h such as:
I think the best option for now is to change BTI_C into NOP if AARCH64_HAVE_BTI
is not set. This avoids creating alignment issues in existing code (which is written
to assume the hint is present) and works for all string functions.
Cheers,
Wilco
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-19 14:57 ` Wilco Dijkstra via Libc-alpha
@ 2021-04-21 10:10 ` naohirot
2021-04-21 15:02 ` Wilco Dijkstra via Libc-alpha
0 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-21 10:10 UTC (permalink / raw)
To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
This mail is a continuation of the macro " shortcut_for_small_size" for small/medium,
less than 512 byte.
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> Yes something like this would work. I would strip out any unnecessary instructions
> and merge multiple cases to avoid branches as much as possible. For example
> start memcpy like this:
>
> memcpy:
> cntb vector_length
> whilelo p0.b, xzr, n // gives a free ptrue for N >= VL
> whilelo p1.b, vector_length, n
> b.last 1f
> ld1b z0.b, p0/z, [src]
> ld1b z1.b, p1/z, [src, #1, mul vl]
> st1b z0.b, p0, [dest]
> st1b z1.b, p1, [dest, #1, mul vl]
> ret
>
> The proposed case 5 uses 13 instructions up to 64 bytes and 19 up to 128, the
> above does 0-127 bytes in 9 instructions. You can see the code is perfectly
> balanced, with
> 4 load/store instructions, 3 ALU instructions and 2 branches.
>
> Rather than doing a complex binary search, we can use the same trick to merge
> the code for 128-256 and 256-512. So overall we only need 2 comparisons which
> we can write like:
>
> cmp n, vector_length, lsl 3
It's really smart way, isn't it? 😊
I re-implemented the macro " shortcut_for_small_size" using the whilelo, and
please check it [1][2] if understood correctly.
The performance of "whilelo dispatch" [3] is almost same as "binary tree dispatch" [4]
but I notice that there are gaps at 128 byte and at 256 byte [3].
[1] https://github.com/NaohiroTamura/glibc/commit/7491bcb36e5c497e509d35b1378fcc663595c2d0
[2] https://github.com/NaohiroTamura/glibc/blob/7491bcb36e5c497e509d35b1378fcc663595c2d0/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L129-L174
[3] https://drive.google.com/file/d/10S6doDFiVtveqRZs-366E_yDzefe-zBS/view
[4] https://drive.google.com/file/d/1p5qPt0KLT4i3Iv_Uy9UT5zo0NetXK-RZ/view
> Like I mentioned before, it is a really good idea to run bench-memcpy-random
> since it will clearly show issues with branch prediction on small copies. For
> memcpy and related functions you want to minimize branches and only use
> branches that are heavily biased.
I checked bench-memcpy-random [5], but it measures the performance from the size
4K byte to 512K byte.
How do we know the branch issue for less than 512 byte?
[5] https://drive.google.com/file/d/1cRwaN9vu9q2Zm8xW6l6hp0GxVB1ZY-Tm/view
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-21 10:10 ` naohirot
@ 2021-04-21 15:02 ` Wilco Dijkstra via Libc-alpha
2021-04-22 13:17 ` naohirot
0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-21 15:02 UTC (permalink / raw)
To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Naohiro,
> It's really smart way, isn't it? 😊
Well that's the point of SVE!
> I re-implemented the macro " shortcut_for_small_size" using the whilelo, and
> please check it [1][2] if understood correctly.
Yes it works fine. You should still remove the check for zero at entry (which is really slow
and unnecessary) and the argument moves. L2 doesn't need the ptrue, all it needs
is MOV dest_ptr, dst.
> The performance of "whilelo dispatch" [3] is almost same as "binary tree dispatch" [4]
> but I notice that there are gaps at 128 byte and at 256 byte [3].
From what I can see, the new version is faster across the full range. It would be useful to show
both new and old in the same graph rather than separately. You can do that by copying the file
and use a different name for the functions. I do this all the time as it allows direct comparison
of several variants in one benchmark run.
That said, the dip at 256+64 looks fairly substantial. It could be throughput of WHILELO - to test
that you could try commenting out the long WHILELO sequence for the 256-512 byte case and
see whether it improves. If it is WHILELO, it is possible to remove 3x WHILELO from the earlier
cases by moving them after a branch (so that the 256-512 case only needs to execute 5x WHILELO
rather than 8 into total). Also it is worth checking if the 256-512 case beats jumping directly
to L(unroll4) - however note that code isn't optimized yet (eg. there is no need for complex
software pipelined loops since we can only iterate once!). If all that doesn't help, it may be
best to split into 256-384 and 384-512 so you only need 2x WHILELO.
> I checked bench-memcpy-random [5], but it measures the performance from the size
> 4K byte to 512K byte.
> How do we know the branch issue for less than 512 byte?
The size is the size of the memory region tested, not the size of the copies. The actual copies
are very small (90% are smaller than 128 bytes). The key is that it doesn't repeat the same copy
over and over so it's hard on the branch predictor just like in a real application.
Cheers,
Wilco
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-21 15:02 ` Wilco Dijkstra via Libc-alpha
@ 2021-04-22 13:17 ` naohirot
2021-04-23 0:58 ` naohirot
0 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-22 13:17 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
Thanks for your review and advice!
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> Yes it works fine. You should still remove the check for zero at entry (which is
> really slow and unnecessary) and the argument moves. L2 doesn't need the ptrue,
> all it needs is MOV dest_ptr, dst.
Yes, I cleaned them up [1].
[1] https://github.com/NaohiroTamura/glibc/commit/fbee8284f6cea9671554249816f3ab2a14abeade
> > The performance of "whilelo dispatch" [3] is almost same as "binary
> > tree dispatch" [4] but I notice that there are gaps at 128 byte and at 256 byte [3].
>
> From what I can see, the new version is faster across the full range. It would be
> useful to show both new and old in the same graph rather than separately. You can
> do that by copying the file and use a different name for the functions. I do this all
> the time as it allows direct comparison of several variants in one benchmark run.
Yes, I confirmed that "whilelo dispatch" is better than "binary tree dispatch".
I converted json data from bench-memcpy.out into csv using jq, and created Graph 1
in Google Sheet [2].
$ cat bench-memcpy.out | jq -r '.functions.memcpy.results| sort_by(.length) | .[]|[.length, .align1, .align2, .timings[5], .length/.timings[5]] | @csv' > bench-memcpy.csv
[2] https://docs.google.com/spreadsheets/d/19XYE63defjFEHZVqciZdmcDrJLWkRfGmSagXlIV2F-c/edit?usp=sharing
> That said, the dip at 256+64 looks fairly substantial. It could be throughput of
> WHILELO - to test that you could try commenting out the long WHILELO sequence
> for the 256-512 byte case and see whether it improves.
I commented out WHILELO in 256-512 byte , and confirmed that it made the dip small [3].
[3] https://drive.google.com/file/d/13Q3OSUN3qXFiTNNkRVGnsNioUMEId1ge/view
> If it is WHILELO, it is
> possible to remove 3x WHILELO from the earlier cases by moving them after a
> branch (so that the 256-512 case only needs to execute 5x WHILELO rather than 8
> into total).
As shown in Graph 2 in Google Sheet [2], this approach didn't make the dip small,
because I assume that we can reduce two WHILELO, but we needed to add two PTRUE.
I changed the code [1] like the following diff.
$ git diff
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
index 6d0ae1cd1f..2ae1f4e3b9 100644
--- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -139,12 +139,13 @@
1: // if rest > vector_length * 8
cmp n, vector_length, lsl 3 // vector_length * 8
b.hi \exit
+ cmp n, vector_length, lsl 2 // vector_length * 4
+ b.hi 1f
// if rest <= vector_length * 4
lsl tmp1, vector_length, 1 // vector_length * 2
whilelo p2.b, tmp1, n
incb tmp1
whilelo p3.b, tmp1, n
- b.last 1f
ld1b z0.b, p0/z, [src, #0, mul vl]
ld1b z1.b, p1/z, [src, #1, mul vl]
ld1b z2.b, p2/z, [src, #2, mul vl]
@@ -155,6 +156,8 @@
st1b z3.b, p3, [dest, #3, mul vl]
ret
1: // if rest <= vector_length * 8
+ ptrue p2.b
+ ptrue p3.b
lsl tmp1, vector_length, 2 // vector_length * 4
whilelo p4.b, tmp1, n
incb tmp1
> Also it is worth checking if the 256-512 case beats jumping directly to
> L(unroll4) - however note that code isn't optimized yet (eg. there is no need for
> complex software pipelined loops since we can only iterate once!).
I tried, but it didn't work for memmove, because L(unroll4) doesn't support
backward copy.
> If all that
> doesn't help, it may be best to split into 256-384 and 384-512 so you only need 2x
> WHILELO.
This way [4] made the dip small as shown in Graph3 in Google Sheet [2].
So it seems that this is the way we should take.
[4] https://github.com/NaohiroTamura/glibc/commit/cbcb80e69325c16c6697c42627a6ca12c3245a86
> > I checked bench-memcpy-random [5], but it measures the performance
> > from the size 4K byte to 512K byte.
> > How do we know the branch issue for less than 512 byte?
>
> The size is the size of the memory region tested, not the size of the copies. The
> actual copies are very small (90% are smaller than 128 bytes). The key is that it
> doesn't repeat the same copy over and over so it's hard on the branch predictor
> just like in a real application.
I see, I'll take a look at the source code more thoroughly.
Thanks.
Naohiro
^ permalink raw reply related [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-22 13:17 ` naohirot
@ 2021-04-23 0:58 ` naohirot
0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-04-23 0:58 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
Let me make one correction, I forgot about free ptrue to p0.b.
> From: Tamura, Naohiro/田村 直広 <naohirot@fujitsu.com>
> > If it is WHILELO,
> > it is possible to remove 3x WHILELO from the earlier cases by moving
> > them after a branch (so that the 256-512 case only needs to execute 5x
> > WHILELO rather than 8 into total).
>
> As shown in Graph 2 in Google Sheet [2], this approach didn't make the dip small,
> because I assume that we can reduce two WHILELO, but we needed to add two
> PTRUE.
I didn't have to add two PTREU because of the free p0.b.
As shown in Graph 4 in Google Sheet [2], this approach without adding two PTRUE made
the dip small a little bit, but improvement is smaller than the last way [4] shown in Graph 3.
So the conclusion seems not to change.
[2] https://docs.google.com/spreadsheets/d/19XYE63defjFEHZVqciZdmcDrJLWkRfGmSagXlIV2F-c/edit?usp=sharing
The code without adding two PTRUE is like the following diff.
$ git diff
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
index 6d0ae1cd1f..c3779d0147 100644
--- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -139,12 +139,13 @@
1: // if rest > vector_length * 8
cmp n, vector_length, lsl 3 // vector_length * 8
b.hi \exit
+ cmp n, vector_length, lsl 2 // vector_length * 4
+ b.hi 1f
// if rest <= vector_length * 4
lsl tmp1, vector_length, 1 // vector_length * 2
whilelo p2.b, tmp1, n
incb tmp1
whilelo p3.b, tmp1, n
- b.last 1f
ld1b z0.b, p0/z, [src, #0, mul vl]
ld1b z1.b, p1/z, [src, #1, mul vl]
ld1b z2.b, p2/z, [src, #2, mul vl]
@@ -165,16 +166,16 @@
whilelo p7.b, tmp1, n
ld1b z0.b, p0/z, [src, #0, mul vl]
ld1b z1.b, p1/z, [src, #1, mul vl]
- ld1b z2.b, p2/z, [src, #2, mul vl]
- ld1b z3.b, p3/z, [src, #3, mul vl]
+ ld1b z2.b, p0/z, [src, #2, mul vl]
+ ld1b z3.b, p0/z, [src, #3, mul vl]
ld1b z4.b, p4/z, [src, #4, mul vl]
ld1b z5.b, p5/z, [src, #5, mul vl]
ld1b z6.b, p6/z, [src, #6, mul vl]
ld1b z7.b, p7/z, [src, #7, mul vl]
st1b z0.b, p0, [dest, #0, mul vl]
st1b z1.b, p1, [dest, #1, mul vl]
- st1b z2.b, p2, [dest, #2, mul vl]
- st1b z3.b, p3, [dest, #3, mul vl]
+ st1b z2.b, p0, [dest, #2, mul vl]
+ st1b z3.b, p0, [dest, #3, mul vl]
st1b z4.b, p4, [dest, #4, mul vl]
st1b z5.b, p5, [dest, #5, mul vl]
st1b z6.b, p6, [dest, #6, mul vl]
> I changed the code [1] like the following diff.
>
> $ git diff
> diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> index 6d0ae1cd1f..2ae1f4e3b9 100644
> --- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> @@ -139,12 +139,13 @@
> 1: // if rest > vector_length * 8
> cmp n, vector_length, lsl 3 // vector_length * 8
> b.hi \exit
> + cmp n, vector_length, lsl 2 // vector_length * 4
> + b.hi 1f
> // if rest <= vector_length * 4
> lsl tmp1, vector_length, 1 // vector_length * 2
> whilelo p2.b, tmp1, n
> incb tmp1
> whilelo p3.b, tmp1, n
> - b.last 1f
> ld1b z0.b, p0/z, [src, #0, mul vl]
> ld1b z1.b, p1/z, [src, #1, mul vl]
> ld1b z2.b, p2/z, [src, #2, mul vl]
> @@ -155,6 +156,8 @@
> st1b z3.b, p3, [dest, #3, mul vl]
> ret
> 1: // if rest <= vector_length * 8
> + ptrue p2.b
> + ptrue p3.b
> lsl tmp1, vector_length, 2 // vector_length * 4
> whilelo p4.b, tmp1, n
> incb tmp1
> > If all that doesn't help, it may be best to split into 256-384 and
> > 384-512 so you only need 2x WHILELO.
>
> This way [4] made the dip small as shown in Graph3 in Google Sheet [2].
> So it seems that this is the way we should take.
>
> [4]
> https://github.com/NaohiroTamura/glibc/commit/cbcb80e69325c16c6697c4262
> 7a6ca12c3245a86
Thanks.
Naohiro
^ permalink raw reply related [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-14 16:02 ` Wilco Dijkstra via Libc-alpha
` (4 preceding siblings ...)
2021-04-20 5:49 ` naohirot
@ 2021-04-23 13:22 ` naohirot
5 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-04-23 13:22 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
Let me re-evaluate the loop unrolling/software pipelining of L(vl_agnostic) for the size
512B-4MB using the latest source code [2] with all graphs [3] in this mail.
The early evaluation was reported in the mail [1] but all graphs were not provided.
[1] https://sourceware.org/pipermail/libc-alpha/2021-April/125002.html
[2] https://github.com/NaohiroTamura/glibc/commit/cbcb80e69325c16c6697c42627a6ca12c3245a86
[3] https://docs.google.com/spreadsheets/d/1leFhCAirelDezb0OFC7cr7v4uMUMveaN1iAxL410D2c/edit?usp=sharing
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> Yes that is a good idea - you could also check whether the software pipelining
> actually helps on an OoO core (it shouldn't) since that contributes a lot to the
> complexity and the amount of code and unrolling required.
I compared each unrolls by commenting out upper labels of the target label.
For example, if the target labels is L(unroll4) of memset, L(unroll32) and L(unroll8)
are commented out, and L(unroll4), L(unroll2), and L(unroll1) are executed.
Regarding memcpy/memmove, among L(unroll8), L(unroll4), L(unroll2), and L(unroll1).
Regarding memset, among L(unroll32), L(unroll8), L(unroll4), L(unroll2), and L(unroll1) .
The result was that 8 unrolling/pipelining for memcpy/memmove and 32
unrolling/pipelining for memset are still effective between the size 512B-64KB
as shown in the graphs in Google Sheet [3]
In conclusion, it seems the loop unrolling/software pipelining technique still works
in case of A64FX. It may be a peculiar characteristic of A64FX, I believe.
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-20 14:44 ` Wilco Dijkstra via Libc-alpha
@ 2021-04-27 9:01 ` naohirot
0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-04-27 9:01 UTC (permalink / raw)
To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
I focus on the zero fill distance in this mail.
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> >> Also why
> would the
> >>DC_ZVA need to be done so early? It seems to me that cleaning the
> >>cacheline just before you write it works best since that avoids accidentally
> replacing it.
> >>
> >
> > Yes, I moved it closer, please look at the change [1].
>
> What I meant is, why is ZF_DIST so huge? I don't see how that helps. Is there any
> penalty if we did it like this (or possibly with 1-2 cachelines offset)?
>
> dc zva, dest_ptr
> st1b z0.b, p0, [dest_ptr, #0, mul vl]
> st1b z1.b, p0, [dest_ptr, #1, mul vl]
> st1b z2.b, p0, [dest_ptr, #2, mul vl]
> st1b z3.b, p0, [dest_ptr, #3, mul vl]
I tested several zero fill distance for memcpy and memset including 1-2 cachelines offset.
As shown in Graph1 and Graph2 of Google Sheet [1], the most suitable zero fill
distance of both memcpy and memset was 21 cachelinees offset.
ZF21 means Zero Fill distance is 21 * cachelines offset in Graph1 and Graph2.
So I updated both memcpy and memset source code [2][3].
[1] https://docs.google.com/spreadsheets/d/1qXWHc-OXl2E9Q9vWUl4R4eM00k02eij6eMAhXYUFVoI/edit
[2] https://github.com/NaohiroTamura/glibc/commit/5e7f737a270334ec0f86c0228f90000bf9a2cf00
[3] https://github.com/NaohiroTamura/glibc/commit/42334cb84419603003977eb77783bf407cb75072
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-20 11:39 ` Wilco Dijkstra via Libc-alpha
@ 2021-04-27 11:03 ` naohirot
0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-04-27 11:03 UTC (permalink / raw)
To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
This mail is a continuation of removing redundant instructions.
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> For the first 2 CBZ cases in both [1] and [2] the fastest option is to use
> ANDS+BEQ. ANDS only requires 1 ALU operation while AND+CBZ uses 2 ALU
> operations on A64FX.
I see, I haven't used ANDS before. Thanks for the advice.
I updated memcpy[1] and memset[2].
[1] https://github.com/NaohiroTamura/glibc/commit/fca2c1cf1fd80ec7ecb93f7cd08be9aab9ca9412
[2] https://github.com/NaohiroTamura/glibc/commit/5004e34c35a20faf3e12e6ce915845a75b778cbf
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-20 16:00 ` Wilco Dijkstra via Libc-alpha
@ 2021-04-27 11:58 ` naohirot
2021-04-29 15:13 ` Wilco Dijkstra via Libc-alpha
0 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-04-27 11:58 UTC (permalink / raw)
To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco-san,
This mail is a continuation of BTI macro.
I believe that I've answered all of your comments so far.
Please let me know if I missed something.
If there is no further comments to the first version of this patch,
I'd like to proceed with the preparation of the second version after
the consecutive National holidays, Apr. 29th - May. 5th, in Japan.
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> > So if distinct degradation happens only on A64FX, I'd like to add
> > another ENTRY macro in sysdeps/aarch64/sysdep.h such as:
>
> I think the best option for now is to change BTI_C into NOP if AARCH64_HAVE_BTI
> is not set. This avoids creating alignment issues in existing code (which is written
> to assume the hint is present) and works for all string functions.
I updated sysdeps/aarch64/sysdep.h following your advice [1].
Then I reverted the entries of memcpy/memmove [2] and memset [3].
[1] https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0c82cb70339294386
[2] https://github.com/NaohiroTamura/glibc/commit/f4627d5a0faa8d2bd9102964a3e31936248fa9ca
[3] https://github.com/NaohiroTamura/glibc/commit/da48f62bab67d875cb712a886ba074073857d5c3
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-27 11:58 ` naohirot
@ 2021-04-29 15:13 ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:01 ` Szabolcs Nagy via Libc-alpha
2021-05-06 10:01 ` naohirot
0 siblings, 2 replies; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-29 15:13 UTC (permalink / raw)
To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Naohiro,
> I believe that I've answered all of your comments so far.
> Please let me know if I missed something.
> If there is no further comments to the first version of this patch,
> I'd like to proceed with the preparation of the second version after
> the consecutive National holidays, Apr. 29th - May. 5th, in Japan.
I've only looked at memcpy so far. My comments on memcpy:
(1) Improve the tail code in unroll4/2/1/last to do the reverse of
shortcut_for_small_size - basically there is no need for loops or lots of branches.
(2) Rather than start with L2, check for n > L2_SIZE && vector_length == 64 and
start with the vl_agnostic case. Copies > L2_SIZE will be very rare so it's best to
handle the common case first.
(3) The alignment code can be significantly simplified. Why not just process
4 vectors unconditionally and then align the pointers? That avoids all the
complex code and is much faster.
(4) Is there a benefit of aligning src or dst to vector size in the vl_agnostic case?
If so, it would be easy to align to a vector first and then if n > L2_SIZE do the
remaining 3 vectors to align to a full cacheline.
(5) I'm not sure I understand the reason for src_notag/dest_notag. However if
you want to ignore tags, just change the mov src_ptr, src into AND that
clears the tag. There is no reason to both clear the tag and also keep the
original pointer and tag.
For memmove I would suggest to merge it with memcpy to save ~100 instructions.
I don't understand the complexity of the L(dispatch) code - you just need a simple
3-instruction overlap check that branches to bwd_unroll8.
I haven't looked at memset, but pretty much all the improvements apply there too.
>> I think the best option for now is to change BTI_C into NOP if AARCH64_HAVE_BTI
>> is not set. This avoids creating alignment issues in existing code (which is written
>> to assume the hint is present) and works for all string functions.
>
> I updated sysdeps/aarch64/sysdep.h following your advice [1].
>
> [1] https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0c82cb70339294386
I meant using an actual NOP in the #else case so that existing string functions
won't change. Also note the #defines in the #if and #else need to be indented.
Cheers,
Wilco
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-29 15:13 ` Wilco Dijkstra via Libc-alpha
@ 2021-04-30 15:01 ` Szabolcs Nagy via Libc-alpha
2021-04-30 15:23 ` Wilco Dijkstra via Libc-alpha
2021-05-06 10:01 ` naohirot
1 sibling, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-04-30 15:01 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: 'GNU C Library'
The 04/29/2021 16:13, Wilco Dijkstra wrote:
> > I updated sysdeps/aarch64/sysdep.h following your advice [1].
> >
> > [1] https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0c82cb70339294386
>
> I meant using an actual NOP in the #else case so that existing string functions
> won't change. Also note the #defines in the #if and #else need to be indented.
is that really useful?
'bti c' is already a nop if it's unsupported.
maybe it works if a64fx_memcpy.S has
#undef BTI_C
#define BTI_C
ENTRY(a64fx_memcpy)
...
to save one nop.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-30 15:01 ` Szabolcs Nagy via Libc-alpha
@ 2021-04-30 15:23 ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:30 ` Florian Weimer via Libc-alpha
0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-30 15:23 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: 'GNU C Library'
Hi Szabolcs,
>> I meant using an actual NOP in the #else case so that existing string functions
>> won't change. Also note the #defines in the #if and #else need to be indented.
>
> is that really useful?
> 'bti c' is already a nop if it's unsupported.
Well it doesn't seem to behave like a NOP. So to avoid slowing down all string
functions, bti c must be removed completely, not just from A64FX memcpy.
Using a real NOP is fine in all cases as long as HAVE_AARCH64_BTI is not defined.
> maybe it works if a64fx_memcpy.S has
>
> #undef BTI_C
> #define BTI_C
> ENTRY(a64fx_memcpy)
That works for memcpy, but what about everything else?
Cheers,
Wilco
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-30 15:23 ` Wilco Dijkstra via Libc-alpha
@ 2021-04-30 15:30 ` Florian Weimer via Libc-alpha
2021-04-30 15:40 ` Wilco Dijkstra via Libc-alpha
0 siblings, 1 reply; 72+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-04-30 15:30 UTC (permalink / raw)
To: Wilco Dijkstra via Libc-alpha; +Cc: Szabolcs Nagy, Wilco Dijkstra
* Wilco Dijkstra via Libc-alpha:
> Hi Szabolcs,
>
>>> I meant using an actual NOP in the #else case so that existing string functions
>>> won't change. Also note the #defines in the #if and #else need to be indented.
>>
>> is that really useful?
>> 'bti c' is already a nop if it's unsupported.
>
> Well it doesn't seem to behave like a NOP. So to avoid slowing down
> all string functions, bti c must be removed completely, not just from
> A64FX memcpy. Using a real NOP is fine in all cases as long as
> HAVE_AARCH64_BTI is not defined.
I'm probably confused, but: If BTI is active, many more glibc functions
will have BTI markers. What makes the string functions special?
Thanks,
Florian
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-30 15:30 ` Florian Weimer via Libc-alpha
@ 2021-04-30 15:40 ` Wilco Dijkstra via Libc-alpha
2021-05-04 7:56 ` Szabolcs Nagy via Libc-alpha
0 siblings, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-30 15:40 UTC (permalink / raw)
To: Florian Weimer, Wilco Dijkstra via Libc-alpha; +Cc: Szabolcs Nagy
Hi Florian,
>> Well it doesn't seem to behave like a NOP. So to avoid slowing down
>> all string functions, bti c must be removed completely, not just from
>> A64FX memcpy. Using a real NOP is fine in all cases as long as
>> HAVE_AARCH64_BTI is not defined.
>
> I'm probably confused, but: If BTI is active, many more glibc functions
> will have BTI markers. What makes the string functions special?
Exactly. And at that point trying to remove it from memcpy is just pointless.
The case we are discussing is where BTI is not turned on in GLIBC but we still
emit a BTI at the start of assembler functions for simplicity. By using a NOP
instead, A64FX will not execute BTI anywhere in GLIBC.
Cheers,
Wilco
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-30 15:40 ` Wilco Dijkstra via Libc-alpha
@ 2021-05-04 7:56 ` Szabolcs Nagy via Libc-alpha
2021-05-04 10:17 ` Florian Weimer via Libc-alpha
0 siblings, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-04 7:56 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: Florian Weimer, Wilco Dijkstra via Libc-alpha
The 04/30/2021 16:40, Wilco Dijkstra wrote:
> >> Well it doesn't seem to behave like a NOP. So to avoid slowing down
> >> all string functions, bti c must be removed completely, not just from
> >> A64FX memcpy. Using a real NOP is fine in all cases as long as
> >> HAVE_AARCH64_BTI is not defined.
> >
> > I'm probably confused, but: If BTI is active, many more glibc functions
> > will have BTI markers. What makes the string functions special?
>
> Exactly. And at that point trying to remove it from memcpy is just pointless.
>
> The case we are discussing is where BTI is not turned on in GLIBC but we still
> emit a BTI at the start of assembler functions for simplicity. By using a NOP
> instead, A64FX will not execute BTI anywhere in GLIBC.
the asm ENTRY was written with the assumption that bti c
behaves like a nop when bti is disabled, so we don't have
to make the asm conditional based on cflags.
if that's not the case i agree with the patch, however we
will have to review some other code (e.g. libgcc outline
atomics asm) where we made the same assumption.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-05-04 7:56 ` Szabolcs Nagy via Libc-alpha
@ 2021-05-04 10:17 ` Florian Weimer via Libc-alpha
2021-05-04 10:38 ` Wilco Dijkstra via Libc-alpha
2021-05-04 10:42 ` Szabolcs Nagy via Libc-alpha
0 siblings, 2 replies; 72+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-05-04 10:17 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: Wilco Dijkstra via Libc-alpha, Wilco Dijkstra
* Szabolcs Nagy:
> The 04/30/2021 16:40, Wilco Dijkstra wrote:
>> >> Well it doesn't seem to behave like a NOP. So to avoid slowing down
>> >> all string functions, bti c must be removed completely, not just from
>> >> A64FX memcpy. Using a real NOP is fine in all cases as long as
>> >> HAVE_AARCH64_BTI is not defined.
>> >
>> > I'm probably confused, but: If BTI is active, many more glibc functions
>> > will have BTI markers. What makes the string functions special?
>>
>> Exactly. And at that point trying to remove it from memcpy is just pointless.
>>
>> The case we are discussing is where BTI is not turned on in GLIBC but we still
>> emit a BTI at the start of assembler functions for simplicity. By using a NOP
>> instead, A64FX will not execute BTI anywhere in GLIBC.
>
> the asm ENTRY was written with the assumption that bti c
> behaves like a nop when bti is disabled, so we don't have
> to make the asm conditional based on cflags.
>
> if that's not the case i agree with the patch, however we
> will have to review some other code (e.g. libgcc outline
> atomics asm) where we made the same assumption.
I find this discussion extremely worrisome. If bti c does not behave
like a nop, then we need a new AArch64 ABI variant to enable BTI.
That being said, a distribution with lots of bti c instructions in
binaries seems to run on A64FX CPUs, so I'm not sure what is going on.
Thanks,
Florian
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-05-04 10:17 ` Florian Weimer via Libc-alpha
@ 2021-05-04 10:38 ` Wilco Dijkstra via Libc-alpha
2021-05-04 10:42 ` Szabolcs Nagy via Libc-alpha
1 sibling, 0 replies; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-05-04 10:38 UTC (permalink / raw)
To: Florian Weimer, Szabolcs Nagy; +Cc: Wilco Dijkstra via Libc-alpha
Hi Florian,
> I find this discussion extremely worrisome. If bti c does not behave
> like a nop, then we need a new AArch64 ABI variant to enable BTI.
>
> That being said, a distribution with lots of bti c instructions in
> binaries seems to run on A64FX CPUs, so I'm not sure what is going on.
NOP-space instructions should take no time or execution resources.
From Naohiro's graphs I estimate A64FX takes around 30 cycles per BTI
instruction - that's clearly "not behaving like a NOP". That would cause a
significant performance degradation if BTI is enabled in a distro.
Cheers,
Wilco
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-05-04 10:17 ` Florian Weimer via Libc-alpha
2021-05-04 10:38 ` Wilco Dijkstra via Libc-alpha
@ 2021-05-04 10:42 ` Szabolcs Nagy via Libc-alpha
2021-05-04 11:07 ` Florian Weimer via Libc-alpha
1 sibling, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-04 10:42 UTC (permalink / raw)
To: Florian Weimer; +Cc: Wilco Dijkstra via Libc-alpha, Wilco Dijkstra
The 05/04/2021 12:17, Florian Weimer wrote:
> * Szabolcs Nagy:
>
> > The 04/30/2021 16:40, Wilco Dijkstra wrote:
> >> >> Well it doesn't seem to behave like a NOP. So to avoid slowing down
> >> >> all string functions, bti c must be removed completely, not just from
> >> >> A64FX memcpy. Using a real NOP is fine in all cases as long as
> >> >> HAVE_AARCH64_BTI is not defined.
> >> >
> >> > I'm probably confused, but: If BTI is active, many more glibc functions
> >> > will have BTI markers. What makes the string functions special?
> >>
> >> Exactly. And at that point trying to remove it from memcpy is just pointless.
> >>
> >> The case we are discussing is where BTI is not turned on in GLIBC but we still
> >> emit a BTI at the start of assembler functions for simplicity. By using a NOP
> >> instead, A64FX will not execute BTI anywhere in GLIBC.
> >
> > the asm ENTRY was written with the assumption that bti c
> > behaves like a nop when bti is disabled, so we don't have
> > to make the asm conditional based on cflags.
> >
> > if that's not the case i agree with the patch, however we
> > will have to review some other code (e.g. libgcc outline
> > atomics asm) where we made the same assumption.
>
> I find this discussion extremely worrisome. If bti c does not behave
> like a nop, then we need a new AArch64 ABI variant to enable BTI.
>
> That being said, a distribution with lots of bti c instructions in
> binaries seems to run on A64FX CPUs, so I'm not sure what is going on.
this does not have correctness impact, only performance impact.
hint space instructions are seem slower than expected on a64fx.
which means unconditionally adding bti c to asm entry code is not
ideal if somebody tries to build a system without branch-protection.
distros that build all binaries with branch protection will just
take a performance hit on a64fx, we cant fix that easily.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-05-04 10:42 ` Szabolcs Nagy via Libc-alpha
@ 2021-05-04 11:07 ` Florian Weimer via Libc-alpha
0 siblings, 0 replies; 72+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-05-04 11:07 UTC (permalink / raw)
To: Szabolcs Nagy; +Cc: Wilco Dijkstra via Libc-alpha, Wilco Dijkstra
* Szabolcs Nagy:
> The 05/04/2021 12:17, Florian Weimer wrote:
>> * Szabolcs Nagy:
>>
>> > The 04/30/2021 16:40, Wilco Dijkstra wrote:
>> >> >> Well it doesn't seem to behave like a NOP. So to avoid slowing down
>> >> >> all string functions, bti c must be removed completely, not just from
>> >> >> A64FX memcpy. Using a real NOP is fine in all cases as long as
>> >> >> HAVE_AARCH64_BTI is not defined.
>> >> >
>> >> > I'm probably confused, but: If BTI is active, many more glibc functions
>> >> > will have BTI markers. What makes the string functions special?
>> >>
>> >> Exactly. And at that point trying to remove it from memcpy is just pointless.
>> >>
>> >> The case we are discussing is where BTI is not turned on in GLIBC but we still
>> >> emit a BTI at the start of assembler functions for simplicity. By using a NOP
>> >> instead, A64FX will not execute BTI anywhere in GLIBC.
>> >
>> > the asm ENTRY was written with the assumption that bti c
>> > behaves like a nop when bti is disabled, so we don't have
>> > to make the asm conditional based on cflags.
>> >
>> > if that's not the case i agree with the patch, however we
>> > will have to review some other code (e.g. libgcc outline
>> > atomics asm) where we made the same assumption.
>>
>> I find this discussion extremely worrisome. If bti c does not behave
>> like a nop, then we need a new AArch64 ABI variant to enable BTI.
>>
>> That being said, a distribution with lots of bti c instructions in
>> binaries seems to run on A64FX CPUs, so I'm not sure what is going on.
>
> this does not have correctness impact, only performance impact.
>
> hint space instructions are seem slower than expected on a64fx.
>
> which means unconditionally adding bti c to asm entry code is not
> ideal if somebody tries to build a system without branch-protection.
> distros that build all binaries with branch protection will just
> take a performance hit on a64fx, we cant fix that easily.
I think I see it now. It's not critically slow, but there appears to be
observable impact. I'm still worried.
Thanks,
Florian
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-04-29 15:13 ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:01 ` Szabolcs Nagy via Libc-alpha
@ 2021-05-06 10:01 ` naohirot
2021-05-06 14:26 ` Szabolcs Nagy via Libc-alpha
2021-05-06 17:31 ` Wilco Dijkstra via Libc-alpha
1 sibling, 2 replies; 72+ messages in thread
From: naohirot @ 2021-05-06 10:01 UTC (permalink / raw)
To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Wilco,
Thanks for the comments, I applied all of your comments to both
memcpy/memmove and memset except (3) alignment code for memset.
The latest code became memcpy/memove [1] and memset [2] in the
patch-20210317 [3] branch by evaluating the performance data as shown
below.
[1] https://github.com/NaohiroTamura/glibc/blob/d2ea23703fc45cbfe4a8f27c759b0b23722e17a4/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/d2ea23703fc45cbfe4a8f27c759b0b23722e17a4/sysdeps/aarch64/multiarch/memset_a64fx.S
[3] https://github.com/NaohiroTamura/glibc/commits/patch-20210317
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> I've only looked at memcpy so far. My comments on memcpy:
>
> (1) Improve the tail code in unroll4/2/1/last to do the reverse of
> shortcut_for_small_size - basically there is no need for loops or lots of
> branches.
>
I updated the tail code both memcpy/memmove [4] and memset [5], and
replaced small size code of memset [5].
The performance is shown as "whilelo" in Google Sheet Graph for
memcpy/memmove [6] and memset [7].
[4] https://github.com/NaohiroTamura/glibc/commit/f7d9d7b22814affdd89cf291905b9c6601e2031d
[5] https://github.com/NaohiroTamura/glibc/commit/b79d6731f800a56be66c895c035b791ca5176bbb
[6] https://docs.google.com/spreadsheets/d/1Rh-bwF6dpWqoOCbL2epogUPn4I2Emd0NiFgoEOPaujM/edit
[7] https://docs.google.com/spreadsheets/d/1TS0qFhyR_06OyqaRHYAdCKxwvRz7f1T8jI7Pu6x2GIk/edit
> (2) Rather than start with L2, check for n > L2_SIZE && vector_length == 64 and
> start with the vl_agnostic case. Copies > L2_SIZE will be very rare so it's best
> to
> handle the common case first.
>
I changed the order both both memcpy/memmove [8] and memset [9].
The performance is shown as "agnostic1st" in Google Sheet Graph for
memcpy/memmove [6] and memset [7].
[8] https://github.com/NaohiroTamura/glibc/commit/c0d7e39aa4aefe3d7b7d2a8a7c220150a0eb78fe
[9] https://github.com/NaohiroTamura/glibc/commit/d2ea23703fc45cbfe4a8f27c759b0b23722e17a4
> (3) The alignment code can be significantly simplified. Why not just process
> 4 vectors unconditionally and then align the pointers? That avoids all the
> complex code and is much faster.
>
In terms of memcpy/memmove, I tried 4 patterns, "simplifiedL2algin"[10],
" simplifiedL2algin2"[11], "agnosticVLalign"[12], and "noalign"[13] as shown in
Google Sheet Graph [14].
"simplifiedL2algin"[10] simplified to 4 whilelo, " simplifiedL2algin2"[11] simplified
to 2 whilelo or 4 whilelo, "agnosticVLalign"[12] added alignment code to L(vl_agnostic),
and "noalign"[13] removed all alignments.
"agnosticVLalign"[12] and "noalign"[13] didn't improve the performance, so these
commits are kept in the patch-20210317-memcpy-alignment branch [15]
[10] https://github.com/NaohiroTamura/glibc/commit/dd4ede78ec4d74e61a4dc3166fc8586168c4e410
[11] https://github.com/NaohiroTamura/glibc/commit/dd246ff01d59e4e91d10261cd070baae07c0093e
[12] https://github.com/NaohiroTamura/glibc/commit/35b8057d91024bf41595d38d94b2c3c76bdfd6b0
[13] https://github.com/NaohiroTamura/glibc/commit/b1f16f3e738152a5c0f3441201058b48901b4910
[14] https://docs.google.com/spreadsheets/d/1REBslxd56kMDMiXHAtRkBn4IaUO7AVmgvGldJl5qc58/edit
[15] https://github.com/NaohiroTamura/glibc/commits/patch-20210317-memcpy-alignment
In terms of memset, I tried 4 patterns too, " VL/CL-align "[16], "CL-align"[17],
"CL-align2"[18] and "noalign"[19] as shown in Google Sheet Graph [20].
" VL/CL-align "[16] simplified to 1 whilelo for VL and 3 whilelo for CL,
"CL-align"[17] simplified to 4 whilelo, "CL-align2"[18] simplified to 2 whilelo or
4 whilelo, and "noalign"[19] removed all alignments.
As shown in Google Sheet Graph [20] all of 4 patters didn't improve the
performance, so these commits are kept in the
patch-20210317-memset-alignment branch [21]
[16] https://github.com/NaohiroTamura/glibc/commit/2405b67a6bb8b380476967e150b35f10e0f25fe3
[17 https://github.com/NaohiroTamura/glibc/commit/a01a8ef08f3b53a691502538dabce3d5941790ff
[18] https://github.com/NaohiroTamura/glibc/commit/c8eb4467acbc97890a4f76f716a88d2dd901e083
[19] https://github.com/NaohiroTamura/glibc/commit/01ff56a9e558d650b09b0053adbc3215d269d65f
[20] https://docs.google.com/spreadsheets/d/1qT0ZkbrrL3fpEyfdjr23cbtanNyPFKN8xDo6E9Mb_YQ/edit
[21] https://github.com/NaohiroTamura/glibc/commits/patch-20210317-memset-alginment
> (4) Is there a benefit of aligning src or dst to vector size in the vl_agnostic case?
> If so, it would be easy to align to a vector first and then if n > L2_SIZE do the
> remaining 3 vectors to align to a full cacheline.
>
As tried in (3), "agnosticVLalign"[12] didn't improve the performance.
> (5) I'm not sure I understand the reason for src_notag/dest_notag. However if
> you want to ignore tags, just change the mov src_ptr, src into AND that
> clears the tag. There is no reason to both clear the tag and also keep the
> original pointer and tag.
>
A64FX has Fujitsu's proprietary enhancement regarding tag address.
I removed dest_notag/src_notag macro and simplified L(dispatch) [22]
"src" address has to be kept to jump to L(last)[23].
[22] https://github.com/NaohiroTamura/glibc/commit/519244f5058d0aa98634bb544bae3358f0b7b07c
[23] https://github.com/NaohiroTamura/glibc/blob/519244f5058d0aa98634bb544bae3358f0b7b07c/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L399
> For memmove I would suggest to merge it with memcpy to save ~100 instructions.
> I don't understand the complexity of the L(dispatch) code - you just need a simple
> 3-instruction overlap check that branches to bwd_unroll8.
>
I simplified the he L(dispatch) code to 3 instructions[24] in the commit[23].
[24] https://github.com/NaohiroTamura/glibc/blob/519244f5058d0aa98634bb544bae3358f0b7b07c/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L368-L370
> I haven't looked at memset, but pretty much all the improvements apply there too.
So please review the latest memset [2].
> >> I think the best option for now is to change BTI_C into NOP if
> >> AARCH64_HAVE_BTI is not set. This avoids creating alignment issues in
> >> existing code (which is written to assume the hint is present) and works for all
> string functions.
> >
> > I updated sysdeps/aarch64/sysdep.h following your advice [1].
> >
> > [1]
> > https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0
> > c82cb70339294386
>
> I meant using an actual NOP in the #else case so that existing string functions
> won't change. Also note the #defines in the #if and #else need to be indented.
>
I've read the mail thread regarding BTI, but I think I couldn't fully understand the
problem. BTI seems available from ARMv8.5, and A64FX is ARMv8.2.
Even though distro distributed BTI enabled binary, BTI doesn't work on A64FX.
So BTI_J macro can be removed from A64FX IFUNC code at least, because A64FX
IFUNC code is executed only on A64FX.
Are we discussing the BTI_C code which is not in IFUNC code?
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-05-06 10:01 ` naohirot
@ 2021-05-06 14:26 ` Szabolcs Nagy via Libc-alpha
2021-05-06 15:09 ` Florian Weimer via Libc-alpha
2021-05-06 17:31 ` Wilco Dijkstra via Libc-alpha
1 sibling, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-06 14:26 UTC (permalink / raw)
To: naohirot@fujitsu.com; +Cc: 'GNU C Library', 'Wilco Dijkstra'
The 05/06/2021 10:01, naohirot@fujitsu.com wrote:
> > From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> > > [1]
> > > https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0
> > > c82cb70339294386
> >
> > I meant using an actual NOP in the #else case so that existing string functions
> > won't change. Also note the #defines in the #if and #else need to be indented.
> >
>
> I've read the mail thread regarding BTI, but I think I couldn't fully understand the
> problem. BTI seems available from ARMv8.5, and A64FX is ARMv8.2.
> Even though distro distributed BTI enabled binary, BTI doesn't work on A64FX.
> So BTI_J macro can be removed from A64FX IFUNC code at least, because A64FX
> IFUNC code is executed only on A64FX.
> Are we discussing the BTI_C code which is not in IFUNC code?
BTI_C at function entry.
the slowdown you showed with bti c at function entry
should not be present with a plain nop.
this means a64fx implemented hint space instructions
(such as bti c) slower than plain nops, which is not
expected and will cause slowdowns with distros that
try to distribute binaries with bti c, this problem
goes beyond string functions.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-05-06 14:26 ` Szabolcs Nagy via Libc-alpha
@ 2021-05-06 15:09 ` Florian Weimer via Libc-alpha
0 siblings, 0 replies; 72+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-05-06 15:09 UTC (permalink / raw)
To: Szabolcs Nagy via Libc-alpha; +Cc: Szabolcs Nagy, 'Wilco Dijkstra'
* Szabolcs Nagy via Libc-alpha:
> this means a64fx implemented hint space instructions
> (such as bti c) slower than plain nops, which is not
> expected and will cause slowdowns with distros that
> try to distribute binaries with bti c, this problem
> goes beyond string functions.
And we are using -mbranch-protection=standard on AArch64 going forward,
for example:
| optflags: aarch64 %{__global_compiler_flags} -mbranch-protection=standard -fasynchronous-unwind-tables %[ "%{toolchain}" == "gcc" ? "-fstack-clash-protection" : "" ]
<https://gitlab.com/redhat/centos-stream/rpms/redhat-rpm-config/-/blob/c9s/rpmrc#L77>
(Fedora is similar.
This is why I find this issue so worrying.
Thanks,
Florian
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-05-06 10:01 ` naohirot
2021-05-06 14:26 ` Szabolcs Nagy via Libc-alpha
@ 2021-05-06 17:31 ` Wilco Dijkstra via Libc-alpha
2021-05-07 12:31 ` naohirot
1 sibling, 1 reply; 72+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-05-06 17:31 UTC (permalink / raw)
To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'
Hi Naohiro,
> I've read the mail thread regarding BTI, but I think I couldn't fully understand the
> problem. BTI seems available from ARMv8.5, and A64FX is ARMv8.2.
BTI instructions are NOP hints, so it is possible to enable BTI even on ARMv8.0.
Using BTI instructions is harmless on CPUs that don't support it if NOP hints are as
cheap as a NOP (which generally doesn't need any execution resources).
> Even though distro distributed BTI enabled binary, BTI doesn't work on A64FX.
It works (ie. it is binary compatible with A64FX) and should have no effect. However
it seems to cause an unexpected slowdown.
> So BTI_J macro can be removed from A64FX IFUNC code at least, because A64FX
> IFUNC code is executed only on A64FX.
How is removing it just from memcpy going to help? The worry is not about memcpy
but the slowdown from all the BTI instructions that will be added to most functions.
Note it is still worthwhile to change BTI_C to NOP as suggested - that is the case when
BTI is not enabled, and there you want to avoid inserting BTI when it is not needed.
Cheers,
Wilco
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-05-06 17:31 ` Wilco Dijkstra via Libc-alpha
@ 2021-05-07 12:31 ` naohirot
0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-05-07 12:31 UTC (permalink / raw)
To: Wilco Dijkstra; +Cc: Szabolcs Nagy, Florian Weimer, 'GNU C Library'
Hi Wilco, Szabolcs, Florian,
Thanks for the explanation!
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> How is removing it just from memcpy going to help? The worry is not about memcpy
> but the slowdown from all the BTI instructions that will be added to most functions.
OK, I understood.
Now I'm asking a question to CPU design team how A64FX "hint 34" is implemented and
behaves.
> Note it is still worthwhile to change BTI_C to NOP as suggested - that is the case when
> BTI is not enabled, and there you want to avoid inserting BTI when it is not needed.
I changed BTI_C and BTI_J definitions to nop [1].
[1] https://github.com/NaohiroTamura/glibc/commit/0804fe9d288d489ec8af98c687552decd2723f5d
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-03-17 2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
` (5 preceding siblings ...)
2021-03-29 12:03 ` [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Szabolcs Nagy via Libc-alpha
@ 2021-05-10 1:45 ` naohirot
2021-05-14 13:35 ` Szabolcs Nagy via Libc-alpha
2021-05-12 9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
7 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-05-10 1:45 UTC (permalink / raw)
To: Szabolcs Nagy, Wilco Dijkstra, Florian Weimer; +Cc: libc-alpha@sourceware.org
Hi Szabolcs, Wilco, Florian,
> From: Naohiro Tamura <naohirot@fujitsu.com>
> Sent: Wednesday, March 17, 2021 11:29 AM
> Fujitsu is in the process of signing the copyright assignment paper.
> We'd like to have some feedback in advance.
FYI: Fujitsu has submitted the signed assignment finally.
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX
2021-03-17 2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
` (6 preceding siblings ...)
2021-05-10 1:45 ` naohirot
@ 2021-05-12 9:23 ` Naohiro Tamura
2021-05-12 9:26 ` [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64 Naohiro Tamura
` (8 more replies)
7 siblings, 9 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12 9:23 UTC (permalink / raw)
To: libc-alpha
Hi Szabolcs, Wilco, Florian,
Thank you for reviewing Patch V1.
Patch V2 has been reflected all of V1 comments which are mainly
related to redundant assembler code.
Consequently assembler code has been minimized, and each line of V2
assembler code has been rationalized by string bench performance
data.
In terms of assembler LOC (lines of code), memcpy/memmove reduced 60%
from 1,000 to 400 lines, memset reduced 55% from 600 to 270 lines.
So please kindly review V2.
Thanks.
Naohiro
Naohiro Tamura (6):
config: Added HAVE_AARCH64_SVE_ASM for aarch64
aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI
aarch64: Added optimized memcpy and memmove for A64FX
aarch64: Added optimized memset for A64FX
scripts: Added Vector Length Set test helper script
benchtests: Fixed bench-memcpy-random: buf1: mprotect failed
benchtests/bench-memcpy-random.c | 4 +-
config.h.in | 5 +
manual/tunables.texi | 3 +-
scripts/vltest.py | 82 ++++
sysdeps/aarch64/configure | 28 ++
sysdeps/aarch64/configure.ac | 15 +
sysdeps/aarch64/multiarch/Makefile | 3 +-
sysdeps/aarch64/multiarch/ifunc-impl-list.c | 13 +-
sysdeps/aarch64/multiarch/init-arch.h | 4 +-
sysdeps/aarch64/multiarch/memcpy.c | 12 +-
sysdeps/aarch64/multiarch/memcpy_a64fx.S | 405 ++++++++++++++++++
sysdeps/aarch64/multiarch/memmove.c | 12 +-
sysdeps/aarch64/multiarch/memset.c | 11 +-
sysdeps/aarch64/multiarch/memset_a64fx.S | 268 ++++++++++++
sysdeps/aarch64/sysdep.h | 9 +-
.../unix/sysv/linux/aarch64/cpu-features.c | 4 +
.../unix/sysv/linux/aarch64/cpu-features.h | 4 +
17 files changed, 868 insertions(+), 14 deletions(-)
create mode 100755 scripts/vltest.py
create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S
create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S
--
2.17.1
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64
2021-05-12 9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
@ 2021-05-12 9:26 ` Naohiro Tamura
2021-05-26 10:05 ` Szabolcs Nagy via Libc-alpha
2021-05-12 9:27 ` [PATCH v2 2/6] aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI Naohiro Tamura
` (7 subsequent siblings)
8 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12 9:26 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
This patch checks if assembler supports '-march=armv8.2-a+sve' to
generate SVE code or not, and then define HAVE_AARCH64_SVE_ASM macro.
---
config.h.in | 5 +++++
sysdeps/aarch64/configure | 28 ++++++++++++++++++++++++++++
sysdeps/aarch64/configure.ac | 15 +++++++++++++++
3 files changed, 48 insertions(+)
diff --git a/config.h.in b/config.h.in
index 99036b887f..13fba9bb8d 100644
--- a/config.h.in
+++ b/config.h.in
@@ -121,6 +121,11 @@
/* AArch64 PAC-RET code generation is enabled. */
#define HAVE_AARCH64_PAC_RET 0
+/* Assembler support ARMv8.2-A SVE.
+ This macro becomes obsolete when glibc increased the minimum
+ required version of GNU 'binutils' to 2.28 or later. */
+#define HAVE_AARCH64_SVE_ASM 0
+
/* ARC big endian ABI */
#undef HAVE_ARC_BE
diff --git a/sysdeps/aarch64/configure b/sysdeps/aarch64/configure
index 83c3a23e44..4c1fac49f3 100644
--- a/sysdeps/aarch64/configure
+++ b/sysdeps/aarch64/configure
@@ -304,3 +304,31 @@ fi
$as_echo "$libc_cv_aarch64_variant_pcs" >&6; }
config_vars="$config_vars
aarch64-variant-pcs = $libc_cv_aarch64_variant_pcs"
+
+# Check if asm support armv8.2-a+sve
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for SVE support in assembler" >&5
+$as_echo_n "checking for SVE support in assembler... " >&6; }
+if ${libc_cv_asm_sve+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ cat > conftest.s <<\EOF
+ ptrue p0.b
+EOF
+if { ac_try='${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&5'
+ { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
+ (eval $ac_try) 2>&5
+ ac_status=$?
+ $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+ test $ac_status = 0; }; }; then
+ libc_cv_aarch64_sve_asm=yes
+else
+ libc_cv_aarch64_sve_asm=no
+fi
+rm -f conftest*
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_asm_sve" >&5
+$as_echo "$libc_cv_asm_sve" >&6; }
+if test $libc_cv_aarch64_sve_asm = yes; then
+ $as_echo "#define HAVE_AARCH64_SVE_ASM 1" >>confdefs.h
+
+fi
diff --git a/sysdeps/aarch64/configure.ac b/sysdeps/aarch64/configure.ac
index 66f755078a..3347c13fa1 100644
--- a/sysdeps/aarch64/configure.ac
+++ b/sysdeps/aarch64/configure.ac
@@ -90,3 +90,18 @@ EOF
fi
rm -rf conftest.*])
LIBC_CONFIG_VAR([aarch64-variant-pcs], [$libc_cv_aarch64_variant_pcs])
+
+# Check if asm support armv8.2-a+sve
+AC_CACHE_CHECK(for SVE support in assembler, libc_cv_asm_sve, [dnl
+cat > conftest.s <<\EOF
+ ptrue p0.b
+EOF
+if AC_TRY_COMMAND(${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&AS_MESSAGE_LOG_FD); then
+ libc_cv_aarch64_sve_asm=yes
+else
+ libc_cv_aarch64_sve_asm=no
+fi
+rm -f conftest*])
+if test $libc_cv_aarch64_sve_asm = yes; then
+ AC_DEFINE(HAVE_AARCH64_SVE_ASM)
+fi
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 2/6] aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI
2021-05-12 9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
2021-05-12 9:26 ` [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64 Naohiro Tamura
@ 2021-05-12 9:27 ` Naohiro Tamura
2021-05-26 10:06 ` Szabolcs Nagy via Libc-alpha
2021-05-12 9:28 ` [PATCH v2 3/6] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
` (6 subsequent siblings)
8 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12 9:27 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
This patch defines BTI_C and BTI_J macros conditionally for
performance.
If HAVE_AARCH64_BTI is true, BTI_C and BTI_J are defined as HINT
instruction for ARMv8.5 BTI (Branch Target Identification).
If HAVE_AARCH64_BTI is false, both BTI_C and BTI_J are defined as
NOP.
---
sysdeps/aarch64/sysdep.h | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/sysdeps/aarch64/sysdep.h b/sysdeps/aarch64/sysdep.h
index 90acca4e42..b936e29cbd 100644
--- a/sysdeps/aarch64/sysdep.h
+++ b/sysdeps/aarch64/sysdep.h
@@ -62,8 +62,13 @@ strip_pac (void *p)
#define ASM_SIZE_DIRECTIVE(name) .size name,.-name
/* Branch Target Identitication support. */
-#define BTI_C hint 34
-#define BTI_J hint 36
+#if HAVE_AARCH64_BTI
+# define BTI_C hint 34
+# define BTI_J hint 36
+#else
+# define BTI_C nop
+# define BTI_J nop
+#endif
/* Return address signing support (pac-ret). */
#define PACIASP hint 25
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 3/6] aarch64: Added optimized memcpy and memmove for A64FX
2021-05-12 9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
2021-05-12 9:26 ` [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64 Naohiro Tamura
2021-05-12 9:27 ` [PATCH v2 2/6] aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI Naohiro Tamura
@ 2021-05-12 9:28 ` Naohiro Tamura
2021-05-26 10:19 ` Szabolcs Nagy via Libc-alpha
2021-05-12 9:28 ` [PATCH v2 4/6] aarch64: Added optimized memset " Naohiro Tamura
` (5 subsequent siblings)
8 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12 9:28 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
This patch optimizes the performance of memcpy/memmove for A64FX [1]
which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
cache per NUMA node.
The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill, and software pipelining.
SVE assembler code for memcpy/memmove is implemented as Vector Length
Agnostic code so theoretically it can be run on any SOC which supports
ARMv8-A SVE standard.
We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.
And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.
[1] https://github.com/fujitsu/A64FX
---
manual/tunables.texi | 3 +-
sysdeps/aarch64/multiarch/Makefile | 2 +-
sysdeps/aarch64/multiarch/ifunc-impl-list.c | 8 +-
sysdeps/aarch64/multiarch/init-arch.h | 4 +-
sysdeps/aarch64/multiarch/memcpy.c | 12 +-
sysdeps/aarch64/multiarch/memcpy_a64fx.S | 405 ++++++++++++++++++
sysdeps/aarch64/multiarch/memmove.c | 12 +-
.../unix/sysv/linux/aarch64/cpu-features.c | 4 +
.../unix/sysv/linux/aarch64/cpu-features.h | 4 +
9 files changed, 446 insertions(+), 8 deletions(-)
create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S
diff --git a/manual/tunables.texi b/manual/tunables.texi
index 6de647b426..fe7c1313cc 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -454,7 +454,8 @@ This tunable is specific to powerpc, powerpc64 and powerpc64le.
The @code{glibc.cpu.name=xxx} tunable allows the user to tell @theglibc{} to
assume that the CPU is @code{xxx} where xxx may have one of these values:
@code{generic}, @code{falkor}, @code{thunderxt88}, @code{thunderx2t99},
-@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng}.
+@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng},
+@code{a64fx}.
This tunable is specific to aarch64.
@end deftp
diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index dc3efffb36..04c3f17121 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -1,6 +1,6 @@
ifeq ($(subdir),string)
sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
- memcpy_falkor \
+ memcpy_falkor memcpy_a64fx \
memset_generic memset_falkor memset_emag memset_kunpeng \
memchr_generic memchr_nosimd \
strlen_mte strlen_asimd
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index 99a8c68aac..911393565c 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -25,7 +25,7 @@
#include <stdio.h>
/* Maximum number of IFUNC implementations. */
-#define MAX_IFUNC 4
+#define MAX_IFUNC 7
size_t
__libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
@@ -43,12 +43,18 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_falkor)
IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_simd)
+#if HAVE_AARCH64_SVE_ASM
+ IFUNC_IMPL_ADD (array, i, memcpy, sve, __memcpy_a64fx)
+#endif
IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
IFUNC_IMPL (i, name, memmove,
IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx)
IFUNC_IMPL_ADD (array, i, memmove, !bti, __memmove_thunderx2)
IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_falkor)
IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_simd)
+#if HAVE_AARCH64_SVE_ASM
+ IFUNC_IMPL_ADD (array, i, memmove, sve, __memmove_a64fx)
+#endif
IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
IFUNC_IMPL (i, name, memset,
/* Enable this on non-falkor processors too so that other cores
diff --git a/sysdeps/aarch64/multiarch/init-arch.h b/sysdeps/aarch64/multiarch/init-arch.h
index a167699e74..6d92c1bcff 100644
--- a/sysdeps/aarch64/multiarch/init-arch.h
+++ b/sysdeps/aarch64/multiarch/init-arch.h
@@ -33,4 +33,6 @@
bool __attribute__((unused)) bti = \
HAVE_AARCH64_BTI && GLRO(dl_aarch64_cpu_features).bti; \
bool __attribute__((unused)) mte = \
- MTE_ENABLED ();
+ MTE_ENABLED (); \
+ bool __attribute__((unused)) sve = \
+ GLRO(dl_aarch64_cpu_features).sve;
diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
index 0e0a5cbcfb..d90ee51ffc 100644
--- a/sysdeps/aarch64/multiarch/memcpy.c
+++ b/sysdeps/aarch64/multiarch/memcpy.c
@@ -33,6 +33,9 @@ extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;
extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;
extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
+#if HAVE_AARCH64_SVE_ASM
+extern __typeof (__redirect_memcpy) __memcpy_a64fx attribute_hidden;
+#endif
libc_ifunc (__libc_memcpy,
(IS_THUNDERX (midr)
@@ -44,8 +47,13 @@ libc_ifunc (__libc_memcpy,
: (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
|| IS_NEOVERSE_V1 (midr)
? __memcpy_simd
- : __memcpy_generic)))));
-
+#if HAVE_AARCH64_SVE_ASM
+ : (IS_A64FX (midr)
+ ? __memcpy_a64fx
+ : __memcpy_generic))))));
+#else
+ : __memcpy_generic)))));
+#endif
# undef memcpy
strong_alias (__libc_memcpy, memcpy);
#endif
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
new file mode 100644
index 0000000000..e28afd708f
--- /dev/null
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -0,0 +1,405 @@
+/* Optimized memcpy for Fujitsu A64FX processor.
+ Copyright (C) 2012-2021 Free Software Foundation, Inc.
+
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library. If not, see
+ <https://www.gnu.org/licenses/>. */
+
+#include <sysdep.h>
+
+#if HAVE_AARCH64_SVE_ASM
+#if IS_IN (libc)
+# define MEMCPY __memcpy_a64fx
+# define MEMMOVE __memmove_a64fx
+
+/* Assumptions:
+ *
+ * ARMv8.2-a, AArch64, unaligned accesses, sve
+ *
+ */
+
+#define L2_SIZE (8*1024*1024)/2 // L2 8MB/2
+#define CACHE_LINE_SIZE 256
+#define ZF_DIST (CACHE_LINE_SIZE * 21) // Zerofill distance
+#define dest x0
+#define src x1
+#define n x2 // size
+#define tmp1 x3
+#define tmp2 x4
+#define tmp3 x5
+#define rest x6
+#define dest_ptr x7
+#define src_ptr x8
+#define vector_length x9
+#define cl_remainder x10 // CACHE_LINE_SIZE remainder
+
+ .arch armv8.2-a+sve
+
+ .macro dc_zva times
+ dc zva, tmp1
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ .if \times-1
+ dc_zva "(\times-1)"
+ .endif
+ .endm
+
+ .macro ld1b_unroll8
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ .endm
+
+ .macro stld1b_unroll4a
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ .endm
+
+ .macro stld1b_unroll4b
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ .endm
+
+ .macro stld1b_unroll8
+ stld1b_unroll4a
+ stld1b_unroll4b
+ .endm
+
+ .macro st1b_unroll8
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ .endm
+
+ .macro shortcut_for_small_size exit
+ // if rest <= vector_length * 2
+ whilelo p0.b, xzr, n
+ whilelo p1.b, vector_length, n
+ b.last 1f
+ ld1b z0.b, p0/z, [src, #0, mul vl]
+ ld1b z1.b, p1/z, [src, #1, mul vl]
+ st1b z0.b, p0, [dest, #0, mul vl]
+ st1b z1.b, p1, [dest, #1, mul vl]
+ ret
+1: // if rest > vector_length * 8
+ cmp n, vector_length, lsl 3 // vector_length * 8
+ b.hi \exit
+ // if rest <= vector_length * 4
+ lsl tmp1, vector_length, 1 // vector_length * 2
+ whilelo p2.b, tmp1, n
+ incb tmp1
+ whilelo p3.b, tmp1, n
+ b.last 1f
+ ld1b z0.b, p0/z, [src, #0, mul vl]
+ ld1b z1.b, p1/z, [src, #1, mul vl]
+ ld1b z2.b, p2/z, [src, #2, mul vl]
+ ld1b z3.b, p3/z, [src, #3, mul vl]
+ st1b z0.b, p0, [dest, #0, mul vl]
+ st1b z1.b, p1, [dest, #1, mul vl]
+ st1b z2.b, p2, [dest, #2, mul vl]
+ st1b z3.b, p3, [dest, #3, mul vl]
+ ret
+1: // if rest <= vector_length * 8
+ lsl tmp1, vector_length, 2 // vector_length * 4
+ whilelo p4.b, tmp1, n
+ incb tmp1
+ whilelo p5.b, tmp1, n
+ b.last 1f
+ ld1b z0.b, p0/z, [src, #0, mul vl]
+ ld1b z1.b, p1/z, [src, #1, mul vl]
+ ld1b z2.b, p2/z, [src, #2, mul vl]
+ ld1b z3.b, p3/z, [src, #3, mul vl]
+ ld1b z4.b, p4/z, [src, #4, mul vl]
+ ld1b z5.b, p5/z, [src, #5, mul vl]
+ st1b z0.b, p0, [dest, #0, mul vl]
+ st1b z1.b, p1, [dest, #1, mul vl]
+ st1b z2.b, p2, [dest, #2, mul vl]
+ st1b z3.b, p3, [dest, #3, mul vl]
+ st1b z4.b, p4, [dest, #4, mul vl]
+ st1b z5.b, p5, [dest, #5, mul vl]
+ ret
+1: lsl tmp1, vector_length, 2 // vector_length * 4
+ incb tmp1 // vector_length * 5
+ incb tmp1 // vector_length * 6
+ whilelo p6.b, tmp1, n
+ incb tmp1
+ whilelo p7.b, tmp1, n
+ ld1b z0.b, p0/z, [src, #0, mul vl]
+ ld1b z1.b, p1/z, [src, #1, mul vl]
+ ld1b z2.b, p2/z, [src, #2, mul vl]
+ ld1b z3.b, p3/z, [src, #3, mul vl]
+ ld1b z4.b, p4/z, [src, #4, mul vl]
+ ld1b z5.b, p5/z, [src, #5, mul vl]
+ ld1b z6.b, p6/z, [src, #6, mul vl]
+ ld1b z7.b, p7/z, [src, #7, mul vl]
+ st1b z0.b, p0, [dest, #0, mul vl]
+ st1b z1.b, p1, [dest, #1, mul vl]
+ st1b z2.b, p2, [dest, #2, mul vl]
+ st1b z3.b, p3, [dest, #3, mul vl]
+ st1b z4.b, p4, [dest, #4, mul vl]
+ st1b z5.b, p5, [dest, #5, mul vl]
+ st1b z6.b, p6, [dest, #6, mul vl]
+ st1b z7.b, p7, [dest, #7, mul vl]
+ ret
+ .endm
+
+ENTRY (MEMCPY)
+
+ PTR_ARG (0)
+ PTR_ARG (1)
+ SIZE_ARG (2)
+
+L(memcpy):
+ cntb vector_length
+ // shortcut for less than vector_length * 8
+ // gives a free ptrue to p0.b for n >= vector_length
+ shortcut_for_small_size L(vl_agnostic)
+ // end of shortcut
+
+L(vl_agnostic): // VL Agnostic
+ mov rest, n
+ mov dest_ptr, dest
+ mov src_ptr, src
+ // if rest >= L2_SIZE && vector_length == 64 then L(L2)
+ mov tmp1, 64
+ cmp rest, L2_SIZE
+ ccmp vector_length, tmp1, 0, cs
+ b.eq L(L2)
+
+L(unroll8): // unrolling and software pipeline
+ lsl tmp1, vector_length, 3 // vector_length * 8
+ .p2align 3
+ cmp rest, tmp1
+ b.cc L(last)
+ ld1b_unroll8
+ add src_ptr, src_ptr, tmp1
+ sub rest, rest, tmp1
+ cmp rest, tmp1
+ b.cc 2f
+ .p2align 3
+1: stld1b_unroll8
+ add dest_ptr, dest_ptr, tmp1
+ add src_ptr, src_ptr, tmp1
+ sub rest, rest, tmp1
+ cmp rest, tmp1
+ b.ge 1b
+2: st1b_unroll8
+ add dest_ptr, dest_ptr, tmp1
+
+ .p2align 3
+L(last):
+ whilelo p0.b, xzr, rest
+ whilelo p1.b, vector_length, rest
+ b.last 1f
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p1/z, [src_ptr, #1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p1, [dest_ptr, #1, mul vl]
+ ret
+1: lsl tmp1, vector_length, 1 // vector_length * 2
+ whilelo p2.b, tmp1, rest
+ incb tmp1
+ whilelo p3.b, tmp1, rest
+ b.last 1f
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p1/z, [src_ptr, #1, mul vl]
+ ld1b z2.b, p2/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p3/z, [src_ptr, #3, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p1, [dest_ptr, #1, mul vl]
+ st1b z2.b, p2, [dest_ptr, #2, mul vl]
+ st1b z3.b, p3, [dest_ptr, #3, mul vl]
+ ret
+1: lsl tmp1, vector_length, 2 // vector_length * 4
+ whilelo p4.b, tmp1, rest
+ incb tmp1
+ whilelo p5.b, tmp1, rest
+ incb tmp1
+ whilelo p6.b, tmp1, rest
+ incb tmp1
+ whilelo p7.b, tmp1, rest
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p1/z, [src_ptr, #1, mul vl]
+ ld1b z2.b, p2/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p3/z, [src_ptr, #3, mul vl]
+ ld1b z4.b, p4/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p5/z, [src_ptr, #5, mul vl]
+ ld1b z6.b, p6/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p7/z, [src_ptr, #7, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p1, [dest_ptr, #1, mul vl]
+ st1b z2.b, p2, [dest_ptr, #2, mul vl]
+ st1b z3.b, p3, [dest_ptr, #3, mul vl]
+ st1b z4.b, p4, [dest_ptr, #4, mul vl]
+ st1b z5.b, p5, [dest_ptr, #5, mul vl]
+ st1b z6.b, p6, [dest_ptr, #6, mul vl]
+ st1b z7.b, p7, [dest_ptr, #7, mul vl]
+ ret
+
+L(L2):
+ // align dest address at CACHE_LINE_SIZE byte boundary
+ mov tmp1, CACHE_LINE_SIZE
+ ands tmp2, dest_ptr, CACHE_LINE_SIZE - 1
+ // if cl_remainder == 0
+ b.eq L(L2_dc_zva)
+ sub cl_remainder, tmp1, tmp2
+ // process remainder until the first CACHE_LINE_SIZE boundary
+ whilelo p1.b, xzr, cl_remainder // keep p0.b all true
+ whilelo p2.b, vector_length, cl_remainder
+ b.last 1f
+ ld1b z1.b, p1/z, [src_ptr, #0, mul vl]
+ ld1b z2.b, p2/z, [src_ptr, #1, mul vl]
+ st1b z1.b, p1, [dest_ptr, #0, mul vl]
+ st1b z2.b, p2, [dest_ptr, #1, mul vl]
+ b 2f
+1: lsl tmp1, vector_length, 1 // vector_length * 2
+ whilelo p3.b, tmp1, cl_remainder
+ incb tmp1
+ whilelo p4.b, tmp1, cl_remainder
+ ld1b z1.b, p1/z, [src_ptr, #0, mul vl]
+ ld1b z2.b, p2/z, [src_ptr, #1, mul vl]
+ ld1b z3.b, p3/z, [src_ptr, #2, mul vl]
+ ld1b z4.b, p4/z, [src_ptr, #3, mul vl]
+ st1b z1.b, p1, [dest_ptr, #0, mul vl]
+ st1b z2.b, p2, [dest_ptr, #1, mul vl]
+ st1b z3.b, p3, [dest_ptr, #2, mul vl]
+ st1b z4.b, p4, [dest_ptr, #3, mul vl]
+2: add dest_ptr, dest_ptr, cl_remainder
+ add src_ptr, src_ptr, cl_remainder
+ sub rest, rest, cl_remainder
+
+L(L2_dc_zva):
+ // zero fill
+ and tmp1, dest, 0xffffffffffffff
+ and tmp2, src, 0xffffffffffffff
+ subs tmp1, tmp1, tmp2 // diff
+ b.ge 1f
+ neg tmp1, tmp1
+1: mov tmp3, ZF_DIST + CACHE_LINE_SIZE * 2
+ cmp tmp1, tmp3
+ b.lo L(unroll8)
+ mov tmp1, dest_ptr
+ dc_zva (ZF_DIST / CACHE_LINE_SIZE) - 1
+ // unroll
+ ld1b_unroll8 // this line has to be after "b.lo L(unroll8)"
+ add src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ mov tmp1, ZF_DIST
+ .p2align 3
+1: stld1b_unroll4a
+ add tmp2, dest_ptr, tmp1 // dest_ptr + ZF_DIST
+ dc zva, tmp2
+ stld1b_unroll4b
+ add tmp2, tmp2, CACHE_LINE_SIZE
+ dc zva, tmp2
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+ add src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ cmp rest, tmp3 // ZF_DIST + CACHE_LINE_SIZE * 2
+ b.ge 1b
+ st1b_unroll8
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+ b L(unroll8)
+
+END (MEMCPY)
+libc_hidden_builtin_def (MEMCPY)
+
+
+ENTRY (MEMMOVE)
+
+ PTR_ARG (0)
+ PTR_ARG (1)
+ SIZE_ARG (2)
+
+ // remove tag address
+ // dest has to be immutable because it is the return value
+ // src has to be immutable because it is used in L(bwd_last)
+ and tmp2, dest, 0xffffffffffffff // save dest_notag into tmp2
+ and tmp3, src, 0xffffffffffffff // save src_notag intp tmp3
+ cmp n, 0
+ ccmp tmp2, tmp3, 4, ne
+ b.ne 1f
+ ret
+1: cntb vector_length
+ // shortcut for less than vector_length * 8
+ // gives a free ptrue to p0.b for n >= vector_length
+ // tmp2 and tmp3 should not be used in this macro to keep notag addresses
+ shortcut_for_small_size L(dispatch)
+ // end of shortcut
+
+L(dispatch):
+ // tmp2 = dest_notag, tmp3 = src_notag
+ // diff = dest_notag - src_notag
+ sub tmp1, tmp2, tmp3
+ // if diff <= 0 || diff >= n then memcpy
+ cmp tmp1, 0
+ ccmp tmp1, n, 2, gt
+ b.cs L(vl_agnostic)
+
+L(bwd_start):
+ mov rest, n
+ add dest_ptr, dest, n // dest_end
+ add src_ptr, src, n // src_end
+
+L(bwd_unroll8): // unrolling and software pipeline
+ lsl tmp1, vector_length, 3 // vector_length * 8
+ .p2align 3
+ cmp rest, tmp1
+ b.cc L(bwd_last)
+ sub src_ptr, src_ptr, tmp1
+ ld1b_unroll8
+ sub rest, rest, tmp1
+ cmp rest, tmp1
+ b.cc 2f
+ .p2align 3
+1: sub src_ptr, src_ptr, tmp1
+ sub dest_ptr, dest_ptr, tmp1
+ stld1b_unroll8
+ sub rest, rest, tmp1
+ cmp rest, tmp1
+ b.ge 1b
+2: sub dest_ptr, dest_ptr, tmp1
+ st1b_unroll8
+
+L(bwd_last):
+ mov dest_ptr, dest
+ mov src_ptr, src
+ b L(last)
+
+END (MEMMOVE)
+libc_hidden_builtin_def (MEMMOVE)
+#endif /* IS_IN (libc) */
+#endif /* HAVE_AARCH64_SVE_ASM */
diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarch/memmove.c
index 12d77818a9..be2d35a251 100644
--- a/sysdeps/aarch64/multiarch/memmove.c
+++ b/sysdeps/aarch64/multiarch/memmove.c
@@ -33,6 +33,9 @@ extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;
extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;
extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidden;
extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;
+#if HAVE_AARCH64_SVE_ASM
+extern __typeof (__redirect_memmove) __memmove_a64fx attribute_hidden;
+#endif
libc_ifunc (__libc_memmove,
(IS_THUNDERX (midr)
@@ -44,8 +47,13 @@ libc_ifunc (__libc_memmove,
: (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
|| IS_NEOVERSE_V1 (midr)
? __memmove_simd
- : __memmove_generic)))));
-
+#if HAVE_AARCH64_SVE_ASM
+ : (IS_A64FX (midr)
+ ? __memmove_a64fx
+ : __memmove_generic))))));
+#else
+ : __memmove_generic)))));
+#endif
# undef memmove
strong_alias (__libc_memmove, memmove);
#endif
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
index db6aa3516c..6206a2f618 100644
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
@@ -46,6 +46,7 @@ static struct cpu_list cpu_list[] = {
{"ares", 0x411FD0C0},
{"emag", 0x503F0001},
{"kunpeng920", 0x481FD010},
+ {"a64fx", 0x460F0010},
{"generic", 0x0}
};
@@ -116,4 +117,7 @@ init_cpu_features (struct cpu_features *cpu_features)
(PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_ASYNC | MTE_ALLOWED_TAGS),
0, 0, 0);
#endif
+
+ /* Check if SVE is supported. */
+ cpu_features->sve = GLRO (dl_hwcap) & HWCAP_SVE;
}
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
index 3b9bfed134..2b322e5414 100644
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
@@ -65,6 +65,9 @@
#define IS_KUNPENG920(midr) (MIDR_IMPLEMENTOR(midr) == 'H' \
&& MIDR_PARTNUM(midr) == 0xd01)
+#define IS_A64FX(midr) (MIDR_IMPLEMENTOR(midr) == 'F' \
+ && MIDR_PARTNUM(midr) == 0x001)
+
struct cpu_features
{
uint64_t midr_el1;
@@ -72,6 +75,7 @@ struct cpu_features
bool bti;
/* Currently, the GLIBC memory tagging tunable only defines 8 bits. */
uint8_t mte_state;
+ bool sve;
};
#endif /* _CPU_FEATURES_AARCH64_H */
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 4/6] aarch64: Added optimized memset for A64FX
2021-05-12 9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
` (2 preceding siblings ...)
2021-05-12 9:28 ` [PATCH v2 3/6] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
@ 2021-05-12 9:28 ` Naohiro Tamura
2021-05-26 10:22 ` Szabolcs Nagy via Libc-alpha
2021-05-12 9:29 ` [PATCH v2 5/6] scripts: Added Vector Length Set test helper script Naohiro Tamura
` (4 subsequent siblings)
8 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12 9:28 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
This patch optimizes the performance of memset for A64FX [1] which
implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
per NUMA node.
The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill and prefetch.
SVE assembler code for memset is implemented as Vector Length Agnostic
code so theoretically it can be run on any SOC which supports ARMv8-A
SVE standard.
We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.
And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.
[1] https://github.com/fujitsu/A64FX
---
sysdeps/aarch64/multiarch/Makefile | 1 +
sysdeps/aarch64/multiarch/ifunc-impl-list.c | 5 +-
sysdeps/aarch64/multiarch/memset.c | 11 +-
sysdeps/aarch64/multiarch/memset_a64fx.S | 268 ++++++++++++++++++++
4 files changed, 283 insertions(+), 2 deletions(-)
create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S
diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index 04c3f17121..7500cf1e93 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -2,6 +2,7 @@ ifeq ($(subdir),string)
sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
memcpy_falkor memcpy_a64fx \
memset_generic memset_falkor memset_emag memset_kunpeng \
+ memset_a64fx \
memchr_generic memchr_nosimd \
strlen_mte strlen_asimd
endif
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index 911393565c..4e1a641d9f 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -37,7 +37,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
INIT_ARCH ();
- /* Support sysdeps/aarch64/multiarch/memcpy.c and memmove.c. */
+ /* Support sysdeps/aarch64/multiarch/memcpy.c, memmove.c and memset.c. */
IFUNC_IMPL (i, name, memcpy,
IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx)
IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
@@ -62,6 +62,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
IFUNC_IMPL_ADD (array, i, memset, (zva_size == 64), __memset_falkor)
IFUNC_IMPL_ADD (array, i, memset, (zva_size == 64), __memset_emag)
IFUNC_IMPL_ADD (array, i, memset, 1, __memset_kunpeng)
+#if HAVE_AARCH64_SVE_ASM
+ IFUNC_IMPL_ADD (array, i, memset, sve, __memset_a64fx)
+#endif
IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic))
IFUNC_IMPL (i, name, memchr,
IFUNC_IMPL_ADD (array, i, memchr, !mte, __memchr_nosimd)
diff --git a/sysdeps/aarch64/multiarch/memset.c b/sysdeps/aarch64/multiarch/memset.c
index 28d3926bc2..48a59574dd 100644
--- a/sysdeps/aarch64/multiarch/memset.c
+++ b/sysdeps/aarch64/multiarch/memset.c
@@ -31,6 +31,9 @@ extern __typeof (__redirect_memset) __libc_memset;
extern __typeof (__redirect_memset) __memset_falkor attribute_hidden;
extern __typeof (__redirect_memset) __memset_emag attribute_hidden;
extern __typeof (__redirect_memset) __memset_kunpeng attribute_hidden;
+#if HAVE_AARCH64_SVE_ASM
+extern __typeof (__redirect_memset) __memset_a64fx attribute_hidden;
+#endif
extern __typeof (__redirect_memset) __memset_generic attribute_hidden;
libc_ifunc (__libc_memset,
@@ -40,7 +43,13 @@ libc_ifunc (__libc_memset,
? __memset_falkor
: (IS_EMAG (midr) && zva_size == 64
? __memset_emag
- : __memset_generic)));
+#if HAVE_AARCH64_SVE_ASM
+ : (IS_A64FX (midr)
+ ? __memset_a64fx
+ : __memset_generic))));
+#else
+ : __memset_generic)));
+#endif
# undef memset
strong_alias (__libc_memset, memset);
diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S
new file mode 100644
index 0000000000..9bd58cab6d
--- /dev/null
+++ b/sysdeps/aarch64/multiarch/memset_a64fx.S
@@ -0,0 +1,268 @@
+/* Optimized memset for Fujitsu A64FX processor.
+ Copyright (C) 2012-2021 Free Software Foundation, Inc.
+
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library. If not, see
+ <https://www.gnu.org/licenses/>. */
+
+#include <sysdep.h>
+#include <sysdeps/aarch64/memset-reg.h>
+
+#if HAVE_AARCH64_SVE_ASM
+#if IS_IN (libc)
+# define MEMSET __memset_a64fx
+
+/* Assumptions:
+ *
+ * ARMv8.2-a, AArch64, unaligned accesses, sve
+ *
+ */
+
+#define L1_SIZE (64*1024) // L1 64KB
+#define L2_SIZE (8*1024*1024) // L2 8MB - 1MB
+#define CACHE_LINE_SIZE 256
+#define PF_DIST_L1 (CACHE_LINE_SIZE * 16) // Prefetch distance L1
+#define ZF_DIST (CACHE_LINE_SIZE * 21) // Zerofill distance
+#define rest x8
+#define vector_length x9
+#define vl_remainder x10 // vector_length remainder
+#define cl_remainder x11 // CACHE_LINE_SIZE remainder
+
+ .arch armv8.2-a+sve
+
+ .macro dc_zva times
+ dc zva, tmp1
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ .if \times-1
+ dc_zva "(\times-1)"
+ .endif
+ .endm
+
+ .macro st1b_unroll first=0, last=7
+ st1b z0.b, p0, [dst, #\first, mul vl]
+ .if \last-\first
+ st1b_unroll "(\first+1)", \last
+ .endif
+ .endm
+
+ .macro shortcut_for_small_size exit
+ // if rest <= vector_length * 2
+ whilelo p0.b, xzr, count
+ whilelo p1.b, vector_length, count
+ b.last 1f
+ st1b z0.b, p0, [dstin, #0, mul vl]
+ st1b z0.b, p1, [dstin, #1, mul vl]
+ ret
+1: // if rest > vector_length * 8
+ cmp count, vector_length, lsl 3 // vector_length * 8
+ b.hi \exit
+ // if rest <= vector_length * 4
+ lsl tmp1, vector_length, 1 // vector_length * 2
+ whilelo p2.b, tmp1, count
+ incb tmp1
+ whilelo p3.b, tmp1, count
+ b.last 1f
+ st1b z0.b, p0, [dstin, #0, mul vl]
+ st1b z0.b, p1, [dstin, #1, mul vl]
+ st1b z0.b, p2, [dstin, #2, mul vl]
+ st1b z0.b, p3, [dstin, #3, mul vl]
+ ret
+1: // if rest <= vector_length * 8
+ lsl tmp1, vector_length, 2 // vector_length * 4
+ whilelo p4.b, tmp1, count
+ incb tmp1
+ whilelo p5.b, tmp1, count
+ b.last 1f
+ st1b z0.b, p0, [dstin, #0, mul vl]
+ st1b z0.b, p1, [dstin, #1, mul vl]
+ st1b z0.b, p2, [dstin, #2, mul vl]
+ st1b z0.b, p3, [dstin, #3, mul vl]
+ st1b z0.b, p4, [dstin, #4, mul vl]
+ st1b z0.b, p5, [dstin, #5, mul vl]
+ ret
+1: lsl tmp1, vector_length, 2 // vector_length * 4
+ incb tmp1 // vector_length * 5
+ incb tmp1 // vector_length * 6
+ whilelo p6.b, tmp1, count
+ incb tmp1
+ whilelo p7.b, tmp1, count
+ st1b z0.b, p0, [dstin, #0, mul vl]
+ st1b z0.b, p1, [dstin, #1, mul vl]
+ st1b z0.b, p2, [dstin, #2, mul vl]
+ st1b z0.b, p3, [dstin, #3, mul vl]
+ st1b z0.b, p4, [dstin, #4, mul vl]
+ st1b z0.b, p5, [dstin, #5, mul vl]
+ st1b z0.b, p6, [dstin, #6, mul vl]
+ st1b z0.b, p7, [dstin, #7, mul vl]
+ ret
+ .endm
+
+ENTRY (MEMSET)
+
+ PTR_ARG (0)
+ SIZE_ARG (2)
+
+ cbnz count, 1f
+ ret
+1: dup z0.b, valw
+ cntb vector_length
+ // shortcut for less than vector_length * 8
+ // gives a free ptrue to p0.b for n >= vector_length
+ shortcut_for_small_size L(vl_agnostic)
+ // end of shortcut
+
+L(vl_agnostic): // VL Agnostic
+ mov rest, count
+ mov dst, dstin
+ add dstend, dstin, count
+ // if rest >= L2_SIZE && vector_length == 64 then L(L2)
+ mov tmp1, 64
+ cmp rest, L2_SIZE
+ ccmp vector_length, tmp1, 0, cs
+ b.eq L(L2)
+ // if rest >= L1_SIZE && vector_length == 64 then L(L1_prefetch)
+ cmp rest, L1_SIZE
+ ccmp vector_length, tmp1, 0, cs
+ b.eq L(L1_prefetch)
+
+L(unroll32):
+ lsl tmp1, vector_length, 3 // vector_length * 8
+ lsl tmp2, vector_length, 5 // vector_length * 32
+ .p2align 3
+1: cmp rest, tmp2
+ b.cc L(unroll8)
+ st1b_unroll
+ add dst, dst, tmp1
+ st1b_unroll
+ add dst, dst, tmp1
+ st1b_unroll
+ add dst, dst, tmp1
+ st1b_unroll
+ add dst, dst, tmp1
+ sub rest, rest, tmp2
+ b 1b
+
+L(unroll8):
+ lsl tmp1, vector_length, 3
+ .p2align 3
+1: cmp rest, tmp1
+ b.cc L(last)
+ st1b_unroll
+ add dst, dst, tmp1
+ sub rest, rest, tmp1
+ b 1b
+
+L(last):
+ whilelo p0.b, xzr, rest
+ whilelo p1.b, vector_length, rest
+ b.last 1f
+ st1b z0.b, p0, [dst, #0, mul vl]
+ st1b z0.b, p1, [dst, #1, mul vl]
+ ret
+1: lsl tmp1, vector_length, 1 // vector_length * 2
+ whilelo p2.b, tmp1, rest
+ incb tmp1
+ whilelo p3.b, tmp1, rest
+ b.last 1f
+ st1b z0.b, p0, [dst, #0, mul vl]
+ st1b z0.b, p1, [dst, #1, mul vl]
+ st1b z0.b, p2, [dst, #2, mul vl]
+ st1b z0.b, p3, [dst, #3, mul vl]
+ ret
+1: lsl tmp1, vector_length, 2 // vector_length * 4
+ whilelo p4.b, tmp1, rest
+ incb tmp1
+ whilelo p5.b, tmp1, rest
+ incb tmp1
+ whilelo p6.b, tmp1, rest
+ incb tmp1
+ whilelo p7.b, tmp1, rest
+ st1b z0.b, p0, [dst, #0, mul vl]
+ st1b z0.b, p1, [dst, #1, mul vl]
+ st1b z0.b, p2, [dst, #2, mul vl]
+ st1b z0.b, p3, [dst, #3, mul vl]
+ st1b z0.b, p4, [dst, #4, mul vl]
+ st1b z0.b, p5, [dst, #5, mul vl]
+ st1b z0.b, p6, [dst, #6, mul vl]
+ st1b z0.b, p7, [dst, #7, mul vl]
+ ret
+
+L(L1_prefetch): // if rest >= L1_SIZE
+ .p2align 3
+1: st1b_unroll 0, 3
+ prfm pstl1keep, [dst, PF_DIST_L1]
+ st1b_unroll 4, 7
+ prfm pstl1keep, [dst, PF_DIST_L1 + CACHE_LINE_SIZE]
+ add dst, dst, CACHE_LINE_SIZE * 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ cmp rest, L1_SIZE
+ b.ge 1b
+ cbnz rest, L(unroll32)
+ ret
+
+L(L2):
+ // align dst address at vector_length byte boundary
+ sub tmp1, vector_length, 1
+ ands tmp2, dst, tmp1
+ // if vl_remainder == 0
+ b.eq 1f
+ sub vl_remainder, vector_length, tmp2
+ // process remainder until the first vector_length boundary
+ whilelt p2.b, xzr, vl_remainder
+ st1b z0.b, p2, [dst]
+ add dst, dst, vl_remainder
+ sub rest, rest, vl_remainder
+ // align dstin address at CACHE_LINE_SIZE byte boundary
+1: mov tmp1, CACHE_LINE_SIZE
+ ands tmp2, dst, CACHE_LINE_SIZE - 1
+ // if cl_remainder == 0
+ b.eq L(L2_dc_zva)
+ sub cl_remainder, tmp1, tmp2
+ // process remainder until the first CACHE_LINE_SIZE boundary
+ mov tmp1, xzr // index
+2: whilelt p2.b, tmp1, cl_remainder
+ st1b z0.b, p2, [dst, tmp1]
+ incb tmp1
+ cmp tmp1, cl_remainder
+ b.lo 2b
+ add dst, dst, cl_remainder
+ sub rest, rest, cl_remainder
+
+L(L2_dc_zva):
+ // zero fill
+ mov tmp1, dst
+ dc_zva (ZF_DIST / CACHE_LINE_SIZE) - 1
+ mov zva_len, ZF_DIST
+ add tmp1, zva_len, CACHE_LINE_SIZE * 2
+ // unroll
+ .p2align 3
+1: st1b_unroll 0, 3
+ add tmp2, dst, zva_len
+ dc zva, tmp2
+ st1b_unroll 4, 7
+ add tmp2, tmp2, CACHE_LINE_SIZE
+ dc zva, tmp2
+ add dst, dst, CACHE_LINE_SIZE * 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ cmp rest, tmp1 // ZF_DIST + CACHE_LINE_SIZE * 2
+ b.ge 1b
+ cbnz rest, L(unroll8)
+ ret
+
+END (MEMSET)
+libc_hidden_builtin_def (MEMSET)
+
+#endif /* IS_IN (libc) */
+#endif /* HAVE_AARCH64_SVE_ASM */
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 5/6] scripts: Added Vector Length Set test helper script
2021-05-12 9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
` (3 preceding siblings ...)
2021-05-12 9:28 ` [PATCH v2 4/6] aarch64: Added optimized memset " Naohiro Tamura
@ 2021-05-12 9:29 ` Naohiro Tamura
2021-05-12 16:58 ` Joseph Myers
2021-05-20 7:34 ` Naohiro Tamura
2021-05-12 9:29 ` [PATCH v2 6/6] benchtests: Fixed bench-memcpy-random: buf1: mprotect failed Naohiro Tamura
` (3 subsequent siblings)
8 siblings, 2 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12 9:29 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
This patch is a test helper script to change Vector Length for child
process. This script can be used as test-wrapper for 'make check'.
Usage examples:
ubuntu@bionic:~/build$ make check subdirs=string \
test-wrapper='~/glibc/scripts/vltest.py 16'
ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 16 make test \
t=string/test-memcpy
ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 32 ./debugglibc.sh \
string/test-memmove
ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 64 ./testrun.sh
string/test-memset
---
scripts/vltest.py | 82 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 82 insertions(+)
create mode 100755 scripts/vltest.py
diff --git a/scripts/vltest.py b/scripts/vltest.py
new file mode 100755
index 0000000000..264dfa449f
--- /dev/null
+++ b/scripts/vltest.py
@@ -0,0 +1,82 @@
+#!/usr/bin/python3
+# Set Scalable Vector Length test helper
+# Copyright (C) 2019-2021 Free Software Foundation, Inc.
+# This file is part of the GNU C Library.
+#
+# The GNU C Library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# The GNU C Library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with the GNU C Library; if not, see
+# <https://www.gnu.org/licenses/>.
+"""Set Scalable Vector Length test helper.
+
+Set Scalable Vector Length for child process.
+
+examples:
+
+ubuntu@bionic:~/build$ make check subdirs=string \
+test-wrapper='~/glibc/scripts/vltest.py 16'
+
+ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 16 make test \
+t=string/test-memcpy
+
+ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 32 ./debugglibc.sh \
+string/test-memmove
+
+ubuntu@bionic:~/build$ ~/glibc/scripts/vltest.py 64 ./testrun.sh \
+string/test-memset
+"""
+import argparse
+from ctypes import cdll, CDLL
+import os
+import sys
+
+EXIT_SUCCESS = 0
+EXIT_FAILURE = 1
+EXIT_UNSUPPORTED = 77
+
+AT_HWCAP = 16
+HWCAP_SVE = (1 << 22)
+
+PR_SVE_GET_VL = 51
+PR_SVE_SET_VL = 50
+PR_SVE_SET_VL_ONEXEC = (1 << 18)
+PR_SVE_VL_INHERIT = (1 << 17)
+PR_SVE_VL_LEN_MASK = 0xffff
+
+def main(args):
+ libc = CDLL("libc.so.6")
+ if not libc.getauxval(AT_HWCAP) & HWCAP_SVE:
+ print("CPU doesn't support SVE")
+ sys.exit(EXIT_UNSUPPORTED)
+
+ libc.prctl(PR_SVE_SET_VL,
+ args.vl[0] | PR_SVE_SET_VL_ONEXEC | PR_SVE_VL_INHERIT)
+ os.execvp(args.args[0], args.args)
+ print("exec system call failure")
+ sys.exit(EXIT_FAILURE)
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser(description=
+ "Set Scalable Vector Length test helper",
+ formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+ # positional argument
+ parser.add_argument("vl", nargs=1, type=int,
+ choices=range(16, 257, 16),
+ help=('vector length '\
+ 'which is multiples of 16 from 16 to 256'))
+ # remainDer arguments
+ parser.add_argument('args', nargs=argparse.REMAINDER,
+ help=('args '\
+ 'which is passed to child process'))
+ args = parser.parse_args()
+ main(args)
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v2 6/6] benchtests: Fixed bench-memcpy-random: buf1: mprotect failed
2021-05-12 9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
` (4 preceding siblings ...)
2021-05-12 9:29 ` [PATCH v2 5/6] scripts: Added Vector Length Set test helper script Naohiro Tamura
@ 2021-05-12 9:29 ` Naohiro Tamura
2021-05-26 10:25 ` Szabolcs Nagy via Libc-alpha
2021-05-27 0:22 ` [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX naohirot
` (2 subsequent siblings)
8 siblings, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-12 9:29 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
This patch fixed mprotect system call failure on AArch64.
This failure happened on not only A64FX but also ThunderX2.
Also this patch updated a JSON key from "max-size" to "length" so that
'plot_strings.py' can process 'bench-memcpy-random.out'
---
benchtests/bench-memcpy-random.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/benchtests/bench-memcpy-random.c b/benchtests/bench-memcpy-random.c
index 9b62033379..c490b73ed0 100644
--- a/benchtests/bench-memcpy-random.c
+++ b/benchtests/bench-memcpy-random.c
@@ -16,7 +16,7 @@
License along with the GNU C Library; if not, see
<https://www.gnu.org/licenses/>. */
-#define MIN_PAGE_SIZE (512*1024+4096)
+#define MIN_PAGE_SIZE (512*1024+getpagesize())
#define TEST_MAIN
#define TEST_NAME "memcpy"
#include "bench-string.h"
@@ -160,7 +160,7 @@ do_test (json_ctx_t *json_ctx, size_t max_size)
}
json_element_object_begin (json_ctx);
- json_attr_uint (json_ctx, "max-size", (double) max_size);
+ json_attr_uint (json_ctx, "length", (double) max_size);
json_array_begin (json_ctx, "timings");
FOR_EACH_IMPL (impl, 0)
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH v2 5/6] scripts: Added Vector Length Set test helper script
2021-05-12 9:29 ` [PATCH v2 5/6] scripts: Added Vector Length Set test helper script Naohiro Tamura
@ 2021-05-12 16:58 ` Joseph Myers
2021-05-13 9:53 ` naohirot
2021-05-20 7:34 ` Naohiro Tamura
1 sibling, 1 reply; 72+ messages in thread
From: Joseph Myers @ 2021-05-12 16:58 UTC (permalink / raw)
To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha
On Wed, 12 May 2021, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
>
> This patch is a test helper script to change Vector Length for child
> process. This script can be used as test-wrapper for 'make check'.
This is specific to AArch64, so I think it would better go under
sysdeps/unix/sysv/linux/aarch64/ rather than under scripts/.
There is also the question of how to make this discoverable to people
developing glibc. Maybe this script should be mentioned in install.texi
(with INSTALL regenerated accordingly), with the documentation there
clearly explaining that it's specific to AArch64 GNU/Linux.
--
Joseph S. Myers
joseph@codesourcery.com
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH v2 5/6] scripts: Added Vector Length Set test helper script
2021-05-12 16:58 ` Joseph Myers
@ 2021-05-13 9:53 ` naohirot
0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-05-13 9:53 UTC (permalink / raw)
To: 'Joseph Myers'; +Cc: libc-alpha@sourceware.org
Hi Joseph,
Thank you for the review.
> From: Joseph Myers <joseph@codesourcery.com>
> > This patch is a test helper script to change Vector Length for child
> > process. This script can be used as test-wrapper for 'make check'.
>
> This is specific to AArch64, so I think it would better go under
> sysdeps/unix/sysv/linux/aarch64/ rather than under scripts/.
OK, I moved it to sysdeps/unix/sysv/linux/aarch64/.
> There is also the question of how to make this discoverable to people developing
> glibc. Maybe this script should be mentioned in install.texi (with INSTALL
> regenerated accordingly), with the documentation there clearly explaining that it's
> specific to AArch64 GNU/Linux.
OK, I updated install.texi, INSTALL, vlset.py doc part as well as commit message
such as the followings or my github [1].
[1] https://github.com/NaohiroTamura/glibc/commit/37a5832fea109ab939ffdf58a2a19d5707849cc5
[commit message] aarch64: Added Vector Length Set test helper script
This patch is a test helper script to change Vector Length for child
process. This script can be used as test-wrapper for 'make check'.
Usage examples:
~/build$ make check subdirs=string \
test-wrapper='~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16'
~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16 \
make test t=string/test-memcpy
~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 32 \
./debugglibc.sh string/test-memmove
~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 64 \
./testrun.sh string/test-memset
---
INSTALL | 4 ++
manual/install.texi | 3 +
sysdeps/unix/sysv/linux/aarch64/vltest.py | 82 +++++++++++++++++++++++
3 files changed, 89 insertions(+)
create mode 100755 sysdeps/unix/sysv/linux/aarch64/vltest.py
diff --git a/INSTALL b/INSTALL
index 065a568585..bc761ab98b 100644
--- a/INSTALL
+++ b/INSTALL
@@ -380,6 +380,10 @@ the same syntax as 'test-wrapper-env', the only difference in its
semantics being starting with an empty set of environment variables
rather than the ambient set.
+ For AArch64 with SVE, when testing the GNU C Library, 'test-wrapper'
+may be set to "SRCDIR/sysdeps/unix/sysv/linux/aarch64/vltest.py
+VECTOR-LENGTH" to change Vector Length.
+
Installing the C Library
========================
diff --git a/manual/install.texi b/manual/install.texi
index eb41fbd0b5..f1d858fb78 100644
--- a/manual/install.texi
+++ b/manual/install.texi
@@ -418,6 +418,9 @@ use has the same syntax as @samp{test-wrapper-env}, the only
difference in its semantics being starting with an empty set of
environment variables rather than the ambient set.
+For AArch64 with SVE, when testing @theglibc{}, @samp{test-wrapper}
+may be set to "@var{srcdir}/sysdeps/unix/sysv/linux/aarch64/vltest.py
+@var{vector-length}" to change Vector Length.
@node Running make install
@appendixsec Installing the C Library
diff --git a/sysdeps/unix/sysv/linux/aarch64/vltest.py b/sysdeps/unix/sysv/linux/aarch64/vltest.py
new file mode 100755
index 0000000000..bed62ad151
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/aarch64/vltest.py
@@ -0,0 +1,82 @@
+#!/usr/bin/python3
+# Set Scalable Vector Length test helper
+# Copyright (C) 2021 Free Software Foundation, Inc.
+# This file is part of the GNU C Library.
+#
+# The GNU C Library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# The GNU C Library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with the GNU C Library; if not, see
+# <https://www.gnu.org/licenses/>.
+"""Set Scalable Vector Length test helper.
+
+Set Scalable Vector Length for child process.
+
+examples:
+
+~/build$ make check subdirs=string \
+test-wrapper='~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16'
+
+~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16 \
+make test t=string/test-memcpy
+
+~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 32 \
+./debugglibc.sh string/test-memmove
+
+~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 64 \
+./testrun.sh string/test-memset
+"""
Thanks.
Naohiro
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-05-10 1:45 ` naohirot
@ 2021-05-14 13:35 ` Szabolcs Nagy via Libc-alpha
2021-05-19 0:11 ` naohirot
0 siblings, 1 reply; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-14 13:35 UTC (permalink / raw)
To: naohirot@fujitsu.com, Carlos O'Donell
Cc: Florian Weimer, libc-alpha@sourceware.org, Wilco Dijkstra
The 05/10/2021 01:45, naohirot@fujitsu.com wrote:
> FYI: Fujitsu has submitted the signed assignment finally.
Carlos, can we commit patches from fujitsu now?
(i dont know if we are still waiting for something)
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
2021-05-14 13:35 ` Szabolcs Nagy via Libc-alpha
@ 2021-05-19 0:11 ` naohirot
0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-05-19 0:11 UTC (permalink / raw)
To: 'Szabolcs Nagy', Carlos O'Donell
Cc: Florian Weimer, libc-alpha@sourceware.org, Wilco Dijkstra
Hi Szabolcs, Carlos,
> From: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
> Sent: Friday, May 14, 2021 10:36 PM
>
> The 05/10/2021 01:45, naohirot@fujitsu.com wrote:
> > FYI: Fujitsu has submitted the signed assignment finally.
>
> Carlos, can we commit patches from fujitsu now?
> (i dont know if we are still waiting for something)
Fujitsu has received FSF signed assignment.
So the contract process has completed.
Thanks.
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v2 5/6] scripts: Added Vector Length Set test helper script
2021-05-12 9:29 ` [PATCH v2 5/6] scripts: Added Vector Length Set test helper script Naohiro Tamura
2021-05-12 16:58 ` Joseph Myers
@ 2021-05-20 7:34 ` Naohiro Tamura
2021-05-26 10:24 ` Szabolcs Nagy via Libc-alpha
1 sibling, 1 reply; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-20 7:34 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
Let me send the whole updated patch.
Thanks.
Naohiro
-- >8 --
Subject: [PATCH v2 5/6] aarch64: Added Vector Length Set test helper script
This patch is a test helper script to change Vector Length for child
process. This script can be used as test-wrapper for 'make check'.
Usage examples:
~/build$ make check subdirs=string \
test-wrapper='~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16'
~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16 \
make test t=string/test-memcpy
~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 32 \
./debugglibc.sh string/test-memmove
~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 64 \
./testrun.sh string/test-memset
---
INSTALL | 4 ++
manual/install.texi | 3 +
sysdeps/unix/sysv/linux/aarch64/vltest.py | 82 +++++++++++++++++++++++
3 files changed, 89 insertions(+)
create mode 100755 sysdeps/unix/sysv/linux/aarch64/vltest.py
diff --git a/INSTALL b/INSTALL
index 065a568585e6..bc761ab98bbf 100644
--- a/INSTALL
+++ b/INSTALL
@@ -380,6 +380,10 @@ the same syntax as 'test-wrapper-env', the only difference in its
semantics being starting with an empty set of environment variables
rather than the ambient set.
+ For AArch64 with SVE, when testing the GNU C Library, 'test-wrapper'
+may be set to "SRCDIR/sysdeps/unix/sysv/linux/aarch64/vltest.py
+VECTOR-LENGTH" to change Vector Length.
+
Installing the C Library
========================
diff --git a/manual/install.texi b/manual/install.texi
index eb41fbd0b5ab..f1d858fb789c 100644
--- a/manual/install.texi
+++ b/manual/install.texi
@@ -418,6 +418,9 @@ use has the same syntax as @samp{test-wrapper-env}, the only
difference in its semantics being starting with an empty set of
environment variables rather than the ambient set.
+For AArch64 with SVE, when testing @theglibc{}, @samp{test-wrapper}
+may be set to "@var{srcdir}/sysdeps/unix/sysv/linux/aarch64/vltest.py
+@var{vector-length}" to change Vector Length.
@node Running make install
@appendixsec Installing the C Library
diff --git a/sysdeps/unix/sysv/linux/aarch64/vltest.py b/sysdeps/unix/sysv/linux/aarch64/vltest.py
new file mode 100755
index 000000000000..bed62ad151e0
--- /dev/null
+++ b/sysdeps/unix/sysv/linux/aarch64/vltest.py
@@ -0,0 +1,82 @@
+#!/usr/bin/python3
+# Set Scalable Vector Length test helper
+# Copyright (C) 2021 Free Software Foundation, Inc.
+# This file is part of the GNU C Library.
+#
+# The GNU C Library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# The GNU C Library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with the GNU C Library; if not, see
+# <https://www.gnu.org/licenses/>.
+"""Set Scalable Vector Length test helper.
+
+Set Scalable Vector Length for child process.
+
+examples:
+
+~/build$ make check subdirs=string \
+test-wrapper='~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16'
+
+~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16 \
+make test t=string/test-memcpy
+
+~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 32 \
+./debugglibc.sh string/test-memmove
+
+~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 64 \
+./testrun.sh string/test-memset
+"""
+import argparse
+from ctypes import cdll, CDLL
+import os
+import sys
+
+EXIT_SUCCESS = 0
+EXIT_FAILURE = 1
+EXIT_UNSUPPORTED = 77
+
+AT_HWCAP = 16
+HWCAP_SVE = (1 << 22)
+
+PR_SVE_GET_VL = 51
+PR_SVE_SET_VL = 50
+PR_SVE_SET_VL_ONEXEC = (1 << 18)
+PR_SVE_VL_INHERIT = (1 << 17)
+PR_SVE_VL_LEN_MASK = 0xffff
+
+def main(args):
+ libc = CDLL("libc.so.6")
+ if not libc.getauxval(AT_HWCAP) & HWCAP_SVE:
+ print("CPU doesn't support SVE")
+ sys.exit(EXIT_UNSUPPORTED)
+
+ libc.prctl(PR_SVE_SET_VL,
+ args.vl[0] | PR_SVE_SET_VL_ONEXEC | PR_SVE_VL_INHERIT)
+ os.execvp(args.args[0], args.args)
+ print("exec system call failure")
+ sys.exit(EXIT_FAILURE)
+
+if __name__ == '__main__':
+ parser = argparse.ArgumentParser(description=
+ "Set Scalable Vector Length test helper",
+ formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+
+ # positional argument
+ parser.add_argument("vl", nargs=1, type=int,
+ choices=range(16, 257, 16),
+ help=('vector length '\
+ 'which is multiples of 16 from 16 to 256'))
+ # remainDer arguments
+ parser.add_argument('args', nargs=argparse.REMAINDER,
+ help=('args '\
+ 'which is passed to child process'))
+ args = parser.parse_args()
+ main(args)
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* Re: [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64
2021-05-12 9:26 ` [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64 Naohiro Tamura
@ 2021-05-26 10:05 ` Szabolcs Nagy via Libc-alpha
0 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-26 10:05 UTC (permalink / raw)
To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha
The 05/12/2021 09:26, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
>
> This patch checks if assembler supports '-march=armv8.2-a+sve' to
> generate SVE code or not, and then define HAVE_AARCH64_SVE_ASM macro.
this is ok for master.
i will commit it for you.
> ---
> config.h.in | 5 +++++
> sysdeps/aarch64/configure | 28 ++++++++++++++++++++++++++++
> sysdeps/aarch64/configure.ac | 15 +++++++++++++++
> 3 files changed, 48 insertions(+)
>
> diff --git a/config.h.in b/config.h.in
> index 99036b887f..13fba9bb8d 100644
> --- a/config.h.in
> +++ b/config.h.in
> @@ -121,6 +121,11 @@
> /* AArch64 PAC-RET code generation is enabled. */
> #define HAVE_AARCH64_PAC_RET 0
>
> +/* Assembler support ARMv8.2-A SVE.
> + This macro becomes obsolete when glibc increased the minimum
> + required version of GNU 'binutils' to 2.28 or later. */
> +#define HAVE_AARCH64_SVE_ASM 0
> +
> /* ARC big endian ABI */
> #undef HAVE_ARC_BE
>
> diff --git a/sysdeps/aarch64/configure b/sysdeps/aarch64/configure
> index 83c3a23e44..4c1fac49f3 100644
> --- a/sysdeps/aarch64/configure
> +++ b/sysdeps/aarch64/configure
> @@ -304,3 +304,31 @@ fi
> $as_echo "$libc_cv_aarch64_variant_pcs" >&6; }
> config_vars="$config_vars
> aarch64-variant-pcs = $libc_cv_aarch64_variant_pcs"
> +
> +# Check if asm support armv8.2-a+sve
> +{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for SVE support in assembler" >&5
> +$as_echo_n "checking for SVE support in assembler... " >&6; }
> +if ${libc_cv_asm_sve+:} false; then :
> + $as_echo_n "(cached) " >&6
> +else
> + cat > conftest.s <<\EOF
> + ptrue p0.b
> +EOF
> +if { ac_try='${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&5'
> + { { eval echo "\"\$as_me\":${as_lineno-$LINENO}: \"$ac_try\""; } >&5
> + (eval $ac_try) 2>&5
> + ac_status=$?
> + $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
> + test $ac_status = 0; }; }; then
> + libc_cv_aarch64_sve_asm=yes
> +else
> + libc_cv_aarch64_sve_asm=no
> +fi
> +rm -f conftest*
> +fi
> +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $libc_cv_asm_sve" >&5
> +$as_echo "$libc_cv_asm_sve" >&6; }
> +if test $libc_cv_aarch64_sve_asm = yes; then
> + $as_echo "#define HAVE_AARCH64_SVE_ASM 1" >>confdefs.h
> +
> +fi
> diff --git a/sysdeps/aarch64/configure.ac b/sysdeps/aarch64/configure.ac
> index 66f755078a..3347c13fa1 100644
> --- a/sysdeps/aarch64/configure.ac
> +++ b/sysdeps/aarch64/configure.ac
> @@ -90,3 +90,18 @@ EOF
> fi
> rm -rf conftest.*])
> LIBC_CONFIG_VAR([aarch64-variant-pcs], [$libc_cv_aarch64_variant_pcs])
> +
> +# Check if asm support armv8.2-a+sve
> +AC_CACHE_CHECK(for SVE support in assembler, libc_cv_asm_sve, [dnl
> +cat > conftest.s <<\EOF
> + ptrue p0.b
> +EOF
> +if AC_TRY_COMMAND(${CC-cc} -c -march=armv8.2-a+sve conftest.s 1>&AS_MESSAGE_LOG_FD); then
> + libc_cv_aarch64_sve_asm=yes
> +else
> + libc_cv_aarch64_sve_asm=no
> +fi
> +rm -f conftest*])
> +if test $libc_cv_aarch64_sve_asm = yes; then
> + AC_DEFINE(HAVE_AARCH64_SVE_ASM)
> +fi
> --
> 2.17.1
>
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 2/6] aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI
2021-05-12 9:27 ` [PATCH v2 2/6] aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI Naohiro Tamura
@ 2021-05-26 10:06 ` Szabolcs Nagy via Libc-alpha
0 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-26 10:06 UTC (permalink / raw)
To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha
The 05/12/2021 09:27, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
>
> This patch defines BTI_C and BTI_J macros conditionally for
> performance.
> If HAVE_AARCH64_BTI is true, BTI_C and BTI_J are defined as HINT
> instruction for ARMv8.5 BTI (Branch Target Identification).
> If HAVE_AARCH64_BTI is false, both BTI_C and BTI_J are defined as
> NOP.
thanks. this is ok for master.
i will commit it.
> ---
> sysdeps/aarch64/sysdep.h | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/sysdeps/aarch64/sysdep.h b/sysdeps/aarch64/sysdep.h
> index 90acca4e42..b936e29cbd 100644
> --- a/sysdeps/aarch64/sysdep.h
> +++ b/sysdeps/aarch64/sysdep.h
> @@ -62,8 +62,13 @@ strip_pac (void *p)
> #define ASM_SIZE_DIRECTIVE(name) .size name,.-name
>
> /* Branch Target Identitication support. */
> -#define BTI_C hint 34
> -#define BTI_J hint 36
> +#if HAVE_AARCH64_BTI
> +# define BTI_C hint 34
> +# define BTI_J hint 36
> +#else
> +# define BTI_C nop
> +# define BTI_J nop
> +#endif
>
> /* Return address signing support (pac-ret). */
> #define PACIASP hint 25
> --
> 2.17.1
>
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 3/6] aarch64: Added optimized memcpy and memmove for A64FX
2021-05-12 9:28 ` [PATCH v2 3/6] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
@ 2021-05-26 10:19 ` Szabolcs Nagy via Libc-alpha
0 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-26 10:19 UTC (permalink / raw)
To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha
The 05/12/2021 09:28, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
>
> This patch optimizes the performance of memcpy/memmove for A64FX [1]
> which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
> cache per NUMA node.
>
> The performance optimization makes use of Scalable Vector Register
> with several techniques such as loop unrolling, memory access
> alignment, cache zero fill, and software pipelining.
>
> SVE assembler code for memcpy/memmove is implemented as Vector Length
> Agnostic code so theoretically it can be run on any SOC which supports
> ARMv8-A SVE standard.
>
> We confirmed that all testcases have been passed by running 'make
> check' and 'make xcheck' not only on A64FX but also on ThunderX2.
>
> And also we confirmed that the SVE 512 bit vector register performance
> is roughly 4 times better than Advanced SIMD 128 bit register and 8
> times better than scalar 64 bit register by running 'make bench'.
>
> [1] https://github.com/fujitsu/A64FX
thanks. this looks ok, except for whitespace usage.
can you please send a version with fixed whitespaces?
> --- a/sysdeps/aarch64/multiarch/memcpy.c
> +++ b/sysdeps/aarch64/multiarch/memcpy.c
> @@ -33,6 +33,9 @@ extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;
> extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
> extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;
> extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
> +#if HAVE_AARCH64_SVE_ASM
> +extern __typeof (__redirect_memcpy) __memcpy_a64fx attribute_hidden;
> +#endif
>
> libc_ifunc (__libc_memcpy,
> (IS_THUNDERX (midr)
> @@ -44,8 +47,13 @@ libc_ifunc (__libc_memcpy,
> : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
> || IS_NEOVERSE_V1 (midr)
> ? __memcpy_simd
> - : __memcpy_generic)))));
> -
> +#if HAVE_AARCH64_SVE_ASM
> + : (IS_A64FX (midr)
> + ? __memcpy_a64fx
> + : __memcpy_generic))))));
> +#else
> + : __memcpy_generic)))));
> +#endif
glibc uses a mix of tabs and spaces, you used space only.
> --- /dev/null
> +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> @@ -0,0 +1,405 @@
> +/* Optimized memcpy for Fujitsu A64FX processor.
> + Copyright (C) 2012-2021 Free Software Foundation, Inc.
> +
> + This file is part of the GNU C Library.
> +
> + The GNU C Library is free software; you can redistribute it and/or
> + modify it under the terms of the GNU Lesser General Public
> + License as published by the Free Software Foundation; either
> + version 2.1 of the License, or (at your option) any later version.
> +
> + The GNU C Library is distributed in the hope that it will be useful,
> + but WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + Lesser General Public License for more details.
> +
> + You should have received a copy of the GNU Lesser General Public
> + License along with the GNU C Library. If not, see
> + <https://www.gnu.org/licenses/>. */
> +
> +#include <sysdep.h>
> +
> +#if HAVE_AARCH64_SVE_ASM
> +#if IS_IN (libc)
> +# define MEMCPY __memcpy_a64fx
> +# define MEMMOVE __memmove_a64fx
> +
> +/* Assumptions:
> + *
> + * ARMv8.2-a, AArch64, unaligned accesses, sve
> + *
> + */
> +
> +#define L2_SIZE (8*1024*1024)/2 // L2 8MB/2
> +#define CACHE_LINE_SIZE 256
> +#define ZF_DIST (CACHE_LINE_SIZE * 21) // Zerofill distance
> +#define dest x0
> +#define src x1
> +#define n x2 // size
> +#define tmp1 x3
> +#define tmp2 x4
> +#define tmp3 x5
> +#define rest x6
> +#define dest_ptr x7
> +#define src_ptr x8
> +#define vector_length x9
> +#define cl_remainder x10 // CACHE_LINE_SIZE remainder
> +
> + .arch armv8.2-a+sve
> +
> + .macro dc_zva times
> + dc zva, tmp1
> + add tmp1, tmp1, CACHE_LINE_SIZE
> + .if \times-1
> + dc_zva "(\times-1)"
> + .endif
> + .endm
> +
> + .macro ld1b_unroll8
> + ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
> + ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
> + ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
> + ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
> + ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
> + ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
> + ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
> + ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
> + .endm
...
please indent all asm code with one tab, see other asm files.
> --- a/sysdeps/aarch64/multiarch/memmove.c
> +++ b/sysdeps/aarch64/multiarch/memmove.c
> @@ -33,6 +33,9 @@ extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;
> extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;
> extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidden;
> extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;
> +#if HAVE_AARCH64_SVE_ASM
> +extern __typeof (__redirect_memmove) __memmove_a64fx attribute_hidden;
> +#endif
>
> libc_ifunc (__libc_memmove,
> (IS_THUNDERX (midr)
> @@ -44,8 +47,13 @@ libc_ifunc (__libc_memmove,
> : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
> || IS_NEOVERSE_V1 (midr)
> ? __memmove_simd
> - : __memmove_generic)))));
> -
> +#if HAVE_AARCH64_SVE_ASM
> + : (IS_A64FX (midr)
> + ? __memmove_a64fx
> + : __memmove_generic))))));
> +#else
> + : __memmove_generic)))));
> +#endif
same as above.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 4/6] aarch64: Added optimized memset for A64FX
2021-05-12 9:28 ` [PATCH v2 4/6] aarch64: Added optimized memset " Naohiro Tamura
@ 2021-05-26 10:22 ` Szabolcs Nagy via Libc-alpha
0 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-26 10:22 UTC (permalink / raw)
To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha
The 05/12/2021 09:28, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
>
> This patch optimizes the performance of memset for A64FX [1] which
> implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
> per NUMA node.
>
> The performance optimization makes use of Scalable Vector Register
> with several techniques such as loop unrolling, memory access
> alignment, cache zero fill and prefetch.
>
> SVE assembler code for memset is implemented as Vector Length Agnostic
> code so theoretically it can be run on any SOC which supports ARMv8-A
> SVE standard.
>
> We confirmed that all testcases have been passed by running 'make
> check' and 'make xcheck' not only on A64FX but also on ThunderX2.
>
> And also we confirmed that the SVE 512 bit vector register performance
> is roughly 4 times better than Advanced SIMD 128 bit register and 8
> times better than scalar 64 bit register by running 'make bench'.
>
> [1] https://github.com/fujitsu/A64FX
thanks, this looks good, except for whitespace.
can you please send a version with fixed whitespaces?
> --- a/sysdeps/aarch64/multiarch/memset.c
> +++ b/sysdeps/aarch64/multiarch/memset.c
...
> - : __memset_generic)));
> +#if HAVE_AARCH64_SVE_ASM
> + : (IS_A64FX (midr)
> + ? __memset_a64fx
> + : __memset_generic))));
> +#else
> + : __memset_generic)));
> +#endif
replace 8 spaces with 1 tab.
> --- /dev/null
> +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S
...
> + .arch armv8.2-a+sve
> +
> + .macro dc_zva times
> + dc zva, tmp1
> + add tmp1, tmp1, CACHE_LINE_SIZE
> + .if \times-1
> + dc_zva "(\times-1)"
> + .endif
> + .endm
use 1 tab indentation throughout.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 5/6] scripts: Added Vector Length Set test helper script
2021-05-20 7:34 ` Naohiro Tamura
@ 2021-05-26 10:24 ` Szabolcs Nagy via Libc-alpha
0 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-26 10:24 UTC (permalink / raw)
To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha
The 05/20/2021 07:34, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
>
> Let me send the whole updated patch.
> Thanks.
> Naohiro
>
> -- >8 --
> Subject: [PATCH v2 5/6] aarch64: Added Vector Length Set test helper script
>
> This patch is a test helper script to change Vector Length for child
> process. This script can be used as test-wrapper for 'make check'.
>
> Usage examples:
>
> ~/build$ make check subdirs=string \
> test-wrapper='~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16'
>
> ~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16 \
> make test t=string/test-memcpy
>
> ~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 32 \
> ./debugglibc.sh string/test-memmove
>
> ~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 64 \
> ./testrun.sh string/test-memset
thanks, this is ok for master.
i will commit it.
> ---
> INSTALL | 4 ++
> manual/install.texi | 3 +
> sysdeps/unix/sysv/linux/aarch64/vltest.py | 82 +++++++++++++++++++++++
> 3 files changed, 89 insertions(+)
> create mode 100755 sysdeps/unix/sysv/linux/aarch64/vltest.py
>
> diff --git a/INSTALL b/INSTALL
> index 065a568585e6..bc761ab98bbf 100644
> --- a/INSTALL
> +++ b/INSTALL
> @@ -380,6 +380,10 @@ the same syntax as 'test-wrapper-env', the only difference in its
> semantics being starting with an empty set of environment variables
> rather than the ambient set.
>
> + For AArch64 with SVE, when testing the GNU C Library, 'test-wrapper'
> +may be set to "SRCDIR/sysdeps/unix/sysv/linux/aarch64/vltest.py
> +VECTOR-LENGTH" to change Vector Length.
> +
> Installing the C Library
> ========================
>
> diff --git a/manual/install.texi b/manual/install.texi
> index eb41fbd0b5ab..f1d858fb789c 100644
> --- a/manual/install.texi
> +++ b/manual/install.texi
> @@ -418,6 +418,9 @@ use has the same syntax as @samp{test-wrapper-env}, the only
> difference in its semantics being starting with an empty set of
> environment variables rather than the ambient set.
>
> +For AArch64 with SVE, when testing @theglibc{}, @samp{test-wrapper}
> +may be set to "@var{srcdir}/sysdeps/unix/sysv/linux/aarch64/vltest.py
> +@var{vector-length}" to change Vector Length.
>
> @node Running make install
> @appendixsec Installing the C Library
> diff --git a/sysdeps/unix/sysv/linux/aarch64/vltest.py b/sysdeps/unix/sysv/linux/aarch64/vltest.py
> new file mode 100755
> index 000000000000..bed62ad151e0
> --- /dev/null
> +++ b/sysdeps/unix/sysv/linux/aarch64/vltest.py
> @@ -0,0 +1,82 @@
> +#!/usr/bin/python3
> +# Set Scalable Vector Length test helper
> +# Copyright (C) 2021 Free Software Foundation, Inc.
> +# This file is part of the GNU C Library.
> +#
> +# The GNU C Library is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU Lesser General Public
> +# License as published by the Free Software Foundation; either
> +# version 2.1 of the License, or (at your option) any later version.
> +#
> +# The GNU C Library is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> +# Lesser General Public License for more details.
> +#
> +# You should have received a copy of the GNU Lesser General Public
> +# License along with the GNU C Library; if not, see
> +# <https://www.gnu.org/licenses/>.
> +"""Set Scalable Vector Length test helper.
> +
> +Set Scalable Vector Length for child process.
> +
> +examples:
> +
> +~/build$ make check subdirs=string \
> +test-wrapper='~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16'
> +
> +~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 16 \
> +make test t=string/test-memcpy
> +
> +~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 32 \
> +./debugglibc.sh string/test-memmove
> +
> +~/build$ ~/glibc/sysdeps/unix/sysv/linux/aarch64/vltest.py 64 \
> +./testrun.sh string/test-memset
> +"""
> +import argparse
> +from ctypes import cdll, CDLL
> +import os
> +import sys
> +
> +EXIT_SUCCESS = 0
> +EXIT_FAILURE = 1
> +EXIT_UNSUPPORTED = 77
> +
> +AT_HWCAP = 16
> +HWCAP_SVE = (1 << 22)
> +
> +PR_SVE_GET_VL = 51
> +PR_SVE_SET_VL = 50
> +PR_SVE_SET_VL_ONEXEC = (1 << 18)
> +PR_SVE_VL_INHERIT = (1 << 17)
> +PR_SVE_VL_LEN_MASK = 0xffff
> +
> +def main(args):
> + libc = CDLL("libc.so.6")
> + if not libc.getauxval(AT_HWCAP) & HWCAP_SVE:
> + print("CPU doesn't support SVE")
> + sys.exit(EXIT_UNSUPPORTED)
> +
> + libc.prctl(PR_SVE_SET_VL,
> + args.vl[0] | PR_SVE_SET_VL_ONEXEC | PR_SVE_VL_INHERIT)
> + os.execvp(args.args[0], args.args)
> + print("exec system call failure")
> + sys.exit(EXIT_FAILURE)
> +
> +if __name__ == '__main__':
> + parser = argparse.ArgumentParser(description=
> + "Set Scalable Vector Length test helper",
> + formatter_class=argparse.ArgumentDefaultsHelpFormatter)
> +
> + # positional argument
> + parser.add_argument("vl", nargs=1, type=int,
> + choices=range(16, 257, 16),
> + help=('vector length '\
> + 'which is multiples of 16 from 16 to 256'))
> + # remainDer arguments
> + parser.add_argument('args', nargs=argparse.REMAINDER,
> + help=('args '\
> + 'which is passed to child process'))
> + args = parser.parse_args()
> + main(args)
> --
> 2.17.1
>
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH v2 6/6] benchtests: Fixed bench-memcpy-random: buf1: mprotect failed
2021-05-12 9:29 ` [PATCH v2 6/6] benchtests: Fixed bench-memcpy-random: buf1: mprotect failed Naohiro Tamura
@ 2021-05-26 10:25 ` Szabolcs Nagy via Libc-alpha
0 siblings, 0 replies; 72+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-26 10:25 UTC (permalink / raw)
To: Naohiro Tamura; +Cc: Naohiro Tamura, libc-alpha
The 05/12/2021 09:29, Naohiro Tamura wrote:
> From: Naohiro Tamura <naohirot@jp.fujitsu.com>
>
> This patch fixed mprotect system call failure on AArch64.
> This failure happened on not only A64FX but also ThunderX2.
>
> Also this patch updated a JSON key from "max-size" to "length" so that
> 'plot_strings.py' can process 'bench-memcpy-random.out'
thanks, this is ok for master.
i will commit it.
> ---
> benchtests/bench-memcpy-random.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/benchtests/bench-memcpy-random.c b/benchtests/bench-memcpy-random.c
> index 9b62033379..c490b73ed0 100644
> --- a/benchtests/bench-memcpy-random.c
> +++ b/benchtests/bench-memcpy-random.c
> @@ -16,7 +16,7 @@
> License along with the GNU C Library; if not, see
> <https://www.gnu.org/licenses/>. */
>
> -#define MIN_PAGE_SIZE (512*1024+4096)
> +#define MIN_PAGE_SIZE (512*1024+getpagesize())
> #define TEST_MAIN
> #define TEST_NAME "memcpy"
> #include "bench-string.h"
> @@ -160,7 +160,7 @@ do_test (json_ctx_t *json_ctx, size_t max_size)
> }
>
> json_element_object_begin (json_ctx);
> - json_attr_uint (json_ctx, "max-size", (double) max_size);
> + json_attr_uint (json_ctx, "length", (double) max_size);
> json_array_begin (json_ctx, "timings");
>
> FOR_EACH_IMPL (impl, 0)
> --
> 2.17.1
>
--
^ permalink raw reply [flat|nested] 72+ messages in thread
* RE: [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX
2021-05-12 9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
` (5 preceding siblings ...)
2021-05-12 9:29 ` [PATCH v2 6/6] benchtests: Fixed bench-memcpy-random: buf1: mprotect failed Naohiro Tamura
@ 2021-05-27 0:22 ` naohirot
2021-05-27 23:50 ` naohirot
2021-05-27 7:42 ` [PATCH v3 1/2] aarch64: Added optimized memcpy and memmove " Naohiro Tamura
2021-05-27 7:44 ` [PATCH v3 2/2] aarch64: Added optimized memset " Naohiro Tamura
8 siblings, 1 reply; 72+ messages in thread
From: naohirot @ 2021-05-27 0:22 UTC (permalink / raw)
To: 'Szabolcs Nagy', libc-alpha@sourceware.org
Hi Szabolcs,
> config: Added HAVE_AARCH64_SVE_ASM for aarch64
> aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI
> scripts: Added Vector Length Set test helper script
> benchtests: Fixed bench-memcpy-random: buf1: mprotect failed
Thank you for the merges!
> aarch64: Added optimized memcpy and memmove for A64FX
> aarch64: Added optimized memset for A64FX
I'll fix the whitespaces.
Thanks
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH v3 1/2] aarch64: Added optimized memcpy and memmove for A64FX
2021-05-12 9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
` (6 preceding siblings ...)
2021-05-27 0:22 ` [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX naohirot
@ 2021-05-27 7:42 ` Naohiro Tamura
2021-05-27 7:44 ` [PATCH v3 2/2] aarch64: Added optimized memset " Naohiro Tamura
8 siblings, 0 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-27 7:42 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
This patch optimizes the performance of memcpy/memmove for A64FX [1]
which implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB
cache per NUMA node.
The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill, and software pipelining.
SVE assembler code for memcpy/memmove is implemented as Vector Length
Agnostic code so theoretically it can be run on any SOC which supports
ARMv8-A SVE standard.
We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.
And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.
[1] https://github.com/fujitsu/A64FX
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
---
manual/tunables.texi | 3 +-
sysdeps/aarch64/multiarch/Makefile | 2 +-
sysdeps/aarch64/multiarch/ifunc-impl-list.c | 8 +-
sysdeps/aarch64/multiarch/init-arch.h | 4 +-
sysdeps/aarch64/multiarch/memcpy.c | 18 +-
sysdeps/aarch64/multiarch/memcpy_a64fx.S | 406 ++++++++++++++++++
sysdeps/aarch64/multiarch/memmove.c | 18 +-
.../unix/sysv/linux/aarch64/cpu-features.c | 4 +
.../unix/sysv/linux/aarch64/cpu-features.h | 4 +
9 files changed, 453 insertions(+), 14 deletions(-)
create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S
diff --git a/manual/tunables.texi b/manual/tunables.texi
index 6de647b4262c..fe7c1313ccc4 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -454,7 +454,8 @@ This tunable is specific to powerpc, powerpc64 and powerpc64le.
The @code{glibc.cpu.name=xxx} tunable allows the user to tell @theglibc{} to
assume that the CPU is @code{xxx} where xxx may have one of these values:
@code{generic}, @code{falkor}, @code{thunderxt88}, @code{thunderx2t99},
-@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng}.
+@code{thunderx2t99p1}, @code{ares}, @code{emag}, @code{kunpeng},
+@code{a64fx}.
This tunable is specific to aarch64.
@end deftp
diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index dc3efffb36b6..04c3f171215e 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -1,6 +1,6 @@
ifeq ($(subdir),string)
sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
- memcpy_falkor \
+ memcpy_falkor memcpy_a64fx \
memset_generic memset_falkor memset_emag memset_kunpeng \
memchr_generic memchr_nosimd \
strlen_mte strlen_asimd
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index 99a8c68aaca0..911393565c21 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -25,7 +25,7 @@
#include <stdio.h>
/* Maximum number of IFUNC implementations. */
-#define MAX_IFUNC 4
+#define MAX_IFUNC 7
size_t
__libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
@@ -43,12 +43,18 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_falkor)
IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_simd)
+#if HAVE_AARCH64_SVE_ASM
+ IFUNC_IMPL_ADD (array, i, memcpy, sve, __memcpy_a64fx)
+#endif
IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_generic))
IFUNC_IMPL (i, name, memmove,
IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_thunderx)
IFUNC_IMPL_ADD (array, i, memmove, !bti, __memmove_thunderx2)
IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_falkor)
IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_simd)
+#if HAVE_AARCH64_SVE_ASM
+ IFUNC_IMPL_ADD (array, i, memmove, sve, __memmove_a64fx)
+#endif
IFUNC_IMPL_ADD (array, i, memmove, 1, __memmove_generic))
IFUNC_IMPL (i, name, memset,
/* Enable this on non-falkor processors too so that other cores
diff --git a/sysdeps/aarch64/multiarch/init-arch.h b/sysdeps/aarch64/multiarch/init-arch.h
index a167699e74f4..6d92c1bcff6a 100644
--- a/sysdeps/aarch64/multiarch/init-arch.h
+++ b/sysdeps/aarch64/multiarch/init-arch.h
@@ -33,4 +33,6 @@
bool __attribute__((unused)) bti = \
HAVE_AARCH64_BTI && GLRO(dl_aarch64_cpu_features).bti; \
bool __attribute__((unused)) mte = \
- MTE_ENABLED ();
+ MTE_ENABLED (); \
+ bool __attribute__((unused)) sve = \
+ GLRO(dl_aarch64_cpu_features).sve;
diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
index 0e0a5cbcfb1b..25e0081eeb51 100644
--- a/sysdeps/aarch64/multiarch/memcpy.c
+++ b/sysdeps/aarch64/multiarch/memcpy.c
@@ -33,6 +33,9 @@ extern __typeof (__redirect_memcpy) __memcpy_simd attribute_hidden;
extern __typeof (__redirect_memcpy) __memcpy_thunderx attribute_hidden;
extern __typeof (__redirect_memcpy) __memcpy_thunderx2 attribute_hidden;
extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
+# if HAVE_AARCH64_SVE_ASM
+extern __typeof (__redirect_memcpy) __memcpy_a64fx attribute_hidden;
+# endif
libc_ifunc (__libc_memcpy,
(IS_THUNDERX (midr)
@@ -40,12 +43,17 @@ libc_ifunc (__libc_memcpy,
: (IS_FALKOR (midr) || IS_PHECDA (midr)
? __memcpy_falkor
: (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr)
- ? __memcpy_thunderx2
- : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
- || IS_NEOVERSE_V1 (midr)
- ? __memcpy_simd
+ ? __memcpy_thunderx2
+ : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
+ || IS_NEOVERSE_V1 (midr)
+ ? __memcpy_simd
+# if HAVE_AARCH64_SVE_ASM
+ : (IS_A64FX (midr)
+ ? __memcpy_a64fx
+ : __memcpy_generic))))));
+# else
: __memcpy_generic)))));
-
+# endif
# undef memcpy
strong_alias (__libc_memcpy, memcpy);
#endif
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
new file mode 100644
index 000000000000..65528405bb12
--- /dev/null
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -0,0 +1,406 @@
+/* Optimized memcpy for Fujitsu A64FX processor.
+ Copyright (C) 2021 Free Software Foundation, Inc.
+
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library. If not, see
+ <https://www.gnu.org/licenses/>. */
+
+#include <sysdep.h>
+
+/* Assumptions:
+ *
+ * ARMv8.2-a, AArch64, unaligned accesses, sve
+ *
+ */
+
+#define L2_SIZE (8*1024*1024)/2 // L2 8MB/2
+#define CACHE_LINE_SIZE 256
+#define ZF_DIST (CACHE_LINE_SIZE * 21) // Zerofill distance
+#define dest x0
+#define src x1
+#define n x2 // size
+#define tmp1 x3
+#define tmp2 x4
+#define tmp3 x5
+#define rest x6
+#define dest_ptr x7
+#define src_ptr x8
+#define vector_length x9
+#define cl_remainder x10 // CACHE_LINE_SIZE remainder
+
+#if HAVE_AARCH64_SVE_ASM
+# if IS_IN (libc)
+# define MEMCPY __memcpy_a64fx
+# define MEMMOVE __memmove_a64fx
+
+ .arch armv8.2-a+sve
+
+ .macro dc_zva times
+ dc zva, tmp1
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ .if \times-1
+ dc_zva "(\times-1)"
+ .endif
+ .endm
+
+ .macro ld1b_unroll8
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ .endm
+
+ .macro stld1b_unroll4a
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p0/z, [src_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ ld1b z2.b, p0/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p0/z, [src_ptr, #3, mul vl]
+ .endm
+
+ .macro stld1b_unroll4b
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ ld1b z4.b, p0/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p0/z, [src_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ ld1b z6.b, p0/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p0/z, [src_ptr, #7, mul vl]
+ .endm
+
+ .macro stld1b_unroll8
+ stld1b_unroll4a
+ stld1b_unroll4b
+ .endm
+
+ .macro st1b_unroll8
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p0, [dest_ptr, #1, mul vl]
+ st1b z2.b, p0, [dest_ptr, #2, mul vl]
+ st1b z3.b, p0, [dest_ptr, #3, mul vl]
+ st1b z4.b, p0, [dest_ptr, #4, mul vl]
+ st1b z5.b, p0, [dest_ptr, #5, mul vl]
+ st1b z6.b, p0, [dest_ptr, #6, mul vl]
+ st1b z7.b, p0, [dest_ptr, #7, mul vl]
+ .endm
+
+ .macro shortcut_for_small_size exit
+ // if rest <= vector_length * 2
+ whilelo p0.b, xzr, n
+ whilelo p1.b, vector_length, n
+ b.last 1f
+ ld1b z0.b, p0/z, [src, #0, mul vl]
+ ld1b z1.b, p1/z, [src, #1, mul vl]
+ st1b z0.b, p0, [dest, #0, mul vl]
+ st1b z1.b, p1, [dest, #1, mul vl]
+ ret
+1: // if rest > vector_length * 8
+ cmp n, vector_length, lsl 3 // vector_length * 8
+ b.hi \exit
+ // if rest <= vector_length * 4
+ lsl tmp1, vector_length, 1 // vector_length * 2
+ whilelo p2.b, tmp1, n
+ incb tmp1
+ whilelo p3.b, tmp1, n
+ b.last 1f
+ ld1b z0.b, p0/z, [src, #0, mul vl]
+ ld1b z1.b, p1/z, [src, #1, mul vl]
+ ld1b z2.b, p2/z, [src, #2, mul vl]
+ ld1b z3.b, p3/z, [src, #3, mul vl]
+ st1b z0.b, p0, [dest, #0, mul vl]
+ st1b z1.b, p1, [dest, #1, mul vl]
+ st1b z2.b, p2, [dest, #2, mul vl]
+ st1b z3.b, p3, [dest, #3, mul vl]
+ ret
+1: // if rest <= vector_length * 8
+ lsl tmp1, vector_length, 2 // vector_length * 4
+ whilelo p4.b, tmp1, n
+ incb tmp1
+ whilelo p5.b, tmp1, n
+ b.last 1f
+ ld1b z0.b, p0/z, [src, #0, mul vl]
+ ld1b z1.b, p1/z, [src, #1, mul vl]
+ ld1b z2.b, p2/z, [src, #2, mul vl]
+ ld1b z3.b, p3/z, [src, #3, mul vl]
+ ld1b z4.b, p4/z, [src, #4, mul vl]
+ ld1b z5.b, p5/z, [src, #5, mul vl]
+ st1b z0.b, p0, [dest, #0, mul vl]
+ st1b z1.b, p1, [dest, #1, mul vl]
+ st1b z2.b, p2, [dest, #2, mul vl]
+ st1b z3.b, p3, [dest, #3, mul vl]
+ st1b z4.b, p4, [dest, #4, mul vl]
+ st1b z5.b, p5, [dest, #5, mul vl]
+ ret
+1: lsl tmp1, vector_length, 2 // vector_length * 4
+ incb tmp1 // vector_length * 5
+ incb tmp1 // vector_length * 6
+ whilelo p6.b, tmp1, n
+ incb tmp1
+ whilelo p7.b, tmp1, n
+ ld1b z0.b, p0/z, [src, #0, mul vl]
+ ld1b z1.b, p1/z, [src, #1, mul vl]
+ ld1b z2.b, p2/z, [src, #2, mul vl]
+ ld1b z3.b, p3/z, [src, #3, mul vl]
+ ld1b z4.b, p4/z, [src, #4, mul vl]
+ ld1b z5.b, p5/z, [src, #5, mul vl]
+ ld1b z6.b, p6/z, [src, #6, mul vl]
+ ld1b z7.b, p7/z, [src, #7, mul vl]
+ st1b z0.b, p0, [dest, #0, mul vl]
+ st1b z1.b, p1, [dest, #1, mul vl]
+ st1b z2.b, p2, [dest, #2, mul vl]
+ st1b z3.b, p3, [dest, #3, mul vl]
+ st1b z4.b, p4, [dest, #4, mul vl]
+ st1b z5.b, p5, [dest, #5, mul vl]
+ st1b z6.b, p6, [dest, #6, mul vl]
+ st1b z7.b, p7, [dest, #7, mul vl]
+ ret
+ .endm
+
+ENTRY (MEMCPY)
+
+ PTR_ARG (0)
+ PTR_ARG (1)
+ SIZE_ARG (2)
+
+L(memcpy):
+ cntb vector_length
+ // shortcut for less than vector_length * 8
+ // gives a free ptrue to p0.b for n >= vector_length
+ shortcut_for_small_size L(vl_agnostic)
+ // end of shortcut
+
+L(vl_agnostic): // VL Agnostic
+ mov rest, n
+ mov dest_ptr, dest
+ mov src_ptr, src
+ // if rest >= L2_SIZE && vector_length == 64 then L(L2)
+ mov tmp1, 64
+ cmp rest, L2_SIZE
+ ccmp vector_length, tmp1, 0, cs
+ b.eq L(L2)
+
+L(unroll8): // unrolling and software pipeline
+ lsl tmp1, vector_length, 3 // vector_length * 8
+ .p2align 3
+ cmp rest, tmp1
+ b.cc L(last)
+ ld1b_unroll8
+ add src_ptr, src_ptr, tmp1
+ sub rest, rest, tmp1
+ cmp rest, tmp1
+ b.cc 2f
+ .p2align 3
+1: stld1b_unroll8
+ add dest_ptr, dest_ptr, tmp1
+ add src_ptr, src_ptr, tmp1
+ sub rest, rest, tmp1
+ cmp rest, tmp1
+ b.ge 1b
+2: st1b_unroll8
+ add dest_ptr, dest_ptr, tmp1
+
+ .p2align 3
+L(last):
+ whilelo p0.b, xzr, rest
+ whilelo p1.b, vector_length, rest
+ b.last 1f
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p1/z, [src_ptr, #1, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p1, [dest_ptr, #1, mul vl]
+ ret
+1: lsl tmp1, vector_length, 1 // vector_length * 2
+ whilelo p2.b, tmp1, rest
+ incb tmp1
+ whilelo p3.b, tmp1, rest
+ b.last 1f
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p1/z, [src_ptr, #1, mul vl]
+ ld1b z2.b, p2/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p3/z, [src_ptr, #3, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p1, [dest_ptr, #1, mul vl]
+ st1b z2.b, p2, [dest_ptr, #2, mul vl]
+ st1b z3.b, p3, [dest_ptr, #3, mul vl]
+ ret
+1: lsl tmp1, vector_length, 2 // vector_length * 4
+ whilelo p4.b, tmp1, rest
+ incb tmp1
+ whilelo p5.b, tmp1, rest
+ incb tmp1
+ whilelo p6.b, tmp1, rest
+ incb tmp1
+ whilelo p7.b, tmp1, rest
+ ld1b z0.b, p0/z, [src_ptr, #0, mul vl]
+ ld1b z1.b, p1/z, [src_ptr, #1, mul vl]
+ ld1b z2.b, p2/z, [src_ptr, #2, mul vl]
+ ld1b z3.b, p3/z, [src_ptr, #3, mul vl]
+ ld1b z4.b, p4/z, [src_ptr, #4, mul vl]
+ ld1b z5.b, p5/z, [src_ptr, #5, mul vl]
+ ld1b z6.b, p6/z, [src_ptr, #6, mul vl]
+ ld1b z7.b, p7/z, [src_ptr, #7, mul vl]
+ st1b z0.b, p0, [dest_ptr, #0, mul vl]
+ st1b z1.b, p1, [dest_ptr, #1, mul vl]
+ st1b z2.b, p2, [dest_ptr, #2, mul vl]
+ st1b z3.b, p3, [dest_ptr, #3, mul vl]
+ st1b z4.b, p4, [dest_ptr, #4, mul vl]
+ st1b z5.b, p5, [dest_ptr, #5, mul vl]
+ st1b z6.b, p6, [dest_ptr, #6, mul vl]
+ st1b z7.b, p7, [dest_ptr, #7, mul vl]
+ ret
+
+L(L2):
+ // align dest address at CACHE_LINE_SIZE byte boundary
+ mov tmp1, CACHE_LINE_SIZE
+ ands tmp2, dest_ptr, CACHE_LINE_SIZE - 1
+ // if cl_remainder == 0
+ b.eq L(L2_dc_zva)
+ sub cl_remainder, tmp1, tmp2
+ // process remainder until the first CACHE_LINE_SIZE boundary
+ whilelo p1.b, xzr, cl_remainder // keep p0.b all true
+ whilelo p2.b, vector_length, cl_remainder
+ b.last 1f
+ ld1b z1.b, p1/z, [src_ptr, #0, mul vl]
+ ld1b z2.b, p2/z, [src_ptr, #1, mul vl]
+ st1b z1.b, p1, [dest_ptr, #0, mul vl]
+ st1b z2.b, p2, [dest_ptr, #1, mul vl]
+ b 2f
+1: lsl tmp1, vector_length, 1 // vector_length * 2
+ whilelo p3.b, tmp1, cl_remainder
+ incb tmp1
+ whilelo p4.b, tmp1, cl_remainder
+ ld1b z1.b, p1/z, [src_ptr, #0, mul vl]
+ ld1b z2.b, p2/z, [src_ptr, #1, mul vl]
+ ld1b z3.b, p3/z, [src_ptr, #2, mul vl]
+ ld1b z4.b, p4/z, [src_ptr, #3, mul vl]
+ st1b z1.b, p1, [dest_ptr, #0, mul vl]
+ st1b z2.b, p2, [dest_ptr, #1, mul vl]
+ st1b z3.b, p3, [dest_ptr, #2, mul vl]
+ st1b z4.b, p4, [dest_ptr, #3, mul vl]
+2: add dest_ptr, dest_ptr, cl_remainder
+ add src_ptr, src_ptr, cl_remainder
+ sub rest, rest, cl_remainder
+
+L(L2_dc_zva):
+ // zero fill
+ and tmp1, dest, 0xffffffffffffff
+ and tmp2, src, 0xffffffffffffff
+ subs tmp1, tmp1, tmp2 // diff
+ b.ge 1f
+ neg tmp1, tmp1
+1: mov tmp3, ZF_DIST + CACHE_LINE_SIZE * 2
+ cmp tmp1, tmp3
+ b.lo L(unroll8)
+ mov tmp1, dest_ptr
+ dc_zva (ZF_DIST / CACHE_LINE_SIZE) - 1
+ // unroll
+ ld1b_unroll8 // this line has to be after "b.lo L(unroll8)"
+ add src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ mov tmp1, ZF_DIST
+ .p2align 3
+1: stld1b_unroll4a
+ add tmp2, dest_ptr, tmp1 // dest_ptr + ZF_DIST
+ dc zva, tmp2
+ stld1b_unroll4b
+ add tmp2, tmp2, CACHE_LINE_SIZE
+ dc zva, tmp2
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+ add src_ptr, src_ptr, CACHE_LINE_SIZE * 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ cmp rest, tmp3 // ZF_DIST + CACHE_LINE_SIZE * 2
+ b.ge 1b
+ st1b_unroll8
+ add dest_ptr, dest_ptr, CACHE_LINE_SIZE * 2
+ b L(unroll8)
+
+END (MEMCPY)
+libc_hidden_builtin_def (MEMCPY)
+
+
+ENTRY (MEMMOVE)
+
+ PTR_ARG (0)
+ PTR_ARG (1)
+ SIZE_ARG (2)
+
+ // remove tag address
+ // dest has to be immutable because it is the return value
+ // src has to be immutable because it is used in L(bwd_last)
+ and tmp2, dest, 0xffffffffffffff // save dest_notag into tmp2
+ and tmp3, src, 0xffffffffffffff // save src_notag intp tmp3
+ cmp n, 0
+ ccmp tmp2, tmp3, 4, ne
+ b.ne 1f
+ ret
+1: cntb vector_length
+ // shortcut for less than vector_length * 8
+ // gives a free ptrue to p0.b for n >= vector_length
+ // tmp2 and tmp3 should not be used in this macro to keep
+ // notag addresses
+ shortcut_for_small_size L(dispatch)
+ // end of shortcut
+
+L(dispatch):
+ // tmp2 = dest_notag, tmp3 = src_notag
+ // diff = dest_notag - src_notag
+ sub tmp1, tmp2, tmp3
+ // if diff <= 0 || diff >= n then memcpy
+ cmp tmp1, 0
+ ccmp tmp1, n, 2, gt
+ b.cs L(vl_agnostic)
+
+L(bwd_start):
+ mov rest, n
+ add dest_ptr, dest, n // dest_end
+ add src_ptr, src, n // src_end
+
+L(bwd_unroll8): // unrolling and software pipeline
+ lsl tmp1, vector_length, 3 // vector_length * 8
+ .p2align 3
+ cmp rest, tmp1
+ b.cc L(bwd_last)
+ sub src_ptr, src_ptr, tmp1
+ ld1b_unroll8
+ sub rest, rest, tmp1
+ cmp rest, tmp1
+ b.cc 2f
+ .p2align 3
+1: sub src_ptr, src_ptr, tmp1
+ sub dest_ptr, dest_ptr, tmp1
+ stld1b_unroll8
+ sub rest, rest, tmp1
+ cmp rest, tmp1
+ b.ge 1b
+2: sub dest_ptr, dest_ptr, tmp1
+ st1b_unroll8
+
+L(bwd_last):
+ mov dest_ptr, dest
+ mov src_ptr, src
+ b L(last)
+
+END (MEMMOVE)
+libc_hidden_builtin_def (MEMMOVE)
+# endif /* IS_IN (libc) */
+#endif /* HAVE_AARCH64_SVE_ASM */
diff --git a/sysdeps/aarch64/multiarch/memmove.c b/sysdeps/aarch64/multiarch/memmove.c
index 12d77818a999..d0adefc547f6 100644
--- a/sysdeps/aarch64/multiarch/memmove.c
+++ b/sysdeps/aarch64/multiarch/memmove.c
@@ -33,6 +33,9 @@ extern __typeof (__redirect_memmove) __memmove_simd attribute_hidden;
extern __typeof (__redirect_memmove) __memmove_thunderx attribute_hidden;
extern __typeof (__redirect_memmove) __memmove_thunderx2 attribute_hidden;
extern __typeof (__redirect_memmove) __memmove_falkor attribute_hidden;
+# if HAVE_AARCH64_SVE_ASM
+extern __typeof (__redirect_memmove) __memmove_a64fx attribute_hidden;
+# endif
libc_ifunc (__libc_memmove,
(IS_THUNDERX (midr)
@@ -40,12 +43,17 @@ libc_ifunc (__libc_memmove,
: (IS_FALKOR (midr) || IS_PHECDA (midr)
? __memmove_falkor
: (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr)
- ? __memmove_thunderx2
- : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
- || IS_NEOVERSE_V1 (midr)
- ? __memmove_simd
+ ? __memmove_thunderx2
+ : (IS_NEOVERSE_N1 (midr) || IS_NEOVERSE_N2 (midr)
+ || IS_NEOVERSE_V1 (midr)
+ ? __memmove_simd
+# if HAVE_AARCH64_SVE_ASM
+ : (IS_A64FX (midr)
+ ? __memmove_a64fx
+ : __memmove_generic))))));
+# else
: __memmove_generic)))));
-
+# endif
# undef memmove
strong_alias (__libc_memmove, memmove);
#endif
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
index db6aa3516c1b..6206a2f618b0 100644
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.c
@@ -46,6 +46,7 @@ static struct cpu_list cpu_list[] = {
{"ares", 0x411FD0C0},
{"emag", 0x503F0001},
{"kunpeng920", 0x481FD010},
+ {"a64fx", 0x460F0010},
{"generic", 0x0}
};
@@ -116,4 +117,7 @@ init_cpu_features (struct cpu_features *cpu_features)
(PR_TAGGED_ADDR_ENABLE | PR_MTE_TCF_ASYNC | MTE_ALLOWED_TAGS),
0, 0, 0);
#endif
+
+ /* Check if SVE is supported. */
+ cpu_features->sve = GLRO (dl_hwcap) & HWCAP_SVE;
}
diff --git a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
index 3b9bfed1349c..2b322e5414be 100644
--- a/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
+++ b/sysdeps/unix/sysv/linux/aarch64/cpu-features.h
@@ -65,6 +65,9 @@
#define IS_KUNPENG920(midr) (MIDR_IMPLEMENTOR(midr) == 'H' \
&& MIDR_PARTNUM(midr) == 0xd01)
+#define IS_A64FX(midr) (MIDR_IMPLEMENTOR(midr) == 'F' \
+ && MIDR_PARTNUM(midr) == 0x001)
+
struct cpu_features
{
uint64_t midr_el1;
@@ -72,6 +75,7 @@ struct cpu_features
bool bti;
/* Currently, the GLIBC memory tagging tunable only defines 8 bits. */
uint8_t mte_state;
+ bool sve;
};
#endif /* _CPU_FEATURES_AARCH64_H */
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* [PATCH v3 2/2] aarch64: Added optimized memset for A64FX
2021-05-12 9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
` (7 preceding siblings ...)
2021-05-27 7:42 ` [PATCH v3 1/2] aarch64: Added optimized memcpy and memmove " Naohiro Tamura
@ 2021-05-27 7:44 ` Naohiro Tamura
8 siblings, 0 replies; 72+ messages in thread
From: Naohiro Tamura @ 2021-05-27 7:44 UTC (permalink / raw)
To: libc-alpha; +Cc: Naohiro Tamura
From: Naohiro Tamura <naohirot@jp.fujitsu.com>
This patch optimizes the performance of memset for A64FX [1] which
implements ARMv8-A SVE and has L1 64KB cache per core and L2 8MB cache
per NUMA node.
The performance optimization makes use of Scalable Vector Register
with several techniques such as loop unrolling, memory access
alignment, cache zero fill and prefetch.
SVE assembler code for memset is implemented as Vector Length Agnostic
code so theoretically it can be run on any SOC which supports ARMv8-A
SVE standard.
We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.
And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.
[1] https://github.com/fujitsu/A64FX
Reviewed-by: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Reviewed-by: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
---
sysdeps/aarch64/multiarch/Makefile | 1 +
sysdeps/aarch64/multiarch/ifunc-impl-list.c | 5 +-
sysdeps/aarch64/multiarch/memset.c | 17 +-
sysdeps/aarch64/multiarch/memset_a64fx.S | 268 ++++++++++++++++++++
4 files changed, 286 insertions(+), 5 deletions(-)
create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S
diff --git a/sysdeps/aarch64/multiarch/Makefile b/sysdeps/aarch64/multiarch/Makefile
index 04c3f171215e..7500cf1e9369 100644
--- a/sysdeps/aarch64/multiarch/Makefile
+++ b/sysdeps/aarch64/multiarch/Makefile
@@ -2,6 +2,7 @@ ifeq ($(subdir),string)
sysdep_routines += memcpy_generic memcpy_advsimd memcpy_thunderx memcpy_thunderx2 \
memcpy_falkor memcpy_a64fx \
memset_generic memset_falkor memset_emag memset_kunpeng \
+ memset_a64fx \
memchr_generic memchr_nosimd \
strlen_mte strlen_asimd
endif
diff --git a/sysdeps/aarch64/multiarch/ifunc-impl-list.c b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
index 911393565c21..4e1a641d9fe9 100644
--- a/sysdeps/aarch64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/aarch64/multiarch/ifunc-impl-list.c
@@ -37,7 +37,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
INIT_ARCH ();
- /* Support sysdeps/aarch64/multiarch/memcpy.c and memmove.c. */
+ /* Support sysdeps/aarch64/multiarch/memcpy.c, memmove.c and memset.c. */
IFUNC_IMPL (i, name, memcpy,
IFUNC_IMPL_ADD (array, i, memcpy, 1, __memcpy_thunderx)
IFUNC_IMPL_ADD (array, i, memcpy, !bti, __memcpy_thunderx2)
@@ -62,6 +62,9 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
IFUNC_IMPL_ADD (array, i, memset, (zva_size == 64), __memset_falkor)
IFUNC_IMPL_ADD (array, i, memset, (zva_size == 64), __memset_emag)
IFUNC_IMPL_ADD (array, i, memset, 1, __memset_kunpeng)
+#if HAVE_AARCH64_SVE_ASM
+ IFUNC_IMPL_ADD (array, i, memset, sve, __memset_a64fx)
+#endif
IFUNC_IMPL_ADD (array, i, memset, 1, __memset_generic))
IFUNC_IMPL (i, name, memchr,
IFUNC_IMPL_ADD (array, i, memchr, !mte, __memchr_nosimd)
diff --git a/sysdeps/aarch64/multiarch/memset.c b/sysdeps/aarch64/multiarch/memset.c
index 28d3926bc2e6..d7d9bbbda095 100644
--- a/sysdeps/aarch64/multiarch/memset.c
+++ b/sysdeps/aarch64/multiarch/memset.c
@@ -31,16 +31,25 @@ extern __typeof (__redirect_memset) __libc_memset;
extern __typeof (__redirect_memset) __memset_falkor attribute_hidden;
extern __typeof (__redirect_memset) __memset_emag attribute_hidden;
extern __typeof (__redirect_memset) __memset_kunpeng attribute_hidden;
+# if HAVE_AARCH64_SVE_ASM
+extern __typeof (__redirect_memset) __memset_a64fx attribute_hidden;
+# endif
extern __typeof (__redirect_memset) __memset_generic attribute_hidden;
libc_ifunc (__libc_memset,
IS_KUNPENG920 (midr)
?__memset_kunpeng
: ((IS_FALKOR (midr) || IS_PHECDA (midr)) && zva_size == 64
- ? __memset_falkor
- : (IS_EMAG (midr) && zva_size == 64
- ? __memset_emag
- : __memset_generic)));
+ ? __memset_falkor
+ : (IS_EMAG (midr) && zva_size == 64
+ ? __memset_emag
+# if HAVE_AARCH64_SVE_ASM
+ : (IS_A64FX (midr)
+ ? __memset_a64fx
+ : __memset_generic))));
+# else
+ : __memset_generic)));
+# endif
# undef memset
strong_alias (__libc_memset, memset);
diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/multiarch/memset_a64fx.S
new file mode 100644
index 000000000000..ce54e5418b08
--- /dev/null
+++ b/sysdeps/aarch64/multiarch/memset_a64fx.S
@@ -0,0 +1,268 @@
+/* Optimized memset for Fujitsu A64FX processor.
+ Copyright (C) 2021 Free Software Foundation, Inc.
+
+ This file is part of the GNU C Library.
+
+ The GNU C Library is free software; you can redistribute it and/or
+ modify it under the terms of the GNU Lesser General Public
+ License as published by the Free Software Foundation; either
+ version 2.1 of the License, or (at your option) any later version.
+
+ The GNU C Library is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ Lesser General Public License for more details.
+
+ You should have received a copy of the GNU Lesser General Public
+ License along with the GNU C Library. If not, see
+ <https://www.gnu.org/licenses/>. */
+
+#include <sysdep.h>
+#include <sysdeps/aarch64/memset-reg.h>
+
+/* Assumptions:
+ *
+ * ARMv8.2-a, AArch64, unaligned accesses, sve
+ *
+ */
+
+#define L1_SIZE (64*1024) // L1 64KB
+#define L2_SIZE (8*1024*1024) // L2 8MB - 1MB
+#define CACHE_LINE_SIZE 256
+#define PF_DIST_L1 (CACHE_LINE_SIZE * 16) // Prefetch distance L1
+#define ZF_DIST (CACHE_LINE_SIZE * 21) // Zerofill distance
+#define rest x8
+#define vector_length x9
+#define vl_remainder x10 // vector_length remainder
+#define cl_remainder x11 // CACHE_LINE_SIZE remainder
+
+#if HAVE_AARCH64_SVE_ASM
+# if IS_IN (libc)
+# define MEMSET __memset_a64fx
+
+ .arch armv8.2-a+sve
+
+ .macro dc_zva times
+ dc zva, tmp1
+ add tmp1, tmp1, CACHE_LINE_SIZE
+ .if \times-1
+ dc_zva "(\times-1)"
+ .endif
+ .endm
+
+ .macro st1b_unroll first=0, last=7
+ st1b z0.b, p0, [dst, #\first, mul vl]
+ .if \last-\first
+ st1b_unroll "(\first+1)", \last
+ .endif
+ .endm
+
+ .macro shortcut_for_small_size exit
+ // if rest <= vector_length * 2
+ whilelo p0.b, xzr, count
+ whilelo p1.b, vector_length, count
+ b.last 1f
+ st1b z0.b, p0, [dstin, #0, mul vl]
+ st1b z0.b, p1, [dstin, #1, mul vl]
+ ret
+1: // if rest > vector_length * 8
+ cmp count, vector_length, lsl 3 // vector_length * 8
+ b.hi \exit
+ // if rest <= vector_length * 4
+ lsl tmp1, vector_length, 1 // vector_length * 2
+ whilelo p2.b, tmp1, count
+ incb tmp1
+ whilelo p3.b, tmp1, count
+ b.last 1f
+ st1b z0.b, p0, [dstin, #0, mul vl]
+ st1b z0.b, p1, [dstin, #1, mul vl]
+ st1b z0.b, p2, [dstin, #2, mul vl]
+ st1b z0.b, p3, [dstin, #3, mul vl]
+ ret
+1: // if rest <= vector_length * 8
+ lsl tmp1, vector_length, 2 // vector_length * 4
+ whilelo p4.b, tmp1, count
+ incb tmp1
+ whilelo p5.b, tmp1, count
+ b.last 1f
+ st1b z0.b, p0, [dstin, #0, mul vl]
+ st1b z0.b, p1, [dstin, #1, mul vl]
+ st1b z0.b, p2, [dstin, #2, mul vl]
+ st1b z0.b, p3, [dstin, #3, mul vl]
+ st1b z0.b, p4, [dstin, #4, mul vl]
+ st1b z0.b, p5, [dstin, #5, mul vl]
+ ret
+1: lsl tmp1, vector_length, 2 // vector_length * 4
+ incb tmp1 // vector_length * 5
+ incb tmp1 // vector_length * 6
+ whilelo p6.b, tmp1, count
+ incb tmp1
+ whilelo p7.b, tmp1, count
+ st1b z0.b, p0, [dstin, #0, mul vl]
+ st1b z0.b, p1, [dstin, #1, mul vl]
+ st1b z0.b, p2, [dstin, #2, mul vl]
+ st1b z0.b, p3, [dstin, #3, mul vl]
+ st1b z0.b, p4, [dstin, #4, mul vl]
+ st1b z0.b, p5, [dstin, #5, mul vl]
+ st1b z0.b, p6, [dstin, #6, mul vl]
+ st1b z0.b, p7, [dstin, #7, mul vl]
+ ret
+ .endm
+
+ENTRY (MEMSET)
+
+ PTR_ARG (0)
+ SIZE_ARG (2)
+
+ cbnz count, 1f
+ ret
+1: dup z0.b, valw
+ cntb vector_length
+ // shortcut for less than vector_length * 8
+ // gives a free ptrue to p0.b for n >= vector_length
+ shortcut_for_small_size L(vl_agnostic)
+ // end of shortcut
+
+L(vl_agnostic): // VL Agnostic
+ mov rest, count
+ mov dst, dstin
+ add dstend, dstin, count
+ // if rest >= L2_SIZE && vector_length == 64 then L(L2)
+ mov tmp1, 64
+ cmp rest, L2_SIZE
+ ccmp vector_length, tmp1, 0, cs
+ b.eq L(L2)
+ // if rest >= L1_SIZE && vector_length == 64 then L(L1_prefetch)
+ cmp rest, L1_SIZE
+ ccmp vector_length, tmp1, 0, cs
+ b.eq L(L1_prefetch)
+
+L(unroll32):
+ lsl tmp1, vector_length, 3 // vector_length * 8
+ lsl tmp2, vector_length, 5 // vector_length * 32
+ .p2align 3
+1: cmp rest, tmp2
+ b.cc L(unroll8)
+ st1b_unroll
+ add dst, dst, tmp1
+ st1b_unroll
+ add dst, dst, tmp1
+ st1b_unroll
+ add dst, dst, tmp1
+ st1b_unroll
+ add dst, dst, tmp1
+ sub rest, rest, tmp2
+ b 1b
+
+L(unroll8):
+ lsl tmp1, vector_length, 3
+ .p2align 3
+1: cmp rest, tmp1
+ b.cc L(last)
+ st1b_unroll
+ add dst, dst, tmp1
+ sub rest, rest, tmp1
+ b 1b
+
+L(last):
+ whilelo p0.b, xzr, rest
+ whilelo p1.b, vector_length, rest
+ b.last 1f
+ st1b z0.b, p0, [dst, #0, mul vl]
+ st1b z0.b, p1, [dst, #1, mul vl]
+ ret
+1: lsl tmp1, vector_length, 1 // vector_length * 2
+ whilelo p2.b, tmp1, rest
+ incb tmp1
+ whilelo p3.b, tmp1, rest
+ b.last 1f
+ st1b z0.b, p0, [dst, #0, mul vl]
+ st1b z0.b, p1, [dst, #1, mul vl]
+ st1b z0.b, p2, [dst, #2, mul vl]
+ st1b z0.b, p3, [dst, #3, mul vl]
+ ret
+1: lsl tmp1, vector_length, 2 // vector_length * 4
+ whilelo p4.b, tmp1, rest
+ incb tmp1
+ whilelo p5.b, tmp1, rest
+ incb tmp1
+ whilelo p6.b, tmp1, rest
+ incb tmp1
+ whilelo p7.b, tmp1, rest
+ st1b z0.b, p0, [dst, #0, mul vl]
+ st1b z0.b, p1, [dst, #1, mul vl]
+ st1b z0.b, p2, [dst, #2, mul vl]
+ st1b z0.b, p3, [dst, #3, mul vl]
+ st1b z0.b, p4, [dst, #4, mul vl]
+ st1b z0.b, p5, [dst, #5, mul vl]
+ st1b z0.b, p6, [dst, #6, mul vl]
+ st1b z0.b, p7, [dst, #7, mul vl]
+ ret
+
+L(L1_prefetch): // if rest >= L1_SIZE
+ .p2align 3
+1: st1b_unroll 0, 3
+ prfm pstl1keep, [dst, PF_DIST_L1]
+ st1b_unroll 4, 7
+ prfm pstl1keep, [dst, PF_DIST_L1 + CACHE_LINE_SIZE]
+ add dst, dst, CACHE_LINE_SIZE * 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ cmp rest, L1_SIZE
+ b.ge 1b
+ cbnz rest, L(unroll32)
+ ret
+
+L(L2):
+ // align dst address at vector_length byte boundary
+ sub tmp1, vector_length, 1
+ ands tmp2, dst, tmp1
+ // if vl_remainder == 0
+ b.eq 1f
+ sub vl_remainder, vector_length, tmp2
+ // process remainder until the first vector_length boundary
+ whilelt p2.b, xzr, vl_remainder
+ st1b z0.b, p2, [dst]
+ add dst, dst, vl_remainder
+ sub rest, rest, vl_remainder
+ // align dstin address at CACHE_LINE_SIZE byte boundary
+1: mov tmp1, CACHE_LINE_SIZE
+ ands tmp2, dst, CACHE_LINE_SIZE - 1
+ // if cl_remainder == 0
+ b.eq L(L2_dc_zva)
+ sub cl_remainder, tmp1, tmp2
+ // process remainder until the first CACHE_LINE_SIZE boundary
+ mov tmp1, xzr // index
+2: whilelt p2.b, tmp1, cl_remainder
+ st1b z0.b, p2, [dst, tmp1]
+ incb tmp1
+ cmp tmp1, cl_remainder
+ b.lo 2b
+ add dst, dst, cl_remainder
+ sub rest, rest, cl_remainder
+
+L(L2_dc_zva):
+ // zero fill
+ mov tmp1, dst
+ dc_zva (ZF_DIST / CACHE_LINE_SIZE) - 1
+ mov zva_len, ZF_DIST
+ add tmp1, zva_len, CACHE_LINE_SIZE * 2
+ // unroll
+ .p2align 3
+1: st1b_unroll 0, 3
+ add tmp2, dst, zva_len
+ dc zva, tmp2
+ st1b_unroll 4, 7
+ add tmp2, tmp2, CACHE_LINE_SIZE
+ dc zva, tmp2
+ add dst, dst, CACHE_LINE_SIZE * 2
+ sub rest, rest, CACHE_LINE_SIZE * 2
+ cmp rest, tmp1 // ZF_DIST + CACHE_LINE_SIZE * 2
+ b.ge 1b
+ cbnz rest, L(unroll8)
+ ret
+
+END (MEMSET)
+libc_hidden_builtin_def (MEMSET)
+
+#endif /* IS_IN (libc) */
+#endif /* HAVE_AARCH64_SVE_ASM */
--
2.17.1
^ permalink raw reply related [flat|nested] 72+ messages in thread
* RE: [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX
2021-05-27 0:22 ` [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX naohirot
@ 2021-05-27 23:50 ` naohirot
0 siblings, 0 replies; 72+ messages in thread
From: naohirot @ 2021-05-27 23:50 UTC (permalink / raw)
To: 'Szabolcs Nagy', libc-alpha@sourceware.org
Hi Szabolcs,
> > aarch64: Added optimized memcpy and memmove for A64FX
> > aarch64: Added optimized memset for A64FX
>
> I'll fix the whitespaces.
Great thank you for the merges!
Naohiro
^ permalink raw reply [flat|nested] 72+ messages in thread
end of thread, other threads:[~2021-05-27 23:50 UTC | newest]
Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-17 2:28 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Naohiro Tamura
2021-03-17 2:33 ` [PATCH 1/5] config: Added HAVE_SVE_ASM_SUPPORT for aarch64 Naohiro Tamura
2021-03-29 12:11 ` Szabolcs Nagy via Libc-alpha
2021-03-30 6:19 ` naohirot
2021-03-17 2:34 ` [PATCH 2/5] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
2021-03-29 12:44 ` Szabolcs Nagy via Libc-alpha
2021-03-30 7:17 ` naohirot
2021-03-17 2:34 ` [PATCH 3/5] aarch64: Added optimized memset " Naohiro Tamura
2021-03-17 2:35 ` [PATCH 4/5] scripts: Added Vector Length Set test helper script Naohiro Tamura
2021-03-29 13:20 ` Szabolcs Nagy via Libc-alpha
2021-03-30 7:25 ` naohirot
2021-03-17 2:35 ` [PATCH 5/5] benchtests: Added generic_memcpy and generic_memmove to large benchtests Naohiro Tamura
2021-03-29 12:03 ` [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Szabolcs Nagy via Libc-alpha
2021-05-10 1:45 ` naohirot
2021-05-14 13:35 ` Szabolcs Nagy via Libc-alpha
2021-05-19 0:11 ` naohirot
2021-05-12 9:23 ` [PATCH v2 0/6] aarch64: " Naohiro Tamura
2021-05-12 9:26 ` [PATCH v2 1/6] config: Added HAVE_AARCH64_SVE_ASM for aarch64 Naohiro Tamura
2021-05-26 10:05 ` Szabolcs Nagy via Libc-alpha
2021-05-12 9:27 ` [PATCH v2 2/6] aarch64: define BTI_C and BTI_J macros as NOP unless HAVE_AARCH64_BTI Naohiro Tamura
2021-05-26 10:06 ` Szabolcs Nagy via Libc-alpha
2021-05-12 9:28 ` [PATCH v2 3/6] aarch64: Added optimized memcpy and memmove for A64FX Naohiro Tamura
2021-05-26 10:19 ` Szabolcs Nagy via Libc-alpha
2021-05-12 9:28 ` [PATCH v2 4/6] aarch64: Added optimized memset " Naohiro Tamura
2021-05-26 10:22 ` Szabolcs Nagy via Libc-alpha
2021-05-12 9:29 ` [PATCH v2 5/6] scripts: Added Vector Length Set test helper script Naohiro Tamura
2021-05-12 16:58 ` Joseph Myers
2021-05-13 9:53 ` naohirot
2021-05-20 7:34 ` Naohiro Tamura
2021-05-26 10:24 ` Szabolcs Nagy via Libc-alpha
2021-05-12 9:29 ` [PATCH v2 6/6] benchtests: Fixed bench-memcpy-random: buf1: mprotect failed Naohiro Tamura
2021-05-26 10:25 ` Szabolcs Nagy via Libc-alpha
2021-05-27 0:22 ` [PATCH v2 0/6] aarch64: Added optimized memcpy/memmove/memset for A64FX naohirot
2021-05-27 23:50 ` naohirot
2021-05-27 7:42 ` [PATCH v3 1/2] aarch64: Added optimized memcpy and memmove " Naohiro Tamura
2021-05-27 7:44 ` [PATCH v3 2/2] aarch64: Added optimized memset " Naohiro Tamura
-- strict thread matches above, loose matches on Subject: below --
2021-04-12 12:52 [PATCH 0/5] Added optimized memcpy/memmove/memset " Wilco Dijkstra via Libc-alpha
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
2021-04-14 16:02 ` Wilco Dijkstra via Libc-alpha
2021-04-15 12:20 ` naohirot
2021-04-20 16:00 ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:58 ` naohirot
2021-04-29 15:13 ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:01 ` Szabolcs Nagy via Libc-alpha
2021-04-30 15:23 ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:30 ` Florian Weimer via Libc-alpha
2021-04-30 15:40 ` Wilco Dijkstra via Libc-alpha
2021-05-04 7:56 ` Szabolcs Nagy via Libc-alpha
2021-05-04 10:17 ` Florian Weimer via Libc-alpha
2021-05-04 10:38 ` Wilco Dijkstra via Libc-alpha
2021-05-04 10:42 ` Szabolcs Nagy via Libc-alpha
2021-05-04 11:07 ` Florian Weimer via Libc-alpha
2021-05-06 10:01 ` naohirot
2021-05-06 14:26 ` Szabolcs Nagy via Libc-alpha
2021-05-06 15:09 ` Florian Weimer via Libc-alpha
2021-05-06 17:31 ` Wilco Dijkstra via Libc-alpha
2021-05-07 12:31 ` naohirot
2021-04-19 2:51 ` naohirot
2021-04-19 14:57 ` Wilco Dijkstra via Libc-alpha
2021-04-21 10:10 ` naohirot
2021-04-21 15:02 ` Wilco Dijkstra via Libc-alpha
2021-04-22 13:17 ` naohirot
2021-04-23 0:58 ` naohirot
2021-04-19 12:43 ` naohirot
2021-04-20 3:31 ` naohirot
2021-04-20 14:44 ` Wilco Dijkstra via Libc-alpha
2021-04-27 9:01 ` naohirot
2021-04-20 5:49 ` naohirot
2021-04-20 11:39 ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:03 ` naohirot
2021-04-23 13:22 ` naohirot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).