From: "H.J. Lu via Libc-alpha" <libc-alpha@sourceware.org>
To: liqingqing <liqingqing3@huawei.com>
Cc: Hushiyuan <hushiyuan@huawei.com>,
"libc-alpha@sourceware.org" <libc-alpha@sourceware.org>
Subject: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
Date: Fri, 22 May 2020 21:37:39 -0700 [thread overview]
Message-ID: <CAMe9rOoYXMdOfedtZLx=GT-nFXThzoo7Q__H4vg=2vyOufGY6A@mail.gmail.com> (raw)
In-Reply-To: <e6de570b-48bf-88cf-2cec-5f5a5e7821bf@huawei.com>
[-- Attachment #1: Type: text/plain, Size: 3008 bytes --]
On Fri, May 22, 2020 at 9:10 PM liqingqing <liqingqing3@huawei.com> wrote:
>
> this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
> and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
> this api will use STOB to instead of MOVQ
>
> but when I test this API on x86_64 platform
> and found that this default value is not appropriate for some input length. here it's the enviornment and result
>
> test suite: libMicro-0.4.0
> ./memset -E -C 200 -L -S -W -N "memset_4k" -s 4k -I 250
> ./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k -u -I 400
> ./memset -E -C 200 -L -S -W -N "memset_1m" -s 1m -I 200000
> ./memset -E -C 200 -L -S -W -N "memset_10m" -s 10m -I 2000000
>
> hardware platform:
> Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
> L1d cache:32KB
> L1i cache: 32KB
> L2 cache: 1MB
> L3 cache: 60MB
>
> the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance.
>
> before this commit after this commit
> cycle cycle
> memset_4k 249 96
> memset_10k 657 185
> memset_36k 2773 3767
> memset_100k 7594 10002
> memset_500k 37678 52149
> memset_1m 86780 108044
> memset_10m 1307238 1148994
>
> before this commit after this commit
> MLC cache miss(10sec) MLC cache miss(10sec)
> memset_4k 1,09,33,823 1,01,79,270
> memset_10k 1,23,78,958 1,05,41,087
> memset_36k 3,61,64,244 4,07,22,429
> memset_100k 8,25,33,052 9,31,81,253
> memset_500k 37,32,55,449 43,56,70,395
> memset_1m 75,16,28,239 88,29,90,237
> memset_10m 9,36,61,67,397 8,96,69,49,522
>
>
> though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx,
> but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below:
>
>
>
> From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001
> From: liqingqing <liqingqing3@huawei.com>
> Date: Thu, 21 May 2020 11:23:06 +0800
> Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M
> macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache.
> so update the defaule value to eliminate the decrement .
>
There is no single threshold value which is good for all workloads.
I don't think we should change REP_STOSB_THRESHOLD to 1MB.
On the other hand, the fixed threshold isn't flexible. Please try this
patch to see if you can set the threshold for your specific workload.
--
H.J.
[-- Attachment #2: 0001-x86-Add-thresholds-for-rep-movsb-stosb-to-tunables.patch --]
[-- Type: text/x-patch, Size: 8962 bytes --]
From 7d2e0c0b843d509716d92960b9b139b32eacea54 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Sat, 9 May 2020 11:13:57 -0700
Subject: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables
to update thresholds for "rep movsb" and "rep stosb" at run-time.
Note that the user specified threshold for "rep movsb" smaller than
the minimum threshold will be ignored.
---
manual/tunables.texi | 16 +++++++
sysdeps/x86/cacheinfo.c | 46 +++++++++++++++++++
sysdeps/x86/cpu-features.c | 4 ++
sysdeps/x86/cpu-features.h | 4 ++
sysdeps/x86/dl-tunables.list | 6 +++
.../multiarch/memmove-vec-unaligned-erms.S | 16 +------
.../multiarch/memset-vec-unaligned-erms.S | 12 +----
7 files changed, 78 insertions(+), 26 deletions(-)
diff --git a/manual/tunables.texi b/manual/tunables.texi
index ec18b10834..8054f79be0 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -396,6 +396,22 @@ to set threshold in bytes for non temporal store.
This tunable is specific to i386 and x86-64.
@end deftp
+@deftp Tunable glibc.cpu.x86_rep_movsb_threshold
+The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user
+to set threshold in bytes to start using "rep movsb". Note that the
+user specified threshold smaller than the minimum threshold will be
+ignored.
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
+@deftp Tunable glibc.cpu.x86_rep_stosb_threshold
+The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user
+to set threshold in bytes to start using "rep stosb".
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
@deftp Tunable glibc.cpu.x86_ibt
The @code{glibc.cpu.x86_ibt} tunable allows the user to control how
indirect branch tracking (IBT) should be enabled. Accepted values are
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index 311502dee3..4322328a1b 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -530,6 +530,23 @@ long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024;
/* Threshold to use non temporal store. */
long int __x86_shared_non_temporal_threshold attribute_hidden;
+/* Threshold to use Enhanced REP MOVSB. Since there is overhead to set
+ up REP MOVSB operation, REP MOVSB isn't faster on short data. The
+ memcpy micro benchmark in glibc shows that 2KB is the approximate
+ value above which REP MOVSB becomes faster than SSE2 optimization
+ on processors with Enhanced REP MOVSB. Since larger register size
+ can move more data with a single load and store, the threshold is
+ higher with larger register size. */
+long int __x86_rep_movsb_threshold attribute_hidden = 2048;
+
+/* Threshold to use Enhanced REP STOSB. Since there is overhead to set
+ up REP STOSB operation, REP STOSB isn't faster on short data. The
+ memset micro benchmark in glibc shows that 2KB is the approximate
+ value above which REP STOSB becomes faster on processors with
+ Enhanced REP STOSB. Since the stored value is fixed, larger register
+ size has minimal impact on threshold. */
+long int __x86_rep_stosb_threshold attribute_hidden = 2048;
+
#ifndef DISABLE_PREFETCHW
/* PREFETCHW support flag for use in memory and string routines. */
int __x86_prefetchw attribute_hidden;
@@ -872,6 +889,35 @@ init_cacheinfo (void)
= (cpu_features->non_temporal_threshold != 0
? cpu_features->non_temporal_threshold
: __x86_shared_cache_size * threads * 3 / 4);
+
+ /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8. */
+ unsigned int minimum_rep_movsb_threshold;
+ /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16). */
+ unsigned int rep_movsb_threshold;
+ if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable)
+ && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
+ {
+ rep_movsb_threshold = 2048 * (64 / 16);
+ minimum_rep_movsb_threshold = 64 * 8;
+ }
+ else if (CPU_FEATURES_ARCH_P (cpu_features,
+ AVX_Fast_Unaligned_Load))
+ {
+ rep_movsb_threshold = 2048 * (32 / 16);
+ minimum_rep_movsb_threshold = 32 * 8;
+ }
+ else
+ {
+ rep_movsb_threshold = 2048 * (16 / 16);
+ minimum_rep_movsb_threshold = 16 * 8;
+ }
+ if (cpu_features->rep_movsb_threshold > minimum_rep_movsb_threshold)
+ __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
+ else
+ __x86_rep_movsb_threshold = rep_movsb_threshold;
+
+ if (cpu_features->rep_stosb_threshold)
+ __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold;
}
#endif
diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 916bbf5242..14f847320f 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -564,6 +564,10 @@ no_cpuid:
TUNABLE_GET (hwcaps, tunable_val_t *, TUNABLE_CALLBACK (set_hwcaps));
cpu_features->non_temporal_threshold
= TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
+ cpu_features->rep_movsb_threshold
+ = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
+ cpu_features->rep_stosb_threshold
+ = TUNABLE_GET (x86_rep_stosb_threshold, long int, NULL);
cpu_features->data_cache_size
= TUNABLE_GET (x86_data_cache_size, long int, NULL);
cpu_features->shared_cache_size
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index f05d5ce158..7410324e83 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -91,6 +91,10 @@ struct cpu_features
unsigned long int shared_cache_size;
/* Threshold to use non temporal store. */
unsigned long int non_temporal_threshold;
+ /* Threshold to use "rep movsb". */
+ unsigned long int rep_movsb_threshold;
+ /* Threshold to use "rep stosb". */
+ unsigned long int rep_stosb_threshold;
};
/* Used from outside of glibc to get access to the CPU features
diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
index 251b926ce4..43bf6c2389 100644
--- a/sysdeps/x86/dl-tunables.list
+++ b/sysdeps/x86/dl-tunables.list
@@ -30,6 +30,12 @@ glibc {
x86_non_temporal_threshold {
type: SIZE_T
}
+ x86_rep_movsb_threshold {
+ type: SIZE_T
+ }
+ x86_rep_stosb_threshold {
+ type: SIZE_T
+ }
x86_data_cache_size {
type: SIZE_T
}
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index 74953245aa..bd5dc1a3f3 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -56,17 +56,6 @@
# endif
#endif
-/* Threshold to use Enhanced REP MOVSB. Since there is overhead to set
- up REP MOVSB operation, REP MOVSB isn't faster on short data. The
- memcpy micro benchmark in glibc shows that 2KB is the approximate
- value above which REP MOVSB becomes faster than SSE2 optimization
- on processors with Enhanced REP MOVSB. Since larger register size
- can move more data with a single load and store, the threshold is
- higher with larger register size. */
-#ifndef REP_MOVSB_THRESHOLD
-# define REP_MOVSB_THRESHOLD (2048 * (VEC_SIZE / 16))
-#endif
-
#ifndef PREFETCH
# define PREFETCH(addr) prefetcht0 addr
#endif
@@ -253,9 +242,6 @@ L(movsb):
leaq (%rsi,%rdx), %r9
cmpq %r9, %rdi
/* Avoid slow backward REP MOVSB. */
-# if REP_MOVSB_THRESHOLD <= (VEC_SIZE * 8)
-# error Unsupported REP_MOVSB_THRESHOLD and VEC_SIZE!
-# endif
jb L(more_8x_vec_backward)
1:
mov %RDX_LP, %RCX_LP
@@ -331,7 +317,7 @@ L(between_2_3):
#if defined USE_MULTIARCH && IS_IN (libc)
L(movsb_more_2x_vec):
- cmpq $REP_MOVSB_THRESHOLD, %rdx
+ cmp __x86_rep_movsb_threshold(%rip), %RDX_LP
ja L(movsb)
#endif
L(more_2x_vec):
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index af2299709c..2bfc95de05 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -58,16 +58,6 @@
# endif
#endif
-/* Threshold to use Enhanced REP STOSB. Since there is overhead to set
- up REP STOSB operation, REP STOSB isn't faster on short data. The
- memset micro benchmark in glibc shows that 2KB is the approximate
- value above which REP STOSB becomes faster on processors with
- Enhanced REP STOSB. Since the stored value is fixed, larger register
- size has minimal impact on threshold. */
-#ifndef REP_STOSB_THRESHOLD
-# define REP_STOSB_THRESHOLD 2048
-#endif
-
#ifndef SECTION
# error SECTION is not defined!
#endif
@@ -181,7 +171,7 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))
ret
L(stosb_more_2x_vec):
- cmpq $REP_STOSB_THRESHOLD, %rdx
+ cmp __x86_rep_stosb_threshold(%rip), %RDX_LP
ja L(stosb)
#endif
L(more_2x_vec):
--
2.26.2
next prev parent reply other threads:[~2020-05-23 4:38 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-03-16 7:30 pthread_cond performence Discussion liqingqing
2020-03-18 12:12 ` Carlos O'Donell via Libc-alpha
2020-03-18 12:53 ` Torvald Riegel via Libc-alpha
2020-03-18 14:42 ` Carlos O'Donell via Libc-alpha
2020-05-23 4:04 ` liqingqing
2020-05-23 4:10 ` [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M liqingqing
2020-05-23 4:37 ` H.J. Lu via Libc-alpha [this message]
2020-05-28 11:56 ` [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables H.J. Lu via Libc-alpha
2020-05-28 13:47 ` liqingqing
2020-05-29 13:13 ` Carlos O'Donell via Libc-alpha
2020-05-29 13:21 ` H.J. Lu via Libc-alpha
2020-05-29 16:18 ` Carlos O'Donell via Libc-alpha
2020-06-01 19:32 ` H.J. Lu via Libc-alpha
2020-06-01 19:38 ` Carlos O'Donell via Libc-alpha
2020-06-01 20:15 ` H.J. Lu via Libc-alpha
2020-06-01 20:19 ` H.J. Lu via Libc-alpha
2020-06-01 20:48 ` Florian Weimer
2020-06-01 20:56 ` Carlos O'Donell via Libc-alpha
2020-06-01 21:13 ` H.J. Lu via Libc-alpha
2020-06-01 22:43 ` H.J. Lu via Libc-alpha
2020-06-02 2:08 ` Carlos O'Donell via Libc-alpha
2020-06-04 21:00 ` [PATCH] libc.so: Add --list-tunables H.J. Lu via Libc-alpha
2020-06-05 22:45 ` V2 " H.J. Lu via Libc-alpha
2020-06-06 21:51 ` V3 [PATCH] libc.so: Add --list-tunables support to __libc_main H.J. Lu via Libc-alpha
2020-07-02 18:00 ` Carlos O'Donell via Libc-alpha
2020-07-02 19:08 ` [PATCH] Update tunable min/max values H.J. Lu via Libc-alpha
2020-07-03 16:14 ` Carlos O'Donell via Libc-alpha
2020-07-03 16:54 ` [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables H.J. Lu via Libc-alpha
2020-07-03 17:43 ` Carlos O'Donell via Libc-alpha
2020-07-03 17:53 ` H.J. Lu via Libc-alpha
2020-12-21 4:38 ` [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M Siddhesh Poyarekar
2020-12-22 1:02 ` Qingqing Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/libc/involved.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAMe9rOoYXMdOfedtZLx=GT-nFXThzoo7Q__H4vg=2vyOufGY6A@mail.gmail.com' \
--to=libc-alpha@sourceware.org \
--cc=hjl.tools@gmail.com \
--cc=hushiyuan@huawei.com \
--cc=liqingqing3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).