unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: "H.J. Lu via Libc-alpha" <libc-alpha@sourceware.org>
To: liqingqing <liqingqing3@huawei.com>
Cc: Hushiyuan <hushiyuan@huawei.com>,
	"libc-alpha@sourceware.org" <libc-alpha@sourceware.org>
Subject: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables
Date: Fri, 22 May 2020 21:37:39 -0700	[thread overview]
Message-ID: <CAMe9rOoYXMdOfedtZLx=GT-nFXThzoo7Q__H4vg=2vyOufGY6A@mail.gmail.com> (raw)
In-Reply-To: <e6de570b-48bf-88cf-2cec-5f5a5e7821bf@huawei.com>

[-- Attachment #1: Type: text/plain, Size: 3008 bytes --]

On Fri, May 22, 2020 at 9:10 PM liqingqing <liqingqing3@huawei.com> wrote:
>
> this commitid 830566307f038387ca0af3fd327706a8d1a2f595 optimize implementation of function memset,
> and set macro REP_STOSB_THRESHOLD's default value to 2KB, when the input value is less than 2KB, the data flow is the same, and when the input value is large than 2KB,
> this api will use STOB to instead of  MOVQ
>
> but when I test this API on x86_64 platform
> and found that this default value is not appropriate for some input length. here it's the enviornment and result
>
> test suite: libMicro-0.4.0
>         ./memset -E -C 200 -L -S -W -N "memset_4k"    -s 4k    -I 250
>         ./memset -E -C 200 -L -S -W -N "memset_4k_uc" -s 4k    -u -I 400
>         ./memset -E -C 200 -L -S -W -N "memset_1m"    -s 1m   -I 200000
>         ./memset -E -C 200 -L -S -W -N "memset_10m"   -s 10m -I 2000000
>
> hardware platform:
>         Intel(R) Xeon(R) Gold 6266C CPU @ 3.00GHz
>         L1d cache:32KB
>         L1i cache: 32KB
>         L2 cache: 1MB
>         L3 cache: 60MB
>
> the result is that when input length is between the processor's L1 data cache and L2 cache size, the REP_STOSB_THRESHOLD=2KB will reduce performance.
>
>         before this commit     after this commit
>                 cycle      cycle
> memset_4k       249         96
> memset_10k      657         185
> memset_36k      2773        3767
> memset_100k     7594        10002
> memset_500k     37678       52149
> memset_1m       86780       108044
> memset_10m      1307238     1148994
>
>         before this commit          after this commit
>            MLC cache miss(10sec)         MLC cache miss(10sec)
> memset_4k       1,09,33,823          1,01,79,270
> memset_10k      1,23,78,958          1,05,41,087
> memset_36k      3,61,64,244          4,07,22,429
> memset_100k     8,25,33,052          9,31,81,253
> memset_500k     37,32,55,449         43,56,70,395
> memset_1m       75,16,28,239         88,29,90,237
> memset_10m      9,36,61,67,397       8,96,69,49,522
>
>
> though REP_STOSB_THRESHOLD can be modified at the building time by use -DREP_STOSB_THRESHOLD=xxx,
> but I think the default value may be is not a better one, cause I think most of the processor's L2 cache is large than 2KB, so i submit a patch as below:
>
>
>
> From 44314a556239a7524b5a6451025737c1bdbb1cd0 Mon Sep 17 00:00:00 2001
> From: liqingqing <liqingqing3@huawei.com>
> Date: Thu, 21 May 2020 11:23:06 +0800
> Subject: [PATCH] update REP_STOSB_THRESHOLD's default value from 2k to 1M
> macro REP_STOSB_THRESHOLD's value will reduce memset performace when input length is between processor's L1 data cache and L2 cache.
> so update the defaule value to eliminate the decrement .
>

There is no single threshold value which is good for all workloads.
I don't think we should change REP_STOSB_THRESHOLD to 1MB.
On the other hand, the fixed threshold isn't flexible.  Please try this
patch to see if you can set the threshold for your specific workload.

-- 
H.J.

[-- Attachment #2: 0001-x86-Add-thresholds-for-rep-movsb-stosb-to-tunables.patch --]
[-- Type: text/x-patch, Size: 8962 bytes --]

From 7d2e0c0b843d509716d92960b9b139b32eacea54 Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Sat, 9 May 2020 11:13:57 -0700
Subject: [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables

Add x86_rep_movsb_threshold and x86_rep_stosb_threshold to tunables
to update thresholds for "rep movsb" and "rep stosb" at run-time.

Note that the user specified threshold for "rep movsb" smaller than
the minimum threshold will be ignored.
---
 manual/tunables.texi                          | 16 +++++++
 sysdeps/x86/cacheinfo.c                       | 46 +++++++++++++++++++
 sysdeps/x86/cpu-features.c                    |  4 ++
 sysdeps/x86/cpu-features.h                    |  4 ++
 sysdeps/x86/dl-tunables.list                  |  6 +++
 .../multiarch/memmove-vec-unaligned-erms.S    | 16 +------
 .../multiarch/memset-vec-unaligned-erms.S     | 12 +----
 7 files changed, 78 insertions(+), 26 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index ec18b10834..8054f79be0 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -396,6 +396,22 @@ to set threshold in bytes for non temporal store.
 This tunable is specific to i386 and x86-64.
 @end deftp
 
+@deftp Tunable glibc.cpu.x86_rep_movsb_threshold
+The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user
+to set threshold in bytes to start using "rep movsb".  Note that the
+user specified threshold smaller than the minimum threshold will be
+ignored.
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
+@deftp Tunable glibc.cpu.x86_rep_stosb_threshold
+The @code{glibc.cpu.x86_rep_stosb_threshold} tunable allows the user
+to set threshold in bytes to start using "rep stosb".
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
 @deftp Tunable glibc.cpu.x86_ibt
 The @code{glibc.cpu.x86_ibt} tunable allows the user to control how
 indirect branch tracking (IBT) should be enabled.  Accepted values are
diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c
index 311502dee3..4322328a1b 100644
--- a/sysdeps/x86/cacheinfo.c
+++ b/sysdeps/x86/cacheinfo.c
@@ -530,6 +530,23 @@ long int __x86_raw_shared_cache_size attribute_hidden = 1024 * 1024;
 /* Threshold to use non temporal store.  */
 long int __x86_shared_non_temporal_threshold attribute_hidden;
 
+/* Threshold to use Enhanced REP MOVSB.  Since there is overhead to set
+   up REP MOVSB operation, REP MOVSB isn't faster on short data.  The
+   memcpy micro benchmark in glibc shows that 2KB is the approximate
+   value above which REP MOVSB becomes faster than SSE2 optimization
+   on processors with Enhanced REP MOVSB.  Since larger register size
+   can move more data with a single load and store, the threshold is
+   higher with larger register size.  */
+long int __x86_rep_movsb_threshold attribute_hidden = 2048;
+
+/* Threshold to use Enhanced REP STOSB.  Since there is overhead to set
+   up REP STOSB operation, REP STOSB isn't faster on short data.  The
+   memset micro benchmark in glibc shows that 2KB is the approximate
+   value above which REP STOSB becomes faster on processors with
+   Enhanced REP STOSB.  Since the stored value is fixed, larger register
+   size has minimal impact on threshold.  */
+long int __x86_rep_stosb_threshold attribute_hidden = 2048;
+
 #ifndef DISABLE_PREFETCHW
 /* PREFETCHW support flag for use in memory and string routines.  */
 int __x86_prefetchw attribute_hidden;
@@ -872,6 +889,35 @@ init_cacheinfo (void)
     = (cpu_features->non_temporal_threshold != 0
        ? cpu_features->non_temporal_threshold
        : __x86_shared_cache_size * threads * 3 / 4);
+
+  /* NB: The REP MOVSB threshold must be greater than VEC_SIZE * 8.  */
+  unsigned int minimum_rep_movsb_threshold;
+  /* NB: The default REP MOVSB threshold is 2048 * (VEC_SIZE / 16).  */
+  unsigned int rep_movsb_threshold;
+  if (CPU_FEATURES_ARCH_P (cpu_features, AVX512F_Usable)
+      && !CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512))
+    {
+      rep_movsb_threshold = 2048 * (64 / 16);
+      minimum_rep_movsb_threshold = 64 * 8;
+    }
+  else if (CPU_FEATURES_ARCH_P (cpu_features,
+				AVX_Fast_Unaligned_Load))
+    {
+      rep_movsb_threshold = 2048 * (32 / 16);
+      minimum_rep_movsb_threshold = 32 * 8;
+    }
+  else
+    {
+      rep_movsb_threshold = 2048 * (16 / 16);
+      minimum_rep_movsb_threshold = 16 * 8;
+    }
+  if (cpu_features->rep_movsb_threshold > minimum_rep_movsb_threshold)
+    __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
+  else
+    __x86_rep_movsb_threshold = rep_movsb_threshold;
+
+  if (cpu_features->rep_stosb_threshold)
+    __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold;
 }
 
 #endif
diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 916bbf5242..14f847320f 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -564,6 +564,10 @@ no_cpuid:
   TUNABLE_GET (hwcaps, tunable_val_t *, TUNABLE_CALLBACK (set_hwcaps));
   cpu_features->non_temporal_threshold
     = TUNABLE_GET (x86_non_temporal_threshold, long int, NULL);
+  cpu_features->rep_movsb_threshold
+    = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
+  cpu_features->rep_stosb_threshold
+    = TUNABLE_GET (x86_rep_stosb_threshold, long int, NULL);
   cpu_features->data_cache_size
     = TUNABLE_GET (x86_data_cache_size, long int, NULL);
   cpu_features->shared_cache_size
diff --git a/sysdeps/x86/cpu-features.h b/sysdeps/x86/cpu-features.h
index f05d5ce158..7410324e83 100644
--- a/sysdeps/x86/cpu-features.h
+++ b/sysdeps/x86/cpu-features.h
@@ -91,6 +91,10 @@ struct cpu_features
   unsigned long int shared_cache_size;
   /* Threshold to use non temporal store.  */
   unsigned long int non_temporal_threshold;
+  /* Threshold to use "rep movsb".  */
+  unsigned long int rep_movsb_threshold;
+  /* Threshold to use "rep stosb".  */
+  unsigned long int rep_stosb_threshold;
 };
 
 /* Used from outside of glibc to get access to the CPU features
diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
index 251b926ce4..43bf6c2389 100644
--- a/sysdeps/x86/dl-tunables.list
+++ b/sysdeps/x86/dl-tunables.list
@@ -30,6 +30,12 @@ glibc {
     x86_non_temporal_threshold {
       type: SIZE_T
     }
+    x86_rep_movsb_threshold {
+      type: SIZE_T
+    }
+    x86_rep_stosb_threshold {
+      type: SIZE_T
+    }
     x86_data_cache_size {
       type: SIZE_T
     }
diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
index 74953245aa..bd5dc1a3f3 100644
--- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S
@@ -56,17 +56,6 @@
 # endif
 #endif
 
-/* Threshold to use Enhanced REP MOVSB.  Since there is overhead to set
-   up REP MOVSB operation, REP MOVSB isn't faster on short data.  The
-   memcpy micro benchmark in glibc shows that 2KB is the approximate
-   value above which REP MOVSB becomes faster than SSE2 optimization
-   on processors with Enhanced REP MOVSB.  Since larger register size
-   can move more data with a single load and store, the threshold is
-   higher with larger register size.  */
-#ifndef REP_MOVSB_THRESHOLD
-# define REP_MOVSB_THRESHOLD	(2048 * (VEC_SIZE / 16))
-#endif
-
 #ifndef PREFETCH
 # define PREFETCH(addr) prefetcht0 addr
 #endif
@@ -253,9 +242,6 @@ L(movsb):
 	leaq	(%rsi,%rdx), %r9
 	cmpq	%r9, %rdi
 	/* Avoid slow backward REP MOVSB.  */
-# if REP_MOVSB_THRESHOLD <= (VEC_SIZE * 8)
-#  error Unsupported REP_MOVSB_THRESHOLD and VEC_SIZE!
-# endif
 	jb	L(more_8x_vec_backward)
 1:
 	mov	%RDX_LP, %RCX_LP
@@ -331,7 +317,7 @@ L(between_2_3):
 
 #if defined USE_MULTIARCH && IS_IN (libc)
 L(movsb_more_2x_vec):
-	cmpq	$REP_MOVSB_THRESHOLD, %rdx
+	cmp	__x86_rep_movsb_threshold(%rip), %RDX_LP
 	ja	L(movsb)
 #endif
 L(more_2x_vec):
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index af2299709c..2bfc95de05 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -58,16 +58,6 @@
 # endif
 #endif
 
-/* Threshold to use Enhanced REP STOSB.  Since there is overhead to set
-   up REP STOSB operation, REP STOSB isn't faster on short data.  The
-   memset micro benchmark in glibc shows that 2KB is the approximate
-   value above which REP STOSB becomes faster on processors with
-   Enhanced REP STOSB.  Since the stored value is fixed, larger register
-   size has minimal impact on threshold.  */
-#ifndef REP_STOSB_THRESHOLD
-# define REP_STOSB_THRESHOLD		2048
-#endif
-
 #ifndef SECTION
 # error SECTION is not defined!
 #endif
@@ -181,7 +171,7 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms))
 	ret
 
 L(stosb_more_2x_vec):
-	cmpq	$REP_STOSB_THRESHOLD, %rdx
+	cmp	__x86_rep_stosb_threshold(%rip), %RDX_LP
 	ja	L(stosb)
 #endif
 L(more_2x_vec):
-- 
2.26.2


  reply	other threads:[~2020-05-23  4:38 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-16  7:30 pthread_cond performence Discussion liqingqing
2020-03-18 12:12 ` Carlos O'Donell via Libc-alpha
2020-03-18 12:53   ` Torvald Riegel via Libc-alpha
2020-03-18 14:42     ` Carlos O'Donell via Libc-alpha
2020-05-23  4:04 ` liqingqing
2020-05-23  4:10   ` [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M liqingqing
2020-05-23  4:37     ` H.J. Lu via Libc-alpha [this message]
2020-05-28 11:56       ` [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables H.J. Lu via Libc-alpha
2020-05-28 13:47         ` liqingqing
2020-05-29 13:13       ` Carlos O'Donell via Libc-alpha
2020-05-29 13:21         ` H.J. Lu via Libc-alpha
2020-05-29 16:18           ` Carlos O'Donell via Libc-alpha
2020-06-01 19:32             ` H.J. Lu via Libc-alpha
2020-06-01 19:38               ` Carlos O'Donell via Libc-alpha
2020-06-01 20:15                 ` H.J. Lu via Libc-alpha
2020-06-01 20:19                   ` H.J. Lu via Libc-alpha
2020-06-01 20:48                     ` Florian Weimer
2020-06-01 20:56                       ` Carlos O'Donell via Libc-alpha
2020-06-01 21:13                         ` H.J. Lu via Libc-alpha
2020-06-01 22:43                           ` H.J. Lu via Libc-alpha
2020-06-02  2:08                             ` Carlos O'Donell via Libc-alpha
2020-06-04 21:00                               ` [PATCH] libc.so: Add --list-tunables H.J. Lu via Libc-alpha
2020-06-05 22:45                                 ` V2 " H.J. Lu via Libc-alpha
2020-06-06 21:51                                   ` V3 [PATCH] libc.so: Add --list-tunables support to __libc_main H.J. Lu via Libc-alpha
2020-07-02 18:00                                     ` Carlos O'Donell via Libc-alpha
2020-07-02 19:08                                       ` [PATCH] Update tunable min/max values H.J. Lu via Libc-alpha
2020-07-03 16:14                                         ` Carlos O'Donell via Libc-alpha
2020-07-03 16:54                                           ` [PATCH] x86: Add thresholds for "rep movsb/stosb" to tunables H.J. Lu via Libc-alpha
2020-07-03 17:43                                             ` Carlos O'Donell via Libc-alpha
2020-07-03 17:53                                               ` H.J. Lu via Libc-alpha
2020-12-21  4:38     ` [PATCH]x86: update REP_STOSB_THRESHOLD's default value from 2k to 1M Siddhesh Poyarekar
2020-12-22  1:02       ` Qingqing Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAMe9rOoYXMdOfedtZLx=GT-nFXThzoo7Q__H4vg=2vyOufGY6A@mail.gmail.com' \
    --to=libc-alpha@sourceware.org \
    --cc=hjl.tools@gmail.com \
    --cc=hushiyuan@huawei.com \
    --cc=liqingqing3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).