[PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation
@ 2021-01-20  7:20 Shuo Wang
  2021-01-20 13:09 ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 9+ messages in thread
From: Shuo Wang @ 2021-01-20  7:20 UTC (permalink / raw)
  To: adhemerval.zanella, zhangxuelei4, libc-alpha; +Cc: hushiyuan

In commit 863d775c481704baaa41855fc93e5a1ca2dc6bf6, kunpeng920 is added to default memcpy version,
however, there is performance degradation when the copy size is some large bytes, eg: 100k. 
This is the result, tested in glibc-2.28:
             before backport  after backport	 Performance improvement
memcpy_1k      0.005              0.005                 0.00%
memcpy_10k     0.032              0.029                 10.34%
memcpy_100k    0.356              0.429                 -17.02%
memcpy_1m      7.470              11.153                -33.02%

This is the demo
#include "stdio.h"
#include "string.h"
#include "stdlib.h"

char a[1024*1024] = {12};
char b[1024*1024] = {13};
int main(int argc, char *argv[])
{
    int i = atoi(argv[1]);
    int j;
    int size = atoi(argv[2]);
    
    for (j = 0; j < i; j++)
        memcpy(b, a, size*1024);
    return 0;
}

# gcc -g -O0 memcpy.c -o memcpy
# time taskset -c 10 ./memcpy 100000 1024

Co-authored-by: liqingqing <liqingqing3@huawei.com>

---
 sysdeps/aarch64/multiarch/memcpy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
index 27259d3386..0e0a5cbcfb 100644
--- a/sysdeps/aarch64/multiarch/memcpy.c
+++ b/sysdeps/aarch64/multiarch/memcpy.c
@@ -37,7 +37,7 @@ extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
 libc_ifunc (__libc_memcpy,
             (IS_THUNDERX (midr)
 	     ? __memcpy_thunderx
-	     : (IS_FALKOR (midr) || IS_PHECDA (midr) || IS_KUNPENG920 (midr)
+	     : (IS_FALKOR (midr) || IS_PHECDA (midr)
 		? __memcpy_falkor
 		: (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr)
 		  ? __memcpy_thunderx2
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation
@ 2021-01-20  9:34 Shuo Wang
  0 siblings, 0 replies; 9+ messages in thread
From: Shuo Wang @ 2021-01-20  9:34 UTC (permalink / raw)
  To: adhemerval.zanella, zhangxuelei4, libc-alpha; +Cc: hushiyuan

In commit 863d775c481704baaa41855fc93e5a1ca2dc6bf6, kunpeng920 is added to default memcpy version,
however, there is performance degradation when the copy size is some large bytes, eg: 100k. 
This is the result, tested in glibc-2.28:
             before backport  after backport	 Performance improvement
memcpy_1k         2                 2                     0.00%
memcpy_10k        26                26                    0.00%
memcpy_100k       343               423                   -18.91%
memcpy_1m         9563              11058                 -13.52%

This is the demo
#include "stdio.h"
#include "string.h"
#include "stdlib.h"

char a[1024*1024] = {12};
char b[1024*1024] = {13};
int main(int argc, char *argv[])
{
    int i = atoi(argv[1]);
    int j;
    int size = atoi(argv[2]);
    long long begin, end;

    asm volatile("mrs %0, cntvct_el0" : "=r" (begin));
    for (j = 0; j < i; j++)
        memcpy(b, a, size);
    asm volatile("mrs %0, cntvct_el0" : "=r" (end));
    printf("%llu\n", (end - begin) / i); 
    return 0;
}

# gcc -g -O0 memcpy.c -o memcpy
# taskset -c 10 ./memcpy 100000 1024

Co-authored-by: liqingqing <liqingqing3@huawei.com>

---
 sysdeps/aarch64/multiarch/memcpy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
index 27259d3386..0e0a5cbcfb 100644
--- a/sysdeps/aarch64/multiarch/memcpy.c
+++ b/sysdeps/aarch64/multiarch/memcpy.c
@@ -37,7 +37,7 @@ extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
 libc_ifunc (__libc_memcpy,
             (IS_THUNDERX (midr)
 	     ? __memcpy_thunderx
-	     : (IS_FALKOR (midr) || IS_PHECDA (midr) || IS_KUNPENG920 (midr)
+	     : (IS_FALKOR (midr) || IS_PHECDA (midr)
 		? __memcpy_falkor
 		: (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr)
 		  ? __memcpy_thunderx2
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation
@ 2021-01-20  9:35 Shuo Wang
  0 siblings, 0 replies; 9+ messages in thread
From: Shuo Wang @ 2021-01-20  9:35 UTC (permalink / raw)
  To: adhemerval.zanella, zhangxuelei4, libc-alpha; +Cc: hushiyuan

Update test results and the following mail is the latest test results:
https://sourceware.org/pipermail/libc-alpha/2021-January/121824.html

In fact, commit 863d775c48 did not improve performance in memcpy_10k.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation
  2021-01-20  7:20 Shuo Wang
@ 2021-01-20 13:09 ` Adhemerval Zanella via Libc-alpha
  0 siblings, 0 replies; 9+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-01-20 13:09 UTC (permalink / raw)
  To: Shuo Wang, zhangxuelei4, libc-alpha; +Cc: hushiyuan

Hi,

Since I don't have access to this specific hardware, it would be good
if the original author, Xuelei Zhang, of the change could certify this
reversion is ok.

It should be ok during the freeze since it just a selection of an
already tested implementation for an specific chip implementation.

On 20/01/2021 04:20, Shuo Wang wrote:
> In commit 863d775c481704baaa41855fc93e5a1ca2dc6bf6, kunpeng920 is added to default memcpy version,
> however, there is performance degradation when the copy size is some large bytes, eg: 100k. 
> This is the result, tested in glibc-2.28:
>              before backport  after backport	 Performance improvement
> memcpy_1k      0.005              0.005                 0.00%
> memcpy_10k     0.032              0.029                 10.34%
> memcpy_100k    0.356              0.429                 -17.02%
> memcpy_1m      7.470              11.153                -33.02%
> 
> This is the demo
> #include "stdio.h"
> #include "string.h"
> #include "stdlib.h"
> 
> char a[1024*1024] = {12};
> char b[1024*1024] = {13};
> int main(int argc, char *argv[])
> {
>     int i = atoi(argv[1]);
>     int j;
>     int size = atoi(argv[2]);
>     
>     for (j = 0; j < i; j++)
>         memcpy(b, a, size*1024);
>     return 0;
> }
> 
> # gcc -g -O0 memcpy.c -o memcpy
> # time taskset -c 10 ./memcpy 100000 1024
> 
> Co-authored-by: liqingqing <liqingqing3@huawei.com>
> 
> ---
>  sysdeps/aarch64/multiarch/memcpy.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/sysdeps/aarch64/multiarch/memcpy.c b/sysdeps/aarch64/multiarch/memcpy.c
> index 27259d3386..0e0a5cbcfb 100644
> --- a/sysdeps/aarch64/multiarch/memcpy.c
> +++ b/sysdeps/aarch64/multiarch/memcpy.c
> @@ -37,7 +37,7 @@ extern __typeof (__redirect_memcpy) __memcpy_falkor attribute_hidden;
>  libc_ifunc (__libc_memcpy,
>              (IS_THUNDERX (midr)
>  	     ? __memcpy_thunderx
> -	     : (IS_FALKOR (midr) || IS_PHECDA (midr) || IS_KUNPENG920 (midr)
> +	     : (IS_FALKOR (midr) || IS_PHECDA (midr)
>  		? __memcpy_falkor
>  		: (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr)
>  		  ? __memcpy_thunderx2
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation
@ 2021-01-20 14:49 Wilco Dijkstra via Libc-alpha
  0 siblings, 0 replies; 9+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-01-20 14:49 UTC (permalink / raw)
  To: 'GNU C Library'; +Cc: wangshuo (AF)

Hi Shuo,

Have you tried running the GLIBC memcpy benchmarks? These should give a
decent comparison between all the memcpy variants. It might be that the new
__memcy_simd is better than the generic variant (these are backported to 2.28).

Note that 99% of memcpies are less than 1000 bytes, so it's best to focus on small
copy performance.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation
@ 2021-01-21  1:55 Zhangxuelei (Derek)
  2021-01-21 16:41 ` Adhemerval Zanella via Libc-alpha
  0 siblings, 1 reply; 9+ messages in thread
From: Zhangxuelei (Derek) @ 2021-01-21  1:55 UTC (permalink / raw)
  To: Adhemerval Zanella, wangshuo (AF), libc-alpha@sourceware.org; +Cc: Hushiyuan

Hi,

They are my colleagues and we have certified this results together. It would be better to revert the original selection according to the negative performance of a specific product. And we will still study for a better or more balanced version of memcpy on Kunpeng.

Thank you~

-----邮件原件-----
发件人: Adhemerval Zanella [mailto:adhemerval.zanella@linaro.org] 
发送时间: 2021年1月20日 21:09
收件人: wangshuo (AF) <wangshuo47@huawei.com>; Zhangxuelei (Derek) <zhangxuelei4@huawei.com>; libc-alpha@sourceware.org
抄送: Hushiyuan <hushiyuan@huawei.com>; liqingqing (C) <liqingqing3@huawei.com>
主题: Re: [PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation

Hi,

Since I don't have access to this specific hardware, it would be good if the original author, Xuelei Zhang, of the change could certify this reversion is ok.

It should be ok during the freeze since it just a selection of an already tested implementation for an specific chip implementation.

On 20/01/2021 04:20, Shuo Wang wrote:
> In commit 863d775c481704baaa41855fc93e5a1ca2dc6bf6, kunpeng920 is 
> added to default memcpy version, however, there is performance degradation when the copy size is some large bytes, eg: 100k.
> This is the result, tested in glibc-2.28:
>              before backport  after backport	 Performance improvement
> memcpy_1k      0.005              0.005                 0.00%
> memcpy_10k     0.032              0.029                 10.34%
> memcpy_100k    0.356              0.429                 -17.02%
> memcpy_1m      7.470              11.153                -33.02%
> 
> This is the demo
> #include "stdio.h"
> #include "string.h"
> #include "stdlib.h"
> 
> char a[1024*1024] = {12};
> char b[1024*1024] = {13};
> int main(int argc, char *argv[])
> {
>     int i = atoi(argv[1]);
>     int j;
>     int size = atoi(argv[2]);
>     
>     for (j = 0; j < i; j++)
>         memcpy(b, a, size*1024);
>     return 0;
> }
> 
> # gcc -g -O0 memcpy.c -o memcpy
> # time taskset -c 10 ./memcpy 100000 1024
> 
> Co-authored-by: liqingqing <liqingqing3@huawei.com>
> 
> ---
>  sysdeps/aarch64/multiarch/memcpy.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/sysdeps/aarch64/multiarch/memcpy.c 
> b/sysdeps/aarch64/multiarch/memcpy.c
> index 27259d3386..0e0a5cbcfb 100644
> --- a/sysdeps/aarch64/multiarch/memcpy.c
> +++ b/sysdeps/aarch64/multiarch/memcpy.c
> @@ -37,7 +37,7 @@ extern __typeof (__redirect_memcpy) __memcpy_falkor 
> attribute_hidden;  libc_ifunc (__libc_memcpy,
>              (IS_THUNDERX (midr)
>  	     ? __memcpy_thunderx
> -	     : (IS_FALKOR (midr) || IS_PHECDA (midr) || IS_KUNPENG920 (midr)
> +	     : (IS_FALKOR (midr) || IS_PHECDA (midr)
>  		? __memcpy_falkor
>  		: (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr)
>  		  ? __memcpy_thunderx2
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation
  2021-01-21  1:55 Zhangxuelei (Derek)
@ 2021-01-21 16:41 ` Adhemerval Zanella via Libc-alpha
  2021-01-21 16:45   ` Szabolcs Nagy via Libc-alpha
  0 siblings, 1 reply; 9+ messages in thread
From: Adhemerval Zanella via Libc-alpha @ 2021-01-21 16:41 UTC (permalink / raw)
  To: Zhangxuelei (Derek), wangshuo (AF), libc-alpha@sourceware.org; +Cc: Hushiyuan

On 20/01/2021 22:55, Zhangxuelei (Derek) wrote:
> Hi,
> 
> They are my colleagues and we have certified this results together. It would be better to revert the original selection according to the negative performance of a specific product. And we will still study for a better or more balanced version of memcpy on Kunpeng.
> 
> Thank you~

This is ok for 2.33, please commit.

> 
> -----邮件原件-----
> 发件人: Adhemerval Zanella [mailto:adhemerval.zanella@linaro.org] 
> 发送时间: 2021年1月20日 21:09
> 收件人: wangshuo (AF) <wangshuo47@huawei.com>; Zhangxuelei (Derek) <zhangxuelei4@huawei.com>; libc-alpha@sourceware.org
> 抄送: Hushiyuan <hushiyuan@huawei.com>; liqingqing (C) <liqingqing3@huawei.com>
> 主题: Re: [PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation
> 
> Hi,
> 
> Since I don't have access to this specific hardware, it would be good if the original author, Xuelei Zhang, of the change could certify this reversion is ok.
> 
> It should be ok during the freeze since it just a selection of an already tested implementation for an specific chip implementation.
> 
> On 20/01/2021 04:20, Shuo Wang wrote:
>> In commit 863d775c481704baaa41855fc93e5a1ca2dc6bf6, kunpeng920 is 
>> added to default memcpy version, however, there is performance degradation when the copy size is some large bytes, eg: 100k.
>> This is the result, tested in glibc-2.28:
>>              before backport  after backport	 Performance improvement
>> memcpy_1k      0.005              0.005                 0.00%
>> memcpy_10k     0.032              0.029                 10.34%
>> memcpy_100k    0.356              0.429                 -17.02%
>> memcpy_1m      7.470              11.153                -33.02%
>>
>> This is the demo
>> #include "stdio.h"
>> #include "string.h"
>> #include "stdlib.h"
>>
>> char a[1024*1024] = {12};
>> char b[1024*1024] = {13};
>> int main(int argc, char *argv[])
>> {
>>     int i = atoi(argv[1]);
>>     int j;
>>     int size = atoi(argv[2]);
>>     
>>     for (j = 0; j < i; j++)
>>         memcpy(b, a, size*1024);
>>     return 0;
>> }
>>
>> # gcc -g -O0 memcpy.c -o memcpy
>> # time taskset -c 10 ./memcpy 100000 1024
>>
>> Co-authored-by: liqingqing <liqingqing3@huawei.com>
>>
>> ---
>>  sysdeps/aarch64/multiarch/memcpy.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/sysdeps/aarch64/multiarch/memcpy.c 
>> b/sysdeps/aarch64/multiarch/memcpy.c
>> index 27259d3386..0e0a5cbcfb 100644
>> --- a/sysdeps/aarch64/multiarch/memcpy.c
>> +++ b/sysdeps/aarch64/multiarch/memcpy.c
>> @@ -37,7 +37,7 @@ extern __typeof (__redirect_memcpy) __memcpy_falkor 
>> attribute_hidden;  libc_ifunc (__libc_memcpy,
>>              (IS_THUNDERX (midr)
>>  	     ? __memcpy_thunderx
>> -	     : (IS_FALKOR (midr) || IS_PHECDA (midr) || IS_KUNPENG920 (midr)
>> +	     : (IS_FALKOR (midr) || IS_PHECDA (midr)
>>  		? __memcpy_falkor
>>  		: (IS_THUNDERX2 (midr) || IS_THUNDERX2PA (midr)
>>  		  ? __memcpy_thunderx2
>>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation
  2021-01-21 16:41 ` Adhemerval Zanella via Libc-alpha
@ 2021-01-21 16:45   ` Szabolcs Nagy via Libc-alpha
  0 siblings, 0 replies; 9+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-01-21 16:45 UTC (permalink / raw)
  To: Adhemerval Zanella; +Cc: Hushiyuan, libc-alpha@sourceware.org, wangshuo (AF)

The 01/21/2021 13:41, Adhemerval Zanella via Libc-alpha wrote:
> On 20/01/2021 22:55, Zhangxuelei (Derek) wrote:
> > Hi,
> > 
> > They are my colleagues and we have certified this results together. It would be better to revert the original selection according to the negative performance of a specific product. And we will still study for a better or more balanced version of memcpy on Kunpeng.
> > 
> > Thank you~
> 
> This is ok for 2.33, please commit.

thanks, i committed the patch.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation
@ 2021-01-22  1:26 Shuo Wang
  0 siblings, 0 replies; 9+ messages in thread
From: Shuo Wang @ 2021-01-22  1:26 UTC (permalink / raw)
  To: Wilco.Dijkstra, libc-alpha; +Cc: hushiyuan


	Dear Wilco,

Thanks for your reply. I will test the performance of __memcy_simd and generic on the 
latest glibc in the near future.

Best Wishes,
Shuo

>Hi Shuo,
>
>Have you tried running the GLIBC memcpy benchmarks? These should give a
>decent comparison between all the memcpy variants. It might be that the new
>__memcy_simd is better than the generic variant (these are backported to 2.28).
>
>Note that 99% of memcpies are less than 1000 bytes, so it's best to focus on small
>copy performance.
>
>Cheers,
>Wilco

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-01-22  1:26 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-22  1:26 [PATCH] aarch64: revert memcpy optimze for kunpeng to avoid performance degradation Shuo Wang
  -- strict thread matches above, loose matches on Subject: below --
2021-01-21  1:55 Zhangxuelei (Derek)
2021-01-21 16:41 ` Adhemerval Zanella via Libc-alpha
2021-01-21 16:45   ` Szabolcs Nagy via Libc-alpha
2021-01-20 14:49 Wilco Dijkstra via Libc-alpha
2021-01-20  9:35 Shuo Wang
2021-01-20  9:34 Shuo Wang
2021-01-20  7:20 Shuo Wang
2021-01-20 13:09 ` Adhemerval Zanella via Libc-alpha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).