x86-64: memcpy performance reduce when running in virtual mechine

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* x86-64: memcpy performance reduce when running in virtual mechine
@ 2021-01-11  8:38 Shuo Wang
  2021-01-11  9:06 ` Florian Weimer via Libc-alpha
  0 siblings, 1 reply; 5+ messages in thread
From: Shuo Wang @ 2021-01-11  8:38 UTC (permalink / raw
  To: hjl.tools, libc-alpha; +Cc: hushiyuan

memcpy performance reduce when running in virtual mechine compared with host.
This is test result:
-----------------------
|       | host |  vm  | 
|cycle: |  78  | 1503 |
-----------------------

From perf, we believe that they enter same bracnch between host and vm:
[host]
  78.61%  libc-2.28.so     [.] __memmove_sse2_unaligned_erms
  12.85%  [kernel]         [k] nmi
   6.38%  hot_host_memcpy  [.] main
   
[virtual machine]
  98.64%  libc-2.28.so   [.] __memmove_sse2_unaligned_erms
   0.17%  hot_vm_memcpy  [.] main
   
This is our demo:
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

static __inline__ unsigned long long rdtsc(void)
{
  unsigned hi, lo;
  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

int main(int argc, char **argv)
{
        int i, defs, lm_optb;
    if (argc == 3) {
        defs = atoi(argv[1]);
        lm_optb = atoi(argv[2]);
    } else {
        printf("error input!\n");
        return 1;
    }
    char *src = (char *)valloc(defs);
    char *dest = (char *)valloc(defs);
    int opts = defs;

    memset(src, 1, defs);
    memset(dest, 1, defs);

    unsigned long long begin, end;
    begin = rdtsc();

//while (1) {
    for (i = 0; i < lm_optb; i++) {
        (void) memcpy(dest, src, opts);
    }
//}

    end = rdtsc();
    printf("all cycle = %llu, percall = %llu\n", end - begin, (end - begin) / lm_optb);

    return (0);
}

This is the test log:
# taskset -c 2 ./host_memcpy 1024 1024000
all cycle = 80149652, percall = 78
# taskset -c 2 ./host_memcpy 1024 1024000
all cycle = 93075200, percall = 90

# taskset -c 2 ./vm_memcpy 1024 1024000
all cycle = 1539990968, percall = 1503
# taskset -c 2 ./vm_memcpy 1024 1024000
all cycle = 1541243316, percall = 1505

We build it by:
# gcc -g -O0 memcpy.c -o host_memcpy
# gcc -g -O0 memcpy.c -o vm_memcpy


The environment information is as follows:
[host]
- kernel version: 4.18.0
- glibc version: 2.28
- gcc version: 8.3.1
- qemu version: 2.12.0
- libvirtd version: 4.5.0

# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              60
On-line CPU(s) list: 0-59
Thread(s) per core:  2
Core(s) per socket:  15
Socket(s):           8
NUMA node(s):        8
Vendor ID:           GenuineIntel
CPU family:          6
Model:               62
Model name:          Intel(R) Xeon(R) CPU E7-8870 v2 @ 2.30GHz
Stepping:            7
CPU MHz:             2294.529
CPU max MHz:         2300.0000
CPU min MHz:         1200.0000
BogoMIPS:            4589.07
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            30720K
NUMA node0 CPU(s):   0-14,30-44
NUMA node1 CPU(s):   15-29,45-59
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts md_clear flush_l1d

[virtual machine]
- kernel version: 4.18.0
- glibc version: 2.28
- gcc version: 8.3.1
- qemu version: 2.12.0
- libvirtd version: 4.5.0

# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           4
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               62
Model name:          Intel(R) Xeon(R) CPU E7-8870 v2 @ 2.30GHz
Stepping:            7
CPU MHz:             2294.468
BogoMIPS:            4588.93
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep erms xsaveopt arat umip md_clear arch_capabilities



^ permalink raw reply	[flat|nested] 5+ messages in thread

* x86-64: memcpy performance reduce when running in virtual mechine
@ 2021-01-11  8:41 Shuo Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Shuo Wang @ 2021-01-11  8:41 UTC (permalink / raw
  To: hjl.tools, libc-alpha; +Cc: hushiyuan

There is also performance reduce when memcpy enter __memmove_avx_unaligned_erms in
vm compared with host.
>memcpy performance reduce when running in virtual mechine compared with host.
>This is test result:
>-----------------------
>|       | host |  vm  | 
>|cycle: |  78  | 1503 |
>-----------------------
>
>From perf, we believe that they enter same bracnch between host and vm:
>[host]
>  78.61%  libc-2.28.so     [.] __memmove_sse2_unaligned_erms
>  12.85%  [kernel]         [k] nmi
>   6.38%  hot_host_memcpy  [.] main
>   
>[virtual machine]
>  98.64%  libc-2.28.so   [.] __memmove_sse2_unaligned_erms
>   0.17%  hot_vm_memcpy  [.] main
>   
>This is our demo:
>#include <unistd.h>
>#include <stdlib.h>
>#include <stdio.h>
>#include <string.h>
>
>static __inline__ unsigned long long rdtsc(void)
>{
>  unsigned hi, lo;
>  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
>  return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
>}
>
>int main(int argc, char **argv)
>{
>        int i, defs, lm_optb;
>    if (argc == 3) {
>        defs = atoi(argv[1]);
>        lm_optb = atoi(argv[2]);
>    } else {
>        printf("error input!\n");
>        return 1;
>    }
>    char *src = (char *)valloc(defs);
>    char *dest = (char *)valloc(defs);
>    int opts = defs;
>
>    memset(src, 1, defs);
>    memset(dest, 1, defs);
>
>    unsigned long long begin, end;
>    begin = rdtsc();
>
>//while (1) {
>    for (i = 0; i < lm_optb; i++) {
>        (void) memcpy(dest, src, opts);
>    }
>//}
>
>    end = rdtsc();
>    printf("all cycle = %llu, percall = %llu\n", end - begin, (end - begin) / lm_optb);
>
>    return (0);
>}
>
>This is the test log:
># taskset -c 2 ./host_memcpy 1024 1024000
>all cycle = 80149652, percall = 78
># taskset -c 2 ./host_memcpy 1024 1024000
>all cycle = 93075200, percall = 90
>
># taskset -c 2 ./vm_memcpy 1024 1024000
>all cycle = 1539990968, percall = 1503
># taskset -c 2 ./vm_memcpy 1024 1024000
>all cycle = 1541243316, percall = 1505
>
>We build it by:
># gcc -g -O0 memcpy.c -o host_memcpy
># gcc -g -O0 memcpy.c -o vm_memcpy
>
>
>The environment information is as follows:
>[host]
>- kernel version: 4.18.0
>- glibc version: 2.28
>- gcc version: 8.3.1
>- qemu version: 2.12.0
>- libvirtd version: 4.5.0
>
># lscpu
>Architecture:        x86_64
>CPU op-mode(s):      32-bit, 64-bit
>Byte Order:          Little Endian
>CPU(s):              60
>On-line CPU(s) list: 0-59
>Thread(s) per core:  2
>Core(s) per socket:  15
>Socket(s):           8
>NUMA node(s):        8
>Vendor ID:           GenuineIntel
>CPU family:          6
>Model:               62
>Model name:          Intel(R) Xeon(R) CPU E7-8870 v2 @ 2.30GHz
>Stepping:            7
>CPU MHz:             2294.529
>CPU max MHz:         2300.0000
>CPU min MHz:         1200.0000
>BogoMIPS:            4589.07
>Virtualization:      VT-x
>L1d cache:           32K
>L1i cache:           32K
>L2 cache:            256K
>L3 cache:            30720K
>NUMA node0 CPU(s):   0-14,30-44
>NUMA node1 CPU(s):   15-29,45-59
>Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts md_clear flush_l1d
>
>[virtual machine]
>- kernel version: 4.18.0
>- glibc version: 2.28
>- gcc version: 8.3.1
>- qemu version: 2.12.0
>- libvirtd version: 4.5.0
>
># lscpu
>Architecture:        x86_64
>CPU op-mode(s):      32-bit, 64-bit
>Byte Order:          Little Endian
>CPU(s):              4
>On-line CPU(s) list: 0-3
>Thread(s) per core:  1
>Core(s) per socket:  1
>Socket(s):           4
>NUMA node(s):        1
>Vendor ID:           GenuineIntel
>CPU family:          6
>Model:               62
>Model name:          Intel(R) Xeon(R) CPU E7-8870 v2 @ 2.30GHz
>Stepping:            7
>CPU MHz:             2294.468
>BogoMIPS:            4588.93
>Hypervisor vendor:   KVM
>Virtualization type: full
>L1d cache:           32K
>L1i cache:           32K
>L2 cache:            4096K
>L3 cache:            16384K
>NUMA node0 CPU(s):   0-3
>Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust smep erms xsaveopt arat umip md_clear arch_capabilities
>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: x86-64: memcpy performance reduce when running in virtual mechine
  2021-01-11  8:38 Shuo Wang
@ 2021-01-11  9:06 ` Florian Weimer via Libc-alpha
  0 siblings, 0 replies; 5+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-01-11  9:06 UTC (permalink / raw
  To: Shuo Wang; +Cc: hushiyuan, libc-alpha

* Shuo Wang:

> The environment information is as follows:
> [host]
> - kernel version: 4.18.0
> - glibc version: 2.28
> - gcc version: 8.3.1
> - qemu version: 2.12.0
> - libvirtd version: 4.5.0

Does your have a glibc a backport of this patch?

commit d3c57027470b78dba79c6d931e4e409b1fecfc80
Author: Patrick McGehearty <patrick.mcgehearty@oracle.com>
Date:   Mon Sep 28 20:11:28 2020 +0000

    Reversing calculation of __x86_shared_non_temporal_threshold

Some hypervisors set the number of reported threads in CPUID to zero.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 5+ messages in thread

* x86-64: memcpy performance reduce when running in virtual mechine
@ 2021-01-11 14:32 Shuo Wang
  2021-01-11 15:09 ` Florian Weimer via Libc-alpha
  0 siblings, 1 reply; 5+ messages in thread
From: Shuo Wang @ 2021-01-11 14:32 UTC (permalink / raw
  To: hjl.tools, fweimer, libc-alpha; +Cc: hushiyuan

The performance of memcpy 1024 has recovered. However, there is performance
reduce in host. This is test result (cycle):

	                      memcpy_10	 memcpy_1k	 memcpy_10k	  memcpy_1m	  memcpy_10m
before backport	             8	         34	        187	        130848	   2325409
after backport	             8	         34	        182	        515156	   5282603
Performance improvement	   0.00%	    0.00%	    2.67%	    -293.71%   -127.17%

>* Shuo Wang:
>
>> The environment information is as follows:
>> [host]
>> - kernel version: 4.18.0
>> - glibc version: 2.28
>> - gcc version: 8.3.1
>> - qemu version: 2.12.0
>> - libvirtd version: 4.5.0
>
>Does your have a glibc a backport of this patch?
>
>commit d3c57027470b78dba79c6d931e4e409b1fecfc80
>Author: Patrick McGehearty <patrick.mcgehearty@oracle.com>
>Date:   Mon Sep 28 20:11:28 2020 +0000
>
>    Reversing calculation of __x86_shared_non_temporal_threshold
>
>Some hypervisors set the number of reported threads in CPUID to zero.
>
>Thanks,
>Florian
>-- 
>Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
>Commercial register: Amtsgericht Muenchen, HRB 153243,
>Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: x86-64: memcpy performance reduce when running in virtual mechine
  2021-01-11 14:32 Shuo Wang
@ 2021-01-11 15:09 ` Florian Weimer via Libc-alpha
  0 siblings, 0 replies; 5+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-01-11 15:09 UTC (permalink / raw
  To: Shuo Wang; +Cc: hushiyuan, libc-alpha

* Shuo Wang:

> The performance of memcpy 1024 has recovered. However, there is performance
> reduce in host. This is test result (cycle):
>
> 	                      memcpy_10	 memcpy_1k	 memcpy_10k	  memcpy_1m	  memcpy_10m
> before backport	             8	         34	        187	        130848	   2325409
> after backport	             8	         34	        182	        515156	   5282603
> Performance improvement	   0.00%	    0.00%	    2.67%	    -293.71%   -127.17%

I think this is expected because the large copies no longer stay within
the cache.  This is required to avoid blowing away the entire cache
contents for such large copies, negatively impacting whole system
performance.  This will of course not show up in a micro-benchmark.

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-01-11 15:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-01-11  8:41 x86-64: memcpy performance reduce when running in virtual mechine Shuo Wang
  -- strict thread matches above, loose matches on Subject: below --
2021-01-11 14:32 Shuo Wang
2021-01-11 15:09 ` Florian Weimer via Libc-alpha
2021-01-11  8:38 Shuo Wang
2021-01-11  9:06 ` Florian Weimer via Libc-alpha

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).