[PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
@ 2021-04-12 12:52 Wilco Dijkstra via Libc-alpha
  2021-04-12 18:53 ` Florian Weimer
  2021-04-13 12:07 ` naohirot
  0 siblings, 2 replies; 41+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-12 12:52 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi,

I have a few comments about memcpy design (the principles apply equally to memset):

1. Overall the code is too large due to enormous unroll factors

Our current memcpy is about 300 bytes (that includes memmove), this memcpy is ~12 times larger!
This hurts performance due to the code not fitting in the I-cache for common copies.
On a modern OoO core you need very little unrolling since ALU operations and branches
become essentially free while the CPU executes loads and stores. So rather than unrolling
by 32-64 times, try 4 times - you just need enough to hide the taken branch latency.

2. I don't see any special handling for small copies

Even if you want to hyper optimize gigabyte sized copies, small copies are still extremely common,
so you always want to handle those as quickly (and with as little code) as possible. Special casing
small copies does not slow down the huge copies - the reverse is more likely since you no longer
need to handle small cases.

3. Check whether using SVE helps small/medium copies

Run memcpy-random benchmark to see whether it is faster to use SVE for small cases or just the SIMD
copy on your uarch.

4. Avoid making the code too general or too specialistic

I see both appearing in the code - trying to deal with different cacheline sizes and different vector lengths,
and also splitting these out into separate cases. If you depend on a particular cacheline size, specialize
the code for that and check the size in the ifunc selector (as various memsets do already). If you want to
handle multiple vector sizes, just use a register for the increment rather than repeating the same code
several times for each vector length.

5. Odd prefetches

I have a hard time believing first prefetching the data to be written, then clearing it using DC ZVA (???),
then prefetching the same data a 2nd time, before finally write the loaded data is helping performance...
Generally hardware prefetchers are able to do exactly the right thing since memcpy is trivial to prefetch.
So what is the performance gain of each prefetch/clear step? What is the difference between memcpy
and memmove performance (given memmove doesn't do any of this)?

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-12 12:52 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Wilco Dijkstra via Libc-alpha
@ 2021-04-12 18:53 ` Florian Weimer
  2021-04-13 12:07 ` naohirot
  1 sibling, 0 replies; 41+ messages in thread
From: Florian Weimer @ 2021-04-12 18:53 UTC (permalink / raw)
  To: Wilco Dijkstra via Libc-alpha; +Cc: Szabolcs Nagy, Wilco Dijkstra

* Wilco Dijkstra via Libc-alpha:

> 5. Odd prefetches
>
> I have a hard time believing first prefetching the data to be
> written, then clearing it using DC ZVA (???), then prefetching the
> same data a 2nd time, before finally write the loaded data is
> helping performance...  Generally hardware prefetchers are able to
> do exactly the right thing since memcpy is trivial to prefetch.  So
> what is the performance gain of each prefetch/clear step? What is
> the difference between memcpy and memmove performance (given memmove
> doesn't do any of this)?

Another downside is exposure of latent concurrency bugs:

  G1: Phantom zeros in cardtable
  <https://bugs.openjdk.java.net/browse/JDK-8039042>

I guess the CPU's heritage is shining through here. 8-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-12 12:52 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Wilco Dijkstra via Libc-alpha
  2021-04-12 18:53 ` Florian Weimer
@ 2021-04-13 12:07 ` naohirot
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
  1 sibling, 1 reply; 41+ messages in thread
From: naohirot @ 2021-04-13 12:07 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Thanks for the comments.

I've been continuously updated the first patch since I posted on Mar. 17 2021,
and fixed some bugs.
Here is my local repository's commit history:
https://github.com/NaohiroTamura/glibc/commits/patch-20210317

I answer your comments referring to the latest source code above and
benchtests graphs uploaded to Google drive.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> 
> 1. Overall the code is too large due to enormous unroll factors
> 
> Our current memcpy is about 300 bytes (that includes memmove), this memcpy is
> ~12 times larger!
> This hurts performance due to the code not fitting in the I-cache for common
> copies.

OK, I'll try to remove unnecessary code which doesn't contribute performance gain
based on benchtests performance data. 

> On a modern OoO core you need very little unrolling since ALU operations and
> branches become essentially free while the CPU executes loads and stores. So
> rather than unrolling by 32-64 times, try 4 times - you just need enough to hide the
> taken branch latency.
> 

In terms of loop unrolling, I tested several cases in my local environment.
Here is the result.
The source code is based on the latest commit of the branch patch-20210317 in my GitHub repository.
[1] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S

Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.
This unroll configuration recorded the highest performance.
Memcpy    35 Gbps/sec [3]
Memmove  70 Gbps/sec [4]
Mmemset  70 Gbps/sec [5]

[3] https://drive.google.com/file/d/1Xz04kV-S1E4tKOKLJRl8KgO8ZdCQqv1O/view
[4] https://drive.google.com/file/d/1QDmt7LMscXIJSpaq2sPOiCKl3nxcLxwk/view
[5] https://drive.google.com/file/d/1rpy7rkIskRs6czTARNIh4yCeh8d-L-cP/view

In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,
The performance degraded minus 5 to 15 Gbps/sec at the peak.
Memcpy    30 Gbps/sec [6]
Memmove  65 Gbps/sec [7]
Mmemset  45 Gbps/sec [8]

[6] https://drive.google.com/file/d/1P-QJGeuHPlfj3ax8GlxRShV0_HVMJWGc/view
[7] https://drive.google.com/file/d/1R2IK5eWr8NEduNnvqkdPZyoNE0oImRcp/view
[8] https://drive.google.com/file/d/1WMZFjzF5WgmfpXSOnAd9YMjLqv1mcsEm/view

> 2. I don't see any special handling for small copies
> 
> Even if you want to hyper optimize gigabyte sized copies, small copies are still
> extremely common, so you always want to handle those as quickly (and with as
> little code) as possible. Special casing small copies does not slow down the huge
> copies - the reverse is more likely since you no longer need to handle small cases.
>

Yes, I implemented for the case of 1 byte to 512 byte [9][10].
SVE code seems faster than ASIMD in small/medium range too [11][12][13].

[9] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L176-L267
[10] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S#L68-L78
[11] https://drive.google.com/file/d/1VgkFTrWgjFMQ35btWjqHJbEGMgb3ZE-h/view
[12] https://drive.google.com/file/d/1SJ-WMUEEX73SioT9F7tVEIc4iRa8SfjU/view
[13] https://drive.google.com/file/d/1DPPgh2r6t16Ppe0Cpo5XzkVqWA_AVRUc/view

> 3. Check whether using SVE helps small/medium copies
> 
> Run memcpy-random benchmark to see whether it is faster to use SVE for small
> cases or just the SIMD copy on your uarch.
> 

Thanks for the memcpy-random benchmark info.
For small/medium copies, I needed to remove BTI macro from ASM ENTRY in order
to see the distinct performance difference between ASIMD and SVE.
I'll post the patch [14] with the A64FX second patch.

And also somehow on A64FX as well as on ThunderX2 machine, memcpy-random
didn't start due to mprotect error.
I needed to fix memcpy-random [15].
If this is not wrong, I'll post the patch [15] with the a64fx second patch.

[14] https://github.com/NaohiroTamura/glibc/commit/07ea389846c7c63622b6c0b3aaead3f93e21f356
[15] https://github.com/NaohiroTamura/glibc/commit/ec0b55a855529f75bd6f280e59dc2b1c25640490

> 4. Avoid making the code too general or too specialistic
> 
> I see both appearing in the code - trying to deal with different cacheline sizes and
> different vector lengths, and also splitting these out into separate cases. If you
> depend on a particular cacheline size, specialize the code for that and check the
> size in the ifunc selector (as various memsets do already). If you want to handle
> multiple vector sizes, just use a register for the increment rather than repeating
> the same code several times for each vector length.
> 

In terms of the cache line size, A64FX is not configurable, it is fixed to 256 byte.
I've already removed the code to get it [16][17]

[16] https://github.com/NaohiroTamura/glibc/commit/4bcc6d83c970f7a7283abfec753ecf6b697cf6f7
[17] https://github.com/NaohiroTamura/glibc/commit/f2b2c1ca03b50d414e03411ed65e4b131615e865

In terms of Vector Length, I'll remove the code for VL256 bit and 128 bit.
Because Vector Length agnostic code can cover the both cases.

> 5. Odd prefetches
> 
> I have a hard time believing first prefetching the data to be written, then clearing it
> using DC ZVA (???), then prefetching the same data a 2nd time, before finally
> write the loaded data is helping performance...
> Generally hardware prefetchers are able to do exactly the right thing since
> memcpy is trivial to prefetch.
> So what is the performance gain of each prefetch/clear step? What is the
> difference between memcpy and memmove performance (given memmove
> doesn't do any of this)?

Sorry, memcpy prefetch code was not right, I noticed this bug and fixed it
soon after posting the first patch [18].
Basically " prfm pstl1keep, [dest_ptr, tmp1]" should be " prfm pldl2keep, [src_ptr, tmp1]".

[18] https://github.com/NaohiroTamura/glibc/commit/f5bf15708830f91fb886b15928158db2e875ac88

Without DC_VZA and L2 prefetch, memcpy and memset performance degraded over 4MB.
Please compare [19] with [22], and [21] with [24] for memset.
Without DC_VZA and L2 prefetch, memmove didn't degraded over 4MB.
Please compare [20] with [23].
The reason why I didn't implement DC_VZA and L2 prefetch is that memmove calls memcpy in
most cases, and memmove code only handles backward copy.
Maybe most of memmove-large benchtest cases are backward copy, I need to check.
DC_VZA and L2 prefetch have to be pair, only DC_VZA or only L2 prefetch doesn't get any improvement.

With DC_VZA and L2 prefetch:
[19] https://drive.google.com/file/d/1mmYaLwzEoytBJZ913jaWmucL0j564Ta7/view
[20] https://drive.google.com/file/d/1Bc_DVGBcDRpvDjxCB_2yOk3MOy5BEiOs/view
[21] https://drive.google.com/file/d/19cHvU2lxF28DW9_Z5_5O6gOOdUmVz_ps/view

Without DC_VZA and L2 prefetch:
[22] https://drive.google.com/file/d/1My6idNuQsrsPVODl0VrqiRbMR9yKGsGS/view
[23] https://drive.google.com/file/d/1q8KhvIqDf27fJ8HGWgjX0nBhgPgGBg_T/view
[24] https://drive.google.com/file/d/1l6pDhuPWDLy5egQ6BhRIYRvshvDeIrGl/view

Thanks.
Naohiro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-13 12:07 ` naohirot
@ 2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
  2021-04-15 12:20     ` naohirot
                       ` (5 more replies)
  0 siblings, 6 replies; 41+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-14 16:02 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

Thanks for the comprehensive reply, especially the graphs are quite useful!
(I'd avoid adding generic_memcpy/memmove though since those are unoptimized C
implementations).

> OK, I'll try to remove unnecessary code which doesn't contribute performance gain
> based on benchtests performance data. 

Yes that is a good idea - you could also check whether the software pipelining actually
helps on an OoO core (it shouldn't) since that contributes a lot to the complexity and the
amount of code and unrolling required.

It is also possible to remove a lot of unnecessary code - eg. rather than use 2 instructions
per prefetch, merge the constant offset in the prefetch instruction itself (since they allow
up to 32KB offset). There are also lots of branches that skip a few instructions if a value is
zero, this is often counterproductive due to adding branch mispredictions.

> Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.
> This unroll configuration recorded the highest performance.

> In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,
> The performance degraded minus 5 to 15 Gbps/sec at the peak.

So this is the L(L1_vl_64) loop right? I guess the problem is the large number of
prefetches and all the extra code that is not strictly required (you can remove 5
redundant mov/cmp instructions from the loop). Also assuming prefetching helps
here (the good memmove results suggest it's not needed), prefetching directly
into L1 should be better than first into L2 and then into L1. So I don't see a good
reason why 4x unrolling would have to be any slower.

> Yes, I implemented for the case of 1 byte to 512 byte [9][10].
> SVE code seems faster than ASIMD in small/medium range too [11][12][13].

That adds quite a lot of code and uses a slow linear chain of comparisons. A small
loop like used in the memset should work fine to handle copies smaller than 
256 or 512 bytes (you can handle the zero bytes case for free in this code rather
than special casing it).

> For small/medium copies, I needed to remove BTI macro from ASM ENTRY in order
> to see the distinct performance difference between ASIMD and SVE.
> I'll post the patch [14] with the A64FX second patch.

I'm not sure I understand - the BTI macro just emits a NOP hint so it is harmless. We always emit
it so that it works seamlessly when BTI is enabled.

> And also somehow on A64FX as well as on ThunderX2 machine, memcpy-random
> didn't start due to mprotect error.

Yes it looks like the size isn't rounded up to a pagesize. It really needs the extra space, so
changing +4096 into getpagesize () will work.

> Without DC_VZA and L2 prefetch, memcpy and memset performance degraded over 4MB.

> DC_VZA and L2 prefetch have to be pair, only DC_VZA or only L2 prefetch doesn't get any improvement.

That seems odd. Was that using the L1 prefetch with the L2 distance? It seems to me one of the L1 or L2
prefetches is unnecessary. Also why would the DC_ZVA need to be done so early? It seems to me that
cleaning the cacheline just before you write it works best since that avoids accidentally replacing it.

> Without DC_VZA and L2 prefetch, memmove didn't degraded over 4MB.
>
> The reason why I didn't implement DC_VZA and L2 prefetch is that memmove calls memcpy in
> most cases, and memmove code only handles backward copy.
> Maybe most of memmove-large benchtest cases are backward copy, I need to check.

Most of the memmove tests do indeed overlap (so DC_ZVA does not work). However it also shows
that it performs well across the L2 cache size range without any prefetch or DC_ZVA.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
@ 2021-04-15 12:20     ` naohirot
  2021-04-20 16:00       ` Wilco Dijkstra via Libc-alpha
  2021-04-19  2:51     ` naohirot
                       ` (4 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: naohirot @ 2021-04-15 12:20 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Thanks for reviewing in detail technically!!
Now we have several topics to discuss.
So let me focus on the BTI in this mail. I'll answer other topics in later mail.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> 
> Thanks for the comprehensive reply, especially the graphs are quite useful!
> (I'd avoid adding generic_memcpy/memmove though since those are unoptimized
> C implementations).

OK, I'll withdraw the patch from the A64FX patch V2.

> > For small/medium copies, I needed to remove BTI macro from ASM ENTRY
> > in order to see the distinct performance difference between ASIMD and SVE.
> > I'll post the patch [14] with the A64FX second patch.
> 
> I'm not sure I understand - the BTI macro just emits a NOP hint so it is harmless.
> We always emit it so that it works seamlessly when BTI is enabled.

Yes, I observed that just " hint #0x22" is inserted.
The benchtest results show that the A64FX performance of size less than 100B with
BTI is slower than ASIMD, but without BTI is faster than ASIMD.
And the A64FX performance of 512B with BTI 4Gbps/sec slower than without BTI.

With BTI, source code [4] 
[1] https://drive.google.com/file/d/1LlyQOq7qT4d0-54uVzUtYMMMDgIiddEj/view
[2] https://drive.google.com/file/d/1C2pl-Iz_-18mkpuQTk1PhEHKsd5x0wWo/view
[3] https://drive.google.com/file/d/1eg_p1_b619KN7XLmOpxqcoI3c9o4WXd-/view
[4] https://github.com/NaohiroTamura/glibc/commit/0f45fff654d7a31b58e5d6f4dbfa31d6586f8cc2

Without BTI, source code [8]
[5] https://drive.google.com/file/d/1Mf7wxwgGb5yYBJo1eUxqvjrkp9O4EVVJ/view
[6] https://drive.google.com/file/d/1rgfFmWsM4Q3oDK8aYa_GjEQWttS0pOBF/view
[7] https://drive.google.com/file/d/1hF7oevP-MERrQ04yajtEUY8CSWe8V2EX/view
[8] https://github.com/NaohiroTamura/glibc/commit/c204a74971b3d34680964bc52ac59264b14527e3

I executed the same test on ThanderX2, the result had very little difference
between with BTI and without BTI as you mentioned.
So if distinct degradation happens only on A64FX, I'd like to add another
ENTRY macro in sysdeps/aarch64/sysdep.h such as:

#define ENTRY_ALIGN_NO_BTI(name, align)				\
  .globl C_SYMBOL_NAME(name);					\
  .type C_SYMBOL_NAME(name),%function;				\
  .p2align align;						\
  C_LABEL(name)							\
  cfi_startproc;						\
  CALL_MCOUNT

Or I'd like to change memcpy_a64fx.S and memset_a64fx.S without ENTRY macro such as:
  .globl __memcpy_a64fx
  .type __memcpy_a64fx, %function
  .p2align 6
  __memcpy_a64fx:
  cfi_startproc
  CALL_MCOUNT

What do you think?

> > And also somehow on A64FX as well as on ThunderX2 machine,
> > memcpy-random didn't start due to mprotect error.
> 
> Yes it looks like the size isn't rounded up to a pagesize. It really needs the extra
> space, so changing +4096 into getpagesize () will work.

OK, I've already applied it [8].

Thanks!
Naohiro


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-15 12:20     ` naohirot
@ 2021-04-20 16:00       ` Wilco Dijkstra via Libc-alpha
  2021-04-27 11:58         ` naohirot
  0 siblings, 1 reply; 41+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-20 16:00 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

> Yes, I observed that just " hint #0x22" is inserted.
> The benchtest results show that the A64FX performance of size less than 100B with
> BTI is slower than ASIMD, but without BTI is faster than ASIMD.
> And the A64FX performance of 512B with BTI 4Gbps/sec slower than without BTI.

That's unfortunate - it seems like the hint is very slow, maybe even serializing...
We can work around if for now in GLIBC, but at some point distros will start to insert
BTI instructions by default, and then the performance hit will be bad.

> So if distinct degradation happens only on A64FX, I'd like to add another
> ENTRY macro in sysdeps/aarch64/sysdep.h such as:

I think the best option for now is to change BTI_C into NOP if AARCH64_HAVE_BTI
is not set. This avoids creating alignment issues in existing code (which is written
to assume the hint is present) and works for all string functions.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-20 16:00       ` Wilco Dijkstra via Libc-alpha
@ 2021-04-27 11:58         ` naohirot
  2021-04-29 15:13           ` Wilco Dijkstra via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: naohirot @ 2021-04-27 11:58 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

This mail is a continuation of BTI macro.

I believe that I've answered all of your comments so far.
Please let me know if I missed something.
If there is no further comments to the first version of this patch,
I'd like to proceed with the preparation of the second version after
the consecutive National holidays, Apr. 29th - May. 5th, in Japan.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> > So if distinct degradation happens only on A64FX, I'd like to add
> > another ENTRY macro in sysdeps/aarch64/sysdep.h such as:
> 
> I think the best option for now is to change BTI_C into NOP if AARCH64_HAVE_BTI
> is not set. This avoids creating alignment issues in existing code (which is written
> to assume the hint is present) and works for all string functions.

I updated sysdeps/aarch64/sysdep.h following your advice [1].
Then I reverted the entries of memcpy/memmove [2] and memset [3].

[1] https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0c82cb70339294386
[2] https://github.com/NaohiroTamura/glibc/commit/f4627d5a0faa8d2bd9102964a3e31936248fa9ca
[3] https://github.com/NaohiroTamura/glibc/commit/da48f62bab67d875cb712a886ba074073857d5c3

Thanks.
Naohiro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-27 11:58         ` naohirot
@ 2021-04-29 15:13           ` Wilco Dijkstra via Libc-alpha
  2021-04-30 15:01             ` Szabolcs Nagy via Libc-alpha
  2021-05-06 10:01             ` naohirot
  0 siblings, 2 replies; 41+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-29 15:13 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

> I believe that I've answered all of your comments so far.
> Please let me know if I missed something.
> If there is no further comments to the first version of this patch,
> I'd like to proceed with the preparation of the second version after
> the consecutive National holidays, Apr. 29th - May. 5th, in Japan.

I've only looked at memcpy so far. My comments on memcpy:

(1) Improve the tail code in unroll4/2/1/last to do the reverse of
    shortcut_for_small_size - basically there is no need for loops or lots of branches.

(2) Rather than start with L2, check for n > L2_SIZE && vector_length == 64 and
    start with the vl_agnostic case. Copies > L2_SIZE will be very rare so it's best to
    handle the common case first.

(3) The alignment code can be significantly simplified. Why not just process
    4 vectors unconditionally and then align the pointers? That avoids all the
    complex code and is much faster.

(4) Is there a benefit of aligning src or dst to vector size in the vl_agnostic case?
    If so, it would be easy to align to a vector first and then if n > L2_SIZE do the
    remaining 3 vectors to align to a full cacheline.

(5) I'm not sure I understand the reason for src_notag/dest_notag. However if
    you want to ignore tags, just change the mov src_ptr, src into AND that
    clears the tag. There is no reason to both clear the tag and also keep the
    original pointer and tag.

For memmove I would suggest to merge it with memcpy to save ~100 instructions.
I don't understand the complexity of the L(dispatch) code - you just need a simple
3-instruction overlap check that branches to bwd_unroll8.

I haven't looked at memset, but pretty much all the improvements apply there too.

>> I think the best option for now is to change BTI_C into NOP if AARCH64_HAVE_BTI
>> is not set. This avoids creating alignment issues in existing code (which is written
>> to assume the hint is present) and works for all string functions.
>
> I updated sysdeps/aarch64/sysdep.h following your advice [1].
> 
> [1] https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0c82cb70339294386

I meant using an actual NOP in the #else case so that existing string functions
won't change. Also note the #defines in the #if and #else need to be indented.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-29 15:13           ` Wilco Dijkstra via Libc-alpha
@ 2021-04-30 15:01             ` Szabolcs Nagy via Libc-alpha
  2021-04-30 15:23               ` Wilco Dijkstra via Libc-alpha
  2021-05-06 10:01             ` naohirot
  1 sibling, 1 reply; 41+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-04-30 15:01 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: 'GNU C Library'

The 04/29/2021 16:13, Wilco Dijkstra wrote:
> > I updated sysdeps/aarch64/sysdep.h following your advice [1].
> >
> > [1] https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0c82cb70339294386
> 
> I meant using an actual NOP in the #else case so that existing string functions
> won't change. Also note the #defines in the #if and #else need to be indented.

is that really useful?
'bti c' is already a nop if it's unsupported.

maybe it works if a64fx_memcpy.S has

  #undef BTI_C
  #define BTI_C
  ENTRY(a64fx_memcpy)
    ...

to save one nop.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-30 15:01             ` Szabolcs Nagy via Libc-alpha
@ 2021-04-30 15:23               ` Wilco Dijkstra via Libc-alpha
  2021-04-30 15:30                 ` Florian Weimer via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-30 15:23 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: 'GNU C Library'

Hi Szabolcs,

>> I meant using an actual NOP in the #else case so that existing string functions
>> won't change. Also note the #defines in the #if and #else need to be indented.
>
> is that really useful?
> 'bti c' is already a nop if it's unsupported.

Well it doesn't seem to behave like a NOP. So to avoid slowing down all string
functions, bti c must be removed completely, not just from A64FX memcpy.
Using a real NOP is fine in all cases as long as HAVE_AARCH64_BTI is not defined.

> maybe it works if a64fx_memcpy.S has
>
>  #undef BTI_C
>  #define BTI_C
>  ENTRY(a64fx_memcpy)

That works for memcpy, but what about everything else?

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-30 15:23               ` Wilco Dijkstra via Libc-alpha
@ 2021-04-30 15:30                 ` Florian Weimer via Libc-alpha
  2021-04-30 15:40                   ` Wilco Dijkstra via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-04-30 15:30 UTC (permalink / raw)
  To: Wilco Dijkstra via Libc-alpha; +Cc: Szabolcs Nagy, Wilco Dijkstra

* Wilco Dijkstra via Libc-alpha:

> Hi Szabolcs,
>
>>> I meant using an actual NOP in the #else case so that existing string functions
>>> won't change. Also note the #defines in the #if and #else need to be indented.
>>
>> is that really useful?
>> 'bti c' is already a nop if it's unsupported.
>
> Well it doesn't seem to behave like a NOP. So to avoid slowing down
> all string functions, bti c must be removed completely, not just from
> A64FX memcpy.  Using a real NOP is fine in all cases as long as
> HAVE_AARCH64_BTI is not defined.

I'm probably confused, but: If BTI is active, many more glibc functions
will have BTI markers.  What makes the string functions special?

Thanks,
Florian


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-30 15:30                 ` Florian Weimer via Libc-alpha
@ 2021-04-30 15:40                   ` Wilco Dijkstra via Libc-alpha
  2021-05-04  7:56                     ` Szabolcs Nagy via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-30 15:40 UTC (permalink / raw)
  To: Florian Weimer, Wilco Dijkstra via Libc-alpha; +Cc: Szabolcs Nagy

Hi Florian,

>> Well it doesn't seem to behave like a NOP. So to avoid slowing down
>> all string functions, bti c must be removed completely, not just from
>> A64FX memcpy.  Using a real NOP is fine in all cases as long as
>> HAVE_AARCH64_BTI is not defined.
>
> I'm probably confused, but: If BTI is active, many more glibc functions
> will have BTI markers.  What makes the string functions special?

Exactly. And at that point trying to remove it from memcpy is just pointless.

The case we are discussing is where BTI is not turned on in GLIBC but we still 
emit a BTI at the start of assembler functions for simplicity. By using a NOP
instead, A64FX will not execute BTI anywhere in GLIBC.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-30 15:40                   ` Wilco Dijkstra via Libc-alpha
@ 2021-05-04  7:56                     ` Szabolcs Nagy via Libc-alpha
  2021-05-04 10:17                       ` Florian Weimer via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-04  7:56 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Florian Weimer, Wilco Dijkstra via Libc-alpha

The 04/30/2021 16:40, Wilco Dijkstra wrote:
> >> Well it doesn't seem to behave like a NOP. So to avoid slowing down
> >> all string functions, bti c must be removed completely, not just from
> >> A64FX memcpy.  Using a real NOP is fine in all cases as long as
> >> HAVE_AARCH64_BTI is not defined.
> >
> > I'm probably confused, but: If BTI is active, many more glibc functions
> > will have BTI markers.  What makes the string functions special?
> 
> Exactly. And at that point trying to remove it from memcpy is just pointless.
> 
> The case we are discussing is where BTI is not turned on in GLIBC but we still
> emit a BTI at the start of assembler functions for simplicity. By using a NOP
> instead, A64FX will not execute BTI anywhere in GLIBC.

the asm ENTRY was written with the assumption that bti c
behaves like a nop when bti is disabled, so we don't have
to make the asm conditional based on cflags.

if that's not the case i agree with the patch, however we
will have to review some other code (e.g. libgcc outline
atomics asm) where we made the same assumption.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-04  7:56                     ` Szabolcs Nagy via Libc-alpha
@ 2021-05-04 10:17                       ` Florian Weimer via Libc-alpha
  2021-05-04 10:38                         ` Wilco Dijkstra via Libc-alpha
  2021-05-04 10:42                         ` Szabolcs Nagy via Libc-alpha
  0 siblings, 2 replies; 41+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-05-04 10:17 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Wilco Dijkstra via Libc-alpha, Wilco Dijkstra

* Szabolcs Nagy:

> The 04/30/2021 16:40, Wilco Dijkstra wrote:
>> >> Well it doesn't seem to behave like a NOP. So to avoid slowing down
>> >> all string functions, bti c must be removed completely, not just from
>> >> A64FX memcpy.  Using a real NOP is fine in all cases as long as
>> >> HAVE_AARCH64_BTI is not defined.
>> >
>> > I'm probably confused, but: If BTI is active, many more glibc functions
>> > will have BTI markers.  What makes the string functions special?
>> 
>> Exactly. And at that point trying to remove it from memcpy is just pointless.
>> 
>> The case we are discussing is where BTI is not turned on in GLIBC but we still
>> emit a BTI at the start of assembler functions for simplicity. By using a NOP
>> instead, A64FX will not execute BTI anywhere in GLIBC.
>
> the asm ENTRY was written with the assumption that bti c
> behaves like a nop when bti is disabled, so we don't have
> to make the asm conditional based on cflags.
>
> if that's not the case i agree with the patch, however we
> will have to review some other code (e.g. libgcc outline
> atomics asm) where we made the same assumption.

I find this discussion extremely worrisome.  If bti c does not behave
like a nop, then we need a new AArch64 ABI variant to enable BTI.

That being said, a distribution with lots of bti c instructions in
binaries seems to run on A64FX CPUs, so I'm not sure what is going on.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-04 10:17                       ` Florian Weimer via Libc-alpha
@ 2021-05-04 10:38                         ` Wilco Dijkstra via Libc-alpha
  2021-05-04 10:42                         ` Szabolcs Nagy via Libc-alpha
  1 sibling, 0 replies; 41+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-05-04 10:38 UTC (permalink / raw)
  To: Florian Weimer, Szabolcs Nagy; +Cc: Wilco Dijkstra via Libc-alpha

Hi Florian,

> I find this discussion extremely worrisome.  If bti c does not behave
> like a nop, then we need a new AArch64 ABI variant to enable BTI.
>
> That being said, a distribution with lots of bti c instructions in
> binaries seems to run on A64FX CPUs, so I'm not sure what is going on.

NOP-space instructions should take no time or execution resources.
From Naohiro's graphs I estimate A64FX takes around 30 cycles per BTI
instruction - that's clearly "not behaving like a NOP". That would cause a
significant performance degradation if BTI is enabled in a distro.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-04 10:17                       ` Florian Weimer via Libc-alpha
  2021-05-04 10:38                         ` Wilco Dijkstra via Libc-alpha
@ 2021-05-04 10:42                         ` Szabolcs Nagy via Libc-alpha
  2021-05-04 11:07                           ` Florian Weimer via Libc-alpha
  1 sibling, 1 reply; 41+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-04 10:42 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Wilco Dijkstra via Libc-alpha, Wilco Dijkstra

The 05/04/2021 12:17, Florian Weimer wrote:
> * Szabolcs Nagy:
> 
> > The 04/30/2021 16:40, Wilco Dijkstra wrote:
> >> >> Well it doesn't seem to behave like a NOP. So to avoid slowing down
> >> >> all string functions, bti c must be removed completely, not just from
> >> >> A64FX memcpy.  Using a real NOP is fine in all cases as long as
> >> >> HAVE_AARCH64_BTI is not defined.
> >> >
> >> > I'm probably confused, but: If BTI is active, many more glibc functions
> >> > will have BTI markers.  What makes the string functions special?
> >> 
> >> Exactly. And at that point trying to remove it from memcpy is just pointless.
> >> 
> >> The case we are discussing is where BTI is not turned on in GLIBC but we still
> >> emit a BTI at the start of assembler functions for simplicity. By using a NOP
> >> instead, A64FX will not execute BTI anywhere in GLIBC.
> >
> > the asm ENTRY was written with the assumption that bti c
> > behaves like a nop when bti is disabled, so we don't have
> > to make the asm conditional based on cflags.
> >
> > if that's not the case i agree with the patch, however we
> > will have to review some other code (e.g. libgcc outline
> > atomics asm) where we made the same assumption.
> 
> I find this discussion extremely worrisome.  If bti c does not behave
> like a nop, then we need a new AArch64 ABI variant to enable BTI.
> 
> That being said, a distribution with lots of bti c instructions in
> binaries seems to run on A64FX CPUs, so I'm not sure what is going on.

this does not have correctness impact, only performance impact.

hint space instructions are seem slower than expected on a64fx.

which means unconditionally adding bti c to asm entry code is not
ideal if somebody tries to build a system without branch-protection.
distros that build all binaries with branch protection will just
take a performance hit on a64fx, we cant fix that easily.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-04 10:42                         ` Szabolcs Nagy via Libc-alpha
@ 2021-05-04 11:07                           ` Florian Weimer via Libc-alpha
  0 siblings, 0 replies; 41+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-05-04 11:07 UTC (permalink / raw)
  To: Szabolcs Nagy; +Cc: Wilco Dijkstra via Libc-alpha, Wilco Dijkstra

* Szabolcs Nagy:

> The 05/04/2021 12:17, Florian Weimer wrote:
>> * Szabolcs Nagy:
>> 
>> > The 04/30/2021 16:40, Wilco Dijkstra wrote:
>> >> >> Well it doesn't seem to behave like a NOP. So to avoid slowing down
>> >> >> all string functions, bti c must be removed completely, not just from
>> >> >> A64FX memcpy.  Using a real NOP is fine in all cases as long as
>> >> >> HAVE_AARCH64_BTI is not defined.
>> >> >
>> >> > I'm probably confused, but: If BTI is active, many more glibc functions
>> >> > will have BTI markers.  What makes the string functions special?
>> >> 
>> >> Exactly. And at that point trying to remove it from memcpy is just pointless.
>> >> 
>> >> The case we are discussing is where BTI is not turned on in GLIBC but we still
>> >> emit a BTI at the start of assembler functions for simplicity. By using a NOP
>> >> instead, A64FX will not execute BTI anywhere in GLIBC.
>> >
>> > the asm ENTRY was written with the assumption that bti c
>> > behaves like a nop when bti is disabled, so we don't have
>> > to make the asm conditional based on cflags.
>> >
>> > if that's not the case i agree with the patch, however we
>> > will have to review some other code (e.g. libgcc outline
>> > atomics asm) where we made the same assumption.
>> 
>> I find this discussion extremely worrisome.  If bti c does not behave
>> like a nop, then we need a new AArch64 ABI variant to enable BTI.
>> 
>> That being said, a distribution with lots of bti c instructions in
>> binaries seems to run on A64FX CPUs, so I'm not sure what is going on.
>
> this does not have correctness impact, only performance impact.
>
> hint space instructions are seem slower than expected on a64fx.
>
> which means unconditionally adding bti c to asm entry code is not
> ideal if somebody tries to build a system without branch-protection.
> distros that build all binaries with branch protection will just
> take a performance hit on a64fx, we cant fix that easily.

I think I see it now.  It's not critically slow, but there appears to be
observable impact.  I'm still worried.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-29 15:13           ` Wilco Dijkstra via Libc-alpha
  2021-04-30 15:01             ` Szabolcs Nagy via Libc-alpha
@ 2021-05-06 10:01             ` naohirot
  2021-05-06 14:26               ` Szabolcs Nagy via Libc-alpha
  2021-05-06 17:31               ` Wilco Dijkstra via Libc-alpha
  1 sibling, 2 replies; 41+ messages in thread
From: naohirot @ 2021-05-06 10:01 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco,

Thanks for the comments, I applied all of your comments to both
memcpy/memmove and memset except (3) alignment code for memset.
The latest code became memcpy/memove [1] and memset [2] in the
patch-20210317 [3] branch by evaluating the performance data as shown
below.

[1] https://github.com/NaohiroTamura/glibc/blob/d2ea23703fc45cbfe4a8f27c759b0b23722e17a4/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/d2ea23703fc45cbfe4a8f27c759b0b23722e17a4/sysdeps/aarch64/multiarch/memset_a64fx.S
[3] https://github.com/NaohiroTamura/glibc/commits/patch-20210317

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> I've only looked at memcpy so far. My comments on memcpy:
> 
> (1) Improve the tail code in unroll4/2/1/last to do the reverse of
>     shortcut_for_small_size - basically there is no need for loops or lots of
> branches.
>

I updated the tail code both memcpy/memmove [4] and memset [5], and
replaced small size code of memset [5].
The performance is shown as "whilelo" in Google Sheet Graph for
memcpy/memmove [6] and memset [7].

[4] https://github.com/NaohiroTamura/glibc/commit/f7d9d7b22814affdd89cf291905b9c6601e2031d
[5] https://github.com/NaohiroTamura/glibc/commit/b79d6731f800a56be66c895c035b791ca5176bbb
[6] https://docs.google.com/spreadsheets/d/1Rh-bwF6dpWqoOCbL2epogUPn4I2Emd0NiFgoEOPaujM/edit
[7] https://docs.google.com/spreadsheets/d/1TS0qFhyR_06OyqaRHYAdCKxwvRz7f1T8jI7Pu6x2GIk/edit

> (2) Rather than start with L2, check for n > L2_SIZE && vector_length == 64 and
>     start with the vl_agnostic case. Copies > L2_SIZE will be very rare so it's best
> to
>     handle the common case first.
> 

I changed the order both both memcpy/memmove [8] and memset [9].
The performance is shown as "agnostic1st" in Google Sheet Graph for
memcpy/memmove [6] and memset [7].

[8] https://github.com/NaohiroTamura/glibc/commit/c0d7e39aa4aefe3d7b7d2a8a7c220150a0eb78fe
[9] https://github.com/NaohiroTamura/glibc/commit/d2ea23703fc45cbfe4a8f27c759b0b23722e17a4

> (3) The alignment code can be significantly simplified. Why not just process
>     4 vectors unconditionally and then align the pointers? That avoids all the
>     complex code and is much faster.
> 

In terms of memcpy/memmove, I tried 4 patterns, "simplifiedL2algin"[10], 
" simplifiedL2algin2"[11], "agnosticVLalign"[12], and "noalign"[13] as shown in
Google Sheet Graph [14].

"simplifiedL2algin"[10] simplified to 4 whilelo, " simplifiedL2algin2"[11] simplified
to 2 whilelo or 4 whilelo, "agnosticVLalign"[12] added alignment code to L(vl_agnostic),
and "noalign"[13] removed all alignments.

"agnosticVLalign"[12] and "noalign"[13] didn't improve the performance, so these
commits are kept in the patch-20210317-memcpy-alignment branch [15]

[10] https://github.com/NaohiroTamura/glibc/commit/dd4ede78ec4d74e61a4dc3166fc8586168c4e410
[11] https://github.com/NaohiroTamura/glibc/commit/dd246ff01d59e4e91d10261cd070baae07c0093e
[12] https://github.com/NaohiroTamura/glibc/commit/35b8057d91024bf41595d38d94b2c3c76bdfd6b0
[13] https://github.com/NaohiroTamura/glibc/commit/b1f16f3e738152a5c0f3441201058b48901b4910
[14] https://docs.google.com/spreadsheets/d/1REBslxd56kMDMiXHAtRkBn4IaUO7AVmgvGldJl5qc58/edit
[15] https://github.com/NaohiroTamura/glibc/commits/patch-20210317-memcpy-alignment

In terms of memset, I tried 4 patterns too, " VL/CL-align "[16], "CL-align"[17],
"CL-align2"[18] and "noalign"[19] as shown in Google Sheet Graph [20].

" VL/CL-align "[16] simplified to 1 whilelo for VL and 3 whilelo for CL,
"CL-align"[17] simplified to 4 whilelo, "CL-align2"[18] simplified to 2 whilelo or
4 whilelo, and "noalign"[19] removed all alignments.

As shown in Google Sheet Graph [20] all of 4 patters didn't improve the
performance, so these commits are kept in the
patch-20210317-memset-alignment branch [21]

[16] https://github.com/NaohiroTamura/glibc/commit/2405b67a6bb8b380476967e150b35f10e0f25fe3
[17 https://github.com/NaohiroTamura/glibc/commit/a01a8ef08f3b53a691502538dabce3d5941790ff
[18] https://github.com/NaohiroTamura/glibc/commit/c8eb4467acbc97890a4f76f716a88d2dd901e083
[19] https://github.com/NaohiroTamura/glibc/commit/01ff56a9e558d650b09b0053adbc3215d269d65f
[20] https://docs.google.com/spreadsheets/d/1qT0ZkbrrL3fpEyfdjr23cbtanNyPFKN8xDo6E9Mb_YQ/edit
[21] https://github.com/NaohiroTamura/glibc/commits/patch-20210317-memset-alginment

> (4) Is there a benefit of aligning src or dst to vector size in the vl_agnostic case?
>     If so, it would be easy to align to a vector first and then if n > L2_SIZE do the
>     remaining 3 vectors to align to a full cacheline.
> 

As tried in (3), "agnosticVLalign"[12] didn't improve the performance.

> (5) I'm not sure I understand the reason for src_notag/dest_notag. However if
>     you want to ignore tags, just change the mov src_ptr, src into AND that
>     clears the tag. There is no reason to both clear the tag and also keep the
>     original pointer and tag.
> 

A64FX has Fujitsu's proprietary enhancement regarding tag address.
I removed dest_notag/src_notag macro and simplified L(dispatch) [22]
"src" address has to be kept to jump to L(last)[23].

[22] https://github.com/NaohiroTamura/glibc/commit/519244f5058d0aa98634bb544bae3358f0b7b07c
[23] https://github.com/NaohiroTamura/glibc/blob/519244f5058d0aa98634bb544bae3358f0b7b07c/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L399

> For memmove I would suggest to merge it with memcpy to save ~100 instructions.
> I don't understand the complexity of the L(dispatch) code - you just need a simple
> 3-instruction overlap check that branches to bwd_unroll8.
> 

I simplified the he L(dispatch) code to 3 instructions[24] in the commit[23]. 

[24] https://github.com/NaohiroTamura/glibc/blob/519244f5058d0aa98634bb544bae3358f0b7b07c/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L368-L370

> I haven't looked at memset, but pretty much all the improvements apply there too.

So please review the latest memset [2].

> >> I think the best option for now is to change BTI_C into NOP if
> >> AARCH64_HAVE_BTI is not set. This avoids creating alignment issues in
> >> existing code (which is written to assume the hint is present) and works for all
> string functions.
> >
> > I updated sysdeps/aarch64/sysdep.h following your advice [1].
> >
> > [1]
> > https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0
> > c82cb70339294386
> 
> I meant using an actual NOP in the #else case so that existing string functions
> won't change. Also note the #defines in the #if and #else need to be indented.
> 

I've read the mail thread regarding BTI, but I think I couldn't fully understand the
problem. BTI seems available from ARMv8.5, and A64FX is ARMv8.2.
Even though distro distributed BTI enabled binary, BTI doesn't work on A64FX.
So BTI_J macro can be removed from A64FX IFUNC code at least, because A64FX
IFUNC code is executed only on A64FX.
Are we discussing the BTI_C code which is not in IFUNC code?

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-06 10:01             ` naohirot
@ 2021-05-06 14:26               ` Szabolcs Nagy via Libc-alpha
  2021-05-06 15:09                 ` Florian Weimer via Libc-alpha
  2021-05-06 17:31               ` Wilco Dijkstra via Libc-alpha
  1 sibling, 1 reply; 41+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-06 14:26 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: 'GNU C Library', 'Wilco Dijkstra'

The 05/06/2021 10:01, naohirot@fujitsu.com wrote:
> > From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> > > [1]
> > > https://github.com/NaohiroTamura/glibc/commit/c582917071e76cfed84fafb0
> > > c82cb70339294386
> > 
> > I meant using an actual NOP in the #else case so that existing string functions
> > won't change. Also note the #defines in the #if and #else need to be indented.
> > 
> 
> I've read the mail thread regarding BTI, but I think I couldn't fully understand the
> problem. BTI seems available from ARMv8.5, and A64FX is ARMv8.2.
> Even though distro distributed BTI enabled binary, BTI doesn't work on A64FX.
> So BTI_J macro can be removed from A64FX IFUNC code at least, because A64FX
> IFUNC code is executed only on A64FX.
> Are we discussing the BTI_C code which is not in IFUNC code?

BTI_C at function entry.

the slowdown you showed with bti c at function entry
should not be present with a plain nop.

this means a64fx implemented hint space instructions
(such as bti c) slower than plain nops, which is not
expected and will cause slowdowns with distros that
try to distribute binaries with bti c, this problem
goes beyond string functions.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-06 14:26               ` Szabolcs Nagy via Libc-alpha
@ 2021-05-06 15:09                 ` Florian Weimer via Libc-alpha
  0 siblings, 0 replies; 41+ messages in thread
From: Florian Weimer via Libc-alpha @ 2021-05-06 15:09 UTC (permalink / raw)
  To: Szabolcs Nagy via Libc-alpha; +Cc: Szabolcs Nagy, 'Wilco Dijkstra'

* Szabolcs Nagy via Libc-alpha:

> this means a64fx implemented hint space instructions
> (such as bti c) slower than plain nops, which is not
> expected and will cause slowdowns with distros that
> try to distribute binaries with bti c, this problem
> goes beyond string functions.

And we are using -mbranch-protection=standard on AArch64 going forward,
for example:

| optflags: aarch64 %{__global_compiler_flags} -mbranch-protection=standard -fasynchronous-unwind-tables %[ "%{toolchain}" == "gcc" ? "-fstack-clash-protection" : "" ]

<https://gitlab.com/redhat/centos-stream/rpms/redhat-rpm-config/-/blob/c9s/rpmrc#L77>

(Fedora is similar.

This is why I find this issue so worrying.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-06 10:01             ` naohirot
  2021-05-06 14:26               ` Szabolcs Nagy via Libc-alpha
@ 2021-05-06 17:31               ` Wilco Dijkstra via Libc-alpha
  2021-05-07 12:31                 ` naohirot
  1 sibling, 1 reply; 41+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-05-06 17:31 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

> I've read the mail thread regarding BTI, but I think I couldn't fully understand the
> problem. BTI seems available from ARMv8.5, and A64FX is ARMv8.2.

BTI instructions are NOP hints, so it is possible to enable BTI even on ARMv8.0.
Using BTI instructions is harmless on CPUs that don't support it if NOP hints are as
cheap as a NOP (which generally doesn't need any execution resources).

> Even though distro distributed BTI enabled binary, BTI doesn't work on A64FX.

It works (ie. it is binary compatible with A64FX) and should have no effect. However
it seems to cause an unexpected slowdown.

> So BTI_J macro can be removed from A64FX IFUNC code at least, because A64FX
> IFUNC code is executed only on A64FX.

How is removing it just from memcpy going to help? The worry is not about memcpy
but the slowdown from all the BTI instructions that will be added to most functions.

Note it is still worthwhile to change BTI_C to NOP as suggested - that is the case when
BTI is not enabled, and there you want to avoid inserting BTI when it is not needed.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-06 17:31               ` Wilco Dijkstra via Libc-alpha
@ 2021-05-07 12:31                 ` naohirot
  0 siblings, 0 replies; 41+ messages in thread
From: naohirot @ 2021-05-07 12:31 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Szabolcs Nagy, Florian Weimer, 'GNU C Library'

Hi Wilco, Szabolcs, Florian,

Thanks for the explanation!

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
 
> How is removing it just from memcpy going to help? The worry is not about memcpy
> but the slowdown from all the BTI instructions that will be added to most functions.

OK, I understood.
Now I'm asking a question to CPU design team how A64FX "hint 34" is implemented and
behaves. 
 
> Note it is still worthwhile to change BTI_C to NOP as suggested - that is the case when
> BTI is not enabled, and there you want to avoid inserting BTI when it is not needed.

I changed BTI_C and BTI_J definitions to nop [1]. 

[1] https://github.com/NaohiroTamura/glibc/commit/0804fe9d288d489ec8af98c687552decd2723f5d

Thanks.
Naohiro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
  2021-04-15 12:20     ` naohirot
@ 2021-04-19  2:51     ` naohirot
  2021-04-19 14:57       ` Wilco Dijkstra via Libc-alpha
  2021-04-19 12:43     ` naohirot
                       ` (3 subsequent siblings)
  5 siblings, 1 reply; 41+ messages in thread
From: naohirot @ 2021-04-19  2:51 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Let me focus on the macro " shortcut_for_small_size" for small/medium, less than
512 byte in this mail. 

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> > Yes, I implemented for the case of 1 byte to 512 byte [9][10].
> > SVE code seems faster than ASIMD in small/medium range too [11][12][13].
> 
> That adds quite a lot of code and uses a slow linear chain of comparisons. A small
> loop like used in the memset should work fine to handle copies smaller than
> 256 or 512 bytes (you can handle the zero bytes case for free in this code rather
> than special casing it).
> 

I compared performance of the size less than 512 byte for the following five
implementation cases.

CASE 1: liner chain
As mentioned in the reply [0] I removed BTI_J [1], but the macro " shortcut_for_small_size"
stays linear chain [2]
A64FX performance is 4-14 Gbps [3].
The other arch implementations call BTI_J, so performance is degraded.
.
[0] https://sourceware.org/pipermail/libc-alpha/2021-April/125079.html
[1] https://github.com/NaohiroTamura/glibc/commit/7d7217b518e59c78582ac4e89cae725cf620877e
[2] https://github.com/NaohiroTamura/glibc/blob/7d7217b518e59c78582ac4e89cae725cf620877e/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L176-L267
[3] https://drive.google.com/file/d/16qo7N05W526H9j7_9qjm-_Q7gZmOXwpY/view

CASE 2: whilelt loop such as memset
I tested "whilelt loop" implementation instead of the macro " shortcut_for_small_size".
And after having tested, I commented out "whilelt loop" implementation [4]
Comparing with the CASE 1, A64FX performance degraded from 4-14 Gbps to 3-10 Gbps [5]. 
Please notice that "whilelt loop" implementation cannot be used for memmove,
because it doesn't work for backward copy.
On the other hand, the macro " shortcut_for_small_size" works for backward copy, because
it loads up to all 512 byte of data into z0 to z7 SVE registers at once, and then store all data.

[4] https://github.com/NaohiroTamura/glibc/commit/77d1da301f8161c74875b0314cae34be8cb33477#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5a0ff6a259eR308-R318
[5] https://drive.google.com/file/d/1xdw7mr0c90VupVkQwelFafQHNkXslCwv/view

CASE 3: binary tree chain
I updated the macro " shortcut_for_small_size" to use binary tree chain [6][7].
Comparing with the CASE 1, the size less than 96 byte degraded from 4.0-6.0 Gbps
to 2.5-5.0 Gbps, but the size 512 byte improved from 14.0 Gbps to 17.5 Gbps.

[6] https://github.com/NaohiroTamura/glibc/commit/5c17af8c57561ede5ed2c2af96c9efde4092f02f
[7] https://github.com/NaohiroTamura/glibc/blob/5c17af8c57561ede5ed2c2af96c9efde4092f02f/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L177-L204
[8] https://drive.google.com/file/d/13w8yKdeLpVbp-uJmCttKBKtScya1tXqP/view

CASE 4: binary tree chain except up to 64 byte
I handled up to 64 byte so as to return quickly [9].
Comparing with the CASE 3, the size less than 64 byte improved from 2.5 Gbps to
4.0 Gbps, but the size 512 byte degraded from 17.5 Gbps to 16.5 Gbps [10].

[9] https://github.com/NaohiroTamura/glibc/commit/77d1da301f8161c74875b0314cae34be8cb33477#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5a0ff6a259eR177-R184
[10] https://drive.google.com/file/d/1lFsjns9g_7fySAsvx_RVS9o6HSrk6ir9/view

CASE 5: binary tree chain except up to 128 byte
I handled up to 128 byte so as to return quickly [11].
Comparing with the CASE 4, the size less than 128 byte improved from 4.0-6.0 Gbps
to 4.0-7.0 Gbps, but the size 512 byte degraded from 16.5 Gbps to 16.0 Gbps [12].

[11] https://github.com/NaohiroTamura/glibc/commit/fefc59f01ecfd6a207fe261de5ab133f4409d687#diff-03552f8369653866548b20e7867272a645fa2129c700b78fdfafe5a0ff6a259eR184-R195
[12] https://drive.google.com/file/d/1HS277_qQUuEeZqLUo0H2XRlFhOhIdI_o/view

In conclusion, I'd like to adopt the CASE 5 implementation, considering the
performance balance between the small size (less than 128 byte) and medium size
(close to 512 byte).

Thanks.
Naohiro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-19  2:51     ` naohirot
@ 2021-04-19 14:57       ` Wilco Dijkstra via Libc-alpha
  2021-04-21 10:10         ` naohirot
  0 siblings, 1 reply; 41+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-19 14:57 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

> Let me focus on the macro " shortcut_for_small_size" for small/medium, less than
> 512 byte in this mail. 

Yes, one subject at a time is a good idea.

> Comparing with the CASE 1, A64FX performance degraded from 4-14 Gbps to 3-10 Gbps [5]. 
> Please notice that "whilelt loop" implementation cannot be used for memmove,
> because it doesn't work for backward copy.

Indeed, the memmove code would need a similar loop but backwards. However it sounds like
small loops are not efficient (possibly a high taken branch penalty), so it's not a good option.

> In conclusion, I'd like to adopt the CASE 5 implementation, considering the
> performance balance between the small size (less than 128 byte) and medium size
> (close to 512 byte).

Yes something like this would work. I would strip out any unnecessary instructions and merge
multiple cases to avoid branches as much as possible. For example start memcpy like this:

memcpy:
   cntb        vector_length
   whilelo     p0.b, xzr, n    // gives a free ptrue for N >= VL
   whilelo     p1.b, vector_length, n
   b.last       1f
   ld1b        z0.b, p0/z, [src]
   ld1b        z1.b, p1/z, [src, #1, mul vl]
   st1b        z0.b, p0, [dest]
   st1b        z1.b, p1, [dest, #1, mul vl]
   ret

The proposed case 5 uses 13 instructions up to 64 bytes and 19 up to 128, the above 
does 0-127 bytes in 9 instructions. You can see the code is perfectly balanced, with
4 load/store instructions, 3 ALU instructions and 2 branches.

Rather than doing a complex binary search, we can use the same trick to merge the code
for 128-256 and 256-512. So overall we only need 2 comparisons which we can write like:

cmp n, vector_length, lsl 3

Like I mentioned before, it is a really good idea to run bench-memcpy-random since it
will clearly show issues with branch prediction on small copies. For memcpy and related
functions you want to minimize branches and only use branches that are heavily biased.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-19 14:57       ` Wilco Dijkstra via Libc-alpha
@ 2021-04-21 10:10         ` naohirot
  2021-04-21 15:02           ` Wilco Dijkstra via Libc-alpha
  0 siblings, 1 reply; 41+ messages in thread
From: naohirot @ 2021-04-21 10:10 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

This mail is a continuation of the macro " shortcut_for_small_size" for small/medium,
less than 512 byte.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> Yes something like this would work. I would strip out any unnecessary instructions
> and merge multiple cases to avoid branches as much as possible. For example
> start memcpy like this:
> 
> memcpy:
>    cntb        vector_length
>    whilelo     p0.b, xzr, n    // gives a free ptrue for N >= VL
>    whilelo     p1.b, vector_length, n
>    b.last       1f
>    ld1b        z0.b, p0/z, [src]
>    ld1b        z1.b, p1/z, [src, #1, mul vl]
>    st1b        z0.b, p0, [dest]
>    st1b        z1.b, p1, [dest, #1, mul vl]
>    ret
> 
> The proposed case 5 uses 13 instructions up to 64 bytes and 19 up to 128, the
> above does 0-127 bytes in 9 instructions. You can see the code is perfectly
> balanced, with
> 4 load/store instructions, 3 ALU instructions and 2 branches.
> 
> Rather than doing a complex binary search, we can use the same trick to merge
> the code for 128-256 and 256-512. So overall we only need 2 comparisons which
> we can write like:
> 
> cmp n, vector_length, lsl 3

It's really smart way, isn't it? 😊
I re-implemented the macro " shortcut_for_small_size" using the whilelo, and
please check it [1][2] if understood correctly.
The performance of "whilelo dispatch" [3] is almost same as "binary tree dispatch" [4]
but I notice that there are gaps at 128 byte and at 256 byte [3].

[1] https://github.com/NaohiroTamura/glibc/commit/7491bcb36e5c497e509d35b1378fcc663595c2d0
[2] https://github.com/NaohiroTamura/glibc/blob/7491bcb36e5c497e509d35b1378fcc663595c2d0/sysdeps/aarch64/multiarch/memcpy_a64fx.S#L129-L174
[3] https://drive.google.com/file/d/10S6doDFiVtveqRZs-366E_yDzefe-zBS/view
[4] https://drive.google.com/file/d/1p5qPt0KLT4i3Iv_Uy9UT5zo0NetXK-RZ/view

> Like I mentioned before, it is a really good idea to run bench-memcpy-random
> since it will clearly show issues with branch prediction on small copies. For
> memcpy and related functions you want to minimize branches and only use
> branches that are heavily biased.

I checked bench-memcpy-random [5], but it measures the performance from the size
4K byte to 512K byte.
How do we know the branch issue for less than 512 byte?

[5] https://drive.google.com/file/d/1cRwaN9vu9q2Zm8xW6l6hp0GxVB1ZY-Tm/view

Thanks.
Naohiro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-21 10:10         ` naohirot
@ 2021-04-21 15:02           ` Wilco Dijkstra via Libc-alpha
  2021-04-22 13:17             ` naohirot
  0 siblings, 1 reply; 41+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-21 15:02 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

> It's really smart way, isn't it? 😊

Well that's the point of SVE!

> I re-implemented the macro " shortcut_for_small_size" using the whilelo, and
> please check it [1][2] if understood correctly.

Yes it works fine. You should still remove the check for zero at entry (which is really slow
and unnecessary) and the argument moves. L2 doesn't need the ptrue, all it needs
is MOV dest_ptr, dst.

> The performance of "whilelo dispatch" [3] is almost same as "binary tree dispatch" [4]
> but I notice that there are gaps at 128 byte and at 256 byte [3].

From what I can see, the new version is faster across the full range. It would be useful to show
both new and old in the same graph rather than separately. You can do that by copying the file
and use a different name for the functions. I do this all the time as it allows direct comparison
of several variants in one benchmark run.

That said, the dip at 256+64 looks fairly substantial. It could be throughput of WHILELO - to test
that you could try commenting out the long WHILELO sequence for the 256-512 byte case and
see whether it improves. If it is WHILELO, it is possible to remove 3x WHILELO from the earlier
cases by moving them after a branch (so that the 256-512 case only needs to execute 5x WHILELO
rather than 8 into total). Also it is worth checking if the 256-512 case beats jumping directly
to L(unroll4) - however note that code isn't optimized yet (eg. there is no need for complex
software pipelined loops since we can only iterate once!). If all that doesn't help, it may be
best to split into 256-384 and 384-512 so you only need 2x WHILELO.

> I checked bench-memcpy-random [5], but it measures the performance from the size
> 4K byte to 512K byte.
> How do we know the branch issue for less than 512 byte?

The size is the size of the memory region tested, not the size of the copies. The actual copies
are very small (90% are smaller than 128 bytes). The key is that it doesn't repeat the same copy
over and over so it's hard on the branch predictor just like in a real application.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-21 15:02           ` Wilco Dijkstra via Libc-alpha
@ 2021-04-22 13:17             ` naohirot
  2021-04-23  0:58               ` naohirot
  0 siblings, 1 reply; 41+ messages in thread
From: naohirot @ 2021-04-22 13:17 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Thanks for your review and advice!

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> Yes it works fine. You should still remove the check for zero at entry (which is
> really slow and unnecessary) and the argument moves. L2 doesn't need the ptrue,
> all it needs is MOV dest_ptr, dst.

Yes, I cleaned them up [1].
[1] https://github.com/NaohiroTamura/glibc/commit/fbee8284f6cea9671554249816f3ab2a14abeade

> > The performance of "whilelo dispatch" [3] is almost same as "binary
> > tree dispatch" [4] but I notice that there are gaps at 128 byte and at 256 byte [3].
> 
> From what I can see, the new version is faster across the full range. It would be
> useful to show both new and old in the same graph rather than separately. You can
> do that by copying the file and use a different name for the functions. I do this all
> the time as it allows direct comparison of several variants in one benchmark run.

Yes, I confirmed that "whilelo dispatch" is better than "binary tree dispatch".
I converted json data from bench-memcpy.out into csv using jq, and created Graph 1
in Google Sheet [2].

$ cat bench-memcpy.out | jq -r '.functions.memcpy.results| sort_by(.length) | .[]|[.length, .align1, .align2, .timings[5], .length/.timings[5]] | @csv' > bench-memcpy.csv

[2] https://docs.google.com/spreadsheets/d/19XYE63defjFEHZVqciZdmcDrJLWkRfGmSagXlIV2F-c/edit?usp=sharing

> That said, the dip at 256+64 looks fairly substantial. It could be throughput of
> WHILELO - to test that you could try commenting out the long WHILELO sequence
> for the 256-512 byte case and see whether it improves. 

I commented out WHILELO in 256-512 byte , and confirmed that it made the dip small [3].

[3] https://drive.google.com/file/d/13Q3OSUN3qXFiTNNkRVGnsNioUMEId1ge/view

>                                                      If it is WHILELO, it is
> possible to remove 3x WHILELO from the earlier cases by moving them after a
> branch (so that the 256-512 case only needs to execute 5x WHILELO rather than 8
> into total). 

As shown in Graph 2 in Google Sheet [2], this approach didn't make the dip small,
because I assume that we can reduce two WHILELO, but we needed to add two PTRUE.
I changed the code [1] like the following diff.

$ git diff
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
index 6d0ae1cd1f..2ae1f4e3b9 100644
--- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -139,12 +139,13 @@
 1:  // if rest > vector_length * 8
     cmp         n, vector_length, lsl 3 // vector_length * 8
     b.hi        \exit
+    cmp         n, vector_length, lsl 2 // vector_length * 4
+    b.hi        1f
     // if rest <= vector_length * 4
     lsl         tmp1, vector_length, 1  // vector_length * 2
     whilelo     p2.b, tmp1, n
     incb        tmp1
     whilelo     p3.b, tmp1, n
-    b.last      1f
     ld1b        z0.b, p0/z, [src, #0, mul vl]
     ld1b        z1.b, p1/z, [src, #1, mul vl]
     ld1b        z2.b, p2/z, [src, #2, mul vl]
@@ -155,6 +156,8 @@
     st1b        z3.b, p3, [dest, #3, mul vl]
     ret
 1:  // if rest <= vector_length * 8
+    ptrue       p2.b
+    ptrue       p3.b
     lsl         tmp1, vector_length, 2  // vector_length * 4
     whilelo     p4.b, tmp1, n
     incb        tmp1

>           Also it is worth checking if the 256-512 case beats jumping directly to
> L(unroll4) - however note that code isn't optimized yet (eg. there is no need for
> complex software pipelined loops since we can only iterate once!). 

I tried, but it didn't work for memmove, because L(unroll4) doesn't support
backward copy.

>                                                                                                             If all that
> doesn't help, it may be best to split into 256-384 and 384-512 so you only need 2x
> WHILELO.

This way [4] made the dip small as shown in Graph3 in Google Sheet [2].
So it seems that this is the way we should take.
 
[4] https://github.com/NaohiroTamura/glibc/commit/cbcb80e69325c16c6697c42627a6ca12c3245a86

> > I checked bench-memcpy-random [5], but it measures the performance
> > from the size 4K byte to 512K byte.
> > How do we know the branch issue for less than 512 byte?
> 
> The size is the size of the memory region tested, not the size of the copies. The
> actual copies are very small (90% are smaller than 128 bytes). The key is that it
> doesn't repeat the same copy over and over so it's hard on the branch predictor
> just like in a real application.

I see, I'll take a look at the source code more thoroughly.

Thanks.
Naohiro

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-22 13:17             ` naohirot
@ 2021-04-23  0:58               ` naohirot
  0 siblings, 0 replies; 41+ messages in thread
From: naohirot @ 2021-04-23  0:58 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Let me make one correction, I forgot about free ptrue to p0.b.

> From: Tamura, Naohiro/田村 直広 <naohirot@fujitsu.com>

> >                                                      If it is WHILELO,
> > it is possible to remove 3x WHILELO from the earlier cases by moving
> > them after a branch (so that the 256-512 case only needs to execute 5x
> > WHILELO rather than 8 into total).
> 
> As shown in Graph 2 in Google Sheet [2], this approach didn't make the dip small,
> because I assume that we can reduce two WHILELO, but we needed to add two
> PTRUE.

I didn't have to add two PTREU because of the free p0.b.
As shown in Graph 4 in Google Sheet [2], this approach without adding two PTRUE made
the dip small a little bit, but improvement is smaller than the last way [4] shown in Graph 3.
So the conclusion seems not to change.

[2] https://docs.google.com/spreadsheets/d/19XYE63defjFEHZVqciZdmcDrJLWkRfGmSagXlIV2F-c/edit?usp=sharing

The code without adding two PTRUE is like the following diff.

$ git diff
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
index 6d0ae1cd1f..c3779d0147 100644
--- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -139,12 +139,13 @@
 1:  // if rest > vector_length * 8
     cmp         n, vector_length, lsl 3 // vector_length * 8
     b.hi        \exit
+    cmp         n, vector_length, lsl 2 // vector_length * 4
+    b.hi        1f
     // if rest <= vector_length * 4
     lsl         tmp1, vector_length, 1  // vector_length * 2
     whilelo     p2.b, tmp1, n
     incb        tmp1
     whilelo     p3.b, tmp1, n
-    b.last      1f
     ld1b        z0.b, p0/z, [src, #0, mul vl]
     ld1b        z1.b, p1/z, [src, #1, mul vl]
     ld1b        z2.b, p2/z, [src, #2, mul vl]
@@ -165,16 +166,16 @@
     whilelo     p7.b, tmp1, n
     ld1b        z0.b, p0/z, [src, #0, mul vl]
     ld1b        z1.b, p1/z, [src, #1, mul vl]
-    ld1b        z2.b, p2/z, [src, #2, mul vl]
-    ld1b        z3.b, p3/z, [src, #3, mul vl]
+    ld1b        z2.b, p0/z, [src, #2, mul vl]
+    ld1b        z3.b, p0/z, [src, #3, mul vl]
     ld1b        z4.b, p4/z, [src, #4, mul vl]
     ld1b        z5.b, p5/z, [src, #5, mul vl]
     ld1b        z6.b, p6/z, [src, #6, mul vl]
     ld1b        z7.b, p7/z, [src, #7, mul vl]
     st1b        z0.b, p0, [dest, #0, mul vl]
     st1b        z1.b, p1, [dest, #1, mul vl]
-    st1b        z2.b, p2, [dest, #2, mul vl]
-    st1b        z3.b, p3, [dest, #3, mul vl]
+    st1b        z2.b, p0, [dest, #2, mul vl]
+    st1b        z3.b, p0, [dest, #3, mul vl]
     st1b        z4.b, p4, [dest, #4, mul vl]
     st1b        z5.b, p5, [dest, #5, mul vl]
     st1b        z6.b, p6, [dest, #6, mul vl]

> I changed the code [1] like the following diff.
> 
> $ git diff
> diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> index 6d0ae1cd1f..2ae1f4e3b9 100644
> --- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> +++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
> @@ -139,12 +139,13 @@
>  1:  // if rest > vector_length * 8
>      cmp         n, vector_length, lsl 3 // vector_length * 8
>      b.hi        \exit
> +    cmp         n, vector_length, lsl 2 // vector_length * 4
> +    b.hi        1f
>      // if rest <= vector_length * 4
>      lsl         tmp1, vector_length, 1  // vector_length * 2
>      whilelo     p2.b, tmp1, n
>      incb        tmp1
>      whilelo     p3.b, tmp1, n
> -    b.last      1f
>      ld1b        z0.b, p0/z, [src, #0, mul vl]
>      ld1b        z1.b, p1/z, [src, #1, mul vl]
>      ld1b        z2.b, p2/z, [src, #2, mul vl]
> @@ -155,6 +156,8 @@
>      st1b        z3.b, p3, [dest, #3, mul vl]
>      ret
>  1:  // if rest <= vector_length * 8
> +    ptrue       p2.b
> +    ptrue       p3.b
>      lsl         tmp1, vector_length, 2  // vector_length * 4
>      whilelo     p4.b, tmp1, n
>      incb        tmp1

> > If all that doesn't help, it may be best to split into 256-384 and
> > 384-512 so you only need 2x WHILELO.
> 
> This way [4] made the dip small as shown in Graph3 in Google Sheet [2].
> So it seems that this is the way we should take.
> 
> [4]
> https://github.com/NaohiroTamura/glibc/commit/cbcb80e69325c16c6697c4262
> 7a6ca12c3245a86

Thanks.
Naohiro


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
  2021-04-15 12:20     ` naohirot
  2021-04-19  2:51     ` naohirot
@ 2021-04-19 12:43     ` naohirot
  2021-04-20  3:31     ` naohirot
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 41+ messages in thread
From: naohirot @ 2021-04-19 12:43 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Let me focus on L1_prefetch in this mail.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> > Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.
> > This unroll configuration recorded the highest performance.

When I tested "4 unrolls", I modified the source code [1][2] in the mail [0]
such as followings:
in case of memcpy, 
   I commented out L(unroll8), L(unroll2), and left L(unroll4), L(unroll1) and L(last),
In case of memmove,
   I commented out L(bwd_unroll8), L(bwd_unroll2), and left L(bwd_unroll4), L(bwd_unroll1) and L(bwd_last),
In case of memset, 
   I commented out L(unroll32), L(unroll8), L(unroll2), and left L(unroll4), L(unroll1) and L(last).

[0] https://sourceware.org/pipermail/libc-alpha/2021-April/125002.html
[1] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S

> > In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,
> > The performance degraded minus 5 to 15 Gbps/sec at the peak.
> 
> So this is the L(L1_vl_64) loop right? I guess the problem is the large number of

So this is NOT the L(L1_vl_64) loop, but L(vl_agnostic).

> prefetches and all the extra code that is not strictly required (you can remove 5
> redundant mov/cmp instructions from the loop). Also assuming prefetching helps
> here (the good memmove results suggest it's not needed), prefetching directly
> into L1 should be better than first into L2 and then into L1. So I don't see a good
> reason why 4x unrolling would have to be any slower.

I tried to remove L(L1_prefetch) from both memcpy and memset, and also
I tried to remove L2 prefetch instructions (prfm pstl2keep and pldl2keep) in
L(L1_prefetch) from both memcpy and memset.

In case of memcpy, both removing L(L1_prefetch)[3] and removing L2 prefetch
instruction from L(L1_prefetch) increased the performance of the size range 64KB-4MB
from 18-20 GB/sec [4] to 20-22 GB/sec [5].

[3] https://github.com/NaohiroTamura/glibc/commit/22612299247e64dbffd62aa186513bde7328d104
[4] https://drive.google.com/file/d/1hGWz4eAYWc1ktdw74rzDPxtQQ48P0-Hv/view
[5] https://drive.google.com/file/d/11Pt1mWSCN2LBPHxXUE-rs7Q6JhtBfpyQ/view

In case of memset, removing L(L1_prefetch)[6] decreased the performance of the size range
128KB-4MB from 22-24 GB/sec [7] to 20-22 GB/sec[8].
But removing L2 prefetch instruction (prfm pstl2keep) in L(L1_prefetch) [9] kept the same
performance of the size range 128KB-4MB as 22-24 GB/sec [10].

[6] https://github.com/NaohiroTamura/glibc/blob/22612299247e64dbffd62aa186513bde7328d104/sysdeps/aarch64/multiarch/memset_a64fx.S#L146-L163
   Commented out L146-L163, I didn't commit because of decreasing the performance.
[7] https://drive.google.com/file/d/1MT1d2aBxSoYrzQuRZtv4U9NCXV4ZwHsJ/view
[8] https://drive.google.com/file/d/1qUzYklLvgXTZbP1wm9n4VryF3bgUOplo/view
[9] https://github.com/NaohiroTamura/glibc/commit/cc478c96bac051c9b98b9d9a1ae6f38326f77645
[10] https://drive.google.com/file/d/1bPKHFWyhzNWXX7A_S6_UpZ2BwP2QAJK4/view

In conclusion, I adopt to remove L(L1_prefetch) from memcpy [3] and to remove L2 prefetch
instruction (prfm pstl2keep) from L(L1_prefetch) [9].

Thanks.
Naohiro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
                       ` (2 preceding siblings ...)
  2021-04-19 12:43     ` naohirot
@ 2021-04-20  3:31     ` naohirot
  2021-04-20 14:44       ` Wilco Dijkstra via Libc-alpha
  2021-04-20  5:49     ` naohirot
  2021-04-23 13:22     ` naohirot
  5 siblings, 1 reply; 41+ messages in thread
From: naohirot @ 2021-04-20  3:31 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Let me focus on DC_ZVA and L1/L2 prefetch in this mail.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> > Without DC_VZA and L2 prefetch, memcpy and memset performance degraded
> over 4MB.
> 
> > DC_VZA and L2 prefetch have to be pair, only DC_VZA or only L2 prefetch
> doesn't get any improvement.
> 
> That seems odd. Was that using the L1 prefetch with the L2 distance? It seems to
> me one of the L1 or L2 prefetches is unnecessary. 

I tested the following 4 cases.
The result was that Case 4 is the best.
Case 2 and 3 were almost same as Case 1.
Case 4 [1] improved the performance in the size range more than 4MB from Case 1
7.5-10 GB/sec [2] to 10-10.5 GB/sec [3].

Case 1: DC_ZVA + L1 prefetch + L2 + prefetch [2]
Case 2: DC_ZVA + L1 prefetch
Case 3: DC_ZVA + L2 prefetch
Case 4: DC_ZVA only [3]

[1] https://github.com/NaohiroTamura/glibc/commit/d57bed764a45383dfea8265d6a384646f4f07eed
[2] https://drive.google.com/file/d/1ws3lTLzMFK3lLrrwxVFvriERrs-IKdP9/view
[3] https://drive.google.com/file/d/1g7nuFOtkFw3b5INcAfuuv2lVODmASm-G/view


>                                                Also why would the DC_ZVA
> need to be done so early? It seems to me that cleaning the cacheline just before
> you write it works best since that avoids accidentally replacing it.
> 

Yes, I moved it closer, please look at the change [1].

> > Without DC_VZA and L2 prefetch, memmove didn't degraded over 4MB.
> >
> > The reason why I didn't implement DC_VZA and L2 prefetch is that
> > memmove calls memcpy in most cases, and memmove code only handles
> backward copy.
> > Maybe most of memmove-large benchtest cases are backward copy, I need to
> check.
> 
> Most of the memmove tests do indeed overlap (so DC_ZVA does not work).
> However it also shows that it performs well across the L2 cache size range
> without any prefetch or DC_ZVA.

That's right, I confirmed that only DC_ZVA was necessary [1].

Next, I'll remove redundant instructions.

Thanks.
Naohiro



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-20  3:31     ` naohirot
@ 2021-04-20 14:44       ` Wilco Dijkstra via Libc-alpha
  2021-04-27  9:01         ` naohirot
  0 siblings, 1 reply; 41+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-20 14:44 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Naohiro,

> Case 4 [1] improved the performance in the size range more than 4MB from Case 1
> 7.5-10 GB/sec [2] to 10-10.5 GB/sec [3].
>
> Case 1: DC_ZVA + L1 prefetch + L2 + prefetch [2]
> Case 2: DC_ZVA + L1 prefetch
> Case 3: DC_ZVA + L2 prefetch
> Case 4: DC_ZVA only [3]

That is great news - it simplifies the loop a lot, and it is faster too!

>>                                                Also why would the DC_ZVA
>> need to be done so early? It seems to me that cleaning the cacheline just before
>> you write it works best since that avoids accidentally replacing it.
>> 
>
> Yes, I moved it closer, please look at the change [1].

What I meant is, why is ZF_DIST so huge? I don't see how that helps. Is there any penalty
if we did it like this (or possibly with 1-2 cachelines offset)?

    dc           zva, dest_ptr
    st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
    st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
    st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
    st1b        z3.b, p0,   [dest_ptr, #3, mul vl]

This would remove almost all initialization code from the start of L(L2_dc_zva).

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-20 14:44       ` Wilco Dijkstra via Libc-alpha
@ 2021-04-27  9:01         ` naohirot
  0 siblings, 0 replies; 41+ messages in thread
From: naohirot @ 2021-04-27  9:01 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

I focus on the zero fill distance in this mail.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> >>                                                Also why
> would the
> >>DC_ZVA  need to be done so early? It seems to me that cleaning the
> >>cacheline just before  you write it works best since that avoids accidentally
> replacing it.
> >>
> >
> > Yes, I moved it closer, please look at the change [1].
> 
> What I meant is, why is ZF_DIST so huge? I don't see how that helps. Is there any
> penalty if we did it like this (or possibly with 1-2 cachelines offset)?
> 
>     dc           zva, dest_ptr
>     st1b        z0.b, p0,   [dest_ptr, #0, mul vl]
>     st1b        z1.b, p0,   [dest_ptr, #1, mul vl]
>     st1b        z2.b, p0,   [dest_ptr, #2, mul vl]
>     st1b        z3.b, p0,   [dest_ptr, #3, mul vl]

I tested several zero fill distance for memcpy and memset including 1-2 cachelines offset.
As shown in Graph1 and Graph2 of Google Sheet [1], the most suitable zero fill
distance of both memcpy and memset was 21 cachelinees offset.
ZF21 means Zero Fill distance is  21 * cachelines offset in Graph1 and Graph2.
So I updated both memcpy and memset source code [2][3].

[1] https://docs.google.com/spreadsheets/d/1qXWHc-OXl2E9Q9vWUl4R4eM00k02eij6eMAhXYUFVoI/edit
[2] https://github.com/NaohiroTamura/glibc/commit/5e7f737a270334ec0f86c0228f90000bf9a2cf00
[3] https://github.com/NaohiroTamura/glibc/commit/42334cb84419603003977eb77783bf407cb75072

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
                       ` (3 preceding siblings ...)
  2021-04-20  3:31     ` naohirot
@ 2021-04-20  5:49     ` naohirot
  2021-04-20 11:39       ` Wilco Dijkstra via Libc-alpha
  2021-04-23 13:22     ` naohirot
  5 siblings, 1 reply; 41+ messages in thread
From: naohirot @ 2021-04-20  5:49 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Let me focus on removing redundant instructions in this mail.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> It is also possible to remove a lot of unnecessary code - eg. rather than use 2
> instructions per prefetch, merge the constant offset in the prefetch instruction
> itself (since they allow up to 32KB offset). There are also lots of branches that
> skip a few instructions if a value is zero, this is often counterproductive due to
> adding branch mispredictions.

I removed redundant instructions using cbz and prfm offset address [1][2].

[1] https://github.com/NaohiroTamura/glibc/commit/94363b4ab2e5b4b29843a47a6970b9645a8e4eeb
[2] https://github.com/NaohiroTamura/glibc/commit/4648eb559e46d978ded65d40c6bf8c38dd2519d7

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-20  5:49     ` naohirot
@ 2021-04-20 11:39       ` Wilco Dijkstra via Libc-alpha
  2021-04-27 11:03         ` naohirot
  0 siblings, 1 reply; 41+ messages in thread
From: Wilco Dijkstra via Libc-alpha @ 2021-04-20 11:39 UTC (permalink / raw)
  To: naohirot@fujitsu.com; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Haohiro,

> I removed redundant instructions using cbz and prfm offset address [1][2].
>
> [1] https://github.com/NaohiroTamura/glibc/commit/94363b4ab2e5b4b29843a47a6970b9645a8e4eeb
> [2] https://github.com/NaohiroTamura/glibc/commit/4648eb559e46d978ded65d40c6bf8c38dd2519d7

For the first 2 CBZ cases in both [1] and [2] the fastest option is to use ANDS+BEQ. ANDS only
requires 1 ALU operation while AND+CBZ uses 2 ALU operations on A64FX.

Wilco

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-20 11:39       ` Wilco Dijkstra via Libc-alpha
@ 2021-04-27 11:03         ` naohirot
  0 siblings, 0 replies; 41+ messages in thread
From: naohirot @ 2021-04-27 11:03 UTC (permalink / raw)
  To: 'Wilco Dijkstra'; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

This mail is a continuation of removing redundant instructions.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
 
> For the first 2 CBZ cases in both [1] and [2] the fastest option is to use
> ANDS+BEQ. ANDS only requires 1 ALU operation while AND+CBZ uses 2 ALU
> operations on A64FX.

I see, I haven't used ANDS before. Thanks for the advice.
I updated memcpy[1] and memset[2].

[1] https://github.com/NaohiroTamura/glibc/commit/fca2c1cf1fd80ec7ecb93f7cd08be9aab9ca9412
[2] https://github.com/NaohiroTamura/glibc/commit/5004e34c35a20faf3e12e6ce915845a75b778cbf

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
                       ` (4 preceding siblings ...)
  2021-04-20  5:49     ` naohirot
@ 2021-04-23 13:22     ` naohirot
  5 siblings, 0 replies; 41+ messages in thread
From: naohirot @ 2021-04-23 13:22 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: Szabolcs Nagy, 'GNU C Library'

Hi Wilco-san,

Let me re-evaluate the loop unrolling/software pipelining of L(vl_agnostic) for the size
512B-4MB using the latest source code [2] with all graphs [3] in this mail.
The early evaluation was reported in the mail [1] but all graphs were not provided.

[1] https://sourceware.org/pipermail/libc-alpha/2021-April/125002.html
[2] https://github.com/NaohiroTamura/glibc/commit/cbcb80e69325c16c6697c42627a6ca12c3245a86
[3] https://docs.google.com/spreadsheets/d/1leFhCAirelDezb0OFC7cr7v4uMUMveaN1iAxL410D2c/edit?usp=sharing

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> Yes that is a good idea - you could also check whether the software pipelining
> actually helps on an OoO core (it shouldn't) since that contributes a lot to the
> complexity and the amount of code and unrolling required.

I compared each unrolls by commenting out upper labels of the target label.
For example, if the target labels is L(unroll4) of memset, L(unroll32) and L(unroll8)
are commented out, and L(unroll4), L(unroll2), and L(unroll1) are executed.
Regarding memcpy/memmove, among L(unroll8), L(unroll4), L(unroll2), and L(unroll1).
Regarding memset, among L(unroll32), L(unroll8), L(unroll4), L(unroll2), and L(unroll1) .

The result was that 8 unrolling/pipelining for memcpy/memmove and 32
unrolling/pipelining for memset are still effective between the size 512B-64KB
as shown in the graphs in Google Sheet [3]
In conclusion, it seems the loop unrolling/software pipelining technique still works
in case of A64FX. It may be a peculiar characteristic of A64FX, I believe.

Thanks.
Naohiro

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
@ 2021-03-17  2:28 Naohiro Tamura
  2021-03-29 12:03 ` Szabolcs Nagy via Libc-alpha
  2021-05-10  1:45 ` naohirot
  0 siblings, 2 replies; 41+ messages in thread
From: Naohiro Tamura @ 2021-03-17  2:28 UTC (permalink / raw)
  To: libc-alpha

Fujitsu is in the process of signing the copyright assignment paper.
We'd like to have some feedback in advance.

This series of patches optimize the performance of
memcpy/memmove/memset for A64FX [1] which implements ARMv8-A SVE and
has L1 64KB cache per core and L2 8MB cache per NUMA node.

The first patch is an update of autoconf to check if assembler is
capable for ARMv8-A SVE code generation or not, and then define
HAVE_SVE_ASM_SUPPORT macro.

The second patch is memcpy/memmove performance optimization which makes
use of Scalable Vector Register with several techniques such as
loop unrolling, memory access alignment, cache zero fill, prefetch,
and software pipelining.

The third patch is memset performance optimization which makes
use of Scalable Vector Register with several techniques such as
loop unrolling, memory access alignment, cache zero fill, and
prefetch.

The forth patch is a test helper script to change Vector Length for
child process. This script can be used as test-wrapper for 'make
check'

The fifth patch is to add generic_memcpy and generic_memmove to
bench-memcpy-large.c and bench-memmove-large.c respectively so that we
can compare performance between 512 bit scalable vector register with
scalar 64 bit register consistently among memcpy/memmove/memset
default and large benchtests.

SVE assembler code for memcpy/memmove/memset is implemented as Vector
Length Agnostic code so theoretically it can be run on any SOC which
supports ARMv8-A SVE standard.

We confirmed that all testcases have been passed by running 'make
check' and 'make xcheck' not only on A64FX but also on ThunderX2.

And also we confirmed that the SVE 512 bit vector register performance
is roughly 4 times better than Advanced SIMD 128 bit register and 8
times better than scalar 64 bit register by running 'make bench'.

[1] https://github.com/fujitsu/A64FX

Naohiro Tamura (5):
  config: Added HAVE_SVE_ASM_SUPPORT for aarch64
  aarch64: Added optimized memcpy and memmove for A64FX
  aarch64: Added optimized memset for A64FX
  scripts: Added Vector Length Set test helper script
  benchtests: Added generic_memcpy and generic_memmove to large
    benchtests

 benchtests/bench-memcpy-large.c               |   9 +
 benchtests/bench-memmove-large.c              |   9 +
 config.h.in                                   |   3 +
 manual/tunables.texi                          |   3 +-
 scripts/vltest.py                             |  82 ++
 sysdeps/aarch64/configure                     |  28 +
 sysdeps/aarch64/configure.ac                  |  15 +
 sysdeps/aarch64/multiarch/Makefile            |   3 +-
 sysdeps/aarch64/multiarch/ifunc-impl-list.c   |  17 +-
 sysdeps/aarch64/multiarch/init-arch.h         |   4 +-
 sysdeps/aarch64/multiarch/memcpy.c            |  12 +-
 sysdeps/aarch64/multiarch/memcpy_a64fx.S      | 979 ++++++++++++++++++
 sysdeps/aarch64/multiarch/memmove.c           |  12 +-
 sysdeps/aarch64/multiarch/memset.c            |  11 +-
 sysdeps/aarch64/multiarch/memset_a64fx.S      | 574 ++++++++++
 .../unix/sysv/linux/aarch64/cpu-features.c    |   4 +
 .../unix/sysv/linux/aarch64/cpu-features.h    |   4 +
 17 files changed, 1759 insertions(+), 10 deletions(-)
 create mode 100755 scripts/vltest.py
 create mode 100644 sysdeps/aarch64/multiarch/memcpy_a64fx.S
 create mode 100644 sysdeps/aarch64/multiarch/memset_a64fx.S

-- 
2.17.1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-03-17  2:28 Naohiro Tamura
@ 2021-03-29 12:03 ` Szabolcs Nagy via Libc-alpha
  2021-05-10  1:45 ` naohirot
  1 sibling, 0 replies; 41+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-03-29 12:03 UTC (permalink / raw)
  To: Naohiro Tamura; +Cc: libc-alpha

The 03/17/2021 02:28, Naohiro Tamura wrote:
> Fujitsu is in the process of signing the copyright assignment paper.
> We'd like to have some feedback in advance.

thanks for these patches, please let me know when the
copyright is sorted out. i will do some review now.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-03-17  2:28 Naohiro Tamura
  2021-03-29 12:03 ` Szabolcs Nagy via Libc-alpha
@ 2021-05-10  1:45 ` naohirot
  2021-05-14 13:35   ` Szabolcs Nagy via Libc-alpha
  1 sibling, 1 reply; 41+ messages in thread
From: naohirot @ 2021-05-10  1:45 UTC (permalink / raw)
  To: Szabolcs Nagy, Wilco Dijkstra, Florian Weimer; +Cc: libc-alpha@sourceware.org

Hi Szabolcs, Wilco, Florian,

> From: Naohiro Tamura <naohirot@fujitsu.com>
> Sent: Wednesday, March 17, 2021 11:29 AM
 
> Fujitsu is in the process of signing the copyright assignment paper.
> We'd like to have some feedback in advance.

FYI: Fujitsu has submitted the signed assignment finally.

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-10  1:45 ` naohirot
@ 2021-05-14 13:35   ` Szabolcs Nagy via Libc-alpha
  2021-05-19  0:11     ` naohirot
  0 siblings, 1 reply; 41+ messages in thread
From: Szabolcs Nagy via Libc-alpha @ 2021-05-14 13:35 UTC (permalink / raw)
  To: naohirot@fujitsu.com, Carlos O'Donell
  Cc: Florian Weimer, libc-alpha@sourceware.org, Wilco Dijkstra

The 05/10/2021 01:45, naohirot@fujitsu.com wrote:
> FYI: Fujitsu has submitted the signed assignment finally.

Carlos, can we commit patches from fujitsu now?
(i dont know if we are still waiting for something)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
  2021-05-14 13:35   ` Szabolcs Nagy via Libc-alpha
@ 2021-05-19  0:11     ` naohirot
  0 siblings, 0 replies; 41+ messages in thread
From: naohirot @ 2021-05-19  0:11 UTC (permalink / raw)
  To: 'Szabolcs Nagy', Carlos O'Donell
  Cc: Florian Weimer, libc-alpha@sourceware.org, Wilco Dijkstra

Hi Szabolcs, Carlos,

> From: Szabolcs Nagy <Szabolcs.Nagy@arm.com>
> Sent: Friday, May 14, 2021 10:36 PM
> 
> The 05/10/2021 01:45, naohirot@fujitsu.com wrote:
> > FYI: Fujitsu has submitted the signed assignment finally.
> 
> Carlos, can we commit patches from fujitsu now?
> (i dont know if we are still waiting for something)

Fujitsu has received FSF signed assignment.
So the contract process has completed.

Thanks.
Naohiro


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-05-19  0:12 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-12 12:52 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Wilco Dijkstra via Libc-alpha
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
2021-04-15 12:20     ` naohirot
2021-04-20 16:00       ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:58         ` naohirot
2021-04-29 15:13           ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:01             ` Szabolcs Nagy via Libc-alpha
2021-04-30 15:23               ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:30                 ` Florian Weimer via Libc-alpha
2021-04-30 15:40                   ` Wilco Dijkstra via Libc-alpha
2021-05-04  7:56                     ` Szabolcs Nagy via Libc-alpha
2021-05-04 10:17                       ` Florian Weimer via Libc-alpha
2021-05-04 10:38                         ` Wilco Dijkstra via Libc-alpha
2021-05-04 10:42                         ` Szabolcs Nagy via Libc-alpha
2021-05-04 11:07                           ` Florian Weimer via Libc-alpha
2021-05-06 10:01             ` naohirot
2021-05-06 14:26               ` Szabolcs Nagy via Libc-alpha
2021-05-06 15:09                 ` Florian Weimer via Libc-alpha
2021-05-06 17:31               ` Wilco Dijkstra via Libc-alpha
2021-05-07 12:31                 ` naohirot
2021-04-19  2:51     ` naohirot
2021-04-19 14:57       ` Wilco Dijkstra via Libc-alpha
2021-04-21 10:10         ` naohirot
2021-04-21 15:02           ` Wilco Dijkstra via Libc-alpha
2021-04-22 13:17             ` naohirot
2021-04-23  0:58               ` naohirot
2021-04-19 12:43     ` naohirot
2021-04-20  3:31     ` naohirot
2021-04-20 14:44       ` Wilco Dijkstra via Libc-alpha
2021-04-27  9:01         ` naohirot
2021-04-20  5:49     ` naohirot
2021-04-20 11:39       ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:03         ` naohirot
2021-04-23 13:22     ` naohirot
  -- strict thread matches above, loose matches on Subject: below --
2021-03-17  2:28 Naohiro Tamura
2021-03-29 12:03 ` Szabolcs Nagy via Libc-alpha
2021-05-10  1:45 ` naohirot
2021-05-14 13:35   ` Szabolcs Nagy via Libc-alpha
2021-05-19  0:11     ` naohirot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).