From: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com>,
'GNU C Library' <libc-alpha@sourceware.org>
Subject: RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
Date: Thu, 22 Apr 2021 13:17:46 +0000 [thread overview]
Message-ID: <TYAPR01MB60256E24084F4F6FBFDA4EE6DF469@TYAPR01MB6025.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <VE1PR08MB5599992190ACACB4CBFC673A83479@VE1PR08MB5599.eurprd08.prod.outlook.com>
Hi Wilco-san,
Thanks for your review and advice!
> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> Yes it works fine. You should still remove the check for zero at entry (which is
> really slow and unnecessary) and the argument moves. L2 doesn't need the ptrue,
> all it needs is MOV dest_ptr, dst.
Yes, I cleaned them up [1].
[1] https://github.com/NaohiroTamura/glibc/commit/fbee8284f6cea9671554249816f3ab2a14abeade
> > The performance of "whilelo dispatch" [3] is almost same as "binary
> > tree dispatch" [4] but I notice that there are gaps at 128 byte and at 256 byte [3].
>
> From what I can see, the new version is faster across the full range. It would be
> useful to show both new and old in the same graph rather than separately. You can
> do that by copying the file and use a different name for the functions. I do this all
> the time as it allows direct comparison of several variants in one benchmark run.
Yes, I confirmed that "whilelo dispatch" is better than "binary tree dispatch".
I converted json data from bench-memcpy.out into csv using jq, and created Graph 1
in Google Sheet [2].
$ cat bench-memcpy.out | jq -r '.functions.memcpy.results| sort_by(.length) | .[]|[.length, .align1, .align2, .timings[5], .length/.timings[5]] | @csv' > bench-memcpy.csv
[2] https://docs.google.com/spreadsheets/d/19XYE63defjFEHZVqciZdmcDrJLWkRfGmSagXlIV2F-c/edit?usp=sharing
> That said, the dip at 256+64 looks fairly substantial. It could be throughput of
> WHILELO - to test that you could try commenting out the long WHILELO sequence
> for the 256-512 byte case and see whether it improves.
I commented out WHILELO in 256-512 byte , and confirmed that it made the dip small [3].
[3] https://drive.google.com/file/d/13Q3OSUN3qXFiTNNkRVGnsNioUMEId1ge/view
> If it is WHILELO, it is
> possible to remove 3x WHILELO from the earlier cases by moving them after a
> branch (so that the 256-512 case only needs to execute 5x WHILELO rather than 8
> into total).
As shown in Graph 2 in Google Sheet [2], this approach didn't make the dip small,
because I assume that we can reduce two WHILELO, but we needed to add two PTRUE.
I changed the code [1] like the following diff.
$ git diff
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
index 6d0ae1cd1f..2ae1f4e3b9 100644
--- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -139,12 +139,13 @@
1: // if rest > vector_length * 8
cmp n, vector_length, lsl 3 // vector_length * 8
b.hi \exit
+ cmp n, vector_length, lsl 2 // vector_length * 4
+ b.hi 1f
// if rest <= vector_length * 4
lsl tmp1, vector_length, 1 // vector_length * 2
whilelo p2.b, tmp1, n
incb tmp1
whilelo p3.b, tmp1, n
- b.last 1f
ld1b z0.b, p0/z, [src, #0, mul vl]
ld1b z1.b, p1/z, [src, #1, mul vl]
ld1b z2.b, p2/z, [src, #2, mul vl]
@@ -155,6 +156,8 @@
st1b z3.b, p3, [dest, #3, mul vl]
ret
1: // if rest <= vector_length * 8
+ ptrue p2.b
+ ptrue p3.b
lsl tmp1, vector_length, 2 // vector_length * 4
whilelo p4.b, tmp1, n
incb tmp1
> Also it is worth checking if the 256-512 case beats jumping directly to
> L(unroll4) - however note that code isn't optimized yet (eg. there is no need for
> complex software pipelined loops since we can only iterate once!).
I tried, but it didn't work for memmove, because L(unroll4) doesn't support
backward copy.
> If all that
> doesn't help, it may be best to split into 256-384 and 384-512 so you only need 2x
> WHILELO.
This way [4] made the dip small as shown in Graph3 in Google Sheet [2].
So it seems that this is the way we should take.
[4] https://github.com/NaohiroTamura/glibc/commit/cbcb80e69325c16c6697c42627a6ca12c3245a86
> > I checked bench-memcpy-random [5], but it measures the performance
> > from the size 4K byte to 512K byte.
> > How do we know the branch issue for less than 512 byte?
>
> The size is the size of the memory region tested, not the size of the copies. The
> actual copies are very small (90% are smaller than 128 bytes). The key is that it
> doesn't repeat the same copy over and over so it's hard on the branch predictor
> just like in a real application.
I see, I'll take a look at the source code more thoroughly.
Thanks.
Naohiro
next prev parent reply other threads:[~2021-04-22 13:18 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-12 12:52 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Wilco Dijkstra via Libc-alpha
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
2021-04-14 16:02 ` Wilco Dijkstra via Libc-alpha
2021-04-15 12:20 ` naohirot
2021-04-20 16:00 ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:58 ` naohirot
2021-04-29 15:13 ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:01 ` Szabolcs Nagy via Libc-alpha
2021-04-30 15:23 ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:30 ` Florian Weimer via Libc-alpha
2021-04-30 15:40 ` Wilco Dijkstra via Libc-alpha
2021-05-04 7:56 ` Szabolcs Nagy via Libc-alpha
2021-05-04 10:17 ` Florian Weimer via Libc-alpha
2021-05-04 10:38 ` Wilco Dijkstra via Libc-alpha
2021-05-04 10:42 ` Szabolcs Nagy via Libc-alpha
2021-05-04 11:07 ` Florian Weimer via Libc-alpha
2021-05-06 10:01 ` naohirot
2021-05-06 14:26 ` Szabolcs Nagy via Libc-alpha
2021-05-06 15:09 ` Florian Weimer via Libc-alpha
2021-05-06 17:31 ` Wilco Dijkstra via Libc-alpha
2021-05-07 12:31 ` naohirot
2021-04-19 2:51 ` naohirot
2021-04-19 14:57 ` Wilco Dijkstra via Libc-alpha
2021-04-21 10:10 ` naohirot
2021-04-21 15:02 ` Wilco Dijkstra via Libc-alpha
2021-04-22 13:17 ` naohirot [this message]
2021-04-23 0:58 ` naohirot
2021-04-19 12:43 ` naohirot
2021-04-20 3:31 ` naohirot
2021-04-20 14:44 ` Wilco Dijkstra via Libc-alpha
2021-04-27 9:01 ` naohirot
2021-04-20 5:49 ` naohirot
2021-04-20 11:39 ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:03 ` naohirot
2021-04-23 13:22 ` naohirot
-- strict thread matches above, loose matches on Subject: below --
2021-03-17 2:28 Naohiro Tamura
2021-03-29 12:03 ` Szabolcs Nagy via Libc-alpha
2021-05-10 1:45 ` naohirot
2021-05-14 13:35 ` Szabolcs Nagy via Libc-alpha
2021-05-19 0:11 ` naohirot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/libc/involved.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=TYAPR01MB60256E24084F4F6FBFDA4EE6DF469@TYAPR01MB6025.jpnprd01.prod.outlook.com \
--to=naohirot@fujitsu.com \
--cc=Szabolcs.Nagy@arm.com \
--cc=Wilco.Dijkstra@arm.com \
--cc=libc-alpha@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).