RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com>,
	'GNU C Library' <libc-alpha@sourceware.org>
Subject: RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
Date: Thu, 22 Apr 2021 13:17:46 +0000	[thread overview]
Message-ID: <TYAPR01MB60256E24084F4F6FBFDA4EE6DF469@TYAPR01MB6025.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <VE1PR08MB5599992190ACACB4CBFC673A83479@VE1PR08MB5599.eurprd08.prod.outlook.com>

Hi Wilco-san,

Thanks for your review and advice!

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> Yes it works fine. You should still remove the check for zero at entry (which is
> really slow and unnecessary) and the argument moves. L2 doesn't need the ptrue,
> all it needs is MOV dest_ptr, dst.

Yes, I cleaned them up [1].
[1] https://github.com/NaohiroTamura/glibc/commit/fbee8284f6cea9671554249816f3ab2a14abeade

> > The performance of "whilelo dispatch" [3] is almost same as "binary
> > tree dispatch" [4] but I notice that there are gaps at 128 byte and at 256 byte [3].
> 
> From what I can see, the new version is faster across the full range. It would be
> useful to show both new and old in the same graph rather than separately. You can
> do that by copying the file and use a different name for the functions. I do this all
> the time as it allows direct comparison of several variants in one benchmark run.

Yes, I confirmed that "whilelo dispatch" is better than "binary tree dispatch".
I converted json data from bench-memcpy.out into csv using jq, and created Graph 1
in Google Sheet [2].

$ cat bench-memcpy.out | jq -r '.functions.memcpy.results| sort_by(.length) | .[]|[.length, .align1, .align2, .timings[5], .length/.timings[5]] | @csv' > bench-memcpy.csv

[2] https://docs.google.com/spreadsheets/d/19XYE63defjFEHZVqciZdmcDrJLWkRfGmSagXlIV2F-c/edit?usp=sharing

> That said, the dip at 256+64 looks fairly substantial. It could be throughput of
> WHILELO - to test that you could try commenting out the long WHILELO sequence
> for the 256-512 byte case and see whether it improves. 

I commented out WHILELO in 256-512 byte , and confirmed that it made the dip small [3].

[3] https://drive.google.com/file/d/13Q3OSUN3qXFiTNNkRVGnsNioUMEId1ge/view

>                                                      If it is WHILELO, it is
> possible to remove 3x WHILELO from the earlier cases by moving them after a
> branch (so that the 256-512 case only needs to execute 5x WHILELO rather than 8
> into total). 

As shown in Graph 2 in Google Sheet [2], this approach didn't make the dip small,
because I assume that we can reduce two WHILELO, but we needed to add two PTRUE.
I changed the code [1] like the following diff.

$ git diff
diff --git a/sysdeps/aarch64/multiarch/memcpy_a64fx.S b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
index 6d0ae1cd1f..2ae1f4e3b9 100644
--- a/sysdeps/aarch64/multiarch/memcpy_a64fx.S
+++ b/sysdeps/aarch64/multiarch/memcpy_a64fx.S
@@ -139,12 +139,13 @@
 1:  // if rest > vector_length * 8
     cmp         n, vector_length, lsl 3 // vector_length * 8
     b.hi        \exit
+    cmp         n, vector_length, lsl 2 // vector_length * 4
+    b.hi        1f
     // if rest <= vector_length * 4
     lsl         tmp1, vector_length, 1  // vector_length * 2
     whilelo     p2.b, tmp1, n
     incb        tmp1
     whilelo     p3.b, tmp1, n
-    b.last      1f
     ld1b        z0.b, p0/z, [src, #0, mul vl]
     ld1b        z1.b, p1/z, [src, #1, mul vl]
     ld1b        z2.b, p2/z, [src, #2, mul vl]
@@ -155,6 +156,8 @@
     st1b        z3.b, p3, [dest, #3, mul vl]
     ret
 1:  // if rest <= vector_length * 8
+    ptrue       p2.b
+    ptrue       p3.b
     lsl         tmp1, vector_length, 2  // vector_length * 4
     whilelo     p4.b, tmp1, n
     incb        tmp1

>           Also it is worth checking if the 256-512 case beats jumping directly to
> L(unroll4) - however note that code isn't optimized yet (eg. there is no need for
> complex software pipelined loops since we can only iterate once!). 

I tried, but it didn't work for memmove, because L(unroll4) doesn't support
backward copy.

>                                                                                                             If all that
> doesn't help, it may be best to split into 256-384 and 384-512 so you only need 2x
> WHILELO.

This way [4] made the dip small as shown in Graph3 in Google Sheet [2].
So it seems that this is the way we should take.
 
[4] https://github.com/NaohiroTamura/glibc/commit/cbcb80e69325c16c6697c42627a6ca12c3245a86

> > I checked bench-memcpy-random [5], but it measures the performance
> > from the size 4K byte to 512K byte.
> > How do we know the branch issue for less than 512 byte?
> 
> The size is the size of the memory region tested, not the size of the copies. The
> actual copies are very small (90% are smaller than 128 bytes). The key is that it
> doesn't repeat the same copy over and over so it's hard on the branch predictor
> just like in a real application.

I see, I'll take a look at the source code more thoroughly.

Thanks.
Naohiro

next prev parent reply	other threads:[~2021-04-22 13:18 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-12 12:52 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Wilco Dijkstra via Libc-alpha
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
2021-04-15 12:20     ` naohirot
2021-04-20 16:00       ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:58         ` naohirot
2021-04-29 15:13           ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:01             ` Szabolcs Nagy via Libc-alpha
2021-04-30 15:23               ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:30                 ` Florian Weimer via Libc-alpha
2021-04-30 15:40                   ` Wilco Dijkstra via Libc-alpha
2021-05-04  7:56                     ` Szabolcs Nagy via Libc-alpha
2021-05-04 10:17                       ` Florian Weimer via Libc-alpha
2021-05-04 10:38                         ` Wilco Dijkstra via Libc-alpha
2021-05-04 10:42                         ` Szabolcs Nagy via Libc-alpha
2021-05-04 11:07                           ` Florian Weimer via Libc-alpha
2021-05-06 10:01             ` naohirot
2021-05-06 14:26               ` Szabolcs Nagy via Libc-alpha
2021-05-06 15:09                 ` Florian Weimer via Libc-alpha
2021-05-06 17:31               ` Wilco Dijkstra via Libc-alpha
2021-05-07 12:31                 ` naohirot
2021-04-19  2:51     ` naohirot
2021-04-19 14:57       ` Wilco Dijkstra via Libc-alpha
2021-04-21 10:10         ` naohirot
2021-04-21 15:02           ` Wilco Dijkstra via Libc-alpha
2021-04-22 13:17             ` naohirot [this message]
2021-04-23  0:58               ` naohirot
2021-04-19 12:43     ` naohirot
2021-04-20  3:31     ` naohirot
2021-04-20 14:44       ` Wilco Dijkstra via Libc-alpha
2021-04-27  9:01         ` naohirot
2021-04-20  5:49     ` naohirot
2021-04-20 11:39       ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:03         ` naohirot
2021-04-23 13:22     ` naohirot
  -- strict thread matches above, loose matches on Subject: below --
2021-03-17  2:28 Naohiro Tamura
2021-03-29 12:03 ` Szabolcs Nagy via Libc-alpha
2021-05-10  1:45 ` naohirot
2021-05-14 13:35   ` Szabolcs Nagy via Libc-alpha
2021-05-19  0:11     ` naohirot

find likely ancestor, descendant, or conflicting patches for this message:
dfblob:6d0ae1cd1 dfblob:2ae1f4e3b
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=TYAPR01MB60256E24084F4F6FBFDA4EE6DF469@TYAPR01MB6025.jpnprd01.prod.outlook.com \
    --to=naohirot@fujitsu.com \
    --cc=Szabolcs.Nagy@arm.com \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).