RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
To: 'Wilco Dijkstra' <Wilco.Dijkstra@arm.com>
Cc: Szabolcs Nagy <Szabolcs.Nagy@arm.com>,
	'GNU C Library' <libc-alpha@sourceware.org>
Subject: RE: [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX
Date: Mon, 19 Apr 2021 12:43:50 +0000	[thread overview]
Message-ID: <TYAPR01MB602575963D5AA8B60049F590DF499@TYAPR01MB6025.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <VE1PR08MB5599AFAEFDA55471AF1C648C834E9@VE1PR08MB5599.eurprd08.prod.outlook.com>

Hi Wilco-san,

Let me focus on L1_prefetch in this mail.

> From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>

> > Memcpy/memmove uses 8, 4, 2 unrolls, and memset uses 32, 8, 4, 2 unrolls.
> > This unroll configuration recorded the highest performance.

When I tested "4 unrolls", I modified the source code [1][2] in the mail [0]
such as followings:
in case of memcpy, 
   I commented out L(unroll8), L(unroll2), and left L(unroll4), L(unroll1) and L(last),
In case of memmove,
   I commented out L(bwd_unroll8), L(bwd_unroll2), and left L(bwd_unroll4), L(bwd_unroll1) and L(bwd_last),
In case of memset, 
   I commented out L(unroll32), L(unroll8), L(unroll2), and left L(unroll4), L(unroll1) and L(last).

[0] https://sourceware.org/pipermail/libc-alpha/2021-April/125002.html
[1] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memcpy_a64fx.S
[2] https://github.com/NaohiroTamura/glibc/blob/ec0b55a855529f75bd6f280e59dc2b1c25640490/sysdeps/aarch64/multiarch/memset_a64fx.S

> > In case that Memcpy/memmove uses 4 unrolls, and memset uses 4 unrolls,
> > The performance degraded minus 5 to 15 Gbps/sec at the peak.
> 
> So this is the L(L1_vl_64) loop right? I guess the problem is the large number of

So this is NOT the L(L1_vl_64) loop, but L(vl_agnostic).

> prefetches and all the extra code that is not strictly required (you can remove 5
> redundant mov/cmp instructions from the loop). Also assuming prefetching helps
> here (the good memmove results suggest it's not needed), prefetching directly
> into L1 should be better than first into L2 and then into L1. So I don't see a good
> reason why 4x unrolling would have to be any slower.

I tried to remove L(L1_prefetch) from both memcpy and memset, and also
I tried to remove L2 prefetch instructions (prfm pstl2keep and pldl2keep) in
L(L1_prefetch) from both memcpy and memset.

In case of memcpy, both removing L(L1_prefetch)[3] and removing L2 prefetch
instruction from L(L1_prefetch) increased the performance of the size range 64KB-4MB
from 18-20 GB/sec [4] to 20-22 GB/sec [5].

[3] https://github.com/NaohiroTamura/glibc/commit/22612299247e64dbffd62aa186513bde7328d104
[4] https://drive.google.com/file/d/1hGWz4eAYWc1ktdw74rzDPxtQQ48P0-Hv/view
[5] https://drive.google.com/file/d/11Pt1mWSCN2LBPHxXUE-rs7Q6JhtBfpyQ/view

In case of memset, removing L(L1_prefetch)[6] decreased the performance of the size range
128KB-4MB from 22-24 GB/sec [7] to 20-22 GB/sec[8].
But removing L2 prefetch instruction (prfm pstl2keep) in L(L1_prefetch) [9] kept the same
performance of the size range 128KB-4MB as 22-24 GB/sec [10].

[6] https://github.com/NaohiroTamura/glibc/blob/22612299247e64dbffd62aa186513bde7328d104/sysdeps/aarch64/multiarch/memset_a64fx.S#L146-L163
   Commented out L146-L163, I didn't commit because of decreasing the performance.
[7] https://drive.google.com/file/d/1MT1d2aBxSoYrzQuRZtv4U9NCXV4ZwHsJ/view
[8] https://drive.google.com/file/d/1qUzYklLvgXTZbP1wm9n4VryF3bgUOplo/view
[9] https://github.com/NaohiroTamura/glibc/commit/cc478c96bac051c9b98b9d9a1ae6f38326f77645
[10] https://drive.google.com/file/d/1bPKHFWyhzNWXX7A_S6_UpZ2BwP2QAJK4/view

In conclusion, I adopt to remove L(L1_prefetch) from memcpy [3] and to remove L2 prefetch
instruction (prfm pstl2keep) from L(L1_prefetch) [9].

Thanks.
Naohiro

next prev parent reply	other threads:[~2021-04-19 12:44 UTC|newest]

Thread overview: 41+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-12 12:52 [PATCH 0/5] Added optimized memcpy/memmove/memset for A64FX Wilco Dijkstra via Libc-alpha
2021-04-12 18:53 ` Florian Weimer
2021-04-13 12:07 ` naohirot
2021-04-14 16:02   ` Wilco Dijkstra via Libc-alpha
2021-04-15 12:20     ` naohirot
2021-04-20 16:00       ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:58         ` naohirot
2021-04-29 15:13           ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:01             ` Szabolcs Nagy via Libc-alpha
2021-04-30 15:23               ` Wilco Dijkstra via Libc-alpha
2021-04-30 15:30                 ` Florian Weimer via Libc-alpha
2021-04-30 15:40                   ` Wilco Dijkstra via Libc-alpha
2021-05-04  7:56                     ` Szabolcs Nagy via Libc-alpha
2021-05-04 10:17                       ` Florian Weimer via Libc-alpha
2021-05-04 10:38                         ` Wilco Dijkstra via Libc-alpha
2021-05-04 10:42                         ` Szabolcs Nagy via Libc-alpha
2021-05-04 11:07                           ` Florian Weimer via Libc-alpha
2021-05-06 10:01             ` naohirot
2021-05-06 14:26               ` Szabolcs Nagy via Libc-alpha
2021-05-06 15:09                 ` Florian Weimer via Libc-alpha
2021-05-06 17:31               ` Wilco Dijkstra via Libc-alpha
2021-05-07 12:31                 ` naohirot
2021-04-19  2:51     ` naohirot
2021-04-19 14:57       ` Wilco Dijkstra via Libc-alpha
2021-04-21 10:10         ` naohirot
2021-04-21 15:02           ` Wilco Dijkstra via Libc-alpha
2021-04-22 13:17             ` naohirot
2021-04-23  0:58               ` naohirot
2021-04-19 12:43     ` naohirot [this message]
2021-04-20  3:31     ` naohirot
2021-04-20 14:44       ` Wilco Dijkstra via Libc-alpha
2021-04-27  9:01         ` naohirot
2021-04-20  5:49     ` naohirot
2021-04-20 11:39       ` Wilco Dijkstra via Libc-alpha
2021-04-27 11:03         ` naohirot
2021-04-23 13:22     ` naohirot
  -- strict thread matches above, loose matches on Subject: below --
2021-03-17  2:28 Naohiro Tamura
2021-03-29 12:03 ` Szabolcs Nagy via Libc-alpha
2021-05-10  1:45 ` naohirot
2021-05-14 13:35   ` Szabolcs Nagy via Libc-alpha
2021-05-19  0:11     ` naohirot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=TYAPR01MB602575963D5AA8B60049F590DF499@TYAPR01MB6025.jpnprd01.prod.outlook.com \
    --to=naohirot@fujitsu.com \
    --cc=Szabolcs.Nagy@arm.com \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=libc-alpha@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).