From: naohirot--- via Libc-alpha <libc-alpha@sourceware.org>
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: "'libc-alpha@sourceware.org'" <libc-alpha@sourceware.org>
Subject: RE: [PATCH v3 5/5] AArch64: Improve A64FX memset
Date: Tue, 24 Aug 2021 08:07:14 +0000 [thread overview]
Message-ID: <TYAPR01MB6025AE090CB8F7D43BBBF728DFC59@TYAPR01MB6025.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <TYAPR01MB6025020BA0CBC9681D40F05DDFC59@TYAPR01MB6025.jpnprd01.prod.outlook.com>
Fixed a typo inline
> -----Original Message-----
> From: Libc-alpha <libc-alpha-bounces+naohirot=fujitsu.com@sourceware.org> On Behalf Of naohirot--- via Libc-alpha
> Sent: Tuesday, August 24, 2021 4:56 PM
> To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> Cc: 'GNU C Library' <libc-alpha@sourceware.org>
> Subject: RE: [PATCH v3 5/5] AArch64: Improve A64FX memset
>
> Hi Wilco,
>
> > > In my environment, I don't have any performance degradation by reverting unroll8,
> > > but 16KB performance improvement as shown in the graphs.
> >
> > I still see a major regression at 1KB in the graph (it is larger relatively than the gain at 16KB),
> > plus many smaller regressions between 2KB-8KB.
>
> Are you talking about the regression between V4 and V4 fixed?
> If so, that is also observed in my environment as shown in the graph [2].
> But V4 fixed is not degraded than the master as shown in the graph [1].
>
> I think we are getting almost same result each other, but not exactly same, right?
>
> > > The first graph [1] shows comparison the master with V4 fixed.
> > > The second graph [2] shows comparison V4 with V4 fixed.
> > >
> > > [1] https://drive.google.com/file/d/19og4ZhU9itzFAVXX8TIzlpgiiukiXQbp/view?usp=sharing
> > > [2] https://drive.google.com/file/d/1wQgPU6GyRQ_Z8ibsGja-NfdKhN5bz7I9/view?usp=sharing
>
> > > In your environment, do you have any performance degradation by reverting unroll8?
> > > If there is no disadvantage by reverting unroll8, why don't we revert it?
> >
> > For me bench-memset shows a 50% regression with the unroll8 loop reverted plus
> > many smaller regressions. So I don't think reverting is a good idea.
>
> If the 50% regression in your environment is at 1KB, the regression at 1KB happens
> in my environment too as shown in the graph [4], but the rate seems less than 50%.
>
"the graph [4]" should be "the graph [2]".
Thanks.
Naohiro
> Both your result and my result are true and real.
> I don't think it's rational to make decision by looking at only one environment result.
>
> As I explained at the bottom of this mail, V4 code is tuned to Applo 80 and FX700.
> So we need to take FX1000 into account too.
>
> > I tried "perf stat" and oddly enough this loop causes a lot of branch mispredictions.
> > However if you add a branch at the top of the loop that is never taken (eg. blt and
> > ensuring the sub above it sets the flags), it becomes faster than the best results so far.
> > If you can reproduce that, it is probably the best workaround.
>
> Does "it becomes faster than the best results so far" mean faster than the master?
> I think we should put the baseline or bottom line to the master performance.
> If the workaround is not faster than or equal to the master at 16KB which has the peak
> performance, reverting unroll8 is preferable.
>
> I'm not sure if I understood what the workaround code looks like, is it like this?
>
> L(unroll8):
> sub count, count, tmp1
> .p2align 4
> 1: subs tmp2, xzr, xzr
> b.lt 1b
> st1b_unroll 0, 7
> add dst, dst, tmp1
> subs count, count, tmp1
> b.hi 1b
> add count, count, tmp1
>
> > > Is it HPE Apollo 80 System?
> > > Or does ARM Company have an account to Fujitsu FX1000 or FX700?
> >
> > It has 48 cores, that's all I know...
>
> I think your environment must be Applo 80 or FX700 which has 48 cores and 4 NUMA nodes.
> FX1000 master node has 52 cores and FX1000 compute node has 50 cores.
> OS sees FX1000 as if it has 8 NUMA nodes.
>
> Thanks.
> Naohiro
next prev parent reply other threads:[~2021-08-24 8:07 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-22 16:04 [PATCH v3 5/5] AArch64: Improve A64FX memset Wilco Dijkstra via Libc-alpha
2021-08-03 11:22 ` naohirot--- via Libc-alpha
2021-08-09 14:52 ` Wilco Dijkstra via Libc-alpha
2021-08-17 6:40 ` naohirot--- via Libc-alpha
2021-08-19 13:06 ` Wilco Dijkstra via Libc-alpha
2021-08-20 5:41 ` naohirot--- via Libc-alpha
2021-08-23 16:50 ` Wilco Dijkstra via Libc-alpha
2021-08-24 7:56 ` naohirot--- via Libc-alpha
2021-08-24 8:07 ` naohirot--- via Libc-alpha [this message]
2021-08-24 15:46 ` Wilco Dijkstra via Libc-alpha
2021-08-26 1:44 ` naohirot--- via Libc-alpha
2021-08-26 14:13 ` Wilco Dijkstra via Libc-alpha
2021-08-27 5:05 ` naohirot--- via Libc-alpha
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.gnu.org/software/libc/involved.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=TYAPR01MB6025AE090CB8F7D43BBBF728DFC59@TYAPR01MB6025.jpnprd01.prod.outlook.com \
--to=libc-alpha@sourceware.org \
--cc=Wilco.Dijkstra@arm.com \
--cc=naohirot@fujitsu.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).