RE: [PATCH v3 5/5] AArch64: Improve A64FX memset

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: naohirot--- via Libc-alpha <libc-alpha@sourceware.org>
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Cc: "'libc-alpha@sourceware.org'" <libc-alpha@sourceware.org>
Subject: RE: [PATCH v3 5/5] AArch64: Improve A64FX memset
Date: Tue, 24 Aug 2021 08:07:14 +0000	[thread overview]
Message-ID: <TYAPR01MB6025AE090CB8F7D43BBBF728DFC59@TYAPR01MB6025.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <TYAPR01MB6025020BA0CBC9681D40F05DDFC59@TYAPR01MB6025.jpnprd01.prod.outlook.com>

Fixed a typo inline

> -----Original Message-----
> From: Libc-alpha <libc-alpha-bounces+naohirot=fujitsu.com@sourceware.org> On Behalf Of naohirot--- via Libc-alpha
> Sent: Tuesday, August 24, 2021 4:56 PM
> To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
> Cc: 'GNU C Library' <libc-alpha@sourceware.org>
> Subject: RE: [PATCH v3 5/5] AArch64: Improve A64FX memset
> 
> Hi Wilco,
> 
> > > In my environment, I don't have any performance degradation by reverting unroll8,
> > > but 16KB performance improvement as shown in the graphs.
> >
> > I still see a major regression at 1KB in the graph (it is larger relatively than the gain at 16KB),
> > plus many smaller regressions between 2KB-8KB.
> 
> Are you talking about the regression between V4 and V4 fixed?
> If so, that is also observed in my environment as shown in the graph [2].
> But V4 fixed is not degraded than the master as shown in the graph [1].
> 
> I think we are getting almost same result each other, but not exactly same, right?
> 
> > > The first graph [1] shows comparison the master with V4 fixed.
> > > The second graph [2] shows comparison V4 with V4 fixed.
> > >
> > > [1] https://drive.google.com/file/d/19og4ZhU9itzFAVXX8TIzlpgiiukiXQbp/view?usp=sharing
> > > [2] https://drive.google.com/file/d/1wQgPU6GyRQ_Z8ibsGja-NfdKhN5bz7I9/view?usp=sharing
> 
> > > In your environment, do you have any performance degradation by reverting unroll8?
> > > If there is no disadvantage by reverting unroll8, why don't we revert it?
> >
> > For me bench-memset shows a 50% regression with the unroll8 loop reverted plus
> > many smaller regressions. So I don't think reverting is a good idea.
> 
> If the 50% regression in your environment is at 1KB, the regression at 1KB happens
> in my environment too as shown in the graph [4], but the rate seems less than 50%.
> 
"the graph [4]" should be "the graph [2]".

Thanks.
Naohiro

> Both your result and my result are true and real.
> I don't think it's rational to make decision by looking at only one environment result.
> 
> As I explained at the bottom of this mail, V4 code is tuned to Applo 80 and FX700.
> So we need to take FX1000 into account too.
> 
> > I tried "perf stat" and oddly enough this loop causes a lot of branch mispredictions.
> > However if you add a branch at the top of the loop that is never taken (eg. blt and
> > ensuring the sub above it sets the flags), it becomes faster than the best results so far.
> > If you can reproduce that, it is probably the best workaround.
> 
> Does "it becomes faster than the best results so far" mean faster than the master?
> I think we should put the baseline or bottom line to the master performance.
> If the workaround is not faster than or equal to the master at 16KB which has the peak
> performance, reverting unroll8 is preferable.
> 
> I'm not sure if I understood what the workaround code looks like, is it like this?
> 
> L(unroll8):
>         sub     count, count, tmp1
>         .p2align 4
> 1:      subs    tmp2, xzr, xzr
>         b.lt    1b
>         st1b_unroll 0, 7
>         add     dst, dst, tmp1
>         subs    count, count, tmp1
>         b.hi    1b
>         add     count, count, tmp1
> 
> > > Is it HPE Apollo 80 System?
> > > Or does ARM Company have an account to Fujitsu FX1000 or FX700?
> >
> > It has 48 cores, that's all I know...
> 
> I think your environment must be Applo 80 or FX700 which has 48 cores and 4 NUMA nodes.
> FX1000 master node has 52 cores and FX1000 compute node has 50 cores.
> OS sees FX1000 as if it has 8 NUMA nodes.
> 
> Thanks.
> Naohiro

next prev parent reply	other threads:[~2021-08-24  8:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-22 16:04 [PATCH v3 5/5] AArch64: Improve A64FX memset Wilco Dijkstra via Libc-alpha
2021-08-03 11:22 ` naohirot--- via Libc-alpha
2021-08-09 14:52   ` Wilco Dijkstra via Libc-alpha
2021-08-17  6:40     ` naohirot--- via Libc-alpha
2021-08-19 13:06       ` Wilco Dijkstra via Libc-alpha
2021-08-20  5:41         ` naohirot--- via Libc-alpha
2021-08-23 16:50           ` Wilco Dijkstra via Libc-alpha
2021-08-24  7:56             ` naohirot--- via Libc-alpha
2021-08-24  8:07               ` naohirot--- via Libc-alpha [this message]
2021-08-24 15:46               ` Wilco Dijkstra via Libc-alpha
2021-08-26  1:44                 ` naohirot--- via Libc-alpha
2021-08-26 14:13                   ` Wilco Dijkstra via Libc-alpha
2021-08-27  5:05                     ` naohirot--- via Libc-alpha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=TYAPR01MB6025AE090CB8F7D43BBBF728DFC59@TYAPR01MB6025.jpnprd01.prod.outlook.com \
    --to=libc-alpha@sourceware.org \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=naohirot@fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).