Re: [PATCH v3 5/5] AArch64: Improve A64FX memset

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org>
To: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
Cc: 'GNU C Library' <libc-alpha@sourceware.org>
Subject: Re: [PATCH v3 5/5] AArch64: Improve A64FX memset
Date: Tue, 24 Aug 2021 15:46:33 +0000	[thread overview]
Message-ID: <VE1PR08MB559936A5540C5D4E7758F54D83C59@VE1PR08MB5599.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <TYAPR01MB6025020BA0CBC9681D40F05DDFC59@TYAPR01MB6025.jpnprd01.prod.outlook.com>

Hi Naohiro,

> Are you talking about the regression between V4 and V4 fixed?
> If so, that is also observed in my environment as shown in the graph [2].

I was talking about your graph [2] - my results are below.

> But V4 fixed is not degraded than the master as shown in the graph [1].

That may be true but it has lost a lot of the performance gains of V4 just to improve
the 16KB datapoint in one benchmark. I don't believe that is a good tradeoff.

> I think we are getting almost same result each other, but not exactly same, right?
>
> If the 50% regression in your environment is at 1KB, the regression at 1KB happens
> in my environment too as shown in the graph [4], but the rate seems less than 50%.

Yes the differences are at similar sizes but with different magnitude.

> Both your result and my result are true and real.
> I don't think it's rational to make decision by looking at only one environment result.

It's odd the behaviour with the same CPU isn't identical. If there is a way to make them
behave more similarly, I would love to hear it! In any case it would be good to know
how the blt workaround works on your system.

> Does "it becomes faster than the best results so far" mean faster than the master?

With best result I mean fastest of V4 and V4 with unroll8.

These are the results I get for bench-memset compared to V4 (higher = faster):

       v4+blt  v4+unroll8
0-512  0.01%   0.00%
1K-4K -0.15%  -3.07%
4K-8K  0.11%  -0.04%
16K    3.56%   1.98%
32K    0.74%  -0.71%
64K    1.91%   0.53%
128K   0.23%   0.10%

So the blt workaround improves performance of larger sizes far more than unroll8,
and most importantly, it doesn't regress smaller sizes like unroll8.

> I think we should put the baseline or bottom line to the master performance.
> If the workaround is not faster than or equal to the master at 16KB which has the peak
> performance, reverting unroll8 is preferable. 

A new implementation does not need to beat a previous version on every single size.
It would be impossibly hard to achieve that - an endless game of whack-a-mole...
So I always look for better performance overall and for commonly used size ranges
(see above table, V4+blt is 0.6% faster overall than V4+unroll8).

We should avoid major regressions of course, so the question is whether we can tweak V4
a little so that it does better around 16KB without losing any of its performance gains.
My results show that is possible with the blt workaround, but not with the unroll8 loop.

> I'm not sure if I understood what the workaround code looks like, is it like this?

It just injects a single blt at the top of the loop and changes the sub before the
loop to subs, so you get something like this:

        subs    count, count, tmp1
        .p2align 4
1:      b.lt    last

I can propose a patch for this workaround if it isn't clear.

> I think your environment must be Applo 80 or FX700 which has 48 cores and 4 NUMA nodes.
> FX1000 master node has 52 cores and FX1000 compute node has 50 cores.
> OS sees FX1000 as if it has 8 NUMA nodes.

I do see 4 NUMA nodes indeed, but performance isn't affected at all by which node you select
(at least on bench-memset since it runs from L1/L2).

Cheers,
Wilco

next prev parent reply	other threads:[~2021-08-24 15:47 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-22 16:04 [PATCH v3 5/5] AArch64: Improve A64FX memset Wilco Dijkstra via Libc-alpha
2021-08-03 11:22 ` naohirot--- via Libc-alpha
2021-08-09 14:52   ` Wilco Dijkstra via Libc-alpha
2021-08-17  6:40     ` naohirot--- via Libc-alpha
2021-08-19 13:06       ` Wilco Dijkstra via Libc-alpha
2021-08-20  5:41         ` naohirot--- via Libc-alpha
2021-08-23 16:50           ` Wilco Dijkstra via Libc-alpha
2021-08-24  7:56             ` naohirot--- via Libc-alpha
2021-08-24  8:07               ` naohirot--- via Libc-alpha
2021-08-24 15:46               ` Wilco Dijkstra via Libc-alpha [this message]
2021-08-26  1:44                 ` naohirot--- via Libc-alpha
2021-08-26 14:13                   ` Wilco Dijkstra via Libc-alpha
2021-08-27  5:05                     ` naohirot--- via Libc-alpha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VE1PR08MB559936A5540C5D4E7758F54D83C59@VE1PR08MB5599.eurprd08.prod.outlook.com \
    --to=libc-alpha@sourceware.org \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=naohirot@fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).