unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org>
To: "naohirot@fujitsu.com" <naohirot@fujitsu.com>
Cc: 'GNU C Library' <libc-alpha@sourceware.org>
Subject: Re: [PATCH v3 5/5] AArch64: Improve A64FX memset
Date: Tue, 24 Aug 2021 15:46:33 +0000	[thread overview]
Message-ID: <VE1PR08MB559936A5540C5D4E7758F54D83C59@VE1PR08MB5599.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <TYAPR01MB6025020BA0CBC9681D40F05DDFC59@TYAPR01MB6025.jpnprd01.prod.outlook.com>

Hi Naohiro,

> Are you talking about the regression between V4 and V4 fixed?
> If so, that is also observed in my environment as shown in the graph [2].

I was talking about your graph [2] - my results are below.

> But V4 fixed is not degraded than the master as shown in the graph [1].

That may be true but it has lost a lot of the performance gains of V4 just to improve
the 16KB datapoint in one benchmark. I don't believe that is a good tradeoff.

> I think we are getting almost same result each other, but not exactly same, right?
>
> If the 50% regression in your environment is at 1KB, the regression at 1KB happens
> in my environment too as shown in the graph [4], but the rate seems less than 50%.

Yes the differences are at similar sizes but with different magnitude.

> Both your result and my result are true and real.
> I don't think it's rational to make decision by looking at only one environment result.

It's odd the behaviour with the same CPU isn't identical. If there is a way to make them
behave more similarly, I would love to hear it! In any case it would be good to know
how the blt workaround works on your system.

> Does "it becomes faster than the best results so far" mean faster than the master?

With best result I mean fastest of V4 and V4 with unroll8.

These are the results I get for bench-memset compared to V4 (higher = faster):

       v4+blt  v4+unroll8
0-512  0.01%   0.00%
1K-4K -0.15%  -3.07%
4K-8K  0.11%  -0.04%
16K    3.56%   1.98%
32K    0.74%  -0.71%
64K    1.91%   0.53%
128K   0.23%   0.10%

So the blt workaround improves performance of larger sizes far more than unroll8,
and most importantly, it doesn't regress smaller sizes like unroll8.

> I think we should put the baseline or bottom line to the master performance.
> If the workaround is not faster than or equal to the master at 16KB which has the peak
> performance, reverting unroll8 is preferable. 

A new implementation does not need to beat a previous version on every single size.
It would be impossibly hard to achieve that - an endless game of whack-a-mole...
So I always look for better performance overall and for commonly used size ranges
(see above table, V4+blt is 0.6% faster overall than V4+unroll8).

We should avoid major regressions of course, so the question is whether we can tweak V4
a little so that it does better around 16KB without losing any of its performance gains.
​My results show that is possible with the blt workaround, but not with the unroll8 loop.

> I'm not sure if I understood what the workaround code looks like, is it like this?

It just injects a single blt at the top of the loop and changes the sub before the
loop to subs, so you get something like this:

        subs    count, count, tmp1
        .p2align 4
1:      b.lt    last

I can propose a patch for this workaround if it isn't clear.

> I think your environment must be Applo 80 or FX700 which has 48 cores and 4 NUMA nodes.
> FX1000 master node has 52 cores and FX1000 compute node has 50 cores.
> OS sees FX1000 as if it has 8 NUMA nodes.

I do see 4 NUMA nodes indeed, but performance isn't affected at all by which node you select
(at least on bench-memset since it runs from L1/L2).

Cheers,
Wilco

  parent reply	other threads:[~2021-08-24 15:47 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-22 16:04 [PATCH v3 5/5] AArch64: Improve A64FX memset Wilco Dijkstra via Libc-alpha
2021-08-03 11:22 ` naohirot--- via Libc-alpha
2021-08-09 14:52   ` Wilco Dijkstra via Libc-alpha
2021-08-17  6:40     ` naohirot--- via Libc-alpha
2021-08-19 13:06       ` Wilco Dijkstra via Libc-alpha
2021-08-20  5:41         ` naohirot--- via Libc-alpha
2021-08-23 16:50           ` Wilco Dijkstra via Libc-alpha
2021-08-24  7:56             ` naohirot--- via Libc-alpha
2021-08-24  8:07               ` naohirot--- via Libc-alpha
2021-08-24 15:46               ` Wilco Dijkstra via Libc-alpha [this message]
2021-08-26  1:44                 ` naohirot--- via Libc-alpha
2021-08-26 14:13                   ` Wilco Dijkstra via Libc-alpha
2021-08-27  5:05                     ` naohirot--- via Libc-alpha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=VE1PR08MB559936A5540C5D4E7758F54D83C59@VE1PR08MB5599.eurprd08.prod.outlook.com \
    --to=libc-alpha@sourceware.org \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=naohirot@fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).