Re: [PATCH v2][AArch64] Improve integer memcpy - Adhemerval Zanella via Libc-alpha

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

From: Adhemerval Zanella via Libc-alpha <libc-alpha@sourceware.org>
To: Wilco Dijkstra <Wilco.Dijkstra@arm.com>,
	"libc-alpha@sourceware.org" <libc-alpha@sourceware.org>
Subject: Re: [PATCH v2][AArch64] Improve integer memcpy
Date: Fri, 13 Mar 2020 13:43:21 -0300	[thread overview]
Message-ID: <7afa8c21-56ae-c533-18f4-6a25a46c8b1f@linaro.org> (raw)
In-Reply-To: <AM5PR0801MB2035DC6C63E1909315AFA0E083FC0@AM5PR0801MB2035.eurprd08.prod.outlook.com>



On 11/03/2020 13:32, Wilco Dijkstra wrote:
> Hi Adhemerval,
> 
>> I wonder if the optimization for sizes up to 128 yields same gain
>> for the other chip memcpy implementation (thunderx, thunderx2, and
>> falkor).  
> 
> Most definitely - the new memcpy is 15-20% faster than __memcpy_thunderx2
> on TX2.

OK, what I would like to avoid is keep maintaining subpar architecture
implementations once generic implementation improves.  

So, for ThunderX the only optimization its implementation uses iis
prefetch for sizes larger then 32KB. Is it really paying off?
Could it switch to generic implementation as well?

For ThundeX2, it uses Q registers and 128 bytes loops for aligned
loops and the jump table for unaligned.  Is the jump table still
a gain for ThunderX2?  Also, it might an option to have a generic
memcpy that uses Q register with a larger window (so ThunderX and
newer core might prefer it instead of generic one).

> 
>> The main differences seems to be how each chip handles
>> large copies, with thundex and falkor doing 64 bytes per loop,
>> while thunderx2 does either 128 bytes (when source and dest are
>> aligned) or 64 for unaligned inputs (it also does not issue
>> unaligned access, doing aligned load plus merge using a jump table).
> 
> Yes that jump table is insane at 1KB of code... It may seem great in
> microbenchmarks but it falls apart in the real world.
> 
>> So it seems that I don't see a straightforward way to unify the
>> implementations, maybe adding a common shared code for sizes
>> less than 128 bytes.
> 
> Yes we could share the code for small cases across implementations.
> I was thinking about having an ifunc for large copies so we could
> statically link a common routine to handle small copies and avoid
> PLT overheads in 99% of cases.
> 
>> One question is if doing operation for large sizes using
>> ldp/stp might yield some gains (as thunderx2 does, at least
>> for aligned case), or if the cost of checking and using some
>> specific cases does not pay of.
> 
> You mean LDP/STP of SIMD registers? There is some gain for those on
> modern cores.
> 
>> +   Large copies use a software pipelined loop processing 64 bytes per iteration.
>> +   The destination pointer is 16-byte aligned to minimize unaligned accesses.
>> +   The loop tail is handled by always copying 64 bytes from the end.
>> +*/
> 
>> Ok, so it now uses a similar strategy ThunderX/Falkor memcpy (Falkor
>> limits the copy to one register due a hardware prefetcher limitation).
> 
> Well this is what it always did. It's faster on in-order cores and supports
> overlapping copies (unlike the Falkor memcpy).
> 
> I'll fix up the long lines before commit.
> 
> Cheers,
> Wilco
>

     prev parent reply	other threads:[~2020-03-13 16:43 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <AM5PR0801MB2035AA956FB1D577A54A72EB83EA0@AM5PR0801MB2035.eurprd08.prod.outlook.com>
2020-02-26 16:18 ` [PATCH v2][AArch64] Improve integer memcpy Wilco Dijkstra
2020-03-10 18:46   ` Adhemerval Zanella via Libc-alpha
2020-03-11 16:32     ` Wilco Dijkstra
2020-03-13 16:43       ` Adhemerval Zanella via Libc-alpha [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7afa8c21-56ae-c533-18f4-6a25a46c8b1f@linaro.org \
    --to=libc-alpha@sourceware.org \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=adhemerval.zanella@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).