From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS17314 8.43.84.0/22 X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 434E61F8C6 for ; Tue, 3 Aug 2021 11:23:25 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 1C2D0393839B for ; Tue, 3 Aug 2021 11:23:24 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 1C2D0393839B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1627989804; bh=msYTlNkbCxFxPfXXplBsGXiNWetnnzWuZQRm5yvwB9M=; h=To:Subject:Date:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=ZuAP4VbG4rHaLVi7px/DYso3gswAE+eNWEnUYmR3p5XH+AMmZwc4IKppL0GZbTTaw r3Hk3tjxC07NovQad45d2aG2XLglSo0hn9i7hT8O5LO0BJrsGSh47W9acCpdHrWDhN /YQgovteNpX5X89Ulu0ny87icKQVZM4cbP6TEcyw= Received: from esa12.fujitsucc.c3s2.iphmx.com (esa12.fujitsucc.c3s2.iphmx.com [216.71.156.125]) by sourceware.org (Postfix) with ESMTPS id 0A186381DC69 for ; Tue, 3 Aug 2021 11:23:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 0A186381DC69 X-IronPort-AV: E=McAfee;i="6200,9189,10064"; a="36089088" X-IronPort-AV: E=Sophos;i="5.84,291,1620658800"; d="scan'208";a="36089088" Received: from mail-os2jpn01lp2055.outbound.protection.outlook.com (HELO JPN01-OS2-obe.outbound.protection.outlook.com) ([104.47.92.55]) by ob1.fujitsucc.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 03 Aug 2021 20:22:59 +0900 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=f4L5QOqPCXExmbnsCGVR5PV4QswLtCVjmzy2RThBWjdI2WMozy7/b6nHXVmTK73raL3wYWsDy+88chD6Nb+JoVqQi2waxpTkuGzWHAtilFNszJ5jS5q2M4Tk0I0X3O8QCLyq92/82bJRQ99LXFeqR+WsllhqrvBGDuQssNmcZjZ8PKRveC+XEQ2AujBKE/iq8EHl0ygxePGuHPyXfSAJfKhxdkMvovDwvL4JZn7YclACoZssmy8TZ6Tupb4iwudzZxv99JbM4A7z2ibxqW+iEdkLZbHZjEgEQ3KPU+b/605x/cLcjyrVMhb5RrWqY9Ik41qSbANQRc7vkQKcvIH3vg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=msYTlNkbCxFxPfXXplBsGXiNWetnnzWuZQRm5yvwB9M=; b=LdMsgBHRB5InuXb7l5Z0l5DqxJdQa/WBtE9e8DHl2ptxrMe0h5+yT3yvCmef8OGhFc20zidVgQYjG++ffSb52wlvEMBcpvisxldzEXdNnVLbks8xLj1w/wCvPJi5ITRAMBiTwhwq3N1FI91Ut9q0JPUmGBY6OC4PX9DpP8Ce1xDLkHaRhiIcWl3pcKHJS00agmyE2L8kGvcLGT9iTZrYaaWeWUHsj/slP/rh16QZklvNcl2Hvkxdqrp/W5ux/NJJTFbM5gHEHkRsCvg0yCBqBpLr8QU6Qjzcexj7ObrhhkuFR3VJjyOBplq4VCPyu4QFwd3lKQ3DzHoxJe3OP0Y9yw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=fujitsu.com; dmarc=pass action=none header.from=fujitsu.com; dkim=pass header.d=fujitsu.com; arc=none Received: from TYAPR01MB6025.jpnprd01.prod.outlook.com (2603:1096:402:36::13) by TYCPR01MB5616.jpnprd01.prod.outlook.com (2603:1096:400:40::8) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4373.22; Tue, 3 Aug 2021 11:22:56 +0000 Received: from TYAPR01MB6025.jpnprd01.prod.outlook.com ([fe80::5816:45c1:5336:c108]) by TYAPR01MB6025.jpnprd01.prod.outlook.com ([fe80::5816:45c1:5336:c108%7]) with mapi id 15.20.4373.026; Tue, 3 Aug 2021 11:22:56 +0000 To: Wilco Dijkstra Subject: RE: [PATCH v3 5/5] AArch64: Improve A64FX memset Thread-Topic: [PATCH v3 5/5] AArch64: Improve A64FX memset Thread-Index: AQHXfxMmDOwcjJYmP0eOZ48wdBomwathHRvQ Date: Tue, 3 Aug 2021 11:22:56 +0000 Message-ID: References: In-Reply-To: Accept-Language: en-001, ja-JP, en-US Content-Language: aa X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_a7295cc1-d279-42ac-ab4d-3b0f4fece050_Enabled=True; MSIP_Label_a7295cc1-d279-42ac-ab4d-3b0f4fece050_SiteId=a19f121d-81e1-4858-a9d8-736e267fd4c7; MSIP_Label_a7295cc1-d279-42ac-ab4d-3b0f4fece050_SetDate=2021-08-03T11:22:56.013Z; MSIP_Label_a7295cc1-d279-42ac-ab4d-3b0f4fece050_Name=FUJITSU-RESTRICTED; MSIP_Label_a7295cc1-d279-42ac-ab4d-3b0f4fece050_ContentBits=0; MSIP_Label_a7295cc1-d279-42ac-ab4d-3b0f4fece050_Method=Standard; x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 9ff98fc4-3d76-4c13-94ce-08d956710b28 x-ms-traffictypediagnostic: TYCPR01MB5616: x-ms-exchange-transport-forked: True x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:7691; x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: oshvvjvj2S6JeTgYjubkFPieLt8DIslDx4CdgEL0DcjvUvWOOTWl/G6C1e4BHNXwU4W5v8qL7SjKk3Q1czNVcwcgbRZGV/+L99z0PhY6EdKEgloPA9Hx+UI5t6o0Prc/mTAvd+0FqZyEfZgwEDYgNKXEFJ9PUXITW+InaB+fHVPBe+2JsPW1Puz5JGuhVRMJZix/PRzDHMnMkwcGe9Yi1XvX0+oEjxLg93ALl7tdev+47AnmauZLruWU40BoleVr/T8hiIBUFVz18E+l/ZQ4An2FG+ONdJjGNbX4mZn9mu6/XfpASGTCCAtP68yrpg6JJN6O/Re+8+1DiWXxBNsLP/DHDKAGrqtkUgMrjP9lan5nPMI3Q0I3bSABmshKesSYuguf8wUKiGRafqaOWXM9/T6W6E9ltgExcqJ3Ip+bMFo9+mhI95TwaSzQ5jUpYTGq6PcqvDm1N6xHAKoS7fqrcfdig51vZR4d+OaNzGvTIVfFHe9aFJ3ybk3aeMMJ+/IyiDmhSWb8XXsFf93rZfhv2kdd05E1EJ2nnShEiatUAEqRXsCUYreeBzjEzlHaSWyunoiyNB+YwR+oMF1mkQtDuW/z82e0yll7EDliBCznGZ3mXwDN0FvcaA/9rRLdCNL4KjEhJM+UWjbq1T8M/El/okNxT8wHq+NmdrcoBn7ZVist/ybc1C5E65X8r4fYUVUvyYW+pCClul+/QSkVHWqXX3j5fWXl9rxoEtfdmwaFfmS5pqThyb3ixtAVGxjPaZajGl/rMOox4lBMwCxKT14oyDQa/m/byhXmkJ0rNqwcFip7kHi7xmrVb7o+bluiRQSamObhKM3Hert9zacA999N1s6eu/JFddS+YsoXcu86DPiuQdxbO9pj9rpfclcZNeRF x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:TYAPR01MB6025.jpnprd01.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(366004)(8936002)(66556008)(66446008)(64756008)(9686003)(66476007)(122000001)(66946007)(316002)(508600001)(38100700002)(55016002)(45954011)(5660300002)(2906002)(4326008)(52536014)(71200400001)(38070700005)(7696005)(6506007)(186003)(86362001)(53546011)(33656002)(6916009)(966005)(26005)(8676002)(83380400001)(85182001)(76116006)(473944003)(357404004); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-2022-jp?B?Zm9wRm1CVnRHWDFFVTYvWmQxL0taeDB6S3JJWlVNWXZRdldsaFNyc0dN?= =?iso-2022-jp?B?eTM3QnBOM241MXRmb2E3cFA3YlZDOGhTZEc5M1M4YmM4N3h6RXJqSkdx?= =?iso-2022-jp?B?TWhyT3dNWE13Y1h3NEVOYmpySjdoZFZidVA3bm5zZ0dUR093QWkxSlYz?= =?iso-2022-jp?B?NGJQeFhKUHg0NlMxb2dGVmovY3BwVmJIVUsva1ZqeDlaR25SdGJlQXh6?= =?iso-2022-jp?B?a2t0NDRYOFhORWFMRHdDRjRKKzQvZEU1Um40bjNiWnNjajBweU5ldUw3?= =?iso-2022-jp?B?d3d1Qk5WcUcwdWxlcFJEcUFHbFBCQTZnSzMyVkhDUGJjSkxGWHh6SFhy?= =?iso-2022-jp?B?WlJBTUtnTkRub1hpVVl0ZVBwQitZNjZ0ZW1lbjJGMUIrNjFHN2J5YWJS?= =?iso-2022-jp?B?Z2RUUEt5RVh0dVBRajJSZDAvOSt3L0pQNkxKYThuVHVldjEwalZteC9U?= =?iso-2022-jp?B?bFZsQnhFaVBXZmxFOFBoaS96dUM2NlhQQ2YycHFOck1FdE9GWGJTTEZx?= =?iso-2022-jp?B?VU9BR0pCeXF5VmpranE5Mm1hTjU5NXJaeFpRRndiaGpuQ3FEejJ5L3VC?= =?iso-2022-jp?B?NVVmZ0srcnoxdFhkY0JLWEFybk1wMXVxSWxrb3AweXozVnlGNmFBZCtw?= =?iso-2022-jp?B?U0NCQXpjKys1RlJzcEJtRW9URmwxeDJreGozaDNGYldPZWxwU0dEdWls?= =?iso-2022-jp?B?V3FyYTUxaVppdlk1ZnFjb3VlSGNKcFVwUUFEYnNMMCtpQXRsclZBdE1I?= =?iso-2022-jp?B?VFdrVVZZUHZDNng3VG5pcVVTM0ZIYitnQWNzQmxKb0htaWY0MkF3MFVy?= =?iso-2022-jp?B?M2xSZFdWcWtUaCtoZTQzcjdTb3F3dExIcjZXUExKaHNhL2YwRm0xN0Zv?= =?iso-2022-jp?B?azlid3czZTgxbi82VVhJWkdDNnlSOHFWRFZkaXhmQ2NFcUJKc2ZSeXJz?= =?iso-2022-jp?B?K0pyc0gwcWp6RFg2V3dUZ3R0Q204Y3pvWlJxWnN4bWMvYUh3TEZjTFNN?= =?iso-2022-jp?B?MndQWEMxcm9sR1JFOW1CTXdRdkFjbkFHYjR3cTVxWmh2M0M3L0QwR2hs?= =?iso-2022-jp?B?aU5nYlNjb216T05Pc01xN0RjVWtYTE1MT0JyOVZlWXV3M01IOTVPZS9p?= =?iso-2022-jp?B?NDhVdE1DZ1VTaDFpRzhPeWRiOFRwc2x1Uk5ubitKdEFGOFJVVWw1NjFq?= =?iso-2022-jp?B?L3VuQ1I1NDVoSHk5L1hka3poUCtlM0pOODQ1dXFvK3RzVWZsS21UZHhE?= =?iso-2022-jp?B?OEZIMVcxRXBLelZHQVRrM2MzL0ZYOVZPSUUyTEFvWjcxTjlRL1dmMGpl?= =?iso-2022-jp?B?MHhMWmpOU2FYRHZvWU9oV1ZxSHZHL2QyODkxVTNpZFo2bDFvSDRhTDBB?= =?iso-2022-jp?B?c0sxbXZqd0tmQ0g1ck9mQ0MxQ2hpYUxrWkJkZ255bkpTY3RTWEtLSzZ5?= =?iso-2022-jp?B?M08rVVdKVGZuZExJWkxpRytvQjUzMGV5djFyOWRBTjF0alczdHhZbmwr?= =?iso-2022-jp?B?bVVZSXhrdzJaMFZJVE12c29lRTU2UFY5enBuc3RxcUo3UE44QURZMSts?= =?iso-2022-jp?B?WnNMR041Sk9pVW5GTDAwa2dCYmwzSGV1UU9GRTU4ZmJwS0VEODF3dVA4?= =?iso-2022-jp?B?cE5hRnJMbS93NGpBZnZHNVVyZWRhd0lHWE9ZYW5LMlQ2aEQwTEVwd25R?= =?iso-2022-jp?B?YVJINU9Say9lTXljZEhWcDFnQTk3dWZoWUdZNTF3OWk1SzBodHhzSUtD?= =?iso-2022-jp?B?YkZtNWhjTGtTcmZKRXNMMWtrTU82eVpUa0JsZ3MwVlBCMTZ0Tk9zMSt4?= =?iso-2022-jp?B?c056UkxQRGhHR2laNjZ3eENRS2dkc1pZeExpVzZJOXBTU2R1Z0VkOG9I?= =?iso-2022-jp?B?ZzkwUCtvNnZMeFJQUmlMeTFwaXg4PQ==?= Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: fujitsu.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: TYAPR01MB6025.jpnprd01.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 9ff98fc4-3d76-4c13-94ce-08d956710b28 X-MS-Exchange-CrossTenant-originalarrivaltime: 03 Aug 2021 11:22:56.7062 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: a19f121d-81e1-4858-a9d8-736e267fd4c7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: aUPsciwohovY4f+ibuYtP8NsJJi/gyz4Xv65u4HxOPUPMLnw3UKlnfTuUzkoND644Spe3H9uoSKioEqCGIxUvg== X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYCPR01MB5616 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: naohirot--- via Libc-alpha Reply-To: "naohirot@fujitsu.com" Cc: 'GNU C Library' Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" Hi Wilco,=0A= =0A= Thank you for the patch.=0A= =0A= I confirmed V3 Part 5 performance is better than the master=0A= except 16KB dip [1].=0A= See the comparison graphs between the master and V3 Part 5 [1][2][3].=0A= =0A= And the 16KB dip can be fixed, see the below.=0A= =0A= Reviewed-by: Naohiro Tamura =0A= Tested-by: Naohiro Tamura =0A= =0A= [1] https://drive.google.com/file/d/10ujn5LNOqgI2VpynUc1Adt9U777_Ixxz/view?= usp=3Dsharing=0A= [2] https://drive.google.com/file/d/14vCq_ng0tFDjo1BRqaMm9m9o3Kntjr0v/view?= usp=3Dsharing=0A= [3] https://drive.google.com/file/d/1GBFk8czzJV5hB9sT93qB7Rw7pHQzEfRt/view?= usp=3Dsharing=0A= =0A= > -----Original Message-----=0A= > From: Wilco Dijkstra =0A= > Sent: Friday, July 23, 2021 1:05 AM=0A= > To: Tamura, Naohiro/=1B$BEDB<=1B(B =1B$BD>9-=1B(B = =0A= > Cc: 'GNU C Library' =0A= > Subject: [PATCH v3 5/5] AArch64: Improve A64FX memset=0A= =0A= How about like this?=0A= "AArch64: Improve A64FX memset by removing rest variable"=0A= =0A= > =0A= > Simplify the code for memsets smaller than L1. Improve the unroll8 and L1= _prefetch loops.=0A= > =0A= > ---=0A= > diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/m= ultiarch/memset_a64fx.S=0A= > index 8665c272431b46dadea53c63ab74829c3aa99312..36628e101db33a9a8ff5234b9= 8dd5a3a5c9ed73c 100644=0A= > --- a/sysdeps/aarch64/multiarch/memset_a64fx.S=0A= > +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S=0A= > @@ -30,7 +30,6 @@=0A= > #define L2_SIZE (8*1024*1024) // L2 8MB - 1MB=0A= > #define CACHE_LINE_SIZE 256=0A= > #define PF_DIST_L1 (CACHE_LINE_SIZE * 16) // Prefetch distance L1=0A= > -#define rest x2=0A= > #define vector_length x9=0A= > =0A= > #if HAVE_AARCH64_SVE_ASM=0A= > @@ -89,29 +88,19 @@ ENTRY (MEMSET)=0A= > =0A= > .p2align 4=0A= > L(vl_agnostic): // VL Agnostic=0A= > - mov rest, count=0A= > mov dst, dstin=0A= > - add dstend, dstin, count=0A= > - // if rest >=3D L2_SIZE && vector_length =3D=3D 64 then L(L2)=0A= > - mov tmp1, 64=0A= > - cmp rest, L2_SIZE=0A= > - ccmp vector_length, tmp1, 0, cs=0A= > - b.eq L(L2)=0A= > - // if rest >=3D L1_SIZE && vector_length =3D=3D 64 then L(L1_prefetch)= =0A= > - cmp rest, L1_SIZE=0A= > - ccmp vector_length, tmp1, 0, cs=0A= > - b.eq L(L1_prefetch)=0A= > -=0A= > + cmp count, L1_SIZE=0A= > + b.hi L(L1_prefetch)=0A= > =0A= > + // count >=3D 8 * vector_length=0A= > L(unroll8):=0A= > - lsl tmp1, vector_length, 3=0A= > - .p2align 3=0A= > -1: cmp rest, tmp1=0A= > - b.cc L(last)=0A= > - st1b_unroll=0A= > + sub count, count, tmp1=0A= > + .p2align 4=0A= > +1: subs count, count, tmp1=0A= > + st1b_unroll 0, 7=0A= > add dst, dst, tmp1=0A= > - sub rest, rest, tmp1=0A= > - b 1b=0A= > + b.hi 1b=0A= > + add count, count, tmp1=0A= > =0A= =0A= Reverting unroll8 logic to V3 Part 4 fixed 16KB dip [4].=0A= See the comparison graphs between the master and V3 Part 5 fixed [4][5][6]= .=0A= =0A= L(unroll8):=0A= lsl tmp1, vector_length, 3=0A= .p2align 3=0A= 1: cmp count, tmp1=0A= b.cc L(last)=0A= st1b_unroll=0A= add dst, dst, tmp1=0A= sub count, count, tmp1=0A= b 1b=0A= =0A= [4] https://drive.google.com/file/d/1NfaEF24ud8JOpCktlzoeQ5VvyJc593lD/view?= usp=3Dsharing=0A= [5] https://drive.google.com/file/d/1DfwPenANTwgLm2kqu_w9QugmYNqysOMF/view?= usp=3Dsharing=0A= [6] https://drive.google.com/file/d/1OL6_gbdevwJmfEeRbEANJ4pVdqaSRZvV/view?= usp=3Dsharing=0A= =0A= Thanks.=0A= Naohiro=0A= =0A= > L(last):=0A= > cmp count, vector_length, lsl 1=0A= > @@ -129,18 +118,22 @@ L(last):=0A= > st1b z0.b, p0, [dstend, -1, mul vl]=0A= > ret=0A= > =0A= > -L(L1_prefetch): // if rest >=3D L1_SIZE=0A= > + // count >=3D L1_SIZE=0A= > .p2align 3=0A= > +L(L1_prefetch):=0A= > + cmp count, L2_SIZE=0A= > + b.hs L(L2)=0A= > + cmp vector_length, 64=0A= > + b.ne L(unroll8)=0A= > 1: st1b_unroll 0, 3=0A= > prfm pstl1keep, [dst, PF_DIST_L1]=0A= > st1b_unroll 4, 7=0A= > prfm pstl1keep, [dst, PF_DIST_L1 + CACHE_LINE_SIZE]=0A= > add dst, dst, CACHE_LINE_SIZE * 2=0A= > - sub rest, rest, CACHE_LINE_SIZE * 2=0A= > - cmp rest, L1_SIZE=0A= > - b.ge 1b=0A= > - cbnz rest, L(unroll8)=0A= > - ret=0A= > + sub count, count, CACHE_LINE_SIZE * 2=0A= > + cmp count, PF_DIST_L1=0A= > + b.hs 1b=0A= > + b L(unroll8)=0A= > =0A= > // count >=3D L2_SIZE=0A= > L(L2):=0A=