From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS17314 8.43.84.0/22 X-Spam-Status: No, score=-3.7 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,RDNS_DYNAMIC,SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 878061F8C6 for ; Tue, 24 Aug 2021 07:56:52 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 989113858018 for ; Tue, 24 Aug 2021 07:56:50 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 989113858018 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1629791810; bh=CWRFvbo09825bUgh99jPq+4QaP7uexJhms040QwwbmE=; h=To:Subject:Date:References:In-Reply-To:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=yEIEda2tM2nAp8M9pBwk6FDZJ2V3LDk7sPUmEDany4sqakvz7kdzKOqAi3Hrg4HRn hbIwOVocB0jzCUcvxCJwyPPMagssPD4M6lXRRIEka0F3Yi9HTyuxQExmnev7Tm5qCh pGMHNpsaCQiPX7BIOCjbwKN0pUXZQwL9NJg37plw= Received: from esa20.fujitsucc.c3s2.iphmx.com (esa20.fujitsucc.c3s2.iphmx.com [216.71.158.65]) by sourceware.org (Postfix) with ESMTPS id A00EA3858038 for ; Tue, 24 Aug 2021 07:56:20 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A00EA3858038 X-IronPort-AV: E=McAfee;i="6200,9189,10085"; a="37483546" X-IronPort-AV: E=Sophos;i="5.84,346,1620658800"; d="scan'208";a="37483546" Received: from mail-os2jpn01lp2055.outbound.protection.outlook.com (HELO JPN01-OS2-obe.outbound.protection.outlook.com) ([104.47.92.55]) by ob1.fujitsucc.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Aug 2021 16:56:19 +0900 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=HsFJzTTrjuQ642B389BYKIiX4LliJkkr8orxwyhCCj6Rz3F2CtZM4IyPZrIN/P6HTNJ8A7C2WdfPgcoGoVFObknZH/utKHx/j7qyH5Okl0O29zmoRqYMOMxX3LoTdsQTFWVy51PlsS7eOh170au9mjIz+LxpXlyhX4U8O7/NAmZvBhas6K9lfK6I0aK+lwvhCJSQjUJ/CReDCw4POazgjBcWuOwsUqQc10sH7kvvshWW+vjH0qHcl4LC0aCUzlGeYwEAJaCam6MZdfxjjhBnPQN8Ym5Te90Wh1jC9ohVM4ZV9yVPjqdm9PxnJDKqasClIS612cV9tVaBYXAzjVntiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=CWRFvbo09825bUgh99jPq+4QaP7uexJhms040QwwbmE=; b=kJvOICiFvct2SQKT67og12Qx0zBUE6mYVMx+MLhMfs5mCJYFOrOAk5zC9lnT6tcKyjcE3laBZX1qmG85469Hk0+PC4bi9C+RPjPxPSsNSpKgQU0sMNLDDETeCoE+tAjGHXEzIwes+c+B6UZpCH8ohk+6v3jD1QOBu4lpIRaBO4PeYIHBdIUGaAphOL+rCMI30a9TxbttTcdWNoIvzSR9f1JGjjmqxV3WWp+SaNmWc5XN0gwsaAUhtlGFXt76bkB6ywwAUtvxX03gRWCws9vYzKYmy3BU+mMddbn1S4A9EJy1x5TmL2J7oILAab2Y1aDGAr5B8kilwXjzZLEYhAxytQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=fujitsu.com; dmarc=pass action=none header.from=fujitsu.com; dkim=pass header.d=fujitsu.com; arc=none Received: from TYAPR01MB6025.jpnprd01.prod.outlook.com (2603:1096:402:36::13) by TYYPR01MB6779.jpnprd01.prod.outlook.com (2603:1096:400:ce::6) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4436.19; Tue, 24 Aug 2021 07:56:15 +0000 Received: from TYAPR01MB6025.jpnprd01.prod.outlook.com ([fe80::f55f:cf98:a8d4:b803]) by TYAPR01MB6025.jpnprd01.prod.outlook.com ([fe80::f55f:cf98:a8d4:b803%8]) with mapi id 15.20.4436.025; Tue, 24 Aug 2021 07:56:15 +0000 To: Wilco Dijkstra Subject: RE: [PATCH v3 5/5] AArch64: Improve A64FX memset Thread-Topic: [PATCH v3 5/5] AArch64: Improve A64FX memset Thread-Index: AQHXfxMmDOwcjJYmP0eOZ48wdBomwathHRvQgAo6SBuADAQMwYADlG+HgADnVzCABY//ZoAAms4w Date: Tue, 24 Aug 2021 07:56:15 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-001, ja-JP, en-US Content-Language: aa X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: MSIP_Label_a7295cc1-d279-42ac-ab4d-3b0f4fece050_Enabled=True; MSIP_Label_a7295cc1-d279-42ac-ab4d-3b0f4fece050_SiteId=a19f121d-81e1-4858-a9d8-736e267fd4c7; MSIP_Label_a7295cc1-d279-42ac-ab4d-3b0f4fece050_SetDate=2021-08-24T07:56:15.020Z; MSIP_Label_a7295cc1-d279-42ac-ab4d-3b0f4fece050_Name=FUJITSU-RESTRICTED; MSIP_Label_a7295cc1-d279-42ac-ab4d-3b0f4fece050_ContentBits=0; MSIP_Label_a7295cc1-d279-42ac-ab4d-3b0f4fece050_Method=Standard; x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 71807b60-2fbd-4f1f-b7d3-08d966d4a646 x-ms-traffictypediagnostic: TYYPR01MB6779: x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:10000; x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: z6RWMSKwBNF0zc0pr2Z5dYKIqqR8p64Dd6Zwng2308X1+4VGzB8K0XSdrgLsWlNJNYcU3obkBGlqWhi3o7Nw7jFh28HAhnfmQtz5IWEzaG/qVomlPESXFUl0aBhIhAMr/4FLx8NhSIURGpwTjIVV1Iv4LUd4qJJkUXMkFulZOZ3hMuCkzawbFj+gF2wjOUhp2fqAqGi1EF1BJ6Zq27up6HfgNfUXtmrwJsELsTZ/nUVUrXO/vPqrkEp7C/g4Tvvznsu2sR7VQeKPQKIMFtbxkFs7G+gSWOB0BHmx2e4sVsCMeKby5miAx5A3wJqAc0klnM1toojR2i/7R4SAC8U+v+vPUdLP2K/e9P59x2pj0BKSNy6mRfz+hQcamaeChvD2AjZmHH8B7ZhPdlMyveqevZhsfTpZGve7rLNykUCNBZ5VSMdCz5R/z7K2m1g4y1OnIr2H2I3s4JjN0ymxydsgvzHi8/QSqPtK/Yhf3o8XVNQTO8p1eYilT87qrM+wt4vo7WjtwXU8ORarm7HLe57NkW8nt6eelfTejfffz8rJPn/FAOR8y4WovS9Tw9UkmZf2U9Zb02aBpaqehRCRIiBeYlmSxsXqemdxPlGaFjnHAbS7sAkm2apfWYibjhJgZvRESAdlkuAGHfovWzBzxyO3Bt6LA8vZ1yKbc5BqCml3wmzaGqpUZ3/irUIy+nztuNL9jgBcmJnM201R9e+qy7EYQBhWJbqNxOQx/iVNWSdh+61AxWvmPwM8WX+7FSEU25scRQbwc3jjFZulWFjuybBsl1DFH9U0NPMIJ+nKnqL++e1uVrsn2aLAwWa6x7Z/zfA2RRr9Y5/ZpQaYIezt0O9rBAgZtDUc7XQnlEa465cOyJw= x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:TYAPR01MB6025.jpnprd01.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(376002)(39860400002)(396003)(136003)(366004)(346002)(9686003)(55016002)(8936002)(6916009)(478600001)(33656002)(4326008)(5660300002)(86362001)(71200400001)(38100700002)(122000001)(38070700005)(7696005)(8676002)(52536014)(2906002)(83380400001)(6506007)(66556008)(64756008)(66446008)(186003)(76116006)(316002)(26005)(966005)(66476007)(66946007)(219204002); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-2022-jp?B?OFZkamJtZzJyZ2RMSTJ2MGFyRml6QzhYZktyOVF4ci9Db056VzlOUW9u?= =?iso-2022-jp?B?TVZ3RWQwdEMwMGJ4eVQ3REZ4NG5HVEQ4YzYzYi82eDJ0TmRjTkhQTFNZ?= =?iso-2022-jp?B?dGlMSTkrekFYVGxmcGY0dkx0SE5LSDNnbERxOVdkU1FLQzdqMUlmbi8w?= =?iso-2022-jp?B?bHRIVCtjMHV6ZlVJemhsNEJ5UnN5SEFuTmNqVWFkc091TG8wSVhSdWZM?= =?iso-2022-jp?B?dU8rcHhUdnFPb1RLVTU5Rjh3YUY5M1FxSVA5T2V6L0gzN0JOUWwxSWdu?= =?iso-2022-jp?B?MmN6YTNaYWJYRHRlZDd0MmFqTnloUHFmSTZMVC9mcTBxdlJCd0c1TkNq?= =?iso-2022-jp?B?dmViUy9JTmx3SkR5SVNUcFN5UEdndSs2ZDhnQUQ4OEpJZGJjRWhHL2Nh?= =?iso-2022-jp?B?THp2bU05R1VwaXdPKzI3N2xQeE1QTmNvL0o0Z3BvQWRMUW9FTElESEJw?= =?iso-2022-jp?B?b2MrZnVPbVI3WXRadkJ3ZmFMdXA3aHRIZWgwTUhMY3FHdjJuU0ovZ2hu?= =?iso-2022-jp?B?UnkwYkZjSk9mVFF5dGhTUTYxNkpBVEc4cXhnQXVxUG9DWlhLamF1Sm16?= =?iso-2022-jp?B?bnM5Q1VMRmpWVC9zVGVLN1NCd0JRRytXMStFUlV2NkVIQXNlQWlmQWY2?= =?iso-2022-jp?B?YzFla0NKbjVZeXJkd0dLUUhMdEFvanp2MWYvZThWeEpsd25MQ3NGaTVx?= =?iso-2022-jp?B?VkxtaFZtdmtyWkxoRlFGdFdBVVR0djBMYWI5aEw5WjI4OURQQjdYbnJX?= =?iso-2022-jp?B?bEluV1VGTnNsWlRFMWtJdk9VODBWWXk2bTlTMVRZZi91SWZROHdUdzBP?= =?iso-2022-jp?B?ZmlldkVOQ3FEeWdyNEs0NmJWd3QySndXSVRqZllPVEdYalJIVTRFcE5J?= =?iso-2022-jp?B?bjlWN2I4d3JBVklOVW5wZ2VCd1U5b0xyMlovTmZhOWxZcUxUeFZpMkk1?= =?iso-2022-jp?B?WWRNeFpBYyttSnEvRTBTenA0TFUwQTVSUjY1NFNkRFkrbjNDcU5GMkha?= =?iso-2022-jp?B?V3o4YSt3NTZ0ckVlWHVrQjk4NktjMlBoVVQ4cTMrSFlFN01wUzVIeTdP?= =?iso-2022-jp?B?RG5vZ0VZNllEZmlqSXFYQ3Jwc2RGdEF2b0xtVG15RjRleWpTeUdaOTFE?= =?iso-2022-jp?B?bmxGMGRmSHBLY2J3Z0VtZkd3dzR0TzMweU52cTQyL284VTRGZEF6R2Vl?= =?iso-2022-jp?B?QlozRXljWlBGZ3czVWEyMlBIZUtIT3Y1WitxZTFUNitjZnJrNlZNQmJn?= =?iso-2022-jp?B?MytvcmF3Z0JsV2l2UW1hb0p6MGx2MkFXZ2JYaU5jWEVoZVNUV3VGcG5t?= =?iso-2022-jp?B?SU1JZ01GWGJEUTlMSWlpKzhIelNLOXBuZHJPNDkyeDl4RE12SW9ud2Ra?= =?iso-2022-jp?B?elhJRmtodW1QWUtJcVI5b09lckE5ZkNCdEQzTWM2dFhOd0N3Rk5DNFhq?= =?iso-2022-jp?B?d0hYeUtycDk1Mjg2dXJpcSs2UDk3K29Ia1ZXWFpXbGlBNlE5SDdGaTl3?= =?iso-2022-jp?B?U2RDQmhaVS9oUFZ2eFFXS0d5WTMrSG5RRW5vK0ZlYzExQkhmMkJ3VVB4?= =?iso-2022-jp?B?bjdqNWdaNUU1ZktnbmhkS3IxaHJobVJaZkZXazVVZ2UxaDlJSXlBRnFu?= =?iso-2022-jp?B?Ym1xK2U0enloVndJQWZTakFUNHRocHl1UFdBSmRRbVE4cXJuZGp0eUpJ?= =?iso-2022-jp?B?TUdSbWZZd3VGZ0xaSVlKK0owMmowQmQzNjVNVkx4ZEVRM2QzM2tFQXVv?= =?iso-2022-jp?B?bHhnSW0wU0Q3aWQ1QnR0QlA5RGxYWXlKandSakIzNWt4WTV0RHJicGRP?= =?iso-2022-jp?B?d2g1azRGZHd0V2Zla0dnQ2NFODRjU0Vpbml4eis1YjJ3ZnFGRVpXN3h0?= =?iso-2022-jp?B?SElwbVp5aXNMU1lCZGthWWZXaVIwPQ==?= x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: fujitsu.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: TYAPR01MB6025.jpnprd01.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 71807b60-2fbd-4f1f-b7d3-08d966d4a646 X-MS-Exchange-CrossTenant-originalarrivaltime: 24 Aug 2021 07:56:15.8129 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: a19f121d-81e1-4858-a9d8-736e267fd4c7 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: mtY5kpE2npquK4XEEQblarySPU+VF2HroAA2zdTPnFwidW/tkN5RM/iEAThOkkoSxn2HJVclOGO+xL0QIadBYA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYYPR01MB6779 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: naohirot--- via Libc-alpha Reply-To: "naohirot@fujitsu.com" Cc: 'GNU C Library' Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" Hi Wilco,=0A= =0A= > > In my environment, I don't have any performance degradation by revertin= g unroll8,=0A= > > but 16KB performance improvement as shown in the graphs.=0A= > =0A= > I still see a major regression at 1KB in the graph (it is larger relative= ly than the gain at 16KB),=0A= > plus many smaller regressions between 2KB-8KB.=0A= =0A= Are you talking about the regression between V4 and V4 fixed?=0A= If so, that is also observed in my environment as shown in the graph [2].= =0A= But V4 fixed is not degraded than the master as shown in the graph [1].=0A= =0A= I think we are getting almost same result each other, but not exactly same,= right?=0A= =0A= > > The first graph [1] shows comparison the master with V4 fixed.=0A= > > The second graph [2] shows comparison V4 with V4 fixed.=0A= > > =0A= > > [1] https://drive.google.com/file/d/19og4ZhU9itzFAVXX8TIzlpgiiukiXQbp/v= iew?usp=3Dsharing=0A= > > [2] https://drive.google.com/file/d/1wQgPU6GyRQ_Z8ibsGja-NfdKhN5bz7I9/v= iew?usp=3Dsharing =0A= =0A= > > In your environment, do you have any performance degradation by reverti= ng unroll8?=0A= > > If there is no disadvantage by reverting unroll8, why don't we revert i= t?=0A= > =0A= > For me bench-memset shows a 50% regression with the unroll8 loop reverted= plus=0A= > many smaller regressions. So I don't think reverting is a good idea.=0A= =0A= If the 50% regression in your environment is at 1KB, the regression at 1KB = happens=0A= in my environment too as shown in the graph [4], but the rate seems less th= an 50%.=0A= =0A= Both your result and my result are true and real.=0A= I don't think it's rational to make decision by looking at only one environ= ment result.=0A= =0A= As I explained at the bottom of this mail, V4 code is tuned to Applo 80 and= FX700.=0A= So we need to take FX1000 into account too.=0A= =0A= > I tried "perf stat" and oddly enough this loop causes a lot of branch mis= predictions.=0A= > However if you add a branch at the top of the loop that is never taken (e= g. blt and=0A= > ensuring the sub above it sets the flags), it becomes faster than the bes= t results so far.=0A= > If you can reproduce that, it is probably the best workaround.=0A= =0A= Does "it becomes faster than the best results so far" mean faster than the = master?=0A= I think we should put the baseline or bottom line to the master performance= .=0A= If the workaround is not faster than or equal to the master at 16KB which h= as the peak=0A= performance, reverting unroll8 is preferable. =0A= =0A= I'm not sure if I understood what the workaround code looks like, is it lik= e this?=0A= =0A= L(unroll8):=0A= sub count, count, tmp1=0A= .p2align 4=0A= 1: subs tmp2, xzr, xzr=0A= b.lt 1b=0A= st1b_unroll 0, 7=0A= add dst, dst, tmp1=0A= subs count, count, tmp1=0A= b.hi 1b=0A= add count, count, tmp1=0A= =0A= > > Is it HPE Apollo 80 System?=0A= > > Or does ARM Company have an account to Fujitsu FX1000 or FX700?=0A= > =0A= > It has 48 cores, that's all I know...=0A= =0A= I think your environment must be Applo 80 or FX700 which has 48 cores and 4= NUMA nodes.=0A= FX1000 master node has 52 cores and FX1000 compute node has 50 cores.=0A= OS sees FX1000 as if it has 8 NUMA nodes.=0A= =0A= Thanks.=0A= Naohiro=0A=