From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS3215 2.6.0.0/16 X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RCVD_IN_DNSWL_MED,SPF_HELO_PASS,SPF_PASS,UNPARSEABLE_RELAY shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 098191F8C6 for ; Wed, 30 Jun 2021 15:50:37 +0000 (UTC) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 11ABF393D022 for ; Wed, 30 Jun 2021 15:50:35 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 11ABF393D022 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1625068235; bh=QzJ0IwtT0x8IbH1JBr6UQvD4ON5CCQDewkIaK8SLmuU=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:Cc:From; b=UBhjW55p8T/OHN3/vwWdDoPshIO9I5GH1i6d9eSQJzAdIvBm0UTjd/3CSW+YVpLjy x3a+iPZdTpKixQHOTfI0MXYiPz4ST6/qHz5mCPg4M6zTAaDHPA/XUEXFUwMt03BgYe oA9ODMxdMq9m+RglcOJvJVXuOnYzhmdpBO2+5DR4= Received: from EUR05-DB8-obe.outbound.protection.outlook.com (mail-db8eur05on2053.outbound.protection.outlook.com [40.107.20.53]) by sourceware.org (Postfix) with ESMTPS id BAE3C385E82B for ; Wed, 30 Jun 2021 15:50:11 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org BAE3C385E82B Received: from DB6PR07CA0049.eurprd07.prod.outlook.com (2603:10a6:6:2a::11) by VI1PR08MB2701.eurprd08.prod.outlook.com (2603:10a6:802:1a::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4264.24; Wed, 30 Jun 2021 15:50:09 +0000 Received: from DB5EUR03FT025.eop-EUR03.prod.protection.outlook.com (2603:10a6:6:2a:cafe::aa) by DB6PR07CA0049.outlook.office365.com (2603:10a6:6:2a::11) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4287.8 via Frontend Transport; Wed, 30 Jun 2021 15:50:09 +0000 X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123) smtp.mailfrom=arm.com; sourceware.org; dkim=pass (signature was verified) header.d=armh.onmicrosoft.com;sourceware.org; dmarc=pass action=none header.from=arm.com; Received-SPF: Pass (protection.outlook.com: domain of arm.com designates 63.35.35.123 as permitted sender) receiver=protection.outlook.com; client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com; Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by DB5EUR03FT025.mail.protection.outlook.com (10.152.20.104) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4287.22 via Frontend Transport; Wed, 30 Jun 2021 15:50:09 +0000 Received: ("Tessian outbound 80741586f868:v97"); Wed, 30 Jun 2021 15:50:09 +0000 X-CheckRecipientChecked: true X-CR-MTA-CID: 5ebbda56d354ab44 X-CR-MTA-TID: 64aa7808 Received: from c3fcf234aebb.1 by 64aa7808-outbound-1.mta.getcheckrecipient.com id 276B2383-339E-4E3E-964A-C483B62053C4.1; Wed, 30 Jun 2021 15:49:57 +0000 Received: from EUR02-HE1-obe.outbound.protection.outlook.com by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id c3fcf234aebb.1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384); Wed, 30 Jun 2021 15:49:57 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=WqbaXB1ZPwvlUO7zdL3lo3m9z7IP0itQqnwWsovZ3zH6d2F29Wm/F724pY3s0UGuK2zkxbNujZeViPqhJfSzi8hEpb/R3mAMBirWSn5lLTCASFsuPeAp6U3Hkqptsso5OFPuKqohqEuCHDBSV07KTqkZJ7P/ZSQYGlbhikPkYHes9MfqCZSulnNDesRITS5bUvDDSr4iT1AgIUZH+CO/PMdEdDMeXef+l30R8KlZmuxXFcCmkqLZg8hWMgdyZrLAxH9J9kwx052IO+LXEB266sNc6MAdax5/0Zcj76kPoZqWjRsV7qCC0KWUfUNNHF1pm2qFiICUQzVxNG1uUbMz9A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=QzJ0IwtT0x8IbH1JBr6UQvD4ON5CCQDewkIaK8SLmuU=; b=FnZlvLyRRoG3XxQvFkM/wxny0QAhhrqLW4VTrA5piwCpxn1qS3Mq0yFd6WfmCIPxZx6jJRpDQAGBPWaKoVW/D1rfzyJE0+X13HiSfH/DWqhvfPBJ97p8atiwKxsrfyquoTHziUaYzIQCDDPpAZi0Q3JYPqo6+O7pXS2+5i4OlfeY1kxjr74B00wrL22I7rznxbPx8FxGYWELsJHTSnsjR4hdUhgl30oeDm3rcRjgrJMAQsBc+oTLjXZIK/LKksFzdZhKPaP6Ql8FER8i3TnB/Kw2hYxqMQTGI3p4WxxM0Hs+QLuPrxjcG+k0rw41LlDyTDvXsb4wFHJcv4Y9ZxCYDw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass header.d=arm.com; arc=none Received: from VE1PR08MB5599.eurprd08.prod.outlook.com (2603:10a6:800:1a1::12) by VE1PR08MB5773.eurprd08.prod.outlook.com (2603:10a6:800:1a9::17) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4287.23; Wed, 30 Jun 2021 15:49:52 +0000 Received: from VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::8c25:b5e8:b9be:13ac]) by VE1PR08MB5599.eurprd08.prod.outlook.com ([fe80::8c25:b5e8:b9be:13ac%5]) with mapi id 15.20.4242.023; Wed, 30 Jun 2021 15:49:52 +0000 To: "naohirot@fujitsu.com" Subject: [PATCH] AArch64: Improve A64FX memset Thread-Topic: [PATCH] AArch64: Improve A64FX memset Thread-Index: AQHXbcYzOoyDgpB+D0uU1TFRbN82oQ== Date: Wed, 30 Jun 2021 15:49:52 +0000 Message-ID: Accept-Language: en-GB, en-US Content-Language: en-GB X-MS-Has-Attach: X-MS-TNEF-Correlator: Authentication-Results-Original: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; x-originating-ip: [82.24.249.100] x-ms-publictraffictype: Email X-MS-Office365-Filtering-Correlation-Id: 8de240f9-b9ff-4ba2-7ada-08d93bdebd1e x-ms-traffictypediagnostic: VE1PR08MB5773:|VI1PR08MB2701: X-Microsoft-Antispam-PRVS: x-checkrecipientrouted: true nodisclaimer: true x-ms-oob-tlc-oobclassifiers: OLM:2512;OLM:2512; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam-Untrusted: BCL:0; X-Microsoft-Antispam-Message-Info-Original: RLVXPAtXPtDYIK9oKE7E3o0ZJLweiz8TUa0h5AhxAT/e27APsn/YUREM3iasgC/xwSHNVfL1APu8X9Tqq4bPdnMwkRCNHRoohIY58FvjTkudfV8aph+AMa/6q/BXEq5QfbnSOU/iPyYj57hpG7nuzAVZD4F6a59OQV5h+3ZiXMTKrE6iTTCQGpwwkMuQ87/ngV6xrSDaPk+KaRcRwv+KtLJ1i+wKauR4S9cvYi+gALKV0iO5AFc0f0ElmPHEe+zbAymlhd1wuSssNte2jnbI7Ape3nUk/bMPt/sP1GQnSLKzImBnO0U+OKSRxHIkbxY1d+BuEWQWPKBxR2cDitFP2HXrTnwNdBYhE9NNRd2kWhm0r3FxAx+8aH+uvlq0ym9ADs18tlZ8NnwoEY5SITL5+/uEHRSkJHZ85wfMjZ+yueckw/UWeZC0NeLpz728tDXxTyVo1AP18YvvapEPyrOyzkCeUDS1R8ao5/sMXDsSHBIZ8RjeF1PBcQIKkcX3IeF+KWQLMllJPX4MvrCUvq0FBD8e8jUwx0Tl8Qi7+WyA7s/Y5rXUbTTNhtSib9hKa33EwPNAO1k6TSiXFilNonNRm8YO2bGJ7lc/49WNMDfYqFGL2RN2PsRf4yWN2CT25ZkdsdhG/RRgoeiGWZt3FxnAogj7bPws6haA2m4xuVnrMKhe6TjGrkM3fOqTMpSmTcxRlqa87qdlGZH6Q44L6TCzuBFe4/pMHClUYMvOiFW/BQc= X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VE1PR08MB5599.eurprd08.prod.outlook.com; PTR:; CAT:NONE; SFS:(4636009)(346002)(366004)(136003)(39850400004)(396003)(376002)(52536014)(9686003)(5660300002)(55016002)(4326008)(316002)(7696005)(76116006)(91956017)(66946007)(6506007)(64756008)(66556008)(478600001)(66476007)(71200400001)(66446008)(2906002)(122000001)(38100700002)(8936002)(8676002)(6916009)(86362001)(33656002)(26005)(186003)(83380400001)(473944003)(357404004); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?Q?yoMip9sYCN344sqgfvT7k4E0qxHDUvVWcxgnpK9CvYI4KrlZw6ptxl63rg?= =?iso-8859-1?Q?gXHLeduaK6gRTZuwZwsaByRgCHXDJoEhWz8E0MeXrbm7p2r1f+7eKQfwCI?= =?iso-8859-1?Q?7+JmdNno8QSMvWf3LHQowqY0i4mCbYeSXCXWbmFIG87FTXspScyk5BJZbA?= =?iso-8859-1?Q?wi3553srIBlOuRRz7wQcjZWGDnR6sYwmsgv9w80JsvdgtWQtOQ67WALF1q?= =?iso-8859-1?Q?QqR7I67EeiimigIW1B7CEa/qeRvJfxi1Xd5UaY1LrsCxOuievpxYVakkS7?= =?iso-8859-1?Q?XO66iBb1sCGx0y+XKd8mmxZTIxziU/CSdlX0f1HEWjDQ70iVHgP8P5ERcQ?= =?iso-8859-1?Q?tsPIwpAizXpevQ5QMDUc6aypun6cpjO+sZRuzc+wXK2LXcmdXNstdm0iYQ?= =?iso-8859-1?Q?OkTBMOwgIWRR/4In2ojg8NSAzg3dOMZopLefMsqDp0DY7SLbTc0Gp+zTLh?= =?iso-8859-1?Q?I1uqhDoRA1JLNgX/pRBuEGkcPg9JSSure0GFIrb9ZsTC6KtjPMfJqCtTmI?= =?iso-8859-1?Q?JvbeWcVfUkKBqC3aP5wppNEgogOqn9kEwB0/AM3mQPZ4KJEJCq6+ZX8efO?= =?iso-8859-1?Q?0jhfWT+ibNL5JVdMg4GRGhayLhJfsbruB9hMcIMkGZ0iA05dRhGPvi62sh?= =?iso-8859-1?Q?1l0ukK6wcwKkZNwhew8RLUjRqo1bFgqxKElB7UsOPuPvr6Cyq0bgbCO8ne?= =?iso-8859-1?Q?YUSlfFA1v1sGF4wQVI+4mlJIbPnVUhEsB+hZuww+IaEqt9Nsd3VcRLChvu?= =?iso-8859-1?Q?VoDMyhtsFPdVl6lVOuY1VcJfm8em0hfACfsfxsxvIUzxfRVJ0TAJYxA0Q1?= =?iso-8859-1?Q?Y8ieJyUqO1N6IvDtTFaHPhFBZVVXoZgs4Rm7bAb40/+bkteC8zEkuBdXaN?= =?iso-8859-1?Q?wcfgHXcSucRcVcfafFu+0kdCFd1IwDn9EBOsqEHexmYG1cLbwm+IsIzUtQ?= =?iso-8859-1?Q?SZrbgz6YdyWyI7pejbZ4EnwFAsg+PTdQmNAQau1bhZaNmrmr/7B1Qo5C+7?= =?iso-8859-1?Q?GvIvFH9YAIJiCb5/5n9tBImUzZE/ZTL6X5A5rdIsuRN4U+kGjLWH/a99xH?= =?iso-8859-1?Q?F1oNE5Vo/csp+XD0dlhmi6hfy5iDt/9ueEFZszc3878teK0IZC1wZQo1+k?= =?iso-8859-1?Q?wUXniku8JohZsqQXMsc4gpvhw+E2hzC3ZrbqqILFy7LADy2VRyOuM7QdGo?= =?iso-8859-1?Q?54TDObTd58/385JcaD72/+j37+KzB1KCwVzqzMSrGWv2oe/bmmB/uU/6Qm?= =?iso-8859-1?Q?94O8mJIzTRTgskqKpnalN6s+HYO6bPEQo+f0euyoPD/kAlpFNv3qQg+VaZ?= =?iso-8859-1?Q?F86+oU3pvJDfybo85PPsxFzpgrM0ydt5rRmwhJctcvEve9w=3D?= x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-Transport-CrossTenantHeadersStamped: VE1PR08MB5773 Original-Authentication-Results: fujitsu.com; dkim=none (message not signed) header.d=none;fujitsu.com; dmarc=none action=none header.from=arm.com; X-EOPAttributedMessage: 0 X-MS-Exchange-Transport-CrossTenantHeadersStripped: DB5EUR03FT025.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: f78b03d3-7dea-43c7-376d-08d93bdeb35e X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: zDVA/Dc4jm84K2UdufXL2+iv3+VdRBwFt/qtDox8wmUj3p2IA8nhmQyrAsThJ9kGmcc3ZUInsFG+pXlSsGvPrB1zDFfAih31RFLtq92Aokxs0zR6KyEXlRhHTn59KjkBhwMEXShcPRdN2RM2KHHfd1M/9qvisP2o/Si+iefww0QKUNn+XG5tzkJ7WxyR3Zk8kF8J7m49SWy3mMkTSqQX8cEBl3pWFW1N3WjZgPGDpcMOx0/7liPWHvyqweJyj73NSfRWsSUyw7CeQ8d3mttgIyHPzzOnpeuHO6FMDAdEF185xr4N+C95ekxPZkWm9ChQrD8hF/Pao0+6Bv/BZalIKNz7vpURE74zBzb0zjbTDUUZqPUNUbf3wrn9FkO+ehnkPoL7ODD0hgKDmiO3A1aw1KVX04QZdSnR2St818vTO3H8C+CnC6K9DNEzLo5KJuaoMfDBhOHQnUx2+COKICIWEGlgC93C/xppKXDnXEn0CxJt9pLqrxK/ZS4gWJpwXEytefHMKmdX/p31rZuMhLEq8/1piYyMzr/BMZA9x3viyZCC5uF04m04EbFoJ3wE4ywfhvKWhLLA+PVGNR8oEuGxMO5H9msWauhxvepZBWNQ90iB7OGrL0Ynon9Svpif5w9jPFySkJ9CO76jgQanSWVrLdZXSbGWMUGNRVoDuv8AkrgzwPyuMtPyrXNev59pSlRf+W2/ZZqZbC2d49hjfsT03krpwafiR+ZupZZ3fJEfbtd+qWvpIRu/iwzQ3o+wMDsb X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:; IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com; PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE; SFS:(4636009)(346002)(39850400004)(376002)(136003)(396003)(36840700001)(46966006)(316002)(47076005)(86362001)(4326008)(336012)(33656002)(6506007)(5660300002)(55016002)(26005)(36860700001)(9686003)(8936002)(81166007)(70206006)(356005)(2906002)(52536014)(70586007)(7696005)(478600001)(82740400003)(186003)(8676002)(83380400001)(6862004)(82310400003)(473944003)(357404004); DIR:OUT; SFP:1101; X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 30 Jun 2021 15:50:09.1235 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 8de240f9-b9ff-4ba2-7ada-08d93bdebd1e X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123]; Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com] X-MS-Exchange-CrossTenant-AuthSource: DB5EUR03FT025.eop-EUR03.prod.protection.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem X-MS-Exchange-Transport-CrossTenantHeadersStamped: VI1PR08MB2701 X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , From: Wilco Dijkstra via Libc-alpha Reply-To: Wilco Dijkstra Cc: 'GNU C Library' Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org Sender: "Libc-alpha" Hi Naohiro,=0A= =0A= And here is the memset version. The code is smaller and easier to follow pl= us a bit=0A= faster. One thing I noticed is that it does not optimize for the common cas= e of memset=0A= of zero (the generic memset is significantly faster for large sizes). It is= possible to just=0A= use DC ZVA for zeroing memsets and not do any vector stores.=0A= =0A= =0A= Reduce the codesize of the A64FX memset by simplifying the small memset cod= e,=0A= better handling of alignment and last 8 vectors as well as removing redunda= nt=0A= instructions and branches. The size for memset goes down from 1032 to 604 b= ytes.=0A= Performance is noticeably better for small memsets.=0A= =0A= Passes GLIBC regress, OK for commit?=0A= =0A= ---=0A= =0A= diff --git a/sysdeps/aarch64/multiarch/memset_a64fx.S b/sysdeps/aarch64/mul= tiarch/memset_a64fx.S=0A= index ce54e5418b08c8bc0ecc7affff68a59272ba6397..da8930c2b0e5ab552943331e9a1= aa355e917e775 100644=0A= --- a/sysdeps/aarch64/multiarch/memset_a64fx.S=0A= +++ b/sysdeps/aarch64/multiarch/memset_a64fx.S=0A= @@ -57,149 +57,78 @@=0A= .endif=0A= .endm=0A= =0A= - .macro shortcut_for_small_size exit=0A= - // if rest <=3D vector_length * 2=0A= +=0A= +#undef BTI_C=0A= +#define BTI_C=0A= +=0A= +ENTRY (MEMSET)=0A= +=0A= + PTR_ARG (0)=0A= + SIZE_ARG (2)=0A= +=0A= + dup z0.b, valw=0A= whilelo p0.b, xzr, count=0A= + cntb vector_length=0A= whilelo p1.b, vector_length, count=0A= - b.last 1f=0A= st1b z0.b, p0, [dstin, #0, mul vl]=0A= st1b z0.b, p1, [dstin, #1, mul vl]=0A= - ret=0A= -1: // if rest > vector_length * 8=0A= - cmp count, vector_length, lsl 3 // vector_length * 8=0A= - b.hi \exit=0A= - // if rest <=3D vector_length * 4=0A= - lsl tmp1, vector_length, 1 // vector_length * 2=0A= - whilelo p2.b, tmp1, count=0A= - incb tmp1=0A= - whilelo p3.b, tmp1, count=0A= b.last 1f=0A= - st1b z0.b, p0, [dstin, #0, mul vl]=0A= - st1b z0.b, p1, [dstin, #1, mul vl]=0A= - st1b z0.b, p2, [dstin, #2, mul vl]=0A= - st1b z0.b, p3, [dstin, #3, mul vl]=0A= ret=0A= -1: // if rest <=3D vector_length * 8=0A= - lsl tmp1, vector_length, 2 // vector_length * 4=0A= - whilelo p4.b, tmp1, count=0A= - incb tmp1=0A= - whilelo p5.b, tmp1, count=0A= - b.last 1f=0A= - st1b z0.b, p0, [dstin, #0, mul vl]=0A= - st1b z0.b, p1, [dstin, #1, mul vl]=0A= - st1b z0.b, p2, [dstin, #2, mul vl]=0A= - st1b z0.b, p3, [dstin, #3, mul vl]=0A= - st1b z0.b, p4, [dstin, #4, mul vl]=0A= - st1b z0.b, p5, [dstin, #5, mul vl]=0A= - ret=0A= -1: lsl tmp1, vector_length, 2 // vector_length * 4=0A= - incb tmp1 // vector_length * 5=0A= - incb tmp1 // vector_length * 6=0A= - whilelo p6.b, tmp1, count=0A= - incb tmp1=0A= - whilelo p7.b, tmp1, count=0A= - st1b z0.b, p0, [dstin, #0, mul vl]=0A= - st1b z0.b, p1, [dstin, #1, mul vl]=0A= - st1b z0.b, p2, [dstin, #2, mul vl]=0A= - st1b z0.b, p3, [dstin, #3, mul vl]=0A= - st1b z0.b, p4, [dstin, #4, mul vl]=0A= - st1b z0.b, p5, [dstin, #5, mul vl]=0A= - st1b z0.b, p6, [dstin, #6, mul vl]=0A= - st1b z0.b, p7, [dstin, #7, mul vl]=0A= - ret=0A= - .endm=0A= =0A= -ENTRY (MEMSET)=0A= -=0A= - PTR_ARG (0)=0A= - SIZE_ARG (2)=0A= -=0A= - cbnz count, 1f=0A= + .p2align 4=0A= +1:=0A= + add dst, dstin, count=0A= + cmp count, vector_length, lsl 2=0A= + b.hi 1f=0A= + st1b z0.b, p0, [dst, #-2, mul vl]=0A= + st1b z0.b, p0, [dst, #-1, mul vl]=0A= + ret=0A= +1:=0A= + cmp count, vector_length, lsl 3 // vector_length * 8=0A= + b.hi L(vl_agnostic)=0A= +=0A= + st1b z0.b, p0, [dstin, #2, mul vl]=0A= + st1b z0.b, p0, [dstin, #3, mul vl]=0A= + st1b z0.b, p0, [dst, #-4, mul vl]=0A= + st1b z0.b, p0, [dst, #-3, mul vl]=0A= + st1b z0.b, p0, [dst, #-2, mul vl]=0A= + st1b z0.b, p0, [dst, #-1, mul vl]=0A= ret=0A= -1: dup z0.b, valw=0A= - cntb vector_length=0A= - // shortcut for less than vector_length * 8=0A= - // gives a free ptrue to p0.b for n >=3D vector_length=0A= - shortcut_for_small_size L(vl_agnostic)=0A= - // end of shortcut=0A= =0A= L(vl_agnostic): // VL Agnostic=0A= mov rest, count=0A= mov dst, dstin=0A= - add dstend, dstin, count=0A= - // if rest >=3D L2_SIZE && vector_length =3D=3D 64 then L(L2)=0A= mov tmp1, 64=0A= - cmp rest, L2_SIZE=0A= - ccmp vector_length, tmp1, 0, cs=0A= - b.eq L(L2)=0A= // if rest >=3D L1_SIZE && vector_length =3D=3D 64 then L(L1_prefetch)=0A= cmp rest, L1_SIZE=0A= ccmp vector_length, tmp1, 0, cs=0A= b.eq L(L1_prefetch)=0A= =0A= -L(unroll32):=0A= - lsl tmp1, vector_length, 3 // vector_length * 8=0A= - lsl tmp2, vector_length, 5 // vector_length * 32=0A= - .p2align 3=0A= -1: cmp rest, tmp2=0A= - b.cc L(unroll8)=0A= - st1b_unroll=0A= - add dst, dst, tmp1=0A= - st1b_unroll=0A= - add dst, dst, tmp1=0A= - st1b_unroll=0A= - add dst, dst, tmp1=0A= - st1b_unroll=0A= - add dst, dst, tmp1=0A= - sub rest, rest, tmp2=0A= - b 1b=0A= -=0A= L(unroll8):=0A= lsl tmp1, vector_length, 3=0A= - .p2align 3=0A= + .p2align 4=0A= 1: cmp rest, tmp1=0A= - b.cc L(last)=0A= + b.ls L(last)=0A= st1b_unroll=0A= add dst, dst, tmp1=0A= sub rest, rest, tmp1=0A= b 1b=0A= =0A= -L(last):=0A= - whilelo p0.b, xzr, rest=0A= - whilelo p1.b, vector_length, rest=0A= - b.last 1f=0A= - st1b z0.b, p0, [dst, #0, mul vl]=0A= - st1b z0.b, p1, [dst, #1, mul vl]=0A= - ret=0A= -1: lsl tmp1, vector_length, 1 // vector_length * 2=0A= - whilelo p2.b, tmp1, rest=0A= - incb tmp1=0A= - whilelo p3.b, tmp1, rest=0A= - b.last 1f=0A= - st1b z0.b, p0, [dst, #0, mul vl]=0A= - st1b z0.b, p1, [dst, #1, mul vl]=0A= - st1b z0.b, p2, [dst, #2, mul vl]=0A= - st1b z0.b, p3, [dst, #3, mul vl]=0A= - ret=0A= -1: lsl tmp1, vector_length, 2 // vector_length * 4=0A= - whilelo p4.b, tmp1, rest=0A= - incb tmp1=0A= - whilelo p5.b, tmp1, rest=0A= - incb tmp1=0A= - whilelo p6.b, tmp1, rest=0A= - incb tmp1=0A= - whilelo p7.b, tmp1, rest=0A= - st1b z0.b, p0, [dst, #0, mul vl]=0A= - st1b z0.b, p1, [dst, #1, mul vl]=0A= - st1b z0.b, p2, [dst, #2, mul vl]=0A= - st1b z0.b, p3, [dst, #3, mul vl]=0A= - st1b z0.b, p4, [dst, #4, mul vl]=0A= - st1b z0.b, p5, [dst, #5, mul vl]=0A= - st1b z0.b, p6, [dst, #6, mul vl]=0A= - st1b z0.b, p7, [dst, #7, mul vl]=0A= +L(last): // store 8 vectors from the end=0A= + add dst, dst, rest=0A= + st1b z0.b, p0, [dst, #-8, mul vl]=0A= + st1b z0.b, p0, [dst, #-7, mul vl]=0A= + st1b z0.b, p0, [dst, #-6, mul vl]=0A= + st1b z0.b, p0, [dst, #-5, mul vl]=0A= + st1b z0.b, p0, [dst, #-4, mul vl]=0A= + st1b z0.b, p0, [dst, #-3, mul vl]=0A= + st1b z0.b, p0, [dst, #-2, mul vl]=0A= + st1b z0.b, p0, [dst, #-1, mul vl]=0A= ret=0A= =0A= L(L1_prefetch): // if rest >=3D L1_SIZE=0A= + cmp rest, L2_SIZE=0A= + b.hs L(L2)=0A= .p2align 3=0A= 1: st1b_unroll 0, 3=0A= prfm pstl1keep, [dst, PF_DIST_L1]=0A= @@ -208,37 +137,19 @@ L(L1_prefetch): // if rest >=3D L1_SIZE=0A= add dst, dst, CACHE_LINE_SIZE * 2=0A= sub rest, rest, CACHE_LINE_SIZE * 2=0A= cmp rest, L1_SIZE=0A= - b.ge 1b=0A= - cbnz rest, L(unroll32)=0A= - ret=0A= + b.hs 1b=0A= + b L(unroll8)=0A= =0A= L(L2):=0A= - // align dst address at vector_length byte boundary=0A= - sub tmp1, vector_length, 1=0A= - ands tmp2, dst, tmp1=0A= - // if vl_remainder =3D=3D 0=0A= - b.eq 1f=0A= - sub vl_remainder, vector_length, tmp2=0A= - // process remainder until the first vector_length boundary=0A= - whilelt p2.b, xzr, vl_remainder=0A= - st1b z0.b, p2, [dst]=0A= - add dst, dst, vl_remainder=0A= - sub rest, rest, vl_remainder=0A= // align dstin address at CACHE_LINE_SIZE byte boundary=0A= -1: mov tmp1, CACHE_LINE_SIZE=0A= - ands tmp2, dst, CACHE_LINE_SIZE - 1=0A= - // if cl_remainder =3D=3D 0=0A= - b.eq L(L2_dc_zva)=0A= - sub cl_remainder, tmp1, tmp2=0A= - // process remainder until the first CACHE_LINE_SIZE boundary=0A= - mov tmp1, xzr // index=0A= -2: whilelt p2.b, tmp1, cl_remainder=0A= - st1b z0.b, p2, [dst, tmp1]=0A= - incb tmp1=0A= - cmp tmp1, cl_remainder=0A= - b.lo 2b=0A= - add dst, dst, cl_remainder=0A= - sub rest, rest, cl_remainder=0A= + and tmp1, dst, CACHE_LINE_SIZE - 1=0A= + sub tmp1, tmp1, CACHE_LINE_SIZE=0A= + st1b z0.b, p0, [dst, #0, mul vl]=0A= + st1b z0.b, p0, [dst, #1, mul vl]=0A= + st1b z0.b, p0, [dst, #2, mul vl]=0A= + st1b z0.b, p0, [dst, #3, mul vl]=0A= + sub dst, dst, tmp1=0A= + add rest, rest, tmp1=0A= =0A= L(L2_dc_zva):=0A= // zero fill=0A= @@ -250,16 +161,15 @@ L(L2_dc_zva):=0A= .p2align 3=0A= 1: st1b_unroll 0, 3=0A= add tmp2, dst, zva_len=0A= - dc zva, tmp2=0A= + dc zva, tmp2=0A= st1b_unroll 4, 7=0A= add tmp2, tmp2, CACHE_LINE_SIZE=0A= dc zva, tmp2=0A= add dst, dst, CACHE_LINE_SIZE * 2=0A= sub rest, rest, CACHE_LINE_SIZE * 2=0A= cmp rest, tmp1 // ZF_DIST + CACHE_LINE_SIZE * 2=0A= - b.ge 1b=0A= - cbnz rest, L(unroll8)=0A= - ret=0A= + b.hs 1b=0A= + b L(unroll8)=0A= =0A= END (MEMSET)=0A= libc_hidden_builtin_def (MEMSET)=0A= =0A=