From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on starla X-Spam-Level: X-Spam-Status: No, score=-0.9 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id C8A6F1F44D for ; Thu, 21 Mar 2024 23:17:57 +0000 (UTC) Authentication-Results: dcvr.yhbt.net; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=FEKde3iS; dkim-atps=neutral Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 2CF833858C98 for ; Thu, 21 Mar 2024 23:17:56 +0000 (GMT) Received: from mail-oa1-x33.google.com (mail-oa1-x33.google.com [IPv6:2001:4860:4864:20::33]) by sourceware.org (Postfix) with ESMTPS id 1B6C23858D28 for ; Thu, 21 Mar 2024 23:17:31 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1B6C23858D28 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1B6C23858D28 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2001:4860:4864:20::33 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1711063054; cv=none; b=CNsyfvucj+4aB4d6/Mrai0xdcZWp7Gfia/fIJABMSn3JMazZ784RB8uX0QgjWc0DJJQB5TlhlxGA0honWisZtBQyTGA5k7kDMY+rDn5bBHq/FrqOpGQ4bdiHUVl6LxiLUAlzuADcb0JwR+79ZwgdB8a+Q7YrYC5f0guOMZMEOWQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1711063054; c=relaxed/simple; bh=QlMfLoeQneXeIeyoByLppMajBpFsBU3GHkNhf6c82/c=; h=DKIM-Signature:MIME-Version:From:Date:Message-ID:Subject:To; b=uLDiNQr2f13mceLRZbM091NAAL+kFTGYUsk8Eh9nsxM3TP31ObFZWyY680wF5fv7zDG7j0+xUaxJkoSfmokSdpdH4kC+4kyJwNob0SH2rPyXnnE+AWd/rN35VNvodroLy8E/yN22sGekr36UWJ1FG+aCMsH2VlvBPA66kjz2ozw= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-oa1-x33.google.com with SMTP id 586e51a60fabf-22200c78d4fso820445fac.1 for ; Thu, 21 Mar 2024 16:17:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1711063049; x=1711667849; darn=sourceware.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=gyPkXSuDM3CgbZVuq71u8u5UBKHLWANt+2/A+rjayhQ=; b=FEKde3iSiPcZ1/xr10LhZGBlnYaAla6YnS9cFy7Ja8x1hyoekB5QXn/+pxJiFmKZGR 7QaY1+PL1IzXPAGAC6XQ6E0e2pE4ujjAu/QmsFGh37eOq0ugpi2MK1J0KKHUp4ni+mnJ bydeG9EGBCLWYptzfyrQ6EtzUL3HNOW67MrmzH6J7vCpLK8bnbFI6RfZ0oNUxZoaLHzM 0Kuw3p4CohudhSdUXXJZ1BB8xAMjhoJAP8VB1iXGm8437xMTQKN5jQGLv2NqTY/rcmfn pqjygwj3vlPCC4vb/xLDOwIz8bzabmwdBzQc4LIxed4vtvBGjGjwDVOzdXRfOMMVfbkl eCGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711063049; x=1711667849; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=gyPkXSuDM3CgbZVuq71u8u5UBKHLWANt+2/A+rjayhQ=; b=OyMGS83B8d5cJFqDQi607KC1s6snSYMeFTA7YusxciJOcdt7Iv+yLZQvxouI+7u48b LyhYMh1kpq86SMKNMDkTINv5PlBTvsWCtqJzCboljpO+pe1nVVHTz/MFlyG0rADGFtUK J4qhzwAp5jv1NxMJrcjtCqS/8sPnGePadTA98tbmKz2OSLyDt+rS2NG/18duPNapOMaM +1f2PA3u8u6NmLFpNb7J80cFKMELuCKB4wy4D3Wq7cczIEKW1i99BlqP8eaJ5PvIJuNy 6SGzfJqrt7yuoLF7yfqKBoJ9wtKi0rbt/2McrEhyQclUP7SwqDRGDhoE/Is+KNP0TaHR CU1Q== X-Gm-Message-State: AOJu0YzgJilL6GnYzTqntoBialG4WLJB823FKopOWgoYzeDYC9HovvBC FPAwexlE+Fa+E+5odo8Nsc7xHOiqfFwiW5tDU7lBeG/BqJEThQy1mTpvOMO+XOd9YG4KkBEevmO roEa2BA8NFlVEQFJhQnAfxjRZ0Tk= X-Google-Smtp-Source: AGHT+IH38K6LeNTOSJzXU4pfDVi1mslQVzg0GNgYre6ZirkXz6ageVoj7LFv0jAFhOgQUnepjT/z72uZoFZyBWhEQo4= X-Received: by 2002:a05:6870:d919:b0:229:cf31:636e with SMTP id gq25-20020a056870d91900b00229cf31636emr766552oab.43.1711063049104; Thu, 21 Mar 2024 16:17:29 -0700 (PDT) MIME-Version: 1.0 References: <20240321171200.1177053-1-adhemerval.zanella@linaro.org> In-Reply-To: <20240321171200.1177053-1-adhemerval.zanella@linaro.org> From: Noah Goldstein Date: Thu, 21 Mar 2024 18:17:17 -0500 Message-ID: Subject: Re: [PATCH] x86_64: Remove avx512 strstr implementation To: Adhemerval Zanella Cc: libc-alpha@sourceware.org, Wilco Dijkstra , "H . J . Lu" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces+e=80x24.org@sourceware.org On Thu, Mar 21, 2024 at 12:12=E2=80=AFPM Adhemerval Zanella wrote: > > As indicated in a recent thread, this it is a simple brute-force > algorithm that checks the whole needle at a matching character pair > (and does so 1 byte at a time after the first 64 bytes of a needle). > Also it never skips ahead and thus can match at every haystack > position after trying to match all of the needle, which generic > implementation avoids. > > As indicated by Wilco, a 4x larger needle and 16x larger haystack gives > a clear 65x slowdown both basic_strstr and __strstr_avx512: > > "ifuncs": ["basic_strstr", "twoway_strstr", "__strstr_avx512", > "__strstr_sse2_unaligned", "__strstr_generic"], > > { > "len_haystack": 65536, > "len_needle": 1024, > "align_haystack": 0, > "align_needle": 0, > "fail": 1, > "desc": "Difficult bruteforce needle", > "timings": [4.0948e+07, 15094.5, 3.20818e+07, 108558, 10839.2] > }, > { > "len_haystack": 1048576, > "len_needle": 4096, > "align_haystack": 0, > "align_needle": 0, > "fail": 1, > "desc": "Difficult bruteforce needle", > "timings": [2.69767e+09, 100797, 2.08535e+09, 495706, 82666.9] > } > > PS: I don't have an AVX512 capable machine to verify this issues, but > skimming through the code it does seems to follow what Wilco has > described. > > --- > sysdeps/x86_64/multiarch/Makefile | 3 - > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 6 - > sysdeps/x86_64/multiarch/strstr-avx512.c | 218 --------------------- > sysdeps/x86_64/multiarch/strstr.c | 25 +-- > 4 files changed, 4 insertions(+), 248 deletions(-) > delete mode 100644 sysdeps/x86_64/multiarch/strstr-avx512.c > > diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch= /Makefile > index d3d2270394..696cb66991 100644 > --- a/sysdeps/x86_64/multiarch/Makefile > +++ b/sysdeps/x86_64/multiarch/Makefile > @@ -117,7 +117,6 @@ sysdep_routines +=3D \ > strrchr-evex512 \ > strrchr-sse2 \ > strspn-sse4 \ > - strstr-avx512 \ > strstr-sse2-unaligned \ > varshift \ > # sysdep_routines > @@ -125,8 +124,6 @@ sysdep_routines +=3D \ > CFLAGS-strcspn-sse4.c +=3D -msse4 > CFLAGS-strpbrk-sse4.c +=3D -msse4 > CFLAGS-strspn-sse4.c +=3D -msse4 > - > -CFLAGS-strstr-avx512.c +=3D -mavx512f -mavx512vl -mavx512dq -mavx512bw -= mbmi -mbmi2 -O3 > endif > > ifeq ($(subdir),wcsmbs) > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/= multiarch/ifunc-impl-list.c > index c4a21d4b7c..0bbb71bbbf 100644 > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > @@ -790,12 +790,6 @@ __libc_ifunc_impl_list (const char *name, struct lib= c_ifunc_impl *array, > > /* Support sysdeps/x86_64/multiarch/strstr.c. */ > IFUNC_IMPL (i, name, strstr, > - IFUNC_IMPL_ADD (array, i, strstr, > - (CPU_FEATURE_USABLE (AVX512VL) > - && CPU_FEATURE_USABLE (AVX512BW) > - && CPU_FEATURE_USABLE (AVX512DQ) > - && CPU_FEATURE_USABLE (BMI2)), > - __strstr_avx512) > IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_sse2_unaligne= d) > IFUNC_IMPL_ADD (array, i, strstr, 1, __strstr_generic)) > > diff --git a/sysdeps/x86_64/multiarch/strstr-avx512.c b/sysdeps/x86_64/mu= ltiarch/strstr-avx512.c > deleted file mode 100644 > index 3ac53accbd..0000000000 > --- a/sysdeps/x86_64/multiarch/strstr-avx512.c > +++ /dev/null > @@ -1,218 +0,0 @@ > -/* strstr optimized with 512-bit AVX-512 instructions > - Copyright (C) 2022-2024 Free Software Foundation, Inc. > - This file is part of the GNU C Library. > - > - The GNU C Library is free software; you can redistribute it and/or > - modify it under the terms of the GNU Lesser General Public > - License as published by the Free Software Foundation; either > - version 2.1 of the License, or (at your option) any later version. > - > - The GNU C Library is distributed in the hope that it will be useful, > - but WITHOUT ANY WARRANTY; without even the implied warranty of > - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > - Lesser General Public License for more details. > - > - You should have received a copy of the GNU Lesser General Public > - License along with the GNU C Library; if not, see > - . */ > - > -#include > -#include > -#include > -#include > - > -#define FULL_MMASK64 0xffffffffffffffff > -#define ONE_64BIT 0x1ull > -#define ZMM_SIZE_IN_BYTES 64 > -#define PAGESIZE 4096 > - > -#define cvtmask64_u64(...) (uint64_t) (__VA_ARGS__) > -#define kshiftri_mask64(x, y) ((x) >> (y)) > -#define kand_mask64(x, y) ((x) & (y)) > - > -/* > - Returns the index of the first edge within the needle, returns 0 if no = edge > - is found. Example: 'ab' is the first edge in 'aaaaaaaaaabaarddg' > - */ > -static inline size_t > -find_edge_in_needle (const char *ned) > -{ > - size_t ind =3D 0; > - while (ned[ind + 1] !=3D '\0') > - { > - if (ned[ind] !=3D ned[ind + 1]) > - return ind; > - else > - ind =3D ind + 1; > - } > - return 0; > -} > - > -/* > - Compare needle with haystack byte by byte at specified location > - */ > -static inline bool > -verify_string_match (const char *hay, const size_t hay_index, const char= *ned, > - size_t ind) > -{ > - while (ned[ind] !=3D '\0') > - { > - if (ned[ind] !=3D hay[hay_index + ind]) > - return false; > - ind =3D ind + 1; > - } > - return true; > -} > - > -/* > - Compare needle with haystack at specified location. The first 64 bytes = are > - compared using a ZMM register. > - */ > -static inline bool > -verify_string_match_avx512 (const char *hay, const size_t hay_index, > - const char *ned, const __mmask64 ned_mask, > - const __m512i ned_zmm) > -{ > - /* check first 64 bytes using zmm and then scalar */ > - __m512i hay_zmm =3D _mm512_loadu_si512 (hay + hay_index); // safe to d= o so > - __mmask64 match =3D _mm512_mask_cmpneq_epi8_mask (ned_mask, hay_zmm, n= ed_zmm); > - if (match !=3D 0x0) // failed the first few chars > - return false; > - else if (ned_mask =3D=3D FULL_MMASK64) > - return verify_string_match (hay, hay_index, ned, ZMM_SIZE_IN_BYTES); > - return true; > -} > - > -char * > -__strstr_avx512 (const char *haystack, const char *ned) > -{ > - char first =3D ned[0]; > - if (first =3D=3D '\0') > - return (char *)haystack; > - if (ned[1] =3D=3D '\0') > - return (char *)strchr (haystack, ned[0]); > - > - size_t edge =3D find_edge_in_needle (ned); > - > - /* ensure haystack is as long as the pos of edge in needle */ > - for (int ii =3D 0; ii < edge; ++ii) > - { > - if (haystack[ii] =3D=3D '\0') > - return NULL; > - } > - > - /* > - Load 64 bytes of the needle and save it to a zmm register > - Read one cache line at a time to avoid loading across a page boundary > - */ > - __mmask64 ned_load_mask =3D _bzhi_u64 ( > - FULL_MMASK64, 64 - ((uintptr_t) (ned) & 63)); > - __m512i ned_zmm =3D _mm512_maskz_loadu_epi8 (ned_load_mask, ned); > - __mmask64 ned_nullmask > - =3D _mm512_mask_testn_epi8_mask (ned_load_mask, ned_zmm, ned_zmm); > - > - if (__glibc_unlikely (ned_nullmask =3D=3D 0x0)) > - { > - ned_zmm =3D _mm512_loadu_si512 (ned); > - ned_nullmask =3D _mm512_testn_epi8_mask (ned_zmm, ned_zmm); > - ned_load_mask =3D ned_nullmask ^ (ned_nullmask - ONE_64BIT); > - if (ned_nullmask !=3D 0x0) > - ned_load_mask =3D ned_load_mask >> 1; > - } > - else > - { > - ned_load_mask =3D ned_nullmask ^ (ned_nullmask - ONE_64BIT); > - ned_load_mask =3D ned_load_mask >> 1; > - } > - const __m512i ned0 =3D _mm512_set1_epi8 (ned[edge]); > - const __m512i ned1 =3D _mm512_set1_epi8 (ned[edge + 1]); > - > - /* > - Read the bytes of haystack in the current cache line > - */ > - size_t hay_index =3D edge; > - __mmask64 loadmask =3D _bzhi_u64 ( > - FULL_MMASK64, 64 - ((uintptr_t) (haystack + hay_index) & 63)); > - /* First load is a partial cache line */ > - __m512i hay0 =3D _mm512_maskz_loadu_epi8 (loadmask, haystack + hay_ind= ex); > - /* Search for NULL and compare only till null char */ > - uint64_t nullmask > - =3D cvtmask64_u64 (_mm512_mask_testn_epi8_mask (loadmask, hay0, ha= y0)); > - uint64_t cmpmask =3D nullmask ^ (nullmask - ONE_64BIT); > - cmpmask =3D cmpmask & cvtmask64_u64 (loadmask); > - /* Search for the 2 characters of needle */ > - __mmask64 k0 =3D _mm512_cmpeq_epi8_mask (hay0, ned0); > - __mmask64 k1 =3D _mm512_cmpeq_epi8_mask (hay0, ned1); > - k1 =3D kshiftri_mask64 (k1, 1); > - /* k2 masks tell us if both chars from needle match */ > - uint64_t k2 =3D cvtmask64_u64 (kand_mask64 (k0, k1)) & cmpmask; > - /* For every match, search for the entire needle for a full match */ > - while (k2) > - { > - uint64_t bitcount =3D _tzcnt_u64 (k2); > - k2 =3D _blsr_u64 (k2); > - size_t match_pos =3D hay_index + bitcount - edge; > - if (((uintptr_t) (haystack + match_pos) & (PAGESIZE - 1)) > - < PAGESIZE - 1 - ZMM_SIZE_IN_BYTES) > - { > - /* > - * Use vector compare as long as you are not crossing a page > - */ > - if (verify_string_match_avx512 (haystack, match_pos, ned, > - ned_load_mask, ned_zmm)) > - return (char *)haystack + match_pos; > - } > - else > - { > - if (verify_string_match (haystack, match_pos, ned, 0)) > - return (char *)haystack + match_pos; > - } > - } > - /* We haven't checked for potential match at the last char yet */ > - haystack =3D (const char *)(((uintptr_t) (haystack + hay_index) | 63))= ; > - hay_index =3D 0; > - > - /* > - Loop over one cache line at a time to prevent reading over page > - boundary > - */ > - __m512i hay1; > - while (nullmask =3D=3D 0) > - { > - hay0 =3D _mm512_loadu_si512 (haystack + hay_index); > - hay1 =3D _mm512_load_si512 (haystack + hay_index > - + 1); // Always 64 byte aligned > - nullmask =3D cvtmask64_u64 (_mm512_testn_epi8_mask (hay1, hay1)); > - /* Compare only till null char */ > - cmpmask =3D nullmask ^ (nullmask - ONE_64BIT); > - k0 =3D _mm512_cmpeq_epi8_mask (hay0, ned0); > - k1 =3D _mm512_cmpeq_epi8_mask (hay1, ned1); > - /* k2 masks tell us if both chars from needle match */ > - k2 =3D cvtmask64_u64 (kand_mask64 (k0, k1)) & cmpmask; > - /* For every match, compare full strings for potential match */ > - while (k2) > - { > - uint64_t bitcount =3D _tzcnt_u64 (k2); > - k2 =3D _blsr_u64 (k2); > - size_t match_pos =3D hay_index + bitcount - edge; > - if (((uintptr_t) (haystack + match_pos) & (PAGESIZE - 1)) > - < PAGESIZE - 1 - ZMM_SIZE_IN_BYTES) > - { > - /* > - * Use vector compare as long as you are not crossing a pa= ge > - */ > - if (verify_string_match_avx512 (haystack, match_pos, ned, > - ned_load_mask, ned_zmm)) > - return (char *)haystack + match_pos; > - } > - else > - { > - /* Compare byte by byte */ > - if (verify_string_match (haystack, match_pos, ned, 0)) > - return (char *)haystack + match_pos; > - } > - } > - hay_index +=3D ZMM_SIZE_IN_BYTES; > - } > - return NULL; > -} > diff --git a/sysdeps/x86_64/multiarch/strstr.c b/sysdeps/x86_64/multiarch= /strstr.c > index a513bac5c3..828308668b 100644 > --- a/sysdeps/x86_64/multiarch/strstr.c > +++ b/sysdeps/x86_64/multiarch/strstr.c > @@ -35,32 +35,15 @@ > > extern __typeof (__redirect_strstr) __strstr_sse2_unaligned attribute_hi= dden; > extern __typeof (__redirect_strstr) __strstr_generic attribute_hidden; > -extern __typeof (__redirect_strstr) __strstr_avx512 attribute_hidden; > > #include "init-arch.h" > > /* Avoid DWARF definition DIE on ifunc symbol so that GDB can handle > ifunc symbol properly. */ > extern __typeof (__redirect_strstr) __libc_strstr; > - > -static inline void * > -IFUNC_SELECTOR (void) > -{ > - const struct cpu_features *cpu_features =3D __get_cpu_features (); > - > - if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_AVX512) > - && CPU_FEATURE_USABLE_P (cpu_features, AVX512VL) > - && CPU_FEATURE_USABLE_P (cpu_features, AVX512BW) > - && CPU_FEATURE_USABLE_P (cpu_features, AVX512DQ) > - && CPU_FEATURE_USABLE_P (cpu_features, BMI2)) > - return __strstr_avx512; > - > - if (CPU_FEATURES_ARCH_P (cpu_features, Fast_Unaligned_Load)) > - return __strstr_sse2_unaligned; > - > - return __strstr_generic; > -} > - > -libc_ifunc_redirected (__redirect_strstr, __libc_strstr, IFUNC_SELECTOR = ()); > +libc_ifunc (__libc_strstr, > + HAS_ARCH_FEATURE (Fast_Unaligned_Load) > + ? __strstr_sse2_unaligned > + : __strstr_generic) > #undef strstr > strong_alias (__libc_strstr, strstr) > -- > 2.34.1 > LGTM. Reviewed-by: Noah Goldstein