From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-3.3 required=3.0 tests=AWL,BAYES_00,BODY_8BITS, DKIMWL_WL_MED,DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 6341F20248 for ; Wed, 6 Mar 2019 16:55:53 +0000 (UTC) DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:references :in-reply-to:content-type:content-transfer-encoding :mime-version; q=dns; s=default; b=F2BBfpzNNaf5Qs/oEshOjPqWNKn+W YA7N7pGCrqOqvE1rMrIGZWy16s1qVyoHyUqluIkH+XJk6UPmtPc7Z/xlQviYkXBO G93z8ig/BYKNSAA7cTuojX59hfcitXCaE1s2kk8fgHr8OwoCTCw1nGJaqSqRMB38 nHxoy/U1VV9gss= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:from:to:cc:subject:date:message-id:references :in-reply-to:content-type:content-transfer-encoding :mime-version; s=default; bh=f3JNXm74mKRJTLfirtR7KT9Eqfg=; b=ayg ehGsrCLpzGBhH/zjRYbMz1R/+Ja3zP4AFUCiACsEJK+W6BdyNysuNuGnayK9ysEt sQwmbXV/KEoz8ulm6R+2nfgNIVSdqeTHMAZ0UExvV3De9zqHOFkk4nEPa7JHWVJh BGEmc/90wgoLHnZgDR0hmGK49bZmywXcvxQBl9JU= Received: (qmail 127050 invoked by alias); 6 Mar 2019 16:55:51 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 127038 invoked by uid 89); 6 Mar 2019 16:55:50 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: EUR01-VE1-obe.outbound.protection.outlook.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector1-arm-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=54xRGaZkAxK0o4j/qNFxICURNup1bvzet5uloym0opg=; b=BfGLOtqcecs5JlnY/nqD4sGMD5qlA1iugRWDeuqZY6hQXJ2yXwXQYwdou5Grvg/v3XWaj8qjCsMoGEP3Mf/7m4KZZkRHGoNTolPY0thIP1fqwSOriFSt1zusWFJCLY9Jq+rcbLol3v3Mls5f4l4ISPhA8Ktz661Elh79+MIqBek= From: Wilco Dijkstra To: 'GNU C Library' CC: nd Subject: Re: [PATCH] Improve performance of memmem Date: Wed, 6 Mar 2019 16:55:39 +0000 Message-ID: References: , In-Reply-To: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; x-ms-exchange-purlcount: 1 received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED ping =A0=20 This patch significantly improves performance of memmem using a novel modified Horspool algorithm.=A0 Needles up to size 256 use a bad-character table indexed by hashed pairs of characters to quickly skip past mismatches= . Long needles use a self-adapting filtering step to avoid comparing the whol= e needle repeatedly. By limiting the needle length to 256, the shift table only requires 8 bits per entry, lowering preprocessing overhead and minimizing cache effects. This limit also implies worst-case performance is linear. Small needles up to size 2 use a dedicated linear search.=A0 Very long need= les use the Two-Way algorithm (to avoid increasing stack size or slowing down the common case, inlining is disabled). The performance gain is 4.5 times on English text on AArch64 using random needles with average size 8 (this is even faster than the recently improved= strstr algorithm, so I'll update that in the near future). Tested against GLIBC testsuite and randomized tests. OK for commit? ChangeLog: 2018-12-21=A0 Wilco Dijkstra=A0 =A0=A0=A0=A0=A0=A0=A0 * string/str-two-way.h (two_way_long_needle): Block i= nlining. =A0=A0=A0=A0=A0=A0=A0 * string/memmem.c (__memmem): Rewrite to improve perf= ormance. -- diff --git a/string/memmem.c b/string/memmem.c index d72b8249e62a744c2d031c9ccb0157f141df641f..150777800456075d981bd589e5f= 00388e9aa54f0 100644 --- a/string/memmem.c +++ b/string/memmem.c @@ -15,17 +15,13 @@ =A0=A0=A0 License along with the GNU C Library; if not, see =A0=A0=A0 .=A0 */ =A0 -/* This particular implementation was written by Eric Blake, 2008.=A0 */ - =A0#ifndef _LIBC =A0# include =A0#endif =A0 -/* Specification of memmem.=A0 */ =A0#include =A0 =A0#ifndef _LIBC -# define __builtin_expect(expr, val)=A0=A0 (expr) =A0# define __memmem=A0=A0=A0=A0=A0=A0 memmem =A0#endif =A0 @@ -36,51 +32,90 @@ =A0 =A0#undef memmem =A0 -/* Return the first occurrence of NEEDLE in HAYSTACK.=A0 Return HAYSTACK -=A0=A0 if NEEDLE_LEN is 0, otherwise NULL if NEEDLE is not found in -=A0=A0 HAYSTACK.=A0 */ +#define hash2(p) (((size_t)(p)[0] - ((size_t)(p)[-1] << 3)) % sizeof (shif= t)) + +/* Fast memmem algorithm with guaranteed linear-time performance. +=A0=A0 Small needles up to size 2 use a dedicated linear search.=A0 Longer= needles +=A0=A0 up to size 256 use a novel modified Horspool algorithm.=A0 It hashe= s pairs +=A0=A0 of characters to quickly skip past mismatches.=A0 The main search l= oop only +=A0=A0 exits if the last 2 characters match, avoiding unnecessary calls to= memcmp +=A0=A0 and allowing for a larger skip if there is no match.=A0 A self-adap= ting +=A0=A0 filtering check is used to quickly detect mismatches in long needle= s. +=A0=A0 By limiting the needle length to 256, the shift table can be reduce= d to 8 +=A0=A0 bits per entry, lowering preprocessing overhead and minimizing cach= e effects. +=A0=A0 The limit also implies worst-case performance is linear. +=A0=A0 Needles larger than 256 characters use the linear-time Two-Way algo= rithm.=A0 */ =A0void * -__memmem (const void *haystack_start, size_t haystack_len, -=A0=A0=A0=A0=A0=A0=A0=A0 const void *needle_start, size_t needle_len) +__memmem (const void *haystack, size_t hs_len, +=A0=A0=A0=A0=A0=A0=A0=A0 const void *needle, size_t ne_len) =A0{ -=A0 /* Abstract memory is considered to be an array of 'unsigned char' val= ues, -=A0=A0=A0=A0 not an array of 'char' values.=A0 See ISO C 99 section 6.2.6.= 1.=A0 */ -=A0 const unsigned char *haystack =3D (const unsigned char *) haystack_sta= rt; -=A0 const unsigned char *needle =3D (const unsigned char *) needle_start; - -=A0 if (needle_len =3D=3D 0) -=A0=A0=A0 /* The first occurrence of the empty string is deemed to occur a= t -=A0=A0=A0=A0=A0=A0 the beginning of the string.=A0 */ -=A0=A0=A0 return (void *) haystack; - -=A0 /* Sanity check, otherwise the loop might search through the whole -=A0=A0=A0=A0 memory.=A0 */ -=A0 if (__glibc_unlikely (haystack_len < needle_len)) +=A0 const unsigned char *hs =3D (const unsigned char *) haystack; +=A0 const unsigned char *ne =3D (const unsigned char *) needle; + +=A0 if (ne_len =3D=3D 0) +=A0=A0=A0 return (void *) hs; +=A0 if (ne_len =3D=3D 1) +=A0=A0=A0 return (void *) memchr (hs, ne[0], hs_len); + +=A0 /* Ensure haystack length is >=3D needle length.=A0 */ +=A0 if (hs_len < ne_len) =A0=A0=A0=A0 return NULL; =A0 -=A0 /* Use optimizations in memchr when possible, to reduce the search -=A0=A0=A0=A0 size of haystack using a linear algorithm with a smaller -=A0=A0=A0=A0 coefficient.=A0 However, avoid memchr for long needles, since= we -=A0=A0=A0=A0 can often achieve sublinear performance.=A0 */ -=A0 if (needle_len < LONG_NEEDLE_THRESHOLD) +=A0 const unsigned char *end =3D hs + hs_len - ne_len; + +=A0 if (ne_len =3D=3D 2) +=A0=A0=A0 { +=A0=A0=A0=A0=A0 uint32_t nw =3D ne[0] << 16 | ne[1], hw =3D hs[0] << 16 | = hs[1]; +=A0=A0=A0=A0=A0 for (hs++; hs <=3D end && hw !=3D nw; ) +=A0=A0=A0=A0=A0=A0 hw =3D hw << 16 | *++hs; +=A0=A0=A0=A0=A0 return hw =3D=3D nw ? (void *)hs - 1 : NULL; +=A0=A0=A0 } + +=A0 /* Use Two-Way algorithm for very long needles.=A0 */ +=A0 if (__builtin_expect (ne_len > 256, 0)) +=A0=A0=A0 return two_way_long_needle (hs, hs_len, ne, ne_len); + +=A0 uint8_t shift[256]; +=A0 size_t tmp, shift1; +=A0 size_t m1 =3D ne_len - 1; +=A0 size_t offset =3D 0; + +=A0 memset (shift, 0, sizeof (shift)); +=A0 for (int i =3D 1; i < m1; i++) +=A0=A0=A0 shift[hash2 (ne + i)] =3D i; +=A0 shift1 =3D m1 - shift[hash2 (ne + m1)]; +=A0 shift[hash2 (ne + m1)] =3D m1; + +=A0 for ( ; hs <=3D end; ) =A0=A0=A0=A0 { -=A0=A0=A0=A0=A0 haystack =3D memchr (haystack, *needle, haystack_len); -=A0=A0=A0=A0=A0 if (!haystack || __builtin_expect (needle_len =3D=3D 1, 0)= ) -=A0=A0=A0=A0=A0=A0 return (void *) haystack; -=A0=A0=A0=A0=A0 haystack_len -=3D haystack - (const unsigned char *) hayst= ack_start; -=A0=A0=A0=A0=A0 if (haystack_len < needle_len) -=A0=A0=A0=A0=A0=A0 return NULL; -=A0=A0=A0=A0=A0 /* Check whether we have a match.=A0 This improves perform= ance since we -=A0=A0=A0=A0=A0=A0=A0 avoid the initialization overhead of the two-way alg= orithm.=A0 */ -=A0=A0=A0=A0=A0 if (memcmp (haystack, needle, needle_len) =3D=3D 0) -=A0=A0=A0=A0=A0=A0 return (void *) haystack; -=A0=A0=A0=A0=A0 return two_way_short_needle (haystack, haystack_len, needl= e, needle_len); +=A0=A0=A0=A0=A0 /* Skip past character pairs not in the needle.=A0 */ +=A0=A0=A0=A0=A0 do +=A0=A0=A0=A0=A0=A0 { +=A0=A0=A0=A0=A0=A0=A0=A0 hs +=3D m1; +=A0=A0=A0=A0=A0=A0=A0=A0 tmp =3D shift[hash2 (hs)]; +=A0=A0=A0=A0=A0=A0 } +=A0=A0=A0=A0=A0 while (tmp =3D=3D 0 && hs <=3D end); + +=A0=A0=A0=A0=A0 /* If the match is not at the end of the needle, shift to = the end +=A0=A0=A0=A0=A0=A0=A0 and continue until we match the last 2 characters.= =A0 */ +=A0=A0=A0=A0=A0 hs -=3D tmp; +=A0=A0=A0=A0=A0 if (tmp < m1) +=A0=A0=A0=A0=A0=A0 continue; + +=A0=A0=A0=A0=A0 if (m1 <=3D 15 || memcmp (hs + offset, ne + offset, sizeof= (long)) =3D=3D 0) +=A0=A0=A0=A0=A0=A0 { +=A0=A0=A0=A0=A0=A0=A0=A0 if (memcmp (hs, ne, m1) =3D=3D 0) +=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 return (void *) hs; + +=A0=A0=A0=A0=A0=A0=A0=A0 /* Adjust filter offset when it doesn't find the = mismatch.=A0 */ +=A0=A0=A0=A0=A0=A0=A0=A0 offset =3D (offset >=3D sizeof (long) ? offset : = m1) - sizeof (long); +=A0=A0=A0=A0=A0=A0 } + +=A0=A0=A0=A0=A0 /* Skip based on matching the last 2 characters.=A0 */ +=A0=A0=A0=A0=A0 hs +=3D shift1; =A0=A0=A0=A0 } -=A0 else -=A0=A0=A0 return two_way_long_needle (haystack, haystack_len, needle, need= le_len); +=A0 return NULL; =A0} =A0libc_hidden_def (__memmem) =A0weak_alias (__memmem, memmem) =A0libc_hidden_weak (memmem) - -#undef LONG_NEEDLE_THRESHOLD diff --git a/string/str-two-way.h b/string/str-two-way.h index 31c3f18fb057cdd999c3ac9e9d894a8b62a98a70..5a800e0eaf1c7505a9340a7aabd= 149326958df4a 100644 --- a/string/str-two-way.h +++ b/string/str-two-way.h @@ -383,7 +383,7 @@ two_way_short_needle (const unsigned char *haystack, si= ze_t haystack_len, =A0=A0=A0 If AVAILABLE modifies HAYSTACK_LEN (as in strstr), then at most 3= * =A0=A0=A0 HAYSTACK_LEN - NEEDLE_LEN comparisons occur in searching, and =A0=A0=A0 sublinear performance is not possible.=A0 */ -static RETURN_TYPE +__attribute__((noinline)) static RETURN_TYPE =A0two_way_long_needle (const unsigned char *haystack, size_t haystack_len, =A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 const unsig= ned char *needle, size_t needle_len) =A0{ =A0=A0=A0 =