From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS31976 209.132.180.0/23 X-Spam-Status: No, score=-4.0 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 5689620248 for ; Tue, 16 Apr 2019 20:56:53 +0000 (UTC) Received: (qmail 110114 invoked by alias); 16 Apr 2019 20:56:50 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Received: (qmail 110103 invoked by uid 89); 16 Apr 2019 20:56:50 -0000 Authentication-Results: sourceware.org; auth=none X-HELO: brightrain.aerifal.cx Date: Tue, 16 Apr 2019 16:56:43 -0400 From: Rich Felker To: Szabolcs Nagy Cc: Zack Weinberg , Wilco Dijkstra , nd , GNU C Library Subject: Re: [PATCH v2] Improve performance of strstr Message-ID: <20190416205643.GG23599@brightrain.aerifal.cx> References: <2b27b4ca-0370-442d-9f39-210265f00444@arm.com> <20190412171613.GB23599@brightrain.aerifal.cx> <20190415144051.GE23599@brightrain.aerifal.cx> <734450c0-1959-6f47-3d8f-35a746f331c9@arm.com> <953ad13a-cd71-32c4-ccc3-d1b8cac57620@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: Rich Felker On Tue, Apr 16, 2019 at 06:07:37PM +0000, Szabolcs Nagy wrote: > On 16/04/2019 18:01, Szabolcs Nagy wrote: > > On 16/04/2019 17:40, Szabolcs Nagy wrote: > >> On 15/04/2019 21:15, Zack Weinberg wrote: > >>> On Mon, Apr 15, 2019 at 2:02 PM Wilco Dijkstra wrote: > >>>> Hi Rich, > >>> ... > >>>>>> Yes, without a reproducible example I can't see what your issue is. You > >>>>>> can't make it go quadratic because it simply isn't. > >>>>> > >>>>> Obviously it's not unbounded because you have a (very large) bound on > >>>>> the size, 256. I can make it do a 256-byte strcmp for nearly every > >>>>> byte of the input haystack. Maybe because of vectorization on some > >>>>> targets that's only 16x slower than the current code rather than 256x > >>>>> slower, but it's still a lot slower. > >>>> > >>>> No you can't. It's impossible to make it do a full 256 byte memcmp every > >>>> character. And bad cases are not bad because of the time spent comparing > >>>> strings - they are bad because of mispredicted branches. So it's not possible > >>>> to compare bad cases without benchmarking actual examples on modern > >>>> CPUs. > >>> > >>> This discussion has been going in circles for quite some time now. > >>> > >>> Wilco, Rich, I think it would help a lot if you could BOTH write down > >>> some example needle and haystack pairs that you believe will > >>> demonstrate significantly improved performance with your preferred > >>> algorithm, and/or pathologically slow performance with your > >>> dispreferred algorithm. Even without specific numbers, that will give > >>> everyone something concrete to argue over, at least. > >> > >> since the new algorithm tries to look at the last two chars first > >> i'd say a possible bad case for it is > >> > >> needle = 250*"a" + "abbbaa" > >> haystack = (250*"a" + "bbbbaa") * 1000 > >> > >> (256 is the longest needle for the new algo, checking in a 256k > >> haystack should be large enough to see matching performance > >> instead of setup time) > >> > >> i think this should be close to worst case, but i haven't done > >> a detailed analysis, the regression with the new algorithm is > >> > >> 16.0x slower on Cortex-A72 > >> 17.8x slower on Cortex-A57 > > > > scratch that, with > > > > needle = 248*"a" + "abbbbaaa" > > haystack = (248*"a" + "bbbbbaaa") * 1000 > > > > i get > > > > 37.8x slower on Cortex-A72 > > 40.0x slower on Cortex-A57 > > this is a good case for twoway, so we need a twoway worst case too > for a real comparision: comparing the worst for new vs worst for > twoway i've seen so far, new is > > 4.5x slower on Cortex-A72 > 2.7x slower on Cortex-A57 > > but there is no guarantee that my inputs are near the real worst > cases, it's hard to tell (and clearly uarch matters too). > > (i wanted to avoid diving into the algorithm details to find > worst cases) There've been a series of mitigations analogous to someone 'fixing' an SQLi vuln by filtering for SQL keywords or single-quote characters in the input rather than actually fixing the bug, burying the issue and making it harder to find the worst-cases that still exist. That's why I'm frustrated with this whole topic/proposal. Given a bound on the needle size it's used for, this is fundamentally a finite problem (albeit a pretty damn large one), so there likely are "heuristic" solutions sufficiently complete to cover all cases. However, even if one can be found, who wants to look at or validate the argument/proof that it covers them all? Or run the fuzzers looking for mistakes? Etc. The right way forward here, if there's actually any performance issue to care about improving on, is to study why the current two-way implementation is slower than expected and make changes that improve performance in all cases. The kind of primitives (e.g. comparison of particular subneedle first) Wilco Dijkstra is using are very similar to what two-way (with bad-char shift table) does, but invoked heuristically rather than rigorously. Optimized versions can be done rigorously instead. For example, when the right factor is bounded by word/vector size (which is very common and can be special-cased), two-way reduces to something close to the proposed loop, but with *much better* advance on mismatches. And of course, if the 2-char-hash shift table is better than a single-byte one, that can be used with two-way too. I suggested before that, for the general case in two-way, it would help to have a fast vectorized memcmp that returns the length of the matching prefix, rather than the difference, but this case (long right half) may actually be sufficiently rare that we shouldn't care about making it faster than it already is. Rich