From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 1E32B1F4B4 for ; Fri, 11 Sep 2020 12:48:23 +0000 (UTC) Received: from localhost ([::1]:44978 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kGiTV-0007sP-UA for normalperson@yhbt.net; Fri, 11 Sep 2020 08:48:21 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:38152) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kGiTJ-0007oO-MT; Fri, 11 Sep 2020 08:48:13 -0400 Received: from mail-wm1-f66.google.com ([209.85.128.66]:54981) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kGiTH-0003Me-UB; Fri, 11 Sep 2020 08:48:09 -0400 Received: by mail-wm1-f66.google.com with SMTP id s13so4322863wmh.4; Fri, 11 Sep 2020 05:48:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Yvvs2VemeqLDPTIfwoP/7sSoOqHTKpjtJ18UZiLHrek=; b=r8umjj6H0NIQmahz+SdIxiwvOQjk/qxr2jnXAaPclMK4KUSc3fZluDGN9KXGld2/eC Pxl4Rc2FpoiBPErYm4+L3weGmR0D2C7jdhfams+ENfMp7JVXHkSgZ0BX4IqnmERRRLKB ZXAC2B31IyTvcztoaU4KsBTga8GyKtzZP0T33K5solsskWdOLlwBaS8VJ1G/D7lKMOTb PIfXgsC9H0OW0fSStC+IIkOb0foYGfwoQU+PemSRjRgqFjRE12iNh7I0hvYRrGMjlLnk 6zphdsrNbPb6QOVAByOyDJKntoTpGQ214Oo0Cjo2qBTaa1LaFktuzMtdIjcExwrSb5kE v4OA== X-Gm-Message-State: AOAM531PKoF8IprhBf3EDkSidP0HzNC/hAzsuTi3dr6NxjvBo/s7jll/ zR26drPQqEA6Je6pEfK6ddUlu09dJqGhOWC7iPo= X-Google-Smtp-Source: ABdhPJx78j/HYahUlBKVU8dksVQfFd/3c57n1RBm3/Fkg5FaGtgOHu6f2aHfxWuY/1nVG+0nXaNfQYjT15er4fcnmB0= X-Received: by 2002:a05:600c:ce:: with SMTP id u14mr2168304wmm.137.1599828481900; Fri, 11 Sep 2020 05:48:01 -0700 (PDT) MIME-Version: 1.0 References: <20200418002153.8771.27F6AC2D@kcn.ne.jp> <20200419074109.431A.27F6AC2D@kcn.ne.jp> <20200419111025.4326.27F6AC2D@kcn.ne.jp> In-Reply-To: <20200419111025.4326.27F6AC2D@kcn.ne.jp> From: Jim Meyering Date: Fri, 11 Sep 2020 14:47:49 +0200 Message-ID: Subject: Re: bug#40634: Massive pattern list handling with -E format seems very slow since 2.28. To: Norihiro Tanaka , GNU grep developers Content-Type: text/plain; charset="UTF-8" Received-SPF: pass client-ip=209.85.128.66; envelope-from=meyering@gmail.com; helo=mail-wm1-f66.google.com X-detected-operating-system: by eggs.gnu.org: First seen = 2020/09/11 08:48:02 X-ACL-Warn: Detected OS = Linux 2.2.x-3.x [generic] [fuzzy] X-Spam_score_int: -13 X-Spam_score: -1.4 X-Spam_bar: - X-Spam_report: (-1.4 / 5.0 requ) BAYES_00=-1.9, FREEMAIL_FORGED_FROMDOMAIN=0.25, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: fryasu@yahoo.co.jp, "bug-gnulib@gnu.org List" , Paul Eggert , 40634@debbugs.gnu.org Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" On Sun, Apr 19, 2020 at 4:10 AM Norihiro Tanaka wrote: > On Sun, 19 Apr 2020 07:41:49 +0900 > Norihiro Tanaka wrote: > > On Sat, 18 Apr 2020 00:22:26 +0900 > > Norihiro Tanaka wrote: > > > > > > > > On Fri, 17 Apr 2020 10:24:42 +0900 > > > Norihiro Tanaka wrote: > > > > > > > > > > > On Fri, 17 Apr 2020 09:35:36 +0900 > > > > Norihiro Tanaka wrote: > > > > > > > > > > > > > > On Thu, 16 Apr 2020 16:00:29 -0700 > > > > > Paul Eggert wrote: > > > > > > > > > > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote: > > > > > > > > > > > > > I have had no idea to solve the problem yet. If we revert it, bug#33357 > > > > > > > will come back. > > > > > > > > > > > > Yes, I'd rather not revert if we can help it. > > > > > > > > > > > > My own thought was to not analyze the regular expression if we discover that the input is empty. :-) > > > > > > > > > > Now, I have a idea, it is that we build indexes of epsilon nodes > > > > > including in follows before remove epsilon nodes. > > > > > > > > > > > > I wrote fix for the bug, but it will be slower then at grep 2.27 yet. > > > > > > It was improved previous patch. > > > > Sorry, correct patch is here. > > I made the previous patch even simpler. > > before: > > $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null > real 7.24 > user 7.14 > sys 0.09 > > after: > > $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null > real 0.62 > user 0.52 > sys 0.10 Thank you for this patch. I have rebased and made minor syntactic changes. I'll push it to gnulib soon, if not today, then by Monday. I am considering creating a test case in grep, but it feels too tight to be feasible: I would use a relative perf test, requiring that a passing test incur a perf cost of less than say 100x. Here's the beginnings of my attempt (note: this is just an outline -- obviously would not rely on having "time" in path or as a shell builtin): gen() { local n=$1 local i=1 while : ; do local pat=$(printf $i | sha1sum | cut -d' ' -f1) printf '%s\n' "$pat$pat(\$|$pat)" i=$(expr $i + 1) test $i = $n && break done } gen 4000 > pats-4000 head -400 pats-4000 > pats-400 # With fixed code, that a 10x input size increase (n=400 to 4000) # induces a 40x runtime increase: .05 -> 2.0s # Just prior to this change, it's 150x: 0.2 -> 30s env LC_ALL=C time -p src/grep -E -v -m1 -f pats-400 /dev/null env LC_ALL=C time -p src/grep -E -v -m1 -f pats-4000 /dev/null