From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 199F31F4B4 for ; Fri, 11 Sep 2020 13:17:56 +0000 (UTC) Received: from localhost ([::1]:52684 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kGiw6-0007me-V5 for normalperson@yhbt.net; Fri, 11 Sep 2020 09:17:54 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:47650) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kGivy-0007k2-RZ; Fri, 11 Sep 2020 09:17:46 -0400 Received: from mail-wm1-f66.google.com ([209.85.128.66]:38876) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1kGivv-0007Xg-Hy; Fri, 11 Sep 2020 09:17:45 -0400 Received: by mail-wm1-f66.google.com with SMTP id l9so4775803wme.3; Fri, 11 Sep 2020 06:17:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=64250QNkjfBfsR2yWWs/87yHdcoFTgvba7f1zeoE3Lw=; b=Z8r8Y6KBWYGD4V6w1hgaBD9vVupw9zf/00CUq591v9OuwdN9++kkzCv8amXuWst3Wo SK3iLQMbSIQLHZXQ/W6IOT1kahjKJ9ZTJa/rvX/p7Eu3/Zo4pxo/AE4w20aGIDJYwGnf eDyeC5DeU4RM8/iepIofVJqIwIzgeFT2Jis0APot9MI/+QMWa/wPi+vrXfEpYZs6feoB 7iK7OecgGj83RGVMbmUlHeOPakfo9qKyd/pyycVAfieXc/or+BEylFmKNFN2wL3a0bpz JOvvosLMec1NOPyxi/GlAD6Y3S4fVzAKZmBdevqX6rDjOe1MTYTVcawxhULvtz8rvCzv t9/w== X-Gm-Message-State: AOAM531GjdiUUD87oKvsGyV+aKvUNvcGiyUxiUcHeBCC3UsO5afJ/JrY 48pwd+fDovar4Zlj5NbhDGhLI3TwFdHWzchLbnU= X-Google-Smtp-Source: ABdhPJyuXc3eiS0+q0HW1NerUytXB7TuNE2+KzyUTdbAj63k0o6azKPqUmo3NqW1A9lUUiw7IELC3kBP0IDbKpjWChQ= X-Received: by 2002:a7b:c847:: with SMTP id c7mr2240511wml.149.1599830261662; Fri, 11 Sep 2020 06:17:41 -0700 (PDT) MIME-Version: 1.0 References: <20200418002153.8771.27F6AC2D@kcn.ne.jp> <20200419074109.431A.27F6AC2D@kcn.ne.jp> <20200419111025.4326.27F6AC2D@kcn.ne.jp> In-Reply-To: From: Jim Meyering Date: Fri, 11 Sep 2020 15:17:29 +0200 Message-ID: Subject: Re: bug#40634: Massive pattern list handling with -E format seems very slow since 2.28. To: Norihiro Tanaka , GNU grep developers Content-Type: multipart/mixed; boundary="00000000000072a11c05af0982ad" Received-SPF: pass client-ip=209.85.128.66; envelope-from=meyering@gmail.com; helo=mail-wm1-f66.google.com X-detected-operating-system: by eggs.gnu.org: First seen = 2020/09/11 08:48:02 X-ACL-Warn: Detected OS = Linux 2.2.x-3.x [generic] [fuzzy] X-Spam_score_int: -13 X-Spam_score: -1.4 X-Spam_bar: - X-Spam_report: (-1.4 / 5.0 requ) BAYES_00=-1.9, FREEMAIL_FORGED_FROMDOMAIN=0.25, FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: fryasu@yahoo.co.jp, "bug-gnulib@gnu.org List" , Paul Eggert , 40634@debbugs.gnu.org Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" --00000000000072a11c05af0982ad Content-Type: text/plain; charset="UTF-8" On Fri, Sep 11, 2020 at 2:47 PM Jim Meyering wrote: > On Sun, Apr 19, 2020 at 4:10 AM Norihiro Tanaka wrote: > > On Sun, 19 Apr 2020 07:41:49 +0900 > > Norihiro Tanaka wrote: > > > On Sat, 18 Apr 2020 00:22:26 +0900 > > > Norihiro Tanaka wrote: > > > > > > > > > > > On Fri, 17 Apr 2020 10:24:42 +0900 > > > > Norihiro Tanaka wrote: > > > > > > > > > > > > > > On Fri, 17 Apr 2020 09:35:36 +0900 > > > > > Norihiro Tanaka wrote: > > > > > > > > > > > > > > > > > On Thu, 16 Apr 2020 16:00:29 -0700 > > > > > > Paul Eggert wrote: > > > > > > > > > > > > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote: > > > > > > > > > > > > > > > I have had no idea to solve the problem yet. If we revert it, bug#33357 > > > > > > > > will come back. > > > > > > > > > > > > > > Yes, I'd rather not revert if we can help it. > > > > > > > > > > > > > > My own thought was to not analyze the regular expression if we discover that the input is empty. :-) > > > > > > > > > > > > Now, I have a idea, it is that we build indexes of epsilon nodes > > > > > > including in follows before remove epsilon nodes. > > > > > > > > > > > > > > > I wrote fix for the bug, but it will be slower then at grep 2.27 yet. > > > > > > > > It was improved previous patch. > > > > > > Sorry, correct patch is here. > > > > I made the previous patch even simpler. > > > > before: > > > > $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null > > real 7.24 > > user 7.14 > > sys 0.09 > > > > after: > > > > $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null > > real 0.62 > > user 0.52 > > sys 0.10 > > Thank you for this patch. I have rebased and made minor syntactic changes. > I'll push it to gnulib soon, if not today, then by Monday. > > I am considering creating a test case in grep, but it feels too tight > to be feasible: I would use a relative perf test, requiring that a > passing test incur a perf cost of less than say 100x. Here's the > beginnings of my attempt (note: this is just an outline -- obviously > would not rely on having "time" in path or as a shell builtin): > > gen() > { > local n=$1 > local i=1 > while : ; do > local pat=$(printf $i | sha1sum | cut -d' ' -f1) > printf '%s\n' "$pat$pat(\$|$pat)" > i=$(expr $i + 1) > test $i = $n && break > done > } > > gen 4000 > pats-4000 > head -400 pats-4000 > pats-400 > > # With fixed code, that a 10x input size increase (n=400 to 4000) > # induces a 40x runtime increase: .05 -> 2.0s > # Just prior to this change, it's 150x: 0.2 -> 30s > > env LC_ALL=C time -p src/grep -E -v -m1 -f pats-400 /dev/null > env LC_ALL=C time -p src/grep -E -v -m1 -f pats-4000 /dev/null And here is the adjusted patch: --00000000000072a11c05af0982ad Content-Type: application/octet-stream; name="dfa.c-epsilon-node-removal-speedup.patch" Content-Disposition: attachment; filename="dfa.c-epsilon-node-removal-speedup.patch" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_key9mchp0 RnJvbSBiY2Q3OTMwZGJhMmYwNzhkYTk5MjY2MGY2YTE0Y2RlOWU0MmI5NGM2IE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBKaW0gTWV5ZXJpbmcgPG1leWVyaW5nQGZiLmNvbT4KRGF0ZTog RnJpLCAxMSBTZXAgMjAyMCAwMToyNToxMCAtMDcwMApTdWJqZWN0OiBbUEFUQ0hdIGRmYTogc3Bl ZWQgdXAgZXBzaWxvbi1ub2RlIHJlbW92YWwKCkJ1aWxkIGF1eGlsaWFyeSBpbmRpY2VzIGJlZm9y ZSByZW1vdmluZyBlcHNpbG9uIGNsb3N1cmUuCkJlZm9yZSwgd2hlbiByZW1vdmluZyBhbiBlcHNp bG9uIGNsb3N1cmUsIHdlIHdvdWxkIHNlYXJjaCBmb3Igbm9kZXMKaW5jbHVkaW5nIHRoYXQgZXBz aWxvbiBjbG9zdXJlIGluIGFsbCBub2RlcyBzZXF1ZW50aWFsbHksIGJ1dCB0aGF0CmNvdWxkIGJl IHZlcnkgc2xvdy4gIE5vdywgYnVpbGQgYXV4aWxpYXJ5IGluZGljZXMgYmVmb3JlIHNlYXJjaGlu Zy4KUmVwb3J0ZWQgaW46IGh0dHBzOi8vYnVncy5nbnUub3JnLzQwNjM0CgoqIGxpYi9kZmEuYyAo b3ZlcndyYXApOiBOZXcgZnVuY3Rpb24uCihlcHNjbG9zdXJlKTogQnVpbGQgYXV4aWxpYXJ5IGlu ZGljZXMgYmVmb3JlIHJlbW92aW5nIGFueQplcHNpbG9uIGNsb3N1cmU7IHVzZSB0aGVtIHRvIHNw ZWVkIHVwIHRoYXQgcHJvY2Vzcy4KLS0tCiBsaWIvZGZhLmMgfCAxMDcgKysrKysrKysrKysrKysr KysrKysrKysrKysrKysrKysrKysrKy0tLS0tLS0tLS0tLS0tLS0tCiAxIGZpbGUgY2hhbmdlZCwg NzMgaW5zZXJ0aW9ucygrKSwgMzQgZGVsZXRpb25zKC0pCgpkaWZmIC0tZ2l0IGEvbGliL2RmYS5j IGIvbGliL2RmYS5jCmluZGV4IDFmMDU4N2E3YS4uYzQzMDBkZmI1IDEwMDY0NAotLS0gYS9saWIv ZGZhLmMKKysrIGIvbGliL2RmYS5jCkBAIC0yMjAzLDYgKzIyMDMsMjIgQEAgcmVwbGFjZSAocG9z aXRpb25fc2V0ICpkc3QsIGlkeF90IGRlbCwgcG9zaXRpb25fc2V0ICphZGQsCiAgICAgfQogfQoK K3N0YXRpYyBib29sIF9HTF9BVFRSSUJVVEVfUFVSRQorb3ZlcndyYXAgKHBvc2l0aW9uX3NldCBj b25zdCAqcywgaWR4X3QgY29uc3QgKmVsZW1zLCBpZHhfdCBuZWxlbSkKK3sKKyAgaWR4X3QgaSA9 IDAsIGogPSAwOworCisgIHdoaWxlIChpIDwgcy0+bmVsZW0gJiYgaiA8IG5lbGVtKQorICAgIGlm IChzLT5lbGVtc1tpXS5pbmRleCA8IGVsZW1zW2pdKQorICAgICAgaSsrOworICAgIGVsc2UgaWYg KHMtPmVsZW1zW2ldLmluZGV4ID4gZWxlbXNbal0pCisgICAgICBqKys7CisgICAgZWxzZQorICAg ICAgcmV0dXJuIHRydWU7CisKKyAgcmV0dXJuIGZhbHNlOworfQorCiAvKiBGaW5kIHRoZSBpbmRl eCBvZiB0aGUgc3RhdGUgY29ycmVzcG9uZGluZyB0byB0aGUgZ2l2ZW4gcG9zaXRpb24gc2V0IHdp dGgKICAgIHRoZSBnaXZlbiBwcmVjZWRpbmcgY29udGV4dCwgb3IgY3JlYXRlIGEgbmV3IHN0YXRl IGlmIHRoZXJlIGlzIG5vIHN1Y2gKICAgIHN0YXRlLiAgQ29udGV4dCB0ZWxscyB3aGV0aGVyIHdl IGdvdCBoZXJlIG9uIGEgbmV3bGluZSBvciBsZXR0ZXIuICAqLwpAQCAtMjMwMCw0NSArMjMxNiw2 OCBAQCBzdGF0aWMgdm9pZAogZXBzY2xvc3VyZSAoc3RydWN0IGRmYSBjb25zdCAqZCkKIHsKICAg cG9zaXRpb25fc2V0IHRtcDsKKyAgaWR4X3QgKmN1cnJzLCAqbmV4dHM7CisgIGlkeF90IG5jdXJy ID0gMDsKKyAgaWR4X3Qgbm5leHQgPSAwOworCiAgIGFsbG9jX3Bvc2l0aW9uX3NldCAoJnRtcCwg ZC0+bmxlYXZlcyk7CisgIGN1cnJzID0geG5tYWxsb2MgKGQtPnRpbmRleCwgc2l6ZW9mICpjdXJy cyk7CisgIG5leHRzID0geG5tYWxsb2MgKGQtPnRpbmRleCwgc2l6ZW9mICpuZXh0cyk7CisKICAg Zm9yIChpZHhfdCBpID0gMDsgaSA8IGQtPnRpbmRleDsgaSsrKQotICAgIGlmIChkLT5mb2xsb3dz W2ldLm5lbGVtID4gMCAmJiBkLT50b2tlbnNbaV0gPj0gTk9UQ0hBUgotICAgICAgICAmJiBkLT50 b2tlbnNbaV0gIT0gQkFDS1JFRiAmJiBkLT50b2tlbnNbaV0gIT0gQU5ZQ0hBUgotICAgICAgICAm JiBkLT50b2tlbnNbaV0gIT0gTUJDU0VUICYmIGQtPnRva2Vuc1tpXSA8IENTRVQpCi0gICAgICB7 Ci0gICAgICAgIHVuc2lnbmVkIGludCBjb25zdHJhaW50OwotICAgICAgICBzd2l0Y2ggKGQtPnRv a2Vuc1tpXSkKLSAgICAgICAgICB7Ci0gICAgICAgICAgY2FzZSBCRUdMSU5FOgotICAgICAgICAg ICAgY29uc3RyYWludCA9IEJFR0xJTkVfQ09OU1RSQUlOVDsKLSAgICAgICAgICAgIGJyZWFrOwot ICAgICAgICAgIGNhc2UgRU5ETElORToKLSAgICAgICAgICAgIGNvbnN0cmFpbnQgPSBFTkRMSU5F X0NPTlNUUkFJTlQ7Ci0gICAgICAgICAgICBicmVhazsKLSAgICAgICAgICBjYXNlIEJFR1dPUkQ6 Ci0gICAgICAgICAgICBjb25zdHJhaW50ID0gQkVHV09SRF9DT05TVFJBSU5UOwotICAgICAgICAg ICAgYnJlYWs7Ci0gICAgICAgICAgY2FzZSBFTkRXT1JEOgotICAgICAgICAgICAgY29uc3RyYWlu dCA9IEVORFdPUkRfQ09OU1RSQUlOVDsKLSAgICAgICAgICAgIGJyZWFrOwotICAgICAgICAgIGNh c2UgTElNV09SRDoKLSAgICAgICAgICAgIGNvbnN0cmFpbnQgPSBMSU1XT1JEX0NPTlNUUkFJTlQ7 Ci0gICAgICAgICAgICBicmVhazsKLSAgICAgICAgICBjYXNlIE5PVExJTVdPUkQ6Ci0gICAgICAg ICAgICBjb25zdHJhaW50ID0gTk9UTElNV09SRF9DT05TVFJBSU5UOwotICAgICAgICAgICAgYnJl YWs7Ci0gICAgICAgICAgZGVmYXVsdDoKLSAgICAgICAgICAgIGNvbnN0cmFpbnQgPSBOT19DT05T VFJBSU5UOwotICAgICAgICAgICAgYnJlYWs7Ci0gICAgICAgICAgfQorICAgIHsKKyAgICAgIGlm IChkLT5mb2xsb3dzW2ldLm5lbGVtID4gMCAmJiBkLT50b2tlbnNbaV0gPj0gTk9UQ0hBUgorICAg ICAgICAgICYmIGQtPnRva2Vuc1tpXSAhPSBCQUNLUkVGICYmIGQtPnRva2Vuc1tpXSAhPSBBTllD SEFSCisgICAgICAgICAgJiYgZC0+dG9rZW5zW2ldICE9IE1CQ1NFVCAmJiBkLT50b2tlbnNbaV0g PCBDU0VUKQorICAgICAgICBjdXJyc1tuY3VycisrXSA9IGk7CisgICAgfQoKLSAgICAgICAgZGVs ZXRlIChpLCAmZC0+Zm9sbG93c1tpXSk7CisgIGZvciAoaWR4X3QgaSA9IDAsIGogPSAwOyBpIDwg ZC0+dGluZGV4OyBpKyspCisgICAgeworICAgICAgd2hpbGUgKGogPCBuY3VyciAmJiBjdXJyc1tq XSA8IGkpCisgICAgICAgIGorKzsKKyAgICAgIGlmIChvdmVyd3JhcCAoJmQtPmZvbGxvd3NbaV0s IGN1cnJzLCBuY3VycikpCisgICAgICAgIG5leHRzW25uZXh0KytdID0gaTsKKyAgICB9CisKKyAg Zm9yIChpZHhfdCBpID0gMDsgaSA8IG5jdXJyOyBpKyspCisgICAgeworICAgICAgdW5zaWduZWQg aW50IGNvbnN0cmFpbnQ7CisgICAgICBzd2l0Y2ggKGQtPnRva2Vuc1tjdXJyc1tpXV0pCisgICAg ICAgIHsKKyAgICAgICAgY2FzZSBCRUdMSU5FOgorICAgICAgICAgIGNvbnN0cmFpbnQgPSBCRUdM SU5FX0NPTlNUUkFJTlQ7CisgICAgICAgICAgYnJlYWs7CisgICAgICAgIGNhc2UgRU5ETElORToK KyAgICAgICAgICBjb25zdHJhaW50ID0gRU5ETElORV9DT05TVFJBSU5UOworICAgICAgICAgIGJy ZWFrOworICAgICAgICBjYXNlIEJFR1dPUkQ6CisgICAgICAgICAgY29uc3RyYWludCA9IEJFR1dP UkRfQ09OU1RSQUlOVDsKKyAgICAgICAgICBicmVhazsKKyAgICAgICAgY2FzZSBFTkRXT1JEOgor ICAgICAgICAgIGNvbnN0cmFpbnQgPSBFTkRXT1JEX0NPTlNUUkFJTlQ7CisgICAgICAgICAgYnJl YWs7CisgICAgICAgIGNhc2UgTElNV09SRDoKKyAgICAgICAgICBjb25zdHJhaW50ID0gTElNV09S RF9DT05TVFJBSU5UOworICAgICAgICAgIGJyZWFrOworICAgICAgICBjYXNlIE5PVExJTVdPUkQ6 CisgICAgICAgICAgY29uc3RyYWludCA9IE5PVExJTVdPUkRfQ09OU1RSQUlOVDsKKyAgICAgICAg ICBicmVhazsKKyAgICAgICAgZGVmYXVsdDoKKyAgICAgICAgICBjb25zdHJhaW50ID0gTk9fQ09O U1RSQUlOVDsKKyAgICAgICAgICBicmVhazsKKyAgICAgICAgfQorCisgICAgICBkZWxldGUgKGks ICZkLT5mb2xsb3dzW2N1cnJzW2ldXSk7CisKKyAgICAgIGZvciAoaWR4X3QgaiA9IDA7IGogPCBu bmV4dDsgaisrKQorICAgICAgICByZXBsYWNlICgmZC0+Zm9sbG93c1tuZXh0c1tqXV0sIGN1cnJz W2ldLCAmZC0+Zm9sbG93c1tjdXJyc1tpXV0sCisgICAgICAgICAgICAgICAgIGNvbnN0cmFpbnQs ICZ0bXApOworICAgIH0KCi0gICAgICAgIGZvciAoaWR4X3QgaiA9IDA7IGogPCBkLT50aW5kZXg7 IGorKykKLSAgICAgICAgICBpZiAoaSAhPSBqICYmIGQtPmZvbGxvd3Nbal0ubmVsZW0gPiAwKQot ICAgICAgICAgICAgcmVwbGFjZSAoJmQtPmZvbGxvd3Nbal0sIGksICZkLT5mb2xsb3dzW2ldLCBj b25zdHJhaW50LCAmdG1wKTsKLSAgICAgIH0KICAgZnJlZSAodG1wLmVsZW1zKTsKKyAgZnJlZSAo Y3VycnMpOworICBmcmVlIChuZXh0cyk7CiB9CgogLyogUmV0dXJucyB0aGUgc2V0IG9mIGNvbnRl eHRzIGZvciB3aGljaCB0aGVyZSBpcyBhdCBsZWFzdCBvbmUKLS0gCjIuMjguMC5yYzAKCg== --00000000000072a11c05af0982ad--