bug-gnulib@gnu.org mirror (unofficial)
 help / color / mirror / Atom feed
From: Jim Meyering <jim@meyering.net>
To: Norihiro Tanaka <noritnk@kcn.ne.jp>,
	GNU grep developers <grep-devel@gnu.org>
Cc: fryasu@yahoo.co.jp,
	"bug-gnulib@gnu.org List" <bug-gnulib@gnu.org>,
	Paul Eggert <eggert@cs.ucla.edu>,
	40634@debbugs.gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems very slow since 2.28.
Date: Fri, 11 Sep 2020 14:47:49 +0200	[thread overview]
Message-ID: <CA+8g5KGQr=L5YdzqGNiR+j_KvaiOja1_pmAtz4__C_7si_CCaQ@mail.gmail.com> (raw)
In-Reply-To: <20200419111025.4326.27F6AC2D@kcn.ne.jp>

On Sun, Apr 19, 2020 at 4:10 AM Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> On Sun, 19 Apr 2020 07:41:49 +0900
> Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > On Sat, 18 Apr 2020 00:22:26 +0900
> > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> >
> > >
> > > On Fri, 17 Apr 2020 10:24:42 +0900
> > > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > >
> > > >
> > > > On Fri, 17 Apr 2020 09:35:36 +0900
> > > > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > > >
> > > > >
> > > > > On Thu, 16 Apr 2020 16:00:29 -0700
> > > > > Paul Eggert <eggert@cs.ucla.edu> wrote:
> > > > >
> > > > > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > > > > >
> > > > > > > I have had no idea to solve the problem yet.  If we revert it, bug#33357
> > > > > > > will come back.
> > > > > >
> > > > > > Yes, I'd rather not revert if we can help it.
> > > > > >
> > > > > > My own thought was to not analyze the regular expression if we discover that the input is empty. :-)
> > > > >
> > > > > Now, I have a idea, it is that we build indexes of epsilon nodes
> > > > > including in follows before remove epsilon nodes.
> > > >
> > > >
> > > > I wrote fix for the bug, but it will be slower then at grep 2.27 yet.
> > >
> > > It was improved previous patch.
> >
> > Sorry, correct patch is here.
>
> I made the previous patch even simpler.
>
> before:
>
> $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> real 7.24
> user 7.14
> sys 0.09
>
> after:
>
> $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> real 0.62
> user 0.52
> sys 0.10

Thank you for this patch. I have rebased and made minor syntactic changes.
I'll push it to gnulib soon, if not today, then by Monday.

I am considering creating a test case in grep, but it feels too tight
to be feasible: I would use a relative perf test, requiring that a
passing test incur a perf cost of less than say 100x. Here's the
beginnings of my attempt (note: this is just an outline -- obviously
would not rely on having "time" in path or as a shell builtin):

gen()
{
  local n=$1
  local i=1
  while : ; do
    local pat=$(printf $i | sha1sum | cut -d' ' -f1)
    printf '%s\n' "$pat$pat(\$|$pat)"
    i=$(expr $i + 1)
    test $i = $n && break
  done
}

gen 4000 > pats-4000
head -400 pats-4000 > pats-400

# With fixed code, that a 10x input size increase (n=400 to 4000)
# induces a 40x runtime increase: .05 -> 2.0s
# Just prior to this change, it's 150x: 0.2 -> 30s

env LC_ALL=C time -p src/grep -E -v -m1 -f pats-400 /dev/null
env LC_ALL=C time -p src/grep -E -v -m1 -f pats-4000 /dev/null


  reply	other threads:[~2020-09-11 12:48 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <6503eb8e-e6fd-b4dd-aab7-76bb6955d87b@cs.ucla.edu>
     [not found] ` <20200417093536.875E.27F6AC2D@kcn.ne.jp>
2020-04-17  1:24   ` bug#40634: Massive pattern list handling with -E format seems very slow since 2.28 Norihiro Tanaka
2020-04-17 15:22     ` Norihiro Tanaka
2020-04-18 22:41       ` Norihiro Tanaka
2020-04-19  2:10         ` Norihiro Tanaka
2020-09-11 12:47           ` Jim Meyering [this message]
2020-09-11 13:17             ` Jim Meyering
2020-09-11 23:01               ` Paul Eggert
2020-09-12  6:41                 ` Jim Meyering
2020-09-14  2:03                   ` Paul Eggert
2020-09-14 14:14                     ` Jim Meyering
2020-09-21 19:22                     ` Paul Eggert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.gnu.org/mailman/listinfo/bug-gnulib

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+8g5KGQr=L5YdzqGNiR+j_KvaiOja1_pmAtz4__C_7si_CCaQ@mail.gmail.com' \
    --to=jim@meyering.net \
    --cc=40634@debbugs.gnu.org \
    --cc=bug-gnulib@gnu.org \
    --cc=eggert@cs.ucla.edu \
    --cc=fryasu@yahoo.co.jp \
    --cc=grep-devel@gnu.org \
    --cc=noritnk@kcn.ne.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).