Re: bug#40634: Massive pattern list handling with -E format seems very slow since 2.28.

From: Jim Meyering <jim@meyering.net>
To: Norihiro Tanaka <noritnk@kcn.ne.jp>,
	GNU grep developers <grep-devel@gnu.org>
Cc: fryasu@yahoo.co.jp,
	"bug-gnulib@gnu.org List" <bug-gnulib@gnu.org>,
	Paul Eggert <eggert@cs.ucla.edu>,
	40634@debbugs.gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems very slow since 2.28.
Date: Fri, 11 Sep 2020 15:17:29 +0200	[thread overview]
Message-ID: <CA+8g5KHwqFLbY=FZ2j1b-_j=dwYa-q_DR88wnQx+SNiBY7gh+A@mail.gmail.com> (raw)
In-Reply-To: <CA+8g5KGQr=L5YdzqGNiR+j_KvaiOja1_pmAtz4__C_7si_CCaQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2839 bytes --]

On Fri, Sep 11, 2020 at 2:47 PM Jim Meyering <jim@meyering.net> wrote:
> On Sun, Apr 19, 2020 at 4:10 AM Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > On Sun, 19 Apr 2020 07:41:49 +0900
> > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > > On Sat, 18 Apr 2020 00:22:26 +0900
> > > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > >
> > > >
> > > > On Fri, 17 Apr 2020 10:24:42 +0900
> > > > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > > >
> > > > >
> > > > > On Fri, 17 Apr 2020 09:35:36 +0900
> > > > > Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > > > >
> > > > > >
> > > > > > On Thu, 16 Apr 2020 16:00:29 -0700
> > > > > > Paul Eggert <eggert@cs.ucla.edu> wrote:
> > > > > >
> > > > > > > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > > > > > >
> > > > > > > > I have had no idea to solve the problem yet.  If we revert it, bug#33357
> > > > > > > > will come back.
> > > > > > >
> > > > > > > Yes, I'd rather not revert if we can help it.
> > > > > > >
> > > > > > > My own thought was to not analyze the regular expression if we discover that the input is empty. :-)
> > > > > >
> > > > > > Now, I have a idea, it is that we build indexes of epsilon nodes
> > > > > > including in follows before remove epsilon nodes.
> > > > >
> > > > >
> > > > > I wrote fix for the bug, but it will be slower then at grep 2.27 yet.
> > > >
> > > > It was improved previous patch.
> > >
> > > Sorry, correct patch is here.
> >
> > I made the previous patch even simpler.
> >
> > before:
> >
> > $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> > real 7.24
> > user 7.14
> > sys 0.09
> >
> > after:
> >
> > $ env LC_ALL=C time -p src/grep -E -v -m1 -f grep-patterns.txt /dev/null
> > real 0.62
> > user 0.52
> > sys 0.10
>
> Thank you for this patch. I have rebased and made minor syntactic changes.
> I'll push it to gnulib soon, if not today, then by Monday.
>
> I am considering creating a test case in grep, but it feels too tight
> to be feasible: I would use a relative perf test, requiring that a
> passing test incur a perf cost of less than say 100x. Here's the
> beginnings of my attempt (note: this is just an outline -- obviously
> would not rely on having "time" in path or as a shell builtin):
>
> gen()
> {
>   local n=$1
>   local i=1
>   while : ; do
>     local pat=$(printf $i | sha1sum | cut -d' ' -f1)
>     printf '%s\n' "$pat$pat(\$|$pat)"
>     i=$(expr $i + 1)
>     test $i = $n && break
>   done
> }
>
> gen 4000 > pats-4000
> head -400 pats-4000 > pats-400
>
> # With fixed code, that a 10x input size increase (n=400 to 4000)
> # induces a 40x runtime increase: .05 -> 2.0s
> # Just prior to this change, it's 150x: 0.2 -> 30s
>
> env LC_ALL=C time -p src/grep -E -v -m1 -f pats-400 /dev/null
> env LC_ALL=C time -p src/grep -E -v -m1 -f pats-4000 /dev/null

And here is the adjusted patch:

[-- Attachment #2: dfa.c-epsilon-node-removal-speedup.patch --]
[-- Type: application/octet-stream, Size: 4498 bytes --]

From bcd7930dba2f078da992660f6a14cde9e42b94c6 Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyering@fb.com>
Date: Fri, 11 Sep 2020 01:25:10 -0700
Subject: [PATCH] dfa: speed up epsilon-node removal

Build auxiliary indices before removing epsilon closure.
Before, when removing an epsilon closure, we would search for nodes
including that epsilon closure in all nodes sequentially, but that
could be very slow.  Now, build auxiliary indices before searching.
Reported in: https://bugs.gnu.org/40634

* lib/dfa.c (overwrap): New function.
(epsclosure): Build auxiliary indices before removing any
epsilon closure; use them to speed up that process.
---
 lib/dfa.c | 107 +++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 73 insertions(+), 34 deletions(-)

diff --git a/lib/dfa.c b/lib/dfa.c
index 1f0587a7a..c4300dfb5 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -2203,6 +2203,22 @@ replace (position_set *dst, idx_t del, position_set *add,
     }
 }

+static bool _GL_ATTRIBUTE_PURE
+overwrap (position_set const *s, idx_t const *elems, idx_t nelem)
+{
+  idx_t i = 0, j = 0;
+
+  while (i < s->nelem && j < nelem)
+    if (s->elems[i].index < elems[j])
+      i++;
+    else if (s->elems[i].index > elems[j])
+      j++;
+    else
+      return true;
+
+  return false;
+}
+
 /* Find the index of the state corresponding to the given position set with
    the given preceding context, or create a new state if there is no such
    state.  Context tells whether we got here on a newline or letter.  */
@@ -2300,45 +2316,68 @@ static void
 epsclosure (struct dfa const *d)
 {
   position_set tmp;
+  idx_t *currs, *nexts;
+  idx_t ncurr = 0;
+  idx_t nnext = 0;
+
   alloc_position_set (&tmp, d->nleaves);
+  currs = xnmalloc (d->tindex, sizeof *currs);
+  nexts = xnmalloc (d->tindex, sizeof *nexts);
+
   for (idx_t i = 0; i < d->tindex; i++)
-    if (d->follows[i].nelem > 0 && d->tokens[i] >= NOTCHAR
-        && d->tokens[i] != BACKREF && d->tokens[i] != ANYCHAR
-        && d->tokens[i] != MBCSET && d->tokens[i] < CSET)
-      {
-        unsigned int constraint;
-        switch (d->tokens[i])
-          {
-          case BEGLINE:
-            constraint = BEGLINE_CONSTRAINT;
-            break;
-          case ENDLINE:
-            constraint = ENDLINE_CONSTRAINT;
-            break;
-          case BEGWORD:
-            constraint = BEGWORD_CONSTRAINT;
-            break;
-          case ENDWORD:
-            constraint = ENDWORD_CONSTRAINT;
-            break;
-          case LIMWORD:
-            constraint = LIMWORD_CONSTRAINT;
-            break;
-          case NOTLIMWORD:
-            constraint = NOTLIMWORD_CONSTRAINT;
-            break;
-          default:
-            constraint = NO_CONSTRAINT;
-            break;
-          }
+    {
+      if (d->follows[i].nelem > 0 && d->tokens[i] >= NOTCHAR
+          && d->tokens[i] != BACKREF && d->tokens[i] != ANYCHAR
+          && d->tokens[i] != MBCSET && d->tokens[i] < CSET)
+        currs[ncurr++] = i;
+    }

-        delete (i, &d->follows[i]);
+  for (idx_t i = 0, j = 0; i < d->tindex; i++)
+    {
+      while (j < ncurr && currs[j] < i)
+        j++;
+      if (overwrap (&d->follows[i], currs, ncurr))
+        nexts[nnext++] = i;
+    }
+
+  for (idx_t i = 0; i < ncurr; i++)
+    {
+      unsigned int constraint;
+      switch (d->tokens[currs[i]])
+        {
+        case BEGLINE:
+          constraint = BEGLINE_CONSTRAINT;
+          break;
+        case ENDLINE:
+          constraint = ENDLINE_CONSTRAINT;
+          break;
+        case BEGWORD:
+          constraint = BEGWORD_CONSTRAINT;
+          break;
+        case ENDWORD:
+          constraint = ENDWORD_CONSTRAINT;
+          break;
+        case LIMWORD:
+          constraint = LIMWORD_CONSTRAINT;
+          break;
+        case NOTLIMWORD:
+          constraint = NOTLIMWORD_CONSTRAINT;
+          break;
+        default:
+          constraint = NO_CONSTRAINT;
+          break;
+        }
+
+      delete (i, &d->follows[currs[i]]);
+
+      for (idx_t j = 0; j < nnext; j++)
+        replace (&d->follows[nexts[j]], currs[i], &d->follows[currs[i]],
+                 constraint, &tmp);
+    }

-        for (idx_t j = 0; j < d->tindex; j++)
-          if (i != j && d->follows[j].nelem > 0)
-            replace (&d->follows[j], i, &d->follows[i], constraint, &tmp);
-      }
   free (tmp.elems);
+  free (currs);
+  free (nexts);
 }

 /* Returns the set of contexts for which there is at least one
-- 
2.28.0.rc0