bug-gnulib@gnu.org mirror (unofficial)
 help / color / mirror / Atom feed
From: Norihiro Tanaka <noritnk@kcn.ne.jp>
To: <40634@debbugs.gnu.org>
Cc: fryasu@yahoo.co.jp, Paul Eggert <eggert@cs.ucla.edu>, bug-gnulib@gnu.org
Subject: Re: bug#40634: Massive pattern list handling with -E format seems very slow since 2.28.
Date: Fri, 17 Apr 2020 10:24:42 +0900	[thread overview]
Message-ID: <20200417102441.8766.27F6AC2D@kcn.ne.jp> (raw)
In-Reply-To: <20200417093536.875E.27F6AC2D@kcn.ne.jp>

[-- Attachment #1: Type: text/plain, Size: 677 bytes --]


On Fri, 17 Apr 2020 09:35:36 +0900
Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:

> 
> On Thu, 16 Apr 2020 16:00:29 -0700
> Paul Eggert <eggert@cs.ucla.edu> wrote:
> 
> > On 4/16/20 3:53 PM, Norihiro Tanaka wrote:
> > 
> > > I have had no idea to solve the problem yet.  If we revert it, bug#33357
> > > will come back.
> > 
> > Yes, I'd rather not revert if we can help it.
> > 
> > My own thought was to not analyze the regular expression if we discover that the input is empty. :-)
> 
> Now, I have a idea, it is that we build indexes of epsilon nodes
> including in follows before remove epsilon nodes.


I wrote fix for the bug, but it will be slower then at grep 2.27 yet.

[-- Attachment #2: 0001-dfa-build-auxiliary-indexes-before-remove-epsilon-cl.patch --]
[-- Type: text/plain, Size: 3184 bytes --]

From e5680a4c43b185df6b8c1b80423d663de6f6e37e Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka <noritnk@kcn.ne.jp>
Date: Fri, 17 Apr 2020 10:12:01 +0900
Subject: [PATCH] dfa: build auxiliary indexes before remove epsilon closure

When remove epsilon closure, so far searched nodes including epsilon closure
in all nodes sequentially, but it is slow for some cases.  Now build
auxiliary indexes before search.

Problem reported in: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=40634

* lib/dfa.c (overwrap): New function.
(epsclosure): build auxiliary indexes before remove epsilon closure.
---
 lib/dfa.c |   52 +++++++++++++++++++++++++++++++++++++++++++++-------
 1 files changed, 45 insertions(+), 7 deletions(-)

diff --git a/lib/dfa.c b/lib/dfa.c
index 9939d22..958069e 100644
--- a/lib/dfa.c
+++ b/lib/dfa.c
@@ -2201,6 +2201,22 @@ replace (position_set *dst, idx_t del, position_set *add,
     }
 }
 
+static bool
+overwrap (position_set const *s, idx_t const *elems, idx_t nelem)
+{
+  idx_t i = 0, j = 0;
+
+  while (i < s->nelem && j < nelem)
+    if (s->elems[i].index < elems[j])
+      i++;
+    else if (s->elems[i].index > elems[j])
+      j++;
+    else
+      return true;
+
+  return false;
+}
+
 /* Find the index of the state corresponding to the given position set with
    the given preceding context, or create a new state if there is no such
    state.  Context tells whether we got here on a newline or letter.  */
@@ -2298,14 +2314,33 @@ static void
 epsclosure (struct dfa const *d)
 {
   position_set tmp;
+  idx_t *currs, *nexts;
+  idx_t ncurr = 0;
+  idx_t nnext = 0;
+
   alloc_position_set (&tmp, d->nleaves);
-  for (idx_t i = 0; i < d->tindex; i++)
-    if (d->follows[i].nelem > 0 && d->tokens[i] >= NOTCHAR
+  currs = xnmalloc (d->tindex, sizeof *currs);
+  nexts = xnmalloc (d->tindex, sizeof *nexts);
+
+  for (idx_t i = 0; i < d->tindex; i++) {
+     if (d->follows[i].nelem > 0 && d->tokens[i] >= NOTCHAR
         && d->tokens[i] != BACKREF && d->tokens[i] != ANYCHAR
         && d->tokens[i] != MBCSET && d->tokens[i] < CSET)
+      currs[ncurr++] = i;
+  }
+
+  for (idx_t i = 0, j = 0; i < d->tindex; i++)
+    {
+      while (j < ncurr && currs[j] < i)
+        j++;
+      if (overwrap (&d->follows[i], currs, ncurr))
+        nexts[nnext++] = i;
+    }
+
+  for (idx_t i = 0; i < ncurr; i++)
       {
         unsigned int constraint;
-        switch (d->tokens[i])
+        switch (d->tokens[currs[i]])
           {
           case BEGLINE:
             constraint = BEGLINE_CONSTRAINT;
@@ -2330,13 +2365,16 @@ epsclosure (struct dfa const *d)
             break;
           }
 
-        delete (i, &d->follows[i]);
+        delete (i, &d->follows[currs[i]]);
 
-        for (idx_t j = 0; j < d->tindex; j++)
-          if (i != j && d->follows[j].nelem > 0)
-            replace (&d->follows[j], i, &d->follows[i], constraint, &tmp);
+        for (idx_t j = 0; j < nnext; j++)
+          replace (&d->follows[nexts[j]], currs[i], &d->follows[currs[i]],
+                  constraint, &tmp);
       }
+
   free (tmp.elems);
+  free (currs);
+  free (nexts);
 }
 
 /* Returns the set of contexts for which there is at least one
-- 
1.7.1


       reply	other threads:[~2020-04-17  1:25 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <6503eb8e-e6fd-b4dd-aab7-76bb6955d87b@cs.ucla.edu>
     [not found] ` <20200417093536.875E.27F6AC2D@kcn.ne.jp>
2020-04-17  1:24   ` Norihiro Tanaka [this message]
2020-04-17 15:22     ` bug#40634: Massive pattern list handling with -E format seems very slow since 2.28 Norihiro Tanaka
2020-04-18 22:41       ` Norihiro Tanaka
2020-04-19  2:10         ` Norihiro Tanaka
2020-09-11 12:47           ` Jim Meyering
2020-09-11 13:17             ` Jim Meyering
2020-09-11 23:01               ` Paul Eggert
2020-09-12  6:41                 ` Jim Meyering
2020-09-14  2:03                   ` Paul Eggert
2020-09-14 14:14                     ` Jim Meyering
2020-09-21 19:22                     ` Paul Eggert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.gnu.org/mailman/listinfo/bug-gnulib

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200417102441.8766.27F6AC2D@kcn.ne.jp \
    --to=noritnk@kcn.ne.jp \
    --cc=40634@debbugs.gnu.org \
    --cc=bug-gnulib@gnu.org \
    --cc=eggert@cs.ucla.edu \
    --cc=fryasu@yahoo.co.jp \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).