From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-Status: No, score=-4.2 required=3.0 tests=AWL,BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H4,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dcvr.yhbt.net (Postfix) with ESMTPS id 006131F4B4 for ; Sun, 13 Sep 2020 01:55:16 +0000 (UTC) Received: from localhost ([::1]:56942 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kHHEY-0006Lf-Si for normalperson@yhbt.net; Sat, 12 Sep 2020 21:55:14 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:43766) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kHHEU-0006LK-Ph for bug-gnulib@gnu.org; Sat, 12 Sep 2020 21:55:10 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:53950) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kHHES-0007if-6N for bug-gnulib@gnu.org; Sat, 12 Sep 2020 21:55:10 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id D01C31600B4; Sat, 12 Sep 2020 18:55:03 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 5O1WbumRefGx; Sat, 12 Sep 2020 18:55:02 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 7D870160113; Sat, 12 Sep 2020 18:55:02 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 3yRyBXjgAwiI; Sat, 12 Sep 2020 18:55:02 -0700 (PDT) Received: from day.example.com (cpe-75-82-69-226.socal.res.rr.com [75.82.69.226]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 4E7C31600B4; Sat, 12 Sep 2020 18:55:02 -0700 (PDT) From: Paul Eggert To: bug-gnulib@gnu.org Subject: [PATCH 1/2] dfa: use backward set in removal of epsilon closure Date: Sat, 12 Sep 2020 18:54:56 -0700 Message-Id: <20200913015457.21825-1-eggert@cs.ucla.edu> X-Mailer: git-send-email 2.17.1 Received-SPF: pass client-ip=131.179.128.68; envelope-from=eggert@cs.ucla.edu; helo=zimbra.cs.ucla.edu X-detected-operating-system: by eggs.gnu.org: First seen = 2020/09/12 21:55:04 X-ACL-Warn: Detected OS = Linux 3.1-3.10 [fuzzy] X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: bug-gnulib@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Gnulib discussion list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Norihiro Tanaka Errors-To: bug-gnulib-bounces+normalperson=yhbt.net@gnu.org Sender: "bug-gnulib" From: Norihiro Tanaka When removing in epsilon closure, the code searched all nodes sequentially, and this was slow for some cases. Build a backward set before search, and only check previous position with the set. Problem reported in . * lib/dfa.c (struct dfa): New member 'epsilon'. (addtok_mb): Check whether a pattern has epsilon node or not. (epsclosure): New arg BACKWORD; caller changed. When removing epsilon node and reconnecting, check only previous positions. Treat BEG as if it were character. (dfaanalyze): Build backward set. --- lib/dfa.c | 65 +++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 51 insertions(+), 14 deletions(-) diff --git a/lib/dfa.c b/lib/dfa.c index 1f0587a7a..7c8eb6685 100644 --- a/lib/dfa.c +++ b/lib/dfa.c @@ -488,6 +488,7 @@ struct dfa bool fast; /* The DFA is fast. */ token utf8_anychar_classes[9]; /* To lower ANYCHAR in UTF-8 locales. */ mbstate_t mbs; /* Multibyte conversion state. */ + bool epsilon; /* The following are valid only if MB_CUR_MAX > 1. */ @@ -1615,13 +1616,21 @@ addtok_mb (struct dfa *dfa, token t, char mbprop) dfa->parse.depth--; break; - case BACKREF: - dfa->fast = false; + case BEGLINE: + case ENDLINE: + case BEGWORD: + case ENDWORD: + case LIMWORD: + case NOTLIMWORD: + case EMPTY: + dfa->epsilon = true; FALLTHROUGH; + default: - dfa->nleaves++; - FALLTHROUGH; - case EMPTY: + if (t == BACKREF) + dfa->fast = false; + if (t != EMPTY) + dfa->nleaves++; dfa->parse.depth++; break; } @@ -2297,14 +2306,15 @@ state_index (struct dfa *d, position_set const *s, int context) constraint. Repeat exhaustively until no funny positions are left. S->elems must be large enough to hold the result. */ static void -epsclosure (struct dfa const *d) +epsclosure (struct dfa const *d, position_set *backword) { position_set tmp; alloc_position_set (&tmp, d->nleaves); for (idx_t i = 0; i < d->tindex; i++) if (d->follows[i].nelem > 0 && d->tokens[i] >= NOTCHAR && d->tokens[i] != BACKREF && d->tokens[i] != ANYCHAR - && d->tokens[i] != MBCSET && d->tokens[i] < CSET) + && d->tokens[i] != MBCSET && d->tokens[i] < CSET + && d->tokens[i] != BEG) { unsigned int constraint; switch (d->tokens[i]) @@ -2327,16 +2337,21 @@ epsclosure (struct dfa const *d) case NOTLIMWORD: constraint = NOTLIMWORD_CONSTRAINT; break; - default: + case EMPTY: constraint = NO_CONSTRAINT; break; + default: + abort (); } delete (i, &d->follows[i]); - for (idx_t j = 0; j < d->tindex; j++) - if (i != j && d->follows[j].nelem > 0) - replace (&d->follows[j], i, &d->follows[i], constraint, &tmp); + for (idx_t j = 0; j < backword[i].nelem; j++) + replace (&d->follows[backword[i].elems[j].index], i, &d->follows[i], + constraint, &tmp); + for (idx_t j = 0; j < d->follows[i].nelem; j++) + replace (&backword[d->follows[i].elems[j].index], i, &backword[i], + NO_CONSTRAINT, &tmp); } free (tmp.elems); } @@ -2643,6 +2658,7 @@ dfaanalyze (struct dfa *d, bool searchflag) /* Firstpos and lastpos elements. */ position *firstpos = posalloc; position *lastpos = firstpos + d->nleaves; + position_set *backword = NULL; position pos; position_set tmp; @@ -2675,6 +2691,9 @@ dfaanalyze (struct dfa *d, bool searchflag) alloc_position_set (&merged, d->nleaves); d->follows = xcalloc (d->tindex, sizeof *d->follows); + if (d->epsilon) + backword = xcalloc (d->tindex, sizeof *backword); + for (idx_t i = 0; i < d->tindex; i++) { switch (d->tokens[i]) @@ -2712,6 +2731,17 @@ dfaanalyze (struct dfa *d, bool searchflag) case CAT: /* Every element in the firstpos of the second argument is in the follow of every element in the lastpos of the first argument. */ + if (d->epsilon) + { + tmp.nelem = stk[-2].nlastpos; + tmp.elems = lastpos - stk[-1].nlastpos - stk[-2].nlastpos; + position *p = firstpos - stk[-1].nfirstpos; + for (idx_t j = 0; j < stk[-1].nfirstpos; j++) + { + merge (&tmp, &backword[p[j].index], &merged); + copy (&merged, &backword[p[j].index]); + } + } { tmp.nelem = stk[-1].nfirstpos; tmp.elems = firstpos - stk[-1].nfirstpos; @@ -2801,9 +2831,15 @@ dfaanalyze (struct dfa *d, bool searchflag) #endif } - /* For each follow set that is the follow set of a real position, replace - it with its epsilon closure. */ - epsclosure (d); + if (d->epsilon) + { + /* For each follow set that is the follow set of a real position, + replace it with its epsilon closure. */ + epsclosure (d, backword); + + for (idx_t i = 0; i < d->tindex; i++) + free (backword[i].elems); + } dfaoptimize (d); @@ -2865,6 +2901,7 @@ dfaanalyze (struct dfa *d, bool searchflag) d->min_trcount++; d->trcount = 0; + free (backword); free (posalloc); free (stkalloc); free (merged.elems); -- 2.17.1