* [ruby-core:22927] [Bug #1301] Poor RegExp Matching Performance
@ 2009-03-18 14:53 Andreas Grau
2009-03-18 15:33 ` [ruby-core:22928] " Eero Saynatkari
2009-03-19 1:59 ` [ruby-core:22936] " Martin Duerst
0 siblings, 2 replies; 7+ messages in thread
From: Andreas Grau @ 2009-03-18 14:53 UTC (permalink / raw
To: ruby-core
Bug #1301: Poor RegExp Matching Performance
http://redmine.ruby-lang.org/issues/show/1301
Author: Andreas Grau
Status: Open, Priority: Normal
Category: core
ruby -v: 1.9.0 (2008-06-20 revision 17482) [i486-linux]
I noticed a very poor performance on matching regular expressions.
Running following code using ruby 1.9.0 on a Core2Duo (2x2Ghz) takes more than 20s to complete.
regexp = /[b]+.+.+.+.+.+.+.+.+[a]/
str="bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
regexp.match(str)
$ time ruby1.9 test.rb
real 0m23.029s
user 0m22.421s
sys 0m0.012s
----------------------------------------
http://redmine.ruby-lang.org
^ permalink raw reply [flat|nested] 7+ messages in thread
* [ruby-core:22928] Re: [Bug #1301] Poor RegExp Matching Performance
2009-03-18 14:53 [ruby-core:22927] [Bug #1301] Poor RegExp Matching Performance Andreas Grau
@ 2009-03-18 15:33 ` Eero Saynatkari
2009-03-18 15:43 ` [ruby-core:22929] " Andreas Grau
2009-03-19 1:59 ` [ruby-core:22936] " Martin Duerst
1 sibling, 1 reply; 7+ messages in thread
From: Eero Saynatkari @ 2009-03-18 15:33 UTC (permalink / raw
To: ruby-core@ruby-lang.org
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=US-ASCII, Size: 962 bytes --]
Excerpts from Marius Mårnes Mathiesen's message of Wed Mar 18 16:53:08 +0200 2009:
> Bug #1301: Poor RegExp Matching Performance
> http://redmine.ruby-lang.org/issues/show/1301
>
> Author: Andreas Grau
> Status: Open, Priority: Normal
> Category: core
> ruby -v: 1.9.0 (2008-06-20 revision 17482) [i486-linux]
>
> I noticed a very poor performance on matching regular expressions.
You mean non-matching? Aside from which, I would be hard-pressed
to call the code below a "regular expression."
> Running following code using ruby 1.9.0 on a Core2Duo (2x2Ghz) takes more than
> 20s to complete.
>
> regexp = /[b]+.+.+.+.+.+.+.+.+[a]/
> str="bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
> regexp.match(str)
Are you able to produce a reasonable example of where this
is a problem and not just a pathologically horrible regexp?
The above, written as:
/[b]+.+[a]/
Or:
/[b].+.{1,}[a]/
Completes instantly.
--
Magic is insufficiently advanced technology.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [ruby-core:22929] [Bug #1301] Poor RegExp Matching Performance
2009-03-18 15:33 ` [ruby-core:22928] " Eero Saynatkari
@ 2009-03-18 15:43 ` Andreas Grau
2009-03-18 16:34 ` [ruby-core:22930] " Eero Saynatkari
0 siblings, 1 reply; 7+ messages in thread
From: Andreas Grau @ 2009-03-18 15:43 UTC (permalink / raw
To: ruby-core
Issue #1301 has been updated by Andreas Grau.
Using this regexp
regexp = /^(\d*\s+){8,8}+\d+$/
This matches on any line starting with up to 8 number separated by by blanks and a final number
if the string is now in the wrong format, e.g.
str=" #"
the matching takes a very long time
regexp.match(str)
----------------------------------------
http://redmine.ruby-lang.org/issues/show/1301
----------------------------------------
http://redmine.ruby-lang.org
^ permalink raw reply [flat|nested] 7+ messages in thread
* [ruby-core:22930] Re: [Bug #1301] Poor RegExp Matching Performance
2009-03-18 15:43 ` [ruby-core:22929] " Andreas Grau
@ 2009-03-18 16:34 ` Eero Saynatkari
0 siblings, 0 replies; 7+ messages in thread
From: Eero Saynatkari @ 2009-03-18 16:34 UTC (permalink / raw
To: ruby-core@ruby-lang.org
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=US-ASCII, Size: 884 bytes --]
Excerpts from Marius Mårnes Mathiesen's message of Wed Mar 18 17:43:04 +0200 2009:
> Issue #1301 has been updated by Andreas Grau.
>
>
> Using this regexp
> regexp = /^(\d*\s+){8,8}+\d+$/
>
> This matches on any line starting with up to 8 number separated by by blanks
> and a final number
>
> if the string is now in the wrong format, e.g.
> str=" #"
>
> the matching takes a very long time
> regexp.match(str)
OK, much more reasonable case, enough to warrant looking into
maybe eliminating some pathological cases. Your particular case
could, of course, be better (and more naturally) written as:
regexp = /^(\d+\s+){0,8}\s*\d+$/
Or something similar to fit the bill, but it eliminates the
poor performance of the example provided until and if such
fixes in the state machine are done.
--
Magic is insufficiently advanced technology.
^ permalink raw reply [flat|nested] 7+ messages in thread
* [ruby-core:22936] Re: [Bug #1301] Poor RegExp Matching Performance
2009-03-18 14:53 [ruby-core:22927] [Bug #1301] Poor RegExp Matching Performance Andreas Grau
2009-03-18 15:33 ` [ruby-core:22928] " Eero Saynatkari
@ 2009-03-19 1:59 ` Martin Duerst
2009-03-19 11:17 ` [ruby-core:22953] " Andreas Grau
1 sibling, 1 reply; 7+ messages in thread
From: Martin Duerst @ 2009-03-19 1:59 UTC (permalink / raw
To: ruby-core
At 23:53 09/03/18, Andreas Grau wrote:
>Bug #1301: Poor RegExp Matching Performance
>http://redmine.ruby-lang.org/issues/show/1301
>
>Author: Andreas Grau
>Status: Open, Priority: Normal
>Category: core
>ruby -v: 1.9.0 (2008-06-20 revision 17482) [i486-linux]
>
>I noticed a very poor performance on matching regular expressions.
>
>Running following code using ruby 1.9.0 on a Core2Duo (2x2Ghz) takes more
>than 20s to complete.
>
>regexp = /[b]+.+.+.+.+.+.+.+.+[a]/
Eero gave a much better way to write this regexp.
But why is it so bad, in particular for the string
below? The string has 37 'b's, and you are asking
it to try all different ways it can be split into
nine non-empty substrings (the first of which has
to consist of 'b's only), followed by an [a]. It's
surprising that this only takes 23 seconds!
Theory about regular expressions (formal language theory)
says there shouldn't be any difference, but Ruby regular
expressions (same for Perl, Python, and so on) are not
really regular expressions in the formal sense anymore,
because they allow capturing and many more neat practical
things. So the implementation uses backtracking, and
your regexp above creates a lot of backtracking.
For details on how to write good regular expressions,
please see Jeffrey Friedl's book
(e.g. at
http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/)
BTW, the [] in the above expression are not necessary,
it would usually be written
regexp = /b+.+.+.+.+.+.+.+.+a/
Regards, Martin.
>str="bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
>regexp.match(str)
>
>$ time ruby1.9 test.rb
>
>real 0m23.029s
>user 0m22.421s
>sys 0m0.012s
>
>
>----------------------------------------
>http://redmine.ruby-lang.org
#-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
^ permalink raw reply [flat|nested] 7+ messages in thread
* [ruby-core:22953] [Bug #1301] Poor RegExp Matching Performance
2009-03-19 1:59 ` [ruby-core:22936] " Martin Duerst
@ 2009-03-19 11:17 ` Andreas Grau
2009-03-19 13:29 ` [ruby-core:22955] " Wolfgang Nádasi-Donner
0 siblings, 1 reply; 7+ messages in thread
From: Andreas Grau @ 2009-03-19 11:17 UTC (permalink / raw
To: ruby-core
Issue #1301 has been updated by Andreas Grau.
> Theory about regular expressions (formal language theory)
> says there shouldn't be any difference, but Ruby regular
> expressions (same for Perl, Python, and so on) are not
> really regular expressions in the formal sense anymore,
> because they allow capturing and many more neat practical
> things. So the implementation uses backtracking, and
> your regexp above creates a lot of backtracking.
I agree that python is also ineffizient, however, it is in
my tests 10 times faster when testing this regexp.
What I'm not sure if regexp parsing really requires backtracking.
The Unix command sed for matching and group capturing requires
below 10ms. (ruby ~10s, python ~1s).
regexp=/(\d*) +(\d*) +(\d*) +(\d*) +(\d*) +(\d*) +(\d*) +(\d*) +(\d+)$/
str=" #"
BTW, a trivial optimization would be to test matching of the regexp using
fast DFA/NFA automat and in case of a matching, use backtracking...
----------------------------------------
http://redmine.ruby-lang.org/issues/show/1301
----------------------------------------
http://redmine.ruby-lang.org
^ permalink raw reply [flat|nested] 7+ messages in thread
* [ruby-core:22955] Re: [Bug #1301] Poor RegExp Matching Performance
2009-03-19 11:17 ` [ruby-core:22953] " Andreas Grau
@ 2009-03-19 13:29 ` Wolfgang Nádasi-Donner
0 siblings, 0 replies; 7+ messages in thread
From: Wolfgang Nádasi-Donner @ 2009-03-19 13:29 UTC (permalink / raw
To: ruby-core
Andreas Grau schrieb:
> BTW, a trivial optimization would be to test matching of the regexp using
> fast DFA/NFA automat and in case of a matching, use backtracking...
The new Ruby regex interpreter Oniguruma works on "very extended regular
expressions" which are able to parse some context sensitive grammars.
Because a less powerful real regular expression machine may be helpful
for several simpler expressions, it may be worth to think about an
additional adaption of such a machine for Ruby.
Wolfgang Nádasi-Donner
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2009-03-19 13:39 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-18 14:53 [ruby-core:22927] [Bug #1301] Poor RegExp Matching Performance Andreas Grau
2009-03-18 15:33 ` [ruby-core:22928] " Eero Saynatkari
2009-03-18 15:43 ` [ruby-core:22929] " Andreas Grau
2009-03-18 16:34 ` [ruby-core:22930] " Eero Saynatkari
2009-03-19 1:59 ` [ruby-core:22936] " Martin Duerst
2009-03-19 11:17 ` [ruby-core:22953] " Andreas Grau
2009-03-19 13:29 ` [ruby-core:22955] " Wolfgang Nádasi-Donner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).