[ruby-core:22927] [Bug #1301] Poor RegExp Matching Performance

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

* [ruby-core:22927] [Bug #1301] Poor RegExp Matching Performance
@ 2009-03-18 14:53 Andreas Grau
  2009-03-18 15:33 ` [ruby-core:22928] " Eero Saynatkari
  2009-03-19  1:59 ` [ruby-core:22936] " Martin Duerst
  0 siblings, 2 replies; 7+ messages in thread
From: Andreas Grau @ 2009-03-18 14:53 UTC (permalink / raw
  To: ruby-core

Bug #1301: Poor RegExp Matching Performance
http://redmine.ruby-lang.org/issues/show/1301

Author: Andreas Grau
Status: Open, Priority: Normal
Category: core
ruby -v: 1.9.0 (2008-06-20 revision 17482) [i486-linux]

I noticed a very poor performance on matching regular expressions.

Running following code using ruby 1.9.0 on a Core2Duo (2x2Ghz) takes more than 20s to complete.

regexp = /[b]+.+.+.+.+.+.+.+.+[a]/
str="bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
regexp.match(str)

$ time ruby1.9  test.rb 

real    0m23.029s
user    0m22.421s
sys     0m0.012s


----------------------------------------
http://redmine.ruby-lang.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ruby-core:22928] Re: [Bug #1301] Poor RegExp Matching Performance
  2009-03-18 14:53 [ruby-core:22927] [Bug #1301] Poor RegExp Matching Performance Andreas Grau
@ 2009-03-18 15:33 ` Eero Saynatkari
  2009-03-18 15:43   ` [ruby-core:22929] " Andreas Grau
  2009-03-19  1:59 ` [ruby-core:22936] " Martin Duerst
  1 sibling, 1 reply; 7+ messages in thread
From: Eero Saynatkari @ 2009-03-18 15:33 UTC (permalink / raw
  To: ruby-core@ruby-lang.org

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=US-ASCII, Size: 962 bytes --]

Excerpts from Marius Mårnes Mathiesen's message of Wed Mar 18 16:53:08 +0200 2009:
> Bug #1301: Poor RegExp Matching Performance
> http://redmine.ruby-lang.org/issues/show/1301
> 
> Author: Andreas Grau
> Status: Open, Priority: Normal
> Category: core
> ruby -v: 1.9.0 (2008-06-20 revision 17482) [i486-linux]
> 
> I noticed a very poor performance on matching regular expressions.

You mean non-matching? Aside from which, I would be hard-pressed
to call the code below a "regular expression."

> Running following code using ruby 1.9.0 on a Core2Duo (2x2Ghz) takes more than
> 20s to complete.
> 
> regexp = /[b]+.+.+.+.+.+.+.+.+[a]/
> str="bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
> regexp.match(str)

Are you able to produce a reasonable example of where this
is a problem and not just a pathologically horrible regexp?
The above, written as:

    /[b]+.+[a]/

Or:

    /[b].+.{1,}[a]/

Completes instantly.


--
Magic is insufficiently advanced technology.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ruby-core:22929] [Bug #1301] Poor RegExp Matching Performance
  2009-03-18 15:33 ` [ruby-core:22928] " Eero Saynatkari
@ 2009-03-18 15:43   ` Andreas Grau
  2009-03-18 16:34     ` [ruby-core:22930] " Eero Saynatkari
  0 siblings, 1 reply; 7+ messages in thread
From: Andreas Grau @ 2009-03-18 15:43 UTC (permalink / raw
  To: ruby-core

Issue #1301 has been updated by Andreas Grau.


Using this regexp
regexp = /^(\d*\s+){8,8}+\d+$/

This matches on any line starting with up to 8 number separated by by blanks and a final number

if the string is now in the wrong format, e.g. 
str="                                  #"

the matching takes a very long time
regexp.match(str)



----------------------------------------
http://redmine.ruby-lang.org/issues/show/1301

----------------------------------------
http://redmine.ruby-lang.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ruby-core:22930] Re: [Bug #1301] Poor RegExp Matching Performance
  2009-03-18 15:43   ` [ruby-core:22929] " Andreas Grau
@ 2009-03-18 16:34     ` Eero Saynatkari
  0 siblings, 0 replies; 7+ messages in thread
From: Eero Saynatkari @ 2009-03-18 16:34 UTC (permalink / raw
  To: ruby-core@ruby-lang.org

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=US-ASCII, Size: 884 bytes --]

Excerpts from Marius Mårnes Mathiesen's message of Wed Mar 18 17:43:04 +0200 2009:
> Issue #1301 has been updated by Andreas Grau.
> 
> 
> Using this regexp
> regexp = /^(\d*\s+){8,8}+\d+$/
> 
> This matches on any line starting with up to 8 number separated by by blanks
> and a final number
> 
> if the string is now in the wrong format, e.g. 
> str="                                  #"
> 
> the matching takes a very long time
> regexp.match(str)

OK, much more reasonable case, enough to warrant looking into
maybe eliminating some pathological cases. Your particular case
could, of course, be better (and more naturally) written as:

   regexp = /^(\d+\s+){0,8}\s*\d+$/

Or something similar to fit the bill, but it eliminates the
poor performance of the example provided until and if such
fixes in the state machine are done.


--
Magic is insufficiently advanced technology.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ruby-core:22936] Re: [Bug #1301] Poor RegExp Matching Performance
  2009-03-18 14:53 [ruby-core:22927] [Bug #1301] Poor RegExp Matching Performance Andreas Grau
  2009-03-18 15:33 ` [ruby-core:22928] " Eero Saynatkari
@ 2009-03-19  1:59 ` Martin Duerst
  2009-03-19 11:17   ` [ruby-core:22953] " Andreas Grau
  1 sibling, 1 reply; 7+ messages in thread
From: Martin Duerst @ 2009-03-19  1:59 UTC (permalink / raw
  To: ruby-core

At 23:53 09/03/18, Andreas Grau wrote:
>Bug #1301: Poor RegExp Matching Performance
>http://redmine.ruby-lang.org/issues/show/1301
>
>Author: Andreas Grau
>Status: Open, Priority: Normal
>Category: core
>ruby -v: 1.9.0 (2008-06-20 revision 17482) [i486-linux]
>
>I noticed a very poor performance on matching regular expressions.
>
>Running following code using ruby 1.9.0 on a Core2Duo (2x2Ghz) takes more 
>than 20s to complete.
>
>regexp = /[b]+.+.+.+.+.+.+.+.+[a]/

Eero gave a much better way to write this regexp.
But why is it so bad, in particular for the string
below? The string has 37 'b's, and you are asking
it to try all different ways it can be split into
nine non-empty substrings (the first of which has
to consist of 'b's only), followed by an [a]. It's
surprising that this only takes 23 seconds! 

Theory about regular expressions (formal language theory)
says there shouldn't be any difference, but Ruby regular
expressions (same for Perl, Python, and so on) are not
really regular expressions in the formal sense anymore,
because they allow capturing and many more neat practical
things. So the implementation uses backtracking, and
your regexp above creates a lot of backtracking.
For details on how to write good regular expressions,
please see Jeffrey Friedl's book
(e.g. at
http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/)

BTW, the [] in the above expression are not necessary,
it would usually be written

regexp = /b+.+.+.+.+.+.+.+.+a/

Regards,    Martin.

>str="bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"
>regexp.match(str)
>
>$ time ruby1.9  test.rb 
>
>real    0m23.029s
>user    0m22.421s
>sys     0m0.012s
>
>
>----------------------------------------
>http://redmine.ruby-lang.org

#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp     

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ruby-core:22953] [Bug #1301] Poor RegExp Matching Performance
  2009-03-19  1:59 ` [ruby-core:22936] " Martin Duerst
@ 2009-03-19 11:17   ` Andreas Grau
  2009-03-19 13:29     ` [ruby-core:22955] " Wolfgang Nádasi-Donner
  0 siblings, 1 reply; 7+ messages in thread
From: Andreas Grau @ 2009-03-19 11:17 UTC (permalink / raw
  To: ruby-core

Issue #1301 has been updated by Andreas Grau.


> Theory about regular expressions (formal language theory)
> says there shouldn't be any difference, but Ruby regular
> expressions (same for Perl, Python, and so on) are not
> really regular expressions in the formal sense anymore,
> because they allow capturing and many more neat practical
> things. So the implementation uses backtracking, and
> your regexp above creates a lot of backtracking.

I agree that python is also ineffizient, however, it is in
my tests 10 times faster when testing this regexp.

What I'm not sure if regexp parsing really requires backtracking.
The Unix command sed for matching and group capturing requires
below 10ms. (ruby ~10s, python ~1s).

regexp=/(\d*) +(\d*) +(\d*) +(\d*) +(\d*) +(\d*) +(\d*) +(\d*) +(\d+)$/
str="                              #"

BTW, a trivial optimization would be to test matching of the regexp using
fast DFA/NFA automat and in case of a matching, use backtracking...


----------------------------------------
http://redmine.ruby-lang.org/issues/show/1301

----------------------------------------
http://redmine.ruby-lang.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [ruby-core:22955] Re: [Bug #1301] Poor RegExp Matching Performance
  2009-03-19 11:17   ` [ruby-core:22953] " Andreas Grau
@ 2009-03-19 13:29     ` Wolfgang Nádasi-Donner
  0 siblings, 0 replies; 7+ messages in thread
From: Wolfgang Nádasi-Donner @ 2009-03-19 13:29 UTC (permalink / raw
  To: ruby-core

Andreas Grau schrieb:
> BTW, a trivial optimization would be to test matching of the regexp using
> fast DFA/NFA automat and in case of a matching, use backtracking...
The new Ruby regex interpreter Oniguruma works on "very extended regular 
expressions" which are able to parse some context sensitive grammars.

Because a less powerful real regular expression machine may be helpful 
for several simpler expressions, it may be worth to think about an 
additional adaption of such a machine for Ruby.

Wolfgang Nádasi-Donner

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-03-19 13:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-18 14:53 [ruby-core:22927] [Bug #1301] Poor RegExp Matching Performance Andreas Grau
2009-03-18 15:33 ` [ruby-core:22928] " Eero Saynatkari
2009-03-18 15:43   ` [ruby-core:22929] " Andreas Grau
2009-03-18 16:34     ` [ruby-core:22930] " Eero Saynatkari
2009-03-19  1:59 ` [ruby-core:22936] " Martin Duerst
2009-03-19 11:17   ` [ruby-core:22953] " Andreas Grau
2009-03-19 13:29     ` [ruby-core:22955] " Wolfgang Nádasi-Donner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).