ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
* [ruby-core:100239] [Ruby master Feature#17206] Introduce new Regexp option to avoid MatchData allocation
@ 2020-09-30 15:42 fatkodima123
  2020-09-30 15:58 ` [ruby-core:100240] " zn
                   ` (6 more replies)
  0 siblings, 7 replies; 8+ messages in thread
From: fatkodima123 @ 2020-09-30 15:42 UTC (permalink / raw)
  To: ruby-core

Issue #17206 has been reported by fatkodima (Dima Fatko).

----------------------------------------
Feature #17206: Introduce new Regexp option to avoid MatchData allocation
https://bugs.ruby-lang.org/issues/17206

* Author: fatkodima (Dima Fatko)
* Status: Open
* Priority: Normal
----------------------------------------
Originates from https://bugs.ruby-lang.org/issues/17030

When this option is specified, ruby will not create global `MatchData` objects, when not explicitly needed by the method.

If the new option is named `f`, we can write as `/o/f`, and `grep(/o/f)` is faster than `grep(/o/)`.

This speeds up not only `grep`, but also `all?`, `any?`, `case` and so on.

Many people have written code like this:
```ruby
IO.foreach("foo.txt") do |line|
  case line
  when /^#/
    # do nothing 
  when /^(\d+)/
    # using $1
  when /xxx/
    # using $&
  when /yyy/
    # not using $&
  else
    # ...
  end
end
```

This is slow, because of the above mentioned problem.
Replacing `/^#/` with `/^#/f`, and `/yyy/` with `/yyy/f` will make it faster.

Some benchmarks - https://bugs.ruby-lang.org/issues/17030#note-9 which show `2.5x` to `5x` speedup.

PR: https://github.com/ruby/ruby/pull/3455



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:100240] [Ruby master Feature#17206] Introduce new Regexp option to avoid MatchData allocation
  2020-09-30 15:42 [ruby-core:100239] [Ruby master Feature#17206] Introduce new Regexp option to avoid MatchData allocation fatkodima123
@ 2020-09-30 15:58 ` zn
  2020-09-30 16:22 ` [ruby-core:100241] " fatkodima123
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: zn @ 2020-09-30 15:58 UTC (permalink / raw)
  To: ruby-core

Issue #17206 has been updated by znz (Kazuhiro NISHIYAMA).


What does `regexp_without_matchdata.match(string)` return when matched?

----------------------------------------
Feature #17206: Introduce new Regexp option to avoid MatchData allocation
https://bugs.ruby-lang.org/issues/17206#change-87826

* Author: fatkodima (Dima Fatko)
* Status: Open
* Priority: Normal
----------------------------------------
Originates from https://bugs.ruby-lang.org/issues/17030

When this option is specified, ruby will not create global `MatchData` objects, when not explicitly needed by the method.

If the new option is named `f`, we can write as `/o/f`, and `grep(/o/f)` is faster than `grep(/o/)`.

This speeds up not only `grep`, but also `all?`, `any?`, `case` and so on.

Many people have written code like this:
```ruby
IO.foreach("foo.txt") do |line|
  case line
  when /^#/
    # do nothing 
  when /^(\d+)/
    # using $1
  when /xxx/
    # using $&
  when /yyy/
    # not using $&
  else
    # ...
  end
end
```

This is slow, because of the above mentioned problem.
Replacing `/^#/` with `/^#/f`, and `/yyy/` with `/yyy/f` will make it faster.

Some benchmarks - https://bugs.ruby-lang.org/issues/17030#note-9 which show `2.5x` to `5x` speedup.

PR: https://github.com/ruby/ruby/pull/3455



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:100241] [Ruby master Feature#17206] Introduce new Regexp option to avoid MatchData allocation
  2020-09-30 15:42 [ruby-core:100239] [Ruby master Feature#17206] Introduce new Regexp option to avoid MatchData allocation fatkodima123
  2020-09-30 15:58 ` [ruby-core:100240] " zn
@ 2020-09-30 16:22 ` fatkodima123
  2020-09-30 18:18 ` [ruby-core:100242] [Ruby master Feature#17206] Introduce new Regexp option to avoid global MatchData allocations eregontp
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: fatkodima123 @ 2020-09-30 16:22 UTC (permalink / raw)
  To: ruby-core

Issue #17206 has been updated by fatkodima (Dima Fatko).


znz (Kazuhiro NISHIYAMA) wrote in #note-1:
> What does `regexp_without_matchdata.match(string)` return when matched?

Thats what `when not explicitly needed by the method.` part was about: it returns `MatchData` in this case, as requested.

----------------------------------------
Feature #17206: Introduce new Regexp option to avoid MatchData allocation
https://bugs.ruby-lang.org/issues/17206#change-87827

* Author: fatkodima (Dima Fatko)
* Status: Open
* Priority: Normal
----------------------------------------
Originates from https://bugs.ruby-lang.org/issues/17030

When this option is specified, ruby will not create global `MatchData` objects, when not explicitly needed by the method.

If the new option is named `f`, we can write as `/o/f`, and `grep(/o/f)` is faster than `grep(/o/)`.

This speeds up not only `grep`, but also `all?`, `any?`, `case` and so on.

Many people have written code like this:
```ruby
IO.foreach("foo.txt") do |line|
  case line
  when /^#/
    # do nothing 
  when /^(\d+)/
    # using $1
  when /xxx/
    # using $&
  when /yyy/
    # not using $&
  else
    # ...
  end
end
```

This is slow, because of the above mentioned problem.
Replacing `/^#/` with `/^#/f`, and `/yyy/` with `/yyy/f` will make it faster.

Some benchmarks - https://bugs.ruby-lang.org/issues/17030#note-9 which show `2.5x` to `5x` speedup.

PR: https://github.com/ruby/ruby/pull/3455



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:100242] [Ruby master Feature#17206] Introduce new Regexp option to avoid global MatchData allocations
  2020-09-30 15:42 [ruby-core:100239] [Ruby master Feature#17206] Introduce new Regexp option to avoid MatchData allocation fatkodima123
  2020-09-30 15:58 ` [ruby-core:100240] " zn
  2020-09-30 16:22 ` [ruby-core:100241] " fatkodima123
@ 2020-09-30 18:18 ` eregontp
  2020-10-24  1:34 ` [ruby-core:100519] " scivola20
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: eregontp @ 2020-09-30 18:18 UTC (permalink / raw)
  To: ruby-core

Issue #17206 has been updated by Eregon (Benoit Daloze).


IMHO hardcoding such knowledge in the pattern feels wrong (vs in the matching method like `Regexp#match?` which is fine).
It seems to me that it could cause confusing bugs, e.g. when using `/f` in the `case` above if a `when` clause starts to use one of the `$~`-derived variables.
Then it would unexpectedly always be `nil`, causing a potentially very subtle bug.

I have a hard time to believe that allocating the MatchData is so expensive.
If that's the case, then there must be a lot of optimization potential for faster allocation of MatchData in CRuby.
What I think rather is this is due to having to set $~ in the caller, and maybe to compute group offsets.

I think it would be worth investigating more in details where does the performance overhead from `$~` & friends come from in CRuby.

----------------------------------------
Feature #17206: Introduce new Regexp option to avoid global MatchData allocations
https://bugs.ruby-lang.org/issues/17206#change-87829

* Author: fatkodima (Dima Fatko)
* Status: Open
* Priority: Normal
----------------------------------------
Originates from https://bugs.ruby-lang.org/issues/17030

When this option is specified, ruby will not create global `MatchData` objects, when not explicitly needed by the method.

If the new option is named `f`, we can write as `/o/f`, and `grep(/o/f)` is faster than `grep(/o/)`.

This speeds up not only `grep`, but also `all?`, `any?`, `case` and so on.

Many people have written code like this:
```ruby
IO.foreach("foo.txt") do |line|
  case line
  when /^#/
    # do nothing 
  when /^(\d+)/
    # using $1
  when /xxx/
    # using $&
  when /yyy/
    # not using $&
  else
    # ...
  end
end
```

This is slow, because of the above mentioned problem.
Replacing `/^#/` with `/^#/f`, and `/yyy/` with `/yyy/f` will make it faster.

Some benchmarks - https://bugs.ruby-lang.org/issues/17030#note-9 which show `2.5x` to `5x` speedup.

PR: https://github.com/ruby/ruby/pull/3455



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:100519] [Ruby master Feature#17206] Introduce new Regexp option to avoid global MatchData allocations
  2020-09-30 15:42 [ruby-core:100239] [Ruby master Feature#17206] Introduce new Regexp option to avoid MatchData allocation fatkodima123
                   ` (2 preceding siblings ...)
  2020-09-30 18:18 ` [ruby-core:100242] [Ruby master Feature#17206] Introduce new Regexp option to avoid global MatchData allocations eregontp
@ 2020-10-24  1:34 ` scivola20
  2020-10-24 14:30 ` [ruby-core:100523] " eregontp
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 8+ messages in thread
From: scivola20 @ 2020-10-24  1:34 UTC (permalink / raw)
  To: ruby-core

Issue #17206 has been updated by scivola20 (sciv ola).


I believe that people who can use `match?` and `match` methods properly, can use this new Regexp option properly.

By the way, the total size of ``$` ``, `$&`, `$'` equals to the size of the target string. Therefore a huge amount of String garbage will be generated, if the text is very large.



----------------------------------------
Feature #17206: Introduce new Regexp option to avoid global MatchData allocations
https://bugs.ruby-lang.org/issues/17206#change-88142

* Author: fatkodima (Dima Fatko)
* Status: Open
* Priority: Normal
----------------------------------------
Originates from https://bugs.ruby-lang.org/issues/17030

When this option is specified, ruby will not create global `MatchData` objects, when not explicitly needed by the method.

If the new option is named `f`, we can write as `/o/f`, and `grep(/o/f)` is faster than `grep(/o/)`.

This speeds up not only `grep`, but also `all?`, `any?`, `case` and so on.

Many people have written code like this:
```ruby
IO.foreach("foo.txt") do |line|
  case line
  when /^#/
    # do nothing 
  when /^(\d+)/
    # using $1
  when /xxx/
    # using $&
  when /yyy/
    # not using $&
  else
    # ...
  end
end
```

This is slow, because of the above mentioned problem.
Replacing `/^#/` with `/^#/f`, and `/yyy/` with `/yyy/f` will make it faster.

Some benchmarks - https://bugs.ruby-lang.org/issues/17030#note-9 which show `2.5x` to `5x` speedup.

PR: https://github.com/ruby/ruby/pull/3455



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:100523] [Ruby master Feature#17206] Introduce new Regexp option to avoid global MatchData allocations
  2020-09-30 15:42 [ruby-core:100239] [Ruby master Feature#17206] Introduce new Regexp option to avoid MatchData allocation fatkodima123
                   ` (3 preceding siblings ...)
  2020-10-24  1:34 ` [ruby-core:100519] " scivola20
@ 2020-10-24 14:30 ` eregontp
  2020-10-24 14:51 ` [ruby-core:100524] " eregontp
  2020-10-28 22:43 ` [ruby-core:100626] " scivola20
  6 siblings, 0 replies; 8+ messages in thread
From: eregontp @ 2020-10-24 14:30 UTC (permalink / raw)
  To: ruby-core

Issue #17206 has been updated by Eregon (Benoit Daloze).


scivola20 (sciv ola) wrote in #note-5:
> I believe that people who can use `match?` and `match` methods properly, can use this new Regexp option properly.

I disagree, `match?` is clear, I think `=~` suddenly not setting `$~` would be a frequent source of bugs.

> By the way, the total size of ``$` ``, `$&`, `$'` equals to the size of the target string. Therefore a huge amount of String garbage will be generated, if the text is very large.

They are all based on `$~`, isn't it?
I think they only need a copy-on-write copy of the source string (to avoid later mutations affecting them) + the matched offsets.
At least that's what happens in TruffleRuby.


----------------------------------------
Feature #17206: Introduce new Regexp option to avoid global MatchData allocations
https://bugs.ruby-lang.org/issues/17206#change-88145

* Author: fatkodima (Dima Fatko)
* Status: Open
* Priority: Normal
----------------------------------------
Originates from https://bugs.ruby-lang.org/issues/17030

When this option is specified, ruby will not create global `MatchData` objects, when not explicitly needed by the method.

If the new option is named `f`, we can write as `/o/f`, and `grep(/o/f)` is faster than `grep(/o/)`.

This speeds up not only `grep`, but also `all?`, `any?`, `case` and so on.

Many people have written code like this:
```ruby
IO.foreach("foo.txt") do |line|
  case line
  when /^#/
    # do nothing 
  when /^(\d+)/
    # using $1
  when /xxx/
    # using $&
  when /yyy/
    # not using $&
  else
    # ...
  end
end
```

This is slow, because of the above mentioned problem.
Replacing `/^#/` with `/^#/f`, and `/yyy/` with `/yyy/f` will make it faster.

Some benchmarks - https://bugs.ruby-lang.org/issues/17030#note-9 which show `2.5x` to `5x` speedup.

PR: https://github.com/ruby/ruby/pull/3455



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:100524] [Ruby master Feature#17206] Introduce new Regexp option to avoid global MatchData allocations
  2020-09-30 15:42 [ruby-core:100239] [Ruby master Feature#17206] Introduce new Regexp option to avoid MatchData allocation fatkodima123
                   ` (4 preceding siblings ...)
  2020-10-24 14:30 ` [ruby-core:100523] " eregontp
@ 2020-10-24 14:51 ` eregontp
  2020-10-28 22:43 ` [ruby-core:100626] " scivola20
  6 siblings, 0 replies; 8+ messages in thread
From: eregontp @ 2020-10-24 14:51 UTC (permalink / raw)
  To: ruby-core

Issue #17206 has been updated by Eregon (Benoit Daloze).


I took a quick look, the logic to set $~ is here:
https://github.com/ruby/ruby/blob/148961adcd0704d964fce920330a6301b9704c25/re.c#L1608-L1623

It does not seem so expensive, but the region is allocated which xmalloc() which is probably not so cheap (there is also a `rb_gc()` call in there, hopefully it's not hit in practice).
`rb_backref_set()` goes through a few indirections (it needs to reach the caller frame typically), but it does not seem too expensive either.
I think it would be valuable to investigate further what's actually expensive for setting `$~` and how can that be optimized.

A hacky Regexp flag to manually optimize `match/=~/===` calls doesn't seem a good way to me.
The caller code knows if it needs $~, etc, not the Regexp literal.

----------------------------------------
Feature #17206: Introduce new Regexp option to avoid global MatchData allocations
https://bugs.ruby-lang.org/issues/17206#change-88146

* Author: fatkodima (Dima Fatko)
* Status: Open
* Priority: Normal
----------------------------------------
Originates from https://bugs.ruby-lang.org/issues/17030

When this option is specified, ruby will not create global `MatchData` objects, when not explicitly needed by the method.

If the new option is named `f`, we can write as `/o/f`, and `grep(/o/f)` is faster than `grep(/o/)`.

This speeds up not only `grep`, but also `all?`, `any?`, `case` and so on.

Many people have written code like this:
```ruby
IO.foreach("foo.txt") do |line|
  case line
  when /^#/
    # do nothing 
  when /^(\d+)/
    # using $1
  when /xxx/
    # using $&
  when /yyy/
    # not using $&
  else
    # ...
  end
end
```

This is slow, because of the above mentioned problem.
Replacing `/^#/` with `/^#/f`, and `/yyy/` with `/yyy/f` will make it faster.

Some benchmarks - https://bugs.ruby-lang.org/issues/17030#note-9 which show `2.5x` to `5x` speedup.

PR: https://github.com/ruby/ruby/pull/3455



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:100626] [Ruby master Feature#17206] Introduce new Regexp option to avoid global MatchData allocations
  2020-09-30 15:42 [ruby-core:100239] [Ruby master Feature#17206] Introduce new Regexp option to avoid MatchData allocation fatkodima123
                   ` (5 preceding siblings ...)
  2020-10-24 14:51 ` [ruby-core:100524] " eregontp
@ 2020-10-28 22:43 ` scivola20
  6 siblings, 0 replies; 8+ messages in thread
From: scivola20 @ 2020-10-28 22:43 UTC (permalink / raw)
  To: ruby-core

Issue #17206 has been updated by scivola20 (sciv ola).


Sorry. “a huge amount of String garbage” is my misunderstanding.

But I don’t know under what situation this option may cause a bug.


----------------------------------------
Feature #17206: Introduce new Regexp option to avoid global MatchData allocations
https://bugs.ruby-lang.org/issues/17206#change-88261

* Author: fatkodima (Dima Fatko)
* Status: Open
* Priority: Normal
----------------------------------------
Originates from https://bugs.ruby-lang.org/issues/17030

When this option is specified, ruby will not create global `MatchData` objects, when not explicitly needed by the method.

If the new option is named `f`, we can write as `/o/f`, and `grep(/o/f)` is faster than `grep(/o/)`.

This speeds up not only `grep`, but also `all?`, `any?`, `case` and so on.

Many people have written code like this:
```ruby
IO.foreach("foo.txt") do |line|
  case line
  when /^#/
    # do nothing 
  when /^(\d+)/
    # using $1
  when /xxx/
    # using $&
  when /yyy/
    # not using $&
  else
    # ...
  end
end
```

This is slow, because of the above mentioned problem.
Replacing `/^#/` with `/^#/f`, and `/yyy/` with `/yyy/f` will make it faster.

Some benchmarks - https://bugs.ruby-lang.org/issues/17030#note-9 which show `2.5x` to `5x` speedup.

PR: https://github.com/ruby/ruby/pull/3455



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-10-28 22:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-30 15:42 [ruby-core:100239] [Ruby master Feature#17206] Introduce new Regexp option to avoid MatchData allocation fatkodima123
2020-09-30 15:58 ` [ruby-core:100240] " zn
2020-09-30 16:22 ` [ruby-core:100241] " fatkodima123
2020-09-30 18:18 ` [ruby-core:100242] [Ruby master Feature#17206] Introduce new Regexp option to avoid global MatchData allocations eregontp
2020-10-24  1:34 ` [ruby-core:100519] " scivola20
2020-10-24 14:30 ` [ruby-core:100523] " eregontp
2020-10-24 14:51 ` [ruby-core:100524] " eregontp
2020-10-28 22:43 ` [ruby-core:100626] " scivola20

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).