* [ruby-core:61706] [ruby-trunk - Bug #9680] [Open] String#sub and siblings should not use regex when String pattern is passed
[not found] <redmine.issue-9680.20140327000948@ruby-lang.org>
@ 2014-03-27 0:09 ` sam.rawlins
2014-03-27 1:23 ` [ruby-core:61707] " Eric Wong
2014-03-27 1:28 ` [ruby-core:61708] [ruby-trunk - Bug #9680] " normalperson
` (6 subsequent siblings)
7 siblings, 1 reply; 9+ messages in thread
From: sam.rawlins @ 2014-03-27 0:09 UTC (permalink / raw
To: ruby-core
Issue #9680 has been reported by Sam Rawlins.
----------------------------------------
Bug #9680: String#sub and siblings should not use regex when String pattern is passed
https://bugs.ruby-lang.org/issues/9680
* Author: Sam Rawlins
* Status: Open
* Priority: Normal
* Assignee:
* Category:
* Target version:
* ruby -v: trunk
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN
----------------------------------------
Currently `String#sub`, `#sub!`, `#gsub, and `#gsub!` all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example, `"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:
* dup of the original String
* result String
* 2x `":"<US-ASCII>`
* 2x `":"<ASCII-8BIT>`
* Regexp from pattern: `/:/`
* `#<MatchData ":">`
* `#<MatchData nil>`
I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to `rb_reg_search()` are replaced with calls to a new function, `rb_pat_search()`, which conditionally calls `rb_reg_search()` or `rb_str_index()`, depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.
Runtime of each method is dramatically reduced:
require 'benchmark'
n = 4_000_000
Benchmark.bm(7) do |bm|
str1 = "123:456"; str2 = "123_456";
colon = ":"; underscore = "_"
# each benchmark runs the substring method twice so that the bang methods can
# perform the same number of substitutions to str1 each go around.
bm.report("sub") { n.times { str1.sub(colon, underscore); str2.sub(underscore, colon) } }
bm.report("sub!") { n.times { str1.sub!(colon, underscore); str1.sub!(underscore, colon) } }
bm.report("gsub") { n.times { str1.gsub(colon, underscore); str2.gsub(underscore, colon) } }
bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end
# trunk
user system total real
sub 40.450000 0.580000 41.030000 ( 41.209658)
sub! 39.780000 0.580000 40.360000 ( 40.656789)
gsub 58.500000 0.820000 59.320000 ( 59.603923)
gsub! 59.400000 0.770000 60.170000 ( 60.435687)
# this patch
user system total real
sub 3.060000 0.010000 3.070000 ( 3.091920)
sub! 2.380000 0.010000 2.390000 ( 2.390769)
gsub 7.130000 0.130000 7.260000 ( 7.299139)
gsub! 7.660000 0.150000 7.810000 ( 7.846190)
When using a String pattern, runtime is reduced by 87% to 94%.
There is only one incompatibility that I am aware of: `$&` will not be set after using a sub method with a String pattern. (Subgroups (`$1`, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)
In the future, only 3 more methods use the function, `get_pat()`, that creates a Regexp from the String pattern: `#split`, `#scan`, and `#match`. I think this fix could be applied to these as well.
---Files--------------------------------
ruby-579.diff (5.12 KB)
--
https://bugs.ruby-lang.org/
^ permalink raw reply [flat|nested] 9+ messages in thread
* [ruby-core:61707] Re: [ruby-trunk - Bug #9680] [Open] String#sub and siblings should not use regex when String pattern is passed
2014-03-27 0:09 ` [ruby-core:61706] [ruby-trunk - Bug #9680] [Open] String#sub and siblings should not use regex when String pattern is passed sam.rawlins
@ 2014-03-27 1:23 ` Eric Wong
0 siblings, 0 replies; 9+ messages in thread
From: Eric Wong @ 2014-03-27 1:23 UTC (permalink / raw
To: Ruby developers
I had an idea to replace the current reg_cache with memoization for
converting string literals, but never got around to it. That would
also reduce garbage while preserving $& compatibility.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [ruby-core:61708] [ruby-trunk - Bug #9680] String#sub and siblings should not use regex when String pattern is passed
[not found] <redmine.issue-9680.20140327000948@ruby-lang.org>
2014-03-27 0:09 ` [ruby-core:61706] [ruby-trunk - Bug #9680] [Open] String#sub and siblings should not use regex when String pattern is passed sam.rawlins
@ 2014-03-27 1:28 ` normalperson
2014-03-27 2:02 ` [ruby-core:61710] " sam.rawlins
` (5 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: normalperson @ 2014-03-27 1:28 UTC (permalink / raw
To: ruby-core
Issue #9680 has been updated by Eric Wong.
I had an idea to replace the current reg_cache with memoization for
converting string literals, but never got around to it. That would
also reduce garbage while preserving $& compatibility.
----------------------------------------
Bug #9680: String#sub and siblings should not use regex when String pattern is passed
https://bugs.ruby-lang.org/issues/9680#change-45953
* Author: Sam Rawlins
* Status: Open
* Priority: Normal
* Assignee:
* Category:
* Target version:
* ruby -v: trunk
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN
----------------------------------------
Currently `String#sub`, `#sub!`, `#gsub, and `#gsub!` all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example, `"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:
* dup of the original String
* result String
* 2x `":"<US-ASCII>`
* 2x `":"<ASCII-8BIT>`
* Regexp from pattern: `/:/`
* `#<MatchData ":">`
* `#<MatchData nil>`
I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to `rb_reg_search()` are replaced with calls to a new function, `rb_pat_search()`, which conditionally calls `rb_reg_search()` or `rb_str_index()`, depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.
Runtime of each method is dramatically reduced:
require 'benchmark'
n = 4_000_000
Benchmark.bm(7) do |bm|
str1 = "123:456"; str2 = "123_456";
colon = ":"; underscore = "_"
# each benchmark runs the substring method twice so that the bang methods can
# perform the same number of substitutions to str1 each go around.
bm.report("sub") { n.times { str1.sub(colon, underscore); str2.sub(underscore, colon) } }
bm.report("sub!") { n.times { str1.sub!(colon, underscore); str1.sub!(underscore, colon) } }
bm.report("gsub") { n.times { str1.gsub(colon, underscore); str2.gsub(underscore, colon) } }
bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end
# trunk
user system total real
sub 40.450000 0.580000 41.030000 ( 41.209658)
sub! 39.780000 0.580000 40.360000 ( 40.656789)
gsub 58.500000 0.820000 59.320000 ( 59.603923)
gsub! 59.400000 0.770000 60.170000 ( 60.435687)
# this patch
user system total real
sub 3.060000 0.010000 3.070000 ( 3.091920)
sub! 2.380000 0.010000 2.390000 ( 2.390769)
gsub 7.130000 0.130000 7.260000 ( 7.299139)
gsub! 7.660000 0.150000 7.810000 ( 7.846190)
When using a String pattern, runtime is reduced by 87% to 94%.
There is only one incompatibility that I am aware of: `$&` will not be set after using a sub method with a String pattern. (Subgroups (`$1`, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)
In the future, only 3 more methods use the function, `get_pat()`, that creates a Regexp from the String pattern: `#split`, `#scan`, and `#match`. I think this fix could be applied to these as well.
---Files--------------------------------
ruby-579.diff (5.12 KB)
--
https://bugs.ruby-lang.org/
^ permalink raw reply [flat|nested] 9+ messages in thread
* [ruby-core:61710] [ruby-trunk - Bug #9680] String#sub and siblings should not use regex when String pattern is passed
[not found] <redmine.issue-9680.20140327000948@ruby-lang.org>
2014-03-27 0:09 ` [ruby-core:61706] [ruby-trunk - Bug #9680] [Open] String#sub and siblings should not use regex when String pattern is passed sam.rawlins
2014-03-27 1:28 ` [ruby-core:61708] [ruby-trunk - Bug #9680] " normalperson
@ 2014-03-27 2:02 ` sam.rawlins
2014-03-27 16:17 ` [ruby-core:61725] " eregontp
` (4 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: sam.rawlins @ 2014-03-27 2:02 UTC (permalink / raw
To: ruby-core
Issue #9680 has been updated by Sam Rawlins.
I think the speedup in this patch comes almost entirely from skipping the regex engine, not from the GC savings.
Preserving `$&` (and `$~` and friends) while still not firing up the regex engine might be possible (constructing the basic backref, with no subgroups, by hand), but very very ugly code (an `RMatch` has an `RRegexp` and an `rmatch` which has a `re_registers`, etc). This might only a ~20 line function, but feels so dirty...
I think an improvement (or replacement) to `reg_cache` would also be welcome.
----------------------------------------
Bug #9680: String#sub and siblings should not use regex when String pattern is passed
https://bugs.ruby-lang.org/issues/9680#change-45955
* Author: Sam Rawlins
* Status: Open
* Priority: Normal
* Assignee:
* Category:
* Target version:
* ruby -v: trunk
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN
----------------------------------------
Currently `String#sub`, `#sub!`, `#gsub, and `#gsub!` all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example, `"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:
* dup of the original String
* result String
* 2x `":"<US-ASCII>`
* 2x `":"<ASCII-8BIT>`
* Regexp from pattern: `/:/`
* `#<MatchData ":">`
* `#<MatchData nil>`
I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to `rb_reg_search()` are replaced with calls to a new function, `rb_pat_search()`, which conditionally calls `rb_reg_search()` or `rb_str_index()`, depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.
Runtime of each method is dramatically reduced:
require 'benchmark'
n = 4_000_000
Benchmark.bm(7) do |bm|
str1 = "123:456"; str2 = "123_456";
colon = ":"; underscore = "_"
# each benchmark runs the substring method twice so that the bang methods can
# perform the same number of substitutions to str1 each go around.
bm.report("sub") { n.times { str1.sub(colon, underscore); str2.sub(underscore, colon) } }
bm.report("sub!") { n.times { str1.sub!(colon, underscore); str1.sub!(underscore, colon) } }
bm.report("gsub") { n.times { str1.gsub(colon, underscore); str2.gsub(underscore, colon) } }
bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end
# trunk
user system total real
sub 40.450000 0.580000 41.030000 ( 41.209658)
sub! 39.780000 0.580000 40.360000 ( 40.656789)
gsub 58.500000 0.820000 59.320000 ( 59.603923)
gsub! 59.400000 0.770000 60.170000 ( 60.435687)
# this patch
user system total real
sub 3.060000 0.010000 3.070000 ( 3.091920)
sub! 2.380000 0.010000 2.390000 ( 2.390769)
gsub 7.130000 0.130000 7.260000 ( 7.299139)
gsub! 7.660000 0.150000 7.810000 ( 7.846190)
When using a String pattern, runtime is reduced by 87% to 94%.
There is only one incompatibility that I am aware of: `$&` will not be set after using a sub method with a String pattern. (Subgroups (`$1`, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)
In the future, only 3 more methods use the function, `get_pat()`, that creates a Regexp from the String pattern: `#split`, `#scan`, and `#match`. I think this fix could be applied to these as well.
---Files--------------------------------
ruby-579.diff (5.12 KB)
--
https://bugs.ruby-lang.org/
^ permalink raw reply [flat|nested] 9+ messages in thread
* [ruby-core:61725] [ruby-trunk - Bug #9680] String#sub and siblings should not use regex when String pattern is passed
[not found] <redmine.issue-9680.20140327000948@ruby-lang.org>
` (2 preceding siblings ...)
2014-03-27 2:02 ` [ruby-core:61710] " sam.rawlins
@ 2014-03-27 16:17 ` eregontp
2014-03-27 19:03 ` [ruby-core:61728] " sam.rawlins
` (3 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: eregontp @ 2014-03-27 16:17 UTC (permalink / raw
To: ruby-core
Issue #9680 has been updated by Benoit Daloze.
It would be interesting to run the benchmark on a more realistic example. One should use String#tr or String#tr! if it is only to replace a single character.
----------------------------------------
Bug #9680: String#sub and siblings should not use regex when String pattern is passed
https://bugs.ruby-lang.org/issues/9680#change-45965
* Author: Sam Rawlins
* Status: Open
* Priority: Normal
* Assignee:
* Category:
* Target version:
* ruby -v: trunk
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN
----------------------------------------
Currently `String#sub`, `#sub!`, `#gsub, and `#gsub!` all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example, `"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:
* dup of the original String
* result String
* 2x `":"<US-ASCII>`
* 2x `":"<ASCII-8BIT>`
* Regexp from pattern: `/:/`
* `#<MatchData ":">`
* `#<MatchData nil>`
I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to `rb_reg_search()` are replaced with calls to a new function, `rb_pat_search()`, which conditionally calls `rb_reg_search()` or `rb_str_index()`, depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.
Runtime of each method is dramatically reduced:
require 'benchmark'
n = 4_000_000
Benchmark.bm(7) do |bm|
str1 = "123:456"; str2 = "123_456";
colon = ":"; underscore = "_"
# each benchmark runs the substring method twice so that the bang methods can
# perform the same number of substitutions to str1 each go around.
bm.report("sub") { n.times { str1.sub(colon, underscore); str2.sub(underscore, colon) } }
bm.report("sub!") { n.times { str1.sub!(colon, underscore); str1.sub!(underscore, colon) } }
bm.report("gsub") { n.times { str1.gsub(colon, underscore); str2.gsub(underscore, colon) } }
bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end
# trunk
user system total real
sub 40.450000 0.580000 41.030000 ( 41.209658)
sub! 39.780000 0.580000 40.360000 ( 40.656789)
gsub 58.500000 0.820000 59.320000 ( 59.603923)
gsub! 59.400000 0.770000 60.170000 ( 60.435687)
# this patch
user system total real
sub 3.060000 0.010000 3.070000 ( 3.091920)
sub! 2.380000 0.010000 2.390000 ( 2.390769)
gsub 7.130000 0.130000 7.260000 ( 7.299139)
gsub! 7.660000 0.150000 7.810000 ( 7.846190)
When using a String pattern, runtime is reduced by 87% to 94%.
There is only one incompatibility that I am aware of: `$&` will not be set after using a sub method with a String pattern. (Subgroups (`$1`, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)
In the future, only 3 more methods use the function, `get_pat()`, that creates a Regexp from the String pattern: `#split`, `#scan`, and `#match`. I think this fix could be applied to these as well.
---Files--------------------------------
ruby-579.diff (5.12 KB)
--
https://bugs.ruby-lang.org/
^ permalink raw reply [flat|nested] 9+ messages in thread
* [ruby-core:61728] [ruby-trunk - Bug #9680] String#sub and siblings should not use regex when String pattern is passed
[not found] <redmine.issue-9680.20140327000948@ruby-lang.org>
` (3 preceding siblings ...)
2014-03-27 16:17 ` [ruby-core:61725] " eregontp
@ 2014-03-27 19:03 ` sam.rawlins
2014-03-27 22:42 ` [ruby-core:61733] [ruby-trunk - Bug #9680] [Closed] " nobu
` (2 subsequent siblings)
7 siblings, 0 replies; 9+ messages in thread
From: sam.rawlins @ 2014-03-27 19:03 UTC (permalink / raw
To: ruby-core
Issue #9680 has been updated by Sam Rawlins.
Good point, Benoit! This is actually why I started looking into #gsub in the first place. I benchmarked ActiveSupport::Inflector [1], which does operations like `gsub!('/', '::')` and `gsub('::', '/')`. Here are the benchmarks, before and after Nobu's patch, based on my patch:
BEFORE
user system total real
#underscore 24.000000 0.160000 24.160000 ( 24.254921)
#camelize 3.040000 0.010000 3.050000 ( 3.060907)
AFTER
user system total real
#underscore 23.690000 0.160000 23.850000 ( 24.012497)
#camelize 2.680000 0.010000 2.690000 ( 2.706418)
So #underscore is 1% faster; #camelize is 11% faster.
I also benchmarked Psych with `YAML.dump(["one string", "two string"])`:
user system total real
YAML.dump BEFORE 11.380000 0.250000 11.630000 ( 11.680545)
YAML.dump AFTER 11.030000 0.240000 11.270000 ( 11.313147)
The patch seems to shave 3% off here. Take all of these benchmarks with a grain of salt :)
[1] https://github.com/rails/rails/blob/master/activesupport/lib/active_support/inflector/methods.rb
----------------------------------------
Bug #9680: String#sub and siblings should not use regex when String pattern is passed
https://bugs.ruby-lang.org/issues/9680#change-45968
* Author: Sam Rawlins
* Status: Open
* Priority: Normal
* Assignee:
* Category:
* Target version:
* ruby -v: trunk
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN
----------------------------------------
Currently `String#sub`, `#sub!`, `#gsub, and `#gsub!` all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example, `"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:
* dup of the original String
* result String
* 2x `":"<US-ASCII>`
* 2x `":"<ASCII-8BIT>`
* Regexp from pattern: `/:/`
* `#<MatchData ":">`
* `#<MatchData nil>`
I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to `rb_reg_search()` are replaced with calls to a new function, `rb_pat_search()`, which conditionally calls `rb_reg_search()` or `rb_str_index()`, depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.
Runtime of each method is dramatically reduced:
require 'benchmark'
n = 4_000_000
Benchmark.bm(7) do |bm|
str1 = "123:456"; str2 = "123_456";
colon = ":"; underscore = "_"
# each benchmark runs the substring method twice so that the bang methods can
# perform the same number of substitutions to str1 each go around.
bm.report("sub") { n.times { str1.sub(colon, underscore); str2.sub(underscore, colon) } }
bm.report("sub!") { n.times { str1.sub!(colon, underscore); str1.sub!(underscore, colon) } }
bm.report("gsub") { n.times { str1.gsub(colon, underscore); str2.gsub(underscore, colon) } }
bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end
# trunk
user system total real
sub 40.450000 0.580000 41.030000 ( 41.209658)
sub! 39.780000 0.580000 40.360000 ( 40.656789)
gsub 58.500000 0.820000 59.320000 ( 59.603923)
gsub! 59.400000 0.770000 60.170000 ( 60.435687)
# this patch
user system total real
sub 3.060000 0.010000 3.070000 ( 3.091920)
sub! 2.380000 0.010000 2.390000 ( 2.390769)
gsub 7.130000 0.130000 7.260000 ( 7.299139)
gsub! 7.660000 0.150000 7.810000 ( 7.846190)
When using a String pattern, runtime is reduced by 87% to 94%.
There is only one incompatibility that I am aware of: `$&` will not be set after using a sub method with a String pattern. (Subgroups (`$1`, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)
In the future, only 3 more methods use the function, `get_pat()`, that creates a Regexp from the String pattern: `#split`, `#scan`, and `#match`. I think this fix could be applied to these as well.
---Files--------------------------------
ruby-579.diff (5.12 KB)
--
https://bugs.ruby-lang.org/
^ permalink raw reply [flat|nested] 9+ messages in thread
* [ruby-core:61733] [ruby-trunk - Bug #9680] [Closed] String#sub and siblings should not use regex when String pattern is passed
[not found] <redmine.issue-9680.20140327000948@ruby-lang.org>
` (4 preceding siblings ...)
2014-03-27 19:03 ` [ruby-core:61728] " sam.rawlins
@ 2014-03-27 22:42 ` nobu
2014-07-02 6:58 ` [ruby-core:63491] [ruby-trunk - Bug #9680] " usa
2014-08-08 3:25 ` [ruby-core:64260] " nagachika00
7 siblings, 0 replies; 9+ messages in thread
From: nobu @ 2014-03-27 22:42 UTC (permalink / raw
To: ruby-core
Issue #9680 has been updated by Nobuyoshi Nakada.
Status changed from Open to Closed
I missed to include this ticket reference in the commit log.
Inflector seems using other replacements with RE, so this improvement might not be significant much.
----------------------------------------
Bug #9680: String#sub and siblings should not use regex when String pattern is passed
https://bugs.ruby-lang.org/issues/9680#change-45973
* Author: Sam Rawlins
* Status: Closed
* Priority: Normal
* Assignee:
* Category:
* Target version:
* ruby -v: trunk
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN
----------------------------------------
Currently `String#sub`, `#sub!`, `#gsub, and `#gsub!` all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example, `"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:
* dup of the original String
* result String
* 2x `":"<US-ASCII>`
* 2x `":"<ASCII-8BIT>`
* Regexp from pattern: `/:/`
* `#<MatchData ":">`
* `#<MatchData nil>`
I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to `rb_reg_search()` are replaced with calls to a new function, `rb_pat_search()`, which conditionally calls `rb_reg_search()` or `rb_str_index()`, depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.
Runtime of each method is dramatically reduced:
require 'benchmark'
n = 4_000_000
Benchmark.bm(7) do |bm|
str1 = "123:456"; str2 = "123_456";
colon = ":"; underscore = "_"
# each benchmark runs the substring method twice so that the bang methods can
# perform the same number of substitutions to str1 each go around.
bm.report("sub") { n.times { str1.sub(colon, underscore); str2.sub(underscore, colon) } }
bm.report("sub!") { n.times { str1.sub!(colon, underscore); str1.sub!(underscore, colon) } }
bm.report("gsub") { n.times { str1.gsub(colon, underscore); str2.gsub(underscore, colon) } }
bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end
# trunk
user system total real
sub 40.450000 0.580000 41.030000 ( 41.209658)
sub! 39.780000 0.580000 40.360000 ( 40.656789)
gsub 58.500000 0.820000 59.320000 ( 59.603923)
gsub! 59.400000 0.770000 60.170000 ( 60.435687)
# this patch
user system total real
sub 3.060000 0.010000 3.070000 ( 3.091920)
sub! 2.380000 0.010000 2.390000 ( 2.390769)
gsub 7.130000 0.130000 7.260000 ( 7.299139)
gsub! 7.660000 0.150000 7.810000 ( 7.846190)
When using a String pattern, runtime is reduced by 87% to 94%.
There is only one incompatibility that I am aware of: `$&` will not be set after using a sub method with a String pattern. (Subgroups (`$1`, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)
In the future, only 3 more methods use the function, `get_pat()`, that creates a Regexp from the String pattern: `#split`, `#scan`, and `#match`. I think this fix could be applied to these as well.
---Files--------------------------------
ruby-579.diff (5.12 KB)
--
https://bugs.ruby-lang.org/
^ permalink raw reply [flat|nested] 9+ messages in thread
* [ruby-core:63491] [ruby-trunk - Bug #9680] String#sub and siblings should not use regex when String pattern is passed
[not found] <redmine.issue-9680.20140327000948@ruby-lang.org>
` (5 preceding siblings ...)
2014-03-27 22:42 ` [ruby-core:61733] [ruby-trunk - Bug #9680] [Closed] " nobu
@ 2014-07-02 6:58 ` usa
2014-08-08 3:25 ` [ruby-core:64260] " nagachika00
7 siblings, 0 replies; 9+ messages in thread
From: usa @ 2014-07-02 6:58 UTC (permalink / raw
To: ruby-core
Issue #9680 has been updated by Usaku NAKAMURA.
Backport changed from 2.0.0: UNKNOWN, 2.1: UNKNOWN to 2.0.0: WONTFIX, 2.1: UNKNOWN
----------------------------------------
Bug #9680: String#sub and siblings should not use regex when String pattern is passed
https://bugs.ruby-lang.org/issues/9680#change-47534
* Author: Sam Rawlins
* Status: Closed
* Priority: Normal
* Assignee:
* Category:
* Target version:
* ruby -v: trunk
* Backport: 2.0.0: WONTFIX, 2.1: UNKNOWN
----------------------------------------
Currently `String#sub`, `#sub!`, `#gsub, and `#gsub!` all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example, `"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:
* dup of the original String
* result String
* 2x `":"<US-ASCII>`
* 2x `":"<ASCII-8BIT>`
* Regexp from pattern: `/:/`
* `#<MatchData ":">`
* `#<MatchData nil>`
I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to `rb_reg_search()` are replaced with calls to a new function, `rb_pat_search()`, which conditionally calls `rb_reg_search()` or `rb_str_index()`, depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.
Runtime of each method is dramatically reduced:
require 'benchmark'
n = 4_000_000
Benchmark.bm(7) do |bm|
str1 = "123:456"; str2 = "123_456";
colon = ":"; underscore = "_"
# each benchmark runs the substring method twice so that the bang methods can
# perform the same number of substitutions to str1 each go around.
bm.report("sub") { n.times { str1.sub(colon, underscore); str2.sub(underscore, colon) } }
bm.report("sub!") { n.times { str1.sub!(colon, underscore); str1.sub!(underscore, colon) } }
bm.report("gsub") { n.times { str1.gsub(colon, underscore); str2.gsub(underscore, colon) } }
bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end
# trunk
user system total real
sub 40.450000 0.580000 41.030000 ( 41.209658)
sub! 39.780000 0.580000 40.360000 ( 40.656789)
gsub 58.500000 0.820000 59.320000 ( 59.603923)
gsub! 59.400000 0.770000 60.170000 ( 60.435687)
# this patch
user system total real
sub 3.060000 0.010000 3.070000 ( 3.091920)
sub! 2.380000 0.010000 2.390000 ( 2.390769)
gsub 7.130000 0.130000 7.260000 ( 7.299139)
gsub! 7.660000 0.150000 7.810000 ( 7.846190)
When using a String pattern, runtime is reduced by 87% to 94%.
There is only one incompatibility that I am aware of: `$&` will not be set after using a sub method with a String pattern. (Subgroups (`$1`, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)
In the future, only 3 more methods use the function, `get_pat()`, that creates a Regexp from the String pattern: `#split`, `#scan`, and `#match`. I think this fix could be applied to these as well.
---Files--------------------------------
ruby-579.diff (5.12 KB)
--
https://bugs.ruby-lang.org/
^ permalink raw reply [flat|nested] 9+ messages in thread
* [ruby-core:64260] [ruby-trunk - Bug #9680] String#sub and siblings should not use regex when String pattern is passed
[not found] <redmine.issue-9680.20140327000948@ruby-lang.org>
` (6 preceding siblings ...)
2014-07-02 6:58 ` [ruby-core:63491] [ruby-trunk - Bug #9680] " usa
@ 2014-08-08 3:25 ` nagachika00
7 siblings, 0 replies; 9+ messages in thread
From: nagachika00 @ 2014-08-08 3:25 UTC (permalink / raw
To: ruby-core
Issue #9680 has been updated by Tomoyuki Chikanaga.
Backport changed from 2.0.0: WONTFIX, 2.1: UNKNOWN to 2.0.0: WONTFIX, 2.1: WONTFIX
----------------------------------------
Bug #9680: String#sub and siblings should not use regex when String pattern is passed
https://bugs.ruby-lang.org/issues/9680#change-48248
* Author: Sam Rawlins
* Status: Closed
* Priority: Normal
* Assignee:
* Category:
* Target version:
* ruby -v: trunk
* Backport: 2.0.0: WONTFIX, 2.1: WONTFIX
----------------------------------------
Currently `String#sub`, `#sub!`, `#gsub, and `#gsub!` all accept a String pattern, but immediately create a Regexp from it, and use the regex engine to search for the pattern. This is not performant. For example, `"123:456".gsub(":", "_")` creates the following objects, most of which are immediately up for GC:
* dup of the original String
* result String
* 2x `":"<US-ASCII>`
* 2x `":"<ASCII-8BIT>`
* Regexp from pattern: `/:/`
* `#<MatchData ":">`
* `#<MatchData nil>`
I have a solution which is not too complicated, at https://github.com/ruby/ruby/pull/579 and attached. Calls to `rb_reg_search()` are replaced with calls to a new function, `rb_pat_search()`, which conditionally calls `rb_reg_search()` or `rb_str_index()`, depending on whether the pattern is a String. Calculating the substring that needs to be replaced is also different when the pattern is a String.
Runtime of each method is dramatically reduced:
require 'benchmark'
n = 4_000_000
Benchmark.bm(7) do |bm|
str1 = "123:456"; str2 = "123_456";
colon = ":"; underscore = "_"
# each benchmark runs the substring method twice so that the bang methods can
# perform the same number of substitutions to str1 each go around.
bm.report("sub") { n.times { str1.sub(colon, underscore); str2.sub(underscore, colon) } }
bm.report("sub!") { n.times { str1.sub!(colon, underscore); str1.sub!(underscore, colon) } }
bm.report("gsub") { n.times { str1.gsub(colon, underscore); str2.gsub(underscore, colon) } }
bm.report("gsub!") { n.times { str1.gsub!(colon, underscore); str1.gsub!(underscore, colon) } }
end
# trunk
user system total real
sub 40.450000 0.580000 41.030000 ( 41.209658)
sub! 39.780000 0.580000 40.360000 ( 40.656789)
gsub 58.500000 0.820000 59.320000 ( 59.603923)
gsub! 59.400000 0.770000 60.170000 ( 60.435687)
# this patch
user system total real
sub 3.060000 0.010000 3.070000 ( 3.091920)
sub! 2.380000 0.010000 2.390000 ( 2.390769)
gsub 7.130000 0.130000 7.260000 ( 7.299139)
gsub! 7.660000 0.150000 7.810000 ( 7.846190)
When using a String pattern, runtime is reduced by 87% to 94%.
There is only one incompatibility that I am aware of: `$&` will not be set after using a sub method with a String pattern. (Subgroups (`$1`, ...) will not be available either, but weren't before, since String patterns are escaped before being used.)
In the future, only 3 more methods use the function, `get_pat()`, that creates a Regexp from the String pattern: `#split`, `#scan`, and `#match`. I think this fix could be applied to these as well.
---Files--------------------------------
ruby-579.diff (5.12 KB)
--
https://bugs.ruby-lang.org/
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2014-08-08 3:48 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <redmine.issue-9680.20140327000948@ruby-lang.org>
2014-03-27 0:09 ` [ruby-core:61706] [ruby-trunk - Bug #9680] [Open] String#sub and siblings should not use regex when String pattern is passed sam.rawlins
2014-03-27 1:23 ` [ruby-core:61707] " Eric Wong
2014-03-27 1:28 ` [ruby-core:61708] [ruby-trunk - Bug #9680] " normalperson
2014-03-27 2:02 ` [ruby-core:61710] " sam.rawlins
2014-03-27 16:17 ` [ruby-core:61725] " eregontp
2014-03-27 19:03 ` [ruby-core:61728] " sam.rawlins
2014-03-27 22:42 ` [ruby-core:61733] [ruby-trunk - Bug #9680] [Closed] " nobu
2014-07-02 6:58 ` [ruby-core:63491] [ruby-trunk - Bug #9680] " usa
2014-08-08 3:25 ` [ruby-core:64260] " nagachika00
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).