ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
* [ruby-core:106883] [Ruby master Bug#18447] Potential performance regression with String#lines in large strings
@ 2021-12-28 23:26 ttilberg (Tim Tilberg)
  2021-12-29  5:36 ` [ruby-core:106896] " mame (Yusuke Endoh)
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: ttilberg (Tim Tilberg) @ 2021-12-28 23:26 UTC (permalink / raw
  To: ruby-core

Issue #18447 has been reported by ttilberg (Tim Tilberg).

----------------------------------------
Bug #18447: Potential performance regression with String#lines in large strings
https://bugs.ruby-lang.org/issues/18447

* Author: ttilberg (Tim Tilberg)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
I believe there may be a potential performance regression regarding `String#lines` worth noting between 3.0.3 and 3.1.0. This came about in [this discussion](https://www.reddit.com/r/ruby/comments/rpje0g/fast_to_way_to_parse_csv/hqadqsd/) regarding large file parsing performance. We were benchmarking various ways to parse a 10 million row CSV file. Slurping the file took significantly longer than streaming in version 3.1.0, even though the data was able to fit in memory. After further research, we started to feel that there may be something to speak up about here, and I think it's pinned down to `String#lines`.

I'm running Mac OS Big Sur 11.6.1 on a 13" 2020 MBP with 32 GB ram. Specific Ruby versions are included in the comparison examples below.

The simplest reproduction seems to be:

Ruby 3.0.3: ~1.5 seconds

```
▶ time ruby -ve '("\n" * 10_000_000).lines'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
ruby -ve '("\n" * 10_000_000).lines'  1.38s user 0.39s system 100% cpu 1.756 total
```

Ruby 3.1.0: ~11.5 seconds

```
▶ time ruby -ve '("\n" * 10_000_000).lines'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
ruby -ve '("\n" * 10_000_000).lines'  1.52s user 10.01s system 99% cpu 11.579 total
```

Some other observations:

An earlier script that I ran ruby-prof against looked like:

```
puts File.read("sample-data.csv").lines.sum { 1 }
```

- It appeared that the time increase stemmed from `String#lines`, as all other methods had similar time taken between the versions:

```
 Ruby 3.1.0
  %self      total      self      wait     child     calls  name
 89.93     12.728    12.728     0.000     0.000        1   String#lines
  9.01      1.275     1.275     0.000     0.000        1   Array#sum
  1.05      0.149     0.149     0.000     0.000        1   <Class::IO>#read

  Ruby 3.0.3
   %self      total      self      wait     child     calls  name
 74.91      3.773     3.773     0.000     0.000        1   String#lines
 22.15      1.116     1.116     0.000     0.000        1   Array#sum
  2.93      0.148     0.148     0.000     0.000        1   <Class::IO>#read
```

- A similar enumerator without `String#lines` does not appear to cause this:

```
▶ time ruby -ve '10_000_000.times.map { nil }'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
ruby -ve '10_000_000.times.map { nil }'  0.57s user 0.16s system 102% cpu 0.710 total
```

```
▶ time ruby -ve '10_000_000.times.map { nil }'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
ruby -ve '10_000_000.times.map { nil }'  0.61s user 0.16s system 102% cpu 0.753 total
```

- It doesn't seem related to string generation:

```
▶ time ruby -ve '("\n" * 10_000_000)'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
-e:1: warning: possibly useless use of * in void context
ruby -ve '("\n" * 10_000_000)'  0.13s user 0.14s system 107% cpu 0.246 total
```

```
▶ time ruby -ve '("\n" * 10_000_000)'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
-e:1: warning: possibly useless use of * in void context
ruby -ve '("\n" * 10_000_000)'  0.13s user 0.14s system 107% cpu 0.245 total
```

(Thanks to simpl1g for the discussion on Reddit, and help detecting this potential issue)



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [ruby-core:106896] [Ruby master Bug#18447] Potential performance regression with String#lines in large strings
  2021-12-28 23:26 [ruby-core:106883] [Ruby master Bug#18447] Potential performance regression with String#lines in large strings ttilberg (Tim Tilberg)
@ 2021-12-29  5:36 ` mame (Yusuke Endoh)
  2021-12-29  8:05 ` [ruby-core:106897] " nobu (Nobuyoshi Nakada)
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: mame (Yusuke Endoh) @ 2021-12-29  5:36 UTC (permalink / raw
  To: ruby-core

Issue #18447 has been updated by mame (Yusuke Endoh).

Assignee set to peterzhu2118 (Peter Zhu)

Thank you for your great report. I could reproduce the issue on M1 macOS. It does not repro on Ubuntu.

According to `git bisect`, this regression is caused by 785f5eb8f0f994c755ac5f1e1479f6e2ec6d0ef3.

```
% git checkout 09f284c052854392fad1e63fd541a6c8704d7b90 && make -j miniruby
% time ./miniruby -ve '("\n" * 10_000_000).lines'
ruby 3.1.0dev (2021-02-24T11:01:44Z v3_1_0_preview1~4291 09f284c052) [arm64-darwin20]
./miniruby -ve '("\n" * 10_000_000).lines'  0.91s user 0.10s system 81% cpu 1.255 total
```

```
% git checkout 785f5eb8f0f994c755ac5f1e1479f6e2ec6d0ef3 && make -j miniruby
% time ./miniruby -ve '("\n" * 10_000_000).lines'
ruby 3.1.0dev (2021-02-24T20:25:30Z v3_1_0_preview1~4290 785f5eb8f0) [arm64-darwin20]
./miniruby -ve '("\n" * 10_000_000).lines'  0.85s user 4.77s system 96% cpu 5.834 total

```

It mainly takes "system" time, not "user" time. So, it seems to make sense that the slowdown is due to the use of mmap.

@peterzhu2118 Could you take a look at this?

----------------------------------------
Bug #18447: Potential performance regression with String#lines in large strings
https://bugs.ruby-lang.org/issues/18447#change-95713

* Author: ttilberg (Tim Tilberg)
* Status: Open
* Priority: Normal
* Assignee: peterzhu2118 (Peter Zhu)
* ruby -v: ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
I believe there may be a potential performance regression regarding `String#lines` worth noting between 3.0.3 and 3.1.0. This came about in [this discussion](https://www.reddit.com/r/ruby/comments/rpje0g/fast_to_way_to_parse_csv/hqadqsd/) regarding large file parsing performance. We were benchmarking various ways to parse a 10 million row CSV file. Slurping the file took significantly longer than streaming in version 3.1.0, even though the data was able to fit in memory. After further research, we started to feel that there may be something to speak up about here, and I think it's pinned down to `String#lines`.

I'm running Mac OS Big Sur 11.6.1 on a 13" 2020 MBP with 32 GB ram. Specific Ruby versions are included in the comparison examples below.

The simplest reproduction seems to be:

Ruby 3.0.3: ~1.5 seconds

```
▶ time ruby -ve '("\n" * 10_000_000).lines'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
ruby -ve '("\n" * 10_000_000).lines'  1.38s user 0.39s system 100% cpu 1.756 total
```

Ruby 3.1.0: ~11.5 seconds

```
▶ time ruby -ve '("\n" * 10_000_000).lines'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
ruby -ve '("\n" * 10_000_000).lines'  1.52s user 10.01s system 99% cpu 11.579 total
```

Some other observations:

An earlier script that I ran ruby-prof against looked like:

```
puts File.read("sample-data.csv").lines.sum { 1 }
```

- It appeared that the time increase stemmed from `String#lines`, as all other methods had similar time taken between the versions:

```
 Ruby 3.1.0
  %self      total      self      wait     child     calls  name
 89.93     12.728    12.728     0.000     0.000        1   String#lines
  9.01      1.275     1.275     0.000     0.000        1   Array#sum
  1.05      0.149     0.149     0.000     0.000        1   <Class::IO>#read

  Ruby 3.0.3
   %self      total      self      wait     child     calls  name
 74.91      3.773     3.773     0.000     0.000        1   String#lines
 22.15      1.116     1.116     0.000     0.000        1   Array#sum
  2.93      0.148     0.148     0.000     0.000        1   <Class::IO>#read
```

- A similar enumerator without `String#lines` does not appear to cause this:

```
▶ time ruby -ve '10_000_000.times.map { nil }'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
ruby -ve '10_000_000.times.map { nil }'  0.57s user 0.16s system 102% cpu 0.710 total
```

```
▶ time ruby -ve '10_000_000.times.map { nil }'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
ruby -ve '10_000_000.times.map { nil }'  0.61s user 0.16s system 102% cpu 0.753 total
```

- It doesn't seem related to string generation:

```
▶ time ruby -ve '("\n" * 10_000_000)'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
-e:1: warning: possibly useless use of * in void context
ruby -ve '("\n" * 10_000_000)'  0.13s user 0.14s system 107% cpu 0.246 total
```

```
▶ time ruby -ve '("\n" * 10_000_000)'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
-e:1: warning: possibly useless use of * in void context
ruby -ve '("\n" * 10_000_000)'  0.13s user 0.14s system 107% cpu 0.245 total
```

(Thanks to simpl1g for the discussion on Reddit, and help detecting this potential issue)



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [ruby-core:106897] [Ruby master Bug#18447] Potential performance regression with String#lines in large strings
  2021-12-28 23:26 [ruby-core:106883] [Ruby master Bug#18447] Potential performance regression with String#lines in large strings ttilberg (Tim Tilberg)
  2021-12-29  5:36 ` [ruby-core:106896] " mame (Yusuke Endoh)
@ 2021-12-29  8:05 ` nobu (Nobuyoshi Nakada)
  2021-12-29 16:12 ` [ruby-core:106906] " peterzhu2118 (Peter Zhu)
  2021-12-29 21:54 ` [ruby-core:106910] " ttilberg (Tim Tilberg)
  3 siblings, 0 replies; 5+ messages in thread
From: nobu (Nobuyoshi Nakada) @ 2021-12-29  8:05 UTC (permalink / raw
  To: ruby-core

Issue #18447 has been updated by nobu (Nobuyoshi Nakada).


This seems to make it faster nearly to 3.0.
https://github.com/nobu/ruby/tree/mac-heap_page
```diff
diff --git a/gc.c b/gc.c
index a5369c28f13d..aaae009b2ee7 100644
--- a/gc.c
+++ b/gc.c
@@ -847,8 +847,15 @@ typedef struct rb_objspace {
 } rb_objspace_t;
 
 
+#if defined(__APPLE__) && defined(__LP64__)
+/* slow mmap */
+#define HEAP_PAGE_ALIGN_LOG 16
+#endif
+
+#ifndef HEAP_PAGE_ALIGN_LOG
 /* default tiny heap size: 16KB */
 #define HEAP_PAGE_ALIGN_LOG 14
+#endif
 #define CEILDIV(i, mod) (((i) + (mod) - 1)/(mod))
 enum {
     HEAP_PAGE_ALIGN = (1UL << HEAP_PAGE_ALIGN_LOG),
```

----------------------------------------
Bug #18447: Potential performance regression with String#lines in large strings
https://bugs.ruby-lang.org/issues/18447#change-95714

* Author: ttilberg (Tim Tilberg)
* Status: Open
* Priority: Normal
* Assignee: peterzhu2118 (Peter Zhu)
* ruby -v: ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
I believe there may be a potential performance regression regarding `String#lines` worth noting between 3.0.3 and 3.1.0. This came about in [this discussion](https://www.reddit.com/r/ruby/comments/rpje0g/fast_to_way_to_parse_csv/hqadqsd/) regarding large file parsing performance. We were benchmarking various ways to parse a 10 million row CSV file. Slurping the file took significantly longer than streaming in version 3.1.0, even though the data was able to fit in memory. After further research, we started to feel that there may be something to speak up about here, and I think it's pinned down to `String#lines`.

I'm running Mac OS Big Sur 11.6.1 on a 13" 2020 MBP with 32 GB ram. Specific Ruby versions are included in the comparison examples below.

The simplest reproduction seems to be:

Ruby 3.0.3: ~1.5 seconds

```
▶ time ruby -ve '("\n" * 10_000_000).lines'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
ruby -ve '("\n" * 10_000_000).lines'  1.38s user 0.39s system 100% cpu 1.756 total
```

Ruby 3.1.0: ~11.5 seconds

```
▶ time ruby -ve '("\n" * 10_000_000).lines'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
ruby -ve '("\n" * 10_000_000).lines'  1.52s user 10.01s system 99% cpu 11.579 total
```

Some other observations:

An earlier script that I ran ruby-prof against looked like:

```
puts File.read("sample-data.csv").lines.sum { 1 }
```

- It appeared that the time increase stemmed from `String#lines`, as all other methods had similar time taken between the versions:

```
 Ruby 3.1.0
  %self      total      self      wait     child     calls  name
 89.93     12.728    12.728     0.000     0.000        1   String#lines
  9.01      1.275     1.275     0.000     0.000        1   Array#sum
  1.05      0.149     0.149     0.000     0.000        1   <Class::IO>#read

  Ruby 3.0.3
   %self      total      self      wait     child     calls  name
 74.91      3.773     3.773     0.000     0.000        1   String#lines
 22.15      1.116     1.116     0.000     0.000        1   Array#sum
  2.93      0.148     0.148     0.000     0.000        1   <Class::IO>#read
```

- A similar enumerator without `String#lines` does not appear to cause this:

```
▶ time ruby -ve '10_000_000.times.map { nil }'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
ruby -ve '10_000_000.times.map { nil }'  0.57s user 0.16s system 102% cpu 0.710 total
```

```
▶ time ruby -ve '10_000_000.times.map { nil }'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
ruby -ve '10_000_000.times.map { nil }'  0.61s user 0.16s system 102% cpu 0.753 total
```

- It doesn't seem related to string generation:

```
▶ time ruby -ve '("\n" * 10_000_000)'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
-e:1: warning: possibly useless use of * in void context
ruby -ve '("\n" * 10_000_000)'  0.13s user 0.14s system 107% cpu 0.246 total
```

```
▶ time ruby -ve '("\n" * 10_000_000)'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
-e:1: warning: possibly useless use of * in void context
ruby -ve '("\n" * 10_000_000)'  0.13s user 0.14s system 107% cpu 0.245 total
```

(Thanks to simpl1g for the discussion on Reddit, and help detecting this potential issue)



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [ruby-core:106906] [Ruby master Bug#18447] Potential performance regression with String#lines in large strings
  2021-12-28 23:26 [ruby-core:106883] [Ruby master Bug#18447] Potential performance regression with String#lines in large strings ttilberg (Tim Tilberg)
  2021-12-29  5:36 ` [ruby-core:106896] " mame (Yusuke Endoh)
  2021-12-29  8:05 ` [ruby-core:106897] " nobu (Nobuyoshi Nakada)
@ 2021-12-29 16:12 ` peterzhu2118 (Peter Zhu)
  2021-12-29 21:54 ` [ruby-core:106910] " ttilberg (Tim Tilberg)
  3 siblings, 0 replies; 5+ messages in thread
From: peterzhu2118 (Peter Zhu) @ 2021-12-29 16:12 UTC (permalink / raw
  To: ruby-core

Issue #18447 has been updated by peterzhu2118 (Peter Zhu).


I think we should consider making the page sizes 64KB by default. Many systems (e.g. PowerPC) have 64 KB page sizes, so we can't use mmap on those systems meaning that we can't use compaction. See issue #18133.

I opened a PR for this feature: https://github.com/ruby/ruby/pull/5373


----------------------------------------
Bug #18447: Potential performance regression with String#lines in large strings
https://bugs.ruby-lang.org/issues/18447#change-95724

* Author: ttilberg (Tim Tilberg)
* Status: Open
* Priority: Normal
* Assignee: peterzhu2118 (Peter Zhu)
* ruby -v: ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
I believe there may be a potential performance regression regarding `String#lines` worth noting between 3.0.3 and 3.1.0. This came about in [this discussion](https://www.reddit.com/r/ruby/comments/rpje0g/fast_to_way_to_parse_csv/hqadqsd/) regarding large file parsing performance. We were benchmarking various ways to parse a 10 million row CSV file. Slurping the file took significantly longer than streaming in version 3.1.0, even though the data was able to fit in memory. After further research, we started to feel that there may be something to speak up about here, and I think it's pinned down to `String#lines`.

I'm running Mac OS Big Sur 11.6.1 on a 13" 2020 MBP with 32 GB ram. Specific Ruby versions are included in the comparison examples below.

The simplest reproduction seems to be:

Ruby 3.0.3: ~1.5 seconds

```
▶ time ruby -ve '("\n" * 10_000_000).lines'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
ruby -ve '("\n" * 10_000_000).lines'  1.38s user 0.39s system 100% cpu 1.756 total
```

Ruby 3.1.0: ~11.5 seconds

```
▶ time ruby -ve '("\n" * 10_000_000).lines'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
ruby -ve '("\n" * 10_000_000).lines'  1.52s user 10.01s system 99% cpu 11.579 total
```

Some other observations:

An earlier script that I ran ruby-prof against looked like:

```
puts File.read("sample-data.csv").lines.sum { 1 }
```

- It appeared that the time increase stemmed from `String#lines`, as all other methods had similar time taken between the versions:

```
 Ruby 3.1.0
  %self      total      self      wait     child     calls  name
 89.93     12.728    12.728     0.000     0.000        1   String#lines
  9.01      1.275     1.275     0.000     0.000        1   Array#sum
  1.05      0.149     0.149     0.000     0.000        1   <Class::IO>#read

  Ruby 3.0.3
   %self      total      self      wait     child     calls  name
 74.91      3.773     3.773     0.000     0.000        1   String#lines
 22.15      1.116     1.116     0.000     0.000        1   Array#sum
  2.93      0.148     0.148     0.000     0.000        1   <Class::IO>#read
```

- A similar enumerator without `String#lines` does not appear to cause this:

```
▶ time ruby -ve '10_000_000.times.map { nil }'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
ruby -ve '10_000_000.times.map { nil }'  0.57s user 0.16s system 102% cpu 0.710 total
```

```
▶ time ruby -ve '10_000_000.times.map { nil }'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
ruby -ve '10_000_000.times.map { nil }'  0.61s user 0.16s system 102% cpu 0.753 total
```

- It doesn't seem related to string generation:

```
▶ time ruby -ve '("\n" * 10_000_000)'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
-e:1: warning: possibly useless use of * in void context
ruby -ve '("\n" * 10_000_000)'  0.13s user 0.14s system 107% cpu 0.246 total
```

```
▶ time ruby -ve '("\n" * 10_000_000)'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
-e:1: warning: possibly useless use of * in void context
ruby -ve '("\n" * 10_000_000)'  0.13s user 0.14s system 107% cpu 0.245 total
```

(Thanks to simpl1g for the discussion on Reddit, and help detecting this potential issue)



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [ruby-core:106910] [Ruby master Bug#18447] Potential performance regression with String#lines in large strings
  2021-12-28 23:26 [ruby-core:106883] [Ruby master Bug#18447] Potential performance regression with String#lines in large strings ttilberg (Tim Tilberg)
                   ` (2 preceding siblings ...)
  2021-12-29 16:12 ` [ruby-core:106906] " peterzhu2118 (Peter Zhu)
@ 2021-12-29 21:54 ` ttilberg (Tim Tilberg)
  3 siblings, 0 replies; 5+ messages in thread
From: ttilberg (Tim Tilberg) @ 2021-12-29 21:54 UTC (permalink / raw
  To: ruby-core

Issue #18447 has been updated by ttilberg (Tim Tilberg).


Thanks everyone! I wondered if it wasn't due to memory allocations, but the "build a massive string" operation was successful and quite fast. I looked through `rb_str_enumerate_lines` in the C source, and got lost rather quickly, aside from seeing that if the string wasn't frozen, we would create a new one that _is_ frozen. I'm curious if someone would bother to take a moment to share where this was manifested from my example (`String#lines`)? I'm working my way up towards understanding Ruby at a deeper level.

Regardless, thank you everyone for triaging and creating a patch so quickly!

----------------------------------------
Bug #18447: Potential performance regression with String#lines in large strings
https://bugs.ruby-lang.org/issues/18447#change-95730

* Author: ttilberg (Tim Tilberg)
* Status: Closed
* Priority: Normal
* Assignee: peterzhu2118 (Peter Zhu)
* ruby -v: ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
I believe there may be a potential performance regression regarding `String#lines` worth noting between 3.0.3 and 3.1.0. This came about in [this discussion](https://www.reddit.com/r/ruby/comments/rpje0g/fast_to_way_to_parse_csv/hqadqsd/) regarding large file parsing performance. We were benchmarking various ways to parse a 10 million row CSV file. Slurping the file took significantly longer than streaming in version 3.1.0, even though the data was able to fit in memory. After further research, we started to feel that there may be something to speak up about here, and I think it's pinned down to `String#lines`.

I'm running Mac OS Big Sur 11.6.1 on a 13" 2020 MBP with 32 GB ram. Specific Ruby versions are included in the comparison examples below.

The simplest reproduction seems to be:

Ruby 3.0.3: ~1.5 seconds

```
▶ time ruby -ve '("\n" * 10_000_000).lines'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
ruby -ve '("\n" * 10_000_000).lines'  1.38s user 0.39s system 100% cpu 1.756 total
```

Ruby 3.1.0: ~11.5 seconds

```
▶ time ruby -ve '("\n" * 10_000_000).lines'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
ruby -ve '("\n" * 10_000_000).lines'  1.52s user 10.01s system 99% cpu 11.579 total
```

Some other observations:

An earlier script that I ran ruby-prof against looked like:

```
puts File.read("sample-data.csv").lines.sum { 1 }
```

- It appeared that the time increase stemmed from `String#lines`, as all other methods had similar time taken between the versions:

```
 Ruby 3.1.0
  %self      total      self      wait     child     calls  name
 89.93     12.728    12.728     0.000     0.000        1   String#lines
  9.01      1.275     1.275     0.000     0.000        1   Array#sum
  1.05      0.149     0.149     0.000     0.000        1   <Class::IO>#read

  Ruby 3.0.3
   %self      total      self      wait     child     calls  name
 74.91      3.773     3.773     0.000     0.000        1   String#lines
 22.15      1.116     1.116     0.000     0.000        1   Array#sum
  2.93      0.148     0.148     0.000     0.000        1   <Class::IO>#read
```

- A similar enumerator without `String#lines` does not appear to cause this:

```
▶ time ruby -ve '10_000_000.times.map { nil }'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
ruby -ve '10_000_000.times.map { nil }'  0.57s user 0.16s system 102% cpu 0.710 total
```

```
▶ time ruby -ve '10_000_000.times.map { nil }'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
ruby -ve '10_000_000.times.map { nil }'  0.61s user 0.16s system 102% cpu 0.753 total
```

- It doesn't seem related to string generation:

```
▶ time ruby -ve '("\n" * 10_000_000)'
ruby 3.0.3p157 (2021-11-24 revision 3fb7d2cadc) [x86_64-darwin20]
-e:1: warning: possibly useless use of * in void context
ruby -ve '("\n" * 10_000_000)'  0.13s user 0.14s system 107% cpu 0.246 total
```

```
▶ time ruby -ve '("\n" * 10_000_000)'
ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-darwin20]
-e:1: warning: possibly useless use of * in void context
ruby -ve '("\n" * 10_000_000)'  0.13s user 0.14s system 107% cpu 0.245 total
```

(Thanks to simpl1g for the discussion on Reddit, and help detecting this potential issue)



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-12-29 21:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-12-28 23:26 [ruby-core:106883] [Ruby master Bug#18447] Potential performance regression with String#lines in large strings ttilberg (Tim Tilberg)
2021-12-29  5:36 ` [ruby-core:106896] " mame (Yusuke Endoh)
2021-12-29  8:05 ` [ruby-core:106897] " nobu (Nobuyoshi Nakada)
2021-12-29 16:12 ` [ruby-core:106906] " peterzhu2118 (Peter Zhu)
2021-12-29 21:54 ` [ruby-core:106910] " ttilberg (Tim Tilberg)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).