* [ruby-core:110085] [Ruby master Feature#14900] Extra allocation in String#byteslice
[not found] <redmine.issue-14900.20180707094727.7841@ruby-lang.org>
@ 2022-09-26 11:26 ` ioquatix (Samuel Williams)
2022-09-26 12:45 ` [ruby-core:110087] " ioquatix (Samuel Williams)
` (2 subsequent siblings)
3 siblings, 0 replies; 4+ messages in thread
From: ioquatix (Samuel Williams) @ 2022-09-26 11:26 UTC (permalink / raw)
To: ruby-core
Issue #14900 has been updated by ioquatix (Samuel Williams).
I updated my PR and started investigating the changes between the current implementation and the proposed implementation, and after discussing it with @byroot we found a bug: https://github.com/ruby/ruby/pull/6443
This actually is far worse in terms of over-allocation, so fixing this is a good first step.
We are also considering introducing `byteslice!` which should be a better interface for when you "don't care about the source buffer" and thus can "internally freeze it to avoid memory allocations". https://bugs.ruby-lang.org/issues/13626
That being said, I'll update the PR here (impacting `byteslice`) and check whether it still makes sense.
----------------------------------------
Feature #14900: Extra allocation in String#byteslice
https://bugs.ruby-lang.org/issues/14900#change-99340
* Author: janko (Janko Marohnić)
* Status: Open
* Priority: Normal
----------------------------------------
When executing `String#byteslice` with a range, I noticed that sometimes the original string is allocated again. When I run the following script:
~~~ ruby
require "objspace"
string = "a" * 100_000
GC.start
GC.disable
generation = GC.count
ObjectSpace.trace_object_allocations do
string.byteslice(50_000..-1)
ObjectSpace.each_object(String) do |string|
p string.bytesize if ObjectSpace.allocation_generation(string) == generation
end
end
~~~
it outputs
~~~
50000
100000
6
5
~~~
The one with 50000 bytes is the result of `String#byteslice`, but the one with 100000 bytes is the duplicated original string. I expected only the result of `String#byteslice` to be amongst new allocations.
If instead of the last 50000 bytes I slice the *first* 50000 bytes, the extra duplication doesn't occur.
~~~ ruby
# ...
string.byteslice(0, 50_000)
# ...
~~~
~~~
50000
5
~~~
It's definitely ok if the implementation of `String#bytesize` allocates extra strings as part of the implementation, but it would be nice if they were deallocated before returning the result.
EDIT: It seems that `String#slice` has the same issue.
--
https://bugs.ruby-lang.org/
^ permalink raw reply [flat|nested] 4+ messages in thread
* [ruby-core:110087] [Ruby master Feature#14900] Extra allocation in String#byteslice
[not found] <redmine.issue-14900.20180707094727.7841@ruby-lang.org>
2022-09-26 11:26 ` [ruby-core:110085] [Ruby master Feature#14900] Extra allocation in String#byteslice ioquatix (Samuel Williams)
@ 2022-09-26 12:45 ` ioquatix (Samuel Williams)
2022-09-26 12:50 ` [ruby-core:110088] " ioquatix (Samuel Williams)
2022-09-26 14:53 ` [ruby-core:110089] [Ruby master Bug#14900] " byroot (Jean Boussier)
3 siblings, 0 replies; 4+ messages in thread
From: ioquatix (Samuel Williams) @ 2022-09-26 12:45 UTC (permalink / raw)
To: ruby-core
Issue #14900 has been updated by ioquatix (Samuel Williams).
Okay, so @byroot and I discussed this issue at length.
The simplest way to get the behaviour you want is to call `freeze` on the source string.
```ruby
require "objspace"
string = "a" * 100_000
GC.start
GC.disable
generation = GC.count
string.freeze # ADD THIS LINE
ObjectSpace.trace_object_allocations do
string.byteslice(50_000..-1)
ObjectSpace.each_object(String) do |string|
p string.bytesize if ObjectSpace.allocation_generation(string) == generation
end
end
```
This has the desired result that the source string is not copied, but is internally used as a shared buffer for the `string.byteslice`. However, it prevents you from modifying the source string.
The reason why it works this way, is because as @funny_falcon pointed out, many people are slicing the front off a buffer several times. If the buffer is huge, this would result in many large `memcpy`.
By freezing the string you avoid this memcpy. Alternatively, `byteslice` does this for you internally. It's a bit more nuanced than that because smaller strings are always copied, so this only kicks in for larger strings.
So I wanted to think a bit about how to do this efficiently - I think something like this can be pretty good:
``` ruby
buffer = String.new # allocation
while true
# Efficiently read into the buffer:
if buffer.empty?
io.read(1024, buffer)
else
buffer << io.read(1024)
end
# Freeze the buffer so it will be shared during processing:
buffer.freeze
# Consume the buffer in chunks:
while size = consume(buffer)
buffer = buffer.byteslice(size..-1) # shared root string - no memcpy or allocation
end
# Unfreeze the buffer if needed.
buffer = +buffer
end
```
The proposed PR basically skips the internal sharing mechanism unless you call `buffer.freeze`. In current Ruby, it's optional, and if you don't freeze it, Ruby is forced to create an internal dup, which is what you are seeing.
We should investigate the performance of the typical IO usage, to see which way is better.
----------------------------------------
Feature #14900: Extra allocation in String#byteslice
https://bugs.ruby-lang.org/issues/14900#change-99342
* Author: janko (Janko Marohnić)
* Status: Open
* Priority: Normal
----------------------------------------
When executing `String#byteslice` with a range, I noticed that sometimes the original string is allocated again. When I run the following script:
~~~ ruby
require "objspace"
string = "a" * 100_000
GC.start
GC.disable
generation = GC.count
ObjectSpace.trace_object_allocations do
string.byteslice(50_000..-1)
ObjectSpace.each_object(String) do |string|
p string.bytesize if ObjectSpace.allocation_generation(string) == generation
end
end
~~~
it outputs
~~~
50000
100000
6
5
~~~
The one with 50000 bytes is the result of `String#byteslice`, but the one with 100000 bytes is the duplicated original string. I expected only the result of `String#byteslice` to be amongst new allocations.
If instead of the last 50000 bytes I slice the *first* 50000 bytes, the extra duplication doesn't occur.
~~~ ruby
# ...
string.byteslice(0, 50_000)
# ...
~~~
~~~
50000
5
~~~
It's definitely ok if the implementation of `String#bytesize` allocates extra strings as part of the implementation, but it would be nice if they were deallocated before returning the result.
EDIT: It seems that `String#slice` has the same issue.
--
https://bugs.ruby-lang.org/
^ permalink raw reply [flat|nested] 4+ messages in thread
* [ruby-core:110088] [Ruby master Feature#14900] Extra allocation in String#byteslice
[not found] <redmine.issue-14900.20180707094727.7841@ruby-lang.org>
2022-09-26 11:26 ` [ruby-core:110085] [Ruby master Feature#14900] Extra allocation in String#byteslice ioquatix (Samuel Williams)
2022-09-26 12:45 ` [ruby-core:110087] " ioquatix (Samuel Williams)
@ 2022-09-26 12:50 ` ioquatix (Samuel Williams)
2022-09-26 14:53 ` [ruby-core:110089] [Ruby master Bug#14900] " byroot (Jean Boussier)
3 siblings, 0 replies; 4+ messages in thread
From: ioquatix (Samuel Williams) @ 2022-09-26 12:50 UTC (permalink / raw)
To: ruby-core
Issue #14900 has been updated by ioquatix (Samuel Williams).
By the way, ideally, I think you can implement this:
```ruby
buffer = String.new # allocation
while true
# Efficiently read into the buffer:
if buffer.empty?
io.read(1024, buffer)
else
buffer << io.read(1024)
end
# Consume the buffer in chunks:
while size = consume(buffer)
buffer.byteslice!(size..-1) # shared root string - no memcpy or allocation
end
end
```
----------------------------------------
Feature #14900: Extra allocation in String#byteslice
https://bugs.ruby-lang.org/issues/14900#change-99343
* Author: janko (Janko Marohnić)
* Status: Open
* Priority: Normal
----------------------------------------
When executing `String#byteslice` with a range, I noticed that sometimes the original string is allocated again. When I run the following script:
~~~ ruby
require "objspace"
string = "a" * 100_000
GC.start
GC.disable
generation = GC.count
ObjectSpace.trace_object_allocations do
string.byteslice(50_000..-1)
ObjectSpace.each_object(String) do |string|
p string.bytesize if ObjectSpace.allocation_generation(string) == generation
end
end
~~~
it outputs
~~~
50000
100000
6
5
~~~
The one with 50000 bytes is the result of `String#byteslice`, but the one with 100000 bytes is the duplicated original string. I expected only the result of `String#byteslice` to be amongst new allocations.
If instead of the last 50000 bytes I slice the *first* 50000 bytes, the extra duplication doesn't occur.
~~~ ruby
# ...
string.byteslice(0, 50_000)
# ...
~~~
~~~
50000
5
~~~
It's definitely ok if the implementation of `String#bytesize` allocates extra strings as part of the implementation, but it would be nice if they were deallocated before returning the result.
EDIT: It seems that `String#slice` has the same issue.
--
https://bugs.ruby-lang.org/
^ permalink raw reply [flat|nested] 4+ messages in thread
* [ruby-core:110089] [Ruby master Bug#14900] Extra allocation in String#byteslice
[not found] <redmine.issue-14900.20180707094727.7841@ruby-lang.org>
` (2 preceding siblings ...)
2022-09-26 12:50 ` [ruby-core:110088] " ioquatix (Samuel Williams)
@ 2022-09-26 14:53 ` byroot (Jean Boussier)
3 siblings, 0 replies; 4+ messages in thread
From: byroot (Jean Boussier) @ 2022-09-26 14:53 UTC (permalink / raw)
To: ruby-core
Issue #14900 has been updated by byroot (Jean Boussier).
Status changed from Open to Closed
Backport set to 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
Tracker changed from Feature to Bug
I think if we stick solely to the ticket description, I think https://github.com/ruby/ruby/pull/6443 fixes the issue (the extra useless allocation).
Also note that this extra allocation was a shared string, so it wasn't copying the string content.
So we can probably close this as fixed now.
If we wish to improve typically IO buffering code, we can open another issue.
----------------------------------------
Bug #14900: Extra allocation in String#byteslice
https://bugs.ruby-lang.org/issues/14900#change-99344
* Author: janko (Janko Marohnić)
* Status: Closed
* Priority: Normal
* Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
When executing `String#byteslice` with a range, I noticed that sometimes the original string is allocated again. When I run the following script:
~~~ ruby
require "objspace"
string = "a" * 100_000
GC.start
GC.disable
generation = GC.count
ObjectSpace.trace_object_allocations do
string.byteslice(50_000..-1)
ObjectSpace.each_object(String) do |string|
p string.bytesize if ObjectSpace.allocation_generation(string) == generation
end
end
~~~
it outputs
~~~
50000
100000
6
5
~~~
The one with 50000 bytes is the result of `String#byteslice`, but the one with 100000 bytes is the duplicated original string. I expected only the result of `String#byteslice` to be amongst new allocations.
If instead of the last 50000 bytes I slice the *first* 50000 bytes, the extra duplication doesn't occur.
~~~ ruby
# ...
string.byteslice(0, 50_000)
# ...
~~~
~~~
50000
5
~~~
It's definitely ok if the implementation of `String#bytesize` allocates extra strings as part of the implementation, but it would be nice if they were deallocated before returning the result.
EDIT: It seems that `String#slice` has the same issue.
--
https://bugs.ruby-lang.org/
^ permalink raw reply [flat|nested] 4+ messages in thread