ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
* [ruby-core:110085] [Ruby master Feature#14900] Extra allocation in String#byteslice
       [not found] <redmine.issue-14900.20180707094727.7841@ruby-lang.org>
@ 2022-09-26 11:26 ` ioquatix (Samuel Williams)
  2022-09-26 12:45 ` [ruby-core:110087] " ioquatix (Samuel Williams)
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 4+ messages in thread
From: ioquatix (Samuel Williams) @ 2022-09-26 11:26 UTC (permalink / raw)
  To: ruby-core

Issue #14900 has been updated by ioquatix (Samuel Williams).


I updated my PR and started investigating the changes between the current implementation and the proposed implementation, and after discussing it with @byroot we found a bug: https://github.com/ruby/ruby/pull/6443

This actually is far worse in terms of over-allocation, so fixing this is a good first step.

We are also considering introducing `byteslice!` which should be a better interface for when you "don't care about the source buffer" and thus can "internally freeze it to avoid memory allocations". https://bugs.ruby-lang.org/issues/13626

That being said, I'll update the PR here (impacting `byteslice`) and check whether it still makes sense.

----------------------------------------
Feature #14900: Extra allocation in String#byteslice
https://bugs.ruby-lang.org/issues/14900#change-99340

* Author: janko (Janko Marohnić)
* Status: Open
* Priority: Normal
----------------------------------------
When executing `String#byteslice` with a range, I noticed that sometimes the original string is allocated again. When I run the following script:

~~~ ruby
require "objspace"

string = "a" * 100_000

GC.start
GC.disable
generation = GC.count

ObjectSpace.trace_object_allocations do
  string.byteslice(50_000..-1)

  ObjectSpace.each_object(String) do |string|
    p string.bytesize if ObjectSpace.allocation_generation(string) == generation
  end
end
~~~

it outputs

~~~
50000
100000
6
5
~~~

The one with 50000 bytes is the result of `String#byteslice`, but the one with 100000 bytes is the duplicated original string. I expected only the result of `String#byteslice` to be amongst new allocations.

If instead of the last 50000 bytes I slice the *first* 50000 bytes, the extra duplication doesn't occur.

~~~ ruby
# ...
  string.byteslice(0, 50_000)
# ...
~~~

~~~
50000
5
~~~

It's definitely ok if the implementation of `String#bytesize` allocates extra strings as part of the implementation, but it would be nice if they were deallocated before returning the result.

EDIT: It seems that `String#slice` has the same issue.



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [ruby-core:110087] [Ruby master Feature#14900] Extra allocation in String#byteslice
       [not found] <redmine.issue-14900.20180707094727.7841@ruby-lang.org>
  2022-09-26 11:26 ` [ruby-core:110085] [Ruby master Feature#14900] Extra allocation in String#byteslice ioquatix (Samuel Williams)
@ 2022-09-26 12:45 ` ioquatix (Samuel Williams)
  2022-09-26 12:50 ` [ruby-core:110088] " ioquatix (Samuel Williams)
  2022-09-26 14:53 ` [ruby-core:110089] [Ruby master Bug#14900] " byroot (Jean Boussier)
  3 siblings, 0 replies; 4+ messages in thread
From: ioquatix (Samuel Williams) @ 2022-09-26 12:45 UTC (permalink / raw)
  To: ruby-core

Issue #14900 has been updated by ioquatix (Samuel Williams).


Okay, so @byroot and I discussed this issue at length.

The simplest way to get the behaviour you want is to call `freeze` on the source string.

```ruby
require "objspace"

string = "a" * 100_000

GC.start
GC.disable
generation = GC.count

string.freeze # ADD THIS LINE

ObjectSpace.trace_object_allocations do
  string.byteslice(50_000..-1)

  ObjectSpace.each_object(String) do |string|
    p string.bytesize if ObjectSpace.allocation_generation(string) == generation
  end
end
```

This has the desired result that the source string is not copied, but is internally used as a shared buffer for the `string.byteslice`. However, it prevents you from modifying the source string.

The reason why it works this way, is because as @funny_falcon pointed out, many people are slicing the front off a buffer several times. If the buffer is huge, this would result in many large `memcpy`.

By freezing the string you avoid this memcpy. Alternatively, `byteslice` does this for you internally. It's a bit more nuanced than that because smaller strings are always copied, so this only kicks in for larger strings.

So I wanted to think a bit about how to do this efficiently - I think something like this can be pretty good:

``` ruby
buffer = String.new # allocation
while true
	# Efficiently read into the buffer:
	if buffer.empty?
		io.read(1024, buffer)
	else
		buffer << io.read(1024)
	end

	# Freeze the buffer so it will be shared during processing:
	buffer.freeze

	# Consume the buffer in chunks:
	while size = consume(buffer)
		buffer = buffer.byteslice(size..-1) # shared root string - no memcpy or allocation
	end

	# Unfreeze the buffer if needed.
	buffer = +buffer
end
```

The proposed PR basically skips the internal sharing mechanism unless you call `buffer.freeze`. In current Ruby, it's optional, and if you don't freeze it, Ruby is forced to create an internal dup, which is what you are seeing.

We should investigate the performance of the typical IO usage, to see which way is better.


----------------------------------------
Feature #14900: Extra allocation in String#byteslice
https://bugs.ruby-lang.org/issues/14900#change-99342

* Author: janko (Janko Marohnić)
* Status: Open
* Priority: Normal
----------------------------------------
When executing `String#byteslice` with a range, I noticed that sometimes the original string is allocated again. When I run the following script:

~~~ ruby
require "objspace"

string = "a" * 100_000

GC.start
GC.disable
generation = GC.count

ObjectSpace.trace_object_allocations do
  string.byteslice(50_000..-1)

  ObjectSpace.each_object(String) do |string|
    p string.bytesize if ObjectSpace.allocation_generation(string) == generation
  end
end
~~~

it outputs

~~~
50000
100000
6
5
~~~

The one with 50000 bytes is the result of `String#byteslice`, but the one with 100000 bytes is the duplicated original string. I expected only the result of `String#byteslice` to be amongst new allocations.

If instead of the last 50000 bytes I slice the *first* 50000 bytes, the extra duplication doesn't occur.

~~~ ruby
# ...
  string.byteslice(0, 50_000)
# ...
~~~

~~~
50000
5
~~~

It's definitely ok if the implementation of `String#bytesize` allocates extra strings as part of the implementation, but it would be nice if they were deallocated before returning the result.

EDIT: It seems that `String#slice` has the same issue.



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [ruby-core:110088] [Ruby master Feature#14900] Extra allocation in String#byteslice
       [not found] <redmine.issue-14900.20180707094727.7841@ruby-lang.org>
  2022-09-26 11:26 ` [ruby-core:110085] [Ruby master Feature#14900] Extra allocation in String#byteslice ioquatix (Samuel Williams)
  2022-09-26 12:45 ` [ruby-core:110087] " ioquatix (Samuel Williams)
@ 2022-09-26 12:50 ` ioquatix (Samuel Williams)
  2022-09-26 14:53 ` [ruby-core:110089] [Ruby master Bug#14900] " byroot (Jean Boussier)
  3 siblings, 0 replies; 4+ messages in thread
From: ioquatix (Samuel Williams) @ 2022-09-26 12:50 UTC (permalink / raw)
  To: ruby-core

Issue #14900 has been updated by ioquatix (Samuel Williams).


By the way, ideally, I think you can implement this:

```ruby
buffer = String.new # allocation
while true
	# Efficiently read into the buffer:
	if buffer.empty?
		io.read(1024, buffer)
	else
		buffer << io.read(1024)
	end

	# Consume the buffer in chunks:
	while size = consume(buffer)
		buffer.byteslice!(size..-1) # shared root string - no memcpy or allocation
	end
end
```

----------------------------------------
Feature #14900: Extra allocation in String#byteslice
https://bugs.ruby-lang.org/issues/14900#change-99343

* Author: janko (Janko Marohnić)
* Status: Open
* Priority: Normal
----------------------------------------
When executing `String#byteslice` with a range, I noticed that sometimes the original string is allocated again. When I run the following script:

~~~ ruby
require "objspace"

string = "a" * 100_000

GC.start
GC.disable
generation = GC.count

ObjectSpace.trace_object_allocations do
  string.byteslice(50_000..-1)

  ObjectSpace.each_object(String) do |string|
    p string.bytesize if ObjectSpace.allocation_generation(string) == generation
  end
end
~~~

it outputs

~~~
50000
100000
6
5
~~~

The one with 50000 bytes is the result of `String#byteslice`, but the one with 100000 bytes is the duplicated original string. I expected only the result of `String#byteslice` to be amongst new allocations.

If instead of the last 50000 bytes I slice the *first* 50000 bytes, the extra duplication doesn't occur.

~~~ ruby
# ...
  string.byteslice(0, 50_000)
# ...
~~~

~~~
50000
5
~~~

It's definitely ok if the implementation of `String#bytesize` allocates extra strings as part of the implementation, but it would be nice if they were deallocated before returning the result.

EDIT: It seems that `String#slice` has the same issue.



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [ruby-core:110089] [Ruby master Bug#14900] Extra allocation in String#byteslice
       [not found] <redmine.issue-14900.20180707094727.7841@ruby-lang.org>
                   ` (2 preceding siblings ...)
  2022-09-26 12:50 ` [ruby-core:110088] " ioquatix (Samuel Williams)
@ 2022-09-26 14:53 ` byroot (Jean Boussier)
  3 siblings, 0 replies; 4+ messages in thread
From: byroot (Jean Boussier) @ 2022-09-26 14:53 UTC (permalink / raw)
  To: ruby-core

Issue #14900 has been updated by byroot (Jean Boussier).

Status changed from Open to Closed
Backport set to 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
Tracker changed from Feature to Bug

I think if we stick solely to the ticket description, I think https://github.com/ruby/ruby/pull/6443 fixes the issue (the extra useless allocation).

Also note that this extra allocation was a shared string, so it wasn't copying the string content.

So we can probably close this as fixed now.

If we wish to improve typically IO buffering code, we can open another issue.

----------------------------------------
Bug #14900: Extra allocation in String#byteslice
https://bugs.ruby-lang.org/issues/14900#change-99344

* Author: janko (Janko Marohnić)
* Status: Closed
* Priority: Normal
* Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
When executing `String#byteslice` with a range, I noticed that sometimes the original string is allocated again. When I run the following script:

~~~ ruby
require "objspace"

string = "a" * 100_000

GC.start
GC.disable
generation = GC.count

ObjectSpace.trace_object_allocations do
  string.byteslice(50_000..-1)

  ObjectSpace.each_object(String) do |string|
    p string.bytesize if ObjectSpace.allocation_generation(string) == generation
  end
end
~~~

it outputs

~~~
50000
100000
6
5
~~~

The one with 50000 bytes is the result of `String#byteslice`, but the one with 100000 bytes is the duplicated original string. I expected only the result of `String#byteslice` to be amongst new allocations.

If instead of the last 50000 bytes I slice the *first* 50000 bytes, the extra duplication doesn't occur.

~~~ ruby
# ...
  string.byteslice(0, 50_000)
# ...
~~~

~~~
50000
5
~~~

It's definitely ok if the implementation of `String#bytesize` allocates extra strings as part of the implementation, but it would be nice if they were deallocated before returning the result.

EDIT: It seems that `String#slice` has the same issue.



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-09-26 14:53 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <redmine.issue-14900.20180707094727.7841@ruby-lang.org>
2022-09-26 11:26 ` [ruby-core:110085] [Ruby master Feature#14900] Extra allocation in String#byteslice ioquatix (Samuel Williams)
2022-09-26 12:45 ` [ruby-core:110087] " ioquatix (Samuel Williams)
2022-09-26 12:50 ` [ruby-core:110088] " ioquatix (Samuel Williams)
2022-09-26 14:53 ` [ruby-core:110089] [Ruby master Bug#14900] " byroot (Jean Boussier)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).