[ruby-core:86954] [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

* [ruby-core:86954] [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream
       [not found] <redmine.issue-14745.20180509125009@ruby-lang.org>
@ 2018-05-09 12:50 ` janko.marohnic
  2018-05-11  3:36   ` [ruby-core:86983] " Eric Wong
                     ` (2 more replies)
  2018-05-16  9:33 ` [ruby-core:87075] " janko.marohnic
  2018-05-17 12:28 ` [ruby-core:87136] " janko.marohnic
  2 siblings, 3 replies; 8+ messages in thread
From: janko.marohnic @ 2018-05-09 12:50 UTC (permalink / raw)
  To: ruby-core

Issue #14745 has been reported by janko (Janko Marohnić).

----------------------------------------
Bug #14745: High memory usage when using String#replace with IO.copy_stream
https://bugs.ruby-lang.org/issues/14745

* Author: janko (Janko Marohnić)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17]
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
I'm using custom IO-like objects that implement #read as the first argument to IO.copy_stream, and I noticed odd memory behaviour when using String#replace on the output buffer versus String#clear. Here is an example of a "fake IO" object where #read uses String#clear on the output buffer:

~~~ ruby
GC.disable

require "stringio"

class FakeIO
  def initialize(content)
    @io = StringIO.new(content)
  end

  def read(length, outbuf)
    chunk = @io.read(length)

    if chunk
      outbuf.clear
      outbuf << chunk
      chunk.clear
    else
      outbuf.clear
    end

    outbuf unless outbuf.empty?
  end
end

io = FakeIO.new("a" * 50*1024*1024) # 50MB

IO.copy_stream(io, File::NULL)

system "top -pid #{Process.pid}"
~~~

This program outputs memory usage of 50MB at the end, as expected – 50MB was loaded into memory at the beginning and any new strings are deallocated. However, if I modify the #read implementation to use String#replace instead of String#clear:

~~~ ruby
  def read(length, outbuf)
    chunk = @io.read(length)

    if chunk
      outbuf.replace chunk
      chunk.clear
    else
      outbuf.clear
    end

    outbuf unless outbuf.empty?
  end
~~~

the memory usage has now doubled to 100MB at the end of the program, indicating that some string bytes weren't successfully deallocated. So, it seems that String#replace has different behaviour compared to String#clear + String#<<.

I was *only* able to reproduce this with `IO.copy_stream`, the following program shows 50MB memory usage, regardless of whether the String#clear or String#replace approach is used:

~~~ ruby
GC.disable

buffer = "a" * 50*1024*1024
chunk  = "b" * 50*1024*1024

if ARGV[0] == "clear"
  buffer.clear
  buffer << chunk
else
  buffer.replace chunk
end

chunk.clear

system "top -pid #{Process.pid}"
~~~

With this program I also noticed one interesting thing. If I remove `chunk.clear`, then the "clear" version uses 100MB as expected (because both buffer and chunk strings are 50MB large), but the "replace" version uses only 50MB, which makes it appear that the `buffer` string doesn't use any memory when in fact it should use 50MB just like the `chunk` string. I found that odd, and I think it might be a clue to the memory bug with String#replace I experienced when using `IO.copy_stream`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:86983] Re: [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream
  2018-05-09 12:50 ` [ruby-core:86954] [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream janko.marohnic
@ 2018-05-11  3:36   ` Eric Wong
  2018-05-11  4:26   ` [ruby-core:86987] " Eric Wong
  2018-05-11  5:43   ` [ruby-core:86988] " Eric Wong
  2 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2018-05-11  3:36 UTC (permalink / raw)
  To: ruby-core

janko.marohnic@gmail.com wrote:
> the memory usage has now doubled to 100MB at the end of the
> program, indicating that some string bytes weren't
> successfully deallocated. So, it seems that String#replace has
> different behaviour compared to String#clear + String#<<.

Yes, this is an unfortunate side effect because of copy-on-write
semantics of String#replace.  If the arg (other_str) for
String#replace is non-frozen, a new frozen string is created with
using the existing malloc-ed pointer.  Both the receiver string
and other_str point to that new, shared string.

> I was *only* able to reproduce this with `IO.copy_stream`

I suspect part of this is because outbuf becomes a long-lived object
with IO.copy_stream (not sure), and the hidden frozen string
becomes long-lived, as well.

So yeah; a combination of well-intentioned optimizations hurt
when combined together.

The other part could be anything using IO#write could create
massive amounts of garbage before 2.5:
https://bugs.ruby-lang.org/issues/13085 And your copy_stream use
hits the IO#write case.  Unfortunately, the "fix" for [Bug
#13085] won't work effectively with shared strings, because we
can't free+recycle the a string which wasn't created internally
by the VM.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:86987] Re: [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream
  2018-05-09 12:50 ` [ruby-core:86954] [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream janko.marohnic
  2018-05-11  3:36   ` [ruby-core:86983] " Eric Wong
@ 2018-05-11  4:26   ` Eric Wong
  2018-05-11  5:43   ` [ruby-core:86988] " Eric Wong
  2 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2018-05-11  4:26 UTC (permalink / raw)
  To: ruby-core

janko.marohnic@gmail.com wrote:
> ~~~ ruby
>   def read(length, outbuf)
>     chunk = @io.read(length)
> 
>     if chunk
>       outbuf.replace chunk
>       chunk.clear

Elaborating on my previous comment,  chunk.clear is a no-op in
this case because the shared frozen string is still in play and
now used by outbuf.

>     else
>       outbuf.clear

Likewise, the final #clear is also no-op because the
hidden+frozen string is still floating around waiting to be
GC-ed.

>     end
> 
>     outbuf unless outbuf.empty?
>   end
> ~~~

I don't think you need IO.copy_stream or IO#write to trigger, even.

~~~ ruby
io = FakeIO.new("a" * 50*1024*1024) # 50MB
buf = ''.b
while z = io.read(16384, buf)
end
~~~

I tried playing with shortening object lifetimes, but I guess GC
is not aggressive enough by default to matter.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:86988] Re: [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream
  2018-05-09 12:50 ` [ruby-core:86954] [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream janko.marohnic
  2018-05-11  3:36   ` [ruby-core:86983] " Eric Wong
  2018-05-11  4:26   ` [ruby-core:86987] " Eric Wong
@ 2018-05-11  5:43   ` Eric Wong
  2 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2018-05-11  5:43 UTC (permalink / raw)
  To: ruby-core

janko.marohnic@gmail.com wrote:
>   def initialize(content)
>     @io = StringIO.new(content)
>   end
> 
>   def read(length, outbuf)
>     chunk = @io.read(length)

Finally, I always assumed your example is a contrived case and
you're dealing with an interface somewhere (not StringIO) which
doesn't accept a destination buffer for .read.

Fwiw, I encountered a similar problem with Net::HTTP and
Zlib::GzipReader earlier this year and made r61665 to ensure
outbuf got used all the way down the call chain.

Sometimes I guess modifying/fixing interfaces is the only way to go.
(unless we can do escape analysis, but I'm not sure how feasible
 it is with a dynamic language)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:87075] [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream
       [not found] <redmine.issue-14745.20180509125009@ruby-lang.org>
  2018-05-09 12:50 ` [ruby-core:86954] [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream janko.marohnic
@ 2018-05-16  9:33 ` janko.marohnic
  2018-05-16 10:07   ` [ruby-core:87077] " Eric Wong
  2018-05-17 12:28 ` [ruby-core:87136] " janko.marohnic
  2 siblings, 1 reply; 8+ messages in thread
From: janko.marohnic @ 2018-05-16  9:33 UTC (permalink / raw)
  To: ruby-core

Issue #14745 has been updated by janko (Janko Marohnić).

> Yes, this is an unfortunate side effect because of copy-on-write
> semantics of String#replace. If the arg (other_str) for
> String#replace is non-frozen, a new frozen string is created with
> using the existing malloc-ed pointer. Both the receiver string
> and other_str point to that new, shared string.
>
> So yeah; a combination of well-intentioned optimizations hurt
> when combined together.

That makes sense, thanks for the explanation.

> The other part could be anything using IO#write could create
> massive amounts of garbage before 2.5

This is on Ruby 2.5.1, so I'm guessing it doesn't apply here.

> Finally, I always assumed your example is a contrived case and
> you're dealing with an interface somewhere (not StringIO) which
> doesn't accept a destination buffer for .read.

The example was simplified for reproducing purposes. The place where I discovered this was in https://github.com/rocketjob/symmetric-encryption/pull/98 (I eventually managed to figure out `String#replace` was causing the high memory usage, so I switched to `String#clear`).

In short, the `SymmetricEncryption::Reader` object wraps an IO object with encrypted content, and when calling `#read` it reads data from the underlying IO object, decrypts it and returns the decrypted data. So, it's not patching the lack of outbuf argument (because the underlying IO object *should* accept the outbuf argument), rather it provides an `IO#read` interface over incrementally decrypting IO object content.

If this is necessary behaviour when having the copy-on-write optimization, feel free to close this ticket then.

----------------------------------------
Bug #14745: High memory usage when using String#replace with IO.copy_stream
https://bugs.ruby-lang.org/issues/14745#change-72040

* Author: janko (Janko Marohnić)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17]
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
I'm using custom IO-like objects that implement #read as the first argument to IO.copy_stream, and I noticed odd memory behaviour when using String#replace on the output buffer versus String#clear. Here is an example of a "fake IO" object where #read uses String#clear on the output buffer:

~~~ ruby
GC.disable

require "stringio"

class FakeIO
  def initialize(content)
    @io = StringIO.new(content)
  end

  def read(length, outbuf)
    chunk = @io.read(length)

    if chunk
      outbuf.clear
      outbuf << chunk
      chunk.clear
    else
      outbuf.clear
    end

    outbuf unless outbuf.empty?
  end
end

io = FakeIO.new("a" * 50*1024*1024) # 50MB

IO.copy_stream(io, File::NULL)

system "top -pid #{Process.pid}"
~~~

This program outputs memory usage of 50MB at the end, as expected – 50MB was loaded into memory at the beginning and any new strings are deallocated. However, if I modify the #read implementation to use String#replace instead of String#clear:

~~~ ruby
  def read(length, outbuf)
    chunk = @io.read(length)

    if chunk
      outbuf.replace chunk
      chunk.clear
    else
      outbuf.clear
    end

    outbuf unless outbuf.empty?
  end
~~~

the memory usage has now doubled to 100MB at the end of the program, indicating that some string bytes weren't successfully deallocated. So, it seems that String#replace has different behaviour compared to String#clear + String#<<.

I was *only* able to reproduce this with `IO.copy_stream`, the following program shows 50MB memory usage, regardless of whether the String#clear or String#replace approach is used:

~~~ ruby
GC.disable

buffer = "a" * 50*1024*1024
chunk  = "b" * 50*1024*1024

if ARGV[0] == "clear"
  buffer.clear
  buffer << chunk
else
  buffer.replace chunk
end

chunk.clear

system "top -pid #{Process.pid}"
~~~

With this program I also noticed one interesting thing. If I remove `chunk.clear`, then the "clear" version uses 100MB as expected (because both buffer and chunk strings are 50MB large), but the "replace" version uses only 50MB, which makes it appear that the `buffer` string doesn't use any memory when in fact it should use 50MB just like the `chunk` string. I found that odd, and I think it might be a clue to the memory bug with String#replace I experienced when using `IO.copy_stream`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:87077] Re: [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream
  2018-05-16  9:33 ` [ruby-core:87075] " janko.marohnic
@ 2018-05-16 10:07   ` Eric Wong
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2018-05-16 10:07 UTC (permalink / raw)
  To: ruby-core

janko.marohnic@gmail.com wrote:
> I wrote:
> > Finally, I always assumed your example is a contrived case and
> > you're dealing with an interface somewhere (not StringIO) which
> > doesn't accept a destination buffer for .read.
 
> The example was simplified for reproducing purposes. The place
> where I discovered this was in
> https://github.com/rocketjob/symmetric-encryption/pull/98 (I
> eventually managed to figure out `String#replace` was causing
> the high memory usage, so I switched to `String#clear`).
> 
> In short, the `SymmetricEncryption::Reader` object wraps an IO
> object with encrypted content, and when calling `#read` it
> reads data from the underlying IO object, decrypts it and
> returns the decrypted data. So, it's not patching the lack of
> outbuf argument (because the underlying IO object *should*
> accept the outbuf argument), rather it provides an `IO#read`
> interface over incrementally decrypting IO object content.

I may be misreading(*), but it looks like @stream_cipher.update
can take a second destination arg (like IO#read and friends) and
maybe that helps...  (that appears to be OpenSSL::Cipher#update)

If that's not usable somehow, I've sometimes wondered if adding a
String#exchange! method might help:

	foo = "hello............................"
	bar = "world............................"
	foo.exchange!(bar)
	p foo # => "world............................"
	p bar # => "hello............................"

In your case, you could still use String#clear with this:

	dst.clear
	dst.exchange!(tmp)
	# tmp is now '' because dst was cleared before


For non-embed strings, it would swap the pointers and be zero-copy,
but won't suffer the new-frozen-string-behind-your-back problem
of String#replace.



(*) I'm looking at your commit 8d41efacff26f3357016d6b611bee174802fba66
    after git clone-ing (since I don't do JS-infested web UIs)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:87136] [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream
       [not found] <redmine.issue-14745.20180509125009@ruby-lang.org>
  2018-05-09 12:50 ` [ruby-core:86954] [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream janko.marohnic
  2018-05-16  9:33 ` [ruby-core:87075] " janko.marohnic
@ 2018-05-17 12:28 ` janko.marohnic
  2018-05-18 18:52   ` [ruby-core:87176] " Eric Wong
  2 siblings, 1 reply; 8+ messages in thread
From: janko.marohnic @ 2018-05-17 12:28 UTC (permalink / raw)
  To: ruby-core

Issue #14745 has been updated by janko (Janko Marohnić).

> it looks like @stream_cipher.update can take a second destination arg (like IO#read and friends) and maybe that helps... (that appears to be OpenSSL::Cipher#update)

Wow, I didn't know that. Thanks, this allowed me to greatly simplify the `#read` implementation, so now it's both simpler and has lower memory usage than the original implementation.

> I've sometimes wondered if adding a String#exchange! method might help

That sounds very useful to have in general. In my particular scenario (the PR) I wouldn't have use of it after all, this `String#clear` vs `String#replace` thing was just something I encountered during the optimization, the current version doesn't need `String#replace`. But I can imagine `String#exchange!` being a useful tool for diligent string management.

I'll close this ticket, as this seems to be a necessary consequence of the copy-on-write optimization, and I'm not having problems with it anymore.

----------------------------------------
Bug #14745: High memory usage when using String#replace with IO.copy_stream
https://bugs.ruby-lang.org/issues/14745#change-72123

* Author: janko (Janko Marohnić)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17]
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
I'm using custom IO-like objects that implement #read as the first argument to IO.copy_stream, and I noticed odd memory behaviour when using String#replace on the output buffer versus String#clear. Here is an example of a "fake IO" object where #read uses String#clear on the output buffer:

~~~ ruby
GC.disable

require "stringio"

class FakeIO
  def initialize(content)
    @io = StringIO.new(content)
  end

  def read(length, outbuf)
    chunk = @io.read(length)

    if chunk
      outbuf.clear
      outbuf << chunk
      chunk.clear
    else
      outbuf.clear
    end

    outbuf unless outbuf.empty?
  end
end

io = FakeIO.new("a" * 50*1024*1024) # 50MB

IO.copy_stream(io, File::NULL)

system "top -pid #{Process.pid}"
~~~

This program outputs memory usage of 50MB at the end, as expected – 50MB was loaded into memory at the beginning and any new strings are deallocated. However, if I modify the #read implementation to use String#replace instead of String#clear:

~~~ ruby
  def read(length, outbuf)
    chunk = @io.read(length)

    if chunk
      outbuf.replace chunk
      chunk.clear
    else
      outbuf.clear
    end

    outbuf unless outbuf.empty?
  end
~~~

the memory usage has now doubled to 100MB at the end of the program, indicating that some string bytes weren't successfully deallocated. So, it seems that String#replace has different behaviour compared to String#clear + String#<<.

I was *only* able to reproduce this with `IO.copy_stream`, the following program shows 50MB memory usage, regardless of whether the String#clear or String#replace approach is used:

~~~ ruby
GC.disable

buffer = "a" * 50*1024*1024
chunk  = "b" * 50*1024*1024

if ARGV[0] == "clear"
  buffer.clear
  buffer << chunk
else
  buffer.replace chunk
end

chunk.clear

system "top -pid #{Process.pid}"
~~~

With this program I also noticed one interesting thing. If I remove `chunk.clear`, then the "clear" version uses 100MB as expected (because both buffer and chunk strings are 50MB large), but the "replace" version uses only 50MB, which makes it appear that the `buffer` string doesn't use any memory when in fact it should use 50MB just like the `chunk` string. I found that odd, and I think it might be a clue to the memory bug with String#replace I experienced when using `IO.copy_stream`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [ruby-core:87176] Re: [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream
  2018-05-17 12:28 ` [ruby-core:87136] " janko.marohnic
@ 2018-05-18 18:52   ` Eric Wong
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Wong @ 2018-05-18 18:52 UTC (permalink / raw)
  To: ruby-core

Good to know the destination buffer to #update helped :>

> But I can imagine `String#exchange!` being a useful tool for
> diligent string management.

Maybe... As it stands, destination buffers for #read and #update
are underutilized and I'm glad to see them being used more.
Thus I'm not hopeful of #exchange! being widely adopted...
When used, destination buffers would still offer better performance,
anyways, as there's no extra object slot used.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-05-18 18:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <redmine.issue-14745.20180509125009@ruby-lang.org>
2018-05-09 12:50 ` [ruby-core:86954] [Ruby trunk Bug#14745] High memory usage when using String#replace with IO.copy_stream janko.marohnic
2018-05-11  3:36   ` [ruby-core:86983] " Eric Wong
2018-05-11  4:26   ` [ruby-core:86987] " Eric Wong
2018-05-11  5:43   ` [ruby-core:86988] " Eric Wong
2018-05-16  9:33 ` [ruby-core:87075] " janko.marohnic
2018-05-16 10:07   ` [ruby-core:87077] " Eric Wong
2018-05-17 12:28 ` [ruby-core:87136] " janko.marohnic
2018-05-18 18:52   ` [ruby-core:87176] " Eric Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).