[ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations)

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

* [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations)
@ 2022-06-28 13:21 byroot (Jean Boussier)
  2022-06-30  9:27 ` [ruby-core:109098] " byroot (Jean Boussier)
                   ` (21 more replies)
  0 siblings, 22 replies; 23+ messages in thread
From: byroot (Jean Boussier) @ 2022-06-28 13:21 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been reported by byroot (Jean Boussier).

----------------------------------------
Feature #18885: Long lived fork advisory API (potential Copy on Write optimizations)
https://bugs.ruby-lang.org/issues/18885

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

It is rather common to deploy Ruby with forking servers. A process first load the code and data of the application, and then forks a number of workers to handle an incoming workload.
The advantage is that each child has its own GVL and its own GC, so they don't impact each others latency. The downside however is that in uses more memory than using threads or fibers.
That increased memory usage is largely mitigated by Copy on Write, but it's far from perfect. Over time various memory regions will be written into and unshared.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

MRI could assume than any `fork` may be long lived and perform all the optimizations it can then, but It may be preferable to have a dedicated API for that. e.g.

  - `Process.fork(long_lived: true)`
  - `Process.long_lived_fork`
  - `RubyVM.prepare_for_long_lived_fork`

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically (somewhat related https://github.com/ruby/ruby/pull/6049).

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109098] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations)
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
@ 2022-06-30  9:27 ` byroot (Jean Boussier)
  2022-07-16  3:19 ` [ruby-core:109227] " ioquatix (Samuel Williams)
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: byroot (Jean Boussier) @ 2022-06-30  9:27 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by byroot (Jean Boussier).

Description updated

Another possible optimization I just found:

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, its coderange may be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

This also makes me think that this API isn't only useful for forking setup. Even if you use only threads or fibers, you may want to tell the VM that you are done loading and that it's now time to perform optimizations. So the API may use a more generic name.

----------------------------------------
Feature #18885: Long lived fork advisory API (potential Copy on Write optimizations)
https://bugs.ruby-lang.org/issues/18885#change-98245

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

It is rather common to deploy Ruby with forking servers. A process first load the code and data of the application, and then forks a number of workers to handle an incoming workload.
The advantage is that each child has its own GVL and its own GC, so they don't impact each others latency. The downside however is that in uses more memory than using threads or fibers.
That increased memory usage is largely mitigated by Copy on Write, but it's far from perfect. Over time various memory regions will be written into and unshared.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

MRI could assume than any `fork` may be long lived and perform all the optimizations it can then, but It may be preferable to have a dedicated API for that. e.g.

  - `Process.fork(long_lived: true)`
  - `Process.long_lived_fork`
  - `RubyVM.prepare_for_long_lived_fork`

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically (somewhat related https://github.com/ruby/ruby/pull/6049).

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, its coderange may be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109227] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations)
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
  2022-06-30  9:27 ` [ruby-core:109098] " byroot (Jean Boussier)
@ 2022-07-16  3:19 ` ioquatix (Samuel Williams)
  2022-07-27 16:55 ` [ruby-core:109339] " byroot (Jean Boussier)
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: ioquatix (Samuel Williams) @ 2022-07-16  3:19 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by ioquatix (Samuel Williams).

This is a really nice idea. My current implementation uses `GC.compact` before prefork, and it shows a big advantage. I'm happy to test any proposals with real world workloads.

----------------------------------------
Feature #18885: Long lived fork advisory API (potential Copy on Write optimizations)
https://bugs.ruby-lang.org/issues/18885#change-98363

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

It is rather common to deploy Ruby with forking servers. A process first load the code and data of the application, and then forks a number of workers to handle an incoming workload.
The advantage is that each child has its own GVL and its own GC, so they don't impact each others latency. The downside however is that in uses more memory than using threads or fibers.
That increased memory usage is largely mitigated by Copy on Write, but it's far from perfect. Over time various memory regions will be written into and unshared.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

MRI could assume than any `fork` may be long lived and perform all the optimizations it can then, but It may be preferable to have a dedicated API for that. e.g.

  - `Process.fork(long_lived: true)`
  - `Process.long_lived_fork`
  - `RubyVM.prepare_for_long_lived_fork`

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically (somewhat related https://github.com/ruby/ruby/pull/6049).

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, its coderange may be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109339] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations)
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
  2022-06-30  9:27 ` [ruby-core:109098] " byroot (Jean Boussier)
  2022-07-16  3:19 ` [ruby-core:109227] " ioquatix (Samuel Williams)
@ 2022-07-27 16:55 ` byroot (Jean Boussier)
  2022-07-30  2:33 ` [ruby-core:109380] " Dan0042 (Daniel DeLorme)
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: byroot (Jean Boussier) @ 2022-07-27 16:55 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by byroot (Jean Boussier).

Another optimization that could be invoked from this method is `malloc_trim`.

----------------------------------------
Feature #18885: Long lived fork advisory API (potential Copy on Write optimizations)
https://bugs.ruby-lang.org/issues/18885#change-98482

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

It is rather common to deploy Ruby with forking servers. A process first load the code and data of the application, and then forks a number of workers to handle an incoming workload.
The advantage is that each child has its own GVL and its own GC, so they don't impact each others latency. The downside however is that in uses more memory than using threads or fibers.
That increased memory usage is largely mitigated by Copy on Write, but it's far from perfect. Over time various memory regions will be written into and unshared.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

MRI could assume than any `fork` may be long lived and perform all the optimizations it can then, but It may be preferable to have a dedicated API for that. e.g.

  - `Process.fork(long_lived: true)`
  - `Process.long_lived_fork`
  - `RubyVM.prepare_for_long_lived_fork`

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically (somewhat related https://github.com/ruby/ruby/pull/6049).

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, its coderange may be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109380] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations)
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (2 preceding siblings ...)
  2022-07-27 16:55 ` [ruby-core:109339] " byroot (Jean Boussier)
@ 2022-07-30  2:33 ` Dan0042 (Daniel DeLorme)
  2022-07-30  6:19 ` [ruby-core:109381] " byroot (Jean Boussier)
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Dan0042 (Daniel DeLorme) @ 2022-07-30  2:33 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by Dan0042 (Daniel DeLorme).

I think the state of Copy-on-Write is already pretty decent, but any improvement is of course very welcome. As to naming, since this is mainly for preforking servers, what about `Process.prefork`?

----------------------------------------
Feature #18885: Long lived fork advisory API (potential Copy on Write optimizations)
https://bugs.ruby-lang.org/issues/18885#change-98530

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

It is rather common to deploy Ruby with forking servers. A process first load the code and data of the application, and then forks a number of workers to handle an incoming workload.
The advantage is that each child has its own GVL and its own GC, so they don't impact each others latency. The downside however is that in uses more memory than using threads or fibers.
That increased memory usage is largely mitigated by Copy on Write, but it's far from perfect. Over time various memory regions will be written into and unshared.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

MRI could assume than any `fork` may be long lived and perform all the optimizations it can then, but It may be preferable to have a dedicated API for that. e.g.

  - `Process.fork(long_lived: true)`
  - `Process.long_lived_fork`
  - `RubyVM.prepare_for_long_lived_fork`

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically (somewhat related https://github.com/ruby/ruby/pull/6049).

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, its coderange may be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109381] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations)
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (3 preceding siblings ...)
  2022-07-30  2:33 ` [ruby-core:109380] " Dan0042 (Daniel DeLorme)
@ 2022-07-30  6:19 ` byroot (Jean Boussier)
  2022-08-02  8:56 ` [ruby-core:109409] " mame (Yusuke Endoh)
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: byroot (Jean Boussier) @ 2022-07-30  6:19 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by byroot (Jean Boussier).

> the state of Copy-on-Write is already pretty decent,

It depends how you look at it. In the few apps on which I optimized CoW as much as I could, only between 50% and 60% of the parent process memory is shared. That really isn't that good. 

----------------------------------------
Feature #18885: Long lived fork advisory API (potential Copy on Write optimizations)
https://bugs.ruby-lang.org/issues/18885#change-98531

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

It is rather common to deploy Ruby with forking servers. A process first load the code and data of the application, and then forks a number of workers to handle an incoming workload.
The advantage is that each child has its own GVL and its own GC, so they don't impact each others latency. The downside however is that in uses more memory than using threads or fibers.
That increased memory usage is largely mitigated by Copy on Write, but it's far from perfect. Over time various memory regions will be written into and unshared.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

MRI could assume than any `fork` may be long lived and perform all the optimizations it can then, but It may be preferable to have a dedicated API for that. e.g.

  - `Process.fork(long_lived: true)`
  - `Process.long_lived_fork`
  - `RubyVM.prepare_for_long_lived_fork`

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically (somewhat related https://github.com/ruby/ruby/pull/6049).

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, its coderange may be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109409] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations)
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (4 preceding siblings ...)
  2022-07-30  6:19 ` [ruby-core:109381] " byroot (Jean Boussier)
@ 2022-08-02  8:56 ` mame (Yusuke Endoh)
  2022-08-02  9:03 ` [ruby-core:109410] " byroot (Jean Boussier)
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: mame (Yusuke Endoh) @ 2022-08-02  8:56 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by mame (Yusuke Endoh).

We discussed this issue at the dev meeting. We did not reach any conclusion, but I'd like to share some comments.

### What and how efficient is this proposal?

Some attendees wanted to confirm quantitative evaluation of the benefits this proposal would bring.
@ko1 said that he created nakayoshi_fork as a joke gem. He didn't expect people to use it seriously, and he didn't have serious quantitative measurements.

(I've heard people say that memory usage has been reduced by nakayoshi_fork, but it would be nice to be properly confirm this advantage before introduction.)

### How is it integrated with `Process._fork`?

`Process._fork` has been introduced as an zero-argument API. This API is supposed to be overridden, so we cannot add an argument easily.
If we keep `Process._fork` as is, we need to do some GC processes like nakayoshi_fork *before* the hook of `Process._fork`. Is it OK?

### Are "short-lived" forks needed?

How much are "short-lived" forks used nowadays? The major use case where `Process.exec` is called shortly after `Process.fork`, is covered by `Process.spawn`.
If there is few use cases for "short-lived" forks, we may change the default behavior to "long-lived".
However, we sometimes use fork in tests, to invoke a temporal web server, for example. Calling GC whenever calling fork might be too heavy.

### Is GC called whenever `fork(long_lived: true)` is called?

Here is a typical server code that uses fork:

```
loop do
  sock = servsock.accept
  if fork(long_lived: true)
    ...
  end
end
```

The parent process creates only a socket object for each iteration. It looks somewhat useless to call full GC in the parent process every time `fork(long_lived: true)` is called. A more intelligent strategy may be preferable here.

----------------------------------------
Feature #18885: Long lived fork advisory API (potential Copy on Write optimizations)
https://bugs.ruby-lang.org/issues/18885#change-98558

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

It is rather common to deploy Ruby with forking servers. A process first load the code and data of the application, and then forks a number of workers to handle an incoming workload.
The advantage is that each child has its own GVL and its own GC, so they don't impact each others latency. The downside however is that in uses more memory than using threads or fibers.
That increased memory usage is largely mitigated by Copy on Write, but it's far from perfect. Over time various memory regions will be written into and unshared.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

MRI could assume than any `fork` may be long lived and perform all the optimizations it can then, but It may be preferable to have a dedicated API for that. e.g.

  - `Process.fork(long_lived: true)`
  - `Process.long_lived_fork`
  - `RubyVM.prepare_for_long_lived_fork`

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically (somewhat related https://github.com/ruby/ruby/pull/6049).

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, its coderange may be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109410] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations)
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (5 preceding siblings ...)
  2022-08-02  8:56 ` [ruby-core:109409] " mame (Yusuke Endoh)
@ 2022-08-02  9:03 ` byroot (Jean Boussier)
  2022-08-03  1:32 ` [ruby-core:109417] " mame (Yusuke Endoh)
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: byroot (Jean Boussier) @ 2022-08-02  9:03 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by byroot (Jean Boussier).

@mame 

> it would be nice to be properly confirm this advantage before introduction.

https://bugs.ruby-lang.org/issues/11164 is an example of how bad things can go without nakayoshi_fork (or similar). I can get production data from some of our apps if you wish, but the effect is going to be very app dependent, so I'm not sure if it's very relevant. You can craft demo-apps for which memory usage totally blow up if you don't promote objects to the old generation before forking.

> How is it integrated with Process._fork?

Since I wrote this, I'm now convinced that it shouldn't be a `fork` argument, but a distinct API on `RubyVM`, since if you fork multiple workers you don't want to run them things again as it could invalidate CoW in previous workers. 

So IMO `RubyVM.prepare_for_long_lived_fork` is the proper API (aside from the name which I doubt is desirable).

Also please note that several of the proposed optimization can only be done from inside Ruby, so decorating like nakayoshi_fork does is not an option. Hence the main reason of this proposal.

----------------------------------------
Feature #18885: Long lived fork advisory API (potential Copy on Write optimizations)
https://bugs.ruby-lang.org/issues/18885#change-98559

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

It is rather common to deploy Ruby with forking servers. A process first load the code and data of the application, and then forks a number of workers to handle an incoming workload.
The advantage is that each child has its own GVL and its own GC, so they don't impact each others latency. The downside however is that in uses more memory than using threads or fibers.
That increased memory usage is largely mitigated by Copy on Write, but it's far from perfect. Over time various memory regions will be written into and unshared.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

MRI could assume than any `fork` may be long lived and perform all the optimizations it can then, but It may be preferable to have a dedicated API for that. e.g.

  - `Process.fork(long_lived: true)`
  - `Process.long_lived_fork`
  - `RubyVM.prepare_for_long_lived_fork`

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically (somewhat related https://github.com/ruby/ruby/pull/6049).

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, its coderange may be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109417] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations)
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (6 preceding siblings ...)
  2022-08-02  9:03 ` [ruby-core:109410] " byroot (Jean Boussier)
@ 2022-08-03  1:32 ` mame (Yusuke Endoh)
  2022-08-03  6:42 ` [ruby-core:109420] " byroot (Jean Boussier)
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: mame (Yusuke Endoh) @ 2022-08-03  1:32 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by mame (Yusuke Endoh).

> So IMO RubyVM.prepare_for_long_lived_fork is the proper API (aside from the name which I doubt is desirable).

After calling this, all `fork` calls are treated as "long-lived". Is my understanding right?

> Also please note that several of the proposed optimization can only be done from inside Ruby, so decorating like nakayoshi_fork does is not an option.

I know that, but it seemed hard to me to convince the committers to change the API first for optimizations that have not been implemented yet and we don't know how effective they will be. IMO, it is good to focus on the use case of nakayoshi_fork since it is already implemented and used by not a few people. If there is a proper evaluation of the effect of nakayoshi_fork, it would be easier to persuade @matz.

----------------------------------------
Feature #18885: Long lived fork advisory API (potential Copy on Write optimizations)
https://bugs.ruby-lang.org/issues/18885#change-98567

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

It is rather common to deploy Ruby with forking servers. A process first load the code and data of the application, and then forks a number of workers to handle an incoming workload.
The advantage is that each child has its own GVL and its own GC, so they don't impact each others latency. The downside however is that in uses more memory than using threads or fibers.
That increased memory usage is largely mitigated by Copy on Write, but it's far from perfect. Over time various memory regions will be written into and unshared.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

MRI could assume than any `fork` may be long lived and perform all the optimizations it can then, but It may be preferable to have a dedicated API for that. e.g.

  - `Process.fork(long_lived: true)`
  - `Process.long_lived_fork`
  - `RubyVM.prepare_for_long_lived_fork`

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically (somewhat related https://github.com/ruby/ruby/pull/6049).

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, its coderange may be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109420] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations)
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (7 preceding siblings ...)
  2022-08-03  1:32 ` [ruby-core:109417] " mame (Yusuke Endoh)
@ 2022-08-03  6:42 ` byroot (Jean Boussier)
  2022-08-03  7:10 ` [ruby-core:109421] [Ruby master Feature#18885] End of boot advisory API for RubyVM byroot (Jean Boussier)
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: byroot (Jean Boussier) @ 2022-08-03  6:42 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by byroot (Jean Boussier).

> After calling this, all fork calls are treated as "long-lived". Is my understanding right?

Well, this wouldn't change anything to the `Process.fork` implementation. I think I need to rewrite the ticket description because it is now confusing, I'll do it in a minute.

Also as said before I don't even think this is specific to forking servers anymore, I think `RubyVM.make_ready` or something like that would be just fine. Even if you don't fork, optimizations such as precomputing inline caching could improve performance of the first request.

>  it is good to focus on the use case of nakayoshi_fork 

Ok, so here's a thread from when Puma added it as an option two years ago, https://github.com/puma/puma/issues/2258#issuecomment-630510423

> After fixing the config bug in nakayoshi_fork, Codetriage is now showing about a 10% reduction in memory usage 

Some other people report good numbers too, but generally they enabled other changes at the same time.

----------------------------------------
Feature #18885: Long lived fork advisory API (potential Copy on Write optimizations)
https://bugs.ruby-lang.org/issues/18885#change-98572

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

It is rather common to deploy Ruby with forking servers. A process first load the code and data of the application, and then forks a number of workers to handle an incoming workload.
The advantage is that each child has its own GVL and its own GC, so they don't impact each others latency. The downside however is that in uses more memory than using threads or fibers.
That increased memory usage is largely mitigated by Copy on Write, but it's far from perfect. Over time various memory regions will be written into and unshared.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

MRI could assume than any `fork` may be long lived and perform all the optimizations it can then, but It may be preferable to have a dedicated API for that. e.g.

  - `Process.fork(long_lived: true)`
  - `Process.long_lived_fork`
  - `RubyVM.prepare_for_long_lived_fork`

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically (somewhat related https://github.com/ruby/ruby/pull/6049).

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, its coderange may be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109421] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (8 preceding siblings ...)
  2022-08-03  6:42 ` [ruby-core:109420] " byroot (Jean Boussier)
@ 2022-08-03  7:10 ` byroot (Jean Boussier)
  2022-08-10 18:21 ` [ruby-core:109469] " Dan0042 (Daniel DeLorme)
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: byroot (Jean Boussier) @ 2022-08-03  7:10 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by byroot (Jean Boussier).

Description updated
Subject changed from Long lived fork advisory API (potential Copy on Write optimizations) to End of boot advisory API for RubyVM

Ok, Ip updated the description, it's still very much focused on CoW, but hopefully it should now be more clear that's it's not the only benefit.

Also it now only ask a method on `RubyVM`, which could perfectly be marked as experimental, so API change concerns should be minimal.

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-98573

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109469] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (9 preceding siblings ...)
  2022-08-03  7:10 ` [ruby-core:109421] [Ruby master Feature#18885] End of boot advisory API for RubyVM byroot (Jean Boussier)
@ 2022-08-10 18:21 ` Dan0042 (Daniel DeLorme)
  2022-08-10 18:24 ` [ruby-core:109470] " byroot (Jean Boussier)
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Dan0042 (Daniel DeLorme) @ 2022-08-10 18:21 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by Dan0042 (Daniel DeLorme).

I think the terminology used here might cause some confusion in the discussion.

"End of boot" makes it sound like this API would be useful for non-forking servers once they have finished their "boot" sequence. But from what I understand this is still very much a fork-specific API. Is there any point to precompute inline caches if there is no fork?

"Long lived" children processes are not really the point I think? Imagine a (ridiculous) architecture where the parent keeps spawning children and each child serves a single request before dying. Despite being short-lived, these processes would benefit from this API. So it's not about preparing *children* for being long-lived, it's about preparing the *parent* for having *any* children.

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-98630

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109470] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (10 preceding siblings ...)
  2022-08-10 18:21 ` [ruby-core:109469] " Dan0042 (Daniel DeLorme)
@ 2022-08-10 18:24 ` byroot (Jean Boussier)
  2022-08-18  6:51 ` [ruby-core:109528] " matz (Yukihiro Matsumoto)
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: byroot (Jean Boussier) @ 2022-08-10 18:24 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by byroot (Jean Boussier).

> Is there any point to precompute inline caches if there is no fork?

Yes, the first "request" (or whatever your unit of work is) won't have to do it. So you are moving some work to boot time, instead of user input processing time.

> these processes would benefit from this API.

For the CoW parts no, not much. If the child isn't going to live for long, it's unlikely to invalidate that many pages. 

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-98631

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109528] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (11 preceding siblings ...)
  2022-08-10 18:24 ` [ruby-core:109470] " byroot (Jean Boussier)
@ 2022-08-18  6:51 ` matz (Yukihiro Matsumoto)
  2022-08-18  6:55 ` [ruby-core:109529] " byroot (Jean Boussier)
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: matz (Yukihiro Matsumoto) @ 2022-08-18  6:51 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by matz (Yukihiro Matsumoto).

I am OK with adding this feature, but I have some concerns with the place and the name.
`RubyVM` is not globally available (e.g., not for JRuby or TruffleRuby). And I don't think `prepare` or `ready` describes the whole functionality.

Matz.

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-98698

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109529] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (12 preceding siblings ...)
  2022-08-18  6:51 ` [ruby-core:109528] " matz (Yukihiro Matsumoto)
@ 2022-08-18  6:55 ` byroot (Jean Boussier)
  2022-08-18  7:14 ` [ruby-core:109531] " Eregon (Benoit Daloze)
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: byroot (Jean Boussier) @ 2022-08-18  6:55 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by byroot (Jean Boussier).

Thank you Matz.

> RubyVM is not globally available (e.g., not for JRuby or TruffleRuby). 

Yes, what was on purpose because the behavior would be very VM specific, some VM might not even to have it. It's not meant to be a cross implementation feature.

> And I don't think prepare or ready describes the whole functionality.

I'll try to come up with other names.

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-98699

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109531] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (13 preceding siblings ...)
  2022-08-18  6:55 ` [ruby-core:109529] " byroot (Jean Boussier)
@ 2022-08-18  7:14 ` Eregon (Benoit Daloze)
  2022-08-18  7:16 ` [ruby-core:109533] " byroot (Jean Boussier)
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Eregon (Benoit Daloze) @ 2022-08-18  7:14 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by Eregon (Benoit Daloze).

An API to notify "end of boot" seems useful beyond just fork COW optimizations, as you say.
For instance a JIT might use that as a hint for what to compile/stop compiling/purge the queue during boot/reset compilation counters/etc.
So it shouldn't be under RubyVM which means only available on CRuby (forever).

Maybe a Kernel class method?

`Kernel.booted`/`Kernel.application_booted`/`Kernel.code_loaded`/`Kernel.startup_done` maybe?

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-98701

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109533] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (14 preceding siblings ...)
  2022-08-18  7:14 ` [ruby-core:109531] " Eregon (Benoit Daloze)
@ 2022-08-18  7:16 ` byroot (Jean Boussier)
  2022-09-15 13:16 ` [ruby-core:109901] " byroot (Jean Boussier)
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: byroot (Jean Boussier) @ 2022-08-18  7:16 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by byroot (Jean Boussier).

What about `ObjectSpace`?

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-98703

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109901] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (15 preceding siblings ...)
  2022-08-18  7:16 ` [ruby-core:109533] " byroot (Jean Boussier)
@ 2022-09-15 13:16 ` byroot (Jean Boussier)
  2022-09-22  5:52 ` [ruby-core:109989] " ioquatix (Samuel Williams)
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: byroot (Jean Boussier) @ 2022-09-15 13:16 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by byroot (Jean Boussier).

So I wrote a reproduction script to showcase the effect of constant caches on Copy on Write performance:

```ruby
class MemInfo
  def initialize(pid = "self")
    @info = parse(File.read("/proc/#{pid}/smaps_rollup"))
  end

  def pss
    @info[:Pss]
  end

  def rss
    @info[:Rss]
  end

  def shared_memory
    @info[:Shared_Clean] + @info[:Shared_Dirty]
  end

  def cow_efficiency
    shared_memory.to_f / MemInfo.new(Process.ppid).rss * 100.0
  end

  private

  def parse(rollup)
    fields = {}
    rollup.each_line do |line|
      if (matchdata = line.match(/(?<field>\w+)\:\s+(?<size>\d+) kB$/))
        fields[matchdata[:field].to_sym] = matchdata[:size].to_i
      end
    end
    fields
  end
end

CONST_NUM = Integer(ENV.fetch("NUM", 100_000))

module App
  CONST_NUM.times do |i|
    class_eval(<<~RUBY, __FILE__, __LINE__ + 1)
      Const#{i} = Module.new

      def self.lookup_#{i}
        Const#{i}
      end
    RUBY
  end

  class_eval(<<~RUBY, __FILE__, __LINE__ + 1)
    def self.warmup
      #{CONST_NUM.times.map { |i| "lookup_#{i}"}.join("\n")}
    end
  RUBY
end

puts "=== fresh parent stats ==="
puts "RSS: #{MemInfo.new.rss} kB"
puts

def print_child_meminfo
  meminfo = MemInfo.new
  puts "PSS: #{meminfo.pss} kB"
  puts "Shared #{meminfo.shared_memory} kB"
  puts "CoW efficiency: #{meminfo.cow_efficiency.round(1)}%"
  puts
end

fork do
  puts "=== fresh fork stats ==="
  print_child_meminfo

  App.warmup

  print_child_meminfo
end

Process.wait

App.warmup

puts "=== warmed parent stats ==="
puts "RSS: #{MemInfo.new.rss} kB"
puts

fork do
  puts "=== warmed fork stats ==="
  print_child_meminfo

  App.warmup

  print_child_meminfo
end

Process.wait
```

Results:

```
$ docker run -v $PWD:/app -it ruby:3.1 ruby /app/app.rb
=== fresh parent stats ===
RSS: 236104 kB

=== fresh fork stats ===
PSS: 117198 kB
Shared 233828 kB
CoW efficiency: 99.0%

PSS: 199734 kB
Shared 72740 kB
CoW efficiency: 30.8%

=== warmed parent stats ===
RSS: 237128 kB

=== warmed fork stats ===
PSS: 117632 kB
Shared 234880 kB
CoW efficiency: 99.1%

PSS: 118318 kB
Shared 235444 kB
CoW efficiency: 99.3%
```

### What this shows

When we first fork the process, the memory cost is close to 0. The parent process has ~230MiB RSS, but 99% of that is shared with the first child, putting the actual cost of the fork at barely a couple MiB.

However as soon as we start executing code in the child that wasn't warmed up in the parent, the inline caches are being filled, which invalidates the shared pages. After that only a third of the parent memory is shared, putting the cost of the child at about 163MiB.

The second part of the reproduction first warmup these caches in the parent before forking. As a result the child doesn't invalidate shared memory when it execute the code, and the cost of the child remain totally negligible.

### What it means for the real world

Of course this repro is specially crafted to show the impact of constant caches, there are other source of invalidations such as method caches etc, but as mentioned now that https://github.com/ruby/ruby/pull/6187 was merged, it should be easy to prewarm the constant caches when that proposed API is called.

I guess all we need is a name. Maybe `ObjectSpace.optimize`?

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-99144

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:109989] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (16 preceding siblings ...)
  2022-09-15 13:16 ` [ruby-core:109901] " byroot (Jean Boussier)
@ 2022-09-22  5:52 ` ioquatix (Samuel Williams)
  2022-09-23 12:57 ` [ruby-core:110045] " Dan0042 (Daniel DeLorme)
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: ioquatix (Samuel Williams) @ 2022-09-22  5:52 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by ioquatix (Samuel Williams).

This is awesome. Nice work.

I also like `warmup` as a name.

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-99239

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:110045] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (17 preceding siblings ...)
  2022-09-22  5:52 ` [ruby-core:109989] " ioquatix (Samuel Williams)
@ 2022-09-23 12:57 ` Dan0042 (Daniel DeLorme)
  2022-10-07 14:38 ` [ruby-core:110231] " matz (Yukihiro Matsumoto)
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 23+ messages in thread
From: Dan0042 (Daniel DeLorme) @ 2022-09-23 12:57 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by Dan0042 (Daniel DeLorme).

+1 for Process.warmup

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-99295

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:110231] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (18 preceding siblings ...)
  2022-09-23 12:57 ` [ruby-core:110045] " Dan0042 (Daniel DeLorme)
@ 2022-10-07 14:38 ` matz (Yukihiro Matsumoto)
  2022-10-07 15:05 ` [ruby-core:110232] " byroot (Jean Boussier)
  2023-04-13  7:21 ` [ruby-core:113213] " ioquatix (Samuel Williams) via ruby-core
  21 siblings, 0 replies; 23+ messages in thread
From: matz (Yukihiro Matsumoto) @ 2022-10-07 14:38 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by matz (Yukihiro Matsumoto).

Process.warmup sounds better than other candidates. My only concern is that the target of warming up might not be Process in the future (e.g. when Ractor local GC is introduced).

Matz.

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-99515

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:110232] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (19 preceding siblings ...)
  2022-10-07 14:38 ` [ruby-core:110231] " matz (Yukihiro Matsumoto)
@ 2022-10-07 15:05 ` byroot (Jean Boussier)
  2023-04-13  7:21 ` [ruby-core:113213] " ioquatix (Samuel Williams) via ruby-core
  21 siblings, 0 replies; 23+ messages in thread
From: byroot (Jean Boussier) @ 2022-10-07 15:05 UTC (permalink / raw)
  To: ruby-core

Issue #18885 has been updated by byroot (Jean Boussier).

Thank you Matz!

> My only concern is that the target of warming up might not be Process in the future

Given the type of optimizations we have in mind right now, I think they'll still be global even on a Ractor heavy context. The main semantic of this signal is "I'm done loading my code" which doesn't change even with heavy Ractor use.

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-99516

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [ruby-core:113213] [Ruby master Feature#18885] End of boot advisory API for RubyVM
  2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
                   ` (20 preceding siblings ...)
  2022-10-07 15:05 ` [ruby-core:110232] " byroot (Jean Boussier)
@ 2023-04-13  7:21 ` ioquatix (Samuel Williams) via ruby-core
  21 siblings, 0 replies; 23+ messages in thread
From: ioquatix (Samuel Williams) via ruby-core @ 2023-04-13  7:21 UTC (permalink / raw)
  To: ruby-core; +Cc: ioquatix (Samuel Williams)

Issue #18885 has been updated by ioquatix (Samuel Williams).

Looking forward to using this.

----------------------------------------
Feature #18885: End of boot advisory API for RubyVM
https://bugs.ruby-lang.org/issues/18885#change-102756

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

Many optimizations in the Ruby VM rely on lazily computed caches: Strings coderange, constant caches, method caches, etc etc.
As such even without JIT, some operations need a bit of a warm up, and might be flushed if new constants are defined, new code is loaded, or some objects are mutated.

Additionally these lazily computed caches can cause increased memory usage for applications relying on Copy-on-Write memory.
Whenever one of these caches is updated post fork, the entire memory page is invalidated. Precomputing these caches at the end of boot,
even if based on heuristic, could improve Copy-on-Write performance.

The classic example is the objects generation, young objects must be promoted to the old generation before forking, otherwise they'll get invalidated on the next GC run. That's what https://github.com/ko1/nakayoshi_fork addresses.

But there are other sources of CoW invalidation that could be addressed by MRI if it had a clear notification when it needs to be done.

### Proposal

If applications had an API to notify the virtual machine that they're done loading code and are about to start processing user input,
it would give the VM a good point in time to perform optimizations on the existing code and objects.

e.g. could be something like `RubyVM.prepare`, or `RubyVM.ready`.

It's somewhat similar to [Matz's static barrier idea from RubyConf 2020](https://youtu.be/JojpqfaPhjI?t=1908), except that it wouldn't disable any feature.

### Potential optimizations

`nakayoshi_fork` already does the following:

  - Do a major GC run to get rid of as many dangling objects as possible.
  - Promote all surviving objects to the highest generation
  - Compact the heap.

But it would be much simpler to do this from inside the VM rather than do cryptic things such as `4.times { GC.start }` from the Ruby side.

It's also not good to do this on every fork, once you fork the first long lived child, you shouldn't run it again. So decorating `fork` is not a good hook point. 

Also after discussing with @jhawthorn, @tenderlovemaking and @alanwu, we believe this would open the door to several other CoW optimizations:

#### Precompute inline caches

Even though we don't have hard data to prove it, we are convinced that a big source of CoW invalidation are inline caches. Most ISeq are never invoked during initialization, so child processed are forked with mostly cold caches. As a result the first time a method is executed in the child, many memory pages holding ISeq are invalidated as caches get updated.

We think MRI could try to precompute these caches before forking children. Constant cache particularly should be resolvable statically see https://github.com/ruby/ruby/pull/6187.

Method caches are harder to resolve statically, but we can probably apply some heuristics to at least reduce the cache misses.

#### Copy on Write aware GC

We could also keep some metadata about which memory pages are shared, or even introduce a "permanent" generation. [The Instagram engineering team introduced something like that in Python](https://instagram-engineering.com/copy-on-write-friendly-python-garbage-collection-ad6ed5233ddf) ([ticket](https://bugs.python.org/issue31558), [PR](https://github.com/python/cpython/pull/3705)).

That makes the GC aware of which objects live on a shared page. With this information the GC can decide to no free dangling objects leaving on these pages, not to compact these pages, etc.

#### Scan the coderange of all strings

Strings have a lazily computed `coderange` attribute in their flags. So if a string is allocated at boot, but only used after fork, on first use its coderange will mayneed to be computed and the string mutated.

Using https://github.com/ruby/ruby/pull/6076, I noticed that 58% of the strings retained at the end of the boot sequence had an `UNKNOWN` coderange.

So eagerly scanning the coderange of all strings could also improve Copy on Write performance.

#### malloc_trim

This hook will also be a good point to release unused pages to the system with `malloc_trim`.

-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2023-04-13  7:21 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-28 13:21 [ruby-core:109081] [Ruby master Feature#18885] Long lived fork advisory API (potential Copy on Write optimizations) byroot (Jean Boussier)
2022-06-30  9:27 ` [ruby-core:109098] " byroot (Jean Boussier)
2022-07-16  3:19 ` [ruby-core:109227] " ioquatix (Samuel Williams)
2022-07-27 16:55 ` [ruby-core:109339] " byroot (Jean Boussier)
2022-07-30  2:33 ` [ruby-core:109380] " Dan0042 (Daniel DeLorme)
2022-07-30  6:19 ` [ruby-core:109381] " byroot (Jean Boussier)
2022-08-02  8:56 ` [ruby-core:109409] " mame (Yusuke Endoh)
2022-08-02  9:03 ` [ruby-core:109410] " byroot (Jean Boussier)
2022-08-03  1:32 ` [ruby-core:109417] " mame (Yusuke Endoh)
2022-08-03  6:42 ` [ruby-core:109420] " byroot (Jean Boussier)
2022-08-03  7:10 ` [ruby-core:109421] [Ruby master Feature#18885] End of boot advisory API for RubyVM byroot (Jean Boussier)
2022-08-10 18:21 ` [ruby-core:109469] " Dan0042 (Daniel DeLorme)
2022-08-10 18:24 ` [ruby-core:109470] " byroot (Jean Boussier)
2022-08-18  6:51 ` [ruby-core:109528] " matz (Yukihiro Matsumoto)
2022-08-18  6:55 ` [ruby-core:109529] " byroot (Jean Boussier)
2022-08-18  7:14 ` [ruby-core:109531] " Eregon (Benoit Daloze)
2022-08-18  7:16 ` [ruby-core:109533] " byroot (Jean Boussier)
2022-09-15 13:16 ` [ruby-core:109901] " byroot (Jean Boussier)
2022-09-22  5:52 ` [ruby-core:109989] " ioquatix (Samuel Williams)
2022-09-23 12:57 ` [ruby-core:110045] " Dan0042 (Daniel DeLorme)
2022-10-07 14:38 ` [ruby-core:110231] " matz (Yukihiro Matsumoto)
2022-10-07 15:05 ` [ruby-core:110232] " byroot (Jean Boussier)
2023-04-13  7:21 ` [ruby-core:113213] " ioquatix (Samuel Williams) via ruby-core

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).