[ruby-core:105785] [Ruby master Feature#17795] Around `Process.fork` callbacks API

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

From: "ioquatix (Samuel Williams)" <noreply@ruby-lang.org>
To: ruby-core@ruby-lang.org
Subject: [ruby-core:105785] [Ruby master Feature#17795] Around `Process.fork` callbacks API
Date: Mon, 25 Oct 2021 10:03:41 +0000 (UTC)	[thread overview]
Message-ID: <redmine.journal-94301.20211025100341.7941@ruby-lang.org> (raw)
In-Reply-To: redmine.issue-17795.20210412080908.7941@ruby-lang.org

Issue #17795 has been updated by ioquatix (Samuel Williams).

I agree with Jeremy, but I'll go a step further. If we want Ruby to be safe, the only sane way to handle fork (in general) is to close all resources in the child process (as already happens with Thread instances, and similarly with close on exec style behaviour). Any other model requires cooperation between the parent and child process, as Jeremy outlines and simply has to involve a shared understanding of what happens before and after fork in a highly application specific way.

I feel like adding an `at_fork` or `after_fork` hook should be sufficient for 99% of use cases, however we should allow users to bypass it, as in:

```
after_fork do
  connection_pool.close_on_fork
end

connection = connection_pool.acquire
connection.close_on_fork = false

fork do
  connection.query(...) # connection is still valid
end
```

However this kind of usage really seems like the edge of the edge cases and it begs the question, do we want to encourage libraries and applications to do this?

The more valid example was found by digging into the Datadog profiler:

```
after_fork do
  Datadog.profiler.start if Datadog.profiler
end
```

This to me seems like a totally legitimate use case but I don't think this needs `_fork` and I think it makes more sense to be consistent with `at_exit`, e.g. `at_fork` could be a better name, only executed in the child process. Wrapping `Process#_fork` seems very specific to how the Process module is implemented and I don't think we should be exposing that or expecting users to `prepend` to standard library modules to get functionality. It sets a bad precedent IMHO.

----------------------------------------
Feature #17795: Around `Process.fork` callbacks API
https://bugs.ruby-lang.org/issues/17795#change-94301

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
Replaces: https://bugs.ruby-lang.org/issues/5446

### Context

Ruby code in production is very often running in a forking setup (puma, unicorn, etc), and it is common some types of libraries to need to know when the Ruby process was forked. For instance:

  - Most database clients, ORMs or other libraries keeping a connection pool might need to close connections before the fork happens.
  - Libraries relying on some kind of dispatcher thread might need to restart the thread in the forked children, and clear any internal buffer (e.g. statsd clients, newrelic_rpm).

**This need is only for forking the whole ruby process, extensions doing a `fork(2) + exec(2)` combo etc are not a concern, this aim at only catching `kernel.fork`, `Process.fork` and maybe `Process.daemon`.**.
The use case is for forks that end up executing Ruby code.

### Current solutions

Right now this use case is handled in several ways.

#### Rely on the integrating code to call a `before_fork` or `after_fork` callback.

Some libraries simply rely on documentation and require the user to use the hooks provided by their forking server.

Examples:

  - Sequel: http://sequel.jeremyevans.net/rdoc/files/doc/fork_safety_rdoc.html
  - Rails's Active Record: https://devcenter.heroku.com/articles/concurrency-and-database-connections#multi-process-servers
  - ScoutAPM (it tries to detect popular forking setup and register itself): https://github.com/scoutapp/scout_apm_ruby/blob/fa83793b9e8d2f9a32c920f59b57d7f198f466b8/lib/scout_apm/environment.rb#L142-L146
  - NewRelic RPM (similarly tries to register to popular forking setups): https://www.rubydoc.info/github/newrelic/rpm/NewRelic%2FAgent:after_fork

#### Continuously check `Process.pid`

Some libraries chose to instead keep the process PID in a variable, and to regularly compare it to `Process.pid` to detect forked children.
Unfortunately `Process.pid` is relatively slow on Linux, and these checks tend to be in tight loops, so it's not uncommon when using these libraries
to spend `1` or `2%` of runtime in `Process.pid`.

Examples:

  - Rails's Active Record used to check `Process.pid` https://github.com/Shopify/rails/blob/411ccbdab2608c62aabdb320d52cb02d446bb39c/activerecord/lib/active_record/connection_adapters/abstract/connection_pool.rb#L946, it still does but a bit less: https://github.com/rails/rails/pull/41850
  - the `Typhoeus` HTTP client: https://github.com/typhoeus/typhoeus/blob/a345545e5e4ac0522b883fe0cf19e5e2e807b4b0/lib/typhoeus/pool.rb#L34-L42
  - Redis client: https://github.com/redis/redis-rb/blob/6542934f01b9c390ee450bd372209a04bc3a239b/lib/redis/client.rb#L384
  - Some older versions of NewRelic RPM: https://github.com/opendna/scoreranking-api/blob/8fba96d23b4d3e6b64f625079c184f3a292bbc12/vendor/gems/ruby/1.9.1/gems/newrelic_rpm-3.7.3.204/lib/new_relic/agent/harvester.rb#L39-L41

#### Continuously check `Thread#alive?`

Similar to checking `Process.pid`, but for the background thread use case. `Thread#alive?` is regularly checked, and if the thread is dead, it is assumed that the process was forked.
It's much less costly than a `Process.pid`, but also a bit less reliable as the thread could have died for other reasons. It also delays re-creating the thread to the next check rather than immediately upon forking.

Examples:

  - `statsd-instrument`: https://github.com/Shopify/statsd-instrument/blob/0445cca46e29aa48e9f1efec7c72352aff7ec931/lib/statsd/instrument/batched_udp_sink.rb#L63

#### Decorate `Kernel.fork` and `Process.fork`

Another solution is to prepend a module in `Process` and `Kernel`, to decorate the fork method and implement your own callback. It works well, but is made difficult by `Kernel.fork`.

Examples:

  - Active Support: https://github.com/rails/rails/blob/9aed3dcdfea6b64c18035f8e2622c474ba499846/activesupport/lib/active_support/fork_tracker.rb
  - `dd-trace-rb`: https://github.com/DataDog/dd-trace-rb/blob/793946146b4709289cfd459f3b68e8227a9f5fa7/lib/ddtrace/profiling/ext/forking.rb
  - To some extent, `nakayoshi_fork` decorates the `fork` method: https://github.com/ko1/nakayoshi_fork/blob/19ef5efc51e0ae51d7f5f37a0b785309bf16e97f/lib/nakayoshi_fork.rb

### Proposals

I see two possible features to improve this situation:

####  Fork callbacks

One solution would be for Ruby to expose a callback API for these two events, similar to `Kernel.at_exit`.

Most implementations of this functionnality in other languages ([C's `pthread_atfork`](https://man7.org/linux/man-pages/man3/pthread_atfork.3.html), [Python's `os.register_at_fork`](https://docs.python.org/3/library/os.html#os.register_at_fork)) expose 3 callbacks:

  - `prepare` or `before` executed in the parent process before the `fork(2)`
  - `parent` or `after_in_parent` executed in the parent process after the `fork(2)`
  - `child` or `after_in_child` executed in the child process after the `fork(2)`

A direct translation of such API in Ruby could look like `Process.at_fork(prepare: Proc, parent: Proc, child: Proc)` if inspired by `pthread_atfork`.

Or alternatively each callback could be exposed idependently: `Process.before_fork {}`, `Process.after_fork_parent {}`, `Process.after_fork_child {}`.

Also note that similar APIs don't expose any way to unregister callbacks, and expect users to use weak references or to not hold onto objects that should be garbage collected.

Pseudo code:

```ruby
module Process
  @prepare = []
  @parent = []
  @child = []

  def self.at_fork(prepare: nil, parent: nil, child: nil)
    @prepare.unshift(prepare) if prepare # prepare callbacks are executed in reverse registration order
    @parent << parent if parent
    @child << child if child
  end

  def self.fork
    @prepare.each(&:call)
    if pid = Primitive.fork
      @parent.each(&:call) # We could consider passing the pid here.
    else
      @child.each(&:call)
    end
  end
end

```

#### Make `Kernel.fork` a delegator

A simpler change would be to just make `Kernel.fork` a delegator to `Process.fork`. This would make it much easier to prepend a module on `Process` for each library to implement its own callback.

Proposed patch: https://github.com/ruby/ruby/pull/4361

-- 
https://bugs.ruby-lang.org/

     prev parent reply	other threads:[~2021-10-25 10:03 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-12  8:09 [ruby-core:103400] [Ruby master Feature#17795] `before_fork` and `after_fork` callback API jean.boussier
2021-04-12 13:24 ` [ruby-core:103402] " daniel
2021-04-12 14:12 ` [ruby-core:103403] " jean.boussier
2021-04-12 14:59 ` [ruby-core:103404] " daniel
2021-04-13  1:32 ` [ruby-core:103410] " muraken
2021-04-13  2:32 ` [ruby-core:103411] " merch-redmine
2021-04-13 10:52 ` [ruby-core:103421] " jean.boussier
2021-04-13 11:16 ` [ruby-core:103423] " eregontp
2021-04-13 14:36 ` [ruby-core:103427] " mame
2021-04-13 14:46 ` [ruby-core:103428] " daniel
2021-04-13 15:17 ` [ruby-core:103431] " jean.boussier
2021-04-13 16:32 ` [ruby-core:103432] " mame
2021-04-13 22:22 ` [ruby-core:103443] " jean.boussier
2021-04-14  6:07 ` [ruby-core:103447] " ko1
2021-04-14  7:34 ` [ruby-core:103448] [Ruby master Feature#17795] Around `Process.fork` callbacks API jean.boussier
2021-04-16  8:03 ` [ruby-core:103476] " akr
2021-04-16  8:07 ` [ruby-core:103478] " jean.boussier
2021-04-16 13:28 ` [ruby-core:103482] " ivo.anjo
2021-04-17  7:56 ` [ruby-core:103495] " mame
2021-04-17 11:02 ` [ruby-core:103499] " jean.boussier
2021-04-17 17:43 ` [ruby-core:103501] " mame
2021-04-19  9:55 ` [ruby-core:103512] " jean.boussier
2021-04-21  7:30 ` [ruby-core:103537] " sam.saffron
2021-04-22 13:21 ` [ruby-core:103550] " jean.boussier
2021-04-22 14:34 ` [ruby-core:103552] " daniel
2021-04-22 15:23 ` [ruby-core:103553] " daniel
2021-04-22 18:58 ` [ruby-core:103558] " jean.boussier
2021-04-23 16:26 ` [ruby-core:103575] " eregontp
2021-05-03 11:34 ` [ruby-core:103697] " eregontp
2021-05-21 17:45 ` [ruby-core:103952] " mame
2021-05-21 21:23 ` [ruby-core:103959] " merch-redmine
2021-06-02  7:46 ` [ruby-core:104135] " ivo.anjo
2021-06-02  8:40 ` [ruby-core:104136] " jean.boussier
2021-07-15  7:18 ` [ruby-core:104614] " matz
2021-07-15  7:29 ` [ruby-core:104615] " mame
2021-07-15 20:15 ` [ruby-core:104624] " jean.boussier
2021-07-15 20:40 ` [ruby-core:104625] " jean.boussier
2021-07-16  6:07 ` [ruby-core:104628] " mame
2021-10-17 19:39 ` [ruby-core:105657] " Dan0042 (Daniel DeLorme)
2021-10-25  2:36 ` [ruby-core:105768] " matz (Yukihiro Matsumoto)
2021-10-25  4:18 ` [ruby-core:105771] " mame (Yusuke Endoh)
2021-10-25 10:03 ` ioquatix (Samuel Williams) [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.ruby-lang.org/en/community/mailing-lists/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=redmine.journal-94301.20211025100341.7941@ruby-lang.org \
    --to=ruby-core@ruby-lang.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).