From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ruby-core-bounces@ruby-lang.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS4713 221.184.0.0/13
X-Spam-Status: No, score=-3.4 required=3.0 tests=AWL,BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,SPF_PASS
	shortcircuit=no autolearn=ham autolearn_force=no version=3.4.0
Received: from neon.ruby-lang.org (neon.ruby-lang.org [221.186.184.75])
	by dcvr.yhbt.net (Postfix) with ESMTP id 7DB6021841
	for <normalperson@yhbt.net>; Mon, 30 Apr 2018 10:25:40 +0000 (UTC)
Received: from neon.ruby-lang.org (localhost [IPv6:::1])
	by neon.ruby-lang.org (Postfix) with ESMTP id E35C6120905;
	Mon, 30 Apr 2018 19:25:38 +0900 (JST)
Received: from dcvr.yhbt.net (dcvr.yhbt.net [64.71.152.64])
 by neon.ruby-lang.org (Postfix) with ESMTPS id 1CFF41208C0
 for <ruby-core@ruby-lang.org>; Mon, 30 Apr 2018 19:25:29 +0900 (JST)
Received: from localhost (dcvr.yhbt.net [127.0.0.1])
 by dcvr.yhbt.net (Postfix) with ESMTP id BC55221841;
 Mon, 30 Apr 2018 10:25:26 +0000 (UTC)
Date: Mon, 30 Apr 2018 10:25:26 +0000
From: Eric Wong <normalperson@yhbt.net>
To: ruby-core@ruby-lang.org
Message-ID: <20180430102526.GA20199@dcvr>
References: <redmine.issue-13618.20170601001407@ruby-lang.org>
 <redmine.journal-71723.20180430012431.01cc5918a1bcf852@ruby-lang.org>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <redmine.journal-71723.20180430012431.01cc5918a1bcf852@ruby-lang.org>
X-ML-Name: ruby-core
X-Mail-Count: 86774
Subject: [ruby-core:86774] Re: [Ruby trunk Feature#13618] [PATCH] auto fiber
 schedule for rb_wait_for_single_fd and rb_waitpid
X-BeenThere: ruby-core@ruby-lang.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: Ruby developers <ruby-core@ruby-lang.org>
List-Id: Ruby developers <ruby-core.ruby-lang.org>
List-Unsubscribe: <https://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>, 
 <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
List-Post: <mailto:ruby-core@ruby-lang.org>
List-Help: <mailto:ruby-core-request@ruby-lang.org?subject=help>
List-Subscribe: <https://lists.ruby-lang.org/cgi-bin/mailman/listinfo/ruby-core>, 
 <mailto:ruby-core-request@ruby-lang.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ruby-core-bounces@ruby-lang.org
Sender: "ruby-core" <ruby-core-bounces@ruby-lang.org>

samuel@oriontransfer.org wrote:
> 
> > Using a background thread is your mistake.

> Don't assume I made this design. It was made by other people.
> I merely tested it because I was interested in the performance
> overhead. And yes, there is significant overhead. And let's be
> generous, people who invested their time and effort to make
> such a thing for Ruby deserve our appreciation. Knowing that
> the path they chose to explore was not good is equally
> important.

The problem I have with existing reactor patterns is threads are
an afterthought.  They should not be.

> > Multiple foreground threads safely use epoll_wait or kevent
> > on the SAME epoll or kqueue fd. It's perfectly safe to do
> > that.

> Sure, that's reasonable. If you want to share those data
> structures across threads, you can dispatch your work in
> different threads too. I liked what you did with
> https://yhbt.net/yahns/yahns.txt and it's an interesting
> design.
> 
> The biggest single benefit of this design is that blocking
> operations in an individual "task" or "worker" won't block any
> other "task" or "worker", up to the limit of the thread pool
> you allocate, at which point things WILL start causing
> blocking. So you can't avoid blocking even with this design.

Of course everything blocks at some point when things get
overloaded.  The difference is there's no head-of-line blocking
in yahns because sockets can migrate to an idle thread.

Auto-fiber can't avoid head-of-line blocking right now,
because Ruby Fiber can't migrate across threads (that's a
separate problem).

> The major downside of such a design is that workers have to
> assume they could be running on different threads, so shared
> data structure needs locking/will cause contention. In
> addition the current state of the Ruby GIL means that any such
> design will generally have poor performance.

No, you don't need locking for read/write ops if you use
EV_ONESHOT/EPOLLONESHOT.  libev and typical reactor pattern
designs are not built with one-shot in mind, so they're stuck
using Level-triggering and rely on locking.

Only FD allocation/deallocation requires locking (the kernel
needs locking, there, too).

> So, I think it's safe to say, that in an end to end test, the
> GIL is a MAJOR performance issue. Feel free to correct me if
> you think I'm wrong. I'm sure this story is more complicated
> than the above benchmarks, but I felt like it was a useful
> comparison.

GVL is a major performance issue if your bottleneck is the CPU.
It is not a major problem when my bottleneck is network I/O
or high-latency disks (I have systems with dozens or hundreds).

> Blocking operations that are causing performance issues should
> use a thread pool. For things like launching an external
> process or syscall, and waiting for it to finish, threads are
> ideal.

Launching external process and waitpid does not benefit from
native threads.

Again, native_thread_count >= disk_count is a huge thing I rely
on with Ruby for years now, so using one native thread is
totally wrong for my use case when I have dozens/hundreds of
slow disks.

> There is some elegance in the design you propose. Your
> proposal requires some kind of "Task" or "Worker" which is a
> fiber which will yield when IO would block, and resume when IO
> is ready. Based on what you've said, do you mind explaining
> whether the "Task" or "Worker" is resumed on the same thread
> or a different one? Do you maintain a thread pool?

The use of threads or thread pool remains up to the Ruby user.
There's no extra fibers or native threads created behind users'
back; that's a waste of memory.  It uses "idle time" of any
available threads (including main thread) to do scheduling work.

(Current Ruby has provisions for an internal thread cache for
 Thread.new, but it's orthogonal to this issue and has been around
 for a decade in a buggy, never-enabled state).

> If it's always resumed on the same thread, how do you manage
> that? e.g. perhaps you can show me how the following would
> work:

Every thread has a FIFO run-queue (th->afrunq or th->runq
depending on which version you look at)....

> If you following this model, the thread must be calling into
> `epoll` or `kqueue` in order to resume work. But based on what
> you've said, if you have several of the above threads running,
> and the thread itself is invoking `epoll_wait`, then it
> receives events for a different thread, how does that work? Do
> you send the events to the different thread? If you do that,
> what is the overhead? If you don't do that, do you move
> workers between threads?

When a thread receives work for a fiber for a different thread,
it inserts into the runqueue of the other thread.

Right now it's ccan/list for branchless insert/delete (relies on GVL)

If/when we get rid of GVL, we will likely use wfcqueue for
wait-free insert and mass dequeue.  Wait-free is better than
lock-free, even, but there'd still be memory barriers, of
course.

Again, we can't move fibers across threads in Ruby atm.

One-shot notifications ensures we don't get unintended events.

> Then, why not consider the similar model to async which uses
> per-thread reactors. The workers do not move around threads,
> and the reactor does not need to send events to other threads.

I know all that sounds like an unnecessary serialization and
overhead, but the same stuff is being serialized in the kernel
and hardware, even.

For (typical) servers with a single active NIC, interrupts tend
to be handled by a single CPU and inserting into epoll readylist
has the same serialization overhead.  So partitioning across
multiple epoll/kqueue descriptions inside the kernel is a
waste of time unless you're getting enough traffic to max out
a CPU with interrupt handling.

There's nothing about the design which prevents the use of
parallel schedulers (they are not "reactors" to me).

So if I was getting enough network traffic to saturate multiple
NICs and peg a CPU from network traffic alone, yes, as a last
resort I'd have extra epoll/kqueue-based schedulers inside a
process.

That's a last resort.  I know we can eek more performance out of
the epoll readylist inside the Linux kernel, first.  But that's
not even worth the effort atm.

Until then, I''d rather save unswappable kernel memory and FDs
with a single epoll/kqueue per-process.