ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
From: "Eregon (Benoit Daloze)" <noreply@ruby-lang.org>
To: ruby-core@ruby-lang.org
Subject: [ruby-core:105463] [Ruby master Feature#18228] Add a `timeout` option to `IO.copy_stream`
Date: Tue, 28 Sep 2021 10:10:00 +0000 (UTC)	[thread overview]
Message-ID: <redmine.journal-93913.20210928100959.7941@ruby-lang.org> (raw)
In-Reply-To: redmine.issue-18228.20210927121609.7941@ruby-lang.org

Issue #18228 has been updated by Eregon (Benoit Daloze).


byroot (Jean Boussier) wrote in #note-4:
> I think it can? But from my initial research I was under the impression that using ALARM to interrupt system calls was a bit frowned upon and that `poll/select` was generally regarded as preferable.

It's what Ruby already uses to interrupt blocking system calls for Ruby interrupts like signals, Thread#raise, etc (as you might already know), so I think it's fine.
Making an IO non-blocking for IO.copy_stream might have side effects, but maybe they are fine.

----------------------------------------
Feature #18228: Add a `timeout` option to `IO.copy_stream`
https://bugs.ruby-lang.org/issues/18228#change-93913

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
----------------------------------------
### Context

In many situations dealing with large files, `IO.copy_stream` when usable bring major performance gains (often twice faster at the very least). And more importantly, when the copying is deferred to the kernel, the performance is much more consistent as it is less impacted by the CPU utilization on the machine.

However, it is often unsafe to use because it doesn't have a timeout, so you can only use it if both the source and destination IOs are trusted, otherwise it is trivial for an attacker to DOS the service by reading the response very slowly.

### Some examples

- It is [used by `webrick`](https://github.com/ruby/webrick/commit/54be684da9d993ad6c237e2e9853eb98bcbaae6e).
- `Net::HTTP` uses it to send request body if they are IOs, but [it is used with a "fake IO" to allow for timeouts](https://github.com/ruby/net-http/pull/27), so `sendfile(2)` &co are never used.
- [A proof of concept of integrating in puma shows a 2x speedup](https://github.com/puma/puma/pull/2703). 
- [Various other HTTP client could use it as well](https://github.com/nahi/httpclient/pull/383).
- I used it in private projects to download and upload large archives in and out of Google Cloud Storage with great effects.

### Possible implementation

The main difficulty is that the underlying sycalls don't have a timeout either.

The main syscall used in these scenarios is `sendfile(2)`. It doesn't have a timeout parameter, however if called on file descriptors with `O_NONBLOCK` it does return early and allow for a `select/poll` loop. I did a very quick and dirty experiment with this, and it does seem to work.

The other two accelerating syscalls are [`copy_file_range(2)`](https://man7.org/linux/man-pages/man2/copy_file_range.2.html) (linux) and [`fcopyfile(2)`](https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/fcopyfile.3.html) (macOS). Neither have a timeout, and neither manpage document an `EAGAIN / EWOULDBLOCK` error. However these syscalls are limited to real file copies, generally speaking timeouts for real files are less of a critical need, so it would be possible to simply not use these syscalls if a timeout is provided.

### Interface

`copy_stream(src, dst, copy_length, src_offset, timeout)`
or `copy_stream(src, dst, copy_length, src_offset, timeout: nil)`

As for the return value in case of a timeout, it is important to convey both that a timeout happened, and the number of bytes that were copied, otherwise it makes retries impossible.

- It could simply returns the number of byte, and let the caller compare it to the expected number of bytes copied, but that wouldn't work in cases where the size of `src` isn't known.
- It could return `-1 - bytes_copied`, not particularly elegant but would work.
- It could return multiple values or some kind of result object when a timeout is provided.
- It could raise an error, with `bytes_copied` as an attribute on the error.

Or alternatively `copy_stream` would be left without a timeout, and some kind of `copy_stream2` would be introduced so that `copy_stream` return value wouldn't be made inconsistent.






-- 
https://bugs.ruby-lang.org/

  parent reply	other threads:[~2021-09-28 10:10 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-27 12:16 [ruby-core:105450] [Ruby master Feature#18228] Add a `timeout` option to `IO.copy_stream` byroot (Jean Boussier)
2021-09-27 16:57 ` [ruby-core:105451] " Eregon (Benoit Daloze)
2021-10-01  4:48   ` [ruby-core:105517] " Eric Wong
2021-09-27 20:11 ` [ruby-core:105453] " ioquatix (Samuel Williams)
2021-10-01  5:01   ` [ruby-core:105518] " Eric Wong
2021-09-27 21:21 ` [ruby-core:105454] " byroot (Jean Boussier)
2021-09-28 10:10 ` Eregon (Benoit Daloze) [this message]
2021-09-28 10:17 ` [ruby-core:105464] " byroot (Jean Boussier)
2021-09-30 10:02 ` [ruby-core:105502] " ioquatix (Samuel Williams)
2021-09-30 10:09 ` [ruby-core:105503] " byroot (Jean Boussier)
2021-09-30 20:06 ` [ruby-core:105514] " Eregon (Benoit Daloze)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.ruby-lang.org/en/community/mailing-lists/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=redmine.journal-93913.20210928100959.7941@ruby-lang.org \
    --to=ruby-core@ruby-lang.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).