ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
* [ruby-core:88500] [Ruby trunk Bug#14997] Socket connect timeout exceeds the timeout value for
       [not found] <redmine.issue-14997.20180816085628@ruby-lang.org>
@ 2018-08-16  8:56 ` maciej
  2018-08-24 14:47 ` [ruby-core:88631] " maciej
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 4+ messages in thread
From: maciej @ 2018-08-16  8:56 UTC (permalink / raw
  To: ruby-core

Issue #14997 has been reported by maciej.mensfeld (Maciej Mensfeld).

----------------------------------------
Bug #14997: Socket connect timeout exceeds the timeout value for 
https://bugs.ruby-lang.org/issues/14997

* Author: maciej.mensfeld (Maciej Mensfeld)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: 2.5.1
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
Given a case, where a domain is being resolved to multiple IPs (4 in the following example):

```
dig debug-xyz.elb.us-east-1.amazonaws.com a

; <<>> DiG 9.10.3-P4-Ubuntu <<>> debug-xyz.elb.us-east-1.amazonaws.com a
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54375
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;debug-xyz.elb.us-east-1.amazonaws.com. IN A

;; ANSWER SECTION:
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.86.79
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.109.24
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.119.55
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.71.167

;; Query time: 4 msec
;; SERVER: 172.31.0.2#53(172.31.0.2)
;; WHEN: Tue Aug 14 13:46:18 UTC 2018
;; MSG SIZE  rcvd: 132
```

and when `connect_timeout` is set to a certain value (N), the overall timeout upon non-responsive endpoints that don't immediately throw an exception can reach `N * 4`.

This can disrupt some time-sensitive systems.

We've experienced it with the following setup:

- TCP server (event machine) behind an AWS NLB
- TCP server process goes down behind NLB but NLB is still responsive
- Socket connect_timeout is set to 100ms
- AWS NLB keeps the connection in the waiting state hoping that the service behind it will get back to normal (but it doesn't)
- Ruby timeouts after 100ms
- Ruby tries to connect to the next IP from the pool (AWS NLB again)
- Due to 4 hosts resolving, the overall timeout is 400ms.

Not sure whether this should be qualified as a bug or a feature, but I believe it should be definitely documented or there should be an option to "hard" block this limit.



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [ruby-core:88631] [Ruby trunk Bug#14997] Socket connect timeout exceeds the timeout value for
       [not found] <redmine.issue-14997.20180816085628@ruby-lang.org>
  2018-08-16  8:56 ` [ruby-core:88500] [Ruby trunk Bug#14997] Socket connect timeout exceeds the timeout value for maciej
@ 2018-08-24 14:47 ` maciej
  2019-07-02 10:58 ` [ruby-core:93472] [Ruby master " tenderlove
  2019-11-26 14:03 ` [ruby-core:95961] " shatrov
  3 siblings, 0 replies; 4+ messages in thread
From: maciej @ 2018-08-24 14:47 UTC (permalink / raw
  To: ruby-core

Issue #14997 has been updated by maciej.mensfeld (Maciej Mensfeld).


If anyone is actually willing to confirm, that it is indeed an unwanted / unexpected behavior, I offer to fix it.

It could be fixed by tracking how much of the time "pool" has been used and lowering the timeout value appropriate for the next attempts. That would guarantee, that we would never exceed the timeout.

I think this is the most elegant solution.

----------------------------------------
Bug #14997: Socket connect timeout exceeds the timeout value for 
https://bugs.ruby-lang.org/issues/14997#change-73691

* Author: maciej.mensfeld (Maciej Mensfeld)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: 2.5.1
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
Given a case, where a domain is being resolved to multiple IPs (4 in the following example):

```
dig debug-xyz.elb.us-east-1.amazonaws.com a

; <<>> DiG 9.10.3-P4-Ubuntu <<>> debug-xyz.elb.us-east-1.amazonaws.com a
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54375
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;debug-xyz.elb.us-east-1.amazonaws.com. IN A

;; ANSWER SECTION:
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.86.79
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.109.24
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.119.55
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.71.167

;; Query time: 4 msec
;; SERVER: 172.31.0.2#53(172.31.0.2)
;; WHEN: Tue Aug 14 13:46:18 UTC 2018
;; MSG SIZE  rcvd: 132
```

and when `connect_timeout` is set to a certain value (N), the overall timeout upon non-responsive endpoints that don't immediately throw an exception can reach `N * 4`.

This can disrupt some time-sensitive systems.

We've experienced it with the following setup:

- TCP server (event machine) behind an AWS NLB
- TCP server process goes down behind NLB but NLB is still responsive
- Socket connect_timeout is set to 100ms
- AWS NLB keeps the connection in the waiting state hoping that the service behind it will get back to normal (but it doesn't)
- Ruby timeouts after 100ms
- Ruby tries to connect to the next IP from the pool (AWS NLB again)
- Due to 4 hosts resolving, the overall timeout is 400ms.

Not sure whether this should be qualified as a bug or a feature, but I believe it should be definitely documented or there should be an option to "hard" block this limit.

Here's the code actually responsible for this behavior: https://github.com/ruby/ruby/blob/trunk/ext/socket/lib/socket.rb#L631-L664



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [ruby-core:93472] [Ruby master Bug#14997] Socket connect timeout exceeds the timeout value for
       [not found] <redmine.issue-14997.20180816085628@ruby-lang.org>
  2018-08-16  8:56 ` [ruby-core:88500] [Ruby trunk Bug#14997] Socket connect timeout exceeds the timeout value for maciej
  2018-08-24 14:47 ` [ruby-core:88631] " maciej
@ 2019-07-02 10:58 ` tenderlove
  2019-11-26 14:03 ` [ruby-core:95961] " shatrov
  3 siblings, 0 replies; 4+ messages in thread
From: tenderlove @ 2019-07-02 10:58 UTC (permalink / raw
  To: ruby-core

Issue #14997 has been updated by tenderlovemaking (Aaron Patterson).


This really sounds like a bug to me.  Please make a patch and I will apply it.

----------------------------------------
Bug #14997: Socket connect timeout exceeds the timeout value for 
https://bugs.ruby-lang.org/issues/14997#change-79022

* Author: maciej.mensfeld (Maciej Mensfeld)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: 2.5.1
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
Given a case, where a domain is being resolved to multiple IPs (4 in the following example):

```
dig debug-xyz.elb.us-east-1.amazonaws.com a

; <<>> DiG 9.10.3-P4-Ubuntu <<>> debug-xyz.elb.us-east-1.amazonaws.com a
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54375
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;debug-xyz.elb.us-east-1.amazonaws.com. IN A

;; ANSWER SECTION:
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.86.79
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.109.24
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.119.55
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.71.167

;; Query time: 4 msec
;; SERVER: 172.31.0.2#53(172.31.0.2)
;; WHEN: Tue Aug 14 13:46:18 UTC 2018
;; MSG SIZE  rcvd: 132
```

and when `connect_timeout` is set to a certain value (N), the overall timeout upon non-responsive endpoints that don't immediately throw an exception can reach `N * 4`.

This can disrupt some time-sensitive systems.

We've experienced it with the following setup:

- TCP server (event machine) behind an AWS NLB
- TCP server process goes down behind NLB but NLB is still responsive
- Socket connect_timeout is set to 100ms
- AWS NLB keeps the connection in the waiting state hoping that the service behind it will get back to normal (but it doesn't)
- Ruby timeouts after 100ms
- Ruby tries to connect to the next IP from the pool (AWS NLB again)
- Due to 4 hosts resolving, the overall timeout is 400ms.

Not sure whether this should be qualified as a bug or a feature, but I believe it should be definitely documented or there should be an option to "hard" block this limit.

Here's the code actually responsible for this behavior: https://github.com/ruby/ruby/blob/trunk/ext/socket/lib/socket.rb#L631-L664



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [ruby-core:95961] [Ruby master Bug#14997] Socket connect timeout exceeds the timeout value for
       [not found] <redmine.issue-14997.20180816085628@ruby-lang.org>
                   ` (2 preceding siblings ...)
  2019-07-02 10:58 ` [ruby-core:93472] [Ruby master " tenderlove
@ 2019-11-26 14:03 ` shatrov
  3 siblings, 0 replies; 4+ messages in thread
From: shatrov @ 2019-11-26 14:03 UTC (permalink / raw
  To: ruby-core

Issue #14997 has been updated by kirs (Kir Shatrov).


tenderlovemaking (Aaron Patterson) wrote:
> This really sounds like a bug to me.  Please make a patch and I will apply it.

Do you mind taking a look at https://github.com/ruby/ruby/pull/1806? Based on my testing it's solving the problem.

Together with https://bugs.ruby-lang.org/issues/15553 (already merged), many of us at Shopify would really love to see that fixed in 2.7 as it would improve resiliency and avoid Ruby processes to hang for 10s (default resolv timeout) when DNS is experiencing issues.

----------------------------------------
Bug #14997: Socket connect timeout exceeds the timeout value for 
https://bugs.ruby-lang.org/issues/14997#change-82795

* Author: maciej.mensfeld (Maciej Mensfeld)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: 2.5.1
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
Given a case, where a domain is being resolved to multiple IPs (4 in the following example):

```
dig debug-xyz.elb.us-east-1.amazonaws.com a

; <<>> DiG 9.10.3-P4-Ubuntu <<>> debug-xyz.elb.us-east-1.amazonaws.com a
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 54375
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;debug-xyz.elb.us-east-1.amazonaws.com. IN A

;; ANSWER SECTION:
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.86.79
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.109.24
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.119.55
debug-xyz.elb.us-east-1.amazonaws.com. 60 IN A 172.31.71.167

;; Query time: 4 msec
;; SERVER: 172.31.0.2#53(172.31.0.2)
;; WHEN: Tue Aug 14 13:46:18 UTC 2018
;; MSG SIZE  rcvd: 132
```

and when `connect_timeout` is set to a certain value (N), the overall timeout upon non-responsive endpoints that don't immediately throw an exception can reach `N * 4`.

This can disrupt some time-sensitive systems.

We've experienced it with the following setup:

- TCP server (event machine) behind an AWS NLB
- TCP server process goes down behind NLB but NLB is still responsive
- Socket connect_timeout is set to 100ms
- AWS NLB keeps the connection in the waiting state hoping that the service behind it will get back to normal (but it doesn't)
- Ruby timeouts after 100ms
- Ruby tries to connect to the next IP from the pool (AWS NLB again)
- Due to 4 hosts resolving, the overall timeout is 400ms.

Not sure whether this should be qualified as a bug or a feature, but I believe it should be definitely documented or there should be an option to "hard" block this limit.

Here's the code actually responsible for this behavior: https://github.com/ruby/ruby/blob/trunk/ext/socket/lib/socket.rb#L631-L664



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-11-26 14:03 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <redmine.issue-14997.20180816085628@ruby-lang.org>
2018-08-16  8:56 ` [ruby-core:88500] [Ruby trunk Bug#14997] Socket connect timeout exceeds the timeout value for maciej
2018-08-24 14:47 ` [ruby-core:88631] " maciej
2019-07-02 10:58 ` [ruby-core:93472] [Ruby master " tenderlove
2019-11-26 14:03 ` [ruby-core:95961] " shatrov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).