[ruby-core:90931] [Ruby trunk Bug#15517] Net::HTTP not recognizing valid UTF-8

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

* [ruby-core:90931] [Ruby trunk Bug#15517] Net::HTTP not recognizing valid UTF-8
       [not found] <redmine.issue-15517.20190108161351@ruby-lang.org>
@ 2019-01-08 16:13 ` cohencarlisle+bugs.ruby-lang
  2019-01-19  3:30 ` [ruby-core:91167] " cohencarlisle+bugs.ruby-lang
  2019-01-20  1:15 ` [ruby-core:91184] " duerst
  2 siblings, 0 replies; 3+ messages in thread
From: cohencarlisle+bugs.ruby-lang @ 2019-01-08 16:13 UTC (permalink / raw)
  To: ruby-core

Issue #15517 has been reported by cohen (Cohen Carlisle).

----------------------------------------
Bug #15517: Net::HTTP not recognizing valid UTF-8
https://bugs.ruby-lang.org/issues/15517

* Author: cohen (Cohen Carlisle)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: 2.6.0
* Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN
----------------------------------------
I created a case at https://github.com/Cohen-Carlisle/utf8app that shows Net::HTTP labeling a response body as ASCII-8BIT encoded because it contains a non-ascii character (specifically, the double prime symbol: ″), but recognizing ascii-only strings as UTF-8 encoded. The example is live on heroku but because it's a free dyno, it will go to sleep and take a while to start up the first time it is hit after a while.

As explained there, I would expect response body strings with the double prime symbol to still have an encoding of UTF-8 since they are valid UTF-8.

The README from the repo (which shows the behavior) is reproduced below:

The purpose of this app is to demonstrate unexpected behavior in Ruby's net/http library. Valid UTF-8 response bodies are encoded as ASCII-8BIT, which apparently means Ruby is treating them as pure binary data, even when Content-Type headers label the body as UTF-8.

In the example below, I would expect the response body to have UTF-8 encoding. Especially because when I copy and paste the body into a new string literal in my console, that string is UTF-8 encoded.

~~~
require 'net/http'
uri = URI('https://utf8app.herokuapp.com')
uri.path = '/utf8/example'
res = Net::HTTP.get_response(uri)
res['Content-Type']
# => "text/plain; charset=utf-8"
puts res.body
# The symbol for the inch unit of measurement is ″.
res.body.encoding
# => #<Encoding:ASCII-8BIT>
res.body.ascii_only?
# => false
'The symbol for the inch unit of measurement is ″.'.encoding
# => #<Encoding:UTF-8>
~~~

We can demonstrate that the encoding issue is due to the non-ascii inches symbol by replacing it with a double quote instead.

~~~
uri.path = '/ascii/example'
res = Net::HTTP.get_response(uri)
res['Content-Type']
# => "text/plain; charset=utf-8"
puts res.body
# The symbol for the inch unit of measurement is ".
res.body.encoding
# => #<Encoding:UTF-8>
res.body.ascii_only?
# => true
~~~

Finally, as an extra WTF, JSON.parse recognizes the non-ascii characters as valid UTF-8 in a JSON example.

~~~
require 'json'
uri.path = '/utf8/example_json'
res = Net::HTTP.get_response(uri)
res['Content-Type']
# => "application/json; charset=utf-8"
puts res.body
# {"feet":"′","inches":"″"}
res.body.encoding
# => #<Encoding:ASCII-8BIT>
json = JSON.parse(res.body)
# => {"feet"=>"′", "inches"=>"″"}
json.values.map { |v| [v.encoding.to_s, v] }
# => [["UTF-8", "′"], ["UTF-8", "″"]]
~~~

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [ruby-core:91167] [Ruby trunk Bug#15517] Net::HTTP not recognizing valid UTF-8
       [not found] <redmine.issue-15517.20190108161351@ruby-lang.org>
  2019-01-08 16:13 ` [ruby-core:90931] [Ruby trunk Bug#15517] Net::HTTP not recognizing valid UTF-8 cohencarlisle+bugs.ruby-lang
@ 2019-01-19  3:30 ` cohencarlisle+bugs.ruby-lang
  2019-01-20  1:15 ` [ruby-core:91184] " duerst
  2 siblings, 0 replies; 3+ messages in thread
From: cohencarlisle+bugs.ruby-lang @ 2019-01-19  3:30 UTC (permalink / raw)
  To: ruby-core

Issue #15517 has been updated by cohen (Cohen Carlisle).

I'm not sure I think this is exactly the same as https://bugs.ruby-lang.org/issues/2567, as that one has focused on using the HTTP headers to guess the content type. Here I'm pointing out that ASCII-only strings are recognized as UTF8, but valid, multi-byte UTF8 strings are not recognized as UTF8 encoded. I suppose the trouble is that checking if the string is a valid UTF8 encoded string is not trivial, but other core/stdlib functions, like File.read seem to perform this.

----------------------------------------
Bug #15517: Net::HTTP not recognizing valid UTF-8
https://bugs.ruby-lang.org/issues/15517#change-76395

* Author: cohen (Cohen Carlisle)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: 2.6.0
* Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN
----------------------------------------
I created a case at https://github.com/Cohen-Carlisle/utf8app that shows Net::HTTP labeling a response body as ASCII-8BIT encoded because it contains a non-ascii character (specifically, the double prime symbol: ″), but recognizing ascii-only strings as UTF-8 encoded. The example is live on heroku but because it's a free dyno, it will go to sleep and take a while to start up the first time it is hit after a while.

As explained there, I would expect response body strings with the double prime symbol to still have an encoding of UTF-8 since they are valid UTF-8.

The README from the repo (which shows the behavior) is reproduced below:

The purpose of this app is to demonstrate unexpected behavior in Ruby's net/http library. Valid UTF-8 response bodies are encoded as ASCII-8BIT, which apparently means Ruby is treating them as pure binary data, even when Content-Type headers label the body as UTF-8.

In the example below, I would expect the response body to have UTF-8 encoding. Especially because when I copy and paste the body into a new string literal in my console, that string is UTF-8 encoded.

~~~
require 'net/http'
uri = URI('https://utf8app.herokuapp.com')
uri.path = '/utf8/example'
res = Net::HTTP.get_response(uri)
res['Content-Type']
# => "text/plain; charset=utf-8"
puts res.body
# The symbol for the inch unit of measurement is ″.
res.body.encoding
# => #<Encoding:ASCII-8BIT>
res.body.ascii_only?
# => false
'The symbol for the inch unit of measurement is ″.'.encoding
# => #<Encoding:UTF-8>
~~~

We can demonstrate that the encoding issue is due to the non-ascii inches symbol by replacing it with a double quote instead.

~~~
uri.path = '/ascii/example'
res = Net::HTTP.get_response(uri)
res['Content-Type']
# => "text/plain; charset=utf-8"
puts res.body
# The symbol for the inch unit of measurement is ".
res.body.encoding
# => #<Encoding:UTF-8>
res.body.ascii_only?
# => true
~~~

Finally, as an extra WTF, JSON.parse recognizes the non-ascii characters as valid UTF-8 in a JSON example.

~~~
require 'json'
uri.path = '/utf8/example_json'
res = Net::HTTP.get_response(uri)
res['Content-Type']
# => "application/json; charset=utf-8"
puts res.body
# {"feet":"′","inches":"″"}
res.body.encoding
# => #<Encoding:ASCII-8BIT>
json = JSON.parse(res.body)
# => {"feet"=>"′", "inches"=>"″"}
json.values.map { |v| [v.encoding.to_s, v] }
# => [["UTF-8", "′"], ["UTF-8", "″"]]
~~~

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [ruby-core:91184] [Ruby trunk Bug#15517] Net::HTTP not recognizing valid UTF-8
       [not found] <redmine.issue-15517.20190108161351@ruby-lang.org>
  2019-01-08 16:13 ` [ruby-core:90931] [Ruby trunk Bug#15517] Net::HTTP not recognizing valid UTF-8 cohencarlisle+bugs.ruby-lang
  2019-01-19  3:30 ` [ruby-core:91167] " cohencarlisle+bugs.ruby-lang
@ 2019-01-20  1:15 ` duerst
  2 siblings, 0 replies; 3+ messages in thread
From: duerst @ 2019-01-20  1:15 UTC (permalink / raw)
  To: ruby-core

Issue #15517 has been updated by duerst (Martin Dürst).

I think this issue is not a duplicate of issue #2567, but it is clearly related.

Checking whether the string is valid UTF-8 is rather easy;
```
string.force_encoding('UTF-8').valid_encoding?
```
should do.

There are some security issues with browsers accepting certain mime types in certain encodings, and servers might mislabel stuff, but for `text/plain; charset=utf-8`, the text/plain part isn't a security issue, and text/plain doesn't have any way of indicating the encoding inside the document, so there should be no problems declaring the encoding of the resulting string as UTF-8.

----------------------------------------
Bug #15517: Net::HTTP not recognizing valid UTF-8
https://bugs.ruby-lang.org/issues/15517#change-76415

* Author: cohen (Cohen Carlisle)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: 2.6.0
* Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN
----------------------------------------
I created a case at https://github.com/Cohen-Carlisle/utf8app that shows Net::HTTP labeling a response body as ASCII-8BIT encoded because it contains a non-ascii character (specifically, the double prime symbol: ″), but recognizing ascii-only strings as UTF-8 encoded. The example is live on heroku but because it's a free dyno, it will go to sleep and take a while to start up the first time it is hit after a while.

As explained there, I would expect response body strings with the double prime symbol to still have an encoding of UTF-8 since they are valid UTF-8.

The README from the repo (which shows the behavior) is reproduced below:

The purpose of this app is to demonstrate unexpected behavior in Ruby's net/http library. Valid UTF-8 response bodies are encoded as ASCII-8BIT, which apparently means Ruby is treating them as pure binary data, even when Content-Type headers label the body as UTF-8.

In the example below, I would expect the response body to have UTF-8 encoding. Especially because when I copy and paste the body into a new string literal in my console, that string is UTF-8 encoded.

~~~
require 'net/http'
uri = URI('https://utf8app.herokuapp.com')
uri.path = '/utf8/example'
res = Net::HTTP.get_response(uri)
res['Content-Type']
# => "text/plain; charset=utf-8"
puts res.body
# The symbol for the inch unit of measurement is ″.
res.body.encoding
# => #<Encoding:ASCII-8BIT>
res.body.ascii_only?
# => false
'The symbol for the inch unit of measurement is ″.'.encoding
# => #<Encoding:UTF-8>
~~~

We can demonstrate that the encoding issue is due to the non-ascii inches symbol by replacing it with a double quote instead.

~~~
uri.path = '/ascii/example'
res = Net::HTTP.get_response(uri)
res['Content-Type']
# => "text/plain; charset=utf-8"
puts res.body
# The symbol for the inch unit of measurement is ".
res.body.encoding
# => #<Encoding:UTF-8>
res.body.ascii_only?
# => true
~~~

Finally, as an extra WTF, JSON.parse recognizes the non-ascii characters as valid UTF-8 in a JSON example.

~~~
require 'json'
uri.path = '/utf8/example_json'
res = Net::HTTP.get_response(uri)
res['Content-Type']
# => "application/json; charset=utf-8"
puts res.body
# {"feet":"′","inches":"″"}
res.body.encoding
# => #<Encoding:ASCII-8BIT>
json = JSON.parse(res.body)
# => {"feet"=>"′", "inches"=>"″"}
json.values.map { |v| [v.encoding.to_s, v] }
# => [["UTF-8", "′"], ["UTF-8", "″"]]
~~~

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-01-20  1:15 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <redmine.issue-15517.20190108161351@ruby-lang.org>
2019-01-08 16:13 ` [ruby-core:90931] [Ruby trunk Bug#15517] Net::HTTP not recognizing valid UTF-8 cohencarlisle+bugs.ruby-lang
2019-01-19  3:30 ` [ruby-core:91167] " cohencarlisle+bugs.ruby-lang
2019-01-20  1:15 ` [ruby-core:91184] " duerst

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).