From: "zverok (Victor Shepelev)" <noreply@ruby-lang.org>
To: ruby-core@ruby-lang.org
Subject: [ruby-core:105810] [Ruby master Bug#16497] StringIO#internal_encoding is broken (more severely in 2.7)
Date: Tue, 26 Oct 2021 16:31:58 +0000 (UTC) [thread overview]
Message-ID: <redmine.journal-94329.20211026163158.710@ruby-lang.org> (raw)
In-Reply-To: redmine.issue-16497.20200110111831.710@ruby-lang.org
Issue #16497 has been updated by zverok (Victor Shepelev).
@nobu @naruse @byroot Year after, this is still broken in the recent head.
```ruby
RUBY_DESCRIPTION
# => "ruby 3.1.0dev (2021-10-26T11:17:00Z master afdca0e780) [x86_64-linux]"
str = 'Україна'.encode('KOI8-U')
# => "\xF5\xCB\xD2\xC1\xA7\xCE\xC1"
io = StringIO.new(str, 'r:KOI8-U:UTF-8')
io.internal_encoding
# => nil -- expected UTF-8
io.external_encoding
# => #<Encoding:UTF-8> -- expected KOI8-U
out = io.read
# => "\xF5\xCB\xD2\xC1\xA7\xCE\xC1" -- expected 'Україна' in UTF-8, but it seems to be still KOI8-U?
out.encoding
# => #<Encoding:UTF-8> -- but it can't even report it properly
```
----------------------------------------
Bug #16497: StringIO#internal_encoding is broken (more severely in 2.7)
https://bugs.ruby-lang.org/issues/16497#change-94329
* Author: zverok (Victor Shepelev)
* Status: Assigned
* Priority: Normal
* Assignee: nobu (Nobuyoshi Nakada)
* Backport: 2.5: DONTNEED, 2.6: DONTNEED, 2.7: REQUIRED
----------------------------------------
To the best of my understanding from [Encoding](https://docs.ruby-lang.org/en/master/Encoding.html) docs, the following is true:
* external encoding (explicitly specified or taken from `Encoding.default_external`) specifies how the IO understands input and stores it internally
* internal encoding (explicitly specified or taken from `Encoding.default_internal`) specifies how the IO converts what it reads.
Demonstration with regular files:
```ruby
# prepare data
File.write('test.txt', 'Україна'.encode('KOI8-U'), encoding: 'KOI8-U') #=> 7
def test(io)
str = io.read
[io.external_encoding, io.internal_encoding, str, str.encoding]
end
# read it:
test(File.open('test.txt', 'r:KOI8-U'))
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
# We can specify internal encoding when opening the file:
test(File.open('test.txt', 'r:KOI8-U:UTF-8'))
# => [#<Encoding:KOI8-U>, #<Encoding:UTF-8>, "Україна", #<Encoding:UTF-8>]
# ...or when it is already opened
test(File.open('test.txt').tap { |f| f.set_encoding('KOI8-U', 'UTF-8') })
# => [#<Encoding:KOI8-U>, #<Encoding:UTF-8>, "Україна", #<Encoding:UTF-8>]
# ...or with Encoding.default_internal
Encoding.default_internal = 'UTF-8'
test(File.open('test.txt', 'r:KOI8-U'))
# => [#<Encoding:KOI8-U>, #<Encoding:UTF-8>, "Україна", #<Encoding:UTF-8>]
```
But with StringIO, **internal encoding can't be set** in Ruby **2.6**:
```ruby
require 'stringio'
Encoding.default_internal = nil
str = 'Україна'.encode('KOI8-U')
# Simplest form:
test(StringIO.new(str))
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
# Try to set via mode
test(StringIO.new(str, 'r:KOI8-U:UTF-8'))
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
# Try to set via set_encoding:
test(StringIO.new(str, 'r:KOI8-U:UTF-8').tap { |f| f.set_encoding('KOI8-U', 'UTF-8') })
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
# Try to set via Enoding.default_internal:
Encoding.default_internal = 'UTF-8'
test(StringIO.new(str))
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
```
So, in 2.6, any attempt to do something with StringIO's internal encoding are **just ignored**.
In **2.7**, though, matters became much worse:
```ruby
require 'stringio'
Encoding.default_internal = nil
str = 'Україна'.encode('KOI8-U')
# Behaves same as 2.6
test(StringIO.new(str))
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
# Try to set via mode: WEIRD behavior starts
test(StringIO.new(str, 'r:KOI8-U:UTF-8'))
# => [#<Encoding:UTF-8>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:UTF-8>]
# Try to set via set_encoding: still just ignored
test(StringIO.new(str, 'r:KOI8-U:UTF-8').tap { |f| f.set_encoding('KOI8-U', 'UTF-8') })
# => [#<Encoding:KOI8-U>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:KOI8-U>]
# Try to set via Enoding.default_internal: WEIRD behavior again
Encoding.default_internal = 'UTF-8'
test(StringIO.new(str))
# => [#<Encoding:UTF-8>, nil, "\xF5\xCB\xD2\xC1\xA7\xCE\xC1", #<Encoding:UTF-8>]
```
So, **2.7** not just ignores attempts to set **internal** encoding, but erroneously sets it to **external** one, so strings are not recoded, but their encoding is forced to change.
I believe it is severe bug (more severe than 2.6's "just ignoring").
[This Reddit thread](https://www.reddit.com/r/ruby/comments/emd6q4/is_this_a_stringio_bug_in_ruby_270/) shows how it breaks existing code:
* the author uses `StringIO` to work with `ASCII-8BIT` strings;
* the code is performed in Rails environment (which sets `internal_encoding` to `UTF-8` by default);
* under **2.7**, `StringIO#read` returns `ASCII-8BIT` content in Strings saying their encoding is `UTF-8`.
--
https://bugs.ruby-lang.org/
prev parent reply other threads:[~2021-10-26 16:32 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <redmine.issue-16497.20200110111831.710@ruby-lang.org>
2020-03-12 13:04 ` [ruby-core:97461] [Ruby master Bug#16497] StringIO#internal_encoding is broken (more severely in 2.7) jean.boussier
2020-03-15 13:07 ` [ruby-core:97506] " naruse
2020-03-15 17:27 ` [ruby-core:97511] " zverok.offline
2020-12-23 6:35 ` [ruby-core:101643] " zverok.offline
2021-10-26 16:31 ` zverok (Victor Shepelev) [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-list from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.ruby-lang.org/en/community/mailing-lists/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=redmine.journal-94329.20211026163158.710@ruby-lang.org \
--to=ruby-core@ruby-lang.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).