[ruby-core:91662] [Ruby trunk Bug#15635] Inconsistent handling of dummy encodings and code range

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

* [ruby-core:91662] [Ruby trunk Bug#15635] Inconsistent handling of dummy encodings and code range
       [not found] <redmine.issue-15635.20190304162633@ruby-lang.org>
@ 2019-03-04 16:26 ` ruby
  2019-03-04 16:28 ` [ruby-core:91663] " ruby
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 4+ messages in thread
From: ruby @ 2019-03-04 16:26 UTC (permalink / raw)
  To: ruby-core

Issue #15635 has been reported by nirvdrum (Kevin Menard).

----------------------------------------
Bug #15635: Inconsistent handling of dummy encodings and code range
https://bugs.ruby-lang.org/issues/15635

* Author: nirvdrum (Kevin Menard)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: ruby 2.6.1p33 (2019-01-30 revision 66950) [x86_64-linux]
* Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN
----------------------------------------
It's hard to write code that works properly with dummy encodings, so they should really be avoided altogether. However, I've come across a code path that I think yields inconsistent results when it comes to dummy encodings with a minimum character length > 1 (i.e., "UTF-16" and "UTF-32"). To illustrate the issue, run the following program:

``` ruby
s = "abc"
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts

s.encode!("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts

s.force_encoding("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
```

The output on Ruby 2.6.1p33 for me is:

```
UTF-8
Dummy: false
Valid: true
[97, 98, 99]

UTF-32
Dummy: true
Valid: true
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]

UTF-32
Dummy: true
Valid: false
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]
```

Basically, we start with a UTF-8 string and convert it to UTF-32. Without an explicit indication of endianness, the encoding is considered dummy, but internally big endian is used (i.e., UTF-32BE). The new byte pattern for the successfully encoded string is shown. After calling `force_encoding` on the string with the same encoding, suddenly the string is no longer considered valid. I think many people would expect `force_encoding` using the string's current encoding to be a no-op. Even setting that aside, we can see the byte sequence and the encoding for the string didn't change, but its validity has. I believe this is wrong.

The problematic lines are https://github.com/ruby/ruby/blob/e6d1c72bec5c6544e9ae82501a6cdd7460222096/string.c#L660-L662 from `rb_enc_str_coderange`:

``` c
if (rb_enc_mbminlen(enc) > 1 && rb_enc_dummy_p(enc)) {
	    cr = ENC_CODERANGE_BROKEN;
}
```

This unconditionally sets the code range to `CR_BROKEN`, but only for dummy encodings with a minimum length > 1. I think this is designed to specifically target UTF-16 and UTF-32, while leaving UTF-7 alone.

It may be the case that the correct behavior here is to always mark the string as invalid. After all, the dummy encodings could change endianness based on platform (although, I don't think they ever do). If that's the case, then the issue is with the code path from `String#encode`.

The inconsistency presents both correctness and performance issues. The former is because end user code may use a method like `String#valid_encoding?` for branching decisions. The latter because the runtime generally takes a slower path for operations on `CR_BROKEN` strings.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [ruby-core:91663] [Ruby trunk Bug#15635] Inconsistent handling of dummy encodings and code range
       [not found] <redmine.issue-15635.20190304162633@ruby-lang.org>
  2019-03-04 16:26 ` [ruby-core:91662] [Ruby trunk Bug#15635] Inconsistent handling of dummy encodings and code range ruby
@ 2019-03-04 16:28 ` ruby
  2019-03-06  8:53 ` [ruby-core:91697] " naruse
  2019-03-07 15:31 ` [ruby-core:91710] " nagachika00
  3 siblings, 0 replies; 4+ messages in thread
From: ruby @ 2019-03-04 16:28 UTC (permalink / raw)
  To: ruby-core

Issue #15635 has been updated by nirvdrum (Kevin Menard).

I also tested some older Ruby releases. The issue is also present in `ruby 2.4.4p296 (2018-03-28 revision 63013) [x86_64-linux]` and `ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]`.

----------------------------------------
Bug #15635: Inconsistent handling of dummy encodings and code range
https://bugs.ruby-lang.org/issues/15635#change-76924

* Author: nirvdrum (Kevin Menard)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: ruby 2.6.1p33 (2019-01-30 revision 66950) [x86_64-linux]
* Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN
----------------------------------------
It's hard to write code that works properly with dummy encodings, so they should really be avoided altogether. However, I've come across a code path that I think yields inconsistent results when it comes to dummy encodings with a minimum character length > 1 (i.e., "UTF-16" and "UTF-32"). To illustrate the issue, run the following program:

``` ruby
s = "abc"
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts

s.encode!("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts

s.force_encoding("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
```

The output on Ruby 2.6.1p33 for me is:

```
UTF-8
Dummy: false
Valid: true
[97, 98, 99]

UTF-32
Dummy: true
Valid: true
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]

UTF-32
Dummy: true
Valid: false
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]
```

Basically, we start with a UTF-8 string and convert it to UTF-32. Without an explicit indication of endianness, the encoding is considered dummy, but internally big endian is used (i.e., UTF-32BE). The new byte pattern for the successfully encoded string is shown. After calling `force_encoding` on the string with the same encoding, suddenly the string is no longer considered valid. I think many people would expect `force_encoding` using the string's current encoding to be a no-op. Even setting that aside, we can see the byte sequence and the encoding for the string didn't change, but its validity has. I believe this is wrong.

The problematic lines are https://github.com/ruby/ruby/blob/e6d1c72bec5c6544e9ae82501a6cdd7460222096/string.c#L660-L662 from `rb_enc_str_coderange`:

``` c
if (rb_enc_mbminlen(enc) > 1 && rb_enc_dummy_p(enc)) {
	    cr = ENC_CODERANGE_BROKEN;
}
```

This unconditionally sets the code range to `CR_BROKEN`, but only for dummy encodings with a minimum length > 1. I think this is designed to specifically target UTF-16 and UTF-32, while leaving UTF-7 alone.

It may be the case that the correct behavior here is to always mark the string as invalid. After all, the dummy encodings could change endianness based on platform (although, I don't think they ever do). If that's the case, then the issue is with the code path from `String#encode`.

The inconsistency presents both correctness and performance issues. The former is because end user code may use a method like `String#valid_encoding?` for branching decisions. The latter because the runtime generally takes a slower path for operations on `CR_BROKEN` strings.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [ruby-core:91697] [Ruby trunk Bug#15635] Inconsistent handling of dummy encodings and code range
       [not found] <redmine.issue-15635.20190304162633@ruby-lang.org>
  2019-03-04 16:26 ` [ruby-core:91662] [Ruby trunk Bug#15635] Inconsistent handling of dummy encodings and code range ruby
  2019-03-04 16:28 ` [ruby-core:91663] " ruby
@ 2019-03-06  8:53 ` naruse
  2019-03-07 15:31 ` [ruby-core:91710] " nagachika00
  3 siblings, 0 replies; 4+ messages in thread
From: naruse @ 2019-03-06  8:53 UTC (permalink / raw)
  To: ruby-core

Issue #15635 has been updated by naruse (Yui NARUSE).

Backport changed from 2.4: REQUIRED, 2.5: REQUIRED, 2.6: REQUIRED to 2.4: REQUIRED, 2.5: REQUIRED, 2.6: DONE

ruby_2_6 r67181 merged revision(s) 67167.

----------------------------------------
Bug #15635: Inconsistent handling of dummy encodings and code range
https://bugs.ruby-lang.org/issues/15635#change-76962

* Author: nirvdrum (Kevin Menard)
* Status: Closed
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: ruby 2.6.1p33 (2019-01-30 revision 66950) [x86_64-linux]
* Backport: 2.4: REQUIRED, 2.5: REQUIRED, 2.6: DONE
----------------------------------------
It's hard to write code that works properly with dummy encodings, so they should really be avoided altogether. However, I've come across a code path that I think yields inconsistent results when it comes to dummy encodings with a minimum character length > 1 (i.e., "UTF-16" and "UTF-32"). To illustrate the issue, run the following program:

``` ruby
s = "abc"
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts

s.encode!("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts

s.force_encoding("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
```

The output on Ruby 2.6.1p33 for me is:

```
UTF-8
Dummy: false
Valid: true
[97, 98, 99]

UTF-32
Dummy: true
Valid: true
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]

UTF-32
Dummy: true
Valid: false
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]
```

Basically, we start with a UTF-8 string and convert it to UTF-32. Without an explicit indication of endianness, the encoding is considered dummy, but internally big endian is used (i.e., UTF-32BE). The new byte pattern for the successfully encoded string is shown. After calling `force_encoding` on the string with the same encoding, suddenly the string is no longer considered valid. I think many people would expect `force_encoding` using the string's current encoding to be a no-op. Even setting that aside, we can see the byte sequence and the encoding for the string didn't change, but its validity has. I believe this is wrong.

The problematic lines are https://github.com/ruby/ruby/blob/e6d1c72bec5c6544e9ae82501a6cdd7460222096/string.c#L660-L662 from `rb_enc_str_coderange`:

``` c
if (rb_enc_mbminlen(enc) > 1 && rb_enc_dummy_p(enc)) {
	    cr = ENC_CODERANGE_BROKEN;
}
```

This unconditionally sets the code range to `CR_BROKEN`, but only for dummy encodings with a minimum length > 1. I think this is designed to specifically target UTF-16 and UTF-32, while leaving UTF-7 alone.

It may be the case that the correct behavior here is to always mark the string as invalid. After all, the dummy encodings could change endianness based on platform (although, I don't think they ever do). If that's the case, then the issue is with the code path from `String#encode`.

The inconsistency presents both correctness and performance issues. The former is because end user code may use a method like `String#valid_encoding?` for branching decisions. The latter because the runtime generally takes a slower path for operations on `CR_BROKEN` strings.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [ruby-core:91710] [Ruby trunk Bug#15635] Inconsistent handling of dummy encodings and code range
       [not found] <redmine.issue-15635.20190304162633@ruby-lang.org>
                   ` (2 preceding siblings ...)
  2019-03-06  8:53 ` [ruby-core:91697] " naruse
@ 2019-03-07 15:31 ` nagachika00
  3 siblings, 0 replies; 4+ messages in thread
From: nagachika00 @ 2019-03-07 15:31 UTC (permalink / raw)
  To: ruby-core

Issue #15635 has been updated by nagachika (Tomoyuki Chikanaga).

Backport changed from 2.4: REQUIRED, 2.5: REQUIRED, 2.6: DONE to 2.4: REQUIRED, 2.5: DONE, 2.6: DONE

ruby_2_5 r67192 merged revision(s) 67167.

----------------------------------------
Bug #15635: Inconsistent handling of dummy encodings and code range
https://bugs.ruby-lang.org/issues/15635#change-76980

* Author: nirvdrum (Kevin Menard)
* Status: Closed
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: ruby 2.6.1p33 (2019-01-30 revision 66950) [x86_64-linux]
* Backport: 2.4: REQUIRED, 2.5: DONE, 2.6: DONE
----------------------------------------
It's hard to write code that works properly with dummy encodings, so they should really be avoided altogether. However, I've come across a code path that I think yields inconsistent results when it comes to dummy encodings with a minimum character length > 1 (i.e., "UTF-16" and "UTF-32"). To illustrate the issue, run the following program:

``` ruby
s = "abc"
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts

s.encode!("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts

s.force_encoding("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
```

The output on Ruby 2.6.1p33 for me is:

```
UTF-8
Dummy: false
Valid: true
[97, 98, 99]

UTF-32
Dummy: true
Valid: true
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]

UTF-32
Dummy: true
Valid: false
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]
```

Basically, we start with a UTF-8 string and convert it to UTF-32. Without an explicit indication of endianness, the encoding is considered dummy, but internally big endian is used (i.e., UTF-32BE). The new byte pattern for the successfully encoded string is shown. After calling `force_encoding` on the string with the same encoding, suddenly the string is no longer considered valid. I think many people would expect `force_encoding` using the string's current encoding to be a no-op. Even setting that aside, we can see the byte sequence and the encoding for the string didn't change, but its validity has. I believe this is wrong.

The problematic lines are https://github.com/ruby/ruby/blob/e6d1c72bec5c6544e9ae82501a6cdd7460222096/string.c#L660-L662 from `rb_enc_str_coderange`:

``` c
if (rb_enc_mbminlen(enc) > 1 && rb_enc_dummy_p(enc)) {
	    cr = ENC_CODERANGE_BROKEN;
}
```

This unconditionally sets the code range to `CR_BROKEN`, but only for dummy encodings with a minimum length > 1. I think this is designed to specifically target UTF-16 and UTF-32, while leaving UTF-7 alone.

It may be the case that the correct behavior here is to always mark the string as invalid. After all, the dummy encodings could change endianness based on platform (although, I don't think they ever do). If that's the case, then the issue is with the code path from `String#encode`.

The inconsistency presents both correctness and performance issues. The former is because end user code may use a method like `String#valid_encoding?` for branching decisions. The latter because the runtime generally takes a slower path for operations on `CR_BROKEN` strings.

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-03-07 15:31 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <redmine.issue-15635.20190304162633@ruby-lang.org>
2019-03-04 16:26 ` [ruby-core:91662] [Ruby trunk Bug#15635] Inconsistent handling of dummy encodings and code range ruby
2019-03-04 16:28 ` [ruby-core:91663] " ruby
2019-03-06  8:53 ` [ruby-core:91697] " naruse
2019-03-07 15:31 ` [ruby-core:91710] " nagachika00

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).