From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ruby-core-bounces@ruby-lang.org>
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net
X-Spam-Level: 
X-Spam-ASN: AS4713 221.184.0.0/13
X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED,
	SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2
Received: from neon.ruby-lang.org (neon.ruby-lang.org [221.186.184.75])
	by dcvr.yhbt.net (Postfix) with ESMTP id 34A1D20248
	for <normalperson@yhbt.net>; Wed,  6 Mar 2019 08:54:01 +0000 (UTC)
Received: from neon.ruby-lang.org (localhost [IPv6:::1])
	by neon.ruby-lang.org (Postfix) with ESMTP id 183F1120BD3;
	Wed,  6 Mar 2019 17:53:58 +0900 (JST)
Received: from o1678948x4.outbound-mail.sendgrid.net
 (o1678948x4.outbound-mail.sendgrid.net [167.89.48.4])
 by neon.ruby-lang.org (Postfix) with ESMTPS id EF68F120937
 for <ruby-core@ruby-lang.org>; Wed,  6 Mar 2019 17:53:55 +0900 (JST)
Received: by filter0073p3iad2.sendgrid.net with SMTP id
 filter0073p3iad2-11788-5C7F8AA3-9
 2019-03-06 08:53:55.142009949 +0000 UTC m=+50383.624145857
Received: from herokuapp.com (unknown [34.204.10.139])
 by ismtpd0038p1mdw1.sendgrid.net (SG) with ESMTP id 2HW3ZfmnSq-_UpS0nuIcBQ
 for <ruby-core@ruby-lang.org>; Wed, 06 Mar 2019 08:53:55.010 +0000 (UTC)
Date: Wed, 06 Mar 2019 08:53:55 +0000 (UTC)
From: naruse@airemix.jp
Message-ID: <redmine.journal-76962.20190306085354.9a83e0b08e6805f0@ruby-lang.org>
References: <redmine.issue-15635.20190304162633@ruby-lang.org>
Mime-Version: 1.0
X-Redmine-MailingListIntegration-Message-Ids: 67147
X-Redmine-Project: ruby-trunk
X-Redmine-Issue-Id: 15635
X-Redmine-Issue-Author: nirvdrum
X-Redmine-Sender: naruse
X-Mailer: Redmine
X-Redmine-Host: bugs.ruby-lang.org
X-Redmine-Site: Ruby Issue Tracking System
X-Auto-Response-Suppress: All
Auto-Submitted: auto-generated
X-SG-EID: =?us-ascii?Q?UqoG4vcRhHM9V1I4f4J7DhjzUfTg+8muXbMD6UD+LVS2Ip809drqxcy=2FVmAEuy?=
 =?us-ascii?Q?IGgQDozpEu63AmPj3NJWHLRvR5CYVm6yIbntm8N?=
 =?us-ascii?Q?s8Y+79TuAvVrT28S1u8rC+KMF7m9of9T7dGPeTp?=
 =?us-ascii?Q?tARGTE4YOiH9G9kUVq34ntQcBvYr3SyYNjrS9qS?=
 =?us-ascii?Q?AzXkdtBwLq4=2FkxsgAcmlcV+Ehkd6X9ACYkw=3D=3D?=
To: ruby-core@ruby-lang.org
X-ML-Name: ruby-core
X-Mail-Count: 91697
Subject: [ruby-core:91697] [Ruby trunk Bug#15635] Inconsistent handling of
 dummy encodings and code range
X-BeenThere: ruby-core@ruby-lang.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: Ruby developers <ruby-core@ruby-lang.org>
List-Id: Ruby developers <ruby-core.ruby-lang.org>
List-Unsubscribe: <https://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>, 
 <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
List-Post: <mailto:ruby-core@ruby-lang.org>
List-Help: <mailto:ruby-core-request@ruby-lang.org?subject=help>
List-Subscribe: <https://lists.ruby-lang.org/cgi-bin/mailman/listinfo/ruby-core>, 
 <mailto:ruby-core-request@ruby-lang.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: ruby-core-bounces@ruby-lang.org
Sender: "ruby-core" <ruby-core-bounces@ruby-lang.org>

Issue #15635 has been updated by naruse (Yui NARUSE).

Backport changed from 2.4: REQUIRED, 2.5: REQUIRED, 2.6: REQUIRED to 2.4: REQUIRED, 2.5: REQUIRED, 2.6: DONE

ruby_2_6 r67181 merged revision(s) 67167.

----------------------------------------
Bug #15635: Inconsistent handling of dummy encodings and code range
https://bugs.ruby-lang.org/issues/15635#change-76962

* Author: nirvdrum (Kevin Menard)
* Status: Closed
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: ruby 2.6.1p33 (2019-01-30 revision 66950) [x86_64-linux]
* Backport: 2.4: REQUIRED, 2.5: REQUIRED, 2.6: DONE
----------------------------------------
It's hard to write code that works properly with dummy encodings, so they should really be avoided altogether. However, I've come across a code path that I think yields inconsistent results when it comes to dummy encodings with a minimum character length > 1 (i.e., "UTF-16" and "UTF-32"). To illustrate the issue, run the following program:


``` ruby
s = "abc"
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts

s.encode!("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
puts

s.force_encoding("UTF-32")
puts s.encoding
puts "Dummy: #{s.encoding.dummy?}"
puts "Valid: #{s.valid_encoding?}"
puts s.bytes.inspect
```

The output on Ruby 2.6.1p33 for me is:

```
UTF-8
Dummy: false
Valid: true
[97, 98, 99]

UTF-32
Dummy: true
Valid: true
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]

UTF-32
Dummy: true
Valid: false
[0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99]
```

Basically, we start with a UTF-8 string and convert it to UTF-32. Without an explicit indication of endianness, the encoding is considered dummy, but internally big endian is used (i.e., UTF-32BE). The new byte pattern for the successfully encoded string is shown. After calling `force_encoding` on the string with the same encoding, suddenly the string is no longer considered valid. I think many people would expect `force_encoding` using the string's current encoding to be a no-op. Even setting that aside, we can see the byte sequence and the encoding for the string didn't change, but its validity has. I believe this is wrong.

The problematic lines are https://github.com/ruby/ruby/blob/e6d1c72bec5c6544e9ae82501a6cdd7460222096/string.c#L660-L662 from `rb_enc_str_coderange`:

``` c
if (rb_enc_mbminlen(enc) > 1 && rb_enc_dummy_p(enc)) {
	    cr = ENC_CODERANGE_BROKEN;
}
```

This unconditionally sets the code range to `CR_BROKEN`, but only for dummy encodings with a minimum length > 1. I think this is designed to specifically target UTF-16 and UTF-32, while leaving UTF-7 alone.

It may be the case that the correct behavior here is to always mark the string as invalid. After all, the dummy encodings could change endianness based on platform (although, I don't think they ever do). If that's the case, then the issue is with the code path from `String#encode`.

The inconsistency presents both correctness and performance issues. The former is because end user code may use a method like `String#valid_encoding?` for branching decisions. The latter because the runtime generally takes a slower path for operations on `CR_BROKEN` strings.


-- 
https://bugs.ruby-lang.org/