From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on dcvr.yhbt.net X-Spam-Level: X-Spam-ASN: AS4713 221.184.0.0/13 X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RCVD_IN_DNSWL_MED, SPF_PASS shortcircuit=no autolearn=ham autolearn_force=no version=3.4.2 Received: from neon.ruby-lang.org (neon.ruby-lang.org [221.186.184.75]) by dcvr.yhbt.net (Postfix) with ESMTP id F3C5B20248 for ; Mon, 4 Mar 2019 16:28:43 +0000 (UTC) Received: from neon.ruby-lang.org (localhost [IPv6:::1]) by neon.ruby-lang.org (Postfix) with ESMTP id CF198121D13; Tue, 5 Mar 2019 01:28:40 +0900 (JST) Received: from o1678916x28.outbound-mail.sendgrid.net (o1678916x28.outbound-mail.sendgrid.net [167.89.16.28]) by neon.ruby-lang.org (Postfix) with ESMTPS id 90D8A121D0C for ; Tue, 5 Mar 2019 01:28:38 +0900 (JST) Received: by filter0177p3mdw1.sendgrid.net with SMTP id filter0177p3mdw1-20559-5C7D5233-3F 2019-03-04 16:28:36.005391454 +0000 UTC m=+320506.336482359 Received: from herokuapp.com (unknown [18.208.174.211]) by ismtpd0005p1iad1.sendgrid.net (SG) with ESMTP id FgUxIBbYTqC8Te-VmK5SGw for ; Mon, 04 Mar 2019 16:28:35.953 +0000 (UTC) Date: Mon, 04 Mar 2019 16:28:36 +0000 (UTC) From: ruby@kevin.nirvdrum.com Message-ID: References: Mime-Version: 1.0 X-Redmine-MailingListIntegration-Message-Ids: 67113 X-Redmine-Project: ruby-trunk X-Redmine-Issue-Id: 15635 X-Redmine-Issue-Author: nirvdrum X-Redmine-Sender: nirvdrum X-Mailer: Redmine X-Redmine-Host: bugs.ruby-lang.org X-Redmine-Site: Ruby Issue Tracking System X-Auto-Response-Suppress: All Auto-Submitted: auto-generated X-SG-EID: =?us-ascii?Q?OskfDdz18tTTh0mcMy3Eqre+W9PFfXkvmYznFsF4P+dseypbJWApIxbUTBTMag?= =?us-ascii?Q?DV+QyZE+k1vSGR4mHOuH7wZr82P16mxltHAF7J=2F?= =?us-ascii?Q?TFpmc7dot88=2FfPvldzrnSB6xqqvu+RSEFvyqrmg?= =?us-ascii?Q?1qGgsaFnzr8=2F5Bz8wRi2ZkrN0wg09Z00yNt7Q3l?= =?us-ascii?Q?sdZFNZR6esjmDh0p3zWxHRANZ23UV2b7g4g=3D=3D?= To: ruby-core@ruby-lang.org X-ML-Name: ruby-core X-Mail-Count: 91663 Subject: [ruby-core:91663] [Ruby trunk Bug#15635] Inconsistent handling of dummy encodings and code range X-BeenThere: ruby-core@ruby-lang.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: Ruby developers List-Id: Ruby developers List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: ruby-core-bounces@ruby-lang.org Sender: "ruby-core" Issue #15635 has been updated by nirvdrum (Kevin Menard). I also tested some older Ruby releases. The issue is also present in `ruby 2.4.4p296 (2018-03-28 revision 63013) [x86_64-linux]` and `ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-linux]`. ---------------------------------------- Bug #15635: Inconsistent handling of dummy encodings and code range https://bugs.ruby-lang.org/issues/15635#change-76924 * Author: nirvdrum (Kevin Menard) * Status: Open * Priority: Normal * Assignee: * Target version: * ruby -v: ruby 2.6.1p33 (2019-01-30 revision 66950) [x86_64-linux] * Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN ---------------------------------------- It's hard to write code that works properly with dummy encodings, so they should really be avoided altogether. However, I've come across a code path that I think yields inconsistent results when it comes to dummy encodings with a minimum character length > 1 (i.e., "UTF-16" and "UTF-32"). To illustrate the issue, run the following program: ``` ruby s = "abc" puts s.encoding puts "Dummy: #{s.encoding.dummy?}" puts "Valid: #{s.valid_encoding?}" puts s.bytes.inspect puts s.encode!("UTF-32") puts s.encoding puts "Dummy: #{s.encoding.dummy?}" puts "Valid: #{s.valid_encoding?}" puts s.bytes.inspect puts s.force_encoding("UTF-32") puts s.encoding puts "Dummy: #{s.encoding.dummy?}" puts "Valid: #{s.valid_encoding?}" puts s.bytes.inspect ``` The output on Ruby 2.6.1p33 for me is: ``` UTF-8 Dummy: false Valid: true [97, 98, 99] UTF-32 Dummy: true Valid: true [0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99] UTF-32 Dummy: true Valid: false [0, 0, 254, 255, 0, 0, 0, 97, 0, 0, 0, 98, 0, 0, 0, 99] ``` Basically, we start with a UTF-8 string and convert it to UTF-32. Without an explicit indication of endianness, the encoding is considered dummy, but internally big endian is used (i.e., UTF-32BE). The new byte pattern for the successfully encoded string is shown. After calling `force_encoding` on the string with the same encoding, suddenly the string is no longer considered valid. I think many people would expect `force_encoding` using the string's current encoding to be a no-op. Even setting that aside, we can see the byte sequence and the encoding for the string didn't change, but its validity has. I believe this is wrong. The problematic lines are https://github.com/ruby/ruby/blob/e6d1c72bec5c6544e9ae82501a6cdd7460222096/string.c#L660-L662 from `rb_enc_str_coderange`: ``` c if (rb_enc_mbminlen(enc) > 1 && rb_enc_dummy_p(enc)) { cr = ENC_CODERANGE_BROKEN; } ``` This unconditionally sets the code range to `CR_BROKEN`, but only for dummy encodings with a minimum length > 1. I think this is designed to specifically target UTF-16 and UTF-32, while leaving UTF-7 alone. It may be the case that the correct behavior here is to always mark the string as invalid. After all, the dummy encodings could change endianness based on platform (although, I don't think they ever do). If that's the case, then the issue is with the code path from `String#encode`. The inconsistency presents both correctness and performance issues. The former is because end user code may use a method like `String#valid_encoding?` for branching decisions. The latter because the runtime generally takes a slower path for operations on `CR_BROKEN` strings. -- https://bugs.ruby-lang.org/