[ruby-core:109440] [Ruby master Feature#18949] Deprecate and remove replicate and dummy encodings

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

From: duerst <noreply@ruby-lang.org>
To: ruby-core@ruby-lang.org
Subject: [ruby-core:109440] [Ruby master Feature#18949] Deprecate and remove replicate and dummy encodings
Date: Sun, 07 Aug 2022 03:13:16 +0000 (UTC)	[thread overview]
Message-ID: <redmine.journal-98593.20220807031315.772@ruby-lang.org> (raw)
In-Reply-To: redmine.issue-18949.20220729174026.772@ruby-lang.org

Issue #18949 has been updated by duerst (Martin Dürst).

Eregon (Benoit Daloze) wrote in #note-14:
> znz (Kazuhiro NISHIYAMA) wrote in #note-10:
> > I think ISO-2022-JP is used for mail archives, and IRCnet even now.
> 
> AFAIK Encoding::ISO_2022_JP (in Ruby) is only useful to say "this String is encoded in ISO-2022-JP", but Ruby can't do anything with that except String#encode.
> Is the use case then to String#encode to another encoding before any String operation?

Yes, exactly. I already said so in https://bugs.ruby-lang.org/issues/18949#note-8.

> Could it be replaced by the BINARY encoding in programs using it?

No, that would lose encoding information.

> Or just an instance variable on the String to remember its original encoding?

Theoretically yes, but that would make dealing with encodings more complex.

> Or eagerly converting to a non-dummy encoding as soon as receiving ISO-2022-JP bytes from an external source?

Theoretically yes, but that would again complicate code somewhere. We don't just want to move around complexity, because it stays roughly the same, and moving it around is work and may introduce additional errors.

Eregon (Benoit Daloze) wrote in #note-13:
> Usages of Encoding::UTF_32:
(details omitted)
> Usages of Encoding::UTF_16:
(details omitted)
> More usages, but all these usages seem incorrect, and should use either UTF_16LE or UTF_16BE.
> 
> What we probably need here is a custom deprecation messages for these two constants, to tell to use the LE/BE variants instead.

You are looking for Encoding::UTF_32 in code, but what's more relevant is UTF-32 and UTF-16 in data. These are officially registered 'charset' labels, see https://www.iana.org/assignments/charset-reg/charset-reg.xhtml.

Eregon (Benoit Daloze) wrote in #note-12:
> This is a disappointing answer, it sounds like there was not enough time at the dev meeting to consider this issue properly maybe? (that's fine, but then don't reject please)

I wasn't at the meeting, so I can't say.

> I reopen because this issue is still vastly unanswered.
> Also it feels not so respectful to me to answer with a single short unclear sentence to a carefully-crafted issue.

Naruse-san usually writes very short answeres, but that doesn't mean he doesn't understand the matter at hand.

> I estimate I have a lot of experience in this area since I reimplemented dummy/replicate encodings in TruffleRuby recently.

(Naruse-san and others have a lot of experience because they implemented these encodings in CRuby.)

> The dev meeting log has no notes about a discussion: https://github.com/ruby/dev-meeting-log/blob/master/DevMeeting-2022-07-21.md#feature-18949-deprecate-and-remove-replicate-and-dummy-encodings-eregon
> 
> naruse (Yui NARUSE) wrote in #note-11:
> > String is a container and an encoding is a label of it. While data whose encoding is an encoding categorized in dummy encodings in Ruby, we cannot avoid such encodings.
> 
> So, there are 3 things we can decide, which I pointed out in https://bugs.ruby-lang.org/issues/18949#note-6
> It seems you are against "3. Deprecate and then remove other dummy encodings".
> What about the other points, 1. and 2.?

For point 2, please see above. For 1, these functions are internal, and I guess the same thing could be done with some different functions, but as far as I understand, the functionality is needed for Ruby itself. And I don't see the point of a fixed number of encodings. That it is possible to support new encodings may no longer be as important as it was 10 or 20 years ago, but may come in handy anyway.

> Could you explain your point of view better?
> Who uses dummy encodings and what for? Since no string operations really works on them, they are indeed nothing more than a label + the transcoding table.

Yes. That's exactly what they are. And I have difficulties understanding why there would be a problem with this. Supporting a real encoding is real work. But supporting dummy encodings can't be that much work, because there isn't much functionality.

> I agree 3. is more difficult, and is less important than 1./2..
> 1. seems straightforward, there seems to be extremely few usages of it, and they can be replaced by Encoding::BINARY easily.
>    In fact `Encoding#replicate` is used in zero gems, so I think it's fair enough to deprecate and then remove it without further discussion.

`enc_replicate` is used internally as far as I understand. Maybe we can deprecate the Ruby method, but not the C function.

> 2. I think we need to evaluate usages of the dummy UTF-16 and UTF-32 encodings, e.g., with `gem-codesearch`.

See above.

> For better compatibility between Rubies we need to do 2. as well, as mentioned above the overhead and complexity for dummy UTF-16/UTF-32 is too much, and JRuby & TruffleRuby already do not deal with it for many String operations.
> It is also clearly a significant performance cost for CRuby.

----------------------------------------
Feature #18949: Deprecate and remove replicate and dummy encodings
https://bugs.ruby-lang.org/issues/18949#change-98593

* Author: Eregon (Benoit Daloze)
* Status: Open
* Priority: Normal
----------------------------------------
Ruby has a lot of accidental complexity.
Sometimes it becomes clear some features bring a lot of complexity and yet provide little value or are used very rarely.
Also most Ruby users do not even know about these features.
Replicate and dummy encodings seem to clearly fall into this category, almost nobody uses them but they add a significant complexity and also add a significant performance overhead.
Notably, the existence of those means the number of encodings in a Ruby runtime is actually variable and not fixed.
That means extra synchronization, hashtable lookups, indirections, function calls, etc.

## Replicate Encodings

Replicate encodings are created using `Encoding#replicate(name)`.
It almost sounds like an alias but in fact it is more than that and creates a new Encoding object, which can be used by a String:
```ruby
e = Encoding::US_ASCII.replicate('MY-US-ASCII')
s = "abc".force_encoding(e)
p s.encoding # => e
p s.encoding.name # => 'MY-US-ASCII'
```

This seems completely useless.
There is an obvious first step here which is to change `Encoding#replicate` to return the receiver, and just install an alias for it.
That avoids creating more encoding instances needlessly.

I think we should also deprecate and remove this method though, it is never a good idea to have a global mutable map like this.
If someone want extra aliases for encodings, they can easily to do so by having their own Hash: `{ alias => encoding }.fetch(name) { Encoding.find(name) }`.

## Dummy Encodings

Dummy encodings are not real encodings. They are artificial encodings designed to look like encodings, but don't function as encodings in Ruby.
From the docs:
```
enc.dummy? -> true or false
------------------------------------------------------------------------
Returns true for dummy encodings. A dummy encoding is an encoding for
which character handling is not properly implemented. It is used for
stateful encodings.
```

I wonder why we have those half-implemented encodings in core, it sounds to me like unfinished work which should not have been merged.

The "codepoints" of dummy encodings are just "bytes" and so they behave the same as `Encoding::BINARY`, with the exception of the UTF-16 and UTF-32 dummy encodings.

### UTF-16 and UTF-32 dummy encodings

These two are special dummy encodings.
What they do is they scan the first 2 or 4 bytes of the String, and if those bytes are a byte-order mark (BOM),
the true "actual" encoding is resolved to UTF-16BE/UTF-16LE or UTF-32BE/UTF-32LE.
Otherwise, `Encoding::BINARY` is returned.
This logic is done by `get_actual_encoding()`.

What is weird is this check is not done on String creation, no, it is done *every time* the encoding of that String is accessed (and the result is not stored on the String).
That is a needless overhead and really unreliable semantics.
Do we really want a String which automagically changes between UTF-16LE and UTF-16BE based on mutating its bytes? I think nobody wants that:
```ruby
s = "\xFF\xFEa\x00b\x00c\x00d\x00".force_encoding("UTF-16")
p s # => "\uFEFFabcd"
s.setbyte 0, 254
s.setbyte 1, 255
p s # => "\uFEFF\u6100\u6200\u6300\u6400"
```

I think the path is clear, we should deprecate and then remove Encoding::UTF_16 and Encoding::UTF_32 (dummy encodings).
And then we no longer need `get_actual_encoding()` and the overhead it adds to every String method.

We could also keep those constants and make them refer the native-endian UTF-16/32.
But that could cause confusing errors as we would change the meaning of them.
We could add `Encoding::UTF_16NE` / `Encoding::UTF_16_NATIVE_ENDIAN` if that is useful.

Another possibility would be to resolve these encodings on String creation, like:
```
"\xFF\xFE".force_encoding("UTF-16").encoding # => UTF-16LE
String.new("\xFF\xFE", encoding: Encoding::UTF_16).encoding # => UTF-16LE
"ab".force_encoding("UTF-16").encoding # exception, not a BOM
String.new("ab", encoding: Encoding::UTF_16).encoding # exception, not a BOM
```
I think it is unnecessary to keep such complexity though.
A class method on String or Encoding like e.g. `Encoding.find_from_bom(string)` is so much clearer and efficient (no need to special case those encodings in String.new, #force_encoding, etc).

FWIW JRuby seems to use `getActualEncoding()` only in 2 places (scanForCodeRange, inspect), which is an indication those dummy UTF encodings are barely used if ever. Similarly, TruffleRuby only has 4 usages of `GetActualEncodingNode`.

### Existing dummy encodings

```
> Encoding.list.select(&:dummy?) 
[#<Encoding:UTF-16 (dummy)>,  #<Encoding:UTF-32 (dummy)>,
 #<Encoding:IBM037 (dummy)>, #<Encoding:UTF-7 (dummy)>,
 #<Encoding:ISO-2022-JP (dummy)>, #<Encoding:ISO-2022-JP-2 (dummy)>, #<Encoding:ISO-2022-JP-KDDI (dummy)>,
 #<Encoding:CP50220 (dummy)>, #<Encoding:CP50221 (dummy)>]
```

So besides UTF-16/UTF-32 dummy, it's only 7 encodings.
Does anyone use one of these 7 dummy encodings?

What is interesting to note, is that these encodings are exactly the ones that are also not ASCII-compatible, with the exception of UTF-16BE/UTF-16LE/UTF-32BE/UTF-32LE (non-dummy).
As a note, UTF-{16,32}{BE,LE} are ASCII-compatible in codepoints but not in bytes, and Ruby uses the bytes definition of ASCII-compatible.
There is potential to simplify encoding compatibility rules and encoding compatibility checks based on that.
So what this means is if we removed dummy encodings, all encodings except UTF-{16,32}{BE,LE} would be ASCII-compatible, which would lead to significant simplifications for many string operations which currently need to handle dummy encodings specially.
Unicode encodings like UTF-{16,32}{BE,LE} already have special behavior for some Ruby methods, so those are already handled specially in some places (they are the only encodings with minLength > 1).

```
> Encoding.list.reject(&:ascii_compatible?)
[#<Encoding:UTF-16BE>, #<Encoding:UTF-16LE>,
 #<Encoding:UTF-32BE>, #<Encoding:UTF-32LE>,
 #<Encoding:UTF-16 (dummy)>, #<Encoding:UTF-32 (dummy)>,
 #<Encoding:IBM037 (dummy)>, #<Encoding:UTF-7 (dummy)>,
 #<Encoding:ISO-2022-JP (dummy)>, #<Encoding:ISO-2022-JP-2 (dummy)>, #<Encoding:ISO-2022-JP-KDDI (dummy)>,
 #<Encoding:CP50220 (dummy)>, #<Encoding:CP50221 (dummy)>]
```

What can we do with such a dummy non-ASCII-compatible encoding?
Almost nothing useful:
```ruby
s = "abc".encode("IBM037")
=> "\x81\x82\x83"
> s.bytes
=> [129, 130, 131]
> s.codepoints
=> [129, 130, 131]
> s == "abc"
=> false
> "été".encode("IBM037")
=> "\x51\xA3\x51"
```

So about the only thing that works with them is `String#encode`.

I think we could preserve that functionality, if actually used (does anyone use one of these 7 dummy encodings?), through:
```ruby
> "été".encode("IBM037")
=> "\x51\xA3\x51" (.encoding == BINARY)
> "\x51\xA3\x51".encode("UTF-8", "IBM037") # encode from IBM037 to UTF-8
=> "été" (.encoding == UTF-8)
```

That way there is no need for those to be Encoding instances, we would only need the conversion tables.

It is even better if we can remove them, so the notion of "dummy encodings" can disappear completely and nobody needs to understand or implement them.

### rb_define_dummy_encoding(name)

The C-API has `rb_define_dummy_encoding(const char *name)`.
This creates a new Encoding instance with `dummy?=true`, and it is also non-ASCII-compatible.
There seems to be no purpose to this besides storing the metadata of an encoding which does not exist in Ruby.
This seems a really expensive/complex way to handle that from the VM point of view (because it dynamically creates an Encoding and add it to lists/maps/etc).
A simple replacement would be to mark the String as BINARY and save the encoding name as an instance variable of that String.
Since anyway Ruby can't understand anything about that String, it's just raw bytes to Ruby's eyes.

## Summary

I suggest we deprecate replicate and dummy encodings in Ruby 3.2.
And then we remove them in the next version.

This will significantly simplify string-related methods, and the behavior exposed to Ruby users.

It will also significantly speedup encoding lookup in CRuby (and other Ruby implementations).
With a fixed number of encodings we can ensure all encoding indices fit in 7 bits, and `ENCODING_GET` can be simply `RB_ENCODING_GET_INLINED`.
`get_actual_encoding()` will be gone and its overhead as well.
`rb_enc_from_index()` would be just `return global_enc_table->list[index].enc;`, instead of the expensive behavior currently with `GLOBAL_ENC_TABLE_EVAL` which takes a lock and more when there are multiple Ractors.
Many checks in these methods would be removed as well.
Yet another improvement would be to load all encodings eagerly, that is small and fast in my experience, what is slow and big is the conversion tables, that'd simplify `must_encindex()` further.
These changes would affect most String methods, which use
```
STR_ENC_GET->get_encoding which does:
  get_actual_encoding->rb_enc_from_index and possibly ->enc_from_index
  ENCODING_GET->RB_ENCODING_GET_INLINED and possibly ->rb_enc_get_index->enc_get_index_str->rb_attr_get
```
Some of these details are mentioned in https://github.com/ruby/ruby/pull/6095#discussion_r915149708.
The overhead is so large that it is worth handling some hardcoded encoding indices directly in String methods.
This feels wrong, getting the encoding from a String should be simple, straightforward and fast.

Further optimizations will be unlocked as the encoding list becomes fixed and immutable.
For example, the name-to-Encoding map is then immutable and could use perfect hashing.
Inline caching those lookups also becomes easier as the the map cannot change.
Also that map would no longer need synchronization, etc.

## To Decide

Each item is independent. I think 1 & 2 are very important, 3 less but would be nice.

1. Deprecate and then remove `Encoding#replicate` and `rb_define_dummy_encoding()`. With that there is a fixed number of encodings, a lot of simplifications and many optimizations become available. They are used respectively in only 1 gem and 5 gems, see https://bugs.ruby-lang.org/issues/18949#note-4
2. Deprecate and then remove the dummy UTF-16 and UTF-32 encodings. This removes the need for `get_actual_encoding()` which is expensive. This functionality seems rarely used in practice, and it only works when such strings have a BOM, which is very rare.
3. Deprecate and then remove other dummy encodings, so there are no more dummy "half-implemented" encodings and all encodings are ASCII-compatible in terms of codepoints.

-- 
https://bugs.ruby-lang.org/

next prev parent reply	other threads:[~2022-08-07  3:13 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-29 17:40 [ruby-core:109371] [Ruby master Feature#18949] Deprecate and remove replicate and dummy encodings Eregon (Benoit Daloze)
2022-07-29 17:48 ` [ruby-core:109372] " byroot (Jean Boussier)
2022-07-29 20:17 ` [ruby-core:109377] " noahgibbs (Noah Gibbs)
2022-07-29 23:53 ` [ruby-core:109378] " Eregon (Benoit Daloze)
2022-07-30 12:57 ` [ruby-core:109384] " Eregon (Benoit Daloze)
2022-07-31 13:57 ` [ruby-core:109393] " Eregon (Benoit Daloze)
2022-08-01  7:01 ` [ruby-core:109400] " duerst
2022-08-01 17:33 ` [ruby-core:109402] " nirvdrum (Kevin Menard)
2022-08-02  1:04 ` [ruby-core:109405] " znz (Kazuhiro NISHIYAMA)
2022-08-02  1:48 ` [ruby-core:109406] " naruse (Yui NARUSE)
2022-08-06 12:28 ` [ruby-core:109433] " Eregon (Benoit Daloze)
2022-08-06 12:42 ` [ruby-core:109434] " Eregon (Benoit Daloze)
2022-08-06 12:56 ` [ruby-core:109435] " Eregon (Benoit Daloze)
2022-08-07  3:13 ` duerst [this message]
2022-08-08  2:28 ` [ruby-core:109441] " nobu (Nobuyoshi Nakada)
2022-08-08  2:35 ` [ruby-core:109442] " nobu (Nobuyoshi Nakada)
2022-08-08 10:09 ` [ruby-core:109447] " Eregon (Benoit Daloze)
2022-08-08 10:38 ` [ruby-core:109448] " Eregon (Benoit Daloze)
2022-08-09  6:29 ` [ruby-core:109452] " duerst
2022-08-09  6:49 ` [ruby-core:109453] " duerst
2022-08-09 11:52 ` [ruby-core:109454] " Eregon (Benoit Daloze)
2022-08-09 12:08 ` [ruby-core:109455] " Eregon (Benoit Daloze)
2022-08-10  4:23 ` [ruby-core:109459] " duerst
2022-08-10 10:20 ` [ruby-core:109461] " Eregon (Benoit Daloze)
2022-08-10 16:12 ` [ruby-core:109463] " Dan0042 (Daniel DeLorme)
2022-08-10 16:40 ` [ruby-core:109464] " Eregon (Benoit Daloze)
2022-08-10 21:37 ` [ruby-core:109472] " Dan0042 (Daniel DeLorme)
2022-08-18  9:38 ` [ruby-core:109543] " Eregon (Benoit Daloze)
2022-08-19  5:38 ` [ruby-core:109566] " matz (Yukihiro Matsumoto)
2022-09-03 10:41 ` [ruby-core:109831] " Eregon (Benoit Daloze)
2022-09-03 14:58 ` [ruby-core:109833] " Eregon (Benoit Daloze)
2022-09-12 12:05 ` [ruby-core:109889] " Eregon (Benoit Daloze)
2022-09-12 13:23 ` [ruby-core:109890] " Eregon (Benoit Daloze)
2023-01-06 14:19 ` [ruby-core:111690] " Eregon (Benoit Daloze) via ruby-core
2023-01-06 15:19 ` [ruby-core:111697] " Eregon (Benoit Daloze) via ruby-core

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.ruby-lang.org/en/community/mailing-lists/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=redmine.journal-98593.20220807031315.772@ruby-lang.org \
    --to=ruby-core@ruby-lang.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).