[ruby-core:93250] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

* [ruby-core:93250] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
@ 2019-06-19 15:00 ` jean.boussier
  2019-06-19 15:55 ` [ruby-core:93251] " jean.boussier
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: jean.boussier @ 2019-06-19 15:00 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been reported by byroot (Jean Boussier).

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93251] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
  2019-06-19 15:00 ` [ruby-core:93250] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals jean.boussier
@ 2019-06-19 15:55 ` jean.boussier
  2019-06-20  4:48 ` [ruby-core:93269] " duerst
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: jean.boussier @ 2019-06-19 15:55 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by byroot (Jean Boussier).

In order to provide some data, I counted the duplicates in a Redmine heap dump (`ObjectSpace.dump_all`):

Here the counting code:

```ruby
#!/usr/bin/env ruby
# frozen_string_literal: true
require 'json'

fstrings = []
STDIN.each do |line|
  object = JSON.parse(line)
  fstrings << object if object['fstring']
end

counts = {}
fstrings.each do |str|
  counts[str['value']] ||= 0
  counts[str['value']] += 1
end
duplicates = counts.select { |k, v| v > 1 }.map(&:first)

puts "total fstrings: #{fstrings.size}"
puts "dups: #{duplicates.size}"
puts "sample:"
puts duplicates.first(20)
```

And the results for Redmine:

```
total fstrings: 84678
dups: 3686
sample:
changes
absent
part
EVENTS
RANGE
OBJECT
Silent
EXCEPTION
Settings
DATE
Index
Graph
COMPLEX
Definition
fcntl
inline
lockfile
update
gemfile
oth
```

That's about 4% of the fstring table being duplicates.

I also ran the script against one much bigger private app, and the duplicate ratio was similar, but the table was an order of magnitude bigger.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78701

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93269] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
  2019-06-19 15:00 ` [ruby-core:93250] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals jean.boussier
  2019-06-19 15:55 ` [ruby-core:93251] " jean.boussier
@ 2019-06-20  4:48 ` duerst
  2019-06-25 13:53 ` [ruby-core:93358] " naruse
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: duerst @ 2019-06-20  4:48 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by duerst (Martin Dürst).

I think this change makes a lot of sense, not only because of the memory savings, but also because of the streamlining for encodings in general.

I agree that the chance of backwards incompatibilities in actual programs is very low. I think this is a good change for Ruby 2.7.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78727

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93358] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (2 preceding siblings ...)
  2019-06-20  4:48 ` [ruby-core:93269] " duerst
@ 2019-06-25 13:53 ` naruse
  2019-06-25 22:06 ` [ruby-core:93365] " jean.boussier
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: naruse @ 2019-06-25 13:53 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by naruse (Yui NARUSE).

Note that an incompatibility which is caused by the change of string encoding is `String#<<(integer)`.

Maybe String#<<(n) should be deprecated if n > 127 and explicitly specify the encoding argument.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78871

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93365] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (3 preceding siblings ...)
  2019-06-25 13:53 ` [ruby-core:93358] " naruse
@ 2019-06-25 22:06 ` jean.boussier
  2019-06-27  9:41 ` [ruby-core:93385] " duerst
                   ` (22 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: jean.boussier @ 2019-06-25 22:06 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by byroot (Jean Boussier).

@naruse Interesting, I actually had no idea you could `String#<<(integer)`.

In my humble opinion, integer shifting on string returned by `Symbol#to_s` is quite specific, and is unlikely to be common in the wild. 

Additionally UTF8 strings  accept everything ASCII string would, so to break existing code it would need to expect `String#<<(127+)` to blow up.

```ruby
>> 'a' << 234324
=> "a\u{39354}"
>> 'a'.force_encoding(Encoding::BINARY) << 234324
RangeError: 234324 out of char range
```

That being said, it's not my role to judge the backward compatibility impact, and I'm likely to overlook parts of it.

More generally, let me know if there's anything I can do to push this forward. I think I'll update the test suite in the pull request to acknowledge the encoding change and get a green test suite.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78880

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93385] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (4 preceding siblings ...)
  2019-06-25 22:06 ` [ruby-core:93365] " jean.boussier
@ 2019-06-27  9:41 ` duerst
  2019-06-28  9:26 ` [ruby-core:93402] " eregontp
                   ` (21 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: duerst @ 2019-06-27  9:41 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by duerst (Martin Dürst).

naruse (Yui NARUSE) wrote:
> Note that an incompatibility which is caused by the change of string encoding is `String#<<(integer)`.
>
> Maybe String#<<(n) should be deprecated if n > 127 and explicitly specify the encoding argument.

If I understand this correctly, the proposal is to change the encoding of Symbols from ASCII to UTF-8. So if such a symbol is converted to a String (which in itself may not be that frequent), and then an Integer is 'shifted' into that String with `<<`, then the only incompatibility that we get is that until now, it was an error to do that with a number > 127.

So the overall consequence is that something that produced an error up to now doesn't produce an error anymore. I guess that's an incompatibility that we should be able to tolerate. It's much more of a problem if something that worked until now stops to work, or if something that worked one way suddenly works another way.

As for explicitly specifying an encoding argument for `String#<<`, I understand that it may be the conceptually correct thing to do (we are using the Integer as a character number, so we better knew what encoding this character number was expressed in). But the encoding is already available from the string, and in most cases will be the source encoding or so anyway, which will be UTF-8 in most cases. Also, because `<<` is a binary operator, it would be difficult to add additional parameters.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78910

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93402] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (5 preceding siblings ...)
  2019-06-27  9:41 ` [ruby-core:93385] " duerst
@ 2019-06-28  9:26 ` eregontp
  2019-06-28  9:29 ` [ruby-core:93403] " eregontp
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: eregontp @ 2019-06-28  9:26 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by Eregon (Benoit Daloze).

duerst (Martin Dürst) wrote:
> If I understand this correctly, the proposal is to change the encoding of Symbols from ASCII to UTF-8. So if such a symbol is converted to a String (which in itself may not be that frequent), and then an Integer is 'shifted' into that String with `<<`, then the only incompatibility that we get is that until now, it was an error to do that with a number > 127.
> So the overall consequence is that something that produced an error up to now doesn't produce an error anymore. I guess that's an incompatibility that we should be able to tolerate. It's much more of a problem if something that worked until now stops to work, or if something that worked one way suddenly works another way.

It's not raising an error:

```
$ ruby -ve 's=:abc.to_s; s<<233; p s; p s.encoding'                     
ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-linux]
"abc\xE9"
#<Encoding:ASCII-8BIT>

$ ruby -ve 's=:abc.to_s.force_encoding("UTF-8"); s<<233; p s; p s.encoding'
ruby 2.6.2p47 (2019-03-13 revision 67232) [x86_64-linux]
"abcé"
#<Encoding:UTF-8>
```

I'm a bit concerned about compatibility, I think we should evaluate with a few gems, and how much of test-all and specs fail with this change.

I agree in general having a consistent encoding for Symbol literals seems simpler for semantics.

TruffleRuby reuses the underlying memory (byte[], aka char*) for interned Strings of different encodings, so only the metadata (encoding, coderange, etc) is duplicated, but not the actual bytes. Probably MRI could do the same, and that would be transparent and not need to change semantics.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78944

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93403] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (6 preceding siblings ...)
  2019-06-28  9:26 ` [ruby-core:93402] " eregontp
@ 2019-06-28  9:29 ` eregontp
  2019-06-28  9:31 ` [ruby-core:93404] " eregontp
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: eregontp @ 2019-06-28  9:29 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by Eregon (Benoit Daloze).

byroot (Jean Boussier) wrote:
> However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not.

Specs can always be changed, along with `ruby_version_is` guards to specify which behavior on which version (https://github.com/ruby/ruby/blob/trunk/spec/README.md).
Needing to change them indicate a potential incompatibility though.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78945

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93404] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (7 preceding siblings ...)
  2019-06-28  9:29 ` [ruby-core:93403] " eregontp
@ 2019-06-28  9:31 ` eregontp
  2019-06-28 16:33 ` [ruby-core:93413] " ruby
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: eregontp @ 2019-06-28  9:31 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by Eregon (Benoit Daloze).

If we change this, the encoding of Symbol literals should be the same as String literals, i.e., use the file's magic encoding comment or UTF-8 if there isn't one.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78946

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93413] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (8 preceding siblings ...)
  2019-06-28  9:31 ` [ruby-core:93404] " eregontp
@ 2019-06-28 16:33 ` ruby
  2019-07-01 11:36 ` [ruby-core:93449] [Ruby master " jean.boussier
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: ruby @ 2019-06-28 16:33 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by nirvdrum (Kevin Menard).

I generally like the idea, but really from a semantics perspective rather than a memory savings one. It's confusing to both implementers and end users alike that Symbols take on a different encoding from Strings if they happen to be ASCII-only. So the other nice benefit of the change is `String#{intern,to_sym}` can be made much cheaper. Having said all of that, I'm sure the current behavior was maintained when non-ASCII-only Symbols were introduced for a reason. I think it'd be good to look back and see what the rationale was.

If the solution then is to convert the String's encoding when calling `Symbol#to_s`, if the Symbol is ASCII-only, then I think you're going to investigate knock-on effects. E.g., `String#force_encoding` currently unconditionally clears the String's code range. That's metadata you really don't want to lose. But, by setting the encoding to ASCII-only, you may be okay most of the time because there are code paths that just check if the encoding uses single byte characters without doing a full code range scan. Likewise, if you do decide to skip the `US-ASCII` conversion, you could have the inverse problem. Now you have a UTF-8 string and if that doesn't have its code range set, you've turned some O(1) operations to O(n). Please note, I haven't really traced all the String and Symbol code. These were potential pitfalls that stood out to me when reviewing the PR and looking briefly at the CRuby source. My general point being that even if things come out correct, you could still alter the e
 xecution profile in such a way as to introduce a performance regression by changing from a fixed-width to a variable-width encoding or by not taking proper care of the code range value. None of that is insurmountable, of course.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78955

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93449] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (9 preceding siblings ...)
  2019-06-28 16:33 ` [ruby-core:93413] " ruby
@ 2019-07-01 11:36 ` jean.boussier
  2019-07-01 14:33 ` [ruby-core:93452] " eregontp
                   ` (16 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: jean.boussier @ 2019-07-01 11:36 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by byroot (Jean Boussier).

Sorry for the late reply, somehow I can't make email notifications work on Redmine...

> Specs can always be changed, along with ruby_version_is guards to specify which behavior on which version

Thanks fro letting me know. I updated the PR, I expect it to pass CI, but will do further updates if it doesn't.

> If we change this, the encoding of Symbol literals should be the same as String literals, i.e., use the file's magic encoding comment or UTF-8 if there isn't one.

Yes and no.

First it's kinda already the case and stays that way. If the symbol name can't be expressed as pure `ASCII`, it will have the string's encoding, hence the file encoding.

However, one of the reason why the encoding is coerced, it's because if you have the following situation:

```ruby
# encoding: iso-8659-1
ISO_SYMBOL = :foo

# encoding: utf-8
UTF_SYMBOL = :foo
```

You do want both constants to reference the same symbol. From what I gathered it was the whole reason behind the ASCII coercion.

> I'm sure the current behavior was maintained when non-ASCII-only Symbols were introduced for a reason.

I believe it's the reason I described above.

> If the solution then is to convert the String's encoding when calling Symbol#to_s

Yeah, that was just a suggestion to retain `to_s` backward compatibility, but I really don't think it's a good idea.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78997

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93452] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (10 preceding siblings ...)
  2019-07-01 11:36 ` [ruby-core:93449] [Ruby master " jean.boussier
@ 2019-07-01 14:33 ` eregontp
  2019-07-01 14:38 ` [ruby-core:93453] " eregontp
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: eregontp @ 2019-07-01 14:33 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by Eregon (Benoit Daloze).

byroot (Jean Boussier) wrote:
> You do want both constants to reference the same symbol. From what I gathered it was the whole reason behind the ASCII coercion.

That makes sense, thanks for the explanation.

US-ASCII is the natural subset for 7-bit characters, so it makes perfect sense to me that it's used for 7-bit symbols.
UTF-8 is not, and is less precise than US-ASCII for that matter.
At least performance-wise it shouldn't matter too much since the coderange will be CR_7BIT.

I'm unsure, it seems a bit arbitrary to give "ascii" symbols a UTF-8 encoding.
And many core methods return US-ASCII Strings and I would say that it is expected when they only return 7-bit characters.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-78999

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93453] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (11 preceding siblings ...)
  2019-07-01 14:33 ` [ruby-core:93452] " eregontp
@ 2019-07-01 14:38 ` eregontp
  2019-07-01 14:48 ` [ruby-core:93454] " jean.boussier
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: eregontp @ 2019-07-01 14:38 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by Eregon (Benoit Daloze).

Sharing char* is a more general optimization, and could apply to more cases (e.g., frozen Strings with identical bytes but different encodings).
So I'm thinking that would be better rather than changing semantics for the (rather obscure to end users) purpose of fitting better with the current fstring representation.

I'd like another reason than the internal optimization which can be done another way if we do this, but it's just my opinion.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79000

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93454] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (12 preceding siblings ...)
  2019-07-01 14:38 ` [ruby-core:93453] " eregontp
@ 2019-07-01 14:48 ` jean.boussier
  2019-07-01 15:37 ` [ruby-core:93456] " hanmac
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: jean.boussier @ 2019-07-01 14:48 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by byroot (Jean Boussier).

> US-ASCII is the natural subset for 7-bit characters, so it makes perfect sense to me that it's used for 7-bit symbols. UTF-8 is not, and is less precise than US-ASCII for that matter.

I don't disagree with this, but my point is that UTF-8 is a superset of US-ASCII, and much more likely to be the encoding of the various frozen string literals.

> At least performance-wise it shouldn't matter too much 

What do you mean by performance ? String comparisons ? If so it doesn't really matter much for symbols AFAIK.

> I'm unsure, it seems a bit arbitrary to give "ascii" symbols a UTF-8 encoding.

IMO there's two arguments here:

  - Consistency / Least surprise: UTF-8 is now the default source file encoding, it would make sense that the symbols created out of these files (not just `Symbol` instances, but module names, method names etc) would be UTF-8 as well.
  - Memory usage: as explained is the original issue description, it save some memory usage.

Honestly, what is surprising to me is this:

```ruby
'foo'.encoding # => UTF-8
:foo.to_s.encoding # => US-ASCII
module Foo; end
Foo.name.encoding # => US-ASCII
Foo.method(:name).name.encoding # => US-ASCII
:"olé".to_s.encoding # => UTF-8
```

> Sharing char* is a more general optimization, and could apply to more cases (e.g., frozen Strings with identical bytes but different encodings).

The problem is that the different encoding have to be kept somewhere. So you end up with the original string plus some form of shared string that point to the original one and hold the different encoding.

So unless that string is too big to be embedded (rarely the case for symbols), you haven't actually saved anything.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79001

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93456] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (13 preceding siblings ...)
  2019-07-01 14:48 ` [ruby-core:93454] " jean.boussier
@ 2019-07-01 15:37 ` hanmac
  2019-07-01 17:25 ` [ruby-core:93458] " eregontp
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: hanmac @ 2019-07-01 15:37 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by Hanmac (Hans Mackowiak).

i didn't checked the deep code, but would it help if Symbol.to_s or Module.name would return a shared string?

in this case only memory is allocated when the returned string itself gets changed?

or why not return a frozen string?

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79003

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93458] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (14 preceding siblings ...)
  2019-07-01 15:37 ` [ruby-core:93456] " hanmac
@ 2019-07-01 17:25 ` eregontp
  2019-07-02  0:49 ` [ruby-core:93464] " jean.boussier
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: eregontp @ 2019-07-01 17:25 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by Eregon (Benoit Daloze).

byroot (Jean Boussier) wrote:
> What do you mean by performance ? String comparisons ? If so it doesn't really matter much for symbols AFAIK.

I mean performance of String operations on a UTF-8 vs a US-ASCII String.
As @nirvdrum said above, some optimizations might only apply to US-ASCII, although in most cases the coderange should make it apply to UTF-8 too.

> IMO there's two arguments here:
>   - Consistency / Least surprise: UTF-8 is now the default source file encoding, it would make sense that the symbols created out of these files (not just `Symbol` instances, but module names, method names etc) would be UTF-8 as well.

Right, that argument makes sense to me.

Does this PR also addresses module and method names?
FWIW, TruffleRuby already uses UTF-8 for module and method names, and it seems not to be a compatibility problem.

It will be a bit weird if there is a magic encoding comment though, as then Symbols, module/method names will be UTF-8 if 7-bit but the specified magic encoding if not 7-bit.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79005

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93464] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (15 preceding siblings ...)
  2019-07-01 17:25 ` [ruby-core:93458] " eregontp
@ 2019-07-02  0:49 ` jean.boussier
  2019-07-02  9:48 ` [ruby-core:93467] " eregontp
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: jean.boussier @ 2019-07-02  0:49 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by byroot (Jean Boussier).

> would it help if Symbol.to_s or Module.name would return a shared string?

It's not really about the returned string, it's about the internal frozen string that is kept in the symbol table. 

> or why not return a frozen string?

I already proposed it, but it was rejected for backward compatibilty concerns: https://bugs.ruby-lang.org/issues/15836

And again, it's kind of another topic entirely.

> I mean performance of String operations on a UTF-8 vs a US-ASCII String.

Right. What I was trying to say is that most of the time you compare the symbols directly, which doesn't involve string comparisons.

However it's true that performance might be impacted for operations done on the strings returned by `Symbol#to_s`.

I wonder wether the coderange could be eagerly set as in this case we do know it's 7-bit. I suppose so, I need to dig into that part of strings.

> Does this PR also addresses module and method names?

Yes, I think it does. 

> it seems not to be a compatibility problem.

That doesn't surprise me one bit. I bet the vast majority of the strings returned by `Symbol#to_s` and `Module#name` end up converted to UTF-8 because they are concatenated with string literals which are UTF-8.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79014

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93467] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (16 preceding siblings ...)
  2019-07-02  0:49 ` [ruby-core:93464] " jean.boussier
@ 2019-07-02  9:48 ` eregontp
  2019-07-02 10:07 ` [ruby-core:93469] " jean.boussier
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: eregontp @ 2019-07-02  9:48 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by Eregon (Benoit Daloze).

I'm fine with this proposal, it would be interesting to hear what others think.
@byroot Could you add this issue to #15930 for discussion?

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79017

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93469] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (17 preceding siblings ...)
  2019-07-02  9:48 ` [ruby-core:93467] " eregontp
@ 2019-07-02 10:07 ` jean.boussier
  2019-07-11  7:18 ` [ruby-core:93670] " matz
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: jean.boussier @ 2019-07-02 10:07 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by byroot (Jean Boussier).

Done.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79019

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93670] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (18 preceding siblings ...)
  2019-07-02 10:07 ` [ruby-core:93469] " jean.boussier
@ 2019-07-11  7:18 ` matz
  2019-07-11 14:48 ` [ruby-core:93687] " eregontp
                   ` (7 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: matz @ 2019-07-11  7:18 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by matz (Yukihiro Matsumoto).

First of all, this pull-request itself breaks non UTF-8 programs. It should be the source encoding instead of direct UTF-8.
Second, 4% of fstring table is only a fraction of total memory consumption. I am not sure how much effective.

If you update the pull-request to use the source encoding, we will merge it for the experiment.

Matz.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79281

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93687] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (19 preceding siblings ...)
  2019-07-11  7:18 ` [ruby-core:93670] " matz
@ 2019-07-11 14:48 ` eregontp
  2019-07-11 15:17 ` [ruby-core:93688] " matz
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: eregontp @ 2019-07-11 14:48 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by Eregon (Benoit Daloze).

matz (Yukihiro Matsumoto) wrote:
> If you update the pull-request to use the source encoding, we will merge it for the experiment.

@matz I thought the same, but that makes `:foo` in files with different source encodings no longer the same object though:
https://bugs.ruby-lang.org/issues/15940#note-10
That sounds like a larger breaking change to me.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79302

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93688] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (20 preceding siblings ...)
  2019-07-11 14:48 ` [ruby-core:93687] " eregontp
@ 2019-07-11 15:17 ` matz
  2019-07-12 13:27 ` [ruby-core:93721] " jean.boussier
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: matz @ 2019-07-11 15:17 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by matz (Yukihiro Matsumoto).

That's why I said **experiment**. We need to measure in the real-world application to ensure the change will reduce memory consumption before the release. If not, we should give up the idea. I worry 4% reduction in fstring table is not big enough, especially considering the mix of source encodings. But we need a number anyway.

Matz.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79303

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93721] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (21 preceding siblings ...)
  2019-07-11 15:17 ` [ruby-core:93688] " matz
@ 2019-07-12 13:27 ` jean.boussier
  2019-07-12 13:46 ` [ruby-core:93723] " matz
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: jean.boussier @ 2019-07-12 13:27 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by byroot (Jean Boussier).

> First of all, this pull-request itself breaks non UTF-8 programs.

Could you elaborate on this? I don't understand what breaks in non UTF-8 programs. I ran some tests with `# encoding: EUC-JP` and can't find anything breaking.

However there was indeed a bug that would break at build time that I just fixed.

> It should be the source encoding instead of direct UTF-8.

Like @eregon I don't understand the rationale. Currently, regardless of the file encoding, ASCII only symbols are coerced to ASCII encoding.

> 4% of fstring table is only a fraction of total memory consumption. I am not sure how much effective.

Yes, I know it's not a big saving. I only submitted it because I couldn't see any real drawback to it, so a small gain for a small effort seemed worth it.

 I can already tell approximately how much it saves based on the Redmine benchmark. 3 686 string instances saved, the vast majority of them being embedded so `40 B`,  it's `147_440 B` (`147kB`).

Compared to `ObjectSpace.memsize_of_all` in the same process giving `48_149_329 B`. So in relative it's a `0.3%` saving overall (or even less because `ObjectSpace.memsize_of_all` isn't perfectly accurate). Which indeed isn't impressive at all. 

That being said, on our internal app that have a ~10x bigger fstring table, the duplication ratio is similar, so the saving would be over 1MB, which while still small relatively speaking, is significant in absolute.

So if you are sure this change would cause issues, then I'd rather close it now because I know the savings it brings won't ever justify it.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79354

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93723] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (22 preceding siblings ...)
  2019-07-12 13:27 ` [ruby-core:93721] " jean.boussier
@ 2019-07-12 13:46 ` matz
  2019-07-12 14:48 ` [ruby-core:93726] " jean.boussier
                   ` (3 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: matz @ 2019-07-12 13:46 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by matz (Yukihiro Matsumoto).

If I understand your patch correctly:

```ruby
# encoding: euc-jp
#old behavior
p "foo".encoding     # => "euc-jp"
p :foo.to_s.encoding # => "us-ascii"

#new behavior
p "foo".encoding     # => "euc-jp"
p :foo.to_s.encoding # => "utf-8"
```

I feel this is an inconsistent and confusing behavior change. Am I wrong?

Besides that, I am not sure if this change worth saving 147KB or even 1.4MB in the apps that might consume a few hundred GB of memory.

Matz.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79356

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93726] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (23 preceding siblings ...)
  2019-07-12 13:46 ` [ruby-core:93723] " matz
@ 2019-07-12 14:48 ` jean.boussier
  2019-07-12 14:50 ` [ruby-core:93727] " jean.boussier
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 28+ messages in thread
From: jean.boussier @ 2019-07-12 14:48 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by byroot (Jean Boussier).

> If I understand your patch correctly

Yes you do.

> I feel this is an inconsistent and confusing behavior change. Am I wrong?

I don't know if you are wrong, but at least we don't agree.

My reasoning is as follow:

  - Simple symbols (read pure ASCII) have to be coerced into a common encoding so that `# encoding: euc-jp :foo == # encoding: iso-8601-1 :foo`
  - UTF-8 is a strict super set of ASCII. Any valid ASCII is valid UTF-8.
  - Simple symbols being UTF-8 encoded isn't any weirder than them being ASCII encoded to me.
  - UTF-8 being the default ruby source encoding, it makes sense for it to be the default internal symbol encoding.
  - If like most Ruby users my source is UTF-8 encoded, then it removes one source of surprise.

> Besides that, I am not sure if this change worth saving 147KB or even 1.4MB in the apps that might consume a few hundred GB of memory.

That is entirely your call. I personally don't see any downside to this change, hence why the minor memory saving is welcome to me, but if you see some downside to it then I agree it's not a big enough saving to justify it.

Also small nitpick,  the 1.4MB saving, it's for an app consuming hundreds of MB not GB.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79358

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93727] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (24 preceding siblings ...)
  2019-07-12 14:48 ` [ruby-core:93726] " jean.boussier
@ 2019-07-12 14:50 ` jean.boussier
  2019-07-12 15:06 ` [ruby-core:93730] " matz
  2019-07-30  4:17 ` [ruby-core:94020] " ko1
  27 siblings, 0 replies; 28+ messages in thread
From: jean.boussier @ 2019-07-12 14:50 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by byroot (Jean Boussier).

Just to extend your example:

```ruby
p "foo".encoding     # => "euc-jp"
p :foo.to_s.encoding # => "us-ascii"
p :"注意書".to_s.encoding # => "euc-jp"

#new behavior
p "foo".encoding     # => "euc-jp"
p :foo.to_s.encoding # => "utf-8"
p :"注意書".to_s.encoding # => "euc-jp"
```

If the symbol is not pure ASCII, the current behavior doesn't change.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79359

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:93730] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (25 preceding siblings ...)
  2019-07-12 14:50 ` [ruby-core:93727] " jean.boussier
@ 2019-07-12 15:06 ` matz
  2019-07-30  4:17 ` [ruby-core:94020] " ko1
  27 siblings, 0 replies; 28+ messages in thread
From: matz @ 2019-07-12 15:06 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by matz (Yukihiro Matsumoto).

> UTF-8 is a strict super set of ASCII. Any valid ASCII is valid UTF-8.

So is EUC-JP (or Shift_JIS or many other encodings). Even though UTF-8 is the current default source encoding of Ruby, ALL ASCII symbols being in UTF-8 encoding make me feel weird.

I understand the need and room to reduce fstring table size, but probably in a different way.

Matz.

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-79362

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [ruby-core:94020] [Ruby master Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
       [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
                   ` (26 preceding siblings ...)
  2019-07-12 15:06 ` [ruby-core:93730] " matz
@ 2019-07-30  4:17 ` ko1
  27 siblings, 0 replies; 28+ messages in thread
From: ko1 @ 2019-07-30  4:17 UTC (permalink / raw
  To: ruby-core

Issue #15940 has been updated by ko1 (Koichi Sasada).

Assignee set to naruse (Yui NARUSE)

----------------------------------------
Feature #15940: Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals
https://bugs.ruby-lang.org/issues/15940#change-80244

* Author: byroot (Jean Boussier)
* Status: Open
* Priority: Normal
* Assignee: naruse (Yui NARUSE)
* Target version: 
----------------------------------------
Patch: https://github.com/ruby/ruby/pull/2242

It's not uncommon for symbols to have literal string counterparts, e.g.

```ruby
class User
  attr_accessor :name

  def as_json
    { 'name' => name }
  end
end
```

Since the default source encoding is UTF-8, and that symbols coerce their internal fstring to ASCII when possible, the above snippet will actually keep two instances of `"name"` in the fstring registry. One in ASCII, the other in UTF-8.

Considering that UTF-8 is a strict superset of ASCII, storing the symbols fstrings as UTF-8 instead makes no significant difference, but allows in most cases to reuse the equivalent string literals.

The only notable behavioral change is `Symbol#to_s`.

Previously `:name.to_s.encoding` would be `#<Encoding:US-ASCII>`.
After this patch it's `#<Encoding:UTF-8>`. I can't foresee any significant compatibility impact of this change on existing code.

However, there are several ruby specs asserting this behavior, but I don't know if they can be changed or not: https://github.com/ruby/spec/commit/a73a1c11f13590dccb975ba4348a04423c009453

If this specification is impossible to change, then we could consider changing the encoding of the String returned by `Symbol#to_s`, e.g in ruby pseudo code:

```ruby
def to_s
  str = fstr.dup
  str.force_encoding(Encoding::ASCII) if str.ascii_only?
  str
end
```

-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-07-30  4:18 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <redmine.issue-15940.20190619150012@ruby-lang.org>
2019-06-19 15:00 ` [ruby-core:93250] [Ruby trunk Feature#15940] Coerce symbols internal fstrings in UTF8 rather than ASCII to better share memory with string literals jean.boussier
2019-06-19 15:55 ` [ruby-core:93251] " jean.boussier
2019-06-20  4:48 ` [ruby-core:93269] " duerst
2019-06-25 13:53 ` [ruby-core:93358] " naruse
2019-06-25 22:06 ` [ruby-core:93365] " jean.boussier
2019-06-27  9:41 ` [ruby-core:93385] " duerst
2019-06-28  9:26 ` [ruby-core:93402] " eregontp
2019-06-28  9:29 ` [ruby-core:93403] " eregontp
2019-06-28  9:31 ` [ruby-core:93404] " eregontp
2019-06-28 16:33 ` [ruby-core:93413] " ruby
2019-07-01 11:36 ` [ruby-core:93449] [Ruby master " jean.boussier
2019-07-01 14:33 ` [ruby-core:93452] " eregontp
2019-07-01 14:38 ` [ruby-core:93453] " eregontp
2019-07-01 14:48 ` [ruby-core:93454] " jean.boussier
2019-07-01 15:37 ` [ruby-core:93456] " hanmac
2019-07-01 17:25 ` [ruby-core:93458] " eregontp
2019-07-02  0:49 ` [ruby-core:93464] " jean.boussier
2019-07-02  9:48 ` [ruby-core:93467] " eregontp
2019-07-02 10:07 ` [ruby-core:93469] " jean.boussier
2019-07-11  7:18 ` [ruby-core:93670] " matz
2019-07-11 14:48 ` [ruby-core:93687] " eregontp
2019-07-11 15:17 ` [ruby-core:93688] " matz
2019-07-12 13:27 ` [ruby-core:93721] " jean.boussier
2019-07-12 13:46 ` [ruby-core:93723] " matz
2019-07-12 14:48 ` [ruby-core:93726] " jean.boussier
2019-07-12 14:50 ` [ruby-core:93727] " jean.boussier
2019-07-12 15:06 ` [ruby-core:93730] " matz
2019-07-30  4:17 ` [ruby-core:94020] " ko1

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).