[ruby-core:117469] [Ruby master Feature#20415] Precompute literal String hash code during compilation

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

From: "byroot (Jean Boussier) via ruby-core" <ruby-core@ml.ruby-lang.org>
To: ruby-core@ml.ruby-lang.org
Cc: "byroot (Jean Boussier)" <noreply@ruby-lang.org>
Subject: [ruby-core:117469] [Ruby master Feature#20415] Precompute literal String hash code during compilation
Date: Tue, 09 Apr 2024 07:43:43 +0000 (UTC)	[thread overview]
Message-ID: <redmine.issue-20415.20240409074343.7941@ruby-lang.org> (raw)
In-Reply-To: redmine.issue-20415.20240409074343.7941@ruby-lang.org

Issue #20415 has been reported by byroot (Jean Boussier).

----------------------------------------
Feature #20415: Precompute literal String hash code during compilation
https://bugs.ruby-lang.org/issues/20415

* Author: byroot (Jean Boussier)
* Status: Open
----------------------------------------
I worked on a proof of concept with @etienne which I think has some potential, but I'm looking for feedback on what would be the best implementation.


The proof of concept is here: https://github.com/Shopify/ruby/pull/553

### Idea

Most string literals are relatively short, hence embedded, and have some wasted bytes at the end of their slot. We could use that wasted space to store the string hash.

The goal being to make **looking up a literal String key in a hash, as fast as a Symbol key**. The goal isn't to memoize the hash code of all strings, but to **only selectively precompute the hash code of literal strings
in the compiler**. The compiler could even selectively do this when we literal string is used to lookup a hash (`opt_aref`).

Here's the benchmark we used:

```ruby
hash = 10.times.to_h do |i|
  [i, i]
end

dyn_sym = "dynamic_symbol".to_sym
hash[:some_symbol] = 1
hash[dyn_sym] = 1
hash["small"] = 2
hash["frozen_string_literal"] = 2

Benchmark.ips do |x|
  x.report("symbol") { hash[:some_symbol] }
  x.report("dyn_symbol") { hash[:some_symbol] }
  x.report("small_lit") { hash["small"] }
  x.report("frozen_lit") { hash["frozen_string_literal"] }
  x.compare!(order: :baseline)
end
```

On Ruby 3.3.0, looking up a String key is a bit slower based on the key size:

```
Calculating -------------------------------------
              symbol     24.175M (± 1.7%) i/s -    122.002M in   5.048306s
          dyn_symbol     24.345M (± 1.6%) i/s -    122.019M in   5.013400s
           small_lit     21.252M (± 2.1%) i/s -    107.744M in   5.072042s
          frozen_lit     20.095M (± 1.3%) i/s -    100.489M in   5.001681s

Comparison:
              symbol: 24174848.1 i/s
          dyn_symbol: 24345476.9 i/s - same-ish: difference falls within error
           small_lit: 21252403.2 i/s - 1.14x  slower
          frozen_lit: 20094766.0 i/s - 1.20x  slower
```

With the proof of concept performance is pretty much identical:

```
Calculating -------------------------------------
              symbol     23.528M (± 6.9%) i/s -    117.584M in   5.033231s
          dyn_symbol     23.777M (± 4.7%) i/s -    120.231M in   5.071734s
           small_lit     23.066M (± 2.9%) i/s -    115.376M in   5.006947s
          frozen_lit     22.729M (± 1.1%) i/s -    115.693M in   5.090700s

Comparison:
              symbol: 23527823.6 i/s
          dyn_symbol: 23776757.8 i/s - same-ish: difference falls within error
           small_lit: 23065535.3 i/s - same-ish: difference falls within error
          frozen_lit: 22729351.6 i/s - same-ish: difference falls within error
```

### Possible implementation

The reason I'm opening this issue early is to get feedback on which would be the best implementation.

#### Store hashcode after the string terminator

Right now the proof of concept simply stores the `st_index_t` after the string null terminator, and only when the string is embedded and as enough left over space.
Strings with a precomputed hash are marked with an user flag.

Pros:

  - Very simple implementation, no need to change a lot of code, and very easy to strip out if we want to.
  - Doesn't use any extra memory. If the string doesn't have enough left over bytes, the optimization simply isn't applied.
  - The worst case overhead is a single `FL_TEST_RAW` in `rb_str_hash`.

Cons:

  - The optimization won't apply to certain string sizes. e.g. strings between `17` and `23` bytes won't have a precomputed hash code.
  - Extracting the hash code requires some not so nice pointer arithmetic.


#### Create another RString union

Another possibility would be to add another entry in the `RString` struct union, such as we'd have:

```c
struct RString {
    struct RBasic basic;
    long len;
    union {
        // ... existing members
        struct {
            st_index_t hash;
            char ary[1];
        } embded_literal;
    } as;
};
```

Pros:

  - The optimization can now be applied to all string sizes.
  - The hashcode is always at the same offset and properly aligned.

Cons:

  - Some strings would be bumped by one slot size, so would use marginally more memory.
  - Complexify the code base more, need to modify a lot more string related code (e.g. `RSTRING_PTR` and many others)
  - When compiling such string, if an equal string already exists in the `fstring` table, we'd need to replace it, we can't just mutate it in place to add the hashcode.


### Prior art

[Feature #15331] is somewhat similar in its idea, but it does it lazily for all strings. Here it's much simpler because limited to string literals, which are the ones likely to be used as Hash keys, and the overhead is on compilation, not runtime (aside from a single flag check). So I think most of the caveats of that original implementation don't apply here.




-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/

next      parent reply	other threads:[~2024-04-09  7:43 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-09  7:43 byroot (Jean Boussier) via ruby-core [this message]
2024-04-09 12:53 ` [ruby-core:117472] [Ruby master Feature#20415] Precompute literal String hash code during compilation Eregon (Benoit Daloze) via ruby-core
2024-04-09 13:10 ` [ruby-core:117473] " byroot (Jean Boussier) via ruby-core
2024-04-09 14:15 ` [ruby-core:117475] " Eregon (Benoit Daloze) via ruby-core
2024-04-09 14:23 ` [ruby-core:117476] " byroot (Jean Boussier) via ruby-core
2024-04-22 10:31 ` [ruby-core:117642] " etienne via ruby-core
2024-05-09 12:48 ` [ruby-core:117819] " shyouhei (Shyouhei Urabe) via ruby-core
2024-05-09 12:54 ` [ruby-core:117820] " byroot (Jean Boussier) via ruby-core
2024-05-09 13:46 ` [ruby-core:117822] " byroot (Jean Boussier) via ruby-core

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.ruby-lang.org/en/community/mailing-lists/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=redmine.issue-20415.20240409074343.7941@ruby-lang.org \
    --to=ruby-core@ruby-lang.org \
    --cc=noreply@ruby-lang.org \
    --cc=ruby-core@ml.ruby-lang.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).