ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
* [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code
@ 2022-03-07 17:57 chrisseaton (Chris Seaton)
  2022-03-07 19:15 ` [ruby-core:107785] " Dan0042 (Daniel DeLorme)
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: chrisseaton (Chris Seaton) @ 2022-03-07 17:57 UTC (permalink / raw
  To: ruby-core

Issue #18611 has been reported by chrisseaton (Chris Seaton).

----------------------------------------
Feature #18611: Promote best practice for combining multiple values into a hash code
https://bugs.ruby-lang.org/issues/18611

* Author: chrisseaton (Chris Seaton)
* Status: Open
* Priority: Normal
----------------------------------------
User-defined hash methods often work by combining the hash code of several values into one. This requires some logic to combine the values. In our experience, users are making a variety of choices for the algorithm for this, and in many cases are picking an option which may not be the best for security and performance in multiple ways. It's also a shame that users are having to think about how to do this basic operation themselves.

For example, this hash method creates a single hash code by combining the hash value of three values that make up the object. The user has combined the values using the xor operator and multiplication by prime numbers to distribute bits. This is an ok way to combine multiple values into a hash code.

```ruby
def hash
  x.hash ^ (y.hash * 3) ^ (z.hash * 5)
end
```

But users have to know to do this, and they sometimes get it wrong, writing things like this.

```ruby
def hash
  x.hash ^ y.hash ^ z.hash
end
```

This isn't distributing the bits very well. A bad hash code may harm performance if it cause more collisions in a hash table. Collisions may also cause excess cache eviction, which would further harm performance. If performance is reduced in this way there's a potential security risk due to denial-of-service. (We don't think this is an immediate practical security problem, which is why we're discussing in the open issue tracker, not the security mailing list.)

The `x.hash ^ (y.hash * 3) ^ (z.hash * 5)` pattern is still not ideal, as users have to manually write it, and it's a lot of logic to execute in the Ruby interpreter, when it could be possibly be done in native code instead. A better pattern we think is this:

```ruby
def hash
  [x, y, z].hash
end
```

This leaves the logic of creating a suitable and safe hash value to `[...].hash`, which does it correctly.

Why doesn't everyone already use this pattern? Because it's not documented as the right thing to do. We want to present a couple of options for what could be done to encourage people to use this pattern or an equivalent, to help people write more concise and clear code that is also more performant and secure.

# Document `[...].hash` as best practice and optimise it

If we want people to use `[...].hash`, we should say that in the documentation for `Kernel#hash` as the best practice. Wording along the lines of

> If you're implementing a hash code for a compound set of values, best practice is to combine them with `[...].hash`. For example....

This way people reading the documentation on `Kernel#hash` get pointed in the clear, concise, performant, secure direction.

We can combine this recommendation with an optimisation to `[...].hash` to remove the array allocation in implementation of Ruby without escape analysis and scalar replacement, similar to what is done for `Array#min` and `#max`. This way the best practice is even faster.

# Introduce a new similar method, but specifically for the purpose so it is discoverable

A second option is to introduce a new method, specifically for this task, `hash_objects(...)`. This is inspired By Java's `Objects.hash(...)`. The reason for the new method is that it should make it more discoverable - if you go looking for a tool to combine hash values you'd find one. We'd still link to it from `Kernel#hash`. This method would not require the array allocation removal optimisation, as it's just a simple call.

# Examples of hash methods

Even the MRI codebase has some suboptimal hash methods we don't need to look very far for examples. For example lib `lib/resolv.rb`, these two hash methods don't distribute the bits they combine

* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1734
* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1307

Both these examples could be replaced with either of our proposals.

A good example of someone already using best practice is this.

* https://github.com/ruby/ruby/blob/128972189284f4338722e8a910d0b4f6e7a02b31/lib/bundler/source/git.rb#L50

But this would still be faster with the optimisation we proposed, or using `hash_objects(...)`, as that'd remove the array allocation and the `hash` call.

# Other things we've already done

We've proposed a RuboCop cop to try to catch the pattern we think is suboptimal https://github.com/rubocop/rubocop/pull/10441.

Co-authored with @sambostock.




-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ruby-core:107785] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code
  2022-03-07 17:57 [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code chrisseaton (Chris Seaton)
@ 2022-03-07 19:15 ` Dan0042 (Daniel DeLorme)
  2022-03-07 19:32 ` [ruby-core:107786] " chrisseaton (Chris Seaton)
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Dan0042 (Daniel DeLorme) @ 2022-03-07 19:15 UTC (permalink / raw
  To: ruby-core

Issue #18611 has been updated by Dan0042 (Daniel DeLorme).


chrisseaton (Chris Seaton) wrote:
> A better pattern we think is this:
> 
> ```ruby
> def hash
>   [x, y, z].hash
> end
> ```

Not a bad idea, but wouldn't this be even better?

```ruby
def hash
  [self.class, x, y, z].hash
end
```

Which is roughly what Struct does, I believe

```ruby
X = Struct.new(:a, :b, :c)
Y = Struct.new(:a, :b, :c)
X.new(1,2,3).hash #=> 1941937414662583733
X.new(1,2,3).hash #=> 1941937414662583733
Y.new(1,2,3).hash #=> 1583914263362372853
```

And at the same time Struct also defines `eql?` to follow the same rules as `hash`. So maybe a better "best practice" would be to inherit from Struct when you want hash/eql based on a compound set of values. I think this makes the "Value Struct" proposed in #16122 more important, since you may not want that class to be Enumerable.

----------------------------------------
Feature #18611: Promote best practice for combining multiple values into a hash code
https://bugs.ruby-lang.org/issues/18611#change-96709

* Author: chrisseaton (Chris Seaton)
* Status: Open
* Priority: Normal
----------------------------------------
User-defined hash methods often work by combining the hash code of several values into one. This requires some logic to combine the values. In our experience, users are making a variety of choices for the algorithm for this, and in many cases are picking an option which may not be the best for security and performance in multiple ways. It's also a shame that users are having to think about how to do this basic operation themselves.

For example, this hash method creates a single hash code by combining the hash value of three values that make up the object. The user has combined the values using the xor operator and multiplication by prime numbers to distribute bits. This is an ok way to combine multiple values into a hash code.

```ruby
def hash
  x.hash ^ (y.hash * 3) ^ (z.hash * 5)
end
```

But users have to know to do this, and they sometimes get it wrong, writing things like this.

```ruby
def hash
  x.hash ^ y.hash ^ z.hash
end
```

This isn't distributing the bits very well. A bad hash code may harm performance if it cause more collisions in a hash table. Collisions may also cause excess cache eviction, which would further harm performance. If performance is reduced in this way there's a potential security risk due to denial-of-service. (We don't think this is an immediate practical security problem, which is why we're discussing in the open issue tracker, not the security mailing list.)

The `x.hash ^ (y.hash * 3) ^ (z.hash * 5)` pattern is still not ideal, as users have to manually write it, and it's a lot of logic to execute in the Ruby interpreter, when it could be possibly be done in native code instead. A better pattern we think is this:

```ruby
def hash
  [x, y, z].hash
end
```

This leaves the logic of creating a suitable and safe hash value to `[...].hash`, which does it correctly.

Why doesn't everyone already use this pattern? Because it's not documented as the right thing to do. We want to present a couple of options for what could be done to encourage people to use this pattern or an equivalent, to help people write more concise and clear code that is also more performant and secure.

# Document `[...].hash` as best practice and optimise it

If we want people to use `[...].hash`, we should say that in the documentation for `Kernel#hash` as the best practice. Wording along the lines of

> If you're implementing a hash code for a compound set of values, best practice is to combine them with `[...].hash`. For example....

This way people reading the documentation on `Kernel#hash` get pointed in the clear, concise, performant, secure direction.

We can combine this recommendation with an optimisation to `[...].hash` to remove the array allocation in implementation of Ruby without escape analysis and scalar replacement, similar to what is done for `Array#min` and `#max`. This way the best practice is even faster.

# Introduce a new similar method, but specifically for the purpose so it is discoverable

A second option is to introduce a new method, specifically for this task, `hash_objects(...)`. This is inspired By Java's `Objects.hash(...)`. The reason for the new method is that it should make it more discoverable - if you go looking for a tool to combine hash values you'd find one. We'd still link to it from `Kernel#hash`. This method would not require the array allocation removal optimisation, as it's just a simple call.

# Examples of hash methods

Even the MRI codebase has some suboptimal hash methods we don't need to look very far for examples. For example lib `lib/resolv.rb`, these two hash methods don't distribute the bits they combine

* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1734
* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1307

Both these examples could be replaced with either of our proposals.

A good example of someone already using best practice is this.

* https://github.com/ruby/ruby/blob/128972189284f4338722e8a910d0b4f6e7a02b31/lib/bundler/source/git.rb#L50

But this would still be faster with the optimisation we proposed, or using `hash_objects(...)`, as that'd remove the array allocation and the `hash` call.

# Other things we've already done

We've proposed a RuboCop cop to try to catch the pattern we think is suboptimal https://github.com/rubocop/rubocop/pull/10441.

Co-authored with @sambostock.




-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ruby-core:107786] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code
  2022-03-07 17:57 [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code chrisseaton (Chris Seaton)
  2022-03-07 19:15 ` [ruby-core:107785] " Dan0042 (Daniel DeLorme)
@ 2022-03-07 19:32 ` chrisseaton (Chris Seaton)
  2022-03-08 11:19 ` [ruby-core:107793] " Eregon (Benoit Daloze)
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: chrisseaton (Chris Seaton) @ 2022-03-07 19:32 UTC (permalink / raw
  To: ruby-core

Issue #18611 has been updated by chrisseaton (Chris Seaton).


Yes we could recommend adding `class` to the mixture. That's an argument for `hash_values` since if it has `self` already and can fix in the class itself.

I think refactoring lots of work to use `Struct` is a bit more radical. There are existing APIs that we may not want to modify like that.

----------------------------------------
Feature #18611: Promote best practice for combining multiple values into a hash code
https://bugs.ruby-lang.org/issues/18611#change-96710

* Author: chrisseaton (Chris Seaton)
* Status: Open
* Priority: Normal
----------------------------------------
User-defined hash methods often work by combining the hash code of several values into one. This requires some logic to combine the values. In our experience, users are making a variety of choices for the algorithm for this, and in many cases are picking an option which may not be the best for security and performance in multiple ways. It's also a shame that users are having to think about how to do this basic operation themselves.

For example, this hash method creates a single hash code by combining the hash value of three values that make up the object. The user has combined the values using the xor operator and multiplication by prime numbers to distribute bits. This is an ok way to combine multiple values into a hash code.

```ruby
def hash
  x.hash ^ (y.hash * 3) ^ (z.hash * 5)
end
```

But users have to know to do this, and they sometimes get it wrong, writing things like this.

```ruby
def hash
  x.hash ^ y.hash ^ z.hash
end
```

This isn't distributing the bits very well. A bad hash code may harm performance if it cause more collisions in a hash table. Collisions may also cause excess cache eviction, which would further harm performance. If performance is reduced in this way there's a potential security risk due to denial-of-service. (We don't think this is an immediate practical security problem, which is why we're discussing in the open issue tracker, not the security mailing list.)

The `x.hash ^ (y.hash * 3) ^ (z.hash * 5)` pattern is still not ideal, as users have to manually write it, and it's a lot of logic to execute in the Ruby interpreter, when it could be possibly be done in native code instead. A better pattern we think is this:

```ruby
def hash
  [x, y, z].hash
end
```

This leaves the logic of creating a suitable and safe hash value to `[...].hash`, which does it correctly.

Why doesn't everyone already use this pattern? Because it's not documented as the right thing to do. We want to present a couple of options for what could be done to encourage people to use this pattern or an equivalent, to help people write more concise and clear code that is also more performant and secure.

# Document `[...].hash` as best practice and optimise it

If we want people to use `[...].hash`, we should say that in the documentation for `Kernel#hash` as the best practice. Wording along the lines of

> If you're implementing a hash code for a compound set of values, best practice is to combine them with `[...].hash`. For example....

This way people reading the documentation on `Kernel#hash` get pointed in the clear, concise, performant, secure direction.

We can combine this recommendation with an optimisation to `[...].hash` to remove the array allocation in implementation of Ruby without escape analysis and scalar replacement, similar to what is done for `Array#min` and `#max`. This way the best practice is even faster.

# Introduce a new similar method, but specifically for the purpose so it is discoverable

A second option is to introduce a new method, specifically for this task, `hash_objects(...)`. This is inspired By Java's `Objects.hash(...)`. The reason for the new method is that it should make it more discoverable - if you go looking for a tool to combine hash values you'd find one. We'd still link to it from `Kernel#hash`. This method would not require the array allocation removal optimisation, as it's just a simple call.

# Examples of hash methods

Even the MRI codebase has some suboptimal hash methods we don't need to look very far for examples. For example lib `lib/resolv.rb`, these two hash methods don't distribute the bits they combine

* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1734
* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1307

Both these examples could be replaced with either of our proposals.

A good example of someone already using best practice is this.

* https://github.com/ruby/ruby/blob/128972189284f4338722e8a910d0b4f6e7a02b31/lib/bundler/source/git.rb#L50

But this would still be faster with the optimisation we proposed, or using `hash_objects(...)`, as that'd remove the array allocation and the `hash` call.

# Other things we've already done

We've proposed a RuboCop cop to try to catch the pattern we think is suboptimal https://github.com/rubocop/rubocop/pull/10441.

Co-authored with @sambostock.




-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ruby-core:107793] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code
  2022-03-07 17:57 [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code chrisseaton (Chris Seaton)
  2022-03-07 19:15 ` [ruby-core:107785] " Dan0042 (Daniel DeLorme)
  2022-03-07 19:32 ` [ruby-core:107786] " chrisseaton (Chris Seaton)
@ 2022-03-08 11:19 ` Eregon (Benoit Daloze)
  2022-04-01 11:37 ` [ruby-core:108154] " byroot (Jean Boussier)
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Eregon (Benoit Daloze) @ 2022-03-08 11:19 UTC (permalink / raw
  To: ruby-core

Issue #18611 has been updated by Eregon (Benoit Daloze).


Including the class is often (maybe even always?) unnecessary, and does impact performance.
Struct is special because it generates generic classes, so there it makes sense to include the class, but that's about the only case where it does IMHO.

I think `[a, b, c].hash` is the best by far, could you open a PR to document that?

Adding a new method would prevent adopting it now and just delay solving this problem, I don't think it's a productive way if an existing way already works fine (`[a, b, c].hash`).

(BTW, `hash_objects` would still likely allocate an array (internal or Ruby-level) for the var-args, so that's worse than a literal array actually.)

----------------------------------------
Feature #18611: Promote best practice for combining multiple values into a hash code
https://bugs.ruby-lang.org/issues/18611#change-96717

* Author: chrisseaton (Chris Seaton)
* Status: Open
* Priority: Normal
----------------------------------------
User-defined hash methods often work by combining the hash code of several values into one. This requires some logic to combine the values. In our experience, users are making a variety of choices for the algorithm for this, and in many cases are picking an option which may not be the best for security and performance in multiple ways. It's also a shame that users are having to think about how to do this basic operation themselves.

For example, this hash method creates a single hash code by combining the hash value of three values that make up the object. The user has combined the values using the xor operator and multiplication by prime numbers to distribute bits. This is an ok way to combine multiple values into a hash code.

```ruby
def hash
  x.hash ^ (y.hash * 3) ^ (z.hash * 5)
end
```

But users have to know to do this, and they sometimes get it wrong, writing things like this.

```ruby
def hash
  x.hash ^ y.hash ^ z.hash
end
```

This isn't distributing the bits very well. A bad hash code may harm performance if it cause more collisions in a hash table. Collisions may also cause excess cache eviction, which would further harm performance. If performance is reduced in this way there's a potential security risk due to denial-of-service. (We don't think this is an immediate practical security problem, which is why we're discussing in the open issue tracker, not the security mailing list.)

The `x.hash ^ (y.hash * 3) ^ (z.hash * 5)` pattern is still not ideal, as users have to manually write it, and it's a lot of logic to execute in the Ruby interpreter, when it could be possibly be done in native code instead. A better pattern we think is this:

```ruby
def hash
  [x, y, z].hash
end
```

This leaves the logic of creating a suitable and safe hash value to `[...].hash`, which does it correctly.

Why doesn't everyone already use this pattern? Because it's not documented as the right thing to do. We want to present a couple of options for what could be done to encourage people to use this pattern or an equivalent, to help people write more concise and clear code that is also more performant and secure.

# Document `[...].hash` as best practice and optimise it

If we want people to use `[...].hash`, we should say that in the documentation for `Kernel#hash` as the best practice. Wording along the lines of

> If you're implementing a hash code for a compound set of values, best practice is to combine them with `[...].hash`. For example....

This way people reading the documentation on `Kernel#hash` get pointed in the clear, concise, performant, secure direction.

We can combine this recommendation with an optimisation to `[...].hash` to remove the array allocation in implementation of Ruby without escape analysis and scalar replacement, similar to what is done for `Array#min` and `#max`. This way the best practice is even faster.

# Introduce a new similar method, but specifically for the purpose so it is discoverable

A second option is to introduce a new method, specifically for this task, `hash_objects(...)`. This is inspired By Java's `Objects.hash(...)`. The reason for the new method is that it should make it more discoverable - if you go looking for a tool to combine hash values you'd find one. We'd still link to it from `Kernel#hash`. This method would not require the array allocation removal optimisation, as it's just a simple call.

# Examples of hash methods

Even the MRI codebase has some suboptimal hash methods we don't need to look very far for examples. For example lib `lib/resolv.rb`, these two hash methods don't distribute the bits they combine

* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1734
* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1307

Both these examples could be replaced with either of our proposals.

A good example of someone already using best practice is this.

* https://github.com/ruby/ruby/blob/128972189284f4338722e8a910d0b4f6e7a02b31/lib/bundler/source/git.rb#L50

But this would still be faster with the optimisation we proposed, or using `hash_objects(...)`, as that'd remove the array allocation and the `hash` call.

# Other things we've already done

We've proposed a RuboCop cop to try to catch the pattern we think is suboptimal https://github.com/rubocop/rubocop/pull/10441.

Co-authored with @sambostock.




-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ruby-core:108154] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code
  2022-03-07 17:57 [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code chrisseaton (Chris Seaton)
                   ` (2 preceding siblings ...)
  2022-03-08 11:19 ` [ruby-core:107793] " Eregon (Benoit Daloze)
@ 2022-04-01 11:37 ` byroot (Jean Boussier)
  2022-04-16 15:42 ` [ruby-core:108268] " Eregon (Benoit Daloze)
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: byroot (Jean Boussier) @ 2022-04-01 11:37 UTC (permalink / raw
  To: ruby-core

Issue #18611 has been updated by byroot (Jean Boussier).


> Including the class is often (maybe even always?) unnecessary, and does impact performance.

Could you elaborate a bit more on this? In my mind `#hash` should include all the elements that may change the result of `#eql?`, and it's common (if not best practice) for `#eql? / #==` to start with `other.class == self.class`.

That particularly make sense when you have some kinds of value objects for which the class is an important metadata, e.g. trees of nodes etc.

----------------------------------------
Feature #18611: Promote best practice for combining multiple values into a hash code
https://bugs.ruby-lang.org/issues/18611#change-97129

* Author: chrisseaton (Chris Seaton)
* Status: Open
* Priority: Normal
----------------------------------------
User-defined hash methods often work by combining the hash code of several values into one. This requires some logic to combine the values. In our experience, users are making a variety of choices for the algorithm for this, and in many cases are picking an option which may not be the best for security and performance in multiple ways. It's also a shame that users are having to think about how to do this basic operation themselves.

For example, this hash method creates a single hash code by combining the hash value of three values that make up the object. The user has combined the values using the xor operator and multiplication by prime numbers to distribute bits. This is an ok way to combine multiple values into a hash code.

```ruby
def hash
  x.hash ^ (y.hash * 3) ^ (z.hash * 5)
end
```

But users have to know to do this, and they sometimes get it wrong, writing things like this.

```ruby
def hash
  x.hash ^ y.hash ^ z.hash
end
```

This isn't distributing the bits very well. A bad hash code may harm performance if it cause more collisions in a hash table. Collisions may also cause excess cache eviction, which would further harm performance. If performance is reduced in this way there's a potential security risk due to denial-of-service. (We don't think this is an immediate practical security problem, which is why we're discussing in the open issue tracker, not the security mailing list.)

The `x.hash ^ (y.hash * 3) ^ (z.hash * 5)` pattern is still not ideal, as users have to manually write it, and it's a lot of logic to execute in the Ruby interpreter, when it could be possibly be done in native code instead. A better pattern we think is this:

```ruby
def hash
  [x, y, z].hash
end
```

This leaves the logic of creating a suitable and safe hash value to `[...].hash`, which does it correctly.

Why doesn't everyone already use this pattern? Because it's not documented as the right thing to do. We want to present a couple of options for what could be done to encourage people to use this pattern or an equivalent, to help people write more concise and clear code that is also more performant and secure.

# Document `[...].hash` as best practice and optimise it

If we want people to use `[...].hash`, we should say that in the documentation for `Kernel#hash` as the best practice. Wording along the lines of

> If you're implementing a hash code for a compound set of values, best practice is to combine them with `[...].hash`. For example....

This way people reading the documentation on `Kernel#hash` get pointed in the clear, concise, performant, secure direction.

We can combine this recommendation with an optimisation to `[...].hash` to remove the array allocation in implementation of Ruby without escape analysis and scalar replacement, similar to what is done for `Array#min` and `#max`. This way the best practice is even faster.

# Introduce a new similar method, but specifically for the purpose so it is discoverable

A second option is to introduce a new method, specifically for this task, `hash_objects(...)`. This is inspired By Java's `Objects.hash(...)`. The reason for the new method is that it should make it more discoverable - if you go looking for a tool to combine hash values you'd find one. We'd still link to it from `Kernel#hash`. This method would not require the array allocation removal optimisation, as it's just a simple call.

# Examples of hash methods

Even the MRI codebase has some suboptimal hash methods we don't need to look very far for examples. For example lib `lib/resolv.rb`, these two hash methods don't distribute the bits they combine

* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1734
* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1307

Both these examples could be replaced with either of our proposals.

A good example of someone already using best practice is this.

* https://github.com/ruby/ruby/blob/128972189284f4338722e8a910d0b4f6e7a02b31/lib/bundler/source/git.rb#L50

But this would still be faster with the optimisation we proposed, or using `hash_objects(...)`, as that'd remove the array allocation and the `hash` call.

# Other things we've already done

We've proposed a RuboCop cop to try to catch the pattern we think is suboptimal https://github.com/rubocop/rubocop/pull/10441.

Co-authored with @sambostock.




-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ruby-core:108268] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code
  2022-03-07 17:57 [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code chrisseaton (Chris Seaton)
                   ` (3 preceding siblings ...)
  2022-04-01 11:37 ` [ruby-core:108154] " byroot (Jean Boussier)
@ 2022-04-16 15:42 ` Eregon (Benoit Daloze)
  2022-04-16 15:45 ` [ruby-core:108269] " chrisseaton (Chris Seaton)
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Eregon (Benoit Daloze) @ 2022-04-16 15:42 UTC (permalink / raw
  To: ruby-core

Issue #18611 has been updated by Eregon (Benoit Daloze).


I think an `eql?` implementation should either not check the class, and assume it will be used a key with objects of the same class, or do `return false unless other.is_a?(MyClass)`.
Another way to see this is why the `other.class == self.class` is ignoring the singleton class? Because it does not matter in most cases.
So if one defines `eql?` on a non-leaf class (or a class which has at least 1 subclass), I think they need to ask themselves do I want instances of different subclasses to be eql? or not.
I could imagine yes, for instance if a subclass is used to optimize some methods for a given well-known instance.

----------------------------------------
Feature #18611: Promote best practice for combining multiple values into a hash code
https://bugs.ruby-lang.org/issues/18611#change-97291

* Author: chrisseaton (Chris Seaton)
* Status: Open
* Priority: Normal
----------------------------------------
User-defined hash methods often work by combining the hash code of several values into one. This requires some logic to combine the values. In our experience, users are making a variety of choices for the algorithm for this, and in many cases are picking an option which may not be the best for security and performance in multiple ways. It's also a shame that users are having to think about how to do this basic operation themselves.

For example, this hash method creates a single hash code by combining the hash value of three values that make up the object. The user has combined the values using the xor operator and multiplication by prime numbers to distribute bits. This is an ok way to combine multiple values into a hash code.

```ruby
def hash
  x.hash ^ (y.hash * 3) ^ (z.hash * 5)
end
```

But users have to know to do this, and they sometimes get it wrong, writing things like this.

```ruby
def hash
  x.hash ^ y.hash ^ z.hash
end
```

This isn't distributing the bits very well. A bad hash code may harm performance if it cause more collisions in a hash table. Collisions may also cause excess cache eviction, which would further harm performance. If performance is reduced in this way there's a potential security risk due to denial-of-service. (We don't think this is an immediate practical security problem, which is why we're discussing in the open issue tracker, not the security mailing list.)

The `x.hash ^ (y.hash * 3) ^ (z.hash * 5)` pattern is still not ideal, as users have to manually write it, and it's a lot of logic to execute in the Ruby interpreter, when it could be possibly be done in native code instead. A better pattern we think is this:

```ruby
def hash
  [x, y, z].hash
end
```

This leaves the logic of creating a suitable and safe hash value to `[...].hash`, which does it correctly.

Why doesn't everyone already use this pattern? Because it's not documented as the right thing to do. We want to present a couple of options for what could be done to encourage people to use this pattern or an equivalent, to help people write more concise and clear code that is also more performant and secure.

# Document `[...].hash` as best practice and optimise it

If we want people to use `[...].hash`, we should say that in the documentation for `Kernel#hash` as the best practice. Wording along the lines of

> If you're implementing a hash code for a compound set of values, best practice is to combine them with `[...].hash`. For example....

This way people reading the documentation on `Kernel#hash` get pointed in the clear, concise, performant, secure direction.

We can combine this recommendation with an optimisation to `[...].hash` to remove the array allocation in implementation of Ruby without escape analysis and scalar replacement, similar to what is done for `Array#min` and `#max`. This way the best practice is even faster.

# Introduce a new similar method, but specifically for the purpose so it is discoverable

A second option is to introduce a new method, specifically for this task, `hash_objects(...)`. This is inspired By Java's `Objects.hash(...)`. The reason for the new method is that it should make it more discoverable - if you go looking for a tool to combine hash values you'd find one. We'd still link to it from `Kernel#hash`. This method would not require the array allocation removal optimisation, as it's just a simple call.

# Examples of hash methods

Even the MRI codebase has some suboptimal hash methods we don't need to look very far for examples. For example lib `lib/resolv.rb`, these two hash methods don't distribute the bits they combine

* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1734
* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1307

Both these examples could be replaced with either of our proposals.

A good example of someone already using best practice is this.

* https://github.com/ruby/ruby/blob/128972189284f4338722e8a910d0b4f6e7a02b31/lib/bundler/source/git.rb#L50

But this would still be faster with the optimisation we proposed, or using `hash_objects(...)`, as that'd remove the array allocation and the `hash` call.

# Other things we've already done

We've proposed a RuboCop cop to try to catch the pattern we think is suboptimal https://github.com/rubocop/rubocop/pull/10441.

Co-authored with @sambostock.




-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ruby-core:108269] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code
  2022-03-07 17:57 [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code chrisseaton (Chris Seaton)
                   ` (4 preceding siblings ...)
  2022-04-16 15:42 ` [ruby-core:108268] " Eregon (Benoit Daloze)
@ 2022-04-16 15:45 ` chrisseaton (Chris Seaton)
  2022-04-16 17:02 ` [ruby-core:108270] " Eregon (Benoit Daloze)
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: chrisseaton (Chris Seaton) @ 2022-04-16 15:45 UTC (permalink / raw
  To: ruby-core

Issue #18611 has been updated by chrisseaton (Chris Seaton).


Implemented in https://github.com/ruby/ruby/pull/5805

----------------------------------------
Feature #18611: Promote best practice for combining multiple values into a hash code
https://bugs.ruby-lang.org/issues/18611#change-97292

* Author: chrisseaton (Chris Seaton)
* Status: Open
* Priority: Normal
----------------------------------------
User-defined hash methods often work by combining the hash code of several values into one. This requires some logic to combine the values. In our experience, users are making a variety of choices for the algorithm for this, and in many cases are picking an option which may not be the best for security and performance in multiple ways. It's also a shame that users are having to think about how to do this basic operation themselves.

For example, this hash method creates a single hash code by combining the hash value of three values that make up the object. The user has combined the values using the xor operator and multiplication by prime numbers to distribute bits. This is an ok way to combine multiple values into a hash code.

```ruby
def hash
  x.hash ^ (y.hash * 3) ^ (z.hash * 5)
end
```

But users have to know to do this, and they sometimes get it wrong, writing things like this.

```ruby
def hash
  x.hash ^ y.hash ^ z.hash
end
```

This isn't distributing the bits very well. A bad hash code may harm performance if it cause more collisions in a hash table. Collisions may also cause excess cache eviction, which would further harm performance. If performance is reduced in this way there's a potential security risk due to denial-of-service. (We don't think this is an immediate practical security problem, which is why we're discussing in the open issue tracker, not the security mailing list.)

The `x.hash ^ (y.hash * 3) ^ (z.hash * 5)` pattern is still not ideal, as users have to manually write it, and it's a lot of logic to execute in the Ruby interpreter, when it could be possibly be done in native code instead. A better pattern we think is this:

```ruby
def hash
  [x, y, z].hash
end
```

This leaves the logic of creating a suitable and safe hash value to `[...].hash`, which does it correctly.

Why doesn't everyone already use this pattern? Because it's not documented as the right thing to do. We want to present a couple of options for what could be done to encourage people to use this pattern or an equivalent, to help people write more concise and clear code that is also more performant and secure.

# Document `[...].hash` as best practice and optimise it

If we want people to use `[...].hash`, we should say that in the documentation for `Kernel#hash` as the best practice. Wording along the lines of

> If you're implementing a hash code for a compound set of values, best practice is to combine them with `[...].hash`. For example....

This way people reading the documentation on `Kernel#hash` get pointed in the clear, concise, performant, secure direction.

We can combine this recommendation with an optimisation to `[...].hash` to remove the array allocation in implementation of Ruby without escape analysis and scalar replacement, similar to what is done for `Array#min` and `#max`. This way the best practice is even faster.

# Introduce a new similar method, but specifically for the purpose so it is discoverable

A second option is to introduce a new method, specifically for this task, `hash_objects(...)`. This is inspired By Java's `Objects.hash(...)`. The reason for the new method is that it should make it more discoverable - if you go looking for a tool to combine hash values you'd find one. We'd still link to it from `Kernel#hash`. This method would not require the array allocation removal optimisation, as it's just a simple call.

# Examples of hash methods

Even the MRI codebase has some suboptimal hash methods we don't need to look very far for examples. For example lib `lib/resolv.rb`, these two hash methods don't distribute the bits they combine

* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1734
* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1307

Both these examples could be replaced with either of our proposals.

A good example of someone already using best practice is this.

* https://github.com/ruby/ruby/blob/128972189284f4338722e8a910d0b4f6e7a02b31/lib/bundler/source/git.rb#L50

But this would still be faster with the optimisation we proposed, or using `hash_objects(...)`, as that'd remove the array allocation and the `hash` call.

# Other things we've already done

We've proposed a RuboCop cop to try to catch the pattern we think is suboptimal https://github.com/rubocop/rubocop/pull/10441.

Co-authored with @sambostock.




-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ruby-core:108270] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code
  2022-03-07 17:57 [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code chrisseaton (Chris Seaton)
                   ` (5 preceding siblings ...)
  2022-04-16 15:45 ` [ruby-core:108269] " chrisseaton (Chris Seaton)
@ 2022-04-16 17:02 ` Eregon (Benoit Daloze)
  2022-04-29 18:55 ` [ruby-core:108447] " tenderlovemaking (Aaron Patterson)
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Eregon (Benoit Daloze) @ 2022-04-16 17:02 UTC (permalink / raw
  To: ruby-core

Issue #18611 has been updated by Eregon (Benoit Daloze).


Regarding my last reply, I changed my mind, including the class is useful to avoid unintentional hash conflicts between unrelated classes for a Hash which contains keys of various types, ,ore details here: https://github.com/ruby/ruby/pull/5805#discussion_r851650129

----------------------------------------
Feature #18611: Promote best practice for combining multiple values into a hash code
https://bugs.ruby-lang.org/issues/18611#change-97293

* Author: chrisseaton (Chris Seaton)
* Status: Open
* Priority: Normal
----------------------------------------
User-defined hash methods often work by combining the hash code of several values into one. This requires some logic to combine the values. In our experience, users are making a variety of choices for the algorithm for this, and in many cases are picking an option which may not be the best for security and performance in multiple ways. It's also a shame that users are having to think about how to do this basic operation themselves.

For example, this hash method creates a single hash code by combining the hash value of three values that make up the object. The user has combined the values using the xor operator and multiplication by prime numbers to distribute bits. This is an ok way to combine multiple values into a hash code.

```ruby
def hash
  x.hash ^ (y.hash * 3) ^ (z.hash * 5)
end
```

But users have to know to do this, and they sometimes get it wrong, writing things like this.

```ruby
def hash
  x.hash ^ y.hash ^ z.hash
end
```

This isn't distributing the bits very well. A bad hash code may harm performance if it cause more collisions in a hash table. Collisions may also cause excess cache eviction, which would further harm performance. If performance is reduced in this way there's a potential security risk due to denial-of-service. (We don't think this is an immediate practical security problem, which is why we're discussing in the open issue tracker, not the security mailing list.)

The `x.hash ^ (y.hash * 3) ^ (z.hash * 5)` pattern is still not ideal, as users have to manually write it, and it's a lot of logic to execute in the Ruby interpreter, when it could be possibly be done in native code instead. A better pattern we think is this:

```ruby
def hash
  [x, y, z].hash
end
```

This leaves the logic of creating a suitable and safe hash value to `[...].hash`, which does it correctly.

Why doesn't everyone already use this pattern? Because it's not documented as the right thing to do. We want to present a couple of options for what could be done to encourage people to use this pattern or an equivalent, to help people write more concise and clear code that is also more performant and secure.

# Document `[...].hash` as best practice and optimise it

If we want people to use `[...].hash`, we should say that in the documentation for `Kernel#hash` as the best practice. Wording along the lines of

> If you're implementing a hash code for a compound set of values, best practice is to combine them with `[...].hash`. For example....

This way people reading the documentation on `Kernel#hash` get pointed in the clear, concise, performant, secure direction.

We can combine this recommendation with an optimisation to `[...].hash` to remove the array allocation in implementation of Ruby without escape analysis and scalar replacement, similar to what is done for `Array#min` and `#max`. This way the best practice is even faster.

# Introduce a new similar method, but specifically for the purpose so it is discoverable

A second option is to introduce a new method, specifically for this task, `hash_objects(...)`. This is inspired By Java's `Objects.hash(...)`. The reason for the new method is that it should make it more discoverable - if you go looking for a tool to combine hash values you'd find one. We'd still link to it from `Kernel#hash`. This method would not require the array allocation removal optimisation, as it's just a simple call.

# Examples of hash methods

Even the MRI codebase has some suboptimal hash methods we don't need to look very far for examples. For example lib `lib/resolv.rb`, these two hash methods don't distribute the bits they combine

* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1734
* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1307

Both these examples could be replaced with either of our proposals.

A good example of someone already using best practice is this.

* https://github.com/ruby/ruby/blob/128972189284f4338722e8a910d0b4f6e7a02b31/lib/bundler/source/git.rb#L50

But this would still be faster with the optimisation we proposed, or using `hash_objects(...)`, as that'd remove the array allocation and the `hash` call.

# Other things we've already done

We've proposed a RuboCop cop to try to catch the pattern we think is suboptimal https://github.com/rubocop/rubocop/pull/10441.

Co-authored with @sambostock.




-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ruby-core:108447] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code
  2022-03-07 17:57 [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code chrisseaton (Chris Seaton)
                   ` (6 preceding siblings ...)
  2022-04-16 17:02 ` [ruby-core:108270] " Eregon (Benoit Daloze)
@ 2022-04-29 18:55 ` tenderlovemaking (Aaron Patterson)
  2022-04-29 19:07 ` [ruby-core:108448] " sambostock (Sam Bostock)
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: tenderlovemaking (Aaron Patterson) @ 2022-04-29 18:55 UTC (permalink / raw
  To: ruby-core

Issue #18611 has been updated by tenderlovemaking (Aaron Patterson).


I think this is a good idea.  [Searching GitHub](https://cs.github.com/?q=%22def+hash%22+language%3ARuby) shows many example of the not-so-great XOR pattern, but also shows many people doing `[a, b, c].hash`.  It makes sense to me that we promote the later and optimize it as well.

----------------------------------------
Feature #18611: Promote best practice for combining multiple values into a hash code
https://bugs.ruby-lang.org/issues/18611#change-97485

* Author: chrisseaton (Chris Seaton)
* Status: Open
* Priority: Normal
----------------------------------------
User-defined hash methods often work by combining the hash code of several values into one. This requires some logic to combine the values. In our experience, users are making a variety of choices for the algorithm for this, and in many cases are picking an option which may not be the best for security and performance in multiple ways. It's also a shame that users are having to think about how to do this basic operation themselves.

For example, this hash method creates a single hash code by combining the hash value of three values that make up the object. The user has combined the values using the xor operator and multiplication by prime numbers to distribute bits. This is an ok way to combine multiple values into a hash code.

```ruby
def hash
  x.hash ^ (y.hash * 3) ^ (z.hash * 5)
end
```

But users have to know to do this, and they sometimes get it wrong, writing things like this.

```ruby
def hash
  x.hash ^ y.hash ^ z.hash
end
```

This isn't distributing the bits very well. A bad hash code may harm performance if it cause more collisions in a hash table. Collisions may also cause excess cache eviction, which would further harm performance. If performance is reduced in this way there's a potential security risk due to denial-of-service. (We don't think this is an immediate practical security problem, which is why we're discussing in the open issue tracker, not the security mailing list.)

The `x.hash ^ (y.hash * 3) ^ (z.hash * 5)` pattern is still not ideal, as users have to manually write it, and it's a lot of logic to execute in the Ruby interpreter, when it could be possibly be done in native code instead. A better pattern we think is this:

```ruby
def hash
  [x, y, z].hash
end
```

This leaves the logic of creating a suitable and safe hash value to `[...].hash`, which does it correctly.

Why doesn't everyone already use this pattern? Because it's not documented as the right thing to do. We want to present a couple of options for what could be done to encourage people to use this pattern or an equivalent, to help people write more concise and clear code that is also more performant and secure.

# Document `[...].hash` as best practice and optimise it

If we want people to use `[...].hash`, we should say that in the documentation for `Kernel#hash` as the best practice. Wording along the lines of

> If you're implementing a hash code for a compound set of values, best practice is to combine them with `[...].hash`. For example....

This way people reading the documentation on `Kernel#hash` get pointed in the clear, concise, performant, secure direction.

We can combine this recommendation with an optimisation to `[...].hash` to remove the array allocation in implementation of Ruby without escape analysis and scalar replacement, similar to what is done for `Array#min` and `#max`. This way the best practice is even faster.

# Introduce a new similar method, but specifically for the purpose so it is discoverable

A second option is to introduce a new method, specifically for this task, `hash_objects(...)`. This is inspired By Java's `Objects.hash(...)`. The reason for the new method is that it should make it more discoverable - if you go looking for a tool to combine hash values you'd find one. We'd still link to it from `Kernel#hash`. This method would not require the array allocation removal optimisation, as it's just a simple call.

# Examples of hash methods

Even the MRI codebase has some suboptimal hash methods we don't need to look very far for examples. For example lib `lib/resolv.rb`, these two hash methods don't distribute the bits they combine

* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1734
* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1307

Both these examples could be replaced with either of our proposals.

A good example of someone already using best practice is this.

* https://github.com/ruby/ruby/blob/128972189284f4338722e8a910d0b4f6e7a02b31/lib/bundler/source/git.rb#L50

But this would still be faster with the optimisation we proposed, or using `hash_objects(...)`, as that'd remove the array allocation and the `hash` call.

# Other things we've already done

We've proposed a RuboCop cop to try to catch the pattern we think is suboptimal https://github.com/rubocop/rubocop/pull/10441.

Co-authored with @sambostock.




-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ruby-core:108448] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code
  2022-03-07 17:57 [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code chrisseaton (Chris Seaton)
                   ` (7 preceding siblings ...)
  2022-04-29 18:55 ` [ruby-core:108447] " tenderlovemaking (Aaron Patterson)
@ 2022-04-29 19:07 ` sambostock (Sam Bostock)
  2022-04-30 10:58 ` [ruby-core:108452] " Eregon (Benoit Daloze)
  2022-05-10  6:24 ` [ruby-core:108498] " ioquatix (Samuel Williams)
  10 siblings, 0 replies; 12+ messages in thread
From: sambostock (Sam Bostock) @ 2022-04-29 19:07 UTC (permalink / raw
  To: ruby-core

Issue #18611 has been updated by sambostock (Sam Bostock).


I found some _interesting_ implementations while working on the Rubocop Cop, including some in Ruby itself. For example:

- [`rgeo/rgeo`](https://github.com/rubocop/rubocop/pull/10441/files#r835693586)
- [Ruby's `IPAddr#hash`](https://github.com/rubocop/rubocop/pull/10441/files#r835694614)

In related news, the [cop](https://docs.rubocop.org/rubocop/cops_security.html#securitycompoundhash) has been merged into Rubocop 1.28, although it is worth noting it is not enabled by default.

----------------------------------------
Feature #18611: Promote best practice for combining multiple values into a hash code
https://bugs.ruby-lang.org/issues/18611#change-97486

* Author: chrisseaton (Chris Seaton)
* Status: Open
* Priority: Normal
----------------------------------------
User-defined hash methods often work by combining the hash code of several values into one. This requires some logic to combine the values. In our experience, users are making a variety of choices for the algorithm for this, and in many cases are picking an option which may not be the best for security and performance in multiple ways. It's also a shame that users are having to think about how to do this basic operation themselves.

For example, this hash method creates a single hash code by combining the hash value of three values that make up the object. The user has combined the values using the xor operator and multiplication by prime numbers to distribute bits. This is an ok way to combine multiple values into a hash code.

```ruby
def hash
  x.hash ^ (y.hash * 3) ^ (z.hash * 5)
end
```

But users have to know to do this, and they sometimes get it wrong, writing things like this.

```ruby
def hash
  x.hash ^ y.hash ^ z.hash
end
```

This isn't distributing the bits very well. A bad hash code may harm performance if it cause more collisions in a hash table. Collisions may also cause excess cache eviction, which would further harm performance. If performance is reduced in this way there's a potential security risk due to denial-of-service. (We don't think this is an immediate practical security problem, which is why we're discussing in the open issue tracker, not the security mailing list.)

The `x.hash ^ (y.hash * 3) ^ (z.hash * 5)` pattern is still not ideal, as users have to manually write it, and it's a lot of logic to execute in the Ruby interpreter, when it could be possibly be done in native code instead. A better pattern we think is this:

```ruby
def hash
  [x, y, z].hash
end
```

This leaves the logic of creating a suitable and safe hash value to `[...].hash`, which does it correctly.

Why doesn't everyone already use this pattern? Because it's not documented as the right thing to do. We want to present a couple of options for what could be done to encourage people to use this pattern or an equivalent, to help people write more concise and clear code that is also more performant and secure.

# Document `[...].hash` as best practice and optimise it

If we want people to use `[...].hash`, we should say that in the documentation for `Kernel#hash` as the best practice. Wording along the lines of

> If you're implementing a hash code for a compound set of values, best practice is to combine them with `[...].hash`. For example....

This way people reading the documentation on `Kernel#hash` get pointed in the clear, concise, performant, secure direction.

We can combine this recommendation with an optimisation to `[...].hash` to remove the array allocation in implementation of Ruby without escape analysis and scalar replacement, similar to what is done for `Array#min` and `#max`. This way the best practice is even faster.

# Introduce a new similar method, but specifically for the purpose so it is discoverable

A second option is to introduce a new method, specifically for this task, `hash_objects(...)`. This is inspired By Java's `Objects.hash(...)`. The reason for the new method is that it should make it more discoverable - if you go looking for a tool to combine hash values you'd find one. We'd still link to it from `Kernel#hash`. This method would not require the array allocation removal optimisation, as it's just a simple call.

# Examples of hash methods

Even the MRI codebase has some suboptimal hash methods we don't need to look very far for examples. For example lib `lib/resolv.rb`, these two hash methods don't distribute the bits they combine

* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1734
* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1307

Both these examples could be replaced with either of our proposals.

A good example of someone already using best practice is this.

* https://github.com/ruby/ruby/blob/128972189284f4338722e8a910d0b4f6e7a02b31/lib/bundler/source/git.rb#L50

But this would still be faster with the optimisation we proposed, or using `hash_objects(...)`, as that'd remove the array allocation and the `hash` call.

# Other things we've already done

We've proposed a RuboCop cop to try to catch the pattern we think is suboptimal https://github.com/rubocop/rubocop/pull/10441.

Co-authored with @sambostock.




-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ruby-core:108452] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code
  2022-03-07 17:57 [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code chrisseaton (Chris Seaton)
                   ` (8 preceding siblings ...)
  2022-04-29 19:07 ` [ruby-core:108448] " sambostock (Sam Bostock)
@ 2022-04-30 10:58 ` Eregon (Benoit Daloze)
  2022-05-10  6:24 ` [ruby-core:108498] " ioquatix (Samuel Williams)
  10 siblings, 0 replies; 12+ messages in thread
From: Eregon (Benoit Daloze) @ 2022-04-30 10:58 UTC (permalink / raw
  To: ruby-core

Issue #18611 has been updated by Eregon (Benoit Daloze).

Status changed from Open to Closed

Merged.

----------------------------------------
Feature #18611: Promote best practice for combining multiple values into a hash code
https://bugs.ruby-lang.org/issues/18611#change-97490

* Author: chrisseaton (Chris Seaton)
* Status: Closed
* Priority: Normal
----------------------------------------
User-defined hash methods often work by combining the hash code of several values into one. This requires some logic to combine the values. In our experience, users are making a variety of choices for the algorithm for this, and in many cases are picking an option which may not be the best for security and performance in multiple ways. It's also a shame that users are having to think about how to do this basic operation themselves.

For example, this hash method creates a single hash code by combining the hash value of three values that make up the object. The user has combined the values using the xor operator and multiplication by prime numbers to distribute bits. This is an ok way to combine multiple values into a hash code.

```ruby
def hash
  x.hash ^ (y.hash * 3) ^ (z.hash * 5)
end
```

But users have to know to do this, and they sometimes get it wrong, writing things like this.

```ruby
def hash
  x.hash ^ y.hash ^ z.hash
end
```

This isn't distributing the bits very well. A bad hash code may harm performance if it cause more collisions in a hash table. Collisions may also cause excess cache eviction, which would further harm performance. If performance is reduced in this way there's a potential security risk due to denial-of-service. (We don't think this is an immediate practical security problem, which is why we're discussing in the open issue tracker, not the security mailing list.)

The `x.hash ^ (y.hash * 3) ^ (z.hash * 5)` pattern is still not ideal, as users have to manually write it, and it's a lot of logic to execute in the Ruby interpreter, when it could be possibly be done in native code instead. A better pattern we think is this:

```ruby
def hash
  [x, y, z].hash
end
```

This leaves the logic of creating a suitable and safe hash value to `[...].hash`, which does it correctly.

Why doesn't everyone already use this pattern? Because it's not documented as the right thing to do. We want to present a couple of options for what could be done to encourage people to use this pattern or an equivalent, to help people write more concise and clear code that is also more performant and secure.

# Document `[...].hash` as best practice and optimise it

If we want people to use `[...].hash`, we should say that in the documentation for `Kernel#hash` as the best practice. Wording along the lines of

> If you're implementing a hash code for a compound set of values, best practice is to combine them with `[...].hash`. For example....

This way people reading the documentation on `Kernel#hash` get pointed in the clear, concise, performant, secure direction.

We can combine this recommendation with an optimisation to `[...].hash` to remove the array allocation in implementation of Ruby without escape analysis and scalar replacement, similar to what is done for `Array#min` and `#max`. This way the best practice is even faster.

# Introduce a new similar method, but specifically for the purpose so it is discoverable

A second option is to introduce a new method, specifically for this task, `hash_objects(...)`. This is inspired By Java's `Objects.hash(...)`. The reason for the new method is that it should make it more discoverable - if you go looking for a tool to combine hash values you'd find one. We'd still link to it from `Kernel#hash`. This method would not require the array allocation removal optimisation, as it's just a simple call.

# Examples of hash methods

Even the MRI codebase has some suboptimal hash methods we don't need to look very far for examples. For example lib `lib/resolv.rb`, these two hash methods don't distribute the bits they combine

* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1734
* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1307

Both these examples could be replaced with either of our proposals.

A good example of someone already using best practice is this.

* https://github.com/ruby/ruby/blob/128972189284f4338722e8a910d0b4f6e7a02b31/lib/bundler/source/git.rb#L50

But this would still be faster with the optimisation we proposed, or using `hash_objects(...)`, as that'd remove the array allocation and the `hash` call.

# Other things we've already done

We've proposed a RuboCop cop to try to catch the pattern we think is suboptimal https://github.com/rubocop/rubocop/pull/10441.

Co-authored with @sambostock.




-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [ruby-core:108498] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code
  2022-03-07 17:57 [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code chrisseaton (Chris Seaton)
                   ` (9 preceding siblings ...)
  2022-04-30 10:58 ` [ruby-core:108452] " Eregon (Benoit Daloze)
@ 2022-05-10  6:24 ` ioquatix (Samuel Williams)
  10 siblings, 0 replies; 12+ messages in thread
From: ioquatix (Samuel Williams) @ 2022-05-10  6:24 UTC (permalink / raw
  To: ruby-core

Issue #18611 has been updated by ioquatix (Samuel Williams).


This makes me happy, thanks everyone!

----------------------------------------
Feature #18611: Promote best practice for combining multiple values into a hash code
https://bugs.ruby-lang.org/issues/18611#change-97543

* Author: chrisseaton (Chris Seaton)
* Status: Closed
* Priority: Normal
----------------------------------------
User-defined hash methods often work by combining the hash code of several values into one. This requires some logic to combine the values. In our experience, users are making a variety of choices for the algorithm for this, and in many cases are picking an option which may not be the best for security and performance in multiple ways. It's also a shame that users are having to think about how to do this basic operation themselves.

For example, this hash method creates a single hash code by combining the hash value of three values that make up the object. The user has combined the values using the xor operator and multiplication by prime numbers to distribute bits. This is an ok way to combine multiple values into a hash code.

```ruby
def hash
  x.hash ^ (y.hash * 3) ^ (z.hash * 5)
end
```

But users have to know to do this, and they sometimes get it wrong, writing things like this.

```ruby
def hash
  x.hash ^ y.hash ^ z.hash
end
```

This isn't distributing the bits very well. A bad hash code may harm performance if it cause more collisions in a hash table. Collisions may also cause excess cache eviction, which would further harm performance. If performance is reduced in this way there's a potential security risk due to denial-of-service. (We don't think this is an immediate practical security problem, which is why we're discussing in the open issue tracker, not the security mailing list.)

The `x.hash ^ (y.hash * 3) ^ (z.hash * 5)` pattern is still not ideal, as users have to manually write it, and it's a lot of logic to execute in the Ruby interpreter, when it could be possibly be done in native code instead. A better pattern we think is this:

```ruby
def hash
  [x, y, z].hash
end
```

This leaves the logic of creating a suitable and safe hash value to `[...].hash`, which does it correctly.

Why doesn't everyone already use this pattern? Because it's not documented as the right thing to do. We want to present a couple of options for what could be done to encourage people to use this pattern or an equivalent, to help people write more concise and clear code that is also more performant and secure.

# Document `[...].hash` as best practice and optimise it

If we want people to use `[...].hash`, we should say that in the documentation for `Kernel#hash` as the best practice. Wording along the lines of

> If you're implementing a hash code for a compound set of values, best practice is to combine them with `[...].hash`. For example....

This way people reading the documentation on `Kernel#hash` get pointed in the clear, concise, performant, secure direction.

We can combine this recommendation with an optimisation to `[...].hash` to remove the array allocation in implementation of Ruby without escape analysis and scalar replacement, similar to what is done for `Array#min` and `#max`. This way the best practice is even faster.

# Introduce a new similar method, but specifically for the purpose so it is discoverable

A second option is to introduce a new method, specifically for this task, `hash_objects(...)`. This is inspired By Java's `Objects.hash(...)`. The reason for the new method is that it should make it more discoverable - if you go looking for a tool to combine hash values you'd find one. We'd still link to it from `Kernel#hash`. This method would not require the array allocation removal optimisation, as it's just a simple call.

# Examples of hash methods

Even the MRI codebase has some suboptimal hash methods we don't need to look very far for examples. For example lib `lib/resolv.rb`, these two hash methods don't distribute the bits they combine

* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1734
* https://github.com/ruby/ruby/blob/c445963575a8572f6b0baf7135093c128adab3b9/lib/resolv.rb#L1307

Both these examples could be replaced with either of our proposals.

A good example of someone already using best practice is this.

* https://github.com/ruby/ruby/blob/128972189284f4338722e8a910d0b4f6e7a02b31/lib/bundler/source/git.rb#L50

But this would still be faster with the optimisation we proposed, or using `hash_objects(...)`, as that'd remove the array allocation and the `hash` call.

# Other things we've already done

We've proposed a RuboCop cop to try to catch the pattern we think is suboptimal https://github.com/rubocop/rubocop/pull/10441.

Co-authored with @sambostock.




-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-05-10  6:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-03-07 17:57 [ruby-core:107784] [Ruby master Feature#18611] Promote best practice for combining multiple values into a hash code chrisseaton (Chris Seaton)
2022-03-07 19:15 ` [ruby-core:107785] " Dan0042 (Daniel DeLorme)
2022-03-07 19:32 ` [ruby-core:107786] " chrisseaton (Chris Seaton)
2022-03-08 11:19 ` [ruby-core:107793] " Eregon (Benoit Daloze)
2022-04-01 11:37 ` [ruby-core:108154] " byroot (Jean Boussier)
2022-04-16 15:42 ` [ruby-core:108268] " Eregon (Benoit Daloze)
2022-04-16 15:45 ` [ruby-core:108269] " chrisseaton (Chris Seaton)
2022-04-16 17:02 ` [ruby-core:108270] " Eregon (Benoit Daloze)
2022-04-29 18:55 ` [ruby-core:108447] " tenderlovemaking (Aaron Patterson)
2022-04-29 19:07 ` [ruby-core:108448] " sambostock (Sam Bostock)
2022-04-30 10:58 ` [ruby-core:108452] " Eregon (Benoit Daloze)
2022-05-10  6:24 ` [ruby-core:108498] " ioquatix (Samuel Williams)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).