ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
* [ruby-core:98292] [Ruby master Bug#16851] Ruby hashing algorithm could be improved using Tabulation Hashing
@ 2020-05-12 17:55 anamma06
  2020-05-12 18:19 ` [ruby-core:98294] [Ruby master Feature#16851] " jean.boussier
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: anamma06 @ 2020-05-12 17:55 UTC (permalink / raw)
  To: ruby-core

Issue #16851 has been reported by ana06 (Ana Maria Martinez Gomez).

----------------------------------------
Bug #16851: Ruby hashing algorithm could be improved using Tabulation Hashing
https://bugs.ruby-lang.org/issues/16851

* Author: ana06 (Ana Maria Martinez Gomez)
* Status: Open
* Priority: Normal
* ruby -v: master
* Backport: 2.5: UNKNOWN, 2.6: UNKNOWN, 2.7: UNKNOWN
----------------------------------------
I have implemented Linear Probing and Simple tabulation in Ruby: https://github.com/Ana06/ruby/compare/master...Ana06:tabulation_project

I tested it using the following code:
https://github.com/Ana06/ruby-tabulation/blob/master/benchmark_tabulation.rb
It benchmarks the insertion of 600000 elements (by hashing their 64 bits ids) in an empty hash 100 times. Below are the times (in seconds) I got executing this code for different versions of the Ruby code:

- master (without Simple Tabulation): 29.68
- master with Linear Probing (without Simple Tabulation): 99.76
- master with Quadratic Probing (without Simple Tabulation): 29.97
- master with Simple Tabulation: 15.76
- master with Linear Probing and Simple Tabulation: 24.23
- **master with Quadratic Probing and Simple Tabulation: 13.73**

`master` refers to ruby 2.8.0dev:
(2020-05-07T16:22:38Z master 7ded8fd) [x86_64-linux].
I tried with 8 x Intel i5-8265U.

Although this is just a prove of concept and not something that could be already use in production, I think it proves the potential of Simple Tabulation to improve current Ruby implementation. It may be worthwhile to explore this idea further. What do you think?

I did this as part of a project for my [Advanced Algorithms course](http://www.cs.columbia.edu/~andoni/advancedS20/index.html). For more details check the project report: 
https://github.com/Ana06/ruby-tabulation/blob/master/latex/RubyTabulation_Project.pdf



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:98294] [Ruby master Feature#16851] Ruby hashing algorithm could be improved using Tabulation Hashing
  2020-05-12 17:55 [ruby-core:98292] [Ruby master Bug#16851] Ruby hashing algorithm could be improved using Tabulation Hashing anamma06
@ 2020-05-12 18:19 ` jean.boussier
  2020-05-13  1:48 ` [ruby-core:98308] " matz
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: jean.boussier @ 2020-05-12 18:19 UTC (permalink / raw)
  To: ruby-core

Issue #16851 has been updated by byroot (Jean Boussier).


This seems interesting. I'd suggest running the hash related benchmarks included in ruby's repo: https://github.com/ruby/ruby/tree/master/benchmark

 

----------------------------------------
Feature #16851: Ruby hashing algorithm could be improved using Tabulation Hashing
https://bugs.ruby-lang.org/issues/16851#change-85533

* Author: ana06 (Ana Maria Martinez Gomez)
* Status: Open
* Priority: Normal
----------------------------------------
I have implemented Linear Probing and Simple tabulation in Ruby: https://github.com/Ana06/ruby/compare/master...Ana06:tabulation_project

I tested it using the following code:
https://github.com/Ana06/ruby-tabulation/blob/master/benchmark_tabulation.rb
It benchmarks the insertion of 600000 elements (by hashing their 64 bits ids) in an empty hash 100 times. Below are the times (in seconds) I got executing this code for different versions of the Ruby code:

- master (without Simple Tabulation): 29.68
- master with Linear Probing (without Simple Tabulation): 99.76
- master with Quadratic Probing (without Simple Tabulation): 29.97
- master with Simple Tabulation: 15.76
- master with Linear Probing and Simple Tabulation: 24.23
- **master with Quadratic Probing and Simple Tabulation: 13.73**

`master` refers to ruby 2.8.0dev:
(2020-05-07T16:22:38Z master 7ded8fd) [x86_64-linux].
I tried with 8 x Intel i5-8265U.

Although this is just a prove of concept and not something that could be already use in production, I think it proves the potential of Simple Tabulation to improve current Ruby implementation. It may be worthwhile to explore this idea further. What do you think?

I did this as part of a project for my [Advanced Algorithms course](http://www.cs.columbia.edu/~andoni/advancedS20/index.html). For more details check the project report: 
https://github.com/Ana06/ruby-tabulation/blob/master/latex/RubyTabulation_Project.pdf



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:98308] [Ruby master Feature#16851] Ruby hashing algorithm could be improved using Tabulation Hashing
  2020-05-12 17:55 [ruby-core:98292] [Ruby master Bug#16851] Ruby hashing algorithm could be improved using Tabulation Hashing anamma06
  2020-05-12 18:19 ` [ruby-core:98294] [Ruby master Feature#16851] " jean.boussier
@ 2020-05-13  1:48 ` matz
  2020-05-14 21:00 ` [ruby-core:98364] " anamma06
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: matz @ 2020-05-13  1:48 UTC (permalink / raw)
  To: ruby-core

Issue #16851 has been updated by matz (Yukihiro Matsumoto).


Interesting!! @ana06 Do you need any help?

Matz.


----------------------------------------
Feature #16851: Ruby hashing algorithm could be improved using Tabulation Hashing
https://bugs.ruby-lang.org/issues/16851#change-85547

* Author: ana06 (Ana Maria Martinez Gomez)
* Status: Open
* Priority: Normal
----------------------------------------
I have implemented Linear Probing and Simple tabulation in Ruby: https://github.com/Ana06/ruby/compare/master...Ana06:tabulation_project

I tested it using the following code:
https://github.com/Ana06/ruby-tabulation/blob/master/benchmark_tabulation.rb
It benchmarks the insertion of 600000 elements (by hashing their 64 bits ids) in an empty hash 100 times. Below are the times (in seconds) I got executing this code for different versions of the Ruby code:

- master (without Simple Tabulation): 29.68
- master with Linear Probing (without Simple Tabulation): 99.76
- master with Quadratic Probing (without Simple Tabulation): 29.97
- master with Simple Tabulation: 15.76
- master with Linear Probing and Simple Tabulation: 24.23
- **master with Quadratic Probing and Simple Tabulation: 13.73**

`master` refers to ruby 2.8.0dev:
(2020-05-07T16:22:38Z master 7ded8fd) [x86_64-linux].
I tried with 8 x Intel i5-8265U.

Although this is just a prove of concept and not something that could be already use in production, I think it proves the potential of Simple Tabulation to improve current Ruby implementation. It may be worthwhile to explore this idea further. What do you think?

I did this as part of a project for my [Advanced Algorithms course](http://www.cs.columbia.edu/~andoni/advancedS20/index.html). For more details check the project report: 
https://github.com/Ana06/ruby-tabulation/blob/master/latex/RubyTabulation_Project.pdf



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:98364] [Ruby master Feature#16851] Ruby hashing algorithm could be improved using Tabulation Hashing
  2020-05-12 17:55 [ruby-core:98292] [Ruby master Bug#16851] Ruby hashing algorithm could be improved using Tabulation Hashing anamma06
  2020-05-12 18:19 ` [ruby-core:98294] [Ruby master Feature#16851] " jean.boussier
  2020-05-13  1:48 ` [ruby-core:98308] " matz
@ 2020-05-14 21:00 ` anamma06
  2020-10-22  6:12 ` [ruby-core:100494] " mame
  2020-10-22  9:18 ` [ruby-core:100497] " mame
  4 siblings, 0 replies; 6+ messages in thread
From: anamma06 @ 2020-05-14 21:00 UTC (permalink / raw)
  To: ruby-core

Issue #16851 has been updated by ana06 (Ana Maria Martinez Gomez).


@byroot

> I'd suggest running the hash related benchmarks included in ruby's repo: https://github.com/ruby/ruby/tree/master/benchmark

For most of the benchmarks, this changes doesn't make much of a different because they are either too small hashes or use strings or symbols as keys and this is only implemented for the general keys (objects, integers, etc.). Another important detail is that in the way that it is implemented right now the filling of the random matrix may be being benchmaked (so I think that the result of a finished implementation may be even better).

It does make a difference in the `bighash` benchmark. I have increased the number of elements inserted in the hash a bit to make the difference clearer. I have also added an extra benchmark, equivalent to this one but with objects instead of numbers as keys: https://github.com/ruby/ruby/commit/99e0c33655dec45a9b36850b6fe2cb36ef5f4d42

Those are the results for those two benchmarks with Simple tabulation and Quadratic Probing [ https://github.com/Ana06/ruby/tree/tabulation ]: 

```
compare-ruby: ruby 2.8.0dev (2020-05-07T16:22:38Z master 7ded8fd29a) [x86_64-linux]
built-ruby: https://github.com/Ana06/ruby/tree/tabulation

Calculating -------------------------------------·
                     compare-ruby  built-ruby 
         bighash_obj        0.079       0.099 i/s -       1.000 times in 12.591126s 10.078022s
             bighash        0.394       0.454 i/s -       1.000 times in 2.541103s 2.204265s

Comparison:
                      bighash_obj
          built-ruby:         0.1 i/s 
        compare-ruby:         0.1 i/s - 1.25x  slower

                          bighash
          built-ruby:         0.5 i/s 
        compare-ruby:         0.4 i/s - 1.15x  slower
```

Those are the results for those two benchmarks with Simple tabulation and the current hash implementation (without Quadratic or Linear Probing) [ https://github.com/Ana06/ruby/tree/tabulation without [95f367a](https://github.com/ruby/ruby/commit/95f367a02bdc6b03eaf5ab9ac6c13821e802dbff) ]:

```
compare-ruby: ruby 2.8.0dev (2020-05-07T16:22:38Z master 7ded8fd29a) [x86_64-linux]
built-ruby: https://github.com/Ana06/ruby/tree/tabulation without 95f367a 

Calculating -------------------------------------
                     compare-ruby  built-ruby 
         bighash_obj        0.081       0.094 i/s -       1.000 times in 12.271814s 10.585687s
             bighash        0.369       0.409 i/s -       1.000 times in 2.711338s 2.445082s

Comparison:
                      bighash_obj
          built-ruby:         0.1 i/s 
        compare-ruby:         0.1 i/s - 1.16x  slower

                          bighash
          built-ruby:         0.4 i/s 
        compare-ruby:         0.4 i/s - 1.11x  slower
```

> @matz
ana06 (Ana Maria Martinez Gomez) Do you need any help?

To finish it so that it can be integrated in production I do need some help. I am not sure where the good place to put this code is. Concretely:
* The filling of the random matrix shouldn't be done the first time it is needed, but it should be part of the "initialization" of Ruby (or the Hash class). But I don't know which is the best way to do this. Actually, this doesn't necessarily need to be done as I have implemented it. Simple Tabulation needs 64KB of random data. Ideally it could even be a shared memory section with other programs (I don't think we have to do something like this, but just pointing out that there may be an smarter solution).
* I have removed some functionality out by placing the Simple Tabulation code in the `any_hash` function directly. With that it is not allowed to overwrite the hash method for the hashing implementation (which maybe is ok, it is done for other objects already, related: https://bugs.ruby-lang.org/issues/16850). Using arrays as keys is also broken (at least not consistent with the previous behavior because the array id is now used). I would need some direction how to address these things (probably there is a better place to integrate this code).

There would still be some work to do, such as for example implement it for 32 bites (currently only implemented for 64). But this concretely shouldn't be a big deal.

----------------------------------------
Feature #16851: Ruby hashing algorithm could be improved using Tabulation Hashing
https://bugs.ruby-lang.org/issues/16851#change-85631

* Author: ana06 (Ana Maria Martinez Gomez)
* Status: Open
* Priority: Normal
----------------------------------------
I have implemented Linear Probing and Simple tabulation in Ruby: https://github.com/Ana06/ruby/compare/master...Ana06:tabulation

I tested it using the following code:
https://github.com/Ana06/ruby-tabulation/blob/master/benchmark_tabulation.rb
It benchmarks the insertion of 600000 elements (by hashing their 64 bits ids) in an empty hash 100 times. Below are the times (in seconds) I got executing this code for different versions of the Ruby code:

- master (without Simple Tabulation): 29.68
- master with Linear Probing (without Simple Tabulation): 99.76
- master with Quadratic Probing (without Simple Tabulation): 29.97
- master with Simple Tabulation: 15.76
- master with Linear Probing and Simple Tabulation: 24.23
- **master with Quadratic Probing and Simple Tabulation: 13.73**

`master` refers to ruby 2.8.0dev:
(2020-05-07T16:22:38Z master 7ded8fd) [x86_64-linux].
I tried with 8 x Intel i5-8265U.

Although this is just a proof of concept and not something that could be already use in production, I think it proves the potential of Simple Tabulation to improve current Ruby implementation. It may be worthwhile to explore this idea further. What do you think?

I did this as part of a project for my [Advanced Algorithms course](http://www.cs.columbia.edu/~andoni/advancedS20/index.html). For more details check the project report: 
https://github.com/Ana06/ruby-tabulation/blob/master/latex/RubyTabulation_Project.pdf



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request@ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:100494] [Ruby master Feature#16851] Ruby hashing algorithm could be improved using Tabulation Hashing
  2020-05-12 17:55 [ruby-core:98292] [Ruby master Bug#16851] Ruby hashing algorithm could be improved using Tabulation Hashing anamma06
                   ` (2 preceding siblings ...)
  2020-05-14 21:00 ` [ruby-core:98364] " anamma06
@ 2020-10-22  6:12 ` mame
  2020-10-22  9:18 ` [ruby-core:100497] " mame
  4 siblings, 0 replies; 6+ messages in thread
From: mame @ 2020-10-22  6:12 UTC (permalink / raw)
  To: ruby-core

Issue #16851 has been updated by mame (Yusuke Endoh).


> With that it is not allowed to overwrite the hash method for the hashing implementation (which maybe is ok

It is far from okay... Your patch does not call #hash method at all. `make test` does not pass.

```
class Foo
  def hash
    raise
  end
end

h = { Foo.new => 1 } # raises nothing
```

I'm not familiar with the tabulation technique, but I guess it should use the result of `other_func` instead of `rb_obj_id`.

I'm unsure if the performance measurement is affected by the omission of the call to #hash, but could you re-measure the benchmark after `make check` passed?

----------------------------------------
Feature #16851: Ruby hashing algorithm could be improved using Tabulation Hashing
https://bugs.ruby-lang.org/issues/16851#change-88113

* Author: ana06 (Ana Maria Martinez Gomez)
* Status: Open
* Priority: Normal
----------------------------------------
I have implemented Linear Probing and Simple tabulation in Ruby: https://github.com/Ana06/ruby/compare/master...Ana06:tabulation

I tested it using the following code:
https://github.com/Ana06/ruby-tabulation/blob/master/benchmark_tabulation.rb
It benchmarks the insertion of 600000 elements (by hashing their 64 bits ids) in an empty hash 100 times. Below are the times (in seconds) I got executing this code for different versions of the Ruby code:

- master (without Simple Tabulation): 29.68
- master with Linear Probing (without Simple Tabulation): 99.76
- master with Quadratic Probing (without Simple Tabulation): 29.97
- master with Simple Tabulation: 15.76
- master with Linear Probing and Simple Tabulation: 24.23
- **master with Quadratic Probing and Simple Tabulation: 13.73**

`master` refers to ruby 2.8.0dev:
(2020-05-07T16:22:38Z master 7ded8fd) [x86_64-linux].
I tried with 8 x Intel i5-8265U.

Although this is just a proof of concept and not something that could be already use in production, I think it proves the potential of Simple Tabulation to improve current Ruby implementation. It may be worthwhile to explore this idea further. What do you think?

I did this as part of a project for my [Advanced Algorithms course](http://www.cs.columbia.edu/~andoni/advancedS20/index.html). For more details check the project report: 
https://github.com/Ana06/ruby-tabulation/blob/master/latex/RubyTabulation_Project.pdf



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [ruby-core:100497] [Ruby master Feature#16851] Ruby hashing algorithm could be improved using Tabulation Hashing
  2020-05-12 17:55 [ruby-core:98292] [Ruby master Bug#16851] Ruby hashing algorithm could be improved using Tabulation Hashing anamma06
                   ` (3 preceding siblings ...)
  2020-10-22  6:12 ` [ruby-core:100494] " mame
@ 2020-10-22  9:18 ` mame
  4 siblings, 0 replies; 6+ messages in thread
From: mame @ 2020-10-22  9:18 UTC (permalink / raw)
  To: ruby-core

Issue #16851 has been updated by mame (Yusuke Endoh).

Status changed from Open to Feedback

By replacing `rb_obj_id` with `other_func`, I cannot see any significant difference between the proposed patch and the current master. I believe that the large part of the performance improvement comes from the wrong omission of `#hash` call. So, I set the ticket status as "Feedback".

Some advices:

* The current hash function does not simply return object_id. Instead, it mixes a random salt to make it difficult for attackers to predict hash values for security reason. You may use [the initialization function](https://github.com/ruby/ruby/blob/d23d5c3130a0944e7e591370cbf8599009318a7c/random.c#L1578-L1585) to fill the random matrix.
* Not only it mixes the salt, but also it scrambles a hash value with some big primes. See the function [mult_and_mix](https://github.com/ruby/ruby/blob/d23d5c3130a0944e7e591370cbf8599009318a7c/hash.c#L256-L286). You may want to try replacing the function with tabulation hashing.
* Before any performance measurement, please make sure it pass the test suite.

----------------------------------------
Feature #16851: Ruby hashing algorithm could be improved using Tabulation Hashing
https://bugs.ruby-lang.org/issues/16851#change-88117

* Author: ana06 (Ana Maria Martinez Gomez)
* Status: Feedback
* Priority: Normal
----------------------------------------
I have implemented Linear Probing and Simple tabulation in Ruby: https://github.com/Ana06/ruby/compare/master...Ana06:tabulation

I tested it using the following code:
https://github.com/Ana06/ruby-tabulation/blob/master/benchmark_tabulation.rb
It benchmarks the insertion of 600000 elements (by hashing their 64 bits ids) in an empty hash 100 times. Below are the times (in seconds) I got executing this code for different versions of the Ruby code:

- master (without Simple Tabulation): 29.68
- master with Linear Probing (without Simple Tabulation): 99.76
- master with Quadratic Probing (without Simple Tabulation): 29.97
- master with Simple Tabulation: 15.76
- master with Linear Probing and Simple Tabulation: 24.23
- **master with Quadratic Probing and Simple Tabulation: 13.73**

`master` refers to ruby 2.8.0dev:
(2020-05-07T16:22:38Z master 7ded8fd) [x86_64-linux].
I tried with 8 x Intel i5-8265U.

Although this is just a proof of concept and not something that could be already use in production, I think it proves the potential of Simple Tabulation to improve current Ruby implementation. It may be worthwhile to explore this idea further. What do you think?

I did this as part of a project for my [Advanced Algorithms course](http://www.cs.columbia.edu/~andoni/advancedS20/index.html). For more details check the project report: 
https://github.com/Ana06/ruby-tabulation/blob/master/latex/RubyTabulation_Project.pdf



-- 
https://bugs.ruby-lang.org/

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-10-22  9:18 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-12 17:55 [ruby-core:98292] [Ruby master Bug#16851] Ruby hashing algorithm could be improved using Tabulation Hashing anamma06
2020-05-12 18:19 ` [ruby-core:98294] [Ruby master Feature#16851] " jean.boussier
2020-05-13  1:48 ` [ruby-core:98308] " matz
2020-05-14 21:00 ` [ruby-core:98364] " anamma06
2020-10-22  6:12 ` [ruby-core:100494] " mame
2020-10-22  9:18 ` [ruby-core:100497] " mame

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).