From: "Eregon (Benoit Daloze) via ruby-core" <ruby-core@ml.ruby-lang.org>
To: ruby-core@ml.ruby-lang.org
Cc: "Eregon (Benoit Daloze)" <noreply@ruby-lang.org>
Subject: [ruby-core:116955] [Ruby master Bug#20301] `Set#add?` does two hash look-ups
Date: Mon, 26 Feb 2024 15:42:10 +0000 (UTC) [thread overview]
Message-ID: <redmine.journal-106994.20240226154210.51722@ruby-lang.org> (raw)
In-Reply-To: redmine.issue-20301.20240226024235.51722@ruby-lang.org
Issue #20301 has been updated by Eregon (Benoit Daloze).
@Dan0042 That's related to #17342 then, and also known as `compute_if_absent` in [concurrent-ruby](https://ruby-concurrency.github.io/concurrent-ruby/1.1.5/Concurrent/Map.html#compute_if_absent-instance_method).
----------------------------------------
Bug #20301: `Set#add?` does two hash look-ups
https://bugs.ruby-lang.org/issues/20301#change-106994
* Author: AMomchilov (Alexander Momchilov)
* Status: Open
* Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.:
```ruby
SEEN_VALUES = Set.new
def receive_value(value)
if SEEN_VALUES.add?(value)
puts "Saw #{value} for the first time."
else
puts "Already seen #{value}, ignoring."
end
end
receive_value(1) # Saw 1 for the first time.
receive_value(2) # Saw 2 for the first time.
receive_value(3) # Saw 3 for the first time.
receive_value(1) # Already seen 1, ignoring.
```
Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525))
```rb
class Set
def add?(o
# 1. `include?(o)` looks up into `@hash`
# 2. if the value isn't there, `add(o)` does a second look-up into `@hash`
add(o) unless include?(o)
end
end
```
This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute.
We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#update_value` to do exactly that. If that existed, we can re-implement `#add?` as:
```rb
class Set
def add?(o)
# Only requires a single look-up into `@hash`!
self unless @hash.update_value(o, true)
end
```
Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093
# Theory
How much of a benefit this has depends on 2 factors:
1. How much `#hash` is called, which depends on how many **new** objects are added to the set.
* If every object is new, then `#hash` used to be called twice on every `#add?`.
* This is where this improvement makes the biggest (2x!) change.
* If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement.
* It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set.
* Every other case lies somewhere in between those two, depending on the % of objects which are new.
2. How slow `#hash` is to compute for the key
* If the hash is slow to compute, this change will make a bigger improvement
* If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact.
# Benchmark summary
| | All objects are new | All objects are preexisting |
|---------------------------|-------:|------:|
| objects with slow `#hash` | 100.0% | ~0.0% |
| objects with fast `#hash` | 24.5% | 4.6% |
As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't.
For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093
--
https://bugs.ruby-lang.org/
______________________________________________
ruby-core mailing list -- ruby-core@ml.ruby-lang.org
To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/
next prev parent reply other threads:[~2024-02-26 15:42 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-26 2:42 [ruby-core:116941] [Ruby master Bug#20301] `Set#add?` does two hash look-ups AMomchilov (Alexander Momchilov) via ruby-core
2024-02-26 15:31 ` [ruby-core:116952] " Dan0042 (Daniel DeLorme) via ruby-core
2024-02-26 15:42 ` Eregon (Benoit Daloze) via ruby-core [this message]
2024-02-26 17:37 ` [ruby-core:116959] " AMomchilov (Alexander Momchilov) via ruby-core
2024-03-14 5:44 ` [ruby-core:117138] " shyouhei (Shyouhei Urabe) via ruby-core
2024-03-14 6:41 ` [ruby-core:117141] " nobu (Nobuyoshi Nakada) via ruby-core
2024-03-14 7:15 ` [ruby-core:117142] " shyouhei (Shyouhei Urabe) via ruby-core
2024-03-14 15:04 ` [ruby-core:117177] " Eregon (Benoit Daloze) via ruby-core
2024-03-15 0:10 ` [ruby-core:117190] " shyouhei (Shyouhei Urabe) via ruby-core
2024-03-15 1:34 ` [ruby-core:117192] " AMomchilov (Alexander Momchilov) via ruby-core
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-list from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://www.ruby-lang.org/en/community/mailing-lists/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=redmine.journal-106994.20240226154210.51722@ruby-lang.org \
--to=ruby-core@ruby-lang.org \
--cc=noreply@ruby-lang.org \
--cc=ruby-core@ml.ruby-lang.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).