[ruby-core:105561] [Ruby master Feature#18239] Variable Width Allocation: Strings

ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed

From: "Eregon (Benoit Daloze)" <noreply@ruby-lang.org>
To: ruby-core@ruby-lang.org
Subject: [ruby-core:105561] [Ruby master Feature#18239] Variable Width Allocation: Strings
Date: Tue, 05 Oct 2021 15:10:43 +0000 (UTC)	[thread overview]
Message-ID: <redmine.journal-94016.20211005151043.42491@ruby-lang.org> (raw)
In-Reply-To: redmine.issue-18239.20211004181344.42491@ruby-lang.org

Issue #18239 has been updated by Eregon (Benoit Daloze).

Regarding ABI compatibility, AFAIK CRuby doesn't care about it for development versions.
I.e., conceptually we can think of every commit potentially changing the ABI but the ABI version number (`RbConfig::CONFIG['ruby_version']`) is never changed for dev versions of CRuby.
So it's indeed currently up to users of CRuby dev versions to basically always wipe out their compiled gems whenever rebuilding a new dev version.

FWIW TruffleRuby actually tracks the ABI of dev versions through [this file](https://github.com/oracle/truffleruby/blob/master/lib/cext/ABI_version.txt) which means it is possible to sensibly cache compiled gems even for dev versions.
ruby/setup-ruby has no choice for CRuby dev but [to use the commit](https://github.com/ruby/setup-ruby/blob/a6f22865941e122a37e097fbded3dd0b54c39207/bundler.js#L188) as the ABI version.
This issue is made worse by Ruby switchers like RVM & chruby setting GEM_HOME (so the ABI is effectively ignored by RubyGems in those cases, and those directories need to be cleaned manually).
When GEM_HOME is not set, it would be enough to rebuild CRuby dev and remove the directory before installing (which includes both CRuby & gems), but Ruby installers don't do that yet.
Bundler always includes the ABI version when setting the bundler path (`bundle config --local path`), but if the ABI version is incorrect like for CRuby dev it's of no use.

Anyway, I don't want to distract more about ABI for this issue.
The ABI version not representing the ABI for CRuby dev versions is a long-standing issue, but orthogonal to this proposal and I believe it does not need to be solved for this feature.

----------------------------------------
Feature #18239: Variable Width Allocation: Strings
https://bugs.ruby-lang.org/issues/18239#change-94016

* Author: peterzhu2118 (Peter Zhu)
* Status: Open
* Priority: Normal
----------------------------------------
# GitHub PR: https://github.com/ruby/ruby/issues/4933

# Feature description

Since merging #18045 which introduced size pools inside the GC to allocate various sized slots, we've been working on expanding the usage of VWA into more types (before this patch, only classes are allocated through VWA). This patch changes strings to use VWA.

## Summary

- This patch allocates strings using VWA.
- String embedded through VWA are embedded. The patch changes strings to support dynamic capacity embedded strings.
- We do not handle resizing in VWA in this patch (embedded strings resized up are moved back to malloc), which may result in wasted space. However, in benchmarks, this does not appear to be an issue.
- We propose enabling VWA by default. We are confident about the stability and performance of this feature.

## String allocation

Strings with known sizes at allocation time that are small enough is allocated as an embedded string. Embedded string headers are now 18 bytes, so the maximum embedded string length is now 302 bytes (currently VWA is configured to allocate up to 320 byte slots, but can be easily configured for even larger slots). Embedded strings have their contents directly follow the object headers. This aims to improve cache performance since the contents are on the same cache line as the object headers. For strings with unknown sizes, or with contents that are too large, it falls back to allocate 40 byte slots and store the contents in the malloc heap.

## String reallocation

If an embedded string is expanded and can no longer fill the slot, it is moved into the malloc heap. This may mean that some space in the slot is wasted. For example, if the string was originally allocated in a 160 byte slot and moved to the malloc heap, 120 bytes of the slot is wasted (since we only need 40 bytes for a string with contents on the malloc heap). This patch does not aim to tackle this issue. Memory usage also does not appear to be an issue in any of the benchmarks.

## Incompatibility of VWA and non-VWA built gems

Gems with native extensions built with VWA are not compatible with non-VWA built gems (and vice versa). When switching between VWA and non-VWA rubies, gems must be rebuilt. This is because the header file `rstring.h` changes depending on whether `USE_RVARGC` is set or not (the internal string implementation changes with VWA).

## Enabling VWA by default

In this patch, we propose enabling VWA by default. We believe that the performance data supports this (see the [benchmark results](#Benchmark-results) section). We're also confident about its stability. It passess all tests on CI for the Shopify monolith and we've ran this version of Ruby on a small portion of production traffic in a Shopify service for about a week (where it served over 500 million requests).

Although VWA is enabled by default, we plan on supporting the `USE_RVARGC` flag as an escape hatch to disable VWA until at least Ruby 3.1 is released (i.e. we may remove it starting 2022).

# Benchmark setup

Benchmarking was done on a bare-metal Ubuntu machine on AWS. All benchmark results are using glibc by default, except when jemalloc is explicitly specified.

```
$ uname -a
Linux 5.8.0-1038-aws #40~20.04.1-Ubuntu SMP Thu Jun 17 13:25:28 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
```

glibc version:

```
$ ldd --version
ldd (Ubuntu GLIBC 2.31-0ubuntu9.2) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
```

jemalloc version:

```
$ apt list --installed | grep jemalloc

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libjemalloc-dev/focal,now 5.2.1-1ubuntu1 amd64 [installed]
libjemalloc2/focal,now 5.2.1-1ubuntu1 amd64 [installed,automatic]
```

To measure memory usage over time, the [mstat tool](https://github.com/bpowers/mstat) was used.

Ruby master was benchmarked on commit [7adfb14f60](https://github.com/ruby/ruby/commit/7adfb14f60). The branch was rebased on top of the same commit.

Performance benchmarks of this branch without VWA turned on are included to make sure this patch does not introduce a performance regression compared to master.

# Benchmark results

## Summary

- On a live, production Shopify service, we see no significant differences in response times. However, VWA has lower memory usage.
- On railsbench, we see a minor performance improvement. However, we also see worse p100 response times. The reason is analyzed in the [railsbench](#railsbench) section below. VWA uses more memory when using glibc, but uses an equal amount of memory for jemalloc.
- On rdoc generation, VWA uses significantly less memory for a small reduction in performance.
- Microbenchmarks show significant performance improvement for strings that are embedded in VWA (e.g. `String#==` is significantly faster).

## Shopify production

We deployed this branch with VWA enabled vs. Ruby master commit 7adfb14f60 on a small portion of live, production traffic for a Shopify web application to test real-world performance. This data was collected for a period of 1 day where they each served approximately 84 million requests. Average, median, p90, p99 response times were within the margin of error of each other (<1% difference). However, VWA consistently had lower memory usage (Resident Set Size) by 0.96x.

## railsbench

For railsbench, we ran the [railsbench benchmark](https://github.com/k0kubun/railsbench/blob/master/bin/bench). For both the performance and memory benchmarks, 25 runs were conducted for each combination (branch + glibc, master + glibc, branch + jemalloc, master + jemalloc).

In this benchmark, VWA suffers from poor p100. This is because railsbench is relatively small (only one controller and model), so there are not many pages in each size pools. Incremental marking requires there to be many pooled pages, but because there are not a lot of pooled pages (due to the small heap size), it has to mark a very large number of objects at every step. We do not expect this to be a problem for real apps with a larger number of objects allocated at boot, and this is confirmed by metrics collected in the [Shopify production application](#Shopify-production).

### glibc

Using glibc, VWA is about 1.018x faster than master in railsbench in throughput (RPS).

```
+-----------+-----------------+------------------+--------+
|           | Branch (VWA on) | Branch (VWA off) | Master |
+-----------+-----------------+------------------+--------+
| RPS       | 740.92          | 722.97           | 745.31 |
| p50 (ms)  | 1.31            | 1.37             | 1.33   |
| p90 (ms)  | 1.40            | 1.45             | 1.43   |
| p99 (ms)  | 2.36            | 1.80             | 1.78   |
| p100 (ms) | 21.38           | 17.28            | 15.15  |
+-----------+-----------------+------------------+--------+
```

![](https://i.imgur.com/9yJ3fO5.png)

Average max memory usage for VWA: 104.82 MB

Average max memory usage for master: 101.17 MB

VWA uses 1.04x more memory.

### jemalloc

Using glibc, VWA is about 1.06x faster than master in railsbench in RPS.

```
+-----------+-----------------+------------------+--------+
|           | Branch (VWA on) | Branch (VWA off) | Master |
+-----------+-----------------+------------------+--------+
| RPS       | 781.21          | 742.12           | 739.04 |
| p50 (ms)  | 1.25            | 1.31             | 1.30   |
| p90 (ms)  | 1.32            | 1.39             | 1.38   |
| p99 (ms)  | 2.15            | 2.53             | 4.41   |
| p100 (ms) | 21.11           | 16.60            | 15.75  |
+-----------+-----------------+------------------+--------+
```

![](https://i.imgur.com/Nltbqq0.png)

Average max memory usage for VWA: 102.68 MB

Average max memory usage for master: 103.14 MB

VWA uses 1.00x less memory.

## rdoc generation

In rdoc generation, we see significant memory usage reduction at the cost of small performance reduction.

### glibc

```
+-----------+-----------------+------------------+--------+-------------+
|           | Branch (VWA on) | Branch (VWA off) | Master | VWA speedup |
+-----------+-----------------+------------------+--------+-------------+
| Time (s)  | 16.68           | 16.08            | 15.87  | 0.95x       |
+-----------+-----------------+------------------+--------+-------------+
```

![](https://i.imgur.com/HJCtowS.png)

Average max memory usage for VWA: 295.19 MB

Average max memory usage for master: 365.06 MB

VWA uses 0.81x less memory.

### jemalloc

```
+-----------+-----------------+------------------+--------+-------------+
|           | Branch (VWA on) | Branch (VWA off) | Master | VWA speedup |
+-----------+-----------------+------------------+--------+-------------+
| Time (s)  | 16.34           | 15.64            | 15.51  | 0.95x      |
+-----------+-----------------+------------------+--------+-------------+
```

![](https://i.imgur.com/lrdUxXw.png)

Average max memory usage for VWA: 281.16 MB

Average max memory usage for master: 316.49 MB

VWA uses 0.89x less memory.

## Liquid benchmarks

For the liquid benchmarks, we ran the [liquid benchmark](https://github.com/Shopify/liquid/blob/master/performance/benchmark.rb) averaged over 5 runs each. We see that VWA is faster across the board.

```
+----------------------+-----------------+------------------+--------+-------------+
|                      | Branch (VWA on) | Branch (VWA off) | Master | VWA speedup |
+----------------------+-----------------+------------------+--------+-------------+
| Parse (i/s)          | 40.40           | 38.44            | 39.47  | 1.02x      |
| Render (i/s)         | 126.47          | 121.97           | 121.20 | 1.04x      |
| Parse & Render (i/s) | 28.81           | 27.52            | 28.02  | 1.03x      |
+----------------------+-----------------+------------------+--------+-------------+
```

## Microbenchmarks

These microbenchmarks are very favourable for VWA since the strings created have a length of 71, so they are embedded in VWA and allocated on the malloc heap for master.

```
+---------------------+-----------------+------------------+---------+-------------+
|                    | Branch (VWA on) | Branch (VWA off) | Master  | VWA speedup |
+--------------------+-----------------+------------------+---------+-------------+
| String#== (i/s)    | 1.806k          | 1.334k           | 1.337k  | 1.35x       |
| String#times (i/s) | 6.010M          | 5.238M           | 5.247M  | 1.15x       |
| String#[]= (i/s)   | 1.031k          | 863.816          | 897.915 | 1.15x       |
+--------------------+-----------------+------------------+---------+-------------+
```

{{collapse(Benchmark source code)

```ruby
require "bundler/inline"
gemfile do
  source "https://rubygems.org"
  gem "benchmark-ips"
end

COUNT = 10_000

strs1 = []
strs2 = []

nine = "9"

COUNT.times do
  strs1 << [*"A".."Z", *"0".."9"].join(" ")
  strs2 << [*"A".."Z", *"0".."9"].join(" ")
end

Benchmark.ips do |x|
  x.report("String#==") do |times|
    i = 0
    while i < times
      COUNT.times { |i| strs1[i] == strs2[i] }
      i += 1
    end
  end

  x.report("String#times") do |times|
    i = 0
    while i < times
      "a" * 100
      i += 1
    end
  end

  x.report("String#[]=") do |times|
    i = 0
    while i < times
      strs1.each { |str| str[-1] = nine }
      i += 1
    end
  end
end
```
}}

-- 
https://bugs.ruby-lang.org/

next prev parent reply	other threads:[~2021-10-05 15:10 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-04 18:13 [ruby-core:105544] [Ruby master Feature#18239] Variable Width Allocation: Strings peterzhu2118 (Peter Zhu)
2021-10-04 18:30 ` [ruby-core:105545] " ko1 (Koichi Sasada)
2021-10-04 18:32 ` [ruby-core:105546] " ko1 (Koichi Sasada)
2021-10-04 18:49 ` [ruby-core:105547] " peterzhu2118 (Peter Zhu)
2021-10-04 18:57 ` [ruby-core:105548] " ko1 (Koichi Sasada)
2021-10-04 19:09 ` [ruby-core:105549] " ko1 (Koichi Sasada)
2021-10-04 19:16 ` [ruby-core:105550] " ko1 (Koichi Sasada)
2021-10-04 20:07 ` [ruby-core:105551] " peterzhu2118 (Peter Zhu)
2021-10-04 20:13 ` [ruby-core:105552] " peterzhu2118 (Peter Zhu)
2021-10-04 20:15 ` [ruby-core:105553] " ko1 (Koichi Sasada)
2021-10-04 20:17 ` [ruby-core:105554] " peterzhu2118 (Peter Zhu)
2021-10-04 20:18 ` [ruby-core:105555] " ko1 (Koichi Sasada)
2021-10-04 23:55 ` [ruby-core:105557] " duerst
2021-10-05 13:08 ` [ruby-core:105559] " peterzhu2118 (Peter Zhu)
2021-10-05 15:10 ` Eregon (Benoit Daloze) [this message]
2021-10-11 17:59 ` [ruby-core:105619] " Eregon (Benoit Daloze)
2021-10-21  6:02 ` [ruby-core:105714] " matz (Yukihiro Matsumoto)
2021-10-22 20:03 ` [ruby-core:105756] " peterzhu2118 (Peter Zhu)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.ruby-lang.org/en/community/mailing-lists/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=redmine.journal-94016.20211005151043.42491@ruby-lang.org \
    --to=ruby-core@ruby-lang.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).