ruby-core@ruby-lang.org archive (unofficial mirror)
 help / color / mirror / Atom feed
From: the.codefolio.guy@gmail.com
To: ruby-core@ruby-lang.org
Subject: [ruby-core:91948] [Ruby trunk Feature#15626] Manual Compaction for MRI's GC (`GC.compact`)
Date: Fri, 22 Mar 2019 20:51:48 +0000 (UTC)	[thread overview]
Message-ID: <redmine.journal-77280.20190322205147.659cdca848fdc0b7@ruby-lang.org> (raw)
In-Reply-To: redmine.issue-15626.20190227221310@ruby-lang.org

Issue #15626 has been updated by noahgibbs (Noah Gibbs).


One problem with automatically compacting on full GC: compaction appears to require two full GCs to do its thing. It might be possible to integrate this into Ruby's normal cycle. But I don't think this is designed for that as-is.

Aaron has commented on this in his talks - usually he calls GC.compact once after giving everything a chance to "settle" (running the app for awhile.)

You could call it periodically, but I think this is a heavier-weight operation than a normal full GC.

----------------------------------------
Feature #15626: Manual Compaction for MRI's GC (`GC.compact`)
https://bugs.ruby-lang.org/issues/15626#change-77280

* Author: tenderlovemaking (Aaron Patterson)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
----------------------------------------
Hi,

I've attached a patch that implements a compactor for MRI.  I'll try to describe the patch in detail below, but the biggest user facing changes are:

1. A new method `GC.compact` that can be called from Ruby
2. C extensions have a new "after compaction" callback they can register
3. Introduction of new marking functions that allow objects to be marked, but not "pinned" to the same location
4. A function to get the new location of an object (to be used from the "after compaction" callback)

# Compactor for RGenGC

This document describes the compactor implementation for RGenGC.  This patch
adds a new method to Ruby, `GC.compact` which will compact the objects in the
heap.  What follows is a description of the algorithm, along with achievements
and technical challenges.

## Steps for Compaction

These are the high level steps taken in `GC.compact` are as follows:

1. Full GC
2. Move objects
3. Update references
4. Full GC

### Step One, Full GC

Not all objects in Ruby's heap can move.  In particular, C extensions may hold
references to VALUE pointers.  If the VALUE pointers move, the C extension may
not be able to find them and will get a SEGV or possibly wrong data.

To prevent this, a new "pinning" table was introduced.  Any object marked via
`rb_gc_mark` is "pinned" and is not allowed to move.  C extensions must mark
their references in order to keep them alive.  Since C extensions use
`rb_gc_mark` to mark a reference, we know not to move that reference.  In order
to ensure all references get marked, we perform a full GC.

New functions, `rb_gc_mark_no_pin`, etc were introduced and used internally.
These functions keeps the object alive but without pinning the reference.

Summary: Perform a full GC so that objects marked with `rb_gc_mark` do not move.

### Step Two, Move Objects

This compactor uses a "two finger" algorithm which was introduced in "The
Programming Language LISP: Its Operation and Applications" (page 228)[^1].  Two
pointers point at each side of the heap, if one slot is empty, and the other is
moveable, it swaps places and leaves a `T_MOVED` object that contains a
forwarding address.

In Ruby, the algorithm looks something like this:

```ruby
def compact
  heap = [ ... ] # many slots
  left  = 0
  right = heap.length - 1

  while left < right
    left_slot = heap[left]
    right_slot = heap[right]

    if is_empty?(left_slot) && !is_empty?(right_slot) && can_move?(right_slot)
      swap(left, right)
      heap[right] = T_MOVED.new(left) # leave forwarding address
    end

    while !is_empty?(heap[left])
      left += 1
    end

    while is_empty?(heap[right]) || !can_move?(heap[right])
      right -= 1
    end
  end
end
```

If we have a heap with 6 slots, before compaction it might look something like
this:

```
                     ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─        
                        Before Compaction   │       
                     └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─        
               ┌─────┬─────┬─────┬─────┬─────┬─────┐
 Slot Index    │  1  │  2  │  3  │  4  │  5  │  6  │
               ├─────┼─────┼─────┼─────┼─────┼─────┤
Slot Contents  │  A  │  B  │Empty│Empty│  C  │  D  │
               │     │     │     │     │     │     │
               └─────┴─────┴─────┴─────┴─────┴─────┘
```

After compaction, but before reference update, it will look like this:

```
                     ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─        
                         After Compaction   │       
                     └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─        
               ┌─────┬─────┬─────┬─────┬─────┬─────┐
 Slot Index    │  1  │  2  │  3  │  4  │  5  │  6  │
               ├─────┼─────┼─────┼─────┼─────┼─────┤
Slot Contents  │  A  │  B  │  D  │  C  │MOVED│MOVED│
               │     │     │     │     │TO 4 │TO 3 │
               └─────┴─────┴─────┴─────┴─────┴─────┘
```

Slots 5 and 6 contain `T_MOVED` objects that point to the new locations of D and
C objects.

### Step Three, Update References

After objects have moved, references are updated.  This is a matter of updating
any references to use the forwarding address rather than the original location.

For example, if object A has a reference to C, and B a reference to D:

```
                     ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─        
                        Before Compaction   │       
                     └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─        
               ┌─────┬─────┬─────┬─────┬─────┬─────┐
 Slot Index    │  1  │  2  │  3  │  4  │  5  │  6  │
               ├─────┼─────┼─────┼─────┼─────┼─────┤
Slot Contents  │  A  │  B  │Empty│Empty│  C  │  D  │
               │     │     │     │     │     │     │
               └─────┴─────┴─────┴─────┴─────┴─────┘
                  │     │                 ▲     ▲   
                  │     │                 │     │   
                  └─────┼─────────────────┘     │   
                        └───────────────────────┘   
```

After compaction, but before updating references, the heap will look like this:

```
                     ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─        
                         After Compaction   │       
                     └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─        
               ┌─────┬─────┬─────┬─────┬─────┬─────┐
 Slot Index    │  1  │  2  │  3  │  4  │  5  │  6  │
               ├─────┼─────┼─────┼─────┼─────┼─────┤
Slot Contents  │  A  │  B  │  D  │  C  │MOVED│MOVED│
               │     │     │     │     │TO 4 │TO 3 │
               └─────┴─────┴─────┴─────┴─────┴─────┘
                  │     │                 ▲     ▲   
                  │     │                 │     │   
                  └─────┼─────────────────┘     │   
                        └───────────────────────┘   
```

The update references step looks at each object in the heap, checks to see if it
references any moved slots, then updates the reference to use the new location.
In Ruby, it looks like this:

```ruby
def update_references
  heap.each do |slot|
    next if is_empty?(slot) || is_moved?(slot)

    slot.references.each_with_index do |child, i|
      if is_moved?(child)
        slot.set_reference(i, child.new_location)
      end
    end
  end
end
```

After reference updates, the heap will look like this:

```
                     ┌ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─        
                        After Ref Updates   │       
                     └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─        
               ┌─────┬─────┬─────┬─────┬─────┬─────┐
 Slot Index    │  1  │  2  │  3  │  4  │  5  │  6  │
               ├─────┼─────┼─────┼─────┼─────┼─────┤
Slot Contents  │  A  │  B  │  D  │  C  │MOVED│MOVED│
               │     │     │     │     │TO 4 │TO 3 │
               └─────┴─────┴─────┴─────┴─────┴─────┘
                  │     │     ▲     ▲               
                  │     │     │     │               
                  └─────┼─────┼─────┘               
                        └─────┘                     
```

### Step Four, Full GC

After references have been updated, a full GC is performed.  This converts the
`T_MOVED` slots back in to `T_EMPTY` slots.  The GC automatically takes care of
this because no object should reference a `T_MOVED` slot.

## Non Moving Objects

Non moving objects are:

1. Objects on the stack
2. Objects marked with `rb_gc_mark`
3. Hash key objects

### Objects on the stack

The stack should not be mutated, so these are pinned.

### Objects marked with `rb_gc_mark`

As mentioned earlier, any object marked with `rb_gc_mark` may not move because a
C extension may hold a reference.  Here is an excerpt from Yajl, a JSON parser.
This is the mark function it uses:

```c
static void yajl_encoder_wrapper_mark(void * wrapper) {
    yajl_encoder_wrapper * w = wrapper;
    if (w) {
        rb_gc_mark(w->on_progress_callback);
        rb_gc_mark(w->terminator);
    }
}


obj = Data_Make_Struct(klass, yajl_encoder_wrapper, yajl_encoder_wrapper_mark, yajl_encoder_wrapper_free, wrapper);
```

The values in this mark function `w->on_progress_callback` and `w->terminator`
will not move since they are pinned by `rb_gc_mark`.

### Hash key objects

Certain hash keys will not move.  Most objects use their address as the hash
value.  If the object moves, the hash value could change, so hash keys are not
allowed to move.

There is an exception for string objects.  String objects calculate their hash
based on the bytes in the string.  Since the bytes in the string do not change
after move, they are allowed to move.  Most hash keys in large programs are
strings, so only allowing string keys to move seemed enough.

## Special Considerations

### Object ID

`Object#object_id` bases its value from the address of the object.  For example:

```ruby
x = Object.new
id = x.object_id
GC.compact        # `x` moves
id == x.object_id # `id` should be same as `x.object_id`
```

We expect the object id to remain the same regardless of compaction.  To deal
with `object_id`, two hashes were introduced.  One hash contains a map of the
object to the *seen* object id.  The second map contains all seen object ids.
The maps are updated any time `object_id` is called, an object moves, or an
object is freed.

object_id implementation in Ruby:

```ruby
$obj_to_id = {}
$id_to_obj = {}

class Object
  def object_id
    if $obj_to_id.key?(address)
      # Second call to object_id on this object
      return $obj_to_id[address]
    else
      # First call to object_id on this object

      addr = self.address

      loop do
        if $seen_ids.key?(addr)
          # Resolve conflict
          addr += sizeof(RVALUE)
        else
          $obj_to_id[self.address] = addr
          $id_to_obj[addr]         = self
          return addr
        end
      end
    end
  end

  private

  def address
    # returns address in memory of object
  end
end
```

During compaction:

```ruby
def compact
  heap = [ ... ] # many slots

  while left < right
    dest_slot = heap[left]
    source_slot = heap[right]

    if moving?(source_slot)
      if $obj_to_id.key?(source_slot.address)
        id = $obj_to_id.delete(source_slot.address)
        $obj_to_id[dest_slot.address] = id
        $id_to_obj[id]                = dest_slot.address
      end

      # etc
    end
  end
end
```

During Free:

```ruby
def free(obj)
  if $obj_to_id.key?(obj.address)
    $seen_ids.delete($obj_to_id.delete(obj.address))
  end
end
```

### ObjectSpace._id2ref

When generating an object id, in the case of a collision, the address is
incremented by 40.  This enables `_id2ref` to determine there is a VALUE pointer
and round trip the object correctly.

## Compaction Support for C extensions

Any object marked with `rb_gc_mark` will be "pinned" and cannot move.  C
extensions mark objects via `rb_gc_mark`, so this ensures that pointers in C
stay valid.

However, some C extensions may want to support reference movement.

### Reference Movement in C extensions

In order to maintain backwards compatibility, if a C extension holds a reference
to a VALUE, the VALUE should not move.  Going forward, C extensions can support
moving VALUEs.  To support moving VALUEs, a C extension should change
`rb_gc_mark` to `rb_gc_mark_no_pin`, then implement a new callback that is
called during compaction that gives the extension an opportunity to update its
own references.  A new function `rb_gc_new_location` will return the new
location for a particular VALUE.

Here is an example for autoload from `variable.c`.  A new compaction callback is
registered, `autoload_i_compact`:

```c
static const rb_data_type_t autoload_data_i_type = {
    "autoload_i",
    {autoload_i_mark, autoload_i_free, autoload_i_memsize, autoload_i_compact},
    0, 0, RUBY_TYPED_FREE_IMMEDIATELY
};
```

The mark function changes to *mark* references, but *not* pin them:

```c
static void
autoload_i_mark(void *ptr)
{
    struct autoload_data_i *p = ptr;

    rb_gc_mark_no_pin(p->feature);

    /* allow GC to free us if no modules refer to this via autoload_const.ad */
    if (list_empty(&p->constants)) {
        rb_hash_delete(autoload_featuremap, p->feature);
    }
}
```

After the heap is compacted, any C extensions that registered a "compaction"
callback will be called and have a chance to update internal references.  The
autoload compaction function is like this:

```c
static void
autoload_i_compact(void *ptr)
{
    struct autoload_data_i *p = ptr;
    p->feature = rb_gc_new_location(p->feature);
```

The compaction callback asks for the new location for the `p->feature`
reference.  `rb_gc_new_location` will either return the current value of
`p->feature` (in the case the VALUE did not move), or the new location.

## Reference Verification and Testing

After compaction, objects that have moved are changed to `T_MOVED` objects with
forwarding addresses.  Once all references are updated, no object should point
to a `T_MOVED` slot.  We can say that before and after compaction, no object
should ever point at a `T_MOVED` slot.  So if a `T_MOVED` object is ever pushed
on to the mark queue, there is a bug.  `push_mark_stack` has a check for
`T_MOVED` objects, and will crash if a `T_MOVED` object is ever pushed on to the
mark stack.

A method `GC.verify_compaction_references` was added that doubles the available
heap size, then compacts the heap.  Since the heap has doubled in size, any
object that can move will move.  Any references to `T_MOVED` objects will be
caught during the GC phase after compaction.

## Known Issue

Safety for C extensions depends entirely on C extensions marking their
references.  If a C extension does not mark a reference, the compactor assumes
that the object is safe to move.  This can cause an error in the following
situation, when there is an object graph like this:


```
 ┌───────────────┐       ┌───────────────┐ 
 │  Ruby Object  │       │  Ruby Object  │ 
 │  Implemented  │       │  Implemented  │ 
 │    in Ruby    │       │     in C      │ 
 │               │       │               │ 
 └───────────────┘       └───────────────┘ 
         │                       │         
         │                                 
         │                       │   NOT   
 Marked  │                          Marked 
         │                       │         
         │   ┌───────────────┐             
         │   │               │   │         
         │   │  Ruby Object  │             
         └──▶│               │◀─ ┘         
             │               │             
             └───────────────┘             
```

If the C extension contains a reference to an object, but expects the object
not to move because a Ruby object contains a reference, then the target Ruby
object *may* move and the reference in the C extension will be wrong.  I like to
call these "ghost references" because the GC cannot see them but they will come
back to haunt you.

The solution to this is either:

1. Change the C extension to `rb_gc_mark` the object
2. Remove the reference from the C extension

One example of this situation was fixed here:

  https://github.com/msgpack/msgpack-ruby/issues/133

## Summary

Backwards compatibility with C extensions is maintained by "pinning" any
references marked with `rb_gc_mark`.  A full mark *must* be done before
compacting to discover all pinned references.

A new callback for C extensions is introduced so that C extensions can support
compaction.  C extensions that wish to support compaction must use the new
callback, `rb_gc_mark_no_pin`, and `rb_gc_new_location`.

C extensions that maintain pointers but do not mark those pointers may SEGV.  We
can use `GC.verify_compaction_references` to discover these issues.

[^1]: See also "The Garbage Collection Handbook" page 32


---Files--------------------------------
before_compact-0.png (111 KB)
after_compact-0.png (80.6 KB)
before_compact-1.png (81.2 KB)
after_compact-1.png (79.4 KB)
0001-GC-Compaction-for-MRI.patch (77.4 KB)
ruby_2019-03-22-171839_PikachuEXEs-MBP-2018.crash (71.4 KB)
ruby_2019-03-22-171839-1_PikachuEXEs-MBP-2018.crash (71.4 KB)


-- 
https://bugs.ruby-lang.org/

  parent reply	other threads:[~2019-03-22 20:51 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <redmine.issue-15626.20190227221310@ruby-lang.org>
2019-02-27 22:13 ` [ruby-core:91634] [Ruby trunk Feature#15626] Manual Compaction for MRI's GC (`GC.compact`) tenderlove
2019-02-27 22:26 ` [ruby-core:91635] " tenderlove
2019-02-27 22:57 ` [ruby-core:91636] " tenderlove
2019-02-27 23:59 ` [ruby-core:91637] " tad.a.digger
2019-02-28  4:25 ` [ruby-core:91638] " duerst
2019-02-28  6:18 ` [ruby-core:91639] " ko1
2019-03-11  8:16 ` [ruby-core:91765] " matz
2019-03-11  9:02 ` [ruby-core:91767] " naruse
2019-03-18 20:51 ` [ruby-core:91876] " nate.berkopec
2019-03-20  9:41 ` [ruby-core:91893] " pikachuexe
2019-03-21  5:37 ` [ruby-core:91912] " pikachuexe
2019-03-22  9:21 ` [ruby-core:91927] " pikachuexe
2019-03-22 20:51 ` the.codefolio.guy [this message]
2019-03-23  0:57 ` [ruby-core:91952] " duerst
2019-03-23 21:50 ` [ruby-core:91961] " tenderlove
2019-03-25  1:54 ` [ruby-core:91970] " the.codefolio.guy
2019-03-25  2:52 ` [ruby-core:91971] " pikachuexe
2019-03-26  3:00 ` [ruby-core:91987] " the.codefolio.guy
2019-03-26  6:12 ` [ruby-core:91988] " the.codefolio.guy
2019-04-03  0:37 ` [ruby-core:92118] " tenderlove
2019-04-03  1:17 ` [ruby-core:92119] " tenderlove
2019-04-03  5:50 ` [ruby-core:92121] " pikachuexe
2019-04-09  0:56 ` [ruby-core:92216] " tenderlove
2019-04-15  4:27 ` [ruby-core:92286] " the.codefolio.guy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-list from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.ruby-lang.org/en/community/mailing-lists/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=redmine.journal-77280.20190322205147.659cdca848fdc0b7@ruby-lang.org \
    --to=ruby-core@ruby-lang.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).