CREL dynamic relocations

unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed

* CREL dynamic relocations
@ 2024-03-24  5:50 Fangrui Song
  2024-03-25 11:52 ` Florian Weimer
  0 siblings, 1 reply; 7+ messages in thread
From: Fangrui Song @ 2024-03-24  5:50 UTC (permalink / raw)
  To: libc-alpha

I have proposed a compact relocation format CREL at
https://groups.google.com/g/generic-abi/c/yb0rjw56ORw/m/eiBcYxSfAQAJ
(previously named RELLEB).

CREL primarily targets static relocations, achieving significant .o
file size reduction for lld builds: 18.0% for x86-64/aarch64 and 34.3%
for riscv64.
CREL holds promise for dynamic relocations as well, surpassing
Android's packed relocation format.

* R_*_GLOB_DAT and R_*_JUMP_SLOT relocations can typically be encoded
with repeated 0x09 0x01 (when shift==3, offset++, symidx++).
* Absolute relocation (__cxa_pure_virtual@CXXABI_1.3 + 0) relocations
typically require just one byte (offset+=n)

While CREL's benefit for non-relative relocations is smaller than
RELR's optimizing for relative relocations, it still shines for
programs where non-relative relocations make up more than 5% of the
file size
(quite a few GUI libraries, probably due to C++ virtual tables).
We could even consider phasing out REL/RELA for new architectures in
favor of CREL.

I'd love to hear your feedback on CREL and hope someone might consider
implementing rtld support (and binutils support if you like that as
well:) )
https://sourceware.org/bugzilla/show_bug.cgi?id=31541

I've notified binutils and GCC folks at
https://sourceware.org/pipermail/binutils/2024-March/133153.html
("CREL relocation format for ELF (was: RELLEB)").

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: CREL dynamic relocations
  2024-03-24  5:50 CREL dynamic relocations Fangrui Song
@ 2024-03-25 11:52 ` Florian Weimer
  2024-03-25 18:51   ` Fangrui Song
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Weimer @ 2024-03-25 11:52 UTC (permalink / raw)
  To: Fangrui Song; +Cc: libc-alpha

* Fangrui Song:

> I have proposed a compact relocation format CREL at
> https://groups.google.com/g/generic-abi/c/yb0rjw56ORw/m/eiBcYxSfAQAJ
> (previously named RELLEB).
>
> CREL primarily targets static relocations, achieving significant .o
> file size reduction for lld builds: 18.0% for x86-64/aarch64 and 34.3%
> for riscv64.
> CREL holds promise for dynamic relocations as well, surpassing
> Android's packed relocation format.

As I said elsewhere, I'm concerned about the use of the ULEB128
encoding.  It's unnecessarily difficult to decode.

Thanks,
Florian


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: CREL dynamic relocations
  2024-03-25 11:52 ` Florian Weimer
@ 2024-03-25 18:51   ` Fangrui Song
  0 siblings, 0 replies; 7+ messages in thread
From: Fangrui Song @ 2024-03-25 18:51 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Fangrui Song, libc-alpha

On Mon, Mar 25, 2024 at 4:53 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> * Fangrui Song:
>
> > I have proposed a compact relocation format CREL at
> > https://groups.google.com/g/generic-abi/c/yb0rjw56ORw/m/eiBcYxSfAQAJ
> > (previously named RELLEB).
> >
> > CREL primarily targets static relocations, achieving significant .o
> > file size reduction for lld builds: 18.0% for x86-64/aarch64 and 34.3%
> > for riscv64.
> > CREL holds promise for dynamic relocations as well, surpassing
> > Android's packed relocation format.
>
> As I said elsewhere, I'm concerned about the use of the ULEB128
> encoding.  It's unnecessarily difficult to decode.
>
> Thanks,
> Florian

Thanks. I have seen your question at
https://groups.google.com/g/generic-abi/c/yb0rjw56ORw/m/osMXhg5XAgAJ
and replied there that
since one-byte encodings dominant for our use cases, LEB128 is
actually the best choice (in terms of both performance and
simplicity).

I've researched the dynamic relocation problem in the weekend and
incorporated the following text to my blog post

Traditionally, we have two dynamic relocation ranges for executables
and shared objects (except static position-dependent executables):

* `.rela.dyn` (`[DT_RELA, DT_RELA + DT_RELASZ)`) or `.rel.dyn`
(`[DT_REL, DT_REL + DT_RELSZ)`)
* `.rela.plt` (`[DT_JMPREL, DT_JMPREL + DT_PLTRELSZ)`): Stored
JUMP_SLOT relocations. `DT_PLTREL` specifies `DT_REL` or `DT_RELA`.

IRELATIVE relocations can be placed in either range, but preferrably
in `.rel[a].dyn`.

Some GNU ld ports (e.g. SPARC) treat `.rela.plt` as a subset of
`.rela.dyn`, introducing complexity for dynamic loaders.

**CREL adoption considerations**

* New dynamic tag (`DT_CREL`): To identify CREL relocations, separate
from existing `DT_REL`/`DT_RELA`.
* No `DT_CRELSZ`: Relocation count can be derived from the CREL header.
* Output section description `.rela.dyn : { *(.rela.dyn) *(.rela.plt)
}` is incompatible with CREL.

**Challenges with lazy binding**

glibc's lazy binding scheme relies on [random access to relocation
entries within the `DT_JMPREL`
table](https://maskray.me/blog/2021-09-19-all-about-procedure-linkage-table#:~:text=_dl_fixup).
CREL's sequential nature prevents this. However, eager binding doesn't
require random access.
Therefore, when `-z now` (eager binding) is enabled, we can:

* Set `DT_PLTREL` to `DT_CREL`.
* Replace `.rel[a].plt` with `.crel.plt`.

**Challenges with statically linked position-dependent executables**

glibc introduces additional complexity for IRELATIVE relocations in
statically linked position-dependent executables.
They should only contain IRELATIVE relocations and no other dynamic relocations.

glibc's `csu/libc-start.c` processes IRELATIVE relocations in the
range [`[__rela_iplt_start,
__rela_iplt_end)`](https://maskray.me/blog/2021-01-18-gnu-indirect-function#non-preemptible-ifunc#rela_iplt_start-and-__rela_iplt_end)
(or `[__rel_iplt_start, __rel_iplt_end)`, determined at build time
through `ELF_MACHINE_IREL`).
While CREL relocations cannot be decoded in the middle of the section,
we can still place IRELATIVE relocations in `.crel.dyn` because there
wouldn't be any other relocation types (position-dependent executables
don't have RELATIVE relocations).
When CREL is enabled, we can define `__crel_iplt_start` and
`__crel_iplt_end` for statically linked position-dependent
executables.

If glibc only intends to support `addend_bit==0`, the code can simply be:
```c
  extern const uint8_t __crel_iplt_start[] __attribute__ ((weak));
  extern const uint8_t __crel_iplt_end[] __attribute__ ((weak));
  if (&__crel_iplt_start != &__crel_iplt_end) {
    const uint8_t *p = __crel_iplt_start;
    size_t offset = 0, count = read_uleb128 (&p), shift = count & 3;
    for (count >>= 3; count; count--) {
      uint8_t rel_head = *p++;
      offset += rel_head >> 2;
      if (rel_head & 128)
        offset += (read_uleb128 (&p) << 5) - 32;
      if (rel_head & 2)
        read_sleb128 (&p);
      elf_crel_irel ((ElfW (Addr) *) (offset << shift));
    }
  }
```

**Considering implicit addends for CREL**

Many dynamic relocations have zero addends:

* COPY/GLOB_DAT/JUMP_SLOT relocations only use zero addends.
* Absolute relocations could use non-zero addends with `STT_SECTION`
symbol, but linkers convert them to relative relocations.

Usually only RELATIVE/IRELATIVE and potentially TPREL/TPOFF might
require non-zero addends.
Switching from `DT_RELA` to `DT_REL` offers a minor size advantage.

I considered defining two separate dynamic tags (`DT_CREL` and
`DT_CRELA`) to distinguish between implicit and explicit addends.
However, this would have introduced complexity:

* Should `llvm-readelf -r` dump the zero addends for `DT_CRELA`?
* Should dynamic loaders support both dynamic tags?

I placed the delta addend bit next to offset bits so that it can be
reused for offsets.
Thanks to Stefan O'Rear's for making me believe that my original
thought of reserving a single bit flag (`addend_bit`) within the CREL
header is elegant.
Dynamic loaders prioritizing simplicity can hardcode the desired
`addend_bit` value.

`ld.lld -z crel` defaults to implicit addends (`addend_bit==0`), but
the option of using in-relocation addends is available with `-z crel
-z rela`.

**DT_AARCH64_AUTH_RELR vs CREL**

The AArch64 PAuth ABI introduces `DT_AARCH64_AUTH_RELR` as a variant
of RELR for signed relocations.
However, its benefit seems limited.

In a release build of Clang 16, using `-z crel -z rela` resulted in a
`.crel.dyn` section size of only 1.0% of the file size.
Notably, enabling implicit addends with `-z crel -z rel` further
reduced the size to just 0.3%.
While `DT_AARCH64_AUTH_RELR` will achieve a noticeable smaller
relocation size if most relative relocations are encoded with it, the
advantage seems less significant considering CREL's already compact
size.

Furthermore, `DT_AARCH64_AUTH_RLEL` introduces additional complexity
to the linker due to its 32-bit addend limitation: the in-place 64
value encodes a 32-bit schema, giving just 32 bits to the implicit
addend.
If the addend does not fit into 32 bits, `DT_AARCH64_AUTH_RELR` cannot be used.
CREL with addends would avoid this complexity.

I have filed [Quantifying the benefits of
DT_AARCH64_AUTH_RELR](https://github.com/ARM-software/abi-aa/issues/252).

-- 
宋方睿

^ permalink raw reply	[flat|nested] 7+ messages in thread

* CREL dynamic relocations
@ 2024-04-09 15:32 Wilco Dijkstra
  2024-04-11  2:41 ` Fangrui Song
  0 siblings, 1 reply; 7+ messages in thread
From: Wilco Dijkstra @ 2024-04-09 15:32 UTC (permalink / raw)
  To: maskray@gcc.gnu.org; +Cc: 'GNU C Library', Florian Weimer

Hi,

I like the general idea of more compact relocations, however what I don't get is
what the overall goal is. If the goal is more compact object files, why don't we just
add a (de)compress pass using a fast compression algorithm? CPU time is cheap
today, and real compression easily gives 2-4x reduction of object file size, far more
than you could achieve by just compressing relocations.

Alternatively, if we wanted to access and process ELF files without any decompression,
we could define compact relocations as fixed-size entries. Using 64 bits for a compact
RELA relocation gives a straightforward 4x compression. Out of range values could
use the next entry to extend the ranges.

So my main issue with the proposal is that it tries too hard to compress relocations.
For example using offset compression for relocations, symbol indices and even addends
seems to have little value: the signed offset means you lose one bit, and if out of range
values are rare or not grouped together, offset encodings are actually less efficient.

I don't get the discussion about relocation numbers on AArch64 - 4 or 5 bits would
handle all frequently used relocations, so we'd just remap them to fit in the short
encoding. Hence I don't see a need at all for a signed offset encoding.

So my suggestion is to either go for compression of the whole object file or use a
simpler approach that isn't almost a compression algorithm.

Cheers,
Wilco

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: CREL dynamic relocations
  2024-04-09 15:32 Wilco Dijkstra
@ 2024-04-11  2:41 ` Fangrui Song
  2024-04-12 16:18   ` enh
  0 siblings, 1 reply; 7+ messages in thread
From: Fangrui Song @ 2024-04-11  2:41 UTC (permalink / raw)
  To: Wilco Dijkstra; +Cc: GNU C Library, Florian Weimer

Thank you for your interest in the CREL relocation format.

On Tue, Apr 9, 2024 at 8:33 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> I like the general idea of more compact relocations, however what I don't get is
> what the overall goal is. If the goal is more compact object files, why don't we just
> add a (de)compress pass using a fast compression algorithm? CPU time is cheap
> today, and real compression easily gives 2-4x reduction of object file size, far more
> than you could achieve by just compressing relocations.

My primary goal is to make relocatable files smaller (see
https://sourceware.org/pipermail/binutils/2024-March/133229.html for a
few use cases).
Smaller files benefit applications in several ways, including smaller
I/O amount and lower linker memory usage (for linkers like gold, lld,
and mold that map input files into memory).

Generic data compression formats (like zlib or zstd) applied at the
filesystem level won't achieve this goal because they don't decrease
memory usage.
In addition, filesystem compression does not appear too popular.

Interestingly, I measured a 5.9% size reduction in .o files even after
zstd compression when comparing two Clang builds with and without
CREL.

    % ruby -e 'require "zstd-ruby"; un=com=0;
Dir.glob("/tmp/out/s2-custom0/**/*.o").each{|f| x =
File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size};
puts "uncompressed: #{un}\ncompressed: #{com}"'
    uncompressed: 136086784
    compressed: 37173381

    % ruby -e 'require "zstd-ruby"; un=com=0;
Dir.glob("/tmp/out/s2-custom1/**/*.o").each{|f| x =
File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size};
puts "uncompressed: #{un}\ncompressed: #{com}"'
    uncompressed: 111655952
    compressed: 34964421

    1-111655952/136086784 ~= 18.0% (uncompressed)
    1-34964421/37173381 ~= 5.9%    (zstd)

Another objective is to minimize the size of dynamic relocations.
Android achieves this through ld.lld --pack-dyn-relocs=android+relr,
which compacts RELA relocations in their packed format.
While effective, CREL offers a simpler approach that delivers even
greater size reductions.

> Alternatively, if we wanted to access and process ELF files without any decompression,
> we could define compact relocations as fixed-size entries. Using 64 bits for a compact
> RELA relocation gives a straightforward 4x compression. Out of range values could
> use the next entry to extend the ranges.

64 bits are quite large. CREL typically needs just one to three bytes
for one relocation.
How do you design a format that is generic enough to work with all
relocation types and symbol indexes?

> So my main issue with the proposal is that it tries too hard to compress relocations.
> For example using offset compression for relocations, symbol indices and even addends
> seems to have little value: the signed offset means you lose one bit, and if out of range
> values are rare or not grouped together, offset encodings are actually less efficient.

I actually use unsigned delta offset to save one bit but signed delta
symidx/addend.
I have analyzed how many bits are demanded by typical relocations.
Quote https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#crel-relocation-format
:

    Absolute symbol indexes allow one-byte encoding for symbols in the
range [0,128) and offer minor size advantage for static relocations
when the symbol table is sorted by usage frequency. Delta encoding, on
the other hand, might optimize for the scenario when the symbol table
presents locality: neighbor symbols are frequently mutually called.

    Delta symbol index enables one-byte encoding for GOT/PLT dynamic
relocations when .got/.got.plt entries are ordered by symbol index.
For example, R_*_GLOB_DAT and R_*_JUMP_SLOT relocations can typically
be encoded with repeated 0x05 0x01 (when addend_bit==0 && shift==3,
offset++, symidx++). Delta encoding has a disvantage. It can partial
claim the optimization by arranging symbols in a "cold0 hot cold1"
pattern. In addition, delta symbol index enables one-byte encoding for
GOT/PLT dynamic relocations when .got/.got.plt entries are ordered by
symbol index.

    In my experiments, absolute encoding with ULEB128 results in
slightly larger .o file sizes for both x86-64 and AArch64 builds.

For a decoder that only supports in-reloc addends (recommended for
relocatable files), the C++ implementation is as simple as:

  const auto hdr = decodeULEB128(p);
  const size_t count = hdr / 8, shift = hdr % 4;
  Elf_Addr offset = 0, addend = 0;
  uint32_t symidx = 0, type = 0;
  for (size_t i = 0; i != count; ++i) {
    const uint8_t b = *p++;
    offset += b >> 3;
    if (b >= 0x80) offset += (decodeULEB128(p) << 4) - 0x10;
    if (b & 1) symidx += decodeSLEB128(p);
    if (b & 2) type += decodeSLEB128(p);
    if (b & 4) addend += decodeSLEB128(p);
    rels[i] = {offset << shift, symidx, type, addend};
  }

+= for all of symidx/type/addend is for consistency, but the choice
turns out to be very good as well.

> I don't get the discussion about relocation numbers on AArch64 - 4 or 5 bits would
> handle all frequently used relocations, so we'd just remap them to fit in the short
> encoding. Hence I don't see a need at all for a signed offset encoding.

The common static relocation types are within [257,313] (before
R_AARCH64_PLT32).
Delta encoding allows ~all but the first relocation's type to be
encoded in a single byte.

How do you design a compression scheme without baked-in knowledge (dictionary)?
We don't want the generic encoding scheme to hard code relocation type
range for each architecture.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: CREL dynamic relocations
  2024-04-11  2:41 ` Fangrui Song
@ 2024-04-12 16:18   ` enh
  2024-04-18  4:31     ` Fangrui Song
  0 siblings, 1 reply; 7+ messages in thread
From: enh @ 2024-04-12 16:18 UTC (permalink / raw)
  To: Fangrui Song; +Cc: Wilco Dijkstra, GNU C Library, Florian Weimer

On Wed, Apr 10, 2024 at 7:41 PM Fangrui Song <maskray@gcc.gnu.org> wrote:
>
> Thank you for your interest in the CREL relocation format.
>
> On Tue, Apr 9, 2024 at 8:33 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> > I like the general idea of more compact relocations, however what I don't get is
> > what the overall goal is. If the goal is more compact object files, why don't we just
> > add a (de)compress pass using a fast compression algorithm? CPU time is cheap
> > today, and real compression easily gives 2-4x reduction of object file size, far more
> > than you could achieve by just compressing relocations.
>
> My primary goal is to make relocatable files smaller (see
> https://sourceware.org/pipermail/binutils/2024-March/133229.html for a
> few use cases).
> Smaller files benefit applications in several ways, including smaller
> I/O amount and lower linker memory usage (for linkers like gold, lld,
> and mold that map input files into memory).
>
> Generic data compression formats (like zlib or zstd) applied at the
> filesystem level won't achieve this goal because they don't decrease
> memory usage.
> In addition, filesystem compression does not appear too popular.
>
> Interestingly, I measured a 5.9% size reduction in .o files even after
> zstd compression when comparing two Clang builds with and without
> CREL.
>
>     % ruby -e 'require "zstd-ruby"; un=com=0;
> Dir.glob("/tmp/out/s2-custom0/**/*.o").each{|f| x =
> File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size};
> puts "uncompressed: #{un}\ncompressed: #{com}"'
>     uncompressed: 136086784
>     compressed: 37173381
>
>     % ruby -e 'require "zstd-ruby"; un=com=0;
> Dir.glob("/tmp/out/s2-custom1/**/*.o").each{|f| x =
> File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size};
> puts "uncompressed: #{un}\ncompressed: #{com}"'
>     uncompressed: 111655952
>     compressed: 34964421
>
>     1-111655952/136086784 ~= 18.0% (uncompressed)
>     1-34964421/37173381 ~= 5.9%    (zstd)
>
> Another objective is to minimize the size of dynamic relocations.
> Android achieves this through ld.lld --pack-dyn-relocs=android+relr,
> which compacts RELA relocations in their packed format.
> While effective, CREL offers a simpler approach that delivers even
> greater size reductions.

yeah, though android+relr has the advantage of having already shipped,
so it's usable on just about every device that app developers still
support (api 23 --
https://android.googlesource.com/platform/bionic/+/master/android-changes-for-ndk-developers.md#relative-relocations-relr
-- versus the effective minimum of api 21) :-)

something new today wouldn't be as widely usable on Android for about
a decade, so although we'd probably support it if others did, to be
really interesting for Android -- the point where we'd implement it
even if no-one else did -- it'd have to be 2x better (in space or
time, and not sacrificing the other, because app developers are very
conscious of both) rather than the little bit better that it actually
is. given our long lead time [for app developers to be able to rely on
ubiquity] and our "good enough" solution, i'm actually more interested
in the "what if we re-thought _everything_?" end of the spectrum than
small tweaks here.

what does mach-o do here? (that is: "why doesn't Apple care?".)

> > Alternatively, if we wanted to access and process ELF files without any decompression,
> > we could define compact relocations as fixed-size entries. Using 64 bits for a compact
> > RELA relocation gives a straightforward 4x compression. Out of range values could
> > use the next entry to extend the ranges.
>
> 64 bits are quite large. CREL typically needs just one to three bytes
> for one relocation.
> How do you design a format that is generic enough to work with all
> relocation types and symbol indexes?
>
> > So my main issue with the proposal is that it tries too hard to compress relocations.
> > For example using offset compression for relocations, symbol indices and even addends
> > seems to have little value: the signed offset means you lose one bit, and if out of range
> > values are rare or not grouped together, offset encodings are actually less efficient.
>
> I actually use unsigned delta offset to save one bit but signed delta
> symidx/addend.
> I have analyzed how many bits are demanded by typical relocations.
> Quote https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#crel-relocation-format
> :
>
>     Absolute symbol indexes allow one-byte encoding for symbols in the
> range [0,128) and offer minor size advantage for static relocations
> when the symbol table is sorted by usage frequency. Delta encoding, on
> the other hand, might optimize for the scenario when the symbol table
> presents locality: neighbor symbols are frequently mutually called.
>
>     Delta symbol index enables one-byte encoding for GOT/PLT dynamic
> relocations when .got/.got.plt entries are ordered by symbol index.
> For example, R_*_GLOB_DAT and R_*_JUMP_SLOT relocations can typically
> be encoded with repeated 0x05 0x01 (when addend_bit==0 && shift==3,
> offset++, symidx++). Delta encoding has a disvantage. It can partial
> claim the optimization by arranging symbols in a "cold0 hot cold1"
> pattern. In addition, delta symbol index enables one-byte encoding for
> GOT/PLT dynamic relocations when .got/.got.plt entries are ordered by
> symbol index.
>
>     In my experiments, absolute encoding with ULEB128 results in
> slightly larger .o file sizes for both x86-64 and AArch64 builds.
>
> For a decoder that only supports in-reloc addends (recommended for
> relocatable files), the C++ implementation is as simple as:
>
>   const auto hdr = decodeULEB128(p);
>   const size_t count = hdr / 8, shift = hdr % 4;
>   Elf_Addr offset = 0, addend = 0;
>   uint32_t symidx = 0, type = 0;
>   for (size_t i = 0; i != count; ++i) {
>     const uint8_t b = *p++;
>     offset += b >> 3;
>     if (b >= 0x80) offset += (decodeULEB128(p) << 4) - 0x10;
>     if (b & 1) symidx += decodeSLEB128(p);
>     if (b & 2) type += decodeSLEB128(p);
>     if (b & 4) addend += decodeSLEB128(p);
>     rels[i] = {offset << shift, symidx, type, addend};
>   }
>
> += for all of symidx/type/addend is for consistency, but the choice
> turns out to be very good as well.
>
> > I don't get the discussion about relocation numbers on AArch64 - 4 or 5 bits would
> > handle all frequently used relocations, so we'd just remap them to fit in the short
> > encoding. Hence I don't see a need at all for a signed offset encoding.
>
> The common static relocation types are within [257,313] (before
> R_AARCH64_PLT32).
> Delta encoding allows ~all but the first relocation's type to be
> encoded in a single byte.
>
> How do you design a compression scheme without baked-in knowledge (dictionary)?
> We don't want the generic encoding scheme to hard code relocation type
> range for each architecture.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: CREL dynamic relocations
  2024-04-12 16:18   ` enh
@ 2024-04-18  4:31     ` Fangrui Song
  0 siblings, 0 replies; 7+ messages in thread
From: Fangrui Song @ 2024-04-18  4:31 UTC (permalink / raw)
  To: enh; +Cc: Fangrui Song, Wilco Dijkstra, GNU C Library, Florian Weimer

On Fri, Apr 12, 2024 at 9:19 AM enh <enh@google.com> wrote:
>
> On Wed, Apr 10, 2024 at 7:41 PM Fangrui Song <maskray@gcc.gnu.org> wrote:
> >
> > Thank you for your interest in the CREL relocation format.
> >
> > On Tue, Apr 9, 2024 at 8:33 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> > > I like the general idea of more compact relocations, however what I don't get is
> > > what the overall goal is. If the goal is more compact object files, why don't we just
> > > add a (de)compress pass using a fast compression algorithm? CPU time is cheap
> > > today, and real compression easily gives 2-4x reduction of object file size, far more
> > > than you could achieve by just compressing relocations.
> >
> > My primary goal is to make relocatable files smaller (see
> > https://sourceware.org/pipermail/binutils/2024-March/133229.html for a
> > few use cases).
> > Smaller files benefit applications in several ways, including smaller
> > I/O amount and lower linker memory usage (for linkers like gold, lld,
> > and mold that map input files into memory).
> >
> > Generic data compression formats (like zlib or zstd) applied at the
> > filesystem level won't achieve this goal because they don't decrease
> > memory usage.
> > In addition, filesystem compression does not appear too popular.
> >
> > Interestingly, I measured a 5.9% size reduction in .o files even after
> > zstd compression when comparing two Clang builds with and without
> > CREL.
> >
> >     % ruby -e 'require "zstd-ruby"; un=com=0;
> > Dir.glob("/tmp/out/s2-custom0/**/*.o").each{|f| x =
> > File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size};
> > puts "uncompressed: #{un}\ncompressed: #{com}"'
> >     uncompressed: 136086784
> >     compressed: 37173381
> >
> >     % ruby -e 'require "zstd-ruby"; un=com=0;
> > Dir.glob("/tmp/out/s2-custom1/**/*.o").each{|f| x =
> > File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size};
> > puts "uncompressed: #{un}\ncompressed: #{com}"'
> >     uncompressed: 111655952
> >     compressed: 34964421
> >
> >     1-111655952/136086784 ~= 18.0% (uncompressed)
> >     1-34964421/37173381 ~= 5.9%    (zstd)
> >
> > Another objective is to minimize the size of dynamic relocations.
> > Android achieves this through ld.lld --pack-dyn-relocs=android+relr,
> > which compacts RELA relocations in their packed format.
> > While effective, CREL offers a simpler approach that delivers even
> > greater size reductions.
>
> yeah, though android+relr has the advantage of having already shipped,
> so it's usable on just about every device that app developers still
> support (api 23 --
> https://android.googlesource.com/platform/bionic/+/master/android-changes-for-ndk-developers.md#relative-relocations-relr
> -- versus the effective minimum of api 21) :-)
>
> something new today wouldn't be as widely usable on Android for about
> a decade, so although we'd probably support it if others did, to be
> really interesting for Android -- the point where we'd implement it
> even if no-one else did -- it'd have to be 2x better (in space or
> time, and not sacrificing the other, because app developers are very
> conscious of both) rather than the little bit better that it actually
> is. given our long lead time [for app developers to be able to rely on
> ubiquity] and our "good enough" solution, i'm actually more interested
> in the "what if we re-thought _everything_?" end of the spectrum than
> small tweaks here.
>
> what does mach-o do here? (that is: "why doesn't Apple care?".)
>
> > > Alternatively, if we wanted to access and process ELF files without any decompression,
> > > we could define compact relocations as fixed-size entries. Using 64 bits for a compact
> > > RELA relocation gives a straightforward 4x compression. Out of range values could
> > > use the next entry to extend the ranges.
> >
> > 64 bits are quite large. CREL typically needs just one to three bytes
> > for one relocation.
> > How do you design a format that is generic enough to work with all
> > relocation types and symbol indexes?
> >
> > > So my main issue with the proposal is that it tries too hard to compress relocations.
> > > For example using offset compression for relocations, symbol indices and even addends
> > > seems to have little value: the signed offset means you lose one bit, and if out of range
> > > values are rare or not grouped together, offset encodings are actually less efficient.
> >
> > I actually use unsigned delta offset to save one bit but signed delta
> > symidx/addend.
> > I have analyzed how many bits are demanded by typical relocations.
> > Quote https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#crel-relocation-format
> > :
> >
> >     Absolute symbol indexes allow one-byte encoding for symbols in the
> > range [0,128) and offer minor size advantage for static relocations
> > when the symbol table is sorted by usage frequency. Delta encoding, on
> > the other hand, might optimize for the scenario when the symbol table
> > presents locality: neighbor symbols are frequently mutually called.
> >
> >     Delta symbol index enables one-byte encoding for GOT/PLT dynamic
> > relocations when .got/.got.plt entries are ordered by symbol index.
> > For example, R_*_GLOB_DAT and R_*_JUMP_SLOT relocations can typically
> > be encoded with repeated 0x05 0x01 (when addend_bit==0 && shift==3,
> > offset++, symidx++). Delta encoding has a disvantage. It can partial
> > claim the optimization by arranging symbols in a "cold0 hot cold1"
> > pattern. In addition, delta symbol index enables one-byte encoding for
> > GOT/PLT dynamic relocations when .got/.got.plt entries are ordered by
> > symbol index.
> >
> >     In my experiments, absolute encoding with ULEB128 results in
> > slightly larger .o file sizes for both x86-64 and AArch64 builds.
> >
> > For a decoder that only supports in-reloc addends (recommended for
> > relocatable files), the C++ implementation is as simple as:
> >
> >   const auto hdr = decodeULEB128(p);
> >   const size_t count = hdr / 8, shift = hdr % 4;
> >   Elf_Addr offset = 0, addend = 0;
> >   uint32_t symidx = 0, type = 0;
> >   for (size_t i = 0; i != count; ++i) {
> >     const uint8_t b = *p++;
> >     offset += b >> 3;
> >     if (b >= 0x80) offset += (decodeULEB128(p) << 4) - 0x10;
> >     if (b & 1) symidx += decodeSLEB128(p);
> >     if (b & 2) type += decodeSLEB128(p);
> >     if (b & 4) addend += decodeSLEB128(p);
> >     rels[i] = {offset << shift, symidx, type, addend};
> >   }
> >
> > += for all of symidx/type/addend is for consistency, but the choice
> > turns out to be very good as well.
> >
> > > I don't get the discussion about relocation numbers on AArch64 - 4 or 5 bits would
> > > handle all frequently used relocations, so we'd just remap them to fit in the short
> > > encoding. Hence I don't see a need at all for a signed offset encoding.
> >
> > The common static relocation types are within [257,313] (before
> > R_AARCH64_PLT32).
> > Delta encoding allows ~all but the first relocation's type to be
> > encoded in a single byte.
> >
> > How do you design a compression scheme without baked-in knowledge (dictionary)?
> > We don't want the generic encoding scheme to hard code relocation type
> > range for each architecture.

PATH=/tmp/Rel/bin:$PATH
ld.lld @response.txt -z now --pack-dyn-relocs=relr -o clang.relr
ld.lld @response.txt -z now --pack-dyn-relocs=android+relr -o clang.relr+android
ld.lld @response.txt -z now --pack-dyn-relocs=relr -z crel -o clang.relr+crel
llvm-objcopy --dump-section .rela.dyn=reladyn clang.relr /dev/null
llvm-objcopy --dump-section .crel.dyn=creldyn clang.relr+crel /dev/null
llvm-objcopy --dump-section .rela.dyn=androiddyn clang.relr+android /dev/null
xz -fk reladyn androiddyn creldyn
zstd -fk reladyn androiddyn creldyn

% llvm-readelf -r clang.relr+crel | grep crel.dyn
Relocation section '.crel.dyn' at offset 0x5df318 contains 2907 entries:
% stat -c '%s %n' reladyn* androiddyn* creldyn*
69768 reladyn
4236 reladyn.xz
4449 reladyn.zst
4604 androiddyn
1484 androiddyn.xz
1481 androiddyn.zst
3980 creldyn
1344 creldyn.xz
1367 creldyn.zst
% xz --robot --list -vv androiddyn.xz | grep block
block   1       1       1       12      0       1448    4604    0.315
 CRC64   bb905bfac8a6ca11        12      --      1427    8454200
--lzma2=dict=8MiB

There are 2907 relocation entries that are not RELATIVE or JUMP_SLOT,
which can be compacted with either Android's packed relocation format
or CREL.
Notably, zstd/LZMA2 filtering on RELA outperforms Android's packed
relocation format, while CREL outperforms zstd/LZMA2:)
As a byte-oriented format, CREL relocations can be further compressed.
(If CREL were made more bit-oriented, it might become uncompressible.)

However, achieving double the compression ratio of Android's format is
unrealistic for byte-oriented schemes where each relocation requires
at least one byte.
A byte-oriented format needs at least a bit more than 2907 bytes,
exceeding half of the original 4604 bytes.

An interesting observation: ld.lld still uses RELA for JUMP_SLOT
relocations even with --pack-dyn-relocs=android+relr.
CREL handling JUMP_SLOT relocations potentially improves compression further.

llvm-objcopy --dump-section .crel.plt=crelplt clang.relr+crel /dev/null
llvm-objcopy --dump-section .rela.plt=relaplt clang.relr+android /dev/null
xz -fk relaplt crelplt
% stat -c '%s %n' *plt*
672 crelplt
116 crelplt.xz
7992 relaplt
628 relaplt.xz

In clang.relr+android, .rela.* sections consume 12596 bytes.
In clang.relr+crel, .crel.* sections consume 4652 bytes.

Since there is no dynamic loader implementation, I cannot comment on
the performance.
Intuitively CREL is simpler and should be faster.

> what does mach-o do here? (that is: "why doesn't Apple care?".)

Before chained fixups (WWDC2022), they use a state machine (think of
DWARF .debug_line) to set
segment_index/symbol_name/dylib_ordinal/offset/addend/type/etc
(https://github.com/apple-opensource/dyld/blob/e3f88907bebb8421f50f0943595f6874de70ebe0/src/ImageLoaderMachOCompressed.cpp#L1347).
The bind opcodes look like:

...
BIND_OPCODE_SET_SYMBOL_TRAILING_FLAGS_IMM
flags
symbol_name
(BIND_OPCODE_SET_TYPE_IMM|type)
BIND_OPCODE_SET_DYLIB_ORDINAL_IMM
dylib_ordinal
(BIND_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB | segment_idx)
offset
BIND_OPCODE_DO_BIND_ULEB_TIMES_SKIPPING_ULEB
opcode
consecutive_count
data
BIND_OPCODE_SET_ADDEND_SLEB
addend
BIND_OPCODE_DO_BIND_ADD_ADDR_ULEB
delta_addend
...

While this model is powerful, its complexity might not be ideal for ELF systems.
They encode symbol names/flags in place, dylib_ordinal (two-level
namespace), and segment index, complexity that ELF systems don't have.
Having opcodes in the stream makes their encoding likely less effecient.

---

Chained fixups are a new system.
They encode just the first relocations along with the relocation type
information in each page.
Subsequent relocations are encoded using the locations to be modified.

* It necessitates kernel understanding of dynamic linking data and
communication with the dynamic loader.
* The encoding repurposes the ELF "implicit addend" for page offsets,
limiting its use for larger addends. Some addend bits are stolen to
encode a singly linked list (dyld_chained_ptr_64_bind::next
dyld_chained_ptr_64_rebase::next).
  + For the rebase operation (ELF relative relocation), 36 least
significant bits and 8 most significant bits are encoded. This is
insufficient if we need a larger addend.
  + The page offset is muliple of 4, further losing generality.

This approach may not be suitable for 32-bit systems or those
requiring support for numerous relocation types, like ELF.
Additionally, the current encoding of initial relocations and the
import table seems less space-efficient. A CREL-like design could
potentially improve space utilization.



-- 
宋方睿

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-04-18  4:32 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-24  5:50 CREL dynamic relocations Fangrui Song
2024-03-25 11:52 ` Florian Weimer
2024-03-25 18:51   ` Fangrui Song
  -- strict thread matches above, loose matches on Subject: below --
2024-04-09 15:32 Wilco Dijkstra
2024-04-11  2:41 ` Fangrui Song
2024-04-12 16:18   ` enh
2024-04-18  4:31     ` Fangrui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).