unofficial mirror of libc-alpha@sourceware.org
 help / color / mirror / Atom feed
From: Fangrui Song <maskray@google.com>
To: enh <enh@google.com>
Cc: Fangrui Song <maskray@gcc.gnu.org>,
	Wilco Dijkstra <Wilco.Dijkstra@arm.com>,
	GNU C Library <libc-alpha@sourceware.org>,
	Florian Weimer <fweimer@redhat.com>
Subject: Re: CREL dynamic relocations
Date: Wed, 17 Apr 2024 21:31:49 -0700	[thread overview]
Message-ID: <CAFP8O3LoUMJzf6ZfKKWKaBKsWaC8bkzDjFcbKS05tFQMT6iCaA@mail.gmail.com> (raw)
In-Reply-To: <CAJgzZoox6GqTUwdjD0mqXs=A=O8YSm_qC_hzhtf10_egmAmrDw@mail.gmail.com>

On Fri, Apr 12, 2024 at 9:19 AM enh <enh@google.com> wrote:
>
> On Wed, Apr 10, 2024 at 7:41 PM Fangrui Song <maskray@gcc.gnu.org> wrote:
> >
> > Thank you for your interest in the CREL relocation format.
> >
> > On Tue, Apr 9, 2024 at 8:33 AM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> > > I like the general idea of more compact relocations, however what I don't get is
> > > what the overall goal is. If the goal is more compact object files, why don't we just
> > > add a (de)compress pass using a fast compression algorithm? CPU time is cheap
> > > today, and real compression easily gives 2-4x reduction of object file size, far more
> > > than you could achieve by just compressing relocations.
> >
> > My primary goal is to make relocatable files smaller (see
> > https://sourceware.org/pipermail/binutils/2024-March/133229.html for a
> > few use cases).
> > Smaller files benefit applications in several ways, including smaller
> > I/O amount and lower linker memory usage (for linkers like gold, lld,
> > and mold that map input files into memory).
> >
> > Generic data compression formats (like zlib or zstd) applied at the
> > filesystem level won't achieve this goal because they don't decrease
> > memory usage.
> > In addition, filesystem compression does not appear too popular.
> >
> > Interestingly, I measured a 5.9% size reduction in .o files even after
> > zstd compression when comparing two Clang builds with and without
> > CREL.
> >
> >     % ruby -e 'require "zstd-ruby"; un=com=0;
> > Dir.glob("/tmp/out/s2-custom0/**/*.o").each{|f| x =
> > File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size};
> > puts "uncompressed: #{un}\ncompressed: #{com}"'
> >     uncompressed: 136086784
> >     compressed: 37173381
> >
> >     % ruby -e 'require "zstd-ruby"; un=com=0;
> > Dir.glob("/tmp/out/s2-custom1/**/*.o").each{|f| x =
> > File.open(f,"rb"){|h|h.read}; un+=x.size; com+=Zstd.compress(x).size};
> > puts "uncompressed: #{un}\ncompressed: #{com}"'
> >     uncompressed: 111655952
> >     compressed: 34964421
> >
> >     1-111655952/136086784 ~= 18.0% (uncompressed)
> >     1-34964421/37173381 ~= 5.9%    (zstd)
> >
> > Another objective is to minimize the size of dynamic relocations.
> > Android achieves this through ld.lld --pack-dyn-relocs=android+relr,
> > which compacts RELA relocations in their packed format.
> > While effective, CREL offers a simpler approach that delivers even
> > greater size reductions.
>
> yeah, though android+relr has the advantage of having already shipped,
> so it's usable on just about every device that app developers still
> support (api 23 --
> https://android.googlesource.com/platform/bionic/+/master/android-changes-for-ndk-developers.md#relative-relocations-relr
> -- versus the effective minimum of api 21) :-)
>
> something new today wouldn't be as widely usable on Android for about
> a decade, so although we'd probably support it if others did, to be
> really interesting for Android -- the point where we'd implement it
> even if no-one else did -- it'd have to be 2x better (in space or
> time, and not sacrificing the other, because app developers are very
> conscious of both) rather than the little bit better that it actually
> is. given our long lead time [for app developers to be able to rely on
> ubiquity] and our "good enough" solution, i'm actually more interested
> in the "what if we re-thought _everything_?" end of the spectrum than
> small tweaks here.
>
> what does mach-o do here? (that is: "why doesn't Apple care?".)
>
> > > Alternatively, if we wanted to access and process ELF files without any decompression,
> > > we could define compact relocations as fixed-size entries. Using 64 bits for a compact
> > > RELA relocation gives a straightforward 4x compression. Out of range values could
> > > use the next entry to extend the ranges.
> >
> > 64 bits are quite large. CREL typically needs just one to three bytes
> > for one relocation.
> > How do you design a format that is generic enough to work with all
> > relocation types and symbol indexes?
> >
> > > So my main issue with the proposal is that it tries too hard to compress relocations.
> > > For example using offset compression for relocations, symbol indices and even addends
> > > seems to have little value: the signed offset means you lose one bit, and if out of range
> > > values are rare or not grouped together, offset encodings are actually less efficient.
> >
> > I actually use unsigned delta offset to save one bit but signed delta
> > symidx/addend.
> > I have analyzed how many bits are demanded by typical relocations.
> > Quote https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#crel-relocation-format
> > :
> >
> >     Absolute symbol indexes allow one-byte encoding for symbols in the
> > range [0,128) and offer minor size advantage for static relocations
> > when the symbol table is sorted by usage frequency. Delta encoding, on
> > the other hand, might optimize for the scenario when the symbol table
> > presents locality: neighbor symbols are frequently mutually called.
> >
> >     Delta symbol index enables one-byte encoding for GOT/PLT dynamic
> > relocations when .got/.got.plt entries are ordered by symbol index.
> > For example, R_*_GLOB_DAT and R_*_JUMP_SLOT relocations can typically
> > be encoded with repeated 0x05 0x01 (when addend_bit==0 && shift==3,
> > offset++, symidx++). Delta encoding has a disvantage. It can partial
> > claim the optimization by arranging symbols in a "cold0 hot cold1"
> > pattern. In addition, delta symbol index enables one-byte encoding for
> > GOT/PLT dynamic relocations when .got/.got.plt entries are ordered by
> > symbol index.
> >
> >     In my experiments, absolute encoding with ULEB128 results in
> > slightly larger .o file sizes for both x86-64 and AArch64 builds.
> >
> > For a decoder that only supports in-reloc addends (recommended for
> > relocatable files), the C++ implementation is as simple as:
> >
> >   const auto hdr = decodeULEB128(p);
> >   const size_t count = hdr / 8, shift = hdr % 4;
> >   Elf_Addr offset = 0, addend = 0;
> >   uint32_t symidx = 0, type = 0;
> >   for (size_t i = 0; i != count; ++i) {
> >     const uint8_t b = *p++;
> >     offset += b >> 3;
> >     if (b >= 0x80) offset += (decodeULEB128(p) << 4) - 0x10;
> >     if (b & 1) symidx += decodeSLEB128(p);
> >     if (b & 2) type += decodeSLEB128(p);
> >     if (b & 4) addend += decodeSLEB128(p);
> >     rels[i] = {offset << shift, symidx, type, addend};
> >   }
> >
> > += for all of symidx/type/addend is for consistency, but the choice
> > turns out to be very good as well.
> >
> > > I don't get the discussion about relocation numbers on AArch64 - 4 or 5 bits would
> > > handle all frequently used relocations, so we'd just remap them to fit in the short
> > > encoding. Hence I don't see a need at all for a signed offset encoding.
> >
> > The common static relocation types are within [257,313] (before
> > R_AARCH64_PLT32).
> > Delta encoding allows ~all but the first relocation's type to be
> > encoded in a single byte.
> >
> > How do you design a compression scheme without baked-in knowledge (dictionary)?
> > We don't want the generic encoding scheme to hard code relocation type
> > range for each architecture.

PATH=/tmp/Rel/bin:$PATH
ld.lld @response.txt -z now --pack-dyn-relocs=relr -o clang.relr
ld.lld @response.txt -z now --pack-dyn-relocs=android+relr -o clang.relr+android
ld.lld @response.txt -z now --pack-dyn-relocs=relr -z crel -o clang.relr+crel
llvm-objcopy --dump-section .rela.dyn=reladyn clang.relr /dev/null
llvm-objcopy --dump-section .crel.dyn=creldyn clang.relr+crel /dev/null
llvm-objcopy --dump-section .rela.dyn=androiddyn clang.relr+android /dev/null
xz -fk reladyn androiddyn creldyn
zstd -fk reladyn androiddyn creldyn

% llvm-readelf -r clang.relr+crel | grep crel.dyn
Relocation section '.crel.dyn' at offset 0x5df318 contains 2907 entries:
% stat -c '%s %n' reladyn* androiddyn* creldyn*
69768 reladyn
4236 reladyn.xz
4449 reladyn.zst
4604 androiddyn
1484 androiddyn.xz
1481 androiddyn.zst
3980 creldyn
1344 creldyn.xz
1367 creldyn.zst
% xz --robot --list -vv androiddyn.xz | grep block
block   1       1       1       12      0       1448    4604    0.315
 CRC64   bb905bfac8a6ca11        12      --      1427    8454200
--lzma2=dict=8MiB

There are 2907 relocation entries that are not RELATIVE or JUMP_SLOT,
which can be compacted with either Android's packed relocation format
or CREL.
Notably, zstd/LZMA2 filtering on RELA outperforms Android's packed
relocation format, while CREL outperforms zstd/LZMA2:)
As a byte-oriented format, CREL relocations can be further compressed.
(If CREL were made more bit-oriented, it might become uncompressible.)

However, achieving double the compression ratio of Android's format is
unrealistic for byte-oriented schemes where each relocation requires
at least one byte.
A byte-oriented format needs at least a bit more than 2907 bytes,
exceeding half of the original 4604 bytes.

An interesting observation: ld.lld still uses RELA for JUMP_SLOT
relocations even with --pack-dyn-relocs=android+relr.
CREL handling JUMP_SLOT relocations potentially improves compression further.

llvm-objcopy --dump-section .crel.plt=crelplt clang.relr+crel /dev/null
llvm-objcopy --dump-section .rela.plt=relaplt clang.relr+android /dev/null
xz -fk relaplt crelplt
% stat -c '%s %n' *plt*
672 crelplt
116 crelplt.xz
7992 relaplt
628 relaplt.xz

In clang.relr+android, .rela.* sections consume 12596 bytes.
In clang.relr+crel, .crel.* sections consume 4652 bytes.

Since there is no dynamic loader implementation, I cannot comment on
the performance.
Intuitively CREL is simpler and should be faster.

> what does mach-o do here? (that is: "why doesn't Apple care?".)

Before chained fixups (WWDC2022), they use a state machine (think of
DWARF .debug_line) to set
segment_index/symbol_name/dylib_ordinal/offset/addend/type/etc
(https://github.com/apple-opensource/dyld/blob/e3f88907bebb8421f50f0943595f6874de70ebe0/src/ImageLoaderMachOCompressed.cpp#L1347).
The bind opcodes look like:

...
BIND_OPCODE_SET_SYMBOL_TRAILING_FLAGS_IMM
flags
symbol_name
(BIND_OPCODE_SET_TYPE_IMM|type)
BIND_OPCODE_SET_DYLIB_ORDINAL_IMM
dylib_ordinal
(BIND_OPCODE_SET_SEGMENT_AND_OFFSET_ULEB | segment_idx)
offset
BIND_OPCODE_DO_BIND_ULEB_TIMES_SKIPPING_ULEB
opcode
consecutive_count
data
BIND_OPCODE_SET_ADDEND_SLEB
addend
BIND_OPCODE_DO_BIND_ADD_ADDR_ULEB
delta_addend
...

While this model is powerful, its complexity might not be ideal for ELF systems.
They encode symbol names/flags in place, dylib_ordinal (two-level
namespace), and segment index, complexity that ELF systems don't have.
Having opcodes in the stream makes their encoding likely less effecient.

---

Chained fixups are a new system.
They encode just the first relocations along with the relocation type
information in each page.
Subsequent relocations are encoded using the locations to be modified.

* It necessitates kernel understanding of dynamic linking data and
communication with the dynamic loader.
* The encoding repurposes the ELF "implicit addend" for page offsets,
limiting its use for larger addends. Some addend bits are stolen to
encode a singly linked list (dyld_chained_ptr_64_bind::next
dyld_chained_ptr_64_rebase::next).
  + For the rebase operation (ELF relative relocation), 36 least
significant bits and 8 most significant bits are encoded. This is
insufficient if we need a larger addend.
  + The page offset is muliple of 4, further losing generality.

This approach may not be suitable for 32-bit systems or those
requiring support for numerous relocation types, like ELF.
Additionally, the current encoding of initial relocations and the
import table seems less space-efficient. A CREL-like design could
potentially improve space utilization.



-- 
宋方睿

  reply	other threads:[~2024-04-18  4:32 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-09 15:32 CREL dynamic relocations Wilco Dijkstra
2024-04-11  2:41 ` Fangrui Song
2024-04-12 16:18   ` enh
2024-04-18  4:31     ` Fangrui Song [this message]
  -- strict thread matches above, loose matches on Subject: below --
2024-03-24  5:50 Fangrui Song
2024-03-25 11:52 ` Florian Weimer
2024-03-25 18:51   ` Fangrui Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://www.gnu.org/software/libc/involved.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAFP8O3LoUMJzf6ZfKKWKaBKsWaC8bkzDjFcbKS05tFQMT6iCaA@mail.gmail.com \
    --to=maskray@google.com \
    --cc=Wilco.Dijkstra@arm.com \
    --cc=enh@google.com \
    --cc=fweimer@redhat.com \
    --cc=libc-alpha@sourceware.org \
    --cc=maskray@gcc.gnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).